Merge tag 'drm-msm-fixes-2020-02-16' of https://gitlab.freedesktop.org/drm/msm into...

author Dave Airlie <airlied@redhat.com>

Thu, 20 Feb 2020 01:01:28 +0000 (11:01 +1000)

committer Dave Airlie <airlied@redhat.com>

Thu, 20 Feb 2020 01:01:33 +0000 (11:01 +1000)
author Dave Airlie <airlied@redhat.com>
Thu, 20 Feb 2020 01:01:28 +0000 (11:01 +1000)
committer Dave Airlie <airlied@redhat.com>
Thu, 20 Feb 2020 01:01:33 +0000 (11:01 +1000)
diff --git a/Documentation/devicetree/bindings/input/ilitek,ili2xxx.txt b/Documentation/devicetree/bindings/input/ilitek,ili2xxx.txt

index dc194b2..cdcaa3f 100644 (file)
--- a/Documentation/devicetree/bindings/input/ilitek,ili2xxx.txt
+++ b/Documentation/devicetree/bindings/input/ilitek,ili2xxx.txt
@@ -1,9 +1,10 @@
-Ilitek ILI210x/ILI2117/ILI251x touchscreen controller
+Ilitek ILI210x/ILI2117/ILI2120/ILI251x touchscreen controller
  
  Required properties:
  - compatible:
      ilitek,ili210x for ILI210x
      ilitek,ili2117 for ILI2117
+    ilitek,ili2120 for ILI2120
      ilitek,ili251x for ILI251x
  
  - reg: The I2C address of the device
diff --git a/Documentation/driver-api/ipmb.rst b/Documentation/driver-api/ipmb.rst

index 3ec3bae..209c49e 100644 (file)
--- a/Documentation/driver-api/ipmb.rst
+++ b/Documentation/driver-api/ipmb.rst
@@ -71,9 +71,13 @@ b) Example for device tree::
              ipmb@10 {
                      compatible = "ipmb-dev";
                      reg = <0x10>;
+                    i2c-protocol;
              };
       };
  
+If xmit of data to be done using raw i2c block vs smbus
+then "i2c-protocol" needs to be defined as above.
+
  2) Manually from Linux::
  
       modprobe ipmb-dev-int
diff --git a/Documentation/virt/guest-halt-polling.rst b/Documentation/virt/guest-halt-polling.rst

new file mode 100644 (file)

index 0000000..b4e7479
--- /dev/null
+++ b/Documentation/virt/guest-halt-polling.rst
@@ -0,0 +1,84 @@
+==================
+Guest halt polling
+==================
+
+The cpuidle_haltpoll driver, with the haltpoll governor, allows
+the guest vcpus to poll for a specified amount of time before
+halting.
+
+This provides the following benefits to host side polling:
+
+       1) The POLL flag is set while polling is performed, which allows
+          a remote vCPU to avoid sending an IPI (and the associated
+          cost of handling the IPI) when performing a wakeup.
+
+       2) The VM-exit cost can be avoided.
+
+The downside of guest side polling is that polling is performed
+even with other runnable tasks in the host.
+
+The basic logic as follows: A global value, guest_halt_poll_ns,
+is configured by the user, indicating the maximum amount of
+time polling is allowed. This value is fixed.
+
+Each vcpu has an adjustable guest_halt_poll_ns
+("per-cpu guest_halt_poll_ns"), which is adjusted by the algorithm
+in response to events (explained below).
+
+Module Parameters
+=================
+
+The haltpoll governor has 5 tunable module parameters:
+
+1) guest_halt_poll_ns:
+
+Maximum amount of time, in nanoseconds, that polling is
+performed before halting.
+
+Default: 200000
+
+2) guest_halt_poll_shrink:
+
+Division factor used to shrink per-cpu guest_halt_poll_ns when
+wakeup event occurs after the global guest_halt_poll_ns.
+
+Default: 2
+
+3) guest_halt_poll_grow:
+
+Multiplication factor used to grow per-cpu guest_halt_poll_ns
+when event occurs after per-cpu guest_halt_poll_ns
+but before global guest_halt_poll_ns.
+
+Default: 2
+
+4) guest_halt_poll_grow_start:
+
+The per-cpu guest_halt_poll_ns eventually reaches zero
+in case of an idle system. This value sets the initial
+per-cpu guest_halt_poll_ns when growing. This can
+be increased from 10000, to avoid misses during the initial
+growth stage:
+
+10k, 20k, 40k, ... (example assumes guest_halt_poll_grow=2).
+
+Default: 50000
+
+5) guest_halt_poll_allow_shrink:
+
+Bool parameter which allows shrinking. Set to N
+to avoid it (per-cpu guest_halt_poll_ns will remain
+high once achieves global guest_halt_poll_ns value).
+
+Default: Y
+
+The module parameters can be set from the debugfs files in::
+
+       /sys/module/haltpoll/parameters/
+
+Further Notes
+=============
+
+- Care should be taken when setting the guest_halt_poll_ns parameter as a
+  large value has the potential to drive the cpu usage to 100% on a machine
+  which would be almost entirely idle otherwise.
diff --git a/Documentation/virt/index.rst b/Documentation/virt/index.rst

index 062ffb5..de1ab81 100644 (file)
--- a/Documentation/virt/index.rst
+++ b/Documentation/virt/index.rst
@@ -8,7 +8,9 @@ Linux Virtualization Support
     :maxdepth: 2
  
     kvm/index
+   uml/user_mode_linux
     paravirt_ops
+   guest-halt-polling
  
  .. only:: html and subproject
  
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst

new file mode 100644 (file)

index 0000000..97a72a5
--- /dev/null
+++ b/Documentation/virt/kvm/api.rst
@@ -0,0 +1,6026 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================================================
+The Definitive KVM (Kernel-based Virtual Machine) API Documentation
+===================================================================
+
+1. General description
+======================
+
+The kvm API is a set of ioctls that are issued to control various aspects
+of a virtual machine.  The ioctls belong to the following classes:
+
+ - System ioctls: These query and set global attributes which affect the
+   whole kvm subsystem.  In addition a system ioctl is used to create
+   virtual machines.
+
+ - VM ioctls: These query and set attributes that affect an entire virtual
+   machine, for example memory layout.  In addition a VM ioctl is used to
+   create virtual cpus (vcpus) and devices.
+
+   VM ioctls must be issued from the same process (address space) that was
+   used to create the VM.
+
+ - vcpu ioctls: These query and set attributes that control the operation
+   of a single virtual cpu.
+
+   vcpu ioctls should be issued from the same thread that was used to create
+   the vcpu, except for asynchronous vcpu ioctl that are marked as such in
+   the documentation.  Otherwise, the first ioctl after switching threads
+   could see a performance impact.
+
+ - device ioctls: These query and set attributes that control the operation
+   of a single device.
+
+   device ioctls must be issued from the same process (address space) that
+   was used to create the VM.
+
+2. File descriptors
+===================
+
+The kvm API is centered around file descriptors.  An initial
+open("/dev/kvm") obtains a handle to the kvm subsystem; this handle
+can be used to issue system ioctls.  A KVM_CREATE_VM ioctl on this
+handle will create a VM file descriptor which can be used to issue VM
+ioctls.  A KVM_CREATE_VCPU or KVM_CREATE_DEVICE ioctl on a VM fd will
+create a virtual cpu or device and return a file descriptor pointing to
+the new resource.  Finally, ioctls on a vcpu or device fd can be used
+to control the vcpu or device.  For vcpus, this includes the important
+task of actually running guest code.
+
+In general file descriptors can be migrated among processes by means
+of fork() and the SCM_RIGHTS facility of unix domain socket.  These
+kinds of tricks are explicitly not supported by kvm.  While they will
+not cause harm to the host, their actual behavior is not guaranteed by
+the API.  See "General description" for details on the ioctl usage
+model that is supported by KVM.
+
+It is important to note that althought VM ioctls may only be issued from
+the process that created the VM, a VM's lifecycle is associated with its
+file descriptor, not its creator (process).  In other words, the VM and
+its resources, *including the associated address space*, are not freed
+until the last reference to the VM's file descriptor has been released.
+For example, if fork() is issued after ioctl(KVM_CREATE_VM), the VM will
+not be freed until both the parent (original) process and its child have
+put their references to the VM's file descriptor.
+
+Because a VM's resources are not freed until the last reference to its
+file descriptor is released, creating additional references to a VM via
+via fork(), dup(), etc... without careful consideration is strongly
+discouraged and may have unwanted side effects, e.g. memory allocated
+by and on behalf of the VM's process may not be freed/unaccounted when
+the VM is shut down.
+
+
+3. Extensions
+=============
+
+As of Linux 2.6.22, the KVM ABI has been stabilized: no backward
+incompatible change are allowed.  However, there is an extension
+facility that allows backward-compatible extensions to the API to be
+queried and used.
+
+The extension mechanism is not based on the Linux version number.
+Instead, kvm defines extension identifiers and a facility to query
+whether a particular extension identifier is available.  If it is, a
+set of ioctls is available for application use.
+
+
+4. API description
+==================
+
+This section describes ioctls that can be used to control kvm guests.
+For each ioctl, the following information is provided along with a
+description:
+
+  Capability:
+      which KVM extension provides this ioctl.  Can be 'basic',
+      which means that is will be provided by any kernel that supports
+      API version 12 (see section 4.1), a KVM_CAP_xyz constant, which
+      means availability needs to be checked with KVM_CHECK_EXTENSION
+      (see section 4.4), or 'none' which means that while not all kernels
+      support this ioctl, there's no capability bit to check its
+      availability: for kernels that don't support the ioctl,
+      the ioctl returns -ENOTTY.
+
+  Architectures:
+      which instruction set architectures provide this ioctl.
+      x86 includes both i386 and x86_64.
+
+  Type:
+      system, vm, or vcpu.
+
+  Parameters:
+      what parameters are accepted by the ioctl.
+
+  Returns:
+      the return value.  General error numbers (EBADF, ENOMEM, EINVAL)
+      are not detailed, but errors with specific meanings are.
+
+
+4.1 KVM_GET_API_VERSION
+-----------------------
+
+:Capability: basic
+:Architectures: all
+:Type: system ioctl
+:Parameters: none
+:Returns: the constant KVM_API_VERSION (=12)
+
+This identifies the API version as the stable kvm API. It is not
+expected that this number will change.  However, Linux 2.6.20 and
+2.6.21 report earlier versions; these are not documented and not
+supported.  Applications should refuse to run if KVM_GET_API_VERSION
+returns a value other than 12.  If this check passes, all ioctls
+described as 'basic' will be available.
+
+
+4.2 KVM_CREATE_VM
+-----------------
+
+:Capability: basic
+:Architectures: all
+:Type: system ioctl
+:Parameters: machine type identifier (KVM_VM_*)
+:Returns: a VM fd that can be used to control the new virtual machine.
+
+The new VM has no virtual cpus and no memory.
+You probably want to use 0 as machine type.
+
+In order to create user controlled virtual machines on S390, check
+KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as
+privileged user (CAP_SYS_ADMIN).
+
+To use hardware assisted virtualization on MIPS (VZ ASE) rather than
+the default trap & emulate implementation (which changes the virtual
+memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the
+flag KVM_VM_MIPS_VZ.
+
+
+On arm64, the physical address size for a VM (IPA Size limit) is limited
+to 40bits by default. The limit can be configured if the host supports the
+extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use
+KVM_VM_TYPE_ARM_IPA_SIZE(IPA_Bits) to set the size in the machine type
+identifier, where IPA_Bits is the maximum width of any physical
+address used by the VM. The IPA_Bits is encoded in bits[7-0] of the
+machine type identifier.
+
+e.g, to configure a guest to use 48bit physical address size::
+
+    vm_fd = ioctl(dev_fd, KVM_CREATE_VM, KVM_VM_TYPE_ARM_IPA_SIZE(48));
+
+The requested size (IPA_Bits) must be:
+
+ ==   =========================================================
+  0   Implies default size, 40bits (for backward compatibility)
+  N   Implies N bits, where N is a positive integer such that,
+      32 <= N <= Host_IPA_Limit
+ ==   =========================================================
+
+Host_IPA_Limit is the maximum possible value for IPA_Bits on the host and
+is dependent on the CPU capability and the kernel configuration. The limit can
+be retrieved using KVM_CAP_ARM_VM_IPA_SIZE of the KVM_CHECK_EXTENSION
+ioctl() at run-time.
+
+Please note that configuring the IPA size does not affect the capability
+exposed by the guest CPUs in ID_AA64MMFR0_EL1[PARange]. It only affects
+size of the address translated by the stage2 level (guest physical to
+host physical address translations).
+
+
+4.3 KVM_GET_MSR_INDEX_LIST, KVM_GET_MSR_FEATURE_INDEX_LIST
+----------------------------------------------------------
+
+:Capability: basic, KVM_CAP_GET_MSR_FEATURES for KVM_GET_MSR_FEATURE_INDEX_LIST
+:Architectures: x86
+:Type: system ioctl
+:Parameters: struct kvm_msr_list (in/out)
+:Returns: 0 on success; -1 on error
+
+Errors:
+
+  ======     ============================================================
+  EFAULT     the msr index list cannot be read from or written to
+  E2BIG      the msr index list is to be to fit in the array specified by
+             the user.
+  ======     ============================================================
+
+::
+
+  struct kvm_msr_list {
+       __u32 nmsrs; /* number of msrs in entries */
+       __u32 indices[0];
+  };
+
+The user fills in the size of the indices array in nmsrs, and in return
+kvm adjusts nmsrs to reflect the actual number of msrs and fills in the
+indices array with their numbers.
+
+KVM_GET_MSR_INDEX_LIST returns the guest msrs that are supported.  The list
+varies by kvm version and host processor, but does not change otherwise.
+
+Note: if kvm indicates supports MCE (KVM_CAP_MCE), then the MCE bank MSRs are
+not returned in the MSR list, as different vcpus can have a different number
+of banks, as set via the KVM_X86_SETUP_MCE ioctl.
+
+KVM_GET_MSR_FEATURE_INDEX_LIST returns the list of MSRs that can be passed
+to the KVM_GET_MSRS system ioctl.  This lets userspace probe host capabilities
+and processor features that are exposed via MSRs (e.g., VMX capabilities).
+This list also varies by kvm version and host processor, but does not change
+otherwise.
+
+
+4.4 KVM_CHECK_EXTENSION
+-----------------------
+
+:Capability: basic, KVM_CAP_CHECK_EXTENSION_VM for vm ioctl
+:Architectures: all
+:Type: system ioctl, vm ioctl
+:Parameters: extension identifier (KVM_CAP_*)
+:Returns: 0 if unsupported; 1 (or some other positive integer) if supported
+
+The API allows the application to query about extensions to the core
+kvm API.  Userspace passes an extension identifier (an integer) and
+receives an integer that describes the extension availability.
+Generally 0 means no and 1 means yes, but some extensions may report
+additional information in the integer return value.
+
+Based on their initialization different VMs may have different capabilities.
+It is thus encouraged to use the vm ioctl to query for capabilities (available
+with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)
+
+4.5 KVM_GET_VCPU_MMAP_SIZE
+--------------------------
+
+:Capability: basic
+:Architectures: all
+:Type: system ioctl
+:Parameters: none
+:Returns: size of vcpu mmap area, in bytes
+
+The KVM_RUN ioctl (cf.) communicates with userspace via a shared
+memory region.  This ioctl returns the size of that region.  See the
+KVM_RUN documentation for details.
+
+
+4.6 KVM_SET_MEMORY_REGION
+-------------------------
+
+:Capability: basic
+:Architectures: all
+:Type: vm ioctl
+:Parameters: struct kvm_memory_region (in)
+:Returns: 0 on success, -1 on error
+
+This ioctl is obsolete and has been removed.
+
+
+4.7 KVM_CREATE_VCPU
+-------------------
+
+:Capability: basic
+:Architectures: all
+:Type: vm ioctl
+:Parameters: vcpu id (apic id on x86)
+:Returns: vcpu fd on success, -1 on error
+
+This API adds a vcpu to a virtual machine. No more than max_vcpus may be added.
+The vcpu id is an integer in the range [0, max_vcpu_id).
+
+The recommended max_vcpus value can be retrieved using the KVM_CAP_NR_VCPUS of
+the KVM_CHECK_EXTENSION ioctl() at run-time.
+The maximum possible value for max_vcpus can be retrieved using the
+KVM_CAP_MAX_VCPUS of the KVM_CHECK_EXTENSION ioctl() at run-time.
+
+If the KVM_CAP_NR_VCPUS does not exist, you should assume that max_vcpus is 4
+cpus max.
+If the KVM_CAP_MAX_VCPUS does not exist, you should assume that max_vcpus is
+same as the value returned from KVM_CAP_NR_VCPUS.
+
+The maximum possible value for max_vcpu_id can be retrieved using the
+KVM_CAP_MAX_VCPU_ID of the KVM_CHECK_EXTENSION ioctl() at run-time.
+
+If the KVM_CAP_MAX_VCPU_ID does not exist, you should assume that max_vcpu_id
+is the same as the value returned from KVM_CAP_MAX_VCPUS.
+
+On powerpc using book3s_hv mode, the vcpus are mapped onto virtual
+threads in one or more virtual CPU cores.  (This is because the
+hardware requires all the hardware threads in a CPU core to be in the
+same partition.)  The KVM_CAP_PPC_SMT capability indicates the number
+of vcpus per virtual core (vcore).  The vcore id is obtained by
+dividing the vcpu id by the number of vcpus per vcore.  The vcpus in a
+given vcore will always be in the same physical core as each other
+(though that might be a different physical core from time to time).
+Userspace can control the threading (SMT) mode of the guest by its
+allocation of vcpu ids.  For example, if userspace wants
+single-threaded guest vcpus, it should make all vcpu ids be a multiple
+of the number of vcpus per vcore.
+
+For virtual cpus that have been created with S390 user controlled virtual
+machines, the resulting vcpu fd can be memory mapped at page offset
+KVM_S390_SIE_PAGE_OFFSET in order to obtain a memory map of the virtual
+cpu's hardware control block.
+
+
+4.8 KVM_GET_DIRTY_LOG (vm ioctl)
+--------------------------------
+
+:Capability: basic
+:Architectures: all
+:Type: vm ioctl
+:Parameters: struct kvm_dirty_log (in/out)
+:Returns: 0 on success, -1 on error
+
+::
+
+  /* for KVM_GET_DIRTY_LOG */
+  struct kvm_dirty_log {
+       __u32 slot;
+       __u32 padding;
+       union {
+               void __user *dirty_bitmap; /* one bit per page */
+               __u64 padding;
+       };
+  };
+
+Given a memory slot, return a bitmap containing any pages dirtied
+since the last call to this ioctl.  Bit 0 is the first page in the
+memory slot.  Ensure the entire structure is cleared to avoid padding
+issues.
+
+If KVM_CAP_MULTI_ADDRESS_SPACE is available, bits 16-31 specifies
+the address space for which you want to return the dirty bitmap.
+They must be less than the value that KVM_CHECK_EXTENSION returns for
+the KVM_CAP_MULTI_ADDRESS_SPACE capability.
+
+The bits in the dirty bitmap are cleared before the ioctl returns, unless
+KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 is enabled.  For more information,
+see the description of the capability.
+
+4.9 KVM_SET_MEMORY_ALIAS
+------------------------
+
+:Capability: basic
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_memory_alias (in)
+:Returns: 0 (success), -1 (error)
+
+This ioctl is obsolete and has been removed.
+
+
+4.10 KVM_RUN
+------------
+
+:Capability: basic
+:Architectures: all
+:Type: vcpu ioctl
+:Parameters: none
+:Returns: 0 on success, -1 on error
+
+Errors:
+
+  =====      =============================
+  EINTR      an unmasked signal is pending
+  =====      =============================
+
+This ioctl is used to run a guest virtual cpu.  While there are no
+explicit parameters, there is an implicit parameter block that can be
+obtained by mmap()ing the vcpu fd at offset 0, with the size given by
+KVM_GET_VCPU_MMAP_SIZE.  The parameter block is formatted as a 'struct
+kvm_run' (see below).
+
+
+4.11 KVM_GET_REGS
+-----------------
+
+:Capability: basic
+:Architectures: all except ARM, arm64
+:Type: vcpu ioctl
+:Parameters: struct kvm_regs (out)
+:Returns: 0 on success, -1 on error
+
+Reads the general purpose registers from the vcpu.
+
+::
+
+  /* x86 */
+  struct kvm_regs {
+       /* out (KVM_GET_REGS) / in (KVM_SET_REGS) */
+       __u64 rax, rbx, rcx, rdx;
+       __u64 rsi, rdi, rsp, rbp;
+       __u64 r8,  r9,  r10, r11;
+       __u64 r12, r13, r14, r15;
+       __u64 rip, rflags;
+  };
+
+  /* mips */
+  struct kvm_regs {
+       /* out (KVM_GET_REGS) / in (KVM_SET_REGS) */
+       __u64 gpr[32];
+       __u64 hi;
+       __u64 lo;
+       __u64 pc;
+  };
+
+
+4.12 KVM_SET_REGS
+-----------------
+
+:Capability: basic
+:Architectures: all except ARM, arm64
+:Type: vcpu ioctl
+:Parameters: struct kvm_regs (in)
+:Returns: 0 on success, -1 on error
+
+Writes the general purpose registers into the vcpu.
+
+See KVM_GET_REGS for the data structure.
+
+
+4.13 KVM_GET_SREGS
+------------------
+
+:Capability: basic
+:Architectures: x86, ppc
+:Type: vcpu ioctl
+:Parameters: struct kvm_sregs (out)
+:Returns: 0 on success, -1 on error
+
+Reads special registers from the vcpu.
+
+::
+
+  /* x86 */
+  struct kvm_sregs {
+       struct kvm_segment cs, ds, es, fs, gs, ss;
+       struct kvm_segment tr, ldt;
+       struct kvm_dtable gdt, idt;
+       __u64 cr0, cr2, cr3, cr4, cr8;
+       __u64 efer;
+       __u64 apic_base;
+       __u64 interrupt_bitmap[(KVM_NR_INTERRUPTS + 63) / 64];
+  };
+
+  /* ppc -- see arch/powerpc/include/uapi/asm/kvm.h */
+
+interrupt_bitmap is a bitmap of pending external interrupts.  At most
+one bit may be set.  This interrupt has been acknowledged by the APIC
+but not yet injected into the cpu core.
+
+
+4.14 KVM_SET_SREGS
+------------------
+
+:Capability: basic
+:Architectures: x86, ppc
+:Type: vcpu ioctl
+:Parameters: struct kvm_sregs (in)
+:Returns: 0 on success, -1 on error
+
+Writes special registers into the vcpu.  See KVM_GET_SREGS for the
+data structures.
+
+
+4.15 KVM_TRANSLATE
+------------------
+
+:Capability: basic
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_translation (in/out)
+:Returns: 0 on success, -1 on error
+
+Translates a virtual address according to the vcpu's current address
+translation mode.
+
+::
+
+  struct kvm_translation {
+       /* in */
+       __u64 linear_address;
+
+       /* out */
+       __u64 physical_address;
+       __u8  valid;
+       __u8  writeable;
+       __u8  usermode;
+       __u8  pad[5];
+  };
+
+
+4.16 KVM_INTERRUPT
+------------------
+
+:Capability: basic
+:Architectures: x86, ppc, mips
+:Type: vcpu ioctl
+:Parameters: struct kvm_interrupt (in)
+:Returns: 0 on success, negative on failure.
+
+Queues a hardware interrupt vector to be injected.
+
+::
+
+  /* for KVM_INTERRUPT */
+  struct kvm_interrupt {
+       /* in */
+       __u32 irq;
+  };
+
+X86:
+^^^^
+
+:Returns:
+
+       ========= ===================================
+         0       on success,
+        -EEXIST  if an interrupt is already enqueued
+        -EINVAL  the the irq number is invalid
+        -ENXIO   if the PIC is in the kernel
+        -EFAULT  if the pointer is invalid
+       ========= ===================================
+
+Note 'irq' is an interrupt vector, not an interrupt pin or line. This
+ioctl is useful if the in-kernel PIC is not used.
+
+PPC:
+^^^^
+
+Queues an external interrupt to be injected. This ioctl is overleaded
+with 3 different irq values:
+
+a) KVM_INTERRUPT_SET
+
+   This injects an edge type external interrupt into the guest once it's ready
+   to receive interrupts. When injected, the interrupt is done.
+
+b) KVM_INTERRUPT_UNSET
+
+   This unsets any pending interrupt.
+
+   Only available with KVM_CAP_PPC_UNSET_IRQ.
+
+c) KVM_INTERRUPT_SET_LEVEL
+
+   This injects a level type external interrupt into the guest context. The
+   interrupt stays pending until a specific ioctl with KVM_INTERRUPT_UNSET
+   is triggered.
+
+   Only available with KVM_CAP_PPC_IRQ_LEVEL.
+
+Note that any value for 'irq' other than the ones stated above is invalid
+and incurs unexpected behavior.
+
+This is an asynchronous vcpu ioctl and can be invoked from any thread.
+
+MIPS:
+^^^^^
+
+Queues an external interrupt to be injected into the virtual CPU. A negative
+interrupt number dequeues the interrupt.
+
+This is an asynchronous vcpu ioctl and can be invoked from any thread.
+
+
+4.17 KVM_DEBUG_GUEST
+--------------------
+
+:Capability: basic
+:Architectures: none
+:Type: vcpu ioctl
+:Parameters: none)
+:Returns: -1 on error
+
+Support for this has been removed.  Use KVM_SET_GUEST_DEBUG instead.
+
+
+4.18 KVM_GET_MSRS
+-----------------
+
+:Capability: basic (vcpu), KVM_CAP_GET_MSR_FEATURES (system)
+:Architectures: x86
+:Type: system ioctl, vcpu ioctl
+:Parameters: struct kvm_msrs (in/out)
+:Returns: number of msrs successfully returned;
+          -1 on error
+
+When used as a system ioctl:
+Reads the values of MSR-based features that are available for the VM.  This
+is similar to KVM_GET_SUPPORTED_CPUID, but it returns MSR indices and values.
+The list of msr-based features can be obtained using KVM_GET_MSR_FEATURE_INDEX_LIST
+in a system ioctl.
+
+When used as a vcpu ioctl:
+Reads model-specific registers from the vcpu.  Supported msr indices can
+be obtained using KVM_GET_MSR_INDEX_LIST in a system ioctl.
+
+::
+
+  struct kvm_msrs {
+       __u32 nmsrs; /* number of msrs in entries */
+       __u32 pad;
+
+       struct kvm_msr_entry entries[0];
+  };
+
+  struct kvm_msr_entry {
+       __u32 index;
+       __u32 reserved;
+       __u64 data;
+  };
+
+Application code should set the 'nmsrs' member (which indicates the
+size of the entries array) and the 'index' member of each array entry.
+kvm will fill in the 'data' member.
+
+
+4.19 KVM_SET_MSRS
+-----------------
+
+:Capability: basic
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_msrs (in)
+:Returns: number of msrs successfully set (see below), -1 on error
+
+Writes model-specific registers to the vcpu.  See KVM_GET_MSRS for the
+data structures.
+
+Application code should set the 'nmsrs' member (which indicates the
+size of the entries array), and the 'index' and 'data' members of each
+array entry.
+
+It tries to set the MSRs in array entries[] one by one. If setting an MSR
+fails, e.g., due to setting reserved bits, the MSR isn't supported/emulated
+by KVM, etc..., it stops processing the MSR list and returns the number of
+MSRs that have been set successfully.
+
+
+4.20 KVM_SET_CPUID
+------------------
+
+:Capability: basic
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_cpuid (in)
+:Returns: 0 on success, -1 on error
+
+Defines the vcpu responses to the cpuid instruction.  Applications
+should use the KVM_SET_CPUID2 ioctl if available.
+
+::
+
+  struct kvm_cpuid_entry {
+       __u32 function;
+       __u32 eax;
+       __u32 ebx;
+       __u32 ecx;
+       __u32 edx;
+       __u32 padding;
+  };
+
+  /* for KVM_SET_CPUID */
+  struct kvm_cpuid {
+       __u32 nent;
+       __u32 padding;
+       struct kvm_cpuid_entry entries[0];
+  };
+
+
+4.21 KVM_SET_SIGNAL_MASK
+------------------------
+
+:Capability: basic
+:Architectures: all
+:Type: vcpu ioctl
+:Parameters: struct kvm_signal_mask (in)
+:Returns: 0 on success, -1 on error
+
+Defines which signals are blocked during execution of KVM_RUN.  This
+signal mask temporarily overrides the threads signal mask.  Any
+unblocked signal received (except SIGKILL and SIGSTOP, which retain
+their traditional behaviour) will cause KVM_RUN to return with -EINTR.
+
+Note the signal will only be delivered if not blocked by the original
+signal mask.
+
+::
+
+  /* for KVM_SET_SIGNAL_MASK */
+  struct kvm_signal_mask {
+       __u32 len;
+       __u8  sigset[0];
+  };
+
+
+4.22 KVM_GET_FPU
+----------------
+
+:Capability: basic
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_fpu (out)
+:Returns: 0 on success, -1 on error
+
+Reads the floating point state from the vcpu.
+
+::
+
+  /* for KVM_GET_FPU and KVM_SET_FPU */
+  struct kvm_fpu {
+       __u8  fpr[8][16];
+       __u16 fcw;
+       __u16 fsw;
+       __u8  ftwx;  /* in fxsave format */
+       __u8  pad1;
+       __u16 last_opcode;
+       __u64 last_ip;
+       __u64 last_dp;
+       __u8  xmm[16][16];
+       __u32 mxcsr;
+       __u32 pad2;
+  };
+
+
+4.23 KVM_SET_FPU
+----------------
+
+:Capability: basic
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_fpu (in)
+:Returns: 0 on success, -1 on error
+
+Writes the floating point state to the vcpu.
+
+::
+
+  /* for KVM_GET_FPU and KVM_SET_FPU */
+  struct kvm_fpu {
+       __u8  fpr[8][16];
+       __u16 fcw;
+       __u16 fsw;
+       __u8  ftwx;  /* in fxsave format */
+       __u8  pad1;
+       __u16 last_opcode;
+       __u64 last_ip;
+       __u64 last_dp;
+       __u8  xmm[16][16];
+       __u32 mxcsr;
+       __u32 pad2;
+  };
+
+
+4.24 KVM_CREATE_IRQCHIP
+-----------------------
+
+:Capability: KVM_CAP_IRQCHIP, KVM_CAP_S390_IRQCHIP (s390)
+:Architectures: x86, ARM, arm64, s390
+:Type: vm ioctl
+:Parameters: none
+:Returns: 0 on success, -1 on error
+
+Creates an interrupt controller model in the kernel.
+On x86, creates a virtual ioapic, a virtual PIC (two PICs, nested), and sets up
+future vcpus to have a local APIC.  IRQ routing for GSIs 0-15 is set to both
+PIC and IOAPIC; GSI 16-23 only go to the IOAPIC.
+On ARM/arm64, a GICv2 is created. Any other GIC versions require the usage of
+KVM_CREATE_DEVICE, which also supports creating a GICv2.  Using
+KVM_CREATE_DEVICE is preferred over KVM_CREATE_IRQCHIP for GICv2.
+On s390, a dummy irq routing table is created.
+
+Note that on s390 the KVM_CAP_S390_IRQCHIP vm capability needs to be enabled
+before KVM_CREATE_IRQCHIP can be used.
+
+
+4.25 KVM_IRQ_LINE
+-----------------
+
+:Capability: KVM_CAP_IRQCHIP
+:Architectures: x86, arm, arm64
+:Type: vm ioctl
+:Parameters: struct kvm_irq_level
+:Returns: 0 on success, -1 on error
+
+Sets the level of a GSI input to the interrupt controller model in the kernel.
+On some architectures it is required that an interrupt controller model has
+been previously created with KVM_CREATE_IRQCHIP.  Note that edge-triggered
+interrupts require the level to be set to 1 and then back to 0.
+
+On real hardware, interrupt pins can be active-low or active-high.  This
+does not matter for the level field of struct kvm_irq_level: 1 always
+means active (asserted), 0 means inactive (deasserted).
+
+x86 allows the operating system to program the interrupt polarity
+(active-low/active-high) for level-triggered interrupts, and KVM used
+to consider the polarity.  However, due to bitrot in the handling of
+active-low interrupts, the above convention is now valid on x86 too.
+This is signaled by KVM_CAP_X86_IOAPIC_POLARITY_IGNORED.  Userspace
+should not present interrupts to the guest as active-low unless this
+capability is present (or unless it is not using the in-kernel irqchip,
+of course).
+
+
+ARM/arm64 can signal an interrupt either at the CPU level, or at the
+in-kernel irqchip (GIC), and for in-kernel irqchip can tell the GIC to
+use PPIs designated for specific cpus.  The irq field is interpreted
+like this::
+
+  bits:  |  31 ... 28  | 27 ... 24 | 23  ... 16 | 15 ... 0 |
+  field: | vcpu2_index | irq_type  | vcpu_index |  irq_id  |
+
+The irq_type field has the following values:
+
+- irq_type[0]:
+              out-of-kernel GIC: irq_id 0 is IRQ, irq_id 1 is FIQ
+- irq_type[1]:
+              in-kernel GIC: SPI, irq_id between 32 and 1019 (incl.)
+               (the vcpu_index field is ignored)
+- irq_type[2]:
+              in-kernel GIC: PPI, irq_id between 16 and 31 (incl.)
+
+(The irq_id field thus corresponds nicely to the IRQ ID in the ARM GIC specs)
+
+In both cases, level is used to assert/deassert the line.
+
+When KVM_CAP_ARM_IRQ_LINE_LAYOUT_2 is supported, the target vcpu is
+identified as (256 * vcpu2_index + vcpu_index). Otherwise, vcpu2_index
+must be zero.
+
+Note that on arm/arm64, the KVM_CAP_IRQCHIP capability only conditions
+injection of interrupts for the in-kernel irqchip. KVM_IRQ_LINE can always
+be used for a userspace interrupt controller.
+
+::
+
+  struct kvm_irq_level {
+       union {
+               __u32 irq;     /* GSI */
+               __s32 status;  /* not used for KVM_IRQ_LEVEL */
+       };
+       __u32 level;           /* 0 or 1 */
+  };
+
+
+4.26 KVM_GET_IRQCHIP
+--------------------
+
+:Capability: KVM_CAP_IRQCHIP
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_irqchip (in/out)
+:Returns: 0 on success, -1 on error
+
+Reads the state of a kernel interrupt controller created with
+KVM_CREATE_IRQCHIP into a buffer provided by the caller.
+
+::
+
+  struct kvm_irqchip {
+       __u32 chip_id;  /* 0 = PIC1, 1 = PIC2, 2 = IOAPIC */
+       __u32 pad;
+        union {
+               char dummy[512];  /* reserving space */
+               struct kvm_pic_state pic;
+               struct kvm_ioapic_state ioapic;
+       } chip;
+  };
+
+
+4.27 KVM_SET_IRQCHIP
+--------------------
+
+:Capability: KVM_CAP_IRQCHIP
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_irqchip (in)
+:Returns: 0 on success, -1 on error
+
+Sets the state of a kernel interrupt controller created with
+KVM_CREATE_IRQCHIP from a buffer provided by the caller.
+
+::
+
+  struct kvm_irqchip {
+       __u32 chip_id;  /* 0 = PIC1, 1 = PIC2, 2 = IOAPIC */
+       __u32 pad;
+        union {
+               char dummy[512];  /* reserving space */
+               struct kvm_pic_state pic;
+               struct kvm_ioapic_state ioapic;
+       } chip;
+  };
+
+
+4.28 KVM_XEN_HVM_CONFIG
+-----------------------
+
+:Capability: KVM_CAP_XEN_HVM
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_xen_hvm_config (in)
+:Returns: 0 on success, -1 on error
+
+Sets the MSR that the Xen HVM guest uses to initialize its hypercall
+page, and provides the starting address and size of the hypercall
+blobs in userspace.  When the guest writes the MSR, kvm copies one
+page of a blob (32- or 64-bit, depending on the vcpu mode) to guest
+memory.
+
+::
+
+  struct kvm_xen_hvm_config {
+       __u32 flags;
+       __u32 msr;
+       __u64 blob_addr_32;
+       __u64 blob_addr_64;
+       __u8 blob_size_32;
+       __u8 blob_size_64;
+       __u8 pad2[30];
+  };
+
+
+4.29 KVM_GET_CLOCK
+------------------
+
+:Capability: KVM_CAP_ADJUST_CLOCK
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_clock_data (out)
+:Returns: 0 on success, -1 on error
+
+Gets the current timestamp of kvmclock as seen by the current guest. In
+conjunction with KVM_SET_CLOCK, it is used to ensure monotonicity on scenarios
+such as migration.
+
+When KVM_CAP_ADJUST_CLOCK is passed to KVM_CHECK_EXTENSION, it returns the
+set of bits that KVM can return in struct kvm_clock_data's flag member.
+
+The only flag defined now is KVM_CLOCK_TSC_STABLE.  If set, the returned
+value is the exact kvmclock value seen by all VCPUs at the instant
+when KVM_GET_CLOCK was called.  If clear, the returned value is simply
+CLOCK_MONOTONIC plus a constant offset; the offset can be modified
+with KVM_SET_CLOCK.  KVM will try to make all VCPUs follow this clock,
+but the exact value read by each VCPU could differ, because the host
+TSC is not stable.
+
+::
+
+  struct kvm_clock_data {
+       __u64 clock;  /* kvmclock current value */
+       __u32 flags;
+       __u32 pad[9];
+  };
+
+
+4.30 KVM_SET_CLOCK
+------------------
+
+:Capability: KVM_CAP_ADJUST_CLOCK
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_clock_data (in)
+:Returns: 0 on success, -1 on error
+
+Sets the current timestamp of kvmclock to the value specified in its parameter.
+In conjunction with KVM_GET_CLOCK, it is used to ensure monotonicity on scenarios
+such as migration.
+
+::
+
+  struct kvm_clock_data {
+       __u64 clock;  /* kvmclock current value */
+       __u32 flags;
+       __u32 pad[9];
+  };
+
+
+4.31 KVM_GET_VCPU_EVENTS
+------------------------
+
+:Capability: KVM_CAP_VCPU_EVENTS
+:Extended by: KVM_CAP_INTR_SHADOW
+:Architectures: x86, arm, arm64
+:Type: vcpu ioctl
+:Parameters: struct kvm_vcpu_event (out)
+:Returns: 0 on success, -1 on error
+
+X86:
+^^^^
+
+Gets currently pending exceptions, interrupts, and NMIs as well as related
+states of the vcpu.
+
+::
+
+  struct kvm_vcpu_events {
+       struct {
+               __u8 injected;
+               __u8 nr;
+               __u8 has_error_code;
+               __u8 pending;
+               __u32 error_code;
+       } exception;
+       struct {
+               __u8 injected;
+               __u8 nr;
+               __u8 soft;
+               __u8 shadow;
+       } interrupt;
+       struct {
+               __u8 injected;
+               __u8 pending;
+               __u8 masked;
+               __u8 pad;
+       } nmi;
+       __u32 sipi_vector;
+       __u32 flags;
+       struct {
+               __u8 smm;
+               __u8 pending;
+               __u8 smm_inside_nmi;
+               __u8 latched_init;
+       } smi;
+       __u8 reserved[27];
+       __u8 exception_has_payload;
+       __u64 exception_payload;
+  };
+
+The following bits are defined in the flags field:
+
+- KVM_VCPUEVENT_VALID_SHADOW may be set to signal that
+  interrupt.shadow contains a valid state.
+
+- KVM_VCPUEVENT_VALID_SMM may be set to signal that smi contains a
+  valid state.
+
+- KVM_VCPUEVENT_VALID_PAYLOAD may be set to signal that the
+  exception_has_payload, exception_payload, and exception.pending
+  fields contain a valid state. This bit will be set whenever
+  KVM_CAP_EXCEPTION_PAYLOAD is enabled.
+
+ARM/ARM64:
+^^^^^^^^^^
+
+If the guest accesses a device that is being emulated by the host kernel in
+such a way that a real device would generate a physical SError, KVM may make
+a virtual SError pending for that VCPU. This system error interrupt remains
+pending until the guest takes the exception by unmasking PSTATE.A.
+
+Running the VCPU may cause it to take a pending SError, or make an access that
+causes an SError to become pending. The event's description is only valid while
+the VPCU is not running.
+
+This API provides a way to read and write the pending 'event' state that is not
+visible to the guest. To save, restore or migrate a VCPU the struct representing
+the state can be read then written using this GET/SET API, along with the other
+guest-visible registers. It is not possible to 'cancel' an SError that has been
+made pending.
+
+A device being emulated in user-space may also wish to generate an SError. To do
+this the events structure can be populated by user-space. The current state
+should be read first, to ensure no existing SError is pending. If an existing
+SError is pending, the architecture's 'Multiple SError interrupts' rules should
+be followed. (2.5.3 of DDI0587.a "ARM Reliability, Availability, and
+Serviceability (RAS) Specification").
+
+SError exceptions always have an ESR value. Some CPUs have the ability to
+specify what the virtual SError's ESR value should be. These systems will
+advertise KVM_CAP_ARM_INJECT_SERROR_ESR. In this case exception.has_esr will
+always have a non-zero value when read, and the agent making an SError pending
+should specify the ISS field in the lower 24 bits of exception.serror_esr. If
+the system supports KVM_CAP_ARM_INJECT_SERROR_ESR, but user-space sets the events
+with exception.has_esr as zero, KVM will choose an ESR.
+
+Specifying exception.has_esr on a system that does not support it will return
+-EINVAL. Setting anything other than the lower 24bits of exception.serror_esr
+will return -EINVAL.
+
+It is not possible to read back a pending external abort (injected via
+KVM_SET_VCPU_EVENTS or otherwise) because such an exception is always delivered
+directly to the virtual CPU).
+
+::
+
+  struct kvm_vcpu_events {
+       struct {
+               __u8 serror_pending;
+               __u8 serror_has_esr;
+               __u8 ext_dabt_pending;
+               /* Align it to 8 bytes */
+               __u8 pad[5];
+               __u64 serror_esr;
+       } exception;
+       __u32 reserved[12];
+  };
+
+4.32 KVM_SET_VCPU_EVENTS
+------------------------
+
+:Capability: KVM_CAP_VCPU_EVENTS
+:Extended by: KVM_CAP_INTR_SHADOW
+:Architectures: x86, arm, arm64
+:Type: vcpu ioctl
+:Parameters: struct kvm_vcpu_event (in)
+:Returns: 0 on success, -1 on error
+
+X86:
+^^^^
+
+Set pending exceptions, interrupts, and NMIs as well as related states of the
+vcpu.
+
+See KVM_GET_VCPU_EVENTS for the data structure.
+
+Fields that may be modified asynchronously by running VCPUs can be excluded
+from the update. These fields are nmi.pending, sipi_vector, smi.smm,
+smi.pending. Keep the corresponding bits in the flags field cleared to
+suppress overwriting the current in-kernel state. The bits are:
+
+===============================  ==================================
+KVM_VCPUEVENT_VALID_NMI_PENDING  transfer nmi.pending to the kernel
+KVM_VCPUEVENT_VALID_SIPI_VECTOR  transfer sipi_vector
+KVM_VCPUEVENT_VALID_SMM          transfer the smi sub-struct.
+===============================  ==================================
+
+If KVM_CAP_INTR_SHADOW is available, KVM_VCPUEVENT_VALID_SHADOW can be set in
+the flags field to signal that interrupt.shadow contains a valid state and
+shall be written into the VCPU.
+
+KVM_VCPUEVENT_VALID_SMM can only be set if KVM_CAP_X86_SMM is available.
+
+If KVM_CAP_EXCEPTION_PAYLOAD is enabled, KVM_VCPUEVENT_VALID_PAYLOAD
+can be set in the flags field to signal that the
+exception_has_payload, exception_payload, and exception.pending fields
+contain a valid state and shall be written into the VCPU.
+
+ARM/ARM64:
+^^^^^^^^^^
+
+User space may need to inject several types of events to the guest.
+
+Set the pending SError exception state for this VCPU. It is not possible to
+'cancel' an Serror that has been made pending.
+
+If the guest performed an access to I/O memory which could not be handled by
+userspace, for example because of missing instruction syndrome decode
+information or because there is no device mapped at the accessed IPA, then
+userspace can ask the kernel to inject an external abort using the address
+from the exiting fault on the VCPU. It is a programming error to set
+ext_dabt_pending after an exit which was not either KVM_EXIT_MMIO or
+KVM_EXIT_ARM_NISV. This feature is only available if the system supports
+KVM_CAP_ARM_INJECT_EXT_DABT. This is a helper which provides commonality in
+how userspace reports accesses for the above cases to guests, across different
+userspace implementations. Nevertheless, userspace can still emulate all Arm
+exceptions by manipulating individual registers using the KVM_SET_ONE_REG API.
+
+See KVM_GET_VCPU_EVENTS for the data structure.
+
+
+4.33 KVM_GET_DEBUGREGS
+----------------------
+
+:Capability: KVM_CAP_DEBUGREGS
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_debugregs (out)
+:Returns: 0 on success, -1 on error
+
+Reads debug registers from the vcpu.
+
+::
+
+  struct kvm_debugregs {
+       __u64 db[4];
+       __u64 dr6;
+       __u64 dr7;
+       __u64 flags;
+       __u64 reserved[9];
+  };
+
+
+4.34 KVM_SET_DEBUGREGS
+----------------------
+
+:Capability: KVM_CAP_DEBUGREGS
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_debugregs (in)
+:Returns: 0 on success, -1 on error
+
+Writes debug registers into the vcpu.
+
+See KVM_GET_DEBUGREGS for the data structure. The flags field is unused
+yet and must be cleared on entry.
+
+
+4.35 KVM_SET_USER_MEMORY_REGION
+-------------------------------
+
+:Capability: KVM_CAP_USER_MEMORY
+:Architectures: all
+:Type: vm ioctl
+:Parameters: struct kvm_userspace_memory_region (in)
+:Returns: 0 on success, -1 on error
+
+::
+
+  struct kvm_userspace_memory_region {
+       __u32 slot;
+       __u32 flags;
+       __u64 guest_phys_addr;
+       __u64 memory_size; /* bytes */
+       __u64 userspace_addr; /* start of the userspace allocated memory */
+  };
+
+  /* for kvm_memory_region::flags */
+  #define KVM_MEM_LOG_DIRTY_PAGES      (1UL << 0)
+  #define KVM_MEM_READONLY     (1UL << 1)
+
+This ioctl allows the user to create, modify or delete a guest physical
+memory slot.  Bits 0-15 of "slot" specify the slot id and this value
+should be less than the maximum number of user memory slots supported per
+VM.  The maximum allowed slots can be queried using KVM_CAP_NR_MEMSLOTS.
+Slots may not overlap in guest physical address space.
+
+If KVM_CAP_MULTI_ADDRESS_SPACE is available, bits 16-31 of "slot"
+specifies the address space which is being modified.  They must be
+less than the value that KVM_CHECK_EXTENSION returns for the
+KVM_CAP_MULTI_ADDRESS_SPACE capability.  Slots in separate address spaces
+are unrelated; the restriction on overlapping slots only applies within
+each address space.
+
+Deleting a slot is done by passing zero for memory_size.  When changing
+an existing slot, it may be moved in the guest physical memory space,
+or its flags may be modified, but it may not be resized.
+
+Memory for the region is taken starting at the address denoted by the
+field userspace_addr, which must point at user addressable memory for
+the entire memory slot size.  Any object may back this memory, including
+anonymous memory, ordinary files, and hugetlbfs.
+
+It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
+be identical.  This allows large pages in the guest to be backed by large
+pages in the host.
+
+The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
+KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
+writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
+use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
+to make a new slot read-only.  In this case, writes to this memory will be
+posted to userspace as KVM_EXIT_MMIO exits.
+
+When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
+the memory region are automatically reflected into the guest.  For example, an
+mmap() that affects the region will be made visible immediately.  Another
+example is madvise(MADV_DROP).
+
+It is recommended to use this API instead of the KVM_SET_MEMORY_REGION ioctl.
+The KVM_SET_MEMORY_REGION does not allow fine grained control over memory
+allocation and is deprecated.
+
+
+4.36 KVM_SET_TSS_ADDR
+---------------------
+
+:Capability: KVM_CAP_SET_TSS_ADDR
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: unsigned long tss_address (in)
+:Returns: 0 on success, -1 on error
+
+This ioctl defines the physical address of a three-page region in the guest
+physical address space.  The region must be within the first 4GB of the
+guest physical address space and must not conflict with any memory slot
+or any mmio address.  The guest may malfunction if it accesses this memory
+region.
+
+This ioctl is required on Intel-based hosts.  This is needed on Intel hardware
+because of a quirk in the virtualization implementation (see the internals
+documentation when it pops into existence).
+
+
+4.37 KVM_ENABLE_CAP
+-------------------
+
+:Capability: KVM_CAP_ENABLE_CAP
+:Architectures: mips, ppc, s390
+:Type: vcpu ioctl
+:Parameters: struct kvm_enable_cap (in)
+:Returns: 0 on success; -1 on error
+
+:Capability: KVM_CAP_ENABLE_CAP_VM
+:Architectures: all
+:Type: vcpu ioctl
+:Parameters: struct kvm_enable_cap (in)
+:Returns: 0 on success; -1 on error
+
+.. note::
+
+   Not all extensions are enabled by default. Using this ioctl the application
+   can enable an extension, making it available to the guest.
+
+On systems that do not support this ioctl, it always fails. On systems that
+do support it, it only works for extensions that are supported for enablement.
+
+To check if a capability can be enabled, the KVM_CHECK_EXTENSION ioctl should
+be used.
+
+::
+
+  struct kvm_enable_cap {
+       /* in */
+       __u32 cap;
+
+The capability that is supposed to get enabled.
+
+::
+
+       __u32 flags;
+
+A bitfield indicating future enhancements. Has to be 0 for now.
+
+::
+
+       __u64 args[4];
+
+Arguments for enabling a feature. If a feature needs initial values to
+function properly, this is the place to put them.
+
+::
+
+       __u8  pad[64];
+  };
+
+The vcpu ioctl should be used for vcpu-specific capabilities, the vm ioctl
+for vm-wide capabilities.
+
+4.38 KVM_GET_MP_STATE
+---------------------
+
+:Capability: KVM_CAP_MP_STATE
+:Architectures: x86, s390, arm, arm64
+:Type: vcpu ioctl
+:Parameters: struct kvm_mp_state (out)
+:Returns: 0 on success; -1 on error
+
+::
+
+  struct kvm_mp_state {
+       __u32 mp_state;
+  };
+
+Returns the vcpu's current "multiprocessing state" (though also valid on
+uniprocessor guests).
+
+Possible values are:
+
+   ==========================    ===============================================
+   KVM_MP_STATE_RUNNABLE         the vcpu is currently running [x86,arm/arm64]
+   KVM_MP_STATE_UNINITIALIZED    the vcpu is an application processor (AP)
+                                 which has not yet received an INIT signal [x86]
+   KVM_MP_STATE_INIT_RECEIVED    the vcpu has received an INIT signal, and is
+                                 now ready for a SIPI [x86]
+   KVM_MP_STATE_HALTED           the vcpu has executed a HLT instruction and
+                                 is waiting for an interrupt [x86]
+   KVM_MP_STATE_SIPI_RECEIVED    the vcpu has just received a SIPI (vector
+                                 accessible via KVM_GET_VCPU_EVENTS) [x86]
+   KVM_MP_STATE_STOPPED          the vcpu is stopped [s390,arm/arm64]
+   KVM_MP_STATE_CHECK_STOP       the vcpu is in a special error state [s390]
+   KVM_MP_STATE_OPERATING        the vcpu is operating (running or halted)
+                                 [s390]
+   KVM_MP_STATE_LOAD             the vcpu is in a special load/startup state
+                                 [s390]
+   ==========================    ===============================================
+
+On x86, this ioctl is only useful after KVM_CREATE_IRQCHIP. Without an
+in-kernel irqchip, the multiprocessing state must be maintained by userspace on
+these architectures.
+
+For arm/arm64:
+^^^^^^^^^^^^^^
+
+The only states that are valid are KVM_MP_STATE_STOPPED and
+KVM_MP_STATE_RUNNABLE which reflect if the vcpu is paused or not.
+
+4.39 KVM_SET_MP_STATE
+---------------------
+
+:Capability: KVM_CAP_MP_STATE
+:Architectures: x86, s390, arm, arm64
+:Type: vcpu ioctl
+:Parameters: struct kvm_mp_state (in)
+:Returns: 0 on success; -1 on error
+
+Sets the vcpu's current "multiprocessing state"; see KVM_GET_MP_STATE for
+arguments.
+
+On x86, this ioctl is only useful after KVM_CREATE_IRQCHIP. Without an
+in-kernel irqchip, the multiprocessing state must be maintained by userspace on
+these architectures.
+
+For arm/arm64:
+^^^^^^^^^^^^^^
+
+The only states that are valid are KVM_MP_STATE_STOPPED and
+KVM_MP_STATE_RUNNABLE which reflect if the vcpu should be paused or not.
+
+4.40 KVM_SET_IDENTITY_MAP_ADDR
+------------------------------
+
+:Capability: KVM_CAP_SET_IDENTITY_MAP_ADDR
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: unsigned long identity (in)
+:Returns: 0 on success, -1 on error
+
+This ioctl defines the physical address of a one-page region in the guest
+physical address space.  The region must be within the first 4GB of the
+guest physical address space and must not conflict with any memory slot
+or any mmio address.  The guest may malfunction if it accesses this memory
+region.
+
+Setting the address to 0 will result in resetting the address to its default
+(0xfffbc000).
+
+This ioctl is required on Intel-based hosts.  This is needed on Intel hardware
+because of a quirk in the virtualization implementation (see the internals
+documentation when it pops into existence).
+
+Fails if any VCPU has already been created.
+
+4.41 KVM_SET_BOOT_CPU_ID
+------------------------
+
+:Capability: KVM_CAP_SET_BOOT_CPU_ID
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: unsigned long vcpu_id
+:Returns: 0 on success, -1 on error
+
+Define which vcpu is the Bootstrap Processor (BSP).  Values are the same
+as the vcpu id in KVM_CREATE_VCPU.  If this ioctl is not called, the default
+is vcpu 0.
+
+
+4.42 KVM_GET_XSAVE
+------------------
+
+:Capability: KVM_CAP_XSAVE
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_xsave (out)
+:Returns: 0 on success, -1 on error
+
+
+::
+
+  struct kvm_xsave {
+       __u32 region[1024];
+  };
+
+This ioctl would copy current vcpu's xsave struct to the userspace.
+
+
+4.43 KVM_SET_XSAVE
+------------------
+
+:Capability: KVM_CAP_XSAVE
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_xsave (in)
+:Returns: 0 on success, -1 on error
+
+::
+
+
+  struct kvm_xsave {
+       __u32 region[1024];
+  };
+
+This ioctl would copy userspace's xsave struct to the kernel.
+
+
+4.44 KVM_GET_XCRS
+-----------------
+
+:Capability: KVM_CAP_XCRS
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_xcrs (out)
+:Returns: 0 on success, -1 on error
+
+::
+
+  struct kvm_xcr {
+       __u32 xcr;
+       __u32 reserved;
+       __u64 value;
+  };
+
+  struct kvm_xcrs {
+       __u32 nr_xcrs;
+       __u32 flags;
+       struct kvm_xcr xcrs[KVM_MAX_XCRS];
+       __u64 padding[16];
+  };
+
+This ioctl would copy current vcpu's xcrs to the userspace.
+
+
+4.45 KVM_SET_XCRS
+-----------------
+
+:Capability: KVM_CAP_XCRS
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_xcrs (in)
+:Returns: 0 on success, -1 on error
+
+::
+
+  struct kvm_xcr {
+       __u32 xcr;
+       __u32 reserved;
+       __u64 value;
+  };
+
+  struct kvm_xcrs {
+       __u32 nr_xcrs;
+       __u32 flags;
+       struct kvm_xcr xcrs[KVM_MAX_XCRS];
+       __u64 padding[16];
+  };
+
+This ioctl would set vcpu's xcr to the value userspace specified.
+
+
+4.46 KVM_GET_SUPPORTED_CPUID
+----------------------------
+
+:Capability: KVM_CAP_EXT_CPUID
+:Architectures: x86
+:Type: system ioctl
+:Parameters: struct kvm_cpuid2 (in/out)
+:Returns: 0 on success, -1 on error
+
+::
+
+  struct kvm_cpuid2 {
+       __u32 nent;
+       __u32 padding;
+       struct kvm_cpuid_entry2 entries[0];
+  };
+
+  #define KVM_CPUID_FLAG_SIGNIFCANT_INDEX              BIT(0)
+  #define KVM_CPUID_FLAG_STATEFUL_FUNC         BIT(1)
+  #define KVM_CPUID_FLAG_STATE_READ_NEXT               BIT(2)
+
+  struct kvm_cpuid_entry2 {
+       __u32 function;
+       __u32 index;
+       __u32 flags;
+       __u32 eax;
+       __u32 ebx;
+       __u32 ecx;
+       __u32 edx;
+       __u32 padding[3];
+  };
+
+This ioctl returns x86 cpuid features which are supported by both the
+hardware and kvm in its default configuration.  Userspace can use the
+information returned by this ioctl to construct cpuid information (for
+KVM_SET_CPUID2) that is consistent with hardware, kernel, and
+userspace capabilities, and with user requirements (for example, the
+user may wish to constrain cpuid to emulate older hardware, or for
+feature consistency across a cluster).
+
+Note that certain capabilities, such as KVM_CAP_X86_DISABLE_EXITS, may
+expose cpuid features (e.g. MONITOR) which are not supported by kvm in
+its default configuration. If userspace enables such capabilities, it
+is responsible for modifying the results of this ioctl appropriately.
+
+Userspace invokes KVM_GET_SUPPORTED_CPUID by passing a kvm_cpuid2 structure
+with the 'nent' field indicating the number of entries in the variable-size
+array 'entries'.  If the number of entries is too low to describe the cpu
+capabilities, an error (E2BIG) is returned.  If the number is too high,
+the 'nent' field is adjusted and an error (ENOMEM) is returned.  If the
+number is just right, the 'nent' field is adjusted to the number of valid
+entries in the 'entries' array, which is then filled.
+
+The entries returned are the host cpuid as returned by the cpuid instruction,
+with unknown or unsupported features masked out.  Some features (for example,
+x2apic), may not be present in the host cpu, but are exposed by kvm if it can
+emulate them efficiently. The fields in each entry are defined as follows:
+
+  function:
+         the eax value used to obtain the entry
+
+  index:
+         the ecx value used to obtain the entry (for entries that are
+         affected by ecx)
+
+  flags:
+     an OR of zero or more of the following:
+
+        KVM_CPUID_FLAG_SIGNIFCANT_INDEX:
+           if the index field is valid
+        KVM_CPUID_FLAG_STATEFUL_FUNC:
+           if cpuid for this function returns different values for successive
+           invocations; there will be several entries with the same function,
+           all with this flag set
+        KVM_CPUID_FLAG_STATE_READ_NEXT:
+           for KVM_CPUID_FLAG_STATEFUL_FUNC entries, set if this entry is
+           the first entry to be read by a cpu
+
+   eax, ebx, ecx, edx:
+         the values returned by the cpuid instruction for
+         this function/index combination
+
+The TSC deadline timer feature (CPUID leaf 1, ecx[24]) is always returned
+as false, since the feature depends on KVM_CREATE_IRQCHIP for local APIC
+support.  Instead it is reported via::
+
+  ioctl(KVM_CHECK_EXTENSION, KVM_CAP_TSC_DEADLINE_TIMER)
+
+if that returns true and you use KVM_CREATE_IRQCHIP, or if you emulate the
+feature in userspace, then you can enable the feature for KVM_SET_CPUID2.
+
+
+4.47 KVM_PPC_GET_PVINFO
+-----------------------
+
+:Capability: KVM_CAP_PPC_GET_PVINFO
+:Architectures: ppc
+:Type: vm ioctl
+:Parameters: struct kvm_ppc_pvinfo (out)
+:Returns: 0 on success, !0 on error
+
+::
+
+  struct kvm_ppc_pvinfo {
+       __u32 flags;
+       __u32 hcall[4];
+       __u8  pad[108];
+  };
+
+This ioctl fetches PV specific information that need to be passed to the guest
+using the device tree or other means from vm context.
+
+The hcall array defines 4 instructions that make up a hypercall.
+
+If any additional field gets added to this structure later on, a bit for that
+additional piece of information will be set in the flags bitmap.
+
+The flags bitmap is defined as::
+
+   /* the host supports the ePAPR idle hcall
+   #define KVM_PPC_PVINFO_FLAGS_EV_IDLE   (1<<0)
+
+4.52 KVM_SET_GSI_ROUTING
+------------------------
+
+:Capability: KVM_CAP_IRQ_ROUTING
+:Architectures: x86 s390 arm arm64
+:Type: vm ioctl
+:Parameters: struct kvm_irq_routing (in)
+:Returns: 0 on success, -1 on error
+
+Sets the GSI routing table entries, overwriting any previously set entries.
+
+On arm/arm64, GSI routing has the following limitation:
+
+- GSI routing does not apply to KVM_IRQ_LINE but only to KVM_IRQFD.
+
+::
+
+  struct kvm_irq_routing {
+       __u32 nr;
+       __u32 flags;
+       struct kvm_irq_routing_entry entries[0];
+  };
+
+No flags are specified so far, the corresponding field must be set to zero.
+
+::
+
+  struct kvm_irq_routing_entry {
+       __u32 gsi;
+       __u32 type;
+       __u32 flags;
+       __u32 pad;
+       union {
+               struct kvm_irq_routing_irqchip irqchip;
+               struct kvm_irq_routing_msi msi;
+               struct kvm_irq_routing_s390_adapter adapter;
+               struct kvm_irq_routing_hv_sint hv_sint;
+               __u32 pad[8];
+       } u;
+  };
+
+  /* gsi routing entry types */
+  #define KVM_IRQ_ROUTING_IRQCHIP 1
+  #define KVM_IRQ_ROUTING_MSI 2
+  #define KVM_IRQ_ROUTING_S390_ADAPTER 3
+  #define KVM_IRQ_ROUTING_HV_SINT 4
+
+flags:
+
+- KVM_MSI_VALID_DEVID: used along with KVM_IRQ_ROUTING_MSI routing entry
+  type, specifies that the devid field contains a valid value.  The per-VM
+  KVM_CAP_MSI_DEVID capability advertises the requirement to provide
+  the device ID.  If this capability is not available, userspace should
+  never set the KVM_MSI_VALID_DEVID flag as the ioctl might fail.
+- zero otherwise
+
+::
+
+  struct kvm_irq_routing_irqchip {
+       __u32 irqchip;
+       __u32 pin;
+  };
+
+  struct kvm_irq_routing_msi {
+       __u32 address_lo;
+       __u32 address_hi;
+       __u32 data;
+       union {
+               __u32 pad;
+               __u32 devid;
+       };
+  };
+
+If KVM_MSI_VALID_DEVID is set, devid contains a unique device identifier
+for the device that wrote the MSI message.  For PCI, this is usually a
+BFD identifier in the lower 16 bits.
+
+On x86, address_hi is ignored unless the KVM_X2APIC_API_USE_32BIT_IDS
+feature of KVM_CAP_X2APIC_API capability is enabled.  If it is enabled,
+address_hi bits 31-8 provide bits 31-8 of the destination id.  Bits 7-0 of
+address_hi must be zero.
+
+::
+
+  struct kvm_irq_routing_s390_adapter {
+       __u64 ind_addr;
+       __u64 summary_addr;
+       __u64 ind_offset;
+       __u32 summary_offset;
+       __u32 adapter_id;
+  };
+
+  struct kvm_irq_routing_hv_sint {
+       __u32 vcpu;
+       __u32 sint;
+  };
+
+
+4.55 KVM_SET_TSC_KHZ
+--------------------
+
+:Capability: KVM_CAP_TSC_CONTROL
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: virtual tsc_khz
+:Returns: 0 on success, -1 on error
+
+Specifies the tsc frequency for the virtual machine. The unit of the
+frequency is KHz.
+
+
+4.56 KVM_GET_TSC_KHZ
+--------------------
+
+:Capability: KVM_CAP_GET_TSC_KHZ
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: none
+:Returns: virtual tsc-khz on success, negative value on error
+
+Returns the tsc frequency of the guest. The unit of the return value is
+KHz. If the host has unstable tsc this ioctl returns -EIO instead as an
+error.
+
+
+4.57 KVM_GET_LAPIC
+------------------
+
+:Capability: KVM_CAP_IRQCHIP
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_lapic_state (out)
+:Returns: 0 on success, -1 on error
+
+::
+
+  #define KVM_APIC_REG_SIZE 0x400
+  struct kvm_lapic_state {
+       char regs[KVM_APIC_REG_SIZE];
+  };
+
+Reads the Local APIC registers and copies them into the input argument.  The
+data format and layout are the same as documented in the architecture manual.
+
+If KVM_X2APIC_API_USE_32BIT_IDS feature of KVM_CAP_X2APIC_API is
+enabled, then the format of APIC_ID register depends on the APIC mode
+(reported by MSR_IA32_APICBASE) of its VCPU.  x2APIC stores APIC ID in
+the APIC_ID register (bytes 32-35).  xAPIC only allows an 8-bit APIC ID
+which is stored in bits 31-24 of the APIC register, or equivalently in
+byte 35 of struct kvm_lapic_state's regs field.  KVM_GET_LAPIC must then
+be called after MSR_IA32_APICBASE has been set with KVM_SET_MSR.
+
+If KVM_X2APIC_API_USE_32BIT_IDS feature is disabled, struct kvm_lapic_state
+always uses xAPIC format.
+
+
+4.58 KVM_SET_LAPIC
+------------------
+
+:Capability: KVM_CAP_IRQCHIP
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_lapic_state (in)
+:Returns: 0 on success, -1 on error
+
+::
+
+  #define KVM_APIC_REG_SIZE 0x400
+  struct kvm_lapic_state {
+       char regs[KVM_APIC_REG_SIZE];
+  };
+
+Copies the input argument into the Local APIC registers.  The data format
+and layout are the same as documented in the architecture manual.
+
+The format of the APIC ID register (bytes 32-35 of struct kvm_lapic_state's
+regs field) depends on the state of the KVM_CAP_X2APIC_API capability.
+See the note in KVM_GET_LAPIC.
+
+
+4.59 KVM_IOEVENTFD
+------------------
+
+:Capability: KVM_CAP_IOEVENTFD
+:Architectures: all
+:Type: vm ioctl
+:Parameters: struct kvm_ioeventfd (in)
+:Returns: 0 on success, !0 on error
+
+This ioctl attaches or detaches an ioeventfd to a legal pio/mmio address
+within the guest.  A guest write in the registered address will signal the
+provided event instead of triggering an exit.
+
+::
+
+  struct kvm_ioeventfd {
+       __u64 datamatch;
+       __u64 addr;        /* legal pio/mmio address */
+       __u32 len;         /* 0, 1, 2, 4, or 8 bytes    */
+       __s32 fd;
+       __u32 flags;
+       __u8  pad[36];
+  };
+
+For the special case of virtio-ccw devices on s390, the ioevent is matched
+to a subchannel/virtqueue tuple instead.
+
+The following flags are defined::
+
+  #define KVM_IOEVENTFD_FLAG_DATAMATCH (1 << kvm_ioeventfd_flag_nr_datamatch)
+  #define KVM_IOEVENTFD_FLAG_PIO       (1 << kvm_ioeventfd_flag_nr_pio)
+  #define KVM_IOEVENTFD_FLAG_DEASSIGN  (1 << kvm_ioeventfd_flag_nr_deassign)
+  #define KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY \
+       (1 << kvm_ioeventfd_flag_nr_virtio_ccw_notify)
+
+If datamatch flag is set, the event will be signaled only if the written value
+to the registered address is equal to datamatch in struct kvm_ioeventfd.
+
+For virtio-ccw devices, addr contains the subchannel id and datamatch the
+virtqueue index.
+
+With KVM_CAP_IOEVENTFD_ANY_LENGTH, a zero length ioeventfd is allowed, and
+the kernel will ignore the length of guest write and may get a faster vmexit.
+The speedup may only apply to specific architectures, but the ioeventfd will
+work anyway.
+
+4.60 KVM_DIRTY_TLB
+------------------
+
+:Capability: KVM_CAP_SW_TLB
+:Architectures: ppc
+:Type: vcpu ioctl
+:Parameters: struct kvm_dirty_tlb (in)
+:Returns: 0 on success, -1 on error
+
+::
+
+  struct kvm_dirty_tlb {
+       __u64 bitmap;
+       __u32 num_dirty;
+  };
+
+This must be called whenever userspace has changed an entry in the shared
+TLB, prior to calling KVM_RUN on the associated vcpu.
+
+The "bitmap" field is the userspace address of an array.  This array
+consists of a number of bits, equal to the total number of TLB entries as
+determined by the last successful call to KVM_CONFIG_TLB, rounded up to the
+nearest multiple of 64.
+
+Each bit corresponds to one TLB entry, ordered the same as in the shared TLB
+array.
+
+The array is little-endian: the bit 0 is the least significant bit of the
+first byte, bit 8 is the least significant bit of the second byte, etc.
+This avoids any complications with differing word sizes.
+
+The "num_dirty" field is a performance hint for KVM to determine whether it
+should skip processing the bitmap and just invalidate everything.  It must
+be set to the number of set bits in the bitmap.
+
+
+4.62 KVM_CREATE_SPAPR_TCE
+-------------------------
+
+:Capability: KVM_CAP_SPAPR_TCE
+:Architectures: powerpc
+:Type: vm ioctl
+:Parameters: struct kvm_create_spapr_tce (in)
+:Returns: file descriptor for manipulating the created TCE table
+
+This creates a virtual TCE (translation control entry) table, which
+is an IOMMU for PAPR-style virtual I/O.  It is used to translate
+logical addresses used in virtual I/O into guest physical addresses,
+and provides a scatter/gather capability for PAPR virtual I/O.
+
+::
+
+  /* for KVM_CAP_SPAPR_TCE */
+  struct kvm_create_spapr_tce {
+       __u64 liobn;
+       __u32 window_size;
+  };
+
+The liobn field gives the logical IO bus number for which to create a
+TCE table.  The window_size field specifies the size of the DMA window
+which this TCE table will translate - the table will contain one 64
+bit TCE entry for every 4kiB of the DMA window.
+
+When the guest issues an H_PUT_TCE hcall on a liobn for which a TCE
+table has been created using this ioctl(), the kernel will handle it
+in real mode, updating the TCE table.  H_PUT_TCE calls for other
+liobns will cause a vm exit and must be handled by userspace.
+
+The return value is a file descriptor which can be passed to mmap(2)
+to map the created TCE table into userspace.  This lets userspace read
+the entries written by kernel-handled H_PUT_TCE calls, and also lets
+userspace update the TCE table directly which is useful in some
+circumstances.
+
+
+4.63 KVM_ALLOCATE_RMA
+---------------------
+
+:Capability: KVM_CAP_PPC_RMA
+:Architectures: powerpc
+:Type: vm ioctl
+:Parameters: struct kvm_allocate_rma (out)
+:Returns: file descriptor for mapping the allocated RMA
+
+This allocates a Real Mode Area (RMA) from the pool allocated at boot
+time by the kernel.  An RMA is a physically-contiguous, aligned region
+of memory used on older POWER processors to provide the memory which
+will be accessed by real-mode (MMU off) accesses in a KVM guest.
+POWER processors support a set of sizes for the RMA that usually
+includes 64MB, 128MB, 256MB and some larger powers of two.
+
+::
+
+  /* for KVM_ALLOCATE_RMA */
+  struct kvm_allocate_rma {
+       __u64 rma_size;
+  };
+
+The return value is a file descriptor which can be passed to mmap(2)
+to map the allocated RMA into userspace.  The mapped area can then be
+passed to the KVM_SET_USER_MEMORY_REGION ioctl to establish it as the
+RMA for a virtual machine.  The size of the RMA in bytes (which is
+fixed at host kernel boot time) is returned in the rma_size field of
+the argument structure.
+
+The KVM_CAP_PPC_RMA capability is 1 or 2 if the KVM_ALLOCATE_RMA ioctl
+is supported; 2 if the processor requires all virtual machines to have
+an RMA, or 1 if the processor can use an RMA but doesn't require it,
+because it supports the Virtual RMA (VRMA) facility.
+
+
+4.64 KVM_NMI
+------------
+
+:Capability: KVM_CAP_USER_NMI
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: none
+:Returns: 0 on success, -1 on error
+
+Queues an NMI on the thread's vcpu.  Note this is well defined only
+when KVM_CREATE_IRQCHIP has not been called, since this is an interface
+between the virtual cpu core and virtual local APIC.  After KVM_CREATE_IRQCHIP
+has been called, this interface is completely emulated within the kernel.
+
+To use this to emulate the LINT1 input with KVM_CREATE_IRQCHIP, use the
+following algorithm:
+
+  - pause the vcpu
+  - read the local APIC's state (KVM_GET_LAPIC)
+  - check whether changing LINT1 will queue an NMI (see the LVT entry for LINT1)
+  - if so, issue KVM_NMI
+  - resume the vcpu
+
+Some guests configure the LINT1 NMI input to cause a panic, aiding in
+debugging.
+
+
+4.65 KVM_S390_UCAS_MAP
+----------------------
+
+:Capability: KVM_CAP_S390_UCONTROL
+:Architectures: s390
+:Type: vcpu ioctl
+:Parameters: struct kvm_s390_ucas_mapping (in)
+:Returns: 0 in case of success
+
+The parameter is defined like this::
+
+       struct kvm_s390_ucas_mapping {
+               __u64 user_addr;
+               __u64 vcpu_addr;
+               __u64 length;
+       };
+
+This ioctl maps the memory at "user_addr" with the length "length" to
+the vcpu's address space starting at "vcpu_addr". All parameters need to
+be aligned by 1 megabyte.
+
+
+4.66 KVM_S390_UCAS_UNMAP
+------------------------
+
+:Capability: KVM_CAP_S390_UCONTROL
+:Architectures: s390
+:Type: vcpu ioctl
+:Parameters: struct kvm_s390_ucas_mapping (in)
+:Returns: 0 in case of success
+
+The parameter is defined like this::
+
+       struct kvm_s390_ucas_mapping {
+               __u64 user_addr;
+               __u64 vcpu_addr;
+               __u64 length;
+       };
+
+This ioctl unmaps the memory in the vcpu's address space starting at
+"vcpu_addr" with the length "length". The field "user_addr" is ignored.
+All parameters need to be aligned by 1 megabyte.
+
+
+4.67 KVM_S390_VCPU_FAULT
+------------------------
+
+:Capability: KVM_CAP_S390_UCONTROL
+:Architectures: s390
+:Type: vcpu ioctl
+:Parameters: vcpu absolute address (in)
+:Returns: 0 in case of success
+
+This call creates a page table entry on the virtual cpu's address space
+(for user controlled virtual machines) or the virtual machine's address
+space (for regular virtual machines). This only works for minor faults,
+thus it's recommended to access subject memory page via the user page
+table upfront. This is useful to handle validity intercepts for user
+controlled virtual machines to fault in the virtual cpu's lowcore pages
+prior to calling the KVM_RUN ioctl.
+
+
+4.68 KVM_SET_ONE_REG
+--------------------
+
+:Capability: KVM_CAP_ONE_REG
+:Architectures: all
+:Type: vcpu ioctl
+:Parameters: struct kvm_one_reg (in)
+:Returns: 0 on success, negative value on failure
+
+Errors:
+
+  ======   ============================================================
+  ENOENT   no such register
+  EINVAL   invalid register ID, or no such register
+  EPERM    (arm64) register access not allowed before vcpu finalization
+  ======   ============================================================
+
+(These error codes are indicative only: do not rely on a specific error
+code being returned in a specific situation.)
+
+::
+
+  struct kvm_one_reg {
+       __u64 id;
+       __u64 addr;
+ };
+
+Using this ioctl, a single vcpu register can be set to a specific value
+defined by user space with the passed in struct kvm_one_reg, where id
+refers to the register identifier as described below and addr is a pointer
+to a variable with the respective size. There can be architecture agnostic
+and architecture specific registers. Each have their own range of operation
+and their own constants and width. To keep track of the implemented
+registers, find a list below:
+
+  ======= =============================== ============
+  Arch              Register              Width (bits)
+  ======= =============================== ============
+  PPC     KVM_REG_PPC_HIOR                64
+  PPC     KVM_REG_PPC_IAC1                64
+  PPC     KVM_REG_PPC_IAC2                64
+  PPC     KVM_REG_PPC_IAC3                64
+  PPC     KVM_REG_PPC_IAC4                64
+  PPC     KVM_REG_PPC_DAC1                64
+  PPC     KVM_REG_PPC_DAC2                64
+  PPC     KVM_REG_PPC_DABR                64
+  PPC     KVM_REG_PPC_DSCR                64
+  PPC     KVM_REG_PPC_PURR                64
+  PPC     KVM_REG_PPC_SPURR               64
+  PPC     KVM_REG_PPC_DAR                 64
+  PPC     KVM_REG_PPC_DSISR               32
+  PPC     KVM_REG_PPC_AMR                 64
+  PPC     KVM_REG_PPC_UAMOR               64
+  PPC     KVM_REG_PPC_MMCR0               64
+  PPC     KVM_REG_PPC_MMCR1               64
+  PPC     KVM_REG_PPC_MMCRA               64
+  PPC     KVM_REG_PPC_MMCR2               64
+  PPC     KVM_REG_PPC_MMCRS               64
+  PPC     KVM_REG_PPC_SIAR                64
+  PPC     KVM_REG_PPC_SDAR                64
+  PPC     KVM_REG_PPC_SIER                64
+  PPC     KVM_REG_PPC_PMC1                32
+  PPC     KVM_REG_PPC_PMC2                32
+  PPC     KVM_REG_PPC_PMC3                32
+  PPC     KVM_REG_PPC_PMC4                32
+  PPC     KVM_REG_PPC_PMC5                32
+  PPC     KVM_REG_PPC_PMC6                32
+  PPC     KVM_REG_PPC_PMC7                32
+  PPC     KVM_REG_PPC_PMC8                32
+  PPC     KVM_REG_PPC_FPR0                64
+  ...
+  PPC     KVM_REG_PPC_FPR31               64
+  PPC     KVM_REG_PPC_VR0                 128
+  ...
+  PPC     KVM_REG_PPC_VR31                128
+  PPC     KVM_REG_PPC_VSR0                128
+  ...
+  PPC     KVM_REG_PPC_VSR31               128
+  PPC     KVM_REG_PPC_FPSCR               64
+  PPC     KVM_REG_PPC_VSCR                32
+  PPC     KVM_REG_PPC_VPA_ADDR            64
+  PPC     KVM_REG_PPC_VPA_SLB             128
+  PPC     KVM_REG_PPC_VPA_DTL             128
+  PPC     KVM_REG_PPC_EPCR                32
+  PPC     KVM_REG_PPC_EPR                 32
+  PPC     KVM_REG_PPC_TCR                 32
+  PPC     KVM_REG_PPC_TSR                 32
+  PPC     KVM_REG_PPC_OR_TSR              32
+  PPC     KVM_REG_PPC_CLEAR_TSR           32
+  PPC     KVM_REG_PPC_MAS0                32
+  PPC     KVM_REG_PPC_MAS1                32
+  PPC     KVM_REG_PPC_MAS2                64
+  PPC     KVM_REG_PPC_MAS7_3              64
+  PPC     KVM_REG_PPC_MAS4                32
+  PPC     KVM_REG_PPC_MAS6                32
+  PPC     KVM_REG_PPC_MMUCFG              32
+  PPC     KVM_REG_PPC_TLB0CFG             32
+  PPC     KVM_REG_PPC_TLB1CFG             32
+  PPC     KVM_REG_PPC_TLB2CFG             32
+  PPC     KVM_REG_PPC_TLB3CFG             32
+  PPC     KVM_REG_PPC_TLB0PS              32
+  PPC     KVM_REG_PPC_TLB1PS              32
+  PPC     KVM_REG_PPC_TLB2PS              32
+  PPC     KVM_REG_PPC_TLB3PS              32
+  PPC     KVM_REG_PPC_EPTCFG              32
+  PPC     KVM_REG_PPC_ICP_STATE           64
+  PPC     KVM_REG_PPC_VP_STATE            128
+  PPC     KVM_REG_PPC_TB_OFFSET           64
+  PPC     KVM_REG_PPC_SPMC1               32
+  PPC     KVM_REG_PPC_SPMC2               32
+  PPC     KVM_REG_PPC_IAMR                64
+  PPC     KVM_REG_PPC_TFHAR               64
+  PPC     KVM_REG_PPC_TFIAR               64
+  PPC     KVM_REG_PPC_TEXASR              64
+  PPC     KVM_REG_PPC_FSCR                64
+  PPC     KVM_REG_PPC_PSPB                32
+  PPC     KVM_REG_PPC_EBBHR               64
+  PPC     KVM_REG_PPC_EBBRR               64
+  PPC     KVM_REG_PPC_BESCR               64
+  PPC     KVM_REG_PPC_TAR                 64
+  PPC     KVM_REG_PPC_DPDES               64
+  PPC     KVM_REG_PPC_DAWR                64
+  PPC     KVM_REG_PPC_DAWRX               64
+  PPC     KVM_REG_PPC_CIABR               64
+  PPC     KVM_REG_PPC_IC                  64
+  PPC     KVM_REG_PPC_VTB                 64
+  PPC     KVM_REG_PPC_CSIGR               64
+  PPC     KVM_REG_PPC_TACR                64
+  PPC     KVM_REG_PPC_TCSCR               64
+  PPC     KVM_REG_PPC_PID                 64
+  PPC     KVM_REG_PPC_ACOP                64
+  PPC     KVM_REG_PPC_VRSAVE              32
+  PPC     KVM_REG_PPC_LPCR                32
+  PPC     KVM_REG_PPC_LPCR_64             64
+  PPC     KVM_REG_PPC_PPR                 64
+  PPC     KVM_REG_PPC_ARCH_COMPAT         32
+  PPC     KVM_REG_PPC_DABRX               32
+  PPC     KVM_REG_PPC_WORT                64
+  PPC    KVM_REG_PPC_SPRG9               64
+  PPC    KVM_REG_PPC_DBSR                32
+  PPC     KVM_REG_PPC_TIDR                64
+  PPC     KVM_REG_PPC_PSSCR               64
+  PPC     KVM_REG_PPC_DEC_EXPIRY          64
+  PPC     KVM_REG_PPC_PTCR                64
+  PPC     KVM_REG_PPC_TM_GPR0             64
+  ...
+  PPC     KVM_REG_PPC_TM_GPR31            64
+  PPC     KVM_REG_PPC_TM_VSR0             128
+  ...
+  PPC     KVM_REG_PPC_TM_VSR63            128
+  PPC     KVM_REG_PPC_TM_CR               64
+  PPC     KVM_REG_PPC_TM_LR               64
+  PPC     KVM_REG_PPC_TM_CTR              64
+  PPC     KVM_REG_PPC_TM_FPSCR            64
+  PPC     KVM_REG_PPC_TM_AMR              64
+  PPC     KVM_REG_PPC_TM_PPR              64
+  PPC     KVM_REG_PPC_TM_VRSAVE           64
+  PPC     KVM_REG_PPC_TM_VSCR             32
+  PPC     KVM_REG_PPC_TM_DSCR             64
+  PPC     KVM_REG_PPC_TM_TAR              64
+  PPC     KVM_REG_PPC_TM_XER              64
+
+  MIPS    KVM_REG_MIPS_R0                 64
+  ...
+  MIPS    KVM_REG_MIPS_R31                64
+  MIPS    KVM_REG_MIPS_HI                 64
+  MIPS    KVM_REG_MIPS_LO                 64
+  MIPS    KVM_REG_MIPS_PC                 64
+  MIPS    KVM_REG_MIPS_CP0_INDEX          32
+  MIPS    KVM_REG_MIPS_CP0_ENTRYLO0       64
+  MIPS    KVM_REG_MIPS_CP0_ENTRYLO1       64
+  MIPS    KVM_REG_MIPS_CP0_CONTEXT        64
+  MIPS    KVM_REG_MIPS_CP0_CONTEXTCONFIG  32
+  MIPS    KVM_REG_MIPS_CP0_USERLOCAL      64
+  MIPS    KVM_REG_MIPS_CP0_XCONTEXTCONFIG 64
+  MIPS    KVM_REG_MIPS_CP0_PAGEMASK       32
+  MIPS    KVM_REG_MIPS_CP0_PAGEGRAIN      32
+  MIPS    KVM_REG_MIPS_CP0_SEGCTL0        64
+  MIPS    KVM_REG_MIPS_CP0_SEGCTL1        64
+  MIPS    KVM_REG_MIPS_CP0_SEGCTL2        64
+  MIPS    KVM_REG_MIPS_CP0_PWBASE         64
+  MIPS    KVM_REG_MIPS_CP0_PWFIELD        64
+  MIPS    KVM_REG_MIPS_CP0_PWSIZE         64
+  MIPS    KVM_REG_MIPS_CP0_WIRED          32
+  MIPS    KVM_REG_MIPS_CP0_PWCTL          32
+  MIPS    KVM_REG_MIPS_CP0_HWRENA         32
+  MIPS    KVM_REG_MIPS_CP0_BADVADDR       64
+  MIPS    KVM_REG_MIPS_CP0_BADINSTR       32
+  MIPS    KVM_REG_MIPS_CP0_BADINSTRP      32
+  MIPS    KVM_REG_MIPS_CP0_COUNT          32
+  MIPS    KVM_REG_MIPS_CP0_ENTRYHI        64
+  MIPS    KVM_REG_MIPS_CP0_COMPARE        32
+  MIPS    KVM_REG_MIPS_CP0_STATUS         32
+  MIPS    KVM_REG_MIPS_CP0_INTCTL         32
+  MIPS    KVM_REG_MIPS_CP0_CAUSE          32
+  MIPS    KVM_REG_MIPS_CP0_EPC            64
+  MIPS    KVM_REG_MIPS_CP0_PRID           32
+  MIPS    KVM_REG_MIPS_CP0_EBASE          64
+  MIPS    KVM_REG_MIPS_CP0_CONFIG         32
+  MIPS    KVM_REG_MIPS_CP0_CONFIG1        32
+  MIPS    KVM_REG_MIPS_CP0_CONFIG2        32
+  MIPS    KVM_REG_MIPS_CP0_CONFIG3        32
+  MIPS    KVM_REG_MIPS_CP0_CONFIG4        32
+  MIPS    KVM_REG_MIPS_CP0_CONFIG5        32
+  MIPS    KVM_REG_MIPS_CP0_CONFIG7        32
+  MIPS    KVM_REG_MIPS_CP0_XCONTEXT       64
+  MIPS    KVM_REG_MIPS_CP0_ERROREPC       64
+  MIPS    KVM_REG_MIPS_CP0_KSCRATCH1      64
+  MIPS    KVM_REG_MIPS_CP0_KSCRATCH2      64
+  MIPS    KVM_REG_MIPS_CP0_KSCRATCH3      64
+  MIPS    KVM_REG_MIPS_CP0_KSCRATCH4      64
+  MIPS    KVM_REG_MIPS_CP0_KSCRATCH5      64
+  MIPS    KVM_REG_MIPS_CP0_KSCRATCH6      64
+  MIPS    KVM_REG_MIPS_CP0_MAAR(0..63)    64
+  MIPS    KVM_REG_MIPS_COUNT_CTL          64
+  MIPS    KVM_REG_MIPS_COUNT_RESUME       64
+  MIPS    KVM_REG_MIPS_COUNT_HZ           64
+  MIPS    KVM_REG_MIPS_FPR_32(0..31)      32
+  MIPS    KVM_REG_MIPS_FPR_64(0..31)      64
+  MIPS    KVM_REG_MIPS_VEC_128(0..31)     128
+  MIPS    KVM_REG_MIPS_FCR_IR             32
+  MIPS    KVM_REG_MIPS_FCR_CSR            32
+  MIPS    KVM_REG_MIPS_MSA_IR             32
+  MIPS    KVM_REG_MIPS_MSA_CSR            32
+  ======= =============================== ============
+
+ARM registers are mapped using the lower 32 bits.  The upper 16 of that
+is the register group type, or coprocessor number:
+
+ARM core registers have the following id bit patterns::
+
+  0x4020 0000 0010 <index into the kvm_regs struct:16>
+
+ARM 32-bit CP15 registers have the following id bit patterns::
+
+  0x4020 0000 000F <zero:1> <crn:4> <crm:4> <opc1:4> <opc2:3>
+
+ARM 64-bit CP15 registers have the following id bit patterns::
+
+  0x4030 0000 000F <zero:1> <zero:4> <crm:4> <opc1:4> <zero:3>
+
+ARM CCSIDR registers are demultiplexed by CSSELR value::
+
+  0x4020 0000 0011 00 <csselr:8>
+
+ARM 32-bit VFP control registers have the following id bit patterns::
+
+  0x4020 0000 0012 1 <regno:12>
+
+ARM 64-bit FP registers have the following id bit patterns::
+
+  0x4030 0000 0012 0 <regno:12>
+
+ARM firmware pseudo-registers have the following bit pattern::
+
+  0x4030 0000 0014 <regno:16>
+
+
+arm64 registers are mapped using the lower 32 bits. The upper 16 of
+that is the register group type, or coprocessor number:
+
+arm64 core/FP-SIMD registers have the following id bit patterns. Note
+that the size of the access is variable, as the kvm_regs structure
+contains elements ranging from 32 to 128 bits. The index is a 32bit
+value in the kvm_regs structure seen as a 32bit array::
+
+  0x60x0 0000 0010 <index into the kvm_regs struct:16>
+
+Specifically:
+
+======================= ========= ===== =======================================
+    Encoding            Register  Bits  kvm_regs member
+======================= ========= ===== =======================================
+  0x6030 0000 0010 0000 X0          64  regs.regs[0]
+  0x6030 0000 0010 0002 X1          64  regs.regs[1]
+  ...
+  0x6030 0000 0010 003c X30         64  regs.regs[30]
+  0x6030 0000 0010 003e SP          64  regs.sp
+  0x6030 0000 0010 0040 PC          64  regs.pc
+  0x6030 0000 0010 0042 PSTATE      64  regs.pstate
+  0x6030 0000 0010 0044 SP_EL1      64  sp_el1
+  0x6030 0000 0010 0046 ELR_EL1     64  elr_el1
+  0x6030 0000 0010 0048 SPSR_EL1    64  spsr[KVM_SPSR_EL1] (alias SPSR_SVC)
+  0x6030 0000 0010 004a SPSR_ABT    64  spsr[KVM_SPSR_ABT]
+  0x6030 0000 0010 004c SPSR_UND    64  spsr[KVM_SPSR_UND]
+  0x6030 0000 0010 004e SPSR_IRQ    64  spsr[KVM_SPSR_IRQ]
+  0x6060 0000 0010 0050 SPSR_FIQ    64  spsr[KVM_SPSR_FIQ]
+  0x6040 0000 0010 0054 V0         128  fp_regs.vregs[0]    [1]_
+  0x6040 0000 0010 0058 V1         128  fp_regs.vregs[1]    [1]_
+  ...
+  0x6040 0000 0010 00d0 V31        128  fp_regs.vregs[31]   [1]_
+  0x6020 0000 0010 00d4 FPSR        32  fp_regs.fpsr
+  0x6020 0000 0010 00d5 FPCR        32  fp_regs.fpcr
+======================= ========= ===== =======================================
+
+.. [1] These encodings are not accepted for SVE-enabled vcpus.  See
+       KVM_ARM_VCPU_INIT.
+
+       The equivalent register content can be accessed via bits [127:0] of
+       the corresponding SVE Zn registers instead for vcpus that have SVE
+       enabled (see below).
+
+arm64 CCSIDR registers are demultiplexed by CSSELR value::
+
+  0x6020 0000 0011 00 <csselr:8>
+
+arm64 system registers have the following id bit patterns::
+
+  0x6030 0000 0013 <op0:2> <op1:3> <crn:4> <crm:4> <op2:3>
+
+.. warning::
+
+     Two system register IDs do not follow the specified pattern.  These
+     are KVM_REG_ARM_TIMER_CVAL and KVM_REG_ARM_TIMER_CNT, which map to
+     system registers CNTV_CVAL_EL0 and CNTVCT_EL0 respectively.  These
+     two had their values accidentally swapped, which means TIMER_CVAL is
+     derived from the register encoding for CNTVCT_EL0 and TIMER_CNT is
+     derived from the register encoding for CNTV_CVAL_EL0.  As this is
+     API, it must remain this way.
+
+arm64 firmware pseudo-registers have the following bit pattern::
+
+  0x6030 0000 0014 <regno:16>
+
+arm64 SVE registers have the following bit patterns::
+
+  0x6080 0000 0015 00 <n:5> <slice:5>   Zn bits[2048*slice + 2047 : 2048*slice]
+  0x6050 0000 0015 04 <n:4> <slice:5>   Pn bits[256*slice + 255 : 256*slice]
+  0x6050 0000 0015 060 <slice:5>        FFR bits[256*slice + 255 : 256*slice]
+  0x6060 0000 0015 ffff                 KVM_REG_ARM64_SVE_VLS pseudo-register
+
+Access to register IDs where 2048 * slice >= 128 * max_vq will fail with
+ENOENT.  max_vq is the vcpu's maximum supported vector length in 128-bit
+quadwords: see [2]_ below.
+
+These registers are only accessible on vcpus for which SVE is enabled.
+See KVM_ARM_VCPU_INIT for details.
+
+In addition, except for KVM_REG_ARM64_SVE_VLS, these registers are not
+accessible until the vcpu's SVE configuration has been finalized
+using KVM_ARM_VCPU_FINALIZE(KVM_ARM_VCPU_SVE).  See KVM_ARM_VCPU_INIT
+and KVM_ARM_VCPU_FINALIZE for more information about this procedure.
+
+KVM_REG_ARM64_SVE_VLS is a pseudo-register that allows the set of vector
+lengths supported by the vcpu to be discovered and configured by
+userspace.  When transferred to or from user memory via KVM_GET_ONE_REG
+or KVM_SET_ONE_REG, the value of this register is of type
+__u64[KVM_ARM64_SVE_VLS_WORDS], and encodes the set of vector lengths as
+follows::
+
+  __u64 vector_lengths[KVM_ARM64_SVE_VLS_WORDS];
+
+  if (vq >= SVE_VQ_MIN && vq <= SVE_VQ_MAX &&
+      ((vector_lengths[(vq - KVM_ARM64_SVE_VQ_MIN) / 64] >>
+               ((vq - KVM_ARM64_SVE_VQ_MIN) % 64)) & 1))
+       /* Vector length vq * 16 bytes supported */
+  else
+       /* Vector length vq * 16 bytes not supported */
+
+.. [2] The maximum value vq for which the above condition is true is
+       max_vq.  This is the maximum vector length available to the guest on
+       this vcpu, and determines which register slices are visible through
+       this ioctl interface.
+
+(See Documentation/arm64/sve.rst for an explanation of the "vq"
+nomenclature.)
+
+KVM_REG_ARM64_SVE_VLS is only accessible after KVM_ARM_VCPU_INIT.
+KVM_ARM_VCPU_INIT initialises it to the best set of vector lengths that
+the host supports.
+
+Userspace may subsequently modify it if desired until the vcpu's SVE
+configuration is finalized using KVM_ARM_VCPU_FINALIZE(KVM_ARM_VCPU_SVE).
+
+Apart from simply removing all vector lengths from the host set that
+exceed some value, support for arbitrarily chosen sets of vector lengths
+is hardware-dependent and may not be available.  Attempting to configure
+an invalid set of vector lengths via KVM_SET_ONE_REG will fail with
+EINVAL.
+
+After the vcpu's SVE configuration is finalized, further attempts to
+write this register will fail with EPERM.
+
+
+MIPS registers are mapped using the lower 32 bits.  The upper 16 of that is
+the register group type:
+
+MIPS core registers (see above) have the following id bit patterns::
+
+  0x7030 0000 0000 <reg:16>
+
+MIPS CP0 registers (see KVM_REG_MIPS_CP0_* above) have the following id bit
+patterns depending on whether they're 32-bit or 64-bit registers::
+
+  0x7020 0000 0001 00 <reg:5> <sel:3>   (32-bit)
+  0x7030 0000 0001 00 <reg:5> <sel:3>   (64-bit)
+
+Note: KVM_REG_MIPS_CP0_ENTRYLO0 and KVM_REG_MIPS_CP0_ENTRYLO1 are the MIPS64
+versions of the EntryLo registers regardless of the word size of the host
+hardware, host kernel, guest, and whether XPA is present in the guest, i.e.
+with the RI and XI bits (if they exist) in bits 63 and 62 respectively, and
+the PFNX field starting at bit 30.
+
+MIPS MAARs (see KVM_REG_MIPS_CP0_MAAR(*) above) have the following id bit
+patterns::
+
+  0x7030 0000 0001 01 <reg:8>
+
+MIPS KVM control registers (see above) have the following id bit patterns::
+
+  0x7030 0000 0002 <reg:16>
+
+MIPS FPU registers (see KVM_REG_MIPS_FPR_{32,64}() above) have the following
+id bit patterns depending on the size of the register being accessed. They are
+always accessed according to the current guest FPU mode (Status.FR and
+Config5.FRE), i.e. as the guest would see them, and they become unpredictable
+if the guest FPU mode is changed. MIPS SIMD Architecture (MSA) vector
+registers (see KVM_REG_MIPS_VEC_128() above) have similar patterns as they
+overlap the FPU registers::
+
+  0x7020 0000 0003 00 <0:3> <reg:5> (32-bit FPU registers)
+  0x7030 0000 0003 00 <0:3> <reg:5> (64-bit FPU registers)
+  0x7040 0000 0003 00 <0:3> <reg:5> (128-bit MSA vector registers)
+
+MIPS FPU control registers (see KVM_REG_MIPS_FCR_{IR,CSR} above) have the
+following id bit patterns::
+
+  0x7020 0000 0003 01 <0:3> <reg:5>
+
+MIPS MSA control registers (see KVM_REG_MIPS_MSA_{IR,CSR} above) have the
+following id bit patterns::
+
+  0x7020 0000 0003 02 <0:3> <reg:5>
+
+
+4.69 KVM_GET_ONE_REG
+--------------------
+
+:Capability: KVM_CAP_ONE_REG
+:Architectures: all
+:Type: vcpu ioctl
+:Parameters: struct kvm_one_reg (in and out)
+:Returns: 0 on success, negative value on failure
+
+Errors include:
+
+  ======== ============================================================
+  ENOENT   no such register
+  EINVAL   invalid register ID, or no such register
+  EPERM    (arm64) register access not allowed before vcpu finalization
+  ======== ============================================================
+
+(These error codes are indicative only: do not rely on a specific error
+code being returned in a specific situation.)
+
+This ioctl allows to receive the value of a single register implemented
+in a vcpu. The register to read is indicated by the "id" field of the
+kvm_one_reg struct passed in. On success, the register value can be found
+at the memory location pointed to by "addr".
+
+The list of registers accessible using this interface is identical to the
+list in 4.68.
+
+
+4.70 KVM_KVMCLOCK_CTRL
+----------------------
+
+:Capability: KVM_CAP_KVMCLOCK_CTRL
+:Architectures: Any that implement pvclocks (currently x86 only)
+:Type: vcpu ioctl
+:Parameters: None
+:Returns: 0 on success, -1 on error
+
+This signals to the host kernel that the specified guest is being paused by
+userspace.  The host will set a flag in the pvclock structure that is checked
+from the soft lockup watchdog.  The flag is part of the pvclock structure that
+is shared between guest and host, specifically the second bit of the flags
+field of the pvclock_vcpu_time_info structure.  It will be set exclusively by
+the host and read/cleared exclusively by the guest.  The guest operation of
+checking and clearing the flag must an atomic operation so
+load-link/store-conditional, or equivalent must be used.  There are two cases
+where the guest will clear the flag: when the soft lockup watchdog timer resets
+itself or when a soft lockup is detected.  This ioctl can be called any time
+after pausing the vcpu, but before it is resumed.
+
+
+4.71 KVM_SIGNAL_MSI
+-------------------
+
+:Capability: KVM_CAP_SIGNAL_MSI
+:Architectures: x86 arm arm64
+:Type: vm ioctl
+:Parameters: struct kvm_msi (in)
+:Returns: >0 on delivery, 0 if guest blocked the MSI, and -1 on error
+
+Directly inject a MSI message. Only valid with in-kernel irqchip that handles
+MSI messages.
+
+::
+
+  struct kvm_msi {
+       __u32 address_lo;
+       __u32 address_hi;
+       __u32 data;
+       __u32 flags;
+       __u32 devid;
+       __u8  pad[12];
+  };
+
+flags:
+  KVM_MSI_VALID_DEVID: devid contains a valid value.  The per-VM
+  KVM_CAP_MSI_DEVID capability advertises the requirement to provide
+  the device ID.  If this capability is not available, userspace
+  should never set the KVM_MSI_VALID_DEVID flag as the ioctl might fail.
+
+If KVM_MSI_VALID_DEVID is set, devid contains a unique device identifier
+for the device that wrote the MSI message.  For PCI, this is usually a
+BFD identifier in the lower 16 bits.
+
+On x86, address_hi is ignored unless the KVM_X2APIC_API_USE_32BIT_IDS
+feature of KVM_CAP_X2APIC_API capability is enabled.  If it is enabled,
+address_hi bits 31-8 provide bits 31-8 of the destination id.  Bits 7-0 of
+address_hi must be zero.
+
+
+4.71 KVM_CREATE_PIT2
+--------------------
+
+:Capability: KVM_CAP_PIT2
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_pit_config (in)
+:Returns: 0 on success, -1 on error
+
+Creates an in-kernel device model for the i8254 PIT. This call is only valid
+after enabling in-kernel irqchip support via KVM_CREATE_IRQCHIP. The following
+parameters have to be passed::
+
+  struct kvm_pit_config {
+       __u32 flags;
+       __u32 pad[15];
+  };
+
+Valid flags are::
+
+  #define KVM_PIT_SPEAKER_DUMMY     1 /* emulate speaker port stub */
+
+PIT timer interrupts may use a per-VM kernel thread for injection. If it
+exists, this thread will have a name of the following pattern::
+
+  kvm-pit/<owner-process-pid>
+
+When running a guest with elevated priorities, the scheduling parameters of
+this thread may have to be adjusted accordingly.
+
+This IOCTL replaces the obsolete KVM_CREATE_PIT.
+
+
+4.72 KVM_GET_PIT2
+-----------------
+
+:Capability: KVM_CAP_PIT_STATE2
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_pit_state2 (out)
+:Returns: 0 on success, -1 on error
+
+Retrieves the state of the in-kernel PIT model. Only valid after
+KVM_CREATE_PIT2. The state is returned in the following structure::
+
+  struct kvm_pit_state2 {
+       struct kvm_pit_channel_state channels[3];
+       __u32 flags;
+       __u32 reserved[9];
+  };
+
+Valid flags are::
+
+  /* disable PIT in HPET legacy mode */
+  #define KVM_PIT_FLAGS_HPET_LEGACY  0x00000001
+
+This IOCTL replaces the obsolete KVM_GET_PIT.
+
+
+4.73 KVM_SET_PIT2
+-----------------
+
+:Capability: KVM_CAP_PIT_STATE2
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_pit_state2 (in)
+:Returns: 0 on success, -1 on error
+
+Sets the state of the in-kernel PIT model. Only valid after KVM_CREATE_PIT2.
+See KVM_GET_PIT2 for details on struct kvm_pit_state2.
+
+This IOCTL replaces the obsolete KVM_SET_PIT.
+
+
+4.74 KVM_PPC_GET_SMMU_INFO
+--------------------------
+
+:Capability: KVM_CAP_PPC_GET_SMMU_INFO
+:Architectures: powerpc
+:Type: vm ioctl
+:Parameters: None
+:Returns: 0 on success, -1 on error
+
+This populates and returns a structure describing the features of
+the "Server" class MMU emulation supported by KVM.
+This can in turn be used by userspace to generate the appropriate
+device-tree properties for the guest operating system.
+
+The structure contains some global information, followed by an
+array of supported segment page sizes::
+
+      struct kvm_ppc_smmu_info {
+            __u64 flags;
+            __u32 slb_size;
+            __u32 pad;
+            struct kvm_ppc_one_seg_page_size sps[KVM_PPC_PAGE_SIZES_MAX_SZ];
+      };
+
+The supported flags are:
+
+    - KVM_PPC_PAGE_SIZES_REAL:
+        When that flag is set, guest page sizes must "fit" the backing
+        store page sizes. When not set, any page size in the list can
+        be used regardless of how they are backed by userspace.
+
+    - KVM_PPC_1T_SEGMENTS
+        The emulated MMU supports 1T segments in addition to the
+        standard 256M ones.
+
+    - KVM_PPC_NO_HASH
+       This flag indicates that HPT guests are not supported by KVM,
+       thus all guests must use radix MMU mode.
+
+The "slb_size" field indicates how many SLB entries are supported
+
+The "sps" array contains 8 entries indicating the supported base
+page sizes for a segment in increasing order. Each entry is defined
+as follow::
+
+   struct kvm_ppc_one_seg_page_size {
+       __u32 page_shift;       /* Base page shift of segment (or 0) */
+       __u32 slb_enc;          /* SLB encoding for BookS */
+       struct kvm_ppc_one_page_size enc[KVM_PPC_PAGE_SIZES_MAX_SZ];
+   };
+
+An entry with a "page_shift" of 0 is unused. Because the array is
+organized in increasing order, a lookup can stop when encoutering
+such an entry.
+
+The "slb_enc" field provides the encoding to use in the SLB for the
+page size. The bits are in positions such as the value can directly
+be OR'ed into the "vsid" argument of the slbmte instruction.
+
+The "enc" array is a list which for each of those segment base page
+size provides the list of supported actual page sizes (which can be
+only larger or equal to the base page size), along with the
+corresponding encoding in the hash PTE. Similarly, the array is
+8 entries sorted by increasing sizes and an entry with a "0" shift
+is an empty entry and a terminator::
+
+   struct kvm_ppc_one_page_size {
+       __u32 page_shift;       /* Page shift (or 0) */
+       __u32 pte_enc;          /* Encoding in the HPTE (>>12) */
+   };
+
+The "pte_enc" field provides a value that can OR'ed into the hash
+PTE's RPN field (ie, it needs to be shifted left by 12 to OR it
+into the hash PTE second double word).
+
+4.75 KVM_IRQFD
+--------------
+
+:Capability: KVM_CAP_IRQFD
+:Architectures: x86 s390 arm arm64
+:Type: vm ioctl
+:Parameters: struct kvm_irqfd (in)
+:Returns: 0 on success, -1 on error
+
+Allows setting an eventfd to directly trigger a guest interrupt.
+kvm_irqfd.fd specifies the file descriptor to use as the eventfd and
+kvm_irqfd.gsi specifies the irqchip pin toggled by this event.  When
+an event is triggered on the eventfd, an interrupt is injected into
+the guest using the specified gsi pin.  The irqfd is removed using
+the KVM_IRQFD_FLAG_DEASSIGN flag, specifying both kvm_irqfd.fd
+and kvm_irqfd.gsi.
+
+With KVM_CAP_IRQFD_RESAMPLE, KVM_IRQFD supports a de-assert and notify
+mechanism allowing emulation of level-triggered, irqfd-based
+interrupts.  When KVM_IRQFD_FLAG_RESAMPLE is set the user must pass an
+additional eventfd in the kvm_irqfd.resamplefd field.  When operating
+in resample mode, posting of an interrupt through kvm_irq.fd asserts
+the specified gsi in the irqchip.  When the irqchip is resampled, such
+as from an EOI, the gsi is de-asserted and the user is notified via
+kvm_irqfd.resamplefd.  It is the user's responsibility to re-queue
+the interrupt if the device making use of it still requires service.
+Note that closing the resamplefd is not sufficient to disable the
+irqfd.  The KVM_IRQFD_FLAG_RESAMPLE is only necessary on assignment
+and need not be specified with KVM_IRQFD_FLAG_DEASSIGN.
+
+On arm/arm64, gsi routing being supported, the following can happen:
+
+- in case no routing entry is associated to this gsi, injection fails
+- in case the gsi is associated to an irqchip routing entry,
+  irqchip.pin + 32 corresponds to the injected SPI ID.
+- in case the gsi is associated to an MSI routing entry, the MSI
+  message and device ID are translated into an LPI (support restricted
+  to GICv3 ITS in-kernel emulation).
+
+4.76 KVM_PPC_ALLOCATE_HTAB
+--------------------------
+
+:Capability: KVM_CAP_PPC_ALLOC_HTAB
+:Architectures: powerpc
+:Type: vm ioctl
+:Parameters: Pointer to u32 containing hash table order (in/out)
+:Returns: 0 on success, -1 on error
+
+This requests the host kernel to allocate an MMU hash table for a
+guest using the PAPR paravirtualization interface.  This only does
+anything if the kernel is configured to use the Book 3S HV style of
+virtualization.  Otherwise the capability doesn't exist and the ioctl
+returns an ENOTTY error.  The rest of this description assumes Book 3S
+HV.
+
+There must be no vcpus running when this ioctl is called; if there
+are, it will do nothing and return an EBUSY error.
+
+The parameter is a pointer to a 32-bit unsigned integer variable
+containing the order (log base 2) of the desired size of the hash
+table, which must be between 18 and 46.  On successful return from the
+ioctl, the value will not be changed by the kernel.
+
+If no hash table has been allocated when any vcpu is asked to run
+(with the KVM_RUN ioctl), the host kernel will allocate a
+default-sized hash table (16 MB).
+
+If this ioctl is called when a hash table has already been allocated,
+with a different order from the existing hash table, the existing hash
+table will be freed and a new one allocated.  If this is ioctl is
+called when a hash table has already been allocated of the same order
+as specified, the kernel will clear out the existing hash table (zero
+all HPTEs).  In either case, if the guest is using the virtualized
+real-mode area (VRMA) facility, the kernel will re-create the VMRA
+HPTEs on the next KVM_RUN of any vcpu.
+
+4.77 KVM_S390_INTERRUPT
+-----------------------
+
+:Capability: basic
+:Architectures: s390
+:Type: vm ioctl, vcpu ioctl
+:Parameters: struct kvm_s390_interrupt (in)
+:Returns: 0 on success, -1 on error
+
+Allows to inject an interrupt to the guest. Interrupts can be floating
+(vm ioctl) or per cpu (vcpu ioctl), depending on the interrupt type.
+
+Interrupt parameters are passed via kvm_s390_interrupt::
+
+  struct kvm_s390_interrupt {
+       __u32 type;
+       __u32 parm;
+       __u64 parm64;
+  };
+
+type can be one of the following:
+
+KVM_S390_SIGP_STOP (vcpu)
+    - sigp stop; optional flags in parm
+KVM_S390_PROGRAM_INT (vcpu)
+    - program check; code in parm
+KVM_S390_SIGP_SET_PREFIX (vcpu)
+    - sigp set prefix; prefix address in parm
+KVM_S390_RESTART (vcpu)
+    - restart
+KVM_S390_INT_CLOCK_COMP (vcpu)
+    - clock comparator interrupt
+KVM_S390_INT_CPU_TIMER (vcpu)
+    - CPU timer interrupt
+KVM_S390_INT_VIRTIO (vm)
+    - virtio external interrupt; external interrupt
+      parameters in parm and parm64
+KVM_S390_INT_SERVICE (vm)
+    - sclp external interrupt; sclp parameter in parm
+KVM_S390_INT_EMERGENCY (vcpu)
+    - sigp emergency; source cpu in parm
+KVM_S390_INT_EXTERNAL_CALL (vcpu)
+    - sigp external call; source cpu in parm
+KVM_S390_INT_IO(ai,cssid,ssid,schid) (vm)
+    - compound value to indicate an
+      I/O interrupt (ai - adapter interrupt; cssid,ssid,schid - subchannel);
+      I/O interruption parameters in parm (subchannel) and parm64 (intparm,
+      interruption subclass)
+KVM_S390_MCHK (vm, vcpu)
+    - machine check interrupt; cr 14 bits in parm, machine check interrupt
+      code in parm64 (note that machine checks needing further payload are not
+      supported by this ioctl)
+
+This is an asynchronous vcpu ioctl and can be invoked from any thread.
+
+4.78 KVM_PPC_GET_HTAB_FD
+------------------------
+
+:Capability: KVM_CAP_PPC_HTAB_FD
+:Architectures: powerpc
+:Type: vm ioctl
+:Parameters: Pointer to struct kvm_get_htab_fd (in)
+:Returns: file descriptor number (>= 0) on success, -1 on error
+
+This returns a file descriptor that can be used either to read out the
+entries in the guest's hashed page table (HPT), or to write entries to
+initialize the HPT.  The returned fd can only be written to if the
+KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
+can only be read if that bit is clear.  The argument struct looks like
+this::
+
+  /* For KVM_PPC_GET_HTAB_FD */
+  struct kvm_get_htab_fd {
+       __u64   flags;
+       __u64   start_index;
+       __u64   reserved[2];
+  };
+
+  /* Values for kvm_get_htab_fd.flags */
+  #define KVM_GET_HTAB_BOLTED_ONLY     ((__u64)0x1)
+  #define KVM_GET_HTAB_WRITE           ((__u64)0x2)
+
+The 'start_index' field gives the index in the HPT of the entry at
+which to start reading.  It is ignored when writing.
+
+Reads on the fd will initially supply information about all
+"interesting" HPT entries.  Interesting entries are those with the
+bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
+all entries.  When the end of the HPT is reached, the read() will
+return.  If read() is called again on the fd, it will start again from
+the beginning of the HPT, but will only return HPT entries that have
+changed since they were last read.
+
+Data read or written is structured as a header (8 bytes) followed by a
+series of valid HPT entries (16 bytes) each.  The header indicates how
+many valid HPT entries there are and how many invalid entries follow
+the valid entries.  The invalid entries are not represented explicitly
+in the stream.  The header format is::
+
+  struct kvm_get_htab_header {
+       __u32   index;
+       __u16   n_valid;
+       __u16   n_invalid;
+  };
+
+Writes to the fd create HPT entries starting at the index given in the
+header; first 'n_valid' valid entries with contents from the data
+written, then 'n_invalid' invalid entries, invalidating any previously
+valid entries found.
+
+4.79 KVM_CREATE_DEVICE
+----------------------
+
+:Capability: KVM_CAP_DEVICE_CTRL
+:Type: vm ioctl
+:Parameters: struct kvm_create_device (in/out)
+:Returns: 0 on success, -1 on error
+
+Errors:
+
+  ======  =======================================================
+  ENODEV  The device type is unknown or unsupported
+  EEXIST  Device already created, and this type of device may not
+          be instantiated multiple times
+  ======  =======================================================
+
+  Other error conditions may be defined by individual device types or
+  have their standard meanings.
+
+Creates an emulated device in the kernel.  The file descriptor returned
+in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR.
+
+If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
+device type is supported (not necessarily whether it can be created
+in the current vm).
+
+Individual devices should not define flags.  Attributes should be used
+for specifying any behavior that is not implied by the device type
+number.
+
+::
+
+  struct kvm_create_device {
+       __u32   type;   /* in: KVM_DEV_TYPE_xxx */
+       __u32   fd;     /* out: device handle */
+       __u32   flags;  /* in: KVM_CREATE_DEVICE_xxx */
+  };
+
+4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
+--------------------------------------------
+
+:Capability: KVM_CAP_DEVICE_CTRL, KVM_CAP_VM_ATTRIBUTES for vm device,
+             KVM_CAP_VCPU_ATTRIBUTES for vcpu device
+:Type: device ioctl, vm ioctl, vcpu ioctl
+:Parameters: struct kvm_device_attr
+:Returns: 0 on success, -1 on error
+
+Errors:
+
+  =====   =============================================================
+  ENXIO   The group or attribute is unknown/unsupported for this device
+          or hardware support is missing.
+  EPERM   The attribute cannot (currently) be accessed this way
+          (e.g. read-only attribute, or attribute that only makes
+          sense when the device is in a different state)
+  =====   =============================================================
+
+  Other error conditions may be defined by individual device types.
+
+Gets/sets a specified piece of device configuration and/or state.  The
+semantics are device-specific.  See individual device documentation in
+the "devices" directory.  As with ONE_REG, the size of the data
+transferred is defined by the particular attribute.
+
+::
+
+  struct kvm_device_attr {
+       __u32   flags;          /* no flags currently defined */
+       __u32   group;          /* device-defined */
+       __u64   attr;           /* group-defined */
+       __u64   addr;           /* userspace address of attr data */
+  };
+
+4.81 KVM_HAS_DEVICE_ATTR
+------------------------
+
+:Capability: KVM_CAP_DEVICE_CTRL, KVM_CAP_VM_ATTRIBUTES for vm device,
+            KVM_CAP_VCPU_ATTRIBUTES for vcpu device
+:Type: device ioctl, vm ioctl, vcpu ioctl
+:Parameters: struct kvm_device_attr
+:Returns: 0 on success, -1 on error
+
+Errors:
+
+  =====   =============================================================
+  ENXIO   The group or attribute is unknown/unsupported for this device
+          or hardware support is missing.
+  =====   =============================================================
+
+Tests whether a device supports a particular attribute.  A successful
+return indicates the attribute is implemented.  It does not necessarily
+indicate that the attribute can be read or written in the device's
+current state.  "addr" is ignored.
+
+4.82 KVM_ARM_VCPU_INIT
+----------------------
+
+:Capability: basic
+:Architectures: arm, arm64
+:Type: vcpu ioctl
+:Parameters: struct kvm_vcpu_init (in)
+:Returns: 0 on success; -1 on error
+
+Errors:
+
+  ======     =================================================================
+  EINVAL     the target is unknown, or the combination of features is invalid.
+  ENOENT     a features bit specified is unknown.
+  ======     =================================================================
+
+This tells KVM what type of CPU to present to the guest, and what
+optional features it should have.  This will cause a reset of the cpu
+registers to their initial values.  If this is not called, KVM_RUN will
+return ENOEXEC for that vcpu.
+
+Note that because some registers reflect machine topology, all vcpus
+should be created before this ioctl is invoked.
+
+Userspace can call this function multiple times for a given vcpu, including
+after the vcpu has been run. This will reset the vcpu to its initial
+state. All calls to this function after the initial call must use the same
+target and same set of feature flags, otherwise EINVAL will be returned.
+
+Possible features:
+
+       - KVM_ARM_VCPU_POWER_OFF: Starts the CPU in a power-off state.
+         Depends on KVM_CAP_ARM_PSCI.  If not set, the CPU will be powered on
+         and execute guest code when KVM_RUN is called.
+       - KVM_ARM_VCPU_EL1_32BIT: Starts the CPU in a 32bit mode.
+         Depends on KVM_CAP_ARM_EL1_32BIT (arm64 only).
+       - KVM_ARM_VCPU_PSCI_0_2: Emulate PSCI v0.2 (or a future revision
+          backward compatible with v0.2) for the CPU.
+         Depends on KVM_CAP_ARM_PSCI_0_2.
+       - KVM_ARM_VCPU_PMU_V3: Emulate PMUv3 for the CPU.
+         Depends on KVM_CAP_ARM_PMU_V3.
+
+       - KVM_ARM_VCPU_PTRAUTH_ADDRESS: Enables Address Pointer authentication
+         for arm64 only.
+         Depends on KVM_CAP_ARM_PTRAUTH_ADDRESS.
+         If KVM_CAP_ARM_PTRAUTH_ADDRESS and KVM_CAP_ARM_PTRAUTH_GENERIC are
+         both present, then both KVM_ARM_VCPU_PTRAUTH_ADDRESS and
+         KVM_ARM_VCPU_PTRAUTH_GENERIC must be requested or neither must be
+         requested.
+
+       - KVM_ARM_VCPU_PTRAUTH_GENERIC: Enables Generic Pointer authentication
+         for arm64 only.
+         Depends on KVM_CAP_ARM_PTRAUTH_GENERIC.
+         If KVM_CAP_ARM_PTRAUTH_ADDRESS and KVM_CAP_ARM_PTRAUTH_GENERIC are
+         both present, then both KVM_ARM_VCPU_PTRAUTH_ADDRESS and
+         KVM_ARM_VCPU_PTRAUTH_GENERIC must be requested or neither must be
+         requested.
+
+       - KVM_ARM_VCPU_SVE: Enables SVE for the CPU (arm64 only).
+         Depends on KVM_CAP_ARM_SVE.
+         Requires KVM_ARM_VCPU_FINALIZE(KVM_ARM_VCPU_SVE):
+
+          * After KVM_ARM_VCPU_INIT:
+
+             - KVM_REG_ARM64_SVE_VLS may be read using KVM_GET_ONE_REG: the
+               initial value of this pseudo-register indicates the best set of
+               vector lengths possible for a vcpu on this host.
+
+          * Before KVM_ARM_VCPU_FINALIZE(KVM_ARM_VCPU_SVE):
+
+             - KVM_RUN and KVM_GET_REG_LIST are not available;
+
+             - KVM_GET_ONE_REG and KVM_SET_ONE_REG cannot be used to access
+               the scalable archietctural SVE registers
+               KVM_REG_ARM64_SVE_ZREG(), KVM_REG_ARM64_SVE_PREG() or
+               KVM_REG_ARM64_SVE_FFR;
+
+             - KVM_REG_ARM64_SVE_VLS may optionally be written using
+               KVM_SET_ONE_REG, to modify the set of vector lengths available
+               for the vcpu.
+
+          * After KVM_ARM_VCPU_FINALIZE(KVM_ARM_VCPU_SVE):
+
+             - the KVM_REG_ARM64_SVE_VLS pseudo-register is immutable, and can
+               no longer be written using KVM_SET_ONE_REG.
+
+4.83 KVM_ARM_PREFERRED_TARGET
+-----------------------------
+
+:Capability: basic
+:Architectures: arm, arm64
+:Type: vm ioctl
+:Parameters: struct struct kvm_vcpu_init (out)
+:Returns: 0 on success; -1 on error
+
+Errors:
+
+  ======     ==========================================
+  ENODEV     no preferred target available for the host
+  ======     ==========================================
+
+This queries KVM for preferred CPU target type which can be emulated
+by KVM on underlying host.
+
+The ioctl returns struct kvm_vcpu_init instance containing information
+about preferred CPU target type and recommended features for it.  The
+kvm_vcpu_init->features bitmap returned will have feature bits set if
+the preferred target recommends setting these features, but this is
+not mandatory.
+
+The information returned by this ioctl can be used to prepare an instance
+of struct kvm_vcpu_init for KVM_ARM_VCPU_INIT ioctl which will result in
+in VCPU matching underlying host.
+
+
+4.84 KVM_GET_REG_LIST
+---------------------
+
+:Capability: basic
+:Architectures: arm, arm64, mips
+:Type: vcpu ioctl
+:Parameters: struct kvm_reg_list (in/out)
+:Returns: 0 on success; -1 on error
+
+Errors:
+
+  =====      ==============================================================
+  E2BIG      the reg index list is too big to fit in the array specified by
+             the user (the number required will be written into n).
+  =====      ==============================================================
+
+::
+
+  struct kvm_reg_list {
+       __u64 n; /* number of registers in reg[] */
+       __u64 reg[0];
+  };
+
+This ioctl returns the guest registers that are supported for the
+KVM_GET_ONE_REG/KVM_SET_ONE_REG calls.
+
+
+4.85 KVM_ARM_SET_DEVICE_ADDR (deprecated)
+-----------------------------------------
+
+:Capability: KVM_CAP_ARM_SET_DEVICE_ADDR
+:Architectures: arm, arm64
+:Type: vm ioctl
+:Parameters: struct kvm_arm_device_address (in)
+:Returns: 0 on success, -1 on error
+
+Errors:
+
+  ======  ============================================
+  ENODEV  The device id is unknown
+  ENXIO   Device not supported on current system
+  EEXIST  Address already set
+  E2BIG   Address outside guest physical address space
+  EBUSY   Address overlaps with other device range
+  ======  ============================================
+
+::
+
+  struct kvm_arm_device_addr {
+       __u64 id;
+       __u64 addr;
+  };
+
+Specify a device address in the guest's physical address space where guests
+can access emulated or directly exposed devices, which the host kernel needs
+to know about. The id field is an architecture specific identifier for a
+specific device.
+
+ARM/arm64 divides the id field into two parts, a device id and an
+address type id specific to the individual device::
+
+  bits:  | 63        ...       32 | 31    ...    16 | 15    ...    0 |
+  field: |        0x00000000      |     device id   |  addr type id  |
+
+ARM/arm64 currently only require this when using the in-kernel GIC
+support for the hardware VGIC features, using KVM_ARM_DEVICE_VGIC_V2
+as the device id.  When setting the base address for the guest's
+mapping of the VGIC virtual CPU and distributor interface, the ioctl
+must be called after calling KVM_CREATE_IRQCHIP, but before calling
+KVM_RUN on any of the VCPUs.  Calling this ioctl twice for any of the
+base addresses will return -EEXIST.
+
+Note, this IOCTL is deprecated and the more flexible SET/GET_DEVICE_ATTR API
+should be used instead.
+
+
+4.86 KVM_PPC_RTAS_DEFINE_TOKEN
+------------------------------
+
+:Capability: KVM_CAP_PPC_RTAS
+:Architectures: ppc
+:Type: vm ioctl
+:Parameters: struct kvm_rtas_token_args
+:Returns: 0 on success, -1 on error
+
+Defines a token value for a RTAS (Run Time Abstraction Services)
+service in order to allow it to be handled in the kernel.  The
+argument struct gives the name of the service, which must be the name
+of a service that has a kernel-side implementation.  If the token
+value is non-zero, it will be associated with that service, and
+subsequent RTAS calls by the guest specifying that token will be
+handled by the kernel.  If the token value is 0, then any token
+associated with the service will be forgotten, and subsequent RTAS
+calls by the guest for that service will be passed to userspace to be
+handled.
+
+4.87 KVM_SET_GUEST_DEBUG
+------------------------
+
+:Capability: KVM_CAP_SET_GUEST_DEBUG
+:Architectures: x86, s390, ppc, arm64
+:Type: vcpu ioctl
+:Parameters: struct kvm_guest_debug (in)
+:Returns: 0 on success; -1 on error
+
+::
+
+  struct kvm_guest_debug {
+       __u32 control;
+       __u32 pad;
+       struct kvm_guest_debug_arch arch;
+  };
+
+Set up the processor specific debug registers and configure vcpu for
+handling guest debug events. There are two parts to the structure, the
+first a control bitfield indicates the type of debug events to handle
+when running. Common control bits are:
+
+  - KVM_GUESTDBG_ENABLE:        guest debugging is enabled
+  - KVM_GUESTDBG_SINGLESTEP:    the next run should single-step
+
+The top 16 bits of the control field are architecture specific control
+flags which can include the following:
+
+  - KVM_GUESTDBG_USE_SW_BP:     using software breakpoints [x86, arm64]
+  - KVM_GUESTDBG_USE_HW_BP:     using hardware breakpoints [x86, s390, arm64]
+  - KVM_GUESTDBG_INJECT_DB:     inject DB type exception [x86]
+  - KVM_GUESTDBG_INJECT_BP:     inject BP type exception [x86]
+  - KVM_GUESTDBG_EXIT_PENDING:  trigger an immediate guest exit [s390]
+
+For example KVM_GUESTDBG_USE_SW_BP indicates that software breakpoints
+are enabled in memory so we need to ensure breakpoint exceptions are
+correctly trapped and the KVM run loop exits at the breakpoint and not
+running off into the normal guest vector. For KVM_GUESTDBG_USE_HW_BP
+we need to ensure the guest vCPUs architecture specific registers are
+updated to the correct (supplied) values.
+
+The second part of the structure is architecture specific and
+typically contains a set of debug registers.
+
+For arm64 the number of debug registers is implementation defined and
+can be determined by querying the KVM_CAP_GUEST_DEBUG_HW_BPS and
+KVM_CAP_GUEST_DEBUG_HW_WPS capabilities which return a positive number
+indicating the number of supported registers.
+
+For ppc, the KVM_CAP_PPC_GUEST_DEBUG_SSTEP capability indicates whether
+the single-step debug event (KVM_GUESTDBG_SINGLESTEP) is supported.
+
+When debug events exit the main run loop with the reason
+KVM_EXIT_DEBUG with the kvm_debug_exit_arch part of the kvm_run
+structure containing architecture specific debug information.
+
+4.88 KVM_GET_EMULATED_CPUID
+---------------------------
+
+:Capability: KVM_CAP_EXT_EMUL_CPUID
+:Architectures: x86
+:Type: system ioctl
+:Parameters: struct kvm_cpuid2 (in/out)
+:Returns: 0 on success, -1 on error
+
+::
+
+  struct kvm_cpuid2 {
+       __u32 nent;
+       __u32 flags;
+       struct kvm_cpuid_entry2 entries[0];
+  };
+
+The member 'flags' is used for passing flags from userspace.
+
+::
+
+  #define KVM_CPUID_FLAG_SIGNIFCANT_INDEX              BIT(0)
+  #define KVM_CPUID_FLAG_STATEFUL_FUNC         BIT(1)
+  #define KVM_CPUID_FLAG_STATE_READ_NEXT               BIT(2)
+
+  struct kvm_cpuid_entry2 {
+       __u32 function;
+       __u32 index;
+       __u32 flags;
+       __u32 eax;
+       __u32 ebx;
+       __u32 ecx;
+       __u32 edx;
+       __u32 padding[3];
+  };
+
+This ioctl returns x86 cpuid features which are emulated by
+kvm.Userspace can use the information returned by this ioctl to query
+which features are emulated by kvm instead of being present natively.
+
+Userspace invokes KVM_GET_EMULATED_CPUID by passing a kvm_cpuid2
+structure with the 'nent' field indicating the number of entries in
+the variable-size array 'entries'. If the number of entries is too low
+to describe the cpu capabilities, an error (E2BIG) is returned. If the
+number is too high, the 'nent' field is adjusted and an error (ENOMEM)
+is returned. If the number is just right, the 'nent' field is adjusted
+to the number of valid entries in the 'entries' array, which is then
+filled.
+
+The entries returned are the set CPUID bits of the respective features
+which kvm emulates, as returned by the CPUID instruction, with unknown
+or unsupported feature bits cleared.
+
+Features like x2apic, for example, may not be present in the host cpu
+but are exposed by kvm in KVM_GET_SUPPORTED_CPUID because they can be
+emulated efficiently and thus not included here.
+
+The fields in each entry are defined as follows:
+
+  function:
+        the eax value used to obtain the entry
+  index:
+        the ecx value used to obtain the entry (for entries that are
+         affected by ecx)
+  flags:
+    an OR of zero or more of the following:
+
+        KVM_CPUID_FLAG_SIGNIFCANT_INDEX:
+           if the index field is valid
+        KVM_CPUID_FLAG_STATEFUL_FUNC:
+           if cpuid for this function returns different values for successive
+           invocations; there will be several entries with the same function,
+           all with this flag set
+        KVM_CPUID_FLAG_STATE_READ_NEXT:
+           for KVM_CPUID_FLAG_STATEFUL_FUNC entries, set if this entry is
+           the first entry to be read by a cpu
+
+   eax, ebx, ecx, edx:
+
+         the values returned by the cpuid instruction for
+         this function/index combination
+
+4.89 KVM_S390_MEM_OP
+--------------------
+
+:Capability: KVM_CAP_S390_MEM_OP
+:Architectures: s390
+:Type: vcpu ioctl
+:Parameters: struct kvm_s390_mem_op (in)
+:Returns: = 0 on success,
+          < 0 on generic error (e.g. -EFAULT or -ENOMEM),
+          > 0 if an exception occurred while walking the page tables
+
+Read or write data from/to the logical (virtual) memory of a VCPU.
+
+Parameters are specified via the following structure::
+
+  struct kvm_s390_mem_op {
+       __u64 gaddr;            /* the guest address */
+       __u64 flags;            /* flags */
+       __u32 size;             /* amount of bytes */
+       __u32 op;               /* type of operation */
+       __u64 buf;              /* buffer in userspace */
+       __u8 ar;                /* the access register number */
+       __u8 reserved[31];      /* should be set to 0 */
+  };
+
+The type of operation is specified in the "op" field. It is either
+KVM_S390_MEMOP_LOGICAL_READ for reading from logical memory space or
+KVM_S390_MEMOP_LOGICAL_WRITE for writing to logical memory space. The
+KVM_S390_MEMOP_F_CHECK_ONLY flag can be set in the "flags" field to check
+whether the corresponding memory access would create an access exception
+(without touching the data in the memory at the destination). In case an
+access exception occurred while walking the MMU tables of the guest, the
+ioctl returns a positive error number to indicate the type of exception.
+This exception is also raised directly at the corresponding VCPU if the
+flag KVM_S390_MEMOP_F_INJECT_EXCEPTION is set in the "flags" field.
+
+The start address of the memory region has to be specified in the "gaddr"
+field, and the length of the region in the "size" field (which must not
+be 0). The maximum value for "size" can be obtained by checking the
+KVM_CAP_S390_MEM_OP capability. "buf" is the buffer supplied by the
+userspace application where the read data should be written to for
+KVM_S390_MEMOP_LOGICAL_READ, or where the data that should be written is
+stored for a KVM_S390_MEMOP_LOGICAL_WRITE. When KVM_S390_MEMOP_F_CHECK_ONLY
+is specified, "buf" is unused and can be NULL. "ar" designates the access
+register number to be used; the valid range is 0..15.
+
+The "reserved" field is meant for future extensions. It is not used by
+KVM with the currently defined set of flags.
+
+4.90 KVM_S390_GET_SKEYS
+-----------------------
+
+:Capability: KVM_CAP_S390_SKEYS
+:Architectures: s390
+:Type: vm ioctl
+:Parameters: struct kvm_s390_skeys
+:Returns: 0 on success, KVM_S390_GET_KEYS_NONE if guest is not using storage
+          keys, negative value on error
+
+This ioctl is used to get guest storage key values on the s390
+architecture. The ioctl takes parameters via the kvm_s390_skeys struct::
+
+  struct kvm_s390_skeys {
+       __u64 start_gfn;
+       __u64 count;
+       __u64 skeydata_addr;
+       __u32 flags;
+       __u32 reserved[9];
+  };
+
+The start_gfn field is the number of the first guest frame whose storage keys
+you want to get.
+
+The count field is the number of consecutive frames (starting from start_gfn)
+whose storage keys to get. The count field must be at least 1 and the maximum
+allowed value is defined as KVM_S390_SKEYS_ALLOC_MAX. Values outside this range
+will cause the ioctl to return -EINVAL.
+
+The skeydata_addr field is the address to a buffer large enough to hold count
+bytes. This buffer will be filled with storage key data by the ioctl.
+
+4.91 KVM_S390_SET_SKEYS
+-----------------------
+
+:Capability: KVM_CAP_S390_SKEYS
+:Architectures: s390
+:Type: vm ioctl
+:Parameters: struct kvm_s390_skeys
+:Returns: 0 on success, negative value on error
+
+This ioctl is used to set guest storage key values on the s390
+architecture. The ioctl takes parameters via the kvm_s390_skeys struct.
+See section on KVM_S390_GET_SKEYS for struct definition.
+
+The start_gfn field is the number of the first guest frame whose storage keys
+you want to set.
+
+The count field is the number of consecutive frames (starting from start_gfn)
+whose storage keys to get. The count field must be at least 1 and the maximum
+allowed value is defined as KVM_S390_SKEYS_ALLOC_MAX. Values outside this range
+will cause the ioctl to return -EINVAL.
+
+The skeydata_addr field is the address to a buffer containing count bytes of
+storage keys. Each byte in the buffer will be set as the storage key for a
+single frame starting at start_gfn for count frames.
+
+Note: If any architecturally invalid key value is found in the given data then
+the ioctl will return -EINVAL.
+
+4.92 KVM_S390_IRQ
+-----------------
+
+:Capability: KVM_CAP_S390_INJECT_IRQ
+:Architectures: s390
+:Type: vcpu ioctl
+:Parameters: struct kvm_s390_irq (in)
+:Returns: 0 on success, -1 on error
+
+Errors:
+
+
+  ======  =================================================================
+  EINVAL  interrupt type is invalid
+          type is KVM_S390_SIGP_STOP and flag parameter is invalid value,
+          type is KVM_S390_INT_EXTERNAL_CALL and code is bigger
+          than the maximum of VCPUs
+  EBUSY   type is KVM_S390_SIGP_SET_PREFIX and vcpu is not stopped,
+          type is KVM_S390_SIGP_STOP and a stop irq is already pending,
+          type is KVM_S390_INT_EXTERNAL_CALL and an external call interrupt
+          is already pending
+  ======  =================================================================
+
+Allows to inject an interrupt to the guest.
+
+Using struct kvm_s390_irq as a parameter allows
+to inject additional payload which is not
+possible via KVM_S390_INTERRUPT.
+
+Interrupt parameters are passed via kvm_s390_irq::
+
+  struct kvm_s390_irq {
+       __u64 type;
+       union {
+               struct kvm_s390_io_info io;
+               struct kvm_s390_ext_info ext;
+               struct kvm_s390_pgm_info pgm;
+               struct kvm_s390_emerg_info emerg;
+               struct kvm_s390_extcall_info extcall;
+               struct kvm_s390_prefix_info prefix;
+               struct kvm_s390_stop_info stop;
+               struct kvm_s390_mchk_info mchk;
+               char reserved[64];
+       } u;
+  };
+
+type can be one of the following:
+
+- KVM_S390_SIGP_STOP - sigp stop; parameter in .stop
+- KVM_S390_PROGRAM_INT - program check; parameters in .pgm
+- KVM_S390_SIGP_SET_PREFIX - sigp set prefix; parameters in .prefix
+- KVM_S390_RESTART - restart; no parameters
+- KVM_S390_INT_CLOCK_COMP - clock comparator interrupt; no parameters
+- KVM_S390_INT_CPU_TIMER - CPU timer interrupt; no parameters
+- KVM_S390_INT_EMERGENCY - sigp emergency; parameters in .emerg
+- KVM_S390_INT_EXTERNAL_CALL - sigp external call; parameters in .extcall
+- KVM_S390_MCHK - machine check interrupt; parameters in .mchk
+
+This is an asynchronous vcpu ioctl and can be invoked from any thread.
+
+4.94 KVM_S390_GET_IRQ_STATE
+---------------------------
+
+:Capability: KVM_CAP_S390_IRQ_STATE
+:Architectures: s390
+:Type: vcpu ioctl
+:Parameters: struct kvm_s390_irq_state (out)
+:Returns: >= number of bytes copied into buffer,
+          -EINVAL if buffer size is 0,
+          -ENOBUFS if buffer size is too small to fit all pending interrupts,
+          -EFAULT if the buffer address was invalid
+
+This ioctl allows userspace to retrieve the complete state of all currently
+pending interrupts in a single buffer. Use cases include migration
+and introspection. The parameter structure contains the address of a
+userspace buffer and its length::
+
+  struct kvm_s390_irq_state {
+       __u64 buf;
+       __u32 flags;        /* will stay unused for compatibility reasons */
+       __u32 len;
+       __u32 reserved[4];  /* will stay unused for compatibility reasons */
+  };
+
+Userspace passes in the above struct and for each pending interrupt a
+struct kvm_s390_irq is copied to the provided buffer.
+
+The structure contains a flags and a reserved field for future extensions. As
+the kernel never checked for flags == 0 and QEMU never pre-zeroed flags and
+reserved, these fields can not be used in the future without breaking
+compatibility.
+
+If -ENOBUFS is returned the buffer provided was too small and userspace
+may retry with a bigger buffer.
+
+4.95 KVM_S390_SET_IRQ_STATE
+---------------------------
+
+:Capability: KVM_CAP_S390_IRQ_STATE
+:Architectures: s390
+:Type: vcpu ioctl
+:Parameters: struct kvm_s390_irq_state (in)
+:Returns: 0 on success,
+          -EFAULT if the buffer address was invalid,
+          -EINVAL for an invalid buffer length (see below),
+          -EBUSY if there were already interrupts pending,
+          errors occurring when actually injecting the
+          interrupt. See KVM_S390_IRQ.
+
+This ioctl allows userspace to set the complete state of all cpu-local
+interrupts currently pending for the vcpu. It is intended for restoring
+interrupt state after a migration. The input parameter is a userspace buffer
+containing a struct kvm_s390_irq_state::
+
+  struct kvm_s390_irq_state {
+       __u64 buf;
+       __u32 flags;        /* will stay unused for compatibility reasons */
+       __u32 len;
+       __u32 reserved[4];  /* will stay unused for compatibility reasons */
+  };
+
+The restrictions for flags and reserved apply as well.
+(see KVM_S390_GET_IRQ_STATE)
+
+The userspace memory referenced by buf contains a struct kvm_s390_irq
+for each interrupt to be injected into the guest.
+If one of the interrupts could not be injected for some reason the
+ioctl aborts.
+
+len must be a multiple of sizeof(struct kvm_s390_irq). It must be > 0
+and it must not exceed (max_vcpus + 32) * sizeof(struct kvm_s390_irq),
+which is the maximum number of possibly pending cpu-local interrupts.
+
+4.96 KVM_SMI
+------------
+
+:Capability: KVM_CAP_X86_SMM
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: none
+:Returns: 0 on success, -1 on error
+
+Queues an SMI on the thread's vcpu.
+
+4.97 KVM_CAP_PPC_MULTITCE
+-------------------------
+
+:Capability: KVM_CAP_PPC_MULTITCE
+:Architectures: ppc
+:Type: vm
+
+This capability means the kernel is capable of handling hypercalls
+H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
+space. This significantly accelerates DMA operations for PPC KVM guests.
+User space should expect that its handlers for these hypercalls
+are not going to be called if user space previously registered LIOBN
+in KVM (via KVM_CREATE_SPAPR_TCE or similar calls).
+
+In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
+user space might have to advertise it for the guest. For example,
+IBM pSeries (sPAPR) guest starts using them if "hcall-multi-tce" is
+present in the "ibm,hypertas-functions" device-tree property.
+
+The hypercalls mentioned above may or may not be processed successfully
+in the kernel based fast path. If they can not be handled by the kernel,
+they will get passed on to user space. So user space still has to have
+an implementation for these despite the in kernel acceleration.
+
+This capability is always enabled.
+
+4.98 KVM_CREATE_SPAPR_TCE_64
+----------------------------
+
+:Capability: KVM_CAP_SPAPR_TCE_64
+:Architectures: powerpc
+:Type: vm ioctl
+:Parameters: struct kvm_create_spapr_tce_64 (in)
+:Returns: file descriptor for manipulating the created TCE table
+
+This is an extension for KVM_CAP_SPAPR_TCE which only supports 32bit
+windows, described in 4.62 KVM_CREATE_SPAPR_TCE
+
+This capability uses extended struct in ioctl interface::
+
+  /* for KVM_CAP_SPAPR_TCE_64 */
+  struct kvm_create_spapr_tce_64 {
+       __u64 liobn;
+       __u32 page_shift;
+       __u32 flags;
+       __u64 offset;   /* in pages */
+       __u64 size;     /* in pages */
+  };
+
+The aim of extension is to support an additional bigger DMA window with
+a variable page size.
+KVM_CREATE_SPAPR_TCE_64 receives a 64bit window size, an IOMMU page shift and
+a bus offset of the corresponding DMA window, @size and @offset are numbers
+of IOMMU pages.
+
+@flags are not used at the moment.
+
+The rest of functionality is identical to KVM_CREATE_SPAPR_TCE.
+
+4.99 KVM_REINJECT_CONTROL
+-------------------------
+
+:Capability: KVM_CAP_REINJECT_CONTROL
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_reinject_control (in)
+:Returns: 0 on success,
+         -EFAULT if struct kvm_reinject_control cannot be read,
+         -ENXIO if KVM_CREATE_PIT or KVM_CREATE_PIT2 didn't succeed earlier.
+
+i8254 (PIT) has two modes, reinject and !reinject.  The default is reinject,
+where KVM queues elapsed i8254 ticks and monitors completion of interrupt from
+vector(s) that i8254 injects.  Reinject mode dequeues a tick and injects its
+interrupt whenever there isn't a pending interrupt from i8254.
+!reinject mode injects an interrupt as soon as a tick arrives.
+
+::
+
+  struct kvm_reinject_control {
+       __u8 pit_reinject;
+       __u8 reserved[31];
+  };
+
+pit_reinject = 0 (!reinject mode) is recommended, unless running an old
+operating system that uses the PIT for timing (e.g. Linux 2.4.x).
+
+4.100 KVM_PPC_CONFIGURE_V3_MMU
+------------------------------
+
+:Capability: KVM_CAP_PPC_RADIX_MMU or KVM_CAP_PPC_HASH_MMU_V3
+:Architectures: ppc
+:Type: vm ioctl
+:Parameters: struct kvm_ppc_mmuv3_cfg (in)
+:Returns: 0 on success,
+         -EFAULT if struct kvm_ppc_mmuv3_cfg cannot be read,
+         -EINVAL if the configuration is invalid
+
+This ioctl controls whether the guest will use radix or HPT (hashed
+page table) translation, and sets the pointer to the process table for
+the guest.
+
+::
+
+  struct kvm_ppc_mmuv3_cfg {
+       __u64   flags;
+       __u64   process_table;
+  };
+
+There are two bits that can be set in flags; KVM_PPC_MMUV3_RADIX and
+KVM_PPC_MMUV3_GTSE.  KVM_PPC_MMUV3_RADIX, if set, configures the guest
+to use radix tree translation, and if clear, to use HPT translation.
+KVM_PPC_MMUV3_GTSE, if set and if KVM permits it, configures the guest
+to be able to use the global TLB and SLB invalidation instructions;
+if clear, the guest may not use these instructions.
+
+The process_table field specifies the address and size of the guest
+process table, which is in the guest's space.  This field is formatted
+as the second doubleword of the partition table entry, as defined in
+the Power ISA V3.00, Book III section 5.7.6.1.
+
+4.101 KVM_PPC_GET_RMMU_INFO
+---------------------------
+
+:Capability: KVM_CAP_PPC_RADIX_MMU
+:Architectures: ppc
+:Type: vm ioctl
+:Parameters: struct kvm_ppc_rmmu_info (out)
+:Returns: 0 on success,
+        -EFAULT if struct kvm_ppc_rmmu_info cannot be written,
+        -EINVAL if no useful information can be returned
+
+This ioctl returns a structure containing two things: (a) a list
+containing supported radix tree geometries, and (b) a list that maps
+page sizes to put in the "AP" (actual page size) field for the tlbie
+(TLB invalidate entry) instruction.
+
+::
+
+  struct kvm_ppc_rmmu_info {
+       struct kvm_ppc_radix_geom {
+               __u8    page_shift;
+               __u8    level_bits[4];
+               __u8    pad[3];
+       }       geometries[8];
+       __u32   ap_encodings[8];
+  };
+
+The geometries[] field gives up to 8 supported geometries for the
+radix page table, in terms of the log base 2 of the smallest page
+size, and the number of bits indexed at each level of the tree, from
+the PTE level up to the PGD level in that order.  Any unused entries
+will have 0 in the page_shift field.
+
+The ap_encodings gives the supported page sizes and their AP field
+encodings, encoded with the AP value in the top 3 bits and the log
+base 2 of the page size in the bottom 6 bits.
+
+4.102 KVM_PPC_RESIZE_HPT_PREPARE
+--------------------------------
+
+:Capability: KVM_CAP_SPAPR_RESIZE_HPT
+:Architectures: powerpc
+:Type: vm ioctl
+:Parameters: struct kvm_ppc_resize_hpt (in)
+:Returns: 0 on successful completion,
+        >0 if a new HPT is being prepared, the value is an estimated
+         number of milliseconds until preparation is complete,
+         -EFAULT if struct kvm_reinject_control cannot be read,
+        -EINVAL if the supplied shift or flags are invalid,
+        -ENOMEM if unable to allocate the new HPT,
+        -ENOSPC if there was a hash collision
+
+::
+
+  struct kvm_ppc_rmmu_info {
+       struct kvm_ppc_radix_geom {
+               __u8    page_shift;
+               __u8    level_bits[4];
+               __u8    pad[3];
+       }       geometries[8];
+       __u32   ap_encodings[8];
+  };
+
+The geometries[] field gives up to 8 supported geometries for the
+radix page table, in terms of the log base 2 of the smallest page
+size, and the number of bits indexed at each level of the tree, from
+the PTE level up to the PGD level in that order.  Any unused entries
+will have 0 in the page_shift field.
+
+The ap_encodings gives the supported page sizes and their AP field
+encodings, encoded with the AP value in the top 3 bits and the log
+base 2 of the page size in the bottom 6 bits.
+
+4.102 KVM_PPC_RESIZE_HPT_PREPARE
+--------------------------------
+
+:Capability: KVM_CAP_SPAPR_RESIZE_HPT
+:Architectures: powerpc
+:Type: vm ioctl
+:Parameters: struct kvm_ppc_resize_hpt (in)
+:Returns: 0 on successful completion,
+        >0 if a new HPT is being prepared, the value is an estimated
+         number of milliseconds until preparation is complete,
+         -EFAULT if struct kvm_reinject_control cannot be read,
+        -EINVAL if the supplied shift or flags are invalid,when moving existing
+         HPT entries to the new HPT,
+        -EIO on other error conditions
+
+Used to implement the PAPR extension for runtime resizing of a guest's
+Hashed Page Table (HPT).  Specifically this starts, stops or monitors
+the preparation of a new potential HPT for the guest, essentially
+implementing the H_RESIZE_HPT_PREPARE hypercall.
+
+If called with shift > 0 when there is no pending HPT for the guest,
+this begins preparation of a new pending HPT of size 2^(shift) bytes.
+It then returns a positive integer with the estimated number of
+milliseconds until preparation is complete.
+
+If called when there is a pending HPT whose size does not match that
+requested in the parameters, discards the existing pending HPT and
+creates a new one as above.
+
+If called when there is a pending HPT of the size requested, will:
+
+  * If preparation of the pending HPT is already complete, return 0
+  * If preparation of the pending HPT has failed, return an error
+    code, then discard the pending HPT.
+  * If preparation of the pending HPT is still in progress, return an
+    estimated number of milliseconds until preparation is complete.
+
+If called with shift == 0, discards any currently pending HPT and
+returns 0 (i.e. cancels any in-progress preparation).
+
+flags is reserved for future expansion, currently setting any bits in
+flags will result in an -EINVAL.
+
+Normally this will be called repeatedly with the same parameters until
+it returns <= 0.  The first call will initiate preparation, subsequent
+ones will monitor preparation until it completes or fails.
+
+::
+
+  struct kvm_ppc_resize_hpt {
+       __u64 flags;
+       __u32 shift;
+       __u32 pad;
+  };
+
+4.103 KVM_PPC_RESIZE_HPT_COMMIT
+-------------------------------
+
+:Capability: KVM_CAP_SPAPR_RESIZE_HPT
+:Architectures: powerpc
+:Type: vm ioctl
+:Parameters: struct kvm_ppc_resize_hpt (in)
+:Returns: 0 on successful completion,
+         -EFAULT if struct kvm_reinject_control cannot be read,
+        -EINVAL if the supplied shift or flags are invalid,
+        -ENXIO is there is no pending HPT, or the pending HPT doesn't
+         have the requested size,
+        -EBUSY if the pending HPT is not fully prepared,
+        -ENOSPC if there was a hash collision when moving existing
+         HPT entries to the new HPT,
+        -EIO on other error conditions
+
+Used to implement the PAPR extension for runtime resizing of a guest's
+Hashed Page Table (HPT).  Specifically this requests that the guest be
+transferred to working with the new HPT, essentially implementing the
+H_RESIZE_HPT_COMMIT hypercall.
+
+This should only be called after KVM_PPC_RESIZE_HPT_PREPARE has
+returned 0 with the same parameters.  In other cases
+KVM_PPC_RESIZE_HPT_COMMIT will return an error (usually -ENXIO or
+-EBUSY, though others may be possible if the preparation was started,
+but failed).
+
+This will have undefined effects on the guest if it has not already
+placed itself in a quiescent state where no vcpu will make MMU enabled
+memory accesses.
+
+On succsful completion, the pending HPT will become the guest's active
+HPT and the previous HPT will be discarded.
+
+On failure, the guest will still be operating on its previous HPT.
+
+::
+
+  struct kvm_ppc_resize_hpt {
+       __u64 flags;
+       __u32 shift;
+       __u32 pad;
+  };
+
+4.104 KVM_X86_GET_MCE_CAP_SUPPORTED
+-----------------------------------
+
+:Capability: KVM_CAP_MCE
+:Architectures: x86
+:Type: system ioctl
+:Parameters: u64 mce_cap (out)
+:Returns: 0 on success, -1 on error
+
+Returns supported MCE capabilities. The u64 mce_cap parameter
+has the same format as the MSR_IA32_MCG_CAP register. Supported
+capabilities will have the corresponding bits set.
+
+4.105 KVM_X86_SETUP_MCE
+-----------------------
+
+:Capability: KVM_CAP_MCE
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: u64 mcg_cap (in)
+:Returns: 0 on success,
+         -EFAULT if u64 mcg_cap cannot be read,
+         -EINVAL if the requested number of banks is invalid,
+         -EINVAL if requested MCE capability is not supported.
+
+Initializes MCE support for use. The u64 mcg_cap parameter
+has the same format as the MSR_IA32_MCG_CAP register and
+specifies which capabilities should be enabled. The maximum
+supported number of error-reporting banks can be retrieved when
+checking for KVM_CAP_MCE. The supported capabilities can be
+retrieved with KVM_X86_GET_MCE_CAP_SUPPORTED.
+
+4.106 KVM_X86_SET_MCE
+---------------------
+
+:Capability: KVM_CAP_MCE
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_x86_mce (in)
+:Returns: 0 on success,
+         -EFAULT if struct kvm_x86_mce cannot be read,
+         -EINVAL if the bank number is invalid,
+         -EINVAL if VAL bit is not set in status field.
+
+Inject a machine check error (MCE) into the guest. The input
+parameter is::
+
+  struct kvm_x86_mce {
+       __u64 status;
+       __u64 addr;
+       __u64 misc;
+       __u64 mcg_status;
+       __u8 bank;
+       __u8 pad1[7];
+       __u64 pad2[3];
+  };
+
+If the MCE being reported is an uncorrected error, KVM will
+inject it as an MCE exception into the guest. If the guest
+MCG_STATUS register reports that an MCE is in progress, KVM
+causes an KVM_EXIT_SHUTDOWN vmexit.
+
+Otherwise, if the MCE is a corrected error, KVM will just
+store it in the corresponding bank (provided this bank is
+not holding a previously reported uncorrected error).
+
+4.107 KVM_S390_GET_CMMA_BITS
+----------------------------
+
+:Capability: KVM_CAP_S390_CMMA_MIGRATION
+:Architectures: s390
+:Type: vm ioctl
+:Parameters: struct kvm_s390_cmma_log (in, out)
+:Returns: 0 on success, a negative value on error
+
+This ioctl is used to get the values of the CMMA bits on the s390
+architecture. It is meant to be used in two scenarios:
+
+- During live migration to save the CMMA values. Live migration needs
+  to be enabled via the KVM_REQ_START_MIGRATION VM property.
+- To non-destructively peek at the CMMA values, with the flag
+  KVM_S390_CMMA_PEEK set.
+
+The ioctl takes parameters via the kvm_s390_cmma_log struct. The desired
+values are written to a buffer whose location is indicated via the "values"
+member in the kvm_s390_cmma_log struct.  The values in the input struct are
+also updated as needed.
+
+Each CMMA value takes up one byte.
+
+::
+
+  struct kvm_s390_cmma_log {
+       __u64 start_gfn;
+       __u32 count;
+       __u32 flags;
+       union {
+               __u64 remaining;
+               __u64 mask;
+       };
+       __u64 values;
+  };
+
+start_gfn is the number of the first guest frame whose CMMA values are
+to be retrieved,
+
+count is the length of the buffer in bytes,
+
+values points to the buffer where the result will be written to.
+
+If count is greater than KVM_S390_SKEYS_MAX, then it is considered to be
+KVM_S390_SKEYS_MAX. KVM_S390_SKEYS_MAX is re-used for consistency with
+other ioctls.
+
+The result is written in the buffer pointed to by the field values, and
+the values of the input parameter are updated as follows.
+
+Depending on the flags, different actions are performed. The only
+supported flag so far is KVM_S390_CMMA_PEEK.
+
+The default behaviour if KVM_S390_CMMA_PEEK is not set is:
+start_gfn will indicate the first page frame whose CMMA bits were dirty.
+It is not necessarily the same as the one passed as input, as clean pages
+are skipped.
+
+count will indicate the number of bytes actually written in the buffer.
+It can (and very often will) be smaller than the input value, since the
+buffer is only filled until 16 bytes of clean values are found (which
+are then not copied in the buffer). Since a CMMA migration block needs
+the base address and the length, for a total of 16 bytes, we will send
+back some clean data if there is some dirty data afterwards, as long as
+the size of the clean data does not exceed the size of the header. This
+allows to minimize the amount of data to be saved or transferred over
+the network at the expense of more roundtrips to userspace. The next
+invocation of the ioctl will skip over all the clean values, saving
+potentially more than just the 16 bytes we found.
+
+If KVM_S390_CMMA_PEEK is set:
+the existing storage attributes are read even when not in migration
+mode, and no other action is performed;
+
+the output start_gfn will be equal to the input start_gfn,
+
+the output count will be equal to the input count, except if the end of
+memory has been reached.
+
+In both cases:
+the field "remaining" will indicate the total number of dirty CMMA values
+still remaining, or 0 if KVM_S390_CMMA_PEEK is set and migration mode is
+not enabled.
+
+mask is unused.
+
+values points to the userspace buffer where the result will be stored.
+
+This ioctl can fail with -ENOMEM if not enough memory can be allocated to
+complete the task, with -ENXIO if CMMA is not enabled, with -EINVAL if
+KVM_S390_CMMA_PEEK is not set but migration mode was not enabled, with
+-EFAULT if the userspace address is invalid or if no page table is
+present for the addresses (e.g. when using hugepages).
+
+4.108 KVM_S390_SET_CMMA_BITS
+----------------------------
+
+:Capability: KVM_CAP_S390_CMMA_MIGRATION
+:Architectures: s390
+:Type: vm ioctl
+:Parameters: struct kvm_s390_cmma_log (in)
+:Returns: 0 on success, a negative value on error
+
+This ioctl is used to set the values of the CMMA bits on the s390
+architecture. It is meant to be used during live migration to restore
+the CMMA values, but there are no restrictions on its use.
+The ioctl takes parameters via the kvm_s390_cmma_values struct.
+Each CMMA value takes up one byte.
+
+::
+
+  struct kvm_s390_cmma_log {
+       __u64 start_gfn;
+       __u32 count;
+       __u32 flags;
+       union {
+               __u64 remaining;
+               __u64 mask;
+       };
+       __u64 values;
+  };
+
+start_gfn indicates the starting guest frame number,
+
+count indicates how many values are to be considered in the buffer,
+
+flags is not used and must be 0.
+
+mask indicates which PGSTE bits are to be considered.
+
+remaining is not used.
+
+values points to the buffer in userspace where to store the values.
+
+This ioctl can fail with -ENOMEM if not enough memory can be allocated to
+complete the task, with -ENXIO if CMMA is not enabled, with -EINVAL if
+the count field is too large (e.g. more than KVM_S390_CMMA_SIZE_MAX) or
+if the flags field was not 0, with -EFAULT if the userspace address is
+invalid, if invalid pages are written to (e.g. after the end of memory)
+or if no page table is present for the addresses (e.g. when using
+hugepages).
+
+4.109 KVM_PPC_GET_CPU_CHAR
+--------------------------
+
+:Capability: KVM_CAP_PPC_GET_CPU_CHAR
+:Architectures: powerpc
+:Type: vm ioctl
+:Parameters: struct kvm_ppc_cpu_char (out)
+:Returns: 0 on successful completion,
+        -EFAULT if struct kvm_ppc_cpu_char cannot be written
+
+This ioctl gives userspace information about certain characteristics
+of the CPU relating to speculative execution of instructions and
+possible information leakage resulting from speculative execution (see
+CVE-2017-5715, CVE-2017-5753 and CVE-2017-5754).  The information is
+returned in struct kvm_ppc_cpu_char, which looks like this::
+
+  struct kvm_ppc_cpu_char {
+       __u64   character;              /* characteristics of the CPU */
+       __u64   behaviour;              /* recommended software behaviour */
+       __u64   character_mask;         /* valid bits in character */
+       __u64   behaviour_mask;         /* valid bits in behaviour */
+  };
+
+For extensibility, the character_mask and behaviour_mask fields
+indicate which bits of character and behaviour have been filled in by
+the kernel.  If the set of defined bits is extended in future then
+userspace will be able to tell whether it is running on a kernel that
+knows about the new bits.
+
+The character field describes attributes of the CPU which can help
+with preventing inadvertent information disclosure - specifically,
+whether there is an instruction to flash-invalidate the L1 data cache
+(ori 30,30,0 or mtspr SPRN_TRIG2,rN), whether the L1 data cache is set
+to a mode where entries can only be used by the thread that created
+them, whether the bcctr[l] instruction prevents speculation, and
+whether a speculation barrier instruction (ori 31,31,0) is provided.
+
+The behaviour field describes actions that software should take to
+prevent inadvertent information disclosure, and thus describes which
+vulnerabilities the hardware is subject to; specifically whether the
+L1 data cache should be flushed when returning to user mode from the
+kernel, and whether a speculation barrier should be placed between an
+array bounds check and the array access.
+
+These fields use the same bit definitions as the new
+H_GET_CPU_CHARACTERISTICS hypercall.
+
+4.110 KVM_MEMORY_ENCRYPT_OP
+---------------------------
+
+:Capability: basic
+:Architectures: x86
+:Type: system
+:Parameters: an opaque platform specific structure (in/out)
+:Returns: 0 on success; -1 on error
+
+If the platform supports creating encrypted VMs then this ioctl can be used
+for issuing platform-specific memory encryption commands to manage those
+encrypted VMs.
+
+Currently, this ioctl is used for issuing Secure Encrypted Virtualization
+(SEV) commands on AMD Processors. The SEV commands are defined in
+Documentation/virt/kvm/amd-memory-encryption.rst.
+
+4.111 KVM_MEMORY_ENCRYPT_REG_REGION
+-----------------------------------
+
+:Capability: basic
+:Architectures: x86
+:Type: system
+:Parameters: struct kvm_enc_region (in)
+:Returns: 0 on success; -1 on error
+
+This ioctl can be used to register a guest memory region which may
+contain encrypted data (e.g. guest RAM, SMRAM etc).
+
+It is used in the SEV-enabled guest. When encryption is enabled, a guest
+memory region may contain encrypted data. The SEV memory encryption
+engine uses a tweak such that two identical plaintext pages, each at
+different locations will have differing ciphertexts. So swapping or
+moving ciphertext of those pages will not result in plaintext being
+swapped. So relocating (or migrating) physical backing pages for the SEV
+guest will require some additional steps.
+
+Note: The current SEV key management spec does not provide commands to
+swap or migrate (move) ciphertext pages. Hence, for now we pin the guest
+memory region registered with the ioctl.
+
+4.112 KVM_MEMORY_ENCRYPT_UNREG_REGION
+-------------------------------------
+
+:Capability: basic
+:Architectures: x86
+:Type: system
+:Parameters: struct kvm_enc_region (in)
+:Returns: 0 on success; -1 on error
+
+This ioctl can be used to unregister the guest memory region registered
+with KVM_MEMORY_ENCRYPT_REG_REGION ioctl above.
+
+4.113 KVM_HYPERV_EVENTFD
+------------------------
+
+:Capability: KVM_CAP_HYPERV_EVENTFD
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_hyperv_eventfd (in)
+
+This ioctl (un)registers an eventfd to receive notifications from the guest on
+the specified Hyper-V connection id through the SIGNAL_EVENT hypercall, without
+causing a user exit.  SIGNAL_EVENT hypercall with non-zero event flag number
+(bits 24-31) still triggers a KVM_EXIT_HYPERV_HCALL user exit.
+
+::
+
+  struct kvm_hyperv_eventfd {
+       __u32 conn_id;
+       __s32 fd;
+       __u32 flags;
+       __u32 padding[3];
+  };
+
+The conn_id field should fit within 24 bits::
+
+  #define KVM_HYPERV_CONN_ID_MASK              0x00ffffff
+
+The acceptable values for the flags field are::
+
+  #define KVM_HYPERV_EVENTFD_DEASSIGN  (1 << 0)
+
+:Returns: 0 on success,
+         -EINVAL if conn_id or flags is outside the allowed range,
+         -ENOENT on deassign if the conn_id isn't registered,
+         -EEXIST on assign if the conn_id is already registered
+
+4.114 KVM_GET_NESTED_STATE
+--------------------------
+
+:Capability: KVM_CAP_NESTED_STATE
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_nested_state (in/out)
+:Returns: 0 on success, -1 on error
+
+Errors:
+
+  =====      =============================================================
+  E2BIG      the total state size exceeds the value of 'size' specified by
+             the user; the size required will be written into size.
+  =====      =============================================================
+
+::
+
+  struct kvm_nested_state {
+       __u16 flags;
+       __u16 format;
+       __u32 size;
+
+       union {
+               struct kvm_vmx_nested_state_hdr vmx;
+               struct kvm_svm_nested_state_hdr svm;
+
+               /* Pad the header to 128 bytes.  */
+               __u8 pad[120];
+       } hdr;
+
+       union {
+               struct kvm_vmx_nested_state_data vmx[0];
+               struct kvm_svm_nested_state_data svm[0];
+       } data;
+  };
+
+  #define KVM_STATE_NESTED_GUEST_MODE          0x00000001
+  #define KVM_STATE_NESTED_RUN_PENDING         0x00000002
+  #define KVM_STATE_NESTED_EVMCS               0x00000004
+
+  #define KVM_STATE_NESTED_FORMAT_VMX          0
+  #define KVM_STATE_NESTED_FORMAT_SVM          1
+
+  #define KVM_STATE_NESTED_VMX_VMCS_SIZE       0x1000
+
+  #define KVM_STATE_NESTED_VMX_SMM_GUEST_MODE  0x00000001
+  #define KVM_STATE_NESTED_VMX_SMM_VMXON       0x00000002
+
+  struct kvm_vmx_nested_state_hdr {
+       __u64 vmxon_pa;
+       __u64 vmcs12_pa;
+
+       struct {
+               __u16 flags;
+       } smm;
+  };
+
+  struct kvm_vmx_nested_state_data {
+       __u8 vmcs12[KVM_STATE_NESTED_VMX_VMCS_SIZE];
+       __u8 shadow_vmcs12[KVM_STATE_NESTED_VMX_VMCS_SIZE];
+  };
+
+This ioctl copies the vcpu's nested virtualization state from the kernel to
+userspace.
+
+The maximum size of the state can be retrieved by passing KVM_CAP_NESTED_STATE
+to the KVM_CHECK_EXTENSION ioctl().
+
+4.115 KVM_SET_NESTED_STATE
+--------------------------
+
+:Capability: KVM_CAP_NESTED_STATE
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_nested_state (in)
+:Returns: 0 on success, -1 on error
+
+This copies the vcpu's kvm_nested_state struct from userspace to the kernel.
+For the definition of struct kvm_nested_state, see KVM_GET_NESTED_STATE.
+
+4.116 KVM_(UN)REGISTER_COALESCED_MMIO
+-------------------------------------
+
+:Capability: KVM_CAP_COALESCED_MMIO (for coalesced mmio)
+            KVM_CAP_COALESCED_PIO (for coalesced pio)
+:Architectures: all
+:Type: vm ioctl
+:Parameters: struct kvm_coalesced_mmio_zone
+:Returns: 0 on success, < 0 on error
+
+Coalesced I/O is a performance optimization that defers hardware
+register write emulation so that userspace exits are avoided.  It is
+typically used to reduce the overhead of emulating frequently accessed
+hardware registers.
+
+When a hardware register is configured for coalesced I/O, write accesses
+do not exit to userspace and their value is recorded in a ring buffer
+that is shared between kernel and userspace.
+
+Coalesced I/O is used if one or more write accesses to a hardware
+register can be deferred until a read or a write to another hardware
+register on the same device.  This last access will cause a vmexit and
+userspace will process accesses from the ring buffer before emulating
+it. That will avoid exiting to userspace on repeated writes.
+
+Coalesced pio is based on coalesced mmio. There is little difference
+between coalesced mmio and pio except that coalesced pio records accesses
+to I/O ports.
+
+4.117 KVM_CLEAR_DIRTY_LOG (vm ioctl)
+------------------------------------
+
+:Capability: KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
+:Architectures: x86, arm, arm64, mips
+:Type: vm ioctl
+:Parameters: struct kvm_dirty_log (in)
+:Returns: 0 on success, -1 on error
+
+::
+
+  /* for KVM_CLEAR_DIRTY_LOG */
+  struct kvm_clear_dirty_log {
+       __u32 slot;
+       __u32 num_pages;
+       __u64 first_page;
+       union {
+               void __user *dirty_bitmap; /* one bit per page */
+               __u64 padding;
+       };
+  };
+
+The ioctl clears the dirty status of pages in a memory slot, according to
+the bitmap that is passed in struct kvm_clear_dirty_log's dirty_bitmap
+field.  Bit 0 of the bitmap corresponds to page "first_page" in the
+memory slot, and num_pages is the size in bits of the input bitmap.
+first_page must be a multiple of 64; num_pages must also be a multiple of
+64 unless first_page + num_pages is the size of the memory slot.  For each
+bit that is set in the input bitmap, the corresponding page is marked "clean"
+in KVM's dirty bitmap, and dirty tracking is re-enabled for that page
+(for example via write-protection, or by clearing the dirty bit in
+a page table entry).
+
+If KVM_CAP_MULTI_ADDRESS_SPACE is available, bits 16-31 specifies
+the address space for which you want to return the dirty bitmap.
+They must be less than the value that KVM_CHECK_EXTENSION returns for
+the KVM_CAP_MULTI_ADDRESS_SPACE capability.
+
+This ioctl is mostly useful when KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
+is enabled; for more information, see the description of the capability.
+However, it can always be used as long as KVM_CHECK_EXTENSION confirms
+that KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 is present.
+
+4.118 KVM_GET_SUPPORTED_HV_CPUID
+--------------------------------
+
+:Capability: KVM_CAP_HYPERV_CPUID
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_cpuid2 (in/out)
+:Returns: 0 on success, -1 on error
+
+::
+
+  struct kvm_cpuid2 {
+       __u32 nent;
+       __u32 padding;
+       struct kvm_cpuid_entry2 entries[0];
+  };
+
+  struct kvm_cpuid_entry2 {
+       __u32 function;
+       __u32 index;
+       __u32 flags;
+       __u32 eax;
+       __u32 ebx;
+       __u32 ecx;
+       __u32 edx;
+       __u32 padding[3];
+  };
+
+This ioctl returns x86 cpuid features leaves related to Hyper-V emulation in
+KVM.  Userspace can use the information returned by this ioctl to construct
+cpuid information presented to guests consuming Hyper-V enlightenments (e.g.
+Windows or Hyper-V guests).
+
+CPUID feature leaves returned by this ioctl are defined by Hyper-V Top Level
+Functional Specification (TLFS). These leaves can't be obtained with
+KVM_GET_SUPPORTED_CPUID ioctl because some of them intersect with KVM feature
+leaves (0x40000000, 0x40000001).
+
+Currently, the following list of CPUID leaves are returned:
+ - HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS
+ - HYPERV_CPUID_INTERFACE
+ - HYPERV_CPUID_VERSION
+ - HYPERV_CPUID_FEATURES
+ - HYPERV_CPUID_ENLIGHTMENT_INFO
+ - HYPERV_CPUID_IMPLEMENT_LIMITS
+ - HYPERV_CPUID_NESTED_FEATURES
+
+HYPERV_CPUID_NESTED_FEATURES leaf is only exposed when Enlightened VMCS was
+enabled on the corresponding vCPU (KVM_CAP_HYPERV_ENLIGHTENED_VMCS).
+
+Userspace invokes KVM_GET_SUPPORTED_CPUID by passing a kvm_cpuid2 structure
+with the 'nent' field indicating the number of entries in the variable-size
+array 'entries'.  If the number of entries is too low to describe all Hyper-V
+feature leaves, an error (E2BIG) is returned. If the number is more or equal
+to the number of Hyper-V feature leaves, the 'nent' field is adjusted to the
+number of valid entries in the 'entries' array, which is then filled.
+
+'index' and 'flags' fields in 'struct kvm_cpuid_entry2' are currently reserved,
+userspace should not expect to get any particular value there.
+
+4.119 KVM_ARM_VCPU_FINALIZE
+---------------------------
+
+:Architectures: arm, arm64
+:Type: vcpu ioctl
+:Parameters: int feature (in)
+:Returns: 0 on success, -1 on error
+
+Errors:
+
+  ======     ==============================================================
+  EPERM      feature not enabled, needs configuration, or already finalized
+  EINVAL     feature unknown or not present
+  ======     ==============================================================
+
+Recognised values for feature:
+
+  =====      ===========================================
+  arm64      KVM_ARM_VCPU_SVE (requires KVM_CAP_ARM_SVE)
+  =====      ===========================================
+
+Finalizes the configuration of the specified vcpu feature.
+
+The vcpu must already have been initialised, enabling the affected feature, by
+means of a successful KVM_ARM_VCPU_INIT call with the appropriate flag set in
+features[].
+
+For affected vcpu features, this is a mandatory step that must be performed
+before the vcpu is fully usable.
+
+Between KVM_ARM_VCPU_INIT and KVM_ARM_VCPU_FINALIZE, the feature may be
+configured by use of ioctls such as KVM_SET_ONE_REG.  The exact configuration
+that should be performaned and how to do it are feature-dependent.
+
+Other calls that depend on a particular feature being finalized, such as
+KVM_RUN, KVM_GET_REG_LIST, KVM_GET_ONE_REG and KVM_SET_ONE_REG, will fail with
+-EPERM unless the feature has already been finalized by means of a
+KVM_ARM_VCPU_FINALIZE call.
+
+See KVM_ARM_VCPU_INIT for details of vcpu features that require finalization
+using this ioctl.
+
+4.120 KVM_SET_PMU_EVENT_FILTER
+------------------------------
+
+:Capability: KVM_CAP_PMU_EVENT_FILTER
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_pmu_event_filter (in)
+:Returns: 0 on success, -1 on error
+
+::
+
+  struct kvm_pmu_event_filter {
+       __u32 action;
+       __u32 nevents;
+       __u32 fixed_counter_bitmap;
+       __u32 flags;
+       __u32 pad[4];
+       __u64 events[0];
+  };
+
+This ioctl restricts the set of PMU events that the guest can program.
+The argument holds a list of events which will be allowed or denied.
+The eventsel+umask of each event the guest attempts to program is compared
+against the events field to determine whether the guest should have access.
+The events field only controls general purpose counters; fixed purpose
+counters are controlled by the fixed_counter_bitmap.
+
+No flags are defined yet, the field must be zero.
+
+Valid values for 'action'::
+
+  #define KVM_PMU_EVENT_ALLOW 0
+  #define KVM_PMU_EVENT_DENY 1
+
+4.121 KVM_PPC_SVM_OFF
+---------------------
+
+:Capability: basic
+:Architectures: powerpc
+:Type: vm ioctl
+:Parameters: none
+:Returns: 0 on successful completion,
+
+Errors:
+
+  ======     ================================================================
+  EINVAL     if ultravisor failed to terminate the secure guest
+  ENOMEM     if hypervisor failed to allocate new radix page tables for guest
+  ======     ================================================================
+
+This ioctl is used to turn off the secure mode of the guest or transition
+the guest from secure mode to normal mode. This is invoked when the guest
+is reset. This has no effect if called for a normal guest.
+
+This ioctl issues an ultravisor call to terminate the secure guest,
+unpins the VPA pages and releases all the device pages that are used to
+track the secure pages by hypervisor.
+
+4.122 KVM_S390_NORMAL_RESET
+
+Capability: KVM_CAP_S390_VCPU_RESETS
+Architectures: s390
+Type: vcpu ioctl
+Parameters: none
+Returns: 0
+
+This ioctl resets VCPU registers and control structures according to
+the cpu reset definition in the POP (Principles Of Operation).
+
+4.123 KVM_S390_INITIAL_RESET
+
+Capability: none
+Architectures: s390
+Type: vcpu ioctl
+Parameters: none
+Returns: 0
+
+This ioctl resets VCPU registers and control structures according to
+the initial cpu reset definition in the POP. However, the cpu is not
+put into ESA mode. This reset is a superset of the normal reset.
+
+4.124 KVM_S390_CLEAR_RESET
+
+Capability: KVM_CAP_S390_VCPU_RESETS
+Architectures: s390
+Type: vcpu ioctl
+Parameters: none
+Returns: 0
+
+This ioctl resets VCPU registers and control structures according to
+the clear cpu reset definition in the POP. However, the cpu is not put
+into ESA mode. This reset is a superset of the initial reset.
+
+
+5. The kvm_run structure
+========================
+
+Application code obtains a pointer to the kvm_run structure by
+mmap()ing a vcpu fd.  From that point, application code can control
+execution by changing fields in kvm_run prior to calling the KVM_RUN
+ioctl, and obtain information about the reason KVM_RUN returned by
+looking up structure members.
+
+::
+
+  struct kvm_run {
+       /* in */
+       __u8 request_interrupt_window;
+
+Request that KVM_RUN return when it becomes possible to inject external
+interrupts into the guest.  Useful in conjunction with KVM_INTERRUPT.
+
+::
+
+       __u8 immediate_exit;
+
+This field is polled once when KVM_RUN starts; if non-zero, KVM_RUN
+exits immediately, returning -EINTR.  In the common scenario where a
+signal is used to "kick" a VCPU out of KVM_RUN, this field can be used
+to avoid usage of KVM_SET_SIGNAL_MASK, which has worse scalability.
+Rather than blocking the signal outside KVM_RUN, userspace can set up
+a signal handler that sets run->immediate_exit to a non-zero value.
+
+This field is ignored if KVM_CAP_IMMEDIATE_EXIT is not available.
+
+::
+
+       __u8 padding1[6];
+
+       /* out */
+       __u32 exit_reason;
+
+When KVM_RUN has returned successfully (return value 0), this informs
+application code why KVM_RUN has returned.  Allowable values for this
+field are detailed below.
+
+::
+
+       __u8 ready_for_interrupt_injection;
+
+If request_interrupt_window has been specified, this field indicates
+an interrupt can be injected now with KVM_INTERRUPT.
+
+::
+
+       __u8 if_flag;
+
+The value of the current interrupt flag.  Only valid if in-kernel
+local APIC is not used.
+
+::
+
+       __u16 flags;
+
+More architecture-specific flags detailing state of the VCPU that may
+affect the device's behavior.  The only currently defined flag is
+KVM_RUN_X86_SMM, which is valid on x86 machines and is set if the
+VCPU is in system management mode.
+
+::
+
+       /* in (pre_kvm_run), out (post_kvm_run) */
+       __u64 cr8;
+
+The value of the cr8 register.  Only valid if in-kernel local APIC is
+not used.  Both input and output.
+
+::
+
+       __u64 apic_base;
+
+The value of the APIC BASE msr.  Only valid if in-kernel local
+APIC is not used.  Both input and output.
+
+::
+
+       union {
+               /* KVM_EXIT_UNKNOWN */
+               struct {
+                       __u64 hardware_exit_reason;
+               } hw;
+
+If exit_reason is KVM_EXIT_UNKNOWN, the vcpu has exited due to unknown
+reasons.  Further architecture-specific information is available in
+hardware_exit_reason.
+
+::
+
+               /* KVM_EXIT_FAIL_ENTRY */
+               struct {
+                       __u64 hardware_entry_failure_reason;
+               } fail_entry;
+
+If exit_reason is KVM_EXIT_FAIL_ENTRY, the vcpu could not be run due
+to unknown reasons.  Further architecture-specific information is
+available in hardware_entry_failure_reason.
+
+::
+
+               /* KVM_EXIT_EXCEPTION */
+               struct {
+                       __u32 exception;
+                       __u32 error_code;
+               } ex;
+
+Unused.
+
+::
+
+               /* KVM_EXIT_IO */
+               struct {
+  #define KVM_EXIT_IO_IN  0
+  #define KVM_EXIT_IO_OUT 1
+                       __u8 direction;
+                       __u8 size; /* bytes */
+                       __u16 port;
+                       __u32 count;
+                       __u64 data_offset; /* relative to kvm_run start */
+               } io;
+
+If exit_reason is KVM_EXIT_IO, then the vcpu has
+executed a port I/O instruction which could not be satisfied by kvm.
+data_offset describes where the data is located (KVM_EXIT_IO_OUT) or
+where kvm expects application code to place the data for the next
+KVM_RUN invocation (KVM_EXIT_IO_IN).  Data format is a packed array.
+
+::
+
+               /* KVM_EXIT_DEBUG */
+               struct {
+                       struct kvm_debug_exit_arch arch;
+               } debug;
+
+If the exit_reason is KVM_EXIT_DEBUG, then a vcpu is processing a debug event
+for which architecture specific information is returned.
+
+::
+
+               /* KVM_EXIT_MMIO */
+               struct {
+                       __u64 phys_addr;
+                       __u8  data[8];
+                       __u32 len;
+                       __u8  is_write;
+               } mmio;
+
+If exit_reason is KVM_EXIT_MMIO, then the vcpu has
+executed a memory-mapped I/O instruction which could not be satisfied
+by kvm.  The 'data' member contains the written data if 'is_write' is
+true, and should be filled by application code otherwise.
+
+The 'data' member contains, in its first 'len' bytes, the value as it would
+appear if the VCPU performed a load or store of the appropriate width directly
+to the byte array.
+
+.. note::
+
+      For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR and
+      KVM_EXIT_EPR the corresponding
+
+operations are complete (and guest state is consistent) only after userspace
+has re-entered the kernel with KVM_RUN.  The kernel side will first finish
+incomplete operations and then check for pending signals.  Userspace
+can re-enter the guest with an unmasked signal pending to complete
+pending operations.
+
+::
+
+               /* KVM_EXIT_HYPERCALL */
+               struct {
+                       __u64 nr;
+                       __u64 args[6];
+                       __u64 ret;
+                       __u32 longmode;
+                       __u32 pad;
+               } hypercall;
+
+Unused.  This was once used for 'hypercall to userspace'.  To implement
+such functionality, use KVM_EXIT_IO (x86) or KVM_EXIT_MMIO (all except s390).
+
+.. note:: KVM_EXIT_IO is significantly faster than KVM_EXIT_MMIO.
+
+::
+
+               /* KVM_EXIT_TPR_ACCESS */
+               struct {
+                       __u64 rip;
+                       __u32 is_write;
+                       __u32 pad;
+               } tpr_access;
+
+To be documented (KVM_TPR_ACCESS_REPORTING).
+
+::
+
+               /* KVM_EXIT_S390_SIEIC */
+               struct {
+                       __u8 icptcode;
+                       __u64 mask; /* psw upper half */
+                       __u64 addr; /* psw lower half */
+                       __u16 ipa;
+                       __u32 ipb;
+               } s390_sieic;
+
+s390 specific.
+
+::
+
+               /* KVM_EXIT_S390_RESET */
+  #define KVM_S390_RESET_POR       1
+  #define KVM_S390_RESET_CLEAR     2
+  #define KVM_S390_RESET_SUBSYSTEM 4
+  #define KVM_S390_RESET_CPU_INIT  8
+  #define KVM_S390_RESET_IPL       16
+               __u64 s390_reset_flags;
+
+s390 specific.
+
+::
+
+               /* KVM_EXIT_S390_UCONTROL */
+               struct {
+                       __u64 trans_exc_code;
+                       __u32 pgm_code;
+               } s390_ucontrol;
+
+s390 specific. A page fault has occurred for a user controlled virtual
+machine (KVM_VM_S390_UNCONTROL) on it's host page table that cannot be
+resolved by the kernel.
+The program code and the translation exception code that were placed
+in the cpu's lowcore are presented here as defined by the z Architecture
+Principles of Operation Book in the Chapter for Dynamic Address Translation
+(DAT)
+
+::
+
+               /* KVM_EXIT_DCR */
+               struct {
+                       __u32 dcrn;
+                       __u32 data;
+                       __u8  is_write;
+               } dcr;
+
+Deprecated - was used for 440 KVM.
+
+::
+
+               /* KVM_EXIT_OSI */
+               struct {
+                       __u64 gprs[32];
+               } osi;
+
+MOL uses a special hypercall interface it calls 'OSI'. To enable it, we catch
+hypercalls and exit with this exit struct that contains all the guest gprs.
+
+If exit_reason is KVM_EXIT_OSI, then the vcpu has triggered such a hypercall.
+Userspace can now handle the hypercall and when it's done modify the gprs as
+necessary. Upon guest entry all guest GPRs will then be replaced by the values
+in this struct.
+
+::
+
+               /* KVM_EXIT_PAPR_HCALL */
+               struct {
+                       __u64 nr;
+                       __u64 ret;
+                       __u64 args[9];
+               } papr_hcall;
+
+This is used on 64-bit PowerPC when emulating a pSeries partition,
+e.g. with the 'pseries' machine type in qemu.  It occurs when the
+guest does a hypercall using the 'sc 1' instruction.  The 'nr' field
+contains the hypercall number (from the guest R3), and 'args' contains
+the arguments (from the guest R4 - R12).  Userspace should put the
+return code in 'ret' and any extra returned values in args[].
+The possible hypercalls are defined in the Power Architecture Platform
+Requirements (PAPR) document available from www.power.org (free
+developer registration required to access it).
+
+::
+
+               /* KVM_EXIT_S390_TSCH */
+               struct {
+                       __u16 subchannel_id;
+                       __u16 subchannel_nr;
+                       __u32 io_int_parm;
+                       __u32 io_int_word;
+                       __u32 ipb;
+                       __u8 dequeued;
+               } s390_tsch;
+
+s390 specific. This exit occurs when KVM_CAP_S390_CSS_SUPPORT has been enabled
+and TEST SUBCHANNEL was intercepted. If dequeued is set, a pending I/O
+interrupt for the target subchannel has been dequeued and subchannel_id,
+subchannel_nr, io_int_parm and io_int_word contain the parameters for that
+interrupt. ipb is needed for instruction parameter decoding.
+
+::
+
+               /* KVM_EXIT_EPR */
+               struct {
+                       __u32 epr;
+               } epr;
+
+On FSL BookE PowerPC chips, the interrupt controller has a fast patch
+interrupt acknowledge path to the core. When the core successfully
+delivers an interrupt, it automatically populates the EPR register with
+the interrupt vector number and acknowledges the interrupt inside
+the interrupt controller.
+
+In case the interrupt controller lives in user space, we need to do
+the interrupt acknowledge cycle through it to fetch the next to be
+delivered interrupt vector using this exit.
+
+It gets triggered whenever both KVM_CAP_PPC_EPR are enabled and an
+external interrupt has just been delivered into the guest. User space
+should put the acknowledged interrupt vector into the 'epr' field.
+
+::
+
+               /* KVM_EXIT_SYSTEM_EVENT */
+               struct {
+  #define KVM_SYSTEM_EVENT_SHUTDOWN       1
+  #define KVM_SYSTEM_EVENT_RESET          2
+  #define KVM_SYSTEM_EVENT_CRASH          3
+                       __u32 type;
+                       __u64 flags;
+               } system_event;
+
+If exit_reason is KVM_EXIT_SYSTEM_EVENT then the vcpu has triggered
+a system-level event using some architecture specific mechanism (hypercall
+or some special instruction). In case of ARM/ARM64, this is triggered using
+HVC instruction based PSCI call from the vcpu. The 'type' field describes
+the system-level event type. The 'flags' field describes architecture
+specific flags for the system-level event.
+
+Valid values for 'type' are:
+
+ - KVM_SYSTEM_EVENT_SHUTDOWN -- the guest has requested a shutdown of the
+   VM. Userspace is not obliged to honour this, and if it does honour
+   this does not need to destroy the VM synchronously (ie it may call
+   KVM_RUN again before shutdown finally occurs).
+ - KVM_SYSTEM_EVENT_RESET -- the guest has requested a reset of the VM.
+   As with SHUTDOWN, userspace can choose to ignore the request, or
+   to schedule the reset to occur in the future and may call KVM_RUN again.
+ - KVM_SYSTEM_EVENT_CRASH -- the guest crash occurred and the guest
+   has requested a crash condition maintenance. Userspace can choose
+   to ignore the request, or to gather VM memory core dump and/or
+   reset/shutdown of the VM.
+
+::
+
+               /* KVM_EXIT_IOAPIC_EOI */
+               struct {
+                       __u8 vector;
+               } eoi;
+
+Indicates that the VCPU's in-kernel local APIC received an EOI for a
+level-triggered IOAPIC interrupt.  This exit only triggers when the
+IOAPIC is implemented in userspace (i.e. KVM_CAP_SPLIT_IRQCHIP is enabled);
+the userspace IOAPIC should process the EOI and retrigger the interrupt if
+it is still asserted.  Vector is the LAPIC interrupt vector for which the
+EOI was received.
+
+::
+
+               struct kvm_hyperv_exit {
+  #define KVM_EXIT_HYPERV_SYNIC          1
+  #define KVM_EXIT_HYPERV_HCALL          2
+                       __u32 type;
+                       union {
+                               struct {
+                                       __u32 msr;
+                                       __u64 control;
+                                       __u64 evt_page;
+                                       __u64 msg_page;
+                               } synic;
+                               struct {
+                                       __u64 input;
+                                       __u64 result;
+                                       __u64 params[2];
+                               } hcall;
+                       } u;
+               };
+               /* KVM_EXIT_HYPERV */
+                struct kvm_hyperv_exit hyperv;
+
+Indicates that the VCPU exits into userspace to process some tasks
+related to Hyper-V emulation.
+
+Valid values for 'type' are:
+
+       - KVM_EXIT_HYPERV_SYNIC -- synchronously notify user-space about
+
+Hyper-V SynIC state change. Notification is used to remap SynIC
+event/message pages and to enable/disable SynIC messages/events processing
+in userspace.
+
+::
+
+               /* KVM_EXIT_ARM_NISV */
+               struct {
+                       __u64 esr_iss;
+                       __u64 fault_ipa;
+               } arm_nisv;
+
+Used on arm and arm64 systems. If a guest accesses memory not in a memslot,
+KVM will typically return to userspace and ask it to do MMIO emulation on its
+behalf. However, for certain classes of instructions, no instruction decode
+(direction, length of memory access) is provided, and fetching and decoding
+the instruction from the VM is overly complicated to live in the kernel.
+
+Historically, when this situation occurred, KVM would print a warning and kill
+the VM. KVM assumed that if the guest accessed non-memslot memory, it was
+trying to do I/O, which just couldn't be emulated, and the warning message was
+phrased accordingly. However, what happened more often was that a guest bug
+caused access outside the guest memory areas which should lead to a more
+meaningful warning message and an external abort in the guest, if the access
+did not fall within an I/O window.
+
+Userspace implementations can query for KVM_CAP_ARM_NISV_TO_USER, and enable
+this capability at VM creation. Once this is done, these types of errors will
+instead return to userspace with KVM_EXIT_ARM_NISV, with the valid bits from
+the HSR (arm) and ESR_EL2 (arm64) in the esr_iss field, and the faulting IPA
+in the fault_ipa field. Userspace can either fix up the access if it's
+actually an I/O access by decoding the instruction from guest memory (if it's
+very brave) and continue executing the guest, or it can decide to suspend,
+dump, or restart the guest.
+
+Note that KVM does not skip the faulting instruction as it does for
+KVM_EXIT_MMIO, but userspace has to emulate any change to the processing state
+if it decides to decode and emulate the instruction.
+
+::
+
+               /* Fix the size of the union. */
+               char padding[256];
+       };
+
+       /*
+        * shared registers between kvm and userspace.
+        * kvm_valid_regs specifies the register classes set by the host
+        * kvm_dirty_regs specified the register classes dirtied by userspace
+        * struct kvm_sync_regs is architecture specific, as well as the
+        * bits for kvm_valid_regs and kvm_dirty_regs
+        */
+       __u64 kvm_valid_regs;
+       __u64 kvm_dirty_regs;
+       union {
+               struct kvm_sync_regs regs;
+               char padding[SYNC_REGS_SIZE_BYTES];
+       } s;
+
+If KVM_CAP_SYNC_REGS is defined, these fields allow userspace to access
+certain guest registers without having to call SET/GET_*REGS. Thus we can
+avoid some system call overhead if userspace has to handle the exit.
+Userspace can query the validity of the structure by checking
+kvm_valid_regs for specific bits. These bits are architecture specific
+and usually define the validity of a groups of registers. (e.g. one bit
+for general purpose registers)
+
+Please note that the kernel is allowed to use the kvm_run structure as the
+primary storage for certain register types. Therefore, the kernel may use the
+values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
+
+::
+
+  };
+
+
+
+6. Capabilities that can be enabled on vCPUs
+============================================
+
+There are certain capabilities that change the behavior of the virtual CPU or
+the virtual machine when enabled. To enable them, please see section 4.37.
+Below you can find a list of capabilities and what their effect on the vCPU or
+the virtual machine is when enabling them.
+
+The following information is provided along with the description:
+
+  Architectures:
+      which instruction set architectures provide this ioctl.
+      x86 includes both i386 and x86_64.
+
+  Target:
+      whether this is a per-vcpu or per-vm capability.
+
+  Parameters:
+      what parameters are accepted by the capability.
+
+  Returns:
+      the return value.  General error numbers (EBADF, ENOMEM, EINVAL)
+      are not detailed, but errors with specific meanings are.
+
+
+6.1 KVM_CAP_PPC_OSI
+-------------------
+
+:Architectures: ppc
+:Target: vcpu
+:Parameters: none
+:Returns: 0 on success; -1 on error
+
+This capability enables interception of OSI hypercalls that otherwise would
+be treated as normal system calls to be injected into the guest. OSI hypercalls
+were invented by Mac-on-Linux to have a standardized communication mechanism
+between the guest and the host.
+
+When this capability is enabled, KVM_EXIT_OSI can occur.
+
+
+6.2 KVM_CAP_PPC_PAPR
+--------------------
+
+:Architectures: ppc
+:Target: vcpu
+:Parameters: none
+:Returns: 0 on success; -1 on error
+
+This capability enables interception of PAPR hypercalls. PAPR hypercalls are
+done using the hypercall instruction "sc 1".
+
+It also sets the guest privilege level to "supervisor" mode. Usually the guest
+runs in "hypervisor" privilege mode with a few missing features.
+
+In addition to the above, it changes the semantics of SDR1. In this mode, the
+HTAB address part of SDR1 contains an HVA instead of a GPA, as PAPR keeps the
+HTAB invisible to the guest.
+
+When this capability is enabled, KVM_EXIT_PAPR_HCALL can occur.
+
+
+6.3 KVM_CAP_SW_TLB
+------------------
+
+:Architectures: ppc
+:Target: vcpu
+:Parameters: args[0] is the address of a struct kvm_config_tlb
+:Returns: 0 on success; -1 on error
+
+::
+
+  struct kvm_config_tlb {
+       __u64 params;
+       __u64 array;
+       __u32 mmu_type;
+       __u32 array_len;
+  };
+
+Configures the virtual CPU's TLB array, establishing a shared memory area
+between userspace and KVM.  The "params" and "array" fields are userspace
+addresses of mmu-type-specific data structures.  The "array_len" field is an
+safety mechanism, and should be set to the size in bytes of the memory that
+userspace has reserved for the array.  It must be at least the size dictated
+by "mmu_type" and "params".
+
+While KVM_RUN is active, the shared region is under control of KVM.  Its
+contents are undefined, and any modification by userspace results in
+boundedly undefined behavior.
+
+On return from KVM_RUN, the shared region will reflect the current state of
+the guest's TLB.  If userspace makes any changes, it must call KVM_DIRTY_TLB
+to tell KVM which entries have been changed, prior to calling KVM_RUN again
+on this vcpu.
+
+For mmu types KVM_MMU_FSL_BOOKE_NOHV and KVM_MMU_FSL_BOOKE_HV:
+
+ - The "params" field is of type "struct kvm_book3e_206_tlb_params".
+ - The "array" field points to an array of type "struct
+   kvm_book3e_206_tlb_entry".
+ - The array consists of all entries in the first TLB, followed by all
+   entries in the second TLB.
+ - Within a TLB, entries are ordered first by increasing set number.  Within a
+   set, entries are ordered by way (increasing ESEL).
+ - The hash for determining set number in TLB0 is: (MAS2 >> 12) & (num_sets - 1)
+   where "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
+ - The tsize field of mas1 shall be set to 4K on TLB0, even though the
+   hardware ignores this value for TLB0.
+
+6.4 KVM_CAP_S390_CSS_SUPPORT
+----------------------------
+
+:Architectures: s390
+:Target: vcpu
+:Parameters: none
+:Returns: 0 on success; -1 on error
+
+This capability enables support for handling of channel I/O instructions.
+
+TEST PENDING INTERRUPTION and the interrupt portion of TEST SUBCHANNEL are
+handled in-kernel, while the other I/O instructions are passed to userspace.
+
+When this capability is enabled, KVM_EXIT_S390_TSCH will occur on TEST
+SUBCHANNEL intercepts.
+
+Note that even though this capability is enabled per-vcpu, the complete
+virtual machine is affected.
+
+6.5 KVM_CAP_PPC_EPR
+-------------------
+
+:Architectures: ppc
+:Target: vcpu
+:Parameters: args[0] defines whether the proxy facility is active
+:Returns: 0 on success; -1 on error
+
+This capability enables or disables the delivery of interrupts through the
+external proxy facility.
+
+When enabled (args[0] != 0), every time the guest gets an external interrupt
+delivered, it automatically exits into user space with a KVM_EXIT_EPR exit
+to receive the topmost interrupt vector.
+
+When disabled (args[0] == 0), behavior is as if this facility is unsupported.
+
+When this capability is enabled, KVM_EXIT_EPR can occur.
+
+6.6 KVM_CAP_IRQ_MPIC
+--------------------
+
+:Architectures: ppc
+:Parameters: args[0] is the MPIC device fd;
+             args[1] is the MPIC CPU number for this vcpu
+
+This capability connects the vcpu to an in-kernel MPIC device.
+
+6.7 KVM_CAP_IRQ_XICS
+--------------------
+
+:Architectures: ppc
+:Target: vcpu
+:Parameters: args[0] is the XICS device fd;
+             args[1] is the XICS CPU number (server ID) for this vcpu
+
+This capability connects the vcpu to an in-kernel XICS device.
+
+6.8 KVM_CAP_S390_IRQCHIP
+------------------------
+
+:Architectures: s390
+:Target: vm
+:Parameters: none
+
+This capability enables the in-kernel irqchip for s390. Please refer to
+"4.24 KVM_CREATE_IRQCHIP" for details.
+
+6.9 KVM_CAP_MIPS_FPU
+--------------------
+
+:Architectures: mips
+:Target: vcpu
+:Parameters: args[0] is reserved for future use (should be 0).
+
+This capability allows the use of the host Floating Point Unit by the guest. It
+allows the Config1.FP bit to be set to enable the FPU in the guest. Once this is
+done the ``KVM_REG_MIPS_FPR_*`` and ``KVM_REG_MIPS_FCR_*`` registers can be
+accessed (depending on the current guest FPU register mode), and the Status.FR,
+Config5.FRE bits are accessible via the KVM API and also from the guest,
+depending on them being supported by the FPU.
+
+6.10 KVM_CAP_MIPS_MSA
+---------------------
+
+:Architectures: mips
+:Target: vcpu
+:Parameters: args[0] is reserved for future use (should be 0).
+
+This capability allows the use of the MIPS SIMD Architecture (MSA) by the guest.
+It allows the Config3.MSAP bit to be set to enable the use of MSA by the guest.
+Once this is done the ``KVM_REG_MIPS_VEC_*`` and ``KVM_REG_MIPS_MSA_*``
+registers can be accessed, and the Config5.MSAEn bit is accessible via the
+KVM API and also from the guest.
+
+6.74 KVM_CAP_SYNC_REGS
+----------------------
+
+:Architectures: s390, x86
+:Target: s390: always enabled, x86: vcpu
+:Parameters: none
+:Returns: x86: KVM_CHECK_EXTENSION returns a bit-array indicating which register
+          sets are supported
+          (bitfields defined in arch/x86/include/uapi/asm/kvm.h).
+
+As described above in the kvm_sync_regs struct info in section 5 (kvm_run):
+KVM_CAP_SYNC_REGS "allow[s] userspace to access certain guest registers
+without having to call SET/GET_*REGS". This reduces overhead by eliminating
+repeated ioctl calls for setting and/or getting register values. This is
+particularly important when userspace is making synchronous guest state
+modifications, e.g. when emulating and/or intercepting instructions in
+userspace.
+
+For s390 specifics, please refer to the source code.
+
+For x86:
+
+- the register sets to be copied out to kvm_run are selectable
+  by userspace (rather that all sets being copied out for every exit).
+- vcpu_events are available in addition to regs and sregs.
+
+For x86, the 'kvm_valid_regs' field of struct kvm_run is overloaded to
+function as an input bit-array field set by userspace to indicate the
+specific register sets to be copied out on the next exit.
+
+To indicate when userspace has modified values that should be copied into
+the vCPU, the all architecture bitarray field, 'kvm_dirty_regs' must be set.
+This is done using the same bitflags as for the 'kvm_valid_regs' field.
+If the dirty bit is not set, then the register set values will not be copied
+into the vCPU even if they've been modified.
+
+Unused bitfields in the bitarrays must be set to zero.
+
+::
+
+  struct kvm_sync_regs {
+        struct kvm_regs regs;
+        struct kvm_sregs sregs;
+        struct kvm_vcpu_events events;
+  };
+
+6.75 KVM_CAP_PPC_IRQ_XIVE
+-------------------------
+
+:Architectures: ppc
+:Target: vcpu
+:Parameters: args[0] is the XIVE device fd;
+             args[1] is the XIVE CPU number (server ID) for this vcpu
+
+This capability connects the vcpu to an in-kernel XIVE device.
+
+7. Capabilities that can be enabled on VMs
+==========================================
+
+There are certain capabilities that change the behavior of the virtual
+machine when enabled. To enable them, please see section 4.37. Below
+you can find a list of capabilities and what their effect on the VM
+is when enabling them.
+
+The following information is provided along with the description:
+
+  Architectures:
+      which instruction set architectures provide this ioctl.
+      x86 includes both i386 and x86_64.
+
+  Parameters:
+      what parameters are accepted by the capability.
+
+  Returns:
+      the return value.  General error numbers (EBADF, ENOMEM, EINVAL)
+      are not detailed, but errors with specific meanings are.
+
+
+7.1 KVM_CAP_PPC_ENABLE_HCALL
+----------------------------
+
+:Architectures: ppc
+:Parameters: args[0] is the sPAPR hcall number;
+            args[1] is 0 to disable, 1 to enable in-kernel handling
+
+This capability controls whether individual sPAPR hypercalls (hcalls)
+get handled by the kernel or not.  Enabling or disabling in-kernel
+handling of an hcall is effective across the VM.  On creation, an
+initial set of hcalls are enabled for in-kernel handling, which
+consists of those hcalls for which in-kernel handlers were implemented
+before this capability was implemented.  If disabled, the kernel will
+not to attempt to handle the hcall, but will always exit to userspace
+to handle it.  Note that it may not make sense to enable some and
+disable others of a group of related hcalls, but KVM does not prevent
+userspace from doing that.
+
+If the hcall number specified is not one that has an in-kernel
+implementation, the KVM_ENABLE_CAP ioctl will fail with an EINVAL
+error.
+
+7.2 KVM_CAP_S390_USER_SIGP
+--------------------------
+
+:Architectures: s390
+:Parameters: none
+
+This capability controls which SIGP orders will be handled completely in user
+space. With this capability enabled, all fast orders will be handled completely
+in the kernel:
+
+- SENSE
+- SENSE RUNNING
+- EXTERNAL CALL
+- EMERGENCY SIGNAL
+- CONDITIONAL EMERGENCY SIGNAL
+
+All other orders will be handled completely in user space.
+
+Only privileged operation exceptions will be checked for in the kernel (or even
+in the hardware prior to interception). If this capability is not enabled, the
+old way of handling SIGP orders is used (partially in kernel and user space).
+
+7.3 KVM_CAP_S390_VECTOR_REGISTERS
+---------------------------------
+
+:Architectures: s390
+:Parameters: none
+:Returns: 0 on success, negative value on error
+
+Allows use of the vector registers introduced with z13 processor, and
+provides for the synchronization between host and user space.  Will
+return -EINVAL if the machine does not support vectors.
+
+7.4 KVM_CAP_S390_USER_STSI
+--------------------------
+
+:Architectures: s390
+:Parameters: none
+
+This capability allows post-handlers for the STSI instruction. After
+initial handling in the kernel, KVM exits to user space with
+KVM_EXIT_S390_STSI to allow user space to insert further data.
+
+Before exiting to userspace, kvm handlers should fill in s390_stsi field of
+vcpu->run::
+
+  struct {
+       __u64 addr;
+       __u8 ar;
+       __u8 reserved;
+       __u8 fc;
+       __u8 sel1;
+       __u16 sel2;
+  } s390_stsi;
+
+  @addr - guest address of STSI SYSIB
+  @fc   - function code
+  @sel1 - selector 1
+  @sel2 - selector 2
+  @ar   - access register number
+
+KVM handlers should exit to userspace with rc = -EREMOTE.
+
+7.5 KVM_CAP_SPLIT_IRQCHIP
+-------------------------
+
+:Architectures: x86
+:Parameters: args[0] - number of routes reserved for userspace IOAPICs
+:Returns: 0 on success, -1 on error
+
+Create a local apic for each processor in the kernel. This can be used
+instead of KVM_CREATE_IRQCHIP if the userspace VMM wishes to emulate the
+IOAPIC and PIC (and also the PIT, even though this has to be enabled
+separately).
+
+This capability also enables in kernel routing of interrupt requests;
+when KVM_CAP_SPLIT_IRQCHIP only routes of KVM_IRQ_ROUTING_MSI type are
+used in the IRQ routing table.  The first args[0] MSI routes are reserved
+for the IOAPIC pins.  Whenever the LAPIC receives an EOI for these routes,
+a KVM_EXIT_IOAPIC_EOI vmexit will be reported to userspace.
+
+Fails if VCPU has already been created, or if the irqchip is already in the
+kernel (i.e. KVM_CREATE_IRQCHIP has already been called).
+
+7.6 KVM_CAP_S390_RI
+-------------------
+
+:Architectures: s390
+:Parameters: none
+
+Allows use of runtime-instrumentation introduced with zEC12 processor.
+Will return -EINVAL if the machine does not support runtime-instrumentation.
+Will return -EBUSY if a VCPU has already been created.
+
+7.7 KVM_CAP_X2APIC_API
+----------------------
+
+:Architectures: x86
+:Parameters: args[0] - features that should be enabled
+:Returns: 0 on success, -EINVAL when args[0] contains invalid features
+
+Valid feature flags in args[0] are::
+
+  #define KVM_X2APIC_API_USE_32BIT_IDS            (1ULL << 0)
+  #define KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK  (1ULL << 1)
+
+Enabling KVM_X2APIC_API_USE_32BIT_IDS changes the behavior of
+KVM_SET_GSI_ROUTING, KVM_SIGNAL_MSI, KVM_SET_LAPIC, and KVM_GET_LAPIC,
+allowing the use of 32-bit APIC IDs.  See KVM_CAP_X2APIC_API in their
+respective sections.
+
+KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK must be enabled for x2APIC to work
+in logical mode or with more than 255 VCPUs.  Otherwise, KVM treats 0xff
+as a broadcast even in x2APIC mode in order to support physical x2APIC
+without interrupt remapping.  This is undesirable in logical mode,
+where 0xff represents CPUs 0-7 in cluster 0.
+
+7.8 KVM_CAP_S390_USER_INSTR0
+----------------------------
+
+:Architectures: s390
+:Parameters: none
+
+With this capability enabled, all illegal instructions 0x0000 (2 bytes) will
+be intercepted and forwarded to user space. User space can use this
+mechanism e.g. to realize 2-byte software breakpoints. The kernel will
+not inject an operating exception for these instructions, user space has
+to take care of that.
+
+This capability can be enabled dynamically even if VCPUs were already
+created and are running.
+
+7.9 KVM_CAP_S390_GS
+-------------------
+
+:Architectures: s390
+:Parameters: none
+:Returns: 0 on success; -EINVAL if the machine does not support
+          guarded storage; -EBUSY if a VCPU has already been created.
+
+Allows use of guarded storage for the KVM guest.
+
+7.10 KVM_CAP_S390_AIS
+---------------------
+
+:Architectures: s390
+:Parameters: none
+
+Allow use of adapter-interruption suppression.
+:Returns: 0 on success; -EBUSY if a VCPU has already been created.
+
+7.11 KVM_CAP_PPC_SMT
+--------------------
+
+:Architectures: ppc
+:Parameters: vsmt_mode, flags
+
+Enabling this capability on a VM provides userspace with a way to set
+the desired virtual SMT mode (i.e. the number of virtual CPUs per
+virtual core).  The virtual SMT mode, vsmt_mode, must be a power of 2
+between 1 and 8.  On POWER8, vsmt_mode must also be no greater than
+the number of threads per subcore for the host.  Currently flags must
+be 0.  A successful call to enable this capability will result in
+vsmt_mode being returned when the KVM_CAP_PPC_SMT capability is
+subsequently queried for the VM.  This capability is only supported by
+HV KVM, and can only be set before any VCPUs have been created.
+The KVM_CAP_PPC_SMT_POSSIBLE capability indicates which virtual SMT
+modes are available.
+
+7.12 KVM_CAP_PPC_FWNMI
+----------------------
+
+:Architectures: ppc
+:Parameters: none
+
+With this capability a machine check exception in the guest address
+space will cause KVM to exit the guest with NMI exit reason. This
+enables QEMU to build error log and branch to guest kernel registered
+machine check handling routine. Without this capability KVM will
+branch to guests' 0x200 interrupt vector.
+
+7.13 KVM_CAP_X86_DISABLE_EXITS
+------------------------------
+
+:Architectures: x86
+:Parameters: args[0] defines which exits are disabled
+:Returns: 0 on success, -EINVAL when args[0] contains invalid exits
+
+Valid bits in args[0] are::
+
+  #define KVM_X86_DISABLE_EXITS_MWAIT            (1 << 0)
+  #define KVM_X86_DISABLE_EXITS_HLT              (1 << 1)
+  #define KVM_X86_DISABLE_EXITS_PAUSE            (1 << 2)
+  #define KVM_X86_DISABLE_EXITS_CSTATE           (1 << 3)
+
+Enabling this capability on a VM provides userspace with a way to no
+longer intercept some instructions for improved latency in some
+workloads, and is suggested when vCPUs are associated to dedicated
+physical CPUs.  More bits can be added in the future; userspace can
+just pass the KVM_CHECK_EXTENSION result to KVM_ENABLE_CAP to disable
+all such vmexits.
+
+Do not enable KVM_FEATURE_PV_UNHALT if you disable HLT exits.
+
+7.14 KVM_CAP_S390_HPAGE_1M
+--------------------------
+
+:Architectures: s390
+:Parameters: none
+:Returns: 0 on success, -EINVAL if hpage module parameter was not set
+         or cmma is enabled, or the VM has the KVM_VM_S390_UCONTROL
+         flag set
+
+With this capability the KVM support for memory backing with 1m pages
+through hugetlbfs can be enabled for a VM. After the capability is
+enabled, cmma can't be enabled anymore and pfmfi and the storage key
+interpretation are disabled. If cmma has already been enabled or the
+hpage module parameter is not set to 1, -EINVAL is returned.
+
+While it is generally possible to create a huge page backed VM without
+this capability, the VM will not be able to run.
+
+7.15 KVM_CAP_MSR_PLATFORM_INFO
+------------------------------
+
+:Architectures: x86
+:Parameters: args[0] whether feature should be enabled or not
+
+With this capability, a guest may read the MSR_PLATFORM_INFO MSR. Otherwise,
+a #GP would be raised when the guest tries to access. Currently, this
+capability does not enable write permissions of this MSR for the guest.
+
+7.16 KVM_CAP_PPC_NESTED_HV
+--------------------------
+
+:Architectures: ppc
+:Parameters: none
+:Returns: 0 on success, -EINVAL when the implementation doesn't support
+         nested-HV virtualization.
+
+HV-KVM on POWER9 and later systems allows for "nested-HV"
+virtualization, which provides a way for a guest VM to run guests that
+can run using the CPU's supervisor mode (privileged non-hypervisor
+state).  Enabling this capability on a VM depends on the CPU having
+the necessary functionality and on the facility being enabled with a
+kvm-hv module parameter.
+
+7.17 KVM_CAP_EXCEPTION_PAYLOAD
+------------------------------
+
+:Architectures: x86
+:Parameters: args[0] whether feature should be enabled or not
+
+With this capability enabled, CR2 will not be modified prior to the
+emulated VM-exit when L1 intercepts a #PF exception that occurs in
+L2. Similarly, for kvm-intel only, DR6 will not be modified prior to
+the emulated VM-exit when L1 intercepts a #DB exception that occurs in
+L2. As a result, when KVM_GET_VCPU_EVENTS reports a pending #PF (or
+#DB) exception for L2, exception.has_payload will be set and the
+faulting address (or the new DR6 bits*) will be reported in the
+exception_payload field. Similarly, when userspace injects a #PF (or
+#DB) into L2 using KVM_SET_VCPU_EVENTS, it is expected to set
+exception.has_payload and to put the faulting address - or the new DR6
+bits\ [#]_ - in the exception_payload field.
+
+This capability also enables exception.pending in struct
+kvm_vcpu_events, which allows userspace to distinguish between pending
+and injected exceptions.
+
+
+.. [#] For the new DR6 bits, note that bit 16 is set iff the #DB exception
+       will clear DR6.RTM.
+
+7.18 KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
+
+:Architectures: x86, arm, arm64, mips
+:Parameters: args[0] whether feature should be enabled or not
+
+With this capability enabled, KVM_GET_DIRTY_LOG will not automatically
+clear and write-protect all pages that are returned as dirty.
+Rather, userspace will have to do this operation separately using
+KVM_CLEAR_DIRTY_LOG.
+
+At the cost of a slightly more complicated operation, this provides better
+scalability and responsiveness for two reasons.  First,
+KVM_CLEAR_DIRTY_LOG ioctl can operate on a 64-page granularity rather
+than requiring to sync a full memslot; this ensures that KVM does not
+take spinlocks for an extended period of time.  Second, in some cases a
+large amount of time can pass between a call to KVM_GET_DIRTY_LOG and
+userspace actually using the data in the page.  Pages can be modified
+during this time, which is inefficint for both the guest and userspace:
+the guest will incur a higher penalty due to write protection faults,
+while userspace can see false reports of dirty pages.  Manual reprotection
+helps reducing this time, improving guest performance and reducing the
+number of dirty log false positives.
+
+KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 was previously available under the name
+KVM_CAP_MANUAL_DIRTY_LOG_PROTECT, but the implementation had bugs that make
+it hard or impossible to use it correctly.  The availability of
+KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 signals that those bugs are fixed.
+Userspace should not try to use KVM_CAP_MANUAL_DIRTY_LOG_PROTECT.
+
+8. Other capabilities.
+======================
+
+This section lists capabilities that give information about other
+features of the KVM implementation.
+
+8.1 KVM_CAP_PPC_HWRNG
+---------------------
+
+:Architectures: ppc
+
+This capability, if KVM_CHECK_EXTENSION indicates that it is
+available, means that that the kernel has an implementation of the
+H_RANDOM hypercall backed by a hardware random-number generator.
+If present, the kernel H_RANDOM handler can be enabled for guest use
+with the KVM_CAP_PPC_ENABLE_HCALL capability.
+
+8.2 KVM_CAP_HYPERV_SYNIC
+------------------------
+
+:Architectures: x86
+
+This capability, if KVM_CHECK_EXTENSION indicates that it is
+available, means that that the kernel has an implementation of the
+Hyper-V Synthetic interrupt controller(SynIC). Hyper-V SynIC is
+used to support Windows Hyper-V based guest paravirt drivers(VMBus).
+
+In order to use SynIC, it has to be activated by setting this
+capability via KVM_ENABLE_CAP ioctl on the vcpu fd. Note that this
+will disable the use of APIC hardware virtualization even if supported
+by the CPU, as it's incompatible with SynIC auto-EOI behavior.
+
+8.3 KVM_CAP_PPC_RADIX_MMU
+-------------------------
+
+:Architectures: ppc
+
+This capability, if KVM_CHECK_EXTENSION indicates that it is
+available, means that that the kernel can support guests using the
+radix MMU defined in Power ISA V3.00 (as implemented in the POWER9
+processor).
+
+8.4 KVM_CAP_PPC_HASH_MMU_V3
+---------------------------
+
+:Architectures: ppc
+
+This capability, if KVM_CHECK_EXTENSION indicates that it is
+available, means that that the kernel can support guests using the
+hashed page table MMU defined in Power ISA V3.00 (as implemented in
+the POWER9 processor), including in-memory segment tables.
+
+8.5 KVM_CAP_MIPS_VZ
+-------------------
+
+:Architectures: mips
+
+This capability, if KVM_CHECK_EXTENSION on the main kvm handle indicates that
+it is available, means that full hardware assisted virtualization capabilities
+of the hardware are available for use through KVM. An appropriate
+KVM_VM_MIPS_* type must be passed to KVM_CREATE_VM to create a VM which
+utilises it.
+
+If KVM_CHECK_EXTENSION on a kvm VM handle indicates that this capability is
+available, it means that the VM is using full hardware assisted virtualization
+capabilities of the hardware. This is useful to check after creating a VM with
+KVM_VM_MIPS_DEFAULT.
+
+The value returned by KVM_CHECK_EXTENSION should be compared against known
+values (see below). All other values are reserved. This is to allow for the
+possibility of other hardware assisted virtualization implementations which
+may be incompatible with the MIPS VZ ASE.
+
+==  ==========================================================================
+ 0  The trap & emulate implementation is in use to run guest code in user
+    mode. Guest virtual memory segments are rearranged to fit the guest in the
+    user mode address space.
+
+ 1  The MIPS VZ ASE is in use, providing full hardware assisted
+    virtualization, including standard guest virtual memory segments.
+==  ==========================================================================
+
+8.6 KVM_CAP_MIPS_TE
+-------------------
+
+:Architectures: mips
+
+This capability, if KVM_CHECK_EXTENSION on the main kvm handle indicates that
+it is available, means that the trap & emulate implementation is available to
+run guest code in user mode, even if KVM_CAP_MIPS_VZ indicates that hardware
+assisted virtualisation is also available. KVM_VM_MIPS_TE (0) must be passed
+to KVM_CREATE_VM to create a VM which utilises it.
+
+If KVM_CHECK_EXTENSION on a kvm VM handle indicates that this capability is
+available, it means that the VM is using trap & emulate.
+
+8.7 KVM_CAP_MIPS_64BIT
+----------------------
+
+:Architectures: mips
+
+This capability indicates the supported architecture type of the guest, i.e. the
+supported register and address width.
+
+The values returned when this capability is checked by KVM_CHECK_EXTENSION on a
+kvm VM handle correspond roughly to the CP0_Config.AT register field, and should
+be checked specifically against known values (see below). All other values are
+reserved.
+
+==  ========================================================================
+ 0  MIPS32 or microMIPS32.
+    Both registers and addresses are 32-bits wide.
+    It will only be possible to run 32-bit guest code.
+
+ 1  MIPS64 or microMIPS64 with access only to 32-bit compatibility segments.
+    Registers are 64-bits wide, but addresses are 32-bits wide.
+    64-bit guest code may run but cannot access MIPS64 memory segments.
+    It will also be possible to run 32-bit guest code.
+
+ 2  MIPS64 or microMIPS64 with access to all address segments.
+    Both registers and addresses are 64-bits wide.
+    It will be possible to run 64-bit or 32-bit guest code.
+==  ========================================================================
+
+8.9 KVM_CAP_ARM_USER_IRQ
+------------------------
+
+:Architectures: arm, arm64
+
+This capability, if KVM_CHECK_EXTENSION indicates that it is available, means
+that if userspace creates a VM without an in-kernel interrupt controller, it
+will be notified of changes to the output level of in-kernel emulated devices,
+which can generate virtual interrupts, presented to the VM.
+For such VMs, on every return to userspace, the kernel
+updates the vcpu's run->s.regs.device_irq_level field to represent the actual
+output level of the device.
+
+Whenever kvm detects a change in the device output level, kvm guarantees at
+least one return to userspace before running the VM.  This exit could either
+be a KVM_EXIT_INTR or any other exit event, like KVM_EXIT_MMIO. This way,
+userspace can always sample the device output level and re-compute the state of
+the userspace interrupt controller.  Userspace should always check the state
+of run->s.regs.device_irq_level on every kvm exit.
+The value in run->s.regs.device_irq_level can represent both level and edge
+triggered interrupt signals, depending on the device.  Edge triggered interrupt
+signals will exit to userspace with the bit in run->s.regs.device_irq_level
+set exactly once per edge signal.
+
+The field run->s.regs.device_irq_level is available independent of
+run->kvm_valid_regs or run->kvm_dirty_regs bits.
+
+If KVM_CAP_ARM_USER_IRQ is supported, the KVM_CHECK_EXTENSION ioctl returns a
+number larger than 0 indicating the version of this capability is implemented
+and thereby which bits in in run->s.regs.device_irq_level can signal values.
+
+Currently the following bits are defined for the device_irq_level bitmap::
+
+  KVM_CAP_ARM_USER_IRQ >= 1:
+
+    KVM_ARM_DEV_EL1_VTIMER -  EL1 virtual timer
+    KVM_ARM_DEV_EL1_PTIMER -  EL1 physical timer
+    KVM_ARM_DEV_PMU        -  ARM PMU overflow interrupt signal
+
+Future versions of kvm may implement additional events. These will get
+indicated by returning a higher number from KVM_CHECK_EXTENSION and will be
+listed above.
+
+8.10 KVM_CAP_PPC_SMT_POSSIBLE
+-----------------------------
+
+:Architectures: ppc
+
+Querying this capability returns a bitmap indicating the possible
+virtual SMT modes that can be set using KVM_CAP_PPC_SMT.  If bit N
+(counting from the right) is set, then a virtual SMT mode of 2^N is
+available.
+
+8.11 KVM_CAP_HYPERV_SYNIC2
+--------------------------
+
+:Architectures: x86
+
+This capability enables a newer version of Hyper-V Synthetic interrupt
+controller (SynIC).  The only difference with KVM_CAP_HYPERV_SYNIC is that KVM
+doesn't clear SynIC message and event flags pages when they are enabled by
+writing to the respective MSRs.
+
+8.12 KVM_CAP_HYPERV_VP_INDEX
+----------------------------
+
+:Architectures: x86
+
+This capability indicates that userspace can load HV_X64_MSR_VP_INDEX msr.  Its
+value is used to denote the target vcpu for a SynIC interrupt.  For
+compatibilty, KVM initializes this msr to KVM's internal vcpu index.  When this
+capability is absent, userspace can still query this msr's value.
+
+8.13 KVM_CAP_S390_AIS_MIGRATION
+-------------------------------
+
+:Architectures: s390
+:Parameters: none
+
+This capability indicates if the flic device will be able to get/set the
+AIS states for migration via the KVM_DEV_FLIC_AISM_ALL attribute and allows
+to discover this without having to create a flic device.
+
+8.14 KVM_CAP_S390_PSW
+---------------------
+
+:Architectures: s390
+
+This capability indicates that the PSW is exposed via the kvm_run structure.
+
+8.15 KVM_CAP_S390_GMAP
+----------------------
+
+:Architectures: s390
+
+This capability indicates that the user space memory used as guest mapping can
+be anywhere in the user memory address space, as long as the memory slots are
+aligned and sized to a segment (1MB) boundary.
+
+8.16 KVM_CAP_S390_COW
+---------------------
+
+:Architectures: s390
+
+This capability indicates that the user space memory used as guest mapping can
+use copy-on-write semantics as well as dirty pages tracking via read-only page
+tables.
+
+8.17 KVM_CAP_S390_BPB
+---------------------
+
+:Architectures: s390
+
+This capability indicates that kvm will implement the interfaces to handle
+reset, migration and nested KVM for branch prediction blocking. The stfle
+facility 82 should not be provided to the guest without this capability.
+
+8.18 KVM_CAP_HYPERV_TLBFLUSH
+----------------------------
+
+:Architectures: x86
+
+This capability indicates that KVM supports paravirtualized Hyper-V TLB Flush
+hypercalls:
+HvFlushVirtualAddressSpace, HvFlushVirtualAddressSpaceEx,
+HvFlushVirtualAddressList, HvFlushVirtualAddressListEx.
+
+8.19 KVM_CAP_ARM_INJECT_SERROR_ESR
+----------------------------------
+
+:Architectures: arm, arm64
+
+This capability indicates that userspace can specify (via the
+KVM_SET_VCPU_EVENTS ioctl) the syndrome value reported to the guest when it
+takes a virtual SError interrupt exception.
+If KVM advertises this capability, userspace can only specify the ISS field for
+the ESR syndrome. Other parts of the ESR, such as the EC are generated by the
+CPU when the exception is taken. If this virtual SError is taken to EL1 using
+AArch64, this value will be reported in the ISS field of ESR_ELx.
+
+See KVM_CAP_VCPU_EVENTS for more details.
+
+8.20 KVM_CAP_HYPERV_SEND_IPI
+----------------------------
+
+:Architectures: x86
+
+This capability indicates that KVM supports paravirtualized Hyper-V IPI send
+hypercalls:
+HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
+
+8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
+-----------------------------------
+
+:Architecture: x86
+
+This capability indicates that KVM running on top of Hyper-V hypervisor
+enables Direct TLB flush for its guests meaning that TLB flush
+hypercalls are handled by Level 0 hypervisor (Hyper-V) bypassing KVM.
+Due to the different ABI for hypercall parameters between Hyper-V and
+KVM, enabling this capability effectively disables all hypercall
+handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
+flush hypercalls by Hyper-V) so userspace should disable KVM identification
+in CPUID and only exposes Hyper-V identification. In this case, guest
+thinks it's running on Hyper-V and only use Hyper-V hypercalls.
+
+8.22 KVM_CAP_S390_VCPU_RESETS
+
+Architectures: s390
+
+This capability indicates that the KVM_S390_NORMAL_RESET and
+KVM_S390_CLEAR_RESET ioctls are available.
diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt

deleted file mode 100644 (file)

index c6e1ce5..0000000
--- a/Documentation/virt/kvm/api.txt
+++ /dev/null
@@ -1,5450 +0,0 @@
-The Definitive KVM (Kernel-based Virtual Machine) API Documentation
-===================================================================
-
-1. General description
-----------------------
-
-The kvm API is a set of ioctls that are issued to control various aspects
-of a virtual machine.  The ioctls belong to the following classes:
-
- - System ioctls: These query and set global attributes which affect the
-   whole kvm subsystem.  In addition a system ioctl is used to create
-   virtual machines.
-
- - VM ioctls: These query and set attributes that affect an entire virtual
-   machine, for example memory layout.  In addition a VM ioctl is used to
-   create virtual cpus (vcpus) and devices.
-
-   VM ioctls must be issued from the same process (address space) that was
-   used to create the VM.
-
- - vcpu ioctls: These query and set attributes that control the operation
-   of a single virtual cpu.
-
-   vcpu ioctls should be issued from the same thread that was used to create
-   the vcpu, except for asynchronous vcpu ioctl that are marked as such in
-   the documentation.  Otherwise, the first ioctl after switching threads
-   could see a performance impact.
-
- - device ioctls: These query and set attributes that control the operation
-   of a single device.
-
-   device ioctls must be issued from the same process (address space) that
-   was used to create the VM.
-
-2. File descriptors
--------------------
-
-The kvm API is centered around file descriptors.  An initial
-open("/dev/kvm") obtains a handle to the kvm subsystem; this handle
-can be used to issue system ioctls.  A KVM_CREATE_VM ioctl on this
-handle will create a VM file descriptor which can be used to issue VM
-ioctls.  A KVM_CREATE_VCPU or KVM_CREATE_DEVICE ioctl on a VM fd will
-create a virtual cpu or device and return a file descriptor pointing to
-the new resource.  Finally, ioctls on a vcpu or device fd can be used
-to control the vcpu or device.  For vcpus, this includes the important
-task of actually running guest code.
-
-In general file descriptors can be migrated among processes by means
-of fork() and the SCM_RIGHTS facility of unix domain socket.  These
-kinds of tricks are explicitly not supported by kvm.  While they will
-not cause harm to the host, their actual behavior is not guaranteed by
-the API.  See "General description" for details on the ioctl usage
-model that is supported by KVM.
-
-It is important to note that althought VM ioctls may only be issued from
-the process that created the VM, a VM's lifecycle is associated with its
-file descriptor, not its creator (process).  In other words, the VM and
-its resources, *including the associated address space*, are not freed
-until the last reference to the VM's file descriptor has been released.
-For example, if fork() is issued after ioctl(KVM_CREATE_VM), the VM will
-not be freed until both the parent (original) process and its child have
-put their references to the VM's file descriptor.
-
-Because a VM's resources are not freed until the last reference to its
-file descriptor is released, creating additional references to a VM via
-via fork(), dup(), etc... without careful consideration is strongly
-discouraged and may have unwanted side effects, e.g. memory allocated
-by and on behalf of the VM's process may not be freed/unaccounted when
-the VM is shut down.
-
-
-3. Extensions
--------------
-
-As of Linux 2.6.22, the KVM ABI has been stabilized: no backward
-incompatible change are allowed.  However, there is an extension
-facility that allows backward-compatible extensions to the API to be
-queried and used.
-
-The extension mechanism is not based on the Linux version number.
-Instead, kvm defines extension identifiers and a facility to query
-whether a particular extension identifier is available.  If it is, a
-set of ioctls is available for application use.
-
-
-4. API description
-------------------
-
-This section describes ioctls that can be used to control kvm guests.
-For each ioctl, the following information is provided along with a
-description:
-
-  Capability: which KVM extension provides this ioctl.  Can be 'basic',
-      which means that is will be provided by any kernel that supports
-      API version 12 (see section 4.1), a KVM_CAP_xyz constant, which
-      means availability needs to be checked with KVM_CHECK_EXTENSION
-      (see section 4.4), or 'none' which means that while not all kernels
-      support this ioctl, there's no capability bit to check its
-      availability: for kernels that don't support the ioctl,
-      the ioctl returns -ENOTTY.
-
-  Architectures: which instruction set architectures provide this ioctl.
-      x86 includes both i386 and x86_64.
-
-  Type: system, vm, or vcpu.
-
-  Parameters: what parameters are accepted by the ioctl.
-
-  Returns: the return value.  General error numbers (EBADF, ENOMEM, EINVAL)
-      are not detailed, but errors with specific meanings are.
-
-
-4.1 KVM_GET_API_VERSION
-
-Capability: basic
-Architectures: all
-Type: system ioctl
-Parameters: none
-Returns: the constant KVM_API_VERSION (=12)
-
-This identifies the API version as the stable kvm API. It is not
-expected that this number will change.  However, Linux 2.6.20 and
-2.6.21 report earlier versions; these are not documented and not
-supported.  Applications should refuse to run if KVM_GET_API_VERSION
-returns a value other than 12.  If this check passes, all ioctls
-described as 'basic' will be available.
-
-
-4.2 KVM_CREATE_VM
-
-Capability: basic
-Architectures: all
-Type: system ioctl
-Parameters: machine type identifier (KVM_VM_*)
-Returns: a VM fd that can be used to control the new virtual machine.
-
-The new VM has no virtual cpus and no memory.
-You probably want to use 0 as machine type.
-
-In order to create user controlled virtual machines on S390, check
-KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as
-privileged user (CAP_SYS_ADMIN).
-
-To use hardware assisted virtualization on MIPS (VZ ASE) rather than
-the default trap & emulate implementation (which changes the virtual
-memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the
-flag KVM_VM_MIPS_VZ.
-
-
-On arm64, the physical address size for a VM (IPA Size limit) is limited
-to 40bits by default. The limit can be configured if the host supports the
-extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use
-KVM_VM_TYPE_ARM_IPA_SIZE(IPA_Bits) to set the size in the machine type
-identifier, where IPA_Bits is the maximum width of any physical
-address used by the VM. The IPA_Bits is encoded in bits[7-0] of the
-machine type identifier.
-
-e.g, to configure a guest to use 48bit physical address size :
-
-    vm_fd = ioctl(dev_fd, KVM_CREATE_VM, KVM_VM_TYPE_ARM_IPA_SIZE(48));
-
-The requested size (IPA_Bits) must be :
-  0 - Implies default size, 40bits (for backward compatibility)
-
-  or
-
-  N - Implies N bits, where N is a positive integer such that,
-      32 <= N <= Host_IPA_Limit
-
-Host_IPA_Limit is the maximum possible value for IPA_Bits on the host and
-is dependent on the CPU capability and the kernel configuration. The limit can
-be retrieved using KVM_CAP_ARM_VM_IPA_SIZE of the KVM_CHECK_EXTENSION
-ioctl() at run-time.
-
-Please note that configuring the IPA size does not affect the capability
-exposed by the guest CPUs in ID_AA64MMFR0_EL1[PARange]. It only affects
-size of the address translated by the stage2 level (guest physical to
-host physical address translations).
-
-
-4.3 KVM_GET_MSR_INDEX_LIST, KVM_GET_MSR_FEATURE_INDEX_LIST
-
-Capability: basic, KVM_CAP_GET_MSR_FEATURES for KVM_GET_MSR_FEATURE_INDEX_LIST
-Architectures: x86
-Type: system ioctl
-Parameters: struct kvm_msr_list (in/out)
-Returns: 0 on success; -1 on error
-Errors:
-  EFAULT:    the msr index list cannot be read from or written to
-  E2BIG:     the msr index list is to be to fit in the array specified by
-             the user.
-
-struct kvm_msr_list {
-       __u32 nmsrs; /* number of msrs in entries */
-       __u32 indices[0];
-};
-
-The user fills in the size of the indices array in nmsrs, and in return
-kvm adjusts nmsrs to reflect the actual number of msrs and fills in the
-indices array with their numbers.
-
-KVM_GET_MSR_INDEX_LIST returns the guest msrs that are supported.  The list
-varies by kvm version and host processor, but does not change otherwise.
-
-Note: if kvm indicates supports MCE (KVM_CAP_MCE), then the MCE bank MSRs are
-not returned in the MSR list, as different vcpus can have a different number
-of banks, as set via the KVM_X86_SETUP_MCE ioctl.
-
-KVM_GET_MSR_FEATURE_INDEX_LIST returns the list of MSRs that can be passed
-to the KVM_GET_MSRS system ioctl.  This lets userspace probe host capabilities
-and processor features that are exposed via MSRs (e.g., VMX capabilities).
-This list also varies by kvm version and host processor, but does not change
-otherwise.
-
-
-4.4 KVM_CHECK_EXTENSION
-
-Capability: basic, KVM_CAP_CHECK_EXTENSION_VM for vm ioctl
-Architectures: all
-Type: system ioctl, vm ioctl
-Parameters: extension identifier (KVM_CAP_*)
-Returns: 0 if unsupported; 1 (or some other positive integer) if supported
-
-The API allows the application to query about extensions to the core
-kvm API.  Userspace passes an extension identifier (an integer) and
-receives an integer that describes the extension availability.
-Generally 0 means no and 1 means yes, but some extensions may report
-additional information in the integer return value.
-
-Based on their initialization different VMs may have different capabilities.
-It is thus encouraged to use the vm ioctl to query for capabilities (available
-with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)
-
-4.5 KVM_GET_VCPU_MMAP_SIZE
-
-Capability: basic
-Architectures: all
-Type: system ioctl
-Parameters: none
-Returns: size of vcpu mmap area, in bytes
-
-The KVM_RUN ioctl (cf.) communicates with userspace via a shared
-memory region.  This ioctl returns the size of that region.  See the
-KVM_RUN documentation for details.
-
-
-4.6 KVM_SET_MEMORY_REGION
-
-Capability: basic
-Architectures: all
-Type: vm ioctl
-Parameters: struct kvm_memory_region (in)
-Returns: 0 on success, -1 on error
-
-This ioctl is obsolete and has been removed.
-
-
-4.7 KVM_CREATE_VCPU
-
-Capability: basic
-Architectures: all
-Type: vm ioctl
-Parameters: vcpu id (apic id on x86)
-Returns: vcpu fd on success, -1 on error
-
-This API adds a vcpu to a virtual machine. No more than max_vcpus may be added.
-The vcpu id is an integer in the range [0, max_vcpu_id).
-
-The recommended max_vcpus value can be retrieved using the KVM_CAP_NR_VCPUS of
-the KVM_CHECK_EXTENSION ioctl() at run-time.
-The maximum possible value for max_vcpus can be retrieved using the
-KVM_CAP_MAX_VCPUS of the KVM_CHECK_EXTENSION ioctl() at run-time.
-
-If the KVM_CAP_NR_VCPUS does not exist, you should assume that max_vcpus is 4
-cpus max.
-If the KVM_CAP_MAX_VCPUS does not exist, you should assume that max_vcpus is
-same as the value returned from KVM_CAP_NR_VCPUS.
-
-The maximum possible value for max_vcpu_id can be retrieved using the
-KVM_CAP_MAX_VCPU_ID of the KVM_CHECK_EXTENSION ioctl() at run-time.
-
-If the KVM_CAP_MAX_VCPU_ID does not exist, you should assume that max_vcpu_id
-is the same as the value returned from KVM_CAP_MAX_VCPUS.
-
-On powerpc using book3s_hv mode, the vcpus are mapped onto virtual
-threads in one or more virtual CPU cores.  (This is because the
-hardware requires all the hardware threads in a CPU core to be in the
-same partition.)  The KVM_CAP_PPC_SMT capability indicates the number
-of vcpus per virtual core (vcore).  The vcore id is obtained by
-dividing the vcpu id by the number of vcpus per vcore.  The vcpus in a
-given vcore will always be in the same physical core as each other
-(though that might be a different physical core from time to time).
-Userspace can control the threading (SMT) mode of the guest by its
-allocation of vcpu ids.  For example, if userspace wants
-single-threaded guest vcpus, it should make all vcpu ids be a multiple
-of the number of vcpus per vcore.
-
-For virtual cpus that have been created with S390 user controlled virtual
-machines, the resulting vcpu fd can be memory mapped at page offset
-KVM_S390_SIE_PAGE_OFFSET in order to obtain a memory map of the virtual
-cpu's hardware control block.
-
-
-4.8 KVM_GET_DIRTY_LOG (vm ioctl)
-
-Capability: basic
-Architectures: all
-Type: vm ioctl
-Parameters: struct kvm_dirty_log (in/out)
-Returns: 0 on success, -1 on error
-
-/* for KVM_GET_DIRTY_LOG */
-struct kvm_dirty_log {
-       __u32 slot;
-       __u32 padding;
-       union {
-               void __user *dirty_bitmap; /* one bit per page */
-               __u64 padding;
-       };
-};
-
-Given a memory slot, return a bitmap containing any pages dirtied
-since the last call to this ioctl.  Bit 0 is the first page in the
-memory slot.  Ensure the entire structure is cleared to avoid padding
-issues.
-
-If KVM_CAP_MULTI_ADDRESS_SPACE is available, bits 16-31 specifies
-the address space for which you want to return the dirty bitmap.
-They must be less than the value that KVM_CHECK_EXTENSION returns for
-the KVM_CAP_MULTI_ADDRESS_SPACE capability.
-
-The bits in the dirty bitmap are cleared before the ioctl returns, unless
-KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 is enabled.  For more information,
-see the description of the capability.
-
-4.9 KVM_SET_MEMORY_ALIAS
-
-Capability: basic
-Architectures: x86
-Type: vm ioctl
-Parameters: struct kvm_memory_alias (in)
-Returns: 0 (success), -1 (error)
-
-This ioctl is obsolete and has been removed.
-
-
-4.10 KVM_RUN
-
-Capability: basic
-Architectures: all
-Type: vcpu ioctl
-Parameters: none
-Returns: 0 on success, -1 on error
-Errors:
-  EINTR:     an unmasked signal is pending
-
-This ioctl is used to run a guest virtual cpu.  While there are no
-explicit parameters, there is an implicit parameter block that can be
-obtained by mmap()ing the vcpu fd at offset 0, with the size given by
-KVM_GET_VCPU_MMAP_SIZE.  The parameter block is formatted as a 'struct
-kvm_run' (see below).
-
-
-4.11 KVM_GET_REGS
-
-Capability: basic
-Architectures: all except ARM, arm64
-Type: vcpu ioctl
-Parameters: struct kvm_regs (out)
-Returns: 0 on success, -1 on error
-
-Reads the general purpose registers from the vcpu.
-
-/* x86 */
-struct kvm_regs {
-       /* out (KVM_GET_REGS) / in (KVM_SET_REGS) */
-       __u64 rax, rbx, rcx, rdx;
-       __u64 rsi, rdi, rsp, rbp;
-       __u64 r8,  r9,  r10, r11;
-       __u64 r12, r13, r14, r15;
-       __u64 rip, rflags;
-};
-
-/* mips */
-struct kvm_regs {
-       /* out (KVM_GET_REGS) / in (KVM_SET_REGS) */
-       __u64 gpr[32];
-       __u64 hi;
-       __u64 lo;
-       __u64 pc;
-};
-
-
-4.12 KVM_SET_REGS
-
-Capability: basic
-Architectures: all except ARM, arm64
-Type: vcpu ioctl
-Parameters: struct kvm_regs (in)
-Returns: 0 on success, -1 on error
-
-Writes the general purpose registers into the vcpu.
-
-See KVM_GET_REGS for the data structure.
-
-
-4.13 KVM_GET_SREGS
-
-Capability: basic
-Architectures: x86, ppc
-Type: vcpu ioctl
-Parameters: struct kvm_sregs (out)
-Returns: 0 on success, -1 on error
-
-Reads special registers from the vcpu.
-
-/* x86 */
-struct kvm_sregs {
-       struct kvm_segment cs, ds, es, fs, gs, ss;
-       struct kvm_segment tr, ldt;
-       struct kvm_dtable gdt, idt;
-       __u64 cr0, cr2, cr3, cr4, cr8;
-       __u64 efer;
-       __u64 apic_base;
-       __u64 interrupt_bitmap[(KVM_NR_INTERRUPTS + 63) / 64];
-};
-
-/* ppc -- see arch/powerpc/include/uapi/asm/kvm.h */
-
-interrupt_bitmap is a bitmap of pending external interrupts.  At most
-one bit may be set.  This interrupt has been acknowledged by the APIC
-but not yet injected into the cpu core.
-
-
-4.14 KVM_SET_SREGS
-
-Capability: basic
-Architectures: x86, ppc
-Type: vcpu ioctl
-Parameters: struct kvm_sregs (in)
-Returns: 0 on success, -1 on error
-
-Writes special registers into the vcpu.  See KVM_GET_SREGS for the
-data structures.
-
-
-4.15 KVM_TRANSLATE
-
-Capability: basic
-Architectures: x86
-Type: vcpu ioctl
-Parameters: struct kvm_translation (in/out)
-Returns: 0 on success, -1 on error
-
-Translates a virtual address according to the vcpu's current address
-translation mode.
-
-struct kvm_translation {
-       /* in */
-       __u64 linear_address;
-
-       /* out */
-       __u64 physical_address;
-       __u8  valid;
-       __u8  writeable;
-       __u8  usermode;
-       __u8  pad[5];
-};
-
-
-4.16 KVM_INTERRUPT
-
-Capability: basic
-Architectures: x86, ppc, mips
-Type: vcpu ioctl
-Parameters: struct kvm_interrupt (in)
-Returns: 0 on success, negative on failure.
-
-Queues a hardware interrupt vector to be injected.
-
-/* for KVM_INTERRUPT */
-struct kvm_interrupt {
-       /* in */
-       __u32 irq;
-};
-
-X86:
-
-Returns: 0 on success,
-        -EEXIST if an interrupt is already enqueued
-        -EINVAL the the irq number is invalid
-        -ENXIO if the PIC is in the kernel
-        -EFAULT if the pointer is invalid
-
-Note 'irq' is an interrupt vector, not an interrupt pin or line. This
-ioctl is useful if the in-kernel PIC is not used.
-
-PPC:
-
-Queues an external interrupt to be injected. This ioctl is overleaded
-with 3 different irq values:
-
-a) KVM_INTERRUPT_SET
-
-  This injects an edge type external interrupt into the guest once it's ready
-  to receive interrupts. When injected, the interrupt is done.
-
-b) KVM_INTERRUPT_UNSET
-
-  This unsets any pending interrupt.
-
-  Only available with KVM_CAP_PPC_UNSET_IRQ.
-
-c) KVM_INTERRUPT_SET_LEVEL
-
-  This injects a level type external interrupt into the guest context. The
-  interrupt stays pending until a specific ioctl with KVM_INTERRUPT_UNSET
-  is triggered.
-
-  Only available with KVM_CAP_PPC_IRQ_LEVEL.
-
-Note that any value for 'irq' other than the ones stated above is invalid
-and incurs unexpected behavior.
-
-This is an asynchronous vcpu ioctl and can be invoked from any thread.
-
-MIPS:
-
-Queues an external interrupt to be injected into the virtual CPU. A negative
-interrupt number dequeues the interrupt.
-
-This is an asynchronous vcpu ioctl and can be invoked from any thread.
-
-
-4.17 KVM_DEBUG_GUEST
-
-Capability: basic
-Architectures: none
-Type: vcpu ioctl
-Parameters: none)
-Returns: -1 on error
-
-Support for this has been removed.  Use KVM_SET_GUEST_DEBUG instead.
-
-
-4.18 KVM_GET_MSRS
-
-Capability: basic (vcpu), KVM_CAP_GET_MSR_FEATURES (system)
-Architectures: x86
-Type: system ioctl, vcpu ioctl
-Parameters: struct kvm_msrs (in/out)
-Returns: number of msrs successfully returned;
-        -1 on error
-
-When used as a system ioctl:
-Reads the values of MSR-based features that are available for the VM.  This
-is similar to KVM_GET_SUPPORTED_CPUID, but it returns MSR indices and values.
-The list of msr-based features can be obtained using KVM_GET_MSR_FEATURE_INDEX_LIST
-in a system ioctl.
-
-When used as a vcpu ioctl:
-Reads model-specific registers from the vcpu.  Supported msr indices can
-be obtained using KVM_GET_MSR_INDEX_LIST in a system ioctl.
-
-struct kvm_msrs {
-       __u32 nmsrs; /* number of msrs in entries */
-       __u32 pad;
-
-       struct kvm_msr_entry entries[0];
-};
-
-struct kvm_msr_entry {
-       __u32 index;
-       __u32 reserved;
-       __u64 data;
-};
-
-Application code should set the 'nmsrs' member (which indicates the
-size of the entries array) and the 'index' member of each array entry.
-kvm will fill in the 'data' member.
-
-
-4.19 KVM_SET_MSRS
-
-Capability: basic
-Architectures: x86
-Type: vcpu ioctl
-Parameters: struct kvm_msrs (in)
-Returns: number of msrs successfully set (see below), -1 on error
-
-Writes model-specific registers to the vcpu.  See KVM_GET_MSRS for the
-data structures.
-
-Application code should set the 'nmsrs' member (which indicates the
-size of the entries array), and the 'index' and 'data' members of each
-array entry.
-
-It tries to set the MSRs in array entries[] one by one. If setting an MSR
-fails, e.g., due to setting reserved bits, the MSR isn't supported/emulated
-by KVM, etc..., it stops processing the MSR list and returns the number of
-MSRs that have been set successfully.
-
-
-4.20 KVM_SET_CPUID
-
-Capability: basic
-Architectures: x86
-Type: vcpu ioctl
-Parameters: struct kvm_cpuid (in)
-Returns: 0 on success, -1 on error
-
-Defines the vcpu responses to the cpuid instruction.  Applications
-should use the KVM_SET_CPUID2 ioctl if available.
-
-
-struct kvm_cpuid_entry {
-       __u32 function;
-       __u32 eax;
-       __u32 ebx;
-       __u32 ecx;
-       __u32 edx;
-       __u32 padding;
-};
-
-/* for KVM_SET_CPUID */
-struct kvm_cpuid {
-       __u32 nent;
-       __u32 padding;
-       struct kvm_cpuid_entry entries[0];
-};
-
-
-4.21 KVM_SET_SIGNAL_MASK
-
-Capability: basic
-Architectures: all
-Type: vcpu ioctl
-Parameters: struct kvm_signal_mask (in)
-Returns: 0 on success, -1 on error
-
-Defines which signals are blocked during execution of KVM_RUN.  This
-signal mask temporarily overrides the threads signal mask.  Any
-unblocked signal received (except SIGKILL and SIGSTOP, which retain
-their traditional behaviour) will cause KVM_RUN to return with -EINTR.
-
-Note the signal will only be delivered if not blocked by the original
-signal mask.
-
-/* for KVM_SET_SIGNAL_MASK */
-struct kvm_signal_mask {
-       __u32 len;
-       __u8  sigset[0];
-};
-
-
-4.22 KVM_GET_FPU
-
-Capability: basic
-Architectures: x86
-Type: vcpu ioctl
-Parameters: struct kvm_fpu (out)
-Returns: 0 on success, -1 on error
-
-Reads the floating point state from the vcpu.
-
-/* for KVM_GET_FPU and KVM_SET_FPU */
-struct kvm_fpu {
-       __u8  fpr[8][16];
-       __u16 fcw;
-       __u16 fsw;
-       __u8  ftwx;  /* in fxsave format */
-       __u8  pad1;
-       __u16 last_opcode;
-       __u64 last_ip;
-       __u64 last_dp;
-       __u8  xmm[16][16];
-       __u32 mxcsr;
-       __u32 pad2;
-};
-
-
-4.23 KVM_SET_FPU
-
-Capability: basic
-Architectures: x86
-Type: vcpu ioctl
-Parameters: struct kvm_fpu (in)
-Returns: 0 on success, -1 on error
-
-Writes the floating point state to the vcpu.
-
-/* for KVM_GET_FPU and KVM_SET_FPU */
-struct kvm_fpu {
-       __u8  fpr[8][16];
-       __u16 fcw;
-       __u16 fsw;
-       __u8  ftwx;  /* in fxsave format */
-       __u8  pad1;
-       __u16 last_opcode;
-       __u64 last_ip;
-       __u64 last_dp;
-       __u8  xmm[16][16];
-       __u32 mxcsr;
-       __u32 pad2;
-};
-
-
-4.24 KVM_CREATE_IRQCHIP
-
-Capability: KVM_CAP_IRQCHIP, KVM_CAP_S390_IRQCHIP (s390)
-Architectures: x86, ARM, arm64, s390
-Type: vm ioctl
-Parameters: none
-Returns: 0 on success, -1 on error
-
-Creates an interrupt controller model in the kernel.
-On x86, creates a virtual ioapic, a virtual PIC (two PICs, nested), and sets up
-future vcpus to have a local APIC.  IRQ routing for GSIs 0-15 is set to both
-PIC and IOAPIC; GSI 16-23 only go to the IOAPIC.
-On ARM/arm64, a GICv2 is created. Any other GIC versions require the usage of
-KVM_CREATE_DEVICE, which also supports creating a GICv2.  Using
-KVM_CREATE_DEVICE is preferred over KVM_CREATE_IRQCHIP for GICv2.
-On s390, a dummy irq routing table is created.
-
-Note that on s390 the KVM_CAP_S390_IRQCHIP vm capability needs to be enabled
-before KVM_CREATE_IRQCHIP can be used.
-
-
-4.25 KVM_IRQ_LINE
-
-Capability: KVM_CAP_IRQCHIP
-Architectures: x86, arm, arm64
-Type: vm ioctl
-Parameters: struct kvm_irq_level
-Returns: 0 on success, -1 on error
-
-Sets the level of a GSI input to the interrupt controller model in the kernel.
-On some architectures it is required that an interrupt controller model has
-been previously created with KVM_CREATE_IRQCHIP.  Note that edge-triggered
-interrupts require the level to be set to 1 and then back to 0.
-
-On real hardware, interrupt pins can be active-low or active-high.  This
-does not matter for the level field of struct kvm_irq_level: 1 always
-means active (asserted), 0 means inactive (deasserted).
-
-x86 allows the operating system to program the interrupt polarity
-(active-low/active-high) for level-triggered interrupts, and KVM used
-to consider the polarity.  However, due to bitrot in the handling of
-active-low interrupts, the above convention is now valid on x86 too.
-This is signaled by KVM_CAP_X86_IOAPIC_POLARITY_IGNORED.  Userspace
-should not present interrupts to the guest as active-low unless this
-capability is present (or unless it is not using the in-kernel irqchip,
-of course).
-
-
-ARM/arm64 can signal an interrupt either at the CPU level, or at the
-in-kernel irqchip (GIC), and for in-kernel irqchip can tell the GIC to
-use PPIs designated for specific cpus.  The irq field is interpreted
-like this:
-
-  bits:  |  31 ... 28  | 27 ... 24 | 23  ... 16 | 15 ... 0 |
-  field: | vcpu2_index | irq_type  | vcpu_index |  irq_id  |
-
-The irq_type field has the following values:
-- irq_type[0]: out-of-kernel GIC: irq_id 0 is IRQ, irq_id 1 is FIQ
-- irq_type[1]: in-kernel GIC: SPI, irq_id between 32 and 1019 (incl.)
-               (the vcpu_index field is ignored)
-- irq_type[2]: in-kernel GIC: PPI, irq_id between 16 and 31 (incl.)
-
-(The irq_id field thus corresponds nicely to the IRQ ID in the ARM GIC specs)
-
-In both cases, level is used to assert/deassert the line.
-
-When KVM_CAP_ARM_IRQ_LINE_LAYOUT_2 is supported, the target vcpu is
-identified as (256 * vcpu2_index + vcpu_index). Otherwise, vcpu2_index
-must be zero.
-
-Note that on arm/arm64, the KVM_CAP_IRQCHIP capability only conditions
-injection of interrupts for the in-kernel irqchip. KVM_IRQ_LINE can always
-be used for a userspace interrupt controller.
-
-struct kvm_irq_level {
-       union {
-               __u32 irq;     /* GSI */
-               __s32 status;  /* not used for KVM_IRQ_LEVEL */
-       };
-       __u32 level;           /* 0 or 1 */
-};
-
-
-4.26 KVM_GET_IRQCHIP
-
-Capability: KVM_CAP_IRQCHIP
-Architectures: x86
-Type: vm ioctl
-Parameters: struct kvm_irqchip (in/out)
-Returns: 0 on success, -1 on error
-
-Reads the state of a kernel interrupt controller created with
-KVM_CREATE_IRQCHIP into a buffer provided by the caller.
-
-struct kvm_irqchip {
-       __u32 chip_id;  /* 0 = PIC1, 1 = PIC2, 2 = IOAPIC */
-       __u32 pad;
-        union {
-               char dummy[512];  /* reserving space */
-               struct kvm_pic_state pic;
-               struct kvm_ioapic_state ioapic;
-       } chip;
-};
-
-
-4.27 KVM_SET_IRQCHIP
-
-Capability: KVM_CAP_IRQCHIP
-Architectures: x86
-Type: vm ioctl
-Parameters: struct kvm_irqchip (in)
-Returns: 0 on success, -1 on error
-
-Sets the state of a kernel interrupt controller created with
-KVM_CREATE_IRQCHIP from a buffer provided by the caller.
-
-struct kvm_irqchip {
-       __u32 chip_id;  /* 0 = PIC1, 1 = PIC2, 2 = IOAPIC */
-       __u32 pad;
-        union {
-               char dummy[512];  /* reserving space */
-               struct kvm_pic_state pic;
-               struct kvm_ioapic_state ioapic;
-       } chip;
-};
-
-
-4.28 KVM_XEN_HVM_CONFIG
-
-Capability: KVM_CAP_XEN_HVM
-Architectures: x86
-Type: vm ioctl
-Parameters: struct kvm_xen_hvm_config (in)
-Returns: 0 on success, -1 on error
-
-Sets the MSR that the Xen HVM guest uses to initialize its hypercall
-page, and provides the starting address and size of the hypercall
-blobs in userspace.  When the guest writes the MSR, kvm copies one
-page of a blob (32- or 64-bit, depending on the vcpu mode) to guest
-memory.
-
-struct kvm_xen_hvm_config {
-       __u32 flags;
-       __u32 msr;
-       __u64 blob_addr_32;
-       __u64 blob_addr_64;
-       __u8 blob_size_32;
-       __u8 blob_size_64;
-       __u8 pad2[30];
-};
-
-
-4.29 KVM_GET_CLOCK
-
-Capability: KVM_CAP_ADJUST_CLOCK
-Architectures: x86
-Type: vm ioctl
-Parameters: struct kvm_clock_data (out)
-Returns: 0 on success, -1 on error
-
-Gets the current timestamp of kvmclock as seen by the current guest. In
-conjunction with KVM_SET_CLOCK, it is used to ensure monotonicity on scenarios
-such as migration.
-
-When KVM_CAP_ADJUST_CLOCK is passed to KVM_CHECK_EXTENSION, it returns the
-set of bits that KVM can return in struct kvm_clock_data's flag member.
-
-The only flag defined now is KVM_CLOCK_TSC_STABLE.  If set, the returned
-value is the exact kvmclock value seen by all VCPUs at the instant
-when KVM_GET_CLOCK was called.  If clear, the returned value is simply
-CLOCK_MONOTONIC plus a constant offset; the offset can be modified
-with KVM_SET_CLOCK.  KVM will try to make all VCPUs follow this clock,
-but the exact value read by each VCPU could differ, because the host
-TSC is not stable.
-
-struct kvm_clock_data {
-       __u64 clock;  /* kvmclock current value */
-       __u32 flags;
-       __u32 pad[9];
-};
-
-
-4.30 KVM_SET_CLOCK
-
-Capability: KVM_CAP_ADJUST_CLOCK
-Architectures: x86
-Type: vm ioctl
-Parameters: struct kvm_clock_data (in)
-Returns: 0 on success, -1 on error
-
-Sets the current timestamp of kvmclock to the value specified in its parameter.
-In conjunction with KVM_GET_CLOCK, it is used to ensure monotonicity on scenarios
-such as migration.
-
-struct kvm_clock_data {
-       __u64 clock;  /* kvmclock current value */
-       __u32 flags;
-       __u32 pad[9];
-};
-
-
-4.31 KVM_GET_VCPU_EVENTS
-
-Capability: KVM_CAP_VCPU_EVENTS
-Extended by: KVM_CAP_INTR_SHADOW
-Architectures: x86, arm, arm64
-Type: vcpu ioctl
-Parameters: struct kvm_vcpu_event (out)
-Returns: 0 on success, -1 on error
-
-X86:
-
-Gets currently pending exceptions, interrupts, and NMIs as well as related
-states of the vcpu.
-
-struct kvm_vcpu_events {
-       struct {
-               __u8 injected;
-               __u8 nr;
-               __u8 has_error_code;
-               __u8 pending;
-               __u32 error_code;
-       } exception;
-       struct {
-               __u8 injected;
-               __u8 nr;
-               __u8 soft;
-               __u8 shadow;
-       } interrupt;
-       struct {
-               __u8 injected;
-               __u8 pending;
-               __u8 masked;
-               __u8 pad;
-       } nmi;
-       __u32 sipi_vector;
-       __u32 flags;
-       struct {
-               __u8 smm;
-               __u8 pending;
-               __u8 smm_inside_nmi;
-               __u8 latched_init;
-       } smi;
-       __u8 reserved[27];
-       __u8 exception_has_payload;
-       __u64 exception_payload;
-};
-
-The following bits are defined in the flags field:
-
-- KVM_VCPUEVENT_VALID_SHADOW may be set to signal that
-  interrupt.shadow contains a valid state.
-
-- KVM_VCPUEVENT_VALID_SMM may be set to signal that smi contains a
-  valid state.
-
-- KVM_VCPUEVENT_VALID_PAYLOAD may be set to signal that the
-  exception_has_payload, exception_payload, and exception.pending
-  fields contain a valid state. This bit will be set whenever
-  KVM_CAP_EXCEPTION_PAYLOAD is enabled.
-
-ARM/ARM64:
-
-If the guest accesses a device that is being emulated by the host kernel in
-such a way that a real device would generate a physical SError, KVM may make
-a virtual SError pending for that VCPU. This system error interrupt remains
-pending until the guest takes the exception by unmasking PSTATE.A.
-
-Running the VCPU may cause it to take a pending SError, or make an access that
-causes an SError to become pending. The event's description is only valid while
-the VPCU is not running.
-
-This API provides a way to read and write the pending 'event' state that is not
-visible to the guest. To save, restore or migrate a VCPU the struct representing
-the state can be read then written using this GET/SET API, along with the other
-guest-visible registers. It is not possible to 'cancel' an SError that has been
-made pending.
-
-A device being emulated in user-space may also wish to generate an SError. To do
-this the events structure can be populated by user-space. The current state
-should be read first, to ensure no existing SError is pending. If an existing
-SError is pending, the architecture's 'Multiple SError interrupts' rules should
-be followed. (2.5.3 of DDI0587.a "ARM Reliability, Availability, and
-Serviceability (RAS) Specification").
-
-SError exceptions always have an ESR value. Some CPUs have the ability to
-specify what the virtual SError's ESR value should be. These systems will
-advertise KVM_CAP_ARM_INJECT_SERROR_ESR. In this case exception.has_esr will
-always have a non-zero value when read, and the agent making an SError pending
-should specify the ISS field in the lower 24 bits of exception.serror_esr. If
-the system supports KVM_CAP_ARM_INJECT_SERROR_ESR, but user-space sets the events
-with exception.has_esr as zero, KVM will choose an ESR.
-
-Specifying exception.has_esr on a system that does not support it will return
--EINVAL. Setting anything other than the lower 24bits of exception.serror_esr
-will return -EINVAL.
-
-It is not possible to read back a pending external abort (injected via
-KVM_SET_VCPU_EVENTS or otherwise) because such an exception is always delivered
-directly to the virtual CPU).
-
-
-struct kvm_vcpu_events {
-       struct {
-               __u8 serror_pending;
-               __u8 serror_has_esr;
-               __u8 ext_dabt_pending;
-               /* Align it to 8 bytes */
-               __u8 pad[5];
-               __u64 serror_esr;
-       } exception;
-       __u32 reserved[12];
-};
-
-4.32 KVM_SET_VCPU_EVENTS
-
-Capability: KVM_CAP_VCPU_EVENTS
-Extended by: KVM_CAP_INTR_SHADOW
-Architectures: x86, arm, arm64
-Type: vcpu ioctl
-Parameters: struct kvm_vcpu_event (in)
-Returns: 0 on success, -1 on error
-
-X86:
-
-Set pending exceptions, interrupts, and NMIs as well as related states of the
-vcpu.
-
-See KVM_GET_VCPU_EVENTS for the data structure.
-
-Fields that may be modified asynchronously by running VCPUs can be excluded
-from the update. These fields are nmi.pending, sipi_vector, smi.smm,
-smi.pending. Keep the corresponding bits in the flags field cleared to
-suppress overwriting the current in-kernel state. The bits are:
-
-KVM_VCPUEVENT_VALID_NMI_PENDING - transfer nmi.pending to the kernel
-KVM_VCPUEVENT_VALID_SIPI_VECTOR - transfer sipi_vector
-KVM_VCPUEVENT_VALID_SMM         - transfer the smi sub-struct.
-
-If KVM_CAP_INTR_SHADOW is available, KVM_VCPUEVENT_VALID_SHADOW can be set in
-the flags field to signal that interrupt.shadow contains a valid state and
-shall be written into the VCPU.
-
-KVM_VCPUEVENT_VALID_SMM can only be set if KVM_CAP_X86_SMM is available.
-
-If KVM_CAP_EXCEPTION_PAYLOAD is enabled, KVM_VCPUEVENT_VALID_PAYLOAD
-can be set in the flags field to signal that the
-exception_has_payload, exception_payload, and exception.pending fields
-contain a valid state and shall be written into the VCPU.
-
-ARM/ARM64:
-
-User space may need to inject several types of events to the guest.
-
-Set the pending SError exception state for this VCPU. It is not possible to
-'cancel' an Serror that has been made pending.
-
-If the guest performed an access to I/O memory which could not be handled by
-userspace, for example because of missing instruction syndrome decode
-information or because there is no device mapped at the accessed IPA, then
-userspace can ask the kernel to inject an external abort using the address
-from the exiting fault on the VCPU. It is a programming error to set
-ext_dabt_pending after an exit which was not either KVM_EXIT_MMIO or
-KVM_EXIT_ARM_NISV. This feature is only available if the system supports
-KVM_CAP_ARM_INJECT_EXT_DABT. This is a helper which provides commonality in
-how userspace reports accesses for the above cases to guests, across different
-userspace implementations. Nevertheless, userspace can still emulate all Arm
-exceptions by manipulating individual registers using the KVM_SET_ONE_REG API.
-
-See KVM_GET_VCPU_EVENTS for the data structure.
-
-
-4.33 KVM_GET_DEBUGREGS
-
-Capability: KVM_CAP_DEBUGREGS
-Architectures: x86
-Type: vm ioctl
-Parameters: struct kvm_debugregs (out)
-Returns: 0 on success, -1 on error
-
-Reads debug registers from the vcpu.
-
-struct kvm_debugregs {
-       __u64 db[4];
-       __u64 dr6;
-       __u64 dr7;
-       __u64 flags;
-       __u64 reserved[9];
-};
-
-
-4.34 KVM_SET_DEBUGREGS
-
-Capability: KVM_CAP_DEBUGREGS
-Architectures: x86
-Type: vm ioctl
-Parameters: struct kvm_debugregs (in)
-Returns: 0 on success, -1 on error
-
-Writes debug registers into the vcpu.
-
-See KVM_GET_DEBUGREGS for the data structure. The flags field is unused
-yet and must be cleared on entry.
-
-
-4.35 KVM_SET_USER_MEMORY_REGION
-
-Capability: KVM_CAP_USER_MEMORY
-Architectures: all
-Type: vm ioctl
-Parameters: struct kvm_userspace_memory_region (in)
-Returns: 0 on success, -1 on error
-
-struct kvm_userspace_memory_region {
-       __u32 slot;
-       __u32 flags;
-       __u64 guest_phys_addr;
-       __u64 memory_size; /* bytes */
-       __u64 userspace_addr; /* start of the userspace allocated memory */
-};
-
-/* for kvm_memory_region::flags */
-#define KVM_MEM_LOG_DIRTY_PAGES        (1UL << 0)
-#define KVM_MEM_READONLY       (1UL << 1)
-
-This ioctl allows the user to create, modify or delete a guest physical
-memory slot.  Bits 0-15 of "slot" specify the slot id and this value
-should be less than the maximum number of user memory slots supported per
-VM.  The maximum allowed slots can be queried using KVM_CAP_NR_MEMSLOTS.
-Slots may not overlap in guest physical address space.
-
-If KVM_CAP_MULTI_ADDRESS_SPACE is available, bits 16-31 of "slot"
-specifies the address space which is being modified.  They must be
-less than the value that KVM_CHECK_EXTENSION returns for the
-KVM_CAP_MULTI_ADDRESS_SPACE capability.  Slots in separate address spaces
-are unrelated; the restriction on overlapping slots only applies within
-each address space.
-
-Deleting a slot is done by passing zero for memory_size.  When changing
-an existing slot, it may be moved in the guest physical memory space,
-or its flags may be modified, but it may not be resized.
-
-Memory for the region is taken starting at the address denoted by the
-field userspace_addr, which must point at user addressable memory for
-the entire memory slot size.  Any object may back this memory, including
-anonymous memory, ordinary files, and hugetlbfs.
-
-It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
-be identical.  This allows large pages in the guest to be backed by large
-pages in the host.
-
-The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
-KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
-writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
-use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
-to make a new slot read-only.  In this case, writes to this memory will be
-posted to userspace as KVM_EXIT_MMIO exits.
-
-When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
-the memory region are automatically reflected into the guest.  For example, an
-mmap() that affects the region will be made visible immediately.  Another
-example is madvise(MADV_DROP).
-
-It is recommended to use this API instead of the KVM_SET_MEMORY_REGION ioctl.
-The KVM_SET_MEMORY_REGION does not allow fine grained control over memory
-allocation and is deprecated.
-
-
-4.36 KVM_SET_TSS_ADDR
-
-Capability: KVM_CAP_SET_TSS_ADDR
-Architectures: x86
-Type: vm ioctl
-Parameters: unsigned long tss_address (in)
-Returns: 0 on success, -1 on error
-
-This ioctl defines the physical address of a three-page region in the guest
-physical address space.  The region must be within the first 4GB of the
-guest physical address space and must not conflict with any memory slot
-or any mmio address.  The guest may malfunction if it accesses this memory
-region.
-
-This ioctl is required on Intel-based hosts.  This is needed on Intel hardware
-because of a quirk in the virtualization implementation (see the internals
-documentation when it pops into existence).
-
-
-4.37 KVM_ENABLE_CAP
-
-Capability: KVM_CAP_ENABLE_CAP
-Architectures: mips, ppc, s390
-Type: vcpu ioctl
-Parameters: struct kvm_enable_cap (in)
-Returns: 0 on success; -1 on error
-
-Capability: KVM_CAP_ENABLE_CAP_VM
-Architectures: all
-Type: vcpu ioctl
-Parameters: struct kvm_enable_cap (in)
-Returns: 0 on success; -1 on error
-
-+Not all extensions are enabled by default. Using this ioctl the application
-can enable an extension, making it available to the guest.
-
-On systems that do not support this ioctl, it always fails. On systems that
-do support it, it only works for extensions that are supported for enablement.
-
-To check if a capability can be enabled, the KVM_CHECK_EXTENSION ioctl should
-be used.
-
-struct kvm_enable_cap {
-       /* in */
-       __u32 cap;
-
-The capability that is supposed to get enabled.
-
-       __u32 flags;
-
-A bitfield indicating future enhancements. Has to be 0 for now.
-
-       __u64 args[4];
-
-Arguments for enabling a feature. If a feature needs initial values to
-function properly, this is the place to put them.
-
-       __u8  pad[64];
-};
-
-The vcpu ioctl should be used for vcpu-specific capabilities, the vm ioctl
-for vm-wide capabilities.
-
-4.38 KVM_GET_MP_STATE
-
-Capability: KVM_CAP_MP_STATE
-Architectures: x86, s390, arm, arm64
-Type: vcpu ioctl
-Parameters: struct kvm_mp_state (out)
-Returns: 0 on success; -1 on error
-
-struct kvm_mp_state {
-       __u32 mp_state;
-};
-
-Returns the vcpu's current "multiprocessing state" (though also valid on
-uniprocessor guests).
-
-Possible values are:
-
- - KVM_MP_STATE_RUNNABLE:        the vcpu is currently running [x86,arm/arm64]
- - KVM_MP_STATE_UNINITIALIZED:   the vcpu is an application processor (AP)
-                                 which has not yet received an INIT signal [x86]
- - KVM_MP_STATE_INIT_RECEIVED:   the vcpu has received an INIT signal, and is
-                                 now ready for a SIPI [x86]
- - KVM_MP_STATE_HALTED:          the vcpu has executed a HLT instruction and
-                                 is waiting for an interrupt [x86]
- - KVM_MP_STATE_SIPI_RECEIVED:   the vcpu has just received a SIPI (vector
-                                 accessible via KVM_GET_VCPU_EVENTS) [x86]
- - KVM_MP_STATE_STOPPED:         the vcpu is stopped [s390,arm/arm64]
- - KVM_MP_STATE_CHECK_STOP:      the vcpu is in a special error state [s390]
- - KVM_MP_STATE_OPERATING:       the vcpu is operating (running or halted)
-                                 [s390]
- - KVM_MP_STATE_LOAD:            the vcpu is in a special load/startup state
-                                 [s390]
-
-On x86, this ioctl is only useful after KVM_CREATE_IRQCHIP. Without an
-in-kernel irqchip, the multiprocessing state must be maintained by userspace on
-these architectures.
-
-For arm/arm64:
-
-The only states that are valid are KVM_MP_STATE_STOPPED and
-KVM_MP_STATE_RUNNABLE which reflect if the vcpu is paused or not.
-
-4.39 KVM_SET_MP_STATE
-
-Capability: KVM_CAP_MP_STATE
-Architectures: x86, s390, arm, arm64
-Type: vcpu ioctl
-Parameters: struct kvm_mp_state (in)
-Returns: 0 on success; -1 on error
-
-Sets the vcpu's current "multiprocessing state"; see KVM_GET_MP_STATE for
-arguments.
-
-On x86, this ioctl is only useful after KVM_CREATE_IRQCHIP. Without an
-in-kernel irqchip, the multiprocessing state must be maintained by userspace on
-these architectures.
-
-For arm/arm64:
-
-The only states that are valid are KVM_MP_STATE_STOPPED and
-KVM_MP_STATE_RUNNABLE which reflect if the vcpu should be paused or not.
-
-4.40 KVM_SET_IDENTITY_MAP_ADDR
-
-Capability: KVM_CAP_SET_IDENTITY_MAP_ADDR
-Architectures: x86
-Type: vm ioctl
-Parameters: unsigned long identity (in)
-Returns: 0 on success, -1 on error
-
-This ioctl defines the physical address of a one-page region in the guest
-physical address space.  The region must be within the first 4GB of the
-guest physical address space and must not conflict with any memory slot
-or any mmio address.  The guest may malfunction if it accesses this memory
-region.
-
-Setting the address to 0 will result in resetting the address to its default
-(0xfffbc000).
-
-This ioctl is required on Intel-based hosts.  This is needed on Intel hardware
-because of a quirk in the virtualization implementation (see the internals
-documentation when it pops into existence).
-
-Fails if any VCPU has already been created.
-
-4.41 KVM_SET_BOOT_CPU_ID
-
-Capability: KVM_CAP_SET_BOOT_CPU_ID
-Architectures: x86
-Type: vm ioctl
-Parameters: unsigned long vcpu_id
-Returns: 0 on success, -1 on error
-
-Define which vcpu is the Bootstrap Processor (BSP).  Values are the same
-as the vcpu id in KVM_CREATE_VCPU.  If this ioctl is not called, the default
-is vcpu 0.
-
-
-4.42 KVM_GET_XSAVE
-
-Capability: KVM_CAP_XSAVE
-Architectures: x86
-Type: vcpu ioctl
-Parameters: struct kvm_xsave (out)
-Returns: 0 on success, -1 on error
-
-struct kvm_xsave {
-       __u32 region[1024];
-};
-
-This ioctl would copy current vcpu's xsave struct to the userspace.
-
-
-4.43 KVM_SET_XSAVE
-
-Capability: KVM_CAP_XSAVE
-Architectures: x86
-Type: vcpu ioctl
-Parameters: struct kvm_xsave (in)
-Returns: 0 on success, -1 on error
-
-struct kvm_xsave {
-       __u32 region[1024];
-};
-
-This ioctl would copy userspace's xsave struct to the kernel.
-
-
-4.44 KVM_GET_XCRS
-
-Capability: KVM_CAP_XCRS
-Architectures: x86
-Type: vcpu ioctl
-Parameters: struct kvm_xcrs (out)
-Returns: 0 on success, -1 on error
-
-struct kvm_xcr {
-       __u32 xcr;
-       __u32 reserved;
-       __u64 value;
-};
-
-struct kvm_xcrs {
-       __u32 nr_xcrs;
-       __u32 flags;
-       struct kvm_xcr xcrs[KVM_MAX_XCRS];
-       __u64 padding[16];
-};
-
-This ioctl would copy current vcpu's xcrs to the userspace.
-
-
-4.45 KVM_SET_XCRS
-
-Capability: KVM_CAP_XCRS
-Architectures: x86
-Type: vcpu ioctl
-Parameters: struct kvm_xcrs (in)
-Returns: 0 on success, -1 on error
-
-struct kvm_xcr {
-       __u32 xcr;
-       __u32 reserved;
-       __u64 value;
-};
-
-struct kvm_xcrs {
-       __u32 nr_xcrs;
-       __u32 flags;
-       struct kvm_xcr xcrs[KVM_MAX_XCRS];
-       __u64 padding[16];
-};
-
-This ioctl would set vcpu's xcr to the value userspace specified.
-
-
-4.46 KVM_GET_SUPPORTED_CPUID
-
-Capability: KVM_CAP_EXT_CPUID
-Architectures: x86
-Type: system ioctl
-Parameters: struct kvm_cpuid2 (in/out)
-Returns: 0 on success, -1 on error
-
-struct kvm_cpuid2 {
-       __u32 nent;
-       __u32 padding;
-       struct kvm_cpuid_entry2 entries[0];
-};
-
-#define KVM_CPUID_FLAG_SIGNIFCANT_INDEX                BIT(0)
-#define KVM_CPUID_FLAG_STATEFUL_FUNC           BIT(1)
-#define KVM_CPUID_FLAG_STATE_READ_NEXT         BIT(2)
-
-struct kvm_cpuid_entry2 {
-       __u32 function;
-       __u32 index;
-       __u32 flags;
-       __u32 eax;
-       __u32 ebx;
-       __u32 ecx;
-       __u32 edx;
-       __u32 padding[3];
-};
-
-This ioctl returns x86 cpuid features which are supported by both the
-hardware and kvm in its default configuration.  Userspace can use the
-information returned by this ioctl to construct cpuid information (for
-KVM_SET_CPUID2) that is consistent with hardware, kernel, and
-userspace capabilities, and with user requirements (for example, the
-user may wish to constrain cpuid to emulate older hardware, or for
-feature consistency across a cluster).
-
-Note that certain capabilities, such as KVM_CAP_X86_DISABLE_EXITS, may
-expose cpuid features (e.g. MONITOR) which are not supported by kvm in
-its default configuration. If userspace enables such capabilities, it
-is responsible for modifying the results of this ioctl appropriately.
-
-Userspace invokes KVM_GET_SUPPORTED_CPUID by passing a kvm_cpuid2 structure
-with the 'nent' field indicating the number of entries in the variable-size
-array 'entries'.  If the number of entries is too low to describe the cpu
-capabilities, an error (E2BIG) is returned.  If the number is too high,
-the 'nent' field is adjusted and an error (ENOMEM) is returned.  If the
-number is just right, the 'nent' field is adjusted to the number of valid
-entries in the 'entries' array, which is then filled.
-
-The entries returned are the host cpuid as returned by the cpuid instruction,
-with unknown or unsupported features masked out.  Some features (for example,
-x2apic), may not be present in the host cpu, but are exposed by kvm if it can
-emulate them efficiently. The fields in each entry are defined as follows:
-
-  function: the eax value used to obtain the entry
-  index: the ecx value used to obtain the entry (for entries that are
-         affected by ecx)
-  flags: an OR of zero or more of the following:
-        KVM_CPUID_FLAG_SIGNIFCANT_INDEX:
-           if the index field is valid
-        KVM_CPUID_FLAG_STATEFUL_FUNC:
-           if cpuid for this function returns different values for successive
-           invocations; there will be several entries with the same function,
-           all with this flag set
-        KVM_CPUID_FLAG_STATE_READ_NEXT:
-           for KVM_CPUID_FLAG_STATEFUL_FUNC entries, set if this entry is
-           the first entry to be read by a cpu
-   eax, ebx, ecx, edx: the values returned by the cpuid instruction for
-         this function/index combination
-
-The TSC deadline timer feature (CPUID leaf 1, ecx[24]) is always returned
-as false, since the feature depends on KVM_CREATE_IRQCHIP for local APIC
-support.  Instead it is reported via
-
-  ioctl(KVM_CHECK_EXTENSION, KVM_CAP_TSC_DEADLINE_TIMER)
-
-if that returns true and you use KVM_CREATE_IRQCHIP, or if you emulate the
-feature in userspace, then you can enable the feature for KVM_SET_CPUID2.
-
-
-4.47 KVM_PPC_GET_PVINFO
-
-Capability: KVM_CAP_PPC_GET_PVINFO
-Architectures: ppc
-Type: vm ioctl
-Parameters: struct kvm_ppc_pvinfo (out)
-Returns: 0 on success, !0 on error
-
-struct kvm_ppc_pvinfo {
-       __u32 flags;
-       __u32 hcall[4];
-       __u8  pad[108];
-};
-
-This ioctl fetches PV specific information that need to be passed to the guest
-using the device tree or other means from vm context.
-
-The hcall array defines 4 instructions that make up a hypercall.
-
-If any additional field gets added to this structure later on, a bit for that
-additional piece of information will be set in the flags bitmap.
-
-The flags bitmap is defined as:
-
-   /* the host supports the ePAPR idle hcall
-   #define KVM_PPC_PVINFO_FLAGS_EV_IDLE   (1<<0)
-
-4.52 KVM_SET_GSI_ROUTING
-
-Capability: KVM_CAP_IRQ_ROUTING
-Architectures: x86 s390 arm arm64
-Type: vm ioctl
-Parameters: struct kvm_irq_routing (in)
-Returns: 0 on success, -1 on error
-
-Sets the GSI routing table entries, overwriting any previously set entries.
-
-On arm/arm64, GSI routing has the following limitation:
-- GSI routing does not apply to KVM_IRQ_LINE but only to KVM_IRQFD.
-
-struct kvm_irq_routing {
-       __u32 nr;
-       __u32 flags;
-       struct kvm_irq_routing_entry entries[0];
-};
-
-No flags are specified so far, the corresponding field must be set to zero.
-
-struct kvm_irq_routing_entry {
-       __u32 gsi;
-       __u32 type;
-       __u32 flags;
-       __u32 pad;
-       union {
-               struct kvm_irq_routing_irqchip irqchip;
-               struct kvm_irq_routing_msi msi;
-               struct kvm_irq_routing_s390_adapter adapter;
-               struct kvm_irq_routing_hv_sint hv_sint;
-               __u32 pad[8];
-       } u;
-};
-
-/* gsi routing entry types */
-#define KVM_IRQ_ROUTING_IRQCHIP 1
-#define KVM_IRQ_ROUTING_MSI 2
-#define KVM_IRQ_ROUTING_S390_ADAPTER 3
-#define KVM_IRQ_ROUTING_HV_SINT 4
-
-flags:
-- KVM_MSI_VALID_DEVID: used along with KVM_IRQ_ROUTING_MSI routing entry
-  type, specifies that the devid field contains a valid value.  The per-VM
-  KVM_CAP_MSI_DEVID capability advertises the requirement to provide
-  the device ID.  If this capability is not available, userspace should
-  never set the KVM_MSI_VALID_DEVID flag as the ioctl might fail.
-- zero otherwise
-
-struct kvm_irq_routing_irqchip {
-       __u32 irqchip;
-       __u32 pin;
-};
-
-struct kvm_irq_routing_msi {
-       __u32 address_lo;
-       __u32 address_hi;
-       __u32 data;
-       union {
-               __u32 pad;
-               __u32 devid;
-       };
-};
-
-If KVM_MSI_VALID_DEVID is set, devid contains a unique device identifier
-for the device that wrote the MSI message.  For PCI, this is usually a
-BFD identifier in the lower 16 bits.
-
-On x86, address_hi is ignored unless the KVM_X2APIC_API_USE_32BIT_IDS
-feature of KVM_CAP_X2APIC_API capability is enabled.  If it is enabled,
-address_hi bits 31-8 provide bits 31-8 of the destination id.  Bits 7-0 of
-address_hi must be zero.
-
-struct kvm_irq_routing_s390_adapter {
-       __u64 ind_addr;
-       __u64 summary_addr;
-       __u64 ind_offset;
-       __u32 summary_offset;
-       __u32 adapter_id;
-};
-
-struct kvm_irq_routing_hv_sint {
-       __u32 vcpu;
-       __u32 sint;
-};
-
-
-4.55 KVM_SET_TSC_KHZ
-
-Capability: KVM_CAP_TSC_CONTROL
-Architectures: x86
-Type: vcpu ioctl
-Parameters: virtual tsc_khz
-Returns: 0 on success, -1 on error
-
-Specifies the tsc frequency for the virtual machine. The unit of the
-frequency is KHz.
-
-
-4.56 KVM_GET_TSC_KHZ
-
-Capability: KVM_CAP_GET_TSC_KHZ
-Architectures: x86
-Type: vcpu ioctl
-Parameters: none
-Returns: virtual tsc-khz on success, negative value on error
-
-Returns the tsc frequency of the guest. The unit of the return value is
-KHz. If the host has unstable tsc this ioctl returns -EIO instead as an
-error.
-
-
-4.57 KVM_GET_LAPIC
-
-Capability: KVM_CAP_IRQCHIP
-Architectures: x86
-Type: vcpu ioctl
-Parameters: struct kvm_lapic_state (out)
-Returns: 0 on success, -1 on error
-
-#define KVM_APIC_REG_SIZE 0x400
-struct kvm_lapic_state {
-       char regs[KVM_APIC_REG_SIZE];
-};
-
-Reads the Local APIC registers and copies them into the input argument.  The
-data format and layout are the same as documented in the architecture manual.
-
-If KVM_X2APIC_API_USE_32BIT_IDS feature of KVM_CAP_X2APIC_API is
-enabled, then the format of APIC_ID register depends on the APIC mode
-(reported by MSR_IA32_APICBASE) of its VCPU.  x2APIC stores APIC ID in
-the APIC_ID register (bytes 32-35).  xAPIC only allows an 8-bit APIC ID
-which is stored in bits 31-24 of the APIC register, or equivalently in
-byte 35 of struct kvm_lapic_state's regs field.  KVM_GET_LAPIC must then
-be called after MSR_IA32_APICBASE has been set with KVM_SET_MSR.
-
-If KVM_X2APIC_API_USE_32BIT_IDS feature is disabled, struct kvm_lapic_state
-always uses xAPIC format.
-
-
-4.58 KVM_SET_LAPIC
-
-Capability: KVM_CAP_IRQCHIP
-Architectures: x86
-Type: vcpu ioctl
-Parameters: struct kvm_lapic_state (in)
-Returns: 0 on success, -1 on error
-
-#define KVM_APIC_REG_SIZE 0x400
-struct kvm_lapic_state {
-       char regs[KVM_APIC_REG_SIZE];
-};
-
-Copies the input argument into the Local APIC registers.  The data format
-and layout are the same as documented in the architecture manual.
-
-The format of the APIC ID register (bytes 32-35 of struct kvm_lapic_state's
-regs field) depends on the state of the KVM_CAP_X2APIC_API capability.
-See the note in KVM_GET_LAPIC.
-
-
-4.59 KVM_IOEVENTFD
-
-Capability: KVM_CAP_IOEVENTFD
-Architectures: all
-Type: vm ioctl
-Parameters: struct kvm_ioeventfd (in)
-Returns: 0 on success, !0 on error
-
-This ioctl attaches or detaches an ioeventfd to a legal pio/mmio address
-within the guest.  A guest write in the registered address will signal the
-provided event instead of triggering an exit.
-
-struct kvm_ioeventfd {
-       __u64 datamatch;
-       __u64 addr;        /* legal pio/mmio address */
-       __u32 len;         /* 0, 1, 2, 4, or 8 bytes    */
-       __s32 fd;
-       __u32 flags;
-       __u8  pad[36];
-};
-
-For the special case of virtio-ccw devices on s390, the ioevent is matched
-to a subchannel/virtqueue tuple instead.
-
-The following flags are defined:
-
-#define KVM_IOEVENTFD_FLAG_DATAMATCH (1 << kvm_ioeventfd_flag_nr_datamatch)
-#define KVM_IOEVENTFD_FLAG_PIO       (1 << kvm_ioeventfd_flag_nr_pio)
-#define KVM_IOEVENTFD_FLAG_DEASSIGN  (1 << kvm_ioeventfd_flag_nr_deassign)
-#define KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY \
-       (1 << kvm_ioeventfd_flag_nr_virtio_ccw_notify)
-
-If datamatch flag is set, the event will be signaled only if the written value
-to the registered address is equal to datamatch in struct kvm_ioeventfd.
-
-For virtio-ccw devices, addr contains the subchannel id and datamatch the
-virtqueue index.
-
-With KVM_CAP_IOEVENTFD_ANY_LENGTH, a zero length ioeventfd is allowed, and
-the kernel will ignore the length of guest write and may get a faster vmexit.
-The speedup may only apply to specific architectures, but the ioeventfd will
-work anyway.
-
-4.60 KVM_DIRTY_TLB
-
-Capability: KVM_CAP_SW_TLB
-Architectures: ppc
-Type: vcpu ioctl
-Parameters: struct kvm_dirty_tlb (in)
-Returns: 0 on success, -1 on error
-
-struct kvm_dirty_tlb {
-       __u64 bitmap;
-       __u32 num_dirty;
-};
-
-This must be called whenever userspace has changed an entry in the shared
-TLB, prior to calling KVM_RUN on the associated vcpu.
-
-The "bitmap" field is the userspace address of an array.  This array
-consists of a number of bits, equal to the total number of TLB entries as
-determined by the last successful call to KVM_CONFIG_TLB, rounded up to the
-nearest multiple of 64.
-
-Each bit corresponds to one TLB entry, ordered the same as in the shared TLB
-array.
-
-The array is little-endian: the bit 0 is the least significant bit of the
-first byte, bit 8 is the least significant bit of the second byte, etc.
-This avoids any complications with differing word sizes.
-
-The "num_dirty" field is a performance hint for KVM to determine whether it
-should skip processing the bitmap and just invalidate everything.  It must
-be set to the number of set bits in the bitmap.
-
-
-4.62 KVM_CREATE_SPAPR_TCE
-
-Capability: KVM_CAP_SPAPR_TCE
-Architectures: powerpc
-Type: vm ioctl
-Parameters: struct kvm_create_spapr_tce (in)
-Returns: file descriptor for manipulating the created TCE table
-
-This creates a virtual TCE (translation control entry) table, which
-is an IOMMU for PAPR-style virtual I/O.  It is used to translate
-logical addresses used in virtual I/O into guest physical addresses,
-and provides a scatter/gather capability for PAPR virtual I/O.
-
-/* for KVM_CAP_SPAPR_TCE */
-struct kvm_create_spapr_tce {
-       __u64 liobn;
-       __u32 window_size;
-};
-
-The liobn field gives the logical IO bus number for which to create a
-TCE table.  The window_size field specifies the size of the DMA window
-which this TCE table will translate - the table will contain one 64
-bit TCE entry for every 4kiB of the DMA window.
-
-When the guest issues an H_PUT_TCE hcall on a liobn for which a TCE
-table has been created using this ioctl(), the kernel will handle it
-in real mode, updating the TCE table.  H_PUT_TCE calls for other
-liobns will cause a vm exit and must be handled by userspace.
-
-The return value is a file descriptor which can be passed to mmap(2)
-to map the created TCE table into userspace.  This lets userspace read
-the entries written by kernel-handled H_PUT_TCE calls, and also lets
-userspace update the TCE table directly which is useful in some
-circumstances.
-
-
-4.63 KVM_ALLOCATE_RMA
-
-Capability: KVM_CAP_PPC_RMA
-Architectures: powerpc
-Type: vm ioctl
-Parameters: struct kvm_allocate_rma (out)
-Returns: file descriptor for mapping the allocated RMA
-
-This allocates a Real Mode Area (RMA) from the pool allocated at boot
-time by the kernel.  An RMA is a physically-contiguous, aligned region
-of memory used on older POWER processors to provide the memory which
-will be accessed by real-mode (MMU off) accesses in a KVM guest.
-POWER processors support a set of sizes for the RMA that usually
-includes 64MB, 128MB, 256MB and some larger powers of two.
-
-/* for KVM_ALLOCATE_RMA */
-struct kvm_allocate_rma {
-       __u64 rma_size;
-};
-
-The return value is a file descriptor which can be passed to mmap(2)
-to map the allocated RMA into userspace.  The mapped area can then be
-passed to the KVM_SET_USER_MEMORY_REGION ioctl to establish it as the
-RMA for a virtual machine.  The size of the RMA in bytes (which is
-fixed at host kernel boot time) is returned in the rma_size field of
-the argument structure.
-
-The KVM_CAP_PPC_RMA capability is 1 or 2 if the KVM_ALLOCATE_RMA ioctl
-is supported; 2 if the processor requires all virtual machines to have
-an RMA, or 1 if the processor can use an RMA but doesn't require it,
-because it supports the Virtual RMA (VRMA) facility.
-
-
-4.64 KVM_NMI
-
-Capability: KVM_CAP_USER_NMI
-Architectures: x86
-Type: vcpu ioctl
-Parameters: none
-Returns: 0 on success, -1 on error
-
-Queues an NMI on the thread's vcpu.  Note this is well defined only
-when KVM_CREATE_IRQCHIP has not been called, since this is an interface
-between the virtual cpu core and virtual local APIC.  After KVM_CREATE_IRQCHIP
-has been called, this interface is completely emulated within the kernel.
-
-To use this to emulate the LINT1 input with KVM_CREATE_IRQCHIP, use the
-following algorithm:
-
-  - pause the vcpu
-  - read the local APIC's state (KVM_GET_LAPIC)
-  - check whether changing LINT1 will queue an NMI (see the LVT entry for LINT1)
-  - if so, issue KVM_NMI
-  - resume the vcpu
-
-Some guests configure the LINT1 NMI input to cause a panic, aiding in
-debugging.
-
-
-4.65 KVM_S390_UCAS_MAP
-
-Capability: KVM_CAP_S390_UCONTROL
-Architectures: s390
-Type: vcpu ioctl
-Parameters: struct kvm_s390_ucas_mapping (in)
-Returns: 0 in case of success
-
-The parameter is defined like this:
-       struct kvm_s390_ucas_mapping {
-               __u64 user_addr;
-               __u64 vcpu_addr;
-               __u64 length;
-       };
-
-This ioctl maps the memory at "user_addr" with the length "length" to
-the vcpu's address space starting at "vcpu_addr". All parameters need to
-be aligned by 1 megabyte.
-
-
-4.66 KVM_S390_UCAS_UNMAP
-
-Capability: KVM_CAP_S390_UCONTROL
-Architectures: s390
-Type: vcpu ioctl
-Parameters: struct kvm_s390_ucas_mapping (in)
-Returns: 0 in case of success
-
-The parameter is defined like this:
-       struct kvm_s390_ucas_mapping {
-               __u64 user_addr;
-               __u64 vcpu_addr;
-               __u64 length;
-       };
-
-This ioctl unmaps the memory in the vcpu's address space starting at
-"vcpu_addr" with the length "length". The field "user_addr" is ignored.
-All parameters need to be aligned by 1 megabyte.
-
-
-4.67 KVM_S390_VCPU_FAULT
-
-Capability: KVM_CAP_S390_UCONTROL
-Architectures: s390
-Type: vcpu ioctl
-Parameters: vcpu absolute address (in)
-Returns: 0 in case of success
-
-This call creates a page table entry on the virtual cpu's address space
-(for user controlled virtual machines) or the virtual machine's address
-space (for regular virtual machines). This only works for minor faults,
-thus it's recommended to access subject memory page via the user page
-table upfront. This is useful to handle validity intercepts for user
-controlled virtual machines to fault in the virtual cpu's lowcore pages
-prior to calling the KVM_RUN ioctl.
-
-
-4.68 KVM_SET_ONE_REG
-
-Capability: KVM_CAP_ONE_REG
-Architectures: all
-Type: vcpu ioctl
-Parameters: struct kvm_one_reg (in)
-Returns: 0 on success, negative value on failure
-Errors:
-  ENOENT:   no such register
-  EINVAL:   invalid register ID, or no such register
-  EPERM:    (arm64) register access not allowed before vcpu finalization
-(These error codes are indicative only: do not rely on a specific error
-code being returned in a specific situation.)
-
-struct kvm_one_reg {
-       __u64 id;
-       __u64 addr;
-};
-
-Using this ioctl, a single vcpu register can be set to a specific value
-defined by user space with the passed in struct kvm_one_reg, where id
-refers to the register identifier as described below and addr is a pointer
-to a variable with the respective size. There can be architecture agnostic
-and architecture specific registers. Each have their own range of operation
-and their own constants and width. To keep track of the implemented
-registers, find a list below:
-
-  Arch  |           Register            | Width (bits)
-        |                               |
-  PPC   | KVM_REG_PPC_HIOR              | 64
-  PPC   | KVM_REG_PPC_IAC1              | 64
-  PPC   | KVM_REG_PPC_IAC2              | 64
-  PPC   | KVM_REG_PPC_IAC3              | 64
-  PPC   | KVM_REG_PPC_IAC4              | 64
-  PPC   | KVM_REG_PPC_DAC1              | 64
-  PPC   | KVM_REG_PPC_DAC2              | 64
-  PPC   | KVM_REG_PPC_DABR              | 64
-  PPC   | KVM_REG_PPC_DSCR              | 64
-  PPC   | KVM_REG_PPC_PURR              | 64
-  PPC   | KVM_REG_PPC_SPURR             | 64
-  PPC   | KVM_REG_PPC_DAR               | 64
-  PPC   | KVM_REG_PPC_DSISR             | 32
-  PPC   | KVM_REG_PPC_AMR               | 64
-  PPC   | KVM_REG_PPC_UAMOR             | 64
-  PPC   | KVM_REG_PPC_MMCR0             | 64
-  PPC   | KVM_REG_PPC_MMCR1             | 64
-  PPC   | KVM_REG_PPC_MMCRA             | 64
-  PPC   | KVM_REG_PPC_MMCR2             | 64
-  PPC   | KVM_REG_PPC_MMCRS             | 64
-  PPC   | KVM_REG_PPC_SIAR              | 64
-  PPC   | KVM_REG_PPC_SDAR              | 64
-  PPC   | KVM_REG_PPC_SIER              | 64
-  PPC   | KVM_REG_PPC_PMC1              | 32
-  PPC   | KVM_REG_PPC_PMC2              | 32
-  PPC   | KVM_REG_PPC_PMC3              | 32
-  PPC   | KVM_REG_PPC_PMC4              | 32
-  PPC   | KVM_REG_PPC_PMC5              | 32
-  PPC   | KVM_REG_PPC_PMC6              | 32
-  PPC   | KVM_REG_PPC_PMC7              | 32
-  PPC   | KVM_REG_PPC_PMC8              | 32
-  PPC   | KVM_REG_PPC_FPR0              | 64
-          ...
-  PPC   | KVM_REG_PPC_FPR31             | 64
-  PPC   | KVM_REG_PPC_VR0               | 128
-          ...
-  PPC   | KVM_REG_PPC_VR31              | 128
-  PPC   | KVM_REG_PPC_VSR0              | 128
-          ...
-  PPC   | KVM_REG_PPC_VSR31             | 128
-  PPC   | KVM_REG_PPC_FPSCR             | 64
-  PPC   | KVM_REG_PPC_VSCR              | 32
-  PPC   | KVM_REG_PPC_VPA_ADDR          | 64
-  PPC   | KVM_REG_PPC_VPA_SLB           | 128
-  PPC   | KVM_REG_PPC_VPA_DTL           | 128
-  PPC   | KVM_REG_PPC_EPCR              | 32
-  PPC   | KVM_REG_PPC_EPR               | 32
-  PPC   | KVM_REG_PPC_TCR               | 32
-  PPC   | KVM_REG_PPC_TSR               | 32
-  PPC   | KVM_REG_PPC_OR_TSR            | 32
-  PPC   | KVM_REG_PPC_CLEAR_TSR         | 32
-  PPC   | KVM_REG_PPC_MAS0              | 32
-  PPC   | KVM_REG_PPC_MAS1              | 32
-  PPC   | KVM_REG_PPC_MAS2              | 64
-  PPC   | KVM_REG_PPC_MAS7_3            | 64
-  PPC   | KVM_REG_PPC_MAS4              | 32
-  PPC   | KVM_REG_PPC_MAS6              | 32
-  PPC   | KVM_REG_PPC_MMUCFG            | 32
-  PPC   | KVM_REG_PPC_TLB0CFG           | 32
-  PPC   | KVM_REG_PPC_TLB1CFG           | 32
-  PPC   | KVM_REG_PPC_TLB2CFG           | 32
-  PPC   | KVM_REG_PPC_TLB3CFG           | 32
-  PPC   | KVM_REG_PPC_TLB0PS            | 32
-  PPC   | KVM_REG_PPC_TLB1PS            | 32
-  PPC   | KVM_REG_PPC_TLB2PS            | 32
-  PPC   | KVM_REG_PPC_TLB3PS            | 32
-  PPC   | KVM_REG_PPC_EPTCFG            | 32
-  PPC   | KVM_REG_PPC_ICP_STATE         | 64
-  PPC   | KVM_REG_PPC_VP_STATE          | 128
-  PPC   | KVM_REG_PPC_TB_OFFSET         | 64
-  PPC   | KVM_REG_PPC_SPMC1             | 32
-  PPC   | KVM_REG_PPC_SPMC2             | 32
-  PPC   | KVM_REG_PPC_IAMR              | 64
-  PPC   | KVM_REG_PPC_TFHAR             | 64
-  PPC   | KVM_REG_PPC_TFIAR             | 64
-  PPC   | KVM_REG_PPC_TEXASR            | 64
-  PPC   | KVM_REG_PPC_FSCR              | 64
-  PPC   | KVM_REG_PPC_PSPB              | 32
-  PPC   | KVM_REG_PPC_EBBHR             | 64
-  PPC   | KVM_REG_PPC_EBBRR             | 64
-  PPC   | KVM_REG_PPC_BESCR             | 64
-  PPC   | KVM_REG_PPC_TAR               | 64
-  PPC   | KVM_REG_PPC_DPDES             | 64
-  PPC   | KVM_REG_PPC_DAWR              | 64
-  PPC   | KVM_REG_PPC_DAWRX             | 64
-  PPC   | KVM_REG_PPC_CIABR             | 64
-  PPC   | KVM_REG_PPC_IC                | 64
-  PPC   | KVM_REG_PPC_VTB               | 64
-  PPC   | KVM_REG_PPC_CSIGR             | 64
-  PPC   | KVM_REG_PPC_TACR              | 64
-  PPC   | KVM_REG_PPC_TCSCR             | 64
-  PPC   | KVM_REG_PPC_PID               | 64
-  PPC   | KVM_REG_PPC_ACOP              | 64
-  PPC   | KVM_REG_PPC_VRSAVE            | 32
-  PPC   | KVM_REG_PPC_LPCR              | 32
-  PPC   | KVM_REG_PPC_LPCR_64           | 64
-  PPC   | KVM_REG_PPC_PPR               | 64
-  PPC   | KVM_REG_PPC_ARCH_COMPAT       | 32
-  PPC   | KVM_REG_PPC_DABRX             | 32
-  PPC   | KVM_REG_PPC_WORT              | 64
-  PPC  | KVM_REG_PPC_SPRG9             | 64
-  PPC  | KVM_REG_PPC_DBSR              | 32
-  PPC   | KVM_REG_PPC_TIDR              | 64
-  PPC   | KVM_REG_PPC_PSSCR             | 64
-  PPC   | KVM_REG_PPC_DEC_EXPIRY        | 64
-  PPC   | KVM_REG_PPC_PTCR              | 64
-  PPC   | KVM_REG_PPC_TM_GPR0           | 64
-          ...
-  PPC   | KVM_REG_PPC_TM_GPR31          | 64
-  PPC   | KVM_REG_PPC_TM_VSR0           | 128
-          ...
-  PPC   | KVM_REG_PPC_TM_VSR63          | 128
-  PPC   | KVM_REG_PPC_TM_CR             | 64
-  PPC   | KVM_REG_PPC_TM_LR             | 64
-  PPC   | KVM_REG_PPC_TM_CTR            | 64
-  PPC   | KVM_REG_PPC_TM_FPSCR          | 64
-  PPC   | KVM_REG_PPC_TM_AMR            | 64
-  PPC   | KVM_REG_PPC_TM_PPR            | 64
-  PPC   | KVM_REG_PPC_TM_VRSAVE         | 64
-  PPC   | KVM_REG_PPC_TM_VSCR           | 32
-  PPC   | KVM_REG_PPC_TM_DSCR           | 64
-  PPC   | KVM_REG_PPC_TM_TAR            | 64
-  PPC   | KVM_REG_PPC_TM_XER            | 64
-        |                               |
-  MIPS  | KVM_REG_MIPS_R0               | 64
-          ...
-  MIPS  | KVM_REG_MIPS_R31              | 64
-  MIPS  | KVM_REG_MIPS_HI               | 64
-  MIPS  | KVM_REG_MIPS_LO               | 64
-  MIPS  | KVM_REG_MIPS_PC               | 64
-  MIPS  | KVM_REG_MIPS_CP0_INDEX        | 32
-  MIPS  | KVM_REG_MIPS_CP0_ENTRYLO0     | 64
-  MIPS  | KVM_REG_MIPS_CP0_ENTRYLO1     | 64
-  MIPS  | KVM_REG_MIPS_CP0_CONTEXT      | 64
-  MIPS  | KVM_REG_MIPS_CP0_CONTEXTCONFIG| 32
-  MIPS  | KVM_REG_MIPS_CP0_USERLOCAL    | 64
-  MIPS  | KVM_REG_MIPS_CP0_XCONTEXTCONFIG| 64
-  MIPS  | KVM_REG_MIPS_CP0_PAGEMASK     | 32
-  MIPS  | KVM_REG_MIPS_CP0_PAGEGRAIN    | 32
-  MIPS  | KVM_REG_MIPS_CP0_SEGCTL0      | 64
-  MIPS  | KVM_REG_MIPS_CP0_SEGCTL1      | 64
-  MIPS  | KVM_REG_MIPS_CP0_SEGCTL2      | 64
-  MIPS  | KVM_REG_MIPS_CP0_PWBASE       | 64
-  MIPS  | KVM_REG_MIPS_CP0_PWFIELD      | 64
-  MIPS  | KVM_REG_MIPS_CP0_PWSIZE       | 64
-  MIPS  | KVM_REG_MIPS_CP0_WIRED        | 32
-  MIPS  | KVM_REG_MIPS_CP0_PWCTL        | 32
-  MIPS  | KVM_REG_MIPS_CP0_HWRENA       | 32
-  MIPS  | KVM_REG_MIPS_CP0_BADVADDR     | 64
-  MIPS  | KVM_REG_MIPS_CP0_BADINSTR     | 32
-  MIPS  | KVM_REG_MIPS_CP0_BADINSTRP    | 32
-  MIPS  | KVM_REG_MIPS_CP0_COUNT        | 32
-  MIPS  | KVM_REG_MIPS_CP0_ENTRYHI      | 64
-  MIPS  | KVM_REG_MIPS_CP0_COMPARE      | 32
-  MIPS  | KVM_REG_MIPS_CP0_STATUS       | 32
-  MIPS  | KVM_REG_MIPS_CP0_INTCTL       | 32
-  MIPS  | KVM_REG_MIPS_CP0_CAUSE        | 32
-  MIPS  | KVM_REG_MIPS_CP0_EPC          | 64
-  MIPS  | KVM_REG_MIPS_CP0_PRID         | 32
-  MIPS  | KVM_REG_MIPS_CP0_EBASE        | 64
-  MIPS  | KVM_REG_MIPS_CP0_CONFIG       | 32
-  MIPS  | KVM_REG_MIPS_CP0_CONFIG1      | 32
-  MIPS  | KVM_REG_MIPS_CP0_CONFIG2      | 32
-  MIPS  | KVM_REG_MIPS_CP0_CONFIG3      | 32
-  MIPS  | KVM_REG_MIPS_CP0_CONFIG4      | 32
-  MIPS  | KVM_REG_MIPS_CP0_CONFIG5      | 32
-  MIPS  | KVM_REG_MIPS_CP0_CONFIG7      | 32
-  MIPS  | KVM_REG_MIPS_CP0_XCONTEXT     | 64
-  MIPS  | KVM_REG_MIPS_CP0_ERROREPC     | 64
-  MIPS  | KVM_REG_MIPS_CP0_KSCRATCH1    | 64
-  MIPS  | KVM_REG_MIPS_CP0_KSCRATCH2    | 64
-  MIPS  | KVM_REG_MIPS_CP0_KSCRATCH3    | 64
-  MIPS  | KVM_REG_MIPS_CP0_KSCRATCH4    | 64
-  MIPS  | KVM_REG_MIPS_CP0_KSCRATCH5    | 64
-  MIPS  | KVM_REG_MIPS_CP0_KSCRATCH6    | 64
-  MIPS  | KVM_REG_MIPS_CP0_MAAR(0..63)  | 64
-  MIPS  | KVM_REG_MIPS_COUNT_CTL        | 64
-  MIPS  | KVM_REG_MIPS_COUNT_RESUME     | 64
-  MIPS  | KVM_REG_MIPS_COUNT_HZ         | 64
-  MIPS  | KVM_REG_MIPS_FPR_32(0..31)    | 32
-  MIPS  | KVM_REG_MIPS_FPR_64(0..31)    | 64
-  MIPS  | KVM_REG_MIPS_VEC_128(0..31)   | 128
-  MIPS  | KVM_REG_MIPS_FCR_IR           | 32
-  MIPS  | KVM_REG_MIPS_FCR_CSR          | 32
-  MIPS  | KVM_REG_MIPS_MSA_IR           | 32
-  MIPS  | KVM_REG_MIPS_MSA_CSR          | 32
-
-ARM registers are mapped using the lower 32 bits.  The upper 16 of that
-is the register group type, or coprocessor number:
-
-ARM core registers have the following id bit patterns:
-  0x4020 0000 0010 <index into the kvm_regs struct:16>
-
-ARM 32-bit CP15 registers have the following id bit patterns:
-  0x4020 0000 000F <zero:1> <crn:4> <crm:4> <opc1:4> <opc2:3>
-
-ARM 64-bit CP15 registers have the following id bit patterns:
-  0x4030 0000 000F <zero:1> <zero:4> <crm:4> <opc1:4> <zero:3>
-
-ARM CCSIDR registers are demultiplexed by CSSELR value:
-  0x4020 0000 0011 00 <csselr:8>
-
-ARM 32-bit VFP control registers have the following id bit patterns:
-  0x4020 0000 0012 1 <regno:12>
-
-ARM 64-bit FP registers have the following id bit patterns:
-  0x4030 0000 0012 0 <regno:12>
-
-ARM firmware pseudo-registers have the following bit pattern:
-  0x4030 0000 0014 <regno:16>
-
-
-arm64 registers are mapped using the lower 32 bits. The upper 16 of
-that is the register group type, or coprocessor number:
-
-arm64 core/FP-SIMD registers have the following id bit patterns. Note
-that the size of the access is variable, as the kvm_regs structure
-contains elements ranging from 32 to 128 bits. The index is a 32bit
-value in the kvm_regs structure seen as a 32bit array.
-  0x60x0 0000 0010 <index into the kvm_regs struct:16>
-
-Specifically:
-    Encoding            Register  Bits  kvm_regs member
-----------------------------------------------------------------
-  0x6030 0000 0010 0000 X0          64  regs.regs[0]
-  0x6030 0000 0010 0002 X1          64  regs.regs[1]
-    ...
-  0x6030 0000 0010 003c X30         64  regs.regs[30]
-  0x6030 0000 0010 003e SP          64  regs.sp
-  0x6030 0000 0010 0040 PC          64  regs.pc
-  0x6030 0000 0010 0042 PSTATE      64  regs.pstate
-  0x6030 0000 0010 0044 SP_EL1      64  sp_el1
-  0x6030 0000 0010 0046 ELR_EL1     64  elr_el1
-  0x6030 0000 0010 0048 SPSR_EL1    64  spsr[KVM_SPSR_EL1] (alias SPSR_SVC)
-  0x6030 0000 0010 004a SPSR_ABT    64  spsr[KVM_SPSR_ABT]
-  0x6030 0000 0010 004c SPSR_UND    64  spsr[KVM_SPSR_UND]
-  0x6030 0000 0010 004e SPSR_IRQ    64  spsr[KVM_SPSR_IRQ]
-  0x6060 0000 0010 0050 SPSR_FIQ    64  spsr[KVM_SPSR_FIQ]
-  0x6040 0000 0010 0054 V0         128  fp_regs.vregs[0]    (*)
-  0x6040 0000 0010 0058 V1         128  fp_regs.vregs[1]    (*)
-    ...
-  0x6040 0000 0010 00d0 V31        128  fp_regs.vregs[31]   (*)
-  0x6020 0000 0010 00d4 FPSR        32  fp_regs.fpsr
-  0x6020 0000 0010 00d5 FPCR        32  fp_regs.fpcr
-
-(*) These encodings are not accepted for SVE-enabled vcpus.  See
-    KVM_ARM_VCPU_INIT.
-
-    The equivalent register content can be accessed via bits [127:0] of
-    the corresponding SVE Zn registers instead for vcpus that have SVE
-    enabled (see below).
-
-arm64 CCSIDR registers are demultiplexed by CSSELR value:
-  0x6020 0000 0011 00 <csselr:8>
-
-arm64 system registers have the following id bit patterns:
-  0x6030 0000 0013 <op0:2> <op1:3> <crn:4> <crm:4> <op2:3>
-
-WARNING:
-     Two system register IDs do not follow the specified pattern.  These
-     are KVM_REG_ARM_TIMER_CVAL and KVM_REG_ARM_TIMER_CNT, which map to
-     system registers CNTV_CVAL_EL0 and CNTVCT_EL0 respectively.  These
-     two had their values accidentally swapped, which means TIMER_CVAL is
-     derived from the register encoding for CNTVCT_EL0 and TIMER_CNT is
-     derived from the register encoding for CNTV_CVAL_EL0.  As this is
-     API, it must remain this way.
-
-arm64 firmware pseudo-registers have the following bit pattern:
-  0x6030 0000 0014 <regno:16>
-
-arm64 SVE registers have the following bit patterns:
-  0x6080 0000 0015 00 <n:5> <slice:5>   Zn bits[2048*slice + 2047 : 2048*slice]
-  0x6050 0000 0015 04 <n:4> <slice:5>   Pn bits[256*slice + 255 : 256*slice]
-  0x6050 0000 0015 060 <slice:5>        FFR bits[256*slice + 255 : 256*slice]
-  0x6060 0000 0015 ffff                 KVM_REG_ARM64_SVE_VLS pseudo-register
-
-Access to register IDs where 2048 * slice >= 128 * max_vq will fail with
-ENOENT.  max_vq is the vcpu's maximum supported vector length in 128-bit
-quadwords: see (**) below.
-
-These registers are only accessible on vcpus for which SVE is enabled.
-See KVM_ARM_VCPU_INIT for details.
-
-In addition, except for KVM_REG_ARM64_SVE_VLS, these registers are not
-accessible until the vcpu's SVE configuration has been finalized
-using KVM_ARM_VCPU_FINALIZE(KVM_ARM_VCPU_SVE).  See KVM_ARM_VCPU_INIT
-and KVM_ARM_VCPU_FINALIZE for more information about this procedure.
-
-KVM_REG_ARM64_SVE_VLS is a pseudo-register that allows the set of vector
-lengths supported by the vcpu to be discovered and configured by
-userspace.  When transferred to or from user memory via KVM_GET_ONE_REG
-or KVM_SET_ONE_REG, the value of this register is of type
-__u64[KVM_ARM64_SVE_VLS_WORDS], and encodes the set of vector lengths as
-follows:
-
-__u64 vector_lengths[KVM_ARM64_SVE_VLS_WORDS];
-
-if (vq >= SVE_VQ_MIN && vq <= SVE_VQ_MAX &&
-    ((vector_lengths[(vq - KVM_ARM64_SVE_VQ_MIN) / 64] >>
-               ((vq - KVM_ARM64_SVE_VQ_MIN) % 64)) & 1))
-       /* Vector length vq * 16 bytes supported */
-else
-       /* Vector length vq * 16 bytes not supported */
-
-(**) The maximum value vq for which the above condition is true is
-max_vq.  This is the maximum vector length available to the guest on
-this vcpu, and determines which register slices are visible through
-this ioctl interface.
-
-(See Documentation/arm64/sve.rst for an explanation of the "vq"
-nomenclature.)
-
-KVM_REG_ARM64_SVE_VLS is only accessible after KVM_ARM_VCPU_INIT.
-KVM_ARM_VCPU_INIT initialises it to the best set of vector lengths that
-the host supports.
-
-Userspace may subsequently modify it if desired until the vcpu's SVE
-configuration is finalized using KVM_ARM_VCPU_FINALIZE(KVM_ARM_VCPU_SVE).
-
-Apart from simply removing all vector lengths from the host set that
-exceed some value, support for arbitrarily chosen sets of vector lengths
-is hardware-dependent and may not be available.  Attempting to configure
-an invalid set of vector lengths via KVM_SET_ONE_REG will fail with
-EINVAL.
-
-After the vcpu's SVE configuration is finalized, further attempts to
-write this register will fail with EPERM.
-
-
-MIPS registers are mapped using the lower 32 bits.  The upper 16 of that is
-the register group type:
-
-MIPS core registers (see above) have the following id bit patterns:
-  0x7030 0000 0000 <reg:16>
-
-MIPS CP0 registers (see KVM_REG_MIPS_CP0_* above) have the following id bit
-patterns depending on whether they're 32-bit or 64-bit registers:
-  0x7020 0000 0001 00 <reg:5> <sel:3>   (32-bit)
-  0x7030 0000 0001 00 <reg:5> <sel:3>   (64-bit)
-
-Note: KVM_REG_MIPS_CP0_ENTRYLO0 and KVM_REG_MIPS_CP0_ENTRYLO1 are the MIPS64
-versions of the EntryLo registers regardless of the word size of the host
-hardware, host kernel, guest, and whether XPA is present in the guest, i.e.
-with the RI and XI bits (if they exist) in bits 63 and 62 respectively, and
-the PFNX field starting at bit 30.
-
-MIPS MAARs (see KVM_REG_MIPS_CP0_MAAR(*) above) have the following id bit
-patterns:
-  0x7030 0000 0001 01 <reg:8>
-
-MIPS KVM control registers (see above) have the following id bit patterns:
-  0x7030 0000 0002 <reg:16>
-
-MIPS FPU registers (see KVM_REG_MIPS_FPR_{32,64}() above) have the following
-id bit patterns depending on the size of the register being accessed. They are
-always accessed according to the current guest FPU mode (Status.FR and
-Config5.FRE), i.e. as the guest would see them, and they become unpredictable
-if the guest FPU mode is changed. MIPS SIMD Architecture (MSA) vector
-registers (see KVM_REG_MIPS_VEC_128() above) have similar patterns as they
-overlap the FPU registers:
-  0x7020 0000 0003 00 <0:3> <reg:5> (32-bit FPU registers)
-  0x7030 0000 0003 00 <0:3> <reg:5> (64-bit FPU registers)
-  0x7040 0000 0003 00 <0:3> <reg:5> (128-bit MSA vector registers)
-
-MIPS FPU control registers (see KVM_REG_MIPS_FCR_{IR,CSR} above) have the
-following id bit patterns:
-  0x7020 0000 0003 01 <0:3> <reg:5>
-
-MIPS MSA control registers (see KVM_REG_MIPS_MSA_{IR,CSR} above) have the
-following id bit patterns:
-  0x7020 0000 0003 02 <0:3> <reg:5>
-
-
-4.69 KVM_GET_ONE_REG
-
-Capability: KVM_CAP_ONE_REG
-Architectures: all
-Type: vcpu ioctl
-Parameters: struct kvm_one_reg (in and out)
-Returns: 0 on success, negative value on failure
-Errors include:
-  ENOENT:   no such register
-  EINVAL:   invalid register ID, or no such register
-  EPERM:    (arm64) register access not allowed before vcpu finalization
-(These error codes are indicative only: do not rely on a specific error
-code being returned in a specific situation.)
-
-This ioctl allows to receive the value of a single register implemented
-in a vcpu. The register to read is indicated by the "id" field of the
-kvm_one_reg struct passed in. On success, the register value can be found
-at the memory location pointed to by "addr".
-
-The list of registers accessible using this interface is identical to the
-list in 4.68.
-
-
-4.70 KVM_KVMCLOCK_CTRL
-
-Capability: KVM_CAP_KVMCLOCK_CTRL
-Architectures: Any that implement pvclocks (currently x86 only)
-Type: vcpu ioctl
-Parameters: None
-Returns: 0 on success, -1 on error
-
-This signals to the host kernel that the specified guest is being paused by
-userspace.  The host will set a flag in the pvclock structure that is checked
-from the soft lockup watchdog.  The flag is part of the pvclock structure that
-is shared between guest and host, specifically the second bit of the flags
-field of the pvclock_vcpu_time_info structure.  It will be set exclusively by
-the host and read/cleared exclusively by the guest.  The guest operation of
-checking and clearing the flag must an atomic operation so
-load-link/store-conditional, or equivalent must be used.  There are two cases
-where the guest will clear the flag: when the soft lockup watchdog timer resets
-itself or when a soft lockup is detected.  This ioctl can be called any time
-after pausing the vcpu, but before it is resumed.
-
-
-4.71 KVM_SIGNAL_MSI
-
-Capability: KVM_CAP_SIGNAL_MSI
-Architectures: x86 arm arm64
-Type: vm ioctl
-Parameters: struct kvm_msi (in)
-Returns: >0 on delivery, 0 if guest blocked the MSI, and -1 on error
-
-Directly inject a MSI message. Only valid with in-kernel irqchip that handles
-MSI messages.
-
-struct kvm_msi {
-       __u32 address_lo;
-       __u32 address_hi;
-       __u32 data;
-       __u32 flags;
-       __u32 devid;
-       __u8  pad[12];
-};
-
-flags: KVM_MSI_VALID_DEVID: devid contains a valid value.  The per-VM
-  KVM_CAP_MSI_DEVID capability advertises the requirement to provide
-  the device ID.  If this capability is not available, userspace
-  should never set the KVM_MSI_VALID_DEVID flag as the ioctl might fail.
-
-If KVM_MSI_VALID_DEVID is set, devid contains a unique device identifier
-for the device that wrote the MSI message.  For PCI, this is usually a
-BFD identifier in the lower 16 bits.
-
-On x86, address_hi is ignored unless the KVM_X2APIC_API_USE_32BIT_IDS
-feature of KVM_CAP_X2APIC_API capability is enabled.  If it is enabled,
-address_hi bits 31-8 provide bits 31-8 of the destination id.  Bits 7-0 of
-address_hi must be zero.
-
-
-4.71 KVM_CREATE_PIT2
-
-Capability: KVM_CAP_PIT2
-Architectures: x86
-Type: vm ioctl
-Parameters: struct kvm_pit_config (in)
-Returns: 0 on success, -1 on error
-
-Creates an in-kernel device model for the i8254 PIT. This call is only valid
-after enabling in-kernel irqchip support via KVM_CREATE_IRQCHIP. The following
-parameters have to be passed:
-
-struct kvm_pit_config {
-       __u32 flags;
-       __u32 pad[15];
-};
-
-Valid flags are:
-
-#define KVM_PIT_SPEAKER_DUMMY     1 /* emulate speaker port stub */
-
-PIT timer interrupts may use a per-VM kernel thread for injection. If it
-exists, this thread will have a name of the following pattern:
-
-kvm-pit/<owner-process-pid>
-
-When running a guest with elevated priorities, the scheduling parameters of
-this thread may have to be adjusted accordingly.
-
-This IOCTL replaces the obsolete KVM_CREATE_PIT.
-
-
-4.72 KVM_GET_PIT2
-
-Capability: KVM_CAP_PIT_STATE2
-Architectures: x86
-Type: vm ioctl
-Parameters: struct kvm_pit_state2 (out)
-Returns: 0 on success, -1 on error
-
-Retrieves the state of the in-kernel PIT model. Only valid after
-KVM_CREATE_PIT2. The state is returned in the following structure:
-
-struct kvm_pit_state2 {
-       struct kvm_pit_channel_state channels[3];
-       __u32 flags;
-       __u32 reserved[9];
-};
-
-Valid flags are:
-
-/* disable PIT in HPET legacy mode */
-#define KVM_PIT_FLAGS_HPET_LEGACY  0x00000001
-
-This IOCTL replaces the obsolete KVM_GET_PIT.
-
-
-4.73 KVM_SET_PIT2
-
-Capability: KVM_CAP_PIT_STATE2
-Architectures: x86
-Type: vm ioctl
-Parameters: struct kvm_pit_state2 (in)
-Returns: 0 on success, -1 on error
-
-Sets the state of the in-kernel PIT model. Only valid after KVM_CREATE_PIT2.
-See KVM_GET_PIT2 for details on struct kvm_pit_state2.
-
-This IOCTL replaces the obsolete KVM_SET_PIT.
-
-
-4.74 KVM_PPC_GET_SMMU_INFO
-
-Capability: KVM_CAP_PPC_GET_SMMU_INFO
-Architectures: powerpc
-Type: vm ioctl
-Parameters: None
-Returns: 0 on success, -1 on error
-
-This populates and returns a structure describing the features of
-the "Server" class MMU emulation supported by KVM.
-This can in turn be used by userspace to generate the appropriate
-device-tree properties for the guest operating system.
-
-The structure contains some global information, followed by an
-array of supported segment page sizes:
-
-      struct kvm_ppc_smmu_info {
-            __u64 flags;
-            __u32 slb_size;
-            __u32 pad;
-            struct kvm_ppc_one_seg_page_size sps[KVM_PPC_PAGE_SIZES_MAX_SZ];
-      };
-
-The supported flags are:
-
-    - KVM_PPC_PAGE_SIZES_REAL:
-        When that flag is set, guest page sizes must "fit" the backing
-        store page sizes. When not set, any page size in the list can
-        be used regardless of how they are backed by userspace.
-
-    - KVM_PPC_1T_SEGMENTS
-        The emulated MMU supports 1T segments in addition to the
-        standard 256M ones.
-
-    - KVM_PPC_NO_HASH
-       This flag indicates that HPT guests are not supported by KVM,
-       thus all guests must use radix MMU mode.
-
-The "slb_size" field indicates how many SLB entries are supported
-
-The "sps" array contains 8 entries indicating the supported base
-page sizes for a segment in increasing order. Each entry is defined
-as follow:
-
-   struct kvm_ppc_one_seg_page_size {
-       __u32 page_shift;       /* Base page shift of segment (or 0) */
-       __u32 slb_enc;          /* SLB encoding for BookS */
-       struct kvm_ppc_one_page_size enc[KVM_PPC_PAGE_SIZES_MAX_SZ];
-   };
-
-An entry with a "page_shift" of 0 is unused. Because the array is
-organized in increasing order, a lookup can stop when encoutering
-such an entry.
-
-The "slb_enc" field provides the encoding to use in the SLB for the
-page size. The bits are in positions such as the value can directly
-be OR'ed into the "vsid" argument of the slbmte instruction.
-
-The "enc" array is a list which for each of those segment base page
-size provides the list of supported actual page sizes (which can be
-only larger or equal to the base page size), along with the
-corresponding encoding in the hash PTE. Similarly, the array is
-8 entries sorted by increasing sizes and an entry with a "0" shift
-is an empty entry and a terminator:
-
-   struct kvm_ppc_one_page_size {
-       __u32 page_shift;       /* Page shift (or 0) */
-       __u32 pte_enc;          /* Encoding in the HPTE (>>12) */
-   };
-
-The "pte_enc" field provides a value that can OR'ed into the hash
-PTE's RPN field (ie, it needs to be shifted left by 12 to OR it
-into the hash PTE second double word).
-
-4.75 KVM_IRQFD
-
-Capability: KVM_CAP_IRQFD
-Architectures: x86 s390 arm arm64
-Type: vm ioctl
-Parameters: struct kvm_irqfd (in)
-Returns: 0 on success, -1 on error
-
-Allows setting an eventfd to directly trigger a guest interrupt.
-kvm_irqfd.fd specifies the file descriptor to use as the eventfd and
-kvm_irqfd.gsi specifies the irqchip pin toggled by this event.  When
-an event is triggered on the eventfd, an interrupt is injected into
-the guest using the specified gsi pin.  The irqfd is removed using
-the KVM_IRQFD_FLAG_DEASSIGN flag, specifying both kvm_irqfd.fd
-and kvm_irqfd.gsi.
-
-With KVM_CAP_IRQFD_RESAMPLE, KVM_IRQFD supports a de-assert and notify
-mechanism allowing emulation of level-triggered, irqfd-based
-interrupts.  When KVM_IRQFD_FLAG_RESAMPLE is set the user must pass an
-additional eventfd in the kvm_irqfd.resamplefd field.  When operating
-in resample mode, posting of an interrupt through kvm_irq.fd asserts
-the specified gsi in the irqchip.  When the irqchip is resampled, such
-as from an EOI, the gsi is de-asserted and the user is notified via
-kvm_irqfd.resamplefd.  It is the user's responsibility to re-queue
-the interrupt if the device making use of it still requires service.
-Note that closing the resamplefd is not sufficient to disable the
-irqfd.  The KVM_IRQFD_FLAG_RESAMPLE is only necessary on assignment
-and need not be specified with KVM_IRQFD_FLAG_DEASSIGN.
-
-On arm/arm64, gsi routing being supported, the following can happen:
-- in case no routing entry is associated to this gsi, injection fails
-- in case the gsi is associated to an irqchip routing entry,
-  irqchip.pin + 32 corresponds to the injected SPI ID.
-- in case the gsi is associated to an MSI routing entry, the MSI
-  message and device ID are translated into an LPI (support restricted
-  to GICv3 ITS in-kernel emulation).
-
-4.76 KVM_PPC_ALLOCATE_HTAB
-
-Capability: KVM_CAP_PPC_ALLOC_HTAB
-Architectures: powerpc
-Type: vm ioctl
-Parameters: Pointer to u32 containing hash table order (in/out)
-Returns: 0 on success, -1 on error
-
-This requests the host kernel to allocate an MMU hash table for a
-guest using the PAPR paravirtualization interface.  This only does
-anything if the kernel is configured to use the Book 3S HV style of
-virtualization.  Otherwise the capability doesn't exist and the ioctl
-returns an ENOTTY error.  The rest of this description assumes Book 3S
-HV.
-
-There must be no vcpus running when this ioctl is called; if there
-are, it will do nothing and return an EBUSY error.
-
-The parameter is a pointer to a 32-bit unsigned integer variable
-containing the order (log base 2) of the desired size of the hash
-table, which must be between 18 and 46.  On successful return from the
-ioctl, the value will not be changed by the kernel.
-
-If no hash table has been allocated when any vcpu is asked to run
-(with the KVM_RUN ioctl), the host kernel will allocate a
-default-sized hash table (16 MB).
-
-If this ioctl is called when a hash table has already been allocated,
-with a different order from the existing hash table, the existing hash
-table will be freed and a new one allocated.  If this is ioctl is
-called when a hash table has already been allocated of the same order
-as specified, the kernel will clear out the existing hash table (zero
-all HPTEs).  In either case, if the guest is using the virtualized
-real-mode area (VRMA) facility, the kernel will re-create the VMRA
-HPTEs on the next KVM_RUN of any vcpu.
-
-4.77 KVM_S390_INTERRUPT
-
-Capability: basic
-Architectures: s390
-Type: vm ioctl, vcpu ioctl
-Parameters: struct kvm_s390_interrupt (in)
-Returns: 0 on success, -1 on error
-
-Allows to inject an interrupt to the guest. Interrupts can be floating
-(vm ioctl) or per cpu (vcpu ioctl), depending on the interrupt type.
-
-Interrupt parameters are passed via kvm_s390_interrupt:
-
-struct kvm_s390_interrupt {
-       __u32 type;
-       __u32 parm;
-       __u64 parm64;
-};
-
-type can be one of the following:
-
-KVM_S390_SIGP_STOP (vcpu) - sigp stop; optional flags in parm
-KVM_S390_PROGRAM_INT (vcpu) - program check; code in parm
-KVM_S390_SIGP_SET_PREFIX (vcpu) - sigp set prefix; prefix address in parm
-KVM_S390_RESTART (vcpu) - restart
-KVM_S390_INT_CLOCK_COMP (vcpu) - clock comparator interrupt
-KVM_S390_INT_CPU_TIMER (vcpu) - CPU timer interrupt
-KVM_S390_INT_VIRTIO (vm) - virtio external interrupt; external interrupt
-                          parameters in parm and parm64
-KVM_S390_INT_SERVICE (vm) - sclp external interrupt; sclp parameter in parm
-KVM_S390_INT_EMERGENCY (vcpu) - sigp emergency; source cpu in parm
-KVM_S390_INT_EXTERNAL_CALL (vcpu) - sigp external call; source cpu in parm
-KVM_S390_INT_IO(ai,cssid,ssid,schid) (vm) - compound value to indicate an
-    I/O interrupt (ai - adapter interrupt; cssid,ssid,schid - subchannel);
-    I/O interruption parameters in parm (subchannel) and parm64 (intparm,
-    interruption subclass)
-KVM_S390_MCHK (vm, vcpu) - machine check interrupt; cr 14 bits in parm,
-                           machine check interrupt code in parm64 (note that
-                           machine checks needing further payload are not
-                           supported by this ioctl)
-
-This is an asynchronous vcpu ioctl and can be invoked from any thread.
-
-4.78 KVM_PPC_GET_HTAB_FD
-
-Capability: KVM_CAP_PPC_HTAB_FD
-Architectures: powerpc
-Type: vm ioctl
-Parameters: Pointer to struct kvm_get_htab_fd (in)
-Returns: file descriptor number (>= 0) on success, -1 on error
-
-This returns a file descriptor that can be used either to read out the
-entries in the guest's hashed page table (HPT), or to write entries to
-initialize the HPT.  The returned fd can only be written to if the
-KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
-can only be read if that bit is clear.  The argument struct looks like
-this:
-
-/* For KVM_PPC_GET_HTAB_FD */
-struct kvm_get_htab_fd {
-       __u64   flags;
-       __u64   start_index;
-       __u64   reserved[2];
-};
-
-/* Values for kvm_get_htab_fd.flags */
-#define KVM_GET_HTAB_BOLTED_ONLY       ((__u64)0x1)
-#define KVM_GET_HTAB_WRITE             ((__u64)0x2)
-
-The `start_index' field gives the index in the HPT of the entry at
-which to start reading.  It is ignored when writing.
-
-Reads on the fd will initially supply information about all
-"interesting" HPT entries.  Interesting entries are those with the
-bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
-all entries.  When the end of the HPT is reached, the read() will
-return.  If read() is called again on the fd, it will start again from
-the beginning of the HPT, but will only return HPT entries that have
-changed since they were last read.
-
-Data read or written is structured as a header (8 bytes) followed by a
-series of valid HPT entries (16 bytes) each.  The header indicates how
-many valid HPT entries there are and how many invalid entries follow
-the valid entries.  The invalid entries are not represented explicitly
-in the stream.  The header format is:
-
-struct kvm_get_htab_header {
-       __u32   index;
-       __u16   n_valid;
-       __u16   n_invalid;
-};
-
-Writes to the fd create HPT entries starting at the index given in the
-header; first `n_valid' valid entries with contents from the data
-written, then `n_invalid' invalid entries, invalidating any previously
-valid entries found.
-
-4.79 KVM_CREATE_DEVICE
-
-Capability: KVM_CAP_DEVICE_CTRL
-Type: vm ioctl
-Parameters: struct kvm_create_device (in/out)
-Returns: 0 on success, -1 on error
-Errors:
-  ENODEV: The device type is unknown or unsupported
-  EEXIST: Device already created, and this type of device may not
-          be instantiated multiple times
-
-  Other error conditions may be defined by individual device types or
-  have their standard meanings.
-
-Creates an emulated device in the kernel.  The file descriptor returned
-in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR.
-
-If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
-device type is supported (not necessarily whether it can be created
-in the current vm).
-
-Individual devices should not define flags.  Attributes should be used
-for specifying any behavior that is not implied by the device type
-number.
-
-struct kvm_create_device {
-       __u32   type;   /* in: KVM_DEV_TYPE_xxx */
-       __u32   fd;     /* out: device handle */
-       __u32   flags;  /* in: KVM_CREATE_DEVICE_xxx */
-};
-
-4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
-
-Capability: KVM_CAP_DEVICE_CTRL, KVM_CAP_VM_ATTRIBUTES for vm device,
-  KVM_CAP_VCPU_ATTRIBUTES for vcpu device
-Type: device ioctl, vm ioctl, vcpu ioctl
-Parameters: struct kvm_device_attr
-Returns: 0 on success, -1 on error
-Errors:
-  ENXIO:  The group or attribute is unknown/unsupported for this device
-          or hardware support is missing.
-  EPERM:  The attribute cannot (currently) be accessed this way
-          (e.g. read-only attribute, or attribute that only makes
-          sense when the device is in a different state)
-
-  Other error conditions may be defined by individual device types.
-
-Gets/sets a specified piece of device configuration and/or state.  The
-semantics are device-specific.  See individual device documentation in
-the "devices" directory.  As with ONE_REG, the size of the data
-transferred is defined by the particular attribute.
-
-struct kvm_device_attr {
-       __u32   flags;          /* no flags currently defined */
-       __u32   group;          /* device-defined */
-       __u64   attr;           /* group-defined */
-       __u64   addr;           /* userspace address of attr data */
-};
-
-4.81 KVM_HAS_DEVICE_ATTR
-
-Capability: KVM_CAP_DEVICE_CTRL, KVM_CAP_VM_ATTRIBUTES for vm device,
-  KVM_CAP_VCPU_ATTRIBUTES for vcpu device
-Type: device ioctl, vm ioctl, vcpu ioctl
-Parameters: struct kvm_device_attr
-Returns: 0 on success, -1 on error
-Errors:
-  ENXIO:  The group or attribute is unknown/unsupported for this device
-          or hardware support is missing.
-
-Tests whether a device supports a particular attribute.  A successful
-return indicates the attribute is implemented.  It does not necessarily
-indicate that the attribute can be read or written in the device's
-current state.  "addr" is ignored.
-
-4.82 KVM_ARM_VCPU_INIT
-
-Capability: basic
-Architectures: arm, arm64
-Type: vcpu ioctl
-Parameters: struct kvm_vcpu_init (in)
-Returns: 0 on success; -1 on error
-Errors:
-  EINVAL:    the target is unknown, or the combination of features is invalid.
-  ENOENT:    a features bit specified is unknown.
-
-This tells KVM what type of CPU to present to the guest, and what
-optional features it should have.  This will cause a reset of the cpu
-registers to their initial values.  If this is not called, KVM_RUN will
-return ENOEXEC for that vcpu.
-
-Note that because some registers reflect machine topology, all vcpus
-should be created before this ioctl is invoked.
-
-Userspace can call this function multiple times for a given vcpu, including
-after the vcpu has been run. This will reset the vcpu to its initial
-state. All calls to this function after the initial call must use the same
-target and same set of feature flags, otherwise EINVAL will be returned.
-
-Possible features:
-       - KVM_ARM_VCPU_POWER_OFF: Starts the CPU in a power-off state.
-         Depends on KVM_CAP_ARM_PSCI.  If not set, the CPU will be powered on
-         and execute guest code when KVM_RUN is called.
-       - KVM_ARM_VCPU_EL1_32BIT: Starts the CPU in a 32bit mode.
-         Depends on KVM_CAP_ARM_EL1_32BIT (arm64 only).
-       - KVM_ARM_VCPU_PSCI_0_2: Emulate PSCI v0.2 (or a future revision
-          backward compatible with v0.2) for the CPU.
-         Depends on KVM_CAP_ARM_PSCI_0_2.
-       - KVM_ARM_VCPU_PMU_V3: Emulate PMUv3 for the CPU.
-         Depends on KVM_CAP_ARM_PMU_V3.
-
-       - KVM_ARM_VCPU_PTRAUTH_ADDRESS: Enables Address Pointer authentication
-         for arm64 only.
-         Depends on KVM_CAP_ARM_PTRAUTH_ADDRESS.
-         If KVM_CAP_ARM_PTRAUTH_ADDRESS and KVM_CAP_ARM_PTRAUTH_GENERIC are
-         both present, then both KVM_ARM_VCPU_PTRAUTH_ADDRESS and
-         KVM_ARM_VCPU_PTRAUTH_GENERIC must be requested or neither must be
-         requested.
-
-       - KVM_ARM_VCPU_PTRAUTH_GENERIC: Enables Generic Pointer authentication
-         for arm64 only.
-         Depends on KVM_CAP_ARM_PTRAUTH_GENERIC.
-         If KVM_CAP_ARM_PTRAUTH_ADDRESS and KVM_CAP_ARM_PTRAUTH_GENERIC are
-         both present, then both KVM_ARM_VCPU_PTRAUTH_ADDRESS and
-         KVM_ARM_VCPU_PTRAUTH_GENERIC must be requested or neither must be
-         requested.
-
-       - KVM_ARM_VCPU_SVE: Enables SVE for the CPU (arm64 only).
-         Depends on KVM_CAP_ARM_SVE.
-         Requires KVM_ARM_VCPU_FINALIZE(KVM_ARM_VCPU_SVE):
-
-          * After KVM_ARM_VCPU_INIT:
-
-             - KVM_REG_ARM64_SVE_VLS may be read using KVM_GET_ONE_REG: the
-               initial value of this pseudo-register indicates the best set of
-               vector lengths possible for a vcpu on this host.
-
-          * Before KVM_ARM_VCPU_FINALIZE(KVM_ARM_VCPU_SVE):
-
-             - KVM_RUN and KVM_GET_REG_LIST are not available;
-
-             - KVM_GET_ONE_REG and KVM_SET_ONE_REG cannot be used to access
-               the scalable archietctural SVE registers
-               KVM_REG_ARM64_SVE_ZREG(), KVM_REG_ARM64_SVE_PREG() or
-               KVM_REG_ARM64_SVE_FFR;
-
-             - KVM_REG_ARM64_SVE_VLS may optionally be written using
-               KVM_SET_ONE_REG, to modify the set of vector lengths available
-               for the vcpu.
-
-          * After KVM_ARM_VCPU_FINALIZE(KVM_ARM_VCPU_SVE):
-
-             - the KVM_REG_ARM64_SVE_VLS pseudo-register is immutable, and can
-               no longer be written using KVM_SET_ONE_REG.
-
-4.83 KVM_ARM_PREFERRED_TARGET
-
-Capability: basic
-Architectures: arm, arm64
-Type: vm ioctl
-Parameters: struct struct kvm_vcpu_init (out)
-Returns: 0 on success; -1 on error
-Errors:
-  ENODEV:    no preferred target available for the host
-
-This queries KVM for preferred CPU target type which can be emulated
-by KVM on underlying host.
-
-The ioctl returns struct kvm_vcpu_init instance containing information
-about preferred CPU target type and recommended features for it.  The
-kvm_vcpu_init->features bitmap returned will have feature bits set if
-the preferred target recommends setting these features, but this is
-not mandatory.
-
-The information returned by this ioctl can be used to prepare an instance
-of struct kvm_vcpu_init for KVM_ARM_VCPU_INIT ioctl which will result in
-in VCPU matching underlying host.
-
-
-4.84 KVM_GET_REG_LIST
-
-Capability: basic
-Architectures: arm, arm64, mips
-Type: vcpu ioctl
-Parameters: struct kvm_reg_list (in/out)
-Returns: 0 on success; -1 on error
-Errors:
-  E2BIG:     the reg index list is too big to fit in the array specified by
-             the user (the number required will be written into n).
-
-struct kvm_reg_list {
-       __u64 n; /* number of registers in reg[] */
-       __u64 reg[0];
-};
-
-This ioctl returns the guest registers that are supported for the
-KVM_GET_ONE_REG/KVM_SET_ONE_REG calls.
-
-
-4.85 KVM_ARM_SET_DEVICE_ADDR (deprecated)
-
-Capability: KVM_CAP_ARM_SET_DEVICE_ADDR
-Architectures: arm, arm64
-Type: vm ioctl
-Parameters: struct kvm_arm_device_address (in)
-Returns: 0 on success, -1 on error
-Errors:
-  ENODEV: The device id is unknown
-  ENXIO:  Device not supported on current system
-  EEXIST: Address already set
-  E2BIG:  Address outside guest physical address space
-  EBUSY:  Address overlaps with other device range
-
-struct kvm_arm_device_addr {
-       __u64 id;
-       __u64 addr;
-};
-
-Specify a device address in the guest's physical address space where guests
-can access emulated or directly exposed devices, which the host kernel needs
-to know about. The id field is an architecture specific identifier for a
-specific device.
-
-ARM/arm64 divides the id field into two parts, a device id and an
-address type id specific to the individual device.
-
-  bits:  | 63        ...       32 | 31    ...    16 | 15    ...    0 |
-  field: |        0x00000000      |     device id   |  addr type id  |
-
-ARM/arm64 currently only require this when using the in-kernel GIC
-support for the hardware VGIC features, using KVM_ARM_DEVICE_VGIC_V2
-as the device id.  When setting the base address for the guest's
-mapping of the VGIC virtual CPU and distributor interface, the ioctl
-must be called after calling KVM_CREATE_IRQCHIP, but before calling
-KVM_RUN on any of the VCPUs.  Calling this ioctl twice for any of the
-base addresses will return -EEXIST.
-
-Note, this IOCTL is deprecated and the more flexible SET/GET_DEVICE_ATTR API
-should be used instead.
-
-
-4.86 KVM_PPC_RTAS_DEFINE_TOKEN
-
-Capability: KVM_CAP_PPC_RTAS
-Architectures: ppc
-Type: vm ioctl
-Parameters: struct kvm_rtas_token_args
-Returns: 0 on success, -1 on error
-
-Defines a token value for a RTAS (Run Time Abstraction Services)
-service in order to allow it to be handled in the kernel.  The
-argument struct gives the name of the service, which must be the name
-of a service that has a kernel-side implementation.  If the token
-value is non-zero, it will be associated with that service, and
-subsequent RTAS calls by the guest specifying that token will be
-handled by the kernel.  If the token value is 0, then any token
-associated with the service will be forgotten, and subsequent RTAS
-calls by the guest for that service will be passed to userspace to be
-handled.
-
-4.87 KVM_SET_GUEST_DEBUG
-
-Capability: KVM_CAP_SET_GUEST_DEBUG
-Architectures: x86, s390, ppc, arm64
-Type: vcpu ioctl
-Parameters: struct kvm_guest_debug (in)
-Returns: 0 on success; -1 on error
-
-struct kvm_guest_debug {
-       __u32 control;
-       __u32 pad;
-       struct kvm_guest_debug_arch arch;
-};
-
-Set up the processor specific debug registers and configure vcpu for
-handling guest debug events. There are two parts to the structure, the
-first a control bitfield indicates the type of debug events to handle
-when running. Common control bits are:
-
-  - KVM_GUESTDBG_ENABLE:        guest debugging is enabled
-  - KVM_GUESTDBG_SINGLESTEP:    the next run should single-step
-
-The top 16 bits of the control field are architecture specific control
-flags which can include the following:
-
-  - KVM_GUESTDBG_USE_SW_BP:     using software breakpoints [x86, arm64]
-  - KVM_GUESTDBG_USE_HW_BP:     using hardware breakpoints [x86, s390, arm64]
-  - KVM_GUESTDBG_INJECT_DB:     inject DB type exception [x86]
-  - KVM_GUESTDBG_INJECT_BP:     inject BP type exception [x86]
-  - KVM_GUESTDBG_EXIT_PENDING:  trigger an immediate guest exit [s390]
-
-For example KVM_GUESTDBG_USE_SW_BP indicates that software breakpoints
-are enabled in memory so we need to ensure breakpoint exceptions are
-correctly trapped and the KVM run loop exits at the breakpoint and not
-running off into the normal guest vector. For KVM_GUESTDBG_USE_HW_BP
-we need to ensure the guest vCPUs architecture specific registers are
-updated to the correct (supplied) values.
-
-The second part of the structure is architecture specific and
-typically contains a set of debug registers.
-
-For arm64 the number of debug registers is implementation defined and
-can be determined by querying the KVM_CAP_GUEST_DEBUG_HW_BPS and
-KVM_CAP_GUEST_DEBUG_HW_WPS capabilities which return a positive number
-indicating the number of supported registers.
-
-For ppc, the KVM_CAP_PPC_GUEST_DEBUG_SSTEP capability indicates whether
-the single-step debug event (KVM_GUESTDBG_SINGLESTEP) is supported.
-
-When debug events exit the main run loop with the reason
-KVM_EXIT_DEBUG with the kvm_debug_exit_arch part of the kvm_run
-structure containing architecture specific debug information.
-
-4.88 KVM_GET_EMULATED_CPUID
-
-Capability: KVM_CAP_EXT_EMUL_CPUID
-Architectures: x86
-Type: system ioctl
-Parameters: struct kvm_cpuid2 (in/out)
-Returns: 0 on success, -1 on error
-
-struct kvm_cpuid2 {
-       __u32 nent;
-       __u32 flags;
-       struct kvm_cpuid_entry2 entries[0];
-};
-
-The member 'flags' is used for passing flags from userspace.
-
-#define KVM_CPUID_FLAG_SIGNIFCANT_INDEX                BIT(0)
-#define KVM_CPUID_FLAG_STATEFUL_FUNC           BIT(1)
-#define KVM_CPUID_FLAG_STATE_READ_NEXT         BIT(2)
-
-struct kvm_cpuid_entry2 {
-       __u32 function;
-       __u32 index;
-       __u32 flags;
-       __u32 eax;
-       __u32 ebx;
-       __u32 ecx;
-       __u32 edx;
-       __u32 padding[3];
-};
-
-This ioctl returns x86 cpuid features which are emulated by
-kvm.Userspace can use the information returned by this ioctl to query
-which features are emulated by kvm instead of being present natively.
-
-Userspace invokes KVM_GET_EMULATED_CPUID by passing a kvm_cpuid2
-structure with the 'nent' field indicating the number of entries in
-the variable-size array 'entries'. If the number of entries is too low
-to describe the cpu capabilities, an error (E2BIG) is returned. If the
-number is too high, the 'nent' field is adjusted and an error (ENOMEM)
-is returned. If the number is just right, the 'nent' field is adjusted
-to the number of valid entries in the 'entries' array, which is then
-filled.
-
-The entries returned are the set CPUID bits of the respective features
-which kvm emulates, as returned by the CPUID instruction, with unknown
-or unsupported feature bits cleared.
-
-Features like x2apic, for example, may not be present in the host cpu
-but are exposed by kvm in KVM_GET_SUPPORTED_CPUID because they can be
-emulated efficiently and thus not included here.
-
-The fields in each entry are defined as follows:
-
-  function: the eax value used to obtain the entry
-  index: the ecx value used to obtain the entry (for entries that are
-         affected by ecx)
-  flags: an OR of zero or more of the following:
-        KVM_CPUID_FLAG_SIGNIFCANT_INDEX:
-           if the index field is valid
-        KVM_CPUID_FLAG_STATEFUL_FUNC:
-           if cpuid for this function returns different values for successive
-           invocations; there will be several entries with the same function,
-           all with this flag set
-        KVM_CPUID_FLAG_STATE_READ_NEXT:
-           for KVM_CPUID_FLAG_STATEFUL_FUNC entries, set if this entry is
-           the first entry to be read by a cpu
-   eax, ebx, ecx, edx: the values returned by the cpuid instruction for
-         this function/index combination
-
-4.89 KVM_S390_MEM_OP
-
-Capability: KVM_CAP_S390_MEM_OP
-Architectures: s390
-Type: vcpu ioctl
-Parameters: struct kvm_s390_mem_op (in)
-Returns: = 0 on success,
-         < 0 on generic error (e.g. -EFAULT or -ENOMEM),
-         > 0 if an exception occurred while walking the page tables
-
-Read or write data from/to the logical (virtual) memory of a VCPU.
-
-Parameters are specified via the following structure:
-
-struct kvm_s390_mem_op {
-       __u64 gaddr;            /* the guest address */
-       __u64 flags;            /* flags */
-       __u32 size;             /* amount of bytes */
-       __u32 op;               /* type of operation */
-       __u64 buf;              /* buffer in userspace */
-       __u8 ar;                /* the access register number */
-       __u8 reserved[31];      /* should be set to 0 */
-};
-
-The type of operation is specified in the "op" field. It is either
-KVM_S390_MEMOP_LOGICAL_READ for reading from logical memory space or
-KVM_S390_MEMOP_LOGICAL_WRITE for writing to logical memory space. The
-KVM_S390_MEMOP_F_CHECK_ONLY flag can be set in the "flags" field to check
-whether the corresponding memory access would create an access exception
-(without touching the data in the memory at the destination). In case an
-access exception occurred while walking the MMU tables of the guest, the
-ioctl returns a positive error number to indicate the type of exception.
-This exception is also raised directly at the corresponding VCPU if the
-flag KVM_S390_MEMOP_F_INJECT_EXCEPTION is set in the "flags" field.
-
-The start address of the memory region has to be specified in the "gaddr"
-field, and the length of the region in the "size" field (which must not
-be 0). The maximum value for "size" can be obtained by checking the
-KVM_CAP_S390_MEM_OP capability. "buf" is the buffer supplied by the
-userspace application where the read data should be written to for
-KVM_S390_MEMOP_LOGICAL_READ, or where the data that should be written is
-stored for a KVM_S390_MEMOP_LOGICAL_WRITE. When KVM_S390_MEMOP_F_CHECK_ONLY
-is specified, "buf" is unused and can be NULL. "ar" designates the access
-register number to be used; the valid range is 0..15.
-
-The "reserved" field is meant for future extensions. It is not used by
-KVM with the currently defined set of flags.
-
-4.90 KVM_S390_GET_SKEYS
-
-Capability: KVM_CAP_S390_SKEYS
-Architectures: s390
-Type: vm ioctl
-Parameters: struct kvm_s390_skeys
-Returns: 0 on success, KVM_S390_GET_KEYS_NONE if guest is not using storage
-         keys, negative value on error
-
-This ioctl is used to get guest storage key values on the s390
-architecture. The ioctl takes parameters via the kvm_s390_skeys struct.
-
-struct kvm_s390_skeys {
-       __u64 start_gfn;
-       __u64 count;
-       __u64 skeydata_addr;
-       __u32 flags;
-       __u32 reserved[9];
-};
-
-The start_gfn field is the number of the first guest frame whose storage keys
-you want to get.
-
-The count field is the number of consecutive frames (starting from start_gfn)
-whose storage keys to get. The count field must be at least 1 and the maximum
-allowed value is defined as KVM_S390_SKEYS_ALLOC_MAX. Values outside this range
-will cause the ioctl to return -EINVAL.
-
-The skeydata_addr field is the address to a buffer large enough to hold count
-bytes. This buffer will be filled with storage key data by the ioctl.
-
-4.91 KVM_S390_SET_SKEYS
-
-Capability: KVM_CAP_S390_SKEYS
-Architectures: s390
-Type: vm ioctl
-Parameters: struct kvm_s390_skeys
-Returns: 0 on success, negative value on error
-
-This ioctl is used to set guest storage key values on the s390
-architecture. The ioctl takes parameters via the kvm_s390_skeys struct.
-See section on KVM_S390_GET_SKEYS for struct definition.
-
-The start_gfn field is the number of the first guest frame whose storage keys
-you want to set.
-
-The count field is the number of consecutive frames (starting from start_gfn)
-whose storage keys to get. The count field must be at least 1 and the maximum
-allowed value is defined as KVM_S390_SKEYS_ALLOC_MAX. Values outside this range
-will cause the ioctl to return -EINVAL.
-
-The skeydata_addr field is the address to a buffer containing count bytes of
-storage keys. Each byte in the buffer will be set as the storage key for a
-single frame starting at start_gfn for count frames.
-
-Note: If any architecturally invalid key value is found in the given data then
-the ioctl will return -EINVAL.
-
-4.92 KVM_S390_IRQ
-
-Capability: KVM_CAP_S390_INJECT_IRQ
-Architectures: s390
-Type: vcpu ioctl
-Parameters: struct kvm_s390_irq (in)
-Returns: 0 on success, -1 on error
-Errors:
-  EINVAL: interrupt type is invalid
-          type is KVM_S390_SIGP_STOP and flag parameter is invalid value
-          type is KVM_S390_INT_EXTERNAL_CALL and code is bigger
-            than the maximum of VCPUs
-  EBUSY:  type is KVM_S390_SIGP_SET_PREFIX and vcpu is not stopped
-          type is KVM_S390_SIGP_STOP and a stop irq is already pending
-          type is KVM_S390_INT_EXTERNAL_CALL and an external call interrupt
-            is already pending
-
-Allows to inject an interrupt to the guest.
-
-Using struct kvm_s390_irq as a parameter allows
-to inject additional payload which is not
-possible via KVM_S390_INTERRUPT.
-
-Interrupt parameters are passed via kvm_s390_irq:
-
-struct kvm_s390_irq {
-       __u64 type;
-       union {
-               struct kvm_s390_io_info io;
-               struct kvm_s390_ext_info ext;
-               struct kvm_s390_pgm_info pgm;
-               struct kvm_s390_emerg_info emerg;
-               struct kvm_s390_extcall_info extcall;
-               struct kvm_s390_prefix_info prefix;
-               struct kvm_s390_stop_info stop;
-               struct kvm_s390_mchk_info mchk;
-               char reserved[64];
-       } u;
-};
-
-type can be one of the following:
-
-KVM_S390_SIGP_STOP - sigp stop; parameter in .stop
-KVM_S390_PROGRAM_INT - program check; parameters in .pgm
-KVM_S390_SIGP_SET_PREFIX - sigp set prefix; parameters in .prefix
-KVM_S390_RESTART - restart; no parameters
-KVM_S390_INT_CLOCK_COMP - clock comparator interrupt; no parameters
-KVM_S390_INT_CPU_TIMER - CPU timer interrupt; no parameters
-KVM_S390_INT_EMERGENCY - sigp emergency; parameters in .emerg
-KVM_S390_INT_EXTERNAL_CALL - sigp external call; parameters in .extcall
-KVM_S390_MCHK - machine check interrupt; parameters in .mchk
-
-This is an asynchronous vcpu ioctl and can be invoked from any thread.
-
-4.94 KVM_S390_GET_IRQ_STATE
-
-Capability: KVM_CAP_S390_IRQ_STATE
-Architectures: s390
-Type: vcpu ioctl
-Parameters: struct kvm_s390_irq_state (out)
-Returns: >= number of bytes copied into buffer,
-         -EINVAL if buffer size is 0,
-         -ENOBUFS if buffer size is too small to fit all pending interrupts,
-         -EFAULT if the buffer address was invalid
-
-This ioctl allows userspace to retrieve the complete state of all currently
-pending interrupts in a single buffer. Use cases include migration
-and introspection. The parameter structure contains the address of a
-userspace buffer and its length:
-
-struct kvm_s390_irq_state {
-       __u64 buf;
-       __u32 flags;        /* will stay unused for compatibility reasons */
-       __u32 len;
-       __u32 reserved[4];  /* will stay unused for compatibility reasons */
-};
-
-Userspace passes in the above struct and for each pending interrupt a
-struct kvm_s390_irq is copied to the provided buffer.
-
-The structure contains a flags and a reserved field for future extensions. As
-the kernel never checked for flags == 0 and QEMU never pre-zeroed flags and
-reserved, these fields can not be used in the future without breaking
-compatibility.
-
-If -ENOBUFS is returned the buffer provided was too small and userspace
-may retry with a bigger buffer.
-
-4.95 KVM_S390_SET_IRQ_STATE
-
-Capability: KVM_CAP_S390_IRQ_STATE
-Architectures: s390
-Type: vcpu ioctl
-Parameters: struct kvm_s390_irq_state (in)
-Returns: 0 on success,
-         -EFAULT if the buffer address was invalid,
-         -EINVAL for an invalid buffer length (see below),
-         -EBUSY if there were already interrupts pending,
-         errors occurring when actually injecting the
-          interrupt. See KVM_S390_IRQ.
-
-This ioctl allows userspace to set the complete state of all cpu-local
-interrupts currently pending for the vcpu. It is intended for restoring
-interrupt state after a migration. The input parameter is a userspace buffer
-containing a struct kvm_s390_irq_state:
-
-struct kvm_s390_irq_state {
-       __u64 buf;
-       __u32 flags;        /* will stay unused for compatibility reasons */
-       __u32 len;
-       __u32 reserved[4];  /* will stay unused for compatibility reasons */
-};
-
-The restrictions for flags and reserved apply as well.
-(see KVM_S390_GET_IRQ_STATE)
-
-The userspace memory referenced by buf contains a struct kvm_s390_irq
-for each interrupt to be injected into the guest.
-If one of the interrupts could not be injected for some reason the
-ioctl aborts.
-
-len must be a multiple of sizeof(struct kvm_s390_irq). It must be > 0
-and it must not exceed (max_vcpus + 32) * sizeof(struct kvm_s390_irq),
-which is the maximum number of possibly pending cpu-local interrupts.
-
-4.96 KVM_SMI
-
-Capability: KVM_CAP_X86_SMM
-Architectures: x86
-Type: vcpu ioctl
-Parameters: none
-Returns: 0 on success, -1 on error
-
-Queues an SMI on the thread's vcpu.
-
-4.97 KVM_CAP_PPC_MULTITCE
-
-Capability: KVM_CAP_PPC_MULTITCE
-Architectures: ppc
-Type: vm
-
-This capability means the kernel is capable of handling hypercalls
-H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
-space. This significantly accelerates DMA operations for PPC KVM guests.
-User space should expect that its handlers for these hypercalls
-are not going to be called if user space previously registered LIOBN
-in KVM (via KVM_CREATE_SPAPR_TCE or similar calls).
-
-In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
-user space might have to advertise it for the guest. For example,
-IBM pSeries (sPAPR) guest starts using them if "hcall-multi-tce" is
-present in the "ibm,hypertas-functions" device-tree property.
-
-The hypercalls mentioned above may or may not be processed successfully
-in the kernel based fast path. If they can not be handled by the kernel,
-they will get passed on to user space. So user space still has to have
-an implementation for these despite the in kernel acceleration.
-
-This capability is always enabled.
-
-4.98 KVM_CREATE_SPAPR_TCE_64
-
-Capability: KVM_CAP_SPAPR_TCE_64
-Architectures: powerpc
-Type: vm ioctl
-Parameters: struct kvm_create_spapr_tce_64 (in)
-Returns: file descriptor for manipulating the created TCE table
-
-This is an extension for KVM_CAP_SPAPR_TCE which only supports 32bit
-windows, described in 4.62 KVM_CREATE_SPAPR_TCE
-
-This capability uses extended struct in ioctl interface:
-
-/* for KVM_CAP_SPAPR_TCE_64 */
-struct kvm_create_spapr_tce_64 {
-       __u64 liobn;
-       __u32 page_shift;
-       __u32 flags;
-       __u64 offset;   /* in pages */
-       __u64 size;     /* in pages */
-};
-
-The aim of extension is to support an additional bigger DMA window with
-a variable page size.
-KVM_CREATE_SPAPR_TCE_64 receives a 64bit window size, an IOMMU page shift and
-a bus offset of the corresponding DMA window, @size and @offset are numbers
-of IOMMU pages.
-
-@flags are not used at the moment.
-
-The rest of functionality is identical to KVM_CREATE_SPAPR_TCE.
-
-4.99 KVM_REINJECT_CONTROL
-
-Capability: KVM_CAP_REINJECT_CONTROL
-Architectures: x86
-Type: vm ioctl
-Parameters: struct kvm_reinject_control (in)
-Returns: 0 on success,
-         -EFAULT if struct kvm_reinject_control cannot be read,
-         -ENXIO if KVM_CREATE_PIT or KVM_CREATE_PIT2 didn't succeed earlier.
-
-i8254 (PIT) has two modes, reinject and !reinject.  The default is reinject,
-where KVM queues elapsed i8254 ticks and monitors completion of interrupt from
-vector(s) that i8254 injects.  Reinject mode dequeues a tick and injects its
-interrupt whenever there isn't a pending interrupt from i8254.
-!reinject mode injects an interrupt as soon as a tick arrives.
-
-struct kvm_reinject_control {
-       __u8 pit_reinject;
-       __u8 reserved[31];
-};
-
-pit_reinject = 0 (!reinject mode) is recommended, unless running an old
-operating system that uses the PIT for timing (e.g. Linux 2.4.x).
-
-4.100 KVM_PPC_CONFIGURE_V3_MMU
-
-Capability: KVM_CAP_PPC_RADIX_MMU or KVM_CAP_PPC_HASH_MMU_V3
-Architectures: ppc
-Type: vm ioctl
-Parameters: struct kvm_ppc_mmuv3_cfg (in)
-Returns: 0 on success,
-         -EFAULT if struct kvm_ppc_mmuv3_cfg cannot be read,
-         -EINVAL if the configuration is invalid
-
-This ioctl controls whether the guest will use radix or HPT (hashed
-page table) translation, and sets the pointer to the process table for
-the guest.
-
-struct kvm_ppc_mmuv3_cfg {
-       __u64   flags;
-       __u64   process_table;
-};
-
-There are two bits that can be set in flags; KVM_PPC_MMUV3_RADIX and
-KVM_PPC_MMUV3_GTSE.  KVM_PPC_MMUV3_RADIX, if set, configures the guest
-to use radix tree translation, and if clear, to use HPT translation.
-KVM_PPC_MMUV3_GTSE, if set and if KVM permits it, configures the guest
-to be able to use the global TLB and SLB invalidation instructions;
-if clear, the guest may not use these instructions.
-
-The process_table field specifies the address and size of the guest
-process table, which is in the guest's space.  This field is formatted
-as the second doubleword of the partition table entry, as defined in
-the Power ISA V3.00, Book III section 5.7.6.1.
-
-4.101 KVM_PPC_GET_RMMU_INFO
-
-Capability: KVM_CAP_PPC_RADIX_MMU
-Architectures: ppc
-Type: vm ioctl
-Parameters: struct kvm_ppc_rmmu_info (out)
-Returns: 0 on success,
-        -EFAULT if struct kvm_ppc_rmmu_info cannot be written,
-        -EINVAL if no useful information can be returned
-
-This ioctl returns a structure containing two things: (a) a list
-containing supported radix tree geometries, and (b) a list that maps
-page sizes to put in the "AP" (actual page size) field for the tlbie
-(TLB invalidate entry) instruction.
-
-struct kvm_ppc_rmmu_info {
-       struct kvm_ppc_radix_geom {
-               __u8    page_shift;
-               __u8    level_bits[4];
-               __u8    pad[3];
-       }       geometries[8];
-       __u32   ap_encodings[8];
-};
-
-The geometries[] field gives up to 8 supported geometries for the
-radix page table, in terms of the log base 2 of the smallest page
-size, and the number of bits indexed at each level of the tree, from
-the PTE level up to the PGD level in that order.  Any unused entries
-will have 0 in the page_shift field.
-
-The ap_encodings gives the supported page sizes and their AP field
-encodings, encoded with the AP value in the top 3 bits and the log
-base 2 of the page size in the bottom 6 bits.
-
-4.102 KVM_PPC_RESIZE_HPT_PREPARE
-
-Capability: KVM_CAP_SPAPR_RESIZE_HPT
-Architectures: powerpc
-Type: vm ioctl
-Parameters: struct kvm_ppc_resize_hpt (in)
-Returns: 0 on successful completion,
-        >0 if a new HPT is being prepared, the value is an estimated
-             number of milliseconds until preparation is complete
-         -EFAULT if struct kvm_reinject_control cannot be read,
-        -EINVAL if the supplied shift or flags are invalid
-        -ENOMEM if unable to allocate the new HPT
-        -ENOSPC if there was a hash collision when moving existing
-                  HPT entries to the new HPT
-        -EIO on other error conditions
-
-Used to implement the PAPR extension for runtime resizing of a guest's
-Hashed Page Table (HPT).  Specifically this starts, stops or monitors
-the preparation of a new potential HPT for the guest, essentially
-implementing the H_RESIZE_HPT_PREPARE hypercall.
-
-If called with shift > 0 when there is no pending HPT for the guest,
-this begins preparation of a new pending HPT of size 2^(shift) bytes.
-It then returns a positive integer with the estimated number of
-milliseconds until preparation is complete.
-
-If called when there is a pending HPT whose size does not match that
-requested in the parameters, discards the existing pending HPT and
-creates a new one as above.
-
-If called when there is a pending HPT of the size requested, will:
-  * If preparation of the pending HPT is already complete, return 0
-  * If preparation of the pending HPT has failed, return an error
-    code, then discard the pending HPT.
-  * If preparation of the pending HPT is still in progress, return an
-    estimated number of milliseconds until preparation is complete.
-
-If called with shift == 0, discards any currently pending HPT and
-returns 0 (i.e. cancels any in-progress preparation).
-
-flags is reserved for future expansion, currently setting any bits in
-flags will result in an -EINVAL.
-
-Normally this will be called repeatedly with the same parameters until
-it returns <= 0.  The first call will initiate preparation, subsequent
-ones will monitor preparation until it completes or fails.
-
-struct kvm_ppc_resize_hpt {
-       __u64 flags;
-       __u32 shift;
-       __u32 pad;
-};
-
-4.103 KVM_PPC_RESIZE_HPT_COMMIT
-
-Capability: KVM_CAP_SPAPR_RESIZE_HPT
-Architectures: powerpc
-Type: vm ioctl
-Parameters: struct kvm_ppc_resize_hpt (in)
-Returns: 0 on successful completion,
-         -EFAULT if struct kvm_reinject_control cannot be read,
-        -EINVAL if the supplied shift or flags are invalid
-        -ENXIO is there is no pending HPT, or the pending HPT doesn't
-                 have the requested size
-        -EBUSY if the pending HPT is not fully prepared
-        -ENOSPC if there was a hash collision when moving existing
-                  HPT entries to the new HPT
-        -EIO on other error conditions
-
-Used to implement the PAPR extension for runtime resizing of a guest's
-Hashed Page Table (HPT).  Specifically this requests that the guest be
-transferred to working with the new HPT, essentially implementing the
-H_RESIZE_HPT_COMMIT hypercall.
-
-This should only be called after KVM_PPC_RESIZE_HPT_PREPARE has
-returned 0 with the same parameters.  In other cases
-KVM_PPC_RESIZE_HPT_COMMIT will return an error (usually -ENXIO or
--EBUSY, though others may be possible if the preparation was started,
-but failed).
-
-This will have undefined effects on the guest if it has not already
-placed itself in a quiescent state where no vcpu will make MMU enabled
-memory accesses.
-
-On succsful completion, the pending HPT will become the guest's active
-HPT and the previous HPT will be discarded.
-
-On failure, the guest will still be operating on its previous HPT.
-
-struct kvm_ppc_resize_hpt {
-       __u64 flags;
-       __u32 shift;
-       __u32 pad;
-};
-
-4.104 KVM_X86_GET_MCE_CAP_SUPPORTED
-
-Capability: KVM_CAP_MCE
-Architectures: x86
-Type: system ioctl
-Parameters: u64 mce_cap (out)
-Returns: 0 on success, -1 on error
-
-Returns supported MCE capabilities. The u64 mce_cap parameter
-has the same format as the MSR_IA32_MCG_CAP register. Supported
-capabilities will have the corresponding bits set.
-
-4.105 KVM_X86_SETUP_MCE
-
-Capability: KVM_CAP_MCE
-Architectures: x86
-Type: vcpu ioctl
-Parameters: u64 mcg_cap (in)
-Returns: 0 on success,
-         -EFAULT if u64 mcg_cap cannot be read,
-         -EINVAL if the requested number of banks is invalid,
-         -EINVAL if requested MCE capability is not supported.
-
-Initializes MCE support for use. The u64 mcg_cap parameter
-has the same format as the MSR_IA32_MCG_CAP register and
-specifies which capabilities should be enabled. The maximum
-supported number of error-reporting banks can be retrieved when
-checking for KVM_CAP_MCE. The supported capabilities can be
-retrieved with KVM_X86_GET_MCE_CAP_SUPPORTED.
-
-4.106 KVM_X86_SET_MCE
-
-Capability: KVM_CAP_MCE
-Architectures: x86
-Type: vcpu ioctl
-Parameters: struct kvm_x86_mce (in)
-Returns: 0 on success,
-         -EFAULT if struct kvm_x86_mce cannot be read,
-         -EINVAL if the bank number is invalid,
-         -EINVAL if VAL bit is not set in status field.
-
-Inject a machine check error (MCE) into the guest. The input
-parameter is:
-
-struct kvm_x86_mce {
-       __u64 status;
-       __u64 addr;
-       __u64 misc;
-       __u64 mcg_status;
-       __u8 bank;
-       __u8 pad1[7];
-       __u64 pad2[3];
-};
-
-If the MCE being reported is an uncorrected error, KVM will
-inject it as an MCE exception into the guest. If the guest
-MCG_STATUS register reports that an MCE is in progress, KVM
-causes an KVM_EXIT_SHUTDOWN vmexit.
-
-Otherwise, if the MCE is a corrected error, KVM will just
-store it in the corresponding bank (provided this bank is
-not holding a previously reported uncorrected error).
-
-4.107 KVM_S390_GET_CMMA_BITS
-
-Capability: KVM_CAP_S390_CMMA_MIGRATION
-Architectures: s390
-Type: vm ioctl
-Parameters: struct kvm_s390_cmma_log (in, out)
-Returns: 0 on success, a negative value on error
-
-This ioctl is used to get the values of the CMMA bits on the s390
-architecture. It is meant to be used in two scenarios:
-- During live migration to save the CMMA values. Live migration needs
-  to be enabled via the KVM_REQ_START_MIGRATION VM property.
-- To non-destructively peek at the CMMA values, with the flag
-  KVM_S390_CMMA_PEEK set.
-
-The ioctl takes parameters via the kvm_s390_cmma_log struct. The desired
-values are written to a buffer whose location is indicated via the "values"
-member in the kvm_s390_cmma_log struct.  The values in the input struct are
-also updated as needed.
-Each CMMA value takes up one byte.
-
-struct kvm_s390_cmma_log {
-       __u64 start_gfn;
-       __u32 count;
-       __u32 flags;
-       union {
-               __u64 remaining;
-               __u64 mask;
-       };
-       __u64 values;
-};
-
-start_gfn is the number of the first guest frame whose CMMA values are
-to be retrieved,
-
-count is the length of the buffer in bytes,
-
-values points to the buffer where the result will be written to.
-
-If count is greater than KVM_S390_SKEYS_MAX, then it is considered to be
-KVM_S390_SKEYS_MAX. KVM_S390_SKEYS_MAX is re-used for consistency with
-other ioctls.
-
-The result is written in the buffer pointed to by the field values, and
-the values of the input parameter are updated as follows.
-
-Depending on the flags, different actions are performed. The only
-supported flag so far is KVM_S390_CMMA_PEEK.
-
-The default behaviour if KVM_S390_CMMA_PEEK is not set is:
-start_gfn will indicate the first page frame whose CMMA bits were dirty.
-It is not necessarily the same as the one passed as input, as clean pages
-are skipped.
-
-count will indicate the number of bytes actually written in the buffer.
-It can (and very often will) be smaller than the input value, since the
-buffer is only filled until 16 bytes of clean values are found (which
-are then not copied in the buffer). Since a CMMA migration block needs
-the base address and the length, for a total of 16 bytes, we will send
-back some clean data if there is some dirty data afterwards, as long as
-the size of the clean data does not exceed the size of the header. This
-allows to minimize the amount of data to be saved or transferred over
-the network at the expense of more roundtrips to userspace. The next
-invocation of the ioctl will skip over all the clean values, saving
-potentially more than just the 16 bytes we found.
-
-If KVM_S390_CMMA_PEEK is set:
-the existing storage attributes are read even when not in migration
-mode, and no other action is performed;
-
-the output start_gfn will be equal to the input start_gfn,
-
-the output count will be equal to the input count, except if the end of
-memory has been reached.
-
-In both cases:
-the field "remaining" will indicate the total number of dirty CMMA values
-still remaining, or 0 if KVM_S390_CMMA_PEEK is set and migration mode is
-not enabled.
-
-mask is unused.
-
-values points to the userspace buffer where the result will be stored.
-
-This ioctl can fail with -ENOMEM if not enough memory can be allocated to
-complete the task, with -ENXIO if CMMA is not enabled, with -EINVAL if
-KVM_S390_CMMA_PEEK is not set but migration mode was not enabled, with
--EFAULT if the userspace address is invalid or if no page table is
-present for the addresses (e.g. when using hugepages).
-
-4.108 KVM_S390_SET_CMMA_BITS
-
-Capability: KVM_CAP_S390_CMMA_MIGRATION
-Architectures: s390
-Type: vm ioctl
-Parameters: struct kvm_s390_cmma_log (in)
-Returns: 0 on success, a negative value on error
-
-This ioctl is used to set the values of the CMMA bits on the s390
-architecture. It is meant to be used during live migration to restore
-the CMMA values, but there are no restrictions on its use.
-The ioctl takes parameters via the kvm_s390_cmma_values struct.
-Each CMMA value takes up one byte.
-
-struct kvm_s390_cmma_log {
-       __u64 start_gfn;
-       __u32 count;
-       __u32 flags;
-       union {
-               __u64 remaining;
-               __u64 mask;
-       };
-       __u64 values;
-};
-
-start_gfn indicates the starting guest frame number,
-
-count indicates how many values are to be considered in the buffer,
-
-flags is not used and must be 0.
-
-mask indicates which PGSTE bits are to be considered.
-
-remaining is not used.
-
-values points to the buffer in userspace where to store the values.
-
-This ioctl can fail with -ENOMEM if not enough memory can be allocated to
-complete the task, with -ENXIO if CMMA is not enabled, with -EINVAL if
-the count field is too large (e.g. more than KVM_S390_CMMA_SIZE_MAX) or
-if the flags field was not 0, with -EFAULT if the userspace address is
-invalid, if invalid pages are written to (e.g. after the end of memory)
-or if no page table is present for the addresses (e.g. when using
-hugepages).
-
-4.109 KVM_PPC_GET_CPU_CHAR
-
-Capability: KVM_CAP_PPC_GET_CPU_CHAR
-Architectures: powerpc
-Type: vm ioctl
-Parameters: struct kvm_ppc_cpu_char (out)
-Returns: 0 on successful completion
-        -EFAULT if struct kvm_ppc_cpu_char cannot be written
-
-This ioctl gives userspace information about certain characteristics
-of the CPU relating to speculative execution of instructions and
-possible information leakage resulting from speculative execution (see
-CVE-2017-5715, CVE-2017-5753 and CVE-2017-5754).  The information is
-returned in struct kvm_ppc_cpu_char, which looks like this:
-
-struct kvm_ppc_cpu_char {
-       __u64   character;              /* characteristics of the CPU */
-       __u64   behaviour;              /* recommended software behaviour */
-       __u64   character_mask;         /* valid bits in character */
-       __u64   behaviour_mask;         /* valid bits in behaviour */
-};
-
-For extensibility, the character_mask and behaviour_mask fields
-indicate which bits of character and behaviour have been filled in by
-the kernel.  If the set of defined bits is extended in future then
-userspace will be able to tell whether it is running on a kernel that
-knows about the new bits.
-
-The character field describes attributes of the CPU which can help
-with preventing inadvertent information disclosure - specifically,
-whether there is an instruction to flash-invalidate the L1 data cache
-(ori 30,30,0 or mtspr SPRN_TRIG2,rN), whether the L1 data cache is set
-to a mode where entries can only be used by the thread that created
-them, whether the bcctr[l] instruction prevents speculation, and
-whether a speculation barrier instruction (ori 31,31,0) is provided.
-
-The behaviour field describes actions that software should take to
-prevent inadvertent information disclosure, and thus describes which
-vulnerabilities the hardware is subject to; specifically whether the
-L1 data cache should be flushed when returning to user mode from the
-kernel, and whether a speculation barrier should be placed between an
-array bounds check and the array access.
-
-These fields use the same bit definitions as the new
-H_GET_CPU_CHARACTERISTICS hypercall.
-
-4.110 KVM_MEMORY_ENCRYPT_OP
-
-Capability: basic
-Architectures: x86
-Type: system
-Parameters: an opaque platform specific structure (in/out)
-Returns: 0 on success; -1 on error
-
-If the platform supports creating encrypted VMs then this ioctl can be used
-for issuing platform-specific memory encryption commands to manage those
-encrypted VMs.
-
-Currently, this ioctl is used for issuing Secure Encrypted Virtualization
-(SEV) commands on AMD Processors. The SEV commands are defined in
-Documentation/virt/kvm/amd-memory-encryption.rst.
-
-4.111 KVM_MEMORY_ENCRYPT_REG_REGION
-
-Capability: basic
-Architectures: x86
-Type: system
-Parameters: struct kvm_enc_region (in)
-Returns: 0 on success; -1 on error
-
-This ioctl can be used to register a guest memory region which may
-contain encrypted data (e.g. guest RAM, SMRAM etc).
-
-It is used in the SEV-enabled guest. When encryption is enabled, a guest
-memory region may contain encrypted data. The SEV memory encryption
-engine uses a tweak such that two identical plaintext pages, each at
-different locations will have differing ciphertexts. So swapping or
-moving ciphertext of those pages will not result in plaintext being
-swapped. So relocating (or migrating) physical backing pages for the SEV
-guest will require some additional steps.
-
-Note: The current SEV key management spec does not provide commands to
-swap or migrate (move) ciphertext pages. Hence, for now we pin the guest
-memory region registered with the ioctl.
-
-4.112 KVM_MEMORY_ENCRYPT_UNREG_REGION
-
-Capability: basic
-Architectures: x86
-Type: system
-Parameters: struct kvm_enc_region (in)
-Returns: 0 on success; -1 on error
-
-This ioctl can be used to unregister the guest memory region registered
-with KVM_MEMORY_ENCRYPT_REG_REGION ioctl above.
-
-4.113 KVM_HYPERV_EVENTFD
-
-Capability: KVM_CAP_HYPERV_EVENTFD
-Architectures: x86
-Type: vm ioctl
-Parameters: struct kvm_hyperv_eventfd (in)
-
-This ioctl (un)registers an eventfd to receive notifications from the guest on
-the specified Hyper-V connection id through the SIGNAL_EVENT hypercall, without
-causing a user exit.  SIGNAL_EVENT hypercall with non-zero event flag number
-(bits 24-31) still triggers a KVM_EXIT_HYPERV_HCALL user exit.
-
-struct kvm_hyperv_eventfd {
-       __u32 conn_id;
-       __s32 fd;
-       __u32 flags;
-       __u32 padding[3];
-};
-
-The conn_id field should fit within 24 bits:
-
-#define KVM_HYPERV_CONN_ID_MASK                0x00ffffff
-
-The acceptable values for the flags field are:
-
-#define KVM_HYPERV_EVENTFD_DEASSIGN    (1 << 0)
-
-Returns: 0 on success,
-       -EINVAL if conn_id or flags is outside the allowed range
-       -ENOENT on deassign if the conn_id isn't registered
-       -EEXIST on assign if the conn_id is already registered
-
-4.114 KVM_GET_NESTED_STATE
-
-Capability: KVM_CAP_NESTED_STATE
-Architectures: x86
-Type: vcpu ioctl
-Parameters: struct kvm_nested_state (in/out)
-Returns: 0 on success, -1 on error
-Errors:
-  E2BIG:     the total state size exceeds the value of 'size' specified by
-             the user; the size required will be written into size.
-
-struct kvm_nested_state {
-       __u16 flags;
-       __u16 format;
-       __u32 size;
-
-       union {
-               struct kvm_vmx_nested_state_hdr vmx;
-               struct kvm_svm_nested_state_hdr svm;
-
-               /* Pad the header to 128 bytes.  */
-               __u8 pad[120];
-       } hdr;
-
-       union {
-               struct kvm_vmx_nested_state_data vmx[0];
-               struct kvm_svm_nested_state_data svm[0];
-       } data;
-};
-
-#define KVM_STATE_NESTED_GUEST_MODE    0x00000001
-#define KVM_STATE_NESTED_RUN_PENDING   0x00000002
-#define KVM_STATE_NESTED_EVMCS         0x00000004
-
-#define KVM_STATE_NESTED_FORMAT_VMX            0
-#define KVM_STATE_NESTED_FORMAT_SVM            1
-
-#define KVM_STATE_NESTED_VMX_VMCS_SIZE         0x1000
-
-#define KVM_STATE_NESTED_VMX_SMM_GUEST_MODE    0x00000001
-#define KVM_STATE_NESTED_VMX_SMM_VMXON         0x00000002
-
-struct kvm_vmx_nested_state_hdr {
-       __u64 vmxon_pa;
-       __u64 vmcs12_pa;
-
-       struct {
-               __u16 flags;
-       } smm;
-};
-
-struct kvm_vmx_nested_state_data {
-       __u8 vmcs12[KVM_STATE_NESTED_VMX_VMCS_SIZE];
-       __u8 shadow_vmcs12[KVM_STATE_NESTED_VMX_VMCS_SIZE];
-};
-
-This ioctl copies the vcpu's nested virtualization state from the kernel to
-userspace.
-
-The maximum size of the state can be retrieved by passing KVM_CAP_NESTED_STATE
-to the KVM_CHECK_EXTENSION ioctl().
-
-4.115 KVM_SET_NESTED_STATE
-
-Capability: KVM_CAP_NESTED_STATE
-Architectures: x86
-Type: vcpu ioctl
-Parameters: struct kvm_nested_state (in)
-Returns: 0 on success, -1 on error
-
-This copies the vcpu's kvm_nested_state struct from userspace to the kernel.
-For the definition of struct kvm_nested_state, see KVM_GET_NESTED_STATE.
-
-4.116 KVM_(UN)REGISTER_COALESCED_MMIO
-
-Capability: KVM_CAP_COALESCED_MMIO (for coalesced mmio)
-           KVM_CAP_COALESCED_PIO (for coalesced pio)
-Architectures: all
-Type: vm ioctl
-Parameters: struct kvm_coalesced_mmio_zone
-Returns: 0 on success, < 0 on error
-
-Coalesced I/O is a performance optimization that defers hardware
-register write emulation so that userspace exits are avoided.  It is
-typically used to reduce the overhead of emulating frequently accessed
-hardware registers.
-
-When a hardware register is configured for coalesced I/O, write accesses
-do not exit to userspace and their value is recorded in a ring buffer
-that is shared between kernel and userspace.
-
-Coalesced I/O is used if one or more write accesses to a hardware
-register can be deferred until a read or a write to another hardware
-register on the same device.  This last access will cause a vmexit and
-userspace will process accesses from the ring buffer before emulating
-it. That will avoid exiting to userspace on repeated writes.
-
-Coalesced pio is based on coalesced mmio. There is little difference
-between coalesced mmio and pio except that coalesced pio records accesses
-to I/O ports.
-
-4.117 KVM_CLEAR_DIRTY_LOG (vm ioctl)
-
-Capability: KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
-Architectures: x86, arm, arm64, mips
-Type: vm ioctl
-Parameters: struct kvm_dirty_log (in)
-Returns: 0 on success, -1 on error
-
-/* for KVM_CLEAR_DIRTY_LOG */
-struct kvm_clear_dirty_log {
-       __u32 slot;
-       __u32 num_pages;
-       __u64 first_page;
-       union {
-               void __user *dirty_bitmap; /* one bit per page */
-               __u64 padding;
-       };
-};
-
-The ioctl clears the dirty status of pages in a memory slot, according to
-the bitmap that is passed in struct kvm_clear_dirty_log's dirty_bitmap
-field.  Bit 0 of the bitmap corresponds to page "first_page" in the
-memory slot, and num_pages is the size in bits of the input bitmap.
-first_page must be a multiple of 64; num_pages must also be a multiple of
-64 unless first_page + num_pages is the size of the memory slot.  For each
-bit that is set in the input bitmap, the corresponding page is marked "clean"
-in KVM's dirty bitmap, and dirty tracking is re-enabled for that page
-(for example via write-protection, or by clearing the dirty bit in
-a page table entry).
-
-If KVM_CAP_MULTI_ADDRESS_SPACE is available, bits 16-31 specifies
-the address space for which you want to return the dirty bitmap.
-They must be less than the value that KVM_CHECK_EXTENSION returns for
-the KVM_CAP_MULTI_ADDRESS_SPACE capability.
-
-This ioctl is mostly useful when KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
-is enabled; for more information, see the description of the capability.
-However, it can always be used as long as KVM_CHECK_EXTENSION confirms
-that KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 is present.
-
-4.118 KVM_GET_SUPPORTED_HV_CPUID
-
-Capability: KVM_CAP_HYPERV_CPUID
-Architectures: x86
-Type: vcpu ioctl
-Parameters: struct kvm_cpuid2 (in/out)
-Returns: 0 on success, -1 on error
-
-struct kvm_cpuid2 {
-       __u32 nent;
-       __u32 padding;
-       struct kvm_cpuid_entry2 entries[0];
-};
-
-struct kvm_cpuid_entry2 {
-       __u32 function;
-       __u32 index;
-       __u32 flags;
-       __u32 eax;
-       __u32 ebx;
-       __u32 ecx;
-       __u32 edx;
-       __u32 padding[3];
-};
-
-This ioctl returns x86 cpuid features leaves related to Hyper-V emulation in
-KVM.  Userspace can use the information returned by this ioctl to construct
-cpuid information presented to guests consuming Hyper-V enlightenments (e.g.
-Windows or Hyper-V guests).
-
-CPUID feature leaves returned by this ioctl are defined by Hyper-V Top Level
-Functional Specification (TLFS). These leaves can't be obtained with
-KVM_GET_SUPPORTED_CPUID ioctl because some of them intersect with KVM feature
-leaves (0x40000000, 0x40000001).
-
-Currently, the following list of CPUID leaves are returned:
- HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS
- HYPERV_CPUID_INTERFACE
- HYPERV_CPUID_VERSION
- HYPERV_CPUID_FEATURES
- HYPERV_CPUID_ENLIGHTMENT_INFO
- HYPERV_CPUID_IMPLEMENT_LIMITS
- HYPERV_CPUID_NESTED_FEATURES
-
-HYPERV_CPUID_NESTED_FEATURES leaf is only exposed when Enlightened VMCS was
-enabled on the corresponding vCPU (KVM_CAP_HYPERV_ENLIGHTENED_VMCS).
-
-Userspace invokes KVM_GET_SUPPORTED_CPUID by passing a kvm_cpuid2 structure
-with the 'nent' field indicating the number of entries in the variable-size
-array 'entries'.  If the number of entries is too low to describe all Hyper-V
-feature leaves, an error (E2BIG) is returned. If the number is more or equal
-to the number of Hyper-V feature leaves, the 'nent' field is adjusted to the
-number of valid entries in the 'entries' array, which is then filled.
-
-'index' and 'flags' fields in 'struct kvm_cpuid_entry2' are currently reserved,
-userspace should not expect to get any particular value there.
-
-4.119 KVM_ARM_VCPU_FINALIZE
-
-Architectures: arm, arm64
-Type: vcpu ioctl
-Parameters: int feature (in)
-Returns: 0 on success, -1 on error
-Errors:
-  EPERM:     feature not enabled, needs configuration, or already finalized
-  EINVAL:    feature unknown or not present
-
-Recognised values for feature:
-  arm64      KVM_ARM_VCPU_SVE (requires KVM_CAP_ARM_SVE)
-
-Finalizes the configuration of the specified vcpu feature.
-
-The vcpu must already have been initialised, enabling the affected feature, by
-means of a successful KVM_ARM_VCPU_INIT call with the appropriate flag set in
-features[].
-
-For affected vcpu features, this is a mandatory step that must be performed
-before the vcpu is fully usable.
-
-Between KVM_ARM_VCPU_INIT and KVM_ARM_VCPU_FINALIZE, the feature may be
-configured by use of ioctls such as KVM_SET_ONE_REG.  The exact configuration
-that should be performaned and how to do it are feature-dependent.
-
-Other calls that depend on a particular feature being finalized, such as
-KVM_RUN, KVM_GET_REG_LIST, KVM_GET_ONE_REG and KVM_SET_ONE_REG, will fail with
--EPERM unless the feature has already been finalized by means of a
-KVM_ARM_VCPU_FINALIZE call.
-
-See KVM_ARM_VCPU_INIT for details of vcpu features that require finalization
-using this ioctl.
-
-4.120 KVM_SET_PMU_EVENT_FILTER
-
-Capability: KVM_CAP_PMU_EVENT_FILTER
-Architectures: x86
-Type: vm ioctl
-Parameters: struct kvm_pmu_event_filter (in)
-Returns: 0 on success, -1 on error
-
-struct kvm_pmu_event_filter {
-       __u32 action;
-       __u32 nevents;
-       __u32 fixed_counter_bitmap;
-       __u32 flags;
-       __u32 pad[4];
-       __u64 events[0];
-};
-
-This ioctl restricts the set of PMU events that the guest can program.
-The argument holds a list of events which will be allowed or denied.
-The eventsel+umask of each event the guest attempts to program is compared
-against the events field to determine whether the guest should have access.
-The events field only controls general purpose counters; fixed purpose
-counters are controlled by the fixed_counter_bitmap.
-
-No flags are defined yet, the field must be zero.
-
-Valid values for 'action':
-#define KVM_PMU_EVENT_ALLOW 0
-#define KVM_PMU_EVENT_DENY 1
-
-4.121 KVM_PPC_SVM_OFF
-
-Capability: basic
-Architectures: powerpc
-Type: vm ioctl
-Parameters: none
-Returns: 0 on successful completion,
-Errors:
-  EINVAL:    if ultravisor failed to terminate the secure guest
-  ENOMEM:    if hypervisor failed to allocate new radix page tables for guest
-
-This ioctl is used to turn off the secure mode of the guest or transition
-the guest from secure mode to normal mode. This is invoked when the guest
-is reset. This has no effect if called for a normal guest.
-
-This ioctl issues an ultravisor call to terminate the secure guest,
-unpins the VPA pages and releases all the device pages that are used to
-track the secure pages by hypervisor.
-
-4.122 KVM_S390_NORMAL_RESET
-
-Capability: KVM_CAP_S390_VCPU_RESETS
-Architectures: s390
-Type: vcpu ioctl
-Parameters: none
-Returns: 0
-
-This ioctl resets VCPU registers and control structures according to
-the cpu reset definition in the POP (Principles Of Operation).
-
-4.123 KVM_S390_INITIAL_RESET
-
-Capability: none
-Architectures: s390
-Type: vcpu ioctl
-Parameters: none
-Returns: 0
-
-This ioctl resets VCPU registers and control structures according to
-the initial cpu reset definition in the POP. However, the cpu is not
-put into ESA mode. This reset is a superset of the normal reset.
-
-4.124 KVM_S390_CLEAR_RESET
-
-Capability: KVM_CAP_S390_VCPU_RESETS
-Architectures: s390
-Type: vcpu ioctl
-Parameters: none
-Returns: 0
-
-This ioctl resets VCPU registers and control structures according to
-the clear cpu reset definition in the POP. However, the cpu is not put
-into ESA mode. This reset is a superset of the initial reset.
-
-
-5. The kvm_run structure
-------------------------
-
-Application code obtains a pointer to the kvm_run structure by
-mmap()ing a vcpu fd.  From that point, application code can control
-execution by changing fields in kvm_run prior to calling the KVM_RUN
-ioctl, and obtain information about the reason KVM_RUN returned by
-looking up structure members.
-
-struct kvm_run {
-       /* in */
-       __u8 request_interrupt_window;
-
-Request that KVM_RUN return when it becomes possible to inject external
-interrupts into the guest.  Useful in conjunction with KVM_INTERRUPT.
-
-       __u8 immediate_exit;
-
-This field is polled once when KVM_RUN starts; if non-zero, KVM_RUN
-exits immediately, returning -EINTR.  In the common scenario where a
-signal is used to "kick" a VCPU out of KVM_RUN, this field can be used
-to avoid usage of KVM_SET_SIGNAL_MASK, which has worse scalability.
-Rather than blocking the signal outside KVM_RUN, userspace can set up
-a signal handler that sets run->immediate_exit to a non-zero value.
-
-This field is ignored if KVM_CAP_IMMEDIATE_EXIT is not available.
-
-       __u8 padding1[6];
-
-       /* out */
-       __u32 exit_reason;
-
-When KVM_RUN has returned successfully (return value 0), this informs
-application code why KVM_RUN has returned.  Allowable values for this
-field are detailed below.
-
-       __u8 ready_for_interrupt_injection;
-
-If request_interrupt_window has been specified, this field indicates
-an interrupt can be injected now with KVM_INTERRUPT.
-
-       __u8 if_flag;
-
-The value of the current interrupt flag.  Only valid if in-kernel
-local APIC is not used.
-
-       __u16 flags;
-
-More architecture-specific flags detailing state of the VCPU that may
-affect the device's behavior.  The only currently defined flag is
-KVM_RUN_X86_SMM, which is valid on x86 machines and is set if the
-VCPU is in system management mode.
-
-       /* in (pre_kvm_run), out (post_kvm_run) */
-       __u64 cr8;
-
-The value of the cr8 register.  Only valid if in-kernel local APIC is
-not used.  Both input and output.
-
-       __u64 apic_base;
-
-The value of the APIC BASE msr.  Only valid if in-kernel local
-APIC is not used.  Both input and output.
-
-       union {
-               /* KVM_EXIT_UNKNOWN */
-               struct {
-                       __u64 hardware_exit_reason;
-               } hw;
-
-If exit_reason is KVM_EXIT_UNKNOWN, the vcpu has exited due to unknown
-reasons.  Further architecture-specific information is available in
-hardware_exit_reason.
-
-               /* KVM_EXIT_FAIL_ENTRY */
-               struct {
-                       __u64 hardware_entry_failure_reason;
-               } fail_entry;
-
-If exit_reason is KVM_EXIT_FAIL_ENTRY, the vcpu could not be run due
-to unknown reasons.  Further architecture-specific information is
-available in hardware_entry_failure_reason.
-
-               /* KVM_EXIT_EXCEPTION */
-               struct {
-                       __u32 exception;
-                       __u32 error_code;
-               } ex;
-
-Unused.
-
-               /* KVM_EXIT_IO */
-               struct {
-#define KVM_EXIT_IO_IN  0
-#define KVM_EXIT_IO_OUT 1
-                       __u8 direction;
-                       __u8 size; /* bytes */
-                       __u16 port;
-                       __u32 count;
-                       __u64 data_offset; /* relative to kvm_run start */
-               } io;
-
-If exit_reason is KVM_EXIT_IO, then the vcpu has
-executed a port I/O instruction which could not be satisfied by kvm.
-data_offset describes where the data is located (KVM_EXIT_IO_OUT) or
-where kvm expects application code to place the data for the next
-KVM_RUN invocation (KVM_EXIT_IO_IN).  Data format is a packed array.
-
-               /* KVM_EXIT_DEBUG */
-               struct {
-                       struct kvm_debug_exit_arch arch;
-               } debug;
-
-If the exit_reason is KVM_EXIT_DEBUG, then a vcpu is processing a debug event
-for which architecture specific information is returned.
-
-               /* KVM_EXIT_MMIO */
-               struct {
-                       __u64 phys_addr;
-                       __u8  data[8];
-                       __u32 len;
-                       __u8  is_write;
-               } mmio;
-
-If exit_reason is KVM_EXIT_MMIO, then the vcpu has
-executed a memory-mapped I/O instruction which could not be satisfied
-by kvm.  The 'data' member contains the written data if 'is_write' is
-true, and should be filled by application code otherwise.
-
-The 'data' member contains, in its first 'len' bytes, the value as it would
-appear if the VCPU performed a load or store of the appropriate width directly
-to the byte array.
-
-NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR and
-      KVM_EXIT_EPR the corresponding
-operations are complete (and guest state is consistent) only after userspace
-has re-entered the kernel with KVM_RUN.  The kernel side will first finish
-incomplete operations and then check for pending signals.  Userspace
-can re-enter the guest with an unmasked signal pending to complete
-pending operations.
-
-               /* KVM_EXIT_HYPERCALL */
-               struct {
-                       __u64 nr;
-                       __u64 args[6];
-                       __u64 ret;
-                       __u32 longmode;
-                       __u32 pad;
-               } hypercall;
-
-Unused.  This was once used for 'hypercall to userspace'.  To implement
-such functionality, use KVM_EXIT_IO (x86) or KVM_EXIT_MMIO (all except s390).
-Note KVM_EXIT_IO is significantly faster than KVM_EXIT_MMIO.
-
-               /* KVM_EXIT_TPR_ACCESS */
-               struct {
-                       __u64 rip;
-                       __u32 is_write;
-                       __u32 pad;
-               } tpr_access;
-
-To be documented (KVM_TPR_ACCESS_REPORTING).
-
-               /* KVM_EXIT_S390_SIEIC */
-               struct {
-                       __u8 icptcode;
-                       __u64 mask; /* psw upper half */
-                       __u64 addr; /* psw lower half */
-                       __u16 ipa;
-                       __u32 ipb;
-               } s390_sieic;
-
-s390 specific.
-
-               /* KVM_EXIT_S390_RESET */
-#define KVM_S390_RESET_POR       1
-#define KVM_S390_RESET_CLEAR     2
-#define KVM_S390_RESET_SUBSYSTEM 4
-#define KVM_S390_RESET_CPU_INIT  8
-#define KVM_S390_RESET_IPL       16
-               __u64 s390_reset_flags;
-
-s390 specific.
-
-               /* KVM_EXIT_S390_UCONTROL */
-               struct {
-                       __u64 trans_exc_code;
-                       __u32 pgm_code;
-               } s390_ucontrol;
-
-s390 specific. A page fault has occurred for a user controlled virtual
-machine (KVM_VM_S390_UNCONTROL) on it's host page table that cannot be
-resolved by the kernel.
-The program code and the translation exception code that were placed
-in the cpu's lowcore are presented here as defined by the z Architecture
-Principles of Operation Book in the Chapter for Dynamic Address Translation
-(DAT)
-
-               /* KVM_EXIT_DCR */
-               struct {
-                       __u32 dcrn;
-                       __u32 data;
-                       __u8  is_write;
-               } dcr;
-
-Deprecated - was used for 440 KVM.
-
-               /* KVM_EXIT_OSI */
-               struct {
-                       __u64 gprs[32];
-               } osi;
-
-MOL uses a special hypercall interface it calls 'OSI'. To enable it, we catch
-hypercalls and exit with this exit struct that contains all the guest gprs.
-
-If exit_reason is KVM_EXIT_OSI, then the vcpu has triggered such a hypercall.
-Userspace can now handle the hypercall and when it's done modify the gprs as
-necessary. Upon guest entry all guest GPRs will then be replaced by the values
-in this struct.
-
-               /* KVM_EXIT_PAPR_HCALL */
-               struct {
-                       __u64 nr;
-                       __u64 ret;
-                       __u64 args[9];
-               } papr_hcall;
-
-This is used on 64-bit PowerPC when emulating a pSeries partition,
-e.g. with the 'pseries' machine type in qemu.  It occurs when the
-guest does a hypercall using the 'sc 1' instruction.  The 'nr' field
-contains the hypercall number (from the guest R3), and 'args' contains
-the arguments (from the guest R4 - R12).  Userspace should put the
-return code in 'ret' and any extra returned values in args[].
-The possible hypercalls are defined in the Power Architecture Platform
-Requirements (PAPR) document available from www.power.org (free
-developer registration required to access it).
-
-               /* KVM_EXIT_S390_TSCH */
-               struct {
-                       __u16 subchannel_id;
-                       __u16 subchannel_nr;
-                       __u32 io_int_parm;
-                       __u32 io_int_word;
-                       __u32 ipb;
-                       __u8 dequeued;
-               } s390_tsch;
-
-s390 specific. This exit occurs when KVM_CAP_S390_CSS_SUPPORT has been enabled
-and TEST SUBCHANNEL was intercepted. If dequeued is set, a pending I/O
-interrupt for the target subchannel has been dequeued and subchannel_id,
-subchannel_nr, io_int_parm and io_int_word contain the parameters for that
-interrupt. ipb is needed for instruction parameter decoding.
-
-               /* KVM_EXIT_EPR */
-               struct {
-                       __u32 epr;
-               } epr;
-
-On FSL BookE PowerPC chips, the interrupt controller has a fast patch
-interrupt acknowledge path to the core. When the core successfully
-delivers an interrupt, it automatically populates the EPR register with
-the interrupt vector number and acknowledges the interrupt inside
-the interrupt controller.
-
-In case the interrupt controller lives in user space, we need to do
-the interrupt acknowledge cycle through it to fetch the next to be
-delivered interrupt vector using this exit.
-
-It gets triggered whenever both KVM_CAP_PPC_EPR are enabled and an
-external interrupt has just been delivered into the guest. User space
-should put the acknowledged interrupt vector into the 'epr' field.
-
-               /* KVM_EXIT_SYSTEM_EVENT */
-               struct {
-#define KVM_SYSTEM_EVENT_SHUTDOWN       1
-#define KVM_SYSTEM_EVENT_RESET          2
-#define KVM_SYSTEM_EVENT_CRASH          3
-                       __u32 type;
-                       __u64 flags;
-               } system_event;
-
-If exit_reason is KVM_EXIT_SYSTEM_EVENT then the vcpu has triggered
-a system-level event using some architecture specific mechanism (hypercall
-or some special instruction). In case of ARM/ARM64, this is triggered using
-HVC instruction based PSCI call from the vcpu. The 'type' field describes
-the system-level event type. The 'flags' field describes architecture
-specific flags for the system-level event.
-
-Valid values for 'type' are:
-  KVM_SYSTEM_EVENT_SHUTDOWN -- the guest has requested a shutdown of the
-   VM. Userspace is not obliged to honour this, and if it does honour
-   this does not need to destroy the VM synchronously (ie it may call
-   KVM_RUN again before shutdown finally occurs).
-  KVM_SYSTEM_EVENT_RESET -- the guest has requested a reset of the VM.
-   As with SHUTDOWN, userspace can choose to ignore the request, or
-   to schedule the reset to occur in the future and may call KVM_RUN again.
-  KVM_SYSTEM_EVENT_CRASH -- the guest crash occurred and the guest
-   has requested a crash condition maintenance. Userspace can choose
-   to ignore the request, or to gather VM memory core dump and/or
-   reset/shutdown of the VM.
-
-               /* KVM_EXIT_IOAPIC_EOI */
-               struct {
-                       __u8 vector;
-               } eoi;
-
-Indicates that the VCPU's in-kernel local APIC received an EOI for a
-level-triggered IOAPIC interrupt.  This exit only triggers when the
-IOAPIC is implemented in userspace (i.e. KVM_CAP_SPLIT_IRQCHIP is enabled);
-the userspace IOAPIC should process the EOI and retrigger the interrupt if
-it is still asserted.  Vector is the LAPIC interrupt vector for which the
-EOI was received.
-
-               struct kvm_hyperv_exit {
-#define KVM_EXIT_HYPERV_SYNIC          1
-#define KVM_EXIT_HYPERV_HCALL          2
-                       __u32 type;
-                       union {
-                               struct {
-                                       __u32 msr;
-                                       __u64 control;
-                                       __u64 evt_page;
-                                       __u64 msg_page;
-                               } synic;
-                               struct {
-                                       __u64 input;
-                                       __u64 result;
-                                       __u64 params[2];
-                               } hcall;
-                       } u;
-               };
-               /* KVM_EXIT_HYPERV */
-                struct kvm_hyperv_exit hyperv;
-Indicates that the VCPU exits into userspace to process some tasks
-related to Hyper-V emulation.
-Valid values for 'type' are:
-       KVM_EXIT_HYPERV_SYNIC -- synchronously notify user-space about
-Hyper-V SynIC state change. Notification is used to remap SynIC
-event/message pages and to enable/disable SynIC messages/events processing
-in userspace.
-
-               /* KVM_EXIT_ARM_NISV */
-               struct {
-                       __u64 esr_iss;
-                       __u64 fault_ipa;
-               } arm_nisv;
-
-Used on arm and arm64 systems. If a guest accesses memory not in a memslot,
-KVM will typically return to userspace and ask it to do MMIO emulation on its
-behalf. However, for certain classes of instructions, no instruction decode
-(direction, length of memory access) is provided, and fetching and decoding
-the instruction from the VM is overly complicated to live in the kernel.
-
-Historically, when this situation occurred, KVM would print a warning and kill
-the VM. KVM assumed that if the guest accessed non-memslot memory, it was
-trying to do I/O, which just couldn't be emulated, and the warning message was
-phrased accordingly. However, what happened more often was that a guest bug
-caused access outside the guest memory areas which should lead to a more
-meaningful warning message and an external abort in the guest, if the access
-did not fall within an I/O window.
-
-Userspace implementations can query for KVM_CAP_ARM_NISV_TO_USER, and enable
-this capability at VM creation. Once this is done, these types of errors will
-instead return to userspace with KVM_EXIT_ARM_NISV, with the valid bits from
-the HSR (arm) and ESR_EL2 (arm64) in the esr_iss field, and the faulting IPA
-in the fault_ipa field. Userspace can either fix up the access if it's
-actually an I/O access by decoding the instruction from guest memory (if it's
-very brave) and continue executing the guest, or it can decide to suspend,
-dump, or restart the guest.
-
-Note that KVM does not skip the faulting instruction as it does for
-KVM_EXIT_MMIO, but userspace has to emulate any change to the processing state
-if it decides to decode and emulate the instruction.
-
-               /* Fix the size of the union. */
-               char padding[256];
-       };
-
-       /*
-        * shared registers between kvm and userspace.
-        * kvm_valid_regs specifies the register classes set by the host
-        * kvm_dirty_regs specified the register classes dirtied by userspace
-        * struct kvm_sync_regs is architecture specific, as well as the
-        * bits for kvm_valid_regs and kvm_dirty_regs
-        */
-       __u64 kvm_valid_regs;
-       __u64 kvm_dirty_regs;
-       union {
-               struct kvm_sync_regs regs;
-               char padding[SYNC_REGS_SIZE_BYTES];
-       } s;
-
-If KVM_CAP_SYNC_REGS is defined, these fields allow userspace to access
-certain guest registers without having to call SET/GET_*REGS. Thus we can
-avoid some system call overhead if userspace has to handle the exit.
-Userspace can query the validity of the structure by checking
-kvm_valid_regs for specific bits. These bits are architecture specific
-and usually define the validity of a groups of registers. (e.g. one bit
- for general purpose registers)
-
-Please note that the kernel is allowed to use the kvm_run structure as the
-primary storage for certain register types. Therefore, the kernel may use the
-values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
-
-};
-
-
-
-6. Capabilities that can be enabled on vCPUs
---------------------------------------------
-
-There are certain capabilities that change the behavior of the virtual CPU or
-the virtual machine when enabled. To enable them, please see section 4.37.
-Below you can find a list of capabilities and what their effect on the vCPU or
-the virtual machine is when enabling them.
-
-The following information is provided along with the description:
-
-  Architectures: which instruction set architectures provide this ioctl.
-      x86 includes both i386 and x86_64.
-
-  Target: whether this is a per-vcpu or per-vm capability.
-
-  Parameters: what parameters are accepted by the capability.
-
-  Returns: the return value.  General error numbers (EBADF, ENOMEM, EINVAL)
-      are not detailed, but errors with specific meanings are.
-
-
-6.1 KVM_CAP_PPC_OSI
-
-Architectures: ppc
-Target: vcpu
-Parameters: none
-Returns: 0 on success; -1 on error
-
-This capability enables interception of OSI hypercalls that otherwise would
-be treated as normal system calls to be injected into the guest. OSI hypercalls
-were invented by Mac-on-Linux to have a standardized communication mechanism
-between the guest and the host.
-
-When this capability is enabled, KVM_EXIT_OSI can occur.
-
-
-6.2 KVM_CAP_PPC_PAPR
-
-Architectures: ppc
-Target: vcpu
-Parameters: none
-Returns: 0 on success; -1 on error
-
-This capability enables interception of PAPR hypercalls. PAPR hypercalls are
-done using the hypercall instruction "sc 1".
-
-It also sets the guest privilege level to "supervisor" mode. Usually the guest
-runs in "hypervisor" privilege mode with a few missing features.
-
-In addition to the above, it changes the semantics of SDR1. In this mode, the
-HTAB address part of SDR1 contains an HVA instead of a GPA, as PAPR keeps the
-HTAB invisible to the guest.
-
-When this capability is enabled, KVM_EXIT_PAPR_HCALL can occur.
-
-
-6.3 KVM_CAP_SW_TLB
-
-Architectures: ppc
-Target: vcpu
-Parameters: args[0] is the address of a struct kvm_config_tlb
-Returns: 0 on success; -1 on error
-
-struct kvm_config_tlb {
-       __u64 params;
-       __u64 array;
-       __u32 mmu_type;
-       __u32 array_len;
-};
-
-Configures the virtual CPU's TLB array, establishing a shared memory area
-between userspace and KVM.  The "params" and "array" fields are userspace
-addresses of mmu-type-specific data structures.  The "array_len" field is an
-safety mechanism, and should be set to the size in bytes of the memory that
-userspace has reserved for the array.  It must be at least the size dictated
-by "mmu_type" and "params".
-
-While KVM_RUN is active, the shared region is under control of KVM.  Its
-contents are undefined, and any modification by userspace results in
-boundedly undefined behavior.
-
-On return from KVM_RUN, the shared region will reflect the current state of
-the guest's TLB.  If userspace makes any changes, it must call KVM_DIRTY_TLB
-to tell KVM which entries have been changed, prior to calling KVM_RUN again
-on this vcpu.
-
-For mmu types KVM_MMU_FSL_BOOKE_NOHV and KVM_MMU_FSL_BOOKE_HV:
- - The "params" field is of type "struct kvm_book3e_206_tlb_params".
- - The "array" field points to an array of type "struct
-   kvm_book3e_206_tlb_entry".
- - The array consists of all entries in the first TLB, followed by all
-   entries in the second TLB.
- - Within a TLB, entries are ordered first by increasing set number.  Within a
-   set, entries are ordered by way (increasing ESEL).
- - The hash for determining set number in TLB0 is: (MAS2 >> 12) & (num_sets - 1)
-   where "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
- - The tsize field of mas1 shall be set to 4K on TLB0, even though the
-   hardware ignores this value for TLB0.
-
-6.4 KVM_CAP_S390_CSS_SUPPORT
-
-Architectures: s390
-Target: vcpu
-Parameters: none
-Returns: 0 on success; -1 on error
-
-This capability enables support for handling of channel I/O instructions.
-
-TEST PENDING INTERRUPTION and the interrupt portion of TEST SUBCHANNEL are
-handled in-kernel, while the other I/O instructions are passed to userspace.
-
-When this capability is enabled, KVM_EXIT_S390_TSCH will occur on TEST
-SUBCHANNEL intercepts.
-
-Note that even though this capability is enabled per-vcpu, the complete
-virtual machine is affected.
-
-6.5 KVM_CAP_PPC_EPR
-
-Architectures: ppc
-Target: vcpu
-Parameters: args[0] defines whether the proxy facility is active
-Returns: 0 on success; -1 on error
-
-This capability enables or disables the delivery of interrupts through the
-external proxy facility.
-
-When enabled (args[0] != 0), every time the guest gets an external interrupt
-delivered, it automatically exits into user space with a KVM_EXIT_EPR exit
-to receive the topmost interrupt vector.
-
-When disabled (args[0] == 0), behavior is as if this facility is unsupported.
-
-When this capability is enabled, KVM_EXIT_EPR can occur.
-
-6.6 KVM_CAP_IRQ_MPIC
-
-Architectures: ppc
-Parameters: args[0] is the MPIC device fd
-            args[1] is the MPIC CPU number for this vcpu
-
-This capability connects the vcpu to an in-kernel MPIC device.
-
-6.7 KVM_CAP_IRQ_XICS
-
-Architectures: ppc
-Target: vcpu
-Parameters: args[0] is the XICS device fd
-            args[1] is the XICS CPU number (server ID) for this vcpu
-
-This capability connects the vcpu to an in-kernel XICS device.
-
-6.8 KVM_CAP_S390_IRQCHIP
-
-Architectures: s390
-Target: vm
-Parameters: none
-
-This capability enables the in-kernel irqchip for s390. Please refer to
-"4.24 KVM_CREATE_IRQCHIP" for details.
-
-6.9 KVM_CAP_MIPS_FPU
-
-Architectures: mips
-Target: vcpu
-Parameters: args[0] is reserved for future use (should be 0).
-
-This capability allows the use of the host Floating Point Unit by the guest. It
-allows the Config1.FP bit to be set to enable the FPU in the guest. Once this is
-done the KVM_REG_MIPS_FPR_* and KVM_REG_MIPS_FCR_* registers can be accessed
-(depending on the current guest FPU register mode), and the Status.FR,
-Config5.FRE bits are accessible via the KVM API and also from the guest,
-depending on them being supported by the FPU.
-
-6.10 KVM_CAP_MIPS_MSA
-
-Architectures: mips
-Target: vcpu
-Parameters: args[0] is reserved for future use (should be 0).
-
-This capability allows the use of the MIPS SIMD Architecture (MSA) by the guest.
-It allows the Config3.MSAP bit to be set to enable the use of MSA by the guest.
-Once this is done the KVM_REG_MIPS_VEC_* and KVM_REG_MIPS_MSA_* registers can be
-accessed, and the Config5.MSAEn bit is accessible via the KVM API and also from
-the guest.
-
-6.74 KVM_CAP_SYNC_REGS
-Architectures: s390, x86
-Target: s390: always enabled, x86: vcpu
-Parameters: none
-Returns: x86: KVM_CHECK_EXTENSION returns a bit-array indicating which register
-sets are supported (bitfields defined in arch/x86/include/uapi/asm/kvm.h).
-
-As described above in the kvm_sync_regs struct info in section 5 (kvm_run):
-KVM_CAP_SYNC_REGS "allow[s] userspace to access certain guest registers
-without having to call SET/GET_*REGS". This reduces overhead by eliminating
-repeated ioctl calls for setting and/or getting register values. This is
-particularly important when userspace is making synchronous guest state
-modifications, e.g. when emulating and/or intercepting instructions in
-userspace.
-
-For s390 specifics, please refer to the source code.
-
-For x86:
-- the register sets to be copied out to kvm_run are selectable
-  by userspace (rather that all sets being copied out for every exit).
-- vcpu_events are available in addition to regs and sregs.
-
-For x86, the 'kvm_valid_regs' field of struct kvm_run is overloaded to
-function as an input bit-array field set by userspace to indicate the
-specific register sets to be copied out on the next exit.
-
-To indicate when userspace has modified values that should be copied into
-the vCPU, the all architecture bitarray field, 'kvm_dirty_regs' must be set.
-This is done using the same bitflags as for the 'kvm_valid_regs' field.
-If the dirty bit is not set, then the register set values will not be copied
-into the vCPU even if they've been modified.
-
-Unused bitfields in the bitarrays must be set to zero.
-
-struct kvm_sync_regs {
-        struct kvm_regs regs;
-        struct kvm_sregs sregs;
-        struct kvm_vcpu_events events;
-};
-
-6.75 KVM_CAP_PPC_IRQ_XIVE
-
-Architectures: ppc
-Target: vcpu
-Parameters: args[0] is the XIVE device fd
-            args[1] is the XIVE CPU number (server ID) for this vcpu
-
-This capability connects the vcpu to an in-kernel XIVE device.
-
-7. Capabilities that can be enabled on VMs
-------------------------------------------
-
-There are certain capabilities that change the behavior of the virtual
-machine when enabled. To enable them, please see section 4.37. Below
-you can find a list of capabilities and what their effect on the VM
-is when enabling them.
-
-The following information is provided along with the description:
-
-  Architectures: which instruction set architectures provide this ioctl.
-      x86 includes both i386 and x86_64.
-
-  Parameters: what parameters are accepted by the capability.
-
-  Returns: the return value.  General error numbers (EBADF, ENOMEM, EINVAL)
-      are not detailed, but errors with specific meanings are.
-
-
-7.1 KVM_CAP_PPC_ENABLE_HCALL
-
-Architectures: ppc
-Parameters: args[0] is the sPAPR hcall number
-           args[1] is 0 to disable, 1 to enable in-kernel handling
-
-This capability controls whether individual sPAPR hypercalls (hcalls)
-get handled by the kernel or not.  Enabling or disabling in-kernel
-handling of an hcall is effective across the VM.  On creation, an
-initial set of hcalls are enabled for in-kernel handling, which
-consists of those hcalls for which in-kernel handlers were implemented
-before this capability was implemented.  If disabled, the kernel will
-not to attempt to handle the hcall, but will always exit to userspace
-to handle it.  Note that it may not make sense to enable some and
-disable others of a group of related hcalls, but KVM does not prevent
-userspace from doing that.
-
-If the hcall number specified is not one that has an in-kernel
-implementation, the KVM_ENABLE_CAP ioctl will fail with an EINVAL
-error.
-
-7.2 KVM_CAP_S390_USER_SIGP
-
-Architectures: s390
-Parameters: none
-
-This capability controls which SIGP orders will be handled completely in user
-space. With this capability enabled, all fast orders will be handled completely
-in the kernel:
-- SENSE
-- SENSE RUNNING
-- EXTERNAL CALL
-- EMERGENCY SIGNAL
-- CONDITIONAL EMERGENCY SIGNAL
-
-All other orders will be handled completely in user space.
-
-Only privileged operation exceptions will be checked for in the kernel (or even
-in the hardware prior to interception). If this capability is not enabled, the
-old way of handling SIGP orders is used (partially in kernel and user space).
-
-7.3 KVM_CAP_S390_VECTOR_REGISTERS
-
-Architectures: s390
-Parameters: none
-Returns: 0 on success, negative value on error
-
-Allows use of the vector registers introduced with z13 processor, and
-provides for the synchronization between host and user space.  Will
-return -EINVAL if the machine does not support vectors.
-
-7.4 KVM_CAP_S390_USER_STSI
-
-Architectures: s390
-Parameters: none
-
-This capability allows post-handlers for the STSI instruction. After
-initial handling in the kernel, KVM exits to user space with
-KVM_EXIT_S390_STSI to allow user space to insert further data.
-
-Before exiting to userspace, kvm handlers should fill in s390_stsi field of
-vcpu->run:
-struct {
-       __u64 addr;
-       __u8 ar;
-       __u8 reserved;
-       __u8 fc;
-       __u8 sel1;
-       __u16 sel2;
-} s390_stsi;
-
-@addr - guest address of STSI SYSIB
-@fc   - function code
-@sel1 - selector 1
-@sel2 - selector 2
-@ar   - access register number
-
-KVM handlers should exit to userspace with rc = -EREMOTE.
-
-7.5 KVM_CAP_SPLIT_IRQCHIP
-
-Architectures: x86
-Parameters: args[0] - number of routes reserved for userspace IOAPICs
-Returns: 0 on success, -1 on error
-
-Create a local apic for each processor in the kernel. This can be used
-instead of KVM_CREATE_IRQCHIP if the userspace VMM wishes to emulate the
-IOAPIC and PIC (and also the PIT, even though this has to be enabled
-separately).
-
-This capability also enables in kernel routing of interrupt requests;
-when KVM_CAP_SPLIT_IRQCHIP only routes of KVM_IRQ_ROUTING_MSI type are
-used in the IRQ routing table.  The first args[0] MSI routes are reserved
-for the IOAPIC pins.  Whenever the LAPIC receives an EOI for these routes,
-a KVM_EXIT_IOAPIC_EOI vmexit will be reported to userspace.
-
-Fails if VCPU has already been created, or if the irqchip is already in the
-kernel (i.e. KVM_CREATE_IRQCHIP has already been called).
-
-7.6 KVM_CAP_S390_RI
-
-Architectures: s390
-Parameters: none
-
-Allows use of runtime-instrumentation introduced with zEC12 processor.
-Will return -EINVAL if the machine does not support runtime-instrumentation.
-Will return -EBUSY if a VCPU has already been created.
-
-7.7 KVM_CAP_X2APIC_API
-
-Architectures: x86
-Parameters: args[0] - features that should be enabled
-Returns: 0 on success, -EINVAL when args[0] contains invalid features
-
-Valid feature flags in args[0] are
-
-#define KVM_X2APIC_API_USE_32BIT_IDS            (1ULL << 0)
-#define KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK  (1ULL << 1)
-
-Enabling KVM_X2APIC_API_USE_32BIT_IDS changes the behavior of
-KVM_SET_GSI_ROUTING, KVM_SIGNAL_MSI, KVM_SET_LAPIC, and KVM_GET_LAPIC,
-allowing the use of 32-bit APIC IDs.  See KVM_CAP_X2APIC_API in their
-respective sections.
-
-KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK must be enabled for x2APIC to work
-in logical mode or with more than 255 VCPUs.  Otherwise, KVM treats 0xff
-as a broadcast even in x2APIC mode in order to support physical x2APIC
-without interrupt remapping.  This is undesirable in logical mode,
-where 0xff represents CPUs 0-7 in cluster 0.
-
-7.8 KVM_CAP_S390_USER_INSTR0
-
-Architectures: s390
-Parameters: none
-
-With this capability enabled, all illegal instructions 0x0000 (2 bytes) will
-be intercepted and forwarded to user space. User space can use this
-mechanism e.g. to realize 2-byte software breakpoints. The kernel will
-not inject an operating exception for these instructions, user space has
-to take care of that.
-
-This capability can be enabled dynamically even if VCPUs were already
-created and are running.
-
-7.9 KVM_CAP_S390_GS
-
-Architectures: s390
-Parameters: none
-Returns: 0 on success; -EINVAL if the machine does not support
-        guarded storage; -EBUSY if a VCPU has already been created.
-
-Allows use of guarded storage for the KVM guest.
-
-7.10 KVM_CAP_S390_AIS
-
-Architectures: s390
-Parameters: none
-
-Allow use of adapter-interruption suppression.
-Returns: 0 on success; -EBUSY if a VCPU has already been created.
-
-7.11 KVM_CAP_PPC_SMT
-
-Architectures: ppc
-Parameters: vsmt_mode, flags
-
-Enabling this capability on a VM provides userspace with a way to set
-the desired virtual SMT mode (i.e. the number of virtual CPUs per
-virtual core).  The virtual SMT mode, vsmt_mode, must be a power of 2
-between 1 and 8.  On POWER8, vsmt_mode must also be no greater than
-the number of threads per subcore for the host.  Currently flags must
-be 0.  A successful call to enable this capability will result in
-vsmt_mode being returned when the KVM_CAP_PPC_SMT capability is
-subsequently queried for the VM.  This capability is only supported by
-HV KVM, and can only be set before any VCPUs have been created.
-The KVM_CAP_PPC_SMT_POSSIBLE capability indicates which virtual SMT
-modes are available.
-
-7.12 KVM_CAP_PPC_FWNMI
-
-Architectures: ppc
-Parameters: none
-
-With this capability a machine check exception in the guest address
-space will cause KVM to exit the guest with NMI exit reason. This
-enables QEMU to build error log and branch to guest kernel registered
-machine check handling routine. Without this capability KVM will
-branch to guests' 0x200 interrupt vector.
-
-7.13 KVM_CAP_X86_DISABLE_EXITS
-
-Architectures: x86
-Parameters: args[0] defines which exits are disabled
-Returns: 0 on success, -EINVAL when args[0] contains invalid exits
-
-Valid bits in args[0] are
-
-#define KVM_X86_DISABLE_EXITS_MWAIT            (1 << 0)
-#define KVM_X86_DISABLE_EXITS_HLT              (1 << 1)
-#define KVM_X86_DISABLE_EXITS_PAUSE            (1 << 2)
-#define KVM_X86_DISABLE_EXITS_CSTATE           (1 << 3)
-
-Enabling this capability on a VM provides userspace with a way to no
-longer intercept some instructions for improved latency in some
-workloads, and is suggested when vCPUs are associated to dedicated
-physical CPUs.  More bits can be added in the future; userspace can
-just pass the KVM_CHECK_EXTENSION result to KVM_ENABLE_CAP to disable
-all such vmexits.
-
-Do not enable KVM_FEATURE_PV_UNHALT if you disable HLT exits.
-
-7.14 KVM_CAP_S390_HPAGE_1M
-
-Architectures: s390
-Parameters: none
-Returns: 0 on success, -EINVAL if hpage module parameter was not set
-        or cmma is enabled, or the VM has the KVM_VM_S390_UCONTROL
-        flag set
-
-With this capability the KVM support for memory backing with 1m pages
-through hugetlbfs can be enabled for a VM. After the capability is
-enabled, cmma can't be enabled anymore and pfmfi and the storage key
-interpretation are disabled. If cmma has already been enabled or the
-hpage module parameter is not set to 1, -EINVAL is returned.
-
-While it is generally possible to create a huge page backed VM without
-this capability, the VM will not be able to run.
-
-7.15 KVM_CAP_MSR_PLATFORM_INFO
-
-Architectures: x86
-Parameters: args[0] whether feature should be enabled or not
-
-With this capability, a guest may read the MSR_PLATFORM_INFO MSR. Otherwise,
-a #GP would be raised when the guest tries to access. Currently, this
-capability does not enable write permissions of this MSR for the guest.
-
-7.16 KVM_CAP_PPC_NESTED_HV
-
-Architectures: ppc
-Parameters: none
-Returns: 0 on success, -EINVAL when the implementation doesn't support
-        nested-HV virtualization.
-
-HV-KVM on POWER9 and later systems allows for "nested-HV"
-virtualization, which provides a way for a guest VM to run guests that
-can run using the CPU's supervisor mode (privileged non-hypervisor
-state).  Enabling this capability on a VM depends on the CPU having
-the necessary functionality and on the facility being enabled with a
-kvm-hv module parameter.
-
-7.17 KVM_CAP_EXCEPTION_PAYLOAD
-
-Architectures: x86
-Parameters: args[0] whether feature should be enabled or not
-
-With this capability enabled, CR2 will not be modified prior to the
-emulated VM-exit when L1 intercepts a #PF exception that occurs in
-L2. Similarly, for kvm-intel only, DR6 will not be modified prior to
-the emulated VM-exit when L1 intercepts a #DB exception that occurs in
-L2. As a result, when KVM_GET_VCPU_EVENTS reports a pending #PF (or
-#DB) exception for L2, exception.has_payload will be set and the
-faulting address (or the new DR6 bits*) will be reported in the
-exception_payload field. Similarly, when userspace injects a #PF (or
-#DB) into L2 using KVM_SET_VCPU_EVENTS, it is expected to set
-exception.has_payload and to put the faulting address (or the new DR6
-bits*) in the exception_payload field.
-
-This capability also enables exception.pending in struct
-kvm_vcpu_events, which allows userspace to distinguish between pending
-and injected exceptions.
-
-
-* For the new DR6 bits, note that bit 16 is set iff the #DB exception
-  will clear DR6.RTM.
-
-7.18 KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
-
-Architectures: x86, arm, arm64, mips
-Parameters: args[0] whether feature should be enabled or not
-
-With this capability enabled, KVM_GET_DIRTY_LOG will not automatically
-clear and write-protect all pages that are returned as dirty.
-Rather, userspace will have to do this operation separately using
-KVM_CLEAR_DIRTY_LOG.
-
-At the cost of a slightly more complicated operation, this provides better
-scalability and responsiveness for two reasons.  First,
-KVM_CLEAR_DIRTY_LOG ioctl can operate on a 64-page granularity rather
-than requiring to sync a full memslot; this ensures that KVM does not
-take spinlocks for an extended period of time.  Second, in some cases a
-large amount of time can pass between a call to KVM_GET_DIRTY_LOG and
-userspace actually using the data in the page.  Pages can be modified
-during this time, which is inefficint for both the guest and userspace:
-the guest will incur a higher penalty due to write protection faults,
-while userspace can see false reports of dirty pages.  Manual reprotection
-helps reducing this time, improving guest performance and reducing the
-number of dirty log false positives.
-
-KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 was previously available under the name
-KVM_CAP_MANUAL_DIRTY_LOG_PROTECT, but the implementation had bugs that make
-it hard or impossible to use it correctly.  The availability of
-KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 signals that those bugs are fixed.
-Userspace should not try to use KVM_CAP_MANUAL_DIRTY_LOG_PROTECT.
-
-8. Other capabilities.
-----------------------
-
-This section lists capabilities that give information about other
-features of the KVM implementation.
-
-8.1 KVM_CAP_PPC_HWRNG
-
-Architectures: ppc
-
-This capability, if KVM_CHECK_EXTENSION indicates that it is
-available, means that that the kernel has an implementation of the
-H_RANDOM hypercall backed by a hardware random-number generator.
-If present, the kernel H_RANDOM handler can be enabled for guest use
-with the KVM_CAP_PPC_ENABLE_HCALL capability.
-
-8.2 KVM_CAP_HYPERV_SYNIC
-
-Architectures: x86
-This capability, if KVM_CHECK_EXTENSION indicates that it is
-available, means that that the kernel has an implementation of the
-Hyper-V Synthetic interrupt controller(SynIC). Hyper-V SynIC is
-used to support Windows Hyper-V based guest paravirt drivers(VMBus).
-
-In order to use SynIC, it has to be activated by setting this
-capability via KVM_ENABLE_CAP ioctl on the vcpu fd. Note that this
-will disable the use of APIC hardware virtualization even if supported
-by the CPU, as it's incompatible with SynIC auto-EOI behavior.
-
-8.3 KVM_CAP_PPC_RADIX_MMU
-
-Architectures: ppc
-
-This capability, if KVM_CHECK_EXTENSION indicates that it is
-available, means that that the kernel can support guests using the
-radix MMU defined in Power ISA V3.00 (as implemented in the POWER9
-processor).
-
-8.4 KVM_CAP_PPC_HASH_MMU_V3
-
-Architectures: ppc
-
-This capability, if KVM_CHECK_EXTENSION indicates that it is
-available, means that that the kernel can support guests using the
-hashed page table MMU defined in Power ISA V3.00 (as implemented in
-the POWER9 processor), including in-memory segment tables.
-
-8.5 KVM_CAP_MIPS_VZ
-
-Architectures: mips
-
-This capability, if KVM_CHECK_EXTENSION on the main kvm handle indicates that
-it is available, means that full hardware assisted virtualization capabilities
-of the hardware are available for use through KVM. An appropriate
-KVM_VM_MIPS_* type must be passed to KVM_CREATE_VM to create a VM which
-utilises it.
-
-If KVM_CHECK_EXTENSION on a kvm VM handle indicates that this capability is
-available, it means that the VM is using full hardware assisted virtualization
-capabilities of the hardware. This is useful to check after creating a VM with
-KVM_VM_MIPS_DEFAULT.
-
-The value returned by KVM_CHECK_EXTENSION should be compared against known
-values (see below). All other values are reserved. This is to allow for the
-possibility of other hardware assisted virtualization implementations which
-may be incompatible with the MIPS VZ ASE.
-
- 0: The trap & emulate implementation is in use to run guest code in user
-    mode. Guest virtual memory segments are rearranged to fit the guest in the
-    user mode address space.
-
- 1: The MIPS VZ ASE is in use, providing full hardware assisted
-    virtualization, including standard guest virtual memory segments.
-
-8.6 KVM_CAP_MIPS_TE
-
-Architectures: mips
-
-This capability, if KVM_CHECK_EXTENSION on the main kvm handle indicates that
-it is available, means that the trap & emulate implementation is available to
-run guest code in user mode, even if KVM_CAP_MIPS_VZ indicates that hardware
-assisted virtualisation is also available. KVM_VM_MIPS_TE (0) must be passed
-to KVM_CREATE_VM to create a VM which utilises it.
-
-If KVM_CHECK_EXTENSION on a kvm VM handle indicates that this capability is
-available, it means that the VM is using trap & emulate.
-
-8.7 KVM_CAP_MIPS_64BIT
-
-Architectures: mips
-
-This capability indicates the supported architecture type of the guest, i.e. the
-supported register and address width.
-
-The values returned when this capability is checked by KVM_CHECK_EXTENSION on a
-kvm VM handle correspond roughly to the CP0_Config.AT register field, and should
-be checked specifically against known values (see below). All other values are
-reserved.
-
- 0: MIPS32 or microMIPS32.
-    Both registers and addresses are 32-bits wide.
-    It will only be possible to run 32-bit guest code.
-
- 1: MIPS64 or microMIPS64 with access only to 32-bit compatibility segments.
-    Registers are 64-bits wide, but addresses are 32-bits wide.
-    64-bit guest code may run but cannot access MIPS64 memory segments.
-    It will also be possible to run 32-bit guest code.
-
- 2: MIPS64 or microMIPS64 with access to all address segments.
-    Both registers and addresses are 64-bits wide.
-    It will be possible to run 64-bit or 32-bit guest code.
-
-8.9 KVM_CAP_ARM_USER_IRQ
-
-Architectures: arm, arm64
-This capability, if KVM_CHECK_EXTENSION indicates that it is available, means
-that if userspace creates a VM without an in-kernel interrupt controller, it
-will be notified of changes to the output level of in-kernel emulated devices,
-which can generate virtual interrupts, presented to the VM.
-For such VMs, on every return to userspace, the kernel
-updates the vcpu's run->s.regs.device_irq_level field to represent the actual
-output level of the device.
-
-Whenever kvm detects a change in the device output level, kvm guarantees at
-least one return to userspace before running the VM.  This exit could either
-be a KVM_EXIT_INTR or any other exit event, like KVM_EXIT_MMIO. This way,
-userspace can always sample the device output level and re-compute the state of
-the userspace interrupt controller.  Userspace should always check the state
-of run->s.regs.device_irq_level on every kvm exit.
-The value in run->s.regs.device_irq_level can represent both level and edge
-triggered interrupt signals, depending on the device.  Edge triggered interrupt
-signals will exit to userspace with the bit in run->s.regs.device_irq_level
-set exactly once per edge signal.
-
-The field run->s.regs.device_irq_level is available independent of
-run->kvm_valid_regs or run->kvm_dirty_regs bits.
-
-If KVM_CAP_ARM_USER_IRQ is supported, the KVM_CHECK_EXTENSION ioctl returns a
-number larger than 0 indicating the version of this capability is implemented
-and thereby which bits in in run->s.regs.device_irq_level can signal values.
-
-Currently the following bits are defined for the device_irq_level bitmap:
-
-  KVM_CAP_ARM_USER_IRQ >= 1:
-
-    KVM_ARM_DEV_EL1_VTIMER -  EL1 virtual timer
-    KVM_ARM_DEV_EL1_PTIMER -  EL1 physical timer
-    KVM_ARM_DEV_PMU        -  ARM PMU overflow interrupt signal
-
-Future versions of kvm may implement additional events. These will get
-indicated by returning a higher number from KVM_CHECK_EXTENSION and will be
-listed above.
-
-8.10 KVM_CAP_PPC_SMT_POSSIBLE
-
-Architectures: ppc
-
-Querying this capability returns a bitmap indicating the possible
-virtual SMT modes that can be set using KVM_CAP_PPC_SMT.  If bit N
-(counting from the right) is set, then a virtual SMT mode of 2^N is
-available.
-
-8.11 KVM_CAP_HYPERV_SYNIC2
-
-Architectures: x86
-
-This capability enables a newer version of Hyper-V Synthetic interrupt
-controller (SynIC).  The only difference with KVM_CAP_HYPERV_SYNIC is that KVM
-doesn't clear SynIC message and event flags pages when they are enabled by
-writing to the respective MSRs.
-
-8.12 KVM_CAP_HYPERV_VP_INDEX
-
-Architectures: x86
-
-This capability indicates that userspace can load HV_X64_MSR_VP_INDEX msr.  Its
-value is used to denote the target vcpu for a SynIC interrupt.  For
-compatibilty, KVM initializes this msr to KVM's internal vcpu index.  When this
-capability is absent, userspace can still query this msr's value.
-
-8.13 KVM_CAP_S390_AIS_MIGRATION
-
-Architectures: s390
-Parameters: none
-
-This capability indicates if the flic device will be able to get/set the
-AIS states for migration via the KVM_DEV_FLIC_AISM_ALL attribute and allows
-to discover this without having to create a flic device.
-
-8.14 KVM_CAP_S390_PSW
-
-Architectures: s390
-
-This capability indicates that the PSW is exposed via the kvm_run structure.
-
-8.15 KVM_CAP_S390_GMAP
-
-Architectures: s390
-
-This capability indicates that the user space memory used as guest mapping can
-be anywhere in the user memory address space, as long as the memory slots are
-aligned and sized to a segment (1MB) boundary.
-
-8.16 KVM_CAP_S390_COW
-
-Architectures: s390
-
-This capability indicates that the user space memory used as guest mapping can
-use copy-on-write semantics as well as dirty pages tracking via read-only page
-tables.
-
-8.17 KVM_CAP_S390_BPB
-
-Architectures: s390
-
-This capability indicates that kvm will implement the interfaces to handle
-reset, migration and nested KVM for branch prediction blocking. The stfle
-facility 82 should not be provided to the guest without this capability.
-
-8.18 KVM_CAP_HYPERV_TLBFLUSH
-
-Architectures: x86
-
-This capability indicates that KVM supports paravirtualized Hyper-V TLB Flush
-hypercalls:
-HvFlushVirtualAddressSpace, HvFlushVirtualAddressSpaceEx,
-HvFlushVirtualAddressList, HvFlushVirtualAddressListEx.
-
-8.19 KVM_CAP_ARM_INJECT_SERROR_ESR
-
-Architectures: arm, arm64
-
-This capability indicates that userspace can specify (via the
-KVM_SET_VCPU_EVENTS ioctl) the syndrome value reported to the guest when it
-takes a virtual SError interrupt exception.
-If KVM advertises this capability, userspace can only specify the ISS field for
-the ESR syndrome. Other parts of the ESR, such as the EC are generated by the
-CPU when the exception is taken. If this virtual SError is taken to EL1 using
-AArch64, this value will be reported in the ISS field of ESR_ELx.
-
-See KVM_CAP_VCPU_EVENTS for more details.
-8.20 KVM_CAP_HYPERV_SEND_IPI
-
-Architectures: x86
-
-This capability indicates that KVM supports paravirtualized Hyper-V IPI send
-hypercalls:
-HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
-8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
-
-Architecture: x86
-
-This capability indicates that KVM running on top of Hyper-V hypervisor
-enables Direct TLB flush for its guests meaning that TLB flush
-hypercalls are handled by Level 0 hypervisor (Hyper-V) bypassing KVM.
-Due to the different ABI for hypercall parameters between Hyper-V and
-KVM, enabling this capability effectively disables all hypercall
-handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
-flush hypercalls by Hyper-V) so userspace should disable KVM identification
-in CPUID and only exposes Hyper-V identification. In this case, guest
-thinks it's running on Hyper-V and only use Hyper-V hypercalls.
-
-8.22 KVM_CAP_S390_VCPU_RESETS
-
-Architectures: s390
-
-This capability indicates that the KVM_S390_NORMAL_RESET and
-KVM_S390_CLEAR_RESET ioctls are available.
diff --git a/Documentation/virt/kvm/arm/hyp-abi.rst b/Documentation/virt/kvm/arm/hyp-abi.rst

new file mode 100644 (file)

index 0000000..d1fc27d
--- /dev/null
+++ b/Documentation/virt/kvm/arm/hyp-abi.rst
@@ -0,0 +1,63 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================================
+Internal ABI between the kernel and HYP
+=======================================
+
+This file documents the interaction between the Linux kernel and the
+hypervisor layer when running Linux as a hypervisor (for example
+KVM). It doesn't cover the interaction of the kernel with the
+hypervisor when running as a guest (under Xen, KVM or any other
+hypervisor), or any hypervisor-specific interaction when the kernel is
+used as a host.
+
+On arm and arm64 (without VHE), the kernel doesn't run in hypervisor
+mode, but still needs to interact with it, allowing a built-in
+hypervisor to be either installed or torn down.
+
+In order to achieve this, the kernel must be booted at HYP (arm) or
+EL2 (arm64), allowing it to install a set of stubs before dropping to
+SVC/EL1. These stubs are accessible by using a 'hvc #0' instruction,
+and only act on individual CPUs.
+
+Unless specified otherwise, any built-in hypervisor must implement
+these functions (see arch/arm{,64}/include/asm/virt.h):
+
+* ::
+
+    r0/x0 = HVC_SET_VECTORS
+    r1/x1 = vectors
+
+  Set HVBAR/VBAR_EL2 to 'vectors' to enable a hypervisor. 'vectors'
+  must be a physical address, and respect the alignment requirements
+  of the architecture. Only implemented by the initial stubs, not by
+  Linux hypervisors.
+
+* ::
+
+    r0/x0 = HVC_RESET_VECTORS
+
+  Turn HYP/EL2 MMU off, and reset HVBAR/VBAR_EL2 to the initials
+  stubs' exception vector value. This effectively disables an existing
+  hypervisor.
+
+* ::
+
+    r0/x0 = HVC_SOFT_RESTART
+    r1/x1 = restart address
+    x2 = x0's value when entering the next payload (arm64)
+    x3 = x1's value when entering the next payload (arm64)
+    x4 = x2's value when entering the next payload (arm64)
+
+  Mask all exceptions, disable the MMU, move the arguments into place
+  (arm64 only), and jump to the restart address while at HYP/EL2. This
+  hypercall is not expected to return to its caller.
+
+Any other value of r0/x0 triggers a hypervisor-specific handling,
+which is not documented here.
+
+The return value of a stub hypercall is held by r0/x0, and is 0 on
+success, and HVC_STUB_ERR on error. A stub hypercall is allowed to
+clobber any of the caller-saved registers (x0-x18 on arm64, r0-r3 and
+ip on arm). It is thus recommended to use a function call to perform
+the hypercall.
diff --git a/Documentation/virt/kvm/arm/hyp-abi.txt b/Documentation/virt/kvm/arm/hyp-abi.txt

deleted file mode 100644 (file)

index a20a0be..0000000
--- a/Documentation/virt/kvm/arm/hyp-abi.txt
+++ /dev/null
@@ -1,53 +0,0 @@
-* Internal ABI between the kernel and HYP
-
-This file documents the interaction between the Linux kernel and the
-hypervisor layer when running Linux as a hypervisor (for example
-KVM). It doesn't cover the interaction of the kernel with the
-hypervisor when running as a guest (under Xen, KVM or any other
-hypervisor), or any hypervisor-specific interaction when the kernel is
-used as a host.
-
-On arm and arm64 (without VHE), the kernel doesn't run in hypervisor
-mode, but still needs to interact with it, allowing a built-in
-hypervisor to be either installed or torn down.
-
-In order to achieve this, the kernel must be booted at HYP (arm) or
-EL2 (arm64), allowing it to install a set of stubs before dropping to
-SVC/EL1. These stubs are accessible by using a 'hvc #0' instruction,
-and only act on individual CPUs.
-
-Unless specified otherwise, any built-in hypervisor must implement
-these functions (see arch/arm{,64}/include/asm/virt.h):
-
-* r0/x0 = HVC_SET_VECTORS
-  r1/x1 = vectors
-
-  Set HVBAR/VBAR_EL2 to 'vectors' to enable a hypervisor. 'vectors'
-  must be a physical address, and respect the alignment requirements
-  of the architecture. Only implemented by the initial stubs, not by
-  Linux hypervisors.
-
-* r0/x0 = HVC_RESET_VECTORS
-
-  Turn HYP/EL2 MMU off, and reset HVBAR/VBAR_EL2 to the initials
-  stubs' exception vector value. This effectively disables an existing
-  hypervisor.
-
-* r0/x0 = HVC_SOFT_RESTART
-  r1/x1 = restart address
-  x2 = x0's value when entering the next payload (arm64)
-  x3 = x1's value when entering the next payload (arm64)
-  x4 = x2's value when entering the next payload (arm64)
-
-  Mask all exceptions, disable the MMU, move the arguments into place
-  (arm64 only), and jump to the restart address while at HYP/EL2. This
-  hypercall is not expected to return to its caller.
-
-Any other value of r0/x0 triggers a hypervisor-specific handling,
-which is not documented here.
-
-The return value of a stub hypercall is held by r0/x0, and is 0 on
-success, and HVC_STUB_ERR on error. A stub hypercall is allowed to
-clobber any of the caller-saved registers (x0-x18 on arm64, r0-r3 and
-ip on arm). It is thus recommended to use a function call to perform
-the hypercall.
diff --git a/Documentation/virt/kvm/arm/index.rst b/Documentation/virt/kvm/arm/index.rst

new file mode 100644 (file)

index 0000000..3e2b2ab
--- /dev/null
+++ b/Documentation/virt/kvm/arm/index.rst
@@ -0,0 +1,12 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===
+ARM
+===
+
+.. toctree::
+   :maxdepth: 2
+
+   hyp-abi
+   psci
+   pvtime
diff --git a/Documentation/virt/kvm/arm/psci.rst b/Documentation/virt/kvm/arm/psci.rst

new file mode 100644 (file)

index 0000000..d52c2e8
--- /dev/null
+++ b/Documentation/virt/kvm/arm/psci.rst
@@ -0,0 +1,77 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================================
+Power State Coordination Interface (PSCI)
+=========================================
+
+KVM implements the PSCI (Power State Coordination Interface)
+specification in order to provide services such as CPU on/off, reset
+and power-off to the guest.
+
+The PSCI specification is regularly updated to provide new features,
+and KVM implements these updates if they make sense from a virtualization
+point of view.
+
+This means that a guest booted on two different versions of KVM can
+observe two different "firmware" revisions. This could cause issues if
+a given guest is tied to a particular PSCI revision (unlikely), or if
+a migration causes a different PSCI version to be exposed out of the
+blue to an unsuspecting guest.
+
+In order to remedy this situation, KVM exposes a set of "firmware
+pseudo-registers" that can be manipulated using the GET/SET_ONE_REG
+interface. These registers can be saved/restored by userspace, and set
+to a convenient value if required.
+
+The following register is defined:
+
+* KVM_REG_ARM_PSCI_VERSION:
+
+  - Only valid if the vcpu has the KVM_ARM_VCPU_PSCI_0_2 feature set
+    (and thus has already been initialized)
+  - Returns the current PSCI version on GET_ONE_REG (defaulting to the
+    highest PSCI version implemented by KVM and compatible with v0.2)
+  - Allows any PSCI version implemented by KVM and compatible with
+    v0.2 to be set with SET_ONE_REG
+  - Affects the whole VM (even if the register view is per-vcpu)
+
+* KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1:
+    Holds the state of the firmware support to mitigate CVE-2017-5715, as
+    offered by KVM to the guest via a HVC call. The workaround is described
+    under SMCCC_ARCH_WORKAROUND_1 in [1].
+
+  Accepted values are:
+
+    KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_AVAIL:
+      KVM does not offer
+      firmware support for the workaround. The mitigation status for the
+      guest is unknown.
+    KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_AVAIL:
+      The workaround HVC call is
+      available to the guest and required for the mitigation.
+    KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_REQUIRED:
+      The workaround HVC call
+      is available to the guest, but it is not needed on this VCPU.
+
+* KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2:
+    Holds the state of the firmware support to mitigate CVE-2018-3639, as
+    offered by KVM to the guest via a HVC call. The workaround is described
+    under SMCCC_ARCH_WORKAROUND_2 in [1]_.
+
+  Accepted values are:
+
+    KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_AVAIL:
+      A workaround is not
+      available. KVM does not offer firmware support for the workaround.
+    KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_UNKNOWN:
+      The workaround state is
+      unknown. KVM does not offer firmware support for the workaround.
+    KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_AVAIL:
+      The workaround is available,
+      and can be disabled by a vCPU. If
+      KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_ENABLED is set, it is active for
+      this vCPU.
+    KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_REQUIRED:
+      The workaround is always active on this vCPU or it is not needed.
+
+.. [1] https://developer.arm.com/-/media/developer/pdf/ARM_DEN_0070A_Firmware_interfaces_for_mitigating_CVE-2017-5715.pdf
diff --git a/Documentation/virt/kvm/arm/psci.txt b/Documentation/virt/kvm/arm/psci.txt

deleted file mode 100644 (file)

index 559586f..0000000
--- a/Documentation/virt/kvm/arm/psci.txt
+++ /dev/null
@@ -1,61 +0,0 @@
-KVM implements the PSCI (Power State Coordination Interface)
-specification in order to provide services such as CPU on/off, reset
-and power-off to the guest.
-
-The PSCI specification is regularly updated to provide new features,
-and KVM implements these updates if they make sense from a virtualization
-point of view.
-
-This means that a guest booted on two different versions of KVM can
-observe two different "firmware" revisions. This could cause issues if
-a given guest is tied to a particular PSCI revision (unlikely), or if
-a migration causes a different PSCI version to be exposed out of the
-blue to an unsuspecting guest.
-
-In order to remedy this situation, KVM exposes a set of "firmware
-pseudo-registers" that can be manipulated using the GET/SET_ONE_REG
-interface. These registers can be saved/restored by userspace, and set
-to a convenient value if required.
-
-The following register is defined:
-
-* KVM_REG_ARM_PSCI_VERSION:
-
-  - Only valid if the vcpu has the KVM_ARM_VCPU_PSCI_0_2 feature set
-    (and thus has already been initialized)
-  - Returns the current PSCI version on GET_ONE_REG (defaulting to the
-    highest PSCI version implemented by KVM and compatible with v0.2)
-  - Allows any PSCI version implemented by KVM and compatible with
-    v0.2 to be set with SET_ONE_REG
-  - Affects the whole VM (even if the register view is per-vcpu)
-
-* KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1:
-  Holds the state of the firmware support to mitigate CVE-2017-5715, as
-  offered by KVM to the guest via a HVC call. The workaround is described
-  under SMCCC_ARCH_WORKAROUND_1 in [1].
-  Accepted values are:
-    KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_AVAIL: KVM does not offer
-      firmware support for the workaround. The mitigation status for the
-      guest is unknown.
-    KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_AVAIL: The workaround HVC call is
-      available to the guest and required for the mitigation.
-    KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_REQUIRED: The workaround HVC call
-      is available to the guest, but it is not needed on this VCPU.
-
-* KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2:
-  Holds the state of the firmware support to mitigate CVE-2018-3639, as
-  offered by KVM to the guest via a HVC call. The workaround is described
-  under SMCCC_ARCH_WORKAROUND_2 in [1].
-  Accepted values are:
-    KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_AVAIL: A workaround is not
-      available. KVM does not offer firmware support for the workaround.
-    KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_UNKNOWN: The workaround state is
-      unknown. KVM does not offer firmware support for the workaround.
-    KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_AVAIL: The workaround is available,
-      and can be disabled by a vCPU. If
-      KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_ENABLED is set, it is active for
-      this vCPU.
-    KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_REQUIRED: The workaround is
-      always active on this vCPU or it is not needed.
-
-[1] https://developer.arm.com/-/media/developer/pdf/ARM_DEN_0070A_Firmware_interfaces_for_mitigating_CVE-2017-5715.pdf
diff --git a/Documentation/virt/kvm/devices/arm-vgic-its.rst b/Documentation/virt/kvm/devices/arm-vgic-its.rst

new file mode 100644 (file)

index 0000000..6c304fd
--- /dev/null
+++ b/Documentation/virt/kvm/devices/arm-vgic-its.rst
@@ -0,0 +1,209 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============================================
+ARM Virtual Interrupt Translation Service (ITS)
+===============================================
+
+Device types supported:
+  KVM_DEV_TYPE_ARM_VGIC_ITS    ARM Interrupt Translation Service Controller
+
+The ITS allows MSI(-X) interrupts to be injected into guests. This extension is
+optional.  Creating a virtual ITS controller also requires a host GICv3 (see
+arm-vgic-v3.txt), but does not depend on having physical ITS controllers.
+
+There can be multiple ITS controllers per guest, each of them has to have
+a separate, non-overlapping MMIO region.
+
+
+Groups
+======
+
+KVM_DEV_ARM_VGIC_GRP_ADDR
+-------------------------
+
+  Attributes:
+    KVM_VGIC_ITS_ADDR_TYPE (rw, 64-bit)
+      Base address in the guest physical address space of the GICv3 ITS
+      control register frame.
+      This address needs to be 64K aligned and the region covers 128K.
+
+  Errors:
+
+    =======  =================================================
+    -E2BIG   Address outside of addressable IPA range
+    -EINVAL  Incorrectly aligned address
+    -EEXIST  Address already configured
+    -EFAULT  Invalid user pointer for attr->addr.
+    -ENODEV  Incorrect attribute or the ITS is not supported.
+    =======  =================================================
+
+
+KVM_DEV_ARM_VGIC_GRP_CTRL
+-------------------------
+
+  Attributes:
+    KVM_DEV_ARM_VGIC_CTRL_INIT
+      request the initialization of the ITS, no additional parameter in
+      kvm_device_attr.addr.
+
+    KVM_DEV_ARM_ITS_CTRL_RESET
+      reset the ITS, no additional parameter in kvm_device_attr.addr.
+      See "ITS Reset State" section.
+
+    KVM_DEV_ARM_ITS_SAVE_TABLES
+      save the ITS table data into guest RAM, at the location provisioned
+      by the guest in corresponding registers/table entries.
+
+      The layout of the tables in guest memory defines an ABI. The entries
+      are laid out in little endian format as described in the last paragraph.
+
+    KVM_DEV_ARM_ITS_RESTORE_TABLES
+      restore the ITS tables from guest RAM to ITS internal structures.
+
+      The GICV3 must be restored before the ITS and all ITS registers but
+      the GITS_CTLR must be restored before restoring the ITS tables.
+
+      The GITS_IIDR read-only register must also be restored before
+      calling KVM_DEV_ARM_ITS_RESTORE_TABLES as the IIDR revision field
+      encodes the ABI revision.
+
+      The expected ordering when restoring the GICv3/ITS is described in section
+      "ITS Restore Sequence".
+
+  Errors:
+
+    =======  ==========================================================
+     -ENXIO  ITS not properly configured as required prior to setting
+             this attribute
+    -ENOMEM  Memory shortage when allocating ITS internal data
+    -EINVAL  Inconsistent restored data
+    -EFAULT  Invalid guest ram access
+    -EBUSY   One or more VCPUS are running
+    -EACCES  The virtual ITS is backed by a physical GICv4 ITS, and the
+            state is not available
+    =======  ==========================================================
+
+KVM_DEV_ARM_VGIC_GRP_ITS_REGS
+-----------------------------
+
+  Attributes:
+      The attr field of kvm_device_attr encodes the offset of the
+      ITS register, relative to the ITS control frame base address
+      (ITS_base).
+
+      kvm_device_attr.addr points to a __u64 value whatever the width
+      of the addressed register (32/64 bits). 64 bit registers can only
+      be accessed with full length.
+
+      Writes to read-only registers are ignored by the kernel except for:
+
+      - GITS_CREADR. It must be restored otherwise commands in the queue
+        will be re-executed after restoring CWRITER. GITS_CREADR must be
+        restored before restoring the GITS_CTLR which is likely to enable the
+        ITS. Also it must be restored after GITS_CBASER since a write to
+        GITS_CBASER resets GITS_CREADR.
+      - GITS_IIDR. The Revision field encodes the table layout ABI revision.
+        In the future we might implement direct injection of virtual LPIs.
+        This will require an upgrade of the table layout and an evolution of
+        the ABI. GITS_IIDR must be restored before calling
+        KVM_DEV_ARM_ITS_RESTORE_TABLES.
+
+      For other registers, getting or setting a register has the same
+      effect as reading/writing the register on real hardware.
+
+  Errors:
+
+    =======  ====================================================
+    -ENXIO   Offset does not correspond to any supported register
+    -EFAULT  Invalid user pointer for attr->addr
+    -EINVAL  Offset is not 64-bit aligned
+    -EBUSY   one or more VCPUS are running
+    =======  ====================================================
+
+ITS Restore Sequence:
+---------------------
+
+The following ordering must be followed when restoring the GIC and the ITS:
+
+a) restore all guest memory and create vcpus
+b) restore all redistributors
+c) provide the ITS base address
+   (KVM_DEV_ARM_VGIC_GRP_ADDR)
+d) restore the ITS in the following order:
+
+     1. Restore GITS_CBASER
+     2. Restore all other ``GITS_`` registers, except GITS_CTLR!
+     3. Load the ITS table data (KVM_DEV_ARM_ITS_RESTORE_TABLES)
+     4. Restore GITS_CTLR
+
+Then vcpus can be started.
+
+ITS Table ABI REV0:
+-------------------
+
+ Revision 0 of the ABI only supports the features of a virtual GICv3, and does
+ not support a virtual GICv4 with support for direct injection of virtual
+ interrupts for nested hypervisors.
+
+ The device table and ITT are indexed by the DeviceID and EventID,
+ respectively. The collection table is not indexed by CollectionID, and the
+ entries in the collection are listed in no particular order.
+ All entries are 8 bytes.
+
+ Device Table Entry (DTE)::
+
+   bits:     | 63| 62 ... 49 | 48 ... 5 | 4 ... 0 |
+   values:   | V |   next    | ITT_addr |  Size   |
+
+ where:
+
+ - V indicates whether the entry is valid. If not, other fields
+   are not meaningful.
+ - next: equals to 0 if this entry is the last one; otherwise it
+   corresponds to the DeviceID offset to the next DTE, capped by
+   2^14 -1.
+ - ITT_addr matches bits [51:8] of the ITT address (256 Byte aligned).
+ - Size specifies the supported number of bits for the EventID,
+   minus one
+
+ Collection Table Entry (CTE)::
+
+   bits:     | 63| 62 ..  52  | 51 ... 16 | 15  ...   0 |
+   values:   | V |    RES0    |  RDBase   |    ICID     |
+
+ where:
+
+ - V indicates whether the entry is valid. If not, other fields are
+   not meaningful.
+ - RES0: reserved field with Should-Be-Zero-or-Preserved behavior.
+ - RDBase is the PE number (GICR_TYPER.Processor_Number semantic),
+ - ICID is the collection ID
+
+ Interrupt Translation Entry (ITE)::
+
+   bits:     | 63 ... 48 | 47 ... 16 | 15 ... 0 |
+   values:   |    next   |   pINTID  |  ICID    |
+
+ where:
+
+ - next: equals to 0 if this entry is the last one; otherwise it corresponds
+   to the EventID offset to the next ITE capped by 2^16 -1.
+ - pINTID is the physical LPI ID; if zero, it means the entry is not valid
+   and other fields are not meaningful.
+ - ICID is the collection ID
+
+ITS Reset State:
+----------------
+
+RESET returns the ITS to the same state that it was when first created and
+initialized. When the RESET command returns, the following things are
+guaranteed:
+
+- The ITS is not enabled and quiescent
+  GITS_CTLR.Enabled = 0 .Quiescent=1
+- There is no internally cached state
+- No collection or device table are used
+  GITS_BASER<n>.Valid = 0
+- GITS_CBASER = 0, GITS_CREADR = 0, GITS_CWRITER = 0
+- The ABI version is unchanged and remains the one set when the ITS
+  device was first created.
diff --git a/Documentation/virt/kvm/devices/arm-vgic-its.txt b/Documentation/virt/kvm/devices/arm-vgic-its.txt

deleted file mode 100644 (file)

index eeaa95b..0000000
--- a/Documentation/virt/kvm/devices/arm-vgic-its.txt
+++ /dev/null
@@ -1,181 +0,0 @@
-ARM Virtual Interrupt Translation Service (ITS)
-===============================================
-
-Device types supported:
-  KVM_DEV_TYPE_ARM_VGIC_ITS    ARM Interrupt Translation Service Controller
-
-The ITS allows MSI(-X) interrupts to be injected into guests. This extension is
-optional.  Creating a virtual ITS controller also requires a host GICv3 (see
-arm-vgic-v3.txt), but does not depend on having physical ITS controllers.
-
-There can be multiple ITS controllers per guest, each of them has to have
-a separate, non-overlapping MMIO region.
-
-
-Groups:
-  KVM_DEV_ARM_VGIC_GRP_ADDR
-  Attributes:
-    KVM_VGIC_ITS_ADDR_TYPE (rw, 64-bit)
-      Base address in the guest physical address space of the GICv3 ITS
-      control register frame.
-      This address needs to be 64K aligned and the region covers 128K.
-  Errors:
-    -E2BIG:  Address outside of addressable IPA range
-    -EINVAL: Incorrectly aligned address
-    -EEXIST: Address already configured
-    -EFAULT: Invalid user pointer for attr->addr.
-    -ENODEV: Incorrect attribute or the ITS is not supported.
-
-
-  KVM_DEV_ARM_VGIC_GRP_CTRL
-  Attributes:
-    KVM_DEV_ARM_VGIC_CTRL_INIT
-      request the initialization of the ITS, no additional parameter in
-      kvm_device_attr.addr.
-
-    KVM_DEV_ARM_ITS_CTRL_RESET
-      reset the ITS, no additional parameter in kvm_device_attr.addr.
-      See "ITS Reset State" section.
-
-    KVM_DEV_ARM_ITS_SAVE_TABLES
-      save the ITS table data into guest RAM, at the location provisioned
-      by the guest in corresponding registers/table entries.
-
-      The layout of the tables in guest memory defines an ABI. The entries
-      are laid out in little endian format as described in the last paragraph.
-
-    KVM_DEV_ARM_ITS_RESTORE_TABLES
-      restore the ITS tables from guest RAM to ITS internal structures.
-
-      The GICV3 must be restored before the ITS and all ITS registers but
-      the GITS_CTLR must be restored before restoring the ITS tables.
-
-      The GITS_IIDR read-only register must also be restored before
-      calling KVM_DEV_ARM_ITS_RESTORE_TABLES as the IIDR revision field
-      encodes the ABI revision.
-
-      The expected ordering when restoring the GICv3/ITS is described in section
-      "ITS Restore Sequence".
-
-  Errors:
-    -ENXIO:  ITS not properly configured as required prior to setting
-             this attribute
-    -ENOMEM: Memory shortage when allocating ITS internal data
-    -EINVAL: Inconsistent restored data
-    -EFAULT: Invalid guest ram access
-    -EBUSY:  One or more VCPUS are running
-    -EACCES: The virtual ITS is backed by a physical GICv4 ITS, and the
-            state is not available
-
-  KVM_DEV_ARM_VGIC_GRP_ITS_REGS
-  Attributes:
-      The attr field of kvm_device_attr encodes the offset of the
-      ITS register, relative to the ITS control frame base address
-      (ITS_base).
-
-      kvm_device_attr.addr points to a __u64 value whatever the width
-      of the addressed register (32/64 bits). 64 bit registers can only
-      be accessed with full length.
-
-      Writes to read-only registers are ignored by the kernel except for:
-      - GITS_CREADR. It must be restored otherwise commands in the queue
-        will be re-executed after restoring CWRITER. GITS_CREADR must be
-        restored before restoring the GITS_CTLR which is likely to enable the
-        ITS. Also it must be restored after GITS_CBASER since a write to
-        GITS_CBASER resets GITS_CREADR.
-      - GITS_IIDR. The Revision field encodes the table layout ABI revision.
-        In the future we might implement direct injection of virtual LPIs.
-        This will require an upgrade of the table layout and an evolution of
-        the ABI. GITS_IIDR must be restored before calling
-        KVM_DEV_ARM_ITS_RESTORE_TABLES.
-
-      For other registers, getting or setting a register has the same
-      effect as reading/writing the register on real hardware.
-  Errors:
-    -ENXIO: Offset does not correspond to any supported register
-    -EFAULT: Invalid user pointer for attr->addr
-    -EINVAL: Offset is not 64-bit aligned
-    -EBUSY: one or more VCPUS are running
-
- ITS Restore Sequence:
- -------------------------
-
-The following ordering must be followed when restoring the GIC and the ITS:
-a) restore all guest memory and create vcpus
-b) restore all redistributors
-c) provide the ITS base address
-   (KVM_DEV_ARM_VGIC_GRP_ADDR)
-d) restore the ITS in the following order:
-   1. Restore GITS_CBASER
-   2. Restore all other GITS_ registers, except GITS_CTLR!
-   3. Load the ITS table data (KVM_DEV_ARM_ITS_RESTORE_TABLES)
-   4. Restore GITS_CTLR
-
-Then vcpus can be started.
-
- ITS Table ABI REV0:
- -------------------
-
- Revision 0 of the ABI only supports the features of a virtual GICv3, and does
- not support a virtual GICv4 with support for direct injection of virtual
- interrupts for nested hypervisors.
-
- The device table and ITT are indexed by the DeviceID and EventID,
- respectively. The collection table is not indexed by CollectionID, and the
- entries in the collection are listed in no particular order.
- All entries are 8 bytes.
-
- Device Table Entry (DTE):
-
- bits:     | 63| 62 ... 49 | 48 ... 5 | 4 ... 0 |
- values:   | V |   next    | ITT_addr |  Size   |
-
- where;
- - V indicates whether the entry is valid. If not, other fields
-   are not meaningful.
- - next: equals to 0 if this entry is the last one; otherwise it
-   corresponds to the DeviceID offset to the next DTE, capped by
-   2^14 -1.
- - ITT_addr matches bits [51:8] of the ITT address (256 Byte aligned).
- - Size specifies the supported number of bits for the EventID,
-   minus one
-
- Collection Table Entry (CTE):
-
- bits:     | 63| 62 ..  52  | 51 ... 16 | 15  ...   0 |
- values:   | V |    RES0    |  RDBase   |    ICID     |
-
- where:
- - V indicates whether the entry is valid. If not, other fields are
-   not meaningful.
- - RES0: reserved field with Should-Be-Zero-or-Preserved behavior.
- - RDBase is the PE number (GICR_TYPER.Processor_Number semantic),
- - ICID is the collection ID
-
- Interrupt Translation Entry (ITE):
-
- bits:     | 63 ... 48 | 47 ... 16 | 15 ... 0 |
- values:   |    next   |   pINTID  |  ICID    |
-
- where:
- - next: equals to 0 if this entry is the last one; otherwise it corresponds
-   to the EventID offset to the next ITE capped by 2^16 -1.
- - pINTID is the physical LPI ID; if zero, it means the entry is not valid
-   and other fields are not meaningful.
- - ICID is the collection ID
-
- ITS Reset State:
- ----------------
-
-RESET returns the ITS to the same state that it was when first created and
-initialized. When the RESET command returns, the following things are
-guaranteed:
-
-- The ITS is not enabled and quiescent
-  GITS_CTLR.Enabled = 0 .Quiescent=1
-- There is no internally cached state
-- No collection or device table are used
-  GITS_BASER<n>.Valid = 0
-- GITS_CBASER = 0, GITS_CREADR = 0, GITS_CWRITER = 0
-- The ABI version is unchanged and remains the one set when the ITS
-  device was first created.
diff --git a/Documentation/virt/kvm/devices/arm-vgic-v3.rst b/Documentation/virt/kvm/devices/arm-vgic-v3.rst

new file mode 100644 (file)

index 0000000..5dd3bff
--- /dev/null
+++ b/Documentation/virt/kvm/devices/arm-vgic-v3.rst
@@ -0,0 +1,291 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================================================
+ARM Virtual Generic Interrupt Controller v3 and later (VGICv3)
+==============================================================
+
+
+Device types supported:
+  - KVM_DEV_TYPE_ARM_VGIC_V3     ARM Generic Interrupt Controller v3.0
+
+Only one VGIC instance may be instantiated through this API.  The created VGIC
+will act as the VM interrupt controller, requiring emulated user-space devices
+to inject interrupts to the VGIC instead of directly to CPUs.  It is not
+possible to create both a GICv3 and GICv2 on the same VM.
+
+Creating a guest GICv3 device requires a host GICv3 as well.
+
+
+Groups:
+  KVM_DEV_ARM_VGIC_GRP_ADDR
+   Attributes:
+
+    KVM_VGIC_V3_ADDR_TYPE_DIST (rw, 64-bit)
+      Base address in the guest physical address space of the GICv3 distributor
+      register mappings. Only valid for KVM_DEV_TYPE_ARM_VGIC_V3.
+      This address needs to be 64K aligned and the region covers 64 KByte.
+
+    KVM_VGIC_V3_ADDR_TYPE_REDIST (rw, 64-bit)
+      Base address in the guest physical address space of the GICv3
+      redistributor register mappings. There are two 64K pages for each
+      VCPU and all of the redistributor pages are contiguous.
+      Only valid for KVM_DEV_TYPE_ARM_VGIC_V3.
+      This address needs to be 64K aligned.
+
+    KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION (rw, 64-bit)
+      The attribute data pointed to by kvm_device_attr.addr is a __u64 value::
+
+        bits:     | 63   ....  52  |  51   ....   16 | 15 - 12  |11 - 0
+        values:   |     count      |       base      |  flags   | index
+
+      - index encodes the unique redistributor region index
+      - flags: reserved for future use, currently 0
+      - base field encodes bits [51:16] of the guest physical base address
+        of the first redistributor in the region.
+      - count encodes the number of redistributors in the region. Must be
+        greater than 0.
+
+      There are two 64K pages for each redistributor in the region and
+      redistributors are laid out contiguously within the region. Regions
+      are filled with redistributors in the index order. The sum of all
+      region count fields must be greater than or equal to the number of
+      VCPUs. Redistributor regions must be registered in the incremental
+      index order, starting from index 0.
+
+      The characteristics of a specific redistributor region can be read
+      by presetting the index field in the attr data.
+      Only valid for KVM_DEV_TYPE_ARM_VGIC_V3.
+
+  It is invalid to mix calls with KVM_VGIC_V3_ADDR_TYPE_REDIST and
+  KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION attributes.
+
+  Errors:
+
+    =======  =============================================================
+    -E2BIG   Address outside of addressable IPA range
+    -EINVAL  Incorrectly aligned address, bad redistributor region
+             count/index, mixed redistributor region attribute usage
+    -EEXIST  Address already configured
+    -ENOENT  Attempt to read the characteristics of a non existing
+             redistributor region
+    -ENXIO   The group or attribute is unknown/unsupported for this device
+             or hardware support is missing.
+    -EFAULT  Invalid user pointer for attr->addr.
+    =======  =============================================================
+
+
+  KVM_DEV_ARM_VGIC_GRP_DIST_REGS, KVM_DEV_ARM_VGIC_GRP_REDIST_REGS
+   Attributes:
+
+    The attr field of kvm_device_attr encodes two values::
+
+      bits:     | 63   ....  32  |  31   ....    0 |
+      values:   |      mpidr     |      offset     |
+
+    All distributor regs are (rw, 32-bit) and kvm_device_attr.addr points to a
+    __u32 value.  64-bit registers must be accessed by separately accessing the
+    lower and higher word.
+
+    Writes to read-only registers are ignored by the kernel.
+
+    KVM_DEV_ARM_VGIC_GRP_DIST_REGS accesses the main distributor registers.
+    KVM_DEV_ARM_VGIC_GRP_REDIST_REGS accesses the redistributor of the CPU
+    specified by the mpidr.
+
+    The offset is relative to the "[Re]Distributor base address" as defined
+    in the GICv3/4 specs.  Getting or setting such a register has the same
+    effect as reading or writing the register on real hardware, except for the
+    following registers: GICD_STATUSR, GICR_STATUSR, GICD_ISPENDR,
+    GICR_ISPENDR0, GICD_ICPENDR, and GICR_ICPENDR0.  These registers behave
+    differently when accessed via this interface compared to their
+    architecturally defined behavior to allow software a full view of the
+    VGIC's internal state.
+
+    The mpidr field is used to specify which
+    redistributor is accessed.  The mpidr is ignored for the distributor.
+
+    The mpidr encoding is based on the affinity information in the
+    architecture defined MPIDR, and the field is encoded as follows::
+
+      | 63 .... 56 | 55 .... 48 | 47 .... 40 | 39 .... 32 |
+      |    Aff3    |    Aff2    |    Aff1    |    Aff0    |
+
+    Note that distributor fields are not banked, but return the same value
+    regardless of the mpidr used to access the register.
+
+    GICD_IIDR.Revision is updated when the KVM implementation is changed in a
+    way directly observable by the guest or userspace.  Userspace should read
+    GICD_IIDR from KVM and write back the read value to confirm its expected
+    behavior is aligned with the KVM implementation.  Userspace should set
+    GICD_IIDR before setting any other registers to ensure the expected
+    behavior.
+
+
+    The GICD_STATUSR and GICR_STATUSR registers are architecturally defined such
+    that a write of a clear bit has no effect, whereas a write with a set bit
+    clears that value.  To allow userspace to freely set the values of these two
+    registers, setting the attributes with the register offsets for these two
+    registers simply sets the non-reserved bits to the value written.
+
+
+    Accesses (reads and writes) to the GICD_ISPENDR register region and
+    GICR_ISPENDR0 registers get/set the value of the latched pending state for
+    the interrupts.
+
+    This is identical to the value returned by a guest read from ISPENDR for an
+    edge triggered interrupt, but may differ for level triggered interrupts.
+    For edge triggered interrupts, once an interrupt becomes pending (whether
+    because of an edge detected on the input line or because of a guest write
+    to ISPENDR) this state is "latched", and only cleared when either the
+    interrupt is activated or when the guest writes to ICPENDR. A level
+    triggered interrupt may be pending either because the level input is held
+    high by a device, or because of a guest write to the ISPENDR register. Only
+    ISPENDR writes are latched; if the device lowers the line level then the
+    interrupt is no longer pending unless the guest also wrote to ISPENDR, and
+    conversely writes to ICPENDR or activations of the interrupt do not clear
+    the pending status if the line level is still being held high.  (These
+    rules are documented in the GICv3 specification descriptions of the ICPENDR
+    and ISPENDR registers.) For a level triggered interrupt the value accessed
+    here is that of the latch which is set by ISPENDR and cleared by ICPENDR or
+    interrupt activation, whereas the value returned by a guest read from
+    ISPENDR is the logical OR of the latch value and the input line level.
+
+    Raw access to the latch state is provided to userspace so that it can save
+    and restore the entire GIC internal state (which is defined by the
+    combination of the current input line level and the latch state, and cannot
+    be deduced from purely the line level and the value of the ISPENDR
+    registers).
+
+    Accesses to GICD_ICPENDR register region and GICR_ICPENDR0 registers have
+    RAZ/WI semantics, meaning that reads always return 0 and writes are always
+    ignored.
+
+  Errors:
+
+    ======  =====================================================
+    -ENXIO  Getting or setting this register is not yet supported
+    -EBUSY  One or more VCPUs are running
+    ======  =====================================================
+
+
+  KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS
+   Attributes:
+
+    The attr field of kvm_device_attr encodes two values::
+
+      bits:     | 63      ....       32 | 31  ....  16 | 15  ....  0 |
+      values:   |         mpidr         |      RES     |    instr    |
+
+    The mpidr field encodes the CPU ID based on the affinity information in the
+    architecture defined MPIDR, and the field is encoded as follows::
+
+      | 63 .... 56 | 55 .... 48 | 47 .... 40 | 39 .... 32 |
+      |    Aff3    |    Aff2    |    Aff1    |    Aff0    |
+
+    The instr field encodes the system register to access based on the fields
+    defined in the A64 instruction set encoding for system register access
+    (RES means the bits are reserved for future use and should be zero)::
+
+      | 15 ... 14 | 13 ... 11 | 10 ... 7 | 6 ... 3 | 2 ... 0 |
+      |   Op 0    |    Op1    |    CRn   |   CRm   |   Op2   |
+
+    All system regs accessed through this API are (rw, 64-bit) and
+    kvm_device_attr.addr points to a __u64 value.
+
+    KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS accesses the CPU interface registers for the
+    CPU specified by the mpidr field.
+
+    CPU interface registers access is not implemented for AArch32 mode.
+    Error -ENXIO is returned when accessed in AArch32 mode.
+
+  Errors:
+
+    =======  =====================================================
+    -ENXIO   Getting or setting this register is not yet supported
+    -EBUSY   VCPU is running
+    -EINVAL  Invalid mpidr or register value supplied
+    =======  =====================================================
+
+
+  KVM_DEV_ARM_VGIC_GRP_NR_IRQS
+   Attributes:
+
+    A value describing the number of interrupts (SGI, PPI and SPI) for
+    this GIC instance, ranging from 64 to 1024, in increments of 32.
+
+    kvm_device_attr.addr points to a __u32 value.
+
+  Errors:
+
+    =======  ======================================
+    -EINVAL  Value set is out of the expected range
+    -EBUSY   Value has already be set.
+    =======  ======================================
+
+
+  KVM_DEV_ARM_VGIC_GRP_CTRL
+   Attributes:
+
+    KVM_DEV_ARM_VGIC_CTRL_INIT
+      request the initialization of the VGIC, no additional parameter in
+      kvm_device_attr.addr.
+    KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES
+      save all LPI pending bits into guest RAM pending tables.
+
+      The first kB of the pending table is not altered by this operation.
+
+  Errors:
+
+    =======  ========================================================
+    -ENXIO   VGIC not properly configured as required prior to calling
+             this attribute
+    -ENODEV  no online VCPU
+    -ENOMEM  memory shortage when allocating vgic internal data
+    -EFAULT  Invalid guest ram access
+    -EBUSY   One or more VCPUS are running
+    =======  ========================================================
+
+
+  KVM_DEV_ARM_VGIC_GRP_LEVEL_INFO
+   Attributes:
+
+    The attr field of kvm_device_attr encodes the following values::
+
+      bits:     | 63      ....       32 | 31   ....    10 | 9  ....  0 |
+      values:   |         mpidr         |      info       |   vINTID   |
+
+    The vINTID specifies which set of IRQs is reported on.
+
+    The info field specifies which information userspace wants to get or set
+    using this interface.  Currently we support the following info values:
+
+      VGIC_LEVEL_INFO_LINE_LEVEL:
+       Get/Set the input level of the IRQ line for a set of 32 contiguously
+       numbered interrupts.
+
+       vINTID must be a multiple of 32.
+
+       kvm_device_attr.addr points to a __u32 value which will contain a
+       bitmap where a set bit means the interrupt level is asserted.
+
+       Bit[n] indicates the status for interrupt vINTID + n.
+
+    SGIs and any interrupt with a higher ID than the number of interrupts
+    supported, will be RAZ/WI.  LPIs are always edge-triggered and are
+    therefore not supported by this interface.
+
+    PPIs are reported per VCPU as specified in the mpidr field, and SPIs are
+    reported with the same value regardless of the mpidr specified.
+
+    The mpidr field encodes the CPU ID based on the affinity information in the
+    architecture defined MPIDR, and the field is encoded as follows::
+
+      | 63 .... 56 | 55 .... 48 | 47 .... 40 | 39 .... 32 |
+      |    Aff3    |    Aff2    |    Aff1    |    Aff0    |
+
+  Errors:
+
+    =======  =============================================
+    -EINVAL  vINTID is not multiple of 32 or info field is
+            not VGIC_LEVEL_INFO_LINE_LEVEL
+    =======  =============================================
diff --git a/Documentation/virt/kvm/devices/arm-vgic-v3.txt b/Documentation/virt/kvm/devices/arm-vgic-v3.txt

deleted file mode 100644 (file)

index ff290b4..0000000
--- a/Documentation/virt/kvm/devices/arm-vgic-v3.txt
+++ /dev/null
@@ -1,251 +0,0 @@
-ARM Virtual Generic Interrupt Controller v3 and later (VGICv3)
-==============================================================
-
-
-Device types supported:
-  KVM_DEV_TYPE_ARM_VGIC_V3     ARM Generic Interrupt Controller v3.0
-
-Only one VGIC instance may be instantiated through this API.  The created VGIC
-will act as the VM interrupt controller, requiring emulated user-space devices
-to inject interrupts to the VGIC instead of directly to CPUs.  It is not
-possible to create both a GICv3 and GICv2 on the same VM.
-
-Creating a guest GICv3 device requires a host GICv3 as well.
-
-
-Groups:
-  KVM_DEV_ARM_VGIC_GRP_ADDR
-  Attributes:
-    KVM_VGIC_V3_ADDR_TYPE_DIST (rw, 64-bit)
-      Base address in the guest physical address space of the GICv3 distributor
-      register mappings. Only valid for KVM_DEV_TYPE_ARM_VGIC_V3.
-      This address needs to be 64K aligned and the region covers 64 KByte.
-
-    KVM_VGIC_V3_ADDR_TYPE_REDIST (rw, 64-bit)
-      Base address in the guest physical address space of the GICv3
-      redistributor register mappings. There are two 64K pages for each
-      VCPU and all of the redistributor pages are contiguous.
-      Only valid for KVM_DEV_TYPE_ARM_VGIC_V3.
-      This address needs to be 64K aligned.
-
-    KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION (rw, 64-bit)
-      The attribute data pointed to by kvm_device_attr.addr is a __u64 value:
-      bits:     | 63   ....  52  |  51   ....   16 | 15 - 12  |11 - 0
-      values:   |     count      |       base      |  flags   | index
-      - index encodes the unique redistributor region index
-      - flags: reserved for future use, currently 0
-      - base field encodes bits [51:16] of the guest physical base address
-        of the first redistributor in the region.
-      - count encodes the number of redistributors in the region. Must be
-        greater than 0.
-      There are two 64K pages for each redistributor in the region and
-      redistributors are laid out contiguously within the region. Regions
-      are filled with redistributors in the index order. The sum of all
-      region count fields must be greater than or equal to the number of
-      VCPUs. Redistributor regions must be registered in the incremental
-      index order, starting from index 0.
-      The characteristics of a specific redistributor region can be read
-      by presetting the index field in the attr data.
-      Only valid for KVM_DEV_TYPE_ARM_VGIC_V3.
-
-  It is invalid to mix calls with KVM_VGIC_V3_ADDR_TYPE_REDIST and
-  KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION attributes.
-
-  Errors:
-    -E2BIG:  Address outside of addressable IPA range
-    -EINVAL: Incorrectly aligned address, bad redistributor region
-             count/index, mixed redistributor region attribute usage
-    -EEXIST: Address already configured
-    -ENOENT: Attempt to read the characteristics of a non existing
-             redistributor region
-    -ENXIO:  The group or attribute is unknown/unsupported for this device
-             or hardware support is missing.
-    -EFAULT: Invalid user pointer for attr->addr.
-
-
-  KVM_DEV_ARM_VGIC_GRP_DIST_REGS
-  KVM_DEV_ARM_VGIC_GRP_REDIST_REGS
-  Attributes:
-    The attr field of kvm_device_attr encodes two values:
-    bits:     | 63   ....  32  |  31   ....    0 |
-    values:   |      mpidr     |      offset     |
-
-    All distributor regs are (rw, 32-bit) and kvm_device_attr.addr points to a
-    __u32 value.  64-bit registers must be accessed by separately accessing the
-    lower and higher word.
-
-    Writes to read-only registers are ignored by the kernel.
-
-    KVM_DEV_ARM_VGIC_GRP_DIST_REGS accesses the main distributor registers.
-    KVM_DEV_ARM_VGIC_GRP_REDIST_REGS accesses the redistributor of the CPU
-    specified by the mpidr.
-
-    The offset is relative to the "[Re]Distributor base address" as defined
-    in the GICv3/4 specs.  Getting or setting such a register has the same
-    effect as reading or writing the register on real hardware, except for the
-    following registers: GICD_STATUSR, GICR_STATUSR, GICD_ISPENDR,
-    GICR_ISPENDR0, GICD_ICPENDR, and GICR_ICPENDR0.  These registers behave
-    differently when accessed via this interface compared to their
-    architecturally defined behavior to allow software a full view of the
-    VGIC's internal state.
-
-    The mpidr field is used to specify which
-    redistributor is accessed.  The mpidr is ignored for the distributor.
-
-    The mpidr encoding is based on the affinity information in the
-    architecture defined MPIDR, and the field is encoded as follows:
-      | 63 .... 56 | 55 .... 48 | 47 .... 40 | 39 .... 32 |
-      |    Aff3    |    Aff2    |    Aff1    |    Aff0    |
-
-    Note that distributor fields are not banked, but return the same value
-    regardless of the mpidr used to access the register.
-
-    GICD_IIDR.Revision is updated when the KVM implementation is changed in a
-    way directly observable by the guest or userspace.  Userspace should read
-    GICD_IIDR from KVM and write back the read value to confirm its expected
-    behavior is aligned with the KVM implementation.  Userspace should set
-    GICD_IIDR before setting any other registers to ensure the expected
-    behavior.
-
-
-    The GICD_STATUSR and GICR_STATUSR registers are architecturally defined such
-    that a write of a clear bit has no effect, whereas a write with a set bit
-    clears that value.  To allow userspace to freely set the values of these two
-    registers, setting the attributes with the register offsets for these two
-    registers simply sets the non-reserved bits to the value written.
-
-
-    Accesses (reads and writes) to the GICD_ISPENDR register region and
-    GICR_ISPENDR0 registers get/set the value of the latched pending state for
-    the interrupts.
-
-    This is identical to the value returned by a guest read from ISPENDR for an
-    edge triggered interrupt, but may differ for level triggered interrupts.
-    For edge triggered interrupts, once an interrupt becomes pending (whether
-    because of an edge detected on the input line or because of a guest write
-    to ISPENDR) this state is "latched", and only cleared when either the
-    interrupt is activated or when the guest writes to ICPENDR. A level
-    triggered interrupt may be pending either because the level input is held
-    high by a device, or because of a guest write to the ISPENDR register. Only
-    ISPENDR writes are latched; if the device lowers the line level then the
-    interrupt is no longer pending unless the guest also wrote to ISPENDR, and
-    conversely writes to ICPENDR or activations of the interrupt do not clear
-    the pending status if the line level is still being held high.  (These
-    rules are documented in the GICv3 specification descriptions of the ICPENDR
-    and ISPENDR registers.) For a level triggered interrupt the value accessed
-    here is that of the latch which is set by ISPENDR and cleared by ICPENDR or
-    interrupt activation, whereas the value returned by a guest read from
-    ISPENDR is the logical OR of the latch value and the input line level.
-
-    Raw access to the latch state is provided to userspace so that it can save
-    and restore the entire GIC internal state (which is defined by the
-    combination of the current input line level and the latch state, and cannot
-    be deduced from purely the line level and the value of the ISPENDR
-    registers).
-
-    Accesses to GICD_ICPENDR register region and GICR_ICPENDR0 registers have
-    RAZ/WI semantics, meaning that reads always return 0 and writes are always
-    ignored.
-
-  Errors:
-    -ENXIO: Getting or setting this register is not yet supported
-    -EBUSY: One or more VCPUs are running
-
-
-  KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS
-  Attributes:
-    The attr field of kvm_device_attr encodes two values:
-    bits:     | 63      ....       32 | 31  ....  16 | 15  ....  0 |
-    values:   |         mpidr         |      RES     |    instr    |
-
-    The mpidr field encodes the CPU ID based on the affinity information in the
-    architecture defined MPIDR, and the field is encoded as follows:
-      | 63 .... 56 | 55 .... 48 | 47 .... 40 | 39 .... 32 |
-      |    Aff3    |    Aff2    |    Aff1    |    Aff0    |
-
-    The instr field encodes the system register to access based on the fields
-    defined in the A64 instruction set encoding for system register access
-    (RES means the bits are reserved for future use and should be zero):
-
-      | 15 ... 14 | 13 ... 11 | 10 ... 7 | 6 ... 3 | 2 ... 0 |
-      |   Op 0    |    Op1    |    CRn   |   CRm   |   Op2   |
-
-    All system regs accessed through this API are (rw, 64-bit) and
-    kvm_device_attr.addr points to a __u64 value.
-
-    KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS accesses the CPU interface registers for the
-    CPU specified by the mpidr field.
-
-    CPU interface registers access is not implemented for AArch32 mode.
-    Error -ENXIO is returned when accessed in AArch32 mode.
-  Errors:
-    -ENXIO: Getting or setting this register is not yet supported
-    -EBUSY: VCPU is running
-    -EINVAL: Invalid mpidr or register value supplied
-
-
-  KVM_DEV_ARM_VGIC_GRP_NR_IRQS
-  Attributes:
-    A value describing the number of interrupts (SGI, PPI and SPI) for
-    this GIC instance, ranging from 64 to 1024, in increments of 32.
-
-    kvm_device_attr.addr points to a __u32 value.
-
-  Errors:
-    -EINVAL: Value set is out of the expected range
-    -EBUSY: Value has already be set.
-
-
-  KVM_DEV_ARM_VGIC_GRP_CTRL
-  Attributes:
-    KVM_DEV_ARM_VGIC_CTRL_INIT
-      request the initialization of the VGIC, no additional parameter in
-      kvm_device_attr.addr.
-    KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES
-      save all LPI pending bits into guest RAM pending tables.
-
-      The first kB of the pending table is not altered by this operation.
-  Errors:
-    -ENXIO: VGIC not properly configured as required prior to calling
-     this attribute
-    -ENODEV: no online VCPU
-    -ENOMEM: memory shortage when allocating vgic internal data
-    -EFAULT: Invalid guest ram access
-    -EBUSY:  One or more VCPUS are running
-
-
-  KVM_DEV_ARM_VGIC_GRP_LEVEL_INFO
-  Attributes:
-    The attr field of kvm_device_attr encodes the following values:
-    bits:     | 63      ....       32 | 31   ....    10 | 9  ....  0 |
-    values:   |         mpidr         |      info       |   vINTID   |
-
-    The vINTID specifies which set of IRQs is reported on.
-
-    The info field specifies which information userspace wants to get or set
-    using this interface.  Currently we support the following info values:
-
-      VGIC_LEVEL_INFO_LINE_LEVEL:
-       Get/Set the input level of the IRQ line for a set of 32 contiguously
-       numbered interrupts.
-       vINTID must be a multiple of 32.
-
-       kvm_device_attr.addr points to a __u32 value which will contain a
-       bitmap where a set bit means the interrupt level is asserted.
-
-       Bit[n] indicates the status for interrupt vINTID + n.
-
-    SGIs and any interrupt with a higher ID than the number of interrupts
-    supported, will be RAZ/WI.  LPIs are always edge-triggered and are
-    therefore not supported by this interface.
-
-    PPIs are reported per VCPU as specified in the mpidr field, and SPIs are
-    reported with the same value regardless of the mpidr specified.
-
-    The mpidr field encodes the CPU ID based on the affinity information in the
-    architecture defined MPIDR, and the field is encoded as follows:
-      | 63 .... 56 | 55 .... 48 | 47 .... 40 | 39 .... 32 |
-      |    Aff3    |    Aff2    |    Aff1    |    Aff0    |
-  Errors:
-    -EINVAL: vINTID is not multiple of 32 or
-     info field is not VGIC_LEVEL_INFO_LINE_LEVEL
diff --git a/Documentation/virt/kvm/devices/arm-vgic.rst b/Documentation/virt/kvm/devices/arm-vgic.rst

new file mode 100644 (file)

index 0000000..40bdeea
--- /dev/null
+++ b/Documentation/virt/kvm/devices/arm-vgic.rst
@@ -0,0 +1,156 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================================
+ARM Virtual Generic Interrupt Controller v2 (VGIC)
+==================================================
+
+Device types supported:
+
+  - KVM_DEV_TYPE_ARM_VGIC_V2     ARM Generic Interrupt Controller v2.0
+
+Only one VGIC instance may be instantiated through either this API or the
+legacy KVM_CREATE_IRQCHIP API.  The created VGIC will act as the VM interrupt
+controller, requiring emulated user-space devices to inject interrupts to the
+VGIC instead of directly to CPUs.
+
+GICv3 implementations with hardware compatibility support allow creating a
+guest GICv2 through this interface.  For information on creating a guest GICv3
+device and guest ITS devices, see arm-vgic-v3.txt.  It is not possible to
+create both a GICv3 and GICv2 device on the same VM.
+
+
+Groups:
+  KVM_DEV_ARM_VGIC_GRP_ADDR
+   Attributes:
+
+    KVM_VGIC_V2_ADDR_TYPE_DIST (rw, 64-bit)
+      Base address in the guest physical address space of the GIC distributor
+      register mappings. Only valid for KVM_DEV_TYPE_ARM_VGIC_V2.
+      This address needs to be 4K aligned and the region covers 4 KByte.
+
+    KVM_VGIC_V2_ADDR_TYPE_CPU (rw, 64-bit)
+      Base address in the guest physical address space of the GIC virtual cpu
+      interface register mappings. Only valid for KVM_DEV_TYPE_ARM_VGIC_V2.
+      This address needs to be 4K aligned and the region covers 4 KByte.
+
+  Errors:
+
+    =======  =============================================================
+    -E2BIG   Address outside of addressable IPA range
+    -EINVAL  Incorrectly aligned address
+    -EEXIST  Address already configured
+    -ENXIO   The group or attribute is unknown/unsupported for this device
+             or hardware support is missing.
+    -EFAULT  Invalid user pointer for attr->addr.
+    =======  =============================================================
+
+  KVM_DEV_ARM_VGIC_GRP_DIST_REGS
+   Attributes:
+
+    The attr field of kvm_device_attr encodes two values::
+
+      bits:     | 63   ....  40 | 39 ..  32  |  31   ....    0 |
+      values:   |    reserved   | vcpu_index |      offset     |
+
+    All distributor regs are (rw, 32-bit)
+
+    The offset is relative to the "Distributor base address" as defined in the
+    GICv2 specs.  Getting or setting such a register has the same effect as
+    reading or writing the register on the actual hardware from the cpu whose
+    index is specified with the vcpu_index field.  Note that most distributor
+    fields are not banked, but return the same value regardless of the
+    vcpu_index used to access the register.
+
+    GICD_IIDR.Revision is updated when the KVM implementation of an emulated
+    GICv2 is changed in a way directly observable by the guest or userspace.
+    Userspace should read GICD_IIDR from KVM and write back the read value to
+    confirm its expected behavior is aligned with the KVM implementation.
+    Userspace should set GICD_IIDR before setting any other registers (both
+    KVM_DEV_ARM_VGIC_GRP_DIST_REGS and KVM_DEV_ARM_VGIC_GRP_CPU_REGS) to ensure
+    the expected behavior. Unless GICD_IIDR has been set from userspace, writes
+    to the interrupt group registers (GICD_IGROUPR) are ignored.
+
+  Errors:
+
+    =======  =====================================================
+    -ENXIO   Getting or setting this register is not yet supported
+    -EBUSY   One or more VCPUs are running
+    -EINVAL  Invalid vcpu_index supplied
+    =======  =====================================================
+
+  KVM_DEV_ARM_VGIC_GRP_CPU_REGS
+   Attributes:
+
+    The attr field of kvm_device_attr encodes two values::
+
+      bits:     | 63   ....  40 | 39 ..  32  |  31   ....    0 |
+      values:   |    reserved   | vcpu_index |      offset     |
+
+    All CPU interface regs are (rw, 32-bit)
+
+    The offset specifies the offset from the "CPU interface base address" as
+    defined in the GICv2 specs.  Getting or setting such a register has the
+    same effect as reading or writing the register on the actual hardware.
+
+    The Active Priorities Registers APRn are implementation defined, so we set a
+    fixed format for our implementation that fits with the model of a "GICv2
+    implementation without the security extensions" which we present to the
+    guest.  This interface always exposes four register APR[0-3] describing the
+    maximum possible 128 preemption levels.  The semantics of the register
+    indicate if any interrupts in a given preemption level are in the active
+    state by setting the corresponding bit.
+
+    Thus, preemption level X has one or more active interrupts if and only if:
+
+      APRn[X mod 32] == 0b1,  where n = X / 32
+
+    Bits for undefined preemption levels are RAZ/WI.
+
+    Note that this differs from a CPU's view of the APRs on hardware in which
+    a GIC without the security extensions expose group 0 and group 1 active
+    priorities in separate register groups, whereas we show a combined view
+    similar to GICv2's GICH_APR.
+
+    For historical reasons and to provide ABI compatibility with userspace we
+    export the GICC_PMR register in the format of the GICH_VMCR.VMPriMask
+    field in the lower 5 bits of a word, meaning that userspace must always
+    use the lower 5 bits to communicate with the KVM device and must shift the
+    value left by 3 places to obtain the actual priority mask level.
+
+  Errors:
+
+    =======  =====================================================
+    -ENXIO   Getting or setting this register is not yet supported
+    -EBUSY   One or more VCPUs are running
+    -EINVAL  Invalid vcpu_index supplied
+    =======  =====================================================
+
+  KVM_DEV_ARM_VGIC_GRP_NR_IRQS
+   Attributes:
+
+    A value describing the number of interrupts (SGI, PPI and SPI) for
+    this GIC instance, ranging from 64 to 1024, in increments of 32.
+
+  Errors:
+
+    =======  =============================================================
+    -EINVAL  Value set is out of the expected range
+    -EBUSY   Value has already be set, or GIC has already been initialized
+             with default values.
+    =======  =============================================================
+
+  KVM_DEV_ARM_VGIC_GRP_CTRL
+   Attributes:
+
+    KVM_DEV_ARM_VGIC_CTRL_INIT
+      request the initialization of the VGIC or ITS, no additional parameter
+      in kvm_device_attr.addr.
+
+  Errors:
+
+    =======  =========================================================
+    -ENXIO   VGIC not properly configured as required prior to calling
+             this attribute
+    -ENODEV  no online VCPU
+    -ENOMEM  memory shortage when allocating vgic internal data
+    =======  =========================================================
diff --git a/Documentation/virt/kvm/devices/arm-vgic.txt b/Documentation/virt/kvm/devices/arm-vgic.txt

deleted file mode 100644 (file)

index 97b6518..0000000
--- a/Documentation/virt/kvm/devices/arm-vgic.txt
+++ /dev/null
@@ -1,127 +0,0 @@
-ARM Virtual Generic Interrupt Controller v2 (VGIC)
-==================================================
-
-Device types supported:
-  KVM_DEV_TYPE_ARM_VGIC_V2     ARM Generic Interrupt Controller v2.0
-
-Only one VGIC instance may be instantiated through either this API or the
-legacy KVM_CREATE_IRQCHIP API.  The created VGIC will act as the VM interrupt
-controller, requiring emulated user-space devices to inject interrupts to the
-VGIC instead of directly to CPUs.
-
-GICv3 implementations with hardware compatibility support allow creating a
-guest GICv2 through this interface.  For information on creating a guest GICv3
-device and guest ITS devices, see arm-vgic-v3.txt.  It is not possible to
-create both a GICv3 and GICv2 device on the same VM.
-
-
-Groups:
-  KVM_DEV_ARM_VGIC_GRP_ADDR
-  Attributes:
-    KVM_VGIC_V2_ADDR_TYPE_DIST (rw, 64-bit)
-      Base address in the guest physical address space of the GIC distributor
-      register mappings. Only valid for KVM_DEV_TYPE_ARM_VGIC_V2.
-      This address needs to be 4K aligned and the region covers 4 KByte.
-
-    KVM_VGIC_V2_ADDR_TYPE_CPU (rw, 64-bit)
-      Base address in the guest physical address space of the GIC virtual cpu
-      interface register mappings. Only valid for KVM_DEV_TYPE_ARM_VGIC_V2.
-      This address needs to be 4K aligned and the region covers 4 KByte.
-  Errors:
-    -E2BIG:  Address outside of addressable IPA range
-    -EINVAL: Incorrectly aligned address
-    -EEXIST: Address already configured
-    -ENXIO:  The group or attribute is unknown/unsupported for this device
-             or hardware support is missing.
-    -EFAULT: Invalid user pointer for attr->addr.
-
-  KVM_DEV_ARM_VGIC_GRP_DIST_REGS
-  Attributes:
-    The attr field of kvm_device_attr encodes two values:
-    bits:     | 63   ....  40 | 39 ..  32  |  31   ....    0 |
-    values:   |    reserved   | vcpu_index |      offset     |
-
-    All distributor regs are (rw, 32-bit)
-
-    The offset is relative to the "Distributor base address" as defined in the
-    GICv2 specs.  Getting or setting such a register has the same effect as
-    reading or writing the register on the actual hardware from the cpu whose
-    index is specified with the vcpu_index field.  Note that most distributor
-    fields are not banked, but return the same value regardless of the
-    vcpu_index used to access the register.
-
-    GICD_IIDR.Revision is updated when the KVM implementation of an emulated
-    GICv2 is changed in a way directly observable by the guest or userspace.
-    Userspace should read GICD_IIDR from KVM and write back the read value to
-    confirm its expected behavior is aligned with the KVM implementation.
-    Userspace should set GICD_IIDR before setting any other registers (both
-    KVM_DEV_ARM_VGIC_GRP_DIST_REGS and KVM_DEV_ARM_VGIC_GRP_CPU_REGS) to ensure
-    the expected behavior. Unless GICD_IIDR has been set from userspace, writes
-    to the interrupt group registers (GICD_IGROUPR) are ignored.
-  Errors:
-    -ENXIO: Getting or setting this register is not yet supported
-    -EBUSY: One or more VCPUs are running
-    -EINVAL: Invalid vcpu_index supplied
-
-  KVM_DEV_ARM_VGIC_GRP_CPU_REGS
-  Attributes:
-    The attr field of kvm_device_attr encodes two values:
-    bits:     | 63   ....  40 | 39 ..  32  |  31   ....    0 |
-    values:   |    reserved   | vcpu_index |      offset     |
-
-    All CPU interface regs are (rw, 32-bit)
-
-    The offset specifies the offset from the "CPU interface base address" as
-    defined in the GICv2 specs.  Getting or setting such a register has the
-    same effect as reading or writing the register on the actual hardware.
-
-    The Active Priorities Registers APRn are implementation defined, so we set a
-    fixed format for our implementation that fits with the model of a "GICv2
-    implementation without the security extensions" which we present to the
-    guest.  This interface always exposes four register APR[0-3] describing the
-    maximum possible 128 preemption levels.  The semantics of the register
-    indicate if any interrupts in a given preemption level are in the active
-    state by setting the corresponding bit.
-
-    Thus, preemption level X has one or more active interrupts if and only if:
-
-      APRn[X mod 32] == 0b1,  where n = X / 32
-
-    Bits for undefined preemption levels are RAZ/WI.
-
-    Note that this differs from a CPU's view of the APRs on hardware in which
-    a GIC without the security extensions expose group 0 and group 1 active
-    priorities in separate register groups, whereas we show a combined view
-    similar to GICv2's GICH_APR.
-
-    For historical reasons and to provide ABI compatibility with userspace we
-    export the GICC_PMR register in the format of the GICH_VMCR.VMPriMask
-    field in the lower 5 bits of a word, meaning that userspace must always
-    use the lower 5 bits to communicate with the KVM device and must shift the
-    value left by 3 places to obtain the actual priority mask level.
-
-  Errors:
-    -ENXIO: Getting or setting this register is not yet supported
-    -EBUSY: One or more VCPUs are running
-    -EINVAL: Invalid vcpu_index supplied
-
-  KVM_DEV_ARM_VGIC_GRP_NR_IRQS
-  Attributes:
-    A value describing the number of interrupts (SGI, PPI and SPI) for
-    this GIC instance, ranging from 64 to 1024, in increments of 32.
-
-  Errors:
-    -EINVAL: Value set is out of the expected range
-    -EBUSY: Value has already be set, or GIC has already been initialized
-            with default values.
-
-  KVM_DEV_ARM_VGIC_GRP_CTRL
-  Attributes:
-    KVM_DEV_ARM_VGIC_CTRL_INIT
-      request the initialization of the VGIC or ITS, no additional parameter
-      in kvm_device_attr.addr.
-  Errors:
-    -ENXIO: VGIC not properly configured as required prior to calling
-     this attribute
-    -ENODEV: no online VCPU
-    -ENOMEM: memory shortage when allocating vgic internal data
diff --git a/Documentation/virt/kvm/devices/index.rst b/Documentation/virt/kvm/devices/index.rst

new file mode 100644 (file)

index 0000000..192cda7
--- /dev/null
+++ b/Documentation/virt/kvm/devices/index.rst
@@ -0,0 +1,19 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======
+Devices
+=======
+
+.. toctree::
+   :maxdepth: 2
+
+   arm-vgic-its
+   arm-vgic
+   arm-vgic-v3
+   mpic
+   s390_flic
+   vcpu
+   vfio
+   vm
+   xics
+   xive
diff --git a/Documentation/virt/kvm/devices/mpic.rst b/Documentation/virt/kvm/devices/mpic.rst

new file mode 100644 (file)

index 0000000..55cefe0
--- /dev/null
+++ b/Documentation/virt/kvm/devices/mpic.rst
@@ -0,0 +1,58 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
+MPIC interrupt controller
+=========================
+
+Device types supported:
+
+  - KVM_DEV_TYPE_FSL_MPIC_20     Freescale MPIC v2.0
+  - KVM_DEV_TYPE_FSL_MPIC_42     Freescale MPIC v4.2
+
+Only one MPIC instance, of any type, may be instantiated.  The created
+MPIC will act as the system interrupt controller, connecting to each
+vcpu's interrupt inputs.
+
+Groups:
+  KVM_DEV_MPIC_GRP_MISC
+   Attributes:
+
+    KVM_DEV_MPIC_BASE_ADDR (rw, 64-bit)
+      Base address of the 256 KiB MPIC register space.  Must be
+      naturally aligned.  A value of zero disables the mapping.
+      Reset value is zero.
+
+  KVM_DEV_MPIC_GRP_REGISTER (rw, 32-bit)
+    Access an MPIC register, as if the access were made from the guest.
+    "attr" is the byte offset into the MPIC register space.  Accesses
+    must be 4-byte aligned.
+
+    MSIs may be signaled by using this attribute group to write
+    to the relevant MSIIR.
+
+  KVM_DEV_MPIC_GRP_IRQ_ACTIVE (rw, 32-bit)
+    IRQ input line for each standard openpic source.  0 is inactive and 1
+    is active, regardless of interrupt sense.
+
+    For edge-triggered interrupts:  Writing 1 is considered an activating
+    edge, and writing 0 is ignored.  Reading returns 1 if a previously
+    signaled edge has not been acknowledged, and 0 otherwise.
+
+    "attr" is the IRQ number.  IRQ numbers for standard sources are the
+    byte offset of the relevant IVPR from EIVPR0, divided by 32.
+
+IRQ Routing:
+
+  The MPIC emulation supports IRQ routing. Only a single MPIC device can
+  be instantiated. Once that device has been created, it's available as
+  irqchip id 0.
+
+  This irqchip 0 has 256 interrupt pins, which expose the interrupts in
+  the main array of interrupt sources (a.k.a. "SRC" interrupts).
+
+  The numbering is the same as the MPIC device tree binding -- based on
+  the register offset from the beginning of the sources array, without
+  regard to any subdivisions in chip documentation such as "internal"
+  or "external" interrupts.
+
+  Access to non-SRC interrupts is not implemented through IRQ routing mechanisms.
diff --git a/Documentation/virt/kvm/devices/mpic.txt b/Documentation/virt/kvm/devices/mpic.txt

deleted file mode 100644 (file)

index 8257397..0000000
--- a/Documentation/virt/kvm/devices/mpic.txt
+++ /dev/null
@@ -1,53 +0,0 @@
-MPIC interrupt controller
-=========================
-
-Device types supported:
-  KVM_DEV_TYPE_FSL_MPIC_20     Freescale MPIC v2.0
-  KVM_DEV_TYPE_FSL_MPIC_42     Freescale MPIC v4.2
-
-Only one MPIC instance, of any type, may be instantiated.  The created
-MPIC will act as the system interrupt controller, connecting to each
-vcpu's interrupt inputs.
-
-Groups:
-  KVM_DEV_MPIC_GRP_MISC
-  Attributes:
-    KVM_DEV_MPIC_BASE_ADDR (rw, 64-bit)
-      Base address of the 256 KiB MPIC register space.  Must be
-      naturally aligned.  A value of zero disables the mapping.
-      Reset value is zero.
-
-  KVM_DEV_MPIC_GRP_REGISTER (rw, 32-bit)
-    Access an MPIC register, as if the access were made from the guest.
-    "attr" is the byte offset into the MPIC register space.  Accesses
-    must be 4-byte aligned.
-
-    MSIs may be signaled by using this attribute group to write
-    to the relevant MSIIR.
-
-  KVM_DEV_MPIC_GRP_IRQ_ACTIVE (rw, 32-bit)
-    IRQ input line for each standard openpic source.  0 is inactive and 1
-    is active, regardless of interrupt sense.
-
-    For edge-triggered interrupts:  Writing 1 is considered an activating
-    edge, and writing 0 is ignored.  Reading returns 1 if a previously
-    signaled edge has not been acknowledged, and 0 otherwise.
-
-    "attr" is the IRQ number.  IRQ numbers for standard sources are the
-    byte offset of the relevant IVPR from EIVPR0, divided by 32.
-
-IRQ Routing:
-
-  The MPIC emulation supports IRQ routing. Only a single MPIC device can
-  be instantiated. Once that device has been created, it's available as
-  irqchip id 0.
-
-  This irqchip 0 has 256 interrupt pins, which expose the interrupts in
-  the main array of interrupt sources (a.k.a. "SRC" interrupts).
-
-  The numbering is the same as the MPIC device tree binding -- based on
-  the register offset from the beginning of the sources array, without
-  regard to any subdivisions in chip documentation such as "internal"
-  or "external" interrupts.
-
-  Access to non-SRC interrupts is not implemented through IRQ routing mechanisms.
diff --git a/Documentation/virt/kvm/devices/s390_flic.rst b/Documentation/virt/kvm/devices/s390_flic.rst

new file mode 100644 (file)

index 0000000..954190d
--- /dev/null
+++ b/Documentation/virt/kvm/devices/s390_flic.rst
@@ -0,0 +1,173 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================================
+FLIC (floating interrupt controller)
+====================================
+
+FLIC handles floating (non per-cpu) interrupts, i.e. I/O, service and some
+machine check interruptions. All interrupts are stored in a per-vm list of
+pending interrupts. FLIC performs operations on this list.
+
+Only one FLIC instance may be instantiated.
+
+FLIC provides support to
+- add interrupts (KVM_DEV_FLIC_ENQUEUE)
+- inspect currently pending interrupts (KVM_FLIC_GET_ALL_IRQS)
+- purge all pending floating interrupts (KVM_DEV_FLIC_CLEAR_IRQS)
+- purge one pending floating I/O interrupt (KVM_DEV_FLIC_CLEAR_IO_IRQ)
+- enable/disable for the guest transparent async page faults
+- register and modify adapter interrupt sources (KVM_DEV_FLIC_ADAPTER_*)
+- modify AIS (adapter-interruption-suppression) mode state (KVM_DEV_FLIC_AISM)
+- inject adapter interrupts on a specified adapter (KVM_DEV_FLIC_AIRQ_INJECT)
+- get/set all AIS mode states (KVM_DEV_FLIC_AISM_ALL)
+
+Groups:
+  KVM_DEV_FLIC_ENQUEUE
+    Passes a buffer and length into the kernel which are then injected into
+    the list of pending interrupts.
+    attr->addr contains the pointer to the buffer and attr->attr contains
+    the length of the buffer.
+    The format of the data structure kvm_s390_irq as it is copied from userspace
+    is defined in usr/include/linux/kvm.h.
+
+  KVM_DEV_FLIC_GET_ALL_IRQS
+    Copies all floating interrupts into a buffer provided by userspace.
+    When the buffer is too small it returns -ENOMEM, which is the indication
+    for userspace to try again with a bigger buffer.
+
+    -ENOBUFS is returned when the allocation of a kernelspace buffer has
+    failed.
+
+    -EFAULT is returned when copying data to userspace failed.
+    All interrupts remain pending, i.e. are not deleted from the list of
+    currently pending interrupts.
+    attr->addr contains the userspace address of the buffer into which all
+    interrupt data will be copied.
+    attr->attr contains the size of the buffer in bytes.
+
+  KVM_DEV_FLIC_CLEAR_IRQS
+    Simply deletes all elements from the list of currently pending floating
+    interrupts.  No interrupts are injected into the guest.
+
+  KVM_DEV_FLIC_CLEAR_IO_IRQ
+    Deletes one (if any) I/O interrupt for a subchannel identified by the
+    subsystem identification word passed via the buffer specified by
+    attr->addr (address) and attr->attr (length).
+
+  KVM_DEV_FLIC_APF_ENABLE
+    Enables async page faults for the guest. So in case of a major page fault
+    the host is allowed to handle this async and continues the guest.
+
+  KVM_DEV_FLIC_APF_DISABLE_WAIT
+    Disables async page faults for the guest and waits until already pending
+    async page faults are done. This is necessary to trigger a completion interrupt
+    for every init interrupt before migrating the interrupt list.
+
+  KVM_DEV_FLIC_ADAPTER_REGISTER
+    Register an I/O adapter interrupt source. Takes a kvm_s390_io_adapter
+    describing the adapter to register::
+
+       struct kvm_s390_io_adapter {
+               __u32 id;
+               __u8 isc;
+               __u8 maskable;
+               __u8 swap;
+               __u8 flags;
+       };
+
+   id contains the unique id for the adapter, isc the I/O interruption subclass
+   to use, maskable whether this adapter may be masked (interrupts turned off),
+   swap whether the indicators need to be byte swapped, and flags contains
+   further characteristics of the adapter.
+
+   Currently defined values for 'flags' are:
+
+   - KVM_S390_ADAPTER_SUPPRESSIBLE: adapter is subject to AIS
+     (adapter-interrupt-suppression) facility. This flag only has an effect if
+     the AIS capability is enabled.
+
+   Unknown flag values are ignored.
+
+
+  KVM_DEV_FLIC_ADAPTER_MODIFY
+    Modifies attributes of an existing I/O adapter interrupt source. Takes
+    a kvm_s390_io_adapter_req specifying the adapter and the operation::
+
+       struct kvm_s390_io_adapter_req {
+               __u32 id;
+               __u8 type;
+               __u8 mask;
+               __u16 pad0;
+               __u64 addr;
+       };
+
+    id specifies the adapter and type the operation. The supported operations
+    are:
+
+    KVM_S390_IO_ADAPTER_MASK
+      mask or unmask the adapter, as specified in mask
+
+    KVM_S390_IO_ADAPTER_MAP
+      perform a gmap translation for the guest address provided in addr,
+      pin a userspace page for the translated address and add it to the
+      list of mappings
+
+      .. note:: A new mapping will be created unconditionally; therefore,
+               the calling code should avoid making duplicate mappings.
+
+    KVM_S390_IO_ADAPTER_UNMAP
+      release a userspace page for the translated address specified in addr
+      from the list of mappings
+
+  KVM_DEV_FLIC_AISM
+    modify the adapter-interruption-suppression mode for a given isc if the
+    AIS capability is enabled. Takes a kvm_s390_ais_req describing::
+
+       struct kvm_s390_ais_req {
+               __u8 isc;
+               __u16 mode;
+       };
+
+    isc contains the target I/O interruption subclass, mode the target
+    adapter-interruption-suppression mode. The following modes are
+    currently supported:
+
+    - KVM_S390_AIS_MODE_ALL: ALL-Interruptions Mode, i.e. airq injection
+      is always allowed;
+    - KVM_S390_AIS_MODE_SINGLE: SINGLE-Interruption Mode, i.e. airq
+      injection is only allowed once and the following adapter interrupts
+      will be suppressed until the mode is set again to ALL-Interruptions
+      or SINGLE-Interruption mode.
+
+  KVM_DEV_FLIC_AIRQ_INJECT
+    Inject adapter interrupts on a specified adapter.
+    attr->attr contains the unique id for the adapter, which allows for
+    adapter-specific checks and actions.
+    For adapters subject to AIS, handle the airq injection suppression for
+    an isc according to the adapter-interruption-suppression mode on condition
+    that the AIS capability is enabled.
+
+  KVM_DEV_FLIC_AISM_ALL
+    Gets or sets the adapter-interruption-suppression mode for all ISCs. Takes
+    a kvm_s390_ais_all describing::
+
+       struct kvm_s390_ais_all {
+              __u8 simm; /* Single-Interruption-Mode mask */
+              __u8 nimm; /* No-Interruption-Mode mask *
+       };
+
+    simm contains Single-Interruption-Mode mask for all ISCs, nimm contains
+    No-Interruption-Mode mask for all ISCs. Each bit in simm and nimm corresponds
+    to an ISC (MSB0 bit 0 to ISC 0 and so on). The combination of simm bit and
+    nimm bit presents AIS mode for a ISC.
+
+    KVM_DEV_FLIC_AISM_ALL is indicated by KVM_CAP_S390_AIS_MIGRATION.
+
+Note: The KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR device ioctls executed on
+FLIC with an unknown group or attribute gives the error code EINVAL (instead of
+ENXIO, as specified in the API documentation). It is not possible to conclude
+that a FLIC operation is unavailable based on the error code resulting from a
+usage attempt.
+
+.. note:: The KVM_DEV_FLIC_CLEAR_IO_IRQ ioctl will return EINVAL in case a
+         zero schid is specified.
diff --git a/Documentation/virt/kvm/devices/s390_flic.txt b/Documentation/virt/kvm/devices/s390_flic.txt

deleted file mode 100644 (file)

index a4e20a0..0000000
--- a/Documentation/virt/kvm/devices/s390_flic.txt
+++ /dev/null
@@ -1,163 +0,0 @@
-FLIC (floating interrupt controller)
-====================================
-
-FLIC handles floating (non per-cpu) interrupts, i.e. I/O, service and some
-machine check interruptions. All interrupts are stored in a per-vm list of
-pending interrupts. FLIC performs operations on this list.
-
-Only one FLIC instance may be instantiated.
-
-FLIC provides support to
-- add interrupts (KVM_DEV_FLIC_ENQUEUE)
-- inspect currently pending interrupts (KVM_FLIC_GET_ALL_IRQS)
-- purge all pending floating interrupts (KVM_DEV_FLIC_CLEAR_IRQS)
-- purge one pending floating I/O interrupt (KVM_DEV_FLIC_CLEAR_IO_IRQ)
-- enable/disable for the guest transparent async page faults
-- register and modify adapter interrupt sources (KVM_DEV_FLIC_ADAPTER_*)
-- modify AIS (adapter-interruption-suppression) mode state (KVM_DEV_FLIC_AISM)
-- inject adapter interrupts on a specified adapter (KVM_DEV_FLIC_AIRQ_INJECT)
-- get/set all AIS mode states (KVM_DEV_FLIC_AISM_ALL)
-
-Groups:
-  KVM_DEV_FLIC_ENQUEUE
-    Passes a buffer and length into the kernel which are then injected into
-    the list of pending interrupts.
-    attr->addr contains the pointer to the buffer and attr->attr contains
-    the length of the buffer.
-    The format of the data structure kvm_s390_irq as it is copied from userspace
-    is defined in usr/include/linux/kvm.h.
-
-  KVM_DEV_FLIC_GET_ALL_IRQS
-    Copies all floating interrupts into a buffer provided by userspace.
-    When the buffer is too small it returns -ENOMEM, which is the indication
-    for userspace to try again with a bigger buffer.
-    -ENOBUFS is returned when the allocation of a kernelspace buffer has
-    failed.
-    -EFAULT is returned when copying data to userspace failed.
-    All interrupts remain pending, i.e. are not deleted from the list of
-    currently pending interrupts.
-    attr->addr contains the userspace address of the buffer into which all
-    interrupt data will be copied.
-    attr->attr contains the size of the buffer in bytes.
-
-  KVM_DEV_FLIC_CLEAR_IRQS
-    Simply deletes all elements from the list of currently pending floating
-    interrupts.  No interrupts are injected into the guest.
-
-  KVM_DEV_FLIC_CLEAR_IO_IRQ
-    Deletes one (if any) I/O interrupt for a subchannel identified by the
-    subsystem identification word passed via the buffer specified by
-    attr->addr (address) and attr->attr (length).
-
-  KVM_DEV_FLIC_APF_ENABLE
-    Enables async page faults for the guest. So in case of a major page fault
-    the host is allowed to handle this async and continues the guest.
-
-  KVM_DEV_FLIC_APF_DISABLE_WAIT
-    Disables async page faults for the guest and waits until already pending
-    async page faults are done. This is necessary to trigger a completion interrupt
-    for every init interrupt before migrating the interrupt list.
-
-  KVM_DEV_FLIC_ADAPTER_REGISTER
-    Register an I/O adapter interrupt source. Takes a kvm_s390_io_adapter
-    describing the adapter to register:
-
-struct kvm_s390_io_adapter {
-       __u32 id;
-       __u8 isc;
-       __u8 maskable;
-       __u8 swap;
-       __u8 flags;
-};
-
-   id contains the unique id for the adapter, isc the I/O interruption subclass
-   to use, maskable whether this adapter may be masked (interrupts turned off),
-   swap whether the indicators need to be byte swapped, and flags contains
-   further characteristics of the adapter.
-   Currently defined values for 'flags' are:
-   - KVM_S390_ADAPTER_SUPPRESSIBLE: adapter is subject to AIS
-     (adapter-interrupt-suppression) facility. This flag only has an effect if
-     the AIS capability is enabled.
-   Unknown flag values are ignored.
-
-
-  KVM_DEV_FLIC_ADAPTER_MODIFY
-    Modifies attributes of an existing I/O adapter interrupt source. Takes
-    a kvm_s390_io_adapter_req specifying the adapter and the operation:
-
-struct kvm_s390_io_adapter_req {
-       __u32 id;
-       __u8 type;
-       __u8 mask;
-       __u16 pad0;
-       __u64 addr;
-};
-
-    id specifies the adapter and type the operation. The supported operations
-    are:
-
-    KVM_S390_IO_ADAPTER_MASK
-      mask or unmask the adapter, as specified in mask
-
-    KVM_S390_IO_ADAPTER_MAP
-      perform a gmap translation for the guest address provided in addr,
-      pin a userspace page for the translated address and add it to the
-      list of mappings
-      Note: A new mapping will be created unconditionally; therefore,
-            the calling code should avoid making duplicate mappings.
-
-    KVM_S390_IO_ADAPTER_UNMAP
-      release a userspace page for the translated address specified in addr
-      from the list of mappings
-
-  KVM_DEV_FLIC_AISM
-    modify the adapter-interruption-suppression mode for a given isc if the
-    AIS capability is enabled. Takes a kvm_s390_ais_req describing:
-
-struct kvm_s390_ais_req {
-       __u8 isc;
-       __u16 mode;
-};
-
-    isc contains the target I/O interruption subclass, mode the target
-    adapter-interruption-suppression mode. The following modes are
-    currently supported:
-    - KVM_S390_AIS_MODE_ALL: ALL-Interruptions Mode, i.e. airq injection
-      is always allowed;
-    - KVM_S390_AIS_MODE_SINGLE: SINGLE-Interruption Mode, i.e. airq
-      injection is only allowed once and the following adapter interrupts
-      will be suppressed until the mode is set again to ALL-Interruptions
-      or SINGLE-Interruption mode.
-
-  KVM_DEV_FLIC_AIRQ_INJECT
-    Inject adapter interrupts on a specified adapter.
-    attr->attr contains the unique id for the adapter, which allows for
-    adapter-specific checks and actions.
-    For adapters subject to AIS, handle the airq injection suppression for
-    an isc according to the adapter-interruption-suppression mode on condition
-    that the AIS capability is enabled.
-
-  KVM_DEV_FLIC_AISM_ALL
-    Gets or sets the adapter-interruption-suppression mode for all ISCs. Takes
-    a kvm_s390_ais_all describing:
-
-struct kvm_s390_ais_all {
-       __u8 simm; /* Single-Interruption-Mode mask */
-       __u8 nimm; /* No-Interruption-Mode mask *
-};
-
-    simm contains Single-Interruption-Mode mask for all ISCs, nimm contains
-    No-Interruption-Mode mask for all ISCs. Each bit in simm and nimm corresponds
-    to an ISC (MSB0 bit 0 to ISC 0 and so on). The combination of simm bit and
-    nimm bit presents AIS mode for a ISC.
-
-    KVM_DEV_FLIC_AISM_ALL is indicated by KVM_CAP_S390_AIS_MIGRATION.
-
-Note: The KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR device ioctls executed on
-FLIC with an unknown group or attribute gives the error code EINVAL (instead of
-ENXIO, as specified in the API documentation). It is not possible to conclude
-that a FLIC operation is unavailable based on the error code resulting from a
-usage attempt.
-
-Note: The KVM_DEV_FLIC_CLEAR_IO_IRQ ioctl will return EINVAL in case a zero
-schid is specified.
diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst

new file mode 100644 (file)

index 0000000..9963e68
--- /dev/null
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -0,0 +1,114 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+Generic vcpu interface
+======================
+
+The virtual cpu "device" also accepts the ioctls KVM_SET_DEVICE_ATTR,
+KVM_GET_DEVICE_ATTR, and KVM_HAS_DEVICE_ATTR. The interface uses the same struct
+kvm_device_attr as other devices, but targets VCPU-wide settings and controls.
+
+The groups and attributes per virtual cpu, if any, are architecture specific.
+
+1. GROUP: KVM_ARM_VCPU_PMU_V3_CTRL
+==================================
+
+:Architectures: ARM64
+
+1.1. ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_IRQ
+---------------------------------------
+
+:Parameters: in kvm_device_attr.addr the address for PMU overflow interrupt is a
+            pointer to an int
+
+Returns:
+
+        =======  ========================================================
+        -EBUSY   The PMU overflow interrupt is already set
+        -ENXIO   The overflow interrupt not set when attempting to get it
+        -ENODEV  PMUv3 not supported
+        -EINVAL  Invalid PMU overflow interrupt number supplied or
+                 trying to set the IRQ number without using an in-kernel
+                 irqchip.
+        =======  ========================================================
+
+A value describing the PMUv3 (Performance Monitor Unit v3) overflow interrupt
+number for this vcpu. This interrupt could be a PPI or SPI, but the interrupt
+type must be same for each vcpu. As a PPI, the interrupt number is the same for
+all vcpus, while as an SPI it must be a separate number per vcpu.
+
+1.2 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_INIT
+---------------------------------------
+
+:Parameters: no additional parameter in kvm_device_attr.addr
+
+Returns:
+
+        =======  ======================================================
+        -ENODEV  PMUv3 not supported or GIC not initialized
+        -ENXIO   PMUv3 not properly configured or in-kernel irqchip not
+                 configured as required prior to calling this attribute
+        -EBUSY   PMUv3 already initialized
+        =======  ======================================================
+
+Request the initialization of the PMUv3.  If using the PMUv3 with an in-kernel
+virtual GIC implementation, this must be done after initializing the in-kernel
+irqchip.
+
+
+2. GROUP: KVM_ARM_VCPU_TIMER_CTRL
+=================================
+
+:Architectures: ARM, ARM64
+
+2.1. ATTRIBUTES: KVM_ARM_VCPU_TIMER_IRQ_VTIMER, KVM_ARM_VCPU_TIMER_IRQ_PTIMER
+-----------------------------------------------------------------------------
+
+:Parameters: in kvm_device_attr.addr the address for the timer interrupt is a
+            pointer to an int
+
+Returns:
+
+        =======  =================================
+        -EINVAL  Invalid timer interrupt number
+        -EBUSY   One or more VCPUs has already run
+        =======  =================================
+
+A value describing the architected timer interrupt number when connected to an
+in-kernel virtual GIC.  These must be a PPI (16 <= intid < 32).  Setting the
+attribute overrides the default values (see below).
+
+=============================  ==========================================
+KVM_ARM_VCPU_TIMER_IRQ_VTIMER  The EL1 virtual timer intid (default: 27)
+KVM_ARM_VCPU_TIMER_IRQ_PTIMER  The EL1 physical timer intid (default: 30)
+=============================  ==========================================
+
+Setting the same PPI for different timers will prevent the VCPUs from running.
+Setting the interrupt number on a VCPU configures all VCPUs created at that
+time to use the number provided for a given timer, overwriting any previously
+configured values on other VCPUs.  Userspace should configure the interrupt
+numbers on at least one VCPU after creating all VCPUs and before running any
+VCPUs.
+
+3. GROUP: KVM_ARM_VCPU_PVTIME_CTRL
+==================================
+
+:Architectures: ARM64
+
+3.1 ATTRIBUTE: KVM_ARM_VCPU_PVTIME_IPA
+--------------------------------------
+
+:Parameters: 64-bit base address
+
+Returns:
+
+        =======  ======================================
+        -ENXIO   Stolen time not implemented
+        -EEXIST  Base address already set for this VCPU
+        -EINVAL  Base address not 64 byte aligned
+        =======  ======================================
+
+Specifies the base address of the stolen time structure for this VCPU. The
+base address must be 64 byte aligned and exist within a valid guest memory
+region. See Documentation/virt/kvm/arm/pvtime.txt for more information
+including the layout of the stolen time structure.
diff --git a/Documentation/virt/kvm/devices/vcpu.txt b/Documentation/virt/kvm/devices/vcpu.txt

deleted file mode 100644 (file)

index 6f3bd64..0000000
--- a/Documentation/virt/kvm/devices/vcpu.txt
+++ /dev/null
@@ -1,76 +0,0 @@
-Generic vcpu interface
-====================================
-
-The virtual cpu "device" also accepts the ioctls KVM_SET_DEVICE_ATTR,
-KVM_GET_DEVICE_ATTR, and KVM_HAS_DEVICE_ATTR. The interface uses the same struct
-kvm_device_attr as other devices, but targets VCPU-wide settings and controls.
-
-The groups and attributes per virtual cpu, if any, are architecture specific.
-
-1. GROUP: KVM_ARM_VCPU_PMU_V3_CTRL
-Architectures: ARM64
-
-1.1. ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_IRQ
-Parameters: in kvm_device_attr.addr the address for PMU overflow interrupt is a
-            pointer to an int
-Returns: -EBUSY: The PMU overflow interrupt is already set
-         -ENXIO: The overflow interrupt not set when attempting to get it
-         -ENODEV: PMUv3 not supported
-         -EINVAL: Invalid PMU overflow interrupt number supplied or
-                  trying to set the IRQ number without using an in-kernel
-                  irqchip.
-
-A value describing the PMUv3 (Performance Monitor Unit v3) overflow interrupt
-number for this vcpu. This interrupt could be a PPI or SPI, but the interrupt
-type must be same for each vcpu. As a PPI, the interrupt number is the same for
-all vcpus, while as an SPI it must be a separate number per vcpu.
-
-1.2 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_INIT
-Parameters: no additional parameter in kvm_device_attr.addr
-Returns: -ENODEV: PMUv3 not supported or GIC not initialized
-         -ENXIO: PMUv3 not properly configured or in-kernel irqchip not
-                 configured as required prior to calling this attribute
-         -EBUSY: PMUv3 already initialized
-
-Request the initialization of the PMUv3.  If using the PMUv3 with an in-kernel
-virtual GIC implementation, this must be done after initializing the in-kernel
-irqchip.
-
-
-2. GROUP: KVM_ARM_VCPU_TIMER_CTRL
-Architectures: ARM,ARM64
-
-2.1. ATTRIBUTE: KVM_ARM_VCPU_TIMER_IRQ_VTIMER
-2.2. ATTRIBUTE: KVM_ARM_VCPU_TIMER_IRQ_PTIMER
-Parameters: in kvm_device_attr.addr the address for the timer interrupt is a
-            pointer to an int
-Returns: -EINVAL: Invalid timer interrupt number
-         -EBUSY:  One or more VCPUs has already run
-
-A value describing the architected timer interrupt number when connected to an
-in-kernel virtual GIC.  These must be a PPI (16 <= intid < 32).  Setting the
-attribute overrides the default values (see below).
-
-KVM_ARM_VCPU_TIMER_IRQ_VTIMER: The EL1 virtual timer intid (default: 27)
-KVM_ARM_VCPU_TIMER_IRQ_PTIMER: The EL1 physical timer intid (default: 30)
-
-Setting the same PPI for different timers will prevent the VCPUs from running.
-Setting the interrupt number on a VCPU configures all VCPUs created at that
-time to use the number provided for a given timer, overwriting any previously
-configured values on other VCPUs.  Userspace should configure the interrupt
-numbers on at least one VCPU after creating all VCPUs and before running any
-VCPUs.
-
-3. GROUP: KVM_ARM_VCPU_PVTIME_CTRL
-Architectures: ARM64
-
-3.1 ATTRIBUTE: KVM_ARM_VCPU_PVTIME_IPA
-Parameters: 64-bit base address
-Returns: -ENXIO:  Stolen time not implemented
-         -EEXIST: Base address already set for this VCPU
-         -EINVAL: Base address not 64 byte aligned
-
-Specifies the base address of the stolen time structure for this VCPU. The
-base address must be 64 byte aligned and exist within a valid guest memory
-region. See Documentation/virt/kvm/arm/pvtime.txt for more information
-including the layout of the stolen time structure.
diff --git a/Documentation/virt/kvm/devices/vfio.rst b/Documentation/virt/kvm/devices/vfio.rst

new file mode 100644 (file)

index 0000000..2d20dc5
--- /dev/null
+++ b/Documentation/virt/kvm/devices/vfio.rst
@@ -0,0 +1,41 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================
+VFIO virtual device
+===================
+
+Device types supported:
+
+  - KVM_DEV_TYPE_VFIO
+
+Only one VFIO instance may be created per VM.  The created device
+tracks VFIO groups in use by the VM and features of those groups
+important to the correctness and acceleration of the VM.  As groups
+are enabled and disabled for use by the VM, KVM should be updated
+about their presence.  When registered with KVM, a reference to the
+VFIO-group is held by KVM.
+
+Groups:
+  KVM_DEV_VFIO_GROUP
+
+KVM_DEV_VFIO_GROUP attributes:
+  KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
+       kvm_device_attr.addr points to an int32_t file descriptor
+       for the VFIO group.
+  KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
+       kvm_device_attr.addr points to an int32_t file descriptor
+       for the VFIO group.
+  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
+       allocated by sPAPR KVM.
+       kvm_device_attr.addr points to a struct::
+
+               struct kvm_vfio_spapr_tce {
+                       __s32   groupfd;
+                       __s32   tablefd;
+               };
+
+       where:
+
+       - @groupfd is a file descriptor for a VFIO group;
+       - @tablefd is a file descriptor for a TCE table allocated via
+         KVM_CREATE_SPAPR_TCE.
diff --git a/Documentation/virt/kvm/devices/vfio.txt b/Documentation/virt/kvm/devices/vfio.txt

deleted file mode 100644 (file)

index 528c77c..0000000
--- a/Documentation/virt/kvm/devices/vfio.txt
+++ /dev/null
@@ -1,36 +0,0 @@
-VFIO virtual device
-===================
-
-Device types supported:
-  KVM_DEV_TYPE_VFIO
-
-Only one VFIO instance may be created per VM.  The created device
-tracks VFIO groups in use by the VM and features of those groups
-important to the correctness and acceleration of the VM.  As groups
-are enabled and disabled for use by the VM, KVM should be updated
-about their presence.  When registered with KVM, a reference to the
-VFIO-group is held by KVM.
-
-Groups:
-  KVM_DEV_VFIO_GROUP
-
-KVM_DEV_VFIO_GROUP attributes:
-  KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
-       kvm_device_attr.addr points to an int32_t file descriptor
-       for the VFIO group.
-  KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
-       kvm_device_attr.addr points to an int32_t file descriptor
-       for the VFIO group.
-  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
-       allocated by sPAPR KVM.
-       kvm_device_attr.addr points to a struct:
-
-       struct kvm_vfio_spapr_tce {
-               __s32   groupfd;
-               __s32   tablefd;
-       };
-
-       where
-       @groupfd is a file descriptor for a VFIO group;
-       @tablefd is a file descriptor for a TCE table allocated via
-               KVM_CREATE_SPAPR_TCE.
diff --git a/Documentation/virt/kvm/devices/vm.rst b/Documentation/virt/kvm/devices/vm.rst

new file mode 100644 (file)

index 0000000..0aa5b1c
--- /dev/null
+++ b/Documentation/virt/kvm/devices/vm.rst
@@ -0,0 +1,316 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+Generic vm interface
+====================
+
+The virtual machine "device" also accepts the ioctls KVM_SET_DEVICE_ATTR,
+KVM_GET_DEVICE_ATTR, and KVM_HAS_DEVICE_ATTR. The interface uses the same
+struct kvm_device_attr as other devices, but targets VM-wide settings
+and controls.
+
+The groups and attributes per virtual machine, if any, are architecture
+specific.
+
+1. GROUP: KVM_S390_VM_MEM_CTRL
+==============================
+
+:Architectures: s390
+
+1.1. ATTRIBUTE: KVM_S390_VM_MEM_ENABLE_CMMA
+-------------------------------------------
+
+:Parameters: none
+:Returns: -EBUSY if a vcpu is already defined, otherwise 0
+
+Enables Collaborative Memory Management Assist (CMMA) for the virtual machine.
+
+1.2. ATTRIBUTE: KVM_S390_VM_MEM_CLR_CMMA
+----------------------------------------
+
+:Parameters: none
+:Returns: -EINVAL if CMMA was not enabled;
+         0 otherwise
+
+Clear the CMMA status for all guest pages, so any pages the guest marked
+as unused are again used any may not be reclaimed by the host.
+
+1.3. ATTRIBUTE KVM_S390_VM_MEM_LIMIT_SIZE
+-----------------------------------------
+
+:Parameters: in attr->addr the address for the new limit of guest memory
+:Returns: -EFAULT if the given address is not accessible;
+         -EINVAL if the virtual machine is of type UCONTROL;
+         -E2BIG if the given guest memory is to big for that machine;
+         -EBUSY if a vcpu is already defined;
+         -ENOMEM if not enough memory is available for a new shadow guest mapping;
+         0 otherwise.
+
+Allows userspace to query the actual limit and set a new limit for
+the maximum guest memory size. The limit will be rounded up to
+2048 MB, 4096 GB, 8192 TB respectively, as this limit is governed by
+the number of page table levels. In the case that there is no limit we will set
+the limit to KVM_S390_NO_MEM_LIMIT (U64_MAX).
+
+2. GROUP: KVM_S390_VM_CPU_MODEL
+===============================
+
+:Architectures: s390
+
+2.1. ATTRIBUTE: KVM_S390_VM_CPU_MACHINE (r/o)
+---------------------------------------------
+
+Allows user space to retrieve machine and kvm specific cpu related information::
+
+  struct kvm_s390_vm_cpu_machine {
+       __u64 cpuid;           # CPUID of host
+       __u32 ibc;             # IBC level range offered by host
+       __u8  pad[4];
+       __u64 fac_mask[256];   # set of cpu facilities enabled by KVM
+       __u64 fac_list[256];   # set of cpu facilities offered by host
+  }
+
+:Parameters: address of buffer to store the machine related cpu data
+            of type struct kvm_s390_vm_cpu_machine*
+:Returns:   -EFAULT if the given address is not accessible from kernel space;
+           -ENOMEM if not enough memory is available to process the ioctl;
+           0 in case of success.
+
+2.2. ATTRIBUTE: KVM_S390_VM_CPU_PROCESSOR (r/w)
+===============================================
+
+Allows user space to retrieve or request to change cpu related information for a vcpu::
+
+  struct kvm_s390_vm_cpu_processor {
+       __u64 cpuid;           # CPUID currently (to be) used by this vcpu
+       __u16 ibc;             # IBC level currently (to be) used by this vcpu
+       __u8  pad[6];
+       __u64 fac_list[256];   # set of cpu facilities currently (to be) used
+                             # by this vcpu
+  }
+
+KVM does not enforce or limit the cpu model data in any form. Take the information
+retrieved by means of KVM_S390_VM_CPU_MACHINE as hint for reasonable configuration
+setups. Instruction interceptions triggered by additionally set facility bits that
+are not handled by KVM need to by imlemented in the VM driver code.
+
+:Parameters: address of buffer to store/set the processor related cpu
+            data of type struct kvm_s390_vm_cpu_processor*.
+:Returns:  -EBUSY in case 1 or more vcpus are already activated (only in write case);
+          -EFAULT if the given address is not accessible from kernel space;
+          -ENOMEM if not enough memory is available to process the ioctl;
+          0 in case of success.
+
+.. _KVM_S390_VM_CPU_MACHINE_FEAT:
+
+2.3. ATTRIBUTE: KVM_S390_VM_CPU_MACHINE_FEAT (r/o)
+--------------------------------------------------
+
+Allows user space to retrieve available cpu features. A feature is available if
+provided by the hardware and supported by kvm. In theory, cpu features could
+even be completely emulated by kvm.
+
+::
+
+  struct kvm_s390_vm_cpu_feat {
+       __u64 feat[16]; # Bitmap (1 = feature available), MSB 0 bit numbering
+  };
+
+:Parameters: address of a buffer to load the feature list from.
+:Returns:  -EFAULT if the given address is not accessible from kernel space;
+          0 in case of success.
+
+2.4. ATTRIBUTE: KVM_S390_VM_CPU_PROCESSOR_FEAT (r/w)
+----------------------------------------------------
+
+Allows user space to retrieve or change enabled cpu features for all VCPUs of a
+VM. Features that are not available cannot be enabled.
+
+See :ref:`KVM_S390_VM_CPU_MACHINE_FEAT` for
+a description of the parameter struct.
+
+:Parameters: address of a buffer to store/load the feature list from.
+:Returns:   -EFAULT if the given address is not accessible from kernel space;
+           -EINVAL if a cpu feature that is not available is to be enabled;
+           -EBUSY if at least one VCPU has already been defined;
+           0 in case of success.
+
+.. _KVM_S390_VM_CPU_MACHINE_SUBFUNC:
+
+2.5. ATTRIBUTE: KVM_S390_VM_CPU_MACHINE_SUBFUNC (r/o)
+-----------------------------------------------------
+
+Allows user space to retrieve available cpu subfunctions without any filtering
+done by a set IBC. These subfunctions are indicated to the guest VCPU via
+query or "test bit" subfunctions and used e.g. by cpacf functions, plo and ptff.
+
+A subfunction block is only valid if KVM_S390_VM_CPU_MACHINE contains the
+STFL(E) bit introducing the affected instruction. If the affected instruction
+indicates subfunctions via a "query subfunction", the response block is
+contained in the returned struct. If the affected instruction
+indicates subfunctions via a "test bit" mechanism, the subfunction codes are
+contained in the returned struct in MSB 0 bit numbering.
+
+::
+
+  struct kvm_s390_vm_cpu_subfunc {
+       u8 plo[32];           # always valid (ESA/390 feature)
+       u8 ptff[16];          # valid with TOD-clock steering
+       u8 kmac[16];          # valid with Message-Security-Assist
+       u8 kmc[16];           # valid with Message-Security-Assist
+       u8 km[16];            # valid with Message-Security-Assist
+       u8 kimd[16];          # valid with Message-Security-Assist
+       u8 klmd[16];          # valid with Message-Security-Assist
+       u8 pckmo[16];         # valid with Message-Security-Assist-Extension 3
+       u8 kmctr[16];         # valid with Message-Security-Assist-Extension 4
+       u8 kmf[16];           # valid with Message-Security-Assist-Extension 4
+       u8 kmo[16];           # valid with Message-Security-Assist-Extension 4
+       u8 pcc[16];           # valid with Message-Security-Assist-Extension 4
+       u8 ppno[16];          # valid with Message-Security-Assist-Extension 5
+       u8 kma[16];           # valid with Message-Security-Assist-Extension 8
+       u8 kdsa[16];          # valid with Message-Security-Assist-Extension 9
+       u8 reserved[1792];    # reserved for future instructions
+  };
+
+:Parameters: address of a buffer to load the subfunction blocks from.
+:Returns:   -EFAULT if the given address is not accessible from kernel space;
+           0 in case of success.
+
+2.6. ATTRIBUTE: KVM_S390_VM_CPU_PROCESSOR_SUBFUNC (r/w)
+-------------------------------------------------------
+
+Allows user space to retrieve or change cpu subfunctions to be indicated for
+all VCPUs of a VM. This attribute will only be available if kernel and
+hardware support are in place.
+
+The kernel uses the configured subfunction blocks for indication to
+the guest. A subfunction block will only be used if the associated STFL(E) bit
+has not been disabled by user space (so the instruction to be queried is
+actually available for the guest).
+
+As long as no data has been written, a read will fail. The IBC will be used
+to determine available subfunctions in this case, this will guarantee backward
+compatibility.
+
+See :ref:`KVM_S390_VM_CPU_MACHINE_SUBFUNC` for a
+description of the parameter struct.
+
+:Parameters: address of a buffer to store/load the subfunction blocks from.
+:Returns:   -EFAULT if the given address is not accessible from kernel space;
+           -EINVAL when reading, if there was no write yet;
+           -EBUSY if at least one VCPU has already been defined;
+           0 in case of success.
+
+3. GROUP: KVM_S390_VM_TOD
+=========================
+
+:Architectures: s390
+
+3.1. ATTRIBUTE: KVM_S390_VM_TOD_HIGH
+------------------------------------
+
+Allows user space to set/get the TOD clock extension (u8) (superseded by
+KVM_S390_VM_TOD_EXT).
+
+:Parameters: address of a buffer in user space to store the data (u8) to
+:Returns:   -EFAULT if the given address is not accessible from kernel space;
+           -EINVAL if setting the TOD clock extension to != 0 is not supported
+
+3.2. ATTRIBUTE: KVM_S390_VM_TOD_LOW
+-----------------------------------
+
+Allows user space to set/get bits 0-63 of the TOD clock register as defined in
+the POP (u64).
+
+:Parameters: address of a buffer in user space to store the data (u64) to
+:Returns:    -EFAULT if the given address is not accessible from kernel space
+
+3.3. ATTRIBUTE: KVM_S390_VM_TOD_EXT
+-----------------------------------
+
+Allows user space to set/get bits 0-63 of the TOD clock register as defined in
+the POP (u64). If the guest CPU model supports the TOD clock extension (u8), it
+also allows user space to get/set it. If the guest CPU model does not support
+it, it is stored as 0 and not allowed to be set to a value != 0.
+
+:Parameters: address of a buffer in user space to store the data
+            (kvm_s390_vm_tod_clock) to
+:Returns:   -EFAULT if the given address is not accessible from kernel space;
+           -EINVAL if setting the TOD clock extension to != 0 is not supported
+
+4. GROUP: KVM_S390_VM_CRYPTO
+============================
+
+:Architectures: s390
+
+4.1. ATTRIBUTE: KVM_S390_VM_CRYPTO_ENABLE_AES_KW (w/o)
+------------------------------------------------------
+
+Allows user space to enable aes key wrapping, including generating a new
+wrapping key.
+
+:Parameters: none
+:Returns:    0
+
+4.2. ATTRIBUTE: KVM_S390_VM_CRYPTO_ENABLE_DEA_KW (w/o)
+------------------------------------------------------
+
+Allows user space to enable dea key wrapping, including generating a new
+wrapping key.
+
+:Parameters: none
+:Returns:    0
+
+4.3. ATTRIBUTE: KVM_S390_VM_CRYPTO_DISABLE_AES_KW (w/o)
+-------------------------------------------------------
+
+Allows user space to disable aes key wrapping, clearing the wrapping key.
+
+:Parameters: none
+:Returns:    0
+
+4.4. ATTRIBUTE: KVM_S390_VM_CRYPTO_DISABLE_DEA_KW (w/o)
+-------------------------------------------------------
+
+Allows user space to disable dea key wrapping, clearing the wrapping key.
+
+:Parameters: none
+:Returns:    0
+
+5. GROUP: KVM_S390_VM_MIGRATION
+===============================
+
+:Architectures: s390
+
+5.1. ATTRIBUTE: KVM_S390_VM_MIGRATION_STOP (w/o)
+------------------------------------------------
+
+Allows userspace to stop migration mode, needed for PGSTE migration.
+Setting this attribute when migration mode is not active will have no
+effects.
+
+:Parameters: none
+:Returns:    0
+
+5.2. ATTRIBUTE: KVM_S390_VM_MIGRATION_START (w/o)
+-------------------------------------------------
+
+Allows userspace to start migration mode, needed for PGSTE migration.
+Setting this attribute when migration mode is already active will have
+no effects.
+
+:Parameters: none
+:Returns:   -ENOMEM if there is not enough free memory to start migration mode;
+           -EINVAL if the state of the VM is invalid (e.g. no memory defined);
+           0 in case of success.
+
+5.3. ATTRIBUTE: KVM_S390_VM_MIGRATION_STATUS (r/o)
+--------------------------------------------------
+
+Allows userspace to query the status of migration mode.
+
+:Parameters: address of a buffer in user space to store the data (u64) to;
+            the data itself is either 0 if migration mode is disabled or 1
+            if it is enabled
+:Returns:   -EFAULT if the given address is not accessible from kernel space;
+           0 in case of success.
diff --git a/Documentation/virt/kvm/devices/vm.txt b/Documentation/virt/kvm/devices/vm.txt

deleted file mode 100644 (file)

index 4ffb82b..0000000
--- a/Documentation/virt/kvm/devices/vm.txt
+++ /dev/null
@@ -1,270 +0,0 @@
-Generic vm interface
-====================================
-
-The virtual machine "device" also accepts the ioctls KVM_SET_DEVICE_ATTR,
-KVM_GET_DEVICE_ATTR, and KVM_HAS_DEVICE_ATTR. The interface uses the same
-struct kvm_device_attr as other devices, but targets VM-wide settings
-and controls.
-
-The groups and attributes per virtual machine, if any, are architecture
-specific.
-
-1. GROUP: KVM_S390_VM_MEM_CTRL
-Architectures: s390
-
-1.1. ATTRIBUTE: KVM_S390_VM_MEM_ENABLE_CMMA
-Parameters: none
-Returns: -EBUSY if a vcpu is already defined, otherwise 0
-
-Enables Collaborative Memory Management Assist (CMMA) for the virtual machine.
-
-1.2. ATTRIBUTE: KVM_S390_VM_MEM_CLR_CMMA
-Parameters: none
-Returns: -EINVAL if CMMA was not enabled
-         0 otherwise
-
-Clear the CMMA status for all guest pages, so any pages the guest marked
-as unused are again used any may not be reclaimed by the host.
-
-1.3. ATTRIBUTE KVM_S390_VM_MEM_LIMIT_SIZE
-Parameters: in attr->addr the address for the new limit of guest memory
-Returns: -EFAULT if the given address is not accessible
-         -EINVAL if the virtual machine is of type UCONTROL
-         -E2BIG if the given guest memory is to big for that machine
-         -EBUSY if a vcpu is already defined
-         -ENOMEM if not enough memory is available for a new shadow guest mapping
-          0 otherwise
-
-Allows userspace to query the actual limit and set a new limit for
-the maximum guest memory size. The limit will be rounded up to
-2048 MB, 4096 GB, 8192 TB respectively, as this limit is governed by
-the number of page table levels. In the case that there is no limit we will set
-the limit to KVM_S390_NO_MEM_LIMIT (U64_MAX).
-
-2. GROUP: KVM_S390_VM_CPU_MODEL
-Architectures: s390
-
-2.1. ATTRIBUTE: KVM_S390_VM_CPU_MACHINE (r/o)
-
-Allows user space to retrieve machine and kvm specific cpu related information:
-
-struct kvm_s390_vm_cpu_machine {
-       __u64 cpuid;           # CPUID of host
-       __u32 ibc;             # IBC level range offered by host
-       __u8  pad[4];
-       __u64 fac_mask[256];   # set of cpu facilities enabled by KVM
-       __u64 fac_list[256];   # set of cpu facilities offered by host
-}
-
-Parameters: address of buffer to store the machine related cpu data
-            of type struct kvm_s390_vm_cpu_machine*
-Returns:    -EFAULT if the given address is not accessible from kernel space
-           -ENOMEM if not enough memory is available to process the ioctl
-           0 in case of success
-
-2.2. ATTRIBUTE: KVM_S390_VM_CPU_PROCESSOR (r/w)
-
-Allows user space to retrieve or request to change cpu related information for a vcpu:
-
-struct kvm_s390_vm_cpu_processor {
-       __u64 cpuid;           # CPUID currently (to be) used by this vcpu
-       __u16 ibc;             # IBC level currently (to be) used by this vcpu
-       __u8  pad[6];
-       __u64 fac_list[256];   # set of cpu facilities currently (to be) used
-                              # by this vcpu
-}
-
-KVM does not enforce or limit the cpu model data in any form. Take the information
-retrieved by means of KVM_S390_VM_CPU_MACHINE as hint for reasonable configuration
-setups. Instruction interceptions triggered by additionally set facility bits that
-are not handled by KVM need to by imlemented in the VM driver code.
-
-Parameters: address of buffer to store/set the processor related cpu
-           data of type struct kvm_s390_vm_cpu_processor*.
-Returns:    -EBUSY in case 1 or more vcpus are already activated (only in write case)
-           -EFAULT if the given address is not accessible from kernel space
-           -ENOMEM if not enough memory is available to process the ioctl
-           0 in case of success
-
-2.3. ATTRIBUTE: KVM_S390_VM_CPU_MACHINE_FEAT (r/o)
-
-Allows user space to retrieve available cpu features. A feature is available if
-provided by the hardware and supported by kvm. In theory, cpu features could
-even be completely emulated by kvm.
-
-struct kvm_s390_vm_cpu_feat {
-        __u64 feat[16]; # Bitmap (1 = feature available), MSB 0 bit numbering
-};
-
-Parameters: address of a buffer to load the feature list from.
-Returns:    -EFAULT if the given address is not accessible from kernel space.
-           0 in case of success.
-
-2.4. ATTRIBUTE: KVM_S390_VM_CPU_PROCESSOR_FEAT (r/w)
-
-Allows user space to retrieve or change enabled cpu features for all VCPUs of a
-VM. Features that are not available cannot be enabled.
-
-See 2.3. for a description of the parameter struct.
-
-Parameters: address of a buffer to store/load the feature list from.
-Returns:    -EFAULT if the given address is not accessible from kernel space.
-           -EINVAL if a cpu feature that is not available is to be enabled.
-           -EBUSY if at least one VCPU has already been defined.
-           0 in case of success.
-
-2.5. ATTRIBUTE: KVM_S390_VM_CPU_MACHINE_SUBFUNC (r/o)
-
-Allows user space to retrieve available cpu subfunctions without any filtering
-done by a set IBC. These subfunctions are indicated to the guest VCPU via
-query or "test bit" subfunctions and used e.g. by cpacf functions, plo and ptff.
-
-A subfunction block is only valid if KVM_S390_VM_CPU_MACHINE contains the
-STFL(E) bit introducing the affected instruction. If the affected instruction
-indicates subfunctions via a "query subfunction", the response block is
-contained in the returned struct. If the affected instruction
-indicates subfunctions via a "test bit" mechanism, the subfunction codes are
-contained in the returned struct in MSB 0 bit numbering.
-
-struct kvm_s390_vm_cpu_subfunc {
-       u8 plo[32];           # always valid (ESA/390 feature)
-       u8 ptff[16];          # valid with TOD-clock steering
-       u8 kmac[16];          # valid with Message-Security-Assist
-       u8 kmc[16];           # valid with Message-Security-Assist
-       u8 km[16];            # valid with Message-Security-Assist
-       u8 kimd[16];          # valid with Message-Security-Assist
-       u8 klmd[16];          # valid with Message-Security-Assist
-       u8 pckmo[16];         # valid with Message-Security-Assist-Extension 3
-       u8 kmctr[16];         # valid with Message-Security-Assist-Extension 4
-       u8 kmf[16];           # valid with Message-Security-Assist-Extension 4
-       u8 kmo[16];           # valid with Message-Security-Assist-Extension 4
-       u8 pcc[16];           # valid with Message-Security-Assist-Extension 4
-       u8 ppno[16];          # valid with Message-Security-Assist-Extension 5
-       u8 kma[16];           # valid with Message-Security-Assist-Extension 8
-       u8 kdsa[16];          # valid with Message-Security-Assist-Extension 9
-       u8 reserved[1792];    # reserved for future instructions
-};
-
-Parameters: address of a buffer to load the subfunction blocks from.
-Returns:    -EFAULT if the given address is not accessible from kernel space.
-           0 in case of success.
-
-2.6. ATTRIBUTE: KVM_S390_VM_CPU_PROCESSOR_SUBFUNC (r/w)
-
-Allows user space to retrieve or change cpu subfunctions to be indicated for
-all VCPUs of a VM. This attribute will only be available if kernel and
-hardware support are in place.
-
-The kernel uses the configured subfunction blocks for indication to
-the guest. A subfunction block will only be used if the associated STFL(E) bit
-has not been disabled by user space (so the instruction to be queried is
-actually available for the guest).
-
-As long as no data has been written, a read will fail. The IBC will be used
-to determine available subfunctions in this case, this will guarantee backward
-compatibility.
-
-See 2.5. for a description of the parameter struct.
-
-Parameters: address of a buffer to store/load the subfunction blocks from.
-Returns:    -EFAULT if the given address is not accessible from kernel space.
-           -EINVAL when reading, if there was no write yet.
-           -EBUSY if at least one VCPU has already been defined.
-           0 in case of success.
-
-3. GROUP: KVM_S390_VM_TOD
-Architectures: s390
-
-3.1. ATTRIBUTE: KVM_S390_VM_TOD_HIGH
-
-Allows user space to set/get the TOD clock extension (u8) (superseded by
-KVM_S390_VM_TOD_EXT).
-
-Parameters: address of a buffer in user space to store the data (u8) to
-Returns:    -EFAULT if the given address is not accessible from kernel space
-           -EINVAL if setting the TOD clock extension to != 0 is not supported
-
-3.2. ATTRIBUTE: KVM_S390_VM_TOD_LOW
-
-Allows user space to set/get bits 0-63 of the TOD clock register as defined in
-the POP (u64).
-
-Parameters: address of a buffer in user space to store the data (u64) to
-Returns:    -EFAULT if the given address is not accessible from kernel space
-
-3.3. ATTRIBUTE: KVM_S390_VM_TOD_EXT
-Allows user space to set/get bits 0-63 of the TOD clock register as defined in
-the POP (u64). If the guest CPU model supports the TOD clock extension (u8), it
-also allows user space to get/set it. If the guest CPU model does not support
-it, it is stored as 0 and not allowed to be set to a value != 0.
-
-Parameters: address of a buffer in user space to store the data
-            (kvm_s390_vm_tod_clock) to
-Returns:    -EFAULT if the given address is not accessible from kernel space
-           -EINVAL if setting the TOD clock extension to != 0 is not supported
-
-4. GROUP: KVM_S390_VM_CRYPTO
-Architectures: s390
-
-4.1. ATTRIBUTE: KVM_S390_VM_CRYPTO_ENABLE_AES_KW (w/o)
-
-Allows user space to enable aes key wrapping, including generating a new
-wrapping key.
-
-Parameters: none
-Returns:    0
-
-4.2. ATTRIBUTE: KVM_S390_VM_CRYPTO_ENABLE_DEA_KW (w/o)
-
-Allows user space to enable dea key wrapping, including generating a new
-wrapping key.
-
-Parameters: none
-Returns:    0
-
-4.3. ATTRIBUTE: KVM_S390_VM_CRYPTO_DISABLE_AES_KW (w/o)
-
-Allows user space to disable aes key wrapping, clearing the wrapping key.
-
-Parameters: none
-Returns:    0
-
-4.4. ATTRIBUTE: KVM_S390_VM_CRYPTO_DISABLE_DEA_KW (w/o)
-
-Allows user space to disable dea key wrapping, clearing the wrapping key.
-
-Parameters: none
-Returns:    0
-
-5. GROUP: KVM_S390_VM_MIGRATION
-Architectures: s390
-
-5.1. ATTRIBUTE: KVM_S390_VM_MIGRATION_STOP (w/o)
-
-Allows userspace to stop migration mode, needed for PGSTE migration.
-Setting this attribute when migration mode is not active will have no
-effects.
-
-Parameters: none
-Returns:    0
-
-5.2. ATTRIBUTE: KVM_S390_VM_MIGRATION_START (w/o)
-
-Allows userspace to start migration mode, needed for PGSTE migration.
-Setting this attribute when migration mode is already active will have
-no effects.
-
-Parameters: none
-Returns:    -ENOMEM if there is not enough free memory to start migration mode
-           -EINVAL if the state of the VM is invalid (e.g. no memory defined)
-           0 in case of success.
-
-5.3. ATTRIBUTE: KVM_S390_VM_MIGRATION_STATUS (r/o)
-
-Allows userspace to query the status of migration mode.
-
-Parameters: address of a buffer in user space to store the data (u64) to;
-           the data itself is either 0 if migration mode is disabled or 1
-           if it is enabled
-Returns:    -EFAULT if the given address is not accessible from kernel space
-           0 in case of success.
diff --git a/Documentation/virt/kvm/devices/xics.rst b/Documentation/virt/kvm/devices/xics.rst

new file mode 100644 (file)

index 0000000..2d6927e
--- /dev/null
+++ b/Documentation/virt/kvm/devices/xics.rst
@@ -0,0 +1,92 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
+XICS interrupt controller
+=========================
+
+Device type supported: KVM_DEV_TYPE_XICS
+
+Groups:
+  1. KVM_DEV_XICS_GRP_SOURCES
+       Attributes:
+
+         One per interrupt source, indexed by the source number.
+  2. KVM_DEV_XICS_GRP_CTRL
+       Attributes:
+
+         2.1 KVM_DEV_XICS_NR_SERVERS (write only)
+
+  The kvm_device_attr.addr points to a __u32 value which is the number of
+  interrupt server numbers (ie, highest possible vcpu id plus one).
+
+  Errors:
+
+    =======  ==========================================
+    -EINVAL  Value greater than KVM_MAX_VCPU_ID.
+    -EFAULT  Invalid user pointer for attr->addr.
+    -EBUSY   A vcpu is already connected to the device.
+    =======  ==========================================
+
+This device emulates the XICS (eXternal Interrupt Controller
+Specification) defined in PAPR.  The XICS has a set of interrupt
+sources, each identified by a 20-bit source number, and a set of
+Interrupt Control Presentation (ICP) entities, also called "servers",
+each associated with a virtual CPU.
+
+The ICP entities are created by enabling the KVM_CAP_IRQ_ARCH
+capability for each vcpu, specifying KVM_CAP_IRQ_XICS in args[0] and
+the interrupt server number (i.e. the vcpu number from the XICS's
+point of view) in args[1] of the kvm_enable_cap struct.  Each ICP has
+64 bits of state which can be read and written using the
+KVM_GET_ONE_REG and KVM_SET_ONE_REG ioctls on the vcpu.  The 64 bit
+state word has the following bitfields, starting at the
+least-significant end of the word:
+
+* Unused, 16 bits
+
+* Pending interrupt priority, 8 bits
+  Zero is the highest priority, 255 means no interrupt is pending.
+
+* Pending IPI (inter-processor interrupt) priority, 8 bits
+  Zero is the highest priority, 255 means no IPI is pending.
+
+* Pending interrupt source number, 24 bits
+  Zero means no interrupt pending, 2 means an IPI is pending
+
+* Current processor priority, 8 bits
+  Zero is the highest priority, meaning no interrupts can be
+  delivered, and 255 is the lowest priority.
+
+Each source has 64 bits of state that can be read and written using
+the KVM_GET_DEVICE_ATTR and KVM_SET_DEVICE_ATTR ioctls, specifying the
+KVM_DEV_XICS_GRP_SOURCES attribute group, with the attribute number being
+the interrupt source number.  The 64 bit state word has the following
+bitfields, starting from the least-significant end of the word:
+
+* Destination (server number), 32 bits
+
+  This specifies where the interrupt should be sent, and is the
+  interrupt server number specified for the destination vcpu.
+
+* Priority, 8 bits
+
+  This is the priority specified for this interrupt source, where 0 is
+  the highest priority and 255 is the lowest.  An interrupt with a
+  priority of 255 will never be delivered.
+
+* Level sensitive flag, 1 bit
+
+  This bit is 1 for a level-sensitive interrupt source, or 0 for
+  edge-sensitive (or MSI).
+
+* Masked flag, 1 bit
+
+  This bit is set to 1 if the interrupt is masked (cannot be delivered
+  regardless of its priority), for example by the ibm,int-off RTAS
+  call, or 0 if it is not masked.
+
+* Pending flag, 1 bit
+
+  This bit is 1 if the source has a pending interrupt, otherwise 0.
+
+Only one XICS instance may be created per VM.
diff --git a/Documentation/virt/kvm/devices/xics.txt b/Documentation/virt/kvm/devices/xics.txt

deleted file mode 100644 (file)

index 423332d..0000000
--- a/Documentation/virt/kvm/devices/xics.txt
+++ /dev/null
@@ -1,76 +0,0 @@
-XICS interrupt controller
-
-Device type supported: KVM_DEV_TYPE_XICS
-
-Groups:
-  1. KVM_DEV_XICS_GRP_SOURCES
-  Attributes: One per interrupt source, indexed by the source number.
-
-  2. KVM_DEV_XICS_GRP_CTRL
-  Attributes:
-    2.1 KVM_DEV_XICS_NR_SERVERS (write only)
-  The kvm_device_attr.addr points to a __u32 value which is the number of
-  interrupt server numbers (ie, highest possible vcpu id plus one).
-  Errors:
-    -EINVAL: Value greater than KVM_MAX_VCPU_ID.
-    -EFAULT: Invalid user pointer for attr->addr.
-    -EBUSY:  A vcpu is already connected to the device.
-
-This device emulates the XICS (eXternal Interrupt Controller
-Specification) defined in PAPR.  The XICS has a set of interrupt
-sources, each identified by a 20-bit source number, and a set of
-Interrupt Control Presentation (ICP) entities, also called "servers",
-each associated with a virtual CPU.
-
-The ICP entities are created by enabling the KVM_CAP_IRQ_ARCH
-capability for each vcpu, specifying KVM_CAP_IRQ_XICS in args[0] and
-the interrupt server number (i.e. the vcpu number from the XICS's
-point of view) in args[1] of the kvm_enable_cap struct.  Each ICP has
-64 bits of state which can be read and written using the
-KVM_GET_ONE_REG and KVM_SET_ONE_REG ioctls on the vcpu.  The 64 bit
-state word has the following bitfields, starting at the
-least-significant end of the word:
-
-* Unused, 16 bits
-
-* Pending interrupt priority, 8 bits
-  Zero is the highest priority, 255 means no interrupt is pending.
-
-* Pending IPI (inter-processor interrupt) priority, 8 bits
-  Zero is the highest priority, 255 means no IPI is pending.
-
-* Pending interrupt source number, 24 bits
-  Zero means no interrupt pending, 2 means an IPI is pending
-
-* Current processor priority, 8 bits
-  Zero is the highest priority, meaning no interrupts can be
-  delivered, and 255 is the lowest priority.
-
-Each source has 64 bits of state that can be read and written using
-the KVM_GET_DEVICE_ATTR and KVM_SET_DEVICE_ATTR ioctls, specifying the
-KVM_DEV_XICS_GRP_SOURCES attribute group, with the attribute number being
-the interrupt source number.  The 64 bit state word has the following
-bitfields, starting from the least-significant end of the word:
-
-* Destination (server number), 32 bits
-  This specifies where the interrupt should be sent, and is the
-  interrupt server number specified for the destination vcpu.
-
-* Priority, 8 bits
-  This is the priority specified for this interrupt source, where 0 is
-  the highest priority and 255 is the lowest.  An interrupt with a
-  priority of 255 will never be delivered.
-
-* Level sensitive flag, 1 bit
-  This bit is 1 for a level-sensitive interrupt source, or 0 for
-  edge-sensitive (or MSI).
-
-* Masked flag, 1 bit
-  This bit is set to 1 if the interrupt is masked (cannot be delivered
-  regardless of its priority), for example by the ibm,int-off RTAS
-  call, or 0 if it is not masked.
-
-* Pending flag, 1 bit
-  This bit is 1 if the source has a pending interrupt, otherwise 0.
-
-Only one XICS instance may be created per VM.
diff --git a/Documentation/virt/kvm/devices/xive.rst b/Documentation/virt/kvm/devices/xive.rst

new file mode 100644 (file)

index 0000000..8bdf3dc
--- /dev/null
+++ b/Documentation/virt/kvm/devices/xive.rst
@@ -0,0 +1,247 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================================================
+POWER9 eXternal Interrupt Virtualization Engine (XIVE Gen1)
+===========================================================
+
+Device types supported:
+  - KVM_DEV_TYPE_XIVE     POWER9 XIVE Interrupt Controller generation 1
+
+This device acts as a VM interrupt controller. It provides the KVM
+interface to configure the interrupt sources of a VM in the underlying
+POWER9 XIVE interrupt controller.
+
+Only one XIVE instance may be instantiated. A guest XIVE device
+requires a POWER9 host and the guest OS should have support for the
+XIVE native exploitation interrupt mode. If not, it should run using
+the legacy interrupt mode, referred as XICS (POWER7/8).
+
+* Device Mappings
+
+  The KVM device exposes different MMIO ranges of the XIVE HW which
+  are required for interrupt management. These are exposed to the
+  guest in VMAs populated with a custom VM fault handler.
+
+  1. Thread Interrupt Management Area (TIMA)
+
+  Each thread has an associated Thread Interrupt Management context
+  composed of a set of registers. These registers let the thread
+  handle priority management and interrupt acknowledgment. The most
+  important are :
+
+      - Interrupt Pending Buffer     (IPB)
+      - Current Processor Priority   (CPPR)
+      - Notification Source Register (NSR)
+
+  They are exposed to software in four different pages each proposing
+  a view with a different privilege. The first page is for the
+  physical thread context and the second for the hypervisor. Only the
+  third (operating system) and the fourth (user level) are exposed the
+  guest.
+
+  2. Event State Buffer (ESB)
+
+  Each source is associated with an Event State Buffer (ESB) with
+  either a pair of even/odd pair of pages which provides commands to
+  manage the source: to trigger, to EOI, to turn off the source for
+  instance.
+
+  3. Device pass-through
+
+  When a device is passed-through into the guest, the source
+  interrupts are from a different HW controller (PHB4) and the ESB
+  pages exposed to the guest should accommadate this change.
+
+  The passthru_irq helpers, kvmppc_xive_set_mapped() and
+  kvmppc_xive_clr_mapped() are called when the device HW irqs are
+  mapped into or unmapped from the guest IRQ number space. The KVM
+  device extends these helpers to clear the ESB pages of the guest IRQ
+  number being mapped and then lets the VM fault handler repopulate.
+  The handler will insert the ESB page corresponding to the HW
+  interrupt of the device being passed-through or the initial IPI ESB
+  page if the device has being removed.
+
+  The ESB remapping is fully transparent to the guest and the OS
+  device driver. All handling is done within VFIO and the above
+  helpers in KVM-PPC.
+
+* Groups:
+
+1. KVM_DEV_XIVE_GRP_CTRL
+     Provides global controls on the device
+
+  Attributes:
+    1.1 KVM_DEV_XIVE_RESET (write only)
+    Resets the interrupt controller configuration for sources and event
+    queues. To be used by kexec and kdump.
+
+    Errors: none
+
+    1.2 KVM_DEV_XIVE_EQ_SYNC (write only)
+    Sync all the sources and queues and mark the EQ pages dirty. This
+    to make sure that a consistent memory state is captured when
+    migrating the VM.
+
+    Errors: none
+
+    1.3 KVM_DEV_XIVE_NR_SERVERS (write only)
+    The kvm_device_attr.addr points to a __u32 value which is the number of
+    interrupt server numbers (ie, highest possible vcpu id plus one).
+
+    Errors:
+
+      =======  ==========================================
+      -EINVAL  Value greater than KVM_MAX_VCPU_ID.
+      -EFAULT  Invalid user pointer for attr->addr.
+      -EBUSY   A vCPU is already connected to the device.
+      =======  ==========================================
+
+2. KVM_DEV_XIVE_GRP_SOURCE (write only)
+     Initializes a new source in the XIVE device and mask it.
+
+  Attributes:
+    Interrupt source number  (64-bit)
+
+  The kvm_device_attr.addr points to a __u64 value::
+
+    bits:     | 63   ....  2 |   1   |   0
+    values:   |    unused    | level | type
+
+  - type:  0:MSI 1:LSI
+  - level: assertion level in case of an LSI.
+
+  Errors:
+
+    =======  ==========================================
+    -E2BIG   Interrupt source number is out of range
+    -ENOMEM  Could not create a new source block
+    -EFAULT  Invalid user pointer for attr->addr.
+    -ENXIO   Could not allocate underlying HW interrupt
+    =======  ==========================================
+
+3. KVM_DEV_XIVE_GRP_SOURCE_CONFIG (write only)
+     Configures source targeting
+
+  Attributes:
+    Interrupt source number  (64-bit)
+
+  The kvm_device_attr.addr points to a __u64 value::
+
+    bits:     | 63   ....  33 |  32  | 31 .. 3 |  2 .. 0
+    values:   |    eisn       | mask |  server | priority
+
+  - priority: 0-7 interrupt priority level
+  - server: CPU number chosen to handle the interrupt
+  - mask: mask flag (unused)
+  - eisn: Effective Interrupt Source Number
+
+  Errors:
+
+    =======  =======================================================
+    -ENOENT  Unknown source number
+    -EINVAL  Not initialized source number
+    -EINVAL  Invalid priority
+    -EINVAL  Invalid CPU number.
+    -EFAULT  Invalid user pointer for attr->addr.
+    -ENXIO   CPU event queues not configured or configuration of the
+            underlying HW interrupt failed
+    -EBUSY   No CPU available to serve interrupt
+    =======  =======================================================
+
+4. KVM_DEV_XIVE_GRP_EQ_CONFIG (read-write)
+     Configures an event queue of a CPU
+
+  Attributes:
+    EQ descriptor identifier (64-bit)
+
+  The EQ descriptor identifier is a tuple (server, priority)::
+
+    bits:     | 63   ....  32 | 31 .. 3 |  2 .. 0
+    values:   |    unused     |  server | priority
+
+  The kvm_device_attr.addr points to::
+
+    struct kvm_ppc_xive_eq {
+       __u32 flags;
+       __u32 qshift;
+       __u64 qaddr;
+       __u32 qtoggle;
+       __u32 qindex;
+       __u8  pad[40];
+    };
+
+  - flags: queue flags
+      KVM_XIVE_EQ_ALWAYS_NOTIFY (required)
+       forces notification without using the coalescing mechanism
+       provided by the XIVE END ESBs.
+  - qshift: queue size (power of 2)
+  - qaddr: real address of queue
+  - qtoggle: current queue toggle bit
+  - qindex: current queue index
+  - pad: reserved for future use
+
+  Errors:
+
+    =======  =========================================
+    -ENOENT  Invalid CPU number
+    -EINVAL  Invalid priority
+    -EINVAL  Invalid flags
+    -EINVAL  Invalid queue size
+    -EINVAL  Invalid queue address
+    -EFAULT  Invalid user pointer for attr->addr.
+    -EIO     Configuration of the underlying HW failed
+    =======  =========================================
+
+5. KVM_DEV_XIVE_GRP_SOURCE_SYNC (write only)
+     Synchronize the source to flush event notifications
+
+  Attributes:
+    Interrupt source number  (64-bit)
+
+  Errors:
+
+    =======  =============================
+    -ENOENT  Unknown source number
+    -EINVAL  Not initialized source number
+    =======  =============================
+
+* VCPU state
+
+  The XIVE IC maintains VP interrupt state in an internal structure
+  called the NVT. When a VP is not dispatched on a HW processor
+  thread, this structure can be updated by HW if the VP is the target
+  of an event notification.
+
+  It is important for migration to capture the cached IPB from the NVT
+  as it synthesizes the priorities of the pending interrupts. We
+  capture a bit more to report debug information.
+
+  KVM_REG_PPC_VP_STATE (2 * 64bits)::
+
+    bits:     |  63  ....  32  |  31  ....  0  |
+    values:   |   TIMA word0   |   TIMA word1  |
+    bits:     | 127       ..........       64  |
+    values:   |            unused              |
+
+* Migration:
+
+  Saving the state of a VM using the XIVE native exploitation mode
+  should follow a specific sequence. When the VM is stopped :
+
+  1. Mask all sources (PQ=01) to stop the flow of events.
+
+  2. Sync the XIVE device with the KVM control KVM_DEV_XIVE_EQ_SYNC to
+  flush any in-flight event notification and to stabilize the EQs. At
+  this stage, the EQ pages are marked dirty to make sure they are
+  transferred in the migration sequence.
+
+  3. Capture the state of the source targeting, the EQs configuration
+  and the state of thread interrupt context registers.
+
+  Restore is similar:
+
+  1. Restore the EQ configuration. As targeting depends on it.
+  2. Restore targeting
+  3. Restore the thread interrupt contexts
+  4. Restore the source states
+  5. Let the vCPU run
diff --git a/Documentation/virt/kvm/devices/xive.txt b/Documentation/virt/kvm/devices/xive.txt

deleted file mode 100644 (file)

index f5d1d6b..0000000
--- a/Documentation/virt/kvm/devices/xive.txt
+++ /dev/null
@@ -1,205 +0,0 @@
-POWER9 eXternal Interrupt Virtualization Engine (XIVE Gen1)
-==========================================================
-
-Device types supported:
-  KVM_DEV_TYPE_XIVE     POWER9 XIVE Interrupt Controller generation 1
-
-This device acts as a VM interrupt controller. It provides the KVM
-interface to configure the interrupt sources of a VM in the underlying
-POWER9 XIVE interrupt controller.
-
-Only one XIVE instance may be instantiated. A guest XIVE device
-requires a POWER9 host and the guest OS should have support for the
-XIVE native exploitation interrupt mode. If not, it should run using
-the legacy interrupt mode, referred as XICS (POWER7/8).
-
-* Device Mappings
-
-  The KVM device exposes different MMIO ranges of the XIVE HW which
-  are required for interrupt management. These are exposed to the
-  guest in VMAs populated with a custom VM fault handler.
-
-  1. Thread Interrupt Management Area (TIMA)
-
-  Each thread has an associated Thread Interrupt Management context
-  composed of a set of registers. These registers let the thread
-  handle priority management and interrupt acknowledgment. The most
-  important are :
-
-      - Interrupt Pending Buffer     (IPB)
-      - Current Processor Priority   (CPPR)
-      - Notification Source Register (NSR)
-
-  They are exposed to software in four different pages each proposing
-  a view with a different privilege. The first page is for the
-  physical thread context and the second for the hypervisor. Only the
-  third (operating system) and the fourth (user level) are exposed the
-  guest.
-
-  2. Event State Buffer (ESB)
-
-  Each source is associated with an Event State Buffer (ESB) with
-  either a pair of even/odd pair of pages which provides commands to
-  manage the source: to trigger, to EOI, to turn off the source for
-  instance.
-
-  3. Device pass-through
-
-  When a device is passed-through into the guest, the source
-  interrupts are from a different HW controller (PHB4) and the ESB
-  pages exposed to the guest should accommadate this change.
-
-  The passthru_irq helpers, kvmppc_xive_set_mapped() and
-  kvmppc_xive_clr_mapped() are called when the device HW irqs are
-  mapped into or unmapped from the guest IRQ number space. The KVM
-  device extends these helpers to clear the ESB pages of the guest IRQ
-  number being mapped and then lets the VM fault handler repopulate.
-  The handler will insert the ESB page corresponding to the HW
-  interrupt of the device being passed-through or the initial IPI ESB
-  page if the device has being removed.
-
-  The ESB remapping is fully transparent to the guest and the OS
-  device driver. All handling is done within VFIO and the above
-  helpers in KVM-PPC.
-
-* Groups:
-
-  1. KVM_DEV_XIVE_GRP_CTRL
-  Provides global controls on the device
-  Attributes:
-    1.1 KVM_DEV_XIVE_RESET (write only)
-    Resets the interrupt controller configuration for sources and event
-    queues. To be used by kexec and kdump.
-    Errors: none
-
-    1.2 KVM_DEV_XIVE_EQ_SYNC (write only)
-    Sync all the sources and queues and mark the EQ pages dirty. This
-    to make sure that a consistent memory state is captured when
-    migrating the VM.
-    Errors: none
-
-    1.3 KVM_DEV_XIVE_NR_SERVERS (write only)
-    The kvm_device_attr.addr points to a __u32 value which is the number of
-    interrupt server numbers (ie, highest possible vcpu id plus one).
-    Errors:
-      -EINVAL: Value greater than KVM_MAX_VCPU_ID.
-      -EFAULT: Invalid user pointer for attr->addr.
-      -EBUSY:  A vCPU is already connected to the device.
-
-  2. KVM_DEV_XIVE_GRP_SOURCE (write only)
-  Initializes a new source in the XIVE device and mask it.
-  Attributes:
-    Interrupt source number  (64-bit)
-  The kvm_device_attr.addr points to a __u64 value:
-  bits:     | 63   ....  2 |   1   |   0
-  values:   |    unused    | level | type
-  - type:  0:MSI 1:LSI
-  - level: assertion level in case of an LSI.
-  Errors:
-    -E2BIG:  Interrupt source number is out of range
-    -ENOMEM: Could not create a new source block
-    -EFAULT: Invalid user pointer for attr->addr.
-    -ENXIO:  Could not allocate underlying HW interrupt
-
-  3. KVM_DEV_XIVE_GRP_SOURCE_CONFIG (write only)
-  Configures source targeting
-  Attributes:
-    Interrupt source number  (64-bit)
-  The kvm_device_attr.addr points to a __u64 value:
-  bits:     | 63   ....  33 |  32  | 31 .. 3 |  2 .. 0
-  values:   |    eisn       | mask |  server | priority
-  - priority: 0-7 interrupt priority level
-  - server: CPU number chosen to handle the interrupt
-  - mask: mask flag (unused)
-  - eisn: Effective Interrupt Source Number
-  Errors:
-    -ENOENT: Unknown source number
-    -EINVAL: Not initialized source number
-    -EINVAL: Invalid priority
-    -EINVAL: Invalid CPU number.
-    -EFAULT: Invalid user pointer for attr->addr.
-    -ENXIO:  CPU event queues not configured or configuration of the
-             underlying HW interrupt failed
-    -EBUSY:  No CPU available to serve interrupt
-
-  4. KVM_DEV_XIVE_GRP_EQ_CONFIG (read-write)
-  Configures an event queue of a CPU
-  Attributes:
-    EQ descriptor identifier (64-bit)
-  The EQ descriptor identifier is a tuple (server, priority) :
-  bits:     | 63   ....  32 | 31 .. 3 |  2 .. 0
-  values:   |    unused     |  server | priority
-  The kvm_device_attr.addr points to :
-    struct kvm_ppc_xive_eq {
-       __u32 flags;
-       __u32 qshift;
-       __u64 qaddr;
-       __u32 qtoggle;
-       __u32 qindex;
-       __u8  pad[40];
-    };
-  - flags: queue flags
-    KVM_XIVE_EQ_ALWAYS_NOTIFY (required)
-       forces notification without using the coalescing mechanism
-       provided by the XIVE END ESBs.
-  - qshift: queue size (power of 2)
-  - qaddr: real address of queue
-  - qtoggle: current queue toggle bit
-  - qindex: current queue index
-  - pad: reserved for future use
-  Errors:
-    -ENOENT: Invalid CPU number
-    -EINVAL: Invalid priority
-    -EINVAL: Invalid flags
-    -EINVAL: Invalid queue size
-    -EINVAL: Invalid queue address
-    -EFAULT: Invalid user pointer for attr->addr.
-    -EIO:    Configuration of the underlying HW failed
-
-  5. KVM_DEV_XIVE_GRP_SOURCE_SYNC (write only)
-  Synchronize the source to flush event notifications
-  Attributes:
-    Interrupt source number  (64-bit)
-  Errors:
-    -ENOENT: Unknown source number
-    -EINVAL: Not initialized source number
-
-* VCPU state
-
-  The XIVE IC maintains VP interrupt state in an internal structure
-  called the NVT. When a VP is not dispatched on a HW processor
-  thread, this structure can be updated by HW if the VP is the target
-  of an event notification.
-
-  It is important for migration to capture the cached IPB from the NVT
-  as it synthesizes the priorities of the pending interrupts. We
-  capture a bit more to report debug information.
-
-  KVM_REG_PPC_VP_STATE (2 * 64bits)
-  bits:     |  63  ....  32  |  31  ....  0  |
-  values:   |   TIMA word0   |   TIMA word1  |
-  bits:     | 127       ..........       64  |
-  values:   |            unused              |
-
-* Migration:
-
-  Saving the state of a VM using the XIVE native exploitation mode
-  should follow a specific sequence. When the VM is stopped :
-
-  1. Mask all sources (PQ=01) to stop the flow of events.
-
-  2. Sync the XIVE device with the KVM control KVM_DEV_XIVE_EQ_SYNC to
-  flush any in-flight event notification and to stabilize the EQs. At
-  this stage, the EQ pages are marked dirty to make sure they are
-  transferred in the migration sequence.
-
-  3. Capture the state of the source targeting, the EQs configuration
-  and the state of thread interrupt context registers.
-
-  Restore is similar :
-
-  1. Restore the EQ configuration. As targeting depends on it.
-  2. Restore targeting
-  3. Restore the thread interrupt contexts
-  4. Restore the source states
-  5. Let the vCPU run
diff --git a/Documentation/virt/kvm/halt-polling.rst b/Documentation/virt/kvm/halt-polling.rst

new file mode 100644 (file)

index 0000000..4922e4a
--- /dev/null
+++ b/Documentation/virt/kvm/halt-polling.rst
@@ -0,0 +1,140 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================
+The KVM halt polling system
+===========================
+
+The KVM halt polling system provides a feature within KVM whereby the latency
+of a guest can, under some circumstances, be reduced by polling in the host
+for some time period after the guest has elected to no longer run by cedeing.
+That is, when a guest vcpu has ceded, or in the case of powerpc when all of the
+vcpus of a single vcore have ceded, the host kernel polls for wakeup conditions
+before giving up the cpu to the scheduler in order to let something else run.
+
+Polling provides a latency advantage in cases where the guest can be run again
+very quickly by at least saving us a trip through the scheduler, normally on
+the order of a few micro-seconds, although performance benefits are workload
+dependant. In the event that no wakeup source arrives during the polling
+interval or some other task on the runqueue is runnable the scheduler is
+invoked. Thus halt polling is especially useful on workloads with very short
+wakeup periods where the time spent halt polling is minimised and the time
+savings of not invoking the scheduler are distinguishable.
+
+The generic halt polling code is implemented in:
+
+       virt/kvm/kvm_main.c: kvm_vcpu_block()
+
+The powerpc kvm-hv specific case is implemented in:
+
+       arch/powerpc/kvm/book3s_hv.c: kvmppc_vcore_blocked()
+
+Halt Polling Interval
+=====================
+
+The maximum time for which to poll before invoking the scheduler, referred to
+as the halt polling interval, is increased and decreased based on the perceived
+effectiveness of the polling in an attempt to limit pointless polling.
+This value is stored in either the vcpu struct:
+
+       kvm_vcpu->halt_poll_ns
+
+or in the case of powerpc kvm-hv, in the vcore struct:
+
+       kvmppc_vcore->halt_poll_ns
+
+Thus this is a per vcpu (or vcore) value.
+
+During polling if a wakeup source is received within the halt polling interval,
+the interval is left unchanged. In the event that a wakeup source isn't
+received during the polling interval (and thus schedule is invoked) there are
+two options, either the polling interval and total block time[0] were less than
+the global max polling interval (see module params below), or the total block
+time was greater than the global max polling interval.
+
+In the event that both the polling interval and total block time were less than
+the global max polling interval then the polling interval can be increased in
+the hope that next time during the longer polling interval the wake up source
+will be received while the host is polling and the latency benefits will be
+received. The polling interval is grown in the function grow_halt_poll_ns() and
+is multiplied by the module parameters halt_poll_ns_grow and
+halt_poll_ns_grow_start.
+
+In the event that the total block time was greater than the global max polling
+interval then the host will never poll for long enough (limited by the global
+max) to wakeup during the polling interval so it may as well be shrunk in order
+to avoid pointless polling. The polling interval is shrunk in the function
+shrink_halt_poll_ns() and is divided by the module parameter
+halt_poll_ns_shrink, or set to 0 iff halt_poll_ns_shrink == 0.
+
+It is worth noting that this adjustment process attempts to hone in on some
+steady state polling interval but will only really do a good job for wakeups
+which come at an approximately constant rate, otherwise there will be constant
+adjustment of the polling interval.
+
+[0] total block time:
+                     the time between when the halt polling function is
+                     invoked and a wakeup source received (irrespective of
+                     whether the scheduler is invoked within that function).
+
+Module Parameters
+=================
+
+The kvm module has 3 tuneable module parameters to adjust the global max
+polling interval as well as the rate at which the polling interval is grown and
+shrunk. These variables are defined in include/linux/kvm_host.h and as module
+parameters in virt/kvm/kvm_main.c, or arch/powerpc/kvm/book3s_hv.c in the
+powerpc kvm-hv case.
+
++-----------------------+---------------------------+-------------------------+
+|Module Parameter      |   Description             |        Default Value    |
++-----------------------+---------------------------+-------------------------+
+|halt_poll_ns          | The global max polling    | KVM_HALT_POLL_NS_DEFAULT|
+|                      | interval which defines    |                         |
+|                      | the ceiling value of the  |                         |
+|                      | polling interval for      | (per arch value)        |
+|                      | each vcpu.                |                         |
++-----------------------+---------------------------+-------------------------+
+|halt_poll_ns_grow     | The value by which the    | 2                       |
+|                      | halt polling interval is  |                         |
+|                      | multiplied in the         |                         |
+|                      | grow_halt_poll_ns()       |                         |
+|                      | function.                 |                         |
++-----------------------+---------------------------+-------------------------+
+|halt_poll_ns_grow_start| The initial value to grow | 10000                  |
+|                      | to from zero in the       |                         |
+|                      | grow_halt_poll_ns()       |                         |
+|                      | function.                 |                         |
++-----------------------+---------------------------+-------------------------+
+|halt_poll_ns_shrink   | The value by which the    | 0                       |
+|                      | halt polling interval is  |                         |
+|                      | divided in the            |                         |
+|                      | shrink_halt_poll_ns()     |                         |
+|                      | function.                 |                         |
++-----------------------+---------------------------+-------------------------+
+
+These module parameters can be set from the debugfs files in:
+
+       /sys/module/kvm/parameters/
+
+Note: that these module parameters are system wide values and are not able to
+      be tuned on a per vm basis.
+
+Further Notes
+=============
+
+- Care should be taken when setting the halt_poll_ns module parameter as a large value
+  has the potential to drive the cpu usage to 100% on a machine which would be almost
+  entirely idle otherwise. This is because even if a guest has wakeups during which very
+  little work is done and which are quite far apart, if the period is shorter than the
+  global max polling interval (halt_poll_ns) then the host will always poll for the
+  entire block time and thus cpu utilisation will go to 100%.
+
+- Halt polling essentially presents a trade off between power usage and latency and
+  the module parameters should be used to tune the affinity for this. Idle cpu time is
+  essentially converted to host kernel time with the aim of decreasing latency when
+  entering the guest.
+
+- Halt polling will only be conducted by the host when no other tasks are runnable on
+  that cpu, otherwise the polling will cease immediately and schedule will be invoked to
+  allow that other task to run. Thus this doesn't allow a guest to denial of service the
+  cpu.
diff --git a/Documentation/virt/kvm/halt-polling.txt b/Documentation/virt/kvm/halt-polling.txt

deleted file mode 100644 (file)

index 4f791b1..0000000
--- a/Documentation/virt/kvm/halt-polling.txt
+++ /dev/null
@@ -1,136 +0,0 @@
-The KVM halt polling system
-===========================
-
-The KVM halt polling system provides a feature within KVM whereby the latency
-of a guest can, under some circumstances, be reduced by polling in the host
-for some time period after the guest has elected to no longer run by cedeing.
-That is, when a guest vcpu has ceded, or in the case of powerpc when all of the
-vcpus of a single vcore have ceded, the host kernel polls for wakeup conditions
-before giving up the cpu to the scheduler in order to let something else run.
-
-Polling provides a latency advantage in cases where the guest can be run again
-very quickly by at least saving us a trip through the scheduler, normally on
-the order of a few micro-seconds, although performance benefits are workload
-dependant. In the event that no wakeup source arrives during the polling
-interval or some other task on the runqueue is runnable the scheduler is
-invoked. Thus halt polling is especially useful on workloads with very short
-wakeup periods where the time spent halt polling is minimised and the time
-savings of not invoking the scheduler are distinguishable.
-
-The generic halt polling code is implemented in:
-
-       virt/kvm/kvm_main.c: kvm_vcpu_block()
-
-The powerpc kvm-hv specific case is implemented in:
-
-       arch/powerpc/kvm/book3s_hv.c: kvmppc_vcore_blocked()
-
-Halt Polling Interval
-=====================
-
-The maximum time for which to poll before invoking the scheduler, referred to
-as the halt polling interval, is increased and decreased based on the perceived
-effectiveness of the polling in an attempt to limit pointless polling.
-This value is stored in either the vcpu struct:
-
-       kvm_vcpu->halt_poll_ns
-
-or in the case of powerpc kvm-hv, in the vcore struct:
-
-       kvmppc_vcore->halt_poll_ns
-
-Thus this is a per vcpu (or vcore) value.
-
-During polling if a wakeup source is received within the halt polling interval,
-the interval is left unchanged. In the event that a wakeup source isn't
-received during the polling interval (and thus schedule is invoked) there are
-two options, either the polling interval and total block time[0] were less than
-the global max polling interval (see module params below), or the total block
-time was greater than the global max polling interval.
-
-In the event that both the polling interval and total block time were less than
-the global max polling interval then the polling interval can be increased in
-the hope that next time during the longer polling interval the wake up source
-will be received while the host is polling and the latency benefits will be
-received. The polling interval is grown in the function grow_halt_poll_ns() and
-is multiplied by the module parameters halt_poll_ns_grow and
-halt_poll_ns_grow_start.
-
-In the event that the total block time was greater than the global max polling
-interval then the host will never poll for long enough (limited by the global
-max) to wakeup during the polling interval so it may as well be shrunk in order
-to avoid pointless polling. The polling interval is shrunk in the function
-shrink_halt_poll_ns() and is divided by the module parameter
-halt_poll_ns_shrink, or set to 0 iff halt_poll_ns_shrink == 0.
-
-It is worth noting that this adjustment process attempts to hone in on some
-steady state polling interval but will only really do a good job for wakeups
-which come at an approximately constant rate, otherwise there will be constant
-adjustment of the polling interval.
-
-[0] total block time: the time between when the halt polling function is
-                     invoked and a wakeup source received (irrespective of
-                     whether the scheduler is invoked within that function).
-
-Module Parameters
-=================
-
-The kvm module has 3 tuneable module parameters to adjust the global max
-polling interval as well as the rate at which the polling interval is grown and
-shrunk. These variables are defined in include/linux/kvm_host.h and as module
-parameters in virt/kvm/kvm_main.c, or arch/powerpc/kvm/book3s_hv.c in the
-powerpc kvm-hv case.
-
-Module Parameter       |   Description             |        Default Value
---------------------------------------------------------------------------------
-halt_poll_ns           | The global max polling    | KVM_HALT_POLL_NS_DEFAULT
-                       | interval which defines    |
-                       | the ceiling value of the  |
-                       | polling interval for      | (per arch value)
-                       | each vcpu.                |
---------------------------------------------------------------------------------
-halt_poll_ns_grow      | The value by which the    | 2
-                       | halt polling interval is  |
-                       | multiplied in the         |
-                       | grow_halt_poll_ns()       |
-                       | function.                 |
---------------------------------------------------------------------------------
-halt_poll_ns_grow_start | The initial value to grow | 10000
-                       | to from zero in the       |
-                       | grow_halt_poll_ns()       |
-                       | function.                 |
---------------------------------------------------------------------------------
-halt_poll_ns_shrink    | The value by which the    | 0
-                       | halt polling interval is  |
-                       | divided in the            |
-                       | shrink_halt_poll_ns()     |
-                       | function.                 |
---------------------------------------------------------------------------------
-
-These module parameters can be set from the debugfs files in:
-
-       /sys/module/kvm/parameters/
-
-Note: that these module parameters are system wide values and are not able to
-      be tuned on a per vm basis.
-
-Further Notes
-=============
-
-- Care should be taken when setting the halt_poll_ns module parameter as a
-large value has the potential to drive the cpu usage to 100% on a machine which
-would be almost entirely idle otherwise. This is because even if a guest has
-wakeups during which very little work is done and which are quite far apart, if
-the period is shorter than the global max polling interval (halt_poll_ns) then
-the host will always poll for the entire block time and thus cpu utilisation
-will go to 100%.
-
-- Halt polling essentially presents a trade off between power usage and latency
-and the module parameters should be used to tune the affinity for this. Idle
-cpu time is essentially converted to host kernel time with the aim of decreasing
-latency when entering the guest.
-
-- Halt polling will only be conducted by the host when no other tasks are
-runnable on that cpu, otherwise the polling will cease immediately and
-schedule will be invoked to allow that other task to run. Thus this doesn't
-allow a guest to denial of service the cpu.
diff --git a/Documentation/virt/kvm/hypercalls.rst b/Documentation/virt/kvm/hypercalls.rst

new file mode 100644 (file)

index 0000000..dbaf207
--- /dev/null
+++ b/Documentation/virt/kvm/hypercalls.rst
@@ -0,0 +1,171 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================
+Linux KVM Hypercall
+===================
+
+X86:
+ KVM Hypercalls have a three-byte sequence of either the vmcall or the vmmcall
+ instruction. The hypervisor can replace it with instructions that are
+ guaranteed to be supported.
+
+ Up to four arguments may be passed in rbx, rcx, rdx, and rsi respectively.
+ The hypercall number should be placed in rax and the return value will be
+ placed in rax.  No other registers will be clobbered unless explicitly stated
+ by the particular hypercall.
+
+S390:
+  R2-R7 are used for parameters 1-6. In addition, R1 is used for hypercall
+  number. The return value is written to R2.
+
+  S390 uses diagnose instruction as hypercall (0x500) along with hypercall
+  number in R1.
+
+  For further information on the S390 diagnose call as supported by KVM,
+  refer to Documentation/virt/kvm/s390-diag.txt.
+
+PowerPC:
+  It uses R3-R10 and hypercall number in R11. R4-R11 are used as output registers.
+  Return value is placed in R3.
+
+  KVM hypercalls uses 4 byte opcode, that are patched with 'hypercall-instructions'
+  property inside the device tree's /hypervisor node.
+  For more information refer to Documentation/virt/kvm/ppc-pv.txt
+
+MIPS:
+  KVM hypercalls use the HYPCALL instruction with code 0 and the hypercall
+  number in $2 (v0). Up to four arguments may be placed in $4-$7 (a0-a3) and
+  the return value is placed in $2 (v0).
+
+KVM Hypercalls Documentation
+============================
+
+The template for each hypercall is:
+1. Hypercall name.
+2. Architecture(s)
+3. Status (deprecated, obsolete, active)
+4. Purpose
+
+1. KVM_HC_VAPIC_POLL_IRQ
+------------------------
+
+:Architecture: x86
+:Status: active
+:Purpose: Trigger guest exit so that the host can check for pending
+          interrupts on reentry.
+
+2. KVM_HC_MMU_OP
+----------------
+
+:Architecture: x86
+:Status: deprecated.
+:Purpose: Support MMU operations such as writing to PTE,
+          flushing TLB, release PT.
+
+3. KVM_HC_FEATURES
+------------------
+
+:Architecture: PPC
+:Status: active
+:Purpose: Expose hypercall availability to the guest. On x86 platforms, cpuid
+          used to enumerate which hypercalls are available. On PPC, either
+         device tree based lookup ( which is also what EPAPR dictates)
+         OR KVM specific enumeration mechanism (which is this hypercall)
+         can be used.
+
+4. KVM_HC_PPC_MAP_MAGIC_PAGE
+----------------------------
+
+:Architecture: PPC
+:Status: active
+:Purpose: To enable communication between the hypervisor and guest there is a
+         shared page that contains parts of supervisor visible register state.
+         The guest can map this shared page to access its supervisor register
+         through memory using this hypercall.
+
+5. KVM_HC_KICK_CPU
+------------------
+
+:Architecture: x86
+:Status: active
+:Purpose: Hypercall used to wakeup a vcpu from HLT state
+:Usage example:
+  A vcpu of a paravirtualized guest that is busywaiting in guest
+  kernel mode for an event to occur (ex: a spinlock to become available) can
+  execute HLT instruction once it has busy-waited for more than a threshold
+  time-interval. Execution of HLT instruction would cause the hypervisor to put
+  the vcpu to sleep until occurrence of an appropriate event. Another vcpu of the
+  same guest can wakeup the sleeping vcpu by issuing KVM_HC_KICK_CPU hypercall,
+  specifying APIC ID (a1) of the vcpu to be woken up. An additional argument (a0)
+  is used in the hypercall for future use.
+
+
+6. KVM_HC_CLOCK_PAIRING
+-----------------------
+:Architecture: x86
+:Status: active
+:Purpose: Hypercall used to synchronize host and guest clocks.
+
+Usage:
+
+a0: guest physical address where host copies
+"struct kvm_clock_offset" structure.
+
+a1: clock_type, ATM only KVM_CLOCK_PAIRING_WALLCLOCK (0)
+is supported (corresponding to the host's CLOCK_REALTIME clock).
+
+       ::
+
+               struct kvm_clock_pairing {
+                       __s64 sec;
+                       __s64 nsec;
+                       __u64 tsc;
+                       __u32 flags;
+                       __u32 pad[9];
+               };
+
+       Where:
+               * sec: seconds from clock_type clock.
+               * nsec: nanoseconds from clock_type clock.
+               * tsc: guest TSC value used to calculate sec/nsec pair
+               * flags: flags, unused (0) at the moment.
+
+The hypercall lets a guest compute a precise timestamp across
+host and guest.  The guest can use the returned TSC value to
+compute the CLOCK_REALTIME for its clock, at the same instant.
+
+Returns KVM_EOPNOTSUPP if the host does not use TSC clocksource,
+or if clock type is different than KVM_CLOCK_PAIRING_WALLCLOCK.
+
+6. KVM_HC_SEND_IPI
+------------------
+
+:Architecture: x86
+:Status: active
+:Purpose: Send IPIs to multiple vCPUs.
+
+- a0: lower part of the bitmap of destination APIC IDs
+- a1: higher part of the bitmap of destination APIC IDs
+- a2: the lowest APIC ID in bitmap
+- a3: APIC ICR
+
+The hypercall lets a guest send multicast IPIs, with at most 128
+128 destinations per hypercall in 64-bit mode and 64 vCPUs per
+hypercall in 32-bit mode.  The destinations are represented by a
+bitmap contained in the first two arguments (a0 and a1). Bit 0 of
+a0 corresponds to the APIC ID in the third argument (a2), bit 1
+corresponds to the APIC ID a2+1, and so on.
+
+Returns the number of CPUs to which the IPIs were delivered successfully.
+
+7. KVM_HC_SCHED_YIELD
+---------------------
+
+:Architecture: x86
+:Status: active
+:Purpose: Hypercall used to yield if the IPI target vCPU is preempted
+
+a0: destination APIC ID
+
+:Usage example: When sending a call-function IPI-many to vCPUs, yield if
+               any of the IPI target vCPUs was preempted.
diff --git a/Documentation/virt/kvm/hypercalls.txt b/Documentation/virt/kvm/hypercalls.txt

deleted file mode 100644 (file)

index 5f6d291..0000000
--- a/Documentation/virt/kvm/hypercalls.txt
+++ /dev/null
@@ -1,154 +0,0 @@
-Linux KVM Hypercall:
-===================
-X86:
- KVM Hypercalls have a three-byte sequence of either the vmcall or the vmmcall
- instruction. The hypervisor can replace it with instructions that are
- guaranteed to be supported.
-
- Up to four arguments may be passed in rbx, rcx, rdx, and rsi respectively.
- The hypercall number should be placed in rax and the return value will be
- placed in rax.  No other registers will be clobbered unless explicitly stated
- by the particular hypercall.
-
-S390:
-  R2-R7 are used for parameters 1-6. In addition, R1 is used for hypercall
-  number. The return value is written to R2.
-
-  S390 uses diagnose instruction as hypercall (0x500) along with hypercall
-  number in R1.
-
-  For further information on the S390 diagnose call as supported by KVM,
-  refer to Documentation/virt/kvm/s390-diag.txt.
-
- PowerPC:
-  It uses R3-R10 and hypercall number in R11. R4-R11 are used as output registers.
-  Return value is placed in R3.
-
-  KVM hypercalls uses 4 byte opcode, that are patched with 'hypercall-instructions'
-  property inside the device tree's /hypervisor node.
-  For more information refer to Documentation/virt/kvm/ppc-pv.txt
-
-MIPS:
-  KVM hypercalls use the HYPCALL instruction with code 0 and the hypercall
-  number in $2 (v0). Up to four arguments may be placed in $4-$7 (a0-a3) and
-  the return value is placed in $2 (v0).
-
-KVM Hypercalls Documentation
-===========================
-The template for each hypercall is:
-1. Hypercall name.
-2. Architecture(s)
-3. Status (deprecated, obsolete, active)
-4. Purpose
-
-1. KVM_HC_VAPIC_POLL_IRQ
-------------------------
-Architecture: x86
-Status: active
-Purpose: Trigger guest exit so that the host can check for pending
-interrupts on reentry.
-
-2. KVM_HC_MMU_OP
-------------------------
-Architecture: x86
-Status: deprecated.
-Purpose: Support MMU operations such as writing to PTE,
-flushing TLB, release PT.
-
-3. KVM_HC_FEATURES
-------------------------
-Architecture: PPC
-Status: active
-Purpose: Expose hypercall availability to the guest. On x86 platforms, cpuid
-used to enumerate which hypercalls are available. On PPC, either device tree
-based lookup ( which is also what EPAPR dictates) OR KVM specific enumeration
-mechanism (which is this hypercall) can be used.
-
-4. KVM_HC_PPC_MAP_MAGIC_PAGE
-------------------------
-Architecture: PPC
-Status: active
-Purpose: To enable communication between the hypervisor and guest there is a
-shared page that contains parts of supervisor visible register state.
-The guest can map this shared page to access its supervisor register through
-memory using this hypercall.
-
-5. KVM_HC_KICK_CPU
-------------------------
-Architecture: x86
-Status: active
-Purpose: Hypercall used to wakeup a vcpu from HLT state
-Usage example : A vcpu of a paravirtualized guest that is busywaiting in guest
-kernel mode for an event to occur (ex: a spinlock to become available) can
-execute HLT instruction once it has busy-waited for more than a threshold
-time-interval. Execution of HLT instruction would cause the hypervisor to put
-the vcpu to sleep until occurrence of an appropriate event. Another vcpu of the
-same guest can wakeup the sleeping vcpu by issuing KVM_HC_KICK_CPU hypercall,
-specifying APIC ID (a1) of the vcpu to be woken up. An additional argument (a0)
-is used in the hypercall for future use.
-
-
-6. KVM_HC_CLOCK_PAIRING
-------------------------
-Architecture: x86
-Status: active
-Purpose: Hypercall used to synchronize host and guest clocks.
-Usage:
-
-a0: guest physical address where host copies
-"struct kvm_clock_offset" structure.
-
-a1: clock_type, ATM only KVM_CLOCK_PAIRING_WALLCLOCK (0)
-is supported (corresponding to the host's CLOCK_REALTIME clock).
-
-               struct kvm_clock_pairing {
-                       __s64 sec;
-                       __s64 nsec;
-                       __u64 tsc;
-                       __u32 flags;
-                       __u32 pad[9];
-               };
-
-       Where:
-               * sec: seconds from clock_type clock.
-               * nsec: nanoseconds from clock_type clock.
-               * tsc: guest TSC value used to calculate sec/nsec pair
-               * flags: flags, unused (0) at the moment.
-
-The hypercall lets a guest compute a precise timestamp across
-host and guest.  The guest can use the returned TSC value to
-compute the CLOCK_REALTIME for its clock, at the same instant.
-
-Returns KVM_EOPNOTSUPP if the host does not use TSC clocksource,
-or if clock type is different than KVM_CLOCK_PAIRING_WALLCLOCK.
-
-6. KVM_HC_SEND_IPI
-------------------------
-Architecture: x86
-Status: active
-Purpose: Send IPIs to multiple vCPUs.
-
-a0: lower part of the bitmap of destination APIC IDs
-a1: higher part of the bitmap of destination APIC IDs
-a2: the lowest APIC ID in bitmap
-a3: APIC ICR
-
-The hypercall lets a guest send multicast IPIs, with at most 128
-128 destinations per hypercall in 64-bit mode and 64 vCPUs per
-hypercall in 32-bit mode.  The destinations are represented by a
-bitmap contained in the first two arguments (a0 and a1). Bit 0 of
-a0 corresponds to the APIC ID in the third argument (a2), bit 1
-corresponds to the APIC ID a2+1, and so on.
-
-Returns the number of CPUs to which the IPIs were delivered successfully.
-
-7. KVM_HC_SCHED_YIELD
-------------------------
-Architecture: x86
-Status: active
-Purpose: Hypercall used to yield if the IPI target vCPU is preempted
-
-a0: destination APIC ID
-
-Usage example: When sending a call-function IPI-many to vCPUs, yield if
-any of the IPI target vCPUs was preempted.
diff --git a/Documentation/virt/kvm/index.rst b/Documentation/virt/kvm/index.rst

index ada224a..774deae 100644 (file)
--- a/Documentation/virt/kvm/index.rst
+++ b/Documentation/virt/kvm/index.rst
@@ -7,6 +7,22 @@ KVM
  .. toctree::
     :maxdepth: 2
  
+   api
     amd-memory-encryption
     cpuid
+   halt-polling
+   hypercalls
+   locking
+   mmu
+   msr
+   nested-vmx
+   ppc-pv
+   s390-diag
+   timekeeping
     vcpu-requests
+
+   review-checklist
+
+   arm/index
+
+   devices/index
diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst

new file mode 100644 (file)

index 0000000..c02291b
--- /dev/null
+++ b/Documentation/virt/kvm/locking.rst
@@ -0,0 +1,243 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+KVM Lock Overview
+=================
+
+1. Acquisition Orders
+---------------------
+
+The acquisition orders for mutexes are as follows:
+
+- kvm->lock is taken outside vcpu->mutex
+
+- kvm->lock is taken outside kvm->slots_lock and kvm->irq_lock
+
+- kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
+  them together is quite rare.
+
+On x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock.
+
+Everything else is a leaf: no other lock is taken inside the critical
+sections.
+
+2. Exception
+------------
+
+Fast page fault:
+
+Fast page fault is the fast path which fixes the guest page fault out of
+the mmu-lock on x86. Currently, the page fault can be fast in one of the
+following two cases:
+
+1. Access Tracking: The SPTE is not present, but it is marked for access
+   tracking i.e. the SPTE_SPECIAL_MASK is set. That means we need to
+   restore the saved R/X bits. This is described in more detail later below.
+
+2. Write-Protection: The SPTE is present and the fault is
+   caused by write-protect. That means we just need to change the W bit of
+   the spte.
+
+What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and
+SPTE_MMU_WRITEABLE bit on the spte:
+
+- SPTE_HOST_WRITEABLE means the gfn is writable on host.
+- SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when
+  the gfn is writable on guest mmu and it is not write-protected by shadow
+  page write-protection.
+
+On fast page fault path, we will use cmpxchg to atomically set the spte W
+bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, or
+restore the saved R/X bits if VMX_EPT_TRACK_ACCESS mask is set, or both. This
+is safe because whenever changing these bits can be detected by cmpxchg.
+
+But we need carefully check these cases:
+
+1) The mapping from gfn to pfn
+
+The mapping from gfn to pfn may be changed since we can only ensure the pfn
+is not changed during cmpxchg. This is a ABA problem, for example, below case
+will happen:
+
++------------------------------------------------------------------------+
+| At the beginning::                                                     |
+|                                                                        |
+|      gpte = gfn1                                                      |
+|      gfn1 is mapped to pfn1 on host                                   |
+|      spte is the shadow page table entry corresponding with gpte and  |
+|      spte = pfn1                                                      |
++------------------------------------------------------------------------+
+| On fast page fault path:                                               |
++------------------------------------+-----------------------------------+
+| CPU 0:                             | CPU 1:                            |
++------------------------------------+-----------------------------------+
+| ::                                 |                                   |
+|                                    |                                   |
+|   old_spte = *spte;                |                                   |
++------------------------------------+-----------------------------------+
+|                                    | pfn1 is swapped out::             |
+|                                    |                                   |
+|                                    |    spte = 0;                      |
+|                                    |                                   |
+|                                    | pfn1 is re-alloced for gfn2.      |
+|                                    |                                   |
+|                                    | gpte is changed to point to       |
+|                                    | gfn2 by the guest::               |
+|                                    |                                   |
+|                                    |    spte = pfn1;                   |
++------------------------------------+-----------------------------------+
+| ::                                                                     |
+|                                                                        |
+|   if (cmpxchg(spte, old_spte, old_spte+W)                              |
+|      mark_page_dirty(vcpu->kvm, gfn1)                                 |
+|            OOPS!!!                                                     |
++------------------------------------------------------------------------+
+
+We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap.
+
+For direct sp, we can easily avoid it since the spte of direct sp is fixed
+to gfn. For indirect sp, before we do cmpxchg, we call gfn_to_pfn_atomic()
+to pin gfn to pfn, because after gfn_to_pfn_atomic():
+
+- We have held the refcount of pfn that means the pfn can not be freed and
+  be reused for another gfn.
+- The pfn is writable that means it can not be shared between different gfns
+  by KSM.
+
+Then, we can ensure the dirty bitmaps is correctly set for a gfn.
+
+Currently, to simplify the whole things, we disable fast page fault for
+indirect shadow page.
+
+2) Dirty bit tracking
+
+In the origin code, the spte can be fast updated (non-atomically) if the
+spte is read-only and the Accessed bit has already been set since the
+Accessed bit and Dirty bit can not be lost.
+
+But it is not true after fast page fault since the spte can be marked
+writable between reading spte and updating spte. Like below case:
+
++------------------------------------------------------------------------+
+| At the beginning::                                                     |
+|                                                                        |
+|      spte.W = 0                                                       |
+|      spte.Accessed = 1                                                |
++------------------------------------+-----------------------------------+
+| CPU 0:                             | CPU 1:                            |
++------------------------------------+-----------------------------------+
+| In mmu_spte_clear_track_bits()::   |                                   |
+|                                    |                                   |
+|  old_spte = *spte;                 |                                   |
+|                                    |                                   |
+|                                    |                                   |
+|  /* 'if' condition is satisfied. */|                                   |
+|  if (old_spte.Accessed == 1 &&     |                                   |
+|       old_spte.W == 0)             |                                   |
+|     spte = 0ull;                   |                                   |
++------------------------------------+-----------------------------------+
+|                                    | on fast page fault path::         |
+|                                    |                                   |
+|                                    |    spte.W = 1                     |
+|                                    |                                   |
+|                                    | memory write on the spte::        |
+|                                    |                                   |
+|                                    |    spte.Dirty = 1                 |
++------------------------------------+-----------------------------------+
+|  ::                                |                                   |
+|                                    |                                   |
+|   else                             |                                   |
+|     old_spte = xchg(spte, 0ull)    |                                   |
+|   if (old_spte.Accessed == 1)      |                                   |
+|     kvm_set_pfn_accessed(spte.pfn);|                                   |
+|   if (old_spte.Dirty == 1)         |                                   |
+|     kvm_set_pfn_dirty(spte.pfn);   |                                   |
+|     OOPS!!!                        |                                   |
++------------------------------------+-----------------------------------+
+
+The Dirty bit is lost in this case.
+
+In order to avoid this kind of issue, we always treat the spte as "volatile"
+if it can be updated out of mmu-lock, see spte_has_volatile_bits(), it means,
+the spte is always atomically updated in this case.
+
+3) flush tlbs due to spte updated
+
+If the spte is updated from writable to readonly, we should flush all TLBs,
+otherwise rmap_write_protect will find a read-only spte, even though the
+writable spte might be cached on a CPU's TLB.
+
+As mentioned before, the spte can be updated to writable out of mmu-lock on
+fast page fault path, in order to easily audit the path, we see if TLBs need
+be flushed caused by this reason in mmu_spte_update() since this is a common
+function to update spte (present -> present).
+
+Since the spte is "volatile" if it can be updated out of mmu-lock, we always
+atomically update the spte, the race caused by fast page fault can be avoided,
+See the comments in spte_has_volatile_bits() and mmu_spte_update().
+
+Lockless Access Tracking:
+
+This is used for Intel CPUs that are using EPT but do not support the EPT A/D
+bits. In this case, when the KVM MMU notifier is called to track accesses to a
+page (via kvm_mmu_notifier_clear_flush_young), it marks the PTE as not-present
+by clearing the RWX bits in the PTE and storing the original R & X bits in
+some unused/ignored bits. In addition, the SPTE_SPECIAL_MASK is also set on the
+PTE (using the ignored bit 62). When the VM tries to access the page later on,
+a fault is generated and the fast page fault mechanism described above is used
+to atomically restore the PTE to a Present state. The W bit is not saved when
+the PTE is marked for access tracking and during restoration to the Present
+state, the W bit is set depending on whether or not it was a write access. If
+it wasn't, then the W bit will remain clear until a write access happens, at
+which time it will be set using the Dirty tracking mechanism described above.
+
+3. Reference
+------------
+
+:Name:         kvm_lock
+:Type:         mutex
+:Arch:         any
+:Protects:     - vm_list
+
+:Name:         kvm_count_lock
+:Type:         raw_spinlock_t
+:Arch:         any
+:Protects:     - hardware virtualization enable/disable
+:Comment:      'raw' because hardware enabling/disabling must be atomic /wrt
+               migration.
+
+:Name:         kvm_arch::tsc_write_lock
+:Type:         raw_spinlock
+:Arch:         x86
+:Protects:     - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}
+               - tsc offset in vmcb
+:Comment:      'raw' because updating the tsc offsets must not be preempted.
+
+:Name:         kvm->mmu_lock
+:Type:         spinlock_t
+:Arch:         any
+:Protects:     -shadow page/shadow tlb entry
+:Comment:      it is a spinlock since it is used in mmu notifier.
+
+:Name:         kvm->srcu
+:Type:         srcu lock
+:Arch:         any
+:Protects:     - kvm->memslots
+               - kvm->buses
+:Comment:      The srcu read lock must be held while accessing memslots (e.g.
+               when using gfn_to_* functions) and while accessing in-kernel
+               MMIO/PIO address->device structure mapping (kvm->buses).
+               The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu
+               if it is needed by multiple functions.
+
+:Name:         blocked_vcpu_on_cpu_lock
+:Type:         spinlock_t
+:Arch:         x86
+:Protects:     blocked_vcpu_on_cpu
+:Comment:      This is a per-CPU lock and it is used for VT-d posted-interrupts.
+               When VT-d posted-interrupts is supported and the VM has assigned
+               devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu
+               protected by blocked_vcpu_on_cpu_lock, when VT-d hardware issues
+               wakeup notification event since external interrupts from the
+               assigned devices happens, we will find the vCPU on the list to
+               wakeup.
diff --git a/Documentation/virt/kvm/locking.txt b/Documentation/virt/kvm/locking.txt

deleted file mode 100644 (file)

index 635cd6e..0000000
--- a/Documentation/virt/kvm/locking.txt
+++ /dev/null
@@ -1,215 +0,0 @@
-KVM Lock Overview
-=================
-
-1. Acquisition Orders
----------------------
-
-The acquisition orders for mutexes are as follows:
-
-- kvm->lock is taken outside vcpu->mutex
-
-- kvm->lock is taken outside kvm->slots_lock and kvm->irq_lock
-
-- kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
-  them together is quite rare.
-
-On x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock.
-
-Everything else is a leaf: no other lock is taken inside the critical
-sections.
-
-2: Exception
-------------
-
-Fast page fault:
-
-Fast page fault is the fast path which fixes the guest page fault out of
-the mmu-lock on x86. Currently, the page fault can be fast in one of the
-following two cases:
-
-1. Access Tracking: The SPTE is not present, but it is marked for access
-tracking i.e. the SPTE_SPECIAL_MASK is set. That means we need to
-restore the saved R/X bits. This is described in more detail later below.
-
-2. Write-Protection: The SPTE is present and the fault is
-caused by write-protect. That means we just need to change the W bit of the 
-spte.
-
-What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and
-SPTE_MMU_WRITEABLE bit on the spte:
-- SPTE_HOST_WRITEABLE means the gfn is writable on host.
-- SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when
-  the gfn is writable on guest mmu and it is not write-protected by shadow
-  page write-protection.
-
-On fast page fault path, we will use cmpxchg to atomically set the spte W
-bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, or 
-restore the saved R/X bits if VMX_EPT_TRACK_ACCESS mask is set, or both. This
-is safe because whenever changing these bits can be detected by cmpxchg.
-
-But we need carefully check these cases:
-1): The mapping from gfn to pfn
-The mapping from gfn to pfn may be changed since we can only ensure the pfn
-is not changed during cmpxchg. This is a ABA problem, for example, below case
-will happen:
-
-At the beginning:
-gpte = gfn1
-gfn1 is mapped to pfn1 on host
-spte is the shadow page table entry corresponding with gpte and
-spte = pfn1
-
-   VCPU 0                           VCPU0
-on fast page fault path:
-
-   old_spte = *spte;
-                                 pfn1 is swapped out:
-                                    spte = 0;
-
-                                 pfn1 is re-alloced for gfn2.
-
-                                 gpte is changed to point to
-                                 gfn2 by the guest:
-                                    spte = pfn1;
-
-   if (cmpxchg(spte, old_spte, old_spte+W)
-       mark_page_dirty(vcpu->kvm, gfn1)
-             OOPS!!!
-
-We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap.
-
-For direct sp, we can easily avoid it since the spte of direct sp is fixed
-to gfn. For indirect sp, before we do cmpxchg, we call gfn_to_pfn_atomic()
-to pin gfn to pfn, because after gfn_to_pfn_atomic():
-- We have held the refcount of pfn that means the pfn can not be freed and
-  be reused for another gfn.
-- The pfn is writable that means it can not be shared between different gfns
-  by KSM.
-
-Then, we can ensure the dirty bitmaps is correctly set for a gfn.
-
-Currently, to simplify the whole things, we disable fast page fault for
-indirect shadow page.
-
-2): Dirty bit tracking
-In the origin code, the spte can be fast updated (non-atomically) if the
-spte is read-only and the Accessed bit has already been set since the
-Accessed bit and Dirty bit can not be lost.
-
-But it is not true after fast page fault since the spte can be marked
-writable between reading spte and updating spte. Like below case:
-
-At the beginning:
-spte.W = 0
-spte.Accessed = 1
-
-   VCPU 0                                       VCPU0
-In mmu_spte_clear_track_bits():
-
-   old_spte = *spte;
-
-   /* 'if' condition is satisfied. */
-   if (old_spte.Accessed == 1 &&
-        old_spte.W == 0)
-      spte = 0ull;
-                                         on fast page fault path:
-                                             spte.W = 1
-                                         memory write on the spte:
-                                             spte.Dirty = 1
-
-
-   else
-      old_spte = xchg(spte, 0ull)
-
-
-   if (old_spte.Accessed == 1)
-      kvm_set_pfn_accessed(spte.pfn);
-   if (old_spte.Dirty == 1)
-      kvm_set_pfn_dirty(spte.pfn);
-      OOPS!!!
-
-The Dirty bit is lost in this case.
-
-In order to avoid this kind of issue, we always treat the spte as "volatile"
-if it can be updated out of mmu-lock, see spte_has_volatile_bits(), it means,
-the spte is always atomically updated in this case.
-
-3): flush tlbs due to spte updated
-If the spte is updated from writable to readonly, we should flush all TLBs,
-otherwise rmap_write_protect will find a read-only spte, even though the
-writable spte might be cached on a CPU's TLB.
-
-As mentioned before, the spte can be updated to writable out of mmu-lock on
-fast page fault path, in order to easily audit the path, we see if TLBs need
-be flushed caused by this reason in mmu_spte_update() since this is a common
-function to update spte (present -> present).
-
-Since the spte is "volatile" if it can be updated out of mmu-lock, we always
-atomically update the spte, the race caused by fast page fault can be avoided,
-See the comments in spte_has_volatile_bits() and mmu_spte_update().
-
-Lockless Access Tracking:
-
-This is used for Intel CPUs that are using EPT but do not support the EPT A/D
-bits. In this case, when the KVM MMU notifier is called to track accesses to a
-page (via kvm_mmu_notifier_clear_flush_young), it marks the PTE as not-present
-by clearing the RWX bits in the PTE and storing the original R & X bits in
-some unused/ignored bits. In addition, the SPTE_SPECIAL_MASK is also set on the
-PTE (using the ignored bit 62). When the VM tries to access the page later on,
-a fault is generated and the fast page fault mechanism described above is used
-to atomically restore the PTE to a Present state. The W bit is not saved when
-the PTE is marked for access tracking and during restoration to the Present
-state, the W bit is set depending on whether or not it was a write access. If
-it wasn't, then the W bit will remain clear until a write access happens, at 
-which time it will be set using the Dirty tracking mechanism described above.
-
-3. Reference
-------------
-
-Name:          kvm_lock
-Type:          mutex
-Arch:          any
-Protects:      - vm_list
-
-Name:          kvm_count_lock
-Type:          raw_spinlock_t
-Arch:          any
-Protects:      - hardware virtualization enable/disable
-Comment:       'raw' because hardware enabling/disabling must be atomic /wrt
-               migration.
-
-Name:          kvm_arch::tsc_write_lock
-Type:          raw_spinlock
-Arch:          x86
-Protects:      - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}
-               - tsc offset in vmcb
-Comment:       'raw' because updating the tsc offsets must not be preempted.
-
-Name:          kvm->mmu_lock
-Type:          spinlock_t
-Arch:          any
-Protects:      -shadow page/shadow tlb entry
-Comment:       it is a spinlock since it is used in mmu notifier.
-
-Name:          kvm->srcu
-Type:          srcu lock
-Arch:          any
-Protects:      - kvm->memslots
-               - kvm->buses
-Comment:       The srcu read lock must be held while accessing memslots (e.g.
-               when using gfn_to_* functions) and while accessing in-kernel
-               MMIO/PIO address->device structure mapping (kvm->buses).
-               The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu
-               if it is needed by multiple functions.
-
-Name:          blocked_vcpu_on_cpu_lock
-Type:          spinlock_t
-Arch:          x86
-Protects:      blocked_vcpu_on_cpu
-Comment:       This is a per-CPU lock and it is used for VT-d posted-interrupts.
-               When VT-d posted-interrupts is supported and the VM has assigned
-               devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu
-               protected by blocked_vcpu_on_cpu_lock, when VT-d hardware issues
-               wakeup notification event since external interrupts from the
-               assigned devices happens, we will find the vCPU on the list to
-               wakeup.
diff --git a/Documentation/virt/kvm/mmu.rst b/Documentation/virt/kvm/mmu.rst

new file mode 100644 (file)

index 0000000..6098188
--- /dev/null
+++ b/Documentation/virt/kvm/mmu.rst
@@ -0,0 +1,483 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+The x86 kvm shadow mmu
+======================
+
+The mmu (in arch/x86/kvm, files mmu.[ch] and paging_tmpl.h) is responsible
+for presenting a standard x86 mmu to the guest, while translating guest
+physical addresses to host physical addresses.
+
+The mmu code attempts to satisfy the following requirements:
+
+- correctness:
+              the guest should not be able to determine that it is running
+               on an emulated mmu except for timing (we attempt to comply
+               with the specification, not emulate the characteristics of
+               a particular implementation such as tlb size)
+- security:
+              the guest must not be able to touch host memory not assigned
+               to it
+- performance:
+               minimize the performance penalty imposed by the mmu
+- scaling:
+               need to scale to large memory and large vcpu guests
+- hardware:
+               support the full range of x86 virtualization hardware
+- integration:
+               Linux memory management code must be in control of guest memory
+               so that swapping, page migration, page merging, transparent
+               hugepages, and similar features work without change
+- dirty tracking:
+               report writes to guest memory to enable live migration
+               and framebuffer-based displays
+- footprint:
+               keep the amount of pinned kernel memory low (most memory
+               should be shrinkable)
+- reliability:
+               avoid multipage or GFP_ATOMIC allocations
+
+Acronyms
+========
+
+====  ====================================================================
+pfn   host page frame number
+hpa   host physical address
+hva   host virtual address
+gfn   guest frame number
+gpa   guest physical address
+gva   guest virtual address
+ngpa  nested guest physical address
+ngva  nested guest virtual address
+pte   page table entry (used also to refer generically to paging structure
+      entries)
+gpte  guest pte (referring to gfns)
+spte  shadow pte (referring to pfns)
+tdp   two dimensional paging (vendor neutral term for NPT and EPT)
+====  ====================================================================
+
+Virtual and real hardware supported
+===================================
+
+The mmu supports first-generation mmu hardware, which allows an atomic switch
+of the current paging mode and cr3 during guest entry, as well as
+two-dimensional paging (AMD's NPT and Intel's EPT).  The emulated hardware
+it exposes is the traditional 2/3/4 level x86 mmu, with support for global
+pages, pae, pse, pse36, cr0.wp, and 1GB pages. Emulated hardware also
+able to expose NPT capable hardware on NPT capable hosts.
+
+Translation
+===========
+
+The primary job of the mmu is to program the processor's mmu to translate
+addresses for the guest.  Different translations are required at different
+times:
+
+- when guest paging is disabled, we translate guest physical addresses to
+  host physical addresses (gpa->hpa)
+- when guest paging is enabled, we translate guest virtual addresses, to
+  guest physical addresses, to host physical addresses (gva->gpa->hpa)
+- when the guest launches a guest of its own, we translate nested guest
+  virtual addresses, to nested guest physical addresses, to guest physical
+  addresses, to host physical addresses (ngva->ngpa->gpa->hpa)
+
+The primary challenge is to encode between 1 and 3 translations into hardware
+that support only 1 (traditional) and 2 (tdp) translations.  When the
+number of required translations matches the hardware, the mmu operates in
+direct mode; otherwise it operates in shadow mode (see below).
+
+Memory
+======
+
+Guest memory (gpa) is part of the user address space of the process that is
+using kvm.  Userspace defines the translation between guest addresses and user
+addresses (gpa->hva); note that two gpas may alias to the same hva, but not
+vice versa.
+
+These hvas may be backed using any method available to the host: anonymous
+memory, file backed memory, and device memory.  Memory might be paged by the
+host at any time.
+
+Events
+======
+
+The mmu is driven by events, some from the guest, some from the host.
+
+Guest generated events:
+
+- writes to control registers (especially cr3)
+- invlpg/invlpga instruction execution
+- access to missing or protected translations
+
+Host generated events:
+
+- changes in the gpa->hpa translation (either through gpa->hva changes or
+  through hva->hpa changes)
+- memory pressure (the shrinker)
+
+Shadow pages
+============
+
+The principal data structure is the shadow page, 'struct kvm_mmu_page'.  A
+shadow page contains 512 sptes, which can be either leaf or nonleaf sptes.  A
+shadow page may contain a mix of leaf and nonleaf sptes.
+
+A nonleaf spte allows the hardware mmu to reach the leaf pages and
+is not related to a translation directly.  It points to other shadow pages.
+
+A leaf spte corresponds to either one or two translations encoded into
+one paging structure entry.  These are always the lowest level of the
+translation stack, with optional higher level translations left to NPT/EPT.
+Leaf ptes point at guest pages.
+
+The following table shows translations encoded by leaf ptes, with higher-level
+translations in parentheses:
+
+ Non-nested guests::
+
+  nonpaging:     gpa->hpa
+  paging:        gva->gpa->hpa
+  paging, tdp:   (gva->)gpa->hpa
+
+ Nested guests::
+
+  non-tdp:       ngva->gpa->hpa  (*)
+  tdp:           (ngva->)ngpa->gpa->hpa
+
+  (*) the guest hypervisor will encode the ngva->gpa translation into its page
+      tables if npt is not present
+
+Shadow pages contain the following information:
+  role.level:
+    The level in the shadow paging hierarchy that this shadow page belongs to.
+    1=4k sptes, 2=2M sptes, 3=1G sptes, etc.
+  role.direct:
+    If set, leaf sptes reachable from this page are for a linear range.
+    Examples include real mode translation, large guest pages backed by small
+    host pages, and gpa->hpa translations when NPT or EPT is active.
+    The linear range starts at (gfn << PAGE_SHIFT) and its size is determined
+    by role.level (2MB for first level, 1GB for second level, 0.5TB for third
+    level, 256TB for fourth level)
+    If clear, this page corresponds to a guest page table denoted by the gfn
+    field.
+  role.quadrant:
+    When role.gpte_is_8_bytes=0, the guest uses 32-bit gptes while the host uses 64-bit
+    sptes.  That means a guest page table contains more ptes than the host,
+    so multiple shadow pages are needed to shadow one guest page.
+    For first-level shadow pages, role.quadrant can be 0 or 1 and denotes the
+    first or second 512-gpte block in the guest page table.  For second-level
+    page tables, each 32-bit gpte is converted to two 64-bit sptes
+    (since each first-level guest page is shadowed by two first-level
+    shadow pages) so role.quadrant takes values in the range 0..3.  Each
+    quadrant maps 1GB virtual address space.
+  role.access:
+    Inherited guest access permissions in the form uwx.  Note execute
+    permission is positive, not negative.
+  role.invalid:
+    The page is invalid and should not be used.  It is a root page that is
+    currently pinned (by a cpu hardware register pointing to it); once it is
+    unpinned it will be destroyed.
+  role.gpte_is_8_bytes:
+    Reflects the size of the guest PTE for which the page is valid, i.e. '1'
+    if 64-bit gptes are in use, '0' if 32-bit gptes are in use.
+  role.nxe:
+    Contains the value of efer.nxe for which the page is valid.
+  role.cr0_wp:
+    Contains the value of cr0.wp for which the page is valid.
+  role.smep_andnot_wp:
+    Contains the value of cr4.smep && !cr0.wp for which the page is valid
+    (pages for which this is true are different from other pages; see the
+    treatment of cr0.wp=0 below).
+  role.smap_andnot_wp:
+    Contains the value of cr4.smap && !cr0.wp for which the page is valid
+    (pages for which this is true are different from other pages; see the
+    treatment of cr0.wp=0 below).
+  role.ept_sp:
+    This is a virtual flag to denote a shadowed nested EPT page.  ept_sp
+    is true if "cr0_wp && smap_andnot_wp", an otherwise invalid combination.
+  role.smm:
+    Is 1 if the page is valid in system management mode.  This field
+    determines which of the kvm_memslots array was used to build this
+    shadow page; it is also used to go back from a struct kvm_mmu_page
+    to a memslot, through the kvm_memslots_for_spte_role macro and
+    __gfn_to_memslot.
+  role.ad_disabled:
+    Is 1 if the MMU instance cannot use A/D bits.  EPT did not have A/D
+    bits before Haswell; shadow EPT page tables also cannot use A/D bits
+    if the L1 hypervisor does not enable them.
+  gfn:
+    Either the guest page table containing the translations shadowed by this
+    page, or the base page frame for linear translations.  See role.direct.
+  spt:
+    A pageful of 64-bit sptes containing the translations for this page.
+    Accessed by both kvm and hardware.
+    The page pointed to by spt will have its page->private pointing back
+    at the shadow page structure.
+    sptes in spt point either at guest pages, or at lower-level shadow pages.
+    Specifically, if sp1 and sp2 are shadow pages, then sp1->spt[n] may point
+    at __pa(sp2->spt).  sp2 will point back at sp1 through parent_pte.
+    The spt array forms a DAG structure with the shadow page as a node, and
+    guest pages as leaves.
+  gfns:
+    An array of 512 guest frame numbers, one for each present pte.  Used to
+    perform a reverse map from a pte to a gfn. When role.direct is set, any
+    element of this array can be calculated from the gfn field when used, in
+    this case, the array of gfns is not allocated. See role.direct and gfn.
+  root_count:
+    A counter keeping track of how many hardware registers (guest cr3 or
+    pdptrs) are now pointing at the page.  While this counter is nonzero, the
+    page cannot be destroyed.  See role.invalid.
+  parent_ptes:
+    The reverse mapping for the pte/ptes pointing at this page's spt. If
+    parent_ptes bit 0 is zero, only one spte points at this page and
+    parent_ptes points at this single spte, otherwise, there exists multiple
+    sptes pointing at this page and (parent_ptes & ~0x1) points at a data
+    structure with a list of parent sptes.
+  unsync:
+    If true, then the translations in this page may not match the guest's
+    translation.  This is equivalent to the state of the tlb when a pte is
+    changed but before the tlb entry is flushed.  Accordingly, unsync ptes
+    are synchronized when the guest executes invlpg or flushes its tlb by
+    other means.  Valid for leaf pages.
+  unsync_children:
+    How many sptes in the page point at pages that are unsync (or have
+    unsynchronized children).
+  unsync_child_bitmap:
+    A bitmap indicating which sptes in spt point (directly or indirectly) at
+    pages that may be unsynchronized.  Used to quickly locate all unsychronized
+    pages reachable from a given page.
+  clear_spte_count:
+    Only present on 32-bit hosts, where a 64-bit spte cannot be written
+    atomically.  The reader uses this while running out of the MMU lock
+    to detect in-progress updates and retry them until the writer has
+    finished the write.
+  write_flooding_count:
+    A guest may write to a page table many times, causing a lot of
+    emulations if the page needs to be write-protected (see "Synchronized
+    and unsynchronized pages" below).  Leaf pages can be unsynchronized
+    so that they do not trigger frequent emulation, but this is not
+    possible for non-leafs.  This field counts the number of emulations
+    since the last time the page table was actually used; if emulation
+    is triggered too frequently on this page, KVM will unmap the page
+    to avoid emulation in the future.
+
+Reverse map
+===========
+
+The mmu maintains a reverse mapping whereby all ptes mapping a page can be
+reached given its gfn.  This is used, for example, when swapping out a page.
+
+Synchronized and unsynchronized pages
+=====================================
+
+The guest uses two events to synchronize its tlb and page tables: tlb flushes
+and page invalidations (invlpg).
+
+A tlb flush means that we need to synchronize all sptes reachable from the
+guest's cr3.  This is expensive, so we keep all guest page tables write
+protected, and synchronize sptes to gptes when a gpte is written.
+
+A special case is when a guest page table is reachable from the current
+guest cr3.  In this case, the guest is obliged to issue an invlpg instruction
+before using the translation.  We take advantage of that by removing write
+protection from the guest page, and allowing the guest to modify it freely.
+We synchronize modified gptes when the guest invokes invlpg.  This reduces
+the amount of emulation we have to do when the guest modifies multiple gptes,
+or when the a guest page is no longer used as a page table and is used for
+random guest data.
+
+As a side effect we have to resynchronize all reachable unsynchronized shadow
+pages on a tlb flush.
+
+
+Reaction to events
+==================
+
+- guest page fault (or npt page fault, or ept violation)
+
+This is the most complicated event.  The cause of a page fault can be:
+
+  - a true guest fault (the guest translation won't allow the access) (*)
+  - access to a missing translation
+  - access to a protected translation
+    - when logging dirty pages, memory is write protected
+    - synchronized shadow pages are write protected (*)
+  - access to untranslatable memory (mmio)
+
+  (*) not applicable in direct mode
+
+Handling a page fault is performed as follows:
+
+ - if the RSV bit of the error code is set, the page fault is caused by guest
+   accessing MMIO and cached MMIO information is available.
+
+   - walk shadow page table
+   - check for valid generation number in the spte (see "Fast invalidation of
+     MMIO sptes" below)
+   - cache the information to vcpu->arch.mmio_gva, vcpu->arch.mmio_access and
+     vcpu->arch.mmio_gfn, and call the emulator
+
+ - If both P bit and R/W bit of error code are set, this could possibly
+   be handled as a "fast page fault" (fixed without taking the MMU lock).  See
+   the description in Documentation/virt/kvm/locking.txt.
+
+ - if needed, walk the guest page tables to determine the guest translation
+   (gva->gpa or ngpa->gpa)
+
+   - if permissions are insufficient, reflect the fault back to the guest
+
+ - determine the host page
+
+   - if this is an mmio request, there is no host page; cache the info to
+     vcpu->arch.mmio_gva, vcpu->arch.mmio_access and vcpu->arch.mmio_gfn
+
+ - walk the shadow page table to find the spte for the translation,
+   instantiating missing intermediate page tables as necessary
+
+   - If this is an mmio request, cache the mmio info to the spte and set some
+     reserved bit on the spte (see callers of kvm_mmu_set_mmio_spte_mask)
+
+ - try to unsynchronize the page
+
+   - if successful, we can let the guest continue and modify the gpte
+
+ - emulate the instruction
+
+   - if failed, unshadow the page and let the guest continue
+
+ - update any translations that were modified by the instruction
+
+invlpg handling:
+
+  - walk the shadow page hierarchy and drop affected translations
+  - try to reinstantiate the indicated translation in the hope that the
+    guest will use it in the near future
+
+Guest control register updates:
+
+- mov to cr3
+
+  - look up new shadow roots
+  - synchronize newly reachable shadow pages
+
+- mov to cr0/cr4/efer
+
+  - set up mmu context for new paging mode
+  - look up new shadow roots
+  - synchronize newly reachable shadow pages
+
+Host translation updates:
+
+  - mmu notifier called with updated hva
+  - look up affected sptes through reverse map
+  - drop (or update) translations
+
+Emulating cr0.wp
+================
+
+If tdp is not enabled, the host must keep cr0.wp=1 so page write protection
+works for the guest kernel, not guest guest userspace.  When the guest
+cr0.wp=1, this does not present a problem.  However when the guest cr0.wp=0,
+we cannot map the permissions for gpte.u=1, gpte.w=0 to any spte (the
+semantics require allowing any guest kernel access plus user read access).
+
+We handle this by mapping the permissions to two possible sptes, depending
+on fault type:
+
+- kernel write fault: spte.u=0, spte.w=1 (allows full kernel access,
+  disallows user access)
+- read fault: spte.u=1, spte.w=0 (allows full read access, disallows kernel
+  write access)
+
+(user write faults generate a #PF)
+
+In the first case there are two additional complications:
+
+- if CR4.SMEP is enabled: since we've turned the page into a kernel page,
+  the kernel may now execute it.  We handle this by also setting spte.nx.
+  If we get a user fetch or read fault, we'll change spte.u=1 and
+  spte.nx=gpte.nx back.  For this to work, KVM forces EFER.NX to 1 when
+  shadow paging is in use.
+- if CR4.SMAP is disabled: since the page has been changed to a kernel
+  page, it can not be reused when CR4.SMAP is enabled. We set
+  CR4.SMAP && !CR0.WP into shadow page's role to avoid this case. Note,
+  here we do not care the case that CR4.SMAP is enabled since KVM will
+  directly inject #PF to guest due to failed permission check.
+
+To prevent an spte that was converted into a kernel page with cr0.wp=0
+from being written by the kernel after cr0.wp has changed to 1, we make
+the value of cr0.wp part of the page role.  This means that an spte created
+with one value of cr0.wp cannot be used when cr0.wp has a different value -
+it will simply be missed by the shadow page lookup code.  A similar issue
+exists when an spte created with cr0.wp=0 and cr4.smep=0 is used after
+changing cr4.smep to 1.  To avoid this, the value of !cr0.wp && cr4.smep
+is also made a part of the page role.
+
+Large pages
+===========
+
+The mmu supports all combinations of large and small guest and host pages.
+Supported page sizes include 4k, 2M, 4M, and 1G.  4M pages are treated as
+two separate 2M pages, on both guest and host, since the mmu always uses PAE
+paging.
+
+To instantiate a large spte, four constraints must be satisfied:
+
+- the spte must point to a large host page
+- the guest pte must be a large pte of at least equivalent size (if tdp is
+  enabled, there is no guest pte and this condition is satisfied)
+- if the spte will be writeable, the large page frame may not overlap any
+  write-protected pages
+- the guest page must be wholly contained by a single memory slot
+
+To check the last two conditions, the mmu maintains a ->disallow_lpage set of
+arrays for each memory slot and large page size.  Every write protected page
+causes its disallow_lpage to be incremented, thus preventing instantiation of
+a large spte.  The frames at the end of an unaligned memory slot have
+artificially inflated ->disallow_lpages so they can never be instantiated.
+
+Fast invalidation of MMIO sptes
+===============================
+
+As mentioned in "Reaction to events" above, kvm will cache MMIO
+information in leaf sptes.  When a new memslot is added or an existing
+memslot is changed, this information may become stale and needs to be
+invalidated.  This also needs to hold the MMU lock while walking all
+shadow pages, and is made more scalable with a similar technique.
+
+MMIO sptes have a few spare bits, which are used to store a
+generation number.  The global generation number is stored in
+kvm_memslots(kvm)->generation, and increased whenever guest memory info
+changes.
+
+When KVM finds an MMIO spte, it checks the generation number of the spte.
+If the generation number of the spte does not equal the global generation
+number, it will ignore the cached MMIO information and handle the page
+fault through the slow path.
+
+Since only 19 bits are used to store generation-number on mmio spte, all
+pages are zapped when there is an overflow.
+
+Unfortunately, a single memory access might access kvm_memslots(kvm) multiple
+times, the last one happening when the generation number is retrieved and
+stored into the MMIO spte.  Thus, the MMIO spte might be created based on
+out-of-date information, but with an up-to-date generation number.
+
+To avoid this, the generation number is incremented again after synchronize_srcu
+returns; thus, bit 63 of kvm_memslots(kvm)->generation set to 1 only during a
+memslot update, while some SRCU readers might be using the old copy.  We do not
+want to use an MMIO sptes created with an odd generation number, and we can do
+this without losing a bit in the MMIO spte.  The "update in-progress" bit of the
+generation is not stored in MMIO spte, and is so is implicitly zero when the
+generation is extracted out of the spte.  If KVM is unlucky and creates an MMIO
+spte while an update is in-progress, the next access to the spte will always be
+a cache miss.  For example, a subsequent access during the update window will
+miss due to the in-progress flag diverging, while an access after the update
+window closes will have a higher generation number (as compared to the spte).
+
+
+Further reading
+===============
+
+- NPT presentation from KVM Forum 2008
+  http://www.linux-kvm.org/images/c/c8/KvmForum2008%24kdf2008_21.pdf
diff --git a/Documentation/virt/kvm/mmu.txt b/Documentation/virt/kvm/mmu.txt

deleted file mode 100644 (file)

index dadb29e..0000000
--- a/Documentation/virt/kvm/mmu.txt
+++ /dev/null
@@ -1,449 +0,0 @@
-The x86 kvm shadow mmu
-======================
-
-The mmu (in arch/x86/kvm, files mmu.[ch] and paging_tmpl.h) is responsible
-for presenting a standard x86 mmu to the guest, while translating guest
-physical addresses to host physical addresses.
-
-The mmu code attempts to satisfy the following requirements:
-
-- correctness: the guest should not be able to determine that it is running
-               on an emulated mmu except for timing (we attempt to comply
-               with the specification, not emulate the characteristics of
-               a particular implementation such as tlb size)
-- security:    the guest must not be able to touch host memory not assigned
-               to it
-- performance: minimize the performance penalty imposed by the mmu
-- scaling:     need to scale to large memory and large vcpu guests
-- hardware:    support the full range of x86 virtualization hardware
-- integration: Linux memory management code must be in control of guest memory
-               so that swapping, page migration, page merging, transparent
-               hugepages, and similar features work without change
-- dirty tracking: report writes to guest memory to enable live migration
-               and framebuffer-based displays
-- footprint:   keep the amount of pinned kernel memory low (most memory
-               should be shrinkable)
-- reliability:  avoid multipage or GFP_ATOMIC allocations
-
-Acronyms
-========
-
-pfn   host page frame number
-hpa   host physical address
-hva   host virtual address
-gfn   guest frame number
-gpa   guest physical address
-gva   guest virtual address
-ngpa  nested guest physical address
-ngva  nested guest virtual address
-pte   page table entry (used also to refer generically to paging structure
-      entries)
-gpte  guest pte (referring to gfns)
-spte  shadow pte (referring to pfns)
-tdp   two dimensional paging (vendor neutral term for NPT and EPT)
-
-Virtual and real hardware supported
-===================================
-
-The mmu supports first-generation mmu hardware, which allows an atomic switch
-of the current paging mode and cr3 during guest entry, as well as
-two-dimensional paging (AMD's NPT and Intel's EPT).  The emulated hardware
-it exposes is the traditional 2/3/4 level x86 mmu, with support for global
-pages, pae, pse, pse36, cr0.wp, and 1GB pages. Emulated hardware also
-able to expose NPT capable hardware on NPT capable hosts.
-
-Translation
-===========
-
-The primary job of the mmu is to program the processor's mmu to translate
-addresses for the guest.  Different translations are required at different
-times:
-
-- when guest paging is disabled, we translate guest physical addresses to
-  host physical addresses (gpa->hpa)
-- when guest paging is enabled, we translate guest virtual addresses, to
-  guest physical addresses, to host physical addresses (gva->gpa->hpa)
-- when the guest launches a guest of its own, we translate nested guest
-  virtual addresses, to nested guest physical addresses, to guest physical
-  addresses, to host physical addresses (ngva->ngpa->gpa->hpa)
-
-The primary challenge is to encode between 1 and 3 translations into hardware
-that support only 1 (traditional) and 2 (tdp) translations.  When the
-number of required translations matches the hardware, the mmu operates in
-direct mode; otherwise it operates in shadow mode (see below).
-
-Memory
-======
-
-Guest memory (gpa) is part of the user address space of the process that is
-using kvm.  Userspace defines the translation between guest addresses and user
-addresses (gpa->hva); note that two gpas may alias to the same hva, but not
-vice versa.
-
-These hvas may be backed using any method available to the host: anonymous
-memory, file backed memory, and device memory.  Memory might be paged by the
-host at any time.
-
-Events
-======
-
-The mmu is driven by events, some from the guest, some from the host.
-
-Guest generated events:
-- writes to control registers (especially cr3)
-- invlpg/invlpga instruction execution
-- access to missing or protected translations
-
-Host generated events:
-- changes in the gpa->hpa translation (either through gpa->hva changes or
-  through hva->hpa changes)
-- memory pressure (the shrinker)
-
-Shadow pages
-============
-
-The principal data structure is the shadow page, 'struct kvm_mmu_page'.  A
-shadow page contains 512 sptes, which can be either leaf or nonleaf sptes.  A
-shadow page may contain a mix of leaf and nonleaf sptes.
-
-A nonleaf spte allows the hardware mmu to reach the leaf pages and
-is not related to a translation directly.  It points to other shadow pages.
-
-A leaf spte corresponds to either one or two translations encoded into
-one paging structure entry.  These are always the lowest level of the
-translation stack, with optional higher level translations left to NPT/EPT.
-Leaf ptes point at guest pages.
-
-The following table shows translations encoded by leaf ptes, with higher-level
-translations in parentheses:
-
- Non-nested guests:
-  nonpaging:     gpa->hpa
-  paging:        gva->gpa->hpa
-  paging, tdp:   (gva->)gpa->hpa
- Nested guests:
-  non-tdp:       ngva->gpa->hpa  (*)
-  tdp:           (ngva->)ngpa->gpa->hpa
-
-(*) the guest hypervisor will encode the ngva->gpa translation into its page
-    tables if npt is not present
-
-Shadow pages contain the following information:
-  role.level:
-    The level in the shadow paging hierarchy that this shadow page belongs to.
-    1=4k sptes, 2=2M sptes, 3=1G sptes, etc.
-  role.direct:
-    If set, leaf sptes reachable from this page are for a linear range.
-    Examples include real mode translation, large guest pages backed by small
-    host pages, and gpa->hpa translations when NPT or EPT is active.
-    The linear range starts at (gfn << PAGE_SHIFT) and its size is determined
-    by role.level (2MB for first level, 1GB for second level, 0.5TB for third
-    level, 256TB for fourth level)
-    If clear, this page corresponds to a guest page table denoted by the gfn
-    field.
-  role.quadrant:
-    When role.gpte_is_8_bytes=0, the guest uses 32-bit gptes while the host uses 64-bit
-    sptes.  That means a guest page table contains more ptes than the host,
-    so multiple shadow pages are needed to shadow one guest page.
-    For first-level shadow pages, role.quadrant can be 0 or 1 and denotes the
-    first or second 512-gpte block in the guest page table.  For second-level
-    page tables, each 32-bit gpte is converted to two 64-bit sptes
-    (since each first-level guest page is shadowed by two first-level
-    shadow pages) so role.quadrant takes values in the range 0..3.  Each
-    quadrant maps 1GB virtual address space.
-  role.access:
-    Inherited guest access permissions in the form uwx.  Note execute
-    permission is positive, not negative.
-  role.invalid:
-    The page is invalid and should not be used.  It is a root page that is
-    currently pinned (by a cpu hardware register pointing to it); once it is
-    unpinned it will be destroyed.
-  role.gpte_is_8_bytes:
-    Reflects the size of the guest PTE for which the page is valid, i.e. '1'
-    if 64-bit gptes are in use, '0' if 32-bit gptes are in use.
-  role.nxe:
-    Contains the value of efer.nxe for which the page is valid.
-  role.cr0_wp:
-    Contains the value of cr0.wp for which the page is valid.
-  role.smep_andnot_wp:
-    Contains the value of cr4.smep && !cr0.wp for which the page is valid
-    (pages for which this is true are different from other pages; see the
-    treatment of cr0.wp=0 below).
-  role.smap_andnot_wp:
-    Contains the value of cr4.smap && !cr0.wp for which the page is valid
-    (pages for which this is true are different from other pages; see the
-    treatment of cr0.wp=0 below).
-  role.ept_sp:
-    This is a virtual flag to denote a shadowed nested EPT page.  ept_sp
-    is true if "cr0_wp && smap_andnot_wp", an otherwise invalid combination.
-  role.smm:
-    Is 1 if the page is valid in system management mode.  This field
-    determines which of the kvm_memslots array was used to build this
-    shadow page; it is also used to go back from a struct kvm_mmu_page
-    to a memslot, through the kvm_memslots_for_spte_role macro and
-    __gfn_to_memslot.
-  role.ad_disabled:
-    Is 1 if the MMU instance cannot use A/D bits.  EPT did not have A/D
-    bits before Haswell; shadow EPT page tables also cannot use A/D bits
-    if the L1 hypervisor does not enable them.
-  gfn:
-    Either the guest page table containing the translations shadowed by this
-    page, or the base page frame for linear translations.  See role.direct.
-  spt:
-    A pageful of 64-bit sptes containing the translations for this page.
-    Accessed by both kvm and hardware.
-    The page pointed to by spt will have its page->private pointing back
-    at the shadow page structure.
-    sptes in spt point either at guest pages, or at lower-level shadow pages.
-    Specifically, if sp1 and sp2 are shadow pages, then sp1->spt[n] may point
-    at __pa(sp2->spt).  sp2 will point back at sp1 through parent_pte.
-    The spt array forms a DAG structure with the shadow page as a node, and
-    guest pages as leaves.
-  gfns:
-    An array of 512 guest frame numbers, one for each present pte.  Used to
-    perform a reverse map from a pte to a gfn. When role.direct is set, any
-    element of this array can be calculated from the gfn field when used, in
-    this case, the array of gfns is not allocated. See role.direct and gfn.
-  root_count:
-    A counter keeping track of how many hardware registers (guest cr3 or
-    pdptrs) are now pointing at the page.  While this counter is nonzero, the
-    page cannot be destroyed.  See role.invalid.
-  parent_ptes:
-    The reverse mapping for the pte/ptes pointing at this page's spt. If
-    parent_ptes bit 0 is zero, only one spte points at this page and
-    parent_ptes points at this single spte, otherwise, there exists multiple
-    sptes pointing at this page and (parent_ptes & ~0x1) points at a data
-    structure with a list of parent sptes.
-  unsync:
-    If true, then the translations in this page may not match the guest's
-    translation.  This is equivalent to the state of the tlb when a pte is
-    changed but before the tlb entry is flushed.  Accordingly, unsync ptes
-    are synchronized when the guest executes invlpg or flushes its tlb by
-    other means.  Valid for leaf pages.
-  unsync_children:
-    How many sptes in the page point at pages that are unsync (or have
-    unsynchronized children).
-  unsync_child_bitmap:
-    A bitmap indicating which sptes in spt point (directly or indirectly) at
-    pages that may be unsynchronized.  Used to quickly locate all unsychronized
-    pages reachable from a given page.
-  clear_spte_count:
-    Only present on 32-bit hosts, where a 64-bit spte cannot be written
-    atomically.  The reader uses this while running out of the MMU lock
-    to detect in-progress updates and retry them until the writer has
-    finished the write.
-  write_flooding_count:
-    A guest may write to a page table many times, causing a lot of
-    emulations if the page needs to be write-protected (see "Synchronized
-    and unsynchronized pages" below).  Leaf pages can be unsynchronized
-    so that they do not trigger frequent emulation, but this is not
-    possible for non-leafs.  This field counts the number of emulations
-    since the last time the page table was actually used; if emulation
-    is triggered too frequently on this page, KVM will unmap the page
-    to avoid emulation in the future.
-
-Reverse map
-===========
-
-The mmu maintains a reverse mapping whereby all ptes mapping a page can be
-reached given its gfn.  This is used, for example, when swapping out a page.
-
-Synchronized and unsynchronized pages
-=====================================
-
-The guest uses two events to synchronize its tlb and page tables: tlb flushes
-and page invalidations (invlpg).
-
-A tlb flush means that we need to synchronize all sptes reachable from the
-guest's cr3.  This is expensive, so we keep all guest page tables write
-protected, and synchronize sptes to gptes when a gpte is written.
-
-A special case is when a guest page table is reachable from the current
-guest cr3.  In this case, the guest is obliged to issue an invlpg instruction
-before using the translation.  We take advantage of that by removing write
-protection from the guest page, and allowing the guest to modify it freely.
-We synchronize modified gptes when the guest invokes invlpg.  This reduces
-the amount of emulation we have to do when the guest modifies multiple gptes,
-or when the a guest page is no longer used as a page table and is used for
-random guest data.
-
-As a side effect we have to resynchronize all reachable unsynchronized shadow
-pages on a tlb flush.
-
-
-Reaction to events
-==================
-
-- guest page fault (or npt page fault, or ept violation)
-
-This is the most complicated event.  The cause of a page fault can be:
-
-  - a true guest fault (the guest translation won't allow the access) (*)
-  - access to a missing translation
-  - access to a protected translation
-    - when logging dirty pages, memory is write protected
-    - synchronized shadow pages are write protected (*)
-  - access to untranslatable memory (mmio)
-
-  (*) not applicable in direct mode
-
-Handling a page fault is performed as follows:
-
- - if the RSV bit of the error code is set, the page fault is caused by guest
-   accessing MMIO and cached MMIO information is available.
-   - walk shadow page table
-   - check for valid generation number in the spte (see "Fast invalidation of
-     MMIO sptes" below)
-   - cache the information to vcpu->arch.mmio_gva, vcpu->arch.mmio_access and
-     vcpu->arch.mmio_gfn, and call the emulator
- - If both P bit and R/W bit of error code are set, this could possibly
-   be handled as a "fast page fault" (fixed without taking the MMU lock).  See
-   the description in Documentation/virt/kvm/locking.txt.
- - if needed, walk the guest page tables to determine the guest translation
-   (gva->gpa or ngpa->gpa)
-   - if permissions are insufficient, reflect the fault back to the guest
- - determine the host page
-   - if this is an mmio request, there is no host page; cache the info to
-     vcpu->arch.mmio_gva, vcpu->arch.mmio_access and vcpu->arch.mmio_gfn
- - walk the shadow page table to find the spte for the translation,
-   instantiating missing intermediate page tables as necessary
-   - If this is an mmio request, cache the mmio info to the spte and set some
-     reserved bit on the spte (see callers of kvm_mmu_set_mmio_spte_mask)
- - try to unsynchronize the page
-   - if successful, we can let the guest continue and modify the gpte
- - emulate the instruction
-   - if failed, unshadow the page and let the guest continue
- - update any translations that were modified by the instruction
-
-invlpg handling:
-
-  - walk the shadow page hierarchy and drop affected translations
-  - try to reinstantiate the indicated translation in the hope that the
-    guest will use it in the near future
-
-Guest control register updates:
-
-- mov to cr3
-  - look up new shadow roots
-  - synchronize newly reachable shadow pages
-
-- mov to cr0/cr4/efer
-  - set up mmu context for new paging mode
-  - look up new shadow roots
-  - synchronize newly reachable shadow pages
-
-Host translation updates:
-
-  - mmu notifier called with updated hva
-  - look up affected sptes through reverse map
-  - drop (or update) translations
-
-Emulating cr0.wp
-================
-
-If tdp is not enabled, the host must keep cr0.wp=1 so page write protection
-works for the guest kernel, not guest guest userspace.  When the guest
-cr0.wp=1, this does not present a problem.  However when the guest cr0.wp=0,
-we cannot map the permissions for gpte.u=1, gpte.w=0 to any spte (the
-semantics require allowing any guest kernel access plus user read access).
-
-We handle this by mapping the permissions to two possible sptes, depending
-on fault type:
-
-- kernel write fault: spte.u=0, spte.w=1 (allows full kernel access,
-  disallows user access)
-- read fault: spte.u=1, spte.w=0 (allows full read access, disallows kernel
-  write access)
-
-(user write faults generate a #PF)
-
-In the first case there are two additional complications:
-- if CR4.SMEP is enabled: since we've turned the page into a kernel page,
-  the kernel may now execute it.  We handle this by also setting spte.nx.
-  If we get a user fetch or read fault, we'll change spte.u=1 and
-  spte.nx=gpte.nx back.  For this to work, KVM forces EFER.NX to 1 when
-  shadow paging is in use.
-- if CR4.SMAP is disabled: since the page has been changed to a kernel
-  page, it can not be reused when CR4.SMAP is enabled. We set
-  CR4.SMAP && !CR0.WP into shadow page's role to avoid this case. Note,
-  here we do not care the case that CR4.SMAP is enabled since KVM will
-  directly inject #PF to guest due to failed permission check.
-
-To prevent an spte that was converted into a kernel page with cr0.wp=0
-from being written by the kernel after cr0.wp has changed to 1, we make
-the value of cr0.wp part of the page role.  This means that an spte created
-with one value of cr0.wp cannot be used when cr0.wp has a different value -
-it will simply be missed by the shadow page lookup code.  A similar issue
-exists when an spte created with cr0.wp=0 and cr4.smep=0 is used after
-changing cr4.smep to 1.  To avoid this, the value of !cr0.wp && cr4.smep
-is also made a part of the page role.
-
-Large pages
-===========
-
-The mmu supports all combinations of large and small guest and host pages.
-Supported page sizes include 4k, 2M, 4M, and 1G.  4M pages are treated as
-two separate 2M pages, on both guest and host, since the mmu always uses PAE
-paging.
-
-To instantiate a large spte, four constraints must be satisfied:
-
-- the spte must point to a large host page
-- the guest pte must be a large pte of at least equivalent size (if tdp is
-  enabled, there is no guest pte and this condition is satisfied)
-- if the spte will be writeable, the large page frame may not overlap any
-  write-protected pages
-- the guest page must be wholly contained by a single memory slot
-
-To check the last two conditions, the mmu maintains a ->disallow_lpage set of
-arrays for each memory slot and large page size.  Every write protected page
-causes its disallow_lpage to be incremented, thus preventing instantiation of
-a large spte.  The frames at the end of an unaligned memory slot have
-artificially inflated ->disallow_lpages so they can never be instantiated.
-
-Fast invalidation of MMIO sptes
-===============================
-
-As mentioned in "Reaction to events" above, kvm will cache MMIO
-information in leaf sptes.  When a new memslot is added or an existing
-memslot is changed, this information may become stale and needs to be
-invalidated.  This also needs to hold the MMU lock while walking all
-shadow pages, and is made more scalable with a similar technique.
-
-MMIO sptes have a few spare bits, which are used to store a
-generation number.  The global generation number is stored in
-kvm_memslots(kvm)->generation, and increased whenever guest memory info
-changes.
-
-When KVM finds an MMIO spte, it checks the generation number of the spte.
-If the generation number of the spte does not equal the global generation
-number, it will ignore the cached MMIO information and handle the page
-fault through the slow path.
-
-Since only 19 bits are used to store generation-number on mmio spte, all
-pages are zapped when there is an overflow.
-
-Unfortunately, a single memory access might access kvm_memslots(kvm) multiple
-times, the last one happening when the generation number is retrieved and
-stored into the MMIO spte.  Thus, the MMIO spte might be created based on
-out-of-date information, but with an up-to-date generation number.
-
-To avoid this, the generation number is incremented again after synchronize_srcu
-returns; thus, bit 63 of kvm_memslots(kvm)->generation set to 1 only during a
-memslot update, while some SRCU readers might be using the old copy.  We do not
-want to use an MMIO sptes created with an odd generation number, and we can do
-this without losing a bit in the MMIO spte.  The "update in-progress" bit of the
-generation is not stored in MMIO spte, and is so is implicitly zero when the
-generation is extracted out of the spte.  If KVM is unlucky and creates an MMIO
-spte while an update is in-progress, the next access to the spte will always be
-a cache miss.  For example, a subsequent access during the update window will
-miss due to the in-progress flag diverging, while an access after the update
-window closes will have a higher generation number (as compared to the spte).
-
-
-Further reading
-===============
-
-- NPT presentation from KVM Forum 2008
-  http://www.linux-kvm.org/images/c/c8/KvmForum2008%24kdf2008_21.pdf
-
diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst

new file mode 100644 (file)

index 0000000..3389203
--- /dev/null
+++ b/Documentation/virt/kvm/msr.rst
@@ -0,0 +1,321 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+KVM-specific MSRs
+=================
+
+:Author: Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010
+
+KVM makes use of some custom MSRs to service some requests.
+
+Custom MSRs have a range reserved for them, that goes from
+0x4b564d00 to 0x4b564dff. There are MSRs outside this area,
+but they are deprecated and their use is discouraged.
+
+Custom MSR list
+---------------
+
+The current supported Custom MSR list is:
+
+MSR_KVM_WALL_CLOCK_NEW:
+       0x4b564d00
+
+data:
+       4-byte alignment physical address of a memory area which must be
+       in guest RAM. This memory is expected to hold a copy of the following
+       structure::
+
+        struct pvclock_wall_clock {
+               u32   version;
+               u32   sec;
+               u32   nsec;
+         } __attribute__((__packed__));
+
+       whose data will be filled in by the hypervisor. The hypervisor is only
+       guaranteed to update this data at the moment of MSR write.
+       Users that want to reliably query this information more than once have
+       to write more than once to this MSR. Fields have the following meanings:
+
+       version:
+               guest has to check version before and after grabbing
+               time information and check that they are both equal and even.
+               An odd version indicates an in-progress update.
+
+       sec:
+                number of seconds for wallclock at time of boot.
+
+       nsec:
+                number of nanoseconds for wallclock at time of boot.
+
+       In order to get the current wallclock time, the system_time from
+       MSR_KVM_SYSTEM_TIME_NEW needs to be added.
+
+       Note that although MSRs are per-CPU entities, the effect of this
+       particular MSR is global.
+
+       Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
+       leaf prior to usage.
+
+MSR_KVM_SYSTEM_TIME_NEW:
+       0x4b564d01
+
+data:
+       4-byte aligned physical address of a memory area which must be in
+       guest RAM, plus an enable bit in bit 0. This memory is expected to hold
+       a copy of the following structure::
+
+         struct pvclock_vcpu_time_info {
+               u32   version;
+               u32   pad0;
+               u64   tsc_timestamp;
+               u64   system_time;
+               u32   tsc_to_system_mul;
+               s8    tsc_shift;
+               u8    flags;
+               u8    pad[2];
+         } __attribute__((__packed__)); /* 32 bytes */
+
+       whose data will be filled in by the hypervisor periodically. Only one
+       write, or registration, is needed for each VCPU. The interval between
+       updates of this structure is arbitrary and implementation-dependent.
+       The hypervisor may update this structure at any time it sees fit until
+       anything with bit0 == 0 is written to it.
+
+       Fields have the following meanings:
+
+       version:
+               guest has to check version before and after grabbing
+               time information and check that they are both equal and even.
+               An odd version indicates an in-progress update.
+
+       tsc_timestamp:
+               the tsc value at the current VCPU at the time
+               of the update of this structure. Guests can subtract this value
+               from current tsc to derive a notion of elapsed time since the
+               structure update.
+
+       system_time:
+               a host notion of monotonic time, including sleep
+               time at the time this structure was last updated. Unit is
+               nanoseconds.
+
+       tsc_to_system_mul:
+               multiplier to be used when converting
+               tsc-related quantity to nanoseconds
+
+       tsc_shift:
+               shift to be used when converting tsc-related
+               quantity to nanoseconds. This shift will ensure that
+               multiplication with tsc_to_system_mul does not overflow.
+               A positive value denotes a left shift, a negative value
+               a right shift.
+
+               The conversion from tsc to nanoseconds involves an additional
+               right shift by 32 bits. With this information, guests can
+               derive per-CPU time by doing::
+
+                       time = (current_tsc - tsc_timestamp)
+                       if (tsc_shift >= 0)
+                               time <<= tsc_shift;
+                       else
+                               time >>= -tsc_shift;
+                       time = (time * tsc_to_system_mul) >> 32
+                       time = time + system_time
+
+       flags:
+               bits in this field indicate extended capabilities
+               coordinated between the guest and the hypervisor. Availability
+               of specific flags has to be checked in 0x40000001 cpuid leaf.
+               Current flags are:
+
+
+               +-----------+--------------+----------------------------------+
+               | flag bit  | cpuid bit    | meaning                          |
+               +-----------+--------------+----------------------------------+
+               |           |              | time measures taken across       |
+               |    0      |      24      | multiple cpus are guaranteed to  |
+               |           |              | be monotonic                     |
+               +-----------+--------------+----------------------------------+
+               |           |              | guest vcpu has been paused by    |
+               |    1      |     N/A      | the host                         |
+               |           |              | See 4.70 in api.txt              |
+               +-----------+--------------+----------------------------------+
+
+       Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
+       leaf prior to usage.
+
+
+MSR_KVM_WALL_CLOCK:
+       0x11
+
+data and functioning:
+       same as MSR_KVM_WALL_CLOCK_NEW. Use that instead.
+
+       This MSR falls outside the reserved KVM range and may be removed in the
+       future. Its usage is deprecated.
+
+       Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
+       leaf prior to usage.
+
+MSR_KVM_SYSTEM_TIME:
+       0x12
+
+data and functioning:
+       same as MSR_KVM_SYSTEM_TIME_NEW. Use that instead.
+
+       This MSR falls outside the reserved KVM range and may be removed in the
+       future. Its usage is deprecated.
+
+       Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
+       leaf prior to usage.
+
+       The suggested algorithm for detecting kvmclock presence is then::
+
+               if (!kvm_para_available())    /* refer to cpuid.txt */
+                       return NON_PRESENT;
+
+               flags = cpuid_eax(0x40000001);
+               if (flags & 3) {
+                       msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
+                       msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
+                       return PRESENT;
+               } else if (flags & 0) {
+                       msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
+                       msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
+                       return PRESENT;
+               } else
+                       return NON_PRESENT;
+
+MSR_KVM_ASYNC_PF_EN:
+       0x4b564d02
+
+data:
+       Bits 63-6 hold 64-byte aligned physical address of a
+       64 byte memory area which must be in guest RAM and must be
+       zeroed. Bits 5-3 are reserved and should be zero. Bit 0 is 1
+       when asynchronous page faults are enabled on the vcpu 0 when
+       disabled. Bit 1 is 1 if asynchronous page faults can be injected
+       when vcpu is in cpl == 0. Bit 2 is 1 if asynchronous page faults
+       are delivered to L1 as #PF vmexits.  Bit 2 can be set only if
+       KVM_FEATURE_ASYNC_PF_VMEXIT is present in CPUID.
+
+       First 4 byte of 64 byte memory location will be written to by
+       the hypervisor at the time of asynchronous page fault (APF)
+       injection to indicate type of asynchronous page fault. Value
+       of 1 means that the page referred to by the page fault is not
+       present. Value 2 means that the page is now available. Disabling
+       interrupt inhibits APFs. Guest must not enable interrupt
+       before the reason is read, or it may be overwritten by another
+       APF. Since APF uses the same exception vector as regular page
+       fault guest must reset the reason to 0 before it does
+       something that can generate normal page fault.  If during page
+       fault APF reason is 0 it means that this is regular page
+       fault.
+
+       During delivery of type 1 APF cr2 contains a token that will
+       be used to notify a guest when missing page becomes
+       available. When page becomes available type 2 APF is sent with
+       cr2 set to the token associated with the page. There is special
+       kind of token 0xffffffff which tells vcpu that it should wake
+       up all processes waiting for APFs and no individual type 2 APFs
+       will be sent.
+
+       If APF is disabled while there are outstanding APFs, they will
+       not be delivered.
+
+       Currently type 2 APF will be always delivered on the same vcpu as
+       type 1 was, but guest should not rely on that.
+
+MSR_KVM_STEAL_TIME:
+       0x4b564d03
+
+data:
+       64-byte alignment physical address of a memory area which must be
+       in guest RAM, plus an enable bit in bit 0. This memory is expected to
+       hold a copy of the following structure::
+
+         struct kvm_steal_time {
+               __u64 steal;
+               __u32 version;
+               __u32 flags;
+               __u8  preempted;
+               __u8  u8_pad[3];
+               __u32 pad[11];
+         }
+
+       whose data will be filled in by the hypervisor periodically. Only one
+       write, or registration, is needed for each VCPU. The interval between
+       updates of this structure is arbitrary and implementation-dependent.
+       The hypervisor may update this structure at any time it sees fit until
+       anything with bit0 == 0 is written to it. Guest is required to make sure
+       this structure is initialized to zero.
+
+       Fields have the following meanings:
+
+       version:
+               a sequence counter. In other words, guest has to check
+               this field before and after grabbing time information and make
+               sure they are both equal and even. An odd version indicates an
+               in-progress update.
+
+       flags:
+               At this point, always zero. May be used to indicate
+               changes in this structure in the future.
+
+       steal:
+               the amount of time in which this vCPU did not run, in
+               nanoseconds. Time during which the vcpu is idle, will not be
+               reported as steal time.
+
+       preempted:
+               indicate the vCPU who owns this struct is running or
+               not. Non-zero values mean the vCPU has been preempted. Zero
+               means the vCPU is not preempted. NOTE, it is always zero if the
+               the hypervisor doesn't support this field.
+
+MSR_KVM_EOI_EN:
+       0x4b564d04
+
+data:
+       Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
+       when disabled.  Bit 1 is reserved and must be zero.  When PV end of
+       interrupt is enabled (bit 0 set), bits 63-2 hold a 4-byte aligned
+       physical address of a 4 byte memory area which must be in guest RAM and
+       must be zeroed.
+
+       The first, least significant bit of 4 byte memory location will be
+       written to by the hypervisor, typically at the time of interrupt
+       injection.  Value of 1 means that guest can skip writing EOI to the apic
+       (using MSR or MMIO write); instead, it is sufficient to signal
+       EOI by clearing the bit in guest memory - this location will
+       later be polled by the hypervisor.
+       Value of 0 means that the EOI write is required.
+
+       It is always safe for the guest to ignore the optimization and perform
+       the APIC EOI write anyway.
+
+       Hypervisor is guaranteed to only modify this least
+       significant bit while in the current VCPU context, this means that
+       guest does not need to use either lock prefix or memory ordering
+       primitives to synchronise with the hypervisor.
+
+       However, hypervisor can set and clear this memory bit at any time:
+       therefore to make sure hypervisor does not interrupt the
+       guest and clear the least significant bit in the memory area
+       in the window between guest testing it to detect
+       whether it can skip EOI apic write and between guest
+       clearing it to signal EOI to the hypervisor,
+       guest must both read the least significant bit in the memory area and
+       clear it using a single CPU instruction, such as test and clear, or
+       compare and exchange.
+
+MSR_KVM_POLL_CONTROL:
+       0x4b564d05
+
+       Control host-side polling.
+
+data:
+       Bit 0 enables (1) or disables (0) host-side HLT polling logic.
+
+       KVM guests can request the host not to poll on HLT, for example if
+       they are performing polling themselves.
diff --git a/Documentation/virt/kvm/msr.txt b/Documentation/virt/kvm/msr.txt

deleted file mode 100644 (file)

index df1f433..0000000
--- a/Documentation/virt/kvm/msr.txt
+++ /dev/null
@@ -1,284 +0,0 @@
-KVM-specific MSRs.
-Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010
-=====================================================
-
-KVM makes use of some custom MSRs to service some requests.
-
-Custom MSRs have a range reserved for them, that goes from
-0x4b564d00 to 0x4b564dff. There are MSRs outside this area,
-but they are deprecated and their use is discouraged.
-
-Custom MSR list
---------
-
-The current supported Custom MSR list is:
-
-MSR_KVM_WALL_CLOCK_NEW:   0x4b564d00
-
-       data: 4-byte alignment physical address of a memory area which must be
-       in guest RAM. This memory is expected to hold a copy of the following
-       structure:
-
-       struct pvclock_wall_clock {
-               u32   version;
-               u32   sec;
-               u32   nsec;
-       } __attribute__((__packed__));
-
-       whose data will be filled in by the hypervisor. The hypervisor is only
-       guaranteed to update this data at the moment of MSR write.
-       Users that want to reliably query this information more than once have
-       to write more than once to this MSR. Fields have the following meanings:
-
-               version: guest has to check version before and after grabbing
-               time information and check that they are both equal and even.
-               An odd version indicates an in-progress update.
-
-               sec: number of seconds for wallclock at time of boot.
-
-               nsec: number of nanoseconds for wallclock at time of boot.
-
-       In order to get the current wallclock time, the system_time from
-       MSR_KVM_SYSTEM_TIME_NEW needs to be added.
-
-       Note that although MSRs are per-CPU entities, the effect of this
-       particular MSR is global.
-
-       Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
-       leaf prior to usage.
-
-MSR_KVM_SYSTEM_TIME_NEW:  0x4b564d01
-
-       data: 4-byte aligned physical address of a memory area which must be in
-       guest RAM, plus an enable bit in bit 0. This memory is expected to hold
-       a copy of the following structure:
-
-       struct pvclock_vcpu_time_info {
-               u32   version;
-               u32   pad0;
-               u64   tsc_timestamp;
-               u64   system_time;
-               u32   tsc_to_system_mul;
-               s8    tsc_shift;
-               u8    flags;
-               u8    pad[2];
-       } __attribute__((__packed__)); /* 32 bytes */
-
-       whose data will be filled in by the hypervisor periodically. Only one
-       write, or registration, is needed for each VCPU. The interval between
-       updates of this structure is arbitrary and implementation-dependent.
-       The hypervisor may update this structure at any time it sees fit until
-       anything with bit0 == 0 is written to it.
-
-       Fields have the following meanings:
-
-               version: guest has to check version before and after grabbing
-               time information and check that they are both equal and even.
-               An odd version indicates an in-progress update.
-
-               tsc_timestamp: the tsc value at the current VCPU at the time
-               of the update of this structure. Guests can subtract this value
-               from current tsc to derive a notion of elapsed time since the
-               structure update.
-
-               system_time: a host notion of monotonic time, including sleep
-               time at the time this structure was last updated. Unit is
-               nanoseconds.
-
-               tsc_to_system_mul: multiplier to be used when converting
-               tsc-related quantity to nanoseconds
-
-               tsc_shift: shift to be used when converting tsc-related
-               quantity to nanoseconds. This shift will ensure that
-               multiplication with tsc_to_system_mul does not overflow.
-               A positive value denotes a left shift, a negative value
-               a right shift.
-
-               The conversion from tsc to nanoseconds involves an additional
-               right shift by 32 bits. With this information, guests can
-               derive per-CPU time by doing:
-
-                       time = (current_tsc - tsc_timestamp)
-                       if (tsc_shift >= 0)
-                               time <<= tsc_shift;
-                       else
-                               time >>= -tsc_shift;
-                       time = (time * tsc_to_system_mul) >> 32
-                       time = time + system_time
-
-               flags: bits in this field indicate extended capabilities
-               coordinated between the guest and the hypervisor. Availability
-               of specific flags has to be checked in 0x40000001 cpuid leaf.
-               Current flags are:
-
-                flag bit   | cpuid bit    | meaning
-               -------------------------------------------------------------
-                           |              | time measures taken across
-                    0      |      24      | multiple cpus are guaranteed to
-                           |              | be monotonic
-               -------------------------------------------------------------
-                           |              | guest vcpu has been paused by
-                    1      |     N/A      | the host
-                           |              | See 4.70 in api.txt
-               -------------------------------------------------------------
-
-       Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
-       leaf prior to usage.
-
-
-MSR_KVM_WALL_CLOCK:  0x11
-
-       data and functioning: same as MSR_KVM_WALL_CLOCK_NEW. Use that instead.
-
-       This MSR falls outside the reserved KVM range and may be removed in the
-       future. Its usage is deprecated.
-
-       Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
-       leaf prior to usage.
-
-MSR_KVM_SYSTEM_TIME: 0x12
-
-       data and functioning: same as MSR_KVM_SYSTEM_TIME_NEW. Use that instead.
-
-       This MSR falls outside the reserved KVM range and may be removed in the
-       future. Its usage is deprecated.
-
-       Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
-       leaf prior to usage.
-
-       The suggested algorithm for detecting kvmclock presence is then:
-
-               if (!kvm_para_available())    /* refer to cpuid.txt */
-                       return NON_PRESENT;
-
-               flags = cpuid_eax(0x40000001);
-               if (flags & 3) {
-                       msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
-                       msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
-                       return PRESENT;
-               } else if (flags & 0) {
-                       msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
-                       msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
-                       return PRESENT;
-               } else
-                       return NON_PRESENT;
-
-MSR_KVM_ASYNC_PF_EN: 0x4b564d02
-       data: Bits 63-6 hold 64-byte aligned physical address of a
-       64 byte memory area which must be in guest RAM and must be
-       zeroed. Bits 5-3 are reserved and should be zero. Bit 0 is 1
-       when asynchronous page faults are enabled on the vcpu 0 when
-       disabled. Bit 1 is 1 if asynchronous page faults can be injected
-       when vcpu is in cpl == 0. Bit 2 is 1 if asynchronous page faults
-       are delivered to L1 as #PF vmexits.  Bit 2 can be set only if
-       KVM_FEATURE_ASYNC_PF_VMEXIT is present in CPUID.
-
-       First 4 byte of 64 byte memory location will be written to by
-       the hypervisor at the time of asynchronous page fault (APF)
-       injection to indicate type of asynchronous page fault. Value
-       of 1 means that the page referred to by the page fault is not
-       present. Value 2 means that the page is now available. Disabling
-       interrupt inhibits APFs. Guest must not enable interrupt
-       before the reason is read, or it may be overwritten by another
-       APF. Since APF uses the same exception vector as regular page
-       fault guest must reset the reason to 0 before it does
-       something that can generate normal page fault.  If during page
-       fault APF reason is 0 it means that this is regular page
-       fault.
-
-       During delivery of type 1 APF cr2 contains a token that will
-       be used to notify a guest when missing page becomes
-       available. When page becomes available type 2 APF is sent with
-       cr2 set to the token associated with the page. There is special
-       kind of token 0xffffffff which tells vcpu that it should wake
-       up all processes waiting for APFs and no individual type 2 APFs
-       will be sent.
-
-       If APF is disabled while there are outstanding APFs, they will
-       not be delivered.
-
-       Currently type 2 APF will be always delivered on the same vcpu as
-       type 1 was, but guest should not rely on that.
-
-MSR_KVM_STEAL_TIME: 0x4b564d03
-
-       data: 64-byte alignment physical address of a memory area which must be
-       in guest RAM, plus an enable bit in bit 0. This memory is expected to
-       hold a copy of the following structure:
-
-       struct kvm_steal_time {
-               __u64 steal;
-               __u32 version;
-               __u32 flags;
-               __u8  preempted;
-               __u8  u8_pad[3];
-               __u32 pad[11];
-       }
-
-       whose data will be filled in by the hypervisor periodically. Only one
-       write, or registration, is needed for each VCPU. The interval between
-       updates of this structure is arbitrary and implementation-dependent.
-       The hypervisor may update this structure at any time it sees fit until
-       anything with bit0 == 0 is written to it. Guest is required to make sure
-       this structure is initialized to zero.
-
-       Fields have the following meanings:
-
-               version: a sequence counter. In other words, guest has to check
-               this field before and after grabbing time information and make
-               sure they are both equal and even. An odd version indicates an
-               in-progress update.
-
-               flags: At this point, always zero. May be used to indicate
-               changes in this structure in the future.
-
-               steal: the amount of time in which this vCPU did not run, in
-               nanoseconds. Time during which the vcpu is idle, will not be
-               reported as steal time.
-
-               preempted: indicate the vCPU who owns this struct is running or
-               not. Non-zero values mean the vCPU has been preempted. Zero
-               means the vCPU is not preempted. NOTE, it is always zero if the
-               the hypervisor doesn't support this field.
-
-MSR_KVM_EOI_EN: 0x4b564d04
-       data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
-       when disabled.  Bit 1 is reserved and must be zero.  When PV end of
-       interrupt is enabled (bit 0 set), bits 63-2 hold a 4-byte aligned
-       physical address of a 4 byte memory area which must be in guest RAM and
-       must be zeroed.
-
-       The first, least significant bit of 4 byte memory location will be
-       written to by the hypervisor, typically at the time of interrupt
-       injection.  Value of 1 means that guest can skip writing EOI to the apic
-       (using MSR or MMIO write); instead, it is sufficient to signal
-       EOI by clearing the bit in guest memory - this location will
-       later be polled by the hypervisor.
-       Value of 0 means that the EOI write is required.
-
-       It is always safe for the guest to ignore the optimization and perform
-       the APIC EOI write anyway.
-
-       Hypervisor is guaranteed to only modify this least
-       significant bit while in the current VCPU context, this means that
-       guest does not need to use either lock prefix or memory ordering
-       primitives to synchronise with the hypervisor.
-
-       However, hypervisor can set and clear this memory bit at any time:
-       therefore to make sure hypervisor does not interrupt the
-       guest and clear the least significant bit in the memory area
-       in the window between guest testing it to detect
-       whether it can skip EOI apic write and between guest
-       clearing it to signal EOI to the hypervisor,
-       guest must both read the least significant bit in the memory area and
-       clear it using a single CPU instruction, such as test and clear, or
-       compare and exchange.
-
-MSR_KVM_POLL_CONTROL: 0x4b564d05
-       Control host-side polling.
-
-       data: Bit 0 enables (1) or disables (0) host-side HLT polling logic.
-
-       KVM guests can request the host not to poll on HLT, for example if
-       they are performing polling themselves.
-
diff --git a/Documentation/virt/kvm/nested-vmx.rst b/Documentation/virt/kvm/nested-vmx.rst

new file mode 100644 (file)

index 0000000..592b0ab
--- /dev/null
+++ b/Documentation/virt/kvm/nested-vmx.rst
@@ -0,0 +1,245 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========
+Nested VMX
+==========
+
+Overview
+---------
+
+On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
+to easily and efficiently run guest operating systems. Normally, these guests
+*cannot* themselves be hypervisors running their own guests, because in VMX,
+guests cannot use VMX instructions.
+
+The "Nested VMX" feature adds this missing capability - of running guest
+hypervisors (which use VMX) with their own nested guests. It does so by
+allowing a guest to use VMX instructions, and correctly and efficiently
+emulating them using the single level of VMX available in the hardware.
+
+We describe in much greater detail the theory behind the nested VMX feature,
+its implementation and its performance characteristics, in the OSDI 2010 paper
+"The Turtles Project: Design and Implementation of Nested Virtualization",
+available at:
+
+       http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
+
+
+Terminology
+-----------
+
+Single-level virtualization has two levels - the host (KVM) and the guests.
+In nested virtualization, we have three levels: The host (KVM), which we call
+L0, the guest hypervisor, which we call L1, and its nested guest, which we
+call L2.
+
+
+Running nested VMX
+------------------
+
+The nested VMX feature is disabled by default. It can be enabled by giving
+the "nested=1" option to the kvm-intel module.
+
+No modifications are required to user space (qemu). However, qemu's default
+emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be
+explicitly enabled, by giving qemu one of the following options:
+
+     - cpu host              (emulated CPU has all features of the real CPU)
+
+     - cpu qemu64,+vmx       (add just the vmx feature to a named CPU type)
+
+
+ABIs
+----
+
+Nested VMX aims to present a standard and (eventually) fully-functional VMX
+implementation for the a guest hypervisor to use. As such, the official
+specification of the ABI that it provides is Intel's VMX specification,
+namely volume 3B of their "Intel 64 and IA-32 Architectures Software
+Developer's Manual". Not all of VMX's features are currently fully supported,
+but the goal is to eventually support them all, starting with the VMX features
+which are used in practice by popular hypervisors (KVM and others).
+
+As a VMX implementation, nested VMX presents a VMCS structure to L1.
+As mandated by the spec, other than the two fields revision_id and abort,
+this structure is *opaque* to its user, who is not supposed to know or care
+about its internal structure. Rather, the structure is accessed through the
+VMREAD and VMWRITE instructions.
+Still, for debugging purposes, KVM developers might be interested to know the
+internals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c.
+
+The name "vmcs12" refers to the VMCS that L1 builds for L2. In the code we
+also have "vmcs01", the VMCS that L0 built for L1, and "vmcs02" is the VMCS
+which L0 builds to actually run L2 - how this is done is explained in the
+aforementioned paper.
+
+For convenience, we repeat the content of struct vmcs12 here. If the internals
+of this structure changes, this can break live migration across KVM versions.
+VMCS12_REVISION (from vmx.c) should be changed if struct vmcs12 or its inner
+struct shadow_vmcs is ever changed.
+
+::
+
+       typedef u64 natural_width;
+       struct __packed vmcs12 {
+               /* According to the Intel spec, a VMCS region must start with
+                * these two user-visible fields */
+               u32 revision_id;
+               u32 abort;
+
+               u32 launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
+               u32 padding[7]; /* room for future expansion */
+
+               u64 io_bitmap_a;
+               u64 io_bitmap_b;
+               u64 msr_bitmap;
+               u64 vm_exit_msr_store_addr;
+               u64 vm_exit_msr_load_addr;
+               u64 vm_entry_msr_load_addr;
+               u64 tsc_offset;
+               u64 virtual_apic_page_addr;
+               u64 apic_access_addr;
+               u64 ept_pointer;
+               u64 guest_physical_address;
+               u64 vmcs_link_pointer;
+               u64 guest_ia32_debugctl;
+               u64 guest_ia32_pat;
+               u64 guest_ia32_efer;
+               u64 guest_pdptr0;
+               u64 guest_pdptr1;
+               u64 guest_pdptr2;
+               u64 guest_pdptr3;
+               u64 host_ia32_pat;
+               u64 host_ia32_efer;
+               u64 padding64[8]; /* room for future expansion */
+               natural_width cr0_guest_host_mask;
+               natural_width cr4_guest_host_mask;
+               natural_width cr0_read_shadow;
+               natural_width cr4_read_shadow;
+               natural_width cr3_target_value0;
+               natural_width cr3_target_value1;
+               natural_width cr3_target_value2;
+               natural_width cr3_target_value3;
+               natural_width exit_qualification;
+               natural_width guest_linear_address;
+               natural_width guest_cr0;
+               natural_width guest_cr3;
+               natural_width guest_cr4;
+               natural_width guest_es_base;
+               natural_width guest_cs_base;
+               natural_width guest_ss_base;
+               natural_width guest_ds_base;
+               natural_width guest_fs_base;
+               natural_width guest_gs_base;
+               natural_width guest_ldtr_base;
+               natural_width guest_tr_base;
+               natural_width guest_gdtr_base;
+               natural_width guest_idtr_base;
+               natural_width guest_dr7;
+               natural_width guest_rsp;
+               natural_width guest_rip;
+               natural_width guest_rflags;
+               natural_width guest_pending_dbg_exceptions;
+               natural_width guest_sysenter_esp;
+               natural_width guest_sysenter_eip;
+               natural_width host_cr0;
+               natural_width host_cr3;
+               natural_width host_cr4;
+               natural_width host_fs_base;
+               natural_width host_gs_base;
+               natural_width host_tr_base;
+               natural_width host_gdtr_base;
+               natural_width host_idtr_base;
+               natural_width host_ia32_sysenter_esp;
+               natural_width host_ia32_sysenter_eip;
+               natural_width host_rsp;
+               natural_width host_rip;
+               natural_width paddingl[8]; /* room for future expansion */
+               u32 pin_based_vm_exec_control;
+               u32 cpu_based_vm_exec_control;
+               u32 exception_bitmap;
+               u32 page_fault_error_code_mask;
+               u32 page_fault_error_code_match;
+               u32 cr3_target_count;
+               u32 vm_exit_controls;
+               u32 vm_exit_msr_store_count;
+               u32 vm_exit_msr_load_count;
+               u32 vm_entry_controls;
+               u32 vm_entry_msr_load_count;
+               u32 vm_entry_intr_info_field;
+               u32 vm_entry_exception_error_code;
+               u32 vm_entry_instruction_len;
+               u32 tpr_threshold;
+               u32 secondary_vm_exec_control;
+               u32 vm_instruction_error;
+               u32 vm_exit_reason;
+               u32 vm_exit_intr_info;
+               u32 vm_exit_intr_error_code;
+               u32 idt_vectoring_info_field;
+               u32 idt_vectoring_error_code;
+               u32 vm_exit_instruction_len;
+               u32 vmx_instruction_info;
+               u32 guest_es_limit;
+               u32 guest_cs_limit;
+               u32 guest_ss_limit;
+               u32 guest_ds_limit;
+               u32 guest_fs_limit;
+               u32 guest_gs_limit;
+               u32 guest_ldtr_limit;
+               u32 guest_tr_limit;
+               u32 guest_gdtr_limit;
+               u32 guest_idtr_limit;
+               u32 guest_es_ar_bytes;
+               u32 guest_cs_ar_bytes;
+               u32 guest_ss_ar_bytes;
+               u32 guest_ds_ar_bytes;
+               u32 guest_fs_ar_bytes;
+               u32 guest_gs_ar_bytes;
+               u32 guest_ldtr_ar_bytes;
+               u32 guest_tr_ar_bytes;
+               u32 guest_interruptibility_info;
+               u32 guest_activity_state;
+               u32 guest_sysenter_cs;
+               u32 host_ia32_sysenter_cs;
+               u32 padding32[8]; /* room for future expansion */
+               u16 virtual_processor_id;
+               u16 guest_es_selector;
+               u16 guest_cs_selector;
+               u16 guest_ss_selector;
+               u16 guest_ds_selector;
+               u16 guest_fs_selector;
+               u16 guest_gs_selector;
+               u16 guest_ldtr_selector;
+               u16 guest_tr_selector;
+               u16 host_es_selector;
+               u16 host_cs_selector;
+               u16 host_ss_selector;
+               u16 host_ds_selector;
+               u16 host_fs_selector;
+               u16 host_gs_selector;
+               u16 host_tr_selector;
+       };
+
+
+Authors
+-------
+
+These patches were written by:
+    - Abel Gordon, abelg <at> il.ibm.com
+    - Nadav Har'El, nyh <at> il.ibm.com
+    - Orit Wasserman, oritw <at> il.ibm.com
+    - Ben-Ami Yassor, benami <at> il.ibm.com
+    - Muli Ben-Yehuda, muli <at> il.ibm.com
+
+With contributions by:
+    - Anthony Liguori, aliguori <at> us.ibm.com
+    - Mike Day, mdday <at> us.ibm.com
+    - Michael Factor, factor <at> il.ibm.com
+    - Zvi Dubitzky, dubi <at> il.ibm.com
+
+And valuable reviews by:
+    - Avi Kivity, avi <at> redhat.com
+    - Gleb Natapov, gleb <at> redhat.com
+    - Marcelo Tosatti, mtosatti <at> redhat.com
+    - Kevin Tian, kevin.tian <at> intel.com
+    - and others.
diff --git a/Documentation/virt/kvm/nested-vmx.txt b/Documentation/virt/kvm/nested-vmx.txt

deleted file mode 100644 (file)

index 97eb135..0000000
--- a/Documentation/virt/kvm/nested-vmx.txt
+++ /dev/null
@@ -1,240 +0,0 @@
-Nested VMX
-==========
-
-Overview
----------
-
-On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
-to easily and efficiently run guest operating systems. Normally, these guests
-*cannot* themselves be hypervisors running their own guests, because in VMX,
-guests cannot use VMX instructions.
-
-The "Nested VMX" feature adds this missing capability - of running guest
-hypervisors (which use VMX) with their own nested guests. It does so by
-allowing a guest to use VMX instructions, and correctly and efficiently
-emulating them using the single level of VMX available in the hardware.
-
-We describe in much greater detail the theory behind the nested VMX feature,
-its implementation and its performance characteristics, in the OSDI 2010 paper
-"The Turtles Project: Design and Implementation of Nested Virtualization",
-available at:
-
-       http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
-
-
-Terminology
------------
-
-Single-level virtualization has two levels - the host (KVM) and the guests.
-In nested virtualization, we have three levels: The host (KVM), which we call
-L0, the guest hypervisor, which we call L1, and its nested guest, which we
-call L2.
-
-
-Running nested VMX
-------------------
-
-The nested VMX feature is disabled by default. It can be enabled by giving
-the "nested=1" option to the kvm-intel module.
-
-No modifications are required to user space (qemu). However, qemu's default
-emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be
-explicitly enabled, by giving qemu one of the following options:
-
-     -cpu host              (emulated CPU has all features of the real CPU)
-
-     -cpu qemu64,+vmx       (add just the vmx feature to a named CPU type)
-
-
-ABIs
-----
-
-Nested VMX aims to present a standard and (eventually) fully-functional VMX
-implementation for the a guest hypervisor to use. As such, the official
-specification of the ABI that it provides is Intel's VMX specification,
-namely volume 3B of their "Intel 64 and IA-32 Architectures Software
-Developer's Manual". Not all of VMX's features are currently fully supported,
-but the goal is to eventually support them all, starting with the VMX features
-which are used in practice by popular hypervisors (KVM and others).
-
-As a VMX implementation, nested VMX presents a VMCS structure to L1.
-As mandated by the spec, other than the two fields revision_id and abort,
-this structure is *opaque* to its user, who is not supposed to know or care
-about its internal structure. Rather, the structure is accessed through the
-VMREAD and VMWRITE instructions.
-Still, for debugging purposes, KVM developers might be interested to know the
-internals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c.
-
-The name "vmcs12" refers to the VMCS that L1 builds for L2. In the code we
-also have "vmcs01", the VMCS that L0 built for L1, and "vmcs02" is the VMCS
-which L0 builds to actually run L2 - how this is done is explained in the
-aforementioned paper.
-
-For convenience, we repeat the content of struct vmcs12 here. If the internals
-of this structure changes, this can break live migration across KVM versions.
-VMCS12_REVISION (from vmx.c) should be changed if struct vmcs12 or its inner
-struct shadow_vmcs is ever changed.
-
-       typedef u64 natural_width;
-       struct __packed vmcs12 {
-               /* According to the Intel spec, a VMCS region must start with
-                * these two user-visible fields */
-               u32 revision_id;
-               u32 abort;
-
-               u32 launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
-               u32 padding[7]; /* room for future expansion */
-
-               u64 io_bitmap_a;
-               u64 io_bitmap_b;
-               u64 msr_bitmap;
-               u64 vm_exit_msr_store_addr;
-               u64 vm_exit_msr_load_addr;
-               u64 vm_entry_msr_load_addr;
-               u64 tsc_offset;
-               u64 virtual_apic_page_addr;
-               u64 apic_access_addr;
-               u64 ept_pointer;
-               u64 guest_physical_address;
-               u64 vmcs_link_pointer;
-               u64 guest_ia32_debugctl;
-               u64 guest_ia32_pat;
-               u64 guest_ia32_efer;
-               u64 guest_pdptr0;
-               u64 guest_pdptr1;
-               u64 guest_pdptr2;
-               u64 guest_pdptr3;
-               u64 host_ia32_pat;
-               u64 host_ia32_efer;
-               u64 padding64[8]; /* room for future expansion */
-               natural_width cr0_guest_host_mask;
-               natural_width cr4_guest_host_mask;
-               natural_width cr0_read_shadow;
-               natural_width cr4_read_shadow;
-               natural_width cr3_target_value0;
-               natural_width cr3_target_value1;
-               natural_width cr3_target_value2;
-               natural_width cr3_target_value3;
-               natural_width exit_qualification;
-               natural_width guest_linear_address;
-               natural_width guest_cr0;
-               natural_width guest_cr3;
-               natural_width guest_cr4;
-               natural_width guest_es_base;
-               natural_width guest_cs_base;
-               natural_width guest_ss_base;
-               natural_width guest_ds_base;
-               natural_width guest_fs_base;
-               natural_width guest_gs_base;
-               natural_width guest_ldtr_base;
-               natural_width guest_tr_base;
-               natural_width guest_gdtr_base;
-               natural_width guest_idtr_base;
-               natural_width guest_dr7;
-               natural_width guest_rsp;
-               natural_width guest_rip;
-               natural_width guest_rflags;
-               natural_width guest_pending_dbg_exceptions;
-               natural_width guest_sysenter_esp;
-               natural_width guest_sysenter_eip;
-               natural_width host_cr0;
-               natural_width host_cr3;
-               natural_width host_cr4;
-               natural_width host_fs_base;
-               natural_width host_gs_base;
-               natural_width host_tr_base;
-               natural_width host_gdtr_base;
-               natural_width host_idtr_base;
-               natural_width host_ia32_sysenter_esp;
-               natural_width host_ia32_sysenter_eip;
-               natural_width host_rsp;
-               natural_width host_rip;
-               natural_width paddingl[8]; /* room for future expansion */
-               u32 pin_based_vm_exec_control;
-               u32 cpu_based_vm_exec_control;
-               u32 exception_bitmap;
-               u32 page_fault_error_code_mask;
-               u32 page_fault_error_code_match;
-               u32 cr3_target_count;
-               u32 vm_exit_controls;
-               u32 vm_exit_msr_store_count;
-               u32 vm_exit_msr_load_count;
-               u32 vm_entry_controls;
-               u32 vm_entry_msr_load_count;
-               u32 vm_entry_intr_info_field;
-               u32 vm_entry_exception_error_code;
-               u32 vm_entry_instruction_len;
-               u32 tpr_threshold;
-               u32 secondary_vm_exec_control;
-               u32 vm_instruction_error;
-               u32 vm_exit_reason;
-               u32 vm_exit_intr_info;
-               u32 vm_exit_intr_error_code;
-               u32 idt_vectoring_info_field;
-               u32 idt_vectoring_error_code;
-               u32 vm_exit_instruction_len;
-               u32 vmx_instruction_info;
-               u32 guest_es_limit;
-               u32 guest_cs_limit;
-               u32 guest_ss_limit;
-               u32 guest_ds_limit;
-               u32 guest_fs_limit;
-               u32 guest_gs_limit;
-               u32 guest_ldtr_limit;
-               u32 guest_tr_limit;
-               u32 guest_gdtr_limit;
-               u32 guest_idtr_limit;
-               u32 guest_es_ar_bytes;
-               u32 guest_cs_ar_bytes;
-               u32 guest_ss_ar_bytes;
-               u32 guest_ds_ar_bytes;
-               u32 guest_fs_ar_bytes;
-               u32 guest_gs_ar_bytes;
-               u32 guest_ldtr_ar_bytes;
-               u32 guest_tr_ar_bytes;
-               u32 guest_interruptibility_info;
-               u32 guest_activity_state;
-               u32 guest_sysenter_cs;
-               u32 host_ia32_sysenter_cs;
-               u32 padding32[8]; /* room for future expansion */
-               u16 virtual_processor_id;
-               u16 guest_es_selector;
-               u16 guest_cs_selector;
-               u16 guest_ss_selector;
-               u16 guest_ds_selector;
-               u16 guest_fs_selector;
-               u16 guest_gs_selector;
-               u16 guest_ldtr_selector;
-               u16 guest_tr_selector;
-               u16 host_es_selector;
-               u16 host_cs_selector;
-               u16 host_ss_selector;
-               u16 host_ds_selector;
-               u16 host_fs_selector;
-               u16 host_gs_selector;
-               u16 host_tr_selector;
-       };
-
-
-Authors
--------
-
-These patches were written by:
-     Abel Gordon, abelg <at> il.ibm.com
-     Nadav Har'El, nyh <at> il.ibm.com
-     Orit Wasserman, oritw <at> il.ibm.com
-     Ben-Ami Yassor, benami <at> il.ibm.com
-     Muli Ben-Yehuda, muli <at> il.ibm.com
-
-With contributions by:
-     Anthony Liguori, aliguori <at> us.ibm.com
-     Mike Day, mdday <at> us.ibm.com
-     Michael Factor, factor <at> il.ibm.com
-     Zvi Dubitzky, dubi <at> il.ibm.com
-
-And valuable reviews by:
-     Avi Kivity, avi <at> redhat.com
-     Gleb Natapov, gleb <at> redhat.com
-     Marcelo Tosatti, mtosatti <at> redhat.com
-     Kevin Tian, kevin.tian <at> intel.com
-     and others.
diff --git a/Documentation/virt/kvm/ppc-pv.rst b/Documentation/virt/kvm/ppc-pv.rst

new file mode 100644 (file)

index 0000000..5fdb907
--- /dev/null
+++ b/Documentation/virt/kvm/ppc-pv.rst
@@ -0,0 +1,222 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================================
+The PPC KVM paravirtual interface
+=================================
+
+The basic execution principle by which KVM on PowerPC works is to run all kernel
+space code in PR=1 which is user space. This way we trap all privileged
+instructions and can emulate them accordingly.
+
+Unfortunately that is also the downfall. There are quite some privileged
+instructions that needlessly return us to the hypervisor even though they
+could be handled differently.
+
+This is what the PPC PV interface helps with. It takes privileged instructions
+and transforms them into unprivileged ones with some help from the hypervisor.
+This cuts down virtualization costs by about 50% on some of my benchmarks.
+
+The code for that interface can be found in arch/powerpc/kernel/kvm*
+
+Querying for existence
+======================
+
+To find out if we're running on KVM or not, we leverage the device tree. When
+Linux is running on KVM, a node /hypervisor exists. That node contains a
+compatible property with the value "linux,kvm".
+
+Once you determined you're running under a PV capable KVM, you can now use
+hypercalls as described below.
+
+KVM hypercalls
+==============
+
+Inside the device tree's /hypervisor node there's a property called
+'hypercall-instructions'. This property contains at most 4 opcodes that make
+up the hypercall. To call a hypercall, just call these instructions.
+
+The parameters are as follows:
+
+        ========       ================        ================
+       Register        IN                      OUT
+        ========       ================        ================
+       r0              -                       volatile
+       r3              1st parameter           Return code
+       r4              2nd parameter           1st output value
+       r5              3rd parameter           2nd output value
+       r6              4th parameter           3rd output value
+       r7              5th parameter           4th output value
+       r8              6th parameter           5th output value
+       r9              7th parameter           6th output value
+       r10             8th parameter           7th output value
+       r11             hypercall number        8th output value
+       r12             -                       volatile
+        ========       ================        ================
+
+Hypercall definitions are shared in generic code, so the same hypercall numbers
+apply for x86 and powerpc alike with the exception that each KVM hypercall
+also needs to be ORed with the KVM vendor code which is (42 << 16).
+
+Return codes can be as follows:
+
+       ====            =========================
+       Code            Meaning
+       ====            =========================
+       0               Success
+       12              Hypercall not implemented
+       <0              Error
+       ====            =========================
+
+The magic page
+==============
+
+To enable communication between the hypervisor and guest there is a new shared
+page that contains parts of supervisor visible register state. The guest can
+map this shared page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE.
+
+With this hypercall issued the guest always gets the magic page mapped at the
+desired location. The first parameter indicates the effective address when the
+MMU is enabled. The second parameter indicates the address in real mode, if
+applicable to the target. For now, we always map the page to -4096. This way we
+can access it using absolute load and store functions. The following
+instruction reads the first field of the magic page::
+
+       ld      rX, -4096(0)
+
+The interface is designed to be extensible should there be need later to add
+additional registers to the magic page. If you add fields to the magic page,
+also define a new hypercall feature to indicate that the host can give you more
+registers. Only if the host supports the additional features, make use of them.
+
+The magic page layout is described by struct kvm_vcpu_arch_shared
+in arch/powerpc/include/asm/kvm_para.h.
+
+Magic page features
+===================
+
+When mapping the magic page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE,
+a second return value is passed to the guest. This second return value contains
+a bitmap of available features inside the magic page.
+
+The following enhancements to the magic page are currently available:
+
+  ============================  =======================================
+  KVM_MAGIC_FEAT_SR            Maps SR registers r/w in the magic page
+  KVM_MAGIC_FEAT_MAS0_TO_SPRG7 Maps MASn, ESR, PIR and high SPRGs
+  ============================  =======================================
+
+For enhanced features in the magic page, please check for the existence of the
+feature before using them!
+
+Magic page flags
+================
+
+In addition to features that indicate whether a host is capable of a particular
+feature we also have a channel for a guest to tell the guest whether it's capable
+of something. This is what we call "flags".
+
+Flags are passed to the host in the low 12 bits of the Effective Address.
+
+The following flags are currently available for a guest to expose:
+
+  MAGIC_PAGE_FLAG_NOT_MAPPED_NX Guest handles NX bits correctly wrt magic page
+
+MSR bits
+========
+
+The MSR contains bits that require hypervisor intervention and bits that do
+not require direct hypervisor intervention because they only get interpreted
+when entering the guest or don't have any impact on the hypervisor's behavior.
+
+The following bits are safe to be set inside the guest:
+
+  - MSR_EE
+  - MSR_RI
+
+If any other bit changes in the MSR, please still use mtmsr(d).
+
+Patched instructions
+====================
+
+The "ld" and "std" instructions are transformed to "lwz" and "stw" instructions
+respectively on 32 bit systems with an added offset of 4 to accommodate for big
+endianness.
+
+The following is a list of mapping the Linux kernel performs when running as
+guest. Implementing any of those mappings is optional, as the instruction traps
+also act on the shared page. So calling privileged instructions still works as
+before.
+
+======================= ================================
+From                   To
+======================= ================================
+mfmsr  rX              ld      rX, magic_page->msr
+mfsprg rX, 0           ld      rX, magic_page->sprg0
+mfsprg rX, 1           ld      rX, magic_page->sprg1
+mfsprg rX, 2           ld      rX, magic_page->sprg2
+mfsprg rX, 3           ld      rX, magic_page->sprg3
+mfsrr0 rX              ld      rX, magic_page->srr0
+mfsrr1 rX              ld      rX, magic_page->srr1
+mfdar  rX              ld      rX, magic_page->dar
+mfdsisr        rX              lwz     rX, magic_page->dsisr
+
+mtmsr  rX              std     rX, magic_page->msr
+mtsprg 0, rX           std     rX, magic_page->sprg0
+mtsprg 1, rX           std     rX, magic_page->sprg1
+mtsprg 2, rX           std     rX, magic_page->sprg2
+mtsprg 3, rX           std     rX, magic_page->sprg3
+mtsrr0 rX              std     rX, magic_page->srr0
+mtsrr1 rX              std     rX, magic_page->srr1
+mtdar  rX              std     rX, magic_page->dar
+mtdsisr        rX              stw     rX, magic_page->dsisr
+
+tlbsync                        nop
+
+mtmsrd rX, 0           b       <special mtmsr section>
+mtmsr  rX              b       <special mtmsr section>
+
+mtmsrd rX, 1           b       <special mtmsrd section>
+
+[Book3S only]
+mtsrin rX, rY          b       <special mtsrin section>
+
+[BookE only]
+wrteei [0|1]           b       <special wrteei section>
+======================= ================================
+
+Some instructions require more logic to determine what's going on than a load
+or store instruction can deliver. To enable patching of those, we keep some
+RAM around where we can live translate instructions to. What happens is the
+following:
+
+       1) copy emulation code to memory
+       2) patch that code to fit the emulated instruction
+       3) patch that code to return to the original pc + 4
+       4) patch the original instruction to branch to the new code
+
+That way we can inject an arbitrary amount of code as replacement for a single
+instruction. This allows us to check for pending interrupts when setting EE=1
+for example.
+
+Hypercall ABIs in KVM on PowerPC
+=================================
+
+1) KVM hypercalls (ePAPR)
+
+These are ePAPR compliant hypercall implementation (mentioned above). Even
+generic hypercalls are implemented here, like the ePAPR idle hcall. These are
+available on all targets.
+
+2) PAPR hypercalls
+
+PAPR hypercalls are needed to run server PowerPC PAPR guests (-M pseries in QEMU).
+These are the same hypercalls that pHyp, the POWER hypervisor implements. Some of
+them are handled in the kernel, some are handled in user space. This is only
+available on book3s_64.
+
+3) OSI hypercalls
+
+Mac-on-Linux is another user of KVM on PowerPC, which has its own hypercall (long
+before KVM). This is supported to maintain compatibility. All these hypercalls get
+forwarded to user space. This is only useful on book3s_32, but can be used with
+book3s_64 as well.
diff --git a/Documentation/virt/kvm/ppc-pv.txt b/Documentation/virt/kvm/ppc-pv.txt

deleted file mode 100644 (file)

index e26115c..0000000
--- a/Documentation/virt/kvm/ppc-pv.txt
+++ /dev/null
@@ -1,212 +0,0 @@
-The PPC KVM paravirtual interface
-=================================
-
-The basic execution principle by which KVM on PowerPC works is to run all kernel
-space code in PR=1 which is user space. This way we trap all privileged
-instructions and can emulate them accordingly.
-
-Unfortunately that is also the downfall. There are quite some privileged
-instructions that needlessly return us to the hypervisor even though they
-could be handled differently.
-
-This is what the PPC PV interface helps with. It takes privileged instructions
-and transforms them into unprivileged ones with some help from the hypervisor.
-This cuts down virtualization costs by about 50% on some of my benchmarks.
-
-The code for that interface can be found in arch/powerpc/kernel/kvm*
-
-Querying for existence
-======================
-
-To find out if we're running on KVM or not, we leverage the device tree. When
-Linux is running on KVM, a node /hypervisor exists. That node contains a
-compatible property with the value "linux,kvm".
-
-Once you determined you're running under a PV capable KVM, you can now use
-hypercalls as described below.
-
-KVM hypercalls
-==============
-
-Inside the device tree's /hypervisor node there's a property called
-'hypercall-instructions'. This property contains at most 4 opcodes that make
-up the hypercall. To call a hypercall, just call these instructions.
-
-The parameters are as follows:
-
-       Register        IN                      OUT
-
-       r0              -                       volatile
-       r3              1st parameter           Return code
-       r4              2nd parameter           1st output value
-       r5              3rd parameter           2nd output value
-       r6              4th parameter           3rd output value
-       r7              5th parameter           4th output value
-       r8              6th parameter           5th output value
-       r9              7th parameter           6th output value
-       r10             8th parameter           7th output value
-       r11             hypercall number        8th output value
-       r12             -                       volatile
-
-Hypercall definitions are shared in generic code, so the same hypercall numbers
-apply for x86 and powerpc alike with the exception that each KVM hypercall
-also needs to be ORed with the KVM vendor code which is (42 << 16).
-
-Return codes can be as follows:
-
-       Code            Meaning
-
-       0               Success
-       12              Hypercall not implemented
-       <0              Error
-
-The magic page
-==============
-
-To enable communication between the hypervisor and guest there is a new shared
-page that contains parts of supervisor visible register state. The guest can
-map this shared page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE.
-
-With this hypercall issued the guest always gets the magic page mapped at the
-desired location. The first parameter indicates the effective address when the
-MMU is enabled. The second parameter indicates the address in real mode, if
-applicable to the target. For now, we always map the page to -4096. This way we
-can access it using absolute load and store functions. The following
-instruction reads the first field of the magic page:
-
-       ld      rX, -4096(0)
-
-The interface is designed to be extensible should there be need later to add
-additional registers to the magic page. If you add fields to the magic page,
-also define a new hypercall feature to indicate that the host can give you more
-registers. Only if the host supports the additional features, make use of them.
-
-The magic page layout is described by struct kvm_vcpu_arch_shared
-in arch/powerpc/include/asm/kvm_para.h.
-
-Magic page features
-===================
-
-When mapping the magic page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE,
-a second return value is passed to the guest. This second return value contains
-a bitmap of available features inside the magic page.
-
-The following enhancements to the magic page are currently available:
-
-  KVM_MAGIC_FEAT_SR            Maps SR registers r/w in the magic page
-  KVM_MAGIC_FEAT_MAS0_TO_SPRG7 Maps MASn, ESR, PIR and high SPRGs
-
-For enhanced features in the magic page, please check for the existence of the
-feature before using them!
-
-Magic page flags
-================
-
-In addition to features that indicate whether a host is capable of a particular
-feature we also have a channel for a guest to tell the guest whether it's capable
-of something. This is what we call "flags".
-
-Flags are passed to the host in the low 12 bits of the Effective Address.
-
-The following flags are currently available for a guest to expose:
-
-  MAGIC_PAGE_FLAG_NOT_MAPPED_NX Guest handles NX bits correctly wrt magic page
-
-MSR bits
-========
-
-The MSR contains bits that require hypervisor intervention and bits that do
-not require direct hypervisor intervention because they only get interpreted
-when entering the guest or don't have any impact on the hypervisor's behavior.
-
-The following bits are safe to be set inside the guest:
-
-  MSR_EE
-  MSR_RI
-
-If any other bit changes in the MSR, please still use mtmsr(d).
-
-Patched instructions
-====================
-
-The "ld" and "std" instructions are transformed to "lwz" and "stw" instructions
-respectively on 32 bit systems with an added offset of 4 to accommodate for big
-endianness.
-
-The following is a list of mapping the Linux kernel performs when running as
-guest. Implementing any of those mappings is optional, as the instruction traps
-also act on the shared page. So calling privileged instructions still works as
-before.
-
-From                   To
-====                   ==
-
-mfmsr  rX              ld      rX, magic_page->msr
-mfsprg rX, 0           ld      rX, magic_page->sprg0
-mfsprg rX, 1           ld      rX, magic_page->sprg1
-mfsprg rX, 2           ld      rX, magic_page->sprg2
-mfsprg rX, 3           ld      rX, magic_page->sprg3
-mfsrr0 rX              ld      rX, magic_page->srr0
-mfsrr1 rX              ld      rX, magic_page->srr1
-mfdar  rX              ld      rX, magic_page->dar
-mfdsisr        rX              lwz     rX, magic_page->dsisr
-
-mtmsr  rX              std     rX, magic_page->msr
-mtsprg 0, rX           std     rX, magic_page->sprg0
-mtsprg 1, rX           std     rX, magic_page->sprg1
-mtsprg 2, rX           std     rX, magic_page->sprg2
-mtsprg 3, rX           std     rX, magic_page->sprg3
-mtsrr0 rX              std     rX, magic_page->srr0
-mtsrr1 rX              std     rX, magic_page->srr1
-mtdar  rX              std     rX, magic_page->dar
-mtdsisr        rX              stw     rX, magic_page->dsisr
-
-tlbsync                        nop
-
-mtmsrd rX, 0           b       <special mtmsr section>
-mtmsr  rX              b       <special mtmsr section>
-
-mtmsrd rX, 1           b       <special mtmsrd section>
-
-[Book3S only]
-mtsrin rX, rY          b       <special mtsrin section>
-
-[BookE only]
-wrteei [0|1]           b       <special wrteei section>
-
-
-Some instructions require more logic to determine what's going on than a load
-or store instruction can deliver. To enable patching of those, we keep some
-RAM around where we can live translate instructions to. What happens is the
-following:
-
-       1) copy emulation code to memory
-       2) patch that code to fit the emulated instruction
-       3) patch that code to return to the original pc + 4
-       4) patch the original instruction to branch to the new code
-
-That way we can inject an arbitrary amount of code as replacement for a single
-instruction. This allows us to check for pending interrupts when setting EE=1
-for example.
-
-Hypercall ABIs in KVM on PowerPC
-=================================
-1) KVM hypercalls (ePAPR)
-
-These are ePAPR compliant hypercall implementation (mentioned above). Even
-generic hypercalls are implemented here, like the ePAPR idle hcall. These are
-available on all targets.
-
-2) PAPR hypercalls
-
-PAPR hypercalls are needed to run server PowerPC PAPR guests (-M pseries in QEMU).
-These are the same hypercalls that pHyp, the POWER hypervisor implements. Some of
-them are handled in the kernel, some are handled in user space. This is only
-available on book3s_64.
-
-3) OSI hypercalls
-
-Mac-on-Linux is another user of KVM on PowerPC, which has its own hypercall (long
-before KVM). This is supported to maintain compatibility. All these hypercalls get
-forwarded to user space. This is only useful on book3s_32, but can be used with
-book3s_64 as well.
diff --git a/Documentation/virt/kvm/review-checklist.rst b/Documentation/virt/kvm/review-checklist.rst

new file mode 100644 (file)

index 0000000..1f86a9d
--- /dev/null
+++ b/Documentation/virt/kvm/review-checklist.rst
@@ -0,0 +1,41 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================================
+Review checklist for kvm patches
+================================
+
+1.  The patch must follow Documentation/process/coding-style.rst and
+    Documentation/process/submitting-patches.rst.
+
+2.  Patches should be against kvm.git master branch.
+
+3.  If the patch introduces or modifies a new userspace API:
+    - the API must be documented in Documentation/virt/kvm/api.txt
+    - the API must be discoverable using KVM_CHECK_EXTENSION
+
+4.  New state must include support for save/restore.
+
+5.  New features must default to off (userspace should explicitly request them).
+    Performance improvements can and should default to on.
+
+6.  New cpu features should be exposed via KVM_GET_SUPPORTED_CPUID2
+
+7.  Emulator changes should be accompanied by unit tests for qemu-kvm.git
+    kvm/test directory.
+
+8.  Changes should be vendor neutral when possible.  Changes to common code
+    are better than duplicating changes to vendor code.
+
+9.  Similarly, prefer changes to arch independent code than to arch dependent
+    code.
+
+10. User/kernel interfaces and guest/host interfaces must be 64-bit clean
+    (all variables and sizes naturally aligned on 64-bit; use specific types
+    only - u64 rather than ulong).
+
+11. New guest visible features must either be documented in a hardware manual
+    or be accompanied by documentation.
+
+12. Features must be robust against reset and kexec - for example, shared
+    host/guest memory must be unshared to prevent the host from writing to
+    guest memory that the guest has not reserved for this purpose.
diff --git a/Documentation/virt/kvm/review-checklist.txt b/Documentation/virt/kvm/review-checklist.txt

deleted file mode 100644 (file)

index 499af49..0000000
--- a/Documentation/virt/kvm/review-checklist.txt
+++ /dev/null
@@ -1,38 +0,0 @@
-Review checklist for kvm patches
-================================
-
-1.  The patch must follow Documentation/process/coding-style.rst and
-    Documentation/process/submitting-patches.rst.
-
-2.  Patches should be against kvm.git master branch.
-
-3.  If the patch introduces or modifies a new userspace API:
-    - the API must be documented in Documentation/virt/kvm/api.txt
-    - the API must be discoverable using KVM_CHECK_EXTENSION
-
-4.  New state must include support for save/restore.
-
-5.  New features must default to off (userspace should explicitly request them).
-    Performance improvements can and should default to on.
-
-6.  New cpu features should be exposed via KVM_GET_SUPPORTED_CPUID2
-
-7.  Emulator changes should be accompanied by unit tests for qemu-kvm.git
-    kvm/test directory.
-
-8.  Changes should be vendor neutral when possible.  Changes to common code
-    are better than duplicating changes to vendor code.
-
-9.  Similarly, prefer changes to arch independent code than to arch dependent
-    code.
-
-10. User/kernel interfaces and guest/host interfaces must be 64-bit clean
-    (all variables and sizes naturally aligned on 64-bit; use specific types
-    only - u64 rather than ulong).
-
-11. New guest visible features must either be documented in a hardware manual
-    or be accompanied by documentation.
-
-12. Features must be robust against reset and kexec - for example, shared
-    host/guest memory must be unshared to prevent the host from writing to
-    guest memory that the guest has not reserved for this purpose.
diff --git a/Documentation/virt/kvm/s390-diag.rst b/Documentation/virt/kvm/s390-diag.rst

new file mode 100644 (file)

index 0000000..eaac486
--- /dev/null
+++ b/Documentation/virt/kvm/s390-diag.rst
@@ -0,0 +1,86 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================
+The s390 DIAGNOSE call on KVM
+=============================
+
+KVM on s390 supports the DIAGNOSE call for making hypercalls, both for
+native hypercalls and for selected hypercalls found on other s390
+hypervisors.
+
+Note that bits are numbered as by the usual s390 convention (most significant
+bit on the left).
+
+
+General remarks
+---------------
+
+DIAGNOSE calls by the guest cause a mandatory intercept. This implies
+all supported DIAGNOSE calls need to be handled by either KVM or its
+userspace.
+
+All DIAGNOSE calls supported by KVM use the RS-a format::
+
+  --------------------------------------
+  |  '83'  | R1 | R3 | B2 |     D2     |
+  --------------------------------------
+  0        8    12   16   20           31
+
+The second-operand address (obtained by the base/displacement calculation)
+is not used to address data. Instead, bits 48-63 of this address specify
+the function code, and bits 0-47 are ignored.
+
+The supported DIAGNOSE function codes vary by the userspace used. For
+DIAGNOSE function codes not specific to KVM, please refer to the
+documentation for the s390 hypervisors defining them.
+
+
+DIAGNOSE function code 'X'500' - KVM virtio functions
+-----------------------------------------------------
+
+If the function code specifies 0x500, various virtio-related functions
+are performed.
+
+General register 1 contains the virtio subfunction code. Supported
+virtio subfunctions depend on KVM's userspace. Generally, userspace
+provides either s390-virtio (subcodes 0-2) or virtio-ccw (subcode 3).
+
+Upon completion of the DIAGNOSE instruction, general register 2 contains
+the function's return code, which is either a return code or a subcode
+specific value.
+
+Subcode 0 - s390-virtio notification and early console printk
+    Handled by userspace.
+
+Subcode 1 - s390-virtio reset
+    Handled by userspace.
+
+Subcode 2 - s390-virtio set status
+    Handled by userspace.
+
+Subcode 3 - virtio-ccw notification
+    Handled by either userspace or KVM (ioeventfd case).
+
+    General register 2 contains a subchannel-identification word denoting
+    the subchannel of the virtio-ccw proxy device to be notified.
+
+    General register 3 contains the number of the virtqueue to be notified.
+
+    General register 4 contains a 64bit identifier for KVM usage (the
+    kvm_io_bus cookie). If general register 4 does not contain a valid
+    identifier, it is ignored.
+
+    After completion of the DIAGNOSE call, general register 2 may contain
+    a 64bit identifier (in the kvm_io_bus cookie case), or a negative
+    error value, if an internal error occurred.
+
+    See also the virtio standard for a discussion of this hypercall.
+
+
+DIAGNOSE function code 'X'501 - KVM breakpoint
+----------------------------------------------
+
+If the function code specifies 0x501, breakpoint functions may be performed.
+This function code is handled by userspace.
+
+This diagnose function code has no subfunctions and uses no parameters.
diff --git a/Documentation/virt/kvm/s390-diag.txt b/Documentation/virt/kvm/s390-diag.txt

deleted file mode 100644 (file)

index 7c52e5f..0000000
--- a/Documentation/virt/kvm/s390-diag.txt
+++ /dev/null
@@ -1,83 +0,0 @@
-The s390 DIAGNOSE call on KVM
-=============================
-
-KVM on s390 supports the DIAGNOSE call for making hypercalls, both for
-native hypercalls and for selected hypercalls found on other s390
-hypervisors.
-
-Note that bits are numbered as by the usual s390 convention (most significant
-bit on the left).
-
-
-General remarks
----------------
-
-DIAGNOSE calls by the guest cause a mandatory intercept. This implies
-all supported DIAGNOSE calls need to be handled by either KVM or its
-userspace.
-
-All DIAGNOSE calls supported by KVM use the RS-a format:
-
---------------------------------------
-|  '83'  | R1 | R3 | B2 |     D2     |
---------------------------------------
-0        8    12   16   20           31
-
-The second-operand address (obtained by the base/displacement calculation)
-is not used to address data. Instead, bits 48-63 of this address specify
-the function code, and bits 0-47 are ignored.
-
-The supported DIAGNOSE function codes vary by the userspace used. For
-DIAGNOSE function codes not specific to KVM, please refer to the
-documentation for the s390 hypervisors defining them.
-
-
-DIAGNOSE function code 'X'500' - KVM virtio functions
------------------------------------------------------
-
-If the function code specifies 0x500, various virtio-related functions
-are performed.
-
-General register 1 contains the virtio subfunction code. Supported
-virtio subfunctions depend on KVM's userspace. Generally, userspace
-provides either s390-virtio (subcodes 0-2) or virtio-ccw (subcode 3).
-
-Upon completion of the DIAGNOSE instruction, general register 2 contains
-the function's return code, which is either a return code or a subcode
-specific value.
-
-Subcode 0 - s390-virtio notification and early console printk
-    Handled by userspace.
-
-Subcode 1 - s390-virtio reset
-    Handled by userspace.
-
-Subcode 2 - s390-virtio set status
-    Handled by userspace.
-
-Subcode 3 - virtio-ccw notification
-    Handled by either userspace or KVM (ioeventfd case).
-
-    General register 2 contains a subchannel-identification word denoting
-    the subchannel of the virtio-ccw proxy device to be notified.
-
-    General register 3 contains the number of the virtqueue to be notified.
-
-    General register 4 contains a 64bit identifier for KVM usage (the
-    kvm_io_bus cookie). If general register 4 does not contain a valid
-    identifier, it is ignored.
-
-    After completion of the DIAGNOSE call, general register 2 may contain
-    a 64bit identifier (in the kvm_io_bus cookie case), or a negative
-    error value, if an internal error occurred.
-
-    See also the virtio standard for a discussion of this hypercall.
-
-
-DIAGNOSE function code 'X'501 - KVM breakpoint
-----------------------------------------------
-
-If the function code specifies 0x501, breakpoint functions may be performed.
-This function code is handled by userspace.
-
-This diagnose function code has no subfunctions and uses no parameters.
diff --git a/Documentation/virt/kvm/timekeeping.rst b/Documentation/virt/kvm/timekeeping.rst

new file mode 100644 (file)

index 0000000..21ae7ef
--- /dev/null
+++ b/Documentation/virt/kvm/timekeeping.rst
@@ -0,0 +1,645 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================================================
+Timekeeping Virtualization for X86-Based Architectures
+======================================================
+
+:Author: Zachary Amsden <zamsden@redhat.com>
+:Copyright: (c) 2010, Red Hat.  All rights reserved.
+
+.. Contents
+
+   1) Overview
+   2) Timing Devices
+   3) TSC Hardware
+   4) Virtualization Problems
+
+1. Overview
+===========
+
+One of the most complicated parts of the X86 platform, and specifically,
+the virtualization of this platform is the plethora of timing devices available
+and the complexity of emulating those devices.  In addition, virtualization of
+time introduces a new set of challenges because it introduces a multiplexed
+division of time beyond the control of the guest CPU.
+
+First, we will describe the various timekeeping hardware available, then
+present some of the problems which arise and solutions available, giving
+specific recommendations for certain classes of KVM guests.
+
+The purpose of this document is to collect data and information relevant to
+timekeeping which may be difficult to find elsewhere, specifically,
+information relevant to KVM and hardware-based virtualization.
+
+2. Timing Devices
+=================
+
+First we discuss the basic hardware devices available.  TSC and the related
+KVM clock are special enough to warrant a full exposition and are described in
+the following section.
+
+2.1. i8254 - PIT
+----------------
+
+One of the first timer devices available is the programmable interrupt timer,
+or PIT.  The PIT has a fixed frequency 1.193182 MHz base clock and three
+channels which can be programmed to deliver periodic or one-shot interrupts.
+These three channels can be configured in different modes and have individual
+counters.  Channel 1 and 2 were not available for general use in the original
+IBM PC, and historically were connected to control RAM refresh and the PC
+speaker.  Now the PIT is typically integrated as part of an emulated chipset
+and a separate physical PIT is not used.
+
+The PIT uses I/O ports 0x40 - 0x43.  Access to the 16-bit counters is done
+using single or multiple byte access to the I/O ports.  There are 6 modes
+available, but not all modes are available to all timers, as only timer 2
+has a connected gate input, required for modes 1 and 5.  The gate line is
+controlled by port 61h, bit 0, as illustrated in the following diagram::
+
+  --------------             ----------------
+  |            |           |                |
+  |  1.1932 MHz|---------->| CLOCK      OUT | ---------> IRQ 0
+  |    Clock   |   |       |                |
+  --------------   |    +->| GATE  TIMER 0  |
+                   |        ----------------
+                   |
+                   |        ----------------
+                   |       |                |
+                   |------>| CLOCK      OUT | ---------> 66.3 KHZ DRAM
+                   |       |                |            (aka /dev/null)
+                   |    +->| GATE  TIMER 1  |
+                   |        ----------------
+                   |
+                   |        ----------------
+                   |       |                |
+                   |------>| CLOCK      OUT | ---------> Port 61h, bit 5
+                           |                |      |
+  Port 61h, bit 0 -------->| GATE  TIMER 2  |       \_.----   ____
+                            ----------------         _|    )--|LPF|---Speaker
+                                                    / *----   \___/
+  Port 61h, bit 1 ---------------------------------/
+
+The timer modes are now described.
+
+Mode 0: Single Timeout.
+ This is a one-shot software timeout that counts down
+ when the gate is high (always true for timers 0 and 1).  When the count
+ reaches zero, the output goes high.
+
+Mode 1: Triggered One-shot.
+ The output is initially set high.  When the gate
+ line is set high, a countdown is initiated (which does not stop if the gate is
+ lowered), during which the output is set low.  When the count reaches zero,
+ the output goes high.
+
+Mode 2: Rate Generator.
+ The output is initially set high.  When the countdown
+ reaches 1, the output goes low for one count and then returns high.  The value
+ is reloaded and the countdown automatically resumes.  If the gate line goes
+ low, the count is halted.  If the output is low when the gate is lowered, the
+ output automatically goes high (this only affects timer 2).
+
+Mode 3: Square Wave.
+ This generates a high / low square wave.  The count
+ determines the length of the pulse, which alternates between high and low
+ when zero is reached.  The count only proceeds when gate is high and is
+ automatically reloaded on reaching zero.  The count is decremented twice at
+ each clock to generate a full high / low cycle at the full periodic rate.
+ If the count is even, the clock remains high for N/2 counts and low for N/2
+ counts; if the clock is odd, the clock is high for (N+1)/2 counts and low
+ for (N-1)/2 counts.  Only even values are latched by the counter, so odd
+ values are not observed when reading.  This is the intended mode for timer 2,
+ which generates sine-like tones by low-pass filtering the square wave output.
+
+Mode 4: Software Strobe.
+ After programming this mode and loading the counter,
+ the output remains high until the counter reaches zero.  Then the output
+ goes low for 1 clock cycle and returns high.  The counter is not reloaded.
+ Counting only occurs when gate is high.
+
+Mode 5: Hardware Strobe.
+ After programming and loading the counter, the
+ output remains high.  When the gate is raised, a countdown is initiated
+ (which does not stop if the gate is lowered).  When the counter reaches zero,
+ the output goes low for 1 clock cycle and then returns high.  The counter is
+ not reloaded.
+
+In addition to normal binary counting, the PIT supports BCD counting.  The
+command port, 0x43 is used to set the counter and mode for each of the three
+timers.
+
+PIT commands, issued to port 0x43, using the following bit encoding::
+
+  Bit 7-4: Command (See table below)
+  Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined)
+  Bit 0  : Binary (0) / BCD (1)
+
+Command table::
+
+  0000 - Latch Timer 0 count for port 0x40
+       sample and hold the count to be read in port 0x40;
+       additional commands ignored until counter is read;
+       mode bits ignored.
+
+  0001 - Set Timer 0 LSB mode for port 0x40
+       set timer to read LSB only and force MSB to zero;
+       mode bits set timer mode
+
+  0010 - Set Timer 0 MSB mode for port 0x40
+       set timer to read MSB only and force LSB to zero;
+       mode bits set timer mode
+
+  0011 - Set Timer 0 16-bit mode for port 0x40
+       set timer to read / write LSB first, then MSB;
+       mode bits set timer mode
+
+  0100 - Latch Timer 1 count for port 0x41 - as described above
+  0101 - Set Timer 1 LSB mode for port 0x41 - as described above
+  0110 - Set Timer 1 MSB mode for port 0x41 - as described above
+  0111 - Set Timer 1 16-bit mode for port 0x41 - as described above
+
+  1000 - Latch Timer 2 count for port 0x42 - as described above
+  1001 - Set Timer 2 LSB mode for port 0x42 - as described above
+  1010 - Set Timer 2 MSB mode for port 0x42 - as described above
+  1011 - Set Timer 2 16-bit mode for port 0x42 as described above
+
+  1101 - General counter latch
+       Latch combination of counters into corresponding ports
+       Bit 3 = Counter 2
+       Bit 2 = Counter 1
+       Bit 1 = Counter 0
+       Bit 0 = Unused
+
+  1110 - Latch timer status
+       Latch combination of counter mode into corresponding ports
+       Bit 3 = Counter 2
+       Bit 2 = Counter 1
+       Bit 1 = Counter 0
+
+       The output of ports 0x40-0x42 following this command will be:
+
+       Bit 7 = Output pin
+       Bit 6 = Count loaded (0 if timer has expired)
+       Bit 5-4 = Read / Write mode
+           01 = MSB only
+           10 = LSB only
+           11 = LSB / MSB (16-bit)
+       Bit 3-1 = Mode
+       Bit 0 = Binary (0) / BCD mode (1)
+
+2.2. RTC
+--------
+
+The second device which was available in the original PC was the MC146818 real
+time clock.  The original device is now obsolete, and usually emulated by the
+system chipset, sometimes by an HPET and some frankenstein IRQ routing.
+
+The RTC is accessed through CMOS variables, which uses an index register to
+control which bytes are read.  Since there is only one index register, read
+of the CMOS and read of the RTC require lock protection (in addition, it is
+dangerous to allow userspace utilities such as hwclock to have direct RTC
+access, as they could corrupt kernel reads and writes of CMOS memory).
+
+The RTC generates an interrupt which is usually routed to IRQ 8.  The interrupt
+can function as a periodic timer, an additional once a day alarm, and can issue
+interrupts after an update of the CMOS registers by the MC146818 is complete.
+The type of interrupt is signalled in the RTC status registers.
+
+The RTC will update the current time fields by battery power even while the
+system is off.  The current time fields should not be read while an update is
+in progress, as indicated in the status register.
+
+The clock uses a 32.768kHz crystal, so bits 6-4 of register A should be
+programmed to a 32kHz divider if the RTC is to count seconds.
+
+This is the RAM map originally used for the RTC/CMOS::
+
+  Location    Size    Description
+  ------------------------------------------
+  00h         byte    Current second (BCD)
+  01h         byte    Seconds alarm (BCD)
+  02h         byte    Current minute (BCD)
+  03h         byte    Minutes alarm (BCD)
+  04h         byte    Current hour (BCD)
+  05h         byte    Hours alarm (BCD)
+  06h         byte    Current day of week (BCD)
+  07h         byte    Current day of month (BCD)
+  08h         byte    Current month (BCD)
+  09h         byte    Current year (BCD)
+  0Ah         byte    Register A
+                       bit 7   = Update in progress
+                       bit 6-4 = Divider for clock
+                                  000 = 4.194 MHz
+                                  001 = 1.049 MHz
+                                  010 = 32 kHz
+                                  10X = test modes
+                                  110 = reset / disable
+                                  111 = reset / disable
+                       bit 3-0 = Rate selection for periodic interrupt
+                                  000 = periodic timer disabled
+                                  001 = 3.90625 uS
+                                  010 = 7.8125 uS
+                                  011 = .122070 mS
+                                  100 = .244141 mS
+                                     ...
+                                 1101 = 125 mS
+                                 1110 = 250 mS
+                                 1111 = 500 mS
+  0Bh         byte    Register B
+                       bit 7   = Run (0) / Halt (1)
+                       bit 6   = Periodic interrupt enable
+                       bit 5   = Alarm interrupt enable
+                       bit 4   = Update-ended interrupt enable
+                       bit 3   = Square wave interrupt enable
+                       bit 2   = BCD calendar (0) / Binary (1)
+                       bit 1   = 12-hour mode (0) / 24-hour mode (1)
+                       bit 0   = 0 (DST off) / 1 (DST enabled)
+  OCh         byte    Register C (read only)
+                       bit 7   = interrupt request flag (IRQF)
+                       bit 6   = periodic interrupt flag (PF)
+                       bit 5   = alarm interrupt flag (AF)
+                       bit 4   = update interrupt flag (UF)
+                       bit 3-0 = reserved
+  ODh         byte    Register D (read only)
+                       bit 7   = RTC has power
+                       bit 6-0 = reserved
+  32h         byte    Current century BCD (*)
+  (*) location vendor specific and now determined from ACPI global tables
+
+2.3. APIC
+---------
+
+On Pentium and later processors, an on-board timer is available to each CPU
+as part of the Advanced Programmable Interrupt Controller.  The APIC is
+accessed through memory-mapped registers and provides interrupt service to each
+CPU, used for IPIs and local timer interrupts.
+
+Although in theory the APIC is a safe and stable source for local interrupts,
+in practice, many bugs and glitches have occurred due to the special nature of
+the APIC CPU-local memory-mapped hardware.  Beware that CPU errata may affect
+the use of the APIC and that workarounds may be required.  In addition, some of
+these workarounds pose unique constraints for virtualization - requiring either
+extra overhead incurred from extra reads of memory-mapped I/O or additional
+functionality that may be more computationally expensive to implement.
+
+Since the APIC is documented quite well in the Intel and AMD manuals, we will
+avoid repetition of the detail here.  It should be pointed out that the APIC
+timer is programmed through the LVT (local vector timer) register, is capable
+of one-shot or periodic operation, and is based on the bus clock divided down
+by the programmable divider register.
+
+2.4. HPET
+---------
+
+HPET is quite complex, and was originally intended to replace the PIT / RTC
+support of the X86 PC.  It remains to be seen whether that will be the case, as
+the de facto standard of PC hardware is to emulate these older devices.  Some
+systems designated as legacy free may support only the HPET as a hardware timer
+device.
+
+The HPET spec is rather loose and vague, requiring at least 3 hardware timers,
+but allowing implementation freedom to support many more.  It also imposes no
+fixed rate on the timer frequency, but does impose some extremal values on
+frequency, error and slew.
+
+In general, the HPET is recommended as a high precision (compared to PIT /RTC)
+time source which is independent of local variation (as there is only one HPET
+in any given system).  The HPET is also memory-mapped, and its presence is
+indicated through ACPI tables by the BIOS.
+
+Detailed specification of the HPET is beyond the current scope of this
+document, as it is also very well documented elsewhere.
+
+2.5. Offboard Timers
+--------------------
+
+Several cards, both proprietary (watchdog boards) and commonplace (e1000) have
+timing chips built into the cards which may have registers which are accessible
+to kernel or user drivers.  To the author's knowledge, using these to generate
+a clocksource for a Linux or other kernel has not yet been attempted and is in
+general frowned upon as not playing by the agreed rules of the game.  Such a
+timer device would require additional support to be virtualized properly and is
+not considered important at this time as no known operating system does this.
+
+3. TSC Hardware
+===============
+
+The TSC or time stamp counter is relatively simple in theory; it counts
+instruction cycles issued by the processor, which can be used as a measure of
+time.  In practice, due to a number of problems, it is the most complicated
+timekeeping device to use.
+
+The TSC is represented internally as a 64-bit MSR which can be read with the
+RDMSR, RDTSC, or RDTSCP (when available) instructions.  In the past, hardware
+limitations made it possible to write the TSC, but generally on old hardware it
+was only possible to write the low 32-bits of the 64-bit counter, and the upper
+32-bits of the counter were cleared.  Now, however, on Intel processors family
+0Fh, for models 3, 4 and 6, and family 06h, models e and f, this restriction
+has been lifted and all 64-bits are writable.  On AMD systems, the ability to
+write the TSC MSR is not an architectural guarantee.
+
+The TSC is accessible from CPL-0 and conditionally, for CPL > 0 software by
+means of the CR4.TSD bit, which when enabled, disables CPL > 0 TSC access.
+
+Some vendors have implemented an additional instruction, RDTSCP, which returns
+atomically not just the TSC, but an indicator which corresponds to the
+processor number.  This can be used to index into an array of TSC variables to
+determine offset information in SMP systems where TSCs are not synchronized.
+The presence of this instruction must be determined by consulting CPUID feature
+bits.
+
+Both VMX and SVM provide extension fields in the virtualization hardware which
+allows the guest visible TSC to be offset by a constant.  Newer implementations
+promise to allow the TSC to additionally be scaled, but this hardware is not
+yet widely available.
+
+3.1. TSC synchronization
+------------------------
+
+The TSC is a CPU-local clock in most implementations.  This means, on SMP
+platforms, the TSCs of different CPUs may start at different times depending
+on when the CPUs are powered on.  Generally, CPUs on the same die will share
+the same clock, however, this is not always the case.
+
+The BIOS may attempt to resynchronize the TSCs during the poweron process and
+the operating system or other system software may attempt to do this as well.
+Several hardware limitations make the problem worse - if it is not possible to
+write the full 64-bits of the TSC, it may be impossible to match the TSC in
+newly arriving CPUs to that of the rest of the system, resulting in
+unsynchronized TSCs.  This may be done by BIOS or system software, but in
+practice, getting a perfectly synchronized TSC will not be possible unless all
+values are read from the same clock, which generally only is possible on single
+socket systems or those with special hardware support.
+
+3.2. TSC and CPU hotplug
+------------------------
+
+As touched on already, CPUs which arrive later than the boot time of the system
+may not have a TSC value that is synchronized with the rest of the system.
+Either system software, BIOS, or SMM code may actually try to establish the TSC
+to a value matching the rest of the system, but a perfect match is usually not
+a guarantee.  This can have the effect of bringing a system from a state where
+TSC is synchronized back to a state where TSC synchronization flaws, however
+small, may be exposed to the OS and any virtualization environment.
+
+3.3. TSC and multi-socket / NUMA
+--------------------------------
+
+Multi-socket systems, especially large multi-socket systems are likely to have
+individual clocksources rather than a single, universally distributed clock.
+Since these clocks are driven by different crystals, they will not have
+perfectly matched frequency, and temperature and electrical variations will
+cause the CPU clocks, and thus the TSCs to drift over time.  Depending on the
+exact clock and bus design, the drift may or may not be fixed in absolute
+error, and may accumulate over time.
+
+In addition, very large systems may deliberately slew the clocks of individual
+cores.  This technique, known as spread-spectrum clocking, reduces EMI at the
+clock frequency and harmonics of it, which may be required to pass FCC
+standards for telecommunications and computer equipment.
+
+It is recommended not to trust the TSCs to remain synchronized on NUMA or
+multiple socket systems for these reasons.
+
+3.4. TSC and C-states
+---------------------
+
+C-states, or idling states of the processor, especially C1E and deeper sleep
+states may be problematic for TSC as well.  The TSC may stop advancing in such
+a state, resulting in a TSC which is behind that of other CPUs when execution
+is resumed.  Such CPUs must be detected and flagged by the operating system
+based on CPU and chipset identifications.
+
+The TSC in such a case may be corrected by catching it up to a known external
+clocksource.
+
+3.5. TSC frequency change / P-states
+------------------------------------
+
+To make things slightly more interesting, some CPUs may change frequency.  They
+may or may not run the TSC at the same rate, and because the frequency change
+may be staggered or slewed, at some points in time, the TSC rate may not be
+known other than falling within a range of values.  In this case, the TSC will
+not be a stable time source, and must be calibrated against a known, stable,
+external clock to be a usable source of time.
+
+Whether the TSC runs at a constant rate or scales with the P-state is model
+dependent and must be determined by inspecting CPUID, chipset or vendor
+specific MSR fields.
+
+In addition, some vendors have known bugs where the P-state is actually
+compensated for properly during normal operation, but when the processor is
+inactive, the P-state may be raised temporarily to service cache misses from
+other processors.  In such cases, the TSC on halted CPUs could advance faster
+than that of non-halted processors.  AMD Turion processors are known to have
+this problem.
+
+3.6. TSC and STPCLK / T-states
+------------------------------
+
+External signals given to the processor may also have the effect of stopping
+the TSC.  This is typically done for thermal emergency power control to prevent
+an overheating condition, and typically, there is no way to detect that this
+condition has happened.
+
+3.7. TSC virtualization - VMX
+-----------------------------
+
+VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
+instructions, which is enough for full virtualization of TSC in any manner.  In
+addition, VMX allows passing through the host TSC plus an additional TSC_OFFSET
+field specified in the VMCS.  Special instructions must be used to read and
+write the VMCS field.
+
+3.8. TSC virtualization - SVM
+-----------------------------
+
+SVM provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
+instructions, which is enough for full virtualization of TSC in any manner.  In
+addition, SVM allows passing through the host TSC plus an additional offset
+field specified in the SVM control block.
+
+3.9. TSC feature bits in Linux
+------------------------------
+
+In summary, there is no way to guarantee the TSC remains in perfect
+synchronization unless it is explicitly guaranteed by the architecture.  Even
+if so, the TSCs in multi-sockets or NUMA systems may still run independently
+despite being locally consistent.
+
+The following feature bits are used by Linux to signal various TSC attributes,
+but they can only be taken to be meaningful for UP or single node systems.
+
+=========================      =======================================
+X86_FEATURE_TSC                        The TSC is available in hardware
+X86_FEATURE_RDTSCP             The RDTSCP instruction is available
+X86_FEATURE_CONSTANT_TSC       The TSC rate is unchanged with P-states
+X86_FEATURE_NONSTOP_TSC                The TSC does not stop in C-states
+X86_FEATURE_TSC_RELIABLE       TSC sync checks are skipped (VMware)
+=========================      =======================================
+
+4. Virtualization Problems
+==========================
+
+Timekeeping is especially problematic for virtualization because a number of
+challenges arise.  The most obvious problem is that time is now shared between
+the host and, potentially, a number of virtual machines.  Thus the virtual
+operating system does not run with 100% usage of the CPU, despite the fact that
+it may very well make that assumption.  It may expect it to remain true to very
+exacting bounds when interrupt sources are disabled, but in reality only its
+virtual interrupt sources are disabled, and the machine may still be preempted
+at any time.  This causes problems as the passage of real time, the injection
+of machine interrupts and the associated clock sources are no longer completely
+synchronized with real time.
+
+This same problem can occur on native hardware to a degree, as SMM mode may
+steal cycles from the naturally on X86 systems when SMM mode is used by the
+BIOS, but not in such an extreme fashion.  However, the fact that SMM mode may
+cause similar problems to virtualization makes it a good justification for
+solving many of these problems on bare metal.
+
+4.1. Interrupt clocking
+-----------------------
+
+One of the most immediate problems that occurs with legacy operating systems
+is that the system timekeeping routines are often designed to keep track of
+time by counting periodic interrupts.  These interrupts may come from the PIT
+or the RTC, but the problem is the same: the host virtualization engine may not
+be able to deliver the proper number of interrupts per second, and so guest
+time may fall behind.  This is especially problematic if a high interrupt rate
+is selected, such as 1000 HZ, which is unfortunately the default for many Linux
+guests.
+
+There are three approaches to solving this problem; first, it may be possible
+to simply ignore it.  Guests which have a separate time source for tracking
+'wall clock' or 'real time' may not need any adjustment of their interrupts to
+maintain proper time.  If this is not sufficient, it may be necessary to inject
+additional interrupts into the guest in order to increase the effective
+interrupt rate.  This approach leads to complications in extreme conditions,
+where host load or guest lag is too much to compensate for, and thus another
+solution to the problem has risen: the guest may need to become aware of lost
+ticks and compensate for them internally.  Although promising in theory, the
+implementation of this policy in Linux has been extremely error prone, and a
+number of buggy variants of lost tick compensation are distributed across
+commonly used Linux systems.
+
+Windows uses periodic RTC clocking as a means of keeping time internally, and
+thus requires interrupt slewing to keep proper time.  It does use a low enough
+rate (ed: is it 18.2 Hz?) however that it has not yet been a problem in
+practice.
+
+4.2. TSC sampling and serialization
+-----------------------------------
+
+As the highest precision time source available, the cycle counter of the CPU
+has aroused much interest from developers.  As explained above, this timer has
+many problems unique to its nature as a local, potentially unstable and
+potentially unsynchronized source.  One issue which is not unique to the TSC,
+but is highlighted because of its very precise nature is sampling delay.  By
+definition, the counter, once read is already old.  However, it is also
+possible for the counter to be read ahead of the actual use of the result.
+This is a consequence of the superscalar execution of the instruction stream,
+which may execute instructions out of order.  Such execution is called
+non-serialized.  Forcing serialized execution is necessary for precise
+measurement with the TSC, and requires a serializing instruction, such as CPUID
+or an MSR read.
+
+Since CPUID may actually be virtualized by a trap and emulate mechanism, this
+serialization can pose a performance issue for hardware virtualization.  An
+accurate time stamp counter reading may therefore not always be available, and
+it may be necessary for an implementation to guard against "backwards" reads of
+the TSC as seen from other CPUs, even in an otherwise perfectly synchronized
+system.
+
+4.3. Timespec aliasing
+----------------------
+
+Additionally, this lack of serialization from the TSC poses another challenge
+when using results of the TSC when measured against another time source.  As
+the TSC is much higher precision, many possible values of the TSC may be read
+while another clock is still expressing the same value.
+
+That is, you may read (T,T+10) while external clock C maintains the same value.
+Due to non-serialized reads, you may actually end up with a range which
+fluctuates - from (T-1.. T+10).  Thus, any time calculated from a TSC, but
+calibrated against an external value may have a range of valid values.
+Re-calibrating this computation may actually cause time, as computed after the
+calibration, to go backwards, compared with time computed before the
+calibration.
+
+This problem is particularly pronounced with an internal time source in Linux,
+the kernel time, which is expressed in the theoretically high resolution
+timespec - but which advances in much larger granularity intervals, sometimes
+at the rate of jiffies, and possibly in catchup modes, at a much larger step.
+
+This aliasing requires care in the computation and recalibration of kvmclock
+and any other values derived from TSC computation (such as TSC virtualization
+itself).
+
+4.4. Migration
+--------------
+
+Migration of a virtual machine raises problems for timekeeping in two ways.
+First, the migration itself may take time, during which interrupts cannot be
+delivered, and after which, the guest time may need to be caught up.  NTP may
+be able to help to some degree here, as the clock correction required is
+typically small enough to fall in the NTP-correctable window.
+
+An additional concern is that timers based off the TSC (or HPET, if the raw bus
+clock is exposed) may now be running at different rates, requiring compensation
+in some way in the hypervisor by virtualizing these timers.  In addition,
+migrating to a faster machine may preclude the use of a passthrough TSC, as a
+faster clock cannot be made visible to a guest without the potential of time
+advancing faster than usual.  A slower clock is less of a problem, as it can
+always be caught up to the original rate.  KVM clock avoids these problems by
+simply storing multipliers and offsets against the TSC for the guest to convert
+back into nanosecond resolution values.
+
+4.5. Scheduling
+---------------
+
+Since scheduling may be based on precise timing and firing of interrupts, the
+scheduling algorithms of an operating system may be adversely affected by
+virtualization.  In theory, the effect is random and should be universally
+distributed, but in contrived as well as real scenarios (guest device access,
+causes of virtualization exits, possible context switch), this may not always
+be the case.  The effect of this has not been well studied.
+
+In an attempt to work around this, several implementations have provided a
+paravirtualized scheduler clock, which reveals the true amount of CPU time for
+which a virtual machine has been running.
+
+4.6. Watchdogs
+--------------
+
+Watchdog timers, such as the lock detector in Linux may fire accidentally when
+running under hardware virtualization due to timer interrupts being delayed or
+misinterpretation of the passage of real time.  Usually, these warnings are
+spurious and can be ignored, but in some circumstances it may be necessary to
+disable such detection.
+
+4.7. Delays and precision timing
+--------------------------------
+
+Precise timing and delays may not be possible in a virtualized system.  This
+can happen if the system is controlling physical hardware, or issues delays to
+compensate for slower I/O to and from devices.  The first issue is not solvable
+in general for a virtualized system; hardware control software can't be
+adequately virtualized without a full real-time operating system, which would
+require an RT aware virtualization platform.
+
+The second issue may cause performance problems, but this is unlikely to be a
+significant issue.  In many cases these delays may be eliminated through
+configuration or paravirtualization.
+
+4.8. Covert channels and leaks
+------------------------------
+
+In addition to the above problems, time information will inevitably leak to the
+guest about the host in anything but a perfect implementation of virtualized
+time.  This may allow the guest to infer the presence of a hypervisor (as in a
+red-pill type detection), and it may allow information to leak between guests
+by using CPU utilization itself as a signalling channel.  Preventing such
+problems would require completely isolated virtual time which may not track
+real time any longer.  This may be useful in certain security or QA contexts,
+but in general isn't recommended for real-world deployment scenarios.
diff --git a/Documentation/virt/kvm/timekeeping.txt b/Documentation/virt/kvm/timekeeping.txt

deleted file mode 100644 (file)

index 76808a1..0000000
--- a/Documentation/virt/kvm/timekeeping.txt
+++ /dev/null
@@ -1,612 +0,0 @@
-
-       Timekeeping Virtualization for X86-Based Architectures
-
-       Zachary Amsden <zamsden@redhat.com>
-       Copyright (c) 2010, Red Hat.  All rights reserved.
-
-1) Overview
-2) Timing Devices
-3) TSC Hardware
-4) Virtualization Problems
-
-=========================================================================
-
-1) Overview
-
-One of the most complicated parts of the X86 platform, and specifically,
-the virtualization of this platform is the plethora of timing devices available
-and the complexity of emulating those devices.  In addition, virtualization of
-time introduces a new set of challenges because it introduces a multiplexed
-division of time beyond the control of the guest CPU.
-
-First, we will describe the various timekeeping hardware available, then
-present some of the problems which arise and solutions available, giving
-specific recommendations for certain classes of KVM guests.
-
-The purpose of this document is to collect data and information relevant to
-timekeeping which may be difficult to find elsewhere, specifically,
-information relevant to KVM and hardware-based virtualization.
-
-=========================================================================
-
-2) Timing Devices
-
-First we discuss the basic hardware devices available.  TSC and the related
-KVM clock are special enough to warrant a full exposition and are described in
-the following section.
-
-2.1) i8254 - PIT
-
-One of the first timer devices available is the programmable interrupt timer,
-or PIT.  The PIT has a fixed frequency 1.193182 MHz base clock and three
-channels which can be programmed to deliver periodic or one-shot interrupts.
-These three channels can be configured in different modes and have individual
-counters.  Channel 1 and 2 were not available for general use in the original
-IBM PC, and historically were connected to control RAM refresh and the PC
-speaker.  Now the PIT is typically integrated as part of an emulated chipset
-and a separate physical PIT is not used.
-
-The PIT uses I/O ports 0x40 - 0x43.  Access to the 16-bit counters is done
-using single or multiple byte access to the I/O ports.  There are 6 modes
-available, but not all modes are available to all timers, as only timer 2
-has a connected gate input, required for modes 1 and 5.  The gate line is
-controlled by port 61h, bit 0, as illustrated in the following diagram.
-
- --------------             ----------------
-|              |           |                |
-|  1.1932 MHz  |---------->| CLOCK      OUT | ---------> IRQ 0
-|    Clock     |   |       |                |
- --------------    |    +->| GATE  TIMER 0  |
-                   |        ----------------
-                   |
-                   |        ----------------
-                   |       |                |
-                   |------>| CLOCK      OUT | ---------> 66.3 KHZ DRAM
-                   |       |                |            (aka /dev/null)
-                   |    +->| GATE  TIMER 1  |
-                   |        ----------------
-                   |
-                   |        ----------------
-                   |       |                |
-                   |------>| CLOCK      OUT | ---------> Port 61h, bit 5
-                           |                |      |
-Port 61h, bit 0 ---------->| GATE  TIMER 2  |       \_.----   ____
-                            ----------------         _|    )--|LPF|---Speaker
-                                                    / *----   \___/
-Port 61h, bit 1 -----------------------------------/
-
-The timer modes are now described.
-
-Mode 0: Single Timeout.   This is a one-shot software timeout that counts down
- when the gate is high (always true for timers 0 and 1).  When the count
- reaches zero, the output goes high.
-
-Mode 1: Triggered One-shot.  The output is initially set high.  When the gate
- line is set high, a countdown is initiated (which does not stop if the gate is
- lowered), during which the output is set low.  When the count reaches zero,
- the output goes high.
-
-Mode 2: Rate Generator.  The output is initially set high.  When the countdown
- reaches 1, the output goes low for one count and then returns high.  The value
- is reloaded and the countdown automatically resumes.  If the gate line goes
- low, the count is halted.  If the output is low when the gate is lowered, the
- output automatically goes high (this only affects timer 2).
-
-Mode 3: Square Wave.   This generates a high / low square wave.  The count
- determines the length of the pulse, which alternates between high and low
- when zero is reached.  The count only proceeds when gate is high and is
- automatically reloaded on reaching zero.  The count is decremented twice at
- each clock to generate a full high / low cycle at the full periodic rate.
- If the count is even, the clock remains high for N/2 counts and low for N/2
- counts; if the clock is odd, the clock is high for (N+1)/2 counts and low
- for (N-1)/2 counts.  Only even values are latched by the counter, so odd
- values are not observed when reading.  This is the intended mode for timer 2,
- which generates sine-like tones by low-pass filtering the square wave output.
-
-Mode 4: Software Strobe.  After programming this mode and loading the counter,
- the output remains high until the counter reaches zero.  Then the output
- goes low for 1 clock cycle and returns high.  The counter is not reloaded.
- Counting only occurs when gate is high.
-
-Mode 5: Hardware Strobe.  After programming and loading the counter, the
- output remains high.  When the gate is raised, a countdown is initiated
- (which does not stop if the gate is lowered).  When the counter reaches zero,
- the output goes low for 1 clock cycle and then returns high.  The counter is
- not reloaded.
-
-In addition to normal binary counting, the PIT supports BCD counting.  The
-command port, 0x43 is used to set the counter and mode for each of the three
-timers.
-
-PIT commands, issued to port 0x43, using the following bit encoding:
-
-Bit 7-4: Command (See table below)
-Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined)
-Bit 0  : Binary (0) / BCD (1)
-
-Command table:
-
-0000 - Latch Timer 0 count for port 0x40
-       sample and hold the count to be read in port 0x40;
-       additional commands ignored until counter is read;
-       mode bits ignored.
-
-0001 - Set Timer 0 LSB mode for port 0x40
-       set timer to read LSB only and force MSB to zero;
-       mode bits set timer mode
-
-0010 - Set Timer 0 MSB mode for port 0x40
-       set timer to read MSB only and force LSB to zero;
-       mode bits set timer mode
-
-0011 - Set Timer 0 16-bit mode for port 0x40
-       set timer to read / write LSB first, then MSB;
-       mode bits set timer mode
-
-0100 - Latch Timer 1 count for port 0x41 - as described above
-0101 - Set Timer 1 LSB mode for port 0x41 - as described above
-0110 - Set Timer 1 MSB mode for port 0x41 - as described above
-0111 - Set Timer 1 16-bit mode for port 0x41 - as described above
-
-1000 - Latch Timer 2 count for port 0x42 - as described above
-1001 - Set Timer 2 LSB mode for port 0x42 - as described above
-1010 - Set Timer 2 MSB mode for port 0x42 - as described above
-1011 - Set Timer 2 16-bit mode for port 0x42 as described above
-
-1101 - General counter latch
-       Latch combination of counters into corresponding ports
-       Bit 3 = Counter 2
-       Bit 2 = Counter 1
-       Bit 1 = Counter 0
-       Bit 0 = Unused
-
-1110 - Latch timer status
-       Latch combination of counter mode into corresponding ports
-       Bit 3 = Counter 2
-       Bit 2 = Counter 1
-       Bit 1 = Counter 0
-
-       The output of ports 0x40-0x42 following this command will be:
-
-       Bit 7 = Output pin
-       Bit 6 = Count loaded (0 if timer has expired)
-       Bit 5-4 = Read / Write mode
-           01 = MSB only
-           10 = LSB only
-           11 = LSB / MSB (16-bit)
-       Bit 3-1 = Mode
-       Bit 0 = Binary (0) / BCD mode (1)
-
-2.2) RTC
-
-The second device which was available in the original PC was the MC146818 real
-time clock.  The original device is now obsolete, and usually emulated by the
-system chipset, sometimes by an HPET and some frankenstein IRQ routing.
-
-The RTC is accessed through CMOS variables, which uses an index register to
-control which bytes are read.  Since there is only one index register, read
-of the CMOS and read of the RTC require lock protection (in addition, it is
-dangerous to allow userspace utilities such as hwclock to have direct RTC
-access, as they could corrupt kernel reads and writes of CMOS memory).
-
-The RTC generates an interrupt which is usually routed to IRQ 8.  The interrupt
-can function as a periodic timer, an additional once a day alarm, and can issue
-interrupts after an update of the CMOS registers by the MC146818 is complete.
-The type of interrupt is signalled in the RTC status registers.
-
-The RTC will update the current time fields by battery power even while the
-system is off.  The current time fields should not be read while an update is
-in progress, as indicated in the status register.
-
-The clock uses a 32.768kHz crystal, so bits 6-4 of register A should be
-programmed to a 32kHz divider if the RTC is to count seconds.
-
-This is the RAM map originally used for the RTC/CMOS:
-
-Location    Size    Description
-------------------------------------------
-00h         byte    Current second (BCD)
-01h         byte    Seconds alarm (BCD)
-02h         byte    Current minute (BCD)
-03h         byte    Minutes alarm (BCD)
-04h         byte    Current hour (BCD)
-05h         byte    Hours alarm (BCD)
-06h         byte    Current day of week (BCD)
-07h         byte    Current day of month (BCD)
-08h         byte    Current month (BCD)
-09h         byte    Current year (BCD)
-0Ah         byte    Register A
-                       bit 7   = Update in progress
-                       bit 6-4 = Divider for clock
-                                  000 = 4.194 MHz
-                                  001 = 1.049 MHz
-                                  010 = 32 kHz
-                                  10X = test modes
-                                  110 = reset / disable
-                                  111 = reset / disable
-                       bit 3-0 = Rate selection for periodic interrupt
-                                  000 = periodic timer disabled
-                                  001 = 3.90625 uS
-                                  010 = 7.8125 uS
-                                  011 = .122070 mS
-                                  100 = .244141 mS
-                                     ...
-                                 1101 = 125 mS
-                                 1110 = 250 mS
-                                 1111 = 500 mS
-0Bh         byte    Register B
-                       bit 7   = Run (0) / Halt (1)
-                       bit 6   = Periodic interrupt enable
-                       bit 5   = Alarm interrupt enable
-                       bit 4   = Update-ended interrupt enable
-                       bit 3   = Square wave interrupt enable
-                       bit 2   = BCD calendar (0) / Binary (1)
-                       bit 1   = 12-hour mode (0) / 24-hour mode (1)
-                       bit 0   = 0 (DST off) / 1 (DST enabled)
-OCh         byte    Register C (read only)
-                       bit 7   = interrupt request flag (IRQF)
-                       bit 6   = periodic interrupt flag (PF)
-                       bit 5   = alarm interrupt flag (AF)
-                       bit 4   = update interrupt flag (UF)
-                       bit 3-0 = reserved
-ODh         byte    Register D (read only)
-                       bit 7   = RTC has power
-                       bit 6-0 = reserved
-32h         byte    Current century BCD (*)
-  (*) location vendor specific and now determined from ACPI global tables
-
-2.3) APIC
-
-On Pentium and later processors, an on-board timer is available to each CPU
-as part of the Advanced Programmable Interrupt Controller.  The APIC is
-accessed through memory-mapped registers and provides interrupt service to each
-CPU, used for IPIs and local timer interrupts.
-
-Although in theory the APIC is a safe and stable source for local interrupts,
-in practice, many bugs and glitches have occurred due to the special nature of
-the APIC CPU-local memory-mapped hardware.  Beware that CPU errata may affect
-the use of the APIC and that workarounds may be required.  In addition, some of
-these workarounds pose unique constraints for virtualization - requiring either
-extra overhead incurred from extra reads of memory-mapped I/O or additional
-functionality that may be more computationally expensive to implement.
-
-Since the APIC is documented quite well in the Intel and AMD manuals, we will
-avoid repetition of the detail here.  It should be pointed out that the APIC
-timer is programmed through the LVT (local vector timer) register, is capable
-of one-shot or periodic operation, and is based on the bus clock divided down
-by the programmable divider register.
-
-2.4) HPET
-
-HPET is quite complex, and was originally intended to replace the PIT / RTC
-support of the X86 PC.  It remains to be seen whether that will be the case, as
-the de facto standard of PC hardware is to emulate these older devices.  Some
-systems designated as legacy free may support only the HPET as a hardware timer
-device.
-
-The HPET spec is rather loose and vague, requiring at least 3 hardware timers,
-but allowing implementation freedom to support many more.  It also imposes no
-fixed rate on the timer frequency, but does impose some extremal values on
-frequency, error and slew.
-
-In general, the HPET is recommended as a high precision (compared to PIT /RTC)
-time source which is independent of local variation (as there is only one HPET
-in any given system).  The HPET is also memory-mapped, and its presence is
-indicated through ACPI tables by the BIOS.
-
-Detailed specification of the HPET is beyond the current scope of this
-document, as it is also very well documented elsewhere.
-
-2.5) Offboard Timers
-
-Several cards, both proprietary (watchdog boards) and commonplace (e1000) have
-timing chips built into the cards which may have registers which are accessible
-to kernel or user drivers.  To the author's knowledge, using these to generate
-a clocksource for a Linux or other kernel has not yet been attempted and is in
-general frowned upon as not playing by the agreed rules of the game.  Such a
-timer device would require additional support to be virtualized properly and is
-not considered important at this time as no known operating system does this.
-
-=========================================================================
-
-3) TSC Hardware
-
-The TSC or time stamp counter is relatively simple in theory; it counts
-instruction cycles issued by the processor, which can be used as a measure of
-time.  In practice, due to a number of problems, it is the most complicated
-timekeeping device to use.
-
-The TSC is represented internally as a 64-bit MSR which can be read with the
-RDMSR, RDTSC, or RDTSCP (when available) instructions.  In the past, hardware
-limitations made it possible to write the TSC, but generally on old hardware it
-was only possible to write the low 32-bits of the 64-bit counter, and the upper
-32-bits of the counter were cleared.  Now, however, on Intel processors family
-0Fh, for models 3, 4 and 6, and family 06h, models e and f, this restriction
-has been lifted and all 64-bits are writable.  On AMD systems, the ability to
-write the TSC MSR is not an architectural guarantee.
-
-The TSC is accessible from CPL-0 and conditionally, for CPL > 0 software by
-means of the CR4.TSD bit, which when enabled, disables CPL > 0 TSC access.
-
-Some vendors have implemented an additional instruction, RDTSCP, which returns
-atomically not just the TSC, but an indicator which corresponds to the
-processor number.  This can be used to index into an array of TSC variables to
-determine offset information in SMP systems where TSCs are not synchronized.
-The presence of this instruction must be determined by consulting CPUID feature
-bits.
-
-Both VMX and SVM provide extension fields in the virtualization hardware which
-allows the guest visible TSC to be offset by a constant.  Newer implementations
-promise to allow the TSC to additionally be scaled, but this hardware is not
-yet widely available.
-
-3.1) TSC synchronization
-
-The TSC is a CPU-local clock in most implementations.  This means, on SMP
-platforms, the TSCs of different CPUs may start at different times depending
-on when the CPUs are powered on.  Generally, CPUs on the same die will share
-the same clock, however, this is not always the case.
-
-The BIOS may attempt to resynchronize the TSCs during the poweron process and
-the operating system or other system software may attempt to do this as well.
-Several hardware limitations make the problem worse - if it is not possible to
-write the full 64-bits of the TSC, it may be impossible to match the TSC in
-newly arriving CPUs to that of the rest of the system, resulting in
-unsynchronized TSCs.  This may be done by BIOS or system software, but in
-practice, getting a perfectly synchronized TSC will not be possible unless all
-values are read from the same clock, which generally only is possible on single
-socket systems or those with special hardware support.
-
-3.2) TSC and CPU hotplug
-
-As touched on already, CPUs which arrive later than the boot time of the system
-may not have a TSC value that is synchronized with the rest of the system.
-Either system software, BIOS, or SMM code may actually try to establish the TSC
-to a value matching the rest of the system, but a perfect match is usually not
-a guarantee.  This can have the effect of bringing a system from a state where
-TSC is synchronized back to a state where TSC synchronization flaws, however
-small, may be exposed to the OS and any virtualization environment.
-
-3.3) TSC and multi-socket / NUMA
-
-Multi-socket systems, especially large multi-socket systems are likely to have
-individual clocksources rather than a single, universally distributed clock.
-Since these clocks are driven by different crystals, they will not have
-perfectly matched frequency, and temperature and electrical variations will
-cause the CPU clocks, and thus the TSCs to drift over time.  Depending on the
-exact clock and bus design, the drift may or may not be fixed in absolute
-error, and may accumulate over time.
-
-In addition, very large systems may deliberately slew the clocks of individual
-cores.  This technique, known as spread-spectrum clocking, reduces EMI at the
-clock frequency and harmonics of it, which may be required to pass FCC
-standards for telecommunications and computer equipment.
-
-It is recommended not to trust the TSCs to remain synchronized on NUMA or
-multiple socket systems for these reasons.
-
-3.4) TSC and C-states
-
-C-states, or idling states of the processor, especially C1E and deeper sleep
-states may be problematic for TSC as well.  The TSC may stop advancing in such
-a state, resulting in a TSC which is behind that of other CPUs when execution
-is resumed.  Such CPUs must be detected and flagged by the operating system
-based on CPU and chipset identifications.
-
-The TSC in such a case may be corrected by catching it up to a known external
-clocksource.
-
-3.5) TSC frequency change / P-states
-
-To make things slightly more interesting, some CPUs may change frequency.  They
-may or may not run the TSC at the same rate, and because the frequency change
-may be staggered or slewed, at some points in time, the TSC rate may not be
-known other than falling within a range of values.  In this case, the TSC will
-not be a stable time source, and must be calibrated against a known, stable,
-external clock to be a usable source of time.
-
-Whether the TSC runs at a constant rate or scales with the P-state is model
-dependent and must be determined by inspecting CPUID, chipset or vendor
-specific MSR fields.
-
-In addition, some vendors have known bugs where the P-state is actually
-compensated for properly during normal operation, but when the processor is
-inactive, the P-state may be raised temporarily to service cache misses from
-other processors.  In such cases, the TSC on halted CPUs could advance faster
-than that of non-halted processors.  AMD Turion processors are known to have
-this problem.
-
-3.6) TSC and STPCLK / T-states
-
-External signals given to the processor may also have the effect of stopping
-the TSC.  This is typically done for thermal emergency power control to prevent
-an overheating condition, and typically, there is no way to detect that this
-condition has happened.
-
-3.7) TSC virtualization - VMX
-
-VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
-instructions, which is enough for full virtualization of TSC in any manner.  In
-addition, VMX allows passing through the host TSC plus an additional TSC_OFFSET
-field specified in the VMCS.  Special instructions must be used to read and
-write the VMCS field.
-
-3.8) TSC virtualization - SVM
-
-SVM provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
-instructions, which is enough for full virtualization of TSC in any manner.  In
-addition, SVM allows passing through the host TSC plus an additional offset
-field specified in the SVM control block.
-
-3.9) TSC feature bits in Linux
-
-In summary, there is no way to guarantee the TSC remains in perfect
-synchronization unless it is explicitly guaranteed by the architecture.  Even
-if so, the TSCs in multi-sockets or NUMA systems may still run independently
-despite being locally consistent.
-
-The following feature bits are used by Linux to signal various TSC attributes,
-but they can only be taken to be meaningful for UP or single node systems.
-
-X86_FEATURE_TSC                : The TSC is available in hardware
-X86_FEATURE_RDTSCP             : The RDTSCP instruction is available
-X86_FEATURE_CONSTANT_TSC       : The TSC rate is unchanged with P-states
-X86_FEATURE_NONSTOP_TSC                : The TSC does not stop in C-states
-X86_FEATURE_TSC_RELIABLE       : TSC sync checks are skipped (VMware)
-
-4) Virtualization Problems
-
-Timekeeping is especially problematic for virtualization because a number of
-challenges arise.  The most obvious problem is that time is now shared between
-the host and, potentially, a number of virtual machines.  Thus the virtual
-operating system does not run with 100% usage of the CPU, despite the fact that
-it may very well make that assumption.  It may expect it to remain true to very
-exacting bounds when interrupt sources are disabled, but in reality only its
-virtual interrupt sources are disabled, and the machine may still be preempted
-at any time.  This causes problems as the passage of real time, the injection
-of machine interrupts and the associated clock sources are no longer completely
-synchronized with real time.
-
-This same problem can occur on native hardware to a degree, as SMM mode may
-steal cycles from the naturally on X86 systems when SMM mode is used by the
-BIOS, but not in such an extreme fashion.  However, the fact that SMM mode may
-cause similar problems to virtualization makes it a good justification for
-solving many of these problems on bare metal.
-
-4.1) Interrupt clocking
-
-One of the most immediate problems that occurs with legacy operating systems
-is that the system timekeeping routines are often designed to keep track of
-time by counting periodic interrupts.  These interrupts may come from the PIT
-or the RTC, but the problem is the same: the host virtualization engine may not
-be able to deliver the proper number of interrupts per second, and so guest
-time may fall behind.  This is especially problematic if a high interrupt rate
-is selected, such as 1000 HZ, which is unfortunately the default for many Linux
-guests.
-
-There are three approaches to solving this problem; first, it may be possible
-to simply ignore it.  Guests which have a separate time source for tracking
-'wall clock' or 'real time' may not need any adjustment of their interrupts to
-maintain proper time.  If this is not sufficient, it may be necessary to inject
-additional interrupts into the guest in order to increase the effective
-interrupt rate.  This approach leads to complications in extreme conditions,
-where host load or guest lag is too much to compensate for, and thus another
-solution to the problem has risen: the guest may need to become aware of lost
-ticks and compensate for them internally.  Although promising in theory, the
-implementation of this policy in Linux has been extremely error prone, and a
-number of buggy variants of lost tick compensation are distributed across
-commonly used Linux systems.
-
-Windows uses periodic RTC clocking as a means of keeping time internally, and
-thus requires interrupt slewing to keep proper time.  It does use a low enough
-rate (ed: is it 18.2 Hz?) however that it has not yet been a problem in
-practice.
-
-4.2) TSC sampling and serialization
-
-As the highest precision time source available, the cycle counter of the CPU
-has aroused much interest from developers.  As explained above, this timer has
-many problems unique to its nature as a local, potentially unstable and
-potentially unsynchronized source.  One issue which is not unique to the TSC,
-but is highlighted because of its very precise nature is sampling delay.  By
-definition, the counter, once read is already old.  However, it is also
-possible for the counter to be read ahead of the actual use of the result.
-This is a consequence of the superscalar execution of the instruction stream,
-which may execute instructions out of order.  Such execution is called
-non-serialized.  Forcing serialized execution is necessary for precise
-measurement with the TSC, and requires a serializing instruction, such as CPUID
-or an MSR read.
-
-Since CPUID may actually be virtualized by a trap and emulate mechanism, this
-serialization can pose a performance issue for hardware virtualization.  An
-accurate time stamp counter reading may therefore not always be available, and
-it may be necessary for an implementation to guard against "backwards" reads of
-the TSC as seen from other CPUs, even in an otherwise perfectly synchronized
-system.
-
-4.3) Timespec aliasing
-
-Additionally, this lack of serialization from the TSC poses another challenge
-when using results of the TSC when measured against another time source.  As
-the TSC is much higher precision, many possible values of the TSC may be read
-while another clock is still expressing the same value.
-
-That is, you may read (T,T+10) while external clock C maintains the same value.
-Due to non-serialized reads, you may actually end up with a range which
-fluctuates - from (T-1.. T+10).  Thus, any time calculated from a TSC, but
-calibrated against an external value may have a range of valid values.
-Re-calibrating this computation may actually cause time, as computed after the
-calibration, to go backwards, compared with time computed before the
-calibration.
-
-This problem is particularly pronounced with an internal time source in Linux,
-the kernel time, which is expressed in the theoretically high resolution
-timespec - but which advances in much larger granularity intervals, sometimes
-at the rate of jiffies, and possibly in catchup modes, at a much larger step.
-
-This aliasing requires care in the computation and recalibration of kvmclock
-and any other values derived from TSC computation (such as TSC virtualization
-itself).
-
-4.4) Migration
-
-Migration of a virtual machine raises problems for timekeeping in two ways.
-First, the migration itself may take time, during which interrupts cannot be
-delivered, and after which, the guest time may need to be caught up.  NTP may
-be able to help to some degree here, as the clock correction required is
-typically small enough to fall in the NTP-correctable window.
-
-An additional concern is that timers based off the TSC (or HPET, if the raw bus
-clock is exposed) may now be running at different rates, requiring compensation
-in some way in the hypervisor by virtualizing these timers.  In addition,
-migrating to a faster machine may preclude the use of a passthrough TSC, as a
-faster clock cannot be made visible to a guest without the potential of time
-advancing faster than usual.  A slower clock is less of a problem, as it can
-always be caught up to the original rate.  KVM clock avoids these problems by
-simply storing multipliers and offsets against the TSC for the guest to convert
-back into nanosecond resolution values.
-
-4.5) Scheduling
-
-Since scheduling may be based on precise timing and firing of interrupts, the
-scheduling algorithms of an operating system may be adversely affected by
-virtualization.  In theory, the effect is random and should be universally
-distributed, but in contrived as well as real scenarios (guest device access,
-causes of virtualization exits, possible context switch), this may not always
-be the case.  The effect of this has not been well studied.
-
-In an attempt to work around this, several implementations have provided a
-paravirtualized scheduler clock, which reveals the true amount of CPU time for
-which a virtual machine has been running.
-
-4.6) Watchdogs
-
-Watchdog timers, such as the lock detector in Linux may fire accidentally when
-running under hardware virtualization due to timer interrupts being delayed or
-misinterpretation of the passage of real time.  Usually, these warnings are
-spurious and can be ignored, but in some circumstances it may be necessary to
-disable such detection.
-
-4.7) Delays and precision timing
-
-Precise timing and delays may not be possible in a virtualized system.  This
-can happen if the system is controlling physical hardware, or issues delays to
-compensate for slower I/O to and from devices.  The first issue is not solvable
-in general for a virtualized system; hardware control software can't be
-adequately virtualized without a full real-time operating system, which would
-require an RT aware virtualization platform.
-
-The second issue may cause performance problems, but this is unlikely to be a
-significant issue.  In many cases these delays may be eliminated through
-configuration or paravirtualization.
-
-4.8) Covert channels and leaks
-
-In addition to the above problems, time information will inevitably leak to the
-guest about the host in anything but a perfect implementation of virtualized
-time.  This may allow the guest to infer the presence of a hypervisor (as in a
-red-pill type detection), and it may allow information to leak between guests
-by using CPU utilization itself as a signalling channel.  Preventing such
-problems would require completely isolated virtual time which may not track
-real time any longer.  This may be useful in certain security or QA contexts,
-but in general isn't recommended for real-world deployment scenarios.
diff --git a/Documentation/virt/uml/UserModeLinux-HOWTO.txt b/Documentation/virt/uml/UserModeLinux-HOWTO.txt

deleted file mode 100644 (file)

index 87b80f5..0000000
--- a/Documentation/virt/uml/UserModeLinux-HOWTO.txt
+++ /dev/null
@@ -1,4589 +0,0 @@
-  User Mode Linux HOWTO
-  User Mode Linux Core Team
-  Mon Nov 18 14:16:16 EST 2002
-
-  This document describes the use and abuse of Jeff Dike's User Mode
-  Linux: a port of the Linux kernel as a normal Intel Linux process.
-  ______________________________________________________________________
-
-  Table of Contents
-
-  1. Introduction
-
-     1.1 How is User Mode Linux Different?
-     1.2 Why Would I Want User Mode Linux?
-
-  2. Compiling the kernel and modules
-
-     2.1 Compiling the kernel
-     2.2 Compiling and installing kernel modules
-     2.3 Compiling and installing uml_utilities
-
-  3. Running UML and logging in
-
-     3.1 Running UML
-     3.2 Logging in
-     3.3 Examples
-
-  4. UML on 2G/2G hosts
-
-     4.1 Introduction
-     4.2 The problem
-     4.3 The solution
-
-  5. Setting up serial lines and consoles
-
-     5.1 Specifying the device
-     5.2 Specifying the channel
-     5.3 Examples
-
-  6. Setting up the network
-
-     6.1 General setup
-     6.2 Userspace daemons
-     6.3 Specifying ethernet addresses
-     6.4 UML interface setup
-     6.5 Multicast
-     6.6 TUN/TAP with the uml_net helper
-     6.7 TUN/TAP with a preconfigured tap device
-     6.8 Ethertap
-     6.9 The switch daemon
-     6.10 Slip
-     6.11 Slirp
-     6.12 pcap
-     6.13 Setting up the host yourself
-
-  7. Sharing Filesystems between Virtual Machines
-
-     7.1 A warning
-     7.2 Using layered block devices
-     7.3 Note!
-     7.4 Another warning
-     7.5 uml_moo : Merging a COW file with its backing file
-
-  8. Creating filesystems
-
-     8.1 Create the filesystem file
-     8.2 Assign the file to a UML device
-     8.3 Creating and mounting the filesystem
-
-  9. Host file access
-
-     9.1 Using hostfs
-     9.2 hostfs as the root filesystem
-     9.3 Building hostfs
-
-  10. The Management Console
-     10.1 version
-     10.2 halt and reboot
-     10.3 config
-     10.4 remove
-     10.5 sysrq
-     10.6 help
-     10.7 cad
-     10.8 stop
-     10.9 go
-
-  11. Kernel debugging
-
-     11.1 Starting the kernel under gdb
-     11.2 Examining sleeping processes
-     11.3 Running ddd on UML
-     11.4 Debugging modules
-     11.5 Attaching gdb to the kernel
-     11.6 Using alternate debuggers
-
-  12. Kernel debugging examples
-
-     12.1 The case of the hung fsck
-     12.2 Episode 2: The case of the hung fsck
-
-  13. What to do when UML doesn't work
-
-     13.1 Strange compilation errors when you build from source
-     13.2 (obsolete)
-     13.3 A variety of panics and hangs with /tmp on a reiserfs  filesystem
-     13.4 The compile fails with errors about conflicting types for 'open', 'dup', and 'waitpid'
-     13.5 UML doesn't work when /tmp is an NFS filesystem
-     13.6 UML hangs on boot when compiled with gprof support
-     13.7 syslogd dies with a SIGTERM on startup
-     13.8 TUN/TAP networking doesn't work on a 2.4 host
-     13.9 You can network to the host but not to other machines on the net
-     13.10 I have no root and I want to scream
-     13.11 UML build conflict between ptrace.h and ucontext.h
-     13.12 The UML BogoMips is exactly half the host's BogoMips
-     13.13 When you run UML, it immediately segfaults
-     13.14 xterms appear, then immediately disappear
-     13.15 Any other panic, hang, or strange behavior
-
-  14. Diagnosing Problems
-
-     14.1 Case 1 : Normal kernel panics
-     14.2 Case 2 : Tracing thread panics
-     14.3 Case 3 : Tracing thread panics caused by other threads
-     14.4 Case 4 : Hangs
-
-  15. Thanks
-
-     15.1 Code and Documentation
-     15.2 Flushing out bugs
-     15.3 Buglets and clean-ups
-     15.4 Case Studies
-     15.5 Other contributions
-
-
-  ______________________________________________________________________
-
-  1.  Introduction
-
-  Welcome to User Mode Linux.  It's going to be fun.
-
-
-
-  1.1.  How is User Mode Linux Different?
-
-  Normally, the Linux Kernel talks straight to your hardware (video
-  card, keyboard, hard drives, etc), and any programs which run ask the
-  kernel to operate the hardware, like so:
-
-
-
-         +-----------+-----------+----+
-         | Process 1 | Process 2 | ...|
-         +-----------+-----------+----+
-         |       Linux Kernel         |
-         +----------------------------+
-         |         Hardware           |
-         +----------------------------+
-
-
-
-
-  The User Mode Linux Kernel is different; instead of talking to the
-  hardware, it talks to a `real' Linux kernel (called the `host kernel'
-  from now on), like any other program.  Programs can then run inside
-  User-Mode Linux as if they were running under a normal kernel, like
-  so:
-
-
-
-                     +----------------+
-                     | Process 2 | ...|
-         +-----------+----------------+
-         | Process 1 | User-Mode Linux|
-         +----------------------------+
-         |       Linux Kernel         |
-         +----------------------------+
-         |         Hardware           |
-         +----------------------------+
-
-
-
-
-
-  1.2.  Why Would I Want User Mode Linux?
-
-
-  1. If User Mode Linux crashes, your host kernel is still fine.
-
-  2. You can run a usermode kernel as a non-root user.
-
-  3. You can debug the User Mode Linux like any normal process.
-
-  4. You can run gprof (profiling) and gcov (coverage testing).
-
-  5. You can play with your kernel without breaking things.
-
-  6. You can use it as a sandbox for testing new apps.
-
-  7. You can try new development kernels safely.
-
-  8. You can run different distributions simultaneously.
-
-  9. It's extremely fun.
-
-
-
-
-
-  2.  Compiling the kernel and modules
-
-
-
-
-  2.1.  Compiling the kernel
-
-
-  Compiling the user mode kernel is just like compiling any other
-  kernel.  Let's go through the steps, using 2.4.0-prerelease (current
-  as of this writing) as an example:
-
-
-  1. Download the latest UML patch from
-
-     the download page <http://user-mode-linux.sourceforge.net/
-
-     In this example, the file is uml-patch-2.4.0-prerelease.bz2.
-
-
-  2. Download the matching kernel from your favourite kernel mirror,
-     such as:
-
-     ftp://ftp.ca.kernel.org/pub/kernel/v2.4/linux-2.4.0-prerelease.tar.bz2
-     <ftp://ftp.ca.kernel.org/pub/kernel/v2.4/linux-2.4.0-prerelease.tar.bz2>
-     .
-
-
-  3. Make a directory and unpack the kernel into it.
-
-
-
-       host%
-       mkdir ~/uml
-
-
-
-
-
-
-       host%
-       cd ~/uml
-
-
-
-
-
-
-       host%
-       tar -xzvf linux-2.4.0-prerelease.tar.bz2
-
-
-
-
-
-
-  4. Apply the patch using
-
-
-
-       host%
-       cd ~/uml/linux
-
-
-
-       host%
-       bzcat uml-patch-2.4.0-prerelease.bz2 | patch -p1
-
-
-
-
-
-
-  5. Run your favorite config; `make xconfig ARCH=um' is the most
-     convenient.  `make config ARCH=um' and 'make menuconfig ARCH=um'
-     will work as well.  The defaults will give you a useful kernel.  If
-     you want to change something, go ahead, it probably won't hurt
-     anything.
-
-
-     Note:  If the host is configured with a 2G/2G address space split
-     rather than the usual 3G/1G split, then the packaged UML binaries
-     will not run.  They will immediately segfault.  See ``UML on 2G/2G
-     hosts''  for the scoop on running UML on your system.
-
-
-
-  6. Finish with `make linux ARCH=um': the result is a file called
-     `linux' in the top directory of your source tree.
-
-  Make sure that you don't build this kernel in /usr/src/linux.  On some
-  distributions, /usr/include/asm is a link into this pool.  The user-
-  mode build changes the other end of that link, and things that include
-  <asm/anything.h> stop compiling.
-
-  The sources are also available from cvs at the project's cvs page,
-  which has directions on getting the sources. You can also browse the
-  CVS pool from there.
-
-  If you get the CVS sources, you will have to check them out into an
-  empty directory. You will then have to copy each file into the
-  corresponding directory in the appropriate kernel pool.
-
-  If you don't have the latest kernel pool, you can get the
-  corresponding user-mode sources with
-
-
-       host% cvs co -r v_2_3_x linux
-
-
-
-
-  where 'x' is the version in your pool. Note that you will not get the
-  bug fixes and enhancements that have gone into subsequent releases.
-
-
-  2.2.  Compiling and installing kernel modules
-
-  UML modules are built in the same way as the native kernel (with the
-  exception of the 'ARCH=um' that you always need for UML):
-
-
-       host% make modules ARCH=um
-
-
-
-
-  Any modules that you want to load into this kernel need to be built in
-  the user-mode pool.  Modules from the native kernel won't work.
-
-  You can install them by using ftp or something to copy them into the
-  virtual machine and dropping them into /lib/modules/`uname -r`.
-
-  You can also get the kernel build process to install them as follows:
-
-  1. with the kernel not booted, mount the root filesystem in the top
-     level of the kernel pool:
-
-
-       host% mount root_fs mnt -o loop
-
-
-
-
-
-
-  2. run
-
-
-       host%
-       make modules_install INSTALL_MOD_PATH=`pwd`/mnt ARCH=um
-
-
-
-
-
-
-  3. unmount the filesystem
-
-
-       host% umount mnt
-
-
-
-
-
-
-  4. boot the kernel on it
-
-
-  When the system is booted, you can use insmod as usual to get the
-  modules into the kernel.  A number of things have been loaded into UML
-  as modules, especially filesystems and network protocols and filters,
-  so most symbols which need to be exported probably already are.
-  However, if you do find symbols that need exporting, let  us
-  <http://user-mode-linux.sourceforge.net/>  know, and
-  they'll be "taken care of".
-
-
-
-  2.3.  Compiling and installing uml_utilities
-
-  Many features of the UML kernel require a user-space helper program,
-  so a uml_utilities package is distributed separately from the kernel
-  patch which provides these helpers. Included within this is:
-
-  o  port-helper - Used by consoles which connect to xterms or ports
-
-  o  tunctl - Configuration tool to create and delete tap devices
-
-  o  uml_net - Setuid binary for automatic tap device configuration
-
-  o  uml_switch - User-space virtual switch required for daemon
-     transport
-
-     The uml_utilities tree is compiled with:
-
-
-       host#
-       make && make install
-
-
-
-
-  Note that UML kernel patches may require a specific version of the
-  uml_utilities distribution. If you don't keep up with the mailing
-  lists, ensure that you have the latest release of uml_utilities if you
-  are experiencing problems with your UML kernel, particularly when
-  dealing with consoles or command-line switches to the helper programs
-
-
-
-
-
-
-
-
-  3.  Running UML and logging in
-
-
-
-  3.1.  Running UML
-
-  It runs on 2.2.15 or later, and all 2.4 kernels.
-
-
-  Booting UML is straightforward.  Simply run 'linux': it will try to
-  mount the file `root_fs' in the current directory.  You do not need to
-  run it as root.  If your root filesystem is not named `root_fs', then
-  you need to put a `ubd0=root_fs_whatever' switch on the linux command
-  line.
-
-
-  You will need a filesystem to boot UML from.  There are a number
-  available for download from  here  <http://user-mode-
-  linux.sourceforge.net/> .  There are also  several tools
-  <http://user-mode-linux.sourceforge.net/>  which can be
-  used to generate UML-compatible filesystem images from media.
-  The kernel will boot up and present you with a login prompt.
-
-
-  Note:  If the host is configured with a 2G/2G address space split
-  rather than the usual 3G/1G split, then the packaged UML binaries will
-  not run.  They will immediately segfault.  See ``UML on 2G/2G hosts''
-  for the scoop on running UML on your system.
-
-
-
-  3.2.  Logging in
-
-
-
-  The prepackaged filesystems have a root account with password 'root'
-  and a user account with password 'user'.  The login banner will
-  generally tell you how to log in.  So, you log in and you will find
-  yourself inside a little virtual machine. Our filesystems have a
-  variety of commands and utilities installed (and it is fairly easy to
-  add more), so you will have a lot of tools with which to poke around
-  the system.
-
-  There are a couple of other ways to log in:
-
-  o  On a virtual console
-
-
-
-     Each virtual console that is configured (i.e. the device exists in
-     /dev and /etc/inittab runs a getty on it) will come up in its own
-     xterm.  If you get tired of the xterms, read ``Setting up serial
-     lines and consoles''  to see how to attach the consoles to
-     something else, like host ptys.
-
-
-
-  o  Over the serial line
-
-
-     In the boot output, find a line that looks like:
-
-
-
-       serial line 0 assigned pty /dev/ptyp1
-
-
-
-
-  Attach your favorite terminal program to the corresponding tty.  I.e.
-  for minicom, the command would be
-
-
-       host% minicom -o -p /dev/ttyp1
-
-
-
-
-
-
-  o  Over the net
-
-
-     If the network is running, then you can telnet to the virtual
-     machine and log in to it.  See ``Setting up the network''  to learn
-     about setting up a virtual network.
-
-  When you're done using it, run halt, and the kernel will bring itself
-  down and the process will exit.
-
-
-  3.3.  Examples
-
-  Here are some examples of UML in action:
-
-  o  A login session <http://user-mode-linux.sourceforge.net/login.html>
-
-  o  A virtual network <http://user-mode-linux.sourceforge.net/net.html>
-
-
-
-
-
-
-
-  4.  UML on 2G/2G hosts
-
-
-
-
-  4.1.  Introduction
-
-
-  Most Linux machines are configured so that the kernel occupies the
-  upper 1G (0xc0000000 - 0xffffffff) of the 4G address space and
-  processes use the lower 3G (0x00000000 - 0xbfffffff).  However, some
-  machine are configured with a 2G/2G split, with the kernel occupying
-  the upper 2G (0x80000000 - 0xffffffff) and processes using the lower
-  2G (0x00000000 - 0x7fffffff).
-
-
-
-
-  4.2.  The problem
-
-
-  The prebuilt UML binaries on this site will not run on 2G/2G hosts
-  because UML occupies the upper .5G of the 3G process address space
-  (0xa0000000 - 0xbfffffff).  Obviously, on 2G/2G hosts, this is right
-  in the middle of the kernel address space, so UML won't even load - it
-  will immediately segfault.
-
-
-
-
-  4.3.  The solution
-
-
-  The fix for this is to rebuild UML from source after enabling
-  CONFIG_HOST_2G_2G (under 'General Setup').  This will cause UML to
-  load itself in the top .5G of that smaller process address space,
-  where it will run fine.  See ``Compiling the kernel and modules''  if
-  you need help building UML from source.
-
-
-
-
-
-
-
-
-
-
-  5.  Setting up serial lines and consoles
-
-
-  It is possible to attach UML serial lines and consoles to many types
-  of host I/O channels by specifying them on the command line.
-
-
-  You can attach them to host ptys, ttys, file descriptors, and ports.
-  This allows you to do things like
-
-  o  have a UML console appear on an unused host console,
-
-  o  hook two virtual machines together by having one attach to a pty
-     and having the other attach to the corresponding tty
-
-  o  make a virtual machine accessible from the net by attaching a
-     console to a port on the host.
-
-
-  The general format of the command line option is device=channel.
-
-
-
-  5.1.  Specifying the device
-
-  Devices are specified with "con" or "ssl" (console or serial line,
-  respectively), optionally with a device number if you are talking
-  about a specific device.
-
-
-  Using just "con" or "ssl" describes all of the consoles or serial
-  lines.  If you want to talk about console #3 or serial line #10, they
-  would be "con3" and "ssl10", respectively.
-
-
-  A specific device name will override a less general "con=" or "ssl=".
-  So, for example, you can assign a pty to each of the serial lines
-  except for the first two like this:
-
-
-        ssl=pty ssl0=tty:/dev/tty0 ssl1=tty:/dev/tty1
-
-
-
-
-  The specificity of the device name is all that matters; order on the
-  command line is irrelevant.
-
-
-
-  5.2.  Specifying the channel
-
-  There are a number of different types of channels to attach a UML
-  device to, each with a different way of specifying exactly what to
-  attach to.
-
-  o  pseudo-terminals - device=pty pts terminals - device=pts
-
-
-     This will cause UML to allocate a free host pseudo-terminal for the
-     device.  The terminal that it got will be announced in the boot
-     log.  You access it by attaching a terminal program to the
-     corresponding tty:
-
-  o  screen /dev/pts/n
-
-  o  screen /dev/ttyxx
-
-  o  minicom -o -p /dev/ttyxx - minicom seems not able to handle pts
-     devices
-
-  o  kermit - start it up, 'open' the device, then 'connect'
-
-
-
-
-
-  o  terminals - device=tty:tty device file
-
-
-     This will make UML attach the device to the specified tty (i.e
-
-
-        con1=tty:/dev/tty3
-
-
-
-
-  will attach UML's console 1 to the host's /dev/tty3).  If the tty that
-  you specify is the slave end of a tty/pty pair, something else must
-  have already opened the corresponding pty in order for this to work.
-
-
-
-
-
-  o  xterms - device=xterm
-
-
-     UML will run an xterm and the device will be attached to it.
-
-
-
-
-
-  o  Port - device=port:port number
-
-
-     This will attach the UML devices to the specified host port.
-     Attaching console 1 to the host's port 9000 would be done like
-     this:
-
-
-        con1=port:9000
-
-
-
-
-  Attaching all the serial lines to that port would be done similarly:
-
-
-        ssl=port:9000
-
-
-
-
-  You access these devices by telnetting to that port.  Each active tel-
-  net session gets a different device.  If there are more telnets to a
-  port than UML devices attached to it, then the extra telnet sessions
-  will block until an existing telnet detaches, or until another device
-  becomes active (i.e. by being activated in /etc/inittab).
-
-  This channel has the advantage that you can both attach multiple UML
-  devices to it and know how to access them without reading the UML boot
-  log.  It is also unique in allowing access to a UML from remote
-  machines without requiring that the UML be networked.  This could be
-  useful in allowing public access to UMLs because they would be
-  accessible from the net, but wouldn't need any kind of network
-  filtering or access control because they would have no network access.
-
-
-  If you attach the main console to a portal, then the UML boot will
-  appear to hang.  In reality, it's waiting for a telnet to connect, at
-  which point the boot will proceed.
-
-
-
-
-
-  o  already-existing file descriptors - device=file descriptor
-
-
-     If you set up a file descriptor on the UML command line, you can
-     attach a UML device to it.  This is most commonly used to put the
-     main console back on stdin and stdout after assigning all the other
-     consoles to something else:
-
-
-        con0=fd:0,fd:1 con=pts
-
-
-
-
-
-
-
-
-  o  Nothing - device=null
-
-
-     This allows the device to be opened, in contrast to 'none', but
-     reads will block, and writes will succeed and the data will be
-     thrown out.
-
-
-
-
-
-  o  None - device=none
-
-
-     This causes the device to disappear.
-
-
-
-  You can also specify different input and output channels for a device
-  by putting a comma between them:
-
-
-        ssl3=tty:/dev/tty2,xterm
-
-
-
-
-  will cause serial line 3 to accept input on the host's /dev/tty2 and
-  display output on an xterm.  That's a silly example - the most common
-  use of this syntax is to reattach the main console to stdin and stdout
-  as shown above.
-
-
-  If you decide to move the main console away from stdin/stdout, the
-  initial boot output will appear in the terminal that you're running
-  UML in.  However, once the console driver has been officially
-  initialized, then the boot output will start appearing wherever you
-  specified that console 0 should be.  That device will receive all
-  subsequent output.
-
-
-
-  5.3.  Examples
-
-  There are a number of interesting things you can do with this
-  capability.
-
-
-  First, this is how you get rid of those bleeding console xterms by
-  attaching them to host ptys:
-
-
-        con=pty con0=fd:0,fd:1
-
-
-
-
-  This will make a UML console take over an unused host virtual console,
-  so that when you switch to it, you will see the UML login prompt
-  rather than the host login prompt:
-
-
-        con1=tty:/dev/tty6
-
-
-
-
-  You can attach two virtual machines together with what amounts to a
-  serial line as follows:
-
-  Run one UML with a serial line attached to a pty -
-
-
-        ssl1=pty
-
-
-
-
-  Look at the boot log to see what pty it got (this example will assume
-  that it got /dev/ptyp1).
-
-  Boot the other UML with a serial line attached to the corresponding
-  tty -
-
-
-        ssl1=tty:/dev/ttyp1
-
-
-
-
-  Log in, make sure that it has no getty on that serial line, attach a
-  terminal program like minicom to it, and you should see the login
-  prompt of the other virtual machine.
-
-
-  6.  Setting up the network
-
-
-
-  This page describes how to set up the various transports and to
-  provide a UML instance with network access to the host, other machines
-  on the local net, and the rest of the net.
-
-
-  As of 2.4.5, UML networking has been completely redone to make it much
-  easier to set up, fix bugs, and add new features.
-
-
-  There is a new helper, uml_net, which does the host setup that
-  requires root privileges.
-
-
-  There are currently five transport types available for a UML virtual
-  machine to exchange packets with other hosts:
-
-  o  ethertap
-
-  o  TUN/TAP
-
-  o  Multicast
-
-  o  a switch daemon
-
-  o  slip
-
-  o  slirp
-
-  o  pcap
-
-     The TUN/TAP, ethertap, slip, and slirp transports allow a UML
-     instance to exchange packets with the host.  They may be directed
-     to the host or the host may just act as a router to provide access
-     to other physical or virtual machines.
-
-
-  The pcap transport is a synthetic read-only interface, using the
-  libpcap binary to collect packets from interfaces on the host and
-  filter them.  This is useful for building preconfigured traffic
-  monitors or sniffers.
-
-
-  The daemon and multicast transports provide a completely virtual
-  network to other virtual machines.  This network is completely
-  disconnected from the physical network unless one of the virtual
-  machines on it is acting as a gateway.
-
-
-  With so many host transports, which one should you use?  Here's when
-  you should use each one:
-
-  o  ethertap - if you want access to the host networking and it is
-     running 2.2
-
-  o  TUN/TAP - if you want access to the host networking and it is
-     running 2.4.  Also, the TUN/TAP transport is able to use a
-     preconfigured device, allowing it to avoid using the setuid uml_net
-     helper, which is a security advantage.
-
-  o  Multicast - if you want a purely virtual network and you don't want
-     to set up anything but the UML
-
-  o  a switch daemon - if you want a purely virtual network and you
-     don't mind running the daemon in order to get somewhat better
-     performance
-
-  o  slip - there is no particular reason to run the slip backend unless
-     ethertap and TUN/TAP are just not available for some reason
-
-  o  slirp - if you don't have root access on the host to setup
-     networking, or if you don't want to allocate an IP to your UML
-
-  o  pcap - not much use for actual network connectivity, but great for
-     monitoring traffic on the host
-
-     Ethertap is available on 2.4 and works fine.  TUN/TAP is preferred
-     to it because it has better performance and ethertap is officially
-     considered obsolete in 2.4.  Also, the root helper only needs to
-     run occasionally for TUN/TAP, rather than handling every packet, as
-     it does with ethertap.  This is a slight security advantage since
-     it provides fewer opportunities for a nasty UML user to somehow
-     exploit the helper's root privileges.
-
-
-  6.1.  General setup
-
-  First, you must have the virtual network enabled in your UML.  If are
-  running a prebuilt kernel from this site, everything is already
-  enabled.  If you build the kernel yourself, under the "Network device
-  support" menu, enable "Network device support", and then the three
-  transports.
-
-
-  The next step is to provide a network device to the virtual machine.
-  This is done by describing it on the kernel command line.
-
-  The general format is
-
-
-       eth <n> = <transport> , <transport args>
-
-
-
-
-  For example, a virtual ethernet device may be attached to a host
-  ethertap device as follows:
-
-
-       eth0=ethertap,tap0,fe:fd:0:0:0:1,192.168.0.254
-
-
-
-
-  This sets up eth0 inside the virtual machine to attach itself to the
-  host /dev/tap0, assigns it an ethernet address, and assigns the host
-  tap0 interface an IP address.
-
-
-
-  Note that the IP address you assign to the host end of the tap device
-  must be different than the IP you assign to the eth device inside UML.
-  If you are short on IPs and don't want to consume two per UML, then
-  you can reuse the host's eth IP address for the host ends of the tap
-  devices.  Internally, the UMLs must still get unique IPs for their eth
-  devices.  You can also give the UMLs non-routable IPs (192.168.x.x or
-  10.x.x.x) and have the host masquerade them.  This will let outgoing
-  connections work, but incoming connections won't without more work,
-  such as port forwarding from the host.
-  Also note that when you configure the host side of an interface, it is
-  only acting as a gateway.  It will respond to pings sent to it
-  locally, but is not useful to do that since it's a host interface.
-  You are not talking to the UML when you ping that interface and get a
-  response.
-
-
-  You can also add devices to a UML and remove them at runtime.  See the
-  ``The Management Console''  page for details.
-
-
-  The sections below describe this in more detail.
-
-
-  Once you've decided how you're going to set up the devices, you boot
-  UML, log in, configure the UML side of the devices, and set up routes
-  to the outside world.  At that point, you will be able to talk to any
-  other machines, physical or virtual, on the net.
-
-
-  If ifconfig inside UML fails and the network refuses to come up, run
-  tell you what went wrong.
-
-
-
-  6.2.  Userspace daemons
-
-  You will likely need the setuid helper, or the switch daemon, or both.
-  They are both installed with the RPM and deb, so if you've installed
-  either, you can skip the rest of this section.
-
-
-  If not, then you need to check them out of CVS, build them, and
-  install them.  The helper is uml_net, in CVS /tools/uml_net, and the
-  daemon is uml_switch, in CVS /tools/uml_router.  They are both built
-  with a plain 'make'.  Both need to be installed in a directory that's
-  in your path - /usr/bin is recommend.  On top of that, uml_net needs
-  to be setuid root.
-
-
-
-  6.3.  Specifying ethernet addresses
-
-  Below, you will see that the TUN/TAP, ethertap, and daemon interfaces
-  allow you to specify hardware addresses for the virtual ethernet
-  devices.  This is generally not necessary.  If you don't have a
-  specific reason to do it, you probably shouldn't.  If one is not
-  specified on the command line, the driver will assign one based on the
-  device IP address.  It will provide the address fe:fd:nn:nn:nn:nn
-  where nn.nn.nn.nn is the device IP address.  This is nearly always
-  sufficient to guarantee a unique hardware address for the device.  A
-  couple of exceptions are:
-
-  o  Another set of virtual ethernet devices are on the same network and
-     they are assigned hardware addresses using a different scheme which
-     may conflict with the UML IP address-based scheme
-
-  o  You aren't going to use the device for IP networking, so you don't
-     assign the device an IP address
-
-     If you let the driver provide the hardware address, you should make
-     sure that the device IP address is known before the interface is
-     brought up.  So, inside UML, this will guarantee that:
-
-
-
-  UML#
-  ifconfig eth0 192.168.0.250 up
-
-
-
-
-  If you decide to assign the hardware address yourself, make sure that
-  the first byte of the address is even.  Addresses with an odd first
-  byte are broadcast addresses, which you don't want assigned to a
-  device.
-
-
-
-  6.4.  UML interface setup
-
-  Once the network devices have been described on the command line, you
-  should boot UML and log in.
-
-
-  The first thing to do is bring the interface up:
-
-
-       UML# ifconfig ethn ip-address up
-
-
-
-
-  You should be able to ping the host at this point.
-
-
-  To reach the rest of the world, you should set a default route to the
-  host:
-
-
-       UML# route add default gw host ip
-
-
-
-
-  Again, with host ip of 192.168.0.4:
-
-
-       UML# route add default gw 192.168.0.4
-
-
-
-
-  This page used to recommend setting a network route to your local net.
-  This is wrong, because it will cause UML to try to figure out hardware
-  addresses of the local machines by arping on the interface to the
-  host.  Since that interface is basically a single strand of ethernet
-  with two nodes on it (UML and the host) and arp requests don't cross
-  networks, they will fail to elicit any responses.  So, what you want
-  is for UML to just blindly throw all packets at the host and let it
-  figure out what to do with them, which is what leaving out the network
-  route and adding the default route does.
-
-
-  Note: If you can't communicate with other hosts on your physical
-  ethernet, it's probably because of a network route that's
-  automatically set up.  If you run 'route -n' and see a route that
-  looks like this:
-
-
-
-
-  Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
-  192.168.0.0     0.0.0.0         255.255.255.0   U     0      0      0   eth0
-
-
-
-
-  with a mask that's not 255.255.255.255, then replace it with a route
-  to your host:
-
-
-       UML#
-       route del -net 192.168.0.0 dev eth0 netmask 255.255.255.0
-
-
-
-
-
-
-       UML#
-       route add -host 192.168.0.4 dev eth0
-
-
-
-
-  This, plus the default route to the host, will allow UML to exchange
-  packets with any machine on your ethernet.
-
-
-
-  6.5.  Multicast
-
-  The simplest way to set up a virtual network between multiple UMLs is
-  to use the mcast transport.  This was written by Harald Welte and is
-  present in UML version 2.4.5-5um and later.  Your system must have
-  multicast enabled in the kernel and there must be a multicast-capable
-  network device on the host.  Normally, this is eth0, but if there is
-  no ethernet card on the host, then you will likely get strange error
-  messages when you bring the device up inside UML.
-
-
-  To use it, run two UMLs with
-
-
-        eth0=mcast
-
-
-
-
-  on their command lines.  Log in, configure the ethernet device in each
-  machine with different IP addresses:
-
-
-       UML1# ifconfig eth0 192.168.0.254
-
-
-
-
-
-
-       UML2# ifconfig eth0 192.168.0.253
-
-
-
-
-  and they should be able to talk to each other.
-
-  The full set of command line options for this transport are
-
-
-
-       ethn=mcast,ethernet address,multicast
-       address,multicast port,ttl
-
-
-
-
-  Harald's original README is here <http://user-mode-linux.source-
-  forge.net/>  and explains these in detail, as well as
-  some other issues.
-
-  There is also a related point-to-point only "ucast" transport.
-  This is useful when your network does not support multicast, and
-  all network connections are simple point to point links.
-
-  The full set of command line options for this transport are
-
-
-       ethn=ucast,ethernet address,remote address,listen port,remote port
-
-
-
-
-  6.6.  TUN/TAP with the uml_net helper
-
-  TUN/TAP is the preferred mechanism on 2.4 to exchange packets with the
-  host.  The TUN/TAP backend has been in UML since 2.4.9-3um.
-
-
-  The easiest way to get up and running is to let the setuid uml_net
-  helper do the host setup for you.  This involves insmod-ing the tun.o
-  module if necessary, configuring the device, and setting up IP
-  forwarding, routing, and proxy arp.  If you are new to UML networking,
-  do this first.  If you're concerned about the security implications of
-  the setuid helper, use it to get up and running, then read the next
-  section to see how to have UML use a preconfigured tap device, which
-  avoids the use of uml_net.
-
-
-  If you specify an IP address for the host side of the device, the
-  uml_net helper will do all necessary setup on the host - the only
-  requirement is that TUN/TAP be available, either built in to the host
-  kernel or as the tun.o module.
-
-  The format of the command line switch to attach a device to a TUN/TAP
-  device is
-
-
-       eth <n> =tuntap,,, <IP address>
-
-
-
-
-  For example, this argument will attach the UML's eth0 to the next
-  available tap device and assign an ethernet address to it based on its
-  IP address
-
-
-       eth0=tuntap,,,192.168.0.254
-
-
-
-
-
-
-  Note that the IP address that must be used for the eth device inside
-  UML is fixed by the routing and proxy arp that is set up on the
-  TUN/TAP device on the host.  You can use a different one, but it won't
-  work because reply packets won't reach the UML.  This is a feature.
-  It prevents a nasty UML user from doing things like setting the UML IP
-  to the same as the network's nameserver or mail server.
-
-
-  There are a couple potential problems with running the TUN/TAP
-  transport on a 2.4 host kernel
-
-  o  TUN/TAP seems not to work on 2.4.3 and earlier.  Upgrade the host
-     kernel or use the ethertap transport.
-
-  o  With an upgraded kernel, TUN/TAP may fail with
-
-
-       File descriptor in bad state
-
-
-
-
-  This is due to a header mismatch between the upgraded kernel and the
-  kernel that was originally installed on the machine.  The fix is to
-  make sure that /usr/src/linux points to the headers for the running
-  kernel.
-
-  These were pointed out by Tim Robinson <timro at trkr dot net> in
-  <http://www.geocrawler.com/> name="this uml-
-  user post"> .
-
-
-
-  6.7.  TUN/TAP with a preconfigured tap device
-
-  If you prefer not to have UML use uml_net (which is somewhat
-  insecure), with UML 2.4.17-11, you can set up a TUN/TAP device
-  beforehand.  The setup needs to be done as root, but once that's done,
-  there is no need for root assistance.  Setting up the device is done
-  as follows:
-
-  o  Create the device with tunctl (available from the UML utilities
-     tarball)
-
-
-
-
-       host#  tunctl -u uid
-
-
-
-
-  where uid is the user id or username that UML will be run as.  This
-  will tell you what device was created.
-
-  o  Configure the device IP (change IP addresses and device name to
-     suit)
-
-
-
-
-       host#  ifconfig tap0 192.168.0.254 up
-
-
-
-
-
-  o  Set up routing and arping if desired - this is my recipe, there are
-     other ways of doing the same thing
-
-
-       host#
-       bash -c 'echo 1 > /proc/sys/net/ipv4/ip_forward'
-
-       host#
-       route add -host 192.168.0.253 dev tap0
-
-
-
-
-
-
-       host#
-       bash -c 'echo 1 > /proc/sys/net/ipv4/conf/tap0/proxy_arp'
-
-
-
-
-
-
-       host#
-       arp -Ds 192.168.0.253 eth0 pub
-
-
-
-
-  Note that this must be done every time the host boots - this configu-
-  ration is not stored across host reboots.  So, it's probably a good
-  idea to stick it in an rc file.  An even better idea would be a little
-  utility which reads the information from a config file and sets up
-  devices at boot time.
-
-  o  Rather than using up two IPs and ARPing for one of them, you can
-     also provide direct access to your LAN by the UML by using a
-     bridge.
-
-
-       host#
-       brctl addbr br0
-
-
-
-
-
-
-       host#
-       ifconfig eth0 0.0.0.0 promisc up
-
-
-
-
-
-
-       host#
-       ifconfig tap0 0.0.0.0 promisc up
-
-
-
-
-
-
-       host#
-       ifconfig br0 192.168.0.1 netmask 255.255.255.0 up
-
-
-
-
-
-
-
-  host#
-  brctl stp br0 off
-
-
-
-
-
-
-       host#
-       brctl setfd br0 1
-
-
-
-
-
-
-       host#
-       brctl sethello br0 1
-
-
-
-
-
-
-       host#
-       brctl addif br0 eth0
-
-
-
-
-
-
-       host#
-       brctl addif br0 tap0
-
-
-
-
-  Note that 'br0' should be setup using ifconfig with the existing IP
-  address of eth0, as eth0 no longer has its own IP.
-
-  o
-
-
-     Also, the /dev/net/tun device must be writable by the user running
-     UML in order for the UML to use the device that's been configured
-     for it.  The simplest thing to do is
-
-
-       host#  chmod 666 /dev/net/tun
-
-
-
-
-  Making it world-writable looks bad, but it seems not to be
-  exploitable as a security hole.  However, it does allow anyone to cre-
-  ate useless tap devices (useless because they can't configure them),
-  which is a DOS attack.  A somewhat more secure alternative would to be
-  to create a group containing all the users who have preconfigured tap
-  devices and chgrp /dev/net/tun to that group with mode 664 or 660.
-
-
-  o  Once the device is set up, run UML with 'eth0=tuntap,device name'
-     (i.e. 'eth0=tuntap,tap0') on the command line (or do it with the
-     mconsole config command).
-
-  o  Bring the eth device up in UML and you're in business.
-
-     If you don't want that tap device any more, you can make it non-
-     persistent with
-
-
-       host#  tunctl -d tap device
-
-
-
-
-  Finally, tunctl has a -b (for brief mode) switch which causes it to
-  output only the name of the tap device it created.  This makes it
-  suitable for capture by a script:
-
-
-       host#  TAP=`tunctl -u 1000 -b`
-
-
-
-
-
-
-  6.8.  Ethertap
-
-  Ethertap is the general mechanism on 2.2 for userspace processes to
-  exchange packets with the kernel.
-
-
-
-  To use this transport, you need to describe the virtual network device
-  on the UML command line.  The general format for this is
-
-
-       eth <n> =ethertap, <device> , <ethernet address> , <tap IP address>
-
-
-
-
-  So, the previous example
-
-
-       eth0=ethertap,tap0,fe:fd:0:0:0:1,192.168.0.254
-
-
-
-
-  attaches the UML eth0 device to the host /dev/tap0, assigns it the
-  ethernet address fe:fd:0:0:0:1, and assigns the IP address
-  192.168.0.254 to the tap device.
-
-
-
-  The tap device is mandatory, but the others are optional.  If the
-  ethernet address is omitted, one will be assigned to it.
-
-
-  The presence of the tap IP address will cause the helper to run and do
-  whatever host setup is needed to allow the virtual machine to
-  communicate with the outside world.  If you're not sure you know what
-  you're doing, this is the way to go.
-
-
-  If it is absent, then you must configure the tap device and whatever
-  arping and routing you will need on the host.  However, even in this
-  case, the uml_net helper still needs to be in your path and it must be
-  setuid root if you're not running UML as root.  This is because the
-  tap device doesn't support SIGIO, which UML needs in order to use
-  something as a source of input.  So, the helper is used as a
-  convenient asynchronous IO thread.
-
-  If you're using the uml_net helper, you can ignore the following host
-  setup - uml_net will do it for you.  You just need to make sure you
-  have ethertap available, either built in to the host kernel or
-  available as a module.
-
-
-  If you want to set things up yourself, you need to make sure that the
-  appropriate /dev entry exists.  If it doesn't, become root and create
-  it as follows:
-
-
-       mknod /dev/tap <minor>  c 36  <minor>  + 16
-
-
-
-
-  For example, this is how to create /dev/tap0:
-
-
-       mknod /dev/tap0 c 36 0 + 16
-
-
-
-
-  You also need to make sure that the host kernel has ethertap support.
-  If ethertap is enabled as a module, you apparently need to insmod
-  ethertap once for each ethertap device you want to enable.  So,
-
-
-       host#
-       insmod ethertap
-
-
-
-
-  will give you the tap0 interface.  To get the tap1 interface, you need
-  to run
-
-
-       host#
-       insmod ethertap unit=1 -o ethertap1
-
-
-
-
-
-
-
-  6.9.  The switch daemon
-
-  Note: This is the daemon formerly known as uml_router, but which was
-  renamed so the network weenies of the world would stop growling at me.
-
-
-  The switch daemon, uml_switch, provides a mechanism for creating a
-  totally virtual network.  By default, it provides no connection to the
-  host network (but see -tap, below).
-
-
-  The first thing you need to do is run the daemon.  Running it with no
-  arguments will make it listen on a default pair of unix domain
-  sockets.
-
-
-  If you want it to listen on a different pair of sockets, use
-
-
-        -unix control socket data socket
-
-
-
-
-
-  If you want it to act as a hub rather than a switch, use
-
-
-        -hub
-
-
-
-
-
-  If you want the switch to be connected to host networking (allowing
-  the umls to get access to the outside world through the host), use
-
-
-        -tap tap0
-
-
-
-
-
-  Note that the tap device must be preconfigured (see "TUN/TAP with a
-  preconfigured tap device", above).  If you're using a different tap
-  device than tap0, specify that instead of tap0.
-
-
-  uml_switch can be backgrounded as follows
-
-
-       host%
-       uml_switch [ options ] < /dev/null > /dev/null
-
-
-
-
-  The reason it doesn't background by default is that it listens to
-  stdin for EOF.  When it sees that, it exits.
-
-
-  The general format of the kernel command line switch is
-
-
-
-       ethn=daemon,ethernet address,socket
-       type,control socket,data socket
-
-
-
-
-  You can leave off everything except the 'daemon'.  You only need to
-  specify the ethernet address if the one that will be assigned to it
-  isn't acceptable for some reason.  The rest of the arguments describe
-  how to communicate with the daemon.  You should only specify them if
-  you told the daemon to use different sockets than the default.  So, if
-  you ran the daemon with no arguments, running the UML on the same
-  machine with
-       eth0=daemon
-
-
-
-
-  will cause the eth0 driver to attach itself to the daemon correctly.
-
-
-
-  6.10.  Slip
-
-  Slip is another, less general, mechanism for a process to communicate
-  with the host networking.  In contrast to the ethertap interface,
-  which exchanges ethernet frames with the host and can be used to
-  transport any higher-level protocol, it can only be used to transport
-  IP.
-
-
-  The general format of the command line switch is
-
-
-
-       ethn=slip,slip IP
-
-
-
-
-  The slip IP argument is the IP address that will be assigned to the
-  host end of the slip device.  If it is specified, the helper will run
-  and will set up the host so that the virtual machine can reach it and
-  the rest of the network.
-
-
-  There are some oddities with this interface that you should be aware
-  of.  You should only specify one slip device on a given virtual
-  machine, and its name inside UML will be 'umn', not 'eth0' or whatever
-  you specified on the command line.  These problems will be fixed at
-  some point.
-
-
-
-  6.11.  Slirp
-
-  slirp uses an external program, usually /usr/bin/slirp, to provide IP
-  only networking connectivity through the host. This is similar to IP
-  masquerading with a firewall, although the translation is performed in
-  user-space, rather than by the kernel.  As slirp does not set up any
-  interfaces on the host, or changes routing, slirp does not require
-  root access or setuid binaries on the host.
-
-
-  The general format of the command line switch for slirp is:
-
-
-
-       ethn=slirp,ethernet address,slirp path
-
-
-
-
-  The ethernet address is optional, as UML will set up the interface
-  with an ethernet address based upon the initial IP address of the
-  interface.  The slirp path is generally /usr/bin/slirp, although it
-  will depend on distribution.
-
-
-  The slirp program can have a number of options passed to the command
-  line and we can't add them to the UML command line, as they will be
-  parsed incorrectly.  Instead, a wrapper shell script can be written or
-  the options inserted into the  /.slirprc file.  More information on
-  all of the slirp options can be found in its man pages.
-
-
-  The eth0 interface on UML should be set up with the IP 10.2.0.15,
-  although you can use anything as long as it is not used by a network
-  you will be connecting to. The default route on UML should be set to
-  use
-
-
-       UML#
-       route add default dev eth0
-
-
-
-
-  slirp provides a number of useful IP addresses which can be used by
-  UML, such as 10.0.2.3 which is an alias for the DNS server specified
-  in /etc/resolv.conf on the host or the IP given in the 'dns' option
-  for slirp.
-
-
-  Even with a baudrate setting higher than 115200, the slirp connection
-  is limited to 115200. If you need it to go faster, the slirp binary
-  needs to be compiled with FULL_BOLT defined in config.h.
-
-
-
-  6.12.  pcap
-
-  The pcap transport is attached to a UML ethernet device on the command
-  line or with uml_mconsole with the following syntax:
-
-
-
-       ethn=pcap,host interface,filter
-       expression,option1,option2
-
-
-
-
-  The expression and options are optional.
-
-
-  The interface is whatever network device on the host you want to
-  sniff.  The expression is a pcap filter expression, which is also what
-  tcpdump uses, so if you know how to specify tcpdump filters, you will
-  use the same expressions here.  The options are up to two of
-  'promisc', control whether pcap puts the host interface into
-  promiscuous mode. 'optimize' and 'nooptimize' control whether the pcap
-  expression optimizer is used.
-
-
-  Example:
-
-
-
-       eth0=pcap,eth0,tcp
-
-       eth1=pcap,eth0,!tcp
-
-
-
-  will cause the UML eth0 to emit all tcp packets on the host eth0 and
-  the UML eth1 to emit all non-tcp packets on the host eth0.
-
-
-
-  6.13.  Setting up the host yourself
-
-  If you don't specify an address for the host side of the ethertap or
-  slip device, UML won't do any setup on the host.  So this is what is
-  needed to get things working (the examples use a host-side IP of
-  192.168.0.251 and a UML-side IP of 192.168.0.250 - adjust to suit your
-  own network):
-
-  o  The device needs to be configured with its IP address.  Tap devices
-     are also configured with an mtu of 1484.  Slip devices are
-     configured with a point-to-point address pointing at the UML ip
-     address.
-
-
-       host#  ifconfig tap0 arp mtu 1484 192.168.0.251 up
-
-
-
-
-
-
-       host#
-       ifconfig sl0 192.168.0.251 pointopoint 192.168.0.250 up
-
-
-
-
-
-  o  If a tap device is being set up, a route is set to the UML IP.
-
-
-       UML# route add -host 192.168.0.250 gw 192.168.0.251
-
-
-
-
-
-  o  To allow other hosts on your network to see the virtual machine,
-     proxy arp is set up for it.
-
-
-       host#  arp -Ds 192.168.0.250 eth0 pub
-
-
-
-
-
-  o  Finally, the host is set up to route packets.
-
-
-       host#  echo 1 > /proc/sys/net/ipv4/ip_forward
-
-
-
-
-
-
-
-
-
-
-  7.  Sharing Filesystems between Virtual Machines
-
-
-
-
-  7.1.  A warning
-
-  Don't attempt to share filesystems simply by booting two UMLs from the
-  same file.  That's the same thing as booting two physical machines
-  from a shared disk.  It will result in filesystem corruption.
-
-
-
-  7.2.  Using layered block devices
-
-  The way to share a filesystem between two virtual machines is to use
-  the copy-on-write (COW) layering capability of the ubd block driver.
-  As of 2.4.6-2um, the driver supports layering a read-write private
-  device over a read-only shared device.  A machine's writes are stored
-  in the private device, while reads come from either device - the
-  private one if the requested block is valid in it, the shared one if
-  not.  Using this scheme, the majority of data which is unchanged is
-  shared between an arbitrary number of virtual machines, each of which
-  has a much smaller file containing the changes that it has made.  With
-  a large number of UMLs booting from a large root filesystem, this
-  leads to a huge disk space saving.  It will also help performance,
-  since the host will be able to cache the shared data using a much
-  smaller amount of memory, so UML disk requests will be served from the
-  host's memory rather than its disks.
-
-
-
-
-  To add a copy-on-write layer to an existing block device file, simply
-  add the name of the COW file to the appropriate ubd switch:
-
-
-        ubd0=root_fs_cow,root_fs_debian_22
-
-
-
-
-  where 'root_fs_cow' is the private COW file and 'root_fs_debian_22' is
-  the existing shared filesystem.  The COW file need not exist.  If it
-  doesn't, the driver will create and initialize it.  Once the COW file
-  has been initialized, it can be used on its own on the command line:
-
-
-        ubd0=root_fs_cow
-
-
-
-
-  The name of the backing file is stored in the COW file header, so it
-  would be redundant to continue specifying it on the command line.
-
-
-
-  7.3.  Note!
-
-  When checking the size of the COW file in order to see the gobs of
-  space that you're saving, make sure you use 'ls -ls' to see the actual
-  disk consumption rather than the length of the file.  The COW file is
-  sparse, so the length will be very different from the disk usage.
-  Here is a 'ls -l' of a COW file and backing file from one boot and
-  shutdown:
-       host% ls -l cow.debian debian2.2
-       -rw-r--r--    1 jdike    jdike    492504064 Aug  6 21:16 cow.debian
-       -rwxrw-rw-    1 jdike    jdike    537919488 Aug  6 20:42 debian2.2
-
-
-
-
-  Doesn't look like much saved space, does it?  Well, here's 'ls -ls':
-
-
-       host% ls -ls cow.debian debian2.2
-          880 -rw-r--r--    1 jdike    jdike    492504064 Aug  6 21:16 cow.debian
-       525832 -rwxrw-rw-    1 jdike    jdike    537919488 Aug  6 20:42 debian2.2
-
-
-
-
-  Now, you can see that the COW file has less than a meg of disk, rather
-  than 492 meg.
-
-
-
-  7.4.  Another warning
-
-  Once a filesystem is being used as a readonly backing file for a COW
-  file, do not boot directly from it or modify it in any way.  Doing so
-  will invalidate any COW files that are using it.  The mtime and size
-  of the backing file are stored in the COW file header at its creation,
-  and they must continue to match.  If they don't, the driver will
-  refuse to use the COW file.
-
-
-
-
-  If you attempt to evade this restriction by changing either the
-  backing file or the COW header by hand, you will get a corrupted
-  filesystem.
-
-
-
-
-  Among other things, this means that upgrading the distribution in a
-  backing file and expecting that all of the COW files using it will see
-  the upgrade will not work.
-
-
-
-
-  7.5.  uml_moo : Merging a COW file with its backing file
-
-  Depending on how you use UML and COW devices, it may be advisable to
-  merge the changes in the COW file into the backing file every once in
-  a while.
-
-
-
-
-  The utility that does this is uml_moo.  Its usage is
-
-
-       host% uml_moo COW file new backing file
-
-
-
-
-  There's no need to specify the backing file since that information is
-  already in the COW file header.  If you're paranoid, boot the new
-  merged file, and if you're happy with it, move it over the old backing
-  file.
-
-
-
-
-  uml_moo creates a new backing file by default as a safety measure.  It
-  also has a destructive merge option which will merge the COW file
-  directly into its current backing file.  This is really only usable
-  when the backing file only has one COW file associated with it.  If
-  there are multiple COWs associated with a backing file, a -d merge of
-  one of them will invalidate all of the others.  However, it is
-  convenient if you're short of disk space, and it should also be
-  noticeably faster than a non-destructive merge.
-
-
-
-
-  uml_moo is installed with the UML deb and RPM.  If you didn't install
-  UML from one of those packages, you can also get it from the UML
-  utilities <http://user-mode-linux.sourceforge.net/
-  utilities>  tar file in tools/moo.
-
-
-
-
-
-
-
-
-  8.  Creating filesystems
-
-
-  You may want to create and mount new UML filesystems, either because
-  your root filesystem isn't large enough or because you want to use a
-  filesystem other than ext2.
-
-
-  This was written on the occasion of reiserfs being included in the
-  2.4.1 kernel pool, and therefore the 2.4.1 UML, so the examples will
-  talk about reiserfs.  This information is generic, and the examples
-  should be easy to translate to the filesystem of your choice.
-
-
-  8.1.  Create the filesystem file
-
-  dd is your friend.  All you need to do is tell dd to create an empty
-  file of the appropriate size.  I usually make it sparse to save time
-  and to avoid allocating disk space until it's actually used.  For
-  example, the following command will create a sparse 100 meg file full
-  of zeroes.
-
-
-       host%
-       dd if=/dev/zero of=new_filesystem seek=100 count=1 bs=1M
-
-
-
-
-
-
-  8.2.  Assign the file to a UML device
-
-  Add an argument like the following to the UML command line:
-
-  ubd4=new_filesystem
-
-
-
-
-  making sure that you use an unassigned ubd device number.
-
-
-
-  8.3.  Creating and mounting the filesystem
-
-  Make sure that the filesystem is available, either by being built into
-  the kernel, or available as a module, then boot up UML and log in.  If
-  the root filesystem doesn't have the filesystem utilities (mkfs, fsck,
-  etc), then get them into UML by way of the net or hostfs.
-
-
-  Make the new filesystem on the device assigned to the new file:
-
-
-       host#  mkreiserfs /dev/ubd/4
-
-
-       <----------- MKREISERFSv2 ----------->
-
-       ReiserFS version 3.6.25
-       Block size 4096 bytes
-       Block count 25856
-       Used blocks 8212
-               Journal - 8192 blocks (18-8209), journal header is in block 8210
-               Bitmaps: 17
-               Root block 8211
-       Hash function "r5"
-       ATTENTION: ALL DATA WILL BE LOST ON '/dev/ubd/4'! (y/n)y
-       journal size 8192 (from 18)
-       Initializing journal - 0%....20%....40%....60%....80%....100%
-       Syncing..done.
-
-
-
-
-  Now, mount it:
-
-
-       UML#
-       mount /dev/ubd/4 /mnt
-
-
-
-
-  and you're in business.
-
-
-
-
-
-
-
-
-
-  9.  Host file access
-
-
-  If you want to access files on the host machine from inside UML, you
-  can treat it as a separate machine and either nfs mount directories
-  from the host or copy files into the virtual machine with scp or rcp.
-  However, since UML is running on the host, it can access those
-  files just like any other process and make them available inside the
-  virtual machine without needing to use the network.
-
-
-  This is now possible with the hostfs virtual filesystem.  With it, you
-  can mount a host directory into the UML filesystem and access the
-  files contained in it just as you would on the host.
-
-
-  9.1.  Using hostfs
-
-  To begin with, make sure that hostfs is available inside the virtual
-  machine with
-
-
-       UML# cat /proc/filesystems
-
-
-
-  .  hostfs should be listed.  If it's not, either rebuild the kernel
-  with hostfs configured into it or make sure that hostfs is built as a
-  module and available inside the virtual machine, and insmod it.
-
-
-  Now all you need to do is run mount:
-
-
-       UML# mount none /mnt/host -t hostfs
-
-
-
-
-  will mount the host's / on the virtual machine's /mnt/host.
-
-
-  If you don't want to mount the host root directory, then you can
-  specify a subdirectory to mount with the -o switch to mount:
-
-
-       UML# mount none /mnt/home -t hostfs -o /home
-
-
-
-
-  will mount the hosts's /home on the virtual machine's /mnt/home.
-
-
-
-  9.2.  hostfs as the root filesystem
-
-  It's possible to boot from a directory hierarchy on the host using
-  hostfs rather than using the standard filesystem in a file.
-
-  To start, you need that hierarchy.  The easiest way is to loop mount
-  an existing root_fs file:
-
-
-       host#  mount root_fs uml_root_dir -o loop
-
-
-
-
-  You need to change the filesystem type of / in etc/fstab to be
-  'hostfs', so that line looks like this:
-
-  /dev/ubd/0       /        hostfs      defaults          1   1
-
-
-
-
-  Then you need to chown to yourself all the files in that directory
-  that are owned by root.  This worked for me:
-
-
-       host#  find . -uid 0 -exec chown jdike {} \;
-
-
-
-
-  Next, make sure that your UML kernel has hostfs compiled in, not as a
-  module.  Then run UML with the boot device pointing at that directory:
-
-
-        ubd0=/path/to/uml/root/directory
-
-
-
-
-  UML should then boot as it does normally.
-
-
-  9.3.  Building hostfs
-
-  If you need to build hostfs because it's not in your kernel, you have
-  two choices:
-
-
-
-  o  Compiling hostfs into the kernel:
-
-
-     Reconfigure the kernel and set the 'Host filesystem' option under
-
-
-  o  Compiling hostfs as a module:
-
-
-     Reconfigure the kernel and set the 'Host filesystem' option under
-     be in arch/um/fs/hostfs/hostfs.o.  Install that in
-     /lib/modules/`uname -r`/fs in the virtual machine, boot it up, and
-
-
-       UML# insmod hostfs
-
-
-
-
-
-
-
-
-
-
-
-
-  10.  The Management Console
-
-
-
-  The UML management console is a low-level interface to the kernel,
-  somewhat like the i386 SysRq interface.  Since there is a full-blown
-  operating system under UML, there is much greater flexibility possible
-  than with the SysRq mechanism.
-
-
-  There are a number of things you can do with the mconsole interface:
-
-  o  get the kernel version
-
-  o  add and remove devices
-
-  o  halt or reboot the machine
-
-  o  Send SysRq commands
-
-  o  Pause and resume the UML
-
-
-  You need the mconsole client (uml_mconsole) which is present in CVS
-  (/tools/mconsole) in 2.4.5-9um and later, and will be in the RPM in
-  2.4.6.
-
-
-  You also need CONFIG_MCONSOLE (under 'General Setup') enabled in UML.
-  When you boot UML, you'll see a line like:
-
-
-       mconsole initialized on /home/jdike/.uml/umlNJ32yL/mconsole
-
-
-
-
-  If you specify a unique machine id one the UML command line, i.e.
-
-
-        umid=debian
-
-
-
-
-  you'll see this
-
-
-       mconsole initialized on /home/jdike/.uml/debian/mconsole
-
-
-
-
-  That file is the socket that uml_mconsole will use to communicate with
-  UML.  Run it with either the umid or the full path as its argument:
-
-
-       host% uml_mconsole debian
-
-
-
-
-  or
-
-
-       host% uml_mconsole /home/jdike/.uml/debian/mconsole
-
-
-
-
-  You'll get a prompt, at which you can run one of these commands:
-
-  o  version
-
-  o  halt
-
-  o  reboot
-
-  o  config
-
-  o  remove
-
-  o  sysrq
-
-  o  help
-
-  o  cad
-
-  o  stop
-
-  o  go
-
-
-  10.1.  version
-
-  This takes no arguments.  It prints the UML version.
-
-
-       (mconsole)  version
-       OK Linux usermode 2.4.5-9um #1 Wed Jun 20 22:47:08 EDT 2001 i686
-
-
-
-
-  There are a couple actual uses for this.  It's a simple no-op which
-  can be used to check that a UML is running.  It's also a way of
-  sending an interrupt to the UML.  This is sometimes useful on SMP
-  hosts, where there's a bug which causes signals to UML to be lost,
-  often causing it to appear to hang.  Sending such a UML the mconsole
-  version command is a good way to 'wake it up' before networking has
-  been enabled, as it does not do anything to the function of the UML.
-
-
-
-  10.2.  halt and reboot
-
-  These take no arguments.  They shut the machine down immediately, with
-  no syncing of disks and no clean shutdown of userspace.  So, they are
-  pretty close to crashing the machine.
-
-
-       (mconsole)  halt
-       OK
-
-
-
-
-
-
-  10.3.  config
-
-  "config" adds a new device to the virtual machine.  Currently the ubd
-  and network drivers support this.  It takes one argument, which is the
-  device to add, with the same syntax as the kernel command line.
-
-
-
-
-  (mconsole)
-  config ubd3=/home/jdike/incoming/roots/root_fs_debian22
-
-  OK
-  (mconsole)  config eth1=mcast
-  OK
-
-
-
-
-
-
-  10.4.  remove
-
-  "remove" deletes a device from the system.  Its argument is just the
-  name of the device to be removed. The device must be idle in whatever
-  sense the driver considers necessary.  In the case of the ubd driver,
-  the removed block device must not be mounted, swapped on, or otherwise
-  open, and in the case of the network driver, the device must be down.
-
-
-       (mconsole)  remove ubd3
-       OK
-       (mconsole)  remove eth1
-       OK
-
-
-
-
-
-
-  10.5.  sysrq
-
-  This takes one argument, which is a single letter.  It calls the
-  generic kernel's SysRq driver, which does whatever is called for by
-  that argument.  See the SysRq documentation in
-  Documentation/admin-guide/sysrq.rst in your favorite kernel tree to
-  see what letters are valid and what they do.
-
-
-
-  10.6.  help
-
-  "help" returns a string listing the valid commands and what each one
-  does.
-
-
-
-  10.7.  cad
-
-  This invokes the Ctl-Alt-Del action on init.  What exactly this ends
-  up doing is up to /etc/inittab.  Normally, it reboots the machine.
-  With UML, this is usually not desired, so if a halt would be better,
-  then find the section of inittab that looks like this
-
-
-       # What to do when CTRL-ALT-DEL is pressed.
-       ca:12345:ctrlaltdel:/sbin/shutdown -t1 -a -r now
-
-
-
-
-  and change the command to halt.
-
-
-
-  10.8.  stop
-
-  This puts the UML in a loop reading mconsole requests until a 'go'
-  mconsole command is received. This is very useful for making backups
-  of UML filesystems, as the UML can be stopped, then synced via 'sysrq
-  s', so that everything is written to the filesystem. You can then copy
-  the filesystem and then send the UML 'go' via mconsole.
-
-
-  Note that a UML running with more than one CPU will have problems
-  after you send the 'stop' command, as only one CPU will be held in a
-  mconsole loop and all others will continue as normal.  This is a bug,
-  and will be fixed.
-
-
-
-  10.9.  go
-
-  This resumes a UML after being paused by a 'stop' command. Note that
-  when the UML has resumed, TCP connections may have timed out and if
-  the UML is paused for a long period of time, crond might go a little
-  crazy, running all the jobs it didn't do earlier.
-
-
-
-
-
-
-
-
-  11.  Kernel debugging
-
-
-  Note: The interface that makes debugging, as described here, possible
-  is present in 2.4.0-test6 kernels and later.
-
-
-  Since the user-mode kernel runs as a normal Linux process, it is
-  possible to debug it with gdb almost like any other process.  It is
-  slightly different because the kernel's threads are already being
-  ptraced for system call interception, so gdb can't ptrace them.
-  However, a mechanism has been added to work around that problem.
-
-
-  In order to debug the kernel, you need build it from source.  See
-  ``Compiling the kernel and modules''  for information on doing that.
-  Make sure that you enable CONFIG_DEBUGSYM and CONFIG_PT_PROXY during
-  the config.  These will compile the kernel with -g, and enable the
-  ptrace proxy so that gdb works with UML, respectively.
-
-
-
-
-  11.1.  Starting the kernel under gdb
-
-  You can have the kernel running under the control of gdb from the
-  beginning by putting 'debug' on the command line.  You will get an
-  xterm with gdb running inside it.  The kernel will send some commands
-  to gdb which will leave it stopped at the beginning of start_kernel.
-  At this point, you can get things going with 'next', 'step', or
-  'cont'.
-
-
-  There is a transcript of a debugging session  here <debug-
-  session.html> , with breakpoints being set in the scheduler and in an
-  interrupt handler.
-  11.2.  Examining sleeping processes
-
-  Not every bug is evident in the currently running process.  Sometimes,
-  processes hang in the kernel when they shouldn't because they've
-  deadlocked on a semaphore or something similar.  In this case, when
-  you ^C gdb and get a backtrace, you will see the idle thread, which
-  isn't very relevant.
-
-
-  What you want is the stack of whatever process is sleeping when it
-  shouldn't be.  You need to figure out which process that is, which is
-  generally fairly easy.  Then you need to get its host process id,
-  which you can do either by looking at ps on the host or at
-  task.thread.extern_pid in gdb.
-
-
-  Now what you do is this:
-
-  o  detach from the current thread
-
-
-       (UML gdb)  det
-
-
-
-
-
-  o  attach to the thread you are interested in
-
-
-       (UML gdb)  att <host pid>
-
-
-
-
-
-  o  look at its stack and anything else of interest
-
-
-       (UML gdb)  bt
-
-
-
-
-  Note that you can't do anything at this point that requires that a
-  process execute, e.g. calling a function
-
-  o  when you're done looking at that process, reattach to the current
-     thread and continue it
-
-
-       (UML gdb)
-       att 1
-
-
-
-
-
-
-       (UML gdb)
-       c
-
-
-
-
-  Here, specifying any pid which is not the process id of a UML thread
-  will cause gdb to reattach to the current thread.  I commonly use 1,
-  but any other invalid pid would work.
-
-
-
-  11.3.  Running ddd on UML
-
-  ddd works on UML, but requires a special kludge.  The process goes
-  like this:
-
-  o  Start ddd
-
-
-       host% ddd linux
-
-
-
-
-
-  o  With ps, get the pid of the gdb that ddd started.  You can ask the
-     gdb to tell you, but for some reason that confuses things and
-     causes a hang.
-
-  o  run UML with 'debug=parent gdb-pid=<pid>' added to the command line
-     - it will just sit there after you hit return
-
-  o  type 'att 1' to the ddd gdb and you will see something like
-
-
-       0xa013dc51 in __kill ()
-
-
-       (gdb)
-
-
-
-
-
-  o  At this point, type 'c', UML will boot up, and you can use ddd just
-     as you do on any other process.
-
-
-
-  11.4.  Debugging modules
-
-  gdb has support for debugging code which is dynamically loaded into
-  the process.  This support is what is needed to debug kernel modules
-  under UML.
-
-
-  Using that support is somewhat complicated.  You have to tell gdb what
-  object file you just loaded into UML and where in memory it is.  Then,
-  it can read the symbol table, and figure out where all the symbols are
-  from the load address that you provided.  It gets more interesting
-  when you load the module again (i.e. after an rmmod).  You have to
-  tell gdb to forget about all its symbols, including the main UML ones
-  for some reason, then load then all back in again.
-
-
-  There's an easy way and a hard way to do this.  The easy way is to use
-  the umlgdb expect script written by Chandan Kudige.  It basically
-  automates the process for you.
-
-
-  First, you must tell it where your modules are.  There is a list in
-  the script that looks like this:
-       set MODULE_PATHS {
-       "fat" "/usr/src/uml/linux-2.4.18/fs/fat/fat.o"
-       "isofs" "/usr/src/uml/linux-2.4.18/fs/isofs/isofs.o"
-       "minix" "/usr/src/uml/linux-2.4.18/fs/minix/minix.o"
-       }
-
-
-
-
-  You change that to list the names and paths of the modules that you
-  are going to debug.  Then you run it from the toplevel directory of
-  your UML pool and it basically tells you what to do:
-
-
-
-
-                   ******** GDB pid is 21903 ********
-       Start UML as: ./linux <kernel switches> debug gdb-pid=21903
-
-
-
-       GNU gdb 5.0rh-5 Red Hat Linux 7.1
-       Copyright 2001 Free Software Foundation, Inc.
-       GDB is free software, covered by the GNU General Public License, and you are
-       welcome to change it and/or distribute copies of it under certain conditions.
-       Type "show copying" to see the conditions.
-       There is absolutely no warranty for GDB.  Type "show warranty" for details.
-       This GDB was configured as "i386-redhat-linux"...
-       (gdb) b sys_init_module
-       Breakpoint 1 at 0xa0011923: file module.c, line 349.
-       (gdb) att 1
-
-
-
-
-  After you run UML and it sits there doing nothing, you hit return at
-  the 'att 1' and continue it:
-
-
-       Attaching to program: /home/jdike/linux/2.4/um/./linux, process 1
-       0xa00f4221 in __kill ()
-       (UML gdb)  c
-       Continuing.
-
-
-
-
-  At this point, you debug normally.  When you insmod something, the
-  expect magic will kick in and you'll see something like:
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-   *** Module hostfs loaded ***
-  Breakpoint 1, sys_init_module (name_user=0x805abb0 "hostfs",
-      mod_user=0x8070e00) at module.c:349
-  349             char *name, *n_name, *name_tmp = NULL;
-  (UML gdb)  finish
-  Run till exit from #0  sys_init_module (name_user=0x805abb0 "hostfs",
-      mod_user=0x8070e00) at module.c:349
-  0xa00e2e23 in execute_syscall (r=0xa8140284) at syscall_kern.c:411
-  411             else res = EXECUTE_SYSCALL(syscall, regs);
-  Value returned is $1 = 0
-  (UML gdb)
-  p/x (int)module_list + module_list->size_of_struct
-
-  $2 = 0xa9021054
-  (UML gdb)  symbol-file ./linux
-  Load new symbol table from "./linux"? (y or n) y
-  Reading symbols from ./linux...
-  done.
-  (UML gdb)
-  add-symbol-file /home/jdike/linux/2.4/um/arch/um/fs/hostfs/hostfs.o 0xa9021054
-
-  add symbol table from file "/home/jdike/linux/2.4/um/arch/um/fs/hostfs/hostfs.o" at
-          .text_addr = 0xa9021054
-   (y or n) y
-
-  Reading symbols from /home/jdike/linux/2.4/um/arch/um/fs/hostfs/hostfs.o...
-  done.
-  (UML gdb)  p *module_list
-  $1 = {size_of_struct = 84, next = 0xa0178720, name = 0xa9022de0 "hostfs",
-    size = 9016, uc = {usecount = {counter = 0}, pad = 0}, flags = 1,
-    nsyms = 57, ndeps = 0, syms = 0xa9023170, deps = 0x0, refs = 0x0,
-    init = 0xa90221f0 <init_hostfs>, cleanup = 0xa902222c <exit_hostfs>,
-    ex_table_start = 0x0, ex_table_end = 0x0, persist_start = 0x0,
-    persist_end = 0x0, can_unload = 0, runsize = 0, kallsyms_start = 0x0,
-    kallsyms_end = 0x0,
-    archdata_start = 0x1b855 <Address 0x1b855 out of bounds>,
-    archdata_end = 0xe5890000 <Address 0xe5890000 out of bounds>,
-    kernel_data = 0xf689c35d <Address 0xf689c35d out of bounds>}
-  >> Finished loading symbols for hostfs ...
-
-
-
-
-  That's the easy way.  It's highly recommended.  The hard way is
-  described below in case you're interested in what's going on.
-
-
-  Boot the kernel under the debugger and load the module with insmod or
-  modprobe.  With gdb, do:
-
-
-       (UML gdb)  p module_list
-
-
-
-
-  This is a list of modules that have been loaded into the kernel, with
-  the most recently loaded module first.  Normally, the module you want
-  is at module_list.  If it's not, walk down the next links, looking at
-  the name fields until find the module you want to debug.  Take the
-  address of that structure, and add module.size_of_struct (which in
-  2.4.10 kernels is 96 (0x60)) to it.  Gdb can make this hard addition
-  for you :-):
-
-
-
-  (UML gdb)
-  printf "%#x\n", (int)module_list module_list->size_of_struct
-
-
-
-
-  The offset from the module start occasionally changes (before 2.4.0,
-  it was module.size_of_struct + 4), so it's a good idea to check the
-  init and cleanup addresses once in a while, as describe below.  Now
-  do:
-
-
-       (UML gdb)
-       add-symbol-file /path/to/module/on/host that_address
-
-
-
-
-  Tell gdb you really want to do it, and you're in business.
-
-
-  If there's any doubt that you got the offset right, like breakpoints
-  appear not to work, or they're appearing in the wrong place, you can
-  check it by looking at the module structure.  The init and cleanup
-  fields should look like:
-
-
-       init = 0x588066b0 <init_hostfs>, cleanup = 0x588066c0 <exit_hostfs>
-
-
-
-
-  with no offsets on the symbol names.  If the names are right, but they
-  are offset, then the offset tells you how much you need to add to the
-  address you gave to add-symbol-file.
-
-
-  When you want to load in a new version of the module, you need to get
-  gdb to forget about the old one.  The only way I've found to do that
-  is to tell gdb to forget about all symbols that it knows about:
-
-
-       (UML gdb)  symbol-file
-
-
-
-
-  Then reload the symbols from the kernel binary:
-
-
-       (UML gdb)  symbol-file /path/to/kernel
-
-
-
-
-  and repeat the process above.  You'll also need to re-enable break-
-  points.  They were disabled when you dumped all the symbols because
-  gdb couldn't figure out where they should go.
-
-
-
-  11.5.  Attaching gdb to the kernel
-
-  If you don't have the kernel running under gdb, you can attach gdb to
-  it later by sending the tracing thread a SIGUSR1.  The first line of
-  the console output identifies its pid:
-       tracing thread pid = 20093
-
-
-
-
-  When you send it the signal:
-
-
-       host% kill -USR1 20093
-
-
-
-
-  you will get an xterm with gdb running in it.
-
-
-  If you have the mconsole compiled into UML, then the mconsole client
-  can be used to start gdb:
-
-
-       (mconsole)  (mconsole) config gdb=xterm
-
-
-
-
-  will fire up an xterm with gdb running in it.
-
-
-
-  11.6.  Using alternate debuggers
-
-  UML has support for attaching to an already running debugger rather
-  than starting gdb itself.  This is present in CVS as of 17 Apr 2001.
-  I sent it to Alan for inclusion in the ac tree, and it will be in my
-  2.4.4 release.
-
-
-  This is useful when gdb is a subprocess of some UI, such as emacs or
-  ddd.  It can also be used to run debuggers other than gdb on UML.
-  Below is an example of using strace as an alternate debugger.
-
-
-  To do this, you need to get the pid of the debugger and pass it in
-  with the
-
-
-  If you are using gdb under some UI, then tell it to 'att 1', and
-  you'll find yourself attached to UML.
-
-
-  If you are using something other than gdb as your debugger, then
-  you'll need to get it to do the equivalent of 'att 1' if it doesn't do
-  it automatically.
-
-
-  An example of an alternate debugger is strace.  You can strace the
-  actual kernel as follows:
-
-  o  Run the following in a shell
-
-
-       host%
-       sh -c 'echo pid=$$; echo -n hit return; read x; exec strace -p 1 -o strace.out'
-
-
-
-  o  Run UML with 'debug' and 'gdb-pid=<pid>' with the pid printed out
-     by the previous command
-
-  o  Hit return in the shell, and UML will start running, and strace
-     output will start accumulating in the output file.
-
-     Note that this is different from running
-
-
-       host% strace ./linux
-
-
-
-
-  That will strace only the main UML thread, the tracing thread, which
-  doesn't do any of the actual kernel work.  It just oversees the vir-
-  tual machine.  In contrast, using strace as described above will show
-  you the low-level activity of the virtual machine.
-
-
-
-
-
-  12.  Kernel debugging examples
-
-  12.1.  The case of the hung fsck
-
-  When booting up the kernel, fsck failed, and dropped me into a shell
-  to fix things up.  I ran fsck -y, which hung:
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-  Setting hostname uml                    [ OK ]
-  Checking root filesystem
-  /dev/fhd0 was not cleanly unmounted, check forced.
-  Error reading block 86894 (Attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 19780.
-
-  /dev/fhd0: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
-          (i.e., without -a or -p options)
-  [ FAILED ]
-
-  *** An error occurred during the file system check.
-  *** Dropping you to a shell; the system will reboot
-  *** when you leave the shell.
-  Give root password for maintenance
-  (or type Control-D for normal startup):
-
-  [root@uml /root]# fsck -y /dev/fhd0
-  fsck -y /dev/fhd0
-  Parallelizing fsck version 1.14 (9-Jan-1999)
-  e2fsck 1.14, 9-Jan-1999 for EXT2 FS 0.5b, 95/08/09
-  /dev/fhd0 contains a file system with errors, check forced.
-  Pass 1: Checking inodes, blocks, and sizes
-  Error reading block 86894 (Attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 19780.  Ignore error? yes
-
-  Inode 19780, i_blocks is 1548, should be 540.  Fix? yes
-
-  Pass 2: Checking directory structure
-  Error reading block 49405 (Attempt to read block from filesystem resulted in short read).  Ignore error? yes
-
-  Directory inode 11858, block 0, offset 0: directory corrupted
-  Salvage? yes
-
-  Missing '.' in directory inode 11858.
-  Fix? yes
-
-  Missing '..' in directory inode 11858.
-  Fix? yes
-
-
-
-
-
-  The standard drill in this sort of situation is to fire up gdb on the
-  signal thread, which, in this case, was pid 1935.  In another window,
-  I run gdb and attach pid 1935.
-
-
-
-
-       ~/linux/2.3.26/um 1016: gdb linux
-       GNU gdb 4.17.0.11 with Linux support
-       Copyright 1998 Free Software Foundation, Inc.
-       GDB is free software, covered by the GNU General Public License, and you are
-       welcome to change it and/or distribute copies of it under certain conditions.
-       Type "show copying" to see the conditions.
-       There is absolutely no warranty for GDB.  Type "show warranty" for details.
-       This GDB was configured as "i386-redhat-linux"...
-
-       (gdb) att 1935
-       Attaching to program `/home/dike/linux/2.3.26/um/linux', Pid 1935
-       0x100756d9 in __wait4 ()
-
-
-
-
-
-
-  Let's see what's currently running:
-
-
-
-       (gdb) p current_task.pid
-       $1 = 0
-
-
-
-
-
-  It's the idle thread, which means that fsck went to sleep for some
-  reason and never woke up.
-
-
-  Let's guess that the last process in the process list is fsck:
-
-
-
-       (gdb) p current_task.prev_task.comm
-       $13 = "fsck.ext2\000\000\000\000\000\000"
-
-
-
-
-
-  It is, so let's see what it thinks it's up to:
-
-
-
-       (gdb) p current_task.prev_task.thread
-       $14 = {extern_pid = 1980, tracing = 0, want_tracing = 0, forking = 0,
-         kernel_stack_page = 0, signal_stack = 1342627840, syscall = {id = 4, args = {
-             3, 134973440, 1024, 0, 1024}, have_result = 0, result = 50590720},
-         request = {op = 2, u = {exec = {ip = 1350467584, sp = 2952789424}, fork = {
-               regs = {1350467584, 2952789424, 0 <repeats 15 times>}, sigstack = 0,
-               pid = 0}, switch_to = 0x507e8000, thread = {proc = 0x507e8000,
-               arg = 0xaffffdb0, flags = 0, new_pid = 0}, input_request = {
-               op = 1350467584, fd = -1342177872, proc = 0, pid = 0}}}}
-
-
-
-
-
-  The interesting things here are the fact that its .thread.syscall.id
-  is __NR_write (see the big switch in arch/um/kernel/syscall_kern.c or
-  the defines in include/asm-um/arch/unistd.h), and that it never
-  returned.  Also, its .request.op is OP_SWITCH (see
-  arch/um/include/user_util.h).  These mean that it went into a write,
-  and, for some reason, called schedule().
-
-
-  The fact that it never returned from write means that its stack should
-  be fairly interesting.  Its pid is 1980 (.thread.extern_pid).  That
-  process is being ptraced by the signal thread, so it must be detached
-  before gdb can attach it:
-
-
-
-
-
-
-
-
-
-
-  (gdb) call detach(1980)
-
-  Program received signal SIGSEGV, Segmentation fault.
-  <function called from gdb>
-  The program being debugged stopped while in a function called from GDB.
-  When the function (detach) is done executing, GDB will silently
-  stop (instead of continuing to evaluate the expression containing
-  the function call).
-  (gdb) call detach(1980)
-  $15 = 0
-
-
-
-
-
-  The first detach segfaults for some reason, and the second one
-  succeeds.
-
-
-  Now I detach from the signal thread, attach to the fsck thread, and
-  look at its stack:
-
-
-       (gdb) det
-       Detaching from program: /home/dike/linux/2.3.26/um/linux Pid 1935
-       (gdb) att 1980
-       Attaching to program `/home/dike/linux/2.3.26/um/linux', Pid 1980
-       0x10070451 in __kill ()
-       (gdb) bt
-       #0  0x10070451 in __kill ()
-       #1  0x10068ccd in usr1_pid (pid=1980) at process.c:30
-       #2  0x1006a03f in _switch_to (prev=0x50072000, next=0x507e8000)
-           at process_kern.c:156
-       #3  0x1006a052 in switch_to (prev=0x50072000, next=0x507e8000, last=0x50072000)
-           at process_kern.c:161
-       #4  0x10001d12 in schedule () at core.c:777
-       #5  0x1006a744 in __down (sem=0x507d241c) at semaphore.c:71
-       #6  0x1006aa10 in __down_failed () at semaphore.c:157
-       #7  0x1006c5d8 in segv_handler (sc=0x5006e940) at trap_user.c:174
-       #8  0x1006c5ec in kern_segv_handler (sig=11) at trap_user.c:182
-       #9  <signal handler called>
-       #10 0x10155404 in errno ()
-       #11 0x1006c0aa in segv (address=1342179328, is_write=2) at trap_kern.c:50
-       #12 0x1006c5d8 in segv_handler (sc=0x5006eaf8) at trap_user.c:174
-       #13 0x1006c5ec in kern_segv_handler (sig=11) at trap_user.c:182
-       #14 <signal handler called>
-       #15 0xc0fd in ?? ()
-       #16 0x10016647 in sys_write (fd=3,
-           buf=0x80b8800 <Address 0x80b8800 out of bounds>, count=1024)
-           at read_write.c:159
-       #17 0x1006d5b3 in execute_syscall (syscall=4, args=0x5006ef08)
-           at syscall_kern.c:254
-       #18 0x1006af87 in really_do_syscall (sig=12) at syscall_user.c:35
-       #19 <signal handler called>
-       #20 0x400dc8b0 in ?? ()
-
-
-
-
-
-  The interesting things here are :
-
-  o  There are two segfaults on this stack (frames 9 and 14)
-
-  o  The first faulting address (frame 11) is 0x50000800
-
-  (gdb) p (void *)1342179328
-  $16 = (void *) 0x50000800
-
-
-
-
-
-  The initial faulting address is interesting because it is on the idle
-  thread's stack.  I had been seeing the idle thread segfault for no
-  apparent reason, and the cause looked like stack corruption.  In hopes
-  of catching the culprit in the act, I had turned off all protections
-  to that stack while the idle thread wasn't running.  This apparently
-  tripped that trap.
-
-
-  However, the more immediate problem is that second segfault and I'm
-  going to concentrate on that.  First, I want to see where the fault
-  happened, so I have to go look at the sigcontent struct in frame 8:
-
-
-
-       (gdb) up
-       #1  0x10068ccd in usr1_pid (pid=1980) at process.c:30
-       30        kill(pid, SIGUSR1);
-       (gdb)
-       #2  0x1006a03f in _switch_to (prev=0x50072000, next=0x507e8000)
-           at process_kern.c:156
-       156       usr1_pid(getpid());
-       (gdb)
-       #3  0x1006a052 in switch_to (prev=0x50072000, next=0x507e8000, last=0x50072000)
-           at process_kern.c:161
-       161       _switch_to(prev, next);
-       (gdb)
-       #4  0x10001d12 in schedule () at core.c:777
-       777             switch_to(prev, next, prev);
-       (gdb)
-       #5  0x1006a744 in __down (sem=0x507d241c) at semaphore.c:71
-       71                      schedule();
-       (gdb)
-       #6  0x1006aa10 in __down_failed () at semaphore.c:157
-       157     }
-       (gdb)
-       #7  0x1006c5d8 in segv_handler (sc=0x5006e940) at trap_user.c:174
-       174       segv(sc->cr2, sc->err & 2);
-       (gdb)
-       #8  0x1006c5ec in kern_segv_handler (sig=11) at trap_user.c:182
-       182       segv_handler(sc);
-       (gdb) p *sc
-       Cannot access memory at address 0x0.
-
-
-
-
-  That's not very useful, so I'll try a more manual method:
-
-
-       (gdb) p *((struct sigcontext *) (&sig + 1))
-       $19 = {gs = 0, __gsh = 0, fs = 0, __fsh = 0, es = 43, __esh = 0, ds = 43,
-         __dsh = 0, edi = 1342179328, esi = 1350378548, ebp = 1342630440,
-         esp = 1342630420, ebx = 1348150624, edx = 1280, ecx = 0, eax = 0,
-         trapno = 14, err = 4, eip = 268480945, cs = 35, __csh = 0, eflags = 66118,
-         esp_at_signal = 1342630420, ss = 43, __ssh = 0, fpstate = 0x0, oldmask = 0,
-         cr2 = 1280}
-
-
-
-  The ip is in handle_mm_fault:
-
-
-       (gdb) p (void *)268480945
-       $20 = (void *) 0x1000b1b1
-       (gdb) i sym $20
-       handle_mm_fault + 57 in section .text
-
-
-
-
-
-  Specifically, it's in pte_alloc:
-
-
-       (gdb) i line *$20
-       Line 124 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h"
-          starts at address 0x1000b1b1 <handle_mm_fault+57>
-          and ends at 0x1000b1b7 <handle_mm_fault+63>.
-
-
-
-
-
-  To find where in handle_mm_fault this is, I'll jump forward in the
-  code until I see an address in that procedure:
-
-
-
-       (gdb) i line *0x1000b1c0
-       Line 126 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h"
-          starts at address 0x1000b1b7 <handle_mm_fault+63>
-          and ends at 0x1000b1c3 <handle_mm_fault+75>.
-       (gdb) i line *0x1000b1d0
-       Line 131 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h"
-          starts at address 0x1000b1d0 <handle_mm_fault+88>
-          and ends at 0x1000b1da <handle_mm_fault+98>.
-       (gdb) i line *0x1000b1e0
-       Line 61 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h"
-          starts at address 0x1000b1da <handle_mm_fault+98>
-          and ends at 0x1000b1e1 <handle_mm_fault+105>.
-       (gdb) i line *0x1000b1f0
-       Line 134 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h"
-          starts at address 0x1000b1f0 <handle_mm_fault+120>
-          and ends at 0x1000b200 <handle_mm_fault+136>.
-       (gdb) i line *0x1000b200
-       Line 135 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h"
-          starts at address 0x1000b200 <handle_mm_fault+136>
-          and ends at 0x1000b208 <handle_mm_fault+144>.
-       (gdb) i line *0x1000b210
-       Line 139 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h"
-          starts at address 0x1000b210 <handle_mm_fault+152>
-          and ends at 0x1000b219 <handle_mm_fault+161>.
-       (gdb) i line *0x1000b220
-       Line 1168 of "memory.c" starts at address 0x1000b21e <handle_mm_fault+166>
-          and ends at 0x1000b222 <handle_mm_fault+170>.
-
-
-
-
-
-  Something is apparently wrong with the page tables or vma_structs, so
-  lets go back to frame 11 and have a look at them:
-
-
-
-  #11 0x1006c0aa in segv (address=1342179328, is_write=2) at trap_kern.c:50
-  50        handle_mm_fault(current, vma, address, is_write);
-  (gdb) call pgd_offset_proc(vma->vm_mm, address)
-  $22 = (pgd_t *) 0x80a548c
-
-
-
-
-
-  That's pretty bogus.  Page tables aren't supposed to be in process
-  text or data areas.  Let's see what's in the vma:
-
-
-       (gdb) p *vma
-       $23 = {vm_mm = 0x507d2434, vm_start = 0, vm_end = 134512640,
-         vm_next = 0x80a4f8c, vm_page_prot = {pgprot = 0}, vm_flags = 31200,
-         vm_avl_height = 2058, vm_avl_left = 0x80a8c94, vm_avl_right = 0x80d1000,
-         vm_next_share = 0xaffffdb0, vm_pprev_share = 0xaffffe63,
-         vm_ops = 0xaffffe7a, vm_pgoff = 2952789626, vm_file = 0xafffffec,
-         vm_private_data = 0x62}
-       (gdb) p *vma.vm_mm
-       $24 = {mmap = 0x507d2434, mmap_avl = 0x0, mmap_cache = 0x8048000,
-         pgd = 0x80a4f8c, mm_users = {counter = 0}, mm_count = {counter = 134904288},
-         map_count = 134909076, mmap_sem = {count = {counter = 135073792},
-           sleepers = -1342177872, wait = {lock = <optimized out or zero length>,
-             task_list = {next = 0xaffffe63, prev = 0xaffffe7a},
-             __magic = -1342177670, __creator = -1342177300}, __magic = 98},
-         page_table_lock = {}, context = 138, start_code = 0, end_code = 0,
-         start_data = 0, end_data = 0, start_brk = 0, brk = 0, start_stack = 0,
-         arg_start = 0, arg_end = 0, env_start = 0, env_end = 0, rss = 1350381536,
-         total_vm = 0, locked_vm = 0, def_flags = 0, cpu_vm_mask = 0, swap_cnt = 0,
-         swap_address = 0, segments = 0x0}
-
-
-
-
-
-  This also pretty bogus.  With all of the 0x80xxxxx and 0xaffffxxx
-  addresses, this is looking like a stack was plonked down on top of
-  these structures.  Maybe it's a stack overflow from the next page:
-
-
-
-       (gdb) p vma
-       $25 = (struct vm_area_struct *) 0x507d2434
-
-
-
-
-
-  That's towards the lower quarter of the page, so that would have to
-  have been pretty heavy stack overflow:
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-  (gdb) x/100x $25
-  0x507d2434:     0x507d2434      0x00000000      0x08048000      0x080a4f8c
-  0x507d2444:     0x00000000      0x080a79e0      0x080a8c94      0x080d1000
-  0x507d2454:     0xaffffdb0      0xaffffe63      0xaffffe7a      0xaffffe7a
-  0x507d2464:     0xafffffec      0x00000062      0x0000008a      0x00000000
-  0x507d2474:     0x00000000      0x00000000      0x00000000      0x00000000
-  0x507d2484:     0x00000000      0x00000000      0x00000000      0x00000000
-  0x507d2494:     0x00000000      0x00000000      0x507d2fe0      0x00000000
-  0x507d24a4:     0x00000000      0x00000000      0x00000000      0x00000000
-  0x507d24b4:     0x00000000      0x00000000      0x00000000      0x00000000
-  0x507d24c4:     0x00000000      0x00000000      0x00000000      0x00000000
-  0x507d24d4:     0x00000000      0x00000000      0x00000000      0x00000000
-  0x507d24e4:     0x00000000      0x00000000      0x00000000      0x00000000
-  0x507d24f4:     0x00000000      0x00000000      0x00000000      0x00000000
-  0x507d2504:     0x00000000      0x00000000      0x00000000      0x00000000
-  0x507d2514:     0x00000000      0x00000000      0x00000000      0x00000000
-  0x507d2524:     0x00000000      0x00000000      0x00000000      0x00000000
-  0x507d2534:     0x00000000      0x00000000      0x507d25dc      0x00000000
-  0x507d2544:     0x00000000      0x00000000      0x00000000      0x00000000
-  0x507d2554:     0x00000000      0x00000000      0x00000000      0x00000000
-  0x507d2564:     0x00000000      0x00000000      0x00000000      0x00000000
-  0x507d2574:     0x00000000      0x00000000      0x00000000      0x00000000
-  0x507d2584:     0x00000000      0x00000000      0x00000000      0x00000000
-  0x507d2594:     0x00000000      0x00000000      0x00000000      0x00000000
-  0x507d25a4:     0x00000000      0x00000000      0x00000000      0x00000000
-  0x507d25b4:     0x00000000      0x00000000      0x00000000      0x00000000
-
-
-
-
-
-  It's not stack overflow.  The only "stack-like" piece of this data is
-  the vma_struct itself.
-
-
-  At this point, I don't see any avenues to pursue, so I just have to
-  admit that I have no idea what's going on.  What I will do, though, is
-  stick a trap on the segfault handler which will stop if it sees any
-  writes to the idle thread's stack.  That was the thing that happened
-  first, and it may be that if I can catch it immediately, what's going
-  on will be somewhat clearer.
-
-
-  12.2.  Episode 2: The case of the hung fsck
-
-  After setting a trap in the SEGV handler for accesses to the signal
-  thread's stack, I reran the kernel.
-
-
-  fsck hung again, this time by hitting the trap:
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-  Setting hostname uml                            [ OK ]
-  Checking root filesystem
-  /dev/fhd0 contains a file system with errors, check forced.
-  Error reading block 86894 (Attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 19780.
-
-  /dev/fhd0: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
-          (i.e., without -a or -p options)
-  [ FAILED ]
-
-  *** An error occurred during the file system check.
-  *** Dropping you to a shell; the system will reboot
-  *** when you leave the shell.
-  Give root password for maintenance
-  (or type Control-D for normal startup):
-
-  [root@uml /root]# fsck -y /dev/fhd0
-  fsck -y /dev/fhd0
-  Parallelizing fsck version 1.14 (9-Jan-1999)
-  e2fsck 1.14, 9-Jan-1999 for EXT2 FS 0.5b, 95/08/09
-  /dev/fhd0 contains a file system with errors, check forced.
-  Pass 1: Checking inodes, blocks, and sizes
-  Error reading block 86894 (Attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 19780.  Ignore error? yes
-
-  Pass 2: Checking directory structure
-  Error reading block 49405 (Attempt to read block from filesystem resulted in short read).  Ignore error? yes
-
-  Directory inode 11858, block 0, offset 0: directory corrupted
-  Salvage? yes
-
-  Missing '.' in directory inode 11858.
-  Fix? yes
-
-  Missing '..' in directory inode 11858.
-  Fix? yes
-
-  Untested (4127) [100fe44c]: trap_kern.c line 31
-
-
-
-
-
-  I need to get the signal thread to detach from pid 4127 so that I can
-  attach to it with gdb.  This is done by sending it a SIGUSR1, which is
-  caught by the signal thread, which detaches the process:
-
-
-       kill -USR1 4127
-
-
-
-
-
-  Now I can run gdb on it:
-
-
-
-
-
-
-
-
-
-
-
-
-
-  ~/linux/2.3.26/um 1034: gdb linux
-  GNU gdb 4.17.0.11 with Linux support
-  Copyright 1998 Free Software Foundation, Inc.
-  GDB is free software, covered by the GNU General Public License, and you are
-  welcome to change it and/or distribute copies of it under certain conditions.
-  Type "show copying" to see the conditions.
-  There is absolutely no warranty for GDB.  Type "show warranty" for details.
-  This GDB was configured as "i386-redhat-linux"...
-  (gdb) att 4127
-  Attaching to program `/home/dike/linux/2.3.26/um/linux', Pid 4127
-  0x10075891 in __libc_nanosleep ()
-
-
-
-
-
-  The backtrace shows that it was in a write and that the fault address
-  (address in frame 3) is 0x50000800, which is right in the middle of
-  the signal thread's stack page:
-
-
-       (gdb) bt
-       #0  0x10075891 in __libc_nanosleep ()
-       #1  0x1007584d in __sleep (seconds=1000000)
-           at ../sysdeps/unix/sysv/linux/sleep.c:78
-       #2  0x1006ce9a in stop () at user_util.c:191
-       #3  0x1006bf88 in segv (address=1342179328, is_write=2) at trap_kern.c:31
-       #4  0x1006c628 in segv_handler (sc=0x5006eaf8) at trap_user.c:174
-       #5  0x1006c63c in kern_segv_handler (sig=11) at trap_user.c:182
-       #6  <signal handler called>
-       #7  0xc0fd in ?? ()
-       #8  0x10016647 in sys_write (fd=3, buf=0x80b8800 "R.", count=1024)
-           at read_write.c:159
-       #9  0x1006d603 in execute_syscall (syscall=4, args=0x5006ef08)
-           at syscall_kern.c:254
-       #10 0x1006af87 in really_do_syscall (sig=12) at syscall_user.c:35
-       #11 <signal handler called>
-       #12 0x400dc8b0 in ?? ()
-       #13 <signal handler called>
-       #14 0x400dc8b0 in ?? ()
-       #15 0x80545fd in ?? ()
-       #16 0x804daae in ?? ()
-       #17 0x8054334 in ?? ()
-       #18 0x804d23e in ?? ()
-       #19 0x8049632 in ?? ()
-       #20 0x80491d2 in ?? ()
-       #21 0x80596b5 in ?? ()
-       (gdb) p (void *)1342179328
-       $3 = (void *) 0x50000800
-
-
-
-
-
-  Going up the stack to the segv_handler frame and looking at where in
-  the code the access happened shows that it happened near line 110 of
-  block_dev.c:
-
-
-
-
-
-
-
-
-
-  (gdb) up
-  #1  0x1007584d in __sleep (seconds=1000000)
-      at ../sysdeps/unix/sysv/linux/sleep.c:78
-  ../sysdeps/unix/sysv/linux/sleep.c:78: No such file or directory.
-  (gdb)
-  #2  0x1006ce9a in stop () at user_util.c:191
-  191       while(1) sleep(1000000);
-  (gdb)
-  #3  0x1006bf88 in segv (address=1342179328, is_write=2) at trap_kern.c:31
-  31          KERN_UNTESTED();
-  (gdb)
-  #4  0x1006c628 in segv_handler (sc=0x5006eaf8) at trap_user.c:174
-  174       segv(sc->cr2, sc->err & 2);
-  (gdb) p *sc
-  $1 = {gs = 0, __gsh = 0, fs = 0, __fsh = 0, es = 43, __esh = 0, ds = 43,
-    __dsh = 0, edi = 1342179328, esi = 134973440, ebp = 1342631484,
-    esp = 1342630864, ebx = 256, edx = 0, ecx = 256, eax = 1024, trapno = 14,
-    err = 6, eip = 268550834, cs = 35, __csh = 0, eflags = 66070,
-    esp_at_signal = 1342630864, ss = 43, __ssh = 0, fpstate = 0x0, oldmask = 0,
-    cr2 = 1342179328}
-  (gdb) p (void *)268550834
-  $2 = (void *) 0x1001c2b2
-  (gdb) i sym $2
-  block_write + 1090 in section .text
-  (gdb) i line *$2
-  Line 209 of "/home/dike/linux/2.3.26/um/include/asm/arch/string.h"
-     starts at address 0x1001c2a1 <block_write+1073>
-     and ends at 0x1001c2bf <block_write+1103>.
-  (gdb) i line *0x1001c2c0
-  Line 110 of "block_dev.c" starts at address 0x1001c2bf <block_write+1103>
-     and ends at 0x1001c2e3 <block_write+1139>.
-
-
-
-
-
-  Looking at the source shows that the fault happened during a call to
-  copy_from_user to copy the data into the kernel:
-
-
-       107             count -= chars;
-       108             copy_from_user(p,buf,chars);
-       109             p += chars;
-       110             buf += chars;
-
-
-
-
-
-  p is the pointer which must contain 0x50000800, since buf contains
-  0x80b8800 (frame 8 above).  It is defined as:
-
-
-                       p = offset + bh->b_data;
-
-
-
-
-
-  I need to figure out what bh is, and it just so happens that bh is
-  passed as an argument to mark_buffer_uptodate and mark_buffer_dirty a
-  few lines later, so I do a little disassembly:
-
-
-
-
-  (gdb) disas 0x1001c2bf 0x1001c2e0
-  Dump of assembler code from 0x1001c2bf to 0x1001c2d0:
-  0x1001c2bf <block_write+1103>:  addl   %eax,0xc(%ebp)
-  0x1001c2c2 <block_write+1106>:  movl   0xfffffdd4(%ebp),%edx
-  0x1001c2c8 <block_write+1112>:  btsl   $0x0,0x18(%edx)
-  0x1001c2cd <block_write+1117>:  btsl   $0x1,0x18(%edx)
-  0x1001c2d2 <block_write+1122>:  sbbl   %ecx,%ecx
-  0x1001c2d4 <block_write+1124>:  testl  %ecx,%ecx
-  0x1001c2d6 <block_write+1126>:  jne    0x1001c2e3 <block_write+1139>
-  0x1001c2d8 <block_write+1128>:  pushl  $0x0
-  0x1001c2da <block_write+1130>:  pushl  %edx
-  0x1001c2db <block_write+1131>:  call   0x1001819c <__mark_buffer_dirty>
-  End of assembler dump.
-
-
-
-
-
-  At that point, bh is in %edx (address 0x1001c2da), which is calculated
-  at 0x1001c2c2 as %ebp + 0xfffffdd4, so I figure exactly what that is,
-  taking %ebp from the sigcontext_struct above:
-
-
-       (gdb) p (void *)1342631484
-       $5 = (void *) 0x5006ee3c
-       (gdb) p 0x5006ee3c+0xfffffdd4
-       $6 = 1342630928
-       (gdb) p (void *)$6
-       $7 = (void *) 0x5006ec10
-       (gdb) p *((void **)$7)
-       $8 = (void *) 0x50100200
-
-
-
-
-
-  Now, I look at the structure to see what's in it, and particularly,
-  what its b_data field contains:
-
-
-       (gdb) p *((struct buffer_head *)0x50100200)
-       $13 = {b_next = 0x50289380, b_blocknr = 49405, b_size = 1024, b_list = 0,
-         b_dev = 15872, b_count = {counter = 1}, b_rdev = 15872, b_state = 24,
-         b_flushtime = 0, b_next_free = 0x501001a0, b_prev_free = 0x50100260,
-         b_this_page = 0x501001a0, b_reqnext = 0x0, b_pprev = 0x507fcf58,
-         b_data = 0x50000800 "", b_page = 0x50004000,
-         b_end_io = 0x10017f60 <end_buffer_io_sync>, b_dev_id = 0x0,
-         b_rsector = 98810, b_wait = {lock = <optimized out or zero length>,
-           task_list = {next = 0x50100248, prev = 0x50100248}, __magic = 1343226448,
-           __creator = 0}, b_kiobuf = 0x0}
-
-
-
-
-
-  The b_data field is indeed 0x50000800, so the question becomes how
-  that happened.  The rest of the structure looks fine, so this probably
-  is not a case of data corruption.  It happened on purpose somehow.
-
-
-  The b_page field is a pointer to the page_struct representing the
-  0x50000000 page.  Looking at it shows the kernel's idea of the state
-  of that page:
-
-
-
-  (gdb) p *$13.b_page
-  $17 = {list = {next = 0x50004a5c, prev = 0x100c5174}, mapping = 0x0,
-    index = 0, next_hash = 0x0, count = {counter = 1}, flags = 132, lru = {
-      next = 0x50008460, prev = 0x50019350}, wait = {
-      lock = <optimized out or zero length>, task_list = {next = 0x50004024,
-        prev = 0x50004024}, __magic = 1342193708, __creator = 0},
-    pprev_hash = 0x0, buffers = 0x501002c0, virtual = 1342177280,
-    zone = 0x100c5160}
-
-
-
-
-
-  Some sanity-checking: the virtual field shows the "virtual" address of
-  this page, which in this kernel is the same as its "physical" address,
-  and the page_struct itself should be mem_map[0], since it represents
-  the first page of memory:
-
-
-
-       (gdb) p (void *)1342177280
-       $18 = (void *) 0x50000000
-       (gdb) p mem_map
-       $19 = (mem_map_t *) 0x50004000
-
-
-
-
-
-  These check out fine.
-
-
-  Now to check out the page_struct itself.  In particular, the flags
-  field shows whether the page is considered free or not:
-
-
-       (gdb) p (void *)132
-       $21 = (void *) 0x84
-
-
-
-
-
-  The "reserved" bit is the high bit, which is definitely not set, so
-  the kernel considers the signal stack page to be free and available to
-  be used.
-
-
-  At this point, I jump to conclusions and start looking at my early
-  boot code, because that's where that page is supposed to be reserved.
-
-
-  In my setup_arch procedure, I have the following code which looks just
-  fine:
-
-
-
-       bootmap_size = init_bootmem(start_pfn, end_pfn - start_pfn);
-       free_bootmem(__pa(low_physmem) + bootmap_size, high_physmem - low_physmem);
-
-
-
-
-
-  Two stack pages have already been allocated, and low_physmem points to
-  the third page, which is the beginning of free memory.
-  The init_bootmem call declares the entire memory to the boot memory
-  manager, which marks it all reserved.  The free_bootmem call frees up
-  all of it, except for the first two pages.  This looks correct to me.
-
-
-  So, I decide to see init_bootmem run and make sure that it is marking
-  those first two pages as reserved.  I never get that far.
-
-
-  Stepping into init_bootmem, and looking at bootmem_map before looking
-  at what it contains shows the following:
-
-
-
-       (gdb) p bootmem_map
-       $3 = (void *) 0x50000000
-
-
-
-
-
-  Aha!  The light dawns.  That first page is doing double duty as a
-  stack and as the boot memory map.  The last thing that the boot memory
-  manager does is to free the pages used by its memory map, so this page
-  is getting freed even its marked as reserved.
-
-
-  The fix was to initialize the boot memory manager before allocating
-  those two stack pages, and then allocate them through the boot memory
-  manager.  After doing this, and fixing a couple of subsequent buglets,
-  the stack corruption problem disappeared.
-
-
-
-
-
-  13.  What to do when UML doesn't work
-
-
-
-
-  13.1.  Strange compilation errors when you build from source
-
-  As of test11, it is necessary to have "ARCH=um" in the environment or
-  on the make command line for all steps in building UML, including
-  clean, distclean, or mrproper, config, menuconfig, or xconfig, dep,
-  and linux.  If you forget for any of them, the i386 build seems to
-  contaminate the UML build.  If this happens, start from scratch with
-
-
-       host%
-       make mrproper ARCH=um
-
-
-
-
-  and repeat the build process with ARCH=um on all the steps.
-
-
-  See ``Compiling the kernel and modules''  for more details.
-
-
-  Another cause of strange compilation errors is building UML in
-  /usr/src/linux.  If you do this, the first thing you need to do is
-  clean up the mess you made.  The /usr/src/linux/asm link will now
-  point to /usr/src/linux/asm-um.  Make it point back to
-  /usr/src/linux/asm-i386.  Then, move your UML pool someplace else and
-  build it there.  Also see below, where a more specific set of symptoms
-  is described.
-
-
-
-  13.3.  A variety of panics and hangs with /tmp on a reiserfs  filesys-
-  tem
-
-  I saw this on reiserfs 3.5.21 and it seems to be fixed in 3.5.27.
-  Panics preceded by
-
-
-       Detaching pid nnnn
-
-
-
-  are diagnostic of this problem.  This is a reiserfs bug which causes a
-  thread to occasionally read stale data from a mmapped page shared with
-  another thread.  The fix is to upgrade the filesystem or to have /tmp
-  be an ext2 filesystem.
-
-
-
-  13.4.  The compile fails with errors about conflicting types for
-  'open', 'dup', and 'waitpid'
-
-  This happens when you build in /usr/src/linux.  The UML build makes
-  the include/asm link point to include/asm-um.  /usr/include/asm points
-  to /usr/src/linux/include/asm, so when that link gets moved, files
-  which need to include the asm-i386 versions of headers get the
-  incompatible asm-um versions.  The fix is to move the include/asm link
-  back to include/asm-i386 and to do UML builds someplace else.
-
-
-
-  13.5.  UML doesn't work when /tmp is an NFS filesystem
-
-  This seems to be a similar situation with the ReiserFS problem above.
-  Some versions of NFS seems not to handle mmap correctly, which UML
-  depends on.  The workaround is have /tmp be a non-NFS directory.
-
-
-  13.6.  UML hangs on boot when compiled with gprof support
-
-  If you build UML with gprof support and, early in the boot, it does
-  this
-
-
-       kernel BUG at page_alloc.c:100!
-
-
-
-
-  you have a buggy gcc.  You can work around the problem by removing
-  UM_FASTCALL from CFLAGS in arch/um/Makefile-i386.  This will open up
-  another bug, but that one is fairly hard to reproduce.
-
-
-
-  13.7.  syslogd dies with a SIGTERM on startup
-
-  The exact boot error depends on the distribution that you're booting,
-  but Debian produces this:
-
-
-       /etc/rc2.d/S10sysklogd: line 49:    93 Terminated
-       start-stop-daemon --start --quiet --exec /sbin/syslogd -- $SYSLOGD
-
-
-
-
-  This is a syslogd bug.  There's a race between a parent process
-  installing a signal handler and its child sending the signal.  See
-  this uml-devel post <http://www.geocrawler.com/lists/3/Source-
-  Forge/709/0/6612801>  for the details.
-
-
-
-  13.8.  TUN/TAP networking doesn't work on a 2.4 host
-
-  There are a couple of problems which were
-  <http://www.geocrawler.com/lists/3/SourceForge/597/0/> name="pointed
-  out">  by Tim Robinson <timro at trkr dot net>
-
-  o  It doesn't work on hosts running 2.4.7 (or thereabouts) or earlier.
-     The fix is to upgrade to something more recent and then read the
-     next item.
-
-  o  If you see
-
-
-       File descriptor in bad state
-
-
-
-  when you bring up the device inside UML, you have a header mismatch
-  between the original kernel and the upgraded one.  Make /usr/src/linux
-  point at the new headers.  This will only be a problem if you build
-  uml_net yourself.
-
-
-
-  13.9.  You can network to the host but not to other machines on the
-  net
-
-  If you can connect to the host, and the host can connect to UML, but
-  you cannot connect to any other machines, then you may need to enable
-  IP Masquerading on the host.  Usually this is only experienced when
-  using private IP addresses (192.168.x.x or 10.x.x.x) for host/UML
-  networking, rather than the public address space that your host is
-  connected to.  UML does not enable IP Masquerading, so you will need
-  to create a static rule to enable it:
-
-
-       host%
-       iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
-
-
-
-
-  Replace eth0 with the interface that you use to talk to the rest of
-  the world.
-
-
-  Documentation on IP Masquerading, and SNAT, can be found at
-  www.netfilter.org  <http://www.netfilter.org> .
-
-
-  If you can reach the local net, but not the outside Internet, then
-  that is usually a routing problem.  The UML needs a default route:
-
-
-       UML#
-       route add default gw gateway IP
-
-
-
-
-  The gateway IP can be any machine on the local net that knows how to
-  reach the outside world.  Usually, this is the host or the local net-
-  work's gateway.
-
-
-  Occasionally, we hear from someone who can reach some machines, but
-  not others on the same net, or who can reach some ports on other
-  machines, but not others.  These are usually caused by strange
-  firewalling somewhere between the UML and the other box.  You track
-  this down by running tcpdump on every interface the packets travel
-  over and see where they disappear.  When you find a machine that takes
-  the packets in, but does not send them onward, that's the culprit.
-
-
-
-  13.10.  I have no root and I want to scream
-
-  Thanks to Birgit Wahlich for telling me about this strange one.  It
-  turns out that there's a limit of six environment variables on the
-  kernel command line.  When that limit is reached or exceeded, argument
-  processing stops, which means that the 'root=' argument that UML
-  usually adds is not seen.  So, the filesystem has no idea what the
-  root device is, so it panics.
-
-
-  The fix is to put less stuff on the command line.  Glomming all your
-  setup variables into one is probably the best way to go.
-
-
-
-  13.11.  UML build conflict between ptrace.h and ucontext.h
-
-  On some older systems, /usr/include/asm/ptrace.h and
-  /usr/include/sys/ucontext.h define the same names.  So, when they're
-  included together, the defines from one completely mess up the parsing
-  of the other, producing errors like:
-       /usr/include/sys/ucontext.h:47: parse error before
-       `10'
-
-
-
-
-  plus a pile of warnings.
-
-
-  This is a libc botch, which has since been fixed, and I don't see any
-  way around it besides upgrading.
-
-
-
-  13.12.  The UML BogoMips is exactly half the host's BogoMips
-
-  On i386 kernels, there are two ways of running the loop that is used
-  to calculate the BogoMips rating, using the TSC if it's there or using
-  a one-instruction loop.  The TSC produces twice the BogoMips as the
-  loop.  UML uses the loop, since it has nothing resembling a TSC, and
-  will get almost exactly the same BogoMips as a host using the loop.
-  However, on a host with a TSC, its BogoMips will be double the loop
-  BogoMips, and therefore double the UML BogoMips.
-
-
-
-  13.13.  When you run UML, it immediately segfaults
-
-  If the host is configured with the 2G/2G address space split, that's
-  why.  See ``UML on 2G/2G hosts''  for the details on getting UML to
-  run on your host.
-
-
-
-  13.14.  xterms appear, then immediately disappear
-
-  If you're running an up to date kernel with an old release of
-  uml_utilities, the port-helper program will not work properly, so
-  xterms will exit straight after they appear. The solution is to
-  upgrade to the latest release of uml_utilities.  Usually this problem
-  occurs when you have installed a packaged release of UML then compiled
-  your own development kernel without upgrading the uml_utilities from
-  the source distribution.
-
-
-
-  13.15.  Any other panic, hang, or strange behavior
-
-  If you're seeing truly strange behavior, such as hangs or panics that
-  happen in random places, or you try running the debugger to see what's
-  happening and it acts strangely, then it could be a problem in the
-  host kernel.  If you're not running a stock Linus or -ac kernel, then
-  try that.  An early version of the preemption patch and a 2.4.10 SuSE
-  kernel have caused very strange problems in UML.
-
-
-  Otherwise, let me know about it.  Send a message to one of the UML
-  mailing lists - either the developer list - user-mode-linux-devel at
-  lists dot sourceforge dot net (subscription info) or the user list -
-  user-mode-linux-user at lists dot sourceforge do net (subscription
-  info), whichever you prefer.  Don't assume that everyone knows about
-  it and that a fix is imminent.
-
-
-  If you want to be super-helpful, read ``Diagnosing Problems'' and
-  follow the instructions contained therein.
-  14.  Diagnosing Problems
-
-
-  If you get UML to crash, hang, or otherwise misbehave, you should
-  report this on one of the project mailing lists, either the developer
-  list - user-mode-linux-devel at lists dot sourceforge dot net
-  (subscription info) or the user list - user-mode-linux-user at lists
-  dot sourceforge dot net (subscription info).  When you do, it is
-  likely that I will want more information.  So, it would be helpful to
-  read the stuff below, do whatever is applicable in your case, and
-  report the results to the list.
-
-
-  For any diagnosis, you're going to need to build a debugging kernel.
-  The binaries from this site aren't debuggable.  If you haven't done
-  this before, read about ``Compiling the kernel and modules''  and
-  ``Kernel debugging''  UML first.
-
-
-  14.1.  Case 1 : Normal kernel panics
-
-  The most common case is for a normal thread to panic.  To debug this,
-  you will need to run it under the debugger (add 'debug' to the command
-  line).  An xterm will start up with gdb running inside it.  Continue
-  it when it stops in start_kernel and make it crash.  Now ^C gdb and
-
-
-  If the panic was a "Kernel mode fault", then there will be a segv
-  frame on the stack and I'm going to want some more information.  The
-  stack might look something like this:
-
-
-       (UML gdb)  backtrace
-       #0  0x1009bf76 in __sigprocmask (how=1, set=0x5f347940, oset=0x0)
-           at ../sysdeps/unix/sysv/linux/sigprocmask.c:49
-       #1  0x10091411 in change_sig (signal=10, on=1) at process.c:218
-       #2  0x10094785 in timer_handler (sig=26) at time_kern.c:32
-       #3  0x1009bf38 in __restore ()
-           at ../sysdeps/unix/sysv/linux/i386/sigaction.c:125
-       #4  0x1009534c in segv (address=8, ip=268849158, is_write=2, is_user=0)
-           at trap_kern.c:66
-       #5  0x10095c04 in segv_handler (sig=11) at trap_user.c:285
-       #6  0x1009bf38 in __restore ()
-
-
-
-
-  I'm going to want to see the symbol and line information for the value
-  of ip in the segv frame.  In this case, you would do the following:
-
-
-       (UML gdb)  i sym 268849158
-
-
-
-
-  and
-
-
-       (UML gdb)  i line *268849158
-
-
-
-
-  The reason for this is the __restore frame right above the segv_han-
-  dler frame is hiding the frame that actually segfaulted.  So, I have
-  to get that information from the faulting ip.
-
-
-  14.2.  Case 2 : Tracing thread panics
-
-  The less common and more painful case is when the tracing thread
-  panics.  In this case, the kernel debugger will be useless because it
-  needs a healthy tracing thread in order to work.  The first thing to
-  do is get a backtrace from the tracing thread.  This is done by
-  figuring out what its pid is, firing up gdb, and attaching it to that
-  pid.  You can figure out the tracing thread pid by looking at the
-  first line of the console output, which will look like this:
-
-
-       tracing thread pid = 15851
-
-
-
-
-  or by running ps on the host and finding the line that looks like
-  this:
-
-
-       jdike 15851 4.5 0.4 132568 1104 pts/0 S 21:34 0:05 ./linux [(tracing thread)]
-
-
-
-
-  If the panic was 'segfault in signals', then follow the instructions
-  above for collecting information about the location of the seg fault.
-
-
-  If the tracing thread flaked out all by itself, then send that
-  backtrace in and wait for our crack debugging team to fix the problem.
-
-
-  14.3.  Case 3 : Tracing thread panics caused by other threads
-
-  However, there are cases where the misbehavior of another thread
-  caused the problem.  The most common panic of this type is:
-
-
-       wait_for_stop failed to wait for  <pid>  to stop with  <signal number>
-
-
-
-
-  In this case, you'll need to get a backtrace from the process men-
-  tioned in the panic, which is complicated by the fact that the kernel
-  debugger is defunct and without some fancy footwork, another gdb can't
-  attach to it.  So, this is how the fancy footwork goes:
-
-  In a shell:
-
-
-       host% kill -STOP pid
-
-
-
-
-  Run gdb on the tracing thread as described in case 2 and do:
-
-
-       (host gdb)  call detach(pid)
-
-
-  If you get a segfault, do it again.  It always works the second time.
-
-  Detach from the tracing thread and attach to that other thread:
-
-
-       (host gdb)  detach
-
-
-
-
-
-
-       (host gdb)  attach pid
-
-
-
-
-  If gdb hangs when attaching to that process, go back to a shell and
-  do:
-
-
-       host%
-       kill -CONT pid
-
-
-
-
-  And then get the backtrace:
-
-
-       (host gdb)  backtrace
-
-
-
-
-
-  14.4.  Case 4 : Hangs
-
-  Hangs seem to be fairly rare, but they sometimes happen.  When a hang
-  happens, we need a backtrace from the offending process.  Run the
-  kernel debugger as described in case 1 and get a backtrace.  If the
-  current process is not the idle thread, then send in the backtrace.
-  You can tell that it's the idle thread if the stack looks like this:
-
-
-       #0  0x100b1401 in __libc_nanosleep ()
-       #1  0x100a2885 in idle_sleep (secs=10) at time.c:122
-       #2  0x100a546f in do_idle () at process_kern.c:445
-       #3  0x100a5508 in cpu_idle () at process_kern.c:471
-       #4  0x100ec18f in start_kernel () at init/main.c:592
-       #5  0x100a3e10 in start_kernel_proc (unused=0x0) at um_arch.c:71
-       #6  0x100a383f in signal_tramp (arg=0x100a3dd8) at trap_user.c:50
-
-
-
-
-  If this is the case, then some other process is at fault, and went to
-  sleep when it shouldn't have.  Run ps on the host and figure out which
-  process should not have gone to sleep and stayed asleep.  Then attach
-  to it with gdb and get a backtrace as described in case 3.
-
-
-
-
-
-
-  15.  Thanks
-
-
-  A number of people have helped this project in various ways, and this
-  page gives recognition where recognition is due.
-
-
-  If you're listed here and you would prefer a real link on your name,
-  or no link at all, instead of the despammed email address pseudo-link,
-  let me know.
-
-
-  If you're not listed here and you think maybe you should be, please
-  let me know that as well.  I try to get everyone, but sometimes my
-  bookkeeping lapses and I forget about contributions.
-
-
-  15.1.  Code and Documentation
-
-  Rusty Russell <rusty at linuxcare.com.au>  -
-
-  o  wrote the  HOWTO <http://user-mode-
-     linux.sourceforge.net/UserModeLinux-HOWTO.html>
-
-  o  prodded me into making this project official and putting it on
-     SourceForge
-
-  o  came up with the way cool UML logo <http://user-mode-
-     linux.sourceforge.net/uml-small.png>
-
-  o  redid the config process
-
-
-  Peter Moulder <reiter at netspace.net.au>  - Fixed my config and build
-  processes, and added some useful code to the block driver
-
-
-  Bill Stearns <wstearns at pobox.com>  -
-
-  o  HOWTO updates
-
-  o  lots of bug reports
-
-  o  lots of testing
-
-  o  dedicated a box (uml.ists.dartmouth.edu) to support UML development
-
-  o  wrote the mkrootfs script, which allows bootable filesystems of
-     RPM-based distributions to be cranked out
-
-  o  cranked out a large number of filesystems with said script
-
-
-  Jim Leu <jleu at mindspring.com>  - Wrote the virtual ethernet driver
-  and associated usermode tools
-
-  Lars Brinkhoff <http://lars.nocrew.org/>  - Contributed the ptrace
-  proxy from his own  project <http://a386.nocrew.org/> to allow easier
-  kernel debugging
-
-
-  Andrea Arcangeli <andrea at suse.de>  - Redid some of the early boot
-  code so that it would work on machines with Large File Support
-
-
-  Chris Emerson <http://www.chiark.greenend.org.uk/~cemerson/>  - Did
-  the first UML port to Linux/ppc
-
-
-  Harald Welte <laforge at gnumonks.org>  - Wrote the multicast
-  transport for the network driver
-
-
-  Jorgen Cederlof - Added special file support to hostfs
-
-
-  Greg Lonnon  <glonnon at ridgerun dot com>  - Changed the ubd driver
-  to allow it to layer a COW file on a shared read-only filesystem and
-  wrote the iomem emulation support
-
-
-  Henrik Nordstrom <http://hem.passagen.se/hno/>  - Provided a variety
-  of patches, fixes, and clues
-
-
-  Lennert Buytenhek - Contributed various patches, a rewrite of the
-  network driver, the first implementation of the mconsole driver, and
-  did the bulk of the work needed to get SMP working again.
-
-
-  Yon Uriarte - Fixed the TUN/TAP network backend while I slept.
-
-
-  Adam Heath - Made a bunch of nice cleanups to the initialization code,
-  plus various other small patches.
-
-
-  Matt Zimmerman - Matt volunteered to be the UML Debian maintainer and
-  is doing a real nice job of it.  He also noticed and fixed a number of
-  actually and potentially exploitable security holes in uml_net.  Plus
-  the occasional patch.  I like patches.
-
-
-  James McMechan - James seems to have taken over maintenance of the ubd
-  driver and is doing a nice job of it.
-
-
-  Chandan Kudige - wrote the umlgdb script which automates the reloading
-  of module symbols.
-
-
-  Steve Schmidtke - wrote the UML slirp transport and hostaudio drivers,
-  enabling UML processes to access audio devices on the host. He also
-  submitted patches for the slip transport and lots of other things.
-
-
-  David Coulson <http://davidcoulson.net>  -
-
-  o  Set up the usermodelinux.org <http://usermodelinux.org>  site,
-     which is a great way of keeping the UML user community on top of
-     UML goings-on.
-
-  o  Site documentation and updates
-
-  o  Nifty little UML management daemon  UMLd
-     <http://uml.openconsultancy.com/umld/>
-
-  o  Lots of testing and bug reports
-
-
-
-
-  15.2.  Flushing out bugs
-
-
-
-  o  Yuri Pudgorodsky
-
-  o  Gerald Britton
-
-  o  Ian Wehrman
-
-  o  Gord Lamb
-
-  o  Eugene Koontz
-
-  o  John H. Hartman
-
-  o  Anders Karlsson
-
-  o  Daniel Phillips
-
-  o  John Fremlin
-
-  o  Rainer Burgstaller
-
-  o  James Stevenson
-
-  o  Matt Clay
-
-  o  Cliff Jefferies
-
-  o  Geoff Hoff
-
-  o  Lennert Buytenhek
-
-  o  Al Viro
-
-  o  Frank Klingenhoefer
-
-  o  Livio Baldini Soares
-
-  o  Jon Burgess
-
-  o  Petru Paler
-
-  o  Paul
-
-  o  Chris Reahard
-
-  o  Sverker Nilsson
-
-  o  Gong Su
-
-  o  johan verrept
-
-  o  Bjorn Eriksson
-
-  o  Lorenzo Allegrucci
-
-  o  Muli Ben-Yehuda
-
-  o  David Mansfield
-
-  o  Howard Goff
-
-  o  Mike Anderson
-
-  o  John Byrne
-
-  o  Sapan J. Batia
-
-  o  Iris Huang
-
-  o  Jan Hudec
-
-  o  Voluspa
-
-
-
-
-  15.3.  Buglets and clean-ups
-
-
-
-  o  Dave Zarzycki
-
-  o  Adam Lazur
-
-  o  Boria Feigin
-
-  o  Brian J. Murrell
-
-  o  JS
-
-  o  Roman Zippel
-
-  o  Wil Cooley
-
-  o  Ayelet Shemesh
-
-  o  Will Dyson
-
-  o  Sverker Nilsson
-
-  o  dvorak
-
-  o  v.naga srinivas
-
-  o  Shlomi Fish
-
-  o  Roger Binns
-
-  o  johan verrept
-
-  o  MrChuoi
-
-  o  Peter Cleve
-
-  o  Vincent Guffens
-
-  o  Nathan Scott
-
-  o  Patrick Caulfield
-
-  o  jbearce
-
-  o  Catalin Marinas
-
-  o  Shane Spencer
-
-  o  Zou Min
-
-
-  o  Ryan Boder
-
-  o  Lorenzo Colitti
-
-  o  Gwendal Grignou
-
-  o  Andre' Breiler
-
-  o  Tsutomu Yasuda
-
-
-
-  15.4.  Case Studies
-
-
-  o  Jon Wright
-
-  o  William McEwan
-
-  o  Michael Richardson
-
-
-
-  15.5.  Other contributions
-
-
-  Bill Carr <Bill.Carr at compaq.com>  made the Red Hat mkrootfs script
-  work with RH 6.2.
-
-  Michael Jennings <mikejen at hevanet.com>  sent in some material which
-  is now gracing the top of the  index  page <http://user-mode-
-  linux.sourceforge.net/>  of this site.
-
-  SGI <http://www.sgi.com>  (and more specifically Ralf Baechle <ralf at
-  uni-koblenz.de> ) gave me an account on oss.sgi.com
-  <http://www.oss.sgi.com> .  The bandwidth there made it possible to
-  produce most of the filesystems available on the project download
-  page.
-
-  Laurent Bonnaud <Laurent.Bonnaud at inpg.fr>  took the old grotty
-  Debian filesystem that I've been distributing and updated it to 2.2.
-  It is now available by itself here.
-
-  Rik van Riel gave me some ftp space on ftp.nl.linux.org so I can make
-  releases even when Sourceforge is broken.
-
-  Rodrigo de Castro looked at my broken pte code and told me what was
-  wrong with it, letting me fix a long-standing (several weeks) and
-  serious set of bugs.
-
-  Chris Reahard built a specialized root filesystem for running a DNS
-  server jailed inside UML.  It's available from the download
-  <http://user-mode-linux.sourceforge.net/dl-sf.html>  page in the Jail
-  Filesystems section.
-
-
-
-
-
-
-
-
-
-
-
-
diff --git a/Documentation/virt/uml/user_mode_linux.rst b/Documentation/virt/uml/user_mode_linux.rst

new file mode 100644 (file)

index 0000000..de0f0b2
--- /dev/null
+++ b/Documentation/virt/uml/user_mode_linux.rst
@@ -0,0 +1,4403 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+User Mode Linux HOWTO
+=====================
+
+:Author:  User Mode Linux Core Team
+:Last-updated: Sat Jan 25 16:07:55 CET 2020
+
+This document describes the use and abuse of Jeff Dike's User Mode
+Linux: a port of the Linux kernel as a normal Intel Linux process.
+
+
+.. Table of Contents
+
+  1. Introduction
+
+     1.1 How is User Mode Linux Different?
+     1.2 Why Would I Want User Mode Linux?
+
+  2. Compiling the kernel and modules
+
+     2.1 Compiling the kernel
+     2.2 Compiling and installing kernel modules
+     2.3 Compiling and installing uml_utilities
+
+  3. Running UML and logging in
+
+     3.1 Running UML
+     3.2 Logging in
+     3.3 Examples
+
+  4. UML on 2G/2G hosts
+
+     4.1 Introduction
+     4.2 The problem
+     4.3 The solution
+
+  5. Setting up serial lines and consoles
+
+     5.1 Specifying the device
+     5.2 Specifying the channel
+     5.3 Examples
+
+  6. Setting up the network
+
+     6.1 General setup
+     6.2 Userspace daemons
+     6.3 Specifying ethernet addresses
+     6.4 UML interface setup
+     6.5 Multicast
+     6.6 TUN/TAP with the uml_net helper
+     6.7 TUN/TAP with a preconfigured tap device
+     6.8 Ethertap
+     6.9 The switch daemon
+     6.10 Slip
+     6.11 Slirp
+     6.12 pcap
+     6.13 Setting up the host yourself
+
+  7. Sharing Filesystems between Virtual Machines
+
+     7.1 A warning
+     7.2 Using layered block devices
+     7.3 Note!
+     7.4 Another warning
+     7.5 uml_moo : Merging a COW file with its backing file
+
+  8. Creating filesystems
+
+     8.1 Create the filesystem file
+     8.2 Assign the file to a UML device
+     8.3 Creating and mounting the filesystem
+
+  9. Host file access
+
+     9.1 Using hostfs
+     9.2 hostfs as the root filesystem
+     9.3 Building hostfs
+
+  10. The Management Console
+     10.1 version
+     10.2 halt and reboot
+     10.3 config
+     10.4 remove
+     10.5 sysrq
+     10.6 help
+     10.7 cad
+     10.8 stop
+     10.9 go
+
+  11. Kernel debugging
+
+     11.1 Starting the kernel under gdb
+     11.2 Examining sleeping processes
+     11.3 Running ddd on UML
+     11.4 Debugging modules
+     11.5 Attaching gdb to the kernel
+     11.6 Using alternate debuggers
+
+  12. Kernel debugging examples
+
+     12.1 The case of the hung fsck
+     12.2 Episode 2: The case of the hung fsck
+
+  13. What to do when UML doesn't work
+
+     13.1 Strange compilation errors when you build from source
+     13.2 (obsolete)
+     13.3 A variety of panics and hangs with /tmp on a reiserfs  filesystem
+     13.4 The compile fails with errors about conflicting types for 'open', 'dup', and 'waitpid'
+     13.5 UML doesn't work when /tmp is an NFS filesystem
+     13.6 UML hangs on boot when compiled with gprof support
+     13.7 syslogd dies with a SIGTERM on startup
+     13.8 TUN/TAP networking doesn't work on a 2.4 host
+     13.9 You can network to the host but not to other machines on the net
+     13.10 I have no root and I want to scream
+     13.11 UML build conflict between ptrace.h and ucontext.h
+     13.12 The UML BogoMips is exactly half the host's BogoMips
+     13.13 When you run UML, it immediately segfaults
+     13.14 xterms appear, then immediately disappear
+     13.15 Any other panic, hang, or strange behavior
+
+  14. Diagnosing Problems
+
+     14.1 Case 1 : Normal kernel panics
+     14.2 Case 2 : Tracing thread panics
+     14.3 Case 3 : Tracing thread panics caused by other threads
+     14.4 Case 4 : Hangs
+
+  15. Thanks
+
+     15.1 Code and Documentation
+     15.2 Flushing out bugs
+     15.3 Buglets and clean-ups
+     15.4 Case Studies
+     15.5 Other contributions
+
+
+1.  Introduction
+================
+
+  Welcome to User Mode Linux.  It's going to be fun.
+
+
+
+1.1.  How is User Mode Linux Different?
+---------------------------------------
+
+  Normally, the Linux Kernel talks straight to your hardware (video
+  card, keyboard, hard drives, etc), and any programs which run ask the
+  kernel to operate the hardware, like so::
+
+
+
+         +-----------+-----------+----+
+         | Process 1 | Process 2 | ...|
+         +-----------+-----------+----+
+         |       Linux Kernel         |
+         +----------------------------+
+         |         Hardware           |
+         +----------------------------+
+
+
+
+
+  The User Mode Linux Kernel is different; instead of talking to the
+  hardware, it talks to a `real` Linux kernel (called the `host kernel`
+  from now on), like any other program.  Programs can then run inside
+  User-Mode Linux as if they were running under a normal kernel, like
+  so::
+
+
+
+                     +----------------+
+                     | Process 2 | ...|
+         +-----------+----------------+
+         | Process 1 | User-Mode Linux|
+         +----------------------------+
+         |       Linux Kernel         |
+         +----------------------------+
+         |         Hardware           |
+         +----------------------------+
+
+
+
+
+
+1.2.  Why Would I Want User Mode Linux?
+---------------------------------------
+
+
+  1. If User Mode Linux crashes, your host kernel is still fine.
+
+  2. You can run a usermode kernel as a non-root user.
+
+  3. You can debug the User Mode Linux like any normal process.
+
+  4. You can run gprof (profiling) and gcov (coverage testing).
+
+  5. You can play with your kernel without breaking things.
+
+  6. You can use it as a sandbox for testing new apps.
+
+  7. You can try new development kernels safely.
+
+  8. You can run different distributions simultaneously.
+
+  9. It's extremely fun.
+
+
+
+.. _Compiling_the_kernel_and_modules:
+
+2.  Compiling the kernel and modules
+====================================
+
+
+
+
+2.1.  Compiling the kernel
+--------------------------
+
+
+  Compiling the user mode kernel is just like compiling any other
+  kernel.
+
+
+  1. Download the latest kernel from your favourite kernel mirror,
+     such as:
+
+     https://mirrors.edge.kernel.org/pub/linux/kernel/v5.x/linux-5.4.14.tar.xz
+
+  2. Make a directory and unpack the kernel into it::
+
+       host%
+       mkdir ~/uml
+
+       host%
+       cd ~/uml
+
+       host%
+       tar xvf linux-5.4.14.tar.xz
+
+
+  3. Run your favorite config; ``make xconfig ARCH=um`` is the most
+     convenient.  ``make config ARCH=um`` and ``make menuconfig ARCH=um``
+     will work as well.  The defaults will give you a useful kernel.  If
+     you want to change something, go ahead, it probably won't hurt
+     anything.
+
+
+     Note:  If the host is configured with a 2G/2G address space split
+     rather than the usual 3G/1G split, then the packaged UML binaries
+     will not run.  They will immediately segfault.  See
+     :ref:`UML_on_2G/2G_hosts`  for the scoop on running UML on your system.
+
+
+
+  4. Finish with ``make linux ARCH=um``: the result is a file called
+     ``linux`` in the top directory of your source tree.
+
+
+2.2.  Compiling and installing kernel modules
+---------------------------------------------
+
+  UML modules are built in the same way as the native kernel (with the
+  exception of the 'ARCH=um' that you always need for UML)::
+
+
+       host% make modules ARCH=um
+
+
+
+
+  Any modules that you want to load into this kernel need to be built in
+  the user-mode pool.  Modules from the native kernel won't work.
+
+  You can install them by using ftp or something to copy them into the
+  virtual machine and dropping them into ``/lib/modules/$(uname -r)``.
+
+  You can also get the kernel build process to install them as follows:
+
+  1. with the kernel not booted, mount the root filesystem in the top
+     level of the kernel pool::
+
+
+       host% mount root_fs mnt -o loop
+
+
+
+
+
+
+  2. run::
+
+
+       host%
+       make modules_install INSTALL_MOD_PATH=`pwd`/mnt ARCH=um
+
+
+
+
+
+
+  3. unmount the filesystem::
+
+
+       host% umount mnt
+
+
+
+
+
+
+  4. boot the kernel on it
+
+
+  When the system is booted, you can use insmod as usual to get the
+  modules into the kernel.  A number of things have been loaded into UML
+  as modules, especially filesystems and network protocols and filters,
+  so most symbols which need to be exported probably already are.
+  However, if you do find symbols that need exporting, let  us
+  know at http://user-mode-linux.sourceforge.net/, and
+  they'll be "taken care of".
+
+
+
+2.3.  Compiling and installing uml_utilities
+--------------------------------------------
+
+  Many features of the UML kernel require a user-space helper program,
+  so a uml_utilities package is distributed separately from the kernel
+  patch which provides these helpers. Included within this is:
+
+  -  port-helper - Used by consoles which connect to xterms or ports
+
+  -  tunctl - Configuration tool to create and delete tap devices
+
+  -  uml_net - Setuid binary for automatic tap device configuration
+
+  -  uml_switch - User-space virtual switch required for daemon
+     transport
+
+     The uml_utilities tree is compiled with::
+
+
+       host#
+       make && make install
+
+
+
+
+  Note that UML kernel patches may require a specific version of the
+  uml_utilities distribution. If you don't keep up with the mailing
+  lists, ensure that you have the latest release of uml_utilities if you
+  are experiencing problems with your UML kernel, particularly when
+  dealing with consoles or command-line switches to the helper programs
+
+
+
+
+
+
+
+
+3.  Running UML and logging in
+==============================
+
+
+
+3.1.  Running UML
+-----------------
+
+  It runs on 2.2.15 or later, and all kernel versions since 2.4.
+
+
+  Booting UML is straightforward.  Simply run 'linux': it will try to
+  mount the file ``root_fs`` in the current directory.  You do not need to
+  run it as root.  If your root filesystem is not named ``root_fs``, then
+  you need to put a ``ubd0=root_fs_whatever`` switch on the linux command
+  line.
+
+
+  You will need a filesystem to boot UML from.  There are a number
+  available for download from http://user-mode-linux.sourceforge.net.
+  There are also  several tools at
+  http://user-mode-linux.sourceforge.net/  which can be
+  used to generate UML-compatible filesystem images from media.
+  The kernel will boot up and present you with a login prompt.
+
+
+Note:
+  If the host is configured with a 2G/2G address space split
+  rather than the usual 3G/1G split, then the packaged UML binaries will
+  not run.  They will immediately segfault.  See :ref:`UML_on_2G/2G_hosts`
+  for the scoop on running UML on your system.
+
+
+
+3.2.  Logging in
+----------------
+
+
+
+  The prepackaged filesystems have a root account with password 'root'
+  and a user account with password 'user'.  The login banner will
+  generally tell you how to log in.  So, you log in and you will find
+  yourself inside a little virtual machine. Our filesystems have a
+  variety of commands and utilities installed (and it is fairly easy to
+  add more), so you will have a lot of tools with which to poke around
+  the system.
+
+  There are a couple of other ways to log in:
+
+  -  On a virtual console
+
+
+
+     Each virtual console that is configured (i.e. the device exists in
+     /dev and /etc/inittab runs a getty on it) will come up in its own
+     xterm.  If you get tired of the xterms, read
+     :ref:`setting_up_serial_lines_and_consoles` to see how to attach
+     the consoles to something else, like host ptys.
+
+
+
+  -  Over the serial line
+
+
+     In the boot output, find a line that looks like::
+
+
+
+       serial line 0 assigned pty /dev/ptyp1
+
+
+
+
+  Attach your favorite terminal program to the corresponding tty.  I.e.
+  for minicom, the command would be::
+
+
+       host% minicom -o -p /dev/ttyp1
+
+
+
+
+
+
+  -  Over the net
+
+
+     If the network is running, then you can telnet to the virtual
+     machine and log in to it.  See :ref:`Setting_up_the_network`  to learn
+     about setting up a virtual network.
+
+  When you're done using it, run halt, and the kernel will bring itself
+  down and the process will exit.
+
+
+3.3.  Examples
+--------------
+
+  Here are some examples of UML in action:
+
+  -  A login session http://user-mode-linux.sourceforge.net/old/login.html
+
+  -  A virtual network http://user-mode-linux.sourceforge.net/old/net.html
+
+
+
+
+
+.. _UML_on_2G/2G_hosts:
+
+4.  UML on 2G/2G hosts
+======================
+
+
+
+
+4.1.  Introduction
+------------------
+
+
+  Most Linux machines are configured so that the kernel occupies the
+  upper 1G (0xc0000000 - 0xffffffff) of the 4G address space and
+  processes use the lower 3G (0x00000000 - 0xbfffffff).  However, some
+  machine are configured with a 2G/2G split, with the kernel occupying
+  the upper 2G (0x80000000 - 0xffffffff) and processes using the lower
+  2G (0x00000000 - 0x7fffffff).
+
+
+
+
+4.2.  The problem
+-----------------
+
+
+  The prebuilt UML binaries on this site will not run on 2G/2G hosts
+  because UML occupies the upper .5G of the 3G process address space
+  (0xa0000000 - 0xbfffffff).  Obviously, on 2G/2G hosts, this is right
+  in the middle of the kernel address space, so UML won't even load - it
+  will immediately segfault.
+
+
+
+
+4.3.  The solution
+------------------
+
+
+  The fix for this is to rebuild UML from source after enabling
+  CONFIG_HOST_2G_2G (under 'General Setup').  This will cause UML to
+  load itself in the top .5G of that smaller process address space,
+  where it will run fine.  See :ref:`Compiling_the_kernel_and_modules`  if
+  you need help building UML from source.
+
+
+
+
+
+
+
+.. _setting_up_serial_lines_and_consoles:
+
+
+5.  Setting up serial lines and consoles
+========================================
+
+
+  It is possible to attach UML serial lines and consoles to many types
+  of host I/O channels by specifying them on the command line.
+
+
+  You can attach them to host ptys, ttys, file descriptors, and ports.
+  This allows you to do things like:
+
+  -  have a UML console appear on an unused host console,
+
+  -  hook two virtual machines together by having one attach to a pty
+     and having the other attach to the corresponding tty
+
+  -  make a virtual machine accessible from the net by attaching a
+     console to a port on the host.
+
+
+  The general format of the command line option is ``device=channel``.
+
+
+
+5.1.  Specifying the device
+---------------------------
+
+  Devices are specified with "con" or "ssl" (console or serial line,
+  respectively), optionally with a device number if you are talking
+  about a specific device.
+
+
+  Using just "con" or "ssl" describes all of the consoles or serial
+  lines.  If you want to talk about console #3 or serial line #10, they
+  would be "con3" and "ssl10", respectively.
+
+
+  A specific device name will override a less general "con=" or "ssl=".
+  So, for example, you can assign a pty to each of the serial lines
+  except for the first two like this::
+
+
+        ssl=pty ssl0=tty:/dev/tty0 ssl1=tty:/dev/tty1
+
+
+
+
+  The specificity of the device name is all that matters; order on the
+  command line is irrelevant.
+
+
+
+5.2.  Specifying the channel
+----------------------------
+
+  There are a number of different types of channels to attach a UML
+  device to, each with a different way of specifying exactly what to
+  attach to.
+
+  -  pseudo-terminals - device=pty pts terminals - device=pts
+
+
+     This will cause UML to allocate a free host pseudo-terminal for the
+     device.  The terminal that it got will be announced in the boot
+     log.  You access it by attaching a terminal program to the
+     corresponding tty:
+
+  -  screen /dev/pts/n
+
+  -  screen /dev/ttyxx
+
+  -  minicom -o -p /dev/ttyxx - minicom seems not able to handle pts
+     devices
+
+  -  kermit - start it up, 'open' the device, then 'connect'
+
+
+
+
+
+  -  terminals - device=tty:tty device file
+
+
+     This will make UML attach the device to the specified tty (i.e::
+
+
+        con1=tty:/dev/tty3
+
+
+
+
+  will attach UML's console 1 to the host's /dev/tty3).  If the tty that
+  you specify is the slave end of a tty/pty pair, something else must
+  have already opened the corresponding pty in order for this to work.
+
+
+
+
+
+  -  xterms - device=xterm
+
+
+     UML will run an xterm and the device will be attached to it.
+
+
+
+
+
+  -  Port - device=port:port number
+
+
+     This will attach the UML devices to the specified host port.
+     Attaching console 1 to the host's port 9000 would be done like
+     this::
+
+
+        con1=port:9000
+
+
+
+
+  Attaching all the serial lines to that port would be done similarly::
+
+
+        ssl=port:9000
+
+
+
+
+  You access these devices by telnetting to that port.  Each active
+  telnet session gets a different device.  If there are more telnets to a
+  port than UML devices attached to it, then the extra telnet sessions
+  will block until an existing telnet detaches, or until another device
+  becomes active (i.e. by being activated in /etc/inittab).
+
+  This channel has the advantage that you can both attach multiple UML
+  devices to it and know how to access them without reading the UML boot
+  log.  It is also unique in allowing access to a UML from remote
+  machines without requiring that the UML be networked.  This could be
+  useful in allowing public access to UMLs because they would be
+  accessible from the net, but wouldn't need any kind of network
+  filtering or access control because they would have no network access.
+
+
+  If you attach the main console to a portal, then the UML boot will
+  appear to hang.  In reality, it's waiting for a telnet to connect, at
+  which point the boot will proceed.
+
+
+
+
+
+  -  already-existing file descriptors - device=file descriptor
+
+
+     If you set up a file descriptor on the UML command line, you can
+     attach a UML device to it.  This is most commonly used to put the
+     main console back on stdin and stdout after assigning all the other
+     consoles to something else::
+
+
+        con0=fd:0,fd:1 con=pts
+
+
+
+
+
+
+
+
+  -  Nothing - device=null
+
+
+     This allows the device to be opened, in contrast to 'none', but
+     reads will block, and writes will succeed and the data will be
+     thrown out.
+
+
+
+
+
+  -  None - device=none
+
+
+     This causes the device to disappear.
+
+
+
+  You can also specify different input and output channels for a device
+  by putting a comma between them::
+
+
+        ssl3=tty:/dev/tty2,xterm
+
+
+
+
+  will cause serial line 3 to accept input on the host's /dev/tty2 and
+  display output on an xterm.  That's a silly example - the most common
+  use of this syntax is to reattach the main console to stdin and stdout
+  as shown above.
+
+
+  If you decide to move the main console away from stdin/stdout, the
+  initial boot output will appear in the terminal that you're running
+  UML in.  However, once the console driver has been officially
+  initialized, then the boot output will start appearing wherever you
+  specified that console 0 should be.  That device will receive all
+  subsequent output.
+
+
+
+5.3.  Examples
+--------------
+
+  There are a number of interesting things you can do with this
+  capability.
+
+
+  First, this is how you get rid of those bleeding console xterms by
+  attaching them to host ptys::
+
+
+        con=pty con0=fd:0,fd:1
+
+
+
+
+  This will make a UML console take over an unused host virtual console,
+  so that when you switch to it, you will see the UML login prompt
+  rather than the host login prompt::
+
+
+        con1=tty:/dev/tty6
+
+
+
+
+  You can attach two virtual machines together with what amounts to a
+  serial line as follows:
+
+  Run one UML with a serial line attached to a pty::
+
+
+        ssl1=pty
+
+
+
+
+  Look at the boot log to see what pty it got (this example will assume
+  that it got /dev/ptyp1).
+
+  Boot the other UML with a serial line attached to the corresponding
+  tty::
+
+
+        ssl1=tty:/dev/ttyp1
+
+
+
+
+  Log in, make sure that it has no getty on that serial line, attach a
+  terminal program like minicom to it, and you should see the login
+  prompt of the other virtual machine.
+
+
+.. _setting_up_the_network:
+
+6.  Setting up the network
+==========================
+
+
+
+  This page describes how to set up the various transports and to
+  provide a UML instance with network access to the host, other machines
+  on the local net, and the rest of the net.
+
+
+  As of 2.4.5, UML networking has been completely redone to make it much
+  easier to set up, fix bugs, and add new features.
+
+
+  There is a new helper, uml_net, which does the host setup that
+  requires root privileges.
+
+
+  There are currently five transport types available for a UML virtual
+  machine to exchange packets with other hosts:
+
+  -  ethertap
+
+  -  TUN/TAP
+
+  -  Multicast
+
+  -  a switch daemon
+
+  -  slip
+
+  -  slirp
+
+  -  pcap
+
+     The TUN/TAP, ethertap, slip, and slirp transports allow a UML
+     instance to exchange packets with the host.  They may be directed
+     to the host or the host may just act as a router to provide access
+     to other physical or virtual machines.
+
+
+  The pcap transport is a synthetic read-only interface, using the
+  libpcap binary to collect packets from interfaces on the host and
+  filter them.  This is useful for building preconfigured traffic
+  monitors or sniffers.
+
+
+  The daemon and multicast transports provide a completely virtual
+  network to other virtual machines.  This network is completely
+  disconnected from the physical network unless one of the virtual
+  machines on it is acting as a gateway.
+
+
+  With so many host transports, which one should you use?  Here's when
+  you should use each one:
+
+  -  ethertap - if you want access to the host networking and it is
+     running 2.2
+
+  -  TUN/TAP - if you want access to the host networking and it is
+     running 2.4.  Also, the TUN/TAP transport is able to use a
+     preconfigured device, allowing it to avoid using the setuid uml_net
+     helper, which is a security advantage.
+
+  -  Multicast - if you want a purely virtual network and you don't want
+     to set up anything but the UML
+
+  -  a switch daemon - if you want a purely virtual network and you
+     don't mind running the daemon in order to get somewhat better
+     performance
+
+  -  slip - there is no particular reason to run the slip backend unless
+     ethertap and TUN/TAP are just not available for some reason
+
+  -  slirp - if you don't have root access on the host to setup
+     networking, or if you don't want to allocate an IP to your UML
+
+  -  pcap - not much use for actual network connectivity, but great for
+     monitoring traffic on the host
+
+     Ethertap is available on 2.4 and works fine.  TUN/TAP is preferred
+     to it because it has better performance and ethertap is officially
+     considered obsolete in 2.4.  Also, the root helper only needs to
+     run occasionally for TUN/TAP, rather than handling every packet, as
+     it does with ethertap.  This is a slight security advantage since
+     it provides fewer opportunities for a nasty UML user to somehow
+     exploit the helper's root privileges.
+
+
+6.1.  General setup
+-------------------
+
+  First, you must have the virtual network enabled in your UML.  If are
+  running a prebuilt kernel from this site, everything is already
+  enabled.  If you build the kernel yourself, under the "Network device
+  support" menu, enable "Network device support", and then the three
+  transports.
+
+
+  The next step is to provide a network device to the virtual machine.
+  This is done by describing it on the kernel command line.
+
+  The general format is::
+
+
+       eth <n> = <transport> , <transport args>
+
+
+
+
+  For example, a virtual ethernet device may be attached to a host
+  ethertap device as follows::
+
+
+       eth0=ethertap,tap0,fe:fd:0:0:0:1,192.168.0.254
+
+
+
+
+  This sets up eth0 inside the virtual machine to attach itself to the
+  host /dev/tap0, assigns it an ethernet address, and assigns the host
+  tap0 interface an IP address.
+
+
+
+  Note that the IP address you assign to the host end of the tap device
+  must be different than the IP you assign to the eth device inside UML.
+  If you are short on IPs and don't want to consume two per UML, then
+  you can reuse the host's eth IP address for the host ends of the tap
+  devices.  Internally, the UMLs must still get unique IPs for their eth
+  devices.  You can also give the UMLs non-routable IPs (192.168.x.x or
+  10.x.x.x) and have the host masquerade them.  This will let outgoing
+  connections work, but incoming connections won't without more work,
+  such as port forwarding from the host.
+  Also note that when you configure the host side of an interface, it is
+  only acting as a gateway.  It will respond to pings sent to it
+  locally, but is not useful to do that since it's a host interface.
+  You are not talking to the UML when you ping that interface and get a
+  response.
+
+
+  You can also add devices to a UML and remove them at runtime.  See the
+  :ref:`The_Management_Console`  page for details.
+
+
+  The sections below describe this in more detail.
+
+
+  Once you've decided how you're going to set up the devices, you boot
+  UML, log in, configure the UML side of the devices, and set up routes
+  to the outside world.  At that point, you will be able to talk to any
+  other machines, physical or virtual, on the net.
+
+
+  If ifconfig inside UML fails and the network refuses to come up, run
+  tell you what went wrong.
+
+
+
+6.2.  Userspace daemons
+-----------------------
+
+  You will likely need the setuid helper, or the switch daemon, or both.
+  They are both installed with the RPM and deb, so if you've installed
+  either, you can skip the rest of this section.
+
+
+  If not, then you need to check them out of CVS, build them, and
+  install them.  The helper is uml_net, in CVS /tools/uml_net, and the
+  daemon is uml_switch, in CVS /tools/uml_router.  They are both built
+  with a plain 'make'.  Both need to be installed in a directory that's
+  in your path - /usr/bin is recommend.  On top of that, uml_net needs
+  to be setuid root.
+
+
+
+6.3.  Specifying ethernet addresses
+-----------------------------------
+
+  Below, you will see that the TUN/TAP, ethertap, and daemon interfaces
+  allow you to specify hardware addresses for the virtual ethernet
+  devices.  This is generally not necessary.  If you don't have a
+  specific reason to do it, you probably shouldn't.  If one is not
+  specified on the command line, the driver will assign one based on the
+  device IP address.  It will provide the address fe:fd:nn:nn:nn:nn
+  where nn.nn.nn.nn is the device IP address.  This is nearly always
+  sufficient to guarantee a unique hardware address for the device.  A
+  couple of exceptions are:
+
+  -  Another set of virtual ethernet devices are on the same network and
+     they are assigned hardware addresses using a different scheme which
+     may conflict with the UML IP address-based scheme
+
+  -  You aren't going to use the device for IP networking, so you don't
+     assign the device an IP address
+
+     If you let the driver provide the hardware address, you should make
+     sure that the device IP address is known before the interface is
+     brought up.  So, inside UML, this will guarantee that::
+
+
+
+         UML#
+         ifconfig eth0 192.168.0.250 up
+
+
+
+
+  If you decide to assign the hardware address yourself, make sure that
+  the first byte of the address is even.  Addresses with an odd first
+  byte are broadcast addresses, which you don't want assigned to a
+  device.
+
+
+
+6.4.  UML interface setup
+-------------------------
+
+  Once the network devices have been described on the command line, you
+  should boot UML and log in.
+
+
+  The first thing to do is bring the interface up::
+
+
+       UML# ifconfig ethn ip-address up
+
+
+
+
+  You should be able to ping the host at this point.
+
+
+  To reach the rest of the world, you should set a default route to the
+  host::
+
+
+       UML# route add default gw host ip
+
+
+
+
+  Again, with host ip of 192.168.0.4::
+
+
+       UML# route add default gw 192.168.0.4
+
+
+
+
+  This page used to recommend setting a network route to your local net.
+  This is wrong, because it will cause UML to try to figure out hardware
+  addresses of the local machines by arping on the interface to the
+  host.  Since that interface is basically a single strand of ethernet
+  with two nodes on it (UML and the host) and arp requests don't cross
+  networks, they will fail to elicit any responses.  So, what you want
+  is for UML to just blindly throw all packets at the host and let it
+  figure out what to do with them, which is what leaving out the network
+  route and adding the default route does.
+
+
+  Note: If you can't communicate with other hosts on your physical
+  ethernet, it's probably because of a network route that's
+  automatically set up.  If you run 'route -n' and see a route that
+  looks like this::
+
+
+
+
+    Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
+    192.168.0.0     0.0.0.0         255.255.255.0   U     0      0      0   eth0
+
+
+
+
+  with a mask that's not 255.255.255.255, then replace it with a route
+  to your host::
+
+
+       UML#
+       route del -net 192.168.0.0 dev eth0 netmask 255.255.255.0
+
+
+       UML#
+       route add -host 192.168.0.4 dev eth0
+
+
+
+
+  This, plus the default route to the host, will allow UML to exchange
+  packets with any machine on your ethernet.
+
+
+
+6.5.  Multicast
+---------------
+
+  The simplest way to set up a virtual network between multiple UMLs is
+  to use the mcast transport.  This was written by Harald Welte and is
+  present in UML version 2.4.5-5um and later.  Your system must have
+  multicast enabled in the kernel and there must be a multicast-capable
+  network device on the host.  Normally, this is eth0, but if there is
+  no ethernet card on the host, then you will likely get strange error
+  messages when you bring the device up inside UML.
+
+
+  To use it, run two UMLs with::
+
+
+        eth0=mcast
+
+
+
+
+  on their command lines.  Log in, configure the ethernet device in each
+  machine with different IP addresses::
+
+
+       UML1# ifconfig eth0 192.168.0.254
+
+
+       UML2# ifconfig eth0 192.168.0.253
+
+
+
+
+  and they should be able to talk to each other.
+
+  The full set of command line options for this transport are::
+
+
+
+       ethn=mcast,ethernet address,multicast
+       address,multicast port,ttl
+
+
+
+  There is also a related point-to-point only "ucast" transport.
+  This is useful when your network does not support multicast, and
+  all network connections are simple point to point links.
+
+  The full set of command line options for this transport are::
+
+
+       ethn=ucast,ethernet address,remote address,listen port,remote port
+
+
+
+
+6.6.  TUN/TAP with the uml_net helper
+-------------------------------------
+
+  TUN/TAP is the preferred mechanism on 2.4 to exchange packets with the
+  host.  The TUN/TAP backend has been in UML since 2.4.9-3um.
+
+
+  The easiest way to get up and running is to let the setuid uml_net
+  helper do the host setup for you.  This involves insmod-ing the tun.o
+  module if necessary, configuring the device, and setting up IP
+  forwarding, routing, and proxy arp.  If you are new to UML networking,
+  do this first.  If you're concerned about the security implications of
+  the setuid helper, use it to get up and running, then read the next
+  section to see how to have UML use a preconfigured tap device, which
+  avoids the use of uml_net.
+
+
+  If you specify an IP address for the host side of the device, the
+  uml_net helper will do all necessary setup on the host - the only
+  requirement is that TUN/TAP be available, either built in to the host
+  kernel or as the tun.o module.
+
+  The format of the command line switch to attach a device to a TUN/TAP
+  device is::
+
+
+       eth <n> =tuntap,,, <IP address>
+
+
+
+
+  For example, this argument will attach the UML's eth0 to the next
+  available tap device and assign an ethernet address to it based on its
+  IP address::
+
+
+       eth0=tuntap,,,192.168.0.254
+
+
+
+
+
+
+  Note that the IP address that must be used for the eth device inside
+  UML is fixed by the routing and proxy arp that is set up on the
+  TUN/TAP device on the host.  You can use a different one, but it won't
+  work because reply packets won't reach the UML.  This is a feature.
+  It prevents a nasty UML user from doing things like setting the UML IP
+  to the same as the network's nameserver or mail server.
+
+
+  There are a couple potential problems with running the TUN/TAP
+  transport on a 2.4 host kernel
+
+  -  TUN/TAP seems not to work on 2.4.3 and earlier.  Upgrade the host
+     kernel or use the ethertap transport.
+
+  -  With an upgraded kernel, TUN/TAP may fail with::
+
+
+       File descriptor in bad state
+
+
+
+
+  This is due to a header mismatch between the upgraded kernel and the
+  kernel that was originally installed on the machine.  The fix is to
+  make sure that /usr/src/linux points to the headers for the running
+  kernel.
+
+  These were pointed out by Tim Robinson <timro at trkr dot net> in the past.
+
+
+
+6.7.  TUN/TAP with a preconfigured tap device
+---------------------------------------------
+
+  If you prefer not to have UML use uml_net (which is somewhat
+  insecure), with UML 2.4.17-11, you can set up a TUN/TAP device
+  beforehand.  The setup needs to be done as root, but once that's done,
+  there is no need for root assistance.  Setting up the device is done
+  as follows:
+
+  -  Create the device with tunctl (available from the UML utilities
+     tarball)::
+
+
+
+
+       host#  tunctl -u uid
+
+
+
+
+  where uid is the user id or username that UML will be run as.  This
+  will tell you what device was created.
+
+  -  Configure the device IP (change IP addresses and device name to
+     suit)::
+
+
+
+
+       host#  ifconfig tap0 192.168.0.254 up
+
+
+
+
+
+  -  Set up routing and arping if desired - this is my recipe, there are
+     other ways of doing the same thing::
+
+
+       host#
+       bash -c 'echo 1 > /proc/sys/net/ipv4/ip_forward'
+
+       host#
+       route add -host 192.168.0.253 dev tap0
+
+       host#
+       bash -c 'echo 1 > /proc/sys/net/ipv4/conf/tap0/proxy_arp'
+
+       host#
+       arp -Ds 192.168.0.253 eth0 pub
+
+
+
+
+  Note that this must be done every time the host boots - this configu-
+  ration is not stored across host reboots.  So, it's probably a good
+  idea to stick it in an rc file.  An even better idea would be a little
+  utility which reads the information from a config file and sets up
+  devices at boot time.
+
+  -  Rather than using up two IPs and ARPing for one of them, you can
+     also provide direct access to your LAN by the UML by using a
+     bridge::
+
+
+       host#
+       brctl addbr br0
+
+
+       host#
+       ifconfig eth0 0.0.0.0 promisc up
+
+
+       host#
+       ifconfig tap0 0.0.0.0 promisc up
+
+
+       host#
+       ifconfig br0 192.168.0.1 netmask 255.255.255.0 up
+
+
+       host#
+       brctl stp br0 off
+
+
+       host#
+       brctl setfd br0 1
+
+
+       host#
+       brctl sethello br0 1
+
+
+       host#
+       brctl addif br0 eth0
+
+
+       host#
+       brctl addif br0 tap0
+
+
+
+
+  Note that 'br0' should be setup using ifconfig with the existing IP
+  address of eth0, as eth0 no longer has its own IP.
+
+  -
+
+
+     Also, the /dev/net/tun device must be writable by the user running
+     UML in order for the UML to use the device that's been configured
+     for it.  The simplest thing to do is::
+
+
+       host#  chmod 666 /dev/net/tun
+
+
+
+
+  Making it world-writable looks bad, but it seems not to be
+  exploitable as a security hole.  However, it does allow anyone to cre-
+  ate useless tap devices (useless because they can't configure them),
+  which is a DOS attack.  A somewhat more secure alternative would to be
+  to create a group containing all the users who have preconfigured tap
+  devices and chgrp /dev/net/tun to that group with mode 664 or 660.
+
+
+  -  Once the device is set up, run UML with 'eth0=tuntap,device name'
+     (i.e. 'eth0=tuntap,tap0') on the command line (or do it with the
+     mconsole config command).
+
+  -  Bring the eth device up in UML and you're in business.
+
+     If you don't want that tap device any more, you can make it non-
+     persistent with::
+
+
+       host#  tunctl -d tap device
+
+
+
+
+  Finally, tunctl has a -b (for brief mode) switch which causes it to
+  output only the name of the tap device it created.  This makes it
+  suitable for capture by a script::
+
+
+       host#  TAP=`tunctl -u 1000 -b`
+
+
+
+
+
+
+6.8.  Ethertap
+--------------
+
+  Ethertap is the general mechanism on 2.2 for userspace processes to
+  exchange packets with the kernel.
+
+
+
+  To use this transport, you need to describe the virtual network device
+  on the UML command line.  The general format for this is::
+
+
+       eth <n> =ethertap, <device> , <ethernet address> , <tap IP address>
+
+
+
+
+  So, the previous example::
+
+
+       eth0=ethertap,tap0,fe:fd:0:0:0:1,192.168.0.254
+
+
+
+
+  attaches the UML eth0 device to the host /dev/tap0, assigns it the
+  ethernet address fe:fd:0:0:0:1, and assigns the IP address
+  192.168.0.254 to the tap device.
+
+
+
+  The tap device is mandatory, but the others are optional.  If the
+  ethernet address is omitted, one will be assigned to it.
+
+
+  The presence of the tap IP address will cause the helper to run and do
+  whatever host setup is needed to allow the virtual machine to
+  communicate with the outside world.  If you're not sure you know what
+  you're doing, this is the way to go.
+
+
+  If it is absent, then you must configure the tap device and whatever
+  arping and routing you will need on the host.  However, even in this
+  case, the uml_net helper still needs to be in your path and it must be
+  setuid root if you're not running UML as root.  This is because the
+  tap device doesn't support SIGIO, which UML needs in order to use
+  something as a source of input.  So, the helper is used as a
+  convenient asynchronous IO thread.
+
+  If you're using the uml_net helper, you can ignore the following host
+  setup - uml_net will do it for you.  You just need to make sure you
+  have ethertap available, either built in to the host kernel or
+  available as a module.
+
+
+  If you want to set things up yourself, you need to make sure that the
+  appropriate /dev entry exists.  If it doesn't, become root and create
+  it as follows::
+
+
+       mknod /dev/tap <minor>  c 36  <minor>  + 16
+
+
+
+
+  For example, this is how to create /dev/tap0::
+
+
+       mknod /dev/tap0 c 36 0 + 16
+
+
+
+
+  You also need to make sure that the host kernel has ethertap support.
+  If ethertap is enabled as a module, you apparently need to insmod
+  ethertap once for each ethertap device you want to enable.  So,::
+
+
+       host#
+       insmod ethertap
+
+
+
+
+  will give you the tap0 interface.  To get the tap1 interface, you need
+  to run::
+
+
+       host#
+       insmod ethertap unit=1 -o ethertap1
+
+
+
+
+
+
+
+6.9.  The switch daemon
+-----------------------
+
+  Note: This is the daemon formerly known as uml_router, but which was
+  renamed so the network weenies of the world would stop growling at me.
+
+
+  The switch daemon, uml_switch, provides a mechanism for creating a
+  totally virtual network.  By default, it provides no connection to the
+  host network (but see -tap, below).
+
+
+  The first thing you need to do is run the daemon.  Running it with no
+  arguments will make it listen on a default pair of unix domain
+  sockets.
+
+
+  If you want it to listen on a different pair of sockets, use::
+
+
+        -unix control socket data socket
+
+
+
+
+
+  If you want it to act as a hub rather than a switch, use::
+
+
+        -hub
+
+
+
+
+
+  If you want the switch to be connected to host networking (allowing
+  the umls to get access to the outside world through the host), use::
+
+
+        -tap tap0
+
+
+
+
+
+  Note that the tap device must be preconfigured (see "TUN/TAP with a
+  preconfigured tap device", above).  If you're using a different tap
+  device than tap0, specify that instead of tap0.
+
+
+  uml_switch can be backgrounded as follows::
+
+
+       host%
+       uml_switch [ options ] < /dev/null > /dev/null
+
+
+
+
+  The reason it doesn't background by default is that it listens to
+  stdin for EOF.  When it sees that, it exits.
+
+
+  The general format of the kernel command line switch is::
+
+
+
+       ethn=daemon,ethernet address,socket
+       type,control socket,data socket
+
+
+
+
+  You can leave off everything except the 'daemon'.  You only need to
+  specify the ethernet address if the one that will be assigned to it
+  isn't acceptable for some reason.  The rest of the arguments describe
+  how to communicate with the daemon.  You should only specify them if
+  you told the daemon to use different sockets than the default.  So, if
+  you ran the daemon with no arguments, running the UML on the same
+  machine with::
+
+       eth0=daemon
+
+
+
+
+  will cause the eth0 driver to attach itself to the daemon correctly.
+
+
+
+6.10.  Slip
+-----------
+
+  Slip is another, less general, mechanism for a process to communicate
+  with the host networking.  In contrast to the ethertap interface,
+  which exchanges ethernet frames with the host and can be used to
+  transport any higher-level protocol, it can only be used to transport
+  IP.
+
+
+  The general format of the command line switch is::
+
+
+
+       ethn=slip,slip IP
+
+
+
+
+  The slip IP argument is the IP address that will be assigned to the
+  host end of the slip device.  If it is specified, the helper will run
+  and will set up the host so that the virtual machine can reach it and
+  the rest of the network.
+
+
+  There are some oddities with this interface that you should be aware
+  of.  You should only specify one slip device on a given virtual
+  machine, and its name inside UML will be 'umn', not 'eth0' or whatever
+  you specified on the command line.  These problems will be fixed at
+  some point.
+
+
+
+6.11.  Slirp
+------------
+
+  slirp uses an external program, usually /usr/bin/slirp, to provide IP
+  only networking connectivity through the host. This is similar to IP
+  masquerading with a firewall, although the translation is performed in
+  user-space, rather than by the kernel.  As slirp does not set up any
+  interfaces on the host, or changes routing, slirp does not require
+  root access or setuid binaries on the host.
+
+
+  The general format of the command line switch for slirp is::
+
+
+
+       ethn=slirp,ethernet address,slirp path
+
+
+
+
+  The ethernet address is optional, as UML will set up the interface
+  with an ethernet address based upon the initial IP address of the
+  interface.  The slirp path is generally /usr/bin/slirp, although it
+  will depend on distribution.
+
+
+  The slirp program can have a number of options passed to the command
+  line and we can't add them to the UML command line, as they will be
+  parsed incorrectly.  Instead, a wrapper shell script can be written or
+  the options inserted into the  /.slirprc file.  More information on
+  all of the slirp options can be found in its man pages.
+
+
+  The eth0 interface on UML should be set up with the IP 10.2.0.15,
+  although you can use anything as long as it is not used by a network
+  you will be connecting to. The default route on UML should be set to
+  use::
+
+
+       UML#
+       route add default dev eth0
+
+
+
+
+  slirp provides a number of useful IP addresses which can be used by
+  UML, such as 10.0.2.3 which is an alias for the DNS server specified
+  in /etc/resolv.conf on the host or the IP given in the 'dns' option
+  for slirp.
+
+
+  Even with a baudrate setting higher than 115200, the slirp connection
+  is limited to 115200. If you need it to go faster, the slirp binary
+  needs to be compiled with FULL_BOLT defined in config.h.
+
+
+
+6.12.  pcap
+-----------
+
+  The pcap transport is attached to a UML ethernet device on the command
+  line or with uml_mconsole with the following syntax::
+
+
+
+       ethn=pcap,host interface,filter
+       expression,option1,option2
+
+
+
+
+  The expression and options are optional.
+
+
+  The interface is whatever network device on the host you want to
+  sniff.  The expression is a pcap filter expression, which is also what
+  tcpdump uses, so if you know how to specify tcpdump filters, you will
+  use the same expressions here.  The options are up to two of
+  'promisc', control whether pcap puts the host interface into
+  promiscuous mode. 'optimize' and 'nooptimize' control whether the pcap
+  expression optimizer is used.
+
+
+  Example::
+
+
+
+       eth0=pcap,eth0,tcp
+
+       eth1=pcap,eth0,!tcp
+
+
+
+  will cause the UML eth0 to emit all tcp packets on the host eth0 and
+  the UML eth1 to emit all non-tcp packets on the host eth0.
+
+
+
+6.13.  Setting up the host yourself
+-----------------------------------
+
+  If you don't specify an address for the host side of the ethertap or
+  slip device, UML won't do any setup on the host.  So this is what is
+  needed to get things working (the examples use a host-side IP of
+  192.168.0.251 and a UML-side IP of 192.168.0.250 - adjust to suit your
+  own network):
+
+  -  The device needs to be configured with its IP address.  Tap devices
+     are also configured with an mtu of 1484.  Slip devices are
+     configured with a point-to-point address pointing at the UML ip
+     address::
+
+
+       host#  ifconfig tap0 arp mtu 1484 192.168.0.251 up
+
+
+       host#
+       ifconfig sl0 192.168.0.251 pointopoint 192.168.0.250 up
+
+
+
+
+
+  -  If a tap device is being set up, a route is set to the UML IP::
+
+
+       UML# route add -host 192.168.0.250 gw 192.168.0.251
+
+
+
+
+
+  -  To allow other hosts on your network to see the virtual machine,
+     proxy arp is set up for it::
+
+
+       host#  arp -Ds 192.168.0.250 eth0 pub
+
+
+
+
+
+  -  Finally, the host is set up to route packets::
+
+
+       host#  echo 1 > /proc/sys/net/ipv4/ip_forward
+
+
+
+
+
+
+
+
+
+
+7.  Sharing Filesystems between Virtual Machines
+================================================
+
+
+
+
+7.1.  A warning
+---------------
+
+  Don't attempt to share filesystems simply by booting two UMLs from the
+  same file.  That's the same thing as booting two physical machines
+  from a shared disk.  It will result in filesystem corruption.
+
+
+
+7.2.  Using layered block devices
+---------------------------------
+
+  The way to share a filesystem between two virtual machines is to use
+  the copy-on-write (COW) layering capability of the ubd block driver.
+  As of 2.4.6-2um, the driver supports layering a read-write private
+  device over a read-only shared device.  A machine's writes are stored
+  in the private device, while reads come from either device - the
+  private one if the requested block is valid in it, the shared one if
+  not.  Using this scheme, the majority of data which is unchanged is
+  shared between an arbitrary number of virtual machines, each of which
+  has a much smaller file containing the changes that it has made.  With
+  a large number of UMLs booting from a large root filesystem, this
+  leads to a huge disk space saving.  It will also help performance,
+  since the host will be able to cache the shared data using a much
+  smaller amount of memory, so UML disk requests will be served from the
+  host's memory rather than its disks.
+
+
+
+
+  To add a copy-on-write layer to an existing block device file, simply
+  add the name of the COW file to the appropriate ubd switch::
+
+
+        ubd0=root_fs_cow,root_fs_debian_22
+
+
+
+
+  where 'root_fs_cow' is the private COW file and 'root_fs_debian_22' is
+  the existing shared filesystem.  The COW file need not exist.  If it
+  doesn't, the driver will create and initialize it.  Once the COW file
+  has been initialized, it can be used on its own on the command line::
+
+
+        ubd0=root_fs_cow
+
+
+
+
+  The name of the backing file is stored in the COW file header, so it
+  would be redundant to continue specifying it on the command line.
+
+
+
+7.3.  Note!
+-----------
+
+  When checking the size of the COW file in order to see the gobs of
+  space that you're saving, make sure you use 'ls -ls' to see the actual
+  disk consumption rather than the length of the file.  The COW file is
+  sparse, so the length will be very different from the disk usage.
+  Here is a 'ls -l' of a COW file and backing file from one boot and
+  shutdown::
+
+       host% ls -l cow.debian debian2.2
+       -rw-r--r--    1 jdike    jdike    492504064 Aug  6 21:16 cow.debian
+       -rwxrw-rw-    1 jdike    jdike    537919488 Aug  6 20:42 debian2.2
+
+
+
+
+  Doesn't look like much saved space, does it?  Well, here's 'ls -ls'::
+
+
+       host% ls -ls cow.debian debian2.2
+          880 -rw-r--r--    1 jdike    jdike    492504064 Aug  6 21:16 cow.debian
+       525832 -rwxrw-rw-    1 jdike    jdike    537919488 Aug  6 20:42 debian2.2
+
+
+
+
+  Now, you can see that the COW file has less than a meg of disk, rather
+  than 492 meg.
+
+
+
+7.4.  Another warning
+---------------------
+
+  Once a filesystem is being used as a readonly backing file for a COW
+  file, do not boot directly from it or modify it in any way.  Doing so
+  will invalidate any COW files that are using it.  The mtime and size
+  of the backing file are stored in the COW file header at its creation,
+  and they must continue to match.  If they don't, the driver will
+  refuse to use the COW file.
+
+
+
+
+  If you attempt to evade this restriction by changing either the
+  backing file or the COW header by hand, you will get a corrupted
+  filesystem.
+
+
+
+
+  Among other things, this means that upgrading the distribution in a
+  backing file and expecting that all of the COW files using it will see
+  the upgrade will not work.
+
+
+
+
+7.5.  uml_moo : Merging a COW file with its backing file
+--------------------------------------------------------
+
+  Depending on how you use UML and COW devices, it may be advisable to
+  merge the changes in the COW file into the backing file every once in
+  a while.
+
+
+
+
+  The utility that does this is uml_moo.  Its usage is::
+
+
+       host% uml_moo COW file new backing file
+
+
+
+
+  There's no need to specify the backing file since that information is
+  already in the COW file header.  If you're paranoid, boot the new
+  merged file, and if you're happy with it, move it over the old backing
+  file.
+
+
+
+
+  uml_moo creates a new backing file by default as a safety measure.  It
+  also has a destructive merge option which will merge the COW file
+  directly into its current backing file.  This is really only usable
+  when the backing file only has one COW file associated with it.  If
+  there are multiple COWs associated with a backing file, a -d merge of
+  one of them will invalidate all of the others.  However, it is
+  convenient if you're short of disk space, and it should also be
+  noticeably faster than a non-destructive merge.
+
+
+
+
+  uml_moo is installed with the UML deb and RPM.  If you didn't install
+  UML from one of those packages, you can also get it from the UML
+  utilities http://user-mode-linux.sourceforge.net/utilities tar file
+  in tools/moo.
+
+
+
+
+
+
+
+
+8.  Creating filesystems
+========================
+
+
+  You may want to create and mount new UML filesystems, either because
+  your root filesystem isn't large enough or because you want to use a
+  filesystem other than ext2.
+
+
+  This was written on the occasion of reiserfs being included in the
+  2.4.1 kernel pool, and therefore the 2.4.1 UML, so the examples will
+  talk about reiserfs.  This information is generic, and the examples
+  should be easy to translate to the filesystem of your choice.
+
+
+8.1.  Create the filesystem file
+================================
+
+  dd is your friend.  All you need to do is tell dd to create an empty
+  file of the appropriate size.  I usually make it sparse to save time
+  and to avoid allocating disk space until it's actually used.  For
+  example, the following command will create a sparse 100 meg file full
+  of zeroes::
+
+
+       host%
+       dd if=/dev/zero of=new_filesystem seek=100 count=1 bs=1M
+
+
+
+
+
+
+  8.2.  Assign the file to a UML device
+
+  Add an argument like the following to the UML command line::
+
+       ubd4=new_filesystem
+
+
+
+
+  making sure that you use an unassigned ubd device number.
+
+
+
+  8.3.  Creating and mounting the filesystem
+
+  Make sure that the filesystem is available, either by being built into
+  the kernel, or available as a module, then boot up UML and log in.  If
+  the root filesystem doesn't have the filesystem utilities (mkfs, fsck,
+  etc), then get them into UML by way of the net or hostfs.
+
+
+  Make the new filesystem on the device assigned to the new file::
+
+
+       host#  mkreiserfs /dev/ubd/4
+
+
+       <----------- MKREISERFSv2 ----------->
+
+       ReiserFS version 3.6.25
+       Block size 4096 bytes
+       Block count 25856
+       Used blocks 8212
+               Journal - 8192 blocks (18-8209), journal header is in block 8210
+               Bitmaps: 17
+               Root block 8211
+       Hash function "r5"
+       ATTENTION: ALL DATA WILL BE LOST ON '/dev/ubd/4'! (y/n)y
+       journal size 8192 (from 18)
+       Initializing journal - 0%....20%....40%....60%....80%....100%
+       Syncing..done.
+
+
+
+
+  Now, mount it::
+
+
+       UML#
+       mount /dev/ubd/4 /mnt
+
+
+
+
+  and you're in business.
+
+
+
+
+
+
+
+
+
+9.  Host file access
+====================
+
+
+  If you want to access files on the host machine from inside UML, you
+  can treat it as a separate machine and either nfs mount directories
+  from the host or copy files into the virtual machine with scp or rcp.
+  However, since UML is running on the host, it can access those
+  files just like any other process and make them available inside the
+  virtual machine without needing to use the network.
+
+
+  This is now possible with the hostfs virtual filesystem.  With it, you
+  can mount a host directory into the UML filesystem and access the
+  files contained in it just as you would on the host.
+
+
+9.1.  Using hostfs
+------------------
+
+  To begin with, make sure that hostfs is available inside the virtual
+  machine with::
+
+
+       UML# cat /proc/filesystems
+
+
+
+  .  hostfs should be listed.  If it's not, either rebuild the kernel
+  with hostfs configured into it or make sure that hostfs is built as a
+  module and available inside the virtual machine, and insmod it.
+
+
+  Now all you need to do is run mount::
+
+
+       UML# mount none /mnt/host -t hostfs
+
+
+
+
+  will mount the host's / on the virtual machine's /mnt/host.
+
+
+  If you don't want to mount the host root directory, then you can
+  specify a subdirectory to mount with the -o switch to mount::
+
+
+       UML# mount none /mnt/home -t hostfs -o /home
+
+
+
+
+  will mount the hosts's /home on the virtual machine's /mnt/home.
+
+
+
+9.2.  hostfs as the root filesystem
+-----------------------------------
+
+  It's possible to boot from a directory hierarchy on the host using
+  hostfs rather than using the standard filesystem in a file.
+
+  To start, you need that hierarchy.  The easiest way is to loop mount
+  an existing root_fs file::
+
+
+       host#  mount root_fs uml_root_dir -o loop
+
+
+
+
+  You need to change the filesystem type of / in etc/fstab to be
+  'hostfs', so that line looks like this::
+
+    /dev/ubd/0       /        hostfs      defaults          1   1
+
+
+
+
+  Then you need to chown to yourself all the files in that directory
+  that are owned by root.  This worked for me::
+
+
+       host#  find . -uid 0 -exec chown jdike {} \;
+
+
+
+
+  Next, make sure that your UML kernel has hostfs compiled in, not as a
+  module.  Then run UML with the boot device pointing at that directory::
+
+
+        ubd0=/path/to/uml/root/directory
+
+
+
+
+  UML should then boot as it does normally.
+
+
+9.3.  Building hostfs
+---------------------
+
+  If you need to build hostfs because it's not in your kernel, you have
+  two choices:
+
+
+
+  -  Compiling hostfs into the kernel:
+
+
+     Reconfigure the kernel and set the 'Host filesystem' option under
+
+
+  -  Compiling hostfs as a module:
+
+
+     Reconfigure the kernel and set the 'Host filesystem' option under
+     be in arch/um/fs/hostfs/hostfs.o.  Install that in
+     ``/lib/modules/$(uname -r)/fs`` in the virtual machine, boot it up, and::
+
+
+       UML# insmod hostfs
+
+
+.. _The_Management_Console:
+
+10.  The Management Console
+===========================
+
+
+
+  The UML management console is a low-level interface to the kernel,
+  somewhat like the i386 SysRq interface.  Since there is a full-blown
+  operating system under UML, there is much greater flexibility possible
+  than with the SysRq mechanism.
+
+
+  There are a number of things you can do with the mconsole interface:
+
+  -  get the kernel version
+
+  -  add and remove devices
+
+  -  halt or reboot the machine
+
+  -  Send SysRq commands
+
+  -  Pause and resume the UML
+
+
+  You need the mconsole client (uml_mconsole) which is present in CVS
+  (/tools/mconsole) in 2.4.5-9um and later, and will be in the RPM in
+  2.4.6.
+
+
+  You also need CONFIG_MCONSOLE (under 'General Setup') enabled in UML.
+  When you boot UML, you'll see a line like::
+
+
+       mconsole initialized on /home/jdike/.uml/umlNJ32yL/mconsole
+
+
+
+
+  If you specify a unique machine id one the UML command line, i.e.::
+
+
+        umid=debian
+
+
+
+
+  you'll see this::
+
+
+       mconsole initialized on /home/jdike/.uml/debian/mconsole
+
+
+
+
+  That file is the socket that uml_mconsole will use to communicate with
+  UML.  Run it with either the umid or the full path as its argument::
+
+
+       host% uml_mconsole debian
+
+
+
+
+  or::
+
+
+       host% uml_mconsole /home/jdike/.uml/debian/mconsole
+
+
+
+
+  You'll get a prompt, at which you can run one of these commands:
+
+  -  version
+
+  -  halt
+
+  -  reboot
+
+  -  config
+
+  -  remove
+
+  -  sysrq
+
+  -  help
+
+  -  cad
+
+  -  stop
+
+  -  go
+
+
+10.1.  version
+--------------
+
+  This takes no arguments.  It prints the UML version::
+
+
+       (mconsole)  version
+       OK Linux usermode 2.4.5-9um #1 Wed Jun 20 22:47:08 EDT 2001 i686
+
+
+
+
+  There are a couple actual uses for this.  It's a simple no-op which
+  can be used to check that a UML is running.  It's also a way of
+  sending an interrupt to the UML.  This is sometimes useful on SMP
+  hosts, where there's a bug which causes signals to UML to be lost,
+  often causing it to appear to hang.  Sending such a UML the mconsole
+  version command is a good way to 'wake it up' before networking has
+  been enabled, as it does not do anything to the function of the UML.
+
+
+
+10.2.  halt and reboot
+----------------------
+
+  These take no arguments.  They shut the machine down immediately, with
+  no syncing of disks and no clean shutdown of userspace.  So, they are
+  pretty close to crashing the machine::
+
+
+       (mconsole)  halt
+       OK
+
+
+
+
+
+
+10.3.  config
+-------------
+
+  "config" adds a new device to the virtual machine.  Currently the ubd
+  and network drivers support this.  It takes one argument, which is the
+  device to add, with the same syntax as the kernel command line::
+
+
+
+
+       (mconsole)
+       config ubd3=/home/jdike/incoming/roots/root_fs_debian22
+
+       OK
+       (mconsole)  config eth1=mcast
+       OK
+
+
+
+
+
+
+10.4.  remove
+-------------
+
+  "remove" deletes a device from the system.  Its argument is just the
+  name of the device to be removed. The device must be idle in whatever
+  sense the driver considers necessary.  In the case of the ubd driver,
+  the removed block device must not be mounted, swapped on, or otherwise
+  open, and in the case of the network driver, the device must be down::
+
+
+       (mconsole)  remove ubd3
+       OK
+       (mconsole)  remove eth1
+       OK
+
+
+
+
+
+
+10.5.  sysrq
+------------
+
+  This takes one argument, which is a single letter.  It calls the
+  generic kernel's SysRq driver, which does whatever is called for by
+  that argument.  See the SysRq documentation in
+  Documentation/admin-guide/sysrq.rst in your favorite kernel tree to
+  see what letters are valid and what they do.
+
+
+
+10.6.  help
+-----------
+
+  "help" returns a string listing the valid commands and what each one
+  does.
+
+
+
+10.7.  cad
+----------
+
+  This invokes the Ctl-Alt-Del action on init.  What exactly this ends
+  up doing is up to /etc/inittab.  Normally, it reboots the machine.
+  With UML, this is usually not desired, so if a halt would be better,
+  then find the section of inittab that looks like this::
+
+
+       # What to do when CTRL-ALT-DEL is pressed.
+       ca:12345:ctrlaltdel:/sbin/shutdown -t1 -a -r now
+
+
+
+
+  and change the command to halt.
+
+
+
+10.8.  stop
+-----------
+
+  This puts the UML in a loop reading mconsole requests until a 'go'
+  mconsole command is received. This is very useful for making backups
+  of UML filesystems, as the UML can be stopped, then synced via 'sysrq
+  s', so that everything is written to the filesystem. You can then copy
+  the filesystem and then send the UML 'go' via mconsole.
+
+
+  Note that a UML running with more than one CPU will have problems
+  after you send the 'stop' command, as only one CPU will be held in a
+  mconsole loop and all others will continue as normal.  This is a bug,
+  and will be fixed.
+
+
+
+10.9.  go
+---------
+
+  This resumes a UML after being paused by a 'stop' command. Note that
+  when the UML has resumed, TCP connections may have timed out and if
+  the UML is paused for a long period of time, crond might go a little
+  crazy, running all the jobs it didn't do earlier.
+
+
+
+
+
+
+.. _Kernel_debugging:
+
+11.  Kernel debugging
+=====================
+
+
+  Note: The interface that makes debugging, as described here, possible
+  is present in 2.4.0-test6 kernels and later.
+
+
+  Since the user-mode kernel runs as a normal Linux process, it is
+  possible to debug it with gdb almost like any other process.  It is
+  slightly different because the kernel's threads are already being
+  ptraced for system call interception, so gdb can't ptrace them.
+  However, a mechanism has been added to work around that problem.
+
+
+  In order to debug the kernel, you need build it from source.  See
+  :ref:`Compiling_the_kernel_and_modules`  for information on doing that.
+  Make sure that you enable CONFIG_DEBUGSYM and CONFIG_PT_PROXY during
+  the config.  These will compile the kernel with ``-g``, and enable the
+  ptrace proxy so that gdb works with UML, respectively.
+
+
+
+
+11.1.  Starting the kernel under gdb
+------------------------------------
+
+  You can have the kernel running under the control of gdb from the
+  beginning by putting 'debug' on the command line.  You will get an
+  xterm with gdb running inside it.  The kernel will send some commands
+  to gdb which will leave it stopped at the beginning of start_kernel.
+  At this point, you can get things going with 'next', 'step', or
+  'cont'.
+
+
+  There is a transcript of a debugging session  here <debug-
+  session.html> , with breakpoints being set in the scheduler and in an
+  interrupt handler.
+
+
+11.2.  Examining sleeping processes
+-----------------------------------
+
+
+  Not every bug is evident in the currently running process.  Sometimes,
+  processes hang in the kernel when they shouldn't because they've
+  deadlocked on a semaphore or something similar.  In this case, when
+  you ^C gdb and get a backtrace, you will see the idle thread, which
+  isn't very relevant.
+
+
+  What you want is the stack of whatever process is sleeping when it
+  shouldn't be.  You need to figure out which process that is, which is
+  generally fairly easy.  Then you need to get its host process id,
+  which you can do either by looking at ps on the host or at
+  task.thread.extern_pid in gdb.
+
+
+  Now what you do is this:
+
+  -  detach from the current thread::
+
+
+       (UML gdb)  det
+
+
+
+
+
+  -  attach to the thread you are interested in::
+
+
+       (UML gdb)  att <host pid>
+
+
+
+
+
+  -  look at its stack and anything else of interest::
+
+
+       (UML gdb)  bt
+
+
+
+
+  Note that you can't do anything at this point that requires that a
+  process execute, e.g. calling a function
+
+  -  when you're done looking at that process, reattach to the current
+     thread and continue it::
+
+
+       (UML gdb)
+       att 1
+
+
+       (UML gdb)
+       c
+
+
+
+
+  Here, specifying any pid which is not the process id of a UML thread
+  will cause gdb to reattach to the current thread.  I commonly use 1,
+  but any other invalid pid would work.
+
+
+
+11.3.  Running ddd on UML
+-------------------------
+
+  ddd works on UML, but requires a special kludge.  The process goes
+  like this:
+
+  -  Start ddd::
+
+
+       host% ddd linux
+
+
+
+
+
+  -  With ps, get the pid of the gdb that ddd started.  You can ask the
+     gdb to tell you, but for some reason that confuses things and
+     causes a hang.
+
+  -  run UML with 'debug=parent gdb-pid=<pid>' added to the command line
+     - it will just sit there after you hit return
+
+  -  type 'att 1' to the ddd gdb and you will see something like::
+
+
+       0xa013dc51 in __kill ()
+
+
+       (gdb)
+
+
+
+
+
+  -  At this point, type 'c', UML will boot up, and you can use ddd just
+     as you do on any other process.
+
+
+
+11.4.  Debugging modules
+------------------------
+
+
+  gdb has support for debugging code which is dynamically loaded into
+  the process.  This support is what is needed to debug kernel modules
+  under UML.
+
+
+  Using that support is somewhat complicated.  You have to tell gdb what
+  object file you just loaded into UML and where in memory it is.  Then,
+  it can read the symbol table, and figure out where all the symbols are
+  from the load address that you provided.  It gets more interesting
+  when you load the module again (i.e. after an rmmod).  You have to
+  tell gdb to forget about all its symbols, including the main UML ones
+  for some reason, then load then all back in again.
+
+
+  There's an easy way and a hard way to do this.  The easy way is to use
+  the umlgdb expect script written by Chandan Kudige.  It basically
+  automates the process for you.
+
+
+  First, you must tell it where your modules are.  There is a list in
+  the script that looks like this::
+
+       set MODULE_PATHS {
+       "fat" "/usr/src/uml/linux-2.4.18/fs/fat/fat.o"
+       "isofs" "/usr/src/uml/linux-2.4.18/fs/isofs/isofs.o"
+       "minix" "/usr/src/uml/linux-2.4.18/fs/minix/minix.o"
+       }
+
+
+
+
+  You change that to list the names and paths of the modules that you
+  are going to debug.  Then you run it from the toplevel directory of
+  your UML pool and it basically tells you what to do::
+
+
+                   ******** GDB pid is 21903 ********
+       Start UML as: ./linux <kernel switches> debug gdb-pid=21903
+
+
+
+       GNU gdb 5.0rh-5 Red Hat Linux 7.1
+       Copyright 2001 Free Software Foundation, Inc.
+       GDB is free software, covered by the GNU General Public License, and you are
+       welcome to change it and/or distribute copies of it under certain conditions.
+       Type "show copying" to see the conditions.
+       There is absolutely no warranty for GDB.  Type "show warranty" for details.
+       This GDB was configured as "i386-redhat-linux"...
+       (gdb) b sys_init_module
+       Breakpoint 1 at 0xa0011923: file module.c, line 349.
+       (gdb) att 1
+
+
+
+
+  After you run UML and it sits there doing nothing, you hit return at
+  the 'att 1' and continue it::
+
+
+       Attaching to program: /home/jdike/linux/2.4/um/./linux, process 1
+       0xa00f4221 in __kill ()
+       (UML gdb)  c
+       Continuing.
+
+
+
+
+  At this point, you debug normally.  When you insmod something, the
+  expect magic will kick in and you'll see something like::
+
+
+     *** Module hostfs loaded ***
+    Breakpoint 1, sys_init_module (name_user=0x805abb0 "hostfs",
+        mod_user=0x8070e00) at module.c:349
+    349             char *name, *n_name, *name_tmp = NULL;
+    (UML gdb)  finish
+    Run till exit from #0  sys_init_module (name_user=0x805abb0 "hostfs",
+        mod_user=0x8070e00) at module.c:349
+    0xa00e2e23 in execute_syscall (r=0xa8140284) at syscall_kern.c:411
+    411             else res = EXECUTE_SYSCALL(syscall, regs);
+    Value returned is $1 = 0
+    (UML gdb)
+    p/x (int)module_list + module_list->size_of_struct
+
+    $2 = 0xa9021054
+    (UML gdb)  symbol-file ./linux
+    Load new symbol table from "./linux"? (y or n) y
+    Reading symbols from ./linux...
+    done.
+    (UML gdb)
+    add-symbol-file /home/jdike/linux/2.4/um/arch/um/fs/hostfs/hostfs.o 0xa9021054
+
+    add symbol table from file "/home/jdike/linux/2.4/um/arch/um/fs/hostfs/hostfs.o" at
+            .text_addr = 0xa9021054
+     (y or n) y
+
+    Reading symbols from /home/jdike/linux/2.4/um/arch/um/fs/hostfs/hostfs.o...
+    done.
+    (UML gdb)  p *module_list
+    $1 = {size_of_struct = 84, next = 0xa0178720, name = 0xa9022de0 "hostfs",
+      size = 9016, uc = {usecount = {counter = 0}, pad = 0}, flags = 1,
+      nsyms = 57, ndeps = 0, syms = 0xa9023170, deps = 0x0, refs = 0x0,
+      init = 0xa90221f0 <init_hostfs>, cleanup = 0xa902222c <exit_hostfs>,
+      ex_table_start = 0x0, ex_table_end = 0x0, persist_start = 0x0,
+      persist_end = 0x0, can_unload = 0, runsize = 0, kallsyms_start = 0x0,
+      kallsyms_end = 0x0,
+      archdata_start = 0x1b855 <Address 0x1b855 out of bounds>,
+      archdata_end = 0xe5890000 <Address 0xe5890000 out of bounds>,
+      kernel_data = 0xf689c35d <Address 0xf689c35d out of bounds>}
+    >> Finished loading symbols for hostfs ...
+
+
+
+
+  That's the easy way.  It's highly recommended.  The hard way is
+  described below in case you're interested in what's going on.
+
+
+  Boot the kernel under the debugger and load the module with insmod or
+  modprobe.  With gdb, do::
+
+
+       (UML gdb)  p module_list
+
+
+
+
+  This is a list of modules that have been loaded into the kernel, with
+  the most recently loaded module first.  Normally, the module you want
+  is at module_list.  If it's not, walk down the next links, looking at
+  the name fields until find the module you want to debug.  Take the
+  address of that structure, and add module.size_of_struct (which in
+  2.4.10 kernels is 96 (0x60)) to it.  Gdb can make this hard addition
+  for you :-)::
+
+
+
+       (UML gdb)
+       printf "%#x\n", (int)module_list module_list->size_of_struct
+
+
+
+
+  The offset from the module start occasionally changes (before 2.4.0,
+  it was module.size_of_struct + 4), so it's a good idea to check the
+  init and cleanup addresses once in a while, as describe below.  Now
+  do::
+
+
+       (UML gdb)
+       add-symbol-file /path/to/module/on/host that_address
+
+
+
+
+  Tell gdb you really want to do it, and you're in business.
+
+
+  If there's any doubt that you got the offset right, like breakpoints
+  appear not to work, or they're appearing in the wrong place, you can
+  check it by looking at the module structure.  The init and cleanup
+  fields should look like::
+
+
+       init = 0x588066b0 <init_hostfs>, cleanup = 0x588066c0 <exit_hostfs>
+
+
+
+
+  with no offsets on the symbol names.  If the names are right, but they
+  are offset, then the offset tells you how much you need to add to the
+  address you gave to add-symbol-file.
+
+
+  When you want to load in a new version of the module, you need to get
+  gdb to forget about the old one.  The only way I've found to do that
+  is to tell gdb to forget about all symbols that it knows about::
+
+
+       (UML gdb)  symbol-file
+
+
+
+
+  Then reload the symbols from the kernel binary::
+
+
+       (UML gdb)  symbol-file /path/to/kernel
+
+
+
+
+  and repeat the process above.  You'll also need to re-enable break-
+  points.  They were disabled when you dumped all the symbols because
+  gdb couldn't figure out where they should go.
+
+
+
+11.5.  Attaching gdb to the kernel
+----------------------------------
+
+  If you don't have the kernel running under gdb, you can attach gdb to
+  it later by sending the tracing thread a SIGUSR1.  The first line of
+  the console output identifies its pid::
+
+       tracing thread pid = 20093
+
+
+
+
+  When you send it the signal::
+
+
+       host% kill -USR1 20093
+
+
+
+
+  you will get an xterm with gdb running in it.
+
+
+  If you have the mconsole compiled into UML, then the mconsole client
+  can be used to start gdb::
+
+
+       (mconsole)  (mconsole) config gdb=xterm
+
+
+
+
+  will fire up an xterm with gdb running in it.
+
+
+
+11.6.  Using alternate debuggers
+--------------------------------
+
+  UML has support for attaching to an already running debugger rather
+  than starting gdb itself.  This is present in CVS as of 17 Apr 2001.
+  I sent it to Alan for inclusion in the ac tree, and it will be in my
+  2.4.4 release.
+
+
+  This is useful when gdb is a subprocess of some UI, such as emacs or
+  ddd.  It can also be used to run debuggers other than gdb on UML.
+  Below is an example of using strace as an alternate debugger.
+
+
+  To do this, you need to get the pid of the debugger and pass it in
+  with the
+
+
+  If you are using gdb under some UI, then tell it to 'att 1', and
+  you'll find yourself attached to UML.
+
+
+  If you are using something other than gdb as your debugger, then
+  you'll need to get it to do the equivalent of 'att 1' if it doesn't do
+  it automatically.
+
+
+  An example of an alternate debugger is strace.  You can strace the
+  actual kernel as follows:
+
+  -  Run the following in a shell::
+
+
+       host%
+       sh -c 'echo pid=$$; echo -n hit return; read x; exec strace -p 1 -o strace.out'
+
+
+
+  -  Run UML with 'debug' and 'gdb-pid=<pid>' with the pid printed out
+     by the previous command
+
+  -  Hit return in the shell, and UML will start running, and strace
+     output will start accumulating in the output file.
+
+     Note that this is different from running::
+
+
+       host% strace ./linux
+
+
+
+
+  That will strace only the main UML thread, the tracing thread, which
+  doesn't do any of the actual kernel work.  It just oversees the vir-
+  tual machine.  In contrast, using strace as described above will show
+  you the low-level activity of the virtual machine.
+
+
+
+
+
+12.  Kernel debugging examples
+==============================
+
+12.1.  The case of the hung fsck
+--------------------------------
+
+  When booting up the kernel, fsck failed, and dropped me into a shell
+  to fix things up.  I ran fsck -y, which hung::
+
+
+    Setting hostname uml                    [ OK ]
+    Checking root filesystem
+    /dev/fhd0 was not cleanly unmounted, check forced.
+    Error reading block 86894 (Attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 19780.
+
+    /dev/fhd0: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
+           (i.e., without -a or -p options)
+    [ FAILED ]
+
+    *** An error occurred during the file system check.
+    *** Dropping you to a shell; the system will reboot
+    *** when you leave the shell.
+    Give root password for maintenance
+    (or type Control-D for normal startup):
+
+    [root@uml /root]# fsck -y /dev/fhd0
+    fsck -y /dev/fhd0
+    Parallelizing fsck version 1.14 (9-Jan-1999)
+    e2fsck 1.14, 9-Jan-1999 for EXT2 FS 0.5b, 95/08/09
+    /dev/fhd0 contains a file system with errors, check forced.
+    Pass 1: Checking inodes, blocks, and sizes
+    Error reading block 86894 (Attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 19780.  Ignore error? yes
+
+    Inode 19780, i_blocks is 1548, should be 540.  Fix? yes
+
+    Pass 2: Checking directory structure
+    Error reading block 49405 (Attempt to read block from filesystem resulted in short read).  Ignore error? yes
+
+    Directory inode 11858, block 0, offset 0: directory corrupted
+    Salvage? yes
+
+    Missing '.' in directory inode 11858.
+    Fix? yes
+
+    Missing '..' in directory inode 11858.
+    Fix? yes
+
+
+  The standard drill in this sort of situation is to fire up gdb on the
+  signal thread, which, in this case, was pid 1935.  In another window,
+  I run gdb and attach pid 1935::
+
+
+       ~/linux/2.3.26/um 1016: gdb linux
+       GNU gdb 4.17.0.11 with Linux support
+       Copyright 1998 Free Software Foundation, Inc.
+       GDB is free software, covered by the GNU General Public License, and you are
+       welcome to change it and/or distribute copies of it under certain conditions.
+       Type "show copying" to see the conditions.
+       There is absolutely no warranty for GDB.  Type "show warranty" for details.
+       This GDB was configured as "i386-redhat-linux"...
+
+       (gdb) att 1935
+       Attaching to program `/home/dike/linux/2.3.26/um/linux', Pid 1935
+       0x100756d9 in __wait4 ()
+
+
+  Let's see what's currently running::
+
+
+
+       (gdb) p current_task.pid
+       $1 = 0
+
+
+
+
+
+  It's the idle thread, which means that fsck went to sleep for some
+  reason and never woke up.
+
+
+  Let's guess that the last process in the process list is fsck::
+
+
+
+       (gdb) p current_task.prev_task.comm
+       $13 = "fsck.ext2\000\000\000\000\000\000"
+
+
+
+
+
+  It is, so let's see what it thinks it's up to::
+
+
+
+       (gdb) p current_task.prev_task.thread
+       $14 = {extern_pid = 1980, tracing = 0, want_tracing = 0, forking = 0,
+         kernel_stack_page = 0, signal_stack = 1342627840, syscall = {id = 4, args = {
+             3, 134973440, 1024, 0, 1024}, have_result = 0, result = 50590720},
+         request = {op = 2, u = {exec = {ip = 1350467584, sp = 2952789424}, fork = {
+               regs = {1350467584, 2952789424, 0 <repeats 15 times>}, sigstack = 0,
+               pid = 0}, switch_to = 0x507e8000, thread = {proc = 0x507e8000,
+               arg = 0xaffffdb0, flags = 0, new_pid = 0}, input_request = {
+               op = 1350467584, fd = -1342177872, proc = 0, pid = 0}}}}
+
+
+
+  The interesting things here are the fact that its .thread.syscall.id
+  is __NR_write (see the big switch in arch/um/kernel/syscall_kern.c or
+  the defines in include/asm-um/arch/unistd.h), and that it never
+  returned.  Also, its .request.op is OP_SWITCH (see
+  arch/um/include/user_util.h).  These mean that it went into a write,
+  and, for some reason, called schedule().
+
+
+  The fact that it never returned from write means that its stack should
+  be fairly interesting.  Its pid is 1980 (.thread.extern_pid).  That
+  process is being ptraced by the signal thread, so it must be detached
+  before gdb can attach it::
+
+
+
+    (gdb) call detach(1980)
+
+    Program received signal SIGSEGV, Segmentation fault.
+    <function called from gdb>
+    The program being debugged stopped while in a function called from GDB.
+    When the function (detach) is done executing, GDB will silently
+    stop (instead of continuing to evaluate the expression containing
+    the function call).
+    (gdb) call detach(1980)
+    $15 = 0
+
+
+  The first detach segfaults for some reason, and the second one
+  succeeds.
+
+
+  Now I detach from the signal thread, attach to the fsck thread, and
+  look at its stack::
+
+
+       (gdb) det
+       Detaching from program: /home/dike/linux/2.3.26/um/linux Pid 1935
+       (gdb) att 1980
+       Attaching to program `/home/dike/linux/2.3.26/um/linux', Pid 1980
+       0x10070451 in __kill ()
+       (gdb) bt
+       #0  0x10070451 in __kill ()
+       #1  0x10068ccd in usr1_pid (pid=1980) at process.c:30
+       #2  0x1006a03f in _switch_to (prev=0x50072000, next=0x507e8000)
+           at process_kern.c:156
+       #3  0x1006a052 in switch_to (prev=0x50072000, next=0x507e8000, last=0x50072000)
+           at process_kern.c:161
+       #4  0x10001d12 in schedule () at core.c:777
+       #5  0x1006a744 in __down (sem=0x507d241c) at semaphore.c:71
+       #6  0x1006aa10 in __down_failed () at semaphore.c:157
+       #7  0x1006c5d8 in segv_handler (sc=0x5006e940) at trap_user.c:174
+       #8  0x1006c5ec in kern_segv_handler (sig=11) at trap_user.c:182
+       #9  <signal handler called>
+       #10 0x10155404 in errno ()
+       #11 0x1006c0aa in segv (address=1342179328, is_write=2) at trap_kern.c:50
+       #12 0x1006c5d8 in segv_handler (sc=0x5006eaf8) at trap_user.c:174
+       #13 0x1006c5ec in kern_segv_handler (sig=11) at trap_user.c:182
+       #14 <signal handler called>
+       #15 0xc0fd in ?? ()
+       #16 0x10016647 in sys_write (fd=3,
+           buf=0x80b8800 <Address 0x80b8800 out of bounds>, count=1024)
+           at read_write.c:159
+       #17 0x1006d5b3 in execute_syscall (syscall=4, args=0x5006ef08)
+           at syscall_kern.c:254
+       #18 0x1006af87 in really_do_syscall (sig=12) at syscall_user.c:35
+       #19 <signal handler called>
+       #20 0x400dc8b0 in ?? ()
+
+
+
+
+
+  The interesting things here are:
+
+  -  There are two segfaults on this stack (frames 9 and 14)
+
+  -  The first faulting address (frame 11) is 0x50000800::
+
+       (gdb) p (void *)1342179328
+       $16 = (void *) 0x50000800
+
+
+
+
+
+  The initial faulting address is interesting because it is on the idle
+  thread's stack.  I had been seeing the idle thread segfault for no
+  apparent reason, and the cause looked like stack corruption.  In hopes
+  of catching the culprit in the act, I had turned off all protections
+  to that stack while the idle thread wasn't running.  This apparently
+  tripped that trap.
+
+
+  However, the more immediate problem is that second segfault and I'm
+  going to concentrate on that.  First, I want to see where the fault
+  happened, so I have to go look at the sigcontent struct in frame 8::
+
+
+
+       (gdb) up
+       #1  0x10068ccd in usr1_pid (pid=1980) at process.c:30
+       30        kill(pid, SIGUSR1);
+       (gdb)
+       #2  0x1006a03f in _switch_to (prev=0x50072000, next=0x507e8000)
+           at process_kern.c:156
+       156       usr1_pid(getpid());
+       (gdb)
+       #3  0x1006a052 in switch_to (prev=0x50072000, next=0x507e8000, last=0x50072000)
+           at process_kern.c:161
+       161       _switch_to(prev, next);
+       (gdb)
+       #4  0x10001d12 in schedule () at core.c:777
+       777             switch_to(prev, next, prev);
+       (gdb)
+       #5  0x1006a744 in __down (sem=0x507d241c) at semaphore.c:71
+       71                      schedule();
+       (gdb)
+       #6  0x1006aa10 in __down_failed () at semaphore.c:157
+       157     }
+       (gdb)
+       #7  0x1006c5d8 in segv_handler (sc=0x5006e940) at trap_user.c:174
+       174       segv(sc->cr2, sc->err & 2);
+       (gdb)
+       #8  0x1006c5ec in kern_segv_handler (sig=11) at trap_user.c:182
+       182       segv_handler(sc);
+       (gdb) p *sc
+       Cannot access memory at address 0x0.
+
+
+
+
+  That's not very useful, so I'll try a more manual method::
+
+
+       (gdb) p *((struct sigcontext *) (&sig + 1))
+       $19 = {gs = 0, __gsh = 0, fs = 0, __fsh = 0, es = 43, __esh = 0, ds = 43,
+         __dsh = 0, edi = 1342179328, esi = 1350378548, ebp = 1342630440,
+         esp = 1342630420, ebx = 1348150624, edx = 1280, ecx = 0, eax = 0,
+         trapno = 14, err = 4, eip = 268480945, cs = 35, __csh = 0, eflags = 66118,
+         esp_at_signal = 1342630420, ss = 43, __ssh = 0, fpstate = 0x0, oldmask = 0,
+         cr2 = 1280}
+
+
+
+  The ip is in handle_mm_fault::
+
+
+       (gdb) p (void *)268480945
+       $20 = (void *) 0x1000b1b1
+       (gdb) i sym $20
+       handle_mm_fault + 57 in section .text
+
+
+
+
+
+  Specifically, it's in pte_alloc::
+
+
+       (gdb) i line *$20
+       Line 124 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h"
+          starts at address 0x1000b1b1 <handle_mm_fault+57>
+          and ends at 0x1000b1b7 <handle_mm_fault+63>.
+
+
+
+
+
+  To find where in handle_mm_fault this is, I'll jump forward in the
+  code until I see an address in that procedure::
+
+
+
+       (gdb) i line *0x1000b1c0
+       Line 126 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h"
+          starts at address 0x1000b1b7 <handle_mm_fault+63>
+          and ends at 0x1000b1c3 <handle_mm_fault+75>.
+       (gdb) i line *0x1000b1d0
+       Line 131 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h"
+          starts at address 0x1000b1d0 <handle_mm_fault+88>
+          and ends at 0x1000b1da <handle_mm_fault+98>.
+       (gdb) i line *0x1000b1e0
+       Line 61 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h"
+          starts at address 0x1000b1da <handle_mm_fault+98>
+          and ends at 0x1000b1e1 <handle_mm_fault+105>.
+       (gdb) i line *0x1000b1f0
+       Line 134 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h"
+          starts at address 0x1000b1f0 <handle_mm_fault+120>
+          and ends at 0x1000b200 <handle_mm_fault+136>.
+       (gdb) i line *0x1000b200
+       Line 135 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h"
+          starts at address 0x1000b200 <handle_mm_fault+136>
+          and ends at 0x1000b208 <handle_mm_fault+144>.
+       (gdb) i line *0x1000b210
+       Line 139 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h"
+          starts at address 0x1000b210 <handle_mm_fault+152>
+          and ends at 0x1000b219 <handle_mm_fault+161>.
+       (gdb) i line *0x1000b220
+       Line 1168 of "memory.c" starts at address 0x1000b21e <handle_mm_fault+166>
+          and ends at 0x1000b222 <handle_mm_fault+170>.
+
+
+
+
+
+  Something is apparently wrong with the page tables or vma_structs, so
+  lets go back to frame 11 and have a look at them::
+
+
+
+    #11 0x1006c0aa in segv (address=1342179328, is_write=2) at trap_kern.c:50
+    50        handle_mm_fault(current, vma, address, is_write);
+    (gdb) call pgd_offset_proc(vma->vm_mm, address)
+    $22 = (pgd_t *) 0x80a548c
+
+
+
+
+
+  That's pretty bogus.  Page tables aren't supposed to be in process
+  text or data areas.  Let's see what's in the vma::
+
+
+       (gdb) p *vma
+       $23 = {vm_mm = 0x507d2434, vm_start = 0, vm_end = 134512640,
+         vm_next = 0x80a4f8c, vm_page_prot = {pgprot = 0}, vm_flags = 31200,
+         vm_avl_height = 2058, vm_avl_left = 0x80a8c94, vm_avl_right = 0x80d1000,
+         vm_next_share = 0xaffffdb0, vm_pprev_share = 0xaffffe63,
+         vm_ops = 0xaffffe7a, vm_pgoff = 2952789626, vm_file = 0xafffffec,
+         vm_private_data = 0x62}
+       (gdb) p *vma.vm_mm
+       $24 = {mmap = 0x507d2434, mmap_avl = 0x0, mmap_cache = 0x8048000,
+         pgd = 0x80a4f8c, mm_users = {counter = 0}, mm_count = {counter = 134904288},
+         map_count = 134909076, mmap_sem = {count = {counter = 135073792},
+           sleepers = -1342177872, wait = {lock = <optimized out or zero length>,
+             task_list = {next = 0xaffffe63, prev = 0xaffffe7a},
+             __magic = -1342177670, __creator = -1342177300}, __magic = 98},
+         page_table_lock = {}, context = 138, start_code = 0, end_code = 0,
+         start_data = 0, end_data = 0, start_brk = 0, brk = 0, start_stack = 0,
+         arg_start = 0, arg_end = 0, env_start = 0, env_end = 0, rss = 1350381536,
+         total_vm = 0, locked_vm = 0, def_flags = 0, cpu_vm_mask = 0, swap_cnt = 0,
+         swap_address = 0, segments = 0x0}
+
+
+
+  This also pretty bogus.  With all of the 0x80xxxxx and 0xaffffxxx
+  addresses, this is looking like a stack was plonked down on top of
+  these structures.  Maybe it's a stack overflow from the next page::
+
+
+       (gdb) p vma
+       $25 = (struct vm_area_struct *) 0x507d2434
+
+
+
+  That's towards the lower quarter of the page, so that would have to
+  have been pretty heavy stack overflow::
+
+
+    (gdb) x/100x $25
+    0x507d2434:     0x507d2434      0x00000000      0x08048000      0x080a4f8c
+    0x507d2444:     0x00000000      0x080a79e0      0x080a8c94      0x080d1000
+    0x507d2454:     0xaffffdb0      0xaffffe63      0xaffffe7a      0xaffffe7a
+    0x507d2464:     0xafffffec      0x00000062      0x0000008a      0x00000000
+    0x507d2474:     0x00000000      0x00000000      0x00000000      0x00000000
+    0x507d2484:     0x00000000      0x00000000      0x00000000      0x00000000
+    0x507d2494:     0x00000000      0x00000000      0x507d2fe0      0x00000000
+    0x507d24a4:     0x00000000      0x00000000      0x00000000      0x00000000
+    0x507d24b4:     0x00000000      0x00000000      0x00000000      0x00000000
+    0x507d24c4:     0x00000000      0x00000000      0x00000000      0x00000000
+    0x507d24d4:     0x00000000      0x00000000      0x00000000      0x00000000
+    0x507d24e4:     0x00000000      0x00000000      0x00000000      0x00000000
+    0x507d24f4:     0x00000000      0x00000000      0x00000000      0x00000000
+    0x507d2504:     0x00000000      0x00000000      0x00000000      0x00000000
+    0x507d2514:     0x00000000      0x00000000      0x00000000      0x00000000
+    0x507d2524:     0x00000000      0x00000000      0x00000000      0x00000000
+    0x507d2534:     0x00000000      0x00000000      0x507d25dc      0x00000000
+    0x507d2544:     0x00000000      0x00000000      0x00000000      0x00000000
+    0x507d2554:     0x00000000      0x00000000      0x00000000      0x00000000
+    0x507d2564:     0x00000000      0x00000000      0x00000000      0x00000000
+    0x507d2574:     0x00000000      0x00000000      0x00000000      0x00000000
+    0x507d2584:     0x00000000      0x00000000      0x00000000      0x00000000
+    0x507d2594:     0x00000000      0x00000000      0x00000000      0x00000000
+    0x507d25a4:     0x00000000      0x00000000      0x00000000      0x00000000
+    0x507d25b4:     0x00000000      0x00000000      0x00000000      0x00000000
+
+
+
+  It's not stack overflow.  The only "stack-like" piece of this data is
+  the vma_struct itself.
+
+
+  At this point, I don't see any avenues to pursue, so I just have to
+  admit that I have no idea what's going on.  What I will do, though, is
+  stick a trap on the segfault handler which will stop if it sees any
+  writes to the idle thread's stack.  That was the thing that happened
+  first, and it may be that if I can catch it immediately, what's going
+  on will be somewhat clearer.
+
+
+12.2.  Episode 2: The case of the hung fsck
+-------------------------------------------
+
+  After setting a trap in the SEGV handler for accesses to the signal
+  thread's stack, I reran the kernel.
+
+
+  fsck hung again, this time by hitting the trap::
+
+
+
+    Setting hostname uml                            [ OK ]
+    Checking root filesystem
+    /dev/fhd0 contains a file system with errors, check forced.
+    Error reading block 86894 (Attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 19780.
+
+    /dev/fhd0: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
+           (i.e., without -a or -p options)
+    [ FAILED ]
+
+    *** An error occurred during the file system check.
+    *** Dropping you to a shell; the system will reboot
+    *** when you leave the shell.
+    Give root password for maintenance
+    (or type Control-D for normal startup):
+
+    [root@uml /root]# fsck -y /dev/fhd0
+    fsck -y /dev/fhd0
+    Parallelizing fsck version 1.14 (9-Jan-1999)
+    e2fsck 1.14, 9-Jan-1999 for EXT2 FS 0.5b, 95/08/09
+    /dev/fhd0 contains a file system with errors, check forced.
+    Pass 1: Checking inodes, blocks, and sizes
+    Error reading block 86894 (Attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 19780.  Ignore error? yes
+
+    Pass 2: Checking directory structure
+    Error reading block 49405 (Attempt to read block from filesystem resulted in short read).  Ignore error? yes
+
+    Directory inode 11858, block 0, offset 0: directory corrupted
+    Salvage? yes
+
+    Missing '.' in directory inode 11858.
+    Fix? yes
+
+    Missing '..' in directory inode 11858.
+    Fix? yes
+
+    Untested (4127) [100fe44c]: trap_kern.c line 31
+
+
+
+
+
+  I need to get the signal thread to detach from pid 4127 so that I can
+  attach to it with gdb.  This is done by sending it a SIGUSR1, which is
+  caught by the signal thread, which detaches the process::
+
+
+       kill -USR1 4127
+
+
+
+
+
+  Now I can run gdb on it::
+
+
+    ~/linux/2.3.26/um 1034: gdb linux
+    GNU gdb 4.17.0.11 with Linux support
+    Copyright 1998 Free Software Foundation, Inc.
+    GDB is free software, covered by the GNU General Public License, and you are
+    welcome to change it and/or distribute copies of it under certain conditions.
+    Type "show copying" to see the conditions.
+    There is absolutely no warranty for GDB.  Type "show warranty" for details.
+    This GDB was configured as "i386-redhat-linux"...
+    (gdb) att 4127
+    Attaching to program `/home/dike/linux/2.3.26/um/linux', Pid 4127
+    0x10075891 in __libc_nanosleep ()
+
+
+
+
+
+  The backtrace shows that it was in a write and that the fault address
+  (address in frame 3) is 0x50000800, which is right in the middle of
+  the signal thread's stack page::
+
+
+       (gdb) bt
+       #0  0x10075891 in __libc_nanosleep ()
+       #1  0x1007584d in __sleep (seconds=1000000)
+           at ../sysdeps/unix/sysv/linux/sleep.c:78
+       #2  0x1006ce9a in stop () at user_util.c:191
+       #3  0x1006bf88 in segv (address=1342179328, is_write=2) at trap_kern.c:31
+       #4  0x1006c628 in segv_handler (sc=0x5006eaf8) at trap_user.c:174
+       #5  0x1006c63c in kern_segv_handler (sig=11) at trap_user.c:182
+       #6  <signal handler called>
+       #7  0xc0fd in ?? ()
+       #8  0x10016647 in sys_write (fd=3, buf=0x80b8800 "R.", count=1024)
+           at read_write.c:159
+       #9  0x1006d603 in execute_syscall (syscall=4, args=0x5006ef08)
+           at syscall_kern.c:254
+       #10 0x1006af87 in really_do_syscall (sig=12) at syscall_user.c:35
+       #11 <signal handler called>
+       #12 0x400dc8b0 in ?? ()
+       #13 <signal handler called>
+       #14 0x400dc8b0 in ?? ()
+       #15 0x80545fd in ?? ()
+       #16 0x804daae in ?? ()
+       #17 0x8054334 in ?? ()
+       #18 0x804d23e in ?? ()
+       #19 0x8049632 in ?? ()
+       #20 0x80491d2 in ?? ()
+       #21 0x80596b5 in ?? ()
+       (gdb) p (void *)1342179328
+       $3 = (void *) 0x50000800
+
+
+
+  Going up the stack to the segv_handler frame and looking at where in
+  the code the access happened shows that it happened near line 110 of
+  block_dev.c::
+
+
+
+    (gdb) up
+    #1  0x1007584d in __sleep (seconds=1000000)
+       at ../sysdeps/unix/sysv/linux/sleep.c:78
+    ../sysdeps/unix/sysv/linux/sleep.c:78: No such file or directory.
+    (gdb)
+    #2  0x1006ce9a in stop () at user_util.c:191
+    191       while(1) sleep(1000000);
+    (gdb)
+    #3  0x1006bf88 in segv (address=1342179328, is_write=2) at trap_kern.c:31
+    31          KERN_UNTESTED();
+    (gdb)
+    #4  0x1006c628 in segv_handler (sc=0x5006eaf8) at trap_user.c:174
+    174       segv(sc->cr2, sc->err & 2);
+    (gdb) p *sc
+    $1 = {gs = 0, __gsh = 0, fs = 0, __fsh = 0, es = 43, __esh = 0, ds = 43,
+       __dsh = 0, edi = 1342179328, esi = 134973440, ebp = 1342631484,
+       esp = 1342630864, ebx = 256, edx = 0, ecx = 256, eax = 1024, trapno = 14,
+       err = 6, eip = 268550834, cs = 35, __csh = 0, eflags = 66070,
+       esp_at_signal = 1342630864, ss = 43, __ssh = 0, fpstate = 0x0, oldmask = 0,
+       cr2 = 1342179328}
+    (gdb) p (void *)268550834
+    $2 = (void *) 0x1001c2b2
+    (gdb) i sym $2
+    block_write + 1090 in section .text
+    (gdb) i line *$2
+    Line 209 of "/home/dike/linux/2.3.26/um/include/asm/arch/string.h"
+       starts at address 0x1001c2a1 <block_write+1073>
+       and ends at 0x1001c2bf <block_write+1103>.
+    (gdb) i line *0x1001c2c0
+    Line 110 of "block_dev.c" starts at address 0x1001c2bf <block_write+1103>
+       and ends at 0x1001c2e3 <block_write+1139>.
+
+
+
+  Looking at the source shows that the fault happened during a call to
+  copy_from_user to copy the data into the kernel::
+
+
+       107             count -= chars;
+       108             copy_from_user(p,buf,chars);
+       109             p += chars;
+       110             buf += chars;
+
+
+
+  p is the pointer which must contain 0x50000800, since buf contains
+  0x80b8800 (frame 8 above).  It is defined as::
+
+
+                       p = offset + bh->b_data;
+
+
+
+
+
+  I need to figure out what bh is, and it just so happens that bh is
+  passed as an argument to mark_buffer_uptodate and mark_buffer_dirty a
+  few lines later, so I do a little disassembly::
+
+
+    (gdb) disas 0x1001c2bf 0x1001c2e0
+    Dump of assembler code from 0x1001c2bf to 0x1001c2d0:
+    0x1001c2bf <block_write+1103>:  addl   %eax,0xc(%ebp)
+    0x1001c2c2 <block_write+1106>:  movl   0xfffffdd4(%ebp),%edx
+    0x1001c2c8 <block_write+1112>:  btsl   $0x0,0x18(%edx)
+    0x1001c2cd <block_write+1117>:  btsl   $0x1,0x18(%edx)
+    0x1001c2d2 <block_write+1122>:  sbbl   %ecx,%ecx
+    0x1001c2d4 <block_write+1124>:  testl  %ecx,%ecx
+    0x1001c2d6 <block_write+1126>:  jne    0x1001c2e3 <block_write+1139>
+    0x1001c2d8 <block_write+1128>:  pushl  $0x0
+    0x1001c2da <block_write+1130>:  pushl  %edx
+    0x1001c2db <block_write+1131>:  call   0x1001819c <__mark_buffer_dirty>
+    End of assembler dump.
+
+
+
+
+
+  At that point, bh is in %edx (address 0x1001c2da), which is calculated
+  at 0x1001c2c2 as %ebp + 0xfffffdd4, so I figure exactly what that is,
+  taking %ebp from the sigcontext_struct above::
+
+
+       (gdb) p (void *)1342631484
+       $5 = (void *) 0x5006ee3c
+       (gdb) p 0x5006ee3c+0xfffffdd4
+       $6 = 1342630928
+       (gdb) p (void *)$6
+       $7 = (void *) 0x5006ec10
+       (gdb) p *((void **)$7)
+       $8 = (void *) 0x50100200
+
+
+
+
+
+  Now, I look at the structure to see what's in it, and particularly,
+  what its b_data field contains::
+
+
+       (gdb) p *((struct buffer_head *)0x50100200)
+       $13 = {b_next = 0x50289380, b_blocknr = 49405, b_size = 1024, b_list = 0,
+         b_dev = 15872, b_count = {counter = 1}, b_rdev = 15872, b_state = 24,
+         b_flushtime = 0, b_next_free = 0x501001a0, b_prev_free = 0x50100260,
+         b_this_page = 0x501001a0, b_reqnext = 0x0, b_pprev = 0x507fcf58,
+         b_data = 0x50000800 "", b_page = 0x50004000,
+         b_end_io = 0x10017f60 <end_buffer_io_sync>, b_dev_id = 0x0,
+         b_rsector = 98810, b_wait = {lock = <optimized out or zero length>,
+           task_list = {next = 0x50100248, prev = 0x50100248}, __magic = 1343226448,
+           __creator = 0}, b_kiobuf = 0x0}
+
+
+
+
+
+  The b_data field is indeed 0x50000800, so the question becomes how
+  that happened.  The rest of the structure looks fine, so this probably
+  is not a case of data corruption.  It happened on purpose somehow.
+
+
+  The b_page field is a pointer to the page_struct representing the
+  0x50000000 page.  Looking at it shows the kernel's idea of the state
+  of that page::
+
+
+
+    (gdb) p *$13.b_page
+    $17 = {list = {next = 0x50004a5c, prev = 0x100c5174}, mapping = 0x0,
+       index = 0, next_hash = 0x0, count = {counter = 1}, flags = 132, lru = {
+       next = 0x50008460, prev = 0x50019350}, wait = {
+       lock = <optimized out or zero length>, task_list = {next = 0x50004024,
+           prev = 0x50004024}, __magic = 1342193708, __creator = 0},
+       pprev_hash = 0x0, buffers = 0x501002c0, virtual = 1342177280,
+       zone = 0x100c5160}
+
+
+
+
+
+  Some sanity-checking: the virtual field shows the "virtual" address of
+  this page, which in this kernel is the same as its "physical" address,
+  and the page_struct itself should be mem_map[0], since it represents
+  the first page of memory::
+
+
+
+       (gdb) p (void *)1342177280
+       $18 = (void *) 0x50000000
+       (gdb) p mem_map
+       $19 = (mem_map_t *) 0x50004000
+
+
+
+
+
+  These check out fine.
+
+
+  Now to check out the page_struct itself.  In particular, the flags
+  field shows whether the page is considered free or not::
+
+
+       (gdb) p (void *)132
+       $21 = (void *) 0x84
+
+
+
+
+
+  The "reserved" bit is the high bit, which is definitely not set, so
+  the kernel considers the signal stack page to be free and available to
+  be used.
+
+
+  At this point, I jump to conclusions and start looking at my early
+  boot code, because that's where that page is supposed to be reserved.
+
+
+  In my setup_arch procedure, I have the following code which looks just
+  fine::
+
+
+
+       bootmap_size = init_bootmem(start_pfn, end_pfn - start_pfn);
+       free_bootmem(__pa(low_physmem) + bootmap_size, high_physmem - low_physmem);
+
+
+
+
+
+  Two stack pages have already been allocated, and low_physmem points to
+  the third page, which is the beginning of free memory.
+  The init_bootmem call declares the entire memory to the boot memory
+  manager, which marks it all reserved.  The free_bootmem call frees up
+  all of it, except for the first two pages.  This looks correct to me.
+
+
+  So, I decide to see init_bootmem run and make sure that it is marking
+  those first two pages as reserved.  I never get that far.
+
+
+  Stepping into init_bootmem, and looking at bootmem_map before looking
+  at what it contains shows the following::
+
+
+
+       (gdb) p bootmem_map
+       $3 = (void *) 0x50000000
+
+
+
+
+
+  Aha!  The light dawns.  That first page is doing double duty as a
+  stack and as the boot memory map.  The last thing that the boot memory
+  manager does is to free the pages used by its memory map, so this page
+  is getting freed even its marked as reserved.
+
+
+  The fix was to initialize the boot memory manager before allocating
+  those two stack pages, and then allocate them through the boot memory
+  manager.  After doing this, and fixing a couple of subsequent buglets,
+  the stack corruption problem disappeared.
+
+
+
+
+
+13.  What to do when UML doesn't work
+=====================================
+
+
+
+
+13.1.  Strange compilation errors when you build from source
+------------------------------------------------------------
+
+  As of test11, it is necessary to have "ARCH=um" in the environment or
+  on the make command line for all steps in building UML, including
+  clean, distclean, or mrproper, config, menuconfig, or xconfig, dep,
+  and linux.  If you forget for any of them, the i386 build seems to
+  contaminate the UML build.  If this happens, start from scratch with::
+
+
+       host%
+       make mrproper ARCH=um
+
+
+
+
+  and repeat the build process with ARCH=um on all the steps.
+
+
+  See :ref:`Compiling_the_kernel_and_modules`  for more details.
+
+
+  Another cause of strange compilation errors is building UML in
+  /usr/src/linux.  If you do this, the first thing you need to do is
+  clean up the mess you made.  The /usr/src/linux/asm link will now
+  point to /usr/src/linux/asm-um.  Make it point back to
+  /usr/src/linux/asm-i386.  Then, move your UML pool someplace else and
+  build it there.  Also see below, where a more specific set of symptoms
+  is described.
+
+
+
+13.3.  A variety of panics and hangs with /tmp on a reiserfs filesystem
+-----------------------------------------------------------------------
+
+  I saw this on reiserfs 3.5.21 and it seems to be fixed in 3.5.27.
+  Panics preceded by::
+
+
+       Detaching pid nnnn
+
+
+
+  are diagnostic of this problem.  This is a reiserfs bug which causes a
+  thread to occasionally read stale data from a mmapped page shared with
+  another thread.  The fix is to upgrade the filesystem or to have /tmp
+  be an ext2 filesystem.
+
+
+
+  13.4.  The compile fails with errors about conflicting types for
+  'open', 'dup', and 'waitpid'
+
+  This happens when you build in /usr/src/linux.  The UML build makes
+  the include/asm link point to include/asm-um.  /usr/include/asm points
+  to /usr/src/linux/include/asm, so when that link gets moved, files
+  which need to include the asm-i386 versions of headers get the
+  incompatible asm-um versions.  The fix is to move the include/asm link
+  back to include/asm-i386 and to do UML builds someplace else.
+
+
+
+13.5.  UML doesn't work when /tmp is an NFS filesystem
+------------------------------------------------------
+
+  This seems to be a similar situation with the ReiserFS problem above.
+  Some versions of NFS seems not to handle mmap correctly, which UML
+  depends on.  The workaround is have /tmp be a non-NFS directory.
+
+
+13.6.  UML hangs on boot when compiled with gprof support
+---------------------------------------------------------
+
+  If you build UML with gprof support and, early in the boot, it does
+  this::
+
+
+       kernel BUG at page_alloc.c:100!
+
+
+
+
+  you have a buggy gcc.  You can work around the problem by removing
+  UM_FASTCALL from CFLAGS in arch/um/Makefile-i386.  This will open up
+  another bug, but that one is fairly hard to reproduce.
+
+
+
+13.7.  syslogd dies with a SIGTERM on startup
+---------------------------------------------
+
+  The exact boot error depends on the distribution that you're booting,
+  but Debian produces this::
+
+
+       /etc/rc2.d/S10sysklogd: line 49:    93 Terminated
+       start-stop-daemon --start --quiet --exec /sbin/syslogd -- $SYSLOGD
+
+
+
+
+  This is a syslogd bug.  There's a race between a parent process
+  installing a signal handler and its child sending the signal.
+
+
+
+13.8.  TUN/TAP networking doesn't work on a 2.4 host
+----------------------------------------------------
+
+  There are a couple of problems which were reported by
+  Tim Robinson <timro at trkr dot net>
+
+  -  It doesn't work on hosts running 2.4.7 (or thereabouts) or earlier.
+     The fix is to upgrade to something more recent and then read the
+     next item.
+
+  -  If you see::
+
+
+       File descriptor in bad state
+
+
+
+  when you bring up the device inside UML, you have a header mismatch
+  between the original kernel and the upgraded one.  Make /usr/src/linux
+  point at the new headers.  This will only be a problem if you build
+  uml_net yourself.
+
+
+
+13.9.  You can network to the host but not to other machines on the net
+=======================================================================
+
+  If you can connect to the host, and the host can connect to UML, but
+  you cannot connect to any other machines, then you may need to enable
+  IP Masquerading on the host.  Usually this is only experienced when
+  using private IP addresses (192.168.x.x or 10.x.x.x) for host/UML
+  networking, rather than the public address space that your host is
+  connected to.  UML does not enable IP Masquerading, so you will need
+  to create a static rule to enable it::
+
+
+       host%
+       iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
+
+
+
+
+  Replace eth0 with the interface that you use to talk to the rest of
+  the world.
+
+
+  Documentation on IP Masquerading, and SNAT, can be found at
+  http://www.netfilter.org.
+
+
+  If you can reach the local net, but not the outside Internet, then
+  that is usually a routing problem.  The UML needs a default route::
+
+
+       UML#
+       route add default gw gateway IP
+
+
+
+
+  The gateway IP can be any machine on the local net that knows how to
+  reach the outside world.  Usually, this is the host or the local net-
+  work's gateway.
+
+
+  Occasionally, we hear from someone who can reach some machines, but
+  not others on the same net, or who can reach some ports on other
+  machines, but not others.  These are usually caused by strange
+  firewalling somewhere between the UML and the other box.  You track
+  this down by running tcpdump on every interface the packets travel
+  over and see where they disappear.  When you find a machine that takes
+  the packets in, but does not send them onward, that's the culprit.
+
+
+
+13.10.  I have no root and I want to scream
+===========================================
+
+  Thanks to Birgit Wahlich for telling me about this strange one.  It
+  turns out that there's a limit of six environment variables on the
+  kernel command line.  When that limit is reached or exceeded, argument
+  processing stops, which means that the 'root=' argument that UML
+  usually adds is not seen.  So, the filesystem has no idea what the
+  root device is, so it panics.
+
+
+  The fix is to put less stuff on the command line.  Glomming all your
+  setup variables into one is probably the best way to go.
+
+
+
+13.11.  UML build conflict between ptrace.h and ucontext.h
+==========================================================
+
+  On some older systems, /usr/include/asm/ptrace.h and
+  /usr/include/sys/ucontext.h define the same names.  So, when they're
+  included together, the defines from one completely mess up the parsing
+  of the other, producing errors like::
+
+       /usr/include/sys/ucontext.h:47: parse error before
+       `10`
+
+
+
+
+  plus a pile of warnings.
+
+
+  This is a libc botch, which has since been fixed, and I don't see any
+  way around it besides upgrading.
+
+
+
+13.12.  The UML BogoMips is exactly half the host's BogoMips
+------------------------------------------------------------
+
+  On i386 kernels, there are two ways of running the loop that is used
+  to calculate the BogoMips rating, using the TSC if it's there or using
+  a one-instruction loop.  The TSC produces twice the BogoMips as the
+  loop.  UML uses the loop, since it has nothing resembling a TSC, and
+  will get almost exactly the same BogoMips as a host using the loop.
+  However, on a host with a TSC, its BogoMips will be double the loop
+  BogoMips, and therefore double the UML BogoMips.
+
+
+
+13.13.  When you run UML, it immediately segfaults
+--------------------------------------------------
+
+  If the host is configured with the 2G/2G address space split, that's
+  why.  See ref:`UML_on_2G/2G_hosts`  for the details on getting UML to
+  run on your host.
+
+
+
+13.14.  xterms appear, then immediately disappear
+-------------------------------------------------
+
+  If you're running an up to date kernel with an old release of
+  uml_utilities, the port-helper program will not work properly, so
+  xterms will exit straight after they appear. The solution is to
+  upgrade to the latest release of uml_utilities.  Usually this problem
+  occurs when you have installed a packaged release of UML then compiled
+  your own development kernel without upgrading the uml_utilities from
+  the source distribution.
+
+
+
+13.15.  Any other panic, hang, or strange behavior
+--------------------------------------------------
+
+  If you're seeing truly strange behavior, such as hangs or panics that
+  happen in random places, or you try running the debugger to see what's
+  happening and it acts strangely, then it could be a problem in the
+  host kernel.  If you're not running a stock Linus or -ac kernel, then
+  try that.  An early version of the preemption patch and a 2.4.10 SuSE
+  kernel have caused very strange problems in UML.
+
+
+  Otherwise, let me know about it.  Send a message to one of the UML
+  mailing lists - either the developer list - user-mode-linux-devel at
+  lists dot sourceforge dot net (subscription info) or the user list -
+  user-mode-linux-user at lists dot sourceforge do net (subscription
+  info), whichever you prefer.  Don't assume that everyone knows about
+  it and that a fix is imminent.
+
+
+  If you want to be super-helpful, read :ref:`Diagnosing_Problems` and
+  follow the instructions contained therein.
+
+.. _Diagnosing_Problems:
+
+14.  Diagnosing Problems
+========================
+
+
+  If you get UML to crash, hang, or otherwise misbehave, you should
+  report this on one of the project mailing lists, either the developer
+  list - user-mode-linux-devel at lists dot sourceforge dot net
+  (subscription info) or the user list - user-mode-linux-user at lists
+  dot sourceforge dot net (subscription info).  When you do, it is
+  likely that I will want more information.  So, it would be helpful to
+  read the stuff below, do whatever is applicable in your case, and
+  report the results to the list.
+
+
+  For any diagnosis, you're going to need to build a debugging kernel.
+  The binaries from this site aren't debuggable.  If you haven't done
+  this before, read about :ref:`Compiling_the_kernel_and_modules`  and
+  :ref:`Kernel_debugging` UML first.
+
+
+14.1.  Case 1 : Normal kernel panics
+------------------------------------
+
+  The most common case is for a normal thread to panic.  To debug this,
+  you will need to run it under the debugger (add 'debug' to the command
+  line).  An xterm will start up with gdb running inside it.  Continue
+  it when it stops in start_kernel and make it crash.  Now ``^C gdb`` and
+
+
+  If the panic was a "Kernel mode fault", then there will be a segv
+  frame on the stack and I'm going to want some more information.  The
+  stack might look something like this::
+
+
+       (UML gdb)  backtrace
+       #0  0x1009bf76 in __sigprocmask (how=1, set=0x5f347940, oset=0x0)
+           at ../sysdeps/unix/sysv/linux/sigprocmask.c:49
+       #1  0x10091411 in change_sig (signal=10, on=1) at process.c:218
+       #2  0x10094785 in timer_handler (sig=26) at time_kern.c:32
+       #3  0x1009bf38 in __restore ()
+           at ../sysdeps/unix/sysv/linux/i386/sigaction.c:125
+       #4  0x1009534c in segv (address=8, ip=268849158, is_write=2, is_user=0)
+           at trap_kern.c:66
+       #5  0x10095c04 in segv_handler (sig=11) at trap_user.c:285
+       #6  0x1009bf38 in __restore ()
+
+
+
+
+  I'm going to want to see the symbol and line information for the value
+  of ip in the segv frame.  In this case, you would do the following::
+
+
+       (UML gdb)  i sym 268849158
+
+
+
+
+  and::
+
+
+       (UML gdb)  i line *268849158
+
+
+
+
+  The reason for this is the __restore frame right above the segv_han-
+  dler frame is hiding the frame that actually segfaulted.  So, I have
+  to get that information from the faulting ip.
+
+
+14.2.  Case 2 : Tracing thread panics
+-------------------------------------
+
+  The less common and more painful case is when the tracing thread
+  panics.  In this case, the kernel debugger will be useless because it
+  needs a healthy tracing thread in order to work.  The first thing to
+  do is get a backtrace from the tracing thread.  This is done by
+  figuring out what its pid is, firing up gdb, and attaching it to that
+  pid.  You can figure out the tracing thread pid by looking at the
+  first line of the console output, which will look like this::
+
+
+       tracing thread pid = 15851
+
+
+
+
+  or by running ps on the host and finding the line that looks like
+  this::
+
+
+       jdike 15851 4.5 0.4 132568 1104 pts/0 S 21:34 0:05 ./linux [(tracing thread)]
+
+
+
+
+  If the panic was 'segfault in signals', then follow the instructions
+  above for collecting information about the location of the seg fault.
+
+
+  If the tracing thread flaked out all by itself, then send that
+  backtrace in and wait for our crack debugging team to fix the problem.
+
+
+  14.3.  Case 3 : Tracing thread panics caused by other threads
+
+  However, there are cases where the misbehavior of another thread
+  caused the problem.  The most common panic of this type is::
+
+
+       wait_for_stop failed to wait for  <pid>  to stop with  <signal number>
+
+
+
+
+  In this case, you'll need to get a backtrace from the process men-
+  tioned in the panic, which is complicated by the fact that the kernel
+  debugger is defunct and without some fancy footwork, another gdb can't
+  attach to it.  So, this is how the fancy footwork goes:
+
+  In a shell::
+
+
+       host% kill -STOP pid
+
+
+
+
+  Run gdb on the tracing thread as described in case 2 and do::
+
+
+       (host gdb)  call detach(pid)
+
+
+  If you get a segfault, do it again.  It always works the second time.
+
+  Detach from the tracing thread and attach to that other thread::
+
+
+       (host gdb)  detach
+
+
+
+
+
+
+       (host gdb)  attach pid
+
+
+
+
+  If gdb hangs when attaching to that process, go back to a shell and
+  do::
+
+
+       host%
+       kill -CONT pid
+
+
+
+
+  And then get the backtrace::
+
+
+       (host gdb)  backtrace
+
+
+
+
+
+14.4.  Case 4 : Hangs
+---------------------
+
+  Hangs seem to be fairly rare, but they sometimes happen.  When a hang
+  happens, we need a backtrace from the offending process.  Run the
+  kernel debugger as described in case 1 and get a backtrace.  If the
+  current process is not the idle thread, then send in the backtrace.
+  You can tell that it's the idle thread if the stack looks like this::
+
+
+       #0  0x100b1401 in __libc_nanosleep ()
+       #1  0x100a2885 in idle_sleep (secs=10) at time.c:122
+       #2  0x100a546f in do_idle () at process_kern.c:445
+       #3  0x100a5508 in cpu_idle () at process_kern.c:471
+       #4  0x100ec18f in start_kernel () at init/main.c:592
+       #5  0x100a3e10 in start_kernel_proc (unused=0x0) at um_arch.c:71
+       #6  0x100a383f in signal_tramp (arg=0x100a3dd8) at trap_user.c:50
+
+
+
+
+  If this is the case, then some other process is at fault, and went to
+  sleep when it shouldn't have.  Run ps on the host and figure out which
+  process should not have gone to sleep and stayed asleep.  Then attach
+  to it with gdb and get a backtrace as described in case 3.
+
+
+
+
+
+
+15.  Thanks
+===========
+
+
+  A number of people have helped this project in various ways, and this
+  page gives recognition where recognition is due.
+
+
+  If you're listed here and you would prefer a real link on your name,
+  or no link at all, instead of the despammed email address pseudo-link,
+  let me know.
+
+
+  If you're not listed here and you think maybe you should be, please
+  let me know that as well.  I try to get everyone, but sometimes my
+  bookkeeping lapses and I forget about contributions.
+
+
+15.1.  Code and Documentation
+-----------------------------
+
+  Rusty Russell <rusty at linuxcare.com.au>  -
+
+  -  wrote the  HOWTO
+     http://user-mode-linux.sourceforge.net/old/UserModeLinux-HOWTO.html
+
+  -  prodded me into making this project official and putting it on
+     SourceForge
+
+  -  came up with the way cool UML logo
+     http://user-mode-linux.sourceforge.net/uml-small.png
+
+  -  redid the config process
+
+
+  Peter Moulder <reiter at netspace.net.au>  - Fixed my config and build
+  processes, and added some useful code to the block driver
+
+
+  Bill Stearns <wstearns at pobox.com>  -
+
+  -  HOWTO updates
+
+  -  lots of bug reports
+
+  -  lots of testing
+
+  -  dedicated a box (uml.ists.dartmouth.edu) to support UML development
+
+  -  wrote the mkrootfs script, which allows bootable filesystems of
+     RPM-based distributions to be cranked out
+
+  -  cranked out a large number of filesystems with said script
+
+
+  Jim Leu <jleu at mindspring.com>  - Wrote the virtual ethernet driver
+  and associated usermode tools
+
+  Lars Brinkhoff http://lars.nocrew.org/  - Contributed the ptrace
+  proxy from his own  project to allow easier kernel debugging
+
+
+  Andrea Arcangeli <andrea at suse.de>  - Redid some of the early boot
+  code so that it would work on machines with Large File Support
+
+
+  Chris Emerson - Did the first UML port to Linux/ppc
+
+
+  Harald Welte <laforge at gnumonks.org>  - Wrote the multicast
+  transport for the network driver
+
+
+  Jorgen Cederlof - Added special file support to hostfs
+
+
+  Greg Lonnon  <glonnon at ridgerun dot com>  - Changed the ubd driver
+  to allow it to layer a COW file on a shared read-only filesystem and
+  wrote the iomem emulation support
+
+
+  Henrik Nordstrom http://hem.passagen.se/hno/  - Provided a variety
+  of patches, fixes, and clues
+
+
+  Lennert Buytenhek - Contributed various patches, a rewrite of the
+  network driver, the first implementation of the mconsole driver, and
+  did the bulk of the work needed to get SMP working again.
+
+
+  Yon Uriarte - Fixed the TUN/TAP network backend while I slept.
+
+
+  Adam Heath - Made a bunch of nice cleanups to the initialization code,
+  plus various other small patches.
+
+
+  Matt Zimmerman - Matt volunteered to be the UML Debian maintainer and
+  is doing a real nice job of it.  He also noticed and fixed a number of
+  actually and potentially exploitable security holes in uml_net.  Plus
+  the occasional patch.  I like patches.
+
+
+  James McMechan - James seems to have taken over maintenance of the ubd
+  driver and is doing a nice job of it.
+
+
+  Chandan Kudige - wrote the umlgdb script which automates the reloading
+  of module symbols.
+
+
+  Steve Schmidtke - wrote the UML slirp transport and hostaudio drivers,
+  enabling UML processes to access audio devices on the host. He also
+  submitted patches for the slip transport and lots of other things.
+
+
+  David Coulson http://davidcoulson.net  -
+
+  -  Set up the http://usermodelinux.org  site,
+     which is a great way of keeping the UML user community on top of
+     UML goings-on.
+
+  -  Site documentation and updates
+
+  -  Nifty little UML management daemon  UMLd
+
+  -  Lots of testing and bug reports
+
+
+
+
+15.2.  Flushing out bugs
+------------------------
+
+
+
+  -  Yuri Pudgorodsky
+
+  -  Gerald Britton
+
+  -  Ian Wehrman
+
+  -  Gord Lamb
+
+  -  Eugene Koontz
+
+  -  John H. Hartman
+
+  -  Anders Karlsson
+
+  -  Daniel Phillips
+
+  -  John Fremlin
+
+  -  Rainer Burgstaller
+
+  -  James Stevenson
+
+  -  Matt Clay
+
+  -  Cliff Jefferies
+
+  -  Geoff Hoff
+
+  -  Lennert Buytenhek
+
+  -  Al Viro
+
+  -  Frank Klingenhoefer
+
+  -  Livio Baldini Soares
+
+  -  Jon Burgess
+
+  -  Petru Paler
+
+  -  Paul
+
+  -  Chris Reahard
+
+  -  Sverker Nilsson
+
+  -  Gong Su
+
+  -  johan verrept
+
+  -  Bjorn Eriksson
+
+  -  Lorenzo Allegrucci
+
+  -  Muli Ben-Yehuda
+
+  -  David Mansfield
+
+  -  Howard Goff
+
+  -  Mike Anderson
+
+  -  John Byrne
+
+  -  Sapan J. Batia
+
+  -  Iris Huang
+
+  -  Jan Hudec
+
+  -  Voluspa
+
+
+
+
+15.3.  Buglets and clean-ups
+----------------------------
+
+
+
+  -  Dave Zarzycki
+
+  -  Adam Lazur
+
+  -  Boria Feigin
+
+  -  Brian J. Murrell
+
+  -  JS
+
+  -  Roman Zippel
+
+  -  Wil Cooley
+
+  -  Ayelet Shemesh
+
+  -  Will Dyson
+
+  -  Sverker Nilsson
+
+  -  dvorak
+
+  -  v.naga srinivas
+
+  -  Shlomi Fish
+
+  -  Roger Binns
+
+  -  johan verrept
+
+  -  MrChuoi
+
+  -  Peter Cleve
+
+  -  Vincent Guffens
+
+  -  Nathan Scott
+
+  -  Patrick Caulfield
+
+  -  jbearce
+
+  -  Catalin Marinas
+
+  -  Shane Spencer
+
+  -  Zou Min
+
+
+  -  Ryan Boder
+
+  -  Lorenzo Colitti
+
+  -  Gwendal Grignou
+
+  -  Andre' Breiler
+
+  -  Tsutomu Yasuda
+
+
+
+15.4.  Case Studies
+-------------------
+
+
+  -  Jon Wright
+
+  -  William McEwan
+
+  -  Michael Richardson
+
+
+
+15.5.  Other contributions
+--------------------------
+
+
+  Bill Carr <Bill.Carr at compaq.com>  made the Red Hat mkrootfs script
+  work with RH 6.2.
+
+  Michael Jennings <mikejen at hevanet.com>  sent in some material which
+  is now gracing the top of the  index  page
+  http://user-mode-linux.sourceforge.net/  of this site.
+
+  SGI (and more specifically Ralf Baechle <ralf at
+  uni-koblenz.de> ) gave me an account on oss.sgi.com.
+  The bandwidth there made it possible to
+  produce most of the filesystems available on the project download
+  page.
+
+  Laurent Bonnaud <Laurent.Bonnaud at inpg.fr>  took the old grotty
+  Debian filesystem that I've been distributing and updated it to 2.2.
+  It is now available by itself here.
+
+  Rik van Riel gave me some ftp space on ftp.nl.linux.org so I can make
+  releases even when Sourceforge is broken.
+
+  Rodrigo de Castro looked at my broken pte code and told me what was
+  wrong with it, letting me fix a long-standing (several weeks) and
+  serious set of bugs.
+
+  Chris Reahard built a specialized root filesystem for running a DNS
+  server jailed inside UML.  It's available from the download
+  http://user-mode-linux.sourceforge.net/old/dl-sf.html  page in the Jail
+  Filesystems section.
diff --git a/Documentation/virtual/guest-halt-polling.txt b/Documentation/virtual/guest-halt-polling.txt

deleted file mode 100644 (file)

index b3a2a29..0000000
--- a/Documentation/virtual/guest-halt-polling.txt
+++ /dev/null
@@ -1,78 +0,0 @@
-Guest halt polling
-==================
-
-The cpuidle_haltpoll driver, with the haltpoll governor, allows
-the guest vcpus to poll for a specified amount of time before
-halting.
-This provides the following benefits to host side polling:
-
-       1) The POLL flag is set while polling is performed, which allows
-          a remote vCPU to avoid sending an IPI (and the associated
-          cost of handling the IPI) when performing a wakeup.
-
-       2) The VM-exit cost can be avoided.
-
-The downside of guest side polling is that polling is performed
-even with other runnable tasks in the host.
-
-The basic logic as follows: A global value, guest_halt_poll_ns,
-is configured by the user, indicating the maximum amount of
-time polling is allowed. This value is fixed.
-
-Each vcpu has an adjustable guest_halt_poll_ns
-("per-cpu guest_halt_poll_ns"), which is adjusted by the algorithm
-in response to events (explained below).
-
-Module Parameters
-=================
-
-The haltpoll governor has 5 tunable module parameters:
-
-1) guest_halt_poll_ns:
-Maximum amount of time, in nanoseconds, that polling is
-performed before halting.
-
-Default: 200000
-
-2) guest_halt_poll_shrink:
-Division factor used to shrink per-cpu guest_halt_poll_ns when
-wakeup event occurs after the global guest_halt_poll_ns.
-
-Default: 2
-
-3) guest_halt_poll_grow:
-Multiplication factor used to grow per-cpu guest_halt_poll_ns
-when event occurs after per-cpu guest_halt_poll_ns
-but before global guest_halt_poll_ns.
-
-Default: 2
-
-4) guest_halt_poll_grow_start:
-The per-cpu guest_halt_poll_ns eventually reaches zero
-in case of an idle system. This value sets the initial
-per-cpu guest_halt_poll_ns when growing. This can
-be increased from 10000, to avoid misses during the initial
-growth stage:
-
-10k, 20k, 40k, ... (example assumes guest_halt_poll_grow=2).
-
-Default: 50000
-
-5) guest_halt_poll_allow_shrink:
-
-Bool parameter which allows shrinking. Set to N
-to avoid it (per-cpu guest_halt_poll_ns will remain
-high once achieves global guest_halt_poll_ns value).
-
-Default: Y
-
-The module parameters can be set from the debugfs files in:
-
-       /sys/module/haltpoll/parameters/
-
-Further Notes
-=============
-
-- Care should be taken when setting the guest_halt_poll_ns parameter as a
-large value has the potential to drive the cpu usage to 100% on a machine which
-would be almost entirely idle otherwise.
diff --git a/MAINTAINERS b/MAINTAINERS

index 38fe2f3..a0d8649 100644 (file)
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2796,11 +2796,11 @@ F:      drivers/block/aoe/
  
  ATHEROS 71XX/9XXX GPIO DRIVER
  M:     Alban Bedel <albeu@free.fr>
+S:     Maintained
  W:     https://github.com/AlbanBedel/linux
  T:     git git://github.com/AlbanBedel/linux
-S:     Maintained
-F:     drivers/gpio/gpio-ath79.c
  F:     Documentation/devicetree/bindings/gpio/gpio-ath79.txt
+F:     drivers/gpio/gpio-ath79.c
  
  ATHEROS 71XX/9XXX USB PHY DRIVER
  M:     Alban Bedel <albeu@free.fr>
@@ -3422,8 +3422,8 @@ BROADCOM BRCMSTB GPIO DRIVER
  M:     Gregory Fong <gregory.0xf0@gmail.com>
  L:     bcm-kernel-feedback-list@broadcom.com
  S:     Supported
-F:     drivers/gpio/gpio-brcmstb.c
  F:     Documentation/devicetree/bindings/gpio/brcm,brcmstb-gpio.txt
+F:     drivers/gpio/gpio-brcmstb.c
  
  BROADCOM BRCMSTB I2C DRIVER
  M:     Kamal Dasu <kdasu.kdev@gmail.com>
@@ -3481,8 +3481,8 @@ BROADCOM KONA GPIO DRIVER
  M:     Ray Jui <rjui@broadcom.com>
  L:     bcm-kernel-feedback-list@broadcom.com
  S:     Supported
-F:     drivers/gpio/gpio-bcm-kona.c
  F:     Documentation/devicetree/bindings/gpio/brcm,kona-gpio.txt
+F:     drivers/gpio/gpio-bcm-kona.c
  
  BROADCOM NETXTREME-E ROCE DRIVER
  M:     Selvin Xavier <selvin.xavier@broadcom.com>
@@ -3597,8 +3597,8 @@ F:        sound/pci/bt87x.c
  
  BT8XXGPIO DRIVER
  M:     Michael Buesch <m@bues.ch>
-W:     http://bu3sch.de/btgpio.php
  S:     Maintained
+W:     http://bu3sch.de/btgpio.php
  F:     drivers/gpio/gpio-bt8xx.c
  
  BTRFS FILE SYSTEM
@@ -7143,18 +7143,18 @@ GPIO SUBSYSTEM
  M:     Linus Walleij <linus.walleij@linaro.org>
  M:     Bartosz Golaszewski <bgolaszewski@baylibre.com>
  L:     linux-gpio@vger.kernel.org
-T:     git git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio.git
  S:     Maintained
+T:     git git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio.git
+F:     Documentation/ABI/obsolete/sysfs-gpio
+F:     Documentation/ABI/testing/gpio-cdev
+F:     Documentation/admin-guide/gpio/
  F:     Documentation/devicetree/bindings/gpio/
  F:     Documentation/driver-api/gpio/
-F:     Documentation/admin-guide/gpio/
-F:     Documentation/ABI/testing/gpio-cdev
-F:     Documentation/ABI/obsolete/sysfs-gpio
  F:     drivers/gpio/
+F:     include/asm-generic/gpio.h
  F:     include/linux/gpio/
  F:     include/linux/gpio.h
  F:     include/linux/of_gpio.h
-F:     include/asm-generic/gpio.h
  F:     include/uapi/linux/gpio.h
  F:     tools/gpio/
  
@@ -8055,8 +8055,8 @@ F:        drivers/scsi/ips.*
  ICH LPC AND GPIO DRIVER
  M:     Peter Tyser <ptyser@xes-inc.com>
  S:     Maintained
-F:     drivers/mfd/lpc_ich.c
  F:     drivers/gpio/gpio-ich.c
+F:     drivers/mfd/lpc_ich.c
  
  ICY I2C DRIVER
  M:     Max Staudt <max@enpas.org>
@@ -16075,8 +16075,8 @@ F:      Documentation/devicetree/bindings/reset/snps,axs10x-reset.txt
  SYNOPSYS CREG GPIO DRIVER
  M:     Eugeniy Paltsev <Eugeniy.Paltsev@synopsys.com>
  S:     Maintained
-F:     drivers/gpio/gpio-creg-snps.c
  F:     Documentation/devicetree/bindings/gpio/snps,creg-gpio.txt
+F:     drivers/gpio/gpio-creg-snps.c
  
  SYNOPSYS DESIGNWARE 8250 UART DRIVER
  R:     Andy Shevchenko <andriy.shevchenko@linux.intel.com>
@@ -16087,8 +16087,8 @@ SYNOPSYS DESIGNWARE APB GPIO DRIVER
  M:     Hoan Tran <hoan@os.amperecomputing.com>
  L:     linux-gpio@vger.kernel.org
  S:     Maintained
-F:     drivers/gpio/gpio-dwapb.c
  F:     Documentation/devicetree/bindings/gpio/snps-dwapb-gpio.txt
+F:     drivers/gpio/gpio-dwapb.c
  
  SYNOPSYS DESIGNWARE AXI DMAC DRIVER
  M:     Eugeniy Paltsev <Eugeniy.Paltsev@synopsys.com>
@@ -18414,8 +18414,8 @@ M:      Nandor Han <nandor.han@ge.com>
  M:     Semi Malinen <semi.malinen@ge.com>
  L:     linux-gpio@vger.kernel.org
  S:     Maintained
-F:     drivers/gpio/gpio-xra1403.c
  F:     Documentation/devicetree/bindings/gpio/gpio-xra1403.txt
+F:     drivers/gpio/gpio-xra1403.c
  
  XTENSA XTFPGA PLATFORM SUPPORT
  M:     Max Filippov <jcmvbkbc@gmail.com>
diff --git a/Makefile b/Makefile

index 84b7184..aab38cb 100644 (file)
--- a/Makefile
+++ b/Makefile
@@ -2,7 +2,7 @@
  VERSION = 5
  PATCHLEVEL = 6
  SUBLEVEL = 0
-EXTRAVERSION = -rc1
+EXTRAVERSION = -rc2
  NAME = Kleptomaniac Octopus
  
  # *DOCUMENTATION*
diff --git a/arch/arm/boot/dts/stih410-b2260.dts b/arch/arm/boot/dts/stih410-b2260.dts

index 4fbd8e9..e2bb597 100644 (file)
--- a/arch/arm/boot/dts/stih410-b2260.dts
+++ b/arch/arm/boot/dts/stih410-b2260.dts
@@ -178,9 +178,6 @@
                         phy-mode = "rgmii";
                         pinctrl-0 = <&pinctrl_rgmii1 &pinctrl_rgmii1_mdio_1>;
  
-                       snps,phy-bus-name = "stmmac";
-                       snps,phy-bus-id = <0>;
-                       snps,phy-addr = <0>;
                         snps,reset-gpio = <&pio0 7 0>;
                         snps,reset-active-low;
                         snps,reset-delays-us = <0 10000 1000000>;
diff --git a/arch/arm/boot/dts/stihxxx-b2120.dtsi b/arch/arm/boot/dts/stihxxx-b2120.dtsi

index 60e1104..d051f08 100644 (file)
--- a/arch/arm/boot/dts/stihxxx-b2120.dtsi
+++ b/arch/arm/boot/dts/stihxxx-b2120.dtsi
@@ -46,7 +46,7 @@
                         /* DAC */
                         format = "i2s";
                         mclk-fs = <256>;
-                       frame-inversion = <1>;
+                       frame-inversion;
                         cpu {
                                 sound-dai = <&sti_uni_player2>;
                         };
diff --git a/arch/arm/configs/am200epdkit_defconfig b/arch/arm/configs/am200epdkit_defconfig

index 622436f..f56ac39 100644 (file)
--- a/arch/arm/configs/am200epdkit_defconfig
+++ b/arch/arm/configs/am200epdkit_defconfig
@@ -11,8 +11,6 @@ CONFIG_SLAB=y
  CONFIG_MODULES=y
  CONFIG_MODULE_UNLOAD=y
  # CONFIG_BLK_DEV_BSG is not set
-# CONFIG_IOSCHED_DEADLINE is not set
-# CONFIG_IOSCHED_CFQ is not set
  CONFIG_ARCH_PXA=y
  CONFIG_ARCH_GUMSTIX=y
  CONFIG_PCCARD=y
diff --git a/arch/arm/configs/axm55xx_defconfig b/arch/arm/configs/axm55xx_defconfig

index f53634a..6ea7daf 100644 (file)
--- a/arch/arm/configs/axm55xx_defconfig
+++ b/arch/arm/configs/axm55xx_defconfig
@@ -25,7 +25,6 @@ CONFIG_EMBEDDED=y
  CONFIG_PROFILING=y
  CONFIG_MODULES=y
  CONFIG_MODULE_UNLOAD=y
-# CONFIG_IOSCHED_DEADLINE is not set
  CONFIG_ARCH_AXXIA=y
  CONFIG_GPIO_PCA953X=y
  CONFIG_ARM_LPAE=y
diff --git a/arch/arm/configs/clps711x_defconfig b/arch/arm/configs/clps711x_defconfig

index c255dab..63a153f 100644 (file)
--- a/arch/arm/configs/clps711x_defconfig
+++ b/arch/arm/configs/clps711x_defconfig
@@ -7,7 +7,6 @@ CONFIG_EMBEDDED=y
  CONFIG_SLOB=y
  CONFIG_JUMP_LABEL=y
  CONFIG_PARTITION_ADVANCED=y
-# CONFIG_IOSCHED_CFQ is not set
  CONFIG_ARCH_CLPS711X=y
  CONFIG_ARCH_AUTCPU12=y
  CONFIG_ARCH_CDB89712=y
diff --git a/arch/arm/configs/cns3420vb_defconfig b/arch/arm/configs/cns3420vb_defconfig

index 89df0a5..66a80b4 100644 (file)
--- a/arch/arm/configs/cns3420vb_defconfig
+++ b/arch/arm/configs/cns3420vb_defconfig
@@ -17,7 +17,7 @@ CONFIG_MODULE_UNLOAD=y
  CONFIG_MODULE_FORCE_UNLOAD=y
  CONFIG_MODVERSIONS=y
  # CONFIG_BLK_DEV_BSG is not set
-CONFIG_IOSCHED_CFQ=m
+CONFIG_IOSCHED_BFQ=m
  CONFIG_ARCH_MULTI_V6=y
  #CONFIG_ARCH_MULTI_V7 is not set
  CONFIG_ARCH_CNS3XXX=y
diff --git a/arch/arm/configs/colibri_pxa300_defconfig b/arch/arm/configs/colibri_pxa300_defconfig

index 446134c..0dae3b1 100644 (file)
--- a/arch/arm/configs/colibri_pxa300_defconfig
+++ b/arch/arm/configs/colibri_pxa300_defconfig
@@ -43,7 +43,6 @@ CONFIG_USB_ANNOUNCE_NEW_DEVICES=y
  CONFIG_USB_MON=y
  CONFIG_USB_STORAGE=y
  CONFIG_MMC=y
-# CONFIG_MMC_BLOCK_BOUNCE is not set
  CONFIG_MMC_PXA=y
  CONFIG_EXT3_FS=y
  CONFIG_NFS_FS=y
diff --git a/arch/arm/configs/collie_defconfig b/arch/arm/configs/collie_defconfig

index e6df11e..36384fd 100644 (file)
--- a/arch/arm/configs/collie_defconfig
+++ b/arch/arm/configs/collie_defconfig
@@ -7,8 +7,6 @@ CONFIG_EXPERT=y
  # CONFIG_BASE_FULL is not set
  # CONFIG_EPOLL is not set
  CONFIG_SLOB=y
-# CONFIG_IOSCHED_DEADLINE is not set
-# CONFIG_IOSCHED_CFQ is not set
  CONFIG_ARCH_SA1100=y
  CONFIG_SA1100_COLLIE=y
  CONFIG_PCCARD=y
diff --git a/arch/arm/configs/davinci_all_defconfig b/arch/arm/configs/davinci_all_defconfig

index 231f897..b5ba8d7 100644 (file)
--- a/arch/arm/configs/davinci_all_defconfig
+++ b/arch/arm/configs/davinci_all_defconfig
@@ -15,8 +15,6 @@ CONFIG_MODULE_UNLOAD=y
  CONFIG_MODULE_FORCE_UNLOAD=y
  CONFIG_MODVERSIONS=y
  CONFIG_PARTITION_ADVANCED=y
-# CONFIG_IOSCHED_DEADLINE is not set
-# CONFIG_IOSCHED_CFQ is not set
  CONFIG_ARCH_MULTIPLATFORM=y
  CONFIG_ARCH_MULTI_V7=n
  CONFIG_ARCH_MULTI_V5=y
diff --git a/arch/arm/configs/efm32_defconfig b/arch/arm/configs/efm32_defconfig

index 10ea925..46213f0 100644 (file)
--- a/arch/arm/configs/efm32_defconfig
+++ b/arch/arm/configs/efm32_defconfig
@@ -12,8 +12,6 @@ CONFIG_EMBEDDED=y
  # CONFIG_VM_EVENT_COUNTERS is not set
  # CONFIG_SLUB_DEBUG is not set
  # CONFIG_BLK_DEV_BSG is not set
-# CONFIG_IOSCHED_DEADLINE is not set
-# CONFIG_IOSCHED_CFQ is not set
  # CONFIG_MMU is not set
  CONFIG_ARM_SINGLE_ARMV7M=y
  CONFIG_ARCH_EFM32=y
diff --git a/arch/arm/configs/ep93xx_defconfig b/arch/arm/configs/ep93xx_defconfig

index ef2d2a8..cd16fb6 100644 (file)
--- a/arch/arm/configs/ep93xx_defconfig
+++ b/arch/arm/configs/ep93xx_defconfig
@@ -11,7 +11,6 @@ CONFIG_MODULE_UNLOAD=y
  CONFIG_MODULE_FORCE_UNLOAD=y
  # CONFIG_BLK_DEV_BSG is not set
  CONFIG_PARTITION_ADVANCED=y
-# CONFIG_IOSCHED_CFQ is not set
  CONFIG_ARCH_EP93XX=y
  CONFIG_CRUNCH=y
  CONFIG_MACH_ADSSPHERE=y
diff --git a/arch/arm/configs/eseries_pxa_defconfig b/arch/arm/configs/eseries_pxa_defconfig

index 56452fa..046f4dc 100644 (file)
--- a/arch/arm/configs/eseries_pxa_defconfig
+++ b/arch/arm/configs/eseries_pxa_defconfig
@@ -9,8 +9,6 @@ CONFIG_MODULES=y
  CONFIG_MODULE_UNLOAD=y
  CONFIG_MODULE_FORCE_UNLOAD=y
  # CONFIG_BLK_DEV_BSG is not set
-# CONFIG_IOSCHED_DEADLINE is not set
-# CONFIG_IOSCHED_CFQ is not set
  CONFIG_ARCH_PXA=y
  CONFIG_ARCH_PXA_ESERIES=y
  # CONFIG_ARM_THUMB is not set
diff --git a/arch/arm/configs/ezx_defconfig b/arch/arm/configs/ezx_defconfig

index 4e28771..bd7b7f9 100644 (file)
--- a/arch/arm/configs/ezx_defconfig
+++ b/arch/arm/configs/ezx_defconfig
@@ -14,7 +14,6 @@ CONFIG_MODULE_UNLOAD=y
  CONFIG_MODULE_FORCE_UNLOAD=y
  CONFIG_MODVERSIONS=y
  # CONFIG_BLK_DEV_BSG is not set
-# CONFIG_IOSCHED_CFQ is not set
  CONFIG_ARCH_PXA=y
  CONFIG_PXA_EZX=y
  CONFIG_NO_HZ=y
diff --git a/arch/arm/configs/h3600_defconfig b/arch/arm/configs/h3600_defconfig

index 4d91e41..c02b3e4 100644 (file)
--- a/arch/arm/configs/h3600_defconfig
+++ b/arch/arm/configs/h3600_defconfig
@@ -5,8 +5,6 @@ CONFIG_LOG_BUF_SHIFT=14
  CONFIG_BLK_DEV_INITRD=y
  CONFIG_MODULES=y
  # CONFIG_BLK_DEV_BSG is not set
-# CONFIG_IOSCHED_DEADLINE is not set
-# CONFIG_IOSCHED_CFQ is not set
  CONFIG_ARCH_SA1100=y
  CONFIG_SA1100_H3600=y
  CONFIG_PCCARD=y
diff --git a/arch/arm/configs/h5000_defconfig b/arch/arm/configs/h5000_defconfig

index 3946c60..f5a338f 100644 (file)
--- a/arch/arm/configs/h5000_defconfig
+++ b/arch/arm/configs/h5000_defconfig
@@ -10,7 +10,6 @@ CONFIG_MODULES=y
  CONFIG_MODULE_UNLOAD=y
  CONFIG_MODULE_FORCE_UNLOAD=y
  # CONFIG_BLK_DEV_BSG is not set
-# CONFIG_IOSCHED_CFQ is not set
  CONFIG_ARCH_PXA=y
  CONFIG_MACH_H5000=y
  CONFIG_AEABI=y
diff --git a/arch/arm/configs/imote2_defconfig b/arch/arm/configs/imote2_defconfig

index 770469f..05c5515 100644 (file)
--- a/arch/arm/configs/imote2_defconfig
+++ b/arch/arm/configs/imote2_defconfig
@@ -13,7 +13,6 @@ CONFIG_MODULE_UNLOAD=y
  CONFIG_MODULE_FORCE_UNLOAD=y
  CONFIG_MODVERSIONS=y
  # CONFIG_BLK_DEV_BSG is not set
-# CONFIG_IOSCHED_CFQ is not set
  CONFIG_ARCH_PXA=y
  CONFIG_MACH_INTELMOTE2=y
  CONFIG_NO_HZ=y
diff --git a/arch/arm/configs/imx_v4_v5_defconfig b/arch/arm/configs/imx_v4_v5_defconfig

index 2b2d617..3df90fc 100644 (file)
--- a/arch/arm/configs/imx_v4_v5_defconfig
+++ b/arch/arm/configs/imx_v4_v5_defconfig
@@ -32,8 +32,6 @@ CONFIG_KPROBES=y
  CONFIG_MODULES=y
  CONFIG_MODULE_UNLOAD=y
  # CONFIG_BLK_DEV_BSG is not set
-# CONFIG_IOSCHED_DEADLINE is not set
-# CONFIG_IOSCHED_CFQ is not set
  CONFIG_NET=y
  CONFIG_PACKET=y
  CONFIG_UNIX=y
diff --git a/arch/arm/configs/lpc18xx_defconfig b/arch/arm/configs/lpc18xx_defconfig

index e518168..be882ea 100644 (file)
--- a/arch/arm/configs/lpc18xx_defconfig
+++ b/arch/arm/configs/lpc18xx_defconfig
@@ -1,4 +1,3 @@
-CONFIG_CROSS_COMPILE="arm-linux-gnueabihf-"
  CONFIG_HIGH_RES_TIMERS=y
  CONFIG_PREEMPT=y
  CONFIG_BLK_DEV_INITRD=y
@@ -28,10 +27,7 @@ CONFIG_FLASH_SIZE=0x00080000
  CONFIG_ZBOOT_ROM_TEXT=0x0
  CONFIG_ZBOOT_ROM_BSS=0x0
  CONFIG_ARM_APPENDED_DTB=y
-# CONFIG_LBDAF is not set
  # CONFIG_BLK_DEV_BSG is not set
-# CONFIG_IOSCHED_DEADLINE is not set
-# CONFIG_IOSCHED_CFQ is not set
  CONFIG_BINFMT_FLAT=y
  CONFIG_BINFMT_ZFLAT=y
  CONFIG_BINFMT_SHARED_FLAT=y
diff --git a/arch/arm/configs/magician_defconfig b/arch/arm/configs/magician_defconfig

index e6486c9..d2e684f 100644 (file)
--- a/arch/arm/configs/magician_defconfig
+++ b/arch/arm/configs/magician_defconfig
@@ -9,8 +9,6 @@ CONFIG_SLAB=y
  CONFIG_MODULES=y
  CONFIG_MODULE_UNLOAD=y
  # CONFIG_BLK_DEV_BSG is not set
-# CONFIG_IOSCHED_DEADLINE is not set
-# CONFIG_IOSCHED_CFQ is not set
  CONFIG_ARCH_PXA=y
  CONFIG_MACH_H4700=y
  CONFIG_MACH_MAGICIAN=y
diff --git a/arch/arm/configs/moxart_defconfig b/arch/arm/configs/moxart_defconfig

index 45d2719..6834e97 100644 (file)
--- a/arch/arm/configs/moxart_defconfig
+++ b/arch/arm/configs/moxart_defconfig
@@ -15,7 +15,6 @@ CONFIG_EMBEDDED=y
  # CONFIG_SLUB_DEBUG is not set
  # CONFIG_COMPAT_BRK is not set
  # CONFIG_BLK_DEV_BSG is not set
-# CONFIG_IOSCHED_DEADLINE is not set
  CONFIG_ARCH_MULTI_V4=y
  # CONFIG_ARCH_MULTI_V7 is not set
  CONFIG_ARCH_MOXART=y
diff --git a/arch/arm/configs/mxs_defconfig b/arch/arm/configs/mxs_defconfig

index 2773899..a9c6f32 100644 (file)
--- a/arch/arm/configs/mxs_defconfig
+++ b/arch/arm/configs/mxs_defconfig
@@ -25,8 +25,6 @@ CONFIG_MODULE_UNLOAD=y
  CONFIG_MODULE_FORCE_UNLOAD=y
  CONFIG_MODVERSIONS=y
  CONFIG_BLK_DEV_INTEGRITY=y
-# CONFIG_IOSCHED_DEADLINE is not set
-# CONFIG_IOSCHED_CFQ is not set
  CONFIG_NET=y
  CONFIG_PACKET=y
  CONFIG_UNIX=y
diff --git a/arch/arm/configs/omap1_defconfig b/arch/arm/configs/omap1_defconfig

index 0c43c58..3b6e745 100644 (file)
--- a/arch/arm/configs/omap1_defconfig
+++ b/arch/arm/configs/omap1_defconfig
@@ -18,8 +18,6 @@ CONFIG_MODULES=y
  CONFIG_MODULE_UNLOAD=y
  CONFIG_MODULE_FORCE_UNLOAD=y
  # CONFIG_BLK_DEV_BSG is not set
-# CONFIG_IOSCHED_DEADLINE is not set
-# CONFIG_IOSCHED_CFQ is not set
  CONFIG_ARCH_OMAP=y
  CONFIG_ARCH_OMAP1=y
  CONFIG_OMAP_RESET_CLOCKS=y
diff --git a/arch/arm/configs/palmz72_defconfig b/arch/arm/configs/palmz72_defconfig

index 4a3fd82..b47c8ab 100644 (file)
--- a/arch/arm/configs/palmz72_defconfig
+++ b/arch/arm/configs/palmz72_defconfig
@@ -7,8 +7,6 @@ CONFIG_SLAB=y
  CONFIG_MODULES=y
  CONFIG_MODULE_UNLOAD=y
  # CONFIG_BLK_DEV_BSG is not set
-# CONFIG_IOSCHED_DEADLINE is not set
-# CONFIG_IOSCHED_CFQ is not set
  CONFIG_ARCH_PXA=y
  CONFIG_ARCH_PXA_PALM=y
  # CONFIG_MACH_PALMTX is not set
diff --git a/arch/arm/configs/pcm027_defconfig b/arch/arm/configs/pcm027_defconfig

index a8c5322..e97a158 100644 (file)
--- a/arch/arm/configs/pcm027_defconfig
+++ b/arch/arm/configs/pcm027_defconfig
@@ -13,8 +13,6 @@ CONFIG_MODULES=y
  CONFIG_MODULE_UNLOAD=y
  CONFIG_MODULE_FORCE_UNLOAD=y
  # CONFIG_BLK_DEV_BSG is not set
-# CONFIG_IOSCHED_DEADLINE is not set
-# CONFIG_IOSCHED_CFQ is not set
  CONFIG_ARCH_PXA=y
  CONFIG_MACH_PCM027=y
  CONFIG_MACH_PCM990_BASEBOARD=y
diff --git a/arch/arm/configs/pleb_defconfig b/arch/arm/configs/pleb_defconfig

index f0541b0..2170148 100644 (file)
--- a/arch/arm/configs/pleb_defconfig
+++ b/arch/arm/configs/pleb_defconfig
@@ -6,8 +6,6 @@ CONFIG_EXPERT=y
  # CONFIG_HOTPLUG is not set
  # CONFIG_SHMEM is not set
  CONFIG_MODULES=y
-# CONFIG_IOSCHED_DEADLINE is not set
-# CONFIG_IOSCHED_CFQ is not set
  CONFIG_ARCH_SA1100=y
  CONFIG_SA1100_PLEB=y
  CONFIG_ZBOOT_ROM_TEXT=0x0
diff --git a/arch/arm/configs/realview_defconfig b/arch/arm/configs/realview_defconfig

index 8a056cc..70e2c74 100644 (file)
--- a/arch/arm/configs/realview_defconfig
+++ b/arch/arm/configs/realview_defconfig
@@ -8,7 +8,6 @@ CONFIG_SLAB=y
  CONFIG_MODULES=y
  CONFIG_MODULE_UNLOAD=y
  # CONFIG_BLK_DEV_BSG is not set
-# CONFIG_IOSCHED_CFQ is not set
  CONFIG_ARCH_MULTI_V6=y
  CONFIG_ARCH_REALVIEW=y
  CONFIG_MACH_REALVIEW_EB=y
diff --git a/arch/arm/configs/sama5_defconfig b/arch/arm/configs/sama5_defconfig

index 27f6135..bab7861 100644 (file)
--- a/arch/arm/configs/sama5_defconfig
+++ b/arch/arm/configs/sama5_defconfig
@@ -14,8 +14,6 @@ CONFIG_MODULE_FORCE_LOAD=y
  CONFIG_MODULE_UNLOAD=y
  CONFIG_MODULE_FORCE_UNLOAD=y
  # CONFIG_BLK_DEV_BSG is not set
-# CONFIG_IOSCHED_DEADLINE is not set
-# CONFIG_IOSCHED_CFQ is not set
  CONFIG_ARCH_AT91=y
  CONFIG_SOC_SAMA5D2=y
  CONFIG_SOC_SAMA5D3=y
@@ -182,7 +180,6 @@ CONFIG_USB_GADGET=y
  CONFIG_USB_ATMEL_USBA=y
  CONFIG_USB_G_SERIAL=y
  CONFIG_MMC=y
-# CONFIG_MMC_BLOCK_BOUNCE is not set
  CONFIG_MMC_SDHCI=y
  CONFIG_MMC_SDHCI_PLTFM=y
  CONFIG_MMC_SDHCI_OF_AT91=y
diff --git a/arch/arm/configs/stm32_defconfig b/arch/arm/configs/stm32_defconfig

index 152321d..551db32 100644 (file)
--- a/arch/arm/configs/stm32_defconfig
+++ b/arch/arm/configs/stm32_defconfig
@@ -14,8 +14,6 @@ CONFIG_EMBEDDED=y
  # CONFIG_VM_EVENT_COUNTERS is not set
  # CONFIG_SLUB_DEBUG is not set
  # CONFIG_BLK_DEV_BSG is not set
-# CONFIG_IOSCHED_DEADLINE is not set
-# CONFIG_IOSCHED_CFQ is not set
  # CONFIG_MMU is not set
  CONFIG_ARCH_STM32=y
  CONFIG_CPU_V7M_NUM_IRQ=240
diff --git a/arch/arm/configs/sunxi_defconfig b/arch/arm/configs/sunxi_defconfig

index 3f5d727..e9fb573 100644 (file)
--- a/arch/arm/configs/sunxi_defconfig
+++ b/arch/arm/configs/sunxi_defconfig
@@ -85,6 +85,7 @@ CONFIG_BATTERY_AXP20X=y
  CONFIG_AXP20X_POWER=y
  CONFIG_THERMAL=y
  CONFIG_CPU_THERMAL=y
+CONFIG_SUN8I_THERMAL=y
  CONFIG_WATCHDOG=y
  CONFIG_SUNXI_WATCHDOG=y
  CONFIG_MFD_AC100=y
diff --git a/arch/arm/configs/u300_defconfig b/arch/arm/configs/u300_defconfig

index 8223397..543f073 100644 (file)
--- a/arch/arm/configs/u300_defconfig
+++ b/arch/arm/configs/u300_defconfig
@@ -11,7 +11,6 @@ CONFIG_MODULES=y
  CONFIG_MODULE_UNLOAD=y
  # CONFIG_BLK_DEV_BSG is not set
  CONFIG_PARTITION_ADVANCED=y
-# CONFIG_IOSCHED_CFQ is not set
  # CONFIG_ARCH_MULTI_V7 is not set
  CONFIG_ARCH_U300=y
  CONFIG_MACH_U300_SPIDUMMY=y
@@ -46,7 +45,6 @@ CONFIG_FB=y
  CONFIG_BACKLIGHT_CLASS_DEVICE=y
  # CONFIG_USB_SUPPORT is not set
  CONFIG_MMC=y
-# CONFIG_MMC_BLOCK_BOUNCE is not set
  CONFIG_MMC_ARMMMCI=y
  CONFIG_RTC_CLASS=y
  # CONFIG_RTC_HCTOSYS is not set
diff --git a/arch/arm/configs/vexpress_defconfig b/arch/arm/configs/vexpress_defconfig

index 2575355..c01baf7 100644 (file)
--- a/arch/arm/configs/vexpress_defconfig
+++ b/arch/arm/configs/vexpress_defconfig
@@ -15,8 +15,6 @@ CONFIG_OPROFILE=y
  CONFIG_MODULES=y
  CONFIG_MODULE_UNLOAD=y
  # CONFIG_BLK_DEV_BSG is not set
-# CONFIG_IOSCHED_DEADLINE is not set
-# CONFIG_IOSCHED_CFQ is not set
  CONFIG_ARCH_VEXPRESS=y
  CONFIG_ARCH_VEXPRESS_DCSCB=y
  CONFIG_ARCH_VEXPRESS_TC2_PM=y
diff --git a/arch/arm/configs/viper_defconfig b/arch/arm/configs/viper_defconfig

index 2ff1616..989599c 100644 (file)
--- a/arch/arm/configs/viper_defconfig
+++ b/arch/arm/configs/viper_defconfig
@@ -9,7 +9,6 @@ CONFIG_SLAB=y
  CONFIG_MODULES=y
  CONFIG_MODULE_UNLOAD=y
  # CONFIG_BLK_DEV_BSG is not set
-# CONFIG_IOSCHED_CFQ is not set
  CONFIG_ARCH_PXA=y
  CONFIG_ARCH_VIPER=y
  CONFIG_IWMMXT=y
diff --git a/arch/arm/configs/zeus_defconfig b/arch/arm/configs/zeus_defconfig

index aa3023c..d3b98c4 100644 (file)
--- a/arch/arm/configs/zeus_defconfig
+++ b/arch/arm/configs/zeus_defconfig
@@ -4,7 +4,6 @@ CONFIG_LOG_BUF_SHIFT=13
  CONFIG_MODULES=y
  CONFIG_MODULE_UNLOAD=y
  # CONFIG_BLK_DEV_BSG is not set
-# CONFIG_IOSCHED_CFQ is not set
  CONFIG_ARCH_PXA=y
  CONFIG_MACH_ARCOM_ZEUS=y
  CONFIG_PCCARD=m
@@ -137,7 +136,6 @@ CONFIG_USB_MASS_STORAGE=m
  CONFIG_USB_G_SERIAL=m
  CONFIG_USB_G_PRINTER=m
  CONFIG_MMC=y
-# CONFIG_MMC_BLOCK_BOUNCE is not set
  CONFIG_MMC_PXA=y
  CONFIG_NEW_LEDS=y
  CONFIG_LEDS_CLASS=m
diff --git a/arch/arm/configs/zx_defconfig b/arch/arm/configs/zx_defconfig

index 4d2ef78..a046a49 100644 (file)
--- a/arch/arm/configs/zx_defconfig
+++ b/arch/arm/configs/zx_defconfig
@@ -16,7 +16,6 @@ CONFIG_EMBEDDED=y
  CONFIG_PERF_EVENTS=y
  CONFIG_SLAB=y
  # CONFIG_BLK_DEV_BSG is not set
-# CONFIG_IOSCHED_CFQ is not set
  CONFIG_ARCH_ZX=y
  CONFIG_SOC_ZX296702=y
  # CONFIG_SWP_EMULATE is not set
diff --git a/arch/arm/kernel/ftrace.c b/arch/arm/kernel/ftrace.c

index 2a5ff69..10499d4 100644 (file)
--- a/arch/arm/kernel/ftrace.c
+++ b/arch/arm/kernel/ftrace.c
@@ -78,13 +78,10 @@ static int ftrace_modify_code(unsigned long pc, unsigned long old,
  {
         unsigned long replaced;
  
-       if (IS_ENABLED(CONFIG_THUMB2_KERNEL)) {
+       if (IS_ENABLED(CONFIG_THUMB2_KERNEL))
                 old = __opcode_to_mem_thumb32(old);
-               new = __opcode_to_mem_thumb32(new);
-       } else {
+       else
                 old = __opcode_to_mem_arm(old);
-               new = __opcode_to_mem_arm(new);
-       }
  
         if (validate) {
                 if (probe_kernel_read(&replaced, (void *)pc, MCOUNT_INSN_SIZE))
diff --git a/arch/arm/kernel/patch.c b/arch/arm/kernel/patch.c

index d0a05a3..e9e828b 100644 (file)
--- a/arch/arm/kernel/patch.c
+++ b/arch/arm/kernel/patch.c
@@ -16,10 +16,10 @@ struct patch {
         unsigned int insn;
  };
  
+#ifdef CONFIG_MMU
  static DEFINE_RAW_SPINLOCK(patch_lock);
  
  static void __kprobes *patch_map(void *addr, int fixmap, unsigned long *flags)
-       __acquires(&patch_lock)
  {
         unsigned int uintaddr = (uintptr_t) addr;
         bool module = !core_kernel_text(uintaddr);
@@ -34,8 +34,6 @@ static void __kprobes *patch_map(void *addr, int fixmap, unsigned long *flags)
  
         if (flags)
                 raw_spin_lock_irqsave(&patch_lock, *flags);
-       else
-               __acquire(&patch_lock);
  
         set_fixmap(fixmap, page_to_phys(page));
  
@@ -43,15 +41,19 @@ static void __kprobes *patch_map(void *addr, int fixmap, unsigned long *flags)
  }
  
  static void __kprobes patch_unmap(int fixmap, unsigned long *flags)
-       __releases(&patch_lock)
  {
         clear_fixmap(fixmap);
  
         if (flags)
                 raw_spin_unlock_irqrestore(&patch_lock, *flags);
-       else
-               __release(&patch_lock);
  }
+#else
+static void __kprobes *patch_map(void *addr, int fixmap, unsigned long *flags)
+{
+       return addr;
+}
+static void __kprobes patch_unmap(int fixmap, unsigned long *flags) { }
+#endif
  
  void __kprobes __patch_text_real(void *addr, unsigned int insn, bool remap)
  {
@@ -64,8 +66,6 @@ void __kprobes __patch_text_real(void *addr, unsigned int insn, bool remap)
  
         if (remap)
                 waddr = patch_map(addr, FIX_TEXT_POKE0, &flags);
-       else
-               __acquire(&patch_lock);
  
         if (thumb2 && __opcode_is_thumb16(insn)) {
                 *(u16 *)waddr = __opcode_to_mem_thumb16(insn);
@@ -102,8 +102,7 @@ void __kprobes __patch_text_real(void *addr, unsigned int insn, bool remap)
         if (waddr != addr) {
                 flush_kernel_vmap_range(waddr, twopage ? size / 2 : size);
                 patch_unmap(FIX_TEXT_POKE0, &flags);
-       } else
-               __release(&patch_lock);
+       }
  
         flush_icache_range((uintptr_t)(addr),
                            (uintptr_t)(addr) + size);
diff --git a/arch/arm/mach-npcm/Kconfig b/arch/arm/mach-npcm/Kconfig

index 880bc2a..7f7002d 100644 (file)
--- a/arch/arm/mach-npcm/Kconfig
+++ b/arch/arm/mach-npcm/Kconfig
@@ -11,7 +11,7 @@ config ARCH_NPCM7XX
         depends on ARCH_MULTI_V7
         select PINCTRL_NPCM7XX
         select NPCM7XX_TIMER
-       select ARCH_REQUIRE_GPIOLIB
+       select GPIOLIB
         select CACHE_L2X0
         select ARM_GIC
         select HAVE_ARM_TWD if SMP
diff --git a/arch/arm64/boot/dts/arm/fvp-base-revc.dts b/arch/arm64/boot/dts/arm/fvp-base-revc.dts

index 62ab0d5..335fff7 100644 (file)
--- a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
+++ b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
@@ -161,10 +161,10 @@
                 bus-range = <0x0 0x1>;
                 reg = <0x0 0x40000000 0x0 0x10000000>;
                 ranges = <0x2000000 0x0 0x50000000 0x0 0x50000000 0x0 0x10000000>;
-               interrupt-map = <0 0 0 1 &gic GIC_SPI 168 IRQ_TYPE_LEVEL_HIGH>,
-                               <0 0 0 2 &gic GIC_SPI 169 IRQ_TYPE_LEVEL_HIGH>,
-                               <0 0 0 3 &gic GIC_SPI 170 IRQ_TYPE_LEVEL_HIGH>,
-                               <0 0 0 4 &gic GIC_SPI 171 IRQ_TYPE_LEVEL_HIGH>;
+               interrupt-map = <0 0 0 1 &gic 0 0 GIC_SPI 168 IRQ_TYPE_LEVEL_HIGH>,
+                               <0 0 0 2 &gic 0 0 GIC_SPI 169 IRQ_TYPE_LEVEL_HIGH>,
+                               <0 0 0 3 &gic 0 0 GIC_SPI 170 IRQ_TYPE_LEVEL_HIGH>,
+                               <0 0 0 4 &gic 0 0 GIC_SPI 171 IRQ_TYPE_LEVEL_HIGH>;
                 interrupt-map-mask = <0x0 0x0 0x0 0x7>;
                 msi-map = <0x0 &its 0x0 0x10000>;
                 iommu-map = <0x0 &smmu 0x0 0x10000>;
diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig

index 0f21288..905109f 100644 (file)
--- a/arch/arm64/configs/defconfig
+++ b/arch/arm64/configs/defconfig
@@ -452,6 +452,7 @@ CONFIG_THERMAL_GOV_POWER_ALLOCATOR=y
  CONFIG_CPU_THERMAL=y
  CONFIG_THERMAL_EMULATION=y
  CONFIG_QORIQ_THERMAL=m
+CONFIG_SUN8I_THERMAL=y
  CONFIG_ROCKCHIP_THERMAL=m
  CONFIG_RCAR_THERMAL=y
  CONFIG_RCAR_GEN3_THERMAL=y
@@ -547,6 +548,7 @@ CONFIG_ROCKCHIP_DW_MIPI_DSI=y
  CONFIG_ROCKCHIP_INNO_HDMI=y
  CONFIG_DRM_RCAR_DU=m
  CONFIG_DRM_SUN4I=m
+CONFIG_DRM_SUN6I_DSI=m
  CONFIG_DRM_SUN8I_DW_HDMI=m
  CONFIG_DRM_SUN8I_MIXER=m
  CONFIG_DRM_MSM=m
@@ -681,7 +683,7 @@ CONFIG_RTC_DRV_SNVS=m
  CONFIG_RTC_DRV_IMX_SC=m
  CONFIG_RTC_DRV_XGENE=y
  CONFIG_DMADEVICES=y
-CONFIG_DMA_BCM2835=m
+CONFIG_DMA_BCM2835=y
  CONFIG_DMA_SUN6I=m
  CONFIG_FSL_EDMA=y
  CONFIG_IMX_SDMA=y
diff --git a/arch/arm64/include/asm/exception.h b/arch/arm64/include/asm/exception.h

index b87c6e2..7a6e81c 100644 (file)
--- a/arch/arm64/include/asm/exception.h
+++ b/arch/arm64/include/asm/exception.h
@@ -33,7 +33,6 @@ static inline u32 disr_to_esr(u64 disr)
  
  asmlinkage void enter_from_user_mode(void);
  void do_mem_abort(unsigned long addr, unsigned int esr, struct pt_regs *regs);
-void do_sp_pc_abort(unsigned long addr, unsigned int esr, struct pt_regs *regs);
  void do_undefinstr(struct pt_regs *regs);
  asmlinkage void bad_mode(struct pt_regs *regs, int reason, unsigned int esr);
  void do_debug_exception(unsigned long addr_if_watchpoint, unsigned int esr,
@@ -47,7 +46,4 @@ void bad_el0_sync(struct pt_regs *regs, int reason, unsigned int esr);
  void do_cp15instr(unsigned int esr, struct pt_regs *regs);
  void do_el0_svc(struct pt_regs *regs);
  void do_el0_svc_compat(struct pt_regs *regs);
-void do_el0_ia_bp_hardening(unsigned long addr,  unsigned int esr,
-                           struct pt_regs *regs);
-
  #endif /* __ASM_EXCEPTION_H */
diff --git a/arch/arm64/include/asm/spinlock.h b/arch/arm64/include/asm/spinlock.h

index 102404d..9083d69 100644 (file)
--- a/arch/arm64/include/asm/spinlock.h
+++ b/arch/arm64/include/asm/spinlock.h
@@ -18,6 +18,10 @@
   * See:
   * https://lore.kernel.org/lkml/20200110100612.GC2827@hirez.programming.kicks-ass.net
   */
-#define vcpu_is_preempted(cpu) false
+#define vcpu_is_preempted vcpu_is_preempted
+static inline bool vcpu_is_preempted(int cpu)
+{
+       return false;
+}
  
  #endif /* __ASM_SPINLOCK_H */
diff --git a/arch/arm64/kernel/kaslr.c b/arch/arm64/kernel/kaslr.c

index 53b8a4e..91a8310 100644 (file)
--- a/arch/arm64/kernel/kaslr.c
+++ b/arch/arm64/kernel/kaslr.c
@@ -11,6 +11,7 @@
  #include <linux/sched.h>
  #include <linux/types.h>
  
+#include <asm/archrandom.h>
  #include <asm/cacheflush.h>
  #include <asm/fixmap.h>
  #include <asm/kernel-pgtable.h>
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c

index bbb0f0c..0062605 100644 (file)
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -466,6 +466,13 @@ static void ssbs_thread_switch(struct task_struct *next)
         if (unlikely(next->flags & PF_KTHREAD))
                 return;
  
+       /*
+        * If all CPUs implement the SSBS extension, then we just need to
+        * context-switch the PSTATE field.
+        */
+       if (cpu_have_feature(cpu_feature(SSBS)))
+               return;
+
         /* If the mitigation is enabled, then we leave SSBS clear. */
         if ((arm64_get_ssbd_state() == ARM64_SSBD_FORCE_ENABLE) ||
             test_tsk_thread_flag(next, TIF_SSBD))
@@ -608,8 +615,6 @@ long get_tagged_addr_ctrl(void)
   * only prevents the tagged address ABI enabling via prctl() and does not
   * disable it for tasks that already opted in to the relaxed ABI.
   */
-static int zero;
-static int one = 1;
  
  static struct ctl_table tagged_addr_sysctl_table[] = {
         {
@@ -618,8 +623,8 @@ static struct ctl_table tagged_addr_sysctl_table[] = {
                 .data           = &tagged_addr_disabled,
                 .maxlen         = sizeof(int),
                 .proc_handler   = proc_dointvec_minmax,
-               .extra1         = &zero,
-               .extra2         = &one,
+               .extra1         = SYSCTL_ZERO,
+               .extra2         = SYSCTL_ONE,
         },
         { }
  };
diff --git a/arch/arm64/kernel/time.c b/arch/arm64/kernel/time.c

index 73f06d4..eebbc8d 100644 (file)
--- a/arch/arm64/kernel/time.c
+++ b/arch/arm64/kernel/time.c
@@ -23,7 +23,7 @@
  #include <linux/irq.h>
  #include <linux/delay.h>
  #include <linux/clocksource.h>
-#include <linux/clk-provider.h>
+#include <linux/of_clk.h>
  #include <linux/acpi.h>
  
  #include <clocksource/arm_arch_timer.h>
diff --git a/arch/s390/boot/uv.c b/arch/s390/boot/uv.c

index ed007f4..3f50115 100644 (file)
--- a/arch/s390/boot/uv.c
+++ b/arch/s390/boot/uv.c
@@ -15,7 +15,8 @@ void uv_query_info(void)
         if (!test_facility(158))
                 return;
  
-       if (uv_call(0, (uint64_t)&uvcb))
+       /* rc==0x100 means that there is additional data we do not process */
+       if (uv_call(0, (uint64_t)&uvcb) && uvcb.header.rc != 0x100)
                 return;
  
         if (test_bit_inv(BIT_UVC_CMD_SET_SHARED_ACCESS, (unsigned long *)uvcb.inst_calls_list) &&
diff --git a/arch/s390/include/asm/timex.h b/arch/s390/include/asm/timex.h

index 670f14a..6bf3a45 100644 (file)
--- a/arch/s390/include/asm/timex.h
+++ b/arch/s390/include/asm/timex.h
@@ -155,7 +155,7 @@ static inline void get_tod_clock_ext(char *clk)
  
  static inline unsigned long long get_tod_clock(void)
  {
-       unsigned char clk[STORE_CLOCK_EXT_SIZE];
+       char clk[STORE_CLOCK_EXT_SIZE];
  
         get_tod_clock_ext(clk);
         return *((unsigned long long *)&clk[1]);
diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c

index 1f22b6b..39eb276 100644 (file)
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -250,6 +250,7 @@ static const u64 amd_f17h_perfmon_event_map[PERF_COUNT_HW_MAX] =
         [PERF_COUNT_HW_CPU_CYCLES]              = 0x0076,
         [PERF_COUNT_HW_INSTRUCTIONS]            = 0x00c0,
         [PERF_COUNT_HW_CACHE_REFERENCES]        = 0xff60,
+       [PERF_COUNT_HW_CACHE_MISSES]            = 0x0964,
         [PERF_COUNT_HW_BRANCH_INSTRUCTIONS]     = 0x00c2,
         [PERF_COUNT_HW_BRANCH_MISSES]           = 0x00c3,
         [PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] = 0x0287,
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c

index 3be51aa..dff6623 100644 (file)
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4765,6 +4765,7 @@ __init int intel_pmu_init(void)
                 break;
  
         case INTEL_FAM6_ATOM_TREMONT_D:
+       case INTEL_FAM6_ATOM_TREMONT:
                 x86_pmu.late_ack = true;
                 memcpy(hw_cache_event_ids, glp_hw_cache_event_ids,
                        sizeof(hw_cache_event_ids));
diff --git a/arch/x86/events/intel/cstate.c b/arch/x86/events/intel/cstate.c

index e1daf41..4814c96 100644 (file)
--- a/arch/x86/events/intel/cstate.c
+++ b/arch/x86/events/intel/cstate.c
@@ -40,17 +40,18 @@
   * Model specific counters:
   *     MSR_CORE_C1_RES: CORE C1 Residency Counter
   *                      perf code: 0x00
- *                      Available model: SLM,AMT,GLM,CNL
+ *                      Available model: SLM,AMT,GLM,CNL,TNT
   *                      Scope: Core (each processor core has a MSR)
   *     MSR_CORE_C3_RESIDENCY: CORE C3 Residency Counter
   *                            perf code: 0x01
   *                            Available model: NHM,WSM,SNB,IVB,HSW,BDW,SKL,GLM,
- *                                             CNL,KBL,CML
+ *                                             CNL,KBL,CML,TNT
   *                            Scope: Core
   *     MSR_CORE_C6_RESIDENCY: CORE C6 Residency Counter
   *                            perf code: 0x02
   *                            Available model: SLM,AMT,NHM,WSM,SNB,IVB,HSW,BDW,
- *                                             SKL,KNL,GLM,CNL,KBL,CML,ICL,TGL
+ *                                             SKL,KNL,GLM,CNL,KBL,CML,ICL,TGL,
+ *                                             TNT
   *                            Scope: Core
   *     MSR_CORE_C7_RESIDENCY: CORE C7 Residency Counter
   *                            perf code: 0x03
@@ -60,17 +61,18 @@
   *     MSR_PKG_C2_RESIDENCY:  Package C2 Residency Counter.
   *                            perf code: 0x00
   *                            Available model: SNB,IVB,HSW,BDW,SKL,KNL,GLM,CNL,
- *                                             KBL,CML,ICL,TGL
+ *                                             KBL,CML,ICL,TGL,TNT
   *                            Scope: Package (physical package)
   *     MSR_PKG_C3_RESIDENCY:  Package C3 Residency Counter.
   *                            perf code: 0x01
   *                            Available model: NHM,WSM,SNB,IVB,HSW,BDW,SKL,KNL,
- *                                             GLM,CNL,KBL,CML,ICL,TGL
+ *                                             GLM,CNL,KBL,CML,ICL,TGL,TNT
   *                            Scope: Package (physical package)
   *     MSR_PKG_C6_RESIDENCY:  Package C6 Residency Counter.
   *                            perf code: 0x02
- *                            Available model: SLM,AMT,NHM,WSM,SNB,IVB,HSW,BDW
- *                                             SKL,KNL,GLM,CNL,KBL,CML,ICL,TGL
+ *                            Available model: SLM,AMT,NHM,WSM,SNB,IVB,HSW,BDW,
+ *                                             SKL,KNL,GLM,CNL,KBL,CML,ICL,TGL,
+ *                                             TNT
   *                            Scope: Package (physical package)
   *     MSR_PKG_C7_RESIDENCY:  Package C7 Residency Counter.
   *                            perf code: 0x03
@@ -87,7 +89,8 @@
   *                            Scope: Package (physical package)
   *     MSR_PKG_C10_RESIDENCY: Package C10 Residency Counter.
   *                            perf code: 0x06
- *                            Available model: HSW ULT,KBL,GLM,CNL,CML,ICL,TGL
+ *                            Available model: HSW ULT,KBL,GLM,CNL,CML,ICL,TGL,
+ *                                             TNT
   *                            Scope: Package (physical package)
   *
   */
@@ -640,8 +643,9 @@ static const struct x86_cpu_id intel_cstates_match[] __initconst = {
  
         X86_CSTATES_MODEL(INTEL_FAM6_ATOM_GOLDMONT,   glm_cstates),
         X86_CSTATES_MODEL(INTEL_FAM6_ATOM_GOLDMONT_D, glm_cstates),
-
         X86_CSTATES_MODEL(INTEL_FAM6_ATOM_GOLDMONT_PLUS, glm_cstates),
+       X86_CSTATES_MODEL(INTEL_FAM6_ATOM_TREMONT_D, glm_cstates),
+       X86_CSTATES_MODEL(INTEL_FAM6_ATOM_TREMONT, glm_cstates),
  
         X86_CSTATES_MODEL(INTEL_FAM6_ICELAKE_L, icl_cstates),
         X86_CSTATES_MODEL(INTEL_FAM6_ICELAKE,   icl_cstates),
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c

index 4b94ae4..dc43cc1 100644 (file)
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1714,6 +1714,8 @@ intel_pmu_save_and_restart_reload(struct perf_event *event, int count)
         old = ((s64)(prev_raw_count << shift) >> shift);
         local64_add(new - old + count * period, &event->count);
  
+       local64_set(&hwc->period_left, -new);
+
         perf_event_update_userpage(event);
  
         return 0;
diff --git a/arch/x86/events/msr.c b/arch/x86/events/msr.c

index 6f86650..a949f6f 100644 (file)
--- a/arch/x86/events/msr.c
+++ b/arch/x86/events/msr.c
@@ -75,8 +75,9 @@ static bool test_intel(int idx, void *data)
  
         case INTEL_FAM6_ATOM_GOLDMONT:
         case INTEL_FAM6_ATOM_GOLDMONT_D:
-
         case INTEL_FAM6_ATOM_GOLDMONT_PLUS:
+       case INTEL_FAM6_ATOM_TREMONT_D:
+       case INTEL_FAM6_ATOM_TREMONT:
  
         case INTEL_FAM6_XEON_PHI_KNL:
         case INTEL_FAM6_XEON_PHI_KNM:
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h

index 4dffbc1..40a0c0f 100644 (file)
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -781,9 +781,19 @@ struct kvm_vcpu_arch {
         u64 msr_kvm_poll_control;
  
         /*
-        * Indicate whether the access faults on its page table in guest
-        * which is set when fix page fault and used to detect unhandeable
-        * instruction.
+        * Indicates the guest is trying to write a gfn that contains one or
+        * more of the PTEs used to translate the write itself, i.e. the access
+        * is changing its own translation in the guest page tables.  KVM exits
+        * to userspace if emulation of the faulting instruction fails and this
+        * flag is set, as KVM cannot make forward progress.
+        *
+        * If emulation fails for a write to guest page tables, KVM unprotects
+        * (zaps) the shadow page for the target gfn and resumes the guest to
+        * retry the non-emulatable instruction (on hardware).  Unprotecting the
+        * gfn doesn't allow forward progress for a self-changing access because
+        * doing so also zaps the translation for the gfn, i.e. retrying the
+        * instruction will hit a !PRESENT fault, which results in a new shadow
+        * page and sends KVM back to square one.
          */
         bool write_fault_to_shadow_pgtable;
  
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c

index eafc631..afcd30d 100644 (file)
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1080,9 +1080,6 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode,
                         result = 1;
                         /* assumes that there are only KVM_APIC_INIT/SIPI */
                         apic->pending_events = (1UL << KVM_APIC_INIT);
-                       /* make sure pending_events is visible before sending
-                        * the request */
-                       smp_wmb();
                         kvm_make_request(KVM_REQ_EVENT, vcpu);
                         kvm_vcpu_kick(vcpu);
                 }
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h

index d55674f..a647601 100644 (file)
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -102,6 +102,19 @@ static inline void kvm_mmu_load_cr3(struct kvm_vcpu *vcpu)
                                               kvm_get_active_pcid(vcpu));
  }
  
+int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
+                      bool prefault);
+
+static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
+                                       u32 err, bool prefault)
+{
+#ifdef CONFIG_RETPOLINE
+       if (likely(vcpu->arch.mmu->page_fault == kvm_tdp_page_fault))
+               return kvm_tdp_page_fault(vcpu, cr2_or_gpa, err, prefault);
+#endif
+       return vcpu->arch.mmu->page_fault(vcpu, cr2_or_gpa, err, prefault);
+}
+
  /*
   * Currently, we have two sorts of write-protection, a) the first one
   * write-protects guest page to sync the guest modification, b) another one is
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c

index 7011a4e..87e9ba2 100644 (file)
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4219,8 +4219,8 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
  }
  EXPORT_SYMBOL_GPL(kvm_handle_page_fault);
  
-static int tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
-                         bool prefault)
+int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
+                      bool prefault)
  {
         int max_level;
  
@@ -4925,7 +4925,7 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
                 return;
  
         context->mmu_role.as_u64 = new_role.as_u64;
-       context->page_fault = tdp_page_fault;
+       context->page_fault = kvm_tdp_page_fault;
         context->sync_page = nonpaging_sync_page;
         context->invlpg = nonpaging_invlpg;
         context->update_pte = nonpaging_update_pte;
@@ -5436,9 +5436,8 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
         }
  
         if (r == RET_PF_INVALID) {
-               r = vcpu->arch.mmu->page_fault(vcpu, cr2_or_gpa,
-                                              lower_32_bits(error_code),
-                                              false);
+               r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
+                                         lower_32_bits(error_code), false);
                 WARN_ON(r == RET_PF_INVALID);
         }
  
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h

index 4e1ef04..e4c8a4c 100644 (file)
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -33,7 +33,7 @@
         #define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
         #define PT_HAVE_ACCESSED_DIRTY(mmu) true
         #ifdef CONFIG_X86_64
-       #define PT_MAX_FULL_LEVELS 4
+       #define PT_MAX_FULL_LEVELS PT64_ROOT_MAX_LEVEL
         #define CMPXCHG cmpxchg
         #else
         #define CMPXCHG cmpxchg64
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c

index a3e32d6..bef0ba3 100644 (file)
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2175,7 +2175,6 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
         u32 dummy;
         u32 eax = 1;
  
-       vcpu->arch.microcode_version = 0x01000065;
         svm->spec_ctrl = 0;
         svm->virt_spec_ctrl = 0;
  
@@ -2266,6 +2265,7 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu)
         init_vmcb(svm);
  
         svm_init_osvw(vcpu);
+       vcpu->arch.microcode_version = 0x01000065;
  
         return 0;
  
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c

index 657c2ed..3589cd3 100644 (file)
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -544,7 +544,8 @@ static void nested_vmx_disable_intercept_for_msr(unsigned long *msr_bitmap_l1,
         }
  }
  
-static inline void enable_x2apic_msr_intercepts(unsigned long *msr_bitmap) {
+static inline void enable_x2apic_msr_intercepts(unsigned long *msr_bitmap)
+{
         int msr;
  
         for (msr = 0x800; msr <= 0x8ff; msr += BITS_PER_LONG) {
@@ -1981,7 +1982,7 @@ static int nested_vmx_handle_enlightened_vmptrld(struct kvm_vcpu *vcpu,
         }
  
         /*
-        * Clean fields data can't de used on VMLAUNCH and when we switch
+        * Clean fields data can't be used on VMLAUNCH and when we switch
          * between different L2 guests as KVM keeps a single VMCS12 per L1.
          */
         if (from_launch || evmcs_gpa_changed)
@@ -3575,6 +3576,33 @@ static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu,
         nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI, intr_info, exit_qual);
  }
  
+/*
+ * Returns true if a debug trap is pending delivery.
+ *
+ * In KVM, debug traps bear an exception payload. As such, the class of a #DB
+ * exception may be inferred from the presence of an exception payload.
+ */
+static inline bool vmx_pending_dbg_trap(struct kvm_vcpu *vcpu)
+{
+       return vcpu->arch.exception.pending &&
+                       vcpu->arch.exception.nr == DB_VECTOR &&
+                       vcpu->arch.exception.payload;
+}
+
+/*
+ * Certain VM-exits set the 'pending debug exceptions' field to indicate a
+ * recognized #DB (data or single-step) that has yet to be delivered. Since KVM
+ * represents these debug traps with a payload that is said to be compatible
+ * with the 'pending debug exceptions' field, write the payload to the VMCS
+ * field if a VM-exit is delivered before the debug trap.
+ */
+static void nested_vmx_update_pending_dbg(struct kvm_vcpu *vcpu)
+{
+       if (vmx_pending_dbg_trap(vcpu))
+               vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS,
+                           vcpu->arch.exception.payload);
+}
+
  static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr)
  {
         struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -3587,6 +3615,7 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr)
                 test_bit(KVM_APIC_INIT, &apic->pending_events)) {
                 if (block_nested_events)
                         return -EBUSY;
+               nested_vmx_update_pending_dbg(vcpu);
                 clear_bit(KVM_APIC_INIT, &apic->pending_events);
                 nested_vmx_vmexit(vcpu, EXIT_REASON_INIT_SIGNAL, 0, 0);
                 return 0;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c

index 9a66648..3be25ec 100644 (file)
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2947,6 +2947,9 @@ void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
  
  static int get_ept_level(struct kvm_vcpu *vcpu)
  {
+       /* Nested EPT currently only supports 4-level walks. */
+       if (is_guest_mode(vcpu) && nested_cpu_has_ept(get_vmcs12(vcpu)))
+               return 4;
         if (cpu_has_vmx_ept_5levels() && (cpuid_maxphyaddr(vcpu) > 48))
                 return 5;
         return 4;
@@ -4238,7 +4241,6 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
  
         vmx->msr_ia32_umwait_control = 0;
  
-       vcpu->arch.microcode_version = 0x100000000ULL;
         vmx->vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val();
         vmx->hv_deadline_tsc = -1;
         kvm_set_cr8(vcpu, 0);
@@ -6763,6 +6765,7 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
         vmx->nested.posted_intr_nv = -1;
         vmx->nested.current_vmptr = -1ull;
  
+       vcpu->arch.microcode_version = 0x100000000ULL;
         vmx->msr_ia32_feature_control_valid_bits = FEAT_CTL_LOCKED;
  
         /*
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c

index fbabb2f..fb5d64e 100644 (file)
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -438,6 +438,14 @@ void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu)
                  * for #DB exceptions under VMX.
                  */
                 vcpu->arch.dr6 ^= payload & DR6_RTM;
+
+               /*
+                * The #DB payload is defined as compatible with the 'pending
+                * debug exceptions' field under VMX, not DR6. While bit 12 is
+                * defined in the 'pending debug exceptions' field (enabled
+                * breakpoint), it is reserved and must be zero in DR6.
+                */
+               vcpu->arch.dr6 &= ~BIT(12);
                 break;
         case PF_VECTOR:
                 vcpu->arch.cr2 = payload;
@@ -490,19 +498,7 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
                 vcpu->arch.exception.error_code = error_code;
                 vcpu->arch.exception.has_payload = has_payload;
                 vcpu->arch.exception.payload = payload;
-               /*
-                * In guest mode, payload delivery should be deferred,
-                * so that the L1 hypervisor can intercept #PF before
-                * CR2 is modified (or intercept #DB before DR6 is
-                * modified under nVMX).  However, for ABI
-                * compatibility with KVM_GET_VCPU_EVENTS and
-                * KVM_SET_VCPU_EVENTS, we can't delay payload
-                * delivery unless userspace has enabled this
-                * functionality via the per-VM capability,
-                * KVM_CAP_EXCEPTION_PAYLOAD.
-                */
-               if (!vcpu->kvm->arch.exception_payload_enabled ||
-                   !is_guest_mode(vcpu))
+               if (!is_guest_mode(vcpu))
                         kvm_deliver_exception_payload(vcpu);
                 return;
         }
@@ -2448,7 +2444,7 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
         vcpu->hv_clock.tsc_timestamp = tsc_timestamp;
         vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
         vcpu->last_guest_tsc = tsc_timestamp;
-       WARN_ON(vcpu->hv_clock.system_time < 0);
+       WARN_ON((s64)vcpu->hv_clock.system_time < 0);
  
         /* If the host uses TSC clocksource, then it is stable */
         pvclock_flags = 0;
@@ -3795,6 +3791,21 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
  {
         process_nmi(vcpu);
  
+       /*
+        * In guest mode, payload delivery should be deferred,
+        * so that the L1 hypervisor can intercept #PF before
+        * CR2 is modified (or intercept #DB before DR6 is
+        * modified under nVMX). Unless the per-VM capability,
+        * KVM_CAP_EXCEPTION_PAYLOAD, is set, we may not defer the delivery of
+        * an exception payload and handle after a KVM_GET_VCPU_EVENTS. Since we
+        * opportunistically defer the exception payload, deliver it if the
+        * capability hasn't been requested before processing a
+        * KVM_GET_VCPU_EVENTS.
+        */
+       if (!vcpu->kvm->arch.exception_payload_enabled &&
+           vcpu->arch.exception.pending && vcpu->arch.exception.has_payload)
+               kvm_deliver_exception_payload(vcpu);
+
         /*
          * The API doesn't provide the instruction length for software
          * exceptions, so don't report them. As long as the guest RIP
@@ -8942,7 +8953,6 @@ int kvm_task_switch(struct kvm_vcpu *vcpu, u16 tss_selector, int idt_index,
  
         kvm_rip_write(vcpu, ctxt->eip);
         kvm_set_rflags(vcpu, ctxt->eflags);
-       kvm_make_request(KVM_REQ_EVENT, vcpu);
         return 1;
  }
  EXPORT_SYMBOL_GPL(kvm_task_switch);
@@ -10182,7 +10192,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
               work->arch.cr3 != vcpu->arch.mmu->get_cr3(vcpu))
                 return;
  
-       vcpu->arch.mmu->page_fault(vcpu, work->cr2_or_gpa, 0, true);
+       kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
  }
  
  static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
diff --git a/crypto/Kconfig b/crypto/Kconfig

index cdb51d4..c24a474 100644 (file)
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -136,8 +136,6 @@ config CRYPTO_USER
           Userspace configuration for cryptographic instantiations such as
           cbc(aes).
  
-if CRYPTO_MANAGER2
-
  config CRYPTO_MANAGER_DISABLE_TESTS
         bool "Disable run-time self tests"
         default y
@@ -155,8 +153,6 @@ config CRYPTO_MANAGER_EXTRA_TESTS
           This is intended for developer use only, as these tests take much
           longer to run than the normal self tests.
  
-endif  # if CRYPTO_MANAGER2
-
  config CRYPTO_GF128MUL
         tristate
  
diff --git a/crypto/testmgr.c b/crypto/testmgr.c

index 88f33c0..ccb3d60 100644 (file)
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -4436,6 +4436,15 @@ static const struct alg_test_desc alg_test_descs[] = {
                         .cipher = __VECS(tf_cbc_tv_template)
                 },
         }, {
+#if IS_ENABLED(CONFIG_CRYPTO_PAES_S390)
+               .alg = "cbc-paes-s390",
+               .fips_allowed = 1,
+               .test = alg_test_skcipher,
+               .suite = {
+                       .cipher = __VECS(aes_cbc_tv_template)
+               }
+       }, {
+#endif
                 .alg = "cbcmac(aes)",
                 .fips_allowed = 1,
                 .test = alg_test_hash,
@@ -4587,6 +4596,15 @@ static const struct alg_test_desc alg_test_descs[] = {
                         .cipher = __VECS(tf_ctr_tv_template)
                 }
         }, {
+#if IS_ENABLED(CONFIG_CRYPTO_PAES_S390)
+               .alg = "ctr-paes-s390",
+               .fips_allowed = 1,
+               .test = alg_test_skcipher,
+               .suite = {
+                       .cipher = __VECS(aes_ctr_tv_template)
+               }
+       }, {
+#endif
                 .alg = "cts(cbc(aes))",
                 .test = alg_test_skcipher,
                 .fips_allowed = 1,
@@ -4879,6 +4897,15 @@ static const struct alg_test_desc alg_test_descs[] = {
                         .cipher = __VECS(xtea_tv_template)
                 }
         }, {
+#if IS_ENABLED(CONFIG_CRYPTO_PAES_S390)
+               .alg = "ecb-paes-s390",
+               .fips_allowed = 1,
+               .test = alg_test_skcipher,
+               .suite = {
+                       .cipher = __VECS(aes_tv_template)
+               }
+       }, {
+#endif
                 .alg = "ecdh",
                 .test = alg_test_kpp,
                 .fips_allowed = 1,
@@ -5465,6 +5492,15 @@ static const struct alg_test_desc alg_test_descs[] = {
                         .cipher = __VECS(tf_xts_tv_template)
                 }
         }, {
+#if IS_ENABLED(CONFIG_CRYPTO_PAES_S390)
+               .alg = "xts-paes-s390",
+               .fips_allowed = 1,
+               .test = alg_test_skcipher,
+               .suite = {
+                       .cipher = __VECS(aes_xts_tv_template)
+               }
+       }, {
+#endif
                 .alg = "xts4096(paes)",
                 .test = alg_test_null,
                 .fips_allowed = 1,
diff --git a/drivers/acpi/acpica/achware.h b/drivers/acpi/acpica/achware.h

index 67f282e..6ad0517 100644 (file)
--- a/drivers/acpi/acpica/achware.h
+++ b/drivers/acpi/acpica/achware.h
@@ -101,6 +101,8 @@ acpi_status acpi_hw_enable_all_runtime_gpes(void);
  
  acpi_status acpi_hw_enable_all_wakeup_gpes(void);
  
+u8 acpi_hw_check_all_gpes(void);
+
  acpi_status
  acpi_hw_enable_runtime_gpe_block(struct acpi_gpe_xrupt_info *gpe_xrupt_info,
                                  struct acpi_gpe_block_info *gpe_block,
diff --git a/drivers/acpi/acpica/evxfgpe.c b/drivers/acpi/acpica/evxfgpe.c

index 2c39ff2..f2de66b 100644 (file)
--- a/drivers/acpi/acpica/evxfgpe.c
+++ b/drivers/acpi/acpica/evxfgpe.c
@@ -795,6 +795,38 @@ acpi_status acpi_enable_all_wakeup_gpes(void)
  
  ACPI_EXPORT_SYMBOL(acpi_enable_all_wakeup_gpes)
  
+/******************************************************************************
+ *
+ * FUNCTION:    acpi_any_gpe_status_set
+ *
+ * PARAMETERS:  None
+ *
+ * RETURN:      Whether or not the status bit is set for any GPE
+ *
+ * DESCRIPTION: Check the status bits of all enabled GPEs and return TRUE if any
+ *              of them is set or FALSE otherwise.
+ *
+ ******************************************************************************/
+u32 acpi_any_gpe_status_set(void)
+{
+       acpi_status status;
+       u8 ret;
+
+       ACPI_FUNCTION_TRACE(acpi_any_gpe_status_set);
+
+       status = acpi_ut_acquire_mutex(ACPI_MTX_EVENTS);
+       if (ACPI_FAILURE(status)) {
+               return (FALSE);
+       }
+
+       ret = acpi_hw_check_all_gpes();
+       (void)acpi_ut_release_mutex(ACPI_MTX_EVENTS);
+
+       return (ret);
+}
+
+ACPI_EXPORT_SYMBOL(acpi_any_gpe_status_set)
+
  /*******************************************************************************
   *
   * FUNCTION:    acpi_install_gpe_block
diff --git a/drivers/acpi/acpica/hwgpe.c b/drivers/acpi/acpica/hwgpe.c

index 1b4252b..f4c285c 100644 (file)
--- a/drivers/acpi/acpica/hwgpe.c
+++ b/drivers/acpi/acpica/hwgpe.c
@@ -444,6 +444,53 @@ acpi_hw_enable_wakeup_gpe_block(struct acpi_gpe_xrupt_info *gpe_xrupt_info,
         return (AE_OK);
  }
  
+/******************************************************************************
+ *
+ * FUNCTION:    acpi_hw_get_gpe_block_status
+ *
+ * PARAMETERS:  gpe_xrupt_info      - GPE Interrupt info
+ *              gpe_block           - Gpe Block info
+ *
+ * RETURN:      Success
+ *
+ * DESCRIPTION: Produce a combined GPE status bits mask for the given block.
+ *
+ ******************************************************************************/
+
+static acpi_status
+acpi_hw_get_gpe_block_status(struct acpi_gpe_xrupt_info *gpe_xrupt_info,
+                            struct acpi_gpe_block_info *gpe_block,
+                            void *ret_ptr)
+{
+       struct acpi_gpe_register_info *gpe_register_info;
+       u64 in_enable, in_status;
+       acpi_status status;
+       u8 *ret = ret_ptr;
+       u32 i;
+
+       /* Examine each GPE Register within the block */
+
+       for (i = 0; i < gpe_block->register_count; i++) {
+               gpe_register_info = &gpe_block->register_info[i];
+
+               status = acpi_hw_read(&in_enable,
+                                     &gpe_register_info->enable_address);
+               if (ACPI_FAILURE(status)) {
+                       continue;
+               }
+
+               status = acpi_hw_read(&in_status,
+                                     &gpe_register_info->status_address);
+               if (ACPI_FAILURE(status)) {
+                       continue;
+               }
+
+               *ret |= in_enable & in_status;
+       }
+
+       return (AE_OK);
+}
+
  /******************************************************************************
   *
   * FUNCTION:    acpi_hw_disable_all_gpes
@@ -510,4 +557,28 @@ acpi_status acpi_hw_enable_all_wakeup_gpes(void)
         return_ACPI_STATUS(status);
  }
  
+/******************************************************************************
+ *
+ * FUNCTION:    acpi_hw_check_all_gpes
+ *
+ * PARAMETERS:  None
+ *
+ * RETURN:      Combined status of all GPEs
+ *
+ * DESCRIPTION: Check all enabled GPEs in all GPE blocks and return TRUE if the
+ *              status bit is set for at least one of them of FALSE otherwise.
+ *
+ ******************************************************************************/
+
+u8 acpi_hw_check_all_gpes(void)
+{
+       u8 ret = 0;
+
+       ACPI_FUNCTION_TRACE(acpi_hw_check_all_gpes);
+
+       (void)acpi_ev_walk_gpe_list(acpi_hw_get_gpe_block_status, &ret);
+
+       return (ret != 0);
+}
+
  #endif                         /* !ACPI_REDUCED_HARDWARE */
diff --git a/drivers/acpi/ec.c b/drivers/acpi/ec.c

index 08bc975..d1f1cf5 100644 (file)
--- a/drivers/acpi/ec.c
+++ b/drivers/acpi/ec.c
@@ -179,6 +179,7 @@ EXPORT_SYMBOL(first_ec);
  
  static struct acpi_ec *boot_ec;
  static bool boot_ec_is_ecdt = false;
+static struct workqueue_struct *ec_wq;
  static struct workqueue_struct *ec_query_wq;
  
  static int EC_FLAGS_QUERY_HANDSHAKE; /* Needs QR_EC issued when SCI_EVT set */
@@ -469,7 +470,7 @@ static void acpi_ec_submit_query(struct acpi_ec *ec)
                 ec_dbg_evt("Command(%s) submitted/blocked",
                            acpi_ec_cmd_string(ACPI_EC_COMMAND_QUERY));
                 ec->nr_pending_queries++;
-               schedule_work(&ec->work);
+               queue_work(ec_wq, &ec->work);
         }
  }
  
@@ -535,7 +536,7 @@ static void acpi_ec_enable_event(struct acpi_ec *ec)
  #ifdef CONFIG_PM_SLEEP
  static void __acpi_ec_flush_work(void)
  {
-       flush_scheduled_work(); /* flush ec->work */
+       drain_workqueue(ec_wq); /* flush ec->work */
         flush_workqueue(ec_query_wq); /* flush queries */
  }
  
@@ -556,8 +557,8 @@ static void acpi_ec_disable_event(struct acpi_ec *ec)
  
  void acpi_ec_flush_work(void)
  {
-       /* Without ec_query_wq there is nothing to flush. */
-       if (!ec_query_wq)
+       /* Without ec_wq there is nothing to flush. */
+       if (!ec_wq)
                 return;
  
         __acpi_ec_flush_work();
@@ -2107,25 +2108,33 @@ static struct acpi_driver acpi_ec_driver = {
         .drv.pm = &acpi_ec_pm,
  };
  
-static inline int acpi_ec_query_init(void)
+static void acpi_ec_destroy_workqueues(void)
  {
-       if (!ec_query_wq) {
-               ec_query_wq = alloc_workqueue("kec_query", 0,
-                                             ec_max_queries);
-               if (!ec_query_wq)
-                       return -ENODEV;
+       if (ec_wq) {
+               destroy_workqueue(ec_wq);
+               ec_wq = NULL;
         }
-       return 0;
-}
-
-static inline void acpi_ec_query_exit(void)
-{
         if (ec_query_wq) {
                 destroy_workqueue(ec_query_wq);
                 ec_query_wq = NULL;
         }
  }
  
+static int acpi_ec_init_workqueues(void)
+{
+       if (!ec_wq)
+               ec_wq = alloc_ordered_workqueue("kec", 0);
+
+       if (!ec_query_wq)
+               ec_query_wq = alloc_workqueue("kec_query", 0, ec_max_queries);
+
+       if (!ec_wq || !ec_query_wq) {
+               acpi_ec_destroy_workqueues();
+               return -ENODEV;
+       }
+       return 0;
+}
+
  static const struct dmi_system_id acpi_ec_no_wakeup[] = {
         {
                 .ident = "Thinkpad X1 Carbon 6th",
@@ -2156,8 +2165,7 @@ int __init acpi_ec_init(void)
         int result;
         int ecdt_fail, dsdt_fail;
  
-       /* register workqueue for _Qxx evaluations */
-       result = acpi_ec_query_init();
+       result = acpi_ec_init_workqueues();
         if (result)
                 return result;
  
@@ -2188,6 +2196,6 @@ static void __exit acpi_ec_exit(void)
  {
  
         acpi_bus_unregister_driver(&acpi_ec_driver);
-       acpi_ec_query_exit();
+       acpi_ec_destroy_workqueues();
  }
  #endif /* 0 */
diff --git a/drivers/acpi/sleep.c b/drivers/acpi/sleep.c

index 4398806..152f7fc 100644 (file)
--- a/drivers/acpi/sleep.c
+++ b/drivers/acpi/sleep.c
@@ -990,21 +990,34 @@ static void acpi_s2idle_sync(void)
         acpi_os_wait_events_complete(); /* synchronize Notify handling */
  }
  
-static void acpi_s2idle_wake(void)
+static bool acpi_s2idle_wake(void)
  {
-       /*
-        * If IRQD_WAKEUP_ARMED is set for the SCI at this point, the SCI has
-        * not triggered while suspended, so bail out.
-        */
-       if (!acpi_sci_irq_valid() ||
-           irqd_is_wakeup_armed(irq_get_irq_data(acpi_sci_irq)))
-               return;
+       if (!acpi_sci_irq_valid())
+               return pm_wakeup_pending();
+
+       while (pm_wakeup_pending()) {
+               /*
+                * If IRQD_WAKEUP_ARMED is set for the SCI at this point, the
+                * SCI has not triggered while suspended, so bail out (the
+                * wakeup is pending anyway and the SCI is not the source of
+                * it).
+                */
+               if (irqd_is_wakeup_armed(irq_get_irq_data(acpi_sci_irq)))
+                       return true;
+
+               /*
+                * If there are no EC events to process and at least one of the
+                * other enabled GPEs is active, the wakeup is regarded as a
+                * genuine one.
+                *
+                * Note that the checks below must be carried out in this order
+                * to avoid returning prematurely due to a change of the EC GPE
+                * status bit from unset to set between the checks with the
+                * status bits of all the other GPEs unset.
+                */
+               if (acpi_any_gpe_status_set() && !acpi_ec_dispatch_gpe())
+                       return true;
  
-       /*
-        * If there are EC events to process, the wakeup may be a spurious one
-        * coming from the EC.
-        */
-       if (acpi_ec_dispatch_gpe()) {
                 /*
                  * Cancel the wakeup and process all pending events in case
                  * there are any wakeup ones in there.
@@ -1017,8 +1030,19 @@ static void acpi_s2idle_wake(void)
  
                 acpi_s2idle_sync();
  
+               /*
+                * The SCI is in the "suspended" state now and it cannot produce
+                * new wakeup events till the rearming below, so if any of them
+                * are pending here, they must be resulting from the processing
+                * of EC events above or coming from somewhere else.
+                */
+               if (pm_wakeup_pending())
+                       return true;
+
                 rearm_wake_irq(acpi_sci_irq);
         }
+
+       return false;
  }
  
  static void acpi_s2idle_restore_early(void)
diff --git a/drivers/bus/moxtet.c b/drivers/bus/moxtet.c

index 15fa293..b20fdcb 100644 (file)
--- a/drivers/bus/moxtet.c
+++ b/drivers/bus/moxtet.c
@@ -465,7 +465,7 @@ static ssize_t input_read(struct file *file, char __user *buf, size_t len,
  {
         struct moxtet *moxtet = file->private_data;
         u8 bin[TURRIS_MOX_MAX_MODULES];
-       u8 hex[sizeof(buf) * 2 + 1];
+       u8 hex[sizeof(bin) * 2 + 1];
         int ret, n;
  
         ret = moxtet_spi_read(moxtet, bin);
diff --git a/drivers/char/ipmi/ipmb_dev_int.c b/drivers/char/ipmi/ipmb_dev_int.c

index 1ff4fb1..382b28f 100644 (file)
--- a/drivers/char/ipmi/ipmb_dev_int.c
+++ b/drivers/char/ipmi/ipmb_dev_int.c
@@ -19,7 +19,7 @@
  #include <linux/spinlock.h>
  #include <linux/wait.h>
  
-#define MAX_MSG_LEN            128
+#define MAX_MSG_LEN            240
  #define IPMB_REQUEST_LEN_MIN   7
  #define NETFN_RSP_BIT_MASK     0x4
  #define REQUEST_QUEUE_MAX_LEN  256
@@ -63,6 +63,7 @@ struct ipmb_dev {
         spinlock_t lock;
         wait_queue_head_t wait_queue;
         struct mutex file_mutex;
+       bool is_i2c_protocol;
  };
  
  static inline struct ipmb_dev *to_ipmb_dev(struct file *file)
@@ -112,6 +113,25 @@ static ssize_t ipmb_read(struct file *file, char __user *buf, size_t count,
         return ret < 0 ? ret : count;
  }
  
+static int ipmb_i2c_write(struct i2c_client *client, u8 *msg, u8 addr)
+{
+       struct i2c_msg i2c_msg;
+
+       /*
+        * subtract 1 byte (rq_sa) from the length of the msg passed to
+        * raw i2c_transfer
+        */
+       i2c_msg.len = msg[IPMB_MSG_LEN_IDX] - 1;
+
+       /* Assign message to buffer except first 2 bytes (length and address) */
+       i2c_msg.buf = msg + 2;
+
+       i2c_msg.addr = addr;
+       i2c_msg.flags = client->flags & I2C_CLIENT_PEC;
+
+       return i2c_transfer(client->adapter, &i2c_msg, 1);
+}
+
  static ssize_t ipmb_write(struct file *file, const char __user *buf,
                         size_t count, loff_t *ppos)
  {
@@ -133,6 +153,12 @@ static ssize_t ipmb_write(struct file *file, const char __user *buf,
         rq_sa = GET_7BIT_ADDR(msg[RQ_SA_8BIT_IDX]);
         netf_rq_lun = msg[NETFN_LUN_IDX];
  
+       /* Check i2c block transfer vs smbus */
+       if (ipmb_dev->is_i2c_protocol) {
+               ret = ipmb_i2c_write(ipmb_dev->client, msg, rq_sa);
+               return (ret == 1) ? count : ret;
+       }
+
         /*
          * subtract rq_sa and netf_rq_lun from the length of the msg passed to
          * i2c_smbus_xfer
@@ -253,7 +279,7 @@ static int ipmb_slave_cb(struct i2c_client *client,
                 break;
  
         case I2C_SLAVE_WRITE_RECEIVED:
-               if (ipmb_dev->msg_idx >= sizeof(struct ipmb_msg))
+               if (ipmb_dev->msg_idx >= sizeof(struct ipmb_msg) - 1)
                         break;
  
                 buf[++ipmb_dev->msg_idx] = *val;
@@ -302,6 +328,9 @@ static int ipmb_probe(struct i2c_client *client,
         if (ret)
                 return ret;
  
+       ipmb_dev->is_i2c_protocol
+               = device_property_read_bool(&client->dev, "i2c-protocol");
+
         ipmb_dev->client = client;
         i2c_set_clientdata(client, ipmb_dev);
         ret = i2c_slave_register(client, ipmb_slave_cb);
diff --git a/drivers/char/ipmi/ipmi_ssif.c b/drivers/char/ipmi/ipmi_ssif.c

index 22c6a2e..8ac390c 100644 (file)
--- a/drivers/char/ipmi/ipmi_ssif.c
+++ b/drivers/char/ipmi/ipmi_ssif.c
@@ -775,10 +775,14 @@ static void msg_done_handler(struct ssif_info *ssif_info, int result,
         flags = ipmi_ssif_lock_cond(ssif_info, &oflags);
         msg = ssif_info->curr_msg;
         if (msg) {
+               if (data) {
+                       if (len > IPMI_MAX_MSG_LENGTH)
+                               len = IPMI_MAX_MSG_LENGTH;
+                       memcpy(msg->rsp, data, len);
+               } else {
+                       len = 0;
+               }
                 msg->rsp_size = len;
-               if (msg->rsp_size > IPMI_MAX_MSG_LENGTH)
-                       msg->rsp_size = IPMI_MAX_MSG_LENGTH;
-               memcpy(msg->rsp, data, msg->rsp_size);
                 ssif_info->curr_msg = NULL;
         }
  
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c

index 4adac3a..cbe6c94 100644 (file)
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -105,6 +105,8 @@ bool have_governor_per_policy(void)
  }
  EXPORT_SYMBOL_GPL(have_governor_per_policy);
  
+static struct kobject *cpufreq_global_kobject;
+
  struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy)
  {
         if (have_governor_per_policy())
@@ -2745,9 +2747,6 @@ int cpufreq_unregister_driver(struct cpufreq_driver *driver)
  }
  EXPORT_SYMBOL_GPL(cpufreq_unregister_driver);
  
-struct kobject *cpufreq_global_kobject;
-EXPORT_SYMBOL(cpufreq_global_kobject);
-
  static int __init cpufreq_core_init(void)
  {
         if (cpufreq_disabled())
diff --git a/drivers/dax/super.c b/drivers/dax/super.c

index 26a654d..0aa4b6b 100644 (file)
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -61,7 +61,7 @@ struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
  {
         if (!blk_queue_dax(bdev->bd_queue))
                 return NULL;
-       return fs_dax_get_by_host(bdev->bd_disk->disk_name);
+       return dax_get_by_host(bdev->bd_disk->disk_name);
  }
  EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
  #endif
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c

index 7243b88..69e0d90 100644 (file)
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -505,16 +505,10 @@ void edac_mc_free(struct mem_ctl_info *mci)
  {
         edac_dbg(1, "\n");
  
-       /* If we're not yet registered with sysfs free only what was allocated
-        * in edac_mc_alloc().
-        */
-       if (!device_is_registered(&mci->dev)) {
-               _edac_mc_free(mci);
-               return;
-       }
+       if (device_is_registered(&mci->dev))
+               edac_unregister_sysfs(mci);
  
-       /* the mci instance is freed here, when the sysfs object is dropped */
-       edac_unregister_sysfs(mci);
+       _edac_mc_free(mci);
  }
  EXPORT_SYMBOL_GPL(edac_mc_free);
  
diff --git a/drivers/edac/edac_mc_sysfs.c b/drivers/edac/edac_mc_sysfs.c

index 0367554..c70ec0a 100644 (file)
--- a/drivers/edac/edac_mc_sysfs.c
+++ b/drivers/edac/edac_mc_sysfs.c
@@ -276,10 +276,7 @@ static const struct attribute_group *csrow_attr_groups[] = {
  
  static void csrow_attr_release(struct device *dev)
  {
-       struct csrow_info *csrow = container_of(dev, struct csrow_info, dev);
-
-       edac_dbg(1, "device %s released\n", dev_name(dev));
-       kfree(csrow);
+       /* release device with _edac_mc_free() */
  }
  
  static const struct device_type csrow_attr_type = {
@@ -447,8 +444,7 @@ error:
                 csrow = mci->csrows[i];
                 if (!nr_pages_per_csrow(csrow))
                         continue;
-
-               device_del(&mci->csrows[i]->dev);
+               device_unregister(&mci->csrows[i]->dev);
         }
  
         return err;
@@ -608,10 +604,7 @@ static const struct attribute_group *dimm_attr_groups[] = {
  
  static void dimm_attr_release(struct device *dev)
  {
-       struct dimm_info *dimm = container_of(dev, struct dimm_info, dev);
-
-       edac_dbg(1, "device %s released\n", dev_name(dev));
-       kfree(dimm);
+       /* release device with _edac_mc_free() */
  }
  
  static const struct device_type dimm_attr_type = {
@@ -893,10 +886,7 @@ static const struct attribute_group *mci_attr_groups[] = {
  
  static void mci_attr_release(struct device *dev)
  {
-       struct mem_ctl_info *mci = container_of(dev, struct mem_ctl_info, dev);
-
-       edac_dbg(1, "device %s released\n", dev_name(dev));
-       kfree(mci);
+       /* release device with _edac_mc_free() */
  }
  
  static const struct device_type mci_attr_type = {
diff --git a/drivers/gpio/gpio-bd71828.c b/drivers/gpio/gpio-bd71828.c

index 04aade9..3dbbc63 100644 (file)
--- a/drivers/gpio/gpio-bd71828.c
+++ b/drivers/gpio/gpio-bd71828.c
@@ -10,16 +10,6 @@
  #define GPIO_OUT_REG(off) (BD71828_REG_GPIO_CTRL1 + (off))
  #define HALL_GPIO_OFFSET 3
  
-/*
- * These defines can be removed when
- * "gpio: Add definition for GPIO direction"
- * (9208b1e77d6e8e9776f34f46ef4079ecac9c3c25 in GPIO tree) gets merged,
- */
-#ifndef GPIO_LINE_DIRECTION_IN
-       #define GPIO_LINE_DIRECTION_IN 1
-       #define GPIO_LINE_DIRECTION_OUT 0
-#endif
-
  struct bd71828_gpio {
         struct rohm_regmap_dev chip;
         struct gpio_chip gpio;
diff --git a/drivers/gpio/gpio-sifive.c b/drivers/gpio/gpio-sifive.c

index 147a1bd..c54dd08 100644 (file)
--- a/drivers/gpio/gpio-sifive.c
+++ b/drivers/gpio/gpio-sifive.c
@@ -35,7 +35,7 @@ struct sifive_gpio {
         void __iomem            *base;
         struct gpio_chip        gc;
         struct regmap           *regs;
-       u32                     irq_state;
+       unsigned long           irq_state;
         unsigned int            trigger[SIFIVE_GPIO_MAX];
         unsigned int            irq_parent[SIFIVE_GPIO_MAX];
  };
@@ -94,7 +94,7 @@ static void sifive_gpio_irq_enable(struct irq_data *d)
         spin_unlock_irqrestore(&gc->bgpio_lock, flags);
  
         /* Enable interrupts */
-       assign_bit(offset, (unsigned long *)&chip->irq_state, 1);
+       assign_bit(offset, &chip->irq_state, 1);
         sifive_gpio_set_ie(chip, offset);
  }
  
@@ -104,7 +104,7 @@ static void sifive_gpio_irq_disable(struct irq_data *d)
         struct sifive_gpio *chip = gpiochip_get_data(gc);
         int offset = irqd_to_hwirq(d) % SIFIVE_GPIO_MAX;
  
-       assign_bit(offset, (unsigned long *)&chip->irq_state, 0);
+       assign_bit(offset, &chip->irq_state, 0);
         sifive_gpio_set_ie(chip, offset);
         irq_chip_disable_parent(d);
  }
diff --git a/drivers/gpio/gpio-xilinx.c b/drivers/gpio/gpio-xilinx.c

index a9748b5..67f9f82 100644 (file)
--- a/drivers/gpio/gpio-xilinx.c
+++ b/drivers/gpio/gpio-xilinx.c
@@ -147,9 +147,10 @@ static void xgpio_set_multiple(struct gpio_chip *gc, unsigned long *mask,
         for (i = 0; i < gc->ngpio; i++) {
                 if (*mask == 0)
                         break;
+               /* Once finished with an index write it out to the register */
                 if (index !=  xgpio_index(chip, i)) {
                         xgpio_writereg(chip->regs + XGPIO_DATA_OFFSET +
-                                      xgpio_regoffset(chip, i),
+                                      index * XGPIO_CHANNEL_OFFSET,
                                        chip->gpio_state[index]);
                         spin_unlock_irqrestore(&chip->gpio_lock[index], flags);
                         index =  xgpio_index(chip, i);
@@ -165,7 +166,7 @@ static void xgpio_set_multiple(struct gpio_chip *gc, unsigned long *mask,
         }
  
         xgpio_writereg(chip->regs + XGPIO_DATA_OFFSET +
-                      xgpio_regoffset(chip, i), chip->gpio_state[index]);
+                      index * XGPIO_CHANNEL_OFFSET, chip->gpio_state[index]);
  
         spin_unlock_irqrestore(&chip->gpio_lock[index], flags);
  }
diff --git a/drivers/gpio/gpiolib.c b/drivers/gpio/gpiolib.c

index 7532834..4d0106c 100644 (file)
--- a/drivers/gpio/gpiolib.c
+++ b/drivers/gpio/gpiolib.c
@@ -3035,13 +3035,33 @@ EXPORT_SYMBOL_GPL(gpiochip_free_own_desc);
   * rely on gpio_request() having been called beforehand.
   */
  
-static int gpio_set_config(struct gpio_chip *gc, unsigned int offset,
-                          enum pin_config_param mode)
+static int gpio_do_set_config(struct gpio_chip *gc, unsigned int offset,
+                             unsigned long config)
  {
         if (!gc->set_config)
                 return -ENOTSUPP;
  
-       return gc->set_config(gc, offset, mode);
+       return gc->set_config(gc, offset, config);
+}
+
+static int gpio_set_config(struct gpio_chip *gc, unsigned int offset,
+                          enum pin_config_param mode)
+{
+       unsigned long config;
+       unsigned arg;
+
+       switch (mode) {
+       case PIN_CONFIG_BIAS_PULL_DOWN:
+       case PIN_CONFIG_BIAS_PULL_UP:
+               arg = 1;
+               break;
+
+       default:
+               arg = 0;
+       }
+
+       config = PIN_CONF_PACKED(mode, arg);
+       return gpio_do_set_config(gc, offset, config);
  }
  
  static int gpio_set_bias(struct gpio_chip *chip, struct gpio_desc *desc)
@@ -3277,7 +3297,7 @@ int gpiod_set_debounce(struct gpio_desc *desc, unsigned debounce)
         chip = desc->gdev->chip;
  
         config = pinconf_to_config_packed(PIN_CONFIG_INPUT_DEBOUNCE, debounce);
-       return gpio_set_config(chip, gpio_chip_hwgpio(desc), config);
+       return gpio_do_set_config(chip, gpio_chip_hwgpio(desc), config);
  }
  EXPORT_SYMBOL_GPL(gpiod_set_debounce);
  
@@ -3311,7 +3331,7 @@ int gpiod_set_transitory(struct gpio_desc *desc, bool transitory)
         packed = pinconf_to_config_packed(PIN_CONFIG_PERSIST_STATE,
                                           !transitory);
         gpio = gpio_chip_hwgpio(desc);
-       rc = gpio_set_config(chip, gpio, packed);
+       rc = gpio_do_set_config(chip, gpio, packed);
         if (rc == -ENOTSUPP) {
                 dev_dbg(&desc->gdev->dev, "Persistence not supported for GPIO %d\n",
                                 gpio);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_pmu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_pmu.c

index 07914e3..1311d6a 100644 (file)
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_pmu.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_pmu.c
@@ -52,7 +52,7 @@ static int amdgpu_perf_event_init(struct perf_event *event)
                 return -ENOENT;
  
         /* update the hw_perf_event struct with config data */
-       hwc->conf = event->attr.config;
+       hwc->config = event->attr.config;
  
         return 0;
  }
@@ -74,9 +74,9 @@ static void amdgpu_perf_start(struct perf_event *event, int flags)
         switch (pe->pmu_perf_type) {
         case PERF_TYPE_AMDGPU_DF:
                 if (!(flags & PERF_EF_RELOAD))
-                       pe->adev->df.funcs->pmc_start(pe->adev, hwc->conf, 1);
+                       pe->adev->df.funcs->pmc_start(pe->adev, hwc->config, 1);
  
-               pe->adev->df.funcs->pmc_start(pe->adev, hwc->conf, 0);
+               pe->adev->df.funcs->pmc_start(pe->adev, hwc->config, 0);
                 break;
         default:
                 break;
@@ -101,7 +101,7 @@ static void amdgpu_perf_read(struct perf_event *event)
  
                 switch (pe->pmu_perf_type) {
                 case PERF_TYPE_AMDGPU_DF:
-                       pe->adev->df.funcs->pmc_get_count(pe->adev, hwc->conf,
+                       pe->adev->df.funcs->pmc_get_count(pe->adev, hwc->config,
                                                           &count);
                         break;
                 default:
@@ -126,7 +126,7 @@ static void amdgpu_perf_stop(struct perf_event *event, int flags)
  
         switch (pe->pmu_perf_type) {
         case PERF_TYPE_AMDGPU_DF:
-               pe->adev->df.funcs->pmc_stop(pe->adev, hwc->conf, 0);
+               pe->adev->df.funcs->pmc_stop(pe->adev, hwc->config, 0);
                 break;
         default:
                 break;
@@ -156,7 +156,8 @@ static int amdgpu_perf_add(struct perf_event *event, int flags)
  
         switch (pe->pmu_perf_type) {
         case PERF_TYPE_AMDGPU_DF:
-               retval = pe->adev->df.funcs->pmc_start(pe->adev, hwc->conf, 1);
+               retval = pe->adev->df.funcs->pmc_start(pe->adev,
+                                                      hwc->config, 1);
                 break;
         default:
                 return 0;
@@ -184,7 +185,7 @@ static void amdgpu_perf_del(struct perf_event *event, int flags)
  
         switch (pe->pmu_perf_type) {
         case PERF_TYPE_AMDGPU_DF:
-               pe->adev->df.funcs->pmc_stop(pe->adev, hwc->conf, 1);
+               pe->adev->df.funcs->pmc_stop(pe->adev, hwc->config, 1);
                 break;
         default:
                 break;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c

index 3a1570d..146f966 100644 (file)
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -1013,6 +1013,30 @@ static int psp_dtm_initialize(struct psp_context *psp)
         return 0;
  }
  
+static int psp_dtm_unload(struct psp_context *psp)
+{
+       int ret;
+       struct psp_gfx_cmd_resp *cmd;
+
+       /*
+        * TODO: bypass the unloading in sriov for now
+        */
+       if (amdgpu_sriov_vf(psp->adev))
+               return 0;
+
+       cmd = kzalloc(sizeof(struct psp_gfx_cmd_resp), GFP_KERNEL);
+       if (!cmd)
+               return -ENOMEM;
+
+       psp_prep_ta_unload_cmd_buf(cmd, psp->dtm_context.session_id);
+
+       ret = psp_cmd_submit_buf(psp, NULL, cmd, psp->fence_buf_mc_addr);
+
+       kfree(cmd);
+
+       return ret;
+}
+
  int psp_dtm_invoke(struct psp_context *psp, uint32_t ta_cmd_id)
  {
         /*
@@ -1037,7 +1061,7 @@ static int psp_dtm_terminate(struct psp_context *psp)
         if (!psp->dtm_context.dtm_initialized)
                 return 0;
  
-       ret = psp_hdcp_unload(psp);
+       ret = psp_dtm_unload(psp);
         if (ret)
                 return ret;
  
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h

index d6deb0e..6fe0573 100644 (file)
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
@@ -179,6 +179,7 @@ struct amdgpu_vcn_inst {
         struct amdgpu_irq_src   irq;
         struct amdgpu_vcn_reg   external;
         struct amdgpu_bo        *dpg_sram_bo;
+       struct dpg_pause_state  pause_state;
         void                    *dpg_sram_cpu_addr;
         uint64_t                dpg_sram_gpu_addr;
         uint32_t                *dpg_sram_curr_addr;
@@ -190,8 +191,6 @@ struct amdgpu_vcn {
         const struct firmware   *fw;    /* VCN firmware */
         unsigned                num_enc_rings;
         enum amd_powergating_state cur_state;
-       struct dpg_pause_state pause_state;
-
         bool                    indirect_sram;
  
         uint8_t num_vcn_inst;
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c

index 1785fda..22bbb36 100644 (file)
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -3923,11 +3923,13 @@ static uint64_t gfx_v10_0_get_gpu_clock_counter(struct amdgpu_device *adev)
  {
         uint64_t clock;
  
+       amdgpu_gfx_off_ctrl(adev, false);
         mutex_lock(&adev->gfx.gpu_clock_mutex);
         WREG32_SOC15(GC, 0, mmRLC_CAPTURE_GPU_CLOCK_COUNT, 1);
         clock = (uint64_t)RREG32_SOC15(GC, 0, mmRLC_GPU_CLOCK_COUNT_LSB) |
                 ((uint64_t)RREG32_SOC15(GC, 0, mmRLC_GPU_CLOCK_COUNT_MSB) << 32ULL);
         mutex_unlock(&adev->gfx.gpu_clock_mutex);
+       amdgpu_gfx_off_ctrl(adev, true);
         return clock;
  }
  
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c

index 90f64b8..3afdbbd 100644 (file)
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -1193,6 +1193,14 @@ static bool gfx_v9_0_should_disable_gfxoff(struct pci_dev *pdev)
         return false;
  }
  
+static bool is_raven_kicker(struct amdgpu_device *adev)
+{
+       if (adev->pm.fw_version >= 0x41e2b)
+               return true;
+       else
+               return false;
+}
+
  static void gfx_v9_0_check_if_need_gfxoff(struct amdgpu_device *adev)
  {
         if (gfx_v9_0_should_disable_gfxoff(adev->pdev))
@@ -1205,9 +1213,8 @@ static void gfx_v9_0_check_if_need_gfxoff(struct amdgpu_device *adev)
                 break;
         case CHIP_RAVEN:
                 if (!(adev->rev_id >= 0x8 || adev->pdev->device == 0x15d8) &&
-                   ((adev->gfx.rlc_fw_version != 106 &&
+                   ((!is_raven_kicker(adev) &&
                       adev->gfx.rlc_fw_version < 531) ||
-                    (adev->gfx.rlc_fw_version == 53815) ||
                      (adev->gfx.rlc_feature_version < 1) ||
                      !adev->gfx.rlc.is_rlc_v2_1))
                         adev->pm.pp_feature &= ~PP_GFXOFF_MASK;
@@ -3959,6 +3966,7 @@ static uint64_t gfx_v9_0_get_gpu_clock_counter(struct amdgpu_device *adev)
  {
         uint64_t clock;
  
+       amdgpu_gfx_off_ctrl(adev, false);
         mutex_lock(&adev->gfx.gpu_clock_mutex);
         if (adev->asic_type == CHIP_VEGA10 && amdgpu_sriov_runtime(adev)) {
                 uint32_t tmp, lsb, msb, i = 0;
@@ -3977,6 +3985,7 @@ static uint64_t gfx_v9_0_get_gpu_clock_counter(struct amdgpu_device *adev)
                         ((uint64_t)RREG32_SOC15(GC, 0, mmRLC_GPU_CLOCK_COUNT_MSB) << 32ULL);
         }
         mutex_unlock(&adev->gfx.gpu_clock_mutex);
+       amdgpu_gfx_off_ctrl(adev, true);
         return clock;
  }
  
@@ -4374,9 +4383,17 @@ static int gfx_v9_0_ecc_late_init(void *handle)
         struct amdgpu_device *adev = (struct amdgpu_device *)handle;
         int r;
  
-       r = gfx_v9_0_do_edc_gds_workarounds(adev);
-       if (r)
-               return r;
+       /*
+        * Temp workaround to fix the issue that CP firmware fails to
+        * update read pointer when CPDMA is writing clearing operation
+        * to GDS in suspend/resume sequence on several cards. So just
+        * limit this operation in cold boot sequence.
+        */
+       if (!adev->in_suspend) {
+               r = gfx_v9_0_do_edc_gds_workarounds(adev);
+               if (r)
+                       return r;
+       }
  
         /* requires IBs so do in late init after IB pool is initialized */
         r = gfx_v9_0_do_edc_gpr_workarounds(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c

index 15f3424..2b488df 100644 (file)
--- a/drivers/gpu/drm/amd/amdgpu/soc15.c
+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
@@ -272,7 +272,12 @@ static u32 soc15_get_config_memsize(struct amdgpu_device *adev)
  
  static u32 soc15_get_xclk(struct amdgpu_device *adev)
  {
-       return adev->clock.spll.reference_freq;
+       u32 reference_clock = adev->clock.spll.reference_freq;
+
+       if (adev->asic_type == CHIP_RAVEN)
+               return reference_clock / 4;
+
+       return reference_clock;
  }
  
  
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c

index 1a24fad..71f61af 100644 (file)
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
@@ -1207,9 +1207,10 @@ static int vcn_v1_0_pause_dpg_mode(struct amdgpu_device *adev,
         struct amdgpu_ring *ring;
  
         /* pause/unpause if state is changed */
-       if (adev->vcn.pause_state.fw_based != new_state->fw_based) {
+       if (adev->vcn.inst[inst_idx].pause_state.fw_based != new_state->fw_based) {
                 DRM_DEBUG("dpg pause state changed %d:%d -> %d:%d",
-                       adev->vcn.pause_state.fw_based, adev->vcn.pause_state.jpeg,
+                       adev->vcn.inst[inst_idx].pause_state.fw_based,
+                       adev->vcn.inst[inst_idx].pause_state.jpeg,
                         new_state->fw_based, new_state->jpeg);
  
                 reg_data = RREG32_SOC15(UVD, 0, mmUVD_DPG_PAUSE) &
@@ -1258,13 +1259,14 @@ static int vcn_v1_0_pause_dpg_mode(struct amdgpu_device *adev,
                         reg_data &= ~UVD_DPG_PAUSE__NJ_PAUSE_DPG_REQ_MASK;
                         WREG32_SOC15(UVD, 0, mmUVD_DPG_PAUSE, reg_data);
                 }
-               adev->vcn.pause_state.fw_based = new_state->fw_based;
+               adev->vcn.inst[inst_idx].pause_state.fw_based = new_state->fw_based;
         }
  
         /* pause/unpause if state is changed */
-       if (adev->vcn.pause_state.jpeg != new_state->jpeg) {
+       if (adev->vcn.inst[inst_idx].pause_state.jpeg != new_state->jpeg) {
                 DRM_DEBUG("dpg pause state changed %d:%d -> %d:%d",
-                       adev->vcn.pause_state.fw_based, adev->vcn.pause_state.jpeg,
+                       adev->vcn.inst[inst_idx].pause_state.fw_based,
+                       adev->vcn.inst[inst_idx].pause_state.jpeg,
                         new_state->fw_based, new_state->jpeg);
  
                 reg_data = RREG32_SOC15(UVD, 0, mmUVD_DPG_PAUSE) &
@@ -1318,7 +1320,7 @@ static int vcn_v1_0_pause_dpg_mode(struct amdgpu_device *adev,
                         reg_data &= ~UVD_DPG_PAUSE__JPEG_PAUSE_DPG_REQ_MASK;
                         WREG32_SOC15(UVD, 0, mmUVD_DPG_PAUSE, reg_data);
                 }
-               adev->vcn.pause_state.jpeg = new_state->jpeg;
+               adev->vcn.inst[inst_idx].pause_state.jpeg = new_state->jpeg;
         }
  
         return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c

index 4f72167..c387c81 100644 (file)
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c
@@ -1137,9 +1137,9 @@ static int vcn_v2_0_pause_dpg_mode(struct amdgpu_device *adev,
         int ret_code;
  
         /* pause/unpause if state is changed */
-       if (adev->vcn.pause_state.fw_based != new_state->fw_based) {
+       if (adev->vcn.inst[inst_idx].pause_state.fw_based != new_state->fw_based) {
                 DRM_DEBUG("dpg pause state changed %d -> %d",
-                       adev->vcn.pause_state.fw_based, new_state->fw_based);
+                       adev->vcn.inst[inst_idx].pause_state.fw_based,  new_state->fw_based);
                 reg_data = RREG32_SOC15(UVD, 0, mmUVD_DPG_PAUSE) &
                         (~UVD_DPG_PAUSE__NJ_PAUSE_DPG_ACK_MASK);
  
@@ -1185,7 +1185,7 @@ static int vcn_v2_0_pause_dpg_mode(struct amdgpu_device *adev,
                         reg_data &= ~UVD_DPG_PAUSE__NJ_PAUSE_DPG_REQ_MASK;
                         WREG32_SOC15(UVD, 0, mmUVD_DPG_PAUSE, reg_data);
                 }
-               adev->vcn.pause_state.fw_based = new_state->fw_based;
+               adev->vcn.inst[inst_idx].pause_state.fw_based = new_state->fw_based;
         }
  
         return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c

index 70fae79..2d64ba1 100644 (file)
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
@@ -1367,9 +1367,9 @@ static int vcn_v2_5_pause_dpg_mode(struct amdgpu_device *adev,
         int ret_code;
  
         /* pause/unpause if state is changed */
-       if (adev->vcn.pause_state.fw_based != new_state->fw_based) {
+       if (adev->vcn.inst[inst_idx].pause_state.fw_based != new_state->fw_based) {
                 DRM_DEBUG("dpg pause state changed %d -> %d",
-                       adev->vcn.pause_state.fw_based, new_state->fw_based);
+                       adev->vcn.inst[inst_idx].pause_state.fw_based,  new_state->fw_based);
                 reg_data = RREG32_SOC15(UVD, inst_idx, mmUVD_DPG_PAUSE) &
                         (~UVD_DPG_PAUSE__NJ_PAUSE_DPG_ACK_MASK);
  
@@ -1407,14 +1407,14 @@ static int vcn_v2_5_pause_dpg_mode(struct amdgpu_device *adev,
                                            RREG32_SOC15(UVD, inst_idx, mmUVD_SCRATCH2) & 0x7FFFFFFF);
  
                                 SOC15_WAIT_ON_RREG(UVD, inst_idx, mmUVD_POWER_STATUS,
-                                          0x0, UVD_POWER_STATUS__UVD_POWER_STATUS_MASK, ret_code);
+                                          UVD_PGFSM_CONFIG__UVDM_UVDU_PWR_ON, UVD_POWER_STATUS__UVD_POWER_STATUS_MASK, ret_code);
                         }
                 } else {
                         /* unpause dpg, no need to wait */
                         reg_data &= ~UVD_DPG_PAUSE__NJ_PAUSE_DPG_REQ_MASK;
                         WREG32_SOC15(UVD, inst_idx, mmUVD_DPG_PAUSE, reg_data);
                 }
-               adev->vcn.pause_state.fw_based = new_state->fw_based;
+               adev->vcn.inst[inst_idx].pause_state.fw_based = new_state->fw_based;
         }
  
         return 0;
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c

index 2795415..e8f66fb 100644 (file)
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -1911,7 +1911,7 @@ static void handle_hpd_irq(void *param)
         mutex_lock(&aconnector->hpd_lock);
  
  #ifdef CONFIG_DRM_AMD_DC_HDCP
-       if (adev->asic_type >= CHIP_RAVEN)
+       if (adev->dm.hdcp_workqueue)
                 hdcp_reset_display(adev->dm.hdcp_workqueue, aconnector->dc_link->link_index);
  #endif
         if (aconnector->fake_enable)
@@ -2088,8 +2088,10 @@ static void handle_hpd_rx_irq(void *param)
                 }
         }
  #ifdef CONFIG_DRM_AMD_DC_HDCP
-       if (hpd_irq_data.bytes.device_service_irq.bits.CP_IRQ)
-               hdcp_handle_cpirq(adev->dm.hdcp_workqueue,  aconnector->base.index);
+           if (hpd_irq_data.bytes.device_service_irq.bits.CP_IRQ) {
+                   if (adev->dm.hdcp_workqueue)
+                           hdcp_handle_cpirq(adev->dm.hdcp_workqueue,  aconnector->base.index);
+           }
  #endif
         if ((dc_link->cur_link_settings.lane_count != LANE_COUNT_UNKNOWN) ||
             (dc_link->type == dc_connection_mst_branch))
@@ -5702,7 +5704,7 @@ void amdgpu_dm_connector_init_helper(struct amdgpu_display_manager *dm,
                 drm_connector_attach_vrr_capable_property(
                         &aconnector->base);
  #ifdef CONFIG_DRM_AMD_DC_HDCP
-               if (adev->asic_type >= CHIP_RAVEN)
+               if (adev->dm.hdcp_workqueue)
                         drm_connector_attach_content_protection_property(&aconnector->base, true);
  #endif
         }
@@ -8408,7 +8410,6 @@ bool amdgpu_dm_psr_enable(struct dc_stream_state *stream)
         /* Calculate number of static frames before generating interrupt to
          * enter PSR.
          */
-       unsigned int frame_time_microsec = 1000000 / vsync_rate_hz;
         // Init fail safe of 2 frames static
         unsigned int num_frames_static = 2;
  
@@ -8423,8 +8424,10 @@ bool amdgpu_dm_psr_enable(struct dc_stream_state *stream)
          * Calculate number of frames such that at least 30 ms of time has
          * passed.
          */
-       if (vsync_rate_hz != 0)
+       if (vsync_rate_hz != 0) {
+               unsigned int frame_time_microsec = 1000000 / vsync_rate_hz;
                 num_frames_static = (30000 / frame_time_microsec) + 1;
+       }
  
         params.triggers.cursor_update = true;
         params.triggers.overlay_update = true;
diff --git a/drivers/gpu/drm/amd/display/dc/bios/command_table2.c b/drivers/gpu/drm/amd/display/dc/bios/command_table2.c

index 629a07a..c4ba6e8 100644 (file)
--- a/drivers/gpu/drm/amd/display/dc/bios/command_table2.c
+++ b/drivers/gpu/drm/amd/display/dc/bios/command_table2.c
@@ -711,10 +711,6 @@ static void enable_disp_power_gating_dmcub(
         power_gating.header.sub_type = DMUB_CMD__VBIOS_ENABLE_DISP_POWER_GATING;
         power_gating.power_gating.pwr = *pwr;
  
-       /* ATOM_ENABLE is old API in DMUB */
-       if (power_gating.power_gating.pwr.enable == ATOM_ENABLE)
-               power_gating.power_gating.pwr.enable = ATOM_INIT;
-
         dc_dmub_srv_cmd_queue(dmcub, &power_gating.header);
         dc_dmub_srv_cmd_execute(dmcub);
         dc_dmub_srv_wait_idle(dmcub);
diff --git a/drivers/gpu/drm/amd/display/dc/clk_mgr/Makefile b/drivers/gpu/drm/amd/display/dc/clk_mgr/Makefile

index 3cd2831..c0f6a8c 100644 (file)
--- a/drivers/gpu/drm/amd/display/dc/clk_mgr/Makefile
+++ b/drivers/gpu/drm/amd/display/dc/clk_mgr/Makefile
@@ -87,6 +87,12 @@ AMD_DISPLAY_FILES += $(AMD_DAL_CLK_MGR_DCN20)
  ###############################################################################
  CLK_MGR_DCN21 = rn_clk_mgr.o rn_clk_mgr_vbios_smu.o
  
+# prevent build errors regarding soft-float vs hard-float FP ABI tags
+# this code is currently unused on ppc64, as it applies to Renoir APUs only
+ifdef CONFIG_PPC64
+CFLAGS_$(AMDDALPATH)/dc/clk_mgr/dcn21/rn_clk_mgr.o := $(call cc-option,-mno-gnu-attribute)
+endif
+
  AMD_DAL_CLK_MGR_DCN21 = $(addprefix $(AMDDALPATH)/dc/clk_mgr/dcn21/,$(CLK_MGR_DCN21))
  
  AMD_DISPLAY_FILES += $(AMD_DAL_CLK_MGR_DCN21)
diff --git a/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn20/dcn20_clk_mgr.c b/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn20/dcn20_clk_mgr.c

index 495f01e..49ce46b 100644 (file)
--- a/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn20/dcn20_clk_mgr.c
+++ b/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn20/dcn20_clk_mgr.c
@@ -117,7 +117,7 @@ void dcn20_update_clocks_update_dpp_dto(struct clk_mgr_internal *clk_mgr,
  
                 prev_dppclk_khz = clk_mgr->base.ctx->dc->current_state->res_ctx.pipe_ctx[i].plane_res.bw.dppclk_khz;
  
-               if (safe_to_lower || prev_dppclk_khz < dppclk_khz) {
+               if ((prev_dppclk_khz > dppclk_khz && safe_to_lower) || prev_dppclk_khz < dppclk_khz) {
                         clk_mgr->dccg->funcs->update_dpp_dto(
                                                         clk_mgr->dccg, dpp_inst, dppclk_khz);
                 }
diff --git a/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn21/rn_clk_mgr.c b/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn21/rn_clk_mgr.c

index 7ae4c06..9ef3f7b 100644 (file)
--- a/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn21/rn_clk_mgr.c
+++ b/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn21/rn_clk_mgr.c
@@ -151,6 +151,12 @@ void rn_update_clocks(struct clk_mgr *clk_mgr_base,
                 rn_vbios_smu_set_min_deep_sleep_dcfclk(clk_mgr, clk_mgr_base->clks.dcfclk_deep_sleep_khz);
         }
  
+       // workaround: Limit dppclk to 100Mhz to avoid lower eDP panel switch to plus 4K monitor underflow.
+       if (!IS_DIAG_DC(dc->ctx->dce_environment)) {
+               if (new_clocks->dppclk_khz < 100000)
+                       new_clocks->dppclk_khz = 100000;
+       }
+
         if (should_set_clock(safe_to_lower, new_clocks->dppclk_khz, clk_mgr->base.clks.dppclk_khz)) {
                 if (clk_mgr->base.clks.dppclk_khz > new_clocks->dppclk_khz)
                         dpp_clock_lowered = true;
@@ -412,19 +418,19 @@ void build_watermark_ranges(struct clk_bw_params *bw_params, struct pp_smu_wm_ra
  
                 ranges->reader_wm_sets[num_valid_sets].wm_inst = bw_params->wm_table.entries[i].wm_inst;
                 ranges->reader_wm_sets[num_valid_sets].wm_type = bw_params->wm_table.entries[i].wm_type;
-               /* We will not select WM based on dcfclk, so leave it as unconstrained */
-               ranges->reader_wm_sets[num_valid_sets].min_drain_clk_mhz = PP_SMU_WM_SET_RANGE_CLK_UNCONSTRAINED_MIN;
-               ranges->reader_wm_sets[num_valid_sets].max_drain_clk_mhz = PP_SMU_WM_SET_RANGE_CLK_UNCONSTRAINED_MAX;
-               /* fclk wil be used to select WM*/
+               /* We will not select WM based on fclk, so leave it as unconstrained */
+               ranges->reader_wm_sets[num_valid_sets].min_fill_clk_mhz = PP_SMU_WM_SET_RANGE_CLK_UNCONSTRAINED_MIN;
+               ranges->reader_wm_sets[num_valid_sets].max_fill_clk_mhz = PP_SMU_WM_SET_RANGE_CLK_UNCONSTRAINED_MAX;
+               /* dcfclk wil be used to select WM*/
  
                 if (ranges->reader_wm_sets[num_valid_sets].wm_type == WM_TYPE_PSTATE_CHG) {
                         if (i == 0)
-                               ranges->reader_wm_sets[num_valid_sets].min_fill_clk_mhz = 0;
+                               ranges->reader_wm_sets[num_valid_sets].min_drain_clk_mhz = 0;
                         else {
                                 /* add 1 to make it non-overlapping with next lvl */
-                               ranges->reader_wm_sets[num_valid_sets].min_fill_clk_mhz = bw_params->clk_table.entries[i - 1].fclk_mhz + 1;
+                               ranges->reader_wm_sets[num_valid_sets].min_drain_clk_mhz = bw_params->clk_table.entries[i - 1].dcfclk_mhz + 1;
                         }
-                       ranges->reader_wm_sets[num_valid_sets].max_fill_clk_mhz = bw_params->clk_table.entries[i].fclk_mhz;
+                       ranges->reader_wm_sets[num_valid_sets].max_drain_clk_mhz = bw_params->clk_table.entries[i].dcfclk_mhz;
  
                 } else {
                         /* unconstrained for memory retraining */
diff --git a/drivers/gpu/drm/amd/display/dc/dce/dce_aux.c b/drivers/gpu/drm/amd/display/dc/dce/dce_aux.c

index f1a5d2c..68c4049 100644 (file)
--- a/drivers/gpu/drm/amd/display/dc/dce/dce_aux.c
+++ b/drivers/gpu/drm/amd/display/dc/dce/dce_aux.c
@@ -400,7 +400,7 @@ static bool acquire(
  {
         enum gpio_result result;
  
-       if (!is_engine_available(engine))
+       if ((engine == NULL) || !is_engine_available(engine))
                 return false;
  
         result = dal_ddc_open(ddc, GPIO_MODE_HARDWARE,
diff --git a/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_hwseq.c b/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_hwseq.c

index cfbbaff..a444fed 100644 (file)
--- a/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_hwseq.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_hwseq.c
@@ -572,7 +572,6 @@ void dcn20_plane_atomic_disable(struct dc *dc, struct pipe_ctx *pipe_ctx)
         dpp->funcs->dpp_dppclk_control(dpp, false, false);
  
         hubp->power_gated = true;
-       dc->optimized_required = false; /* We're powering off, no need to optimize */
  
         hws->funcs.plane_atomic_power_down(dc,
                         pipe_ctx->plane_res.dpp,
diff --git a/drivers/gpu/drm/amd/display/dc/dcn21/dcn21_resource.c b/drivers/gpu/drm/amd/display/dc/dcn21/dcn21_resource.c

index 0d506d3..33d0a17 100644 (file)
--- a/drivers/gpu/drm/amd/display/dc/dcn21/dcn21_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn21/dcn21_resource.c
@@ -60,6 +60,7 @@
  #include "dcn20/dcn20_dccg.h"
  #include "dcn21_hubbub.h"
  #include "dcn10/dcn10_resource.h"
+#include "dce110/dce110_resource.h"
  
  #include "dcn20/dcn20_dwb.h"
  #include "dcn20/dcn20_mmhubbub.h"
@@ -856,6 +857,7 @@ static const struct dc_debug_options debug_defaults_diags = {
  enum dcn20_clk_src_array_id {
         DCN20_CLK_SRC_PLL0,
         DCN20_CLK_SRC_PLL1,
+       DCN20_CLK_SRC_PLL2,
         DCN20_CLK_SRC_TOTAL_DCN21
  };
  
@@ -1718,6 +1720,10 @@ static bool dcn21_resource_construct(
                         dcn21_clock_source_create(ctx, ctx->dc_bios,
                                 CLOCK_SOURCE_COMBO_PHY_PLL1,
                                 &clk_src_regs[1], false);
+       pool->base.clock_sources[DCN20_CLK_SRC_PLL2] =
+                       dcn21_clock_source_create(ctx, ctx->dc_bios,
+                               CLOCK_SOURCE_COMBO_PHY_PLL2,
+                               &clk_src_regs[2], false);
  
         pool->base.clk_src_count = DCN20_CLK_SRC_TOTAL_DCN21;
  
diff --git a/drivers/gpu/drm/amd/display/modules/hdcp/hdcp2_execution.c b/drivers/gpu/drm/amd/display/modules/hdcp/hdcp2_execution.c

index f730b94..5524671 100644 (file)
--- a/drivers/gpu/drm/amd/display/modules/hdcp/hdcp2_execution.c
+++ b/drivers/gpu/drm/amd/display/modules/hdcp/hdcp2_execution.c
@@ -46,8 +46,8 @@ static inline enum mod_hdcp_status check_hdcp2_capable(struct mod_hdcp *hdcp)
         enum mod_hdcp_status status;
  
         if (is_dp_hdcp(hdcp))
-               status = (hdcp->auth.msg.hdcp2.rxcaps_dp[2] & HDCP_2_2_RX_CAPS_VERSION_VAL) &&
-                               HDCP_2_2_DP_HDCP_CAPABLE(hdcp->auth.msg.hdcp2.rxcaps_dp[0]) ?
+               status = (hdcp->auth.msg.hdcp2.rxcaps_dp[0] == HDCP_2_2_RX_CAPS_VERSION_VAL) &&
+                               HDCP_2_2_DP_HDCP_CAPABLE(hdcp->auth.msg.hdcp2.rxcaps_dp[2]) ?
                                 MOD_HDCP_STATUS_SUCCESS :
                                 MOD_HDCP_STATUS_HDCP2_NOT_CAPABLE;
         else
diff --git a/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0_pptable.h b/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0_pptable.h

index b2f96a1..7a63cf8 100644 (file)
--- a/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0_pptable.h
+++ b/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0_pptable.h
@@ -39,21 +39,39 @@
  #define SMU_11_0_PP_OVERDRIVE_VERSION                   0x0800
  #define SMU_11_0_PP_POWERSAVINGCLOCK_VERSION            0x0100
  
+enum SMU_11_0_ODFEATURE_CAP {
+    SMU_11_0_ODCAP_GFXCLK_LIMITS = 0,
+    SMU_11_0_ODCAP_GFXCLK_CURVE,
+    SMU_11_0_ODCAP_UCLK_MAX,
+    SMU_11_0_ODCAP_POWER_LIMIT,
+    SMU_11_0_ODCAP_FAN_ACOUSTIC_LIMIT,
+    SMU_11_0_ODCAP_FAN_SPEED_MIN,
+    SMU_11_0_ODCAP_TEMPERATURE_FAN,
+    SMU_11_0_ODCAP_TEMPERATURE_SYSTEM,
+    SMU_11_0_ODCAP_MEMORY_TIMING_TUNE,
+    SMU_11_0_ODCAP_FAN_ZERO_RPM_CONTROL,
+    SMU_11_0_ODCAP_AUTO_UV_ENGINE,
+    SMU_11_0_ODCAP_AUTO_OC_ENGINE,
+    SMU_11_0_ODCAP_AUTO_OC_MEMORY,
+    SMU_11_0_ODCAP_FAN_CURVE,
+    SMU_11_0_ODCAP_COUNT,
+};
+
  enum SMU_11_0_ODFEATURE_ID {
-    SMU_11_0_ODFEATURE_GFXCLK_LIMITS        = 1 << 0,         //GFXCLK Limit feature
-    SMU_11_0_ODFEATURE_GFXCLK_CURVE         = 1 << 1,         //GFXCLK Curve feature
-    SMU_11_0_ODFEATURE_UCLK_MAX             = 1 << 2,         //UCLK Limit feature
-    SMU_11_0_ODFEATURE_POWER_LIMIT          = 1 << 3,         //Power Limit feature
-    SMU_11_0_ODFEATURE_FAN_ACOUSTIC_LIMIT   = 1 << 4,         //Fan Acoustic RPM feature
-    SMU_11_0_ODFEATURE_FAN_SPEED_MIN        = 1 << 5,         //Minimum Fan Speed feature
-    SMU_11_0_ODFEATURE_TEMPERATURE_FAN      = 1 << 6,         //Fan Target Temperature Limit feature
-    SMU_11_0_ODFEATURE_TEMPERATURE_SYSTEM   = 1 << 7,         //Operating Temperature Limit feature
-    SMU_11_0_ODFEATURE_MEMORY_TIMING_TUNE   = 1 << 8,         //AC Timing Tuning feature
-    SMU_11_0_ODFEATURE_FAN_ZERO_RPM_CONTROL = 1 << 9,         //Zero RPM feature
-    SMU_11_0_ODFEATURE_AUTO_UV_ENGINE       = 1 << 10,        //Auto Under Volt GFXCLK feature
-    SMU_11_0_ODFEATURE_AUTO_OC_ENGINE       = 1 << 11,        //Auto Over Clock GFXCLK feature
-    SMU_11_0_ODFEATURE_AUTO_OC_MEMORY       = 1 << 12,        //Auto Over Clock MCLK feature
-    SMU_11_0_ODFEATURE_FAN_CURVE            = 1 << 13,        //VICTOR TODO
+    SMU_11_0_ODFEATURE_GFXCLK_LIMITS        = 1 << SMU_11_0_ODCAP_GFXCLK_LIMITS,            //GFXCLK Limit feature
+    SMU_11_0_ODFEATURE_GFXCLK_CURVE         = 1 << SMU_11_0_ODCAP_GFXCLK_CURVE,             //GFXCLK Curve feature
+    SMU_11_0_ODFEATURE_UCLK_MAX             = 1 << SMU_11_0_ODCAP_UCLK_MAX,                 //UCLK Limit feature
+    SMU_11_0_ODFEATURE_POWER_LIMIT          = 1 << SMU_11_0_ODCAP_POWER_LIMIT,              //Power Limit feature
+    SMU_11_0_ODFEATURE_FAN_ACOUSTIC_LIMIT   = 1 << SMU_11_0_ODCAP_FAN_ACOUSTIC_LIMIT,       //Fan Acoustic RPM feature
+    SMU_11_0_ODFEATURE_FAN_SPEED_MIN        = 1 << SMU_11_0_ODCAP_FAN_SPEED_MIN,            //Minimum Fan Speed feature
+    SMU_11_0_ODFEATURE_TEMPERATURE_FAN      = 1 << SMU_11_0_ODCAP_TEMPERATURE_FAN,          //Fan Target Temperature Limit feature
+    SMU_11_0_ODFEATURE_TEMPERATURE_SYSTEM   = 1 << SMU_11_0_ODCAP_TEMPERATURE_SYSTEM,       //Operating Temperature Limit feature
+    SMU_11_0_ODFEATURE_MEMORY_TIMING_TUNE   = 1 << SMU_11_0_ODCAP_MEMORY_TIMING_TUNE,       //AC Timing Tuning feature
+    SMU_11_0_ODFEATURE_FAN_ZERO_RPM_CONTROL = 1 << SMU_11_0_ODCAP_FAN_ZERO_RPM_CONTROL,     //Zero RPM feature
+    SMU_11_0_ODFEATURE_AUTO_UV_ENGINE       = 1 << SMU_11_0_ODCAP_AUTO_UV_ENGINE,           //Auto Under Volt GFXCLK feature
+    SMU_11_0_ODFEATURE_AUTO_OC_ENGINE       = 1 << SMU_11_0_ODCAP_AUTO_OC_ENGINE,           //Auto Over Clock GFXCLK feature
+    SMU_11_0_ODFEATURE_AUTO_OC_MEMORY       = 1 << SMU_11_0_ODCAP_AUTO_OC_MEMORY,           //Auto Over Clock MCLK feature
+    SMU_11_0_ODFEATURE_FAN_CURVE            = 1 << SMU_11_0_ODCAP_FAN_CURVE,                //Fan Curve feature
      SMU_11_0_ODFEATURE_COUNT                = 14,
  };
  #define SMU_11_0_MAX_ODFEATURE    32          //Maximum Number of OD Features
diff --git a/drivers/gpu/drm/amd/powerplay/navi10_ppt.c b/drivers/gpu/drm/amd/powerplay/navi10_ppt.c

index 19a9846..0d73a49 100644 (file)
--- a/drivers/gpu/drm/amd/powerplay/navi10_ppt.c
+++ b/drivers/gpu/drm/amd/powerplay/navi10_ppt.c
@@ -736,9 +736,9 @@ static bool navi10_is_support_fine_grained_dpm(struct smu_context *smu, enum smu
         return dpm_desc->SnapToDiscrete == 0 ? true : false;
  }
  
-static inline bool navi10_od_feature_is_supported(struct smu_11_0_overdrive_table *od_table, enum SMU_11_0_ODFEATURE_ID feature)
+static inline bool navi10_od_feature_is_supported(struct smu_11_0_overdrive_table *od_table, enum SMU_11_0_ODFEATURE_CAP cap)
  {
-       return od_table->cap[feature];
+       return od_table->cap[cap];
  }
  
  static void navi10_od_setting_get_range(struct smu_11_0_overdrive_table *od_table,
@@ -846,7 +846,7 @@ static int navi10_print_clk_levels(struct smu_context *smu,
         case SMU_OD_SCLK:
                 if (!smu->od_enabled || !od_table || !od_settings)
                         break;
-               if (!navi10_od_feature_is_supported(od_settings, SMU_11_0_ODFEATURE_GFXCLK_LIMITS))
+               if (!navi10_od_feature_is_supported(od_settings, SMU_11_0_ODCAP_GFXCLK_LIMITS))
                         break;
                 size += sprintf(buf + size, "OD_SCLK:\n");
                 size += sprintf(buf + size, "0: %uMhz\n1: %uMhz\n", od_table->GfxclkFmin, od_table->GfxclkFmax);
@@ -854,7 +854,7 @@ static int navi10_print_clk_levels(struct smu_context *smu,
         case SMU_OD_MCLK:
                 if (!smu->od_enabled || !od_table || !od_settings)
                         break;
-               if (!navi10_od_feature_is_supported(od_settings, SMU_11_0_ODFEATURE_UCLK_MAX))
+               if (!navi10_od_feature_is_supported(od_settings, SMU_11_0_ODCAP_UCLK_MAX))
                         break;
                 size += sprintf(buf + size, "OD_MCLK:\n");
                 size += sprintf(buf + size, "1: %uMHz\n", od_table->UclkFmax);
@@ -862,7 +862,7 @@ static int navi10_print_clk_levels(struct smu_context *smu,
         case SMU_OD_VDDC_CURVE:
                 if (!smu->od_enabled || !od_table || !od_settings)
                         break;
-               if (!navi10_od_feature_is_supported(od_settings, SMU_11_0_ODFEATURE_GFXCLK_CURVE))
+               if (!navi10_od_feature_is_supported(od_settings, SMU_11_0_ODCAP_GFXCLK_CURVE))
                         break;
                 size += sprintf(buf + size, "OD_VDDC_CURVE:\n");
                 for (i = 0; i < 3; i++) {
@@ -887,7 +887,7 @@ static int navi10_print_clk_levels(struct smu_context *smu,
                         break;
                 size = sprintf(buf, "%s:\n", "OD_RANGE");
  
-               if (navi10_od_feature_is_supported(od_settings, SMU_11_0_ODFEATURE_GFXCLK_LIMITS)) {
+               if (navi10_od_feature_is_supported(od_settings, SMU_11_0_ODCAP_GFXCLK_LIMITS)) {
                         navi10_od_setting_get_range(od_settings, SMU_11_0_ODSETTING_GFXCLKFMIN,
                                                     &min_value, NULL);
                         navi10_od_setting_get_range(od_settings, SMU_11_0_ODSETTING_GFXCLKFMAX,
@@ -896,14 +896,14 @@ static int navi10_print_clk_levels(struct smu_context *smu,
                                         min_value, max_value);
                 }
  
-               if (navi10_od_feature_is_supported(od_settings, SMU_11_0_ODFEATURE_UCLK_MAX)) {
+               if (navi10_od_feature_is_supported(od_settings, SMU_11_0_ODCAP_UCLK_MAX)) {
                         navi10_od_setting_get_range(od_settings, SMU_11_0_ODSETTING_UCLKFMAX,
                                                     &min_value, &max_value);
                         size += sprintf(buf + size, "MCLK: %7uMhz %10uMhz\n",
                                         min_value, max_value);
                 }
  
-               if (navi10_od_feature_is_supported(od_settings, SMU_11_0_ODFEATURE_GFXCLK_CURVE)) {
+               if (navi10_od_feature_is_supported(od_settings, SMU_11_0_ODCAP_GFXCLK_CURVE)) {
                         navi10_od_setting_get_range(od_settings, SMU_11_0_ODSETTING_VDDGFXCURVEFREQ_P1,
                                                     &min_value, &max_value);
                         size += sprintf(buf + size, "VDDC_CURVE_SCLK[0]: %7uMhz %10uMhz\n",
@@ -2056,7 +2056,7 @@ static int navi10_od_edit_dpm_table(struct smu_context *smu, enum PP_OD_DPM_TABL
  
         switch (type) {
         case PP_OD_EDIT_SCLK_VDDC_TABLE:
-               if (!navi10_od_feature_is_supported(od_settings, SMU_11_0_ODFEATURE_GFXCLK_LIMITS)) {
+               if (!navi10_od_feature_is_supported(od_settings, SMU_11_0_ODCAP_GFXCLK_LIMITS)) {
                         pr_warn("GFXCLK_LIMITS not supported!\n");
                         return -ENOTSUPP;
                 }
@@ -2102,7 +2102,7 @@ static int navi10_od_edit_dpm_table(struct smu_context *smu, enum PP_OD_DPM_TABL
                 }
                 break;
         case PP_OD_EDIT_MCLK_VDDC_TABLE:
-               if (!navi10_od_feature_is_supported(od_settings, SMU_11_0_ODFEATURE_UCLK_MAX)) {
+               if (!navi10_od_feature_is_supported(od_settings, SMU_11_0_ODCAP_UCLK_MAX)) {
                         pr_warn("UCLK_MAX not supported!\n");
                         return -ENOTSUPP;
                 }
@@ -2143,7 +2143,7 @@ static int navi10_od_edit_dpm_table(struct smu_context *smu, enum PP_OD_DPM_TABL
                 }
                 break;
         case PP_OD_EDIT_VDDC_CURVE:
-               if (!navi10_od_feature_is_supported(od_settings, SMU_11_0_ODFEATURE_GFXCLK_CURVE)) {
+               if (!navi10_od_feature_is_supported(od_settings, SMU_11_0_ODCAP_GFXCLK_CURVE)) {
                         pr_warn("GFXCLK_CURVE not supported!\n");
                         return -ENOTSUPP;
                 }
diff --git a/drivers/gpu/drm/amd/powerplay/smu_v11_0.c b/drivers/gpu/drm/amd/powerplay/smu_v11_0.c

index 0dc4947..b06c057 100644 (file)
--- a/drivers/gpu/drm/amd/powerplay/smu_v11_0.c
+++ b/drivers/gpu/drm/amd/powerplay/smu_v11_0.c
@@ -898,6 +898,9 @@ int smu_v11_0_system_features_control(struct smu_context *smu,
         if (ret)
                 return ret;
  
+       bitmap_zero(feature->enabled, feature->feature_num);
+       bitmap_zero(feature->supported, feature->feature_num);
+
         if (en) {
                 ret = smu_feature_get_enabled_mask(smu, feature_mask, 2);
                 if (ret)
@@ -907,9 +910,6 @@ int smu_v11_0_system_features_control(struct smu_context *smu,
                             feature->feature_num);
                 bitmap_copy(feature->supported, (unsigned long *)&feature_mask,
                             feature->feature_num);
-       } else {
-               bitmap_zero(feature->enabled, feature->feature_num);
-               bitmap_zero(feature->supported, feature->feature_num);
         }
  
         return ret;
diff --git a/drivers/gpu/drm/drm_dp_mst_topology.c b/drivers/gpu/drm/drm_dp_mst_topology.c

index 20cdaf3..cce0b1b 100644 (file)
--- a/drivers/gpu/drm/drm_dp_mst_topology.c
+++ b/drivers/gpu/drm/drm_dp_mst_topology.c
@@ -3838,7 +3838,8 @@ drm_dp_mst_process_up_req(struct drm_dp_mst_topology_mgr *mgr,
                 else if (msg->req_type == DP_RESOURCE_STATUS_NOTIFY)
                         guid = msg->u.resource_stat.guid;
  
-               mstb = drm_dp_get_mst_branch_device_by_guid(mgr, guid);
+               if (guid)
+                       mstb = drm_dp_get_mst_branch_device_by_guid(mgr, guid);
         } else {
                 mstb = drm_dp_get_mst_branch_device(mgr, hdr->lct, hdr->rad);
         }
diff --git a/drivers/gpu/drm/drm_edid.c b/drivers/gpu/drm/drm_edid.c

index 99769d6..805fb00 100644 (file)
--- a/drivers/gpu/drm/drm_edid.c
+++ b/drivers/gpu/drm/drm_edid.c
@@ -3211,7 +3211,7 @@ static u8 *drm_find_cea_extension(const struct edid *edid)
         return cea;
  }
  
-static const struct drm_display_mode *cea_mode_for_vic(u8 vic)
+static __always_inline const struct drm_display_mode *cea_mode_for_vic(u8 vic)
  {
         BUILD_BUG_ON(1 + ARRAY_SIZE(edid_cea_modes_1) - 1 != 127);
         BUILD_BUG_ON(193 + ARRAY_SIZE(edid_cea_modes_193) - 1 != 219);
diff --git a/drivers/gpu/drm/i915/display/intel_bios.c b/drivers/gpu/drm/i915/display/intel_bios.c

index 8beac06..ef4017a 100644 (file)
--- a/drivers/gpu/drm/i915/display/intel_bios.c
+++ b/drivers/gpu/drm/i915/display/intel_bios.c
@@ -357,14 +357,16 @@ parse_generic_dtd(struct drm_i915_private *dev_priv,
                 panel_fixed_mode->hdisplay + dtd->hfront_porch;
         panel_fixed_mode->hsync_end =
                 panel_fixed_mode->hsync_start + dtd->hsync;
-       panel_fixed_mode->htotal = panel_fixed_mode->hsync_end;
+       panel_fixed_mode->htotal =
+               panel_fixed_mode->hdisplay + dtd->hblank;
  
         panel_fixed_mode->vdisplay = dtd->vactive;
         panel_fixed_mode->vsync_start =
                 panel_fixed_mode->vdisplay + dtd->vfront_porch;
         panel_fixed_mode->vsync_end =
                 panel_fixed_mode->vsync_start + dtd->vsync;
-       panel_fixed_mode->vtotal = panel_fixed_mode->vsync_end;
+       panel_fixed_mode->vtotal =
+               panel_fixed_mode->vdisplay + dtd->vblank;
  
         panel_fixed_mode->clock = dtd->pixel_clock;
         panel_fixed_mode->width_mm = dtd->width_mm;
diff --git a/drivers/gpu/drm/i915/display/intel_display.c b/drivers/gpu/drm/i915/display/intel_display.c

index 19ea842..064dd99 100644 (file)
--- a/drivers/gpu/drm/i915/display/intel_display.c
+++ b/drivers/gpu/drm/i915/display/intel_display.c
@@ -12366,6 +12366,7 @@ static int icl_check_nv12_planes(struct intel_crtc_state *crtc_state)
                 /* Copy parameters to slave plane */
                 linked_state->ctl = plane_state->ctl | PLANE_CTL_YUV420_Y_PLANE;
                 linked_state->color_ctl = plane_state->color_ctl;
+               linked_state->view = plane_state->view;
                 memcpy(linked_state->color_plane, plane_state->color_plane,
                        sizeof(linked_state->color_plane));
  
@@ -14476,37 +14477,23 @@ static int intel_atomic_check_crtcs(struct intel_atomic_state *state)
         return 0;
  }
  
-static bool intel_cpu_transcoder_needs_modeset(struct intel_atomic_state *state,
-                                              enum transcoder transcoder)
+static bool intel_cpu_transcoders_need_modeset(struct intel_atomic_state *state,
+                                              u8 transcoders)
  {
-       struct intel_crtc_state *new_crtc_state;
+       const struct intel_crtc_state *new_crtc_state;
         struct intel_crtc *crtc;
         int i;
  
-       for_each_new_intel_crtc_in_state(state, crtc, new_crtc_state, i)
-               if (new_crtc_state->cpu_transcoder == transcoder)
-                       return needs_modeset(new_crtc_state);
+       for_each_new_intel_crtc_in_state(state, crtc, new_crtc_state, i) {
+               if (new_crtc_state->hw.enable &&
+                   transcoders & BIT(new_crtc_state->cpu_transcoder) &&
+                   needs_modeset(new_crtc_state))
+                       return true;
+       }
  
         return false;
  }
  
-static void
-intel_modeset_synced_crtcs(struct intel_atomic_state *state,
-                          u8 transcoders)
-{
-       struct intel_crtc_state *new_crtc_state;
-       struct intel_crtc *crtc;
-       int i;
-
-       for_each_new_intel_crtc_in_state(state, crtc,
-                                        new_crtc_state, i) {
-               if (transcoders & BIT(new_crtc_state->cpu_transcoder)) {
-                       new_crtc_state->uapi.mode_changed = true;
-                       new_crtc_state->update_pipe = false;
-               }
-       }
-}
-
  static int
  intel_modeset_all_tiles(struct intel_atomic_state *state, int tile_grp_id)
  {
@@ -14662,15 +14649,20 @@ static int intel_atomic_check(struct drm_device *dev,
                 if (intel_dp_mst_is_slave_trans(new_crtc_state)) {
                         enum transcoder master = new_crtc_state->mst_master_transcoder;
  
-                       if (intel_cpu_transcoder_needs_modeset(state, master)) {
+                       if (intel_cpu_transcoders_need_modeset(state, BIT(master))) {
                                 new_crtc_state->uapi.mode_changed = true;
                                 new_crtc_state->update_pipe = false;
                         }
-               } else if (is_trans_port_sync_mode(new_crtc_state)) {
+               }
+
+               if (is_trans_port_sync_mode(new_crtc_state)) {
                         u8 trans = new_crtc_state->sync_mode_slaves_mask |
                                    BIT(new_crtc_state->master_transcoder);
  
-                       intel_modeset_synced_crtcs(state, trans);
+                       if (intel_cpu_transcoders_need_modeset(state, trans)) {
+                               new_crtc_state->uapi.mode_changed = true;
+                               new_crtc_state->update_pipe = false;
+                       }
                 }
         }
  
diff --git a/drivers/gpu/drm/i915/display/intel_dsi_vbt.c b/drivers/gpu/drm/i915/display/intel_dsi_vbt.c

index 89fb0d9..04f953b 100644 (file)
--- a/drivers/gpu/drm/i915/display/intel_dsi_vbt.c
+++ b/drivers/gpu/drm/i915/display/intel_dsi_vbt.c
@@ -384,6 +384,7 @@ static const u8 *mipi_exec_gpio(struct intel_dsi *intel_dsi, const u8 *data)
         return data;
  }
  
+#ifdef CONFIG_ACPI
  static int i2c_adapter_lookup(struct acpi_resource *ares, void *data)
  {
         struct i2c_adapter_lookup *lookup = data;
@@ -393,8 +394,7 @@ static int i2c_adapter_lookup(struct acpi_resource *ares, void *data)
         acpi_handle adapter_handle;
         acpi_status status;
  
-       if (intel_dsi->i2c_bus_num >= 0 ||
-           !i2c_acpi_get_i2c_resource(ares, &sb))
+       if (!i2c_acpi_get_i2c_resource(ares, &sb))
                 return 1;
  
         if (lookup->slave_addr != sb->slave_address)
@@ -413,14 +413,41 @@ static int i2c_adapter_lookup(struct acpi_resource *ares, void *data)
         return 1;
  }
  
-static const u8 *mipi_exec_i2c(struct intel_dsi *intel_dsi, const u8 *data)
+static void i2c_acpi_find_adapter(struct intel_dsi *intel_dsi,
+                                 const u16 slave_addr)
  {
         struct drm_device *drm_dev = intel_dsi->base.base.dev;
         struct device *dev = &drm_dev->pdev->dev;
-       struct i2c_adapter *adapter;
         struct acpi_device *acpi_dev;
         struct list_head resource_list;
         struct i2c_adapter_lookup lookup;
+
+       acpi_dev = ACPI_COMPANION(dev);
+       if (acpi_dev) {
+               memset(&lookup, 0, sizeof(lookup));
+               lookup.slave_addr = slave_addr;
+               lookup.intel_dsi = intel_dsi;
+               lookup.dev_handle = acpi_device_handle(acpi_dev);
+
+               INIT_LIST_HEAD(&resource_list);
+               acpi_dev_get_resources(acpi_dev, &resource_list,
+                                      i2c_adapter_lookup,
+                                      &lookup);
+               acpi_dev_free_resource_list(&resource_list);
+       }
+}
+#else
+static inline void i2c_acpi_find_adapter(struct intel_dsi *intel_dsi,
+                                        const u16 slave_addr)
+{
+}
+#endif
+
+static const u8 *mipi_exec_i2c(struct intel_dsi *intel_dsi, const u8 *data)
+{
+       struct drm_device *drm_dev = intel_dsi->base.base.dev;
+       struct device *dev = &drm_dev->pdev->dev;
+       struct i2c_adapter *adapter;
         struct i2c_msg msg;
         int ret;
         u8 vbt_i2c_bus_num = *(data + 2);
@@ -431,20 +458,7 @@ static const u8 *mipi_exec_i2c(struct intel_dsi *intel_dsi, const u8 *data)
  
         if (intel_dsi->i2c_bus_num < 0) {
                 intel_dsi->i2c_bus_num = vbt_i2c_bus_num;
-
-               acpi_dev = ACPI_COMPANION(dev);
-               if (acpi_dev) {
-                       memset(&lookup, 0, sizeof(lookup));
-                       lookup.slave_addr = slave_addr;
-                       lookup.intel_dsi = intel_dsi;
-                       lookup.dev_handle = acpi_device_handle(acpi_dev);
-
-                       INIT_LIST_HEAD(&resource_list);
-                       acpi_dev_get_resources(acpi_dev, &resource_list,
-                                              i2c_adapter_lookup,
-                                              &lookup);
-                       acpi_dev_free_resource_list(&resource_list);
-               }
+               i2c_acpi_find_adapter(intel_dsi, slave_addr);
         }
  
         adapter = i2c_get_adapter(intel_dsi->i2c_bus_num);
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c

index d5a0f5a..60c984e 100644 (file)
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -1981,9 +1981,20 @@ static int __eb_parse(struct dma_fence_work *work)
                                        pw->trampoline);
  }
  
+static void __eb_parse_release(struct dma_fence_work *work)
+{
+       struct eb_parse_work *pw = container_of(work, typeof(*pw), base);
+
+       if (pw->trampoline)
+               i915_active_release(&pw->trampoline->active);
+       i915_active_release(&pw->shadow->active);
+       i915_active_release(&pw->batch->active);
+}
+
  static const struct dma_fence_work_ops eb_parse_ops = {
         .name = "eb_parse",
         .work = __eb_parse,
+       .release = __eb_parse_release,
  };
  
  static int eb_parse_pipeline(struct i915_execbuffer *eb,
@@ -1997,6 +2008,20 @@ static int eb_parse_pipeline(struct i915_execbuffer *eb,
         if (!pw)
                 return -ENOMEM;
  
+       err = i915_active_acquire(&eb->batch->active);
+       if (err)
+               goto err_free;
+
+       err = i915_active_acquire(&shadow->active);
+       if (err)
+               goto err_batch;
+
+       if (trampoline) {
+               err = i915_active_acquire(&trampoline->active);
+               if (err)
+                       goto err_shadow;
+       }
+
         dma_fence_work_init(&pw->base, &eb_parse_ops);
  
         pw->engine = eb->engine;
@@ -2006,7 +2031,9 @@ static int eb_parse_pipeline(struct i915_execbuffer *eb,
         pw->shadow = shadow;
         pw->trampoline = trampoline;
  
-       dma_resv_lock(pw->batch->resv, NULL);
+       err = dma_resv_lock_interruptible(pw->batch->resv, NULL);
+       if (err)
+               goto err_trampoline;
  
         err = dma_resv_reserve_shared(pw->batch->resv, 1);
         if (err)
@@ -2034,6 +2061,14 @@ static int eb_parse_pipeline(struct i915_execbuffer *eb,
  
  err_batch_unlock:
         dma_resv_unlock(pw->batch->resv);
+err_trampoline:
+       if (trampoline)
+               i915_active_release(&trampoline->active);
+err_shadow:
+       i915_active_release(&shadow->active);
+err_batch:
+       i915_active_release(&eb->batch->active);
+err_free:
         kfree(pw);
         return err;
  }
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_mman.c b/drivers/gpu/drm/i915/gem/i915_gem_mman.c

index b9fdac2..0b6a442 100644 (file)
--- a/drivers/gpu/drm/i915/gem/i915_gem_mman.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
@@ -455,10 +455,11 @@ out:
  
  void i915_gem_object_release_mmap_offset(struct drm_i915_gem_object *obj)
  {
-       struct i915_mmap_offset *mmo;
+       struct i915_mmap_offset *mmo, *mn;
  
         spin_lock(&obj->mmo.lock);
-       list_for_each_entry(mmo, &obj->mmo.offsets, offset) {
+       rbtree_postorder_for_each_entry_safe(mmo, mn,
+                                            &obj->mmo.offsets, offset) {
                 /*
                  * vma_node_unmap for GTT mmaps handled already in
                  * __i915_gem_object_release_mmap_gtt
@@ -487,6 +488,67 @@ void i915_gem_object_release_mmap(struct drm_i915_gem_object *obj)
         i915_gem_object_release_mmap_offset(obj);
  }
  
+static struct i915_mmap_offset *
+lookup_mmo(struct drm_i915_gem_object *obj,
+          enum i915_mmap_type mmap_type)
+{
+       struct rb_node *rb;
+
+       spin_lock(&obj->mmo.lock);
+       rb = obj->mmo.offsets.rb_node;
+       while (rb) {
+               struct i915_mmap_offset *mmo =
+                       rb_entry(rb, typeof(*mmo), offset);
+
+               if (mmo->mmap_type == mmap_type) {
+                       spin_unlock(&obj->mmo.lock);
+                       return mmo;
+               }
+
+               if (mmo->mmap_type < mmap_type)
+                       rb = rb->rb_right;
+               else
+                       rb = rb->rb_left;
+       }
+       spin_unlock(&obj->mmo.lock);
+
+       return NULL;
+}
+
+static struct i915_mmap_offset *
+insert_mmo(struct drm_i915_gem_object *obj, struct i915_mmap_offset *mmo)
+{
+       struct rb_node *rb, **p;
+
+       spin_lock(&obj->mmo.lock);
+       rb = NULL;
+       p = &obj->mmo.offsets.rb_node;
+       while (*p) {
+               struct i915_mmap_offset *pos;
+
+               rb = *p;
+               pos = rb_entry(rb, typeof(*pos), offset);
+
+               if (pos->mmap_type == mmo->mmap_type) {
+                       spin_unlock(&obj->mmo.lock);
+                       drm_vma_offset_remove(obj->base.dev->vma_offset_manager,
+                                             &mmo->vma_node);
+                       kfree(mmo);
+                       return pos;
+               }
+
+               if (pos->mmap_type < mmo->mmap_type)
+                       p = &rb->rb_right;
+               else
+                       p = &rb->rb_left;
+       }
+       rb_link_node(&mmo->offset, rb, p);
+       rb_insert_color(&mmo->offset, &obj->mmo.offsets);
+       spin_unlock(&obj->mmo.lock);
+
+       return mmo;
+}
+
  static struct i915_mmap_offset *
  mmap_offset_attach(struct drm_i915_gem_object *obj,
                    enum i915_mmap_type mmap_type,
@@ -496,20 +558,22 @@ mmap_offset_attach(struct drm_i915_gem_object *obj,
         struct i915_mmap_offset *mmo;
         int err;
  
+       mmo = lookup_mmo(obj, mmap_type);
+       if (mmo)
+               goto out;
+
         mmo = kmalloc(sizeof(*mmo), GFP_KERNEL);
         if (!mmo)
                 return ERR_PTR(-ENOMEM);
  
         mmo->obj = obj;
-       mmo->dev = obj->base.dev;
-       mmo->file = file;
         mmo->mmap_type = mmap_type;
         drm_vma_node_reset(&mmo->vma_node);
  
-       err = drm_vma_offset_add(mmo->dev->vma_offset_manager, &mmo->vma_node,
-                                obj->base.size / PAGE_SIZE);
+       err = drm_vma_offset_add(obj->base.dev->vma_offset_manager,
+                                &mmo->vma_node, obj->base.size / PAGE_SIZE);
         if (likely(!err))
-               goto out;
+               goto insert;
  
         /* Attempt to reap some mmap space from dead objects */
         err = intel_gt_retire_requests_timeout(&i915->gt, MAX_SCHEDULE_TIMEOUT);
@@ -517,19 +581,17 @@ mmap_offset_attach(struct drm_i915_gem_object *obj,
                 goto err;
  
         i915_gem_drain_freed_objects(i915);
-       err = drm_vma_offset_add(mmo->dev->vma_offset_manager, &mmo->vma_node,
-                                obj->base.size / PAGE_SIZE);
+       err = drm_vma_offset_add(obj->base.dev->vma_offset_manager,
+                                &mmo->vma_node, obj->base.size / PAGE_SIZE);
         if (err)
                 goto err;
  
+insert:
+       mmo = insert_mmo(obj, mmo);
+       GEM_BUG_ON(lookup_mmo(obj, mmap_type) != mmo);
  out:
         if (file)
                 drm_vma_node_allow(&mmo->vma_node, file);
-
-       spin_lock(&obj->mmo.lock);
-       list_add(&mmo->offset, &obj->mmo.offsets);
-       spin_unlock(&obj->mmo.lock);
-
         return mmo;
  
  err:
@@ -745,60 +807,43 @@ int i915_gem_mmap(struct file *filp, struct vm_area_struct *vma)
         struct drm_vma_offset_node *node;
         struct drm_file *priv = filp->private_data;
         struct drm_device *dev = priv->minor->dev;
+       struct drm_i915_gem_object *obj = NULL;
         struct i915_mmap_offset *mmo = NULL;
-       struct drm_gem_object *obj = NULL;
         struct file *anon;
  
         if (drm_dev_is_unplugged(dev))
                 return -ENODEV;
  
+       rcu_read_lock();
         drm_vma_offset_lock_lookup(dev->vma_offset_manager);
         node = drm_vma_offset_exact_lookup_locked(dev->vma_offset_manager,
                                                   vma->vm_pgoff,
                                                   vma_pages(vma));
-       if (likely(node)) {
-               mmo = container_of(node, struct i915_mmap_offset,
-                                  vma_node);
-               /*
-                * In our dependency chain, the drm_vma_offset_node
-                * depends on the validity of the mmo, which depends on
-                * the gem object. However the only reference we have
-                * at this point is the mmo (as the parent of the node).
-                * Try to check if the gem object was at least cleared.
-                */
-               if (!mmo || !mmo->obj) {
-                       drm_vma_offset_unlock_lookup(dev->vma_offset_manager);
-                       return -EINVAL;
-               }
+       if (node && drm_vma_node_is_allowed(node, priv)) {
                 /*
                  * Skip 0-refcnted objects as it is in the process of being
                  * destroyed and will be invalid when the vma manager lock
                  * is released.
                  */
-               obj = &mmo->obj->base;
-               if (!kref_get_unless_zero(&obj->refcount))
-                       obj = NULL;
+               mmo = container_of(node, struct i915_mmap_offset, vma_node);
+               obj = i915_gem_object_get_rcu(mmo->obj);
         }
         drm_vma_offset_unlock_lookup(dev->vma_offset_manager);
+       rcu_read_unlock();
         if (!obj)
-               return -EINVAL;
-
-       if (!drm_vma_node_is_allowed(node, priv)) {
-               drm_gem_object_put_unlocked(obj);
-               return -EACCES;
-       }
+               return node ? -EACCES : -EINVAL;
  
-       if (i915_gem_object_is_readonly(to_intel_bo(obj))) {
+       if (i915_gem_object_is_readonly(obj)) {
                 if (vma->vm_flags & VM_WRITE) {
-                       drm_gem_object_put_unlocked(obj);
+                       i915_gem_object_put(obj);
                         return -EINVAL;
                 }
                 vma->vm_flags &= ~VM_MAYWRITE;
         }
  
-       anon = mmap_singleton(to_i915(obj->dev));
+       anon = mmap_singleton(to_i915(dev));
         if (IS_ERR(anon)) {
-               drm_gem_object_put_unlocked(obj);
+               i915_gem_object_put(obj);
                 return PTR_ERR(anon);
         }
  
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.c b/drivers/gpu/drm/i915/gem/i915_gem_object.c

index 46bacc8..3598521 100644 (file)
--- a/drivers/gpu/drm/i915/gem/i915_gem_object.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object.c
@@ -63,7 +63,7 @@ void i915_gem_object_init(struct drm_i915_gem_object *obj,
         INIT_LIST_HEAD(&obj->lut_list);
  
         spin_lock_init(&obj->mmo.lock);
-       INIT_LIST_HEAD(&obj->mmo.offsets);
+       obj->mmo.offsets = RB_ROOT;
  
         init_rcu_head(&obj->rcu);
  
@@ -100,8 +100,8 @@ void i915_gem_close_object(struct drm_gem_object *gem, struct drm_file *file)
  {
         struct drm_i915_gem_object *obj = to_intel_bo(gem);
         struct drm_i915_file_private *fpriv = file->driver_priv;
+       struct i915_mmap_offset *mmo, *mn;
         struct i915_lut_handle *lut, *ln;
-       struct i915_mmap_offset *mmo;
         LIST_HEAD(close);
  
         i915_gem_object_lock(obj);
@@ -117,14 +117,8 @@ void i915_gem_close_object(struct drm_gem_object *gem, struct drm_file *file)
         i915_gem_object_unlock(obj);
  
         spin_lock(&obj->mmo.lock);
-       list_for_each_entry(mmo, &obj->mmo.offsets, offset) {
-               if (mmo->file != file)
-                       continue;
-
-               spin_unlock(&obj->mmo.lock);
+       rbtree_postorder_for_each_entry_safe(mmo, mn, &obj->mmo.offsets, offset)
                 drm_vma_node_revoke(&mmo->vma_node, file);
-               spin_lock(&obj->mmo.lock);
-       }
         spin_unlock(&obj->mmo.lock);
  
         list_for_each_entry_safe(lut, ln, &close, obj_link) {
@@ -203,12 +197,14 @@ static void __i915_gem_free_objects(struct drm_i915_private *i915,
  
                 i915_gem_object_release_mmap(obj);
  
-               list_for_each_entry_safe(mmo, mn, &obj->mmo.offsets, offset) {
+               rbtree_postorder_for_each_entry_safe(mmo, mn,
+                                                    &obj->mmo.offsets,
+                                                    offset) {
                         drm_vma_offset_remove(obj->base.dev->vma_offset_manager,
                                               &mmo->vma_node);
                         kfree(mmo);
                 }
-               INIT_LIST_HEAD(&obj->mmo.offsets);
+               obj->mmo.offsets = RB_ROOT;
  
                 GEM_BUG_ON(atomic_read(&obj->bind_count));
                 GEM_BUG_ON(obj->userfault_count);
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.h b/drivers/gpu/drm/i915/gem/i915_gem_object.h

index db70a33..9c86f2d 100644 (file)
--- a/drivers/gpu/drm/i915/gem/i915_gem_object.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object.h
@@ -69,6 +69,15 @@ i915_gem_object_lookup_rcu(struct drm_file *file, u32 handle)
         return idr_find(&file->object_idr, handle);
  }
  
+static inline struct drm_i915_gem_object *
+i915_gem_object_get_rcu(struct drm_i915_gem_object *obj)
+{
+       if (obj && !kref_get_unless_zero(&obj->base.refcount))
+               obj = NULL;
+
+       return obj;
+}
+
  static inline struct drm_i915_gem_object *
  i915_gem_object_lookup(struct drm_file *file, u32 handle)
  {
@@ -76,8 +85,7 @@ i915_gem_object_lookup(struct drm_file *file, u32 handle)
  
         rcu_read_lock();
         obj = i915_gem_object_lookup_rcu(file, handle);
-       if (obj && !kref_get_unless_zero(&obj->base.refcount))
-               obj = NULL;
+       obj = i915_gem_object_get_rcu(obj);
         rcu_read_unlock();
  
         return obj;
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object_types.h b/drivers/gpu/drm/i915/gem/i915_gem_object_types.h

index 88e2686..f64ad77 100644 (file)
--- a/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
@@ -71,13 +71,11 @@ enum i915_mmap_type {
  };
  
  struct i915_mmap_offset {
-       struct drm_device *dev;
         struct drm_vma_offset_node vma_node;
         struct drm_i915_gem_object *obj;
-       struct drm_file *file;
         enum i915_mmap_type mmap_type;
  
-       struct list_head offset;
+       struct rb_node offset;
  };
  
  struct drm_i915_gem_object {
@@ -137,7 +135,7 @@ struct drm_i915_gem_object {
  
         struct {
                 spinlock_t lock; /* Protects access to mmo offsets */
-               struct list_head offsets;
+               struct rb_root offsets;
         } mmo;
  
         I915_SELFTEST_DECLARE(struct list_head st_link);
diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c

index 23137b2..57e8a05 100644 (file)
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -67,21 +67,18 @@ static int intel_context_active_acquire(struct intel_context *ce)
  {
         int err;
  
-       err = i915_active_acquire(&ce->active);
-       if (err)
-               return err;
+       __i915_active_acquire(&ce->active);
+
+       if (intel_context_is_barrier(ce))
+               return 0;
  
         /* Preallocate tracking nodes */
-       if (!intel_context_is_barrier(ce)) {
-               err = i915_active_acquire_preallocate_barrier(&ce->active,
-                                                             ce->engine);
-               if (err) {
-                       i915_active_release(&ce->active);
-                       return err;
-               }
-       }
+       err = i915_active_acquire_preallocate_barrier(&ce->active,
+                                                     ce->engine);
+       if (err)
+               i915_active_release(&ce->active);
  
-       return 0;
+       return err;
  }
  
  static void intel_context_active_release(struct intel_context *ce)
@@ -101,13 +98,19 @@ int __intel_context_do_pin(struct intel_context *ce)
                         return err;
         }
  
-       if (mutex_lock_interruptible(&ce->pin_mutex))
-               return -EINTR;
+       err = i915_active_acquire(&ce->active);
+       if (err)
+               return err;
+
+       if (mutex_lock_interruptible(&ce->pin_mutex)) {
+               err = -EINTR;
+               goto out_release;
+       }
  
-       if (likely(!atomic_read(&ce->pin_count))) {
+       if (likely(!atomic_add_unless(&ce->pin_count, 1, 0))) {
                 err = intel_context_active_acquire(ce);
                 if (unlikely(err))
-                       goto err;
+                       goto out_unlock;
  
                 err = ce->ops->pin(ce);
                 if (unlikely(err))
@@ -117,18 +120,19 @@ int __intel_context_do_pin(struct intel_context *ce)
                          ce->ring->head, ce->ring->tail);
  
                 smp_mb__before_atomic(); /* flush pin before it is visible */
+               atomic_inc(&ce->pin_count);
         }
  
-       atomic_inc(&ce->pin_count);
         GEM_BUG_ON(!intel_context_is_pinned(ce)); /* no overflow! */
-
-       mutex_unlock(&ce->pin_mutex);
-       return 0;
+       GEM_BUG_ON(i915_active_is_idle(&ce->active));
+       goto out_unlock;
  
  err_active:
         intel_context_active_release(ce);
-err:
+out_unlock:
         mutex_unlock(&ce->pin_mutex);
+out_release:
+       i915_active_release(&ce->active);
         return err;
  }
  
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c

index f451ef3..06ff769 100644 (file)
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -671,6 +671,7 @@ void
  intel_engine_init_active(struct intel_engine_cs *engine, unsigned int subclass)
  {
         INIT_LIST_HEAD(&engine->active.requests);
+       INIT_LIST_HEAD(&engine->active.hold);
  
         spin_lock_init(&engine->active.lock);
         lockdep_set_subclass(&engine->active.lock, subclass);
@@ -1422,6 +1423,17 @@ static void print_request_ring(struct drm_printer *m, struct i915_request *rq)
         }
  }
  
+static unsigned long list_count(struct list_head *list)
+{
+       struct list_head *pos;
+       unsigned long count = 0;
+
+       list_for_each(pos, list)
+               count++;
+
+       return count;
+}
+
  void intel_engine_dump(struct intel_engine_cs *engine,
                        struct drm_printer *m,
                        const char *header, ...)
@@ -1491,6 +1503,7 @@ void intel_engine_dump(struct intel_engine_cs *engine,
                         hexdump(m, rq->context->lrc_reg_state, PAGE_SIZE);
                 }
         }
+       drm_printf(m, "\tOn hold?: %lu\n", list_count(&engine->active.hold));
         spin_unlock_irqrestore(&engine->active.lock, flags);
  
         drm_printf(m, "\tMMIO base:  0x%08x\n", engine->mmio_base);
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h

index 350da59..92be41a 100644 (file)
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -295,6 +295,7 @@ struct intel_engine_cs {
         struct {
                 spinlock_t lock;
                 struct list_head requests;
+               struct list_head hold; /* ready requests, but on hold */
         } active;
  
         struct llist_head barrier_tasks;
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c

index 0cf0f6f..a13a8c4 100644 (file)
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -985,6 +985,8 @@ __unwind_incomplete_requests(struct intel_engine_cs *engine)
                         GEM_BUG_ON(RB_EMPTY_ROOT(&engine->execlists.queue.rb_root));
  
                         list_move(&rq->sched.link, pl);
+                       set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
+
                         active = rq;
                 } else {
                         struct intel_engine_cs *owner = rq->context->engine;
@@ -1535,7 +1537,8 @@ static bool can_merge_rq(const struct i915_request *prev,
                 return true;
  
         if (unlikely((prev->fence.flags ^ next->fence.flags) &
-                    (I915_FENCE_FLAG_NOPREEMPT | I915_FENCE_FLAG_SENTINEL)))
+                    (BIT(I915_FENCE_FLAG_NOPREEMPT) |
+                     BIT(I915_FENCE_FLAG_SENTINEL))))
                 return false;
  
         if (!can_merge_ctx(prev->context, next->context))
@@ -1632,8 +1635,8 @@ static void defer_request(struct i915_request *rq, struct list_head * const pl)
                                    !i915_request_completed(rq));
  
                         GEM_BUG_ON(i915_request_is_active(w));
-                       if (list_empty(&w->sched.link))
-                               continue; /* Not yet submitted; unready */
+                       if (!i915_request_is_ready(w))
+                               continue;
  
                         if (rq_prio(w) < rq_prio(rq))
                                 continue;
@@ -2351,6 +2354,310 @@ static void __execlists_submission_tasklet(struct intel_engine_cs *const engine)
         }
  }
  
+static void __execlists_hold(struct i915_request *rq)
+{
+       LIST_HEAD(list);
+
+       do {
+               struct i915_dependency *p;
+
+               if (i915_request_is_active(rq))
+                       __i915_request_unsubmit(rq);
+
+               RQ_TRACE(rq, "on hold\n");
+               clear_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
+               list_move_tail(&rq->sched.link, &rq->engine->active.hold);
+               i915_request_set_hold(rq);
+
+               list_for_each_entry(p, &rq->sched.waiters_list, wait_link) {
+                       struct i915_request *w =
+                               container_of(p->waiter, typeof(*w), sched);
+
+                       /* Leave semaphores spinning on the other engines */
+                       if (w->engine != rq->engine)
+                               continue;
+
+                       if (!i915_request_is_ready(w))
+                               continue;
+
+                       if (i915_request_completed(w))
+                               continue;
+
+                       if (i915_request_on_hold(rq))
+                               continue;
+
+                       list_move_tail(&w->sched.link, &list);
+               }
+
+               rq = list_first_entry_or_null(&list, typeof(*rq), sched.link);
+       } while (rq);
+}
+
+static bool execlists_hold(struct intel_engine_cs *engine,
+                          struct i915_request *rq)
+{
+       spin_lock_irq(&engine->active.lock);
+
+       if (i915_request_completed(rq)) { /* too late! */
+               rq = NULL;
+               goto unlock;
+       }
+
+       if (rq->engine != engine) { /* preempted virtual engine */
+               struct virtual_engine *ve = to_virtual_engine(rq->engine);
+
+               /*
+                * intel_context_inflight() is only protected by virtue
+                * of process_csb() being called only by the tasklet (or
+                * directly from inside reset while the tasklet is suspended).
+                * Assert that neither of those are allowed to run while we
+                * poke at the request queues.
+                */
+               GEM_BUG_ON(!reset_in_progress(&engine->execlists));
+
+               /*
+                * An unsubmitted request along a virtual engine will
+                * remain on the active (this) engine until we are able
+                * to process the context switch away (and so mark the
+                * context as no longer in flight). That cannot have happened
+                * yet, otherwise we would not be hanging!
+                */
+               spin_lock(&ve->base.active.lock);
+               GEM_BUG_ON(intel_context_inflight(rq->context) != engine);
+               GEM_BUG_ON(ve->request != rq);
+               ve->request = NULL;
+               spin_unlock(&ve->base.active.lock);
+               i915_request_put(rq);
+
+               rq->engine = engine;
+       }
+
+       /*
+        * Transfer this request onto the hold queue to prevent it
+        * being resumbitted to HW (and potentially completed) before we have
+        * released it. Since we may have already submitted following
+        * requests, we need to remove those as well.
+        */
+       GEM_BUG_ON(i915_request_on_hold(rq));
+       GEM_BUG_ON(rq->engine != engine);
+       __execlists_hold(rq);
+
+unlock:
+       spin_unlock_irq(&engine->active.lock);
+       return rq;
+}
+
+static bool hold_request(const struct i915_request *rq)
+{
+       struct i915_dependency *p;
+
+       /*
+        * If one of our ancestors is on hold, we must also be on hold,
+        * otherwise we will bypass it and execute before it.
+        */
+       list_for_each_entry(p, &rq->sched.signalers_list, signal_link) {
+               const struct i915_request *s =
+                       container_of(p->signaler, typeof(*s), sched);
+
+               if (s->engine != rq->engine)
+                       continue;
+
+               if (i915_request_on_hold(s))
+                       return true;
+       }
+
+       return false;
+}
+
+static void __execlists_unhold(struct i915_request *rq)
+{
+       LIST_HEAD(list);
+
+       do {
+               struct i915_dependency *p;
+
+               GEM_BUG_ON(!i915_request_on_hold(rq));
+               GEM_BUG_ON(!i915_sw_fence_signaled(&rq->submit));
+
+               i915_request_clear_hold(rq);
+               list_move_tail(&rq->sched.link,
+                              i915_sched_lookup_priolist(rq->engine,
+                                                         rq_prio(rq)));
+               set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
+               RQ_TRACE(rq, "hold release\n");
+
+               /* Also release any children on this engine that are ready */
+               list_for_each_entry(p, &rq->sched.waiters_list, wait_link) {
+                       struct i915_request *w =
+                               container_of(p->waiter, typeof(*w), sched);
+
+                       if (w->engine != rq->engine)
+                               continue;
+
+                       if (!i915_request_on_hold(rq))
+                               continue;
+
+                       /* Check that no other parents are also on hold */
+                       if (hold_request(rq))
+                               continue;
+
+                       list_move_tail(&w->sched.link, &list);
+               }
+
+               rq = list_first_entry_or_null(&list, typeof(*rq), sched.link);
+       } while (rq);
+}
+
+static void execlists_unhold(struct intel_engine_cs *engine,
+                            struct i915_request *rq)
+{
+       spin_lock_irq(&engine->active.lock);
+
+       /*
+        * Move this request back to the priority queue, and all of its
+        * children and grandchildren that were suspended along with it.
+        */
+       __execlists_unhold(rq);
+
+       if (rq_prio(rq) > engine->execlists.queue_priority_hint) {
+               engine->execlists.queue_priority_hint = rq_prio(rq);
+               tasklet_hi_schedule(&engine->execlists.tasklet);
+       }
+
+       spin_unlock_irq(&engine->active.lock);
+}
+
+struct execlists_capture {
+       struct work_struct work;
+       struct i915_request *rq;
+       struct i915_gpu_coredump *error;
+};
+
+static void execlists_capture_work(struct work_struct *work)
+{
+       struct execlists_capture *cap = container_of(work, typeof(*cap), work);
+       const gfp_t gfp = GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN;
+       struct intel_engine_cs *engine = cap->rq->engine;
+       struct intel_gt_coredump *gt = cap->error->gt;
+       struct intel_engine_capture_vma *vma;
+
+       /* Compress all the objects attached to the request, slow! */
+       vma = intel_engine_coredump_add_request(gt->engine, cap->rq, gfp);
+       if (vma) {
+               struct i915_vma_compress *compress =
+                       i915_vma_capture_prepare(gt);
+
+               intel_engine_coredump_add_vma(gt->engine, vma, compress);
+               i915_vma_capture_finish(gt, compress);
+       }
+
+       gt->simulated = gt->engine->simulated;
+       cap->error->simulated = gt->simulated;
+
+       /* Publish the error state, and announce it to the world */
+       i915_error_state_store(cap->error);
+       i915_gpu_coredump_put(cap->error);
+
+       /* Return this request and all that depend upon it for signaling */
+       execlists_unhold(engine, cap->rq);
+       i915_request_put(cap->rq);
+
+       kfree(cap);
+}
+
+static struct execlists_capture *capture_regs(struct intel_engine_cs *engine)
+{
+       const gfp_t gfp = GFP_ATOMIC | __GFP_NOWARN;
+       struct execlists_capture *cap;
+
+       cap = kmalloc(sizeof(*cap), gfp);
+       if (!cap)
+               return NULL;
+
+       cap->error = i915_gpu_coredump_alloc(engine->i915, gfp);
+       if (!cap->error)
+               goto err_cap;
+
+       cap->error->gt = intel_gt_coredump_alloc(engine->gt, gfp);
+       if (!cap->error->gt)
+               goto err_gpu;
+
+       cap->error->gt->engine = intel_engine_coredump_alloc(engine, gfp);
+       if (!cap->error->gt->engine)
+               goto err_gt;
+
+       return cap;
+
+err_gt:
+       kfree(cap->error->gt);
+err_gpu:
+       kfree(cap->error);
+err_cap:
+       kfree(cap);
+       return NULL;
+}
+
+static bool execlists_capture(struct intel_engine_cs *engine)
+{
+       struct execlists_capture *cap;
+
+       if (!IS_ENABLED(CONFIG_DRM_I915_CAPTURE_ERROR))
+               return true;
+
+       /*
+        * We need to _quickly_ capture the engine state before we reset.
+        * We are inside an atomic section (softirq) here and we are delaying
+        * the forced preemption event.
+        */
+       cap = capture_regs(engine);
+       if (!cap)
+               return true;
+
+       cap->rq = execlists_active(&engine->execlists);
+       GEM_BUG_ON(!cap->rq);
+
+       rcu_read_lock();
+       cap->rq = active_request(cap->rq->context->timeline, cap->rq);
+       cap->rq = i915_request_get_rcu(cap->rq);
+       rcu_read_unlock();
+       if (!cap->rq)
+               goto err_free;
+
+       /*
+        * Remove the request from the execlists queue, and take ownership
+        * of the request. We pass it to our worker who will _slowly_ compress
+        * all the pages the _user_ requested for debugging their batch, after
+        * which we return it to the queue for signaling.
+        *
+        * By removing them from the execlists queue, we also remove the
+        * requests from being processed by __unwind_incomplete_requests()
+        * during the intel_engine_reset(), and so they will *not* be replayed
+        * afterwards.
+        *
+        * Note that because we have not yet reset the engine at this point,
+        * it is possible for the request that we have identified as being
+        * guilty, did in fact complete and we will then hit an arbitration
+        * point allowing the outstanding preemption to succeed. The likelihood
+        * of that is very low (as capturing of the engine registers should be
+        * fast enough to run inside an irq-off atomic section!), so we will
+        * simply hold that request accountable for being non-preemptible
+        * long enough to force the reset.
+        */
+       if (!execlists_hold(engine, cap->rq))
+               goto err_rq;
+
+       INIT_WORK(&cap->work, execlists_capture_work);
+       schedule_work(&cap->work);
+       return true;
+
+err_rq:
+       i915_request_put(cap->rq);
+err_free:
+       i915_gpu_coredump_put(cap->error);
+       kfree(cap);
+       return false;
+}
+
  static noinline void preempt_reset(struct intel_engine_cs *engine)
  {
         const unsigned int bit = I915_RESET_ENGINE + engine->id;
@@ -2368,7 +2675,12 @@ static noinline void preempt_reset(struct intel_engine_cs *engine)
         ENGINE_TRACE(engine, "preempt timeout %lu+%ums\n",
                      READ_ONCE(engine->props.preempt_timeout_ms),
                      jiffies_to_msecs(jiffies - engine->execlists.preempt.expires));
-       intel_engine_reset(engine, "preemption time out");
+
+       ring_set_paused(engine, 1); /* Freeze the current request in place */
+       if (execlists_capture(engine))
+               intel_engine_reset(engine, "preemption time out");
+       else
+               ring_set_paused(engine, 0);
  
         tasklet_enable(&engine->execlists.tasklet);
         clear_and_wake_up_bit(bit, lock);
@@ -2430,11 +2742,12 @@ static void execlists_preempt(struct timer_list *timer)
  }
  
  static void queue_request(struct intel_engine_cs *engine,
-                         struct i915_sched_node *node,
-                         int prio)
+                         struct i915_request *rq)
  {
-       GEM_BUG_ON(!list_empty(&node->link));
-       list_add_tail(&node->link, i915_sched_lookup_priolist(engine, prio));
+       GEM_BUG_ON(!list_empty(&rq->sched.link));
+       list_add_tail(&rq->sched.link,
+                     i915_sched_lookup_priolist(engine, rq_prio(rq)));
+       set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
  }
  
  static void __submit_queue_imm(struct intel_engine_cs *engine)
@@ -2462,6 +2775,13 @@ static void submit_queue(struct intel_engine_cs *engine,
         __submit_queue_imm(engine);
  }
  
+static bool ancestor_on_hold(const struct intel_engine_cs *engine,
+                            const struct i915_request *rq)
+{
+       GEM_BUG_ON(i915_request_on_hold(rq));
+       return !list_empty(&engine->active.hold) && hold_request(rq);
+}
+
  static void execlists_submit_request(struct i915_request *request)
  {
         struct intel_engine_cs *engine = request->engine;
@@ -2470,12 +2790,17 @@ static void execlists_submit_request(struct i915_request *request)
         /* Will be called from irq-context when using foreign fences. */
         spin_lock_irqsave(&engine->active.lock, flags);
  
-       queue_request(engine, &request->sched, rq_prio(request));
+       if (unlikely(ancestor_on_hold(engine, request))) {
+               list_add_tail(&request->sched.link, &engine->active.hold);
+               i915_request_set_hold(request);
+       } else {
+               queue_request(engine, request);
  
-       GEM_BUG_ON(RB_EMPTY_ROOT(&engine->execlists.queue.rb_root));
-       GEM_BUG_ON(list_empty(&request->sched.link));
+               GEM_BUG_ON(RB_EMPTY_ROOT(&engine->execlists.queue.rb_root));
+               GEM_BUG_ON(list_empty(&request->sched.link));
  
-       submit_queue(engine, request);
+               submit_queue(engine, request);
+       }
  
         spin_unlock_irqrestore(&engine->active.lock, flags);
  }
@@ -2531,7 +2856,6 @@ static void execlists_context_unpin(struct intel_context *ce)
                       ce->engine);
  
         i915_gem_object_unpin_map(ce->state->obj);
-       intel_ring_reset(ce->ring, ce->ring->tail);
  }
  
  static void
@@ -3325,6 +3649,10 @@ static void execlists_reset_cancel(struct intel_engine_cs *engine)
                 i915_priolist_free(p);
         }
  
+       /* On-hold requests will be flushed to timeline upon their release */
+       list_for_each_entry(rq, &engine->active.hold, sched.link)
+               mark_eio(rq);
+
         /* Cancel all attached virtual engines */
         while ((rb = rb_first_cached(&execlists->virtual))) {
                 struct virtual_engine *ve =
diff --git a/drivers/gpu/drm/i915/gt/mock_engine.c b/drivers/gpu/drm/i915/gt/mock_engine.c

index a560b7e..f280638 100644 (file)
--- a/drivers/gpu/drm/i915/gt/mock_engine.c
+++ b/drivers/gpu/drm/i915/gt/mock_engine.c
@@ -59,11 +59,26 @@ static struct intel_ring *mock_ring(struct intel_engine_cs *engine)
         ring->vaddr = (void *)(ring + 1);
         atomic_set(&ring->pin_count, 1);
  
+       ring->vma = i915_vma_alloc();
+       if (!ring->vma) {
+               kfree(ring);
+               return NULL;
+       }
+       i915_active_init(&ring->vma->active, NULL, NULL);
+
         intel_ring_update_space(ring);
  
         return ring;
  }
  
+static void mock_ring_free(struct intel_ring *ring)
+{
+       i915_active_fini(&ring->vma->active);
+       i915_vma_free(ring->vma);
+
+       kfree(ring);
+}
+
  static struct i915_request *first_request(struct mock_engine *engine)
  {
         return list_first_entry_or_null(&engine->hw_queue,
@@ -121,7 +136,7 @@ static void mock_context_destroy(struct kref *ref)
         GEM_BUG_ON(intel_context_is_pinned(ce));
  
         if (test_bit(CONTEXT_ALLOC_BIT, &ce->flags)) {
-               kfree(ce->ring);
+               mock_ring_free(ce->ring);
                 mock_timeline_unpin(ce->timeline);
         }
  
diff --git a/drivers/gpu/drm/i915/gt/selftest_lrc.c b/drivers/gpu/drm/i915/gt/selftest_lrc.c

index 15cda02..65718ca 100644 (file)
--- a/drivers/gpu/drm/i915/gt/selftest_lrc.c
+++ b/drivers/gpu/drm/i915/gt/selftest_lrc.c
@@ -285,6 +285,107 @@ static int live_unlite_preempt(void *arg)
         return live_unlite_restore(arg, I915_USER_PRIORITY(I915_PRIORITY_MAX));
  }
  
+static int live_hold_reset(void *arg)
+{
+       struct intel_gt *gt = arg;
+       struct intel_engine_cs *engine;
+       enum intel_engine_id id;
+       struct igt_spinner spin;
+       int err = 0;
+
+       /*
+        * In order to support offline error capture for fast preempt reset,
+        * we need to decouple the guilty request and ensure that it and its
+        * descendents are not executed while the capture is in progress.
+        */
+
+       if (!intel_has_reset_engine(gt))
+               return 0;
+
+       if (igt_spinner_init(&spin, gt))
+               return -ENOMEM;
+
+       for_each_engine(engine, gt, id) {
+               struct intel_context *ce;
+               unsigned long heartbeat;
+               struct i915_request *rq;
+
+               ce = intel_context_create(engine);
+               if (IS_ERR(ce)) {
+                       err = PTR_ERR(ce);
+                       break;
+               }
+
+               engine_heartbeat_disable(engine, &heartbeat);
+
+               rq = igt_spinner_create_request(&spin, ce, MI_ARB_CHECK);
+               if (IS_ERR(rq)) {
+                       err = PTR_ERR(rq);
+                       goto out;
+               }
+               i915_request_add(rq);
+
+               if (!igt_wait_for_spinner(&spin, rq)) {
+                       intel_gt_set_wedged(gt);
+                       err = -ETIME;
+                       goto out;
+               }
+
+               /* We have our request executing, now remove it and reset */
+
+               if (test_and_set_bit(I915_RESET_ENGINE + id,
+                                    &gt->reset.flags)) {
+                       intel_gt_set_wedged(gt);
+                       err = -EBUSY;
+                       goto out;
+               }
+               tasklet_disable(&engine->execlists.tasklet);
+
+               engine->execlists.tasklet.func(engine->execlists.tasklet.data);
+               GEM_BUG_ON(execlists_active(&engine->execlists) != rq);
+
+               i915_request_get(rq);
+               execlists_hold(engine, rq);
+               GEM_BUG_ON(!i915_request_on_hold(rq));
+
+               intel_engine_reset(engine, NULL);
+               GEM_BUG_ON(rq->fence.error != -EIO);
+
+               tasklet_enable(&engine->execlists.tasklet);
+               clear_and_wake_up_bit(I915_RESET_ENGINE + id,
+                                     &gt->reset.flags);
+
+               /* Check that we do not resubmit the held request */
+               if (!i915_request_wait(rq, 0, HZ / 5)) {
+                       pr_err("%s: on hold request completed!\n",
+                              engine->name);
+                       i915_request_put(rq);
+                       err = -EIO;
+                       goto out;
+               }
+               GEM_BUG_ON(!i915_request_on_hold(rq));
+
+               /* But is resubmitted on release */
+               execlists_unhold(engine, rq);
+               if (i915_request_wait(rq, 0, HZ / 5) < 0) {
+                       pr_err("%s: held request did not complete!\n",
+                              engine->name);
+                       intel_gt_set_wedged(gt);
+                       err = -ETIME;
+               }
+               i915_request_put(rq);
+
+out:
+               engine_heartbeat_enable(engine, heartbeat);
+               intel_context_put(ce);
+               if (err)
+                       break;
+       }
+
+       igt_spinner_fini(&spin);
+       return err;
+}
+
  static int
  emit_semaphore_chain(struct i915_request *rq, struct i915_vma *vma, int idx)
  {
@@ -3309,12 +3410,168 @@ static int live_virtual_bond(void *arg)
         return 0;
  }
  
+static int reset_virtual_engine(struct intel_gt *gt,
+                               struct intel_engine_cs **siblings,
+                               unsigned int nsibling)
+{
+       struct intel_engine_cs *engine;
+       struct intel_context *ve;
+       unsigned long *heartbeat;
+       struct igt_spinner spin;
+       struct i915_request *rq;
+       unsigned int n;
+       int err = 0;
+
+       /*
+        * In order to support offline error capture for fast preempt reset,
+        * we need to decouple the guilty request and ensure that it and its
+        * descendents are not executed while the capture is in progress.
+        */
+
+       heartbeat = kmalloc_array(nsibling, sizeof(*heartbeat), GFP_KERNEL);
+       if (!heartbeat)
+               return -ENOMEM;
+
+       if (igt_spinner_init(&spin, gt)) {
+               err = -ENOMEM;
+               goto out_free;
+       }
+
+       ve = intel_execlists_create_virtual(siblings, nsibling);
+       if (IS_ERR(ve)) {
+               err = PTR_ERR(ve);
+               goto out_spin;
+       }
+
+       for (n = 0; n < nsibling; n++)
+               engine_heartbeat_disable(siblings[n], &heartbeat[n]);
+
+       rq = igt_spinner_create_request(&spin, ve, MI_ARB_CHECK);
+       if (IS_ERR(rq)) {
+               err = PTR_ERR(rq);
+               goto out_heartbeat;
+       }
+       i915_request_add(rq);
+
+       if (!igt_wait_for_spinner(&spin, rq)) {
+               intel_gt_set_wedged(gt);
+               err = -ETIME;
+               goto out_heartbeat;
+       }
+
+       engine = rq->engine;
+       GEM_BUG_ON(engine == ve->engine);
+
+       /* Take ownership of the reset and tasklet */
+       if (test_and_set_bit(I915_RESET_ENGINE + engine->id,
+                            &gt->reset.flags)) {
+               intel_gt_set_wedged(gt);
+               err = -EBUSY;
+               goto out_heartbeat;
+       }
+       tasklet_disable(&engine->execlists.tasklet);
+
+       engine->execlists.tasklet.func(engine->execlists.tasklet.data);
+       GEM_BUG_ON(execlists_active(&engine->execlists) != rq);
+
+       /* Fake a preemption event; failed of course */
+       spin_lock_irq(&engine->active.lock);
+       __unwind_incomplete_requests(engine);
+       spin_unlock_irq(&engine->active.lock);
+       GEM_BUG_ON(rq->engine != ve->engine);
+
+       /* Reset the engine while keeping our active request on hold */
+       execlists_hold(engine, rq);
+       GEM_BUG_ON(!i915_request_on_hold(rq));
+
+       intel_engine_reset(engine, NULL);
+       GEM_BUG_ON(rq->fence.error != -EIO);
+
+       /* Release our grasp on the engine, letting CS flow again */
+       tasklet_enable(&engine->execlists.tasklet);
+       clear_and_wake_up_bit(I915_RESET_ENGINE + engine->id, &gt->reset.flags);
+
+       /* Check that we do not resubmit the held request */
+       i915_request_get(rq);
+       if (!i915_request_wait(rq, 0, HZ / 5)) {
+               pr_err("%s: on hold request completed!\n",
+                      engine->name);
+               intel_gt_set_wedged(gt);
+               err = -EIO;
+               goto out_rq;
+       }
+       GEM_BUG_ON(!i915_request_on_hold(rq));
+
+       /* But is resubmitted on release */
+       execlists_unhold(engine, rq);
+       if (i915_request_wait(rq, 0, HZ / 5) < 0) {
+               pr_err("%s: held request did not complete!\n",
+                      engine->name);
+               intel_gt_set_wedged(gt);
+               err = -ETIME;
+       }
+
+out_rq:
+       i915_request_put(rq);
+out_heartbeat:
+       for (n = 0; n < nsibling; n++)
+               engine_heartbeat_enable(siblings[n], heartbeat[n]);
+
+       intel_context_put(ve);
+out_spin:
+       igt_spinner_fini(&spin);
+out_free:
+       kfree(heartbeat);
+       return err;
+}
+
+static int live_virtual_reset(void *arg)
+{
+       struct intel_gt *gt = arg;
+       struct intel_engine_cs *siblings[MAX_ENGINE_INSTANCE + 1];
+       unsigned int class, inst;
+
+       /*
+        * Check that we handle a reset event within a virtual engine.
+        * Only the physical engine is reset, but we have to check the flow
+        * of the virtual requests around the reset, and make sure it is not
+        * forgotten.
+        */
+
+       if (USES_GUC_SUBMISSION(gt->i915))
+               return 0;
+
+       if (!intel_has_reset_engine(gt))
+               return 0;
+
+       for (class = 0; class <= MAX_ENGINE_CLASS; class++) {
+               int nsibling, err;
+
+               nsibling = 0;
+               for (inst = 0; inst <= MAX_ENGINE_INSTANCE; inst++) {
+                       if (!gt->engine_class[class][inst])
+                               continue;
+
+                       siblings[nsibling++] = gt->engine_class[class][inst];
+               }
+               if (nsibling < 2)
+                       continue;
+
+               err = reset_virtual_engine(gt, siblings, nsibling);
+               if (err)
+                       return err;
+       }
+
+       return 0;
+}
+
  int intel_execlists_live_selftests(struct drm_i915_private *i915)
  {
         static const struct i915_subtest tests[] = {
                 SUBTEST(live_sanitycheck),
                 SUBTEST(live_unlite_switch),
                 SUBTEST(live_unlite_preempt),
+               SUBTEST(live_hold_reset),
                 SUBTEST(live_timeslice_preempt),
                 SUBTEST(live_timeslice_queue),
                 SUBTEST(live_busywait_preempt),
@@ -3333,6 +3590,7 @@ int intel_execlists_live_selftests(struct drm_i915_private *i915)
                 SUBTEST(live_virtual_mask),
                 SUBTEST(live_virtual_preserved),
                 SUBTEST(live_virtual_bond),
+               SUBTEST(live_virtual_reset),
         };
  
         if (!HAS_EXECLISTS(i915))
diff --git a/drivers/gpu/drm/i915/gvt/firmware.c b/drivers/gpu/drm/i915/gvt/firmware.c

index 049775e..b0c1fda 100644 (file)
--- a/drivers/gpu/drm/i915/gvt/firmware.c
+++ b/drivers/gpu/drm/i915/gvt/firmware.c
@@ -146,7 +146,7 @@ void intel_gvt_free_firmware(struct intel_gvt *gvt)
                 clean_firmware_sysfs(gvt);
  
         kfree(gvt->firmware.cfg_space);
-       kfree(gvt->firmware.mmio);
+       vfree(gvt->firmware.mmio);
  }
  
  static int verify_firmware(struct intel_gvt *gvt,
@@ -229,7 +229,7 @@ int intel_gvt_load_firmware(struct intel_gvt *gvt)
  
         firmware->cfg_space = mem;
  
-       mem = kmalloc(info->mmio_size, GFP_KERNEL);
+       mem = vmalloc(info->mmio_size);
         if (!mem) {
                 kfree(path);
                 kfree(firmware->cfg_space);
diff --git a/drivers/gpu/drm/i915/gvt/gtt.c b/drivers/gpu/drm/i915/gvt/gtt.c

index 34cb404..4a48280 100644 (file)
--- a/drivers/gpu/drm/i915/gvt/gtt.c
+++ b/drivers/gpu/drm/i915/gvt/gtt.c
@@ -1956,7 +1956,11 @@ void _intel_vgpu_mm_release(struct kref *mm_ref)
  
         if (mm->type == INTEL_GVT_MM_PPGTT) {
                 list_del(&mm->ppgtt_mm.list);
+
+               mutex_lock(&mm->vgpu->gvt->gtt.ppgtt_mm_lock);
                 list_del(&mm->ppgtt_mm.lru_list);
+               mutex_unlock(&mm->vgpu->gvt->gtt.ppgtt_mm_lock);
+
                 invalidate_ppgtt_mm(mm);
         } else {
                 vfree(mm->ggtt_mm.virtual_ggtt);
diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c

index f3da5c0..b0a4997 100644 (file)
--- a/drivers/gpu/drm/i915/i915_active.c
+++ b/drivers/gpu/drm/i915/i915_active.c
@@ -416,13 +416,15 @@ int i915_active_acquire(struct i915_active *ref)
         if (err)
                 return err;
  
-       if (!atomic_read(&ref->count) && ref->active)
-               err = ref->active(ref);
-       if (!err) {
-               spin_lock_irq(&ref->tree_lock); /* vs __active_retire() */
-               debug_active_activate(ref);
-               atomic_inc(&ref->count);
-               spin_unlock_irq(&ref->tree_lock);
+       if (likely(!i915_active_acquire_if_busy(ref))) {
+               if (ref->active)
+                       err = ref->active(ref);
+               if (!err) {
+                       spin_lock_irq(&ref->tree_lock); /* __active_retire() */
+                       debug_active_activate(ref);
+                       atomic_inc(&ref->count);
+                       spin_unlock_irq(&ref->tree_lock);
+               }
         }
  
         mutex_unlock(&ref->mutex);
@@ -605,7 +607,7 @@ int i915_active_acquire_preallocate_barrier(struct i915_active *ref,
                                             struct intel_engine_cs *engine)
  {
         intel_engine_mask_t tmp, mask = engine->mask;
-       struct llist_node *pos = NULL, *next;
+       struct llist_node *first = NULL, *last = NULL;
         struct intel_gt *gt = engine->gt;
         int err;
  
@@ -623,6 +625,7 @@ int i915_active_acquire_preallocate_barrier(struct i915_active *ref,
          */
         for_each_engine_masked(engine, gt, mask, tmp) {
                 u64 idx = engine->kernel_context->timeline->fence_context;
+               struct llist_node *prev = first;
                 struct active_node *node;
  
                 node = reuse_idle_barrier(ref, idx);
@@ -656,23 +659,23 @@ int i915_active_acquire_preallocate_barrier(struct i915_active *ref,
                 GEM_BUG_ON(rcu_access_pointer(node->base.fence) != ERR_PTR(-EAGAIN));
  
                 GEM_BUG_ON(barrier_to_engine(node) != engine);
-               next = barrier_to_ll(node);
-               next->next = pos;
-               if (!pos)
-                       pos = next;
+               first = barrier_to_ll(node);
+               first->next = prev;
+               if (!last)
+                       last = first;
                 intel_engine_pm_get(engine);
         }
  
         GEM_BUG_ON(!llist_empty(&ref->preallocated_barriers));
-       llist_add_batch(next, pos, &ref->preallocated_barriers);
+       llist_add_batch(first, last, &ref->preallocated_barriers);
  
         return 0;
  
  unwind:
-       while (pos) {
-               struct active_node *node = barrier_from_ll(pos);
+       while (first) {
+               struct active_node *node = barrier_from_ll(first);
  
-               pos = pos->next;
+               first = first->next;
  
                 atomic_dec(&ref->count);
                 intel_engine_pm_put(barrier_to_engine(node));
diff --git a/drivers/gpu/drm/i915/i915_active.h b/drivers/gpu/drm/i915/i915_active.h

index b571f67..51e1e85 100644 (file)
--- a/drivers/gpu/drm/i915/i915_active.h
+++ b/drivers/gpu/drm/i915/i915_active.h
@@ -188,6 +188,12 @@ int i915_active_acquire(struct i915_active *ref);
  bool i915_active_acquire_if_busy(struct i915_active *ref);
  void i915_active_release(struct i915_active *ref);
  
+static inline void __i915_active_acquire(struct i915_active *ref)
+{
+       GEM_BUG_ON(!atomic_read(&ref->count));
+       atomic_inc(&ref->count);
+}
+
  static inline bool
  i915_active_is_idle(const struct i915_active *ref)
  {
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c

index 94f993e..c2de2f4 100644 (file)
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -265,7 +265,10 @@ i915_gem_dumb_create(struct drm_file *file,
                                                     DRM_FORMAT_MOD_LINEAR))
                 args->pitch = ALIGN(args->pitch, 4096);
  
-       args->size = args->pitch * args->height;
+       if (args->pitch < args->width)
+               return -EINVAL;
+
+       args->size = mul_u32_u32(args->pitch, args->height);
  
         mem_type = INTEL_MEMORY_SYSTEM;
         if (HAS_LMEM(to_i915(dev)))
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c

index 4c1836f..594341e 100644 (file)
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1681,7 +1681,7 @@ static const char *error_msg(struct i915_gpu_coredump *error)
                         "GPU HANG: ecode %d:%x:%08x",
                         INTEL_GEN(error->i915), engines,
                         generate_ecode(first));
-       if (first) {
+       if (first && first->context.pid) {
                 /* Just show the first executing process, more is confusing */
                 len += scnprintf(error->error_msg + len,
                                  sizeof(error->error_msg) - len,
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h

index 9109004..e4a6afe 100644 (file)
--- a/drivers/gpu/drm/i915/i915_gpu_error.h
+++ b/drivers/gpu/drm/i915/i915_gpu_error.h
@@ -314,8 +314,11 @@ i915_vma_capture_finish(struct intel_gt_coredump *gt,
  }
  
  static inline void
-i915_error_state_store(struct drm_i915_private *i915,
-                      struct i915_gpu_coredump *error)
+i915_error_state_store(struct i915_gpu_coredump *error)
+{
+}
+
+static inline void i915_gpu_coredump_put(struct i915_gpu_coredump *gpu)
  {
  }
  
diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c

index 28a82c8..ec02994 100644 (file)
--- a/drivers/gpu/drm/i915/i915_pmu.c
+++ b/drivers/gpu/drm/i915/i915_pmu.c
@@ -637,8 +637,10 @@ static void i915_pmu_enable(struct perf_event *event)
                 container_of(event->pmu, typeof(*i915), pmu.base);
         unsigned int bit = event_enabled_bit(event);
         struct i915_pmu *pmu = &i915->pmu;
+       intel_wakeref_t wakeref;
         unsigned long flags;
  
+       wakeref = intel_runtime_pm_get(&i915->runtime_pm);
         spin_lock_irqsave(&pmu->lock, flags);
  
         /*
@@ -648,6 +650,14 @@ static void i915_pmu_enable(struct perf_event *event)
         BUILD_BUG_ON(ARRAY_SIZE(pmu->enable_count) != I915_PMU_MASK_BITS);
         GEM_BUG_ON(bit >= ARRAY_SIZE(pmu->enable_count));
         GEM_BUG_ON(pmu->enable_count[bit] == ~0);
+
+       if (pmu->enable_count[bit] == 0 &&
+           config_enabled_mask(I915_PMU_RC6_RESIDENCY) & BIT_ULL(bit)) {
+               pmu->sample[__I915_SAMPLE_RC6_LAST_REPORTED].cur = 0;
+               pmu->sample[__I915_SAMPLE_RC6].cur = __get_rc6(&i915->gt);
+               pmu->sleep_last = ktime_get();
+       }
+
         pmu->enable |= BIT_ULL(bit);
         pmu->enable_count[bit]++;
  
@@ -688,6 +698,8 @@ static void i915_pmu_enable(struct perf_event *event)
          * an existing non-zero value.
          */
         local64_set(&event->hw.prev_count, __i915_pmu_event_read(event));
+
+       intel_runtime_pm_put(&i915->runtime_pm, wakeref);
  }
  
  static void i915_pmu_disable(struct perf_event *event)
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c

index be18588..78a5f5d 100644 (file)
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -221,6 +221,8 @@ static void remove_from_engine(struct i915_request *rq)
                 locked = engine;
         }
         list_del_init(&rq->sched.link);
+       clear_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
+       clear_bit(I915_FENCE_FLAG_HOLD, &rq->fence.flags);
         spin_unlock_irq(&locked->active.lock);
  }
  
@@ -408,8 +410,10 @@ bool __i915_request_submit(struct i915_request *request)
  xfer:  /* We may be recursing from the signal callback of another i915 fence */
         spin_lock_nested(&request->lock, SINGLE_DEPTH_NESTING);
  
-       if (!test_and_set_bit(I915_FENCE_FLAG_ACTIVE, &request->fence.flags))
+       if (!test_and_set_bit(I915_FENCE_FLAG_ACTIVE, &request->fence.flags)) {
                 list_move_tail(&request->sched.link, &engine->active.requests);
+               clear_bit(I915_FENCE_FLAG_PQUEUE, &request->fence.flags);
+       }
  
         if (test_bit(DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT, &request->fence.flags) &&
             !test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &request->fence.flags) &&
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h

index 0314336..f57eadc 100644 (file)
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -70,6 +70,18 @@ enum {
          */
         I915_FENCE_FLAG_ACTIVE = DMA_FENCE_FLAG_USER_BITS,
  
+       /*
+        * I915_FENCE_FLAG_PQUEUE - this request is ready for execution
+        *
+        * Using the scheduler, when a request is ready for execution it is put
+        * into the priority queue, and removed from that queue when transferred
+        * to the HW runlists. We want to track its membership within the
+        * priority queue so that we can easily check before rescheduling.
+        *
+        * See i915_request_in_priority_queue()
+        */
+       I915_FENCE_FLAG_PQUEUE,
+
         /*
          * I915_FENCE_FLAG_SIGNAL - this request is currently on signal_list
          *
@@ -78,6 +90,13 @@ enum {
          */
         I915_FENCE_FLAG_SIGNAL,
  
+       /*
+        * I915_FENCE_FLAG_HOLD - this request is currently on hold
+        *
+        * This request has been suspended, pending an ongoing investigation.
+        */
+       I915_FENCE_FLAG_HOLD,
+
         /*
          * I915_FENCE_FLAG_NOPREEMPT - this request should not be preempted
          *
@@ -361,6 +380,11 @@ static inline bool i915_request_is_active(const struct i915_request *rq)
         return test_bit(I915_FENCE_FLAG_ACTIVE, &rq->fence.flags);
  }
  
+static inline bool i915_request_in_priority_queue(const struct i915_request *rq)
+{
+       return test_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
+}
+
  /**
   * Returns true if seq1 is later than seq2.
   */
@@ -454,6 +478,27 @@ static inline bool i915_request_is_running(const struct i915_request *rq)
         return __i915_request_has_started(rq);
  }
  
+/**
+ * i915_request_is_running - check if the request is ready for execution
+ * @rq: the request
+ *
+ * Upon construction, the request is instructed to wait upon various
+ * signals before it is ready to be executed by the HW. That is, we do
+ * not want to start execution and read data before it is written. In practice,
+ * this is controlled with a mixture of interrupts and semaphores. Once
+ * the submit fence is completed, the backend scheduler will place the
+ * request into its queue and from there submit it for execution. So we
+ * can detect when a request is eligible for execution (and is under control
+ * of the scheduler) by querying where it is in any of the scheduler's lists.
+ *
+ * Returns true if the request is ready for execution (it may be inflight),
+ * false otherwise.
+ */
+static inline bool i915_request_is_ready(const struct i915_request *rq)
+{
+       return !list_empty(&rq->sched.link);
+}
+
  static inline bool i915_request_completed(const struct i915_request *rq)
  {
         if (i915_request_signaled(rq))
@@ -483,6 +528,21 @@ static inline bool i915_request_has_sentinel(const struct i915_request *rq)
         return unlikely(test_bit(I915_FENCE_FLAG_SENTINEL, &rq->fence.flags));
  }
  
+static inline bool i915_request_on_hold(const struct i915_request *rq)
+{
+       return unlikely(test_bit(I915_FENCE_FLAG_HOLD, &rq->fence.flags));
+}
+
+static inline void i915_request_set_hold(struct i915_request *rq)
+{
+       set_bit(I915_FENCE_FLAG_HOLD, &rq->fence.flags);
+}
+
+static inline void i915_request_clear_hold(struct i915_request *rq)
+{
+       clear_bit(I915_FENCE_FLAG_HOLD, &rq->fence.flags);
+}
+
  static inline struct intel_timeline *
  i915_request_timeline(struct i915_request *rq)
  {
diff --git a/drivers/gpu/drm/i915/i915_scheduler.c b/drivers/gpu/drm/i915/i915_scheduler.c

index bf87c70..5d96cfb 100644 (file)
--- a/drivers/gpu/drm/i915/i915_scheduler.c
+++ b/drivers/gpu/drm/i915/i915_scheduler.c
@@ -326,20 +326,18 @@ static void __i915_schedule(struct i915_sched_node *node,
  
                 node->attr.priority = prio;
  
-               if (list_empty(&node->link)) {
-                       /*
-                        * If the request is not in the priolist queue because
-                        * it is not yet runnable, then it doesn't contribute
-                        * to our preemption decisions. On the other hand,
-                        * if the request is on the HW, it too is not in the
-                        * queue; but in that case we may still need to reorder
-                        * the inflight requests.
-                        */
+               /*
+                * Once the request is ready, it will be placed into the
+                * priority lists and then onto the HW runlist. Before the
+                * request is ready, it does not contribute to our preemption
+                * decisions and we can safely ignore it, as it will, and
+                * any preemption required, be dealt with upon submission.
+                * See engine->submit_request()
+                */
+               if (list_empty(&node->link))
                         continue;
-               }
  
-               if (!intel_engine_is_virtual(engine) &&
-                   !i915_request_is_active(node_to_request(node))) {
+               if (i915_request_in_priority_queue(node_to_request(node))) {
                         if (!cache.priolist)
                                 cache.priolist =
                                         i915_sched_lookup_priolist(engine,
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c

index 17d7c52..4ff3807 100644 (file)
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/i915/i915_vma.c
@@ -1202,16 +1202,26 @@ int __i915_vma_unbind(struct i915_vma *vma)
         if (ret)
                 return ret;
  
-       GEM_BUG_ON(i915_vma_is_active(vma));
         if (i915_vma_is_pinned(vma)) {
                 vma_print_allocator(vma, "is pinned");
                 return -EAGAIN;
         }
  
-       GEM_BUG_ON(i915_vma_is_active(vma));
+       /*
+        * After confirming that no one else is pinning this vma, wait for
+        * any laggards who may have crept in during the wait (through
+        * a residual pin skipping the vm->mutex) to complete.
+        */
+       ret = i915_vma_sync(vma);
+       if (ret)
+               return ret;
+
         if (!drm_mm_node_allocated(&vma->node))
                 return 0;
  
+       GEM_BUG_ON(i915_vma_is_pinned(vma));
+       GEM_BUG_ON(i915_vma_is_active(vma));
+
         if (i915_vma_is_map_and_fenceable(vma)) {
                 /*
                  * Check that we have flushed all writes through the GGTT
diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c

index c26219c..e4b750b 100644 (file)
--- a/drivers/gpu/drm/msm/msm_drv.c
+++ b/drivers/gpu/drm/msm/msm_drv.c
@@ -441,6 +441,14 @@ static int msm_drm_init(struct device *dev, struct drm_driver *drv)
         if (ret)
                 goto err_msm_uninit;
  
+       if (!dev->dma_parms) {
+               dev->dma_parms = devm_kzalloc(dev, sizeof(*dev->dma_parms),
+                                             GFP_KERNEL);
+               if (!dev->dma_parms)
+                       return -ENOMEM;
+       }
+       dma_set_max_seg_size(dev, DMA_BIT_MASK(32));
+
         msm_gem_shrinker_init(ddev);
  
         switch (get_mdp_ver(pdev)) {
diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c

index 6da59f4..b7a618d 100644 (file)
--- a/drivers/gpu/drm/panfrost/panfrost_drv.c
+++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
@@ -166,6 +166,7 @@ panfrost_lookup_bos(struct drm_device *dev,
                         break;
                 }
  
+               atomic_inc(&bo->gpu_usecount);
                 job->mappings[i] = mapping;
         }
  
diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.h b/drivers/gpu/drm/panfrost/panfrost_gem.h

index ca1bc90..b3517ff 100644 (file)
--- a/drivers/gpu/drm/panfrost/panfrost_gem.h
+++ b/drivers/gpu/drm/panfrost/panfrost_gem.h
@@ -30,6 +30,12 @@ struct panfrost_gem_object {
                 struct mutex lock;
         } mappings;
  
+       /*
+        * Count the number of jobs referencing this BO so we don't let the
+        * shrinker reclaim this object prematurely.
+        */
+       atomic_t gpu_usecount;
+
         bool noexec             :1;
         bool is_heap            :1;
  };
diff --git a/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c b/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c

index f5dd7b2..288e46c 100644 (file)
--- a/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c
+++ b/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c
@@ -41,6 +41,9 @@ static bool panfrost_gem_purge(struct drm_gem_object *obj)
         struct drm_gem_shmem_object *shmem = to_drm_gem_shmem_obj(obj);
         struct panfrost_gem_object *bo = to_panfrost_bo(obj);
  
+       if (atomic_read(&bo->gpu_usecount))
+               return false;
+
         if (!mutex_trylock(&shmem->pages_lock))
                 return false;
  
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c

index 7c36ec6..7157dfd 100644 (file)
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -269,8 +269,13 @@ static void panfrost_job_cleanup(struct kref *ref)
         dma_fence_put(job->render_done_fence);
  
         if (job->mappings) {
-               for (i = 0; i < job->bo_count; i++)
+               for (i = 0; i < job->bo_count; i++) {
+                       if (!job->mappings[i])
+                               break;
+
+                       atomic_dec(&job->mappings[i]->obj->gpu_usecount);
                         panfrost_gem_mapping_put(job->mappings[i]);
+               }
                 kvfree(job->mappings);
         }
  
diff --git a/drivers/gpu/drm/sun4i/sun4i_drv.c b/drivers/gpu/drm/sun4i/sun4i_drv.c

index 5ae67d5..328272f 100644 (file)
--- a/drivers/gpu/drm/sun4i/sun4i_drv.c
+++ b/drivers/gpu/drm/sun4i/sun4i_drv.c
@@ -85,7 +85,6 @@ static int sun4i_drv_bind(struct device *dev)
         }
  
         drm_mode_config_init(drm);
-       drm->mode_config.allow_fb_modifiers = true;
  
         ret = component_bind_all(drm->dev, drm);
         if (ret) {
diff --git a/drivers/gpu/drm/vgem/vgem_drv.c b/drivers/gpu/drm/vgem/vgem_drv.c

index 5bd60de..909eba4 100644 (file)
--- a/drivers/gpu/drm/vgem/vgem_drv.c
+++ b/drivers/gpu/drm/vgem/vgem_drv.c
@@ -196,9 +196,10 @@ static struct drm_gem_object *vgem_gem_create(struct drm_device *dev,
                 return ERR_CAST(obj);
  
         ret = drm_gem_handle_create(file, &obj->base, handle);
-       drm_gem_object_put_unlocked(&obj->base);
-       if (ret)
+       if (ret) {
+               drm_gem_object_put_unlocked(&obj->base);
                 return ERR_PTR(ret);
+       }
  
         return &obj->base;
  }
@@ -221,7 +222,9 @@ static int vgem_gem_dumb_create(struct drm_file *file, struct drm_device *dev,
         args->size = gem_object->size;
         args->pitch = pitch;
  
-       DRM_DEBUG("Created object of size %lld\n", size);
+       drm_gem_object_put_unlocked(gem_object);
+
+       DRM_DEBUG("Created object of size %llu\n", args->size);
  
         return 0;
  }
diff --git a/drivers/hwmon/pmbus/ltc2978.c b/drivers/hwmon/pmbus/ltc2978.c

index f01f488..a91ed01 100644 (file)
--- a/drivers/hwmon/pmbus/ltc2978.c
+++ b/drivers/hwmon/pmbus/ltc2978.c
@@ -82,8 +82,8 @@ enum chips { ltc2974, ltc2975, ltc2977, ltc2978, ltc2980, ltc3880, ltc3882,
  
  #define LTC_POLL_TIMEOUT               100     /* in milli-seconds */
  
-#define LTC_NOT_BUSY                   BIT(5)
-#define LTC_NOT_PENDING                        BIT(4)
+#define LTC_NOT_BUSY                   BIT(6)
+#define LTC_NOT_PENDING                        BIT(5)
  
  /*
   * LTC2978 clears peak data whenever the CLEAR_FAULTS command is executed, which
diff --git a/drivers/hwmon/pmbus/xdpe12284.c b/drivers/hwmon/pmbus/xdpe12284.c

index 3d47806..ecd9b65 100644 (file)
--- a/drivers/hwmon/pmbus/xdpe12284.c
+++ b/drivers/hwmon/pmbus/xdpe12284.c
@@ -94,8 +94,8 @@ static const struct i2c_device_id xdpe122_id[] = {
  MODULE_DEVICE_TABLE(i2c, xdpe122_id);
  
  static const struct of_device_id __maybe_unused xdpe122_of_match[] = {
-       {.compatible = "infineon, xdpe12254"},
-       {.compatible = "infineon, xdpe12284"},
+       {.compatible = "infineon,xdpe12254"},
+       {.compatible = "infineon,xdpe12284"},
         {}
  };
  MODULE_DEVICE_TABLE(of, xdpe122_of_match);
diff --git a/drivers/infiniband/core/security.c b/drivers/infiniband/core/security.c

index 6eb6d27..2b4d803 100644 (file)
--- a/drivers/infiniband/core/security.c
+++ b/drivers/infiniband/core/security.c
@@ -339,22 +339,16 @@ static struct ib_ports_pkeys *get_new_pps(const struct ib_qp *qp,
         if (!new_pps)
                 return NULL;
  
-       if (qp_attr_mask & (IB_QP_PKEY_INDEX | IB_QP_PORT)) {
-               if (!qp_pps) {
-                       new_pps->main.port_num = qp_attr->port_num;
-                       new_pps->main.pkey_index = qp_attr->pkey_index;
-               } else {
-                       new_pps->main.port_num = (qp_attr_mask & IB_QP_PORT) ?
-                                                 qp_attr->port_num :
-                                                 qp_pps->main.port_num;
-
-                       new_pps->main.pkey_index =
-                                       (qp_attr_mask & IB_QP_PKEY_INDEX) ?
-                                        qp_attr->pkey_index :
-                                        qp_pps->main.pkey_index;
-               }
+       if (qp_attr_mask & IB_QP_PORT)
+               new_pps->main.port_num =
+                       (qp_pps) ? qp_pps->main.port_num : qp_attr->port_num;
+       if (qp_attr_mask & IB_QP_PKEY_INDEX)
+               new_pps->main.pkey_index = (qp_pps) ? qp_pps->main.pkey_index :
+                                                     qp_attr->pkey_index;
+       if ((qp_attr_mask & IB_QP_PKEY_INDEX) && (qp_attr_mask & IB_QP_PORT))
                 new_pps->main.state = IB_PORT_PKEY_VALID;
-       } else if (qp_pps) {
+
+       if (!(qp_attr_mask & (IB_QP_PKEY_INDEX || IB_QP_PORT)) && qp_pps) {
                 new_pps->main.port_num = qp_pps->main.port_num;
                 new_pps->main.pkey_index = qp_pps->main.pkey_index;
                 if (qp_pps->main.state != IB_PORT_PKEY_NOT_VALID)
diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c

index d1407fa..1235ffb 100644 (file)
--- a/drivers/infiniband/core/user_mad.c
+++ b/drivers/infiniband/core/user_mad.c
@@ -1312,6 +1312,9 @@ static void ib_umad_kill_port(struct ib_umad_port *port)
         struct ib_umad_file *file;
         int id;
  
+       cdev_device_del(&port->sm_cdev, &port->sm_dev);
+       cdev_device_del(&port->cdev, &port->dev);
+
         mutex_lock(&port->file_mutex);
  
         /* Mark ib_dev NULL and block ioctl or other file ops to progress
@@ -1331,8 +1334,6 @@ static void ib_umad_kill_port(struct ib_umad_port *port)
  
         mutex_unlock(&port->file_mutex);
  
-       cdev_device_del(&port->sm_cdev, &port->sm_dev);
-       cdev_device_del(&port->cdev, &port->dev);
         ida_free(&umad_ida, port->dev_num);
  
         /* balances device_initialize() */
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c

index c8693f5..0259337 100644 (file)
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -2745,12 +2745,6 @@ static int kern_spec_to_ib_spec_action(struct uverbs_attr_bundle *attrs,
         return 0;
  }
  
-static size_t kern_spec_filter_sz(const struct ib_uverbs_flow_spec_hdr *spec)
-{
-       /* Returns user space filter size, includes padding */
-       return (spec->size - sizeof(struct ib_uverbs_flow_spec_hdr)) / 2;
-}
-
  static ssize_t spec_filter_size(const void *kern_spec_filter, u16 kern_filter_size,
                                 u16 ib_real_filter_sz)
  {
@@ -2894,11 +2888,16 @@ int ib_uverbs_kern_spec_to_ib_spec_filter(enum ib_flow_spec_type type,
  static int kern_spec_to_ib_spec_filter(struct ib_uverbs_flow_spec *kern_spec,
                                        union ib_flow_spec *ib_spec)
  {
-       ssize_t kern_filter_sz;
+       size_t kern_filter_sz;
         void *kern_spec_mask;
         void *kern_spec_val;
  
-       kern_filter_sz = kern_spec_filter_sz(&kern_spec->hdr);
+       if (check_sub_overflow((size_t)kern_spec->hdr.size,
+                              sizeof(struct ib_uverbs_flow_spec_hdr),
+                              &kern_filter_sz))
+               return -EINVAL;
+
+       kern_filter_sz /= 2;
  
         kern_spec_val = (void *)kern_spec +
                 sizeof(struct ib_uverbs_flow_spec_hdr);
diff --git a/drivers/infiniband/core/uverbs_std_types.c b/drivers/infiniband/core/uverbs_std_types.c

index 994d874..3abfc63 100644 (file)
--- a/drivers/infiniband/core/uverbs_std_types.c
+++ b/drivers/infiniband/core/uverbs_std_types.c
@@ -220,6 +220,7 @@ void ib_uverbs_free_event_queue(struct ib_uverbs_event_queue *event_queue)
         list_for_each_entry_safe(entry, tmp, &event_queue->event_list, list) {
                 if (entry->counter)
                         list_del(&entry->obj_list);
+               list_del(&entry->list);
                 kfree(entry);
         }
         spin_unlock_irq(&event_queue->lock);
diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c

index ee1182f..d69dece 100644 (file)
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -3036,6 +3036,10 @@ static int terminate(struct c4iw_dev *dev, struct sk_buff *skb)
                                        C4IW_QP_ATTR_NEXT_STATE, &attrs, 1);
                 }
  
+               /* As per draft-hilland-iwarp-verbs-v1.0, sec 6.2.3,
+                * when entering the TERM state the RNIC MUST initiate a CLOSE.
+                */
+               c4iw_ep_disconnect(ep, 1, GFP_KERNEL);
                 c4iw_put_ep(&ep->com);
         } else
                 pr_warn("TERM received tid %u no ep/qp\n", tid);
diff --git a/drivers/infiniband/hw/cxgb4/qp.c b/drivers/infiniband/hw/cxgb4/qp.c

index bbcac53..89ac2f9 100644 (file)
--- a/drivers/infiniband/hw/cxgb4/qp.c
+++ b/drivers/infiniband/hw/cxgb4/qp.c
@@ -1948,10 +1948,10 @@ int c4iw_modify_qp(struct c4iw_dev *rhp, struct c4iw_qp *qhp,
                         qhp->attr.layer_etype = attrs->layer_etype;
                         qhp->attr.ecode = attrs->ecode;
                         ep = qhp->ep;
-                       c4iw_get_ep(&ep->com);
-                       disconnect = 1;
                         if (!internal) {
+                               c4iw_get_ep(&ep->com);
                                 terminate = 1;
+                               disconnect = 1;
                         } else {
                                 terminate = qhp->attr.send_term;
                                 ret = rdma_fini(rhp, qhp, ep);
diff --git a/drivers/infiniband/hw/hfi1/affinity.c b/drivers/infiniband/hw/hfi1/affinity.c

index c142b23..1aeea5d 100644 (file)
--- a/drivers/infiniband/hw/hfi1/affinity.c
+++ b/drivers/infiniband/hw/hfi1/affinity.c
@@ -479,6 +479,8 @@ static int _dev_comp_vect_mappings_create(struct hfi1_devdata *dd,
                           rvt_get_ibdev_name(&(dd)->verbs_dev.rdi), i, cpu);
         }
  
+       free_cpumask_var(available_cpus);
+       free_cpumask_var(non_intr_cpus);
         return 0;
  
  fail:
diff --git a/drivers/infiniband/hw/hfi1/file_ops.c b/drivers/infiniband/hw/hfi1/file_ops.c

index bef6946..2591158 100644 (file)
--- a/drivers/infiniband/hw/hfi1/file_ops.c
+++ b/drivers/infiniband/hw/hfi1/file_ops.c
@@ -200,23 +200,24 @@ static int hfi1_file_open(struct inode *inode, struct file *fp)
  
         fd = kzalloc(sizeof(*fd), GFP_KERNEL);
  
-       if (fd) {
-               fd->rec_cpu_num = -1; /* no cpu affinity by default */
-               fd->mm = current->mm;
-               mmgrab(fd->mm);
-               fd->dd = dd;
-               kobject_get(&fd->dd->kobj);
-               fp->private_data = fd;
-       } else {
-               fp->private_data = NULL;
-
-               if (atomic_dec_and_test(&dd->user_refcount))
-                       complete(&dd->user_comp);
-
-               return -ENOMEM;
-       }
-
+       if (!fd || init_srcu_struct(&fd->pq_srcu))
+               goto nomem;
+       spin_lock_init(&fd->pq_rcu_lock);
+       spin_lock_init(&fd->tid_lock);
+       spin_lock_init(&fd->invalid_lock);
+       fd->rec_cpu_num = -1; /* no cpu affinity by default */
+       fd->mm = current->mm;
+       mmgrab(fd->mm);
+       fd->dd = dd;
+       kobject_get(&fd->dd->kobj);
+       fp->private_data = fd;
         return 0;
+nomem:
+       kfree(fd);
+       fp->private_data = NULL;
+       if (atomic_dec_and_test(&dd->user_refcount))
+               complete(&dd->user_comp);
+       return -ENOMEM;
  }
  
  static long hfi1_file_ioctl(struct file *fp, unsigned int cmd,
@@ -301,21 +302,30 @@ static long hfi1_file_ioctl(struct file *fp, unsigned int cmd,
  static ssize_t hfi1_write_iter(struct kiocb *kiocb, struct iov_iter *from)
  {
         struct hfi1_filedata *fd = kiocb->ki_filp->private_data;
-       struct hfi1_user_sdma_pkt_q *pq = fd->pq;
+       struct hfi1_user_sdma_pkt_q *pq;
         struct hfi1_user_sdma_comp_q *cq = fd->cq;
         int done = 0, reqs = 0;
         unsigned long dim = from->nr_segs;
+       int idx;
  
-       if (!cq || !pq)
+       idx = srcu_read_lock(&fd->pq_srcu);
+       pq = srcu_dereference(fd->pq, &fd->pq_srcu);
+       if (!cq || !pq) {
+               srcu_read_unlock(&fd->pq_srcu, idx);
                 return -EIO;
+       }
  
-       if (!iter_is_iovec(from) || !dim)
+       if (!iter_is_iovec(from) || !dim) {
+               srcu_read_unlock(&fd->pq_srcu, idx);
                 return -EINVAL;
+       }
  
         trace_hfi1_sdma_request(fd->dd, fd->uctxt->ctxt, fd->subctxt, dim);
  
-       if (atomic_read(&pq->n_reqs) == pq->n_max_reqs)
+       if (atomic_read(&pq->n_reqs) == pq->n_max_reqs) {
+               srcu_read_unlock(&fd->pq_srcu, idx);
                 return -ENOSPC;
+       }
  
         while (dim) {
                 int ret;
@@ -333,6 +343,7 @@ static ssize_t hfi1_write_iter(struct kiocb *kiocb, struct iov_iter *from)
                 reqs++;
         }
  
+       srcu_read_unlock(&fd->pq_srcu, idx);
         return reqs;
  }
  
@@ -707,6 +718,7 @@ done:
         if (atomic_dec_and_test(&dd->user_refcount))
                 complete(&dd->user_comp);
  
+       cleanup_srcu_struct(&fdata->pq_srcu);
         kfree(fdata);
         return 0;
  }
diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h

index 6365e8f..cae12f4 100644 (file)
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -1444,10 +1444,13 @@ struct mmu_rb_handler;
  
  /* Private data for file operations */
  struct hfi1_filedata {
+       struct srcu_struct pq_srcu;
         struct hfi1_devdata *dd;
         struct hfi1_ctxtdata *uctxt;
         struct hfi1_user_sdma_comp_q *cq;
-       struct hfi1_user_sdma_pkt_q *pq;
+       /* update side lock for SRCU */
+       spinlock_t pq_rcu_lock;
+       struct hfi1_user_sdma_pkt_q __rcu *pq;
         u16 subctxt;
         /* for cpu affinity; -1 if none */
         int rec_cpu_num;
diff --git a/drivers/infiniband/hw/hfi1/user_exp_rcv.c b/drivers/infiniband/hw/hfi1/user_exp_rcv.c

index f05742a..4da03f8 100644 (file)
--- a/drivers/infiniband/hw/hfi1/user_exp_rcv.c
+++ b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
@@ -87,9 +87,6 @@ int hfi1_user_exp_rcv_init(struct hfi1_filedata *fd,
  {
         int ret = 0;
  
-       spin_lock_init(&fd->tid_lock);
-       spin_lock_init(&fd->invalid_lock);
-
         fd->entry_to_rb = kcalloc(uctxt->expected_count,
                                   sizeof(struct rb_node *),
                                   GFP_KERNEL);
@@ -142,10 +139,12 @@ void hfi1_user_exp_rcv_free(struct hfi1_filedata *fd)
  {
         struct hfi1_ctxtdata *uctxt = fd->uctxt;
  
+       mutex_lock(&uctxt->exp_mutex);
         if (!EXP_TID_SET_EMPTY(uctxt->tid_full_list))
                 unlock_exp_tids(uctxt, &uctxt->tid_full_list, fd);
         if (!EXP_TID_SET_EMPTY(uctxt->tid_used_list))
                 unlock_exp_tids(uctxt, &uctxt->tid_used_list, fd);
+       mutex_unlock(&uctxt->exp_mutex);
  
         kfree(fd->invalid_tids);
         fd->invalid_tids = NULL;
diff --git a/drivers/infiniband/hw/hfi1/user_sdma.c b/drivers/infiniband/hw/hfi1/user_sdma.c

index fd754a1..c2f0d9b 100644 (file)
--- a/drivers/infiniband/hw/hfi1/user_sdma.c
+++ b/drivers/infiniband/hw/hfi1/user_sdma.c
@@ -179,7 +179,6 @@ int hfi1_user_sdma_alloc_queues(struct hfi1_ctxtdata *uctxt,
         pq = kzalloc(sizeof(*pq), GFP_KERNEL);
         if (!pq)
                 return -ENOMEM;
-
         pq->dd = dd;
         pq->ctxt = uctxt->ctxt;
         pq->subctxt = fd->subctxt;
@@ -236,7 +235,7 @@ int hfi1_user_sdma_alloc_queues(struct hfi1_ctxtdata *uctxt,
                 goto pq_mmu_fail;
         }
  
-       fd->pq = pq;
+       rcu_assign_pointer(fd->pq, pq);
         fd->cq = cq;
  
         return 0;
@@ -264,8 +263,14 @@ int hfi1_user_sdma_free_queues(struct hfi1_filedata *fd,
  
         trace_hfi1_sdma_user_free_queues(uctxt->dd, uctxt->ctxt, fd->subctxt);
  
-       pq = fd->pq;
+       spin_lock(&fd->pq_rcu_lock);
+       pq = srcu_dereference_check(fd->pq, &fd->pq_srcu,
+                                   lockdep_is_held(&fd->pq_rcu_lock));
         if (pq) {
+               rcu_assign_pointer(fd->pq, NULL);
+               spin_unlock(&fd->pq_rcu_lock);
+               synchronize_srcu(&fd->pq_srcu);
+               /* at this point there can be no more new requests */
                 if (pq->handler)
                         hfi1_mmu_rb_unregister(pq->handler);
                 iowait_sdma_drain(&pq->busy);
@@ -277,7 +282,8 @@ int hfi1_user_sdma_free_queues(struct hfi1_filedata *fd,
                 kfree(pq->req_in_use);
                 kmem_cache_destroy(pq->txreq_cache);
                 kfree(pq);
-               fd->pq = NULL;
+       } else {
+               spin_unlock(&fd->pq_rcu_lock);
         }
         if (fd->cq) {
                 vfree(fd->cq->comps);
@@ -321,7 +327,8 @@ int hfi1_user_sdma_process_request(struct hfi1_filedata *fd,
  {
         int ret = 0, i;
         struct hfi1_ctxtdata *uctxt = fd->uctxt;
-       struct hfi1_user_sdma_pkt_q *pq = fd->pq;
+       struct hfi1_user_sdma_pkt_q *pq =
+               srcu_dereference(fd->pq, &fd->pq_srcu);
         struct hfi1_user_sdma_comp_q *cq = fd->cq;
         struct hfi1_devdata *dd = pq->dd;
         unsigned long idx = 0;
diff --git a/drivers/infiniband/hw/mlx5/devx.c b/drivers/infiniband/hw/mlx5/devx.c

index d7efc9f..46e1ab7 100644 (file)
--- a/drivers/infiniband/hw/mlx5/devx.c
+++ b/drivers/infiniband/hw/mlx5/devx.c
@@ -2319,14 +2319,12 @@ static int deliver_event(struct devx_event_subscription *event_sub,
  
         if (ev_file->omit_data) {
                 spin_lock_irqsave(&ev_file->lock, flags);
-               if (!list_empty(&event_sub->event_list)) {
+               if (!list_empty(&event_sub->event_list) ||
+                   ev_file->is_destroyed) {
                         spin_unlock_irqrestore(&ev_file->lock, flags);
                         return 0;
                 }
  
-               /* is_destroyed is ignored here because we don't have any memory
-                * allocation to clean up for the omit_data case
-                */
                 list_add_tail(&event_sub->event_list, &ev_file->event_list);
                 spin_unlock_irqrestore(&ev_file->lock, flags);
                 wake_up_interruptible(&ev_file->poll_wait);
@@ -2473,11 +2471,11 @@ static ssize_t devx_async_cmd_event_read(struct file *filp, char __user *buf,
                         return -ERESTARTSYS;
                 }
  
-               if (list_empty(&ev_queue->event_list) &&
-                   ev_queue->is_destroyed)
-                       return -EIO;
-
                 spin_lock_irq(&ev_queue->lock);
+               if (ev_queue->is_destroyed) {
+                       spin_unlock_irq(&ev_queue->lock);
+                       return -EIO;
+               }
         }
  
         event = list_entry(ev_queue->event_list.next,
@@ -2551,10 +2549,6 @@ static ssize_t devx_async_event_read(struct file *filp, char __user *buf,
                 return -EOVERFLOW;
         }
  
-       if (ev_file->is_destroyed) {
-               spin_unlock_irq(&ev_file->lock);
-               return -EIO;
-       }
  
         while (list_empty(&ev_file->event_list)) {
                 spin_unlock_irq(&ev_file->lock);
@@ -2667,8 +2661,10 @@ static int devx_async_cmd_event_destroy_uobj(struct ib_uobject *uobj,
  
         spin_lock_irq(&comp_ev_file->ev_queue.lock);
         list_for_each_entry_safe(entry, tmp,
-                                &comp_ev_file->ev_queue.event_list, list)
+                                &comp_ev_file->ev_queue.event_list, list) {
+               list_del(&entry->list);
                 kvfree(entry);
+       }
         spin_unlock_irq(&comp_ev_file->ev_queue.lock);
         return 0;
  };
@@ -2680,11 +2676,29 @@ static int devx_async_event_destroy_uobj(struct ib_uobject *uobj,
                 container_of(uobj, struct devx_async_event_file,
                              uobj);
         struct devx_event_subscription *event_sub, *event_sub_tmp;
-       struct devx_async_event_data *entry, *tmp;
         struct mlx5_ib_dev *dev = ev_file->dev;
  
         spin_lock_irq(&ev_file->lock);
         ev_file->is_destroyed = 1;
+
+       /* free the pending events allocation */
+       if (ev_file->omit_data) {
+               struct devx_event_subscription *event_sub, *tmp;
+
+               list_for_each_entry_safe(event_sub, tmp, &ev_file->event_list,
+                                        event_list)
+                       list_del_init(&event_sub->event_list);
+
+       } else {
+               struct devx_async_event_data *entry, *tmp;
+
+               list_for_each_entry_safe(entry, tmp, &ev_file->event_list,
+                                        list) {
+                       list_del(&entry->list);
+                       kfree(entry);
+               }
+       }
+
         spin_unlock_irq(&ev_file->lock);
         wake_up_interruptible(&ev_file->poll_wait);
  
@@ -2699,15 +2713,6 @@ static int devx_async_event_destroy_uobj(struct ib_uobject *uobj,
         }
         mutex_unlock(&dev->devx_event_table.event_xa_lock);
  
-       /* free the pending events allocation */
-       if (!ev_file->omit_data) {
-               spin_lock_irq(&ev_file->lock);
-               list_for_each_entry_safe(entry, tmp,
-                                        &ev_file->event_list, list)
-                       kfree(entry); /* read can't come any more */
-               spin_unlock_irq(&ev_file->lock);
-       }
-
         put_device(&dev->ib_dev.dev);
         return 0;
  };
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c

index e874d68..e4bcfa8 100644 (file)
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -2283,8 +2283,8 @@ static int mlx5_ib_mmap_offset(struct mlx5_ib_dev *dev,
  
  static u64 mlx5_entry_to_mmap_offset(struct mlx5_user_mmap_entry *entry)
  {
-       u16 cmd = entry->rdma_entry.start_pgoff >> 16;
-       u16 index = entry->rdma_entry.start_pgoff & 0xFFFF;
+       u64 cmd = (entry->rdma_entry.start_pgoff >> 16) & 0xFFFF;
+       u64 index = entry->rdma_entry.start_pgoff & 0xFFFF;
  
         return (((index >> 8) << 16) | (cmd << MLX5_IB_MMAP_CMD_SHIFT) |
                 (index & 0xFF)) << PAGE_SHIFT;
@@ -6545,7 +6545,7 @@ static int mlx5_ib_init_var_table(struct mlx5_ib_dev *dev)
                                         doorbell_bar_offset);
         bar_size = (1ULL << log_doorbell_bar_size) * 4096;
         var_table->stride_size = 1ULL << log_doorbell_stride;
-       var_table->num_var_hw_entries = bar_size / var_table->stride_size;
+       var_table->num_var_hw_entries = div64_u64(bar_size, var_table->stride_size);
         mutex_init(&var_table->bitmap_lock);
         var_table->bitmap = bitmap_zalloc(var_table->num_var_hw_entries,
                                           GFP_KERNEL);
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c

index a4f8e70..957f3a5 100644 (file)
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -3441,9 +3441,6 @@ static int __mlx5_ib_qp_set_counter(struct ib_qp *qp,
         struct mlx5_ib_qp_base *base;
         u32 set_id;
  
-       if (!MLX5_CAP_GEN(dev->mdev, rts2rts_qp_counters_set_id))
-               return 0;
-
         if (counter)
                 set_id = counter->id;
         else
@@ -6576,6 +6573,7 @@ void mlx5_ib_drain_rq(struct ib_qp *qp)
   */
  int mlx5_ib_qp_set_counter(struct ib_qp *qp, struct rdma_counter *counter)
  {
+       struct mlx5_ib_dev *dev = to_mdev(qp->device);
         struct mlx5_ib_qp *mqp = to_mqp(qp);
         int err = 0;
  
@@ -6585,6 +6583,11 @@ int mlx5_ib_qp_set_counter(struct ib_qp *qp, struct rdma_counter *counter)
                 goto out;
         }
  
+       if (!MLX5_CAP_GEN(dev->mdev, rts2rts_qp_counters_set_id)) {
+               err = -EOPNOTSUPP;
+               goto out;
+       }
+
         if (mqp->state == IB_QPS_RTS) {
                 err = __mlx5_ib_qp_set_counter(qp, counter);
                 if (!err)
diff --git a/drivers/infiniband/sw/rdmavt/qp.c b/drivers/infiniband/sw/rdmavt/qp.c

index 3cdf75d..7858d49 100644 (file)
--- a/drivers/infiniband/sw/rdmavt/qp.c
+++ b/drivers/infiniband/sw/rdmavt/qp.c
@@ -61,6 +61,8 @@
  #define RVT_RWQ_COUNT_THRESHOLD 16
  
  static void rvt_rc_timeout(struct timer_list *t);
+static void rvt_reset_qp(struct rvt_dev_info *rdi, struct rvt_qp *qp,
+                        enum ib_qp_type type);
  
  /*
   * Convert the AETH RNR timeout code into the number of microseconds.
@@ -452,40 +454,41 @@ no_qp_table:
  }
  
  /**
- * free_all_qps - check for QPs still in use
+ * rvt_free_qp_cb - callback function to reset a qp
+ * @qp: the qp to reset
+ * @v: a 64-bit value
+ *
+ * This function resets the qp and removes it from the
+ * qp hash table.
+ */
+static void rvt_free_qp_cb(struct rvt_qp *qp, u64 v)
+{
+       unsigned int *qp_inuse = (unsigned int *)v;
+       struct rvt_dev_info *rdi = ib_to_rvt(qp->ibqp.device);
+
+       /* Reset the qp and remove it from the qp hash list */
+       rvt_reset_qp(rdi, qp, qp->ibqp.qp_type);
+
+       /* Increment the qp_inuse count */
+       (*qp_inuse)++;
+}
+
+/**
+ * rvt_free_all_qps - check for QPs still in use
   * @rdi: rvt device info structure
   *
   * There should not be any QPs still in use.
   * Free memory for table.
+ * Return the number of QPs still in use.
   */
  static unsigned rvt_free_all_qps(struct rvt_dev_info *rdi)
  {
-       unsigned long flags;
-       struct rvt_qp *qp;
-       unsigned n, qp_inuse = 0;
-       spinlock_t *ql; /* work around too long line below */
-
-       if (rdi->driver_f.free_all_qps)
-               qp_inuse = rdi->driver_f.free_all_qps(rdi);
+       unsigned int qp_inuse = 0;
  
         qp_inuse += rvt_mcast_tree_empty(rdi);
  
-       if (!rdi->qp_dev)
-               return qp_inuse;
-
-       ql = &rdi->qp_dev->qpt_lock;
-       spin_lock_irqsave(ql, flags);
-       for (n = 0; n < rdi->qp_dev->qp_table_size; n++) {
-               qp = rcu_dereference_protected(rdi->qp_dev->qp_table[n],
-                                              lockdep_is_held(ql));
-               RCU_INIT_POINTER(rdi->qp_dev->qp_table[n], NULL);
+       rvt_qp_iter(rdi, (u64)&qp_inuse, rvt_free_qp_cb);
  
-               for (; qp; qp = rcu_dereference_protected(qp->next,
-                                                         lockdep_is_held(ql)))
-                       qp_inuse++;
-       }
-       spin_unlock_irqrestore(ql, flags);
-       synchronize_rcu();
         return qp_inuse;
  }
  
@@ -902,14 +905,14 @@ static void rvt_init_qp(struct rvt_dev_info *rdi, struct rvt_qp *qp,
  }
  
  /**
- * rvt_reset_qp - initialize the QP state to the reset state
+ * _rvt_reset_qp - initialize the QP state to the reset state
   * @qp: the QP to reset
   * @type: the QP type
   *
   * r_lock, s_hlock, and s_lock are required to be held by the caller
   */
-static void rvt_reset_qp(struct rvt_dev_info *rdi, struct rvt_qp *qp,
-                        enum ib_qp_type type)
+static void _rvt_reset_qp(struct rvt_dev_info *rdi, struct rvt_qp *qp,
+                         enum ib_qp_type type)
         __must_hold(&qp->s_lock)
         __must_hold(&qp->s_hlock)
         __must_hold(&qp->r_lock)
@@ -955,6 +958,27 @@ static void rvt_reset_qp(struct rvt_dev_info *rdi, struct rvt_qp *qp,
         lockdep_assert_held(&qp->s_lock);
  }
  
+/**
+ * rvt_reset_qp - initialize the QP state to the reset state
+ * @rdi: the device info
+ * @qp: the QP to reset
+ * @type: the QP type
+ *
+ * This is the wrapper function to acquire the r_lock, s_hlock, and s_lock
+ * before calling _rvt_reset_qp().
+ */
+static void rvt_reset_qp(struct rvt_dev_info *rdi, struct rvt_qp *qp,
+                        enum ib_qp_type type)
+{
+       spin_lock_irq(&qp->r_lock);
+       spin_lock(&qp->s_hlock);
+       spin_lock(&qp->s_lock);
+       _rvt_reset_qp(rdi, qp, type);
+       spin_unlock(&qp->s_lock);
+       spin_unlock(&qp->s_hlock);
+       spin_unlock_irq(&qp->r_lock);
+}
+
  /** rvt_free_qpn - Free a qpn from the bit map
   * @qpt: QP table
   * @qpn: queue pair number to free
@@ -1546,7 +1570,7 @@ int rvt_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
         switch (new_state) {
         case IB_QPS_RESET:
                 if (qp->state != IB_QPS_RESET)
-                       rvt_reset_qp(rdi, qp, ibqp->qp_type);
+                       _rvt_reset_qp(rdi, qp, ibqp->qp_type);
                 break;
  
         case IB_QPS_RTR:
@@ -1695,13 +1719,7 @@ int rvt_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata)
         struct rvt_qp *qp = ibqp_to_rvtqp(ibqp);
         struct rvt_dev_info *rdi = ib_to_rvt(ibqp->device);
  
-       spin_lock_irq(&qp->r_lock);
-       spin_lock(&qp->s_hlock);
-       spin_lock(&qp->s_lock);
         rvt_reset_qp(rdi, qp, ibqp->qp_type);
-       spin_unlock(&qp->s_lock);
-       spin_unlock(&qp->s_hlock);
-       spin_unlock_irq(&qp->r_lock);
  
         wait_event(qp->wait, !atomic_read(&qp->refcount));
         /* qpn is now available for use again */
diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c

index 116cafc..4bc8870 100644 (file)
--- a/drivers/infiniband/sw/rxe/rxe_comp.c
+++ b/drivers/infiniband/sw/rxe/rxe_comp.c
@@ -329,7 +329,7 @@ static inline enum comp_state check_ack(struct rxe_qp *qp,
                                         qp->comp.psn = pkt->psn;
                                         if (qp->req.wait_psn) {
                                                 qp->req.wait_psn = 0;
-                                               rxe_run_task(&qp->req.task, 1);
+                                               rxe_run_task(&qp->req.task, 0);
                                         }
                                 }
                                 return COMPST_ERROR_RETRY;
@@ -463,7 +463,7 @@ static void do_complete(struct rxe_qp *qp, struct rxe_send_wqe *wqe)
          */
         if (qp->req.wait_fence) {
                 qp->req.wait_fence = 0;
-               rxe_run_task(&qp->req.task, 1);
+               rxe_run_task(&qp->req.task, 0);
         }
  }
  
@@ -479,7 +479,7 @@ static inline enum comp_state complete_ack(struct rxe_qp *qp,
                 if (qp->req.need_rd_atomic) {
                         qp->comp.timeout_retry = 0;
                         qp->req.need_rd_atomic = 0;
-                       rxe_run_task(&qp->req.task, 1);
+                       rxe_run_task(&qp->req.task, 0);
                 }
         }
  
@@ -725,7 +725,7 @@ int rxe_completer(void *arg)
                                                         RXE_CNT_COMP_RETRY);
                                         qp->req.need_retry = 1;
                                         qp->comp.started_retry = 1;
-                                       rxe_run_task(&qp->req.task, 1);
+                                       rxe_run_task(&qp->req.task, 0);
                                 }
  
                                 if (pkt) {
diff --git a/drivers/infiniband/sw/siw/siw_cm.c b/drivers/infiniband/sw/siw/siw_cm.c

index 0c3f058..c5651a9 100644 (file)
--- a/drivers/infiniband/sw/siw/siw_cm.c
+++ b/drivers/infiniband/sw/siw/siw_cm.c
@@ -1225,10 +1225,9 @@ static void siw_cm_llp_data_ready(struct sock *sk)
         read_lock(&sk->sk_callback_lock);
  
         cep = sk_to_cep(sk);
-       if (!cep) {
-               WARN_ON(1);
+       if (!cep)
                 goto out;
-       }
+
         siw_dbg_cep(cep, "state: %d\n", cep->state);
  
         switch (cep->state) {
diff --git a/drivers/input/keyboard/goldfish_events.c b/drivers/input/keyboard/goldfish_events.c

index bc8c85a..57d435f 100644 (file)
--- a/drivers/input/keyboard/goldfish_events.c
+++ b/drivers/input/keyboard/goldfish_events.c
@@ -30,7 +30,7 @@ struct event_dev {
         struct input_dev *input;
         int irq;
         void __iomem *addr;
-       char name[0];
+       char name[];
  };
  
  static irqreturn_t events_interrupt(int irq, void *dev_id)
diff --git a/drivers/input/keyboard/gpio_keys.c b/drivers/input/keyboard/gpio_keys.c

index 1f56d53..53c9ff3 100644 (file)
--- a/drivers/input/keyboard/gpio_keys.c
+++ b/drivers/input/keyboard/gpio_keys.c
@@ -55,7 +55,7 @@ struct gpio_keys_drvdata {
         struct input_dev *input;
         struct mutex disable_lock;
         unsigned short *keymap;
-       struct gpio_button_data data[0];
+       struct gpio_button_data data[];
  };
  
  /*
diff --git a/drivers/input/keyboard/gpio_keys_polled.c b/drivers/input/keyboard/gpio_keys_polled.c

index 6eb0a2f..c3937d2 100644 (file)
--- a/drivers/input/keyboard/gpio_keys_polled.c
+++ b/drivers/input/keyboard/gpio_keys_polled.c
@@ -38,7 +38,7 @@ struct gpio_keys_polled_dev {
         const struct gpio_keys_platform_data *pdata;
         unsigned long rel_axis_seen[BITS_TO_LONGS(REL_CNT)];
         unsigned long abs_axis_seen[BITS_TO_LONGS(ABS_CNT)];
-       struct gpio_keys_button_data data[0];
+       struct gpio_keys_button_data data[];
  };
  
  static void gpio_keys_button_event(struct input_dev *input,
diff --git a/drivers/input/keyboard/tca6416-keypad.c b/drivers/input/keyboard/tca6416-keypad.c

index 2a14769..2175876 100644 (file)
--- a/drivers/input/keyboard/tca6416-keypad.c
+++ b/drivers/input/keyboard/tca6416-keypad.c
@@ -33,7 +33,7 @@ MODULE_DEVICE_TABLE(i2c, tca6416_id);
  
  struct tca6416_drv_data {
         struct input_dev *input;
-       struct tca6416_button data[0];
+       struct tca6416_button data[];
  };
  
  struct tca6416_keypad_chip {
@@ -48,7 +48,7 @@ struct tca6416_keypad_chip {
         int irqnum;
         u16 pinmask;
         bool use_polling;
-       struct tca6416_button buttons[0];
+       struct tca6416_button buttons[];
  };
  
  static int tca6416_write_reg(struct tca6416_keypad_chip *chip, int reg, u16 val)
diff --git a/drivers/input/mouse/cyapa_gen5.c b/drivers/input/mouse/cyapa_gen5.c

index 14239fb..7f012bf 100644 (file)
--- a/drivers/input/mouse/cyapa_gen5.c
+++ b/drivers/input/mouse/cyapa_gen5.c
@@ -250,7 +250,7 @@ struct cyapa_tsg_bin_image_data_record {
  
  struct cyapa_tsg_bin_image {
         struct cyapa_tsg_bin_image_head image_head;
-       struct cyapa_tsg_bin_image_data_record records[0];
+       struct cyapa_tsg_bin_image_data_record records[];
  } __packed;
  
  struct pip_bl_packet_start {
@@ -271,7 +271,7 @@ struct pip_bl_cmd_head {
         u8 report_id;  /* Bootloader output report id, must be 40h */
         u8 rsvd;  /* Reserved, must be 0 */
         struct pip_bl_packet_start packet_start;
-       u8 data[0];  /* Command data variable based on commands */
+       u8 data[];  /* Command data variable based on commands */
  } __packed;
  
  /* Initiate bootload command data structure. */
@@ -300,7 +300,7 @@ struct tsg_bl_metadata_row_params {
  struct tsg_bl_flash_row_head {
         u8 flash_array_id;
         __le16 flash_row_id;
-       u8 flash_data[0];
+       u8 flash_data[];
  } __packed;
  
  struct pip_app_cmd_head {
@@ -314,7 +314,7 @@ struct pip_app_cmd_head {
          * Bit 6-0: command code.
          */
         u8 cmd_code;
-       u8 parameter_data[0];  /* Parameter data variable based on cmd_code */
+       u8 parameter_data[];  /* Parameter data variable based on cmd_code */
  } __packed;
  
  /* Application get/set parameter command data structure */
diff --git a/drivers/input/mouse/psmouse-smbus.c b/drivers/input/mouse/psmouse-smbus.c

index 027efdd..a472489 100644 (file)
--- a/drivers/input/mouse/psmouse-smbus.c
+++ b/drivers/input/mouse/psmouse-smbus.c
@@ -190,6 +190,7 @@ static int psmouse_smbus_create_companion(struct device *dev, void *data)
         struct psmouse_smbus_dev *smbdev = data;
         unsigned short addr_list[] = { smbdev->board.addr, I2C_CLIENT_END };
         struct i2c_adapter *adapter;
+       struct i2c_client *client;
  
         adapter = i2c_verify_adapter(dev);
         if (!adapter)
@@ -198,12 +199,13 @@ static int psmouse_smbus_create_companion(struct device *dev, void *data)
         if (!i2c_check_functionality(adapter, I2C_FUNC_SMBUS_HOST_NOTIFY))
                 return 0;
  
-       smbdev->client = i2c_new_probed_device(adapter, &smbdev->board,
-                                              addr_list, NULL);
-       if (!smbdev->client)
+       client = i2c_new_scanned_device(adapter, &smbdev->board,
+                                       addr_list, NULL);
+       if (IS_ERR(client))
                 return 0;
  
         /* We have our(?) device, stop iterating i2c bus. */
+       smbdev->client = client;
         return 1;
  }
  
diff --git a/drivers/input/mouse/synaptics.c b/drivers/input/mouse/synaptics.c

index 1ae6f8b..2c666fb 100644 (file)
--- a/drivers/input/mouse/synaptics.c
+++ b/drivers/input/mouse/synaptics.c
@@ -146,7 +146,6 @@ static const char * const topbuttonpad_pnp_ids[] = {
         "LEN0042", /* Yoga */
         "LEN0045",
         "LEN0047",
-       "LEN0049",
         "LEN2000", /* S540 */
         "LEN2001", /* Edge E431 */
         "LEN2002", /* Edge E531 */
@@ -166,9 +165,11 @@ static const char * const smbus_pnp_ids[] = {
         /* all of the topbuttonpad_pnp_ids are valid, we just add some extras */
         "LEN0048", /* X1 Carbon 3 */
         "LEN0046", /* X250 */
+       "LEN0049", /* Yoga 11e */
         "LEN004a", /* W541 */
         "LEN005b", /* P50 */
         "LEN005e", /* T560 */
+       "LEN006c", /* T470s */
         "LEN0071", /* T480 */
         "LEN0072", /* X1 Carbon Gen 5 (2017) - Elan/ALPS trackpoint */
         "LEN0073", /* X1 Carbon G5 (Elantech) */
@@ -179,6 +180,7 @@ static const char * const smbus_pnp_ids[] = {
         "LEN0097", /* X280 -> ALPS trackpoint */
         "LEN009b", /* T580 */
         "LEN200f", /* T450s */
+       "LEN2044", /* L470  */
         "LEN2054", /* E480 */
         "LEN2055", /* E580 */
         "SYN3052", /* HP EliteBook 840 G4 */
diff --git a/drivers/input/touchscreen/ili210x.c b/drivers/input/touchscreen/ili210x.c

index 4a17096..199cf3d 100644 (file)
--- a/drivers/input/touchscreen/ili210x.c
+++ b/drivers/input/touchscreen/ili210x.c
@@ -167,6 +167,36 @@ static const struct ili2xxx_chip ili211x_chip = {
         .resolution             = 2048,
  };
  
+static bool ili212x_touchdata_to_coords(const u8 *touchdata,
+                                       unsigned int finger,
+                                       unsigned int *x, unsigned int *y)
+{
+       u16 val;
+
+       val = get_unaligned_be16(touchdata + 3 + (finger * 5) + 0);
+       if (!(val & BIT(15)))   /* Touch indication */
+               return false;
+
+       *x = val & 0x3fff;
+       *y = get_unaligned_be16(touchdata + 3 + (finger * 5) + 2);
+
+       return true;
+}
+
+static bool ili212x_check_continue_polling(const u8 *data, bool touch)
+{
+       return touch;
+}
+
+static const struct ili2xxx_chip ili212x_chip = {
+       .read_reg               = ili210x_read_reg,
+       .get_touch_data         = ili210x_read_touch_data,
+       .parse_touch_data       = ili212x_touchdata_to_coords,
+       .continue_polling       = ili212x_check_continue_polling,
+       .max_touches            = 10,
+       .has_calibrate_reg      = true,
+};
+
  static int ili251x_read_reg(struct i2c_client *client,
                             u8 reg, void *buf, size_t len)
  {
@@ -321,7 +351,7 @@ static umode_t ili210x_calibrate_visible(struct kobject *kobj,
         struct i2c_client *client = to_i2c_client(dev);
         struct ili210x *priv = i2c_get_clientdata(client);
  
-       return priv->chip->has_calibrate_reg;
+       return priv->chip->has_calibrate_reg ? attr->mode : 0;
  }
  
  static const struct attribute_group ili210x_attr_group = {
@@ -447,6 +477,7 @@ static int ili210x_i2c_probe(struct i2c_client *client,
  static const struct i2c_device_id ili210x_i2c_id[] = {
         { "ili210x", (long)&ili210x_chip },
         { "ili2117", (long)&ili211x_chip },
+       { "ili2120", (long)&ili212x_chip },
         { "ili251x", (long)&ili251x_chip },
         { }
  };
@@ -455,6 +486,7 @@ MODULE_DEVICE_TABLE(i2c, ili210x_i2c_id);
  static const struct of_device_id ili210x_dt_ids[] = {
         { .compatible = "ilitek,ili210x", .data = &ili210x_chip },
         { .compatible = "ilitek,ili2117", .data = &ili211x_chip },
+       { .compatible = "ilitek,ili2120", .data = &ili212x_chip },
         { .compatible = "ilitek,ili251x", .data = &ili251x_chip },
         { }
  };
diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c

index a1df0d9..8bc1faf 100644 (file)
--- a/drivers/md/bcache/alloc.c
+++ b/drivers/md/bcache/alloc.c
@@ -67,6 +67,7 @@
  #include <linux/blkdev.h>
  #include <linux/kthread.h>
  #include <linux/random.h>
+#include <linux/sched/signal.h>
  #include <trace/events/bcache.h>
  
  #define MAX_OPEN_BUCKETS 128
@@ -733,8 +734,21 @@ int bch_open_buckets_alloc(struct cache_set *c)
  
  int bch_cache_allocator_start(struct cache *ca)
  {
-       struct task_struct *k = kthread_run(bch_allocator_thread,
-                                           ca, "bcache_allocator");
+       struct task_struct *k;
+
+       /*
+        * In case previous btree check operation occupies too many
+        * system memory for bcache btree node cache, and the
+        * registering process is selected by OOM killer. Here just
+        * ignore the SIGKILL sent by OOM killer if there is, to
+        * avoid kthread_run() being failed by pending signals. The
+        * bcache registering process will exit after the registration
+        * done.
+        */
+       if (signal_pending(current))
+               flush_signals(current);
+
+       k = kthread_run(bch_allocator_thread, ca, "bcache_allocator");
         if (IS_ERR(k))
                 return PTR_ERR(k);
  
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c

index fa872df..b12186c 100644 (file)
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -34,6 +34,7 @@
  #include <linux/random.h>
  #include <linux/rcupdate.h>
  #include <linux/sched/clock.h>
+#include <linux/sched/signal.h>
  #include <linux/rculist.h>
  #include <linux/delay.h>
  #include <trace/events/bcache.h>
@@ -1913,6 +1914,18 @@ static int bch_gc_thread(void *arg)
  
  int bch_gc_thread_start(struct cache_set *c)
  {
+       /*
+        * In case previous btree check operation occupies too many
+        * system memory for bcache btree node cache, and the
+        * registering process is selected by OOM killer. Here just
+        * ignore the SIGKILL sent by OOM killer if there is, to
+        * avoid kthread_run() being failed by pending signals. The
+        * bcache registering process will exit after the registration
+        * done.
+        */
+       if (signal_pending(current))
+               flush_signals(current);
+
         c->gc_thread = kthread_run(bch_gc_thread, c, "bcache_gc");
         return PTR_ERR_OR_ZERO(c->gc_thread);
  }
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c

index 6730820..0e3ff97 100644 (file)
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -417,8 +417,6 @@ err:
  
  /* Journalling */
  
-#define nr_to_fifo_front(p, front_p, mask)     (((p) - (front_p)) & (mask))
-
  static void btree_flush_write(struct cache_set *c)
  {
         struct btree *b, *t, *btree_nodes[BTREE_FLUSH_NR];
@@ -510,9 +508,8 @@ static void btree_flush_write(struct cache_set *c)
                  *   journal entry can be reclaimed). These selected nodes
                  *   will be ignored and skipped in the folowing for-loop.
                  */
-               if (nr_to_fifo_front(btree_current_write(b)->journal,
-                                    fifo_front_p,
-                                    mask) != 0) {
+               if (((btree_current_write(b)->journal - fifo_front_p) &
+                    mask) != 0) {
                         mutex_unlock(&b->write_lock);
                         continue;
                 }
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c

index 2749daf..0c3c541 100644 (file)
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1917,23 +1917,6 @@ static int run_cache_set(struct cache_set *c)
                 if (bch_btree_check(c))
                         goto err;
  
-               /*
-                * bch_btree_check() may occupy too much system memory which
-                * has negative effects to user space application (e.g. data
-                * base) performance. Shrink the mca cache memory proactively
-                * here to avoid competing memory with user space workloads..
-                */
-               if (!c->shrinker_disabled) {
-                       struct shrink_control sc;
-
-                       sc.gfp_mask = GFP_KERNEL;
-                       sc.nr_to_scan = c->btree_cache_used * c->btree_pages;
-                       /* first run to clear b->accessed tag */
-                       c->shrink.scan_objects(&c->shrink, &sc);
-                       /* second run to reap non-accessed nodes */
-                       c->shrink.scan_objects(&c->shrink, &sc);
-               }
-
                 bch_journal_mark(c, &journal);
                 bch_initial_gc_finish(c);
                 pr_debug("btree_check() done");
diff --git a/drivers/net/dsa/mv88e6xxx/chip.h b/drivers/net/dsa/mv88e6xxx/chip.h

index f332cb4..79cad5e 100644 (file)
--- a/drivers/net/dsa/mv88e6xxx/chip.h
+++ b/drivers/net/dsa/mv88e6xxx/chip.h
@@ -236,7 +236,7 @@ struct mv88e6xxx_port {
         bool mirror_ingress;
         bool mirror_egress;
         unsigned int serdes_irq;
-       char serdes_irq_name[32];
+       char serdes_irq_name[64];
  };
  
  struct mv88e6xxx_chip {
@@ -293,16 +293,16 @@ struct mv88e6xxx_chip {
         struct mv88e6xxx_irq g1_irq;
         struct mv88e6xxx_irq g2_irq;
         int irq;
-       char irq_name[32];
+       char irq_name[64];
         int device_irq;
-       char device_irq_name[32];
+       char device_irq_name[64];
         int watchdog_irq;
-       char watchdog_irq_name[32];
+       char watchdog_irq_name[64];
  
         int atu_prob_irq;
-       char atu_prob_irq_name[32];
+       char atu_prob_irq_name[64];
         int vtu_prob_irq;
-       char vtu_prob_irq_name[32];
+       char vtu_prob_irq_name[64];
         struct kthread_worker *kworker;
         struct kthread_delayed_work irq_poll_work;
  
diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c b/drivers/net/ethernet/amazon/ena/ena_com.c

index ea62604..1fb58f9 100644 (file)
--- a/drivers/net/ethernet/amazon/ena/ena_com.c
+++ b/drivers/net/ethernet/amazon/ena/ena_com.c
@@ -200,6 +200,11 @@ static void comp_ctxt_release(struct ena_com_admin_queue *queue,
  static struct ena_comp_ctx *get_comp_ctxt(struct ena_com_admin_queue *queue,
                                           u16 command_id, bool capture)
  {
+       if (unlikely(!queue->comp_ctx)) {
+               pr_err("Completion context is NULL\n");
+               return NULL;
+       }
+
         if (unlikely(command_id >= queue->q_depth)) {
                 pr_err("command id is larger than the queue size. cmd_id: %u queue size %d\n",
                        command_id, queue->q_depth);
@@ -1041,9 +1046,41 @@ static int ena_com_get_feature(struct ena_com_dev *ena_dev,
                                       feature_ver);
  }
  
+int ena_com_get_current_hash_function(struct ena_com_dev *ena_dev)
+{
+       return ena_dev->rss.hash_func;
+}
+
+static void ena_com_hash_key_fill_default_key(struct ena_com_dev *ena_dev)
+{
+       struct ena_admin_feature_rss_flow_hash_control *hash_key =
+               (ena_dev->rss).hash_key;
+
+       netdev_rss_key_fill(&hash_key->key, sizeof(hash_key->key));
+       /* The key is stored in the device in u32 array
+        * as well as the API requires the key to be passed in this
+        * format. Thus the size of our array should be divided by 4
+        */
+       hash_key->keys_num = sizeof(hash_key->key) / sizeof(u32);
+}
+
  static int ena_com_hash_key_allocate(struct ena_com_dev *ena_dev)
  {
         struct ena_rss *rss = &ena_dev->rss;
+       struct ena_admin_feature_rss_flow_hash_control *hash_key;
+       struct ena_admin_get_feat_resp get_resp;
+       int rc;
+
+       hash_key = (ena_dev->rss).hash_key;
+
+       rc = ena_com_get_feature_ex(ena_dev, &get_resp,
+                                   ENA_ADMIN_RSS_HASH_FUNCTION,
+                                   ena_dev->rss.hash_key_dma_addr,
+                                   sizeof(ena_dev->rss.hash_key), 0);
+       if (unlikely(rc)) {
+               hash_key = NULL;
+               return -EOPNOTSUPP;
+       }
  
         rss->hash_key =
                 dma_alloc_coherent(ena_dev->dmadev, sizeof(*rss->hash_key),
@@ -1254,30 +1291,6 @@ static int ena_com_ind_tbl_convert_to_device(struct ena_com_dev *ena_dev)
         return 0;
  }
  
-static int ena_com_ind_tbl_convert_from_device(struct ena_com_dev *ena_dev)
-{
-       u16 dev_idx_to_host_tbl[ENA_TOTAL_NUM_QUEUES] = { (u16)-1 };
-       struct ena_rss *rss = &ena_dev->rss;
-       u8 idx;
-       u16 i;
-
-       for (i = 0; i < ENA_TOTAL_NUM_QUEUES; i++)
-               dev_idx_to_host_tbl[ena_dev->io_sq_queues[i].idx] = i;
-
-       for (i = 0; i < 1 << rss->tbl_log_size; i++) {
-               if (rss->rss_ind_tbl[i].cq_idx > ENA_TOTAL_NUM_QUEUES)
-                       return -EINVAL;
-               idx = (u8)rss->rss_ind_tbl[i].cq_idx;
-
-               if (dev_idx_to_host_tbl[idx] > ENA_TOTAL_NUM_QUEUES)
-                       return -EINVAL;
-
-               rss->host_rss_ind_tbl[i] = dev_idx_to_host_tbl[idx];
-       }
-
-       return 0;
-}
-
  static void ena_com_update_intr_delay_resolution(struct ena_com_dev *ena_dev,
                                                  u16 intr_delay_resolution)
  {
@@ -2297,15 +2310,16 @@ int ena_com_fill_hash_function(struct ena_com_dev *ena_dev,
  
         switch (func) {
         case ENA_ADMIN_TOEPLITZ:
-               if (key_len > sizeof(hash_key->key)) {
-                       pr_err("key len (%hu) is bigger than the max supported (%zu)\n",
-                              key_len, sizeof(hash_key->key));
-                       return -EINVAL;
+               if (key) {
+                       if (key_len != sizeof(hash_key->key)) {
+                               pr_err("key len (%hu) doesn't equal the supported size (%zu)\n",
+                                      key_len, sizeof(hash_key->key));
+                               return -EINVAL;
+                       }
+                       memcpy(hash_key->key, key, key_len);
+                       rss->hash_init_val = init_val;
+                       hash_key->keys_num = key_len >> 2;
                 }
-
-               memcpy(hash_key->key, key, key_len);
-               rss->hash_init_val = init_val;
-               hash_key->keys_num = key_len >> 2;
                 break;
         case ENA_ADMIN_CRC32:
                 rss->hash_init_val = init_val;
@@ -2342,7 +2356,11 @@ int ena_com_get_hash_function(struct ena_com_dev *ena_dev,
         if (unlikely(rc))
                 return rc;
  
-       rss->hash_func = get_resp.u.flow_hash_func.selected_func;
+       /* ffs() returns 1 in case the lsb is set */
+       rss->hash_func = ffs(get_resp.u.flow_hash_func.selected_func);
+       if (rss->hash_func)
+               rss->hash_func--;
+
         if (func)
                 *func = rss->hash_func;
  
@@ -2606,10 +2624,6 @@ int ena_com_indirect_table_get(struct ena_com_dev *ena_dev, u32 *ind_tbl)
         if (!ind_tbl)
                 return 0;
  
-       rc = ena_com_ind_tbl_convert_from_device(ena_dev);
-       if (unlikely(rc))
-               return rc;
-
         for (i = 0; i < (1 << rss->tbl_log_size); i++)
                 ind_tbl[i] = rss->host_rss_ind_tbl[i];
  
@@ -2626,9 +2640,15 @@ int ena_com_rss_init(struct ena_com_dev *ena_dev, u16 indr_tbl_log_size)
         if (unlikely(rc))
                 goto err_indr_tbl;
  
+       /* The following function might return unsupported in case the
+        * device doesn't support setting the key / hash function. We can safely
+        * ignore this error and have indirection table support only.
+        */
         rc = ena_com_hash_key_allocate(ena_dev);
-       if (unlikely(rc))
+       if (unlikely(rc) && rc != -EOPNOTSUPP)
                 goto err_hash_key;
+       else if (rc != -EOPNOTSUPP)
+               ena_com_hash_key_fill_default_key(ena_dev);
  
         rc = ena_com_hash_ctrl_init(ena_dev);
         if (unlikely(rc))
diff --git a/drivers/net/ethernet/amazon/ena/ena_com.h b/drivers/net/ethernet/amazon/ena/ena_com.h

index 0ce37d5..469f298 100644 (file)
--- a/drivers/net/ethernet/amazon/ena/ena_com.h
+++ b/drivers/net/ethernet/amazon/ena/ena_com.h
@@ -44,6 +44,7 @@
  #include <linux/spinlock.h>
  #include <linux/types.h>
  #include <linux/wait.h>
+#include <linux/netdevice.h>
  
  #include "ena_common_defs.h"
  #include "ena_admin_defs.h"
@@ -655,6 +656,14 @@ int ena_com_rss_init(struct ena_com_dev *ena_dev, u16 log_size);
   */
  void ena_com_rss_destroy(struct ena_com_dev *ena_dev);
  
+/* ena_com_get_current_hash_function - Get RSS hash function
+ * @ena_dev: ENA communication layer struct
+ *
+ * Return the current hash function.
+ * @return: 0 or one of the ena_admin_hash_functions values.
+ */
+int ena_com_get_current_hash_function(struct ena_com_dev *ena_dev);
+
  /* ena_com_fill_hash_function - Fill RSS hash function
   * @ena_dev: ENA communication layer struct
   * @func: The hash function (Toeplitz or crc)
diff --git a/drivers/net/ethernet/amazon/ena/ena_ethtool.c b/drivers/net/ethernet/amazon/ena/ena_ethtool.c

index b4e891d..ced1d57 100644 (file)
--- a/drivers/net/ethernet/amazon/ena/ena_ethtool.c
+++ b/drivers/net/ethernet/amazon/ena/ena_ethtool.c
@@ -636,6 +636,28 @@ static u32 ena_get_rxfh_key_size(struct net_device *netdev)
         return ENA_HASH_KEY_SIZE;
  }
  
+static int ena_indirection_table_get(struct ena_adapter *adapter, u32 *indir)
+{
+       struct ena_com_dev *ena_dev = adapter->ena_dev;
+       int i, rc;
+
+       if (!indir)
+               return 0;
+
+       rc = ena_com_indirect_table_get(ena_dev, indir);
+       if (rc)
+               return rc;
+
+       /* Our internal representation of the indices is: even indices
+        * for Tx and uneven indices for Rx. We need to convert the Rx
+        * indices to be consecutive
+        */
+       for (i = 0; i < ENA_RX_RSS_TABLE_SIZE; i++)
+               indir[i] = ENA_IO_RXQ_IDX_TO_COMBINED_IDX(indir[i]);
+
+       return rc;
+}
+
  static int ena_get_rxfh(struct net_device *netdev, u32 *indir, u8 *key,
                         u8 *hfunc)
  {
@@ -644,11 +666,25 @@ static int ena_get_rxfh(struct net_device *netdev, u32 *indir, u8 *key,
         u8 func;
         int rc;
  
-       rc = ena_com_indirect_table_get(adapter->ena_dev, indir);
+       rc = ena_indirection_table_get(adapter, indir);
         if (rc)
                 return rc;
  
+       /* We call this function in order to check if the device
+        * supports getting/setting the hash function.
+        */
         rc = ena_com_get_hash_function(adapter->ena_dev, &ena_func, key);
+
+       if (rc) {
+               if (rc == -EOPNOTSUPP) {
+                       key = NULL;
+                       hfunc = NULL;
+                       rc = 0;
+               }
+
+               return rc;
+       }
+
         if (rc)
                 return rc;
  
@@ -657,7 +693,7 @@ static int ena_get_rxfh(struct net_device *netdev, u32 *indir, u8 *key,
                 func = ETH_RSS_HASH_TOP;
                 break;
         case ENA_ADMIN_CRC32:
-               func = ETH_RSS_HASH_XOR;
+               func = ETH_RSS_HASH_CRC32;
                 break;
         default:
                 netif_err(adapter, drv, netdev,
@@ -700,10 +736,13 @@ static int ena_set_rxfh(struct net_device *netdev, const u32 *indir,
         }
  
         switch (hfunc) {
+       case ETH_RSS_HASH_NO_CHANGE:
+               func = ena_com_get_current_hash_function(ena_dev);
+               break;
         case ETH_RSS_HASH_TOP:
                 func = ENA_ADMIN_TOEPLITZ;
                 break;
-       case ETH_RSS_HASH_XOR:
+       case ETH_RSS_HASH_CRC32:
                 func = ENA_ADMIN_CRC32;
                 break;
         default:
@@ -814,6 +853,7 @@ static const struct ethtool_ops ena_ethtool_ops = {
         .set_channels           = ena_set_channels,
         .get_tunable            = ena_get_tunable,
         .set_tunable            = ena_set_tunable,
+       .get_ts_info            = ethtool_op_get_ts_info,
  };
  
  void ena_set_ethtool_ops(struct net_device *netdev)
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c

index 894e8c1..0b2fd96 100644 (file)
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -3706,8 +3706,8 @@ static void check_for_missing_keep_alive(struct ena_adapter *adapter)
         if (adapter->keep_alive_timeout == ENA_HW_HINTS_NO_TIMEOUT)
                 return;
  
-       keep_alive_expired = round_jiffies(adapter->last_keep_alive_jiffies +
-                                          adapter->keep_alive_timeout);
+       keep_alive_expired = adapter->last_keep_alive_jiffies +
+                            adapter->keep_alive_timeout;
         if (unlikely(time_is_before_jiffies(keep_alive_expired))) {
                 netif_err(adapter, drv, adapter->netdev,
                           "Keep alive watchdog timeout.\n");
@@ -3809,7 +3809,7 @@ static void ena_timer_service(struct timer_list *t)
         }
  
         /* Reset the timer */
-       mod_timer(&adapter->timer_service, jiffies + HZ);
+       mod_timer(&adapter->timer_service, round_jiffies(jiffies + HZ));
  }
  
  static int ena_calc_max_io_queue_num(struct pci_dev *pdev,
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.h b/drivers/net/ethernet/amazon/ena/ena_netdev.h

index 094324f..8795e0b 100644 (file)
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.h
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.h
@@ -130,6 +130,8 @@
  
  #define ENA_IO_TXQ_IDX(q)      (2 * (q))
  #define ENA_IO_RXQ_IDX(q)      (2 * (q) + 1)
+#define ENA_IO_TXQ_IDX_TO_COMBINED_IDX(q)      ((q) / 2)
+#define ENA_IO_RXQ_IDX_TO_COMBINED_IDX(q)      (((q) - 1) / 2)
  
  #define ENA_MGMNT_IRQ_IDX              0
  #define ENA_IO_IRQ_FIRST_IDX           1
diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c b/drivers/net/ethernet/cisco/enic/enic_main.c

index bbd7b31..ddf60dc 100644 (file)
--- a/drivers/net/ethernet/cisco/enic/enic_main.c
+++ b/drivers/net/ethernet/cisco/enic/enic_main.c
@@ -2013,10 +2013,10 @@ static int enic_stop(struct net_device *netdev)
                 napi_disable(&enic->napi[i]);
  
         netif_carrier_off(netdev);
-       netif_tx_disable(netdev);
         if (vnic_dev_get_intr_mode(enic->vdev) == VNIC_DEV_INTR_MODE_MSIX)
                 for (i = 0; i < enic->wq_count; i++)
                         napi_disable(&enic->napi[enic_cq_wq(enic, i)]);
+       netif_tx_disable(netdev);
  
         if (!enic_is_dynamic(enic) && !enic_is_sriov_vf(enic))
                 enic_dev_del_station_addr(enic);
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c

index ec5f6ee..492bc94 100644 (file)
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -6113,6 +6113,9 @@ static int hclge_get_all_rules(struct hnae3_handle *handle,
  static void hclge_fd_get_flow_tuples(const struct flow_keys *fkeys,
                                      struct hclge_fd_rule_tuples *tuples)
  {
+#define flow_ip6_src fkeys->addrs.v6addrs.src.in6_u.u6_addr32
+#define flow_ip6_dst fkeys->addrs.v6addrs.dst.in6_u.u6_addr32
+
         tuples->ether_proto = be16_to_cpu(fkeys->basic.n_proto);
         tuples->ip_proto = fkeys->basic.ip_proto;
         tuples->dst_port = be16_to_cpu(fkeys->ports.dst);
@@ -6121,12 +6124,12 @@ static void hclge_fd_get_flow_tuples(const struct flow_keys *fkeys,
                 tuples->src_ip[3] = be32_to_cpu(fkeys->addrs.v4addrs.src);
                 tuples->dst_ip[3] = be32_to_cpu(fkeys->addrs.v4addrs.dst);
         } else {
-               memcpy(tuples->src_ip,
-                      fkeys->addrs.v6addrs.src.in6_u.u6_addr32,
-                      sizeof(tuples->src_ip));
-               memcpy(tuples->dst_ip,
-                      fkeys->addrs.v6addrs.dst.in6_u.u6_addr32,
-                      sizeof(tuples->dst_ip));
+               int i;
+
+               for (i = 0; i < IPV6_SIZE; i++) {
+                       tuples->src_ip[i] = be32_to_cpu(flow_ip6_src[i]);
+                       tuples->dst_ip[i] = be32_to_cpu(flow_ip6_dst[i]);
+               }
         }
  }
  
@@ -9834,6 +9837,13 @@ static int hclge_reset_ae_dev(struct hnae3_ae_dev *ae_dev)
                 return ret;
         }
  
+       ret = init_mgr_tbl(hdev);
+       if (ret) {
+               dev_err(&pdev->dev,
+                       "failed to reinit manager table, ret = %d\n", ret);
+               return ret;
+       }
+
         ret = hclge_init_fd_config(hdev);
         if (ret) {
                 dev_err(&pdev->dev, "fd table init fail, ret=%d\n", ret);
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c

index 180224e..28db132 100644 (file)
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c
@@ -566,7 +566,7 @@ static void hclge_tm_vport_tc_info_update(struct hclge_vport *vport)
          */
         kinfo->num_tc = vport->vport_id ? 1 :
                         min_t(u16, vport->alloc_tqps, hdev->tm_info.num_tc);
-       vport->qs_offset = (vport->vport_id ? hdev->tm_info.num_tc : 0) +
+       vport->qs_offset = (vport->vport_id ? HNAE3_MAX_TC : 0) +
                                 (vport->vport_id ? (vport->vport_id - 1) : 0);
  
         max_rss_size = min_t(u16, hdev->rss_size_max,
diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c

index 69523ac..56b9e44 100644 (file)
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -2362,7 +2362,7 @@ static int i40e_vc_enable_queues_msg(struct i40e_vf *vf, u8 *msg)
                 goto error_param;
         }
  
-       if (i40e_vc_validate_vqs_bitmaps(vqs)) {
+       if (!i40e_vc_validate_vqs_bitmaps(vqs)) {
                 aq_ret = I40E_ERR_PARAM;
                 goto error_param;
         }
@@ -2424,7 +2424,7 @@ static int i40e_vc_disable_queues_msg(struct i40e_vf *vf, u8 *msg)
                 goto error_param;
         }
  
-       if (i40e_vc_validate_vqs_bitmaps(vqs)) {
+       if (!i40e_vc_validate_vqs_bitmaps(vqs)) {
                 aq_ret = I40E_ERR_PARAM;
                 goto error_param;
         }
diff --git a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h

index 4459bc5..6873998 100644 (file)
--- a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
@@ -1660,6 +1660,7 @@ struct ice_aqc_get_pkg_info_resp {
         __le32 count;
         struct ice_aqc_get_pkg_info pkg_info[1];
  };
+
  /**
   * struct ice_aq_desc - Admin Queue (AQ) descriptor
   * @flags: ICE_AQ_FLAG_* flags
diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c

index d8e975c..81885ef 100644 (file)
--- a/drivers/net/ethernet/intel/ice/ice_base.c
+++ b/drivers/net/ethernet/intel/ice/ice_base.c
@@ -324,7 +324,7 @@ int ice_setup_rx_ctx(struct ice_ring *ring)
                         if (err)
                                 return err;
  
-                       dev_info(&vsi->back->pdev->dev, "Registered XDP mem model MEM_TYPE_ZERO_COPY on Rx ring %d\n",
+                       dev_info(ice_pf_to_dev(vsi->back), "Registered XDP mem model MEM_TYPE_ZERO_COPY on Rx ring %d\n",
                                  ring->q_index);
                 } else {
                         ring->zca.free = NULL;
@@ -405,8 +405,7 @@ int ice_setup_rx_ctx(struct ice_ring *ring)
         /* Absolute queue number out of 2K needs to be passed */
         err = ice_write_rxq_ctx(hw, &rlan_ctx, pf_q);
         if (err) {
-               dev_err(&vsi->back->pdev->dev,
-                       "Failed to set LAN Rx queue context for absolute Rx queue %d error: %d\n",
+               dev_err(ice_pf_to_dev(vsi->back), "Failed to set LAN Rx queue context for absolute Rx queue %d error: %d\n",
                         pf_q, err);
                 return -EIO;
         }
@@ -428,8 +427,7 @@ int ice_setup_rx_ctx(struct ice_ring *ring)
               ice_alloc_rx_bufs_slow_zc(ring, ICE_DESC_UNUSED(ring)) :
               ice_alloc_rx_bufs(ring, ICE_DESC_UNUSED(ring));
         if (err)
-               dev_info(&vsi->back->pdev->dev,
-                        "Failed allocate some buffers on %sRx ring %d (pf_q %d)\n",
+               dev_info(ice_pf_to_dev(vsi->back), "Failed allocate some buffers on %sRx ring %d (pf_q %d)\n",
                          ring->xsk_umem ? "UMEM enabled " : "",
                          ring->q_index, pf_q);
  
@@ -490,8 +488,7 @@ int ice_vsi_ctrl_rx_ring(struct ice_vsi *vsi, bool ena, u16 rxq_idx)
         /* wait for the change to finish */
         ret = ice_pf_rxq_wait(pf, pf_q, ena);
         if (ret)
-               dev_err(ice_pf_to_dev(pf),
-                       "VSI idx %d Rx ring %d %sable timeout\n",
+               dev_err(ice_pf_to_dev(pf), "VSI idx %d Rx ring %d %sable timeout\n",
                         vsi->idx, pf_q, (ena ? "en" : "dis"));
  
         return ret;
@@ -506,20 +503,15 @@ int ice_vsi_ctrl_rx_ring(struct ice_vsi *vsi, bool ena, u16 rxq_idx)
   */
  int ice_vsi_alloc_q_vectors(struct ice_vsi *vsi)
  {
-       struct ice_pf *pf = vsi->back;
-       int v_idx = 0, num_q_vectors;
-       struct device *dev;
-       int err;
+       struct device *dev = ice_pf_to_dev(vsi->back);
+       int v_idx, err;
  
-       dev = ice_pf_to_dev(pf);
         if (vsi->q_vectors[0]) {
                 dev_dbg(dev, "VSI %d has existing q_vectors\n", vsi->vsi_num);
                 return -EEXIST;
         }
  
-       num_q_vectors = vsi->num_q_vectors;
-
-       for (v_idx = 0; v_idx < num_q_vectors; v_idx++) {
+       for (v_idx = 0; v_idx < vsi->num_q_vectors; v_idx++) {
                 err = ice_vsi_alloc_q_vector(vsi, v_idx);
                 if (err)
                         goto err_out;
@@ -648,8 +640,7 @@ ice_vsi_cfg_txq(struct ice_vsi *vsi, struct ice_ring *ring,
         status = ice_ena_vsi_txq(vsi->port_info, vsi->idx, tc, ring->q_handle,
                                  1, qg_buf, buf_len, NULL);
         if (status) {
-               dev_err(ice_pf_to_dev(pf),
-                       "Failed to set LAN Tx queue context, error: %d\n",
+               dev_err(ice_pf_to_dev(pf), "Failed to set LAN Tx queue context, error: %d\n",
                         status);
                 return -ENODEV;
         }
@@ -815,14 +806,12 @@ ice_vsi_stop_tx_ring(struct ice_vsi *vsi, enum ice_disq_rst_src rst_src,
          * queues at the hardware level anyway.
          */
         if (status == ICE_ERR_RESET_ONGOING) {
-               dev_dbg(&vsi->back->pdev->dev,
-                       "Reset in progress. LAN Tx queues already disabled\n");
+               dev_dbg(ice_pf_to_dev(vsi->back), "Reset in progress. LAN Tx queues already disabled\n");
         } else if (status == ICE_ERR_DOES_NOT_EXIST) {
-               dev_dbg(&vsi->back->pdev->dev,
-                       "LAN Tx queues do not exist, nothing to disable\n");
+               dev_dbg(ice_pf_to_dev(vsi->back), "LAN Tx queues do not exist, nothing to disable\n");
         } else if (status) {
-               dev_err(&vsi->back->pdev->dev,
-                       "Failed to disable LAN Tx queues, error: %d\n", status);
+               dev_err(ice_pf_to_dev(vsi->back), "Failed to disable LAN Tx queues, error: %d\n",
+                       status);
                 return -ENODEV;
         }
  
diff --git a/drivers/net/ethernet/intel/ice/ice_common.c b/drivers/net/ethernet/intel/ice/ice_common.c

index 0207e28..04d5db0 100644 (file)
--- a/drivers/net/ethernet/intel/ice/ice_common.c
+++ b/drivers/net/ethernet/intel/ice/ice_common.c
@@ -24,20 +24,6 @@ static enum ice_status ice_set_mac_type(struct ice_hw *hw)
         return 0;
  }
  
-/**
- * ice_dev_onetime_setup - Temporary HW/FW workarounds
- * @hw: pointer to the HW structure
- *
- * This function provides temporary workarounds for certain issues
- * that are expected to be fixed in the HW/FW.
- */
-void ice_dev_onetime_setup(struct ice_hw *hw)
-{
-#define MBX_PF_VT_PFALLOC      0x00231E80
-       /* set VFs per PF */
-       wr32(hw, MBX_PF_VT_PFALLOC, rd32(hw, PF_VT_PFALLOC_HIF));
-}
-
  /**
   * ice_clear_pf_cfg - Clear PF configuration
   * @hw: pointer to the hardware structure
@@ -602,10 +588,10 @@ void ice_output_fw_log(struct ice_hw *hw, struct ice_aq_desc *desc, void *buf)
  }
  
  /**
- * ice_get_itr_intrl_gran - determine int/intrl granularity
+ * ice_get_itr_intrl_gran
   * @hw: pointer to the HW struct
   *
- * Determines the ITR/intrl granularities based on the maximum aggregate
+ * Determines the ITR/INTRL granularities based on the maximum aggregate
   * bandwidth according to the device's configuration during power-on.
   */
  static void ice_get_itr_intrl_gran(struct ice_hw *hw)
@@ -763,8 +749,6 @@ enum ice_status ice_init_hw(struct ice_hw *hw)
         if (status)
                 goto err_unroll_sched;
  
-       ice_dev_onetime_setup(hw);
-
         /* Get MAC information */
         /* A single port can report up to two (LAN and WoL) addresses */
         mac_buf = devm_kcalloc(ice_hw_to_dev(hw), 2,
@@ -834,7 +818,7 @@ void ice_deinit_hw(struct ice_hw *hw)
   */
  enum ice_status ice_check_reset(struct ice_hw *hw)
  {
-       u32 cnt, reg = 0, grst_delay;
+       u32 cnt, reg = 0, grst_delay, uld_mask;
  
         /* Poll for Device Active state in case a recent CORER, GLOBR,
          * or EMPR has occurred. The grst delay value is in 100ms units.
@@ -856,13 +840,20 @@ enum ice_status ice_check_reset(struct ice_hw *hw)
                 return ICE_ERR_RESET_FAILED;
         }
  
-#define ICE_RESET_DONE_MASK    (GLNVM_ULD_CORER_DONE_M | \
-                                GLNVM_ULD_GLOBR_DONE_M)
+#define ICE_RESET_DONE_MASK    (GLNVM_ULD_PCIER_DONE_M |\
+                                GLNVM_ULD_PCIER_DONE_1_M |\
+                                GLNVM_ULD_CORER_DONE_M |\
+                                GLNVM_ULD_GLOBR_DONE_M |\
+                                GLNVM_ULD_POR_DONE_M |\
+                                GLNVM_ULD_POR_DONE_1_M |\
+                                GLNVM_ULD_PCIER_DONE_2_M)
+
+       uld_mask = ICE_RESET_DONE_MASK;
  
         /* Device is Active; check Global Reset processes are done */
         for (cnt = 0; cnt < ICE_PF_RESET_WAIT_COUNT; cnt++) {
-               reg = rd32(hw, GLNVM_ULD) & ICE_RESET_DONE_MASK;
-               if (reg == ICE_RESET_DONE_MASK) {
+               reg = rd32(hw, GLNVM_ULD) & uld_mask;
+               if (reg == uld_mask) {
                         ice_debug(hw, ICE_DBG_INIT,
                                   "Global reset processes done. %d\n", cnt);
                         break;
diff --git a/drivers/net/ethernet/intel/ice/ice_common.h b/drivers/net/ethernet/intel/ice/ice_common.h

index b5c013f..f9fc005 100644 (file)
--- a/drivers/net/ethernet/intel/ice/ice_common.h
+++ b/drivers/net/ethernet/intel/ice/ice_common.h
@@ -54,8 +54,6 @@ enum ice_status ice_get_caps(struct ice_hw *hw);
  
  void ice_set_safe_mode_caps(struct ice_hw *hw);
  
-void ice_dev_onetime_setup(struct ice_hw *hw);
-
  enum ice_status
  ice_write_rxq_ctx(struct ice_hw *hw, struct ice_rlan_ctx *rlan_ctx,
                   u32 rxq_index);
diff --git a/drivers/net/ethernet/intel/ice/ice_dcb.c b/drivers/net/ethernet/intel/ice/ice_dcb.c

index 713e8a8..adb8dab 100644 (file)
--- a/drivers/net/ethernet/intel/ice/ice_dcb.c
+++ b/drivers/net/ethernet/intel/ice/ice_dcb.c
@@ -1323,13 +1323,13 @@ enum ice_status ice_set_dcb_cfg(struct ice_port_info *pi)
  }
  
  /**
- * ice_aq_query_port_ets - query port ets configuration
+ * ice_aq_query_port_ets - query port ETS configuration
   * @pi: port information structure
   * @buf: pointer to buffer
   * @buf_size: buffer size in bytes
   * @cd: pointer to command details structure or NULL
   *
- * query current port ets configuration
+ * query current port ETS configuration
   */
  static enum ice_status
  ice_aq_query_port_ets(struct ice_port_info *pi,
@@ -1416,13 +1416,13 @@ ice_update_port_tc_tree_cfg(struct ice_port_info *pi,
  }
  
  /**
- * ice_query_port_ets - query port ets configuration
+ * ice_query_port_ets - query port ETS configuration
   * @pi: port information structure
   * @buf: pointer to buffer
   * @buf_size: buffer size in bytes
   * @cd: pointer to command details structure or NULL
   *
- * query current port ets configuration and update the
+ * query current port ETS configuration and update the
   * SW DB with the TC changes
   */
  enum ice_status
diff --git a/drivers/net/ethernet/intel/ice/ice_dcb_lib.c b/drivers/net/ethernet/intel/ice/ice_dcb_lib.c

index 0664e5b..7108fb4 100644 (file)
--- a/drivers/net/ethernet/intel/ice/ice_dcb_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_dcb_lib.c
@@ -315,9 +315,9 @@ ice_dcb_need_recfg(struct ice_pf *pf, struct ice_dcbx_cfg *old_cfg,
   */
  void ice_dcb_rebuild(struct ice_pf *pf)
  {
-       struct ice_dcbx_cfg *local_dcbx_cfg, *desired_dcbx_cfg, *prev_cfg;
         struct ice_aqc_port_ets_elem buf = { 0 };
         struct device *dev = ice_pf_to_dev(pf);
+       struct ice_dcbx_cfg *err_cfg;
         enum ice_status ret;
  
         ret = ice_query_port_ets(pf->hw.port_info, &buf, sizeof(buf), NULL);
@@ -330,53 +330,25 @@ void ice_dcb_rebuild(struct ice_pf *pf)
         if (!test_bit(ICE_FLAG_DCB_ENA, pf->flags))
                 return;
  
-       local_dcbx_cfg = &pf->hw.port_info->local_dcbx_cfg;
-       desired_dcbx_cfg = &pf->hw.port_info->desired_dcbx_cfg;
+       mutex_lock(&pf->tc_mutex);
  
-       /* Save current willing state and force FW to unwilling */
-       local_dcbx_cfg->etscfg.willing = 0x0;
-       local_dcbx_cfg->pfc.willing = 0x0;
-       local_dcbx_cfg->app_mode = ICE_DCBX_APPS_NON_WILLING;
+       if (!pf->hw.port_info->is_sw_lldp)
+               ice_cfg_etsrec_defaults(pf->hw.port_info);
  
-       ice_cfg_etsrec_defaults(pf->hw.port_info);
         ret = ice_set_dcb_cfg(pf->hw.port_info);
         if (ret) {
-               dev_err(dev, "Failed to set DCB to unwilling\n");
+               dev_err(dev, "Failed to set DCB config in rebuild\n");
                 goto dcb_error;
         }
  
-       /* Retrieve DCB config and ensure same as current in SW */
-       prev_cfg = kmemdup(local_dcbx_cfg, sizeof(*prev_cfg), GFP_KERNEL);
-       if (!prev_cfg)
-               goto dcb_error;
-
-       ice_init_dcb(&pf->hw, true);
-       if (pf->hw.port_info->dcbx_status == ICE_DCBX_STATUS_DIS)
-               pf->hw.port_info->is_sw_lldp = true;
-       else
-               pf->hw.port_info->is_sw_lldp = false;
-
-       if (ice_dcb_need_recfg(pf, prev_cfg, local_dcbx_cfg)) {
-               /* difference in cfg detected - disable DCB till next MIB */
-               dev_err(dev, "Set local MIB not accurate\n");
-               kfree(prev_cfg);
-               goto dcb_error;
+       if (!pf->hw.port_info->is_sw_lldp) {
+               ret = ice_cfg_lldp_mib_change(&pf->hw, true);
+               if (ret && !pf->hw.port_info->is_sw_lldp) {
+                       dev_err(dev, "Failed to register for MIB changes\n");
+                       goto dcb_error;
+               }
         }
  
-       /* fetched config congruent to previous configuration */
-       kfree(prev_cfg);
-
-       /* Set the local desired config */
-       if (local_dcbx_cfg->dcbx_mode == ICE_DCBX_MODE_CEE)
-               memcpy(local_dcbx_cfg, desired_dcbx_cfg,
-                      sizeof(*local_dcbx_cfg));
-
-       ice_cfg_etsrec_defaults(pf->hw.port_info);
-       ret = ice_set_dcb_cfg(pf->hw.port_info);
-       if (ret) {
-               dev_err(dev, "Failed to set desired config\n");
-               goto dcb_error;
-       }
         dev_info(dev, "DCB restored after reset\n");
         ret = ice_query_port_ets(pf->hw.port_info, &buf, sizeof(buf), NULL);
         if (ret) {
@@ -384,26 +356,32 @@ void ice_dcb_rebuild(struct ice_pf *pf)
                 goto dcb_error;
         }
  
+       mutex_unlock(&pf->tc_mutex);
+
         return;
  
  dcb_error:
         dev_err(dev, "Disabling DCB until new settings occur\n");
-       prev_cfg = kzalloc(sizeof(*prev_cfg), GFP_KERNEL);
-       if (!prev_cfg)
+       err_cfg = kzalloc(sizeof(*err_cfg), GFP_KERNEL);
+       if (!err_cfg) {
+               mutex_unlock(&pf->tc_mutex);
                 return;
+       }
  
-       prev_cfg->etscfg.willing = true;
-       prev_cfg->etscfg.tcbwtable[0] = ICE_TC_MAX_BW;
-       prev_cfg->etscfg.tsatable[0] = ICE_IEEE_TSA_ETS;
-       memcpy(&prev_cfg->etsrec, &prev_cfg->etscfg, sizeof(prev_cfg->etsrec));
+       err_cfg->etscfg.willing = true;
+       err_cfg->etscfg.tcbwtable[0] = ICE_TC_MAX_BW;
+       err_cfg->etscfg.tsatable[0] = ICE_IEEE_TSA_ETS;
+       memcpy(&err_cfg->etsrec, &err_cfg->etscfg, sizeof(err_cfg->etsrec));
         /* Coverity warns the return code of ice_pf_dcb_cfg() is not checked
          * here as is done for other calls to that function. That check is
          * not necessary since this is in this function's error cleanup path.
          * Suppress the Coverity warning with the following comment...
          */
         /* coverity[check_return] */
-       ice_pf_dcb_cfg(pf, prev_cfg, false);
-       kfree(prev_cfg);
+       ice_pf_dcb_cfg(pf, err_cfg, false);
+       kfree(err_cfg);
+
+       mutex_unlock(&pf->tc_mutex);
  }
  
  /**
@@ -434,9 +412,9 @@ static int ice_dcb_init_cfg(struct ice_pf *pf, bool locked)
  }
  
  /**
- * ice_dcb_sw_default_config - Apply a default DCB config
+ * ice_dcb_sw_dflt_cfg - Apply a default DCB config
   * @pf: PF to apply config to
- * @ets_willing: configure ets willing
+ * @ets_willing: configure ETS willing
   * @locked: was this function called with RTNL held
   */
  static int ice_dcb_sw_dflt_cfg(struct ice_pf *pf, bool ets_willing, bool locked)
@@ -599,8 +577,7 @@ int ice_init_pf_dcb(struct ice_pf *pf, bool locked)
                 goto dcb_init_err;
         }
  
-       dev_info(dev,
-                "DCB is enabled in the hardware, max number of TCs supported on this port are %d\n",
+       dev_info(dev, "DCB is enabled in the hardware, max number of TCs supported on this port are %d\n",
                  pf->hw.func_caps.common_cap.maxtc);
         if (err) {
                 struct ice_vsi *pf_vsi;
@@ -610,8 +587,8 @@ int ice_init_pf_dcb(struct ice_pf *pf, bool locked)
                 clear_bit(ICE_FLAG_FW_LLDP_AGENT, pf->flags);
                 err = ice_dcb_sw_dflt_cfg(pf, true, locked);
                 if (err) {
-                       dev_err(dev,
-                               "Failed to set local DCB config %d\n", err);
+                       dev_err(dev, "Failed to set local DCB config %d\n",
+                               err);
                         err = -EIO;
                         goto dcb_init_err;
                 }
@@ -777,6 +754,8 @@ ice_dcb_process_lldp_set_mib_change(struct ice_pf *pf,
                 }
         }
  
+       mutex_lock(&pf->tc_mutex);
+
         /* store the old configuration */
         tmp_dcbx_cfg = pf->hw.port_info->local_dcbx_cfg;
  
@@ -787,20 +766,20 @@ ice_dcb_process_lldp_set_mib_change(struct ice_pf *pf,
         ret = ice_get_dcb_cfg(pf->hw.port_info);
         if (ret) {
                 dev_err(dev, "Failed to get DCB config\n");
-               return;
+               goto out;
         }
  
         /* No change detected in DCBX configs */
         if (!memcmp(&tmp_dcbx_cfg, &pi->local_dcbx_cfg, sizeof(tmp_dcbx_cfg))) {
                 dev_dbg(dev, "No change detected in DCBX configuration.\n");
-               return;
+               goto out;
         }
  
         need_reconfig = ice_dcb_need_recfg(pf, &tmp_dcbx_cfg,
                                            &pi->local_dcbx_cfg);
         ice_dcbnl_flush_apps(pf, &tmp_dcbx_cfg, &pi->local_dcbx_cfg);
         if (!need_reconfig)
-               return;
+               goto out;
  
         /* Enable DCB tagging only when more than one TC */
         if (ice_dcb_get_num_tc(&pi->local_dcbx_cfg) > 1) {
@@ -814,7 +793,7 @@ ice_dcb_process_lldp_set_mib_change(struct ice_pf *pf,
         pf_vsi = ice_get_main_vsi(pf);
         if (!pf_vsi) {
                 dev_dbg(dev, "PF VSI doesn't exist\n");
-               return;
+               goto out;
         }
  
         rtnl_lock();
@@ -823,13 +802,15 @@ ice_dcb_process_lldp_set_mib_change(struct ice_pf *pf,
         ret = ice_query_port_ets(pf->hw.port_info, &buf, sizeof(buf), NULL);
         if (ret) {
                 dev_err(dev, "Query Port ETS failed\n");
-               rtnl_unlock();
-               return;
+               goto unlock_rtnl;
         }
  
         /* changes in configuration update VSI */
         ice_pf_dcb_recfg(pf);
  
         ice_ena_vsi(pf_vsi, true);
+unlock_rtnl:
         rtnl_unlock();
+out:
+       mutex_unlock(&pf->tc_mutex);
  }
diff --git a/drivers/net/ethernet/intel/ice/ice_dcb_nl.c b/drivers/net/ethernet/intel/ice/ice_dcb_nl.c

index d870c1a..b61aba4 100644 (file)
--- a/drivers/net/ethernet/intel/ice/ice_dcb_nl.c
+++ b/drivers/net/ethernet/intel/ice/ice_dcb_nl.c
@@ -297,8 +297,7 @@ ice_dcbnl_get_pfc_cfg(struct net_device *netdev, int prio, u8 *setting)
                 return;
  
         *setting = (pi->local_dcbx_cfg.pfc.pfcena >> prio) & 0x1;
-       dev_dbg(ice_pf_to_dev(pf),
-               "Get PFC Config up=%d, setting=%d, pfcenable=0x%x\n",
+       dev_dbg(ice_pf_to_dev(pf), "Get PFC Config up=%d, setting=%d, pfcenable=0x%x\n",
                 prio, *setting, pi->local_dcbx_cfg.pfc.pfcena);
  }
  
@@ -418,8 +417,8 @@ ice_dcbnl_get_pg_tc_cfg_tx(struct net_device *netdev, int prio,
                 return;
  
         *pgid = pi->local_dcbx_cfg.etscfg.prio_table[prio];
-       dev_dbg(ice_pf_to_dev(pf),
-               "Get PG config prio=%d tc=%d\n", prio, *pgid);
+       dev_dbg(ice_pf_to_dev(pf), "Get PG config prio=%d tc=%d\n", prio,
+               *pgid);
  }
  
  /**
@@ -713,13 +712,13 @@ static int ice_dcbnl_delapp(struct net_device *netdev, struct dcb_app *app)
                 return -EINVAL;
  
         mutex_lock(&pf->tc_mutex);
-       ret = dcb_ieee_delapp(netdev, app);
-       if (ret)
-               goto delapp_out;
-
         old_cfg = &pf->hw.port_info->local_dcbx_cfg;
  
-       if (old_cfg->numapps == 1)
+       if (old_cfg->numapps <= 1)
+               goto delapp_out;
+
+       ret = dcb_ieee_delapp(netdev, app);
+       if (ret)
                 goto delapp_out;
  
         new_cfg = &pf->hw.port_info->desired_dcbx_cfg;
@@ -882,8 +881,7 @@ ice_dcbnl_vsi_del_app(struct ice_vsi *vsi,
         sapp.protocol = app->prot_id;
         sapp.priority = app->priority;
         err = ice_dcbnl_delapp(vsi->netdev, &sapp);
-       dev_dbg(&vsi->back->pdev->dev,
-               "Deleting app for VSI idx=%d err=%d sel=%d proto=0x%x, prio=%d\n",
+       dev_dbg(ice_pf_to_dev(vsi->back), "Deleting app for VSI idx=%d err=%d sel=%d proto=0x%x, prio=%d\n",
                 vsi->idx, err, app->selector, app->prot_id, app->priority);
  }
  
diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c b/drivers/net/ethernet/intel/ice/ice_ethtool.c

index 90c6a3c..b002ab4 100644 (file)
--- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
+++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
@@ -166,13 +166,24 @@ static void
  ice_get_drvinfo(struct net_device *netdev, struct ethtool_drvinfo *drvinfo)
  {
         struct ice_netdev_priv *np = netdev_priv(netdev);
+       u8 oem_ver, oem_patch, nvm_ver_hi, nvm_ver_lo;
         struct ice_vsi *vsi = np->vsi;
         struct ice_pf *pf = vsi->back;
+       struct ice_hw *hw = &pf->hw;
+       u16 oem_build;
  
         strlcpy(drvinfo->driver, KBUILD_MODNAME, sizeof(drvinfo->driver));
         strlcpy(drvinfo->version, ice_drv_ver, sizeof(drvinfo->version));
-       strlcpy(drvinfo->fw_version, ice_nvm_version_str(&pf->hw),
-               sizeof(drvinfo->fw_version));
+
+       /* Display NVM version (from which the firmware version can be
+        * determined) which contains more pertinent information.
+        */
+       ice_get_nvm_version(hw, &oem_ver, &oem_build, &oem_patch,
+                           &nvm_ver_hi, &nvm_ver_lo);
+       snprintf(drvinfo->fw_version, sizeof(drvinfo->fw_version),
+                "%x.%02x 0x%x %d.%d.%d", nvm_ver_hi, nvm_ver_lo,
+                hw->nvm.eetrack, oem_ver, oem_build, oem_patch);
+
         strlcpy(drvinfo->bus_info, pci_name(pf->pdev),
                 sizeof(drvinfo->bus_info));
         drvinfo->n_priv_flags = ICE_PRIV_FLAG_ARRAY_SIZE;
@@ -363,8 +374,7 @@ static int ice_reg_pattern_test(struct ice_hw *hw, u32 reg, u32 mask)
                 val = rd32(hw, reg);
                 if (val == pattern)
                         continue;
-               dev_err(dev,
-                       "%s: reg pattern test failed - reg 0x%08x pat 0x%08x val 0x%08x\n"
+               dev_err(dev, "%s: reg pattern test failed - reg 0x%08x pat 0x%08x val 0x%08x\n"
                         , __func__, reg, pattern, val);
                 return 1;
         }
@@ -372,8 +382,7 @@ static int ice_reg_pattern_test(struct ice_hw *hw, u32 reg, u32 mask)
         wr32(hw, reg, orig_val);
         val = rd32(hw, reg);
         if (val != orig_val) {
-               dev_err(dev,
-                       "%s: reg restore test failed - reg 0x%08x orig 0x%08x val 0x%08x\n"
+               dev_err(dev, "%s: reg restore test failed - reg 0x%08x orig 0x%08x val 0x%08x\n"
                         , __func__, reg, orig_val, val);
                 return 1;
         }
@@ -791,8 +800,7 @@ ice_self_test(struct net_device *netdev, struct ethtool_test *eth_test,
                 set_bit(__ICE_TESTING, pf->state);
  
                 if (ice_active_vfs(pf)) {
-                       dev_warn(dev,
-                                "Please take active VFs and Netqueues offline and restart the adapter before running NIC diagnostics\n");
+                       dev_warn(dev, "Please take active VFs and Netqueues offline and restart the adapter before running NIC diagnostics\n");
                         data[ICE_ETH_TEST_REG] = 1;
                         data[ICE_ETH_TEST_EEPROM] = 1;
                         data[ICE_ETH_TEST_INTR] = 1;
@@ -1047,7 +1055,7 @@ ice_set_fecparam(struct net_device *netdev, struct ethtool_fecparam *fecparam)
                 fec = ICE_FEC_NONE;
                 break;
         default:
-               dev_warn(&vsi->back->pdev->dev, "Unsupported FEC mode: %d\n",
+               dev_warn(ice_pf_to_dev(vsi->back), "Unsupported FEC mode: %d\n",
                          fecparam->fec);
                 return -EINVAL;
         }
@@ -1200,8 +1208,7 @@ static int ice_set_priv_flags(struct net_device *netdev, u32 flags)
                          * events to respond to.
                          */
                         if (status)
-                               dev_info(dev,
-                                        "Failed to unreg for LLDP events\n");
+                               dev_info(dev, "Failed to unreg for LLDP events\n");
  
                         /* The AQ call to stop the FW LLDP agent will generate
                          * an error if the agent is already stopped.
@@ -1256,8 +1263,7 @@ static int ice_set_priv_flags(struct net_device *netdev, u32 flags)
                         /* Register for MIB change events */
                         status = ice_cfg_lldp_mib_change(&pf->hw, true);
                         if (status)
-                               dev_dbg(dev,
-                                       "Fail to enable MIB change events\n");
+                               dev_dbg(dev, "Fail to enable MIB change events\n");
                 }
         }
         if (test_bit(ICE_FLAG_LEGACY_RX, change_flags)) {
@@ -1710,291 +1716,13 @@ ice_get_settings_link_up(struct ethtool_link_ksettings *ks,
  {
         struct ice_netdev_priv *np = netdev_priv(netdev);
         struct ice_port_info *pi = np->vsi->port_info;
-       struct ethtool_link_ksettings cap_ksettings;
         struct ice_link_status *link_info;
         struct ice_vsi *vsi = np->vsi;
-       bool unrecog_phy_high = false;
-       bool unrecog_phy_low = false;
  
         link_info = &vsi->port_info->phy.link_info;
  
-       /* Initialize supported and advertised settings based on PHY settings */
-       switch (link_info->phy_type_low) {
-       case ICE_PHY_TYPE_LOW_100BASE_TX:
-               ethtool_link_ksettings_add_link_mode(ks, supported, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    100baseT_Full);
-               ethtool_link_ksettings_add_link_mode(ks, advertising, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, advertising,
-                                                    100baseT_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_100M_SGMII:
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    100baseT_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_1000BASE_T:
-               ethtool_link_ksettings_add_link_mode(ks, supported, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    1000baseT_Full);
-               ethtool_link_ksettings_add_link_mode(ks, advertising, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, advertising,
-                                                    1000baseT_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_1G_SGMII:
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    1000baseT_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_1000BASE_SX:
-       case ICE_PHY_TYPE_LOW_1000BASE_LX:
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    1000baseX_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_1000BASE_KX:
-               ethtool_link_ksettings_add_link_mode(ks, supported, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    1000baseKX_Full);
-               ethtool_link_ksettings_add_link_mode(ks, advertising, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, advertising,
-                                                    1000baseKX_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_2500BASE_T:
-               ethtool_link_ksettings_add_link_mode(ks, supported, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, advertising, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    2500baseT_Full);
-               ethtool_link_ksettings_add_link_mode(ks, advertising,
-                                                    2500baseT_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_2500BASE_X:
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    2500baseX_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_2500BASE_KX:
-               ethtool_link_ksettings_add_link_mode(ks, supported, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    2500baseX_Full);
-               ethtool_link_ksettings_add_link_mode(ks, advertising, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, advertising,
-                                                    2500baseX_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_5GBASE_T:
-       case ICE_PHY_TYPE_LOW_5GBASE_KR:
-               ethtool_link_ksettings_add_link_mode(ks, supported, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    5000baseT_Full);
-               ethtool_link_ksettings_add_link_mode(ks, advertising, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, advertising,
-                                                    5000baseT_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_10GBASE_T:
-               ethtool_link_ksettings_add_link_mode(ks, supported, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    10000baseT_Full);
-               ethtool_link_ksettings_add_link_mode(ks, advertising, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, advertising,
-                                                    10000baseT_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_10G_SFI_DA:
-       case ICE_PHY_TYPE_LOW_10G_SFI_AOC_ACC:
-       case ICE_PHY_TYPE_LOW_10G_SFI_C2C:
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    10000baseT_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_10GBASE_SR:
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    10000baseSR_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_10GBASE_LR:
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    10000baseLR_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_10GBASE_KR_CR1:
-               ethtool_link_ksettings_add_link_mode(ks, supported, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    10000baseKR_Full);
-               ethtool_link_ksettings_add_link_mode(ks, advertising, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, advertising,
-                                                    10000baseKR_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_25GBASE_T:
-       case ICE_PHY_TYPE_LOW_25GBASE_CR:
-       case ICE_PHY_TYPE_LOW_25GBASE_CR_S:
-       case ICE_PHY_TYPE_LOW_25GBASE_CR1:
-               ethtool_link_ksettings_add_link_mode(ks, supported, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    25000baseCR_Full);
-               ethtool_link_ksettings_add_link_mode(ks, advertising, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, advertising,
-                                                    25000baseCR_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_25G_AUI_AOC_ACC:
-       case ICE_PHY_TYPE_LOW_25G_AUI_C2C:
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    25000baseCR_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_25GBASE_SR:
-       case ICE_PHY_TYPE_LOW_25GBASE_LR:
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    25000baseSR_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_25GBASE_KR:
-       case ICE_PHY_TYPE_LOW_25GBASE_KR1:
-       case ICE_PHY_TYPE_LOW_25GBASE_KR_S:
-               ethtool_link_ksettings_add_link_mode(ks, supported, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    25000baseKR_Full);
-               ethtool_link_ksettings_add_link_mode(ks, advertising, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, advertising,
-                                                    25000baseKR_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_40GBASE_CR4:
-               ethtool_link_ksettings_add_link_mode(ks, supported, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    40000baseCR4_Full);
-               ethtool_link_ksettings_add_link_mode(ks, advertising, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, advertising,
-                                                    40000baseCR4_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_40G_XLAUI_AOC_ACC:
-       case ICE_PHY_TYPE_LOW_40G_XLAUI:
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    40000baseCR4_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_40GBASE_SR4:
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    40000baseSR4_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_40GBASE_LR4:
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    40000baseLR4_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_40GBASE_KR4:
-               ethtool_link_ksettings_add_link_mode(ks, supported, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    40000baseKR4_Full);
-               ethtool_link_ksettings_add_link_mode(ks, advertising, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, advertising,
-                                                    40000baseKR4_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_50GBASE_CR2:
-       case ICE_PHY_TYPE_LOW_50GBASE_CP:
-               ethtool_link_ksettings_add_link_mode(ks, supported, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    50000baseCR2_Full);
-               ethtool_link_ksettings_add_link_mode(ks, advertising, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, advertising,
-                                                    50000baseCR2_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_50G_LAUI2_AOC_ACC:
-       case ICE_PHY_TYPE_LOW_50G_LAUI2:
-       case ICE_PHY_TYPE_LOW_50G_AUI2_AOC_ACC:
-       case ICE_PHY_TYPE_LOW_50G_AUI2:
-       case ICE_PHY_TYPE_LOW_50GBASE_SR:
-       case ICE_PHY_TYPE_LOW_50G_AUI1_AOC_ACC:
-       case ICE_PHY_TYPE_LOW_50G_AUI1:
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    50000baseCR2_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_50GBASE_KR2:
-       case ICE_PHY_TYPE_LOW_50GBASE_KR_PAM4:
-               ethtool_link_ksettings_add_link_mode(ks, supported, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    50000baseKR2_Full);
-               ethtool_link_ksettings_add_link_mode(ks, advertising, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, advertising,
-                                                    50000baseKR2_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_50GBASE_SR2:
-       case ICE_PHY_TYPE_LOW_50GBASE_LR2:
-       case ICE_PHY_TYPE_LOW_50GBASE_FR:
-       case ICE_PHY_TYPE_LOW_50GBASE_LR:
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    50000baseSR2_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_100GBASE_CR4:
-               ethtool_link_ksettings_add_link_mode(ks, supported, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    100000baseCR4_Full);
-               ethtool_link_ksettings_add_link_mode(ks, advertising, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, advertising,
-                                                    100000baseCR4_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_100G_CAUI4_AOC_ACC:
-       case ICE_PHY_TYPE_LOW_100G_CAUI4:
-       case ICE_PHY_TYPE_LOW_100G_AUI4_AOC_ACC:
-       case ICE_PHY_TYPE_LOW_100G_AUI4:
-       case ICE_PHY_TYPE_LOW_100GBASE_CR_PAM4:
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    100000baseCR4_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_100GBASE_CP2:
-               ethtool_link_ksettings_add_link_mode(ks, supported, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    100000baseCR4_Full);
-               ethtool_link_ksettings_add_link_mode(ks, advertising, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, advertising,
-                                                    100000baseCR4_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_100GBASE_SR4:
-       case ICE_PHY_TYPE_LOW_100GBASE_SR2:
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    100000baseSR4_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_100GBASE_LR4:
-       case ICE_PHY_TYPE_LOW_100GBASE_DR:
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    100000baseLR4_ER4_Full);
-               break;
-       case ICE_PHY_TYPE_LOW_100GBASE_KR4:
-       case ICE_PHY_TYPE_LOW_100GBASE_KR_PAM4:
-               ethtool_link_ksettings_add_link_mode(ks, supported, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    100000baseKR4_Full);
-               ethtool_link_ksettings_add_link_mode(ks, advertising, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, advertising,
-                                                    100000baseKR4_Full);
-               break;
-       default:
-               unrecog_phy_low = true;
-       }
-
-       switch (link_info->phy_type_high) {
-       case ICE_PHY_TYPE_HIGH_100GBASE_KR2_PAM4:
-               ethtool_link_ksettings_add_link_mode(ks, supported, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    100000baseKR4_Full);
-               ethtool_link_ksettings_add_link_mode(ks, advertising, Autoneg);
-               ethtool_link_ksettings_add_link_mode(ks, advertising,
-                                                    100000baseKR4_Full);
-               break;
-       case ICE_PHY_TYPE_HIGH_100G_CAUI2_AOC_ACC:
-       case ICE_PHY_TYPE_HIGH_100G_CAUI2:
-       case ICE_PHY_TYPE_HIGH_100G_AUI2_AOC_ACC:
-       case ICE_PHY_TYPE_HIGH_100G_AUI2:
-               ethtool_link_ksettings_add_link_mode(ks, supported,
-                                                    100000baseCR4_Full);
-               break;
-       default:
-               unrecog_phy_high = true;
-       }
-
-       if (unrecog_phy_low && unrecog_phy_high) {
-               /* if we got here and link is up something bad is afoot */
-               netdev_info(netdev,
-                           "WARNING: Unrecognized PHY_Low (0x%llx).\n",
-                           (u64)link_info->phy_type_low);
-               netdev_info(netdev,
-                           "WARNING: Unrecognized PHY_High (0x%llx).\n",
-                           (u64)link_info->phy_type_high);
-       }
-
-       /* Now that we've worked out everything that could be supported by the
-        * current PHY type, get what is supported by the NVM and intersect
-        * them to get what is truly supported
-        */
-       memset(&cap_ksettings, 0, sizeof(cap_ksettings));
-       ice_phy_type_to_ethtool(netdev, &cap_ksettings);
-       ethtool_intersect_link_masks(ks, &cap_ksettings);
+       /* Get supported and advertised settings from PHY ability with media */
+       ice_phy_type_to_ethtool(netdev, ks);
  
         switch (link_info->link_speed) {
         case ICE_AQ_LINK_SPEED_100GB:
@@ -2028,8 +1756,7 @@ ice_get_settings_link_up(struct ethtool_link_ksettings *ks,
                 ks->base.speed = SPEED_100;
                 break;
         default:
-               netdev_info(netdev,
-                           "WARNING: Unrecognized link_speed (0x%x).\n",
+               netdev_info(netdev, "WARNING: Unrecognized link_speed (0x%x).\n",
                             link_info->link_speed);
                 break;
         }
@@ -2845,13 +2572,11 @@ ice_set_ringparam(struct net_device *netdev, struct ethtool_ringparam *ring)
  
         new_tx_cnt = ALIGN(ring->tx_pending, ICE_REQ_DESC_MULTIPLE);
         if (new_tx_cnt != ring->tx_pending)
-               netdev_info(netdev,
-                           "Requested Tx descriptor count rounded up to %d\n",
+               netdev_info(netdev, "Requested Tx descriptor count rounded up to %d\n",
                             new_tx_cnt);
         new_rx_cnt = ALIGN(ring->rx_pending, ICE_REQ_DESC_MULTIPLE);
         if (new_rx_cnt != ring->rx_pending)
-               netdev_info(netdev,
-                           "Requested Rx descriptor count rounded up to %d\n",
+               netdev_info(netdev, "Requested Rx descriptor count rounded up to %d\n",
                             new_rx_cnt);
  
         /* if nothing to do return success */
@@ -3718,8 +3443,7 @@ ice_set_rc_coalesce(enum ice_container_type c_type, struct ethtool_coalesce *ec,
                 if (ec->rx_coalesce_usecs_high > ICE_MAX_INTRL ||
                     (ec->rx_coalesce_usecs_high &&
                      ec->rx_coalesce_usecs_high < pf->hw.intrl_gran)) {
-                       netdev_info(vsi->netdev,
-                                   "Invalid value, %s-usecs-high valid values are 0 (disabled), %d-%d\n",
+                       netdev_info(vsi->netdev, "Invalid value, %s-usecs-high valid values are 0 (disabled), %d-%d\n",
                                     c_type_str, pf->hw.intrl_gran,
                                     ICE_MAX_INTRL);
                         return -EINVAL;
@@ -3737,8 +3461,7 @@ ice_set_rc_coalesce(enum ice_container_type c_type, struct ethtool_coalesce *ec,
                 break;
         case ICE_TX_CONTAINER:
                 if (ec->tx_coalesce_usecs_high) {
-                       netdev_info(vsi->netdev,
-                                   "setting %s-usecs-high is not supported\n",
+                       netdev_info(vsi->netdev, "setting %s-usecs-high is not supported\n",
                                     c_type_str);
                         return -EINVAL;
                 }
@@ -3755,23 +3478,20 @@ ice_set_rc_coalesce(enum ice_container_type c_type, struct ethtool_coalesce *ec,
  
         itr_setting = rc->itr_setting & ~ICE_ITR_DYNAMIC;
         if (coalesce_usecs != itr_setting && use_adaptive_coalesce) {
-               netdev_info(vsi->netdev,
-                           "%s interrupt throttling cannot be changed if adaptive-%s is enabled\n",
+               netdev_info(vsi->netdev, "%s interrupt throttling cannot be changed if adaptive-%s is enabled\n",
                             c_type_str, c_type_str);
                 return -EINVAL;
         }
  
         if (coalesce_usecs > ICE_ITR_MAX) {
-               netdev_info(vsi->netdev,
-                           "Invalid value, %s-usecs range is 0-%d\n",
+               netdev_info(vsi->netdev, "Invalid value, %s-usecs range is 0-%d\n",
                             c_type_str, ICE_ITR_MAX);
                 return -EINVAL;
         }
  
         /* hardware only supports an ITR granularity of 2us */
         if (coalesce_usecs % 2 != 0) {
-               netdev_info(vsi->netdev,
-                           "Invalid value, %s-usecs must be even\n",
+               netdev_info(vsi->netdev, "Invalid value, %s-usecs must be even\n",
                             c_type_str);
                 return -EINVAL;
         }
@@ -4012,8 +3732,7 @@ ice_get_module_info(struct net_device *netdev,
                 }
                 break;
         default:
-               netdev_warn(netdev,
-                           "SFF Module Type not recognized.\n");
+               netdev_warn(netdev, "SFF Module Type not recognized.\n");
                 return -EINVAL;
         }
         return 0;
@@ -4081,11 +3800,11 @@ ice_get_module_eeprom(struct net_device *netdev,
  static const struct ethtool_ops ice_ethtool_ops = {
         .get_link_ksettings     = ice_get_link_ksettings,
         .set_link_ksettings     = ice_set_link_ksettings,
-       .get_drvinfo            = ice_get_drvinfo,
-       .get_regs_len           = ice_get_regs_len,
-       .get_regs               = ice_get_regs,
-       .get_msglevel           = ice_get_msglevel,
-       .set_msglevel           = ice_set_msglevel,
+       .get_drvinfo            = ice_get_drvinfo,
+       .get_regs_len           = ice_get_regs_len,
+       .get_regs               = ice_get_regs,
+       .get_msglevel           = ice_get_msglevel,
+       .set_msglevel           = ice_set_msglevel,
         .self_test              = ice_self_test,
         .get_link               = ethtool_op_get_link,
         .get_eeprom_len         = ice_get_eeprom_len,
@@ -4112,8 +3831,8 @@ static const struct ethtool_ops ice_ethtool_ops = {
         .get_channels           = ice_get_channels,
         .set_channels           = ice_set_channels,
         .get_ts_info            = ethtool_op_get_ts_info,
-       .get_per_queue_coalesce = ice_get_per_q_coalesce,
-       .set_per_queue_coalesce = ice_set_per_q_coalesce,
+       .get_per_queue_coalesce = ice_get_per_q_coalesce,
+       .set_per_queue_coalesce = ice_set_per_q_coalesce,
         .get_fecparam           = ice_get_fecparam,
         .set_fecparam           = ice_set_fecparam,
         .get_module_info        = ice_get_module_info,
diff --git a/drivers/net/ethernet/intel/ice/ice_hw_autogen.h b/drivers/net/ethernet/intel/ice/ice_hw_autogen.h

index f2cabab..6db3d04 100644 (file)
--- a/drivers/net/ethernet/intel/ice/ice_hw_autogen.h
+++ b/drivers/net/ethernet/intel/ice/ice_hw_autogen.h
@@ -267,8 +267,14 @@
  #define GLNVM_GENS_SR_SIZE_S                   5
  #define GLNVM_GENS_SR_SIZE_M                   ICE_M(0x7, 5)
  #define GLNVM_ULD                              0x000B6008
+#define GLNVM_ULD_PCIER_DONE_M                 BIT(0)
+#define GLNVM_ULD_PCIER_DONE_1_M               BIT(1)
  #define GLNVM_ULD_CORER_DONE_M                 BIT(3)
  #define GLNVM_ULD_GLOBR_DONE_M                 BIT(4)
+#define GLNVM_ULD_POR_DONE_M                   BIT(5)
+#define GLNVM_ULD_POR_DONE_1_M                 BIT(8)
+#define GLNVM_ULD_PCIER_DONE_2_M               BIT(9)
+#define GLNVM_ULD_PE_DONE_M                    BIT(10)
  #define GLPCI_CNF2                             0x000BE004
  #define GLPCI_CNF2_CACHELINE_SIZE_M            BIT(1)
  #define PF_FUNC_RID                            0x0009E880
@@ -331,7 +337,6 @@
  #define GLV_TEPC(_VSI)                         (0x00312000 + ((_VSI) * 4))
  #define GLV_UPRCL(_i)                          (0x003B2000 + ((_i) * 8))
  #define GLV_UPTCL(_i)                          (0x0030A000 + ((_i) * 8))
-#define PF_VT_PFALLOC_HIF                      0x0009DD80
  #define VSIQF_HKEY_MAX_INDEX                   12
  #define VSIQF_HLUT_MAX_INDEX                   15
  #define VFINT_DYN_CTLN(_i)                     (0x00003800 + ((_i) * 4))
diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c

index 1874c9f..d974e2f 100644 (file)
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -117,8 +117,7 @@ static void ice_vsi_set_num_desc(struct ice_vsi *vsi)
                 vsi->num_tx_desc = ICE_DFLT_NUM_TX_DESC;
                 break;
         default:
-               dev_dbg(&vsi->back->pdev->dev,
-                       "Not setting number of Tx/Rx descriptors for VSI type %d\n",
+               dev_dbg(ice_pf_to_dev(vsi->back), "Not setting number of Tx/Rx descriptors for VSI type %d\n",
                         vsi->type);
                 break;
         }
@@ -724,7 +723,7 @@ static void ice_vsi_setup_q_map(struct ice_vsi *vsi, struct ice_vsi_ctx *ctxt)
         vsi->num_txq = tx_count;
  
         if (vsi->type == ICE_VSI_VF && vsi->num_txq != vsi->num_rxq) {
-               dev_dbg(&vsi->back->pdev->dev, "VF VSI should have same number of Tx and Rx queues. Hence making them equal\n");
+               dev_dbg(ice_pf_to_dev(vsi->back), "VF VSI should have same number of Tx and Rx queues. Hence making them equal\n");
                 /* since there is a chance that num_rxq could have been changed
                  * in the above for loop, make num_txq equal to num_rxq.
                  */
@@ -929,8 +928,7 @@ static int ice_vsi_setup_vector_base(struct ice_vsi *vsi)
         vsi->base_vector = ice_get_res(pf, pf->irq_tracker, num_q_vectors,
                                        vsi->idx);
         if (vsi->base_vector < 0) {
-               dev_err(dev,
-                       "Failed to get tracking for %d vectors for VSI %d, err=%d\n",
+               dev_err(dev, "Failed to get tracking for %d vectors for VSI %d, err=%d\n",
                         num_q_vectors, vsi->vsi_num, vsi->base_vector);
                 return -ENOENT;
         }
@@ -1232,8 +1230,9 @@ static void ice_vsi_set_rss_flow_fld(struct ice_vsi *vsi)
   *
   * Returns 0 on success or ENOMEM on failure.
   */
-int ice_add_mac_to_list(struct ice_vsi *vsi, struct list_head *add_list,
-                       const u8 *macaddr)
+int
+ice_add_mac_to_list(struct ice_vsi *vsi, struct list_head *add_list,
+                   const u8 *macaddr)
  {
         struct ice_fltr_list_entry *tmp;
         struct ice_pf *pf = vsi->back;
@@ -1392,12 +1391,10 @@ int ice_vsi_kill_vlan(struct ice_vsi *vsi, u16 vid)
  
         status = ice_remove_vlan(&pf->hw, &tmp_add_list);
         if (status == ICE_ERR_DOES_NOT_EXIST) {
-               dev_dbg(dev,
-                       "Failed to remove VLAN %d on VSI %i, it does not exist, status: %d\n",
+               dev_dbg(dev, "Failed to remove VLAN %d on VSI %i, it does not exist, status: %d\n",
                         vid, vsi->vsi_num, status);
         } else if (status) {
-               dev_err(dev,
-                       "Error removing VLAN %d on vsi %i error: %d\n",
+               dev_err(dev, "Error removing VLAN %d on vsi %i error: %d\n",
                         vid, vsi->vsi_num, status);
                 err = -EIO;
         }
@@ -1453,8 +1450,7 @@ setup_rings:
  
                 err = ice_setup_rx_ctx(vsi->rx_rings[i]);
                 if (err) {
-                       dev_err(&vsi->back->pdev->dev,
-                               "ice_setup_rx_ctx failed for RxQ %d, err %d\n",
+                       dev_err(ice_pf_to_dev(vsi->back), "ice_setup_rx_ctx failed for RxQ %d, err %d\n",
                                 i, err);
                         return err;
                 }
@@ -1623,7 +1619,7 @@ int ice_vsi_manage_vlan_insertion(struct ice_vsi *vsi)
  
         status = ice_update_vsi(hw, vsi->idx, ctxt, NULL);
         if (status) {
-               dev_err(&vsi->back->pdev->dev, "update VSI for VLAN insert failed, err %d aq_err %d\n",
+               dev_err(ice_pf_to_dev(vsi->back), "update VSI for VLAN insert failed, err %d aq_err %d\n",
                         status, hw->adminq.sq_last_status);
                 ret = -EIO;
                 goto out;
@@ -1669,7 +1665,7 @@ int ice_vsi_manage_vlan_stripping(struct ice_vsi *vsi, bool ena)
  
         status = ice_update_vsi(hw, vsi->idx, ctxt, NULL);
         if (status) {
-               dev_err(&vsi->back->pdev->dev, "update VSI for VLAN strip failed, ena = %d err %d aq_err %d\n",
+               dev_err(ice_pf_to_dev(vsi->back), "update VSI for VLAN strip failed, ena = %d err %d aq_err %d\n",
                         ena, status, hw->adminq.sq_last_status);
                 ret = -EIO;
                 goto out;
@@ -1834,8 +1830,7 @@ ice_vsi_set_q_vectors_reg_idx(struct ice_vsi *vsi)
                 struct ice_q_vector *q_vector = vsi->q_vectors[i];
  
                 if (!q_vector) {
-                       dev_err(&vsi->back->pdev->dev,
-                               "Failed to set reg_idx on q_vector %d VSI %d\n",
+                       dev_err(ice_pf_to_dev(vsi->back), "Failed to set reg_idx on q_vector %d VSI %d\n",
                                 i, vsi->vsi_num);
                         goto clear_reg_idx;
                 }
@@ -1898,8 +1893,7 @@ ice_vsi_add_rem_eth_mac(struct ice_vsi *vsi, bool add_rule)
                 status = ice_remove_eth_mac(&pf->hw, &tmp_add_list);
  
         if (status)
-               dev_err(dev,
-                       "Failure Adding or Removing Ethertype on VSI %i error: %d\n",
+               dev_err(dev, "Failure Adding or Removing Ethertype on VSI %i error: %d\n",
                         vsi->vsi_num, status);
  
         ice_free_fltr_list(dev, &tmp_add_list);
@@ -2384,8 +2378,7 @@ ice_get_res(struct ice_pf *pf, struct ice_res_tracker *res, u16 needed, u16 id)
                 return -EINVAL;
  
         if (!needed || needed > res->num_entries || id >= ICE_RES_VALID_BIT) {
-               dev_err(ice_pf_to_dev(pf),
-                       "param err: needed=%d, num_entries = %d id=0x%04x\n",
+               dev_err(ice_pf_to_dev(pf), "param err: needed=%d, num_entries = %d id=0x%04x\n",
                         needed, res->num_entries, id);
                 return -EINVAL;
         }
@@ -2686,7 +2679,6 @@ int ice_vsi_rebuild(struct ice_vsi *vsi, bool init_vsi)
         ice_vsi_put_qs(vsi);
         ice_vsi_clear_rings(vsi);
         ice_vsi_free_arrays(vsi);
-       ice_dev_onetime_setup(&pf->hw);
         if (vsi->type == ICE_VSI_VF)
                 ice_vsi_set_num_qs(vsi, vf->vf_id);
         else
@@ -2765,8 +2757,7 @@ int ice_vsi_rebuild(struct ice_vsi *vsi, bool init_vsi)
         status = ice_cfg_vsi_lan(vsi->port_info, vsi->idx, vsi->tc_cfg.ena_tc,
                                  max_txqs);
         if (status) {
-               dev_err(ice_pf_to_dev(pf),
-                       "VSI %d failed lan queue config, error %d\n",
+               dev_err(ice_pf_to_dev(pf), "VSI %d failed lan queue config, error %d\n",
                         vsi->vsi_num, status);
                 if (init_vsi) {
                         ret = -EIO;
@@ -2834,8 +2825,8 @@ static void ice_vsi_update_q_map(struct ice_vsi *vsi, struct ice_vsi_ctx *ctx)
  int ice_vsi_cfg_tc(struct ice_vsi *vsi, u8 ena_tc)
  {
         u16 max_txqs[ICE_MAX_TRAFFIC_CLASS] = { 0 };
-       struct ice_vsi_ctx *ctx;
         struct ice_pf *pf = vsi->back;
+       struct ice_vsi_ctx *ctx;
         enum ice_status status;
         struct device *dev;
         int i, ret = 0;
@@ -2891,25 +2882,6 @@ out:
  }
  #endif /* CONFIG_DCB */
  
-/**
- * ice_nvm_version_str - format the NVM version strings
- * @hw: ptr to the hardware info
- */
-char *ice_nvm_version_str(struct ice_hw *hw)
-{
-       u8 oem_ver, oem_patch, ver_hi, ver_lo;
-       static char buf[ICE_NVM_VER_LEN];
-       u16 oem_build;
-
-       ice_get_nvm_version(hw, &oem_ver, &oem_build, &oem_patch, &ver_hi,
-                           &ver_lo);
-
-       snprintf(buf, sizeof(buf), "%x.%02x 0x%x %d.%d.%d", ver_hi, ver_lo,
-                hw->nvm.eetrack, oem_ver, oem_build, oem_patch);
-
-       return buf;
-}
-
  /**
   * ice_update_ring_stats - Update ring statistics
   * @ring: ring to update
@@ -2981,7 +2953,7 @@ ice_vsi_cfg_mac_fltr(struct ice_vsi *vsi, const u8 *macaddr, bool set)
                 status = ice_remove_mac(&vsi->back->hw, &tmp_add_list);
  
  cfg_mac_fltr_exit:
-       ice_free_fltr_list(&vsi->back->pdev->dev, &tmp_add_list);
+       ice_free_fltr_list(ice_pf_to_dev(vsi->back), &tmp_add_list);
         return status;
  }
  
@@ -3043,16 +3015,14 @@ int ice_set_dflt_vsi(struct ice_sw *sw, struct ice_vsi *vsi)
  
         /* another VSI is already the default VSI for this switch */
         if (ice_is_dflt_vsi_in_use(sw)) {
-               dev_err(dev,
-                       "Default forwarding VSI %d already in use, disable it and try again\n",
+               dev_err(dev, "Default forwarding VSI %d already in use, disable it and try again\n",
                         sw->dflt_vsi->vsi_num);
                 return -EEXIST;
         }
  
         status = ice_cfg_dflt_vsi(&vsi->back->hw, vsi->idx, true, ICE_FLTR_RX);
         if (status) {
-               dev_err(dev,
-                       "Failed to set VSI %d as the default forwarding VSI, error %d\n",
+               dev_err(dev, "Failed to set VSI %d as the default forwarding VSI, error %d\n",
                         vsi->vsi_num, status);
                 return -EIO;
         }
@@ -3091,8 +3061,7 @@ int ice_clear_dflt_vsi(struct ice_sw *sw)
         status = ice_cfg_dflt_vsi(&dflt_vsi->back->hw, dflt_vsi->idx, false,
                                   ICE_FLTR_RX);
         if (status) {
-               dev_err(dev,
-                       "Failed to clear the default forwarding VSI %d, error %d\n",
+               dev_err(dev, "Failed to clear the default forwarding VSI %d, error %d\n",
                         dflt_vsi->vsi_num, status);
                 return -EIO;
         }
diff --git a/drivers/net/ethernet/intel/ice/ice_lib.h b/drivers/net/ethernet/intel/ice/ice_lib.h

index 68fd0d4..e2c0dad 100644 (file)
--- a/drivers/net/ethernet/intel/ice/ice_lib.h
+++ b/drivers/net/ethernet/intel/ice/ice_lib.h
@@ -97,8 +97,6 @@ void ice_vsi_cfg_frame_size(struct ice_vsi *vsi);
  
  u32 ice_intrl_usec_to_reg(u8 intrl, u8 gran);
  
-char *ice_nvm_version_str(struct ice_hw *hw);
-
  enum ice_status
  ice_vsi_cfg_mac_fltr(struct ice_vsi *vsi, const u8 *macaddr, bool set);
  
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c

index 5ae6716..5ef2805 100644 (file)
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -162,8 +162,7 @@ unregister:
          * had an error
          */
         if (status && vsi->netdev->reg_state == NETREG_REGISTERED) {
-               dev_err(ice_pf_to_dev(pf),
-                       "Could not add MAC filters error %d. Unregistering device\n",
+               dev_err(ice_pf_to_dev(pf), "Could not add MAC filters error %d. Unregistering device\n",
                         status);
                 unregister_netdev(vsi->netdev);
                 free_netdev(vsi->netdev);
@@ -269,7 +268,7 @@ static int ice_cfg_promisc(struct ice_vsi *vsi, u8 promisc_m, bool set_promisc)
   */
  static int ice_vsi_sync_fltr(struct ice_vsi *vsi)
  {
-       struct device *dev = &vsi->back->pdev->dev;
+       struct device *dev = ice_pf_to_dev(vsi->back);
         struct net_device *netdev = vsi->netdev;
         bool promisc_forced_on = false;
         struct ice_pf *pf = vsi->back;
@@ -335,8 +334,7 @@ static int ice_vsi_sync_fltr(struct ice_vsi *vsi)
                     !test_and_set_bit(__ICE_FLTR_OVERFLOW_PROMISC,
                                       vsi->state)) {
                         promisc_forced_on = true;
-                       netdev_warn(netdev,
-                                   "Reached MAC filter limit, forcing promisc mode on VSI %d\n",
+                       netdev_warn(netdev, "Reached MAC filter limit, forcing promisc mode on VSI %d\n",
                                     vsi->vsi_num);
                 } else {
                         err = -EIO;
@@ -382,8 +380,7 @@ static int ice_vsi_sync_fltr(struct ice_vsi *vsi)
                         if (!ice_is_dflt_vsi_in_use(pf->first_sw)) {
                                 err = ice_set_dflt_vsi(pf->first_sw, vsi);
                                 if (err && err != -EEXIST) {
-                                       netdev_err(netdev,
-                                                  "Error %d setting default VSI %i Rx rule\n",
+                                       netdev_err(netdev, "Error %d setting default VSI %i Rx rule\n",
                                                    err, vsi->vsi_num);
                                         vsi->current_netdev_flags &=
                                                 ~IFF_PROMISC;
@@ -395,8 +392,7 @@ static int ice_vsi_sync_fltr(struct ice_vsi *vsi)
                         if (ice_is_vsi_dflt_vsi(pf->first_sw, vsi)) {
                                 err = ice_clear_dflt_vsi(pf->first_sw);
                                 if (err) {
-                                       netdev_err(netdev,
-                                                  "Error %d clearing default VSI %i Rx rule\n",
+                                       netdev_err(netdev, "Error %d clearing default VSI %i Rx rule\n",
                                                    err, vsi->vsi_num);
                                         vsi->current_netdev_flags |=
                                                 IFF_PROMISC;
@@ -752,7 +748,7 @@ void ice_print_link_msg(struct ice_vsi *vsi, bool isup)
         kfree(caps);
  
  done:
-       netdev_info(vsi->netdev, "NIC Link is up %sbps, Requested FEC: %s, FEC: %s, Autoneg: %s, Flow Control: %s\n",
+       netdev_info(vsi->netdev, "NIC Link is up %sbps Full Duplex, Requested FEC: %s, Negotiated FEC: %s, Autoneg: %s, Flow Control: %s\n",
                     speed, fec_req, fec, an, fc);
         ice_print_topo_conflict(vsi);
  }
@@ -815,8 +811,7 @@ ice_link_event(struct ice_pf *pf, struct ice_port_info *pi, bool link_up,
          */
         result = ice_update_link_info(pi);
         if (result)
-               dev_dbg(dev,
-                       "Failed to update link status and re-enable link events for port %d\n",
+               dev_dbg(dev, "Failed to update link status and re-enable link events for port %d\n",
                         pi->lport);
  
         /* if the old link up/down and speed is the same as the new */
@@ -834,13 +829,13 @@ ice_link_event(struct ice_pf *pf, struct ice_port_info *pi, bool link_up,
  
                 result = ice_aq_set_link_restart_an(pi, false, NULL);
                 if (result) {
-                       dev_dbg(dev,
-                               "Failed to set link down, VSI %d error %d\n",
+                       dev_dbg(dev, "Failed to set link down, VSI %d error %d\n",
                                 vsi->vsi_num, result);
                         return result;
                 }
         }
  
+       ice_dcb_rebuild(pf);
         ice_vsi_link_event(vsi, link_up);
         ice_print_link_msg(vsi, link_up);
  
@@ -892,15 +887,13 @@ static int ice_init_link_events(struct ice_port_info *pi)
                        ICE_AQ_LINK_EVENT_MODULE_QUAL_FAIL));
  
         if (ice_aq_set_event_mask(pi->hw, pi->lport, mask, NULL)) {
-               dev_dbg(ice_hw_to_dev(pi->hw),
-                       "Failed to set link event mask for port %d\n",
+               dev_dbg(ice_hw_to_dev(pi->hw), "Failed to set link event mask for port %d\n",
                         pi->lport);
                 return -EIO;
         }
  
         if (ice_aq_get_link_info(pi, true, NULL, NULL)) {
-               dev_dbg(ice_hw_to_dev(pi->hw),
-                       "Failed to enable link events for port %d\n",
+               dev_dbg(ice_hw_to_dev(pi->hw), "Failed to enable link events for port %d\n",
                         pi->lport);
                 return -EIO;
         }
@@ -929,8 +922,8 @@ ice_handle_link_event(struct ice_pf *pf, struct ice_rq_event_info *event)
                                 !!(link_data->link_info & ICE_AQ_LINK_UP),
                                 le16_to_cpu(link_data->link_speed));
         if (status)
-               dev_dbg(ice_pf_to_dev(pf),
-                       "Could not process link event, error %d\n", status);
+               dev_dbg(ice_pf_to_dev(pf), "Could not process link event, error %d\n",
+                       status);
  
         return status;
  }
@@ -979,13 +972,11 @@ static int __ice_clean_ctrlq(struct ice_pf *pf, enum ice_ctl_q q_type)
                         dev_dbg(dev, "%s Receive Queue VF Error detected\n",
                                 qtype);
                 if (val & PF_FW_ARQLEN_ARQOVFL_M) {
-                       dev_dbg(dev,
-                               "%s Receive Queue Overflow Error detected\n",
+                       dev_dbg(dev, "%s Receive Queue Overflow Error detected\n",
                                 qtype);
                 }
                 if (val & PF_FW_ARQLEN_ARQCRIT_M)
-                       dev_dbg(dev,
-                               "%s Receive Queue Critical Error detected\n",
+                       dev_dbg(dev, "%s Receive Queue Critical Error detected\n",
                                 qtype);
                 val &= ~(PF_FW_ARQLEN_ARQVFE_M | PF_FW_ARQLEN_ARQOVFL_M |
                          PF_FW_ARQLEN_ARQCRIT_M);
@@ -998,8 +989,8 @@ static int __ice_clean_ctrlq(struct ice_pf *pf, enum ice_ctl_q q_type)
                    PF_FW_ATQLEN_ATQCRIT_M)) {
                 oldval = val;
                 if (val & PF_FW_ATQLEN_ATQVFE_M)
-                       dev_dbg(dev,
-                               "%s Send Queue VF Error detected\n", qtype);
+                       dev_dbg(dev, "%s Send Queue VF Error detected\n",
+                               qtype);
                 if (val & PF_FW_ATQLEN_ATQOVFL_M) {
                         dev_dbg(dev, "%s Send Queue Overflow Error detected\n",
                                 qtype);
@@ -1048,8 +1039,7 @@ static int __ice_clean_ctrlq(struct ice_pf *pf, enum ice_ctl_q q_type)
                         ice_dcb_process_lldp_set_mib_change(pf, &event);
                         break;
                 default:
-                       dev_dbg(dev,
-                               "%s Receive Queue unknown event 0x%04x ignored\n",
+                       dev_dbg(dev, "%s Receive Queue unknown event 0x%04x ignored\n",
                                 qtype, opcode);
                         break;
                 }
@@ -1238,7 +1228,7 @@ static void ice_handle_mdd_event(struct ice_pf *pf)
                 u16 queue = ((reg & GL_MDET_TX_TCLAN_QNUM_M) >>
                                 GL_MDET_TX_TCLAN_QNUM_S);
  
-               if (netif_msg_rx_err(pf))
+               if (netif_msg_tx_err(pf))
                         dev_info(dev, "Malicious Driver Detection event %d on TX queue %d PF# %d VF# %d\n",
                                  event, queue, pf_num, vf_num);
                 wr32(hw, GL_MDET_TX_TCLAN, 0xffffffff);
@@ -1335,8 +1325,7 @@ static void ice_handle_mdd_event(struct ice_pf *pf)
                         vf->num_mdd_events++;
                         if (vf->num_mdd_events &&
                             vf->num_mdd_events <= ICE_MDD_EVENTS_THRESHOLD)
-                               dev_info(dev,
-                                        "VF %d has had %llu MDD events since last boot, Admin might need to reload AVF driver with this number of events\n",
+                               dev_info(dev, "VF %d has had %llu MDD events since last boot, Admin might need to reload AVF driver with this number of events\n",
                                          i, vf->num_mdd_events);
                 }
         }
@@ -1367,7 +1356,7 @@ static int ice_force_phys_link_state(struct ice_vsi *vsi, bool link_up)
         if (vsi->type != ICE_VSI_PF)
                 return 0;
  
-       dev = &vsi->back->pdev->dev;
+       dev = ice_pf_to_dev(vsi->back);
  
         pi = vsi->port_info;
  
@@ -1378,8 +1367,7 @@ static int ice_force_phys_link_state(struct ice_vsi *vsi, bool link_up)
         retcode = ice_aq_get_phy_caps(pi, false, ICE_AQC_REPORT_SW_CFG, pcaps,
                                       NULL);
         if (retcode) {
-               dev_err(dev,
-                       "Failed to get phy capabilities, VSI %d error %d\n",
+               dev_err(dev, "Failed to get phy capabilities, VSI %d error %d\n",
                         vsi->vsi_num, retcode);
                 retcode = -EIO;
                 goto out;
@@ -1649,8 +1637,8 @@ static int ice_vsi_req_irq_msix(struct ice_vsi *vsi, char *basename)
                 err = devm_request_irq(dev, irq_num, vsi->irq_handler, 0,
                                        q_vector->name, q_vector);
                 if (err) {
-                       netdev_err(vsi->netdev,
-                                  "MSIX request_irq failed, error: %d\n", err);
+                       netdev_err(vsi->netdev, "MSIX request_irq failed, error: %d\n",
+                                  err);
                         goto free_q_irqs;
                 }
  
@@ -1685,7 +1673,7 @@ free_q_irqs:
   */
  static int ice_xdp_alloc_setup_rings(struct ice_vsi *vsi)
  {
-       struct device *dev = &vsi->back->pdev->dev;
+       struct device *dev = ice_pf_to_dev(vsi->back);
         int i;
  
         for (i = 0; i < vsi->num_xdp_txq; i++) {
@@ -2664,14 +2652,12 @@ static void ice_set_pf_caps(struct ice_pf *pf)
         clear_bit(ICE_FLAG_DCB_CAPABLE, pf->flags);
         if (func_caps->common_cap.dcb)
                 set_bit(ICE_FLAG_DCB_CAPABLE, pf->flags);
-#ifdef CONFIG_PCI_IOV
         clear_bit(ICE_FLAG_SRIOV_CAPABLE, pf->flags);
         if (func_caps->common_cap.sr_iov_1_1) {
                 set_bit(ICE_FLAG_SRIOV_CAPABLE, pf->flags);
                 pf->num_vfs_supported = min_t(int, func_caps->num_allocd_vfs,
                                               ICE_MAX_VF_COUNT);
         }
-#endif /* CONFIG_PCI_IOV */
         clear_bit(ICE_FLAG_RSS_ENA, pf->flags);
         if (func_caps->common_cap.rss_table_size)
                 set_bit(ICE_FLAG_RSS_ENA, pf->flags);
@@ -2764,8 +2750,7 @@ static int ice_ena_msix_range(struct ice_pf *pf)
         }
  
         if (v_actual < v_budget) {
-               dev_warn(dev,
-                        "not enough OS MSI-X vectors. requested = %d, obtained = %d\n",
+               dev_warn(dev, "not enough OS MSI-X vectors. requested = %d, obtained = %d\n",
                          v_budget, v_actual);
  /* 2 vectors for LAN (traffic + OICR) */
  #define ICE_MIN_LAN_VECS 2
@@ -2787,8 +2772,7 @@ msix_err:
         goto exit_err;
  
  no_hw_vecs_left_err:
-       dev_err(dev,
-               "not enough device MSI-X vectors. requested = %d, available = %d\n",
+       dev_err(dev, "not enough device MSI-X vectors. requested = %d, available = %d\n",
                 needed, v_left);
         err = -ERANGE;
  exit_err:
@@ -2921,16 +2905,14 @@ ice_log_pkg_init(struct ice_hw *hw, enum ice_status *status)
                     !memcmp(hw->pkg_name, hw->active_pkg_name,
                             sizeof(hw->pkg_name))) {
                         if (hw->pkg_dwnld_status == ICE_AQ_RC_EEXIST)
-                               dev_info(dev,
-                                        "DDP package already present on device: %s version %d.%d.%d.%d\n",
+                               dev_info(dev, "DDP package already present on device: %s version %d.%d.%d.%d\n",
                                          hw->active_pkg_name,
                                          hw->active_pkg_ver.major,
                                          hw->active_pkg_ver.minor,
                                          hw->active_pkg_ver.update,
                                          hw->active_pkg_ver.draft);
                         else
-                               dev_info(dev,
-                                        "The DDP package was successfully loaded: %s version %d.%d.%d.%d\n",
+                               dev_info(dev, "The DDP package was successfully loaded: %s version %d.%d.%d.%d\n",
                                          hw->active_pkg_name,
                                          hw->active_pkg_ver.major,
                                          hw->active_pkg_ver.minor,
@@ -2938,8 +2920,7 @@ ice_log_pkg_init(struct ice_hw *hw, enum ice_status *status)
                                          hw->active_pkg_ver.draft);
                 } else if (hw->active_pkg_ver.major != ICE_PKG_SUPP_VER_MAJ ||
                            hw->active_pkg_ver.minor != ICE_PKG_SUPP_VER_MNR) {
-                       dev_err(dev,
-                               "The device has a DDP package that is not supported by the driver.  The device has package '%s' version %d.%d.x.x.  The driver requires version %d.%d.x.x.  Entering Safe Mode.\n",
+                       dev_err(dev, "The device has a DDP package that is not supported by the driver.  The device has package '%s' version %d.%d.x.x.  The driver requires version %d.%d.x.x.  Entering Safe Mode.\n",
                                 hw->active_pkg_name,
                                 hw->active_pkg_ver.major,
                                 hw->active_pkg_ver.minor,
@@ -2947,8 +2928,7 @@ ice_log_pkg_init(struct ice_hw *hw, enum ice_status *status)
                         *status = ICE_ERR_NOT_SUPPORTED;
                 } else if (hw->active_pkg_ver.major == ICE_PKG_SUPP_VER_MAJ &&
                            hw->active_pkg_ver.minor == ICE_PKG_SUPP_VER_MNR) {
-                       dev_info(dev,
-                                "The driver could not load the DDP package file because a compatible DDP package is already present on the device.  The device has package '%s' version %d.%d.%d.%d.  The package file found by the driver: '%s' version %d.%d.%d.%d.\n",
+                       dev_info(dev, "The driver could not load the DDP package file because a compatible DDP package is already present on the device.  The device has package '%s' version %d.%d.%d.%d.  The package file found by the driver: '%s' version %d.%d.%d.%d.\n",
                                  hw->active_pkg_name,
                                  hw->active_pkg_ver.major,
                                  hw->active_pkg_ver.minor,
@@ -2960,54 +2940,46 @@ ice_log_pkg_init(struct ice_hw *hw, enum ice_status *status)
                                  hw->pkg_ver.update,
                                  hw->pkg_ver.draft);
                 } else {
-                       dev_err(dev,
-                               "An unknown error occurred when loading the DDP package, please reboot the system.  If the problem persists, update the NVM.  Entering Safe Mode.\n");
+                       dev_err(dev, "An unknown error occurred when loading the DDP package, please reboot the system.  If the problem persists, update the NVM.  Entering Safe Mode.\n");
                         *status = ICE_ERR_NOT_SUPPORTED;
                 }
                 break;
         case ICE_ERR_BUF_TOO_SHORT:
                 /* fall-through */
         case ICE_ERR_CFG:
-               dev_err(dev,
-                       "The DDP package file is invalid. Entering Safe Mode.\n");
+               dev_err(dev, "The DDP package file is invalid. Entering Safe Mode.\n");
                 break;
         case ICE_ERR_NOT_SUPPORTED:
                 /* Package File version not supported */
                 if (hw->pkg_ver.major > ICE_PKG_SUPP_VER_MAJ ||
                     (hw->pkg_ver.major == ICE_PKG_SUPP_VER_MAJ &&
                      hw->pkg_ver.minor > ICE_PKG_SUPP_VER_MNR))
-                       dev_err(dev,
-                               "The DDP package file version is higher than the driver supports.  Please use an updated driver.  Entering Safe Mode.\n");
+                       dev_err(dev, "The DDP package file version is higher than the driver supports.  Please use an updated driver.  Entering Safe Mode.\n");
                 else if (hw->pkg_ver.major < ICE_PKG_SUPP_VER_MAJ ||
                          (hw->pkg_ver.major == ICE_PKG_SUPP_VER_MAJ &&
                           hw->pkg_ver.minor < ICE_PKG_SUPP_VER_MNR))
-                       dev_err(dev,
-                               "The DDP package file version is lower than the driver supports.  The driver requires version %d.%d.x.x.  Please use an updated DDP Package file.  Entering Safe Mode.\n",
+                       dev_err(dev, "The DDP package file version is lower than the driver supports.  The driver requires version %d.%d.x.x.  Please use an updated DDP Package file.  Entering Safe Mode.\n",
                                 ICE_PKG_SUPP_VER_MAJ, ICE_PKG_SUPP_VER_MNR);
                 break;
         case ICE_ERR_AQ_ERROR:
                 switch (hw->pkg_dwnld_status) {
                 case ICE_AQ_RC_ENOSEC:
                 case ICE_AQ_RC_EBADSIG:
-                       dev_err(dev,
-                               "The DDP package could not be loaded because its signature is not valid.  Please use a valid DDP Package.  Entering Safe Mode.\n");
+                       dev_err(dev, "The DDP package could not be loaded because its signature is not valid.  Please use a valid DDP Package.  Entering Safe Mode.\n");
                         return;
                 case ICE_AQ_RC_ESVN:
-                       dev_err(dev,
-                               "The DDP Package could not be loaded because its security revision is too low.  Please use an updated DDP Package.  Entering Safe Mode.\n");
+                       dev_err(dev, "The DDP Package could not be loaded because its security revision is too low.  Please use an updated DDP Package.  Entering Safe Mode.\n");
                         return;
                 case ICE_AQ_RC_EBADMAN:
                 case ICE_AQ_RC_EBADBUF:
-                       dev_err(dev,
-                               "An error occurred on the device while loading the DDP package.  The device will be reset.\n");
+                       dev_err(dev, "An error occurred on the device while loading the DDP package.  The device will be reset.\n");
                         return;
                 default:
                         break;
                 }
                 /* fall-through */
         default:
-               dev_err(dev,
-                       "An unknown error (%d) occurred when loading the DDP package.  Entering Safe Mode.\n",
+               dev_err(dev, "An unknown error (%d) occurred when loading the DDP package.  Entering Safe Mode.\n",
                         *status);
                 break;
         }
@@ -3038,8 +3010,7 @@ ice_load_pkg(const struct firmware *firmware, struct ice_pf *pf)
                 status = ice_init_pkg(hw, hw->pkg_copy, hw->pkg_size);
                 ice_log_pkg_init(hw, &status);
         } else {
-               dev_err(dev,
-                       "The DDP package file failed to load. Entering Safe Mode.\n");
+               dev_err(dev, "The DDP package file failed to load. Entering Safe Mode.\n");
         }
  
         if (status) {
@@ -3065,8 +3036,7 @@ ice_load_pkg(const struct firmware *firmware, struct ice_pf *pf)
  static void ice_verify_cacheline_size(struct ice_pf *pf)
  {
         if (rd32(&pf->hw, GLPCI_CNF2) & GLPCI_CNF2_CACHELINE_SIZE_M)
-               dev_warn(ice_pf_to_dev(pf),
-                        "%d Byte cache line assumption is invalid, driver may have Tx timeouts!\n",
+               dev_warn(ice_pf_to_dev(pf), "%d Byte cache line assumption is invalid, driver may have Tx timeouts!\n",
                          ICE_CACHE_LINE_BYTES);
  }
  
@@ -3159,8 +3129,7 @@ static void ice_request_fw(struct ice_pf *pf)
  dflt_pkg_load:
         err = request_firmware(&firmware, ICE_DDP_PKG_FILE, dev);
         if (err) {
-               dev_err(dev,
-                       "The DDP package file was not found or could not be read. Entering Safe Mode\n");
+               dev_err(dev, "The DDP package file was not found or could not be read. Entering Safe Mode\n");
                 return;
         }
  
@@ -3184,7 +3153,9 @@ ice_probe(struct pci_dev *pdev, const struct pci_device_id __always_unused *ent)
         struct ice_hw *hw;
         int err;
  
-       /* this driver uses devres, see Documentation/driver-api/driver-model/devres.rst */
+       /* this driver uses devres, see
+        * Documentation/driver-api/driver-model/devres.rst
+        */
         err = pcim_enable_device(pdev);
         if (err)
                 return err;
@@ -3245,11 +3216,6 @@ ice_probe(struct pci_dev *pdev, const struct pci_device_id __always_unused *ent)
                 goto err_exit_unroll;
         }
  
-       dev_info(dev, "firmware %d.%d.%d api %d.%d.%d nvm %s build 0x%08x\n",
-                hw->fw_maj_ver, hw->fw_min_ver, hw->fw_patch,
-                hw->api_maj_ver, hw->api_min_ver, hw->api_patch,
-                ice_nvm_version_str(hw), hw->fw_build);
-
         ice_request_fw(pf);
  
         /* if ice_request_fw fails, ICE_FLAG_ADV_FEATURES bit won't be
@@ -3257,8 +3223,7 @@ ice_probe(struct pci_dev *pdev, const struct pci_device_id __always_unused *ent)
          * true
          */
         if (ice_is_safe_mode(pf)) {
-               dev_err(dev,
-                       "Package download failed. Advanced features disabled - Device now in Safe Mode\n");
+               dev_err(dev, "Package download failed. Advanced features disabled - Device now in Safe Mode\n");
                 /* we already got function/device capabilities but these don't
                  * reflect what the driver needs to do in safe mode. Instead of
                  * adding conditional logic everywhere to ignore these
@@ -3335,8 +3300,7 @@ ice_probe(struct pci_dev *pdev, const struct pci_device_id __always_unused *ent)
         /* tell the firmware we are up */
         err = ice_send_version(pf);
         if (err) {
-               dev_err(dev,
-                       "probe failed sending driver version %s. error: %d\n",
+               dev_err(dev, "probe failed sending driver version %s. error: %d\n",
                         ice_drv_ver, err);
                 goto err_alloc_sw_unroll;
         }
@@ -3477,8 +3441,7 @@ static pci_ers_result_t ice_pci_err_slot_reset(struct pci_dev *pdev)
  
         err = pci_enable_device_mem(pdev);
         if (err) {
-               dev_err(&pdev->dev,
-                       "Cannot re-enable PCI device after reset, error %d\n",
+               dev_err(&pdev->dev, "Cannot re-enable PCI device after reset, error %d\n",
                         err);
                 result = PCI_ERS_RESULT_DISCONNECT;
         } else {
@@ -3497,8 +3460,7 @@ static pci_ers_result_t ice_pci_err_slot_reset(struct pci_dev *pdev)
  
         err = pci_cleanup_aer_uncorrect_error_status(pdev);
         if (err)
-               dev_dbg(&pdev->dev,
-                       "pci_cleanup_aer_uncorrect_error_status failed, error %d\n",
+               dev_dbg(&pdev->dev, "pci_cleanup_aer_uncorrect_error_status failed, error %d\n",
                         err);
                 /* non-fatal, continue */
  
@@ -3517,8 +3479,8 @@ static void ice_pci_err_resume(struct pci_dev *pdev)
         struct ice_pf *pf = pci_get_drvdata(pdev);
  
         if (!pf) {
-               dev_err(&pdev->dev,
-                       "%s failed, device is unrecoverable\n", __func__);
+               dev_err(&pdev->dev, "%s failed, device is unrecoverable\n",
+                       __func__);
                 return;
         }
  
@@ -3766,8 +3728,7 @@ ice_set_tx_maxrate(struct net_device *netdev, int queue_index, u32 maxrate)
  
         /* Validate maxrate requested is within permitted range */
         if (maxrate && (maxrate > (ICE_SCHED_MAX_BW / 1000))) {
-               netdev_err(netdev,
-                          "Invalid max rate %d specified for the queue %d\n",
+               netdev_err(netdev, "Invalid max rate %d specified for the queue %d\n",
                            maxrate, queue_index);
                 return -EINVAL;
         }
@@ -3783,8 +3744,8 @@ ice_set_tx_maxrate(struct net_device *netdev, int queue_index, u32 maxrate)
                 status = ice_cfg_q_bw_lmt(vsi->port_info, vsi->idx, tc,
                                           q_handle, ICE_MAX_BW, maxrate * 1000);
         if (status) {
-               netdev_err(netdev,
-                          "Unable to set Tx max rate, error %d\n", status);
+               netdev_err(netdev, "Unable to set Tx max rate, error %d\n",
+                          status);
                 return -EIO;
         }
  
@@ -3876,15 +3837,13 @@ ice_set_features(struct net_device *netdev, netdev_features_t features)
  
         /* Don't set any netdev advanced features with device in Safe Mode */
         if (ice_is_safe_mode(vsi->back)) {
-               dev_err(&vsi->back->pdev->dev,
-                       "Device is in Safe Mode - not enabling advanced netdev features\n");
+               dev_err(ice_pf_to_dev(vsi->back), "Device is in Safe Mode - not enabling advanced netdev features\n");
                 return ret;
         }
  
         /* Do not change setting during reset */
         if (ice_is_reset_in_progress(pf->state)) {
-               dev_err(&vsi->back->pdev->dev,
-                       "Device is resetting, changing advanced netdev features temporarily unavailable.\n");
+               dev_err(ice_pf_to_dev(vsi->back), "Device is resetting, changing advanced netdev features temporarily unavailable.\n");
                 return -EBUSY;
         }
  
@@ -4372,21 +4331,18 @@ int ice_down(struct ice_vsi *vsi)
  
         tx_err = ice_vsi_stop_lan_tx_rings(vsi, ICE_NO_RESET, 0);
         if (tx_err)
-               netdev_err(vsi->netdev,
-                          "Failed stop Tx rings, VSI %d error %d\n",
+               netdev_err(vsi->netdev, "Failed stop Tx rings, VSI %d error %d\n",
                            vsi->vsi_num, tx_err);
         if (!tx_err && ice_is_xdp_ena_vsi(vsi)) {
                 tx_err = ice_vsi_stop_xdp_tx_rings(vsi);
                 if (tx_err)
-                       netdev_err(vsi->netdev,
-                                  "Failed stop XDP rings, VSI %d error %d\n",
+                       netdev_err(vsi->netdev, "Failed stop XDP rings, VSI %d error %d\n",
                                    vsi->vsi_num, tx_err);
         }
  
         rx_err = ice_vsi_stop_rx_rings(vsi);
         if (rx_err)
-               netdev_err(vsi->netdev,
-                          "Failed stop Rx rings, VSI %d error %d\n",
+               netdev_err(vsi->netdev, "Failed stop Rx rings, VSI %d error %d\n",
                            vsi->vsi_num, rx_err);
  
         ice_napi_disable_all(vsi);
@@ -4394,8 +4350,7 @@ int ice_down(struct ice_vsi *vsi)
         if (test_bit(ICE_FLAG_LINK_DOWN_ON_CLOSE_ENA, vsi->back->flags)) {
                 link_err = ice_force_phys_link_state(vsi, false);
                 if (link_err)
-                       netdev_err(vsi->netdev,
-                                  "Failed to set physical link down, VSI %d error %d\n",
+                       netdev_err(vsi->netdev, "Failed to set physical link down, VSI %d error %d\n",
                                    vsi->vsi_num, link_err);
         }
  
@@ -4406,8 +4361,7 @@ int ice_down(struct ice_vsi *vsi)
                 ice_clean_rx_ring(vsi->rx_rings[i]);
  
         if (tx_err || rx_err || link_err) {
-               netdev_err(vsi->netdev,
-                          "Failed to close VSI 0x%04X on switch 0x%04X\n",
+               netdev_err(vsi->netdev, "Failed to close VSI 0x%04X on switch 0x%04X\n",
                            vsi->vsi_num, vsi->vsw->sw_id);
                 return -EIO;
         }
@@ -4426,7 +4380,7 @@ int ice_vsi_setup_tx_rings(struct ice_vsi *vsi)
         int i, err = 0;
  
         if (!vsi->num_txq) {
-               dev_err(&vsi->back->pdev->dev, "VSI %d has 0 Tx queues\n",
+               dev_err(ice_pf_to_dev(vsi->back), "VSI %d has 0 Tx queues\n",
                         vsi->vsi_num);
                 return -EINVAL;
         }
@@ -4457,7 +4411,7 @@ int ice_vsi_setup_rx_rings(struct ice_vsi *vsi)
         int i, err = 0;
  
         if (!vsi->num_rxq) {
-               dev_err(&vsi->back->pdev->dev, "VSI %d has 0 Rx queues\n",
+               dev_err(ice_pf_to_dev(vsi->back), "VSI %d has 0 Rx queues\n",
                         vsi->vsi_num);
                 return -EINVAL;
         }
@@ -4554,8 +4508,7 @@ static void ice_vsi_release_all(struct ice_pf *pf)
  
                 err = ice_vsi_release(pf->vsi[i]);
                 if (err)
-                       dev_dbg(ice_pf_to_dev(pf),
-                               "Failed to release pf->vsi[%d], err %d, vsi_num = %d\n",
+                       dev_dbg(ice_pf_to_dev(pf), "Failed to release pf->vsi[%d], err %d, vsi_num = %d\n",
                                 i, err, pf->vsi[i]->vsi_num);
         }
  }
@@ -4582,8 +4535,7 @@ static int ice_vsi_rebuild_by_type(struct ice_pf *pf, enum ice_vsi_type type)
                 /* rebuild the VSI */
                 err = ice_vsi_rebuild(vsi, true);
                 if (err) {
-                       dev_err(dev,
-                               "rebuild VSI failed, err %d, VSI index %d, type %s\n",
+                       dev_err(dev, "rebuild VSI failed, err %d, VSI index %d, type %s\n",
                                 err, vsi->idx, ice_vsi_type_str(type));
                         return err;
                 }
@@ -4591,8 +4543,7 @@ static int ice_vsi_rebuild_by_type(struct ice_pf *pf, enum ice_vsi_type type)
                 /* replay filters for the VSI */
                 status = ice_replay_vsi(&pf->hw, vsi->idx);
                 if (status) {
-                       dev_err(dev,
-                               "replay VSI failed, status %d, VSI index %d, type %s\n",
+                       dev_err(dev, "replay VSI failed, status %d, VSI index %d, type %s\n",
                                 status, vsi->idx, ice_vsi_type_str(type));
                         return -EIO;
                 }
@@ -4605,8 +4556,7 @@ static int ice_vsi_rebuild_by_type(struct ice_pf *pf, enum ice_vsi_type type)
                 /* enable the VSI */
                 err = ice_ena_vsi(vsi, false);
                 if (err) {
-                       dev_err(dev,
-                               "enable VSI failed, err %d, VSI index %d, type %s\n",
+                       dev_err(dev, "enable VSI failed, err %d, VSI index %d, type %s\n",
                                 err, vsi->idx, ice_vsi_type_str(type));
                         return err;
                 }
@@ -4684,8 +4634,7 @@ static void ice_rebuild(struct ice_pf *pf, enum ice_reset_req reset_type)
         }
  
         if (pf->first_sw->dflt_vsi_ena)
-               dev_info(dev,
-                        "Clearing default VSI, re-enable after reset completes\n");
+               dev_info(dev, "Clearing default VSI, re-enable after reset completes\n");
         /* clear the default VSI configuration if it exists */
         pf->first_sw->dflt_vsi = NULL;
         pf->first_sw->dflt_vsi_ena = false;
@@ -4736,8 +4685,7 @@ static void ice_rebuild(struct ice_pf *pf, enum ice_reset_req reset_type)
         /* tell the firmware we are up */
         ret = ice_send_version(pf);
         if (ret) {
-               dev_err(dev,
-                       "Rebuild failed due to error sending driver version: %d\n",
+               dev_err(dev, "Rebuild failed due to error sending driver version: %d\n",
                         ret);
                 goto err_vsi_rebuild;
         }
@@ -4993,7 +4941,7 @@ static int ice_vsi_update_bridge_mode(struct ice_vsi *vsi, u16 bmode)
  
         status = ice_update_vsi(hw, vsi->idx, ctxt, NULL);
         if (status) {
-               dev_err(&vsi->back->pdev->dev, "update VSI for bridge mode failed, bmode = %d err %d aq_err %d\n",
+               dev_err(ice_pf_to_dev(vsi->back), "update VSI for bridge mode failed, bmode = %d err %d aq_err %d\n",
                         bmode, status, hw->adminq.sq_last_status);
                 ret = -EIO;
                 goto out;
@@ -5185,8 +5133,7 @@ int ice_open(struct net_device *netdev)
         if (pi->phy.link_info.link_info & ICE_AQ_MEDIA_AVAILABLE) {
                 err = ice_force_phys_link_state(vsi, true);
                 if (err) {
-                       netdev_err(netdev,
-                                  "Failed to set physical link up, error %d\n",
+                       netdev_err(netdev, "Failed to set physical link up, error %d\n",
                                    err);
                         return err;
                 }
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c

index fd17ace..4de61db 100644 (file)
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -644,7 +644,7 @@ static bool ice_page_is_reserved(struct page *page)
   * Update the offset within page so that Rx buf will be ready to be reused.
   * For systems with PAGE_SIZE < 8192 this function will flip the page offset
   * so the second half of page assigned to Rx buffer will be used, otherwise
- * the offset is moved by the @size bytes
+ * the offset is moved by "size" bytes
   */
  static void
  ice_rx_buf_adjust_pg_offset(struct ice_rx_buf *rx_buf, unsigned int size)
@@ -1078,8 +1078,6 @@ construct_skb:
                                 skb = ice_build_skb(rx_ring, rx_buf, &xdp);
                         else
                                 skb = ice_construct_skb(rx_ring, rx_buf, &xdp);
-               } else {
-                       skb = ice_construct_skb(rx_ring, rx_buf, &xdp);
                 }
                 /* exit if we failed to retrieve a buffer */
                 if (!skb) {
@@ -1621,11 +1619,11 @@ ice_tx_map(struct ice_ring *tx_ring, struct ice_tx_buf *first,
  {
         u64 td_offset, td_tag, td_cmd;
         u16 i = tx_ring->next_to_use;
-       skb_frag_t *frag;
         unsigned int data_len, size;
         struct ice_tx_desc *tx_desc;
         struct ice_tx_buf *tx_buf;
         struct sk_buff *skb;
+       skb_frag_t *frag;
         dma_addr_t dma;
  
         td_tag = off->td_l2tag1;
@@ -1738,9 +1736,8 @@ ice_tx_map(struct ice_ring *tx_ring, struct ice_tx_buf *first,
         ice_maybe_stop_tx(tx_ring, DESC_NEEDED);
  
         /* notify HW of packet */
-       if (netif_xmit_stopped(txring_txq(tx_ring)) || !netdev_xmit_more()) {
+       if (netif_xmit_stopped(txring_txq(tx_ring)) || !netdev_xmit_more())
                 writel(i, tx_ring->tail);
-       }
  
         return;
  
@@ -2078,7 +2075,7 @@ static bool __ice_chk_linearize(struct sk_buff *skb)
         frag = &skb_shinfo(skb)->frags[0];
  
         /* Initialize size to the negative value of gso_size minus 1. We
-        * use this as the worst case scenerio in which the frag ahead
+        * use this as the worst case scenario in which the frag ahead
          * of us only provides one byte which is why we are limited to 6
          * descriptors for a single transmit as the header and previous
          * fragment are already consuming 2 descriptors.
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h

index a862706..14a1bf4 100644 (file)
--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
@@ -33,8 +33,8 @@
   * frame.
   *
   * Note: For cache line sizes 256 or larger this value is going to end
- *       up negative.  In these cases we should fall back to the legacy
- *       receive path.
+ *      up negative.  In these cases we should fall back to the legacy
+ *      receive path.
   */
  #if (PAGE_SIZE < 8192)
  #define ICE_2K_TOO_SMALL_WITH_PADDING \
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c

index 35bbc4f..6da048a 100644 (file)
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
@@ -10,7 +10,7 @@
   */
  void ice_release_rx_desc(struct ice_ring *rx_ring, u32 val)
  {
-       u16 prev_ntu = rx_ring->next_to_use;
+       u16 prev_ntu = rx_ring->next_to_use & ~0x7;
  
         rx_ring->next_to_use = val;
  
diff --git a/drivers/net/ethernet/intel/ice/ice_type.h b/drivers/net/ethernet/intel/ice/ice_type.h

index b361ffa..db0ef6b 100644 (file)
--- a/drivers/net/ethernet/intel/ice/ice_type.h
+++ b/drivers/net/ethernet/intel/ice/ice_type.h
@@ -517,7 +517,7 @@ struct ice_hw {
         struct ice_fw_log_cfg fw_log;
  
  /* Device max aggregate bandwidths corresponding to the GL_PWR_MODE_CTL
- * register. Used for determining the ITR/intrl granularity during
+ * register. Used for determining the ITR/INTRL granularity during
   * initialization.
   */
  #define ICE_MAX_AGG_BW_200G    0x0
diff --git a/drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c b/drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c

index 82b1e7a..262714d 100644 (file)
--- a/drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c
@@ -199,8 +199,7 @@ static void ice_dis_vf_mappings(struct ice_vf *vf)
         if (vsi->rx_mapping_mode == ICE_VSI_MAP_CONTIG)
                 wr32(hw, VPLAN_RX_QBASE(vf->vf_id), 0);
         else
-               dev_err(dev,
-                       "Scattered mode for VF Rx queues is not yet implemented\n");
+               dev_err(dev, "Scattered mode for VF Rx queues is not yet implemented\n");
  }
  
  /**
@@ -402,8 +401,7 @@ static void ice_trigger_vf_reset(struct ice_vf *vf, bool is_vflr, bool is_pfr)
                 if ((reg & VF_TRANS_PENDING_M) == 0)
                         break;
  
-               dev_err(dev,
-                       "VF %d PCI transactions stuck\n", vf->vf_id);
+               dev_err(dev, "VF %d PCI transactions stuck\n", vf->vf_id);
                 udelay(ICE_PCI_CIAD_WAIT_DELAY_US);
         }
  }
@@ -462,7 +460,7 @@ static int ice_vsi_manage_pvid(struct ice_vsi *vsi, u16 vid, bool enable)
  
         status = ice_update_vsi(hw, vsi->idx, ctxt, NULL);
         if (status) {
-               dev_info(&vsi->back->pdev->dev, "update VSI for port VLAN failed, err %d aq_err %d\n",
+               dev_info(ice_pf_to_dev(vsi->back), "update VSI for port VLAN failed, err %d aq_err %d\n",
                          status, hw->adminq.sq_last_status);
                 ret = -EIO;
                 goto out;
@@ -1095,7 +1093,6 @@ bool ice_reset_all_vfs(struct ice_pf *pf, bool is_vflr)
          * finished resetting.
          */
         for (i = 0, v = 0; i < 10 && v < pf->num_alloc_vfs; i++) {
-
                 /* Check each VF in sequence */
                 while (v < pf->num_alloc_vfs) {
                         u32 reg;
@@ -1553,8 +1550,7 @@ ice_vc_send_msg_to_vf(struct ice_vf *vf, u32 v_opcode,
                 dev_info(dev, "VF %d failed opcode %d, retval: %d\n", vf->vf_id,
                          v_opcode, v_retval);
                 if (vf->num_inval_msgs > ICE_DFLT_NUM_INVAL_MSGS_ALLOWED) {
-                       dev_err(dev,
-                               "Number of invalid messages exceeded for VF %d\n",
+                       dev_err(dev, "Number of invalid messages exceeded for VF %d\n",
                                 vf->vf_id);
                         dev_err(dev, "Use PF Control I/F to enable the VF\n");
                         set_bit(ICE_VF_STATE_DIS, vf->vf_states);
@@ -1569,8 +1565,7 @@ ice_vc_send_msg_to_vf(struct ice_vf *vf, u32 v_opcode,
         aq_ret = ice_aq_send_msg_to_vf(&pf->hw, vf->vf_id, v_opcode, v_retval,
                                        msg, msglen, NULL);
         if (aq_ret && pf->hw.mailboxq.sq_last_status != ICE_AQ_RC_ENOSYS) {
-               dev_info(dev,
-                        "Unable to send the message to VF %d ret %d aq_err %d\n",
+               dev_info(dev, "Unable to send the message to VF %d ret %d aq_err %d\n",
                          vf->vf_id, aq_ret, pf->hw.mailboxq.sq_last_status);
                 return -EIO;
         }
@@ -1914,8 +1909,7 @@ int ice_set_vf_spoofchk(struct net_device *netdev, int vf_id, bool ena)
         }
  
         if (vf_vsi->type != ICE_VSI_VF) {
-               netdev_err(netdev,
-                          "Type %d of VSI %d for VF %d is no ICE_VSI_VF\n",
+               netdev_err(netdev, "Type %d of VSI %d for VF %d is no ICE_VSI_VF\n",
                            vf_vsi->type, vf_vsi->vsi_num, vf->vf_id);
                 return -ENODEV;
         }
@@ -1945,8 +1939,7 @@ int ice_set_vf_spoofchk(struct net_device *netdev, int vf_id, bool ena)
  
         status = ice_update_vsi(&pf->hw, vf_vsi->idx, ctx, NULL);
         if (status) {
-               dev_err(dev,
-                       "Failed to %sable spoofchk on VF %d VSI %d\n error %d",
+               dev_err(dev, "Failed to %sable spoofchk on VF %d VSI %d\n error %d",
                         ena ? "en" : "dis", vf->vf_id, vf_vsi->vsi_num, status);
                 ret = -EIO;
                 goto out;
@@ -2063,8 +2056,7 @@ static int ice_vc_ena_qs_msg(struct ice_vf *vf, u8 *msg)
                         continue;
  
                 if (ice_vsi_ctrl_rx_ring(vsi, true, vf_q_id)) {
-                       dev_err(&vsi->back->pdev->dev,
-                               "Failed to enable Rx ring %d on VSI %d\n",
+                       dev_err(ice_pf_to_dev(vsi->back), "Failed to enable Rx ring %d on VSI %d\n",
                                 vf_q_id, vsi->vsi_num);
                         v_ret = VIRTCHNL_STATUS_ERR_PARAM;
                         goto error_param;
@@ -2166,8 +2158,7 @@ static int ice_vc_dis_qs_msg(struct ice_vf *vf, u8 *msg)
  
                         if (ice_vsi_stop_tx_ring(vsi, ICE_NO_RESET, vf->vf_id,
                                                  ring, &txq_meta)) {
-                               dev_err(&vsi->back->pdev->dev,
-                                       "Failed to stop Tx ring %d on VSI %d\n",
+                               dev_err(ice_pf_to_dev(vsi->back), "Failed to stop Tx ring %d on VSI %d\n",
                                         vf_q_id, vsi->vsi_num);
                                 v_ret = VIRTCHNL_STATUS_ERR_PARAM;
                                 goto error_param;
@@ -2193,8 +2184,7 @@ static int ice_vc_dis_qs_msg(struct ice_vf *vf, u8 *msg)
                                 continue;
  
                         if (ice_vsi_ctrl_rx_ring(vsi, false, vf_q_id)) {
-                               dev_err(&vsi->back->pdev->dev,
-                                       "Failed to stop Rx ring %d on VSI %d\n",
+                               dev_err(ice_pf_to_dev(vsi->back), "Failed to stop Rx ring %d on VSI %d\n",
                                         vf_q_id, vsi->vsi_num);
                                 v_ret = VIRTCHNL_STATUS_ERR_PARAM;
                                 goto error_param;
@@ -2357,8 +2347,7 @@ static int ice_vc_cfg_qs_msg(struct ice_vf *vf, u8 *msg)
  
         if (qci->num_queue_pairs > ICE_MAX_BASE_QS_PER_VF ||
             qci->num_queue_pairs > min_t(u16, vsi->alloc_txq, vsi->alloc_rxq)) {
-               dev_err(ice_pf_to_dev(pf),
-                       "VF-%d requesting more than supported number of queues: %d\n",
+               dev_err(ice_pf_to_dev(pf), "VF-%d requesting more than supported number of queues: %d\n",
                         vf->vf_id, min_t(u16, vsi->alloc_txq, vsi->alloc_rxq));
                 v_ret = VIRTCHNL_STATUS_ERR_PARAM;
                 goto error_param;
@@ -2570,8 +2559,7 @@ ice_vc_handle_mac_addr_msg(struct ice_vf *vf, u8 *msg, bool set)
          */
         if (set && !ice_is_vf_trusted(vf) &&
             (vf->num_mac + al->num_elements) > ICE_MAX_MACADDR_PER_VF) {
-               dev_err(ice_pf_to_dev(pf),
-                       "Can't add more MAC addresses, because VF-%d is not trusted, switch the VF to trusted mode in order to add more functionalities\n",
+               dev_err(ice_pf_to_dev(pf), "Can't add more MAC addresses, because VF-%d is not trusted, switch the VF to trusted mode in order to add more functionalities\n",
                         vf->vf_id);
                 v_ret = VIRTCHNL_STATUS_ERR_PARAM;
                 goto handle_mac_exit;
@@ -2648,8 +2636,8 @@ static int ice_vc_request_qs_msg(struct ice_vf *vf, u8 *msg)
         struct ice_pf *pf = vf->pf;
         u16 max_allowed_vf_queues;
         u16 tx_rx_queue_left;
-       u16 cur_queues;
         struct device *dev;
+       u16 cur_queues;
  
         dev = ice_pf_to_dev(pf);
         if (!test_bit(ICE_VF_STATE_ACTIVE, vf->vf_states)) {
@@ -2670,8 +2658,7 @@ static int ice_vc_request_qs_msg(struct ice_vf *vf, u8 *msg)
                 vfres->num_queue_pairs = ICE_MAX_BASE_QS_PER_VF;
         } else if (req_queues > cur_queues &&
                    req_queues - cur_queues > tx_rx_queue_left) {
-               dev_warn(dev,
-                        "VF %d requested %u more queues, but only %u left.\n",
+               dev_warn(dev, "VF %d requested %u more queues, but only %u left.\n",
                          vf->vf_id, req_queues - cur_queues, tx_rx_queue_left);
                 vfres->num_queue_pairs = min_t(u16, max_allowed_vf_queues,
                                                ICE_MAX_BASE_QS_PER_VF);
@@ -2821,8 +2808,8 @@ static int ice_vc_process_vlan_msg(struct ice_vf *vf, u8 *msg, bool add_v)
         for (i = 0; i < vfl->num_elements; i++) {
                 if (vfl->vlan_id[i] > ICE_MAX_VLANID) {
                         v_ret = VIRTCHNL_STATUS_ERR_PARAM;
-                       dev_err(dev,
-                               "invalid VF VLAN id %d\n", vfl->vlan_id[i]);
+                       dev_err(dev, "invalid VF VLAN id %d\n",
+                               vfl->vlan_id[i]);
                         goto error_param;
                 }
         }
@@ -2836,8 +2823,7 @@ static int ice_vc_process_vlan_msg(struct ice_vf *vf, u8 *msg, bool add_v)
  
         if (add_v && !ice_is_vf_trusted(vf) &&
             vsi->num_vlan >= ICE_MAX_VLAN_PER_VF) {
-               dev_info(dev,
-                        "VF-%d is not trusted, switch the VF to trusted mode, in order to add more VLAN addresses\n",
+               dev_info(dev, "VF-%d is not trusted, switch the VF to trusted mode, in order to add more VLAN addresses\n",
                          vf->vf_id);
                 /* There is no need to let VF know about being not trusted,
                  * so we can just return success message here
@@ -2860,8 +2846,7 @@ static int ice_vc_process_vlan_msg(struct ice_vf *vf, u8 *msg, bool add_v)
  
                         if (!ice_is_vf_trusted(vf) &&
                             vsi->num_vlan >= ICE_MAX_VLAN_PER_VF) {
-                               dev_info(dev,
-                                        "VF-%d is not trusted, switch the VF to trusted mode, in order to add more VLAN addresses\n",
+                               dev_info(dev, "VF-%d is not trusted, switch the VF to trusted mode, in order to add more VLAN addresses\n",
                                          vf->vf_id);
                                 /* There is no need to let VF know about being
                                  * not trusted, so we can just return success
@@ -2889,8 +2874,7 @@ static int ice_vc_process_vlan_msg(struct ice_vf *vf, u8 *msg, bool add_v)
                                 status = ice_cfg_vlan_pruning(vsi, true, false);
                                 if (status) {
                                         v_ret = VIRTCHNL_STATUS_ERR_PARAM;
-                                       dev_err(dev,
-                                               "Enable VLAN pruning on VLAN ID: %d failed error-%d\n",
+                                       dev_err(dev, "Enable VLAN pruning on VLAN ID: %d failed error-%d\n",
                                                 vid, status);
                                         goto error_param;
                                 }
@@ -2903,8 +2887,7 @@ static int ice_vc_process_vlan_msg(struct ice_vf *vf, u8 *msg, bool add_v)
                                                              promisc_m, vid);
                                 if (status) {
                                         v_ret = VIRTCHNL_STATUS_ERR_PARAM;
-                                       dev_err(dev,
-                                               "Enable Unicast/multicast promiscuous mode on VLAN ID:%d failed error-%d\n",
+                                       dev_err(dev, "Enable Unicast/multicast promiscuous mode on VLAN ID:%d failed error-%d\n",
                                                 vid, status);
                                 }
                         }
@@ -3140,8 +3123,7 @@ error_handler:
         case VIRTCHNL_OP_GET_VF_RESOURCES:
                 err = ice_vc_get_vf_res_msg(vf, msg);
                 if (ice_vf_init_vlan_stripping(vf))
-                       dev_err(dev,
-                               "Failed to initialize VLAN stripping for VF %d\n",
+                       dev_err(dev, "Failed to initialize VLAN stripping for VF %d\n",
                                 vf->vf_id);
                 ice_vc_notify_vf_link_state(vf);
                 break;
@@ -3313,8 +3295,7 @@ int ice_set_vf_mac(struct net_device *netdev, int vf_id, u8 *mac)
          */
         ether_addr_copy(vf->dflt_lan_addr.addr, mac);
         vf->pf_set_mac = true;
-       netdev_info(netdev,
-                   "MAC on VF %d set to %pM. VF driver will be reinitialized\n",
+       netdev_info(netdev, "MAC on VF %d set to %pM. VF driver will be reinitialized\n",
                     vf_id, mac);
  
         ice_vc_reset_vf(vf);
@@ -3332,10 +3313,8 @@ int ice_set_vf_mac(struct net_device *netdev, int vf_id, u8 *mac)
  int ice_set_vf_trust(struct net_device *netdev, int vf_id, bool trusted)
  {
         struct ice_pf *pf = ice_netdev_to_pf(netdev);
-       struct device *dev;
         struct ice_vf *vf;
  
-       dev = ice_pf_to_dev(pf);
         if (ice_validate_vf_id(pf, vf_id))
                 return -EINVAL;
  
@@ -3358,7 +3337,7 @@ int ice_set_vf_trust(struct net_device *netdev, int vf_id, bool trusted)
  
         vf->trusted = trusted;
         ice_vc_reset_vf(vf);
-       dev_info(dev, "VF %u is now %strusted\n",
+       dev_info(ice_pf_to_dev(pf), "VF %u is now %strusted\n",
                  vf_id, trusted ? "" : "un");
  
         return 0;
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c

index 149dca0..4d3407b 100644 (file)
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -338,8 +338,8 @@ static int ice_xsk_umem_dma_map(struct ice_vsi *vsi, struct xdp_umem *umem)
                                                     DMA_BIDIRECTIONAL,
                                                     ICE_RX_DMA_ATTR);
                 if (dma_mapping_error(dev, dma)) {
-                       dev_dbg(dev,
-                               "XSK UMEM DMA mapping error on page num %d", i);
+                       dev_dbg(dev, "XSK UMEM DMA mapping error on page num %d\n",
+                               i);
                         goto out_unmap;
                 }
  
diff --git a/drivers/net/ethernet/socionext/sni_ave.c b/drivers/net/ethernet/socionext/sni_ave.c

index b703242..67ddf78 100644 (file)
--- a/drivers/net/ethernet/socionext/sni_ave.c
+++ b/drivers/net/ethernet/socionext/sni_ave.c
@@ -1810,6 +1810,9 @@ static int ave_pro4_get_pinmode(struct ave_private *priv,
                 break;
         case PHY_INTERFACE_MODE_MII:
         case PHY_INTERFACE_MODE_RGMII:
+       case PHY_INTERFACE_MODE_RGMII_ID:
+       case PHY_INTERFACE_MODE_RGMII_RXID:
+       case PHY_INTERFACE_MODE_RGMII_TXID:
                 priv->pinmode_val = 0;
                 break;
         default:
@@ -1854,6 +1857,9 @@ static int ave_ld20_get_pinmode(struct ave_private *priv,
                 priv->pinmode_val = SG_ETPINMODE_RMII(0);
                 break;
         case PHY_INTERFACE_MODE_RGMII:
+       case PHY_INTERFACE_MODE_RGMII_ID:
+       case PHY_INTERFACE_MODE_RGMII_RXID:
+       case PHY_INTERFACE_MODE_RGMII_TXID:
                 priv->pinmode_val = 0;
                 break;
         default:
@@ -1876,6 +1882,9 @@ static int ave_pxs3_get_pinmode(struct ave_private *priv,
                 priv->pinmode_val = SG_ETPINMODE_RMII(arg);
                 break;
         case PHY_INTERFACE_MODE_RGMII:
+       case PHY_INTERFACE_MODE_RGMII_ID:
+       case PHY_INTERFACE_MODE_RGMII_RXID:
+       case PHY_INTERFACE_MODE_RGMII_TXID:
                 priv->pinmode_val = 0;
                 break;
         default:
diff --git a/drivers/net/ethernet/sun/sunvnet_common.c b/drivers/net/ethernet/sun/sunvnet_common.c

index c23ce83..8dc6c9f 100644 (file)
--- a/drivers/net/ethernet/sun/sunvnet_common.c
+++ b/drivers/net/ethernet/sun/sunvnet_common.c
@@ -1350,27 +1350,12 @@ sunvnet_start_xmit_common(struct sk_buff *skb, struct net_device *dev,
                 if (vio_version_after_eq(&port->vio, 1, 3))
                         localmtu -= VLAN_HLEN;
  
-               if (skb->protocol == htons(ETH_P_IP)) {
-                       struct flowi4 fl4;
-                       struct rtable *rt = NULL;
-
-                       memset(&fl4, 0, sizeof(fl4));
-                       fl4.flowi4_oif = dev->ifindex;
-                       fl4.flowi4_tos = RT_TOS(ip_hdr(skb)->tos);
-                       fl4.daddr = ip_hdr(skb)->daddr;
-                       fl4.saddr = ip_hdr(skb)->saddr;
-
-                       rt = ip_route_output_key(dev_net(dev), &fl4);
-                       if (!IS_ERR(rt)) {
-                               skb_dst_set(skb, &rt->dst);
-                               icmp_send(skb, ICMP_DEST_UNREACH,
-                                         ICMP_FRAG_NEEDED,
-                                         htonl(localmtu));
-                       }
-               }
+               if (skb->protocol == htons(ETH_P_IP))
+                       icmp_ndo_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
+                                     htonl(localmtu));
  #if IS_ENABLED(CONFIG_IPV6)
                 else if (skb->protocol == htons(ETH_P_IPV6))
-                       icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, localmtu);
+                       icmpv6_ndo_send(skb, ICMPV6_PKT_TOOBIG, 0, localmtu);
  #endif
                 goto out_dropped;
         }
diff --git a/drivers/net/gtp.c b/drivers/net/gtp.c

index af07ea7..672cd2c 100644 (file)
--- a/drivers/net/gtp.c
+++ b/drivers/net/gtp.c
@@ -546,8 +546,8 @@ static int gtp_build_skb_ip4(struct sk_buff *skb, struct net_device *dev,
             mtu < ntohs(iph->tot_len)) {
                 netdev_dbg(dev, "packet too big, fragmentation needed\n");
                 memset(IPCB(skb), 0, sizeof(*IPCB(skb)));
-               icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
-                         htonl(mtu));
+               icmp_ndo_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
+                             htonl(mtu));
                 goto err_rt;
         }
  
diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c

index 9485c8d..3b7a3b8 100644 (file)
--- a/drivers/net/usb/qmi_wwan.c
+++ b/drivers/net/usb/qmi_wwan.c
@@ -61,7 +61,6 @@ enum qmi_wwan_flags {
  
  enum qmi_wwan_quirks {
         QMI_WWAN_QUIRK_DTR = 1 << 0,    /* needs "set DTR" request */
-       QMI_WWAN_QUIRK_QUECTEL_DYNCFG = 1 << 1, /* check num. endpoints */
  };
  
  struct qmimux_hdr {
@@ -916,16 +915,6 @@ static const struct driver_info    qmi_wwan_info_quirk_dtr = {
         .data           = QMI_WWAN_QUIRK_DTR,
  };
  
-static const struct driver_info        qmi_wwan_info_quirk_quectel_dyncfg = {
-       .description    = "WWAN/QMI device",
-       .flags          = FLAG_WWAN | FLAG_SEND_ZLP,
-       .bind           = qmi_wwan_bind,
-       .unbind         = qmi_wwan_unbind,
-       .manage_power   = qmi_wwan_manage_power,
-       .rx_fixup       = qmi_wwan_rx_fixup,
-       .data           = QMI_WWAN_QUIRK_DTR | QMI_WWAN_QUIRK_QUECTEL_DYNCFG,
-};
-
  #define HUAWEI_VENDOR_ID       0x12D1
  
  /* map QMI/wwan function by a fixed interface number */
@@ -946,14 +935,18 @@ static const struct driver_info   qmi_wwan_info_quirk_quectel_dyncfg = {
  #define QMI_GOBI_DEVICE(vend, prod) \
         QMI_FIXED_INTF(vend, prod, 0)
  
-/* Quectel does not use fixed interface numbers on at least some of their
- * devices. We need to check the number of endpoints to ensure that we bind to
- * the correct interface.
+/* Many devices have QMI and DIAG functions which are distinguishable
+ * from other vendor specific functions by class, subclass and
+ * protocol all being 0xff. The DIAG function has exactly 2 endpoints
+ * and is silently rejected when probed.
+ *
+ * This makes it possible to match dynamically numbered QMI functions
+ * as seen on e.g. many Quectel modems.
   */
-#define QMI_QUIRK_QUECTEL_DYNCFG(vend, prod) \
+#define QMI_MATCH_FF_FF_FF(vend, prod) \
         USB_DEVICE_AND_INTERFACE_INFO(vend, prod, USB_CLASS_VENDOR_SPEC, \
                                       USB_SUBCLASS_VENDOR_SPEC, 0xff), \
-       .driver_info = (unsigned long)&qmi_wwan_info_quirk_quectel_dyncfg
+       .driver_info = (unsigned long)&qmi_wwan_info_quirk_dtr
  
  static const struct usb_device_id products[] = {
         /* 1. CDC ECM like devices match on the control interface */
@@ -1059,10 +1052,10 @@ static const struct usb_device_id products[] = {
                 USB_DEVICE_AND_INTERFACE_INFO(0x03f0, 0x581d, USB_CLASS_VENDOR_SPEC, 1, 7),
                 .driver_info = (unsigned long)&qmi_wwan_info,
         },
-       {QMI_QUIRK_QUECTEL_DYNCFG(0x2c7c, 0x0125)},     /* Quectel EC25, EC20 R2.0  Mini PCIe */
-       {QMI_QUIRK_QUECTEL_DYNCFG(0x2c7c, 0x0306)},     /* Quectel EP06/EG06/EM06 */
-       {QMI_QUIRK_QUECTEL_DYNCFG(0x2c7c, 0x0512)},     /* Quectel EG12/EM12 */
-       {QMI_QUIRK_QUECTEL_DYNCFG(0x2c7c, 0x0800)},     /* Quectel RM500Q-GL */
+       {QMI_MATCH_FF_FF_FF(0x2c7c, 0x0125)},   /* Quectel EC25, EC20 R2.0  Mini PCIe */
+       {QMI_MATCH_FF_FF_FF(0x2c7c, 0x0306)},   /* Quectel EP06/EG06/EM06 */
+       {QMI_MATCH_FF_FF_FF(0x2c7c, 0x0512)},   /* Quectel EG12/EM12 */
+       {QMI_MATCH_FF_FF_FF(0x2c7c, 0x0800)},   /* Quectel RM500Q-GL */
  
         /* 3. Combined interface devices matching on interface number */
         {QMI_FIXED_INTF(0x0408, 0xea42, 4)},    /* Yota / Megafon M100-1 */
@@ -1363,6 +1356,7 @@ static const struct usb_device_id products[] = {
         {QMI_FIXED_INTF(0x413c, 0x81b6, 8)},    /* Dell Wireless 5811e */
         {QMI_FIXED_INTF(0x413c, 0x81b6, 10)},   /* Dell Wireless 5811e */
         {QMI_FIXED_INTF(0x413c, 0x81d7, 0)},    /* Dell Wireless 5821e */
+       {QMI_FIXED_INTF(0x413c, 0x81d7, 1)},    /* Dell Wireless 5821e preproduction config */
         {QMI_FIXED_INTF(0x413c, 0x81e0, 0)},    /* Dell Wireless 5821e with eSIM support*/
         {QMI_FIXED_INTF(0x03f0, 0x4e1d, 8)},    /* HP lt4111 LTE/EV-DO/HSPA+ Gobi 4G Module */
         {QMI_FIXED_INTF(0x03f0, 0x9d1d, 1)},    /* HP lt4120 Snapdragon X5 LTE */
@@ -1454,7 +1448,6 @@ static int qmi_wwan_probe(struct usb_interface *intf,
  {
         struct usb_device_id *id = (struct usb_device_id *)prod;
         struct usb_interface_descriptor *desc = &intf->cur_altsetting->desc;
-       const struct driver_info *info;
  
         /* Workaround to enable dynamic IDs.  This disables usbnet
          * blacklisting functionality.  Which, if required, can be
@@ -1490,12 +1483,8 @@ static int qmi_wwan_probe(struct usb_interface *intf,
          * different. Ignore the current interface if the number of endpoints
          * equals the number for the diag interface (two).
          */
-       info = (void *)id->driver_info;
-
-       if (info->data & QMI_WWAN_QUIRK_QUECTEL_DYNCFG) {
-               if (desc->bNumEndpoints == 2)
-                       return -ENODEV;
-       }
+       if (desc->bNumEndpoints == 2)
+               return -ENODEV;
  
         return usbnet_probe(intf, id);
  }
diff --git a/drivers/net/wireguard/device.c b/drivers/net/wireguard/device.c

index 16b1982..43db442 100644 (file)
--- a/drivers/net/wireguard/device.c
+++ b/drivers/net/wireguard/device.c
@@ -203,9 +203,9 @@ err_peer:
  err:
         ++dev->stats.tx_errors;
         if (skb->protocol == htons(ETH_P_IP))
-               icmp_send(skb, ICMP_DEST_UNREACH, ICMP_HOST_UNREACH, 0);
+               icmp_ndo_send(skb, ICMP_DEST_UNREACH, ICMP_HOST_UNREACH, 0);
         else if (skb->protocol == htons(ETH_P_IPV6))
-               icmpv6_send(skb, ICMPV6_DEST_UNREACH, ICMPV6_ADDR_UNREACH, 0);
+               icmpv6_ndo_send(skb, ICMPV6_DEST_UNREACH, ICMPV6_ADDR_UNREACH, 0);
         kfree_skb(skb);
         return ret;
  }
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c

index 5dc32b7..ada59df 100644 (file)
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -66,8 +66,8 @@ MODULE_PARM_DESC(streams, "turn on support for Streams write directives");
   * nvme_reset_wq - hosts nvme reset works
   * nvme_delete_wq - hosts nvme delete works
   *
- * nvme_wq will host works such are scan, aen handling, fw activation,
- * keep-alive error recovery, periodic reconnects etc. nvme_reset_wq
+ * nvme_wq will host works such as scan, aen handling, fw activation,
+ * keep-alive, periodic reconnects etc. nvme_reset_wq
   * runs reset works which also flush works hosted on nvme_wq for
   * serialization purposes. nvme_delete_wq host controller deletion
   * works which flush reset works for serialization.
@@ -976,7 +976,7 @@ static void nvme_keep_alive_end_io(struct request *rq, blk_status_t status)
                 startka = true;
         spin_unlock_irqrestore(&ctrl->lock, flags);
         if (startka)
-               schedule_delayed_work(&ctrl->ka_work, ctrl->kato * HZ);
+               queue_delayed_work(nvme_wq, &ctrl->ka_work, ctrl->kato * HZ);
  }
  
  static int nvme_keep_alive(struct nvme_ctrl *ctrl)
@@ -1006,7 +1006,7 @@ static void nvme_keep_alive_work(struct work_struct *work)
                 dev_dbg(ctrl->device,
                         "reschedule traffic based keep-alive timer\n");
                 ctrl->comp_seen = false;
-               schedule_delayed_work(&ctrl->ka_work, ctrl->kato * HZ);
+               queue_delayed_work(nvme_wq, &ctrl->ka_work, ctrl->kato * HZ);
                 return;
         }
  
@@ -1023,7 +1023,7 @@ static void nvme_start_keep_alive(struct nvme_ctrl *ctrl)
         if (unlikely(ctrl->kato == 0))
                 return;
  
-       schedule_delayed_work(&ctrl->ka_work, ctrl->kato * HZ);
+       queue_delayed_work(nvme_wq, &ctrl->ka_work, ctrl->kato * HZ);
  }
  
  void nvme_stop_keep_alive(struct nvme_ctrl *ctrl)
@@ -3867,7 +3867,7 @@ static void nvme_get_fw_slot_info(struct nvme_ctrl *ctrl)
         if (!log)
                 return;
  
-       if (nvme_get_log(ctrl, NVME_NSID_ALL, 0, NVME_LOG_FW_SLOT, log,
+       if (nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_FW_SLOT, 0, log,
                         sizeof(*log), 0))
                 dev_warn(ctrl->device, "Get FW SLOT INFO log error\n");
         kfree(log);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c

index da392b5..9c80f9f 100644 (file)
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1401,6 +1401,23 @@ static void nvme_disable_admin_queue(struct nvme_dev *dev, bool shutdown)
         nvme_poll_irqdisable(nvmeq, -1);
  }
  
+/*
+ * Called only on a device that has been disabled and after all other threads
+ * that can check this device's completion queues have synced. This is the
+ * last chance for the driver to see a natural completion before
+ * nvme_cancel_request() terminates all incomplete requests.
+ */
+static void nvme_reap_pending_cqes(struct nvme_dev *dev)
+{
+       u16 start, end;
+       int i;
+
+       for (i = dev->ctrl.queue_count - 1; i > 0; i--) {
+               nvme_process_cq(&dev->queues[i], &start, &end, -1);
+               nvme_complete_cqes(&dev->queues[i], start, end);
+       }
+}
+
  static int nvme_cmb_qdepth(struct nvme_dev *dev, int nr_io_queues,
                                 int entry_size)
  {
@@ -2235,11 +2252,6 @@ static bool __nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode)
                 if (timeout == 0)
                         return false;
  
-               /* handle any remaining CQEs */
-               if (opcode == nvme_admin_delete_cq &&
-                   !test_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags))
-                       nvme_poll_irqdisable(nvmeq, -1);
-
                 sent--;
                 if (nr_queues)
                         goto retry;
@@ -2428,6 +2440,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
         nvme_suspend_io_queues(dev);
         nvme_suspend_queue(&dev->queues[0]);
         nvme_pci_disable(dev);
+       nvme_reap_pending_cqes(dev);
  
         blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
         blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c

index 2a47c6c..3e85c5c 100644 (file)
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1088,7 +1088,7 @@ static void nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
         if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
                 return;
  
-       queue_work(nvme_wq, &ctrl->err_work);
+       queue_work(nvme_reset_wq, &ctrl->err_work);
  }
  
  static void nvme_rdma_wr_error(struct ib_cq *cq, struct ib_wc *wc,
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c

index 6d43b23..49d4373 100644 (file)
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -422,7 +422,7 @@ static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
         if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
                 return;
  
-       queue_work(nvme_wq, &to_tcp_ctrl(ctrl)->err_work);
+       queue_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work);
  }
  
  static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
@@ -1054,7 +1054,12 @@ static void nvme_tcp_io_work(struct work_struct *w)
                 } else if (unlikely(result < 0)) {
                         dev_err(queue->ctrl->ctrl.device,
                                 "failed to send request %d\n", result);
-                       if (result != -EPIPE)
+
+                       /*
+                        * Fail the request unless peer closed the connection,
+                        * in which case error recovery flow will complete all.
+                        */
+                       if ((result != -EPIPE) && (result != -ECONNRESET))
                                 nvme_tcp_fail_request(queue->request);
                         nvme_tcp_done_send_req(queue);
                         return;
diff --git a/drivers/perf/arm_smmuv3_pmu.c b/drivers/perf/arm_smmuv3_pmu.c

index d704ecc..f01a57e 100644 (file)
--- a/drivers/perf/arm_smmuv3_pmu.c
+++ b/drivers/perf/arm_smmuv3_pmu.c
@@ -771,7 +771,7 @@ static int smmu_pmu_probe(struct platform_device *pdev)
                 smmu_pmu->reloc_base = smmu_pmu->reg_base;
         }
  
-       irq = platform_get_irq(pdev, 0);
+       irq = platform_get_irq_optional(pdev, 0);
         if (irq > 0)
                 smmu_pmu->irq = irq;
  
diff --git a/drivers/s390/cio/qdio.h b/drivers/s390/cio/qdio.h

index 4b07984..ff74eb5 100644 (file)
--- a/drivers/s390/cio/qdio.h
+++ b/drivers/s390/cio/qdio.h
@@ -182,11 +182,9 @@ enum qdio_queue_irq_states {
  };
  
  struct qdio_input_q {
-       /* input buffer acknowledgement flag */
-       int polling;
         /* first ACK'ed buffer */
         int ack_start;
-       /* how much sbals are acknowledged with qebsm */
+       /* how many SBALs are acknowledged */
         int ack_count;
         /* last time of noticing incoming data */
         u64 timestamp;
diff --git a/drivers/s390/cio/qdio_debug.c b/drivers/s390/cio/qdio_debug.c

index 35410e6..9c0370b 100644 (file)
--- a/drivers/s390/cio/qdio_debug.c
+++ b/drivers/s390/cio/qdio_debug.c
@@ -124,9 +124,8 @@ static int qstat_show(struct seq_file *m, void *v)
         seq_printf(m, "nr_used: %d  ftc: %d\n",
                    atomic_read(&q->nr_buf_used), q->first_to_check);
         if (q->is_input_q) {
-               seq_printf(m, "polling: %d  ack start: %d  ack count: %d\n",
-                          q->u.in.polling, q->u.in.ack_start,
-                          q->u.in.ack_count);
+               seq_printf(m, "ack start: %d  ack count: %d\n",
+                          q->u.in.ack_start, q->u.in.ack_count);
                 seq_printf(m, "DSCI: %x   IRQs disabled: %u\n",
                            *(u8 *)q->irq_ptr->dsci,
                            test_bit(QDIO_QUEUE_IRQS_DISABLED,
diff --git a/drivers/s390/cio/qdio_main.c b/drivers/s390/cio/qdio_main.c

index f8b897b..3475317 100644 (file)
--- a/drivers/s390/cio/qdio_main.c
+++ b/drivers/s390/cio/qdio_main.c
@@ -393,19 +393,15 @@ int debug_get_buf_state(struct qdio_q *q, unsigned int bufnr,
  
  static inline void qdio_stop_polling(struct qdio_q *q)
  {
-       if (!q->u.in.polling)
+       if (!q->u.in.ack_count)
                 return;
  
-       q->u.in.polling = 0;
         qperf_inc(q, stop_polling);
  
         /* show the card that we are not polling anymore */
-       if (is_qebsm(q)) {
-               set_buf_states(q, q->u.in.ack_start, SLSB_P_INPUT_NOT_INIT,
-                              q->u.in.ack_count);
-               q->u.in.ack_count = 0;
-       } else
-               set_buf_state(q, q->u.in.ack_start, SLSB_P_INPUT_NOT_INIT);
+       set_buf_states(q, q->u.in.ack_start, SLSB_P_INPUT_NOT_INIT,
+                      q->u.in.ack_count);
+       q->u.in.ack_count = 0;
  }
  
  static inline void account_sbals(struct qdio_q *q, unsigned int count)
@@ -451,8 +447,7 @@ static inline void inbound_primed(struct qdio_q *q, unsigned int start,
  
         /* for QEBSM the ACK was already set by EQBS */
         if (is_qebsm(q)) {
-               if (!q->u.in.polling) {
-                       q->u.in.polling = 1;
+               if (!q->u.in.ack_count) {
                         q->u.in.ack_count = count;
                         q->u.in.ack_start = start;
                         return;
@@ -471,12 +466,12 @@ static inline void inbound_primed(struct qdio_q *q, unsigned int start,
          * or by the next inbound run.
          */
         new = add_buf(start, count - 1);
-       if (q->u.in.polling) {
+       if (q->u.in.ack_count) {
                 /* reset the previous ACK but first set the new one */
                 set_buf_state(q, new, SLSB_P_INPUT_ACK);
                 set_buf_state(q, q->u.in.ack_start, SLSB_P_INPUT_NOT_INIT);
         } else {
-               q->u.in.polling = 1;
+               q->u.in.ack_count = 1;
                 set_buf_state(q, new, SLSB_P_INPUT_ACK);
         }
  
@@ -1479,13 +1474,12 @@ static int handle_inbound(struct qdio_q *q, unsigned int callflags,
  
         qperf_inc(q, inbound_call);
  
-       if (!q->u.in.polling)
+       if (!q->u.in.ack_count)
                 goto set;
  
         /* protect against stop polling setting an ACK for an emptied slsb */
         if (count == QDIO_MAX_BUFFERS_PER_Q) {
                 /* overwriting everything, just delete polling status */
-               q->u.in.polling = 0;
                 q->u.in.ack_count = 0;
                 goto set;
         } else if (buf_in_between(q->u.in.ack_start, bufnr, count)) {
@@ -1495,15 +1489,14 @@ static int handle_inbound(struct qdio_q *q, unsigned int callflags,
                         diff = sub_buf(diff, q->u.in.ack_start);
                         q->u.in.ack_count -= diff;
                         if (q->u.in.ack_count <= 0) {
-                               q->u.in.polling = 0;
                                 q->u.in.ack_count = 0;
                                 goto set;
                         }
                         q->u.in.ack_start = add_buf(q->u.in.ack_start, diff);
+               } else {
+                       /* the only ACK will be deleted */
+                       q->u.in.ack_count = 0;
                 }
-               else
-                       /* the only ACK will be deleted, so stop polling */
-                       q->u.in.polling = 0;
         }
  
  set:
diff --git a/drivers/s390/cio/qdio_setup.c b/drivers/s390/cio/qdio_setup.c

index dc430bd..3ab8e80 100644 (file)
--- a/drivers/s390/cio/qdio_setup.c
+++ b/drivers/s390/cio/qdio_setup.c
@@ -536,7 +536,7 @@ void qdio_print_subchannel_info(struct qdio_irq *irq_ptr,
  int qdio_enable_async_operation(struct qdio_output_q *outq)
  {
         outq->aobs = kcalloc(QDIO_MAX_BUFFERS_PER_Q, sizeof(struct qaob *),
-                            GFP_ATOMIC);
+                            GFP_KERNEL);
         if (!outq->aobs) {
                 outq->use_cq = 0;
                 return -ENOMEM;
diff --git a/drivers/s390/cio/vfio_ccw_trace.h b/drivers/s390/cio/vfio_ccw_trace.h

index 30162a3..f5d3188 100644 (file)
--- a/drivers/s390/cio/vfio_ccw_trace.h
+++ b/drivers/s390/cio/vfio_ccw_trace.h
@@ -1,5 +1,5 @@
-/* SPDX-License-Identifier: GPL-2.0
- * Tracepoints for vfio_ccw driver
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Tracepoints for vfio_ccw driver
   *
   * Copyright IBM Corp. 2018
   *
diff --git a/drivers/s390/crypto/ap_bus.h b/drivers/s390/crypto/ap_bus.h

index bb35ba4..4348fdf 100644 (file)
--- a/drivers/s390/crypto/ap_bus.h
+++ b/drivers/s390/crypto/ap_bus.h
@@ -162,7 +162,7 @@ struct ap_card {
         unsigned int functions;         /* AP device function bitfield. */
         int queue_depth;                /* AP queue depth.*/
         int id;                         /* AP card number. */
-       atomic_t total_request_count;   /* # requests ever for this AP device.*/
+       atomic64_t total_request_count; /* # requests ever for this AP device.*/
  };
  
  #define to_ap_card(x) container_of((x), struct ap_card, ap_dev.device)
@@ -179,7 +179,7 @@ struct ap_queue {
         enum ap_state state;            /* State of the AP device. */
         int pendingq_count;             /* # requests on pendingq list. */
         int requestq_count;             /* # requests on requestq list. */
-       int total_request_count;        /* # requests ever for this AP device.*/
+       u64 total_request_count;        /* # requests ever for this AP device.*/
         int request_timeout;            /* Request timeout in jiffies. */
         struct timer_list timeout;      /* Timer for request timeouts. */
         struct list_head pendingq;      /* List of message sent to AP queue. */
diff --git a/drivers/s390/crypto/ap_card.c b/drivers/s390/crypto/ap_card.c

index 63b4cc6..e85bfca 100644 (file)
--- a/drivers/s390/crypto/ap_card.c
+++ b/drivers/s390/crypto/ap_card.c
@@ -63,13 +63,13 @@ static ssize_t request_count_show(struct device *dev,
                                   char *buf)
  {
         struct ap_card *ac = to_ap_card(dev);
-       unsigned int req_cnt;
+       u64 req_cnt;
  
         req_cnt = 0;
         spin_lock_bh(&ap_list_lock);
-       req_cnt = atomic_read(&ac->total_request_count);
+       req_cnt = atomic64_read(&ac->total_request_count);
         spin_unlock_bh(&ap_list_lock);
-       return snprintf(buf, PAGE_SIZE, "%d\n", req_cnt);
+       return snprintf(buf, PAGE_SIZE, "%llu\n", req_cnt);
  }
  
  static ssize_t request_count_store(struct device *dev,
@@ -83,7 +83,7 @@ static ssize_t request_count_store(struct device *dev,
         for_each_ap_queue(aq, ac)
                 aq->total_request_count = 0;
         spin_unlock_bh(&ap_list_lock);
-       atomic_set(&ac->total_request_count, 0);
+       atomic64_set(&ac->total_request_count, 0);
  
         return count;
  }
diff --git a/drivers/s390/crypto/ap_queue.c b/drivers/s390/crypto/ap_queue.c

index 37c3bdc..a317ab4 100644 (file)
--- a/drivers/s390/crypto/ap_queue.c
+++ b/drivers/s390/crypto/ap_queue.c
@@ -479,12 +479,12 @@ static ssize_t request_count_show(struct device *dev,
                                   char *buf)
  {
         struct ap_queue *aq = to_ap_queue(dev);
-       unsigned int req_cnt;
+       u64 req_cnt;
  
         spin_lock_bh(&aq->lock);
         req_cnt = aq->total_request_count;
         spin_unlock_bh(&aq->lock);
-       return snprintf(buf, PAGE_SIZE, "%d\n", req_cnt);
+       return snprintf(buf, PAGE_SIZE, "%llu\n", req_cnt);
  }
  
  static ssize_t request_count_store(struct device *dev,
@@ -676,7 +676,7 @@ void ap_queue_message(struct ap_queue *aq, struct ap_message *ap_msg)
         list_add_tail(&ap_msg->list, &aq->requestq);
         aq->requestq_count++;
         aq->total_request_count++;
-       atomic_inc(&aq->card->total_request_count);
+       atomic64_inc(&aq->card->total_request_count);
         /* Send/receive as many request from the queue as possible. */
         ap_wait(ap_sm_event_loop(aq, AP_EVENT_POLL));
         spin_unlock_bh(&aq->lock);
diff --git a/drivers/s390/crypto/pkey_api.c b/drivers/s390/crypto/pkey_api.c

index 71dae64..2f33c5f 100644 (file)
--- a/drivers/s390/crypto/pkey_api.c
+++ b/drivers/s390/crypto/pkey_api.c
@@ -994,7 +994,7 @@ static long pkey_unlocked_ioctl(struct file *filp, unsigned int cmd,
                         return -EFAULT;
                 rc = cca_sec2protkey(ksp.cardnr, ksp.domain,
                                      ksp.seckey.seckey, ksp.protkey.protkey,
-                                    NULL, &ksp.protkey.type);
+                                    &ksp.protkey.len, &ksp.protkey.type);
                 DEBUG_DBG("%s cca_sec2protkey()=%d\n", __func__, rc);
                 if (rc)
                         break;
diff --git a/drivers/s390/crypto/zcrypt_api.c b/drivers/s390/crypto/zcrypt_api.c

index a42257d..56a405d 100644 (file)
--- a/drivers/s390/crypto/zcrypt_api.c
+++ b/drivers/s390/crypto/zcrypt_api.c
@@ -606,8 +606,8 @@ static inline bool zcrypt_card_compare(struct zcrypt_card *zc,
         weight += atomic_read(&zc->load);
         pref_weight += atomic_read(&pref_zc->load);
         if (weight == pref_weight)
-               return atomic_read(&zc->card->total_request_count) >
-                       atomic_read(&pref_zc->card->total_request_count);
+               return atomic64_read(&zc->card->total_request_count) >
+                       atomic64_read(&pref_zc->card->total_request_count);
         return weight > pref_weight;
  }
  
@@ -1226,11 +1226,12 @@ static void zcrypt_qdepth_mask(char qdepth[], size_t max_adapters)
         spin_unlock(&zcrypt_list_lock);
  }
  
-static void zcrypt_perdev_reqcnt(int reqcnt[], size_t max_adapters)
+static void zcrypt_perdev_reqcnt(u32 reqcnt[], size_t max_adapters)
  {
         struct zcrypt_card *zc;
         struct zcrypt_queue *zq;
         int card;
+       u64 cnt;
  
         memset(reqcnt, 0, sizeof(int) * max_adapters);
         spin_lock(&zcrypt_list_lock);
@@ -1242,8 +1243,9 @@ static void zcrypt_perdev_reqcnt(int reqcnt[], size_t max_adapters)
                             || card >= max_adapters)
                                 continue;
                         spin_lock(&zq->queue->lock);
-                       reqcnt[card] = zq->queue->total_request_count;
+                       cnt = zq->queue->total_request_count;
                         spin_unlock(&zq->queue->lock);
+                       reqcnt[card] = (cnt < UINT_MAX) ? (u32) cnt : UINT_MAX;
                 }
         }
         local_bh_enable();
@@ -1421,9 +1423,9 @@ static long zcrypt_unlocked_ioctl(struct file *filp, unsigned int cmd,
                 return 0;
         }
         case ZCRYPT_PERDEV_REQCNT: {
-               int *reqcnt;
+               u32 *reqcnt;
  
-               reqcnt = kcalloc(AP_DEVICES, sizeof(int), GFP_KERNEL);
+               reqcnt = kcalloc(AP_DEVICES, sizeof(u32), GFP_KERNEL);
                 if (!reqcnt)
                         return -ENOMEM;
                 zcrypt_perdev_reqcnt(reqcnt, AP_DEVICES);
@@ -1480,7 +1482,7 @@ static long zcrypt_unlocked_ioctl(struct file *filp, unsigned int cmd,
         }
         case Z90STAT_PERDEV_REQCNT: {
                 /* the old ioctl supports only 64 adapters */
-               int reqcnt[MAX_ZDEV_CARDIDS];
+               u32 reqcnt[MAX_ZDEV_CARDIDS];
  
                 zcrypt_perdev_reqcnt(reqcnt, MAX_ZDEV_CARDIDS);
                 if (copy_to_user((int __user *) arg, reqcnt, sizeof(reqcnt)))
diff --git a/drivers/soc/tegra/fuse/fuse-tegra30.c b/drivers/soc/tegra/fuse/fuse-tegra30.c

index f68f4e1..e6037f9 100644 (file)
--- a/drivers/soc/tegra/fuse/fuse-tegra30.c
+++ b/drivers/soc/tegra/fuse/fuse-tegra30.c
@@ -36,7 +36,8 @@
      defined(CONFIG_ARCH_TEGRA_124_SOC) || \
      defined(CONFIG_ARCH_TEGRA_132_SOC) || \
      defined(CONFIG_ARCH_TEGRA_210_SOC) || \
-    defined(CONFIG_ARCH_TEGRA_186_SOC)
+    defined(CONFIG_ARCH_TEGRA_186_SOC) || \
+    defined(CONFIG_ARCH_TEGRA_194_SOC)
  static u32 tegra30_fuse_read_early(struct tegra_fuse *fuse, unsigned int offset)
  {
         if (WARN_ON(!fuse->base))
diff --git a/drivers/spmi/spmi-pmic-arb.c b/drivers/spmi/spmi-pmic-arb.c

index 97acc2b..de844b4 100644 (file)
--- a/drivers/spmi/spmi-pmic-arb.c
+++ b/drivers/spmi/spmi-pmic-arb.c
@@ -731,6 +731,7 @@ static int qpnpint_irq_domain_translate(struct irq_domain *d,
         return 0;
  }
  
+static struct lock_class_key qpnpint_irq_lock_class, qpnpint_irq_request_class;
  
  static void qpnpint_irq_domain_map(struct spmi_pmic_arb *pmic_arb,
                                    struct irq_domain *domain, unsigned int virq,
@@ -746,6 +747,9 @@ static void qpnpint_irq_domain_map(struct spmi_pmic_arb *pmic_arb,
         else
                 handler = handle_level_irq;
  
+
+       irq_set_lockdep_class(virq, &qpnpint_irq_lock_class,
+                             &qpnpint_irq_request_class);
         irq_domain_set_info(domain, virq, hwirq, &pmic_arb_irqchip, pmic_arb,
                             handler, NULL, NULL);
  }
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c

index 7fa9bb7..89422aa 100644 (file)
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3164,6 +3164,7 @@ int __cold open_ctree(struct super_block *sb,
         /* do not make disk changes in broken FS or nologreplay is given */
         if (btrfs_super_log_root(disk_super) != 0 &&
             !btrfs_test_opt(fs_info, NOLOGREPLAY)) {
+               btrfs_info(fs_info, "start tree-log replay");
                 ret = btrfs_replay_log(fs_info, fs_devices);
                 if (ret) {
                         err = ret;
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c

index 6f417ff..bd6229f 100644 (file)
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -237,6 +237,17 @@ static void try_merge_map(struct extent_map_tree *tree, struct extent_map *em)
         struct extent_map *merge = NULL;
         struct rb_node *rb;
  
+       /*
+        * We can't modify an extent map that is in the tree and that is being
+        * used by another task, as it can cause that other task to see it in
+        * inconsistent state during the merging. We always have 1 reference for
+        * the tree and 1 for this task (which is unpinning the extent map or
+        * clearing the logging flag), so anything > 2 means it's being used by
+        * other tasks too.
+        */
+       if (refcount_read(&em->refs) > 2)
+               return;
+
         if (em->start != 0) {
                 rb = rb_prev(&em->rb_node);
                 if (rb)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c

index 5b3ec93..7d26b4b 100644 (file)
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4085,6 +4085,8 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
         u64 bytes_deleted = 0;
         bool be_nice = false;
         bool should_throttle = false;
+       const u64 lock_start = ALIGN_DOWN(new_size, fs_info->sectorsize);
+       struct extent_state *cached_state = NULL;
  
         BUG_ON(new_size > 0 && min_type != BTRFS_EXTENT_DATA_KEY);
  
@@ -4101,6 +4103,9 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
                 return -ENOMEM;
         path->reada = READA_BACK;
  
+       lock_extent_bits(&BTRFS_I(inode)->io_tree, lock_start, (u64)-1,
+                        &cached_state);
+
         /*
          * We want to drop from the next block forward in case this new size is
          * not block aligned since we will be keeping the last block of the
@@ -4367,6 +4372,9 @@ out:
                 btrfs_ordered_update_i_size(inode, last_size, NULL);
         }
  
+       unlock_extent_cached(&BTRFS_I(inode)->io_tree, lock_start, (u64)-1,
+                            &cached_state);
+
         btrfs_free_path(path);
         return ret;
  }
diff --git a/fs/btrfs/ref-verify.c b/fs/btrfs/ref-verify.c

index b57f361..454a101 100644 (file)
--- a/fs/btrfs/ref-verify.c
+++ b/fs/btrfs/ref-verify.c
@@ -744,6 +744,7 @@ int btrfs_ref_tree_mod(struct btrfs_fs_info *fs_info,
                  */
                 be = add_block_entry(fs_info, bytenr, num_bytes, ref_root);
                 if (IS_ERR(be)) {
+                       kfree(ref);
                         kfree(ra);
                         ret = PTR_ERR(be);
                         goto out;
@@ -757,6 +758,8 @@ int btrfs_ref_tree_mod(struct btrfs_fs_info *fs_info,
                         "re-allocated a block that still has references to it!");
                         dump_block_entry(fs_info, be);
                         dump_ref_action(fs_info, ra);
+                       kfree(ref);
+                       kfree(ra);
                         goto out_unlock;
                 }
  
@@ -819,6 +822,7 @@ int btrfs_ref_tree_mod(struct btrfs_fs_info *fs_info,
  "dropping a ref for a existing root that doesn't have a ref on the block");
                                 dump_block_entry(fs_info, be);
                                 dump_ref_action(fs_info, ra);
+                               kfree(ref);
                                 kfree(ra);
                                 goto out_unlock;
                         }
@@ -834,6 +838,7 @@ int btrfs_ref_tree_mod(struct btrfs_fs_info *fs_info,
  "attempting to add another ref for an existing ref on a tree block");
                         dump_block_entry(fs_info, be);
                         dump_ref_action(fs_info, ra);
+                       kfree(ref);
                         kfree(ra);
                         goto out_unlock;
                 }
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c

index 0616a54..67c6385 100644 (file)
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1834,6 +1834,8 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
                 }
  
                 if (btrfs_super_log_root(fs_info->super_copy) != 0) {
+                       btrfs_warn(fs_info,
+               "mount required to replay tree-log, cannot remount read-write");
                         ret = -EINVAL;
                         goto restore;
                 }
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c

index 7436422..3c10e78 100644 (file)
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -901,6 +901,12 @@ static int addrm_unknown_feature_attrs(struct btrfs_fs_info *fs_info, bool add)
  
  static void __btrfs_sysfs_remove_fsid(struct btrfs_fs_devices *fs_devs)
  {
+       if (fs_devs->devinfo_kobj) {
+               kobject_del(fs_devs->devinfo_kobj);
+               kobject_put(fs_devs->devinfo_kobj);
+               fs_devs->devinfo_kobj = NULL;
+       }
+
         if (fs_devs->devices_kobj) {
                 kobject_del(fs_devs->devices_kobj);
                 kobject_put(fs_devs->devices_kobj);
@@ -1289,7 +1295,7 @@ int btrfs_sysfs_add_device_link(struct btrfs_fs_devices *fs_devices,
  
                 init_completion(&dev->kobj_unregister);
                 error = kobject_init_and_add(&dev->devid_kobj, &devid_ktype,
-                                            fs_devices->devices_kobj, "%llu",
+                                            fs_devices->devinfo_kobj, "%llu",
                                              dev->devid);
                 if (error) {
                         kobject_put(&dev->devid_kobj);
@@ -1369,6 +1375,15 @@ int btrfs_sysfs_add_fsid(struct btrfs_fs_devices *fs_devs)
                 return -ENOMEM;
         }
  
+       fs_devs->devinfo_kobj = kobject_create_and_add("devinfo",
+                                                      &fs_devs->fsid_kobj);
+       if (!fs_devs->devinfo_kobj) {
+               btrfs_err(fs_devs->fs_info,
+                         "failed to init sysfs devinfo kobject");
+               btrfs_sysfs_remove_fsid(fs_devs);
+               return -ENOMEM;
+       }
+
         return 0;
  }
  
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h

index 409f481..f01552a 100644 (file)
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -258,6 +258,7 @@ struct btrfs_fs_devices {
         /* sysfs kobjects */
         struct kobject fsid_kobj;
         struct kobject *devices_kobj;
+       struct kobject *devinfo_kobj;
         struct completion kobj_unregister;
  };
  
diff --git a/fs/ceph/file.c b/fs/ceph/file.c

index c3b8e8e..7e0190b 100644 (file)
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1418,6 +1418,7 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
         struct ceph_cap_flush *prealloc_cf;
         ssize_t count, written = 0;
         int err, want, got;
+       bool direct_lock = false;
         loff_t pos;
         loff_t limit = max(i_size_read(inode), fsc->max_file_size);
  
@@ -1428,8 +1429,11 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
         if (!prealloc_cf)
                 return -ENOMEM;
  
+       if ((iocb->ki_flags & (IOCB_DIRECT | IOCB_APPEND)) == IOCB_DIRECT)
+               direct_lock = true;
+
  retry_snap:
-       if (iocb->ki_flags & IOCB_DIRECT)
+       if (direct_lock)
                 ceph_start_io_direct(inode);
         else
                 ceph_start_io_write(inode);
@@ -1519,14 +1523,15 @@ retry_snap:
  
                 /* we might need to revert back to that point */
                 data = *from;
-               if (iocb->ki_flags & IOCB_DIRECT) {
+               if (iocb->ki_flags & IOCB_DIRECT)
                         written = ceph_direct_read_write(iocb, &data, snapc,
                                                          &prealloc_cf);
-                       ceph_end_io_direct(inode);
-               } else {
+               else
                         written = ceph_sync_write(iocb, &data, pos, snapc);
+               if (direct_lock)
+                       ceph_end_io_direct(inode);
+               else
                         ceph_end_io_write(inode);
-               }
                 if (written > 0)
                         iov_iter_advance(from, written);
                 ceph_put_snap_context(snapc);
@@ -1577,7 +1582,7 @@ retry_snap:
  
         goto out_unlocked;
  out:
-       if (iocb->ki_flags & IOCB_DIRECT)
+       if (direct_lock)
                 ceph_end_io_direct(inode);
         else
                 ceph_end_io_write(inode);
diff --git a/fs/ceph/super.c b/fs/ceph/super.c

index 1d9f083..c7f1506 100644 (file)
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -202,6 +202,26 @@ struct ceph_parse_opts_ctx {
         struct ceph_mount_options       *opts;
  };
  
+/*
+ * Remove adjacent slashes and then the trailing slash, unless it is
+ * the only remaining character.
+ *
+ * E.g. "//dir1////dir2///" --> "/dir1/dir2", "///" --> "/".
+ */
+static void canonicalize_path(char *path)
+{
+       int i, j = 0;
+
+       for (i = 0; path[i] != '\0'; i++) {
+               if (path[i] != '/' || j < 1 || path[j - 1] != '/')
+                       path[j++] = path[i];
+       }
+
+       if (j > 1 && path[j - 1] == '/')
+               j--;
+       path[j] = '\0';
+}
+
  /*
   * Parse the source parameter.  Distinguish the server list from the path.
   *
@@ -224,15 +244,16 @@ static int ceph_parse_source(struct fs_parameter *param, struct fs_context *fc)
  
         dev_name_end = strchr(dev_name, '/');
         if (dev_name_end) {
-               kfree(fsopt->server_path);
-
                 /*
                  * The server_path will include the whole chars from userland
                  * including the leading '/'.
                  */
+               kfree(fsopt->server_path);
                 fsopt->server_path = kstrdup(dev_name_end, GFP_KERNEL);
                 if (!fsopt->server_path)
                         return -ENOMEM;
+
+               canonicalize_path(fsopt->server_path);
         } else {
                 dev_name_end = dev_name + strlen(dev_name);
         }
@@ -456,73 +477,6 @@ static int strcmp_null(const char *s1, const char *s2)
         return strcmp(s1, s2);
  }
  
-/**
- * path_remove_extra_slash - Remove the extra slashes in the server path
- * @server_path: the server path and could be NULL
- *
- * Return NULL if the path is NULL or only consists of "/", or a string
- * without any extra slashes including the leading slash(es) and the
- * slash(es) at the end of the server path, such as:
- * "//dir1////dir2///" --> "dir1/dir2"
- */
-static char *path_remove_extra_slash(const char *server_path)
-{
-       const char *path = server_path;
-       const char *cur, *end;
-       char *buf, *p;
-       int len;
-
-       /* if the server path is omitted */
-       if (!path)
-               return NULL;
-
-       /* remove all the leading slashes */
-       while (*path == '/')
-               path++;
-
-       /* if the server path only consists of slashes */
-       if (*path == '\0')
-               return NULL;
-
-       len = strlen(path);
-
-       buf = kmalloc(len + 1, GFP_KERNEL);
-       if (!buf)
-               return ERR_PTR(-ENOMEM);
-
-       end = path + len;
-       p = buf;
-       do {
-               cur = strchr(path, '/');
-               if (!cur)
-                       cur = end;
-
-               len = cur - path;
-
-               /* including one '/' */
-               if (cur != end)
-                       len += 1;
-
-               memcpy(p, path, len);
-               p += len;
-
-               while (cur <= end && *cur == '/')
-                       cur++;
-               path = cur;
-       } while (path < end);
-
-       *p = '\0';
-
-       /*
-        * remove the last slash if there has and just to make sure that
-        * we will get something like "dir1/dir2"
-        */
-       if (*(--p) == '/')
-               *p = '\0';
-
-       return buf;
-}
-
  static int compare_mount_options(struct ceph_mount_options *new_fsopt,
                                  struct ceph_options *new_opt,
                                  struct ceph_fs_client *fsc)
@@ -530,7 +484,6 @@ static int compare_mount_options(struct ceph_mount_options *new_fsopt,
         struct ceph_mount_options *fsopt1 = new_fsopt;
         struct ceph_mount_options *fsopt2 = fsc->mount_options;
         int ofs = offsetof(struct ceph_mount_options, snapdir_name);
-       char *p1, *p2;
         int ret;
  
         ret = memcmp(fsopt1, fsopt2, ofs);
@@ -540,21 +493,12 @@ static int compare_mount_options(struct ceph_mount_options *new_fsopt,
         ret = strcmp_null(fsopt1->snapdir_name, fsopt2->snapdir_name);
         if (ret)
                 return ret;
+
         ret = strcmp_null(fsopt1->mds_namespace, fsopt2->mds_namespace);
         if (ret)
                 return ret;
  
-       p1 = path_remove_extra_slash(fsopt1->server_path);
-       if (IS_ERR(p1))
-               return PTR_ERR(p1);
-       p2 = path_remove_extra_slash(fsopt2->server_path);
-       if (IS_ERR(p2)) {
-               kfree(p1);
-               return PTR_ERR(p2);
-       }
-       ret = strcmp_null(p1, p2);
-       kfree(p1);
-       kfree(p2);
+       ret = strcmp_null(fsopt1->server_path, fsopt2->server_path);
         if (ret)
                 return ret;
  
@@ -957,7 +901,9 @@ static struct dentry *ceph_real_mount(struct ceph_fs_client *fsc,
         mutex_lock(&fsc->client->mount_mutex);
  
         if (!fsc->sb->s_root) {
-               const char *path, *p;
+               const char *path = fsc->mount_options->server_path ?
+                                    fsc->mount_options->server_path + 1 : "";
+
                 err = __ceph_open_session(fsc->client, started);
                 if (err < 0)
                         goto out;
@@ -969,22 +915,11 @@ static struct dentry *ceph_real_mount(struct ceph_fs_client *fsc,
                                 goto out;
                 }
  
-               p = path_remove_extra_slash(fsc->mount_options->server_path);
-               if (IS_ERR(p)) {
-                       err = PTR_ERR(p);
-                       goto out;
-               }
-               /* if the server path is omitted or just consists of '/' */
-               if (!p)
-                       path = "";
-               else
-                       path = p;
                 dout("mount opening path '%s'\n", path);
  
                 ceph_fs_debugfs_init(fsc);
  
                 root = open_root_dentry(fsc, path, started);
-               kfree(p);
                 if (IS_ERR(root)) {
                         err = PTR_ERR(root);
                         goto out;
@@ -1097,10 +1032,6 @@ static int ceph_get_tree(struct fs_context *fc)
         if (!fc->source)
                 return invalfc(fc, "No source");
  
-#ifdef CONFIG_CEPH_FS_POSIX_ACL
-       fc->sb_flags |= SB_POSIXACL;
-#endif
-
         /* create client (which we may/may not use) */
         fsc = create_fs_client(pctx->opts, pctx->copts);
         pctx->opts = NULL;
@@ -1223,6 +1154,10 @@ static int ceph_init_fs_context(struct fs_context *fc)
         fsopt->max_readdir_bytes = CEPH_MAX_READDIR_BYTES_DEFAULT;
         fsopt->congestion_kb = default_congestion_kb();
  
+#ifdef CONFIG_CEPH_FS_POSIX_ACL
+       fc->sb_flags |= SB_POSIXACL;
+#endif
+
         fc->fs_private = pctx;
         fc->ops = &ceph_context_ops;
         return 0;
diff --git a/fs/ceph/super.h b/fs/ceph/super.h

index 1e456a9..037cdfb 100644 (file)
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -91,7 +91,7 @@ struct ceph_mount_options {
  
         char *snapdir_name;   /* default ".snap" */
         char *mds_namespace;  /* default NULL */
-       char *server_path;    /* default  "/" */
+       char *server_path;    /* default NULL (means "/") */
         char *fscache_uniq;   /* default NULL */
  };
  
diff --git a/fs/cifs/cifsacl.c b/fs/cifs/cifsacl.c

index 440828a..716574a 100644 (file)
--- a/fs/cifs/cifsacl.c
+++ b/fs/cifs/cifsacl.c
@@ -601,7 +601,7 @@ static void access_flags_to_mode(__le32 ace_flags, int type, umode_t *pmode,
                         ((flags & FILE_EXEC_RIGHTS) == FILE_EXEC_RIGHTS))
                 *pmode |= (S_IXUGO & (*pbits_to_set));
  
-       cifs_dbg(NOISY, "access flags 0x%x mode now 0x%x\n", flags, *pmode);
+       cifs_dbg(NOISY, "access flags 0x%x mode now %04o\n", flags, *pmode);
         return;
  }
  
@@ -630,7 +630,7 @@ static void mode_to_access_flags(umode_t mode, umode_t bits_to_use,
         if (mode & S_IXUGO)
                 *pace_flags |= SET_FILE_EXEC_RIGHTS;
  
-       cifs_dbg(NOISY, "mode: 0x%x, access flags now 0x%x\n",
+       cifs_dbg(NOISY, "mode: %04o, access flags now 0x%x\n",
                  mode, *pace_flags);
         return;
  }
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c

index febab27..46ebaf3 100644 (file)
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -414,7 +414,7 @@ cifs_show_security(struct seq_file *s, struct cifs_ses *ses)
                 seq_puts(s, "ntlm");
                 break;
         case Kerberos:
-               seq_printf(s, "krb5,cruid=%u", from_kuid_munged(&init_user_ns,ses->cred_uid));
+               seq_puts(s, "krb5");
                 break;
         case RawNTLMSSP:
                 seq_puts(s, "ntlmssp");
@@ -427,6 +427,10 @@ cifs_show_security(struct seq_file *s, struct cifs_ses *ses)
  
         if (ses->sign)
                 seq_puts(s, "i");
+
+       if (ses->sectype == Kerberos)
+               seq_printf(s, ",cruid=%u",
+                          from_kuid_munged(&init_user_ns, ses->cred_uid));
  }
  
  static void
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c

index a941ac7..4804d1d 100644 (file)
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -4151,7 +4151,7 @@ int cifs_setup_cifs_sb(struct smb_vol *pvolume_info,
         cifs_sb->mnt_gid = pvolume_info->linux_gid;
         cifs_sb->mnt_file_mode = pvolume_info->file_mode;
         cifs_sb->mnt_dir_mode = pvolume_info->dir_mode;
-       cifs_dbg(FYI, "file mode: 0x%hx  dir mode: 0x%hx\n",
+       cifs_dbg(FYI, "file mode: %04ho  dir mode: %04ho\n",
                  cifs_sb->mnt_file_mode, cifs_sb->mnt_dir_mode);
  
         cifs_sb->actimeo = pvolume_info->actimeo;
diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c

index 9ba623b..b5e6635 100644 (file)
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -1648,7 +1648,7 @@ int cifs_mkdir(struct inode *inode, struct dentry *direntry, umode_t mode)
         struct TCP_Server_Info *server;
         char *full_path;
  
-       cifs_dbg(FYI, "In cifs_mkdir, mode = 0x%hx inode = 0x%p\n",
+       cifs_dbg(FYI, "In cifs_mkdir, mode = %04ho inode = 0x%p\n",
                  mode, inode);
  
         cifs_sb = CIFS_SB(inode->i_sb);
diff --git a/fs/cifs/smb2ops.c b/fs/cifs/smb2ops.c

index baa825f..e47190c 100644 (file)
--- a/fs/cifs/smb2ops.c
+++ b/fs/cifs/smb2ops.c
@@ -1116,7 +1116,8 @@ smb2_set_ea(const unsigned int xid, struct cifs_tcon *tcon,
         void *data[1];
         struct smb2_file_full_ea_info *ea = NULL;
         struct kvec close_iov[1];
-       int rc;
+       struct smb2_query_info_rsp *rsp;
+       int rc, used_len = 0;
  
         if (smb3_encryption_required(tcon))
                 flags |= CIFS_TRANSFORM_REQ;
@@ -1139,6 +1140,38 @@ smb2_set_ea(const unsigned int xid, struct cifs_tcon *tcon,
                                                              cifs_sb);
                         if (rc == -ENODATA)
                                 goto sea_exit;
+               } else {
+                       /* If we are adding a attribute we should first check
+                        * if there will be enough space available to store
+                        * the new EA. If not we should not add it since we
+                        * would not be able to even read the EAs back.
+                        */
+                       rc = smb2_query_info_compound(xid, tcon, utf16_path,
+                                     FILE_READ_EA,
+                                     FILE_FULL_EA_INFORMATION,
+                                     SMB2_O_INFO_FILE,
+                                     CIFSMaxBufSize -
+                                     MAX_SMB2_CREATE_RESPONSE_SIZE -
+                                     MAX_SMB2_CLOSE_RESPONSE_SIZE,
+                                     &rsp_iov[1], &resp_buftype[1], cifs_sb);
+                       if (rc == 0) {
+                               rsp = (struct smb2_query_info_rsp *)rsp_iov[1].iov_base;
+                               used_len = le32_to_cpu(rsp->OutputBufferLength);
+                       }
+                       free_rsp_buf(resp_buftype[1], rsp_iov[1].iov_base);
+                       resp_buftype[1] = CIFS_NO_BUFFER;
+                       memset(&rsp_iov[1], 0, sizeof(rsp_iov[1]));
+                       rc = 0;
+
+                       /* Use a fudge factor of 256 bytes in case we collide
+                        * with a different set_EAs command.
+                        */
+                       if(CIFSMaxBufSize - MAX_SMB2_CREATE_RESPONSE_SIZE -
+                          MAX_SMB2_CLOSE_RESPONSE_SIZE - 256 <
+                          used_len + ea_name_len + ea_value_len + 1) {
+                               rc = -ENOSPC;
+                               goto sea_exit;
+                       }
                 }
         }
  
@@ -4795,6 +4828,7 @@ struct smb_version_operations smb21_operations = {
         .wp_retry_size = smb2_wp_retry_size,
         .dir_needs_close = smb2_dir_needs_close,
         .enum_snapshots = smb3_enum_snapshots,
+       .notify = smb3_notify,
         .get_dfs_refer = smb2_get_dfs_refer,
         .select_sectype = smb2_select_sectype,
  #ifdef CONFIG_CIFS_XATTR
diff --git a/fs/dax.c b/fs/dax.c

index 1f1f020..35da144 100644 (file)
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -937,12 +937,11 @@ static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev,
   * on persistent storage prior to completion of the operation.
   */
  int dax_writeback_mapping_range(struct address_space *mapping,
-               struct block_device *bdev, struct writeback_control *wbc)
+               struct dax_device *dax_dev, struct writeback_control *wbc)
  {
         XA_STATE(xas, &mapping->i_pages, wbc->range_start >> PAGE_SHIFT);
         struct inode *inode = mapping->host;
         pgoff_t end_index = wbc->range_end >> PAGE_SHIFT;
-       struct dax_device *dax_dev;
         void *entry;
         int ret = 0;
         unsigned int scanned = 0;
@@ -953,10 +952,6 @@ int dax_writeback_mapping_range(struct address_space *mapping,
         if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL)
                 return 0;
  
-       dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
-       if (!dax_dev)
-               return -EIO;
-
         trace_dax_writeback_range(inode, xas.xa_index, end_index);
  
         tag_pages_for_writeback(mapping, xas.xa_index, end_index);
@@ -977,7 +972,6 @@ int dax_writeback_mapping_range(struct address_space *mapping,
                 xas_lock_irq(&xas);
         }
         xas_unlock_irq(&xas);
-       put_dax(dax_dev);
         trace_dax_writeback_range_done(inode, xas.xa_index, end_index);
         return ret;
  }
@@ -1207,6 +1201,9 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
                 lockdep_assert_held(&inode->i_rwsem);
         }
  
+       if (iocb->ki_flags & IOCB_NOWAIT)
+               flags |= IOMAP_NOWAIT;
+
         while (iov_iter_count(iter)) {
                 ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
                                 iter, dax_iomap_actor);
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c

index 119667e..c885cf7 100644 (file)
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -960,8 +960,9 @@ ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
  static int
  ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc)
  {
-       return dax_writeback_mapping_range(mapping,
-                       mapping->host->i_sb->s_bdev, wbc);
+       struct ext2_sb_info *sbi = EXT2_SB(mapping->host->i_sb);
+
+       return dax_writeback_mapping_range(mapping, sbi->s_daxdev, wbc);
  }
  
  const struct address_space_operations ext2_aops = {
diff --git a/fs/ext4/block_validity.c b/fs/ext4/block_validity.c

index 1ee04e7..0a734ff 100644 (file)
--- a/fs/ext4/block_validity.c
+++ b/fs/ext4/block_validity.c
@@ -207,6 +207,7 @@ static int ext4_protect_reserved_inode(struct super_block *sb,
                 return PTR_ERR(inode);
         num = (inode->i_size + sb->s_blocksize - 1) >> sb->s_blocksize_bits;
         while (i < num) {
+               cond_resched();
                 map.m_lblk = i;
                 map.m_len = num - i;
                 n = ext4_map_blocks(NULL, inode, &map, 0);
diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c

index 1f34074..9aa1f75 100644 (file)
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -129,12 +129,14 @@ static int ext4_readdir(struct file *file, struct dir_context *ctx)
                 if (err != ERR_BAD_DX_DIR) {
                         return err;
                 }
-               /*
-                * We don't set the inode dirty flag since it's not
-                * critical that it get flushed back to the disk.
-                */
-               ext4_clear_inode_flag(file_inode(file),
-                                     EXT4_INODE_INDEX);
+               /* Can we just clear INDEX flag to ignore htree information? */
+               if (!ext4_has_metadata_csum(sb)) {
+                       /*
+                        * We don't set the inode dirty flag since it's not
+                        * critical that it gets flushed back to the disk.
+                        */
+                       ext4_clear_inode_flag(inode, EXT4_INODE_INDEX);
+               }
         }
  
         if (ext4_has_inline_data(inode)) {
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h

index 9a2ee24..4441331 100644 (file)
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2544,8 +2544,11 @@ void ext4_insert_dentry(struct inode *inode,
                         struct ext4_filename *fname);
  static inline void ext4_update_dx_flag(struct inode *inode)
  {
-       if (!ext4_has_feature_dir_index(inode->i_sb))
+       if (!ext4_has_feature_dir_index(inode->i_sb)) {
+               /* ext4_iget() should have caught this... */
+               WARN_ON_ONCE(ext4_has_feature_metadata_csum(inode->i_sb));
                 ext4_clear_inode_flag(inode, EXT4_INODE_INDEX);
+       }
  }
  static const unsigned char ext4_filetype_table[] = {
         DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c

index 3313168..e60aca7 100644 (file)
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2867,7 +2867,7 @@ static int ext4_dax_writepages(struct address_space *mapping,
         percpu_down_read(&sbi->s_journal_flag_rwsem);
         trace_ext4_writepages(inode, wbc);
  
-       ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, wbc);
+       ret = dax_writeback_mapping_range(mapping, sbi->s_daxdev, wbc);
         trace_ext4_writepages_result(inode, wbc, ret,
                                      nr_to_write - wbc->nr_to_write);
         percpu_up_read(&sbi->s_journal_flag_rwsem);
@@ -4644,6 +4644,18 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
                 ret = -EFSCORRUPTED;
                 goto bad_inode;
         }
+       /*
+        * If dir_index is not enabled but there's dir with INDEX flag set,
+        * we'd normally treat htree data as empty space. But with metadata
+        * checksumming that corrupts checksums so forbid that.
+        */
+       if (!ext4_has_feature_dir_index(sb) && ext4_has_metadata_csum(sb) &&
+           ext4_test_inode_flag(inode, EXT4_INODE_INDEX)) {
+               ext4_error_inode(inode, function, line, 0,
+                        "iget: Dir with htree data on filesystem without dir_index feature.");
+               ret = -EFSCORRUPTED;
+               goto bad_inode;
+       }
         ei->i_disksize = inode->i_size;
  #ifdef CONFIG_QUOTA
         ei->i_reserved_quota = 0;
diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c

index 1c44b1a..87f7551 100644 (file)
--- a/fs/ext4/mmp.c
+++ b/fs/ext4/mmp.c
@@ -120,10 +120,10 @@ void __dump_mmp_msg(struct super_block *sb, struct mmp_struct *mmp,
  {
         __ext4_warning(sb, function, line, "%s", msg);
         __ext4_warning(sb, function, line,
-                      "MMP failure info: last update time: %llu, last update "
-                      "node: %s, last update device: %s",
-                      (long long unsigned int) le64_to_cpu(mmp->mmp_time),
-                      mmp->mmp_nodename, mmp->mmp_bdevname);
+                      "MMP failure info: last update time: %llu, last update node: %.*s, last update device: %.*s",
+                      (unsigned long long)le64_to_cpu(mmp->mmp_time),
+                      (int)sizeof(mmp->mmp_nodename), mmp->mmp_nodename,
+                      (int)sizeof(mmp->mmp_bdevname), mmp->mmp_bdevname);
  }
  
  /*
@@ -154,6 +154,7 @@ static int kmmpd(void *data)
         mmp_check_interval = max(EXT4_MMP_CHECK_MULT * mmp_update_interval,
                                  EXT4_MMP_MIN_CHECK_INTERVAL);
         mmp->mmp_check_interval = cpu_to_le16(mmp_check_interval);
+       BUILD_BUG_ON(sizeof(mmp->mmp_bdevname) < BDEVNAME_SIZE);
         bdevname(bh->b_bdev, mmp->mmp_bdevname);
  
         memcpy(mmp->mmp_nodename, init_utsname()->nodename,
@@ -379,7 +380,8 @@ skip:
         /*
          * Start a kernel thread to update the MMP block periodically.
          */
-       EXT4_SB(sb)->s_mmp_tsk = kthread_run(kmmpd, mmpd_data, "kmmpd-%s",
+       EXT4_SB(sb)->s_mmp_tsk = kthread_run(kmmpd, mmpd_data, "kmmpd-%.*s",
+                                            (int)sizeof(mmp->mmp_bdevname),
                                              bdevname(bh->b_bdev,
                                                       mmp->mmp_bdevname));
         if (IS_ERR(EXT4_SB(sb)->s_mmp_tsk)) {
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c

index 129d2eb..ceff4b4 100644 (file)
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2213,6 +2213,13 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry,
                 retval = ext4_dx_add_entry(handle, &fname, dir, inode);
                 if (!retval || (retval != ERR_BAD_DX_DIR))
                         goto out;
+               /* Can we just ignore htree data? */
+               if (ext4_has_metadata_csum(sb)) {
+                       EXT4_ERROR_INODE(dir,
+                               "Directory has corrupted htree index.");
+                       retval = -EFSCORRUPTED;
+                       goto out;
+               }
                 ext4_clear_inode_flag(dir, EXT4_INODE_INDEX);
                 dx_fallback++;
                 ext4_mark_inode_dirty(handle, dir);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c

index 8434217..f464dff 100644 (file)
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3009,17 +3009,11 @@ static int ext4_feature_set_ok(struct super_block *sb, int readonly)
                 return 0;
         }
  
-#ifndef CONFIG_QUOTA
-       if (ext4_has_feature_quota(sb) && !readonly) {
+#if !defined(CONFIG_QUOTA) || !defined(CONFIG_QFMT_V2)
+       if (!readonly && (ext4_has_feature_quota(sb) ||
+                         ext4_has_feature_project(sb))) {
                 ext4_msg(sb, KERN_ERR,
-                        "Filesystem with quota feature cannot be mounted RDWR "
-                        "without CONFIG_QUOTA");
-               return 0;
-       }
-       if (ext4_has_feature_project(sb) && !readonly) {
-               ext4_msg(sb, KERN_ERR,
-                        "Filesystem with project quota feature cannot be mounted RDWR "
-                        "without CONFIG_QUOTA");
+                        "The kernel was not built with CONFIG_QUOTA and CONFIG_QFMT_V2");
                 return 0;
         }
  #endif  /* CONFIG_QUOTA */
@@ -3814,6 +3808,15 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
          */
         sbi->s_li_wait_mult = EXT4_DEF_LI_WAIT_MULT;
  
+       blocksize = BLOCK_SIZE << le32_to_cpu(es->s_log_block_size);
+       if (blocksize < EXT4_MIN_BLOCK_SIZE ||
+           blocksize > EXT4_MAX_BLOCK_SIZE) {
+               ext4_msg(sb, KERN_ERR,
+                      "Unsupported filesystem blocksize %d (%d log_block_size)",
+                        blocksize, le32_to_cpu(es->s_log_block_size));
+               goto failed_mount;
+       }
+
         if (le32_to_cpu(es->s_rev_level) == EXT4_GOOD_OLD_REV) {
                 sbi->s_inode_size = EXT4_GOOD_OLD_INODE_SIZE;
                 sbi->s_first_ino = EXT4_GOOD_OLD_FIRST_INO;
@@ -3831,6 +3834,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
                         ext4_msg(sb, KERN_ERR,
                                "unsupported inode size: %d",
                                sbi->s_inode_size);
+                       ext4_msg(sb, KERN_ERR, "blocksize: %d", blocksize);
                         goto failed_mount;
                 }
                 /*
@@ -4033,14 +4037,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
         if (!ext4_feature_set_ok(sb, (sb_rdonly(sb))))
                 goto failed_mount;
  
-       blocksize = BLOCK_SIZE << le32_to_cpu(es->s_log_block_size);
-       if (blocksize < EXT4_MIN_BLOCK_SIZE ||
-           blocksize > EXT4_MAX_BLOCK_SIZE) {
-               ext4_msg(sb, KERN_ERR,
-                      "Unsupported filesystem blocksize %d (%d log_block_size)",
-                        blocksize, le32_to_cpu(es->s_log_block_size));
-               goto failed_mount;
-       }
         if (le32_to_cpu(es->s_log_block_size) >
             (EXT4_MAX_BLOCK_LOG_SIZE - EXT4_MIN_BLOCK_LOG_SIZE)) {
                 ext4_msg(sb, KERN_ERR,
@@ -5585,10 +5581,7 @@ static int ext4_statfs_project(struct super_block *sb,
                 return PTR_ERR(dquot);
         spin_lock(&dquot->dq_dqb_lock);
  
-       limit = 0;
-       if (dquot->dq_dqb.dqb_bsoftlimit &&
-           (!limit || dquot->dq_dqb.dqb_bsoftlimit < limit))
-               limit = dquot->dq_dqb.dqb_bsoftlimit;
+       limit = dquot->dq_dqb.dqb_bsoftlimit;
         if (dquot->dq_dqb.dqb_bhardlimit &&
             (!limit || dquot->dq_dqb.dqb_bhardlimit < limit))
                 limit = dquot->dq_dqb.dqb_bhardlimit;
@@ -5603,10 +5596,7 @@ static int ext4_statfs_project(struct super_block *sb,
                          (buf->f_blocks - curblock) : 0;
         }
  
-       limit = 0;
-       if (dquot->dq_dqb.dqb_isoftlimit &&
-           (!limit || dquot->dq_dqb.dqb_isoftlimit < limit))
-               limit = dquot->dq_dqb.dqb_isoftlimit;
+       limit = dquot->dq_dqb.dqb_isoftlimit;
         if (dquot->dq_dqb.dqb_ihardlimit &&
             (!limit || dquot->dq_dqb.dqb_ihardlimit < limit))
                 limit = dquot->dq_dqb.dqb_ihardlimit;
diff --git a/fs/io-wq.c b/fs/io-wq.c

index cb60a42..0a5ab1a 100644 (file)
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -16,6 +16,7 @@
  #include <linux/slab.h>
  #include <linux/kthread.h>
  #include <linux/rculist_nulls.h>
+#include <linux/fs_struct.h>
  
  #include "io-wq.h"
  
@@ -59,6 +60,7 @@ struct io_worker {
         const struct cred *cur_creds;
         const struct cred *saved_creds;
         struct files_struct *restore_files;
+       struct fs_struct *restore_fs;
  };
  
  #if BITS_PER_LONG == 64
@@ -151,6 +153,9 @@ static bool __io_worker_unuse(struct io_wqe *wqe, struct io_worker *worker)
                 task_unlock(current);
         }
  
+       if (current->fs != worker->restore_fs)
+               current->fs = worker->restore_fs;
+
         /*
          * If we have an active mm, we need to drop the wq lock before unusing
          * it. If we do, return true and let the caller retry the idle loop.
@@ -311,6 +316,7 @@ static void io_worker_start(struct io_wqe *wqe, struct io_worker *worker)
  
         worker->flags |= (IO_WORKER_F_UP | IO_WORKER_F_RUNNING);
         worker->restore_files = current->files;
+       worker->restore_fs = current->fs;
         io_wqe_inc_running(wqe, worker);
  }
  
@@ -481,6 +487,8 @@ next:
                         current->files = work->files;
                         task_unlock(current);
                 }
+               if (work->fs && current->fs != work->fs)
+                       current->fs = work->fs;
                 if (work->mm != worker->mm)
                         io_wq_switch_mm(worker, work);
                 if (worker->cur_creds != work->creds)
@@ -691,11 +699,16 @@ static int io_wq_manager(void *data)
         /* create fixed workers */
         refcount_set(&wq->refs, workers_to_create);
         for_each_node(node) {
+               if (!node_online(node))
+                       continue;
                 if (!create_io_worker(wq, wq->wqes[node], IO_WQ_ACCT_BOUND))
                         goto err;
                 workers_to_create--;
         }
  
+       while (workers_to_create--)
+               refcount_dec(&wq->refs);
+
         complete(&wq->done);
  
         while (!kthread_should_stop()) {
@@ -703,6 +716,9 @@ static int io_wq_manager(void *data)
                         struct io_wqe *wqe = wq->wqes[node];
                         bool fork_worker[2] = { false, false };
  
+                       if (!node_online(node))
+                               continue;
+
                         spin_lock_irq(&wqe->lock);
                         if (io_wqe_need_worker(wqe, IO_WQ_ACCT_BOUND))
                                 fork_worker[IO_WQ_ACCT_BOUND] = true;
@@ -821,7 +837,9 @@ static bool io_wq_for_each_worker(struct io_wqe *wqe,
  
         list_for_each_entry_rcu(worker, &wqe->all_list, all_list) {
                 if (io_worker_get(worker)) {
-                       ret = func(worker, data);
+                       /* no task if node is/was offline */
+                       if (worker->task)
+                               ret = func(worker, data);
                         io_worker_release(worker);
                         if (ret)
                                 break;
@@ -929,17 +947,19 @@ enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel,
         return ret;
  }
  
+struct work_match {
+       bool (*fn)(struct io_wq_work *, void *data);
+       void *data;
+};
+
  static bool io_wq_worker_cancel(struct io_worker *worker, void *data)
  {
-       struct io_wq_work *work = data;
+       struct work_match *match = data;
         unsigned long flags;
         bool ret = false;
  
-       if (worker->cur_work != work)
-               return false;
-
         spin_lock_irqsave(&worker->lock, flags);
-       if (worker->cur_work == work &&
+       if (match->fn(worker->cur_work, match->data) &&
             !(worker->cur_work->flags & IO_WQ_WORK_NO_CANCEL)) {
                 send_sig(SIGINT, worker->task, 1);
                 ret = true;
@@ -950,15 +970,13 @@ static bool io_wq_worker_cancel(struct io_worker *worker, void *data)
  }
  
  static enum io_wq_cancel io_wqe_cancel_work(struct io_wqe *wqe,
-                                           struct io_wq_work *cwork)
+                                           struct work_match *match)
  {
         struct io_wq_work_node *node, *prev;
         struct io_wq_work *work;
         unsigned long flags;
         bool found = false;
  
-       cwork->flags |= IO_WQ_WORK_CANCEL;
-
         /*
          * First check pending list, if we're lucky we can just remove it
          * from there. CANCEL_OK means that the work is returned as-new,
@@ -968,7 +986,7 @@ static enum io_wq_cancel io_wqe_cancel_work(struct io_wqe *wqe,
         wq_list_for_each(node, prev, &wqe->work_list) {
                 work = container_of(node, struct io_wq_work, list);
  
-               if (work == cwork) {
+               if (match->fn(work, match->data)) {
                         wq_node_del(&wqe->work_list, node, prev);
                         found = true;
                         break;
@@ -989,20 +1007,60 @@ static enum io_wq_cancel io_wqe_cancel_work(struct io_wqe *wqe,
          * completion will run normally in this case.
          */
         rcu_read_lock();
-       found = io_wq_for_each_worker(wqe, io_wq_worker_cancel, cwork);
+       found = io_wq_for_each_worker(wqe, io_wq_worker_cancel, match);
         rcu_read_unlock();
         return found ? IO_WQ_CANCEL_RUNNING : IO_WQ_CANCEL_NOTFOUND;
  }
  
+static bool io_wq_work_match(struct io_wq_work *work, void *data)
+{
+       return work == data;
+}
+
  enum io_wq_cancel io_wq_cancel_work(struct io_wq *wq, struct io_wq_work *cwork)
  {
+       struct work_match match = {
+               .fn     = io_wq_work_match,
+               .data   = cwork
+       };
         enum io_wq_cancel ret = IO_WQ_CANCEL_NOTFOUND;
         int node;
  
+       cwork->flags |= IO_WQ_WORK_CANCEL;
+
         for_each_node(node) {
                 struct io_wqe *wqe = wq->wqes[node];
  
-               ret = io_wqe_cancel_work(wqe, cwork);
+               ret = io_wqe_cancel_work(wqe, &match);
+               if (ret != IO_WQ_CANCEL_NOTFOUND)
+                       break;
+       }
+
+       return ret;
+}
+
+static bool io_wq_pid_match(struct io_wq_work *work, void *data)
+{
+       pid_t pid = (pid_t) (unsigned long) data;
+
+       if (work)
+               return work->task_pid == pid;
+       return false;
+}
+
+enum io_wq_cancel io_wq_cancel_pid(struct io_wq *wq, pid_t pid)
+{
+       struct work_match match = {
+               .fn     = io_wq_pid_match,
+               .data   = (void *) (unsigned long) pid
+       };
+       enum io_wq_cancel ret = IO_WQ_CANCEL_NOTFOUND;
+       int node;
+
+       for_each_node(node) {
+               struct io_wqe *wqe = wq->wqes[node];
+
+               ret = io_wqe_cancel_work(wqe, &match);
                 if (ret != IO_WQ_CANCEL_NOTFOUND)
                         break;
         }
@@ -1036,6 +1094,8 @@ void io_wq_flush(struct io_wq *wq)
         for_each_node(node) {
                 struct io_wqe *wqe = wq->wqes[node];
  
+               if (!node_online(node))
+                       continue;
                 init_completion(&data.done);
                 INIT_IO_WORK(&data.work, io_wq_flush_func);
                 data.work.flags |= IO_WQ_WORK_INTERNAL;
@@ -1067,12 +1127,15 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
  
         for_each_node(node) {
                 struct io_wqe *wqe;
+               int alloc_node = node;
  
-               wqe = kzalloc_node(sizeof(struct io_wqe), GFP_KERNEL, node);
+               if (!node_online(alloc_node))
+                       alloc_node = NUMA_NO_NODE;
+               wqe = kzalloc_node(sizeof(struct io_wqe), GFP_KERNEL, alloc_node);
                 if (!wqe)
                         goto err;
                 wq->wqes[node] = wqe;
-               wqe->node = node;
+               wqe->node = alloc_node;
                 wqe->acct[IO_WQ_ACCT_BOUND].max_workers = bounded;
                 atomic_set(&wqe->acct[IO_WQ_ACCT_BOUND].nr_running, 0);
                 if (wq->user) {
@@ -1080,7 +1143,6 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
                                         task_rlimit(current, RLIMIT_NPROC);
                 }
                 atomic_set(&wqe->acct[IO_WQ_ACCT_UNBOUND].nr_running, 0);
-               wqe->node = node;
                 wqe->wq = wq;
                 spin_lock_init(&wqe->lock);
                 INIT_WQ_LIST(&wqe->work_list);
diff --git a/fs/io-wq.h b/fs/io-wq.h

index 50b3378..ccc7d84 100644 (file)
--- a/fs/io-wq.h
+++ b/fs/io-wq.h
@@ -74,17 +74,20 @@ struct io_wq_work {
         struct files_struct *files;
         struct mm_struct *mm;
         const struct cred *creds;
+       struct fs_struct *fs;
         unsigned flags;
+       pid_t task_pid;
  };
  
  #define INIT_IO_WORK(work, _func)                      \
         do {                                            \
                 (work)->list.next = NULL;               \
                 (work)->func = _func;                   \
-               (work)->flags = 0;                      \
                 (work)->files = NULL;                   \
                 (work)->mm = NULL;                      \
                 (work)->creds = NULL;                   \
+               (work)->fs = NULL;                      \
+               (work)->flags = 0;                      \
         } while (0)                                     \
  
  typedef void (get_work_fn)(struct io_wq_work *);
@@ -107,6 +110,7 @@ void io_wq_flush(struct io_wq *wq);
  
  void io_wq_cancel_all(struct io_wq *wq);
  enum io_wq_cancel io_wq_cancel_work(struct io_wq *wq, struct io_wq_work *cwork);
+enum io_wq_cancel io_wq_cancel_pid(struct io_wq *wq, pid_t pid);
  
  typedef bool (work_cancel_fn)(struct io_wq_work *, void *);
  
diff --git a/fs/io_uring.c b/fs/io_uring.c

index 77f22c3..5a82601 100644 (file)
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -75,6 +75,7 @@
  #include <linux/fsnotify.h>
  #include <linux/fadvise.h>
  #include <linux/eventpoll.h>
+#include <linux/fs_struct.h>
  
  #define CREATE_TRACE_POINTS
  #include <trace/events/io_uring.h>
@@ -204,11 +205,11 @@ struct io_ring_ctx {
  
         struct {
                 unsigned int            flags;
-               int                     compat: 1;
-               int                     account_mem: 1;
-               int                     cq_overflow_flushed: 1;
-               int                     drain_next: 1;
-               int                     eventfd_async: 1;
+               unsigned int            compat: 1;
+               unsigned int            account_mem: 1;
+               unsigned int            cq_overflow_flushed: 1;
+               unsigned int            drain_next: 1;
+               unsigned int            eventfd_async: 1;
  
                 /*
                  * Ring buffer of indices into array of io_uring_sqe, which is
@@ -441,6 +442,7 @@ struct io_async_msghdr {
         struct iovec                    *iov;
         struct sockaddr __user          *uaddr;
         struct msghdr                   msg;
+       struct sockaddr_storage         addr;
  };
  
  struct io_async_rw {
@@ -450,17 +452,12 @@ struct io_async_rw {
         ssize_t                         size;
  };
  
-struct io_async_open {
-       struct filename                 *filename;
-};
-
  struct io_async_ctx {
         union {
                 struct io_async_rw      rw;
                 struct io_async_msghdr  msg;
                 struct io_async_connect connect;
                 struct io_timeout_data  timeout;
-               struct io_async_open    open;
         };
  };
  
@@ -483,6 +480,8 @@ enum {
         REQ_F_MUST_PUNT_BIT,
         REQ_F_TIMEOUT_NOSEQ_BIT,
         REQ_F_COMP_LOCKED_BIT,
+       REQ_F_NEED_CLEANUP_BIT,
+       REQ_F_OVERFLOW_BIT,
  };
  
  enum {
@@ -521,6 +520,10 @@ enum {
         REQ_F_TIMEOUT_NOSEQ     = BIT(REQ_F_TIMEOUT_NOSEQ_BIT),
         /* completion under lock */
         REQ_F_COMP_LOCKED       = BIT(REQ_F_COMP_LOCKED_BIT),
+       /* needs cleanup */
+       REQ_F_NEED_CLEANUP      = BIT(REQ_F_NEED_CLEANUP_BIT),
+       /* in overflow list */
+       REQ_F_OVERFLOW          = BIT(REQ_F_OVERFLOW_BIT),
  };
  
  /*
@@ -553,7 +556,6 @@ struct io_kiocb {
          * llist_node is only used for poll deferred completions
          */
         struct llist_node               llist_node;
-       bool                            has_user;
         bool                            in_async;
         bool                            needs_fixed_file;
         u8                              opcode;
@@ -614,6 +616,8 @@ struct io_op_def {
         unsigned                not_supported : 1;
         /* needs file table */
         unsigned                file_table : 1;
+       /* needs ->fs */
+       unsigned                needs_fs : 1;
  };
  
  static const struct io_op_def io_op_defs[] = {
@@ -656,12 +660,14 @@ static const struct io_op_def io_op_defs[] = {
                 .needs_mm               = 1,
                 .needs_file             = 1,
                 .unbound_nonreg_file    = 1,
+               .needs_fs               = 1,
         },
         [IORING_OP_RECVMSG] = {
                 .async_ctx              = 1,
                 .needs_mm               = 1,
                 .needs_file             = 1,
                 .unbound_nonreg_file    = 1,
+               .needs_fs               = 1,
         },
         [IORING_OP_TIMEOUT] = {
                 .async_ctx              = 1,
@@ -692,6 +698,7 @@ static const struct io_op_def io_op_defs[] = {
                 .needs_file             = 1,
                 .fd_non_neg             = 1,
                 .file_table             = 1,
+               .needs_fs               = 1,
         },
         [IORING_OP_CLOSE] = {
                 .needs_file             = 1,
@@ -705,6 +712,7 @@ static const struct io_op_def io_op_defs[] = {
                 .needs_mm               = 1,
                 .needs_file             = 1,
                 .fd_non_neg             = 1,
+               .needs_fs               = 1,
         },
         [IORING_OP_READ] = {
                 .needs_mm               = 1,
@@ -736,6 +744,7 @@ static const struct io_op_def io_op_defs[] = {
                 .needs_file             = 1,
                 .fd_non_neg             = 1,
                 .file_table             = 1,
+               .needs_fs               = 1,
         },
         [IORING_OP_EPOLL_CTL] = {
                 .unbound_nonreg_file    = 1,
@@ -754,6 +763,7 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
                                  unsigned nr_args);
  static int io_grab_files(struct io_kiocb *req);
  static void io_ring_file_ref_flush(struct fixed_file_data *data);
+static void io_cleanup_req(struct io_kiocb *req);
  
  static struct kmem_cache *req_cachep;
  
@@ -909,6 +919,18 @@ static inline void io_req_work_grab_env(struct io_kiocb *req,
         }
         if (!req->work.creds)
                 req->work.creds = get_current_cred();
+       if (!req->work.fs && def->needs_fs) {
+               spin_lock(&current->fs->lock);
+               if (!current->fs->in_exec) {
+                       req->work.fs = current->fs;
+                       req->work.fs->users++;
+               } else {
+                       req->work.flags |= IO_WQ_WORK_CANCEL;
+               }
+               spin_unlock(&current->fs->lock);
+       }
+       if (!req->work.task_pid)
+               req->work.task_pid = task_pid_vnr(current);
  }
  
  static inline void io_req_work_drop_env(struct io_kiocb *req)
@@ -921,6 +943,16 @@ static inline void io_req_work_drop_env(struct io_kiocb *req)
                 put_cred(req->work.creds);
                 req->work.creds = NULL;
         }
+       if (req->work.fs) {
+               struct fs_struct *fs = req->work.fs;
+
+               spin_lock(&req->work.fs->lock);
+               if (--fs->users)
+                       fs = NULL;
+               spin_unlock(&req->work.fs->lock);
+               if (fs)
+                       free_fs_struct(fs);
+       }
  }
  
  static inline bool io_prep_async_work(struct io_kiocb *req,
@@ -1074,6 +1106,7 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force)
                 req = list_first_entry(&ctx->cq_overflow_list, struct io_kiocb,
                                                 list);
                 list_move(&req->list, &list);
+               req->flags &= ~REQ_F_OVERFLOW;
                 if (cqe) {
                         WRITE_ONCE(cqe->user_data, req->user_data);
                         WRITE_ONCE(cqe->res, req->result);
@@ -1126,6 +1159,7 @@ static void io_cqring_fill_event(struct io_kiocb *req, long res)
                         set_bit(0, &ctx->sq_check_overflow);
                         set_bit(0, &ctx->cq_check_overflow);
                 }
+               req->flags |= REQ_F_OVERFLOW;
                 refcount_inc(&req->refs);
                 req->result = res;
                 list_add_tail(&req->list, &ctx->cq_overflow_list);
@@ -1241,6 +1275,9 @@ static void __io_free_req(struct io_kiocb *req)
  {
         __io_req_aux_free(req);
  
+       if (req->flags & REQ_F_NEED_CLEANUP)
+               io_cleanup_req(req);
+
         if (req->flags & REQ_F_INFLIGHT) {
                 struct io_ring_ctx *ctx = req->ctx;
                 unsigned long flags;
@@ -2056,9 +2093,6 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
                 return iorw->size;
         }
  
-       if (!req->has_user)
-               return -EFAULT;
-
  #ifdef CONFIG_COMPAT
         if (req->ctx->compat)
                 return compat_import_iovec(rw, buf, sqe_len, UIO_FASTIOV,
@@ -2137,6 +2171,8 @@ static void io_req_map_rw(struct io_kiocb *req, ssize_t io_size,
                 req->io->rw.iov = req->io->rw.fast_iov;
                 memcpy(req->io->rw.iov, fast_iov,
                         sizeof(struct iovec) * iter->nr_segs);
+       } else {
+               req->flags |= REQ_F_NEED_CLEANUP;
         }
  }
  
@@ -2148,17 +2184,6 @@ static int io_alloc_async_ctx(struct io_kiocb *req)
         return req->io == NULL;
  }
  
-static void io_rw_async(struct io_wq_work **workptr)
-{
-       struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
-       struct iovec *iov = NULL;
-
-       if (req->io->rw.iov != req->io->rw.fast_iov)
-               iov = req->io->rw.iov;
-       io_wq_submit_work(workptr);
-       kfree(iov);
-}
-
  static int io_setup_async_rw(struct io_kiocb *req, ssize_t io_size,
                              struct iovec *iovec, struct iovec *fast_iov,
                              struct iov_iter *iter)
@@ -2171,7 +2196,6 @@ static int io_setup_async_rw(struct io_kiocb *req, ssize_t io_size,
  
                 io_req_map_rw(req, io_size, iovec, fast_iov, iter);
         }
-       req->work.func = io_rw_async;
         return 0;
  }
  
@@ -2189,7 +2213,8 @@ static int io_read_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe,
         if (unlikely(!(req->file->f_mode & FMODE_READ)))
                 return -EBADF;
  
-       if (!req->io)
+       /* either don't need iovec imported or already have it */
+       if (!req->io || req->flags & REQ_F_NEED_CLEANUP)
                 return 0;
  
         io = req->io;
@@ -2258,8 +2283,8 @@ copy_iov:
                 }
         }
  out_free:
-       if (!io_wq_current_is_worker())
-               kfree(iovec);
+       kfree(iovec);
+       req->flags &= ~REQ_F_NEED_CLEANUP;
         return ret;
  }
  
@@ -2277,7 +2302,8 @@ static int io_write_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe,
         if (unlikely(!(req->file->f_mode & FMODE_WRITE)))
                 return -EBADF;
  
-       if (!req->io)
+       /* either don't need iovec imported or already have it */
+       if (!req->io || req->flags & REQ_F_NEED_CLEANUP)
                 return 0;
  
         io = req->io;
@@ -2352,6 +2378,12 @@ static int io_write(struct io_kiocb *req, struct io_kiocb **nxt,
                         ret2 = call_write_iter(req->file, kiocb, &iter);
                 else
                         ret2 = loop_rw_iter(WRITE, req->file, kiocb, &iter);
+               /*
+                * Raw bdev writes will -EOPNOTSUPP for IOCB_NOWAIT. Just
+                * retry them without IOCB_NOWAIT.
+                */
+               if (ret2 == -EOPNOTSUPP && (kiocb->ki_flags & IOCB_NOWAIT))
+                       ret2 = -EAGAIN;
                 if (!force_nonblock || ret2 != -EAGAIN) {
                         kiocb_done(kiocb, ret2, nxt, req->in_async);
                 } else {
@@ -2364,8 +2396,8 @@ copy_iov:
                 }
         }
  out_free:
-       if (!io_wq_current_is_worker())
-               kfree(iovec);
+       req->flags &= ~REQ_F_NEED_CLEANUP;
+       kfree(iovec);
         return ret;
  }
  
@@ -2534,6 +2566,10 @@ static int io_openat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
  
         if (sqe->ioprio || sqe->buf_index)
                 return -EINVAL;
+       if (sqe->flags & IOSQE_FIXED_FILE)
+               return -EBADF;
+       if (req->flags & REQ_F_NEED_CLEANUP)
+               return 0;
  
         req->open.dfd = READ_ONCE(sqe->fd);
         req->open.how.mode = READ_ONCE(sqe->len);
@@ -2547,6 +2583,7 @@ static int io_openat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
                 return ret;
         }
  
+       req->flags |= REQ_F_NEED_CLEANUP;
         return 0;
  }
  
@@ -2559,6 +2596,10 @@ static int io_openat2_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
  
         if (sqe->ioprio || sqe->buf_index)
                 return -EINVAL;
+       if (sqe->flags & IOSQE_FIXED_FILE)
+               return -EBADF;
+       if (req->flags & REQ_F_NEED_CLEANUP)
+               return 0;
  
         req->open.dfd = READ_ONCE(sqe->fd);
         fname = u64_to_user_ptr(READ_ONCE(sqe->addr));
@@ -2583,6 +2624,7 @@ static int io_openat2_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
                 return ret;
         }
  
+       req->flags |= REQ_F_NEED_CLEANUP;
         return 0;
  }
  
@@ -2614,6 +2656,7 @@ static int io_openat2(struct io_kiocb *req, struct io_kiocb **nxt,
         }
  err:
         putname(req->open.filename);
+       req->flags &= ~REQ_F_NEED_CLEANUP;
         if (ret < 0)
                 req_set_fail_links(req);
         io_cqring_add_event(req, ret);
@@ -2754,6 +2797,10 @@ static int io_statx_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
  
         if (sqe->ioprio || sqe->buf_index)
                 return -EINVAL;
+       if (sqe->flags & IOSQE_FIXED_FILE)
+               return -EBADF;
+       if (req->flags & REQ_F_NEED_CLEANUP)
+               return 0;
  
         req->open.dfd = READ_ONCE(sqe->fd);
         req->open.mask = READ_ONCE(sqe->len);
@@ -2771,6 +2818,7 @@ static int io_statx_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
                 return ret;
         }
  
+       req->flags |= REQ_F_NEED_CLEANUP;
         return 0;
  }
  
@@ -2808,6 +2856,7 @@ retry:
                 ret = cp_statx(&stat, ctx->buffer);
  err:
         putname(ctx->filename);
+       req->flags &= ~REQ_F_NEED_CLEANUP;
         if (ret < 0)
                 req_set_fail_links(req);
         io_cqring_add_event(req, ret);
@@ -2827,7 +2876,7 @@ static int io_close_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
             sqe->rw_flags || sqe->buf_index)
                 return -EINVAL;
         if (sqe->flags & IOSQE_FIXED_FILE)
-               return -EINVAL;
+               return -EBADF;
  
         req->close.fd = READ_ONCE(sqe->fd);
         if (req->file->f_op == &io_uring_fops ||
@@ -2837,24 +2886,25 @@ static int io_close_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
         return 0;
  }
  
+/* only called when __close_fd_get_file() is done */
+static void __io_close_finish(struct io_kiocb *req, struct io_kiocb **nxt)
+{
+       int ret;
+
+       ret = filp_close(req->close.put_file, req->work.files);
+       if (ret < 0)
+               req_set_fail_links(req);
+       io_cqring_add_event(req, ret);
+       fput(req->close.put_file);
+       io_put_req_find_next(req, nxt);
+}
+
  static void io_close_finish(struct io_wq_work **workptr)
  {
         struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
         struct io_kiocb *nxt = NULL;
  
-       /* Invoked with files, we need to do the close */
-       if (req->work.files) {
-               int ret;
-
-               ret = filp_close(req->close.put_file, req->work.files);
-               if (ret < 0)
-                       req_set_fail_links(req);
-               io_cqring_add_event(req, ret);
-       }
-
-       fput(req->close.put_file);
-
-       io_put_req_find_next(req, &nxt);
+       __io_close_finish(req, &nxt);
         if (nxt)
                 io_wq_assign_next(workptr, nxt);
  }
@@ -2877,22 +2927,8 @@ static int io_close(struct io_kiocb *req, struct io_kiocb **nxt,
          * No ->flush(), safely close from here and just punt the
          * fput() to async context.
          */
-       ret = filp_close(req->close.put_file, current->files);
-
-       if (ret < 0)
-               req_set_fail_links(req);
-       io_cqring_add_event(req, ret);
-
-       if (io_wq_current_is_worker()) {
-               struct io_wq_work *old_work, *work;
-
-               old_work = work = &req->work;
-               io_close_finish(&work);
-               if (work && work != old_work)
-                       *nxt = container_of(work, struct io_kiocb, work);
-               return 0;
-       }
-
+       __io_close_finish(req, nxt);
+       return 0;
  eagain:
         req->work.func = io_close_finish;
         /*
@@ -2960,24 +2996,12 @@ static int io_sync_file_range(struct io_kiocb *req, struct io_kiocb **nxt,
         return 0;
  }
  
-#if defined(CONFIG_NET)
-static void io_sendrecv_async(struct io_wq_work **workptr)
-{
-       struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
-       struct iovec *iov = NULL;
-
-       if (req->io->rw.iov != req->io->rw.fast_iov)
-               iov = req->io->msg.iov;
-       io_wq_submit_work(workptr);
-       kfree(iov);
-}
-#endif
-
  static int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
  {
  #if defined(CONFIG_NET)
         struct io_sr_msg *sr = &req->sr_msg;
         struct io_async_ctx *io = req->io;
+       int ret;
  
         sr->msg_flags = READ_ONCE(sqe->msg_flags);
         sr->msg = u64_to_user_ptr(READ_ONCE(sqe->addr));
@@ -2985,10 +3009,16 @@ static int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
  
         if (!io || req->opcode == IORING_OP_SEND)
                 return 0;
+       /* iovec is already imported */
+       if (req->flags & REQ_F_NEED_CLEANUP)
+               return 0;
  
         io->msg.iov = io->msg.fast_iov;
-       return sendmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags,
+       ret = sendmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags,
                                         &io->msg.iov);
+       if (!ret)
+               req->flags |= REQ_F_NEED_CLEANUP;
+       return ret;
  #else
         return -EOPNOTSUPP;
  #endif
@@ -3008,12 +3038,11 @@ static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt,
         sock = sock_from_file(req->file, &ret);
         if (sock) {
                 struct io_async_ctx io;
-               struct sockaddr_storage addr;
                 unsigned flags;
  
                 if (req->io) {
                         kmsg = &req->io->msg;
-                       kmsg->msg.msg_name = &addr;
+                       kmsg->msg.msg_name = &req->io->msg.addr;
                         /* if iov is set, it's allocated already */
                         if (!kmsg->iov)
                                 kmsg->iov = kmsg->fast_iov;
@@ -3022,7 +3051,7 @@ static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt,
                         struct io_sr_msg *sr = &req->sr_msg;
  
                         kmsg = &io.msg;
-                       kmsg->msg.msg_name = &addr;
+                       kmsg->msg.msg_name = &io.msg.addr;
  
                         io.msg.iov = io.msg.fast_iov;
                         ret = sendmsg_copy_msghdr(&io.msg.msg, sr->msg,
@@ -3041,18 +3070,22 @@ static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt,
                 if (force_nonblock && ret == -EAGAIN) {
                         if (req->io)
                                 return -EAGAIN;
-                       if (io_alloc_async_ctx(req))
+                       if (io_alloc_async_ctx(req)) {
+                               if (kmsg && kmsg->iov != kmsg->fast_iov)
+                                       kfree(kmsg->iov);
                                 return -ENOMEM;
+                       }
+                       req->flags |= REQ_F_NEED_CLEANUP;
                         memcpy(&req->io->msg, &io.msg, sizeof(io.msg));
-                       req->work.func = io_sendrecv_async;
                         return -EAGAIN;
                 }
                 if (ret == -ERESTARTSYS)
                         ret = -EINTR;
         }
  
-       if (!io_wq_current_is_worker() && kmsg && kmsg->iov != kmsg->fast_iov)
+       if (kmsg && kmsg->iov != kmsg->fast_iov)
                 kfree(kmsg->iov);
+       req->flags &= ~REQ_F_NEED_CLEANUP;
         io_cqring_add_event(req, ret);
         if (ret < 0)
                 req_set_fail_links(req);
@@ -3120,6 +3153,7 @@ static int io_recvmsg_prep(struct io_kiocb *req,
  #if defined(CONFIG_NET)
         struct io_sr_msg *sr = &req->sr_msg;
         struct io_async_ctx *io = req->io;
+       int ret;
  
         sr->msg_flags = READ_ONCE(sqe->msg_flags);
         sr->msg = u64_to_user_ptr(READ_ONCE(sqe->addr));
@@ -3127,10 +3161,16 @@ static int io_recvmsg_prep(struct io_kiocb *req,
  
         if (!io || req->opcode == IORING_OP_RECV)
                 return 0;
+       /* iovec is already imported */
+       if (req->flags & REQ_F_NEED_CLEANUP)
+               return 0;
  
         io->msg.iov = io->msg.fast_iov;
-       return recvmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags,
+       ret = recvmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags,
                                         &io->msg.uaddr, &io->msg.iov);
+       if (!ret)
+               req->flags |= REQ_F_NEED_CLEANUP;
+       return ret;
  #else
         return -EOPNOTSUPP;
  #endif
@@ -3150,12 +3190,11 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt,
         sock = sock_from_file(req->file, &ret);
         if (sock) {
                 struct io_async_ctx io;
-               struct sockaddr_storage addr;
                 unsigned flags;
  
                 if (req->io) {
                         kmsg = &req->io->msg;
-                       kmsg->msg.msg_name = &addr;
+                       kmsg->msg.msg_name = &req->io->msg.addr;
                         /* if iov is set, it's allocated already */
                         if (!kmsg->iov)
                                 kmsg->iov = kmsg->fast_iov;
@@ -3164,7 +3203,7 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt,
                         struct io_sr_msg *sr = &req->sr_msg;
  
                         kmsg = &io.msg;
-                       kmsg->msg.msg_name = &addr;
+                       kmsg->msg.msg_name = &io.msg.addr;
  
                         io.msg.iov = io.msg.fast_iov;
                         ret = recvmsg_copy_msghdr(&io.msg.msg, sr->msg,
@@ -3185,18 +3224,22 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt,
                 if (force_nonblock && ret == -EAGAIN) {
                         if (req->io)
                                 return -EAGAIN;
-                       if (io_alloc_async_ctx(req))
+                       if (io_alloc_async_ctx(req)) {
+                               if (kmsg && kmsg->iov != kmsg->fast_iov)
+                                       kfree(kmsg->iov);
                                 return -ENOMEM;
+                       }
                         memcpy(&req->io->msg, &io.msg, sizeof(io.msg));
-                       req->work.func = io_sendrecv_async;
+                       req->flags |= REQ_F_NEED_CLEANUP;
                         return -EAGAIN;
                 }
                 if (ret == -ERESTARTSYS)
                         ret = -EINTR;
         }
  
-       if (!io_wq_current_is_worker() && kmsg && kmsg->iov != kmsg->fast_iov)
+       if (kmsg && kmsg->iov != kmsg->fast_iov)
                 kfree(kmsg->iov);
+       req->flags &= ~REQ_F_NEED_CLEANUP;
         io_cqring_add_event(req, ret);
         if (ret < 0)
                 req_set_fail_links(req);
@@ -4207,6 +4250,35 @@ static int io_req_defer(struct io_kiocb *req, const struct io_uring_sqe *sqe)
         return -EIOCBQUEUED;
  }
  
+static void io_cleanup_req(struct io_kiocb *req)
+{
+       struct io_async_ctx *io = req->io;
+
+       switch (req->opcode) {
+       case IORING_OP_READV:
+       case IORING_OP_READ_FIXED:
+       case IORING_OP_READ:
+       case IORING_OP_WRITEV:
+       case IORING_OP_WRITE_FIXED:
+       case IORING_OP_WRITE:
+               if (io->rw.iov != io->rw.fast_iov)
+                       kfree(io->rw.iov);
+               break;
+       case IORING_OP_SENDMSG:
+       case IORING_OP_RECVMSG:
+               if (io->msg.iov != io->msg.fast_iov)
+                       kfree(io->msg.iov);
+               break;
+       case IORING_OP_OPENAT:
+       case IORING_OP_OPENAT2:
+       case IORING_OP_STATX:
+               putname(req->open.filename);
+               break;
+       }
+
+       req->flags &= ~REQ_F_NEED_CLEANUP;
+}
+
  static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
                         struct io_kiocb **nxt, bool force_nonblock)
  {
@@ -4446,7 +4518,6 @@ static void io_wq_submit_work(struct io_wq_work **workptr)
         }
  
         if (!ret) {
-               req->has_user = (work->flags & IO_WQ_WORK_HAS_MM) != 0;
                 req->in_async = true;
                 do {
                         ret = io_issue_sqe(req, NULL, &nxt, false);
@@ -4479,7 +4550,7 @@ static int io_req_needs_file(struct io_kiocb *req, int fd)
  {
         if (!io_op_defs[req->opcode].needs_file)
                 return 0;
-       if (fd == -1 && io_op_defs[req->opcode].fd_non_neg)
+       if ((fd == -1 || fd == AT_FDCWD) && io_op_defs[req->opcode].fd_non_neg)
                 return 0;
         return 1;
  }
@@ -4950,6 +5021,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
         for (i = 0; i < nr; i++) {
                 const struct io_uring_sqe *sqe;
                 struct io_kiocb *req;
+               int err;
  
                 req = io_get_req(ctx, statep);
                 if (unlikely(!req)) {
@@ -4966,20 +5038,23 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
                 submitted++;
  
                 if (unlikely(req->opcode >= IORING_OP_LAST)) {
-                       io_cqring_add_event(req, -EINVAL);
+                       err = -EINVAL;
+fail_req:
+                       io_cqring_add_event(req, err);
                         io_double_put_req(req);
                         break;
                 }
  
                 if (io_op_defs[req->opcode].needs_mm && !*mm) {
                         mm_fault = mm_fault || !mmget_not_zero(ctx->sqo_mm);
-                       if (!mm_fault) {
-                               use_mm(ctx->sqo_mm);
-                               *mm = ctx->sqo_mm;
+                       if (unlikely(mm_fault)) {
+                               err = -EFAULT;
+                               goto fail_req;
                         }
+                       use_mm(ctx->sqo_mm);
+                       *mm = ctx->sqo_mm;
                 }
  
-               req->has_user = *mm != NULL;
                 req->in_async = async;
                 req->needs_fixed_file = async;
                 trace_io_uring_submit_sqe(ctx, req->opcode, req->user_data,
@@ -6301,7 +6376,7 @@ static __poll_t io_uring_poll(struct file *file, poll_table *wait)
         if (READ_ONCE(ctx->rings->sq.tail) - ctx->cached_sq_head !=
             ctx->rings->sq_ring_entries)
                 mask |= EPOLLOUT | EPOLLWRNORM;
-       if (READ_ONCE(ctx->rings->cq.head) != ctx->cached_cq_tail)
+       if (io_cqring_events(ctx, false))
                 mask |= EPOLLIN | EPOLLRDNORM;
  
         return mask;
@@ -6393,6 +6468,29 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx,
                 if (!cancel_req)
                         break;
  
+               if (cancel_req->flags & REQ_F_OVERFLOW) {
+                       spin_lock_irq(&ctx->completion_lock);
+                       list_del(&cancel_req->list);
+                       cancel_req->flags &= ~REQ_F_OVERFLOW;
+                       if (list_empty(&ctx->cq_overflow_list)) {
+                               clear_bit(0, &ctx->sq_check_overflow);
+                               clear_bit(0, &ctx->cq_check_overflow);
+                       }
+                       spin_unlock_irq(&ctx->completion_lock);
+
+                       WRITE_ONCE(ctx->rings->cq_overflow,
+                               atomic_inc_return(&ctx->cached_cq_overflow));
+
+                       /*
+                        * Put inflight ref and overflow ref. If that's
+                        * all we had, then we're done with this request.
+                        */
+                       if (refcount_sub_and_test(2, &cancel_req->refs)) {
+                               io_put_req(cancel_req);
+                               continue;
+                       }
+               }
+
                 io_wq_cancel_work(ctx->io_wq, &cancel_req->work);
                 io_put_req(cancel_req);
                 schedule();
@@ -6405,6 +6503,13 @@ static int io_uring_flush(struct file *file, void *data)
         struct io_ring_ctx *ctx = file->private_data;
  
         io_uring_cancel_files(ctx, data);
+
+       /*
+        * If the task is going away, cancel work it may have pending
+        */
+       if (fatal_signal_pending(current) || (current->flags & PF_EXITING))
+               io_wq_cancel_pid(ctx->io_wq, task_pid_vnr(current));
+
         return 0;
  }
  
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c

index 2494095..27373f5 100644 (file)
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -976,29 +976,33 @@ restart_loop:
                  * it. */
  
                 /*
-               * A buffer which has been freed while still being journaled by
-               * a previous transaction.
-               */
-               if (buffer_freed(bh)) {
+                * A buffer which has been freed while still being journaled
+                * by a previous transaction, refile the buffer to BJ_Forget of
+                * the running transaction. If the just committed transaction
+                * contains "add to orphan" operation, we can completely
+                * invalidate the buffer now. We are rather through in that
+                * since the buffer may be still accessible when blocksize <
+                * pagesize and it is attached to the last partial page.
+                */
+               if (buffer_freed(bh) && !jh->b_next_transaction) {
+                       struct address_space *mapping;
+
+                       clear_buffer_freed(bh);
+                       clear_buffer_jbddirty(bh);
+
                         /*
-                        * If the running transaction is the one containing
-                        * "add to orphan" operation (b_next_transaction !=
-                        * NULL), we have to wait for that transaction to
-                        * commit before we can really get rid of the buffer.
-                        * So just clear b_modified to not confuse transaction
-                        * credit accounting and refile the buffer to
-                        * BJ_Forget of the running transaction. If the just
-                        * committed transaction contains "add to orphan"
-                        * operation, we can completely invalidate the buffer
-                        * now. We are rather through in that since the
-                        * buffer may be still accessible when blocksize <
-                        * pagesize and it is attached to the last partial
-                        * page.
+                        * Block device buffers need to stay mapped all the
+                        * time, so it is enough to clear buffer_jbddirty and
+                        * buffer_freed bits. For the file mapping buffers (i.e.
+                        * journalled data) we need to unmap buffer and clear
+                        * more bits. We also need to be careful about the check
+                        * because the data page mapping can get cleared under
+                        * out hands, which alse need not to clear more bits
+                        * because the page and buffers will be freed and can
+                        * never be reused once we are done with them.
                          */
-                       jh->b_modified = 0;
-                       if (!jh->b_next_transaction) {
-                               clear_buffer_freed(bh);
-                               clear_buffer_jbddirty(bh);
+                       mapping = READ_ONCE(bh->b_page->mapping);
+                       if (mapping && !sb_is_blkdev_sb(mapping->host->i_sb)) {
                                 clear_buffer_mapped(bh);
                                 clear_buffer_new(bh);
                                 clear_buffer_req(bh);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c

index e77a5a0..2dd848a 100644 (file)
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -2329,14 +2329,16 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh,
                         return -EBUSY;
                 }
                 /*
-                * OK, buffer won't be reachable after truncate. We just set
-                * j_next_transaction to the running transaction (if there is
-                * one) and mark buffer as freed so that commit code knows it
-                * should clear dirty bits when it is done with the buffer.
+                * OK, buffer won't be reachable after truncate. We just clear
+                * b_modified to not confuse transaction credit accounting, and
+                * set j_next_transaction to the running transaction (if there
+                * is one) and mark buffer as freed so that commit code knows
+                * it should clear dirty bits when it is done with the buffer.
                  */
                 set_buffer_freed(bh);
                 if (journal->j_running_transaction && buffer_jbddirty(bh))
                         jh->b_next_transaction = journal->j_running_transaction;
+               jh->b_modified = 0;
                 spin_unlock(&journal->j_list_lock);
                 spin_unlock(&jh->b_state_lock);
                 write_unlock(&journal->j_state_lock);
diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c

index 4a84107..1865322 100644 (file)
--- a/fs/nfs/delegation.c
+++ b/fs/nfs/delegation.c
@@ -42,13 +42,27 @@ static void nfs_mark_delegation_revoked(struct nfs_delegation *delegation)
         if (!test_and_set_bit(NFS_DELEGATION_REVOKED, &delegation->flags)) {
                 delegation->stateid.type = NFS4_INVALID_STATEID_TYPE;
                 atomic_long_dec(&nfs_active_delegations);
+               if (!test_bit(NFS_DELEGATION_RETURNING, &delegation->flags))
+                       nfs_clear_verifier_delegated(delegation->inode);
         }
  }
  
+static struct nfs_delegation *nfs_get_delegation(struct nfs_delegation *delegation)
+{
+       refcount_inc(&delegation->refcount);
+       return delegation;
+}
+
+static void nfs_put_delegation(struct nfs_delegation *delegation)
+{
+       if (refcount_dec_and_test(&delegation->refcount))
+               __nfs_free_delegation(delegation);
+}
+
  static void nfs_free_delegation(struct nfs_delegation *delegation)
  {
         nfs_mark_delegation_revoked(delegation);
-       __nfs_free_delegation(delegation);
+       nfs_put_delegation(delegation);
  }
  
  /**
@@ -241,13 +255,18 @@ void nfs_inode_reclaim_delegation(struct inode *inode, const struct cred *cred,
  
  static int nfs_do_return_delegation(struct inode *inode, struct nfs_delegation *delegation, int issync)
  {
+       const struct cred *cred;
         int res = 0;
  
-       if (!test_bit(NFS_DELEGATION_REVOKED, &delegation->flags))
-               res = nfs4_proc_delegreturn(inode,
-                               delegation->cred,
+       if (!test_bit(NFS_DELEGATION_REVOKED, &delegation->flags)) {
+               spin_lock(&delegation->lock);
+               cred = get_cred(delegation->cred);
+               spin_unlock(&delegation->lock);
+               res = nfs4_proc_delegreturn(inode, cred,
                                 &delegation->stateid,
                                 issync);
+               put_cred(cred);
+       }
         return res;
  }
  
@@ -273,9 +292,13 @@ nfs_start_delegation_return_locked(struct nfs_inode *nfsi)
         if (delegation == NULL)
                 goto out;
         spin_lock(&delegation->lock);
-       if (!test_and_set_bit(NFS_DELEGATION_RETURNING, &delegation->flags))
-               ret = delegation;
+       if (!test_and_set_bit(NFS_DELEGATION_RETURNING, &delegation->flags)) {
+               /* Refcount matched in nfs_end_delegation_return() */
+               ret = nfs_get_delegation(delegation);
+       }
         spin_unlock(&delegation->lock);
+       if (ret)
+               nfs_clear_verifier_delegated(&nfsi->vfs_inode);
  out:
         return ret;
  }
@@ -393,6 +416,7 @@ int nfs_inode_set_delegation(struct inode *inode, const struct cred *cred,
         if (delegation == NULL)
                 return -ENOMEM;
         nfs4_stateid_copy(&delegation->stateid, stateid);
+       refcount_set(&delegation->refcount, 1);
         delegation->type = type;
         delegation->pagemod_limit = pagemod_limit;
         delegation->change_attr = inode_peek_iversion_raw(inode);
@@ -492,6 +516,8 @@ static int nfs_end_delegation_return(struct inode *inode, struct nfs_delegation
  
         err = nfs_do_return_delegation(inode, delegation, issync);
  out:
+       /* Refcount matched in nfs_start_delegation_return_locked() */
+       nfs_put_delegation(delegation);
         return err;
  }
  
@@ -686,9 +712,12 @@ void nfs4_inode_return_delegation_on_close(struct inode *inode)
                     list_empty(&NFS_I(inode)->open_files) &&
                     !test_and_set_bit(NFS_DELEGATION_RETURNING, &delegation->flags)) {
                         clear_bit(NFS_DELEGATION_RETURN_IF_CLOSED, &delegation->flags);
-                       ret = delegation;
+                       /* Refcount matched in nfs_end_delegation_return() */
+                       ret = nfs_get_delegation(delegation);
                 }
                 spin_unlock(&delegation->lock);
+               if (ret)
+                       nfs_clear_verifier_delegated(inode);
         }
  out:
         rcu_read_unlock();
@@ -1088,10 +1117,11 @@ restart:
                         delegation = nfs_start_delegation_return_locked(NFS_I(inode));
                         rcu_read_unlock();
                         if (delegation != NULL) {
-                               delegation = nfs_detach_delegation(NFS_I(inode),
-                                       delegation, server);
-                               if (delegation != NULL)
+                               if (nfs_detach_delegation(NFS_I(inode), delegation,
+                                                       server) != NULL)
                                         nfs_free_delegation(delegation);
+                               /* Match nfs_start_delegation_return_locked */
+                               nfs_put_delegation(delegation);
                         }
                         iput(inode);
                         nfs_sb_deactive(server->super);
diff --git a/fs/nfs/delegation.h b/fs/nfs/delegation.h

index 31b8460..9b00a0b 100644 (file)
--- a/fs/nfs/delegation.h
+++ b/fs/nfs/delegation.h
@@ -22,6 +22,7 @@ struct nfs_delegation {
         unsigned long pagemod_limit;
         __u64 change_attr;
         unsigned long flags;
+       refcount_t refcount;
         spinlock_t lock;
         struct rcu_head rcu;
  };
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c

index 1320288..193d6fb 100644 (file)
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -155,6 +155,7 @@ typedef struct {
         loff_t          current_index;
         decode_dirent_t decode;
  
+       unsigned long   dir_verifier;
         unsigned long   timestamp;
         unsigned long   gencount;
         unsigned int    cache_entry_index;
@@ -353,6 +354,7 @@ int nfs_readdir_xdr_filler(struct page **pages, nfs_readdir_descriptor_t *desc,
   again:
         timestamp = jiffies;
         gencount = nfs_inc_attr_generation_counter();
+       desc->dir_verifier = nfs_save_change_attribute(inode);
         error = NFS_PROTO(inode)->readdir(file_dentry(file), cred, entry->cookie, pages,
                                           NFS_SERVER(inode)->dtsize, desc->plus);
         if (error < 0) {
@@ -455,13 +457,13 @@ void nfs_force_use_readdirplus(struct inode *dir)
  }
  
  static
-void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry)
+void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry,
+               unsigned long dir_verifier)
  {
         struct qstr filename = QSTR_INIT(entry->name, entry->len);
         DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
         struct dentry *dentry;
         struct dentry *alias;
-       struct inode *dir = d_inode(parent);
         struct inode *inode;
         int status;
  
@@ -500,7 +502,7 @@ again:
                 if (nfs_same_file(dentry, entry)) {
                         if (!entry->fh->size)
                                 goto out;
-                       nfs_set_verifier(dentry, nfs_save_change_attribute(dir));
+                       nfs_set_verifier(dentry, dir_verifier);
                         status = nfs_refresh_inode(d_inode(dentry), entry->fattr);
                         if (!status)
                                 nfs_setsecurity(d_inode(dentry), entry->fattr, entry->label);
@@ -526,7 +528,7 @@ again:
                 dput(dentry);
                 dentry = alias;
         }
-       nfs_set_verifier(dentry, nfs_save_change_attribute(dir));
+       nfs_set_verifier(dentry, dir_verifier);
  out:
         dput(dentry);
  }
@@ -564,7 +566,8 @@ int nfs_readdir_page_filler(nfs_readdir_descriptor_t *desc, struct nfs_entry *en
                 count++;
  
                 if (desc->plus)
-                       nfs_prime_dcache(file_dentry(desc->file), entry);
+                       nfs_prime_dcache(file_dentry(desc->file), entry,
+                                       desc->dir_verifier);
  
                 status = nfs_readdir_add_to_array(entry, page);
                 if (status != 0)
@@ -983,14 +986,113 @@ static int nfs_fsync_dir(struct file *filp, loff_t start, loff_t end,
   * full lookup on all child dentries of 'dir' whenever a change occurs
   * on the server that might have invalidated our dcache.
   *
+ * Note that we reserve bit '0' as a tag to let us know when a dentry
+ * was revalidated while holding a delegation on its inode.
+ *
   * The caller should be holding dir->i_lock
   */
  void nfs_force_lookup_revalidate(struct inode *dir)
  {
-       NFS_I(dir)->cache_change_attribute++;
+       NFS_I(dir)->cache_change_attribute += 2;
  }
  EXPORT_SYMBOL_GPL(nfs_force_lookup_revalidate);
  
+/**
+ * nfs_verify_change_attribute - Detects NFS remote directory changes
+ * @dir: pointer to parent directory inode
+ * @verf: previously saved change attribute
+ *
+ * Return "false" if the verifiers doesn't match the change attribute.
+ * This would usually indicate that the directory contents have changed on
+ * the server, and that any dentries need revalidating.
+ */
+static bool nfs_verify_change_attribute(struct inode *dir, unsigned long verf)
+{
+       return (verf & ~1UL) == nfs_save_change_attribute(dir);
+}
+
+static void nfs_set_verifier_delegated(unsigned long *verf)
+{
+       *verf |= 1UL;
+}
+
+#if IS_ENABLED(CONFIG_NFS_V4)
+static void nfs_unset_verifier_delegated(unsigned long *verf)
+{
+       *verf &= ~1UL;
+}
+#endif /* IS_ENABLED(CONFIG_NFS_V4) */
+
+static bool nfs_test_verifier_delegated(unsigned long verf)
+{
+       return verf & 1;
+}
+
+static bool nfs_verifier_is_delegated(struct dentry *dentry)
+{
+       return nfs_test_verifier_delegated(dentry->d_time);
+}
+
+static void nfs_set_verifier_locked(struct dentry *dentry, unsigned long verf)
+{
+       struct inode *inode = d_inode(dentry);
+
+       if (!nfs_verifier_is_delegated(dentry) &&
+           !nfs_verify_change_attribute(d_inode(dentry->d_parent), verf))
+               goto out;
+       if (inode && NFS_PROTO(inode)->have_delegation(inode, FMODE_READ))
+               nfs_set_verifier_delegated(&verf);
+out:
+       dentry->d_time = verf;
+}
+
+/**
+ * nfs_set_verifier - save a parent directory verifier in the dentry
+ * @dentry: pointer to dentry
+ * @verf: verifier to save
+ *
+ * Saves the parent directory verifier in @dentry. If the inode has
+ * a delegation, we also tag the dentry as having been revalidated
+ * while holding a delegation so that we know we don't have to
+ * look it up again after a directory change.
+ */
+void nfs_set_verifier(struct dentry *dentry, unsigned long verf)
+{
+
+       spin_lock(&dentry->d_lock);
+       nfs_set_verifier_locked(dentry, verf);
+       spin_unlock(&dentry->d_lock);
+}
+EXPORT_SYMBOL_GPL(nfs_set_verifier);
+
+#if IS_ENABLED(CONFIG_NFS_V4)
+/**
+ * nfs_clear_verifier_delegated - clear the dir verifier delegation tag
+ * @inode: pointer to inode
+ *
+ * Iterates through the dentries in the inode alias list and clears
+ * the tag used to indicate that the dentry has been revalidated
+ * while holding a delegation.
+ * This function is intended for use when the delegation is being
+ * returned or revoked.
+ */
+void nfs_clear_verifier_delegated(struct inode *inode)
+{
+       struct dentry *alias;
+
+       if (!inode)
+               return;
+       spin_lock(&inode->i_lock);
+       hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
+               spin_lock(&alias->d_lock);
+               nfs_unset_verifier_delegated(&alias->d_time);
+               spin_unlock(&alias->d_lock);
+       }
+       spin_unlock(&inode->i_lock);
+}
+EXPORT_SYMBOL_GPL(nfs_clear_verifier_delegated);
+#endif /* IS_ENABLED(CONFIG_NFS_V4) */
+
  /*
   * A check for whether or not the parent directory has changed.
   * In the case it has, we assume that the dentries are untrustworthy
@@ -1159,6 +1261,7 @@ nfs_lookup_revalidate_dentry(struct inode *dir, struct dentry *dentry,
         struct nfs_fh *fhandle;
         struct nfs_fattr *fattr;
         struct nfs4_label *label;
+       unsigned long dir_verifier;
         int ret;
  
         ret = -ENOMEM;
@@ -1168,6 +1271,7 @@ nfs_lookup_revalidate_dentry(struct inode *dir, struct dentry *dentry,
         if (fhandle == NULL || fattr == NULL || IS_ERR(label))
                 goto out;
  
+       dir_verifier = nfs_save_change_attribute(dir);
         ret = NFS_PROTO(dir)->lookup(dir, dentry, fhandle, fattr, label);
         if (ret < 0) {
                 switch (ret) {
@@ -1188,7 +1292,7 @@ nfs_lookup_revalidate_dentry(struct inode *dir, struct dentry *dentry,
                 goto out;
  
         nfs_setsecurity(inode, fattr, label);
-       nfs_set_verifier(dentry, nfs_save_change_attribute(dir));
+       nfs_set_verifier(dentry, dir_verifier);
  
         /* set a readdirplus hint that we had a cache miss */
         nfs_force_use_readdirplus(dir);
@@ -1230,7 +1334,7 @@ nfs_do_lookup_revalidate(struct inode *dir, struct dentry *dentry,
                 goto out_bad;
         }
  
-       if (NFS_PROTO(dir)->have_delegation(inode, FMODE_READ))
+       if (nfs_verifier_is_delegated(dentry))
                 return nfs_lookup_revalidate_delegated(dir, dentry, inode);
  
         /* Force a full look up iff the parent directory has changed */
@@ -1415,6 +1519,7 @@ struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, unsigned in
         struct nfs_fh *fhandle = NULL;
         struct nfs_fattr *fattr = NULL;
         struct nfs4_label *label = NULL;
+       unsigned long dir_verifier;
         int error;
  
         dfprintk(VFS, "NFS: lookup(%pd2)\n", dentry);
@@ -1440,6 +1545,7 @@ struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, unsigned in
         if (IS_ERR(label))
                 goto out;
  
+       dir_verifier = nfs_save_change_attribute(dir);
         trace_nfs_lookup_enter(dir, dentry, flags);
         error = NFS_PROTO(dir)->lookup(dir, dentry, fhandle, fattr, label);
         if (error == -ENOENT)
@@ -1463,7 +1569,7 @@ no_entry:
                         goto out_label;
                 dentry = res;
         }
-       nfs_set_verifier(dentry, nfs_save_change_attribute(dir));
+       nfs_set_verifier(dentry, dir_verifier);
  out_label:
         trace_nfs_lookup_exit(dir, dentry, flags, error);
         nfs4_label_free(label);
@@ -1668,7 +1774,7 @@ nfs4_do_lookup_revalidate(struct inode *dir, struct dentry *dentry,
         if (inode == NULL)
                 goto full_reval;
  
-       if (NFS_PROTO(dir)->have_delegation(inode, FMODE_READ))
+       if (nfs_verifier_is_delegated(dentry))
                 return nfs_lookup_revalidate_delegated(dir, dentry, inode);
  
         /* NFS only supports OPEN on regular files */
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c

index 1309e6f..11bf158 100644 (file)
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -2114,6 +2114,7 @@ static void init_once(void *foo)
         init_rwsem(&nfsi->rmdir_sem);
         mutex_init(&nfsi->commit_mutex);
         nfs4_init_once(nfsi);
+       nfsi->cache_change_attribute = 0;
  }
  
  static int __init nfs_init_inodecache(void)
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c

index be4eb72..1297919 100644 (file)
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -87,7 +87,6 @@ nfs4_file_open(struct inode *inode, struct file *filp)
         if (inode != d_inode(dentry))
                 goto out_drop;
  
-       nfs_set_verifier(dentry, nfs_save_change_attribute(dir));
         nfs_file_set_open_context(filp, ctx);
         nfs_fscache_open_file(inode, filp);
         err = 0;
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c

index 95d07a3..69b7ab7 100644 (file)
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -2974,10 +2974,13 @@ static int _nfs4_open_and_get_state(struct nfs4_opendata *opendata,
         struct dentry *dentry;
         struct nfs4_state *state;
         fmode_t acc_mode = _nfs4_ctx_to_accessmode(ctx);
+       struct inode *dir = d_inode(opendata->dir);
+       unsigned long dir_verifier;
         unsigned int seq;
         int ret;
  
         seq = raw_seqcount_begin(&sp->so_reclaim_seqcount);
+       dir_verifier = nfs_save_change_attribute(dir);
  
         ret = _nfs4_proc_open(opendata, ctx);
         if (ret != 0)
@@ -3005,8 +3008,19 @@ static int _nfs4_open_and_get_state(struct nfs4_opendata *opendata,
                         dput(ctx->dentry);
                         ctx->dentry = dentry = alias;
                 }
-               nfs_set_verifier(dentry,
-                               nfs_save_change_attribute(d_inode(opendata->dir)));
+       }
+
+       switch(opendata->o_arg.claim) {
+       default:
+               break;
+       case NFS4_OPEN_CLAIM_NULL:
+       case NFS4_OPEN_CLAIM_DELEGATE_CUR:
+       case NFS4_OPEN_CLAIM_DELEGATE_PREV:
+               if (!opendata->rpc_done)
+                       break;
+               if (opendata->o_res.delegation_type != 0)
+                       dir_verifier = nfs_save_change_attribute(dir);
+               nfs_set_verifier(dentry, dir_verifier);
         }
  
         /* Parse layoutget results before we check for access */
@@ -5322,7 +5336,7 @@ static void nfs4_proc_write_setup(struct nfs_pgio_header *hdr,
         hdr->timestamp   = jiffies;
  
         msg->rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_WRITE];
-       nfs4_init_sequence(&hdr->args.seq_args, &hdr->res.seq_res, 1, 0);
+       nfs4_init_sequence(&hdr->args.seq_args, &hdr->res.seq_res, 0, 0);
         nfs4_state_protect_write(server->nfs_client, clnt, msg, hdr);
  }
  
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c

index 3a688eb..58e937b 100644 (file)
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -587,7 +587,7 @@ xfs_dax_writepages(
  
         xfs_iflags_clear(ip, XFS_ITRUNCATED);
         return dax_writeback_mapping_range(mapping,
-                       xfs_inode_buftarg(ip)->bt_bdev, wbc);
+                       xfs_inode_buftarg(ip)->bt_daxdev, wbc);
  }
  
  STATIC sector_t
diff --git a/include/acpi/acpixf.h b/include/acpi/acpixf.h

index 00994b1..5867777 100644 (file)
--- a/include/acpi/acpixf.h
+++ b/include/acpi/acpixf.h
@@ -752,6 +752,7 @@ ACPI_HW_DEPENDENT_RETURN_UINT32(u32 acpi_dispatch_gpe(acpi_handle gpe_device, u3
  ACPI_HW_DEPENDENT_RETURN_STATUS(acpi_status acpi_disable_all_gpes(void))
  ACPI_HW_DEPENDENT_RETURN_STATUS(acpi_status acpi_enable_all_runtime_gpes(void))
  ACPI_HW_DEPENDENT_RETURN_STATUS(acpi_status acpi_enable_all_wakeup_gpes(void))
+ACPI_HW_DEPENDENT_RETURN_UINT32(u32 acpi_any_gpe_status_set(void))
  
  ACPI_HW_DEPENDENT_RETURN_STATUS(acpi_status
                                 acpi_get_gpe_device(u32 gpe_index,
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h

index 018dce8..0fb561d 100644 (file)
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -201,9 +201,6 @@ static inline bool policy_is_shared(struct cpufreq_policy *policy)
         return cpumask_weight(policy->cpus) > 1;
  }
  
-/* /sys/devices/system/cpu/cpufreq: entry point for global variables */
-extern struct kobject *cpufreq_global_kobject;
-
  #ifdef CONFIG_CPU_FREQ
  unsigned int cpufreq_get(unsigned int cpu);
  unsigned int cpufreq_quick_get(unsigned int cpu);
diff --git a/include/linux/dax.h b/include/linux/dax.h

index 9bd8528..328c2db 100644 (file)
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -129,11 +129,6 @@ static inline bool generic_fsdax_supported(struct dax_device *dax_dev,
                         sectors);
  }
  
-static inline struct dax_device *fs_dax_get_by_host(const char *host)
-{
-       return dax_get_by_host(host);
-}
-
  static inline void fs_put_dax(struct dax_device *dax_dev)
  {
         put_dax(dax_dev);
@@ -141,7 +136,7 @@ static inline void fs_put_dax(struct dax_device *dax_dev)
  
  struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
  int dax_writeback_mapping_range(struct address_space *mapping,
-               struct block_device *bdev, struct writeback_control *wbc);
+               struct dax_device *dax_dev, struct writeback_control *wbc);
  
  struct page *dax_layout_busy_page(struct address_space *mapping);
  dax_entry_t dax_lock_page(struct page *page);
@@ -160,11 +155,6 @@ static inline bool generic_fsdax_supported(struct dax_device *dax_dev,
         return false;
  }
  
-static inline struct dax_device *fs_dax_get_by_host(const char *host)
-{
-       return NULL;
-}
-
  static inline void fs_put_dax(struct dax_device *dax_dev)
  {
  }
@@ -180,7 +170,7 @@ static inline struct page *dax_layout_busy_page(struct address_space *mapping)
  }
  
  static inline int dax_writeback_mapping_range(struct address_space *mapping,
-               struct block_device *bdev, struct writeback_control *wbc)
+               struct dax_device *dax_dev, struct writeback_control *wbc)
  {
         return -EOPNOTSUPP;
  }
diff --git a/include/linux/icmpv6.h b/include/linux/icmpv6.h

index ef1cbb5..93338fd 100644 (file)
--- a/include/linux/icmpv6.h
+++ b/include/linux/icmpv6.h
@@ -31,6 +31,12 @@ static inline void icmpv6_send(struct sk_buff *skb,
  }
  #endif
  
+#if IS_ENABLED(CONFIG_NF_NAT)
+void icmpv6_ndo_send(struct sk_buff *skb_in, u8 type, u8 code, __u32 info);
+#else
+#define icmpv6_ndo_send icmpv6_send
+#endif
+
  extern int                             icmpv6_init(void);
  extern int                             icmpv6_err_convert(u8 type, u8 code,
                                                            int *err);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h

index a9c6b5c..9f1f633 100644 (file)
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1616,6 +1616,7 @@ enum netdev_priv_flags {
   *                             and drivers will need to set them appropriately.
   *
   *     @mpls_features: Mask of features inheritable by MPLS
+ *     @gso_partial_features: value(s) from NETIF_F_GSO\*
   *
   *     @ifindex:       interface index
   *     @group:         The group the device belongs to
@@ -1640,8 +1641,11 @@ enum netdev_priv_flags {
   *     @netdev_ops:    Includes several pointers to callbacks,
   *                     if one wants to override the ndo_*() functions
   *     @ethtool_ops:   Management operations
+ *     @l3mdev_ops:    Layer 3 master device operations
   *     @ndisc_ops:     Includes callbacks for different IPv6 neighbour
   *                     discovery handling. Necessary for e.g. 6LoWPAN.
+ *     @xfrmdev_ops:   Transformation offload operations
+ *     @tlsdev_ops:    Transport Layer Security offload operations
   *     @header_ops:    Includes callbacks for creating,parsing,caching,etc
   *                     of Layer 2 headers.
   *
@@ -1680,6 +1684,7 @@ enum netdev_priv_flags {
   *     @dev_port:              Used to differentiate devices that share
   *                             the same function
   *     @addr_list_lock:        XXX: need comments on this one
+ *     @name_assign_type:      network interface name assignment type
   *     @uc_promisc:            Counter that indicates promiscuous mode
   *                             has been enabled due to the need to listen to
   *                             additional unicast addresses in a device that
@@ -1702,6 +1707,9 @@ enum netdev_priv_flags {
   *     @ip6_ptr:       IPv6 specific data
   *     @ax25_ptr:      AX.25 specific data
   *     @ieee80211_ptr: IEEE 802.11 specific data, assign before registering
+ *     @ieee802154_ptr: IEEE 802.15.4 low-rate Wireless Personal Area Network
+ *                      device struct
+ *     @mpls_ptr:      mpls_dev struct pointer
   *
   *     @dev_addr:      Hw address (before bcast,
   *                     because most packets are unicast)
@@ -1710,6 +1718,8 @@ enum netdev_priv_flags {
   *     @num_rx_queues:         Number of RX queues
   *                             allocated at register_netdev() time
   *     @real_num_rx_queues:    Number of RX queues currently active in device
+ *     @xdp_prog:              XDP sockets filter program pointer
+ *     @gro_flush_timeout:     timeout for GRO layer in NAPI
   *
   *     @rx_handler:            handler for received packets
   *     @rx_handler_data:       XXX: need comments on this one
@@ -1731,10 +1741,14 @@ enum netdev_priv_flags {
   *     @qdisc:                 Root qdisc from userspace point of view
   *     @tx_queue_len:          Max frames per queue allowed
   *     @tx_global_lock:        XXX: need comments on this one
+ *     @xdp_bulkq:             XDP device bulk queue
+ *     @xps_cpus_map:          all CPUs map for XPS device
+ *     @xps_rxqs_map:          all RXQs map for XPS device
   *
   *     @xps_maps:      XXX: need comments on this one
   *     @miniq_egress:          clsact qdisc specific data for
   *                             egress processing
+ *     @qdisc_hash:            qdisc hash table
   *     @watchdog_timeo:        Represents the timeout that is used by
   *                             the watchdog (see dev_watchdog())
   *     @watchdog_timer:        List of timers
@@ -3548,7 +3562,7 @@ static inline unsigned int netif_attrmask_next(int n, const unsigned long *srcp,
  }
  
  /**
- *     netif_attrmask_next_and - get the next CPU/Rx queue in *src1p & *src2p
+ *     netif_attrmask_next_and - get the next CPU/Rx queue in \*src1p & \*src2p
   *     @n: CPU/Rx queue index
   *     @src1p: the first CPUs/Rx queues mask pointer
   *     @src2p: the second CPUs/Rx queues mask pointer
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h

index a5f8f03..5d5b91e 100644 (file)
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -337,35 +337,17 @@ static inline int nfs_server_capable(struct inode *inode, int cap)
         return NFS_SERVER(inode)->caps & cap;
  }
  
-static inline void nfs_set_verifier(struct dentry * dentry, unsigned long verf)
-{
-       dentry->d_time = verf;
-}
-
  /**
   * nfs_save_change_attribute - Returns the inode attribute change cookie
   * @dir - pointer to parent directory inode
- * The "change attribute" is updated every time we finish an operation
- * that will result in a metadata change on the server.
+ * The "cache change attribute" is updated when we need to revalidate
+ * our dentry cache after a directory was seen to change on the server.
   */
  static inline unsigned long nfs_save_change_attribute(struct inode *dir)
  {
         return NFS_I(dir)->cache_change_attribute;
  }
  
-/**
- * nfs_verify_change_attribute - Detects NFS remote directory changes
- * @dir - pointer to parent directory inode
- * @chattr - previously saved change attribute
- * Return "false" if the verifiers doesn't match the change attribute.
- * This would usually indicate that the directory contents have changed on
- * the server, and that any dentries need revalidating.
- */
-static inline int nfs_verify_change_attribute(struct inode *dir, unsigned long chattr)
-{
-       return chattr == NFS_I(dir)->cache_change_attribute;
-}
-
  /*
   * linux/fs/nfs/inode.c
   */
@@ -495,6 +477,10 @@ extern const struct file_operations nfs_dir_operations;
  extern const struct dentry_operations nfs_dentry_operations;
  
  extern void nfs_force_lookup_revalidate(struct inode *dir);
+extern void nfs_set_verifier(struct dentry * dentry, unsigned long verf);
+#if IS_ENABLED(CONFIG_NFS_V4)
+extern void nfs_clear_verifier_delegated(struct inode *inode);
+#endif /* IS_ENABLED(CONFIG_NFS_V4) */
  extern struct dentry *nfs_add_or_obtain(struct dentry *dentry,
                         struct nfs_fh *fh, struct nfs_fattr *fattr,
                         struct nfs4_label *label);
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h

index d576503..ae58fad 100644 (file)
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -29,7 +29,8 @@ struct pipe_buffer {
  /**
   *     struct pipe_inode_info - a linux kernel pipe
   *     @mutex: mutex protecting the whole thing
- *     @wait: reader/writer wait point in case of empty/full pipe
+ *     @rd_wait: reader wait point in case of empty pipe
+ *     @wr_wait: writer wait point in case of full pipe
   *     @head: The point of buffer production
   *     @tail: The point of buffer consumption
   *     @max_usage: The maximum number of slots that may be used in the ring
diff --git a/include/linux/sched/nohz.h b/include/linux/sched/nohz.h

index 1abe91f..6d67e9a 100644 (file)
--- a/include/linux/sched/nohz.h
+++ b/include/linux/sched/nohz.h
@@ -15,9 +15,11 @@ static inline void nohz_balance_enter_idle(int cpu) { }
  
  #ifdef CONFIG_NO_HZ_COMMON
  void calc_load_nohz_start(void);
+void calc_load_nohz_remote(struct rq *rq);
  void calc_load_nohz_stop(void);
  #else
  static inline void calc_load_nohz_start(void) { }
+static inline void calc_load_nohz_remote(struct rq *rq) { }
  static inline void calc_load_nohz_stop(void) { }
  #endif /* CONFIG_NO_HZ_COMMON */
  
diff --git a/include/linux/suspend.h b/include/linux/suspend.h

index 4a230c2..2b2055b 100644 (file)
--- a/include/linux/suspend.h
+++ b/include/linux/suspend.h
@@ -191,7 +191,7 @@ struct platform_s2idle_ops {
         int (*begin)(void);
         int (*prepare)(void);
         int (*prepare_late)(void);
-       void (*wake)(void);
+       bool (*wake)(void);
         void (*restore_early)(void);
         void (*restore)(void);
         void (*end)(void);
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h

index af2c85d..6c7a10a 100644 (file)
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -440,7 +440,7 @@ struct synth_event_trace_state {
         struct synth_event *event;
         unsigned int cur_field;
         unsigned int n_u64;
-       bool enabled;
+       bool disabled;
         bool add_next;
         bool add_name;
  };
diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h

index d93017a..e9391e8 100644 (file)
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -33,7 +33,6 @@ enum flow_dissect_ret {
  
  /**
   * struct flow_dissector_key_basic:
- * @thoff: Transport header offset
   * @n_proto: Network header protocol (eg. IPv4/IPv6)
   * @ip_proto: Transport header protocol (eg. TCP/UDP)
   */
diff --git a/include/net/icmp.h b/include/net/icmp.h

index 5d4bfdb..9ac2d26 100644 (file)
--- a/include/net/icmp.h
+++ b/include/net/icmp.h
@@ -43,6 +43,12 @@ static inline void icmp_send(struct sk_buff *skb_in, int type, int code, __be32
         __icmp_send(skb_in, type, code, info, &IPCB(skb_in)->opt);
  }
  
+#if IS_ENABLED(CONFIG_NF_NAT)
+void icmp_ndo_send(struct sk_buff *skb_in, int type, int code, __be32 info);
+#else
+#define icmp_ndo_send icmp_send
+#endif
+
  int icmp_rcv(struct sk_buff *skb);
  int icmp_err(struct sk_buff *skb, u32 info);
  int icmp_init(void);
diff --git a/include/net/mac80211.h b/include/net/mac80211.h

index aa14580..77e6b5a 100644 (file)
--- a/include/net/mac80211.h
+++ b/include/net/mac80211.h
@@ -1004,12 +1004,11 @@ ieee80211_rate_get_vht_nss(const struct ieee80211_tx_rate *rate)
  struct ieee80211_tx_info {
         /* common information */
         u32 flags;
-       u8 band;
-
-       u8 hw_queue;
-
-       u16 ack_frame_id:6;
-       u16 tx_time_est:10;
+       u32 band:3,
+           ack_frame_id:13,
+           hw_queue:4,
+           tx_time_est:10;
+       /* 2 free bits */
  
         union {
                 struct {
diff --git a/init/Kconfig b/init/Kconfig

index cfee56c..452bc18 100644 (file)
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1227,7 +1227,6 @@ endif
  config BOOT_CONFIG
         bool "Boot config support"
         depends on BLK_DEV_INITRD
-       select LIBXBC
         default y
         help
           Extra boot config allows system admin to pass a config file as
diff --git a/init/main.c b/init/main.c

index cc0ee48..f95b014 100644 (file)
--- a/init/main.c
+++ b/init/main.c
@@ -142,6 +142,15 @@ static char *extra_command_line;
  /* Extra init arguments */
  static char *extra_init_args;
  
+#ifdef CONFIG_BOOT_CONFIG
+/* Is bootconfig on command line? */
+static bool bootconfig_found;
+static bool initargs_found;
+#else
+# define bootconfig_found false
+# define initargs_found false
+#endif
+
  static char *execute_command;
  static char *ramdisk_execute_command;
  
@@ -336,17 +345,30 @@ u32 boot_config_checksum(unsigned char *p, u32 size)
         return ret;
  }
  
+static int __init bootconfig_params(char *param, char *val,
+                                   const char *unused, void *arg)
+{
+       if (strcmp(param, "bootconfig") == 0) {
+               bootconfig_found = true;
+       } else if (strcmp(param, "--") == 0) {
+               initargs_found = true;
+       }
+       return 0;
+}
+
  static void __init setup_boot_config(const char *cmdline)
  {
+       static char tmp_cmdline[COMMAND_LINE_SIZE] __initdata;
         u32 size, csum;
         char *data, *copy;
-       const char *p;
         u32 *hdr;
         int ret;
  
-       p = strstr(cmdline, "bootconfig");
-       if (!p || (p != cmdline && !isspace(*(p-1))) ||
-           (p[10] && !isspace(p[10])))
+       strlcpy(tmp_cmdline, boot_command_line, COMMAND_LINE_SIZE);
+       parse_args("bootconfig", tmp_cmdline, NULL, 0, 0, 0, NULL,
+                  bootconfig_params);
+
+       if (!bootconfig_found)
                 return;
  
         if (!initrd_end)
@@ -562,11 +584,12 @@ static void __init setup_command_line(char *command_line)
                  * to init.
                  */
                 len = strlen(saved_command_line);
-               if (!strstr(boot_command_line, " -- ")) {
+               if (initargs_found) {
+                       saved_command_line[len++] = ' ';
+               } else {
                         strcpy(saved_command_line + len, " -- ");
                         len += 4;
-               } else
-                       saved_command_line[len++] = ' ';
+               }
  
                 strcpy(saved_command_line + len, extra_init_args);
         }
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c

index db552b9..75f6873 100644 (file)
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5927,11 +5927,14 @@ void cgroup_post_fork(struct task_struct *child)
  
         spin_lock_irq(&css_set_lock);
  
-       WARN_ON_ONCE(!list_empty(&child->cg_list));
-       cset = task_css_set(current); /* current is @child's parent */
-       get_css_set(cset);
-       cset->nr_tasks++;
-       css_set_move_task(child, NULL, cset, false);
+       /* init tasks are special, only link regular threads */
+       if (likely(child->pid)) {
+               WARN_ON_ONCE(!list_empty(&child->cg_list));
+               cset = task_css_set(current); /* current is @child's parent */
+               get_css_set(cset);
+               cset->nr_tasks++;
+               css_set_move_task(child, NULL, cset, false);
+       }
  
         /*
          * If the cgroup has to be frozen, the new task has too.  Let's set
diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c

index 2c47280..8b1bb5e 100644 (file)
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -131,11 +131,12 @@ static void s2idle_loop(void)
          * to avoid them upfront.
          */
         for (;;) {
-               if (s2idle_ops && s2idle_ops->wake)
-                       s2idle_ops->wake();
-
-               if (pm_wakeup_pending())
+               if (s2idle_ops && s2idle_ops->wake) {
+                       if (s2idle_ops->wake())
+                               break;
+               } else if (pm_wakeup_pending()) {
                         break;
+               }
  
                 pm_wakeup_clear(false);
  
diff --git a/kernel/sched/core.c b/kernel/sched/core.c

index fc1dfc0..1a9983d 100644 (file)
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -552,27 +552,32 @@ void resched_cpu(int cpu)
   */
  int get_nohz_timer_target(void)
  {
-       int i, cpu = smp_processor_id();
+       int i, cpu = smp_processor_id(), default_cpu = -1;
         struct sched_domain *sd;
  
-       if (!idle_cpu(cpu) && housekeeping_cpu(cpu, HK_FLAG_TIMER))
-               return cpu;
+       if (housekeeping_cpu(cpu, HK_FLAG_TIMER)) {
+               if (!idle_cpu(cpu))
+                       return cpu;
+               default_cpu = cpu;
+       }
  
         rcu_read_lock();
         for_each_domain(cpu, sd) {
-               for_each_cpu(i, sched_domain_span(sd)) {
+               for_each_cpu_and(i, sched_domain_span(sd),
+                       housekeeping_cpumask(HK_FLAG_TIMER)) {
                         if (cpu == i)
                                 continue;
  
-                       if (!idle_cpu(i) && housekeeping_cpu(i, HK_FLAG_TIMER)) {
+                       if (!idle_cpu(i)) {
                                 cpu = i;
                                 goto unlock;
                         }
                 }
         }
  
-       if (!housekeeping_cpu(cpu, HK_FLAG_TIMER))
-               cpu = housekeeping_any_cpu(HK_FLAG_TIMER);
+       if (default_cpu == -1)
+               default_cpu = housekeeping_any_cpu(HK_FLAG_TIMER);
+       cpu = default_cpu;
  unlock:
         rcu_read_unlock();
         return cpu;
@@ -1442,17 +1447,6 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
  
  #ifdef CONFIG_SMP
  
-static inline bool is_per_cpu_kthread(struct task_struct *p)
-{
-       if (!(p->flags & PF_KTHREAD))
-               return false;
-
-       if (p->nr_cpus_allowed != 1)
-               return false;
-
-       return true;
-}
-
  /*
   * Per-CPU kthreads are allowed to run on !active && online CPUs, see
   * __set_cpus_allowed_ptr() and select_fallback_rq().
@@ -3669,28 +3663,32 @@ static void sched_tick_remote(struct work_struct *work)
          * statistics and checks timeslices in a time-independent way, regardless
          * of when exactly it is running.
          */
-       if (idle_cpu(cpu) || !tick_nohz_tick_stopped_cpu(cpu))
+       if (!tick_nohz_tick_stopped_cpu(cpu))
                 goto out_requeue;
  
         rq_lock_irq(rq, &rf);
         curr = rq->curr;
-       if (is_idle_task(curr) || cpu_is_offline(cpu))
+       if (cpu_is_offline(cpu))
                 goto out_unlock;
  
+       curr = rq->curr;
         update_rq_clock(rq);
-       delta = rq_clock_task(rq) - curr->se.exec_start;
  
-       /*
-        * Make sure the next tick runs within a reasonable
-        * amount of time.
-        */
-       WARN_ON_ONCE(delta > (u64)NSEC_PER_SEC * 3);
+       if (!is_idle_task(curr)) {
+               /*
+                * Make sure the next tick runs within a reasonable
+                * amount of time.
+                */
+               delta = rq_clock_task(rq) - curr->se.exec_start;
+               WARN_ON_ONCE(delta > (u64)NSEC_PER_SEC * 3);
+       }
         curr->sched_class->task_tick(rq, curr, 0);
  
+       calc_load_nohz_remote(rq);
  out_unlock:
         rq_unlock_irq(rq, &rf);
-
  out_requeue:
+
         /*
          * Run the remote tick once per second (1Hz). This arbitrary
          * frequency is large enough to avoid overload but short enough
@@ -7063,8 +7061,15 @@ void sched_move_task(struct task_struct *tsk)
  
         if (queued)
                 enqueue_task(rq, tsk, queue_flags);
-       if (running)
+       if (running) {
                 set_next_task(rq, tsk);
+               /*
+                * After changing group, the running task may have joined a
+                * throttled one but it's still the running task. Trigger a
+                * resched to make sure that task can still run.
+                */
+               resched_curr(rq);
+       }
  
         task_rq_unlock(rq, tsk, &rf);
  }
@@ -7260,7 +7265,7 @@ capacity_from_percent(char *buf)
                                              &req.percent);
                 if (req.ret)
                         return req;
-               if (req.percent > UCLAMP_PERCENT_SCALE) {
+               if ((u64)req.percent > UCLAMP_PERCENT_SCALE) {
                         req.ret = -ERANGE;
                         return req;
                 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

index fe4e0d7..3c8a379 100644 (file)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3516,7 +3516,6 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
   * attach_entity_load_avg - attach this entity to its cfs_rq load avg
   * @cfs_rq: cfs_rq to attach to
   * @se: sched_entity to attach
- * @flags: migration hints
   *
   * Must call update_cfs_rq_load_avg() before this, since we rely on
   * cfs_rq->avg.last_update_time being current.
@@ -5912,6 +5911,20 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
             (available_idle_cpu(prev) || sched_idle_cpu(prev)))
                 return prev;
  
+       /*
+        * Allow a per-cpu kthread to stack with the wakee if the
+        * kworker thread and the tasks previous CPUs are the same.
+        * The assumption is that the wakee queued work for the
+        * per-cpu kthread that is now complete and the wakeup is
+        * essentially a sync wakeup. An obvious example of this
+        * pattern is IO completions.
+        */
+       if (is_per_cpu_kthread(current) &&
+           prev == smp_processor_id() &&
+           this_rq()->nr_running <= 1) {
+               return prev;
+       }
+
         /* Check a recently used CPU as a potential idle candidate: */
         recent_used_cpu = p->recent_used_cpu;
         if (recent_used_cpu != prev &&
@@ -8658,10 +8671,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
         /*
          * Try to use spare capacity of local group without overloading it or
          * emptying busiest.
-        * XXX Spreading tasks across NUMA nodes is not always the best policy
-        * and special care should be taken for SD_NUMA domain level before
-        * spreading the tasks. For now, load_balance() fully relies on
-        * NUMA_BALANCING and fbq_classify_group/rq to override the decision.
          */
         if (local->group_type == group_has_spare) {
                 if (busiest->group_type > group_fully_busy) {
@@ -8701,16 +8710,37 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
                         env->migration_type = migrate_task;
                         lsub_positive(&nr_diff, local->sum_nr_running);
                         env->imbalance = nr_diff >> 1;
-                       return;
-               }
+               } else {
  
-               /*
-                * If there is no overload, we just want to even the number of
-                * idle cpus.
-                */
-               env->migration_type = migrate_task;
-               env->imbalance = max_t(long, 0, (local->idle_cpus -
+                       /*
+                        * If there is no overload, we just want to even the number of
+                        * idle cpus.
+                        */
+                       env->migration_type = migrate_task;
+                       env->imbalance = max_t(long, 0, (local->idle_cpus -
                                                  busiest->idle_cpus) >> 1);
+               }
+
+               /* Consider allowing a small imbalance between NUMA groups */
+               if (env->sd->flags & SD_NUMA) {
+                       unsigned int imbalance_min;
+
+                       /*
+                        * Compute an allowed imbalance based on a simple
+                        * pair of communicating tasks that should remain
+                        * local and ignore them.
+                        *
+                        * NOTE: Generally this would have been based on
+                        * the domain size and this was evaluated. However,
+                        * the benefit is similar across a range of workloads
+                        * and machines but scaling by the domain size adds
+                        * the risk that lower domains have to be rebalanced.
+                        */
+                       imbalance_min = 2;
+                       if (busiest->sum_nr_running <= imbalance_min)
+                               env->imbalance = 0;
+               }
+
                 return;
         }
  
diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c

index 28a5165..de22da6 100644 (file)
--- a/kernel/sched/loadavg.c
+++ b/kernel/sched/loadavg.c
@@ -231,16 +231,11 @@ static inline int calc_load_read_idx(void)
         return calc_load_idx & 1;
  }
  
-void calc_load_nohz_start(void)
+static void calc_load_nohz_fold(struct rq *rq)
  {
-       struct rq *this_rq = this_rq();
         long delta;
  
-       /*
-        * We're going into NO_HZ mode, if there's any pending delta, fold it
-        * into the pending NO_HZ delta.
-        */
-       delta = calc_load_fold_active(this_rq, 0);
+       delta = calc_load_fold_active(rq, 0);
         if (delta) {
                 int idx = calc_load_write_idx();
  
@@ -248,6 +243,24 @@ void calc_load_nohz_start(void)
         }
  }
  
+void calc_load_nohz_start(void)
+{
+       /*
+        * We're going into NO_HZ mode, if there's any pending delta, fold it
+        * into the pending NO_HZ delta.
+        */
+       calc_load_nohz_fold(this_rq());
+}
+
+/*
+ * Keep track of the load for NOHZ_FULL, must be called between
+ * calc_load_nohz_{start,stop}().
+ */
+void calc_load_nohz_remote(struct rq *rq)
+{
+       calc_load_nohz_fold(rq);
+}
+
  void calc_load_nohz_stop(void)
  {
         struct rq *this_rq = this_rq();
@@ -268,7 +281,7 @@ void calc_load_nohz_stop(void)
                 this_rq->calc_load_update += LOAD_FREQ;
  }
  
-static long calc_load_nohz_fold(void)
+static long calc_load_nohz_read(void)
  {
         int idx = calc_load_read_idx();
         long delta = 0;
@@ -323,7 +336,7 @@ static void calc_global_nohz(void)
  }
  #else /* !CONFIG_NO_HZ_COMMON */
  
-static inline long calc_load_nohz_fold(void) { return 0; }
+static inline long calc_load_nohz_read(void) { return 0; }
  static inline void calc_global_nohz(void) { }
  
  #endif /* CONFIG_NO_HZ_COMMON */
@@ -346,7 +359,7 @@ void calc_global_load(unsigned long ticks)
         /*
          * Fold the 'old' NO_HZ-delta to include all NO_HZ CPUs.
          */
-       delta = calc_load_nohz_fold();
+       delta = calc_load_nohz_read();
         if (delta)
                 atomic_long_add(delta, &calc_load_tasks);
  
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c

index ac4bd0c..0285207 100644 (file)
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -1199,6 +1199,9 @@ static ssize_t psi_write(struct file *file, const char __user *user_buf,
         if (static_branch_likely(&psi_disabled))
                 return -EOPNOTSUPP;
  
+       if (!nbytes)
+               return -EINVAL;
+
         buf_size = min(nbytes, sizeof(buf));
         if (copy_from_user(buf, user_buf, buf_size))
                 return -EFAULT;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h

index 1a88dc8..9ea6478 100644 (file)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -896,7 +896,7 @@ struct rq {
          */
         unsigned long           nr_uninterruptible;
  
-       struct task_struct      *curr;
+       struct task_struct __rcu        *curr;
         struct task_struct      *idle;
         struct task_struct      *stop;
         unsigned long           next_balance;
@@ -2479,3 +2479,16 @@ static inline void membarrier_switch_mm(struct rq *rq,
  {
  }
  #endif
+
+#ifdef CONFIG_SMP
+static inline bool is_per_cpu_kthread(struct task_struct *p)
+{
+       if (!(p->flags & PF_KTHREAD))
+               return false;
+
+       if (p->nr_cpus_allowed != 1)
+               return false;
+
+       return true;
+}
+#endif
diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c

index e7ce7cd..483b3fd 100644 (file)
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -1798,6 +1798,60 @@ void synth_event_cmd_init(struct dynevent_cmd *cmd, char *buf, int maxlen)
  }
  EXPORT_SYMBOL_GPL(synth_event_cmd_init);
  
+static inline int
+__synth_event_trace_start(struct trace_event_file *file,
+                         struct synth_event_trace_state *trace_state)
+{
+       int entry_size, fields_size = 0;
+       int ret = 0;
+
+       /*
+        * Normal event tracing doesn't get called at all unless the
+        * ENABLED bit is set (which attaches the probe thus allowing
+        * this code to be called, etc).  Because this is called
+        * directly by the user, we don't have that but we still need
+        * to honor not logging when disabled.  For the the iterated
+        * trace case, we save the enabed state upon start and just
+        * ignore the following data calls.
+        */
+       if (!(file->flags & EVENT_FILE_FL_ENABLED) ||
+           trace_trigger_soft_disabled(file)) {
+               trace_state->disabled = true;
+               ret = -ENOENT;
+               goto out;
+       }
+
+       trace_state->event = file->event_call->data;
+
+       fields_size = trace_state->event->n_u64 * sizeof(u64);
+
+       /*
+        * Avoid ring buffer recursion detection, as this event
+        * is being performed within another event.
+        */
+       trace_state->buffer = file->tr->array_buffer.buffer;
+       ring_buffer_nest_start(trace_state->buffer);
+
+       entry_size = sizeof(*trace_state->entry) + fields_size;
+       trace_state->entry = trace_event_buffer_reserve(&trace_state->fbuffer,
+                                                       file,
+                                                       entry_size);
+       if (!trace_state->entry) {
+               ring_buffer_nest_end(trace_state->buffer);
+               ret = -EINVAL;
+       }
+out:
+       return ret;
+}
+
+static inline void
+__synth_event_trace_end(struct synth_event_trace_state *trace_state)
+{
+       trace_event_buffer_commit(&trace_state->fbuffer);
+
+       ring_buffer_nest_end(trace_state->buffer);
+}
+
  /**
   * synth_event_trace - Trace a synthetic event
   * @file: The trace_event_file representing the synthetic event
@@ -1819,71 +1873,38 @@ EXPORT_SYMBOL_GPL(synth_event_cmd_init);
   */
  int synth_event_trace(struct trace_event_file *file, unsigned int n_vals, ...)
  {
-       struct trace_event_buffer fbuffer;
-       struct synth_trace_event *entry;
-       struct trace_buffer *buffer;
-       struct synth_event *event;
+       struct synth_event_trace_state state;
         unsigned int i, n_u64;
-       int fields_size = 0;
         va_list args;
-       int ret = 0;
-
-       /*
-        * Normal event generation doesn't get called at all unless
-        * the ENABLED bit is set (which attaches the probe thus
-        * allowing this code to be called, etc).  Because this is
-        * called directly by the user, we don't have that but we
-        * still need to honor not logging when disabled.
-        */
-       if (!(file->flags & EVENT_FILE_FL_ENABLED))
-               return 0;
-
-       event = file->event_call->data;
-
-       if (n_vals != event->n_fields)
-               return -EINVAL;
-
-       if (trace_trigger_soft_disabled(file))
-               return -EINVAL;
-
-       fields_size = event->n_u64 * sizeof(u64);
-
-       /*
-        * Avoid ring buffer recursion detection, as this event
-        * is being performed within another event.
-        */
-       buffer = file->tr->array_buffer.buffer;
-       ring_buffer_nest_start(buffer);
+       int ret;
  
-       entry = trace_event_buffer_reserve(&fbuffer, file,
-                                          sizeof(*entry) + fields_size);
-       if (!entry) {
-               ret = -EINVAL;
-               goto out;
+       ret = __synth_event_trace_start(file, &state);
+       if (ret) {
+               if (ret == -ENOENT)
+                       ret = 0; /* just disabled, not really an error */
+               return ret;
         }
  
         va_start(args, n_vals);
-       for (i = 0, n_u64 = 0; i < event->n_fields; i++) {
+       for (i = 0, n_u64 = 0; i < state.event->n_fields; i++) {
                 u64 val;
  
                 val = va_arg(args, u64);
  
-               if (event->fields[i]->is_string) {
+               if (state.event->fields[i]->is_string) {
                         char *str_val = (char *)(long)val;
-                       char *str_field = (char *)&entry->fields[n_u64];
+                       char *str_field = (char *)&state.entry->fields[n_u64];
  
                         strscpy(str_field, str_val, STR_VAR_LEN_MAX);
                         n_u64 += STR_VAR_LEN_MAX / sizeof(u64);
                 } else {
-                       entry->fields[n_u64] = val;
+                       state.entry->fields[n_u64] = val;
                         n_u64++;
                 }
         }
         va_end(args);
  
-       trace_event_buffer_commit(&fbuffer);
-out:
-       ring_buffer_nest_end(buffer);
+       __synth_event_trace_end(&state);
  
         return ret;
  }
@@ -1910,64 +1931,31 @@ EXPORT_SYMBOL_GPL(synth_event_trace);
  int synth_event_trace_array(struct trace_event_file *file, u64 *vals,
                             unsigned int n_vals)
  {
-       struct trace_event_buffer fbuffer;
-       struct synth_trace_event *entry;
-       struct trace_buffer *buffer;
-       struct synth_event *event;
+       struct synth_event_trace_state state;
         unsigned int i, n_u64;
-       int fields_size = 0;
-       int ret = 0;
-
-       /*
-        * Normal event generation doesn't get called at all unless
-        * the ENABLED bit is set (which attaches the probe thus
-        * allowing this code to be called, etc).  Because this is
-        * called directly by the user, we don't have that but we
-        * still need to honor not logging when disabled.
-        */
-       if (!(file->flags & EVENT_FILE_FL_ENABLED))
-               return 0;
-
-       event = file->event_call->data;
-
-       if (n_vals != event->n_fields)
-               return -EINVAL;
-
-       if (trace_trigger_soft_disabled(file))
-               return -EINVAL;
-
-       fields_size = event->n_u64 * sizeof(u64);
-
-       /*
-        * Avoid ring buffer recursion detection, as this event
-        * is being performed within another event.
-        */
-       buffer = file->tr->array_buffer.buffer;
-       ring_buffer_nest_start(buffer);
+       int ret;
  
-       entry = trace_event_buffer_reserve(&fbuffer, file,
-                                          sizeof(*entry) + fields_size);
-       if (!entry) {
-               ret = -EINVAL;
-               goto out;
+       ret = __synth_event_trace_start(file, &state);
+       if (ret) {
+               if (ret == -ENOENT)
+                       ret = 0; /* just disabled, not really an error */
+               return ret;
         }
  
-       for (i = 0, n_u64 = 0; i < event->n_fields; i++) {
-               if (event->fields[i]->is_string) {
+       for (i = 0, n_u64 = 0; i < state.event->n_fields; i++) {
+               if (state.event->fields[i]->is_string) {
                         char *str_val = (char *)(long)vals[i];
-                       char *str_field = (char *)&entry->fields[n_u64];
+                       char *str_field = (char *)&state.entry->fields[n_u64];
  
                         strscpy(str_field, str_val, STR_VAR_LEN_MAX);
                         n_u64 += STR_VAR_LEN_MAX / sizeof(u64);
                 } else {
-                       entry->fields[n_u64] = vals[i];
+                       state.entry->fields[n_u64] = vals[i];
                         n_u64++;
                 }
         }
  
-       trace_event_buffer_commit(&fbuffer);
-out:
-       ring_buffer_nest_end(buffer);
+       __synth_event_trace_end(&state);
  
         return ret;
  }
@@ -2004,58 +1992,17 @@ EXPORT_SYMBOL_GPL(synth_event_trace_array);
  int synth_event_trace_start(struct trace_event_file *file,
                             struct synth_event_trace_state *trace_state)
  {
-       struct synth_trace_event *entry;
-       int fields_size = 0;
-       int ret = 0;
+       int ret;
  
-       if (!trace_state) {
-               ret = -EINVAL;
-               goto out;
-       }
+       if (!trace_state)
+               return -EINVAL;
  
         memset(trace_state, '\0', sizeof(*trace_state));
  
-       /*
-        * Normal event tracing doesn't get called at all unless the
-        * ENABLED bit is set (which attaches the probe thus allowing
-        * this code to be called, etc).  Because this is called
-        * directly by the user, we don't have that but we still need
-        * to honor not logging when disabled.  For the the iterated
-        * trace case, we save the enabed state upon start and just
-        * ignore the following data calls.
-        */
-       if (!(file->flags & EVENT_FILE_FL_ENABLED)) {
-               trace_state->enabled = false;
-               goto out;
-       }
-
-       trace_state->enabled = true;
-
-       trace_state->event = file->event_call->data;
-
-       if (trace_trigger_soft_disabled(file)) {
-               ret = -EINVAL;
-               goto out;
-       }
+       ret = __synth_event_trace_start(file, trace_state);
+       if (ret == -ENOENT)
+               ret = 0; /* just disabled, not really an error */
  
-       fields_size = trace_state->event->n_u64 * sizeof(u64);
-
-       /*
-        * Avoid ring buffer recursion detection, as this event
-        * is being performed within another event.
-        */
-       trace_state->buffer = file->tr->array_buffer.buffer;
-       ring_buffer_nest_start(trace_state->buffer);
-
-       entry = trace_event_buffer_reserve(&trace_state->fbuffer, file,
-                                          sizeof(*entry) + fields_size);
-       if (!entry) {
-               ret = -EINVAL;
-               goto out;
-       }
-
-       trace_state->entry = entry;
-out:
         return ret;
  }
  EXPORT_SYMBOL_GPL(synth_event_trace_start);
@@ -2088,7 +2035,7 @@ static int __synth_event_add_val(const char *field_name, u64 val,
                 trace_state->add_next = true;
         }
  
-       if (!trace_state->enabled)
+       if (trace_state->disabled)
                 goto out;
  
         event = trace_state->event;
@@ -2223,9 +2170,7 @@ int synth_event_trace_end(struct synth_event_trace_state *trace_state)
         if (!trace_state)
                 return -EINVAL;
  
-       trace_event_buffer_commit(&trace_state->fbuffer);
-
-       ring_buffer_nest_end(trace_state->buffer);
+       __synth_event_trace_end(trace_state);
  
         return 0;
  }
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c

index d8264eb..362cca5 100644 (file)
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1012,7 +1012,7 @@ int __kprobe_event_add_fields(struct dynevent_cmd *cmd, ...)
  {
         struct dynevent_arg arg;
         va_list args;
-       int ret;
+       int ret = 0;
  
         if (cmd->type != DYNEVENT_TYPE_KPROBE)
                 return -EINVAL;
diff --git a/lib/Kconfig b/lib/Kconfig

index 0cf875f..bc7e563 100644 (file)
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -573,9 +573,6 @@ config DIMLIB
  config LIBFDT
         bool
  
-config LIBXBC
-       bool
-
  config OID_REGISTRY
         tristate
         help
diff --git a/lib/Makefile b/lib/Makefile

index 5d64890..611872c 100644 (file)
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -230,7 +230,7 @@ $(foreach file, $(libfdt_files), \
         $(eval CFLAGS_$(file) = -I $(srctree)/scripts/dtc/libfdt))
  lib-$(CONFIG_LIBFDT) += $(libfdt_files)
  
-lib-$(CONFIG_LIBXBC) += bootconfig.o
+lib-$(CONFIG_BOOT_CONFIG) += bootconfig.o
  
  obj-$(CONFIG_RBTREE_TEST) += rbtree_test.o
  obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o
diff --git a/lib/bootconfig.c b/lib/bootconfig.c

index afb2e76..3ea601a 100644 (file)
--- a/lib/bootconfig.c
+++ b/lib/bootconfig.c
@@ -6,12 +6,13 @@
  
  #define pr_fmt(fmt)    "bootconfig: " fmt
  
+#include <linux/bootconfig.h>
  #include <linux/bug.h>
  #include <linux/ctype.h>
  #include <linux/errno.h>
  #include <linux/kernel.h>
+#include <linux/memblock.h>
  #include <linux/printk.h>
-#include <linux/bootconfig.h>
  #include <linux/string.h>
  
  /*
@@ -23,7 +24,7 @@
   * node (for array).
   */
  
-static struct xbc_node xbc_nodes[XBC_NODE_MAX] __initdata;
+static struct xbc_node *xbc_nodes __initdata;
  static int xbc_node_num __initdata;
  static char *xbc_data __initdata;
  static size_t xbc_data_size __initdata;
@@ -719,7 +720,8 @@ void __init xbc_destroy_all(void)
         xbc_data = NULL;
         xbc_data_size = 0;
         xbc_node_num = 0;
-       memset(xbc_nodes, 0, sizeof(xbc_nodes));
+       memblock_free(__pa(xbc_nodes), sizeof(struct xbc_node) * XBC_NODE_MAX);
+       xbc_nodes = NULL;
  }
  
  /**
@@ -748,6 +750,13 @@ int __init xbc_init(char *buf)
                 return -ERANGE;
         }
  
+       xbc_nodes = memblock_alloc(sizeof(struct xbc_node) * XBC_NODE_MAX,
+                                  SMP_CACHE_BYTES);
+       if (!xbc_nodes) {
+               pr_err("Failed to allocate memory for bootconfig nodes.\n");
+               return -ENOMEM;
+       }
+       memset(xbc_nodes, 0, sizeof(struct xbc_node) * XBC_NODE_MAX);
         xbc_data = buf;
         xbc_data_size = ret + 1;
         last_parent = NULL;
diff --git a/net/core/dev.c b/net/core/dev.c

index a69e8bd..a6316b3 100644 (file)
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4527,14 +4527,14 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
         /* Reinjected packets coming from act_mirred or similar should
          * not get XDP generic processing.
          */
-       if (skb_cloned(skb) || skb_is_tc_redirected(skb))
+       if (skb_is_tc_redirected(skb))
                 return XDP_PASS;
  
         /* XDP packets must be linear and must have sufficient headroom
          * of XDP_PACKET_HEADROOM bytes. This is the guarantee that also
          * native XDP provides, thus we need to do it here as well.
          */
-       if (skb_is_nonlinear(skb) ||
+       if (skb_cloned(skb) || skb_is_nonlinear(skb) ||
             skb_headroom(skb) < XDP_PACKET_HEADROOM) {
                 int hroom = XDP_PACKET_HEADROOM - skb_headroom(skb);
                 int troom = skb->tail + skb->data_len - skb->end;
diff --git a/net/core/page_pool.c b/net/core/page_pool.c

index 9b7cbe3..10d2b25 100644 (file)
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -99,8 +99,7 @@ EXPORT_SYMBOL(page_pool_create);
  static void __page_pool_return_page(struct page_pool *pool, struct page *page);
  
  noinline
-static struct page *page_pool_refill_alloc_cache(struct page_pool *pool,
-                                                bool refill)
+static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
  {
         struct ptr_ring *r = &pool->ring;
         struct page *page;
@@ -141,8 +140,7 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool,
                         page = NULL;
                         break;
                 }
-       } while (pool->alloc.count < PP_ALLOC_CACHE_REFILL &&
-                refill);
+       } while (pool->alloc.count < PP_ALLOC_CACHE_REFILL);
  
         /* Return last page */
         if (likely(pool->alloc.count > 0))
@@ -155,20 +153,16 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool,
  /* fast path */
  static struct page *__page_pool_get_cached(struct page_pool *pool)
  {
-       bool refill = false;
         struct page *page;
  
-       /* Test for safe-context, caller should provide this guarantee */
-       if (likely(in_serving_softirq())) {
-               if (likely(pool->alloc.count)) {
-                       /* Fast-path */
-                       page = pool->alloc.cache[--pool->alloc.count];
-                       return page;
-               }
-               refill = true;
+       /* Caller MUST guarantee safe non-concurrent access, e.g. softirq */
+       if (likely(pool->alloc.count)) {
+               /* Fast-path */
+               page = pool->alloc.cache[--pool->alloc.count];
+       } else {
+               page = page_pool_refill_alloc_cache(pool);
         }
  
-       page = page_pool_refill_alloc_cache(pool, refill);
         return page;
  }
  
diff --git a/net/dsa/tag_ar9331.c b/net/dsa/tag_ar9331.c

index 466ffa9..55b0069 100644 (file)
--- a/net/dsa/tag_ar9331.c
+++ b/net/dsa/tag_ar9331.c
@@ -31,7 +31,7 @@ static struct sk_buff *ar9331_tag_xmit(struct sk_buff *skb,
         __le16 *phdr;
         u16 hdr;
  
-       if (skb_cow_head(skb, 0) < 0)
+       if (skb_cow_head(skb, AR9331_HDR_LEN) < 0)
                 return NULL;
  
         phdr = skb_push(skb, AR9331_HDR_LEN);
diff --git a/net/dsa/tag_qca.c b/net/dsa/tag_qca.c

index c8a128c..70db7c9 100644 (file)
--- a/net/dsa/tag_qca.c
+++ b/net/dsa/tag_qca.c
@@ -33,7 +33,7 @@ static struct sk_buff *qca_tag_xmit(struct sk_buff *skb, struct net_device *dev)
         struct dsa_port *dp = dsa_slave_to_port(dev);
         u16 *phdr, hdr;
  
-       if (skb_cow_head(skb, 0) < 0)
+       if (skb_cow_head(skb, QCA_HDR_LEN) < 0)
                 return NULL;
  
         skb_push(skb, QCA_HDR_LEN);
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c

index 18068ed..f369e7c 100644 (file)
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -748,6 +748,39 @@ out:;
  }
  EXPORT_SYMBOL(__icmp_send);
  
+#if IS_ENABLED(CONFIG_NF_NAT)
+#include <net/netfilter/nf_conntrack.h>
+void icmp_ndo_send(struct sk_buff *skb_in, int type, int code, __be32 info)
+{
+       struct sk_buff *cloned_skb = NULL;
+       enum ip_conntrack_info ctinfo;
+       struct nf_conn *ct;
+       __be32 orig_ip;
+
+       ct = nf_ct_get(skb_in, &ctinfo);
+       if (!ct || !(ct->status & IPS_SRC_NAT)) {
+               icmp_send(skb_in, type, code, info);
+               return;
+       }
+
+       if (skb_shared(skb_in))
+               skb_in = cloned_skb = skb_clone(skb_in, GFP_ATOMIC);
+
+       if (unlikely(!skb_in || skb_network_header(skb_in) < skb_in->head ||
+           (skb_network_header(skb_in) + sizeof(struct iphdr)) >
+           skb_tail_pointer(skb_in) || skb_ensure_writable(skb_in,
+           skb_network_offset(skb_in) + sizeof(struct iphdr))))
+               goto out;
+
+       orig_ip = ip_hdr(skb_in)->saddr;
+       ip_hdr(skb_in)->saddr = ct->tuplehash[0].tuple.src.u3.ip;
+       icmp_send(skb_in, type, code, info);
+       ip_hdr(skb_in)->saddr = orig_ip;
+out:
+       consume_skb(cloned_skb);
+}
+EXPORT_SYMBOL(icmp_ndo_send);
+#endif
  
  static void icmp_socket_deliver(struct sk_buff *skb, u32 info)
  {
diff --git a/net/ipv6/ip6_icmp.c b/net/ipv6/ip6_icmp.c

index 0204549..e008675 100644 (file)
--- a/net/ipv6/ip6_icmp.c
+++ b/net/ipv6/ip6_icmp.c
@@ -45,4 +45,38 @@ out:
         rcu_read_unlock();
  }
  EXPORT_SYMBOL(icmpv6_send);
+
+#if IS_ENABLED(CONFIG_NF_NAT)
+#include <net/netfilter/nf_conntrack.h>
+void icmpv6_ndo_send(struct sk_buff *skb_in, u8 type, u8 code, __u32 info)
+{
+       struct sk_buff *cloned_skb = NULL;
+       enum ip_conntrack_info ctinfo;
+       struct in6_addr orig_ip;
+       struct nf_conn *ct;
+
+       ct = nf_ct_get(skb_in, &ctinfo);
+       if (!ct || !(ct->status & IPS_SRC_NAT)) {
+               icmpv6_send(skb_in, type, code, info);
+               return;
+       }
+
+       if (skb_shared(skb_in))
+               skb_in = cloned_skb = skb_clone(skb_in, GFP_ATOMIC);
+
+       if (unlikely(!skb_in || skb_network_header(skb_in) < skb_in->head ||
+           (skb_network_header(skb_in) + sizeof(struct ipv6hdr)) >
+           skb_tail_pointer(skb_in) || skb_ensure_writable(skb_in,
+           skb_network_offset(skb_in) + sizeof(struct ipv6hdr))))
+               goto out;
+
+       orig_ip = ipv6_hdr(skb_in)->saddr;
+       ipv6_hdr(skb_in)->saddr = ct->tuplehash[0].tuple.src.u3.in6;
+       icmpv6_send(skb_in, type, code, info);
+       ipv6_hdr(skb_in)->saddr = orig_ip;
+out:
+       consume_skb(cloned_skb);
+}
+EXPORT_SYMBOL(icmpv6_ndo_send);
+#endif
  #endif
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c

index b5dd20c..5d65436 100644 (file)
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -121,6 +121,7 @@ static struct net_device_stats *ip6_get_stats(struct net_device *dev)
  
  /**
   * ip6_tnl_lookup - fetch tunnel matching the end-point addresses
+ *   @link: ifindex of underlying interface
   *   @remote: the address of the tunnel exit-point
   *   @local: the address of the tunnel entry-point
   *
@@ -134,37 +135,56 @@ static struct net_device_stats *ip6_get_stats(struct net_device *dev)
         for (t = rcu_dereference(start); t; t = rcu_dereference(t->next))
  
  static struct ip6_tnl *
-ip6_tnl_lookup(struct net *net, const struct in6_addr *remote, const struct in6_addr *local)
+ip6_tnl_lookup(struct net *net, int link,
+              const struct in6_addr *remote, const struct in6_addr *local)
  {
         unsigned int hash = HASH(remote, local);
-       struct ip6_tnl *t;
+       struct ip6_tnl *t, *cand = NULL;
         struct ip6_tnl_net *ip6n = net_generic(net, ip6_tnl_net_id);
         struct in6_addr any;
  
         for_each_ip6_tunnel_rcu(ip6n->tnls_r_l[hash]) {
-               if (ipv6_addr_equal(local, &t->parms.laddr) &&
-                   ipv6_addr_equal(remote, &t->parms.raddr) &&
-                   (t->dev->flags & IFF_UP))
+               if (!ipv6_addr_equal(local, &t->parms.laddr) ||
+                   !ipv6_addr_equal(remote, &t->parms.raddr) ||
+                   !(t->dev->flags & IFF_UP))
+                       continue;
+
+               if (link == t->parms.link)
                         return t;
+               else
+                       cand = t;
         }
  
         memset(&any, 0, sizeof(any));
         hash = HASH(&any, local);
         for_each_ip6_tunnel_rcu(ip6n->tnls_r_l[hash]) {
-               if (ipv6_addr_equal(local, &t->parms.laddr) &&
-                   ipv6_addr_any(&t->parms.raddr) &&
-                   (t->dev->flags & IFF_UP))
+               if (!ipv6_addr_equal(local, &t->parms.laddr) ||
+                   !ipv6_addr_any(&t->parms.raddr) ||
+                   !(t->dev->flags & IFF_UP))
+                       continue;
+
+               if (link == t->parms.link)
                         return t;
+               else if (!cand)
+                       cand = t;
         }
  
         hash = HASH(remote, &any);
         for_each_ip6_tunnel_rcu(ip6n->tnls_r_l[hash]) {
-               if (ipv6_addr_equal(remote, &t->parms.raddr) &&
-                   ipv6_addr_any(&t->parms.laddr) &&
-                   (t->dev->flags & IFF_UP))
+               if (!ipv6_addr_equal(remote, &t->parms.raddr) ||
+                   !ipv6_addr_any(&t->parms.laddr) ||
+                   !(t->dev->flags & IFF_UP))
+                       continue;
+
+               if (link == t->parms.link)
                         return t;
+               else if (!cand)
+                       cand = t;
         }
  
+       if (cand)
+               return cand;
+
         t = rcu_dereference(ip6n->collect_md_tun);
         if (t && t->dev->flags & IFF_UP)
                 return t;
@@ -351,7 +371,8 @@ static struct ip6_tnl *ip6_tnl_locate(struct net *net,
              (t = rtnl_dereference(*tp)) != NULL;
              tp = &t->next) {
                 if (ipv6_addr_equal(local, &t->parms.laddr) &&
-                   ipv6_addr_equal(remote, &t->parms.raddr)) {
+                   ipv6_addr_equal(remote, &t->parms.raddr) &&
+                   p->link == t->parms.link) {
                         if (create)
                                 return ERR_PTR(-EEXIST);
  
@@ -485,7 +506,7 @@ ip6_tnl_err(struct sk_buff *skb, __u8 ipproto, struct inet6_skb_parm *opt,
            processing of the error. */
  
         rcu_read_lock();
-       t = ip6_tnl_lookup(dev_net(skb->dev), &ipv6h->daddr, &ipv6h->saddr);
+       t = ip6_tnl_lookup(dev_net(skb->dev), skb->dev->ifindex, &ipv6h->daddr, &ipv6h->saddr);
         if (!t)
                 goto out;
  
@@ -887,7 +908,7 @@ static int ipxip6_rcv(struct sk_buff *skb, u8 ipproto,
         int ret = -1;
  
         rcu_read_lock();
-       t = ip6_tnl_lookup(dev_net(skb->dev), &ipv6h->saddr, &ipv6h->daddr);
+       t = ip6_tnl_lookup(dev_net(skb->dev), skb->dev->ifindex, &ipv6h->saddr, &ipv6h->daddr);
  
         if (t) {
                 u8 tproto = READ_ONCE(t->parms.proto);
@@ -1420,8 +1441,10 @@ tx_err:
  static void ip6_tnl_link_config(struct ip6_tnl *t)
  {
         struct net_device *dev = t->dev;
+       struct net_device *tdev = NULL;
         struct __ip6_tnl_parm *p = &t->parms;
         struct flowi6 *fl6 = &t->fl.u.ip6;
+       unsigned int mtu;
         int t_hlen;
  
         memcpy(dev->dev_addr, &p->laddr, sizeof(struct in6_addr));
@@ -1457,22 +1480,25 @@ static void ip6_tnl_link_config(struct ip6_tnl *t)
                 struct rt6_info *rt = rt6_lookup(t->net,
                                                  &p->raddr, &p->laddr,
                                                  p->link, NULL, strict);
+               if (rt) {
+                       tdev = rt->dst.dev;
+                       ip6_rt_put(rt);
+               }
  
-               if (!rt)
-                       return;
+               if (!tdev && p->link)
+                       tdev = __dev_get_by_index(t->net, p->link);
  
-               if (rt->dst.dev) {
-                       dev->hard_header_len = rt->dst.dev->hard_header_len +
-                               t_hlen;
+               if (tdev) {
+                       dev->hard_header_len = tdev->hard_header_len + t_hlen;
+                       mtu = min_t(unsigned int, tdev->mtu, IP6_MAX_MTU);
  
-                       dev->mtu = rt->dst.dev->mtu - t_hlen;
+                       dev->mtu = mtu - t_hlen;
                         if (!(t->parms.flags & IP6_TNL_F_IGN_ENCAP_LIMIT))
                                 dev->mtu -= 8;
  
                         if (dev->mtu < IPV6_MIN_MTU)
                                 dev->mtu = IPV6_MIN_MTU;
                 }
-               ip6_rt_put(rt);
         }
  }
  
diff --git a/net/mac80211/cfg.c b/net/mac80211/cfg.c

index 000c742..6aee699 100644 (file)
--- a/net/mac80211/cfg.c
+++ b/net/mac80211/cfg.c
@@ -3450,7 +3450,7 @@ int ieee80211_attach_ack_skb(struct ieee80211_local *local, struct sk_buff *skb,
  
         spin_lock_irqsave(&local->ack_status_lock, spin_flags);
         id = idr_alloc(&local->ack_status_frames, ack_skb,
-                      1, 0x40, GFP_ATOMIC);
+                      1, 0x2000, GFP_ATOMIC);
         spin_unlock_irqrestore(&local->ack_status_lock, spin_flags);
  
         if (id < 0) {
diff --git a/net/mac80211/mlme.c b/net/mac80211/mlme.c

index 5fa1317..e041af2 100644 (file)
--- a/net/mac80211/mlme.c
+++ b/net/mac80211/mlme.c
@@ -8,7 +8,7 @@
   * Copyright 2007, Michael Wu <flamingice@sourmilk.net>
   * Copyright 2013-2014  Intel Mobile Communications GmbH
   * Copyright (C) 2015 - 2017 Intel Deutschland GmbH
- * Copyright (C) 2018 - 2019 Intel Corporation
+ * Copyright (C) 2018 - 2020 Intel Corporation
   */
  
  #include <linux/delay.h>
@@ -1311,7 +1311,7 @@ ieee80211_sta_process_chanswitch(struct ieee80211_sub_if_data *sdata,
         if (!res) {
                 ch_switch.timestamp = timestamp;
                 ch_switch.device_timestamp = device_timestamp;
-               ch_switch.block_tx =  beacon ? csa_ie.mode : 0;
+               ch_switch.block_tx = csa_ie.mode;
                 ch_switch.chandef = csa_ie.chandef;
                 ch_switch.count = csa_ie.count;
                 ch_switch.delay = csa_ie.max_switch_time;
@@ -1404,7 +1404,7 @@ ieee80211_sta_process_chanswitch(struct ieee80211_sub_if_data *sdata,
  
         sdata->vif.csa_active = true;
         sdata->csa_chandef = csa_ie.chandef;
-       sdata->csa_block_tx = ch_switch.block_tx;
+       sdata->csa_block_tx = csa_ie.mode;
         ifmgd->csa_ignored_same_chan = false;
  
         if (sdata->csa_block_tx)
@@ -1438,7 +1438,7 @@ ieee80211_sta_process_chanswitch(struct ieee80211_sub_if_data *sdata,
          * reset when the disconnection worker runs.
          */
         sdata->vif.csa_active = true;
-       sdata->csa_block_tx = ch_switch.block_tx;
+       sdata->csa_block_tx = csa_ie.mode;
  
         ieee80211_queue_work(&local->hw, &ifmgd->csa_connection_drop_work);
         mutex_unlock(&local->chanctx_mtx);
diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c

index 4bd1faf..87def9c 100644 (file)
--- a/net/mac80211/tx.c
+++ b/net/mac80211/tx.c
@@ -2442,7 +2442,7 @@ static int ieee80211_store_ack_skb(struct ieee80211_local *local,
  
                 spin_lock_irqsave(&local->ack_status_lock, flags);
                 id = idr_alloc(&local->ack_status_frames, ack_skb,
-                              1, 0x40, GFP_ATOMIC);
+                              1, 0x2000, GFP_ATOMIC);
                 spin_unlock_irqrestore(&local->ack_status_lock, flags);
  
                 if (id >= 0) {
diff --git a/net/mac80211/util.c b/net/mac80211/util.c

index 32a7a53..decd46b 100644 (file)
--- a/net/mac80211/util.c
+++ b/net/mac80211/util.c
@@ -1063,16 +1063,22 @@ _ieee802_11_parse_elems_crc(const u8 *start, size_t len, bool action,
                                 elem_parse_failed = true;
                         break;
                 case WLAN_EID_VHT_OPERATION:
-                       if (elen >= sizeof(struct ieee80211_vht_operation))
+                       if (elen >= sizeof(struct ieee80211_vht_operation)) {
                                 elems->vht_operation = (void *)pos;
-                       else
-                               elem_parse_failed = true;
+                               if (calc_crc)
+                                       crc = crc32_be(crc, pos - 2, elen + 2);
+                               break;
+                       }
+                       elem_parse_failed = true;
                         break;
                 case WLAN_EID_OPMODE_NOTIF:
-                       if (elen > 0)
+                       if (elen > 0) {
                                 elems->opmode_notif = pos;
-                       else
-                               elem_parse_failed = true;
+                               if (calc_crc)
+                                       crc = crc32_be(crc, pos - 2, elen + 2);
+                               break;
+                       }
+                       elem_parse_failed = true;
                         break;
                 case WLAN_EID_MESH_ID:
                         elems->mesh_id = pos;
@@ -2987,10 +2993,22 @@ bool ieee80211_chandef_vht_oper(struct ieee80211_hw *hw,
         int cf0, cf1;
         int ccfs0, ccfs1, ccfs2;
         int ccf0, ccf1;
+       u32 vht_cap;
+       bool support_80_80 = false;
+       bool support_160 = false;
  
         if (!oper || !htop)
                 return false;
  
+       vht_cap = hw->wiphy->bands[chandef->chan->band]->vht_cap.cap;
+       support_160 = (vht_cap & (IEEE80211_VHT_CAP_SUPP_CHAN_WIDTH_MASK |
+                                 IEEE80211_VHT_CAP_EXT_NSS_BW_MASK));
+       support_80_80 = ((vht_cap &
+                        IEEE80211_VHT_CAP_SUPP_CHAN_WIDTH_160_80PLUS80MHZ) ||
+                       (vht_cap & IEEE80211_VHT_CAP_SUPP_CHAN_WIDTH_160MHZ &&
+                        vht_cap & IEEE80211_VHT_CAP_EXT_NSS_BW_MASK) ||
+                       ((vht_cap & IEEE80211_VHT_CAP_EXT_NSS_BW_MASK) >>
+                                   IEEE80211_VHT_CAP_EXT_NSS_BW_SHIFT > 1));
         ccfs0 = oper->center_freq_seg0_idx;
         ccfs1 = oper->center_freq_seg1_idx;
         ccfs2 = (le16_to_cpu(htop->operation_mode) &
@@ -3018,10 +3036,10 @@ bool ieee80211_chandef_vht_oper(struct ieee80211_hw *hw,
                         unsigned int diff;
  
                         diff = abs(ccf1 - ccf0);
-                       if (diff == 8) {
+                       if ((diff == 8) && support_160) {
                                 new.width = NL80211_CHAN_WIDTH_160;
                                 new.center_freq1 = cf1;
-                       } else if (diff > 8) {
+                       } else if ((diff > 8) && support_80_80) {
                                 new.width = NL80211_CHAN_WIDTH_80P80;
                                 new.center_freq2 = cf1;
                         }
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c

index 73780b4..030dee6 100644 (file)
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -643,7 +643,7 @@ static struct ipv6_pinfo *mptcp_inet6_sk(const struct sock *sk)
  }
  #endif
  
-struct sock *mptcp_sk_clone_lock(const struct sock *sk)
+static struct sock *mptcp_sk_clone_lock(const struct sock *sk)
  {
         struct sock *nsk = sk_clone_lock(sk, GFP_ATOMIC);
  
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c

index f9c0d1e..7e54d2a 100644 (file)
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -691,6 +691,7 @@ static const struct nla_policy fl_policy[TCA_FLOWER_MAX + 1] = {
                                             .len = 128 / BITS_PER_BYTE },
         [TCA_FLOWER_KEY_CT_LABELS_MASK] = { .type = NLA_BINARY,
                                             .len = 128 / BITS_PER_BYTE },
+       [TCA_FLOWER_FLAGS]              = { .type = NLA_U32 },
  };
  
  static const struct nla_policy
diff --git a/net/sched/cls_matchall.c b/net/sched/cls_matchall.c

index 039cc86..610a0b7 100644 (file)
--- a/net/sched/cls_matchall.c
+++ b/net/sched/cls_matchall.c
@@ -157,6 +157,7 @@ static void *mall_get(struct tcf_proto *tp, u32 handle)
  static const struct nla_policy mall_policy[TCA_MATCHALL_MAX + 1] = {
         [TCA_MATCHALL_UNSPEC]           = { .type = NLA_UNSPEC },
         [TCA_MATCHALL_CLASSID]          = { .type = NLA_U32 },
+       [TCA_MATCHALL_FLAGS]            = { .type = NLA_U32 },
  };
  
  static int mall_set_parms(struct net *net, struct tcf_proto *tp,
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c

index cee5bf4..90988a5 100644 (file)
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -470,6 +470,8 @@ static void smc_switch_to_fallback(struct smc_sock *smc)
         if (smc->sk.sk_socket && smc->sk.sk_socket->file) {
                 smc->clcsock->file = smc->sk.sk_socket->file;
                 smc->clcsock->file->private_data = smc->clcsock;
+               smc->clcsock->wq.fasync_list =
+                       smc->sk.sk_socket->wq.fasync_list;
         }
  }
  
diff --git a/net/smc/smc_clc.c b/net/smc/smc_clc.c

index 0879f7b..86cccc2 100644 (file)
--- a/net/smc/smc_clc.c
+++ b/net/smc/smc_clc.c
@@ -372,7 +372,9 @@ int smc_clc_send_decline(struct smc_sock *smc, u32 peer_diag_info)
         dclc.hdr.length = htons(sizeof(struct smc_clc_msg_decline));
         dclc.hdr.version = SMC_CLC_V1;
         dclc.hdr.flag = (peer_diag_info == SMC_CLC_DECL_SYNCERR) ? 1 : 0;
-       memcpy(dclc.id_for_peer, local_systemid, sizeof(local_systemid));
+       if (smc->conn.lgr && !smc->conn.lgr->is_smcd)
+               memcpy(dclc.id_for_peer, local_systemid,
+                      sizeof(local_systemid));
         dclc.peer_diagnosis = htonl(peer_diag_info);
         memcpy(dclc.trl.eyecatcher, SMC_EYECATCHER, sizeof(SMC_EYECATCHER));
  
diff --git a/net/smc/smc_diag.c b/net/smc/smc_diag.c

index f38727e..e1f64f4 100644 (file)
--- a/net/smc/smc_diag.c
+++ b/net/smc/smc_diag.c
@@ -39,16 +39,15 @@ static void smc_diag_msg_common_fill(struct smc_diag_msg *r, struct sock *sk)
  {
         struct smc_sock *smc = smc_sk(sk);
  
+       memset(r, 0, sizeof(*r));
         r->diag_family = sk->sk_family;
+       sock_diag_save_cookie(sk, r->id.idiag_cookie);
         if (!smc->clcsock)
                 return;
         r->id.idiag_sport = htons(smc->clcsock->sk->sk_num);
         r->id.idiag_dport = smc->clcsock->sk->sk_dport;
         r->id.idiag_if = smc->clcsock->sk->sk_bound_dev_if;
-       sock_diag_save_cookie(sk, r->id.idiag_cookie);
         if (sk->sk_protocol == SMCPROTO_SMC) {
-               memset(&r->id.idiag_src, 0, sizeof(r->id.idiag_src));
-               memset(&r->id.idiag_dst, 0, sizeof(r->id.idiag_dst));
                 r->id.idiag_src[0] = smc->clcsock->sk->sk_rcv_saddr;
                 r->id.idiag_dst[0] = smc->clcsock->sk->sk_daddr;
  #if IS_ENABLED(CONFIG_IPV6)
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c

index 095be88..125297c 100644 (file)
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -288,8 +288,8 @@ struct rpcrdma_mr_seg *frwr_map(struct rpcrdma_xprt *r_xprt,
  {
         struct rpcrdma_ia *ia = &r_xprt->rx_ia;
         struct ib_reg_wr *reg_wr;
+       int i, n, dma_nents;
         struct ib_mr *ibmr;
-       int i, n;
         u8 key;
  
         if (nsegs > ia->ri_max_frwr_depth)
@@ -313,15 +313,16 @@ struct rpcrdma_mr_seg *frwr_map(struct rpcrdma_xprt *r_xprt,
                         break;
         }
         mr->mr_dir = rpcrdma_data_dir(writing);
+       mr->mr_nents = i;
  
-       mr->mr_nents =
-               ib_dma_map_sg(ia->ri_id->device, mr->mr_sg, i, mr->mr_dir);
-       if (!mr->mr_nents)
+       dma_nents = ib_dma_map_sg(ia->ri_id->device, mr->mr_sg, mr->mr_nents,
+                                 mr->mr_dir);
+       if (!dma_nents)
                 goto out_dmamap_err;
  
         ibmr = mr->frwr.fr_mr;
-       n = ib_map_mr_sg(ibmr, mr->mr_sg, mr->mr_nents, NULL, PAGE_SIZE);
-       if (unlikely(n != mr->mr_nents))
+       n = ib_map_mr_sg(ibmr, mr->mr_sg, dma_nents, NULL, PAGE_SIZE);
+       if (n != dma_nents)
                 goto out_mapmr_err;
  
         ibmr->iova &= 0x00000000ffffffff;
diff --git a/net/tipc/node.c b/net/tipc/node.c

index 99b28b6..0c88778 100644 (file)
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -278,7 +278,7 @@ struct tipc_crypto *tipc_node_crypto_rx_by_list(struct list_head *pos)
  }
  #endif
  
-void tipc_node_free(struct rcu_head *rp)
+static void tipc_node_free(struct rcu_head *rp)
  {
         struct tipc_node *n = container_of(rp, struct tipc_node, rcu);
  
@@ -2798,7 +2798,7 @@ static int tipc_nl_retrieve_nodeid(struct nlattr **attrs, u8 **node_id)
         return 0;
  }
  
-int __tipc_nl_node_set_key(struct sk_buff *skb, struct genl_info *info)
+static int __tipc_nl_node_set_key(struct sk_buff *skb, struct genl_info *info)
  {
         struct nlattr *attrs[TIPC_NLA_NODE_MAX + 1];
         struct net *net = sock_net(skb->sk);
@@ -2875,7 +2875,8 @@ int tipc_nl_node_set_key(struct sk_buff *skb, struct genl_info *info)
         return err;
  }
  
-int __tipc_nl_node_flush_key(struct sk_buff *skb, struct genl_info *info)
+static int __tipc_nl_node_flush_key(struct sk_buff *skb,
+                                   struct genl_info *info)
  {
         struct net *net = sock_net(skb->sk);
         struct tipc_net *tn = tipc_net(net);
diff --git a/net/tipc/socket.c b/net/tipc/socket.c

index f9b4fb9..693e890 100644 (file)
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -2441,6 +2441,8 @@ static int tipc_wait_for_connect(struct socket *sock, long *timeo_p)
                         return -ETIMEDOUT;
                 if (signal_pending(current))
                         return sock_intr_errno(*timeo_p);
+               if (sk->sk_state == TIPC_DISCONNECTING)
+                       break;
  
                 add_wait_queue(sk_sleep(sk), &wait);
                 done = sk_wait_event(sk, timeo_p, tipc_sk_connected(sk),
diff --git a/net/wireless/ethtool.c b/net/wireless/ethtool.c

index a9c0f36..24e1840 100644 (file)
--- a/net/wireless/ethtool.c
+++ b/net/wireless/ethtool.c
@@ -7,9 +7,13 @@
  void cfg80211_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo *info)
  {
         struct wireless_dev *wdev = dev->ieee80211_ptr;
+       struct device *pdev = wiphy_dev(wdev->wiphy);
  
-       strlcpy(info->driver, wiphy_dev(wdev->wiphy)->driver->name,
-               sizeof(info->driver));
+       if (pdev->driver)
+               strlcpy(info->driver, pdev->driver->name,
+                       sizeof(info->driver));
+       else
+               strlcpy(info->driver, "N/A", sizeof(info->driver));
  
         strlcpy(info->version, init_utsname()->release, sizeof(info->version));
  
diff --git a/net/wireless/nl80211.c b/net/wireless/nl80211.c

index 123b8d7..cedf17d 100644 (file)
--- a/net/wireless/nl80211.c
+++ b/net/wireless/nl80211.c
@@ -437,6 +437,7 @@ const struct nla_policy nl80211_policy[NUM_NL80211_ATTR] = {
         [NL80211_ATTR_CONTROL_PORT_NO_ENCRYPT] = { .type = NLA_FLAG },
         [NL80211_ATTR_CONTROL_PORT_OVER_NL80211] = { .type = NLA_FLAG },
         [NL80211_ATTR_PRIVACY] = { .type = NLA_FLAG },
+       [NL80211_ATTR_STATUS_CODE] = { .type = NLA_U16 },
         [NL80211_ATTR_CIPHER_SUITE_GROUP] = { .type = NLA_U32 },
         [NL80211_ATTR_WPA_VERSIONS] = { .type = NLA_U32 },
         [NL80211_ATTR_PID] = { .type = NLA_U32 },
diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c

index dc651a6..3361e3a 100644 (file)
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -300,10 +300,10 @@ xfrmi_xmit2(struct sk_buff *skb, struct net_device *dev, struct flowi *fl)
                         if (mtu < IPV6_MIN_MTU)
                                 mtu = IPV6_MIN_MTU;
  
-                       icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu);
+                       icmpv6_ndo_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu);
                 } else {
-                       icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
-                                 htonl(mtu));
+                       icmp_ndo_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
+                                     htonl(mtu));
                 }
  
                 dst_release(dst);
diff --git a/scripts/kallsyms.c b/scripts/kallsyms.c

index a566d82..0133dfa 100644 (file)
--- a/scripts/kallsyms.c
+++ b/scripts/kallsyms.c
@@ -210,7 +210,7 @@ static struct sym_entry *read_symbol(FILE *in)
  
         len = strlen(name) + 1;
  
-       sym = malloc(sizeof(*sym) + len);
+       sym = malloc(sizeof(*sym) + len + 1);
         if (!sym) {
                 fprintf(stderr, "kallsyms failure: "
                         "unable to allocate required amount of memory\n");
@@ -219,7 +219,7 @@ static struct sym_entry *read_symbol(FILE *in)
         sym->addr = addr;
         sym->len = len;
         sym->sym[0] = type;
-       memcpy(sym_name(sym), name, len);
+       strcpy(sym_name(sym), name);
         sym->percpu_absolute = 0;
  
         return sym;
diff --git a/scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh

index 1919c31..dd484e9 100755 (executable)
--- a/scripts/link-vmlinux.sh
+++ b/scripts/link-vmlinux.sh
@@ -239,7 +239,7 @@ else
  fi;
  
  # final build of init/
-${MAKE} -f "${srctree}/scripts/Makefile.build" obj=init
+${MAKE} -f "${srctree}/scripts/Makefile.build" obj=init need-builtin=1
  
  #link vmlinux.o
  info LD vmlinux.o
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c

index 4b6991e..1659b59 100644 (file)
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -698,7 +698,7 @@ static int selinux_set_mnt_opts(struct super_block *sb,
  
         if (!strcmp(sb->s_type->name, "debugfs") ||
             !strcmp(sb->s_type->name, "tracefs") ||
-           !strcmp(sb->s_type->name, "binderfs") ||
+           !strcmp(sb->s_type->name, "binder") ||
             !strcmp(sb->s_type->name, "pstore"))
                 sbsec->flags |= SE_SBGENFS;
  
diff --git a/security/selinux/ss/sidtab.c b/security/selinux/ss/sidtab.c

index a308ce1..f511ffc 100644 (file)
--- a/security/selinux/ss/sidtab.c
+++ b/security/selinux/ss/sidtab.c
@@ -518,19 +518,13 @@ void sidtab_sid2str_put(struct sidtab *s, struct sidtab_entry *entry,
                         const char *str, u32 str_len)
  {
         struct sidtab_str_cache *cache, *victim = NULL;
+       unsigned long flags;
  
         /* do not cache invalid contexts */
         if (entry->context.len)
                 return;
  
-       /*
-        * Skip the put operation when in non-task context to avoid the need
-        * to disable interrupts while holding s->cache_lock.
-        */
-       if (!in_task())
-               return;
-
-       spin_lock(&s->cache_lock);
+       spin_lock_irqsave(&s->cache_lock, flags);
  
         cache = rcu_dereference_protected(entry->cache,
                                           lockdep_is_held(&s->cache_lock));
@@ -561,7 +555,7 @@ void sidtab_sid2str_put(struct sidtab *s, struct sidtab_entry *entry,
         rcu_assign_pointer(entry->cache, cache);
  
  out_unlock:
-       spin_unlock(&s->cache_lock);
+       spin_unlock_irqrestore(&s->cache_lock, flags);
         kfree_rcu(victim, rcu_member);
  }
  
diff --git a/sound/core/pcm_native.c b/sound/core/pcm_native.c

index 336406b..d5443ee 100644 (file)
--- a/sound/core/pcm_native.c
+++ b/sound/core/pcm_native.c
@@ -2594,7 +2594,8 @@ void snd_pcm_release_substream(struct snd_pcm_substream *substream)
  
         snd_pcm_drop(substream);
         if (substream->hw_opened) {
-               do_hw_free(substream);
+               if (substream->runtime->status->state != SNDRV_PCM_STATE_OPEN)
+                       do_hw_free(substream);
                 substream->ops->close(substream);
                 substream->hw_opened = 0;
         }
diff --git a/sound/pci/hda/patch_realtek.c b/sound/pci/hda/patch_realtek.c

index 4770fb3..6c8cb4c 100644 (file)
--- a/sound/pci/hda/patch_realtek.c
+++ b/sound/pci/hda/patch_realtek.c
@@ -2447,6 +2447,7 @@ static const struct snd_pci_quirk alc882_fixup_tbl[] = {
         SND_PCI_QUIRK(0x1071, 0x8258, "Evesham Voyaeger", ALC882_FIXUP_EAPD),
         SND_PCI_QUIRK(0x1458, 0xa002, "Gigabyte EP45-DS3/Z87X-UD3H", ALC889_FIXUP_FRONT_HP_NO_PRESENCE),
         SND_PCI_QUIRK(0x1458, 0xa0b8, "Gigabyte AZ370-Gaming", ALC1220_FIXUP_GB_DUAL_CODECS),
+       SND_PCI_QUIRK(0x1462, 0x1276, "MSI-GL73", ALC1220_FIXUP_CLEVO_P950),
         SND_PCI_QUIRK(0x1462, 0x7350, "MSI-7350", ALC889_FIXUP_CD),
         SND_PCI_QUIRK(0x1462, 0xda57, "MSI Z270-Gaming", ALC1220_FIXUP_GB_DUAL_CODECS),
         SND_PCI_QUIRK_VENDOR(0x1462, "MSI", ALC882_FIXUP_GPIO3),
@@ -5701,8 +5702,11 @@ static void alc_fixup_headset_jack(struct hda_codec *codec,
                 break;
         case HDA_FIXUP_ACT_INIT:
                 switch (codec->core.vendor_id) {
+               case 0x10ec0215:
                 case 0x10ec0225:
+               case 0x10ec0285:
                 case 0x10ec0295:
+               case 0x10ec0289:
                 case 0x10ec0299:
                         alc_write_coef_idx(codec, 0x48, 0xd011);
                         alc_update_coef_idx(codec, 0x49, 0x007f, 0x0045);
diff --git a/sound/usb/clock.c b/sound/usb/clock.c

index 018b1ec..a48313d 100644 (file)
--- a/sound/usb/clock.c
+++ b/sound/usb/clock.c
@@ -151,8 +151,34 @@ static int uac_clock_selector_set_val(struct snd_usb_audio *chip, int selector_i
         return ret;
  }
  
+/*
+ * Assume the clock is valid if clock source supports only one single sample
+ * rate, the terminal is connected directly to it (there is no clock selector)
+ * and clock type is internal. This is to deal with some Denon DJ controllers
+ * that always reports that clock is invalid.
+ */
+static bool uac_clock_source_is_valid_quirk(struct snd_usb_audio *chip,
+                                           struct audioformat *fmt,
+                                           int source_id)
+{
+       if (fmt->protocol == UAC_VERSION_2) {
+               struct uac_clock_source_descriptor *cs_desc =
+                       snd_usb_find_clock_source(chip->ctrl_intf, source_id);
+
+               if (!cs_desc)
+                       return false;
+
+               return (fmt->nr_rates == 1 &&
+                       (fmt->clock & 0xff) == cs_desc->bClockID &&
+                       (cs_desc->bmAttributes & 0x3) !=
+                               UAC_CLOCK_SOURCE_TYPE_EXT);
+       }
+
+       return false;
+}
+
  static bool uac_clock_source_is_valid(struct snd_usb_audio *chip,
-                                     int protocol,
+                                     struct audioformat *fmt,
                                       int source_id)
  {
         int err;
@@ -160,7 +186,7 @@ static bool uac_clock_source_is_valid(struct snd_usb_audio *chip,
         struct usb_device *dev = chip->dev;
         u32 bmControls;
  
-       if (protocol == UAC_VERSION_3) {
+       if (fmt->protocol == UAC_VERSION_3) {
                 struct uac3_clock_source_descriptor *cs_desc =
                         snd_usb_find_clock_source_v3(chip->ctrl_intf, source_id);
  
@@ -194,10 +220,14 @@ static bool uac_clock_source_is_valid(struct snd_usb_audio *chip,
                 return false;
         }
  
-       return data ? true :  false;
+       if (data)
+               return true;
+       else
+               return uac_clock_source_is_valid_quirk(chip, fmt, source_id);
  }
  
-static int __uac_clock_find_source(struct snd_usb_audio *chip, int entity_id,
+static int __uac_clock_find_source(struct snd_usb_audio *chip,
+                                  struct audioformat *fmt, int entity_id,
                                    unsigned long *visited, bool validate)
  {
         struct uac_clock_source_descriptor *source;
@@ -217,7 +247,7 @@ static int __uac_clock_find_source(struct snd_usb_audio *chip, int entity_id,
         source = snd_usb_find_clock_source(chip->ctrl_intf, entity_id);
         if (source) {
                 entity_id = source->bClockID;
-               if (validate && !uac_clock_source_is_valid(chip, UAC_VERSION_2,
+               if (validate && !uac_clock_source_is_valid(chip, fmt,
                                                                 entity_id)) {
                         usb_audio_err(chip,
                                 "clock source %d is not valid, cannot use\n",
@@ -248,8 +278,9 @@ static int __uac_clock_find_source(struct snd_usb_audio *chip, int entity_id,
                 }
  
                 cur = ret;
-               ret = __uac_clock_find_source(chip, selector->baCSourceID[ret - 1],
-                                              visited, validate);
+               ret = __uac_clock_find_source(chip, fmt,
+                                             selector->baCSourceID[ret - 1],
+                                             visited, validate);
                 if (!validate || ret > 0 || !chip->autoclock)
                         return ret;
  
@@ -260,8 +291,9 @@ static int __uac_clock_find_source(struct snd_usb_audio *chip, int entity_id,
                         if (i == cur)
                                 continue;
  
-                       ret = __uac_clock_find_source(chip, selector->baCSourceID[i - 1],
-                               visited, true);
+                       ret = __uac_clock_find_source(chip, fmt,
+                                                     selector->baCSourceID[i - 1],
+                                                     visited, true);
                         if (ret < 0)
                                 continue;
  
@@ -281,14 +313,16 @@ static int __uac_clock_find_source(struct snd_usb_audio *chip, int entity_id,
         /* FIXME: multipliers only act as pass-thru element for now */
         multiplier = snd_usb_find_clock_multiplier(chip->ctrl_intf, entity_id);
         if (multiplier)
-               return __uac_clock_find_source(chip, multiplier->bCSourceID,
-                                               visited, validate);
+               return __uac_clock_find_source(chip, fmt,
+                                              multiplier->bCSourceID,
+                                              visited, validate);
  
         return -EINVAL;
  }
  
-static int __uac3_clock_find_source(struct snd_usb_audio *chip, int entity_id,
-                                  unsigned long *visited, bool validate)
+static int __uac3_clock_find_source(struct snd_usb_audio *chip,
+                                   struct audioformat *fmt, int entity_id,
+                                   unsigned long *visited, bool validate)
  {
         struct uac3_clock_source_descriptor *source;
         struct uac3_clock_selector_descriptor *selector;
@@ -307,7 +341,7 @@ static int __uac3_clock_find_source(struct snd_usb_audio *chip, int entity_id,
         source = snd_usb_find_clock_source_v3(chip->ctrl_intf, entity_id);
         if (source) {
                 entity_id = source->bClockID;
-               if (validate && !uac_clock_source_is_valid(chip, UAC_VERSION_3,
+               if (validate && !uac_clock_source_is_valid(chip, fmt,
                                                                 entity_id)) {
                         usb_audio_err(chip,
                                 "clock source %d is not valid, cannot use\n",
@@ -338,7 +372,8 @@ static int __uac3_clock_find_source(struct snd_usb_audio *chip, int entity_id,
                 }
  
                 cur = ret;
-               ret = __uac3_clock_find_source(chip, selector->baCSourceID[ret - 1],
+               ret = __uac3_clock_find_source(chip, fmt,
+                                              selector->baCSourceID[ret - 1],
                                                visited, validate);
                 if (!validate || ret > 0 || !chip->autoclock)
                         return ret;
@@ -350,8 +385,9 @@ static int __uac3_clock_find_source(struct snd_usb_audio *chip, int entity_id,
                         if (i == cur)
                                 continue;
  
-                       ret = __uac3_clock_find_source(chip, selector->baCSourceID[i - 1],
-                               visited, true);
+                       ret = __uac3_clock_find_source(chip, fmt,
+                                                      selector->baCSourceID[i - 1],
+                                                      visited, true);
                         if (ret < 0)
                                 continue;
  
@@ -372,7 +408,8 @@ static int __uac3_clock_find_source(struct snd_usb_audio *chip, int entity_id,
         multiplier = snd_usb_find_clock_multiplier_v3(chip->ctrl_intf,
                                                       entity_id);
         if (multiplier)
-               return __uac3_clock_find_source(chip, multiplier->bCSourceID,
+               return __uac3_clock_find_source(chip, fmt,
+                                               multiplier->bCSourceID,
                                                 visited, validate);
  
         return -EINVAL;
@@ -389,18 +426,18 @@ static int __uac3_clock_find_source(struct snd_usb_audio *chip, int entity_id,
   *
   * Returns the clock source UnitID (>=0) on success, or an error.
   */
-int snd_usb_clock_find_source(struct snd_usb_audio *chip, int protocol,
-                             int entity_id, bool validate)
+int snd_usb_clock_find_source(struct snd_usb_audio *chip,
+                             struct audioformat *fmt, bool validate)
  {
         DECLARE_BITMAP(visited, 256);
         memset(visited, 0, sizeof(visited));
  
-       switch (protocol) {
+       switch (fmt->protocol) {
         case UAC_VERSION_2:
-               return __uac_clock_find_source(chip, entity_id, visited,
+               return __uac_clock_find_source(chip, fmt, fmt->clock, visited,
                                                validate);
         case UAC_VERSION_3:
-               return __uac3_clock_find_source(chip, entity_id, visited,
+               return __uac3_clock_find_source(chip, fmt, fmt->clock, visited,
                                                validate);
         default:
                 return -EINVAL;
@@ -501,8 +538,7 @@ static int set_sample_rate_v2v3(struct snd_usb_audio *chip, int iface,
          * automatic clock selection if the current clock is not
          * valid.
          */
-       clock = snd_usb_clock_find_source(chip, fmt->protocol,
-                                         fmt->clock, true);
+       clock = snd_usb_clock_find_source(chip, fmt, true);
         if (clock < 0) {
                 /* We did not find a valid clock, but that might be
                  * because the current sample rate does not match an
@@ -510,8 +546,7 @@ static int set_sample_rate_v2v3(struct snd_usb_audio *chip, int iface,
                  * and we will do another validation after setting the
                  * rate.
                  */
-               clock = snd_usb_clock_find_source(chip, fmt->protocol,
-                                                 fmt->clock, false);
+               clock = snd_usb_clock_find_source(chip, fmt, false);
                 if (clock < 0)
                         return clock;
         }
@@ -577,7 +612,7 @@ static int set_sample_rate_v2v3(struct snd_usb_audio *chip, int iface,
  
  validation:
         /* validate clock after rate change */
-       if (!uac_clock_source_is_valid(chip, fmt->protocol, clock))
+       if (!uac_clock_source_is_valid(chip, fmt, clock))
                 return -ENXIO;
         return 0;
  }
diff --git a/sound/usb/clock.h b/sound/usb/clock.h

index 076e31b..68df0fb 100644 (file)
--- a/sound/usb/clock.h
+++ b/sound/usb/clock.h
@@ -6,7 +6,7 @@ int snd_usb_init_sample_rate(struct snd_usb_audio *chip, int iface,
                              struct usb_host_interface *alts,
                              struct audioformat *fmt, int rate);
  
-int snd_usb_clock_find_source(struct snd_usb_audio *chip, int protocol,
-                            int entity_id, bool validate);
+int snd_usb_clock_find_source(struct snd_usb_audio *chip,
+                             struct audioformat *fmt, bool validate);
  
  #endif /* __USBAUDIO_CLOCK_H */
diff --git a/sound/usb/format.c b/sound/usb/format.c

index 9260136..9f5cb4e 100644 (file)
--- a/sound/usb/format.c
+++ b/sound/usb/format.c
@@ -151,6 +151,19 @@ static u64 parse_audio_format_i_type(struct snd_usb_audio *chip,
         return pcm_formats;
  }
  
+static int set_fixed_rate(struct audioformat *fp, int rate, int rate_bits)
+{
+       kfree(fp->rate_table);
+       fp->rate_table = kmalloc(sizeof(int), GFP_KERNEL);
+       if (!fp->rate_table)
+               return -ENOMEM;
+       fp->nr_rates = 1;
+       fp->rate_min = rate;
+       fp->rate_max = rate;
+       fp->rates = rate_bits;
+       fp->rate_table[0] = rate;
+       return 0;
+}
  
  /*
   * parse the format descriptor and stores the possible sample rates
@@ -223,6 +236,14 @@ static int parse_audio_format_rates_v1(struct snd_usb_audio *chip, struct audiof
                 fp->rate_min = combine_triple(&fmt[offset + 1]);
                 fp->rate_max = combine_triple(&fmt[offset + 4]);
         }
+
+       /* Jabra Evolve 65 headset */
+       if (chip->usb_id == USB_ID(0x0b0e, 0x030b)) {
+               /* only 48kHz for playback while keeping 16kHz for capture */
+               if (fp->nr_rates != 1)
+                       return set_fixed_rate(fp, 48000, SNDRV_PCM_RATE_48000);
+       }
+
         return 0;
  }
  
@@ -299,17 +320,7 @@ static int line6_parse_audio_format_rates_quirk(struct snd_usb_audio *chip,
         case USB_ID(0x0e41, 0x4248): /* Line6 Helix >= fw 2.82 */
         case USB_ID(0x0e41, 0x4249): /* Line6 Helix Rack >= fw 2.82 */
         case USB_ID(0x0e41, 0x424a): /* Line6 Helix LT >= fw 2.82 */
-               /* supported rates: 48Khz */
-               kfree(fp->rate_table);
-               fp->rate_table = kmalloc(sizeof(int), GFP_KERNEL);
-               if (!fp->rate_table)
-                       return -ENOMEM;
-               fp->nr_rates = 1;
-               fp->rate_min = 48000;
-               fp->rate_max = 48000;
-               fp->rates = SNDRV_PCM_RATE_48000;
-               fp->rate_table[0] = 48000;
-               return 0;
+               return set_fixed_rate(fp, 48000, SNDRV_PCM_RATE_48000);
         }
  
         return -ENODEV;
@@ -325,8 +336,7 @@ static int parse_audio_format_rates_v2v3(struct snd_usb_audio *chip,
         struct usb_device *dev = chip->dev;
         unsigned char tmp[2], *data;
         int nr_triplets, data_size, ret = 0, ret_l6;
-       int clock = snd_usb_clock_find_source(chip, fp->protocol,
-                                             fp->clock, false);
+       int clock = snd_usb_clock_find_source(chip, fp, false);
  
         if (clock < 0) {
                 dev_err(&dev->dev,
diff --git a/sound/usb/mixer.c b/sound/usb/mixer.c

index d659fdb..81b2db0 100644 (file)
--- a/sound/usb/mixer.c
+++ b/sound/usb/mixer.c
@@ -897,6 +897,15 @@ static int parse_term_proc_unit(struct mixer_build *state,
         return 0;
  }
  
+static int parse_term_effect_unit(struct mixer_build *state,
+                                 struct usb_audio_term *term,
+                                 void *p1, int id)
+{
+       term->type = UAC3_EFFECT_UNIT << 16; /* virtual type */
+       term->id = id;
+       return 0;
+}
+
  static int parse_term_uac2_clock_source(struct mixer_build *state,
                                         struct usb_audio_term *term,
                                         void *p1, int id)
@@ -981,8 +990,7 @@ static int __check_input_term(struct mixer_build *state, int id,
                                                     UAC3_PROCESSING_UNIT);
                 case PTYPE(UAC_VERSION_2, UAC2_EFFECT_UNIT):
                 case PTYPE(UAC_VERSION_3, UAC3_EFFECT_UNIT):
-                       return parse_term_proc_unit(state, term, p1, id,
-                                                   UAC3_EFFECT_UNIT);
+                       return parse_term_effect_unit(state, term, p1, id);
                 case PTYPE(UAC_VERSION_1, UAC1_EXTENSION_UNIT):
                 case PTYPE(UAC_VERSION_2, UAC2_EXTENSION_UNIT_V2):
                 case PTYPE(UAC_VERSION_3, UAC3_EXTENSION_UNIT):
diff --git a/sound/usb/quirks.c b/sound/usb/quirks.c

index 3a5242e..7f558f4 100644 (file)
--- a/sound/usb/quirks.c
+++ b/sound/usb/quirks.c
@@ -1440,6 +1440,7 @@ bool snd_usb_get_sample_rate_quirk(struct snd_usb_audio *chip)
         case USB_ID(0x1395, 0x740a): /* Sennheiser DECT */
         case USB_ID(0x1901, 0x0191): /* GE B850V3 CP2114 audio interface */
         case USB_ID(0x21b4, 0x0081): /* AudioQuest DragonFly */
+       case USB_ID(0x2912, 0x30c8): /* Audioengine D1 */
                 return true;
         }
  
diff --git a/tools/arch/arm64/include/uapi/asm/kvm.h b/tools/arch/arm64/include/uapi/asm/kvm.h

index 820e575..ba85bb2 100644 (file)
--- a/tools/arch/arm64/include/uapi/asm/kvm.h
+++ b/tools/arch/arm64/include/uapi/asm/kvm.h
@@ -220,10 +220,18 @@ struct kvm_vcpu_events {
  #define KVM_REG_ARM_PTIMER_CVAL                ARM64_SYS_REG(3, 3, 14, 2, 2)
  #define KVM_REG_ARM_PTIMER_CNT         ARM64_SYS_REG(3, 3, 14, 0, 1)
  
-/* EL0 Virtual Timer Registers */
+/*
+ * EL0 Virtual Timer Registers
+ *
+ * WARNING:
+ *      KVM_REG_ARM_TIMER_CVAL and KVM_REG_ARM_TIMER_CNT are not defined
+ *      with the appropriate register encodings.  Their values have been
+ *      accidentally swapped.  As this is set API, the definitions here
+ *      must be used, rather than ones derived from the encodings.
+ */
  #define KVM_REG_ARM_TIMER_CTL          ARM64_SYS_REG(3, 3, 14, 3, 1)
-#define KVM_REG_ARM_TIMER_CNT          ARM64_SYS_REG(3, 3, 14, 3, 2)
  #define KVM_REG_ARM_TIMER_CVAL         ARM64_SYS_REG(3, 3, 14, 0, 2)
+#define KVM_REG_ARM_TIMER_CNT          ARM64_SYS_REG(3, 3, 14, 3, 2)
  
  /* KVM-as-firmware specific pseudo-registers */
  #define KVM_REG_ARM_FW                 (0x0014 << KVM_REG_ARM_COPROC_SHIFT)
diff --git a/tools/arch/arm64/include/uapi/asm/unistd.h b/tools/arch/arm64/include/uapi/asm/unistd.h

index 4703d21..f83a70e 100644 (file)
--- a/tools/arch/arm64/include/uapi/asm/unistd.h
+++ b/tools/arch/arm64/include/uapi/asm/unistd.h
@@ -19,5 +19,6 @@
  #define __ARCH_WANT_NEW_STAT
  #define __ARCH_WANT_SET_GET_RLIMIT
  #define __ARCH_WANT_TIME32_SYSCALLS
+#define __ARCH_WANT_SYS_CLONE3
  
  #include <asm-generic/unistd.h>
diff --git a/tools/arch/x86/include/asm/cpufeatures.h b/tools/arch/x86/include/asm/cpufeatures.h

index e9b6249..f3327cb 100644 (file)
--- a/tools/arch/x86/include/asm/cpufeatures.h
+++ b/tools/arch/x86/include/asm/cpufeatures.h
@@ -220,6 +220,7 @@
  #define X86_FEATURE_ZEN                        ( 7*32+28) /* "" CPU is AMD family 0x17 (Zen) */
  #define X86_FEATURE_L1TF_PTEINV                ( 7*32+29) /* "" L1TF workaround PTE inversion */
  #define X86_FEATURE_IBRS_ENHANCED      ( 7*32+30) /* Enhanced IBRS */
+#define X86_FEATURE_MSR_IA32_FEAT_CTL  ( 7*32+31) /* "" MSR IA32_FEAT_CTL configured */
  
  /* Virtualization flags: Linux defined, word 8 */
  #define X86_FEATURE_TPR_SHADOW         ( 8*32+ 0) /* Intel TPR Shadow */
@@ -357,6 +358,7 @@
  /* Intel-defined CPU features, CPUID level 0x00000007:0 (EDX), word 18 */
  #define X86_FEATURE_AVX512_4VNNIW      (18*32+ 2) /* AVX-512 Neural Network Instructions */
  #define X86_FEATURE_AVX512_4FMAPS      (18*32+ 3) /* AVX-512 Multiply Accumulation Single precision */
+#define X86_FEATURE_FSRM               (18*32+ 4) /* Fast Short Rep Mov */
  #define X86_FEATURE_AVX512_VP2INTERSECT (18*32+ 8) /* AVX-512 Intersect for D/Q */
  #define X86_FEATURE_MD_CLEAR           (18*32+10) /* VERW clears CPU buffers */
  #define X86_FEATURE_TSX_FORCE_ABORT    (18*32+13) /* "" TSX_FORCE_ABORT */
diff --git a/tools/arch/x86/include/asm/disabled-features.h b/tools/arch/x86/include/asm/disabled-features.h

index 8e1d0bb..4ea8584 100644 (file)
--- a/tools/arch/x86/include/asm/disabled-features.h
+++ b/tools/arch/x86/include/asm/disabled-features.h
@@ -10,12 +10,6 @@
   * cpu_feature_enabled().
   */
  
-#ifdef CONFIG_X86_INTEL_MPX
-# define DISABLE_MPX   0
-#else
-# define DISABLE_MPX   (1<<(X86_FEATURE_MPX & 31))
-#endif
-
  #ifdef CONFIG_X86_SMAP
  # define DISABLE_SMAP  0
  #else
@@ -74,7 +68,7 @@
  #define DISABLED_MASK6 0
  #define DISABLED_MASK7 (DISABLE_PTI)
  #define DISABLED_MASK8 0
-#define DISABLED_MASK9 (DISABLE_MPX|DISABLE_SMAP)
+#define DISABLED_MASK9 (DISABLE_SMAP)
  #define DISABLED_MASK10        0
  #define DISABLED_MASK11        0
  #define DISABLED_MASK12        0
diff --git a/tools/bootconfig/include/linux/memblock.h b/tools/bootconfig/include/linux/memblock.h

new file mode 100644 (file)

index 0000000..7862f21
--- /dev/null
+++ b/tools/bootconfig/include/linux/memblock.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _XBC_LINUX_MEMBLOCK_H
+#define _XBC_LINUX_MEMBLOCK_H
+
+#include <stdlib.h>
+
+#define __pa(addr)     (addr)
+#define SMP_CACHE_BYTES        0
+#define memblock_alloc(size, align)    malloc(size)
+#define memblock_free(paddr, size)     free(paddr)
+
+#endif
diff --git a/tools/bootconfig/include/linux/printk.h b/tools/bootconfig/include/linux/printk.h

index 017bcd6..e978a63 100644 (file)
--- a/tools/bootconfig/include/linux/printk.h
+++ b/tools/bootconfig/include/linux/printk.h
@@ -7,7 +7,7 @@
  /* controllable printf */
  extern int pr_output;
  #define printk(fmt, ...)       \
-       (pr_output ? printf(fmt, __VA_ARGS__) : 0)
+       (pr_output ? printf(fmt, ##__VA_ARGS__) : 0)
  
  #define pr_err printk
  #define pr_warn        printk
diff --git a/tools/bootconfig/main.c b/tools/bootconfig/main.c

index 47f4884..e18eeb0 100644 (file)
--- a/tools/bootconfig/main.c
+++ b/tools/bootconfig/main.c
@@ -140,7 +140,7 @@ int load_xbc_from_initrd(int fd, char **buf)
                 return 0;
  
         if (lseek(fd, -8, SEEK_END) < 0) {
-               printf("Failed to lseek: %d\n", -errno);
+               pr_err("Failed to lseek: %d\n", -errno);
                 return -errno;
         }
  
@@ -155,7 +155,7 @@ int load_xbc_from_initrd(int fd, char **buf)
                 return 0;
  
         if (lseek(fd, stat.st_size - 8 - size, SEEK_SET) < 0) {
-               printf("Failed to lseek: %d\n", -errno);
+               pr_err("Failed to lseek: %d\n", -errno);
                 return -errno;
         }
  
@@ -166,7 +166,7 @@ int load_xbc_from_initrd(int fd, char **buf)
         /* Wrong Checksum, maybe no boot config here */
         rcsum = checksum((unsigned char *)*buf, size);
         if (csum != rcsum) {
-               printf("checksum error: %d != %d\n", csum, rcsum);
+               pr_err("checksum error: %d != %d\n", csum, rcsum);
                 return 0;
         }
  
@@ -185,13 +185,13 @@ int show_xbc(const char *path)
  
         fd = open(path, O_RDONLY);
         if (fd < 0) {
-               printf("Failed to open initrd %s: %d\n", path, fd);
+               pr_err("Failed to open initrd %s: %d\n", path, fd);
                 return -errno;
         }
  
         ret = load_xbc_from_initrd(fd, &buf);
         if (ret < 0)
-               printf("Failed to load a boot config from initrd: %d\n", ret);
+               pr_err("Failed to load a boot config from initrd: %d\n", ret);
         else
                 xbc_show_compact_tree();
  
@@ -209,7 +209,7 @@ int delete_xbc(const char *path)
  
         fd = open(path, O_RDWR);
         if (fd < 0) {
-               printf("Failed to open initrd %s: %d\n", path, fd);
+               pr_err("Failed to open initrd %s: %d\n", path, fd);
                 return -errno;
         }
  
@@ -222,7 +222,7 @@ int delete_xbc(const char *path)
         pr_output = 1;
         if (size < 0) {
                 ret = size;
-               printf("Failed to load a boot config from initrd: %d\n", ret);
+               pr_err("Failed to load a boot config from initrd: %d\n", ret);
         } else if (size > 0) {
                 ret = fstat(fd, &stat);
                 if (!ret)
@@ -245,7 +245,7 @@ int apply_xbc(const char *path, const char *xbc_path)
  
         ret = load_xbc_file(xbc_path, &buf);
         if (ret < 0) {
-               printf("Failed to load %s : %d\n", xbc_path, ret);
+               pr_err("Failed to load %s : %d\n", xbc_path, ret);
                 return ret;
         }
         size = strlen(buf) + 1;
@@ -262,7 +262,7 @@ int apply_xbc(const char *path, const char *xbc_path)
         /* Check the data format */
         ret = xbc_init(buf);
         if (ret < 0) {
-               printf("Failed to parse %s: %d\n", xbc_path, ret);
+               pr_err("Failed to parse %s: %d\n", xbc_path, ret);
                 free(data);
                 free(buf);
                 return ret;
@@ -279,20 +279,20 @@ int apply_xbc(const char *path, const char *xbc_path)
         /* Remove old boot config if exists */
         ret = delete_xbc(path);
         if (ret < 0) {
-               printf("Failed to delete previous boot config: %d\n", ret);
+               pr_err("Failed to delete previous boot config: %d\n", ret);
                 return ret;
         }
  
         /* Apply new one */
         fd = open(path, O_RDWR | O_APPEND);
         if (fd < 0) {
-               printf("Failed to open %s: %d\n", path, fd);
+               pr_err("Failed to open %s: %d\n", path, fd);
                 return fd;
         }
         /* TODO: Ensure the @path is initramfs/initrd image */
         ret = write(fd, data, size + 8);
         if (ret < 0) {
-               printf("Failed to apply a boot config: %d\n", ret);
+               pr_err("Failed to apply a boot config: %d\n", ret);
                 return ret;
         }
         close(fd);
@@ -334,12 +334,12 @@ int main(int argc, char **argv)
         }
  
         if (apply && delete) {
-               printf("Error: You can not specify both -a and -d at once.\n");
+               pr_err("Error: You can not specify both -a and -d at once.\n");
                 return usage();
         }
  
         if (optind >= argc) {
-               printf("Error: No initrd is specified.\n");
+               pr_err("Error: No initrd is specified.\n");
                 return usage();
         }
  
diff --git a/tools/bootconfig/test-bootconfig.sh b/tools/bootconfig/test-bootconfig.sh

index 87725e8..1de06de 100755 (executable)
--- a/tools/bootconfig/test-bootconfig.sh
+++ b/tools/bootconfig/test-bootconfig.sh
@@ -64,6 +64,15 @@ echo "File size check"
  new_size=$(stat -c %s $INITRD)
  xpass test $new_size -eq $initrd_size
  
+echo "No error messge while applying"
+OUTFILE=`mktemp tempout-XXXX`
+dd if=/dev/zero of=$INITRD bs=4096 count=1
+printf " \0\0\0 \0\0\0" >> $INITRD
+$BOOTCONF -a $TEMPCONF $INITRD > $OUTFILE 2>&1
+xfail grep -i "failed" $OUTFILE
+xfail grep -i "error" $OUTFILE
+rm $OUTFILE
+
  echo "Max node number check"
  
  echo -n > $TEMPCONF
diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h

index c160a53..f94f65d 100644 (file)
--- a/tools/include/uapi/asm-generic/mman-common.h
+++ b/tools/include/uapi/asm-generic/mman-common.h
@@ -11,6 +11,8 @@
  #define PROT_WRITE     0x2             /* page can be written */
  #define PROT_EXEC      0x4             /* page can be executed */
  #define PROT_SEM       0x8             /* page may be used for atomic ops */
+/*                     0x10               reserved for arch-specific use */
+/*                     0x20               reserved for arch-specific use */
  #define PROT_NONE      0x0             /* page can not be accessed */
  #define PROT_GROWSDOWN 0x01000000      /* mprotect flag: extend change to start of growsdown vma */
  #define PROT_GROWSUP   0x02000000      /* mprotect flag: extend change to end of growsup vma */
diff --git a/tools/include/uapi/asm-generic/unistd.h b/tools/include/uapi/asm-generic/unistd.h

index 1fc8faa..3a3201e 100644 (file)
--- a/tools/include/uapi/asm-generic/unistd.h
+++ b/tools/include/uapi/asm-generic/unistd.h
@@ -851,8 +851,13 @@ __SYSCALL(__NR_pidfd_open, sys_pidfd_open)
  __SYSCALL(__NR_clone3, sys_clone3)
  #endif
  
+#define __NR_openat2 437
+__SYSCALL(__NR_openat2, sys_openat2)
+#define __NR_pidfd_getfd 438
+__SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
+
  #undef __NR_syscalls
-#define __NR_syscalls 436
+#define __NR_syscalls 439
  
  /*
   * 32 bit systems traditionally used different
diff --git a/tools/include/uapi/drm/i915_drm.h b/tools/include/uapi/drm/i915_drm.h

index 5400d7e..829c0a4 100644 (file)
--- a/tools/include/uapi/drm/i915_drm.h
+++ b/tools/include/uapi/drm/i915_drm.h
@@ -395,6 +395,7 @@ typedef struct _drm_i915_sarea {
  #define DRM_IOCTL_I915_GEM_PWRITE      DRM_IOW (DRM_COMMAND_BASE + DRM_I915_GEM_PWRITE, struct drm_i915_gem_pwrite)
  #define DRM_IOCTL_I915_GEM_MMAP                DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_MMAP, struct drm_i915_gem_mmap)
  #define DRM_IOCTL_I915_GEM_MMAP_GTT    DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_MMAP_GTT, struct drm_i915_gem_mmap_gtt)
+#define DRM_IOCTL_I915_GEM_MMAP_OFFSET DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_MMAP_GTT, struct drm_i915_gem_mmap_offset)
  #define DRM_IOCTL_I915_GEM_SET_DOMAIN  DRM_IOW (DRM_COMMAND_BASE + DRM_I915_GEM_SET_DOMAIN, struct drm_i915_gem_set_domain)
  #define DRM_IOCTL_I915_GEM_SW_FINISH   DRM_IOW (DRM_COMMAND_BASE + DRM_I915_GEM_SW_FINISH, struct drm_i915_gem_sw_finish)
  #define DRM_IOCTL_I915_GEM_SET_TILING  DRM_IOWR (DRM_COMMAND_BASE + DRM_I915_GEM_SET_TILING, struct drm_i915_gem_set_tiling)
@@ -793,6 +794,37 @@ struct drm_i915_gem_mmap_gtt {
         __u64 offset;
  };
  
+struct drm_i915_gem_mmap_offset {
+       /** Handle for the object being mapped. */
+       __u32 handle;
+       __u32 pad;
+       /**
+        * Fake offset to use for subsequent mmap call
+        *
+        * This is a fixed-size type for 32/64 compatibility.
+        */
+       __u64 offset;
+
+       /**
+        * Flags for extended behaviour.
+        *
+        * It is mandatory that one of the MMAP_OFFSET types
+        * (GTT, WC, WB, UC, etc) should be included.
+        */
+       __u64 flags;
+#define I915_MMAP_OFFSET_GTT 0
+#define I915_MMAP_OFFSET_WC  1
+#define I915_MMAP_OFFSET_WB  2
+#define I915_MMAP_OFFSET_UC  3
+
+       /*
+        * Zero-terminated chain of extensions.
+        *
+        * No current extensions defined; mbz.
+        */
+       __u64 extensions;
+};
+
  struct drm_i915_gem_set_domain {
         /** Handle for the object */
         __u32 handle;
diff --git a/tools/include/uapi/linux/fcntl.h b/tools/include/uapi/linux/fcntl.h

index 1f97b33..ca88b7b 100644 (file)
--- a/tools/include/uapi/linux/fcntl.h
+++ b/tools/include/uapi/linux/fcntl.h
@@ -3,6 +3,7 @@
  #define _UAPI_LINUX_FCNTL_H
  
  #include <asm/fcntl.h>
+#include <linux/openat2.h>
  
  #define F_SETLEASE     (F_LINUX_SPECIFIC_BASE + 0)
  #define F_GETLEASE     (F_LINUX_SPECIFIC_BASE + 1)
@@ -100,5 +101,4 @@
  
  #define AT_RECURSIVE           0x8000  /* Apply to the entire subtree */
  
-
  #endif /* _UAPI_LINUX_FCNTL_H */
diff --git a/tools/include/uapi/linux/fscrypt.h b/tools/include/uapi/linux/fscrypt.h

index 1beb174..0d8a6f4 100644 (file)
--- a/tools/include/uapi/linux/fscrypt.h
+++ b/tools/include/uapi/linux/fscrypt.h
@@ -8,6 +8,7 @@
  #ifndef _UAPI_LINUX_FSCRYPT_H
  #define _UAPI_LINUX_FSCRYPT_H
  
+#include <linux/ioctl.h>
  #include <linux/types.h>
  
  /* Encryption policy flags */
@@ -109,11 +110,22 @@ struct fscrypt_key_specifier {
         } u;
  };
  
+/*
+ * Payload of Linux keyring key of type "fscrypt-provisioning", referenced by
+ * fscrypt_add_key_arg::key_id as an alternative to fscrypt_add_key_arg::raw.
+ */
+struct fscrypt_provisioning_key_payload {
+       __u32 type;
+       __u32 __reserved;
+       __u8 raw[];
+};
+
  /* Struct passed to FS_IOC_ADD_ENCRYPTION_KEY */
  struct fscrypt_add_key_arg {
         struct fscrypt_key_specifier key_spec;
         __u32 raw_size;
-       __u32 __reserved[9];
+       __u32 key_id;
+       __u32 __reserved[8];
         __u8 raw[];
  };
  
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h

index f0a16b4..4b95f9a 100644 (file)
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -1009,6 +1009,7 @@ struct kvm_ppc_resize_hpt {
  #define KVM_CAP_PPC_GUEST_DEBUG_SSTEP 176
  #define KVM_CAP_ARM_NISV_TO_USER 177
  #define KVM_CAP_ARM_INJECT_EXT_DABT 178
+#define KVM_CAP_S390_VCPU_RESETS 179
  
  #ifdef KVM_CAP_IRQ_ROUTING
  
@@ -1473,6 +1474,10 @@ struct kvm_enc_region {
  /* Available with KVM_CAP_ARM_SVE */
  #define KVM_ARM_VCPU_FINALIZE    _IOW(KVMIO,  0xc2, int)
  
+/* Available with  KVM_CAP_S390_VCPU_RESETS */
+#define KVM_S390_NORMAL_RESET  _IO(KVMIO,   0xc3)
+#define KVM_S390_CLEAR_RESET   _IO(KVMIO,   0xc4)
+
  /* Secure Encrypted Virtualization command */
  enum sev_cmd_id {
         /* Guest initialization commands */
diff --git a/tools/include/uapi/linux/openat2.h b/tools/include/uapi/linux/openat2.h

new file mode 100644 (file)

index 0000000..58b1eb7
--- /dev/null
+++ b/tools/include/uapi/linux/openat2.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_OPENAT2_H
+#define _UAPI_LINUX_OPENAT2_H
+
+#include <linux/types.h>
+
+/*
+ * Arguments for how openat2(2) should open the target path. If only @flags and
+ * @mode are non-zero, then openat2(2) operates very similarly to openat(2).
+ *
+ * However, unlike openat(2), unknown or invalid bits in @flags result in
+ * -EINVAL rather than being silently ignored. @mode must be zero unless one of
+ * {O_CREAT, O_TMPFILE} are set.
+ *
+ * @flags: O_* flags.
+ * @mode: O_CREAT/O_TMPFILE file mode.
+ * @resolve: RESOLVE_* flags.
+ */
+struct open_how {
+       __u64 flags;
+       __u64 mode;
+       __u64 resolve;
+};
+
+/* how->resolve flags for openat2(2). */
+#define RESOLVE_NO_XDEV                0x01 /* Block mount-point crossings
+                                       (includes bind-mounts). */
+#define RESOLVE_NO_MAGICLINKS  0x02 /* Block traversal through procfs-style
+                                       "magic-links". */
+#define RESOLVE_NO_SYMLINKS    0x04 /* Block traversal through all symlinks
+                                       (implies OEXT_NO_MAGICLINKS) */
+#define RESOLVE_BENEATH                0x08 /* Block "lexical" trickery like
+                                       "..", symlinks, and absolute
+                                       paths which escape the dirfd. */
+#define RESOLVE_IN_ROOT                0x10 /* Make all jumps to "/" and ".."
+                                       be scoped inside the dirfd
+                                       (similar to chroot(2)). */
+
+#endif /* _UAPI_LINUX_OPENAT2_H */
diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h

index 7da1b37..07b4f81 100644 (file)
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -234,4 +234,8 @@ struct prctl_mm_map {
  #define PR_GET_TAGGED_ADDR_CTRL                56
  # define PR_TAGGED_ADDR_ENABLE         (1UL << 0)
  
+/* Control reclaim behavior when allocating memory */
+#define PR_SET_IO_FLUSHER              57
+#define PR_GET_IO_FLUSHER              58
+
  #endif /* _LINUX_PRCTL_H */
diff --git a/tools/include/uapi/linux/sched.h b/tools/include/uapi/linux/sched.h

index 4a02178..2e3bc22 100644 (file)
--- a/tools/include/uapi/linux/sched.h
+++ b/tools/include/uapi/linux/sched.h
@@ -36,6 +36,12 @@
  /* Flags for the clone3() syscall. */
  #define CLONE_CLEAR_SIGHAND 0x100000000ULL /* Clear any signal handler and reset to SIG_DFL. */
  
+/*
+ * cloning flags intersect with CSIGNAL so can be used with unshare and clone3
+ * syscalls only:
+ */
+#define CLONE_NEWTIME  0x00000080      /* New time namespace */
+
  #ifndef __ASSEMBLY__
  /**
   * struct clone_args - arguments for the clone3 syscall
diff --git a/tools/include/uapi/sound/asound.h b/tools/include/uapi/sound/asound.h

index df1153c..535a722 100644 (file)
--- a/tools/include/uapi/sound/asound.h
+++ b/tools/include/uapi/sound/asound.h
@@ -26,7 +26,9 @@
  
  #if defined(__KERNEL__) || defined(__linux__)
  #include <linux/types.h>
+#include <asm/byteorder.h>
  #else
+#include <endian.h>
  #include <sys/ioctl.h>
  #endif
  
@@ -154,7 +156,7 @@ struct snd_hwdep_dsp_image {
   *                                                                           *
   *****************************************************************************/
  
-#define SNDRV_PCM_VERSION              SNDRV_PROTOCOL_VERSION(2, 0, 14)
+#define SNDRV_PCM_VERSION              SNDRV_PROTOCOL_VERSION(2, 0, 15)
  
  typedef unsigned long snd_pcm_uframes_t;
  typedef signed long snd_pcm_sframes_t;
@@ -301,7 +303,9 @@ typedef int __bitwise snd_pcm_subformat_t;
  #define SNDRV_PCM_INFO_DRAIN_TRIGGER   0x40000000              /* internal kernel flag - trigger in drain */
  #define SNDRV_PCM_INFO_FIFO_IN_FRAMES  0x80000000      /* internal kernel flag - FIFO size is in frames */
  
-
+#if (__BITS_PER_LONG == 32 && defined(__USE_TIME_BITS64)) || defined __KERNEL__
+#define __SND_STRUCT_TIME64
+#endif
  
  typedef int __bitwise snd_pcm_state_t;
  #define        SNDRV_PCM_STATE_OPEN            ((__force snd_pcm_state_t) 0) /* stream is open */
@@ -317,8 +321,17 @@ typedef int __bitwise snd_pcm_state_t;
  
  enum {
         SNDRV_PCM_MMAP_OFFSET_DATA = 0x00000000,
-       SNDRV_PCM_MMAP_OFFSET_STATUS = 0x80000000,
-       SNDRV_PCM_MMAP_OFFSET_CONTROL = 0x81000000,
+       SNDRV_PCM_MMAP_OFFSET_STATUS_OLD = 0x80000000,
+       SNDRV_PCM_MMAP_OFFSET_CONTROL_OLD = 0x81000000,
+       SNDRV_PCM_MMAP_OFFSET_STATUS_NEW = 0x82000000,
+       SNDRV_PCM_MMAP_OFFSET_CONTROL_NEW = 0x83000000,
+#ifdef __SND_STRUCT_TIME64
+       SNDRV_PCM_MMAP_OFFSET_STATUS = SNDRV_PCM_MMAP_OFFSET_STATUS_NEW,
+       SNDRV_PCM_MMAP_OFFSET_CONTROL = SNDRV_PCM_MMAP_OFFSET_CONTROL_NEW,
+#else
+       SNDRV_PCM_MMAP_OFFSET_STATUS = SNDRV_PCM_MMAP_OFFSET_STATUS_OLD,
+       SNDRV_PCM_MMAP_OFFSET_CONTROL = SNDRV_PCM_MMAP_OFFSET_CONTROL_OLD,
+#endif
  };
  
  union snd_pcm_sync_id {
@@ -456,8 +469,13 @@ enum {
         SNDRV_PCM_AUDIO_TSTAMP_TYPE_LAST = SNDRV_PCM_AUDIO_TSTAMP_TYPE_LINK_SYNCHRONIZED
  };
  
+#ifndef __KERNEL__
+/* explicit padding avoids incompatibility between i386 and x86-64 */
+typedef struct { unsigned char pad[sizeof(time_t) - sizeof(int)]; } __time_pad;
+
  struct snd_pcm_status {
         snd_pcm_state_t state;          /* stream state */
+       __time_pad pad1;                /* align to timespec */
         struct timespec trigger_tstamp; /* time when stream was started/stopped/paused */
         struct timespec tstamp;         /* reference timestamp */
         snd_pcm_uframes_t appl_ptr;     /* appl ptr */
@@ -473,17 +491,48 @@ struct snd_pcm_status {
         __u32 audio_tstamp_accuracy;    /* in ns units, only valid if indicated in audio_tstamp_data */
         unsigned char reserved[52-2*sizeof(struct timespec)]; /* must be filled with zero */
  };
+#endif
+
+/*
+ * For mmap operations, we need the 64-bit layout, both for compat mode,
+ * and for y2038 compatibility. For 64-bit applications, the two definitions
+ * are identical, so we keep the traditional version.
+ */
+#ifdef __SND_STRUCT_TIME64
+#define __snd_pcm_mmap_status64                snd_pcm_mmap_status
+#define __snd_pcm_mmap_control64       snd_pcm_mmap_control
+#define __snd_pcm_sync_ptr64           snd_pcm_sync_ptr
+#ifdef __KERNEL__
+#define __snd_timespec64               __kernel_timespec
+#else
+#define __snd_timespec64               timespec
+#endif
+struct __snd_timespec {
+       __s32 tv_sec;
+       __s32 tv_nsec;
+};
+#else
+#define __snd_pcm_mmap_status          snd_pcm_mmap_status
+#define __snd_pcm_mmap_control         snd_pcm_mmap_control
+#define __snd_pcm_sync_ptr             snd_pcm_sync_ptr
+#define __snd_timespec                 timespec
+struct __snd_timespec64 {
+       __s64 tv_sec;
+       __s64 tv_nsec;
+};
  
-struct snd_pcm_mmap_status {
+#endif
+
+struct __snd_pcm_mmap_status {
         snd_pcm_state_t state;          /* RO: state - SNDRV_PCM_STATE_XXXX */
         int pad1;                       /* Needed for 64 bit alignment */
         snd_pcm_uframes_t hw_ptr;       /* RO: hw ptr (0...boundary-1) */
-       struct timespec tstamp;         /* Timestamp */
+       struct __snd_timespec tstamp;   /* Timestamp */
         snd_pcm_state_t suspended_state; /* RO: suspended stream state */
-       struct timespec audio_tstamp;   /* from sample counter or wall clock */
+       struct __snd_timespec audio_tstamp; /* from sample counter or wall clock */
  };
  
-struct snd_pcm_mmap_control {
+struct __snd_pcm_mmap_control {
         snd_pcm_uframes_t appl_ptr;     /* RW: appl ptr (0...boundary-1) */
         snd_pcm_uframes_t avail_min;    /* RW: min available frames for wakeup */
  };
@@ -492,14 +541,59 @@ struct snd_pcm_mmap_control {
  #define SNDRV_PCM_SYNC_PTR_APPL                (1<<1)  /* get appl_ptr from driver (r/w op) */
  #define SNDRV_PCM_SYNC_PTR_AVAIL_MIN   (1<<2)  /* get avail_min from driver */
  
-struct snd_pcm_sync_ptr {
+struct __snd_pcm_sync_ptr {
         unsigned int flags;
         union {
-               struct snd_pcm_mmap_status status;
+               struct __snd_pcm_mmap_status status;
+               unsigned char reserved[64];
+       } s;
+       union {
+               struct __snd_pcm_mmap_control control;
+               unsigned char reserved[64];
+       } c;
+};
+
+#if defined(__BYTE_ORDER) ? __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
+typedef char __pad_before_uframe[sizeof(__u64) - sizeof(snd_pcm_uframes_t)];
+typedef char __pad_after_uframe[0];
+#endif
+
+#if defined(__BYTE_ORDER) ? __BYTE_ORDER == __LITTLE_ENDIAN : defined(__LITTLE_ENDIAN)
+typedef char __pad_before_uframe[0];
+typedef char __pad_after_uframe[sizeof(__u64) - sizeof(snd_pcm_uframes_t)];
+#endif
+
+struct __snd_pcm_mmap_status64 {
+       snd_pcm_state_t state;          /* RO: state - SNDRV_PCM_STATE_XXXX */
+       __u32 pad1;                     /* Needed for 64 bit alignment */
+       __pad_before_uframe __pad1;
+       snd_pcm_uframes_t hw_ptr;       /* RO: hw ptr (0...boundary-1) */
+       __pad_after_uframe __pad2;
+       struct __snd_timespec64 tstamp; /* Timestamp */
+       snd_pcm_state_t suspended_state;/* RO: suspended stream state */
+       __u32 pad3;                     /* Needed for 64 bit alignment */
+       struct __snd_timespec64 audio_tstamp; /* sample counter or wall clock */
+};
+
+struct __snd_pcm_mmap_control64 {
+       __pad_before_uframe __pad1;
+       snd_pcm_uframes_t appl_ptr;      /* RW: appl ptr (0...boundary-1) */
+       __pad_before_uframe __pad2;
+
+       __pad_before_uframe __pad3;
+       snd_pcm_uframes_t  avail_min;    /* RW: min available frames for wakeup */
+       __pad_after_uframe __pad4;
+};
+
+struct __snd_pcm_sync_ptr64 {
+       __u32 flags;
+       __u32 pad1;
+       union {
+               struct __snd_pcm_mmap_status64 status;
                 unsigned char reserved[64];
         } s;
         union {
-               struct snd_pcm_mmap_control control;
+               struct __snd_pcm_mmap_control64 control;
                 unsigned char reserved[64];
         } c;
  };
@@ -584,6 +678,8 @@ enum {
  #define SNDRV_PCM_IOCTL_STATUS         _IOR('A', 0x20, struct snd_pcm_status)
  #define SNDRV_PCM_IOCTL_DELAY          _IOR('A', 0x21, snd_pcm_sframes_t)
  #define SNDRV_PCM_IOCTL_HWSYNC         _IO('A', 0x22)
+#define __SNDRV_PCM_IOCTL_SYNC_PTR     _IOWR('A', 0x23, struct __snd_pcm_sync_ptr)
+#define __SNDRV_PCM_IOCTL_SYNC_PTR64   _IOWR('A', 0x23, struct __snd_pcm_sync_ptr64)
  #define SNDRV_PCM_IOCTL_SYNC_PTR       _IOWR('A', 0x23, struct snd_pcm_sync_ptr)
  #define SNDRV_PCM_IOCTL_STATUS_EXT     _IOWR('A', 0x24, struct snd_pcm_status)
  #define SNDRV_PCM_IOCTL_CHANNEL_INFO   _IOR('A', 0x32, struct snd_pcm_channel_info)
@@ -614,7 +710,7 @@ enum {
   *  Raw MIDI section - /dev/snd/midi??
   */
  
-#define SNDRV_RAWMIDI_VERSION          SNDRV_PROTOCOL_VERSION(2, 0, 0)
+#define SNDRV_RAWMIDI_VERSION          SNDRV_PROTOCOL_VERSION(2, 0, 1)
  
  enum {
         SNDRV_RAWMIDI_STREAM_OUTPUT = 0,
@@ -648,13 +744,16 @@ struct snd_rawmidi_params {
         unsigned char reserved[16];     /* reserved for future use */
  };
  
+#ifndef __KERNEL__
  struct snd_rawmidi_status {
         int stream;
+       __time_pad pad1;
         struct timespec tstamp;         /* Timestamp */
         size_t avail;                   /* available bytes */
         size_t xruns;                   /* count of overruns since last status (in bytes) */
         unsigned char reserved[16];     /* reserved for future use */
  };
+#endif
  
  #define SNDRV_RAWMIDI_IOCTL_PVERSION   _IOR('W', 0x00, int)
  #define SNDRV_RAWMIDI_IOCTL_INFO       _IOR('W', 0x01, struct snd_rawmidi_info)
@@ -667,7 +766,7 @@ struct snd_rawmidi_status {
   *  Timer section - /dev/snd/timer
   */
  
-#define SNDRV_TIMER_VERSION            SNDRV_PROTOCOL_VERSION(2, 0, 6)
+#define SNDRV_TIMER_VERSION            SNDRV_PROTOCOL_VERSION(2, 0, 7)
  
  enum {
         SNDRV_TIMER_CLASS_NONE = -1,
@@ -761,6 +860,7 @@ struct snd_timer_params {
         unsigned char reserved[60];     /* reserved */
  };
  
+#ifndef __KERNEL__
  struct snd_timer_status {
         struct timespec tstamp;         /* Timestamp - last update */
         unsigned int resolution;        /* current period resolution in ns */
@@ -769,10 +869,11 @@ struct snd_timer_status {
         unsigned int queue;             /* used queue size */
         unsigned char reserved[64];     /* reserved */
  };
+#endif
  
  #define SNDRV_TIMER_IOCTL_PVERSION     _IOR('T', 0x00, int)
  #define SNDRV_TIMER_IOCTL_NEXT_DEVICE  _IOWR('T', 0x01, struct snd_timer_id)
-#define SNDRV_TIMER_IOCTL_TREAD                _IOW('T', 0x02, int)
+#define SNDRV_TIMER_IOCTL_TREAD_OLD    _IOW('T', 0x02, int)
  #define SNDRV_TIMER_IOCTL_GINFO                _IOWR('T', 0x03, struct snd_timer_ginfo)
  #define SNDRV_TIMER_IOCTL_GPARAMS      _IOW('T', 0x04, struct snd_timer_gparams)
  #define SNDRV_TIMER_IOCTL_GSTATUS      _IOWR('T', 0x05, struct snd_timer_gstatus)
@@ -785,6 +886,15 @@ struct snd_timer_status {
  #define SNDRV_TIMER_IOCTL_STOP         _IO('T', 0xa1)
  #define SNDRV_TIMER_IOCTL_CONTINUE     _IO('T', 0xa2)
  #define SNDRV_TIMER_IOCTL_PAUSE                _IO('T', 0xa3)
+#define SNDRV_TIMER_IOCTL_TREAD64      _IOW('T', 0xa4, int)
+
+#if __BITS_PER_LONG == 64
+#define SNDRV_TIMER_IOCTL_TREAD SNDRV_TIMER_IOCTL_TREAD_OLD
+#else
+#define SNDRV_TIMER_IOCTL_TREAD ((sizeof(__kernel_long_t) >= sizeof(time_t)) ? \
+                                SNDRV_TIMER_IOCTL_TREAD_OLD : \
+                                SNDRV_TIMER_IOCTL_TREAD64)
+#endif
  
  struct snd_timer_read {
         unsigned int resolution;
@@ -810,11 +920,15 @@ enum {
         SNDRV_TIMER_EVENT_MRESUME = SNDRV_TIMER_EVENT_RESUME + 10,
  };
  
+#ifndef __KERNEL__
  struct snd_timer_tread {
         int event;
+       __time_pad pad1;
         struct timespec tstamp;
         unsigned int val;
+       __time_pad pad2;
  };
+#endif
  
  /****************************************************************************
   *                                                                          *
@@ -822,7 +936,7 @@ struct snd_timer_tread {
   *                                                                          *
   ****************************************************************************/
  
-#define SNDRV_CTL_VERSION              SNDRV_PROTOCOL_VERSION(2, 0, 7)
+#define SNDRV_CTL_VERSION              SNDRV_PROTOCOL_VERSION(2, 0, 8)
  
  struct snd_ctl_card_info {
         int card;                       /* card number */
@@ -860,7 +974,7 @@ typedef int __bitwise snd_ctl_elem_iface_t;
  #define SNDRV_CTL_ELEM_ACCESS_WRITE            (1<<1)
  #define SNDRV_CTL_ELEM_ACCESS_READWRITE                (SNDRV_CTL_ELEM_ACCESS_READ|SNDRV_CTL_ELEM_ACCESS_WRITE)
  #define SNDRV_CTL_ELEM_ACCESS_VOLATILE         (1<<2)  /* control value may be changed without a notification */
-#define SNDRV_CTL_ELEM_ACCESS_TIMESTAMP                (1<<3)  /* when was control changed */
+// (1 << 3) is unused.
  #define SNDRV_CTL_ELEM_ACCESS_TLV_READ         (1<<4)  /* TLV read is possible */
  #define SNDRV_CTL_ELEM_ACCESS_TLV_WRITE                (1<<5)  /* TLV write is possible */
  #define SNDRV_CTL_ELEM_ACCESS_TLV_READWRITE    (SNDRV_CTL_ELEM_ACCESS_TLV_READ|SNDRV_CTL_ELEM_ACCESS_TLV_WRITE)
@@ -926,11 +1040,7 @@ struct snd_ctl_elem_info {
                 } enumerated;
                 unsigned char reserved[128];
         } value;
-       union {
-               unsigned short d[4];            /* dimensions */
-               unsigned short *d_ptr;          /* indirect - obsoleted */
-       } dimen;
-       unsigned char reserved[64-4*sizeof(unsigned short)];
+       unsigned char reserved[64];
  };
  
  struct snd_ctl_elem_value {
@@ -955,8 +1065,7 @@ struct snd_ctl_elem_value {
                 } bytes;
                 struct snd_aes_iec958 iec958;
         } value;                /* RO */
-       struct timespec tstamp;
-       unsigned char reserved[128-sizeof(struct timespec)];
+       unsigned char reserved[128];
  };
  
  struct snd_ctl_tlv {
diff --git a/tools/perf/arch/arm64/util/header.c b/tools/perf/arch/arm64/util/header.c

index a32e4b7..d730666 100644 (file)
--- a/tools/perf/arch/arm64/util/header.c
+++ b/tools/perf/arch/arm64/util/header.c
@@ -1,8 +1,10 @@
  #include <stdio.h>
  #include <stdlib.h>
  #include <perf/cpumap.h>
+#include <util/cpumap.h>
  #include <internal/cpumap.h>
  #include <api/fs/fs.h>
+#include <errno.h>
  #include "debug.h"
  #include "header.h"
  
@@ -12,26 +14,21 @@
  #define MIDR_VARIANT_SHIFT      20
  #define MIDR_VARIANT_MASK       (0xf << MIDR_VARIANT_SHIFT)
  
-char *get_cpuid_str(struct perf_pmu *pmu)
+static int _get_cpuid(char *buf, size_t sz, struct perf_cpu_map *cpus)
  {
-       char *buf = NULL;
-       char path[PATH_MAX];
         const char *sysfs = sysfs__mountpoint();
-       int cpu;
         u64 midr = 0;
-       struct perf_cpu_map *cpus;
-       FILE *file;
+       int cpu;
  
-       if (!sysfs || !pmu || !pmu->cpus)
-               return NULL;
+       if (!sysfs || sz < MIDR_SIZE)
+               return EINVAL;
  
-       buf = malloc(MIDR_SIZE);
-       if (!buf)
-               return NULL;
+       cpus = perf_cpu_map__get(cpus);
  
-       /* read midr from list of cpus mapped to this pmu */
-       cpus = perf_cpu_map__get(pmu->cpus);
         for (cpu = 0; cpu < perf_cpu_map__nr(cpus); cpu++) {
+               char path[PATH_MAX];
+               FILE *file;
+
                 scnprintf(path, PATH_MAX, "%s/devices/system/cpu/cpu%d"MIDR,
                                 sysfs, cpus->map[cpu]);
  
@@ -57,12 +54,48 @@ char *get_cpuid_str(struct perf_pmu *pmu)
                 break;
         }
  
-       if (!midr) {
+       perf_cpu_map__put(cpus);
+
+       if (!midr)
+               return EINVAL;
+
+       return 0;
+}
+
+int get_cpuid(char *buf, size_t sz)
+{
+       struct perf_cpu_map *cpus = perf_cpu_map__new(NULL);
+       int ret;
+
+       if (!cpus)
+               return EINVAL;
+
+       ret = _get_cpuid(buf, sz, cpus);
+
+       perf_cpu_map__put(cpus);
+
+       return ret;
+}
+
+char *get_cpuid_str(struct perf_pmu *pmu)
+{
+       char *buf = NULL;
+       int res;
+
+       if (!pmu || !pmu->cpus)
+               return NULL;
+
+       buf = malloc(MIDR_SIZE);
+       if (!buf)
+               return NULL;
+
+       /* read midr from list of cpus mapped to this pmu */
+       res = _get_cpuid(buf, MIDR_SIZE, pmu->cpus);
+       if (res) {
                 pr_err("failed to get cpuid string for PMU %s\n", pmu->name);
                 free(buf);
                 buf = NULL;
         }
  
-       perf_cpu_map__put(cpus);
         return buf;
  }
diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl

index c29976e..44d510b 100644 (file)
--- a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
@@ -357,6 +357,8 @@
  433    common  fspick                  __x64_sys_fspick
  434    common  pidfd_open              __x64_sys_pidfd_open
  435    common  clone3                  __x64_sys_clone3/ptregs
+437    common  openat2                 __x64_sys_openat2
+438    common  pidfd_getfd             __x64_sys_pidfd_getfd
  
  #
  # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c

index 46a72ec..01d5420 100644 (file)
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -1065,7 +1065,9 @@ static struct syscall_fmt syscall_fmts[] = {
         { .name     = "poll", .timeout = true, },
         { .name     = "ppoll", .timeout = true, },
         { .name     = "prctl",
-         .arg = { [0] = { .scnprintf = SCA_PRCTL_OPTION, /* option */ },
+         .arg = { [0] = { .scnprintf = SCA_PRCTL_OPTION, /* option */
+                          .strtoul   = STUL_STRARRAY,
+                          .parm      = &strarray__prctl_options, },
                    [1] = { .scnprintf = SCA_PRCTL_ARG2, /* arg2 */ },
                    [2] = { .scnprintf = SCA_PRCTL_ARG3, /* arg3 */ }, }, },
         { .name     = "pread", .alias = "pread64", },
diff --git a/tools/perf/check-headers.sh b/tools/perf/check-headers.sh

index 68039a9..bfb21d0 100755 (executable)
--- a/tools/perf/check-headers.sh
+++ b/tools/perf/check-headers.sh
@@ -13,6 +13,7 @@ include/uapi/linux/kcmp.h
  include/uapi/linux/kvm.h
  include/uapi/linux/in.h
  include/uapi/linux/mount.h
+include/uapi/linux/openat2.h
  include/uapi/linux/perf_event.h
  include/uapi/linux/prctl.h
  include/uapi/linux/sched.h
diff --git a/tools/perf/trace/beauty/beauty.h b/tools/perf/trace/beauty/beauty.h

index 5a61043..d6dfe68 100644 (file)
--- a/tools/perf/trace/beauty/beauty.h
+++ b/tools/perf/trace/beauty/beauty.h
@@ -213,6 +213,8 @@ size_t syscall_arg__scnprintf_x86_arch_prctl_code(char *bf, size_t size, struct
  size_t syscall_arg__scnprintf_prctl_option(char *bf, size_t size, struct syscall_arg *arg);
  #define SCA_PRCTL_OPTION syscall_arg__scnprintf_prctl_option
  
+extern struct strarray strarray__prctl_options;
+
  size_t syscall_arg__scnprintf_prctl_arg2(char *bf, size_t size, struct syscall_arg *arg);
  #define SCA_PRCTL_ARG2 syscall_arg__scnprintf_prctl_arg2
  
diff --git a/tools/perf/trace/beauty/prctl.c b/tools/perf/trace/beauty/prctl.c

index ba2179a..6fe5ad5 100644 (file)
--- a/tools/perf/trace/beauty/prctl.c
+++ b/tools/perf/trace/beauty/prctl.c
@@ -11,9 +11,10 @@
  
  #include "trace/beauty/generated/prctl_option_array.c"
  
+DEFINE_STRARRAY(prctl_options, "PR_");
+
  static size_t prctl__scnprintf_option(int option, char *bf, size_t size, bool show_prefix)
  {
-       static DEFINE_STRARRAY(prctl_options, "PR_");
         return strarray__scnprintf(&strarray__prctl_options, bf, size, "%d", show_prefix, option);
  }
  
diff --git a/tools/perf/util/llvm-utils.c b/tools/perf/util/llvm-utils.c

index eae47c2..b5af680 100644 (file)
--- a/tools/perf/util/llvm-utils.c
+++ b/tools/perf/util/llvm-utils.c
@@ -288,6 +288,7 @@ static const char *kinc_fetch_script =
  "obj-y := dummy.o\n"
  "\\$(obj)/%.o: \\$(src)/%.c\n"
  "\t@echo -n \"\\$(NOSTDINC_FLAGS) \\$(LINUXINCLUDE) \\$(EXTRA_CFLAGS)\"\n"
+"\t\\$(CC) -c -o \\$@ \\$<\n"
  "EOF\n"
  "touch $TMPDIR/dummy.c\n"
  "make -s -C $KBUILD_DIR M=$TMPDIR $KBUILD_OPTS dummy.o 2>/dev/null\n"
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c

index c8c5410..fb5c2cd 100644 (file)
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -686,6 +686,7 @@ static struct dso *machine__findnew_module_dso(struct machine *machine,
  
                 dso__set_module_info(dso, m, machine);
                 dso__set_long_name(dso, strdup(filename), true);
+               dso->kernel = DSO_TYPE_KERNEL;
         }
  
         dso__get(dso);
@@ -726,9 +727,17 @@ static int machine__process_ksymbol_register(struct machine *machine,
         struct map *map = maps__find(&machine->kmaps, event->ksymbol.addr);
  
         if (!map) {
-               map = dso__new_map(event->ksymbol.name);
-               if (!map)
+               struct dso *dso = dso__new(event->ksymbol.name);
+
+               if (dso) {
+                       dso->kernel = DSO_TYPE_KERNEL;
+                       map = map__new2(0, dso);
+               }
+
+               if (!dso || !map) {
+                       dso__put(dso);
                         return -ENOMEM;
+               }
  
                 map->start = event->ksymbol.addr;
                 map->end = map->start + event->ksymbol.len;
@@ -972,7 +981,6 @@ int machine__create_extra_kernel_map(struct machine *machine,
  
         kmap = map__kmap(map);
  
-       kmap->kmaps = &machine->kmaps;
         strlcpy(kmap->name, xm->name, KMAP_NAME_LEN);
  
         maps__insert(&machine->kmaps, map);
@@ -1082,9 +1090,6 @@ int __weak machine__create_extra_kernel_maps(struct machine *machine __maybe_unu
  static int
  __machine__create_kernel_maps(struct machine *machine, struct dso *kernel)
  {
-       struct kmap *kmap;
-       struct map *map;
-
         /* In case of renewal the kernel map, destroy previous one */
         machine__destroy_kernel_maps(machine);
  
@@ -1093,14 +1098,7 @@ __machine__create_kernel_maps(struct machine *machine, struct dso *kernel)
                 return -1;
  
         machine->vmlinux_map->map_ip = machine->vmlinux_map->unmap_ip = identity__map_ip;
-       map = machine__kernel_map(machine);
-       kmap = map__kmap(map);
-       if (!kmap)
-               return -1;
-
-       kmap->kmaps = &machine->kmaps;
-       maps__insert(&machine->kmaps, map);
-
+       maps__insert(&machine->kmaps, machine->vmlinux_map);
         return 0;
  }
  
diff --git a/tools/perf/util/map.c b/tools/perf/util/map.c

index f67960b..a08ca27 100644 (file)
--- a/tools/perf/util/map.c
+++ b/tools/perf/util/map.c
@@ -375,8 +375,13 @@ struct symbol *map__find_symbol_by_name(struct map *map, const char *name)
  
  struct map *map__clone(struct map *from)
  {
-       struct map *map = memdup(from, sizeof(*map));
+       size_t size = sizeof(struct map);
+       struct map *map;
+
+       if (from->dso && from->dso->kernel)
+               size += sizeof(struct kmap);
  
+       map = memdup(from, size);
         if (map != NULL) {
                 refcount_set(&map->refcnt, 1);
                 RB_CLEAR_NODE(&map->rb_node);
@@ -538,6 +543,16 @@ void maps__insert(struct maps *maps, struct map *map)
         __maps__insert(maps, map);
         ++maps->nr_maps;
  
+       if (map->dso && map->dso->kernel) {
+               struct kmap *kmap = map__kmap(map);
+
+               if (kmap)
+                       kmap->kmaps = maps;
+               else
+                       pr_err("Internal error: kernel dso with non kernel map\n");
+       }
+
+
         /*
          * If we already performed some search by name, then we need to add the just
          * inserted map and resort.
diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c

index 2c41d47..90d23cc 100644 (file)
--- a/tools/perf/util/stat-shadow.c
+++ b/tools/perf/util/stat-shadow.c
@@ -18,7 +18,6 @@
   * AGGR_NONE: Use matching CPU
   * AGGR_THREAD: Not supported?
   */
-static bool have_frontend_stalled;
  
  struct runtime_stat rt_stat;
  struct stats walltime_nsecs_stats;
@@ -144,7 +143,6 @@ void runtime_stat__exit(struct runtime_stat *st)
  
  void perf_stat__init_shadow_stats(void)
  {
-       have_frontend_stalled = pmu_have_event("cpu", "stalled-cycles-frontend");
         runtime_stat__init(&rt_stat);
  }
  
@@ -853,10 +851,6 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config,
                         print_metric(config, ctxp, NULL, "%7.2f ",
                                         "stalled cycles per insn",
                                         ratio);
-               } else if (have_frontend_stalled) {
-                       out->new_line(config, ctxp);
-                       print_metric(config, ctxp, NULL, "%7.2f ",
-                                    "stalled cycles per insn", 0);
                 }
         } else if (perf_evsel__match(evsel, HARDWARE, HW_BRANCH_MISSES)) {
                 if (runtime_stat_n(st, STAT_BRANCHES, ctx, cpu) != 0)
diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c

index 3b379b1..1077013 100644 (file)
--- a/tools/perf/util/symbol.c
+++ b/tools/perf/util/symbol.c
@@ -635,9 +635,12 @@ out:
  static bool symbol__is_idle(const char *name)
  {
         const char * const idle_symbols[] = {
+               "acpi_idle_do_entry",
+               "acpi_processor_ffh_cstate_enter",
                 "arch_cpu_idle",
                 "cpu_idle",
                 "cpu_startup_entry",
+               "idle_cpu",
                 "intel_idle",
                 "default_idle",
                 "native_safe_halt",
@@ -651,13 +654,17 @@ static bool symbol__is_idle(const char *name)
                 NULL
         };
         int i;
+       static struct strlist *idle_symbols_list;
  
-       for (i = 0; idle_symbols[i]; i++) {
-               if (!strcmp(idle_symbols[i], name))
-                       return true;
-       }
+       if (idle_symbols_list)
+               return strlist__has_entry(idle_symbols_list, name);
  
-       return false;
+       idle_symbols_list = strlist__new(NULL, NULL);
+
+       for (i = 0; idle_symbols[i]; i++)
+               strlist__add(idle_symbols_list, idle_symbols[i]);
+
+       return strlist__has_entry(idle_symbols_list, name);
  }
  
  static int map__process_kallsym_symbol(void *arg, const char *name,
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile

index 67abc1d..d91c53b 100644 (file)
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -8,7 +8,7 @@ KSFT_KHDR_INSTALL := 1
  UNAME_M := $(shell uname -m)
  
  LIBKVM = lib/assert.c lib/elf.c lib/io.c lib/kvm_util.c lib/sparsebit.c
-LIBKVM_x86_64 = lib/x86_64/processor.c lib/x86_64/vmx.c lib/x86_64/ucall.c
+LIBKVM_x86_64 = lib/x86_64/processor.c lib/x86_64/vmx.c lib/x86_64/svm.c lib/x86_64/ucall.c
  LIBKVM_aarch64 = lib/aarch64/processor.c lib/aarch64/ucall.c
  LIBKVM_s390x = lib/s390x/processor.c lib/s390x/ucall.c
  
@@ -26,6 +26,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/vmx_dirty_log_test
  TEST_GEN_PROGS_x86_64 += x86_64/vmx_set_nested_state_test
  TEST_GEN_PROGS_x86_64 += x86_64/vmx_tsc_adjust_test
  TEST_GEN_PROGS_x86_64 += x86_64/xss_msr_test
+TEST_GEN_PROGS_x86_64 += x86_64/svm_vmcall_test
  TEST_GEN_PROGS_x86_64 += clear_dirty_log_test
  TEST_GEN_PROGS_x86_64 += dirty_log_test
  TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h

index aa6451b..7428513 100644 (file)
--- a/tools/testing/selftests/kvm/include/x86_64/processor.h
+++ b/tools/testing/selftests/kvm/include/x86_64/processor.h
@@ -36,24 +36,24 @@
  #define X86_CR4_SMAP           (1ul << 21)
  #define X86_CR4_PKE            (1ul << 22)
  
-/* The enum values match the intruction encoding of each register */
-enum x86_register {
-       RAX = 0,
-       RCX,
-       RDX,
-       RBX,
-       RSP,
-       RBP,
-       RSI,
-       RDI,
-       R8,
-       R9,
-       R10,
-       R11,
-       R12,
-       R13,
-       R14,
-       R15,
+/* General Registers in 64-Bit Mode */
+struct gpr64_regs {
+       u64 rax;
+       u64 rcx;
+       u64 rdx;
+       u64 rbx;
+       u64 rsp;
+       u64 rbp;
+       u64 rsi;
+       u64 rdi;
+       u64 r8;
+       u64 r9;
+       u64 r10;
+       u64 r11;
+       u64 r12;
+       u64 r13;
+       u64 r14;
+       u64 r15;
  };
  
  struct desc64 {
@@ -220,20 +220,20 @@ static inline void set_cr4(uint64_t val)
         __asm__ __volatile__("mov %0, %%cr4" : : "r" (val) : "memory");
  }
  
-static inline uint64_t get_gdt_base(void)
+static inline struct desc_ptr get_gdt(void)
  {
         struct desc_ptr gdt;
         __asm__ __volatile__("sgdt %[gdt]"
                              : /* output */ [gdt]"=m"(gdt));
-       return gdt.address;
+       return gdt;
  }
  
-static inline uint64_t get_idt_base(void)
+static inline struct desc_ptr get_idt(void)
  {
         struct desc_ptr idt;
         __asm__ __volatile__("sidt %[idt]"
                              : /* output */ [idt]"=m"(idt));
-       return idt.address;
+       return idt;
  }
  
  #define SET_XMM(__var, __xmm) \
diff --git a/tools/testing/selftests/kvm/include/x86_64/svm.h b/tools/testing/selftests/kvm/include/x86_64/svm.h

new file mode 100644 (file)

index 0000000..f4ea235
--- /dev/null
+++ b/tools/testing/selftests/kvm/include/x86_64/svm.h
@@ -0,0 +1,297 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * tools/testing/selftests/kvm/include/x86_64/svm.h
+ * This is a copy of arch/x86/include/asm/svm.h
+ *
+ */
+
+#ifndef SELFTEST_KVM_SVM_H
+#define SELFTEST_KVM_SVM_H
+
+enum {
+       INTERCEPT_INTR,
+       INTERCEPT_NMI,
+       INTERCEPT_SMI,
+       INTERCEPT_INIT,
+       INTERCEPT_VINTR,
+       INTERCEPT_SELECTIVE_CR0,
+       INTERCEPT_STORE_IDTR,
+       INTERCEPT_STORE_GDTR,
+       INTERCEPT_STORE_LDTR,
+       INTERCEPT_STORE_TR,
+       INTERCEPT_LOAD_IDTR,
+       INTERCEPT_LOAD_GDTR,
+       INTERCEPT_LOAD_LDTR,
+       INTERCEPT_LOAD_TR,
+       INTERCEPT_RDTSC,
+       INTERCEPT_RDPMC,
+       INTERCEPT_PUSHF,
+       INTERCEPT_POPF,
+       INTERCEPT_CPUID,
+       INTERCEPT_RSM,
+       INTERCEPT_IRET,
+       INTERCEPT_INTn,
+       INTERCEPT_INVD,
+       INTERCEPT_PAUSE,
+       INTERCEPT_HLT,
+       INTERCEPT_INVLPG,
+       INTERCEPT_INVLPGA,
+       INTERCEPT_IOIO_PROT,
+       INTERCEPT_MSR_PROT,
+       INTERCEPT_TASK_SWITCH,
+       INTERCEPT_FERR_FREEZE,
+       INTERCEPT_SHUTDOWN,
+       INTERCEPT_VMRUN,
+       INTERCEPT_VMMCALL,
+       INTERCEPT_VMLOAD,
+       INTERCEPT_VMSAVE,
+       INTERCEPT_STGI,
+       INTERCEPT_CLGI,
+       INTERCEPT_SKINIT,
+       INTERCEPT_RDTSCP,
+       INTERCEPT_ICEBP,
+       INTERCEPT_WBINVD,
+       INTERCEPT_MONITOR,
+       INTERCEPT_MWAIT,
+       INTERCEPT_MWAIT_COND,
+       INTERCEPT_XSETBV,
+       INTERCEPT_RDPRU,
+};
+
+
+struct __attribute__ ((__packed__)) vmcb_control_area {
+       u32 intercept_cr;
+       u32 intercept_dr;
+       u32 intercept_exceptions;
+       u64 intercept;
+       u8 reserved_1[40];
+       u16 pause_filter_thresh;
+       u16 pause_filter_count;
+       u64 iopm_base_pa;
+       u64 msrpm_base_pa;
+       u64 tsc_offset;
+       u32 asid;
+       u8 tlb_ctl;
+       u8 reserved_2[3];
+       u32 int_ctl;
+       u32 int_vector;
+       u32 int_state;
+       u8 reserved_3[4];
+       u32 exit_code;
+       u32 exit_code_hi;
+       u64 exit_info_1;
+       u64 exit_info_2;
+       u32 exit_int_info;
+       u32 exit_int_info_err;
+       u64 nested_ctl;
+       u64 avic_vapic_bar;
+       u8 reserved_4[8];
+       u32 event_inj;
+       u32 event_inj_err;
+       u64 nested_cr3;
+       u64 virt_ext;
+       u32 clean;
+       u32 reserved_5;
+       u64 next_rip;
+       u8 insn_len;
+       u8 insn_bytes[15];
+       u64 avic_backing_page;  /* Offset 0xe0 */
+       u8 reserved_6[8];       /* Offset 0xe8 */
+       u64 avic_logical_id;    /* Offset 0xf0 */
+       u64 avic_physical_id;   /* Offset 0xf8 */
+       u8 reserved_7[768];
+};
+
+
+#define TLB_CONTROL_DO_NOTHING 0
+#define TLB_CONTROL_FLUSH_ALL_ASID 1
+#define TLB_CONTROL_FLUSH_ASID 3
+#define TLB_CONTROL_FLUSH_ASID_LOCAL 7
+
+#define V_TPR_MASK 0x0f
+
+#define V_IRQ_SHIFT 8
+#define V_IRQ_MASK (1 << V_IRQ_SHIFT)
+
+#define V_GIF_SHIFT 9
+#define V_GIF_MASK (1 << V_GIF_SHIFT)
+
+#define V_INTR_PRIO_SHIFT 16
+#define V_INTR_PRIO_MASK (0x0f << V_INTR_PRIO_SHIFT)
+
+#define V_IGN_TPR_SHIFT 20
+#define V_IGN_TPR_MASK (1 << V_IGN_TPR_SHIFT)
+
+#define V_INTR_MASKING_SHIFT 24
+#define V_INTR_MASKING_MASK (1 << V_INTR_MASKING_SHIFT)
+
+#define V_GIF_ENABLE_SHIFT 25
+#define V_GIF_ENABLE_MASK (1 << V_GIF_ENABLE_SHIFT)
+
+#define AVIC_ENABLE_SHIFT 31
+#define AVIC_ENABLE_MASK (1 << AVIC_ENABLE_SHIFT)
+
+#define LBR_CTL_ENABLE_MASK BIT_ULL(0)
+#define VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK BIT_ULL(1)
+
+#define SVM_INTERRUPT_SHADOW_MASK 1
+
+#define SVM_IOIO_STR_SHIFT 2
+#define SVM_IOIO_REP_SHIFT 3
+#define SVM_IOIO_SIZE_SHIFT 4
+#define SVM_IOIO_ASIZE_SHIFT 7
+
+#define SVM_IOIO_TYPE_MASK 1
+#define SVM_IOIO_STR_MASK (1 << SVM_IOIO_STR_SHIFT)
+#define SVM_IOIO_REP_MASK (1 << SVM_IOIO_REP_SHIFT)
+#define SVM_IOIO_SIZE_MASK (7 << SVM_IOIO_SIZE_SHIFT)
+#define SVM_IOIO_ASIZE_MASK (7 << SVM_IOIO_ASIZE_SHIFT)
+
+#define SVM_VM_CR_VALID_MASK   0x001fULL
+#define SVM_VM_CR_SVM_LOCK_MASK 0x0008ULL
+#define SVM_VM_CR_SVM_DIS_MASK  0x0010ULL
+
+#define SVM_NESTED_CTL_NP_ENABLE       BIT(0)
+#define SVM_NESTED_CTL_SEV_ENABLE      BIT(1)
+
+struct __attribute__ ((__packed__)) vmcb_seg {
+       u16 selector;
+       u16 attrib;
+       u32 limit;
+       u64 base;
+};
+
+struct __attribute__ ((__packed__)) vmcb_save_area {
+       struct vmcb_seg es;
+       struct vmcb_seg cs;
+       struct vmcb_seg ss;
+       struct vmcb_seg ds;
+       struct vmcb_seg fs;
+       struct vmcb_seg gs;
+       struct vmcb_seg gdtr;
+       struct vmcb_seg ldtr;
+       struct vmcb_seg idtr;
+       struct vmcb_seg tr;
+       u8 reserved_1[43];
+       u8 cpl;
+       u8 reserved_2[4];
+       u64 efer;
+       u8 reserved_3[112];
+       u64 cr4;
+       u64 cr3;
+       u64 cr0;
+       u64 dr7;
+       u64 dr6;
+       u64 rflags;
+       u64 rip;
+       u8 reserved_4[88];
+       u64 rsp;
+       u8 reserved_5[24];
+       u64 rax;
+       u64 star;
+       u64 lstar;
+       u64 cstar;
+       u64 sfmask;
+       u64 kernel_gs_base;
+       u64 sysenter_cs;
+       u64 sysenter_esp;
+       u64 sysenter_eip;
+       u64 cr2;
+       u8 reserved_6[32];
+       u64 g_pat;
+       u64 dbgctl;
+       u64 br_from;
+       u64 br_to;
+       u64 last_excp_from;
+       u64 last_excp_to;
+};
+
+struct __attribute__ ((__packed__)) vmcb {
+       struct vmcb_control_area control;
+       struct vmcb_save_area save;
+};
+
+#define SVM_CPUID_FUNC 0x8000000a
+
+#define SVM_VM_CR_SVM_DISABLE 4
+
+#define SVM_SELECTOR_S_SHIFT 4
+#define SVM_SELECTOR_DPL_SHIFT 5
+#define SVM_SELECTOR_P_SHIFT 7
+#define SVM_SELECTOR_AVL_SHIFT 8
+#define SVM_SELECTOR_L_SHIFT 9
+#define SVM_SELECTOR_DB_SHIFT 10
+#define SVM_SELECTOR_G_SHIFT 11
+
+#define SVM_SELECTOR_TYPE_MASK (0xf)
+#define SVM_SELECTOR_S_MASK (1 << SVM_SELECTOR_S_SHIFT)
+#define SVM_SELECTOR_DPL_MASK (3 << SVM_SELECTOR_DPL_SHIFT)
+#define SVM_SELECTOR_P_MASK (1 << SVM_SELECTOR_P_SHIFT)
+#define SVM_SELECTOR_AVL_MASK (1 << SVM_SELECTOR_AVL_SHIFT)
+#define SVM_SELECTOR_L_MASK (1 << SVM_SELECTOR_L_SHIFT)
+#define SVM_SELECTOR_DB_MASK (1 << SVM_SELECTOR_DB_SHIFT)
+#define SVM_SELECTOR_G_MASK (1 << SVM_SELECTOR_G_SHIFT)
+
+#define SVM_SELECTOR_WRITE_MASK (1 << 1)
+#define SVM_SELECTOR_READ_MASK SVM_SELECTOR_WRITE_MASK
+#define SVM_SELECTOR_CODE_MASK (1 << 3)
+
+#define INTERCEPT_CR0_READ     0
+#define INTERCEPT_CR3_READ     3
+#define INTERCEPT_CR4_READ     4
+#define INTERCEPT_CR8_READ     8
+#define INTERCEPT_CR0_WRITE    (16 + 0)
+#define INTERCEPT_CR3_WRITE    (16 + 3)
+#define INTERCEPT_CR4_WRITE    (16 + 4)
+#define INTERCEPT_CR8_WRITE    (16 + 8)
+
+#define INTERCEPT_DR0_READ     0
+#define INTERCEPT_DR1_READ     1
+#define INTERCEPT_DR2_READ     2
+#define INTERCEPT_DR3_READ     3
+#define INTERCEPT_DR4_READ     4
+#define INTERCEPT_DR5_READ     5
+#define INTERCEPT_DR6_READ     6
+#define INTERCEPT_DR7_READ     7
+#define INTERCEPT_DR0_WRITE    (16 + 0)
+#define INTERCEPT_DR1_WRITE    (16 + 1)
+#define INTERCEPT_DR2_WRITE    (16 + 2)
+#define INTERCEPT_DR3_WRITE    (16 + 3)
+#define INTERCEPT_DR4_WRITE    (16 + 4)
+#define INTERCEPT_DR5_WRITE    (16 + 5)
+#define INTERCEPT_DR6_WRITE    (16 + 6)
+#define INTERCEPT_DR7_WRITE    (16 + 7)
+
+#define SVM_EVTINJ_VEC_MASK 0xff
+
+#define SVM_EVTINJ_TYPE_SHIFT 8
+#define SVM_EVTINJ_TYPE_MASK (7 << SVM_EVTINJ_TYPE_SHIFT)
+
+#define SVM_EVTINJ_TYPE_INTR (0 << SVM_EVTINJ_TYPE_SHIFT)
+#define SVM_EVTINJ_TYPE_NMI (2 << SVM_EVTINJ_TYPE_SHIFT)
+#define SVM_EVTINJ_TYPE_EXEPT (3 << SVM_EVTINJ_TYPE_SHIFT)
+#define SVM_EVTINJ_TYPE_SOFT (4 << SVM_EVTINJ_TYPE_SHIFT)
+
+#define SVM_EVTINJ_VALID (1 << 31)
+#define SVM_EVTINJ_VALID_ERR (1 << 11)
+
+#define SVM_EXITINTINFO_VEC_MASK SVM_EVTINJ_VEC_MASK
+#define SVM_EXITINTINFO_TYPE_MASK SVM_EVTINJ_TYPE_MASK
+
+#define        SVM_EXITINTINFO_TYPE_INTR SVM_EVTINJ_TYPE_INTR
+#define        SVM_EXITINTINFO_TYPE_NMI SVM_EVTINJ_TYPE_NMI
+#define        SVM_EXITINTINFO_TYPE_EXEPT SVM_EVTINJ_TYPE_EXEPT
+#define        SVM_EXITINTINFO_TYPE_SOFT SVM_EVTINJ_TYPE_SOFT
+
+#define SVM_EXITINTINFO_VALID SVM_EVTINJ_VALID
+#define SVM_EXITINTINFO_VALID_ERR SVM_EVTINJ_VALID_ERR
+
+#define SVM_EXITINFOSHIFT_TS_REASON_IRET 36
+#define SVM_EXITINFOSHIFT_TS_REASON_JMP 38
+#define SVM_EXITINFOSHIFT_TS_HAS_ERROR_CODE 44
+
+#define SVM_EXITINFO_REG_MASK 0x0F
+
+#define SVM_CR0_SELECTIVE_MASK (X86_CR0_TS | X86_CR0_MP)
+
+#endif /* SELFTEST_KVM_SVM_H */
diff --git a/tools/testing/selftests/kvm/include/x86_64/svm_util.h b/tools/testing/selftests/kvm/include/x86_64/svm_util.h

new file mode 100644 (file)

index 0000000..cd03791
--- /dev/null
+++ b/tools/testing/selftests/kvm/include/x86_64/svm_util.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * tools/testing/selftests/kvm/include/x86_64/svm_utils.h
+ * Header for nested SVM testing
+ *
+ * Copyright (C) 2020, Red Hat, Inc.
+ */
+
+#ifndef SELFTEST_KVM_SVM_UTILS_H
+#define SELFTEST_KVM_SVM_UTILS_H
+
+#include <stdint.h>
+#include "svm.h"
+#include "processor.h"
+
+#define CPUID_SVM_BIT          2
+#define CPUID_SVM              BIT_ULL(CPUID_SVM_BIT)
+
+#define SVM_EXIT_VMMCALL       0x081
+
+struct svm_test_data {
+       /* VMCB */
+       struct vmcb *vmcb; /* gva */
+       void *vmcb_hva;
+       uint64_t vmcb_gpa;
+
+       /* host state-save area */
+       struct vmcb_save_area *save_area; /* gva */
+       void *save_area_hva;
+       uint64_t save_area_gpa;
+};
+
+struct svm_test_data *vcpu_alloc_svm(struct kvm_vm *vm, vm_vaddr_t *p_svm_gva);
+void generic_svm_setup(struct svm_test_data *svm, void *guest_rip, void *guest_rsp);
+void run_guest(struct vmcb *vmcb, uint64_t vmcb_gpa);
+void nested_svm_check_supported(void);
+
+#endif /* SELFTEST_KVM_SVM_UTILS_H */
diff --git a/tools/testing/selftests/kvm/lib/x86_64/svm.c b/tools/testing/selftests/kvm/lib/x86_64/svm.c

new file mode 100644 (file)

index 0000000..6e05a8f
--- /dev/null
+++ b/tools/testing/selftests/kvm/lib/x86_64/svm.c
@@ -0,0 +1,161 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * tools/testing/selftests/kvm/lib/x86_64/svm.c
+ * Helpers used for nested SVM testing
+ * Largely inspired from KVM unit test svm.c
+ *
+ * Copyright (C) 2020, Red Hat, Inc.
+ */
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "../kvm_util_internal.h"
+#include "processor.h"
+#include "svm_util.h"
+
+struct gpr64_regs guest_regs;
+u64 rflags;
+
+/* Allocate memory regions for nested SVM tests.
+ *
+ * Input Args:
+ *   vm - The VM to allocate guest-virtual addresses in.
+ *
+ * Output Args:
+ *   p_svm_gva - The guest virtual address for the struct svm_test_data.
+ *
+ * Return:
+ *   Pointer to structure with the addresses of the SVM areas.
+ */
+struct svm_test_data *
+vcpu_alloc_svm(struct kvm_vm *vm, vm_vaddr_t *p_svm_gva)
+{
+       vm_vaddr_t svm_gva = vm_vaddr_alloc(vm, getpagesize(),
+                                           0x10000, 0, 0);
+       struct svm_test_data *svm = addr_gva2hva(vm, svm_gva);
+
+       svm->vmcb = (void *)vm_vaddr_alloc(vm, getpagesize(),
+                                          0x10000, 0, 0);
+       svm->vmcb_hva = addr_gva2hva(vm, (uintptr_t)svm->vmcb);
+       svm->vmcb_gpa = addr_gva2gpa(vm, (uintptr_t)svm->vmcb);
+
+       svm->save_area = (void *)vm_vaddr_alloc(vm, getpagesize(),
+                                               0x10000, 0, 0);
+       svm->save_area_hva = addr_gva2hva(vm, (uintptr_t)svm->save_area);
+       svm->save_area_gpa = addr_gva2gpa(vm, (uintptr_t)svm->save_area);
+
+       *p_svm_gva = svm_gva;
+       return svm;
+}
+
+static void vmcb_set_seg(struct vmcb_seg *seg, u16 selector,
+                        u64 base, u32 limit, u32 attr)
+{
+       seg->selector = selector;
+       seg->attrib = attr;
+       seg->limit = limit;
+       seg->base = base;
+}
+
+void generic_svm_setup(struct svm_test_data *svm, void *guest_rip, void *guest_rsp)
+{
+       struct vmcb *vmcb = svm->vmcb;
+       uint64_t vmcb_gpa = svm->vmcb_gpa;
+       struct vmcb_save_area *save = &vmcb->save;
+       struct vmcb_control_area *ctrl = &vmcb->control;
+       u32 data_seg_attr = 3 | SVM_SELECTOR_S_MASK | SVM_SELECTOR_P_MASK
+             | SVM_SELECTOR_DB_MASK | SVM_SELECTOR_G_MASK;
+       u32 code_seg_attr = 9 | SVM_SELECTOR_S_MASK | SVM_SELECTOR_P_MASK
+               | SVM_SELECTOR_L_MASK | SVM_SELECTOR_G_MASK;
+       uint64_t efer;
+
+       efer = rdmsr(MSR_EFER);
+       wrmsr(MSR_EFER, efer | EFER_SVME);
+       wrmsr(MSR_VM_HSAVE_PA, svm->save_area_gpa);
+
+       memset(vmcb, 0, sizeof(*vmcb));
+       asm volatile ("vmsave\n\t" : : "a" (vmcb_gpa) : "memory");
+       vmcb_set_seg(&save->es, get_es(), 0, -1U, data_seg_attr);
+       vmcb_set_seg(&save->cs, get_cs(), 0, -1U, code_seg_attr);
+       vmcb_set_seg(&save->ss, get_ss(), 0, -1U, data_seg_attr);
+       vmcb_set_seg(&save->ds, get_ds(), 0, -1U, data_seg_attr);
+       vmcb_set_seg(&save->gdtr, 0, get_gdt().address, get_gdt().size, 0);
+       vmcb_set_seg(&save->idtr, 0, get_idt().address, get_idt().size, 0);
+
+       ctrl->asid = 1;
+       save->cpl = 0;
+       save->efer = rdmsr(MSR_EFER);
+       asm volatile ("mov %%cr4, %0" : "=r"(save->cr4) : : "memory");
+       asm volatile ("mov %%cr3, %0" : "=r"(save->cr3) : : "memory");
+       asm volatile ("mov %%cr0, %0" : "=r"(save->cr0) : : "memory");
+       asm volatile ("mov %%dr7, %0" : "=r"(save->dr7) : : "memory");
+       asm volatile ("mov %%dr6, %0" : "=r"(save->dr6) : : "memory");
+       asm volatile ("mov %%cr2, %0" : "=r"(save->cr2) : : "memory");
+       save->g_pat = rdmsr(MSR_IA32_CR_PAT);
+       save->dbgctl = rdmsr(MSR_IA32_DEBUGCTLMSR);
+       ctrl->intercept = (1ULL << INTERCEPT_VMRUN) |
+                               (1ULL << INTERCEPT_VMMCALL);
+
+       vmcb->save.rip = (u64)guest_rip;
+       vmcb->save.rsp = (u64)guest_rsp;
+       guest_regs.rdi = (u64)svm;
+}
+
+/*
+ * save/restore 64-bit general registers except rax, rip, rsp
+ * which are directly handed through the VMCB guest processor state
+ */
+#define SAVE_GPR_C                             \
+       "xchg %%rbx, guest_regs+0x20\n\t"       \
+       "xchg %%rcx, guest_regs+0x10\n\t"       \
+       "xchg %%rdx, guest_regs+0x18\n\t"       \
+       "xchg %%rbp, guest_regs+0x30\n\t"       \
+       "xchg %%rsi, guest_regs+0x38\n\t"       \
+       "xchg %%rdi, guest_regs+0x40\n\t"       \
+       "xchg %%r8,  guest_regs+0x48\n\t"       \
+       "xchg %%r9,  guest_regs+0x50\n\t"       \
+       "xchg %%r10, guest_regs+0x58\n\t"       \
+       "xchg %%r11, guest_regs+0x60\n\t"       \
+       "xchg %%r12, guest_regs+0x68\n\t"       \
+       "xchg %%r13, guest_regs+0x70\n\t"       \
+       "xchg %%r14, guest_regs+0x78\n\t"       \
+       "xchg %%r15, guest_regs+0x80\n\t"
+
+#define LOAD_GPR_C      SAVE_GPR_C
+
+/*
+ * selftests do not use interrupts so we dropped clgi/sti/cli/stgi
+ * for now. registers involved in LOAD/SAVE_GPR_C are eventually
+ * unmodified so they do not need to be in the clobber list.
+ */
+void run_guest(struct vmcb *vmcb, uint64_t vmcb_gpa)
+{
+       asm volatile (
+               "vmload\n\t"
+               "mov rflags, %%r15\n\t" // rflags
+               "mov %%r15, 0x170(%[vmcb])\n\t"
+               "mov guest_regs, %%r15\n\t"     // rax
+               "mov %%r15, 0x1f8(%[vmcb])\n\t"
+               LOAD_GPR_C
+               "vmrun\n\t"
+               SAVE_GPR_C
+               "mov 0x170(%[vmcb]), %%r15\n\t" // rflags
+               "mov %%r15, rflags\n\t"
+               "mov 0x1f8(%[vmcb]), %%r15\n\t" // rax
+               "mov %%r15, guest_regs\n\t"
+               "vmsave\n\t"
+               : : [vmcb] "r" (vmcb), [vmcb_gpa] "a" (vmcb_gpa)
+               : "r15", "memory");
+}
+
+void nested_svm_check_supported(void)
+{
+       struct kvm_cpuid_entry2 *entry =
+               kvm_get_supported_cpuid_entry(0x80000001);
+
+       if (!(entry->ecx & CPUID_SVM)) {
+               fprintf(stderr, "nested SVM not enabled, skipping test\n");
+               exit(KSFT_SKIP);
+       }
+}
+
diff --git a/tools/testing/selftests/kvm/lib/x86_64/vmx.c b/tools/testing/selftests/kvm/lib/x86_64/vmx.c

index 85064ba..7aaa99c 100644 (file)
--- a/tools/testing/selftests/kvm/lib/x86_64/vmx.c
+++ b/tools/testing/selftests/kvm/lib/x86_64/vmx.c
@@ -288,9 +288,9 @@ static inline void init_vmcs_host_state(void)
         vmwrite(HOST_FS_BASE, rdmsr(MSR_FS_BASE));
         vmwrite(HOST_GS_BASE, rdmsr(MSR_GS_BASE));
         vmwrite(HOST_TR_BASE,
-               get_desc64_base((struct desc64 *)(get_gdt_base() + get_tr())));
-       vmwrite(HOST_GDTR_BASE, get_gdt_base());
-       vmwrite(HOST_IDTR_BASE, get_idt_base());
+               get_desc64_base((struct desc64 *)(get_gdt().address + get_tr())));
+       vmwrite(HOST_GDTR_BASE, get_gdt().address);
+       vmwrite(HOST_IDTR_BASE, get_idt().address);
         vmwrite(HOST_IA32_SYSENTER_ESP, rdmsr(MSR_IA32_SYSENTER_ESP));
         vmwrite(HOST_IA32_SYSENTER_EIP, rdmsr(MSR_IA32_SYSENTER_EIP));
  }
diff --git a/tools/testing/selftests/kvm/x86_64/svm_vmcall_test.c b/tools/testing/selftests/kvm/x86_64/svm_vmcall_test.c

new file mode 100644 (file)

index 0000000..e280f68
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/svm_vmcall_test.c
@@ -0,0 +1,79 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * svm_vmcall_test
+ *
+ * Copyright (C) 2020, Red Hat, Inc.
+ *
+ * Nested SVM testing: VMCALL
+ */
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "processor.h"
+#include "svm_util.h"
+
+#define VCPU_ID                5
+
+static struct kvm_vm *vm;
+
+static void l2_guest_code(struct svm_test_data *svm)
+{
+       __asm__ __volatile__("vmcall");
+}
+
+static void l1_guest_code(struct svm_test_data *svm)
+{
+       #define L2_GUEST_STACK_SIZE 64
+       unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE];
+       struct vmcb *vmcb = svm->vmcb;
+
+       /* Prepare for L2 execution. */
+       generic_svm_setup(svm, l2_guest_code,
+                         &l2_guest_stack[L2_GUEST_STACK_SIZE]);
+
+       run_guest(vmcb, svm->vmcb_gpa);
+
+       GUEST_ASSERT(vmcb->control.exit_code == SVM_EXIT_VMMCALL);
+       GUEST_DONE();
+}
+
+int main(int argc, char *argv[])
+{
+       vm_vaddr_t svm_gva;
+
+       nested_svm_check_supported();
+
+       vm = vm_create_default(VCPU_ID, 0, (void *) l1_guest_code);
+       vcpu_set_cpuid(vm, VCPU_ID, kvm_get_supported_cpuid());
+
+       vcpu_alloc_svm(vm, &svm_gva);
+       vcpu_args_set(vm, VCPU_ID, 1, svm_gva);
+
+       for (;;) {
+               volatile struct kvm_run *run = vcpu_state(vm, VCPU_ID);
+               struct ucall uc;
+
+               vcpu_run(vm, VCPU_ID);
+               TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
+                           "Got exit_reason other than KVM_EXIT_IO: %u (%s)\n",
+                           run->exit_reason,
+                           exit_reason_str(run->exit_reason));
+
+               switch (get_ucall(vm, VCPU_ID, &uc)) {
+               case UCALL_ABORT:
+                       TEST_ASSERT(false, "%s",
+                                   (const char *)uc.args[0]);
+                       /* NOT REACHED */
+               case UCALL_SYNC:
+                       break;
+               case UCALL_DONE:
+                       goto done;
+               default:
+                       TEST_ASSERT(false,
+                                   "Unknown ucall 0x%x.", uc.cmd);
+               }
+       }
+done:
+       kvm_vm_free(vm);
+       return 0;
+}
diff --git a/tools/testing/selftests/wireguard/netns.sh b/tools/testing/selftests/wireguard/netns.sh

index f5ab1cd..138d46b 100755 (executable)
--- a/tools/testing/selftests/wireguard/netns.sh
+++ b/tools/testing/selftests/wireguard/netns.sh
@@ -24,6 +24,7 @@
  set -e
  
  exec 3>&1
+export LANG=C
  export WG_HIDE_KEYS=never
  netns0="wg-test-$$-0"
  netns1="wg-test-$$-1"
@@ -297,7 +298,17 @@ ip1 -4 rule add table main suppress_prefixlength 0
  n1 ping -W 1 -c 100 -f 192.168.99.7
  n1 ping -W 1 -c 100 -f abab::1111
  
+# Have ns2 NAT into wg0 packets from ns0, but return an icmp error along the right route.
+n2 iptables -t nat -A POSTROUTING -s 10.0.0.0/24 -d 192.168.241.0/24 -j SNAT --to 192.168.241.2
+n0 iptables -t filter -A INPUT \! -s 10.0.0.0/24 -i vethrs -j DROP # Manual rpfilter just to be explicit.
+n2 bash -c 'printf 1 > /proc/sys/net/ipv4/ip_forward'
+ip0 -4 route add 192.168.241.1 via 10.0.0.100
+n2 wg set wg0 peer "$pub1" remove
+[[ $(! n0 ping -W 1 -c 1 192.168.241.1 || false) == *"From 10.0.0.100 icmp_seq=1 Destination Host Unreachable"* ]]
+
  n0 iptables -t nat -F
+n0 iptables -t filter -F
+n2 iptables -t nat -F
  ip0 link del vethrc
  ip0 link del vethrs
  ip1 link del wg0
diff --git a/virt/kvm/arm/vgic/vgic-mmio.c b/virt/kvm/arm/vgic/vgic-mmio.c

index d656ebd..97fb2a4 100644 (file)
--- a/virt/kvm/arm/vgic/vgic-mmio.c
+++ b/virt/kvm/arm/vgic/vgic-mmio.c
@@ -179,18 +179,6 @@ unsigned long vgic_mmio_read_pending(struct kvm_vcpu *vcpu,
         return value;
  }
  
-/*
- * This function will return the VCPU that performed the MMIO access and
- * trapped from within the VM, and will return NULL if this is a userspace
- * access.
- *
- * We can disable preemption locally around accessing the per-CPU variable,
- * and use the resolved vcpu pointer after enabling preemption again, because
- * even if the current thread is migrated to another CPU, reading the per-CPU
- * value later will give us the same value as we update the per-CPU variable
- * in the preempt notifier handlers.
- */
-
  /* Must be called with irq->irq_lock held */
  static void vgic_hw_irq_spending(struct kvm_vcpu *vcpu, struct vgic_irq *irq,
                                  bool is_uaccess)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c

index 67ae2d5..70f03ce 100644 (file)
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4409,12 +4409,22 @@ static void kvm_sched_out(struct preempt_notifier *pn,
  
  /**
   * kvm_get_running_vcpu - get the vcpu running on the current CPU.
- * Thanks to preempt notifiers, this can also be called from
- * preemptible context.
+ *
+ * We can disable preemption locally around accessing the per-CPU variable,
+ * and use the resolved vcpu pointer after enabling preemption again,
+ * because even if the current thread is migrated to another CPU, reading
+ * the per-CPU value later will give us the same value as we update the
+ * per-CPU variable in the preempt notifier handlers.
   */
  struct kvm_vcpu *kvm_get_running_vcpu(void)
  {
-        return __this_cpu_read(kvm_running_vcpu);
+       struct kvm_vcpu *vcpu;
+
+       preempt_disable();
+       vcpu = __this_cpu_read(kvm_running_vcpu);
+       preempt_enable();
+
+       return vcpu;
  }
  
  /**
author	Dave Airlie <airlied@redhat.com>
	Thu, 20 Feb 2020 01:01:28 +0000 (11:01 +1000)
committer	Dave Airlie <airlied@redhat.com>
	Thu, 20 Feb 2020 01:01:33 +0000 (11:01 +1000)