Introduce sptep_to_sp() to reduce the boilerplate code needed to get the
shadow page associated with a spte pointer, and to improve readability
as it's not immediately obvious that "page_header" is a KVM-specific
accessor for retrieving a shadow page.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200622202034.15093-6-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add mmu/mmu_internal.h to hold declarations and definitions that need
to be shared between various mmu/ files, but should not be used by
anything outside of the MMU.
Begin populating mmu_internal.h with declarations of the helpers used by
page_track.c.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200622202034.15093-4-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Propagate any error returned by make_mmu_pages_available() out to
userspace instead of resuming the guest if the error occurs while
handling a page fault. Now that zapping the oldest MMU pages skips
active roots, i.e. fails if and only if there are no zappable pages,
there is no chance for a false positive, i.e. no chance of returning a
spurious error to userspace.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200623193542.7554-5-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use the recently introduced kvm_mmu_zap_oldest_mmu_pages() to batch zap
MMU pages when shrinking a slab. This fixes a long standing issue where
KVM's shrinker implementation is completely ineffective due to zapping
only a single page. E.g. without batch zapping, forcing a scan via
drop_caches basically has no impact on a VM with ~2k shadow pages. With
batch zapping, the number of shadow pages can be reduced to a few
hundred pages in one or two runs of drop_caches.
Note, if the default batch size (currently 128) is problematic, e.g.
zapping 128 pages holds mmu_lock for too long, KVM can bound the batch
size by setting @batch in mmu_shrinker.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200623193542.7554-4-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Collect MMU pages for zapping in a loop when making MMU pages available,
and skip over active roots when doing so as zapping an active root can
never immediately free up a page. Batching the zapping avoids multiple
remote TLB flushes and remedies the issue where the loop would bail
early if an active root was encountered.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200623193542.7554-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Delete a shadow page from the invalidation list instead of throwing it
back on the list of active pages when it's a root shadow page with
active users. Invalid active root pages will be explicitly freed by
mmu_free_root_page() when the root_count hits zero, i.e. they don't need
to be put on the active list to avoid leakage.
Use sp->role.invalid to detect that a shadow page has already been
zapped, i.e. is not on a list.
WARN if an invalid page is encountered when zapping pages, as it should
now be impossible.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200623193542.7554-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Skip the unsync checks and the write flooding clearing for fully direct
MMUs, which are guaranteed to not have unsync'd or indirect pages (write
flooding detection only applies to indirect pages). For TDP, this
avoids unnecessary memory reads and writes, and for the write flooding
count will also avoid dirtying a cache line (unsync_child_bitmap itself
consumes a cache line, i.e. write_flooding_count is guaranteed to be in
a different cache line than parent_ptes).
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200623194027.23135-3-sean.j.christopherson@intel.com>
Reviewed-By: Jon Cargille <jcargill@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Current minimum required version of binutils is 2.23,
which supports VMCALL and VMMCALL instruction mnemonics.
Replace the byte-wise specification of VMCALL and
VMMCALL with these proper mnemonics.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20200623183439.5526-1-ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
More often than not, a failed VM-entry in an x86 production
environment is induced by a defective CPU. To help identify the bad
hardware, include the id of the last logical CPU to run a vCPU in the
information provided to userspace on a KVM exit for failed VM-entry or
for KVM internal errors not associated with emulation. The presence of
this additional information is indicated by a new capability,
KVM_CAP_LAST_CPU.
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Message-Id: <20200603235623.245638-5-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Callers of sev_pin_memory() treat
NULL differently:
sev_launch_secret()/svm_register_enc_region() return -ENOMEM
sev_dbg_crypt() returns -EFAULT.
Switching to ERR_PTR() preserves the error and enables cleaner reporting of
different kinds of failures.
Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There are two problems in svn_pin_memory():
1) The return value of get_user_pages_fast() is stored in an
unsigned long, although the declared return value is of type int.
This will not cause any symptoms, but it is misleading.
Fix this by changing the type of npinned to "int".
2) The number of pages passed into get_user_pages_fast() is stored
in an unsigned long, even though get_user_pages_fast() accepts an
int. This means that it is possible to silently overflow the number
of pages.
Fix this by adding a WARN_ON_ONCE() and an early error return. The
npages variable is left as an unsigned long for convenience in
checking for overflow.
Fixes: 89c5058090 ("KVM: SVM: Add support for KVM_SEV_LAUNCH_UPDATE_DATA command")
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Message-Id: <20200526062207.1360225-2-jhubbard@nvidia.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
MSR accesses can be one of:
(1) KVM internal access,
(2) userspace access (e.g., via KVM_SET_MSRS ioctl),
(3) guest access.
The ignore_msrs was previously handled by kvm_get_msr_common() and
kvm_set_msr_common(), which is the bottom of the msr access stack. It's
working in most cases, however it could dump unwanted warning messages to dmesg
even if kvm get/set the msrs internally when calling __kvm_set_msr() or
__kvm_get_msr() (e.g. kvm_cpuid()). Ideally we only want to trap cases (2)
or (3), but not (1) above.
To achieve this, move the ignore_msrs handling upper until the callers of
__kvm_get_msr() and __kvm_set_msr(). To identify the "msr missing" event, a
new return value (KVM_MSR_RET_INVALID==2) is used for that.
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20200622220442.21998-2-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop kvm_arch_write_log_dirty() in favor of invoking .write_log_dirty()
directly from FNAME(update_accessed_dirty_bits). "kvm_arch" is usually
used for x86 functions that are invoked from generic KVM, and implies
that there are external callers, neither of which is true.
Remove the check for a non-NULL kvm_x86_ops hook as the call is wrapped
in PTTYPE_EPT and is unconditionally set by VMX.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200622215832.22090-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Unlike normal 'int' functions returning '0' on success, kvm_setup_async_pf()/
kvm_arch_setup_async_pf() return '1' when a job to handle page fault
asynchronously was scheduled and '0' otherwise. To avoid the confusion
change return type to 'bool'.
No functional change intended.
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200615121334.91300-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM guest code in Linux enables APF only when KVM_FEATURE_ASYNC_PF_INT
is supported, this means we will never see KVM_PV_REASON_PAGE_READY
when handling page fault vmexit in KVM.
While on it, make sure we only follow genuine page fault path when
APF reason is zero. If we happen to see something else this means
that the underlying hypervisor is misbehaving. Leave WARN_ON_ONCE()
to catch that.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM/arm fixes for 5.8, take #3
- Disable preemption on context-switching PMU EL0 state happening
on system register trap
- Don't clobber X0 when tearing down KVM via a soft reset (kexec)
Commit 07da1ffaa1 ("KVM: arm64: Remove host_cpu_context
member from vcpu structure") has, by removing the host CPU
context pointer, exposed that kvm_vcpu_pmu_restore_guest
is called in preemptible contexts:
[ 266.932442] BUG: using smp_processor_id() in preemptible [00000000] code: qemu-system-aar/779
[ 266.939721] caller is debug_smp_processor_id+0x20/0x30
[ 266.944157] CPU: 2 PID: 779 Comm: qemu-system-aar Tainted: G E 5.8.0-rc3-00015-g8d4aa58b2fe3 #1374
[ 266.954268] Hardware name: amlogic w400/w400, BIOS 2020.04 05/22/2020
[ 266.960640] Call trace:
[ 266.963064] dump_backtrace+0x0/0x1e0
[ 266.966679] show_stack+0x20/0x30
[ 266.969959] dump_stack+0xe4/0x154
[ 266.973338] check_preemption_disabled+0xf8/0x108
[ 266.977978] debug_smp_processor_id+0x20/0x30
[ 266.982307] kvm_vcpu_pmu_restore_guest+0x2c/0x68
[ 266.986949] access_pmcr+0xf8/0x128
[ 266.990399] perform_access+0x8c/0x250
[ 266.994108] kvm_handle_sys_reg+0x10c/0x2f8
[ 266.998247] handle_exit+0x78/0x200
[ 267.001697] kvm_arch_vcpu_ioctl_run+0x2ac/0xab8
Note that the bug was always there, it is only the switch to
using percpu accessors that made it obvious.
The fix is to wrap these accesses in a preempt-disabled section,
so that we sample a coherent context on trap from the guest.
Fixes: 435e53fb5e ("arm64: KVM: Enable VHE support for :G/:H perf event modifiers")
Cc:: Andrew Murray <amurray@thegoodpenguin.co.uk>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Use the "common" KVM_POSSIBLE_CR*_GUEST_BITS defines to initialize the
CR0/CR4 guest host masks instead of duplicating most of the CR4 mask and
open coding the CR0 mask. SVM doesn't utilize the masks, i.e. the masks
are effectively VMX specific even if they're not named as such. This
avoids duplicate code, better documents the guest owned CR0 bit, and
eliminates the need for a build-time assertion to keep VMX and x86
synchronized.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200703040422.31536-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Mark CR4.TSD as being possibly owned by the guest as that is indeed the
case on VMX. Without TSD being tagged as possibly owned by the guest, a
targeted read of CR4 to get TSD could observe a stale value. This bug
is benign in the current code base as the sole consumer of TSD is the
emulator (for RDTSC) and the emulator always "reads" the entirety of CR4
when grabbing bits.
Add a build-time assertion in to ensure VMX doesn't hand over more CR4
bits without also updating x86.
Fixes: 52ce3c21ae ("x86,kvm,vmx: Don't trap writes to CR4.TSD")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200703040422.31536-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Inject a #GP on MOV CR4 if CR4.LA57 is toggled in 64-bit mode, which is
illegal per Intel's SDM:
CR4.LA57
57-bit linear addresses (bit 12 of CR4) ... blah blah blah ...
This bit cannot be modified in IA-32e mode.
Note, the pseudocode for MOV CR doesn't call out the fault condition,
which is likely why the check was missed during initial development.
This is arguably an SDM bug and will hopefully be fixed in future
release of the SDM.
Fixes: fd8cb43373 ("KVM: MMU: Expose the LA57 feature to VM.")
Cc: stable@vger.kernel.org
Reported-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200703021714.5549-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Sparse complains on a call to get_compat_sigset, fix it. The "if"
right above explains that sigmask_arg->sigset is basically a
compat_sigset_t.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM/arm fixes for 5.8, take #2
- Make sure a vcpu becoming non-resident doesn't race against the doorbell delivery
- Only advertise pvtime if accounting is enabled
- Return the correct error code if reset fails with SVE
- Make sure that pseudo-NMI functions are annotated as __always_inline
Bit 8 would be the "global" bit, which does not quite make sense for non-leaf
page table entries. Intel ignores it; AMD ignores it in PDEs and PDPEs, but
reserves it in PML4Es.
Probably, earlier versions of the AMD manual documented it as reserved in PDPEs
as well, and that behavior made it into KVM as well as kvm-unit-tests; fix it.
Cc: stable@vger.kernel.org
Reported-by: Nadav Amit <namit@vmware.com>
Fixes: a0c0feb579 ("KVM: x86: reserve bit 8 of non-leaf PDPEs and PML4Es in 64-bit mode on AMD", 2014-09-03)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>