Commit Graph

996876 Commits

Author SHA1 Message Date
Sean Christopherson
e0c378684b KVM: x86/mmu: Retry page faults that hit an invalid memslot
Retry page faults (re-enter the guest) that hit an invalid memslot
instead of treating the memslot as not existing, i.e. handling the
page fault as an MMIO access.  When deleting a memslot, SPTEs aren't
zapped and the TLBs aren't flushed until after the memslot has been
marked invalid.

Handling the invalid slot as MMIO means there's a small window where a
page fault could replace a valid SPTE with an MMIO SPTE.  The legacy
MMU handles such a scenario cleanly, but the TDP MMU assumes such
behavior is impossible (see the BUG() in __handle_changed_spte()).
There's really no good reason why the legacy MMU should allow such a
scenario, and closing this hole allows for additional cleanups.

Fixes: 2f2fad0897 ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs")
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210225204749.1512652-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:37 -04:00
Sean Christopherson
44aaa0150b KVM: x86/mmu: Disable MMIO caching if MMIO value collides with L1TF
Disable MMIO caching if the MMIO value collides with the L1TF mitigation
that usurps high PFN bits.  In practice this should never happen as only
CPUs with SME support can generate such a collision (because the MMIO
value can theoretically get adjusted into legal memory), and no CPUs
exist that support SME and are susceptible to L1TF.  But, closing the
hole is trivial.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210225204749.1512652-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:36 -04:00
Sean Christopherson
ec89e64386 KVM: x86/mmu: Bail from fast_page_fault() if SPTE is not shadow-present
Bail from fast_page_fault() if the SPTE is not a shadow-present SPTE.
Functionally, this is not strictly necessary as the !is_access_allowed()
check will eventually reject the fast path, but an early check on
shadow-present skips unnecessary checks and will allow a future patch to
tweak the A/D status auditing to warn if KVM attempts to query A/D bits
without first ensuring the SPTE is a shadow-present SPTE.

Note, is_shadow_present_pte() is quite expensive at this time, i.e. this
might be a net negative in the short term.  A future patch will optimize
is_shadow_present_pte() to a single AND operation and remedy the issue.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210225204749.1512652-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:35 -04:00
Sean Christopherson
64bb2769d7 KVM: x86/mmu: Check for shadow-present SPTE before querying A/D status
When updating accessed and dirty bits, check that the new SPTE is present
before attempting to query its A/D bits.  Failure to confirm the SPTE is
present can theoretically cause a false negative, e.g. if a MMIO SPTE
replaces a "real" SPTE and somehow the PFNs magically match.

Realistically, this is all but guaranteed to be a benign bug.  Fix it up
primarily so that a future patch can tweak the MMU_WARN_ON checking A/D
status to fire if the SPTE is not-present.

Fixes: f8e144971c ("kvm: x86/mmu: Add access tracking for tdp_mmu")
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210225204749.1512652-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:34 -04:00
Sean Christopherson
2032195713 KVM: x86/mmu: Add convenience wrapper for acting on single hva in TDP MMU
Add a TDP MMU helper to handle a single HVA hook, the name is a nice
reminder that the flow in question is operating on a single HVA.

No functional change intended.

Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210226010329.1766033-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:33 -04:00
Sean Christopherson
c1b91493ed KVM: x86/mmu: Add typedefs for rmap/iter handlers
Add typedefs for the MMU handlers that are invoked when walking the MMU
SPTEs (rmaps in legacy MMU) to act on a host virtual address range.

No functional change intended.

Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210226010329.1766033-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:32 -04:00
Sean Christopherson
e12b785e52 KVM: x86/mmu: Use 'end' param in TDP MMU's test_age_gfn()
Use the @end param when aging a GFN instead of hardcoding the walk to a
single GFN.  Unlike tdp_set_spte(), which simply cannot work with more
than one GFN, aging multiple GFNs would not break, though admittedly it
would be weird.  Be nice to the casual reader and don't make them puzzle
out why the end GFN is unused.

No functional change intended.

Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210226010329.1766033-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:31 -04:00
Sean Christopherson
74fe0f5474 KVM: x86/mmu: WARN if TDP MMU's set_tdp_spte() sees multiple GFNs
WARN if set_tdp_spte() is invoked with multipel GFNs.  It is specifically
a callback to handle a single host PTE being changed.  Consuming the
@end parameter also eliminates the confusing 'unused' parameter.

Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210226010329.1766033-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:30 -04:00
Sean Christopherson
f055ab634c KVM: x86/mmu: Remove spurious TLB flush from TDP MMU's change_pte() hook
Remove an unnecessary remote TLB flush from set_tdp_spte(), the TDP MMu's
hook for handling change_pte() invocations from the MMU notifier.  If
the new host PTE is writable, the flush is completely redundant as there
are no futher changes to the SPTE before the post-loop flush.  If the
host PTE is read-only, then the primary MMU is responsible for ensuring
that the contents of the old and new pages are identical, thus it's safe
to let the guest continue reading both the old and new pages.  KVM must
only ensure the old page cannot be referenced after returning from its
callback; this is handled by the post-loop flush.

Fixes: 1d8dd6b3f1 ("kvm: x86/mmu: Support changed pte notifier in tdp MMU")
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210226010329.1766033-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:29 -04:00
Maxim Levitsky
422e2e1706 KVM: x86: mmu: initialize fault.async_page_fault in walk_addr_generic
This field was left uninitialized by a mistake.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210225154135.405125-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:28 -04:00
Maxim Levitsky
b97f074583 KVM: x86: determine if an exception has an error code only when injecting it.
A page fault can be queued while vCPU is in real paged mode on AMD, and
AMD manual asks the user to always intercept it
(otherwise result is undefined).
The resulting VM exit, does have an error code.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210225154135.405125-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:27 -04:00
Cathy Avery
8173396e94 KVM: nSVM: Optimize vmcb12 to vmcb02 save area copies
Use the vmcb12 control clean field to determine which vmcb12.save
registers were marked dirty in order to minimize register copies
when switching from L1 to L2. Those vmcb12 registers marked as dirty need
to be copied to L0's vmcb02 as they will be used to update the vmcb
state cache for the L2 VMRUN.  In the case where we have a different
vmcb12 from the last L2 VMRUN all vmcb12.save registers must be
copied over to L2.save.

Tested:
kvm-unit-tests
kvm selftests
Fedora L1 L2

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Cathy Avery <cavery@redhat.com>
Message-Id: <20210301200844.2000-1-cavery@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:26 -04:00
Babu Moger
d00b99c514 KVM: SVM: Add support for Virtual SPEC_CTRL
Newer AMD processors have a feature to virtualize the use of the
SPEC_CTRL MSR. Presence of this feature is indicated via CPUID
function 0x8000000A_EDX[20]: GuestSpecCtrl. Hypervisors are not
required to enable this feature since it is automatically enabled on
processors that support it.

A hypervisor may wish to impose speculation controls on guest
execution or a guest may want to impose its own speculation controls.
Therefore, the processor implements both host and guest
versions of SPEC_CTRL.

When in host mode, the host SPEC_CTRL value is in effect and writes
update only the host version of SPEC_CTRL. On a VMRUN, the processor
loads the guest version of SPEC_CTRL from the VMCB. When the guest
writes SPEC_CTRL, only the guest version is updated. On a VMEXIT,
the guest version is saved into the VMCB and the processor returns
to only using the host SPEC_CTRL for speculation control. The guest
SPEC_CTRL is located at offset 0x2E0 in the VMCB.

The effective SPEC_CTRL setting is the guest SPEC_CTRL setting or'ed
with the hypervisor SPEC_CTRL setting. This allows the hypervisor to
ensure a minimum SPEC_CTRL if desired.

This support also fixes an issue where a guest may sometimes see an
inconsistent value for the SPEC_CTRL MSR on processors that support
this feature. With the current SPEC_CTRL support, the first write to
SPEC_CTRL is intercepted and the virtualized version of the SPEC_CTRL
MSR is not updated. When the guest reads back the SPEC_CTRL MSR, it
will be 0x0, instead of the actual expected value. There isn’t a
security concern here, because the host SPEC_CTRL value is or’ed with
the Guest SPEC_CTRL value to generate the effective SPEC_CTRL value.
KVM writes with the guest's virtualized SPEC_CTRL value to SPEC_CTRL
MSR just before the VMRUN, so it will always have the actual value
even though it doesn’t appear that way in the guest. The guest will
only see the proper value for the SPEC_CTRL register if the guest was
to write to the SPEC_CTRL register again. With Virtual SPEC_CTRL
support, the save area spec_ctrl is properly saved and restored.
So, the guest will always see the proper value when it is read back.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Message-Id: <161188100955.28787.11816849358413330720.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:25 -04:00
Babu Moger
f333374e10 x86/cpufeatures: Add the Virtual SPEC_CTRL feature
Newer AMD processors have a feature to virtualize the use of the
SPEC_CTRL MSR. Presence of this feature is indicated via CPUID
function 0x8000000A_EDX[20]: GuestSpecCtrl. When present, the
SPEC_CTRL MSR is automatically virtualized.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Acked-by: Borislav Petkov <bp@suse.de>
Message-Id: <161188100272.28787.4097272856384825024.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:25 -04:00
Maxim Levitsky
cc3ed80ae6 KVM: nSVM: always use vmcb01 to for vmsave/vmload of guest state
This allows to avoid copying of these fields between vmcb01
and vmcb02 on nested guest entry/exit.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:24 -04:00
Paolo Bonzini
fb0c4a4fee KVM: SVM: move VMLOAD/VMSAVE to C code
Thanks to the new macros that handle exception handling for SVM
instructions, it is easier to just do the VMLOAD/VMSAVE in C.
This is safe, as shown by the fact that the host reload is
already done outside the assembly source.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:23 -04:00
Sean Christopherson
c8781feaf1 KVM: SVM: Skip intercepted PAUSE instructions after emulation
Skip PAUSE after interception to avoid unnecessarily re-executing the
instruction in the guest, e.g. after regaining control post-yield.
This is a benign bug as KVM disables PAUSE interception if filtering is
off, including the case where pause_filter_count is set to zero.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210205005750.3841462-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:22 -04:00
Sean Christopherson
32c23c7d52 KVM: SVM: Don't manually emulate RDPMC if nrips=0
Remove bizarre code that causes KVM to run RDPMC through the emulator
when nrips is disabled.  Accelerated emulation of RDPMC doesn't rely on
any additional data from the VMCB, and SVM has generic handling for
updating RIP to skip instructions when nrips is disabled.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210205005750.3841462-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:21 -04:00
Sean Christopherson
c483c45471 KVM: x86: Move RDPMC emulation to common code
Move the entirety of the accelerated RDPMC emulation to x86.c, and assign
the common handler directly to the exit handler array for VMX.  SVM has
bizarre nrips behavior that prevents it from directly invoking the common
handler.  The nrips goofiness will be addressed in a future patch.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210205005750.3841462-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:20 -04:00
Sean Christopherson
5ff3a351f6 KVM: x86: Move trivial instruction-based exit handlers to common code
Move the trivial exit handlers, e.g. for instructions that KVM
"emulates" as nops, to common x86 code.  Assign the common handlers
directly to the exit handler arrays and drop the vendor trampolines.

Opportunistically use pr_warn_once() where appropriate.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210205005750.3841462-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:19 -04:00
Sean Christopherson
92f9895c14 KVM: x86: Move XSETBV emulation to common code
Move the entirety of XSETBV emulation to x86.c, and assign the
function directly to both VMX's and SVM's exit handlers, i.e. drop the
unnecessary trampolines.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210205005750.3841462-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:18 -04:00
Sean Christopherson
2ac636a6ea KVM: nSVM: Add VMLOAD/VMSAVE helper to deduplicate code
Add another helper layer for VMLOAD+VMSAVE, the code is identical except
for the one line that determines which VMCB is the source and which is
the destination.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210205005750.3841462-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:17 -04:00
Sean Christopherson
3a87c7e0d1 KVM: nSVM: Add helper to synthesize nested VM-Exit without collateral
Add a helper to consolidate boilerplate for nested VM-Exits that don't
provide any data in exit_info_*.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210302174515.2812275-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:16 -04:00
Sean Christopherson
cb6a32c2b8 KVM: x86: Handle triple fault in L2 without killing L1
Synthesize a nested VM-Exit if L2 triggers an emulated triple fault
instead of exiting to userspace, which likely will kill L1.  Any flow
that does KVM_REQ_TRIPLE_FAULT is suspect, but the most common scenario
for L2 killing L1 is if L0 (KVM) intercepts a contributory exception that
is _not_intercepted by L1.  E.g. if KVM is intercepting #GPs for the
VMware backdoor, a #GP that occurs in L2 while vectoring an injected #DF
will cause KVM to emulate triple fault.

Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210302174515.2812275-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:15 -04:00
Paolo Bonzini
6312975417 KVM: SVM: Pass struct kvm_vcpu to exit handlers (and many, many other places)
Refactor the svm_exit_handlers API to pass @vcpu instead of @svm to
allow directly invoking common x86 exit handlers (in a future patch).
Opportunistically convert an absurd number of instances of 'svm->vcpu'
to direct uses of 'vcpu' to avoid pointless casting.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210205005750.3841462-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:09 -04:00
Paolo Bonzini
2a32a77cef KVM: SVM: merge update_cr0_intercept into svm_set_cr0
The logic of update_cr0_intercept is pointlessly complicated.
All svm_set_cr0 is compute the effective cr0 and compare it with
the guest value.

Inlining the function and simplifying the condition
clarifies what it is doing.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:42:38 -04:00
Sean Christopherson
11f0cbf0c6 KVM: nSVM: Trace VM-Enter consistency check failures
Use trace_kvm_nested_vmenter_failed() and its macro magic to trace
consistency check failures on nested VMRUN.  Tracing such failures by
running the buggy VMM as a KVM guest is often the only way to get a
precise explanation of why VMRUN failed.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210204000117.3303214-13-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:42:37 -04:00
Sean Christopherson
648fc8ae37 KVM: x86: Move nVMX's consistency check macro to common code
Move KVM's CC() macro to x86.h so that it can be reused by nSVM.
Debugging VM-Enter is as painful on SVM as it is on VMX.

Rename the more visible macro to KVM_NESTED_VMENTER_CONSISTENCY_CHECK
to avoid any collisions with the uber-concise "CC".

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210204000117.3303214-12-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:42:36 -04:00
Krish Sadhukhan
6906e06db9 KVM: nSVM: Add missing checks for reserved bits to svm_set_nested_state()
The path for SVM_SET_NESTED_STATE needs to have the same checks for the CPU
registers, as we have in the VMRUN path for a nested guest. This patch adds
those missing checks to svm_set_nested_state().

Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Message-Id: <20201006190654.32305-3-krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:42:35 -04:00
Paolo Bonzini
c08f390a75 KVM: nSVM: only copy L1 non-VMLOAD/VMSAVE data in svm_set_nested_state()
The VMLOAD/VMSAVE data is not taken from userspace, since it will
not be restored on VMEXIT (it will be copied from VMCB02 to VMCB01).
For clarity, replace the wholesale copy of the VMCB save area
with a copy of that state only.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:42:34 -04:00
Paolo Bonzini
4bb170a543 KVM: nSVM: do not mark all VMCB02 fields dirty on nested vmexit
Since L1 and L2 now use different VMCBs, most of the fields remain the
same in VMCB02 from one L2 run to the next.  Since KVM itself is not
looking at VMCB12's clean field, for now not much can be optimized.
However, in the future we could avoid more copies if the VMCB12's SEG
and DT sections are clean.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:42:33 -04:00
Paolo Bonzini
7ca62d1322 KVM: nSVM: do not mark all VMCB01 fields dirty on nested vmexit
Since L1 and L2 now use different VMCBs, most of the fields remain
the same from one L1 run to the next.  svm_set_cr0 and other functions
called by nested_svm_vmexit already take care of clearing the
corresponding clean bits; only the TSC offset is special.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:42:32 -04:00
Paolo Bonzini
7c3ecfcd31 KVM: nSVM: do not copy vmcb01->control blindly to vmcb02->control
Most fields were going to be overwritten by vmcb12 control fields, or
do not matter at all because they are filled by the processor on vmexit.
Therefore, we need not copy them from vmcb01 to vmcb02 on vmentry.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:42:32 -04:00
Paolo Bonzini
9e8f0fbfff KVM: nSVM: rename functions and variables according to vmcbXY nomenclature
Now that SVM is using a separate vmcb01 and vmcb02 (and also uses the vmcb12
naming) we can give clearer names to functions that write to and read
from those VMCBs.  Likewise, variables and parameters can be renamed
from nested_vmcb to vmcb12.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:42:31 -04:00
Cathy Avery
193015adf4 KVM: nSVM: Track the ASID generation of the vmcb vmrun through the vmcb
This patch moves the asid_generation from the vcpu to the vmcb
in order to track the ASID generation that was active the last
time the vmcb was run. If sd->asid_generation changes between
two runs, the old ASID is invalid and must be changed.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Cathy Avery <cavery@redhat.com>
Message-Id: <20210112164313.4204-3-cavery@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:42:30 -04:00
Cathy Avery
af18fa775d KVM: nSVM: Track the physical cpu of the vmcb vmrun through the vmcb
This patch moves the physical cpu tracking from the vcpu
to the vmcb in svm_switch_vmcb. If either vmcb01 or vmcb02
change physical cpus from one vmrun to the next the vmcb's
previous cpu is preserved for comparison with the current
cpu and the vmcb is marked dirty if different. This prevents
the processor from using old cached data for a vmcb that may
have been updated on a prior run on a different processor.

It also moves the physical cpu check from svm_vcpu_load
to pre_svm_run as the check only needs to be done at run.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Cathy Avery <cavery@redhat.com>
Message-Id: <20210112164313.4204-2-cavery@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:42:29 -04:00
Cathy Avery
4995a3685f KVM: SVM: Use a separate vmcb for the nested L2 guest
svm->vmcb will now point to a separate vmcb for L1 (not nested) or L2
(nested).

The main advantages are removing get_host_vmcb and hsave, in favor of
concepts that are shared with VMX.

We don't need anymore to stash the L1 registers in hsave while L2
runs, but we need to copy the VMLOAD/VMSAVE registers from VMCB01 to
VMCB02 and back.  This more or less has the same cost, but code-wise
nested_svm_vmloadsave can be reused.

This patch omits several optimizations that are possible:

- for simplicity there is some wholesale copying of vmcb.control areas
which can go away.

- we should be able to better use the VMCB01 and VMCB02 clean bits.

- another possibility is to always use VMCB01 for VMLOAD and VMSAVE,
thus avoiding the copy of VMLOAD/VMSAVE registers from VMCB01 to
VMCB02 and back.

Tested:
kvm-unit-tests
kvm self tests
Loaded fedora nested guest on fedora

Signed-off-by: Cathy Avery <cavery@redhat.com>
Message-Id: <20201011184818.3609-3-cavery@redhat.com>
[Fix conflicts; keep VMCB02 G_PAT up to date whenever guest writes the
 PAT MSR; do not copy CR4 over from VMCB01 as it is not needed anymore; add
 a few more comments. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:42:28 -04:00
Sean Christopherson
a3322d5cd8 KVM: nSVM: Set the shadow root level to the TDP level for nested NPT
Override the shadow root level in the MMU context when configuring
NPT for shadowing nested NPT.  The level is always tied to the TDP level
of the host, not whatever level the guest happens to be using.

Fixes: 096586fda5 ("KVM: nSVM: Correctly set the shadow NPT root level in its MMU role")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305011101.3597423-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:42:27 -04:00
Sean Christopherson
6d1b867d04 KVM: SVM: Don't strip the C-bit from CR2 on #PF interception
Don't strip the C-bit from the faulting address on an intercepted #PF,
the address is a virtual address, not a physical address.

Fixes: 0ede79e132 ("KVM: SVM: Clear C-bit from the page fault address")
Cc: stable@vger.kernel.org
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305011101.3597423-13-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:42:26 -04:00
Sean Christopherson
73ad160693 KVM: x86/mmu: WARN on NULL pae_root or lm_root, or bad shadow root level
WARN if KVM is about to dereference a NULL pae_root or lm_root when
loading an MMU, and convert the BUG() on a bad shadow_root_level into a
WARN (now that errors are handled cleanly).  With nested NPT, botching
the level and sending KVM down the wrong path is all too easy, and the
on-demand allocation of pae_root and lm_root means bugs crash the host.
Obviously, KVM could unconditionally allocate the roots, but that's
arguably a worse failure mode as it would potentially corrupt the guest
instead of crashing it.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305011101.3597423-18-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:42:25 -04:00
Sean Christopherson
a91f387b4b KVM: x86/mmu: Sync roots after MMU load iff load as successful
For clarity, explicitly skip syncing roots if the MMU load failed
instead of relying on the !VALID_PAGE check in kvm_mmu_sync_roots().

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305011101.3597423-17-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:42:24 -04:00
Sean Christopherson
61a1773e2e KVM: x86/mmu: Unexport MMU load/unload functions
Unexport the MMU load and unload helpers now that they are no longer
used (incorrectly) in vendor code.

Opportunistically move the kvm_mmu_sync_roots() declaration into mmu.h,
it should not be exposed to vendor code.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305011101.3597423-16-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:42:23 -04:00
Sean Christopherson
f66c53b3b9 KVM: x86: Defer the MMU unload to the normal path on an global INVPCID
Defer unloading the MMU after a INVPCID until the instruction emulation
has completed, i.e. until after RIP has been updated.

On VMX, this is a benign bug as VMX doesn't touch the MMU when skipping
an emulated instruction.  However, on SVM, if nrip is disabled, the
emulator is used to skip an instruction, which would lead to fireworks
if the emulator were invoked without a valid MMU.

Fixes: eb4b248e15 ("kvm: vmx: Support INVPCID in shadow paging mode")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305011101.3597423-15-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:42:22 -04:00
Sean Christopherson
c805f5d558 KVM: nVMX: Defer the MMU reload to the normal path on an EPTP switch
Defer reloading the MMU after a EPTP successful EPTP switch.  The VMFUNC
instruction itself is executed in the previous EPTP context, any side
effects, e.g. updating RIP, should occur in the old context.  Practically
speaking, this bug is benign as VMX doesn't touch the MMU when skipping
an emulated instruction, nor does queuing a single-step #DB.  No other
post-switch side effects exist.

Fixes: 41ab937274 ("KVM: nVMX: Emulate EPTP switching for the L1 hypervisor")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305011101.3597423-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:42:21 -04:00
Sean Christopherson
17e368d94a KVM: x86/mmu: Set the C-bit in the PDPTRs and LM pseudo-PDPTRs
Set the C-bit in SPTEs that are set outside of the normal MMU flows,
specifically the PDPDTRs and the handful of special cased "LM root"
entries, all of which are shadow paging only.

Note, the direct-mapped-root PDPTR handling is needed for the scenario
where paging is disabled in the guest, in which case KVM uses a direct
mapped MMU even though TDP is disabled.

Fixes: d0ec49d4de ("kvm/x86/svm: Support Secure Memory Encryption within KVM")
Cc: stable@vger.kernel.org
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305011101.3597423-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:42:21 -04:00
Sean Christopherson
e49e0b7bf3 KVM: x86/mmu: Fix and unconditionally enable WARNs to detect PAE leaks
Exempt NULL PAE roots from the check to detect leaks, since
kvm_mmu_free_roots() doesn't set them back to INVALID_PAGE.  Stop hiding
the WARNs to detect PAE root leaks behind MMU_WARN_ON, the hidden WARNs
obviously didn't do their job given the hilarious number of bugs that
could lead to PAE roots being leaked, not to mention the above false
positive.

Opportunistically delete a warning on root_hpa being valid, there's
nothing special about 4/5-level shadow pages that warrants a WARN.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305011101.3597423-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:42:20 -04:00
Sean Christopherson
6e0918aec4 KVM: x86/mmu: Check PDPTRs before allocating PAE roots
Check the validity of the PDPTRs before allocating any of the PAE roots,
otherwise a bad PDPTR will cause KVM to leak any previously allocated
roots.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305011101.3597423-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:42:19 -04:00
Sean Christopherson
6e6ec58485 KVM: x86/mmu: Ensure MMU pages are available when allocating roots
Hold the mmu_lock for write for the entire duration of allocating and
initializing an MMU's roots.  This ensures there are MMU pages available
and thus prevents root allocations from failing.  That in turn fixes a
bug where KVM would fail to free valid PAE roots if a one of the later
roots failed to allocate.

Add a comment to make_mmu_pages_available() to call out that the limit
is a soft limit, e.g. KVM will temporarily exceed the threshold if a
page fault allocates multiple shadow pages and there was only one page
"available".

Note, KVM _still_ leaks the PAE roots if the guest PDPTR checks fail.
This will be addressed in a future commit.

Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305011101.3597423-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:42:18 -04:00
Sean Christopherson
748e52b9b7 KVM: x86/mmu: Allocate pae_root and lm_root pages in dedicated helper
Move the on-demand allocation of the pae_root and lm_root pages, used by
nested NPT for 32-bit L1s, into a separate helper.  This will allow a
future patch to hold mmu_lock while allocating the non-special roots so
that make_mmu_pages_available() can be checked once at the start of root
allocation, and thus avoid having to deal with failure in the middle of
root allocation.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305011101.3597423-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:42:17 -04:00
Sean Christopherson
ba0a194ffb KVM: x86/mmu: Allocate the lm_root before allocating PAE roots
Allocate lm_root before the PAE roots so that the PAE roots aren't
leaked if the memory allocation for the lm_root happens to fail.

Note, KVM can still leak PAE roots if mmu_check_root() fails on a guest's
PDPTR, or if mmu_alloc_root() fails due to MMU pages not being available.
Those issues will be fixed in future commits.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305011101.3597423-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:42:16 -04:00