linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-14 11:11:22 -04:00

Author	SHA1	Message	Date
Jim Mattson	6fbef8615d	KVM: x86: Replace growing set of *_in_guest bools with a u64 Store each "disabled exit" boolean in a single bit rather than a byte. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250530185239.2335185-2-jmattson@google.com Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250626001225.744268-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-09 09:32:32 -07:00
Xin Li	e88cfd50b6	KVM: x86: Advertise support for LKGS Advertise support for LKGS (load into IA32_KERNEL_GS_BASE) to userspace if the instruction is supported by the underlying CPU. LKGS is introduced with FRED to completely eliminate the need to swapgs explicilty. It behaves like the MOV to GS instruction except that it loads the base address into the IA32_KERNEL_GS_BASE MSR instead of the GS segment’s descriptor cache, which is exactly what Linux kernel does to load a user level GS base. Thus there is no need to SWAPGS away from the kernel GS base. LKGS is an independent CPU feature that works correctly in a KVM guest without requiring explicit enablement. Signed-off-by: Xin Li (Intel) <xin@zytor.com> Link: https://lore.kernel.org/r/20250626173521.2301088-1-xin@zytor.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-09 09:32:25 -07:00
Sean Christopherson	e1ef1c57ff	KVM: VMX: Add a macro to track which DEBUGCTL bits are host-owned Add VMX_HOST_OWNED_DEBUGCTL_BITS to track which bits are host-owned, i.e. need to be preserved when running the guest, to dedup the logic without having to incur a memory load to get at kvm_x86_ops.HOST_OWNED_DEBUGCTL. No functional change intended. Suggested-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/all/aF1yni8U6XNkyfRf@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-09 09:30:52 -07:00
Chao Gao	05186d7a8e	KVM: SVM: Simplify MSR interception logic for IA32_XSS MSR Use svm_set_intercept_for_msr() directly to configure IA32_XSS MSR interception, ensuring consistency with other cases where MSRs are intercepted depending on guest caps and CPUIDs. No functional change intended. Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250612081947.94081-3-chao.gao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-25 09:47:28 -07:00
Chao Gao	3f06b8927a	KVM: x86: Deduplicate MSR interception enabling and disabling Extract a common function from MSR interception disabling logic and create disabling and enabling functions based on it. This removes most of the duplicated code for MSR interception disabling/enabling. No functional change intended. Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250612081947.94081-2-chao.gao@intel.com [sean: s/enable/set, inline the wrappers] Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-24 15:42:12 -07:00
Sean Christopherson	58c81bc1e7	KVM: x86: Refactor handling of SIPI_RECEIVED when setting MP_STATE Convert the incoming mp_state to INIT_RECIEVED instead of manually calling kvm_set_mp_state() to make it more obvious that the SIPI_RECEIVED logic is translating the incoming state to KVM's internal tracking, as opposed to being some entirely unique flow. Opportunistically add a comment to explain what the code is doing. No functional change intended. Link: https://lore.kernel.org/r/20250605195018.539901-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:08:00 -07:00
Sean Christopherson	0fe3e8d804	KVM: x86: Move INIT_RECEIVED vs. INIT/SIPI blocked check to KVM_RUN Check for the should-be-impossible scenario of a vCPU being in Wait-For-SIPI with INIT/SIPI blocked during KVM_RUN instead of trying to detect and prevent illegal combinations in every ioctl that sets relevant state. Attempting to handle every possible "set" path is a losing game of whack-a-mole, and risks breaking userspace. E.g. INIT/SIPI are blocked on Intel if the vCPU is in VMX Root mode (post-VMXON), and on AMD if GIF=0. Handling those scenarios would require potentially breaking changes to {vmx,svm}_set_nested_state(). Moving the check to KVM_RUN fixes a syzkaller-induced splat due to the aforementioned VMXON case, and in theory should close the hole once and for all. Note, kvm_x86_vcpu_pre_run() already handles SIPI_RECEIVED, only the WFS case needs additional attention. Reported-by: syzbot+c1cbaedc2613058d5194@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?id=490ae63d8d89cb82c5d462d16962cf371df0e476 Link: https://lore.kernel.org/r/20250605195018.539901-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:08:00 -07:00
Sean Christopherson	16777ebded	KVM: x86: WARN and reject KVM_RUN if vCPU's MP_STATE is SIPI_RECEIVED WARN if KVM_RUN is reached with a vCPU's mp_state set to SIPI_RECEIVED, as KVM no longer uses SIPI_RECEIVED internally, and should morph SIPI_RECEIVED into INIT_RECEIVED with a pending SIPI if userspace forces SIPI_RECEIVED. See commit `66450a21f9` ("KVM: x86: Rework INIT and SIPI handling") for more history and details. Link: https://lore.kernel.org/r/20250605195018.539901-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:59 -07:00
Sean Christopherson	c4a37acc51	KVM: x86: Drop pending_smi vs. INIT_RECEIVED check when setting MP_STATE Allow userspace to set a vCPU's mp_state to INIT_RECEIVED in conjunction with a pending SMI, as rejecting that combination could result in KVM disallowing reflecting the output from KVM_GET_VCPU_EVENTS back into KVM via KVM_SET_VCPU_EVENTS. At the time the check was added, smi_pending could only be set in the context of KVM_RUN, with the vCPU in the RUNNABLE state. I.e. it was impossible for KVM to save vCPU state such that userspace could see a pending SMI for a vCPU in WFS. That no longer holds true now that KVM processes requested SMIs during KVM_GET_VCPU_EVENTS, e.g. if a vCPU receives an SMI while in WFS, and then userspace saves vCPU state. Note, this may partially re-open the user-triggerable WARN that was mostly closed by commit `28bf288879` ("KVM: x86: fix user triggerable warning in kvm_apic_accept_events()"), but that WARN can already be triggered in several other ways, e.g. if userspace stuffs VMXON=1 after putting the vCPU into WFS. That issue will be addressed in an upcoming commit, in a more robust fashion (hopefully). Fixes: `1f7becf1b7` ("KVM: x86: get smi pending status correctly") Link: https://lore.kernel.org/r/20250605195018.539901-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:59 -07:00
Sean Christopherson	0792c71c1c	KVM: selftests: Verify KVM disable interception (for userspace) on filter change Re-read MSR_{FS,GS}_BASE after restoring the "allow everything" userspace MSR filter to verify that KVM stops forwarding exits to userspace. This can also be used in conjunction with manual verification (e.g. printk) to ensure KVM is correctly updating the MSR bitmaps consumed by hardware. Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Manali Shukla <Manali.Shukla@amd.com> Link: https://lore.kernel.org/r/20250610225737.156318-33-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:38 -07:00
Sean Christopherson	bea44d1992	KVM: x86: Simplify userspace filter logic when disabling MSR interception Refactor {svm,vmx}_disable_intercept_for_msr() to simplify the handling of userspace filters that disallow access to an MSR. The more complicated logic is no longer needed or justified now that KVM recalculates all MSR intercepts on a userspace MSR filter change, i.e. now that KVM doesn't need to also update shadow bitmaps. No functional change intended. Suggested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-32-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:37 -07:00
Sean Christopherson	73be81b3bb	KVM: SVM: Add a helper to allocate and initialize permissions bitmaps Add a helper to allocate and initialize an MSR or I/O permissions map, as the logic is identical between the two map types, the only difference is the size of the bitmap. Opportunistically add a comment to explain why the bitmaps are initialized with 0xff, e.g. instead of the more common zero-initialized behavior, which is the main motivation for deduplicating the code. No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-31-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:37 -07:00
Sean Christopherson	54f1c77061	KVM: nSVM: Merge MSRPM in 64-bit chunks on 64-bit kernels When merging L0 and L1 MSRPMs as part of nested VMRUN emulation, access the bitmaps using "unsigned long" chunks, i.e. use 8-byte access for 64-bit kernels instead of arbitrarily working on 4-byte chunks. Opportunistically rename local variables in nested_svm_merge_msrpm() to more precisely/accurately reflect their purpose ("offset" in particular is extremely ambiguous). Link: https://lore.kernel.org/r/20250610225737.156318-30-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:36 -07:00
Sean Christopherson	5904ba5172	KVM: SVM: Return -EINVAL instead of MSR_INVALID to signal out-of-range MSR Return -EINVAL instead of MSR_INVALID from svm_msrpm_bit_nr() to indicate that the MSR isn't covered by one of the (currently) three MSRPM ranges, and delete the MSR_INVALID macro now that all users are gone. Link: https://lore.kernel.org/r/20250610225737.156318-29-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:35 -07:00
Sean Christopherson	52f8217742	KVM: nSVM: Access MSRPM in 4-byte chunks only for merging L0 and L1 bitmaps Access the MSRPM using u32/4-byte chunks (and appropriately adjusted offsets) only when merging L0 and L1 bitmaps as part of emulating VMRUN. The only reason to batch accesses to MSRPMs is to avoid the overhead of uaccess operations (e.g. STAC/CLAC and bounds checks) when reading L1's bitmap pointed at by vmcb12. For all other uses, either per-bit accesses are more than fast enough (no uaccess), or KVM is only accessing a single bit (nested_svm_exit_handled_msr()) and so there's nothing to batch. In addition to (hopefully) documenting the uniqueness of the merging code, restricting chunked access to _just_ the merging code will allow for increasing the chunk size (to unsigned long) with minimal risk. Link: https://lore.kernel.org/r/20250610225737.156318-28-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:34 -07:00
Sean Christopherson	7fe0578041	KVM: SVM: Store MSRPM pointer as "void " instead of "u32 " Store KVM's MSRPM pointers as "void " instead of "u32 " to guard against directly accessing the bitmaps outside of code that is explicitly written to access the bitmaps with a specific type. Opportunistically use svm_vcpu_free_msrpm() in svm_vcpu_free() instead of open coding an equivalent. Link: https://lore.kernel.org/r/20250610225737.156318-27-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:34 -07:00
Sean Christopherson	5c9c084763	KVM: SVM: Move svm_msrpm_offset() to nested.c Move svm_msrpm_offset() from svm.c to nested.c now that all usage of the u32-index offsets is nested virtualization specific. No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-26-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:33 -07:00
Sean Christopherson	2f89888434	KVM: SVM: Drop explicit check on MSRPM offset when emulating SEV-ES accesses Now that msr_write_intercepted() defaults to true, i.e. accurately reflects hardware behavior for out-of-range MSRs, and doesn't WARN (or BUG) on an out-of-range MSR, drop sev_es_prevent_msr_access()'s svm_msrpm_offset() check that guarded against calling msr_write_intercepted() with a "bad" index. Opportunistically clean up the helper's formatting. Link: https://lore.kernel.org/r/20250610225737.156318-25-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:32 -07:00
Sean Christopherson	4880919aaf	KVM: SVM: Merge "after set CPUID" intercept recalc helpers Merge svm_recalc_intercepts_after_set_cpuid() and svm_recalc_instruction_intercepts() such that the "after set CPUID" helper simply invokes the type-specific helpers (MSRs vs. instructions), i.e. make svm_recalc_intercepts_after_set_cpuid() a single entry point for all intercept updates that need to be performed after a CPUID change. No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-24-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:32 -07:00
Sean Christopherson	40ba80e4b0	KVM: SVM: Fold svm_vcpu_init_msrpm() into its sole caller Fold svm_vcpu_init_msrpm() into svm_recalc_msr_intercepts() now that there is only the one caller (and because the "init" misnomer is even more misleading than it was in the past). No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-23-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:31 -07:00
Sean Christopherson	049dff172b	KVM: SVM: Rename init_vmcb_after_set_cpuid() to make it intercepts specific Rename init_vmcb_after_set_cpuid() to svm_recalc_intercepts_after_set_cpuid() to more precisely describe its role. Strictly speaking, the name isn't perfect as toggling virtual VM{LOAD,SAVE} is arguably not recalculating an intercept, but practically speaking it's close enough. No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-22-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:30 -07:00
Sean Christopherson	4ceca57e3f	KVM: x86: Rename msr_filter_changed() => recalc_msr_intercepts() Rename msr_filter_changed() to recalc_msr_intercepts() and drop the trampoline wrapper now that both SVM and VMX use a filter-agnostic recalc helper to react to the new userspace filter. No functional change intended. Reviewed-by: Xin Li (Intel) <xin@zytor.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-21-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:30 -07:00
Sean Christopherson	160f143cc1	KVM: SVM: Manually recalc all MSR intercepts on userspace MSR filter change On a userspace MSR filter change, recalculate all MSR intercepts using the filter-agnostic logic instead of maintaining a "shadow copy" of KVM's desired intercepts. The shadow bitmaps add yet another point of failure, are confusing (e.g. what does "handled specially" mean!?!?), an eyesore, and a maintenance burden. Given that KVM must be able to recalculate the correct intercepts at any given time, and that MSR filter updates are not hot paths, there is zero benefit to maintaining the shadow bitmaps. Opportunistically switch from boot_cpu_has() to cpu_feature_enabled() as appropriate. Link: https://lore.kernel.org/all/aCdPbZiYmtni4Bjs@google.com Link: https://lore.kernel.org/all/20241126180253.GAZ0YNTdXH1UGeqsu6@fat_crate.local Cc: Francesco Lavra <francescolavra.fl@gmail.com> Link: https://lore.kernel.org/r/20250610225737.156318-20-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:29 -07:00
Sean Christopherson	8a056ece45	KVM: VMX: Manually recalc all MSR intercepts on userspace MSR filter change On a userspace MSR filter change, recalculate all MSR intercepts using the filter-agnostic logic instead of maintaining a "shadow copy" of KVM's desired intercepts. The shadow bitmaps add yet another point of failure, are confusing (e.g. what does "handled specially" mean!?!?), an eyesore, and a maintenance burden. Given that KVM must be able to recalculate the correct intercepts at any given time, and that MSR filter updates are not hot paths, there is zero benefit to maintaining the shadow bitmaps. Opportunistically switch from boot_cpu_has() to cpu_feature_enabled() as appropriate. Link: https://lore.kernel.org/all/aCdPbZiYmtni4Bjs@google.com Link: https://lore.kernel.org/all/20241126180253.GAZ0YNTdXH1UGeqsu6@fat_crate.local Cc: Borislav Petkov <bp@alien8.de> Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Xin Li (Intel) <xin@zytor.com> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-19-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:28 -07:00
Sean Christopherson	405a63d4d3	KVM: x86: Move definition of X2APIC_MSR() to lapic.h Dedup the definition of X2APIC_MSR and put it in the local APIC code where it belongs. No functional change intended. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-18-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:28 -07:00
Sean Christopherson	cb53d07948	KVM: SVM: Drop "always" flag from list of possible passthrough MSRs Drop the "always" flag from the array of possible passthrough MSRs, and instead manually initialize the permissions for the handful of MSRs that KVM passes through by default. In addition to cutting down on boilerplate copy+paste code and eliminating a misleading flag (the MSRs aren't always passed through, e.g. thanks to MSR filters), this will allow for removing the direct_access_msrs array entirely. Link: https://lore.kernel.org/r/20250610225737.156318-17-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:27 -07:00
Sean Christopherson	3a0f09b361	KVM: SVM: Pass through GHCB MSR if and only if VM is an SEV-ES guest Disable interception of the GHCB MSR if and only if the VM is an SEV-ES guest. While the exact behavior is completely undocumented in the APM, common sense and testing on SEV-ES capable CPUs says that accesses to the GHCB from non-SEV-ES guests will #GP. I.e. from the guest's perspective, no functional change intended. Fixes: `376c6d2850` ("KVM: SVM: Provide support for SEV-ES vCPU creation/loading") Link: https://lore.kernel.org/r/20250610225737.156318-16-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:26 -07:00
Sean Christopherson	6b7315fe54	KVM: SVM: Implement and adopt VMX style MSR intercepts APIs Add and use SVM MSR interception APIs (in most paths) to match VMX's APIs and nomenclature. Specifically, add SVM variants of: vmx_disable_intercept_for_msr(vcpu, msr, type) vmx_enable_intercept_for_msr(vcpu, msr, type) vmx_set_intercept_for_msr(vcpu, msr, type, intercept) to eventually replace SVM's single helper: set_msr_interception(vcpu, msrpm, msr, allow_read, allow_write) which is awkward to use (in all cases, KVM either applies the same logic for both reads and writes, or intercepts one of read or write), and is unintuitive due to using '0' to indicate interception should be set. Keep the guts of the old API for the moment to avoid churning the MSR filter code, as that mess will be overhauled in the near future. Leave behind a temporary comment to call out that the shadow bitmaps have inverted polarity relative to the bitmaps consumed by hardware. No functional change intended. Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:26 -07:00
Sean Christopherson	c38595ad69	KVM: SVM: Add helpers for accessing MSR bitmap that don't rely on offsets Add macro-built helpers for testing, setting, and clearing MSRPM entries without relying on precomputed offsets. This sets the stage for eventually removing general KVM use of precomputed offsets, which are quite confusing and rather inefficient for the vast majority of KVM's usage. Outside of merging L0 and L1 bitmaps for nested SVM, using u32-indexed offsets and accesses is at best unnecessary, and at worst introduces extra operations to retrieve the individual bit from within the offset u32 value. And simply calling them "offsets" is very confusing, as the "unit" of the offset isn't immediately obvious. Use the new helpers in set_msr_interception_bitmap() and msr_write_intercepted() to verify the math and operations, but keep the existing offset-based logic in set_msr_interception_bitmap() to sanity check the "clear" and "set" operations. Manipulating MSR interceptions isn't a hot path and no kernel release is ever expected to contain this specific version of set_msr_interception_bitmap() (it will be removed entirely in the near future). Link: https://lore.kernel.org/r/20250610225737.156318-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:25 -07:00
Sean Christopherson	4879dc9469	KVM: nSVM: Don't initialize vmcb02 MSRPM with vmcb01's "always passthrough" Don't initialize vmcb02's MSRPM with KVM's set of "always passthrough" MSRs, as KVM always needs to consult L1's intercepts, i.e. needs to merge vmcb01 with vmcb12 and write the result to vmcb02. This will eventually allow for the removal of svm_vcpu_init_msrpm(). Note, the bitmaps are truly initialized by svm_vcpu_alloc_msrpm() (default to intercepting all MSRs), e.g. if there is a bug lurking elsewhere, the worst case scenario from dropping the call to svm_vcpu_init_msrpm() should be that KVM would fail to passthrough MSRs to L2. Link: https://lore.kernel.org/r/20250610225737.156318-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:24 -07:00
Sean Christopherson	f21ff2c8c9	KVM: nSVM: Omit SEV-ES specific passthrough MSRs from L0+L1 bitmap merge Don't merge bitmaps on nested VMRUN for MSRs that KVM passes through only for SEV-ES guests. KVM doesn't support nested virtualization for SEV-ES, and likely never will. Link: https://lore.kernel.org/r/20250610225737.156318-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:24 -07:00
Sean Christopherson	9b72c3d59f	KVM: nSVM: Use dedicated array of MSRPM offsets to merge L0 and L1 bitmaps Use a dedicated array of MSRPM offsets to merge L0 and L1 bitmaps, i.e. to merge KVM's vmcb01 bitmap with L1's vmcb12 bitmap. This will eventually allow for the removal of direct_access_msrs, as the only path where tracking the offsets is truly justified is the merge for nested SVM, where merging in chunks is an easy way to batch uaccess reads/writes. Opportunistically omit the x2APIC MSRs from the merge-specific array instead of filtering them out at runtime. Note, disabling interception of DEBUGCTL, XSS, EFER, PAT, GHCB, and TSC_AUX is mutually exclusive with nested virtualization, as KVM passes through those MSRs only for SEV-ES guests, and KVM doesn't support nested virtualization for SEV+ guests. Defer removing those MSRs to a future cleanup in order to make this refactoring as benign as possible. Link: https://lore.kernel.org/r/20250610225737.156318-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:23 -07:00
Sean Christopherson	16e9584cc0	KVM: SVM: Clean up macros related to architectural MSRPM definitions Move SVM's MSR Permissions Map macros to svm.h in anticipation of adding helpers that are available to SVM code, and opportunistically replace a variety of open-coded literals with (hopefully) informative macros. Opportunistically open code ARRAY_SIZE(msrpm_ranges) instead of wrapping it as NUM_MSR_MAPS, which is an ambiguous name even if it were qualified with "SVM_MSRPM". Deliberately leave the ranges as open coded literals, as using macros to define the ranges actually introduces more potential failure points, since both the definitions and the usage have to be careful to use the correct index. The lack of clear intent behind the ranges will be addressed in future patches. No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:11 -07:00
Sean Christopherson	925149b6d0	KVM: SVM: Massage name and param of helper that merges vmcb01 and vmcb12 MSRPMs Rename nested_svm_vmrun_msrpm() to nested_svm_merge_msrpm() to better capture its role, and opportunistically feed it @vcpu instead of @svm, as grabbing "svm" only to turn around and grab svm->vcpu is rather silly. No functional change intended. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:05:40 -07:00
Sean Christopherson	b1bccf7883	KVM: x86: Use non-atomic bit ops to manipulate "shadow" MSR intercepts Manipulate the MSR bitmaps using non-atomic bit ops APIs (two underscores), as the bitmaps are per-vCPU and are only ever accessed while vcpu->mutex is held. Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:05:40 -07:00
Sean Christopherson	6353cd685c	KVM: SVM: Kill the VM instead of the host if MSR interception is buggy WARN and kill the VM instead of panicking the host if KVM attempts to set or query MSR interception for an unsupported MSR. Accessing the MSR interception bitmaps only meaningfully affects post-VMRUN behavior, and KVM_BUG_ON() is guaranteed to prevent the current vCPU from doing VMRUN, i.e. there is no need to panic the entire host. Opportunistically move the sanity checks about their use to index into the MSRPM, e.g. so that bugs only WARN and terminate the VM, as opposed to doing that _and_ generating an out-of-bounds load. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:05:40 -07:00
Sean Christopherson	b241c50c4e	KVM: SVM: Use ARRAY_SIZE() to iterate over direct_access_msrs Drop the unnecessary and dangerous value-terminated behavior of direct_access_msrs, and simply iterate over the actual size of the array. The use in svm_set_x2apic_msr_interception() is especially sketchy, as it relies on unused capacity being zero-initialized, and '0' being outside the range of x2APIC MSRs. To ensure the array and shadow_msr_intercept stay synchronized, simply assert that their sizes are identical (note the six 64-bit-only MSRs). Note, direct_access_msrs will soon be removed entirely; keeping the assert synchronized with the array isn't expected to be along-term maintenance burden. Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:05:40 -07:00
Sean Christopherson	f886515f9b	KVM: SVM: Tag MSR bitmap initialization helpers with __init Tag init_msrpm_offsets() and add_msr_offset() with __init, as they're used only during hardware setup to map potential passthrough MSRs to offsets in the bitmap. Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:05:40 -07:00
Sean Christopherson	5ebd737308	KVM: SVM: Don't BUG if setting up the MSR intercept bitmaps fails WARN and reject module loading if there is a problem with KVM's MSR interception bitmaps. Panicking the host in this situation is inexcusable since it is trivially easy to propagate the error up the stack. Link: https://lore.kernel.org/r/20250610225737.156318-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:05:40 -07:00
Sean Christopherson	fb96d5cf0f	KVM: SVM: Allocate IOPM pages after initial setup in svm_hardware_setup() Allocate pages for the IOPM after initial setup has been completed in svm_hardware_setup(), so that sanity checks can be added in the setup flow without needing to free the IOPM pages. The IOPM is only referenced (via iopm_base) in init_vmcb() and svm_hardware_unsetup(), so there's no need to allocate it early on. No functional change intended (beyond the obvious ordering differences, e.g. if the allocation fails). Link: https://lore.kernel.org/r/20250610225737.156318-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:05:39 -07:00
Sean Christopherson	674ffc6503	KVM: SVM: Disable interception of SPEC_CTRL iff the MSR exists for the guest Disable interception of SPEC_CTRL when the CPU virtualizes (i.e. context switches) SPEC_CTRL if and only if the MSR exists according to the vCPU's CPUID model. Letting the guest access SPEC_CTRL is generally benign, but the guest would see inconsistent behavior if KVM happened to emulate an access to the MSR. Fixes: `d00b99c514` ("KVM: SVM: Add support for Virtual SPEC_CTRL") Reported-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:05:39 -07:00
Maxim Levitsky	6b1dd26544	KVM: VMX: Preserve host's DEBUGCTLMSR_FREEZE_IN_SMM while running the guest Set/clear DEBUGCTLMSR_FREEZE_IN_SMM in GUEST_IA32_DEBUGCTL based on the host's pre-VM-Enter value, i.e. preserve the host's FREEZE_IN_SMM setting while running the guest. When running with the "default treatment of SMIs" in effect (the only mode KVM supports), SMIs do not generate a VM-Exit that is visible to host (non-SMM) software, and instead transitions directly from VMX non-root to SMM. And critically, DEBUGCTL isn't context switched by hardware on SMI or RSM, i.e. SMM will run with whatever value was resident in hardware at the time of the SMI. Failure to preserve FREEZE_IN_SMM results in the PMU unexpectedly counting events while the CPU is executing in SMM, which can pollute profiling and potentially leak information into the guest. Check for changes in FREEZE_IN_SMM prior to every entry into KVM's inner run loop, as the bit can be toggled in IRQ context via IPI callback (SMP function call), by way of /sys/devices/cpu/freeze_on_smi. Add a field in kvm_x86_ops to communicate which DEBUGCTL bits need to be preserved, as FREEZE_IN_SMM is only supported and defined for Intel CPUs, i.e. explicitly checking FREEZE_IN_SMM in common x86 is at best weird, and at worst could lead to undesirable behavior in the future if AMD CPUs ever happened to pick up a collision with the bit. Exempt TDX vCPUs, i.e. protected guests, from the check, as the TDX Module owns and controls GUEST_IA32_DEBUGCTL. WARN in SVM if KVM_RUN_LOAD_DEBUGCTL is set, mostly to document that the lack of handling isn't a KVM bug (TDX already WARNs on any run_flag). Lastly, explicitly reload GUEST_IA32_DEBUGCTL on a VM-Fail that is missed by KVM but detected by hardware, i.e. in nested_vmx_restore_host_state(). Doing so avoids the need to track host_debugctl on a per-VMCS basis, as GUEST_IA32_DEBUGCTL is unconditionally written by prepare_vmcs02() and load_vmcs12_host_state(). For the VM-Fail case, even though KVM won't have actually entered the guest, vcpu_enter_guest() will have run with vmcs02 active and thus could result in vmcs01 being run with a stale value. Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250610232010.162191-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:05:24 -07:00
Maxim Levitsky	7d0cce6cbe	KVM: VMX: Wrap all accesses to IA32_DEBUGCTL with getter/setter APIs Introduce vmx_guest_debugctl_{read,write}() to handle all accesses to vmcs.GUEST_IA32_DEBUGCTL. This will allow stuffing FREEZE_IN_SMM into GUEST_IA32_DEBUGCTL based on the host setting without bleeding the state into the guest, and without needing to copy+paste the FREEZE_IN_SMM logic into every patch that accesses GUEST_IA32_DEBUGCTL. No functional change intended. Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> [sean: massage changelog, make inline, use in all prepare_vmcs02() cases] Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250610232010.162191-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:05:24 -07:00
Maxim Levitsky	095686e6fc	KVM: nVMX: Check vmcs12->guest_ia32_debugctl on nested VM-Enter Add a consistency check for L2's guest_ia32_debugctl, as KVM only supports a subset of hardware functionality, i.e. KVM can't rely on hardware to detect illegal/unsupported values. Failure to check the vmcs12 value would allow the guest to load any harware-supported value while running L2. Take care to exempt BTF and LBR from the validity check in order to match KVM's behavior for writes via WRMSR, but without clobbering vmcs12. Even if VM_EXIT_SAVE_DEBUG_CONTROLS is set in vmcs12, L1 can reasonably expect that vmcs12->guest_ia32_debugctl will not be modified if writes to the MSR are being intercepted. Arguably, KVM _should_ update vmcs12 if VM_EXIT_SAVE_DEBUG_CONTROLS is set and writes to MSR_IA32_DEBUGCTLMSR are not being intercepted by L1, but that would incur non-trivial complexity and wouldn't change the fact that KVM's handling of DEBUGCTL is blatantly broken. I.e. the extra complexity is not worth carrying. Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250610232010.162191-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:05:23 -07:00
Sean Christopherson	8a4351ac30	KVM: VMX: Extract checking of guest's DEBUGCTL into helper Move VMX's logic to check DEBUGCTL values into a standalone helper so that the code can be used by nested VM-Enter to apply the same logic to the value being loaded from vmcs12. KVM needs to explicitly check vmcs12->guest_ia32_debugctl on nested VM-Enter, as hardware may support features that KVM does not, i.e. relying on hardware to detect invalid guest state will result in false negatives. Unfortunately, that means applying KVM's funky suppression of BTF and LBR to vmcs12 so as not to break existing guests. No functional change intended. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250610232010.162191-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:05:22 -07:00
Sean Christopherson	17ec2f9653	KVM: VMX: Allow guest to set DEBUGCTL.RTM_DEBUG if RTM is supported Let the guest set DEBUGCTL.RTM_DEBUG if RTM is supported according to the guest CPUID model, as debug support is supposed to be available if RTM is supported, and there are no known downsides to letting the guest debug RTM aborts. Note, there are no known bug reports related to RTM_DEBUG, the primary motivation is to reduce the probability of breaking existing guests when a future change adds a missing consistency check on vmcs12.GUEST_DEBUGCTL (KVM currently lets L2 run with whatever hardware supports; whoops). Note #2, KVM already emulates DR6.RTM, and doesn't restrict access to DR7.RTM. Fixes: `83c529151a` ("KVM: x86: expose Intel cpu new features (HLE, RTM) to guest") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250610232010.162191-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:04:24 -07:00
Sean Christopherson	80c64c7afe	KVM: x86: Drop kvm_x86_ops.set_dr6() in favor of a new KVM_RUN flag Instruct vendor code to load the guest's DR6 into hardware via a new KVM_RUN flag, and remove kvm_x86_ops.set_dr6(), whose sole purpose was to load vcpu->arch.dr6 into hardware when DR6 can be read/written directly by the guest. Note, TDX already WARNs on any run_flag being set, i.e. will yell if KVM thinks DR6 needs to be reloaded. TDX vCPUs force KVM_DEBUGREG_AUTO_SWITCH and never clear the flag, i.e. should never observe KVM_RUN_LOAD_GUEST_DR6. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250610232010.162191-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:04:24 -07:00
Sean Christopherson	2478b1b220	KVM: x86: Convert vcpu_run()'s immediate exit param into a generic bitmap Convert kvm_x86_ops.vcpu_run()'s "force_immediate_exit" boolean parameter into an a generic bitmap so that similar "take action" information can be passed to vendor code without creating a pile of boolean parameters. This will allow dropping kvm_x86_ops.set_dr6() in favor of a new flag, and will also allow for adding similar functionality for re-loading debugctl in the active VMCS. Opportunistically massage the TDX WARN and comment to prepare for adding more run_flags, all of which are expected to be mutually exclusive with TDX, i.e. should be WARNed on. No functional change intended. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250610232010.162191-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:04:24 -07:00
Sean Christopherson	7d390a9da8	KVM: TDX: Use kvm_arch_vcpu.host_debugctl to restore the host's DEBUGCTL Use the kvm_arch_vcpu.host_debugctl snapshot to restore DEBUGCTL after running a TD vCPU. The final TDX series rebase was mishandled, likely due to commit `fb71c79593` ("KVM: x86: Snapshot the host's DEBUGCTL in common x86") deleting the same line of code from vmx.h, i.e. creating a semantic conflict of sorts, but no syntactic conflict. Using the version in kvm_vcpu_arch picks up the ulong => u64 fix (which isn't relevant to TDX) as well as the IRQ fix from commit `189ecdb3e1` ("KVM: x86: Snapshot the host's DEBUGCTL after disabling IRQs"). Link: https://lore.kernel.org/all/20250307212053.2948340-10-pbonzini@redhat.com Cc: Adrian Hunter <adrian.hunter@intel.com> Fixes: `8af0990375` ("KVM: TDX: Save and restore IA32_DEBUGCTL") Reviewed-by: Adrian Hunter <adrian.hunter@intel.com> Link: https://lore.kernel.org/r/20250610232010.162191-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:04:23 -07:00
Paolo Bonzini	28224ef02b	KVM: TDX: Report supported optional TDVMCALLs in TDX capabilities Allow userspace to advertise TDG.VP.VMCALL subfunctions that the kernel also supports. For each output register of GetTdVmCallInfo's leaf 1, add two fields to KVM_TDX_CAPABILITIES: one for kernel-supported TDVMCALLs (userspace can set those blindly) and one for user-supported TDVMCALLs (userspace can set those if it knows how to handle them). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-06-20 14:20:20 -04:00

1 2 3 4 5 ...

1367593 Commits