Synchronize XSS from the GHCB to KVM's internal tracking if the guest
marks XSS as valid on a #VMGEXIT. Like XCR0, KVM needs an up-to-date copy
of XSS in order to compute the required XSTATE size when emulating
CPUID.0xD.0x1 for the guest.
Treat the incoming XSS change as an emulated write, i.e. validatate the
guest-provided value, to avoid letting the guest load garbage into KVM's
tracking. Simply ignore bad values, as either the guest managed to get an
unsupported value into hardware, or the guest is misbehaving and providing
pure garbage. In either case, KVM can't fix the broken guest.
Explicitly allow access to XSS at all times, as KVM needs to ensure its
copy of XSS stays up-to-date. E.g. KVM supports migration of SEV-ES guests
and so needs to allow the host to save/restore XSS, otherwise a guest
that *knows* its XSS hasn't change could get stale/bad CPUID emulation if
the guest doesn't provide XSS in the GHCB on every exit. This creates a
hypothetical problem where a guest could request emulation of RDMSR or
WRMSR on XSS, but arguably that's not even a problem, e.g. it would be
entirely reasonable for a guest to request "emulation" as a way to inform
the hypervisor that its XSS value has been modified.
Note, emulating the change as an MSR write also takes care of side effects,
e.g. marking dynamic CPUID bits as dirty.
Suggested-by: John Allen <john.allen@amd.com>
base-commit: 14298d819d5a6b7180a4089e7d2121ca3551dc6c
Link: https://lore.kernel.org/r/20250919223258.1604852-40-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Transfer the three CET Shadow Stack VMCB fields (S_CET, ISST_ADDR, and
SSP) on VMRUN, #VMEXIT, and loading nested state (saving nested state
simply copies the entire save area). SVM doesn't provide a way to
disallow L1 from enabling Shadow Stacks for L2, i.e. KVM *must* provide
nested support before advertising SHSTK to userspace.
Link: https://lore.kernel.org/r/20250919223258.1604852-37-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Advertise the LOAD_CET_STATE VM-Entry/Exit control bits in the nested VMX
MSRS, as all nested support for CET virtualization, including consistency
checks, is in place.
Advertise support if and only if KVM supports at least one of IBT or SHSTK.
While it's userspace's responsibility to provide a consistent CPU model to
the guest, that doesn't mean KVM should set userspace up to fail.
Note, the existing {CLEAR,LOAD}_BNDCFGS behavior predates
KVM_X86_QUIRK_STUFF_FEATURE_MSRS, i.e. KVM "solved" the inconsistent CPU
model problem by overwriting the VMX MSRs provided by userspace.
Signed-off-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-35-seanjc@google.com
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Introduce consistency checks for CET states during nested VM-entry.
A VMCS contains both guest and host CET states, each comprising the
IA32_S_CET MSR, SSP, and IA32_INTERRUPT_SSP_TABLE_ADDR MSR. Various
checks are applied to CET states during VM-entry as documented in SDM
Vol3 Chapter "VM ENTRIES". Implement all these checks during nested
VM-entry to emulate the architectural behavior.
In summary, there are three kinds of checks on guest/host CET states
during VM-entry:
A. Checks applied to both guest states and host states:
* The IA32_S_CET field must not set any reserved bits; bits 10 (SUPPRESS)
and 11 (TRACKER) cannot both be set.
* SSP should not have bits 1:0 set.
* The IA32_INTERRUPT_SSP_TABLE_ADDR field must be canonical.
B. Checks applied to host states only
* IA32_S_CET MSR and SSP must be canonical if the CPU enters 64-bit mode
after VM-exit. Otherwise, IA32_S_CET and SSP must have their higher 32
bits cleared.
C. Checks applied to guest states only:
* IA32_S_CET MSR and SSP are not required to be canonical (i.e., 63:N-1
are identical, where N is the CPU's maximum linear-address width). But,
bits 63:N of SSP must be identical.
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-34-seanjc@google.com
[sean: have common helper return 0/-EINVAL, not true/false]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
to enable CET for nested VM.
vmcs12 and vmcs02 needs to be synced when L2 exits to L1 or when L1 wants
to resume L2, that way correct CET states can be observed by one another.
Please note that consistency checks regarding CET state during VM-Entry
will be added later to prevent this patch from becoming too large.
Advertising the new CET VM_ENTRY/EXIT control bits are also be deferred
until after the consistency checks are added.
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Xin Li (Intel) <xin@zytor.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-32-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Swap the order between configuring nested VMX capabilities and base CPU
capabilities, so that nested VMX support can be conditioned on core KVM
support, e.g. to allow conditioning support for LOAD_CET_STATE on the
presence of IBT or SHSTK. Because the sanity checks on nested VMX config
performed by vmx_check_processor_compat() run _after_ vmx_hardware_setup(),
any use of kvm_cpu_cap_has() when configuring nested VMX support will lead
to failures in vmx_check_processor_compat().
While swapping the order of two (or more) configuration flows can lead to
a game of whack-a-mole, in this case nested support inarguably should be
done after base support. KVM should never condition base support on nested
support, because nested support is fully optional, while obviously it's
desirable to condition nested support on base support. And there's zero
evidence the current ordering was intentional, e.g. commit 66a6950f99
("KVM: x86: Introduce kvm_cpu_caps to replace runtime CPUID masking")
likely placed the call to kvm_set_cpu_caps() after nested setup because it
looked pretty.
Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-30-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add support for the LOAD_CET_STATE VM-Enter and VM-Exit controls, the
CET XFEATURE bits in XSS, and advertise support for IBT and SHSTK to
userspace. Explicitly clear IBT and SHSTK onn SVM, as additional work is
needed to enable CET on SVM, e.g. to context switch S_CET and other state.
Disable KVM CET feature if unrestricted_guest is unsupported/disabled as
KVM does not support emulating CET, as running without Unrestricted Guest
can result in KVM emulating large swaths of guest code. While it's highly
unlikely any guest will trigger emulation while also utilizing IBT or
SHSTK, there's zero reason to allow CET without Unrestricted Guest as that
combination should only be possible when explicitly disabling
unrestricted_guest for testing purposes.
Disable CET if VMX_BASIC[bit56] == 0, i.e. if hardware strictly enforces
the presence of an Error Code based on exception vector, as attempting to
inject a #CP with an Error Code (#CP architecturally has an Error Code)
will fail due to the #CP vector historically not having an Error Code.
Clear S_CET and SSP-related VMCS on "reset" to emulate the architectural
of CET MSRs and SSP being reset to 0 after RESET, power-up and INIT. Note,
KVM already clears guest CET state that is managed via XSTATE in
kvm_xstate_reset().
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
[sean: move some bits to separate patches, massage changelog]
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-29-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Make IBT and SHSTK virtualization mutually exclusive with "officially"
supporting setups with guest.MAXPHYADDR < host.MAXPHYADDR, i.e. if the
allow_smaller_maxphyaddr module param is set. Running a guest with a
smaller MAXPHYADDR requires intercepting #PF, and can also trigger
emulation of arbitrary instructions. Intercepting and reacting to #PFs
doesn't play nice with SHSTK, as KVM's MMU hasn't been taught to handle
Shadow Stack accesses, and emulating arbitrary instructions doesn't play
nice with IBT or SHSTK, as KVM's emulator doesn't handle the various side
effects, e.g. doesn't enforce end-branch markers or model Shadow Stack
updates.
Note, hiding IBT and SHSTK based solely on allow_smaller_maxphyaddr is
overkill, as allow_smaller_maxphyaddr is only problematic if the guest is
actually configured to have a smaller MAXPHYADDR. However, KVM's ABI
doesn't provide a way to express that IBT and SHSTK may break if enabled
in conjunction with guest.MAXPHYADDR < host.MAXPHYADDR. I.e. the
alternative is to do nothing in KVM and instead update documentation and
hope KVM users are thorough readers. Go with the conservative-but-correct
approach; worst case scenario, this restriction can be dropped if there's
a strong use case for enabling CET on hosts with allow_smaller_maxphyaddr.
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-28-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Initialize allow_smaller_maxphyaddr during hardware setup as soon as KVM
knows whether or not TDP will be utilized. To avoid having to teach KVM's
emulator all about CET, KVM's upcoming CET virtualization support will be
mutually exclusive with allow_smaller_maxphyaddr, i.e. will disable SHSTK
and IBT if allow_smaller_maxphyaddr is enabled.
In general, allow_smaller_maxphyaddr should be initialized as soon as
possible since it's globally visible while its only input is whether or
not EPT/NPT is enabled. I.e. there's effectively zero risk of setting
allow_smaller_maxphyaddr too early, and substantial risk of setting it
too late.
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250922184743.1745778-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Make TDP a hard requirement for Shadow Stacks, as there are no plans to
add Shadow Stack support to the Shadow MMU. E.g. KVM hasn't been taught
to understand the magic Writable=0,Dirty=1 combination that is required
for Shadow Stack accesses, and so enabling Shadow Stacks when using
shadow paging will put the guest into an infinite #PF loop (KVM thinks the
shadow page tables have a valid mapping, hardware says otherwise).
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-27-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add CET_KERNEL and CET_USER to KVM's set of supported XSS bits when IBT
*or* SHSTK is supported. Like CR4.CET, XFEATURE support for IBT and SHSTK
are bundle together under the CET umbrella, and thus prone to
virtualization holes if KVM or the guest supports only one of IBT or SHSTK,
but hardware supports both. However, again like CR4.CET, such
virtualization holes are benign from the host's perspective so long as KVM
takes care to always honor the "or" logic.
Require CET_KERNEL and CET_USER to come as a pair, and refuse to support
IBT or SHSTK if one (or both) features is missing, as the (host) kernel
expects them to come as a pair, i.e. may get confused and corrupt state if
only one of CET_KERNEL or CET_USER is supported.
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
[sean: split to separate patch, write changelog, add XFEATURE_MASK_CET_ALL]
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-26-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Unconditionally forward XSAVES/XRSTORS VM-Exits from L2 to L1, as KVM
doesn't utilize the XSS-bitmap (KVM relies on controlling the XSS value
in hardware to prevent unauthorized access to XSAVES state). KVM always
loads vmcs02 with vmcs12's bitmap, and so any exit _must_ be due to
vmcs12's XSS-bitmap.
Drop the comment about XSS never being non-zero in anticipation of
enabling CET_KERNEL and CET_USER support.
Opportunistically WARN if XSAVES is not enabled for L2, as the CPU is
supposed to generate #UD before checking the XSS-bitmap.
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-25-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Drop X86_CR4_CET from CR4_RESERVED_BITS and instead mark CET as reserved
if and only if IBT *and* SHSTK are unsupported, i.e. allow CR4.CET to be
set if IBT or SHSTK is supported. This creates a virtualization hole if
the CPU supports both IBT and SHSTK, but the kernel or vCPU model only
supports one of the features. However, it's entirely legal for a CPU to
have only one of IBT or SHSTK, i.e. the hole is a flaw in the architecture,
not in KVM.
More importantly, so long as KVM is careful to initialize and context
switch both IBT and SHSTK state (when supported in hardware) if either
feature is exposed to the guest, a misbehaving guest can only harm itself.
E.g. VMX initializes host CET VMCS fields based solely on hardware
capabilities.
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
[sean: split to separate patch, write changelog]
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-24-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add PK (Protection Keys), SS (Shadow Stacks), and SGX (Software Guard
Extensions) to the set of #PF error flags handled via
kvm_mmu_trace_pferr_flags. While KVM doesn't expect PK or SS #PFs in
particular, pretty print their names instead of the raw hex value saves
the user from having to go spelunking in the SDM to figure out what's
going on.
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-23-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add PFERR_SS_MASK, a.k.a. Shadow Stack access, and WARN if KVM attempts to
check permissions for a Shadow Stack access as KVM hasn't been taught to
understand the magic Writable=0,Dirty=1 combination that is required for
Shadow Stack accesses, and likely will never learn. There are no plans to
support Shadow Stacks with the Shadow MMU, and the emulator rejects all
instructions that affect Shadow Stacks, i.e. it should be impossible for
KVM to observe a #PF due to a shadow stack access.
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-22-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Emulate the Shadow Stack restriction that the current SSP must be a 32-bit
value on a FAR JMP from 64-bit mode to compatibility mode. From the SDM's
pseudocode for FAR JMP:
IF ShadowStackEnabled(CPL)
IF (IA32_EFER.LMA and DEST(segment selector).L) = 0
(* If target is legacy or compatibility mode then the SSP must be in low 4GB *)
IF (SSP & 0xFFFFFFFF00000000 != 0); THEN
#GP(0);
FI;
FI;
FI;
Note, only the current CPL needs to be considered, as FAR JMP can't be
used for inter-privilege level transfers, and KVM rejects emulation of all
other far branch instructions when Shadow Stacks are enabled.
To give the emulator access to GUEST_SSP, special case handling
MSR_KVM_INTERNAL_GUEST_SSP in emulator_get_msr() to treat the access as a
host access (KVM doesn't allow guest accesses to internal "MSRs"). The
->get_msr() API is only used for implicit accesses from the emulator, i.e.
is only used with hardcoded MSR indices, and so any access to
MSR_KVM_INTERNAL_GUEST_SSP is guaranteed to be from KVM, i.e. not from the
guest via RDMSR.
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-21-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Exit to userspace with KVM_INTERNAL_ERROR_EMULATION if the guest triggers
task switch emulation with Indirect Branch Tracking or Shadow Stacks
enabled, as attempting to do the right thing would require non-trivial
effort and complexity, KVM doesn't support emulating CET generally, and
it's extremely unlikely that any guest will do task switches while also
utilizing CET. Defer taking on the complexity until someone cares enough
to put in the time and effort to add support.
Per the SDM:
If shadow stack is enabled, then the SSP of the task is located at the
4 bytes at offset 104 in the 32-bit TSS and is used by the processor to
establish the SSP when a task switch occurs from a task associated with
this TSS. Note that the processor does not write the SSP of the task
initiating the task switch to the TSS of that task, and instead the SSP
of the previous task is pushed onto the shadow stack of the new task.
Note, per the SDM's pseudocode on TASK SWITCHING, IBT state for the new
privilege level is updated. To keep things simple, check both S_CET and
U_CET (again, anyone that wants more precise checking can have the honor
of implementing support).
Reported-by: Binbin Wu <binbin.wu@linux.intel.com>
Closes: https://lore.kernel.org/all/819bd98b-2a60-4107-8e13-41f1e4c706b1@linux.intel.com
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-20-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Don't emulate branch instructions, e.g. CALL/RET/JMP etc., that are
affected by Shadow Stacks and/or Indirect Branch Tracking when said
features are enabled in the guest, as fully emulating CET would require
significant complexity for no practical benefit (KVM shouldn't need to
emulate branch instructions on modern hosts). Simply doing nothing isn't
an option as that would allow a malicious entity to subvert CET
protections via the emulator.
To detect instructions that are subject to IBT or affect IBT state, use
the existing IsBranch flag along with the source operand type to detect
indirect branches, and the existing NearBranch flag to detect far JMPs
and CALLs, all of which are effectively indirect. Explicitly check for
emulation of IRET, FAR RET (IMM), and SYSEXIT (the ret-like far branches)
instead of adding another flag, e.g. IsRet, as it's unlikely the emulator
will ever need to check for return-like instructions outside of this one
specific flow. Use an allow-list instead of a deny-list because (a) it's
a shorter list and (b) so that a missed entry gets a false positive, not a
false negative (i.e. reject emulation instead of clobbering CET state).
For Shadow Stacks, explicitly track instructions that directly affect the
current SSP, as KVM's emulator doesn't have existing flags that can be
used to precisely detect such instructions. Alternatively, the em_xxx()
helpers could directly check for ShadowStack interactions, but using a
dedicated flag is arguably easier to audit, and allows for handling both
IBT and SHSTK in one fell swoop.
Note! On far transfers, do NOT consult the current privilege level and
instead treat SHSTK/IBT as being enabled if they're enabled for User *or*
Supervisor mode. On inter-privilege level far transfers, SHSTK and IBT
can be in play for the target privilege level, i.e. checking the current
privilege could get a false negative, and KVM doesn't know the target
privilege level until emulation gets under way.
Note #2, FAR JMP from 64-bit mode to compatibility mode interacts with
the current SSP, but only to ensure SSP[63:32] == 0. Don't tag FAR JMP
as SHSTK, which would be rather confusing and would result in FAR JMP
being rejected unnecessarily the vast majority of the time (ignoring that
it's unlikely to ever be emulated). A future commit will add the #GP(0)
check for the specific FAR JMP scenario.
Note #3, task switches also modify SSP and so need to be rejected. That
too will be addressed in a future commit.
Suggested-by: Chao Gao <chao.gao@intel.com>
Originally-by: Yang Weijiang <weijiang.yang@intel.com>
Cc: Mathias Krause <minipli@grsecurity.net>
Cc: John Allen <john.allen@amd.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Save constant values to HOST_{S_CET,SSP,INTR_SSP_TABLE} field explicitly.
Kernel IBT is supported and the setting in MSR_IA32_S_CET is static after
post-boot(The exception is BIOS call case but vCPU thread never across it)
and KVM doesn't need to refresh HOST_S_CET field before every VM-Enter/
VM-Exit sequence.
Host supervisor shadow stack is not enabled now and SSP is not accessible
to kernel mode, thus it's safe to set host IA32_INT_SSP_TAB/SSP VMCS field
to 0s. When shadow stack is enabled for CPL3, SSP is reloaded from PL3_SSP
before it exits to userspace. Check SDM Vol 2A/B Chapter 3/4 for SYSCALL/
SYSRET/SYSENTER SYSEXIT/RDSSP/CALL etc.
Prevent KVM module loading if host supervisor shadow stack SHSTK_EN is set
in MSR_IA32_S_CET as KVM cannot co-exit with it correctly.
Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
[sean: snapshot host S_CET if SHSTK *or* IBT is supported]
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Disable interception for CET MSRs that can be accessed via XSAVES/XRSTORS,
and exist accordingly to CPUID, as accesses through XSTATE aren't subject
to MSR interception checks, i.e. can't be intercepted without intercepting
and emulating XSAVES/XRSTORS, and KVM doesn't support emulating
XSAVE/XRSTOR instructions.
Don't condition interception on the guest actually having XSAVES as there
is no benefit to intercepting the accesses (when the MSRs exist). The
MSRs in question are either context switched by the CPU on VM-Enter/VM-Exit
or by KVM via XSAVES/XRSTORS (KVM requires XSAVES to virtualization SHSTK),
i.e. KVM is going to load guest values into hardware irrespective of guest
XSAVES support.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Xin Li (Intel) <xin@zytor.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add emulation interface for CET MSR access. The emulation code is split
into common part and vendor specific part. The former does common checks
for MSRs, e.g., accessibility, data validity etc., then passes operation
to either XSAVE-managed MSRs via the helpers or CET VMCS fields.
SSP can only be read via RDSSP. Writing even requires destructive and
potentially faulting operations such as SAVEPREVSSP/RSTORSSP or
SETSSBSY/CLRSSBSY. Let the host use a pseudo-MSR that is just a wrapper
for the GUEST_SSP field of the VMCS.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
[sean: drop call to kvm_set_xstate_msr() for S_CET, consolidate code]
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add a KVM-defined ONE_REG register, KVM_REG_GUEST_SSP, to let userspace
save and restore the guest's Shadow Stack Pointer (SSP). On both Intel
and AMD, SSP is a hardware register that can only be accessed by software
via dedicated ISA (e.g. RDSSP) or via VMCS/VMCB fields (used by hardware
to context switch SSP at entry/exit). As a result, SSP doesn't fit in
any of KVM's existing interfaces for saving/restoring state.
Internally, treat SSP as a fake/synthetic MSR, as the semantics of writes
to SSP follow that of several other Shadow Stack MSRs, e.g. the PLx_SSP
MSRs. Use a translation layer to hide the KVM-internal MSR index so that
the arbitrary index doesn't become ABI, e.g. so that KVM can rework its
implementation as needed, so long as the ONE_REG ABI is maintained.
Explicitly reject accesses to SSP if the vCPU doesn't have Shadow Stack
support to avoid running afoul of ignore_msrs, which unfortunately applies
to host-initiated accesses (which is a discussion for another day). I.e.
ensure consistent behavior for KVM-defined registers irrespective of
ignore_msrs.
Link: https://lore.kernel.org/all/aca9d389-f11e-4811-90cf-d98e345a5cc2@intel.com
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-14-seanjc@google.com
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Control-flow Enforcement Technology (CET) is a kind of CPU feature used
to prevent Return/CALL/Jump-Oriented Programming (ROP/COP/JOP) attacks.
It provides two sub-features(SHSTK,IBT) to defend against ROP/COP/JOP
style control-flow subversion attacks.
Shadow Stack (SHSTK):
A shadow stack is a second stack used exclusively for control transfer
operations. The shadow stack is separate from the data/normal stack and
can be enabled individually in user and kernel mode. When shadow stack
is enabled, CALL pushes the return address on both the data and shadow
stack. RET pops the return address from both stacks and compares them.
If the return addresses from the two stacks do not match, the processor
generates a #CP.
Indirect Branch Tracking (IBT):
IBT introduces instruction(ENDBRANCH)to mark valid target addresses of
indirect branches (CALL, JMP etc...). If an indirect branch is executed
and the next instruction is _not_ an ENDBRANCH, the processor generates
a #CP. These instruction behaves as a NOP on platforms that have no CET.
Several new CET MSRs are defined to support CET:
MSR_IA32_{U,S}_CET: CET settings for {user,supervisor} CET respectively.
MSR_IA32_PL{0,1,2,3}_SSP: SHSTK pointer linear address for CPL{0,1,2,3}.
MSR_IA32_INT_SSP_TAB: Linear address of SHSTK pointer table, whose entry
is indexed by IST of interrupt gate desc.
Two XSAVES state bits are introduced for CET:
IA32_XSS:[bit 11]: Control saving/restoring user mode CET states
IA32_XSS:[bit 12]: Control saving/restoring supervisor mode CET states.
Six VMCS fields are introduced for CET:
{HOST,GUEST}_S_CET: Stores CET settings for kernel mode.
{HOST,GUEST}_SSP: Stores current active SSP.
{HOST,GUEST}_INTR_SSP_TABLE: Stores current active MSR_IA32_INT_SSP_TAB.
On Intel platforms, two additional bits are defined in VM_EXIT and VM_ENTRY
control fields:
If VM_EXIT_LOAD_CET_STATE = 1, host CET states are loaded from following
VMCS fields at VM-Exit:
HOST_S_CET
HOST_SSP
HOST_INTR_SSP_TABLE
If VM_ENTRY_LOAD_CET_STATE = 1, guest CET states are loaded from following
VMCS fields at VM-Entry:
GUEST_S_CET
GUEST_SSP
GUEST_INTR_SSP_TABLE
Co-developed-by: Zhang Yi Z <yi.z.zhang@linux.intel.com>
Signed-off-by: Zhang Yi Z <yi.z.zhang@linux.intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Load the guest's FPU state if userspace is accessing MSRs whose values
are managed by XSAVES. Introduce two helpers, kvm_{get,set}_xstate_msr(),
to facilitate access to such kind of MSRs.
If MSRs supported in kvm_caps.supported_xss are passed through to guest,
the guest MSRs are swapped with host's before vCPU exits to userspace and
after it reenters kernel before next VM-entry.
Because the modified code is also used for the KVM_GET_MSRS device ioctl(),
explicitly check @vcpu is non-null before attempting to load guest state.
The XSAVE-managed MSRs cannot be retrieved via the device ioctl() without
loading guest FPU state (which doesn't exist).
Note that guest_cpuid_has() is not queried as host userspace is allowed to
access MSRs that have not been exposed to the guest, e.g. it might do
KVM_SET_MSRS prior to KVM_SET_CPUID2.
The two helpers are put here in order to manifest accessing xsave-managed
MSRs requires special check and handling to guarantee the correctness of
read/write to the MSRs.
Co-developed-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
[sean: drop S_CET, add big comment, move accessors to x86.c]
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Xin Li (Intel) <xin@zytor.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Maintain per-guest valid XSS bits and check XSS validity against them
rather than against KVM capabilities. This is to prevent bits that are
supported by KVM but not supported for a guest from being set.
Opportunistically return KVM_MSR_RET_UNSUPPORTED on IA32_XSS MSR accesses
if guest CPUID doesn't enumerate X86_FEATURE_XSAVES. Since
KVM_MSR_RET_UNSUPPORTED takes care of host_initiated cases, drop the
host_initiated check.
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Enable KVM_{G,S}ET_ONE_REG uAPIs so that userspace can access MSRs and
other non-MSR registers through them, along with support for
KVM_GET_REG_LIST to enumerate support for KVM-defined registers.
This is in preparation for allowing userspace to read/write the guest SSP
register, which is needed for the upcoming CET virtualization support.
Currently, two types of registers are supported: KVM_X86_REG_TYPE_MSR and
KVM_X86_REG_TYPE_KVM. All MSRs are in the former type; the latter type is
added for registers that lack existing KVM uAPIs to access them. The "KVM"
in the name is intended to be vague to give KVM flexibility to include
other potential registers. More precise names like "SYNTHETIC" and
"SYNTHETIC_MSR" were considered, but were deemed too confusing (e.g. can
be conflated with synthetic guest-visible MSRs) and may put KVM into a
corner (e.g. if KVM wants to change how a KVM-defined register is modeled
internally).
Enumerate only KVM-defined registers in KVM_GET_REG_LIST to avoid
duplicating KVM_GET_MSR_INDEX_LIST, and so that KVM can return _only_
registers that are fully supported (KVM_GET_REG_LIST is vCPU-scoped, i.e.
can be precise, whereas KVM_GET_MSR_INDEX_LIST is system-scoped).
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Link: https://lore.kernel.org/all/20240219074733.122080-18-weijiang.yang@intel.com [1]
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Merge the queue of KVM selftests changes for 6.18 to pick up the ex_str()
helper so that it can be used to pretty print expected versus actual
exceptions in a new MSR selftest. CET virtualization will add support for
several MSRs with non-trivial semantics, along with new uAPI for accessing
the guest's Shadow Stack Pointer (SSP) from userspace.
Merge the queue of SVM changes for 6.18 to pick up the KVM-defined GHCB
helpers so that kvm_ghcb_get_xss() can be used to virtualize CET for
SEV-ES+ guests.
Use __kvm_set_xcr() to propagate XCR0 changes from the GHCB to KVM's
software model in order to validate the new XCR0 against KVM's view of
the supported XCR0. Allowing garbage is thankfully mostly benign, as
kvm_load_{guest,host}_xsave_state() bail early for vCPUs with protected
state, xstate_required_size() will simply provide garbage back to the
guest, and attempting to save/restore the bad value via KVM_{G,S}ET_XCRS
will only harm the guest (setting XCR0 will fail).
However, allowing the guest to put junk into a field that KVM assumes is
valid is a CVE waiting to happen. And as a bonus, using the proper API
eliminates the ugly open coding of setting arch.cpuid_dynamic_bits_dirty.
Simply ignore bad values, as either the guest managed to get an
unsupported value into hardware, or the guest is misbehaving and providing
pure garbage. In either case, KVM can't fix the broken guest.
Note, using __kvm_set_xcr() also avoids recomputing dynamic CPUID bits
if XCR0 isn't actually changing (relatively to KVM's previous snapshot).
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Fixes: 291bd20d5d ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT")
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Wrap all reads of GHCB save fields with READ_ONCE() via a KVM-specific
GHCB get() utility to help guard against TOCTOU bugs. Using READ_ONCE()
doesn't completely prevent such bugs, e.g. doesn't prevent KVM from
redoing get() after checking the initial value, but at least addresses
all potential TOCTOU issues in the current KVM code base.
To prevent unintentional use of the generic helpers, take only @svm for
the kvm_ghcb_get_xxx() helpers and retrieve the ghcb instead of explicitly
passing it in.
Opportunistically reduce the indentation of the macro-defined helpers and
clean up the alignment.
Fixes: 4e15a0ddc3 ("KVM: SEV: snapshot the GHCB before accessing it")
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Rename kvm_ghcb_get_sw_exit_code() to kvm_get_cached_sw_exit_code() to make
it clear that KVM is getting the cached value, not reading directly from
the guest-controlled GHCB. More importantly, vacating
kvm_ghcb_get_sw_exit_code() will allow adding a KVM-specific macro-built
kvm_ghcb_get_##field() helper to read values from the GHCB.
No functional change intended.
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reduce the number of combinations of unavailable PMU events masks that are
testing by the PMU counters test. In reality, testing every possible
combination isn't all that interesting, and certainly not worth the tens
of seconds (or worse, minutes) of runtime. Fully testing the N^2 space
will be especially problematic in the near future, as 5! new arch events
are on their way.
Use alternating bit patterns (and 0 and -1u) in the hopes that _if_ there
is ever a KVM bug, it's not something horribly convoluted that shows up
only with a super specific pattern/value.
Reported-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250919214648.1585683-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>