Apple SoC DTS updates for 6.18, part 2
- New device trees for all M2 Pro, Max and Ultra models are added.
This is responsible for most of the changed lines since we already
need 2000+ lines just to describe all the power domains inside
t602x-pmgr.dtsi for these SoCs.
- Missing WiFi properties for t600x are added.
- Bluetooth nodes are added for all t600x machines.
- The PCIe ethernet iommu-map was fixed for the Apple M1 iMac
to account for a disabled PCIe port.
- SPMI, NVMe, SART and mailbox nodes for Apple's T2 and A11.
* tag 'apple-soc-dt-6.18-part2' of https://git.kernel.org/pub/scm/linux/kernel/git/sven/linux:
arm64: dts: apple: t8015: Add SPMI node
arm64: dts: apple: t8012: Add SPMI node
arm64: dts: apple: Add J180d (Mac Pro, M2 Ultra, 2023) device tree
arm64: dts: apple: Add J474s, J475c and J475d device trees
arm64: dts: apple: Add J414 and J416 Macbook Pro device trees
arm64: dts: apple: Add initial t6020/t6021/t6022 DTs
arm64: dts: apple: Add ethernet0 alias for J375 template
dt-bindings: arm: apple: Add t6020x compatibles
arm64: dts: apple: t8015: Add NVMe nodes
arm64: dts: apple: t8015: Fix PCIE power domains dependencies
arm64: dts: apple: Add devicetreee for t8112-j415
dt-bindings: arm: apple: Add t8112 j415 compatible
arm64: dts: apple: t600x: Add bluetooth device nodes
arm64: dts: apple: t600x: Add missing WiFi properties
arm64: dts: apple: t8103-j457: Fix PCIe ethernet iommu-map
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
ARM: dts: ti: omap updates for v6.18
These are all minor corrections to the dts files.
* tag 'omap-for-v6.18/dt-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux-omap:
ARM: dts: omap: am335x-cm-t335: Remove unused mcasp num-serializer property
ARM: dts: ti: omap: omap3-devkit8000-lcd: Fix ti,keep-vref-on property to use correct boolean syntax in DTS
ARM: dts: ti: omap: am335x-baltos: Fix ti,en-ck32k-xtal property in DTS to use correct boolean syntax
ARM: dts: omap: Minor whitespace cleanup
ARM: dts: omap: dm816x: Split 'reg' per entry
ARM: dts: omap: dm814x: Split 'reg' per entry
ARM: dts: am33xx-l4: fix UART compatible
ARM: dts: ti: omap4: Use generic "ethernet" as node name
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
The "apm,xgene2-pcie" compatible is unused, undocumented, and in the
wrong position in the compatible list. Given this is a mature and little
used platform, just remove the compatible rather than fix the order and
document it.
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://lore.kernel.org/r/20250919161529.1293151-1-robh@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Another missing supply and a wrong headphone gpio level.
* tag 'v6.17-rockchip-dtsfixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip:
arm64: dts: rockchip: Fix the headphone detection on the orangepi 5
arm64: dts: rockchip: Add vcc supply for SPI Flash on NanoPC-T6
Amlogic ARM64 DT for v6.18:
- Add cache information to the Amlogic SoCs
- Add RTC node for Amlogic C3 SoC
- Fix PWM node for Amlogic C3 SoC
- Remove UHS capability for Odroid-C2 SDCard
* tag 'amlogic-arm64-dt-for-v6.18' of https://git.kernel.org/pub/scm/linux/kernel/git/amlogic/linux:
arm64: dts: amlogic: gxbb-odroidc2: remove UHS capability for SD card
dts: arm: amlogic: fix pwm node for c3
arm64: dts: amlogic: sm1-bananapi: lower SD card speed for stability
arm64: dts: amlogic: Add cache information to the Amlogic T7 SoC
arm64: dts: amlogic: Add cache information to the Amlogic S922X SoC
arm64: dts: amlogic: Add cache information to the Amlogic S7 SoC
arm64: dts: amlogic: Add cache information to the Amlogic C3 SoC
arm64: dts: amlogic: Add cache information to the Amlogic A4 SoC
arm64: dts: amlogic: Add cache information to the Amlogic A1 SoC
arm64: dts: amlogic: Add cache information to the Amlogic GXM SoCS
arm64: dts: amlogic: Add cache information to the Amlogic AXG SoCS
arm64: dts: amlogic: Add cache information to the Amlogic G12A SoCS
arm64: dts: amlogic: Add cache information to the Amlogic SM1 SoC
arm64: dts: amlogic: Add cache information to the Amlogic GXBB and GXL SoC
arm64: dts: amlogic: C3: Add RTC controller node
Link: https://lore.kernel.org/r/d40e7e96-4a7c-4e4f-b36f-750c6525b95c@linaro.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Another missing supply and a wrong headphone gpio level.
* tag 'v6.17-rockchip-dtsfixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip:
arm64: dts: rockchip: Fix the headphone detection on the orangepi 5
arm64: dts: rockchip: Add vcc supply for SPI Flash on NanoPC-T6
arm64: dts: rockchip: fix second M.2 slot on ROCK 5T
arm64: dts: rockchip: fix USB on RADXA ROCK 5T
arm64: dts: rockchip: Add vcc-supply to SPI flash on Pinephone Pro
arm64: dts: rockchip: fix es8388 address on rk3588s-roc-pc
arm64: dts: rockchip: Fix Bluetooth interrupts flag on Neardi LBA3368
arm64: dts: rockchip: correct network description on Sige5
arm64: dts: rockchip: Minor whitespace cleanup
ARM: dts: rockchip: Minor whitespace cleanup
arm64: dts: rockchip: Add supplies for eMMC on rk3588-orangepi-5
arm64: dts: rockchip: Fix the headphone detection on the orangepi 5 plus
arm64: dts: rockchip: Add vcc-supply to SPI flash on rk3399-pinebook-pro
arm64: dts: rockchip: mark eeprom as read-only for Radxa E52C
Allwinner Device Tree changes for 6.18
This tag contains two DT binding header changes that are shared with
the clk tree.
In this cycle we gained support for the MCU PRCM clock and reset
controller on the A523/A527/T527 family of SoCs, the NPU which is a
Vivante GC9000 IP block, and the NPU clock that was missing. The other
PRCM clock controller gained default bus clock rate settings. These
were not configured in the upstream U-boot bootloader, leading to them
running at slower rates. The assigned rates are from the user manual.
There is also a new board, the NetCube Systems Nagami SoM and two of
its carrier boards.
The A523 family development boards now have their internal RTC clocks
configured correctly, so that the RTC does not drift wildly. The missing
functions for the AXP717 on these boards are added. Missing reset GPIOs
and delays for Ethernet PHYs are added. Last, the Cubie A5E now has its
LEDs described and usable.
An overlay for the Orange Pi Zero interface (addon) board was added.
This can be used with the Orange Pi Zero and Zero Plus 2. Default audio
routing for these two boards (to be used with the addon) were added to
complement the overlay.
* tag 'sunxi-dt-for-6.18' of https://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux:
arm64: dts: allwinner: sun55i: Complete AXP717A sub-functions
arm64: dts: allwinner: t527: orangepi-4a: hook up external 32k crystal
arm64: dts: allwinner: t527: avaota-a1: hook up external 32k crystal
arm64: dts: allwinner: a527: cubie-a5e: Drop external 32.768 KHz crystal
arm64: dts: sun55i: a523: Assign standard clock rates to PRCM bus clocks
ARM: dts: sunxi: add support for NetCube Systems Nagami Keypad Carrier
ARM: dts: sunxi: add support for NetCube Systems Nagami Basic Carrier
ARM: dts: sunxi: add support for NetCube Systems Nagami SoM
riscv: dts: allwinner: d1s-t113: Add pinctrl's required by NetCube Systems Nagami SoM
dt-bindings: arm: sunxi: Add NetCube Systems Nagami SoM and carrier board bindings
ARM: dts: allwinner: Add Orange Pi Zero Interface Board overlay
ARM: dts: allwinner: orangepi-zero-plus2: Add default audio routing
ARM: dts: allwinner: orangepi-zero: Add default audio routing
arm64: dts: allwinner: a523: Add NPU device node
arm64: dts: allwinner: a523: Add MCU PRCM CCU node
dt-bindings: clock: sun55i-a523-ccu: Add A523 MCU CCU clock controller
dt-bindings: clock: sun55i-a523-ccu: Add missing NPU module clock
arm64: dts: allwinner: t527: avaota-a1: Add ethernet PHY reset setting
arm64: dts: allwinner: a527: cubie-a5e: Add ethernet PHY reset setting
arm64: dts: allwinner: a527: cubie-a5e: Add LEDs
Link: https://lore.kernel.org/r/aMrtuZg8HlR--TAt@wens.tw
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
In the __arch_prepare_bpf_trampoline() function, retval_off is only
meaningful when save_ret is true, so the current logic is correct.
However, in the original logic, retval_off is only initialized under
certain conditions; for example, in the fmod_ret logic, the compiler is
not aware that the flags of the fmod_ret program (prog) have set
BPF_TRAMP_F_CALL_ORIG, which results in an uninitialized symbol
compilation warning.
So initialize retval_off unconditionally to fix it.
Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn>
Reviewed-by: Pu Lehui <pulehui@huawei.com>
Link: https://lore.kernel.org/r/20250922062244.822937-2-duanchenghao@kylinos.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add support for signed loads from arena which are internally converted
to loads with mode set BPF_PROBE_MEM32SX by the verifier. The
implementation is similar to BPF_PROBE_MEMSX and BPF_MEMSX but for
BPF_PROBE_MEM32SX, arena_vm_base is added to the src register to form
the address.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20250923110157.18326-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, signed load instructions into arena memory are unsupported.
The compiler is free to generate these, and on GCC-14 we see a
corresponding error when it happens. The hurdle in supporting them is
deciding which unused opcode to use to mark them for the JIT's own
consumption. After much thinking, it appears 0xc0 / BPF_NOSPEC can be
combined with load instructions to identify signed arena loads. Use
this to recognize and JIT them appropriately, and remove the verifier
side limitation on the program if the JIT supports them.
Co-developed-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20250923110157.18326-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Make CR4.CET a guest-owned bit under VMX by extending
KVM_POSSIBLE_CR4_GUEST_BITS accordingly.
There's no need to intercept changes to CR4.CET, as it's neither
included in KVM's MMU role bits, nor does KVM specifically care about
the actual value of a (nested) guest's CR4.CET value, beside for
enforcing architectural constraints, i.e. make sure that CR0.WP=1 if
CR4.CET=1.
Intercepting writes to CR4.CET is particularly bad for grsecurity
kernels with KERNEXEC or, even worse, KERNSEAL enabled. These features
heavily make use of read-only kernel objects and use a cpu-local CR0.WP
toggle to override it, when needed. Under a CET-enabled kernel, this
also requires toggling CR4.CET, hence the motivation to make it
guest-owned.
Using the old test from [1] gives the following runtime numbers (perf
stat -r 5 ssdd 10 50000):
* grsec guest on linux-6.16-rc5 + cet patches:
2.4647 +- 0.0706 seconds time elapsed ( +- 2.86% )
* grsec guest on linux-6.16-rc5 + cet patches + CR4.CET guest-owned:
1.5648 +- 0.0240 seconds time elapsed ( +- 1.53% )
Not only does not intercepting CR4.CET make the test run ~35% faster,
it's also more stable with less fluctuation due to fewer VMEXITs.
Therefore, make CR4.CET a guest-owned bit where possible.
This change is VMX-specific, as SVM has no such fine-grained control
register intercept control.
If KVM's assumptions regarding MMU role handling wrt. a guest's CR4.CET
value ever change, the BUILD_BUG_ON()s related to KVM_MMU_CR4_ROLE_BITS
and KVM_POSSIBLE_CR4_GUEST_BITS will catch that early.
Link: https://lore.kernel.org/kvm/20230322013731.102955-1-minipli@grsecurity.net/ [1]
Reviewed-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-52-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add {HV,CP,SX}_VECTOR definitions for AMD's Hypervisor Injection Exception,
VMM Communication Exception, and SVM Security Exception vectors, along with
human friendly formatting for trace_kvm_inj_exception().
Note, KVM is all but guaranteed to never observe or inject #SX, and #HV is
also unlikely to go unused. Add the architectural collateral mostly for
completeness, and on the off chance that hardware goes off the rails.
Link: https://lore.kernel.org/r/20250919223258.1604852-44-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Synchronize XSS from the GHCB to KVM's internal tracking if the guest
marks XSS as valid on a #VMGEXIT. Like XCR0, KVM needs an up-to-date copy
of XSS in order to compute the required XSTATE size when emulating
CPUID.0xD.0x1 for the guest.
Treat the incoming XSS change as an emulated write, i.e. validatate the
guest-provided value, to avoid letting the guest load garbage into KVM's
tracking. Simply ignore bad values, as either the guest managed to get an
unsupported value into hardware, or the guest is misbehaving and providing
pure garbage. In either case, KVM can't fix the broken guest.
Explicitly allow access to XSS at all times, as KVM needs to ensure its
copy of XSS stays up-to-date. E.g. KVM supports migration of SEV-ES guests
and so needs to allow the host to save/restore XSS, otherwise a guest
that *knows* its XSS hasn't change could get stale/bad CPUID emulation if
the guest doesn't provide XSS in the GHCB on every exit. This creates a
hypothetical problem where a guest could request emulation of RDMSR or
WRMSR on XSS, but arguably that's not even a problem, e.g. it would be
entirely reasonable for a guest to request "emulation" as a way to inform
the hypervisor that its XSS value has been modified.
Note, emulating the change as an MSR write also takes care of side effects,
e.g. marking dynamic CPUID bits as dirty.
Suggested-by: John Allen <john.allen@amd.com>
base-commit: 14298d819d5a6b7180a4089e7d2121ca3551dc6c
Link: https://lore.kernel.org/r/20250919223258.1604852-40-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Transfer the three CET Shadow Stack VMCB fields (S_CET, ISST_ADDR, and
SSP) on VMRUN, #VMEXIT, and loading nested state (saving nested state
simply copies the entire save area). SVM doesn't provide a way to
disallow L1 from enabling Shadow Stacks for L2, i.e. KVM *must* provide
nested support before advertising SHSTK to userspace.
Link: https://lore.kernel.org/r/20250919223258.1604852-37-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Advertise the LOAD_CET_STATE VM-Entry/Exit control bits in the nested VMX
MSRS, as all nested support for CET virtualization, including consistency
checks, is in place.
Advertise support if and only if KVM supports at least one of IBT or SHSTK.
While it's userspace's responsibility to provide a consistent CPU model to
the guest, that doesn't mean KVM should set userspace up to fail.
Note, the existing {CLEAR,LOAD}_BNDCFGS behavior predates
KVM_X86_QUIRK_STUFF_FEATURE_MSRS, i.e. KVM "solved" the inconsistent CPU
model problem by overwriting the VMX MSRs provided by userspace.
Signed-off-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-35-seanjc@google.com
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Introduce consistency checks for CET states during nested VM-entry.
A VMCS contains both guest and host CET states, each comprising the
IA32_S_CET MSR, SSP, and IA32_INTERRUPT_SSP_TABLE_ADDR MSR. Various
checks are applied to CET states during VM-entry as documented in SDM
Vol3 Chapter "VM ENTRIES". Implement all these checks during nested
VM-entry to emulate the architectural behavior.
In summary, there are three kinds of checks on guest/host CET states
during VM-entry:
A. Checks applied to both guest states and host states:
* The IA32_S_CET field must not set any reserved bits; bits 10 (SUPPRESS)
and 11 (TRACKER) cannot both be set.
* SSP should not have bits 1:0 set.
* The IA32_INTERRUPT_SSP_TABLE_ADDR field must be canonical.
B. Checks applied to host states only
* IA32_S_CET MSR and SSP must be canonical if the CPU enters 64-bit mode
after VM-exit. Otherwise, IA32_S_CET and SSP must have their higher 32
bits cleared.
C. Checks applied to guest states only:
* IA32_S_CET MSR and SSP are not required to be canonical (i.e., 63:N-1
are identical, where N is the CPU's maximum linear-address width). But,
bits 63:N of SSP must be identical.
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-34-seanjc@google.com
[sean: have common helper return 0/-EINVAL, not true/false]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
to enable CET for nested VM.
vmcs12 and vmcs02 needs to be synced when L2 exits to L1 or when L1 wants
to resume L2, that way correct CET states can be observed by one another.
Please note that consistency checks regarding CET state during VM-Entry
will be added later to prevent this patch from becoming too large.
Advertising the new CET VM_ENTRY/EXIT control bits are also be deferred
until after the consistency checks are added.
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Xin Li (Intel) <xin@zytor.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-32-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Swap the order between configuring nested VMX capabilities and base CPU
capabilities, so that nested VMX support can be conditioned on core KVM
support, e.g. to allow conditioning support for LOAD_CET_STATE on the
presence of IBT or SHSTK. Because the sanity checks on nested VMX config
performed by vmx_check_processor_compat() run _after_ vmx_hardware_setup(),
any use of kvm_cpu_cap_has() when configuring nested VMX support will lead
to failures in vmx_check_processor_compat().
While swapping the order of two (or more) configuration flows can lead to
a game of whack-a-mole, in this case nested support inarguably should be
done after base support. KVM should never condition base support on nested
support, because nested support is fully optional, while obviously it's
desirable to condition nested support on base support. And there's zero
evidence the current ordering was intentional, e.g. commit 66a6950f99
("KVM: x86: Introduce kvm_cpu_caps to replace runtime CPUID masking")
likely placed the call to kvm_set_cpu_caps() after nested setup because it
looked pretty.
Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-30-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add support for the LOAD_CET_STATE VM-Enter and VM-Exit controls, the
CET XFEATURE bits in XSS, and advertise support for IBT and SHSTK to
userspace. Explicitly clear IBT and SHSTK onn SVM, as additional work is
needed to enable CET on SVM, e.g. to context switch S_CET and other state.
Disable KVM CET feature if unrestricted_guest is unsupported/disabled as
KVM does not support emulating CET, as running without Unrestricted Guest
can result in KVM emulating large swaths of guest code. While it's highly
unlikely any guest will trigger emulation while also utilizing IBT or
SHSTK, there's zero reason to allow CET without Unrestricted Guest as that
combination should only be possible when explicitly disabling
unrestricted_guest for testing purposes.
Disable CET if VMX_BASIC[bit56] == 0, i.e. if hardware strictly enforces
the presence of an Error Code based on exception vector, as attempting to
inject a #CP with an Error Code (#CP architecturally has an Error Code)
will fail due to the #CP vector historically not having an Error Code.
Clear S_CET and SSP-related VMCS on "reset" to emulate the architectural
of CET MSRs and SSP being reset to 0 after RESET, power-up and INIT. Note,
KVM already clears guest CET state that is managed via XSTATE in
kvm_xstate_reset().
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
[sean: move some bits to separate patches, massage changelog]
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-29-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Make IBT and SHSTK virtualization mutually exclusive with "officially"
supporting setups with guest.MAXPHYADDR < host.MAXPHYADDR, i.e. if the
allow_smaller_maxphyaddr module param is set. Running a guest with a
smaller MAXPHYADDR requires intercepting #PF, and can also trigger
emulation of arbitrary instructions. Intercepting and reacting to #PFs
doesn't play nice with SHSTK, as KVM's MMU hasn't been taught to handle
Shadow Stack accesses, and emulating arbitrary instructions doesn't play
nice with IBT or SHSTK, as KVM's emulator doesn't handle the various side
effects, e.g. doesn't enforce end-branch markers or model Shadow Stack
updates.
Note, hiding IBT and SHSTK based solely on allow_smaller_maxphyaddr is
overkill, as allow_smaller_maxphyaddr is only problematic if the guest is
actually configured to have a smaller MAXPHYADDR. However, KVM's ABI
doesn't provide a way to express that IBT and SHSTK may break if enabled
in conjunction with guest.MAXPHYADDR < host.MAXPHYADDR. I.e. the
alternative is to do nothing in KVM and instead update documentation and
hope KVM users are thorough readers. Go with the conservative-but-correct
approach; worst case scenario, this restriction can be dropped if there's
a strong use case for enabling CET on hosts with allow_smaller_maxphyaddr.
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-28-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Initialize allow_smaller_maxphyaddr during hardware setup as soon as KVM
knows whether or not TDP will be utilized. To avoid having to teach KVM's
emulator all about CET, KVM's upcoming CET virtualization support will be
mutually exclusive with allow_smaller_maxphyaddr, i.e. will disable SHSTK
and IBT if allow_smaller_maxphyaddr is enabled.
In general, allow_smaller_maxphyaddr should be initialized as soon as
possible since it's globally visible while its only input is whether or
not EPT/NPT is enabled. I.e. there's effectively zero risk of setting
allow_smaller_maxphyaddr too early, and substantial risk of setting it
too late.
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250922184743.1745778-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Make TDP a hard requirement for Shadow Stacks, as there are no plans to
add Shadow Stack support to the Shadow MMU. E.g. KVM hasn't been taught
to understand the magic Writable=0,Dirty=1 combination that is required
for Shadow Stack accesses, and so enabling Shadow Stacks when using
shadow paging will put the guest into an infinite #PF loop (KVM thinks the
shadow page tables have a valid mapping, hardware says otherwise).
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-27-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add CET_KERNEL and CET_USER to KVM's set of supported XSS bits when IBT
*or* SHSTK is supported. Like CR4.CET, XFEATURE support for IBT and SHSTK
are bundle together under the CET umbrella, and thus prone to
virtualization holes if KVM or the guest supports only one of IBT or SHSTK,
but hardware supports both. However, again like CR4.CET, such
virtualization holes are benign from the host's perspective so long as KVM
takes care to always honor the "or" logic.
Require CET_KERNEL and CET_USER to come as a pair, and refuse to support
IBT or SHSTK if one (or both) features is missing, as the (host) kernel
expects them to come as a pair, i.e. may get confused and corrupt state if
only one of CET_KERNEL or CET_USER is supported.
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
[sean: split to separate patch, write changelog, add XFEATURE_MASK_CET_ALL]
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-26-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Unconditionally forward XSAVES/XRSTORS VM-Exits from L2 to L1, as KVM
doesn't utilize the XSS-bitmap (KVM relies on controlling the XSS value
in hardware to prevent unauthorized access to XSAVES state). KVM always
loads vmcs02 with vmcs12's bitmap, and so any exit _must_ be due to
vmcs12's XSS-bitmap.
Drop the comment about XSS never being non-zero in anticipation of
enabling CET_KERNEL and CET_USER support.
Opportunistically WARN if XSAVES is not enabled for L2, as the CPU is
supposed to generate #UD before checking the XSS-bitmap.
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-25-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Drop X86_CR4_CET from CR4_RESERVED_BITS and instead mark CET as reserved
if and only if IBT *and* SHSTK are unsupported, i.e. allow CR4.CET to be
set if IBT or SHSTK is supported. This creates a virtualization hole if
the CPU supports both IBT and SHSTK, but the kernel or vCPU model only
supports one of the features. However, it's entirely legal for a CPU to
have only one of IBT or SHSTK, i.e. the hole is a flaw in the architecture,
not in KVM.
More importantly, so long as KVM is careful to initialize and context
switch both IBT and SHSTK state (when supported in hardware) if either
feature is exposed to the guest, a misbehaving guest can only harm itself.
E.g. VMX initializes host CET VMCS fields based solely on hardware
capabilities.
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
[sean: split to separate patch, write changelog]
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-24-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add PK (Protection Keys), SS (Shadow Stacks), and SGX (Software Guard
Extensions) to the set of #PF error flags handled via
kvm_mmu_trace_pferr_flags. While KVM doesn't expect PK or SS #PFs in
particular, pretty print their names instead of the raw hex value saves
the user from having to go spelunking in the SDM to figure out what's
going on.
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-23-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add PFERR_SS_MASK, a.k.a. Shadow Stack access, and WARN if KVM attempts to
check permissions for a Shadow Stack access as KVM hasn't been taught to
understand the magic Writable=0,Dirty=1 combination that is required for
Shadow Stack accesses, and likely will never learn. There are no plans to
support Shadow Stacks with the Shadow MMU, and the emulator rejects all
instructions that affect Shadow Stacks, i.e. it should be impossible for
KVM to observe a #PF due to a shadow stack access.
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-22-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Emulate the Shadow Stack restriction that the current SSP must be a 32-bit
value on a FAR JMP from 64-bit mode to compatibility mode. From the SDM's
pseudocode for FAR JMP:
IF ShadowStackEnabled(CPL)
IF (IA32_EFER.LMA and DEST(segment selector).L) = 0
(* If target is legacy or compatibility mode then the SSP must be in low 4GB *)
IF (SSP & 0xFFFFFFFF00000000 != 0); THEN
#GP(0);
FI;
FI;
FI;
Note, only the current CPL needs to be considered, as FAR JMP can't be
used for inter-privilege level transfers, and KVM rejects emulation of all
other far branch instructions when Shadow Stacks are enabled.
To give the emulator access to GUEST_SSP, special case handling
MSR_KVM_INTERNAL_GUEST_SSP in emulator_get_msr() to treat the access as a
host access (KVM doesn't allow guest accesses to internal "MSRs"). The
->get_msr() API is only used for implicit accesses from the emulator, i.e.
is only used with hardcoded MSR indices, and so any access to
MSR_KVM_INTERNAL_GUEST_SSP is guaranteed to be from KVM, i.e. not from the
guest via RDMSR.
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-21-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Exit to userspace with KVM_INTERNAL_ERROR_EMULATION if the guest triggers
task switch emulation with Indirect Branch Tracking or Shadow Stacks
enabled, as attempting to do the right thing would require non-trivial
effort and complexity, KVM doesn't support emulating CET generally, and
it's extremely unlikely that any guest will do task switches while also
utilizing CET. Defer taking on the complexity until someone cares enough
to put in the time and effort to add support.
Per the SDM:
If shadow stack is enabled, then the SSP of the task is located at the
4 bytes at offset 104 in the 32-bit TSS and is used by the processor to
establish the SSP when a task switch occurs from a task associated with
this TSS. Note that the processor does not write the SSP of the task
initiating the task switch to the TSS of that task, and instead the SSP
of the previous task is pushed onto the shadow stack of the new task.
Note, per the SDM's pseudocode on TASK SWITCHING, IBT state for the new
privilege level is updated. To keep things simple, check both S_CET and
U_CET (again, anyone that wants more precise checking can have the honor
of implementing support).
Reported-by: Binbin Wu <binbin.wu@linux.intel.com>
Closes: https://lore.kernel.org/all/819bd98b-2a60-4107-8e13-41f1e4c706b1@linux.intel.com
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-20-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Don't emulate branch instructions, e.g. CALL/RET/JMP etc., that are
affected by Shadow Stacks and/or Indirect Branch Tracking when said
features are enabled in the guest, as fully emulating CET would require
significant complexity for no practical benefit (KVM shouldn't need to
emulate branch instructions on modern hosts). Simply doing nothing isn't
an option as that would allow a malicious entity to subvert CET
protections via the emulator.
To detect instructions that are subject to IBT or affect IBT state, use
the existing IsBranch flag along with the source operand type to detect
indirect branches, and the existing NearBranch flag to detect far JMPs
and CALLs, all of which are effectively indirect. Explicitly check for
emulation of IRET, FAR RET (IMM), and SYSEXIT (the ret-like far branches)
instead of adding another flag, e.g. IsRet, as it's unlikely the emulator
will ever need to check for return-like instructions outside of this one
specific flow. Use an allow-list instead of a deny-list because (a) it's
a shorter list and (b) so that a missed entry gets a false positive, not a
false negative (i.e. reject emulation instead of clobbering CET state).
For Shadow Stacks, explicitly track instructions that directly affect the
current SSP, as KVM's emulator doesn't have existing flags that can be
used to precisely detect such instructions. Alternatively, the em_xxx()
helpers could directly check for ShadowStack interactions, but using a
dedicated flag is arguably easier to audit, and allows for handling both
IBT and SHSTK in one fell swoop.
Note! On far transfers, do NOT consult the current privilege level and
instead treat SHSTK/IBT as being enabled if they're enabled for User *or*
Supervisor mode. On inter-privilege level far transfers, SHSTK and IBT
can be in play for the target privilege level, i.e. checking the current
privilege could get a false negative, and KVM doesn't know the target
privilege level until emulation gets under way.
Note #2, FAR JMP from 64-bit mode to compatibility mode interacts with
the current SSP, but only to ensure SSP[63:32] == 0. Don't tag FAR JMP
as SHSTK, which would be rather confusing and would result in FAR JMP
being rejected unnecessarily the vast majority of the time (ignoring that
it's unlikely to ever be emulated). A future commit will add the #GP(0)
check for the specific FAR JMP scenario.
Note #3, task switches also modify SSP and so need to be rejected. That
too will be addressed in a future commit.
Suggested-by: Chao Gao <chao.gao@intel.com>
Originally-by: Yang Weijiang <weijiang.yang@intel.com>
Cc: Mathias Krause <minipli@grsecurity.net>
Cc: John Allen <john.allen@amd.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Save constant values to HOST_{S_CET,SSP,INTR_SSP_TABLE} field explicitly.
Kernel IBT is supported and the setting in MSR_IA32_S_CET is static after
post-boot(The exception is BIOS call case but vCPU thread never across it)
and KVM doesn't need to refresh HOST_S_CET field before every VM-Enter/
VM-Exit sequence.
Host supervisor shadow stack is not enabled now and SSP is not accessible
to kernel mode, thus it's safe to set host IA32_INT_SSP_TAB/SSP VMCS field
to 0s. When shadow stack is enabled for CPL3, SSP is reloaded from PL3_SSP
before it exits to userspace. Check SDM Vol 2A/B Chapter 3/4 for SYSCALL/
SYSRET/SYSENTER SYSEXIT/RDSSP/CALL etc.
Prevent KVM module loading if host supervisor shadow stack SHSTK_EN is set
in MSR_IA32_S_CET as KVM cannot co-exit with it correctly.
Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
[sean: snapshot host S_CET if SHSTK *or* IBT is supported]
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Disable interception for CET MSRs that can be accessed via XSAVES/XRSTORS,
and exist accordingly to CPUID, as accesses through XSTATE aren't subject
to MSR interception checks, i.e. can't be intercepted without intercepting
and emulating XSAVES/XRSTORS, and KVM doesn't support emulating
XSAVE/XRSTOR instructions.
Don't condition interception on the guest actually having XSAVES as there
is no benefit to intercepting the accesses (when the MSRs exist). The
MSRs in question are either context switched by the CPU on VM-Enter/VM-Exit
or by KVM via XSAVES/XRSTORS (KVM requires XSAVES to virtualization SHSTK),
i.e. KVM is going to load guest values into hardware irrespective of guest
XSAVES support.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Xin Li (Intel) <xin@zytor.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add emulation interface for CET MSR access. The emulation code is split
into common part and vendor specific part. The former does common checks
for MSRs, e.g., accessibility, data validity etc., then passes operation
to either XSAVE-managed MSRs via the helpers or CET VMCS fields.
SSP can only be read via RDSSP. Writing even requires destructive and
potentially faulting operations such as SAVEPREVSSP/RSTORSSP or
SETSSBSY/CLRSSBSY. Let the host use a pseudo-MSR that is just a wrapper
for the GUEST_SSP field of the VMCS.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
[sean: drop call to kvm_set_xstate_msr() for S_CET, consolidate code]
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add a KVM-defined ONE_REG register, KVM_REG_GUEST_SSP, to let userspace
save and restore the guest's Shadow Stack Pointer (SSP). On both Intel
and AMD, SSP is a hardware register that can only be accessed by software
via dedicated ISA (e.g. RDSSP) or via VMCS/VMCB fields (used by hardware
to context switch SSP at entry/exit). As a result, SSP doesn't fit in
any of KVM's existing interfaces for saving/restoring state.
Internally, treat SSP as a fake/synthetic MSR, as the semantics of writes
to SSP follow that of several other Shadow Stack MSRs, e.g. the PLx_SSP
MSRs. Use a translation layer to hide the KVM-internal MSR index so that
the arbitrary index doesn't become ABI, e.g. so that KVM can rework its
implementation as needed, so long as the ONE_REG ABI is maintained.
Explicitly reject accesses to SSP if the vCPU doesn't have Shadow Stack
support to avoid running afoul of ignore_msrs, which unfortunately applies
to host-initiated accesses (which is a discussion for another day). I.e.
ensure consistent behavior for KVM-defined registers irrespective of
ignore_msrs.
Link: https://lore.kernel.org/all/aca9d389-f11e-4811-90cf-d98e345a5cc2@intel.com
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-14-seanjc@google.com
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Control-flow Enforcement Technology (CET) is a kind of CPU feature used
to prevent Return/CALL/Jump-Oriented Programming (ROP/COP/JOP) attacks.
It provides two sub-features(SHSTK,IBT) to defend against ROP/COP/JOP
style control-flow subversion attacks.
Shadow Stack (SHSTK):
A shadow stack is a second stack used exclusively for control transfer
operations. The shadow stack is separate from the data/normal stack and
can be enabled individually in user and kernel mode. When shadow stack
is enabled, CALL pushes the return address on both the data and shadow
stack. RET pops the return address from both stacks and compares them.
If the return addresses from the two stacks do not match, the processor
generates a #CP.
Indirect Branch Tracking (IBT):
IBT introduces instruction(ENDBRANCH)to mark valid target addresses of
indirect branches (CALL, JMP etc...). If an indirect branch is executed
and the next instruction is _not_ an ENDBRANCH, the processor generates
a #CP. These instruction behaves as a NOP on platforms that have no CET.
Several new CET MSRs are defined to support CET:
MSR_IA32_{U,S}_CET: CET settings for {user,supervisor} CET respectively.
MSR_IA32_PL{0,1,2,3}_SSP: SHSTK pointer linear address for CPL{0,1,2,3}.
MSR_IA32_INT_SSP_TAB: Linear address of SHSTK pointer table, whose entry
is indexed by IST of interrupt gate desc.
Two XSAVES state bits are introduced for CET:
IA32_XSS:[bit 11]: Control saving/restoring user mode CET states
IA32_XSS:[bit 12]: Control saving/restoring supervisor mode CET states.
Six VMCS fields are introduced for CET:
{HOST,GUEST}_S_CET: Stores CET settings for kernel mode.
{HOST,GUEST}_SSP: Stores current active SSP.
{HOST,GUEST}_INTR_SSP_TABLE: Stores current active MSR_IA32_INT_SSP_TAB.
On Intel platforms, two additional bits are defined in VM_EXIT and VM_ENTRY
control fields:
If VM_EXIT_LOAD_CET_STATE = 1, host CET states are loaded from following
VMCS fields at VM-Exit:
HOST_S_CET
HOST_SSP
HOST_INTR_SSP_TABLE
If VM_ENTRY_LOAD_CET_STATE = 1, guest CET states are loaded from following
VMCS fields at VM-Entry:
GUEST_S_CET
GUEST_SSP
GUEST_INTR_SSP_TABLE
Co-developed-by: Zhang Yi Z <yi.z.zhang@linux.intel.com>
Signed-off-by: Zhang Yi Z <yi.z.zhang@linux.intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>