Pull x86 fixes from Ingo Molnar:
- Update the list of AMD microcode minimum Entrysign revisions
- Add additional fixed AMD RDSEED microcode revisions
- Update the language transliteration for Kiryl Shutsemau's name
in the MAINTAINERS entry
* tag 'x86-urgent-2025-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode/AMD: Add Zen5 model 0x44, stepping 0x1 minrev
x86/CPU/AMD: Add additional fixed RDSEED microcode revisions
MAINTAINERS: Update name spelling
Pull bpf fixes from Alexei Starovoitov:
- Fix interaction between livepatch and BPF fexit programs (Song Liu)
With Steven and Masami acks.
- Fix stack ORC unwind from BPF kprobe_multi (Jiri Olsa)
With Steven and Masami acks.
- Fix out of bounds access in widen_imprecise_scalars() in the verifier
(Eduard Zingerman)
- Fix conflicts between MPTCP and BPF sockmap (Jiayuan Chen)
- Fix net_sched storage collision with BPF data_meta/data_end (Eric
Dumazet)
- Add _impl suffix to BPF kfuncs with implicit args to avoid breaking
them in bpf-next when KF_IMPLICIT_ARGS is added (Mykyta Yatsenko)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Test widen_imprecise_scalars() with different stack depth
bpf: account for current allocated stack depth in widen_imprecise_scalars()
bpf: Add bpf_prog_run_data_pointers()
selftests/bpf: Add mptcp test with sockmap
mptcp: Fix proto fallback detection with BPF
mptcp: Disallow MPTCP subflows from sockmap
selftests/bpf: Add stacktrace ips test for raw_tp
selftests/bpf: Add stacktrace ips test for kprobe_multi/kretprobe_multi
x86/fgraph,bpf: Fix stack ORC unwind from kprobe_multi return probe
Revert "perf/x86: Always store regs->ip in perf_callchain_kernel()"
bpf: add _impl suffix for bpf_stream_vprintk() kfunc
bpf:add _impl suffix for bpf_task_work_schedule* kfuncs
selftests/bpf: Add tests for livepatch + bpf trampoline
ftrace: bpf: Fix IPMODIFY + DIRECT in modify_ftrace_direct()
ftrace: Fix BPF fexit with livepatch
Pull ACPI fixes from Rafael Wysocki:
"These fix issues in the ACPI CPPC library and in the recently added
parser for the ACPI MRRM table:
- Limit some checks in the ACPI CPPC library to online CPUs to avoid
accessing uninitialized per-CPU variables when some CPUs are
offline to start with, like during boot with 'nosmt=force' (Gautham
Shenoy)
- Rework add_boot_memory_ranges() in the ACPI MRRM table parser to
fix memory leaks and improve error handling (Kaushlendra Kumar)"
* tag 'acpi-6.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: MRRM: Fix memory leaks and improve error handling
ACPI: CPPC: Limit perf ctrs in PCC check only to online CPUs
ACPI: CPPC: Perform fast check switch only for online CPUs
ACPI: CPPC: Check _CPC validity for only the online CPUs
ACPI: CPPC: Detect preferred core availability on online CPUs
Merge ACPI CPPC library fixes and an ACPI MRRM table parser fix for
6.18-rc6.
* acpi-cppc:
ACPI: CPPC: Limit perf ctrs in PCC check only to online CPUs
ACPI: CPPC: Perform fast check switch only for online CPUs
ACPI: CPPC: Check _CPC validity for only the online CPUs
ACPI: CPPC: Detect preferred core availability on online CPUs
* acpi-tables:
ACPI: MRRM: Fix memory leaks and improve error handling
Pull kvm fixes from Paolo Bonzini:
"Arm:
- Fix trapping regression when no in-kernel irqchip is present
- Check host-provided, untrusted ranges and offsets in pKVM
- Fix regression restoring the ID_PFR1_EL1 register
- Fix vgic ITS locking issues when LPIs are not directly injected
Arm selftests:
- Correct target CPU programming in vgic_lpi_stress selftest
- Fix exposure of SCTLR2_EL2 and ZCR_EL2 in get-reg-list selftest
RISC-V:
- Fix check for local interrupts on riscv32
- Read HGEIP CSR on the correct cpu when checking for IMSIC
interrupts
- Remove automatic I/O mapping from kvm_arch_prepare_memory_region()
x86:
- Inject #UD if the guest attempts to execute SEAMCALL or TDCALL as
KVM doesn't support virtualization the instructions, but the
instructions are gated only by VMXON. That is, they will VM-Exit
instead of taking a #UD and until now this resulted in KVM exiting
to userspace with an emulation error.
- Unload the "FPU" when emulating INIT of XSTATE features if and only
if the FPU is actually loaded, instead of trying to predict when
KVM will emulate an INIT (CET support missed the MP_STATE path).
Add sanity checks to detect and harden against similar bugs in the
future.
- Unregister KVM's GALog notifier (for AVIC) when kvm-amd.ko is
unloaded.
- Use a raw spinlock for svm->ir_list_lock as the lock is taken
during schedule(), and "normal" spinlocks are sleepable locks when
PREEMPT_RT=y.
- Remove guest_memfd bindings on memslot deletion when a gmem file is
dying to fix a use-after-free race found by syzkaller.
- Fix a goof in the EPT Violation handler where KVM checks the wrong
variable when determining if the reported GVA is valid.
- Fix and simplify the handling of LBR virtualization on AMD, which
was made buggy and unnecessarily complicated by nested VM support
Misc:
- Update Oliver's email address"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (28 commits)
KVM: nSVM: Fix and simplify LBR virtualization handling with nested
KVM: nSVM: Always recalculate LBR MSR intercepts in svm_update_lbrv()
KVM: SVM: Mark VMCB_LBR dirty when MSR_IA32_DEBUGCTLMSR is updated
MAINTAINERS: Switch myself to using kernel.org address
KVM: arm64: vgic-v3: Release reserved slot outside of lpi_xa's lock
KVM: arm64: vgic-v3: Reinstate IRQ lock ordering for LPI xarray
KVM: arm64: Limit clearing of ID_{AA64PFR0,PFR1}_EL1.GIC to userspace irqchip
KVM: arm64: Set ID_{AA64PFR0,PFR1}_EL1.GIC when GICv3 is configured
KVM: arm64: Make all 32bit ID registers fully writable
KVM: VMX: Fix check for valid GVA on an EPT violation
KVM: guest_memfd: Remove bindings on memslot deletion when gmem is dying
KVM: SVM: switch to raw spinlock for svm->ir_list_lock
KVM: SVM: Make avic_ga_log_notifier() local to avic.c
KVM: SVM: Unregister KVM's GALog notifier on kvm-amd.ko exit
KVM: SVM: Initialize per-CPU svm_data at the end of hardware setup
KVM: x86: Call out MSR_IA32_S_CET is not handled by XSAVES
KVM: x86: Harden KVM against imbalanced load/put of guest FPU state
KVM: x86: Unload "FPU" state on INIT if and only if its currently in-use
KVM: arm64: Check the untrusted offset in FF-A memory share
KVM: arm64: Check range args for pKVM mem transitions
...
The current scheme for handling LBRV when nested is used is very
complicated, especially when L1 does not enable LBRV (i.e. does not set
LBR_CTL_ENABLE_MASK).
To avoid copying LBRs between VMCB01 and VMCB02 on every nested
transition, the current implementation switches between using VMCB01 or
VMCB02 as the source of truth for the LBRs while L2 is running. If L2
enables LBR, VMCB02 is used as the source of truth. When L2 disables
LBR, the LBRs are copied to VMCB01 and VMCB01 is used as the source of
truth. This introduces significant complexity, and incorrect behavior in
some cases.
For example, on a nested #VMEXIT, the LBRs are only copied from VMCB02
to VMCB01 if LBRV is enabled in VMCB01. This is because L2's writes to
MSR_IA32_DEBUGCTLMSR to enable LBR are intercepted and propagated to
VMCB01 instead of VMCB02. However, LBRV is only enabled in VMCB02 when
L2 is running.
This means that if L2 enables LBR and exits to L1, the LBRs will not be
propagated from VMCB02 to VMCB01, because LBRV is disabled in VMCB01.
There is no meaningful difference in CPUID rate in L2 when copying LBRs
on every nested transition vs. the current approach, so do the simple
and correct thing and always copy LBRs between VMCB01 and VMCB02 on
nested transitions (when LBRV is disabled by L1). Drop the conditional
LBRs copying in __svm_{enable/disable}_lbrv() as it is now unnecessary.
VMCB02 becomes the only source of truth for LBRs when L2 is running,
regardless of LBRV being enabled by L1, drop svm_get_lbr_vmcb() and use
svm->vmcb directly in its place.
Fixes: 1d5a1b5860 ("KVM: x86: nSVM: correctly virtualize LBR msrs when L2 is running")
Cc: stable@vger.kernel.org
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251108004524.1600006-4-yosry.ahmed@linux.dev
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
svm_update_lbrv() is called when MSR_IA32_DEBUGCTLMSR is updated, and on
nested transitions where LBRV is used. It checks whether LBRV enablement
needs to be changed in the current VMCB, and if it does, it also
recalculate intercepts to LBR MSRs.
However, there are cases where intercepts need to be updated even when
LBRV enablement doesn't. Example scenario:
- L1 has MSR_IA32_DEBUGCTLMSR cleared.
- L1 runs L2 without LBR_CTL_ENABLE (no LBRV).
- L2 sets DEBUGCTLMSR_LBR in MSR_IA32_DEBUGCTLMSR, svm_update_lbrv()
sets LBR_CTL_ENABLE in VMCB02 and disables intercepts to LBR MSRs.
- L2 exits to L1, svm_update_lbrv() is not called on this transition.
- L1 clears MSR_IA32_DEBUGCTLMSR, svm_update_lbrv() finds that
LBR_CTL_ENABLE is already cleared in VMCB01 and does nothing.
- Intercepts remain disabled, L1 reads to LBR MSRs read the host MSRs.
Fix it by always recalculating intercepts in svm_update_lbrv().
Fixes: 1d5a1b5860 ("KVM: x86: nSVM: correctly virtualize LBR msrs when L2 is running")
Cc: stable@vger.kernel.org
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251108004524.1600006-3-yosry.ahmed@linux.dev
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The APM lists the DbgCtlMsr field as being tracked by the VMCB_LBR clean
bit. Always clear the bit when MSR_IA32_DEBUGCTLMSR is updated.
The history is complicated, it was correctly cleared for L1 before
commit 1d5a1b5860 ("KVM: x86: nSVM: correctly virtualize LBR msrs when
L2 is running"). At that point svm_set_msr() started to rely on
svm_update_lbrv() to clear the bit, but when nested virtualization
is enabled the latter does not always clear it even if MSR_IA32_DEBUGCTLMSR
changed. Go back to clearing it directly in svm_set_msr().
Fixes: 1d5a1b5860 ("KVM: x86: nSVM: correctly virtualize LBR msrs when L2 is running")
Reported-by: Matteo Rizzo <matteorizzo@google.com>
Reported-by: evn@google.com
Co-developed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251108004524.1600006-2-yosry.ahmed@linux.dev
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM x86 fixes for 6.18:
- Inject #UD if the guest attempts to execute SEAMCALL or TDCALL as KVM
doesn't support virtualization the instructions, but the instructions
are gated only by VMXON, i.e. will VM-Exit instead of taking a #UD and
thus result in KVM exiting to userspace with an emulation error.
- Unload the "FPU" when emulating INIT of XSTATE features if and only if
the FPU is actually loaded, instead of trying to predict when KVM will
emulate an INIT (CET support missed the MP_STATE path). Add sanity
checks to detect and harden against similar bugs in the future.
- Unregister KVM's GALog notifier (for AVIC) when kvm-amd.ko is unloaded.
- Use a raw spinlock for svm->ir_list_lock as the lock is taken during
schedule(), and "normal" spinlocks are sleepable locks when PREEMPT_RT=y.
- Remove guest_memfd bindings on memslot deletion when a gmem file is dying
to fix a use-after-free race found by syzkaller.
- Fix a goof in the EPT Violation handler where KVM checks the wrong
variable when determining if the reported GVA is valid.
Pull x86 fixes from Ingo Molnar:
- Fix AMD PCI root device caching regression that triggers
on certain firmware variants
- Fix the zen5_rdseed_microcode[] array to be NULL-terminated
- Add more AMD models to microcode signature checking
* tag 'x86-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode/AMD: Add more known models to entry sign checking
x86/CPU/AMD: Add missing terminator for zen5_rdseed_microcode
x86/amd_node: Fix AMD root device caching
Currently we don't get stack trace via ORC unwinder on top of fgraph exit
handler. We can see that when generating stacktrace from kretprobe_multi
bpf program which is based on fprobe/fgraph.
The reason is that the ORC unwind code won't get pass the return_to_handler
callback installed by fgraph return probe machinery.
Solving this by creating stack frame in return_to_handler expected by
ftrace_graph_ret_addr function to recover original return address and
continue with the unwind.
Also updating the pt_regs data with cs/flags/rsp which are needed for
successful stack retrieval from ebpf bpf_get_stackid helper.
- in get_perf_callchain we check user_mode(regs) so CS has to be set
- in perf_callchain_kernel we call perf_hw_regs(regs), so EFLAGS/FIXED
has to be unset
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20251104215405.168643-3-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
This reverts commit 83f44ae0f8.
Currently we store initial stacktrace entry twice for non-HW ot_regs, which
means callers that fail perf_hw_regs(regs) condition in perf_callchain_kernel.
It's easy to reproduce this bpftrace:
# bpftrace -e 'tracepoint:sched:sched_process_exec { print(kstack()); }'
Attaching 1 probe...
bprm_execve+1767
bprm_execve+1767
do_execveat_common.isra.0+425
__x64_sys_execve+56
do_syscall_64+133
entry_SYSCALL_64_after_hwframe+118
When perf_callchain_kernel calls unwind_start with first_frame, AFAICS
we do not skip regs->ip, but it's added as part of the unwind process.
Hence reverting the extra perf_callchain_store for non-hw regs leg.
I was not able to bisect this, so I'm not really sure why this was needed
in v5.2 and why it's not working anymore, but I could see double entries
as far as v5.10.
I did the test for both ORC and framepointer unwind with and without the
this fix and except for the initial entry the stacktraces are the same.
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20251104215405.168643-2-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pull rust fixes from Miguel Ojeda:
- Fix/workaround a couple Rust 1.91.0 build issues when sanitizers are
enabled due to extra checking performed by the compiler and an
upstream issue already fixed for Rust 1.93.0
- Fix future Rust 1.93.0 builds by supporting the stabilized name for
the 'no-jump-tables' flag
- Fix a couple private/broken intra-doc links uncovered by the future
move of pin-init to 'syn'
* tag 'rust-fixes-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux:
rust: kbuild: support `-Cjump-tables=n` for Rust 1.93.0
rust: kbuild: workaround `rustdoc` doctests modifier bug
rust: kbuild: treat `build_error` and `rustdoc` as kernel objects
rust: condvar: fix broken intra-doc link
rust: devres: fix private intra-doc link
The runtime-const infrastructure was never designed to handle the
modular case, because the constant fixup is only done at boot time for
core kernel code.
But by the time I used it for the x86-64 user space limit handling in
commit 86e6b1547b ("x86: fix user address masking non-canonical
speculation issue"), I had completely repressed that fact.
And it all happens to work because the only code that currently actually
gets inlined by modules is for the access_ok() limit check, where the
default constant value works even when not fixed up. Because at least I
had intentionally made it be something that is in the non-canonical
address space region.
But it's technically very wrong, and it does mean that at least in
theory, the use of 'access_ok()' + '__get_user()' can trigger the same
speculation issue with non-canonical addresses that the original commit
was all about.
The pattern is unusual enough that this probably doesn't matter in
practice, but very wrong is still very wrong. Also, let's fix it before
the nice optimized scoped user accessor helpers that Thomas Gleixner is
working on cause this pseudo-constant to then be more widely used.
This all came up due to an unrelated discussion with Mateusz Guzik about
using the runtime const infrastructure for names_cachep accesses too.
There the modular case was much more obviously broken, and Mateusz noted
it in his 'v2' of the patch series.
That then made me notice how broken 'access_ok()' had been in modules
all along. Mea culpa, mea maxima culpa.
Fix it by simply not using the runtime-const code in modules, and just
using the USER_PTR_MAX variable value instead. This is not
performance-critical like the core user accessor functions (get_user()
and friends) are.
Also make sure this doesn't get forgotten the next time somebody wants
to do runtime constant optimizations by having the x86 runtime-const.h
header file error out if included by modules.
Fixes: 86e6b1547b ("x86: fix user address masking non-canonical speculation issue")
Acked-by: Borislav Petkov <bp@alien8.de>
Acked-by: Sean Christopherson <seanjc@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Triggered-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/all/20251030105242.801528-1-mjguzik@gmail.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use a raw spinlock for vcpu_svm.ir_list_lock as the lock can be taken
during schedule() via kvm_sched_out() => __avic_vcpu_put(), and "normal"
spinlocks are sleepable locks when PREEMPT_RT=y.
This fixes the following lockdep warning:
=============================
[ BUG: Invalid wait context ]
6.12.0-146.1640_2124176644.el10.x86_64+debug #1 Not tainted
-----------------------------
qemu-kvm/38299 is trying to lock:
ff11000239725600 (&svm->ir_list_lock){....}-{3:3}, at: __avic_vcpu_put+0xfd/0x300 [kvm_amd]
other info that might help us debug this:
context-{5:5}
2 locks held by qemu-kvm/38299:
#0: ff11000239723ba8 (&vcpu->mutex){+.+.}-{4:4}, at: kvm_vcpu_ioctl+0x240/0xe00 [kvm]
#1: ff11000b906056d8 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x2e/0x130
stack backtrace:
CPU: 1 UID: 0 PID: 38299 Comm: qemu-kvm Kdump: loaded Not tainted 6.12.0-146.1640_2124176644.el10.x86_64+debug #1 PREEMPT(voluntary)
Hardware name: AMD Corporation QUARTZ/QUARTZ, BIOS RQZ100AB 09/14/2023
Call Trace:
<TASK>
dump_stack_lvl+0x6f/0xb0
__lock_acquire+0x921/0xb80
lock_acquire.part.0+0xbe/0x270
_raw_spin_lock_irqsave+0x46/0x90
__avic_vcpu_put+0xfd/0x300 [kvm_amd]
svm_vcpu_put+0xfa/0x130 [kvm_amd]
kvm_arch_vcpu_put+0x48c/0x790 [kvm]
kvm_sched_out+0x161/0x1c0 [kvm]
prepare_task_switch+0x36b/0xf60
__schedule+0x4f7/0x1890
schedule+0xd4/0x260
xfer_to_guest_mode_handle_work+0x54/0xc0
vcpu_run+0x69a/0xa70 [kvm]
kvm_arch_vcpu_ioctl_run+0xdc0/0x17e0 [kvm]
kvm_vcpu_ioctl+0x39f/0xe00 [kvm]
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://patch.msgid.link/20251030194130.307900-1-mlevitsk@redhat.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Setup the per-CPU SVM data structures at the very end of hardware setup so
that svm_hardware_unsetup() can be used in svm_hardware_setup() to unwind
AVIC setup (for the GALog notifier). Alternatively, the error path could
do an explicit, manual unwind, e.g. by adding a helper to free the per-CPU
structures. But the per-CPU allocations have no interactions or
dependencies, i.e. can comfortably live at the end, and so converting to
a manual unwind would introduce churn and code without providing any
immediate advantage.
Link: https://patch.msgid.link/20251016190643.80529-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Update the comment above is_xstate_managed_msr() to note that
MSR_IA32_S_CET isn't saved/restored by XSAVES/XRSTORS.
MSR_IA32_S_CET isn't part of CET_U/S state as the SDM states:
The register state used by Control-Flow Enforcement Technology (CET)
comprises the two 64-bit MSRs (IA32_U_CET and IA32_PL3_SSP) that manage
CET when CPL = 3 (CET_U state); and the three 64-bit MSRs
(IA32_PL0_SSP–IA32_PL2_SSP) that manage CET when CPL < 3 (CET_S state).
Opportunistically shift the snippet about the safety of loading certain
MSRs to the function comment for kvm_access_xstate_msr(), which is where
the MSRs are actually loaded into hardware.
Fixes: e44eb58334 ("KVM: x86: Load guest FPU state when access XSAVE-managed MSRs")
Signed-off-by: Chao Gao <chao.gao@intel.com>
Link: https://patch.msgid.link/20251028060142.29830-1-chao.gao@intel.com
[sean: shift snippet about safety to kvm_access_xstate_msr()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Assert, via KVM_BUG_ON(), that guest FPU state isn't/is in use when
loading/putting the FPU to help detect KVM bugs without needing an assist
from KASAN. If an imbalanced load/put is detected, skip the redundant
load/put to avoid clobbering guest state and/or crashing the host.
Note, kvm_access_xstate_msr() already provides a similar assertion.
Reviewed-by: Yao Yuan <yaoyuan@linux.alibaba.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://patch.msgid.link/20251030185802.3375059-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Replace the hack added by commit f958bd2314 ("KVM: x86: Fix potential
put_fpu() w/o load_fpu() on MPX platform") with a more robust approach of
unloading+reloading guest FPU state based on whether or not the vCPU's FPU
is currently in-use, i.e. currently loaded. This fixes a bug on hosts
that support CET but not MPX, where kvm_arch_vcpu_ioctl_get_mpstate()
neglects to load FPU state (it only checks for MPX support) and leads to
KVM attempting to put FPU state due to kvm_apic_accept_events() triggering
INIT emulation. E.g. on a host with CET but not MPX, syzkaller+KASAN
generates:
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000004: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000020-0x0000000000000027]
CPU: 211 UID: 0 PID: 20451 Comm: syz.9.26 Tainted: G S 6.18.0-smp-DEV #7 NONE
Tainted: [S]=CPU_OUT_OF_SPEC
Hardware name: Google Izumi/izumi, BIOS 0.20250729.1-0 07/29/2025
RIP: 0010:fpu_swap_kvm_fpstate+0x3ce/0x610 ../arch/x86/kernel/fpu/core.c:377
RSP: 0018:ff1100410c167cc0 EFLAGS: 00010202
RAX: 0000000000000004 RBX: 0000000000000020 RCX: 00000000000001aa
RDX: 00000000000001ab RSI: ffffffff817bb960 RDI: 0000000022600000
RBP: dffffc0000000000 R08: ff110040d23c8007 R09: 1fe220081a479000
R10: dffffc0000000000 R11: ffe21c081a479001 R12: ff110040d23c8d98
R13: 00000000fffdc578 R14: 0000000000000000 R15: ff110040d23c8d90
FS: 00007f86dd1876c0(0000) GS:ff11007fc969b000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f86dd186fa8 CR3: 00000040d1dfa003 CR4: 0000000000f73ef0
PKRU: 80000000
Call Trace:
<TASK>
kvm_vcpu_reset+0x80d/0x12c0 ../arch/x86/kvm/x86.c:11818
kvm_apic_accept_events+0x1cb/0x500 ../arch/x86/kvm/lapic.c:3489
kvm_arch_vcpu_ioctl_get_mpstate+0xd0/0x4e0 ../arch/x86/kvm/x86.c:12145
kvm_vcpu_ioctl+0x5e2/0xed0 ../virt/kvm/kvm_main.c:4539
__se_sys_ioctl+0x11d/0x1b0 ../fs/ioctl.c:51
do_syscall_x64 ../arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x6e/0x940 ../arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f86de71d9c9
</TASK>
with a very simple reproducer:
r0 = openat$kvm(0xffffffffffffff9c, &(0x7f0000000000), 0x80b00, 0x0)
r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0)
ioctl$KVM_CREATE_IRQCHIP(r1, 0xae60)
r2 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x0)
ioctl$KVM_SET_IRQCHIP(r1, 0x8208ae63, ...)
ioctl$KVM_GET_MP_STATE(r2, 0x8004ae98, &(0x7f00000000c0))
Alternatively, the MPX hack in GET_MP_STATE could be extended to cover CET,
but from a "don't break existing functionality" perspective, that isn't any
less risky than peeking at the state of in_use, and it's far less robust
for a long term solution (as evidenced by this bug).
Reported-by: Alexander Potapenko <glider@google.com>
Fixes: 69cc3e8865 ("KVM: x86: Add XSS support for CET_KERNEL and CET_USER")
Reviewed-by: Yao Yuan <yaoyuan@linux.alibaba.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://patch.msgid.link/20251030185802.3375059-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Recent AMD node rework removed the "search and count" method of caching AMD
root devices. This depended on the value from a Data Fabric register that was
expected to hold the PCI bus of one of the root devices attached to that
fabric.
However, this expectation is incorrect. The register, when read from PCI
config space, returns the bitwise-OR of the buses of all attached root
devices.
This behavior is benign on AMD reference design boards, since the bus numbers
are aligned. This results in a bitwise-OR value matching one of the buses. For
example, 0x00 | 0x40 | 0xA0 | 0xE0 = 0xE0.
This behavior breaks on boards where the bus numbers are not exactly aligned.
For example, 0x00 | 0x07 | 0xE0 | 0x15 = 0x1F.
The examples above are for AMD node 0. The first root device on other nodes
will not be 0x00. The first root device for other nodes will depend on the
total number of root devices, the system topology, and the specific PCI bus
number assignment.
For example, a system with 2 AMD nodes could have this:
Node 0 : 0x00 0x07 0x0e 0x15
Node 1 : 0x1c 0x23 0x2a 0x31
The bus numbering style in the reference boards is not a requirement. The
numbering found in other boards is not incorrect. Therefore, the root device
caching method needs to be adjusted.
Go back to the "search and count" method used before the recent rework.
Search for root devices using PCI class code rather than fixed PCI IDs.
This keeps the goal of the rework (remove dependency on PCI IDs) while being
able to support various board designs.
Merge helper functions to reduce code duplication.
[ bp: Reflow comment. ]
Fixes: 40a5f6ffdf ("x86/amd_nb: Simplify root device search")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/all/20251028-fix-amd-root-v2-1-843e38f8be2c@amd.com
Pull misc x86 fixes from Ingo Molnar:
- Limit AMD microcode Entrysign sha256 signature checking to
known CPU generations
- Disable AMD RDSEED32 on certain Zen5 CPUs that have a
microcode version before when the microcode-based fix was
issued for the AMD-SB-7055 erratum
- Fix FPU AMD XFD state synchronization on signal delivery
- Fix (work around) a SSE4a-disassembly related build failure
on X86_NATIVE_CPU=y builds
- Extend the AMD Zen6 model space with a new range of models
- Fix <asm/intel-family.h> CPU model comments
- Fix the CONFIG_CFI=y and CONFIG_LTO_CLANG_FULL=y build, which
was unhappy due to missing kCFI type annotations of clear_page()
variants
* tag 'x86-urgent-2025-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Ensure clear_page() variants always have __kcfi_typeid_ symbols
x86/cpu: Add/fix core comments for {Panther,Nova} Lake
x86/CPU/AMD: Extend Zen6 model range
x86/build: Disable SSE4a
x86/fpu: Ensure XFD state on signal delivery
x86/CPU/AMD: Add RDSEED fix for Zen5
x86/microcode/AMD: Limit Entrysign signature checking to known generations
Pull perf event fixes from Ingo Molnar:
"Miscellaneous fixes and CPU model updates:
- Fix an out-of-bounds access on non-hybrid platforms in the Intel
PMU DS code, reported by KASAN
- Add WildcatLake PMU and uncore support: it's identical to the
PantherLake version"
* tag 'perf-urgent-2025-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Add uncore PMU support for Wildcat Lake
perf/x86/intel: Add PMU support for WildcatLake
perf/x86/intel: Fix KASAN global-out-of-bounds warning
Pull bpf fixes from Alexei Starovoitov:
- Mark migrate_disable/enable() as always_inline to avoid issues with
partial inlining (Yonghong Song)
- Fix powerpc stack register definition in libbpf bpf_tracing.h (Andrii
Nakryiko)
- Reject negative head_room in __bpf_skb_change_head (Daniel Borkmann)
- Conditionally include dynptr copy kfuncs (Malin Jonsson)
- Sync pending IRQ work before freeing BPF ring buffer (Noorain Eqbal)
- Do not audit capability check in x86 do_jit() (Ondrej Mosnacek)
- Fix arm64 JIT of BPF_ST insn when it writes into arena memory
(Puranjay Mohan)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf/arm64: Fix BPF_ST into arena memory
bpf: Make migrate_disable always inline to avoid partial inlining
bpf: Reject negative head_room in __bpf_skb_change_head
bpf: Conditionally include dynptr copy kfuncs
libbpf: Fix powerpc's stack register definition in bpf_tracing.h
bpf: Do not audit capability check in do_jit()
bpf: Sync pending IRQ work before freeing ring buffer
When building with CONFIG_CFI=y and CONFIG_LTO_CLANG_FULL=y, there is a series
of errors from the various versions of clear_page() not having __kcfi_typeid_
symbols.
$ cat kernel/configs/repro.config
CONFIG_CFI=y
# CONFIG_LTO_NONE is not set
CONFIG_LTO_CLANG_FULL=y
$ make -skj"$(nproc)" ARCH=x86_64 LLVM=1 clean defconfig repro.config bzImage
ld.lld: error: undefined symbol: __kcfi_typeid_clear_page_rep
>>> referenced by ld-temp.o
>>> vmlinux.o:(__cfi_clear_page_rep)
ld.lld: error: undefined symbol: __kcfi_typeid_clear_page_orig
>>> referenced by ld-temp.o
>>> vmlinux.o:(__cfi_clear_page_orig)
ld.lld: error: undefined symbol: __kcfi_typeid_clear_page_erms
>>> referenced by ld-temp.o
>>> vmlinux.o:(__cfi_clear_page_erms)
With full LTO, it is possible for LLVM to realize that these functions never
have their address taken (as they are only used within an alternative, which
will make them a direct call) across the whole kernel and either drop or skip
generating their kCFI type identification symbols.
clear_page_{rep,orig,erms}() are defined in clear_page_64.S with
SYM_TYPED_FUNC_START as a result of
2981557cb0 ("x86,kcfi: Fix EXPORT_SYMBOL vs kCFI"),
as exported functions are free to be called indirectly thus need kCFI type
identifiers.
Use KCFI_REFERENCE with these clear_page() functions to force LLVM to see
these functions as address-taken and generate then keep the kCFI type
identifiers.
Fixes: 2981557cb0 ("x86,kcfi: Fix EXPORT_SYMBOL vs kCFI")
Closes: https://github.com/ClangBuiltLinux/linux/issues/2128
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://patch.msgid.link/20251013-x86-fix-clear_page-cfi-full-lto-errors-v1-1-d69534c0be61@kernel.org
When running "perf mem record" command on CWF, the below KASAN
global-out-of-bounds warning is seen.
==================================================================
BUG: KASAN: global-out-of-bounds in cmt_latency_data+0x176/0x1b0
Read of size 4 at addr ffffffffb721d000 by task dtlb/9850
Call Trace:
kasan_report+0xb8/0xf0
cmt_latency_data+0x176/0x1b0
setup_arch_pebs_sample_data+0xf49/0x2560
intel_pmu_drain_arch_pebs+0x577/0xb00
handle_pmi_common+0x6c4/0xc80
The issue is caused by below code in __grt_latency_data(). The code
tries to access x86_hybrid_pmu structure which doesn't exist on
non-hybrid platform like CWF.
WARN_ON_ONCE(hybrid_pmu(event->pmu)->pmu_type == hybrid_big)
So add is_hybrid() check before calling this WARN_ON_ONCE to fix the
global-out-of-bounds access issue.
Fixes: 090262439f ("perf/x86/intel: Rename model-specific pebs_latency_data functions")
Reported-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Zide Chen <zide.chen@intel.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20251028064214.1451968-1-dapeng1.mi@linux.intel.com
Leyvi Rose reported that his X86_NATIVE_CPU=y build is failing because our
instruction decoder doesn't support SSE4a and the AMDGPU code seems to be
generating those with his compiler of choice (CLANG+LTO).
Now, our normal build flags disable SSE MMX SSE2 3DNOW AVX, but then
CC_FLAGS_FPU re-enable SSE SSE2.
Since nothing mentions SSE3 or SSE4, I'm assuming that -msse (or its negative)
control all SSE variants -- but why then explicitly enumerate SSE2 ?
Anyway, until the instruction decoder gets fixed, explicitly disallow SSE4a
(an AMD specific SSE4 extension).
Fixes: ea1dcca1de ("x86/kbuild/64: Add the CONFIG_X86_NATIVE_CPU option to locally optimize the kernel with '-march=native'")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Arisu Tachibana <arisu.tachibana@miraclelinux.com>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Harry Wentland <harry.wentland@amd.com>
Cc: <stable@kernel.org>
Sean reported [1] the following splat when running KVM tests:
WARNING: CPU: 232 PID: 15391 at xfd_validate_state+0x65/0x70
Call Trace:
<TASK>
fpu__clear_user_states+0x9c/0x100
arch_do_signal_or_restart+0x142/0x210
exit_to_user_mode_loop+0x55/0x100
do_syscall_64+0x205/0x2c0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
Chao further identified [2] a reproducible scenario involving signal
delivery: a non-AMX task is preempted by an AMX-enabled task which
modifies the XFD MSR.
When the non-AMX task resumes and reloads XSTATE with init values,
a warning is triggered due to a mismatch between fpstate::xfd and the
CPU's current XFD state. fpu__clear_user_states() does not currently
re-synchronize the XFD state after such preemption.
Invoke xfd_update_state() which detects and corrects the mismatch if
there is a dynamic feature.
This also benefits the sigreturn path, as fpu__restore_sig() may call
fpu__clear_user_states() when the sigframe is inaccessible.
[ dhansen: minor changelog munging ]
Closes: https://lore.kernel.org/lkml/aDCo_SczQOUaB2rS@google.com [1]
Fixes: 672365477a ("x86/fpu: Update XFD state where required")
Reported-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/all/aDWbctO%2FRfTGiCg3@intel.com [2]
Cc:stable@vger.kernel.org
Link: https://patch.msgid.link/20250610001700.4097-1-chang.seok.bae%40intel.com
There's an issue with RDSEED's 16-bit and 32-bit register output
variants on Zen5 which return a random value of 0 "at a rate inconsistent
with randomness while incorrectly signaling success (CF=1)". Search the
web for AMD-SB-7055 for more detail.
Add a fix glue which checks microcode revisions.
[ bp: Add microcode revisions checking, rewrite. ]
Cc: stable@vger.kernel.org
Signed-off-by: Gregory Price <gourry@gourry.net>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20251018024010.4112396-1-gourry@gourry.net
Limit Entrysign sha256 signature checking to CPUs in the range Zen1-Zen5.
X86_BUG cannot be used here because the loading on the BSP happens way
too early, before the cpufeatures machinery has been set up.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/all/20251023124629.5385-1-bp@kernel.org
The failure of this check only results in a security mitigation being
applied, slightly affecting performance of the compiled BPF program. It
doesn't result in a failed syscall, an thus auditing a failed LSM
permission check for it is unwanted. For example with SELinux, it causes
a denial to be reported for confined processes running as root, which
tends to be flagged as a problem to be fixed in the policy. Yet
dontauditing or allowing CAP_SYS_ADMIN to the domain may not be
desirable, as it would allow/silence also other checks - either going
against the principle of least privilege or making debugging potentially
harder.
Fix it by changing it from capable() to ns_capable_noaudit(), which
instructs the LSMs to not audit the resulting denials.
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2369326
Fixes: d4e89d212d ("x86/bpf: Call branch history clearing sequence on exit")
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Reviewed-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/r/20251021122758.2659513-1-omosnace@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When retbleed mitigation is disabled, the kernel already prints an info
message that the system is vulnerable. Recent code restructuring also
inadvertently led to RETBLEED_INTEL_MSG being printed as an error, which is
unnecessary as retbleed mitigation was already explicitly disabled (by config
option, cmdline, etc.).
Qualify this print statement so the warning is not printed unless an actual
retbleed mitigation was selected and is being disabled due to incompatibility
with spectre_v2.
Fixes: e3b78a7ad5 ("x86/bugs: Restructure retbleed mitigation")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220624
Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20251003171936.155391-1-david.kaplan@amd.com
Add VMX exit handlers for SEAMCALL and TDCALL to inject a #UD if a non-TD
guest attempts to execute SEAMCALL or TDCALL. Neither SEAMCALL nor TDCALL
is gated by any software enablement other than VMXON, and so will generate
a VM-Exit instead of e.g. a native #UD when executed from the guest kernel.
Note! No unprivileged DoS of the L1 kernel is possible as TDCALL and
SEAMCALL #GP at CPL > 0, and the CPL check is performed prior to the VMX
non-root (VM-Exit) check, i.e. userspace can't crash the VM. And for a
nested guest, KVM forwards unknown exits to L1, i.e. an L2 kernel can
crash itself, but not L1.
Note #2! The Intel® Trust Domain CPU Architectural Extensions spec's
pseudocode shows the CPL > 0 check for SEAMCALL coming _after_ the VM-Exit,
but that appears to be a documentation bug (likely because the CPL > 0
check was incorrectly bundled with other lower-priority #GP checks).
Testing on SPR and EMR shows that the CPL > 0 check is performed before
the VMX non-root check, i.e. SEAMCALL #GPs when executed in usermode.
Note #3! The aforementioned Trust Domain spec uses confusing pseudocode
that says that SEAMCALL will #UD if executed "inSEAM", but "inSEAM"
specifically means in SEAM Root Mode, i.e. in the TDX-Module. The long-
form description explicitly states that SEAMCALL generates an exit when
executed in "SEAM VMX non-root operation". But that's a moot point as the
TDX-Module injects #UD if the guest attempts to execute SEAMCALL, as
documented in the "Unconditionally Blocked Instructions" section of the
TDX-Module base specification.
Cc: stable@vger.kernel.org
Cc: Kai Huang <kai.huang@intel.com>
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20251016182148.69085-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
The following NULL pointer dereference is encountered on mount of resctrl fs
after booting a system that supports assignable counters with the
"rdt=!mbmtotal,!mbmlocal" kernel parameters:
BUG: kernel NULL pointer dereference, address: 0000000000000008
RIP: 0010:mbm_cntr_get
Call Trace:
rdtgroup_assign_cntr_event
rdtgroup_assign_cntrs
rdt_get_tree
Specifying the kernel parameter "rdt=!mbmtotal,!mbmlocal" effectively disables
the legacy X86_FEATURE_CQM_MBM_TOTAL and X86_FEATURE_CQM_MBM_LOCAL features
and the MBM events they represent. This results in the per-domain MBM event
related data structures to not be allocated during early initialization.
resctrl fs initialization follows by implicitly enabling both MBM total and
local events on a system that supports assignable counters (mbm_event mode),
but this enabling occurs after the per-domain data structures have been
created.
After booting, resctrl fs assumes that an enabled event can access all its
state. This results in NULL pointer dereference when resctrl attempts to
access the un-allocated structures of an enabled event.
Remove the late MBM event enabling from resctrl fs.
This leaves a problem where the X86_FEATURE_CQM_MBM_TOTAL and
X86_FEATURE_CQM_MBM_LOCAL features may be disabled while assignable counter
(mbm_event) mode is enabled without any events to support. Switching between
the "default" and "mbm_event" mode without any events is not practical.
Create a dependency between the X86_FEATURE_{CQM_MBM_TOTAL,CQM_MBM_LOCAL} and
X86_FEATURE_ABMC (assignable counter) hardware features. An x86 system that
supports assignable counters now requires support of X86_FEATURE_CQM_MBM_TOTAL
or X86_FEATURE_CQM_MBM_LOCAL.
This ensures all needed MBM related data structures are created before use and
that it is only possible to switch between "default" and "mbm_event" mode when
the same events are available in both modes. This dependency does not exist in
the hardware but this usage of these feature settings work for known systems.
[ bp: Massage commit message. ]
Fixes: 13390861b4 ("x86,fs/resctrl: Detect Assignable Bandwidth Monitoring feature details")
Co-developed-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://patch.msgid.link/a62e6ac063d0693475615edd213d5be5e55443e6.1760560934.git.babu.moger@amd.com
Pull x86 fixes from Borislav Petkov:
- Reset the why-the-system-rebooted register on AMD to avoid stale bits
remaining from previous boots
- Add a missing barrier in the TLB flushing code to prevent erroneously
not flushing a TLB generation
- Make sure cpa_flush() does not overshoot when computing the end range
of a flush region
- Fix resctrl bandwidth counting on AMD systems when the amount of
monitoring groups created exceeds the number the hardware can track
* tag 'x86_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/CPU/AMD: Prevent reset reasons from being retained across reboot
x86/mm: Fix SMP ordering in switch_mm_irqs_off()
x86/mm: Fix overflow in __cpa_addr()
x86/resctrl: Fix miscount of bandwidth event when reactivating previously unavailable RMID
KVM x86 fixes for 6.18:
- Expand the KVM_PRE_FAULT_MEMORY selftest to add a regression test for the
bug fixed by commit 3ccbf6f470 ("KVM: x86/mmu: Return -EAGAIN if userspace
deletes/moves memslot during prefault")
- Don't try to get PMU capabbilities from perf when running a CPU with hybrid
CPUs/PMUs, as perf will rightly WARN.
- Rework KVM_CAP_GUEST_MEMFD_MMAP (newly introduced in 6.18) into a more
generic KVM_CAP_GUEST_MEMFD_FLAGS
- Add a guest_memfd INIT_SHARED flag and require userspace to explicitly set
said flag to initialize memory as SHARED, irrespective of MMAP. The
behavior merged in 6.18 is that enabling mmap() implicitly initializes
memory as SHARED, which would result in an ABI collision for x86 CoCo VMs
as their memory is currently always initialized PRIVATE.
- Allow mmap() on guest_memfd for x86 CoCo VMs, i.e. on VMs with private
memory, to enable testing such setups, i.e. to hopefully flush out any
other lurking ABI issues before 6.18 is officially released.
- Add testcases to the guest_memfd selftest to cover guest_memfd without MMAP,
and host userspace accesses to mmap()'d private memory.