Fix a build error in kvmppc_e500_tlb_init() that was introduced by the
conversion to use kzalloc_objs(), as KVM confusingly uses the size of the
structure that is one and only field in tlbe_priv:
arch/powerpc/kvm/e500_mmu.c:923:33: error: assignment to 'struct tlbe_priv *'
from incompatible pointer type 'struct tlbe_ref *' [-Wincompatible-pointer-types]
923 | vcpu_e500->gtlb_priv[0] = kzalloc_objs(struct tlbe_ref,
| ^
KVM has been flawed since commit 0164c0f0c4 ("KVM: PPC: e500: clear up
confusion between host and guest entries"), but the issue went unnoticed
until kmalloc_obj() came along and enforced types, as "struct tlbe_priv"
was just a wrapper of "struct tlbe_ref" (why on earth the two ever existed
separately...).
Fixes: 69050f8d6d ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types")
Cc: Kees Cook <kees@kernel.org>
Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Link: https://patch.msgid.link/20260303190339.974325-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Increase 'maxnode' when using 'get_mempolicy' syscall in guest_memfd
mmap and NUMA policy tests to fix a failure on one Intel GNR platform.
On a CXL-capable platform, the memory affinity of CXL memory regions may
not be covered by the SRAT. Since each CXL memory region is enumerated
via a CFMWS table, at early boot the kernel parses all CFMWS tables to
detect all CXL memory regions and assigns a 'faked' NUMA node for each
of them, starting from the highest NUMA node ID enumerated via the SRAT.
This increases the 'nr_node_ids'. E.g., on the aforementioned Intel GNR
platform which has 4 NUMA nodes and 18 CFMWS tables, it increases to 22.
This results in the 'get_mempolicy' syscall failure on that platform,
because currently 'maxnode' is hard-coded to 8 but the 'get_mempolicy'
syscall requires the 'maxnode' to be not smaller than the 'nr_node_ids'.
Increase the 'maxnode' to the number of bits of 'nodemask', which is
'unsigned long', to fix this.
This may not cover all systems. Perhaps a better way is to always set
the 'nodemask' and 'maxnode' based on the actual maximum NUMA node ID on
the system, but for now just do the simple way.
Reported-by: Yi Lai <yi1.lai@intel.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221014
Closes: https://lore.kernel.org/all/bug-221014-28872@https.bugzilla.kernel.org%2F
Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Yuan Yao <yaoyuan@linux.alibaba.com>
Link: https://patch.msgid.link/20260302205158.178058-1-kai.huang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM/arm64 fixes for 7.0, take #3
- Correctly handle deeactivation of out-of-LRs interrupts by
starting the EOIcount deactivation walk *after* the last irq
that made it into an LR. This avoids deactivating irqs that
are in the LRs and that the vcpu hasn't deactivated yet.
- Avoid calling into the stubs to probe for ICH_VTR_EL2.TDS when
pKVM is already enabled -- not only thhis isn't possible (pKVM
will reject the call), but it is also useless: this can only
happen for a CPU that has already booted once, and the capability
will not change.
KVM generic changes for 7.0
- Remove a subtle pseudo-overlay of kvm_stats_desc, which, aside from being
unnecessary and confusing, triggered compiler warnings due to
-Wflex-array-member-not-at-end.
- Document that vcpu->mutex is take outside of kvm->slots_lock and
kvm->slots_arch_lock, which is intentional and desirable despite being
rather unintuitive.
KVM/arm64 fixes for 7.0, take #2
- Fix a couple of low-severity bugs in our S2 fault handling path,
affecting the recently introduced LS64 handling and the even more
esoteric handling of hwpoison in a nested context
- Address yet another syzkaller finding in the vgic initialisation,
were we would end-up destroying an uninitialised vgic, with nasty
consequences
- Address an annoying case of pKVM failing to boot when some of the
memblock regions that the host is faulting in are not page-aligned
- Inject some sanity in the NV stage-2 walker by checking the limits
against the advertised PA size, and correctly report the resulting
faults
- Drop an unnecessary ISB when emulating an EL2 S1 address translation
Hotplugging a CPU off and back on fails with pKVM, as we try to
probe for ICH_VTR_EL2.TDS. In a non-VHE setup, this is achieved
by using an EL2 stub helper. However, the stubs are out of reach
once pKVM has deprivileged the kernel. The CPU never boots.
Since pKVM doesn't allow late onlining of CPUs, we can detect
that protected mode is enforced early on, and return the current
state of the capability.
Fixes: 2a28810cbb ("KVM: arm64: GICv3: Detect and work around the lack of ICV_DIR_EL1 trapping")
Reported-by: Vincent Donnefort <vdonnefort@google.com>
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://patch.msgid.link/20260310085433.3936742-1-maz@kernel.org
Cc: stable@vger.kernel.org
Valentine reports that their guests fail to boot correctly, losing
interrupts, and indicates that the wrong interrupt gets deactivated.
What happens here is that if the maintenance interrupt is slow enough
to kick us out of the guest, extra interrupts can be activated from
the LRs. We then exit and proceed to handle EOIcount deactivations,
picking active interrupts from the AP list. But we start from the
top of the list, potentially deactivating interrupts that were in
the LRs, while EOIcount only denotes deactivation of interrupts that
are not present in an LR.
Solve this by tracking the last interrupt that made it in the LRs,
and start the EOIcount deactivation walk *after* that interrupt.
Since this only makes sense while the vcpu is loaded, stash this
in the per-CPU host state.
Huge thanks to Valentine for doing all the detective work and
providing an initial patch.
Fixes: 3cfd59f81e ("KVM: arm64: GICv3: Handle LR overflow when EOImode==0")
Fixes: 281c6c06e2 ("KVM: arm64: GICv2: Handle LR overflow when EOImode==0")
Reported-by: Valentine Burley <valentine.burley@collabora.com>
Tested-by: Valentine Burley <valentine.burley@collabora.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20260307115955.369455-1-valentine.burley@collabora.com
Link: https://patch.msgid.link/20260307191151.3781182-1-maz@kernel.org
Cc: stable@vger.kernel.org
When user_mem_abort() handles a nested stage-2 fault, it truncates
vma_pagesize to respect the guest's mapping size. However, the local
variable vma_shift is never updated to match this new size.
If the underlying host page turns out to be hardware poisoned,
kvm_send_hwpoison_signal() is called with the original, larger
vma_shift instead of the actual mapping size. This signals incorrect
poison boundaries to userspace and breaks hugepage memory poison
containment for nested VMs.
Update vma_shift to match the truncated vma_pagesize when operating
on behalf of a nested hypervisor.
Fixes: fd276e71d1 ("KVM: arm64: nv: Handle shadow stage 2 page faults")
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260304162222.836152-3-tabba@google.com
[maz: simplified vma_shift assignment from the original patch]
Signed-off-by: Marc Zyngier <maz@kernel.org>
The KVM_DEV_RISCV_AIA_GRP_APLIC branch of aia_has_attr() was identified
to have a race condition with concurrent KVM_SET_DEVICE_ATTR ioctls,
leading to a use-after-free bug.
Upon analyzing the code, it was discovered that the
KVM_DEV_RISCV_AIA_GRP_IMSIC branch of aia_has_attr() suffers from the same
lack of synchronization. It invokes kvm_riscv_aia_imsic_has_attr() without
holding dev->kvm->lock.
While aia_has_attr() is running, a concurrent aia_set_attr() could call
aia_init() under the dev->kvm->lock. If aia_init() fails, it may trigger
kvm_riscv_vcpu_aia_imsic_cleanup(), which frees imsic_state. Without proper
locking, kvm_riscv_aia_imsic_has_attr() could attempt to access imsic_state
while it is being deallocated.
Although this specific path has not yet been reported by a fuzzer, it
is logically identical to the APLIC issue. Fix this by acquiring the
dev->kvm->lock before calling kvm_riscv_aia_imsic_has_attr(), ensuring
consistency with the locking pattern used for other AIA attribute groups.
Fixes: 5463091a51 ("RISC-V: KVM: Expose IMSIC registers as attributes of AIA irqchip")
Signed-off-by: Jiakai Xu <xujiakai2025@iscas.ac.cn>
Signed-off-by: Jiakai Xu <jiakaiPeanut@gmail.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20260304080804.2281721-1-xujiakai2025@iscas.ac.cn
Signed-off-by: Anup Patel <anup@brainfault.org>
Fuzzer reports a KASAN use-after-free bug triggered by a race
between KVM_HAS_DEVICE_ATTR and KVM_SET_DEVICE_ATTR ioctls on
the AIA device. The root cause is that aia_has_attr() invokes
kvm_riscv_aia_aplic_has_attr() without holding dev->kvm->lock, while
a concurrent aia_set_attr() may call aia_init() under that lock. When
aia_init() fails after kvm_riscv_aia_aplic_init() has succeeded, it
calls kvm_riscv_aia_aplic_cleanup() in its fail_cleanup_imsics path,
which frees both aplic_state and aplic_state->irqs. The concurrent
has_attr path can then dereference the freed aplic->irqs in
aplic_read_pending():
irqd = &aplic->irqs[irq]; /* UAF here */
KASAN report:
BUG: KASAN: slab-use-after-free in aplic_read_pending
arch/riscv/kvm/aia_aplic.c:119 [inline]
BUG: KASAN: slab-use-after-free in aplic_read_pending_word
arch/riscv/kvm/aia_aplic.c:351 [inline]
BUG: KASAN: slab-use-after-free in aplic_mmio_read_offset
arch/riscv/kvm/aia_aplic.c:406
Read of size 8 at addr ff600000ba965d58 by task 9498
Call Trace:
aplic_read_pending arch/riscv/kvm/aia_aplic.c:119 [inline]
aplic_read_pending_word arch/riscv/kvm/aia_aplic.c:351 [inline]
aplic_mmio_read_offset arch/riscv/kvm/aia_aplic.c:406
kvm_riscv_aia_aplic_has_attr arch/riscv/kvm/aia_aplic.c:566
aia_has_attr arch/riscv/kvm/aia_device.c:469
allocated by task 9473:
kvm_riscv_aia_aplic_init arch/riscv/kvm/aia_aplic.c:583
aia_init arch/riscv/kvm/aia_device.c:248 [inline]
aia_set_attr arch/riscv/kvm/aia_device.c:334
freed by task 9473:
kvm_riscv_aia_aplic_cleanup arch/riscv/kvm/aia_aplic.c:644
aia_init arch/riscv/kvm/aia_device.c:292 [inline]
aia_set_attr arch/riscv/kvm/aia_device.c:334
Fix this race by acquiring dev->kvm->lock in aia_has_attr() before
calling kvm_riscv_aia_aplic_has_attr(), consistent with the locking
pattern used in aia_get_attr() and aia_set_attr().
Fixes: 289a007b98 ("RISC-V: KVM: Expose APLIC registers as attributes of AIA irqchip")
Signed-off-by: Jiakai Xu <jiakaiPeanut@gmail.com>
Signed-off-by: Jiakai Xu <xujiakai2025@iscas.ac.cn>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20260302132703.1721415-1-xujiakai2025@iscas.ac.cn
Signed-off-by: Anup Patel <anup@brainfault.org>
kvm_riscv_vcpu_aia_rmw_topei() assumes that the per-vCPU IMSIC state has
been initialized once AIA is reported as available and initialized at
the VM level. This assumption does not always hold.
Under fuzzed ioctl sequences, a guest may access the IMSIC TOPEI CSR
before the vCPU IMSIC state is set up. In this case,
vcpu->arch.aia_context.imsic_state is still NULL, and the TOPEI RMW path
dereferences it unconditionally, leading to a host kernel crash.
The crash manifests as:
Unable to handle kernel paging request at virtual address
dfffffff0000000e
...
kvm_riscv_vcpu_aia_imsic_rmw arch/riscv/kvm/aia_imsic.c:909
kvm_riscv_vcpu_aia_rmw_topei arch/riscv/kvm/aia.c:231
csr_insn arch/riscv/kvm/vcpu_insn.c:208
system_opcode_insn arch/riscv/kvm/vcpu_insn.c:281
kvm_riscv_vcpu_virtual_insn arch/riscv/kvm/vcpu_insn.c:355
kvm_riscv_vcpu_exit arch/riscv/kvm/vcpu_exit.c:230
kvm_arch_vcpu_ioctl_run arch/riscv/kvm/vcpu.c:1008
...
Fix this by explicitly checking whether the vCPU IMSIC state has been
initialized before handling TOPEI CSR accesses. If not, forward the CSR
emulation to user space.
Fixes: db8b7e97d6 ("RISC-V: KVM: Add in-kernel virtualization of AIA IMSIC")
Signed-off-by: Jiakai Xu <xujiakai2025@iscas.ac.cn>
Signed-off-by: Jiakai Xu <jiakaiPeanut@gmail.com>
Reviewed-by: Nutty Liu <nutty.liu@hotmail.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20260226085119.643295-1-xujiakai2025@iscas.ac.cn
Signed-off-by: Anup Patel <anup@brainfault.org>
While fuzzing KVM on RISC-V, a use-after-free was observed in
kvm_riscv_gstage_get_leaf(), where ptep_get() dereferences a
freed gstage page table page during gfn unmap.
The crash manifests as:
use-after-free in ptep_get include/linux/pgtable.h:340 [inline]
use-after-free in kvm_riscv_gstage_get_leaf arch/riscv/kvm/gstage.c:89
Call Trace:
ptep_get include/linux/pgtable.h:340 [inline]
kvm_riscv_gstage_get_leaf+0x2ea/0x358 arch/riscv/kvm/gstage.c:89
kvm_riscv_gstage_unmap_range+0xf0/0x308 arch/riscv/kvm/gstage.c:265
kvm_unmap_gfn_range+0x168/0x1fc arch/riscv/kvm/mmu.c:256
kvm_mmu_unmap_gfn_range virt/kvm/kvm_main.c:724 [inline]
page last free pid 808 tgid 808 stack trace:
kvm_riscv_mmu_free_pgd+0x1b6/0x26a arch/riscv/kvm/mmu.c:457
kvm_arch_flush_shadow_all+0x1a/0x24 arch/riscv/kvm/mmu.c:134
kvm_flush_shadow_all virt/kvm/kvm_main.c:344 [inline]
The UAF is caused by gstage page table walks running concurrently with
gstage pgd teardown. In particular, kvm_unmap_gfn_range() can traverse
gstage page tables while kvm_arch_flush_shadow_all() frees the pgd,
leading to use-after-free of page table pages.
Fix the issue by serializing gstage unmap and pgd teardown with
kvm->mmu_lock. Holding mmu_lock ensures that gstage page tables
remain valid for the duration of unmap operations and prevents
concurrent frees.
This matches existing RISC-V KVM usage of mmu_lock to protect gstage
map/unmap operations, e.g. kvm_riscv_mmu_iounmap.
Fixes: dd82e35638 ("RISC-V: KVM: Factor-out g-stage page table management")
Signed-off-by: Jiakai Xu <xujiakai2025@iscas.ac.cn>
Signed-off-by: Jiakai Xu <jiakaiPeanut@gmail.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20260202040059.1801167-1-xujiakai2025@iscas.ac.cn
Signed-off-by: Anup Patel <anup@brainfault.org>
When a guest performs an atomic/exclusive operation on memory lacking
the required attributes, user_mem_abort() injects a data abort and
returns early. However, it fails to release the reference to the
host page acquired via __kvm_faultin_pfn().
A malicious guest could repeatedly trigger this fault, leaking host
page references and eventually causing host memory exhaustion (OOM).
Fix this by consolidating the early error returns to a new out_put_page
label that correctly calls kvm_release_page_unused().
Fixes: 2937aeec9d ("KVM: arm64: Handle DABT caused by LS64* instructions on unsupported memory")
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Yuan Yao <yaoyuan@linux.alibaba.com>
Link: https://patch.msgid.link/20260304162222.836152-2-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
As per R_BFHQH,
" When an Address size fault is generated, the reported fault code
indicates one of the following:
If the fault was generated due to the TTBR_ELx used in the translation
having nonzero address bits above the OA size, then a fault at level 0. "
Fix the reported Address size fault level as being 0 if the base address is
wrongly programmed by L1.
Fixes: 61e30b9eef ("KVM: arm64: nv: Implement nested Stage-2 page table walk logic")
Signed-off-by: Zenghui Yu (Huawei) <zenghui.yu@linux.dev>
Link: https://patch.msgid.link/20260225173515.20490-3-zenghui.yu@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
check_base_s2_limits() checks the validity of SL0 and inputsize against
ia_size (inputsize again!) but the pseudocode from DDI0487 G.a
AArch64.TranslationTableWalk() says that we should check against the
implemented PA size.
We would otherwise fail to walk S2 with a valid configuration. E.g.,
granule size = 4KB, inputsize = 40 bits, initial lookup level = 0 (no
concatenation) on a system with 48 bits PA range supported is allowed by
architecture.
Fix it by obtaining PA size by kvm_get_pa_bits(). Note that
kvm_get_pa_bits() returns the fixed limit now and should eventually reflect
the per VM PARange (one day!). Given that the configured PARange should not
be greater that kvm_ipa_limit, it at least fixes the problem described
above.
While at it, inject a level 0 translation fault to guest if
check_base_s2_limits() fails, as per the pseudocode.
Fixes: 61e30b9eef ("KVM: arm64: nv: Implement nested Stage-2 page table walk logic")
Signed-off-by: Zenghui Yu (Huawei) <zenghui.yu@linux.dev>
Link: https://patch.msgid.link/20260225173515.20490-2-zenghui.yu@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
If, for any odd reason, we cannot converge to mapping size that is
completely contained in a memblock region, we fail to install a S2
mapping and go back to the faulting instruction. Rince, repeat.
This happens when faulting in regions that are smaller than a page
or that do not have PAGE_SIZE-aligned boundaries (as witnessed on
an O6 board that refuses to boot in protected mode).
In this situation, fallback to using a PAGE_SIZE mapping anyway --
it isn't like we can go any lower.
Fixes: e728e70580 ("KVM: arm64: Adjust range correctly during host stage-2 faults")
Link: https://lore.kernel.org/r/86wlzr77cn.wl-maz@kernel.org
Cc: stable@vger.kernel.org
Cc: Quentin Perret <qperret@google.com>
Reviewed-by: Quentin Perret <qperret@google.com>
Link: https://patch.msgid.link/20260305132751.2928138-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Explicitly document the ordering of vcpu->mutex being taken *outside* of
kvm->slots_lock. While somewhat unintuitive since vCPUs conceptually have
narrower scope than VMs, the scope of the owning object (vCPU versus VM)
doesn't automatically carry over to the lock. In this case, vcpu->mutex
has far broader scope than kvm->slots_lock. As Paolo put it, it's a
"don't worry about multiple ioctls at the same time" mutex that's intended
to be taken at the outer edges of KVM.
More importantly, arm64 and x86 have gained flows that take kvm->slots_lock
inside of vcpu->mutex. x86's kvm_inhibit_apic_access_page() is particularly
nasty, as slots_lock is taken quite deep within KVM_RUN, i.e. simply
swapping the ordering isn't an option.
Commit to the vcpu->mutex => kvm->slots_lock ordering, as vcpu->mutex
really is intended to be a "top-level" lock, whereas kvm->slots_lock is
"just" a helper lock.
Opportunistically document that vcpu->mutex is also taken outside of
slots_arch_lock, e.g. when allocating shadow roots on x86 (which is the
entire reason slots_arch_lock exists, as shadow roots must be allocated
while holding kvm->srcu)
kvm_mmu_new_pgd()
|
-> kvm_mmu_reload()
|
-> kvm_mmu_load()
|
-> mmu_alloc_shadow_roots()
|
-> mmu_first_shadow_root_alloc()
but also when manipulating memslots in vCPU context, e.g. when inhibiting
the APIC-access page via the aforementioned kvm_inhibit_apic_access_page()
kvm_inhibit_apic_access_page()
|
-> __x86_set_memory_region()
|
-> kvm_set_internal_memslot()
|
-> kvm_set_memory_region()
|
-> kvm_set_memslot()
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Marc Zyngier <maz@kernel.org>
Link: https://patch.msgid.link/20260302170239.596810-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Pull kvm fixes from Paolo Bonzini:
"Arm:
- Make sure we don't leak any S1POE state from guest to guest when
the feature is supported on the HW, but not enabled on the host
- Propagate the ID registers from the host into non-protected VMs
managed by pKVM, ensuring that the guest sees the intended feature
set
- Drop double kern_hyp_va() from unpin_host_sve_state(), which could
bite us if we were to change kern_hyp_va() to not being idempotent
- Don't leak stage-2 mappings in protected mode
- Correctly align the faulting address when dealing with single page
stage-2 mappings for PAGE_SIZE > 4kB
- Fix detection of virtualisation-capable GICv5 IRS, due to the
maintainer being obviously fat fingered... [his words, not mine]
- Remove duplication of code retrieving the ASID for the purpose of
S1 PT handling
- Fix slightly abusive const-ification in vgic_set_kvm_info()
Generic:
- Remove internal Kconfigs that are now set on all architectures
- Remove per-architecture code to enable KVM_CAP_SYNC_MMU, all
architectures finally enable it in Linux 7.0"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: always define KVM_CAP_SYNC_MMU
KVM: remove CONFIG_KVM_GENERIC_MMU_NOTIFIER
KVM: arm64: Deduplicate ASID retrieval code
irqchip/gic-v5: Fix inversion of IRS_IDR0.virt flag
KVM: arm64: Revert accidental drop of kvm_uninit_stage2_mmu() for non-NV VMs
KVM: arm64: Fix protected mode handling of pages larger than 4kB
KVM: arm64: vgic: Handle const qualifier from gic_kvm_info allocation type
KVM: arm64: Remove redundant kern_hyp_va() in unpin_host_sve_state()
KVM: arm64: Fix ID register initialization for non-protected pKVM guests
KVM: arm64: Optimise away S1POE handling when not supported by host
KVM: arm64: Hide S1POE from guests when not supported by the host
Pull debugobjects fix from Thomas Gleixner:
"A single fix for debugobjects.
The deferred page initialization prevents debug objects from
allocating slab pages until the initialization is complete. That
causes depletion of the pool and disabling of debugobjects.
The reason is that debugobjects uses __GFP_HIGH for allocations as it
might be invoked from arbitrary contexts. When PREEMPT_COUNT is
disabled there is no way to know whether the context is safe to set
__GFP_KSWAPD_RECLAIM.
This worked until v6.18. Since then allocations w/o a reclaim flag
cause new_slab() to end up in alloc_frozen_pages_nolock_noprof(),
which returns early when deferred page initialization has not yet
completed.
Work around that when PREEMPT_COUNT is enabled as the preempt counter
allows debugobjects to add __GFP_KSWAPD_RECLAIM to the GFP flags when
the context is preemtible. When PREEMPT_COUNT is disabled the context
is unknown and the reclaim bit can't be set because the caller might
hold locks which might deadlock in the allocator.
That makes debugobjects depend on PREEMPT_COUNT ||
!DEFERRED_STRUCT_PAGE_INIT, which limits the coverage slightly, but
keeps it functional for most cases"
* tag 'core-debugobjects-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
debugobject: Make it work with deferred page initialization - again
Pull x86 fixes from Ingo Molnar:
- Fix speculative safety in fred_extint()
- Fix __WARN_printf() trap in early_fixup_exception()
- Fix clang-build boot bug for unusual alignments, triggered by
CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B=y
- Replace the final few __ASSEMBLY__ stragglers that snuck in lately
into non-UAPI x86 headers and use __ASSEMBLER__ consistently (again)
* tag 'x86-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/headers: Replace __ASSEMBLY__ stragglers with __ASSEMBLER__
x86/cfi: Fix CFI rewrite for odd alignments
x86/bug: Handle __WARN_printf() trap in early_fixup_exception()
x86/fred: Correct speculative safety in fred_extint()
Pull timer fix from Ingo Molnar:
"Improve the inlining of jiffies_to_msecs() and jiffies_to_usecs(), for
the common HZ=100, 250 or 1000 cases. Only use a function call for odd
HZ values like HZ=300 that generate more code.
The function call overhead showed up in performance tests of the TCP
code"
* tag 'timers-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time/jiffies: Inline jiffies_to_msecs() and jiffies_to_usecs()
Pull scheduler fixes from Ingo Molnar:
- Fix zero_vruntime tracking when there's a single task running
- Fix slice protection logic
- Fix the ->vprot logic for reniced tasks
- Fix lag clamping in mixed slice workloads
- Fix objtool uaccess warning (and bug) in the
!CONFIG_RSEQ_SLICE_EXTENSION case caused by unexpected un-inlining,
which triggers with older compilers
- Fix a comment in the rseq registration rseq_size bound check code
- Fix a legacy RSEQ ABI quirk that handled 32-byte area sizes
differently, which special size we now reached naturally and want to
avoid. The visible ugliness of the new reserved field will be avoided
the next time the RSEQ area is extended.
* tag 'sched-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rseq: slice ext: Ensure rseq feature size differs from original rseq size
rseq: Clarify rseq registration rseq_size bound check comment
sched/core: Fix wakeup_preempt's next_class tracking
rseq: Mark rseq_arm_slice_extension_timer() __always_inline
sched/fair: Fix lag clamp
sched/eevdf: Update se->vprot in reweight_entity()
sched/fair: Only set slice protection at pick time
sched/fair: Fix zero_vruntime tracking
Pull perf events fixes from Ingo Molnar:
- Fix lock ordering bug found by lockdep in perf_event_wakeup()
- Fix uncore counter enumeration on Granite Rapids and Sierra Forest
- Fix perf_mmap() refcount bug found by Syzkaller
- Fix __perf_event_overflow() vs perf_remove_from_context() race
* tag 'perf-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix __perf_event_overflow() vs perf_remove_from_context() race
perf/core: Fix refcount bug and potential UAF in perf_mmap
perf/x86/intel/uncore: Add per-scheduler IMC CAS count events
perf/core: Fix invalid wait context in ctx_sched_in()
Pull locking fix from Ingo Molnar:
"Now that LLVM 22 has been released officially, require a release
version to use the new CONFIG_WARN_CONTEXT_ANALYSIS feature.
In particular this avoids the widely used Android clang 22.0.1
pre-release build which is known to be broken for this usecase"
* tag 'locking-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
lib/Kconfig.debug: Require a release version of LLVM 22 for context analysis
Pull irqchip driver fixes from Ingo Molnar:
- Fix frozen interrupt bug in the sifive-plic driver
- Limit per-device MSI interrupts on uncommon gic-v3-its hardware
variants
- Address Sparse warning by constifying a variable in the MMP driver
- Revert broken commit and also fix an error check in the ls-extirq
driver
* tag 'irq-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/ls-extirq: Fix devm_of_iomap() error check
Revert "irqchip/ls-extirq: Use for_each_of_imap_item iterator"
irqchip/mmp: Make icu_irq_chip variable static const
irqchip/gic-v3-its: Limit number of per-device MSIs to the range the ITS supports
irqchip/sifive-plic: Fix frozen interrupt due to affinity setting
Pull SCSI fixes from James Bottomley:
"All changes in drivers (well technically SES is enclosure services,
but its change is minor). The biggest is the write combining change in
lpfc followed by the additional NULL checks in mpi3mr"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ufs: core: Fix shift out of bounds when MAXQ=32
scsi: ufs: core: Move link recovery for hibern8 exit failure to wl_resume
scsi: ufs: core: Fix possible NULL pointer dereference in ufshcd_add_command_trace()
scsi: snic: MAINTAINERS: Update snic maintainers
scsi: snic: Remove unused linkstatus
scsi: pm8001: Fix use-after-free in pm8001_queue_command()
scsi: mpi3mr: Add NULL checks when resetting request and reply queues
scsi: ufs: core: Reset urgent_bkops_lvl to allow runtime PM power mode
scsi: ses: Fix devices attaching to different hosts
scsi: ufs: core: Fix RPMB region size detection for UFS 2.2
scsi: storvsc: Fix scheduling while atomic on PREEMPT_RT
scsi: lpfc: Properly set WC for DPP mapping
Pull bpf fixes from Alexei Starovoitov:
- Fix alignment of arm64 JIT buffer to prevent atomic tearing (Fuad
Tabba)
- Fix invariant violation for single value tnums in the verifier
(Harishankar Vishwanathan, Paul Chaignon)
- Fix a bunch of issues found by ASAN in selftests/bpf (Ihor Solodrai)
- Fix race in devmpa and cpumap on PREEMPT_RT (Jiayuan Chen)
- Fix show_fdinfo of kprobe_multi when cookies are not present (Jiri
Olsa)
- Fix race in freeing special fields in BPF maps to prevent memory
leaks (Kumar Kartikeya Dwivedi)
- Fix OOB read in dmabuf_collector (T.J. Mercier)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (36 commits)
selftests/bpf: Avoid simplification of crafted bounds test
selftests/bpf: Test refinement of single-value tnum
bpf: Improve bounds when tnum has a single possible value
bpf: Introduce tnum_step to step through tnum's members
bpf: Fix race in devmap on PREEMPT_RT
bpf: Fix race in cpumap on PREEMPT_RT
selftests/bpf: Add tests for special fields races
bpf: Retire rcu_trace_implies_rcu_gp() from local storage
bpf: Delay freeing fields in local storage
bpf: Lose const-ness of map in map_check_btf()
bpf: Register dtor for freeing special fields
selftests/bpf: Fix OOB read in dmabuf_collector
selftests/bpf: Fix a memory leak in xdp_flowtable test
bpf: Fix stack-out-of-bounds write in devmap
bpf: Fix kprobe_multi cookies access in show_fdinfo callback
bpf, arm64: Force 8-byte alignment for JIT buffer to prevent atomic tearing
selftests/bpf: Don't override SIGSEGV handler with ASAN
selftests/bpf: Check BPFTOOL env var in detect_bpftool_path()
selftests/bpf: Fix out-of-bounds array access bugs reported by ASAN
selftests/bpf: Fix array bounds warning in jit_disasm_helpers
...
Pull driver core fixes from Danilo Krummrich:
- Do not register imx_clk_scu_driver in imx8qxp_clk_probe(); besides
fixing two other issues, this avoids a deadlock in combination with
commit dc23806a7c ("driver core: enforce device_lock for
driver_match_device()")
- Move secondary node lookup from device_get_next_child_node() to
fwnode_get_next_child_node(); this avoids issues when users switch
from the device API to the fwnode API
- Export io_define_{read,write}!() to avoid unused import warnings when
CONFIG_PCI=n
* tag 'driver-core-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core:
clk: scu/imx8qxp: do not register driver in probe()
rust: io: macro_export io_define_read!() and io_define_write!()
device property: Allow secondary lookup in fwnode_get_next_child_node()
Pull smb client fixes from Steve French:
- Two multichannel fixes
- Locking fix for superblock flags
- Fix to remove debug message that could log password
- Cleanup fix for setting credentials
* tag 'v7.0rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: Use snprintf in cifs_set_cifscreds
smb: client: Don't log plaintext credentials in cifs_set_cifscreds
smb: client: fix broken multichannel with krb5+signing
smb: client: use atomic_t for mnt_cifs_flags
smb: client: fix cifs_pick_channel when channels are equally loaded
Pull spi fixes from Mark Brown:
"One fix for the stm32 driver which got broken for DMA chaining cases,
plus a removal of some straggling bindings for the Bikal SoC which has
been pulled out of the kernel"
* tag 'spi-fix-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: stm32: fix missing pointer assignment in case of dma chaining
spi: dt-bindings: snps,dw-abp-ssi: Remove unused bindings
Pull regulator fixes from Mark Brown:
"A small pile of fixes, none of which are super major - the code fixes
are improved error handling and fixing a leak of a device node.
We also have a typo fix and an improvement to make the binding example
for mt6359 more directly usable"
* tag 'regulator-fix-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: Kconfig: fix a typo
regulator: bq257xx: Fix device node reference leak in bq257xx_reg_dt_parse_gpio()
regulator: fp9931: Fix PM runtime reference leak in fp9931_hwmon_read()
regulator: tps65185: check devm_kzalloc() result in probe
regulator: dt-bindings: mt6359: make regulator names unique
Pull s390 fixes from Vasily Gorbik:
- Fix guest pfault init to pass a physical address to DIAG 0x258,
restoring pfault interrupts and avoiding vCPU stalls during host
page-in
- Fix kexec/kdump hangs with stack protector by marking
s390_reset_system() __no_stack_protector; set_prefix(0) switches
lowcore and the canary no longer matches
- Fix idle/vtime cputime accounting (idle-exit ordering, vtimer
double-forwarding) and small cleanups
* tag 's390-7.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/pfault: Fix virtual vs physical address confusion
s390/kexec: Disable stack protector in s390_reset_system()
s390/idle: Remove psw_idle() prototype
s390/vtime: Use lockdep_assert_irqs_disabled() instead of BUG_ON()
s390/vtime: Use __this_cpu_read() / get rid of READ_ONCE()
s390/irq/idle: Remove psw bits early
s390/idle: Inline update_timer_idle()
s390/idle: Slightly optimize idle time accounting
s390/idle: Add comment for non obvious code
s390/vtime: Fix virtual timer forwarding
s390/idle: Fix cpu idle exit cpu time accounting
KVM/arm64 fixes for 7.0, take #1
- Make sure we don't leak any S1POE state from guest to guest when
the feature is supported on the HW, but not enabled on the host
- Propagate the ID registers from the host into non-protected VMs
managed by pKVM, ensuring that the guest sees the intended feature set
- Drop double kern_hyp_va() from unpin_host_sve_state(), which could
bite us if we were to change kern_hyp_va() to not being idempotent
- Don't leak stage-2 mappings in protected mode
- Correctly align the faulting address when dealing with single page
stage-2 mappings for PAGE_SIZE > 4kB
- Fix detection of virtualisation-capable GICv5 IRS, due to the
maintainer being obviously fat fingered...
- Remove duplication of code retrieving the ASID for the purpose of
S1 PT handling
- Fix slightly abusive const-ification in vgic_set_kvm_info()