Pull RISC-V fixes from Paul Walmsley:
- Fix a CONFIG_SPARSEMEM crash on RV32 by avoiding early phys_to_page()
- Prevent runtime const infrastructure from being used by modules,
similar to what was done for x86
- Avoid problems when shutting down ACPI systems with IOMMUs by adding
a device dependency between IOMMU and devices that use it
- Fix a bug where the CPU pointer masking state isn't properly reset
when tagged addresses aren't enabled for a task
- Fix some incorrect register assignments, and add some missing ones,
in kgdb support code
- Fix compilation of non-kernel code that uses the ptrace uapi header
by replacing BIT() with _BITUL()
- Fix compilation of the validate_v_ptrace kselftest by working around
kselftest macro expansion issues
* tag 'riscv-for-linus-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
ACPI: RIMT: Add dependency between iommu and devices
selftests: riscv: Add braces around EXPECT_EQ()
riscv: use _BITUL macro rather than BIT() in ptrace uapi and kselftests
riscv: Reset pmm when PR_TAGGED_ADDR_ENABLE is not set
riscv: make runtime const not usable by modules
riscv: patch: Avoid early phys_to_page()
riscv: kgdb: fix several debug register assignment bugs
Pull perf fix from Ingo Molnar:
- Fix potential bad container_of() in intel_pmu_hw_config() (Ian
Rogers)
* tag 'perf-urgent-2026-04-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Fix potential bad container_of in intel_pmu_hw_config
Pull MIPS fixes from Thomas Bogendoerfer:
- Fix TLB uniquification for systems with TLB not initialised by
firmware
- Fix allocation in TLB uniquification
- Fix SiByte cache initialisation
- Check uart parameters from firmware on Loongson64 systems
- Fix clock id mismatch for Ralink SoCs
- Fix GCC version check for __mutli3 workaround
* tag 'mips-fixes_7.0_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
mips: mm: Allocate tlb_vpn array atomically
MIPS: mm: Rewrite TLB uniquification for the hidden bit feature
MIPS: mm: Suppress TLB uniquification on EHINV hardware
MIPS: Always record SEGBITS in cpu_data.vmbits
MIPS: Fix the GCC version check for `__multi3' workaround
MIPS: SiByte: Bring back cache initialisation
mips: ralink: update CPU clock index
MIPS: Loongson64: env: Check UARTs passed by LEFI cautiously
In set_tagged_addr_ctrl(), when PR_TAGGED_ADDR_ENABLE is not set, pmlen
is correctly set to 0, but it forgets to reset pmm. This results in the
CPU pmm state not corresponding to the software pmlen state.
Fix this by resetting pmm along with pmlen.
Fixes: 2e17430858 ("riscv: Add support for the tagged address ABI")
Signed-off-by: Zishun Yi <vulab@iscas.ac.cn>
Reviewed-by: Samuel Holland <samuel.holland@sifive.com>
Link: https://patch.msgid.link/20260322160022.21908-1-vulab@iscas.ac.cn
Signed-off-by: Paul Walmsley <pjw@kernel.org>
Similar as commit 284922f4c5 ("x86: uaccess: don't use runtime-const
rewriting in modules") does, make riscv's runtime const not usable by
modules too, to "make sure this doesn't get forgotten the next time
somebody wants to do runtime constant optimizations". The reason is
well explained in the above commit: "The runtime-const infrastructure
was never designed to handle the modular case, because the constant
fixup is only done at boot time for core kernel code."
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Link: https://patch.msgid.link/20260221023731.3476-1-jszhang@kernel.org
Signed-off-by: Paul Walmsley <pjw@kernel.org>
Fix several bugs in the RISC-V kgdb implementation:
- The element of dbg_reg_def[] that is supposed to pertain to the S1
register embeds instead the struct pt_regs offset of the A1
register. Fix this to use the S1 register offset in struct pt_regs.
- The sleeping_thread_to_gdb_regs() function copies the value of the
S10 register into the gdb_regs[] array element meant for the S9
register, and copies the value of the S11 register into the array
element meant for the S10 register. It also neglects to copy the
value of the S11 register. Fix all of these issues.
Fixes: fe89bd2be8 ("riscv: Add KGDB support")
Cc: Vincent Chen <vincent.chen@sifive.com>
Link: https://patch.msgid.link/fde376f8-bcfd-bfe4-e467-07d8f7608d05@kernel.org
Signed-off-by: Paul Walmsley <pjw@kernel.org>
Pull s390 fixes from Vasily Gorbik:
- Fix a memory leak in the zcrypt driver where the AP message buffer
for clear key RSA requests was allocated twice, once by the caller
and again locally, causing the first allocation to never be freed
- Fix the cpum_sf perf sampling rate overflow adjustment to clamp the
recalculated rate to the hardware maximum, preventing exceptions on
heavily loaded systems running with HZ=1000
* tag 's390-7.0-7' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/zcrypt: Fix memory leak with CCA cards used as accelerator
s390/cpum_sf: Cap sampling rate to prevent lsctl exception
Pull arm64 fix from Will Deacon:
- Implement a basic static call trampoline to fix CFI failures with the
generic implementation
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: Use static call trampolines when kCFI is enabled
Auto counter reload may have a group of events with software events
present within it. The software event PMU isn't the x86_hybrid_pmu and
a container_of operation in intel_pmu_set_acr_caused_constr (via the
hybrid helper) could cause out of bound memory reads. Avoid this by
guarding the call to intel_pmu_set_acr_caused_constr with an
is_x86_event check.
Fixes: ec980e4fac ("perf/x86/intel: Support auto counter reload")
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Falcon <thomas.falcon@intel.com>
Link: https://patch.msgid.link/20260312194305.1834035-1-irogers@google.com
Before the introduction of the EHINV feature, which lets software mark
TLB entries invalid, certain older implementations of the MIPS ISA were
equipped with an analogous bit, as a vendor extension, which however is
hidden from software and only ever set at reset, and then any software
write clears it, making the intended TLB entry valid.
This feature makes it unsafe to read a TLB entry with TLBR, modify the
page mask, and write the entry back with TLBWI, because this operation
will implicitly clear the hidden bit and this may create a duplicate
entry, as with the presence of the hidden bit there is no guarantee all
the entries across the TLB are unique each.
Usually the firmware has already uniquified TLB entries before handing
control over, in which case we only need to guarantee at bootstrap no
clash will happen with the VPN2 values chosen in local_flush_tlb_all().
However with systems such as Mikrotik RB532 we get handed the TLB as at
reset, with the hidden bit set across the entries and possibly duplicate
entries present. This then causes a machine check exception when page
sizes are reset in r4k_tlb_uniquify() and prevents the system from
booting.
Rewrite the algorithm used in r4k_tlb_uniquify() then such as to avoid
the reuse of ASID/VPN values across the TLB. Get rid of global entries
first as they may be blocking the entire address space, e.g. 16 256MiB
pages will exhaust the whole address space of a 32-bit CPU and a single
big page can exhaust the 32-bit compatibility space on a 64-bit CPU.
Details of the algorithm chosen are given across the code itself.
Fixes: 9f048fa487 ("MIPS: mm: Prevent a TLB shutdown on initial uniquification")
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Cc: stable@vger.kernel.org # v6.18+
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Hardware that supports the EHINV feature, mandatory for R6 ISA and FTLB
implementation, lets software mark TLB entries invalid, which eliminates
the need to ensure no duplicate matching entries are ever created. This
feature is already used by local_flush_tlb_all(), via the UNIQUE_ENTRYHI
macro, making the preceding call to r4k_tlb_uniquify() superfluous.
The next change will also modify uniquification code such that it'll
become incompatible with the FTLB and MMID features, as well as MIPSr6
CPUs that do not implement 4KiB pages.
Therefore prevent r4k_tlb_uniquify() from being used on EHINV hardware,
as denoted by `cpu_has_tlbinv'.
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
With a 32-bit kernel running on 64-bit MIPS hardware the hardcoded value
of `cpu_vmbits' only records the size of compatibility useg and does not
reflect the size of native xuseg or the complete range of values allowed
in the VPN2 field of TLB entries.
An upcoming change will need the actual VPN2 value range permitted even
in 32-bit kernel configurations, so always include the `vmbits' member
in `struct cpuinfo_mips' and probe for SEGBITS when running on 64-bit
hardware and resorting to the currently hardcoded value of 31 on 32-bit
processors. No functional change for users of `cpu_vmbits'.
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
It was only GCC 10 that fixed a MIPS64r6 code generation issue with a
`__multi3' libcall inefficiently produced to perform 64-bit widening
multiplication while suitable machine instructions exist to do such a
calculation. The fix went in with GCC commit 48b2123f6336 ("re PR
target/82981 (unnecessary __multi3 call for mips64r6 linux kernel)").
Adjust our code accordingly, removing build failures such as:
mips64-linux-ld: lib/math/div64.o: in function `mul_u64_add_u64_div_u64':
div64.c:(.text+0x84): undefined reference to `__multi3'
with the GCC versions affected.
Fixes: ebabcf17bc ("MIPS: Implement __multi3 for GCC7 MIPS64r6 builds")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202601140146.hMLODc6v-lkp@intel.com/
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Cc: stable@vger.kernel.org # v4.15+
Reviewed-by: David Laight <david.laight.linux@gmail.com.
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Bring back cache initialisation for Broadcom SiByte SB1 cores, which has
been removed causing the kernel to hang at bootstrap right after:
Dentry cache hash table entries: 524288 (order: 8, 4194304 bytes, linear)
Inode-cache hash table entries: 262144 (order: 7, 2097152 bytes, linear)
The cause of the problem is R4k cache handlers are also used by Broadcom
SiByte SB1 cores, however with a different cache error exception handler
and therefore not using CPU_R4K_CACHE_TLB:
obj-$(CONFIG_CPU_R4K_CACHE_TLB) += c-r4k.o cex-gen.o tlb-r4k.o
obj-$(CONFIG_CPU_SB1) += c-r4k.o cerr-sb1.o cex-sb1.o tlb-r4k.o
(from arch/mips/mm/Makefile).
Fixes: bbe4f634f4 ("mips: fix r3k_cache_init build regression")
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Cc: stable@vger.kernel.org # v6.8+
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Some firmware does not set nr_uarts properly and passes empty items.
Iterate at most min(system->nr_uarts, MAX_UARTS) items to prevent
out-of-bounds access, and ignore UARTs with addr 0 silently.
Meanwhile, our DT only works with UPIO_MEM but theoretically firmware
may pass other IO types, so explicitly check against that.
Tested on Loongson-LS3A4000-7A1000-NUC-SE.
Fixes: 3989ed4184 ("MIPS: Loongson64: env: Fixup serial clock-frequency when using LEFI")
Cc: stable@vger.kernel.org
Reviewed-by: Yao Zi <me@ziyao.cc>
Signed-off-by: Rong Zhang <rongrong@oss.cipunited.com>
Reviewed-by: Jiaxun Yang <jiaxun.yang@flygoat.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Implement arm64 support for the 'unoptimized' static call variety, which
routes all calls through a trampoline that performs a tail call to the
chosen function, and wire it up for use when kCFI is enabled. This works
around an issue with kCFI and generic static calls, where the prototypes
of default handlers such as __static_call_nop() and __static_call_ret0()
don't match the expected prototype of the call site, resulting in kCFI
false positives [0].
Since static call targets may be located in modules loaded out of direct
branching range, this needs an ADRP/LDR pair to load the branch target
into R16 and a branch-to-register (BR) instruction to perform an
indirect call.
Unlike on x86, there is no pressing need on arm64 to avoid indirect
calls at all cost, but hiding it from the compiler as is done here does
have some benefits:
- the literal is located in .rodata, which gives us the same robustness
advantage that code patching does;
- no D-cache pollution from fetching hash values from .text sections.
From an execution speed PoV, this is unlikely to make any difference at
all.
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will McVicker <willmcvicker@google.com>
Reported-by: Carlos Llamas <cmllamas@google.com>
Closes: https://lore.kernel.org/all/20260311225822.1565895-1-cmllamas@google.com/ [0]
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
The load_segments() function changes segment registers, invalidating GS base
(which KCOV relies on for per-cpu data). When CONFIG_KCOV is enabled, any
subsequent instrumented C code call (e.g. native_gdt_invalidate()) begins
crashing the kernel in an endless loop.
To reproduce the problem, it's sufficient to do kexec on a KCOV-instrumented
kernel:
$ kexec -l /boot/otherKernel
$ kexec -e
The real-world context for this problem is enabling crash dump collection in
syzkaller. For this, the tool loads a panic kernel before fuzzing and then
calls makedumpfile after the panic. This workflow requires both CONFIG_KEXEC
and CONFIG_KCOV to be enabled simultaneously.
Adding safeguards directly to the KCOV fast-path (__sanitizer_cov_trace_pc())
is also undesirable as it would introduce an extra performance overhead.
Disabling instrumentation for the individual functions would be too fragile,
so disable KCOV instrumentation for the entire machine_kexec_64.c and
physaddr.c. If coverage-guided fuzzing ever needs these components in the
future, other approaches should be considered.
The problem is not relevant for 32 bit kernels as CONFIG_KCOV is not supported
there.
[ bp: Space out comment for better readability. ]
Fixes: 0d345996e4 ("x86/kernel: increase kcov coverage under arch/x86/kernel folder")
Signed-off-by: Aleksandr Nogikh <nogikh@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260325154825.551191-1-nogikh@google.com
Pull kvm fixes from Paolo Bonzini:
"s390:
- Lots of small and not-so-small fixes for the newly rewritten gmap,
mostly affecting the handling of nested guests.
x86:
- Fix an issue with shadow paging, which causes KVM to install an
MMIO PTE in the shadow page tables without first zapping a non-MMIO
SPTE if KVM didn't see the write that modified the shadowed guest
PTE.
While commit a54aa15c6b ("KVM: x86/mmu: Handle MMIO SPTEs
directly in mmu_set_spte()") was right about it being impossible to
miss such a write if it was coming from the guest, it failed to
account for writes to guest memory that are outside the scope of
KVM: if userspace modifies the guest PTE, and then the guest hits a
relevant page fault, KVM will get confused"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86/mmu: Only WARN in direct MMUs when overwriting shadow-present SPTE
KVM: x86/mmu: Drop/zap existing present SPTE even when creating an MMIO SPTE
KVM: s390: Fix KVM_S390_VCPU_FAULT ioctl
KVM: s390: vsie: Fix guest page tables protection
KVM: s390: vsie: Fix unshadowing while shadowing
KVM: s390: vsie: Fix refcount overflow for shadow gmaps
KVM: s390: vsie: Fix nested guest memory shadowing
KVM: s390: Correctly handle guest mappings without struct page
KVM: s390: Fix gmap_link()
KVM: s390: vsie: Fix check for pre-existing shadow mapping
KVM: s390: Remove non-atomic dat_crstep_xchg()
KVM: s390: vsie: Fix dat_split_ste()
Pull x86 fixes from Ingo Molnar:
- Fix an early boot crash in AMD SEV-SNP guests, caused by incorrect
FSGSBASE init ordering (Nikunj A Dadhania)
- Remove X86_CR4_FRED from the CR4 pinned bits mask, to fix a race
window during the bootup of SEV-{ES,SNP} or TDX guests, which can
crash them if they trigger exceptions in that window (Borislav
Petkov)
- Fix early boot failures on SEV-ES/SNP guests, due to incorrect early
GHCB access (Nikunj A Dadhania)
- Add clarifying comment to the CRn pinning logic, to avoid future
confusion & bugs (Peter Zijlstra)
* tag 'x86-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Add comment clarifying CRn pinning
x86/fred: Fix early boot failures on SEV-ES/SNP guests
x86/cpu: Remove X86_CR4_FRED from the CR4 pinned bits mask
x86/cpu: Enable FSGSBASE early in cpu_init_exception_handling()
Pull s390 fixes from Vasily Gorbik:
- Add array_index_nospec() to syscall dispatch table lookup to prevent
limited speculative out-of-bounds access with user-controlled syscall
number
- Mark array_index_mask_nospec() __always_inline since GCC may emit an
out-of-line call instead of the inline data dependency sequence the
mitigation relies on
- Clear r12 on kernel entry to prevent potential speculative use of
user value in system_call, ext/io/mcck interrupt handlers
* tag 's390-7.0-6' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/entry: Scrub r12 register on kernel entry
s390/syscalls: Add spectre boundary for syscall dispatch table
s390/barrier: Make array_index_mask_nospec() __always_inline
Before commit f33f2d4c7c ("s390/bp: remove TIF_ISOLATE_BP"),
all entry handlers loaded r12 with the current task pointer
(lg %r12,__LC_CURRENT) for use by the BPENTER/BPEXIT macros. That
commit removed TIF_ISOLATE_BP, dropping both the branch prediction
macros and the r12 load, but did not add r12 to the register clearing
sequence.
Add the missing xgr %r12,%r12 to make the register scrub consistent
across all entry points.
Fixes: f33f2d4c7c ("s390/bp: remove TIF_ISOLATE_BP")
Cc: stable@kernel.org
Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Pull EFI fix from Ard Biesheuvel:
"Fix a potential buffer overrun issue introduced by the previous fix
for EFI boot services region reservations on x86"
* tag 'efi-fixes-for-v7.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
x86/efi: efi_unmap_boot_services: fix calculation of ranges_to_free size
Pull LoongArch fixes from Huacai Chen:
"Fix missing NULL checks for kstrdup(), workaround LS2K/LS7A GPU
DMA hang bug, emit GNU_EH_FRAME for vDSO correctly, and fix some
KVM-related bugs"
* tag 'loongarch-fixes-7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
LoongArch: KVM: Fix base address calculation in kvm_eiointc_regs_access()
LoongArch: KVM: Handle the case that EIOINTC's coremap is empty
LoongArch: KVM: Make kvm_get_vcpu_by_cpuid() more robust
LoongArch: vDSO: Emit GNU_EH_FRAME correctly
LoongArch: Workaround LS2K/LS7A GPU DMA hang bug
LoongArch: Fix missing NULL checks for kstrdup()
Adjust KVM's sanity check against overwriting a shadow-present SPTE with a
another SPTE with a different target PFN to only apply to direct MMUs,
i.e. only to MMUs without shadowed gPTEs. While it's impossible for KVM
to overwrite a shadow-present SPTE in response to a guest write, writes
from outside the scope of KVM, e.g. from host userspace, aren't detected
by KVM's write tracking and so can break KVM's shadow paging rules.
------------[ cut here ]------------
pfn != spte_to_pfn(*sptep)
WARNING: arch/x86/kvm/mmu/mmu.c:3069 at mmu_set_spte+0x1e4/0x440 [kvm], CPU#0: vmx_ept_stale_r/872
Modules linked in: kvm_intel kvm irqbypass
CPU: 0 UID: 1000 PID: 872 Comm: vmx_ept_stale_r Not tainted 7.0.0-rc2-eafebd2d2ab0-sink-vm #319 PREEMPT
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:mmu_set_spte+0x1e4/0x440 [kvm]
Call Trace:
<TASK>
ept_page_fault+0x535/0x7f0 [kvm]
kvm_mmu_do_page_fault+0xee/0x1f0 [kvm]
kvm_mmu_page_fault+0x8d/0x620 [kvm]
vmx_handle_exit+0x18c/0x5a0 [kvm_intel]
kvm_arch_vcpu_ioctl_run+0xc55/0x1c20 [kvm]
kvm_vcpu_ioctl+0x2d5/0x980 [kvm]
__x64_sys_ioctl+0x8a/0xd0
do_syscall_64+0xb5/0x730
entry_SYSCALL_64_after_hwframe+0x4b/0x53
</TASK>
---[ end trace 0000000000000000 ]---
Fixes: 11d4517511 ("KVM: x86/mmu: Warn if PFN changes on shadow-present SPTE in shadow MMU")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
When installing an emulated MMIO SPTE, do so *after* dropping/zapping the
existing SPTE (if it's shadow-present). While commit a54aa15c6b was
right about it being impossible to convert a shadow-present SPTE to an
MMIO SPTE due to a _guest_ write, it failed to account for writes to guest
memory that are outside the scope of KVM.
E.g. if host userspace modifies a shadowed gPTE to switch from a memslot
to emulted MMIO and then the guest hits a relevant page fault, KVM will
install the MMIO SPTE without first zapping the shadow-present SPTE.
------------[ cut here ]------------
is_shadow_present_pte(*sptep)
WARNING: arch/x86/kvm/mmu/mmu.c:484 at mark_mmio_spte+0xb2/0xc0 [kvm], CPU#0: vmx_ept_stale_r/4292
Modules linked in: kvm_intel kvm irqbypass
CPU: 0 UID: 1000 PID: 4292 Comm: vmx_ept_stale_r Not tainted 7.0.0-rc2-eafebd2d2ab0-sink-vm #319 PREEMPT
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:mark_mmio_spte+0xb2/0xc0 [kvm]
Call Trace:
<TASK>
mmu_set_spte+0x237/0x440 [kvm]
ept_page_fault+0x535/0x7f0 [kvm]
kvm_mmu_do_page_fault+0xee/0x1f0 [kvm]
kvm_mmu_page_fault+0x8d/0x620 [kvm]
vmx_handle_exit+0x18c/0x5a0 [kvm_intel]
kvm_arch_vcpu_ioctl_run+0xc55/0x1c20 [kvm]
kvm_vcpu_ioctl+0x2d5/0x980 [kvm]
__x64_sys_ioctl+0x8a/0xd0
do_syscall_64+0xb5/0x730
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x47fa3f
</TASK>
---[ end trace 0000000000000000 ]---
Reported-by: Alexander Bulekov <bkov@amazon.com>
Debugged-by: Alexander Bulekov <bkov@amazon.com>
Suggested-by: Fred Griffoul <fgriffo@amazon.co.uk>
Fixes: a54aa15c6b ("KVM: x86/mmu: Handle MMIO SPTEs directly in mmu_set_spte()")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
A previous commit changed the behaviour of the KVM_S390_VCPU_FAULT
ioctl. The current (wrong) implementation will trigger a guest
addressing exception if the requested address lies outside of a
memslot, unless the VM is UCONTROL.
Restore the previous behaviour by open coding the fault-in logic.
Fixes: 3762e905ec ("KVM: s390: use __kvm_faultin_pfn()")
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
When shadowing, the guest page tables are write-protected, in order to
trap changes and properly unshadow the shadow mapping for the nested
guest. Already shadowed levels are skipped, so that only the needed
levels are write protected.
Currently the levels that get write protected are exactly one level too
deep: the last level (nested guest memory) gets protected in the wrong
way, and will be protected again correctly a few lines afterwards; most
importantly, the highest non-shadowed level does *not* get write
protected.
Moreover, if the nested guest is running in a real address space, there
are no DAT tables to shadow.
Write protect the correct levels, so that all the levels that need to
be protected are protected, and avoid double protecting the last level;
skip attempting to shadow the DAT tables when the nested guest is
running in a real address space.
Fixes: e38c884df9 ("KVM: s390: Switch to new gmap")
Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
If shadowing causes the shadow gmap to get unshadowed, exit early to
prevent an attempt to dereference the parent pointer, which at this
point is NULL.
Opportunistically add some more checks to prevent NULL parents.
Fixes: a2c17f9270 ("KVM: s390: New gmap code")
Fixes: e5f98a6899 ("KVM: s390: Add some helper functions needed for vSIE")
Fixes: e38c884df9 ("KVM: s390: Switch to new gmap")
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
In most cases gmap_put() was not called when it should have.
Add the missing gmap_put() in vsie_run().
Fixes: e38c884df9 ("KVM: s390: Switch to new gmap")
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Fix _do_shadow_pte() to use the correct pointer (guest pte instead of
nested guest) to set up the new pte.
Add a check to return -EOPNOTSUPP if the mapping for the nested guest
is writeable but the same page in the guest is only read-only.
Fixes: e38c884df9 ("KVM: s390: Switch to new gmap")
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Introduce a new special softbit for large pages, like already presend
for normal pages, and use it to mark guest mappings that do not have
struct pages.
Whenever a leaf DAT entry becomes dirty, check the special softbit and
only call SetPageDirty() if there is an actual struct page.
Move the logic to mark pages dirty inside _gmap_ptep_xchg() and
_gmap_crstep_xchg_atomic(), to avoid needlessly duplicating the code.
Fixes: 5a74e3d934 ("KVM: s390: KVM-specific bitfields and helper functions")
Fixes: a2c17f9270 ("KVM: s390: New gmap code")
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
The slow path of the fault handler ultimately called gmap_link(), which
assumed the fault was a major fault, and blindly called dat_link().
In case of minor faults, things were not always handled properly; in
particular the prefix and vsie marker bits were ignored.
Move dat_link() into gmap.c, renaming it accordingly. Once moved, the
new _gmap_link() function will be able to correctly honour the prefix
and vsie markers.
This will cause spurious unshadows in some uncommon cases.
Fixes: 94fd9b16cc ("KVM: s390: KVM page table management functions: lifecycle management")
Fixes: a2c17f9270 ("KVM: s390: New gmap code")
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
When shadowing a nested guest, a check is performed and no shadowing is
attempted if the nested guest is already shadowed.
The existing check was incomplete; fix it by also checking whether the
leaf DAT table entry in the existing shadow gmap has the same protection
as the one specified in the guest DAT entry.
Fixes: e38c884df9 ("KVM: s390: Switch to new gmap")
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
In practice dat_crstep_xchg() is racy and hard to use correctly. Simply
remove it and replace its uses with dat_crstep_xchg_atomic().
This solves some actual races that lead to system hangs / crashes.
Opportunistically fix an alignment issue in _gmap_crstep_xchg_atomic().
Fixes: 589071eaaa ("KVM: s390: KVM page table management functions: clear and replace")
Fixes: 94fd9b16cc ("KVM: s390: KVM page table management functions: lifecycle management")
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
If the guest misbehaves and puts the page tables for its nested guest
inside the memory of the nested guest itself, and the guest and nested
guest are being mapped with large pages, the shadow mapping will
lose synchronization with the actual mapping, since this will cause the
large page with the vsie notification bit to be split, but the
vsie notification bit will not be propagated to the resulting small
pages.
Fix this by propagating the vsie_notif bit from large pages to normal
pages when splitting a large page.
Fixes: 2db149a0a6 ("KVM: s390: KVM page table management functions: walks")
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
In function kvm_eiointc_regs_access(), the register base address is
caculated from array base address plus offset, the offset is absolute
value from the base address. The data type of array base address is
u64, it should be converted into the "void *" type and then plus the
offset.
Cc: <stable@vger.kernel.org>
Fixes: d3e43a1f34 ("LoongArch: KVM: Use 64-bit register definition for EIOINTC").
Reported-by: Aurelien Jarno <aurel32@debian.org>
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1131431
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
EIOINTC's coremap in eiointc_update_sw_coremap() can be empty, currently
we get a cpuid with -1 in this case, but we actually need 0 because it's
similar as the case that cpuid >= 4.
This fix an out-of-bounds access to kvm_arch::phyid_map::phys_map[].
Cc: <stable@vger.kernel.org>
Fixes: 3956a52bc0 ("LoongArch: KVM: Add EIOINTC read and write functions")
Reported-by: Aurelien Jarno <aurel32@debian.org>
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1131431
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
With -fno-asynchronous-unwind-tables and --no-eh-frame-hdr (the default
of the linker), the GNU_EH_FRAME segment (specified by vdso.lds.S) is
empty. This is not valid, as the current DWARF specification mandates
the first byte of the EH frame to be the version number 1. It causes
some unwinders to complain, for example the ClickHouse query profiler
spams the log with messages:
clickhouse-server[365854]: libunwind: unsupported .eh_frame_hdr
version: 127 at 7ffffffb0000
Here "127" is just the byte located at the p_vaddr (0, i.e. the
beginning of the vDSO) of the empty GNU_EH_FRAME segment. Cross-
checking with /proc/365854/maps has also proven 7ffffffb0000 is the
start of vDSO in the process VM image.
In LoongArch the -fno-asynchronous-unwind-tables option seems just a
MIPS legacy, and MIPS only uses this option to satisfy the MIPS-specific
"genvdso" program, per the commit cfd75c2db1 ("MIPS: VDSO: Explicitly
use -fno-asynchronous-unwind-tables"). IIRC it indicates some inherent
limitation of the MIPS ELF ABI and has nothing to do with LoongArch. So
we can simply flip it over to -fasynchronous-unwind-tables and pass
--eh-frame-hdr for linking the vDSO, allowing the profilers to unwind the
stack for statistics even if the sample point is taken when the PC is in
the vDSO.
However simply adjusting the options above would exploit an issue: when
the libgcc unwinder saw the invalid GNU_EH_FRAME segment, it silently
falled back to a machine-specific routine to match the code pattern of
rt_sigreturn() and extract the registers saved in the sigframe if the
code pattern is matched. As unwinding from signal handlers is vital for
libgcc to support pthread cancellation etc., the fall-back routine had
been silently keeping the LoongArch Linux systems functioning since
Linux 5.19. But when we start to emit GNU_EH_FRAME with the correct
format, fall-back routine will no longer be used and libgcc will fail
to unwind the sigframe, and unwinding from signal handlers will no
longer work, causing dozens of glibc test failures. To make it possible
to unwind from signal handlers again, it's necessary to code the unwind
info in __vdso_rt_sigreturn via .cfi_* directives.
The offsets in the .cfi_* directives depend on the layout of struct
sigframe, notably the offset of sigcontext in the sigframe. To use the
offset in the assembly file, factor out struct sigframe into a header to
allow asm-offsets.c to output the offset for assembly.
To work around a long-term issue in the libgcc unwinder (the pc is
unconditionally substracted by 1: doing so is technically incorrect for
a signal frame), a nop instruction is included with the two real
instructions in __vdso_rt_sigreturn in the same FDE PC range. The same
hack has been used on x86 for a long time.
Cc: stable@vger.kernel.org
Fixes: c6b99bed6b ("LoongArch: Add VDSO and VSYSCALL support")
Signed-off-by: Xi Ruoyao <xry111@xry111.site>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
1. Hardware limitation: GPU, DC and VPU are typically PCI device 06.0,
06.1 and 06.2. They share some hardware resources, so when configure the
PCI 06.0 device BAR1, DMA memory access cannot be performed through this
BAR, otherwise it will cause hardware abnormalities.
2. In typical scenarios of reboot or S3/S4, DC access to memory through
BAR is not prohibited, resulting in GPU DMA hangs.
3. Workaround method: When configuring the 06.0 device BAR1, turn off
the memory access of DC, GPU and VPU (via DC's CRTC registers).
Cc: stable@vger.kernel.org
Signed-off-by: Qianhai Wu <wuqianhai@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>