Pull KVM updates from Paolo Bonzini:
"Loongarch:
- Add more CPUCFG mask bits
- Improve feature detection
- Add lazy load support for FPU and binary translation (LBT) register
state
- Fix return value for memory reads from and writes to in-kernel
devices
- Add support for detecting preemption from within a guest
- Add KVM steal time test case to tools/selftests
ARM:
- Add support for FEAT_IDST, allowing ID registers that are not
implemented to be reported as a normal trap rather than as an UNDEF
exception
- Add sanitisation of the VTCR_EL2 register, fixing a number of
UXN/PXN/XN bugs in the process
- Full handling of RESx bits, instead of only RES0, and resulting in
SCTLR_EL2 being added to the list of sanitised registers
- More pKVM fixes for features that are not supposed to be exposed to
guests
- Make sure that MTE being disabled on the pKVM host doesn't give it
the ability to attack the hypervisor
- Allow pKVM's host stage-2 mappings to use the Force Write Back
version of the memory attributes by using the "pass-through'
encoding
- Fix trapping of ICC_DIR_EL1 on GICv5 hosts emulating GICv3 for the
guest
- Preliminary work for guest GICv5 support
- A bunch of debugfs fixes, removing pointless custom iterators
stored in guest data structures
- A small set of FPSIMD cleanups
- Selftest fixes addressing the incorrect alignment of page
allocation
- Other assorted low-impact fixes and spelling fixes
RISC-V:
- Fixes for issues discoverd by KVM API fuzzing in
kvm_riscv_aia_imsic_has_attr(), kvm_riscv_aia_imsic_rw_attr(), and
kvm_riscv_vcpu_aia_imsic_update()
- Allow Zalasr, Zilsd and Zclsd extensions for Guest/VM
- Transparent huge page support for hypervisor page tables
- Adjust the number of available guest irq files based on MMIO
register sizes found in the device tree or the ACPI tables
- Add RISC-V specific paging modes to KVM selftests
- Detect paging mode at runtime for selftests
s390:
- Performance improvement for vSIE (aka nested virtualization)
- Completely new memory management. s390 was a special snowflake that
enlisted help from the architecture's page table management to
build hypervisor page tables, in particular enabling sharing the
last level of page tables. This however was a lot of code (~3K
lines) in order to support KVM, and also blocked several features.
The biggest advantages is that the page size of userspace is
completely independent of the page size used by the guest:
userspace can mix normal pages, THPs and hugetlbfs as it sees fit,
and in fact transparent hugepages were not possible before. It's
also now possible to have nested guests and guests with huge pages
running on the same host
- Maintainership change for s390 vfio-pci
- Small quality of life improvement for protected guests
x86:
- Add support for giving the guest full ownership of PMU hardware
(contexted switched around the fastpath run loop) and allowing
direct access to data MSRs and PMCs (restricted by the vPMU model).
KVM still intercepts access to control registers, e.g. to enforce
event filtering and to prevent the guest from profiling sensitive
host state. This is more accurate, since it has no risk of
contention and thus dropped events, and also has significantly less
overhead.
For more information, see the commit message for merge commit
bf2c3138ae ("Merge tag 'kvm-x86-pmu-6.20' ...")
- Disallow changing the virtual CPU model if L2 is active, for all
the same reasons KVM disallows change the model after the first
KVM_RUN
- Fix a bug where KVM would incorrectly reject host accesses to PV
MSRs when running with KVM_CAP_ENFORCE_PV_FEATURE_CPUID enabled,
even if those were advertised as supported to userspace,
- Fix a bug with protected guest state (SEV-ES/SNP and TDX) VMs,
where KVM would attempt to read CR3 configuring an async #PF entry
- Fail the build if EXPORT_SYMBOL_GPL or EXPORT_SYMBOL is used in KVM
(for x86 only) to enforce usage of EXPORT_SYMBOL_FOR_KVM_INTERNAL.
Only a few exports that are intended for external usage, and those
are allowed explicitly
- When checking nested events after a vCPU is unblocked, ignore
-EBUSY instead of WARNing. Userspace can sometimes put the vCPU
into what should be an impossible state, and spurious exit to
userspace on -EBUSY does not really do anything to solve the issue
- Also throw in the towel and drop the WARN on INIT/SIPI being
blocked when vCPU is in Wait-For-SIPI, which also resulted in
playing whack-a-mole with syzkaller stuffing architecturally
impossible states into KVM
- Add support for new Intel instructions that don't require anything
beyond enumerating feature flags to userspace
- Grab SRCU when reading PDPTRs in KVM_GET_SREGS2
- Add WARNs to guard against modifying KVM's CPU caps outside of the
intended setup flow, as nested VMX in particular is sensitive to
unexpected changes in KVM's golden configuration
- Add a quirk to allow userspace to opt-in to actually suppress EOI
broadcasts when the suppression feature is enabled by the guest
(currently limited to split IRQCHIP, i.e. userspace I/O APIC).
Sadly, simply fixing KVM to honor Suppress EOI Broadcasts isn't an
option as some userspaces have come to rely on KVM's buggy behavior
(KVM advertises Supress EOI Broadcast irrespective of whether or
not userspace I/O APIC supports Directed EOIs)
- Clean up KVM's handling of marking mapped vCPU pages dirty
- Drop a pile of *ancient* sanity checks hidden behind in KVM's
unused ASSERT() macro, most of which could be trivially triggered
by the guest and/or user, and all of which were useless
- Fold "struct dest_map" into its sole user, "struct rtc_status", to
make it more obvious what the weird parameter is used for, and to
allow fropping these RTC shenanigans if CONFIG_KVM_IOAPIC=n
- Bury all of ioapic.h, i8254.h and related ioctls (including
KVM_CREATE_IRQCHIP) behind CONFIG_KVM_IOAPIC=y
- Add a regression test for recent APICv update fixes
- Handle "hardware APIC ISR", a.k.a. SVI, updates in
kvm_apic_update_apicv() to consolidate the updates, and to
co-locate SVI updates with the updates for KVM's own cache of ISR
information
- Drop a dead function declaration
- Minor cleanups
x86 (Intel):
- Rework KVM's handling of VMCS updates while L2 is active to
temporarily switch to vmcs01 instead of deferring the update until
the next nested VM-Exit.
The deferred updates approach directly contributed to several bugs,
was proving to be a maintenance burden due to the difficulty in
auditing the correctness of deferred updates, and was polluting
"struct nested_vmx" with a growing pile of booleans
- Fix an SGX bug where KVM would incorrectly try to handle EPCM page
faults, and instead always reflect them into the guest. Since KVM
doesn't shadow EPCM entries, EPCM violations cannot be due to KVM
interference and can't be resolved by KVM
- Fix a bug where KVM would register its posted interrupt wakeup
handler even if loading kvm-intel.ko ultimately failed
- Disallow access to vmcb12 fields that aren't fully supported,
mostly to avoid weirdness and complexity for FRED and other
features, where KVM wants enable VMCS shadowing for fields that
conditionally exist
- Print out the "bad" offsets and values if kvm-intel.ko refuses to
load (or refuses to online a CPU) due to a VMCS config mismatch
x86 (AMD):
- Drop a user-triggerable WARN on nested_svm_load_cr3() failure
- Add support for virtualizing ERAPS. Note, correct virtualization of
ERAPS relies on an upcoming, publicly announced change in the APM
to reduce the set of conditions where hardware (i.e. KVM) *must*
flush the RAP
- Ignore nSVM intercepts for instructions that are not supported
according to L1's virtual CPU model
- Add support for expedited writes to the fast MMIO bus, a la VMX's
fastpath for EPT Misconfig
- Don't set GIF when clearing EFER.SVME, as GIF exists independently
of SVM, and allow userspace to restore nested state with GIF=0
- Treat exit_code as an unsigned 64-bit value through all of KVM
- Add support for fetching SNP certificates from userspace
- Fix a bug where KVM would use vmcb02 instead of vmcb01 when
emulating VMLOAD or VMSAVE on behalf of L2
- Misc fixes and cleanups
x86 selftests:
- Add a regression test for TPR<=>CR8 synchronization and IRQ masking
- Overhaul selftest's MMU infrastructure to genericize stage-2 MMU
support, and extend x86's infrastructure to support EPT and NPT
(for L2 guests)
- Extend several nested VMX tests to also cover nested SVM
- Add a selftest for nested VMLOAD/VMSAVE
- Rework the nested dirty log test, originally added as a regression
test for PML where KVM logged L2 GPAs instead of L1 GPAs, to
improve test coverage and to hopefully make the test easier to
understand and maintain
guest_memfd:
- Remove kvm_gmem_populate()'s preparation tracking and half-baked
hugepage handling. SEV/SNP was the only user of the tracking and it
can do it via the RMP
- Retroactively document and enforce (for SNP) that
KVM_SEV_SNP_LAUNCH_UPDATE and KVM_TDX_INIT_MEM_REGION require the
source page to be 4KiB aligned, to avoid non-trivial complexity for
something that no known VMM seems to be doing and to avoid an API
special case for in-place conversion, which simply can't support
unaligned sources
- When populating guest_memfd memory, GUP the source page in common
code and pass the refcounted page to the vendor callback, instead
of letting vendor code do the heavy lifting. Doing so avoids a
looming deadlock bug with in-place due an AB-BA conflict betwee
mmap_lock and guest_memfd's filemap invalidate lock
Generic:
- Fix a bug where KVM would ignore the vCPU's selected address space
when creating a vCPU-specific mapping of guest memory. Actually
this bug could not be hit even on x86, the only architecture with
multiple address spaces, but it's a bug nevertheless"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (267 commits)
KVM: s390: Increase permitted SE header size to 1 MiB
MAINTAINERS: Replace backup for s390 vfio-pci
KVM: s390: vsie: Fix race in acquire_gmap_shadow()
KVM: s390: vsie: Fix race in walk_guest_tables()
KVM: s390: Use guest address to mark guest page dirty
irqchip/riscv-imsic: Adjust the number of available guest irq files
RISC-V: KVM: Transparent huge page support
RISC-V: KVM: selftests: Add Zalasr extensions to get-reg-list test
RISC-V: KVM: Allow Zalasr extensions for Guest/VM
KVM: riscv: selftests: Add riscv vm satp modes
KVM: riscv: selftests: add Zilsd and Zclsd extension to get-reg-list test
riscv: KVM: allow Zilsd and Zclsd extensions for Guest/VM
RISC-V: KVM: Skip IMSIC update if vCPU IMSIC state is not initialized
RISC-V: KVM: Fix null pointer dereference in kvm_riscv_aia_imsic_rw_attr()
RISC-V: KVM: Fix null pointer dereference in kvm_riscv_aia_imsic_has_attr()
RISC-V: KVM: Remove unnecessary 'ret' assignment
KVM: s390: Add explicit padding to struct kvm_s390_keyop
KVM: LoongArch: selftests: Add steal time test case
LoongArch: KVM: Add paravirt vcpu_is_preempted() support in guest side
LoongArch: KVM: Add paravirt preempt feature in hypervisor side
...
Pull MM updates from Andrew Morton:
- "powerpc/64s: do not re-activate batched TLB flush" makes
arch_{enter|leave}_lazy_mmu_mode() nest properly (Alexander Gordeev)
It adds a generic enter/leave layer and switches architectures to use
it. Various hacks were removed in the process.
- "zram: introduce compressed data writeback" implements data
compression for zram writeback (Richard Chang and Sergey Senozhatsky)
- "mm: folio_zero_user: clear page ranges" adds clearing of contiguous
page ranges for hugepages. Large improvements during demand faulting
are demonstrated (David Hildenbrand)
- "memcg cleanups" tidies up some memcg code (Chen Ridong)
- "mm/damon: introduce {,max_}nr_snapshots and tracepoint for damos
stats" improves DAMOS stat's provided information, deterministic
control, and readability (SeongJae Park)
- "selftests/mm: hugetlb cgroup charging: robustness fixes" fixes a few
issues in the hugetlb cgroup charging selftests (Li Wang)
- "Fix va_high_addr_switch.sh test failure - again" addresses several
issues in the va_high_addr_switch test (Chunyu Hu)
- "mm/damon/tests/core-kunit: extend existing test scenarios" improves
the KUnit test coverage for DAMON (Shu Anzai)
- "mm/khugepaged: fix dirty page handling for MADV_COLLAPSE" fixes a
glitch in khugepaged which was causing madvise(MADV_COLLAPSE) to
transiently return -EAGAIN (Shivank Garg)
- "arch, mm: consolidate hugetlb early reservation" reworks and
consolidates a pile of straggly code related to reservation of
hugetlb memory from bootmem and creation of CMA areas for hugetlb
(Mike Rapoport)
- "mm: clean up anon_vma implementation" cleans up the anon_vma
implementation in various ways (Lorenzo Stoakes)
- "tweaks for __alloc_pages_slowpath()" does a little streamlining of
the page allocator's slowpath code (Vlastimil Babka)
- "memcg: separate private and public ID namespaces" cleans up the
memcg ID code and prevents the internal-only private IDs from being
exposed to userspace (Shakeel Butt)
- "mm: hugetlb: allocate frozen gigantic folio" cleans up the
allocation of frozen folios and avoids some atomic refcount
operations (Kefeng Wang)
- "mm/damon: advance DAMOS-based LRU sorting" improves DAMOS's movement
of memory betewwn the active and inactive LRUs and adds auto-tuning
of the ratio-based quotas and of monitoring intervals (SeongJae Park)
- "Support page table check on PowerPC" makes
CONFIG_PAGE_TABLE_CHECK_ENFORCED work on powerpc (Andrew Donnellan)
- "nodemask: align nodes_and{,not} with underlying bitmap ops" makes
nodes_and() and nodes_andnot() propagate the return values from the
underlying bit operations, enabling some cleanup in calling code
(Yury Norov)
- "mm/damon: hide kdamond and kdamond_lock from API callers" cleans up
some DAMON internal interfaces (SeongJae Park)
- "mm/khugepaged: cleanups and scan limit fix" does some cleanup work
in khupaged and fixes a scan limit accounting issue (Shivank Garg)
- "mm: balloon infrastructure cleanups" goes to town on the balloon
infrastructure and its page migration function. Mainly cleanups, also
some locking simplification (David Hildenbrand)
- "mm/vmscan: add tracepoint and reason for kswapd_failures reset" adds
additional tracepoints to the page reclaim code (Jiayuan Chen)
- "Replace wq users and add WQ_PERCPU to alloc_workqueue() users" is
part of Marco's kernel-wide migration from the legacy workqueue APIs
over to the preferred unbound workqueues (Marco Crivellari)
- "Various mm kselftests improvements/fixes" provides various unrelated
improvements/fixes for the mm kselftests (Kevin Brodsky)
- "mm: accelerate gigantic folio allocation" greatly speeds up gigantic
folio allocation, mainly by avoiding unnecessary work in
pfn_range_valid_contig() (Kefeng Wang)
- "selftests/damon: improve leak detection and wss estimation
reliability" improves the reliability of two of the DAMON selftests
(SeongJae Park)
- "mm/damon: cleanup kdamond, damon_call(), damos filter and
DAMON_MIN_REGION" does some cleanup work in the core DAMON code
(SeongJae Park)
- "Docs/mm/damon: update intro, modules, maintainer profile, and misc"
performs maintenance work on the DAMON documentation (SeongJae Park)
- "mm: add and use vma_assert_stabilised() helper" refactors and cleans
up the core VMA code. The main aim here is to be able to use the mmap
write lock's lockdep state to perform various assertions regarding
the locking which the VMA code requires (Lorenzo Stoakes)
- "mm, swap: swap table phase II: unify swapin use" removes some old
swap code (swap cache bypassing and swap synchronization) which
wasn't working very well. Various other cleanups and simplifications
were made. The end result is a 20% speedup in one benchmark (Kairui
Song)
- "enable PT_RECLAIM on more 64-bit architectures" makes PT_RECLAIM
available on 64-bit alpha, loongarch, mips, parisc, and um. Various
cleanups were performed along the way (Qi Zheng)
* tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (325 commits)
mm/memory: handle non-split locks correctly in zap_empty_pte_table()
mm: move pte table reclaim code to memory.c
mm: make PT_RECLAIM depends on MMU_GATHER_RCU_TABLE_FREE
mm: convert __HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config
um: mm: enable MMU_GATHER_RCU_TABLE_FREE
parisc: mm: enable MMU_GATHER_RCU_TABLE_FREE
mips: mm: enable MMU_GATHER_RCU_TABLE_FREE
LoongArch: mm: enable MMU_GATHER_RCU_TABLE_FREE
alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE
mm: change mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h
mm/damon/stat: remove __read_mostly from memory_idle_ms_percentiles
zsmalloc: make common caches global
mm: add SPDX id lines to some mm source files
mm/zswap: use %pe to print error pointers
mm/vmscan: use %pe to print error pointers
mm/readahead: fix typo in comment
mm: khugepaged: fix NR_FILE_PAGES and NR_SHMEM in collapse_file()
mm: refactor vma_map_pages to use vm_insert_pages
mm/damon: unify address range representation with damon_addr_range
mm/cma: replace snprintf with strscpy in cma_new_area
...
Pull x86 paravirt updates from Borislav Petkov:
- A nice cleanup to the paravirt code containing a unification of the
paravirt clock interface, taming the include hell by splitting the
pv_ops structure and removing of a bunch of obsolete code (Juergen
Gross)
* tag 'x86_paravirt_for_v7.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/paravirt: Use XOR r32,r32 to clear register in pv_vcpu_is_preempted()
x86/paravirt: Remove trailing semicolons from alternative asm templates
x86/pvlocks: Move paravirt spinlock functions into own header
x86/paravirt: Specify pv_ops array in paravirt macros
x86/paravirt: Allow pv-calls outside paravirt.h
objtool: Allow multiple pv_ops arrays
x86/xen: Drop xen_mmu_ops
x86/xen: Drop xen_cpu_ops
x86/xen: Drop xen_irq_ops
x86/paravirt: Move pv_native_*() prototypes to paravirt.c
x86/paravirt: Introduce new paravirt-base.h header
x86/paravirt: Move paravirt_sched_clock() related code into tsc.c
x86/paravirt: Use common code for paravirt_steal_clock()
riscv/paravirt: Use common code for paravirt_steal_clock()
loongarch/paravirt: Use common code for paravirt_steal_clock()
arm64/paravirt: Use common code for paravirt_steal_clock()
arm/paravirt: Use common code for paravirt_steal_clock()
sched: Move clock related paravirt code to kernel/sched
paravirt: Remove asm/paravirt_api_clock.h
x86/paravirt: Move thunk macros to paravirt_types.h
...
Pull arm64 updates from Will Deacon:
"There's a little less than normal, probably due to LPC & Christmas/New
Year meaning that a few series weren't quite ready or reviewed in
time. It's still useful across the board, despite the only real
feature being support for the LS64 feature enabling 64-byte atomic
accesses to endpoints that support it.
ACPI:
- Add interrupt signalling support to the AGDI handler
- Add Catalin and myself to the arm64 ACPI MAINTAINERS entry
CPU features:
- Drop Kconfig options for PAN and LSE (these are detected at runtime)
- Add support for 64-byte single-copy atomic instructions (LS64/LS64V)
- Reduce MTE overhead when executing in the kernel on Ampere CPUs
- Ensure POR_EL0 value exposed via ptrace is up-to-date
- Fix error handling on GCS allocation failure
CPU frequency:
- Add CPU hotplug support to the FIE setup in the AMU driver
Entry code:
- Minor optimisations and cleanups to the syscall entry path
- Preparatory rework for moving to the generic syscall entry code
Hardware errata:
- Work around Spectre-BHB on TSV110 processors
- Work around broken CMO propagation on some systems with the SI-L1
interconnect
Miscellaneous:
- Disable branch profiling for arch/arm64/ to avoid issues with
noinstr
- Minor fixes and cleanups (kexec + ubsan, WARN_ONCE() instead of
WARN_ON(), reduction of boolean expression)
- Fix custom __READ_ONCE() implementation for LTO builds when
operating on non-atomic types
Perf and PMUs:
- Support for CMN-600AE
- Be stricter about supported hardware in the CMN driver
- Support for DSU-110 and DSU-120
- Support for the cycles event in the DSU driver (alongside the
dedicated cycles counter)
- Use IRQF_NO_THREAD instead of IRQF_ONESHOT in the cxlpmu driver
- Use !bitmap_empty() as a faster alternative to bitmap_weight()
- Fix SPE error handling when failing to resume profiling
Selftests:
- Add support for the FORCE_TARGETS option to the arm64 kselftests
- Avoid nolibc-specific my_syscall() function
- Add basic test for the LS64 HWCAP
- Extend fp-pidbench to cover additional workload patterns"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (43 commits)
perf/arm-cmn: Reject unsupported hardware configurations
perf: arm_spe: Properly set hw.state on failures
arm64/gcs: Fix error handling in arch_set_shadow_stack_status()
arm64: Fix non-atomic __READ_ONCE() with CONFIG_LTO=y
arm64: poe: fix stale POR_EL0 values for ptrace
kselftest/arm64: Raise default number of loops in fp-pidbench
kselftest/arm64: Add a no-SVE loop after SVE in fp-pidbench
perf/cxlpmu: Replace IRQF_ONESHOT with IRQF_NO_THREAD
arm64: mte: Set TCMA1 whenever MTE is present in the kernel
arm64/ptrace: Return early for ptrace_report_syscall_entry() error
arm64/ptrace: Split report_syscall()
arm64: Remove unused _TIF_WORK_MASK
kselftest/arm64: Add missing file in .gitignore
arm64: errata: Workaround for SI L1 downstream coherency issue
kselftest/arm64: Add HWCAP test for FEAT_LS64
arm64: Add support for FEAT_{LS64, LS64_V}
KVM: arm64: Enable FEAT_{LS64, LS64_V} in the supported guest
arm64: Provide basic EL2 setup for FEAT_{LS64, LS64_V} usage at EL0/1
KVM: arm64: Handle DABT caused by LS64* instructions on unsupported memory
KVM: arm64: Add documentation for KVM_EXIT_ARM_LDST64B
...
KVM/arm64 updates for 7.0
- Add support for FEAT_IDST, allowing ID registers that are not
implemented to be reported as a normal trap rather than as an UNDEF
exception.
- Add sanitisation of the VTCR_EL2 register, fixing a number of
UXN/PXN/XN bugs in the process.
- Full handling of RESx bits, instead of only RES0, and resulting in
SCTLR_EL2 being added to the list of sanitised registers.
- More pKVM fixes for features that are not supposed to be exposed to
guests.
- Make sure that MTE being disabled on the pKVM host doesn't give it
the ability to attack the hypervisor.
- Allow pKVM's host stage-2 mappings to use the Force Write Back
version of the memory attributes by using the "pass-through'
encoding.
- Fix trapping of ICC_DIR_EL1 on GICv5 hosts emulating GICv3 for the
guest.
- Preliminary work for guest GICv5 support.
- A bunch of debugfs fixes, removing pointless custom iterators stored
in guest data structures.
- A small set of FPSIMD cleanups.
- Selftest fixes addressing the incorrect alignment of page
allocation.
- Other assorted low-impact fixes and spelling fixes.
* kvm-arm64/misc-6.20:
: .
: Misc KVM/arm64 changes for 6.20
:
: - Trivial FPSIMD cleanups
:
: - Calculate hyp VA size only once, avoiding potential mapping issues when
: VA bits is smaller than expected
:
: - Silence sparse warning for the HYP stack base
:
: - Fix error checking when handling FFA_VERSION
:
: - Add missing trap configuration for DBGWCR15_EL1
:
: - Don't try to deal with nested S2 when NV isn't enabled for a guest
:
: - Various spelling fixes
: .
KVM: arm64: nv: Avoid NV stage-2 code when NV is not supported
KVM: arm64: Fix various comments
KVM: arm64: nv: Add trap config for DBGWCR<15>_EL1
KVM: arm64: Fix error checking for FFA_VERSION
KVM: arm64: Fix missing <asm/stackpage/nvhe.h> include
KVM: arm64: Calculate hyp VA size only once
KVM: arm64: Remove ISB after writing FPEXC32_EL2
KVM: arm64: Shuffle KVM_HOST_DATA_FLAG_* indices
KVM: arm64: Fix comment in fpsimd_lazy_switch_to_host()
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/resx:
: .
: Add infrastructure to deal with the full gamut of RESx bits
: for NV. As a result, it is now possible to have the expected
: semantics for some bits such as SCTLR_EL2.SPAN.
: .
KVM: arm64: Add debugfs file dumping computed RESx values
KVM: arm64: Add sanitisation to SCTLR_EL2
KVM: arm64: Remove all traces of HCR_EL2.MIOCNCE
KVM: arm64: Remove all traces of FEAT_TME
KVM: arm64: Simplify handling of full register invalid constraint
KVM: arm64: Get rid of FIXED_VALUE altogether
KVM: arm64: Simplify handling of HCR_EL2.E2H RESx
KVM: arm64: Move RESx into individual register descriptors
KVM: arm64: Add RES1_WHEN_E2Hx constraints as configuration flags
KVM: arm64: Add REQUIRES_E2H1 constraint as configuration flags
KVM: arm64: Simplify FIXED_VALUE handling
KVM: arm64: Convert HCR_EL2.RW to AS_RES1
KVM: arm64: Correctly handle SCTLR_EL1 RES1 bits for unsupported features
KVM: arm64: Allow RES1 bits to be inferred from configuration
KVM: arm64: Inherit RESx bits from FGT register descriptors
KVM: arm64: Extend unified RESx handling to runtime sanitisation
KVM: arm64: Introduce data structure tracking both RES0 and RES1 bits
KVM: arm64: Introduce standalone FGU computing primitive
KVM: arm64: Remove duplicate configuration for SCTLR_EL1.{EE,E0E}
arm64: Convert SCTLR_EL2 to sysreg infrastructure
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/debugfs-fixes:
: .
: Cleanup of the debugfs iterator, which are way more complicated
: than they ought to be, courtesy of Fuad Tabba. From the cover letter:
:
: "This series refactors the debugfs implementations for `idregs` and
: `vgic-state` to use standard `seq_file` iterator patterns.
:
: The existing implementations relied on storing iterator state within
: global VM structures (`kvm_arch` and `vgic_dist`). This approach
: prevented concurrent reads of the debugfs files (returning -EBUSY) and
: created improper dependencies between transient file operations and
: long-lived VM state."
: .
KVM: arm64: Use standard seq_file iterator for vgic-debug debugfs
KVM: arm64: Reimplement vgic-debug XArray iteration
KVM: arm64: Use standard seq_file iterator for idregs debugfs
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/gicv5-prologue:
: .
: Prologue to GICv5 support, courtesy of Sascha Bischoff.
:
: This is preliminary work that sets the scene for the full-blow
: support.
: .
irqchip/gic-v5: Check if impl is virt capable
KVM: arm64: gic: Set vgic_model before initing private IRQs
arm64/sysreg: Drop ICH_HFGRTR_EL2.ICC_HAPR_EL1 and make RES1
KVM: arm64: gic-v3: Switch vGIC-v3 to use generated ICH_VMCR_EL2
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/fwb-for-all:
: .
: Allow pKVM's host stage-2 mappings to use the Force Write Back version
: of the memory attributes by using the "pass-through' encoding.
:
: This avoids having two separate encodings for S2 on a given platform.
: .
KVM: arm64: Simplify PAGE_S2_MEMATTR
KVM: arm64: Kill KVM_PGTABLE_S2_NOFWB
KVM: arm64: Switch pKVM host S2 over to KVM_PGTABLE_S2_AS_S1
KVM: arm64: Add KVM_PGTABLE_S2_AS_S1 flag
arm64: Add MT_S2{,_FWB}_AS_S1 encodings
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/pkvm-no-mte:
: .
: pKVM updates preventing the host from using MTE-related system
: sysrem registers when the feature is disabled from the kernel
: command-line (arm64.nomte), courtesy of Fuad Taba.
:
: From the cover letter:
:
: "If MTE is supported by the hardware (and is enabled at EL3), it remains
: available to lower exception levels by default. Disabling it in the host
: kernel (e.g., via 'arm64.nomte') only stops the kernel from advertising
: the feature; it does not physically disable MTE in the hardware.
:
: The ability to disable MTE in the host kernel is used by some systems,
: such as Android, so that the physical memory otherwise used as tag
: storage can be used for other things (i.e. treated just like the rest of
: memory). In this scenario, a malicious host could still access tags in
: pages donated to a guest using MTE instructions (e.g., STG and LDG),
: bypassing the kernel's configuration."
: .
KVM: arm64: Use kvm_has_mte() in pKVM trap initialization
KVM: arm64: Inject UNDEF when accessing MTE sysregs with MTE disabled
KVM: arm64: Trap MTE access and discovery when MTE is disabled
KVM: arm64: Remove dead code resetting HCR_EL2 for pKVM
Signed-off-by: Marc Zyngier <maz@kernel.org>
Add a new helper to retrieve the RESx values for a given system
register, and use it for the runtime sanitisation.
This results in slightly better code generation for a fairly hot
path in the hypervisor, and additionally covers all sanitised
registers in all conditions, not just the VNCR-based ones.
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202184329.2724080-6-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
We have so far mostly tracked RES0 bits, but only made a few attempts
at being just as strict for RES1 bits (probably because they are both
rarer and harder to handle).
Start scratching the surface by introducing a data structure tracking
RES0 and RES1 bits at the same time.
Note that contrary to the usual idiom, this structure is mostly passed
around by value -- the ABI handles it nicely, and the resulting code is
much nicer.
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202184329.2724080-5-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Convert SCTLR_EL2 to the sysreg infrastructure, as per the 2025-12_rel
revision of the Registers.json file.
Note that we slightly deviate from the above, as we stick to the ARM
ARM M.a definition of SCTLR_EL2[9], which is RES0, in order to avoid
dragging the POE2 definitions...
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202184329.2724080-2-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
The implementation of __READ_ONCE() under CONFIG_LTO=y incorrectly
qualified the fallback "once" access for types larger than 8 bytes,
which are not atomic but should still happen "once" and suppress common
compiler optimizations.
The cast `volatile typeof(__x)` applied the volatile qualifier to the
pointer type itself rather than the pointee. This created a volatile
pointer to a non-volatile type, which violated __READ_ONCE() semantics.
Fix this by casting to `volatile typeof(*__x) *`.
With a defconfig + LTO + debug options build, we see the following
functions to be affected:
xen_manage_runstate_time (884 -> 944 bytes)
xen_steal_clock (248 -> 340 bytes)
^-- use __READ_ONCE() to load vcpu_runstate_info structs
Fixes: e35123d83e ("arm64: lto: Strengthen READ_ONCE() to acquire when CONFIG_LTO=y")
Cc: stable@vger.kernel.org
Reviewed-by: Boqun Feng <boqun@kernel.org>
Signed-off-by: Marco Elver <elver@google.com>
Tested-by: David Laight <david.laight.linux@gmail.com>
Signed-off-by: Will Deacon <will@kernel.org>
The current implementation uses `idreg_debugfs_iter` in `struct
kvm_arch` to track the sequence position. This effectively makes the
iterator shared across all open file descriptors for the VM.
This approach has significant drawbacks:
- It enforces mutual exclusion, preventing concurrent reads of the
debugfs file (returning -EBUSY).
- It relies on storing transient iterator state in the long-lived VM
structure (`kvm_arch`).
- The use of `u8` for the iterator index imposes an implicit limit of
255 registers. While not currently exceeded, this is fragile against
future architectural growth. Switching to `loff_t` eliminates this
overflow risk.
Refactor the implementation to use the standard `seq_file` iterator.
Instead of storing state in `kvm_arch`, rely on the `pos` argument
passed to the `start` and `next` callbacks, which tracks the logical
index specific to the file descriptor.
This change enables concurrent access and eliminates the
`idreg_debugfs_iter` field from `struct kvm_arch`.
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202085721.3954942-2-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
The GICv5 architecture is dropping the ICC_HAPR_EL1 and ICV_HAPR_EL1
system registers. These registers were never added to the sysregs, but
the traps for them were.
Drop the trap bit from the ICH_HFGRTR_EL2 and make it Res1 as per the
upcoming GICv5 spec change. Additionally, update the EL2 setup code to
not attempt to set that bit.
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260128175919.3828384-4-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
pKVM usage of S2 translation on the host is purely for isolation
purposes, not translation. To that effect, the memory attributes
being used must be that of S1.
With FWB=0, this is easily achieved by using the Normal Cacheable
type (which is the weakest possible memory type) at S2, and let S1
pick something stronger as required.
With FWB=1, the attributes are combined in a different way, and we
cannot arbitrarily use Normal Cacheable. We can, however, use a
memattr encoding that indicates that the final attributes are that
of Stage-1.
Add these encoding and a few pointers to the relevant parts of the
specification. It might come handy some day.
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260123191637.715429-2-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
KVM/arm64 fixes for 6.19
- Ensure early return semantics are preserved for pKVM fault handlers
- Fix case where the kernel runs with the guest's PAN value when
CONFIG_ARM64_PAN is not set
- Make stage-1 walks to set the access flag respect the access
permission of the underlying stage-2, when enabled
- Propagate computed FGT values to the pKVM view of the vCPU at
vcpu_load()
- Correctly program PXN and UXN privilege bits for hVHE's stage-1 page
tables
- Check that the VM is actually using VGICv3 before accessing the GICv3
CPU interface
- Delete some unused code
When software issues a Cache Maintenance Operation (CMO) targeting a
dirty cache line, the CPU and DSU cluster may optimize the operation by
combining the CopyBack Write and CMO into a single combined CopyBack
Write plus CMO transaction presented to the interconnect (MCN).
For these combined transactions, the MCN splits the operation into two
separate transactions, one Write and one CMO, and then propagates the
write and optionally the CMO to the downstream memory system or external
Point of Serialization (PoS).
However, the MCN may return an early CompCMO response to the DSU cluster
before the corresponding Write and CMO transactions have completed at
the external PoS or downstream memory. As a result, stale data may be
observed by external observers that are directly connected to the
external PoS or downstream memory.
This erratum affects any system topology in which the following
conditions apply:
- The Point of Serialization (PoS) is located downstream of the
interconnect.
- A downstream observer accesses memory directly, bypassing the
interconnect.
Conditions:
This erratum occurs only when all of the following conditions are met:
1. Software executes a data cache maintenance operation, specifically,
a clean or clean&invalidate by virtual address (DC CVAC or DC
CIVAC), that hits on unique dirty data in the CPU or DSU cache.
This results in a combined CopyBack and CMO being issued to the
interconnect.
2. The interconnect splits the combined transaction into separate Write
and CMO transactions and returns an early completion response to the
CPU or DSU before the write has completed at the downstream memory
or PoS.
3. A downstream observer accesses the affected memory address after the
early completion response is issued but before the actual memory
write has completed. This allows the observer to read stale data
that has not yet been updated at the PoS or downstream memory.
The implementation of workaround put a second loop of CMOs at the same
virtual address whose operation meet erratum conditions to wait until
cache data be cleaned to PoC. This way of implementation mitigates
performance penalty compared to purely duplicate original CMO.
Signed-off-by: Lucas Wei <lucaswei@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
If MTE is not supported by the hardware, or is disabled in the kernel
configuration (`CONFIG_ARM64_MTE=n`) or command line (`arm64.nomte`),
the kernel stops advertising MTE to userspace and avoids using MTE
instructions. However, this is a software-level disable only.
When MTE hardware is present and enabled by EL3 firmware, leaving
`HCR_EL2.ATA` set allows the host to execute MTE instructions (STG, LDG,
etc.) and access allocation tags in physical memory.
Prevent this by clearing `HCR_EL2.ATA` when MTE is disabled. Remove it
from the `HCR_HOST_NVHE_FLAGS` default, and conditionally set it in
`cpu_prepare_hyp_mode()` only when `system_supports_mte()` returns true.
This causes MTE instructions to trap to EL2 when `HCR_EL2.ATA` is
cleared.
Additionally, set `HCR_EL2.TID5` when MTE is disabled. This traps reads
of `GMID_EL1` (Multiple tag transfer ID register) to EL2, preventing the
discovery of MTE parameters (such as tag block size) when the feature is
suppressed.
Early boot code in `head.S` temporarily keeps `HCR_ATA` set to avoid
special-casing initialization paths. This is safe because this code
executes before untrusted code runs and will clear `HCR_ATA` if MTE is
disabled.
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260122112218.531948-3-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/pkvm-features-6.20:
: .
: pKVM guest feature trapping fixes, courtesy of Fuad Tabba.
: .
KVM: arm64: Prevent host from managing timer offsets for protected VMs
KVM: arm64: Check whether a VM IOCTL is allowed in pKVM
KVM: arm64: Track KVM IOCTLs and their associated KVM caps
KVM: arm64: Do not allow KVM_CAP_ARM_MTE for any guest in pKVM
KVM: arm64: Include VM type when checking VM capabilities in pKVM
KVM: arm64: Introduce helper to calculate fault IPA offset
KVM: arm64: Fix MTE flag initialization for protected VMs
KVM: arm64: Fix Trace Buffer trap polarity for protected VMs
KVM: arm64: Fix Trace Buffer trapping for protected VMs
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/feat_idst:
: .
: Add support for FEAT_IDST, allowing ID registers that are not implemented
: to be reported as a normal trap rather than as an UNDEF exception.
: .
KVM: arm64: selftests: Add a test for FEAT_IDST
KVM: arm64: pkvm: Report optional ID register traps with a 0x18 syndrome
KVM: arm64: pkvm: Add a generic synchronous exception injection primitive
KVM: arm64: Force trap of GMID_EL1 when the guest doesn't have MTE
KVM: arm64: Handle CSSIDR2_EL1 and SMIDR_EL1 in a generic way
KVM: arm64: Handle FEAT_IDST for sysregs without specific handlers
KVM: arm64: Add a generic synchronous exception injection primitive
KVM: arm64: Add trap routing for GMID_EL1
arm64: Repaint ID_AA64MMFR2_EL1.IDS description
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/vtcr:
: .
: VTCR_EL2 conversion to the configuration-driven RESx framework,
: fixing a couple of UXN/PXN/XN bugs in the process.
: .
KVM: arm64: nv: Return correct RES0 bits for FGT registers
KVM: arm64: Always populate FGT masks at boot time
KVM: arm64: Honor UX/PX attributes for EL2 S1 mappings
KVM: arm64: Convert VTCR_EL2 to config-driven sanitisation
KVM: arm64: Account for RES1 bits in DECLARE_FEAT_MAP() and co
arm64: Convert VTCR_EL2 to sysreg infratructure
arm64: Convert ID_AA64MMFR0_EL1.TGRAN{4,16,64}_2 to UnsignedEnum
KVM: arm64: Invert KVM_PGTABLE_WALK_HANDLE_FAULT to fix pKVM walkers
KVM: arm64: Don't blindly set set PSTATE.PAN on guest exit
KVM: arm64: nv: Respect stage-2 write permssion when setting stage-1 AF
KVM: arm64: Remove unused vcpu_{clear,set}_wfx_traps()
KVM: arm64: Remove unused parameter in synchronize_vcpu_pstate()
KVM: arm64: Remove extra argument for __pvkm_host_{share,unshare}_hyp()
KVM: arm64: Inject UNDEF for a register trap without accessor
KVM: arm64: Copy FGT traps to unprotected pKVM VCPU on VCPU load
KVM: arm64: Fix EL2 S1 XN handling for hVHE setups
KVM: arm64: gic: Check for vGICv3 when clearing TWI
Signed-off-by: Marc Zyngier <maz@kernel.org>
Merge arm64/for-next/cpufeature in to resolve conflicts resulting from
the removal of CONFIG_PAN.
* arm64/for-next/cpufeature:
arm64: Add support for FEAT_{LS64, LS64_V}
KVM: arm64: Enable FEAT_{LS64, LS64_V} in the supported guest
arm64: Provide basic EL2 setup for FEAT_{LS64, LS64_V} usage at EL0/1
KVM: arm64: Handle DABT caused by LS64* instructions on unsupported memory
KVM: arm64: Add documentation for KVM_EXIT_ARM_LDST64B
KVM: arm64: Add exit to userspace on {LD,ST}64B* outside of memslots
arm64: Unconditionally enable PAN support
arm64: Unconditionally enable LSE support
arm64: Add support for TSV110 Spectre-BHB mitigation
Signed-off-by: Marc Zyngier <maz@kernel.org>
Armv8.7 introduces single-copy atomic 64-byte loads and stores
instructions and its variants named under FEAT_{LS64, LS64_V}.
These features are identified by ID_AA64ISAR1_EL1.LS64 and the
use of such instructions in userspace (EL0) can be trapped.
As st64bv (FEAT_LS64_V) and st64bv0 (FEAT_LS64_ACCDATA) can not be tell
apart, FEAT_LS64 and FEAT_LS64_ACCDATA which will be supported in later
patch will be exported to userspace, FEAT_LS64_V will be enabled only
in kernel.
In order to support the use of corresponding instructions in userspace:
- Make ID_AA64ISAR1_EL1.LS64 visbile to userspace
- Add identifying and enabling in the cpufeature list
- Expose these support of these features to userspace through HWCAP3
and cpuinfo
ld64b/st64b (FEAT_LS64) and st64bv (FEAT_LS64_V) is intended for
special memory (device memory) so requires support by the CPU, system
and target memory location (device that support these instructions).
The HWCAP3_LS64, implies the support of CPU and system (since no
identification method from system, so SoC vendors should advertise
support in the CPU if system also support them).
Otherwise for ld64b/st64b the atomicity may not be guaranteed or a
DABT will be generated, so users (probably userspace driver developer)
should make sure the target memory (device) also have the support.
For st64bv 0xffffffffffffffff will be returned as status result for
unsupported memory so user should check it.
Document the restrictions along with HWCAP3_LS64.
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Oliver Upton <oupton@kernel.org>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Zhou Wang <wangzhou1@hisilicon.com>
Signed-off-by: Will Deacon <will@kernel.org>
Using FEAT_{LS64, LS64_V} instructions in a guest is also controlled
by HCRX_EL2.{EnALS, EnASR}. Enable it if guest has related feature.
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Oliver Upton <oupton@kernel.org>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Zhou Wang <wangzhou1@hisilicon.com>
Signed-off-by: Will Deacon <will@kernel.org>
Instructions introduced by FEAT_{LS64, LS64_V} is controlled by
HCRX_EL2.{EnALS, EnASR}. Configure all of these to allow usage
at EL0/1.
This doesn't mean these instructions are always available in
EL0/1 if provided. The hypervisor still have the control at
runtime.
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Oliver Upton <oupton@kernel.org>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Zhou Wang <wangzhou1@hisilicon.com>
Signed-off-by: Will Deacon <will@kernel.org>
If FEAT_LS64WB not supported, FEAT_LS64* instructions only support
to access Device/Uncacheable memory, otherwise a data abort for
unsupported Exclusive or atomic access (0x35, UAoEF) is generated
per spec. It's implementation defined whether the target exception
level is routed and is possible to implemented as route to EL2 on a
VHE VM according to DDI0487L.b Section C3.2.6 Single-copy atomic
64-byte load/store.
If it's implemented as generate the DABT to the final enabled stage
(stage-2), inject the UAoEF back to the guest after checking the
memslot is valid.
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Oliver Upton <oupton@kernel.org>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Zhou Wang <wangzhou1@hisilicon.com>
Signed-off-by: Will Deacon <will@kernel.org>
FEAT_PAN has been around since ARMv8.1 (over 11 years ago), has no compiler
dependency (we have our own accessors), and is a great security benefit.
Drop CONFIG_ARM64_PAN, and make the support unconditionnal.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
LSE atomics have been in the architecture since ARMv8.1 (released in
2014), and are hopefully supported by all modern toolchains.
Drop the optional nature of LSE support in the kernel, and always
compile the support in, as this really is very little code. LL/SC
still is the default, and the switch to LSE is done dynamically.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>