LoongArch KVM changes for v6.17
1. Simplify some KVM routines.
2. Enhance in-kernel irqchip emulation.
3. Add stat information with kernel irqchip.
4. Add tracepoints for CPUCFG and CSR emulation exits.
KVM/arm64 changes for 6.17, round #1
- Host driver for GICv5, the next generation interrupt controller for
arm64, including support for interrupt routing, MSIs, interrupt
translation and wired interrupts.
- Use FEAT_GCIE_LEGACY on GICv5 systems to virtualize GICv3 VMs on
GICv5 hardware, leveraging the legacy VGIC interface.
- Userspace control of the 'nASSGIcap' GICv3 feature, allowing
userspace to disable support for SGIs w/o an active state on hardware
that previously advertised it unconditionally.
- Map supporting endpoints with cacheable memory attributes on systems
with FEAT_S2FWB and DIC where KVM no longer needs to perform cache
maintenance on the address range.
- Nested support for FEAT_RAS and FEAT_DoubleFault2, allowing the guest
hypervisor to inject external aborts into an L2 VM and take traps of
masked external aborts to the hypervisor.
- Convert more system register sanitization to the config-driven
implementation.
- Fixes to the visibility of EL2 registers, namely making VGICv3 system
registers accessible through the VGIC device instead of the ONE_REG
vCPU ioctls.
- Various cleanups and minor fixes.
KVM SEV cache maintenance changes for 6.17
- Drop a superfluous WBINVD (on all CPUs!) when destroying a VM.
- Use WBNOINVD instead of WBINVD when possible, for SEV cache maintenance,
e.g. to minimize collateral damage when reclaiming memory from an SEV guest.
- When reclaiming memory from an SEV guest, only do cache flushes on CPUs that
have ever run a vCPU for the guest, i.e. don't flush the caches for CPUs
that can't possibly have cache lines with dirty, encrypted data.
KVM SVM changes for 6.17
Drop KVM's rejection of SNP's SMT and single-socket policy restrictions, and
instead rely on firmware to verify that the policy can actually be supported.
Don't bother checking that requested policy(s) can actually be satisfied, as
an incompatible policy doesn't put the kernel at risk in any way, and providing
guarantees with respect to the physical topology is outside of KVM's purview.
KVM selftests changes for 6.17
- Fix a comment typo.
- Verify KVM is loaded when getting any KVM module param so that attempting to
run a selftest without kvm.ko loaded results in a SKIP message about KVM not
being loaded/enabled, versus some random parameter not existing.
- SKIP tests that hit EACCES when attempting to access a file, with a "Root
required?" help message. In most cases, the test just needs to be run with
elevated permissions.
KVM local APIC changes for 6.17
Extract many of KVM's helpers for accessing architectural local APIC state
to common x86 so that they can be shared by guest-side code for Secure AVIC.
KVM x86 MMU changes for 6.17
- Exempt nested EPT from the the !USER + CR0.WP logic, as EPT doesn't interact
with CR0.WP.
- Move the TDX hardware setup code to tdx.c to better co-locate TDX code
and eliminate a few global symbols.
- Dynamically allocation the shadow MMU's hashed page list, and defer
allocating the hashed list until it's actually needed (the TDP MMU doesn't
use the list).
KVM x86 misc changes for 6.17
- Prevert the host's DEBUGCTL.FREEZE_IN_SMM (Intel only) when running the
guest. Failure to honor FREEZE_IN_SMM can bleed host state into the guest.
- Explicitly check vmcs12.GUEST_DEBUGCTL on nested VM-Enter (Intel only) to
prevent L1 from running L2 with features that KVM doesn't support, e.g. BTF.
- Intercept SPEC_CTRL on AMD if the MSR shouldn't exist according to the
vCPU's CPUID model.
- Rework the MSR interception code so that the SVM and VMX APIs are more or
less identical.
- Recalculate all MSR intercepts from the "source" on MSR filter changes, and
drop the dedicated "shadow" bitmaps (and their awful "max" size defines).
- WARN and reject loading kvm-amd.ko instead of panicking the kernel if the
nested SVM MSRPM offsets tracker can't handle an MSR.
- Advertise support for LKGS (Load Kernel GS base), a new instruction that's
loosely related to FRED, but is supported and enumerated independently.
- Fix a user-triggerable WARN that syzkaller found by stuffing INIT_RECEIVED,
a.k.a. WFS, and then putting the vCPU into VMX Root Mode (post-VMXON). Use
the same approach KVM uses for dealing with "impossible" emulation when
running a !URG guest, and simply wait until KVM_RUN to detect that the vCPU
has architecturally impossible state.
- Add KVM_X86_DISABLE_EXITS_APERFMPERF to allow disabling interception of
APERF/MPERF reads, so that a "properly" configured VM can "virtualize"
APERF/MPERF (with many caveats).
- Reject KVM_SET_TSC_KHZ if vCPUs have been created, as changing the "default"
frequency is unsupported for VMs with a "secure" TSC, and there's no known
use case for changing the default frequency for other VM types.
KVM VFIO device assignment cleanups for 6.17
Kill off kvm_arch_{start,end}_assignment() and x86's associated tracking now
that KVM no longer uses assigned_device_count as a bad heuristic for "VM has
an irqbypass producer" or for "VM has access to host MMIO".
KVM Dirty Ring changes for 6.17
Fix issues with dirty ring harvesting where KVM doesn't bound the processing
of entries in any way, which allows userspace to keep KVM in a tight loop
indefinitely. Clean up code and comments along the way.
KVM generic changes for 6.17
- Add a tracepoint for KVM_SET_MEMORY_ATTRIBUTES to help debug issues related
to private <=> shared memory conversions.
- Drop guest_memfd's .getattr() implementation as the VFS layer will call
generic_fillattr() if inode_operations.getattr is NULL.
KVM MMIO Stale Data mitigation cleanup for 6.17
Rework KVM's mitigation for the MMIO State Data vulnerability to track
whether or not a vCPU has access to (host) MMIO based on the MMU that will be
used when running in the guest. The current approach doesn't actually detect
whether or not a guest has access to MMIO, and is prone to false negatives (and
to a lesser extent, false positives), as KVM_DEV_VFIO_FILE_ADD is optional, and
obviously only covers VFIO devices.
KVM IRQ changes for 6.17
- Rework irqbypass to track/match producers and consumers via an xarray
instead of a linked list. Using a linked list leads to O(n^2) insertion
times, which is hugely problematic for use cases that create large numbers
of VMs. Such use cases typically don't actually use irqbypass, but
eliminating the pointless registration is a future problem to solve as it
likely requires new uAPI.
- Track irqbypass's "token" as "struct eventfd_ctx *" instead of a "void *",
to avoid making a simple concept unnecessarily difficult to understand.
- Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O APIC, PIC,
and PIT emulation at compile time.
- Drop x86's irq_comm.c, and move a pile of IRQ related code into irq.c.
- Fix a variety of flaws and bugs in the AVIC device posted IRQ code.
- Inhibited AVIC if a vCPU's ID is too big (relative to what hardware
supports) instead of rejecting vCPU creation.
- Extend enable_ipiv module param support to SVM, by simply leaving IsRunning
clear in the vCPU's physical ID table entry.
- Disable IPI virtualization, via enable_ipiv, if the CPU is affected by
erratum #1235, to allow (safely) enabling AVIC on such CPUs.
- Dedup x86's device posted IRQ code, as the vast majority of functionality
can be shared verbatime between SVM and VMX.
- Harden the device posted IRQ code against bugs and runtime errors.
- Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups O(1)
instead of O(n).
- Generate GA Log interrupts if and only if the target vCPU is blocking, i.e.
only if KVM needs a notification in order to wake the vCPU.
- Decouple device posted IRQs from VFIO device assignment, as binding a VM to
a VFIO group is not a requirement for enabling device posted IRQs.
- Clean up and document/comment the irqfd assignment code.
- Disallow binding multiple irqfds to an eventfd with a priority waiter, i.e.
ensure an eventfd is bound to at most one irqfd through the entire host,
and add a selftest to verify eventfd:irqfd bindings are globally unique.
Enable ring-based dirty memory tracking on riscv:
- Enable CONFIG_HAVE_KVM_DIRTY_RING_ACQ_REL as riscv is weakly
ordered.
- Set KVM_DIRTY_LOG_PAGE_OFFSET for the ring buffer's physical page
offset.
- Add a check to kvm_vcpu_kvm_riscv_check_vcpu_requests for checking
whether the dirty ring is soft full.
To handle vCPU requests that cause exits to userspace, modified the
`kvm_riscv_check_vcpu_requests` to return a value (currently only
returns 0 or 1).
Signed-off-by: Quan Zhou <zhouquan@iscas.ac.cn>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20e116efb1f7aff211dd8e3cf8990c5521ed5f34.1749810735.git.zhouquan@iscas.ac.cn
Signed-off-by: Anup Patel <anup@brainfault.org>
The Smnpm extension requires special handling because the guest ISA
extension maps to a different extension (Ssnpm) on the host side.
commit 1851e78362 ("RISC-V: KVM: Allow Smnpm and Ssnpm extensions for
guests") missed that the vcpu->arch.isa bit is based only on the host
extension, so currently both KVM_RISCV_ISA_EXT_{SMNPM,SSNPM} map to
vcpu->arch.isa[RISCV_ISA_EXT_SSNPM]. This does not cause any problems
for the guest, because both extensions are force-enabled anyway when the
host supports Ssnpm, but prevents checking for (guest) Smnpm in the SBI
FWFT logic.
Redefine kvm_isa_ext_arr to look up the guest extension, since only the
guest -> host mapping is unambiguous. Factor out the logic for checking
for host support of an extension, so this special case only needs to be
handled in one place, and be explicit about which variables hold a host
vs a guest ISA extension.
Fixes: 1851e78362 ("RISC-V: KVM: Allow Smnpm and Ssnpm extensions for guests")
Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20250111004702.2813013-2-samuel.holland@sifive.com
Signed-off-by: Anup Patel <anup@brainfault.org>
Delegate illegal instruction fault to VS mode by default to avoid such
exceptions being trapped to HS and redirected back to VS.
The delegation of illegal instruction fault is particularly important
to guest applications that use vector instructions frequently. In such
cases, an illegal instruction fault will be raised when guest user thread
uses vector instruction the first time and then guest kernel will enable
user thread to execute following vector instructions.
The fw pmu event counter remains undeleted so that guest can still query
illegal instruction events via sbi call. Guest will only see zero count
on illegal instruction faults and know 'firmware' has delegated it.
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Xu Lu <luxu.kernel@bytedance.com>
Link: https://lore.kernel.org/r/20250714094554.89151-1-luxu.kernel@bytedance.com
Signed-off-by: Anup Patel <anup@brainfault.org>
* kvm-arm64/vgic-v4-ctl:
: Userspace control of nASSGIcap, courtesy of Raghavendra Rao Ananta
:
: Allow userspace to decide if support for SGIs without an active state is
: advertised to the guest, allowing VMs from GICv3-only hardware to be
: migrated to to GICv4.1 capable machines.
Documentation: KVM: arm64: Describe VGICv3 registers writable pre-init
KVM: arm64: selftests: Add test for nASSGIcap attribute
KVM: arm64: vgic-v3: Allow userspace to write GICD_TYPER2.nASSGIcap
KVM: arm64: vgic-v3: Allow access to GICD_IIDR prior to initialization
KVM: arm64: vgic-v3: Consolidate MAINT_IRQ handling
KVM: arm64: Disambiguate support for vSGIs v. vLPIs
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
* kvm-arm64/el2-reg-visibility:
: Fixes to EL2 register visibility, courtesy of Marc Zyngier
:
: - Expose EL2 VGICv3 registers via the VGIC attributes accessor, not the
: KVM_{GET,SET}_ONE_REG ioctls
:
: - Condition visibility of FGT registers on the presence of FEAT_FGT in
: the VM
KVM: arm64: selftest: vgic-v3: Add basic GICv3 sysreg userspace access test
KVM: arm64: Enforce the sorting of the GICv3 system register table
KVM: arm64: Clarify the check for reset callback in check_sysreg_table()
KVM: arm64: vgic-v3: Fix ordering of ICH_HCR_EL2
KVM: arm64: Document registers exposed via KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS
KVM: arm64: selftests: get-reg-list: Add base EL2 registers
KVM: arm64: selftests: get-reg-list: Simplify feature dependency
KVM: arm64: Advertise FGT2 registers to userspace
KVM: arm64: Condition FGT registers on feature availability
KVM: arm64: Expose GICv3 EL2 registers via KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS
KVM: arm64: Let GICv3 save/restore honor visibility attribute
KVM: arm64: Define helper for ICH_VTR_EL2
KVM: arm64: Define constant value for ICC_SRE_EL2
KVM: arm64: Don't advertise ICH_*_EL2 registers through GET_ONE_REG
KVM: arm64: Make RVBAR_EL2 accesses UNDEF
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
* kvm-arm64/config-masks:
: More config-driven mask computation, courtesy of Marc Zyngier
:
: Converts more system registers to the config-driven computation of RESx
: masks based on the advertised feature set
KVM: arm64: Tighten the definition of FEAT_PMUv3p9
KVM: arm64: Convert MDCR_EL2 to config-driven sanitisation
KVM: arm64: Convert SCTLR_EL1 to config-driven sanitisation
KVM: arm64: Convert TCR2_EL2 to config-driven sanitisation
arm64: sysreg: Add THE/ASID2 controls to TCR2_ELx
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Pull timer fix from Thomas Gleixner:
"A single fix for the PTP systemcounter mechanism:
The rework of this mechanism added a 'use_nsec' member to struct
system_counterval. get_device_system_crosststamp() instantiates that
struct on the stack and hands a pointer to the driver callback.
Only the drivers which set use_nsec to true, initialize that field,
but all others ignore it. As get_device_system_crosststamp() does not
initialize the struct, the use_nsec field contains random stack
content in those cases. That causes a miscalulation usually resulting
in a failing range check in the best case.
Initialize the structure before handing it to the drivers to cure
that"
* tag 'timers-urgent-2025-07-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Zero initialize system_counterval when querying time from phc drivers
Pull spi fix from Mark Brown:
"One last fix for v6.16, removing some hard coding to avoid data
corruption on some NAND devices in the QPIC driver"
* tag 'spi-fix-v6.16-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: spi-qpic-snand: don't hardcode ECC steps
Pull i2c fixes from Wolfram Sang:
- qup: avoid potential hang when waiting for bus idle
- tegra: improve ACPI reset error handling
- virtio: use interruptible wait to prevent hang during transfer
* tag 'i2c-for-6.16-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: qup: jump out of the loop in case of timeout
i2c: virtio: Avoid hang by using interruptible completion wait
i2c: tegra: Fix reset error handling with ACPI
Pull clk fixes from Stephen Boyd:
"A few Allwinner clk driver fixes:
- Mark Allwinner A523 MBUS clock as critical to avoid
system stalls
- Fix names of CSI related clocks on Allwinner V3s. This
includes changes to the driver, DT bindings and DT files.
- Fix parents of TCON clock on Allwinner V3s"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: sunxi-ng: v3s: Fix TCON clock parents
clk: sunxi-ng: v3s: Fix CSI1 MCLK clock name
clk: sunxi-ng: v3s: Fix CSI SCLK clock name
clk: sunxi-ng: a523: Mark MBUS clock as critical
Pull ARM fixes from Russell King:
- use an absolute path for asm/unified.h in KBUILD_AFLAGS to solve a
regression caused by commit d5c8d6e0fa ("kbuild: Update assembler
calls to use proper flags and language target")
- fix dead code elimination binutils version check again
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linux:
ARM: 9450/1: Fix allowing linker DCE with binutils < 2.36
ARM: 9448/1: Use an absolute path to unified.h in KBUILD_AFLAGS
Pull SoC fixes from Arnd Bergmann:
"These are two fixes that came in late, one addresses a regression on a
rockchips based board, the other is for ensuring a consistent dt
binding for a device added in 6.16 before the incorrect one makes it
into a release"
* tag 'soc-fixes-6.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
arm64: dts: rockchip: Drop netdev led-triggers on NanoPi R5S
arm64: dts: allwinner: a523: Rename emac0 to gmac0
* kvm-arm64/gcie-legacy:
: Support for GICv3 emulation on GICv5, courtesy of Sascha Bischoff
:
: FEAT_GCIE_LEGACY adds the necessary hardware for GICv5 systems to
: support the legacy GICv3 for VMs, including a backwards-compatible VGIC
: implementation that we all know and love.
:
: As a starting point for GICv5 enablement in KVM, enable + use the
: GICv3-compatible feature when running VMs on GICv5 hardware.
KVM: arm64: gic-v5: Probe for GICv5
KVM: arm64: gic-v5: Support GICv3 compat
arm64/sysreg: Add ICH_VCTLR_EL2
irqchip/gic-v5: Populate struct gic_kvm_info
irqchip/gic-v5: Skip deactivate for forwarded PPI interrupts
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
GICv5 initial host support
Add host kernel support for the new arm64 GICv5 architecture, which is
quite a departure from the previous ones.
Include support for the full gamut of the architecture (interrupt
routing and delivery to CPUs, wired interrupts, MSIs, and interrupt
translation).
* tag 'irqchip-gic-v5-host': (32 commits)
arm64: smp: Fix pNMI setup after GICv5 rework
arm64: Kconfig: Enable GICv5
docs: arm64: gic-v5: Document booting requirements for GICv5
irqchip/gic-v5: Add GICv5 IWB support
irqchip/gic-v5: Add GICv5 ITS support
irqchip/msi-lib: Add IRQ_DOMAIN_FLAG_FWNODE_PARENT handling
irqchip/gic-v3: Rename GICv3 ITS MSI parent
PCI/MSI: Add pci_msi_map_rid_ctlr_node() helper function
of/irq: Add of_msi_xlate() helper function
irqchip/gic-v5: Enable GICv5 SMP booting
irqchip/gic-v5: Add GICv5 LPI/IPI support
irqchip/gic-v5: Add GICv5 IRS/SPI support
irqchip/gic-v5: Add GICv5 PPI support
arm64: Add support for GICv5 GSB barriers
arm64: smp: Support non-SGIs for IPIs
arm64: cpucaps: Add GICv5 CPU interface (GCIE) capability
arm64: cpucaps: Rename GICv3 CPU interface capability
arm64: Disable GICv5 read/write/instruction traps
arm64/sysreg: Add ICH_HFGITR_EL2
arm64/sysreg: Add ICH_HFGWTR_EL2
...
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
* kvm-arm64/doublefault2: (33 commits)
: NV Support for FEAT_RAS + DoubleFault2
:
: Delegate the vSError context to the guest hypervisor when in a nested
: state, including registers related to ESR propagation. Additionally,
: catch up KVM's external abort infrastructure to the architecture,
: implementing the effects of FEAT_DoubleFault2.
:
: This has some impact on non-nested guests, as SErrors deemed unmasked at
: the time they're made pending are now immediately injected with an
: emulated exception entry rather than using the VSE bit.
KVM: arm64: Make RAS registers UNDEF when RAS isn't advertised
KVM: arm64: Filter out HCR_EL2 bits when running in hypervisor context
KVM: arm64: Check for SYSREGS_ON_CPU before accessing the CPU state
KVM: arm64: Commit exceptions from KVM_SET_VCPU_EVENTS immediately
KVM: arm64: selftests: Test ESR propagation for vSError injection
KVM: arm64: Populate ESR_ELx.EC for emulated SError injection
KVM: arm64: selftests: Catch up set_id_regs with the kernel
KVM: arm64: selftests: Add SCTLR2_EL1 to get-reg-list
KVM: arm64: selftests: Test SEAs are taken to SError vector when EASE=1
KVM: arm64: selftests: Add basic SError injection test
KVM: arm64: Don't retire MMIO instruction w/ pending (emulated) SError
KVM: arm64: Advertise support for FEAT_DoubleFault2
KVM: arm64: Advertise support for FEAT_SCTLR2
KVM: arm64: nv: Enable vSErrors when HCRX_EL2.TMEA is set
KVM: arm64: nv: Honor SError routing effects of SCTLR2_ELx.NMEA
KVM: arm64: nv: Take "masked" aborts to EL2 when HCRX_EL2.TMEA is set
KVM: arm64: Route SEAs to the SError vector when EASE is set
KVM: arm64: nv: Ensure Address size faults affect correct ESR
KVM: arm64: Factor out helper for selecting exception target EL
KVM: arm64: Describe SCTLR2_ELx RESx masks
...
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
* kvm-arm64/cacheable-pfnmap:
: Cacheable PFNMAP support at stage-2, courtesy of Ankit Agrawal
:
: For historical reasons, KVM only allows cacheable mappings at stage-2
: when a kernel alias exists in the direct map for the memory region. On
: hardware without FEAT_S2FWB, this is necessary as KVM must do cache
: maintenance to keep guest/host accesses coherent.
:
: This is unnecessarily restrictive on systems with FEAT_S2FWB and
: CTR_EL0.DIC, as KVM no longer needs to perform cache maintenance to
: maintain correctness.
:
: Allow cacheable mappings at stage-2 on supporting hardware when the
: corresponding VMA has cacheable memory attributes and advertise a
: capability to userspace such that a VMM can determine if a stage-2
: mapping can be established (e.g. VFIO device).
KVM: arm64: Expose new KVM cap for cacheable PFNMAP
KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
KVM: arm64: Block cacheable PFNMAP mapping
KVM: arm64: Assume non-PFNMAP/MIXEDMAP VMAs can be mapped cacheable
KVM: arm64: Rename the device variable to s2_force_noncacheable
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>