linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-25 23:02:51 -04:00

Author	SHA1	Message	Date
Oliver Upton	e6157256ee	Revert "KVM: arm64: Split kvm_pgtable_stage2_destroy()" This reverts commit `0e89ca13ee`. The functional change that depended on this refactoring has been found to be quite problematic. Reverting the whole pile to start fresh when new fixes are available. Message-ID: <20250910180930.3679473-3-oliver.upton@linux.dev> Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-09-10 11:11:22 -07:00
Oliver Upton	fc670ad596	Revert "KVM: arm64: Reschedule as needed when destroying the stage-2 page-tables" This reverts commit `e9abe311f3`. syzkaller has managed to tease out multiple bugs in this change and fixing-forward didn't remedy the situation. Considering newly-introduced memory safety issues the potential for scheduler stalls don't seem that bad in comparison Link: https://lore.kernel.org/kvmarm/68c09802.050a0220.3c6139.000d.GAE@google.com/ Message-ID: <20250910180930.3679473-2-oliver.upton@linux.dev> Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-09-10 11:11:21 -07:00
Alok Tiwari	c04f174129	KVM: arm64: vgic: fix incorrect spinlock API usage The function vgic_flush_lr_state() is calling _raw_spin_unlock() instead of the proper raw_spin_unlock(). _raw_spin_unlock() is an internal low-level API and should not be used directly; using raw_spin_unlock() ensures proper locking semantics in the vgic code. Fixes: `8fa3adb8c6` ("KVM: arm/arm64: vgic: Make vgic_irq->irq_lock a raw_spinlock") Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Acked-by: Marc Zyngier <maz@kernel.org> Message-ID: <20250908180413.3655546-1-alok.a.tiwari@oracle.com> Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-09-10 02:56:20 -07:00
Wei-Lin Chang	51d165e92a	KVM: arm64: Remove stage 2 read fault check In the non-NV case, read permission is always granted when mapping stage-2, so checking for it doesn't bring much. On the other hand, shadow stage-2 for NV guests could potentially have non-readable mappings when we align the permissions with those that L1 set for L2, we shouldn't be checking for read faults in this case either. So just remove this check. Suggested-by: Oliver Upton <oliver.upton@linux.dev> Suggested-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Wei-Lin Chang <r09922117@csie.ntu.edu.tw> Link: https://lore.kernel.org/r/20250908064806.4093081-1-r09922117@csie.ntu.edu.tw Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-09-10 02:56:20 -07:00
Fuad Tabba	2dc720e606	KVM: arm64: Fix parameter ordering for VBAR_EL1 assignment The __vcpu_assign_sys_reg() helper expects the register ID as the second argument and the value to be assigned as the third. However, the existing code was passing these parameters in the incorrect order. Fix the function call to properly read the live value of VBAR_EL1 from the guest and update the vCPU value immediately before pending the exception. This ensures the vCPU's value is the same as the guest's and that the exception will be handled at the correct address upon resuming the guest. Fixes: `798eb59787` ("KVM: arm64: Sync protected guest VBAR_EL1 on injecting an undef exception") Signed-off-by: Fuad Tabba <tabba@google.com> Link: https://lore.kernel.org/r/20250908163557.2419780-1-tabba@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-09-10 02:56:20 -07:00
Dongha Lee	ebb2d8fd81	KVM: arm64: nv: Fix incorrect VNCR invalidation range calculation The code for invalidating VNCR entries in both kvm_invalidate_vncr_ipa() and invalidate_vncr_va() incorrectly uses a bitwise AND with `(size - 1)` instead of `~(size - 1)` to align the start address. This results in masking the address bits instead of aligning them down to the start of the block. This bug may cause stale VNCR TLB entries to remain valid even after a TLBI or MMU notifier, leading to incorrect memory translation and unexpected guest behavior. Credit to Team 0xB6 in bob14: DongHa Lee, Gyujeong Jin, Daehyeon Ko, Geonha Lee, Hyungyu Oh, and Jaewon Yang. Reviewed-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Dongha Lee <p@sswd.pw> Link: https://lore.kernel.org/r/20250906040724.72960-1-p@sswd.pw Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-09-10 02:56:20 -07:00
Oliver Upton	13bba09beb	KVM: arm64: vgic-v3: Indicate vgic_put_irq() may take LPI xarray lock The release path on LPIs is quite rare, meaning it can be difficult to find lock ordering bugs on the LPI xarray's spinlock. Tell lockdep that vgic_put_irq() might acquire the xa_lock to make unsafe patterns more obvious. Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250905100531.282980-7-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-09-10 02:56:20 -07:00
Oliver Upton	982f31bbb5	KVM: arm64: vgic-v3: Don't require IRQs be disabled for LPI xarray lock Now that releases of LPIs have been unnested from the ap_list_lock there are no xarray writers that exist in a context where IRQs are already disabled. As such we can relax the locking to the non-IRQ disabling spinlock to guard the LPI xarray. Note that there are still readers of the LPI xarray where IRQs are disabled however the readers rely on RCU protection instead of the lock. Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250905100531.282980-6-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-09-10 02:56:20 -07:00
Oliver Upton	d54594accf	KVM: arm64: vgic-v3: Erase LPIs from xarray outside of raw spinlocks syzkaller has caught us red-handed once more, this time nesting regular spinlocks behind raw spinlocks: ============================= [ BUG: Invalid wait context ] 6.16.0-rc3-syzkaller-g7b8346bd9fce #0 Not tainted ----------------------------- syz.0.29/3743 is trying to lock: a3ff80008e2e9e18 (&xa->xa_lock#20){....}-{3:3}, at: vgic_put_irq+0xb4/0x190 arch/arm64/kvm/vgic/vgic.c:137 other info that might help us debug this: context-{5:5} 3 locks held by syz.0.29/3743: #0: a3ff80008e2e90a8 (&kvm->slots_lock){+.+.}-{4:4}, at: kvm_vgic_destroy+0x50/0x624 arch/arm64/kvm/vgic/vgic-init.c:499 #1: a3ff80008e2e9fa0 (&kvm->arch.config_lock){+.+.}-{4:4}, at: kvm_vgic_destroy+0x5c/0x624 arch/arm64/kvm/vgic/vgic-init.c:500 #2: 58f0000021be1428 (&vgic_cpu->ap_list_lock){....}-{2:2}, at: vgic_flush_pending_lpis+0x3c/0x31c arch/arm64/kvm/vgic/vgic.c:150 stack backtrace: CPU: 0 UID: 0 PID: 3743 Comm: syz.0.29 Not tainted 6.16.0-rc3-syzkaller-g7b8346bd9fce #0 PREEMPT Hardware name: linux,dummy-virt (DT) Call trace: show_stack+0x2c/0x3c arch/arm64/kernel/stacktrace.c:466 (C) __dump_stack+0x30/0x40 lib/dump_stack.c:94 dump_stack_lvl+0xd8/0x12c lib/dump_stack.c:120 dump_stack+0x1c/0x28 lib/dump_stack.c:129 print_lock_invalid_wait_context kernel/locking/lockdep.c:4833 [inline] check_wait_context kernel/locking/lockdep.c:4905 [inline] __lock_acquire+0x978/0x299c kernel/locking/lockdep.c:5190 lock_acquire+0x14c/0x2e0 kernel/locking/lockdep.c:5871 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0x5c/0x7c kernel/locking/spinlock.c:162 vgic_put_irq+0xb4/0x190 arch/arm64/kvm/vgic/vgic.c:137 vgic_flush_pending_lpis+0x24c/0x31c arch/arm64/kvm/vgic/vgic.c:158 __kvm_vgic_vcpu_destroy+0x44/0x500 arch/arm64/kvm/vgic/vgic-init.c:455 kvm_vgic_destroy+0x100/0x624 arch/arm64/kvm/vgic/vgic-init.c:505 kvm_arch_destroy_vm+0x80/0x138 arch/arm64/kvm/arm.c:244 kvm_destroy_vm virt/kvm/kvm_main.c:1308 [inline] kvm_put_kvm+0x800/0xff8 virt/kvm/kvm_main.c:1344 kvm_vm_release+0x58/0x78 virt/kvm/kvm_main.c:1367 __fput+0x4ac/0x980 fs/file_table.c:465 ____fput+0x20/0x58 fs/file_table.c:493 task_work_run+0x1bc/0x254 kernel/task_work.c:227 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] do_notify_resume+0x1b4/0x270 arch/arm64/kernel/entry-common.c:151 exit_to_user_mode_prepare arch/arm64/kernel/entry-common.c:169 [inline] exit_to_user_mode arch/arm64/kernel/entry-common.c:178 [inline] el0_svc+0xb4/0x160 arch/arm64/kernel/entry-common.c:768 el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786 el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600 This is of course no good, but is at odds with how LPI refcounts are managed. Solve the locking mess by deferring the release of unreferenced LPIs after the ap_list_lock is released. Mark these to-be-released LPIs specially to avoid racing with vgic_put_irq() and causing a double-free. Since references can only be taken on LPIs with a nonzero refcount, extending the lifetime of freed LPIs is still safe. Reviewed-by: Marc Zyngier <maz@kernel.org> Reported-by: syzbot+cef594105ac7e60c6d93@syzkaller.appspotmail.com Closes: https://lore.kernel.org/kvmarm/68acd0d9.a00a0220.33401d.048b.GAE@google.com/ Link: https://lore.kernel.org/r/20250905100531.282980-5-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-09-10 02:56:20 -07:00
Oliver Upton	0a4aedf2bd	KVM: arm64: Spin off release helper from vgic_put_irq() Spin off the release implementation from vgic_put_irq() to prepare for a more involved fix for lock ordering such that it may be unnested from raw spinlocks. This has the minor functional change of doing call_rcu() behind the xa_lock although it shouldn't be consequential. Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250905100531.282980-4-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-09-10 02:56:20 -07:00
Oliver Upton	3a08a6ca7c	KVM: arm64: vgic-v3: Use bare refcount for VGIC LPIs KVM's use of krefs to manage LPIs isn't adding much, especially considering vgic_irq_release() is a noop due to the lack of sufficient context. Switch to using a regular refcount in anticipation of adding a meaningful release concept for LPIs. Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250905100531.282980-3-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-09-10 02:56:19 -07:00
Alexandru Elisei	da2e743419	KVM: arm64: VHE: Save and restore host MDCR_EL2 value correctly Prior to commit `75a5fbaf66` ("KVM: arm64: Compute MDCR_EL2 at vcpu_load()"), host MDCR_EL2 was saved correctly: kvm_arch_vcpu_load() kvm_vcpu_load_debug() /* Doesn't touch hardware MDCR_EL2. / kvm_vcpu_load_vhe() __activate_traps_common() / Saves host MDCR_EL2. / host_data_ptr(host_debug_state.mdcr_el2) = read_sysreg(mdcr_el2) /* Writes VCPU MDCR_EL2. / write_sysreg(vcpu->arch.mdcr_el2, mdcr_el2) The MDCR_EL2 value saved previously was restored in kvm_arch_vcpu_put() -> kvm_vcpu_put_vhe(). After the aforementioned commit, host MDCR_EL2 is never saved: kvm_arch_vcpu_load() kvm_vcpu_load_debug() / Writes VCPU MDCR_EL2 / kvm_vcpu_load_vhe() __activate_traps_common() / Saves VCPU MDCR_EL2. / host_data_ptr(host_debug_state.mdcr_el2) = read_sysreg(mdcr_el2) /* Writes VCPU MDCR_EL2 a second time. */ write_sysreg(vcpu->arch.mdcr_el2, mdcr_el2) kvm_arch_vcpu_put() -> kvm_vcpu_put_vhe() then restores the VCPU MDCR_EL2 value. Also VCPU's MDCR_EL2 value gets written to hardware twice now. Fix this by saving the host MDCR_EL2 in kvm_arch_vcpu_load() before it gets overwritten by the VCPU's MDCR_EL2 value, and restore it on VCPU put. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250902130833.338216-3-alexandru.elisei@arm.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-09-10 02:56:19 -07:00
Alexandru Elisei	efad60e460	KVM: arm64: Initialize PMSCR_EL1 when in VHE According to the pseudocode for StatisticalProfilingEnabled() from Arm DDI0487L.b, PMSCR_EL1 controls profiling at EL1 and EL0: - PMSCR_EL1.E1SPE controls profiling at EL1. - PMSCR_EL1.E0SPE controls profiling at EL0 if HCR_EL2.TGE=0. These two fields reset to UNKNOWN values. When KVM runs in VHE mode and profiling is enabled in the host, before entering a guest, KVM does not touch any of the SPE registers, leaving the buffer enabled, and it clears HCR_EL2.TGE. As a result, depending on the reset value for the E1SPE and E0SPE fields, KVM might unintentionally profile a guest. Make the behaviour consistent and predictable by clearing PMSCR_EL1 when KVM initialises the host debug configuration. Note that this is not a problem for nVHE, because KVM clears PMSCR_EL1.{E1SPE,E0SPE} before entering the guest. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20250902130833.338216-2-alexandru.elisei@arm.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-09-10 02:56:19 -07:00
Geonha Lee	860b21c31d	KVM: arm64: nv: fix VNCR TLB ASID match logic for non-Global entries kvm_vncr_tlb_lookup() is supposed to return true when the cached VNCR TLB entry is valid for the current context. For non-Global entries, that means the entry’s ASID must match the current ASID. The current code returns true when the ASIDs do not match, which inverts the logic. This is a potential vulnerability: - Valid entries are ignored and we fall back to kvm_translate_vncr(), hurting performance. - Mismatched entries are treated as permission faults (-EPERM) instead of triggering a fresh translation. - This can also cause stale translations to be (wrongly) considered valid across address spaces. Flip the predicate so non-Global entries only hit when ASIDs match. Special credit to Team 0xB6 for reporting: DongHa Lee, Gyujeong Jin, Daehyeon Ko, Geonha Lee, Hyungyu Oh, and Jaewon Yang. Signed-off-by: Geonha Lee <w1nsom3gna@korea.ac.kr> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250903150421.90752-1-w1nsom3gna@korea.ac.kr Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-09-10 02:56:19 -07:00
Marc Zyngier	34b8f4aded	KVM: arm64: Mark freed S2 MMUs as invalid When freeing an S2 MMU, we free the associated pgd, but omit to mark the structure as invalid. Subsequently, a call to kvm_nested_s2_unmap() would pick these invalid S2 MMUs and pass them down the teardown path. This ends up with a nasty warning as we try to unmap an unallocated set of page tables. Fix this by making the S2 MMU invalid on freeing the pgd by calling kvm_init_nested_s2_mmu(). Fixes: `4f128f8e1a` ("KVM: arm64: nv: Support multiple nested Stage-2 mmu structures") Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250905072859.211369-1-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-09-05 00:43:14 -07:00
Paolo Bonzini	42a0305ab1	Merge tag 'kvmarm-fixes-6.17-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 changes for 6.17, take #2 - Correctly handle 'invariant' system registers for protected VMs - Improved handling of VNCR data aborts, including external aborts - Fixes for handling of FEAT_RAS for NV guests, providing a sane fault context during SEA injection and preventing the use of RASv1p1 fault injection hardware - Ensure that page table destruction when a VM is destroyed gives an opportunity to reschedule - Large fix to KVM's infrastructure for managing guest context loaded on the CPU, addressing issues where the output of AT emulation doesn't get reflected to the guest - Fix AT S12 emulation to actually perform stage-2 translation when necessary - Avoid attempting vLPI irqbypass when GICv4 has been explicitly disabled for a VM - Minor KVM + selftest fixes	2025-08-29 12:57:31 -04:00
Marc Zyngier	ee372e6451	KVM: arm64: nv: Fix ATS12 handling of single-stage translation Volodymyr reports that using a Xen DomU as a nested guest (where HCR_EL2.E2H == 0), ATS12 results in a translation that stops at the L2's S1, which isn't something you'd normally expects. Comparing the code against the spec proves to be illuminating, and suggests that the author of such code must have been tired, cross-eyed, drunk, or maybe all of the above. The gist of it is that, apart from HCR_EL2.VM or HCR_EL2.DC being 0, only the use of the EL2&0 translation regime limits the walk to S1 only, and that we must finish the S2 walk in any other case. Which solves the above issue, as E2H==0 indicates that ATS12 walks the EL1&0 translation regime. Explicitly checking for EL2&0 fixes this. Reported-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> Suggested-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Fixes: `be04cebf3e` ("KVM: arm64: nv: Add emulation of AT S12E{0,1}{R,W}") Link: https://lore.kernel.org/r/20250806141707.3479194-2-volodymyr_babchuk@epam.com Link: https://lore.kernel.org/r/20250809144811.2314038-2-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-08-28 12:44:42 -07:00
Marc Zyngier	3328d17e70	KVM: arm64: Remove __vcpu_{read,write}_sys_reg_{from,to}_cpu() There is no point having __vcpu_{read,write}_sys_reg_{from,to}_cpu() exposed to the rest of the kernel, as the only callers are in sys_regs.c. Move them where they below, which is another opportunity to simplify things a bit. Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250817121926.217900-5-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-08-28 11:39:48 -07:00
Marc Zyngier	ec0ab059d4	KVM: arm64: Fix vcpu_{read,write}_sys_reg() accessors Volodymyr reports (again!) that under some circumstances (E2H==0, walking S1 PTs), PAR_EL1 doesn't report the value of the latest walk in the CPU register, but that instead the value is written to the backing store. Further investigation indicates that the root cause of this is that a group of registers (PAR_EL1, TPIDR_EL{0,1}, the 32_EL2 dregs) should always be considered as "on CPU", as they are not remapped between EL1 and EL2. We fail to treat them accordingly, and end-up considering that the register (PAR_EL1 in this example) should be written to memory instead of in the register. While it would be possible to quickly work around it, it is obvious that the way we track these things at the moment is pretty horrible, and could do with some improvement. Revamp the whole thing by: - defining a location for a register (memory, cpu), potentially depending on the state of the vcpu - define a transformation for this register (mapped register, potential translation, special register needing some particular attention) - convey this information in a structure that can be easily passed around As a result, the accessors themselves become much simpler, as the state is explicit instead of being driven by hard-to-understand conventions. We get rid of the "pure EL2 register" notion, which wasn't very useful, and add sanitisation of the values by applying the RESx masks as required, something that was missing so far. And of course, we add the missing registers to the list, with the indication that they are always loaded. Reported-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Fixes: `fedc612314` ("KVM: arm64: nv: Handle virtual EL2 registers in vcpu_read/write_sys_reg()") Link: https://lore.kernel.org/r/20250806141707.3479194-3-volodymyr_babchuk@epam.com Link: https://lore.kernel.org/r/20250817121926.217900-4-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-08-28 11:39:48 -07:00
Marc Zyngier	e3f6836a63	KVM: arm64: Simplify sysreg access on exception delivery Distinguishing between NV and VHE is slightly pointless, and only serves as an extra complication, or a way to introduce bugs, such as the way SPSR_EL1 gets written without checking for the state being resident. Get rid if this silly distinction, and fix the bug in one go. Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250817121926.217900-3-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-08-28 11:39:48 -07:00
Marc Zyngier	b720269334	KVM: arm64: Check for SYSREGS_ON_CPU before accessing the 32bit state Just like `c6e35dff58` ("KVM: arm64: Check for SYSREGS_ON_CPU before accessing the CPU state") fixed the 64bit state access, add a check for the 32bit state actually being on the CPU before writing it. Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250817121926.217900-2-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-08-28 11:39:48 -07:00
Marc Zyngier	0843e0ced3	KVM: arm64: Get rid of ARM64_FEATURE_MASK() The ARM64_FEATURE_MASK() macro was a hack introduce whilst the automatic generation of sysreg encoding was introduced, and was too unreliable to be entirely trusted. We are in a better place now, and we could really do without this macro. Get rid of it altogether. Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250817202158.395078-7-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-08-21 16:31:56 -07:00
Marc Zyngier	7a765aa88e	KVM: arm64: Make ID_AA64PFR1_EL1.RAS_frac writable Allow userspace to write to RAS_frac, under the condition that the host supports RASv1p1 with RAS_frac==1. Other configurations will result in RAS_frac being exposed as 0, and therefore implicitly not writable. To avoid the clutter, the ID_AA64PFR1_EL1 sanitisation is moved to its own function. Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Link: https://lore.kernel.org/r/20250817202158.395078-6-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-08-21 16:31:56 -07:00
Marc Zyngier	1fab657cb2	KVM: arm64: Make ID_AA64PFR0_EL1.RAS writable Make ID_AA64PFR0_EL1.RAS writable so that we can restore a VM from a system without RAS to a RAS-equipped machine (or disable RAS in the guest). Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Link: https://lore.kernel.org/r/20250817202158.395078-5-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-08-21 16:28:46 -07:00
Marc Zyngier	9049fb1227	KVM: arm64: Ignore HCR_EL2.FIEN set by L1 guest's EL2 An EL2 guest can set HCR_EL2.FIEN, which gives access to the RASv1p1 fault injection mechanism. This would allow an EL1 guest to inject error records into the system, which does sound like a terrible idea. Prevent this situation by added FIEN to the list of bits we silently exclude from being inserted into the host configuration. Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Joey Gouly <joey.gouly@arm.com> Link: https://lore.kernel.org/r/20250817202158.395078-4-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-08-21 16:28:46 -07:00
Marc Zyngier	d7b3e23f94	KVM: arm64: Handle RASv1p1 registers FEAT_RASv1p1 system registeres are not handled at all so far. KVM will give an embarassed warning on the console and inject an UNDEF, despite RASv1p1 being exposed to the guest on suitable HW. Handle these registers similarly to FEAT_RAS, with the added fun that there are two way to indicate the presence of FEAT_RASv1p1. Reviewed-by: Joey Gouly <joey.gouly@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Link: https://lore.kernel.org/r/20250817202158.395078-3-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-08-21 16:28:46 -07:00
Raghavendra Rao Ananta	e9abe311f3	KVM: arm64: Reschedule as needed when destroying the stage-2 page-tables When a large VM, specifically one that holds a significant number of PTEs, gets abruptly destroyed, the following warning is seen during the page-table walk: sched: CPU 0 need_resched set for > 100018840 ns (100 ticks) without schedule CPU: 0 UID: 0 PID: 9617 Comm: kvm_page_table_ Tainted: G O 6.16.0-smp-DEV #3 NONE Tainted: [O]=OOT_MODULE Call trace: show_stack+0x20/0x38 (C) dump_stack_lvl+0x3c/0xb8 dump_stack+0x18/0x30 resched_latency_warn+0x7c/0x88 sched_tick+0x1c4/0x268 update_process_times+0xa8/0xd8 tick_nohz_handler+0xc8/0x168 __hrtimer_run_queues+0x11c/0x338 hrtimer_interrupt+0x104/0x308 arch_timer_handler_phys+0x40/0x58 handle_percpu_devid_irq+0x8c/0x1b0 generic_handle_domain_irq+0x48/0x78 gic_handle_irq+0x1b8/0x408 call_on_irq_stack+0x24/0x30 do_interrupt_handler+0x54/0x78 el1_interrupt+0x44/0x88 el1h_64_irq_handler+0x18/0x28 el1h_64_irq+0x84/0x88 stage2_free_walker+0x30/0xa0 (P) __kvm_pgtable_walk+0x11c/0x258 __kvm_pgtable_walk+0x180/0x258 __kvm_pgtable_walk+0x180/0x258 __kvm_pgtable_walk+0x180/0x258 kvm_pgtable_walk+0xc4/0x140 kvm_pgtable_stage2_destroy+0x5c/0xf0 kvm_free_stage2_pgd+0x6c/0xe8 kvm_uninit_stage2_mmu+0x24/0x48 kvm_arch_flush_shadow_all+0x80/0xa0 kvm_mmu_notifier_release+0x38/0x78 __mmu_notifier_release+0x15c/0x250 exit_mmap+0x68/0x400 __mmput+0x38/0x1c8 mmput+0x30/0x68 exit_mm+0xd4/0x198 do_exit+0x1a4/0xb00 do_group_exit+0x8c/0x120 get_signal+0x6d4/0x778 do_signal+0x90/0x718 do_notify_resume+0x70/0x170 el0_svc+0x74/0xd8 el0t_64_sync_handler+0x60/0xc8 el0t_64_sync+0x1b0/0x1b8 The warning is seen majorly on the host kernels that are configured not to force-preempt, such as CONFIG_PREEMPT_NONE=y. To avoid this, instead of walking the entire page-table in one go, split it into smaller ranges, by checking for cond_resched() between each range. Since the path is executed during VM destruction, after the page-table structure is unlinked from the KVM MMU, relying on cond_resched_rwlock_write() isn't necessary. Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Suggested-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250820162242.2624752-3-rananta@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-08-21 16:27:03 -07:00
Raghavendra Rao Ananta	0e89ca13ee	KVM: arm64: Split kvm_pgtable_stage2_destroy() Split kvm_pgtable_stage2_destroy() into two: - kvm_pgtable_stage2_destroy_range(), that performs the page-table walk and free the entries over a range of addresses. - kvm_pgtable_stage2_destroy_pgd(), that frees the PGD. This refactoring enables subsequent patches to free large page-tables in chunks, calling cond_resched() between each chunk, to yield the CPU as necessary. Existing callers of kvm_pgtable_stage2_destroy(), that probably cannot take advantage of this (such as nVMHE), will continue to function as is. Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Suggested-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250820162242.2624752-2-rananta@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-08-21 16:27:03 -07:00
Marc Zyngier	d19c541d26	KVM: arm64: Correctly populate FAR_EL2 on nested SEA injection vcpu_write_sys_reg()'s signature is not totally obvious, and it is rather easy to write something that looks correct, except that... Oh wait... Swap addr and FAR_EL2 to restore some sanity in the nested SEA department. Fixes: `9aba641b9e` ("KVM: arm64: nv: Respect exception routing rules for SEAs") Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250813163747.2591317-1-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-08-15 11:51:18 -07:00
Fuad Tabba	f1edb15920	arm64: vgic-v2: Fix guest endianness check in hVHE mode In hVHE when running at the hypervisor, SCTLR_EL1 refers to the hypervisor's System Control Register rather than the guest's. Make sure to access the guest's register to determine its endianness. Reported-by: Will Deacon <will@kernel.org> Signed-off-by: Fuad Tabba <tabba@google.com> Link: https://lore.kernel.org/r/20250807120133.871892-4-tabba@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-08-08 01:29:32 -07:00
Fuad Tabba	798eb59787	KVM: arm64: Sync protected guest VBAR_EL1 on injecting an undef exception In pKVM, a race condition can occur if a guest updates its VBAR_EL1 register and, before a vCPU exit synchronizes this change, the hypervisor needs to inject an undefined exception into a protected guest. In this scenario, the vCPU still holds the stale VBAR_EL1 value from before the guest's update. When pKVM injects the exception, it ends up using the stale value. Explicitly read the live value of VBAR_EL1 from the guest and update the vCPU value immediately before pending the exception. This ensures the vCPU's value is the same as the guest's and that the exception will be handled at the correct address upon resuming the guest. Reported-by: Keir Fraser <keirf@google.com> Signed-off-by: Fuad Tabba <tabba@google.com> Link: https://lore.kernel.org/r/20250807120133.871892-3-tabba@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-08-08 01:29:32 -07:00
Fuad Tabba	eaa43934b4	KVM: arm64: Handle AIDR_EL1 and REVIDR_EL1 in host for protected VMs Since commit `17efc1acee` ("arm64: Expose AIDR_EL1 via sysfs"), AIDR_EL1 is read early during boot. Therefore, a guest running as a protected VM will fail to boot because when it attempts to access AIDR_EL1, access to that register is restricted in pKVM for protected guests. Similar to how MIDR_EL1 is handled by the host for protected VMs, let the host handle accesses to AIDR_EL1 as well as REVIDR_EL1. However note that, unlike MIDR_EL1, AIDR_EL1 and REVIDR_EL1 are trapped by HCR_EL2.TID1. Therefore, explicitly mark them as handled by the host for protected VMs. TID1 is always set in pKVM, because it needs to restrict access to SMIDR_EL1, which is also trapped by that bit. Reported-by: Will Deacon <will@kernel.org> Signed-off-by: Fuad Tabba <tabba@google.com> Link: https://lore.kernel.org/r/20250807120133.871892-2-tabba@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-08-08 01:29:31 -07:00
Arnd Bergmann	700d6868fe	kvm: arm64: use BUG() instead of BUG_ON(1) The BUG_ON() macro adds a little bit of complexity over BUG(), and in some cases this ends up confusing the compiler's control flow analysis in a way that results in a warning. This one now shows up with clang-21: arch/arm64/kvm/vgic/vgic-mmio.c:1094:3: error: variable 'len' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized] 1094 \| BUG_ON(1); Change both instances of BUG_ON(1) to a plain BUG() in the arm64 kvm code, to avoid the false-positive warning. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/r/20250807072132.4170088-1-arnd@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-08-08 01:28:57 -07:00
Oliver Upton	69f8fe955d	KVM: arm64: nv: Handle SEAs due to VNCR redirection System register accesses redirected to the VNCR page can also generate external aborts just like any other form of memory access. Route to kvm_handle_guest_sea() for potential APEI handling, falling back to a vSError if the kernel didn't handle the abort. Take the opportunity to throw out the useless kvm_ras.h which provided a helper with a single callsite... Cc: Jiaqi Yan <jiaqiyan@google.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250729182342.3281742-1-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-08-04 22:16:10 -07:00
Marc Zyngier	07f557f60a	KVM: arm64: nv: Properly check ESR_EL2.VNCR on taking a VNCR_EL2 related fault Instead of checking for the ESR_EL2.VNCR bit being set (the only case we should be here), we are actually testing random bits in ESR_EL2.DFSC. 13 obviously being a lucky number, it matches both permission and translation fault status codes, which explains why we never saw it failing. This was found by inspection, while reviewing a vaguely related patch. Whilst we're at it, turn the BUG_ON() into a WARN_ON_ONCE(), as exploding here is just silly. Fixes: `069a05e535` ("KVM: arm64: nv: Handle VNCR_EL2-triggered faults") Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Joey Gouly <joey.gouly@arm.com> Link: https://lore.kernel.org/r/20250730101828.1168707-1-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-08-04 22:15:29 -07:00
Linus Torvalds	63eb28bb14	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Paolo Bonzini: "ARM: - Host driver for GICv5, the next generation interrupt controller for arm64, including support for interrupt routing, MSIs, interrupt translation and wired interrupts - Use FEAT_GCIE_LEGACY on GICv5 systems to virtualize GICv3 VMs on GICv5 hardware, leveraging the legacy VGIC interface - Userspace control of the 'nASSGIcap' GICv3 feature, allowing userspace to disable support for SGIs w/o an active state on hardware that previously advertised it unconditionally - Map supporting endpoints with cacheable memory attributes on systems with FEAT_S2FWB and DIC where KVM no longer needs to perform cache maintenance on the address range - Nested support for FEAT_RAS and FEAT_DoubleFault2, allowing the guest hypervisor to inject external aborts into an L2 VM and take traps of masked external aborts to the hypervisor - Convert more system register sanitization to the config-driven implementation - Fixes to the visibility of EL2 registers, namely making VGICv3 system registers accessible through the VGIC device instead of the ONE_REG vCPU ioctls - Various cleanups and minor fixes LoongArch: - Add stat information for in-kernel irqchip - Add tracepoints for CPUCFG and CSR emulation exits - Enhance in-kernel irqchip emulation - Various cleanups RISC-V: - Enable ring-based dirty memory tracking - Improve perf kvm stat to report interrupt events - Delegate illegal instruction trap to VS-mode - MMU improvements related to upcoming nested virtualization s390x - Fixes x86: - Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O APIC, PIC, and PIT emulation at compile time - Share device posted IRQ code between SVM and VMX and harden it against bugs and runtime errors - Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups O(1) instead of O(n) - For MMIO stale data mitigation, track whether or not a vCPU has access to (host) MMIO based on whether the page tables have MMIO pfns mapped; using VFIO is prone to false negatives - Rework the MSR interception code so that the SVM and VMX APIs are more or less identical - Recalculate all MSR intercepts from scratch on MSR filter changes, instead of maintaining shadow bitmaps - Advertise support for LKGS (Load Kernel GS base), a new instruction that's loosely related to FRED, but is supported and enumerated independently - Fix a user-triggerable WARN that syzkaller found by setting the vCPU in INIT_RECEIVED state (aka wait-for-SIPI), and then putting the vCPU into VMX Root Mode (post-VMXON). Trying to detect every possible path leading to architecturally forbidden states is hard and even risks breaking userspace (if it goes from valid to valid state but passes through invalid states), so just wait until KVM_RUN to detect that the vCPU state isn't allowed - Add KVM_X86_DISABLE_EXITS_APERFMPERF to allow disabling interception of APERF/MPERF reads, so that a "properly" configured VM can access APERF/MPERF. This has many caveats (APERF/MPERF cannot be zeroed on vCPU creation or saved/restored on suspend and resume, or preserved over thread migration let alone VM migration) but can be useful whenever you're interested in letting Linux guests see the effective physical CPU frequency in /proc/cpuinfo - Reject KVM_SET_TSC_KHZ for vm file descriptors if vCPUs have been created, as there's no known use case for changing the default frequency for other VM types and it goes counter to the very reason why the ioctl was added to the vm file descriptor. And also, there would be no way to make it work for confidential VMs with a "secure" TSC, so kill two birds with one stone - Dynamically allocation the shadow MMU's hashed page list, and defer allocating the hashed list until it's actually needed (the TDP MMU doesn't use the list) - Extract many of KVM's helpers for accessing architectural local APIC state to common x86 so that they can be shared by guest-side code for Secure AVIC - Various cleanups and fixes x86 (Intel): - Preserve the host's DEBUGCTL.FREEZE_IN_SMM when running the guest. Failure to honor FREEZE_IN_SMM can leak host state into guests - Explicitly check vmcs12.GUEST_DEBUGCTL on nested VM-Enter to prevent L1 from running L2 with features that KVM doesn't support, e.g. BTF x86 (AMD): - WARN and reject loading kvm-amd.ko instead of panicking the kernel if the nested SVM MSRPM offsets tracker can't handle an MSR (which is pretty much a static condition and therefore should never happen, but still) - Fix a variety of flaws and bugs in the AVIC device posted IRQ code - Inhibit AVIC if a vCPU's ID is too big (relative to what hardware supports) instead of rejecting vCPU creation - Extend enable_ipiv module param support to SVM, by simply leaving IsRunning clear in the vCPU's physical ID table entry - Disable IPI virtualization, via enable_ipiv, if the CPU is affected by erratum #1235, to allow (safely) enabling AVIC on such CPUs - Request GA Log interrupts if and only if the target vCPU is blocking, i.e. only if KVM needs a notification in order to wake the vCPU - Intercept SPEC_CTRL on AMD if the MSR shouldn't exist according to the vCPU's CPUID model - Accept any SNP policy that is accepted by the firmware with respect to SMT and single-socket restrictions. An incompatible policy doesn't put the kernel at risk in any way, so there's no reason for KVM to care - Drop a superfluous WBINVD (on all CPUs!) when destroying a VM and use WBNOINVD instead of WBINVD when possible for SEV cache maintenance - When reclaiming memory from an SEV guest, only do cache flushes on CPUs that have ever run a vCPU for the guest, i.e. don't flush the caches for CPUs that can't possibly have cache lines with dirty, encrypted data Generic: - Rework irqbypass to track/match producers and consumers via an xarray instead of a linked list. Using a linked list leads to O(n^2) insertion times, which is hugely problematic for use cases that create large numbers of VMs. Such use cases typically don't actually use irqbypass, but eliminating the pointless registration is a future problem to solve as it likely requires new uAPI - Track irqbypass's "token" as "struct eventfd_ctx " instead of a "void ", to avoid making a simple concept unnecessarily difficult to understand - Decouple device posted IRQs from VFIO device assignment, as binding a VM to a VFIO group is not a requirement for enabling device posted IRQs - Clean up and document/comment the irqfd assignment code - Disallow binding multiple irqfds to an eventfd with a priority waiter, i.e. ensure an eventfd is bound to at most one irqfd through the entire host, and add a selftest to verify eventfd:irqfd bindings are globally unique - Add a tracepoint for KVM_SET_MEMORY_ATTRIBUTES to help debug issues related to private <=> shared memory conversions - Drop guest_memfd's .getattr() implementation as the VFS layer will call generic_fillattr() if inode_operations.getattr is NULL - Fix issues with dirty ring harvesting where KVM doesn't bound the processing of entries in any way, which allows userspace to keep KVM in a tight loop indefinitely - Kill off kvm_arch_{start,end}_assignment() and x86's associated tracking, now that KVM no longer uses assigned_device_count as a heuristic for either irqbypass usage or MDS mitigation Selftests: - Fix a comment typo - Verify KVM is loaded when getting any KVM module param so that attempting to run a selftest without kvm.ko loaded results in a SKIP message about KVM not being loaded/enabled (versus some random parameter not existing) - Skip tests that hit EACCES when attempting to access a file, and print a "Root required?" help message. In most cases, the test just needs to be run with elevated permissions" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (340 commits) Documentation: KVM: Use unordered list for pre-init VGIC registers RISC-V: KVM: Avoid re-acquiring memslot in kvm_riscv_gstage_map() RISC-V: KVM: Use find_vma_intersection() to search for intersecting VMAs RISC-V: perf/kvm: Add reporting of interrupt events RISC-V: KVM: Enable ring-based dirty memory tracking RISC-V: KVM: Fix inclusion of Smnpm in the guest ISA bitmap RISC-V: KVM: Delegate illegal instruction fault to VS mode RISC-V: KVM: Pass VMID as parameter to kvm_riscv_hfence_xyz() APIs RISC-V: KVM: Factor-out g-stage page table management RISC-V: KVM: Add vmid field to struct kvm_riscv_hfence RISC-V: KVM: Introduce struct kvm_gstage_mapping RISC-V: KVM: Factor-out MMU related declarations into separate headers RISC-V: KVM: Use ncsr_xyz() in kvm_riscv_vcpu_trap_redirect() RISC-V: KVM: Implement kvm_arch_flush_remote_tlbs_range() RISC-V: KVM: Don't flush TLB when PTE is unchanged RISC-V: KVM: Replace KVM_REQ_HFENCE_GVMA_VMID_ALL with KVM_REQ_TLB_FLUSH RISC-V: KVM: Rename and move kvm_riscv_local_tlb_sanitize() RISC-V: KVM: Drop the return value of kvm_riscv_vcpu_aia_init() RISC-V: KVM: Check kvm_riscv_vcpu_alloc_vector_context() return value KVM: arm64: selftests: Add FEAT_RAS EL2 registers to get-reg-list ...	2025-07-30 17:14:01 -07:00
Linus Torvalds	6fb44438a5	Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Catalin Marinas: "A quick summary: perf support for Branch Record Buffer Extensions (BRBE), typical PMU hardware updates, small additions to MTE for store-only tag checking and exposing non-address bits to signal handlers, HAVE_LIVEPATCH enabled on arm64, VMAP_STACK forced on. There is also a TLBI optimisation on hardware that does not require break-before-make when changing the user PTEs between contiguous and non-contiguous. More details: Perf and PMU updates: - Add support for new (v3) Hisilicon SLLC and DDRC PMUs - Add support for Arm-NI PMU integrations that share interrupts between clock domains within a given instance - Allow SPE to be configured with a lower sample period than the minimum recommendation advertised by PMSIDR_EL1.Interval - Add suppport for Arm's "Branch Record Buffer Extension" (BRBE) - Adjust the perf watchdog period according to cpu frequency changes - Minor driver fixes and cleanups Hardware features: - Support for MTE store-only checking (FEAT_MTE_STORE_ONLY) - Support for reporting the non-address bits during a synchronous MTE tag check fault (FEAT_MTE_TAGGED_FAR) - Optimise the TLBI when folding/unfolding contiguous PTEs on hardware with FEAT_BBM (break-before-make) level 2 and no TLB conflict aborts Software features: - Enable HAVE_LIVEPATCH after implementing arch_stack_walk_reliable() and using the text-poke API for late module relocations - Force VMAP_STACK always on and change arm64_efi_rt_init() to use arch_alloc_vmap_stack() in order to avoid KASAN false positives ACPI: - Improve SPCR handling and messaging on systems lacking an SPCR table Debug: - Simplify the debug exception entry path - Drop redundant DBG_MDSCR_* macros Kselftests: - Cleanups and improvements for SME, SVE and FPSIMD tests Miscellaneous: - Optimise loop to reduce redundant operations in contpte_ptep_get() - Remove ISB when resetting POR_EL0 during signal handling - Mark the kernel as tainted on SEA and SError panic - Remove redundant gcs_free() call" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (93 commits) arm64/gcs: task_gcs_el0_enable() should use passed task arm64: Kconfig: Keep selects somewhat alphabetically ordered arm64: signal: Remove ISB when resetting POR_EL0 kselftest/arm64: Handle attempts to disable SM on SME only systems kselftest/arm64: Fix SVE write data generation for SME only systems kselftest/arm64: Test SME on SME only systems in fp-ptrace kselftest/arm64: Test FPSIMD format data writes via NT_ARM_SVE in fp-ptrace kselftest/arm64: Allow sve-ptrace to run on SME only systems arm64/mm: Drop redundant addr increment in set_huge_pte_at() kselftest/arm4: Provide local defines for AT_HWCAP3 arm64: Mark kernel as tainted on SAE and SError panic arm64/gcs: Don't call gcs_free() when releasing task_struct drivers/perf: hisi: Support PMUs with no interrupt drivers/perf: hisi: Relax the event number check of v2 PMUs drivers/perf: hisi: Add support for HiSilicon SLLC v3 PMU driver drivers/perf: hisi: Use ACPI driver_data to retrieve SLLC PMU information drivers/perf: hisi: Add support for HiSilicon DDRC v3 PMU driver drivers/perf: hisi: Simplify the probe process for each DDRC version perf/arm-ni: Support sharing IRQs within an NI instance perf/arm-ni: Consolidate CPU affinity handling ...	2025-07-29 20:21:54 -07:00
Raghavendra Rao Ananta	7b8346bd9f	KVM: arm64: Don't attempt vLPI mappings when vPE allocation is disabled commit `c652887a92` ("KVM: arm64: vgic-v3: Allow userspace to write GICD_TYPER2.nASSGIcap") makes the allocation of vPEs depend on nASSGIcap for GICv4.1 hosts. While the vGIC v4 initialization and teardown is handled correctly, it erroneously attempts to establish a vLPI mapping to a VM that has no vPEs allocated: Unable to handle kernel NULL pointer dereference at virtual address 00000000000000a8 Mem abort info: ESR = 0x0000000096000044 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x04: level 0 translation fault Data abort info: ISV = 0, ISS = 0x00000044, ISS2 = 0x00000000 CM = 0, WnR = 1, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 user pgtable: 4k pages, 48-bit VAs, pgdp=00000073a453b000 [00000000000000a8] pgd=0000000000000000, p4d=0000000000000000 Internal error: Oops: 0000000096000044 [#1] SMP pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) pc : its_irq_set_vcpu_affinity+0x58c/0x95c lr : its_irq_set_vcpu_affinity+0x1e0/0x95c sp : ffff8001029bb9e0 pmr_save: 00000060 x29: ffff8001029bba20 x28: ffff0001ca5e28c0 x27: 0000000000000000 x26: 0000000000000000 x25: ffff00019eee9f80 x24: ffff0001992b3f00 x23: ffff8001029bbab8 x22: ffff00001159fb80 x21: 00000000000024a7 x20: 00000000000024a7 x19: ffff00019eee9fb4 x18: 0000000000000494 x17: 000000000000000e x16: 0000000000000494 x15: 0000000000000002 x14: ffff0001a7f34600 x13: ffffccaad1203000 x12: 0000000000000018 x11: ffff000011991000 x10: 0000000000000000 x9 : 00000000000000a2 x8 : 00000000000020a8 x7 : 0000000000000000 x6 : 000000000000003f x5 : 0000000000000040 x4 : 0000000000000000 x3 : 0000000000000004 x2 : 0000000000000000 x1 : ffff8001029bbab8 x0 : 00000000000000a8 Call trace: its_irq_set_vcpu_affinity+0x58c/0x95c irq_set_vcpu_affinity+0x74/0xc8 its_map_vlpi+0x4c/0x94 kvm_vgic_v4_set_forwarding+0x134/0x298 kvm_arch_irq_bypass_add_producer+0x28/0x34 irq_bypass_register_producer+0xf8/0x1d8 vfio_msi_set_vector_signal+0x2c8/0x308 vfio_pci_set_msi_trigger+0x198/0x2d4 vfio_pci_set_irqs_ioctl+0xf0/0x104 vfio_pci_core_ioctl+0x6ac/0xc5c vfio_device_fops_unl_ioctl+0x128/0x370 __arm64_sys_ioctl+0x98/0xd0 el0_svc_common+0xd8/0x1d8 do_el0_svc+0x28/0x34 el0_svc+0x40/0xb8 el0t_64_sync_handler+0x70/0xbc el0t_64_sync+0x1a8/0x1ac Code: 321f0129 f940094a 8b080148 d1400900 (39000009) ---[ end trace 0000000000000000 ]--- Fix it by moving the GICv4.1 special-casing to vgic_supports_direct_msis(), returning false if the user explicitly disabled nASSGIcap for the VM. Fixes: `c652887a92` ("KVM: arm64: vgic-v3: Allow userspace to write GICD_TYPER2.nASSGIcap") Suggested-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Link: https://lore.kernel.org/r/20250729210644.830364-1-rananta@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-07-29 14:17:31 -07:00
Paolo Bonzini	314b40b3b6	Merge tag 'kvmarm-6.17' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 changes for 6.17, round #1 - Host driver for GICv5, the next generation interrupt controller for arm64, including support for interrupt routing, MSIs, interrupt translation and wired interrupts. - Use FEAT_GCIE_LEGACY on GICv5 systems to virtualize GICv3 VMs on GICv5 hardware, leveraging the legacy VGIC interface. - Userspace control of the 'nASSGIcap' GICv3 feature, allowing userspace to disable support for SGIs w/o an active state on hardware that previously advertised it unconditionally. - Map supporting endpoints with cacheable memory attributes on systems with FEAT_S2FWB and DIC where KVM no longer needs to perform cache maintenance on the address range. - Nested support for FEAT_RAS and FEAT_DoubleFault2, allowing the guest hypervisor to inject external aborts into an L2 VM and take traps of masked external aborts to the hypervisor. - Convert more system register sanitization to the config-driven implementation. - Fixes to the visibility of EL2 registers, namely making VGICv3 system registers accessible through the VGIC device instead of the ONE_REG vCPU ioctls. - Various cleanups and minor fixes.	2025-07-29 12:27:40 -04:00
Paolo Bonzini	f02b1bcc73	Merge tag 'kvm-x86-irqs-6.17' of https://github.com/kvm-x86/linux into HEAD KVM IRQ changes for 6.17 - Rework irqbypass to track/match producers and consumers via an xarray instead of a linked list. Using a linked list leads to O(n^2) insertion times, which is hugely problematic for use cases that create large numbers of VMs. Such use cases typically don't actually use irqbypass, but eliminating the pointless registration is a future problem to solve as it likely requires new uAPI. - Track irqbypass's "token" as "struct eventfd_ctx " instead of a "void ", to avoid making a simple concept unnecessarily difficult to understand. - Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O APIC, PIC, and PIT emulation at compile time. - Drop x86's irq_comm.c, and move a pile of IRQ related code into irq.c. - Fix a variety of flaws and bugs in the AVIC device posted IRQ code. - Inhibited AVIC if a vCPU's ID is too big (relative to what hardware supports) instead of rejecting vCPU creation. - Extend enable_ipiv module param support to SVM, by simply leaving IsRunning clear in the vCPU's physical ID table entry. - Disable IPI virtualization, via enable_ipiv, if the CPU is affected by erratum #1235, to allow (safely) enabling AVIC on such CPUs. - Dedup x86's device posted IRQ code, as the vast majority of functionality can be shared verbatime between SVM and VMX. - Harden the device posted IRQ code against bugs and runtime errors. - Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups O(1) instead of O(n). - Generate GA Log interrupts if and only if the target vCPU is blocking, i.e. only if KVM needs a notification in order to wake the vCPU. - Decouple device posted IRQs from VFIO device assignment, as binding a VM to a VFIO group is not a requirement for enabling device posted IRQs. - Clean up and document/comment the irqfd assignment code. - Disallow binding multiple irqfds to an eventfd with a priority waiter, i.e. ensure an eventfd is bound to at most one irqfd through the entire host, and add a selftest to verify eventfd:irqfd bindings are globally unique.	2025-07-29 08:35:46 -04:00
Linus Torvalds	8e736a2eea	Merge tag 'hardening-v6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: - Introduce and start using TRAILING_OVERLAP() helper for fixing embedded flex array instances (Gustavo A. R. Silva) - mux: Convert mux_control_ops to a flex array member in mux_chip (Thorsten Blum) - string: Group str_has_prefix() and strstarts() (Andy Shevchenko) - Remove KCOV instrumentation from __init and __head (Ritesh Harjani, Kees Cook) - Refactor and rename stackleak feature to support Clang - Add KUnit test for seq_buf API - Fix KUnit fortify test under LTO * tag 'hardening-v6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (22 commits) sched/task_stack: Add missing const qualifier to end_of_stack() kstack_erase: Support Clang stack depth tracking kstack_erase: Add -mgeneral-regs-only to silence Clang warnings init.h: Disable sanitizer coverage for __init and __head kstack_erase: Disable kstack_erase for all of arm compressed boot code x86: Handle KCOV __init vs inline mismatches arm64: Handle KCOV __init vs inline mismatches s390: Handle KCOV __init vs inline mismatches arm: Handle KCOV __init vs inline mismatches mips: Handle KCOV __init vs inline mismatch powerpc/mm/book3s64: Move kfence and debug_pagealloc related calls to __init section configs/hardening: Enable CONFIG_INIT_ON_FREE_DEFAULT_ON configs/hardening: Enable CONFIG_KSTACK_ERASE stackleak: Split KSTACK_ERASE_CFLAGS from GCC_PLUGINS_CFLAGS stackleak: Rename stackleak_track_stack to __sanitizer_cov_stack_depth stackleak: Rename STACKLEAK to KSTACK_ERASE seq_buf: Introduce KUnit tests string: Group str_has_prefix() and strstarts() kunit/fortify: Add back "volatile" for sizeof() constants acpi: nfit: intel: avoid multiple -Wflex-array-member-not-at-end warnings ...	2025-07-28 17:16:12 -07:00
Oliver Upton	0d46e324c0	Merge branch 'kvm-arm64/vgic-v4-ctl' into kvmarm/next * kvm-arm64/vgic-v4-ctl: : Userspace control of nASSGIcap, courtesy of Raghavendra Rao Ananta : : Allow userspace to decide if support for SGIs without an active state is : advertised to the guest, allowing VMs from GICv3-only hardware to be : migrated to to GICv4.1 capable machines. Documentation: KVM: arm64: Describe VGICv3 registers writable pre-init KVM: arm64: selftests: Add test for nASSGIcap attribute KVM: arm64: vgic-v3: Allow userspace to write GICD_TYPER2.nASSGIcap KVM: arm64: vgic-v3: Allow access to GICD_IIDR prior to initialization KVM: arm64: vgic-v3: Consolidate MAINT_IRQ handling KVM: arm64: Disambiguate support for vSGIs v. vLPIs Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-07-28 08:11:38 -07:00
Oliver Upton	a7f49a9bf4	Merge branch 'kvm-arm64/el2-reg-visibility' into kvmarm/next * kvm-arm64/el2-reg-visibility: : Fixes to EL2 register visibility, courtesy of Marc Zyngier : : - Expose EL2 VGICv3 registers via the VGIC attributes accessor, not the : KVM_{GET,SET}_ONE_REG ioctls : : - Condition visibility of FGT registers on the presence of FEAT_FGT in : the VM KVM: arm64: selftest: vgic-v3: Add basic GICv3 sysreg userspace access test KVM: arm64: Enforce the sorting of the GICv3 system register table KVM: arm64: Clarify the check for reset callback in check_sysreg_table() KVM: arm64: vgic-v3: Fix ordering of ICH_HCR_EL2 KVM: arm64: Document registers exposed via KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS KVM: arm64: selftests: get-reg-list: Add base EL2 registers KVM: arm64: selftests: get-reg-list: Simplify feature dependency KVM: arm64: Advertise FGT2 registers to userspace KVM: arm64: Condition FGT registers on feature availability KVM: arm64: Expose GICv3 EL2 registers via KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS KVM: arm64: Let GICv3 save/restore honor visibility attribute KVM: arm64: Define helper for ICH_VTR_EL2 KVM: arm64: Define constant value for ICC_SRE_EL2 KVM: arm64: Don't advertise ICH_*_EL2 registers through GET_ONE_REG KVM: arm64: Make RVBAR_EL2 accesses UNDEF Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-07-28 08:06:38 -07:00
Oliver Upton	d9b9fa2c32	Merge branch 'kvm-arm64/config-masks' into kvmarm/next * kvm-arm64/config-masks: : More config-driven mask computation, courtesy of Marc Zyngier : : Converts more system registers to the config-driven computation of RESx : masks based on the advertised feature set KVM: arm64: Tighten the definition of FEAT_PMUv3p9 KVM: arm64: Convert MDCR_EL2 to config-driven sanitisation KVM: arm64: Convert SCTLR_EL1 to config-driven sanitisation KVM: arm64: Convert TCR2_EL2 to config-driven sanitisation arm64: sysreg: Add THE/ASID2 controls to TCR2_ELx Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-07-28 08:03:08 -07:00
Oliver Upton	0a2c9d808a	Merge branch 'kvm-arm64/misc' into kvmarm/next * kvm-arm64/misc: : Miscellaneous fixes/cleanups for KVM/arm64 : : - Fixes for computing POE output permissions : : - Return ENXIO for invalid VGIC device attribute : : - String helper conversions arm64: kvm: trace_handle_exit: use string choices helper arm64: kvm: sys_regs: use string choices helper KVM: arm64: Follow specification when implementing WXN KVM: arm64: Remove the wi->{e0,}poe vs wr->{p,u}ov confusion KVM: arm64: vgic-its: Return -ENXIO to invalid KVM_DEV_ARM_VGIC_GRP_CTRL attrs Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-07-26 08:54:47 -07:00
Oliver Upton	1f315e99bd	Merge branch 'kvm-arm64/gcie-legacy' into kvmarm/next * kvm-arm64/gcie-legacy: : Support for GICv3 emulation on GICv5, courtesy of Sascha Bischoff : : FEAT_GCIE_LEGACY adds the necessary hardware for GICv5 systems to : support the legacy GICv3 for VMs, including a backwards-compatible VGIC : implementation that we all know and love. : : As a starting point for GICv5 enablement in KVM, enable + use the : GICv3-compatible feature when running VMs on GICv5 hardware. KVM: arm64: gic-v5: Probe for GICv5 KVM: arm64: gic-v5: Support GICv3 compat arm64/sysreg: Add ICH_VCTLR_EL2 irqchip/gic-v5: Populate struct gic_kvm_info irqchip/gic-v5: Skip deactivate for forwarded PPI interrupts Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-07-26 08:50:06 -07:00
Oliver Upton	3318e42b81	Merge branch 'kvm-arm64/doublefault2' into kvmarm/next * kvm-arm64/doublefault2: (33 commits) : NV Support for FEAT_RAS + DoubleFault2 : : Delegate the vSError context to the guest hypervisor when in a nested : state, including registers related to ESR propagation. Additionally, : catch up KVM's external abort infrastructure to the architecture, : implementing the effects of FEAT_DoubleFault2. : : This has some impact on non-nested guests, as SErrors deemed unmasked at : the time they're made pending are now immediately injected with an : emulated exception entry rather than using the VSE bit. KVM: arm64: Make RAS registers UNDEF when RAS isn't advertised KVM: arm64: Filter out HCR_EL2 bits when running in hypervisor context KVM: arm64: Check for SYSREGS_ON_CPU before accessing the CPU state KVM: arm64: Commit exceptions from KVM_SET_VCPU_EVENTS immediately KVM: arm64: selftests: Test ESR propagation for vSError injection KVM: arm64: Populate ESR_ELx.EC for emulated SError injection KVM: arm64: selftests: Catch up set_id_regs with the kernel KVM: arm64: selftests: Add SCTLR2_EL1 to get-reg-list KVM: arm64: selftests: Test SEAs are taken to SError vector when EASE=1 KVM: arm64: selftests: Add basic SError injection test KVM: arm64: Don't retire MMIO instruction w/ pending (emulated) SError KVM: arm64: Advertise support for FEAT_DoubleFault2 KVM: arm64: Advertise support for FEAT_SCTLR2 KVM: arm64: nv: Enable vSErrors when HCRX_EL2.TMEA is set KVM: arm64: nv: Honor SError routing effects of SCTLR2_ELx.NMEA KVM: arm64: nv: Take "masked" aborts to EL2 when HCRX_EL2.TMEA is set KVM: arm64: Route SEAs to the SError vector when EASE is set KVM: arm64: nv: Ensure Address size faults affect correct ESR KVM: arm64: Factor out helper for selecting exception target EL KVM: arm64: Describe SCTLR2_ELx RESx masks ... Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-07-26 08:47:22 -07:00
Raghavendra Rao Ananta	c652887a92	KVM: arm64: vgic-v3: Allow userspace to write GICD_TYPER2.nASSGIcap KVM unconditionally advertises GICD_TYPER2.nASSGIcap (which internally implies vSGIs) on GICv4.1 systems. Allow userspace to change whether a VM supports the feature. Only allow changes prior to VGIC initialization as at that point vPEs need to be allocated for the VM. For convenience, bundle support for vLPIs and vSGIs behind this feature, allowing userspace to control vPE allocation for VMs in environments that may be constrained on vPE IDs. Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250724062805.2658919-5-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-07-26 08:45:52 -07:00
Oliver Upton	f26e6af757	KVM: arm64: vgic-v3: Allow access to GICD_IIDR prior to initialization KVM allows userspace to write GICD_IIDR for backwards-compatibility with older kernels, where new implementation revisions have new features. Unfortunately this is allowed to happen at runtime, and ripping features out from underneath a running guest is a terrible idea. While we can't do anything about the ABI, prepare for more ID-like registers by allowing access to GICD_IIDR prior to VGIC initialization. Hoist initializaiton of the default value to kvm_vgic_create() and discard the incorrect comment that assumed userspace could access the register before initialization (until now). Subsequent changes will allow the VMM to further provision the GIC feature set, e.g. the presence of nASSGIcap. Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250724062805.2658919-4-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-07-26 08:37:45 -07:00
Oliver Upton	ef364c5b43	KVM: arm64: vgic-v3: Consolidate MAINT_IRQ handling Consolidate the duplicated handling of the VGICv3 maintenance IRQ attribute as a regular GICv3 attribute, as it is neither a register nor a common attribute. As this is now handled separately from the VGIC registers, the locking is relaxed to only acquire the intended config_lock. Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250724062805.2658919-3-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2025-07-26 08:37:45 -07:00

1 2 3 4 5 ...

3180 Commits