Commit Graph

1170494 Commits

Author SHA1 Message Date
Paolo Bonzini
807b758496 Merge tag 'kvm-x86-mmu-6.4' of https://github.com/kvm-x86/linux into HEAD
KVM x86 MMU changes for 6.4:

 - Tweak FNAME(sync_spte) to avoid unnecessary writes+flushes when the
   guest is only adding new PTEs

 - Overhaul .sync_page() and .invlpg() to share the .sync_page()
   implementation, i.e. utilize .sync_page()'s optimizations when emulating
   invalidations

 - Clean up the range-based flushing APIs

 - Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single
   A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle
   changed SPTE" overhead associated with writing the entire entry

 - Track the number of "tail" entries in a pte_list_desc to avoid having
   to walk (potentially) all descriptors during insertion and deletion,
   which gets quite expensive if the guest is spamming fork()

 - Misc cleanups
2023-04-26 15:50:01 -04:00
Paolo Bonzini
a1c288f87d Merge tag 'kvm-x86-misc-6.4' of https://github.com/kvm-x86/linux into HEAD
KVM x86 changes for 6.4:

 - Optimize CR0.WP toggling by avoiding an MMU reload when TDP is enabled,
   and by giving the guest control of CR0.WP when EPT is enabled on VMX
   (VMX-only because SVM doesn't support per-bit controls)

 - Add CR0/CR4 helpers to query single bits, and clean up related code
   where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long" return
   as a bool

 - Move AMD_PSFD to cpufeatures.h and purge KVM's definition

 - Misc cleanups
2023-04-26 15:49:23 -04:00
Paolo Bonzini
e1a6d5cf10 Merge tag 'kvm-x86-generic-6.4' of https://github.com/kvm-x86/linux into HEAD
Common KVM changes for 6.4:

 - Drop unnecessary casts from "void *" throughout kvm_main.c

 - Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the struct
   size by 8 bytes on 64-bit kernels by utilizing a padding hole

 - Fix a documentation format goof that was introduced when the KVM docs
   were converted to ReST

 - Constify MIPS's internal callbacks (a leftover from the hardware enabling
   rework that landed in 6.3)
2023-04-26 15:48:44 -04:00
Paolo Bonzini
4f382a79a6 Merge tag 'kvmarm-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 6.4

- Numerous fixes for the pathological lock inversion issue that
  plagued KVM/arm64 since... forever.

- New framework allowing SMCCC-compliant hypercalls to be forwarded
  to userspace, hopefully paving the way for some more features
  being moved to VMMs rather than be implemented in the kernel.

- Large rework of the timer code to allow a VM-wide offset to be
  applied to both virtual and physical counters as well as a
  per-timer, per-vcpu offset that complements the global one.
  This last part allows the NV timer code to be implemented on
  top.

- A small set of fixes to make sure that we don't change anything
  affecting the EL1&0 translation regime just after having having
  taken an exception to EL2 until we have executed a DSB. This
  ensures that speculative walks started in EL1&0 have completed.

- The usual selftest fixes and improvements.
2023-04-26 15:46:52 -04:00
Paolo Bonzini
b3c129e33e Merge tag 'kvm-s390-next-6.4-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
Minor cleanup:
 - phys_to_virt conversion
 - Improvement of VSIE AP management
2023-04-26 15:43:15 -04:00
Marc Zyngier
36fe1b29b3 Merge branch kvm-arm64/spec-ptw into kvmarm-master/next
* kvm-arm64/spec-ptw:
  : .
  : On taking an exception from EL1&0 to EL2(&0), the page table walker is
  : allowed to carry on with speculative walks started from EL1&0 while
  : running at EL2 (see R_LFHQG). Given that the PTW may be actively using
  : the EL1&0 system registers, the only safe way to deal with it is to
  : issue a DSB before changing any of it.
  :
  : We already did the right thing for SPE and TRBE, but ignored the PTW
  : for unknown reasons (probably because the architecture wasn't crystal
  : clear at the time).
  :
  : This requires a bit of surgery in the nvhe code, though most of these
  : patches are comments so that my future self can understand the purpose
  : of these barriers. The VHE code is largely unaffected, thanks to the
  : DSB in the context switch.
  : .
  KVM: arm64: vhe: Drop extra isb() on guest exit
  KVM: arm64: vhe: Synchronise with page table walker on MMU update
  KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc()
  KVM: arm64: nvhe: Synchronise with page table walker on TLBI
  KVM: arm64: nvhe: Synchronise with page table walker on vcpu run

Signed-off-by: Marc Zyngier <maz@kernel.org>
2023-04-21 09:44:58 +01:00
Marc Zyngier
6dcf7316e0 Merge branch kvm-arm64/smccc-filtering into kvmarm-master/next
* kvm-arm64/smccc-filtering:
  : .
  : SMCCC call filtering and forwarding to userspace, courtesy of
  : Oliver Upton. From the cover letter:
  :
  : "The Arm SMCCC is rather prescriptive in regards to the allocation of
  : SMCCC function ID ranges. Many of the hypercall ranges have an
  : associated specification from Arm (FF-A, PSCI, SDEI, etc.) with some
  : room for vendor-specific implementations.
  :
  : The ever-expanding SMCCC surface leaves a lot of work within KVM for
  : providing new features. Furthermore, KVM implements its own
  : vendor-specific ABI, with little room for other implementations (like
  : Hyper-V, for example). Rather than cramming it all into the kernel we
  : should provide a way for userspace to handle hypercalls."
  : .
  KVM: selftests: Fix spelling mistake "KVM_HYPERCAL_EXIT_SMC" -> "KVM_HYPERCALL_EXIT_SMC"
  KVM: arm64: Test that SMC64 arch calls are reserved
  KVM: arm64: Prevent userspace from handling SMC64 arch range
  KVM: arm64: Expose SMC/HVC width to userspace
  KVM: selftests: Add test for SMCCC filter
  KVM: selftests: Add a helper for SMCCC calls with SMC instruction
  KVM: arm64: Let errors from SMCCC emulation to reach userspace
  KVM: arm64: Return NOT_SUPPORTED to guest for unknown PSCI version
  KVM: arm64: Introduce support for userspace SMCCC filtering
  KVM: arm64: Add support for KVM_EXIT_HYPERCALL
  KVM: arm64: Use a maple tree to represent the SMCCC filter
  KVM: arm64: Refactor hvc filtering to support different actions
  KVM: arm64: Start handling SMCs from EL1
  KVM: arm64: Rename SMC/HVC call handler to reflect reality
  KVM: arm64: Add vm fd device attribute accessors
  KVM: arm64: Add a helper to check if a VM has ran once
  KVM: x86: Redefine 'longmode' as a flag for KVM_EXIT_HYPERCALL

Signed-off-by: Marc Zyngier <maz@kernel.org>
2023-04-21 09:44:32 +01:00
Marc Zyngier
367eb095b8 Merge branch kvm-arm64/selftest/misc-6.4 into kvmarm-master/next
* kvm-arm64/selftest/misc-6.4:
  : .
  : Misc selftest updates for 6.4
  :
  : - Add comments for recently added ID registers
  : .
  KVM: selftests: Comment newly defined aarch64 ID registers

Signed-off-by: Marc Zyngier <maz@kernel.org>
2023-04-21 09:39:07 +01:00
Marc Zyngier
e2e321a7d6 Merge branch kvm-arm64/selftest/lpa into kvmarm-master/next
* kvm-arm64/selftest/lpa:
  : .
  : Selftest fixes addressing PTE and TTBR0_EL1 encodings for
  : 52bit PAs
  : .
  KVM: selftests: arm64: Fix ttbr0_el1 encoding for PA bits > 48
  KVM: selftests: arm64: Fix pte encode/decode for PA bits > 48
  KVM: selftests: Fixup config fragment for access_tracking_perf_test

Signed-off-by: Marc Zyngier <maz@kernel.org>
2023-04-21 09:37:36 +01:00
Marc Zyngier
b22498c484 Merge branch kvm-arm64/timer-vm-offsets into kvmarm-master/next
* kvm-arm64/timer-vm-offsets: (21 commits)
  : .
  : This series aims at satisfying multiple goals:
  :
  : - allow a VMM to atomically restore a timer offset for a whole VM
  :   instead of updating the offset each time a vcpu get its counter
  :   written
  :
  : - allow a VMM to save/restore the physical timer context, something
  :   that we cannot do at the moment due to the lack of offsetting
  :
  : - provide a framework that is suitable for NV support, where we get
  :   both global and per timer, per vcpu offsetting, and manage
  :   interrupts in a less braindead way.
  :
  : Conflict resolution involves using the new per-vcpu config lock instead
  : of the home-grown timer lock.
  : .
  KVM: arm64: Handle 32bit CNTPCTSS traps
  KVM: arm64: selftests: Augment existing timer test to handle variable offset
  KVM: arm64: selftests: Deal with spurious timer interrupts
  KVM: arm64: selftests: Add physical timer registers to the sysreg list
  KVM: arm64: nv: timers: Support hyp timer emulation
  KVM: arm64: nv: timers: Add a per-timer, per-vcpu offset
  KVM: arm64: Document KVM_ARM_SET_CNT_OFFSETS and co
  KVM: arm64: timers: Abstract the number of valid timers per vcpu
  KVM: arm64: timers: Fast-track CNTPCT_EL0 trap handling
  KVM: arm64: Elide kern_hyp_va() in VHE-specific parts of the hypervisor
  KVM: arm64: timers: Move the timer IRQs into arch_timer_vm_data
  KVM: arm64: timers: Abstract per-timer IRQ access
  KVM: arm64: timers: Rationalise per-vcpu timer init
  KVM: arm64: timers: Allow save/restoring of the physical timer
  KVM: arm64: timers: Allow userspace to set the global counter offset
  KVM: arm64: Expose {un,}lock_all_vcpus() to the rest of KVM
  KVM: arm64: timers: Allow physical offset without CNTPOFF_EL2
  KVM: arm64: timers: Use CNTPOFF_EL2 to offset the physical timer
  arm64: Add HAS_ECV_CNTPOFF capability
  arm64: Add CNTPOFF_EL2 register definition
  ...

Signed-off-by: Marc Zyngier <maz@kernel.org>
2023-04-21 09:36:40 +01:00
Marc Zyngier
ef5f97e9de Merge branch kvm-arm64/lock-inversion into kvmarm-master/next
* kvm-arm64/lock-inversion:
  : .
  : vm/vcpu lock inversion fixes, courtesy of Oliver Upton, plus a few
  : extra fixes from both Oliver and Reiji Watanabe.
  :
  : From the initial cover letter:
  :
  : As it so happens, lock ordering in KVM/arm64 is completely backwards.
  : There's a significant amount of VM-wide state that needs to be accessed
  : from the context of a vCPU. Until now, this was accomplished by
  : acquiring the kvm->lock, but that cannot be nested within vcpu->mutex.
  :
  : This series fixes the issue with some fine-grained locking for MP state
  : and a new, dedicated mutex that can nest with both kvm->lock and
  : vcpu->mutex.
  : .
  KVM: arm64: Have kvm_psci_vcpu_on() use WRITE_ONCE() to update mp_state
  KVM: arm64: Acquire mp_state_lock in kvm_arch_vcpu_ioctl_vcpu_init()
  KVM: arm64: vgic: Don't acquire its_lock before config_lock
  KVM: arm64: Use config_lock to protect vgic state
  KVM: arm64: Use config_lock to protect data ordered against KVM_RUN
  KVM: arm64: Avoid lock inversion when setting the VM register width
  KVM: arm64: Avoid vcpu->mutex v. kvm->lock inversion in CPU_ON

Signed-off-by: Marc Zyngier <maz@kernel.org>
2023-04-21 09:30:46 +01:00
Nico Boehr
8a46df7cd1 KVM: s390: pci: fix virtual-physical confusion on module unload/load
When the kvm module is unloaded, zpci_setup_aipb() perists some data in the
zpci_aipb structure in s390 pci code. Note that this struct is also passed
to firmware in the zpci_set_irq_ctrl() call and thus the GAIT must be a
physical address.

On module re-insertion, the GAIT is restored from this structure in
zpci_reset_aipb(). But it is a physical address, hence this may cause
issues when the kvm module is unloaded and loaded again.

Fix virtual vs physical address confusion (which currently are the same) by
adding the necessary physical-to-virtual-conversion in zpci_reset_aipb().

Signed-off-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20230222155503.43399-1-nrb@linux.ibm.com
Message-Id: <20230222155503.43399-1-nrb@linux.ibm.com>
2023-04-20 16:30:35 +02:00
Pierre Morel
7be3e33923 KVM: s390: vsie: clarifications on setting the APCB
The APCB is part of the CRYCB.
The calculation of the APCB origin can be done by adding
the APCB offset to the CRYCB origin.

Current code makes confusing transformations, converting
the CRYCB origin to a pointer to calculate the APCB origin.

Let's make things simpler and keep the CRYCB origin to make
these calculations.

Signed-off-by: Pierre Morel <pmorel@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20230214122841.13066-2-pmorel@linux.ibm.com
Message-Id: <20230214122841.13066-2-pmorel@linux.ibm.com>
2023-04-20 16:30:34 +02:00
Nico Boehr
2f2c0911b9 KVM: s390: interrupt: fix virtual-physical confusion for next alert GISA
We sometimes put a virtual address in next_alert, which should always be
a physical address, since it is shared with hardware.

This currently works, because virtual and physical addresses are
the same.

Add phys_to_virt() to resolve the virtual-physical confusion.

Signed-off-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Michael Mueller <mimu@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20230223162236.51569-1-nrb@linux.ibm.com
Message-Id: <20230223162236.51569-1-nrb@linux.ibm.com>
2023-04-20 16:26:20 +02:00
Reiji Watanabe
a189884bdc KVM: arm64: Have kvm_psci_vcpu_on() use WRITE_ONCE() to update mp_state
All accessors of kvm_vcpu_arch::mp_state should be {READ,WRITE}_ONCE(),
since readers of the mp_state don't acquire the mp_state_lock.
Nonetheless, kvm_psci_vcpu_on() updates the mp_state without using
WRITE_ONCE(). So, fix the code to update the mp_state using WRITE_ONCE.

Signed-off-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230419021852.2981107-3-reijiw@google.com
2023-04-20 09:06:20 +01:00
Reiji Watanabe
4ff910be01 KVM: arm64: Acquire mp_state_lock in kvm_arch_vcpu_ioctl_vcpu_init()
kvm_arch_vcpu_ioctl_vcpu_init() doesn't acquire mp_state_lock
when setting the mp_state to KVM_MP_STATE_RUNNABLE. Fix the
code to acquire the lock.

Signed-off-by: Reiji Watanabe <reijiw@google.com>
[maz: minor refactor]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230419021852.2981107-2-reijiw@google.com
2023-04-20 09:06:02 +01:00
Marc Zyngier
bcf3e7da3a KVM: arm64: vhe: Drop extra isb() on guest exit
__kvm_vcpu_run_vhe() end on VHE with an isb(). However, this
function is only reachable via kvm_call_hyp_ret(), which already
contains an isb() in order to mimick the behaviour of nVHE and
provide a context synchronisation event.

We thus have two isb()s back to back, which is one too many.
Drop the first one and solely rely on the one in the helper.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
2023-04-14 08:23:29 +01:00
Marc Zyngier
1ff2755d68 KVM: arm64: vhe: Synchronise with page table walker on MMU update
Contrary to nVHE, VHE is a lot easier when it comes to dealing
with speculative page table walks started at EL1. As we only change
EL1&0 translation regime when context-switching, we already benefit
from the effect of the DSB that sits in the context switch code.

We only need to take care of it in the NV case, where we can
flip between between two EL1 contexts (one of them being the virtual
EL2) without a context switch.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
2023-04-14 08:23:29 +01:00
Marc Zyngier
8442d65373 KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc()
We rely on the presence of a DSB at the end of kvm_flush_dcache_to_poc()
that, on top of ensuring completion of the cache clean, also covers
the speculative page table walk started from EL1.

Document this dependency.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
2023-04-14 08:23:29 +01:00
Marc Zyngier
7e1b2329c2 KVM: arm64: nvhe: Synchronise with page table walker on TLBI
A TLBI from EL2 impacting EL1 involves messing with the EL1&0
translation regime, and the page table walker may still be
performing speculative walks.

Piggyback on the existing DSBs to always have a DSB ISH that
will synchronise all load/store operations that the PTW may
still have.

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2023-04-14 08:23:29 +01:00
Marc Zyngier
a6610435ac KVM: arm64: Handle 32bit CNTPCTSS traps
When CNTPOFF isn't implemented and that we have a non-zero counter
offset, CNTPCT and CNTPCTSS are trapped. We properly handle the
former, but not the latter, as it is not present in the sysreg
table (despite being actually handled in the code). Bummer.

Just populate the cp15_64 table with the missing register.

Reported-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2023-04-13 14:23:42 +01:00
Marc Zyngier
55b5bac159 KVM: arm64: nvhe: Synchronise with page table walker on vcpu run
When taking an exception between the EL1&0 translation regime and
the EL2 translation regime, the page table walker is allowed to
complete the walks started from EL0 or EL1 while running at EL2.

It means that altering the system registers that define the EL1&0
translation regime is fraught with danger *unless* we wait for
the completion of such walk with a DSB (R_LFHQG and subsequent
statements in the ARM ARM). We already did the right thing for
other external agents (SPE, TRBE), but not the PTW.

Rework the existing SPE/TRBE synchronisation to include the PTW,
and add the missing DSB on guest exit.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
2023-04-13 08:38:53 +01:00
Oliver Upton
49e5d16b6f KVM: arm64: vgic: Don't acquire its_lock before config_lock
commit f003277311 ("KVM: arm64: Use config_lock to protect vgic
state") was meant to rectify a longstanding lock ordering issue in KVM
where the kvm->lock is taken while holding vcpu->mutex. As it so
happens, the aforementioned commit introduced yet another locking issue
by acquiring the its_lock before acquiring the config lock.

This is obviously wrong, especially considering that the lock ordering
is well documented in vgic.c. Reshuffle the locks once more to take the
config_lock before the its_lock. While at it, sprinkle in the lockdep
hinting that has become popular as of late to keep lockdep apprised of
our ordering.

Cc: stable@vger.kernel.org
Fixes: f003277311 ("KVM: arm64: Use config_lock to protect vgic state")
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230412062733.988229-1-oliver.upton@linux.dev
2023-04-12 13:50:18 +01:00
Sean Christopherson
cf9f4c0eb1 KVM: x86/mmu: Refresh CR0.WP prior to checking for emulated permission faults
Refresh the MMU's snapshot of the vCPU's CR0.WP prior to checking for
permission faults when emulating a guest memory access and CR0.WP may be
guest owned.  If the guest toggles only CR0.WP and triggers emulation of
a supervisor write, e.g. when KVM is emulating UMIP, KVM may consume a
stale CR0.WP, i.e. use stale protection bits metadata.

Note, KVM passes through CR0.WP if and only if EPT is enabled as CR0.WP
is part of the MMU role for legacy shadow paging, and SVM (NPT) doesn't
support per-bit interception controls for CR0.  Don't bother checking for
EPT vs. NPT as the "old == new" check will always be true under NPT, i.e.
the only cost is the read of vcpu->arch.cr4 (SVM unconditionally grabs CR0
from the VMCB on VM-Exit).

Reported-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lkml.kernel.org/r/677169b4-051f-fcae-756b-9a3e1bb9f8fe%40grsecurity.net
Fixes: fb509f76ac ("KVM: VMX: Make CR0.WP a guest owned bit")
Tested-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20230405002608.418442-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-10 15:25:36 -07:00
Sean Christopherson
9ed3bf4112 KVM: x86/mmu: Move filling of Hyper-V's TLB range struct into Hyper-V code
Refactor Hyper-V's range-based TLB flushing API to take a gfn+nr_pages
pair instead of a struct, and bury said struct in Hyper-V specific code.

Passing along two params generates much better code for the common case
where KVM is _not_ running on Hyper-V, as forwarding the flush on to
Hyper-V's hv_flush_remote_tlbs_range() from kvm_flush_remote_tlbs_range()
becomes a tail call.

Cc: David Matlack <dmatlack@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20230405003133.419177-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-10 15:17:29 -07:00
Sean Christopherson
8a1300ff95 KVM: x86: Rename Hyper-V remote TLB hooks to match established scheme
Rename the Hyper-V hooks for TLB flushing to match the naming scheme used
by all the other TLB flushing hooks, e.g. in kvm_x86_ops, vendor code,
arch hooks from common code, etc.

Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20230405003133.419177-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-10 15:17:29 -07:00
Colin Ian King
c5284f6d8c KVM: selftests: Fix spelling mistake "KVM_HYPERCAL_EXIT_SMC" -> "KVM_HYPERCALL_EXIT_SMC"
There is a spelling mistake in a test assert message. Fix it.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230406080226.122955-1-colin.i.king@gmail.com
2023-04-08 15:38:36 +01:00
Oliver Upton
00e0c94711 KVM: arm64: Test that SMC64 arch calls are reserved
Assert that the SMC64 view of the Arm architecture range is reserved by
KVM and cannot be filtered by userspace.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230408121732.3411329-3-oliver.upton@linux.dev
2023-04-08 15:22:55 +01:00
Oliver Upton
5a23ad6510 KVM: arm64: Prevent userspace from handling SMC64 arch range
Though presently unused, there is an SMC64 view of the Arm architecture
calls defined by the SMCCC. The documentation of the SMCCC filter states
that the SMC64 range is reserved, but nothing actually prevents
userspace from applying a filter to the range.

Insert a range with the HANDLE action for the SMC64 arch range, thereby
preventing userspace from imposing filtering/forwarding on it.

Fixes: fb88707dd3 ("KVM: arm64: Use a maple tree to represent the SMCCC filter")
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230408121732.3411329-2-oliver.upton@linux.dev
2023-04-08 15:22:55 +01:00
Sean Christopherson
400d213228 KVM: SVM: Return the local "r" variable from svm_set_msr()
Rename "r" to "ret" and actually return it from svm_set_msr() to reduce
the probability of repeating the mistake of commit 723d5fb0ff ("kvm:
svm: Add IA32_FLUSH_CMD guest support"), which set "r" thinking that it
would be propagated to the caller.

Alternatively, the declaration of "r" could be moved into the handling of
MSR_TSC_AUX, but that risks variable shadowing in the future.  A wrapper
for kvm_set_user_return_msr() would allow eliding a local variable, but
that feels like delaying the inevitable.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230322011440.2195485-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-04-06 13:37:37 -04:00
Sean Christopherson
da3db168fb KVM: x86: Virtualize FLUSH_L1D and passthrough MSR_IA32_FLUSH_CMD
Virtualize FLUSH_L1D so that the guest can use the performant L1D flush
if one of the many mitigations might require a flush in the guest, e.g.
Linux provides an option to flush the L1D when switching mms.

Passthrough MSR_IA32_FLUSH_CMD for write when it's supported in hardware
and exposed to the guest, i.e. always let the guest write it directly if
FLUSH_L1D is fully supported.

Forward writes to hardware in host context on the off chance that KVM
ends up emulating a WRMSR, or in the really unlikely scenario where
userspace wants to force a flush.  Restrict these forwarded WRMSRs to
the known command out of an abundance of caution.  Passing through the
MSR means the guest can throw any and all values at hardware, but doing
so in host context is arguably a bit more dangerous.

Link: https://lkml.kernel.org/r/CALMp9eTt3xzAEoQ038bJQ9LN0ZOXrSWsN7xnNUD%2B0SS%3DWwF7Pg%40mail.gmail.com
Link: https://lore.kernel.org/all/20230201132905.549148-2-eesposit@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230322011440.2195485-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-04-06 13:37:37 -04:00
Sean Christopherson
903358c7ed KVM: x86: Move MSR_IA32_PRED_CMD WRMSR emulation to common code
Dedup the handling of MSR_IA32_PRED_CMD across VMX and SVM by moving the
logic to kvm_set_msr_common().  Now that the MSR interception toggling is
handled as part of setting guest CPUID, the VMX and SVM paths are
identical.

Opportunistically massage the code to make it a wee bit denser.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20230322011440.2195485-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-04-06 13:37:36 -04:00
Sean Christopherson
bff903e8cd KVM: SVM: Passthrough MSR_IA32_PRED_CMD based purely on host+guest CPUID
Passthrough MSR_IA32_PRED_CMD based purely on whether or not the MSR is
supported and enabled, i.e. don't wait until the first write.  There's no
benefit to deferred passthrough, and the extra logic only adds complexity.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230322011440.2195485-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-04-06 13:37:36 -04:00
Sean Christopherson
9a4c485013 KVM: VMX: Passthrough MSR_IA32_PRED_CMD based purely on host+guest CPUID
Passthrough MSR_IA32_PRED_CMD based purely on whether or not the MSR is
supported and enabled, i.e. don't wait until the first write.  There's no
benefit to deferred passthrough, and the extra logic only adds complexity.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20230322011440.2195485-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-04-06 13:37:36 -04:00
Sean Christopherson
52887af565 KVM: x86: Revert MSR_IA32_FLUSH_CMD.FLUSH_L1D enabling
Revert the recently added virtualizing of MSR_IA32_FLUSH_CMD, as both
the VMX and SVM are fatally buggy to guests that use MSR_IA32_FLUSH_CMD or
MSR_IA32_PRED_CMD, and because the entire foundation of the logic is
flawed.

The most immediate problem is an inverted check on @cmd that results in
rejecting legal values.  SVM doubles down on bugs and drops the error,
i.e. silently breaks all guest mitigations based on the command MSRs.

The next issue is that neither VMX nor SVM was updated to mark
MSR_IA32_FLUSH_CMD as being a possible passthrough MSR,
which isn't hugely problematic, but does break MSR filtering and triggers
a WARN on VMX designed to catch this exact bug.

The foundational issues stem from the MSR_IA32_FLUSH_CMD code reusing
logic from MSR_IA32_PRED_CMD, which in turn was likely copied from KVM's
support for MSR_IA32_SPEC_CTRL.  The copy+paste from MSR_IA32_SPEC_CTRL
was misguided as MSR_IA32_PRED_CMD (and MSR_IA32_FLUSH_CMD) is a
write-only MSR, i.e. doesn't need the same "deferred passthrough"
shenanigans as MSR_IA32_SPEC_CTRL.

Revert all MSR_IA32_FLUSH_CMD enabling in one fell swoop so that there is
no point where KVM advertises, but does not support, L1D_FLUSH.

This reverts commits 45cf86f261,
723d5fb0ff, and
a807b78ad0.

Reported-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lkml.kernel.org/r/20230317190432.GA863767%40dev-arch.thelio-3990X
Cc: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Message-Id: <20230322011440.2195485-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-04-06 13:37:35 -04:00
Tom Rix
944a8dad8b KVM: x86: set "mitigate_smt_rsb" storage-class-specifier to static
smatch reports
arch/x86/kvm/x86.c:199:20: warning: symbol
  'mitigate_smt_rsb' was not declared. Should it be static?

This variable is only used in one file so it should be static.

Signed-off-by: Tom Rix <trix@redhat.com>
Link: https://lore.kernel.org/r/20230404010141.1913667-1-trix@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-05 16:47:35 -07:00
Marc Zyngier
0e5c9a9d65 KVM: arm64: Expose SMC/HVC width to userspace
When returning to userspace to handle a SMCCC call, we consistently
set PC to point to the instruction immediately after the HVC/SMC.

However, should userspace need to know the exact address of the
trapping instruction, it needs to know about the *size* of that
instruction. For AArch64, this is pretty easy. For AArch32, this
is a bit more funky, as Thumb has 16bit encodings for both HVC
and SMC.

Expose this to userspace with a new flag that directly derives
from ESR_EL2.IL. Also update the documentation to reflect the PC
state at the point of exit.

Finally, this fixes a small buglet where the hypercall.{args,ret}
fields would not be cleared on exit, and could contain some
random junk.

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/86pm8iv8tj.wl-maz@kernel.org
2023-04-05 19:20:23 +01:00
Oliver Upton
60e7dade49 KVM: selftests: Add test for SMCCC filter
Add a selftest for the SMCCC filter, ensuring basic UAPI constraints
(e.g. reserved ranges, non-overlapping ranges) are upheld. Additionally,
test that the DENIED and FWD_TO_USER work as intended.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-14-oliver.upton@linux.dev
2023-04-05 12:07:42 +01:00
Oliver Upton
fab19915f4 KVM: selftests: Add a helper for SMCCC calls with SMC instruction
Build a helper for doing SMCs in selftests by macro-izing the current
HVC implementation and taking the conduit instruction as an argument.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-13-oliver.upton@linux.dev
2023-04-05 12:07:42 +01:00
Oliver Upton
37c8e49479 KVM: arm64: Let errors from SMCCC emulation to reach userspace
Typically a negative return from an exit handler is used to request a
return to userspace with the specified error. KVM's handling of SMCCC
emulation (i.e. both HVCs and SMCs) deviates from the trend and resumes
the guest instead.

Stop handling negative returns this way and instead let the error
percolate to userspace.

Suggested-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-12-oliver.upton@linux.dev
2023-04-05 12:07:42 +01:00
Oliver Upton
7e484d2785 KVM: arm64: Return NOT_SUPPORTED to guest for unknown PSCI version
A subsequent change to KVM will allow negative returns from SMCCC
handlers to exit to userspace. Make way for this change by explicitly
returning SMCCC_RET_NOT_SUPPORTED to the guest if the VM is configured
to use an unknown PSCI version. Add a WARN since this is undoubtedly a
KVM bug.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-11-oliver.upton@linux.dev
2023-04-05 12:07:41 +01:00
Oliver Upton
821d935c87 KVM: arm64: Introduce support for userspace SMCCC filtering
As the SMCCC (and related specifications) march towards an 'everything
and the kitchen sink' interface for interacting with a system it becomes
less likely that KVM will support every related feature. We could do
better by letting userspace have a crack at it instead.

Allow userspace to define an 'SMCCC filter' that applies to both HVCs
and SMCs initiated by the guest. Supporting both conduits with this
interface is important for a couple of reasons. Guest SMC usage is table
stakes for a nested guest, as HVCs are always taken to the virtual EL2.
Additionally, guests may want to interact with a service on the secure
side which can now be proxied by userspace.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-10-oliver.upton@linux.dev
2023-04-05 12:07:41 +01:00
Oliver Upton
d824dff191 KVM: arm64: Add support for KVM_EXIT_HYPERCALL
In anticipation of user hypercall filters, add the necessary plumbing to
get SMCCC calls out to userspace. Even though the exit structure has
space for KVM to pass register arguments, let's just avoid it altogether
and let userspace poke at the registers via KVM_GET_ONE_REG.

This deliberately stretches the definition of a 'hypercall' to cover
SMCs from EL1 in addition to the HVCs we know and love. KVM doesn't
support EL1 calls into secure services, but now we can paint that as a
userspace problem and be done with it.

Finally, we need a flag to let userspace know what conduit instruction
was used (i.e. SMC vs. HVC).

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-9-oliver.upton@linux.dev
2023-04-05 12:07:41 +01:00
Oliver Upton
fb88707dd3 KVM: arm64: Use a maple tree to represent the SMCCC filter
Maple tree is an efficient B-tree implementation that is intended for
storing non-overlapping intervals. Such a data structure is a good fit
for the SMCCC filter as it is desirable to sparsely allocate the 32 bit
function ID space.

To that end, add a maple tree to kvm_arch and correctly init/teardown
along with the VM. Wire in a test against the hypercall filter for HVCs
which does nothing until the controls are exposed to userspace.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-8-oliver.upton@linux.dev
2023-04-05 12:07:41 +01:00
Oliver Upton
a8308b3fc9 KVM: arm64: Refactor hvc filtering to support different actions
KVM presently allows userspace to filter guest hypercalls with bitmaps
expressed via pseudo-firmware registers. These bitmaps have a narrow
scope and, of course, can only allow/deny a particular call. A
subsequent change to KVM will introduce a generalized UAPI for filtering
hypercalls, allowing functions to be forwarded to userspace.

Refactor the existing hypercall filtering logic to make room for more
than two actions. While at it, generalize the function names around
SMCCC as it is the basis for the upcoming UAPI.

No functional change intended.

Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-7-oliver.upton@linux.dev
2023-04-05 12:07:41 +01:00
Oliver Upton
c2d2e9b3d8 KVM: arm64: Start handling SMCs from EL1
Whelp, the architecture gods have spoken and confirmed that the function
ID space is common between SMCs and HVCs. Not only that, the expectation
is that hypervisors handle calls to both SMC and HVC conduits. KVM
recently picked up support for SMCCCs in commit bd36b1a9eb ("KVM:
arm64: nv: Handle SMCs taken from virtual EL2") but scoped it only to a
nested hypervisor.

Let's just open the floodgates and let EL1 access our SMCCC
implementation with the SMC instruction as well.

Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-6-oliver.upton@linux.dev
2023-04-05 12:07:41 +01:00
Oliver Upton
aac9496812 KVM: arm64: Rename SMC/HVC call handler to reflect reality
KVM handles SMCCC calls from virtual EL2 that use the SMC instruction
since commit bd36b1a9eb ("KVM: arm64: nv: Handle SMCs taken from
virtual EL2"). Thus, the function name of the handler no longer reflects
reality.

Normalize the name on SMCCC, since that's the only hypercall interface
KVM supports in the first place. No fuctional change intended.

Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-5-oliver.upton@linux.dev
2023-04-05 12:07:41 +01:00
Oliver Upton
e0fc6b2161 KVM: arm64: Add vm fd device attribute accessors
A subsequent change will allow userspace to convey a filter for
hypercalls through a vm device attribute. Add the requisite boilerplate
for vm attribute accessors.

Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-4-oliver.upton@linux.dev
2023-04-05 12:07:41 +01:00
Oliver Upton
de40bb8abb KVM: arm64: Add a helper to check if a VM has ran once
The test_bit(...) pattern is quite a lot of keystrokes. Replace
existing callsites with a helper.

No functional change intended.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-3-oliver.upton@linux.dev
2023-04-05 12:07:41 +01:00
Oliver Upton
e65733b5c5 KVM: x86: Redefine 'longmode' as a flag for KVM_EXIT_HYPERCALL
The 'longmode' field is a bit annoying as it blows an entire __u32 to
represent a boolean value. Since other architectures are looking to add
support for KVM_EXIT_HYPERCALL, now is probably a good time to clean it
up.

Redefine the field (and the remaining padding) as a set of flags.
Preserve the existing ABI by using bit 0 to indicate if the guest was in
long mode and requiring that the remaining 31 bits must be zero.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-2-oliver.upton@linux.dev
2023-04-05 12:07:41 +01:00