linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-10 09:09:55 -04:00

Author	SHA1	Message	Date
Yan Zhao	c15fcf12ff	drm/i915/gvt: switch from ->track_flush_slot() to ->track_remove_region() Switch from the poorly named and flawed ->track_flush_slot() to the newly introduced ->track_remove_region(). From KVMGT's perspective, the two hooks are functionally equivalent, the only difference being that ->track_remove_region() is called only when KVM is 100% certain the memory region will be removed, i.e. is invoked slightly later in KVM's memslot modification flow. Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> [sean: handle name change, massage changelog, rebase] Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Reviewed-by: Zhi Wang <zhi.a.wang@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-20-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 14:07:25 -04:00
Yan Zhao	b83ab124de	KVM: x86: Add a new page-track hook to handle memslot deletion Add a new page-track hook, track_remove_region(), that is called when a memslot DELETE operation is about to be committed. The "remove" hook will be used by KVMGT and will effectively replace the existing track_flush_slot() altogether now that KVM itself doesn't rely on the "flush" hook either. The "flush" hook is flawed as it's invoked before the memslot operation is guaranteed to succeed, i.e. KVM might ultimately keep the existing memslot without notifying external page track users, a.k.a. KVMGT. In practice, this can't currently happen on x86, but there are no guarantees that won't change in the future, not to mention that "flush" does a very poor job of describing what is happening. Pass in the gfn+nr_pages instead of the slot itself so external users, i.e. KVMGT, don't need to exposed to KVM internals (memslots). This will help set the stage for additional cleanups to the page-track APIs. Opportunistically align the existing srcu_read_lock_held() usage so that the new case doesn't stand out like a sore thumb (and not aligning the new code makes bots unhappy). Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20230729013535.1070024-19-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 14:07:25 -04:00
Sean Christopherson	2ee05a4c27	drm/i915/gvt: Don't bother removing write-protection on to-be-deleted slot When handling a slot "flush", don't call back into KVM to drop write protection for gfns in the slot. Now that KVM rejects attempts to move memory slots while KVMGT is attached, the only time a slot is "flushed" is when it's being removed, i.e. the memslot and all its write-tracking metadata is about to be deleted. Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Reviewed-by: Zhi Wang <zhi.a.wang@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-18-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 14:07:24 -04:00
Sean Christopherson	c70934e0ab	KVM: x86: Reject memslot MOVE operations if KVMGT is attached Disallow moving memslots if the VM has external page-track users, i.e. if KVMGT is being used to expose a virtual GPU to the guest, as KVMGT doesn't correctly handle moving memory regions. Note, this is potential ABI breakage! E.g. userspace could move regions that aren't shadowed by KVMGT without harming the guest. However, the only known user of KVMGT is QEMU, and QEMU doesn't move generic memory regions. KVM's own support for moving memory regions was also broken for multiple years (albeit for an edge case, but arguably moving RAM is itself an edge case), e.g. see commit `edd4fa37ba` ("KVM: x86: Allocate new rmap and large page tracking when moving memslot"). Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-17-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 14:07:23 -04:00
Sean Christopherson	b271e17def	KVM: drm/i915/gvt: Drop @vcpu from KVM's ->track_write() hook Drop @vcpu from KVM's ->track_write() hook provided for external users of the page-track APIs now that KVM itself doesn't use the page-track mechanism. Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Reviewed-by: Zhi Wang <zhi.a.wang@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-16-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:49:01 -04:00
Sean Christopherson	932844462a	KVM: x86/mmu: Don't bounce through page-track mechanism for guest PTEs Don't use the generic page-track mechanism to handle writes to guest PTEs in KVM's MMU. KVM's MMU needs access to information that should not be exposed to external page-track users, e.g. KVM needs (for some definitions of "need") the vCPU to query the current paging mode, whereas external users, i.e. KVMGT, have no ties to the current vCPU and so should never need the vCPU. Moving away from the page-track mechanism will allow dropping use of the page-track mechanism for KVM's own MMU, and will also allow simplifying and cleaning up the page-track APIs. Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:49:00 -04:00
Sean Christopherson	eeb87272a3	KVM: x86/mmu: Don't rely on page-track mechanism to flush on memslot change Call kvm_mmu_zap_all_fast() directly when flushing a memslot instead of bouncing through the page-track mechanism. KVM (unfortunately) needs to zap and flush all page tables on memslot DELETE/MOVE irrespective of whether KVM is shadowing guest page tables. This will allow changing KVM to register a page-track notifier on the first shadow root allocation, and will also allow deleting the misguided kvm_page_track_flush_slot() hook itself once KVM-GT also moves to a different method for reacting to memslot changes. No functional change intended. Cc: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20221110014821.1548347-2-seanjc@google.com Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:49:00 -04:00
Sean Christopherson	db0d70e610	KVM: x86/mmu: Move kvm_arch_flush_shadow_{all,memslot}() to mmu.c Move x86's implementation of kvm_arch_flush_shadow_{all,memslot}() into mmu.c, and make kvm_mmu_zap_all() static as it was globally visible only for kvm_arch_flush_shadow_all(). This will allow refactoring kvm_arch_flush_shadow_memslot() to call kvm_mmu_zap_all() directly without having to expose kvm_mmu_zap_all_fast() outside of mmu.c. Keeping everything in mmu.c will also likely simplify supporting TDX, which intends to do zap only relevant SPTEs on memslot updates. No functional change intended. Suggested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:59 -04:00
Sean Christopherson	3cca6b2628	drm/i915/gvt: Protect gfn hash table with vgpu_lock Use vgpu_lock instead of KVM's mmu_lock to protect accesses to the hash table used to track which gfns are write-protected when shadowing the guest's GTT, and hoist the acquisition of vgpu_lock from intel_vgpu_page_track_handler() out to its sole caller, kvmgt_page_track_write(). This fixes a bug where kvmgt_page_track_write(), which doesn't hold kvm->mmu_lock, could race with intel_gvt_page_track_remove() and trigger a use-after-free. Fixing kvmgt_page_track_write() by taking kvm->mmu_lock is not an option as mmu_lock is a r/w spinlock, and intel_vgpu_page_track_handler() might sleep when acquiring vgpu->cache_lock deep down the callstack: intel_vgpu_page_track_handler() \| \|-> page_track->handler / ppgtt_write_protection_handler() \| \|-> ppgtt_handle_guest_write_page_table_bytes() \| \|-> ppgtt_handle_guest_write_page_table() \| \|-> ppgtt_handle_guest_entry_removal() \| \|-> ppgtt_invalidate_pte() \| \|-> intel_gvt_dma_unmap_guest_page() \| \|-> mutex_lock(&vgpu->cache_lock); Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Reviewed-by: Zhi Wang <zhi.a.wang@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:58 -04:00
Sean Christopherson	a90c367e5a	drm/i915/gvt: Drop unused helper intel_vgpu_reset_gtt() Drop intel_vgpu_reset_gtt() as it no longer has any callers. In addition to eliminating dead code, this eliminates the last possible scenario where __kvmgt_protect_table_find() can be reached without holding vgpu_lock. Requiring vgpu_lock to be held when calling __kvmgt_protect_table_find() will allow a protecting the gfn hash with vgpu_lock without too much fuss. No functional change intended. Fixes: `ba25d97757` ("drm/i915/gvt: Do not destroy ppgtt_mm during vGPU D3->D0.") Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Reviewed-by: Zhi Wang <zhi.a.wang@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:57 -04:00
Sean Christopherson	16735297fd	drm/i915/gvt: Use an "unsigned long" to iterate over memslot gfns Use an "unsigned long" instead of an "int" when iterating over the gfns in a memslot. The number of pages in the memslot is tracked as an "unsigned long", e.g. KVMGT could theoretically break if a KVM memslot larger than 16TiB were deleted (2^32 * 4KiB). Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Reviewed-by: Zhi Wang <zhi.a.wang@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:57 -04:00
Sean Christopherson	ba193f62c0	drm/i915/gvt: Don't rely on KVM's gfn_to_pfn() to query possible 2M GTT Now that gvt_pin_guest_page() explicitly verifies the pinned PFN is a transparent hugepage page, don't use KVM's gfn_to_pfn() to pre-check if a 2MiB GTT entry is possible and instead just try to map the GFN with a 2MiB entry. Using KVM to query pfn that is ultimately managed through VFIO is odd, and KVM's gfn_to_pfn() is not intended for non-KVM consumption; it's exported only because of KVM vendor modules (x86 and PPC). Open code the check on 2MiB support instead of keeping is_2MB_gtt_possible() around for a single line of code. Move the call to intel_gvt_dma_map_guest_page() for a 4KiB entry into its case statement, i.e. fork the common path into the 4KiB and 2MiB "direct" shadow paths. Keeping the call in the "common" path is arguably more in the spirit of "one change per patch", but retaining the local "page_size" variable is silly, i.e. the call site will be changed either way, and jumping around the no-longer-common code is more subtle and rather odd, i.e. would just need to be immediately cleaned up. Drop the error message from gvt_pin_guest_page() when KVMGT attempts to shadow a 2MiB guest page that isn't backed by a compatible hugepage in the host. Dropping the pre-check on a THP makes it much more likely that the "error" will be encountered in normal operation. Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Reviewed-by: Zhi Wang <zhi.a.wang@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:56 -04:00
Sean Christopherson	241f0aadb8	drm/i915/gvt: Error out on an attempt to shadowing an unknown GTT entry type Bail from ppgtt_populate_shadow_entry() if an unexpected GTT entry type is encountered instead of subtly falling through to the common "direct shadow" path. Eliminating the default/error path's reliance on the common handling will allow hoisting intel_gvt_dma_map_guest_page() into the case statements so that the 2MiB case can try intel_gvt_dma_map_guest_page() and fallback to splitting the entry on failure. Reviewed-by: Zhi Wang <zhi.a.wang@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:55 -04:00
Sean Christopherson	96b138cd23	drm/i915/gvt: Explicitly check that vGPU is attached before shadowing Move the check that a vGPU is attached from is_2MB_gtt_possible() all the way up to shadow_ppgtt_mm() to avoid unnecessary work, and to make it more obvious that a future cleanup of is_2MB_gtt_possible() isn't introducing a bug. is_2MB_gtt_possible() has only one caller, ppgtt_populate_shadow_entry(), and all paths in ppgtt_populate_shadow_entry() eventually check for attachment by way of intel_gvt_dma_map_guest_page(). And of the paths that lead to ppgtt_populate_shadow_entry(), shadow_ppgtt_mm() is the only one that doesn't already check for INTEL_VGPU_STATUS_ACTIVE or INTEL_VGPU_STATUS_ATTACHED. workload_thread() <= pick_next_workload() => INTEL_VGPU_STATUS_ACTIVE \| -> dispatch_workload() \| \|-> prepare_workload() \| -> intel_vgpu_sync_oos_pages() \| \| \| \|-> ppgtt_set_guest_page_sync() \| \| \| \|-> sync_oos_page() \| \| \| \|-> ppgtt_populate_shadow_entry() \| \|-> intel_vgpu_flush_post_shadow() \| 1: \|-> ppgtt_handle_guest_write_page_table() \| \|-> ppgtt_handle_guest_entry_add() \| 2: \| -> ppgtt_populate_spt_by_guest_entry() \| \| \| \|-> ppgtt_populate_spt() \| \| \| \|-> ppgtt_populate_shadow_entry() \| \| \| \|-> ppgtt_populate_spt_by_guest_entry() [see 2] \| \|-> ppgtt_populate_shadow_entry() kvmgt_page_track_write() <= KVM callback => INTEL_VGPU_STATUS_ATTACHED \| \|-> intel_vgpu_page_track_handler() \| \|-> ppgtt_write_protection_handler() \| \|-> ppgtt_handle_guest_write_page_table_bytes() \| \|-> ppgtt_handle_guest_write_page_table() [see 1] Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:54 -04:00
Sean Christopherson	708e49583d	drm/i915/gvt: Put the page reference obtained by KVM's gfn_to_pfn() Put the struct page reference acquired by gfn_to_pfn(), KVM's API is that the caller is ultimately responsible for dropping any reference. Note, kvm_release_pfn_clean() ensures the pfn is actually a refcounted struct page before trying to put any references. Fixes: `b901b252b6` ("drm/i915/gvt: Add 2M huge gtt support") Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Reviewed-by: Zhi Wang <zhi.a.wang@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:54 -04:00
Yan Zhao	a15e61f337	drm/i915/gvt: Don't try to unpin an empty page range Attempt to unpin pages in the error path of gvt_pin_guest_page() if and only if at least one page was successfully pinned. Unpinning doesn't cause functional problems, but vfio_device_container_unpin_pages() rightfully warns about being asked to unpin zero pages. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> [sean: write changelog] Reviewed-by: Zhi Wang <zhi.a.wang@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:53 -04:00
Sean Christopherson	adc7b226b7	drm/i915/gvt: Verify hugepages are contiguous in physical address space When shadowing a GTT entry with a 2M page, verify that the pfns are contiguous, not just that the struct page pointers are contiguous. The memory map is virtual contiguous if "CONFIG_FLATMEM=y \|\| CONFIG_SPARSEMEM_VMEMMAP=y", but not for "CONFIG_SPARSEMEM=y && CONFIG_SPARSEMEM_VMEMMAP=n", so theoretically KVMGT could encounter struct pages that are virtually contiguous, but not physically contiguous. In practice, this flaw is likely a non-issue as it would cause functional problems iff a section isn't 2M aligned _and_ is directly adjacent to another section with discontiguous pfns. Tested-by: Yongwei Ma <yongwei.ma@intel.com> Reviewed-by: Zhi Wang <zhi.a.wang@intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:52 -04:00
Yan Zhao	1e557c1cd0	drm/i915/gvt: remove interface intel_gvt_is_valid_gfn Currently intel_gvt_is_valid_gfn() is called in two places: (1) shadowing guest GGTT entry (2) shadowing guest PPGTT leaf entry, which was introduced in commit `cc753fbe1a` ("drm/i915/gvt: validate gfn before set shadow page entry"). However, now it's not necessary to call this interface any more, because a. GGTT partial write issue has been fixed by commit `bc0686ff5f` ("drm/i915/gvt: support inconsecutive partial gtt entry write") commit `510fe10b61` ("drm/i915/gvt: fix a bug of partially write ggtt enties") b. PPGTT resides in normal guest RAM and we only treat 8-byte writes as valid page table writes. Any invalid GPA found is regarded as an error, either due to guest misbehavior/attack or bug in host shadow code. So,rather than do GFN pre-checking and replace invalid GFNs with scratch GFN and continue silently, just remove the pre-checking and abort PPGTT shadowing on error detected. c. GFN validity check is still performed in intel_gvt_dma_map_guest_page() --> gvt_pin_guest_page(). It's more desirable to call VFIO interface to do both validity check and mapping. Calling intel_gvt_is_valid_gfn() to do GFN validity check from KVM side while later mapping the GFN through VFIO interface is unnecessarily fragile and confusing for unaware readers. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> [sean: remove now-unused local variables] Acked-by: Zhi Wang <zhi.a.wang@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:52 -04:00
Sean Christopherson	f046923af7	drm/i915/gvt: Verify pfn is "valid" before dereferencing "struct page" Check that the pfn found by gfn_to_pfn() is actually backed by "struct page" memory prior to retrieving and dereferencing the page. KVM supports backing guest memory with VM_PFNMAP, VM_IO, etc., and so there is no guarantee the pfn returned by gfn_to_pfn() has an associated "struct page". Fixes: `b901b252b6` ("drm/i915/gvt: Add 2M huge gtt support") Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Reviewed-by: Zhi Wang <zhi.a.wang@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:51 -04:00
Sean Christopherson	52e322eda3	KVM: x86/mmu: BUG() in rmap helpers iff CONFIG_BUG_ON_DATA_CORRUPTION=y Introduce KVM_BUG_ON_DATA_CORRUPTION() and use it in the low-level rmap helpers to convert the existing BUG()s to WARN_ON_ONCE() when the kernel is built with CONFIG_BUG_ON_DATA_CORRUPTION=n, i.e. does NOT want to BUG() on corruption of host kernel data structures. Environments that don't have infrastructure to automatically capture crash dumps, i.e. aren't likely to enable CONFIG_BUG_ON_DATA_CORRUPTION=y, are typically better served overall by WARN-and-continue behavior (for the kernel, the VM is dead regardless), as a BUG() while holding mmu_lock all but guarantees the _best_ case scenario is a panic(). Make the BUG()s conditional instead of removing/replacing them entirely as there's a non-zero chance (though by no means a guarantee) that the damage isn't contained to the target VM, e.g. if no rmap is found for a SPTE then KVM may be double-zapping the SPTE, i.e. has already freed the memory the SPTE pointed at and thus KVM is reading/writing memory that KVM no longer owns. Link: https://lore.kernel.org/all/20221129191237.31447-1-mizhang@google.com Suggested-by: Mingwei Zhang <mizhang@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Jim Mattson <jmattson@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20230729004722.1056172-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:50 -04:00
Mingwei Zhang	069f30c619	KVM: x86/mmu: Plumb "struct kvm" all the way to pte_list_remove() Plumb "struct kvm" all the way to pte_list_remove() to allow the usage of KVM_BUG() and/or KVM_BUG_ON(). This will allow killing only the offending VM instead of doing BUG() if the kernel is built with CONFIG_BUG_ON_DATA_CORRUPTION=n, i.e. does NOT want to BUG() if KVM's data structures (rmaps) appear to be corrupted. Signed-off-by: Mingwei Zhang <mizhang@google.com> [sean: tweak changelog] Link: https://lore.kernel.org/r/20230729004722.1056172-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:49 -04:00
Sean Christopherson	3328dfe0ea	KVM: x86/mmu: Use BUILD_BUG_ON_INVALID() for KVM_MMU_WARN_ON() stub Use BUILD_BUG_ON_INVALID() instead of an empty do-while loop to stub out KVM_MMU_WARN_ON() when CONFIG_KVM_PROVE_MMU=n, that way _some_ build issues with the usage of KVM_MMU_WARN_ON() will be dected even if the kernel is using the stubs, e.g. basic syntax errors will be detected. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Link: https://lore.kernel.org/r/20230729004722.1056172-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:48 -04:00
Sean Christopherson	870d4d4ed8	KVM: x86/mmu: Replace MMU_DEBUG with proper KVM_PROVE_MMU Kconfig Replace MMU_DEBUG, which requires manually modifying KVM to enable the macro, with a proper Kconfig, KVM_PROVE_MMU. Now that pgprintk() and rmap_printk() are gone, i.e. the macro guards only KVM_MMU_WARN_ON() and won't flood the kernel logs, enabling the option for debug kernels is both desirable and feasible. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Link: https://lore.kernel.org/r/20230729004722.1056172-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:47 -04:00
Sean Christopherson	72e2fb24a0	KVM: x86/mmu: Bug the VM if a vCPU ends up in long mode without PAE enabled Promote the ASSERT(), which is quite dead code in KVM, into a KVM_BUG_ON() for KVM's sanity check that CR4.PAE=1 if the vCPU is in long mode when performing a walk of guest page tables. The sanity is quite cheap since neither EFER nor CR4.PAE requires a VMREAD, especially relative to the cost of walking the guest page tables. More importantly, the sanity check would have prevented the true badness fixed by commit `112e66017b` ("KVM: nVMX: add missing consistency checks for CR0 and CR4"). The missed consistency check resulted in some versions of KVM corrupting the on-stack guest_walker structure due to KVM thinking there are 4/5 levels of page tables, but wiring up the MMU hooks to point at the paging32 implementation, which only allocates space for two levels of page tables in "struct guest_walker32". Queue a page fault for injection if the assertion fails, as both callers, FNAME(gva_to_gpa) and FNAME(walk_addr_generic), assume that walker.fault contains sane info on a walk failure. E.g. not populating the fault info could result in KVM consuming and/or exposing uninitialized stack data before the vCPU is kicked out to userspace, which doesn't happen until KVM checks for KVM_REQ_VM_DEAD on the next enter. Move the check below the initialization of "pte_access" so that the aforementioned to-be-injected page fault doesn't consume uninitialized stack data. The information _shouldn't_ reach the guest or userspace, but there's zero downside to being paranoid in this case. Link: https://lore.kernel.org/r/20230729004722.1056172-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:47 -04:00
Sean Christopherson	20ba462dfd	KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE() Convert all "runtime" assertions, i.e. assertions that can be triggered while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the MMU that is tied to running vCPUs, i.e. not contained to loading and initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if KVM ends up with a bug that causes a root to be invalidated before the page fault handler is invoked, pretty much _every_ page fault VM-Exit triggers the WARN. If a WARN is triggered frequently, the resulting spam usually causes a lot of damage of its own, e.g. consumes resources to log the WARN and pollutes the kernel log, often to the point where other useful information can be lost. In many case, the damage caused by the spam is actually worse than the bug itself, e.g. KVM can almost always recover from an unexpectedly invalid root. On the flip side, warning every time is rarely helpful for debug and triage, i.e. a single splat is usually sufficient to point a debugger in the right direction, and automated testing, e.g. syzkaller, typically runs with warn_on_panic=1, i.e. will never get past the first WARN anyways. Lastly, when an assertions fails multiple times, the stack traces in KVM are almost always identical, i.e. the full splat only needs to be captured once. And _if_ there is value in captruing information about the failed assert, a ratelimited printk() is sufficient and less likely to rack up a large amount of collateral damage. Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:44 -04:00
Sean Christopherson	0fe6370eb3	KVM: x86/mmu: Rename MMU_WARN_ON() to KVM_MMU_WARN_ON() Rename MMU_WARN_ON() to make it super obvious that the assertions are all about KVM's MMU, not the primary MMU. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Link: https://lore.kernel.org/r/20230729004722.1056172-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:43 -04:00
Sean Christopherson	58da926caa	KVM: x86/mmu: Cleanup sanity check of SPTEs at SP free Massage the error message for the sanity check on SPTEs when freeing a shadow page to be more verbose, and to print out all shadow-present SPTEs, not just the first SPTE encountered. Printing all SPTEs can be quite valuable for debug, e.g. highlights whether the leak is a one-off or widepsread, or possibly the result of memory corruption (something else in the kernel stomping on KVM's SPTEs). Opportunistically move the MMU_WARN_ON() into the helper itself, which will allow a future cleanup to use BUILD_BUG_ON_INVALID() as the stub for MMU_WARN_ON(). BUILD_BUG_ON_INVALID() works as intended and results in the compiler complaining about is_empty_shadow_page() not being declared. Link: https://lore.kernel.org/r/20230729004722.1056172-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:43 -04:00
Sean Christopherson	242a6dd8da	KVM: x86/mmu: Avoid pointer arithmetic when iterating over SPTEs Replace the pointer arithmetic used to iterate over SPTEs in is_empty_shadow_page() with more standard interger-based iteration. No functional change intended. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Link: https://lore.kernel.org/r/20230729004722.1056172-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:42 -04:00
Sean Christopherson	c4f92cfe02	KVM: x86/mmu: Delete the "dbg" module param Delete KVM's "dbg" module param now that its usage in KVM is gone (it used to guard pgprintk() and rmap_printk()). Link: https://lore.kernel.org/r/20230729004722.1056172-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:41 -04:00
Sean Christopherson	350c49fdea	KVM: x86/mmu: Delete rmap_printk() and all its usage Delete rmap_printk() so that MMU_WARN_ON() and MMU_DEBUG can be morphed into something that can be regularly enabled for debug kernels. The information provided by rmap_printk() isn't all that useful now that the rmap and unsync code is mature, as the prints are simultaneously too verbose (_lots_ of message) and yet not verbose enough to be helpful for debug (most instances print just the SPTE pointer/value, which is rarely sufficient to root cause anything but trivial bugs). Alternatively, rmap_printk() could be reworked to into tracepoints, but it's not clear there is a real need as rmap bugs rarely escape initial development, and when bugs do escape to production, they are often edge cases and/or reside in code that isn't directly related to the rmaps. In other words, the problems with rmap_printk() being unhelpful also apply to tracepoints. And deleting rmap_printk() doesn't preclude adding tracepoints in the future. Link: https://lore.kernel.org/r/20230729004722.1056172-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:40 -04:00
Sean Christopherson	a98b889492	KVM: x86/mmu: Delete pgprintk() and all its usage Delete KVM's pgprintk() and all its usage, as the code is very prone to bitrot due to being buried behind MMU_DEBUG, and the functionality has been rendered almost entirely obsolete by the tracepoints KVM has gained over the years. And for the situations where the information provided by KVM's tracepoints is insufficient, pgprintk() rarely fills in the gaps, and is almost always far too noisy, i.e. developers end up implementing custom prints anyways. Link: https://lore.kernel.org/r/20230729004722.1056172-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:39 -04:00
Sean Christopherson	d09f711233	KVM: x86/mmu: Guard against collision with KVM-defined PFERR_IMPLICIT_ACCESS Add an assertion in kvm_mmu_page_fault() to ensure the error code provided by hardware doesn't conflict with KVM's software-defined IMPLICIT_ACCESS flag. In the unlikely scenario that future hardware starts using bit 48 for a hardware-defined flag, preserving the bit could result in KVM incorrectly interpreting the unknown flag as KVM's IMPLICIT_ACCESS flag. WARN so that any such conflict can be surfaced to KVM developers and resolved, but otherwise ignore the bit as KVM can't possibly rely on a flag it knows nothing about. Fixes: `4f4aa80e3b` ("KVM: X86: Handle implicit supervisor access with SMAP") Acked-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20230721223711.2334426-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:39 -04:00
Like Xu	91303f800e	KVM: x86/mmu: Move the lockdep_assert of mmu_lock to inside clear_dirty_pt_masked() Move the lockdep_assert_held_write(&kvm->mmu_lock) from the only one caller kvm_tdp_mmu_clear_dirty_pt_masked() to inside clear_dirty_pt_masked(). This change makes it more obvious why it's safe for clear_dirty_pt_masked() to use the non-atomic (for non-volatile SPTEs) tdp_mmu_clear_spte_bits() helper. for_each_tdp_mmu_root() does its own lockdep, so the only "loss" in lockdep coverage is if the list is completely empty. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20230627042639.12636-1-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:38 -04:00
Paolo Bonzini	6d5e3c318a	Merge tag 'kvm-x86-misc-6.6' of https://github.com/kvm-x86/linux into HEAD KVM x86 changes for 6.6: - Misc cleanups - Retry APIC optimized recalculation if a vCPU is added/enabled - Overhaul emergency reboot code to bring SVM up to par with VMX, tie the "emergency disabling" behavior to KVM actually being loaded, and move all of the logic within KVM - Fix user triggerable WARNs in SVM where KVM incorrectly assumes the TSC ratio MSR can diverge from the default iff TSC scaling is enabled, and clean up related code - Add a framework to allow "caching" feature flags so that KVM can check if the guest can use a feature without needing to search guest CPUID	2023-08-31 13:36:33 -04:00
Paolo Bonzini	bd7fe98b35	Merge tag 'kvm-x86-svm-6.6' of https://github.com/kvm-x86/linux into HEAD KVM: x86: SVM changes for 6.6: - Add support for SEV-ES DebugSwap, i.e. allow SEV-ES guests to use debug registers and generate/handle #DBs - Clean up LBR virtualization code - Fix a bug where KVM fails to set the target pCPU during an IRTE update - Fix fatal bugs in SEV-ES intrahost migration - Fix a bug where the recent (architecturally correct) change to reinject #BP and skip INT3 broke SEV guests (can't decode INT3 to skip it)	2023-08-31 13:32:40 -04:00
Paolo Bonzini	755e732dde	Merge tag 'kvm-x86-vmx-6.6' of https://github.com/kvm-x86/linux into HEAD KVM: x86: VMX changes for 6.6: - Misc cleanups - Fix a bug where KVM reads a stale vmcs.IDT_VECTORING_INFO_FIELD when trying to handle NMI VM-Exits	2023-08-31 13:32:06 -04:00
Paolo Bonzini	8783790a5e	Merge tag 'kvm-x86-pmu-6.6' of https://github.com/kvm-x86/linux into HEAD KVM x86 PMU changes for 6.6: - Clean up KVM's handling of Intel architectural events	2023-08-31 13:31:32 -04:00
Paolo Bonzini	e43ae8b689	Merge tag 'kvm-riscv-6.6-1' of https://github.com/kvm-riscv/linux into HEAD KVM/riscv changes for 6.6 - Zba, Zbs, Zicntr, Zicsr, Zifencei, and Zihpm support for Guest/VM - Added ONE_REG interface for SATP mode - Added ONE_REG interface to enable/disable multiple ISA extensions - Improved error codes returned by ONE_REG interfaces - Added KVM_GET_REG_LIST ioctl() implementation for KVM RISC-V - Added get-reg-list selftest for KVM RISC-V	2023-08-31 13:25:55 -04:00
Paolo Bonzini	69fd3876a4	Merge tag 'kvm-s390-next-6.6-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD - PV crypto passthrough enablement (Tony, Steffen, Viktor, Janosch) Allows a PV guest to use crypto cards. Card access is governed by the firmware and once a crypto queue is "bound" to a PV VM every other entity (PV or not) looses access until it is not bound anymore. Enablement is done via flags when creating the PV VM. - Guest debug fixes (Ilya)	2023-08-31 13:21:27 -04:00
Paolo Bonzini	1814db83c0	Merge tag 'kvm-x86-selftests-6.6' of https://github.com/kvm-x86/linux into HEAD KVM: x86: Selftests changes for 6.6: - Add testcases to x86's sync_regs_test for detecting KVM TOCTOU bugs - Add support for printf() in guest code and covert all guest asserts to use printf-based reporting - Clean up the PMU event filter test and add new testcases - Include x86 selftests in the KVM x86 MAINTAINERS entry	2023-08-31 13:20:45 -04:00
Paolo Bonzini	0d15bf966d	Merge tag 'kvm-x86-generic-6.6' of https://github.com/kvm-x86/linux into HEAD Common KVM changes for 6.6: - Wrap kvm_{gfn,hva}_range.pte in a union to allow mmu_notifier events to pass action specific data without needing to constantly update the main handlers. - Drop unused function declarations	2023-08-31 13:19:55 -04:00
Paolo Bonzini	e0fb12c673	Merge tag 'kvmarm-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for Linux 6.6 - Add support for TLB range invalidation of Stage-2 page tables, avoiding unnecessary invalidations. Systems that do not implement range invalidation still rely on a full invalidation when dealing with large ranges. - Add infrastructure for forwarding traps taken from a L2 guest to the L1 guest, with L0 acting as the dispatcher, another baby step towards the full nested support. - Simplify the way we deal with the (long deprecated) 'CPU target', resulting in a much needed cleanup. - Fix another set of PMU bugs, both on the guest and host sides, as we seem to never have any shortage of those... - Relax the alignment requirements of EL2 VA allocations for non-stack allocations, as we were otherwise wasting a lot of that precious VA space. - The usual set of non-functional cleanups, although I note the lack of spelling fixes...	2023-08-31 13:18:53 -04:00
Sean Christopherson	50011c2a24	KVM: VMX: Refresh available regs and IDT vectoring info before NMI handling Reset the mask of available "registers" and refresh the IDT vectoring info snapshot in vmx_vcpu_enter_exit(), before KVM potentially handles a an NMI VM-Exit. One of the "registers" that KVM VMX lazily loads is the vmcs.VM_EXIT_INTR_INFO field, which is holds the vector+type on "exception or NMI" VM-Exits, i.e. is needed to identify NMIs. Clearing the available registers bitmask after handling NMIs results in KVM querying info from the last VM-Exit that read vmcs.VM_EXIT_INTR_INFO, and leads to both missed NMIs and spurious NMIs in the host. Opportunistically grab vmcs.IDT_VECTORING_INFO_FIELD early in the VM-Exit path too, e.g. to guard against similar consumption of stale data. The field is read on every "normal" VM-Exit, and there's no point in delaying the inevitable. Reported-by: Like Xu <like.xu.linux@gmail.com> Fixes: `11df586d77` ("KVM: VMX: Handle NMI VM-Exits in noinstr region") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230825014532.2846714-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-08-28 20:07:43 -07:00
Steffen Eiden	899e2206f4	KVM: s390: pv: Allow AP-instructions for pv-guests Introduces new feature bits and enablement flags for AP and AP IRQ support. Signed-off-by: Steffen Eiden <seiden@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Reviewed-by: Michael Mueller <mimu@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20230815151415.379760-5-seiden@linux.ibm.com Message-Id: <20230815151415.379760-5-seiden@linux.ibm.com>	2023-08-28 09:27:56 +00:00
Steffen Eiden	19c654bf05	KVM: s390: Add UV feature negotiation Add a uv_feature list for pv-guests to the KVM cpu-model. The feature bits 'AP-interpretation for secure guests' and 'AP-interrupt for secure guests' are available. Signed-off-by: Steffen Eiden <seiden@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Reviewed-by: Michael Mueller <mimu@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20230815151415.379760-4-seiden@linux.ibm.com Message-Id: <20230815151415.379760-4-seiden@linux.ibm.com>	2023-08-28 09:27:55 +00:00
Steffen Eiden	59a881402c	s390/uv: UV feature check utility Introduces a function to check the existence of an UV feature. Refactor feature bit checks to use the new function. Signed-off-by: Steffen Eiden <seiden@linux.ibm.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Reviewed-by: Michael Mueller <mimu@linux.ibm.com> Link: https://lore.kernel.org/r/20230815151415.379760-3-seiden@linux.ibm.com Message-Id: <20230815151415.379760-3-seiden@linux.ibm.com>	2023-08-28 09:27:55 +00:00
Viktor Mihajlovski	b1e428615f	KVM: s390: pv: relax WARN_ONCE condition for destroy fast Destroy configuration fast may return with RC 0x104 if there are still bound APQNs in the configuration. The final cleanup will occur with the standard destroy configuration UVC as at this point in time all APQNs have been reset and thus unbound. Therefore, don't warn if RC 0x104 is reported. Signed-off-by: Viktor Mihajlovski <mihajlov@linux.ibm.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Reviewed-by: Steffen Eiden <seiden@linux.ibm.com> Reviewed-by: Michael Mueller <mimu@linux.ibm.com> Link: https://lore.kernel.org/r/20230815151415.379760-2-seiden@linux.ibm.com Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-ID: <20230815151415.379760-2-seiden@linux.ibm.com>	2023-08-28 09:27:55 +00:00
Janosch Frank	5d0545abee	Merge remote-tracking branch 'vfio-ap' into next The Secure Execution AP support makes it possible for SE VMs to securely use APQNs without a third party being able to snoop IO. VMs first bind to an APQN to securely attach it and granting protected key crypto function access. Afterwards they can associate the APQN which grants them clear key crypto function access. Once bound the APQNs are not accessible to the host until a reset is performed. The vfio-ap patches being merged here provide the base hypervisor Secure Execution / Protected Virtualization AP support. This includes proper handling of APQNs that are securely attached to a SE/PV guest especially regarding resets.	2023-08-28 09:26:35 +00:00
Ilya Leoshkevich	642dbc0312	KVM: s390: selftests: Add selftest for single-stepping Test different variations of single-stepping into interrupts: - SVC and PGM interrupts; - Interrupts generated by ISKE; - Interrupts generated by instructions emulated by KVM; - Interrupts generated by instructions emulated by userspace. Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Message-ID: <20230725143857.228626-7-iii@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> [frankja@de.igm.com: s/ASSERT_EQ/TEST_ASSERT_EQ/ because function was renamed in the selftest printf series] Signed-off-by: Janosch Frank <frankja@linux.ibm.com>	2023-08-28 09:24:20 +00:00
Ilya Leoshkevich	fdbeb55ebd	KVM: s390: interrupt: Fix single-stepping keyless mode exits kvm_s390_skey_check_enable() does not emulate any instructions, rather, it clears CPUSTAT_KSS and arranges the instruction that caused the exit (e.g., ISKE, SSKE, RRBE or LPSWE with a keyed PSW) to run again. Therefore, skip the PER check and let the instruction execution happen. Otherwise, a debugger will see two single-step events on the same instruction. Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Message-ID: <20230725143857.228626-6-iii@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>	2023-08-28 09:24:20 +00:00

1 2 3 4 5 ...

1202247 Commits