linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-02-17 12:30:29 -05:00

Author	SHA1	Message	Date
Yi Wang	fbe4a7e881	KVM: Setup empty IRQ routing when creating a VM Setup empty IRQ routing during VM creation so that x86 and s390 don't need to set empty/dummy IRQ routing during KVM_CREATE_IRQCHIP (in future patches). Initializing IRQ routing before there are any potential readers allows KVM to avoid the synchronize_srcu() in kvm_set_irq_routing(), which can introduces 20+ milliseconds of latency in the VM creation path. Ensuring that all VMs have non-NULL IRQ routing also hardens KVM against misbehaving userspace VMMs, e.g. RISC-V dynamically instantiates its interrupt controller, but doesn't override kvm_arch_intc_initialized() or kvm_arch_irqfd_allowed(), and so can likely reach kvm_irq_map_gsi() without fully initialized IRQ routing. Signed-off-by: Yi Wang <foxywang@tencent.com> Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com> Link: https://lore.kernel.org/r/20240506101751.3145407-2-foxywang@tencent.com [sean: init refcount after IRQ routing, fix stub, massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-11 14:18:34 -07:00
Breno Leitao	49f683b41f	KVM: Fix a data race on last_boosted_vcpu in kvm_vcpu_on_spin() Use {READ,WRITE}_ONCE() to access kvm->last_boosted_vcpu to ensure the loads and stores are atomic. In the extremely unlikely scenario the compiler tears the stores, it's theoretically possible for KVM to attempt to get a vCPU using an out-of-bounds index, e.g. if the write is split into multiple 8-bit stores, and is paired with a 32-bit load on a VM with 257 vCPUs: CPU0 CPU1 last_boosted_vcpu = 0xff; (last_boosted_vcpu = 0x100) last_boosted_vcpu[15:8] = 0x01; i = (last_boosted_vcpu = 0x1ff) last_boosted_vcpu[7:0] = 0x00; vcpu = kvm->vcpu_array[0x1ff]; As detected by KCSAN: BUG: KCSAN: data-race in kvm_vcpu_on_spin [kvm] / kvm_vcpu_on_spin [kvm] write to 0xffffc90025a92344 of 4 bytes by task 4340 on cpu 16: kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4112) kvm handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:? arch/x86/kvm/vmx/vmx.c:6606) kvm_intel vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm __se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890) __x64_sys_ioctl (fs/ioctl.c:890) x64_sys_call (arch/x86/entry/syscall_64.c:33) do_syscall_64 (arch/x86/entry/common.c:?) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) read to 0xffffc90025a92344 of 4 bytes by task 4342 on cpu 4: kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4069) kvm handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:? arch/x86/kvm/vmx/vmx.c:6606) kvm_intel vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm __se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890) __x64_sys_ioctl (fs/ioctl.c:890) x64_sys_call (arch/x86/entry/syscall_64.c:33) do_syscall_64 (arch/x86/entry/common.c:?) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) value changed: 0x00000012 -> 0x00000000 Fixes: `217ece6129` ("KVM: use yield_to instead of sleep in kvm_vcpu_on_spin") Cc: stable@vger.kernel.org Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20240510092353.2261824-1-leitao@debian.org Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-05 06:16:11 -07:00
Paolo Bonzini	ab978c62e7	Merge branch 'kvm-6.11-sev-snp' into HEAD Pull base x86 KVM support for running SEV-SNP guests from Michael Roth: * add some basic infrastructure and introduces a new KVM_X86_SNP_VM vm_type to handle differences versus the existing KVM_X86_SEV_VM and KVM_X86_SEV_ES_VM types. * implement the KVM API to handle the creation of a cryptographic launch context, encrypt/measure the initial image into guest memory, and finalize it before launching it. * implement handling for various guest-generated events such as page state changes, onlining of additional vCPUs, etc. * implement the gmem/mmu hooks needed to prepare gmem-allocated pages before mapping them into guest private memory ranges as well as cleaning them up prior to returning them to the host for use as normal memory. Because those cleanup hooks supplant certain activities like issuing WBINVDs during KVM MMU invalidations, avoid duplicating that work to avoid unecessary overhead. This merge leaves out support support for attestation guest requests and for loading the signing keys to be used for attestation requests.	2024-06-03 13:19:46 -04:00
Sean Christopherson	778c350eb5	Revert "KVM: async_pf: avoid recursive flushing of work items" Now that KVM does NOT gift async #PF workers a "struct kvm" reference, don't bother skipping "done" workers when flushing/canceling queued workers, as the deadlock that was being fudged around can no longer occur. When workers, i.e. async_pf_execute(), were gifted a referenced, it was possible for a worker to put the last reference and trigger VM destruction, i.e. trigger flushing of a workqueue from a worker in said workqueue. Note, there is no actual lock, the deadlock was that a worker will be stuck waiting for itself (the workqueue code simulates a lock/unlock via lock_map_{acquire,release}()). Skipping "done" workers isn't problematic per se, but using work->vcpu as a "done" flag is confusing, e.g. it's not clear that async_pf.lock is acquired to protect the work->vcpu, NOT the processing of async_pf.queue (which is protected by vcpu->mutex). This reverts commit `22583f0d9c`. Suggested-by: Xu Yilun <yilun.xu@linux.intel.com> Link: https://lore.kernel.org/r/20240423191649.2885257-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-03 08:55:55 -07:00
Parshuram Sangle	aeb1b22a3a	KVM: Enable halt polling shrink parameter by default Default halt_poll_ns_shrink value of 0 always resets polling interval to 0 on an un-successful poll where vcpu wakeup is not received. This is mostly to avoid pointless polling for more number of shorter intervals. But disabled shrink assumes vcpu wakeup is less likely to be received in subsequent shorter polling intervals. Another side effect of 0 shrink value is that, even on a successful poll if total block time was greater than current polling interval, the polling interval starts over from 0 instead of shrinking by a factor. Enabling shrink with value of 2 allows the polling interval to gradually decrement in case of un-successful poll events as well. This gives a fair chance for successful polling events in subsequent polling intervals rather than resetting it to 0 and starting over from grow_start. Below kvm stat log snippet shows interleaved growth and shrinking of polling interval: 87162647182125: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 10000 (grow 0) 87162647637763: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 20000 (grow 10000) 87162649627943: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 40000 (grow 20000) 87162650892407: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 20000 (shrink 40000) 87162651540378: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 40000 (grow 20000) 87162652276768: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 20000 (shrink 40000) 87162652515037: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 40000 (grow 20000) 87162653383787: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 20000 (shrink 40000) 87162653627670: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 10000 (shrink 20000) 87162653796321: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 20000 (grow 10000) 87162656171645: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 10000 (shrink 20000) 87162661607487: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 0 (shrink 10000) Having both grow and shrink enabled creates a balance in polling interval growth and shrink behavior. Tests show improved successful polling attempt ratio which contribute to VM performance. Power penalty is quite negligible as shrunk polling intervals create bursts of very short durations. Performance assessment results show 3-6% improvements in CPU+GPU, Memory and Storage Android VM workloads whereas 5-9% improvement in average FPS of gaming VM workloads. Power penalty is below 1% where host OS is either idle or running a native workload having 2 VMs enabled. CPU/GPU intensive gaming workloads as well do not show any increased power overhead with shrink enabled. Co-developed-by: Rajendran Jaishankar <jaishankar.rajendran@intel.com> Signed-off-by: Rajendran Jaishankar <jaishankar.rajendran@intel.com> Signed-off-by: Parshuram Sangle <parshuram.sangle@intel.com> Link: https://lore.kernel.org/r/20231102154628.2120-2-parshuram.sangle@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-03 08:49:10 -07:00
Borislav Petkov	96a02b9fa9	KVM: Unexport kvm_debugfs_dir After `faf01aef05` ("KVM: PPC: Merge powerpc's debugfs entry content into generic entry") kvm_debugfs_dir is not used anywhere else outside of kvm_main.c Unexport it and make it static. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240515150804.9354-1-bp@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-03 08:22:59 -07:00
Linus Torvalds	61307b7be4	Merge tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: "The usual shower of singleton fixes and minor series all over MM, documented (hopefully adequately) in the respective changelogs. Notable series include: - Lucas Stach has provided some page-mapping cleanup/consolidation/ maintainability work in the series "mm/treewide: Remove pXd_huge() API". - In the series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one test. - In their series "Memory allocation profiling" Kent Overstreet and Suren Baghdasaryan have contributed a means of determining (via /proc/allocinfo) whereabouts in the kernel memory is being allocated: number of calls and amount of memory. - Matthew Wilcox has provided the series "Various significant MM patches" which does a number of rather unrelated things, but in largely similar code sites. - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes Weiner has fixed the page allocator's handling of migratetype requests, with resulting improvements in compaction efficiency. - In the series "make the hugetlb migration strategy consistent" Baolin Wang has fixed a hugetlb migration issue, which should improve hugetlb allocation reliability. - Liu Shixin has hit an I/O meltdown caused by readahead in a memory-tight memcg. Addressed in the series "Fix I/O high when memory almost met memcg limit". - In the series "mm/filemap: optimize folio adding and splitting" Kairui Song has optimized pagecache insertion, yielding ~10% performance improvement in one test. - Baoquan He has cleaned up and consolidated the early zone initialization code in the series "mm/mm_init.c: refactor free_area_init_core()". - Baoquan has also redone some MM initializatio code in the series "mm/init: minor clean up and improvement". - MM helper cleanups from Christoph Hellwig in his series "remove follow_pfn". - More cleanups from Matthew Wilcox in the series "Various page->flags cleanups". - Vlastimil Babka has contributed maintainability improvements in the series "memcg_kmem hooks refactoring". - More folio conversions and cleanups in Matthew Wilcox's series: "Convert huge_zero_page to huge_zero_folio" "khugepaged folio conversions" "Remove page_idle and page_young wrappers" "Use folio APIs in procfs" "Clean up __folio_put()" "Some cleanups for memory-failure" "Remove page_mapping()" "More folio compat code removal" - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb functions to work on folis". - Code consolidation and cleanup work related to GUP's handling of hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2". - Rick Edgecombe has developed some fixes to stack guard gaps in the series "Cover a guard gap corner case". - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series "mm/ksm: fix ksm exec support for prctl". - Baolin Wang has implemented NUMA balancing for multi-size THPs. This is a simple first-cut implementation for now. The series is "support multi-size THP numa balancing". - Cleanups to vma handling helper functions from Matthew Wilcox in the series "Unify vma_address and vma_pgoff_address". - Some selftests maintenance work from Dev Jain in the series "selftests/mm: mremap_test: Optimizations and style fixes". - Improvements to the swapping of multi-size THPs from Ryan Roberts in the series "Swap-out mTHP without splitting". - Kefeng Wang has significantly optimized the handling of arm64's permission page faults in the series "arch/mm/fault: accelerate pagefault when badaccess" "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS" - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it GUP-fast". - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to use struct vm_fault". - selftests build fixes from John Hubbard in the series "Fix selftests/mm build without requiring "make headers"". - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes the initialization code so that migration between different memory types works as intended. - David Hildenbrand has improved follow_pte() and fixed an errant driver in the series "mm: follow_pte() improvements and acrn follow_pte() fixes". - David also did some cleanup work on large folio mapcounts in his series "mm: mapcount for large folios + page_mapcount() cleanups". - Folio conversions in KSM in Alex Shi's series "transfer page to folio in KSM". - Barry Song has added some sysfs stats for monitoring multi-size THP's in the series "mm: add per-order mTHP alloc and swpout counters". - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled and limit checking cleanups". - Matthew Wilcox has been looking at buffer_head code and found the documentation to be lacking. The series is "Improve buffer head documentation". - Multi-size THPs get more work, this time from Lance Yang. His series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes the freeing of these things. - Kemeng Shi has added more userspace-visible writeback instrumentation in the series "Improve visibility of writeback". - Kemeng Shi then sent some maintenance work on top in the series "Fix and cleanups to page-writeback". - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the series "Improve anon_vma scalability for anon VMAs". Intel's test bot reported an improbable 3x improvement in one test. - SeongJae Park adds some DAMON feature work in the series "mm/damon: add a DAMOS filter type for page granularity access recheck" "selftests/damon: add DAMOS quota goal test" - Also some maintenance work in the series "mm/damon/paddr: simplify page level access re-check for pageout" "mm/damon: misc fixes and improvements" - David Hildenbrand has disabled some known-to-fail selftests ni the series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL". - memcg metadata storage optimizations from Shakeel Butt in "memcg: reduce memory consumption by memcg stats". - DAX fixes and maintenance work from Vishal Verma in the series "dax/bus.c: Fixups for dax-bus locking"" * tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits) memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault selftests: cgroup: add tests to verify the zswap writeback path mm: memcg: make alloc_mem_cgroup_per_node_info() return bool mm/damon/core: fix return value from damos_wmark_metric_value mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED selftests: cgroup: remove redundant enabling of memory controller Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT Docs/mm/damon/design: use a list for supported filters Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file selftests/damon: classify tests for functionalities and regressions selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None' selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts selftests/damon/_damon_sysfs: check errors from nr_schemes file reads mm/damon/core: initialize ->esz_bp from damos_quota_init_priv() selftests/damon: add a test for DAMOS quota goal ...	2024-05-19 09:21:03 -07:00
Michael Roth	4f2e7aa1cf	KVM: SEV: Implement gmem hook for initializing private pages This will handle the RMP table updates needed to put a page into a private state before mapping it into an SEV-SNP guest. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Message-ID: <20240501085210.2213060-14-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-05-12 04:09:32 -04:00
Paolo Bonzini	7323260373	Merge branch 'kvm-coco-hooks' into HEAD Common patches for the target-independent functionality and hooks that are needed by SEV-SNP and TDX.	2024-05-12 04:07:01 -04:00
Paolo Bonzini	7d41e24da2	Merge tag 'kvm-x86-misc-6.10' of https://github.com/kvm-x86/linux into HEAD KVM x86 misc changes for 6.10: - Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID field, which is unused by hardware, so that KVM can communicate its inability to map GPAs that set bits 51:48 due to lack of 5-level paging. Guest firmware is expected to use the information to safely remap BARs in the uppermost GPA space, i.e to avoid placing a BAR at a legal, but unmappable, GPA. - Use vfree() instead of kvfree() for allocations that always use vcalloc() or __vcalloc(). - Don't completely ignore same-value writes to immutable feature MSRs, as doing so results in KVM failing to reject accesses to MSR that aren't supposed to exist given the vCPU model and/or KVM configuration. - Don't mark APICv as being inhibited due to ABSENT if APICv is disabled KVM-wide to avoid confusing debuggers (KVM will never bother clearing the ABSENT inhibit, even if userspace enables in-kernel local APIC).	2024-05-12 03:18:44 -04:00
Paolo Bonzini	f4bc1373d5	Merge tag 'kvm-x86-generic-6.10' of https://github.com/kvm-x86/linux into HEAD KVM cleanups for 6.10: - Misc cleanups extracted from the "exit on missing userspace mapping" series, which has been put on hold in anticipation of a "KVM Userfault" approach, which should provide a superset of functionality. - Remove kvm_make_all_cpus_request_except(), which got added to hack around an AVIC bug, and then became dead code when a more robust fix came along. - Fix a goof in the KVM_CREATE_GUEST_MEMFD documentation.	2024-05-12 03:16:47 -04:00
Paolo Bonzini	e5f62e27b1	Merge tag 'kvmarm-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for Linux 6.10 - Move a lot of state that was previously stored on a per vcpu basis into a per-CPU area, because it is only pertinent to the host while the vcpu is loaded. This results in better state tracking, and a smaller vcpu structure. - Add full handling of the ERET/ERETAA/ERETAB instructions in nested virtualisation. The last two instructions also require emulating part of the pointer authentication extension. As a result, the trap handling of pointer authentication has been greattly simplified. - Turn the global (and not very scalable) LPI translation cache into a per-ITS, scalable cache, making non directly injected LPIs much cheaper to make visible to the vcpu. - A batch of pKVM patches, mostly fixes and cleanups, as the upstreaming process seems to be resuming. Fingers crossed! - Allocate PPIs and SGIs outside of the vcpu structure, allowing for smaller EL2 mapping and some flexibility in implementing more or less than 32 private IRQs. - Purge stale mpidr_data if a vcpu is created after the MPIDR map has been created. - Preserve vcpu-specific ID registers across a vcpu reset. - Various minor cleanups and improvements.	2024-05-12 03:15:53 -04:00
Paolo Bonzini	4232da23d7	Merge tag 'loongarch-kvm-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.10 1. Add ParaVirt IPI support. 2. Add software breakpoint support. 3. Add mmio trace events support.	2024-05-10 13:20:18 -04:00
Michael Roth	a90764f0e4	KVM: guest_memfd: Add hook for invalidating memory In some cases, like with SEV-SNP, guest memory needs to be updated in a platform-specific manner before it can be safely freed back to the host. Wire up arch-defined hooks to the .free_folio kvm_gmem_aops callback to allow for special handling of this sort when freeing memory in response to FALLOC_FL_PUNCH_HOLE operations and when releasing the inode, and go ahead and define an arch-specific hook for x86 since it will be needed for handling memory used for SEV-SNP guests. Signed-off-by: Michael Roth <michael.roth@amd.com> Message-Id: <20231230172351.574091-6-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-05-10 13:11:48 -04:00
Paolo Bonzini	1f6c06b177	KVM: guest_memfd: Add interface for populating gmem pages with user data During guest run-time, kvm_arch_gmem_prepare() is issued as needed to prepare newly-allocated gmem pages prior to mapping them into the guest. In the case of SEV-SNP, this mainly involves setting the pages to private in the RMP table. However, for the GPA ranges comprising the initial guest payload, which are encrypted/measured prior to starting the guest, the gmem pages need to be accessed prior to setting them to private in the RMP table so they can be initialized with the userspace-provided data. Additionally, an SNP firmware call is needed afterward to encrypt them in-place and measure the contents into the guest's launch digest. While it is possible to bypass the kvm_arch_gmem_prepare() hooks so that this handling can be done in an open-coded/vendor-specific manner, this may expose more gmem-internal state/dependencies to external callers than necessary. Try to avoid this by implementing an interface that tries to handle as much of the common functionality inside gmem as possible, while also making it generic enough to potentially be usable/extensible for TDX as well. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Co-developed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-05-10 13:11:47 -04:00
Paolo Bonzini	17573fd971	KVM: guest_memfd: extract __kvm_gmem_get_pfn() In preparation for adding a function that walks a set of pages provided by userspace and populates them in a guest_memfd, add a version of kvm_gmem_get_pfn() that has a "bool prepare" argument and passes it down to kvm_gmem_get_folio(). Populating guest memory has to call repeatedly __kvm_gmem_get_pfn() on the same file, so make the new function take struct file*. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-05-10 13:11:47 -04:00
Paolo Bonzini	3bb2531e20	KVM: guest_memfd: Add hook for initializing memory guest_memfd pages are generally expected to be in some arch-defined initial state prior to using them for guest memory. For SEV-SNP this initial state is 'private', or 'guest-owned', and requires additional operations to move these pages into a 'private' state by updating the corresponding entries the RMP table. Allow for an arch-defined hook to handle updates of this sort, and go ahead and implement one for x86 so KVM implementations like AMD SVM can register a kvm_x86_ops callback to handle these updates for SEV-SNP guests. The preparation callback is always called when allocating/grabbing folios via gmem, and it is up to the architecture to keep track of whether or not the pages are already in the expected state (e.g. the RMP table in the case of SEV-SNP). In some cases, it is necessary to defer the preparation of the pages to handle things like in-place encryption of initial guest memory payloads before marking these pages as 'private'/'guest-owned'. Add an argument (always true for now) to kvm_gmem_get_folio() that allows for the preparation callback to be bypassed. To detect possible issues in the way userspace initializes memory, it is only possible to add an unprepared page if it is not already included in the filemap. Link: https://lore.kernel.org/lkml/ZLqVdvsF11Ddo7Dq@google.com/ Co-developed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Message-Id: <20231230172351.574091-5-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-05-10 13:11:46 -04:00
Paolo Bonzini	fa30b0dc91	KVM: guest_memfd: limit overzealous WARN Because kvm_gmem_get_pfn() is called from the page fault path without any of the slots_lock, filemap lock or mmu_lock taken, it is possible for it to race with kvm_gmem_unbind(). This is not a problem, as any PTE that is installed temporarily will be zapped before the guest has the occasion to run. However, it is not possible to have a complete unbind+bind racing with the page fault, because deleting the memslot will call synchronize_srcu_expedited() and wait for the page fault to be resolved. Thus, we can still warn if the file is there and is not the one we expect. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-05-10 13:11:46 -04:00
Paolo Bonzini	7062372377	KVM: guest_memfd: pass error up from filemap_grab_folio Some SNP ioctls will require the page not to be in the pagecache, and as such they will want to return EEXIST to userspace. Start by passing the error up from filemap_grab_folio. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-05-10 13:11:45 -04:00
Michael Roth	1d23040caa	KVM: guest_memfd: Use AS_INACCESSIBLE when creating guest_memfd inode truncate_inode_pages_range() may attempt to zero pages before truncating them, and this will occur before arch-specific invalidations can be triggered via .invalidate_folio/.free_folio hooks via kvm_gmem_aops. For AMD SEV-SNP this would result in an RMP #PF being generated by the hardware, which is currently treated as fatal (and even if specifically allowed for, would not result in anything other than garbage being written to guest pages due to encryption). On Intel TDX this would also result in undesirable behavior. Set the AS_INACCESSIBLE flag to prevent the MM from attempting unexpected accesses of this sort during operations like truncation. This may also in some cases yield a decent performance improvement for guest_memfd userspace implementations that hole-punch ranges immediately after private->shared conversions via KVM_SET_MEMORY_ATTRIBUTES, since the current implementation of truncate_inode_pages_range() always ends up zero'ing an entire 4K range if it is backing by a 2M folio. Link: https://lore.kernel.org/lkml/ZR9LYhpxTaTk6PJX@google.com/ Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Message-ID: <20240329212444.395559-6-michael.roth@amd.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-05-10 13:11:45 -04:00
David Hildenbrand	29ae7d96d1	mm: pass VMA instead of MM to follow_pte() ... and centralize the VM_IO/VM_PFNMAP sanity check in there. We'll now also perform these sanity checks for direct follow_pte() invocations. For generic_access_phys(), we might now check multiple times: nothing to worry about, really. Link: https://lkml.kernel.org/r/20240410155527.474777-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Sean Christopherson <seanjc@google.com> [KVM] Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Fei Li <fei1.li@intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Yonghua Huang <yonghua.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:27 -07:00
Venkatesh Srinivas	82e9c84d87	KVM: Remove kvm_make_all_cpus_request_except() Remove kvm_make_all_cpus_request_except() as it effectively has no users, and arguably should never have been added in the first place. Commit `54163a346d` ("KVM: Introduce kvm_make_all_cpus_request_except()") added the "except" variation for use in SVM's AVIC update path, which used it to skip sending a request to the current vCPU (commit `7d611233b0` ("KVM: SVM: Disable AVIC before setting V_IRQ")). But the AVIC usage of kvm_make_all_cpus_request_except() was essentially a hack-a-fix that simply squashed the most likely scenario of a racy WARN without addressing the underlying problem(s). Commit `f1577ab214` ("KVM: SVM: svm_set_vintr don't warn if AVIC is active but is about to be deactivated") eventually fixed the WARN itself, and the "except" usage was subsequently dropped by `df63202fe5` ("KVM: x86: APICv: drop immediate APICv disablement on current vCPU"). That kvm_make_all_cpus_request_except() hasn't gained any users in the last ~3 years isn't a coincidence. If a VM-wide broadcast needs to skip the current vCPU, then odds are very good that there is underlying bug that could be better fixed elsewhere. Signed-off-by: Venkatesh Srinivas <venkateshs@chromium.org> Link: https://lore.kernel.org/r/20240404232651.1645176-1-venkateshs@chromium.org [sean: rewrite changelog with --verbose] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-05-02 07:47:03 -07:00
Oliver Upton	ea54dd3742	KVM: Treat the device list as an rculist A subsequent change to KVM/arm64 will necessitate walking the device list outside of the kvm->lock. Prepare by converting to an rculist. This has zero effect on the VM destruction path, as it is expected every reader is backed by a reference on the kvm struct. On the other hand, ensure a given device is completely destroyed before dropping the kvm->lock in the release() path, as certain devices expect to be a singleton (e.g. the vfio-kvm device). Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20240422200158.2606761-2-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>	2024-04-25 13:19:55 +01:00
Sean Christopherson	c23e2b7103	KVM: Allow page-sized MMU caches to be initialized with custom 64-bit values Add support to MMU caches for initializing a page with a custom 64-bit value, e.g. to pre-fill an entire page table with non-zero PTE values. The functionality will be used by x86 to support Intel's TDX, which needs to set bit 63 in all non-present PTEs in order to prevent !PRESENT page faults from getting reflected into the guest (Intel's EPT Violation #VE architecture made the less than brilliant decision of having the per-PTE behavior be opt-out instead of opt-in). Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-Id: <5919f685f109a1b0ebc6bd8fc4536ee94bcc172d.1705965635.git.isaku.yamahata@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-19 12:15:18 -04:00
Sean Christopherson	eefb85b3f0	KVM: Drop unused @may_block param from gfn_to_pfn_cache_invalidate_start() Remove gfn_to_pfn_cache_invalidate_start()'s unused @may_block parameter, which was leftover from KVM's abandoned (for now) attempt to support guest usage of gfn_to_pfn caches. Fixes: `a4bff3df51` ("KVM: pfncache: remove KVM_GUEST_USES_PFN usage") Reported-by: Like Xu <like.xu.linux@gmail.com> Cc: Paul Durrant <paul@xen.org> Cc: David Woodhouse <dwmw2@infradead.org> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240305003742.245767-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-11 12:58:53 -07:00
Paolo Bonzini	5257de954c	KVM: remove unused argument of kvm_handle_hva_range() The only user was kvm_mmu_notifier_change_pte(), which is now gone. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Message-ID: <20240405115815.3226315-3-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-11 13:18:35 -04:00
Paolo Bonzini	f3b65bbaed	KVM: delete .change_pte MMU notifier callback The .change_pte() MMU notifier callback was intended as an optimization. The original point of it was that KSM could tell KVM to flip its secondary PTE to a new location without having to first zap it. At the time there was also an .invalidate_page() callback; both of them were not bracketed by calls to mmu_notifier_invalidate_range_{start,end}(), and .invalidate_page() also doubled as a fallback implementation of .change_pte(). Later on, however, both callbacks were changed to occur within an invalidate_range_start/end() block. In the case of .change_pte(), commit `6bdb913f0a` ("mm: wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end", 2012-10-09) did so to remove the fallback from .invalidate_page() to .change_pte() and allow sleepable .invalidate_page() hooks. This however made KVM's usage of the .change_pte() callback completely moot, because KVM unmaps the sPTEs during .invalidate_range_start() and therefore .change_pte() has no hope of finding a sPTE to change. Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as well as all the architecture specific implementations. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Anup Patel <anup@brainfault.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Reviewed-by: Bibo Mao <maobibo@loongson.cn> Message-ID: <20240405115815.3226315-2-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-11 13:18:27 -04:00
Anish Moorthy	f588557ac4	KVM: Simplify error handling in __gfn_to_pfn_memslot() KVM_HVA_ERR_RO_BAD satisfies kvm_is_error_hva(), so there's no need to duplicate the "if (writable)" block. Fix this by bringing all kvm_is_error_hva() cases under one conditional. Signed-off-by: Anish Moorthy <amoorthy@google.com> Link: https://lore.kernel.org/r/20240215235405.368539-5-amoorthy@google.com [sean: use ternary operator] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-09 14:10:08 -07:00
Anish Moorthy	a3bd2f7ead	KVM: Add function comments for __kvm_read/write_guest_page() The (gfn, data, offset, len) order of parameters is a little strange since "offset" applies to "gfn" rather than to "data". Add function comments to make things perfectly clear. Signed-off-by: Anish Moorthy <amoorthy@google.com> Link: https://lore.kernel.org/r/20240215235405.368539-3-amoorthy@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-09 14:00:48 -07:00
Anish Moorthy	ed2f049fc1	KVM: Clarify meaning of hva_to_pfn()'s 'atomic' parameter The current description can be read as "atomic -> allowed to sleep," when in fact the intended statement is "atomic -> NOT allowed to sleep." Make that clearer in the docstring. Signed-off-by: Anish Moorthy <amoorthy@google.com> Link: https://lore.kernel.org/r/20240215235405.368539-2-amoorthy@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-09 14:00:48 -07:00
Li RongQing	a952d608f0	KVM: Use vfree for memory allocated by vcalloc()/__vcalloc() commit 37b2a6510a48("KVM: use __vcalloc for very large allocations") replaced kvzalloc()/kvcalloc() with vcalloc(), but didn't replace kvfree() with vfree(). Signed-off-by: Li RongQing <lirongqing@baidu.com> Link: https://lore.kernel.org/r/20240131012357.53563-1-lirongqing@baidu.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-09 12:18:38 -07:00
Sean Christopherson	fc62a4e8de	KVM: Explicitly disallow activatating a gfn_to_pfn_cache with INVALID_GPA Explicit disallow activating a gfn_to_pfn_cache with an error gpa, i.e. INVALID_GPA, to ensure that KVM doesn't mistake a GPA-based cache for an HVA-based cache (KVM uses INVALID_GPA as a magic value to differentiate between GPA-based and HVA-based caches). WARN if KVM attempts to activate a cache with INVALID_GPA, purely so that new caches need to at least consider what to do with a "bad" GPA, as all existing usage of kvm_gpc_activate() guarantees gpa != INVALID_GPA. I.e. removing the WARN in the future is completely reasonable if doing so would yield cleaner/better code overall. Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20240320001542.3203871-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-08 13:20:24 -07:00
Sean Christopherson	5c9ca4ed89	KVM: Check validity of offset+length of gfn_to_pfn_cache prior to activation When activating a gfn_to_pfn_cache, verify that the offset+length is sane and usable before marking the cache active. Letting __kvm_gpc_refresh() detect the problem results in a cache being marked active without setting the GPA (or any other fields), which in turn results in KVM trying to refresh a cache with INVALID_GPA. Attempting to refresh a cache with INVALID_GPA isn't functionally problematic, but it runs afoul of the sanity check that exactly one of GPA or userspace HVA is valid, i.e. that a cache is either GPA-based or HVA-based. Reported-by: syzbot+106a4f72b0474e1d1b33@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/0000000000005fa5cc0613f1cebd@google.com Fixes: `721f5b0dda` ("KVM: pfncache: allow a cache to be activated with a fixed (userspace) HVA") Cc: David Woodhouse <dwmw2@infradead.org> Cc: Paul Durrant <paul@xen.org> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240320001542.3203871-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-08 13:20:24 -07:00
Sean Christopherson	18f06e9769	KVM: Add helpers to consolidate gfn_to_pfn_cache's page split check Add a helper to check that the incoming length for a gfn_to_pfn_cache is valid with respect to the cache's GPA and/or HVA. To avoid activating a cache with a bogus GPA, a future fix will fork the page split check in the inner refresh path into activate() and the public rerfresh() APIs, at which point KVM will check the length in three separate places. Deliberately keep the "page offset" logic open coded, as the only other path that consumes the offset, __kvm_gpc_refresh(), already needs to differentiate between GPA-based and HVA-based caches, and it's not obvious that using a helper is a net positive in overall code readability. Note, for GPA-based caches, this has a subtle side effect of using the GPA instead of the resolved HVA in the check() path, but that should be a nop as the HVA offset is derived from the GPA, i.e. the two offsets are identical, barring a KVM bug. Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240320001542.3203871-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-08 13:20:23 -07:00
Paolo Bonzini	e9a2bba476	Merge tag 'kvm-x86-xen-6.9' of https://github.com/kvm-x86/linux into HEAD KVM Xen and pfncache changes for 6.9: - Rip out the half-baked support for using gfn_to_pfn caches to manage pages that are "mapped" into guests via physical addresses. - Add support for using gfn_to_pfn caches with only a host virtual address, i.e. to bypass the "gfn" stage of the cache. The primary use case is overlay pages, where the guest may change the gfn used to reference the overlay page, but the backing hva+pfn remains the same. - Add an ioctl() to allow mapping Xen's shared_info page using an hva instead of a gpa, so that userspace doesn't need to reconfigure and invalidate the cache/mapping if the guest changes the gpa (but userspace keeps the resolved hva the same). - When possible, use a single host TSC value when computing the deadline for Xen timers in order to improve the accuracy of the timer emulation. - Inject pending upcall events when the vCPU software-enables its APIC to fix a bug where an upcall can be lost (and to follow Xen's behavior). - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen events fails, e.g. if the guest has aliased xAPIC IDs. - Extend gfn_to_pfn_cache's mutex to cover (de)activation (in addition to refresh), and drop a now-redundant acquisition of xen_lock (that was protecting the shared_info cache) to fix a deadlock due to recursively acquiring xen_lock.	2024-03-11 10:42:55 -04:00
Paolo Bonzini	c9cd0beae9	Merge tag 'kvm-x86-misc-6.9' of https://github.com/kvm-x86/linux into HEAD KVM x86 misc changes for 6.9: - Explicitly initialize a variety of on-stack variables in the emulator that triggered KMSAN false positives (though in fairness in KMSAN, it's comically difficult to see that the uninitialized memory is never truly consumed). - Fix the deubgregs ABI for 32-bit KVM, and clean up code related to reading DR6 and DR7. - Rework the "force immediate exit" code so that vendor code ultimately decides how and when to force the exit. This allows VMX to further optimize handling preemption timer exits, and allows SVM to avoid sending a duplicate IPI (SVM also has a need to force an exit). - Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if vCPU creation ultimately failed, and add WARN to guard against similar bugs. - Provide a dedicated arch hook for checking if a different vCPU was in-kernel (for directed yield), and simplify the logic for checking if the currently loaded vCPU is in-kernel. - Misc cleanups and fixes.	2024-03-11 10:24:56 -04:00
Paolo Bonzini	507e72f899	Merge tag 'kvm-x86-generic-6.9' of https://github.com/kvm-x86/linux into HEAD KVM common MMU changes for 6.9: - Harden KVM against underflowing the active mmu_notifier invalidation count, so that "bad" invalidations (usually due to bugs elsehwere in the kernel) are detected earlier and are less likely to hang the kernel. - Fix a benign bug in __kvm_mmu_topup_memory_cache() where the object size and number of objects parameters to kvmalloc_array() were swapped.	2024-03-11 10:23:03 -04:00
Paolo Bonzini	a81d95ae8c	Merge tag 'kvm-x86-asyncpf-6.9' of https://github.com/kvm-x86/linux into HEAD KVM async page fault changes for 6.9: - Always flush the async page fault workqueue when a work item is being removed, especially during vCPU destruction, to ensure that there are no workers running in KVM code when all references to KVM-the-module are gone, i.e. to prevent a use-after-free if kvm.ko is unloaded. - Grab a reference to the VM's mm_struct in the async #PF worker itself instead of gifting the worker a reference, e.g. so that there's no need to remember to conditionally clean up after the worker.	2024-03-11 10:22:41 -04:00
Paolo Bonzini	961e2bfcf3	Merge tag 'kvmarm-6.9' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for 6.9 - Infrastructure for building KVM's trap configuration based on the architectural features (or lack thereof) advertised in the VM's ID registers - Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to x86's WC) at stage-2, improving the performance of interacting with assigned devices that can tolerate it - Conversion of KVM's representation of LPIs to an xarray, utilized to address serialization some of the serialization on the LPI injection path - Support for _architectural_ VHE-only systems, advertised through the absence of FEAT_E2H0 in the CPU's ID register - Miscellaneous cleanups, fixes, and spelling corrections to KVM and selftests	2024-03-11 10:02:32 -04:00
Paolo Bonzini	7d8942d8e7	Merge tag 'kvm-x86-guest_memfd_fixes-6.8' of https://github.com/kvm-x86/linux into HEAD KVM GUEST_MEMFD fixes for 6.8: - Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to avoid creating ABI that KVM can't sanely support. - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly clear that such VMs are purely a development and testing vehicle, and come with zero guarantees. - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan is to support confidential VMs with deterministic private memory (SNP and TDX) only in the TDP MMU. - Fix a bug in a GUEST_MEMFD negative test that resulted in false passes when verifying that KVM_MEM_GUEST_MEMFD memslots can't be dirty logged.	2024-03-09 11:48:35 -05:00
David Woodhouse	6addfcf271	KVM: pfncache: simplify locking and make more self-contained The locking on the gfn_to_pfn_cache is... interesting. And awful. There is a rwlock in ->lock which readers take to ensure protection against concurrent changes. But __kvm_gpc_refresh() makes assumptions that certain fields will not change even while it drops the write lock and performs MM operations to revalidate the target PFN and kernel mapping. Commit `93984f19e7` ("KVM: Fully serialize gfn=>pfn cache refresh via mutex") partly addressed that — not by fixing it, but by adding a new mutex, ->refresh_lock. This prevented concurrent __kvm_gpc_refresh() calls on a given gfn_to_pfn_cache, but is still only a partial solution. There is still a theoretical race where __kvm_gpc_refresh() runs in parallel with kvm_gpc_deactivate(). While __kvm_gpc_refresh() has dropped the write lock, kvm_gpc_deactivate() clears the ->active flag and unmaps ->khva. Then __kvm_gpc_refresh() determines that the previous ->pfn and ->khva are still valid, and reinstalls those values into the structure. This leaves the gfn_to_pfn_cache with the ->valid bit set, but ->active clear. And a ->khva which looks like a reasonable kernel address but is actually unmapped. All it takes is a subsequent reactivation to cause that ->khva to be dereferenced. This would theoretically cause an oops which would look something like this: [1724749.564994] BUG: unable to handle page fault for address: ffffaa3540ace0e0 [1724749.565039] RIP: 0010:__kvm_xen_has_interrupt+0x8b/0xb0 I say "theoretically" because theoretically, that oops that was seen in production cannot happen. The code which uses the gfn_to_pfn_cache is supposed to have its own locking, to further paper over the fact that the gfn_to_pfn_cache's own papering-over (->refresh_lock) of its own rwlock abuse is not sufficient. For the Xen vcpu_info that external lock is the vcpu->mutex, and for the shared info it's kvm->arch.xen.xen_lock. Those locks ought to protect the gfn_to_pfn_cache against concurrent deactivation vs. refresh in all but the cases where the vcpu or kvm object is being destroyed, in which case the subsequent reactivation should never happen. Theoretically. Nevertheless, this locking abuse is awful and should be fixed, even if no clear explanation can be found for how the oops happened. So expand the use of the ->refresh_lock mutex to ensure serialization of activate/deactivate vs. refresh and make the pfncache locking entirely self-sufficient. This means that a future commit can simplify the locking in the callers, such as the Xen emulation code which has an outstanding problem with recursive locking of kvm->arch.xen.xen_lock, which will no longer be necessary. The rwlock abuse described above is still not best practice, although it's harmless now that the ->refresh_lock is held for the entire duration while the offending code drops the write lock, does some other stuff, then takes the write lock again and assumes nothing changed. That can also be fixed^W cleaned up in a subsequent commit, but this commit is a simpler basis for the Xen deadlock fix mentioned above. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20240227115648.3104-5-dwmw2@infradead.org [sean: use guard(mutex) to fix a missed unlock] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-03-04 16:22:38 -08:00
Oliver Upton	284851ee5c	KVM: Get rid of return value from kvm_arch_create_vm_debugfs() The general expectation with debugfs is that any initialization failure is nonfatal. Nevertheless, kvm_arch_create_vm_debugfs() allows implementations to return an error and kvm_create_vm_debugfs() allows that to fail VM creation. Change to a void return to discourage architectures from making debugfs failures fatal for the VM. Seems like everyone already had the right idea, as all implementations already return 0 unconditionally. Acked-by: Marc Zyngier <maz@kernel.org> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20240216155941.2029458-1-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>	2024-02-23 21:44:58 +00:00
Sean Christopherson	e563592224	KVM: Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY Disallow creating read-only memslots that support GUEST_MEMFD, as GUEST_MEMFD is fundamentally incompatible with KVM's semantics for read-only memslots. Read-only memslots allow the userspace VMM to emulate option ROMs by filling the backing memory with readable, executable code and data, while triggering emulated MMIO on writes. GUEST_MEMFD doesn't currently support writes from userspace and KVM doesn't support emulated MMIO on private accesses, i.e. the guest can only ever read zeros, and writes will always be treated as errors. Cc: Fuad Tabba <tabba@google.com> Cc: Michael Roth <michael.roth@amd.com> Cc: Isaku Yamahata <isaku.yamahata@gmail.com> Cc: Yu Zhang <yu.c.zhang@linux.intel.com> Cc: Chao Peng <chao.p.peng@linux.intel.com> Fixes: `a7800aa80e` ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory") Link: https://lore.kernel.org/r/20240222190612.2942589-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 17:07:06 -08:00
Arnd Bergmann	ea3689d9df	KVM: fix kvm_mmu_memory_cache allocation warning gcc-14 notices that the arguments to kvmalloc_array() are mixed up: arch/x86/kvm/../../../virt/kvm/kvm_main.c: In function '__kvm_mmu_topup_memory_cache': arch/x86/kvm/../../../virt/kvm/kvm_main.c:424:53: error: 'kvmalloc_array' sizes specified with 'sizeof' in the earlier argument and not in the later argument [-Werror=calloc-transposed-args] 424 \| mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp); \| ^~~~ arch/x86/kvm/../../../virt/kvm/kvm_main.c:424:53: note: earlier argument should specify number of elements, later size of each element The code still works correctly, but the incorrect order prevents the compiler from properly tracking the object sizes. Fixes: `837f66c712` ("KVM: Allow for different capacities in kvm_mmu_memory_cache structs") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20240212112419.1186065-1-arnd@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 17:02:26 -08:00
Sean Christopherson	dafc17dd52	KVM: Add a comment explaining the directed yield pending interrupt logic Add a comment to explain why KVM treats vCPUs with pending interrupts as in-kernel when a vCPU wants to yield to a vCPU that was preempted while running in kernel mode. Link: https://lore.kernel.org/r/20240110003938.490206-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 16:27:41 -08:00
Sean Christopherson	77bcd9e623	KVM: Add dedicated arch hook for querying if vCPU was preempted in-kernel Plumb in a dedicated hook for querying whether or not a vCPU was preempted in-kernel. Unlike literally every other architecture, x86's VMX can check if a vCPU is in kernel context if and only if the vCPU is loaded on the current pCPU. x86's kvm_arch_vcpu_in_kernel() works around the limitation by querying kvm_get_running_vcpu() and redirecting to vcpu->arch.preempted_in_kernel as needed. But that's unnecessary, confusing, and fragile, e.g. x86 has had at least one bug where KVM incorrectly used a stale preempted_in_kernel. No functional change intended. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240110003938.490206-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 16:26:26 -08:00
Paul Durrant	9fa336e343	KVM: pfncache: check the need for invalidation under read lock first When processing mmu_notifier invalidations for gpc caches, pre-check for overlap with the invalidation event while holding gpc->lock for read, and only take gpc->lock for write if the cache needs to be invalidated. Doing a pre-check without taking gpc->lock for write avoids unnecessarily contending the lock for unrelated invalidations, which is very beneficial for caches that are heavily used (but rarely subjected to mmu_notifier invalidations). Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-20-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-22 07:01:20 -08:00
Paul Durrant	721f5b0dda	KVM: pfncache: allow a cache to be activated with a fixed (userspace) HVA Some pfncache pages may actually be overlays on guest memory that have a fixed HVA within the VMM. It's pointless to invalidate such cached mappings if the overlay is moved so allow a cache to be activated directly with the HVA to cater for such cases. A subsequent patch will make use of this facility. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-10-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-20 07:37:46 -08:00
Paul Durrant	406c10962a	KVM: pfncache: include page offset in uhva and use it consistently Currently the pfncache page offset is sometimes determined using the gpa and sometimes the khva, whilst the uhva is always page-aligned. After a subsequent patch is applied the gpa will not always be valid so adjust the code to include the page offset in the uhva and use it consistently as the source of truth. Also, where a page-aligned address is required, use PAGE_ALIGN_DOWN() for clarity. No functional change intended. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-8-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-20 07:37:44 -08:00
Paul Durrant	53e63e953e	KVM: pfncache: stop open-coding offset_in_page() Some code in pfncache uses offset_in_page() but in other places it is open- coded. Use offset_in_page() consistently everywhere. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-7-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-02-20 07:37:43 -08:00

... 3 4 5 6 7 ...

2739 Commits