linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-22 06:15:06 -04:00

Author	SHA1	Message	Date
Yang Weijiang	6a11c860d8	KVM: x86: Report KVM supported CET MSRs as to-be-saved Add CET MSRs to the list of MSRs reported to userspace if the feature, i.e. IBT or SHSTK, associated with the MSRs is supported by KVM. Suggested-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250919223258.1604852-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-09-23 09:00:49 -07:00
Yang Weijiang	586ef9dcbb	KVM: x86: Add fault checks for guest CR4.CET setting Check potential faults for CR4.CET setting per Intel SDM requirements. CET can be enabled if and only if CR0.WP == 1, i.e. setting CR4.CET == 1 faults if CR0.WP == 0 and setting CR0.WP == 0 fails if CR4.CET == 1. Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250919223258.1604852-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-09-23 09:00:48 -07:00
Sean Christopherson	e44eb58334	KVM: x86: Load guest FPU state when access XSAVE-managed MSRs Load the guest's FPU state if userspace is accessing MSRs whose values are managed by XSAVES. Introduce two helpers, kvm_{get,set}_xstate_msr(), to facilitate access to such kind of MSRs. If MSRs supported in kvm_caps.supported_xss are passed through to guest, the guest MSRs are swapped with host's before vCPU exits to userspace and after it reenters kernel before next VM-entry. Because the modified code is also used for the KVM_GET_MSRS device ioctl(), explicitly check @vcpu is non-null before attempting to load guest state. The XSAVE-managed MSRs cannot be retrieved via the device ioctl() without loading guest FPU state (which doesn't exist). Note that guest_cpuid_has() is not queried as host userspace is allowed to access MSRs that have not been exposed to the guest, e.g. it might do KVM_SET_MSRS prior to KVM_SET_CPUID2. The two helpers are put here in order to manifest accessing xsave-managed MSRs requires special check and handling to guarantee the correctness of read/write to the MSRs. Co-developed-by: Yang Weijiang <weijiang.yang@intel.com> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> [sean: drop S_CET, add big comment, move accessors to x86.c] Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Xin Li (Intel) <xin@zytor.com> Link: https://lore.kernel.org/r/20250919223258.1604852-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-09-23 09:00:47 -07:00
Yang Weijiang	779ed05511	KVM: x86: Initialize kvm_caps.supported_xss Set original kvm_caps.supported_xss to (host_xss & KVM_SUPPORTED_XSS) if XSAVES is supported. host_xss contains the host supported xstate feature bits for thread FPU context switch, KVM_SUPPORTED_XSS includes all KVM enabled XSS feature bits, the resulting value represents the supervisor xstates that are available to guest and are backed by host FPU framework for swapping {guest,host} XSAVE-managed registers/MSRs. [sean: relocate and enhance comment about PT / XSS[8] ] Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250919223258.1604852-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-09-23 09:00:46 -07:00
Yang Weijiang	9622e116d0	KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS Update CPUID.(EAX=0DH,ECX=1).EBX to reflect current required xstate size due to XSS MSR modification. CPUID(EAX=0DH,ECX=1).EBX reports the required storage size of all enabled xstate features in (XCR0 \| IA32_XSS). The CPUID value can be used by guest before allocate sufficient xsave buffer. Note, KVM does not yet support any XSS based features, i.e. supported_xss is guaranteed to be zero at this time. Opportunistically skip CPUID updates if XSS value doesn't change. Suggested-by: Sean Christopherson <seanjc@google.com> Co-developed-by: Zhang Yi Z <yi.z.zhang@linux.intel.com> Signed-off-by: Zhang Yi Z <yi.z.zhang@linux.intel.com> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250919223258.1604852-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-09-23 09:00:46 -07:00
Chao Gao	338543cbe0	KVM: x86: Check XSS validity against guest CPUIDs Maintain per-guest valid XSS bits and check XSS validity against them rather than against KVM capabilities. This is to prevent bits that are supported by KVM but not supported for a guest from being set. Opportunistically return KVM_MSR_RET_UNSUPPORTED on IA32_XSS MSR accesses if guest CPUID doesn't enumerate X86_FEATURE_XSAVES. Since KVM_MSR_RET_UNSUPPORTED takes care of host_initiated cases, drop the host_initiated check. Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250919223258.1604852-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-09-23 09:00:45 -07:00
Sean Christopherson	c0a5f29891	KVM: x86: Report XSS as to-be-saved if there are supported features Add MSR_IA32_XSS to list of MSRs reported to userspace if supported_xss is non-zero, i.e. KVM supports at least one XSS based feature. Before enabling CET virtualization series, guest IA32_MSR_XSS is guaranteed to be 0, i.e., XSAVES/XRSTORS is executed in non-root mode with XSS == 0, which equals to the effect of XSAVE/XRSTOR. Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250919223258.1604852-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-09-23 09:00:44 -07:00
Yang Weijiang	06f2969c6a	KVM: x86: Introduce KVM_{G,S}ET_ONE_REG uAPIs support Enable KVM_{G,S}ET_ONE_REG uAPIs so that userspace can access MSRs and other non-MSR registers through them, along with support for KVM_GET_REG_LIST to enumerate support for KVM-defined registers. This is in preparation for allowing userspace to read/write the guest SSP register, which is needed for the upcoming CET virtualization support. Currently, two types of registers are supported: KVM_X86_REG_TYPE_MSR and KVM_X86_REG_TYPE_KVM. All MSRs are in the former type; the latter type is added for registers that lack existing KVM uAPIs to access them. The "KVM" in the name is intended to be vague to give KVM flexibility to include other potential registers. More precise names like "SYNTHETIC" and "SYNTHETIC_MSR" were considered, but were deemed too confusing (e.g. can be conflated with synthetic guest-visible MSRs) and may put KVM into a corner (e.g. if KVM wants to change how a KVM-defined register is modeled internally). Enumerate only KVM-defined registers in KVM_GET_REG_LIST to avoid duplicating KVM_GET_MSR_INDEX_LIST, and so that KVM can return _only_ registers that are fully supported (KVM_GET_REG_LIST is vCPU-scoped, i.e. can be precise, whereas KVM_GET_MSR_INDEX_LIST is system-scoped). Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Link: https://lore.kernel.org/all/20240219074733.122080-18-weijiang.yang@intel.com [1] Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250919223258.1604852-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-09-23 09:00:44 -07:00
Sean Christopherson	5dca3808b2	KVM: x86: Merge 'svm' into 'cet' to pick up GHCB dependencies Merge the queue of SVM changes for 6.18 to pick up the KVM-defined GHCB helpers so that kvm_ghcb_get_xss() can be used to virtualize CET for SEV-ES+ guests.	2025-09-23 08:59:49 -07:00
Naveen N Rao	ca2967de5a	KVM: SVM: Enable AVIC by default for Zen4+ if x2AVIC is support AVIC and x2AVIC are fully functional since Zen 4, with no known hardware errata. Enable AVIC and x2AVIC by default on Zen4+ so long as x2AVIC is supported (to avoid enabling partial support for APIC virtualization by default). Internally, convert "avic" to an integer so that KVM can identify if the user has asked to explicitly enable or disable AVIC, i.e. so that KVM doesn't override an explicit 'y' from the user. Arbitrarily use -1 to denote auto-mode, and accept the string "auto" for the module param in addition to standard boolean values, i.e. continue to allow the user to configure the "avic" module parameter to explicitly enable/disable AVIC. To again maintain backward compatibility with a standard boolean param, set KERNEL_PARAM_OPS_FL_NOARG, which tells the params infrastructure to allow empty values for %true, i.e. to interpret a bare "avic" as "avic=y". Take care to check for a NULL @val when looking for "auto"! Lastly, always print "avic" as a boolean, since auto-mode is resolved during module initialization, i.e. the user should never see "auto" in sysfs. Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org> Tested-by: Naveen N Rao (AMD) <naveen@kernel.org> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250919215934.1590410-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-09-23 08:56:49 -07:00
Sean Christopherson	b146653531	KVM: SVM: Move global "avic" variable to avic.c Move "avic" to avic.c so that it's colocated with the other AVIC specific globals and module params, and so that avic_hardware_setup() is a bit more self-contained, e.g. similar to sev_hardware_setup(). Deliberately set enable_apicv in svm.c as it's already globally visible (defined by kvm.ko, not by kvm-amd.ko), and to clearly capture the dependency on enable_apicv being initialized (svm_hardware_setup() clears several AVIC-specific hooks when enable_apicv is disabled). Alternatively, clearing of the hooks (and enable_ipiv) could be moved to avic_hardware_setup(), but that's not obviously better, e.g. it's helpful to isolate the setting of enable_apicv when reading code from the generic x86 side of the world. No functional change intended. Acked-by: Naveen N Rao (AMD) <naveen@kernel.org> Tested-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250919215934.1590410-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-09-23 08:56:48 -07:00
Sean Christopherson	ad65dca2ca	KVM: SVM: Don't advise the user to do force_avic=y (when x2AVIC is detected) Don't advise the end user to try to force enable AVIC when x2AVIC is reported as supported in CPUID, as forcefully enabling AVIC isn't something that should be done lightly. E.g. some Zen4 client systems hide AVIC but leave x2AVIC behind, and while such a configuration is indeed due to buggy firmware in the sense the reporting x2AVIC without AVIC is nonsensical, KVM has no idea _why_ firmware disabled AVIC in the first place. Suggesting that the user try to run with force_avic=y is sketchy even if the user explicitly tries to enable AVIC, and will be downright irresponsible once KVM starts enabling AVIC by default. Alternatively, KVM could print the message only when the user explicitly asks for AVIC, but running with force_avic=y isn't something that should be encouraged for random users. force_avic is a useful knob for developers and perhaps even advanced users, but isn't something that KVM should advertise broadly. Opportunistically append a newline to the pr_warn() so that it prints out immediately, and tweak the message to say that AVIC is unsupported instead of disabled (disabled suggests that the kernel/KVM is somehow responsible). Suggested-by: Naveen N Rao (AMD) <naveen@kernel.org> Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Tested-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250919215934.1590410-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-09-23 08:56:48 -07:00
Sean Christopherson	ce4253e21f	KVM: SVM: Always print "AVIC enabled" separately, even when force enabled Print the customary "AVIC enabled" informational message even when AVIC is force enabled on a system that doesn't advertise supported for AVIC in CPUID, as not printing the standard message can confuse users and tools. Opportunistically clean up the scary message when AVIC is force enabled, but keep it as separate message so that it is printed at level "warn", versus the standard message only being printed for level "info". Suggested-by: Naveen N Rao (AMD) <naveen@kernel.org> Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Tested-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250919215934.1590410-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-09-23 08:56:47 -07:00
Sean Christopherson	a9095e4fc4	KVM: SVM: Update "APICv in x2APIC without x2AVIC" in avic.c, not svm.c Set the "allow_apicv_in_x2apic_without_x2apic_virtualization" flag as part of avic_hardware_setup() instead of handling in svm_hardware_setup(), and make x2avic_enabled local to avic.c (setting the flag was the only use in svm.c). Tag avic_hardware_setup() with __init as necessary (it should have been tagged __init long ago). No functional change intended (aside from the side effects of tagging avic_hardware_setup() with __init). Acked-by: Naveen N Rao (AMD) <naveen@kernel.org> Tested-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250919215934.1590410-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-09-23 08:56:46 -07:00
Sean Christopherson	eb44ea6a7a	KVM: SVM: Move x2AVIC MSR interception helper to avic.c Move svm_set_x2apic_msr_interception() to avic.c as it's only relevant when x2AVIC is enabled/supported and only called by AVIC code. In addition to scoping AVIC code to avic.c, this will allow burying the global x2avic_enabled variable in avic. Opportunistically rename the helper to explicitly scope it to "avic". No functional change intended. Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Tested-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250919215934.1590410-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-09-23 08:56:46 -07:00
Sean Christopherson	44bfe1f049	KVM: SVM: Make svm_x86_ops globally visible, clean up on-HyperV usage Make svm_x86_ops globally visible in anticipation of modifying the struct in avic.c, and clean up the KVM-on-HyperV usage, as declaring _and using_ a local variable in a header that's only defined in one specific .c-file is all kinds of ugly. Opportunistically make svm_hv_enable_l2_tlb_flush() local to svm_onhyperv.c, as the only reason it was visible was due to the aforementioned shenanigans in svm_onhyperv.h. Alternatively, svm_x86_ops could be explicitly passed to svm_hv_hardware_setup() as a parameter. While that approach is slightly safer, e.g. avoids "hidden" updates, for better or worse, the Intel side of KVM has already chosen to expose vt_x86_ops (and vt_init_ops). Given that svm_x86_ops is only truly consumed by kvm_ops_update, the odds of a "hidden" update causing problems are extremely low. So, absent a strong reason to rework the VMX/TDX code, make svm_x86_ops visible, as having all updates use exactly "svm_x86_ops." is advantageous in its own right. No functional change intended. Link: https://lore.kernel.org/r/20250919215934.1590410-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-09-23 08:56:44 -07:00
Hou Wenlong	29da8c823a	KVM: SVM: Re-load current, not host, TSC_AUX on #VMEXIT from SEV-ES guest Prior to running an SEV-ES guest, set TSC_AUX in the host save area to the current value in hardware, as tracked by the user return infrastructure, instead of always loading the host's desired value for the CPU. If the pCPU is also running a non-SEV-ES vCPU, loading the host's value on #VMEXIT could clobber the other vCPU's value, e.g. if the SEV-ES vCPU preempted the non-SEV-ES vCPU, in which case KVM expects the other vCPU's TSC_AUX value to be resident in hardware. Note, unlike TDX, which blindly _zeroes_ TSC_AUX on TD-Exit, SEV-ES CPUs can load an arbitrary value. Stuff the current value in the host save area instead of refreshing the user return cache so that KVM doesn't need to track whether or not the vCPU actually enterred the guest and thus loaded TSC_AUX from the host save area. Opportunistically tag tsc_aux_uret_slot as read-only after init to guard against unexpected modifications, and to make it obvious that using the variable in sev_es_prepare_switch_to_guest() is safe. Fixes: `916e3e5f26` ("KVM: SVM: Do not use user return MSR support for virtualized TSC_AUX") Cc: stable@vger.kernel.org Suggested-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> [sean: handle the SEV-ES case in sev_es_prepare_switch_to_guest()] Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250923153738.1875174-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-09-23 08:55:21 -07:00
Hou Wenlong	9bc3663507	KVM: x86: Add helper to retrieve current value of user return MSR In the user return MSR support, the cached value is always the hardware value of the specific MSR. Therefore, add a helper to retrieve the cached value, which can replace the need for RDMSR, for example, to allow SEV-ES guests to restore the correct host hardware value without using RDMSR. Cc: stable@vger.kernel.org Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> [sean: drop "cache" from the name, make it a one-liner, tag for stable] Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250923153738.1875174-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-09-23 08:55:20 -07:00
Sean Christopherson	5b66e335ea	KVM: SEV: Reject non-positive effective lengths during LAUNCH_UPDATE Check for an invalid length during LAUNCH_UPDATE at the start of snp_launch_update() instead of subtly relying on kvm_gmem_populate() to detect the bad state. Code that directly handles userspace input absolutely should sanitize those inputs; failure to do so is asking for bugs where KVM consumes an invalid "npages". Keep the check in gmem, but wrap it in a WARN to flag any bad usage by the caller. Note, this is technically an ABI change as KVM would previously allow a length of '0'. But allowing a length of '0' is nonsensical and creates pointless conundrums in KVM. E.g. an empty range is arguably neither private nor shared, but LAUNCH_UPDATE will fail if the starting gpa can't be made private. In practice, no known or well-behaved VMM passes a length of '0'. Note #2, the PAGE_ALIGNED(params.len) check ensures that lengths between 1 and 4095 (inclusive) are also rejected, i.e. that KVM won't end up with npages=0 when doing "npages = params.len / PAGE_SIZE". Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Michael Roth <michael.roth@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250919211649.1575654-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-09-23 08:55:20 -07:00
Sean Christopherson	4135a9a8cc	KVM: SEV: Validate XCR0 provided by guest in GHCB Use __kvm_set_xcr() to propagate XCR0 changes from the GHCB to KVM's software model in order to validate the new XCR0 against KVM's view of the supported XCR0. Allowing garbage is thankfully mostly benign, as kvm_load_{guest,host}_xsave_state() bail early for vCPUs with protected state, xstate_required_size() will simply provide garbage back to the guest, and attempting to save/restore the bad value via KVM_{G,S}ET_XCRS will only harm the guest (setting XCR0 will fail). However, allowing the guest to put junk into a field that KVM assumes is valid is a CVE waiting to happen. And as a bonus, using the proper API eliminates the ugly open coding of setting arch.cpuid_dynamic_bits_dirty. Simply ignore bad values, as either the guest managed to get an unsupported value into hardware, or the guest is misbehaving and providing pure garbage. In either case, KVM can't fix the broken guest. Note, using __kvm_set_xcr() also avoids recomputing dynamic CPUID bits if XCR0 isn't actually changing (relatively to KVM's previous snapshot). Cc: Tom Lendacky <thomas.lendacky@amd.com> Fixes: `291bd20d5d` ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT") Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250919223258.1604852-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-09-23 08:55:19 -07:00
Sean Christopherson	bd5f500d23	KVM: SEV: Read save fields from GHCB exactly once Wrap all reads of GHCB save fields with READ_ONCE() via a KVM-specific GHCB get() utility to help guard against TOCTOU bugs. Using READ_ONCE() doesn't completely prevent such bugs, e.g. doesn't prevent KVM from redoing get() after checking the initial value, but at least addresses all potential TOCTOU issues in the current KVM code base. To prevent unintentional use of the generic helpers, take only @svm for the kvm_ghcb_get_xxx() helpers and retrieve the ghcb instead of explicitly passing it in. Opportunistically reduce the indentation of the macro-defined helpers and clean up the alignment. Fixes: `4e15a0ddc3` ("KVM: SEV: snapshot the GHCB before accessing it") Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250919223258.1604852-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-09-23 08:55:18 -07:00
Sean Christopherson	e0ff302b79	KVM: SEV: Rename kvm_ghcb_get_sw_exit_code() to kvm_get_cached_sw_exit_code() Rename kvm_ghcb_get_sw_exit_code() to kvm_get_cached_sw_exit_code() to make it clear that KVM is getting the cached value, not reading directly from the guest-controlled GHCB. More importantly, vacating kvm_ghcb_get_sw_exit_code() will allow adding a KVM-specific macro-built kvm_ghcb_get_##field() helper to read values from the GHCB. No functional change intended. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250919223258.1604852-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-09-23 08:55:18 -07:00
Steven Rostedt	66e2d96b1c	LoongArch: KVM: Move kvm_iocsr tracepoint out of generic code The tracepoint kvm_iocsr is only used by the loongarch architecture. As trace events can take up to 5K of memory, move this tracepoint into the LoongArch specific tracing file so that it doesn't waste memory for all other architectures. Reviewed-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>	2025-09-23 23:37:26 +08:00
Yury Norov (NVIDIA)	77336b918f	LoongArch: KVM: Rework pch_pic_update_batch_irqs() Use proper bitmap API and drop all the housekeeping code. Reviewed-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>	2025-09-23 23:37:26 +08:00
Bibo Mao	f851fdd895	LoongArch: KVM: Add different length support in loongarch_pch_pic_write() With function loongarch_pch_pic_write(), currently there is only four bytes register write support. But in theory, all length 1/2/4/8 should be supported for all the registers, here add different length support about register write emulation in function loongarch_pch_pic_write(). Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>	2025-09-23 23:37:26 +08:00
Bibo Mao	f8a73df503	LoongArch: KVM: Add different length support in loongarch_pch_pic_read() With function loongarch_pch_pic_read(), currently it is hardcoded length for different registers, and the length comes from exising linux pch_pic driver code. But in theory, all length 1/2/4/8 should be supported for all the registers, here add different length support about register read emulation in function loongarch_pch_pic_read(). Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>	2025-09-23 23:37:09 +08:00
Bibo Mao	eb626c7704	LoongArch: KVM: Add IRR and ISR register read emulation With LS7A user manual, there are registers PCH_PIC_INT_IRR_START and PCH_PIC_INT_ISR_START. So add read access emulation in function loongarch_pch_pic_read() here. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>	2025-09-23 23:37:09 +08:00
Bibo Mao	2f412eb765	LoongArch: KVM: Set version information at initial stage Register PCH_PIC_INT_ID constains version and supported irq number information, and it is a read only register. The detailed value can be set at initial stage, rather than read callback. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>	2025-09-23 23:37:09 +08:00
Bibo Mao	3da2b0d439	LoongArch: KVM: Access mailbox directly in mail_send() With function mail_send(), it is to write mailbox of other VCPUs. Existing simple APIs read_mailbox()/write_mailbox() can be used directly rather than send command on IOCSR address. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>	2025-09-23 23:37:09 +08:00
Bibo Mao	1cf7b2881d	LoongArch: KVM: Add implementation with IOCSR_IPI_SET IPI IOCSR register IOCSR_IPI_SET can send ipi interrupt to other vCPUs, but it can also send an interrupt to vCPU itself. Indeed there are such operations on Linux as arch_irq_work_raise() which will send ipi message to vCPU itself. Here add implementation of write operation with IOCSR_IPI_SET register. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>	2025-09-23 23:37:08 +08:00
Bibo Mao	44598fe776	LoongArch: KVM: Add sign extension with kernel IOCSR read emulation Function kvm_complete_iocsr_read() is to add sign extension with IOCSR read emulation, it is used in user space IOCSR read completion now. Also it should be used in kernel IOCSR read emulation. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>	2025-09-23 23:37:08 +08:00
Bibo Mao	80edf90831	LoongArch: KVM: Add sign extension with kernel MMIO read emulation Function kvm_complete_mmio_read() is to add sign extension with MMIO read emulation, it is used in user space MMIO read completion now. Also it should be used in kernel MMIO read emulation. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>	2025-09-23 23:37:08 +08:00
Bibo Mao	7109f51bcc	LoongArch: KVM: Add PTW feature detection on new hardware With new Loongson-3A6000/3C6000 hardware platforms (or later), hardware page table walking (PTW) feature is supported on host. So here add this feature detection on KVM host. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>	2025-09-23 23:37:08 +08:00
NeilBrown	3d18f80ce1	VFS: rename kern_path_locked() and related functions. kern_path_locked() is now only used to prepare for removing an object from the filesystem (and that is the only credible reason for wanting a positive locked dentry). Thus it corresponds to kern_path_create() and so should have a corresponding name. Unfortunately the name "kern_path_create" is somewhat misleading as it doesn't actually create anything. The recently added simple_start_creating() provides a better pattern I believe. The "start" can be matched with "end" to bracket the creating or removing. So this patch changes names: kern_path_locked -> start_removing_path kern_path_create -> start_creating_path user_path_create -> start_creating_user_path user_path_locked_at -> start_removing_user_path_at done_path_create -> end_creating_path and also introduces end_removing_path() which is identical to end_creating_path(). __start_removing_path (which was __kern_path_locked) is enhanced to call mnt_want_write() for consistency with the start_creating_path(). Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-23 12:37:36 +02:00
Julian Ruess	a612dbe8d0	dibs: Move event handling to dibs layer Add defines for all event types and subtypes an ism device is known to produce as it can be helpful for debugging purposes. Introduces a generic 'struct dibs_event' and adopt ism device driver and smc-d client accordingly. Tolerate and ignore other type and subtype values to enable future device extensions. SMC-D and ISM are now independent. struct ism_dev can be moved to drivers/s390/net/ism.h. Note that in smc, the term 'ism' is still used. Future patches could replace that with 'dibs' or 'smc-d' as appropriate. Signed-off-by: Julian Ruess <julianr@linux.ibm.com> Co-developed-by: Alexandra Winter <wintera@linux.ibm.com> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Link: https://patch.msgid.link/20250918110500.1731261-15-wintera@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-23 11:13:22 +02:00
Alexandra Winter	69baaac936	dibs: Define dibs_client_ops and dibs_dev_ops Move the device add() and remove() functions from ism_client to dibs_client_ops and call add_dev()/del_dev() for ism devices and dibs_loopback devices. dibs_client_ops->add_dev() = smcd_register_dev() for the smc_dibs_client. This is the first step to handle ism and loopback devices alike (as dibs devices) in the smc dibs client. Define dibs_dev->ops and move smcd_ops->get_chid to dibs_dev_ops->get_fabric_id() for ism and loopback devices. See below for why this needs to be in the same patch as dibs_client_ops->add_dev(). The following changes contain intermediate steps, that will be obsoleted by follow-on patches, once more functionality has been moved to dibs: Use different smcd_ops and max_dmbs for ism and loopback. Follow-on patches will change SMC-D to directly use dibs_ops instead of smcd_ops. In smcd_register_dev() it is now necessary to identify a dibs_loopback device before smcd_dev and smcd_ops->get_chid() are available. So provide dibs_dev_ops->get_fabric_id() in this patch and evaluate it in smc_ism_is_loopback(). Call smc_loopback_init() in smcd_register_dev() and call smc_loopback_exit() in smcd_unregister_dev() to handle the functionality that is still in smc_loopback. Follow-on patches will move all smc_loopback code to dibs_loopback. In smcd_[un]register_dev() use only ism device name, this will be replaced by dibs device name by a follow-on patch. End of changes with intermediate parts. Allocate an smcd event workqueue for all dibs devices, although dibs_loopback does not generate events. Use kernel memory instead of devres memory for smcd_dev and smcd->conn. Since commit `a72178cfe8` ("net/smc: Fix dependency of SMC on ISM") an ism device and its driver can have a longer lifetime than the smc module, so smc should not rely on devres to free its resources [1]. It is now the responsibility of the smc client to free smcd and smcd->conn for all dibs devices, ism devices as well as loopback. Call client->ops->del_dev() for all existing dibs devices in dibs_unregister_client(), so all device related structures can be freed in the client. When dibs_unregister_client() is called in the context of smc_exit() or smc_core_reboot_event(), these functions have already called smc_lgrs_shutdown() which calls smc_smcd_terminate_all(smcd) and sets going_away. This is done a second time in smcd_unregister_dev(). This is analogous to how smcr is handled in these functions, by calling first smc_lgrs_shutdown() and then smc_ib_unregister_client() > smc_ib_remove_dev(), so leave it that way. It may be worth investigating, whether smc_lgrs_shutdown() is still required or useful. Remove CONFIG_SMC_LO. CONFIG_DIBS_LO now controls whether a dibs loopback device exists or not. Link: https://www.kernel.org/doc/Documentation/driver-model/devres.txt [1] Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Link: https://patch.msgid.link/20250918110500.1731261-8-wintera@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-23 11:13:22 +02:00
Alexandra Winter	cb990a45d7	dibs: Define dibs loopback The first stage of loopback-ism was implemented as part of the SMC module [1]. Now that we have the dibs layer, provide access to a dibs_loopback device to all dibs clients. This is the first step of moving loopback-ism from net/smc/smc_loopback.* to drivers/dibs/dibs_loopback.*. One global structure lo_dev is allocated and added to the dibs devices. Follow-on patches will move functionality. Same as smc_loopback, dibs_loopback is provided by a config option. Note that there is no way to dynamically add or remove the loopback device. That could be a future improvement. When moving code to drivers/dibs, replace ism_ prefix with dibs_ prefix. As this is mostly a move of existing code, copyright and authors are unchanged. Link: https://lore.kernel.org/lkml/20240428060738.60843-1-guwen@linux.alibaba.com/ [1] Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Link: https://patch.msgid.link/20250918110500.1731261-7-wintera@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-23 11:13:21 +02:00
Alexandra Winter	d324a2ca3f	dibs: Register smc as dibs_client Formally register smc as dibs client. Functionality will be moved by follow-on patches from ism_client to dibs_client until eventually ism_client can be removed. As DIBS is only a shim layer without any dependencies, we can depend SMC on DIBS without adding indirect dependencies. A follow-on patch will remove dependency of SMC on ISM. Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Julian Ruess <julianr@linux.ibm.com> Link: https://patch.msgid.link/20250918110500.1731261-5-wintera@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-23 11:13:21 +02:00
Haren Myneni	ef104054a3	powerpc/pseries: Define __u{8,32} types in papr_hvpipe_hdr struct Fix the the following build errors with CONFIG_UAPI_HEADER_TEST: ./usr/include/asm/papr-hvpipe.h:16:9: error: unknown type name 'u8' 16 \| u8 version; ./usr/include/asm/papr-hvpipe.h:17:9: error: unknown type name 'u8' 17 \| u8 reserved[3]; ./usr/include/asm/papr-hvpipe.h:18:9: error: unknown type name 'u32' 18 \| u32 flags; ./usr/include/asm/papr-hvpipe.h:19:9: error: unknown type name 'u8' 19 \| u8 reserved2[40]; Fixes: `043439ad1a` ("powerpc/pseries: Define papr-hvpipe ioctl") Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Haren Myneni <haren@linux.ibm.com> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20250922091108.1483970-1-haren@linux.ibm.com	2025-09-23 14:30:48 +05:30
Thomas Gleixner	2066f00e5b	x86/topology: Implement topology_is_core_online() to address SMT regression Christian reported that commit `a430c11f40` ("intel_idle: Rescan "dead" SMT siblings during initialization") broke the use case in which both 'nosmt' and 'maxcpus' are on the kernel command line because it onlines primary threads, which were offline due to the maxcpus limit. The initially proposed fix to skip primary threads in the loop is inconsistent. While it prevents the primary thread to be onlined, it then onlines the corresponding hyperthread(s), which does not really make sense. The CPU iterator in cpuhp_smt_enable() contains a check which excludes all threads of a core, when the primary thread is offline. The default implementation is a NOOP and therefore not effective on x86. Implement topology_is_core_online() on x86 to address this issue. This makes the behaviour consistent between x86 and PowerPC. Fixes: `a430c11f40` ("intel_idle: Rescan "dead" SMT siblings during initialization") Fixes: `f694481b1d` ("ACPI: processor: Rescan "dead" SMT siblings during initialization") Closes: https://lore.kernel.org/linux-pm/724616a2-6374-4ba3-8ce3-ea9c45e2ae3b@arm.com/ Reported-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Rafael J. Wysocki (Intel) <rafael@kernel.org> Tested-by: Christian Loehle <christian.loehle@arm.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/12740505.O9o76ZdvQC@rafael.j.wysocki	2025-09-22 21:25:36 +02:00
Mario Limonciello (AMD)	5b87014e99	x86/acpi/cstate: Remove open coded check for cpu_feature_enabled() acpi_processor_power_init_bm_check() has an open coded implementation of cpu_feature_enabled() to detect X86_FEATURE_ZEN. Switch to using cpu_feature_enabled(). Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2025-09-22 19:16:51 +02:00
Dharma Balasubiramani	c656932c3e	ARM: dts: microchip: sam9x7: Add qspi controller Add support for QSPI controller. Signed-off-by: Dharma Balasubiramani <dharma.b@microchip.com> Link: https://lore.kernel.org/r/20250915-sam9x7-qspi-dtsi-v1-1-1cc9adba7573@microchip.com Signed-off-by: Nicolas Ferre <nicolas.ferre@microchip.com>	2025-09-22 18:20:34 +02:00
Christophe Leroy	e9713655b2	soc: fsl: qe: Drop legacy-of-mm-gpiochip.h header from GPIO driver Remove legacy-of-mm-gpiochip.h header file. The above mentioned file provides an OF API that's deprecated. There is no agnostic alternatives to it and we have to open code the logic which was hidden behind of_mm_gpiochip_add_data(). Note, most of the GPIO drivers are using their own labeling schemas and resource retrieval that only a few may gain of the code deduplication, so whenever alternative is appear we can move drivers again to use that one. As a side effect this change fixes a potential memory leak on an error path, if of_mm_gpiochip_add_data() fails. [Text copied from commit `34064c8267` ("powerpc/8xx: Drop legacy-of-mm-gpiochip.h header")] Suggested-by: Bartosz Golaszewski <brgl@bgdev.pl> Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Link: https://lore.kernel.org/r/e8a5d2c5b72233bd36da7fecc0a551ca54d39478.1758212309.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>	2025-09-22 16:37:24 +02:00
Sean Christopherson	e8f85d7884	KVM: x86: Don't treat ENTER and LEAVE as branches, because they aren't Remove the IsBranch flag from ENTER and LEAVE in KVM's emulator, as ENTER and LEAVE are stack operations, not branches. Add forced emulation of said instructions to the PMU counters test to prove that KVM diverges from hardware, and to guard against regressions. Opportunistically add a missing "1 MOV" to the selftest comment regarding the number of instructions per loop, which commit `7803339fa9` ("KVM: selftests: Use data load to trigger LLC references/misses in Intel PMU") forgot to add. Fixes: `018d70ffcf` ("KVM: x86: Update vPMCs when retiring branch instructions") Cc: Jim Mattson <jmattson@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20250919004639.1360453-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-09-22 07:14:05 -07:00
Hendrik Brueckner	e11727b2b0	s390/configs: Enable additional network features Enable AF_XDP, kTLS, and Mellanox subfunctions to accelerate network packet processing. Signed-off-by: Hendrik Brueckner <brueckner@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>	2025-09-22 15:25:07 +02:00
Alexander Popov	4f11559613	x86/Kconfig: Reenable PTDUMP on i386 The commit `f9aad62200` ("mm: rename GENERIC_PTDUMP and PTDUMP_CORE") has broken PTDUMP and the Kconfig options that use it on ARCH=i386, including CONFIG_DEBUG_WX. CONFIG_GENERIC_PTDUMP was renamed into CONFIG_ARCH_HAS_PTDUMP, but it was mistakenly moved from "config X86" to "config X86_64". That made PTDUMP unavailable for i386. Move CONFIG_ARCH_HAS_PTDUMP back to "config X86" to fix it. [ bp: Massage commit message. ] Fixes: `f9aad62200` ("mm: rename GENERIC_PTDUMP and PTDUMP_CORE") Signed-off-by: Alexander Popov <alex.popov@linux.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org	2025-09-22 14:40:17 +02:00
Can Peng	da9e5c04be	arm/syscalls: mark syscall invocation as likely in invoke_syscall The invoke_syscall() function is overwhelmingly called for valid system call entries. Annotate the main path with likely() to help the compiler generate better branch prediction hints, reducing CPU pipeline stalls due to mispredictions. This is a micro-optimization targeting syscall-heavy workloads [1]. Link: https://lore.kernel.org/r/20250922121730.986761-1-pengcan@kylinos.cn [1] Signed-off-by: Can Peng <pengcan@kylinos.cn> Signed-off-by: Will Deacon <will@kernel.org>	2025-09-22 13:26:16 +01:00
Omar Sandoval	5973a62efa	arm64: map [_text, _stext) virtual address range non-executable+read-only Since the referenced fixes commit, the kernel's .text section is only mapped starting from _stext; the region [_text, _stext) is omitted. As a result, other vmalloc/vmap allocations may use the virtual addresses nominally in the range [_text, _stext). This address reuse confuses multiple things: 1. crash_prepare_elf64_headers() sets up a segment in /proc/vmcore mapping the entire range [_text, _end) to [__pa_symbol(_text), __pa_symbol(_end)). Reading an address in [_text, _stext) from /proc/vmcore therefore gives the incorrect result. 2. Tools doing symbolization (either by reading /proc/kallsyms or based on the vmlinux ELF file) will incorrectly identify vmalloc/vmap allocations in [_text, _stext) as kernel symbols. In practice, both of these issues affect the drgn debugger. Specifically, there were cases where the vmap IRQ stacks for some CPUs were allocated in [_text, _stext). As a result, drgn could not get the stack trace for a crash in an IRQ handler because the core dump contained invalid data for the IRQ stack address. The stack addresses were also symbolized as being in the _text symbol. Fix this by bringing back the mapping of [_text, _stext), but now make it non-executable and read-only. This prevents other allocations from using it while still achieving the original goal of not mapping unpredictable data as executable. Other than the changed protection, this is effectively a revert of the fixes commit. Fixes: `e2a073dde9` ("arm64: omit [_text, _stext) from permanent kernel mapping") Cc: stable@vger.kernel.org Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Will Deacon <will@kernel.org>	2025-09-22 11:58:17 +01:00
Anshuman Khandual	14f158552e	arm64/sysreg: Update TCR_EL1 register Update TCR_EL1 register fields as per latest ARM ARM DDI 0487 L.B and while here drop an explicit sysreg definition SYS_TCR_EL1 from sysreg.h, which is now redundant. Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Mark Brown <broonie@kernel.org> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Will Deacon <will@kernel.org>	2025-09-22 11:57:15 +01:00
Dev Jain	fa93b45fd3	arm64: Enable vmalloc-huge with ptdump Our goal is to move towards enabling vmalloc-huge by default on arm64 so as to reduce TLB pressure. Therefore, we need a way to analyze the portion of block mappings in vmalloc space we can get on a production system; this can be done through ptdump, but currently we disable vmalloc-huge if CONFIG_PTDUMP_DEBUGFS is on. The reason is that lazy freeing of kernel pagetables via vmap_try_huge_pxd() may race with ptdump, so ptdump may dereference a bogus address. To solve this, we need to synchronize ptdump_walk() and ptdump_check_wx() with pud_free_pmd_page() and pmd_free_pte_page(). Since this race is very unlikely to happen in practice, we do not want to penalize the vmalloc pagetable tearing path by taking the init_mm mmap_lock. Therefore, we use static keys. ptdump_walk() and ptdump_check_wx() are the pagetable walkers; they will enable the static key - upon observing that, the vmalloc pagetable tearing path will get patched in with an mmap_read_lock/unlock sequence. A combination of the patched-in mmap_read_lock/unlock, the acquire semantics of static_branch_inc(), and the barriers in __flush_tlb_kernel_pgtable() ensures that ptdump will never get a hold on the address of a freed PMD or PTE table. We can verify the correctness of the algorithm via the following litmus test (thanks to James Houghton and Will Deacon): AArch64 ptdump Variant=Ifetch { uint64_t pud=0xa110c; uint64_t pmd; 0:X0=label:"P1:L0"; 0:X1=instr:"NOP"; 0:X2=lock; 0:X3=pud; 0:X4=pmd; 1:X1=0xdead; 1:X2=lock; 1:X3=pud; 1:X4=pmd; } P0 \| P1 ; (* static_key_enable ) \| ( pud_free_pmd_page ) ; STR W1, [X0] \| LDR X9, [X3] ; DC CVAU,X0 \| STR XZR, [X3] ; DSB ISH \| DSB ISH ; IC IVAU,X0 \| ISB ; DSB ISH \| ; ISB \| ( static key ) ; \| L0: ; ( mmap_lock ) \| B out1 ; Lwlock: \| ; MOV W7, #1 \| ( mmap_lock ) ; SWPA W7, W8, [X2] \| Lrlock: ; \| MOV W7, #1 ; \| SWPA W7, W8, [X2] ; ( walk pgtable ) \| ; LDR X9, [X3] \| ( mmap_unlock ) ; CBZ X9, out0 \| STLR WZR, [X2] ; EOR X10, X9, X9 \| ; LDR X11, [X4, X10] \| out1: ; \| EOR X10, X9, X9 ; out0: \| STR X1, [X4, X10] ; exists (0:X8=0 /\ 1:X8=0 /\ ( Lock acquisitions succeed ) 0:X9=0xa110c /\ ( P0 sees the valid PUD ...) 0:X11=0xdead) ( ... but the freed PMD *) For an approximate written proof of why this algorithm works, please read the code comment in [1], which is now removed for the sake of simplicity. mm-selftests pass. No issues were observed while parallelly running test_vmalloc.sh (which stresses the vmalloc subsystem), and cat /sys/kernel/debug/{kernel_page_tables, check_wx_pages} in a loop. Link: https://lore.kernel.org/all/20250723161827.15802-1-dev.jain@arm.com/ [1] Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Dev Jain <dev.jain@arm.com> Signed-off-by: Will Deacon <will@kernel.org>	2025-09-22 11:53:24 +01:00

... 8 9 10 11 12 ...

238443 Commits