Clean up SVM's enter/exit assembly code so that it can be compiled
without OBJECT_FILES_NON_STANDARD. The "standard" __svm_vcpu_run() can't
be made 100% bulletproof, as RBP isn't restored on #VMEXIT, but that's
also the case for __vmx_vcpu_run(), and getting "close enough" is better
than not even trying.
As for SEV-ES, after yet another refresher on swap types, I realized
KVM can simply let the hardware restore registers after #VMEXIT, all
that's missing is storing the current values to the host save area
(they are swap type B). This should provide 100% accuracy when using
stack frames for unwinding, and requires less assembly.
In between, build the SEV-ES code iff CONFIG_KVM_AMD_SEV=y, and yank out
"support" for 32-bit kernels in __svm_sev_es_vcpu_run, which was
unnecessarily polluting the code for a configuration that is disabled
at build time.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Fix a mostly benign bug in the gfn_to_pfn_cache infrastructure where KVM
would allow userspace to refresh the cache with a bogus GPA. The bug has
existed for quite some time, but was exposed by a new sanity check added in
6.9 (to ensure a cache is either GPA-based or HVA-based).
- Drop an unused param from gfn_to_pfn_cache_invalidate_start() that got left
behind during a 6.9 cleanup.
- Disable support for virtualizing adaptive PEBS, as KVM's implementation is
architecturally broken and can leak host LBRs to the guest.
- Fix a bug where KVM neglects to set the enable bits for general purpose
counters in PERF_GLOBAL_CTRL when initializing the virtual PMU. Both Intel
and AMD architectures require the bits to be set at RESET in order for v2
PMUs to be backwards compatible with software that was written for v1 PMUs,
i.e. for software that will never manually set the global enables.
- Disable LBR virtualization on CPUs that don't support LBR callstacks, as
KVM unconditionally uses PERF_SAMPLE_BRANCH_CALL_STACK when creating the
virtual LBR perf event, i.e. KVM will always fail to create LBR events on
such CPUs.
- Fix a math goof in x86's hugepage logic for KVM_SET_MEMORY_ATTRIBUTES that
results in an array overflow (detected by KASAN).
- Fix a flaw in the max_guest_memory selftest that results in it exhausting
the supply of ucall structures when run with more than 256 vCPUs.
- Mark KVM_MEM_READONLY as supported for RISC-V in set_memory_region_test.
- Fix a bug where KVM incorrectly thinks a TDP MMU root is an indirect shadow
root due KVM unnecessarily clobbering root_role.direct when userspace sets
guest CPUID.
- Fix a dirty logging bug in the where KVM fails to write-protect TDP MMU
SPTEs used for L2 if Page-Modification Logging is enabled for L1 and the L1
hypervisor is NOT using EPT (if nEPT is enabled, KVM doesn't use the TDP MMU
to run L2). For simplicity, KVM always disables PML when running L2, but
the TDP MMU wasn't accounting for root-specific conditions that force write-
protect based dirty logging.
Extend vmx_dirty_log_test to include accesses made by L2 when EPT is
disabled.
This commit adds explicit coverage of a bug caught by syzkaller, where
the TDP MMU would clear D-bits instead of write-protecting SPTEs being
used to map an L2, which only happens when L1 does not enable EPT,
causing writes made by L2 to not be reflected in the dirty log when PML
is enabled:
$ ./vmx_dirty_log_test
Nested EPT: disabled
==== Test Assertion Failure ====
x86_64/vmx_dirty_log_test.c:151: test_bit(0, bmap)
pid=72052 tid=72052 errno=4 - Interrupted system call
(stack trace empty)
Page 0 incorrectly reported clean
Opportunistically replace the volatile casts with {READ,WRITE}_ONCE().
Link: https://lore.kernel.org/kvm/000000000000c6526f06137f18cc@google.com/
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240315230541.1635322-5-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Drop the "If AD bits are enabled/disabled" verbiage from the comments
above kvm_tdp_mmu_clear_dirty_{slot,pt_masked}() since TDP MMU SPTEs may
need to be write-protected even when A/D bits are enabled. i.e. These
comments aren't technically correct.
No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240315230541.1635322-4-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Drop the comments above clear_dirty_gfn_range() and
clear_dirty_pt_masked(), since each is word-for-word identical to the
comment above their parent function.
Leave the comment on the parent functions since they are APIs called by
the KVM/x86 MMU.
No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240315230541.1635322-3-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Check kvm_mmu_page_ad_need_write_protect() when deciding whether to
write-protect or clear D-bits on TDP MMU SPTEs, so that the TDP MMU
accounts for any role-specific reasons for disabling D-bit dirty logging.
Specifically, TDP MMU SPTEs must be write-protected when the TDP MMU is
being used to run an L2 (i.e. L1 has disabled EPT) and PML is enabled.
KVM always disables PML when running L2, even when L1 and L2 GPAs are in
the some domain, so failing to write-protect TDP MMU SPTEs will cause
writes made by L2 to not be reflected in the dirty log.
Reported-by: syzbot+900d58a45dcaab9e4821@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=900d58a45dcaab9e4821
Fixes: 5982a53926 ("KVM: x86/mmu: Use kvm_ad_enabled() to determine if TDP MMU SPTEs need wrprot")
Cc: stable@vger.kernel.org
Cc: Vipin Sharma <vipinsh@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240315230541.1635322-2-dmatlack@google.com
[sean: massage shortlog and changelog, tweak ternary op formatting]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Disable LBR virtualization if the CPU doesn't support callstacks, which
were introduced in HSW (see commit e9d7f7cd97 ("perf/x86/intel: Add
basic Haswell LBR call stack support"), as KVM unconditionally configures
the perf LBR event with PERF_SAMPLE_BRANCH_CALL_STACK, i.e. LBR
virtualization always fails on pre-HSW CPUs.
Simply disable LBR support on such CPUs, as it has never worked, i.e.
there is no risk of breaking an existing setup, and figuring out a way
to performantly context switch LBRs on old CPUs is not worth the effort.
Fixes: be635e34c2 ("KVM: vmx/pmu: Expose LBR_FMT in the MSR_IA32_PERF_CAPABILITIES")
Cc: Mingwei Zhang <mizhang@google.com>
Cc: Jim Mattson <jmattson@google.com>
Tested-by: Mingwei Zhang <mizhang@google.com>
Link: https://lore.kernel.org/r/20240307011344.835640-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add a "has_callstack" field to the x86_pmu_lbr structure used to pass
information to KVM, and set it accordingly in x86_perf_get_lbr(). KVM
will use has_callstack to avoid trying to create perf LBR events with
PERF_SAMPLE_BRANCH_CALL_STACK on CPUs that don't support callstacks.
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Link: https://lore.kernel.org/r/20240307011344.835640-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Snapshot VMX's LBR capabilities once during module initialization instead
of calling into perf every time a vCPU reconfigures its vPMU. This will
allow massaging the LBR capabilities, e.g. if the CPU doesn't support
callstacks, without having to remember to update multiple locations.
Opportunistically tag vmx_get_perf_capabilities() with __init, as it's
only called from vmx_set_cpu_caps().
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Link: https://lore.kernel.org/r/20240307011344.835640-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
On AMD and Hygon platforms, the local APIC does not automatically set
the mask bit of the LVTPC register when handling a PMI and there is
no need to clear it in the kernel's PMI handler.
For guests, the mask bit is currently set by kvm_apic_local_deliver()
and unless it is cleared by the guest kernel's PMI handler, PMIs stop
arriving and break use-cases like sampling with perf record.
This does not affect non-PerfMonV2 guests because PMIs are handled in
the guest kernel by x86_pmu_handle_irq() which always clears the LVTPC
mask bit irrespective of the vendor.
Before:
$ perf record -e cycles:u true
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.001 MB perf.data (1 samples) ]
After:
$ perf record -e cycles:u true
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.002 MB perf.data (19 samples) ]
Fixes: a16eb25b09 ("KVM: x86: Mask LVTPC when handling a PMI")
Cc: stable@vger.kernel.org
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
[sean: use is_intel_compatible instead of !is_amd_or_hygon()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240405235603.1173076-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add kvm_vcpu_arch.is_amd_compatible to cache if a vCPU's vendor model is
compatible with AMD, i.e. if the vCPU vendor is AMD or Hygon, along with
helpers to check if a vCPU is compatible AMD vs. Intel. To handle Intel
vs. AMD behavior related to masking the LVTPC entry, KVM will need to
check for vendor compatibility on every PMI injection, i.e. querying for
AMD will soon be a moderately hot path.
Note! This subtly (or maybe not-so-subtly) makes "Intel compatible" KVM's
default behavior, both if userspace omits (or never sets) CPUID 0x0 and if
userspace sets a completely unknown vendor. One could argue that KVM
should treat such vCPUs as not being compatible with Intel *or* AMD, but
that would add useless complexity to KVM.
KVM needs to do *something* in the face of vendor specific behavior, and
so unless KVM conjured up a magic third option, choosing to treat unknown
vendors as neither Intel nor AMD means that checks on AMD compatibility
would yield Intel behavior, and checks for Intel compatibility would yield
AMD behavior. And that's far worse as it would effectively yield random
behavior depending on whether KVM checked for AMD vs. Intel vs. !AMD vs.
!Intel. And practically speaking, all x86 CPUs follow either Intel or AMD
architecture, i.e. "supporting" an unknown third architecture adds no
value.
Deliberately don't convert any of the existing guest_cpuid_is_intel()
checks, as the Intel side of things is messier due to some flows explicitly
checking for exactly vendor==Intel, versus some flows assuming anything
that isn't "AMD compatible" gets Intel behavior. The Intel code will be
cleaned up in the future.
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240405235603.1173076-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Stop compiling vmenter.S with OBJECT_FILES_NON_STANDARD to skip objtool's
stack validation now that __svm_vcpu_run() and __svm_sev_es_vcpu_run()
create stack frames (though the former's effectiveness is dubious).
Note, due to a quirk in how OBJECT_FILES_NON_STANDARD was handled by the
build system prior to commit bf48d9b756 ("kbuild: change tool coverage
variables to take the path relative to $(obj)"), vmx/vmenter.S got lumped
in with svm/vmenter.S. __vmx_vcpu_run() already plays nice with frame
pointers, i.e. it was collateral damage when commit 7f4b5cde24 ("kvm:
Disable objtool frame pointer checking for vmenter.S") added the
OBJECT_FILES_NON_STANDARD hack-a-fix.
Link: https://lore.kernel.org/all/20240217055504.2059803-1-masahiroy@kernel.org
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240223204233.3337324-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Now that KVM uses the host save area to context switch RBP, i.e.
preserves RBP for the entirety of __svm_sev_es_vcpu_run(), create a stack
frame using the standared FRAME_{BEGIN,END} macros.
Note, __svm_sev_es_vcpu_run() is subtly not a leaf function as it can call
into ibpb_feature() via UNTRAIN_RET_VM.
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240223204233.3337324-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Use the host save area to preserve volatile registers that are used in
__svm_sev_es_vcpu_run() to access function parameters after #VMEXIT.
Like saving/restoring non-volatile registers, there's no reason not to
take advantage of hardware restoring registers on #VMEXIT, as doing so
shaves a few instructions and the save area is going to be accessed no
matter what.
Converting all register save/restore code to use the host save area also
make it easier to follow the SEV-ES VMRUN flow in its entirety, as
opposed to having a mix of stack-based versus host save area save/restore.
Add a parameter to RESTORE_HOST_SPEC_CTRL_BODY so that the SEV-ES path
doesn't need to write @spec_ctrl_intercepted to memory just to play nice
with the common macro.
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240223204233.3337324-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Use the host save area to save/restore non-volatile (callee-saved)
registers in __svm_sev_es_vcpu_run() to take advantage of hardware loading
all registers from the save area on #VMEXIT. KVM still needs to save the
registers it wants restored, but the loads are handled automatically by
hardware.
Aside from less assembly code, letting hardware do the restoration means
stack frames are preserved for the entirety of __svm_sev_es_vcpu_run().
Opportunistically add a comment to call out why @svm needs to be saved
across VMRUN->#VMEXIT, as it's not easy to decipher that from the macro
hell.
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Alexey Kardashevskiy <aik@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240223204233.3337324-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
POP @spec_ctrl_intercepted into RAX instead of RBX when discarding it from
the stack so that __svm_sev_es_vcpu_run() doesn't modify any non-volatile
registers. __svm_sev_es_vcpu_run() doesn't return a value, and RAX is
already are clobbered multiple times in the #VMEXIT path.
This will allowing using the host save area to save/restore non-volatile
registers in __svm_sev_es_vcpu_run().
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240223204233.3337324-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Drop 32-bit "support" from __svm_sev_es_vcpu_run(), as SEV/SEV-ES firmly
64-bit only. The "support" was purely the result of bad copy+paste from
__svm_vcpu_run(), which in turn was slightly less bad copy+paste from
__vmx_vcpu_run().
Opportunistically convert to unadulterated register accesses so that it's
easier (but still not easy) to follow which registers hold what arguments,
and when.
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240223204233.3337324-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Compile (and link) __svm_sev_es_vcpu_run() if and only if SEV support is
actually enabled. This will allow dropping non-existent 32-bit "support"
from __svm_sev_es_vcpu_run() without causing undue confusion.
Intentionally don't provide a stub (but keep the declaration), as any sane
compiler, even with things like KASAN enabled, should eliminate the call
to __svm_sev_es_vcpu_run() since sev_es_guest() unconditionally returns
"false" if CONFIG_KVM_AMD_SEV=n.
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240223204233.3337324-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Unconditionally create a stack frame in __svm_vcpu_run() to play nice with
unwinding via frame pointers, at least until the point where RBP is loaded
with the guest's value. Don't bother conditioning the code on
CONFIG_FRAME_POINTER=y, as RBP needs to be saved and restored anyways (due
to it being clobbered with the guest's value); omitting the "MOV RSP, RBP"
is not worth the extra #ifdef.
Creating a stack frame will allow removing the OBJECT_FILES_NON_STANDARD
tag from vmenter.S once __svm_sev_es_vcpu_run() is fixed to not stomp all
over RBP for no reason.
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240223204233.3337324-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Use the raw/true host.MAXPHYADDR when deciding whether or not KVM must
intercept #PFs when allow_smaller_maxphyaddr is enabled, as any adjustments
the kernel makes to boot_cpu_data.x86_phys_bits to account for MKTME KeyID
bits do not apply to the guest physical address space. I.e. the KeyID are
off-limits for host physical addresses, but are not reserved for GPAs as
far as hardware is concerned.
Signed-off-by: Tao Su <tao1.su@linux.intel.com>
Link: https://lore.kernel.org/r/20240319031111.495006-1-tao1.su@linux.intel.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
max_guest_memory_test uses ucalls to sync with the host, but
it also resets the guest RIP back to its initial value in between
tests stages.
This makes the guest never reach the code which frees the ucall struct
and since a fixed pool of 512 ucall structs is used, the test starts
to fail when more that 256 vCPUs are used.
Fix that by replacing the manual register reset with a loop in
the guest code.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240315143507.102629-1-mlevitsk@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add a guest assert in the PMU counters test to verify that KVM stuffs
the vCPU's post-RESET value to globally enable all general purpose
counters. Per Intel's SDM,
IA32_PERF_GLOBAL_CTRL: Sets bits n-1:0 and clears the upper bits.
and
Where "n" is the number of general-purpose counters available in
the processor.
For the edge case where there are zero GP counters, follow the spirit
of the architecture, not the SDM's literal wording, which doesn't account
for this possibility and would require the CPU to set _all_ bits in
PERF_GLOBAL_CTRL.
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240309013641.1413400-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Set the enable bits for general purpose counters in IA32_PERF_GLOBAL_CTRL
when refreshing the PMU to emulate the MSR's architecturally defined
post-RESET behavior. Per Intel's SDM:
IA32_PERF_GLOBAL_CTRL: Sets bits n-1:0 and clears the upper bits.
and
Where "n" is the number of general-purpose counters available in the processor.
AMD also documents this behavior for PerfMonV2 CPUs in one of AMD's many
PPRs.
Do not set any PERF_GLOBAL_CTRL bits if there are no general purpose
counters, although a literal reading of the SDM would require the CPU to
set either bits 63:0 or 31:0. The intent of the behavior is to globally
enable all GP counters; honor the intent, if not the letter of the law.
Leaving PERF_GLOBAL_CTRL '0' effectively breaks PMU usage in guests that
haven't been updated to work with PMUs that support PERF_GLOBAL_CTRL.
This bug was recently exposed when KVM added supported for AMD's
PerfMonV2, i.e. when KVM started exposing a vPMU with PERF_GLOBAL_CTRL to
guest software that only knew how to program v1 PMUs (that don't support
PERF_GLOBAL_CTRL).
Failure to emulate the post-RESET behavior results in such guests
unknowingly leaving all general purpose counters globally disabled (the
entire reason the post-RESET value sets the GP counter enable bits is to
maintain backwards compatibility).
The bug has likely gone unnoticed because PERF_GLOBAL_CTRL has been
supported on Intel CPUs for as long as KVM has existed, i.e. hardly anyone
is running guest software that isn't aware of PERF_GLOBAL_CTRL on Intel
PMUs. And because up until v6.0, KVM _did_ emulate the behavior for Intel
CPUs, although the old behavior was likely dumb luck.
Because (a) that old code was also broken in its own way (the history of
this code is a comedy of errors), and (b) PERF_GLOBAL_CTRL was documented
as having a value of '0' post-RESET in all SDMs before March 2023.
Initial vPMU support in commit f5132b0138 ("KVM: Expose a version 2
architectural PMU to a guests") *almost* got it right (again likely by
dumb luck), but for some reason only set the bits if the guest PMU was
advertised as v1:
if (pmu->version == 1) {
pmu->global_ctrl = (1 << pmu->nr_arch_gp_counters) - 1;
return;
}
Commit f19a0c2c2e ("KVM: PMU emulation: GLOBAL_CTRL MSR should be
enabled on reset") then tried to remedy that goof, presumably because
guest PMUs were leaving PERF_GLOBAL_CTRL '0', i.e. weren't enabling
counters.
pmu->global_ctrl = ((1 << pmu->nr_arch_gp_counters) - 1) |
(((1ull << pmu->nr_arch_fixed_counters) - 1) << X86_PMC_IDX_FIXED);
pmu->global_ctrl_mask = ~pmu->global_ctrl;
That was KVM's behavior up until commit c49467a45f ("KVM: x86/pmu:
Don't overwrite the pmu->global_ctrl when refreshing") removed
*everything*. However, it did so based on the behavior defined by the
SDM , which at the time stated that "Global Perf Counter Controls" is
'0' at Power-Up and RESET.
But then the March 2023 SDM (325462-079US), stealthily changed its
"IA-32 and Intel 64 Processor States Following Power-up, Reset, or INIT"
table to say:
IA32_PERF_GLOBAL_CTRL: Sets bits n-1:0 and clears the upper bits.
Note, kvm_pmu_refresh() can be invoked multiple times, i.e. it's not a
"pure" RESET flow. But it can only be called prior to the first KVM_RUN,
i.e. the guest will only ever observe the final value.
Note #2, KVM has always cleared global_ctrl during refresh (see commit
f5132b0138 ("KVM: Expose a version 2 architectural PMU to a guests")),
i.e. there is no danger of breaking existing setups by clobbering a value
set by userspace.
Reported-by: Babu Moger <babu.moger@amd.com>
Cc: Sandipan Das <sandipan.das@amd.com>
Cc: Like Xu <like.xu.linux@gmail.com>
Cc: Mingwei Zhang <mizhang@google.com>
Cc: Dapeng Mi <dapeng1.mi@linux.intel.com>
Cc: stable@vger.kernel.org
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240309013641.1413400-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Fix KVM_SET_MEMORY_ATTRIBUTES to not overflow lpage_info array and trigger
KASAN splat, as seen in the private_mem_conversions_test selftest.
When memory attributes are set on a GFN range, that range will have
specific properties applied to the TDP. A huge page cannot be used when
the attributes are inconsistent, so they are disabled for those the
specific huge pages. For internal KVM reasons, huge pages are also not
allowed to span adjacent memslots regardless of whether the backing memory
could be mapped as huge.
What GFNs support which huge page sizes is tracked by an array of arrays
'lpage_info' on the memslot, of ‘kvm_lpage_info’ structs. Each index of
lpage_info contains a vmalloc allocated array of these for a specific
supported page size. The kvm_lpage_info denotes whether a specific huge
page (GFN and page size) on the memslot is supported. These arrays include
indices for unaligned head and tail huge pages.
Preventing huge pages from spanning adjacent memslot is covered by
incrementing the count in head and tail kvm_lpage_info when the memslot is
allocated, but disallowing huge pages for memory that has mixed attributes
has to be done in a more complicated way. During the
KVM_SET_MEMORY_ATTRIBUTES ioctl KVM updates lpage_info for each memslot in
the range that has mismatched attributes. KVM does this a memslot at a
time, and marks a special bit, KVM_LPAGE_MIXED_FLAG, in the kvm_lpage_info
for any huge page. This bit is essentially a permanently elevated count.
So huge pages will not be mapped for the GFN at that page size if the
count is elevated in either case: a huge head or tail page unaligned to
the memslot or if KVM_LPAGE_MIXED_FLAG is set because it has mixed
attributes.
To determine whether a huge page has consistent attributes, the
KVM_SET_MEMORY_ATTRIBUTES operation checks an xarray to make sure it
consistently has the incoming attribute. Since level - 1 huge pages are
aligned to level huge pages, it employs an optimization. As long as the
level - 1 huge pages are checked first, it can just check these and assume
that if each level - 1 huge page contained within the level sized huge
page is not mixed, then the level size huge page is not mixed. This
optimization happens in the helper hugepage_has_attrs().
Unfortunately, although the kvm_lpage_info array representing page size
'level' will contain an entry for an unaligned tail page of size level,
the array for level - 1 will not contain an entry for each GFN at page
size level. The level - 1 array will only contain an index for any
unaligned region covered by level - 1 huge page size, which can be a
smaller region. So this causes the optimization to overflow the level - 1
kvm_lpage_info and perform a vmalloc out of bounds read.
In some cases of head and tail pages where an overflow could happen,
callers skip the operation completely as KVM_LPAGE_MIXED_FLAG is not
required to prevent huge pages as discussed earlier. But for memslots that
are smaller than the 1GB page size, it does call hugepage_has_attrs(). In
this case the huge page is both the head and tail page. The issue can be
observed simply by compiling the kernel with CONFIG_KASAN_VMALLOC and
running the selftest “private_mem_conversions_test”, which produces the
output like the following:
BUG: KASAN: vmalloc-out-of-bounds in hugepage_has_attrs+0x7e/0x110
Read of size 4 at addr ffffc900000a3008 by task private_mem_con/169
Call Trace:
dump_stack_lvl
print_report
? __virt_addr_valid
? hugepage_has_attrs
? hugepage_has_attrs
kasan_report
? hugepage_has_attrs
hugepage_has_attrs
kvm_arch_post_set_memory_attributes
kvm_vm_ioctl
It is a little ambiguous whether the unaligned head page (in the bug case
also the tail page) should be expected to have KVM_LPAGE_MIXED_FLAG set.
It is not functionally required, as the unaligned head/tail pages will
already have their kvm_lpage_info count incremented. The comments imply
not setting it on unaligned head pages is intentional, so fix the callers
to skip trying to set KVM_LPAGE_MIXED_FLAG in this case, and in doing so
not call hugepage_has_attrs().
Cc: stable@vger.kernel.org
Fixes: 90b4fe1798 ("KVM: x86: Disallow hugepages when memory attributes are mixed")
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Chao Peng <chao.p.peng@linux.intel.com>
Link: https://lore.kernel.org/r/20240314212902.2762507-1-rick.p.edgecombe@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Drop support for virtualizing adaptive PEBS, as KVM's implementation is
architecturally broken without an obvious/easy path forward, and because
exposing adaptive PEBS can leak host LBRs to the guest, i.e. can leak
host kernel addresses to the guest.
Bug #1 is that KVM doesn't account for the upper 32 bits of
IA32_FIXED_CTR_CTRL when (re)programming fixed counters, e.g
fixed_ctrl_field() drops the upper bits, reprogram_fixed_counters()
stores local variables as u8s and truncates the upper bits too, etc.
Bug #2 is that, because KVM _always_ sets precise_ip to a non-zero value
for PEBS events, perf will _always_ generate an adaptive record, even if
the guest requested a basic record. Note, KVM will also enable adaptive
PEBS in individual *counter*, even if adaptive PEBS isn't exposed to the
guest, but this is benign as MSR_PEBS_DATA_CFG is guaranteed to be zero,
i.e. the guest will only ever see Basic records.
Bug #3 is in perf. intel_pmu_disable_fixed() doesn't clear the upper
bits either, i.e. leaves ICL_FIXED_0_ADAPTIVE set, and
intel_pmu_enable_fixed() effectively doesn't clear ICL_FIXED_0_ADAPTIVE
either. I.e. perf _always_ enables ADAPTIVE counters, regardless of what
KVM requests.
Bug #4 is that adaptive PEBS *might* effectively bypass event filters set
by the host, as "Updated Memory Access Info Group" records information
that might be disallowed by userspace via KVM_SET_PMU_EVENT_FILTER.
Bug #5 is that KVM doesn't ensure LBR MSRs hold guest values (or at least
zeros) when entering a vCPU with adaptive PEBS, which allows the guest
to read host LBRs, i.e. host RIPs/addresses, by enabling "LBR Entries"
records.
Disable adaptive PEBS support as an immediate fix due to the severity of
the LBR leak in particular, and because fixing all of the bugs will be
non-trivial, e.g. not suitable for backporting to stable kernels.
Note! This will break live migration, but trying to make KVM play nice
with live migration would be quite complicated, wouldn't be guaranteed to
work (i.e. KVM might still kill/confuse the guest), and it's not clear
that there are any publicly available VMMs that support adaptive PEBS,
let alone live migrate VMs that support adaptive PEBS, e.g. QEMU doesn't
support PEBS in any capacity.
Link: https://lore.kernel.org/all/20240306230153.786365-1-seanjc@google.com
Link: https://lore.kernel.org/all/ZeepGjHCeSfadANM@google.com
Fixes: c59a1f106f ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS")
Cc: stable@vger.kernel.org
Cc: Like Xu <like.xu.linux@gmail.com>
Cc: Mingwei Zhang <mizhang@google.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhang Xiong <xiong.y.zhang@intel.com>
Cc: Lv Zhiyuan <zhiyuan.lv@intel.com>
Cc: Dapeng Mi <dapeng1.mi@intel.com>
Cc: Jim Mattson <jmattson@google.com>
Acked-by: Like Xu <likexu@tencent.com>
Link: https://lore.kernel.org/r/20240307005833.827147-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Explicit disallow activating a gfn_to_pfn_cache with an error gpa, i.e.
INVALID_GPA, to ensure that KVM doesn't mistake a GPA-based cache for an
HVA-based cache (KVM uses INVALID_GPA as a magic value to differentiate
between GPA-based and HVA-based caches).
WARN if KVM attempts to activate a cache with INVALID_GPA, purely so that
new caches need to at least consider what to do with a "bad" GPA, as all
existing usage of kvm_gpc_activate() guarantees gpa != INVALID_GPA. I.e.
removing the WARN in the future is completely reasonable if doing so would
yield cleaner/better code overall.
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20240320001542.3203871-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add a helper to check that the incoming length for a gfn_to_pfn_cache is
valid with respect to the cache's GPA and/or HVA. To avoid activating a
cache with a bogus GPA, a future fix will fork the page split check in
the inner refresh path into activate() and the public rerfresh() APIs, at
which point KVM will check the length in three separate places.
Deliberately keep the "page offset" logic open coded, as the only other
path that consumes the offset, __kvm_gpc_refresh(), already needs to
differentiate between GPA-based and HVA-based caches, and it's not obvious
that using a helper is a net positive in overall code readability.
Note, for GPA-based caches, this has a subtle side effect of using the GPA
instead of the resolved HVA in the check() path, but that should be a nop
as the HVA offset is derived from the GPA, i.e. the two offsets are
identical, barring a KVM bug.
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20240320001542.3203871-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Pull x86 fixes from Ingo Molnar:
- Fix MCE timer reinit locking
- Fix/improve CoCo guest random entropy pool init
- Fix SEV-SNP late disable bugs
- Fix false positive objtool build warning
- Fix header dependency bug
- Fix resctrl CPU offlining bug
* tag 'x86-urgent-2024-04-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/retpoline: Add NOENDBR annotation to the SRSO dummy return thunk
x86/mce: Make sure to grab mce_sysfs_mutex in set_bank()
x86/CPU/AMD: Track SNP host status with cc_platform_*()
x86/cc: Add cc_platform_set/_clear() helpers
x86/kvm/Kconfig: Have KVM_AMD_SEV select ARCH_HAS_CC_PLATFORM
x86/coco: Require seeding RNG with RDRAND on CoCo systems
x86/numa/32: Include missing <asm/pgtable_areas.h>
x86/resctrl: Fix uninitialized memory read when last CPU of domain goes offline
Pull timer fixes from Ingo Molnar:
"Fix various timer bugs:
- Fix a timer migration bug that may result in missed events
- Fix timer migration group hierarchy event updates
- Fix a PowerPC64 build warning
- Fix a handful of DocBook annotation bugs"
* tag 'timers-urgent-2024-04-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers/migration: Return early on deactivation
timers/migration: Fix ignored event due to missing CPU update
vdso: Use CONFIG_PAGE_SHIFT in vdso/datapage.h
timers: Fix text inconsistencies and spelling
tick/sched: Fix struct tick_sched doc warnings
tick/sched: Fix various kernel-doc warnings
timers: Fix kernel-doc format and add Return values
time/timekeeping: Fix kernel-doc warnings and typos
time/timecounter: Fix inline documentation
Pull x86 perf fix from Ingo Molnar:
"Fix a combined PEBS events bug on x86 Intel CPUs"
* tag 'perf-urgent-2024-04-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/ds: Don't clear ->pebs_data_cfg for the last PEBS event
Pull nfsd fixes from Chuck Lever:
- Address a slow memory leak with RPC-over-TCP
- Prevent another NFS4ERR_DELAY loop during CREATE_SESSION
* tag 'nfsd-6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
nfsd: hold a lighter-weight client reference over CB_RECALL_ANY
SUNRPC: Fix a slow server-side memory leak with RPC-over-TCP
Pull xfs fix from Chandan Babu:
- Allow creating new links to special files which were not associated
with a project quota
* tag 'xfs-6.9-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: allow cross-linking special files without project quota
Pull smb client fixes from Steve French:
- fix to retry close to avoid potential handle leaks when server
returns EBUSY
- DFS fixes including a fix for potential use after free
- fscache fix
- minor strncpy cleanup
- reconnect race fix
- deal with various possible UAF race conditions tearing sessions down
* tag '6.9-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: fix potential UAF in cifs_signal_cifsd_for_reconnect()
smb: client: fix potential UAF in smb2_is_network_name_deleted()
smb: client: fix potential UAF in is_valid_oplock_break()
smb: client: fix potential UAF in smb2_is_valid_oplock_break()
smb: client: fix potential UAF in smb2_is_valid_lease_break()
smb: client: fix potential UAF in cifs_stats_proc_show()
smb: client: fix potential UAF in cifs_stats_proc_write()
smb: client: fix potential UAF in cifs_dump_full_key()
smb: client: fix potential UAF in cifs_debug_files_proc_show()
smb3: retrying on failed server close
smb: client: serialise cifs_construct_tcon() with cifs_mount_mutex
smb: client: handle DFS tcons in cifs_construct_tcon()
smb: client: refresh referral without acquiring refpath_lock
smb: client: guarantee refcounted children from parent session
cifs: Fix caching to try to do open O_WRONLY as rdwr on server
smb: client: fix UAF in smb2_reconnect_server()
smb: client: replace deprecated strncpy with strscpy
srso_alias_untrain_ret() is special code, even if it is a dummy
which is called in the !SRSO case, so annotate it like its real
counterpart, to address the following objtool splat:
vmlinux.o: warning: objtool: .export_symbol+0x2b290: data relocation to !ENDBR: srso_alias_untrain_ret+0x0
Fixes: 4535e1a417 ("x86/bugs: Fix the SRSO mitigation on Zen3/4")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20240405144637.17908-1-bp@kernel.org
We want to fix:
0e11073247 ("x86/retpoline: Do the necessary fixup to the Zen3/4 srso return thunk for !SRSO")
So merge in Linus's latest into x86/urgent to have it available.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull firewire fixes from Takashi Sakamoto:
"The firewire-ohci kernel module has a parameter for verbose kernel
logging. It is well-known that it logs the spurious IRQ for bus-reset
event due to the unmasked register for IRQ event. This update fixes
the issue"
* tag 'firewire-fixes-6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
firewire: ohci: mask bus reset interrupts between ISR and bottom half
In the FireWire OHCI interrupt handler, if a bus reset interrupt has
occurred, mask bus reset interrupts until bus_reset_work has serviced and
cleared the interrupt.
Normally, we always leave bus reset interrupts masked. We infer the bus
reset from the self-ID interrupt that happens shortly thereafter. A
scenario where we unmask bus reset interrupts was introduced in 2008 in
a007bb857e: If
OHCI_PARAM_DEBUG_BUSRESETS (8) is set in the debug parameter bitmask, we
will unmask bus reset interrupts so we can log them.
irq_handler logs the bus reset interrupt. However, we can't clear the bus
reset event flag in irq_handler, because we won't service the event until
later. irq_handler exits with the event flag still set. If the
corresponding interrupt is still unmasked, the first bus reset will
usually freeze the system due to irq_handler being called again each
time it exits. This freeze can be reproduced by loading firewire_ohci
with "modprobe firewire_ohci debug=-1" (to enable all debugging output).
Apparently there are also some cases where bus_reset_work will get called
soon enough to clear the event, and operation will continue normally.
This freeze was first reported a few months after a007bb85 was committed,
but until now it was never fixed. The debug level could safely be set
to -1 through sysfs after the module was loaded, but this would be
ineffectual in logging bus reset interrupts since they were only
unmasked during initialization.
irq_handler will now leave the event flag set but mask bus reset
interrupts, so irq_handler won't be called again and there will be no
freeze. If OHCI_PARAM_DEBUG_BUSRESETS is enabled, bus_reset_work will
unmask the interrupt after servicing the event, so future interrupts
will be caught as desired.
As a side effect to this change, OHCI_PARAM_DEBUG_BUSRESETS can now be
enabled through sysfs in addition to during initial module loading.
However, when enabled through sysfs, logging of bus reset interrupts will
be effective only starting with the second bus reset, after
bus_reset_work has executed.
Signed-off-by: Adam Goldman <adamg@pobox.com>
Signed-off-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>
Pull spi fixes from Mark Brown:
"A few small driver specific fixes, the most important being the
s3c64xx change which is likely to be hit during normal operation"
* tag 'spi-fix-v6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: mchp-pci1xxx: Fix a possible null pointer dereference in pci1xxx_spi_probe
spi: spi-fsl-lpspi: remove redundant spi_controller_put call
spi: s3c64xx: Use DMA mode from fifo size
Pull regulator fix from Mark Brown:
"One simple regualtor fix, fixing module autoloading on tps65132"
* tag 'regulator-fix-v6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: tps65132: Add of_match table
Pull regmap fixes from Mark Brown:
"Richard found a nasty corner case in the maple tree code which he
fixed, and also fixed a compiler warning which was showing up with the
toolchain he uses and helpfully identified a possible incorrect error
code which could have runtime impacts"
* tag 'regmap-fix-v6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
regmap: maple: Fix uninitialized symbol 'ret' warnings
regmap: maple: Fix cache corruption in regcache_maple_drop()
Pull io_uring fixes from Jens Axboe:
- Backport of some fixes that came up during development of the 6.10
io_uring patches. This includes some kbuf cleanups and reference
fixes.
- Disable multishot read if we don't have NOWAIT support on the target
- Fix for a dependency issue with workqueue flushing
* tag 'io_uring-6.9-20240405' of git://git.kernel.dk/linux:
io_uring/kbuf: hold io_buffer_list reference over mmap
io_uring/kbuf: protect io_buffer_list teardown with a reference
io_uring/kbuf: get rid of bl->is_ready
io_uring/kbuf: get rid of lower BGID lists
io_uring: use private workqueue for exit work
io_uring: disable io-wq execution of multishot NOWAIT requests
io_uring/rw: don't allow multishot reads without NOWAIT support