Commit Graph

238443 Commits

Author SHA1 Message Date
Oliver Upton
5aea409638 KVM: arm64: nv: Allow userspace to de-feature stage-2 TGRANs
KVM advertises the stage-2 TGRAN fields as writable to userspace but
prevents any modification for NV-enabled VMs. Update the special-cased
sanitization to permit de-featuring a particular TGRAN without allowing
the legacy value which refers to the stage-1 field for support.

Reported-by: Itaru Kitayama <itaru.kitayama@linux.dev>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-09-19 10:52:01 +01:00
Oliver Upton
ff37a41db8 KVM: arm64: nv: Treat AMO as 1 when at EL2 and {E2H,TGE} = {1, 0}
SErrors are not deliverable at EL2 when the effective value of
HCR_EL2.{TGE,AMO} = {0, 0}. This is bothersome to deal with in nested
as we need to use auxiliary pending state to track the pending vSError
since HCR_EL2.VSE has no mechanism for honoring the guest HCR. On top of
that, we have no way of making that auxiliary pending state visible in
ISR_EL1.

A defect against the architecture now allows an implementation to treat
HCR_EL2.AMO as 1 when HCR_EL2.{E2H,TGE} = {1, 0}. Let's do exactly that,
meaning SErrors are always deliverable at EL2 for the typical E2H=RES1
VM.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-09-19 10:41:56 +01:00
Suzuki K Poulose
d02c2e45b1 arm64: acpi: Enable ACPI CCEL support
Add support for ACPI CCEL by handling the EfiACPIMemoryNVS type memory.
As per UEFI specifications NVS memory is reserved for Firmware use even
after exiting boot services. Thus map the region as read-only.

Cc: Sami Mujawar <sami.mujawar@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Gavin Shan <gshan@redhat.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Tested-by: Sami Mujawar <sami.mujawar@arm.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
2025-09-19 10:12:02 +01:00
Suzuki K Poulose
9e8a3df3e7 arm64: Enable EFI secret area Securityfs support
Enable EFI COCO secrets support. Provide the ioremap_encrypted() support required
by the driver.

Cc: Sami Mujawar <sami.mujawar@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Tested-by: Sami Mujawar <sami.mujawar@arm.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
2025-09-19 10:12:01 +01:00
Suzuki K Poulose
fa84e534c3 arm64: realm: ioremap: Allow mapping memory as encrypted
For ioremap(), so far we only checked if it was a device (RIPAS_DEV) to choose
an encrypted vs decrypted mapping. However, we may have firmware reserved memory
regions exposed to the OS (e.g., EFI Coco Secret Securityfs, ACPI CCEL).
We need to make sure that anything that is RIPAS_RAM (i.e., Guest
protected memory with RMM guarantees) are also mapped as encrypted.

Rephrasing the above, anything that is not RIPAS_EMPTY is guaranteed to be
protected by the RMM. Thus we choose encrypted mapping for anything that is not
RIPAS_EMPTY. While at it, rename the helper function

  __arm64_is_protected_mmio => arm64_rsi_is_protected

to clearly indicate that this not an arm64 generic helper, but something to do
with Realms.

Cc: Sami Mujawar <sami.mujawar@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Tested-by: Sami Mujawar <sami.mujawar@arm.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
2025-09-19 10:12:01 +01:00
J. Neuschäfer
07c7f4f4e9 arm64: dts: allwinner: h313: Add Amediatech X96Q
The X96Q is a set-top box with an H313 SoC, AXP305 PMIC, 1 or 2 GiB RAM,
8 or 16 GiB eMMC flash, 2x USB A, Micro-SD, HDMI, Ethernet, audio/video
output, and infrared input.

  https://x96mini.com/products/x96q-tv-box-android-10-set-top-box

Tested, works:
- debug UART
- status LED
- USB ports in host mode
- MicroSD
- eMMC
- recovery button hidden behind audio/video port
- analog audio (line out)

Does not work:
- Ethernet (requires AC200 MFD/EPHY driver)
- WLAN (requires out-of-tree XRadio driver)
- analog video output (requires AC200 driver)
- HDMI audio/video output

Untested:
- "OTG" USB port in device mode
- built-in IR receiver
- external IR receiver

Table of regulators on the downstream kernel, for reference:

 vcc-5v      1   15      0 unknown  5000mV     0mA  5000mV  5000mV
    dcdca    0    0      0 unknown   900mV     0mA     0mV     0mV
    dcdcb    0    0      0 unknown  1350mV     0mA     0mV     0mV
    dcdcc    0    0      0 unknown   900mV     0mA     0mV     0mV
    dcdcd    0    0      0 unknown  1500mV     0mA     0mV     0mV
    dcdce    0    0      0 unknown  3300mV     0mA     0mV     0mV
    aldo1    0    0      0 unknown  3300mV     0mA     0mV     0mV
    aldo2    0    0      0 unknown   700mV     0mA     0mV     0mV
    aldo3    0    0      0 unknown   700mV     0mA     0mV     0mV
    bldo1    0    0      0 unknown  1800mV     0mA     0mV     0mV
    bldo2    0    0      0 unknown  1800mV     0mA     0mV     0mV
    bldo3    0    0      0 unknown   700mV     0mA     0mV     0mV
    bldo4    0    0      0 unknown   700mV     0mA     0mV     0mV
    cldo1    0    0      0 unknown  2500mV     0mA     0mV     0mV
    cldo2    0    0      0 unknown   700mV     0mA     0mV     0mV
    cldo3    0    0      0 unknown   700mV     0mA     0mV     0mV

Signed-off-by: J. Neuschäfer <j.ne@posteo.net>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Link: https://patch.msgid.link/20250918-x96q-v2-2-51bd39928806@posteo.net
Signed-off-by: Chen-Yu Tsai <wens@csie.org>
2025-09-19 12:37:16 +08:00
Aleksa Paunovic
a8fed1bc03 riscv: Add xmipsexectl as a vendor extension
Add support for MIPS vendor extensions. Add support for the xmipsexectl
vendor extension.

Signed-off-by: Aleksa Paunovic <aleksa.paunovic@htecgroup.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20250724-p8700-pause-v5-2-a6cbbe1c3412@htecgroup.com
[pjw@kernel.org: added the MIPS vendor ID from another patch to fix the build]
Signed-off-by: Paul Walmsley <pjw@kernel.org>
2025-09-18 20:36:00 -06:00
Clément Léger
2e2cf5581f riscv: cpufeature: add validation for zfa, zfh and zfhmin
These extensions depends on the F one. Add a validation callback
checking for the F extension to be present. Now that extensions are
correctly reported using the F/D presence, we can remove the
has_fpu() check in hwprobe_isa_ext0().

Signed-off-by: Clément Léger <cleger@rivosinc.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20250527100001.33284-1-cleger@rivosinc.com
Signed-off-by: Paul Walmsley <pjw@kernel.org>
2025-09-18 19:51:09 -06:00
Andrew Davis
70ddf86d76 riscv: sbi: Switch to new sys-off handler API
Kernel now supports chained power-off handlers. Use
register_platform_power_off() that registers a platform level power-off
handler. Legacy pm_power_off() will be removed once all drivers and archs
are converted to the new sys-off API.

Signed-off-by: Andrew Davis <afd@ti.com>
Tested-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20250813151855.105237-1-afd@ti.com
Signed-off-by: Paul Walmsley <pjw@kernel.org>
2025-09-18 17:48:04 -06:00
Yang Shi
a166563e7e arm64: mm: support large block mapping when rodata=full
When rodata=full is specified, kernel linear mapping has to be mapped at
PTE level since large page table can't be split due to break-before-make
rule on ARM64.

This resulted in a couple of problems:
  - performance degradation
  - more TLB pressure
  - memory waste for kernel page table

With FEAT_BBM level 2 support, splitting large block page table to
smaller ones doesn't need to make the page table entry invalid anymore.
This allows kernel split large block mapping on the fly.

Add kernel page table split support and use large block mapping by
default when FEAT_BBM level 2 is supported for rodata=full.  When
changing permissions for kernel linear mapping, the page table will be
split to smaller size.

The machine without FEAT_BBM level 2 will fallback to have kernel linear
mapping PTE-mapped when rodata=full.

With this we saw significant performance boost with some benchmarks and
much less memory consumption on my AmpereOne machine (192 cores, 1P)
with 256GB memory.

* Memory use after boot
Before:
MemTotal:       258988984 kB
MemFree:        254821700 kB

After:
MemTotal:       259505132 kB
MemFree:        255410264 kB

Around 500MB more memory are free to use.  The larger the machine, the
more memory saved.

* Memcached
We saw performance degradation when running Memcached benchmark with
rodata=full vs rodata=on.  Our profiling pointed to kernel TLB pressure.
With this patchset we saw ops/sec is increased by around 3.5%, P99
latency is reduced by around 9.6%.
The gain mainly came from reduced kernel TLB misses.  The kernel TLB
MPKI is reduced by 28.5%.

The benchmark data is now on par with rodata=on too.

* Disk encryption (dm-crypt) benchmark
Ran fio benchmark with the below command on a 128G ramdisk (ext4) with
disk encryption (by dm-crypt).
fio --directory=/data --random_generator=lfsr --norandommap            \
    --randrepeat 1 --status-interval=999 --rw=write --bs=4k --loops=1  \
    --ioengine=sync --iodepth=1 --numjobs=1 --fsync_on_close=1         \
    --group_reporting --thread --name=iops-test-job --eta-newline=1    \
    --size 100G

The IOPS is increased by 90% - 150% (the variance is high, but the worst
number of good case is around 90% more than the best number of bad
case). The bandwidth is increased and the avg clat is reduced
proportionally.

* Sequential file read
Read 100G file sequentially on XFS (xfs_io read with page cache
populated). The bandwidth is increased by 150%.

Co-developed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
Signed-off-by: Will Deacon <will@kernel.org>
2025-09-18 21:36:37 +01:00
Dev Jain
a660194dd1 arm64: Enable permission change on arm64 kernel block mappings
This patch paves the path to enable huge mappings in vmalloc space and
linear map space by default on arm64. For this we must ensure that we
can handle any permission games on the kernel (init_mm) pagetable.
Previously, __change_memory_common() used apply_to_page_range() which
does not support changing permissions for block mappings. We move away
from this by using the pagewalk API, similar to what riscv does right
now. It is the responsibility of the caller to ensure that the range
over which permissions are being changed falls on leaf mapping
boundaries. For systems with BBML2, this will be handled in future
patches by dyanmically splitting the mappings when required.

Unlike apply_to_page_range(), the pagewalk API currently enforces the
init_mm.mmap_lock to be held. To avoid the unnecessary bottleneck of the
mmap_lock for our usecase, this patch extends this generic API to be
used locklessly, so as to retain the existing behaviour for changing
permissions. Apart from this reason, it is noted at [1] that KFENCE can
manipulate kernel pgtable entries during softirqs. It does this by
calling set_memory_valid() -> __change_memory_common(). This being a
non-sleepable context, we cannot take the init_mm mmap lock.

Add comments to highlight the conditions under which we can use the
lockless variant - no underlying VMA, and the user having exclusive
control over the range, thus guaranteeing no concurrent access.

We require that the start and end of a given range do not partially
overlap block mappings, or cont mappings. Return -EINVAL in case a
partial block mapping is detected in any of the PGD/P4D/PUD/PMD levels;
add a corresponding comment in update_range_prot() to warn that
eliminating such a condition is the responsibility of the caller.

Note that, the pte level callback may change permissions for a whole
contpte block, and that will be done one pte at a time, as opposed to an
atomic operation for the block mappings. This is fine as any access will
decode either the old or the new permission until the TLBI.

apply_to_page_range() currently performs all pte level callbacks while
in lazy mmu mode. Since arm64 can optimize performance by batching
barriers when modifying kernel pgtables in lazy mmu mode, we would like
to continue to benefit from this optimisation. Unfortunately
walk_kernel_page_table_range() does not use lazy mmu mode. However,
since the pagewalk framework is not allocating any memory, we can safely
bracket the whole operation inside lazy mmu mode ourselves. Therefore,
wrap the call to walk_kernel_page_table_range() with the lazy MMU
helpers.

Link: https://lore.kernel.org/linux-arm-kernel/89d0ad18-4772-4d8f-ae8a-7c48d26a927e@arm.com/ [1]
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Yang Shi <yshi@os.amperecomputing.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
2025-09-18 21:36:37 +01:00
Yang Shi
13efe932d2 arm64: cpufeature: add AmpereOne to BBML2 allow list
AmpereOne supports BBML2 without conflict abort, add to the allow list.

Reviewed-by: Christoph Lameter (Ampere) <cl@gentwo.org>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
Signed-off-by: Will Deacon <will@kernel.org>
2025-09-18 21:26:50 +01:00
Jeremy Linton
ea87c5536a arm64: probes: Fix incorrect bl/blr address and register usage
The pt_regs registers are 64-bit on arm64, and should be u64 when
manipulated. Correct this so that we aren't truncating the address
during br/blr sequences.

Fixes: efb07ac534 ("arm64: probes: Add GCS support to bl/blr/ret")
Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
2025-09-18 21:06:59 +01:00
Sean Christopherson
c49aa98376 KVM: x86/pmu: Restrict GLOBAL_{CTRL,STATUS}, fixed PMCs, and PEBS to PMU v2+
Restrict support for GLOBAL_CTRL, GLOBAL_STATUS, fixed PMCs, and PEBS to
v2 or later vPMUs.  The SDM explicitly states that GLOBAL_{CTRL,STATUS} and
fixed counters were introduced with PMU v2, and PEBS has hard dependencies
on fixed counters and the bitmap MSR layouts established by PMU v2.

Reported-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-32-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-18 12:58:15 -07:00
Sean Christopherson
9bae7a0863 KVM: x86/pmu: Move initialization of valid PMCs bitmask to common x86
Move all initialization of all_valid_pmc_idx to common code, as the logic
is 100% common to Intel and AMD, and KVM heavily relies on Intel and AMD
having the same semantics.  E.g. the fact that AMD doesn't support fixed
counters doesn't allow KVM to use all_valid_pmc_idx[63:32] for other
purposes.

Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-31-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-18 12:58:13 -07:00
Dapeng Mi
30c0267f15 KVM: x86/pmu: Use BIT_ULL() instead of open coded equivalents
Replace a variety of "1ull << N" and "(u64)1 << N" snippets with BIT_ULL()
in the PMU code.

No functional change intended.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
[sean: split to separate patch, write changelog]
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-30-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-18 12:57:20 -07:00
Dapeng Mi
2bff2edf69 KVM: VMX: Add helpers to toggle/change a bit in VMCS execution controls
Expand the VMCS controls builder macros to generate helpers to change a
bit to the desired value, and use the new helpers when toggling APICv
related controls.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
[sean: rewrite changelog]
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-27-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-18 12:57:19 -07:00
Sean Christopherson
5a1a726e68 KVM: x86: Use KVM_REQ_RECALC_INTERCEPTS to react to CPUID updates
Defer recalculating MSR and instruction intercepts after a CPUID update
via RECALC_INTERCEPTS to converge on RECALC_INTERCEPTS as the "official"
mechanism for triggering recalcs.  As a bonus, because KVM does a "recalc"
during vCPU creation, and every functional VMM sets CPUID at least once,
for all intents and purposes this saves at least one recalc.

Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-26-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-18 12:57:19 -07:00
Sean Christopherson
6057497336 KVM: x86: Rework KVM_REQ_MSR_FILTER_CHANGED into a generic RECALC_INTERCEPTS
Rework the MSR_FILTER_CHANGED request into a more generic RECALC_INTERCEPTS
request, and expand the responsibilities of vendor code to recalculate all
intercepts that vary based on userspace input, e.g. instruction intercepts
that are tied to guest CPUID.

Providing a generic recalc request will allow the upcoming mediated PMU
support to trigger a recalc when PMU features, e.g. PERF_CAPABILITIES, are
set by userspace, without having to make multiple calls to/from PMU code.
As a bonus, using a request will effectively coalesce recalcs, e.g. will
reduce the number of recalcs for normal usage from 3+ to 1 (vCPU create,
set CPUID, set PERF_CAPABILITIES (Intel only), set filter).

The downside is that MSR filter changes that are done in isolation will do
a small amount of unnecessary work, but that's already a relatively slow
path, and the cost of recalculating instruction intercepts is negligible.

Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-25-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-18 12:57:18 -07:00
Dapeng Mi
cdfed9370b KVM: x86/pmu: Move PMU_CAP_{FW_WRITES,LBR_FMT} into msr-index.h header
Move PMU_CAP_{FW_WRITES,LBR_FMT} into msr-index.h and rename them with
PERF_CAP prefix to keep consistent with other perf capabilities macros.

No functional change intended.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-24-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-18 12:57:16 -07:00
Dapeng Mi
1e24bece26 KVM: x86: Rename vmx_vmentry/vmexit_ctrl() helpers
Rename the two helpers vmx_vmentry/vmexit_ctrl() to
vmx_get_initial_vmentry/vmexit_ctrl() to represent their real meaning.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-23-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-18 12:56:45 -07:00
Sean Christopherson
51f34b1e65 KVM: x86/pmu: Snapshot host (i.e. perf's) reported PMU capabilities
Take a snapshot of the unadulterated PMU capabilities provided by perf so
that KVM can compare guest vPMU capabilities against hardware capabilities
when determining whether or not to intercept PMU MSRs (and RDPMC).

Reviewed-by: Sandipan Das <sandipan.das@amd.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-18 12:56:44 -07:00
Sean Christopherson
e3d1f2826d KVM: SVM: Check pmu->version, not enable_pmu, when getting PMC MSRs
Gate access to PMC MSRs based on pmu->version, not on kvm->arch.enable_pmu,
to more accurately reflect KVM's behavior.  This is a glorified nop, as
pmu->version and pmu->nr_arch_gp_counters can only be non-zero if
amd_pmu_refresh() is reached, kvm_pmu_refresh() invokes amd_pmu_refresh()
if and only if kvm->arch.enable_pmu is true, and amd_pmu_refresh() forces
pmu->version to be 1 or 2.

I.e. the following holds true:

  !pmu->nr_arch_gp_counters || kvm->arch.enable_pmu == (pmu->version > 0)

and so the only way for amd_pmu_get_pmc() to return a non-NULL value is if
both kvm->arch.enable_pmu and pmu->version evaluate to true.

No real functional change intended.

Reviewed-by: Sandipan Das <sandipan.das@amd.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-18 12:56:28 -07:00
Sean Christopherson
4687a2c4e6 KVM: VMX: Setup canonical VMCS config prior to kvm_x86_vendor_init()
Setup the golden VMCS config during vmx_init(), before the call to
kvm_x86_vendor_init(), instead of waiting until the callback to do
hardware setup.  setup_vmcs_config() only touches VMX state, i.e. doesn't
poke anything in kvm.ko, and has no runtime dependencies beyond
hv_init_evmcs().

Setting the VMCS config early on will allow referencing VMCS and VMX
capabilities at any point during setup, e.g. to check for PERF_GLOBAL_CTRL
save/load support during mediated PMU initialization.

Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-18 12:56:28 -07:00
Shanker Donthineni
cc80537caa arm64: cpufeature: Add Olympus MIDR to BBML2 allow list
The NVIDIA Olympus core supports BBML2 without conflict abort. Add
its MIDR to the allow list to enable FEAT_BBM.

Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Signed-off-by: Will Deacon <will@kernel.org>
2025-09-18 20:21:37 +01:00
Shanker Donthineni
e185c8a0d8 arm64: cputype: Add NVIDIA Olympus definitions
Add cpu part and model macro definitions for NVIDIA Olympus core.

Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Signed-off-by: Will Deacon <will@kernel.org>
2025-09-18 20:21:36 +01:00
Nick Chan
70fa521f4d arm64: dts: apple: t8015: Add SPMI node
Add SPMI node for Apple A11 SoC.

Signed-off-by: Nick Chan <towinchenmi@gmail.com>
Signed-off-by: Sven Peter <sven@kernel.org>
2025-09-18 21:13:45 +02:00
Nick Chan
8f6e6934e3 arm64: dts: apple: t8012: Add SPMI node
Add SPMI node for Apple T2 SoC.

Signed-off-by: Nick Chan <towinchenmi@gmail.com>
Signed-off-by: Sven Peter <sven@kernel.org>
2025-09-18 21:13:45 +02:00
Hector Martin
637f7d2c73 arm64: dts: apple: Add J180d (Mac Pro, M2 Ultra, 2023) device tree
The M2 Ultra in the Mac Pro differs from the M2 Ultra Mac Studio in its
PCIe setup. It uses all available 16 PCIe Gen4 on the first die and 8
PCIe Gen4 lanes on the second die to connect to a 100 lane Microchip
Switchtec PCIe switch. All internal PCIe devices and the PCIe slots are
connected to the PCIe switch.
Each die implements a PCIe controller with a single 16 or 8 lane port.
The PCIe controller is mostly compatible with existing implementation
in pcie-apple.c.
The resources for other 8 lanes on the second die are used to connect
the NVMe flash with the controller in the SoC.
This initial device tree does not include PCIe support.

Signed-off-by: Hector Martin <marcan@marcan.st>
Reviewed-by: Neal Gompa <neal@gompa.dev>
Co-developed-by: Janne Grunau <j@jannau.net>
Signed-off-by: Janne Grunau <j@jannau.net>
Reviewed-by: Sven Peter <sven@kernel.org>
Signed-off-by: Sven Peter <sven@kernel.org>
2025-09-18 19:06:13 +00:00
Mikko Rapeli
a1b20e0622 ARM: rockchip: remove REGULATOR conditional to PM
PM is explicitly enabled in lines just below so
REGULATOR can be too.

Suggested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Mikko Rapeli <mikko.rapeli@linaro.org>
Link: https://lore.kernel.org/r/20250915083317.2885761-5-mikko.rapeli@linaro.org
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
2025-09-18 21:05:39 +02:00
Kaison Deng
93781211e9 arm64: dts: rockchip: Add devicetree for the ROC-RK3588-RT
Link: https://en.t-firefly.com/product/industry/rocrk3588rt

The Firefly ROC-RK3588-RT is RK3588 based SBC featuring:
- TF card slot
- SATA 2242 socket
- 1x USB 3.0 Port, 1x USB 2.0 Port, 1x Typec Port
- 1x HDMI 2.1 out, 1x HDMI 2.0 out
- 2x Gigabit Ethernet, 1x 2.5G Ethernet
- M.2 E-KEY for Extended WiFI and Bluetoolh
- ES8388 on-board sound codec - jack in/out
- RTC
- LED: WORK, DIY

Signed-off-by: Kaison Deng <dkx@t-chip.com.cn>
Reviewed-by: Andrew Lunn <andrew@lunn.ch> #gmac0, gmac1, mdio0, mdio1 nodes
Link: https://lore.kernel.org/r/349c4226824efa52ceb14e3d8518c8bb5c7465fc.1757902513.git.dkx@t-chip.com.cn
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
2025-09-18 20:59:59 +02:00
Jakub Kicinski
f2cdc4c22b Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.17-rc7).

No conflicts.

Adjacent changes:

drivers/net/ethernet/mellanox/mlx5/core/en/fs.h
  9536fbe10c ("net/mlx5e: Add PSP steering in local NIC RX")
  7601a0a462 ("net/mlx5e: Add a miss level for ipsec crypto offload")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-18 11:26:06 -07:00
Chukun Pan
cf311ff5e7 arm64: dts: rockchip: update pinctrl names for Radxa E52C
Updated the pinctrl names of the user key and power LED according
to the schematic. Also updated the nodenames of other pinctrls.

Signed-off-by: Chukun Pan <amadeus@jmu.edu.cn>
Link: https://lore.kernel.org/r/20250901100027.164594-4-amadeus@jmu.edu.cn
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
2025-09-18 19:32:24 +02:00
Chukun Pan
3edb9c95ff arm64: dts: rockchip: remove vcc_3v3_pmu regulator for Radxa E52C
According to Radxa E52C Schematic V1.2 [1] page 5, vcc_3v3_pmu
is directly connected to vcc_3v3_s3 via a 0 ohm resistor.
The vcc_3v3_pmu is not a new regulator, so remove it.

[1] https://dl.radxa.com/e/e52c/hw/radxa_e52c_v1.2_schematic.pdf

Signed-off-by: Chukun Pan <amadeus@jmu.edu.cn>
Link: https://lore.kernel.org/r/20250901100027.164594-3-amadeus@jmu.edu.cn
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
2025-09-18 19:32:24 +02:00
Mark Rutland
52b49bd6de arm64: cputype: Remove duplicate Cortex-X1C definitions
We currently have duplicate definitions for ARM_CPU_PART_CORTEX_X1C and
MIDR_CORTEX_X1C as a result of commits:

  58d245e03c ("arm64: cputype: Add Cortex-X1C definitions")
  efe676a1a7 ("arm64: proton-pack: Add new CPUs 'k' values for branch mitigation")

Due to inconsistent sorting when adding entries, there was no textual
conflict between the two patches.

Delete the duplicate definitions added by the latter commit.

The definitions in general are largely (but not entirely) in order of
the MIDR_EL1.PartNum value rather than by CPU name, and the remaining
Cortex-X1C definitions appear later in the list.

For now I haven't sorted the remaining MIDR definitions to minimize
churn. I intend to perform some larger cleanup of these in the near
future which should supersede that anyhow.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
2025-09-18 17:51:50 +01:00
Linus Torvalds
86cc796e5e Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
 "These are mostly Oliver's Arm changes: lock ordering fixes for the
  vGIC, and reverts for a buggy attempt to avoid RCU stalls on large
  VMs.

  Arm:

   - Invalidate nested MMUs upon freeing the PGD to avoid WARNs when
     visiting from an MMU notifier

   - Fixes to the TLB match process and TLB invalidation range for
     managing the VCNR pseudo-TLB

   - Prevent SPE from erroneously profiling guests due to UNKNOWN reset
     values in PMSCR_EL1

   - Fix save/restore of host MDCR_EL2 to account for eagerly
     programming at vcpu_load() on VHE systems

   - Correct lock ordering when dealing with VGIC LPIs, avoiding
     scenarios where an xarray's spinlock was nested with a *raw*
     spinlock

   - Permit stage-2 read permission aborts which are possible in the
     case of NV depending on the guest hypervisor's stage-2 translation

   - Call raw_spin_unlock() instead of the internal spinlock API

   - Fix parameter ordering when assigning VBAR_EL1

   - Reverted a couple of fixes for RCU stalls when destroying a stage-2
     page table.

     There appears to be some nasty refcounting / UAF issues lurking in
     those patches and the band-aid we tried to apply didn't hold.

  s390:

   - mm fixes, including userfaultfd bug fix

  x86:

   - Sync the vTPR from the local APIC to the VMCB even when AVIC is
     active.

     This fixes a bug where host updates to the vTPR, e.g. via
     KVM_SET_LAPIC or emulation of a guest access, are lost and result
     in interrupt delivery issues in the guest"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR even if AVIC is active
  Revert "KVM: arm64: Split kvm_pgtable_stage2_destroy()"
  Revert "KVM: arm64: Reschedule as needed when destroying the stage-2 page-tables"
  KVM: arm64: vgic: fix incorrect spinlock API usage
  KVM: arm64: Remove stage 2 read fault check
  KVM: arm64: Fix parameter ordering for VBAR_EL1 assignment
  KVM: arm64: nv: Fix incorrect VNCR invalidation range calculation
  KVM: arm64: vgic-v3: Indicate vgic_put_irq() may take LPI xarray lock
  KVM: arm64: vgic-v3: Don't require IRQs be disabled for LPI xarray lock
  KVM: arm64: vgic-v3: Erase LPIs from xarray outside of raw spinlocks
  KVM: arm64: Spin off release helper from vgic_put_irq()
  KVM: arm64: vgic-v3: Use bare refcount for VGIC LPIs
  KVM: arm64: vgic: Drop stale comment on IRQ active state
  KVM: arm64: VHE: Save and restore host MDCR_EL2 value correctly
  KVM: arm64: Initialize PMSCR_EL1 when in VHE
  KVM: arm64: nv: fix VNCR TLB ASID match logic for non-Global entries
  KVM: s390: Fix FOLL_*/FAULT_FLAG_* confusion
  KVM: s390: Fix incorrect usage of mmu_notifier_register()
  KVM: s390: Fix access to unavailable adapter indicator pages during postcopy
  KVM: arm64: Mark freed S2 MMUs as invalid
2025-09-18 09:42:55 -07:00
Linus Torvalds
f03e578c8a Merge tag 'uml-for-6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux
Pull UML fixes from Johannes Berg:
 "A few fixes for UML, which I'd meant to send earlier but then forgot.

  All of them are pretty long-standing issues that are either not really
  happening (the UAF), in rarely used code (the FD buffer issue), or an
  issue only for some host configurations (the executable stack):

   - mark stack not executable to work on more modern systems with
     selinux

   - fix use-after-free in a virtio error path

   - fix stack buffer overflow in external unix socket FD receive
     function"

* tag 'uml-for-6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux:
  um: Fix FD copy size in os_rcv_fd_msg()
  um: virtio_uml: Fix use-after-free after put_device in probe
  um: Don't mark stack executable
2025-09-18 09:18:27 -07:00
Oliver Upton
3af1105c4f KVM: arm64: nv: Apply guest's MDCR traps in nested context
KVM needs to ensure the guest hypervisor's traps take effect when the
vCPU is in a nested context. While supporting infrastructure is in place
for most of the EL2 trap registers, MDCR_EL2 is not.

Fold the guest's trap configuration into the effective MDCR_EL2. Apply
it directly to the in-memory representation as it gets recomputed on
every vcpu_load() anyway.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-09-18 16:46:20 +01:00
Oliver Upton
4a68408842 KVM: arm64: nv: Trap debug registers when in hyp context
In case you haven't realized it yet, the architecture is _slightly_
broken in the context of nested virt. Here we have another example of
FEAT_NV2 redirecting a sysreg (MDSCR_EL1) to memory that actually
affects execution at vEL2.

Fortunately, MDCR_EL2.TDA provides the necessary traps to hide this
mess at the expense of unnecessarily trapping the breakpoint/watchpoint
registers. Yes, FEAT_FGT gives us a precise trap but let's just opt for
obvious correctness to start.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-09-18 16:46:20 +01:00
Guo Ren (Alibaba DAMO Academy)
16d18e3eaf riscv: Move vendor errata definitions to new header
Move vendor errata definitions into errata_list_vendors.h.

Signed-off-by: Guo Ren (Alibaba DAMO Academy) <guoren@kernel.org>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Tested-by: Han Gao <rabenda.cn@gmail.com>
Link: https://lore.kernel.org/r/20250713155321.2064856-2-guoren@kernel.org
[pjw@kernel.org: updated to apply and to make the whitespace consistent]
Signed-off-by: Paul Walmsley <pjw@kernel.org>
2025-09-18 08:22:00 -06:00
Heinrich Schuchardt
92c4995b4d RISC-V: ACPI: enable parsing the BGRT table
The BGRT table is used to display a vendor logo during the boot process.
Add the code for parsing it.

Signed-off-by: Heinrich Schuchardt <heinrich.schuchardt@canonical.com>
Reviewed-by: Sunil V L <sunilvl@ventanamicro.com>
Link: https://lore.kernel.org/r/20250729131535.522205-2-heinrich.schuchardt@canonical.com
Signed-off-by: Paul Walmsley <pjw@kernel.org>
2025-09-18 08:21:45 -06:00
Pu Lehui
205cbc7148 riscv: Enable ARCH_HAVE_NMI_SAFE_CMPXCHG
The implement of cmpxchg() in riscv is based on atomic primitives and
has NMI-safe features, so it can be used safely in the in_nmi context.
ftrace's ringbuffer relies on NMI-safe cmpxchg() in the NMI context.
Currently, in_nmi() is true when riscv kprobe is in trap-based mode, so
this config needs to be selected, otherwise kprobetrace will not be
available.

Signed-off-by: Pu Lehui <pulehui@huawei.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Tested-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20250711090443.1688404-1-pulehui@huaweicloud.com
[pjw@kernel.org: moved to preserve alphabetical order]
Signed-off-by: Paul Walmsley <pjw@kernel.org>
2025-09-18 08:20:59 -06:00
Masahiro Yamada
6dab7e15c0 riscv: pi: use 'targets' instead of extra-y in Makefile
%.pi.o files are built as prerequisites of other objects.
There is no need to use extra-y, which is planned for deprecation.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Link: https://lore.kernel.org/r/20250602181023.528550-1-masahiroy@kernel.org
Signed-off-by: Paul Walmsley <pjw@kernel.org>
2025-09-18 08:20:56 -06:00
Ignacio Encinas
cc2294d3f9 riscv: introduce asm/swab.h
Implement endianness swap macros for RISC-V.

Use the rev8 instruction when Zbb is available. Otherwise, rely on the
default mask-and-shift implementation.

Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Tested-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: Ignacio Encinas <ignacio@iencinas.com>
Link: https://lore.kernel.org/r/20250723-riscv-swab-v6-1-fc11e9a2efc9@iencinas.com
Signed-off-by: Paul Walmsley <pjw@kernel.org>
2025-09-18 08:20:25 -06:00
Jessica Liu
316b60b984 riscv: mmap(): use unsigned offset type in riscv_sys_mmap
The variable type of offset should be consistent with the relevant
interfaces of mmap which described in commit 295f10061a ("syscalls:
mmap(): use unsigned offset type consistently"). Otherwise, a user input
with the top bit set would result in a negative page offset rather than a
large one.

Signed-off-by: Jessica Liu <liu.xuemei1@zte.com.cn>
Tested-by: Han Gao <rabenda.cn@gmail.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com>
Link: https://lore.kernel.org/r/20250801104948133AaMr5S6E382PbNNhoJgHA@zte.com.cn
[pjw@kernel.org: hand-applied mangled patch; fixed checkpatch error]
Signed-off-by: Paul Walmsley <pjw@kernel.org>
2025-09-18 08:19:47 -06:00
Junhui Liu
17e9521044 riscv: mm: Use mmu-type from FDT to limit SATP mode
Some RISC-V implementations may hang when attempting to write an
unsupported SATP mode, even though the latest RISC-V specification
states such writes should have no effect. To avoid this issue, the
logic for selecting SATP mode has been refined:

The kernel now determines the SATP mode limit by taking the minimum of
the value specified by the kernel command line (noXlvl) and the
"mmu-type" property in the device tree (FDT). If only one is specified,
use that.
- If the resulting limit is sv48 or higher, the kernel will probe SATP
  modes from this limit downward until a supported mode is found.
- If the limit is sv39, the kernel will directly use sv39 without
  probing.

This ensures SATP mode selection is safe and compatible with both
hardware and user configuration, minimizing the risk of hangs.

Signed-off-by: Junhui Liu <junhui.liu@pigmoral.tech>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com>
Link: https://lore.kernel.org/r/20250722-satp-from-fdt-v1-2-5ba22218fa5f@pigmoral.tech
Signed-off-by: Paul Walmsley <pjw@kernel.org>
2025-09-18 08:18:14 -06:00
James Clark
00d7a1af5a arm64/boot: Enable EL2 requirements for SPE_FEAT_FDS
SPE data source filtering (optional from Armv8.8) requires that traps to
the filter register PMSDSFR be disabled. Document the requirements and
disable the traps if the feature is present.

Tested-by: Leo Yan <leo.yan@arm.com>
Reviewed-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: James Clark <james.clark@linaro.org>
Signed-off-by: Will Deacon <will@kernel.org>
2025-09-18 14:17:02 +01:00
James Clark
510a8fa49d arm64/boot: Factor out a macro to check SPE version
We check the version of SPE twice, and we'll add one more check in the
next commit so factor out a macro to do this. Change the #3 magic number
to the actual SPE version define (V1p2) to make it more readable.

No functional changes intended.

Tested-by: Leo Yan <leo.yan@arm.com>
Reviewed-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: James Clark <james.clark@linaro.org>
Signed-off-by: Will Deacon <will@kernel.org>
2025-09-18 14:17:02 +01:00
James Clark
b4401403af perf: arm_spe: Support FEAT_SPEv1p4 filters
FEAT_SPEv1p4 (optional from Armv8.8) adds some new filter bits and also
makes some previously available bits unavailable again e.g:

  E[30], bit [30]
  When FEAT_SPEv1p4 is _not_ implemented ...

Continuing to hard code the valid filter bits for each version isn't
scalable, and it also doesn't work for filter bits that aren't related
to SPE version. For example most bits have a further condition:

  E[15], bit [15]
  When ... and filtering on event 15 is supported:

Whether "filtering on event 15" is implemented or not is only
discoverable from the TRM of that specific CPU or by probing
PMSEVFR_EL1.

Instead of hard coding them, write all 1s to the PMSEVFR_EL1 register
and read it back to discover the RES0 bits. Unsupported bits are RAZ/WI
so should read as 0s.

For any hardware that doesn't strictly follow RAZ/WI for unsupported
filters: Any bits that should have been supported in a specific SPE
version but now incorrectly appear to be RES0 wouldn't have worked
anyway, so it's better to fail to open events that request them rather
than behaving unexpectedly. Bits that aren't implemented but also aren't
RAZ/WI will be incorrectly reported as supported, but allowing them to
be used is harmless.

Testing on N1SDP shows the probed RES0 bits to be the same as the hard
coded ones. The FVP with SPEv1p4 shows only additional new RES0 bits,
i.e. no previously hard coded RES0 bits are missing.

Tested-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: James Clark <james.clark@linaro.org>
Signed-off-by: Will Deacon <will@kernel.org>
2025-09-18 14:17:02 +01:00
James Clark
a7005ff2d0 arm64: sysreg: Add new PMSFCR_EL1 fields and PMSDSFR_EL1 register
Add new fields and register that are introduced for the features
FEAT_SPE_EFT (extended filtering) and FEAT_SPE_FDS (data source
filtering).

Tested-by: Leo Yan <leo.yan@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: James Clark <james.clark@linaro.org>
Signed-off-by: Will Deacon <will@kernel.org>
2025-09-18 14:17:02 +01:00