Commit Graph

14540 Commits

Author SHA1 Message Date
Zhao Liu
f24ef0093d KVM: x86: Advertise MOVRS CPUID to userspace
Define the feature flag for MOVRS and advertise support to userspace when
the feature is supported by the host.

MOVRS is a new set of instructions introduced in the Intel platform
Diamond Rapids, to provide load instructions that carry a read-shared
hint.

Functionally, MOVRS family is equivalent to existing load instructions,
but its read-shared hint indicates that the source memory location is
likely to become read-shared by multiple processors, i.e., read in the
future by at least one other processor before it is written (assuming it
is ever written in the future). This hint could optimize the behavior of
the caches, especially shared caches, for this data for future reads by
multiple processors. Additionally, MOVRS family also includes a software
prefetch instruction, PREFETCHRST2, that carries the same read-shared
hint. [*]

MOVRS family is enumerated by CPUID single-bit (0x7.0x1.EAX[bit 31]).
Since it's on a densely-populated CPUID leaf and some other bits on
this leaf have kernel usages, define this new feature in cpufeatures.h,
but hide it in /proc/cpuinfo due to lack of current kernel usage.

Advertise MOVRS bit to userspace directly. It's safe, since there's no
new VMX controls or additional host enabling required for guests to use
this feature.

[*]: Intel Architecture Instruction Set Extensions and Future Features
     (rev.059).

Tested-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://patch.msgid.link/20251120050720.931449-2-zhao1.liu@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-23 09:59:38 -08:00
Linus Torvalds
44087d3d46 Merge tag 'x86-urgent-2025-12-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:

 - Fix FPU core dumps on certain CPU models

 - Fix htmldocs build warning

 - Export TLB tracing event name via header

 - Remove unused constant from <linux/mm_types.h>

 - Fix comments

 - Fix whitespace noise in documentation

 - Fix variadic structure's definition to un-confuse UBSAN

 - Fix posted MSI interrupts irq_retrigger() bug

 - Fix asm build failure with older GCC builds

* tag 'x86-urgent-2025-12-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/bug: Fix old GCC compile fails
  x86/msi: Make irq_retrigger() functional for posted MSI
  x86/platform/uv: Fix UBSAN array-index-out-of-bounds
  mm: Remove tlb_flush_reason::NR_TLB_FLUSH_REASONS from <linux/mm_types.h>
  x86/mm/tlb/trace: Export the TLB_REMOTE_WRONG_CPU enum in <trace/events/tlb.h>
  x86/sgx: Remove unmatched quote in __sgx_encl_extend function comment
  x86/boot/Documentation: Fix whitespace noise in boot.rst
  x86/fpu: Fix FPU state core dump truncation on CPUs with no extended xfeatures
  x86/boot/Documentation: Fix htmldocs build warning due to malformed table in boot.rst
2025-12-21 14:41:29 -08:00
Eric Dumazet
91ff28ae6d x86/irqflags: Use ASM_OUTPUT_RM in native_save_fl()
clang is generating very inefficient code for native_save_fl() which is
used for local_irq_save() in critical spots.

Allowing the "pop %0" to use memory:

 1) forces the compiler to add annoying stack canaries when
    CONFIG_STACKPROTECTOR_STRONG=y in many places.

 2) Almost always is followed by an immediate "move memory,register"

One good example is _raw_spin_lock_irqsave, with 8 extra instructions

  ffffffff82067a30 <_raw_spin_lock_irqsave>:
  ffffffff82067a30:		...
  ffffffff82067a39:		53						push   %rbx

  // Three instructions to ajust the stack, read the per-cpu canary
  // and copy it to 8(%rsp)
  ffffffff82067a3a:		48 83 ec 10 			sub    $0x10,%rsp
  ffffffff82067a3e:		65 48 8b 05 da 15 45 02 mov    %gs:0x24515da(%rip),%rax 	   # <__stack_chk_guard>
  ffffffff82067a46:		48 89 44 24 08			mov    %rax,0x8(%rsp)

  ffffffff82067a4b:		9c						pushf

  // instead of pop %rbx, compiler uses 2 instructions.
  ffffffff82067a4c:		8f 04 24				pop    (%rsp)
  ffffffff82067a4f:		48 8b 1c 24 			mov    (%rsp),%rbx

  ffffffff82067a53:		fa						cli
  ffffffff82067a54:		b9 01 00 00 00			mov    $0x1,%ecx
  ffffffff82067a59:		31 c0					xor    %eax,%eax
  ffffffff82067a5b:		f0 0f b1 0f 			lock cmpxchg %ecx,(%rdi)
  ffffffff82067a5f:		75 1d					jne    ffffffff82067a7e <_raw_spin_lock_irqsave+0x4e>

  // three instructions to check the stack canary
  ffffffff82067a61:		65 48 8b 05 b7 15 45 02 mov    %gs:0x24515b7(%rip),%rax 	   # <__stack_chk_guard>
  ffffffff82067a69:		48 3b 44 24 08			cmp    0x8(%rsp),%rax
  ffffffff82067a6e:		75 17					jne    ffffffff82067a87

  ...

  // One extra instruction to adjust the stack.
  ffffffff82067a73:		48 83 c4 10 			add    $0x10,%rsp
  ...

  // One more instruction in case the stack was mangled.
  ffffffff82067a87:		e8 a4 35 ff ff			call   ffffffff8205b030 <__stack_chk_fail>

This patch changes nothing for gcc, but for clang saves ~20000 bytes of text
even though more functions are inlined.

  $ size vmlinux.gcc.before vmlinux.gcc.after vmlinux.clang.before vmlinux.clang.after
     text	   data		bss		dec		hex	filename
  45565821	25005462	4704800	75276083	47c9f33	vmlinux.gcc.before
  45565821	25005462	4704800	75276083	47c9f33	vmlinux.gcc.after
  45121072	24638617	5533040	75292729	47ce039	vmlinux.clang.before
  45093887	24638633	5536808	75269328	47c84d0	vmlinux.clang.after

  $ scripts/bloat-o-meter -t vmlinux.clang.before vmlinux.clang.after
  add/remove: 1/2 grow/shrink: 21/533 up/down: 2250/-22112 (-19862)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-12-20 14:47:12 -08:00
Peter Zijlstra
c56a12c71a x86/bug: Fix old GCC compile fails
For some mysterious reasons the GCC 8 and 9 preprocessor manages to
sporadically fumble _ASM_BYTES(0x0f, 0x0b):

$ grep ".byte[ ]*0x0f" defconfig-build/drivers/net/wireless/realtek/rtlwifi/base.s
        1:       .byte0x0f,0x0b ;
        1:       .byte 0x0f,0x0b ;

which makes the assembler upset and all that. While there are more
_ASM_BYTES() users (notably the NOP instructions), those don't seem
affected. Therefore replace the offending ASM_UD2 with one using the
ud2 mnemonic.

Reported-by: Jean Delvare <jdelvare@suse.de>
Suggested-by: Uros Bizjak <ubizjak@gmail.com>
Fixes: 85a2d4a890 ("x86,ibt: Use UDB instead of 0xEA")
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251218104659.GT3911114@noisy.programming.kicks-ass.net
2025-12-18 11:55:40 +01:00
Thomas Gleixner
0edc78b82b x86/msi: Make irq_retrigger() functional for posted MSI
Luigi reported that retriggering a posted MSI interrupt does not work
correctly.

The reason is that the retrigger happens at the vector domain by sending an
IPI to the actual vector on the target CPU. That works correctly exactly
once because the posted MSI interrupt chip does not issue an EOI as that's
only required for the posted MSI notification vector itself.

As a consequence the vector becomes stale in the ISR, which not only
affects this vector but also any lower priority vector in the affected
APIC because the ISR bit is not cleared.

Luigi proposed to set the vector in the remap PIR bitmap and raise the
posted MSI notification vector. That works, but that still does not cure a
related problem:

  If there is ever a stray interrupt on such a vector, then the related
  APIC ISR bit becomes stale due to the lack of EOI as described above.
  Unlikely to happen, but if it happens it's not debuggable at all.

So instead of playing games with the PIR, this can be actually solved
for both cases by:

 1) Keeping track of the posted interrupt vector handler state

 2) Implementing a posted MSI specific irq_ack() callback which checks that
    state. If the posted vector handler is inactive it issues an EOI,
    otherwise it delegates that to the posted handler.

This is correct versus affinity changes and concurrent events on the posted
vector as the actual handler invocation is serialized through the interrupt
descriptor lock.

Fixes: ed1e48ea43 ("iommu/vt-d: Enable posted mode for device MSIs")
Reported-by: Luigi Rizzo <lrizzo@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Luigi Rizzo <lrizzo@google.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20251125214631.044440658@linutronix.de
Closes: https://lore.kernel.org/lkml/20251124104836.3685533-1-lrizzo@google.com
2025-12-17 18:41:52 +01:00
Kyle Meyer
21433d3e3c x86/platform/uv: Fix UBSAN array-index-out-of-bounds
When UBSAN is enabled, multiple array-index-out-of-bounds messages are
printed:

  [    0.000000] [     T0] UBSAN: array-index-out-of-bounds in arch/x86/kernel/apic/x2apic_uv_x.c:276:23
  [    0.000000] [     T0] index 1 is out of range for type '<unknown> [1]'
  ...
  [    0.000000] [     T0] UBSAN: array-index-out-of-bounds in arch/x86/kernel/apic/x2apic_uv_x.c:277:32
  [    0.000000] [     T0] index 1 is out of range for type '<unknown> [1]'
  ...
  [    0.000000] [     T0] UBSAN: array-index-out-of-bounds in arch/x86/kernel/apic/x2apic_uv_x.c:282:16
  [    0.000000] [     T0] index 1 is out of range for type '<unknown> [1]'
  ...
  [    0.515850] [     T1] UBSAN: array-index-out-of-bounds in arch/x86/kernel/apic/x2apic_uv_x.c:1344:23
  [    0.519851] [     T1] index 1 is out of range for type '<unknown> [1]'
  ...
  [    0.603850] [     T1] UBSAN: array-index-out-of-bounds in arch/x86/kernel/apic/x2apic_uv_x.c:1345:32
  [    0.607850] [     T1] index 1 is out of range for type '<unknown> [1]'
  ...
  [    0.691850] [     T1] UBSAN: array-index-out-of-bounds in arch/x86/kernel/apic/x2apic_uv_x.c:1353:20
  [    0.695850] [     T1] index 1 is out of range for type '<unknown> [1]'

One-element arrays have been deprecated:

  https://docs.kernel.org/process/deprecated.html#zero-length-and-one-element-arrays

Switch entry in struct uv_systab to a flexible array member to fix UBSAN
array-index-out-of-bounds messages.

sizeof(struct uv_systab) is passed to early_memremap() and ioremap(). The
flexible array member is not accessed until the UV system table size is used to
remap the entire UV system table, so changes to sizeof(struct uv_systab) have no
impact.

Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/aTxksN-3otY41WvQ@hpe.com
2025-12-14 08:46:53 +01:00
Linus Torvalds
c752c21c90 Merge tag 'auto-type-conversion-for-v6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/hpa/linux-auto
Pull __auto_type to auto conversion from Peter Anvin:
 "Convert '__auto_type' to 'auto', defining a macro for 'auto' unless
  C23+ is in use"

* tag 'auto-type-conversion-for-v6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/hpa/linux-auto:
  tools/virtio: replace "__auto_type" with "auto"
  selftests/bpf: replace "__auto_type" with "auto"
  arch/x86: replace "__auto_type" with "auto"
  arch/nios2: replace "__auto_type" and adjacent equivalent with "auto"
  fs/proc: replace "__auto_type" with "const auto"
  include/linux: change "__auto_type" to "auto"
  compiler_types.h: add "auto" as a macro for "__auto_type"
2025-12-10 05:41:10 +09:00
H. Peter Anvin
2a5b286bee arch/x86: replace "__auto_type" with "auto"
Replace instances of "__auto_type" with "auto" in:

	arch/x86/include/asm/bug.h
	arch/x86/include/asm/string_64.h
	arch/x86/include/asm/uaccess_64.h

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
2025-12-08 15:32:15 -08:00
Linus Torvalds
feb06d2690 Merge tag 'hyperv-next-signed-20251207' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv updates from Wei Liu:

 - Enhancements to Linux as the root partition for Microsoft Hypervisor:
     - Support a new mode called L1VH, which allows Linux to drive the
       hypervisor running the Azure Host directly
     - Support for MSHV crash dump collection
     - Allow Linux's memory management subsystem to better manage guest
       memory regions
     - Fix issues that prevented a clean shutdown of the whole system on
       bare metal and nested configurations
     - ARM64 support for the MSHV driver
     - Various other bug fixes and cleanups

 - Add support for Confidential VMBus for Linux guest on Hyper-V

 - Secure AVIC support for Linux guests on Hyper-V

 - Add the mshv_vtl driver to allow Linux to run as the secure kernel in
   a higher virtual trust level for Hyper-V

* tag 'hyperv-next-signed-20251207' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (58 commits)
  mshv: Cleanly shutdown root partition with MSHV
  mshv: Use reboot notifier to configure sleep state
  mshv: Add definitions for MSHV sleep state configuration
  mshv: Add support for movable memory regions
  mshv: Add refcount and locking to mem regions
  mshv: Fix huge page handling in memory region traversal
  mshv: Move region management to mshv_regions.c
  mshv: Centralize guest memory region destruction
  mshv: Refactor and rename memory region handling functions
  mshv: adjust interrupt control structure for ARM64
  Drivers: hv: use kmalloc_array() instead of kmalloc()
  mshv: Add ioctl for self targeted passthrough hvcalls
  Drivers: hv: Introduce mshv_vtl driver
  Drivers: hv: Export some symbols for mshv_vtl
  static_call: allow using STATIC_CALL_TRAMP_STR() from assembly
  mshv: Extend create partition ioctl to support cpu features
  mshv: Allow mappings that overlap in uaddr
  mshv: Fix create memory region overlap check
  mshv: add WQ_PERCPU to alloc_workqueue users
  Drivers: hv: Use kmalloc_array() instead of kmalloc()
  ...
2025-12-09 06:10:17 +09:00
Linus Torvalds
509d3f4584 Merge tag 'mm-nonmm-stable-2025-12-06-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:

 - "panic: sys_info: Refactor and fix a potential issue" (Andy Shevchenko)
   fixes a build issue and does some cleanup in ib/sys_info.c

 - "Implement mul_u64_u64_div_u64_roundup()" (David Laight)
   enhances the 64-bit math code on behalf of a PWM driver and beefs up
   the test module for these library functions

 - "scripts/gdb/symbols: make BPF debug info available to GDB" (Ilya Leoshkevich)
   makes BPF symbol names, sizes, and line numbers available to the GDB
   debugger

 - "Enable hung_task and lockup cases to dump system info on demand" (Feng Tang)
   adds a sysctl which can be used to cause additional info dumping when
   the hung-task and lockup detectors fire

 - "lib/base64: add generic encoder/decoder, migrate users" (Kuan-Wei Chiu)
   adds a general base64 encoder/decoder to lib/ and migrates several
   users away from their private implementations

 - "rbree: inline rb_first() and rb_last()" (Eric Dumazet)
   makes TCP a little faster

 - "liveupdate: Rework KHO for in-kernel users" (Pasha Tatashin)
   reworks the KEXEC Handover interfaces in preparation for Live Update
   Orchestrator (LUO), and possibly for other future clients

 - "kho: simplify state machine and enable dynamic updates" (Pasha Tatashin)
   increases the flexibility of KEXEC Handover. Also preparation for LUO

 - "Live Update Orchestrator" (Pasha Tatashin)
   is a major new feature targeted at cloud environments. Quoting the
   cover letter:

      This series introduces the Live Update Orchestrator, a kernel
      subsystem designed to facilitate live kernel updates using a
      kexec-based reboot. This capability is critical for cloud
      environments, allowing hypervisors to be updated with minimal
      downtime for running virtual machines. LUO achieves this by
      preserving the state of selected resources, such as memory,
      devices and their dependencies, across the kernel transition.

      As a key feature, this series includes support for preserving
      memfd file descriptors, which allows critical in-memory data, such
      as guest RAM or any other large memory region, to be maintained in
      RAM across the kexec reboot.

   Mike Rappaport merits a mention here, for his extensive review and
   testing work.

 - "kexec: reorganize kexec and kdump sysfs" (Sourabh Jain)
   moves the kexec and kdump sysfs entries from /sys/kernel/ to
   /sys/kernel/kexec/ and adds back-compatibility symlinks which can
   hopefully be removed one day

 - "kho: fixes for vmalloc restoration" (Mike Rapoport)
   fixes a BUG which was being hit during KHO restoration of vmalloc()
   regions

* tag 'mm-nonmm-stable-2025-12-06-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (139 commits)
  calibrate: update header inclusion
  Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"
  vmcoreinfo: track and log recoverable hardware errors
  kho: fix restoring of contiguous ranges of order-0 pages
  kho: kho_restore_vmalloc: fix initialization of pages array
  MAINTAINERS: TPM DEVICE DRIVER: update the W-tag
  init: replace simple_strtoul with kstrtoul to improve lpj_setup
  KHO: fix boot failure due to kmemleak access to non-PRESENT pages
  Documentation/ABI: new kexec and kdump sysfs interface
  Documentation/ABI: mark old kexec sysfs deprecated
  kexec: move sysfs entries to /sys/kernel/kexec
  test_kho: always print restore status
  kho: free chunks using free_page() instead of kfree()
  selftests/liveupdate: add kexec test for multiple and empty sessions
  selftests/liveupdate: add simple kexec-based selftest for LUO
  selftests/liveupdate: add userspace API selftests
  docs: add documentation for memfd preservation via LUO
  mm: memfd_luo: allow preserving memfd
  liveupdate: luo_file: add private argument to store runtime state
  mm: shmem: export some functions to internal.h
  ...
2025-12-06 14:01:20 -08:00
Linus Torvalds
cc3ee4ba57 Merge tag 'x86-urgent-2025-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
 "Miscellaneous documentation fixes"

* tag 'x86-urgent-2025-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot/Documentation: Prefix hexadecimal literals with 0x
  x86/boot/Documentation: Spell 'ID' consistently
  x86/platform: Fix and extend kernel-doc comments in <asm/x86_init.h>
2025-12-06 12:33:26 -08:00
Linus Torvalds
08b8ddac1f Merge tag 'objtool-urgent-2025-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool fixes from Ingo Molnar:
 "Address various objtool scalability bugs/inefficiencies exposed by
  allmodconfig builds, plus improve the quality of alternatives
  instructions generated code and disassembly"

* tag 'objtool-urgent-2025-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  objtool: Simplify .annotate_insn code generation output some more
  objtool: Add more robust signal error handling, detect and warn about stack overflows
  objtool: Remove newlines and tabs from annotation macros
  objtool: Consolidate annotation macros
  x86/asm: Remove ANNOTATE_DATA_SPECIAL usage
  x86/alternative: Remove ANNOTATE_DATA_SPECIAL usage
  objtool: Fix stack overflow in validate_branch()
2025-12-06 11:56:51 -08:00
Linus Torvalds
f468cf53c5 Merge tag 'bitmap-for-6.19' of github.com:/norov/linux
Pull bitmap updates from Yury Norov:

 - Runtime field_{get,prep}() (Geert)

 - Rust ID pool updates (Alice)

 - min_t() simplification (David)

 - __sw_hweightN kernel-doc fixes (Andy)

 - cpumask.h headers cleanup (Andy)

* tag 'bitmap-for-6.19' of github.com:/norov/linux: (32 commits)
  rust_binder: use bitmap for allocation of handles
  rust: id_pool: do not immediately acquire new ids
  rust: id_pool: do not supply starting capacity
  rust: id_pool: rename IdPool::new() to with_capacity()
  rust: bitmap: add BitmapVec::new_inline()
  rust: bitmap: add MAX_LEN and MAX_INLINE_LEN constants
  cpumask: Don't use "proxy" headers
  soc: renesas: Use bitfield helpers
  clk: renesas: Use bitfield helpers
  ALSA: usb-audio: Convert to common field_{get,prep}() helpers
  soc: renesas: rz-sysc: Convert to common field_get() helper
  pinctrl: ma35: Convert to common field_{get,prep}() helpers
  iio: mlx90614: Convert to common field_{get,prep}() helpers
  iio: dac: Convert to common field_prep() helper
  gpio: aspeed: Convert to common field_{get,prep}() helpers
  EDAC/ie31200: Convert to common field_get() helper
  crypto: qat - convert to common field_get() helper
  clk: at91: Convert to common field_{get,prep}() helpers
  bitfield: Add non-constant field_{prep,get}() helpers
  bitfield: Add less-checking __FIELD_{GET,PREP}()
  ...
2025-12-06 09:01:27 -08:00
Linus Torvalds
51d90a15fe Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
 "ARM:

   - Support for userspace handling of synchronous external aborts
     (SEAs), allowing the VMM to potentially handle the abort in a
     non-fatal manner

   - Large rework of the VGIC's list register handling with the goal of
     supporting more active/pending IRQs than available list registers
     in hardware. In addition, the VGIC now supports EOImode==1 style
     deactivations for IRQs which may occur on a separate vCPU than the
     one that acked the IRQ

   - Support for FEAT_XNX (user / privileged execute permissions) and
     FEAT_HAF (hardware update to the Access Flag) in the software page
     table walkers and shadow MMU

   - Allow page table destruction to reschedule, fixing long
     need_resched latencies observed when destroying a large VM

   - Minor fixes to KVM and selftests

  Loongarch:

   - Get VM PMU capability from HW GCFG register

   - Add AVEC basic support

   - Use 64-bit register definition for EIOINTC

   - Add KVM timer test cases for tools/selftests

  RISC/V:

   - SBI message passing (MPXY) support for KVM guest

   - Give a new, more specific error subcode for the case when in-kernel
     AIA virtualization fails to allocate IMSIC VS-file

   - Support KVM_DIRTY_LOG_INITIALLY_SET, enabling dirty log gradually
     in small chunks

   - Fix guest page fault within HLV* instructions

   - Flush VS-stage TLB after VCPU migration for Andes cores

  s390:

   - Always allocate ESCA (Extended System Control Area), instead of
     starting with the basic SCA and converting to ESCA with the
     addition of the 65th vCPU. The price is increased number of exits
     (and worse performance) on z10 and earlier processor; ESCA was
     introduced by z114/z196 in 2010

   - VIRT_XFER_TO_GUEST_WORK support

   - Operation exception forwarding support

   - Cleanups

  x86:

   - Skip the costly "zap all SPTEs" on an MMIO generation wrap if MMIO
     SPTE caching is disabled, as there can't be any relevant SPTEs to
     zap

   - Relocate a misplaced export

   - Fix an async #PF bug where KVM would clear the completion queue
     when the guest transitioned in and out of paging mode, e.g. when
     handling an SMI and then returning to paged mode via RSM

   - Leave KVM's user-return notifier registered even when disabling
     virtualization, as long as kvm.ko is loaded. On reboot/shutdown,
     keeping the notifier registered is ok; the kernel does not use the
     MSRs and the callback will run cleanly and restore host MSRs if the
     CPU manages to return to userspace before the system goes down

   - Use the checked version of {get,put}_user()

   - Fix a long-lurking bug where KVM's lack of catch-up logic for
     periodic APIC timers can result in a hard lockup in the host

   - Revert the periodic kvmclock sync logic now that KVM doesn't use a
     clocksource that's subject to NTP corrections

   - Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the
     latter behind CONFIG_CPU_MITIGATIONS

   - Context switch XCR0, XSS, and PKRU outside of the entry/exit fast
     path; the only reason they were handled in the fast path was to
     paper of a bug in the core #MC code, and that has long since been
     fixed

   - Add emulator support for AVX MOV instructions, to play nice with
     emulated devices whose guest drivers like to access PCI BARs with
     large multi-byte instructions

  x86 (AMD):

   - Fix a few missing "VMCB dirty" bugs

   - Fix the worst of KVM's lack of EFER.LMSLE emulation

   - Add AVIC support for addressing 4k vCPUs in x2AVIC mode

   - Fix incorrect handling of selective CR0 writes when checking
     intercepts during emulation of L2 instructions

   - Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32]
     on VMRUN and #VMEXIT

   - Fix a bug where KVM corrupt the guest code stream when re-injecting
     a soft interrupt if the guest patched the underlying code after the
     VM-Exit, e.g. when Linux patches code with a temporary INT3

   - Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits
     to userspace, and extend KVM "support" to all policy bits that
     don't require any actual support from KVM

  x86 (Intel):

   - Use the root role from kvm_mmu_page to construct EPTPs instead of
     the current vCPU state, partly as worthwhile cleanup, but mostly to
     pave the way for tracking per-root TLB flushes, and elide EPT
     flushes on pCPU migration if the root is clean from a previous
     flush

   - Add a few missing nested consistency checks

   - Rip out support for doing "early" consistency checks via hardware
     as the functionality hasn't been used in years and is no longer
     useful in general; replace it with an off-by-default module param
     to WARN if hardware fails a check that KVM does not perform

   - Fix a currently-benign bug where KVM would drop the guest's
     SPEC_CTRL[63:32] on VM-Enter

   - Misc cleanups

   - Overhaul the TDX code to address systemic races where KVM (acting
     on behalf of userspace) could inadvertantly trigger lock contention
     in the TDX-Module; KVM was either working around these in weird,
     ugly ways, or was simply oblivious to them (though even Yan's
     devilish selftests could only break individual VMs, not the host
     kernel)

   - Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a
     TDX vCPU, if creating said vCPU failed partway through

   - Fix a few sparse warnings (bad annotation, 0 != NULL)

   - Use struct_size() to simplify copying TDX capabilities to userspace

   - Fix a bug where TDX would effectively corrupt user-return MSR
     values if the TDX Module rejects VP.ENTER and thus doesn't clobber
     host MSRs as expected

  Selftests:

   - Fix a math goof in mmu_stress_test when running on a single-CPU
     system/VM

   - Forcefully override ARCH from x86_64 to x86 to play nice with
     specifying ARCH=x86_64 on the command line

   - Extend a bunch of nested VMX to validate nested SVM as well

   - Add support for LA57 in the core VM_MODE_xxx macro, and add a test
     to verify KVM can save/restore nested VMX state when L1 is using
     5-level paging, but L2 is not

   - Clean up the guest paging code in anticipation of sharing the core
     logic for nested EPT and nested NPT

  guest_memfd:

   - Add NUMA mempolicy support for guest_memfd, and clean up a variety
     of rough edges in guest_memfd along the way

   - Define a CLASS to automatically handle get+put when grabbing a
     guest_memfd from a memslot to make it harder to leak references

   - Enhance KVM selftests to make it easer to develop and debug
     selftests like those added for guest_memfd NUMA support, e.g. where
     test and/or KVM bugs often result in hard-to-debug SIGBUS errors

   - Misc cleanups

  Generic:

   - Use the recently-added WQ_PERCPU when creating the per-CPU
     workqueue for irqfd cleanup

   - Fix a goof in the dirty ring documentation

   - Fix choice of target for directed yield across different calls to
     kvm_vcpu_on_spin(); the function was always starting from the first
     vCPU instead of continuing the round-robin search"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (260 commits)
  KVM: arm64: at: Update AF on software walk only if VM has FEAT_HAFDBS
  KVM: arm64: at: Use correct HA bit in TCR_EL2 when regime is EL2
  KVM: arm64: Document KVM_PGTABLE_PROT_{UX,PX}
  KVM: arm64: Fix spelling mistake "Unexpeced" -> "Unexpected"
  KVM: arm64: Add break to default case in kvm_pgtable_stage2_pte_prot()
  KVM: arm64: Add endian casting to kvm_swap_s[12]_desc()
  KVM: arm64: Fix compilation when CONFIG_ARM64_USE_LSE_ATOMICS=n
  KVM: arm64: selftests: Add test for AT emulation
  KVM: arm64: nv: Expose hardware access flag management to NV guests
  KVM: arm64: nv: Implement HW access flag management in stage-2 SW PTW
  KVM: arm64: Implement HW access flag management in stage-1 SW PTW
  KVM: arm64: Propagate PTW errors up to AT emulation
  KVM: arm64: Add helper for swapping guest descriptor
  KVM: arm64: nv: Use pgtable definitions in stage-2 walk
  KVM: arm64: Handle endianness in read helper for emulated PTW
  KVM: arm64: nv: Stop passing vCPU through void ptr in S2 PTW
  KVM: arm64: Call helper for reading descriptors directly
  KVM: arm64: nv: Advertise support for FEAT_XNX
  KVM: arm64: Teach ptdump about FEAT_XNX permissions
  KVM: s390: Use generic VIRT_XFER_TO_GUEST_WORK functions
  ...
2025-12-05 17:01:20 -08:00
Praveen K Paladugu
615a6e7d83 mshv: Cleanly shutdown root partition with MSHV
Root partitions running on MSHV currently attempt ACPI power-off, which
MSHV intercepts and triggers a Machine Check Exception (MCE), leading
to a kernel panic.

Root partitions panic with a trace similar to:

  [   81.306348] reboot: Power down
  [   81.314709] mce: [Hardware Error]: CPU 0: Machine Check Exception: 4 Bank 0: b2000000c0060001
  [   81.314711] mce: [Hardware Error]: TSC 3b8cb60a66 PPIN 11d98332458e4ea9
  [   81.314713] mce: [Hardware Error]: PROCESSOR 0:606a6 TIME 1759339405 SOCKET 0 APIC 0 microcode ffffffff
  [   81.314715] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
  [   81.314716] mce: [Hardware Error]: Machine check: Processor context corrupt
  [   81.314717] Kernel panic - not syncing: Fatal machine check

To avoid this, configure the sleep state in the hypervisor and invoke
the HVCALL_ENTER_SLEEP_STATE hypercall as the final step in the shutdown
sequence. This ensures a clean and safe shutdown of the root partition.

Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>
Co-developed-by: Anatol Belski <anbelski@linux.microsoft.com>
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Reviewed-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
Acked-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-12-05 23:25:05 +00:00
Praveen K Paladugu
f0be2600ac mshv: Use reboot notifier to configure sleep state
Configure sleep state information from ACPI in MSHV hypervisor using
a reboot notifier. This data allows the hypervisor to correctly power
off the host during shutdown.

Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>
Co-developed-by: Anatol Belski <anbelski@linux.microsoft.com>
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Reviewed-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
Reviewed-by: Stansialv Kinsburskii <skinsburskii@linux.miscrosoft.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-12-05 23:25:01 +00:00
Naman Jain
7bfe3b8ea6 Drivers: hv: Introduce mshv_vtl driver
Provide an interface for Virtual Machine Monitor like OpenVMM and its
use as OpenHCL paravisor to control VTL0 (Virtual trust Level).
Expose devices and support IOCTLs for features like VTL creation,
VTL0 memory management, context switch, making hypercalls,
mapping VTL0 address space to VTL2 userspace, getting new VMBus
messages and channel events in VTL2 etc.

Co-developed-by: Roman Kisel <romank@linux.microsoft.com>
Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
Co-developed-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-12-05 23:16:26 +00:00
Josh Poimboeuf
ed3bf863dc objtool: Remove newlines and tabs from annotation macros
Remove newlines and tabs from the annotation macros so the invoking code
can insert them as needed to match the style of the surrounding code.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://patch.msgid.link/66305834c2eb78f082217611b756231ae9c0b555.1764694625.git.jpoimboe@kernel.org
2025-12-03 19:42:37 +01:00
Josh Poimboeuf
f387d0e102 x86/asm: Remove ANNOTATE_DATA_SPECIAL usage
Instead of manually annotating each __ex_table entry, just make the
section mergeable and store the entry size in the ELF section header.

Either way works for objtool create_fake_symbols(), this way produces
cleaner code generation.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://patch.msgid.link/b858cb7891c1ba0080e22a9c32595e6c302435e2.1764694625.git.jpoimboe@kernel.org
2025-12-03 16:53:19 +01:00
Josh Poimboeuf
a818f28f01 x86/alternative: Remove ANNOTATE_DATA_SPECIAL usage
Instead of manually annotating each .altinstructions entry, just make
the section mergeable and store the entry size in the ELF section
header.

Either way works for objtool create_fake_symbols(), this way produces
cleaner code generation.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://patch.msgid.link/5ac04e6db5be6453dce8003a771ebb0c47b4cd7a.1764694625.git.jpoimboe@kernel.org
2025-12-03 16:53:19 +01:00
Linus Torvalds
d61f1cc5db Merge tag 'x86_cpu_for_6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 CPU feature updates from Dave Hansen:
 "The biggest thing of note here is Linear Address Space Separation
  (LASS). It represents the first time I can think of that the
  upper=>kernel/lower=>user address space convention is actually
  recognized by the hardware on x86. It ensures that userspace can not
  even get the hardware to _start_ page walks for the kernel address
  space. This, of course, is a really nice generic side channel defense.

  This is really only a down payment on LASS support. There are still
  some details to work out in its interaction with EFI calls and
  vsyscall emulation. For now, LASS is disabled if either of those
  features is compiled in (which is almost always the case).

  There's also one straggler commit in here which converts an
  under-utilized AMD CPU feature leaf into a generic Linux-defined leaf
  so more feature can be packed in there.

  Summary:

   - Enable Linear Address Space Separation (LASS)

   - Change X86_FEATURE leaf 17 from an AMD leaf to Linux-defined"

* tag 'x86_cpu_for_6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpu: Enable LASS during CPU initialization
  selftests/x86: Update the negative vsyscall tests to expect a #GP
  x86/traps: Communicate a LASS violation in #GP message
  x86/kexec: Disable LASS during relocate kernel
  x86/alternatives: Disable LASS when patching kernel code
  x86/asm: Introduce inline memcpy and memset
  x86/cpu: Add an LASS dependency on SMAP
  x86/cpufeatures: Enumerate the LASS feature bits
  x86/cpufeatures: Make X86_FEATURE leaf 17 Linux-specific
2025-12-02 14:48:08 -08:00
Linus Torvalds
a7610b8465 Merge tag 'x86_entry_for_6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 entry update from Dave Hansen:
 "This one is pretty trivial: fix a badly-named FRED data structure
  member"

* tag 'x86_entry_for_6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/fred: Fix 64bit identifier in fred_ss
2025-12-02 14:24:21 -08:00
Linus Torvalds
e2aa39b368 Merge tag 'x86_misc_for_6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 updates from Dave Hansen:
 "The most significant are some changes to ensure that symbols exported
  for KVM are used only by KVM modules themselves, along with some
  related cleanups.

  In true x86/misc fashion, the other patch is completely unrelated and
  just enhances an existing pr_warn() to make it clear to users how they
  have tainted their kernel when something is mucking with MSRs.

  Summary:

   - Make MSR-induced taint easier for users to track down

   - Restrict KVM-specific exports to KVM itself"

* tag 'x86_misc_for_6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Restrict KVM-induced symbol exports to KVM modules where obvious/possible
  x86/mm: Drop unnecessary export of "ptdump_walk_pgd_level_debugfs"
  x86/mtrr: Drop unnecessary export of "mtrr_state"
  x86/bugs: Drop unnecessary export of "x86_spec_ctrl_base"
  x86/msr: Add CPU_OUT_OF_SPEC taint name to "unrecognized" pr_warn(msg)
2025-12-02 14:16:42 -08:00
Linus Torvalds
54de197c9a Merge tag 'x86_sgx_for_6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SGX updates from Dave HansenL
 "The main content here is adding support for the new EUPDATESVN SGX
  ISA. Before this, folks who updated microcode had to reboot before
  enclaves could attest to the new microcode. The new functionality lets
  them do this without a reboot.

  The rest are some nice, but relatively mundane comment and kernel-doc
  fixups.

  Summary:

   - Allow security version (SVN) updates so enclaves can attest to new
     microcode

   - Fix kernel docs typos"

* tag 'x86_sgx_for_6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sgx: Fix a typo in the kernel-doc comment for enum sgx_attribute
  x86/sgx: Remove superfluous asterisk from copyright comment in asm/sgx.h
  x86/sgx: Document structs and enums with '@', not '%'
  x86/sgx: Add kernel-doc descriptions for params passed to vDSO user handler
  x86/sgx: Add a missing colon in kernel-doc markup for "struct sgx_enclave_run"
  x86/sgx: Enable automatic SVN updates for SGX enclaves
  x86/sgx: Implement ENCLS[EUPDATESVN]
  x86/sgx: Define error codes for use by ENCLS[EUPDATESVN]
  x86/cpufeatures: Add X86_FEATURE_SGX_EUPDATESVN feature flag
  x86/sgx: Introduce functions to count the sgx_(vepc_)open()
2025-12-02 14:03:05 -08:00
Linus Torvalds
c76431e3b5 Merge tag 'x86_mm_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Borislav Petkov:

 - Use the proper accessors when reading CR3 as part of the page level
   transitions (5-level to 4-level, the use case being kexec) so that
   only the physical address in CR3 is picked up and not flags which are
   above the physical mask shift

 - Clean up and unify __phys_addr_symbol() definitions

* tag 'x86_mm_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  efi/libstub: Fix page table access in 5-level to 4-level paging transition
  x86/boot: Fix page table access in 5-level to 4-level paging transition
  x86/mm: Unify __phys_addr_symbol()
2025-12-02 13:32:52 -08:00
Linus Torvalds
cb502f0e5e Merge tag 'x86_sev_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SEV updates from Borislav Petkov:

 - Largely cleanups along with a change to save XSS to the GHCB
   (Guest-Host Communication Block) in SEV-ES guests so that the
   hypervisor can determine the guest's XSAVES buffer size properly
   and thus support shadow stacks in AMD confidential guests

* tag 'x86_sev_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cc: Fix enum spelling to fix kernel-doc warnings
  x86/boot: Drop unused sev_enable() fallback
  x86/coco/sev: Convert has_cpuflag() to use cpu_feature_enabled()
  x86/sev: Include XSS value in GHCB CPUID request
  x86/boot: Move boot_*msr helpers to asm/shared/msr.h
2025-12-02 13:07:53 -08:00
Linus Torvalds
d748981834 Merge tag 'x86_cleanups_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Borislav Petkov:

 - The mandatory pile of cleanups the cat drags in every merge window

* tag 'x86_cleanups_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot: Clean up whitespace in a20.c
  x86/mm: Delete disabled debug code
  x86/{boot,mtrr}: Remove unused function declarations
  x86/percpu: Use BIT_WORD() and BIT_MASK() macros
  x86/cpufeatures: Correct LKGS feature flag description
  x86/idtentry: Add missing '*' to kernel-doc lines
2025-12-02 12:17:47 -08:00
Linus Torvalds
2ae20d6510 Merge tag 'x86_cache_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 resource control updates from Borislav Petkov:

 - Add support for AMD's Smart Data Cache Injection feature which allows
   for direct insertion of data from I/O devices into the L3 cache, thus
   bypassing DRAM and saving its bandwidth; the resctrl side of the
   feature allows the size of the L3 used for data injection to be
   controlled

 - Add Intel Clearwater Forest to the list of CPUs which support
   Sub-NUMA clustering

 - Other fixes and cleanups

* tag 'x86_cache_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  fs/resctrl: Update bit_usage to reflect io_alloc
  fs/resctrl: Introduce interface to modify io_alloc capacity bitmasks
  fs/resctrl: Modify struct rdt_parse_data to pass mode and CLOSID
  fs/resctrl: Introduce interface to display io_alloc CBMs
  fs/resctrl: Add user interface to enable/disable io_alloc feature
  fs/resctrl: Introduce interface to display "io_alloc" support
  x86,fs/resctrl: Implement "io_alloc" enable/disable handlers
  x86,fs/resctrl: Detect io_alloc feature
  x86/resctrl: Add SDCIAE feature in the command line options
  x86/cpufeatures: Add support for L3 Smart Data Cache Injection Allocation Enforcement
  fs/resctrl: Consider sparse masks when initializing new group's allocation
  x86/resctrl: Support Sub-NUMA Cluster (SNC) mode on Clearwater Forest
2025-12-02 11:55:58 -08:00
Linus Torvalds
2a47c26e55 Merge tag 'x86_microcode_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 microcode loading updates from Borislav Petkov:

 - Add microcode staging support on Intel: it moves the sole microcode
   blobs loading to a non-critical path so that microcode loading
   latencies are kept at minimum. The actual "directing" the hardware to
   load microcode is the only step which is done on the critical path.

   This scheme is also opportunistic as in: on a failure, the machinery
   falls back to normal loading

 - Add the capability to the AMD side of the loader to select one of two
   per-family/model/stepping patches: one is pre-Entrysign and the other
   is post-Entrysign; with the goal to take care of machines which
   haven't updated their BIOS yet - something they should absolutely do
   as this is the only proper Entrysign fix

 - Other small cleanups and fixlets

* tag 'x86_microcode_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/microcode: Mark early_parse_cmdline() as __init
  x86/microcode/AMD: Select which microcode patch to load
  x86/microcode/intel: Enable staging when available
  x86/microcode/intel: Support mailbox transfer
  x86/microcode/intel: Implement staging handler
  x86/microcode/intel: Define staging state struct
  x86/microcode/intel: Establish staging control logic
  x86/microcode: Introduce staging step to reduce late-loading time
  x86/cpu/topology: Make primary thread mask available with SMP=n
2025-12-02 11:35:49 -08:00
Linus Torvalds
a61288200e Merge tag 'ras_core_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RAS updates from Borislav Petkov:

 - The second part of the AMD MCA interrupts rework after the
   last-minute show-stopper from the last merge window was sorted out.
   After this, the AMD MCA deferred errors, thresholding and corrected
   errors interrupt handlers use common MCA code and are tightly
   integrated into the core MCA code, thereby getting rid of
   considerable duplication. All culminating into allowing CMCI error
   thresholding storms to be detected at AMD too, using the common
   infrastructure

 - Add support for two new MCA bank bits on AMD Zen6 which denote
   whether the error address logged is a system physical address, which
   obviates the need for it to be translated before further error
   recovery can be done

* tag 'ras_core_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mce: Handle AMD threshold interrupt storms
  x86/mce: Do not clear bank's poll bit in mce_poll_banks on AMD SMCA systems
  x86/mce: Add support for physical address valid bit
  x86/mce: Save and use APEI corrected threshold limit
  x86/mce/amd: Define threshold restart function for banks
  x86/mce/amd: Remove redundant reset_block()
  x86/mce/amd: Support SMCA Corrected Error Interrupt
  x86/mce/amd: Enable interrupt vectors once per-CPU on SMCA systems
  x86/mce: Unify AMD DFR handler with MCA Polling
  x86/mce: Unify AMD THR handler with MCA Polling
2025-12-02 11:04:37 -08:00
Linus Torvalds
2b09f480f0 Merge tag 'core-rseq-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull rseq updates from Thomas Gleixner:
 "A large overhaul of the restartable sequences and CID management:

  The recent enablement of RSEQ in glibc resulted in regressions which
  are caused by the related overhead. It turned out that the decision to
  invoke the exit to user work was not really a decision. More or less
  each context switch caused that. There is a long list of small issues
  which sums up nicely and results in a 3-4% regression in I/O
  benchmarks.

  The other detail which caused issues due to extra work in context
  switch and task migration is the CID (memory context ID) management.
  It also requires to use a task work to consolidate the CID space,
  which is executed in the context of an arbitrary task and results in
  sporadic uncontrolled exit latencies.

  The rewrite addresses this by:

   - Removing deprecated and long unsupported functionality

   - Moving the related data into dedicated data structures which are
     optimized for fast path processing.

   - Caching values so actual decisions can be made

   - Replacing the current implementation with a optimized inlined
     variant.

   - Separating fast and slow path for architectures which use the
     generic entry code, so that only fault and error handling goes into
     the TIF_NOTIFY_RESUME handler.

   - Rewriting the CID management so that it becomes mostly invisible in
     the context switch path. That moves the work of switching modes
     into the fork/exit path, which is a reasonable tradeoff. That work
     is only required when a process creates more threads than the
     cpuset it is allowed to run on or when enough threads exit after
     that. An artificial thread pool benchmarks which triggers this did
     not degrade, it actually improved significantly.

     The main effect in migration heavy scenarios is that runqueue lock
     held time and therefore contention goes down significantly"

* tag 'core-rseq-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
  sched/mmcid: Switch over to the new mechanism
  sched/mmcid: Implement deferred mode change
  irqwork: Move data struct to a types header
  sched/mmcid: Provide CID ownership mode fixup functions
  sched/mmcid: Provide new scheduler CID mechanism
  sched/mmcid: Introduce per task/CPU ownership infrastructure
  sched/mmcid: Serialize sched_mm_cid_fork()/exit() with a mutex
  sched/mmcid: Provide precomputed maximal value
  sched/mmcid: Move initialization out of line
  signal: Move MMCID exit out of sighand lock
  sched/mmcid: Convert mm CID mask to a bitmap
  cpumask: Cache num_possible_cpus()
  sched/mmcid: Use cpumask_weighted_or()
  cpumask: Introduce cpumask_weighted_or()
  sched/mmcid: Prevent pointless work in mm_update_cpus_allowed()
  sched/mmcid: Move scheduler code out of global header
  sched: Fixup whitespace damage
  sched/mmcid: Cacheline align MM CID storage
  sched/mmcid: Use proper data structures
  sched/mmcid: Revert the complex CID management
  ...
2025-12-02 08:48:53 -08:00
Linus Torvalds
1dce50698a Merge tag 'core-uaccess-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scoped user access updates from Thomas Gleixner:
 "Scoped user mode access and related changes:

   - Implement the missing u64 user access function on ARM when
     CONFIG_CPU_SPECTRE=n.

     This makes it possible to access a 64bit value in generic code with
     [unsafe_]get_user(). All other architectures and ARM variants
     provide the relevant accessors already.

   - Ensure that ASM GOTO jump label usage in the user mode access
     helpers always goes through a local C scope label indirection
     inside the helpers.

     This is required because compilers are not supporting that a ASM
     GOTO target leaves a auto cleanup scope. GCC silently fails to emit
     the cleanup invocation and CLANG fails the build.

     [ Editor's note: gcc-16 will have fixed the code generation issue
       in commit f68fe3ddda4 ("eh: Invoke cleanups/destructors in asm
       goto jumps [PR122835]"). But we obviously have to deal with clang
       and older versions of gcc, so.. - Linus ]

     This provides generic wrapper macros and the conversion of affected
     architecture code to use them.

   - Scoped user mode access with auto cleanup

     Access to user mode memory can be required in hot code paths, but
     if it has to be done with user controlled pointers, the access is
     shielded with a speculation barrier, so that the CPU cannot
     speculate around the address range check. Those speculation
     barriers impact performance quite significantly.

     This cost can be avoided by "masking" the provided pointer so it is
     guaranteed to be in the valid user memory access range and
     otherwise to point to a guaranteed unpopulated address space. This
     has to be done without branches so it creates an address dependency
     for the access, which the CPU cannot speculate ahead.

     This results in repeating and error prone programming patterns:

       	    if (can_do_masked_user_access())
                      from = masked_user_read_access_begin((from));
              else if (!user_read_access_begin(from, sizeof(*from)))
                      return -EFAULT;
              unsafe_get_user(val, from, Efault);
              user_read_access_end();
              return 0;
        Efault:
              user_read_access_end();
              return -EFAULT;

      which can be replaced with scopes and automatic cleanup:

              scoped_user_read_access(from, Efault)
                      unsafe_get_user(val, from, Efault);
              return 0;
         Efault:
              return -EFAULT;

   - Convert code which implements the above pattern over to
     scope_user.*.access(). This also corrects a couple of imbalanced
     masked_*_begin() instances which are harmless on most
     architectures, but prevent PowerPC from implementing the masking
     optimization.

   - Add a missing speculation barrier in copy_from_user_iter()"

* tag 'core-uaccess-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  lib/strn*,uaccess: Use masked_user_{read/write}_access_begin when required
  scm: Convert put_cmsg() to scoped user access
  iov_iter: Add missing speculation barrier to copy_from_user_iter()
  iov_iter: Convert copy_from_user_iter() to masked user access
  select: Convert to scoped user access
  x86/futex: Convert to scoped user access
  futex: Convert to get/put_user_inline()
  uaccess: Provide put/get_user_inline()
  uaccess: Provide scoped user access regions
  arm64: uaccess: Use unsafe wrappers for ASM GOTO
  s390/uaccess: Use unsafe wrappers for ASM GOTO
  riscv/uaccess: Use unsafe wrappers for ASM GOTO
  powerpc/uaccess: Use unsafe wrappers for ASM GOTO
  x86/uaccess: Use unsafe wrappers for ASM GOTO
  uaccess: Provide ASM GOTO safe wrappers for unsafe_*_user()
  ARM: uaccess: Implement missing __get_user_asm_dword()
2025-12-02 08:01:39 -08:00
Linus Torvalds
4a26e7032d Merge tag 'core-bugs-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull bug handling infrastructure updates from Ingo Molnar:
 "Core updates:

   - Improve WARN(), which has vararg printf like arguments, to work
     with the x86 #UD based WARN-optimizing infrastructure by hiding the
     format in the bug_table and replacing this first argument with the
     address of the bug-table entry, while making the actual function
     that's called a UD1 instruction (Peter Zijlstra)

   - Introduce the CONFIG_DEBUG_BUGVERBOSE_DETAILED Kconfig switch (Ingo
     Molnar, s390 support by Heiko Carstens)

  Fixes and cleanups:

   - bugs/s390: Remove private WARN_ON() implementation (Heiko Carstens)

   - <asm/bugs.h>: Make i386 use GENERIC_BUG_RELATIVE_POINTERS (Peter
     Zijlstra)"

* tag 'core-bugs-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
  x86/bugs: Make i386 use GENERIC_BUG_RELATIVE_POINTERS
  x86/bug: Fix BUG_FORMAT vs KASLR
  x86_64/bug: Inline the UD1
  x86/bug: Implement WARN_ONCE()
  x86_64/bug: Implement __WARN_printf()
  x86/bug: Use BUG_FORMAT for DEBUG_BUGVERBOSE_DETAILED
  x86/bug: Add BUG_FORMAT basics
  bug: Allow architectures to provide __WARN_printf()
  bug: Implement WARN_ON() using __WARN_FLAGS()
  bug: Add report_bug_entry()
  bug: Add BUG_FORMAT_ARGS infrastructure
  bug: Clean up CONFIG_GENERIC_BUG_RELATIVE_POINTERS
  bug: Add BUG_FORMAT infrastructure
  x86: Rework __bug_table helpers
  bugs/s390: Remove private WARN_ON() implementation
  bugs/core: Reorganize fields in the first line of WARNING output, add ->comm[] output
  bugs/sh: Concatenate 'cond_str' with '__FILE__' in __WARN_FLAGS(), to extend WARN_ON/BUG_ON output
  bugs/parisc: Concatenate 'cond_str' with '__FILE__' in __WARN_FLAGS(), to extend WARN_ON/BUG_ON output
  bugs/riscv: Concatenate 'cond_str' with '__FILE__' in __BUG_FLAGS(), to extend WARN_ON/BUG_ON output
  bugs/riscv: Pass in 'cond_str' to __BUG_FLAGS()
  ...
2025-12-01 21:33:01 -08:00
Linus Torvalds
6d2c10e889 Merge tag 'sched-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "Scalability and load-balancing improvements:

   - Enable scheduler feature NEXT_BUDDY (Mel Gorman)

   - Reimplement NEXT_BUDDY to align with EEVDF goals (Mel Gorman)

   - Skip sched_balance_running cmpxchg when balance is not due (Tim
     Chen)

   - Implement generic code for architecture specific sched domain NUMA
     distances (Tim Chen)

   - Optimize the NUMA distances of the sched-domains builds of Intel
     Granite Rapids (GNR) and Clearwater Forest (CWF) platforms (Tim
     Chen)

   - Implement proportional newidle balance: a randomized algorithm that
     runs newidle balancing proportional to its success rate. (Peter
     Zijlstra)

  Scheduler infrastructure changes:

   - Implement the 'sched_change' scoped_guard() pattern for the entire
     scheduler (Peter Zijlstra)

   - More broadly utilize the sched_change guard (Peter Zijlstra)

   - Add support to pick functions to take runqueue-flags (Joel
     Fernandes)

   - Provide and use set_need_resched_current() (Peter Zijlstra)

  Fair scheduling enhancements:

   - Forfeit vruntime on yield (Fernand Sieber)

   - Only update stats for allowed CPUs when looking for dst group (Adam
     Li)

  CPU-core scheduling enhancements:

   - Optimize core cookie matching check (Fernand Sieber)

  Deadline scheduler fixes:

   - Only set free_cpus for online runqueues (Doug Berger)

   - Fix dl_server time accounting (Peter Zijlstra)

   - Fix dl_server stop condition (Peter Zijlstra)

  Proxy scheduling fixes:

   - Yield the donor task (Fernand Sieber)

  Fixes and cleanups:

   - Fix do_set_cpus_allowed() locking (Peter Zijlstra)

   - Fix migrate_disable_switch() locking (Peter Zijlstra)

   - Remove double update_rq_clock() in __set_cpus_allowed_ptr_locked()
     (Hao Jia)

   - Increase sched_tick_remote timeout (Phil Auld)

   - sched/deadline: Use cpumask_weight_and() in dl_bw_cpus() (Shrikanth
     Hegde)

   - sched/deadline: Clean up select_task_rq_dl() (Shrikanth Hegde)"

* tag 'sched-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
  sched: Provide and use set_need_resched_current()
  sched/fair: Proportional newidle balance
  sched/fair: Small cleanup to update_newidle_cost()
  sched/fair: Small cleanup to sched_balance_newidle()
  sched/fair: Revert max_newidle_lb_cost bump
  sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  sched/fair: Enable scheduler feature NEXT_BUDDY
  sched: Increase sched_tick_remote timeout
  sched/fair: Have SD_SERIALIZE affect newidle balancing
  sched/fair: Skip sched_balance_running cmpxchg when balance is not due
  sched/deadline: Minor cleanup in select_task_rq_dl()
  sched/deadline: Use cpumask_weight_and() in dl_bw_cpus
  sched/deadline: Document dl_server
  sched/deadline: Fix dl_server stop condition
  sched/deadline: Fix dl_server time accounting
  sched/core: Remove double update_rq_clock() in __set_cpus_allowed_ptr_locked()
  sched/eevdf: Fix min_vruntime vs avg_vruntime
  sched/core: Add comment explaining force-idle vruntime snapshots
  sched/core: Optimize core cookie matching check
  sched/proxy: Yield the donor task
  ...
2025-12-01 21:04:45 -08:00
Linus Torvalds
6c26fbe8c9 Merge tag 'perf-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull performance events updates from Ingo Molnar:
 "Callchain support:

   - Add support for deferred user-space stack unwinding for perf,
     enabled on x86. (Peter Zijlstra, Steven Rostedt)

   - unwind_user/x86: Enable frame pointer unwinding on x86 (Josh
     Poimboeuf)

  x86 PMU support and infrastructure:

   - x86/insn: Simplify for_each_insn_prefix() (Peter Zijlstra)

   - x86/insn,uprobes,alternative: Unify insn_is_nop() (Peter Zijlstra)

  Intel PMU driver:

   - Large series to prepare for and implement architectural PEBS
     support for Intel platforms such as Clearwater Forest (CWF) and
     Panther Lake (PTL). (Dapeng Mi, Kan Liang)

   - Check dynamic constraints (Kan Liang)

   - Optimize PEBS extended config (Peter Zijlstra)

   - cstates:
      - Remove PC3 support from LunarLake (Zhang Rui)
      - Add Pantherlake support (Zhang Rui)
      - Clearwater Forest support (Zide Chen)

  AMD PMU driver:

   - x86/amd: Check event before enable to avoid GPF (George Kennedy)

  Fixes and cleanups:

   - task_work: Fix NMI race condition (Peter Zijlstra)

   - perf/x86: Fix NULL event access and potential PEBS record loss
     (Dapeng Mi)

   - Misc other fixes and cleanups (Dapeng Mi, Ingo Molnar, Peter
     Zijlstra)"

* tag 'perf-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
  perf/x86/intel: Fix and clean up intel_pmu_drain_arch_pebs() type use
  perf/x86/intel: Optimize PEBS extended config
  perf/x86/intel: Check PEBS dyn_constraints
  perf/x86/intel: Add a check for dynamic constraints
  perf/x86/intel: Add counter group support for arch-PEBS
  perf/x86/intel: Setup PEBS data configuration and enable legacy groups
  perf/x86/intel: Update dyn_constraint base on PEBS event precise level
  perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR
  perf/x86/intel: Process arch-PEBS records or record fragments
  perf/x86/intel/ds: Factor out PEBS group processing code to functions
  perf/x86/intel/ds: Factor out PEBS record processing code to functions
  perf/x86/intel: Initialize architectural PEBS
  perf/x86/intel: Correct large PEBS flag check
  perf/x86/intel: Replace x86_pmu.drain_pebs calling with static call
  perf/x86: Fix NULL event access and potential PEBS record loss
  perf/x86: Remove redundant is_x86_event() prototype
  entry,unwind/deferred: Fix unwind_reset_info() placement
  unwind_user/x86: Fix arch=um build
  perf: Support deferred user unwind
  unwind_user/x86: Teach FP unwind about start of function
  ...
2025-12-01 20:42:01 -08:00
Linus Torvalds
63e6995005 Merge tag 'objtool-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool updates from Ingo Molnar:

 - klp-build livepatch module generation (Josh Poimboeuf)

   Introduce new objtool features and a klp-build script to generate
   livepatch modules using a source .patch as input.

   This builds on concepts from the longstanding out-of-tree kpatch
   project which began in 2012 and has been used for many years to
   generate livepatch modules for production kernels. However, this is a
   complete rewrite which incorporates hard-earned lessons from 12+
   years of maintaining kpatch.

   Key improvements compared to kpatch-build:

    - Integrated with objtool: Leverages objtool's existing control-flow
      graph analysis to help detect changed functions.

    - Works on vmlinux.o: Supports late-linked objects, making it
      compatible with LTO, IBT, and similar.

    - Simplified code base: ~3k fewer lines of code.

    - Upstream: No more out-of-tree #ifdef hacks, far less cruft.

    - Cleaner internals: Vastly simplified logic for
      symbol/section/reloc inclusion and special section extraction.

    - Robust __LINE__ macro handling: Avoids false positive binary diffs
      caused by the __LINE__ macro by introducing a fix-patch-lines
      script which injects #line directives into the source .patch to
      preserve the original line numbers at compile time.

 - Disassemble code with libopcodes instead of running objdump
   (Alexandre Chartre)

 - Disassemble support (-d option to objtool) by Alexandre Chartre,
   which supports the decoding of various Linux kernel code generation
   specials such as alternatives:

      17ef:  sched_balance_find_dst_group+0x62f                 mov    0x34(%r9),%edx
      17f3:  sched_balance_find_dst_group+0x633               | <alternative.17f3>             | X86_FEATURE_POPCNT
      17f3:  sched_balance_find_dst_group+0x633               | call   0x17f8 <__sw_hweight64> | popcnt %rdi,%rax
      17f8:  sched_balance_find_dst_group+0x638                 cmp    %eax,%edx

   ... jump table alternatives:

      1895:  sched_use_asym_prio+0x5                            test   $0x8,%ch
      1898:  sched_use_asym_prio+0x8                            je     0x18a9 <sched_use_asym_prio+0x19>
      189a:  sched_use_asym_prio+0xa                          | <jump_table.189a>                        | JUMP
      189a:  sched_use_asym_prio+0xa                          | jmp    0x18ae <sched_use_asym_prio+0x1e> | nop2
      189c:  sched_use_asym_prio+0xc                            mov    $0x1,%eax
      18a1:  sched_use_asym_prio+0x11                           and    $0x80,%ecx

   ... exception table alternatives:

    native_read_msr:
      5b80:  native_read_msr+0x0                                                     mov    %edi,%ecx
      5b82:  native_read_msr+0x2                                                   | <ex_table.5b82> | EXCEPTION
      5b82:  native_read_msr+0x2                                                   | rdmsr           | resume at 0x5b84 <native_read_msr+0x4>
      5b84:  native_read_msr+0x4                                                     shl    $0x20,%rdx

   .... x86 feature flag decoding (also see the X86_FEATURE_POPCNT
        example in sched_balance_find_dst_group() above):

      2faaf:  start_thread_common.constprop.0+0x1f                                    jne    0x2fba4 <start_thread_common.constprop.0+0x114>
      2fab5:  start_thread_common.constprop.0+0x25                                  | <alternative.2fab5>                  | X86_FEATURE_ALWAYS                                  | X86_BUG_NULL_SEG
      2fab5:  start_thread_common.constprop.0+0x25                                  | jmp    0x2faba <.altinstr_aux+0x2f4> | jmp    0x4b0 <start_thread_common.constprop.0+0x3f> | nop5
      2faba:  start_thread_common.constprop.0+0x2a                                    mov    $0x2b,%eax

   ... NOP sequence shortening:

      1048e2:  snapshot_write_finalize+0xc2                                            je     0x104917 <snapshot_write_finalize+0xf7>
      1048e4:  snapshot_write_finalize+0xc4                                            nop6
      1048ea:  snapshot_write_finalize+0xca                                            nop11
      1048f5:  snapshot_write_finalize+0xd5                                            nop11
      104900:  snapshot_write_finalize+0xe0                                            mov    %rax,%rcx
      104903:  snapshot_write_finalize+0xe3                                            mov    0x10(%rdx),%rax

   ... and much more.

 - Function validation tracing support (Alexandre Chartre)

 - Various -ffunction-sections fixes (Josh Poimboeuf)

 - Clang AutoFDO (Automated Feedback-Directed Optimizations) support
   (Josh Poimboeuf)

 - Misc fixes and cleanups (Borislav Petkov, Chen Ni, Dylan Hatch, Ingo
   Molnar, John Wang, Josh Poimboeuf, Pankaj Raghav, Peter Zijlstra,
   Thorsten Blum)

* tag 'objtool-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits)
  objtool: Fix segfault on unknown alternatives
  objtool: Build with disassembly can fail when including bdf.h
  objtool: Trim trailing NOPs in alternative
  objtool: Add wide output for disassembly
  objtool: Compact output for alternatives with one instruction
  objtool: Improve naming of group alternatives
  objtool: Add Function to get the name of a CPU feature
  objtool: Provide access to feature and flags of group alternatives
  objtool: Fix address references in alternatives
  objtool: Disassemble jump table alternatives
  objtool: Disassemble exception table alternatives
  objtool: Print addresses with alternative instructions
  objtool: Disassemble group alternatives
  objtool: Print headers for alternatives
  objtool: Preserve alternatives order
  objtool: Add the --disas=<function-pattern> action
  objtool: Do not validate IBT for .return_sites and .call_sites
  objtool: Improve tracing of alternative instructions
  objtool: Add functions to better name alternatives
  objtool: Identify the different types of alternatives
  ...
2025-12-01 20:18:59 -08:00
Randy Dunlap
33b4c26d4d x86/platform: Fix and extend kernel-doc comments in <asm/x86_init.h>
Fix most (17) kernel-doc warnings in x86_init.h (except for struct
x86_init_ops). The changes are:

- fix struct member name typos
- add ending ':' to struct member names
- add some missing struct member descriptions

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251129002524.1196500-1-rdunlap@infradead.org
2025-12-01 21:57:16 +01:00
Peter Zijlstra
b0a848f4a4 x86/bugs: Make i386 use GENERIC_BUG_RELATIVE_POINTERS
Linus figured less #ifdef is more better and making x86-32 use
GENERIC_BUG_RELATIVE_POINTERS removes one layer of macro magic from
the bug.h bits.

Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-11-27 09:32:48 +01:00
Peter Zijlstra
d62e4f2b95 x86/bug: Fix BUG_FORMAT vs KASLR
Encoding a relative NULL pointer doesn't work for KASLR, when the
whole kernel image gets shifted, the __bug_table and the target string
get shifted by the same amount and the relative offset is preserved.

However when the target is an absolute 0 value and the __bug_table
gets moved about, the end result in a pointer equivalent to
kaslr_offset(), not NULL.

Notably, this will generate SHN_UNDEF relocations, and Ard would
really like to not have those at all.

Use the empty string to denote no-string.

Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-11-27 09:32:47 +01:00
Paolo Bonzini
679fcce002 Merge tag 'kvm-x86-svm-6.19' of https://github.com/kvm-x86/linux into HEAD
KVM SVM changes for 6.19:

 - Fix a few missing "VMCB dirty" bugs.

 - Fix the worst of KVM's lack of EFER.LMSLE emulation.

 - Add AVIC support for addressing 4k vCPUs in x2AVIC mode.

 - Fix incorrect handling of selective CR0 writes when checking intercepts
   during emulation of L2 instructions.

 - Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32] on
   VMRUN and #VMEXIT.

 - Fix a bug where KVM corrupt the guest code stream when re-injecting a soft
   interrupt if the guest patched the underlying code after the VM-Exit, e.g.
   when Linux patches code with a temporary INT3.

 - Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits to
   userspace, and extend KVM "support" to all policy bits that don't require
   any actual support from KVM.
2025-11-26 09:48:39 +01:00
Paolo Bonzini
de8e8ebb1a Merge tag 'kvm-x86-tdx-6.19' of https://github.com/kvm-x86/linux into HEAD
KVM TDX changes for 6.19:

 - Overhaul the TDX code to address systemic races where KVM (acting on behalf
   of userspace) could inadvertantly trigger lock contention in the TDX-Module,
   which KVM was either working around in weird, ugly ways, or was simply
   oblivious to (as proven by Yan tripping several KVM_BUG_ON()s with clever
   selftests).

 - Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a vCPU if
   creating said vCPU failed partway through.

 - Fix a few sparse warnings (bad annotation, 0 != NULL).

 - Use struct_size() to simplify copying capabilities to userspace.
2025-11-26 09:36:37 +01:00
Paolo Bonzini
e64dcfab57 Merge tag 'kvm-x86-misc-6.19' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.19:

 - Fix an async #PF bug where KVM would clear the completion queue when the
   guest transitioned in and out of paging mode, e.g. when handling an SMI and
   then returning to paged mode via RSM.

 - Fix a bug where TDX would effectively corrupt user-return MSR values if the
   TDX Module rejects VP.ENTER and thus doesn't clobber host MSRs as expected.

 - Leave the user-return notifier used to restore MSRs registered when
   disabling virtualization, and instead pin kvm.ko.  Restoring host MSRs via
   IPI callback is either pointless (clean reboot) or dangerous (forced reboot)
   since KVM has no idea what code it's interrupting.

 - Use the checked version of {get,put}_user(), as Linus wants to kill them
   off, and they're measurably faster on modern CPUs due to the unchecked
   versions containing an LFENCE.

 - Fix a long-lurking bug where KVM's lack of catch-up logic for periodic APIC
   timers can result in a hard lockup in the host.

 - Revert the periodic kvmclock sync logic now that KVM doesn't use a
   clocksource that's subject to NPT corrections.

 - Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the latter
   behind CONFIG_CPU_MITIGATIONS.

 - Context switch XCR0, XSS, and PKRU outside of the entry/exit fastpath as
   the only reason they were handled in the faspath was to paper of a bug in
   the core #MC code that has long since been fixed.

 - Add emulator support for AVX MOV instructions to play nice with emulated
   devices whose PCI BARs guest drivers like to access with large multi-byte
   instructions.
2025-11-26 09:34:21 +01:00
Andy Shevchenko
4bd68e4753 cpumask: Don't use "proxy" headers
Update header inclusions to follow IWYU (Include What You Use)
principle.

Note that kernel.h is discouraged to be included as it's written
at the top of that file.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
2025-11-24 16:08:18 -05:00
Peter Zijlstra
860238af7a x86_64/bug: Inline the UD1
(Ab)use the static_call infrastructure to convert all:

  call __WARN_trap

instances into the desired:

  ud1 (%edx), %rdi

eliminating the CALL/RET, but more importantly, fixing the
fact that all WARNs will have:

  RIP: 0010:__WARN_trap+0

Basically, by making it a static_call trampoline call, objtool will
collect the callsites, and then the inline rewrite will hit the
special case and replace the code with the magic instruction.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251110115758.456717741@infradead.org
2025-11-24 20:23:25 +01:00
Peter Zijlstra
11bb4944f0 x86/bug: Implement WARN_ONCE()
Implement WARN_ONCE like WARN using BUGFLAG_ONCE.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251110115758.339309119@infradead.org
2025-11-24 20:23:25 +01:00
Peter Zijlstra
5b472b6e5b x86_64/bug: Implement __WARN_printf()
The basic idea is to have __WARN_printf() be a vararg function such
that the compiler can do the optimal calling convention for us. This
function body will be a #UD and then set up a va_list in the exception
from pt_regs.

But because the trap will be in a called function, the bug_entry must
be passed in. Have that be the first argument, with the format tucked
away inside the bug_entry.

The comments should clarify the real fun details.

The big downside is that all WARNs will now show:

 RIP: 0010:__WARN_trap:+0

One possible solution is to simply discard the top frame when
unwinding. A follow up patch takes care of this slightly differently
by abusing the x86 static_call implementation.

This changes (with the next patches):

	WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET,
		  "corrupted preempt_count: %s/%d/0x%x\n",

from:

	cmpl    $2, %ecx	#, _7
        jne     .L1472
	...

  .L1472:
        cmpb    $0, __already_done.11(%rip)
        je      .L1513
	...

  .L1513
	movb    $1, __already_done.11(%rip)
	movl    1424(%r14), %edx        # _15->pid, _15->pid
        leaq    1912(%r14), %rsi        #, _17
        movq    $.LC43, %rdi    #,
        call    __warn_printk   #
	ud2
  .pushsection __bug_table,"aw"
        2:
        .long 1b - .    # bug_entry::bug_addr
        .long .LC1 - .  # bug_entry::file
        .word 5093      # bug_entry::line
        .word 2313      # bug_entry::flags
        .org 2b + 12
  .popsection
  .pushsection .discard.annotate_insn,"M", @progbits, 8
        .long 1b - .
        .long 8         # ANNOTYPE_REACHABLE
  .popsection

into:

	cmpl    $2, %ecx        #, _7
        jne     .L1442  #,
	...

  .L1442:
        lea (2f)(%rip), %rdi
  1:
  .pushsection __bug_table,"aw"
        2:
        .long 1b - .    # bug_entry::bug_addr
        .long .LC43 - . # bug_entry::format
        .long .LC1 - .  # bug_entry::file
        .word 5093      # bug_entry::line
        .word 2323      # bug_entry::flags
        .org 2b + 16
  .popsection
        movl    1424(%r14), %edx        # _19->pid, _19->pid
        leaq    1912(%r14), %rsi        #, _13
	ud1 (%edx), %rdi

Notably, by pushing everything into the exception handler it can take
care of the ONCE thing.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251110115758.213813530@infradead.org
2025-11-24 20:23:05 +01:00
Peter Zijlstra
4f1b701f24 x86/bug: Use BUG_FORMAT for DEBUG_BUGVERBOSE_DETAILED
Since we have an explicit format string, use it for the condition string
instead of frobbing it in the file string.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251110115758.097401406@infradead.org
2025-11-24 20:22:21 +01:00
Peter Zijlstra
0a52d339d3 x86/bug: Add BUG_FORMAT basics
Opt-in to BUG_FORMAT for x86_64, adjust the BUGTABLE helper and for
now, just store NULL pointers.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251110115757.980264454@infradead.org
2025-11-24 20:22:11 +01:00
Babu Moger
3767def18f x86/cpufeatures: Add support for L3 Smart Data Cache Injection Allocation Enforcement
Smart Data Cache Injection (SDCI) is a mechanism that enables direct insertion
of data from I/O devices into the L3 cache. By directly caching data from I/O
devices rather than first storing the I/O data in DRAM, SDCI reduces demands on
DRAM bandwidth and reduces latency to the processor consuming the I/O data.

The SDCIAE (SDCI Allocation Enforcement) PQE feature allows system software to
control the portion of the L3 cache used for SDCI.

When enabled, SDCIAE forces all SDCI lines to be placed into the L3 cache
partitions identified by the highest-supported L3_MASK_n register, where n is
the maximum supported CLOSID.

Add CPUID feature bit that can be used to configure SDCIAE.

The SDCIAE feature details are documented in:

  AMD64 Architecture Programmer's Manual Volume 2: System Programming
     Publication # 24593 Revision 3.41 section 19.4.7 L3 Smart Data Cache
    Injection Allocation Enforcement (SDCIAE).

available at https://bugzilla.kernel.org/show_bug.cgi?id=206537

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/83ca10d981c48e86df2c3ad9658bb3ba3544c763.1762995456.git.babu.moger@amd.com
2025-11-21 22:03:07 +01:00
Peter Zijlstra
1be1fac648 x86: Rework __bug_table helpers
Rework the __bug_table helpers such that extension becomes easier.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251110115757.111187573@infradead.org
2025-11-21 11:21:30 +01:00