mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2026-05-17 23:37:10 -04:00
e5fd091663f82ee6c36cffd4bb93fcee9fc644c3
9357 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
40286d6379 |
Merge tag 'pci-v7.1-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci updates from Bjorn Helgaas:
"Enumeration:
- Allow TLP Processing Hints to be enabled for RCiEPs (George Abraham
P)
- Enable AtomicOps only if we know the Root Port supports them (Gerd
Bayer)
- Don't enable AtomicOps for RCiEPs since none of them need Atomic
Ops and we can't tell whether the Root Complex would support them
(Gerd Bayer)
- Leave Precision Time Measurement disabled until a driver enables it
to avoid PCIe errors (Mika Westerberg)
- Make pci_set_vga_state() fail if bridge doesn't support VGA
routing, i.e., PCI_BRIDGE_CTL_VGA is not writable, and return
errors to vga_get() callers including userspace via
/dev/vga_arbiter (Simon Richter)
- Validate max-link-speed from DT in j721e, brcmstb, mediatek-gen3,
rzg3s drivers (where the actual controller constraints are known),
and remove validation from the generic OF DT accessor (Hans Zhang)
- Remove pc110pad driver (no longer useful after 486 CPU support
removed) and no_pci_devices() (pc110pad was the last user) (Dmitry
Torokhov, Heiner Kallweit)
Resource management:
- Prevent assigning space to unimplemented bridge windows; previously
we mistakenly assumed prefetchable window existed and assigned
space and put a BAR there (Ahmed Naseef)
- Avoid shrinking bridge windows to fit in the initial Root Port
window; fixes one problem with devices with large BARs connected
via switches, e.g., Thunderbolt (Ilpo Järvinen)
- Pass full extent of empty space, not just the aligned space, to
resource_alignf callback so free space before the requested
alignment can be used (Ilpo Järvinen)
- Place small resources before larger ones for better utilization of
address space (Ilpo Järvinen)
- Fix alignment calculation for resource size larger than align,
e.g., bridge windows larger than the 1MB required alignment (Ilpo
Järvinen)
Reset:
- Update slot handling so all ARI functions are treated as being in
the same slot. They're all reset by Secondary Bus Reset, but
previously drivers of ARI functions that appeared to be on a
non-zero device weren't notified and fatal hardware errors could
result (Keith Busch)
- Make sysfs reset_subordinate hotplug safe to avoid spurious hotplug
events (Keith Busch)
- Hide Secondary Bus Reset ('bus') from sysfs reset_methods if masked
by CXL because it has no effect (Vidya Sagar)
- Avoid FLR for AMD NPU device, where it causes the device to hang
(Lizhi Hou)
Error handling:
- Clear only error bits in PCIe Device Status to avoid accidentally
clearing Emergency Power Reduction Detected (Shuai Xue)
- Check for AER errors even in devices without drivers (Lukas Wunner)
- Initialize ratelimit info so DPC and EDR paths log AER error
information (Kuppuswamy Sathyanarayanan)
Power control:
- Add UPD720201/UPD720202 USB 3.0 xHCI Host Controller .compatible so
generic pwrctrl driver can control it (Neil Armstrong)
Hotplug:
- Set LED_HW_PLUGGABLE for NPEM hotplug-capable ports so LED core
doesn't complain when setting brightness fails because the endpoint
is gone (Richard Cheng)
Peer-to-peer DMA:
- Allow wildcards in list of host bridges that support peer-to-peer
DMA between hierarchy domains and add all Google SoCs (Jacob
Moroni)
Endpoint framework:
- Advertise dynamic inbound mapping support in pci-epf-test and
update host pci_endpoint_test to skip doorbell testing if not
advertised by endpoint (Koichiro Den)
- Return 0, not remaining timeout, when MHI eDMA ops complete so
mhi_ep_ring_add_element() doesn't interpret non-zero as failure
(Daniel Hodges)
- Remove vntb and ntb duplicate resource teardown that leads to oops
when .allow_link() fails or .drop_link() is called (Koichiro Den)
- Disable vntb delayed work before clearing BAR mappings and
doorbells to avoid oops caused by doing the work after resources
have been torn down (Koichiro Den)
- Add a way to describe reserved subregions within BARs, e.g.,
platform-owned fixed register windows, and use it for the RK3588
BAR4 DMA ctrl window (Koichiro Den)
- Add BAR_DISABLED for BARs that will never be available to an EPF
driver, and change some BAR_RESERVED annotations to BAR_DISABLED
(Niklas Cassel)
- Add NTB .get_dma_dev() callback for cases where DMA API requires a
different device, e.g., vNTB devices (Koichiro Den)
- Add reserved region types for MSI-X Table and PBA so Endpoint
controllers can them as describe hardware-owned regions in a
BAR_RESERVED BAR (Manikanta Maddireddy)
- Make Tegra194/234 BAR0 programmable and remove 1MB size limit
(Manikanta Maddireddy)
- Expose Tegra BAR2 (MSI-X) and BAR4 (DMA) as 64-bit BAR_RESERVED
(Manikanta Maddireddy)
- Add Tegra194 and Tegra234 device table entries to pci_endpoint_test
(Manikanta Maddireddy)
- Skip the BAR subrange selftest if there are not enough inbound
window resources to run the test (Christian Bruel)
New native PCIe controller drivers:
- Add DT binding and driver for Andes QiLai SoC PCIe host controller
(Randolph Lin)
- Add DT binding and driver for ESWIN PCIe Root Complex (Senchuan
Zhang)
Baikal T-1 PCIe controller driver:
- Remove driver since it never quite became usable (Andy Shevchenko)
Cadence PCIe controller driver:
- Implement byte/word config reads with dword (32-bit) reads because
some Cadence controllers don't support sub-dword accesses (Aksh
Garg)
CIX Sky1 PCIe controller driver:
- Add 'power-domains' to DT binding for SCMI power domain (Gary Yang)
Freescale i.MX6 PCIe controller driver:
- Add i.MX94 and i.MX943 to fsl,imx6q-pcie-ep DT binding (Richard
Zhu)
- Delay instead of polling for L2/L3 Ready after PME_Turn_off when
suspending i.MX6SX because LTSSM registers are inaccessible
(Richard Zhu)
- Separate PERST# assertion (for resetting endpoints) from core reset
(for resetting the RC itself) to prepare for new DTs with PERST#
GPIO in per-Root Port nodes (Sherry Sun)
- Retain Root Port MSI capability on i.MX7D, i.MX8MM, and i.MX8MQ so
MSI from downstream devices will work (Richard Zhu)
- Fix i.MX95 reference clock source selection when internal refclk is
used (Franz Schnyder)
Freescale Layerscape PCIe controller driver:
- Allow building as a removable module (Sascha Hauer)
MediaTek PCIe Gen3 controller driver:
- Use dev_err_probe() to simplify error paths and make deferred probe
messages visible in /sys/kernel/debug/devices_deferred (Chen-Yu
Tsai)
- Power off device if setup fails (Chen-Yu Tsai)
- Integrate new pwrctrl API to enable power control for WiFi/BT
adapters on mainboard or in PCIe or M.2 slots (Chen-Yu Tsai)
NVIDIA Tegra194 PCIe controller driver:
- Poll less aggressively and non-atomically for PME_TO_Ack during
transition to L2 (Vidya Sagar)
- Disable LTSSM after transition to Detect on surprise link down to
stop toggling between Polling and Detect (Manikanta Maddireddy)
- Don't force the device into the D0 state before L2 when suspending
or shutting down the controller (Vidya Sagar)
- Disable PERST# IRQ only in Endpoint mode because it's not
registered in Root Port mode (Manikanta Maddireddy)
- Handle 'nvidia,refclk-select' as optional (Vidya Sagar)
- Disable direct speed change in Endpoint mode so link speed change
is controlled by the host (Vidya Sagar)
- Set LTR values before link up to avoid bogus LTR messages with 0
latency (Vidya Sagar)
- Allow system suspend when the Endpoint link is down (Vidya Sagar)
- Use DWC IP core version, not Tegra custom values, to avoid DWC core
version check warnings (Manikanta Maddireddy)
- Apply ECRC workaround to devices based on DesignWare 5.00a as well
as 4.90a (Manikanta Maddireddy)
- Disable PM Substate L1.2 in Endpoint mode to work around Tegra234
erratum (Vidya Sagar)
- Delay post-PERST# cleanup until core is powered on to avoid CBB
timeout (Manikanta Maddireddy)
- Assert CLKREQ# so switches that forward it to their downstream side
can bring up those links successfully (Vidya Sagar)
- Calibrate pipe to UPHY for Endpoint mode to reset stale PLL state
from any previous bad link state (Vidya Sagar)
- Remove IRQF_ONESHOT flag from Endpoint interrupt registration so
DMA driver and Endpoint controller driver can share the interrupt
line (Vidya Sagar)
- Enable DMA interrupt to support DMA in both Root Port and Endpoint
modes (Vidya Sagar)
- Enable hardware link retraining after link goes down in Endpoint
mode (Vidya Sagar)
- Add DT binding and driver support for core clock monitoring (Vidya
Sagar)
Qualcomm PCIe controller driver:
- Advertise 'Hot-Plug Capable' and set 'No Command Completed Support'
since Qcom Root Ports support hotplug events like DL_Up/Down and
can accept writes to Slot Control without delays between writes
(Krishna Chaitanya Chundru)
Renesas R-Car PCIe controller driver:
- Mark Endpoint BAR0 and BAR2 as Resizable (Koichiro Den)
- Reduce EPC BAR alignment requirement to 4K (Koichiro Den)
Renesas RZ/G3S PCIe controller driver:
- Add RZ/G3E to DT binding and to driver (John Madieu)
- Assert (not deassert) resets in probe error path (John Madieu)
- Assert resets in suspend path in reverse order they were deasserted
during probe (John Madieu)
- Rework inbound window algorithm to prevent mapping more than
intended region and enforce alignment on size, to prepare for
RZ/G3E support (John Madieu)
Rockchip DesignWare PCIe controller driver:
- Add tracepoints for PCIe controller LTSSM transitions and link rate
changes (Shawn Lin)
- Trace LTSSM events collected by the dw-rockchip debug FIFO (Shawn
Lin)
SOPHGO PCIe controller driver:
- Disable ASPM L0s and L1 on Sophgo 2042 PCIe Root Ports that
advertise support for them (Yao Zi)
Synopsys DesignWare PCIe controller driver:
- Continue with system suspend even if an Endpoint doesn't respond
with PME_TO_Ack message (Manivannan Sadhasivam)
- Set Endpoint MSI-X Table Size in the correct function of a
multi-function device when configuring MSI-X, not in Function 0
(Aksh Garg)
- Set Max Link Width and Max Link Speed for all functions of a
multi-function device, not just Function 0 (Aksh Garg)
- Expose PCIe event counters in groups 5-7 in debugfs (Hans Zhang)
Miscellaneous:
- Warn only once about invalid ACS kernel parameter format (Richard
Cheng)
- Suppress FW_BUG warning when writing sysfs 'numa_node' with the
current value (Li RongQing)
- Drop redundant 'depends on PCI' from Kconfig (Julian Braha)"
* tag 'pci-v7.1-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (165 commits)
PCI/P2PDMA: Add Google SoCs to the P2P DMA host bridge list
PCI/P2PDMA: Allow wildcard Device IDs in host bridge list
PCI: sg2042: Avoid L0s and L1 on Sophgo 2042 PCIe Root Ports
PCI: cadence: Add flags for disabling ASPM capability for broken Root Ports
PCI: tegra194: Add core monitor clock support
dt-bindings: PCI: tegra194: Add monitor clock support
PCI: tegra194: Enable hardware hot reset mode in Endpoint mode
PCI: tegra194: Enable DMA interrupt
PCI: tegra194: Remove IRQF_ONESHOT flag during Endpoint interrupt registration
PCI: tegra194: Calibrate pipe to UPHY for Endpoint mode
PCI: tegra194: Assert CLKREQ# explicitly by default
PCI: tegra194: Fix CBB timeout caused by DBI access before core power-on
PCI: tegra194: Disable L1.2 capability of Tegra234 EP
PCI: dwc: Apply ECRC workaround to DesignWare 5.00a as well
PCI: tegra194: Use DWC IP core version
PCI: tegra194: Free up Endpoint resources during remove()
PCI: tegra194: Allow system suspend when the Endpoint link is not up
PCI: tegra194: Set LTR message request before PCIe link up in Endpoint mode
PCI: tegra194: Disable direct speed change for Endpoint mode
PCI: tegra194: Use devm_gpiod_get_optional() to parse "nvidia,refclk-select"
...
|
||
|
|
3203a08c12 |
Merge tag 'powerpc-7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Madhavan Srinivasan: - powerpc support for huge pfnmaps - Cleanups to use masked user access - Rework pnv_ioda_pick_m64_pe() to use better bitmap API - Convert powerpc to AUDIT_ARCH_COMPAT_GENERIC - Backup region offset update to eflcorehdr - Fixes for wii/ps3 platform - Implement JIT support for private stack in powerpc - Implement JIT support for fsession in powerpc64 trampoline - Add support for instruction array and indirect jump in powerpc - Misc selftest fixes and cleanups Thanks to Abhishek Dubey, Aditya Gupta, Alex Williamson, Amit Machhiwal, Andrew Donnellan, Bartosz Golaszewski, Cédric Le Goater, Chen Ni, Christophe Leroy (CS GROUP), Hari Bathini, J. Neuschäfer, Mukesh Kumar Chaurasiya (IBM), Nam Cao, Nilay Shroff, Pavithra Prakash, Randy Dunlap, Ritesh Harjani (IBM), Shrikanth Hegde, Sourabh Jain, Vaibhav Jain, Venkat Rao Bagalkote, and Yury Norov (NVIDIA) * tag 'powerpc-7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (47 commits) mailmap: Add entry for Andrew Donnellan powerpc32/bpf: fix loading fsession func metadata using PPC_LI32 selftest/bpf: Enable gotox tests for powerpc64 powerpc64/bpf: Add support for indirect jump selftest/bpf: Enable instruction array test for powerpc powerpc/bpf: Add support for instruction array powerpc32/bpf: Add fsession support powerpc64/bpf: Implement fsession support selftests/bpf: Enable private stack tests for powerpc64 powerpc64/bpf: Implement JIT support for private stack powerpc: pci-ioda: Optimize pnv_ioda_pick_m64_pe() powerpc: pci-ioda: use bitmap_alloc() in pnv_ioda_pick_m64_pe() powerpc/net: Inline checksum wrappers and convert to scoped user access powerpc/sstep: Convert to scoped user access powerpc/align: Convert emulate_spe() to scoped user access powerpc/ptrace: Convert gpr32_set_common_user() to scoped user access powerpc/futex: Use masked user access powerpc/audit: Convert powerpc to AUDIT_ARCH_COMPAT_GENERIC cpuidle: powerpc: avoid double clear when breaking snooze powerpc/ps3: spu.c: fix enum and Return kernel-doc warnings ... |
||
|
|
f21f7b5162 |
Merge tag 'timers-vdso-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull vdso updates from Thomas Gleixner: - Make the handling of compat functions consistent and more robust - Rework the underlying data store so that it is dynamically allocated, which allows the conversion of the last holdout SPARC64 to the generic VDSO implementation - Rework the SPARC64 VDSO to utilize the generic implementation - Mop up the left overs of the non-generic VDSO support in the core code - Expand the VDSO selftest and make them more robust - Allow time namespaces to be enabled independently of the generic VDSO support, which was not possible before due to SPARC64 not using it - Various cleanups and improvements in the related code * tag 'timers-vdso-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits) timens: Use task_lock guard in timens_get*() timens: Use mutex guard in proc_timens_set_offset() timens: Simplify some calls to put_time_ns() timens: Add a __free() wrapper for put_time_ns() timens: Remove dependency on the vDSO vdso/timens: Move functions to new file selftests: vDSO: vdso_test_correctness: Add a test for time() selftests: vDSO: vdso_test_correctness: Use facilities from parse_vdso.c selftests: vDSO: vdso_test_correctness: Handle different tv_usec types selftests: vDSO: vdso_test_correctness: Drop SYS_getcpu fallbacks selftests: vDSO: vdso_test_gettimeofday: Remove nolibc checks Revert "selftests: vDSO: parse_vdso: Use UAPI headers instead of libc headers" random: vDSO: Remove ifdeffery random: vDSO: Trim vDSO includes vdso/datapage: Trim down unnecessary includes vdso/datapage: Remove inclusion of gettimeofday.h vdso/helpers: Explicitly include vdso/processor.h vdso/gettimeofday: Add explicit includes random: vDSO: Add explicit includes MIPS: vdso: Explicitly include asm/vdso/vdso.h ... |
||
|
|
d568788baa |
Merge tag 'hardening-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook: - randomize_kstack: Improve implementation across arches (Ryan Roberts) - lkdtm/fortify: Drop unneeded FORTIFY_STR_OBJECT test - refcount: Remove unused __signed_wrap function annotations * tag 'hardening-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: lkdtm/fortify: Drop unneeded FORTIFY_STR_OBJECT test refcount: Remove unused __signed_wrap function annotations randomize_kstack: Unify random source across arches randomize_kstack: Maintain kstack_offset per task |
||
|
|
ef3da345cc |
Merge tag 'vfs-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Features:
- coredump: add tracepoint for coredump events
- fs: hide file and bfile caches behind runtime const machinery
Fixes:
- fix architecture-specific compat_ftruncate64 implementations
- dcache: Limit the minimal number of bucket to two
- fs/omfs: reject s_sys_blocksize smaller than OMFS_DIR_START
- fs/mbcache: cancel shrink work before destroying the cache
- dcache: permit dynamic_dname()s up to NAME_MAX
Cleanups:
- remove or unexport unused fs_context infrastructure
- trivial ->setattr cleanups
- selftests/filesystems: Assume that TIOCGPTPEER is defined
- writeback: fix kernel-doc function name mismatch for wb_put_many()
- autofs: replace manual symlink buffer allocation in autofs_dir_symlink
- init/initramfs.c: trivial fix: FSM -> Finite-state machine
- fs: remove stale and duplicate forward declarations
- readdir: Introduce dirent_size()
- fs: Replace user_access_{begin/end} by scoped user access
- kernel: acct: fix duplicate word in comment
- fs: write a better comment in step_into() concerning .mnt assignment
- fs: attr: fix comment formatting and spelling issues"
* tag 'vfs-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits)
dcache: permit dynamic_dname()s up to NAME_MAX
fs: attr: fix comment formatting and spelling issues
fs: hide file and bfile caches behind runtime const machinery
fs: write a better comment in step_into() concerning .mnt assignment
proc: rename proc_notify_change to proc_setattr
proc: rename proc_setattr to proc_nochmod_setattr
affs: rename affs_notify_change to affs_setattr
adfs: rename adfs_notify_change to adfs_setattr
hfs: update comments on hfs_inode_setattr
kernel: acct: fix duplicate word in comment
fs: Replace user_access_{begin/end} by scoped user access
readdir: Introduce dirent_size()
coredump: add tracepoint for coredump events
fs: remove do_sys_truncate
fs: pass on FTRUNCATE_* flags to do_truncate
fs: fix archiecture-specific compat_ftruncate64
fs: remove stale and duplicate forward declarations
init/initramfs.c: trivial fix: FSM -> Finite-state machine
autofs: replace manual symlink buffer allocation in autofs_dir_symlink
fs/mbcache: cancel shrink work before destroying the cache
...
|
||
|
|
328335a794 |
powerpc/powernv/iommu: iommu incorrectly bypass DMA APIs
In a PowerNV environment, for devices that supports DMA mask less than
64 bit but larger than 32 bits, iommu is incorrectly bypassing DMA
APIs while allocating and mapping buffers for DMA operations.
Devices are failing with ENOMEN during probe with the following messages
amdgpu 0000:01:00.0: [drm] Detected VRAM RAM=4096M, BAR=4096M
amdgpu 0000:01:00.0: [drm] RAM width 128bits GDDR5
amdgpu 0000:01:00.0: iommu: 64-bit OK but direct DMA is limited by 0
amdgpu 0000:01:00.0: dma_iommu_get_required_mask: returning bypass mask 0xfffffffffffffff
amdgpu 0000:01:00.0: 4096M of VRAM memory ready
amdgpu 0000:01:00.0: 32570M of GTT memory ready.
amdgpu 0000:01:00.0: (-12) failed to allocate kernel bo
amdgpu 0000:01:00.0: [drm] Debug VRAM access will use slowpath MM access
amdgpu 0000:01:00.0: [drm] GART: num cpu pages 4096, num gpu pages 65536
amdgpu 0000:01:00.0: [drm] PCIE GART of 256M enabled (table at 0x000000F4FFF80000).
amdgpu 0000:01:00.0: (-12) failed to allocate kernel bo
amdgpu 0000:01:00.0: (-12) create WB bo failed
amdgpu 0000:01:00.0: amdgpu_device_wb_init failed -12
amdgpu 0000:01:00.0: amdgpu_device_ip_init failed
amdgpu 0000:01:00.0: Fatal error during GPU init
amdgpu 0000:01:00.0: finishing device.
amdgpu 0000:01:00.0: probe with driver amdgpu failed with error -12
amdgpu 0000:01:00.0: ttm finalized
Fixes:
|
||
|
|
bf53ede003 |
powerpc/align: Convert emulate_spe() to scoped user access
Commit
|
||
|
|
679fa9c756 |
powerpc/ptrace: Convert gpr32_set_common_user() to scoped user access
Commit
|
||
|
|
f26ad12356 |
powerpc/audit: Convert powerpc to AUDIT_ARCH_COMPAT_GENERIC
Commit |
||
|
|
7695a4e12e |
powerpc: kgdb: fix kernel-doc warnings
Remove empty comment line at the beginning of a kernel-doc function
block. Add a "Return:" section for this function.
These changes prevent 2 kernel-doc warnings:
Warning: ../arch/powerpc/kernel/kgdb.c:103 Cannot find identifier on line:
*
Warning: kgdb.c:113 No description found for return value of 'kgdb_skipexception'
Fixes:
|
||
|
|
9036bd0efc |
PCI: Align head space better
When a bridge window contains big and small resource(s), the small resource(s) may not amount to the half of the size of the big resource which would allow calculate_head_align() to shrink the head alignment. This results in always placing the small resource(s) after the big resource. In general, it would be good to be able to place the small resource(s) before the big resource to achieve better utilization of the address space. In the cases where the large resource can only fit at the end of the window, it is even required. However, carrying the information over from pbus_size_mem() and calculate_head_align() to __pci_assign_resource() and pcibios_align_resource() is not easy with the current data structures. A somewhat hacky way to move the non-aligning tail part to the head is possible within pcibios_align_resource(). The free space between the start of the free space span and the aligned start address can be compared with the non-aligning remainder of the size. If the free space is larger than the remainder, placing the remainder before the start address is possible. This relocation should generally work, because PCI resources consist only power-of-2 atoms. Various arch requirements may still need to override the relocation, so the relocation is only applied selectively in such cases. Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221205 Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Xifer <xiferdev@gmail.com> Link: https://patch.msgid.link/20260324165633.4583-10-ilpo.jarvinen@linux.intel.com |
||
|
|
f699bcc8bc |
resource: Pass full extent of empty space to resource_alignf callback
__find_resource_space() calculates the full extent of empty space but only passes the aligned space to resource_alignf callback. In some situations, the callback may choose take advantage of the free space before the requested alignment. Pass the full extent of the calculated empty space to resource_alignf callback as an additional parameter. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Xifer <xiferdev@gmail.com> Link: https://patch.msgid.link/20260324165633.4583-3-ilpo.jarvinen@linux.intel.com |
||
|
|
a96ef5848c |
randomize_kstack: Unify random source across arches
Previously different architectures were using random sources of
differing strength and cost to decide the random kstack offset. A number
of architectures (loongarch, powerpc, s390, x86) were using their
timestamp counter, at whatever the frequency happened to be. Other
arches (arm64, riscv) were using entropy from the crng via
get_random_u16().
There have been concerns that in some cases the timestamp counters may
be too weak, because they can be easily guessed or influenced by user
space. And get_random_u16() has been shown to be too costly for the
level of protection kstack offset randomization provides.
So let's use a common, architecture-agnostic source of entropy; a
per-cpu prng, seeded at boot-time from the crng. This has a few
benefits:
- We can remove choose_random_kstack_offset(); That was only there to
try to make the timestamp counter value a bit harder to influence
from user space [*].
- The architecture code is simplified. All it has to do now is call
add_random_kstack_offset() in the syscall path.
- The strength of the randomness can be reasoned about independently
of the architecture.
- Arches previously using get_random_u16() now have much faster
syscall paths, see below results.
[*] Additionally, this gets rid of some redundant work on s390 and x86.
Before this patch, those architectures called
choose_random_kstack_offset() under arch_exit_to_user_mode_prepare(),
which is also called for exception returns to userspace which were *not*
syscalls (e.g. regular interrupts). Getting rid of
choose_random_kstack_offset() avoids a small amount of redundant work
for the non-syscall cases.
In some configurations, add_random_kstack_offset() will now call
instrumentable code, so for a couple of arches, I have moved the call a
bit later to the first point where instrumentation is allowed. This
doesn't impact the efficacy of the mechanism.
There have been some claims that a prng may be less strong than the
timestamp counter if not regularly reseeded. But the prng has a period
of about 2^113. So as long as the prng state remains secret, it should
not be possible to guess. If the prng state can be accessed, we have
bigger problems.
Additionally, we are only consuming 6 bits to randomize the stack, so
there are only 64 possible random offsets. I assert that it would be
trivial for an attacker to brute force by repeating their attack and
waiting for the random stack offset to be the desired one. The prng
approach seems entirely proportional to this level of protection.
Performance data are provided below. The baseline is v6.18 with rndstack
on for each respective arch. (I)/(R) indicate statistically significant
improvement/regression. arm64 platform is AWS Graviton3 (m7g.metal).
x86_64 platform is AWS Sapphire Rapids (m7i.24xlarge):
+-----------------+--------------+---------------+---------------+
| Benchmark | Result Class | per-cpu-prng | per-cpu-prng |
| | | arm64 (metal) | x86_64 (VM) |
+=================+==============+===============+===============+
| syscall/getpid | mean (ns) | (I) -9.50% | (I) -17.65% |
| | p99 (ns) | (I) -59.24% | (I) -24.41% |
| | p99.9 (ns) | (I) -59.52% | (I) -28.52% |
+-----------------+--------------+---------------+---------------+
| syscall/getppid | mean (ns) | (I) -9.52% | (I) -19.24% |
| | p99 (ns) | (I) -59.25% | (I) -25.03% |
| | p99.9 (ns) | (I) -59.50% | (I) -28.17% |
+-----------------+--------------+---------------+---------------+
| syscall/invalid | mean (ns) | (I) -10.31% | (I) -18.56% |
| | p99 (ns) | (I) -60.79% | (I) -20.06% |
| | p99.9 (ns) | (I) -61.04% | (I) -25.04% |
+-----------------+--------------+---------------+---------------+
I tested an earlier version of this change on x86 bare metal and it
showed a smaller but still significant improvement. The bare metal
system wasn't available this time around so testing was done in a VM
instance. I'm guessing the cost of rdtsc is higher for VMs.
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Link: https://patch.msgid.link/20260303150840.3789438-3-ryan.roberts@arm.com
Signed-off-by: Kees Cook <kees@kernel.org>
|
||
|
|
e43dce8a0b |
fs: fix archiecture-specific compat_ftruncate64
The "small" argument to do_sys_ftruncate indicates if > 32-bit size should be reject, but all the arch-specific compat ftruncate64 implementations get this wrong. Merge do_sys_ftruncate and ksys_ftruncate, replace the integer as boolean small flag with a descriptive one about LFS semantics, and use it correctly in the architecture-specific ftruncate64 implementations. Fixes: |
||
|
|
07791ff060 |
powerpc: Print MMU_FTRS_POSSIBLE & MMU_FTRS_ALWAYS at startup
Similar to CPU_FTRS_[POSSIBLE|ALWAYS], let's also print MMU_FTRS_[POSSIBLE|ALWAYS]. This has some useful data to capture during bootup. Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/c37a9f314a723048d25aa5424f7ede8eec691d86.1773078178.git.ritesh.list@gmail.com |
||
|
|
4f3df2e5ea |
Merge tag 'powerpc-7.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Madhavan Srinivasan: - Fix KUAP warning in VMX usercopy path - Fix lockdep warning during PCI enumeration - Fix to move CMA reservations to arch_mm_preinit - Fix to check current->mm is alive before getting user callchain Thanks to Aboorva Devarajan, Christophe Leroy (CS GROUP), Dan Horák, Nicolin Chen, Nilay Shroff, Qiao Zhao, Ritesh Harjani (IBM), Saket Kumar Bhaskar, Sayali Patil, Shrikanth Hegde, Venkat Rao Bagalkote, and Viktor Malik. * tag 'powerpc-7.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/iommu: fix lockdep warning during PCI enumeration powerpc/selftests/copyloops: extend selftest to exercise __copy_tofrom_user_power7_vmx powerpc: fix KUAP warning in VMX usercopy path powerpc, perf: Check that current->mm is alive before getting user callchain powerpc/mem: Move CMA reservations to arch_mm_preinit |
||
|
|
82f73ef9c4 |
powerpc/iommu: fix lockdep warning during PCI enumeration
Commit |
||
|
|
0a8321dde0 |
powerpc/mem: Move CMA reservations to arch_mm_preinit
commit |
||
|
|
2b8e3fac9b |
Merge tag 'powerpc-7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Madhavan Srinivasan: - Correct MSI allocation tracking - Always use 64 bits PTE for powerpc/e500 - Fix inline assembly for clang build on PPC32 - Fixes for clang build issues in powerpc64/ftrace - Fixes for powerpc64/bpf JIT and tailcall support - Cleanup MPC83XX devicetrees - Fix keymile vendor prefix - Fix to use big-endian types for crash variables Thanks to Abhishek Dubey, Christophe Leroy (CS GROUP), Hari Bathini, Heiko Schocher, J. Neuschäfer, Mahesh Salgaonkar, Nam Cao, Nilay Shroff, Rob Herring (Arm), Saket Kumar Bhaskar, Sourabh Jain, Stan Johnson, and Venkat Rao Bagalkote. * tag 'powerpc-7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (23 commits) powerpc/pseries: Correct MSI allocation tracking powerpc: dts: mpc83xx: Add unit addresses to /memory powerpc: dts: mpc8315erdb: Add missing #cells properties to SPI bus powerpc: dts: mpc8315erdb: Rename LED nodes to comply with schema powerpc: dts: mpc8315erdb: Use IRQ_TYPE_* macros powerpc: dts: mpc8313erdb: Use IRQ_TYPE_* macros powerpc: 83xx: km83xx: Fix keymile vendor prefix dt-bindings: powerpc: Add Freescale/NXP MPC83xx SoCs powerpc64/bpf: fix kfunc call support powerpc64/bpf: fix handling of BPF stack in exception callback powerpc64/bpf: remove BPF redzone protection in trampoline stack powerpc64/bpf: use consistent tailcall offset in trampoline powerpc64/bpf: fix the address returned by bpf_get_func_ip powerpc64/bpf: do not increment tailcall count when prog is NULL powerpc64/ftrace: workaround clang recording GEP in __patchable_function_entries powerpc64/ftrace: fix OOL stub count with clang powerpc64: make clang cross-build friendly powerpc/crash: adjust the elfcorehdr size powerpc/kexec/core: use big-endian types for crash variables powerpc/prom_init: Fixup missing #size-cells on PowerMac media-bay nodes ... |
||
|
|
9b444349a2 |
powerpc/audit: Directly include unistd_32.h from compat_audit.c
This source file undefines '__powerpc64__' to get the 32-bit system call numbers from asm/unistd.h. However this symbol is also evaluated by other headers, among them is asm/bitsperlong.h. The undefinition leads to an inconsistency between __BITS_PER_LONG and the C type 'long'. An upcoming consistency check will be tripped by this. Directly include asm/unistd_32.h to get access to the 32-bit system call numbers instead. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Link: https://patch.msgid.link/20260302-vdso-compat-checkflags-v2-4-78e55baa58ba@linutronix.de |
||
|
|
db54c28702 |
powerpc64/ftrace: workaround clang recording GEP in __patchable_function_entries
Support for -fpatchable-function-entry on ppc64le was added in Clang with [1]. However, when no prefix NOPs are specified - as is the case with CONFIG_PPC_FTRACE_OUT_OF_LINE - the first NOP is emitted at LEP, but Clang records the Global Entry Point (GEP) unlike GCC which does record the Local Entry Point (LEP). Issue [2] has been raised to align Clang's behavior with GCC. As a temporary workaround to ensure ftrace initialization works as expected with Clang, derive the LEP using ppc_function_entry() for kernel symbols and by looking for the below module GEP sequence for module addresses, until [2] is resolved: ld r2, -8(r12) add r2, r2, r12 [1] https://github.com/llvm/llvm-project/pull/151569 [2] https://github.com/llvm/llvm-project/issues/163706 Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20260127084926.34497-4-hbathini@linux.ibm.com |
||
|
|
4ae12d8bd9 |
Merge tag 'kbuild-fixes-7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux
Pull Kbuild fixes from Nathan Chancellor:
- Split out .modinfo section from ELF_DETAILS macro, as that macro may
be used in other areas that expect to discard .modinfo, breaking
certain image layouts
- Adjust genksyms parser to handle optional attributes in certain
declarations, necessary after commit
|
||
|
|
6fc5d63c6f |
powerpc/prom_init: Fixup missing #size-cells on PowerMac media-bay nodes
Similar to other PowerMac mac-io devices, the media-bay node is missing the
"#size-cells" property.
Depends-on: commit
|
||
|
|
b9e7e3ea60 |
powerpc/e500: Always use 64 bits PTE
Today there are two PTE formats for e500: - The 64 bits format, used - On 64 bits kernel - On 32 bits kernel with 64 bits physical addresses - On 32 bits kernel with support of huge pages - The 32 bits format, used in other cases Maintaining two PTE formats means unnecessary maintenance burden because every change needs to be implemented and tested for both formats. Remove the 32 bits PTE format. The memory usage increase due to larger PTEs is minimal (approx. 0,1% of memory). This also means that from now on huge pages are supported also with 32 bits physical addresses. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/04a658209ea78dcc0f3dbde6b2c29cf1939adfe9.1767721208.git.chleroy@kernel.org |
||
|
|
2185904ff8 |
powerpc/pci: Initialize msi_addr_mask for OF-created PCI devices
Recent changes replaced the use of no_64bit_msi with msi_addr_mask. As a
result, msi_addr_mask is now expected to be initialized to DMA_BIT_MASK(64)
when a pci_dev is set up. However, this initialization was missed on
powerpc due to differences in the device initialization path compared to
other (x86) architecture. Due to this, now PCI device probe method fails on
powerpc system.
On powerpc systems, struct pci_dev instances are created from device tree
nodes via of_create_pci_dev(). Because msi_addr_mask was not initialized
there, it remained zero. Later, during MSI setup, msi_verify_entries()
validates the programmed MSI address against pdev->msi_addr_mask. Since the
mask was not set correctly, the validation fails, causing PCI driver probe
failures for devices on powerpc systems.
Initialize pdev->msi_addr_mask to DMA_BIT_MASK(64) in of_create_pci_dev()
so that MSI address validation succeeds and device probe works as expected.
Fixes:
|
||
|
|
8678591b47 |
kbuild: Split .modinfo out from ELF_DETAILS
Commit |
||
|
|
bf4afc53b7 |
Convert 'alloc_obj' family to use the new default GFP_KERNEL argument
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
69050f8d6d |
treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org> |
||
|
|
136114e0ab |
Merge tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton: - "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" saves disk space by teaching ocfs2 to reclaim suballocator block group space (Heming Zhao) - "Add ARRAY_END(), and use it to fix off-by-one bugs" adds the ARRAY_END() macro and uses it in various places (Alejandro Colomar) - "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" makes the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the page size (Pnina Feder) - "kallsyms: Prevent invalid access when showing module buildid" cleans up kallsyms code related to module buildid and fixes an invalid access crash when printing backtraces (Petr Mladek) - "Address page fault in ima_restore_measurement_list()" fixes a kexec-related crash that can occur when booting the second-stage kernel on x86 (Harshit Mogalapalli) - "kho: ABI headers and Documentation updates" updates the kexec handover ABI documentation (Mike Rapoport) - "Align atomic storage" adds the __aligned attribute to atomic_t and atomic64_t definitions to get natural alignment of both types on csky, m68k, microblaze, nios2, openrisc and sh (Finn Thain) - "kho: clean up page initialization logic" simplifies the page initialization logic in kho_restore_page() (Pratyush Yadav) - "Unload linux/kernel.h" moves several things out of kernel.h and into more appropriate places (Yury Norov) - "don't abuse task_struct.group_leader" removes the usage of ->group_leader when it is "obviously unnecessary" (Oleg Nesterov) - "list private v2 & luo flb" adds some infrastructure improvements to the live update orchestrator (Pasha Tatashin) * tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (107 commits) watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency procfs: fix missing RCU protection when reading real_parent in do_task_stat() watchdog/softlockup: fix sample ring index wrap in need_counting_irqs() kcsan, compiler_types: avoid duplicate type issues in BPF Type Format kho: fix doc for kho_restore_pages() tests/liveupdate: add in-kernel liveupdate test liveupdate: luo_flb: introduce File-Lifecycle-Bound global state liveupdate: luo_file: Use private list list: add kunit test for private list primitives list: add primitives for private list manipulations delayacct: fix uapi timespec64 definition panic: add panic_force_cpu= parameter to redirect panic to a specific CPU netclassid: use thread_group_leader(p) in update_classid_task() RDMA/umem: don't abuse current->group_leader drm/pan*: don't abuse current->group_leader drm/amd: kill the outdated "Only the pthreads threading model is supported" checks drm/amdgpu: don't abuse current->group_leader android/binder: use same_thread_group(proc->tsk, current) in binder_mmap() android/binder: don't abuse current->group_leader kho: skip memoryless NUMA nodes when reserving scratch areas ... |
||
|
|
4cff5c05e0 |
Merge tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- "powerpc/64s: do not re-activate batched TLB flush" makes
arch_{enter|leave}_lazy_mmu_mode() nest properly (Alexander Gordeev)
It adds a generic enter/leave layer and switches architectures to use
it. Various hacks were removed in the process.
- "zram: introduce compressed data writeback" implements data
compression for zram writeback (Richard Chang and Sergey Senozhatsky)
- "mm: folio_zero_user: clear page ranges" adds clearing of contiguous
page ranges for hugepages. Large improvements during demand faulting
are demonstrated (David Hildenbrand)
- "memcg cleanups" tidies up some memcg code (Chen Ridong)
- "mm/damon: introduce {,max_}nr_snapshots and tracepoint for damos
stats" improves DAMOS stat's provided information, deterministic
control, and readability (SeongJae Park)
- "selftests/mm: hugetlb cgroup charging: robustness fixes" fixes a few
issues in the hugetlb cgroup charging selftests (Li Wang)
- "Fix va_high_addr_switch.sh test failure - again" addresses several
issues in the va_high_addr_switch test (Chunyu Hu)
- "mm/damon/tests/core-kunit: extend existing test scenarios" improves
the KUnit test coverage for DAMON (Shu Anzai)
- "mm/khugepaged: fix dirty page handling for MADV_COLLAPSE" fixes a
glitch in khugepaged which was causing madvise(MADV_COLLAPSE) to
transiently return -EAGAIN (Shivank Garg)
- "arch, mm: consolidate hugetlb early reservation" reworks and
consolidates a pile of straggly code related to reservation of
hugetlb memory from bootmem and creation of CMA areas for hugetlb
(Mike Rapoport)
- "mm: clean up anon_vma implementation" cleans up the anon_vma
implementation in various ways (Lorenzo Stoakes)
- "tweaks for __alloc_pages_slowpath()" does a little streamlining of
the page allocator's slowpath code (Vlastimil Babka)
- "memcg: separate private and public ID namespaces" cleans up the
memcg ID code and prevents the internal-only private IDs from being
exposed to userspace (Shakeel Butt)
- "mm: hugetlb: allocate frozen gigantic folio" cleans up the
allocation of frozen folios and avoids some atomic refcount
operations (Kefeng Wang)
- "mm/damon: advance DAMOS-based LRU sorting" improves DAMOS's movement
of memory betewwn the active and inactive LRUs and adds auto-tuning
of the ratio-based quotas and of monitoring intervals (SeongJae Park)
- "Support page table check on PowerPC" makes
CONFIG_PAGE_TABLE_CHECK_ENFORCED work on powerpc (Andrew Donnellan)
- "nodemask: align nodes_and{,not} with underlying bitmap ops" makes
nodes_and() and nodes_andnot() propagate the return values from the
underlying bit operations, enabling some cleanup in calling code
(Yury Norov)
- "mm/damon: hide kdamond and kdamond_lock from API callers" cleans up
some DAMON internal interfaces (SeongJae Park)
- "mm/khugepaged: cleanups and scan limit fix" does some cleanup work
in khupaged and fixes a scan limit accounting issue (Shivank Garg)
- "mm: balloon infrastructure cleanups" goes to town on the balloon
infrastructure and its page migration function. Mainly cleanups, also
some locking simplification (David Hildenbrand)
- "mm/vmscan: add tracepoint and reason for kswapd_failures reset" adds
additional tracepoints to the page reclaim code (Jiayuan Chen)
- "Replace wq users and add WQ_PERCPU to alloc_workqueue() users" is
part of Marco's kernel-wide migration from the legacy workqueue APIs
over to the preferred unbound workqueues (Marco Crivellari)
- "Various mm kselftests improvements/fixes" provides various unrelated
improvements/fixes for the mm kselftests (Kevin Brodsky)
- "mm: accelerate gigantic folio allocation" greatly speeds up gigantic
folio allocation, mainly by avoiding unnecessary work in
pfn_range_valid_contig() (Kefeng Wang)
- "selftests/damon: improve leak detection and wss estimation
reliability" improves the reliability of two of the DAMON selftests
(SeongJae Park)
- "mm/damon: cleanup kdamond, damon_call(), damos filter and
DAMON_MIN_REGION" does some cleanup work in the core DAMON code
(SeongJae Park)
- "Docs/mm/damon: update intro, modules, maintainer profile, and misc"
performs maintenance work on the DAMON documentation (SeongJae Park)
- "mm: add and use vma_assert_stabilised() helper" refactors and cleans
up the core VMA code. The main aim here is to be able to use the mmap
write lock's lockdep state to perform various assertions regarding
the locking which the VMA code requires (Lorenzo Stoakes)
- "mm, swap: swap table phase II: unify swapin use" removes some old
swap code (swap cache bypassing and swap synchronization) which
wasn't working very well. Various other cleanups and simplifications
were made. The end result is a 20% speedup in one benchmark (Kairui
Song)
- "enable PT_RECLAIM on more 64-bit architectures" makes PT_RECLAIM
available on 64-bit alpha, loongarch, mips, parisc, and um. Various
cleanups were performed along the way (Qi Zheng)
* tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (325 commits)
mm/memory: handle non-split locks correctly in zap_empty_pte_table()
mm: move pte table reclaim code to memory.c
mm: make PT_RECLAIM depends on MMU_GATHER_RCU_TABLE_FREE
mm: convert __HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config
um: mm: enable MMU_GATHER_RCU_TABLE_FREE
parisc: mm: enable MMU_GATHER_RCU_TABLE_FREE
mips: mm: enable MMU_GATHER_RCU_TABLE_FREE
LoongArch: mm: enable MMU_GATHER_RCU_TABLE_FREE
alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE
mm: change mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h
mm/damon/stat: remove __read_mostly from memory_idle_ms_percentiles
zsmalloc: make common caches global
mm: add SPDX id lines to some mm source files
mm/zswap: use %pe to print error pointers
mm/vmscan: use %pe to print error pointers
mm/readahead: fix typo in comment
mm: khugepaged: fix NR_FILE_PAGES and NR_SHMEM in collapse_file()
mm: refactor vma_map_pages to use vm_insert_pages
mm/damon: unify address range representation with damon_addr_range
mm/cma: replace snprintf with strscpy in cma_new_area
...
|
||
|
|
192c015940 |
Merge tag 'powerpc-7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates for 7.0 - Implement masked user access - Add bpf support for internal only per-CPU instructions and inline the bpf_get_smp_processor_id() and bpf_get_current_task() functions - Fix pSeries MSI-X allocation failure when quota is exceeded - Fix recursive pci_lock_rescan_remove locking in EEH event handling - Support tailcalls with subprogs & BPF exceptions on 64bit - Extend "trusted" keys to support the PowerVM Key Wrapping Module (PKWM) Thanks to Abhishek Dubey, Christophe Leroy, Gaurav Batra, Guangshuo Li, Jarkko Sakkinen, Mahesh Salgaonkar, Mimi Zohar, Miquel Sabaté Solà, Nam Cao, Narayana Murty N, Nayna Jain, Nilay Shroff, Puranjay Mohan, Saket Kumar Bhaskar, Sourabh Jain, Srish Srinivasan, and Venkat Rao Bagalkote. * tag 'powerpc-7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (27 commits) powerpc/pseries: plpks: export plpks_wrapping_is_supported docs: trusted-encryped: add PKWM as a new trust source keys/trusted_keys: establish PKWM as a trusted source pseries/plpks: add HCALLs for PowerVM Key Wrapping Module pseries/plpks: expose PowerVM wrapping features via the sysfs powerpc/pseries: move the PLPKS config inside its own sysfs directory pseries/plpks: fix kernel-doc comment inconsistencies powerpc/smp: Add check for kcalloc() failure in parse_thread_groups() powerpc: kgdb: Remove OUTBUFMAX constant powerpc64/bpf: Additional NVR handling for bpf_throw powerpc64/bpf: Support exceptions powerpc64/bpf: Add arch_bpf_stack_walk() for BPF JIT powerpc64/bpf: Avoid tailcall restore from trampoline powerpc64/bpf: Support tailcalls with subprogs powerpc64/bpf: Moving tail_call_cnt to bottom of frame powerpc/eeh: fix recursive pci_lock_rescan_remove locking in EEH event handling powerpc/pseries: Fix MSI-X allocation failure when quota is exceeded powerpc/iommu: bypass DMA APIs for coherent allocations for pre-mapped memory powerpc64/bpf: Inline bpf_get_smp_processor_id() and bpf_get_current_task/_btf() powerpc64/bpf: Support internal-only MOV instruction to resolve per-CPU addrs ... |
||
|
|
f1c538ca81 |
Merge tag 'timers-vdso-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull VDSO updates from Thomas Gleixner: - Provide the missing 64-bit variant of clock_getres() This allows the extension of CONFIG_COMPAT_32BIT_TIME to the vDSO and finally the removal of 32-bit time types from the kernel and UAPI. - Remove the useless and broken getcpu_cache from the VDSO The intention was to provide a trivial way to retrieve the CPU number from the VDSO, but as the VDSO data is per process there is no way to make it work. - Switch get/put_unaligned() from packed struct to memcpy() The packed struct violates strict aliasing rules which requires to pass -fno-strict-aliasing to the compiler. As this are scalar values __builtin_memcpy() turns them into simple loads and stores - Use __typeof_unqual__() for __unqual_scalar_typeof() The get/put_unaligned() changes triggered a new sparse warning when __beNN types are used with get/put_unaligned() as sparse builds add a special 'bitwise' attribute to them which prevents sparse to evaluate the Generic in __unqual_scalar_typeof(). Newer sparse versions support __typeof_unqual__() which avoids the problem, but requires a recent sparse install. So this adds a sanity check to sparse builds, which validates that sparse is available and capable of handling it. - Force inline __cvdso_clock_getres_common() Compilers sometimes un-inline agressively, which results in function call overhead and problems with automatic stack variable initialization. Interestingly enough the force inlining results in smaller code than the un-inlined variant produced by GCC when optimizing for size. * tag 'timers-vdso-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: vdso/gettimeofday: Force inlining of __cvdso_clock_getres_common() x86/percpu: Make CONFIG_USE_X86_SEG_SUPPORT work with sparse compiler: Use __typeof_unqual__() for __unqual_scalar_typeof() powerpc/vdso: Provide clock_getres_time64() tools headers: Remove unneeded ignoring of warnings in unaligned.h tools headers: Update the linux/unaligned.h copy with the kernel sources vdso: Switch get/put_unaligned() from packed struct to memcpy() parisc: Inline a type punning version of get_unaligned_le32() vdso: Remove struct getcpu_cache MIPS: vdso: Provide getres_time64() for 32-bit ABIs arm64: vdso32: Provide clock_getres_time64() ARM: VDSO: Provide clock_getres_time64() ARM: VDSO: Patch out __vdso_clock_getres() if unavailable x86/vdso: Provide clock_getres_time64() for x86-32 selftests: vDSO: vdso_test_abi: Add test for clock_getres_time64() selftests: vDSO: vdso_test_abi: Use UAPI system call numbers selftests: vDSO: vdso_config: Add configurations for clock_getres_time64() vdso: Add prototype for __vdso_clock_getres_time64() |
||
|
|
3e4067169c |
Merge branch 'v6.19-rc8'
Update to avoid conflicts with /urgent patches. Signed-off-by: Peter Zijlstra <peterz@infradead.org> |
||
|
|
40850c909f |
powerpc/pseries: move the PLPKS config inside its own sysfs directory
The /sys/firmware/secvar/config directory represents Power LPAR Platform KeyStore (PLPKS) configuration properties such as max_object_size, signed_ update_algorithms, supported_policies, total_size, used_space, and version. These attributes describe the PLPKS, and not the secure boot variables (secvars). Create /sys/firmware/plpks directory and move the PLPKS config inside this directory. For backwards compatibility, create a soft link from the secvar sysfs directory to this config and emit a warning stating that the older sysfs path has been deprecated. Separate out the plpks specific documentation from secvar. Signed-off-by: Srish Srinivasan <ssrish@linux.ibm.com> Tested-by: Nayna Jain <nayna@linux.ibm.com> Reviewed-by: Mimi Zohar <zohar@linux.ibm.com> Reviewed-by: Nayna Jain <nayna@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20260127145228.48320-3-ssrish@linux.ibm.com |
||
|
|
33c1c6d8a2 |
powerpc/smp: Add check for kcalloc() failure in parse_thread_groups()
As kcalloc() may fail, check its return value to avoid a NULL pointer
dereference when passing it to of_property_read_u32_array().
Fixes:
|
||
|
|
9fac145b6d |
mm, arch: consolidate hugetlb CMA reservation
Every architecture that supports hugetlb_cma command line parameter reserves CMA areas for hugetlb during setup_arch(). This obfuscates the ordering of hugetlb CMA initialization with respect to the rest initialization of the core MM. Introduce arch_hugetlb_cma_order() callback to allow architectures report the desired order-per-bit of CMA areas and provide a week implementation of arch_hugetlb_cma_order() for architectures that don't support hugetlb with CMA. Use this callback in hugetlb_cma_reserve() instead if passing the order as parameter and call hugetlb_cma_reserve() from mm_core_init_early() rather than have it spread over architecture specific code. Link: https://lkml.kernel.org/r/20260111082105.290734-28-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Shi <alexs@kernel.org> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Pratyush Yadav <pratyush@kernel.org> Cc: Richard Weinberger <richard@nod.at> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
99d2592023 |
rseq: Implement sys_rseq_slice_yield()
Provide a new syscall which has the only purpose to yield the CPU after the kernel granted a time slice extension. sched_yield() is not suitable for that because it unconditionally schedules, but the end of the time slice extension is not required to schedule when the task was already preempted. This also allows to have a strict check for termination to catch user space invoking random syscalls including sched_yield() from a time slice extension region. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Link: https://patch.msgid.link/20251215155708.929634896@linutronix.de |
||
|
|
24c776355f |
kernel.h: drop hex.h and update all hex.h users
Remove <linux/hex.h> from <linux/kernel.h> and update all users/callers of hex.h interfaces to directly #include <linux/hex.h> as part of the process of putting kernel.h on a diet. Removing hex.h from kernel.h means that 36K C source files don't have to pay the price of parsing hex.h for the roughly 120 C source files that need it. This change has been build-tested with allmodconfig on most ARCHes. Also, all users/callers of <linux/hex.h> in the entire source tree have been updated if needed (if not already #included). Link: https://lkml.kernel.org/r/20251215005206.2362276-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
58852f24f9 |
powerpc/64s: do not re-activate batched TLB flush
Patch series "Nesting support for lazy MMU mode", v6.
When the lazy MMU mode was introduced eons ago, it wasn't made clear
whether such a sequence was legal:
arch_enter_lazy_mmu_mode()
...
arch_enter_lazy_mmu_mode()
...
arch_leave_lazy_mmu_mode()
...
arch_leave_lazy_mmu_mode()
It seems fair to say that nested calls to
arch_{enter,leave}_lazy_mmu_mode() were not expected, and most
architectures never explicitly supported it.
Nesting does in fact occur in certain configurations, and avoiding it has
proved difficult. This series therefore enables lazy_mmu sections to
nest, on all architectures.
Nesting is handled using a counter in task_struct (patch 8), like other
stateless APIs such as pagefault_{disable,enable}(). This is fully
handled in a new generic layer in <linux/pgtable.h>; the arch_* API
remains unchanged. A new pair of calls, lazy_mmu_mode_{pause,resume}(),
is also introduced to allow functions that are called with the lazy MMU
mode enabled to temporarily pause it, regardless of nesting.
An arch now opts in to using the lazy MMU mode by selecting
CONFIG_ARCH_LAZY_MMU; this is more appropriate now that we have a generic
API, especially with state conditionally added to task_struct.
This patch (of 14):
Since commit
|
||
|
|
e561383a39 |
powerpc/watchdog: add support for hardlockup_sys_info sysctl
Commit |
||
|
|
759a1f9737 |
powerpc/vdso: Provide clock_getres_time64()
For consistency with __vdso_clock_gettime64() there should also be a 64-bit variant of clock_getres(). This will allow the extension of CONFIG_COMPAT_32BIT_TIME to the vDSO and finally the removal of 32-bit time types from the kernel and UAPI. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Link: https://patch.msgid.link/20260114-vdso-powerpc-align-v1-1-acf09373d568@linutronix.de |
||
|
|
815a8d2feb |
powerpc/eeh: fix recursive pci_lock_rescan_remove locking in EEH event handling
The recent commit |
||
|
|
1471c517cf |
powerpc/iommu: bypass DMA APIs for coherent allocations for pre-mapped memory
Leverage ARCH_HAS_DMA_MAP_DIRECT config option for coherent allocations as well. This will bypass DMA ops for memory allocations that have been pre-mapped. Always set device bus_dma_limit when memory is pre-mapped. In some architectures, like PowerPC, pmemory can be converted to regular memory via daxctl command. This will gate the coherent allocations to pre-mapped RAM only, by dma_coherent_ok(). Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20251107161105.85999-1-gbatra@linux.ibm.com |
||
|
|
fb7903771c |
powerpc/32s: Fix segments setup when TASK_SIZE is not a multiple of 256M
For book3s/32 it is assumed that TASK_SIZE is a multiple of 256 Mbytes, but Kconfig allows any value for TASK_SIZE. In all relevant calculations, align TASK_SIZE to the upper 256 Mbytes boundary. Also use ASM_CONST() in the definition of TASK_SIZE to ensure it is seen as an unsigned constant. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/8928d906079e156c59794c41e826a684eaaaebb4.1766574657.git.chleroy@kernel.org |
||
|
|
608328ba5b |
powerpc/32: Restore disabling of interrupts at interrupt/syscall exit
Commit |
||
|
|
b94b735675 |
powerpc: Add reloc_offset() to font bitmap pointer used for bootx_printf()
Since Linux v6.7, booting using BootX on an Old World PowerMac produces
an early crash. Stan Johnson writes, "the symptoms are that the screen
goes blank and the backlight stays on, and the system freezes (Linux
doesn't boot)."
Further testing revealed that the failure can be avoided by disabling
CONFIG_BOOTX_TEXT. Bisection revealed that the regression was caused by
a change to the font bitmap pointer that's used when btext_init() begins
painting characters on the display, early in the boot process.
Christophe Leroy explains, "before kernel text is relocated to its final
location ... data is addressed with an offset which is added to the
Global Offset Table (GOT) entries at the start of bootx_init()
by function reloc_got2(). But the pointers that are located inside a
structure are not referenced in the GOT and are therefore not updated by
reloc_got2(). It is therefore needed to apply the offset manually by using
PTRRELOC() macro."
Cc: stable@vger.kernel.org
Link: https://lists.debian.org/debian-powerpc/2025/10/msg00111.html
Link: https://lore.kernel.org/linuxppc-dev/d81ddca8-c5ee-d583-d579-02b19ed95301@yahoo.com/
Reported-by: Cedar Maxwell <cedarmaxwell@mac.com>
Closes: https://lists.debian.org/debian-powerpc/2025/09/msg00031.html
Bisected-by: Stan Johnson <userm57@yahoo.com>
Tested-by: Stan Johnson <userm57@yahoo.com>
Fixes:
|
||
|
|
a7405aa92f |
Merge tag 'dma-mapping-6.19-2025-12-05' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping updates from Marek Szyprowski: - More DMA mapping API refactoring to physical addresses as the primary interface instead of page+offset parameters. This time dma_map_ops callbacks are converted to physical addresses, what in turn results also in some simplification of architecture specific code (Leon Romanovsky and Jason Gunthorpe) - Clarify that dma_map_benchmark is not a kernel self-test, but standalone tool (Qinxin Xia) * tag 'dma-mapping-6.19-2025-12-05' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma-mapping: remove unused map_page callback xen: swiotlb: Convert mapping routine to rely on physical address x86: Use physical address for DMA mapping sparc: Use physical address DMA mapping powerpc: Convert to physical address DMA mapping parisc: Convert DMA map_page to map_phys interface MIPS/jazzdma: Provide physical address directly alpha: Convert mapping routine to rely on physical address dma-mapping: remove unused mapping resource callbacks xen: swiotlb: Switch to physical address mapping callbacks ARM: dma-mapping: Switch to physical address mapping callbacks ARM: dma-mapping: Reduce struct page exposure in arch_sync_dma*() dma-mapping: convert dummy ops to physical address mapping dma-mapping: prepare dma_map_ops to conversion to physical address tools/dma: move dma_map_benchmark from selftests to tools/dma |
||
|
|
ad952db4a8 |
Merge tag 'powerpc-6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman: - Restore clearing of MSR[RI] at interrupt/syscall exit on 32-bit - Fix unpaired stwcx on interrupt exit on 32-bit - Fix race condition leading to double list-add in mac_hid_toggle_emumouse() - Fix mprotect on book3s 32-bit - Fix SLB multihit issue during SLB preload with 64-bit hash MMU - Add support for crashkernel CMA reservation - Add die_id and die_cpumask for Power10 & later to expose chip hemispheres - A series of minor fixes and improvements to the hash SLB code Thanks to Antonio Alvarez Feijoo, Ben Collins, Bhaskar Chowdhury, Christophe Leroy, Daniel Thompson, Dave Vasilevsky, Donet Tom, J. Neuschäfer, Kunwu Chan, Long Li, Naresh Kamboju, Nathan Chancellor, Ritesh Harjani (IBM), Shirisha G, Shrikanth Hegde, Sourabh Jain, Srikar Dronamraju, Stephen Rothwell, Thomas Zimmermann, Venkat Rao Bagalkote, and Vishal Chourasia. * tag 'powerpc-6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (32 commits) macintosh/via-pmu-backlight: Include <linux/fb.h> and <linux/of.h> powerpc/powermac: backlight: Include <linux/of.h> powerpc/64s/slb: Add no_slb_preload early cmdline param powerpc/64s/slb: Make preload_add return type as void powerpc/ptdump: Dump PXX level info for kernel_page_tables powerpc/64s/pgtable: Enable directMap counters in meminfo for Hash powerpc/64s/hash: Update directMap page counters for Hash powerpc/64s/hash: Hash hpt_order should be only available with Hash MMU powerpc/64s/hash: Improve hash mmu printk messages powerpc/64s/hash: Fix phys_addr_t printf format in htab_initialize() powerpc/64s/ptdump: Fix kernel_hash_pagetable dump for ISA v3.00 HPTE format powerpc/64s/hash: Restrict stress_hpt_struct memblock region to within RMA limit powerpc/64s/slb: Fix SLB multihit issue during SLB preload powerpc, mm: Fix mprotect on book3s 32-bit powerpc/smp: Expose die_id and die_cpumask powerpc/83xx: Add a null pointer check to mcu_gpiochip_add arch:powerpc:tools This file was missing shebang line, so added it kexec: Include kernel-end even without crashkernel powerpc: p2020: Rename wdt@ nodes to watchdog@ powerpc: 86xx: Rename wdt@ nodes to watchdog@ ... |
||
|
|
ce5cfb0fa2 |
Merge tag 'iommu-updates-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu updates from Joerg Roedel:
- Introduction of the generic IO page-table framework with support for
Intel and AMD IOMMU formats from Jason.
This has good potential for unifying more IO page-table
implementations and making future enhancements more easy. But this
also needed quite some fixes during development. All known issues
have been fixed, but my feeling is that there is a higher potential
than usual that more might be needed.
- Intel VT-d updates:
- Use right invalidation hint in qi_desc_iotlb()
- Reduce the scope of INTEL_IOMMU_FLOPPY_WA
- ARM-SMMU updates:
- Qualcomm device-tree binding updates for Kaanapali and Glymur SoCs
and a new clock for the TBU.
- Fix error handling if level 1 CD table allocation fails.
- Permit more than the architectural maximum number of SMRs for
funky Qualcomm mis-implementations of SMMUv2.
- Mediatek driver:
- MT8189 iommu support
- Move ARM IO-pgtable selftests to kunit
- Device leak fixes for a couple of drivers
- Random smaller fixes and improvements
* tag 'iommu-updates-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (81 commits)
iommupt/vtd: Support mgaw's less than a 4 level walk for first stage
iommupt/vtd: Allow VT-d to have a larger table top than the vasz requires
powerpc/pseries/svm: Make mem_encrypt.h self contained
genpt: Make GENERIC_PT invisible
iommupt: Avoid a compiler bug with sw_bit
iommu/arm-smmu-qcom: Enable use of all SMR groups when running bare-metal
iommupt: Fix unlikely flows in increase_top()
iommu/amd: Propagate the error code returned by __modify_irte_ga() in modify_irte_ga()
MAINTAINERS: Update my email address
iommu/arm-smmu-v3: Fix error check in arm_smmu_alloc_cd_tables
dt-bindings: iommu: qcom_iommu: Allow 'tbu' clock
iommu/vt-d: Restore previous domain::aperture_end calculation
iommu/vt-d: Fix unused invalidation hint in qi_desc_iotlb
iommu/vt-d: Set INTEL_IOMMU_FLOPPY_WA depend on BLK_DEV_FD
iommu/tegra: fix device leak on probe_device()
iommu/sun50i: fix device leak on of_xlate()
iommu/omap: simplify probe_device() error handling
iommu/omap: fix device leaks on probe_device()
iommu/mediatek-v1: add missing larb count sanity check
iommu/mediatek-v1: fix device leaks on probe()
...
|
||
|
|
00312419f0 |
powerpc/64s/slb: Fix SLB multihit issue during SLB preload
On systems using the hash MMU, there is a software SLB preload cache that
mirrors the entries loaded into the hardware SLB buffer. This preload
cache is subject to periodic eviction — typically after every 256 context
switches — to remove old entry.
To optimize performance, the kernel skips switch_mmu_context() in
switch_mm_irqs_off() when the prev and next mm_struct are the same.
However, on hash MMU systems, this can lead to inconsistencies between
the hardware SLB and the software preload cache.
If an SLB entry for a process is evicted from the software cache on one
CPU, and the same process later runs on another CPU without executing
switch_mmu_context(), the hardware SLB may retain stale entries. If the
kernel then attempts to reload that entry, it can trigger an SLB
multi-hit error.
The following timeline shows how stale SLB entries are created and can
cause a multi-hit error when a process moves between CPUs without a
MMU context switch.
CPU 0 CPU 1
----- -----
Process P
exec swapper/1
load_elf_binary
begin_new_exc
activate_mm
switch_mm_irqs_off
switch_mmu_context
switch_slb
/*
* This invalidates all
* the entries in the HW
* and setup the new HW
* SLB entries as per the
* preload cache.
*/
context_switch
sched_migrate_task migrates process P to cpu-1
Process swapper/0 context switch (to process P)
(uses mm_struct of Process P) switch_mm_irqs_off()
switch_slb
load_slb++
/*
* load_slb becomes 0 here
* and we evict an entry from
* the preload cache with
* preload_age(). We still
* keep HW SLB and preload
* cache in sync, that is
* because all HW SLB entries
* anyways gets evicted in
* switch_slb during SLBIA.
* We then only add those
* entries back in HW SLB,
* which are currently
* present in preload_cache
* (after eviction).
*/
load_elf_binary continues...
setup_new_exec()
slb_setup_new_exec()
sched_switch event
sched_migrate_task migrates
process P to cpu-0
context_switch from swapper/0 to Process P
switch_mm_irqs_off()
/*
* Since both prev and next mm struct are same we don't call
* switch_mmu_context(). This will cause the HW SLB and SW preload
* cache to go out of sync in preload_new_slb_context. Because there
* was an SLB entry which was evicted from both HW and preload cache
* on cpu-1. Now later in preload_new_slb_context(), when we will try
* to add the same preload entry again, we will add this to the SW
* preload cache and then will add it to the HW SLB. Since on cpu-0
* this entry was never invalidated, hence adding this entry to the HW
* SLB will cause a SLB multi-hit error.
*/
load_elf_binary continues...
START_THREAD
start_thread
preload_new_slb_context
/*
* This tries to add a new EA to preload cache which was earlier
* evicted from both cpu-1 HW SLB and preload cache. This caused the
* HW SLB of cpu-0 to go out of sync with the SW preload cache. The
* reason for this was, that when we context switched back on CPU-0,
* we should have ideally called switch_mmu_context() which will
* bring the HW SLB entries on CPU-0 in sync with SW preload cache
* entries by setting up the mmu context properly. But we didn't do
* that since the prev mm_struct running on cpu-0 was same as the
* next mm_struct (which is true for swapper / kernel threads). So
* now when we try to add this new entry into the HW SLB of cpu-0,
* we hit a SLB multi-hit error.
*/
WARNING: CPU: 0 PID: 1810970 at arch/powerpc/mm/book3s64/slb.c:62
assert_slb_presence+0x2c/0x50(48 results) 02:47:29 [20157/42149]
Modules linked in:
CPU: 0 UID: 0 PID: 1810970 Comm: dd Not tainted 6.16.0-rc3-dirty #12
VOLUNTARY
Hardware name: IBM pSeries (emulated by qemu) POWER8 (architected)
0x4d0200 0xf000004 of:SLOF,HEAD hv:linux,kvm pSeries
NIP: c00000000015426c LR: c0000000001543b4 CTR: 0000000000000000
REGS: c0000000497c77e0 TRAP: 0700 Not tainted (6.16.0-rc3-dirty)
MSR: 8000000002823033 <SF,VEC,VSX,FP,ME,IR,DR,RI,LE> CR: 28888482 XER: 00000000
CFAR: c0000000001543b0 IRQMASK: 3
<...>
NIP [c00000000015426c] assert_slb_presence+0x2c/0x50
LR [c0000000001543b4] slb_insert_entry+0x124/0x390
Call Trace:
0x7fffceb5ffff (unreliable)
preload_new_slb_context+0x100/0x1a0
start_thread+0x26c/0x420
load_elf_binary+0x1b04/0x1c40
bprm_execve+0x358/0x680
do_execveat_common+0x1f8/0x240
sys_execve+0x58/0x70
system_call_exception+0x114/0x300
system_call_common+0x160/0x2c4
>From the above analysis, during early exec the hardware SLB is cleared,
and entries from the software preload cache are reloaded into hardware
by switch_slb. However, preload_new_slb_context and slb_setup_new_exec
also attempt to load some of the same entries, which can trigger a
multi-hit. In most cases, these additional preloads simply hit existing
entries and add nothing new. Removing these functions avoids redundant
preloads and eliminates the multi-hit issue. This patch removes these
two functions.
We tested process switching performance using the context_switch
benchmark on POWER9/hash, and observed no regression.
Without this patch: 129041 ops/sec
With this patch: 129341 ops/sec
We also measured SLB faults during boot, and the counts are essentially
the same with and without this patch.
SLB faults without this patch: 19727
SLB faults with this patch: 19786
Fixes:
|