Commit Graph

49367 Commits

Author SHA1 Message Date
Linus Torvalds
eb3289fc47 Merge tag 'driver-core-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core
Pull driver core updates from Danilo Krummrich:
 "Auxiliary:
   - Drop call to dev_pm_domain_detach() in auxiliary_bus_probe()
   - Optimize logic of auxiliary_match_id()

  Rust:
   - Auxiliary:
      - Use primitive C types from prelude

   - DebugFs:
      - Add debugfs support for simple read/write files and custom
        callbacks through a File-type-based and directory-scope-based
        API
      - Sample driver code for the File-type-based API
      - Sample module code for the directory-scope-based API

   - I/O:
      - Add io::poll module and implement Rust specific
        read_poll_timeout() helper

   - IRQ:
      - Implement support for threaded and non-threaded device IRQs
        based on (&Device<Bound>, IRQ number) tuples (IrqRequest)
      - Provide &Device<Bound> cookie in IRQ handlers

   - PCI:
      - Support IRQ requests from IRQ vectors for a specific
        pci::Device<Bound>
      - Implement accessors for subsystem IDs, revision, devid and
        resource start
      - Provide dedicated pci::Vendor and pci::Class types for vendor
        and class ID numbers
      - Implement Display to print actual vendor and class names; Debug
        to print the raw ID numbers
      - Add pci::DeviceId::from_class_and_vendor() helper
      - Use primitive C types from prelude
      - Various minor inline and (safety) comment improvements

   - Platform:
      - Support IRQ requests from IRQ vectors for a specific
        platform::Device<Bound>

   - Nova:
      - Use pci::DeviceId::from_class_and_vendor() to avoid probing
        non-display/compute PCI functions

   - Misc:
      - Add helper for cpu_relax()
      - Update ARef import from sync::aref

  sysfs:
   - Remove bin_attrs_new field from struct attribute_group
   - Remove read_new() and write_new() from struct bin_attribute

  Misc:
   - Document potential race condition in get_dev_from_fwnode()
   - Constify node_group argument in software node registration
     functions
   - Fix order of kernel-doc parameters in various functions
   - Set power.no_pm flag for faux devices
   - Set power.no_callbacks flag along with the power.no_pm flag
   - Constify the pmu_bus bus type
   - Minor spelling fixes"

* tag 'driver-core-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: (43 commits)
  rust: pci: display symbolic PCI vendor names
  rust: pci: display symbolic PCI class names
  rust: pci: fix incorrect platform reference in PCI driver probe doc comment
  rust: pci: fix incorrect platform reference in PCI driver unbind doc comment
  perf: make pmu_bus const
  samples: rust: Add scoped debugfs sample driver
  rust: debugfs: Add support for scoped directories
  samples: rust: Add debugfs sample driver
  rust: debugfs: Add support for callback-based files
  rust: debugfs: Add support for writable files
  rust: debugfs: Add support for read-only files
  rust: debugfs: Add initial support for directories
  driver core: auxiliary bus: Optimize logic of auxiliary_match_id()
  driver core: auxiliary bus: Drop dev_pm_domain_detach() call
  driver core: Fix order of the kernel-doc parameters
  driver core: get_dev_from_fwnode(): document potential race
  drivers: base: fix "publically"->"publicly"
  driver core/PM: Set power.no_callbacks along with power.no_pm
  driver core: faux: Set power.no_pm for faux devices
  rust: pci: inline several tiny functions
  ...
2025-10-01 08:39:23 -07:00
Linus Torvalds
ae28ed4578 Merge tag 'bpf-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf updates from Alexei Starovoitov:

 - Support pulling non-linear xdp data with bpf_xdp_pull_data() kfunc
   (Amery Hung)

   Applied as a stable branch in bpf-next and net-next trees.

 - Support reading skb metadata via bpf_dynptr (Jakub Sitnicki)

   Also a stable branch in bpf-next and net-next trees.

 - Enforce expected_attach_type for tailcall compatibility (Daniel
   Borkmann)

 - Replace path-sensitive with path-insensitive live stack analysis in
   the verifier (Eduard Zingerman)

   This is a significant change in the verification logic. More details,
   motivation, long term plans are in the cover letter/merge commit.

 - Support signed BPF programs (KP Singh)

   This is another major feature that took years to materialize.

   Algorithm details are in the cover letter/marge commit

 - Add support for may_goto instruction to s390 JIT (Ilya Leoshkevich)

 - Add support for may_goto instruction to arm64 JIT (Puranjay Mohan)

 - Fix USDT SIB argument handling in libbpf (Jiawei Zhao)

 - Allow uprobe-bpf program to change context registers (Jiri Olsa)

 - Support signed loads from BPF arena (Kumar Kartikeya Dwivedi and
   Puranjay Mohan)

 - Allow access to union arguments in tracing programs (Leon Hwang)

 - Optimize rcu_read_lock() + migrate_disable() combination where it's
   used in BPF subsystem (Menglong Dong)

 - Introduce bpf_task_work_schedule*() kfuncs to schedule deferred
   execution of BPF callback in the context of a specific task using the
   kernel’s task_work infrastructure (Mykyta Yatsenko)

 - Enforce RCU protection for KF_RCU_PROTECTED kfuncs (Kumar Kartikeya
   Dwivedi)

 - Add stress test for rqspinlock in NMI (Kumar Kartikeya Dwivedi)

 - Improve the precision of tnum multiplier verifier operation
   (Nandakumar Edamana)

 - Use tnums to improve is_branch_taken() logic (Paul Chaignon)

 - Add support for atomic operations in arena in riscv JIT (Pu Lehui)

 - Report arena faults to BPF error stream (Puranjay Mohan)

 - Search for tracefs at /sys/kernel/tracing first in bpftool (Quentin
   Monnet)

 - Add bpf_strcasecmp() kfunc (Rong Tao)

 - Support lookup_and_delete_elem command in BPF_MAP_STACK_TRACE (Tao
   Chen)

* tag 'bpf-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (197 commits)
  libbpf: Replace AF_ALG with open coded SHA-256
  selftests/bpf: Add stress test for rqspinlock in NMI
  selftests/bpf: Add test case for different expected_attach_type
  bpf: Enforce expected_attach_type for tailcall compatibility
  bpftool: Remove duplicate string.h header
  bpf: Remove duplicate crypto/sha2.h header
  libbpf: Fix error when st-prefix_ops and ops from differ btf
  selftests/bpf: Test changing packet data from kfunc
  selftests/bpf: Add stacktrace map lookup_and_delete_elem test case
  selftests/bpf: Refactor stacktrace_map case with skeleton
  bpf: Add lookup_and_delete_elem for BPF_MAP_STACK_TRACE
  selftests/bpf: Fix flaky bpf_cookie selftest
  selftests/bpf: Test changing packet data from global functions with a kfunc
  bpf: Emit struct bpf_xdp_sock type in vmlinux BTF
  selftests/bpf: Task_work selftest cleanup fixes
  MAINTAINERS: Delete inactive maintainers from AF_XDP
  bpf: Mark kfuncs as __noclone
  selftests/bpf: Add kprobe multi write ctx attach test
  selftests/bpf: Add kprobe write ctx attach test
  selftests/bpf: Add uprobe context ip register change test
  ...
2025-09-30 17:58:11 -07:00
Linus Torvalds
4b81e2eb9e Merge tag 'timers-vdso-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull VDSO updates from Thomas Gleixner:

 - Further consolidation of the VDSO infrastructure and the common data
   store

 - Simplification of the related Kconfig logic

 - Improve the VDSO selftest suite

* tag 'timers-vdso-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  selftests: vDSO: Drop vdso_test_clock_getres
  selftests: vDSO: vdso_test_abi: Add tests for clock_gettime64()
  selftests: vDSO: vdso_test_abi: Test CPUTIME clocks
  selftests: vDSO: vdso_test_abi: Use explicit indices for name array
  selftests: vDSO: vdso_test_abi: Drop clock availability tests
  selftests: vDSO: vdso_test_abi: Use ksft_finished()
  selftests: vDSO: vdso_test_abi: Correctly skip whole test with missing vDSO
  selftests: vDSO: Fix -Wunitialized in powerpc VDSO_CALL() wrapper
  vdso: Add struct __kernel_old_timeval forward declaration to gettime.h
  vdso: Gate VDSO_GETRANDOM behind HAVE_GENERIC_VDSO
  vdso: Drop Kconfig GENERIC_VDSO_TIME_NS
  vdso: Drop Kconfig GENERIC_VDSO_DATA_STORE
  vdso: Drop kconfig GENERIC_COMPAT_VDSO
  vdso: Drop kconfig GENERIC_VDSO_32
  riscv: vdso: Untangle Kconfig logic
  time: Build generic update_vsyscall() only with generic time vDSO
  vdso/gettimeofday: Remove !CONFIG_TIME_NS stubs
  vdso: Move ENABLE_COMPAT_VDSO from core to arm64
  ARM: VDSO: Remove cntvct_ok global variable
  vdso/datastore: Gate time data behind CONFIG_GENERIC_GETTIMEOFDAY
2025-09-30 16:58:21 -07:00
Linus Torvalds
70de5572a8 Merge tag 'timers-clocksource-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull clocksource updates from Thomas Gleixner:

 - Further preparations for modular clocksource/event drivers

 - The usual device tree updates to support new chip variants and the
   related changes to thise drivers

 - Avoid a 64-bit division in the TEGRA186 driver, which caused a build
   fail on 32-bit machines.

 - Small fixes, improvements and cleanups all over the place

* tag 'timers-clocksource-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
  dt-bindings: timer: exynos4210-mct: Add compatible for ARTPEC-9 SoC
  clocksource/drivers/sh_cmt: Split start/stop of clock source and events
  clocksource/drivers/clps711x: Fix resource leaks in error paths
  clocksource/drivers/arm_global_timer: Add auto-detection for initial prescaler values
  clocksource/drivers/ingenic-sysost: Convert from round_rate() to determine_rate()
  clocksource/drivers/timer-tegra186: Don't print superfluous errors
  clocksource/drivers/timer-rtl-otto: Simplify documentation
  clocksource/drivers/timer-rtl-otto: Do not interfere with interrupts
  clocksource/drivers/timer-rtl-otto: Drop set_counter function
  clocksource/drivers/timer-rtl-otto: Work around dying timers
  clocksource/drivers/timer-ti-dm : Capture functionality for OMAP DM timer
  clocksource/drivers/arm_arch_timer_mmio: Add MMIO clocksource
  clocksource/drivers/arm_arch_timer_mmio: Switch over to standalone driver
  clocksource/drivers/arm_arch_timer: Add standalone MMIO driver
  ACPI: GTDT: Generate platform devices for MMIO timers
  clocksource/drivers/nxp-pit: Add NXP Automotive s32g2 / s32g3 support
  dt: bindings: fsl,vf610-pit: Add compatible for s32g2 and s32g3
  clocksource/drivers/vf-pit: Rename the VF PIT to NXP PIT
  clocksource/drivers/vf-pit: Unify the function name for irq ack
  clocksource/drivers/vf-pit: Consolidate calls to pit_*_disable/enable
  ...
2025-09-30 16:53:59 -07:00
Linus Torvalds
c5448d46b3 Merge tag 'timers-core-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer core updates from Thomas Gleixner:

 - Address the inconsistent shutdown sequence of per CPU clockevents on
   CPU hotplug, which only removed it from the core but failed to invoke
   the actual device driver shutdown callback. This kept the timer
   active, which prevented power savings and caused pointless noise in
   virtualization.

 - Encapsulate the open coded access to the hrtimer clock base, which is
   a private implementation detail, so that the implementation can be
   changed without breaking a lot of usage sites.

 - Enhance the debug output of the clocksource watchdog to provide
   better information for analysis.

 - The usual set of cleanups and enhancements all over the place

* tag 'timers-core-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  time: Fix spelling mistakes in comments
  clocksource: Print durations for sync check unconditionally
  LoongArch: Remove clockevents shutdown call on offlining
  tick: Do not set device to detached state in tick_shutdown()
  hrtimer: Reorder branches in hrtimer_clockid_to_base()
  hrtimer: Remove hrtimer_clock_base:: Get_time
  hrtimer: Use hrtimer_cb_get_time() helper
  media: pwm-ir-tx: Avoid direct access to hrtimer clockbase
  ALSA: hrtimer: Avoid direct access to hrtimer clockbase
  lib: test_objpool: Avoid direct access to hrtimer clockbase
  sched/core: Avoid direct access to hrtimer clockbase
  timers/itimer: Avoid direct access to hrtimer clockbase
  posix-timers: Avoid direct access to hrtimer clockbase
  jiffies: Remove obsolete SHIFTED_HZ comment
2025-09-30 16:09:27 -07:00
Linus Torvalds
c574fb2ed7 Merge tag 'locking-futex-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull futex updates from Thomas Gleixner:
 "A set of updates for futexes and related selftests:

   - Plug the ptrace_may_access() race against a concurrent exec() which
     allows to pass the check before the target's process transition in
     exec() by taking a read lock on signal->ext_update_lock.

   - A large set of cleanups and enhancement to the futex selftests. The
     bulk of changes is the conversion to the kselftest harness"

* tag 'locking-futex-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
  selftest/futex: Fix spelling mistake "boundarie" -> "boundary"
  selftests/futex: Remove logging.h file
  selftests/futex: Drop logging.h include from futex_numa
  selftests/futex: Refactor futex_numa_mpol with kselftest_harness.h
  selftests/futex: Refactor futex_priv_hash with kselftest_harness.h
  selftests/futex: Refactor futex_waitv with kselftest_harness.h
  selftests/futex: Refactor futex_requeue with kselftest_harness.h
  selftests/futex: Refactor futex_wait with kselftest_harness.h
  selftests/futex: Refactor futex_wait_private_mapped_file with kselftest_harness.h
  selftests/futex: Refactor futex_wait_unitialized_heap with kselftest_harness.h
  selftests/futex: Refactor futex_wait_wouldblock with kselftest_harness.h
  selftests/futex: Refactor futex_wait_timeout with kselftest_harness.h
  selftests/futex: Refactor futex_requeue_pi_signal_restart with kselftest_harness.h
  selftests/futex: Refactor futex_requeue_pi_mismatched_ops with kselftest_harness.h
  selftests/futex: Refactor futex_requeue_pi with kselftest_harness.h
  selftests: kselftest: Create ksft_print_dbg_msg()
  futex: Don't leak robust_list pointer on exec race
  selftest/futex: Compile also with libnuma < 2.0.16
  selftest/futex: Reintroduce "Memory out of range" numa_mpol's subtest
  selftest/futex: Make the error check more precise for futex_numa_mpol
  ...
2025-09-30 16:07:10 -07:00
Linus Torvalds
d8de3685f1 Merge tag 'smp-core-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull smp doc fixlet from Thomas Gleixner:
 "An update of the stale smp_call_function_many() documentation to bring
  it back in sync with the actual implementation"

* tag 'smp-core-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  smp: Fix up and expand the smp_call_function_many() kerneldoc
2025-09-30 16:04:52 -07:00
Linus Torvalds
3b2074c77d Merge tag 'irq-core-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq core updates from Thomas Gleixner:
 "A set of updates for the interrupt core subsystem:

   - Introduce irq_chip_[startup|shutdown]_parent() to prepare for
     addressing a few short comings in the PCI/MSI interrupt subsystem.

     It allows to utilize the interrupt chip startup/shutdown callbacks
     for initializing the interrupt chip hierarchy properly on certain
     RISCV implementations and provides a mechanism to reduce the
     overhead of masking and unmasking PCI/MSI interrupts during
     operation when the underlying MSI provider can mask the interrupt.

     The actual usage comes with the interrupt driver pull request.

   - Add generic error handling for devm_request_*_irq()

     This allows to remove the zoo of random error printk's all over the
     usage sites.

   - Add a mechanism to warn about long-running interrupt handlers

     Long running interrupt handlers can introduce latencies and
     tracking them down is a tedious task. The tracking has to be
     enabled with a threshold on the kernel command line and utilizes a
     static branch to remove the overhead when disabled.

   - Update and extend the selftests which validate the CPU hotplug
     interrupt migration logic

   - Allow dropping the per CPU softirq lock on PREEMPT_RT kernels,
     which causes contention and latencies all over the place.

     The serialization requirements have been pushed down into the
     actual affected usage sites already.

   - The usual small cleanups and improvements"

* tag 'irq-core-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT
  softirq: Provide a handshake for canceling tasklets via polling
  genirq/test: Ensure CPU 1 is online for hotplug test
  genirq/test: Drop CONFIG_GENERIC_IRQ_MIGRATION assumptions
  genirq/test: Depend on SPARSE_IRQ
  genirq/test: Fail early if interrupt request fails
  genirq/test: Factor out fake-virq setup
  genirq/test: Select IRQ_DOMAIN
  genirq/test: Fix depth tests on architectures with NOREQUEST by default.
  genirq: Add support for warning on long-running interrupt handlers
  genirq/devres: Add error handling in devm_request_*_irq()
  genirq: Add irq_chip_(startup/shutdown)_parent()
  genirq: Remove GENERIC_IRQ_LEGACY
2025-09-30 15:55:25 -07:00
Linus Torvalds
1d17e808cf Merge tag 'core-rseq-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull rseq updates from Thomas Gleixner:
 "Two fixes for RSEQ:

   - Protect the event mask modification against the membarrier() IPI as
     otherwise the RmW operation is unprotected and events might be lost

   - Fix the weak symbol reference in rseq selftests

     The current weak RSEQ symbols definitions which were added to allow
     static linkage are not working correctly as they effectively
     re-define the glibc symbols leading to multiple versions of the
     symbols when compiled with -fno-common.

     Mark them as 'extern' to convert them from weak symbol definitions
     to weak symbol references. That works with static and dynamic
     linkage independent of -fcommon and -fno-common"

* tag 'core-rseq-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rseq/selftests: Use weak symbol reference, not definition, to link with glibc
  rseq: Protect event mask against membarrier IPI
2025-09-30 15:06:33 -07:00
Linus Torvalds
e4dcbdff11 Merge tag 'perf-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull performance events updates from Ingo Molnar:
 "Core perf code updates:

   - Convert mmap() related reference counts to refcount_t. This is in
     reaction to the recently fixed refcount bugs, which could have been
     detected earlier and could have mitigated the bug somewhat (Thomas
     Gleixner, Peter Zijlstra)

   - Clean up and simplify the callchain code, in preparation for
     sframes (Steven Rostedt, Josh Poimboeuf)

  Uprobes updates:

   - Add support to optimize usdt probes on x86-64, which gives a
     substantial speedup (Jiri Olsa)

   - Cleanups and fixes on x86 (Peter Zijlstra)

  PMU driver updates:

   - Various optimizations and fixes to the Intel PMU driver (Dapeng Mi)

  Misc cleanups and fixes:

   - Remove redundant __GFP_NOWARN (Qianfeng Rong)"

* tag 'perf-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
  selftests/bpf: Fix uprobe_sigill test for uprobe syscall error value
  uprobes/x86: Return error from uprobe syscall when not called from trampoline
  perf: Skip user unwind if the task is a kernel thread
  perf: Simplify get_perf_callchain() user logic
  perf: Use current->flags & PF_KTHREAD|PF_USER_WORKER instead of current->mm == NULL
  perf: Have get_perf_callchain() return NULL if crosstask and user are set
  perf: Remove get_perf_callchain() init_nr argument
  perf/x86: Print PMU counters bitmap in x86_pmu_show_pmu_cap()
  perf/x86/intel: Add ICL_FIXED_0_ADAPTIVE bit into INTEL_FIXED_BITS_MASK
  perf/x86/intel: Change macro GLOBAL_CTRL_EN_PERF_METRICS to BIT_ULL(48)
  perf/x86: Add PERF_CAP_PEBS_TIMING_INFO flag
  perf/x86/intel: Fix IA32_PMC_x_CFG_B MSRs access error
  perf/x86/intel: Use early_initcall() to hook bts_init()
  uprobes: Remove redundant __GFP_NOWARN
  selftests/seccomp: validate uprobe syscall passes through seccomp
  seccomp: passthrough uprobe systemcall without filtering
  selftests/bpf: Fix uprobe syscall shadow stack test
  selftests/bpf: Change test_uretprobe_regs_change for uprobe and uretprobe
  selftests/bpf: Add uprobe_regs_equal test
  selftests/bpf: Add optimized usdt variant for basic usdt test
  ...
2025-09-30 11:11:21 -07:00
Linus Torvalds
6c7340a7a8 Merge tag 'sched-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "Core scheduler changes:

   - Make migrate_{en,dis}able() inline, to improve performance
     (Menglong Dong)

   - Move STDL_INIT() functions out-of-line (Peter Zijlstra)

   - Unify the SCHED_{SMT,CLUSTER,MC} Kconfig (Peter Zijlstra)

  Fair scheduling:

   - Defer throttling to when tasks exit to user-space, to reduce the
     chance & impact of throttle-preemption with held locks and other
     resources (Aaron Lu, Valentin Schneider)

   - Get rid of sched_domains_curr_level hack for tl->cpumask(), as the
     warning was getting triggered on certain topologies (Peter
     Zijlstra)

  Misc cleanups & fixes:

   - Header cleanups (Menglong Dong)

   - Fix race in push_dl_task() (Harshit Agarwal)"

* tag 'sched-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix some typos in include/linux/preempt.h
  sched: Make migrate_{en,dis}able() inline
  rcu: Replace preempt.h with sched.h in include/linux/rcupdate.h
  arch: Add the macro COMPILE_OFFSETS to all the asm-offsets.c
  sched/fair: Do not balance task to a throttled cfs_rq
  sched/fair: Do not special case tasks in throttled hierarchy
  sched/fair: update_cfs_group() for throttled cfs_rqs
  sched/fair: Propagate load for throttled cfs_rq
  sched/fair: Get rid of throttled_lb_pair()
  sched/fair: Task based throttle time accounting
  sched/fair: Switch to task based throttle model
  sched/fair: Implement throttle task work and related helpers
  sched/fair: Add related data structure for task based throttle
  sched: Unify the SCHED_{SMT,CLUSTER,MC} Kconfig
  sched: Move STDL_INIT() functions out-of-line
  sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask()
  sched/deadline: Fix race in push_dl_task()
2025-09-30 10:35:11 -07:00
Linus Torvalds
755fa5b4fb Merge tag 'cgroup-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:

 - Extensive cpuset code cleanup and refactoring work with no functional
   changes: CPU mask computation logic refactoring, introducing new
   helpers, removing redundant code paths, and improving error handling
   for better maintainability.

 - A few bug fixes to cpuset including fixes for partition creation
   failures when isolcpus is in use, missing error returns, and null
   pointer access prevention in free_tmpmasks().

 - Core cgroup changes include replacing the global percpu_rwsem with
   per-threadgroup rwsem when writing to cgroup.procs for better
   scalability, workqueue conversions to use WQ_PERCPU and
   system_percpu_wq to prepare for workqueue default switching from
   percpu to unbound, and removal of unused code including the
   post_attach callback.

 - New cgroup.stat.local time accounting feature that tracks frozen time
   duration.

 - Misc changes including selftests updates (new freezer time tests and
   backward compatibility fixes), documentation sync, string function
   safety improvements, and 64-bit division fixes.

* tag 'cgroup-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (39 commits)
  cpuset: remove is_prs_invalid helper
  cpuset: remove impossible warning in update_parent_effective_cpumask
  cpuset: remove redundant special case for null input in node mask update
  cpuset: fix missing error return in update_cpumask
  cpuset: Use new excpus for nocpu error check when enabling root partition
  cpuset: fix failure to enable isolated partition when containing isolcpus
  Documentation: cgroup-v2: Sync manual toctree
  cpuset: use partition_cpus_change for setting exclusive cpus
  cpuset: use parse_cpulist for setting cpus.exclusive
  cpuset: introduce partition_cpus_change
  cpuset: refactor cpus_allowed_validate_change
  cpuset: refactor out validate_partition
  cpuset: introduce cpus_excl_conflict and mems_excl_conflict helpers
  cpuset: refactor CPU mask buffer parsing logic
  cpuset: Refactor exclusive CPU mask computation logic
  cpuset: change return type of is_partition_[in]valid to bool
  cpuset: remove unused assignment to trialcs->partition_root_state
  cpuset: move the root cpuset write check earlier
  cgroup/cpuset: Remove redundant rcu_read_lock/unlock() in spin_lock
  cgroup: Remove redundant rcu_read_lock/unlock() in spin_lock
  ...
2025-09-30 09:55:41 -07:00
Linus Torvalds
77fc3f6696 Merge tag 'wq-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:

 - WQ_PERCPU was added to remaining alloc_workqueue() users and
   system_wq usage was replaced with system_percpu_wq and
   system_unbound_wq with system_dfl_wq.

   These are equivalent conversions with no functional changes,
   preparing for switching default to unbound workqueues from percpu.

 - A handshake mechanism was added for canceling BH workers to avoid
   live lock scenarios under PREEMPT_RT.

 - Unnecessary rcu_read_lock/unlock() calls were dropped in
   wq_watchdog_timer_fn() and workqueue_congested().

 - Documentation was fixed to resolve texinfodocs warnings.

* tag 'wq-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: fix texinfodocs warning for WQ_* flags reference
  workqueue: WQ_PERCPU added to alloc_workqueue users
  workqueue: replace use of system_wq with system_percpu_wq
  workqueue: replace use of system_unbound_wq with system_dfl_wq
  workqueue: Provide a handshake for canceling BH workers
  workqueue: Remove rcu_read_lock/unlock() in wq_watchdog_timer_fn()
  workqueue: Remove redundant rcu_read_lock/unlock() in workqueue_congested()
2025-09-30 09:31:09 -07:00
Linus Torvalds
a23cd25bae Merge tag 'sched_ext-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext updates from Tejun Heo:

 - Code organization cleanup. Separate internal types and accessors to
   ext_internal.h to reduce the size of ext.c and improve
   maintainability.

 - Prepare for cgroup sub-scheduler support by adding @sch parameter to
   various functions and helpers, reorganizing scheduler instance
   handling, and dropping obsolete helpers like scx_kf_exit() and
   kf_cpu_valid().

 - Add new scx_bpf_cpu_curr() and scx_bpf_locked_rq() BPF helpers to
   provide safer access patterns with proper RCU protection.
   scx_bpf_cpu_rq() is deprecated with warnings due to potential race
   conditions.

 - Improve debugging with migration-disabled counter in error state
   dumps, SCX_EFLAG_INITIALIZED flag, bitfields for warning flags, and
   other enhancements to help diagnose issues.

 - Use cgroup_lock/unlock() for cgroup synchronization instead of
   scx_cgroup_rwsem based synchronization. This is simpler and allows
   enable/disable paths to synchronize against cgroup changes
   independent of the CPU controller.

 - rhashtable_lookup() replacement to avoid redundant RCU locking was
   reverted due to RCU usage warnings. Will be redone once rhashtable is
   updated to use rcu_dereference_all().

 - Other misc updates and fixes including bypass handling improvements,
   scx_task_iter_relock() improvements, tools/sched_ext updates, and
   compatibility helpers.

* tag 'sched_ext-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (28 commits)
  Revert "sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()"
  sched_ext: Misc updates around scx_sched instance pointer
  sched_ext: Drop scx_kf_exit() and scx_kf_error()
  sched_ext: Add the @sch parameter to scx_dsq_insert_preamble/commit()
  sched_ext: Drop kf_cpu_valid()
  sched_ext: Add the @sch parameter to ext_idle helpers
  sched_ext: Add the @sch parameter to __bstr_format()
  sched_ext: Separate out scx_kick_cpu() and add @sch to it
  tools/sched_ext: scx_qmap: Make debug output quieter by default
  sched_ext: Make qmap dump operation non-destructive
  sched_ext: Add SCX_EFLAG_INITIALIZED to indicate successful ops.init()
  sched_ext: Use bitfields for boolean warning flags
  sched_ext: Fix stray scx_root usage in task_can_run_on_remote_rq()
  sched_ext: Improve SCX_KF_DISPATCH comment
  sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()
  sched_ext: Verify RCU protection in scx_bpf_cpu_curr()
  sched_ext: Add migration-disabled counter to error state dump
  sched_ext: Fix NULL dereference in scx_bpf_cpu_rq() warning
  tools/sched_ext: Add compat helper for scx_bpf_cpu_curr()
  sched_ext: deprecation warn for scx_bpf_cpu_rq()
  ...
2025-09-30 09:05:07 -07:00
Linus Torvalds
56a0810d8c Merge tag 'audit-pr-20250926' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:

 - Proper audit support for multiple LSMs

   As the audit subsystem predated the work to enable multiple LSMs,
   some additional work was needed to support logging the different LSM
   labels for the subjects/tasks and objects on the system. Casey's
   patches add new auxillary records for subjects and objects that
   convey the additional labels.

 - Ensure fanotify audit events are always generated

   Generally speaking security relevant subsystems always generate audit
   events, unless explicitly ignored. However, up to this point fanotify
   events had been ignored by default, but starting with this pull
   request fanotify follows convention and generates audit events by
   default.

 - Replace an instance of strcpy() with strscpy()

 - Minor indentation, style, and comment fixes

* tag 'audit-pr-20250926' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: fix skb leak when audit rate limit is exceeded
  audit: init ab->skb_list earlier in audit_buffer_alloc()
  audit: add record for multiple object contexts
  audit: add record for multiple task security contexts
  lsm: security_lsmblob_to_secctx module selection
  audit: create audit_stamp structure
  audit: add a missing tab
  audit: record fanotify event regardless of presence of rules
  audit: fix typo in auditfilter.c comment
  audit: Replace deprecated strcpy() with strscpy()
  audit: fix indentation in audit_log_exit()
2025-09-30 08:22:16 -07:00
Linus Torvalds
417552999d Merge tag 'powerpc-6.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Madhavan Srinivasan:

 - powerpc support for BPF arena and arena atomics

 - Patches to switch to msi parent domain (per-device MSI domains)

 - Add a lock contention tracepoint in the queued spinlock slowpath

 - Fixes for underflow in pseries/powernv msi and pci paths

 - Switch from legacy-of-mm-gpiochip dependency to platform driver

 - Fixes for handling TLB misses

 - Introduce support for powerpc papr-hvpipe

 - Add vpa-dtl PMU driver for pseries platform

 - Misc fixes and cleanups

Thanks to Aboorva Devarajan, Aditya Bodkhe, Andrew Donnellan, Athira
Rajeev, Cédric Le Goater, Christophe Leroy, Erhard Furtner, Gautam
Menghani, Geert Uytterhoeven, Haren Myneni, Hari Bathini, Joe Lawrence,
Kajol Jain, Kienan Stewart, Linus Walleij, Mahesh Salgaonkar, Nam Cao,
Nicolas Schier, Nysal Jan K.A., Ritesh Harjani (IBM), Ruben Wauters,
Saket Kumar Bhaskar, Shashank MS, Shrikanth Hegde, Tejas Manhas, Thomas
Gleixner, Thomas Huth, Thorsten Blum, Tyrel Datwyler, and Venkat Rao
Bagalkote.

* tag 'powerpc-6.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (49 commits)
  powerpc/pseries: Define __u{8,32} types in papr_hvpipe_hdr struct
  genirq/msi: Remove msi_post_free()
  powerpc/perf/vpa-dtl: Add documentation for VPA dispatch trace log PMU
  powerpc/perf/vpa-dtl: Handle the writing of perf record when aux wake up is needed
  powerpc/perf/vpa-dtl: Add support to capture DTL data in aux buffer
  powerpc/perf/vpa-dtl: Add support to setup and free aux buffer for capturing DTL data
  docs: ABI: sysfs-bus-event_source-devices-vpa-dtl: Document sysfs event format entries for vpa_dtl pmu
  powerpc/vpa_dtl: Add interface to expose vpa dtl counters via perf
  powerpc/time: Expose boot_tb via accessor
  powerpc/32: Remove PAGE_KERNEL_TEXT to fix startup failure
  powerpc/fprobe: fix updated fprobe for function-graph tracer
  powerpc/ftrace: support CONFIG_FUNCTION_GRAPH_RETVAL
  powerpc64/modules: replace stub allocation sentinel with an explicit counter
  powerpc64/modules: correctly iterate over stubs in setup_ftrace_ool_stubs
  powerpc/ftrace: ensure ftrace record ops are always set for NOPs
  powerpc/603: Really copy kernel PGD entries into all PGDIRs
  powerpc/8xx: Remove left-over instruction and comments in DataStoreTLBMiss handler
  powerpc/pseries: HVPIPE changes to support migration
  powerpc/pseries: Enable hvpipe with ibm,set-system-parameter RTAS
  powerpc/pseries: Enable HVPIPE event message interrupt
  ...
2025-09-29 19:28:50 -07:00
Linus Torvalds
feafee2845 Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
 "There's good stuff across the board, including some nice mm
  improvements for CPUs with the 'noabort' BBML2 feature and a clever
  patch to allow ptdump to play nicely with block mappings in the
  vmalloc area.

  Confidential computing:

   - Add support for accepting secrets from firmware (e.g. ACPI CCEL)
     and mapping them with appropriate attributes.

  CPU features:

   - Advertise atomic floating-point instructions to userspace

   - Extend Spectre workarounds to cover additional Arm CPU variants

   - Extend list of CPUs that support break-before-make level 2 and
     guarantee not to generate TLB conflict aborts for changes of
     mapping granularity (BBML2_NOABORT)

   - Add GCS support to our uprobes implementation.

  Documentation:

   - Remove bogus SME documentation concerning register state when
     entering/exiting streaming mode.

  Entry code:

   - Switch over to the generic IRQ entry code (GENERIC_IRQ_ENTRY)

   - Micro-optimise syscall entry path with a compiler branch hint.

  Memory management:

   - Enable huge mappings in vmalloc space even when kernel page-table
     dumping is enabled

   - Tidy up the types used in our early MMU setup code

   - Rework rodata= for closer parity with the behaviour on x86

   - For CPUs implementing BBML2_NOABORT, utilise block mappings in the
     linear map even when rodata= applies to virtual aliases

   - Don't re-allocate the virtual region between '_text' and '_stext',
     as doing so confused tools parsing /proc/vmcore.

  Miscellaneous:

   - Clean-up Kconfig menuconfig text for architecture features

   - Avoid redundant bitmap_empty() during determination of supported
     SME vector lengths

   - Re-enable warnings when building the 32-bit vDSO object

   - Avoid breaking our eggs at the wrong end.

  Perf and PMUs:

   - Support for v3 of the Hisilicon L3C PMU

   - Support for Hisilicon's MN and NoC PMUs

   - Support for Fujitsu's Uncore PMU

   - Support for SPE's extended event filtering feature

   - Preparatory work to enable data source filtering in SPE

   - Support for multiple lanes in the DWC PCIe PMU

   - Support for i.MX94 in the IMX DDR PMU driver

   - MAINTAINERS update (Thank you, Yicong)

   - Minor driver fixes (PERF_IDX2OFF() overflow, CMN register offsets).

  Selftests:

   - Add basic LSFE check to the existing hwcaps test

   - Support nolibc in GCS tests

   - Extend SVE ptrace test to pass unsupported regsets and invalid
     vector lengths

   - Minor cleanups (typos, cosmetic changes).

  System registers:

   - Fix ID_PFR1_EL1 definition

   - Fix incorrect signedness of some fields in ID_AA64MMFR4_EL1

   - Sync TCR_EL1 definition with the latest Arm ARM (L.b)

   - Be stricter about the input fed into our AWK sysreg generator
     script

   - Typo fixes and removal of redundant definitions.

  ACPI, EFI and PSCI:

   - Decouple Arm's "Software Delegated Exception Interface" (SDEI)
     support from the ACPI GHES code so that it can be used by platforms
     booted with device-tree

   - Remove unnecessary per-CPU tracking of the FPSIMD state across EFI
     runtime calls

   - Fix a node refcount imbalance in the PSCI device-tree code.

  CPU Features:

   - Ensure register sanitisation is applied to fields in ID_AA64MMFR4

   - Expose AIDR_EL1 to userspace via sysfs, primarily so that KVM
     guests can reliably query the underlying CPU types from the VMM

   - Re-enabling of SME support (CONFIG_ARM64_SME) as a result of fixes
     to our context-switching, signal handling and ptrace code"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (93 commits)
  arm64: cpufeature: Remove duplicate asm/mmu.h header
  arm64: Kconfig: Make CPU_BIG_ENDIAN depend on BROKEN
  perf/dwc_pcie: Fix use of uninitialized variable
  arm/syscalls: mark syscall invocation as likely in invoke_syscall
  Documentation: hisi-pmu: Add introduction to HiSilicon V3 PMU
  Documentation: hisi-pmu: Fix of minor format error
  drivers/perf: hisi: Add support for L3C PMU v3
  drivers/perf: hisi: Refactor the event configuration of L3C PMU
  drivers/perf: hisi: Extend the field of tt_core
  drivers/perf: hisi: Extract the event filter check of L3C PMU
  drivers/perf: hisi: Simplify the probe process of each L3C PMU version
  drivers/perf: hisi: Export hisi_uncore_pmu_isr()
  drivers/perf: hisi: Relax the event ID check in the framework
  perf: Fujitsu: Add the Uncore PMU driver
  arm64: map [_text, _stext) virtual address range non-executable+read-only
  arm64/sysreg: Update TCR_EL1 register
  arm64: Enable vmalloc-huge with ptdump
  arm64: cpufeature: add Neoverse-V3AE to BBML2 allow list
  arm64: errata: Apply workarounds for Neoverse-V3AE
  arm64: cputype: Add Neoverse-V3AE definitions
  ...
2025-09-29 18:48:39 -07:00
Linus Torvalds
a5ba183bde Merge tag 'hardening-v6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook:
 "One notable addition is the creation of the 'transitional' keyword for
  kconfig so CONFIG renaming can go more smoothly.

  This has been a long-standing deficiency, and with the renaming of
  CONFIG_CFI_CLANG to CONFIG_CFI (since GCC will soon have KCFI
  support), this came up again.

  The breadth of the diffstat is mainly this renaming.

   - Clean up usage of TRAILING_OVERLAP() (Gustavo A. R. Silva)

   - lkdtm: fortify: Fix potential NULL dereference on kmalloc failure
     (Junjie Cao)

   - Add str_assert_deassert() helper (Lad Prabhakar)

   - gcc-plugins: Remove TODO_verify_il for GCC >= 16

   - kconfig: Fix BrokenPipeError warnings in selftests

   - kconfig: Add transitional symbol attribute for migration support

   - kcfi: Rename CONFIG_CFI_CLANG to CONFIG_CFI"

* tag 'hardening-v6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  lib/string_choices: Add str_assert_deassert() helper
  kcfi: Rename CONFIG_CFI_CLANG to CONFIG_CFI
  kconfig: Add transitional symbol attribute for migration support
  kconfig: Fix BrokenPipeError warnings in selftests
  gcc-plugins: Remove TODO_verify_il for GCC >= 16
  stddef: Introduce __TRAILING_OVERLAP()
  stddef: Remove token-pasting in TRAILING_OVERLAP()
  lkdtm: fortify: Fix potential NULL dereference on kmalloc failure
2025-09-29 17:48:27 -07:00
Linus Torvalds
a240a79d43 Merge tag 'seccomp-v6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull seccomp update from Kees Cook:

 - Fix race with WAIT_KILLABLE_RECV (Johannes Nixdorf)

* tag 'seccomp-v6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  selftests/seccomp: Add a test for the WAIT_KILLABLE_RECV fast reply race
  seccomp: Fix a race with WAIT_KILLABLE_RECV if the tracer replies too fast
2025-09-29 17:44:09 -07:00
Linus Torvalds
449c2b302c Merge tag 'vfs-6.18-rc1.async' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs async directory updates from Christian Brauner:
 "This contains further preparatory changes for the asynchronous directory
  locking scheme:

   - Add lookup_one_positive_killable() which allows overlayfs to
     perform lookup that won't block on a fatal signal

   - Unify the mount idmap handling in struct renamedata as a rename can
     only happen within a single mount

   - Introduce kern_path_parent() for audit which sets the path to the
     parent and returns a dentry for the target without holding any
     locks on return

   - Rename kern_path_locked() as it is only used to prepare for the
     removal of an object from the filesystem:

	kern_path_locked()    => start_removing_path()
	kern_path_create()    => start_creating_path()
	user_path_create()    => start_creating_user_path()
	user_path_locked_at() => start_removing_user_path_at()
	done_path_create()    => end_creating_path()
	NA                    => end_removing_path()"

* tag 'vfs-6.18-rc1.async' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  debugfs: rename start_creating() to debugfs_start_creating()
  VFS: rename kern_path_locked() and related functions.
  VFS/audit: introduce kern_path_parent() for audit
  VFS: unify old_mnt_idmap and new_mnt_idmap in renamedata
  VFS: discard err2 in filename_create()
  VFS/ovl: add lookup_one_positive_killable()
2025-09-29 11:55:15 -07:00
Linus Torvalds
18b19abc37 Merge tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull namespace updates from Christian Brauner:
 "This contains a larger set of changes around the generic namespace
  infrastructure of the kernel.

  Each specific namespace type (net, cgroup, mnt, ...) embedds a struct
  ns_common which carries the reference count of the namespace and so
  on.

  We open-coded and cargo-culted so many quirks for each namespace type
  that it just wasn't scalable anymore. So given there's a bunch of new
  changes coming in that area I've started cleaning all of this up.

  The core change is to make it possible to correctly initialize every
  namespace uniformly and derive the correct initialization settings
  from the type of the namespace such as namespace operations, namespace
  type and so on. This leaves the new ns_common_init() function with a
  single parameter which is the specific namespace type which derives
  the correct parameters statically. This also means the compiler will
  yell as soon as someone does something remotely fishy.

  The ns_common_init() addition also allows us to remove ns_alloc_inum()
  and drops any special-casing of the initial network namespace in the
  network namespace initialization code that Linus complained about.

  Another part is reworking the reference counting. The reference
  counting was open-coded and copy-pasted for each namespace type even
  though they all followed the same rules. This also removes all open
  accesses to the reference count and makes it private and only uses a
  very small set of dedicated helpers to manipulate them just like we do
  for e.g., files.

  In addition this generalizes the mount namespace iteration
  infrastructure introduced a few cycles ago. As reminder, the vfs makes
  it possible to iterate sequentially and bidirectionally through all
  mount namespaces on the system or all mount namespaces that the caller
  holds privilege over. This allow userspace to iterate over all mounts
  in all mount namespaces using the listmount() and statmount() system
  call.

  Each mount namespace has a unique identifier for the lifetime of the
  systems that is exposed to userspace. The network namespace also has a
  unique identifier working exactly the same way. This extends the
  concept to all other namespace types.

  The new nstree type makes it possible to lookup namespaces purely by
  their identifier and to walk the namespace list sequentially and
  bidirectionally for all namespace types, allowing userspace to iterate
  through all namespaces. Looking up namespaces in the namespace tree
  works completely locklessly.

  This also means we can move the mount namespace onto the generic
  infrastructure and remove a bunch of code and members from struct
  mnt_namespace itself.

  There's a bunch of stuff coming on top of this in the future but for
  now this uses the generic namespace tree to extend a concept
  introduced first for pidfs a few cycles ago. For a while now we have
  supported pidfs file handles for pidfds. This has proven to be very
  useful.

  This extends the concept to cover namespaces as well. It is possible
  to encode and decode namespace file handles using the common
  name_to_handle_at() and open_by_handle_at() apis.

  As with pidfs file handles, namespace file handles are exhaustive,
  meaning it is not required to actually hold a reference to nsfs in
  able to decode aka open_by_handle_at() a namespace file handle.
  Instead the FD_NSFS_ROOT constant can be passed which will let the
  kernel grab a reference to the root of nsfs internally and thus decode
  the file handle.

  Namespaces file descriptors can already be derived from pidfds which
  means they aren't subject to overmount protection bugs. IOW, it's
  irrelevant if the caller would not have access to an appropriate
  /proc/<pid>/ns/ directory as they could always just derive the
  namespace based on a pidfd already.

  It has the same advantage as pidfds. It's possible to reliably and for
  the lifetime of the system refer to a namespace without pinning any
  resources and to compare them trivially.

  Permission checking is kept simple. If the caller is located in the
  namespace the file handle refers to they are able to open it otherwise
  they must hold privilege over the owning namespace of the relevant
  namespace.

  The namespace file handle layout is exposed as uapi and has a stable
  and extensible format. For now it simply contains the namespace
  identifier, the namespace type, and the inode number. The stable
  format means that userspace may construct its own namespace file
  handles without going through name_to_handle_at() as they are already
  allowed for pidfs and cgroup file handles"

* tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (65 commits)
  ns: drop assert
  ns: move ns type into struct ns_common
  nstree: make struct ns_tree private
  ns: add ns_debug()
  ns: simplify ns_common_init() further
  cgroup: add missing ns_common include
  ns: use inode initializer for initial namespaces
  selftests/namespaces: verify initial namespace inode numbers
  ns: rename to __ns_ref
  nsfs: port to ns_ref_*() helpers
  net: port to ns_ref_*() helpers
  uts: port to ns_ref_*() helpers
  ipv4: use check_net()
  net: use check_net()
  net-sysfs: use check_net()
  user: port to ns_ref_*() helpers
  time: port to ns_ref_*() helpers
  pid: port to ns_ref_*() helpers
  ipc: port to ns_ref_*() helpers
  cgroup: port to ns_ref_*() helpers
  ...
2025-09-29 11:20:29 -07:00
Linus Torvalds
722df25ddf Merge tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull copy_process updates from Christian Brauner:
 "This contains the changes to enable support for clone3() on nios2
  which apparently is still a thing.

  The more exciting part of this is that it cleans up the inconsistency
  in how the 64-bit flag argument is passed from copy_process() into the
  various other copy_*() helpers"

[ Fixed up rv ltl_monitor 32-bit support as per Sasha Levin in the merge ]

* tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  nios2: implement architecture-specific portion of sys_clone3
  arch: copy_thread: pass clone_flags as u64
  copy_process: pass clone_flags as u64 across calltree
  copy_sighand: Handle architectures where sizeof(unsigned long) < sizeof(u64)
2025-09-29 10:36:50 -07:00
Linus Torvalds
e571372101 Merge tag 'vfs-6.18-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull pidfs updates from Christian Brauner:
 "This just contains a few changes to pid_nr_ns() to make it more robust
  and cleans up or improves a few users that ab- or misuse it currently"

* tag 'vfs-6.18-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  pid: change task_state() to use task_ppid_nr_ns()
  pid: change bacct_add_tsk() to use task_ppid_nr_ns()
  pid: make __task_pid_nr_ns(ns => NULL) safe for zombie callers
  pid: Add a judgment for ns null in pid_nr_ns
2025-09-29 10:02:35 -07:00
Linus Torvalds
b7ce6fa90f Merge tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
 "This contains the usual selections of misc updates for this cycle.

  Features:

   - Add "initramfs_options" parameter to set initramfs mount options.
     This allows to add specific mount options to the rootfs to e.g.,
     limit the memory size

   - Add RWF_NOSIGNAL flag for pwritev2()

     Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE
     signal from being raised when writing on disconnected pipes or
     sockets. The flag is handled directly by the pipe filesystem and
     converted to the existing MSG_NOSIGNAL flag for sockets

   - Allow to pass pid namespace as procfs mount option

     Ever since the introduction of pid namespaces, procfs has had very
     implicit behaviour surrounding them (the pidns used by a procfs
     mount is auto-selected based on the mounting process's active
     pidns, and the pidns itself is basically hidden once the mount has
     been constructed)

     This implicit behaviour has historically meant that userspace was
     required to do some special dances in order to configure the pidns
     of a procfs mount as desired. Examples include:

     * In order to bypass the mnt_too_revealing() check, Kubernetes
       creates a procfs mount from an empty pidns so that user
       namespaced containers can be nested (without this, the nested
       containers would fail to mount procfs)

       But this requires forking off a helper process because you cannot
       just one-shot this using mount(2)

     * Container runtimes in general need to fork into a container
       before configuring its mounts, which can lead to security issues
       in the case of shared-pidns containers (a privileged process in
       the pidns can interact with your container runtime process)

       While SUID_DUMP_DISABLE and user namespaces make this less of an
       issue, the strict need for this due to a minor uAPI wart is kind
       of unfortunate

       Things would be much easier if there was a way for userspace to
       just specify the pidns they want. So this pull request contains
       changes to implement a new "pidns" argument which can be set
       using fsconfig(2):

           fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
           fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);

       or classic mount(2) / mount(8):

           // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
           mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");

  Cleanups:

   - Remove the last references to EXPORT_OP_ASYNC_LOCK

   - Make file_remove_privs_flags() static

   - Remove redundant __GFP_NOWARN when GFP_NOWAIT is used

   - Use try_cmpxchg() in start_dir_add()

   - Use try_cmpxchg() in sb_init_done_wq()

   - Replace offsetof() with struct_size() in ioctl_file_dedupe_range()

   - Remove vfs_ioctl() export

   - Replace rwlock() with spinlock in epoll code as rwlock causes
     priority inversion on preempt rt kernels

   - Make ns_entries in fs/proc/namespaces const

   - Use a switch() statement() in init_special_inode() just like we do
     in may_open()

   - Use struct_size() in dir_add() in the initramfs code

   - Use str_plural() in rd_load_image()

   - Replace strcpy() with strscpy() in find_link()

   - Rename generic_delete_inode() to inode_just_drop() and
     generic_drop_inode() to inode_generic_drop()

   - Remove unused arguments from fcntl_{g,s}et_rw_hint()

  Fixes:

   - Document @name parameter for name_contains_dotdot() helper

   - Fix spelling mistake

   - Always return zero from replace_fd() instead of the file descriptor
     number

   - Limit the size for copy_file_range() in compat mode to prevent a
     signed overflow

   - Fix debugfs mount options not being applied

   - Verify the inode mode when loading it from disk in minixfs

   - Verify the inode mode when loading it from disk in cramfs

   - Don't trigger automounts with RESOLVE_NO_XDEV

     If openat2() was called with RESOLVE_NO_XDEV it didn't traverse
     through automounts, but could still trigger them

   - Add FL_RECLAIM flag to show_fl_flags() macro so it appears in
     tracepoints

   - Fix unused variable warning in rd_load_image() on s390

   - Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD

   - Use ns_capable_noaudit() when determining net sysctl permissions

   - Don't call path_put() under namespace semaphore in listmount() and
     statmount()"

* tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits)
  fcntl: trim arguments
  listmount: don't call path_put() under namespace semaphore
  statmount: don't call path_put() under namespace semaphore
  pid: use ns_capable_noaudit() when determining net sysctl permissions
  fs: rename generic_delete_inode() and generic_drop_inode()
  init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD
  initramfs: Replace strcpy() with strscpy() in find_link()
  initrd: Use str_plural() in rd_load_image()
  initramfs: Use struct_size() helper to improve dir_add()
  initrd: Fix unused variable warning in rd_load_image() on s390
  fs: use the switch statement in init_special_inode()
  fs/proc/namespaces: make ns_entries const
  filelock: add FL_RECLAIM to show_fl_flags() macro
  eventpoll: Replace rwlock with spinlock
  selftests/proc: add tests for new pidns APIs
  procfs: add "pidns" mount option
  pidns: move is-ancestor logic to helper
  openat2: don't trigger automounts with RESOLVE_NO_XDEV
  namei: move cross-device check to __traverse_mounts
  namei: remove LOOKUP_NO_XDEV check from handle_mounts
  ...
2025-09-29 09:03:07 -07:00
Linus Torvalds
8f9736633f Merge tag 'trace-v6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:

 - Fix buffer overflow in osnoise_cpu_write()

   The allocated buffer to read user space did not add a nul terminating
   byte after copying from user the string. It then reads the string,
   and if user space did not add a nul byte, the read will continue
   beyond the string.

   Add a nul terminating byte after reading the string.

 - Fix missing check for lockdown on tracing

   There's a path from kprobe events or uprobe events that can update
   the tracing system even if lockdown on tracing is activate. Add a
   check in the dynamic event path.

 - Add a recursion check for the function graph return path

   Now that fprobes can hook to the function graph tracer and call
   different code between the entry and the exit, the exit code may now
   call functions that are not called in entry. This means that the exit
   handler can possibly trigger recursion that is not caught and cause
   the system to crash.

   Add the same recursion checks in the function exit handler as exists
   in the entry handler path.

* tag 'trace-v6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: fgraph: Protect return handler from recursion loop
  tracing: dynevent: Add a missing lockdown check on dynevent
  tracing/osnoise: Fix slab-out-of-bounds in _parse_integer_limit()
2025-09-28 10:26:35 -07:00
Daniel Borkmann
4540aed51b bpf: Enforce expected_attach_type for tailcall compatibility
Yinhao et al. recently reported:

  Our fuzzer tool discovered an uninitialized pointer issue in the
  bpf_prog_test_run_xdp() function within the Linux kernel's BPF subsystem.
  This leads to a NULL pointer dereference when a BPF program attempts to
  deference the txq member of struct xdp_buff object.

The test initializes two programs of BPF_PROG_TYPE_XDP: progA acts as the
entry point for bpf_prog_test_run_xdp() and its expected_attach_type can
neither be of be BPF_XDP_DEVMAP nor BPF_XDP_CPUMAP. progA calls into a slot
of a tailcall map it owns. progB's expected_attach_type must be BPF_XDP_DEVMAP
to pass xdp_is_valid_access() validation. The program returns struct xdp_md's
egress_ifindex, and the latter is only allowed to be accessed under mentioned
expected_attach_type. progB is then inserted into the tailcall which progA
calls.

The underlying issue goes beyond XDP though. Another example are programs
of type BPF_PROG_TYPE_CGROUP_SOCK_ADDR. sock_addr_is_valid_access() as well
as sock_addr_func_proto() have different logic depending on the programs'
expected_attach_type. Similarly, a program attached to BPF_CGROUP_INET4_GETPEERNAME
should not be allowed doing a tailcall into a program which calls bpf_bind()
out of BPF which is only enabled for BPF_CGROUP_INET4_CONNECT.

In short, specifying expected_attach_type allows to open up additional
functionality or restrictions beyond what the basic bpf_prog_type enables.
The use of tailcalls must not violate these constraints. Fix it by enforcing
expected_attach_type in __bpf_prog_map_compatible().

Note that we only enforce this for tailcall maps, but not for BPF devmaps or
cpumaps: There, the programs are invoked through dev_map_bpf_prog_run*() and
cpu_map_bpf_prog_run*() which set up a new environment / context and therefore
these situations are not prone to this issue.

Fixes: 5e43f899b0 ("bpf: Check attach type at prog load time")
Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20250926171201.188490-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-27 06:24:27 -07:00
Masami Hiramatsu (Google)
0db0934e7f tracing: fgraph: Protect return handler from recursion loop
function_graph_enter_regs() prevents itself from recursion by
ftrace_test_recursion_trylock(), but __ftrace_return_to_handler(),
which is called at the exit, does not prevent such recursion.
Therefore, while it can prevent recursive calls from
fgraph_ops::entryfunc(), it is not able to prevent recursive calls
to fgraph from fgraph_ops::retfunc(), resulting in a recursive loop.
This can lead an unexpected recursion bug reported by Menglong.

 is_endbr() is called in __ftrace_return_to_handler -> fprobe_return
  -> kprobe_multi_link_exit_handler -> is_endbr.

To fix this issue, acquire ftrace_test_recursion_trylock() in the
__ftrace_return_to_handler() after unwind the shadow stack to mark
this section must prevent recursive call of fgraph inside user-defined
fgraph_ops::retfunc().

This is essentially a fix to commit 4346ba1604 ("fprobe: Rewrite
fprobe on function-graph tracer"), because before that fgraph was
only used from the function graph tracer. Fprobe allowed user to run
any callbacks from fgraph after that commit.

Reported-by: Menglong Dong <menglong8.dong@gmail.com>
Closes: https://lore.kernel.org/all/20250918120939.1706585-1-dongml2@chinatelecom.cn/
Fixes: 4346ba1604 ("fprobe: Rewrite fprobe on function-graph tracer")
Cc: stable@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/175852292275.307379.9040117316112640553.stgit@devnote2
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Menglong Dong <menglong8.dong@gmail.com>
Acked-by: Menglong Dong <menglong8.dong@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-27 09:04:05 -04:00
Linus Torvalds
083fc6d7fa Merge tag 'sched-urgent-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Fix two dl_server regressions: a race that can end up leaving the
  dl_server stuck, and a dl_server throttling bug causing lag to fair
  tasks"

* tag 'sched-urgent-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/deadline: Fix dl_server behaviour
  sched/deadline: Fix dl_server getting stuck
2025-09-26 12:30:23 -07:00
Linus Torvalds
2cea0ed979 Merge tag 'locking-urgent-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Ingo Molnar:
 "Fix a PI-futexes race, and fix a copy_process() futex cleanup bug"

* tag 'locking-urgent-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Use correct exit on failure from futex_hash_allocate_default()
  futex: Prevent use-after-free during requeue-PI
2025-09-26 12:28:32 -07:00
Tao Chen
17f0d1f632 bpf: Add lookup_and_delete_elem for BPF_MAP_STACK_TRACE
The stacktrace map can be easily full, which will lead to failure in
obtaining the stack. In addition to increasing the size of the map,
another solution is to delete the stack_id after looking it up from
the user, so extend the existing bpf_map_lookup_and_delete_elem()
functionality to stacktrace map types.

Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250925175030.1615837-1-chen.dylane@linux.dev
2025-09-25 16:12:14 -07:00
Linus Torvalds
93a2744561 Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio fixes from Michael Tsirkin:
 "virtio,vhost: last minute fixes

  More small fixes. Most notably this fixes crashes and hangs in
  vhost-net"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  MAINTAINERS, mailmap: Update address for Peter Hilber
  virtio_config: clarify output parameters
  uapi: vduse: fix typo in comment
  vhost: Take a reference on the task in struct vhost_task.
  vhost-net: flush batched before enabling notifications
  Revert "vhost/net: Defer TX queue re-enable until after sendmsg"
  vhost-net: unbreak busy polling
  vhost-scsi: fix argument order in tport allocation error message
2025-09-25 08:06:03 -07:00
Menglong Dong
378b770819 sched: Make migrate_{en,dis}able() inline
For now, migrate_enable and migrate_disable are global, which makes them
become hotspots in some case. Take BPF for example, the function calling
to migrate_enable and migrate_disable in BPF trampoline can introduce
significant overhead, and following is the 'perf top' of FENTRY's
benchmark (./tools/testing/selftests/bpf/bench trig-fentry):

  54.63% bpf_prog_2dcccf652aac1793_bench_trigger_fentry [k]
                 bpf_prog_2dcccf652aac1793_bench_trigger_fentry
  10.43% [kernel] [k] migrate_enable
  10.07% bpf_trampoline_6442517037 [k] bpf_trampoline_6442517037
  8.06% [kernel] [k] __bpf_prog_exit_recur
  4.11% libc.so.6 [.] syscall
  2.15% [kernel] [k] entry_SYSCALL_64
  1.48% [kernel] [k] memchr_inv
  1.32% [kernel] [k] fput
  1.16% [kernel] [k] _copy_to_user
  0.73% [kernel] [k] bpf_prog_test_run_raw_tp

So in this commit, we make migrate_enable/migrate_disable inline to obtain
better performance. The struct rq is defined internally in
kernel/sched/sched.h, and the field "nr_pinned" is accessed in
migrate_enable/migrate_disable, which makes it hard to make them inline.

Alexei Starovoitov suggests to generate the offset of "nr_pinned" in [1],
so we can define the migrate_enable/migrate_disable in
include/linux/sched.h and access "this_rq()->nr_pinned" with
"(void *)this_rq() + RQ_nr_pinned".

The offset of "nr_pinned" is generated in include/generated/rq-offsets.h
by kernel/sched/rq-offsets.c.

Generally speaking, we move the definition of migrate_enable and
migrate_disable to include/linux/sched.h from kernel/sched/core.c. The
calling to __set_cpus_allowed_ptr() is leaved in ___migrate_enable().

The "struct rq" is not available in include/linux/sched.h, so we can't
access the "runqueues" with this_cpu_ptr(), as the compilation will fail
in this_cpu_ptr() -> raw_cpu_ptr() -> __verify_pcpu_ptr():
  typeof((ptr) + 0)

So we introduce the this_rq_raw() and access the runqueues with
arch_raw_cpu_ptr/PERCPU_PTR directly.

The variable "runqueues" is not visible in the kernel modules, and export
it is not a good idea. As Peter Zijlstra advised in [2], we define and
export migrate_enable/migrate_disable in kernel/sched/core.c too, and use
them for the modules.

Before this patch, the performance of BPF FENTRY is:

  fentry         :  113.030 ± 0.149M/s
  fentry         :  112.501 ± 0.187M/s
  fentry         :  112.828 ± 0.267M/s
  fentry         :  115.287 ± 0.241M/s

After this patch, the performance of BPF FENTRY increases to:

  fentry         :  143.644 ± 0.670M/s
  fentry         :  149.764 ± 0.362M/s
  fentry         :  149.642 ± 0.156M/s
  fentry         :  145.263 ± 0.221M/s

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/bpf/CAADnVQ+5sEDKHdsJY5ZsfGDO_1SEhhQWHrt2SMBG5SYyQ+jt7w@mail.gmail.com/ [1]
Link: https://lore.kernel.org/all/20250819123214.GH4067720@noisy.programming.kicks-ass.net/ [2]
2025-09-25 09:57:16 +02:00
Peter Zijlstra
a3a70caf79 sched/deadline: Fix dl_server behaviour
John reported undesirable behaviour with the dl_server since commit:
cccb45d7c4 ("sched/deadline: Less agressive dl_server handling").

When starving fair tasks on purpose (starting spinning FIFO tasks),
his fair workload, which often goes (briefly) idle, would delay fair
invocations for a second, running one invocation per second was both
unexpected and terribly slow.

The reason this happens is that when dl_se->server_pick_task() returns
NULL, indicating no runnable tasks, it would yield, pushing any later
jobs out a whole period (1 second).

Instead simply stop the server. This should restore behaviour in that
a later wakeup (which restarts the server) will be able to continue
running (subject to the CBS wakeup rules).

Notably, this does not re-introduce the behaviour cccb45d7c4 set
out to solve, any start/stop cycle is naturally throttled by the timer
period (no active cancel).

Fixes: cccb45d7c4 ("sched/deadline: Less agressive dl_server handling")
Reported-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: John Stultz <jstultz@google.com>
2025-09-25 09:51:50 +02:00
Peter Zijlstra
4ae8d9aa9f sched/deadline: Fix dl_server getting stuck
John found it was easy to hit lockup warnings when running locktorture
on a 2 CPU VM, which he bisected down to: commit cccb45d7c4
("sched/deadline: Less agressive dl_server handling").

While debugging it seems there is a chance where we end up with the
dl_server dequeued, with dl_se->dl_server_active. This causes
dl_server_start() to return without enqueueing the dl_server, thus it
fails to run when RT tasks starve the cpu.

When this happens, dl_server_timer() catches the
'!dl_se->server_has_tasks(dl_se)' case, which then calls
replenish_dl_entity() and dl_server_stopped() and finally return
HRTIMER_NO_RESTART.

This ends in no new timer and also no enqueue, leaving the dl_server
'dead', allowing starvation.

What should have happened is for the bandwidth timer to start the
zero-laxity timer, which in turn would enqueue the dl_server and cause
dl_se->server_pick_task() to be called -- which will stop the
dl_server if no fair tasks are observed for a whole period.

IOW, it is totally irrelevant if there are fair tasks at the moment of
bandwidth refresh.

This removes all dl_se->server_has_tasks() users, so remove the whole
thing.

Fixes: cccb45d7c4 ("sched/deadline: Less agressive dl_server handling")
Reported-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: John Stultz <jstultz@google.com>
2025-09-25 09:51:50 +02:00
Christian Brauner
af075603f2 ns: drop assert
Otherwise we warn when e.g., no namespaces are configured but the
initial namespace for is still around.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-25 09:23:54 +02:00
Christian Brauner
4055526d35 ns: move ns type into struct ns_common
It's misplaced in struct proc_ns_operations and ns->ops might be NULL if
the namespace is compiled out but we still want to know the type of the
namespace for the initial namespace struct.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-25 09:23:54 +02:00
Christian Brauner
10cdfcd37a nstree: make struct ns_tree private
Don't expose it directly. There's no need to do that.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-25 09:23:47 +02:00
Linus Torvalds
bf40f4b877 Merge tag 'probes-fixes-v6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:

 - fprobe: Even if there is a memory allocation failure, try to remove
   the addresses recorded until then from the filter. Previously we just
   skipped it.

 - tracing: dynevent: Add a missing lockdown check on dynevent. This
   dynevent is the interface for all probe events. Thus if there is no
   check, any probe events can be added after lock down the tracefs.

* tag 'probes-fixes-v6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: dynevent: Add a missing lockdown check on dynevent
  tracing: fprobe: Fix to remove recorded module addresses from filter
2025-09-24 19:17:07 -07:00
Kees Cook
23ef9d4397 kcfi: Rename CONFIG_CFI_CLANG to CONFIG_CFI
The kernel's CFI implementation uses the KCFI ABI specifically, and is
not strictly tied to a particular compiler. In preparation for GCC
supporting KCFI, rename CONFIG_CFI_CLANG to CONFIG_CFI (along with
associated options).

Use new "transitional" Kconfig option for old CONFIG_CFI_CLANG that will
enable CONFIG_CFI during olddefconfig.

Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20250923213422.1105654-3-kees@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
2025-09-24 14:29:14 -07:00
Will Deacon
4e4e36dce3 Merge branch 'for-next/uprobes' into for-next/core
* for-next/uprobes:
  arm64: probes: Fix incorrect bl/blr address and register usage
  uprobes: uprobe_warn should use passed task
  arm64: Kconfig: Remove GCS restrictions on UPROBES
  arm64: uprobes: Add GCS support to uretprobes
  arm64: probes: Add GCS support to bl/blr/ret
  arm64: uaccess: Add additional userspace GCS accessors
  arm64: uaccess: Move existing GCS accessors definitions to gcs.h
  arm64: probes: Break ret out from bl/blr
2025-09-24 16:35:06 +01:00
Masami Hiramatsu (Google)
456c32e3c4 tracing: dynevent: Add a missing lockdown check on dynevent
Since dynamic_events interface on tracefs is compatible with
kprobe_events and uprobe_events, it should also check the lockdown
status and reject if it is set.

Link: https://lore.kernel.org/all/175824455687.45175.3734166065458520748.stgit@devnote2/

Fixes: 17911ff38a ("tracing: Add locked_down checks to the open calls of files created for tracefs")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
2025-09-25 00:22:46 +09:00
Masami Hiramatsu (Google)
c539feff3c tracing: fprobe: Fix to remove recorded module addresses from filter
Even if there is a memory allocation failure in fprobe_addr_list_add(),
there is a partial list of module addresses. So remove the recorded
addresses from filter if exists.
This also removes the redundant ret local variable.

Fixes: a3dc2983ca ("tracing: fprobe: Cleanup fprobe hash when module unloading")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Reviewed-by: Menglong Dong <menglong8.dong@gmail.com>
2025-09-24 23:18:26 +09:00
Jiri Olsa
4363264111 uprobe: Do not emulate/sstep original instruction when ip is changed
If uprobe handler changes instruction pointer we still execute single
step) or emulate the original instruction and increment the (new) ip
with its length.

This makes the new instruction pointer bogus and application will
likely crash on illegal instruction execution.

If user decided to take execution elsewhere, it makes little sense
to execute the original instruction, so let's skip it.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20250916215301.664963-3-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-24 02:25:06 -07:00
Jiri Olsa
7384893d97 bpf: Allow uprobe program to change context registers
Currently uprobe (BPF_PROG_TYPE_KPROBE) program can't write to the
context registers data. While this makes sense for kprobe attachments,
for uprobe attachment it might make sense to be able to change user
space registers to alter application execution.

Since uprobe and kprobe programs share the same type (BPF_PROG_TYPE_KPROBE),
we can't deny write access to context during the program load. We need
to check on it during program attachment to see if it's going to be
kprobe or uprobe.

Storing the program's write attempt to context and checking on it
during the attachment.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20250916215301.664963-2-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-24 02:25:06 -07:00
Sebastian Andrzej Siewior
4ec3c15462 futex: Use correct exit on failure from futex_hash_allocate_default()
copy_process() uses the wrong error exit path from futex_hash_allocate_default().
After exiting from futex_hash_allocate_default(), neither tasklist_lock
nor siglock has been acquired. The exit label bad_fork_core_free unlocks
both of these locks which is wrong.

The next exit label, bad_fork_cancel_cgroup, is the correct exit.
sched_cgroup_fork() did not allocate any resources that need to freed.

Use bad_fork_cancel_cgroup on error exit from futex_hash_allocate_default().

Fixes: 7c4f75a21f ("futex: Allow automatic allocation of process wide futex hash")
Reported-by: syzbot+80cb3cc5c14fad191a10@syzkaller.appspotmail.com
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Closes: https://lore.kernel.org/all/68cb1cbd.050a0220.2ff435.0599.GAE@google.com
2025-09-24 09:20:02 +02:00
Tejun Heo
df10932ad7 Revert "sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()"
This reverts commit c8191ee8e6 which triggers
the following suspicious RCU usage warning:

[    6.647598] =============================
[    6.647603] WARNING: suspicious RCU usage
[    6.647605] 6.17.0-rc7-virtme #1 Not tainted
[    6.647608] -----------------------------
[    6.647608] ./include/linux/rhashtable.h:602 suspicious rcu_dereference_check() usage!
[    6.647610]
[    6.647610] other info that might help us debug this:
[    6.647610]
[    6.647612]
[    6.647612] rcu_scheduler_active = 2, debug_locks = 1
[    6.647613] 1 lock held by swapper/10/0:
[    6.647614]  #0: ffff8b14bbb3cc98 (&rq->__lock){-.-.}-{2:2}, at:
+raw_spin_rq_lock_nested+0x20/0x90
[    6.647630]
[    6.647630] stack backtrace:
[    6.647633] CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Not tainted 6.17.0-rc7-virtme #1
+PREEMPT(full)
[    6.647643] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[    6.647646] Sched_ext: beerland_1.0.2_g27d63fc3_x86_64_unknown_linux_gnu (enabled+all)
[    6.647648] Call Trace:
[    6.647652]  <IRQ>
[    6.647655]  dump_stack_lvl+0x78/0xe0
[    6.647665]  lockdep_rcu_suspicious+0x14a/0x1b0
[    6.647672]  __rhashtable_lookup.constprop.0+0x1d5/0x250
[    6.647680]  find_dsq_for_dispatch+0xbc/0x190
[    6.647684]  do_enqueue_task+0x25b/0x550
[    6.647689]  enqueue_task_scx+0x21d/0x360
[    6.647692]  ? trace_lock_acquire+0x22/0xb0
[    6.647695]  enqueue_task+0x2e/0xd0
[    6.647698]  ttwu_do_activate+0xa2/0x290
[    6.647703]  sched_ttwu_pending+0xfd/0x250
[    6.647706]  __flush_smp_call_function_queue+0x1cd/0x610
[    6.647714]  __sysvec_call_function_single+0x34/0x150
[    6.647720]  sysvec_call_function_single+0x6e/0x80
[    6.647726]  </IRQ>
[    6.647726]  <TASK>
[    6.647727]  asm_sysvec_call_function_single+0x1a/0x20

Reported-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23 20:38:23 -10:00
Martin KaFai Lau
34f033a6c9 Merge branch 'bpf-next/xdp_pull_data' into 'bpf-next/master'
Merge the xdp_pull_data stable branch into the master branch. No conflict.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2025-09-23 16:23:58 -07:00
Amery Hung
0e7a733ab3 bpf: Clear packet pointers after changing packet data in kfuncs
bpf_xdp_pull_data() may change packet data and therefore packet pointers
need to be invalidated. Add bpf_xdp_pull_data() to the special kfunc
list instead of introducing a new KF_ flag until there are more kfuncs
changing packet data.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250922233356.3356453-5-ameryhung@gmail.com
2025-09-23 13:35:12 -07:00
Tejun Heo
ebfd5226ec sched_ext: Merge branch 'for-6.17-fixes' into for-6.18
Pull sched_ext/for-6.17-fixes to receive:

 55ed11b181 ("sched_ext: idle: Handle migration-disabled tasks in BPF code")

which conflicts with the following commit in for-6.18:

 2407bae23d ("sched_ext: Add the @sch parameter to ext_idle helpers")

The conflict is a simple context conflict which can be resolved by taking
the updated parts from both commits.

Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23 09:10:20 -10:00
Leon Hwang
ccb4f5d91e bpf: Allow union argument in trampoline based programs
Currently, functions with 'union' arguments cannot be traced with
fentry/fexit:

bpftrace -e 'fentry:release_pages { exit(); }' -v

The function release_pages arg0 type UNION is unsupported.

The type of the 'release_pages' arg0 is defined as:

typedef union {
	struct page **pages;
	struct folio **folios;
	struct encoded_page **encoded_pages;
} release_pages_arg __attribute__ ((__transparent_union__));

This patch relaxes the restriction by allowing function arguments of type
'union' to be traced in verifier.

Reviewed-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20250919044110.23729-2-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-23 12:07:46 -07:00