Commit Graph

49620 Commits

Author SHA1 Message Date
KP Singh
603b441623 bpf: Update the bpf_prog_calc_tag to use SHA256
Exclusive maps restrict map access to specific programs using a hash.
The current hash used for this is SHA1, which is prone to collisions.
This patch uses SHA256, which  is more resilient against
collisions. This new hash is stored in bpf_prog and used by the verifier
to determine if a program can access a given exclusive map.

The original 64-bit tags are kept, as they are used by users as a short,
possibly colliding program identifier for non-security purposes.

Signed-off-by: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/r/20250914215141.15144-2-kpsingh@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-18 19:10:20 -07:00
Kumar Kartikeya Dwivedi
1512231b6c bpf: Enforce RCU protection for KF_RCU_PROTECTED
Currently, KF_RCU_PROTECTED only applies to iterator APIs and that too
in a convoluted fashion: the presence of this flag on the kfunc is used
to set MEM_RCU in iterator type, and the lack of RCU protection results
in an error only later, once next() or destroy() methods are invoked on
the iterator. While there is no bug, this is certainly a bit
unintuitive, and makes the enforcement of the flag iterator specific.

In the interest of making this flag useful for other upcoming kfuncs,
e.g. scx_bpf_cpu_curr() [0][1], add enforcement for invoking the kfunc
in an RCU critical section in general.

This would also mean that iterator APIs using KF_RCU_PROTECTED will
error out earlier, instead of throwing an error for lack of RCU CS
protection when next() or destroy() methods are invoked.

In addition to this, if the kfuncs tagged KF_RCU_PROTECTED return a
pointer value, ensure that this pointer value is only usable in an RCU
critical section. There might be edge cases where the return value is
special and doesn't need to imply MEM_RCU semantics, but in general, the
assumption should hold for the majority of kfuncs, and we can revisit
things if necessary later.

  [0]: https://lore.kernel.org/all/20250903212311.369697-3-christian.loehle@arm.com
  [1]: https://lore.kernel.org/all/20250909195709.92669-1-arighi@nvidia.com

Tested-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250917032755.4068726-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-18 15:36:17 -07:00
Linus Torvalds
097a6c336d Merge tag 'trace-rv-v6.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull runtime verifier fixes from Steven Rostedt:

 - Fix build in some RISC-V flavours

   Some system calls only are available for the 64bit RISC-V machines.
   #ifdef out the cases of clock_nanosleep and futex in the sleep
   monitor if they are not supported by the architecture.

 - Fix wrong cast, obsolete after refactoring

   Use container_of() to get to the rv_monitor structure from the
   enable_monitors_next() 'p' pointer. The assignment worked only
   because the list field used happened to be the first field of the
   structure.

 - Remove redundant include files

   Some include files were listed twice. Remove the extra ones and sort
   the includes.

 - Fix missing unlock on failure

   There was an error path that exited the rv_register_monitor()
   function without releasing a lock. Change that to goto the lock
   release.

 - Add Gabriele Monaco to be Runtime Verifier maintainer

   Gabriele is doing most of the work on RV as well as collecting
   patches. Add him to the maintainers file for Runtime Verification.

* tag 'trace-rv-v6.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  rv: Add Gabriele Monaco as maintainer for Runtime Verification
  rv: Fix missing mutex unlock in rv_register_monitor()
  include/linux/rv.h: remove redundant include file
  rv: Fix wrong type cast in enabled_monitors_next()
  rv: Support systems with time64-only syscalls
2025-09-18 15:22:00 -07:00
Rafael J. Wysocki
ccf09357ff smp: Fix up and expand the smp_call_function_many() kerneldoc
The smp_call_function_many() kerneldoc comment got out of sync with the
function definition (bool parameter "wait" is incorrectly described as a
bitmask in it), so fix it up by copying the "wait" description from the
smp_call_function() kerneldoc and add information regarding the handling
of the local CPU to it.

Fixes: 49b3bd213a ("smp: Fix all kernel-doc warnings")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2025-09-18 22:21:28 +02:00
Andrea Righi
ac6772e8bc sched_ext: Add migration-disabled counter to error state dump
Include the task's migration-disabled counter when dumping task state
during an error exit.

This can help diagnose cases where tasks can get stuck, because they're
unable to migrate elsewhere.

tj: s/nomig/no_mig/ for readability and consistency with other keys.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-18 08:54:57 -10:00
Jakub Kicinski
f2cdc4c22b Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.17-rc7).

No conflicts.

Adjacent changes:

drivers/net/ethernet/mellanox/mlx5/core/en/fs.h
  9536fbe10c ("net/mlx5e: Add PSP steering in local NIC RX")
  7601a0a462 ("net/mlx5e: Add a miss level for ipsec crypto offload")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-18 11:26:06 -07:00
Linus Torvalds
992d4e481e Merge tag 'probes-fixes-v6.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probe fix from Masami Hiramatsu:

 - kprobe-event: Fix null-ptr-deref in trace_kprobe_create_internal(),
   by handling NULL return of kmemdup() correctly

* tag 'probes-fixes-v6.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: kprobe-event: Fix null-ptr-deref in trace_kprobe_create_internal()
2025-09-17 16:52:26 -07:00
Wang Liang
dc3382fffd tracing: kprobe-event: Fix null-ptr-deref in trace_kprobe_create_internal()
A crash was observed with the following output:

Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
CPU: 1 UID: 0 PID: 2899 Comm: syz.2.399 Not tainted 6.17.0-rc5+ #5 PREEMPT(none)
RIP: 0010:trace_kprobe_create_internal+0x3fc/0x1440 kernel/trace/trace_kprobe.c:911
Call Trace:
 <TASK>
 trace_kprobe_create_cb+0xa2/0xf0 kernel/trace/trace_kprobe.c:1089
 trace_probe_create+0xf1/0x110 kernel/trace/trace_probe.c:2246
 dyn_event_create+0x45/0x70 kernel/trace/trace_dynevent.c:128
 create_or_delete_trace_kprobe+0x5e/0xc0 kernel/trace/trace_kprobe.c:1107
 trace_parse_run_command+0x1a5/0x330 kernel/trace/trace.c:10785
 vfs_write+0x2b6/0xd00 fs/read_write.c:684
 ksys_write+0x129/0x240 fs/read_write.c:738
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x5d/0x2d0 arch/x86/entry/syscall_64.c:94
 </TASK>

Function kmemdup() may return NULL in trace_kprobe_create_internal(), add
check for it's return value.

Link: https://lore.kernel.org/all/20250916075816.3181175-1-wangliang74@huawei.com/

Fixes: 33b4e38baa ("tracing: kprobe-event: Allocate string buffers from heap")
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-09-18 07:36:41 +09:00
Linus Torvalds
37889ceadd Merge tag 'sched_ext-for-6.17-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:

 - Fix build failure when !FAIR_GROUP_SCHED && EXT_GROUP_SCHED

 - Revert "sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()"
   which was causing issues with per-CPU task scheduling and reenqueuing
   behavior

* tag 'sched_ext-for-6.17-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext, sched/core: Fix build failure when !FAIR_GROUP_SCHED && EXT_GROUP_SCHED
  Revert "sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()"
2025-09-17 13:27:31 -07:00
Linus Torvalds
05950213a9 Merge tag 'cgroup-for-6.17-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "This contains two cgroup changes. Both are pretty low risk.

  - Fix deadlock in cgroup destruction when repeatedly
    mounting/unmounting perf_event and net_prio controllers.

    The issue occurs because cgroup_destroy_wq has max_active=1, causing
    root destruction to wait for CSS offline operations that are queued
    behind it.

    The fix splits cgroup_destroy_wq into three separate workqueues to
    eliminate the blocking.

  - Set of->priv to NULL upon file release to make potential bugs to
    manifest as NULL pointer dereferences rather than use-after-free
    errors"

* tag 'cgroup-for-6.17-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup/psi: Set of->priv to NULL upon file release
  cgroup: split cgroup_destroy_wq into 3 workqueues
2025-09-17 13:22:08 -07:00
Chen Ridong
c49b5e89c4 cpuset: use partition_cpus_change for setting exclusive cpus
A previous patch has introduced a new helper function
partition_cpus_change(). Now replace the exclusive cpus setting logic
with this helper function.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17 08:37:31 -10:00
Chen Ridong
de9f15e21c cpuset: use parse_cpulist for setting cpus.exclusive
Previous patches made parse_cpulist handle empty cpu mask input.
Now use this helper for exclusive cpus setting. Also, compute_trialcs_xcpus
can be called with empty cpus and handles it correctly.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17 08:37:31 -10:00
Chen Ridong
27db824600 cpuset: introduce partition_cpus_change
Introduce the partition_cpus_change function to handle both regular CPU
set updates and exclusive CPU modifications, either of which may trigger
partition state changes. This generalized function will also be utilized
for exclusive CPU updates in subsequent patches.

With the introduction of compute_trialcs_excpus in a previous patch,
the trialcs->effective_xcpus field is now consistently computed and
maintained. Consequently, the legacy logic which assigned
**trialcs->allowed_cpus to a local 'xcpus' variable** when
trialcs->effective_xcpus was empty has been removed.

This removal is safe because when trialcs is not a partition member,
trialcs->effective_xcpus is now correctly populated with the intersection
of user_xcpus and the parent's effective_xcpus. This calculation inherently
covers the scenario previously handled by the removed code.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17 08:37:31 -10:00
Chen Ridong
c636673980 cpuset: refactor cpus_allowed_validate_change
Refactor cpus_allowed_validate_change to handle the special case where
cpuset.cpus can be set even when violating partition sibling CPU
exclusivity rules. This differs from the general validation logic in
validate_change. Add a wrapper function to properly handle this
exceptional case.

The trialcs->prs_err field is cleared before performing validation checks
for both CPU changes and partition errors. If cpus_allowed_validate_change
fails its validation, trialcs->prs_err is set to PERR_NOTEXCL. If partition
validation fails, the specific error code returned by validate_partition
is assigned to trialcs->prs_err.

With the partition validation status now directly available through
trialcs->prs_err, the local boolean variable 'invalidate' becomes
redundant and can be safely removed.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17 08:37:31 -10:00
Chen Ridong
7e05981ba3 cpuset: refactor out validate_partition
Refactor the validate_partition function to handle cpuset partition
validation when modifying cpuset.cpus. This refactoring also makes the
function reusable for handling cpuset.cpus.exclusive updates in subsequent
patches.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17 08:37:30 -10:00
Chen Ridong
8daab66eb3 cpuset: introduce cpus_excl_conflict and mems_excl_conflict helpers
This patch adds cpus_excl_conflict() and mems_excl_conflict() helper
functions to improve code readability and maintainability. The exclusive
conflict checking follows these rules:

1. If either cpuset has the 'exclusive' flag set, their user_xcpus must
   not have any overlap.
2. If neither cpuset has the 'exclusive' flag set, their
   'cpuset.cpus.exclusive' (only for v2) values must not intersect.
3. The 'cpuset.cpus' of one cpuset must not form a subset of another
   cpuset's 'cpuset.cpus.exclusive'.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17 08:37:30 -10:00
Chen Ridong
c5866c9a00 cpuset: refactor CPU mask buffer parsing logic
The current implementation contains redundant handling for empty mask
inputs, as cpulist_parse() already properly handles these cases. This
refactoring introduces a new helper function parse_cpuset_cpulist() to
consolidate CPU list parsing logic and eliminate special-case checks for
empty inputs.

Additionally, the effective_xcpus computation for trial cpusets has been
simplified. Rather than computing effective_xcpus only when exclusive_cpus
is set or when the cpuset forms a valid partition, we now recalculate it
on every cpuset.cpus update. This approach ensures consistency and allows
removal of redundant effective_xcpus logic in subsequent patches.

The trial cpuset's effective_xcpus calculation follows two distinct cases:
1. For member cpusets: effective_xcpus is determined by the intersection
   of cpuset->exclusive_cpus and the parent's effective_xcpus.
2. For non-member cpusets: effective_xcpus is derived from the intersection
   of user_xcpus and the parent's effective_xcpus.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17 08:37:30 -10:00
Chen Ridong
86bbbd1f33 cpuset: Refactor exclusive CPU mask computation logic
The current compute_effective_exclusive_cpumask function handles multiple
scenarios with different input parameters, making the code difficult to
follow. This patch refactors it into two separate functions:
compute_excpus and compute_trialcs_excpus.

The compute_excpus function calculates the exclusive CPU mask for a given
input and excludes exclusive CPUs from sibling cpusets when cs's
exclusive_cpus is not explicitly set.

The compute_trialcs_excpus function specifically handles exclusive CPU
computation for trial cpusets used during CPU mask configuration updates,
and always excludes exclusive CPUs from sibling cpusets.

This refactoring significantly improves code readability and clarity,
making it explicit which function to call for each use case and what
parameters should be provided.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17 08:37:30 -10:00
Chen Ridong
6a59fc4a3a cpuset: change return type of is_partition_[in]valid to bool
The functions is_partition_valid() and is_partition_invalid() logically
return boolean values, but were previously declared with return type
'int'. This patch changes their return type to 'bool' to better reflect
their semantic meaning and improve type safety.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17 08:37:30 -10:00
Chen Ridong
bba0ccf829 cpuset: remove unused assignment to trialcs->partition_root_state
The trialcs->partition_root_state field is not used during the
configuration of 'cpuset.cpus' or 'cpuset.cpus.exclusive'. Therefore,
the assignment of values to this field can be safely removed.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17 08:35:28 -10:00
Chen Ridong
b783a62655 cpuset: move the root cpuset write check earlier
The 'cpus' or 'mems' lists of the top_cpuset cannot be modified.
This check can be moved before acquiring any locks as a common code
block to improve efficiency and maintainability.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17 08:35:17 -10:00
Eduard Zingerman
a3c73d629e bpf: dont report verifier bug for missing bpf_scc_visit on speculative path
Syzbot generated a program that triggers a verifier_bug() call in
maybe_exit_scc(). maybe_exit_scc() assumes that, when called for a
state with insn_idx in some SCC, there should be an instance of struct
bpf_scc_visit allocated for that SCC. Turns out the assumption does
not hold for speculative execution paths. See example in the next
patch.

maybe_scc_exit() is called from update_branch_counts() for states that
reach branch count of zero, meaning that path exploration for a
particular path is finished. Path exploration can finish in one of
three ways:
a. Verification error is found. In this case, update_branch_counts()
   is called only for non-speculative paths.
b. Top level BPF_EXIT is reached. Such instructions are never a part of
   an SCC, so compute_scc_callchain() in maybe_scc_exit() will return
   false, and maybe_scc_exit() will return early.
c. A checkpoint is reached and matched. Checkpoints are created by
   is_state_visited(), which calls maybe_enter_scc(), which allocates
   bpf_scc_visit instances for checkpoints within SCCs.

Hence, for non-speculative symbolic execution paths, the assumption
still holds: if maybe_scc_exit() is called for a state within an SCC,
bpf_scc_visit instance must exist.

This patch removes the verifier_bug() call for speculative paths.

Fixes: c9e31900b5 ("bpf: propagate read/precision marks over state graph backedges")
Reported-by: syzbot+3afc814e8df1af64b653@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/68c85acd.050a0220.2ff435.03a4.GAE@google.com/
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250916212251.3490455-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-17 11:19:58 -07:00
Sebastian Andrzej Siewior
3253cb49cb softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT
softirqs are preemptible on PREEMPT_RT. There is synchronisation between
individual sections which disable bottom halves. This in turn means that
a forced threaded interrupt cannot preempt another forced threaded
interrupt. Instead it will PI-boost the other handler and wait for its
completion.

This is required because code within a softirq section is assumed to be
non-preemptible and may expect exclusive access to per-CPU resources
such as variables or pinned timers.

Code with such expectation has been identified and updated to use
local_lock_nested_bh() for locking of the per-CPU resource. This means the
softirq lock can be removed.

Disable the softirq synchronization, but add a new config switch
CONFIG_PREEMPT_RT_NEEDS_BH_LOCK which allows to re-enable the synchronized
behavior in case that there are issues, which haven't been detected yet.

The softirq_ctrl.cnt accounting remains to let the NOHZ code know if
softirqs are currently handled.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2025-09-17 16:25:41 +02:00
Sebastian Andrzej Siewior
fd4e876f59 softirq: Provide a handshake for canceling tasklets via polling
The tasklet_unlock_spin_wait() via tasklet_disable_in_atomic() is
provided for a few legacy tasklet users. The interface is used from
atomic context (which is either softirq or disabled preemption) on
non-PREEMPT_RT and relies on spinning until the tasklet callback
completes.

On PREEMPT_RT the context is never atomic but the busy polling logic
remains. It is possible that the thread invoking tasklet_unlock_spin_wait()
has higher priority than the tasklet. If both run on the same CPU the the
tasklet makes no progress and the thread trying to cancel the tasklet will
live-lock the system.

To avoid the lockup tasklet_unlock_spin_wait() uses local_bh_disable()/
enable() which utilizes the local_lock_t for synchronisation. This lock is
a central per-CPU BKL and about to be removed.

Solve this by acquire a lock in tasklet_action_common() which is held while
the tasklet's callback is invoked. This lock will be acquired from
tasklet_unlock_spin_wait() via tasklet_callback_cancel_wait_running().

After the tasklet completed tasklet_callback_sync_wait_running() drops the
lock and acquires it again. In order to avoid unlocking the lock even if
there is no cancel request, there is a cb_waiters counter which is
incremented during a cancel request.  Blocking on the lock will PI-boost
the tasklet if needed, ensuring progress is made.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2025-09-17 16:25:41 +02:00
Tejun Heo
a1eab4d813 sched_ext, sched/core: Fix build failure when !FAIR_GROUP_SCHED && EXT_GROUP_SCHED
While collecting SCX related fields in struct task_group into struct
scx_task_group, 6e6558a6bc ("sched_ext, sched/core: Factor out struct
scx_task_group") forgot update tg->scx_weight usage in tg_weight(), which
leads to build failure when CONFIG_FAIR_GROUP_SCHED is disabled but
CONFIG_EXT_GROUP_SCHED is enabled. Fix it.

Fixes: 6e6558a6bc ("sched_ext, sched/core: Factor out struct scx_task_group")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202509170230.MwZsJSWa-lkp@intel.com/
Tested-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-16 23:07:27 -10:00
Jeremy Linton
ba1afc94de uprobes: uprobe_warn should use passed task
uprobe_warn() is passed a task structure, yet its using current. For
the most part this shouldn't matter, but since a task structure is
provided, lets use it.

Fixes: 248d3a7b2f ("uprobes: Change uprobe_copy_process() to dup return_instances")
Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
2025-09-16 21:34:49 +01:00
Marco Crivellari
dadb3ebcf3 workqueue: WQ_PERCPU added to alloc_workqueue users
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.

alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This patch adds a new WQ_PERCPU flag to explicitly request the use of
the per-CPU behavior. Both flags coexist for one release cycle to allow
callers to transition their calls.

Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.

All existing users have been updated accordingly.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-16 10:33:53 -10:00
Andrea Righi
0b47b6c354 Revert "sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()"
scx_bpf_reenqueue_local() can be called from ops.cpu_release() when a
CPU is taken by a higher scheduling class to give tasks queued to the
CPU's local DSQ a chance to be migrated somewhere else, instead of
waiting indefinitely for that CPU to become available again.

In doing so, we decided to skip migration-disabled tasks, under the
assumption that they cannot be migrated anyway.

However, when a higher scheduling class preempts a CPU, the running task
is always inserted at the head of the local DSQ as a migration-disabled
task. This means it is always skipped by scx_bpf_reenqueue_local(), and
ends up being confined to the same CPU even if that CPU is heavily
contended by other higher scheduling class tasks.

As an example, let's consider the following scenario:

 $ schedtool -a 0,1, -e yes > /dev/null
 $ sudo schedtool -F -p 99 -a 0, -e \
   stress-ng -c 1 --cpu-load 99 --cpu-load-slice 1000

The first task (SCHED_EXT) can run on CPU0 or CPU1. The second task
(SCHED_FIFO) is pinned to CPU0 and consumes ~99% of it. If the SCHED_EXT
task initially runs on CPU0, it will remain there because it always sees
CPU0 as "idle" in the short gaps left by the RT task, resulting in ~1%
utilization while CPU1 stays idle:

    0[||||||||||||||||||||||100.0%]   8[                        0.0%]
    1[                        0.0%]   9[                        0.0%]
    2[                        0.0%]  10[                        0.0%]
    3[                        0.0%]  11[                        0.0%]
    4[                        0.0%]  12[                        0.0%]
    5[                        0.0%]  13[                        0.0%]
    6[                        0.0%]  14[                        0.0%]
    7[                        0.0%]  15[                        0.0%]
  PID USER       PRI  NI  S CPU  CPU%▽MEM%   TIME+  Command
 1067 root        RT   0  R   0  99.0  0.2  0:31.16 stress-ng-cpu [run]
  975 arighi      20   0  R   0   1.0  0.0  0:26.32 yes

By allowing scx_bpf_reenqueue_local() to re-enqueue migration-disabled
tasks, the scheduler can choose to migrate them to other CPUs (CPU1 in
this case) via ops.enqueue(), leading to better CPU utilization:

    0[||||||||||||||||||||||100.0%]   8[                        0.0%]
    1[||||||||||||||||||||||100.0%]   9[                        0.0%]
    2[                        0.0%]  10[                        0.0%]
    3[                        0.0%]  11[                        0.0%]
    4[                        0.0%]  12[                        0.0%]
    5[                        0.0%]  13[                        0.0%]
    6[                        0.0%]  14[                        0.0%]
    7[                        0.0%]  15[                        0.0%]
  PID USER       PRI  NI  S CPU  CPU%▽MEM%   TIME+  Command
  577 root        RT   0  R   0 100.0  0.2  0:23.17 stress-ng-cpu [run]
  555 arighi      20   0  R   1 100.0  0.0  0:28.67 yes

It's debatable whether per-CPU tasks should be re-enqueued as well, but
doing so is probably safer: the scheduler can recognize re-enqueued
tasks through the %SCX_ENQ_REENQ flag, reassess their placement, and
either put them back at the head of the local DSQ or let another task
attempt to take the CPU.

This also prevents giving per-CPU tasks an implicit priority boost,
which would otherwise make them more likely to reclaim CPUs preempted by
higher scheduling classes.

Fixes: 97e13ecb02 ("sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()")
Cc: stable@vger.kernel.org # v6.15+
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-16 10:15:23 -10:00
pengdonglin
58ab6d25a1 cgroup/cpuset: Remove redundant rcu_read_lock/unlock() in spin_lock
Since commit a8bb74acd8 ("rcu: Consolidate RCU-sched update-side function definitions")
there is no difference between rcu_read_lock(), rcu_read_lock_bh() and
rcu_read_lock_sched() in terms of RCU read section and the relevant grace
period. That means that spin_lock(), which implies rcu_read_lock_sched(),
also implies rcu_read_lock().

There is no need no explicitly start a RCU read section if one has already
been started implicitly by spin_lock().

Simplify the code and remove the inner rcu_read_lock() invocation.

Cc: Waiman Long <longman@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: pengdonglin <pengdonglin@xiaomi.com>
Signed-off-by: pengdonglin <dolinux.peng@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-16 08:36:24 -10:00
pengdonglin
3ee4211ef8 cgroup: Remove redundant rcu_read_lock/unlock() in spin_lock
Since commit a8bb74acd8 ("rcu: Consolidate RCU-sched update-side function definitions")
there is no difference between rcu_read_lock(), rcu_read_lock_bh() and
rcu_read_lock_sched() in terms of RCU read section and the relevant grace
period. That means that spin_lock(), which implies rcu_read_lock_sched(),
also implies rcu_read_lock().

There is no need no explicitly start a RCU read section if one has already
been started implicitly by spin_lock().

Simplify the code and remove the inner rcu_read_lock() invocation.

Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: pengdonglin <pengdonglin@xiaomi.com>
Signed-off-by: pengdonglin <dolinux.peng@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-16 08:36:14 -10:00
Al Viro
1f6df58474 drop_collected_paths(): constify arguments
... and use that to constify the pointers in callers

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-15 21:26:43 -04:00
Al Viro
1b8abbb121 bpf...d_path(): constify path argument
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-15 21:17:08 -04:00
Eduard Zingerman
b13448dd64 bpf: potential double-free of env->insn_aux_data
Function bpf_patch_insn_data() has the following structure:

  static struct bpf_prog *bpf_patch_insn_data(... env ...)
  {
        struct bpf_prog *new_prog;
        struct bpf_insn_aux_data *new_data = NULL;

        if (len > 1) {
                new_data = vrealloc(...);  // <--------- (1)
                if (!new_data)
                        return NULL;

                env->insn_aux_data = new_data;  // <---- (2)
        }

        new_prog = bpf_patch_insn_single(env->prog, off, patch, len);
        if (IS_ERR(new_prog)) {
                ...
                vfree(new_data);   // <----------------- (3)
                return NULL;
        }
        ... happy path ...
  }

In case if bpf_patch_insn_single() returns an error the `new_data`
allocated at (1) will be freed at (3). However, at (2) this pointer
is stored in `env->insn_aux_data`. Which is freed unconditionally
by verifier.c:bpf_check() on both happy and error paths.
Thus, leading to double-free.

Fix this by removing vfree() call at (3), ownership over `new_data` is
already passed to `env->insn_aux_data` at this point.

Fixes: 77620d1267 ("bpf: use realloc in bpf_patch_insn_data")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250912-patch-insn-data-double-free-v1-1-af05bd85a21a@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-15 13:04:21 -07:00
Kumar Kartikeya Dwivedi
2c89513395 bpf: Do not limit bpf_cgroup_from_id to current's namespace
The bpf_cgroup_from_id kfunc relies on cgroup_get_from_id to obtain the
cgroup corresponding to a given cgroup ID. This helper can be called in
a lot of contexts where the current thread can be random. A recent
example was its use in sched_ext's ops.tick(), to obtain the root cgroup
pointer. Since the current task can be whatever random user space task
preempted by the timer tick, this makes the behavior of the helper
unreliable.

Refactor out __cgroup_get_from_id as the non-namespace aware version of
cgroup_get_from_id, and change bpf_cgroup_from_id to make use of it.

There is no compatibility breakage here, since changing the namespace
against which the lookup is being done to the root cgroup namespace only
permits a wider set of lookups to succeed now. The cgroup IDs across
namespaces are globally unique, and thus don't need to be retranslated.

Reported-by: Dan Schatzberg <dschatzberg@meta.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250915032618.1551762-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-15 10:53:15 -07:00
Mateusz Guzik
f99b391778 fs: rename generic_delete_inode() and generic_drop_inode()
generic_delete_inode() is rather misleading for what the routine is
doing. inode_just_drop() should be much clearer.

The new naming is inconsistent with generic_drop_inode(), so rename that
one as well with inode_ as the suffix.

No functional changes.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-15 16:09:42 +02:00
Kaushlendra Kumar
1441edd129 refperf: Set reader_tasks to NULL after kfree()
Set reader_tasks to NULL after kfree() in ref_scale_cleanup() to
improve debugging experience with kernel debugging tools. This
follows the common pattern of NULLing pointers after freeing to
avoid dangling pointer issues during debugging sessions.

Setting pointers to NULL after freeing helps debugging tools like
kdgb,drgn, and other kernel debuggers by providing clear indication
that the memory has been freed and the pointer is no longer valid.

Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Kaushlendra Kumar <kaushlendra.kumar@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2025-09-15 05:20:09 -07:00
Kaushlendra Kumar
fb7855a6b5 refperf: Remove redundant kfree() after torture_stop_kthread()
Remove unnecessary kfree(main_task) call in ref_scale_cleanup() as
torture_stop_kthread() already handles the memory cleanup for the
task structure internally.

The additional kfree(main_task) call after torture_stop_kthread()
is redundant and confusing since torture_stop_kthread() sets the
pointer to NULL, making this a no-op.

This pattern is consistent with other torture test modules where
torture_stop_kthread() is called without explicit kfree() of the
task pointer, as the torture framework manages the task lifecycle
internally.

Signed-off-by: Kaushlendra Kumar <kaushlendra.kumar@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2025-09-15 05:20:09 -07:00
Zqiang
e6a43aeb71 srcu/tiny: Remove preempt_disable/enable() in srcu_gp_start_if_needed()
Currently, the srcu_gp_start_if_needed() is always be invoked in
preempt disable's critical section, this commit therefore remove
redundant preempt_disable/enable() in srcu_gp_start_if_needed()
and adds a call to lockdep_assert_preemption_disabled() in order
to enable lockdep to diagnose mistaken invocations of this function
from preempts-enabled code.

Fixes: 65b4a59557 ("srcu: Make Tiny SRCU explicitly disable preemption")
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2025-09-15 05:16:28 -07:00
Paul E. McKenney
1c77e862b8 srcu: Document srcu_flip() memory-barrier D relation to SRCU-fast
The smp_mb() memory barrier at the end of srcu_flip() has a comment,
but that comment does not make it clear that this memory barrier is an
optimization, as opposed to being needed for correctness.  This commit
therefore adds this information and points out that it is omitted
for SRCU-fast, where a much heavier weight synchronize_srcu() would
be required.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: <bpf@vger.kernel.org>
2025-09-15 05:16:28 -07:00
Rafael J. Wysocki
bd03c7020d Merge back earlier material related to system sleep for 6.18 2025-09-15 12:03:26 +02:00
Christian Loehle
1ebe8f7e78 PM: EM: Fix late boot with holes in CPU topology
Commit e3f1164fc9 ("PM: EM: Support late CPUs booting and capacity
adjustment") added a mechanism to handle CPUs that come up late by
retrying when any of the `cpufreq_cpu_get()` call fails.

However, if there are holes in the CPU topology (offline CPUs, e.g.
nosmt), the first missing CPU causes the loop to break, preventing
subsequent online CPUs from being updated.

Instead of aborting on the first missing CPU policy, loop through all
and retry if any were missing.

Fixes: e3f1164fc9 ("PM: EM: Support late CPUs booting and capacity adjustment")
Suggested-by: Kenneth Crudup <kenneth.crudup@gmail.com>
Reported-by: Kenneth Crudup <kenneth.crudup@gmail.com>
Link: https://lore.kernel.org/linux-pm/40212796-734c-4140-8a85-854f72b8144d@panix.com/
Cc: 6.9+ <stable@vger.kernel.org> # 6.9+
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20250831214357.2020076-1-christian.loehle@arm.com
[ rjw: Drop the new pr_debug() message which is not very useful ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-09-15 12:02:24 +02:00
Aaron Lu
0d4eaf8caf sched/fair: Do not balance task to a throttled cfs_rq
When doing load balance and the target cfs_rq is in throttled hierarchy,
whether to allow balancing there is a question.

The good side to allow balancing is: if the target CPU is idle or less
loaded and the being balanced task is holding some kernel resources,
then it seems a good idea to balance the task there and let the task get
the CPU earlier and release kernel resources sooner. The bad part is, if
the task is not holding any kernel resources, then the balance seems not
that useful.

While theoretically it's debatable, a performance test[0] which involves
200 cgroups and each cgroup runs hackbench(20 sender, 20 receiver) in
pipe mode showed a performance degradation on AMD Genoa when allowing
load balance to throttled cfs_rq. Analysis[1] showed hackbench doesn't
like task migration across LLC boundary. For this reason, add a check in
can_migrate_task() to forbid balancing to a cfs_rq that is in throttled
hierarchy. This reduced task migration a lot and performance restored.

[0]: https://lore.kernel.org/lkml/20250822110701.GB289@bytedance/
[1]: https://lore.kernel.org/lkml/20250903101102.GB42@bytedance/

Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
2025-09-15 09:38:38 +02:00
Aaron Lu
253b3f5872 sched/fair: Do not special case tasks in throttled hierarchy
With the introduction of task based throttle model, task in a throttled
hierarchy is allowed to continue to run till it gets throttled on its
ret2user path.

For this reason, remove those throttled_hierarchy() checks in the
following functions so that those tasks can get their turn as normal
tasks: dequeue_entities(), check_preempt_wakeup_fair() and
yield_to_task_fair().

The benefit of doing it this way is: if those tasks gets the chance to
run earlier and if they hold any kernel resources, they can release
those resources earlier. The downside is, if they don't hold any kernel
resouces, all they can do is to throttle themselves on their way back to
user space so the favor to let them run seems not that useful and for
check_preempt_wakeup_fair(), that favor may be bad for curr.

K Prateek Nayak pointed out prio_changed_fair() can send a throttled
task to check_preempt_wakeup_fair(), further tests showed the affinity
change path from move_queued_task() can also send a throttled task to
check_preempt_wakeup_fair(), that's why the check of task_is_throttled()
in that function.

Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-09-15 09:38:37 +02:00
Aaron Lu
fcd394866e sched/fair: update_cfs_group() for throttled cfs_rqs
With task based throttle model, tasks in a throttled hierarchy are
allowed to continue to run if they are running in kernel mode. For this
reason, PELT clock is not stopped for these cfs_rqs in throttled
hierarchy when they still have tasks running or queued.

Since PELT clock is not stopped, whether to allow update_cfs_group()
doing its job for cfs_rqs which are in throttled hierarchy but still
have tasks running/queued is a question.

The good side is, continue to run update_cfs_group() can get these
cfs_rq entities with an up2date weight and that up2date weight can be
useful to derive an accurate load for the CPU as well as ensure fairness
if multiple tasks of different cgroups are running on the same CPU.
OTOH, as Benjamin Segall pointed: when unthrottle comes around the most
likely correct distribution is the distribution we had at the time of
throttle.

In reality, either way may not matter that much if tasks in throttled
hierarchy don't run in kernel mode for too long. But in case that
happens, let these cfs_rq entities have an up2date weight seems a good
thing to do.

Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-09-15 09:38:37 +02:00
Aaron Lu
fe8d238e64 sched/fair: Propagate load for throttled cfs_rq
Before task based throttle model, propagating load will stop at a
throttled cfs_rq and that propagate will happen on unthrottle time by
update_load_avg().

Now that there is no update_load_avg() on unthrottle for throttled
cfs_rq and all load tracking is done by task related operations, let the
propagate happen immediately.

While at it, add a comment to explain why cfs_rqs that are not affected
by throttle have to be added to leaf cfs_rq list in
propagate_entity_cfs_rq() per my understanding of commit 0258bdfaff
("sched/fair: Fix unfairness caused by missing load decay").

Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
2025-09-15 09:38:37 +02:00
Zhen Ni
9b5096761c rv: Fix missing mutex unlock in rv_register_monitor()
If create_monitor_dir() fails, the function returns directly without
releasing rv_interface_lock. This leaves the mutex locked and causes
subsequent monitor registration attempts to deadlock.

Fix it by making the error path jump to out_unlock, ensuring that the
mutex is always released before returning.

Fixes: 24cbfe18d5 ("rv: Merge struct rv_monitor_def into struct rv_monitor")
Signed-off-by: Zhen Ni <zhen.ni@easystack.cn>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Reviewed-by: Nam Cao <namcao@linutronix.de>
Link: https://lore.kernel.org/r/20250903065112.1878330-1-zhen.ni@easystack.cn
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2025-09-15 08:36:35 +02:00
Nam Cao
de090d1cca rv: Fix wrong type cast in enabled_monitors_next()
Argument 'p' of enabled_monitors_next() is not a pointer to struct
rv_monitor, it is actually a pointer to the list_head inside struct
rv_monitor. Therefore it is wrong to cast 'p' to struct rv_monitor *.

This wrong type cast has been there since the beginning. But it still
worked because the list_head was the first field in struct rv_monitor_def.
This is no longer true since commit 24cbfe18d5 ("rv: Merge struct
rv_monitor_def into struct rv_monitor") moved the list_head, and this wrong
type cast became a functional problem.

Properly use container_of() instead.

Fixes: 24cbfe18d5 ("rv: Merge struct rv_monitor_def into struct rv_monitor")
Signed-off-by: Nam Cao <namcao@linutronix.de>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/r/20250806120911.989365-1-namcao@linutronix.de
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2025-09-15 08:36:35 +02:00
Palmer Dabbelt
03ee64b5e5 rv: Support systems with time64-only syscalls
Some systems (like 32-bit RISC-V) only have the 64-bit time_t versions
of syscalls.  So handle the 32-bit time_t version of those being
undefined.

Fixes: f74f8bb246 ("rv: Add rtapp_sleep monitor")
Closes: https://lore.kernel.org/oe-kbuild-all/202508160204.SsFyNfo6-lkp@intel.com
Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
Acked-by: Nam Cao <namcao@linutronix.de>
Link: https://lore.kernel.org/r/20250804194518.97620-2-palmer@dabbelt.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2025-09-15 08:36:27 +02:00
Greg Kroah-Hartman
c319c4ec06 Merge 6.17-rc6 into driver-core-next
We need the driver core fixes in here to build on top of.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-15 08:26:05 +02:00
Linus Torvalds
8378c89172 Merge tag 'timers-urgent-2025-09-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
 "Fix a lost-timeout CPU hotplug bug in the hrtimer code, which can
  trigger with certain hardware configs and regular HZ"

* tag 'timers-urgent-2025-09-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  hrtimers: Unconditionally update target CPU base after offline timer migration
2025-09-14 08:38:05 -07:00