Commit Graph

1438672 Commits

Author SHA1 Message Date
Tejun Heo
cd6aab7367 sched_ext: Close sub-sched init race with post-init DEAD recheck
scx_sub_enable_workfn()'s init pass and scx_sub_disable() migration both
drop the rq lock to call __scx_init_task() against the other sched. A
TASK_DEAD @p can fall through sched_ext_dead() in that window.
sched_ext_dead() runs ops.exit_task() on the sched @p was attached to, not
on the sched whose init just completed, so the new allocation leaks.

Reuse the DEAD signal set by sched_ext_dead(). After __scx_init_task()
returns, take task_rq_lock(p) and check for DEAD; on hit, call
scx_sub_init_cancel_task() against the sub sched the init ran for and drop
@p; on miss, proceed as before.

Reported-by: zhidao su <suzhidao@xiaomi.com>
Link: https://lore.kernel.org/all/20260429133155.3825247-1-suzhidao@xiaomi.com/
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-10 10:08:16 -10:00
Tejun Heo
c941d7391f sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN
scx_root_enable_workfn() drops the iter rq lock for ops.init_task() and a
TASK_DEAD @p can fall through sched_ext_dead() in that window. The race hits
when sched_ext_dead() observes SCX_TASK_INIT (the intermediate state before
@p->scx.sched is published) and dereferences NULL via SCX_HAS_OP(NULL,
exit_task), or observes SCX_TASK_NONE during the unlocked init window and
skips cleanup so exit_task() never runs.

Add SCX_TASK_INIT_BEGIN. The enable path writes NONE -> INIT_BEGIN under the
iter rq lock, then takes the rq lock again after init to walk INIT_BEGIN ->
INIT -> READY. sched_ext_dead() that wins the rq-lock race observes
INIT_BEGIN and sets DEAD without calling into ops; the post-init recheck
unwinds via scx_sub_init_cancel_task().

scx_fork() runs single-threaded against sched_ext_dead() (the task is not on
scx_tasks until scx_post_fork() adds it) so its INIT_BEGIN -> INIT walk
needs no rq-lock pairing; it rolls back to NONE on ops.init_task() failure.

The validation matrix grows the INIT_BEGIN row and the INIT_BEGIN -> DEAD
edge; INIT now requires INIT_BEGIN as the predecessor. scx_sub_disable()'s
migration writes INIT_BEGIN as a synthetic predecessor to satisfy the
tightened verification.

The sub-sched paths still race with sched_ext_dead() during the unlocked
init window. This will be fixed by the next patch.

Reported-by: zhidao su <suzhidao@xiaomi.com>
Link: https://lore.kernel.org/all/20260429133155.3825247-1-suzhidao@xiaomi.com/
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-10 10:08:16 -10:00
Tejun Heo
cceb8fa9cb sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD state
SCX_TASK_OFF_TASKS marked tasks already through sched_ext_dead() so cgroup
task iteration would skip them. This can be expressed better with a task
state. Replace the flag with SCX_TASK_DEAD.

scx_disable_and_exit_task() resets state to NONE on its way out, so
sched_ext_dead() now sets DEAD after the wrapper returns. The validation
matrix grows NONE -> DEAD, warns on DEAD -> NONE, and tightens READY's
predecessor to INIT or ENABLED so the new DEAD value cannot silently
transition to READY.

Prepares for the following enable vs dead race fix.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-10 10:08:16 -10:00
Tejun Heo
938dd9ab2b sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state()
Prepare for the SCX_TASK_INIT_BEGIN/DEAD work that follows by collapsing the
scx_init_task() helper. Move the SCX_TASK_RESET_RUNNABLE_AT setting into
scx_set_task_state() on the INIT transition (it was set unconditionally at
every INIT site through the scx_init_task() helper), inline scx_init_task()
into scx_fork() and scx_root_enable_workfn(), and drop the helper.

As a side effect, scx_sub_disable() migration sequence now also sets
RESET_RUNNABLE_AT (it previously wrote INIT directly without going through
scx_init_task()). The flag triggers a runnable_at reset on the next
set_task_runnable(), which is harmless on a task that has just been moved
between scheds.

On root-enable, p->scx.flags is written without the task's rq lock. The task
isn't visible to scx yet, and a follow-up patch restores the lock-held
write.

v2: Note p->scx.flags rq-lock relaxation on root-enable path. (Andrea)

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-10 10:08:16 -10:00
Tejun Heo
6947bea4b7 sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD work
Cleanups in preparation for the state-machine work that follows:

- Convert three sub-sched call sites that open-code
  rcu_assign_pointer(p->scx.sched, ...) to scx_set_task_sched().

- Move scx_get_task_state()/scx_set_task_state() above the SCX task iter
  section so scx_task_iter_next_locked() can use them without a forward
  declaration.

No functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-10 10:08:16 -10:00
Zqiang
ab28a0673d sched_ext: Use IRQ_WORK_INIT_HARD() to initialize sch->disable_irq_work
For built with PREEMPT_RT kernels, the scx_disable_irq_workfn() is
called from per-cpu irq_work kthreads context, this means that
when call the scx_dump_state() in the scx_disable_irq_workfn() to
output current->comm/pid, it always output current irq_work kthread's
comm/pid. this commit therefore use the IRQ_WORK_INIT_HARD() to
initialize sch->disable_irq_work to make scx_disable_irq_workfn() is
called from hardirq context.

Fixes: f4a6c506d1 ("sched_ext: Always bounce scx_disable() through irq_work")
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-08 05:11:53 -10:00
Tejun Heo
1f91d0d582 sched_ext: Fix !CONFIG_EXT_SUB_SCHED build warnings
W=1 with CONFIG_EXT_SUB_SCHED=n flags 'err_msg' uninitialized and
'err_free_lb_resched' unused. Initialize err_msg and gate the label.

Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-07 14:16:59 -10:00
Tejun Heo
363a53749c sched_ext: Drop unused scx_find_sub_sched() stub
scx_find_sub_sched()'s only caller, scx_bpf_sub_dispatch(), is gated on
CONFIG_EXT_SUB_SCHED. When CONFIG_EXT_SUB_SCHED=n the caller compiles out
and the stub becomes dead code, tripping -Wunused-function on randconfigs.
Drop the stub.

Fixes: 25037af712 ("sched_ext: Add rhashtable lookup for sub-schedulers")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/all/202605080556.42PXw8U9-lkp@intel.com/
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-07 12:09:21 -10:00
Tejun Heo
4616a9c36b sched_ext: Move scx_error() out of scx_link_sched()'s lock region
scx_link_sched() holds scx_sched_lock. The scx_error() calls inside take the
same lock through scx_claim_exit() and deadlock. Move them out of the guard.

Fixes: 6b4576b097 ("sched_ext: Reject sub-sched attachment to a disabled parent")
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-07 11:05:50 -10:00
David Carlier
b34c82777a sched_ext: idle: Recheck prev_cpu after narrowing allowed mask
scx_select_cpu_dfl() narrows @allowed to @cpus_allowed & @p->cpus_ptr
when the BPF caller supplies a @cpus_allowed that differs from
@p->cpus_ptr and @p doesn't have full affinity. However,
@is_prev_allowed was computed against the original (wider)
@cpus_allowed, so the prev_cpu fast paths could pick a @prev_cpu that
is in @cpus_allowed but not in @p->cpus_ptr, violating the intended
invariant that the returned CPU is always usable by @p. The kernel
masks this via the SCX_EV_SELECT_CPU_FALLBACK fallback, but the
behavior contradicts the documented contract.

Move the @is_prev_allowed evaluation past the narrowing block so it
tests against the final @allowed mask.

Fixes: ee9a4e9279 ("sched_ext: idle: Properly handle invalid prev_cpu during idle selection")
Cc: stable@vger.kernel.org # v6.16+
Assisted-by: Claude <noreply@anthropic.com>
Signed-off-by: David Carlier <devnexen@gmail.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-04 11:01:04 -10:00
Tejun Heo
ff9eda4ea9 sched_ext: Skip past-sched_ext_dead() tasks in scx_task_iter_next_locked()
scx_task_iter's cgroup-scoped mode can return tasks whose
sched_ext_dead() has already completed: cgroup_task_dead() removes
from cset->tasks after sched_ext_dead() in finish_task_switch() and is
irq-work deferred on PREEMPT_RT. The global mode is fine -
sched_ext_dead() removes from scx_tasks via list_del_init() first.

Callers (sub-sched enable prep/abort/apply, scx_sub_disable(),
scx_fail_parent()) assume returned tasks are still on @sch and trip
WARN_ON_ONCE() or operate on torn-down state otherwise.

Set %SCX_TASK_OFF_TASKS in sched_ext_dead() under @p's rq lock and
have scx_task_iter_next_locked() skip flagged tasks under the same
lock. Setter and reader serialize on the per-task rq lock - no race.

Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-04 09:06:03 -10:00
Tejun Heo
60f21a2649 cgroup, sched_ext: Include exiting tasks in cgroup iter
a72f73c4dd ("cgroup: Don't expose dead tasks in cgroup") made
css_task_iter_advance() skip exiting tasks so cgroup.procs stays consistent
with waitpid() visibility. Unfortunately, this broke scx_task_iter.

scx_task_iter walks either scx_tasks (global) or a cgroup subtree via
css_task_iter() and the two modes are expected to cover the same set of
tasks. After the above change the cgroup-scoped mode silently skips tasks
past exit_signals() that are still on scx_tasks.

scx_sub_enable_workfn()'s abort path is one of the symptoms: an exiting
SCX_TASK_SUB_INIT task can race past the cgroup iter leaking
__scx_init_task() state. Other iterations share the same gap.

Add CSS_TASK_ITER_WITH_DEAD to opt out of the skip and use it from
scx_task_iter().

Fixes: b0e4c2f8a0 ("sched_ext: Implement cgroup subtree iteration for scx_task_iter")
Reported-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-04 09:06:03 -10:00
Cheng-Yang Chou
d99f7a32f0 sched_ext: Fix scx_flush_disable_work() UAF race
scx_flush_disable_work() calls irq_work_sync() followed by
kthread_flush_work() to ensure that the disable kthread work has
fully completed before bpf_scx_unreg() frees the SCX scheduler.

However, a concurrent scx_vexit() (e.g., triggered by a watchdog stall)
creates a race window between scx_claim_exit() and irq_work_queue():

  CPU A (scx_vexit (watchdog))        CPU B (bpf_scx_unreg)
  ----                                ----
  scx_claim_exit()
    atomic_try_cmpxchg(NONE->kind)
  stack_trace_save()
  vscnprintf()
                                      scx_disable()
                                        scx_claim_exit() -> FAIL
                                      scx_flush_disable_work()
                                        irq_work_sync()      // no-op: not queued yet
                                        kthread_flush_work() // no-op: not queued yet
                                      kobject_put(&sch->kobj) -> free %sch
  irq_work_queue() -> UAF on %sch
  scx_disable_irq_workfn()
    kthread_queue_work() -> UAF

The root cause is that CPU B's scx_flush_disable_work() returns after
syncing an irq_work that has not yet been queued, while CPU A is still
executing the code between scx_claim_exit() and irq_work_queue().

Loop until exit_kind reaches SCX_EXIT_DONE or SCX_EXIT_NONE, draining
disable_irq_work and disable_work in each pass. This ensures that any
work queued after the previous check is caught, while also correctly
handling cases where no disable was triggered (e.g., the
scx_sub_enable_workfn() abort path).

Fixes: 510a270554 ("sched_ext: sync disable_irq_work in bpf_scx_unreg()")
Reported-by: https://sashiko.dev/#/patchset/20260424100221.32407-1-icheng%40nvidia.com
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-28 07:40:03 -10:00
Kuba Piecuch
163f8b7f9a sched_ext: Call wakeup_preempt() in local_dsq_post_enq()
There are several edge cases (see linked thread) where an IMMED task
can be left lingering on a local DSQ if an RT task swoops in at the
wrong time. All of these edge cases are due to rq->next_class being idle
even after dispatching a task to rq's local DSQ. We should bump
rq->next_class to &ext_sched_class as soon as we've inserted a task into
the local DSQ.

To optimize the common case of rq->next_class == &ext_sched_class,
only call wakeup_preempt() if rq->next_class is below EXT. If next_class
is EXT or above, wakeup_preempt() is a no-op anyway.

This lets us also simplify the preempt_curr() logic a bit since
wakeup_preempt() will call preempt_curr() for us if next_class is
below EXT.

Link: https://lore.kernel.org/all/DHZPHUFXB4N3.2RY28MUEWBNYK@google.com/
Signed-off-by: Kuba Piecuch <jpiecuch@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-28 06:28:48 -10:00
Tejun Heo
deb7b2f93d sched_ext: Release cpus_read_lock on scx_link_sched() failure in root enable
scx_root_enable_workfn() takes cpus_read_lock() before
scx_link_sched(sch), but the `if (ret) goto err_disable` on failure
skips the matching cpus_read_unlock() - all other err_disable gotos
along this path drop the lock first.

scx_link_sched() only returns non-zero on the sub-sched path
(parent != NULL), so the leak path is unreachable via the root
caller today. Still, the unwind is out of line with the surrounding
paths.

Drop cpus_read_lock() before goto err_disable.

v2: Correct Fixes: tag (Andrea Righi).

Fixes: 25037af712 ("sched_ext: Add rhashtable lookup for sub-schedulers")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-24 14:31:36 -10:00
Tejun Heo
05b4a9a9bc sched_ext: Reject NULL-sch callers in scx_bpf_task_set_slice/dsq_vtime
scx_prog_sched(aux) returns NULL for TRACING / SYSCALL BPF progs that
have no struct_ops association when the root scheduler has sub_attach
set. scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() pass
that NULL into scx_task_on_sched(sch, p), which under
CONFIG_EXT_SUB_SCHED is rcu_access_pointer(p->scx.sched) == sch. For
any non-scx task p->scx.sched is NULL, so NULL == NULL returns true
and the authority gate is bypassed - a privileged but
non-struct_ops-associated prog can poke p->scx.slice /
p->scx.dsq_vtime on arbitrary tasks.

Reject !sch up front so the gate only admits callers with a resolved
scheduler.

Fixes: 245d09c594 ("sched_ext: Enforce scheduler ownership when updating slice and dsq_vtime")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:36 -10:00
Tejun Heo
ea7c716a24 sched_ext: Refuse cross-task select_cpu_from_kfunc calls
select_cpu_from_kfunc() skipped pi_lock for @p when called from
ops.select_cpu() or another rq-locked SCX op, assuming the held lock
protects @p. scx_bpf_select_cpu_dfl() / __scx_bpf_select_cpu_and() accept an
arbitrary KF_RCU task_struct, so a caller in e.g. ops.select_cpu(p1) or
ops.enqueue(p1) can pass some other p2 - the held pi_lock / rq lock is p1's,
not p2's - and reading p2->cpus_ptr / nr_cpus_allowed races with
set_cpus_allowed_ptr() and migrate_disable_switch() on another CPU.

Abort the scheduler on cross-task calls in both branches: for
ops.select_cpu() use scx_kf_arg_task_ok() to verify @p is the wake-up
task recorded in current->scx.kf_tasks[] by SCX_CALL_OP_TASK_RET();
for other rq-locked SCX ops compare task_rq(p) against scx_locked_rq().

v2: Switch the in_select_cpu cross-task check from direct_dispatch_task
    comparison to scx_kf_arg_task_ok(). The former spuriously rejects when
    ops.select_cpu() calls scx_bpf_dsq_insert() first, then calls
    scx_bpf_select_cpu_*() on the same task. (Andrea Righi)

Fixes: 0022b32850 ("sched_ext: Decouple kfunc unlocked-context check from kf_mask")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:36 -10:00
Tejun Heo
c0e8ddc76d sched_ext: Align cgroup #ifdef guards with SUB_SCHED vs GROUP_SCHED
Two EXT_GROUP_SCHED/SUB_SCHED guards are misclassified:

- scx_root_enable_workfn()'s cgroup_get(cgrp) and the err_put_cgrp unwind
  in scx_alloc_and_add_sched() are under `#if GROUP || SUB`, but the
  matching cgroup_put() in scx_sched_free_rcu_work() is inside `#ifdef SUB`
  only (via sch->cgrp, stored only under SUB). GROUP-only would leak a
  reference on every root-sched enable.

- sch_cgroup() / set_cgroup_sched() live under `#if GROUP || SUB` but touch
  SUB-only fields (sch->cgrp, cgroup->scx_sched). GROUP-only wouldn't
  compile.

GROUP needs CGROUP_SCHED; SUB needs only CGROUPS. CGROUPS=y/CGROUP_SCHED=n
gives the reachable GROUP=n, SUB=y combination; GROUP=y, SUB=n isn't
reachable today (SUB is def_bool y under CGROUPS). Neither miscategorization
triggers a real bug in any reachable config, but keep the guards honest:

- Narrow cgroup_get and err_put_cgrp to `#ifdef SUB` (matches the free-side
  put).
- Move sch_cgroup() and set_cgroup_sched() to a separate `#ifdef SUB` block
  with no-op stubs for the !SUB case; keep root_cgroup() and scx_cgroup_{
  lock,unlock}() under `#if GROUP || SUB` since those only need cgroup core.

Fixes: ebeca1f930 ("sched_ext: Introduce cgroup sub-sched support")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:36 -10:00
Tejun Heo
d292aa00de sched_ext: Make bypass LB cpumasks per-scheduler
scx_bypass_lb_{donee,resched}_cpumask were file-scope statics shared by all
scheduler instances. With CONFIG_EXT_SUB_SCHED, multiple sched instances
each arm their own bypass_lb_timer; concurrent bypass_lb_node() calls RMW
the global cpumasks with no lock, corrupting donee/resched decisions.

Move the cpumasks into struct scx_sched, allocate them alongside the timer
in scx_alloc_and_add_sched(), free them in scx_sched_free_rcu_work().

Fixes: 95d1df610c ("sched_ext: Implement load balancer for bypass mode")
Cc: stable@vger.kernel.org # v6.19+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:36 -10:00
Tejun Heo
4155fb489f sched_ext: Pass held rq to SCX_CALL_OP() for core_sched_before
scx_prio_less() runs from core-sched's pick_next_task() path with rq
locked but invokes ops.core_sched_before() with NULL locked_rq, leaving
scx_locked_rq_state NULL. If the BPF callback calls a kfunc that
re-acquires rq based on scx_locked_rq() - e.g. scx_bpf_cpuperf_set(cpu)
- it re-acquires the already-held rq.

Pass task_rq(a).

Fixes: 7b0888b7cc ("sched_ext: Implement core-sched support")
Cc: stable@vger.kernel.org # v6.12+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:36 -10:00
Tejun Heo
207d76a372 sched_ext: Pass held rq to SCX_CALL_OP() for dump_cpu/dump_task
scx_dump_state() walks CPUs with rq_lock_irqsave() held and invokes
ops.dump_cpu / ops.dump_task with NULL locked_rq, leaving
scx_locked_rq_state NULL. If the BPF callback calls a kfunc that
re-acquires rq based on scx_locked_rq() - e.g. scx_bpf_cpuperf_set(cpu)
- it re-acquires the already-held rq.

Pass the held rq to SCX_CALL_OP(). Thread it into scx_dump_task() too.
The pre-loop ops.dump call runs before rq_lock_irqsave() so keeps
rq=NULL.

Fixes: 07814a9439 ("sched_ext: Print debug dump after an error exit")
Cc: stable@vger.kernel.org # v6.12+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:36 -10:00
Tejun Heo
7fb39e4eb4 sched_ext: Save and restore scx_locked_rq across SCX_CALL_OP
SCX_CALL_OP{,_RET}() unconditionally clears scx_locked_rq_state to NULL on
exit. Correct at the top level, but ops can recurse via
scx_bpf_sub_dispatch(): a parent's ops.dispatch calls the helper, which
invokes the child's ops.dispatch under another SCX_CALL_OP. When the inner
call returns, the NULL clobbers the outer's state. The parent's BPF then
calls kfuncs like scx_bpf_cpuperf_set() which read scx_locked_rq()==NULL and
re-acquire the already-held rq.

Snapshot scx_locked_rq_state on entry and restore on exit. Rename the rq
parameter to locked_rq across all SCX_CALL_OP* macros so the snapshot local
can be typed as 'struct rq *' without colliding with the parameter token in
the expansion. SCX_CALL_OP_TASK{,_RET}() and SCX_CALL_OP_2TASKS_RET() funnel
through the two base macros and inherit the fix.

Fixes: 4f8b122848 ("sched_ext: Add basic building blocks for nested sub-scheduler dispatching")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:36 -10:00
Tejun Heo
2f2ea77092 sched_ext: Use dsq->first_task instead of list_empty() in dispatch_enqueue() FIFO-tail
dispatch_enqueue()'s FIFO-tail path used list_empty(&dsq->list) to decide
whether to set dsq->first_task on enqueue. dsq->list can contain parked BPF
iterator cursors (SCX_DSQ_LNODE_ITER_CURSOR), so list_empty() is not a
reliable "no real task" check. If the last real task is unlinked while a
cursor is parked, first_task becomes NULL; the next FIFO-tail enqueue then
sees list_empty() == false and skips the first_task update, leaving
scx_bpf_dsq_peek() returning NULL for a non-empty DSQ.

Test dsq->first_task directly, which already tracks only real tasks and is
maintained under dsq->lock.

Fixes: 44f5c8ec5b ("sched_ext: Add lockless peek operation for DSQs")
Cc: stable@vger.kernel.org # v6.19+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Cc: Ryan Newton <newton@meta.com>
2026-04-24 14:31:35 -10:00
Tejun Heo
cc2a387d33 sched_ext: Resolve caller's scheduler in scx_bpf_destroy_dsq() / scx_bpf_dsq_nr_queued()
scx_bpf_create_dsq() resolves the calling scheduler via scx_prog_sched(aux)
and inserts the new DSQ into that scheduler's dsq_hash. Its inverse
scx_bpf_destroy_dsq() and the query helper scx_bpf_dsq_nr_queued() were
hard-coded to rcu_dereference(scx_root), so a sub-scheduler could only
destroy or query DSQs in the root scheduler's hash - never its own. If the
root had a DSQ with the same id, the sub-sched silently destroyed it and the
root aborted on the next dispatch ("invalid DSQ ID 0x0..").

Take a const struct bpf_prog_aux *aux via KF_IMPLICIT_ARGS and resolve the
scheduler with scx_prog_sched(aux), matching scx_bpf_create_dsq().

Fixes: ebeca1f930 ("sched_ext: Introduce cgroup sub-sched support")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:35 -10:00
Tejun Heo
80afd4c84b sched_ext: Read scx_root under scx_cgroup_ops_rwsem in cgroup setters
scx_group_set_{weight,idle,bandwidth}() cache scx_root before acquiring
scx_cgroup_ops_rwsem, so the pointer can be stale by the time the op runs.
If the loaded scheduler is disabled and freed (via RCU work) and another is
enabled between the naked load and the rwsem acquire, the reader sees
scx_cgroup_enabled=true (the new scheduler's) but dereferences the freed one
- UAF on SCX_HAS_OP(sch, ...) / SCX_CALL_OP(sch, ...).

scx_cgroup_enabled is toggled only under scx_cgroup_ops_rwsem write
(scx_cgroup_{init,exit}), so reading scx_root inside the rwsem read section
correlates @sch with the enabled snapshot.

Fixes: a5bd6ba30b ("sched_ext: Use cgroup_lock/unlock() to synchronize against cgroup operations")
Cc: stable@vger.kernel.org # v6.18+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:35 -10:00
Tejun Heo
21a5a97ba4 sched_ext: Don't disable tasks in scx_sub_enable_workfn() abort path
scx_sub_enable_workfn()'s prep loop calls __scx_init_task(sch, p, false)
without transitioning task state, then sets SCX_TASK_SUB_INIT. If prep fails
partway, the abort path runs __scx_disable_and_exit_task(sch, p) on the
marked tasks. Task state is still the parent's ENABLED, so that dispatches
to the SCX_TASK_ENABLED arm and calls scx_disable_task(sch, p) - i.e.
child->ops.disable() - for tasks on which child->ops.enable() never ran. A
BPF sub-scheduler allocating per-task state in enable/freeing in disable
would operate on uninitialized state.

The dying-task branch in scx_disable_and_exit_task() has the same problem,
and scx_enabling_sub_sched was cleared before the abort cleanup loop - a
task exiting during cleanup tripped the WARN and skipped both ops.exit_task
and the SCX_TASK_SUB_INIT clear, leaking per-task resources and leaving the
task stuck.

Introduce scx_sub_init_cancel_task() that calls ops.exit_task with
cancelled=true - matching what the top-level init path does when init_task
itself returns -errno. Use it in the abort loop and in the dying-task
branch. scx_enabling_sub_sched now stays set until the abort loop finishes
clearing SUB_INIT, so concurrent exits hitting the dying-task branch can
still find @sch. That branch also clears SCX_TASK_SUB_INIT unconditionally
when seen, leaving the task unmarked even if the WARN fires.

Fixes: 337ec00b1d ("sched_ext: Implement cgroup sub-sched enabling and disabling")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:35 -10:00
Tejun Heo
da2d81b411 sched_ext: Skip tasks with stale task_rq in bypass_lb_cpu()
bypass_lb_cpu() transfers tasks between per-CPU bypass DSQs without
migrating them - task_cpu() only updates when the donee later consumes the
task via move_remote_task_to_local_dsq(). If the LB timer fires again before
consumption and the new DSQ becomes a donor, @p is still on the previous CPU
and task_rq(@p) != donor_rq. @p can't be moved without its own rq locked.

Skip such tasks.

Fixes: 95d1df610c ("sched_ext: Implement load balancer for bypass mode")
Cc: stable@vger.kernel.org # v6.19+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:35 -10:00
Tejun Heo
4fda9f0e7c sched_ext: Guard scx_dsq_move() against NULL kit->dsq after failed iter_new
bpf_iter_scx_dsq_new() clears kit->dsq on failure and
bpf_iter_scx_dsq_{next,destroy}() guard against that. scx_dsq_move() doesn't -
it dereferences kit->dsq immediately, so a BPF program that calls
scx_bpf_dsq_move[_vtime]() after a failed iter_new oopses the kernel.

Return false if kit->dsq is NULL.

Fixes: 4c30f5ce4f ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()")
Cc: stable@vger.kernel.org # v6.12+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:35 -10:00
Tejun Heo
411d3ef1a7 sched_ext: Unregister sub_kset on scheduler disable
When ops.sub_attach is set, scx_alloc_and_add_sched() creates sub_kset as a
child of &sch->kobj, which pins the parent with its own reference. The
disable paths never call kset_unregister(), so the final kobject_put() in
bpf_scx_unreg() leaves a stale reference and scx_kobj_release() never runs,
leaking the whole struct scx_sched on every load/unload cycle.

Unregister sub_kset in scx_root_disable() and scx_sub_disable() before
kobject_del(&sch->kobj).

Fixes: ebeca1f930 ("sched_ext: Introduce cgroup sub-sched support")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:35 -10:00
Tejun Heo
bd2d76455b sched_ext: Defer scx_hardlockup() out of NMI
scx_hardlockup() runs from NMI and eventually calls scx_claim_exit(),
which takes scx_sched_lock. scx_sched_lock isn't NMI-safe and grabbing
it from NMI context can lead to deadlocks.

The hardlockup handler is best-effort recovery and the disable path it
triggers runs off of irq_work anyway. Move the handle_lockup() call into
an irq_work so it runs in IRQ context.

Fixes: ebeca1f930 ("sched_ext: Introduce cgroup sub-sched support")
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:13:22 -10:00
Richard Cheng
510a270554 sched_ext: sync disable_irq_work in bpf_scx_unreg()
When unregistered my self-written scx scheduler, the following panic
occurs.

[  229.923133] Kernel text patching generated an invalid instruction at 0xffff80009bc2c1f8!
[  229.923146] Internal error: Oops - BRK: 00000000f2000100 [#1]  SMP
[  230.077871] CPU: 48 UID: 0 PID: 1760 Comm: kworker/u583:7 Not tainted 7.0.0+ #3 PREEMPT(full)
[  230.086677] Hardware name: NVIDIA GB200 NVL/P3809-BMC, BIOS 02.05.12 20251107
[  230.093972] Workqueue: events_unbound bpf_map_free_deferred
[  230.099675] Sched_ext: invariant_0.1.0_aarch64_unknown_linux_gnu_debug (disabling), task: runnable_at=-174ms
[  230.116843] pc : 0xffff80009bc2c1f8
[  230.120406] lr : dequeue_task_scx+0x270/0x2d0
[  230.217749] Call trace:
[  230.228515]  0xffff80009bc2c1f8 (P)
[  230.232077]  dequeue_task+0x84/0x188
[  230.235728]  sched_change_begin+0x1dc/0x250
[  230.240000]  __set_cpus_allowed_ptr_locked+0x17c/0x240
[  230.245250]  __set_cpus_allowed_ptr+0x74/0xf0
[  230.249701]  ___migrate_enable+0x4c/0xa0
[  230.253707]  bpf_map_free_deferred+0x1a4/0x1b0
[  230.258246]  process_one_work+0x184/0x540
[  230.262342]  worker_thread+0x19c/0x348
[  230.266170]  kthread+0x13c/0x150
[  230.269465]  ret_from_fork+0x10/0x20
[  230.281393] Code: d4202000 d4202000 d4202000 d4202000 (d4202000)
[  230.287621] ---[ end trace 0000000000000000 ]---
[  231.160046] Kernel panic - not syncing: Oops - BRK: Fatal exception in interrupt

The root cause is that the JIT page backing ops->quiescent() is freed
before all callers of that function have stopped.

The expected ordering during teardown is:
    bitmap_zero(sch->has_op) + synchronize_rcu()
        -> guarantees no CPU will ever call sch->ops.* again
    -> only THEN free the BPF struct_ops JIT page

bpf_scx_unreg() is supposed to enforce the order, but after
commit f4a6c506d1 ("sched_ext: Always bounce scx_disable() through
irq_work"), disable_work is no longer queued directly, causing
kthread_flush_work() to be a noop. Thus, the caller drops the struct_ops
map too early and poisoned with AARCH64_BREAK_FAULT before
disable_workfn ever execute.

So the subsequent dequeue_task() still sees SCX_HAS_OP(sch, quiescent)
as true and calls ops.quiescent, which hit on the poisoned page and BRK
panic.

Add a helper scx_flush_disable_work() so the future use cases that want
to flush disable_work can use it.
Also amend the call for scx_root_enable_workfn() and
scx_sub_enable_workfn() which have similar pattern in the error path.

Fixes: f4a6c506d1 ("sched_ext: Always bounce scx_disable() through irq_work")
Signed-off-by: Richard Cheng <icheng@nvidia.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-24 07:26:48 -10:00
zhidao su
4e3d7c89e1 sched_ext: Fix local_dsq_post_enq() to use task's scheduler in sub-sched
local_dsq_post_enq() calls call_task_dequeue() with scx_root instead of
the scheduler instance actually managing the task. When
CONFIG_EXT_SUB_SCHED is enabled, tasks may be managed by a sub-scheduler
whose ops.dequeue() callback differs from root's. Using scx_root causes
the wrong scheduler's ops.dequeue() to be consulted: sub-sched tasks
dispatched to a local DSQ via scx_bpf_dsq_move_to_local() will have
SCX_TASK_IN_CUSTODY cleared but the sub-scheduler's ops.dequeue() is
never invoked, violating the custody exit semantics.

Fix by adding a 'struct scx_sched *sch' parameter to local_dsq_post_enq()
and move_local_task_to_local_dsq(), and propagating the correct scheduler
from their callers dispatch_enqueue(), move_task_between_dsqs(), and
consume_dispatch_q().

This is consistent with dispatch_enqueue()'s non-local path which already
passes 'sch' directly to call_task_dequeue() for global/bypass DSQs.

Fixes: ebf1ccff79 ("sched_ext: Fix ops.dequeue() semantics")
Signed-off-by: zhidao su <suzhidao@xiaomi.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-23 06:36:56 -10:00
Tejun Heo
05909810a9 tools/sched_ext: scx_qmap: Silence task_ctx lookup miss
scx_fork() dispatches ops.init_task to exactly one scheduler - the one
owning the forking task's cgroup. A task forked inside a sub-scheduler's
cgroup is init'd into the sub only; the root scheduler has no task_ctx
entry for it. When that task later appears as @prev in the root's
qmap_dispatch() (or flows through core-sched comparison via task_qdist),
the bpf_task_storage_get() legitimately misses.

qmap treated those misses as fatal via scx_bpf_error("task_ctx lookup
failed") and aborted the scheduler as soon as the first cross-sched
task hit the root. Drop the error in the sites where the miss is
legitimate: lookup_task_ctx() (helper; callers already check for NULL),
qmap_dispatch()'s @prev branch (bookkeeping-only), task_qdist()
(returns 0 which makes the comparison a no-op), and qmap_select_cpu()
(returns prev_cpu as a no-op fallback instead of -ESRCH). The existing
scx_error was a paranoid guard from the pre-sub-sched world where every
task was owned by the one and only scheduler.

v2: qmap_select_cpu() returns prev_cpu on NULL instead of -ESRCH, so
    the root scheduler doesn't error on cross-sched tasks that pass
    through it (Andrea Righi).

Fixes: 4f8b122848 ("sched_ext: Add basic building blocks for nested sub-scheduler dispatching")
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
2026-04-21 06:18:58 -10:00
Tejun Heo
4fe9852927 rhashtable: Bounce deferred worker kick through irq_work
Inserts past 75% load call schedule_work(&ht->run_work) to kick an
async resize. If a caller holds a raw spinlock (e.g. an
insecure_elasticity user), schedule_work() under that lock records

  caller_lock -> pool->lock -> pi_lock -> rq->__lock

A cycle forms if any of these locks is acquired in the reverse
direction elsewhere. sched_ext, the only current insecure_elasticity
user, hits this: it holds scx_sched_lock across rhashtable inserts of
sub-schedulers, while scx_bypass() takes rq->__lock -> scx_sched_lock.
Exercising the resize path produces:

  Chain exists of:
    &pool->lock --> &rq->__lock --> scx_sched_lock

Bounce the kick from the insert paths through irq_work so
schedule_work() runs from hard IRQ context with the caller's lock no
longer held. rht_deferred_worker()'s self-rearm on error stays on
schedule_work(&ht->run_work) - the worker runs in process context with
no caller lock held, and keeping the self-requeue on @run_work lets
cancel_work_sync() in rhashtable_free_and_destroy() drain it.

v3: Keep rht_deferred_worker()'s self-rearm on schedule_work(&run_work).
    Routing it through irq_work in v2 broke cancel_work_sync()'s
    self-requeue handling - an irq_work queued after irq_work_sync()
    returned but while cancel_work_sync() was still waiting could fire
    post-teardown.

v2: Bounce unconditionally instead of gating on insecure_elasticity,
    as suggested by Herbert.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
2026-04-20 20:10:50 -10:00
Cheng-Yang Chou
5897ca15d2 selftests/sched_ext: Add non_scx_kfunc_deny test
Verify that the BPF verifier rejects a non-SCX struct_ops program
(tcp_congestion_ops) that attempts to call an SCX kfunc (scx_bpf_kick_cpu).
The test expects the load to fail with -EACCES from scx_kfunc_context_filter.

Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-20 07:57:29 -10:00
Cheng-Yang Chou
2d2b026c3e sched_ext: Deny SCX kfuncs to non-SCX struct_ops programs
scx_kfunc_context_filter() currently allows non-SCX struct_ops programs
(e.g. tcp_congestion_ops) to call SCX unlocked kfuncs. This is wrong
for two reasons:

- It is semantically incorrect: a TCP congestion control program has no
  business calling SCX kfuncs such as scx_bpf_kick_cpu().

- With CONFIG_EXT_SUB_SCHED=y, kfuncs like scx_bpf_kick_cpu() call
  scx_prog_sched(aux), which invokes bpf_prog_get_assoc_struct_ops(aux)
  and casts the result to struct sched_ext_ops * before reading ops->priv.
  For a non-SCX struct_ops program the returned pointer is the kdata of
  that struct_ops type, which is far smaller than sched_ext_ops, making
  the read an out-of-bounds access (confirmed with KASAN).

Extend the filter to cover scx_kfunc_set_any and scx_kfunc_set_idle as
well, and deny all SCX kfuncs for any struct_ops program that is not the
SCX struct_ops. This addresses both issues: the semantic contract is
enforced at the verifier level, and the runtime out-of-bounds access
becomes unreachable.

Fixes: d1d3c1c6ae ("sched_ext: Add verifier-time kfunc context filter")
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-20 07:57:29 -10:00
Tejun Heo
87019cb6c2 sched_ext: Mark scx_sched_hash insecure_elasticity
scx_sched_hash is inserted into under scx_sched_lock (raw_spinlock_irq)
in scx_link_sched(). rhashtable's sync grow path calls get_random_u32()
and does a GFP_ATOMIC allocation; both acquire regular spinlocks, which
is unsafe under raw_spinlock_t. Set insecure_elasticity to skip the
sync grow.

v2:
- Dropped dsq_hash changes. Insertion is not under raw_spin_lock.

- Switched from no_sync_grow flag to insecure_elasticity.

Fixes: 25037af712 ("sched_ext: Add rhashtable lookup for sub-schedulers")
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-19 05:47:28 -10:00
Herbert Xu
73bd122778 rhashtable: Restore insecure_elasticity toggle
Some users of rhashtable cannot handle insertion failures, and
are happy to accept the consequences of a hash table that having
very long chains.

Restore the insecure_elasticity toggle for these users.  In
addition to disabling the chain length checks, this also removes
the emergency resize that would otherwise occur when the hash
table occupancy hits 100% (an async resize is still scheduled
at 75%).

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-19 05:47:21 -10:00
Linus Torvalds
3cd8b194bf Merge tag 'v7.1-rc-part1-smbdirect-fixes' of git://git.samba.org/ksmbd
Pull smbdirect updates from Steve French:
 "Move smbdirect server and client code to common directory:

   - temporary use of smbdirect_all_c_files.c to allow micro steps

   - factor out common functions into a smbdirect.ko.

   - convert cifs.ko to use smbdirect.ko

   - convert ksmbd.ko to use smbdirect.ko

   - let smbdirect.ko use global workqueues

   - move ib_client logic from ksmbd.ko into smbdirect.ko

   - remove smbdirect_all_c_files.c hack again

   - some locking and teardown related fixes on top"

* tag 'v7.1-rc-part1-smbdirect-fixes' of git://git.samba.org/ksmbd: (145 commits)
  smb: smbdirect: let smbdirect_connection_deregister_mr_io unlock while waiting
  smb: smbdirect: fix the logic in smbdirect_socket_destroy_sync() without an error
  smb: smbdirect: fix copyright header of smbdirect.h
  smb: smbdirect: change smbdirect_socket_parameters.{initiator_depth,responder_resources} to __u16
  smb: smbdirect: remove unused SMBDIRECT_USE_INLINE_C_FILES logic
  smb: server: no longer use smbdirect_socket_set_custom_workqueue()
  smb: client: no longer use smbdirect_socket_set_custom_workqueue()
  smb: smbdirect: introduce global workqueues
  smb: smbdirect: prepare use of dedicated workqueues for different steps
  smb: smbdirect: remove unused smbdirect_connection_mr_io_recovery_work()
  smb: smbdirect: wrap rdma_disconnect() in rdma_[un]lock_handler()
  smb: server: make use of smbdirect_netdev_rdma_capable_mode_type()
  smb: smbdirect: introduce smbdirect_netdev_rdma_capable_mode_type()
  smb: server: make use of smbdirect.ko
  smb: server: remove unused ksmbd_transport_ops.prepare()
  smb: server: make use of smbdirect_socket_{listen,accept}()
  smb: server: only use public smbdirect functions
  smb: server: make use of smbdirect_socket_create_accepting()/smbdirect_socket_release()
  smb: server: make use of smbdirect_{socket_init_accepting,connection_wait_for_connected}()
  smb: server: make use of smbdirect_connection_send_iter() and related functions
  ...
2026-04-16 08:25:04 -07:00
Linus Torvalds
d3d9443f8b Merge tag 'livepatching-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching
Pull livepatching updates from Petr Mladek:

 - Add two new selftests

* tag 'livepatching-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching:
  selftests/livepatch: add test for module function patching
  selftests: livepatch: test-ftrace: livepatch a traced function
2026-04-16 08:13:27 -07:00
Linus Torvalds
090748e62f Merge tag 'm68k-for-v7.1-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k
Pull m68k updates from Geert Uytterhoeven:

 - Add support for QEMU virt-ctrl, and use it for system reset
   and power off on the virt platform

 - defconfig updates

 - Miscellaneous fixes and improvements

* tag 'm68k-for-v7.1-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
  m68k: virt: Switch to qemu-virt-ctrl driver
  power: reset: Add QEMU virt-ctrl driver
  m68k: defconfig: Update defconfigs for v7.0-rc1
  m68k: emu: Replace unbounded sprintf() in nfhd_init_one()
  m68k: uapi: Add ucontext.h
  m68k: defconfig: hp300: Enable monochrome and 16-color linux logos
  m68k: q40: Remove commented out code
2026-04-16 08:11:01 -07:00
Linus Torvalds
948ef73f7e Merge tag 'efi-next-for-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI updates from Ard Biesheuvel:
 "Again not a busy cycle for EFI, just some minor tweaks and bug fixes:

   - Enable boot graphics resource table (BGRT) on Xen/x86

   - Correct a misguided assumption in the memory attributes table
     sanity check

   - Start tagging efi_mem_reserve()'d regions as MEMBLOCK_RSRV_KERN

   - Some other minor fixes and cleanups"

* tag 'efi-next-for-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
  efi/capsule-loader: fix incorrect sizeof in phys array reallocation
  efi: Tag memblock reservations of boot services regions as RSRV_KERN
  memblock: Permit existing reserved regions to be marked RSRV_KERN
  efi/memattr: Fix thinko in table size sanity check
  efi: libstub: fix type of fdt 32 and 64bit variables
  efi: Drop unused efi_range_is_wc() function
  efi: Enable BGRT loading under Xen
  efi: make efi_mem_type() and efi_mem_attributes() work on Xen PV
2026-04-16 08:06:25 -07:00
Linus Torvalds
f0bf3eac92 Merge tag 'vfio-v7.1-rc1' of https://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:

 - Update QAT vfio-pci variant driver for Gen 5, 420xx devices (Vijay
   Sundar Selvamani, Suman Kumar Chakraborty, Giovanni Cabiddu)

 - Fix vfio selftest MMIO DMA mapping selftest (Alex Mastro)

 - Conversions to const struct class in support of class_create()
   deprecation (Jori Koolstra)

 - Improve selftest compiler compatibility by avoiding initializer on
   variable-length array (Manish Honap)

 - Define new uAPI for drivers supporting migration to advise user-
   space of new initial data for reducing target startup latency.
   Implemented for mlx5 vfio-pci variant driver (Yishai Hadas)

 - Enable vfio selftests on aarch64, not just cross-compiles reporting
   arm64 (Ted Logan)

 - Update vfio selftest driver support to include additional DSA devices
   (Yi Lai)

 - Unconditionally include debugfs root pointer in vfio device struct,
   avoiding a build failure seen in hisi_acc variant driver without
   debugfs otherwise (Arnd Bergmann)

 - Add support for the s390 ISM (Internal Shared Memory) device via a
   new variant driver. The device is unique in the size of its BAR space
   (256TiB) and lack of mmap support (Julian Ruess)

 - Enforce that vfio-pci drivers implement a name in their ops structure
   for use in sequestering SR-IOV VFs (Alex Williamson)

 - Prune leftover group notifier code (Paolo Bonzini)

 - Fix Xe vfio-pci variant driver to avoid migration support as a
   dependency in the reset path and missing release call (Michał
   Winiarski)

* tag 'vfio-v7.1-rc1' of https://github.com/awilliam/linux-vfio: (23 commits)
  vfio/xe: Add a missing vfio_pci_core_release_dev()
  vfio/xe: Reorganize the init to decouple migration from reset
  vfio: remove dead notifier code
  vfio/pci: Require vfio_device_ops.name
  MAINTAINERS: add VFIO ISM PCI DRIVER section
  vfio/ism: Implement vfio_pci driver for ISM devices
  vfio/pci: Rename vfio_config_do_rw() to vfio_pci_config_rw_single() and export it
  vfio: unhide vdev->debug_root
  vfio/qat: add support for Intel QAT 420xx VFs
  vfio: selftests: Support DMR and GNR-D DSA devices
  vfio: selftests: Build tests on aarch64
  vfio/mlx5: Add REINIT support to VFIO_MIG_GET_PRECOPY_INFO
  vfio/mlx5: consider inflight SAVE during PRE_COPY
  net/mlx5: Add IFC bits for migration state
  vfio: Adapt drivers to use the core helper vfio_check_precopy_ioctl
  vfio: Add support for VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2
  vfio: Define uAPI for re-init initial bytes during the PRE_COPY phase
  vfio: selftests: Fix VLA initialisation in vfio_pci_irq_set()
  vfio: uapi: fix comment typo
  vfio: mdev: replace mtty_dev->vd_class with a const struct class
  ...
2026-04-16 08:01:16 -07:00
Petr Mladek
448c0f8cb7 Merge branch 'for-7.1/module-function-test' into for-linus 2026-04-16 10:33:43 +02:00
Stefan Metzmacher
d09a040c18 smb: smbdirect: let smbdirect_connection_deregister_mr_io unlock while waiting
We should not hold a mutex locked during wait_for_completion()
holding a reference is enough.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Henrique Carvalho <henrique.carvalho@suse.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2026-04-15 21:58:24 -05:00
Stefan Metzmacher
25c2e34931 smb: smbdirect: fix the logic in smbdirect_socket_destroy_sync() without an error
If smbdirect_socket_destroy_sync() and sc->first_error was not set
we should set -ESHUTDOWN, that's a better condition
doing it only implicitly with the
sc->status < SMBDIRECT_SOCKET_DISCONNECTING check.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Henrique Carvalho <henrique.carvalho@suse.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2026-04-15 21:58:24 -05:00
Stefan Metzmacher
3892007f2b smb: smbdirect: fix copyright header of smbdirect.h
Everything in smbdirect.h was taken from my out of
tree prototype.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Henrique Carvalho <henrique.carvalho@suse.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2026-04-15 21:58:24 -05:00
Stefan Metzmacher
735610d0ce smb: smbdirect: change smbdirect_socket_parameters.{initiator_depth,responder_resources} to __u16
We still limit this to U8_MAX as the rdma api only uses __u8
and that's also the limit for Infiniband and RoCE*,
while iWarp would be able to support larger values at
the protocol level.

As struct smbdirect_socket_parameters will be part
of the uapi for IPPROTO_SMBDIRECT in future, change it
now even if userspace sockets won't be supported yet.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Acked-by: Henrique Carvalho <henrique.carvalho@suse.com>
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2026-04-15 21:58:24 -05:00
Stefan Metzmacher
aa43bb2c0f smb: smbdirect: remove unused SMBDIRECT_USE_INLINE_C_FILES logic
We always build as standalone module (or as part of the core kernel).

This also removes unused elements from struct smbdirect_socket
and unused exports.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2026-04-15 21:58:24 -05:00
Stefan Metzmacher
649c47559a smb: server: no longer use smbdirect_socket_set_custom_workqueue()
smbdirect.ko has global workqueues now, so we should use these
default once.

Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2026-04-15 21:58:24 -05:00