Commit Graph

49584 Commits

Author SHA1 Message Date
Aleksa Sarai
7df8782012 pidns: move is-ancestor logic to helper
This check will be needed in later patches, and there's no point
open-coding it each time.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Link: https://lore.kernel.org/20250805-procfs-pidns-api-v4-1-705f984940e7@cyphar.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-02 11:37:24 +02:00
Baochen Qiang
7e2368a217 dma-debug: don't enforce dma mapping check on noncoherent allocations
As discussed in [1], there is no need to enforce dma mapping check on
noncoherent allocations, a simple test on the returned CPU address is
good enough.

Add a new pair of debug helpers and use them for noncoherent alloc/free
to fix this issue.

Fixes: efa70f2fdc ("dma-mapping: add a new dma_alloc_pages API")
Link: https://lore.kernel.org/all/ff6c1fe6-820f-4e58-8395-df06aa91706c@oss.qualcomm.com # 1
Signed-off-by: Baochen Qiang <baochen.qiang@oss.qualcomm.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20250828-dma-debug-fix-noncoherent-dma-check-v1-1-76e9be0dd7fc@oss.qualcomm.com
2025-09-02 10:18:16 +02:00
Simon Schuster
edd3cb05c0 copy_process: pass clone_flags as u64 across calltree
With the introduction of clone3 in commit 7f192e3cd3 ("fork: add
clone3") the effective bit width of clone_flags on all architectures was
increased from 32-bit to 64-bit, with a new type of u64 for the flags.
However, for most consumers of clone_flags the interface was not
changed from the previous type of unsigned long.

While this works fine as long as none of the new 64-bit flag bits
(CLONE_CLEAR_SIGHAND and CLONE_INTO_CGROUP) are evaluated, this is still
undesirable in terms of the principle of least surprise.

Thus, this commit fixes all relevant interfaces of callees to
sys_clone3/copy_process (excluding the architecture-specific
copy_thread) to consistently pass clone_flags as u64, so that
no truncation to 32-bit integers occurs on 32-bit architectures.

Signed-off-by: Simon Schuster <schuster.simon@siemens-energy.com>
Link: https://lore.kernel.org/20250901-nios2-implement-clone3-v2-2-53fcf5577d57@siemens-energy.com
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-01 15:31:34 +02:00
Simon Schuster
04ff48239f copy_sighand: Handle architectures where sizeof(unsigned long) < sizeof(u64)
With the introduction of clone3 in commit 7f192e3cd3 ("fork: add
clone3") the effective bit width of clone_flags on all architectures was
increased from 32-bit to 64-bit. However, the signature of the copy_*
helper functions (e.g., copy_sighand) used by copy_process was not
adapted.

As such, they truncate the flags on any 32-bit architectures that
supports clone3 (arc, arm, csky, m68k, microblaze, mips32, openrisc,
parisc32, powerpc32, riscv32, x86-32 and xtensa).

For copy_sighand with CLONE_CLEAR_SIGHAND being an actual u64
constant, this triggers an observable bug in kernel selftest
clone3_clear_sighand:

        if (clone_flags & CLONE_CLEAR_SIGHAND)

in function copy_sighand within fork.c will always fail given:

        unsigned long /* == uint32_t */ clone_flags
        #define CLONE_CLEAR_SIGHAND 0x100000000ULL

This commit fixes the bug by always passing clone_flags to copy_sighand
via their declared u64 type, invariant of architecture-dependent integer
sizes.

Fixes: b612e5df45 ("clone3: add CLONE_CLEAR_SIGHAND")
Cc: stable@vger.kernel.org # linux-5.5+
Signed-off-by: Simon Schuster <schuster.simon@siemens-energy.com>
Link: https://lore.kernel.org/20250901-nios2-implement-clone3-v2-1-53fcf5577d57@siemens-energy.com
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-01 15:31:33 +02:00
Li Jun
98da8a4aec PM: hibernate: Fix typo in memory bitmaps description comment
Correct 'leave' to 'leaf' in memory bitmaps description comment.

Signed-off-by: Li Jun <lijun01@kylinos.cn>
Link: https://patch.msgid.link/20250819104038.1596952-1-lijun01@kylinos.cn
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-09-01 11:55:53 +02:00
Qianfeng Rong
5545d56fd1 PM: hibernate: Use vmalloc_array() and vcalloc() to improve code
Remove array_size() calls and replace vmalloc() and vzalloc() with
vmalloc_array() and vcalloc() respectively to simplify the code in
save_compressed_image() and load_compressed_image().

vmalloc_array() is also optimized better, resulting in less
instructions being used, and vmalloc_array() handling overhead is
lower [1].

Link: https://lore.kernel.org/lkml/abc66ec5-85a4-47e1-9759-2f60ab111971@vivo.com/ [1]
Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Link: https://patch.msgid.link/20250817083636.53872-1-rongqianfeng@vivo.com
[ rjw: Changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-09-01 11:55:53 +02:00
Linus Torvalds
fe3ad7a58b Merge tag 'sched_urgent_for_v6.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:

 - Fix a stall on the CPU offline path due to mis-counting a deadline
   server task twice as part of the runqueue's running tasks count

 - Fix a realtime tasks starvation case where failure to enqueue a timer
   whose expiration time is already in the past would cause repeated
   attempts to re-enqueue a deadline server task which leads to starving
   the former, realtime one

 - Prevent a delayed deadline server task stop from breaking the
   per-runqueue bandwidth tracking

 - Have a function checking whether the deadline server task has
   stopped, return the correct value

* tag 'sched_urgent_for_v6.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/deadline: Don't count nr_running for dl_server proxy tasks
  sched/deadline: Fix RT task potential starvation when expiry time passed
  sched/deadline: Always stop dl-server before changing parameters
  sched/deadline: Fix dl_server_stopped()
2025-08-31 09:13:00 -07:00
Sebastian Andrzej Siewior
d9b05321e2 futex: Move futex_hash_free() back to __mmput()
To avoid a memory leak via mm_alloc() + mmdrop() the futex cleanup code
has been moved to __mmdrop(). This resulted in a warnings if the futex
hash table has been allocated via vmalloc() the mmdrop() was invoked
from atomic context.
The free path must stay in __mmput() to ensure it is invoked from
preemptible context.

In order to avoid the memory leak, delay the allocation of
mm_struct::mm->futex_ref to futex_hash_allocate(). This works because
neither the per-CPU counter nor the private hash has been allocated and
therefore
- futex_private_hash() callers (such as exit_pi_state_list()) don't
  acquire reference if there is no private hash yet. There is also no
  reference put.

- Regular callers (futex_hash()) fallback to global hash. No reference
  counting here.

The futex_ref member can be allocated in futex_hash_allocate() before
the private hash itself is allocated. This happens either while the
first thread is created or on request. In both cases the process has
just a single thread so there can be either futex operation in progress
or the request to create a private hash.

Move futex_hash_free() back to __mmput();
Move the allocation of mm_struct::futex_ref to futex_hash_allocate().

  [ bp: Fold a follow-up fix to prevent a use-after-free:
    https://lore.kernel.org/r/20250830213806.sEKuuGSm@linutronix.de ]

Fixes:  e703b7e247 ("futex: Move futex cleanup to __mmdrop()")
Closes: https://lore.kernel.org/all/20250821102721.6deae493@kernel.org/
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lkml.kernel.org/r/20250822141238.PfnkTjFb@linutronix.de
2025-08-31 11:48:19 +02:00
Casey Schaufler
0ffbc876d0 audit: add record for multiple object contexts
Create a new audit record AUDIT_MAC_OBJ_CONTEXTS.
An example of the MAC_OBJ_CONTEXTS record is:

    type=MAC_OBJ_CONTEXTS
      msg=audit(1601152467.009:1050):
      obj_selinux=unconfined_u:object_r:user_home_t:s0

When an audit event includes a AUDIT_MAC_OBJ_CONTEXTS record
the "obj=" field in other records in the event will be "obj=?".
An AUDIT_MAC_OBJ_CONTEXTS record is supplied when the system has
multiple security modules that may make access decisions based
on an object security context.

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: subj tweak, audit example readability indents]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-08-30 10:15:30 -04:00
Casey Schaufler
eb59d494ee audit: add record for multiple task security contexts
Replace the single skb pointer in an audit_buffer with a list of
skb pointers. Add the audit_stamp information to the audit_buffer as
there's no guarantee that there will be an audit_context containing
the stamp associated with the event. At audit_log_end() time create
auxiliary records as have been added to the list. Functions are
created to manage the skb list in the audit_buffer.

Create a new audit record AUDIT_MAC_TASK_CONTEXTS.
An example of the MAC_TASK_CONTEXTS record is:

    type=MAC_TASK_CONTEXTS
      msg=audit(1600880931.832:113)
      subj_apparmor=unconfined
      subj_smack=_

When an audit event includes a AUDIT_MAC_TASK_CONTEXTS record the
"subj=" field in other records in the event will be "subj=?".
An AUDIT_MAC_TASK_CONTEXTS record is supplied when the system has
multiple security modules that may make access decisions based on a
subject security context.

Refactor audit_log_task_context(), creating a new audit_log_subj_ctx().
This is used in netlabel auditing to provide multiple subject security
contexts as necessary.

Suggested-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: subj tweak, audit example readability indents]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-08-30 10:15:30 -04:00
Casey Schaufler
a59076f266 lsm: security_lsmblob_to_secctx module selection
Add a parameter lsmid to security_lsmblob_to_secctx() to identify which
of the security modules that may be active should provide the security
context. If the value of lsmid is LSM_ID_UNDEF the first LSM providing
a hook is used. security_secid_to_secctx() is unchanged, and will
always report the first LSM providing a hook.

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: subj tweak]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-08-30 10:15:29 -04:00
Casey Schaufler
0a561e3904 audit: create audit_stamp structure
Replace the timestamp and serial number pair used in audit records
with a structure containing the two elements.

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: subj tweak]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-08-30 10:15:28 -04:00
Jakub Kicinski
d23ad54de7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.17-rc4).

No conflicts.

Adjacent changes:

drivers/net/ethernet/intel/idpf/idpf_txrx.c
  02614eee26 ("idpf: do not linearize big TSO packets")
  6c4e684802 ("idpf: remove obsolete stashing code")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-29 11:48:01 -07:00
Linus Torvalds
4d28e28098 Merge tag 'dma-mapping-6.17-2025-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fixes from Marek Szyprowski:

 - another small fix for arm64 systems with memory encryption (Shanker
   Donthineni)

 - fix for arm32 systems with non-standard CMA configuration (Oreoluwa
   Babatunde)

* tag 'dma-mapping-6.17-2025-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
  dma/pool: Ensure DMA_DIRECT_REMAP allocations are decrypted
  of: reserved_mem: Restructure call site for dma_contiguous_early_fixup()
2025-08-28 16:04:14 -07:00
Nandakumar Edamana
1df7dad4d5 bpf: Improve the general precision of tnum_mul
Drop the value-mask decomposition technique and adopt straightforward
long-multiplication with a twist: when LSB(a) is uncertain, find the
two partial products (for LSB(a) = known 0 and LSB(a) = known 1) and
take a union.

Experiment shows that applying this technique in long multiplication
improves the precision in a significant number of cases (at the cost
of losing precision in a relatively lower number of cases).

Signed-off-by: Nandakumar Edamana <nandakumar@nandakumar.co.in>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Reviewed-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20250826034524.2159515-1-nandakumar@nandakumar.co.in
2025-08-27 15:00:26 -07:00
Marie Zhussupova
b9a214b5f6 kunit: Pass parameterized test context to generate_params()
To enable more complex parameterized testing scenarios, the
generate_params() function needs additional context beyond just
the previously generated parameter. This patch modifies the
generate_params() function signature to include an extra
`struct kunit *test` argument, giving test users access to the
parameterized test context when generating parameters.

The `struct kunit *test` argument was added as the first parameter
to the function signature as it aligns with the convention of other
KUnit functions that accept `struct kunit *test` first. This also
mirrors the "this" or "self" reference found in object-oriented
programming languages.

This patch also modifies xe_pci_live_device_gen_param() in xe_pci.c
and nthreads_gen_params() in kcsan_test.c to reflect this signature
change.

Link: https://lore.kernel.org/r/20250826091341.1427123-4-davidgow@google.com
Reviewed-by: David Gow <davidgow@google.com>
Reviewed-by: Rae Moar <rmoar@google.com>
Acked-by: Marco Elver <elver@google.com>
Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Marie Zhussupova <marievic@google.com>
[Catch some additional gen_params signatures in drm/xe/tests --David]
Signed-off-by: David Gow <davidgow@google.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2025-08-26 23:36:03 -06:00
Yicong Yang
52d15521eb sched/deadline: Don't count nr_running for dl_server proxy tasks
On CPU offline the kernel stalled with below call trace:

  INFO: task kworker/0:1:11 blocked for more than 120 seconds.

cpuhp hold the cpu hotplug lock endless and stalled vmstat_shepherd.
This is because we count nr_running twice on cpuhp enqueuing and failed
the wait condition of cpuhp:

  enqueue_task_fair() // pick cpuhp from idle, rq->nr_running = 0
    dl_server_start()
      [...]
      add_nr_running() // rq->nr_running = 1
    add_nr_running() // rq->nr_running = 2
  [switch to cpuhp, waiting on balance_hotplug_wait()]
  rcuwait_wait_event(rq->nr_running == 1 && ...) // failed, rq->nr_running=2
    schedule() // wait again

It doesn't make sense to count the dl_server towards runnable tasks,
since it runs other tasks.

Fixes: 63ba8422f8 ("sched/deadline: Introduce deadline servers")
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250627035420.37712-1-yangyicong@huawei.com
2025-08-26 10:46:01 +02:00
kuyo chang
421fc59cf5 sched/deadline: Fix RT task potential starvation when expiry time passed
[Symptom]
The fair server mechanism, which is intended to prevent fair starvation
when higher-priority tasks monopolize the CPU.
Specifically, RT tasks on the runqueue may not be scheduled as expected.

[Analysis]
The log "sched: DL replenish lagged too much" triggered.

By memory dump of dl_server:
    curr = 0xFFFFFF80D6A0AC00 (
      dl_server = 0xFFFFFF83CD5B1470(
        dl_runtime = 0x02FAF080,
        dl_deadline = 0x3B9ACA00,
        dl_period = 0x3B9ACA00,
        dl_bw = 0xCCCC,
        dl_density = 0xCCCC,
        runtime = 0x02FAF080,
        deadline = 0x0000082031EB0E80,
        flags = 0x0,
        dl_throttled = 0x0,
        dl_yielded = 0x0,
        dl_non_contending = 0x0,
        dl_overrun = 0x0,
        dl_server = 0x1,
        dl_server_active = 0x1,
        dl_defer = 0x1,
        dl_defer_armed = 0x0,
        dl_defer_running = 0x1,
        dl_timer = (
          node = (
            expires = 0x000008199756E700),
          _softexpires = 0x000008199756E700,
          function = 0xFFFFFFDB9AF44D30 = dl_task_timer,
          base = 0xFFFFFF83CD5A12C0,
          state = 0x0,
          is_rel = 0x0,
          is_soft = 0x0,
    clock_update_flags = 0x4,
    clock = 0x000008204A496900,

 - The timer expiration time (rq->curr->dl_server->dl_timer->expires)
   is already in the past, indicating the timer has expired.
 - The timer state (rq->curr->dl_server->dl_timer->state) is 0.

[Suspected Root Cause]
The relevant code flow in the throttle path of
update_curr_dl_se() as follows:

  dequeue_dl_entity(dl_se, 0);                // the DL entity is dequeued

  if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(dl_se))) {
      if (dl_server(dl_se))                   // timer registration fails
          enqueue_dl_entity(dl_se, ENQUEUE_REPLENISH);//enqueue immediately
      ...
  }

The failure of `start_dl_timer` is caused by attempting to register a
timer with an expiration time that is already in the past. When this
situation persists, the code repeatedly re-enqueues the DL entity
without properly replenishing or restarting the timer, resulting in RT
task may not be scheduled as expected.

[Proposed Solution]:
Instead of immediately re-enqueuing the DL entity on timer registration
failure, this change ensures the DL entity is properly replenished and
the timer is restarted, preventing RT potential starvation.

Fixes: 63ba8422f8 ("sched/deadline: Introduce deadline servers")
Signed-off-by: kuyo chang <kuyo.chang@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Closes: https://lore.kernel.org/CAMuHMdXn4z1pioTtBGMfQM0jsLviqS2jwysaWXpoLxWYoGa82w@mail.gmail.com
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Jiri Slaby <jirislaby@kernel.org>
Tested-by: Diederik de Haas <didi.debian@cknow.org>
Link: https://lkml.kernel.org/r/20250615131129.954975-1-kuyo.chang@mediatek.com
2025-08-26 10:46:01 +02:00
Juri Lelli
bb4700adc3 sched/deadline: Always stop dl-server before changing parameters
Commit cccb45d7c4 ("sched/deadline: Less agressive dl_server
handling") reduced dl-server overhead by delaying disabling servers only
after there are no fair task around for a whole period, which means that
deadline entities are not dequeued right away on a server stop event.
However, the delay opens up a window in which a request for changing
server parameters can break per-runqueue running_bw tracking, as
reported by Yuri.

Close the problematic window by unconditionally calling dl_server_stop()
before applying the new parameters (ensuring deadline entities go
through an actual dequeue).

Fixes: cccb45d7c4 ("sched/deadline: Less agressive dl_server handling")
Reported-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20250721-upstream-fix-dlserver-lessaggressive-b4-v1-1-4ebc10c87e40@redhat.com
2025-08-26 10:46:00 +02:00
Huacai Chen
4717432dfd sched/deadline: Fix dl_server_stopped()
Commit cccb45d7c4 ("sched/deadline: Less agressive dl_server handling")
introduces dl_server_stopped(). But it is obvious that dl_server_stopped()
should return true if dl_se->dl_server_active is 0.

Fixes: cccb45d7c4 ("sched/deadline: Less agressive dl_server handling")
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250809130419.1980742-1-chenhuacai@loongson.cn
2025-08-26 10:46:00 +02:00
Josh Poimboeuf
16ed389227 perf: Skip user unwind if the task is a kernel thread
If the task is not a user thread, there's no user stack to unwind.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250820180428.930791978@kernel.org
2025-08-26 09:51:13 +02:00
Josh Poimboeuf
d77e3319e3 perf: Simplify get_perf_callchain() user logic
Simplify the get_perf_callchain() user logic a bit.  task_pt_regs()
should never be NULL.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20250820180428.760066227@kernel.org
2025-08-26 09:51:13 +02:00
Steven Rostedt
90942f9fac perf: Use current->flags & PF_KTHREAD|PF_USER_WORKER instead of current->mm == NULL
To determine if a task is a kernel thread or not, it is more reliable to
use (current->flags & (PF_KTHREAD|PF_USER_WORKERi)) than to rely on
current->mm being NULL.  That is because some kernel tasks (io_uring
helpers) may have a mm field.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250820180428.592367294@kernel.org
2025-08-26 09:51:13 +02:00
Josh Poimboeuf
153f9e74de perf: Have get_perf_callchain() return NULL if crosstask and user are set
get_perf_callchain() doesn't support cross-task unwinding for user space
stacks, have it return NULL if both the crosstask and user arguments are
set.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250820180428.426423415@kernel.org
2025-08-26 09:51:12 +02:00
Josh Poimboeuf
e649bcda25 perf: Remove get_perf_callchain() init_nr argument
The 'init_nr' argument has double duty: it's used to initialize both the
number of contexts and the number of stack entries.  That's confusing
and the callers always pass zero anyway.  Hard code the zero.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Namhyung Kim <Namhyung@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20250820180428.259565081@kernel.org
2025-08-26 09:51:12 +02:00
Menglong Dong
8e4f0b1ebc bpf: use rcu_read_lock_dont_migrate() for trampoline.c
Use rcu_read_lock_dont_migrate() and rcu_read_unlock_migrate() in
trampoline.c to obtain better performance when PREEMPT_RCU is not enabled.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20250821090609.42508-8-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-25 18:52:16 -07:00
Menglong Dong
427a36bb55 bpf: use rcu_read_lock_dont_migrate() for bpf_prog_run_array_cg()
Use rcu_read_lock_dont_migrate() and rcu_read_unlock_migrate() in
bpf_prog_run_array_cg to obtain better performance when PREEMPT_RCU is
not enabled.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20250821090609.42508-7-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-25 18:52:16 -07:00
Menglong Dong
cf4303b70d bpf: use rcu_read_lock_dont_migrate() for bpf_task_storage_free()
Use rcu_read_lock_dont_migrate() and rcu_read_unlock_migrate() in
bpf_task_storage_free to obtain better performance when PREEMPT_RCU is
not enabled.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20250821090609.42508-6-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-25 18:52:16 -07:00
Menglong Dong
68748f0397 bpf: use rcu_read_lock_dont_migrate() for bpf_iter_run_prog()
Use rcu_read_lock_dont_migrate() and rcu_read_unlock_migrate() in
bpf_iter_run_prog to obtain better performance when PREEMPT_RCU is
not enabled.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20250821090609.42508-5-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-25 18:52:16 -07:00
Menglong Dong
f2fa9b9069 bpf: use rcu_read_lock_dont_migrate() for bpf_inode_storage_free()
Use rcu_read_lock_dont_migrate() and rcu_read_unlock_migrate() in
bpf_inode_storage_free to obtain better performance when PREEMPT_RCU is
not enabled.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20250821090609.42508-4-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-25 18:52:16 -07:00
Menglong Dong
8c0afc7c9c bpf: use rcu_read_lock_dont_migrate() for bpf_cgrp_storage_free()
Use rcu_read_lock_dont_migrate() and rcu_read_unlock_migrate() in
bpf_cgrp_storage_free to obtain better performance when PREEMPT_RCU is
not enabled.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20250821090609.42508-3-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-25 18:52:16 -07:00
Chen Ridong
2c98144fc8 cpuset: add helpers for cpus read and cpuset_mutex locks
cpuset: add helpers for cpus_read_lock and cpuset_mutex locks.

Replace repetitive locking patterns with new helpers:
- cpuset_full_lock()
- cpuset_full_unlock()

This makes the code cleaner and ensures consistent lock ordering.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-25 08:20:22 -10:00
Chen Ridong
ada00d5162 cpuset: separate tmpmasks and cpuset allocation logic
The original alloc_cpumasks() served dual purposes: allocating cpumasks
for both temporary masks (tmpmasks) and cpuset structures. This patch:

1. Decouples these allocation paths for better code clarity
2. Introduces dedicated alloc_tmpmasks() and dup_or_alloc_cpuset()
   functions
3. Maintains symmetric pairing:
   - alloc_tmpmasks() ↔ free_tmpmasks()
   - dup_or_alloc_cpuset() ↔ free_cpuset()

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-25 08:19:57 -10:00
Chen Ridong
5806b3d051 cpuset: decouple tmpmasks and cpumasks freeing in cgroup
Currently, free_cpumasks() can free both tmpmasks and cpumasks of a cpuset
(cs). However, these two operations are not logically coupled. To improve
code clarity:
1. Move cpumask freeing to free_cpuset()
2. Rename free_cpumasks() to free_tmpmasks()

This change enforces the single responsibility principle.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-25 08:19:24 -10:00
Tiffany Yang
8d2a755895 cgroup: Fix 64-bit division in cgroup.stat.local
Fix the following build error for 32-bit systems:
   arm-linux-gnueabi-ld: kernel/cgroup/cgroup.o: in function `cgroup_core_local_stat_show':
>> kernel/cgroup/cgroup.c:3781:(.text+0x28f4): undefined reference to `__aeabi_uldivmod'
   arm-linux-gnueabi-ld: (__aeabi_uldivmod): Unknown destination type (ARM/Thumb) in kernel/cgroup/cgroup.o
>> kernel/cgroup/cgroup.c:3781:(.text+0x28f4): dangerous relocation: unsupported relocation

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202508230604.KyvqOy81-lkp@intel.com/
Signed-off-by: Tiffany Yang <ynaffit@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-25 08:16:55 -10:00
Andy Shevchenko
a214365140 rculist: move list_for_each_rcu() to where it belongs
The list_for_each_rcu() relies on the rcu_dereference() API which is not
provided by the list.h. At the same time list.h is a low-level basic header
that must not have dependencies like RCU, besides the fact of the potential
circular dependencies in some cases. With all that said, move RCU related
API to the rculist.h where it belongs.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Simona Vetter <simona.vetter@ffwll.ch>
Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
2025-08-25 10:13:26 -07:00
Linus Torvalds
69fd6b99b8 Merge tag 'perf_urgent_for_v6.17_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Borislav Petkov:

 - Fix a case where the events throttling logic operates on inactive
   events

* tag 'perf_urgent_for_v6.17_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Avoid undefined behavior from stopping/starting inactive events
2025-08-24 10:13:05 -04:00
Linus Torvalds
14f84cd318 Merge tag 'modules-6.17-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux
Pull modules fix from Daniel Gomez:
 "This includes a fix part of the KSPP (Kernel Self Protection Project)
  to replace the deprecated and unsafe strcpy() calls in the kernel
  parameter string handler and sysfs parameters for built-in modules.
  Single commit, no functional changes"

* tag 'modules-6.17-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux:
  params: Replace deprecated strcpy() with strscpy() and memcpy()
2025-08-24 09:43:50 -04:00
Pan Chuang
55b48e23f5 genirq/devres: Add error handling in devm_request_*_irq()
devm_request_threaded_irq() and devm_request_any_context_irq() currently
don't print any error message when interrupt registration fails.

This forces each driver to implement redundant error logging - over 2,000
lines of error messages exist across drivers. Additionally, when
upper-layer functions propagate these errors without logging, critical
debugging information is lost.

Add devm_request_result() helper to unify error reporting via dev_err_probe(),

Use it in devm_request_threaded_irq() and devm_request_any_context_irq()
printing device name, IRQ number, handler functions, and error code on failure
automatically.

Co-developed-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Pan Chuang <panchuang@vivo.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250805092922.135500-2-panchuang@vivo.com
2025-08-24 13:00:45 +02:00
Inochi Amaoto
7a721a2fee genirq: Add irq_chip_(startup/shutdown)_parent()
As the MSI controller on SG2044 uses PLIC as the underlying interrupt
controller, it needs to call irq_enable() and irq_disable() to
startup/shutdown interrupts. Otherwise, the MSI interrupt can not be
startup correctly and will not respond any incoming interrupt.

Introduce irq_chip_startup_parent() and irq_chip_shutdown_parent() to allow
the interrupt controller to call the irq_startup()/irq_shutdown() callbacks
of the parent interrupt chip.

In case the irq_startup()/irq_shutdown() callbacks are not implemented for
the parent interrupt chip, this will fallback to irq_chip_enable_parent()
or irq_chip_disable_parent().

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Inochi Amaoto <inochiama@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Chen Wang <unicorn_wang@outlook.com> # Pioneerbox
Reviewed-by: Chen Wang <unicorn_wang@outlook.com>
Link: https://lore.kernel.org/all/20250813232835.43458-2-inochiama@gmail.com
Link: https://lore.kernel.org/lkml/20250722224513.22125-1-inochiama@gmail.com/
2025-08-23 21:20:25 +02:00
Sebastian Andrzej Siewior
3c71648793 genirq: Remove GENERIC_IRQ_LEGACY
IA64 is gone and with it the last GENERIC_IRQ_LEGACY user.

Remove GENERIC_IRQ_LEGACY.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250814165949.hvtP03r4@linutronix.de
2025-08-23 19:46:04 +02:00
Linus Torvalds
e1d8f9ccb2 Merge tag 'trace-v6.17-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:

 - Fix rtla and latency tooling pkg-config errors

   If libtraceevent and libtracefs is installed, but their corresponding
   '.pc' files are not installed, it reports that the libraries are
   missing and confuses the developer. Instead, report that the
   pkg-config files are missing and should be installed.

 - Fix overflow bug of the parser in trace_get_user()

   trace_get_user() uses the parsing functions to parse the user space
   strings. If the parser fails due to incorrect processing, it doesn't
   terminate the buffer with a nul byte. Add a "failed" flag to the
   parser that gets set when parsing fails and is used to know if the
   buffer is fine to use or not.

 - Remove a semicolon that was at an end of a comment line

 - Fix register_ftrace_graph() to unregister the pm notifier on error

   The register_ftrace_graph() registers a pm notifier but there's an
   error path that can exit the function without unregistering it. Since
   the function returns an error, it will never be unregistered.

 - Allocate and copy ftrace hash for reader of ftrace filter files

   When the set_ftrace_filter or set_ftrace_notrace files are open for
   read, an iterator is created and sets its hash pointer to the
   associated hash that represents filtering or notrace filtering to it.
   The issue is that the hash it points to can change while the
   iteration is happening. All the locking used to access the tracer's
   hashes are released which means those hashes can change or even be
   freed. Using the hash pointed to by the iterator can cause UAF bugs
   or similar.

   Have the read of these files allocate and copy the corresponding
   hashes and use that as that will keep them the same while the
   iterator is open. This also simplifies the code as opening it for
   write already does an allocate and copy, and now that the read is
   doing the same, there's no need to check which way it was opened on
   the release of the file, and the iterator hash can always be freed.

 - Fix function graph to copy args into temp storage

   The output of the function graph tracer shows both the entry and the
   exit of a function. When the exit is right after the entry, it
   combines the two events into one with the output of "function();",
   instead of showing:

     function() {
     }

   In order to do this, the iterator descriptor that reads the events
   includes storage that saves the entry event while it peaks at the
   next event in the ring buffer. The peek can free the entry event so
   the iterator must store the information to use it after the peek.

   With the addition of function graph tracer recording the args, where
   the args are a dynamic array in the entry event, the temp storage
   does not save them. This causes the args to be corrupted or even
   cause a read of unsafe memory.

   Add space to save the args in the temp storage of the iterator.

 - Fix race between ftrace_dump and reading trace_pipe

   ftrace_dump() is used when a crash occurs where the ftrace buffer
   will be printed to the console. But it can also be triggered by
   sysrq-z. If a sysrq-z is triggered while a task is reading trace_pipe
   it can cause a race in the ftrace_dump() where it checks if the
   buffer has content, then it checks if the next event is available,
   and then prints the output (regardless if the next event was
   available or not). Reading trace_pipe at the same time can cause it
   to not be available, and this triggers a WARN_ON in the print. Move
   the printing into the check if the next event exists or not

* tag 'trace-v6.17-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ftrace: Also allocate and copy hash for reading of filter files
  ftrace: Fix potential warning in trace_printk_seq during ftrace_dump
  fgraph: Copy args in intermediate storage with entry
  trace/fgraph: Fix the warning caused by missing unregister notifier
  ring-buffer: Remove redundant semicolons
  tracing: Limit access to parser->buffer when trace_get_user failed
  rtla: Check pkg-config install
  tools/latency-collector: Check pkg-config install
2025-08-23 10:11:34 -04:00
Steven Rostedt
bfb336cf97 ftrace: Also allocate and copy hash for reading of filter files
Currently the reader of set_ftrace_filter and set_ftrace_notrace just adds
the pointer to the global tracer hash to its iterator. Unlike the writer
that allocates a copy of the hash, the reader keeps the pointer to the
filter hashes. This is problematic because this pointer is static across
function calls that release the locks that can update the global tracer
hashes. This can cause UAF and similar bugs.

Allocate and copy the hash for reading the filter files like it is done
for the writers. This not only fixes UAF bugs, but also makes the code a
bit simpler as it doesn't have to differentiate when to free the
iterator's hash between writers and readers.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20250822183606.12962cc3@batman.local.home
Fixes: c20489dad1 ("ftrace: Assign iter->hash to filter or notrace hashes on seq read")
Closes: https://lore.kernel.org/all/20250813023044.2121943-1-wutengda@huaweicloud.com/
Closes: https://lore.kernel.org/all/20250822192437.GA458494@ax162/
Reported-by: Tengda Wu <wutengda@huaweicloud.com>
Tested-by: Tengda Wu <wutengda@huaweicloud.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-22 19:58:35 -04:00
Tengda Wu
4013aef2ce ftrace: Fix potential warning in trace_printk_seq during ftrace_dump
When calling ftrace_dump_one() concurrently with reading trace_pipe,
a WARN_ON_ONCE() in trace_printk_seq() can be triggered due to a race
condition.

The issue occurs because:

CPU0 (ftrace_dump)                              CPU1 (reader)
echo z > /proc/sysrq-trigger

!trace_empty(&iter)
trace_iterator_reset(&iter) <- len = size = 0
                                                cat /sys/kernel/tracing/trace_pipe
trace_find_next_entry_inc(&iter)
  __find_next_entry
    ring_buffer_empty_cpu <- all empty
  return NULL

trace_printk_seq(&iter.seq)
  WARN_ON_ONCE(s->seq.len >= s->seq.size)

In the context between trace_empty() and trace_find_next_entry_inc()
during ftrace_dump, the ring buffer data was consumed by other readers.
This caused trace_find_next_entry_inc to return NULL, failing to populate
`iter.seq`. At this point, due to the prior trace_iterator_reset, both
`iter.seq.len` and `iter.seq.size` were set to 0. Since they are equal,
the WARN_ON_ONCE condition is triggered.

Move the trace_printk_seq() into the if block that checks to make sure the
return value of trace_find_next_entry_inc() is non-NULL in
ftrace_dump_one(), ensuring the 'iter.seq' is properly populated before
subsequent operations.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ingo Molnar <mingo@elte.hu>
Link: https://lore.kernel.org/20250822033343.3000289-1-wutengda@huaweicloud.com
Fixes: d769041f86 ("ring_buffer: implement new locking")
Signed-off-by: Tengda Wu <wutengda@huaweicloud.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-22 17:32:36 -04:00
Steven Rostedt
e3d01979e4 fgraph: Copy args in intermediate storage with entry
The output of the function graph tracer has two ways to display its
entries. One way for leaf functions with no events recorded within them,
and the other is for functions with events recorded inside it. As function
graph has an entry and exit event, to simplify the output of leaf
functions it combines the two, where as non leaf functions are separate:

 2)               |              invoke_rcu_core() {
 2)               |                raise_softirq() {
 2)   0.391 us    |                  __raise_softirq_irqoff();
 2)   1.191 us    |                }
 2)   2.086 us    |              }

The __raise_softirq_irqoff() function above is really two events that were
merged into one. Otherwise it would have looked like:

 2)               |              invoke_rcu_core() {
 2)               |                raise_softirq() {
 2)               |                  __raise_softirq_irqoff() {
 2)   0.391 us    |                  }
 2)   1.191 us    |                }
 2)   2.086 us    |              }

In order to do this merge, the reading of the trace output file needs to
look at the next event before printing. But since the pointer to the event
is on the ring buffer, it needs to save the entry event before it looks at
the next event as the next event goes out of focus as soon as a new event
is read from the ring buffer. After it reads the next event, it will print
the entry event with either the '{' (non leaf) or ';' and timestamps (leaf).

The iterator used to read the trace file has storage for this event. The
problem happens when the function graph tracer has arguments attached to
the entry event as the entry now has a variable length "args" field. This
field only gets set when funcargs option is used. But the args are not
recorded in this temp data and garbage could be printed. The entry field
is copied via:

  data->ent = *curr;

Where "curr" is the entry field. But this method only saves the non
variable length fields from the structure.

Add a helper structure to the iterator data that adds the max args size to
the data storage in the iterator. Then simply copy the entire entry into
this storage (with size protection).

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/20250820195522.51d4a268@gandalf.local.home
Reported-by: Sasha Levin <sashal@kernel.org>
Tested-by: Sasha Levin <sashal@kernel.org>
Closes: https://lore.kernel.org/all/aJaxRVKverIjF4a6@lappy/
Fixes: ff5c9c576e ("ftrace: Add support for function argument to graph tracer")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-22 17:32:35 -04:00
Tao Chen
4223bf833c bpf: Remove preempt_disable in bpf_try_get_buffers
Now BPF program will run with migration disabled, so it is safe
to access this_cpu_inc_return(bpf_bprintf_nest_level).

Fixes: d9c9e4db18 ("bpf: Factorize bpf_trace_printk and bpf_seq_printf")
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250819125638.2544715-1-chen.dylane@linux.dev
2025-08-22 11:44:09 -07:00
Eric Biggers
d47cc4dea1 bpf: Use sha1() instead of sha1_transform() in bpf_prog_calc_tag()
Now that there's a proper SHA-1 library API, just use that instead of
the low-level SHA-1 compression function.  This eliminates the need for
bpf_prog_calc_tag() to implement the SHA-1 padding itself.  No
functional change; the computed tags remain the same.

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20250811201615.564461-1-ebiggers@kernel.org
2025-08-22 11:40:05 -07:00
Tiffany Yang
afa3701c0e cgroup: cgroup.stat.local time accounting
There isn't yet a clear way to identify a set of "lost" time that
everyone (or at least a wider group of users) cares about. However,
users can perform some delay accounting by iterating over components of
interest. This patch allows cgroup v2 freezing time to be one of those
components.

Track the cumulative time that each v2 cgroup spends freezing and expose
it to userland via a new local stat file in cgroupfs. Thank you to
Michal, who provided the ASCII art in the updated documentation.

To access this value:
  $ mkdir /sys/fs/cgroup/test
  $ cat /sys/fs/cgroup/test/cgroup.stat.local
  freeze_time_total 0

Ensure consistent freeze time reads with freeze_seq, a per-cgroup
sequence counter. Writes are serialized using the css_set_lock.

Signed-off-by: Tiffany Yang <ynaffit@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-22 07:50:43 -10:00
Chen Ridong
94a4acfec1 cgroup/psi: Set of->priv to NULL upon file release
Setting of->priv to NULL when the file is released enables earlier bug
detection. This allows potential bugs to manifest as NULL pointer
dereferences rather than use-after-free errors[1], which are generally more
difficult to diagnose.

[1] https://lore.kernel.org/cgroups/38ef3ff9-b380-44f0-9315-8b3714b0948d@huaweicloud.com/T/#m8a3b3f88f0ff3da5925d342e90043394f8b2091b
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-22 07:47:43 -10:00
Chen Ridong
79f919a89c cgroup: split cgroup_destroy_wq into 3 workqueues
A hung task can occur during [1] LTP cgroup testing when repeatedly
mounting/unmounting perf_event and net_prio controllers with
systemd.unified_cgroup_hierarchy=1. The hang manifests in
cgroup_lock_and_drain_offline() during root destruction.

Related case:
cgroup_fj_function_perf_event cgroup_fj_function.sh perf_event
cgroup_fj_function_net_prio cgroup_fj_function.sh net_prio

Call Trace:
	cgroup_lock_and_drain_offline+0x14c/0x1e8
	cgroup_destroy_root+0x3c/0x2c0
	css_free_rwork_fn+0x248/0x338
	process_one_work+0x16c/0x3b8
	worker_thread+0x22c/0x3b0
	kthread+0xec/0x100
	ret_from_fork+0x10/0x20

Root Cause:

CPU0                            CPU1
mount perf_event                umount net_prio
cgroup1_get_tree                cgroup_kill_sb
rebind_subsystems               // root destruction enqueues
				// cgroup_destroy_wq
// kill all perf_event css
                                // one perf_event css A is dying
                                // css A offline enqueues cgroup_destroy_wq
                                // root destruction will be executed first
                                css_free_rwork_fn
                                cgroup_destroy_root
                                cgroup_lock_and_drain_offline
                                // some perf descendants are dying
                                // cgroup_destroy_wq max_active = 1
                                // waiting for css A to die

Problem scenario:
1. CPU0 mounts perf_event (rebind_subsystems)
2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
3. A dying perf_event CSS gets queued for offline after root destruction
4. Root destruction waits for offline completion, but offline work is
   blocked behind root destruction in cgroup_destroy_wq (max_active=1)

Solution:
Split cgroup_destroy_wq into three dedicated workqueues:
cgroup_offline_wq – Handles CSS offline operations
cgroup_release_wq – Manages resource release
cgroup_free_wq – Performs final memory deallocation

This separation eliminates blocking in the CSS free path while waiting for
offline operations to complete.

[1] https://github.com/linux-test-project/ltp/blob/master/runtest/controllers
Fixes: 334c3679ec ("cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends")
Reported-by: Gao Yingjie <gaoyingjie@uniontech.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Suggested-by: Teju Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-22 07:44:11 -10:00