Commit Graph

49620 Commits

Author SHA1 Message Date
Richard Guy Briggs
654d61b8e0 audit: record AUDIT_ANOM_* events regardless of presence of rules
When no audit rules are in place, AUDIT_ANOM_{LINK,CREAT} events
reported in audit_log_path_denied() are unconditionally dropped due to
an explicit check for the existence of any audit rules.  Given this is a
report of a security violation, allow it to be recorded regardless of
the existence of any audit rules.

To test,
	mkdir -p /root/tmp
	chmod 1777 /root/tmp
	touch /root/tmp/test.txt
	useradd test
	chown test /root/tmp/test.txt
	{echo C0644 12 test.txt; printf 'hello\ntest1\n'; printf \\000;} | \
		scp -t /root/tmp
Check with
	ausearch -m ANOM_CREAT -ts recent

Link: https://issues.redhat.com/browse/RHEL-9065
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-04-11 14:14:41 -04:00
Andy Shevchenko
baaba7b448 audit: mark audit_log_vformat() with __printf() attribute
audit_log_vformat() is using printf() type of format, and GCC compiler
(Debian 14.2.0-17) is not happy about this:

kernel/audit.c:1978:9: error: function ‘audit_log_vformat’
  might be a candidate for ‘gnu_printf’ format attribute
kernel/audit.c:1987:17: error: function ‘audit_log_vformat’
  might be a candidate for ‘gnu_printf’ format attribute

Fix the compilation errors (`make W=1` when CONFIG_WERROR=y, which is
default) by adding __printf() attribute.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
[PM: commit description line wrap fixes]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-04-11 13:57:22 -04:00
Kumar Kartikeya Dwivedi
a650d38915 bpf: Convert ringbuf map to rqspinlock
Convert the raw spinlock used by BPF ringbuf to rqspinlock. Currently,
we have an open syzbot report of a potential deadlock. In addition, the
ringbuf can fail to reserve spuriously under contention from NMI
context.

It is potentially attractive to enable unconstrained usage (incl. NMIs)
while ensuring no deadlocks manifest at runtime, perform the conversion
to rqspinlock to achieve this.

This change was benchmarked for BPF ringbuf's multi-producer contention
case on an Intel Sapphire Rapids server, with hyperthreading disabled
and performance governor turned on. 5 warm up runs were done for each
case before obtaining the results.

Before (raw_spinlock_t):

Ringbuf, multi-producer contention
==================================
rb-libbpf nr_prod 1  11.440 ± 0.019M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 2  2.706 ± 0.010M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 3  3.130 ± 0.004M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 4  2.472 ± 0.003M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 8  2.352 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 12 2.813 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 16 1.988 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 20 2.245 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 24 2.148 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 28 2.190 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 32 2.490 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 36 2.180 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 40 2.201 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 44 2.226 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 48 2.164 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 52 1.874 ± 0.001M/s (drops 0.000 ± 0.000M/s)

After (rqspinlock_t):

Ringbuf, multi-producer contention
==================================
rb-libbpf nr_prod 1  11.078 ± 0.019M/s (drops 0.000 ± 0.000M/s) (-3.16%)
rb-libbpf nr_prod 2  2.801 ± 0.014M/s (drops 0.000 ± 0.000M/s) (3.51%)
rb-libbpf nr_prod 3  3.454 ± 0.005M/s (drops 0.000 ± 0.000M/s) (10.35%)
rb-libbpf nr_prod 4  2.567 ± 0.002M/s (drops 0.000 ± 0.000M/s) (3.84%)
rb-libbpf nr_prod 8  2.468 ± 0.001M/s (drops 0.000 ± 0.000M/s) (4.93%)
rb-libbpf nr_prod 12 2.510 ± 0.001M/s (drops 0.000 ± 0.000M/s) (-10.77%)
rb-libbpf nr_prod 16 2.075 ± 0.001M/s (drops 0.000 ± 0.000M/s) (4.38%)
rb-libbpf nr_prod 20 2.640 ± 0.001M/s (drops 0.000 ± 0.000M/s) (17.59%)
rb-libbpf nr_prod 24 2.092 ± 0.001M/s (drops 0.000 ± 0.000M/s) (-2.61%)
rb-libbpf nr_prod 28 2.426 ± 0.005M/s (drops 0.000 ± 0.000M/s) (10.78%)
rb-libbpf nr_prod 32 2.331 ± 0.004M/s (drops 0.000 ± 0.000M/s) (-6.39%)
rb-libbpf nr_prod 36 2.306 ± 0.003M/s (drops 0.000 ± 0.000M/s) (5.78%)
rb-libbpf nr_prod 40 2.178 ± 0.002M/s (drops 0.000 ± 0.000M/s) (-1.04%)
rb-libbpf nr_prod 44 2.293 ± 0.001M/s (drops 0.000 ± 0.000M/s) (3.01%)
rb-libbpf nr_prod 48 2.022 ± 0.001M/s (drops 0.000 ± 0.000M/s) (-6.56%)
rb-libbpf nr_prod 52 1.809 ± 0.001M/s (drops 0.000 ± 0.000M/s) (-3.47%)

There's a fair amount of noise in the benchmark, with numbers on reruns
going up and down by 10%, so all changes are in the range of this
disturbance, and we see no major regressions.

Reported-by: syzbot+850aaf14624dc0c6d366@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/0000000000004aa700061379547e@google.com
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250411101759.4061366-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-11 10:28:26 -07:00
Breno Leitao
0f08335ade trace: tcp: Add tracepoint for tcp_sendmsg_locked()
Add a tracepoint to monitor TCP send operations, enabling detailed
visibility into TCP message transmission.

Create a new tracepoint within the tcp_sendmsg_locked function,
capturing traditional fields along with size_goal, which indicates the
optimal data size for a single TCP segment. Additionally, a reference to
the struct sock sk is passed, allowing direct access for BPF programs.
The implementation is largely based on David's patch[1] and suggestions.

Link: https://lore.kernel.org/all/70168c8f-bf52-4279-b4c4-be64527aa1ac@kernel.org/ [1]
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250408-tcpsendmsg-v3-2-208b87064c28@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10 18:34:05 -07:00
Jakub Kicinski
cb7103298d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.15-rc2).

Conflict:

Documentation/networking/netdevices.rst
net/core/lock_debug.c
  04efcee6ef ("net: hold instance lock during NETDEV_CHANGE")
  03df156dd3 ("xdp: double protect netdev->xdp_flags with netdev->lock")

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10 16:51:07 -07:00
Linus Torvalds
34833819d2 Merge tag 'timers-urgent-2025-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc timer fixes from Ingo Molnar:

 - Fix missing ACCESS_PRIVATE() that triggered a Sparse warning

 - Fix lockdep false positive in tick_freeze() on CONFIG_PREEMPT_RT=y

 - Avoid <vdso/unaligned.h> macro's variable shadowing to address build
   warning that triggers under W=2 builds

* tag 'timers-urgent-2025-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  vdso: Address variable shadowing in macros
  timekeeping: Add a lockdep override in tick_freeze()
  hrtimer: Add missing ACCESS_PRIVATE() for hrtimer::function
2025-04-10 15:39:39 -07:00
Linus Torvalds
ac253a537d Merge tag 'perf-urgent-2025-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc perf events fixes from Ingo Molnar:

 - Fix __free_event() corner case splat

 - Fix false-positive uprobes related lockdep splat on
   CONFIG_PREEMPT_RT=y kernels

 - Fix a complicated perf sigtrap race that may result in hangs

* tag 'perf-urgent-2025-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix hang while freeing sigtrap event
  uprobes: Avoid false-positive lockdep splat on CONFIG_PREEMPT_RT=y in the ri_timer() uprobe timer callback, use raw_write_seqcount_*()
  perf/core: Fix WARN_ON(!ctx) in __free_event() for partial init
2025-04-10 14:47:36 -07:00
Kumar Kartikeya Dwivedi
2f41503d64 bpf: Convert queue_stack map to rqspinlock
Replace all usage of raw_spinlock_t in queue_stack_maps.c with
rqspinlock. This is a map type with a set of open syzbot reports
reproducing possible deadlocks. Prior attempt to fix the issues
was at [0], but was dropped in favor of this approach.

Make sure we return the -EBUSY error in case of possible deadlocks or
timeouts, just to make sure user space or BPF programs relying on the
error code to detect problems do not break.

With these changes, the map should be safe to access in any context,
including NMIs.

  [0]: https://lore.kernel.org/all/20240429165658.1305969-1-sidchintamaneni@gmail.com

Reported-by: syzbot+8bdfc2c53fb2b63e1871@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/0000000000004c3fc90615f37756@google.com
Reported-by: syzbot+252bc5c744d0bba917e1@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000c80abd0616517df9@google.com
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250410153142.2064340-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-10 12:51:10 -07:00
Kumar Kartikeya Dwivedi
92b90f780d bpf: Use architecture provided res_smp_cond_load_acquire
In v2 of rqspinlock [0], we fixed potential problems with WFE usage in
arm64 to fallback to a version copied from Ankur's series [1]. This
logic was moved into arch-specific headers in v3 [2].

However, we missed using the arch-provided res_smp_cond_load_acquire
in commit ebababcd03 ("rqspinlock: Hardcode cond_acquire loops for arm64")
due to a rebasing mistake between v2 and v3 of the rqspinlock series.
Fix the typo to fallback to the arm64 definition as we did in v2.

  [0]: https://lore.kernel.org/bpf/20250206105435.2159977-18-memxor@gmail.com
  [1]: https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com
  [2]: https://lore.kernel.org/bpf/20250303152305.3195648-9-memxor@gmail.com

Fixes: ebababcd03 ("rqspinlock: Hardcode cond_acquire loops for arm64")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250410145512.1876745-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-10 12:47:07 -07:00
Hou Tao
6704b1e8cf bpf: Don't allocate per-cpu extra_elems for fd htab
The update of element in fd htab is in-place now, therefore, there is no
need to allocate per-cpu extra_elems, just remove it.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250401062250.543403-6-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-09 20:12:54 -07:00
Hou Tao
e8a65856c7 bpf: Add is_fd_htab() helper
Add is_fd_htab() helper to check whether the map is htab of maps.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250401062250.543403-5-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-09 20:12:53 -07:00
Hou Tao
2c304172e0 bpf: Support atomic update for htab of maps
As reported by Cody Haas [1], when there is concurrent map lookup and
map update operation in an existing element for htab of maps, the map
lookup procedure may return -ENOENT unexpectedly.

The root cause is twofold:

1) the update of existing element involves two separated list operation
In htab_map_update_elem(), it first inserts the new element at the head
of list, then it deletes the old element. Therefore, it is possible a
lookup operation has already iterated to the middle of the list when a
concurrent update operation begins, and the lookup operation will fail
to find the target element.

2) the immediate reuse of htab element.
It is more subtle. Even through the lookup operation finds the old
element, it is possible that the target element has been removed by a
concurrent update operation, and the element has been reused immediately
by other update operation which runs on the same CPU as the previous
update operation, and the element is inserted into the same bucket list.
After these steps above, when the lookup operation tries to compare the
key in the old element with the expected key, the match will fail
because the key in the old element have been overwritten by other update
operation.

The two-step update process is relatively straightforward to address.
The more challenging aspect is the immediate reuse. As Alexei pointed
out:

 So since 2022 both prealloc and no_prealloc reuse elements.
 We can consider a new flag for the hash map like F_REUSE_AFTER_RCU_GP
 that will use _rcu() flavor of freeing into bpf_ma,
 but it has to have a strong reason.

Given that htab of maps doesn't support special field in value and
directly stores the inner map pointer in htab_element, just do in-place
update for htab of maps instead of attempting to address the immediate
reuse issue.

[1]: https://lore.kernel.org/xdp-newbies/CAH7f-ULFTwKdoH_t2SFc5rWCVYLEg-14d1fBYWH2eekudsnTRg@mail.gmail.com/

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250401062250.543403-4-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-09 20:12:53 -07:00
Hou Tao
5771e306b6 bpf: Rename __htab_percpu_map_update_elem to htab_map_update_elem_in_place
Rename __htab_percpu_map_update_elem to htab_map_update_elem_in_place,
and add a new percpu argument for the helper to support in-place update
for both per-cpu htab and htab of maps.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250401062250.543403-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-09 20:12:53 -07:00
Hou Tao
ba2b31b0f3 bpf: Factor out htab_elem_value helper()
All hash maps store map key and map value together. The relative offset
of the map value compared to the map key is round_up(key_size, 8).
Therefore, factor out a common helper htab_elem_value() to calculate the
address of the map value instead of duplicating the logic.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250401062250.543403-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-09 20:12:53 -07:00
Stanislav Fomichev
311920774c configs/debug: run and debug PREEMPT
Recent change [0] resulted in a "BUG: using __this_cpu_read() in
preemptible" splat [1]. PREEMPT kernels have additional requirements
on what can and can not run with/without preemption enabled.
Expose those constrains in the debug kernels.

0: https://lore.kernel.org/netdev/20250314120048.12569-2-justin.iurman@uliege.be/
1: https://lore.kernel.org/netdev/20250402094458.006ba2a7@kernel.org/T/#mbf72641e9d7d274daee9003ef5edf6833201f1bc

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250402172305.1775226-1-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09 17:47:06 -07:00
Tao Chen
a76116f422 bpf: Check link_create.flags parameter for multi_uprobe
The link_create.flags are currently not used for multi-uprobes, so return
-EINVAL if it is set, same as for other attach APIs.

We allow target_fd to have an arbitrary value for multi-uprobe, though,
as there are existing users (libbpf) relying on this.

Fixes: 89ae89f53d ("bpf: Add multi uprobe link")
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250407035752.1108927-2-chen.dylane@linux.dev
2025-04-09 16:28:51 -07:00
Tao Chen
243911982a bpf: Check link_create.flags parameter for multi_kprobe
The link_create.flags are currently not used for multi-kprobes, so return
-EINVAL if it is set, same as for other attach APIs.

We allow target_fd, on the other hand, to have an arbitrary value for
multi-kprobe, as there are existing users (libbpf) relying on this.

Fixes: 0dcac27254 ("bpf: Add multi kprobe link")
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250407035752.1108927-1-chen.dylane@linux.dev
2025-04-09 16:28:22 -07:00
Sebastian Andrzej Siewior
92e250c624 timekeeping: Add a lockdep override in tick_freeze()
tick_freeze() acquires a raw spinlock (tick_freeze_lock). Later in the
callchain (timekeeping_suspend() -> mc146818_avoid_UIP()) the RTC driver
acquires a spinlock which becomes a sleeping lock on PREEMPT_RT.  Lockdep
complains about this lock nesting.

Add a lockdep override for this special case and a comment explaining
why it is okay.

Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Chris Bainbridge <chris.bainbridge@gmail.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250404133429.pnAzf-eF@linutronix.de
Closes: https://lore.kernel.org/all/20250330113202.GAZ-krsjAnurOlTcp-@fat_crate.local/
Closes: https://lore.kernel.org/all/CAP-bSRZ0CWyZZsMtx046YV8L28LhY0fson2g4EqcwRAVN1Jk+Q@mail.gmail.com/
2025-04-09 22:30:39 +02:00
Eric Dumazet
0df6db767a posix-timers: Initialize cache early and move pointer into __timer_data
Move posix_timers_cache initialization to posixtimer_init(). At that point
the memory subsystem is already up and running.

Also move the cache pointer to the __timer_data variable to avoid
potential false sharing, since it never was marked as __ro_after_init.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250402133114.253901-1-edumazet@google.com
2025-04-09 21:21:36 +02:00
Tejun Heo
0b30461793 sched_ext: Make scx_has_op a bitmap
scx_has_op is used to encode which ops are implemented by the BPF scheduler
into an array of static_keys. While this saves a bit of branching overhead,
that is unlikely to be noticeable compared to the overall cost. As the
global static_keys can't work with the planned hierarchical multiple
scheduler support, replace the static_key array with a bitmap.

In repeated hackbench runs before and after static_keys removal on an AMD
Ryzen 3900X, I couldn't tell any measurable performance difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09 09:06:00 -10:00
Tejun Heo
743354e3bb sched_ext: Remove scx_ops_allow_queued_wakeup static_key
scx_ops_allow_queued_wakeup is used to encode SCX_OPS_ALLOW_QUEUED_WAKEUP
into a static_key. The test is gated behind scx_enabled(), and, even when
sched_ext is enabled, is unlikely for the static_key usage to make any
meaningful difference. It is made to use a static_key mostly because there
was no reason not to. However, global static_keys can't work with the
planned hierarchical multiple scheduler support. Remove the static_key and
instead test SCX_OPS_ALLOW_QUEUED_WAKEUP directly.

In repeated hackbench runs before and after static_keys removal on an AMD
Ryzen 3900X, I couldn't tell any measurable performance difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09 09:06:00 -10:00
Tejun Heo
54d2e717bc sched_ext: Remove scx_ops_cpu_preempt static_key
scx_ops_cpu_preempt is used to encode whether ops.cpu_acquire/release() are
implemented into a static_key. These tests aren't hot enough for static_key
usage to make any meaningful difference and are made to use a static_key
mostly because there was no reason not to. However, global static_keys can't
work with the planned hierarchical multiple scheduler support. Remove the
static_key and instead use an internal ops flag SCX_OPS_HAS_CPU_PREEMPT to
record and test whether ops.cpu_acquire/release() are implemented.

In repeated hackbench runs before and after static_keys removal on an AMD
Ryzen 3900X, I couldn't tell any measurable performance difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09 09:06:00 -10:00
Tejun Heo
cc39454c34 sched_ext: Remove scx_ops_enq_* static_keys
scx_ops_enq_last/exiting/migration_disabled are used to encode the
corresponding SCX_OPS_ flags into static_keys. These flags aren't hot enough
for static_key usage to make any meaningful difference and are made
static_keys mostly because there was no reason not to. However, global
static_keys can't work with the planned hierarchical multiple scheduler
support. Remove the static_keys and test the ops flags directly.

In repeated hackbench runs before and after static_keys removal on an AMD
Ryzen 3900X, I couldn't tell any measurable performance difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09 09:05:59 -10:00
Tejun Heo
d75ee2d678 sched_ext: Indentation updates
Purely cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09 09:05:59 -10:00
Nam Cao
2424e146be hrtimer: Add missing ACCESS_PRIVATE() for hrtimer::function
The "function" field of struct hrtimer has been changed to private, but
two instances have not been converted to use ACCESS_PRIVATE().

Convert them to use ACCESS_PRIVATE().

Fixes: 04257da0c9 ("hrtimers: Make callback function pointer private")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250408103854.1851093-1-namcao@linutronix.de
Closes: https://lore.kernel.org/oe-kbuild-all/202504071931.vOVl13tt-lkp@intel.com/
Closes: https://lore.kernel.org/oe-kbuild-all/202504072155.5UAZjYGU-lkp@intel.com/
2025-04-09 21:00:42 +02:00
Thomas Gleixner
9357e329cd genirq/msi: Rename msi_[un]lock_descs()
Now that all abuse is gone and the legit users are converted to
guard(msi_descs_lock), rename the lock functions and document them as
internal.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huwei.com>
Link: https://lore.kernel.org/all/20250319105506.864699741@linutronix.de
2025-04-09 20:47:30 +02:00
Thomas Gleixner
0dac2b0930 genirq/msi: Use lock guards for MSI descriptor locking
Provide a lock guard for MSI descriptor locking and update the core code
accordingly.

No functional change intended.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/all/20250319105506.144672678@linutronix.de
2025-04-09 20:47:29 +02:00
Thorsten Blum
cfdb7520f9 PM: hibernate: Remove size arguments when calling strscpy()
The size parameter is optional and strscpy() automatically determines
the length of the destination buffer using sizeof() if the argument is
omitted. This makes the explicit sizeof() calls unnecessary. Remove
them to shorten and simplify the code.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://patch.msgid.link/20250318080755.61126-2-thorsten.blum@linux.dev
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-04-09 20:11:46 +02:00
Steven Rostedt
e1a453a57b tracing: Do not add length to print format in synthetic events
The following causes a vsnprintf fault:

  # echo 's:wake_lat char[] wakee; u64 delta;' >> /sys/kernel/tracing/dynamic_events
  # echo 'hist:keys=pid:ts=common_timestamp.usecs if !(common_flags & 0x18)' > /sys/kernel/tracing/events/sched/sched_waking/trigger
  # echo 'hist:keys=next_pid:delta=common_timestamp.usecs-$ts:onmatch(sched.sched_waking).trace(wake_lat,next_comm,$delta)' > /sys/kernel/tracing/events/sched/sched_switch/trigger

Because the synthetic event's "wakee" field is created as a dynamic string
(even though the string copied is not). The print format to print the
dynamic string changed from "%*s" to "%s" because another location
(__set_synth_event_print_fmt()) exported this to user space, and user
space did not need that. But it is still used in print_synth_event(), and
the output looks like:

          <idle>-0       [001] d..5.   193.428167: wake_lat: wakee=(efault)sshd-sessiondelta=155
    sshd-session-879     [001] d..5.   193.811080: wake_lat: wakee=(efault)kworker/u34:5delta=58
          <idle>-0       [002] d..5.   193.811198: wake_lat: wakee=(efault)bashdelta=91
            bash-880     [002] d..5.   193.811371: wake_lat: wakee=(efault)kworker/u35:2delta=21
          <idle>-0       [001] d..5.   193.811516: wake_lat: wakee=(efault)sshd-sessiondelta=129
    sshd-session-879     [001] d..5.   193.967576: wake_lat: wakee=(efault)kworker/u34:5delta=50

The length isn't needed as the string is always nul terminated. Just print
the string and not add the length (which was hard coded to the max string
length anyway).

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Douglas Raillard <douglas.raillard@arm.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/20250407154139.69955768@gandalf.local.home
Fixes: 4d38328eb4 ("tracing: Fix synth event printk format for str fields");
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-09 11:34:21 -04:00
Arnd Bergmann
d7b98ae522 dma/contiguous: avoid warning about unused size_bytes
When building with W=1, this variable is unused for configs with
CONFIG_CMA_SIZE_SEL_PERCENTAGE=y:

kernel/dma/contiguous.c:67:26: error: 'size_bytes' defined but not used [-Werror=unused-const-variable=]

Change this to a macro to avoid the warning.

Fixes: c64be2bb1c ("drivers: add Contiguous Memory Allocator")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20250409151557.3890443-1-arnd@kernel.org
2025-04-09 17:28:53 +02:00
Joel Granados
bc4f328ff5 sparc: mv sparc sysctls into their own file under arch/sparc/kernel
Move sparc sysctls (reboot-cmd, stop-a, scons-poweroff and tsb-ratio)
into a new file (arch/sparc/kernel/setup.c). This file will be included
for both 32 and 64 bit sparc. Leave "tsb-ratio" under SPARC64 ifdef as
it was in kernel/sysctl.c. The sysctl table register is called with
arch_initcall placing it after its original place in proc_root_init.

This is part of a greater effort to move ctl tables into their
respective subsystems which will reduce the merge conflicts in
kernel/sysctl.c.

Signed-off-by: Joel Granados <joel.granados@kernel.org>
2025-04-09 13:32:16 +02:00
Joel Granados
67049b53e0 stack_tracer: move sysctl registration to kernel/trace/trace_stack.c
Move stack_tracer_enabled into trace_stack_sysctl_table. This is part of
a greater effort to move ctl tables into their respective subsystems
which will reduce the merge conflicts in kernel/sysctl.c.

Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
2025-04-09 13:32:16 +02:00
Joel Granados
dd293df639 tracing: Move trace sysctls into trace.c
Move trace ctl tables into their own const array in
kernel/trace/trace.c. The sysctl table register is called with
subsys_initcall placing if after its original place in proc_root_init.
This is part of a greater effort to move ctl tables into their
respective subsystems which will reduce the merge conflicts in
kernel/sysctl.c.

Signed-off-by: Joel Granados <joel.granados@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-09 13:32:16 +02:00
Joel Granados
256db5c9b5 signal: Move signal ctl tables into signal.c
Move print-fatal-signals into its own const ctl table array in
kernel/signal.c. This is part of a greater effort to move ctl tables
into their respective subsystems which will reduce the merge conflicts
in kernel/sysctl.c.

Signed-off-by: Joel Granados <joel.granados@kernel.org>
2025-04-09 13:32:16 +02:00
Joel Granados
c09b981041 panic: Move panic ctl tables into panic.c
Move panic, panic_on_oops, panic_print, panic_on_warn into
kerne/panic.c. This is part of a greater effort to move ctl tables into
their respective subsystems which will reduce the merge conflicts in
kernel/sysctl.c.

Signed-off-by: Joel Granados <joel.granados@kernel.org>
2025-04-09 13:32:16 +02:00
Geert Uytterhoeven
771487050f genirq/generic-chip: Fix incorrect lock guard conversions
When booting BeagleBone Black:

    ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:4398 lockdep_hardirqs_on_prepare+0x23c/0x280
    DEBUG_LOCKS_WARN_ON(early_boot_irqs_disabled)
    CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.15.0-rc1-boneblack-00004-g195298c3b116 #209 NONE
    Hardware name: Generic AM33XX (Flattened Device Tree)
    Call trace:
     _raw_spin_unlock_irq from irq_map_generic_chip+0x144/0x190
     irq_map_generic_chip from irq_domain_associate_locked+0x68/0x164
     irq_domain_associate_locked from irq_create_fwspec_mapping+0x34c/0x43c
     irq_create_fwspec_mapping from irq_create_of_mapping+0x64/0x8c
     irq_create_of_mapping from irq_of_parse_and_map+0x54/0x7c
     irq_of_parse_and_map from dmtimer_clkevt_init_common+0x54/0x15c
     dmtimer_clkevt_init_common from dmtimer_systimer_init+0x41c/0x5b8
     dmtimer_systimer_init from timer_probe+0x68/0xf0
     timer_probe from start_kernel+0x4a4/0x6bc
     start_kernel from 0x0
    irq event stamp: 0
    hardirqs last  enabled at (0): [<00000000>] 0x0
    hardirqs last disabled at (0): [<00000000>] 0x0
    softirqs last  enabled at (0): [<00000000>] 0x0
    softirqs last disabled at (0): [<00000000>] 0x0
    ---[ end trace 0000000000000000 ]---

and:

    ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 0 at init/main.c:1022 start_kernel+0x4e8/0x6bc
    Interrupts were enabled early
    CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G        W           6.15.0-rc1-boneblack-00004-g195298c3b116 #209 NONE
    Tainted: [W]=WARN
    Hardware name: Generic AM33XX (Flattened Device Tree)
    Call trace:
     unwind_backtrace from show_stack+0x10/0x14
     show_stack from dump_stack_lvl+0x6c/0x90
     dump_stack_lvl from __warn+0x70/0x1b0
     __warn from warn_slowpath_fmt+0x1d4/0x1ec
     warn_slowpath_fmt from start_kernel+0x4e8/0x6bc
     start_kernel from 0x0
    irq event stamp: 0
    hardirqs last  enabled at (0): [<00000000>] 0x0
    hardirqs last disabled at (0): [<00000000>] 0x0
    softirqs last  enabled at (0): [<00000000>] 0x0
    softirqs last disabled at (0): [<00000000>] 0x0
    ---[ end trace 0000000000000000 ]---

Fix this by correcting two misconversions of
raw_spin_{,un}lock_irq{save,restore}() to lock guards.

Fixes: 195298c3b1 ("genirq/generic-chip: Convert core code to lock guards")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/514f94c5891c61ac0a4a7fdad113e75db1eea367.1744135467.git.geert+renesas@glider.be
2025-04-08 22:49:02 +02:00
Linus Torvalds
bec7dcbc24 Merge tag 'probes-fixes-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:

 - fprobe: remove fprobe_hlist_node when module unloading

   When a fprobe target module is removed, the fprobe_hlist_node should
   be removed from the fprobe's hash table to prevent reusing
   accidentally if another module is loaded at the same address.

 - fprobe: lock module while registering fprobe

   The module containing the function to be probeed is locked using a
   reference counter until the fprobe registration is complete, which
   prevents use after free.

 - fprobe-events: fix possible UAF on modules

   Basically as same as above, but in the fprobe-events layer we also
   need to get module reference counter when we find the tracepoint in
   the module.

* tag 'probes-fixes-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: fprobe: Cleanup fprobe hash when module unloading
  tracing: fprobe events: Fix possible UAF on modules
  tracing: fprobe: Fix to lock module while registering fprobe
2025-04-08 12:51:34 -07:00
Linus Torvalds
e37f72b3b4 Merge tag 'cgroup-for-6.15-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:

 - A number of cpuset remote partition related fixes and cleanups along
   with selftest updates.

 - A change from this merge window made cgroup_rstat_updated_list()
   called outside cgroup_rstat_lock leading to list corruptions. Fix it
   by relocating the call inside the lock.

* tag 'cgroup-for-6.15-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup/cpuset: Fix race between newly created partition and dying one
  cgroup: rstat: call cgroup_rstat_updated_list with cgroup_rstat_lock
  selftest/cgroup: Add a remote partition transition test to test_cpuset_prs.sh
  selftest/cgroup: Clean up and restructure test_cpuset_prs.sh
  selftest/cgroup: Update test_cpuset_prs.sh to use | as effective CPUs and state separator
  cgroup/cpuset: Remove unneeded goto in sched_partition_write() and rename it
  cgroup/cpuset: Code cleanup and comment update
  cgroup/cpuset: Don't allow creation of local partition over a remote one
  cgroup/cpuset: Remove remote_partition_check() & make update_cpumasks_hier() handle remote partition
  cgroup/cpuset: Fix error handling in remote_partition_disable()
  cgroup/cpuset: Fix incorrect isolated_cpus update in update_parent_effective_cpumask()
2025-04-08 12:15:05 -07:00
Tejun Heo
294f5ff474 sched_ext: Merge branch 'for-6.15-fixes' into for-6.16
Pull for-6.15-fixes to receive:

 e776b26e37 ("sched_ext: Remove cpu.weight / cpu.idle unimplemented warnings")

which conflicts with:

 1a7ff7216c ("sched_ext: Drop "ops" from scx_ops_enable_state and friends")

The former removes code updated by the latter. Resolved by removing the
updated section.

Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-08 08:56:57 -10:00
Joel Fernandes
f50ad4b73e srcu: Use rcu_seq_done_exact() for polling API
poll_state_synchronize_srcu() uses rcu_seq_done() unlike
poll_state_synchronize_rcu() which uses rcu_seq_done_exact().

The  rcu_seq_done_exact() makes more sense for polling API, as with
this API, there is a higher chance that there is a significant delay
between the get_state..() and poll_state..() calls since a cookie
can be stored and reused at a later time. During such a delay, if
the gp_seq counter progresses more than ULONG_MAX/2 distance, then
poll_state..() may return false for a long time unwantedly.

Fix by using the more accurate rcu_seq_done_exact() API which is
exactly what straight RCU's polling does.

It may make sense, as future work, to add debug code here as well, where
we compare a physical timestamp between get_state..() and poll_state()
calls and yell if significant time has past but the grace period has
still not progressed.

Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
2025-04-08 14:55:55 -04:00
Phil Auld
6432e163ba sched/isolation: Make use of more than one housekeeping cpu
The exising code uses housekeeping_any_cpu() to select a cpu for
a given housekeeping task. However, this often ends up calling
cpumask_any_and() which is defined as cpumask_first_and() which has
the effect of alyways using the first cpu among those available.

The same applies when multiple NUMA nodes are involved. In that
case the first cpu in the local node is chosen which does provide
a bit of spreading but with multiple HK cpus per node the same
issues arise.

We have numerous cases where a single HK cpu just cannot keep up
and the remote_tick warning fires. It also can lead to the other
things (orchastration sw, HA keepalives etc) on the HK cpus getting
starved which leads to other issues.  In these cases we recommend
increasing the number of HK cpus.  But... that only helps the
userspace tasks somewhat. It does not help the actual housekeeping
part.

Spread the HK work out by having housekeeping_any_cpu() and
sched_numa_find_closest() use cpumask_any_and_distribute()
instead of cpumask_any_and().

Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Reviewed-by: Vishal Chourasia <vishalc@linux.ibm.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20250218184618.1331715-1-pauld@redhat.com
2025-04-08 20:55:55 +02:00
Harshit Agarwal
690e47d140 sched/rt: Fix race in push_rt_task
Overview
========
When a CPU chooses to call push_rt_task and picks a task to push to
another CPU's runqueue then it will call find_lock_lowest_rq method
which would take a double lock on both CPUs' runqueues. If one of the
locks aren't readily available, it may lead to dropping the current
runqueue lock and reacquiring both the locks at once. During this window
it is possible that the task is already migrated and is running on some
other CPU. These cases are already handled. However, if the task is
migrated and has already been executed and another CPU is now trying to
wake it up (ttwu) such that it is queued again on the runqeue
(on_rq is 1) and also if the task was run by the same CPU, then the
current checks will pass even though the task was migrated out and is no
longer in the pushable tasks list.

Crashes
=======
This bug resulted in quite a few flavors of crashes triggering kernel
panics with various crash signatures such as assert failures, page
faults, null pointer dereferences, and queue corruption errors all
coming from scheduler itself.

Some of the crashes:
-> kernel BUG at kernel/sched/rt.c:1616! BUG_ON(idx >= MAX_RT_PRIO)
   Call Trace:
   ? __die_body+0x1a/0x60
   ? die+0x2a/0x50
   ? do_trap+0x85/0x100
   ? pick_next_task_rt+0x6e/0x1d0
   ? do_error_trap+0x64/0xa0
   ? pick_next_task_rt+0x6e/0x1d0
   ? exc_invalid_op+0x4c/0x60
   ? pick_next_task_rt+0x6e/0x1d0
   ? asm_exc_invalid_op+0x12/0x20
   ? pick_next_task_rt+0x6e/0x1d0
   __schedule+0x5cb/0x790
   ? update_ts_time_stats+0x55/0x70
   schedule_idle+0x1e/0x40
   do_idle+0x15e/0x200
   cpu_startup_entry+0x19/0x20
   start_secondary+0x117/0x160
   secondary_startup_64_no_verify+0xb0/0xbb

-> BUG: kernel NULL pointer dereference, address: 00000000000000c0
   Call Trace:
   ? __die_body+0x1a/0x60
   ? no_context+0x183/0x350
   ? __warn+0x8a/0xe0
   ? exc_page_fault+0x3d6/0x520
   ? asm_exc_page_fault+0x1e/0x30
   ? pick_next_task_rt+0xb5/0x1d0
   ? pick_next_task_rt+0x8c/0x1d0
   __schedule+0x583/0x7e0
   ? update_ts_time_stats+0x55/0x70
   schedule_idle+0x1e/0x40
   do_idle+0x15e/0x200
   cpu_startup_entry+0x19/0x20
   start_secondary+0x117/0x160
   secondary_startup_64_no_verify+0xb0/0xbb

-> BUG: unable to handle page fault for address: ffff9464daea5900
   kernel BUG at kernel/sched/rt.c:1861! BUG_ON(rq->cpu != task_cpu(p))

-> kernel BUG at kernel/sched/rt.c:1055! BUG_ON(!rq->nr_running)
   Call Trace:
   ? __die_body+0x1a/0x60
   ? die+0x2a/0x50
   ? do_trap+0x85/0x100
   ? dequeue_top_rt_rq+0xa2/0xb0
   ? do_error_trap+0x64/0xa0
   ? dequeue_top_rt_rq+0xa2/0xb0
   ? exc_invalid_op+0x4c/0x60
   ? dequeue_top_rt_rq+0xa2/0xb0
   ? asm_exc_invalid_op+0x12/0x20
   ? dequeue_top_rt_rq+0xa2/0xb0
   dequeue_rt_entity+0x1f/0x70
   dequeue_task_rt+0x2d/0x70
   __schedule+0x1a8/0x7e0
   ? blk_finish_plug+0x25/0x40
   schedule+0x3c/0xb0
   futex_wait_queue_me+0xb6/0x120
   futex_wait+0xd9/0x240
   do_futex+0x344/0xa90
   ? get_mm_exe_file+0x30/0x60
   ? audit_exe_compare+0x58/0x70
   ? audit_filter_rules.constprop.26+0x65e/0x1220
   __x64_sys_futex+0x148/0x1f0
   do_syscall_64+0x30/0x80
   entry_SYSCALL_64_after_hwframe+0x62/0xc7

-> BUG: unable to handle page fault for address: ffff8cf3608bc2c0
   Call Trace:
   ? __die_body+0x1a/0x60
   ? no_context+0x183/0x350
   ? spurious_kernel_fault+0x171/0x1c0
   ? exc_page_fault+0x3b6/0x520
   ? plist_check_list+0x15/0x40
   ? plist_check_list+0x2e/0x40
   ? asm_exc_page_fault+0x1e/0x30
   ? _cond_resched+0x15/0x30
   ? futex_wait_queue_me+0xc8/0x120
   ? futex_wait+0xd9/0x240
   ? try_to_wake_up+0x1b8/0x490
   ? futex_wake+0x78/0x160
   ? do_futex+0xcd/0xa90
   ? plist_check_list+0x15/0x40
   ? plist_check_list+0x2e/0x40
   ? plist_del+0x6a/0xd0
   ? plist_check_list+0x15/0x40
   ? plist_check_list+0x2e/0x40
   ? dequeue_pushable_task+0x20/0x70
   ? __schedule+0x382/0x7e0
   ? asm_sysvec_reschedule_ipi+0xa/0x20
   ? schedule+0x3c/0xb0
   ? exit_to_user_mode_prepare+0x9e/0x150
   ? irqentry_exit_to_user_mode+0x5/0x30
   ? asm_sysvec_reschedule_ipi+0x12/0x20

Above are some of the common examples of the crashes that were observed
due to this issue.

Details
=======
Let's look at the following scenario to understand this race.

1) CPU A enters push_rt_task
  a) CPU A has chosen next_task = task p.
  b) CPU A calls find_lock_lowest_rq(Task p, CPU Z’s rq).
  c) CPU A identifies CPU X as a destination CPU (X < Z).
  d) CPU A enters double_lock_balance(CPU Z’s rq, CPU X’s rq).
  e) Since X is lower than Z, CPU A unlocks CPU Z’s rq. Someone else has
     locked CPU X’s rq, and thus, CPU A must wait.

2) At CPU Z
  a) Previous task has completed execution and thus, CPU Z enters
     schedule, locks its own rq after CPU A releases it.
  b) CPU Z dequeues previous task and begins executing task p.
  c) CPU Z unlocks its rq.
  d) Task p yields the CPU (ex. by doing IO or waiting to acquire a
     lock) which triggers the schedule function on CPU Z.
  e) CPU Z enters schedule again, locks its own rq, and dequeues task p.
  f) As part of dequeue, it sets p.on_rq = 0 and unlocks its rq.

3) At CPU B
  a) CPU B enters try_to_wake_up with input task p.
  b) Since CPU Z dequeued task p, p.on_rq = 0, and CPU B updates
     B.state = WAKING.
  c) CPU B via select_task_rq determines CPU Y as the target CPU.

4) The race
  a) CPU A acquires CPU X’s lock and relocks CPU Z.
  b) CPU A reads task p.cpu = Z and incorrectly concludes task p is
     still on CPU Z.
  c) CPU A failed to notice task p had been dequeued from CPU Z while
     CPU A was waiting for locks in double_lock_balance. If CPU A knew
     that task p had been dequeued, it would return NULL forcing
     push_rt_task to give up the task p's migration.
  d) CPU B updates task p.cpu = Y and calls ttwu_queue.
  e) CPU B locks Ys rq. CPU B enqueues task p onto Y and sets task
     p.on_rq = 1.
  f) CPU B unlocks CPU Y, triggering memory synchronization.
  g) CPU A reads task p.on_rq = 1, cementing its assumption that task p
     has not migrated.
  h) CPU A decides to migrate p to CPU X.

This leads to A dequeuing p from Y's queue and various crashes down the
line.

Solution
========
The solution here is fairly simple. After obtaining the lock (at 4a),
the check is enhanced to make sure that the task is still at the head of
the pushable tasks list. If not, then it is anyway not suitable for
being pushed out.

Testing
=======
The fix is tested on a cluster of 3 nodes, where the panics due to this
are hit every couple of days. A fix similar to this was deployed on such
cluster and was stable for more than 30 days.

Co-developed-by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
Co-developed-by: Gauri Patwardhan <gauri.patwardhan@nutanix.com>
Signed-off-by: Gauri Patwardhan <gauri.patwardhan@nutanix.com>
Co-developed-by: Rahul Chunduru <rahul.chunduru@nutanix.com>
Signed-off-by: Rahul Chunduru <rahul.chunduru@nutanix.com>
Signed-off-by: Harshit Agarwal <harshit@nutanix.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: "Steven Rostedt (Google)" <rostedt@goodmis.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Tested-by: Will Ton <william.ton@nutanix.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250225180553.167995-1-harshit@nutanix.com
2025-04-08 20:55:55 +02:00
Michal Koutný
0ab94c3242 sched: Add annotations to RT_GROUP_SCHED fields
Update comments to ease RT throttling understanding.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250310170442.504716-10-mkoutny@suse.com
2025-04-08 20:55:55 +02:00
Frederic Weisbecker
4d949edbc4 rcu: Comment on the extraneous delta test on rcu_seq_done_exact()
The numbers used in rcu_seq_done_exact() lack some explanation behind
their magic. Especially after the commit:

    85aad7cc41 ("rcu: Fix get_state_synchronize_rcu_full() GP-start detection")

which reported a subtle issue where a new GP sequence snapshot was taken
on the root node state while a grace period had already been started and
reflected on the global state sequence but not yet on the root node
sequence, making a polling user waiting on a wrong already started grace
period that would ignore freshly online CPUs.

The fix involved taking the snaphot on the global state sequence and
waiting on the root node sequence. And since a grace period is first
started on the global state and only afterward reflected on the root
node, a snapshot taken on the global state sequence might be two full
grace periods ahead of the root node as in the following example:

rnp->gp_seq = rcu_state.gp_seq = 0

    CPU 0                                           CPU 1
    -----                                           -----
    // rcu_state.gp_seq = 1
    rcu_seq_start(&rcu_state.gp_seq)
                                                    // snap = 8
                                                    snap = rcu_seq_snap(&rcu_state.gp_seq)
                                                    // Two full GP differences
                                                    rcu_seq_done_exact(&rnp->gp_seq, snap)
    // rnp->gp_seq = 1
    WRITE_ONCE(rnp->gp_seq, rcu_state.gp_seq);

Add a comment about those expectations and to clarify the magic within
the relevant function.

Note that the issue arises mainly with the use of rcu_seq_done_exact()
which has a much tigher guardband (of 2 GPs) to ensure the false-negative
window of the API during wraparound is limited to just 2 GPs.
rcu_seq_done() does not have such strict requirements, however its large
false-negative window of ULONG_MAX/2 is not ideal for the polling API.
However, this also means care is needed to ensure the guardband is as
large as needed to avoid the example scenario describe above which a
warning added in an earlier patch does.

[ Comment wordsmithing by Joel ]

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
2025-04-08 14:55:54 -04:00
Michal Koutný
87f1fb77d8 sched: Add RT_GROUP WARN checks for non-root task_groups
With CONFIG_RT_GROUP_SCHED but runtime disabling of RT_GROUPs we expect
the existence of the root task_group only and all rt_sched_entity'ies
should be queued on root's rt_rq.

If we get a non-root RT_GROUP something went wrong.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250310170442.504716-9-mkoutny@suse.com
2025-04-08 20:55:54 +02:00
Joel Fernandes
4aa6e94cf9 rcu: Add warning to ensure rcu_seq_done_exact() is working
The previous patch improved the rcu_seq_done_exact() function by adding
a meaningful constant for the guardband.

Ensure that this is working for the future by a quick check during
rcu_gp_init().

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
2025-04-08 14:55:54 -04:00
Michal Koutný
d6809c2f60 sched: Do not construct nor expose RT_GROUP_SCHED structures if disabled
Thanks to kernel cmdline being available early, before any
cgroup hierarchy exists, we can achieve the RT_GROUP_SCHED boottime
disabling goal by simply skipping any creation (and destruction) of
RT_GROUP data and its exposure via RT attributes.

We can do this thanks to previously placed runtime guards that would
redirect all operations to root_task_group's data when RT_GROUP_SCHED
disabled.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250310170442.504716-8-mkoutny@suse.com
2025-04-08 20:55:54 +02:00
Joel Fernandes
9c94c5ad39 rcu: Replace magic number with meaningful constant in rcu_seq_done_exact()
The rcu_seq_done_exact() function checks if a grace period has completed by
comparing sequence numbers. It includes a guard band to handle sequence number
wraparound, which was previously expressed using the magic number calculation
'3 * RCU_SEQ_STATE_MASK + 1'.

This magic number is not immediately obvious in terms of what it represents.

Instead, the reason we need this tiny guardband is because of the lag between
the setting of rcu_state.gp_seq_polled and root rnp's gp_seq in rcu_gp_init().

This guardband needs to be at least 2 GPs worth of counts, to avoid recognizing
the newly started GP as completed immediately, due to the following sequence
which arises due to the delay between update of rcu_state.gp_seq_polled and
root rnp's gp_seq:

rnp->gp_seq = rcu_state.gp_seq = 0

    CPU 0                                           CPU 1
    -----                                           -----
    // rcu_state.gp_seq = 1
    rcu_seq_start(&rcu_state.gp_seq)
                                                    // snap = 8
                                                    snap = rcu_seq_snap(&rcu_state.gp_seq)
                                                    // Two full GP differences
                                                    rcu_seq_done_exact(&rnp->gp_seq, snap)
    // rnp->gp_seq = 1
    WRITE_ONCE(rnp->gp_seq, rcu_state.gp_seq);

This can happen due to get_state_synchronize_rcu_full() sampling
rcu_state.gp_seq_polled, however the poll_state_synchronize_rcu_full()
sampling the root rnp's gp_seq. The delay between the update of the 2
counters occurs in rcu_gp_init() during which the counters briefly go
out of sync.

Make the guardband explictly 2 GPs. This improves code readability and
maintainability by making the intent clearer as well.

Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
2025-04-08 14:55:54 -04:00
Michal Koutný
277e090975 sched: Bypass bandwitdh checks with runtime disabled RT_GROUP_SCHED
When RT_GROUPs are compiled but not exposed, their bandwidth cannot
be configured (and it is not initialized for non-root task_groups neither).
Therefore bypass any checks of task vs task_group bandwidth.

This will achieve behavior very similar to setups that have
!CONFIG_RT_GROUP_SCHED and attach cpu controller to cgroup v2 hierarchy.
(On a related note, this may allow having RT tasks with
CONFIG_RT_GROUP_SCHED and cgroup v2 hierarchy.)

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250310170442.504716-7-mkoutny@suse.com
2025-04-08 20:55:54 +02:00
Michal Koutný
61d3164fec sched: Skip non-root task_groups with disabled RT_GROUP_SCHED
First, we want to prevent placement of RT tasks on non-root rt_rqs which
we achieve in the task migration code that'd fall back to
root_task_group's rt_rq.

Second, we want to work with only root_task_group's rt_rq when iterating
all "real" rt_rqs when RT_GROUP is disabled. To achieve this we keep
root_task_group as the first one on the task_groups and break out
quickly.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250310170442.504716-6-mkoutny@suse.com
2025-04-08 20:55:53 +02:00