Commit Graph

45917 Commits

Author SHA1 Message Date
Thorsten Blum
b96c312551 ring-buffer: Use vma_pages() helper function
Use the vma_pages() helper function and fix the following
Coccinelle/coccicheck warning reported by vma_pages.cocci:

  WARNING: Consider using vma_pages helper on vma

Rename the local variable vma_pages accordingly.

Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Link: https://lore.kernel.org/20240709215657.322071-2-thorsten.blum@toblux.com
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-07-15 15:09:06 -04:00
levi.yun
7dc836187f trace/pid_list: Change gfp flags in pid_list_fill_irq()
pid_list_fill_irq() runs via irq_work.
When CONFIG_PREEMPT_RT is disabled, it would run in irq_context.
so it shouldn't sleep while memory allocation.

Change gfp flags from GFP_KERNEL to GFP_NOWAIT to prevent sleep in
irq_work.

This change wouldn't impact functionality in practice because the worst-size
is 2K.

Cc: stable@goodmis.org
Fixes: 8d6e90983a ("tracing: Create a sparse bitmask for pid filtering")
Link: https://lore.kernel.org/20240704150226.1359936-1-yeoreum.yun@arm.com
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: levi.yun <yeoreum.yun@arm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-07-15 15:07:14 -04:00
Dan Carpenter
94dfa500e7 tracing: Fix NULL vs IS_ERR() check in enable_instances()
The trace_array_create_systems() function returns error pointers, not
NULL.  Fix the check to match.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: e645535a95 ("tracing: Add option to use memmapped memory for trace boot instance")
Link: https://lore.kernel.org/9b23ea03-d709-435f-a309-461c3d747457@moroto.mountain
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-07-15 14:55:02 -04:00
Linus Torvalds
b051320d6a Merge tag 'vfs-6.11.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
 "Features:

   - Support passing NULL along AT_EMPTY_PATH for statx().

     NULL paths with any flag value other than AT_EMPTY_PATH go the
     usual route and end up with -EFAULT to retain compatibility (Rust
     is abusing calls of the sort to detect availability of statx)

     This avoids path lookup code, lockref management, memory allocation
     and in case of NULL path userspace memory access (which can be
     quite expensive with SMAP on x86_64)

   - Don't block i_writecount during exec. Remove the
     deny_write_access() mechanism for executables

   - Relax open_by_handle_at() permissions in specific cases where we
     can prove that the caller had sufficient privileges to open a file

   - Switch timespec64 fields in struct inode to discrete integers
     freeing up 4 bytes

  Fixes:

   - Fix false positive circular locking warning in hfsplus

   - Initialize hfs_inode_info after hfs_alloc_inode() in hfs

   - Avoid accidental overflows in vfs_fallocate()

   - Don't interrupt fallocate with EINTR in tmpfs to avoid constantly
     restarting shmem_fallocate()

   - Add missing quote in comment in fs/readdir

  Cleanups:

   - Don't assign and test in an if statement in mqueue. Move the
     assignment out of the if statement

   - Reflow the logic in may_create_in_sticky()

   - Remove the usage of the deprecated ida_simple_xx() API from procfs

   - Reject FSCONFIG_CMD_CREATE_EXCL requets that depend on the new
     mount api early

   - Rename variables in copy_tree() to make it easier to understand

   - Replace WARN(down_read_trylock, ...) abuse with proper asserts in
     various places in the VFS

   - Get rid of user_path_at_empty() and drop the empty argument from
     getname_flags()

   - Check for error while copying and no path in one branch in
     getname_flags()

   - Avoid redundant smp_mb() for THP handling in do_dentry_open()

   - Rename parent_ino to d_parent_ino and make it use RCU

   - Remove unused header include in fs/readdir

   - Export in_group_capable() helper and switch f2fs and fuse over to
     it instead of open-coding the logic in both places"

* tag 'vfs-6.11.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits)
  ipc: mqueue: remove assignment from IS_ERR argument
  vfs: rename parent_ino to d_parent_ino and make it use RCU
  vfs: support statx(..., NULL, AT_EMPTY_PATH, ...)
  stat: use vfs_empty_path() helper
  fs: new helper vfs_empty_path()
  fs: reflow may_create_in_sticky()
  vfs: remove redundant smp_mb for thp handling in do_dentry_open
  fuse: Use in_group_or_capable() helper
  f2fs: Use in_group_or_capable() helper
  fs: Export in_group_or_capable()
  vfs: reorder checks in may_create_in_sticky
  hfs: fix to initialize fields of hfs_inode_info after hfs_alloc_inode()
  proc: Remove usage of the deprecated ida_simple_xx() API
  hfsplus: fix to avoid false alarm of circular locking
  Improve readability of copy_tree
  vfs: shave a branch in getname_flags
  vfs: retire user_path_at_empty and drop empty arg from getname_flags
  vfs: stop using user_path_at_empty in do_readlinkat
  tmpfs: don't interrupt fallocate with EINTR
  fs: don't block i_writecount during exec
  ...
2024-07-15 10:52:51 -07:00
Masahiro Yamada
c442db3f49 kbuild: remove PROVIDE() for kallsyms symbols
This reimplements commit 951bcae6c5 ("kallsyms: Avoid weak references
for kallsyms symbols") because I am not a big fan of PROVIDE().

As an alternative solution, this commit prepends one more kallsyms step.

    KSYMS   .tmp_vmlinux.kallsyms0.S          # added
    AS      .tmp_vmlinux.kallsyms0.o          # added
    LD      .tmp_vmlinux.btf
    BTF     .btf.vmlinux.bin.o
    LD      .tmp_vmlinux.kallsyms1
    NM      .tmp_vmlinux.kallsyms1.syms
    KSYMS   .tmp_vmlinux.kallsyms1.S
    AS      .tmp_vmlinux.kallsyms1.o
    LD      .tmp_vmlinux.kallsyms2
    NM      .tmp_vmlinux.kallsyms2.syms
    KSYMS   .tmp_vmlinux.kallsyms2.S
    AS      .tmp_vmlinux.kallsyms2.o
    LD      vmlinux

Step 0 takes /dev/null as input, and generates .tmp_vmlinux.kallsyms0.o,
which has a valid kallsyms format with the empty symbol list, and can be
linked to vmlinux. Since it is really small, the added compile-time cost
is negligible.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>
2024-07-16 01:08:36 +09:00
Vlastimil Babka
436381eaf2 Merge branch 'slab/for-6.11/buckets' into slab/for-next
Merge all the slab patches previously collected on top of v6.10-rc1,
over cleanups/fixes that had to be based on rc6.
2024-07-15 10:44:16 +02:00
Lai Jiangshan
58629d4871 workqueue: Always queue work items to the newest PWQ for order workqueues
To ensure non-reentrancy, __queue_work() attempts to enqueue a work
item to the pool of the currently executing worker. This is not only
unnecessary for an ordered workqueue, where order inherently suggests
non-reentrancy, but it could also disrupt the sequence if the item is
not enqueued on the newest PWQ.

Just queue it to the newest PWQ and let order management guarantees
non-reentrancy.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Fixes: 4c065dbce1 ("workqueue: Enable unbound cpumask update on ordered workqueues")
Cc: stable@vger.kernel.org # v6.9+
Signed-off-by: Tejun Heo <tj@kernel.org>
(cherry picked from commit 74347be3edfd11277799242766edf844c43dd5d3)
2024-07-14 18:20:19 -10:00
Jiapeng Chong
8bb30798fd sched_ext: Fixes incorrect type in bpf_scx_init()
The type_id is defined as u32type, if(type_id<0) is invalid, hence
modified its type to s32.

./kernel/sched/ext.c:4958:5-12: WARNING: Unsigned expression compared with zero: type_id < 0.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9523
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-14 18:10:10 -10:00
Tejun Heo
9283ff5be1 Merge branch 'for-6.10-fixes' into for-6.11 2024-07-14 18:04:03 -10:00
Kent Overstreet
1a616c2fe9 lockdep: lockdep_set_notrack_class()
Add a new helper to disable lockdep tracking entirely for a given class.

This is needed for bcachefs, which takes too many btree node locks for
lockdep to track. Instead, we have a single lockdep_map for "btree_trans
has any btree nodes locked", which makes more since given that we have
centralized lock management and a cycle detector.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:16 -04:00
Linus Torvalds
365346980e Merge tag 'sched_urgent_for_v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:

 - Fix a performance regression when measuring the CPU time of a thread
   (clock_gettime(CLOCK_THREAD_CPUTIME_ID,...)) due to the addition of
   PSI IRQ time accounting in the hotpath

 - Fix a task_struct leak due to missing to decrement the refcount when
   the task is enqueued before the timer which is supposed to do that,
   expires

 - Revert an attempt to expedite detaching of movable tasks, as finding
   those could become very costly. Turns out the original issue wasn't
   even hit by anyone

* tag 'sched_urgent_for_v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Move psi_account_irqtime() out of update_rq_clock_task() hotpath
  sched/deadline: Fix task_struct reference leak
  Revert "sched/fair: Make sure to try to detach at least one movable task"
2024-07-14 10:18:25 -07:00
Thomas Gleixner
b7625d67eb Merge tag 'timers-v6.11-rc1' of https://git.linaro.org/people/daniel.lezcano/linux into timers/core
Pull clocksource/event driver updates from Daniel Lezcano:

  - Remove unnecessary local variables initialization as they will be
    initialized in the code path anyway right after on the ARM arch
    timer and the ARM global timer (Li kunyu)

  - Fix a race condition in the interrupt leading to a deadlock on the
    SH CMT driver. Note that this fix was not tested on the platform
    using this timer but the fix seems reasonable enough to be picked
    confidently (Niklas Söderlund)

  - Increase the rating of the gic-timer and use the configured width
    clocksource register on the MIPS architecture (Jiaxun Yang)

  - Add the DT bindings for the TMU on the Renesas platforms (Geert
    Uytterhoeven)

  - Add the DT bindings for the SOPHGO SG2002 clint on RiscV (Thomas
    Bonnefille)

  - Add the rtl-otto timer driver along with the DT bindings for the
    Realtek platform (Chris Packham)

Link: https://lore.kernel.org/all/91cd05de-4c5d-4242-a381-3b8a4fe6a2a2@linaro.org
2024-07-13 12:07:10 +02:00
Yang Li
b69bdba5a3 swiotlb: fix kernel-doc description for swiotlb_del_transient
Describe the pool argument in the kernel-doc comment for
swiotlb_del_transient.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-07-13 07:36:10 +02:00
Jakub Kicinski
26f453176a Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2024-07-12

We've added 23 non-merge commits during the last 3 day(s) which contain
a total of 18 files changed, 234 insertions(+), 243 deletions(-).

The main changes are:

1) Improve BPF verifier by utilizing overflow.h helpers to check
   for overflows, from Shung-Hsi Yu.

2) Fix NULL pointer dereference in resolve_prog_type() for BPF_PROG_TYPE_EXT
   when attr->attach_prog_fd was not specified, from Tengda Wu.

3) Fix arm64 BPF JIT when generating code for BPF trampolines with
   BPF_TRAMP_F_CALL_ORIG which corrupted upper address bits,
   from Puranjay Mohan.

4) Remove test_run callback from lwt_seg6local_prog_ops which never worked
   in the first place and caused syzbot reports,
   from Sebastian Andrzej Siewior.

5) Relax BPF verifier to accept non-zero offset on KF_TRUSTED_ARGS/
   /KF_RCU-typed BPF kfuncs, from Matt Bobrowski.

6) Fix a long standing bug in libbpf with regards to handling of BPF
   skeleton's forward and backward compatibility, from Andrii Nakryiko.

7) Annotate btf_{seq,snprintf}_show functions with __printf,
   from Alan Maguire.

8) BPF selftest improvements to reuse common network helpers in sk_lookup
   test and dropping the open-coded inetaddr_len() and make_socket() ones,
   from Geliang Tang.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (23 commits)
  selftests/bpf: Test for null-pointer-deref bugfix in resolve_prog_type()
  bpf: Fix null pointer dereference in resolve_prog_type() for BPF_PROG_TYPE_EXT
  selftests/bpf: DENYLIST.aarch64: Skip fexit_sleep again
  bpf: use check_sub_overflow() to check for subtraction overflows
  bpf: use check_add_overflow() to check for addition overflows
  bpf: fix overflow check in adjust_jmp_off()
  bpf: Eliminate remaining "make W=1" warnings in kernel/bpf/btf.o
  bpf: annotate BTF show functions with __printf
  bpf, arm64: Fix trampoline for BPF_TRAMP_F_CALL_ORIG
  selftests/bpf: Close obj in error path in xdp_adjust_tail
  selftests/bpf: Null checks for links in bpf_tcp_ca
  selftests/bpf: Use connect_fd_to_fd in sk_lookup
  selftests/bpf: Use start_server_addr in sk_lookup
  selftests/bpf: Use start_server_str in sk_lookup
  selftests/bpf: Close fd in error path in drop_on_reuseport
  selftests/bpf: Add ASSERT_OK_FD macro
  selftests/bpf: Add backlog for network_helper_opts
  selftests/bpf: fix compilation failure when CONFIG_NF_FLOW_TABLE=m
  bpf: Remove tst_run from lwt_seg6local_prog_ops.
  bpf: relax zero fixed offset constraint on KF_TRUSTED_ARGS/KF_RCU
  ...
====================

Link: https://patch.msgid.link/20240712212448.5378-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-12 22:25:54 -07:00
Kees Cook
0fe2356434 tsacct: replace strncpy() with strscpy()
Replace the deprecated[1] use of strncpy() in bacct_add_tsk().  Since this
is UAPI, include trailing padding in the copy.

Link: https://github.com/KSPP/linux/issues/90 [1]
Link: https://lkml.kernel.org/r/20240711171308.work.995-kees@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
Cc: "Dr. Thomas Orgis" <thomas.orgis@uni-hamburg.de>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Ismael Luceno <ismael@iodev.co.uk>
Cc: Peng Liu <liupeng256@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-12 16:39:53 -07:00
Christophe Leroy
18d095b255 mm: define __pte_leaf_size() to also take a PMD entry
On powerpc 8xx, when a page is 8M size, the information is in the PMD
entry.  So allow architectures to provide __pte_leaf_size() instead of
pte_leaf_size() and provide the PMD entry to that function.

When __pte_leaf_size() is not defined, define it as a pte_leaf_size() so
that architectures not interested in the PMD arguments are not impacted.

Only define a default pte_leaf_size() when __pte_leaf_size() is not
defined to make sure nobody adds new calls to pte_leaf_size() in the core.

Link: https://lkml.kernel.org/r/c7c008f0a314bf8029ad7288fdc908db1ec7e449.1719928057.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-12 15:52:15 -07:00
Tejun Heo
5b26f7b920 sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches
In ops.dispatch(), SCX_DSQ_LOCAL_ON can be used to dispatch the task to the
local DSQ of any CPU. However, during direct dispatch from ops.select_cpu()
and ops.enqueue(), this isn't allowed. This is because dispatching to the
local DSQ of a remote CPU requires locking both the task's current and new
rq's and such double locking can't be done directly from ops.enqueue().

While waking up a task, as ops.select_cpu() can pick any CPU and both
ops.select_cpu() and ops.enqueue() can use SCX_DSQ_LOCAL as the dispatch
target to dispatch to the DSQ of the picked CPU, the BPF scheduler can still
do whatever it wants to do. However, while a task is being enqueued for a
different reason, e.g. after its slice expiration, only ops.enqueue() is
called and there's no way for the BPF scheduler to directly dispatch to the
local DSQ of a remote CPU. This gap in API forces schedulers into
work-arounds which are not straightforward or optimal such as skipping
direct dispatches in such cases.

Implement deferred enqueueing to allow directly dispatching to the local DSQ
of a remote CPU from ops.select_cpu() and ops.enqueue(). Such tasks are
temporarily queued on rq->scx.ddsp_deferred_locals. When the rq lock can be
safely released, the tasks are taken off the list and queued on the target
local DSQs using dispatch_to_local_dsq().

v2: - Add missing return after queue_balance_callback() in
      schedule_deferred(). (David).

    - dispatch_to_local_dsq() now assumes that @rq is locked but unpinned
      and thus no longer takes @rf. Updated accordingly.

    - UP build warning fix.

Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Andrea Righi <righi.andrea@gmail.com>
Acked-by: David Vernet <void@manifault.com>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Changwoo Min <changwoo@igalia.com>
2024-07-12 08:20:33 -10:00
Tejun Heo
f47a818950 sched_ext: s/SCX_RQ_BALANCING/SCX_RQ_IN_BALANCE/ and add SCX_RQ_IN_WAKEUP
SCX_RQ_BALANCING is used to mark that the rq is currently in balance().
Rename it to SCX_RQ_IN_BALANCE and add SCX_RQ_IN_WAKEUP which marks whether
the rq is currently enqueueing for a wakeup. This will be used to implement
direct dispatching to local DSQ of another CPU.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
2024-07-12 08:20:33 -10:00
Tejun Heo
3cf78c5d01 sched_ext: Unpin and repin rq lock from balance_scx()
sched_ext often needs to migrate tasks across CPUs right before execution
and thus uses the balance path to dispatch tasks from the BPF scheduler.
balance_scx() is called with rq locked and pinned but is passed @rf and thus
allowed to unpin and unlock. Currently, @rf is passed down the call stack so
the rq lock is unpinned just when double locking is needed.

This creates unnecessary complications such as having to explicitly
manipulate lock pinning for core scheduling. We also want to use
dispatch_to_local_dsq_lock() from other paths which are called with rq
locked but unpinned.

rq lock handling in the dispatch path is straightforward outside the
migration implementation and extending the pinning protection down the
callstack doesn't add enough meaningful extra protection to justify the
extra complexity.

Unpin and repin rq lock from the outer balance_scx() and drop @rf passing
and lock pinning handling from the inner functions. UP is updated to call
balance_one() instead of balance_scx() to avoid adding NULL @rf handling to
balance_scx(). AS this makes balance_scx() unused in UP, it's put inside a
CONFIG_SMP block.

No functional changes intended outside of lock annotation updates.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Righi <righi.andrea@gmail.com>
2024-07-12 08:20:32 -10:00
Tejun Heo
d6a05910d2 sched_ext: Open-code task_linked_on_dsq()
task_linked_on_dsq() exists as a helper because it used to test both the
rbtree and list nodes. It now only tests the list node and the list node
will soon be used for something else too. The helper doesn't improve
anything materially and the naming will become confusing. Open-code the list
node testing and remove task_linked_on_dsq()

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
2024-07-12 08:20:32 -10:00
Tejun Heo
fc283116d0 sched: Move struct balance_callback definition upward
Move struct balance_callback definition upward so that it's visible to
class-specific rq struct definitions. This will be used to embed a struct
balance_callback in struct scx_rq.

No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2024-07-12 08:20:32 -10:00
Xiu Jianfeng
6a26f9c689 cgroup/misc: Introduce misc.events.local
Currently the event counting provided by misc.events is hierarchical,
it's not practical if user is only concerned with events of a
specified cgroup. Therefore, introduce misc.events.local collect events
specific to the given cgroup.

This is analogous to memory.events.local and pids.events.local.

Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-12 06:45:23 -10:00
Shung-Hsi Yu
deac5871eb bpf: use check_sub_overflow() to check for subtraction overflows
Similar to previous patch that drops signed_add*_overflows() and uses
(compiler) builtin-based check_add_overflow(), do the same for
signed_sub*_overflows() and replace them with the generic
check_sub_overflow() to make future refactoring easier and have the
checks implemented more efficiently.

Unsigned overflow check for subtraction does not use helpers and are
simple enough already, so they're left untouched.

After the change GCC 13.3.0 generates cleaner assembly on x86_64:

	if (check_sub_overflow(*dst_smin, src_reg->smax_value, dst_smin) ||
   139bf:	mov    0x28(%r12),%rax
   139c4:	mov    %edx,0x54(%r12)
   139c9:	sub    %r11,%rax
   139cc:	mov    %rax,0x28(%r12)
   139d1:	jo     14627 <adjust_reg_min_max_vals+0x1237>
	    check_sub_overflow(*dst_smax, src_reg->smin_value, dst_smax)) {
   139d7:	mov    0x30(%r12),%rax
   139dc:	sub    %r9,%rax
   139df:	mov    %rax,0x30(%r12)
	if (check_sub_overflow(*dst_smin, src_reg->smax_value, dst_smin) ||
   139e4:	jo     14627 <adjust_reg_min_max_vals+0x1237>
   ...
		*dst_smin = S64_MIN;
   14627:	movabs $0x8000000000000000,%rax
   14631:	mov    %rax,0x28(%r12)
		*dst_smax = S64_MAX;
   14636:	sub    $0x1,%rax
   1463a:	mov    %rax,0x30(%r12)

Before the change it gives:

	if (signed_sub_overflows(dst_reg->smin_value, smax_val) ||
   13a50:	mov    0x28(%r12),%rdi
   13a55:	mov    %edx,0x54(%r12)
		dst_reg->smax_value = S64_MAX;
   13a5a:	movabs $0x7fffffffffffffff,%rdx
   13a64:	mov    %eax,0x50(%r12)
		dst_reg->smin_value = S64_MIN;
   13a69:	movabs $0x8000000000000000,%rax
	s64 res = (s64)((u64)a - (u64)b);
   13a73:	mov    %rdi,%rsi
   13a76:	sub    %rcx,%rsi
	if (b < 0)
   13a79:	test   %rcx,%rcx
   13a7c:	js     145ea <adjust_reg_min_max_vals+0x119a>
	if (signed_sub_overflows(dst_reg->smin_value, smax_val) ||
   13a82:	cmp    %rsi,%rdi
   13a85:	jl     13ac7 <adjust_reg_min_max_vals+0x677>
	    signed_sub_overflows(dst_reg->smax_value, smin_val)) {
   13a87:	mov    0x30(%r12),%r8
	s64 res = (s64)((u64)a - (u64)b);
   13a8c:	mov    %r8,%rax
   13a8f:	sub    %r9,%rax
	return res > a;
   13a92:	cmp    %rax,%r8
   13a95:	setl   %sil
	if (b < 0)
   13a99:	test   %r9,%r9
   13a9c:	js     147d1 <adjust_reg_min_max_vals+0x1381>
		dst_reg->smax_value = S64_MAX;
   13aa2:	movabs $0x7fffffffffffffff,%rdx
		dst_reg->smin_value = S64_MIN;
   13aac:	movabs $0x8000000000000000,%rax
	if (signed_sub_overflows(dst_reg->smin_value, smax_val) ||
   13ab6:	test   %sil,%sil
   13ab9:	jne    13ac7 <adjust_reg_min_max_vals+0x677>
		dst_reg->smin_value -= smax_val;
   13abb:	mov    %rdi,%rax
		dst_reg->smax_value -= smin_val;
   13abe:	mov    %r8,%rdx
		dst_reg->smin_value -= smax_val;
   13ac1:	sub    %rcx,%rax
		dst_reg->smax_value -= smin_val;
   13ac4:	sub    %r9,%rdx
   13ac7:	mov    %rax,0x28(%r12)
   ...
   13ad1:	mov    %rdx,0x30(%r12)
   ...
	if (signed_sub_overflows(dst_reg->smin_value, smax_val) ||
   145ea:	cmp    %rsi,%rdi
   145ed:	jg     13ac7 <adjust_reg_min_max_vals+0x677>
   145f3:	jmp    13a87 <adjust_reg_min_max_vals+0x637>

Suggested-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20240712080127.136608-4-shung-hsi.yu@suse.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-07-12 08:54:08 -07:00
Shung-Hsi Yu
28a4411076 bpf: use check_add_overflow() to check for addition overflows
signed_add*_overflows() was added back when there was no overflow-check
helper. With the introduction of such helpers in commit f0907827a8
("compiler.h: enable builtin overflow checkers and add fallback code"), we
can drop signed_add*_overflows() in kernel/bpf/verifier.c and use the
generic check_add_overflow() instead.

This will make future refactoring easier, and takes advantage of
compiler-emitted hardware instructions that efficiently implement these
checks.

After the change GCC 13.3.0 generates cleaner assembly on x86_64:

	err = adjust_scalar_min_max_vals(env, insn, dst_reg, *src_reg);
   13625:	mov    0x28(%rbx),%r9  /*  r9 = src_reg->smin_value */
   13629:	mov    0x30(%rbx),%rcx /* rcx = src_reg->smax_value */
   ...
	if (check_add_overflow(*dst_smin, src_reg->smin_value, dst_smin) ||
   141c1:	mov    %r9,%rax
   141c4:	add    0x28(%r12),%rax
   141c9:	mov    %rax,0x28(%r12)
   141ce:	jo     146e4 <adjust_reg_min_max_vals+0x1294>
	    check_add_overflow(*dst_smax, src_reg->smax_value, dst_smax)) {
   141d4:	add    0x30(%r12),%rcx
   141d9:	mov    %rcx,0x30(%r12)
	if (check_add_overflow(*dst_smin, src_reg->smin_value, dst_smin) ||
   141de:	jo     146e4 <adjust_reg_min_max_vals+0x1294>
   ...
		*dst_smin = S64_MIN;
   146e4:	movabs $0x8000000000000000,%rax
   146ee:	mov    %rax,0x28(%r12)
		*dst_smax = S64_MAX;
   146f3:	sub    $0x1,%rax
   146f7:	mov    %rax,0x30(%r12)

Before the change it gives:

	s64 smin_val = src_reg->smin_value;
     675:	mov    0x28(%rsi),%r8
	s64 smax_val = src_reg->smax_value;
	u64 umin_val = src_reg->umin_value;
	u64 umax_val = src_reg->umax_value;
     679:	mov    %rdi,%rax /* rax = dst_reg */
	if (signed_add_overflows(dst_reg->smin_value, smin_val) ||
     67c:	mov    0x28(%rdi),%rdi /* rdi = dst_reg->smin_value */
	u64 umin_val = src_reg->umin_value;
     680:	mov    0x38(%rsi),%rdx
	u64 umax_val = src_reg->umax_value;
     684:	mov    0x40(%rsi),%rcx
	s64 res = (s64)((u64)a + (u64)b);
     688:	lea    (%r8,%rdi,1),%r9 /* r9 = dst_reg->smin_value + src_reg->smin_value */
	return res < a;
     68c:	cmp    %r9,%rdi
     68f:	setg   %r10b /* r10b = (dst_reg->smin_value + src_reg->smin_value) > dst_reg->smin_value */
	if (b < 0)
     693:	test   %r8,%r8
     696:	js     72b <scalar_min_max_add+0xbb>
	    signed_add_overflows(dst_reg->smax_value, smax_val)) {
		dst_reg->smin_value = S64_MIN;
		dst_reg->smax_value = S64_MAX;
     69c:	movabs $0x7fffffffffffffff,%rdi
	s64 smax_val = src_reg->smax_value;
     6a6:	mov    0x30(%rsi),%r8
		dst_reg->smin_value = S64_MIN;
     6aa:	00 00 00 	movabs $0x8000000000000000,%rsi
	if (signed_add_overflows(dst_reg->smin_value, smin_val) ||
     6b4:	test   %r10b,%r10b /* (dst_reg->smin_value + src_reg->smin_value) > dst_reg->smin_value ? goto 6cb */
     6b7:	jne    6cb <scalar_min_max_add+0x5b>
	    signed_add_overflows(dst_reg->smax_value, smax_val)) {
     6b9:	mov    0x30(%rax),%r10   /* r10 = dst_reg->smax_value */
	s64 res = (s64)((u64)a + (u64)b);
     6bd:	lea    (%r10,%r8,1),%r11 /* r11 = dst_reg->smax_value + src_reg->smax_value */
	if (b < 0)
     6c1:	test   %r8,%r8
     6c4:	js     71e <scalar_min_max_add+0xae>
	if (signed_add_overflows(dst_reg->smin_value, smin_val) ||
     6c6:	cmp    %r11,%r10 /* (dst_reg->smax_value + src_reg->smax_value) <= dst_reg->smax_value ? goto 723 */
     6c9:	jle    723 <scalar_min_max_add+0xb3>
	} else {
		dst_reg->smin_value += smin_val;
		dst_reg->smax_value += smax_val;
	}
     6cb:	mov    %rsi,0x28(%rax)
     ...
     6d5:	mov    %rdi,0x30(%rax)
     ...
	if (signed_add_overflows(dst_reg->smin_value, smin_val) ||
     71e:	cmp    %r11,%r10
     721:	jl     6cb <scalar_min_max_add+0x5b>
		dst_reg->smin_value += smin_val;
     723:	mov    %r9,%rsi
		dst_reg->smax_value += smax_val;
     726:	mov    %r11,%rdi
     729:	jmp    6cb <scalar_min_max_add+0x5b>
		return res > a;
     72b:	cmp    %r9,%rdi
     72e:	setl   %r10b
     732:	jmp    69c <scalar_min_max_add+0x2c>
     737:	nopw   0x0(%rax,%rax,1)

Note: unlike adjust_ptr_min_max_vals() and scalar*_min_max_add(), it is
necessary to introduce intermediate variable in adjust_jmp_off() to keep
the functional behavior unchanged. Without an intermediate variable
imm/off will be altered even on overflow.

Suggested-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20240712080127.136608-3-shung-hsi.yu@suse.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-07-12 08:54:08 -07:00
Shung-Hsi Yu
4a04b4f0de bpf: fix overflow check in adjust_jmp_off()
adjust_jmp_off() incorrectly used the insn->imm field for all overflow check,
which is incorrect as that should only be done or the BPF_JMP32 | BPF_JA case,
not the general jump instruction case. Fix it by using insn->off for overflow
check in the general case.

Fixes: 5337ac4c9b ("bpf: Fix the corner case with may_goto and jump to the 1st insn.")
Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20240712080127.136608-2-shung-hsi.yu@suse.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-07-12 08:54:07 -07:00
Alan Maguire
2454075f8e bpf: Eliminate remaining "make W=1" warnings in kernel/bpf/btf.o
As reported by Mirsad [1] we still see format warnings in kernel/bpf/btf.o
at W=1 warning level:

  CC      kernel/bpf/btf.o
./kernel/bpf/btf.c: In function ‘btf_type_seq_show_flags’:
./kernel/bpf/btf.c:7553:21: warning: assignment left-hand side might be a candidate for a format attribute [-Wsuggest-attribute=format]
 7553 |         sseq.showfn = btf_seq_show;
      |                     ^
./kernel/bpf/btf.c: In function ‘btf_type_snprintf_show’:
./kernel/bpf/btf.c:7604:31: warning: assignment left-hand side might be a candidate for a format attribute [-Wsuggest-attribute=format]
 7604 |         ssnprintf.show.showfn = btf_snprintf_show;
      |                               ^

Combined with CONFIG_WERROR=y these can halt the build.

The fix (annotating the structure field with __printf())
suggested by Mirsad resolves these. Apologies I missed this last time.
No other W=1 warnings were observed in kernel/bpf after this fix.

[1] https://lore.kernel.org/bpf/92c9d047-f058-400c-9c7d-81d4dc1ef71b@gmail.com/

Fixes: b3470da314 ("bpf: annotate BTF show functions with __printf")
Reported-by: Mirsad Todorovac <mtodorovac69@gmail.com>
Suggested-by: Mirsad Todorovac <mtodorovac69@gmail.com>
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240712092859.1390960-1-alan.maguire@oracle.com
2024-07-12 17:02:26 +02:00
Lai Jiangshan
b2b1f93384 workqueue: Rename wq_update_pod() to unbound_wq_update_pwq()
What wq_update_pod() does is just to update the pwq of the specific
cpu.  Rename it and update the comments.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-11 12:50:35 -10:00
Lai Jiangshan
d160a58de5 workqueue: Remove the arguments @hotplug_cpu and @online from wq_update_pod()
The arguments @hotplug_cpu and @online are not used in wq_update_pod()
since the functions called by wq_update_pod() don't need them.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-11 12:50:35 -10:00
Lai Jiangshan
88a41b185d workqueue: Remove the argument @cpu_going_down from wq_calc_pod_cpumask()
wq_calc_pod_cpumask() uses wq_online_cpumask, which excludes the cpu
going down, so the argument cpu_going_down is unused and can be removed.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-11 12:50:34 -10:00
Lai Jiangshan
2cb61f76be workqueue: Remove the unneeded cpumask empty check in wq_calc_pod_cpumask()
The cpumask empty check in wq_calc_pod_cpumask() has long been useless.
It just works purely as documents which states that the cpumask is not
possible empty after the function returns.

Now the code above is even more explicit that the cpumask is not empty,
so the document-only empty check can be removed.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-11 12:50:34 -10:00
Lai Jiangshan
19af457573 workqueue: Remove cpus_read_lock() from apply_wqattrs_lock()
1726a17135 ("workqueue: Put PWQ allocation and WQ enlistment in the same
lock C.S.") led to the following possible deadlock:

  WARNING: possible recursive locking detected
  6.10.0-rc5-00004-g1d4c6111406c #1 Not tainted
   --------------------------------------------
   swapper/0/1 is trying to acquire lock:
   c27760f4 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue (kernel/workqueue.c:5152 kernel/workqueue.c:5730) 
  
   but task is already holding lock:
   c27760f4 (cpu_hotplug_lock){++++}-{0:0}, at: padata_alloc (kernel/padata.c:1007) 
   ...  
   stack backtrace:
   ...
   cpus_read_lock (include/linux/percpu-rwsem.h:53 kernel/cpu.c:488) 
   alloc_workqueue (kernel/workqueue.c:5152 kernel/workqueue.c:5730) 
   padata_alloc (kernel/padata.c:1007 (discriminator 1)) 
   pcrypt_init_padata (crypto/pcrypt.c:327 (discriminator 1)) 
   pcrypt_init (crypto/pcrypt.c:353) 
   do_one_initcall (init/main.c:1267) 
   do_initcalls (init/main.c:1328 (discriminator 1) init/main.c:1345 (discriminator 1)) 
   kernel_init_freeable (init/main.c:1364) 
   kernel_init (init/main.c:1469) 
   ret_from_fork (arch/x86/kernel/process.c:153) 
   ret_from_fork_asm (arch/x86/entry/entry_32.S:737) 
   entry_INT80_32 (arch/x86/entry/entry_32.S:944) 

This is caused by pcrypt allocating a workqueue while holding
cpus_read_lock(), so workqueue code can't do it again as that can lead to
deadlocks if down_write starts after the first down_read.

The pwq creations and installations have been reworked based on
wq_online_cpumask rather than cpu_online_mask making cpus_read_lock() is
unneeded during wqattrs changes. Fix the deadlock by removing
cpus_read_lock() from apply_wqattrs_lock().

tj: Updated changelog.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Fixes: 1726a17135 ("workqueue: Put PWQ allocation and WQ enlistment in the same lock C.S.")
Link: http://lkml.kernel.org/r/202407081521.83b627c1-lkp@intel.com
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-11 12:50:34 -10:00
Lai Jiangshan
fbb3d4c15d workqueue: Simplify wq_calc_pod_cpumask() with wq_online_cpumask
Avoid relying on cpu_online_mask for wqattrs changes so that
cpus_read_lock() can be removed from apply_wqattrs_lock().

And with wq_online_cpumask, attrs->__pod_cpumask doesn't need to be
reused as a temporary storage to calculate if the pod have any online
CPUs @attrs wants since @cpu_going_down is not in the wq_online_cpumask.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-11 12:50:34 -10:00
Lai Jiangshan
8d84baf760 workqueue: Add wq_online_cpumask
The new wq_online_mask mirrors the cpu_online_mask except during
hotplugging; specifically, it differs between the hotplugging stages
of workqueue_offline_cpu() and workqueue_online_cpu(), during which
the transitioning CPU is not represented in the mask.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-11 12:50:34 -10:00
Alan Maguire
b3470da314 bpf: annotate BTF show functions with __printf
-Werror=suggest-attribute=format warns about two functions
in kernel/bpf/btf.c [1]; add __printf() annotations to silence
these warnings since for CONFIG_WERROR=y they will trigger
build failures.

[1] https://lore.kernel.org/bpf/a8b20c72-6631-4404-9e1f-0410642d7d20@gmail.com/

Fixes: 31d0bc8163 ("bpf: Move to generic BTF show support, apply it to seq files/strings")
Reported-by: Mirsad Todorovac <mtodorovac69@gmail.com>
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Tested-by: Mirsad Todorovac <mtodorovac69@yahoo.com>
Link: https://lore.kernel.org/r/20240711182321.963667-1-alan.maguire@oracle.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-07-11 14:15:17 -07:00
Jakub Kicinski
7c8267275d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

net/sched/act_ct.c
  26488172b0 ("net/sched: Fix UAF when resolving a clash")
  3abbd7ed8b ("act_ct: prepare for stolen verdict coming from conntrack and nat engine")

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-11 12:58:13 -07:00
Yu Liao
f7d43dd206 tick/broadcast: Make takeover of broadcast hrtimer reliable
Running the LTP hotplug stress test on a aarch64 machine results in
rcu_sched stall warnings when the broadcast hrtimer was owned by the
un-plugged CPU. The issue is the following:

CPU1 (owns the broadcast hrtimer)	CPU2

				tick_broadcast_enter()
				  // shutdown local timer device
				  broadcast_shutdown_local()
				...
				tick_broadcast_exit()
				  clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT)
				  // timer device is not programmed
				  cpumask_set_cpu(cpu, tick_broadcast_force_mask)

				initiates offlining of CPU1
take_cpu_down()
/*
 * CPU1 shuts down and does not
 * send broadcast IPI anymore
 */
				takedown_cpu()
				  hotplug_cpu__broadcast_tick_pull()
				    // move broadcast hrtimer to this CPU
				    clockevents_program_event()
				      bc_set_next()
					hrtimer_start()
					/*
					 * timer device is not programmed
					 * because only the first expiring
					 * timer will trigger clockevent
					 * device reprogramming
					 */

What happens is that CPU2 exits broadcast mode with force bit set, then the
local timer device is not reprogrammed and CPU2 expects to receive the
expired event by the broadcast IPI. But this does not happen because CPU1
is offlined by CPU2. CPU switches the clockevent device to ONESHOT state,
but does not reprogram the device.

The subsequent reprogramming of the hrtimer broadcast device does not
program the clockevent device of CPU2 either because the pending expiry
time is already in the past and the CPU expects the event to be delivered.
As a consequence all CPUs which wait for a broadcast event to be delivered
are stuck forever.

Fix this issue by reprogramming the local timer device if the broadcast
force bit of the CPU is set so that the broadcast hrtimer is delivered.

[ tglx: Massage comment and change log. Add Fixes tag ]

Fixes: 989dcb645c ("tick: Handle broadcast wakeup of multiple cpus")
Signed-off-by: Yu Liao <liaoyu15@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240711124843.64167-1-liaoyu15@huawei.com
2024-07-11 18:00:24 +02:00
Juergen Gross
9fe6a8c5b2 x86/xen: remove deprecated xen_nopvspin boot parameter
The xen_nopvspin boot parameter is deprecated since 2019. nopvspin
can be used instead.

Remove the xen_nopvspin boot parameter and replace the xen_pvspin
variable use cases with nopvspin.

This requires to move the nopvspin variable out of the .initdata
section, as it needs to be accessed for cpuhotplug, too.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Message-ID: <20240710110139.22300-1-jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
2024-07-11 16:33:51 +02:00
Ingo Molnar
011b1134b8 Merge branch 'sched/urgent' into sched/core, to pick up fixes and refresh the branch
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-07-11 10:42:33 +02:00
Pavel Begunkov
943ad0b62e kernel: rerun task_work while freezing in get_signal()
io_uring can asynchronously add a task_work while the task is getting
freezed. TIF_NOTIFY_SIGNAL will prevent the task from sleeping in
do_freezer_trap(), and since the get_signal()'s relock loop doesn't
retry task_work, the task will spin there not being able to sleep
until the freezing is cancelled / the task is killed / etc.

Run task_works in the freezer path. Keep the patch small and simple
so it can be easily back ported, but we might need to do some cleaning
after and look if there are other places with similar problems.

Cc: stable@vger.kernel.org
Link: https://github.com/systemd/systemd/issues/33626
Fixes: 12db8b6900 ("entry: Add support for TIF_NOTIFY_SIGNAL")
Reported-by: Julian Orth <ju.orth@gmail.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/89ed3a52933370deaaf61a0a620a6ac91f1e754d.1720634146.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-11 01:51:44 -06:00
Kumar Kartikeya Dwivedi
a6fcd19d7e bpf: Defer work in bpf_timer_cancel_and_free
Currently, the same case as previous patch (two timer callbacks trying
to cancel each other) can be invoked through bpf_map_update_elem as
well, or more precisely, freeing map elements containing timers. Since
this relies on hrtimer_cancel as well, it is prone to the same deadlock
situation as the previous patch.

It would be sufficient to use hrtimer_try_to_cancel to fix this problem,
as the timer cannot be enqueued after async_cancel_and_free. Once
async_cancel_and_free has been done, the timer must be reinitialized
before it can be armed again. The callback running in parallel trying to
arm the timer will fail, and freeing bpf_hrtimer without waiting is
sufficient (given kfree_rcu), and bpf_timer_cb will return
HRTIMER_NORESTART, preventing the timer from being rearmed again.

However, there exists a UAF scenario where the callback arms the timer
before entering this function, such that if cancellation fails (due to
timer callback invoking this routine, or the target timer callback
running concurrently). In such a case, if the timer expiration is
significantly far in the future, the RCU grace period expiration
happening before it will free the bpf_hrtimer state and along with it
the struct hrtimer, that is enqueued.

Hence, it is clear cancellation needs to occur after
async_cancel_and_free, and yet it cannot be done inline due to deadlock
issues. We thus modify bpf_timer_cancel_and_free to defer work to the
global workqueue, adding a work_struct alongside rcu_head (both used at
_different_ points of time, so can share space).

Update existing code comments to reflect the new state of affairs.

Fixes: b00628b1c7 ("bpf: Introduce bpf timers.")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20240709185440.1104957-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-07-10 15:59:44 -07:00
Kumar Kartikeya Dwivedi
d4523831f0 bpf: Fail bpf_timer_cancel when callback is being cancelled
Given a schedule:

timer1 cb			timer2 cb

bpf_timer_cancel(timer2);	bpf_timer_cancel(timer1);

Both bpf_timer_cancel calls would wait for the other callback to finish
executing, introducing a lockup.

Add an atomic_t count named 'cancelling' in bpf_hrtimer. This keeps
track of all in-flight cancellation requests for a given BPF timer.
Whenever cancelling a BPF timer, we must check if we have outstanding
cancellation requests, and if so, we must fail the operation with an
error (-EDEADLK) since cancellation is synchronous and waits for the
callback to finish executing. This implies that we can enter a deadlock
situation involving two or more timer callbacks executing in parallel
and attempting to cancel one another.

Note that we avoid incrementing the cancelling counter for the target
timer (the one being cancelled) if bpf_timer_cancel is not invoked from
a callback, to avoid spurious errors. The whole point of detecting
cur->cancelling and returning -EDEADLK is to not enter a busy wait loop
(which may or may not lead to a lockup). This does not apply in case the
caller is in a non-callback context, the other side can continue to
cancel as it sees fit without running into errors.

Background on prior attempts:

Earlier versions of this patch used a bool 'cancelling' bit and used the
following pattern under timer->lock to publish cancellation status.

lock(t->lock);
t->cancelling = true;
mb();
if (cur->cancelling)
	return -EDEADLK;
unlock(t->lock);
hrtimer_cancel(t->timer);
t->cancelling = false;

The store outside the critical section could overwrite a parallel
requests t->cancelling assignment to true, to ensure the parallely
executing callback observes its cancellation status.

It would be necessary to clear this cancelling bit once hrtimer_cancel
is done, but lack of serialization introduced races. Another option was
explored where bpf_timer_start would clear the bit when (re)starting the
timer under timer->lock. This would ensure serialized access to the
cancelling bit, but may allow it to be cleared before in-flight
hrtimer_cancel has finished executing, such that lockups can occur
again.

Thus, we choose an atomic counter to keep track of all outstanding
cancellation requests and use it to prevent lockups in case callbacks
attempt to cancel each other while executing in parallel.

Reported-by: Dohyun Kim <dohyunkim@google.com>
Reported-by: Neel Natu <neelnatu@google.com>
Fixes: b00628b1c7 ("bpf: Introduce bpf timers.")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20240709185440.1104957-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-07-10 15:59:44 -07:00
Mohammad Shehar Yaar Tausif
af253aef18 bpf: fix order of args in call to bpf_map_kvcalloc
The original function call passed size of smap->bucket before the number of
buckets which raises the error 'calloc-transposed-args' on compilation.

Vlastimil Babka added:

The order of parameters can be traced back all the way to 6ac99e8f23
("bpf: Introduce bpf sk local storage") accross several refactorings,
and that's why the commit is used as a Fixes: tag.

In v6.10-rc1, a different commit 2c321f3f70 ("mm: change inlined
allocation helpers to account at the call site") however exposed the
order of args in a way that gcc-14 has enough visibility to start
warning about it, because (in !CONFIG_MEMCG case) bpf_map_kvcalloc is
then a macro alias for kvcalloc instead of a static inline wrapper.

To sum up the warning happens when the following conditions are all met:

- gcc-14 is used (didn't see it with gcc-13)
- commit 2c321f3f70 is present
- CONFIG_MEMCG is not enabled in .config
- CONFIG_WERROR turns this from a compiler warning to error

Fixes: 6ac99e8f23 ("bpf: Introduce bpf sk local storage")
Reviewed-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Christian Kujau <lists@nerdbynature.de>
Signed-off-by: Mohammad Shehar Yaar Tausif <sheharyaar48@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20240710100521.15061-2-vbabka@suse.cz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-07-10 15:31:19 -07:00
Zqiang
77aeb1b685 smp: Add missing destroy_work_on_stack() call in smp_call_on_cpu()
For CONFIG_DEBUG_OBJECTS_WORK=y kernels sscs.work defined by
INIT_WORK_ONSTACK() is initialized by debug_object_init_on_stack() for
the debug check in __init_work() to work correctly.

But this lacks the counterpart to remove the tracked object from debug
objects again, which will cause a debug object warning once the stack is
freed.

Add the missing destroy_work_on_stack() invocation to cure that.

[ tglx: Massaged changelog ]

Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20240704065213.13559-1-qiang.zhang1211@gmail.com
2024-07-10 22:40:39 +02:00
Wei Yang
9325585288 kernel/fork.c: put set_max_threads()/task_struct_whitelist() in __init section
The functions set_max_threads() and task_struct_whitelist() are only used
by fork_init() during bootup.

Let's add __init tag to them.

Link: https://lkml.kernel.org/r/20240701013410.17260-2-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10 12:14:54 -07:00
Wei Yang
66b4aaf733 kernel/fork.c: get totalram_pages from memblock to calculate max_threads
Since we plan to move the accounting into __free_pages_core(),
totalram_pages may not represent the total usable pages on system at this
point when defer_init is enabled.

Instead we can get the total usable pages from memblock directly.

Link: https://lkml.kernel.org/r/20240701013410.17260-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10 12:14:54 -07:00
Johannes Weiner
3a3b7fec39 mm: remove CONFIG_MEMCG_KMEM
CONFIG_MEMCG_KMEM used to be a user-visible option for whether slab
tracking is enabled.  It has been default-enabled and equivalent to
CONFIG_MEMCG for almost a decade.  We've only grown more kernel memory
accounting sites since, and there is no imaginable cgroup usecase going
forward that wants to track user pages but not the multitude of
user-drivable kernel allocations.

Link: https://lkml.kernel.org/r/20240701153148.452230-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10 12:14:54 -07:00
Arnd Bergmann
505d66d1ab clone3: drop __ARCH_WANT_SYS_CLONE3 macro
When clone3() was introduced, it was not obvious how each architecture
deals with setting up the stack and keeping the register contents in
a fork()-like system call, so this was left for the architecture
maintainers to implement, with __ARCH_WANT_SYS_CLONE3 defined by those
that already implement it.

Five years later, we still have a few architectures left that are missing
clone3(), and the macro keeps getting in the way as it's fundamentally
different from all the other __ARCH_WANT_SYS_* macros that are meant
to provide backwards-compatibility with applications using older
syscalls that are no longer provided by default.

Address this by reversing the polarity of the macro, adding an
__ARCH_BROKEN_SYS_CLONE3 macro to all architectures that don't
already provide the syscall, and remove __ARCH_WANT_SYS_CLONE3
from all the other ones.

Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-07-10 14:23:38 +02:00
Michael Kelley
7296f2301a swiotlb: reduce swiotlb pool lookups
With CONFIG_SWIOTLB_DYNAMIC enabled, each round-trip map/unmap pair
in the swiotlb results in 6 calls to swiotlb_find_pool(). In multiple
places, the pool is found and used in one function, and then must
be found again in the next function that is called because only the
tlb_addr is passed as an argument. These are the six call sites:

dma_direct_map_page:
 1. swiotlb_map -> swiotlb_tbl_map_single -> swiotlb_bounce

dma_direct_unmap_page:
 2. dma_direct_sync_single_for_cpu -> is_swiotlb_buffer
 3. dma_direct_sync_single_for_cpu -> swiotlb_sync_single_for_cpu ->
	swiotlb_bounce
 4. is_swiotlb_buffer
 5. swiotlb_tbl_unmap_single -> swiotlb_del_transient
 6. swiotlb_tbl_unmap_single -> swiotlb_release_slots

Reduce the number of calls by finding the pool at a higher level, and
passing it as an argument instead of searching again. A key change is
for is_swiotlb_buffer() to return a pool pointer instead of a boolean,
and then pass this pool pointer to subsequent swiotlb functions.

There are 9 occurrences of is_swiotlb_buffer() used to test if a buffer
is a swiotlb buffer before calling a swiotlb function. To reduce code
duplication in getting the pool pointer and passing it as an argument,
introduce inline wrappers for this pattern. The generated code is
essentially unchanged.

Since is_swiotlb_buffer() no longer returns a boolean, rename some
functions to reflect the change:

 * swiotlb_find_pool() becomes __swiotlb_find_pool()
 * is_swiotlb_buffer() becomes swiotlb_find_pool()
 * is_xen_swiotlb_buffer() becomes xen_swiotlb_find_pool()

With these changes, a round-trip map/unmap pair requires only 2 pool
lookups (listed using the new names and wrappers):

dma_direct_unmap_page:
 1. dma_direct_sync_single_for_cpu -> swiotlb_find_pool
 2. swiotlb_tbl_unmap_single -> swiotlb_find_pool

These changes come from noticing the inefficiencies in a code review,
not from performance measurements. With CONFIG_SWIOTLB_DYNAMIC,
__swiotlb_find_pool() is not trivial, and it uses an RCU read lock,
so avoiding the redundant calls helps performance in a hot path.
When CONFIG_SWIOTLB_DYNAMIC is *not* set, the code size reduction
is minimal and the perf benefits are likely negligible, but no
harm is done.

No functional change is intended.

Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Petr Tesarik <petr@tesarici.cz>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-07-10 07:59:03 +02:00
Matt Bobrowski
605c96997d bpf: relax zero fixed offset constraint on KF_TRUSTED_ARGS/KF_RCU
Currently, BPF kfuncs which accept trusted pointer arguments
i.e. those flagged as KF_TRUSTED_ARGS, KF_RCU, or KF_RELEASE, all
require an original/unmodified trusted pointer argument to be supplied
to them. By original/unmodified, it means that the backing register
holding the trusted pointer argument that is to be supplied to the BPF
kfunc must have its fixed offset set to zero, or else the BPF verifier
will outright reject the BPF program load. However, this zero fixed
offset constraint that is currently enforced by the BPF verifier onto
BPF kfuncs specifically flagged to accept KF_TRUSTED_ARGS or KF_RCU
trusted pointer arguments is rather unnecessary, and can limit their
usability in practice. Specifically, it completely eliminates the
possibility of constructing a derived trusted pointer from an original
trusted pointer. To put it simply, a derived pointer is a pointer
which points to one of the nested member fields of the object being
pointed to by the original trusted pointer.

This patch relaxes the zero fixed offset constraint that is enforced
upon BPF kfuncs which specifically accept KF_TRUSTED_ARGS, or KF_RCU
arguments. Although, the zero fixed offset constraint technically also
applies to BPF kfuncs accepting KF_RELEASE arguments, relaxing this
constraint for such BPF kfuncs has subtle and unwanted
side-effects. This was discovered by experimenting a little further
with an initial version of this patch series [0]. The primary issue
with relaxing the zero fixed offset constraint on BPF kfuncs accepting
KF_RELEASE arguments is that it'd would open up the opportunity for
BPF programs to supply both trusted pointers and derived trusted
pointers to them. For KF_RELEASE BPF kfuncs specifically, this could
be problematic as resources associated with the backing pointer could
be released by the backing BPF kfunc and cause instabilities for the
rest of the kernel.

With this new fixed offset semantic in-place for BPF kfuncs accepting
KF_TRUSTED_ARGS and KF_RCU arguments, we now have more flexibility
when it comes to the BPF kfuncs that we're able to introduce moving
forward.

Early discussions covering the possibility of relaxing the zero fixed
offset constraint can be found using the link below. This will provide
more context on where all this has stemmed from [1].

Notably, pre-existing tests have been updated such that they provide
coverage for the updated zero fixed offset
functionality. Specifically, the nested offset test was converted from
a negative to positive test as it was already designed to assert zero
fixed offset semantics of a KF_TRUSTED_ARGS BPF kfunc.

[0] https://lore.kernel.org/bpf/ZnA9ndnXKtHOuYMe@google.com/
[1] https://lore.kernel.org/bpf/ZhkbrM55MKQ0KeIV@google.com/

Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20240709210939.1544011-1-mattbobrowski@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-07-09 19:11:47 -07:00
Masami Hiramatsu (Google)
b10545b6b8 tracing/kprobes: Fix build error when find_module() is not available
The kernel test robot reported that the find_module() is not available
if CONFIG_MODULES=n.
Fix this error by hiding find_modules() in #ifdef CONFIG_MODULES with
related rcu locks as try_module_get_by_name().

Link: https://lore.kernel.org/all/172056819167.201571.250053007194508038.stgit@devnote2/

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202407070744.RcLkn8sq-lkp@intel.com/
Closes: https://lore.kernel.org/oe-kbuild-all/202407070917.VVUCBlaS-lkp@intel.com/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-07-10 09:47:00 +09:00