Commit Graph

43291 Commits

Author SHA1 Message Date
Li Zhe
9d02330abd softlockup: serialized softlockup's log
If multiple CPUs trigger softlockup at the same time with
'softlockup_all_cpu_backtrace=0', the softlockup's logs will appear
staggeredly in dmesg, which will affect the viewing of the logs for
developer.  Since the code path for outputting softlockup logs is not a
kernel hotspot and the performance requirements for the code are not
strict, locks are used to serialize the softlockup log output to improve
the readability of the logs.

Link: https://lkml.kernel.org/r/20231123084022.10302-1-lizhe.67@bytedance.com
Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 17:21:44 -08:00
Baoquan He
b3ba234171 kexec_file: load kernel at top of system RAM if required
Patch series "kexec_file: Load kernel at top of system RAM if required".

Justification:
==============

Kexec_load interface has been doing top down searching and loading
kernel/initrd/purgtory etc to prepare for kexec reboot.  In that way, the
benefits are that it avoids to consume and fragment limited low memory
which satisfy DMA buffer allocation and big chunk of continuous memory
during system init; and avoids to stir with BIOS/FW reserved or occupied
areas, or corner case handling/work around/quirk occupied areas when doing
system init.  By the way, the top-down searching and loading of kexec-ed
kernel is done in user space utility code.

For kexec_file loading, even if kexec_buf.top_down is 'true', it's simply
ignored.  It calls walk_system_ram_res() directly to go through all
resources of System RAM bottom up, to find an available memory region,
then call locate_mem_hole_callback() to allocate memory in that found
memory region from top to down.  This is not expected and inconsistent
with kexec_load.

Implementation
===============

In patch 1, introduce a new function walk_system_ram_res_rev() which is a
variant of walk_system_ram_res(), it walks through a list of all the
resources of System RAM in reversed order, i.e., from higher to lower.

In patch 2, check if kexec_buf.top_down is 'true' in
kexec_walk_resources(), if yes, call walk_system_ram_res_rev() to find
memory region of system RAM from top to down to load kernel/initrd etc.

Background information: ======================= And I ever tried this in
the past in a different way, please see below link.  In the post, I tried
to adjust struct sibling linking code, replace the the singly linked list
with list_head so that walk_system_ram_res_rev() can be implemented in a
much easier way.  Finally I failed. 
https://lore.kernel.org/all/20180718024944.577-4-bhe@redhat.com/

This time, I picked up the patch from AKASHI Takahiro's old post and made
some change to take as the current patch 1:
https://lists.infradead.org/pipermail/linux-arm-kernel/2017-September/531456.html


This patch (of 2):

Kexec_load interface has been doing top down searching and loading
kernel/initrd/purgtory etc to prepare for kexec reboot.  In that way, the
benefits are that it avoids to consume and fragment limited low memory
which satisfy DMA buffer allocation and big chunk of continuous memory
during system init; and avoids to stir with BIOS/FW reserved or occupied
areas, or corner case handling/work around/quirk occupied areas when doing
system init.  By the way, the top-down searching and loading of kexec-ed
kernel is done in user space utility code.

For kexec_file loading, even if kexec_buf.top_down is 'true', it's simply
ignored.  It calls walk_system_ram_res() directly to go through all
resources of System RAM bottom up, to find an available memory region,
then call locate_mem_hole_callback() to allocate memory in that found
memory region from top to down.  This is not expected and inconsistent
with kexec_load.

Here check if kexec_buf.top_down is 'true' in kexec_walk_resources(), if
yes, call the newly added walk_system_ram_res_rev() to find memory region
of system RAM from top to down to load kernel/initrd etc.

Link: https://lkml.kernel.org/r/20231114091658.228030-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20231114091658.228030-3-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 17:21:44 -08:00
Baoquan He
7acf164b25 resource: add walk_system_ram_res_rev()
This function, being a variant of walk_system_ram_res() introduced in
commit 8c86e70ace ("resource: provide new functions to walk through
resources"), walks through a list of all the resources of System RAM in
reversed order, i.e., from higher to lower.

It will be used in kexec_file code to load kernel, initrd etc when
preparing kexec reboot.

Link: https://lkml.kernel.org/r/ZVTA6z/06cLnWKUz@MiWiFi-R3L-srv
Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 17:21:44 -08:00
Arnd Bergmann
b1c3efe079 sched: fair: move unused stub functions to header
These four functions have a normal definition for CONFIG_FAIR_GROUP_SCHED,
and empty one that is only referenced when FAIR_GROUP_SCHED is disabled
but CGROUP_SCHED is still enabled.  If both are turned off, the functions
are still defined but the misisng prototype causes a W=1 warning:

kernel/sched/fair.c:12544:6: error: no previous prototype for 'free_fair_sched_group'
kernel/sched/fair.c:12546:5: error: no previous prototype for 'alloc_fair_sched_group'
kernel/sched/fair.c:12553:6: error: no previous prototype for 'online_fair_sched_group'
kernel/sched/fair.c:12555:6: error: no previous prototype for 'unregister_fair_sched_group'

Move the alternatives into the header as static inline functions with the
correct combination of #ifdef checks to avoid the warning without adding
even more complexity.

[A different patch with the same description got applied by accident
 and was later reverted, but the original patch is still missing]

Link: https://lkml.kernel.org/r/20231123110506.707903-4-arnd@kernel.org
Fixes: 7aa55f2a59 ("sched/fair: Move unused stub functions to header")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicolas Schier <nicolas@fjasle.eu>
Cc: Palmer Dabbelt <palmer@rivosinc.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Tudor Ambarus <tudor.ambarus@linaro.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 17:21:43 -08:00
Uros Bizjak
0311d82724 kexec: use atomic_try_cmpxchg in crash_kexec
Use atomic_try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
crash_kexec().  x86 CMPXCHG instruction returns success in ZF flag,
so this change saves a compare after cmpxchg.

No functional change intended.

Link: https://lkml.kernel.org/r/20231114161228.108516-1-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 17:21:33 -08:00
Oleg Nesterov
27bbb2a0fd __ptrace_unlink: kill the obsolete "FIXME" code
The corner case described by the comment is no longer possible after the
commit 7b3c36fc4c ("ptrace: fix task_join_group_stop() for the case when
current is traced"), task_join_group_stop() ensures that the new thread
has the correct signr in JOBCTL_STOP_SIGMASK regardless of ptrace.

Link: https://lkml.kernel.org/r/20231121162650.GA6635@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 17:21:32 -08:00
Oleg Nesterov
b454ec2922 kernel/signal.c: simplify force_sig_info_to_task(), kill recalc_sigpending_and_wake()
The purpose of recalc_sigpending_and_wake() is not clear, it looks
"obviously unneeded" because we are going to send the signal which can't
be blocked or ignored.

Add the comment to explain why we can't rely on send_signal_locked() and
make this logic more simple/explicit.  recalc_sigpending_and_wake() has no
other users, it can die.

In fact I think we don't even need signal_wake_up(), the target task must
be either current or a TASK_TRACED child, otherwise the usage of siglock
is not safe.  But this needs another change.

Link: https://lkml.kernel.org/r/20231120151649.GA15995@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 17:21:32 -08:00
Heiko Carstens
3888750e21 arch: remove ARCH_TASK_STRUCT_ALLOCATOR
IA-64 was the only architecture which selected ARCH_TASK_STRUCT_ALLOCATOR.
IA-64 was removed with commit cf8e865810 ("arch: Remove Itanium (IA-64)
architecture"). Therefore remove support for ARCH_THREAD_STACK_ALLOCATOR
as well.

Link: https://lkml.kernel.org/r/20231116133638.1636277-3-hca@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 17:21:31 -08:00
Heiko Carstens
f72709ab69 arch: remove ARCH_THREAD_STACK_ALLOCATOR
Patch series "Remove unused code after IA-64 removal".

While looking into something different I noticed that there are a couple
of Kconfig options which were only selected by IA-64 and which are now
unused.

So remove them and simplify the code a bit.


This patch (of 3):

IA-64 was the only architecture which selected ARCH_THREAD_STACK_ALLOCATOR.
IA-64 was removed with commit cf8e865810 ("arch: Remove Itanium (IA-64)
architecture"). Therefore remove support for ARCH_THREAD_STACK_ALLOCATOR as
well.

Link: https://lkml.kernel.org/r/20231116133638.1636277-1-hca@linux.ibm.com
Link: https://lkml.kernel.org/r/20231116133638.1636277-2-hca@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 17:21:30 -08:00
Oleg Nesterov
61a7a5e25f introduce for_other_threads(p, t)
Cosmetic, but imho it makes the usage look more clear and simple, the new
helper doesn't require to initialize "t".

After this change while_each_thread() has only 3 users, and it is only
used in the do/while loops.

Link: https://lkml.kernel.org/r/20231030155710.GA9095@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 17:21:25 -08:00
Dongmin Lee
a9a1d6ad66 kernel/reboot: explicitly notify if halt occurred instead of power off
When kernel_can_power_off() returns false, and reboot has called with
LINUX_REBOOT_CMD_POWER_OFF, kernel_halt() will be initiated instead of
actual power off function.

However, in this situation, Kernel never explicitly notifies user that
system halted instead of requested power off.

Since halt and power off perform different behavior, and user initiated
reboot call with power off command, not halt, This could be unintended
behavior to user, like this:

~ # poweroff -f
[    3.581482] reboot: System halted

Therefore, this explicitly notifies user that poweroff is not available,
and halting has been occured as an alternative behavior instead:

~ # poweroff -f
[    4.123668] reboot: Power off not available: System halted instead

[akpm@linux-foundation.org: tweak comment text]
Link: https://lkml.kernel.org/r/20231104113320.72440-1-ldmldm05@gmail.com
Signed-off-by: Dongmin Lee <ldmldm05@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 17:21:25 -08:00
Andrew Morton
0c92218f4e Merge branch 'master' into mm-hotfixes-stable 2023-12-06 17:03:50 -08:00
Baoquan He
dccf78d39f kernel/Kconfig.kexec: drop select of KEXEC for CRASH_DUMP
Ignat Korchagin complained that a potential config regression was
introduced by commit 89cde45591 ("kexec: consolidate kexec and crash
options into kernel/Kconfig.kexec").  Before the commit, CONFIG_CRASH_DUMP
has no dependency on CONFIG_KEXEC.  After the commit, CRASH_DUMP selects
KEXEC.  That enforces system to have CONFIG_KEXEC=y as long as
CONFIG_CRASH_DUMP=Y which people may not want.

In Ignat's case, he sets CONFIG_CRASH_DUMP=y, CONFIG_KEXEC_FILE=y and
CONFIG_KEXEC=n because kexec_load interface could have security issue if
kernel/initrd has no chance to be signed and verified.

CRASH_DUMP has select of KEXEC because Eric, author of above commit, met a
LKP report of build failure when posting patch of earlier version.  Please
see below link to get detail of the LKP report:

    https://lore.kernel.org/all/3e8eecd1-a277-2cfb-690e-5de2eb7b988e@oracle.com/T/#u

In fact, that LKP report is triggered because arm's <asm/kexec.h> is
wrapped in CONFIG_KEXEC ifdeffery scope.  That is wrong.  CONFIG_KEXEC
controls the enabling/disabling of kexec_load interface, but not kexec
feature.  Removing the wrongly added CONFIG_KEXEC ifdeffery scope in
<asm/kexec.h> of arm allows us to drop the select KEXEC for CRASH_DUMP. 
Meanwhile, change arch/arm/kernel/Makefile to let machine_kexec.o
relocate_kernel.o depend on KEXEC_CORE.

Link: https://lkml.kernel.org/r/20231128054457.659452-1-bhe@redhat.com
Fixes: 89cde45591 ("kexec: consolidate kexec and crash options into kernel/Kconfig.kexec")
Signed-off-by: Baoquan He <bhe@redhat.com>
Reported-by: Ignat Korchagin <ignat@cloudflare.com>
Tested-by: Ignat Korchagin <ignat@cloudflare.com>	[compile-time only]
Tested-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Eric DeVolder <eric_devolder@yahoo.com>
Tested-by: Eric DeVolder <eric_devolder@yahoo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-06 16:12:48 -08:00
Linus Torvalds
669fc83452 Merge tag 'probes-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:

 - objpool: Fix objpool overrun case on memory/cache access delay
   especially on the big.LITTLE SoC. The objpool uses a copy of object
   slot index internal loop, but the slot index can be changed on
   another processor in parallel. In that case, the difference of 'head'
   local copy and the 'slot->last' index will be bigger than local slot
   size. In that case, we need to re-read the slot::head to update it.

 - kretprobe: Fix to use appropriate rcu API for kretprobe holder. Since
   kretprobe_holder::rp is RCU managed, it should use
   rcu_assign_pointer() and rcu_dereference_check() correctly. Also
   adding __rcu tag for finding wrong usage by sparse.

 - rethook: Fix to use appropriate rcu API for rethook::handler. The
   same as kretprobe, rethook::handler is RCU managed and it should use
   rcu_assign_pointer() and rcu_dereference_check(). This also adds
   __rcu tag for finding wrong usage by sparse.

* tag 'probes-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  rethook: Use __rcu pointer for rethook::handler
  kprobes: consistent rcu api usage for kretprobe holder
  lib: objpool: fix head overrun on RK3588 SBC
2023-12-03 08:02:49 +09:00
Masami Hiramatsu (Google)
a1461f1fd6 rethook: Use __rcu pointer for rethook::handler
Since the rethook::handler is an RCU-maganged pointer so that it will
notice readers the rethook is stopped (unregistered) or not, it should
be an __rcu pointer and use appropriate functions to be accessed. This
will use appropriate memory barrier when accessing it. OTOH,
rethook::data is never changed, so we don't need to check it in
get_kretprobe().

NOTE: To avoid sparse warning, rethook::handler is defined by a raw
function pointer type with __rcu instead of rethook_handler_t.

Link: https://lore.kernel.org/all/170126066201.398836.837498688669005979.stgit@devnote2/

Fixes: 54ecbe6f1e ("rethook: Add a generic return hook")
Cc: stable@vger.kernel.org
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202311241808.rv9ceuAh-lkp@intel.com/
Tested-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-12-01 14:53:56 +09:00
JP Kobryn
d839a656d0 kprobes: consistent rcu api usage for kretprobe holder
It seems that the pointer-to-kretprobe "rp" within the kretprobe_holder is
RCU-managed, based on the (non-rethook) implementation of get_kretprobe().
The thought behind this patch is to make use of the RCU API where possible
when accessing this pointer so that the needed barriers are always in place
and to self-document the code.

The __rcu annotation to "rp" allows for sparse RCU checking. Plain writes
done to the "rp" pointer are changed to make use of the RCU macro for
assignment. For the single read, the implementation of get_kretprobe()
is simplified by making use of an RCU macro which accomplishes the same,
but note that the log warning text will be more generic.

I did find that there is a difference in assembly generated between the
usage of the RCU macros vs without. For example, on arm64, when using
rcu_assign_pointer(), the corresponding store instruction is a
store-release (STLR) which has an implicit barrier. When normal assignment
is done, a regular store (STR) is found. In the macro case, this seems to
be a result of rcu_assign_pointer() using smp_store_release() when the
value to write is not NULL.

Link: https://lore.kernel.org/all/20231122132058.3359-1-inwardvessel@gmail.com/

Fixes: d741bf41d7 ("kprobes: Remove kretprobe hash")
Cc: stable@vger.kernel.org
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-12-01 14:53:55 +09:00
Linus Torvalds
6172a5180f Merge tag 'net-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
 "Including fixes from bpf and wifi.

  Current release - regressions:

   - neighbour: fix __randomize_layout crash in struct neighbour

   - r8169: fix deadlock on RTL8125 in jumbo mtu mode

  Previous releases - regressions:

   - wifi:
       - mac80211: fix warning at station removal time
       - cfg80211: fix CQM for non-range use

   - tools: ynl-gen: fix unexpected response handling

   - octeontx2-af: fix possible buffer overflow

   - dpaa2: recycle the RX buffer only after all processing done

   - rswitch: fix missing dev_kfree_skb_any() in error path

  Previous releases - always broken:

   - ipv4: fix uaf issue when receiving igmp query packet

   - wifi: mac80211: fix debugfs deadlock at device removal time

   - bpf:
       - sockmap: af_unix stream sockets need to hold ref for pair sock
       - netdevsim: don't accept device bound programs

   - selftests: fix a char signedness issue

   - dsa: mv88e6xxx: fix marvell 6350 probe crash

   - octeontx2-pf: restore TC ingress police rules when interface is up

   - wangxun: fix memory leak on msix entry

   - ravb: keep reverse order of operations in ravb_remove()"

* tag 'net-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (51 commits)
  net: ravb: Keep reverse order of operations in ravb_remove()
  net: ravb: Stop DMA in case of failures on ravb_open()
  net: ravb: Start TX queues after HW initialization succeeded
  net: ravb: Make write access to CXR35 first before accessing other EMAC registers
  net: ravb: Use pm_runtime_resume_and_get()
  net: ravb: Check return value of reset_control_deassert()
  net: libwx: fix memory leak on msix entry
  ice: Fix VF Reset paths when interface in a failed over aggregate
  bpf, sockmap: Add af_unix test with both sockets in map
  bpf, sockmap: af_unix stream sockets need to hold ref for pair sock
  tools: ynl-gen: always construct struct ynl_req_state
  ethtool: don't propagate EOPNOTSUPP from dumps
  ravb: Fix races between ravb_tx_timeout_work() and net related ops
  r8169: prevent potential deadlock in rtl8169_close
  r8169: fix deadlock on RTL8125 in jumbo mtu mode
  neighbour: Fix __randomize_layout crash in struct neighbour
  octeontx2-pf: Restore TC ingress police rules when interface is up
  octeontx2-pf: Fix adding mbox work queue entry when num_vfs > 64
  net: stmmac: xgmac: Disable FPE MMC interrupts
  octeontx2-af: Fix possible buffer overflow
  ...
2023-12-01 08:24:46 +09:00
Hou Tao
75a442581d bpf: Add missed allocation hint for bpf_mem_cache_alloc_flags()
bpf_mem_cache_alloc_flags() may call __alloc() directly when there is no
free object in free list, but it doesn't initialize the allocation hint
for the returned pointer. It may lead to bad memory dereference when
freeing the pointer, so fix it by initializing the allocation hint.

Fixes: 822fb26bdb ("bpf: Add a hint to allocated objects.")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231111043821.2258513-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-26 18:00:26 -08:00
Linus Torvalds
1d0dbc3d16 Merge tag 'locking-urgent-2023-11-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Ingo Molnar:
 "Fix lockdep block chain corruption resulting in KASAN warnings"

* tag 'locking-urgent-2023-11-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  lockdep: Fix block chain corruption
2023-11-26 08:30:11 -08:00
Peter Zijlstra
bca4104b00 lockdep: Fix block chain corruption
Kent reported an occasional KASAN splat in lockdep. Mark then noted:

> I suspect the dodgy access is to chain_block_buckets[-1], which hits the last 4
> bytes of the redzone and gets (incorrectly/misleadingly) attributed to
> nr_large_chain_blocks.

That would mean @size == 0, at which point size_to_bucket() returns -1
and the above happens.

alloc_chain_hlocks() has 'size - req', for the first with the
precondition 'size >= rq', which allows the 0.

This code is trying to split a block, del_chain_block() takes what we
need, and add_chain_block() puts back the remainder, except in the
above case the remainder is 0 sized and things go sideways.

Fixes: 810507fe6f ("locking/lockdep: Reuse freed chain_hlocks entries")
Reported-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Kent Overstreet <kent.overstreet@linux.dev>
Link: https://lkml.kernel.org/r/20231121114126.GH8262@noisy.programming.kicks-ass.net
2023-11-24 11:04:54 +01:00
Linus Torvalds
d3fa86b1a7 Merge tag 'net-6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf.

  Current release - regressions:

   - Revert "net: r8169: Disable multicast filter for RTL8168H and
     RTL8107E"

   - kselftest: rtnetlink: fix ip route command typo

  Current release - new code bugs:

   - s390/ism: make sure ism driver implies smc protocol in kconfig

   - two build fixes for tools/net

  Previous releases - regressions:

   - rxrpc: couple of ACK/PING/RTT handling fixes

  Previous releases - always broken:

   - bpf: verify bpf_loop() callbacks as if they are called unknown
     number of times

   - improve stability of auto-bonding with Hyper-V

   - account BPF-neigh-redirected traffic in interface statistics

  Misc:

   - net: fill in some more MODULE_DESCRIPTION()s"

* tag 'net-6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (58 commits)
  tools: ynl: fix duplicate op name in devlink
  tools: ynl: fix header path for nfsd
  net: ipa: fix one GSI register field width
  tls: fix NULL deref on tls_sw_splice_eof() with empty record
  net: axienet: Fix check for partial TX checksum
  vsock/test: fix SEQPACKET message bounds test
  i40e: Fix adding unsupported cloud filters
  ice: restore timestamp configuration after device reset
  ice: unify logic for programming PFINT_TSYN_MSK
  ice: remove ptp_tx ring parameter flag
  amd-xgbe: propagate the correct speed and duplex status
  amd-xgbe: handle the corner-case during tx completion
  amd-xgbe: handle corner-case during sfp hotplug
  net: veth: fix ethtool stats reporting
  octeontx2-pf: Fix ntuple rule creation to direct packet to VF with higher Rx queue than its PF
  net: usb: qmi_wwan: claim interface 4 for ZTE MF290
  Revert "net: r8169: Disable multicast filter for RTL8168H and RTL8107E"
  net/smc: avoid data corruption caused by decline
  nfc: virtual_ncidev: Add variable to check if ndev is running
  dpll: Fix potential msg memleak when genlmsg_put_reply failed
  ...
2023-11-23 10:40:13 -08:00
Eduard Zingerman
bb124da69c bpf: keep track of max number of bpf_loop callback iterations
In some cases verifier can't infer convergence of the bpf_loop()
iteration. E.g. for the following program:

    static int cb(__u32 idx, struct num_context* ctx)
    {
        ctx->i++;
        return 0;
    }

    SEC("?raw_tp")
    int prog(void *_)
    {
        struct num_context ctx = { .i = 0 };
        __u8 choice_arr[2] = { 0, 1 };

        bpf_loop(2, cb, &ctx, 0);
        return choice_arr[ctx.i];
    }

Each 'cb' simulation would eventually return to 'prog' and reach
'return choice_arr[ctx.i]' statement. At which point ctx.i would be
marked precise, thus forcing verifier to track multitude of separate
states with {.i=0}, {.i=1}, ... at bpf_loop() callback entry.

This commit allows "brute force" handling for such cases by limiting
number of callback body simulations using 'umax' value of the first
bpf_loop() parameter.

For this, extend bpf_func_state with 'callback_depth' field.
Increment this field when callback visiting state is pushed to states
traversal stack. For frame #N it's 'callback_depth' field counts how
many times callback with frame depth N+1 had been executed.
Use bpf_func_state specifically to allow independent tracking of
callback depths when multiple nested bpf_loop() calls are present.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-11-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-20 18:36:40 -08:00
Eduard Zingerman
cafe2c2150 bpf: widening for callback iterators
Callbacks are similar to open coded iterators, so add imprecise
widening logic for callback body processing. This makes callback based
loops behave identically to open coded iterators, e.g. allowing to
verify programs like below:

  struct ctx { u32 i; };
  int cb(u32 idx, struct ctx* ctx)
  {
          ++ctx->i;
          return 0;
  }
  ...
  struct ctx ctx = { .i = 0 };
  bpf_loop(100, cb, &ctx, 0);
  ...

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-9-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-20 18:36:40 -08:00
Eduard Zingerman
ab5cfac139 bpf: verify callbacks as if they are called unknown number of times
Prior to this patch callbacks were handled as regular function calls,
execution of callback body was modeled exactly once.
This patch updates callbacks handling logic as follows:
- introduces a function push_callback_call() that schedules callback
  body verification in env->head stack;
- updates prepare_func_exit() to reschedule callback body verification
  upon BPF_EXIT;
- as calls to bpf_*_iter_next(), calls to callback invoking functions
  are marked as checkpoints;
- is_state_visited() is updated to stop callback based iteration when
  some identical parent state is found.

Paths with callback function invoked zero times are now verified first,
which leads to necessity to modify some selftests:
- the following negative tests required adding release/unlock/drop
  calls to avoid previously masked unrelated error reports:
  - cb_refs.c:underflow_prog
  - exceptions_fail.c:reject_rbtree_add_throw
  - exceptions_fail.c:reject_with_cp_reference
- the following precision tracking selftests needed change in expected
  log trace:
  - verifier_subprog_precision.c:callback_result_precise
    (note: r0 precision is no longer propagated inside callback and
           I think this is a correct behavior)
  - verifier_subprog_precision.c:parent_callee_saved_reg_precise_with_callback
  - verifier_subprog_precision.c:parent_stack_slot_precise_with_callback

Reported-by: Andrew Werner <awerner32@gmail.com>
Closes: https://lore.kernel.org/bpf/CA+vRuzPChFNXmouzGG+wsy=6eMcfr1mFG0F3g7rbg-sedGKW3w@mail.gmail.com/
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-7-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-20 18:35:44 -08:00
Eduard Zingerman
58124a98cb bpf: extract setup_func_entry() utility function
Move code for simulated stack frame creation to a separate utility
function. This function would be used in the follow-up change for
callbacks handling.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-6-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-20 18:33:35 -08:00
Eduard Zingerman
683b96f960 bpf: extract __check_reg_arg() utility function
Split check_reg_arg() into two utility functions:
- check_reg_arg() operating on registers from current verifier state;
- __check_reg_arg() operating on a specific set of registers passed as
  a parameter;

The __check_reg_arg() function would be used by a follow-up change for
callbacks handling.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-5-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-20 18:33:35 -08:00
Linus Torvalds
b0014556a2 Merge tag 'timers_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Borislav Petkov:

 - Do the push of pending hrtimers away from a CPU which is being
   offlined earlier in the offlining process in order to prevent a
   deadlock

* tag 'timers_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  hrtimers: Push pending hrtimers away from outgoing CPU earlier
2023-11-19 13:35:07 -08:00
Linus Torvalds
2a0adc4954 Merge tag 'sched_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:

 - Fix virtual runtime calculation when recomputing a sched entity's
   weights

 - Fix wrongly rejected unprivileged poll requests to the cgroup psi
   pressure files

 - Make sure the load balancing is done by only one CPU

* tag 'sched_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix the decision for load balance
  sched: psi: fix unprivileged polling against cgroups
  sched/eevdf: Fix vruntime adjustment on reweight
2023-11-19 13:32:00 -08:00
Linus Torvalds
2f84f8232e Merge tag 'locking_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Borislav Petkov:

 - Fix a hardcoded futex flags case which lead to one robust futex test
   failure

* tag 'locking_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Fix hardcoded flags
2023-11-19 13:30:21 -08:00
Linus Torvalds
c8b3443cbd Merge tag 'perf_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Borislav Petkov:

 - Make sure the context refcount is transferred too when migrating perf
   events

* tag 'perf_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Fix cpuctx refcounting
2023-11-19 13:26:42 -08:00
Linus Torvalds
2254005ef1 Merge tag 'parisc-for-6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc fixes from Helge Deller:
 "On parisc we still sometimes need writeable stacks, e.g. if programs
  aren't compiled with gcc-14. To avoid issues with the upcoming
  systemd-254 we therefore have to disable prctl(PR_SET_MDWE) for now
  (for parisc only).

  The other two patches are minor: a bugfix for the soft power-off on
  qemu with 64-bit kernel and prefer strscpy() over strlcpy():

   - Fix power soft-off on qemu

   - Disable prctl(PR_SET_MDWE) since parisc sometimes still needs
     writeable stacks

   - Use strscpy instead of strlcpy in show_cpuinfo()"

* tag 'parisc-for-6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  prctl: Disable prctl(PR_SET_MDWE) on parisc
  parisc/power: Fix power soft-off when running on qemu
  parisc: Replace strlcpy() with strscpy()
2023-11-18 15:13:10 -08:00
Helge Deller
793838138c prctl: Disable prctl(PR_SET_MDWE) on parisc
systemd-254 tries to use prctl(PR_SET_MDWE) for it's MemoryDenyWriteExecute
functionality, but fails on parisc which still needs executable stacks in
certain combinations of gcc/glibc/kernel.

Disable prctl(PR_SET_MDWE) by returning -EINVAL for now on parisc, until
userspace has catched up.

Signed-off-by: Helge Deller <deller@gmx.de>
Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Sam James <sam@gentoo.org>
Closes: https://github.com/systemd/systemd/issues/29775
Tested-by: Sam James <sam@gentoo.org>
Link: https://lore.kernel.org/all/875y2jro9a.fsf@gentoo.org/
Cc: <stable@vger.kernel.org> # v6.3+
2023-11-18 19:35:31 +01:00
Linus Torvalds
bf786e2a78 Merge tag 'audit-pr-20231116' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit fix from Paul Moore:
 "One small audit patch to convert a WARN_ON_ONCE() into a normal
  conditional to avoid scary looking console warnings when eBPF code
  generates audit records from unexpected places"

* tag 'audit-pr-20231116' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: don't WARN_ON_ONCE(!current->mm) in audit_exe_compare()
2023-11-17 08:42:05 -05:00
Linus Torvalds
7475e51b87 Merge tag 'net-6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
 "Including fixes from BPF and netfilter.

  Current release - regressions:

   - core: fix undefined behavior in netdev name allocation

   - bpf: do not allocate percpu memory at init stage

   - netfilter: nf_tables: split async and sync catchall in two
     functions

   - mptcp: fix possible NULL pointer dereference on close

  Current release - new code bugs:

   - eth: ice: dpll: fix initial lock status of dpll

  Previous releases - regressions:

   - bpf: fix precision backtracking instruction iteration

   - af_unix: fix use-after-free in unix_stream_read_actor()

   - tipc: fix kernel-infoleak due to uninitialized TLV value

   - eth: bonding: stop the device in bond_setup_by_slave()

   - eth: mlx5:
      - fix double free of encap_header
      - avoid referencing skb after free-ing in drop path

   - eth: hns3: fix VF reset

   - eth: mvneta: fix calls to page_pool_get_stats

  Previous releases - always broken:

   - core: set SOCK_RCU_FREE before inserting socket into hashtable

   - bpf: fix control-flow graph checking in privileged mode

   - eth: ppp: limit MRU to 64K

   - eth: stmmac: avoid rx queue overrun

   - eth: icssg-prueth: fix error cleanup on failing initialization

   - eth: hns3: fix out-of-bounds access may occur when coalesce info is
     read via debugfs

   - eth: cortina: handle large frames

  Misc:

   - selftests: gso: support CONFIG_MAX_SKB_FRAGS up to 45"

* tag 'net-6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (78 commits)
  macvlan: Don't propagate promisc change to lower dev in passthru
  net: sched: do not offload flows with a helper in act_ct
  net/mlx5e: Check return value of snprintf writing to fw_version buffer for representors
  net/mlx5e: Check return value of snprintf writing to fw_version buffer
  net/mlx5e: Reduce the size of icosq_str
  net/mlx5: Increase size of irq name buffer
  net/mlx5e: Update doorbell for port timestamping CQ before the software counter
  net/mlx5e: Track xmit submission to PTP WQ after populating metadata map
  net/mlx5e: Avoid referencing skb after free-ing in drop path of mlx5e_sq_xmit_wqe
  net/mlx5e: Don't modify the peer sent-to-vport rules for IPSec offload
  net/mlx5e: Fix pedit endianness
  net/mlx5e: fix double free of encap_header in update funcs
  net/mlx5e: fix double free of encap_header
  net/mlx5: Decouple PHC .adjtime and .adjphase implementations
  net/mlx5: DR, Allow old devices to use multi destination FTE
  net/mlx5: Free used cpus mask when an IRQ is released
  Revert "net/mlx5: DR, Supporting inline WQE when possible"
  bpf: Do not allocate percpu memory at init stage
  net: Fix undefined behavior in netdev name allocation
  dt-bindings: net: ethernet-controller: Fix formatting error
  ...
2023-11-16 07:51:26 -05:00
Yonghong Song
1fda5bb66a bpf: Do not allocate percpu memory at init stage
Kirill Shutemov reported significant percpu memory consumption increase after
booting in 288-cpu VM ([1]) due to commit 41a5db8d81 ("bpf: Add support for
non-fix-size percpu mem allocation"). The percpu memory consumption is
increased from 111MB to 969MB. The number is from /proc/meminfo.

I tried to reproduce the issue with my local VM which at most supports upto
255 cpus. With 252 cpus, without the above commit, the percpu memory
consumption immediately after boot is 57MB while with the above commit the
percpu memory consumption is 231MB.

This is not good since so far percpu memory from bpf memory allocator is not
widely used yet. Let us change pre-allocation in init stage to on-demand
allocation when verifier detects there is a need of percpu memory for bpf
program. With this change, percpu memory consumption after boot can be reduced
signicantly.

  [1] https://lore.kernel.org/lkml/20231109154934.4saimljtqx625l3v@box.shutemov.name/

Fixes: 41a5db8d81 ("bpf: Add support for non-fix-size percpu mem allocation")
Reported-and-tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231111013928.948838-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15 07:51:06 -08:00
Peter Zijlstra
889c58b315 perf/core: Fix cpuctx refcounting
Audit of the refcounting turned up that perf_pmu_migrate_context()
fails to migrate the ctx refcount.

Fixes: bd27568117 ("perf: Rewrite core context handling")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20230612093539.085862001@infradead.org
Cc: <stable@vger.kernel.org>
2023-11-15 04:18:31 +01:00
Peter Zijlstra
c9bd1568d5 futex: Fix hardcoded flags
Xi reported that commit 5694289ce1 ("futex: Flag conversion") broke
glibc's robust futex tests.

This was narrowed down to the change of FLAGS_SHARED from 0x01 to
0x10, at which point Florian noted that handle_futex_death() has a
hardcoded flags argument of 1.

Change this to: FLAGS_SIZE_32 | FLAGS_SHARED, matching how
futex_to_flags() unconditionally sets FLAGS_SIZE_32 for all legacy
futex ops.

Reported-by: Xi Ruoyao <xry111@xry111.site>
Reported-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20231114201402.GA25315@noisy.programming.kicks-ass.net
Fixes: 5694289ce1 ("futex: Flag conversion")
Cc: <stable@vger.kernel.org>
2023-11-15 04:02:25 +01:00
Paul Moore
969d90ec21 audit: don't WARN_ON_ONCE(!current->mm) in audit_exe_compare()
eBPF can end up calling into the audit code from some odd places, and
some of these places don't have @current set properly so we end up
tripping the `WARN_ON_ONCE(!current->mm)` near the top of
`audit_exe_compare()`.  While the basic `!current->mm` check is good,
the `WARN_ON_ONCE()` results in some scary console messages so let's
drop that and just do the regular `!current->mm` check to avoid
problems.

Cc: <stable@vger.kernel.org>
Fixes: 47846d5134 ("audit: don't take task_lock() in audit_exe_compare() code path")
Reported-by: Artem Savkov <asavkov@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2023-11-14 17:34:27 -05:00
Keisuke Nishimura
6d7e4782bc sched/fair: Fix the decision for load balance
should_we_balance is called for the decision to do load-balancing.
When sched ticks invoke this function, only one CPU should return
true. However, in the current code, two CPUs can return true. The
following situation, where b means busy and i means idle, is an
example, because CPU 0 and CPU 2 return true.

        [0, 1] [2, 3]
         b  b   i  b

This fix checks if there exists an idle CPU with busy sibling(s)
after looking for a CPU on an idle core. If some idle CPUs with busy
siblings are found, just the first one should do load-balancing.

Fixes: b1bfeab9b0 ("sched/fair: Consider the idle state of the whole core for load balance")
Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20231031133821.1570861-1-keisuke.nishimura@inria.fr
2023-11-14 22:27:01 +01:00
Johannes Weiner
8b39d20ece sched: psi: fix unprivileged polling against cgroups
519fabc7aa ("psi: remove 500ms min window size limitation for
triggers") breaks unprivileged psi polling on cgroups.

Historically, we had a privilege check for polling in the open() of a
pressure file in /proc, but were erroneously missing it for the open()
of cgroup pressure files.

When unprivileged polling was introduced in d82caa2735 ("sched/psi:
Allow unprivileged polling of N*2s period"), it needed to filter
privileges depending on the exact polling parameters, and as such
moved the CAP_SYS_RESOURCE check from the proc open() callback to
psi_trigger_create(). Both the proc files as well as cgroup files go
through this during write(). This implicitly added the missing check
for privileges required for HT polling for cgroups.

When 519fabc7aa ("psi: remove 500ms min window size limitation for
triggers") followed right after to remove further restrictions on the
RT polling window, it incorrectly assumed the cgroup privilege check
was still missing and added it to the cgroup open(), mirroring what we
used to do for proc files in the past.

As a result, unprivileged poll requests that would be supported now
get rejected when opening the cgroup pressure file for writing.

Remove the cgroup open() check. psi_trigger_create() handles it.

Fixes: 519fabc7aa ("psi: remove 500ms min window size limitation for triggers")
Reported-by: Luca Boccassi <bluca@debian.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Luca Boccassi <bluca@debian.org>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Cc: stable@vger.kernel.org # 6.5+
Link: https://lore.kernel.org/r/20231026164114.2488682-1-hannes@cmpxchg.org
2023-11-14 22:27:00 +01:00
Abel Wu
eab03c23c2 sched/eevdf: Fix vruntime adjustment on reweight
vruntime of the (on_rq && !0-lag) entity needs to be adjusted when
it gets re-weighted, and the calculations can be simplified based
on the fact that re-weight won't change the w-average of all the
entities. Please check the proofs in comments.

But adjusting vruntime can also cause position change in RB-tree
hence require re-queue to fix up which might be costly. This might
be avoided by deferring adjustment to the time the entity actually
leaves tree (dequeue/pick), but that will negatively affect task
selection and probably not good enough either.

Fixes: 147f3efaa2 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231107090510.71322-2-wuyun.abel@bytedance.com
2023-11-14 22:27:00 +01:00
Thomas Gleixner
5c0930ccaa hrtimers: Push pending hrtimers away from outgoing CPU earlier
2b8272ff4a ("cpu/hotplug: Prevent self deadlock on CPU hot-unplug")
solved the straight forward CPU hotplug deadlock vs. the scheduler
bandwidth timer. Yu discovered a more involved variant where a task which
has a bandwidth timer started on the outgoing CPU holds a lock and then
gets throttled. If the lock required by one of the CPU hotplug callbacks
the hotplug operation deadlocks because the unthrottling timer event is not
handled on the dying CPU and can only be recovered once the control CPU
reaches the hotplug state which pulls the pending hrtimers from the dead
CPU.

Solve this by pushing the hrtimers away from the dying CPU in the dying
callbacks. Nothing can queue a hrtimer on the dying CPU at that point because
all other CPUs spin in stop_machine() with interrupts disabled and once the
operation is finished the CPU is marked offline.

Reported-by: Yu Liao <liaoyu15@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Liu Tie <liutie4@huawei.com>
Link: https://lore.kernel.org/r/87a5rphara.ffs@tglx
2023-11-11 18:06:42 +01:00
Linus Torvalds
3ca112b71f Merge tag 'probes-fixes-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:

 - Documentation update: Add a note about argument and return value
   fetching is the best effort because it depends on the type.

 - objpool: Fix to make internal global variables static in
   test_objpool.c.

 - kprobes: Unify kprobes_exceptions_nofify() prototypes. There are the
   same prototypes in asm/kprobes.h for some architectures, but some of
   them are missing the prototype and it causes a warning. So move the
   prototype into linux/kprobes.h.

 - tracing: Fix to check the tracepoint event and return event at
   parsing stage. The tracepoint event doesn't support %return but if
   $retval exists, it will be converted to %return silently. This finds
   that case and rejects it.

 - tracing: Fix the order of the descriptions about the parameters of
   __kprobe_event_gen_cmd_start() to be consistent with the argument
   list of the function.

* tag 'probes-fixes-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/kprobes: Fix the order of argument descriptions
  tracing: fprobe-event: Fix to check tracepoint event and return
  kprobes: unify kprobes_exceptions_nofify() prototypes
  lib: test_objpool: make global variables static
  Documentation: tracing: Add a note about argument and retval access
2023-11-10 16:35:04 -08:00
Yujie Liu
f032c53bea tracing/kprobes: Fix the order of argument descriptions
The order of descriptions should be consistent with the argument list of
the function, so "kretprobe" should be the second one.

int __kprobe_event_gen_cmd_start(struct dynevent_cmd *cmd, bool kretprobe,
                                 const char *name, const char *loc, ...)

Link: https://lore.kernel.org/all/20231031041305.3363712-1-yujie.liu@intel.com/

Fixes: 2a588dd1d5 ("tracing: Add kprobe event command generation functions")
Suggested-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Yujie Liu <yujie.liu@intel.com>
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-11-11 08:00:43 +09:00
Linus Torvalds
391ce5b9c4 Merge tag 'dma-mapping-6.7-2023-11-10' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:

 - don't leave pages decrypted for DMA in encrypted memory setups linger
   around on failure (Petr Tesarik)

 - fix an out of bounds access in the new dynamic swiotlb code (Petr
   Tesarik)

 - fix dma_addressing_limited for systems with weird physical memory
   layouts (Jia He)

* tag 'dma-mapping-6.7-2023-11-10' of git://git.infradead.org/users/hch/dma-mapping:
  swiotlb: fix out-of-bounds TLB allocations with CONFIG_SWIOTLB_DYNAMIC
  dma-mapping: fix dma_addressing_limited() if dma_range_map can't cover all system RAM
  dma-mapping: move dma_addressing_limited() out of line
  swiotlb: do not free decrypted pages if dynamic
2023-11-10 11:09:07 -08:00
Masami Hiramatsu (Google)
ce51e6153f tracing: fprobe-event: Fix to check tracepoint event and return
Fix to check the tracepoint event is not valid with $retval.
The commit 08c9306fc2 ("tracing/fprobe-event: Assume fprobe is
a return event by $retval") introduced automatic return probe
conversion with $retval. But since tracepoint event does not
support return probe, $retval is not acceptable.

Without this fix, ftracetest, tprobe_syntax_errors.tc fails;

[22] Tracepoint probe event parser error log check      [FAIL]
 ----
 # tail 22-tprobe_syntax_errors.tc-log.mRKroL
 + ftrace_errlog_check trace_fprobe t kfree ^$retval dynamic_events
 + printf %s t kfree
 + wc -c
 + pos=8
 + printf %s t kfree ^$retval
 + tr -d ^
 + command=t kfree $retval
 + echo Test command: t kfree $retval
 Test command: t kfree $retval
 + echo
 ----

So 't kfree $retval' should fail (tracepoint doesn't support
return probe) but passed it.

Link: https://lore.kernel.org/all/169944555933.45057.12831706585287704173.stgit@devnote2/

Fixes: 08c9306fc2 ("tracing/fprobe-event: Assume fprobe is a return event by $retval")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-11-10 20:06:12 +09:00
Andrii Nakryiko
10e14e9652 bpf: fix control-flow graph checking in privileged mode
When BPF program is verified in privileged mode, BPF verifier allows
bounded loops. This means that from CFG point of view there are
definitely some back-edges. Original commit adjusted check_cfg() logic
to not detect back-edges in control flow graph if they are resulting
from conditional jumps, which the idea that subsequent full BPF
verification process will determine whether such loops are bounded or
not, and either accept or reject the BPF program. At least that's my
reading of the intent.

Unfortunately, the implementation of this idea doesn't work correctly in
all possible situations. Conditional jump might not result in immediate
back-edge, but just a few unconditional instructions later we can arrive
at back-edge. In such situations check_cfg() would reject BPF program
even in privileged mode, despite it might be bounded loop. Next patch
adds one simple program demonstrating such scenario.

To keep things simple, instead of trying to detect back edges in
privileged mode, just assume every back edge is valid and let subsequent
BPF verification prove or reject bounded loops.

Note a few test changes. For unknown reason, we have a few tests that
are specified to detect a back-edge in a privileged mode, but looking at
their code it seems like the right outcome is passing check_cfg() and
letting subsequent verification to make a decision about bounded or not
bounded looping.

Bounded recursion case is also interesting. The example should pass, as
recursion is limited to just a few levels and so we never reach maximum
number of nested frames and never exhaust maximum stack depth. But the
way that max stack depth logic works today it falsely detects this as
exceeding max nested frame count. This patch series doesn't attempt to
fix this orthogonal problem, so we just adjust expected verifier failure.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Fixes: 2589726d12 ("bpf: introduce bounded loops")
Reported-by: Hao Sun <sunhao.th@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231110061412.2995786-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09 22:57:24 -08:00
Andrii Nakryiko
4bb7ea946a bpf: fix precision backtracking instruction iteration
Fix an edge case in __mark_chain_precision() which prematurely stops
backtracking instructions in a state if it happens that state's first
and last instruction indexes are the same. This situations doesn't
necessarily mean that there were no instructions simulated in a state,
but rather that we starting from the instruction, jumped around a bit,
and then ended up at the same instruction before checkpointing or
marking precision.

To distinguish between these two possible situations, we need to consult
jump history. If it's empty or contain a single record "bridging" parent
state and first instruction of processed state, then we indeed
backtracked all instructions in this state. But if history is not empty,
we are definitely not done yet.

Move this logic inside get_prev_insn_idx() to contain it more nicely.
Use -ENOENT return code to denote "we are out of instructions"
situation.

This bug was exposed by verifier_loop1.c's bounded_recursion subtest, once
the next fix in this patch set is applied.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Fixes: b5dc0163d8 ("bpf: precise scalar_value tracking")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231110002638.4168352-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09 20:11:20 -08:00
Andrii Nakryiko
3feb263bb5 bpf: handle ldimm64 properly in check_cfg()
ldimm64 instructions are 16-byte long, and so have to be handled
appropriately in check_cfg(), just like the rest of BPF verifier does.

This has implications in three places:
  - when determining next instruction for non-jump instructions;
  - when determining next instruction for callback address ldimm64
    instructions (in visit_func_call_insn());
  - when checking for unreachable instructions, where second half of
    ldimm64 is expected to be unreachable;

We take this also as an opportunity to report jump into the middle of
ldimm64. And adjust few test_verifier tests accordingly.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Reported-by: Hao Sun <sunhao.th@gmail.com>
Fixes: 475fb78fbf ("bpf: verifier (add branch/goto checks)")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231110002638.4168352-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09 20:11:20 -08:00
Linus Torvalds
89cdf9d556 Merge tag 'net-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
 "Including fixes from netfilter and bpf.

  Current release - regressions:

   - sched: fix SKB_NOT_DROPPED_YET splat under debug config

  Current release - new code bugs:

   - tcp:
       - fix usec timestamps with TCP fastopen
       - fix possible out-of-bounds reads in tcp_hash_fail()
       - fix SYN option room calculation for TCP-AO

   - tcp_sigpool: fix some off by one bugs

   - bpf: fix compilation error without CGROUPS

   - ptp:
       - ptp_read() should not release queue
       - fix tsevqs corruption

  Previous releases - regressions:

   - llc: verify mac len before reading mac header

  Previous releases - always broken:

   - bpf:
       - fix check_stack_write_fixed_off() to correctly spill imm
       - fix precision tracking for BPF_ALU | BPF_TO_BE | BPF_END
       - check map->usercnt after timer->timer is assigned

   - dsa: lan9303: consequently nested-lock physical MDIO

   - dccp/tcp: call security_inet_conn_request() after setting IP addr

   - tg3: fix the TX ring stall due to incorrect full ring handling

   - phylink: initialize carrier state at creation

   - ice: fix direction of VF rules in switchdev mode

  Misc:

   - fill in a bunch of missing MODULE_DESCRIPTION()s, more to come"

* tag 'net-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (84 commits)
  net: ti: icss-iep: fix setting counter value
  ptp: fix corrupted list in ptp_open
  ptp: ptp_read should not release queue
  net_sched: sch_fq: better validate TCA_FQ_WEIGHTS and TCA_FQ_PRIOMAP
  net: kcm: fill in MODULE_DESCRIPTION()
  net/sched: act_ct: Always fill offloading tuple iifidx
  netfilter: nat: fix ipv6 nat redirect with mapped and scoped addresses
  netfilter: xt_recent: fix (increase) ipv6 literal buffer length
  ipvs: add missing module descriptions
  netfilter: nf_tables: remove catchall element in GC sync path
  netfilter: add missing module descriptions
  drivers/net/ppp: use standard array-copy-function
  net: enetc: shorten enetc_setup_xdp_prog() error message to fit NETLINK_MAX_FMTMSG_LEN
  virtio/vsock: Fix uninit-value in virtio_transport_recv_pkt()
  r8169: respect userspace disabling IFF_MULTICAST
  selftests/bpf: get trusted cgrp from bpf_iter__cgroup directly
  bpf: Let verifier consider {task,cgroup} is trusted in bpf_iter_reg
  net: phylink: initialize carrier state at creation
  test/vsock: add dobule bind connect test
  test/vsock: refactor vsock_accept
  ...
2023-11-09 17:09:35 -08:00