linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-12-27 11:06:41 -05:00

Author	SHA1	Message	Date
Uros Bizjak	58b4fba81a	ucount: use atomic_long_try_cmpxchg() in atomic_long_inc_below() Use atomic_long_try_cmpxchg() instead of atomic_long_cmpxchg (ptr, old, new) == old in atomic_long_inc_below(). x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Also, atomic_long_try_cmpxchg implicitly assigns old ptr value to "old" when cmpxchg fails, enabling further code simplifications. No functional change intended. Link: https://lkml.kernel.org/r/20250721174610.28361-2-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Reviewed-by: Alexey Gladkov <legion@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Alexey Gladkov <legion@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: MengEn Sun <mengensun@tencent.com> Cc: "Thomas Weißschuh" <linux@weissschuh.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:01:38 -07:00
Uros Bizjak	f8cd9193b6	ucount: fix atomic_long_inc_below() argument type The type of u argument of atomic_long_inc_below() should be long to avoid unwanted truncation to int. The patch fixes the wrong argument type of an internal function to prevent unwanted argument truncation. It fixes an internal locking primitive; it should not have any direct effect on userspace. Mark said : AFAICT there's no problem in practice because atomic_long_inc_below() : is only used by inc_ucount(), and it looks like the value is : constrained between 0 and INT_MAX. : : In inc_ucount() the limit value is taken from : user_namespace::ucount_max[], and AFAICT that's only written by : sysctls, to the table setup by setup_userns_sysctls(), where : UCOUNT_ENTRY() limits the value between 0 and INT_MAX. : : This is certainly a cleanup, but there might be no functional issue in : practice as above. Link: https://lkml.kernel.org/r/20250721174610.28361-1-ubizjak@gmail.com Fixes: `f9c82a4ea8` ("Increase size of ucounts to atomic_long_t") Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Alexey Gladkov <legion@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: MengEn Sun <mengensun@tencent.com> Cc: "Thomas Weißschuh" <linux@weissschuh.net> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:01:38 -07:00
Alexander Graf	07d2490297	kexec: enable CMA based contiguous allocation When booting a new kernel with kexec_file, the kernel picks a target location that the kernel should live at, then allocates random pages, checks whether any of those patches magically happens to coincide with a target address range and if so, uses them for that range. For every page allocated this way, it then creates a page list that the relocation code - code that executes while all CPUs are off and we are just about to jump into the new kernel - copies to their final memory location. We can not put them there before, because chances are pretty good that at least some page in the target range is already in use by the currently running Linux environment. Copying is happening from a single CPU at RAM rate, which takes around 4-50 ms per 100 MiB. All of this is inefficient and error prone. To successfully kexec, we need to quiesce all devices of the outgoing kernel so they don't scribble over the new kernel's memory. We have seen cases where that does not happen properly (cough GIC cough) and hence the new kernel was corrupted. This started a month long journey to root cause failing kexecs to eventually see memory corruption, because the new kernel was corrupted severely enough that it could not emit output to tell us about the fact that it was corrupted. By allocating memory for the next kernel from a memory range that is guaranteed scribbling free, we can boot the next kernel up to a point where it is at least able to detect corruption and maybe even stop it before it becomes severe. This increases the chance for successful kexecs. Since kexec got introduced, Linux has gained the CMA framework which can perform physically contiguous memory mappings, while keeping that memory available for movable memory when it is not needed for contiguous allocations. The default CMA allocator is for DMA allocations. This patch adds logic to the kexec file loader to attempt to place the target payload at a location allocated from CMA. If successful, it uses that memory range directly instead of creating copy instructions during the hot phase. To ensure that there is a safety net in case anything goes wrong with the CMA allocation, it also adds a flag for user space to force disable CMA allocations. Using CMA allocations has two advantages: 1) Faster by 4-50 ms per 100 MiB. There is no more need to copy in the hot phase. 2) More robust. Even if by accident some page is still in use for DMA, the new kernel image will be safe from that access because it resides in a memory region that is considered allocated in the old kernel and has a chance to reinitialize that component. Link: https://lkml.kernel.org/r/20250610085327.51817-1-graf@amazon.com Signed-off-by: Alexander Graf <graf@amazon.com> Acked-by: Baoquan He <bhe@redhat.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Zhongkun He <hezhongkun.hzk@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:01:38 -07:00
Matt Fleming	ed4f142f72	stackdepot: make max number of pools boot-time configurable We're hitting the WARN in depot_init_pool() about reaching the stack depot limit because we have long stacks that don't dedup very well. Introduce a new start-up parameter to allow users to set the number of maximum stack depot pools. Link: https://lkml.kernel.org/r/20250718153928.94229-1-matt@readmodwrite.com Signed-off-by: Matt Fleming <mfleming@cloudflare.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:01:38 -07:00
Dr. David Alan Gilbert	6c6d8f8ba7	lib/xxhash: remove unused functions xxh32_digest() and xxh32_update() were added in 2017 in the original xxhash commit, but have remained unused. Remove them. Link: https://lkml.kernel.org/r/20250716133245.243363-1-linux@treblig.org Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Dave Gilbert <linux@treblig.org> Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:01:37 -07:00
Andrew Morton	fefbeed8c6	init/Kconfig: restore CONFIG_BROKEN help text Linus added it in 2003, it later was removed. Put it back. Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Borislav Betkov <bp@alien8.de> Cc: David S. Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:01:37 -07:00
Herbert Xu	a9ed4422ad	lib/raid6: update recov_rvv.c zero page usage Update lib/raid6/recov_rvv.c, for `1857fcc847` ("lib/raid6: replace custom zero page with ZERO_PAGE"), per Klara. Link: https://lkml.kernel.org/r/aFkUnXWtxcgOTVkw@gondor.apana.org.au Fixes: `1857fcc847` ("lib/raid6: replace custom zero page with ZERO_PAGE") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Cc: Song Liu <song@kernel.org> Cc: Yu Kuai <yukuai3@huawei.com> Cc: Klara Modin <klarasmodin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-19 19:08:29 -07:00
Wang Yaxin	320bf1709a	docs: update docs after introducing delaytop The "getdelays" can only display the latency of a single task by specifying a PID, we introduce the "delaytop" with the following capabilities: 1. system view: monitors latency metrics (CPU, I/O, memory, IRQ, etc.) for all system processes 2. supports field-based sorting (e.g., default sort by CPU latency in descending order) 3. dynamic interactive interface: focus on specific processes with --pid; limit displayed entries with --processes 20; control monitoring duration with --iterations; Link: https://lkml.kernel.org/r/2025071013514177028RdjISjqeIOnTCRvGAwy@zte.com.cn Signed-off-by: Fan Yu <fan.yu9@zte.com.cn> Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn> Signed-off-by: Jiang Kun <jiang.kun2@zte.com.cn> Cc: Balbir Singh <bsingharora@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Peilin He <he.peilin@zte.com.cn> Cc: Qiang Tu <tu.qiang35@zte.com.cn> Cc: wangyong <wang.yong12@zte.com.cn> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Yang Yang <yang.yang29@zte.com.cn> Cc: Yunkai Zhang <zhang.yunkai@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-19 19:08:29 -07:00
Wang Yaxin	6b47c9f8ee	delaytop: add psi info to show system delay Support showing whole delay of system by reading PSI, just like the first few lines of information output by the top command. the output of delaytop includes both system-wide delay and delay of individual tasks, providing a more comprehensive reflection of system latency status. Use case ======== bash# ./delaytop System Pressure Information: (avg10/avg60/avg300/total) CPU: full: 0.0%/ 0.0%/ 0.0%/0 some: 0.1%/ 0.0%/ 0.0%/14216596 Memory: full: 0.0%/ 0.0%/ 0.0%/34010659 some: 0.0%/ 0.0%/ 0.0%/35406492 IO: full: 0.1%/ 0.0%/ 0.0%/51029453 some: 0.1%/ 0.0%/ 0.0%/55330465 IRQ: full: 0.0%/ 0.0%/ 0.0%/0 Top 20 processes (sorted by CPU delay): PID TGID COMMAND CPU(ms) IO(ms) SWAP(ms) RCL(ms) THR(ms) CMP(ms) WP(ms) IRQ(ms) --------------------------------------------------------------------------------------------- 32 32 kworker/2:0H-sy 23.65 0.00 0.00 0.00 0.00 0.00 0.00 0.00 497 497 kworker/R-scsi_ 1.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 495 495 kworker/R-scsi_ 1.13 0.00 0.00 0.00 0.00 0.00 0.00 0.00 494 494 scsi_eh_0 1.12 0.00 0.00 0.00 0.00 0.00 0.00 0.00 485 485 kworker/R-ata_s 0.90 0.00 0.00 0.00 0.00 0.00 0.00 0.00 574 574 kworker/R-kdmfl 0.36 0.00 0.00 0.00 0.00 0.00 0.00 0.00 34 34 idle_inject/3 0.33 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1123 1123 nde-netfilter 0.28 0.00 0.00 0.00 0.00 0.00 0.00 0.00 60 60 ksoftirqd/7 0.25 0.00 0.00 0.00 0.00 0.00 0.00 0.00 114 114 kworker/0:2-cgr 0.25 0.00 0.00 0.00 0.00 0.00 0.00 0.00 496 496 scsi_eh_1 0.24 0.00 0.00 0.00 0.00 0.00 0.00 0.00 51 51 cpuhp/6 0.24 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1667 1667 atd 0.24 0.00 0.00 0.00 0.00 0.00 0.00 0.00 45 45 cpuhp/5 0.23 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1102 1102 nde-backupservi 0.22 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1098 1098 systemsettings 0.21 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1100 1100 audit-monitor 0.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 53 53 migration/6 0.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1482 1482 sshd 0.19 0.00 0.00 0.00 0.00 0.00 0.00 0.00 39 39 cpuhp/4 0.19 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Link: https://lkml.kernel.org/r/20250710135451340_5pOgpIFi0M5AE7H44W1D@zte.com.cn Co-developed-by: Fan Yu <fan.yu9@zte.com.cn> Signed-off-by: Fan Yu <fan.yu9@zte.com.cn> Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn> Signed-off-by: Jiang Kun <jiang.kun2@zte.com.cn> Cc: Balbir Singh <bsingharora@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Peilin He <he.peilin@zte.com.cn> Cc: Qiang Tu <tu.qiang35@zte.com.cn> Cc: wangyong <wang.yong12@zte.com.cn> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Yang Yang <yang.yang29@zte.com.cn> Cc: Yunkai Zhang <zhang.yunkai@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-19 19:08:29 -07:00
WangYuli	599579e857	selftests/thermal: remove duplicate newlines in perror calls perror() automatically appends a newline character, so the explicit '\n' in the format strings is redundant and results in duplicate newlines in the output. Remove the redundant '\n' characters from perror() calls in workload_hint_test.c to fix the formatting. Link: https://lkml.kernel.org/r/F482FB1EC020000C+20250710134751.306096-1-wangyuli@uniontech.com Signed-off-by: WangYuli <wangyuli@uniontech.com> Cc: Guan Wentao <guanwentao@uniontech.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-19 19:08:28 -07:00
WangYuli	813b468088	selftests/thermal: remove duplicate sprintf() call in workload_hint_test Remove redundant sprintf() call that was duplicating the same operation of formatting delay_str with argv[1]. Link: https://lkml.kernel.org/r/6338CD0E839B770B+20250710130412.284531-1-wangyuli@uniontech.com Signed-off-by: WangYuli <wangyuli@uniontech.com> Cc: Guan Wentao <guanwentao@uniontech.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-19 19:08:28 -07:00
Kuan-Wei Chiu	36e2241687	riscv: optimize gcd() performance on RISC-V without Zbb extension The binary GCD implementation uses FFS (find first set), which benefits from hardware support for the ctz instruction, provided by the Zbb extension on RISC-V. Without Zbb, this results in slower software-emulated behavior. Previously, RISC-V always used the binary GCD, regardless of actual hardware support. This patch improves runtime efficiency by disabling the efficient_ffs_key static branch when Zbb is either not enabled in the kernel (config) or not supported on the executing CPU. This selects the odd-even GCD implementation, which is faster in the absence of efficient FFS. This change ensures the most suitable GCD algorithm is chosen dynamically based on actual hardware capabilities. Link: https://lkml.kernel.org/r/20250606134758.1308400-4-visitorckw@gmail.com Co-developed-by: Yu-Chun Lin <eleanor15x@gmail.com> Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com> Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Acked-by: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-19 19:08:28 -07:00
Kuan-Wei Chiu	26b537edc5	riscv: optimize gcd() code size when CONFIG_RISCV_ISA_ZBB is disabled The binary GCD implementation depends on efficient ffs(), which on RISC-V requires hardware support for the Zbb extension. When CONFIG_RISCV_ISA_ZBB is not enabled, the kernel will never use binary GCD, as runtime logic will always fall back to the odd-even implementation. To avoid compiling unused code and reduce code size, select CONFIG_CPU_NO_EFFICIENT_FFS when CONFIG_RISCV_ISA_ZBB is not set. $ ./scripts/bloat-o-meter ./lib/math/gcd.o.old ./lib/math/gcd.o.new add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-274 (-274) Function old new delta gcd 360 86 -274 Total: Before=384, After=110, chg -71.35% Link: https://lkml.kernel.org/r/20250606134758.1308400-3-visitorckw@gmail.com Co-developed-by: Yu-Chun Lin <eleanor15x@gmail.com> Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com> Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Acked-by: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-19 19:08:28 -07:00
Kuan-Wei Chiu	b3d5fd6f82	lib/math/gcd: use static key to select implementation at runtime Patch series "Optimize GCD performance on RISC-V by selecting implementation at runtime", v3. The current implementation of gcd() selects between the binary GCD and the odd-even GCD algorithm at compile time, depending on whether CONFIG_CPU_NO_EFFICIENT_FFS is set. On platforms like RISC-V, however, this compile-time decision can be misleading: even when the compiler emits ctz instructions based on the assumption that they are efficient (as is the case when CONFIG_RISCV_ISA_ZBB is enabled), the actual hardware may lack support for the Zbb extension. In such cases, ffs() falls back to a software implementation at runtime, making the binary GCD algorithm significantly slower than the odd-even variant. To address this, we introduce a static key to allow runtime selection between the binary and odd-even GCD implementations. On RISC-V, the kernel now checks for Zbb support during boot. If Zbb is unavailable, the static key is disabled so that gcd() consistently uses the more efficient odd-even algorithm in that scenario. Additionally, to further reduce code size, we select CONFIG_CPU_NO_EFFICIENT_FFS automatically when CONFIG_RISCV_ISA_ZBB is not enabled, avoiding compilation of the unused binary GCD implementation entirely on systems where it would never be executed. This series ensures that the most efficient GCD algorithm is used in practice and avoids compiling unnecessary code based on hardware capabilities and kernel configuration. This patch (of 3): On platforms like RISC-V, the compiler may generate hardware FFS instructions even if the underlying CPU does not actually support them. Currently, the GCD implementation is chosen at compile time based on CONFIG_CPU_NO_EFFICIENT_FFS, which can result in suboptimal behavior on such systems. Introduce a static key, efficient_ffs_key, to enable runtime selection between the binary GCD (using ffs) and the odd-even GCD implementation. This allows the kernel to default to the faster binary GCD when FFS is efficient, while retaining the ability to fall back when needed. Link: https://lkml.kernel.org/r/20250606134758.1308400-1-visitorckw@gmail.com Link: https://lkml.kernel.org/r/20250606134758.1308400-2-visitorckw@gmail.com Co-developed-by: Yu-Chun Lin <eleanor15x@gmail.com> Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com> Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-19 19:08:28 -07:00
Ivan Pravdin	08eabe4b9e	ocfs2: avoid potential ABBA deadlock by reordering tl_inode lock In ocfs2_move_extent(), tl_inode is currently locked after the global bitmap inode. However, in ocfs2_flush_truncate_log(), the lock order is reversed: tl_inode is locked first, followed by the global bitmap inode. This creates a classic ABBA deadlock scenario if two threads attempt these operations concurrently and acquire the locks in different orders. To prevent this, move the tl_inode locking earlier in ocfs2_move_extent(), so that it always precedes the global bitmap inode lock. No functional changes beyond lock ordering. Link: https://lkml.kernel.org/r/20250708020640.387741-1-ipravdin.official@gmail.com Reported-by: syzbot+6bf948e47f9bac7aacfa@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67d5645c.050a0220.1dc86f.0004.GAE@google.com/ Signed-off-by: Ivan Pravdin <ipravdin.official@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-19 19:08:27 -07:00
Colin Ian King	97103dcec2	squashfs: fix incorrect argument to sizeof in kmalloc_array call The sizeof(void ) is the incorrect argument in the kmalloc_array call, it best to fix this by using sizeof(cache_folios) instead. Fortunately the sizes of void* and folio* happen to be the same, so this has not shown up as a run time issue. [akpm@linux-foundation.org: fix build] Link: https://lkml.kernel.org/r/20250708142604.1891156-1-colin.i.king@gmail.com Fixes: `2e227ff5e2` ("squashfs: add optional full compressed block caching") Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Cc: Phillip Lougher <phillip@squashfs.org.uk> Cc: Chanho Min <chanho.min@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-19 19:08:27 -07:00
Colin Ian King	c0f98be69f	squashfs: replace ;; with ; and end of fi declaration There is an extraneous ; after a declaration, remove it. Link: https://lkml.kernel.org/r/20250708114900.1883130-1-colin.i.king@gmail.com Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Phillip Lougher <phillip@squashfs.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-19 19:08:27 -07:00
Ivan Pravdin	44acc46d18	ocfs2: avoid NULL pointer dereference in dx_dir_lookup_rec() When a directory entry is not found, ocfs2_dx_dir_lookup_rec() prints an error message that unconditionally dereferences the 'rec' pointer. However, if 'rec' is NULL, this leads to a NULL pointer dereference and a kernel panic. Add an explicit check empty extent list to avoid dereferencing NULL 'rec' pointer. Link: https://lkml.kernel.org/r/20250708001009.372263-1-ipravdin.official@gmail.com Reported-by: syzbot+20282c1b2184a857ac4c@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67cd7e29.050a0220.e1a89.0007.GAE@google.com/ Signed-off-by: Ivan Pravdin <ipravdin.official@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-19 19:08:27 -07:00
Ahelenia Ziemiańska	988f451ecb	ocfs2/dlm: fix "take a while" typo Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-19 19:08:26 -07:00
Lillian Berry	98aa4d5d24	init/main.c: add warning when file specified in rdinit is inaccessible Avoid silently ignoring the initramfs when the file specified in rdinit is not usable. This prints an error that clearly explains the issue (file was not found, vs initramfs was not found). Link: https://lkml.kernel.org/r/20250707091411.1412681-1-lillian@star-ark.net Signed-off-by: Lillian Berry <lillian@star-ark.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-19 19:08:26 -07:00
Zi Li	4efec6c091	samples: enhance hung_task detector test with read-write semaphore support Extend the hung_task detector test module to include read-write semaphore support alongside existing mutex and semaphore tests. This module now creates additional debugfs files under <debugfs>/hung_task, namely 'rw_semaphore_read' and 'rw_semaphore_write', in addition to 'mutex' and 'semaphore'. Reading these files with multiple processes triggers a prolonged sleep (256 seconds) while holding the respective lock, enabling hung_task detector testing for various locking mechanisms. This change builds on the extensible hung_task_tests module, adding read-write semaphore functionality to improve test coverage for kernel locking primitives. The implementation ensures proper lock handling and includes checks to prevent redundant data reads. Usage is: > cd /sys/kernel/debug/hung_task > cat mutex & cat mutex # Test mutex blocking > cat semaphore & cat semaphore # Test semaphore blocking > cat rw_semaphore_write \ & cat rw_semaphore_read # Test rwsem blocking > cat rw_semaphore_write \ & cat rw_semaphore_write # Test rwsem blocking Update the Kconfig description to reflect the addition of read-write semaphore debugfs files. Link: https://lkml.kernel.org/r/20250627072924.36567-4-lance.yang@linux.dev Signed-off-by: Zi Li <zi.li@linux.dev> Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: John Stultz <jstultz@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Mingzhe Yang <mingzhe.yang@ly.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tomasz Figa <tfiga@chromium.org> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: Yongliang Gao <leonylgao@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-19 19:08:26 -07:00
Lance Yang	77da18de55	hung_task: extend hung task blocker tracking to rwsems Inspired by mutex blocker tracking[1], and having already extended it to semaphores, let's now add support for reader-writer semaphores (rwsems). The approach is simple: when a task enters TASK_UNINTERRUPTIBLE while waiting for an rwsem, we just call hung_task_set_blocker(). The hung task detector can then query the rwsem's owner to identify the lock holder. Tracking works reliably for writers, as there can only be a single writer holding the lock, and its task struct is stored in the owner field. The main challenge lies with readers. The owner field points to only one of many concurrent readers, so we might lose track of the blocker if that specific reader unlocks, even while others remain. This is not a significant issue, however. In practice, long-lasting lock contention is almost always caused by a writer. Therefore, reliably tracking the writer is the primary goal of this patch series ;) With this change, the hung task detector can now show blocker task's info like below: [Fri Jun 27 15:21:34 2025] INFO: task cat:28631 blocked for more than 122 seconds. [Fri Jun 27 15:21:34 2025] Tainted: G S 6.16.0-rc3 #8 [Fri Jun 27 15:21:34 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Jun 27 15:21:34 2025] task:cat state:D stack:0 pid:28631 tgid:28631 ppid:28501 task_flags:0x400000 flags:0x00004000 [Fri Jun 27 15:21:34 2025] Call Trace: [Fri Jun 27 15:21:34 2025] <TASK> [Fri Jun 27 15:21:34 2025] __schedule+0x7c7/0x1930 [Fri Jun 27 15:21:34 2025] ? __pfx___schedule+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? policy_nodemask+0x215/0x340 [Fri Jun 27 15:21:34 2025] ? _raw_spin_lock_irq+0x8a/0xe0 [Fri Jun 27 15:21:34 2025] ? __pfx__raw_spin_lock_irq+0x10/0x10 [Fri Jun 27 15:21:34 2025] schedule+0x6a/0x180 [Fri Jun 27 15:21:34 2025] schedule_preempt_disabled+0x15/0x30 [Fri Jun 27 15:21:34 2025] rwsem_down_read_slowpath+0x55e/0xe10 [Fri Jun 27 15:21:34 2025] ? __pfx_rwsem_down_read_slowpath+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? __pfx___might_resched+0x10/0x10 [Fri Jun 27 15:21:34 2025] down_read+0xc9/0x230 [Fri Jun 27 15:21:34 2025] ? __pfx_down_read+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? __debugfs_file_get+0x14d/0x700 [Fri Jun 27 15:21:34 2025] ? __pfx___debugfs_file_get+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? handle_pte_fault+0x52a/0x710 [Fri Jun 27 15:21:34 2025] ? selinux_file_permission+0x3a9/0x590 [Fri Jun 27 15:21:34 2025] read_dummy_rwsem_read+0x4a/0x90 [Fri Jun 27 15:21:34 2025] full_proxy_read+0xff/0x1c0 [Fri Jun 27 15:21:34 2025] ? rw_verify_area+0x6d/0x410 [Fri Jun 27 15:21:34 2025] vfs_read+0x177/0xa50 [Fri Jun 27 15:21:34 2025] ? __pfx_vfs_read+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? fdget_pos+0x1cf/0x4c0 [Fri Jun 27 15:21:34 2025] ksys_read+0xfc/0x1d0 [Fri Jun 27 15:21:34 2025] ? __pfx_ksys_read+0x10/0x10 [Fri Jun 27 15:21:34 2025] do_syscall_64+0x66/0x2d0 [Fri Jun 27 15:21:34 2025] entry_SYSCALL_64_after_hwframe+0x76/0x7e [Fri Jun 27 15:21:34 2025] RIP: 0033:0x7f3f8faefb40 [Fri Jun 27 15:21:34 2025] RSP: 002b:00007ffdeda5ab98 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [Fri Jun 27 15:21:34 2025] RAX: ffffffffffffffda RBX: 0000000000010000 RCX: 00007f3f8faefb40 [Fri Jun 27 15:21:34 2025] RDX: 0000000000010000 RSI: 00000000010fa000 RDI: 0000000000000003 [Fri Jun 27 15:21:34 2025] RBP: 00000000010fa000 R08: 0000000000000000 R09: 0000000000010fff [Fri Jun 27 15:21:34 2025] R10: 00007ffdeda59fe0 R11: 0000000000000246 R12: 00000000010fa000 [Fri Jun 27 15:21:34 2025] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000fff [Fri Jun 27 15:21:34 2025] </TASK> [Fri Jun 27 15:21:34 2025] INFO: task cat:28631 <reader> blocked on an rw-semaphore likely owned by task cat:28630 <writer> [Fri Jun 27 15:21:34 2025] task:cat state:S stack:0 pid:28630 tgid:28630 ppid:28501 task_flags:0x400000 flags:0x00004000 [Fri Jun 27 15:21:34 2025] Call Trace: [Fri Jun 27 15:21:34 2025] <TASK> [Fri Jun 27 15:21:34 2025] __schedule+0x7c7/0x1930 [Fri Jun 27 15:21:34 2025] ? __pfx___schedule+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? __mod_timer+0x304/0xa80 [Fri Jun 27 15:21:34 2025] schedule+0x6a/0x180 [Fri Jun 27 15:21:34 2025] schedule_timeout+0xfb/0x230 [Fri Jun 27 15:21:34 2025] ? __pfx_schedule_timeout+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? __pfx_process_timeout+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? down_write+0xc4/0x140 [Fri Jun 27 15:21:34 2025] msleep_interruptible+0xbe/0x150 [Fri Jun 27 15:21:34 2025] read_dummy_rwsem_write+0x54/0x90 [Fri Jun 27 15:21:34 2025] full_proxy_read+0xff/0x1c0 [Fri Jun 27 15:21:34 2025] ? rw_verify_area+0x6d/0x410 [Fri Jun 27 15:21:34 2025] vfs_read+0x177/0xa50 [Fri Jun 27 15:21:34 2025] ? __pfx_vfs_read+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? fdget_pos+0x1cf/0x4c0 [Fri Jun 27 15:21:34 2025] ksys_read+0xfc/0x1d0 [Fri Jun 27 15:21:34 2025] ? __pfx_ksys_read+0x10/0x10 [Fri Jun 27 15:21:34 2025] do_syscall_64+0x66/0x2d0 [Fri Jun 27 15:21:34 2025] entry_SYSCALL_64_after_hwframe+0x76/0x7e [Fri Jun 27 15:21:34 2025] RIP: 0033:0x7f8f288efb40 [Fri Jun 27 15:21:34 2025] RSP: 002b:00007ffffb631038 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [Fri Jun 27 15:21:34 2025] RAX: ffffffffffffffda RBX: 0000000000010000 RCX: 00007f8f288efb40 [Fri Jun 27 15:21:34 2025] RDX: 0000000000010000 RSI: 000000002a4b5000 RDI: 0000000000000003 [Fri Jun 27 15:21:34 2025] RBP: 000000002a4b5000 R08: 0000000000000000 R09: 0000000000010fff [Fri Jun 27 15:21:34 2025] R10: 00007ffffb630460 R11: 0000000000000246 R12: 000000002a4b5000 [Fri Jun 27 15:21:34 2025] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000fff [Fri Jun 27 15:21:34 2025] </TASK> [1] https://lore.kernel.org/all/174046694331.2194069.15472952050240807469.stgit@mhiramat.tok.corp.google.com/ Link: https://lkml.kernel.org/r/20250627072924.36567-3-lance.yang@linux.dev Signed-off-by: Lance Yang <lance.yang@linux.dev> Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: John Stultz <jstultz@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Mingzhe Yang <mingzhe.yang@ly.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tomasz Figa <tfiga@chromium.org> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: Yongliang Gao <leonylgao@tencent.com> Cc: Zi Li <zi.li@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-19 19:08:26 -07:00
Lance Yang	ae2da51def	locking/rwsem: make owner helpers globally available Patch series "extend hung task blocker tracking to rwsems". Inspired by mutex blocker tracking[1], and having already extended it to semaphores, let's now add support for reader-writer semaphores (rwsems). The approach is simple: when a task enters TASK_UNINTERRUPTIBLE while waiting for an rwsem, we just call hung_task_set_blocker(). The hung task detector can then query the rwsem's owner to identify the lock holder. Tracking works reliably for writers, as there can only be a single writer holding the lock, and its task struct is stored in the owner field. The main challenge lies with readers. The owner field points to only one of many concurrent readers, so we might lose track of the blocker if that specific reader unlocks, even while others remain. This is not a significant issue, however. In practice, long-lasting lock contention is almost always caused by a writer. Therefore, reliably tracking the writer is the primary goal of this patch series ;) With this change, the hung task detector can now show blocker task's info like below: [Fri Jun 27 15:21:34 2025] INFO: task cat:28631 blocked for more than 122 seconds. [Fri Jun 27 15:21:34 2025] Tainted: G S 6.16.0-rc3 #8 [Fri Jun 27 15:21:34 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Jun 27 15:21:34 2025] task:cat state:D stack:0 pid:28631 tgid:28631 ppid:28501 task_flags:0x400000 flags:0x00004000 [Fri Jun 27 15:21:34 2025] Call Trace: [Fri Jun 27 15:21:34 2025] <TASK> [Fri Jun 27 15:21:34 2025] __schedule+0x7c7/0x1930 [Fri Jun 27 15:21:34 2025] ? __pfx___schedule+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? policy_nodemask+0x215/0x340 [Fri Jun 27 15:21:34 2025] ? _raw_spin_lock_irq+0x8a/0xe0 [Fri Jun 27 15:21:34 2025] ? __pfx__raw_spin_lock_irq+0x10/0x10 [Fri Jun 27 15:21:34 2025] schedule+0x6a/0x180 [Fri Jun 27 15:21:34 2025] schedule_preempt_disabled+0x15/0x30 [Fri Jun 27 15:21:34 2025] rwsem_down_read_slowpath+0x55e/0xe10 [Fri Jun 27 15:21:34 2025] ? __pfx_rwsem_down_read_slowpath+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? __pfx___might_resched+0x10/0x10 [Fri Jun 27 15:21:34 2025] down_read+0xc9/0x230 [Fri Jun 27 15:21:34 2025] ? __pfx_down_read+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? __debugfs_file_get+0x14d/0x700 [Fri Jun 27 15:21:34 2025] ? __pfx___debugfs_file_get+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? handle_pte_fault+0x52a/0x710 [Fri Jun 27 15:21:34 2025] ? selinux_file_permission+0x3a9/0x590 [Fri Jun 27 15:21:34 2025] read_dummy_rwsem_read+0x4a/0x90 [Fri Jun 27 15:21:34 2025] full_proxy_read+0xff/0x1c0 [Fri Jun 27 15:21:34 2025] ? rw_verify_area+0x6d/0x410 [Fri Jun 27 15:21:34 2025] vfs_read+0x177/0xa50 [Fri Jun 27 15:21:34 2025] ? __pfx_vfs_read+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? fdget_pos+0x1cf/0x4c0 [Fri Jun 27 15:21:34 2025] ksys_read+0xfc/0x1d0 [Fri Jun 27 15:21:34 2025] ? __pfx_ksys_read+0x10/0x10 [Fri Jun 27 15:21:34 2025] do_syscall_64+0x66/0x2d0 [Fri Jun 27 15:21:34 2025] entry_SYSCALL_64_after_hwframe+0x76/0x7e [Fri Jun 27 15:21:34 2025] RIP: 0033:0x7f3f8faefb40 [Fri Jun 27 15:21:34 2025] RSP: 002b:00007ffdeda5ab98 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [Fri Jun 27 15:21:34 2025] RAX: ffffffffffffffda RBX: 0000000000010000 RCX: 00007f3f8faefb40 [Fri Jun 27 15:21:34 2025] RDX: 0000000000010000 RSI: 00000000010fa000 RDI: 0000000000000003 [Fri Jun 27 15:21:34 2025] RBP: 00000000010fa000 R08: 0000000000000000 R09: 0000000000010fff [Fri Jun 27 15:21:34 2025] R10: 00007ffdeda59fe0 R11: 0000000000000246 R12: 00000000010fa000 [Fri Jun 27 15:21:34 2025] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000fff [Fri Jun 27 15:21:34 2025] </TASK> [Fri Jun 27 15:21:34 2025] INFO: task cat:28631 <reader> blocked on an rw-semaphore likely owned by task cat:28630 <writer> [Fri Jun 27 15:21:34 2025] task:cat state:S stack:0 pid:28630 tgid:28630 ppid:28501 task_flags:0x400000 flags:0x00004000 [Fri Jun 27 15:21:34 2025] Call Trace: [Fri Jun 27 15:21:34 2025] <TASK> [Fri Jun 27 15:21:34 2025] __schedule+0x7c7/0x1930 [Fri Jun 27 15:21:34 2025] ? __pfx___schedule+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? __mod_timer+0x304/0xa80 [Fri Jun 27 15:21:34 2025] schedule+0x6a/0x180 [Fri Jun 27 15:21:34 2025] schedule_timeout+0xfb/0x230 [Fri Jun 27 15:21:34 2025] ? __pfx_schedule_timeout+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? __pfx_process_timeout+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? down_write+0xc4/0x140 [Fri Jun 27 15:21:34 2025] msleep_interruptible+0xbe/0x150 [Fri Jun 27 15:21:34 2025] read_dummy_rwsem_write+0x54/0x90 [Fri Jun 27 15:21:34 2025] full_proxy_read+0xff/0x1c0 [Fri Jun 27 15:21:34 2025] ? rw_verify_area+0x6d/0x410 [Fri Jun 27 15:21:34 2025] vfs_read+0x177/0xa50 [Fri Jun 27 15:21:34 2025] ? __pfx_vfs_read+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? fdget_pos+0x1cf/0x4c0 [Fri Jun 27 15:21:34 2025] ksys_read+0xfc/0x1d0 [Fri Jun 27 15:21:34 2025] ? __pfx_ksys_read+0x10/0x10 [Fri Jun 27 15:21:34 2025] do_syscall_64+0x66/0x2d0 [Fri Jun 27 15:21:34 2025] entry_SYSCALL_64_after_hwframe+0x76/0x7e [Fri Jun 27 15:21:34 2025] RIP: 0033:0x7f8f288efb40 [Fri Jun 27 15:21:34 2025] RSP: 002b:00007ffffb631038 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [Fri Jun 27 15:21:34 2025] RAX: ffffffffffffffda RBX: 0000000000010000 RCX: 00007f8f288efb40 [Fri Jun 27 15:21:34 2025] RDX: 0000000000010000 RSI: 000000002a4b5000 RDI: 0000000000000003 [Fri Jun 27 15:21:34 2025] RBP: 000000002a4b5000 R08: 0000000000000000 R09: 0000000000010fff [Fri Jun 27 15:21:34 2025] R10: 00007ffffb630460 R11: 0000000000000246 R12: 000000002a4b5000 [Fri Jun 27 15:21:34 2025] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000fff [Fri Jun 27 15:21:34 2025] </TASK> This patch (of 3): In preparation for extending blocker tracking to support rwsems, make the rwsem_owner() and is_rwsem_reader_owned() helpers globally available for determining if the blocker is a writer or one of the readers. Additionally, a stale owner pointer in a reader-owned rwsem can lead to false positives in blocker tracking when CONFIG_DETECT_HUNG_TASK_BLOCKER is enabled. To mitigate this, clear the owner field on the reader unlock path, similar to what CONFIG_DEBUG_RWSEMS does. A NULL owner is better than a stale one for diagnostics. Link: https://lkml.kernel.org/r/20250627072924.36567-1-lance.yang@linux.dev Link: https://lkml.kernel.org/r/20250627072924.36567-2-lance.yang@linux.dev Link: https://lore.kernel.org/all/174046694331.2194069.15472952050240807469.stgit@mhiramat.tok.corp.google.com/ [1] Signed-off-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: John Stultz <jstultz@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Mingzhe Yang <mingzhe.yang@ly.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tomasz Figa <tfiga@chromium.org> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: Yongliang Gao <leonylgao@tencent.com> Cc: Zi Li <zi.li@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-19 19:08:25 -07:00
Easwar Hariharan	249e7ced72	coccinelle: misc: secs_to_jiffies: implement context and report modes As requested by Ricardo and Jakub, implement report and context modes for the secs_to_jiffies Coccinelle script. While here, add the option to look for opportunities to use secs_to_jiffies() in headers. Link: https://lkml.kernel.org/r/20250703225145.152288-1-eahariha@linux.microsoft.com Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Closes: https://lore.kernel.org/all/20250129-secs_to_jiffles-v1-1-35a5e16b9f03@chromium.org/ Closes: https://lore.kernel.org/all/20250221162107.409ae333@kernel.org/ Tested-by: Ricardo Ribalda <ribalda@chromium.org> Cc: Julia Lawall <julia.lawall@inria.fr> Cc: Nicolas Palix <nicolas.palix@imag.fr> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Ricardo Ribalda <ribalda@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-19 19:08:25 -07:00
Feng Tang	ee13240cd7	panic: add note that panic_print sysctl interface is deprecated Add a dedicated core parameter 'panic_console_replay' for controlling console replay, and add note that 'panic_print' sysctl interface will be obsoleted by 'panic_sys_info' and 'panic_console_replay'. When it happens, the SYS_INFO_PANIC_CONSOLE_REPLAY can be removed as well. Link: https://lkml.kernel.org/r/20250703021004.42328-6-feng.tang@linux.alibaba.com Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com> Suggested-by: Petr Mladek <pmladek@suse.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <lance.yang@linux.dev> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-19 19:08:25 -07:00
Feng Tang	9743d12d0c	panic: add 'panic_sys_info=' setup option for kernel cmdline 'panic_sys_info=' sysctl interface is already added for runtime setting. Add counterpart kernel cmdline option for boottime setting. Link: https://lkml.kernel.org/r/20250703021004.42328-5-feng.tang@linux.alibaba.com Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com> Suggested-by: Petr Mladek <pmladek@suse.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <lance.yang@linux.dev> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-19 19:08:24 -07:00
Feng Tang	d747755917	panic: add 'panic_sys_info' sysctl to take human readable string parameter Bitmap definition for 'panic_print' is hard to remember and decode. Add 'panic_sys_info='sysctl to take human readable string like "tasks,mem,timers,locks,ftrace,..." and translate it into bitmap. The detailed mapping is: SYS_INFO_TASKS "tasks" SYS_INFO_MEM "mem" SYS_INFO_TIMERS "timers" SYS_INFO_LOCKS "locks" SYS_INFO_FTRACE "ftrace" SYS_INFO_ALL_CPU_BT "all_bt" SYS_INFO_BLOCKED_TASKS "blocked_tasks" [nathan@kernel.org: add __maybe_unused to sys_info_avail] Link: https://lkml.kernel.org/r/20250708-fix-clang-sys_info_avail-warning-v1-1-60d239eacd64@kernel.org Link: https://lkml.kernel.org/r/20250703021004.42328-4-feng.tang@linux.alibaba.com Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com> Suggested-by: Petr Mladek <pmladek@suse.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <lance.yang@linux.dev> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-19 19:08:24 -07:00
Feng Tang	b76e89e50f	panic: generalize panic_print's function to show sys info 'panic_print' was introduced to help debugging kernel panic by dumping different kinds of system information like tasks' call stack, memory, ftrace buffer, etc. Actually this function could also be used to help debugging other cases like task-hung, soft/hard lockup, etc. where user may need the snapshot of system info at that time. Extract system info dump function related code from panic.c to separate file sys_info.[ch], for wider usage by other kernel parts for debugging. Also modify the macro names about singulars/plurals. Link: https://lkml.kernel.org/r/20250703021004.42328-3-feng.tang@linux.alibaba.com Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com> Suggested-by: Petr Mladek <pmladek@suse.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <lance.yang@linux.dev> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-19 19:08:24 -07:00
Feng Tang	261743b013	panic: clean up code for console replay Patch series "generalize panic_print's dump function to be used by other kernel parts", v3. When working on kernel stability issues, panic, task-hung and software/hardware lockup are frequently met. And to debug them, user may need lots of system information at that time, like task call stacks, lock info, memory info etc. panic case already has panic_print_sys_info() for this purpose, and has a 'panic_print' bitmask to control what kinds of information is needed, which is also helpful to debug other task-hung and lockup cases. So this patchset extracts the function out to a new file 'lib/sys_info.c', and makes it available for other cases which also need to dump system info for debugging. Also as suggested by Petr Mladek, add 'panic_sys_info=' interface to take human readable string like "tasks,mem,locks,timers,ftrace,....", and eventually obsolete the current 'panic_print' bitmap interface. In RFC and V1 version, hung_task and SW/HW watchdog modules are enabled with the new sys_info dump interface. In v2, they are kept out for better review of current change, and will be posted later. Locally these have been used in our bug chasing for stability issues and was proven helpful. Many thanks to Petr Mladek for great suggestions on both the code and architectures! This patch (of 5): Currently the panic_print_sys_info() was called twice with different parameters to handle console replay case, which is kind of confusing. Add panic_console_replay() explicitly and rename 'PANIC_PRINT_ALL_PRINTK_MSG' to 'PANIC_CONSOLE_REPLAY', to make the code straightforward. The related kernel document is also updated. Link: https://lkml.kernel.org/r/20250703021004.42328-1-feng.tang@linux.alibaba.com Link: https://lkml.kernel.org/r/20250703021004.42328-2-feng.tang@linux.alibaba.com Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com> Suggested-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <lance.yang@linux.dev> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-19 19:08:24 -07:00
Jiri Bohac	bf8be1c361	x86: implement crashkernel cma reservation Implement the crashkernel CMA reservation for x86: - enable parsing of the cma suffix by parse_crashkernel() - reserve memory with reserve_crashkernel_cma() - add the CMA-reserved ranges to the e820 map for the crash kernel - exclude the CMA-reserved ranges from vmcore Link: https://lkml.kernel.org/r/aEqp1LD2og4QeBw9@dwarf.suse.cz Signed-off-by: Jiri Bohac <jbohac@suse.cz> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Donald Dutile <ddutile@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Philipp Rudo <prudo@redhat.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Tao Liu <ltao@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-19 19:08:23 -07:00
Jiri Bohac	e1280f3071	kdump: wait for DMA to finish when using CMA When re-using the CMA area for kdump there is a risk of pending DMA into pinned user pages in the CMA area. Pages residing in CMA areas can usually not get long-term pinned and are instead migrated away from the CMA area, so long-term pinning is typically not a concern. (BUGs in the kernel might still lead to long-term pinning of such pages if everything goes wrong.) Pages pinned without FOLL_LONGTERM remain in the CMA and may possibly be the source or destination of a pending DMA transfer. Although there is no clear specification how long a page may be pinned without FOLL_LONGTERM, pinning without the flag shows an intent of the caller to only use the memory for short-lived DMA transfers, not a transfer initiated by a device asynchronously at a random time in the future. Add a delay of CMA_DMA_TIMEOUT_SEC seconds before starting the kdump kernel, giving such short-lived DMA transfers time to finish before the CMA memory is re-used by the kdump kernel. Set CMA_DMA_TIMEOUT_SEC to 10 seconds - chosen arbitrarily as both a huge margin for a DMA transfer, yet not increasing the kdump time too significantly. Link: https://lkml.kernel.org/r/aEqpgDIBndZ5LXSo@dwarf.suse.cz Signed-off-by: Jiri Bohac <jbohac@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Donald Dutile <ddutile@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Philipp Rudo <prudo@redhat.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Tao Liu <ltao@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-19 19:08:23 -07:00
Jiri Bohac	ce1bf19a34	kdump, documentation: describe craskernel CMA reservation Describe the new crashkernel ",cma" suffix in Documentation/ Link: https://lkml.kernel.org/r/aEqpQwUy6gqSiUkV@dwarf.suse.cz Signed-off-by: Jiri Bohac <jbohac@suse.cz> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Donald Dutile <ddutile@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Philipp Rudo <prudo@redhat.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Tao Liu <ltao@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-19 19:08:23 -07:00
Jiri Bohac	ab475510e0	kdump: implement reserve_crashkernel_cma reserve_crashkernel_cma() reserves CMA ranges for the crash kernel. If allocating the requested size fails, try to reserve in smaller blocks. Store the reserved ranges in the crashk_cma_ranges array and the number of ranges in crashk_cma_cnt. Link: https://lkml.kernel.org/r/aEqpBwOy_ekm0gw9@dwarf.suse.cz Signed-off-by: Jiri Bohac <jbohac@suse.cz> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Donald Dutile <ddutile@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Philipp Rudo <prudo@redhat.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Tao Liu <ltao@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-19 19:08:23 -07:00
Jiri Bohac	35c18f2933	Add a new optional ",cma" suffix to the crashkernel= command line option Patch series "kdump: crashkernel reservation from CMA", v5. This series implements a way to reserve additional crash kernel memory using CMA. Currently, all the memory for the crash kernel is not usable by the 1st (production) kernel. It is also unmapped so that it can't be corrupted by the fault that will eventually trigger the crash. This makes sense for the memory actually used by the kexec-loaded crash kernel image and initrd and the data prepared during the load (vmcoreinfo, ...). However, the reserved space needs to be much larger than that to provide enough run-time memory for the crash kernel and the kdump userspace. Estimating the amount of memory to reserve is difficult. Being too careful makes kdump likely to end in OOM, being too generous takes even more memory from the production system. Also, the reservation only allows reserving a single contiguous block (or two with the "low" suffix). I've seen systems where this fails because the physical memory is fragmented. By reserving additional crashkernel memory from CMA, the main crashkernel reservation can be just large enough to fit the kernel and initrd image, minimizing the memory taken away from the production system. Most of the run-time memory for the crash kernel will be memory previously available to userspace in the production system. As this memory is no longer wasted, the reservation can be done with a generous margin, making kdump more reliable. Kernel memory that we need to preserve for dumping is normally not allocated from CMA, unless it is explicitly allocated as movable. Currently this is only the case for memory ballooning and zswap. Such movable memory will be missing from the vmcore. User data is typically not dumped by makedumpfile. When dumping of user data is intended this new CMA reservation cannot be used. There are five patches in this series: The first adds a new ",cma" suffix to the recenly introduced generic crashkernel parsing code. parse_crashkernel() takes one more argument to store the cma reservation size. The second patch implements reserve_crashkernel_cma() which performs the reservation. If the requested size is not available in a single range, multiple smaller ranges will be reserved. The third patch updates Documentation/, explicitly mentioning the potential DMA corruption of the CMA-reserved memory. The fourth patch adds a short delay before booting the kdump kernel, allowing pending DMA transfers to finish. The fifth patch enables the functionality for x86 as a proof of concept. There are just three things every arch needs to do: - call reserve_crashkernel_cma() - include the CMA-reserved ranges in the physical memory map - exclude the CMA-reserved ranges from the memory available through /proc/vmcore by excluding them from the vmcoreinfo PT_LOAD ranges. Adding other architectures is easy and I can do that as soon as this series is merged. With this series applied, specifying crashkernel=100M craskhernel=1G,cma on the command line will make a standard crashkernel reservation of 100M, where kexec will load the kernel and initrd. An additional 1G will be reserved from CMA, still usable by the production system. The crash kernel will have 1.1G memory available. The 100M can be reliably predicted based on the size of the kernel and initrd. The new cma suffix is completely optional. When no crashkernel=size,cma is specified, everything works as before. This patch (of 5): Add a new cma_size parameter to parse_crashkernel(). When not NULL, call __parse_crashkernel to parse the CMA reservation size from "crashkernel=size,cma" and store it in cma_size. Set cma_size to NULL in all calls to parse_crashkernel(). Link: https://lkml.kernel.org/r/aEqnxxfLZMllMC8I@dwarf.suse.cz Link: https://lkml.kernel.org/r/aEqoQckgoTQNULnh@dwarf.suse.cz Signed-off-by: Jiri Bohac <jbohac@suse.cz> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Donald Dutile <ddutile@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Philipp Rudo <prudo@redhat.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Tao Liu <ltao@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-19 19:08:22 -07:00
Brian Norris	22c2ed6996	checkpatch: check for missing sentinels in ID arrays All of the ID tables based on <linux/mod_devicetable.h> (of_device_id, pci_device_id, ...) require their arrays to end in an empty sentinel value. That's usually spelled with an empty initializer entry (e.g., "{}"), but also sometimes with explicit 0 entries, field initializers (e.g., '.id = ""'), or even a macro entry (like PCMCIA_DEVICE_NULL). Without a sentinel, device-matching code may read out of bounds. I've found a number of such bugs in driver reviews, and we even occasionally commit one to the tree. See commit `5751eee5c6` ("i2c: nomadik: Add missing sentinel to match table") for example. Teach checkpatch to find these ID tables, and complain if it looks like there wasn't a sentinel value. Test output: $ git format-patch -1 `a0d15cc47f` --stdout \| scripts/checkpatch.pl - ERROR: missing sentinel in ID array #57: FILE: drivers/i2c/busses/i2c-nomadik.c:1073: +static const struct of_device_id nmk_i2c_eyeq_match_table[] = { { .compatible = "XXXXXXXXXXXXXXXXXX", .data = (void )(NMK_I2C_EYEQ_FLAG_32B_BUS \| NMK_I2C_EYEQ_FLAG_IS_EYEQ5), }, }; total: 1 errors, 0 warnings, 66 lines checked NOTE: For some of the reported defects, checkpatch may be able to mechanically convert to the typical style using --fix or --fix-inplace. "[PATCH] i2c: nomadik: switch from of_device_is_compatible() to" has style problems, please review. NOTE: If any of the errors are false positives, please report them to the maintainer, see CHECKPATCH in MAINTAINERS. When run across the entire tree (scripts/checkpatch.pl -q --types MISSING_SENTINEL -f ...), false positives exist: where macros are used that hide the table from analysis (e.g., drivers/gpu/drm/radeon/radeon_drv.c / radeon_PCI_IDS). There are fewer than 5 of these. * where such tables are processed correctly via ARRAY_SIZE() (fewer than 5 instances). This is by far not the typical usage of _device_id arrays. some odd parsing artifacts, where ctx_statement_block() seems to quit in the middle of a block due to #if/#else/#endif. Also, not every "struct *_device_id" is in fact a sentinel-requiring structure, but even with such types, false positives are very rare. Link: https://lkml.kernel.org/r/20250702235245.1007351-1-briannorris@chromium.org Signed-off-by: Brian Norris <briannorris@chromium.org> Acked-by: Joe Perches <joe@perches.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Brian Norris <briannorris@chromium.org> Cc: Dwaipayan Ray <dwaipayanray1@gmail.com> Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-09 22:57:57 -07:00
Moon Hee Lee	1f04e0e652	selftests: ptrace: add set_syscall_info to .gitignore Add the set_syscall_info test binary to .gitignore to avoid tracking build artifacts in the ptrace selftests directory. Link: https://lkml.kernel.org/r/20250623183405.133434-2-moonhee.lee.ca@gmail.com Signed-off-by: Moon Hee Lee <moonhee.lee.ca@gmail.com> Cc: "Dmitry V. Levin" <ldv@strace.io> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-09 22:57:57 -07:00
Tetsuo Handa	d0118d7d20	ocfs2: update d_splice_alias() return code checking When commit `d3556babd7` ("ocfs2: fix d_splice_alias() return code checking") was merged into v3.18-rc3, d_splice_alias() was returning one of a valid dentry, NULL or an ERR_PTR. When commit `b5ae6b15bd` ("merge d_materialise_unique() into d_splice_alias()") was merged into v3.19-rc1, d_splice_alias() started returning -ELOOP as one of ERR_PTR values. Now, when syzkaller mounts a crafted ocfs2 filesystem image that hits d_splice_alias() == -ELOOP case from ocfs2_lookup(), ocfs2_lookup() fails to handle -ELOOP case and generic_shutdown_super() hits "VFS: Busy inodes after unmount" message. Instead of calling ocfs2_dentry_attach_lock() or ocfs2_dentry_attach_gen() when d_splice_alias() returned an ERR_PTR value, change ocfs2_lookup() to bail out immediately. Also, ocfs2_lookup() needs to call dupt() when ocfs2_dentry_attach_lock() returned an ERR_PTR value. Link: https://lkml.kernel.org/r/da5be67d-2a0b-4b93-85d6-42f3b7440135@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot <syzbot+1134d3a5b062e9665a7a@syzkaller.appspotmail.com> Closes: https://syzkaller.appspot.com/bug?extid=1134d3a5b062e9665a7a Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Richard Weinberger <richard@nod.at> Cc: Tetsuo Handa <penguin-kernel@i-love-sakura.ne.jp> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-09 22:57:57 -07:00
Su Hui	ad0039db42	fs/proc/vmcore: a few cleanups for vmcore_add_device_dump() There are two cleanups for vmcore_add_device_dump(). Return -ENOMEM directly rather than goto the label to simplify the code and use scoped_guard() to simplify the lock/unlock code. Link: https://lkml.kernel.org/r/20250626105440.1053139-1-suhui@nfschina.com Signed-off-by: Su Hui <suhui@nfschina.com> Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Suhui <suhui@nfschina.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-09 22:57:56 -07:00
Sachin Mokashi	37d0f07bc5	mailmap: update Sachin Mokashi's email address As previous contributions were made with the older email address, which is no longer in use. Update my new address to map the old one. Link: https://lkml.kernel.org/r/20250624141220.1264691-1-sachin.mokashi@intel.com Signed-off-by: Sachin Mokashi <sachin.mokashi@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-09 22:57:56 -07:00
Tetsuo Handa	0c954c57f9	ocfs2: embed actual values into ocfs2_sysfile_lock_key names Since lockdep_set_class() uses stringified key name via macro, calling lockdep_set_class() with an array causes lockdep warning messages to report variable name than actual index number. Change ocfs2_init_locked_inode() to pass actual index number for better readability of lockdep reports. This patch does not change behavior. Before: Chain exists of: &ocfs2_sysfile_lock_key[args->fi_sysfile_type] --> jbd2_handle --> &oi->ip_xattr_sem Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&oi->ip_xattr_sem); lock(jbd2_handle); lock(&oi->ip_xattr_sem); lock(&ocfs2_sysfile_lock_key[args->fi_sysfile_type]); * DEADLOCK * After: Chain exists of: &ocfs2_sysfile_lock_key[EXTENT_ALLOC_SYSTEM_INODE] --> jbd2_handle --> &oi->ip_xattr_sem Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&oi->ip_xattr_sem); lock(jbd2_handle); lock(&oi->ip_xattr_sem); lock(&ocfs2_sysfile_lock_key[EXTENT_ALLOC_SYSTEM_INODE]); * DEADLOCK * Link: https://lkml.kernel.org/r/29348724-639c-443d-bbce-65c3a0a13a38@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-09 22:57:56 -07:00
Yaxin Wang	01bda05819	tools/accounting/delaytop: add delaytop to record top-n task delay Problem ======= The "getdelays" can only display the latency of a single task by specifying a PID, but it has the following limitations: 1. single-task perspective: only supports querying the latency (CPU, I/O, memory, etc.) of an individual task via PID and cannot provide a global analysis of high-latency processes across the system. 2. lack of High-Latency process awareness: when the overall system latency is high (e.g., a spike in CPU latency), there is no way to quickly identify the top N processes contributing to the highest latency. 3. poor interactivity: It lacks dynamic sorting and refresh capabilities (similar to top), making it difficult to monitor latency changes in real time. Solution ======== To address these limitations, we introduce the "delaytop" with the following capabilities: 1. system view: monitors latency metrics (CPU, I/O, memory, IRQ, etc.) for all system processes 2. supports field-based sorting (e.g., default sort by CPU latency in descending order) 3. dynamic interactive interface: focus on specific processes with --pid; limit displayed entries with --processes 20; control monitoring duration with --iterations; Use case ======== bash# ./delaytop Top 20 processes (sorted by CPU delay): PID TGID COMMAND CPU(ms) IO(ms) SWAP(ms) RCL(ms) THR(ms) CMP(ms) WP(ms) IRQ(ms) --------------------------------------------------------------------------------------------- 26 26 kworker/1:0H 5.55 0.00 0.00 0.00 0.00 0.00 0.00 0.00 32 32 kworker/2:0H-kb 2.93 0.00 0.00 0.00 0.00 0.00 0.00 0.00 38 38 kworker/3:0H-ev 2.88 0.00 0.00 0.00 0.00 0.00 0.00 0.00 84 84 kworker/R-vfio- 1.62 0.00 0.00 0.00 0.00 0.00 0.00 0.00 24 24 ksoftirqd/1 1.43 0.00 0.00 0.00 0.00 0.00 0.00 0.00 19 19 idle_inject/0 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16 16 rcu_exp_par_gp_ 0.87 0.00 0.00 0.00 0.00 0.00 0.00 0.00 11 11 kworker/0:1 0.87 0.00 0.00 0.00 0.00 0.00 0.00 0.00 22 22 idle_inject/1 0.80 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3 3 pool_workqueue_ 0.74 0.00 0.00 0.00 0.00 0.00 0.00 0.00 81 81 scsi_eh_1 0.59 0.00 0.00 0.00 0.00 0.00 0.00 0.00 30 30 ksoftirqd/2 0.42 0.00 0.00 0.00 0.00 0.00 0.00 0.00 36 36 ksoftirqd/3 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 9 9 kworker/0:0-eve 0.36 0.00 0.00 0.00 0.00 0.00 0.00 0.00 8 8 kworker/R-netns 0.34 0.00 0.00 0.00 0.00 0.00 0.00 0.00 76 76 kworker/1:1-pm 0.32 0.00 0.00 0.00 0.00 0.00 0.00 0.00 21 21 cpuhp/1 0.30 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4 4 kworker/R-rcu_g 0.21 0.00 0.00 0.00 0.00 0.00 0.00 0.00 12 12 kworker/u16:0-i 0.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1 1 init 0.18 0.00 0.00 0.00 0.00 0.00 0.08 0.00 Link: https://lkml.kernel.org/r/20250619211843633h05gWrBDMFkEH6xAVm_5y@zte.com.cn Co-developed-by: Fan Yu <fan.yu9@zte.com.cn> Signed-off-by: Fan Yu <fan.yu9@zte.com.cn> Signed-off-by: Yaxin Wang <wang.yaxin@zte.com.cn> Cc: Balbir Singh <bsingharora@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Peilin He <he.peilin@zte.com.cn> Cc: Qiang Tu <tu.qiang35@zte.com.cn> Cc: wangyong <wang.yong12@zte.com.cn> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Yang Yang <yang.yang29@zte.com.cn> Cc: ye xingchen <ye.xingchen@zte.com.cn> Cc: Yunkai Zhang <zhang.yunkai@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-09 22:57:56 -07:00
Li Chen	896f612273	fs: fat: Prevent fsfuzzer from dominating the console fsfuzzer may make many invalid access for FAT-fs and generate many kmsg like "FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb)". For platforms & os that enables hardware serial device whose speed are slow, this may cause softlockup easily. So let's ratelimit the error log. The log as below: [11916.242560] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb) [11916.254485] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb) [11916.266388] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb) [11916.278287] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb) [11916.290180] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb) [11916.302068] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb) [11916.313962] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb) [11916.325848] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb) [11916.337732] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb) [11916.349619] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb) [11916.361505] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb) [11916.373391] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb) [11916.385272] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb) [11916.397144] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb) [11916.409025] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb) [11916.420909] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb) [11916.432791] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb) [11916.444674] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb) [11916.456558] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb) [11916.468446] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb) [11916.480352] watchdog: BUG: soft lockup - CPU#58 stuck for 26s! [cat:2446035] [11916.480357] Modules linked in: ... [11916.480503] CPU: 58 PID: 2446035 Comm: cat Kdump: loaded Tainted: ... [11916.480508] Hardware name: vclusters VSFT5000 B/VSFT5000 B, BIOS ... [11916.480510] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [11916.480513] pc : console_emit_next_record+0x1b4/0x288 [11916.480524] lr : console_emit_next_record+0x1ac/0x288 [11916.480525] sp : ffff80009bcdae90 [11916.480527] x29: ffff80009bcdaec0 x28: ffff800082513810 x27: 0000000000000001 [11916.480530] x26: 0000000000000001 x25: ffff800081f66000 x24: 0000000000000000 [11916.480533] x23: 0000000000000000 x22: ffff80009bcdaf8f x21: 0000000000000001 [11916.480535] x20: 0000000000000000 x19: ffff800082513810 x18: ffffffffffffffff [11916.480538] x17: 0000000000000002 x16: 0000000000000001 x15: ffff80009bcdab30 [11916.480541] x14: 0000000000000000 x13: 205d353330363434 x12: 32545b5d36343438 [11916.480543] x11: 652820544146206f x10: 7420737365636361 x9 : ffff800080159a6c [11916.480546] x8 : 69202c726f727265 x7 : 545b5d3634343836 x6 : 342e36313931315b [11916.480549] x5 : ffff800082513a01 x4 : ffff80009bcdad31 x3 : 0000000000000000 [11916.480551] x2 : 00000000ffffffff x1 : 0000000001b9b000 x0 : ffff8000836cef00 [11916.480554] Call trace: [11916.480557] console_emit_next_record+0x1b4/0x288 [11916.480560] console_flush_all+0xcc/0x190 [11916.480563] console_unlock+0x78/0x138 [11916.480565] vprintk_emit+0x1c4/0x210 [11916.480568] vprintk_default+0x40/0x58 [11916.480570] vprintk+0x84/0xc8 [11916.480572] _printk+0x68/0xa0 [11916.480578] _fat_msg+0x6c/0xa0 [fat] [11916.480593] __fat_fs_error+0xf8/0x118 [fat] [11916.480601] fat_ent_read+0x164/0x238 [fat] [11916.480609] fat_get_cluster+0x180/0x2c8 [fat] [11916.480617] fat_get_mapped_cluster+0xb8/0x170 [fat] Link: https://lkml.kernel.org/r/20250620020231.9292-1-me@linux.beauty Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-09 22:57:55 -07:00
Jiazi Li	fed307b67c	kthread: update comment for __to_kthread With commit `343f4c49f2` ("kthread: Don't allocate kthread_struct for init and umh") and commit `753550eb0c` ("fork: Explicitly set PF_KTHREAD"), umh task no longer have struct kthread and PF_KTHREAD flag. Update the comment to describe what the current rules are to detect is something is a kthread. Link: https://lkml.kernel.org/r/20250620100801.23185-1-jqqlijiazi@gmail.com Signed-off-by: Jiazi Li <jqqlijiazi@gmail.com> Signed-off-by: mingzhu.wang <mingzhu.wang@transsion.com> Suggested-by Eric W . Biederman <ebiederm@xmission.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-09 22:57:55 -07:00
Arnd Bergmann	caf728dfa7	lib: test_objagg: split test_hints_case() into two functions With sanitizers enabled, this function uses a lot of stack, causing a harmless warning: lib/test_objagg.c: In function 'test_hints_case.constprop': lib/test_objagg.c:994:1: error: the frame size of 1440 bytes is larger than 1408 bytes [-Werror=frame-larger-than=] Most of this is from the two 'struct world' structures. Since most of the work in this function is duplicated for the two, split it up into separate functions that each use one of them. The combined stack usage is still the same here, but there is no warning any more, and the code is still safe because of the known call chain. Link: https://lkml.kernel.org/r/20250620111907.3395296-1-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-09 22:57:55 -07:00
Andrew Morton	4d71d99f36	MAINTAINERS: add lib/raid6/ to "SOFTWARE RAID" Cc: Song Liu <song@kernel.org> Cc: Yu Kuai <yukuai3@huawei.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-09 22:57:55 -07:00
Herbert Xu	1857fcc847	lib/raid6: replace custom zero page with ZERO_PAGE Use the system-wide zero page instead of a custom zero page. [herbert@gondor.apana.org.au: update lib/raid6/recov_rvv.c, per Klara] Link: https://lkml.kernel.org/r/aFkUnXWtxcgOTVkw@gondor.apana.org.au Link: https://lkml.kernel.org/r/Z9flJNkWQICx0PXk@gondor.apana.org.au Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Cc: Song Liu <song@kernel.org> Cc: Yu Kuai <yukuai3@huawei.com> Cc: Klara Modin <klarasmodin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-09 22:57:54 -07:00
Johannes Berg	41a7f73768	scripts: gdb: move MNT_* constants to gdb-parsed Since these are now no longer defines, but in an enum. Link: https://lkml.kernel.org/r/20250618134629.25700-2-johannes@sipsolutions.net Fixes: `101f2bbab5` ("fs: convert mount flags to enum") Reviewed-by: Benjamin Berg <benjamin.berg@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Cc: Jan Kiszka <jan.kiszka@siemens.com> Cc: Kieran Bingham <kbingham@kernel.org> Cc: Stephen Brennan <stephen.s.brennan@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-09 22:57:54 -07:00
Pasha Tatashin	64960497ea	fork: clean up ifdef logic around stack allocation There is an unneeded OR in the ifdef functions that are used to allocate and free kernel stacks based on direct map or vmap. Adding dynamic stack support would complicate this logic even further. Therefore, clean up by changing the order so OR is no longer needed. Link: https://lkml.kernel.org/r/20250618-fork-fixes-v4-1-2e05a2e1f5fc@linaro.org Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Link: https://lore.kernel.org/20240311164638.2015063-3-pasha.tatashin@soleen.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-09 22:57:54 -07:00
Long Li	816a880032	ocfs2: remove redundant NULL check in rename path The code checks newfe_bh for NULL after it has already been dereferenced to access b_data. This NULL check is unnecessary for two reasons: 1. If ocfs2_inode_lock() succeeds (returns >= 0), newfe_bh is guaranteed to be valid. 2. We've already dereferenced newfe_bh to access b_data, so it must be non-NULL at this point. Remove the redundant NULL check in the trace_ocfs2_rename_over_existing() call to improve code clarity. Link: https://lkml.kernel.org/r/20250617012534.3458669-1-leo.lilong@huawei.com Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Su Yue <glass.su@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-09 22:57:54 -07:00
Lizhi Xu	2ae8267999	ocfs2: reset folio to NULL when get folio fails The reproducer uses FAULT_INJECTION to make memory allocation fail, which causes __filemap_get_folio() to fail, when initializing w_folios[i] in ocfs2_grab_folios_for_write(), it only returns an error code and the value of w_folios[i] is the error code, which causes ocfs2_unlock_and_free_folios() to recycle the invalid w_folios[i] when releasing folios. Link: https://lkml.kernel.org/r/20250616013140.3602219-1-lizhi.xu@windriver.com Reported-by: syzbot+c2ea94ae47cd7e3881ec@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=c2ea94ae47cd7e3881ec Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-09 22:57:53 -07:00

1 2 3 4 5 ...

1368670 Commits