dquot_scan_active() can race with quota deactivation in
quota_release_workfn() like:
CPU0 (quota_release_workfn) CPU1 (dquot_scan_active)
============================== ==============================
spin_lock(&dq_list_lock);
list_replace_init(
&releasing_dquots, &rls_head);
/* dquot X on rls_head,
dq_count == 0,
DQ_ACTIVE_B still set */
spin_unlock(&dq_list_lock);
synchronize_srcu(&dquot_srcu);
spin_lock(&dq_list_lock);
list_for_each_entry(dquot,
&inuse_list, dq_inuse) {
/* finds dquot X */
dquot_active(X) -> true
atomic_inc(&X->dq_count);
}
spin_unlock(&dq_list_lock);
spin_lock(&dq_list_lock);
dquot = list_first_entry(&rls_head);
WARN_ON_ONCE(atomic_read(&dquot->dq_count));
The problem is not only a cosmetic one as under memory pressure the
caller of dquot_scan_active() can end up working on freed dquot.
Fix the problem by making sure the dquot is removed from releasing list
when we acquire a reference to it.
Fixes: 869b6ea160 ("quota: Fix slow quotaoff")
Reported-by: Sam Sun <samsun1006219@gmail.com>
Link: https://lore.kernel.org/all/CAEkJfYPTt3uP1vAYnQ5V2ZWn5O9PLhhGi5HbOcAzyP9vbXyjeg@mail.gmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
Mounting a crafted UDF image with repeated partition descriptors can
trigger a heap out-of-bounds write in part_descs_loc[].
handle_partition_descriptor() deduplicates entries by partition number,
but appended slots never record partnum. As a result duplicate
Partition Descriptors are appended repeatedly and num_part_descs keeps
growing.
Once the table is full, the growth path still sizes the allocation from
partnum even though inserts are indexed by num_part_descs. If partnum is
already aligned to PART_DESC_ALLOC_STEP, ALIGN(partnum, step) can keep
the old capacity and the next append writes past the end of the table.
Store partnum in the appended slot and size growth from the next append
count so deduplication and capacity tracking follow the same model.
Fixes: ee4af50ca9 ("udf: Fix mounting of Win7 created UDF filesystems")
Cc: stable@vger.kernel.org
Signed-off-by: Seohyeon Maeng <bioloidgp@gmail.com>
Link: https://patch.msgid.link/20260310081652.21220-1-bioloidgp@gmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
ext2_unlink() calls inode_dec_link_count() unconditionally, which
invokes drop_nlink(). If the inode was loaded from a corrupted disk
image with i_links_count == 0, drop_nlink()
triggers WARN_ON(inode->i_nlink == 0)
Follow the ext4 pattern from __ext4_unlink(): check i_nlink before
decrementing. If already zero, skip the decrement.
Signed-off-by: Ziyi Guo <n7l8m4@u.northwestern.edu>
Link: https://patch.msgid.link/20260211022052.973114-1-n7l8m4@u.northwestern.edu
Signed-off-by: Jan Kara <jack@suse.cz>
The function __rsv_window_dump() is a heavyweight debug tool that walks
the reservation red-black tree. It is currently guarded by #if 1,
forcing it to be compiled into all kernels, even production ones.
Match the rest of the file by guarding it with #ifdef EXT2FS_DEBUG,
so it is only included when explicit debugging is enabled.
This removes the unused function code from standard builds.
Signed-off-by: Milos Nikic <nikic.milos@gmail.com>
Link: https://patch.msgid.link/20260207022920.258247-1-nikic.milos@gmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
If ext2_get_blocks() is called with maxblocks == 0, it currently triggers
a BUG_ON(), causing a kernel panic.
While this condition implies a logic error in the caller, a filesystem
should not crash the system due to invalid arguments.
Replace the BUG_ON() with a WARN_ON_ONCE() to provide a stack trace for
debugging, and return -EINVAL to handle the error gracefully.
Signed-off-by: Milos Nikic <nikic.milos@gmail.com>
Link: https://patch.msgid.link/20260207010617.216675-1-nikic.milos@gmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
Pull erofs fixes from Gao Xiang:
- Do not share the page cache if the real @aops differs
- Fix the incomplete condition for interlaced plain extents
- Get rid of more unnecessary #ifdefs
* tag 'erofs-for-7.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: fix interlaced plain identification for encoded extents
erofs: remove more unnecessary #ifdefs
erofs: allow sharing page cache with the same aops only
Pull ata fixes from Niklas Cassel:
- The newly introduced feature that issues a deferred (non-NCQ) command
from a workqueue, forgot to consider the case where the deferred QC
times out. Fix the code to take timeouts into consideration, which
avoids a use after free (Damien)
- The newly introduced feature that issues a deferred (non-NCQ) command
from a workqueue, when unloading the module, calls cancel_work_sync(),
a function that can sleep, while holding a spin lock. Move the function
call outside the lock (Damien)
* tag 'ata-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
ata: libata-core: fix cancellation of a port deferred qc work
ata: libata-eh: correctly handle deferred qc timeouts
Pull vfs fixes from Christian Brauner:
- Fix an uninitialized variable in file_getattr().
The flags_valid field wasn't initialized before calling
vfs_fileattr_get(), triggering KMSAN uninit-value reports in fuse
- Fix writeback wakeup and logging timeouts when DETECT_HUNG_TASK is
not enabled.
sysctl_hung_task_timeout_secs is 0 in that case causing spurious
"waiting for writeback completion for more than 1 seconds" warnings
- Fix a null-ptr-deref in do_statmount() when the mount is internal
- Add missing kernel-doc description for the @private parameter in
iomap_readahead()
- Fix mount namespace creation to hold namespace_sem across the mount
copy in create_new_namespace().
The previous drop-and-reacquire pattern was fragile and failed to
clean up mount propagation links if the real rootfs was a shared or
dependent mount
- Fix /proc mount iteration where m->index wasn't updated when
m->show() overflows, causing a restart to repeatedly show the same
mount entry in a rapidly expanding mount table
- Return EFSCORRUPTED instead of ENOSPC in minix_new_inode() when the
inode number is out of range
- Fix unshare(2) when CLONE_NEWNS is set and current->fs isn't shared.
copy_mnt_ns() received the live fs_struct so if a subsequent
namespace creation failed the rollback would leave pwd and root
pointing to detached mounts. Always allocate a new fs_struct when
CLONE_NEWNS is requested
- fserror bug fixes:
- Remove the unused fsnotify_sb_error() helper now that all callers
have been converted to fserror_report_metadata
- Fix a lockdep splat in fserror_report() where igrab() takes
inode::i_lock which can be held in IRQ context.
Replace igrab() with a direct i_count bump since filesystems
should not report inodes that are about to be freed or not yet
exposed
- Handle error pointer in procfs for try_lookup_noperm()
- Fix an integer overflow in ep_loop_check_proc() where recursive calls
returning INT_MAX would overflow when +1 is added, breaking the
recursion depth check
- Fix a misleading break in pidfs
* tag 'vfs-7.0-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
pidfs: avoid misleading break
eventpoll: Fix integer overflow in ep_loop_check_proc()
proc: Fix pointer error dereference
fserror: fix lockdep complaint when igrabbing inode
fsnotify: drop unused helper
unshare: fix unshare_fs() handling
minix: Correct errno in minix_new_inode
namespace: fix proc mount iteration
mount: hold namespace_sem across copy in create_new_namespace()
iomap: Describe @private in iomap_readahead()
statmount: Fix the null-ptr-deref in do_statmount()
writeback: Fix wakeup and logging timeouts for !DETECT_HUNG_TASK
fs: init flags_valid before calling vfs_fileattr_get
Only plain data whose start position and on-disk physical length are
both aligned to the block size should be classified as interlaced
plain extents. Otherwise, it must be treated as shifted plain extents.
This issue was found by syzbot using a crafted compressed image
containing plain extents with unaligned physical lengths, which can
cause OOB read in z_erofs_transform_plain().
Reported-and-tested-by: syzbot+d988dc155e740d76a331@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/r/699d5714.050a0220.cdd3c.03e7.GAE@google.com
Fixes: 1d191b4ca5 ("erofs: implement encoded extent metadata")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
dvb_dvr_open() calls dvb_ringbuffer_init() when a new reader opens the
DVR device. dvb_ringbuffer_init() calls init_waitqueue_head(), which
reinitializes the waitqueue list head to empty.
Since dmxdev->dvr_buffer.queue is a shared waitqueue (all opens of the
same DVR device share it), this orphans any existing waitqueue entries
from io_uring poll or epoll, leaving them with stale prev/next pointers
while the list head is reset to {self, self}.
The waitqueue and spinlock in dvr_buffer are already properly
initialized once in dvb_dmxdev_init(). The open path only needs to
reset the buffer data pointer, size, and read/write positions.
Replace the dvb_ringbuffer_init() call in dvb_dvr_open() with direct
assignment of data/size and a call to dvb_ringbuffer_reset(), which
properly resets pread, pwrite, and error with correct memory ordering
without touching the waitqueue or spinlock.
Cc: stable@vger.kernel.org
Fixes: 34731df288 ("V4L/DVB (3501): Dmxdev: use dvb_ringbuffer")
Reported-by: syzbot+ab12f0c08dd7ab8d057c@syzkaller.appspotmail.com
Tested-by: syzbot+ab12f0c08dd7ab8d057c@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/698a26d3.050a0220.3b3015.007d.GAE@google.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cancel_work_sync() is a sleeping function so it cannot be called with
the spin lock of a port being held. Move the call to this function in
ata_port_detach() after EH completes, with the port lock released,
together with other work cancellation calls.
Fixes: 0ea84089db ("ata: libata-scsi: avoid Non-NCQ command starvation")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Igor Pylypiv <ipylypiv@google.com>
A deferred qc may timeout while waiting for the device queue to drain
to be submitted. In such case, since the qc is not active,
ata_scsi_cmd_error_handler() ends up calling scsi_eh_finish_cmd(),
which frees the qc. But as the port deferred_qc field still references
this finished/freed qc, the deferred qc work may eventually attempt to
call ata_qc_issue() against this invalid qc, leading to errors such as
reported by UBSAN (syzbot run):
UBSAN: shift-out-of-bounds in drivers/ata/libata-core.c:5166:24
shift exponent 4210818301 is too large for 64-bit type 'long long unsigned int'
...
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x100/0x190 lib/dump_stack.c:120
ubsan_epilogue+0xa/0x30 lib/ubsan.c:233
__ubsan_handle_shift_out_of_bounds+0x279/0x2a0 lib/ubsan.c:494
ata_qc_issue.cold+0x38/0x9f drivers/ata/libata-core.c:5166
ata_scsi_deferred_qc_work+0x154/0x1f0 drivers/ata/libata-scsi.c:1679
process_one_work+0x9d7/0x1920 kernel/workqueue.c:3275
process_scheduled_works kernel/workqueue.c:3358 [inline]
worker_thread+0x5da/0xe40 kernel/workqueue.c:3439
kthread+0x370/0x450 kernel/kthread.c:467
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
</TASK>
Fix this by checking if the qc of a timed out SCSI command is a deferred
one, and in such case, clear the port deferred_qc field and finish the
SCSI command with DID_TIME_OUT.
Reported-by: syzbot+1f77b8ca15336fff21ff@syzkaller.appspotmail.com
Fixes: 0ea84089db ("ata: libata-scsi: avoid Non-NCQ command starvation")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Igor Pylypiv <ipylypiv@google.com>
This config option goes way back - it used to be an internal debug
option to random.c (at that point called DEBUG_RANDOM_BOOT), then was
renamed and exposed as a config option as CONFIG_WARN_UNSEEDED_RANDOM,
and then further renamed to the current CONFIG_WARN_ALL_UNSEEDED_RANDOM.
It was all done with the best of intentions: the more limited
rate-limited reports were reporting some cases, but if you wanted to see
all the gory details, you'd enable this "ALL" option.
However, it turns out - perhaps not surprisingly - that when people
don't care about and fix the first rate-limited cases, they most
certainly don't care about any others either, and so warning about all
of them isn't actually helping anything.
And the non-ratelimited reporting causes problems, where well-meaning
people enable debug options, but the excessive flood of messages that
nobody cares about will hide actual real information when things go
wrong.
I just got a kernel bug report (which had nothing to do with randomness)
where two thirds of the the truncated dmesg was just variations of
random: get_random_u32 called from __get_random_u32_below+0x10/0x70 with crng_init=0
and in the process early boot messages had been lost (in addition to
making the messages that _hadn't_ been lost harder to read).
The proper way to find these things for the hypothetical developer that
cares - if such a person exists - is almost certainly with boot time
tracing. That gives you the option to get call graphs etc too, which is
likely a requirement for fixing any problems anyway.
See Documentation/trace/boottime-trace.rst for that option.
And if we for some reason do want to re-introduce actual printing of
these things, it will need to have some uniqueness filtering rather than
this "just print it all" model.
Fixes: cc1e127bfa ("random: remove ratelimiting for in-kernel unseeded randomness")
Acked-by: Jason Donenfeld <Jason@zx2c4.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The default_gfp() helper that I added is not wrong, but it turns out
that it causes unnecessary headaches for 'sparse' which doesn't support
the use of __VA_OPT__ (introduced in C++20 and C23, and supported by gcc
and clang for a long time).
We do already use __VA_OPT__ in some other cases in the kernel (drm/xe
and btrfs), but it has been fairly limited. Now it triggers for pretty
much everything, and sparse ends up not working at all.
We can use the traditional gcc ',##__VA_ARGS__' syntax instead: it may
not be the "C standard" way and is slightly less natural in this
context, but it is the traditional model for this and avoids the sparse
problem.
Reported-and-tested-by: Ricardo Ribalda <ribalda@chromium.org>
Reported-and-tested-by: Richard Fitzgerald <rf@opensource.cirrus.com>
Reported-by: Ben Dooks <ben.dooks@codethink.co.uk>
Fixes: e19e1b480a ("add default_gfp() helper macro and use it in the new *alloc_obj() helpers")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Inode with identical data but different @aops cannot be mixed
because the page cache is managed by different subsystems (e.g.,
@aops for compressed on-disk inodes cannot handle plain on-disk
inodes).
In this patch, we never allow inodes to share the page cache
among plain, compressed, and fileio cases. When a shared inode
is created, we initialize @aops that is the same as the initial
real inode, and subsequent inodes cannot share the page cache
if the inferred @aops differ from the corresponding shared inode.
This is reasonable as a first step because, in typical use cases,
if an inode is compressible, it will fall into compressed
inodes across different filesystem images unless users use plain
filesystems. However, in that cases, users will use plain
filesystems all the time.
Fixes: 5ef3208e3b ("erofs: introduce the page cache share feature")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Pull fsverity fixes from Eric Biggers:
- Fix a build error on parisc
- Remove the non-large-folio-aware function fsverity_verify_page()
* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
fsverity: fix build error by adding fsverity_readahead() stub
fsverity: remove fsverity_verify_page()
f2fs: make f2fs_verify_cluster() partially large-folio-aware
f2fs: remove unnecessary ClearPageUptodate in f2fs_verify_cluster()
Pull crypto library fix from Eric Biggers:
"Fix a big endian specific issue in the PPC64-optimized AES code"
* tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
lib/crypto: powerpc/aes: Fix rndkey_from_vsx() on big endian CPUs
Stephen retired and stepped back from -next maintainership, update his
entry in CREDITS to recognise his 18 years of hard work making it what
it is today and all the impact it's had on our development process.
Also update to his current GnuPG key while we're here.
Acked-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
Acked-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The x509 public key code gained a dependency on the sha256 hash
implementation, causing a rare link time failure in randconfig
builds:
arm-linux-gnueabi-ld: crypto/asymmetric_keys/x509_public_key.o: in function `x509_get_sig_params':
x509_public_key.c:(.text.x509_get_sig_params+0x12): undefined reference to `sha256'
arm-linux-gnueabi-ld: (sha256): Unknown destination type (ARM/Thumb) in crypto/asymmetric_keys/x509_public_key.o
x509_public_key.c:(.text.x509_get_sig_params+0x12): dangerous relocation: unsupported relocation
Select the necessary library code from Kconfig.
Fixes: 2c62068ac8 ("x509: Separately calculate sha256 for blacklist")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Align to the commit bf4afc53b7 ("Convert 'alloc_obj' family to use the
new default GFP_KERNEL argument") update the 'kmalloc_obj' declaration
for userspace to fix below compile error:
In file included from arch/arm/boot/compressed/../../../../lib/decompress_unxz.c:241,
from arch/arm/boot/compressed/decompress.c:56:
arch/arm/boot/compressed/../../../../lib/xz/xz_dec_stream.c: In function 'xz_dec_init':
arch/arm/boot/compressed/../../../../lib/xz/xz_dec_stream.c:787:28: error: implicit declaration of function 'kmalloc_obj'; did you mean 'kmalloc'? [-Wimplicit-function-declaration]
787 | struct xz_dec *s = kmalloc_obj(*s);
| ^~~~~~~~~~~
| kmalloc
Signed-off-by: Haiyue Wang <haiyuewa@163.com>
Fixes: 69050f8d6d ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types")
Fixes: bf4afc53b7 ("Convert 'alloc_obj' family to use the new default GFP_KERNEL argument")
Reviewed-by: Kees Cook <kees@kernel.org>
Acked-by: Lasse Collin <lasse.collin@tukaani.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull RTC updates from Alexandre Belloni:
- loongson: Loongson-2K0300 support
- s35390a: nvmem support
- zynqmp: rework calibration
* tag 'rtc-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux:
rtc: ds1390: fix number of bytes read from RTC
rtc: class: Remove duplicate check for alarm
rtc: optee: simplify OP-TEE context match
rtc: interface: Alarm race handling should not discard preceding error
rtc: s35390a: implement nvmem support
rtc: loongson: Add Loongson-2K0300 support
dt-bindings: rtc: loongson: Document Loongson-2K0300 compatible
dt-bindings: rtc: loongson: Correct Loongson-1C interrupts property
dt-bindings: rtc: renesas,rz-rtca3: Add RZ/V2N support
dt-bindings: rtc: cpcap: convert to schema
rtc: zynqmp: use dynamic max and min offset ranges
rtc: zynqmp: rework set_offset
rtc: zynqmp: rework read_offset
rtc: zynqmp: check calibration max value
rtc: zynqmp: correct frequency value
rtc: amlogic-a4: Remove IRQF_ONESHOT
rtc: pcf8563: use correct of_node for output clock
rtc: max31335: use correct CONFIG symbol in IS_REACHABLE()
rtc: nvvrs: Add ARCH_TEGRA to the NV VRS RTC driver
Pull rust fixes from Miguel Ojeda:
"Toolchain and infrastructure:
- Pass '-Zunstable-options' flag required by the future Rust 1.95.0
- Fix 'objtool' warning for Rust 1.84.0
'kernel' crate:
- 'irq' module: add missing bound detected by the future Rust 1.95.0
- 'list' module: add missing 'unsafe' blocks and placeholder safety
comments to macros (an issue for future callers within the crate)
'pin-init' crate:
- Clean Clippy warning that changed behavior in the future Rust
1.95.0"
* tag 'rust-fixes-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux:
rust: list: Add unsafe blocks for container_of and safety comments
rust: pin-init: replace clippy `expect` with `allow`
rust: irq: add `'static` bounds to irq callbacks
objtool/rust: add one more `noreturn` Rust function
rust: kbuild: pass `-Zunstable-options` for Rust 1.95.0
Pull runtime verifier fix from Steven Rostedt:
- Fix multiple definition of __pcpu_unique_da_mon_this
After refactoring monitors, we used static per-cpu variables with the
same names across different per-cpu monitors. This is explicitly
disallowed for modules on some architectures (alpha) or if
CONFIG_DEBUG_FORCE_WEAK_PER_CPU is enabled (e.g. Fedora's debug
kernel). Make sure all those variables have different names to avoid
compilation issues.
* tag 'trace-rv-7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
rv: Fix multiple definition of __pcpu_unique_da_mon_this
This converts some of the visually simpler cases that have been split
over multiple lines. I only did the ones that are easy to verify the
resulting diff by having just that final GFP_KERNEL argument on the next
line.
Somebody should probably do a proper coccinelle script for this, but for
me the trivial script actually resulted in an assertion failure in the
middle of the script. I probably had made it a bit _too_ trivial.
So after fighting that far a while I decided to just do some of the
syntactically simpler cases with variations of the previous 'sed'
scripts.
The more syntactically complex multi-line cases would mostly really want
whitespace cleanup anyway.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the exact same thing as the 'alloc_obj()' version, only much
smaller because there are a lot fewer users of the *alloc_flex()
interface.
As with alloc_obj() version, this was done entirely with mindless brute
force, using the same script, except using 'flex' in the pattern rather
than 'objs*'.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Most simple allocations use GFP_KERNEL, and with the new allocation
helpers being introduced, let's just take advantage of that to simplify
that default case.
It's a numbers game:
git grep 'alloc_obj(' |
sed 's/.*\(GFP_[_A-Z]*\).*/\1/' |
sort | uniq -c | sort -n | tail
shows that about 90% of all those new allocator instances just use that
standard GFP_KERNEL.
Those helpers are already macros, and we can easily just make it be the
default case when the gfp argument is missing.
And yes, we could do that for all the legacy interfaces too, but let's
keep it to just the new ones at least for now, since those all got
converted recently anyway, so this is not any "extra" noise outside of
that limited conversion.
And, in fact, I want to do this before doing the -rc1 release, exactly
so that we don't get extra merge conflicts.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 69050f8d6d ("treewide: Replace kmalloc with kmalloc_obj for
non-scalar types") started using the new allocation helpers, and in the
process showed that they were completely non-working.
The overflow logic in overflows_flex_counter_type() is completely the
wrong way around, and that broke __alloc_flex() completely. By chance,
the resulting code was then such a mess that clang generated
sufficiently garbage code that objtool warned about it all. Which made
it somewhat quicker to narrow things down.
While fixing overflows_flex_counter_type() would presumably fix this
all, I'm excising the whole broken overflow logic from __alloc_flex(),
because we don't want that kind of code in basic allocation functions
anyway.
That (no longer) broken overflows_flex_counter_type() thing needs to be
inserted into the actual __set_flex_counter() logic in the unlikely case
that we ever want this at all. And made conditional.
Fixes: 81cee9166a ("compiler_types: Introduce __flex_counter() and family")
Fixes: 69050f8d6d ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types")
Cc: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/all/CAHk-=whEd020BYzGTzYrENjD9Z5_82xx6h8HsQvH5xDSnv0=Hw@mail.gmail.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull kmalloc_obj conversion from Kees Cook:
"This does the tree-wide conversion to kmalloc_obj() and friends using
coccinelle, with a subsequent small manual cleanup of whitespace
alignment that coccinelle does not handle.
This uncovered a clang bug in __builtin_counted_by_ref(), so the
conversion is preceded by disabling that for current versions of
clang. The imminent clang 22.1 release has the fix.
I've done allmodconfig build tests for x86_64, arm64, i386, and arm. I
did defconfig builds for alpha, m68k, mips, parisc, powerpc, riscv,
s390, sparc, sh, arc, csky, xtensa, hexagon, and openrisc"
* tag 'kmalloc_obj-treewide-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
kmalloc_obj: Clean up after treewide replacements
treewide: Replace kmalloc with kmalloc_obj for non-scalar types
compiler_types: Disable __builtin_counted_by_ref for Clang
Pull perf tools updates from Arnaldo Carvalho de Melo:
- Introduce 'perf sched stats' tool with record/report/diff workflows
using schedstat counters
- Add a faster libdw based addr2line implementation and allow selecting
it or its alternatives via 'perf config addr2line.style='
- Data-type profiling fixes and improvements including the ability to
select fields using 'perf report''s -F/-fields, e.g.:
'perf report --fields overhead,type'
- Add 'perf test' regression tests for Data-type profiling with C and
Rust workloads
- Fix srcline printing with inlines in callchains, make sure this has
coverage in 'perf test'
- Fix printing of leaf IP in LBR callchains
- Fix display of metrics without sufficient permission in 'perf stat'
- Print all machines in 'perf kvm report -vvv', not just the host
- Switch from SHA-1 to BLAKE2s for build ID generation, remove SHA-1
code
- Fix 'perf report's histogram entry collapsing with '-F' option
- Use system's cacheline size instead of a hardcoded value in 'perf
report'
- Allow filtering conversion by time range in 'perf data'
- Cover conversion to CTF using 'perf data' in 'perf test'
- Address newer glibc const-correctness (-Werror=discarded-qualifiers)
issues
- Fixes and improvements for ARM's CoreSight support, simplify ARM SPE
event config in 'perf mem', update docs for 'perf c2c' including the
ARM events it can be used with
- Build support for generating metrics from arch specific python
script, add extra AMD, Intel, ARM64 metrics using it
- Add AMD Zen 6 events and metrics
- Add JSON file with OpenHW Risc-V CVA6 hardware counters
- Add 'perf kvm' stats live testing
- Add more 'perf stat' tests to 'perf test'
- Fix segfault in `perf lock contention -b/--use-bpf`
- Fix various 'perf test' cases for s390
- Build system cleanups, bump minimum shellcheck version to 0.7.2
- Support building the capstone based annotation routines as a plugin
- Allow passing extra Clang flags via EXTRA_BPF_FLAGS
* tag 'perf-tools-for-v7.0-1-2026-02-21' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (255 commits)
perf test script: Add python script testing support
perf test script: Add perl script testing support
perf script: Allow the generated script to be a path
perf test: perf data --to-ctf testing
perf test: Test pipe mode with data conversion --to-json
perf json: Pipe mode --to-ctf support
perf json: Pipe mode --to-json support
perf check: Add libbabeltrace to the listed features
perf build: Allow passing extra Clang flags via EXTRA_BPF_FLAGS
perf test data_type_profiling.sh: Skip just the Rust tests if code_with_type workload is missing
tools build: Fix feature test for rust compiler
perf libunwind: Fix calls to thread__e_machine()
perf stat: Add no-affinity flag
perf evlist: Reduce affinity use and move into iterator, fix no affinity
perf evlist: Missing TPEBS close in evlist__close()
perf evlist: Special map propagation for tool events that read on 1 CPU
perf stat-shadow: In prepare_metric fix guard on reading NULL perf_stat_evsel
Revert "perf tool_pmu: More accurately set the cpus for tool events"
tools build: Emit dependencies file for test-rust.bin
tools build: Make test-rust.bin be removed by the 'clean' target
...
Pull coccinelle updates from Julia Lawall:
"This simplifies and clarifies the handling of output generated by
Coccinelle that is sent to standard error.
By default, this goes to /dev/null. Remind the user of that and
encourage them to provide another file name (Benjamin Philip)"
* tag 'cocci-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlawall/linux:
Documentation: Coccinelle: document debug log handling
scripts: coccicheck: warn on unset debug file
scripts: coccicheck: simplify debug file handling
Pull NTB (PCIe non-transparent bridge) updates from Jon Mason:
"NTB updates include debugfs improvements, correctness fixes, cleanups,
and new hardware support:
ntb_transport QP stats are converted to seq_file, a tx_memcpy_offload
module parameter is introduced with associated ordering fixes, and a
debugfs queue name truncation bug is corrected.
Additional fixes address format specifier mismatches in ntb_tool and
boundary conditions in the Switchtec driver, while unused MSI helpers
are removed and the codebase migrates to dma_map_phys().
Intel Gen6 (Diamond Rapids) NTB support is also added"
* tag 'ntb-7.0' of https://github.com/jonmason/ntb:
NTB: ntb_transport: Use seq_file for QP stats debugfs
NTB: ntb_transport: Fix too small buffer for debugfs_name
ntb/ntb_tool: correct sscanf format for u64 and size_t in tool_peer_mw_trans_write
ntb: intel: Add Intel Gen6 NTB support for DiamondRapids
NTB/msi: Remove unused functions
ntb: ntb_hw_switchtec: Increase MAX_MWS limit to 256
ntb: ntb_hw_switchtec: Fix array-index-out-of-bounds access
ntb: ntb_hw_switchtec: Fix shift-out-of-bounds for 0 mw lut
NTB: epf: allow built-in build
ntb: migrate to dma_map_phys instead of map_page
NTB: ntb_transport: Add 'tx_memcpy_offload' module option
NTB: ntb_transport: Remove unused 'retries' field from ntb_queue_entry
Pull io_uring fixes from Jens Axboe:
- A fix for a missing URING_CMD128 opcode check, fixing an issue with
the SQE mixed mode support introduced in 6.19. Merged late due to
having multiple dependencies
- Add sqe->cmd size checking for big SQEs, similar to what we have for
normal sized SQEs
- Fix a race condition in zcrx, that leads to a double free
* tag 'io_uring-20260221' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring: Add size check for sqe->cmd
io_uring: add IORING_OP_URING_CMD128 to opcode checks
io_uring/zcrx: fix user_ref race between scrub and refill paths
Pull memblock fix from Mike Rapoport:
"Fix detection of NUMA node for CXL windows
phys_to_target_node() may assign a CXL Fixed Memory Window to the
wrong NUMA node when a CXL node resides in the gap of discontinuous
System RAM node.
Fix this by checking both numa_meminfo and numa_reserved_meminfo,
preferring the reserved NID when the address appears in both"
* tag 'fixes-2026-02-21' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
mm: numa_memblks: Identify the accurate NUMA ID of CFMW
Pull sched_ext fixes from Tejun Heo:
- Various bug fixes for the example schedulers and selftests
* tag 'sched_ext-for-7.0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
tools/sched_ext: fix getopt not re-parsed on restart
tools/sched_ext: scx_userland: fix data races on shared counters
tools/sched_ext: scx_pair: fix stride == 0 crash on single-CPU systems
tools/sched_ext: scx_central: fix CPU_SET and skeleton leak on early exit
tools/sched_ext: scx_userland: fix stale data on restart
tools/sched_ext: scx_flatcg: fix potential stack overflow from VLA in fcg_read_stats
selftests/sched_ext: Fix rt_stall flaky failure
tools/sched_ext: scx_userland: fix restart and stats thread lifecycle bugs
tools/sched_ext: scx_central: fix sched_setaffinity() call with the set size
tools/sched_ext: scx_flatcg: zero-initialize stats counter array
Pull smb server fixes from Steve French:
"Two small fixes:
- fix potential deadlock
- minor cleanup"
* tag 'v7.0-rc-part2-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: call ksmbd_vfs_kern_path_end_removing() on some error paths
smb: server: Remove duplicate include of misc.h
The current debug documentation does not mention that logs are printed
to stdout unless DEBUG_FILE is set. It also doesn't mention that
Coccinelle cannot overwrite debug files.
Document this behaviour in the examples and reference it in the
debugging section.
Signed-off-by: Benjamin Philip <benjamin.philip495@gmail.com>
Signed-off-by: Julia Lawall <julia.lawall@inria.fr>
coccicheck prints debug logs to stdout unless a debug file has been set.
This makes it hard to read coccinelle's suggested changes, especially
for someone new to coccicheck.
From this commit, we warn about this behaviour from within the script on
an unset debug file. Explicitly setting the debug file to /dev/null
suppresses the warning while keeping the default.
Signed-off-by: Benjamin Philip <benjamin.philip495@gmail.com>
Signed-off-by: Julia Lawall <julia.lawall@inria.fr>
This commit separates handling unset files and pre-existing files. It
also eliminates a duplicated check for unset files in run_cmd_parmap().
Signed-off-by: Benjamin Philip <benjamin.philip495@gmail.com>
Signed-off-by: Julia Lawall <julia.lawall@inria.fr>
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
Unfortunately, there is a corner case of __builtin_counted_by_ref()
usage that crashes[1] Clang since support was introduced in Clang 19.
Disable it prior to Clang 22. Found while tested kmalloc_obj treewide
refactoring (via kmalloc_flex() usage).
Link: https://github.com/llvm/llvm-project/issues/182575 [1]
Signed-off-by: Kees Cook <kees@kernel.org>
After goto restart, optind retains its advanced position from the
previous getopt loop, causing getopt() to immediately return -1.
This silently drops all command-line options on the restarted skeleton.
Reset optind to 1 at the restart label so options are re-parsed.
Affected schedulers: scx_simple, scx_central, scx_flatcg, scx_pair,
scx_sdt, scx_cpu0.
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>