Pull erofs fixes from Gao Xiang:
- Do not share the page cache if the real @aops differs
- Fix the incomplete condition for interlaced plain extents
- Get rid of more unnecessary #ifdefs
* tag 'erofs-for-7.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: fix interlaced plain identification for encoded extents
erofs: remove more unnecessary #ifdefs
erofs: allow sharing page cache with the same aops only
Pull vfs fixes from Christian Brauner:
- Fix an uninitialized variable in file_getattr().
The flags_valid field wasn't initialized before calling
vfs_fileattr_get(), triggering KMSAN uninit-value reports in fuse
- Fix writeback wakeup and logging timeouts when DETECT_HUNG_TASK is
not enabled.
sysctl_hung_task_timeout_secs is 0 in that case causing spurious
"waiting for writeback completion for more than 1 seconds" warnings
- Fix a null-ptr-deref in do_statmount() when the mount is internal
- Add missing kernel-doc description for the @private parameter in
iomap_readahead()
- Fix mount namespace creation to hold namespace_sem across the mount
copy in create_new_namespace().
The previous drop-and-reacquire pattern was fragile and failed to
clean up mount propagation links if the real rootfs was a shared or
dependent mount
- Fix /proc mount iteration where m->index wasn't updated when
m->show() overflows, causing a restart to repeatedly show the same
mount entry in a rapidly expanding mount table
- Return EFSCORRUPTED instead of ENOSPC in minix_new_inode() when the
inode number is out of range
- Fix unshare(2) when CLONE_NEWNS is set and current->fs isn't shared.
copy_mnt_ns() received the live fs_struct so if a subsequent
namespace creation failed the rollback would leave pwd and root
pointing to detached mounts. Always allocate a new fs_struct when
CLONE_NEWNS is requested
- fserror bug fixes:
- Remove the unused fsnotify_sb_error() helper now that all callers
have been converted to fserror_report_metadata
- Fix a lockdep splat in fserror_report() where igrab() takes
inode::i_lock which can be held in IRQ context.
Replace igrab() with a direct i_count bump since filesystems
should not report inodes that are about to be freed or not yet
exposed
- Handle error pointer in procfs for try_lookup_noperm()
- Fix an integer overflow in ep_loop_check_proc() where recursive calls
returning INT_MAX would overflow when +1 is added, breaking the
recursion depth check
- Fix a misleading break in pidfs
* tag 'vfs-7.0-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
pidfs: avoid misleading break
eventpoll: Fix integer overflow in ep_loop_check_proc()
proc: Fix pointer error dereference
fserror: fix lockdep complaint when igrabbing inode
fsnotify: drop unused helper
unshare: fix unshare_fs() handling
minix: Correct errno in minix_new_inode
namespace: fix proc mount iteration
mount: hold namespace_sem across copy in create_new_namespace()
iomap: Describe @private in iomap_readahead()
statmount: Fix the null-ptr-deref in do_statmount()
writeback: Fix wakeup and logging timeouts for !DETECT_HUNG_TASK
fs: init flags_valid before calling vfs_fileattr_get
Only plain data whose start position and on-disk physical length are
both aligned to the block size should be classified as interlaced
plain extents. Otherwise, it must be treated as shifted plain extents.
This issue was found by syzbot using a crafted compressed image
containing plain extents with unaligned physical lengths, which can
cause OOB read in z_erofs_transform_plain().
Reported-and-tested-by: syzbot+d988dc155e740d76a331@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/r/699d5714.050a0220.cdd3c.03e7.GAE@google.com
Fixes: 1d191b4ca5 ("erofs: implement encoded extent metadata")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Inode with identical data but different @aops cannot be mixed
because the page cache is managed by different subsystems (e.g.,
@aops for compressed on-disk inodes cannot handle plain on-disk
inodes).
In this patch, we never allow inodes to share the page cache
among plain, compressed, and fileio cases. When a shared inode
is created, we initialize @aops that is the same as the initial
real inode, and subsequent inodes cannot share the page cache
if the inferred @aops differ from the corresponding shared inode.
This is reasonable as a first step because, in typical use cases,
if an inode is compressible, it will fall into compressed
inodes across different filesystem images unless users use plain
filesystems. However, in that cases, users will use plain
filesystems all the time.
Fixes: 5ef3208e3b ("erofs: introduce the page cache share feature")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Pull fsverity fixes from Eric Biggers:
- Fix a build error on parisc
- Remove the non-large-folio-aware function fsverity_verify_page()
* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
fsverity: fix build error by adding fsverity_readahead() stub
fsverity: remove fsverity_verify_page()
f2fs: make f2fs_verify_cluster() partially large-folio-aware
f2fs: remove unnecessary ClearPageUptodate in f2fs_verify_cluster()
This converts some of the visually simpler cases that have been split
over multiple lines. I only did the ones that are easy to verify the
resulting diff by having just that final GFP_KERNEL argument on the next
line.
Somebody should probably do a proper coccinelle script for this, but for
me the trivial script actually resulted in an assertion failure in the
middle of the script. I probably had made it a bit _too_ trivial.
So after fighting that far a while I decided to just do some of the
syntactically simpler cases with variations of the previous 'sed'
scripts.
The more syntactically complex multi-line cases would mostly really want
whitespace cleanup anyway.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the exact same thing as the 'alloc_obj()' version, only much
smaller because there are a lot fewer users of the *alloc_flex()
interface.
As with alloc_obj() version, this was done entirely with mindless brute
force, using the same script, except using 'flex' in the pattern rather
than 'objs*'.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull kmalloc_obj conversion from Kees Cook:
"This does the tree-wide conversion to kmalloc_obj() and friends using
coccinelle, with a subsequent small manual cleanup of whitespace
alignment that coccinelle does not handle.
This uncovered a clang bug in __builtin_counted_by_ref(), so the
conversion is preceded by disabling that for current versions of
clang. The imminent clang 22.1 release has the fix.
I've done allmodconfig build tests for x86_64, arm64, i386, and arm. I
did defconfig builds for alpha, m68k, mips, parisc, powerpc, riscv,
s390, sparc, sh, arc, csky, xtensa, hexagon, and openrisc"
* tag 'kmalloc_obj-treewide-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
kmalloc_obj: Clean up after treewide replacements
treewide: Replace kmalloc with kmalloc_obj for non-scalar types
compiler_types: Disable __builtin_counted_by_ref for Clang
Pull io_uring fixes from Jens Axboe:
- A fix for a missing URING_CMD128 opcode check, fixing an issue with
the SQE mixed mode support introduced in 6.19. Merged late due to
having multiple dependencies
- Add sqe->cmd size checking for big SQEs, similar to what we have for
normal sized SQEs
- Fix a race condition in zcrx, that leads to a double free
* tag 'io_uring-20260221' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring: Add size check for sqe->cmd
io_uring: add IORING_OP_URING_CMD128 to opcode checks
io_uring/zcrx: fix user_ref race between scrub and refill paths
Pull smb server fixes from Steve French:
"Two small fixes:
- fix potential deadlock
- minor cleanup"
* tag 'v7.0-rc-part2-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: call ksmbd_vfs_kern_path_end_removing() on some error paths
smb: server: Remove duplicate include of misc.h
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
Pull btrfs fixes from David Sterba:
- multiple error handling fixes of unexpected conditions
- reset block group size class once it becomes empty so that
its class can be changed
- error message level adjustments
- fixes of returned error values
- use correct block reserve for delayed refs
* tag 'for-7.0-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix invalid leaf access in btrfs_quota_enable() if ref key not found
btrfs: fix lost error return in btrfs_find_orphan_roots()
btrfs: fix lost return value on error in finish_verity()
btrfs: change unaligned root messages to error level in btrfs_validate_super()
btrfs: use the correct type to initialize block reserve for delayed refs
btrfs: do not ASSERT() when the fs flips RO inside btrfs_repair_io_failure()
btrfs: reset block group size class when it becomes empty
btrfs: replace BUG() with error handling in __btrfs_balance()
btrfs: handle unexpected exact match in btrfs_set_inode_index_count()
Pull ecryptfs updates from Tyler Hicks:
"This consists of some really minor typo fixes that fell through the
cracks and some more recent code cleanups:
- Comment typo fixes
- Removal of an unused function declaration
- Use strscpy() instead of the deprecated strcpy()
- Use string copying helpers instead of memcpy() and manually
terminating strings"
* tag 'ecryptfs-7.0-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
ecryptfs: Replace memcpy + NUL termination in ecryptfs_copy_filename
ecryptfs: Drop redundant NUL terminations after calling ecryptfs_to_hex
ecryptfs: Replace memcpy + NUL termination in ecryptfs_new_file_context
ecryptfs: Replace strcpy with strscpy in ecryptfs_validate_options
ecryptfs: Replace strcpy with strscpy in ecryptfs_cipher_code_to_string
ecryptfs: Replace strcpy with strscpy in ecryptfs_set_default_crypt_stat_vals
ecryptfs: simplify list initialization in ecryptfs_parse_packet_set()
ecryptfs: Remove unused declartion ecryptfs_fill_zeros()
ecryptfs: Fix packet format comment in parse_tag_67_packet()
ecryptfs: comment typo fix
ecryptfs: keystore: Fix typo 'the the' in comment
For SQE128, sqe->cmd provides 80 bytes for uring_cmd. Add macro to
check if size of user struct does not exceed 80 bytes at compile time.
User doesn't have to track this manually during development.
Replace io_uring_sqe_cmd() inline func with macro and add
io_uring_sqe128_cmd() which checks struct
size for 16 bytes cmd and 80 bytes cmd respectively.
Signed-off-by: Govindarajulu Varadarajan <govind.varadar@gmail.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig reported a lockdep splat in generic/108:
================================
WARNING: inconsistent lock state
6.19.0+ #4827 Tainted: G N
--------------------------------
inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
swapper/1/0 [HC1[1]:SC0[0]:HE0:SE1] takes:
ffff88811ed1b140 (&sb->s_type->i_lock_key#33){?.+.}-{3:3}, at: igrab+0x1a/0xb0
{HARDIRQ-ON-W} state was registered at:
lock_acquire+0xca/0x2c0
_raw_spin_lock+0x2e/0x40
unlock_new_inode+0x2c/0xc0
xfs_iget+0xcf4/0x1080
xfs_trans_metafile_iget+0x3d/0x100
xfs_metafile_iget+0x2b/0x50
xfs_mount_setup_metadir+0x20/0x60
xfs_mountfs+0x457/0xa60
xfs_fs_fill_super+0x6b3/0xa90
get_tree_bdev_flags+0x13c/0x1e0
vfs_get_tree+0x27/0xe0
vfs_cmd_create+0x54/0xe0
__do_sys_fsconfig+0x309/0x620
do_syscall_64+0x8b/0xf80
entry_SYSCALL_64_after_hwframe+0x76/0x7e
irq event stamp: 139080
hardirqs last enabled at (139079): [<ffffffff813a923c>] do_idle+0x1ec/0x270
hardirqs last disabled at (139080): [<ffffffff828a8d09>] common_interrupt+0x19/0xe0
softirqs last enabled at (139032): [<ffffffff8134a853>] __irq_exit_rcu+0xc3/0x120
softirqs last disabled at (139025): [<ffffffff8134a853>] __irq_exit_rcu+0xc3/0x120
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&sb->s_type->i_lock_key#33);
<Interrupt>
lock(&sb->s_type->i_lock_key#33);
*** DEADLOCK ***
1 lock held by swapper/1/0:
#0: ffff8881052c81a0 (&vblk->vqs[i].lock){-.-.}-{3:3}, at: virtblk_done+0x4b/0x110
stack backtrace:
CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Tainted: G N 6.19.0+ #4827 PREEMPT(full)
Tainted: [N]=TEST
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
Call Trace:
<IRQ>
dump_stack_lvl+0x5b/0x80
print_usage_bug.part.0+0x22c/0x2c0
mark_lock+0xa6f/0xe90
__lock_acquire+0x10b6/0x25e0
lock_acquire+0xca/0x2c0
_raw_spin_lock+0x2e/0x40
igrab+0x1a/0xb0
fserror_report+0x135/0x260
iomap_finish_ioend_buffered+0x170/0x210
clone_endio+0x8f/0x1c0
blk_update_request+0x1e4/0x4d0
blk_mq_end_request+0x1b/0x100
virtblk_done+0x6f/0x110
vring_interrupt+0x59/0x80
__handle_irq_event_percpu+0x8a/0x2e0
handle_irq_event+0x33/0x70
handle_edge_irq+0xdd/0x1e0
__common_interrupt+0x6f/0x180
common_interrupt+0xb7/0xe0
</IRQ>
It looks like the concern here is that inode::i_lock is sometimes taken
in IRQ context, and sometimes it is held when going to IRQ context,
though it's a little difficult to tell since I think this is a kernel
from after the actual 6.19 release but before 7.0-rc1.
Either way, we don't need to take i_lock, because filesystems should
not report files to fserror if they're about to be freed or have not
yet been exposed to other threads, because the resulting fsnotify report
will be meaningless.
Therefore, bump inode::i_count directly and clarify the preconditions on
the inode being passed in.
Link: https://lore.kernel.org/linux-fsdevel/aY7BndIgQg3ci_6s@infradead.org/
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://patch.msgid.link/177148129564.716249.3069780698231701540.stgit@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Pull more MM updates from Andrew Morton:
- "mm/vmscan: fix demotion targets checks in reclaim/demotion" fixes a
couple of issues in the demotion code - pages were failed demotion
and were finding themselves demoted into disallowed nodes (Bing Jiao)
- "Remove XA_ZERO from error recovery of dup_mmap()" fixes a rare
mapledtree race and performs a number of cleanups (Liam Howlett)
- "mm: add bitmap VMA flag helpers and convert all mmap_prepare to use
them" implements a lot of cleanups following on from the conversion
of the VMA flags into a bitmap (Lorenzo Stoakes)
- "support batch checking of references and unmapping for large folios"
implements batching to greatly improve the performance of reclaiming
clean file-backed large folios (Baolin Wang)
- "selftests/mm: add memory failure selftests" does as claimed (Miaohe
Lin)
* tag 'mm-stable-2026-02-18-19-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (36 commits)
mm/page_alloc: clear page->private in free_pages_prepare()
selftests/mm: add memory failure dirty pagecache test
selftests/mm: add memory failure clean pagecache test
selftests/mm: add memory failure anonymous page test
mm: rmap: support batched unmapping for file large folios
arm64: mm: implement the architecture-specific clear_flush_young_ptes()
arm64: mm: support batch clearing of the young flag for large folios
arm64: mm: factor out the address and ptep alignment into a new helper
mm: rmap: support batched checks of the references for large folios
tools/testing/vma: add VMA userland tests for VMA flag functions
tools/testing/vma: separate out vma_internal.h into logical headers
tools/testing/vma: separate VMA userland tests into separate files
mm: make vm_area_desc utilise vma_flags_t only
mm: update all remaining mmap_prepare users to use vma_flags_t
mm: update shmem_[kernel]_file_*() functions to use vma_flags_t
mm: update secretmem to use VMA flags on mmap_prepare
mm: update hugetlbfs to use VMA flags on mmap_prepare
mm: add basic VMA flag operation helper functions
tools: bitmap: add missing bitmap_[subset(), andnot()]
mm: add mk_vma_flags() bitmap flag macro helper
...
Pull sysctl updates from Joel Granados:
- Remove macros from proc handler converters
Replace the proc converter macros with "regular" functions. Though it
is more verbose than the macro version, it helps when debugging and
better aligns with coding-style.rst.
- General cleanup
Remove superfluous ctl_table forward declarations. Const qualify the
memory_allocation_profiling_sysctl and loadpin_sysctl_table arrays.
Add missing kernel doc to proc_dointvec_conv.
- Testing
This series was run through sysctl selftests/kunit test suite in
x86_64. And went into linux-next after rc4, giving it a good 3 weeks
of testing
* tag 'sysctl-7.00-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
sysctl: replace SYSCTL_INT_CONV_CUSTOM macro with functions
sysctl: Replace unidirectional INT converter macros with functions
sysctl: Add kernel doc to proc_douintvec_conv
sysctl: Replace UINT converter macros with functions
sysctl: Add CONFIG_PROC_SYSCTL guards for converter macros
sysctl: clarify proc_douintvec_minmax doc
sysctl: Return -ENOSYS from proc_douintvec_conv when CONFIG_PROC_SYSCTL=n
sysctl: Remove unused ctl_table forward declarations
loadpin: Implement custom proc_handler for enforce
alloc_tag: move memory_allocation_profiling_sysctls into .rodata
sysctl: Add missing kernel-doc for proc_dointvec_conv
If btrfs_search_slot_for_read() returns 1, it means we did not find any
key greater than or equals to the key we asked for, meaning we have
reached the end of the tree and therefore the path is not valid. If
this happens we need to break out of the loop and stop, instead of
continuing and accessing an invalid path.
Fixes: 5223cc60b4 ("btrfs: drop the path before adding qgroup items when enabling qgroups")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If the root nodes for the chunk root, tree root or log root are not sector
size aligned, we are logging a warning message but these are in fact
errors that makes the super block validation fail. So change the level of
the messages from warning to error.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When initializing the delayed refs block reserve for a transaction handle
we are passing a type of BTRFS_BLOCK_RSV_DELOPS, which is meant for
delayed items and not for delayed refs. The correct type for delayed refs
is BTRFS_BLOCK_RSV_DELREFS.
On release of any excess space reserved in a local delayed refs reserve,
we also should transfer that excess space to the global block reserve
(it it's full, we return to the space info for general availability).
By initializing a transaction's local delayed refs block reserve with a
type of BTRFS_BLOCK_RSV_DELOPS, we were also causing any excess space
released from the delayed block reserve (fs_info->delayed_block_rsv, used
for delayed inodes and items) to be transferred to the global block
reserve instead of the global delayed refs block reserve. This was an
unintentional change in commit 28270e25c6 ("btrfs: always reserve space
for delayed refs when starting transaction"), but it's not particularly
serious as things tend to cancel out each other most of the time and it's
relatively rare to be anywhere near exhaustion of the global reserve.
Fix this by initializing a transaction's local delayed refs reserve with
a type of BTRFS_BLOCK_RSV_DELREFS and making btrfs_block_rsv_release()
attempt to transfer unused space from such a reserve into the global block
reserve, just as we did before that commit for when the block reserve is
a delayed refs rsv.
Reported-by: Alex Lyakas <alex.lyakas@zadara.com>
Link: https://lore.kernel.org/linux-btrfs/CAOcd+r0FHG5LWzTSu=LknwSoqxfw+C00gFAW7fuX71+Z5AfEew@mail.gmail.com/
Fixes: 28270e25c6 ("btrfs: always reserve space for delayed refs when starting transaction")
Reviewed-by: Alex Lyakas <alex.lyakas@zadara.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a bug report that when btrfs hits ENOSPC error in a critical
path, btrfs flips RO (this part is expected, although the ENOSPC bug
still needs to be addressed).
The problem is after the RO flip, if there is a read repair pending, we
can hit the ASSERT() inside btrfs_repair_io_failure() like the following:
BTRFS info (device vdc): relocating block group 30408704 flags metadata|raid1
------------[ cut here ]------------
BTRFS: Transaction aborted (error -28)
WARNING: fs/btrfs/extent-tree.c:3235 at __btrfs_free_extent.isra.0+0x453/0xfd0, CPU#1: btrfs/383844
Modules linked in: kvm_intel kvm irqbypass
[...]
---[ end trace 0000000000000000 ]---
BTRFS info (device vdc state EA): 2 enospc errors during balance
BTRFS info (device vdc state EA): balance: ended with status: -30
BTRFS error (device vdc state EA): parent transid verify failed on logical 30556160 mirror 2 wanted 8 found 6
BTRFS error (device vdc state EA): bdev /dev/nvme0n1 errs: wr 0, rd 0, flush 0, corrupt 10, gen 0
[...]
assertion failed: !(fs_info->sb->s_flags & SB_RDONLY) :: 0, in fs/btrfs/bio.c:938
------------[ cut here ]------------
assertion failed: !(fs_info->sb->s_flags & SB_RDONLY) :: 0, in fs/btrfs/bio.c:938
kernel BUG at fs/btrfs/bio.c:938!
Oops: invalid opcode: 0000 [#1] SMP NOPTI
CPU: 0 UID: 0 PID: 868 Comm: kworker/u8:13 Tainted: G W N 6.19.0-rc6+ #4788 PREEMPT(full)
Tainted: [W]=WARN, [N]=TEST
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
Workqueue: btrfs-endio simple_end_io_work
RIP: 0010:btrfs_repair_io_failure.cold+0xb2/0x120
RSP: 0000:ffffc90001d2bcf0 EFLAGS: 00010246
RAX: 0000000000000051 RBX: 0000000000001000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff8305cf42 RDI: 00000000ffffffff
RBP: 0000000000000002 R08: 00000000fffeffff R09: ffffffff837fa988
R10: ffffffff8327a9e0 R11: 6f69747265737361 R12: ffff88813018d310
R13: ffff888168b8a000 R14: ffffc90001d2bd90 R15: ffff88810a169000
FS: 0000000000000000(0000) GS:ffff8885e752c000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
------------[ cut here ]------------
[CAUSE]
The cause of -ENOSPC error during the test case btrfs/124 is still
unknown, although it's known that we still have cases where metadata can
be over-committed but can not be fulfilled correctly, thus if we hit
such ENOSPC error inside a critical path, we have no choice but abort
the current transaction.
This will mark the fs read-only.
The problem is inside the btrfs_repair_io_failure() path that we require
the fs not to be mount read-only. This is normally fine, but if we are
doing a read-repair meanwhile the fs flips RO due to a critical error,
we can enter btrfs_repair_io_failure() with super block set to
read-only, thus triggering the above crash.
[FIX]
Just replace the ASSERT() with a proper return if the fs is already
read-only.
Reported-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/linux-btrfs/20260126045555.GB31641@lst.de/
Tested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Block group size classes are managed consistently everywhere.
Currently, btrfs_use_block_group_size_class() sets a block group's size
class to specialize it for a specific allocation size. However, this
size class remains "stale" even if the block group becomes completely
empty (both used and reserved bytes reach zero).
This happens in two scenarios:
1. When space reservations are freed (e.g., due to errors or transaction
aborts) via btrfs_free_reserved_bytes().
2. When the last extent in a block group is freed via
btrfs_update_block_group().
While size classes are advisory, a stale size class can cause
find_free_extent to unnecessarily skip candidate block groups during
initial search loops. This undermines the purpose of size classes to
reduce fragmentation by keeping block groups restricted to a specific
size class when they could be reused for any size.
Fix this by resetting the size class to BTRFS_BG_SZ_NONE whenever a
block group's used and reserved counts both reach zero. This ensures
that empty block groups are fully available for any allocation size in
the next cycle.
Fixes: 52bb7a2166 ("btrfs: introduce size class to block group allocator")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We search with offset (u64)-1 which should never match exactly.
Previously this was handled with BUG(). Now logs an error
and return -EUCLEAN.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Adarsh Das <adarshdas950@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We search with offset (u64)-1 which should never match exactly.
Previously the code silently returned success without setting the index
count. Now logs an error and return -EUCLEAN instead.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Adarsh Das <adarshdas950@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>,
Signed-off-by: David Sterba <dsterba@suse.com>
The m->index isn't updated when m->show() overflows and retains its
value before the current mount causing a restart to start at the same
value. If that happens in short order to due a quickly expanding mount
table this would cause the same mount to be shown again and again.
Ensure that *pos always equals the mount id of the mount that was
returned by start/next. On restart after overflow mnt_find_id_at(*pos)
finds the exact mount. This should avoid duplicates, avoid skips and
should handle concurrent modification just fine.
Cc: <stable@vger.kernel.org>
Fixed: 2eea9ce431 ("mounts: keep list of mounts in an rbtree")
Link: https://patch.msgid.link/20260129-geleckt-treuhand-4bb940acacd9@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
Fix an oversight when creating a new mount namespace. If someone had the
bright idea to make the real rootfs a shared or dependent mount and it
is later copied the copy will become a peer of the old real rootfs mount
or a dependent mount of it. The namespace semaphore is dropped and we
use mount lock exact to lock the new real root mount. If that fails or
the subsequent do_loopback() fails we rely on the copy of the real root
mount to be cleaned up by path_put(). The problem is that this doesn't
deal with mount propagation and will leave the mounts linked in the
propagation lists.
When creating a new mount namespace create_new_namespace() first
acquires namespace_sem to clone the nullfs root, drops it, then
reacquires it via LOCK_MOUNT_EXACT which takes inode_lock first to
respect the inode_lock -> namespace_sem lock ordering. This
drop-and-reacquire pattern is fragile and was the source of the
propagation cleanup bug fixed in the preceding commit.
Extend lock_mount_exact() with a copy_mount mode that clones the mount
under the locks atomically. When copy_mount is true, path_overmounted()
is skipped since we're copying the mount, not mounting on top of it -
the nullfs root always has rootfs mounted on top so the check would
always fail. If clone_mnt() fails after get_mountpoint() has pinned the
mountpoint, __unlock_mount() is used to properly unpin the mountpoint
and release both locks.
This allows create_new_namespace() to use LOCK_MOUNT_EXACT_COPY which
takes inode_lock and namespace_sem once and holds them throughout the
clone and subsequent mount operations, eliminating the
drop-and-reacquire pattern entirely.
Reported-by: syzbot+a89f9434fb5a001ccd58@syzkaller.appspotmail.com
Fixes: 9b8a0ba682 ("mount: add OPEN_TREE_NAMESPACE") # mainline only
Link: https://lore.kernel.org/699047f6.050a0220.2757fb.0024.GAE@google.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
f2fs_verify_cluster() is the only remaining caller of the
non-large-folio-aware function fsverity_verify_page(). To unblock the
removal of that function, change f2fs_verify_cluster() to verify the
entire folio of each page and mark it up-to-date.
Note that this doesn't actually make f2fs_verify_cluster()
large-folio-aware, as it is still passed an array of pages. Currently,
it's never called with large folios.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20260218010630.7407-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Pull ntfs3 updates from Konstantin Komarov:
"New code:
- improve readahead for bitmap initialization and large directory scans
- fsync files by syncing parent inodes
- drop of preallocated clusters for sparse and compressed files
- zero-fill folios beyond i_valid in ntfs_read_folio()
- implement llseek SEEK_DATA/SEEK_HOLE by scanning data runs
- implement iomap-based file operations
- allow explicit boolean acl/prealloc mount options
- fall-through between switch labels
- delayed-allocation (delalloc) support
Fixes:
- check return value of indx_find to avoid infinite loop
- initialize new folios before use
- infinite loop in attr_load_runs_range on inconsistent metadata
- infinite loop triggered by zero-sized ATTR_LIST
- ntfs_mount_options leak in ntfs_fill_super()
- deadlock in ni_read_folio_cmpr
- circular locking dependency in run_unpack_ex
- prevent infinite loops caused by the next valid being the same
- restore NULL folio initialization in ntfs_writepages()
- slab-out-of-bounds read in DeleteIndexEntryRoot
Updates:
- allow readdir() to finish after directory mutations without rewinddir()
- handle attr_set_size() errors when truncating files
- make ntfs_writeback_ops static
- refactor duplicate kmemdup pattern in do_action()
- avoid calling run_get_entry() when run == NULL in ntfs_read_run_nb_ra()
Replaced:
- use wait_on_buffer() directly
- rename ni_readpage_cmpr into ni_read_folio_cmpr"
* tag 'ntfs3_for_7.0' of https://github.com/Paragon-Software-Group/linux-ntfs3: (26 commits)
fs/ntfs3: add delayed-allocation (delalloc) support
fs/ntfs3: avoid calling run_get_entry() when run == NULL in ntfs_read_run_nb_ra()
fs/ntfs3: add fall-through between switch labels
fs/ntfs3: allow explicit boolean acl/prealloc mount options
fs/ntfs3: Fix slab-out-of-bounds read in DeleteIndexEntryRoot
ntfs3: Restore NULL folio initialization in ntfs_writepages()
ntfs3: Refactor duplicate kmemdup pattern in do_action()
fs/ntfs3: prevent infinite loops caused by the next valid being the same
fs/ntfs3: make ntfs_writeback_ops static
ntfs3: fix circular locking dependency in run_unpack_ex
fs/ntfs3: implement iomap-based file operations
fs/ntfs3: fix deadlock in ni_read_folio_cmpr
fs/ntfs3: implement llseek SEEK_DATA/SEEK_HOLE by scanning data runs
fs/ntfs3: zero-fill folios beyond i_valid in ntfs_read_folio()
fs/ntfs3: handle attr_set_size() errors when truncating files
fs/ntfs3: drop preallocated clusters for sparse and compressed files
fs/ntfs3: fsync files by syncing parent inodes
fs/ntfs3: fix ntfs_mount_options leak in ntfs_fill_super()
fs/ntfs3: allow readdir() to finish after directory mutations without rewinddir()
fs/ntfs3: improve readahead for bitmap initialization and large directory scans
...
Pull ceph updates from Ilya Dryomov:
"This adds support for the upcoming aes256k key type in CephX that is
based on Kerberos 5 and brings a bunch of assorted CephFS fixes from
Ethan and Sam. One of Sam's patches in particular undoes a change in
the fscrypt area that had an inadvertent side effect of making CephFS
behave as if mounted with wsize=4096 and leading to the corresponding
degradation in performance, especially for sequential writes"
* tag 'ceph-for-7.0-rc1' of https://github.com/ceph/ceph-client:
ceph: assert loop invariants in ceph_writepages_start()
ceph: remove error return from ceph_process_folio_batch()
ceph: fix write storm on fscrypted files
ceph: do not propagate page array emplacement errors as batch errors
ceph: supply snapshot context in ceph_uninline_data()
ceph: supply snapshot context in ceph_zero_partial_object()
libceph: adapt ceph_x_challenge_blob hashing and msgr1 message signing
libceph: add support for CEPH_CRYPTO_AES256KRB5
libceph: introduce ceph_crypto_key_prepare()
libceph: generalize ceph_x_encrypt_offset() and ceph_x_encrypt_buflen()
libceph: define and enforce CEPH_MAX_KEY_LEN
Pull overlayfs update from Amir Goldstein:
"Relax the semantics of uuid=off to cater to a use case of overlayfs
lower layers on btrfs clones, whose UUID are ephemeral and an upper
layer on a different filesystem"
* tag 'ovl-update-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs:
ovl: relax requirement for uuid=off,index=on
Pull smb client fixes from Steve French:
- Fix three potential double free vulnerabilities
- Fix data corruption due to racy lease checks
- Enforce SMB1 signing verification checks
- Fix invalid mount option parsing
- Remove unneeded tracepoint
- Various minor error code corrections
- Minor cleanup
* tag 'v7.0-rc-part2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: terminate session upon failed client required signing
cifs: some missing initializations on replay
cifs: remove unnecessary tracing after put tcon
cifs: update internal module version number
smb: client: fix data corruption due to racy lease checks
smb/client: move NT_STATUS_MORE_ENTRIES
smb/client: rename to NT_ERROR_INVALID_DATATYPE
smb/client: rename to NT_STATUS_SOME_NOT_MAPPED
smb/client: map NT_STATUS_PRIVILEGE_NOT_HELD
smb/client: map NT_STATUS_MORE_PROCESSING_REQUIRED
smb/client: map NT_STATUS_BUFFER_OVERFLOW
smb/client: map NT_STATUS_NOTIFY_ENUM_DIR
cifs: SMB1 split: Remove duplicate include of cifs_debug.h
smb: client: fix regression with mount options parsing
Pull more misc vfs updates from Christian Brauner:
"Features:
- Optimize close_range() from O(range size) to O(active FDs) by using
find_next_bit() on the open_fds bitmap instead of linearly scanning
the entire requested range. This is a significant improvement for
large-range close operations on sparse file descriptor tables.
- Add FS_XFLAG_VERITY file attribute for fs-verity files, retrievable
via FS_IOC_FSGETXATTR and file_getattr(). The flag is read-only.
Add tracepoints for fs-verity enable and verify operations,
replacing the previously removed debug printk's.
- Prevent nfsd from exporting special kernel filesystems like pidfs
and nsfs. These filesystems have custom ->open() and ->permission()
export methods that are designed for open_by_handle_at(2) only and
are incompatible with nfsd. Update the exportfs documentation
accordingly.
Fixes:
- Fix KMSAN uninit-value in ovl_fill_real() where strcmp() was used
on a non-null-terminated decrypted directory entry name from
fscrypt. This triggered on encrypted lower layers when the
decrypted name buffer contained uninitialized tail data.
The fix also adds VFS-level name_is_dot(), name_is_dotdot(), and
name_is_dot_dotdot() helpers, replacing various open-coded "." and
".." checks across the tree.
- Fix read-only fsflags not being reset together with xflags in
vfs_fileattr_set(). Currently harmless since no read-only xflags
overlap with flags, but this would cause inconsistencies for any
future shared read-only flag
- Return -EREMOTE instead of -ESRCH from PIDFD_GET_INFO when the
target process is in a different pid namespace. This lets userspace
distinguish "process exited" from "process in another namespace",
matching glibc's pidfd_getpid() behavior
Cleanups:
- Use C-string literals in the Rust seq_file bindings, replacing the
kernel::c_str!() macro (available since Rust 1.77)
- Fix typo in d_walk_ret enum comment, add porting notes for the
readlink_copy() calling convention change"
* tag 'vfs-7.0-rc1.misc.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: add porting notes about readlink_copy()
pidfs: return -EREMOTE when PIDFD_GET_INFO is called on another ns
nfsd: do not allow exporting of special kernel filesystems
exportfs: clarify the documentation of open()/permission() expotrfs ops
fsverity: add tracepoints
fs: add FS_XFLAG_VERITY for fs-verity files
rust: seq_file: replace `kernel::c_str!` with C-Strings
fs: dcache: fix typo in enum d_walk_ret comment
ovl: use name_is_dot* helpers in readdir code
fs: add helpers name_is_dot{,dot,_dotdot}
ovl: Fix uninit-value in ovl_fill_real
fs: reset read-only fsflags together with xflags
fs/file: optimize close_range() complexity from O(N) to O(Sparse)
Pull pidfs updates from Christian Brauner:
- pid: introduce task_ppid_vnr() helper
- pidfs: convert rb-tree to rhashtable
Mateusz reported performance penalties during task creation because
pidfs uses pidmap_lock to add elements into the rbtree. Switch to an
rhashtable to have separate fine-grained locking and to decouple from
pidmap_lock moving all heavy manipulations outside of it
Also move inode allocation outside of pidmap_lock. With this there's
nothing happening for pidfs under pidmap_lock
- pid: reorder fields in pid_namespace to reduce false sharing
- Revert "pid: make __task_pid_nr_ns(ns => NULL) safe for zombie
callers"
- ipc: Add SPDX license id to mqueue.c
* tag 'kernel-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
pid: introduce task_ppid_vnr() helper
pidfs: implement ino allocation without the pidmap lock
Revert "pid: make __task_pid_nr_ns(ns => NULL) safe for zombie callers"
pid: reorder fields in pid_namespace to reduce false sharing
pidfs: convert rb-tree to rhashtable
ipc: Add SPDX license id to mqueue.c
This patch implements delayed allocation (delalloc) in ntfs3 driver.
It introduces an in-memory delayed-runlist (run_da) and the helpers to
track, reserve and later convert those delayed reservations into real
clusters at writeback time. The change keeps on-disk formats untouched and
focuses on pagecache integration, correctness and safe interaction with
fallocate, truncate, and dio/iomap paths.
Key points:
- add run_da (delay-allocated run tree) and bookkeeping for delayed clusters.
- mark ranges as delalloc (DELALLOC_LCN) instead of immediately allocating.
Actual allocation performed later (writeback / attr_set_size_ex / explicit
flush paths).
- direct i/o / iomap paths updated to avoid dio collisions with
delalloc: dio falls back or forces allocation of delayed blocks before
proceeding.
- punch/collapse/truncate/fallocate check and cancel delay-alloc reservations.
Sparse/compressed files handled specially.
- free-space checks updated (ntfs_check_free_space) to account for reserved
delalloc clusters and MFT record budgeting.
- delayed allocations are committed on last writer (file release) and on
explicit allocation flush paths.
Tested-by: syzbot@syzkaller.appspotmail.com
Reported-by: syzbot+2bd8e813c7f767aa9bb1@syzkaller.appspotmail.com
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Currently, when smb signature verification fails, the behaviour is to log
the failure without any action to terminate the session.
Call cifs_reconnect() when client required signature verification fails.
Otherwise, log the error without reconnecting.
Signed-off-by: Aaditya Kansal <aadityakansal390@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
In several places in the code, we have a label to signify
the start of the code where a request can be replayed if
necessary. However, some of these places were missing the
necessary reinitializations of certain local variables
before replay.
This change makes sure that these variables get initialized
after the label.
Cc: stable@vger.kernel.org
Reported-by: Yuchan Nam <entropy1110@gmail.com>
Tested-by: Yuchan Nam <entropy1110@gmail.com>
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
This code was recently changed from manually decrementing
tcon ref to using cifs_put_tcon. But even before that change
this tracing happened after decrementing the ref count, which
is wrong. With cifs_put_tcon, tracing already happens inside it.
So just removing the extra tracing here.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
There are two places where ksmbd_vfs_kern_path_end_removing() needs to be
called in order to balance what the corresponding successful call to
ksmbd_vfs_kern_path_start_removing() has done, i.e. drop inode locks and
put the taken references. Otherwise there might be potential deadlocks
and unbalanced locks which are caught like:
BUG: workqueue leaked lock or atomic: kworker/5:21/0x00000000/7596
last function: handle_ksmbd_work
2 locks held by kworker/5:21/7596:
#0: ffff8881051ae448 (sb_writers#3){.+.+}-{0:0}, at: ksmbd_vfs_kern_path_locked+0x142/0x660
#1: ffff888130e966c0 (&type->i_mutex_dir_key#3/1){+.+.}-{4:4}, at: ksmbd_vfs_kern_path_locked+0x17d/0x660
CPU: 5 PID: 7596 Comm: kworker/5:21 Not tainted 6.1.162-00456-gc29b353f383b #138
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
Workqueue: ksmbd-io handle_ksmbd_work
Call Trace:
<TASK>
dump_stack_lvl+0x44/0x5b
process_one_work.cold+0x57/0x5c
worker_thread+0x82/0x600
kthread+0x153/0x190
ret_from_fork+0x22/0x30
</TASK>
Found by Linux Verification Center (linuxtesting.org).
Fixes: d5fc1400a3 ("smb/server: avoid deadlock when linking with ReplaceIfExists")
Cc: stable@vger.kernel.org
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>