clean fit; namespace_shared due to iterating through ns->mounts.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Clean fit for guards use; guards can't be weaker due to umount_tree() calls.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Note that just as path_put, it should never be done in scope of
namespace_sem, be it shared or exclusive.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
mount_writer: write_seqlock; that's an equivalent of {un,}lock_mount_hash()
mount_locked_reader: read_seqlock_excl; these tend to be open-coded.
No bulk conversions, please - if nothing else, quite a few places take
use mount_writer form when mount_locked_reader is sufficent. It needs
to be dealt with carefully.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If anything, namespace_lock should be DEFINE_LOCK_GUARD_0, not DEFINE_GUARD.
That way we
* do not need to feed it a bogus argument
* do not get gcc trying to compare an address of static in
file variable with -4097 - and, if we are unlucky, trying to keep
it in a register, with spills and all such.
The same problems apply to grabbing namespace_sem shared.
Rename it to namespace_excl, add namespace_shared, convert the existing users:
guard(namespace_lock, &namespace_sem) => guard(namespace_excl)()
guard(rwsem_read, &namespace_sem) => guard(namespace_shared)()
scoped_guard(namespace_lock, &namespace_sem) => scoped_guard(namespace_excl)
scoped_guard(rwsem_read, &namespace_sem) => scoped_guard(namespace_shared)
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull misc fixes from Andrew Morton:
"17 hotfixes. 13 are cc:stable and the remainder address post-6.16
issues or aren't considered necessary for -stable kernels. 11 of these
fixes are for MM.
This includes a three-patch series from Harry Yoo which fixes an
intermittent boot failure which can occur on x86 systems. And a
two-patch series from Alexander Gordeev which fixes a KASAN crash on
S390 systems"
* tag 'mm-hotfixes-stable-2025-09-01-17-20' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: fix possible deadlock in kmemleak
x86/mm/64: define ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings()
mm: introduce and use {pgd,p4d}_populate_kernel()
mm: move page table sync declarations to linux/pgtable.h
proc: fix missing pde_set_flags() for net proc files
mm: fix accounting of memmap pages
mm/damon/core: prevent unnecessary overflow in damos_set_effective_quota()
kexec: add KEXEC_FILE_NO_CMA as a legal flag
kasan: fix GCC mem-intrinsic prefix with sw tags
mm/kasan: avoid lazy MMU mode hazards
mm/kasan: fix vmalloc shadow memory (de-)population races
kunit: kasan_test: disable fortify string checker on kasan_strings() test
selftests/mm: fix FORCE_READ to read input value correctly
mm/userfaultfd: fix kmap_local LIFO ordering for CONFIG_HIGHPTE
ocfs2: prevent release journal inode after journal shutdown
rust: mm: mark VmaNew as transparent
of_numa: fix uninitialized memory nodes causing kernel panic
Let's split IPU writes in hot data area to improve the GC efficiency.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull btrfs fixes from David Sterba:
- fix a few races related to inode link count
- fix inode leak on failure to add link to inode
- move transaction aborts closer to where they happen
* tag 'for-6.17-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: avoid load/store tearing races when checking if an inode was logged
btrfs: fix race between setting last_dir_index_offset and inode logging
btrfs: fix race between logging inode and checking if it was logged before
btrfs: simplify error handling logic for btrfs_link()
btrfs: fix inode leak on failure to add link to inode
btrfs: abort transaction on failure to add link to inode
There is a race condition between inode eviction and inode caching that
can cause a live struct btrfs_inode to be missing from the root->inodes
xarray. Specifically, there is a window during evict() between the inode
being unhashed and deleted from the xarray. If btrfs_iget() is called
for the same inode in that window, it will be recreated and inserted
into the xarray, but then eviction will delete the new entry, leaving
nothing in the xarray:
Thread 1 Thread 2
---------------------------------------------------------------
evict()
remove_inode_hash()
btrfs_iget_path()
btrfs_iget_locked()
btrfs_read_locked_inode()
btrfs_add_inode_to_root()
destroy_inode()
btrfs_destroy_inode()
btrfs_del_inode_from_root()
__xa_erase
In turn, this can cause issues for subvolume deletion. Specifically, if
an inode is in this lost state, and all other inodes are evicted, then
btrfs_del_inode_from_root() will call btrfs_add_dead_root() prematurely.
If the lost inode has a delayed_node attached to it, then when
btrfs_clean_one_deleted_snapshot() calls btrfs_kill_all_delayed_nodes(),
it will loop forever because the delayed_nodes xarray will never become
empty (unless memory pressure forces the inode out). We saw this
manifest as soft lockups in production.
Fix it by only deleting the xarray entry if it matches the given inode
(using __xa_cmpxchg()).
Fixes: 310b2f5d5a ("btrfs: use an xarray to track open inodes in a root")
Cc: stable@vger.kernel.org # 6.11+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Co-authored-by: Leo Martins <loemra.dev@gmail.com>
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
With 64K page size (aarch64 with 64K page size config) and 4K btrfs
block size, the following workload can easily lead to a corrupted read:
mkfs.btrfs -f -s 4k $dev > /dev/null
mount -o compress $dev $mnt
xfs_io -f -c "pwrite -S 0xff 0 64k" $mnt/base > /dev/null
echo "correct result:"
od -Ad -t x1 $mnt/base
xfs_io -f -c "reflink $mnt/base 32k 0 32k" \
-c "reflink $mnt/base 0 32k 32k" \
-c "pwrite -S 0xff 60k 4k" $mnt/new > /dev/null
echo "incorrect result:"
od -Ad -t x1 $mnt/new
umount $mnt
This shows the following result:
correct result:
0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
*
0065536
incorrect result:
0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
*
0032768 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0061440 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
*
0065536
Notice the zero in the range [32K, 60K), which is incorrect.
[CAUSE]
With extra trace printk, it shows the following events during od:
(some unrelated info removed like CPU and context)
od-3457 btrfs_do_readpage: enter r/i=5/258 folio=0(65536) prev_em_start=0000000000000000
The "r/i" is indicating the root and inode number. In our case the file
"new" is using ino 258 from fs tree (root 5).
Here notice the @prev_em_start pointer is NULL. This means the
btrfs_do_readpage() is called from btrfs_read_folio(), not from
btrfs_readahead().
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=0 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=4096 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=8192 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=12288 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=16384 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=20480 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=24576 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=28672 got em start=0 len=32768
These above 32K blocks will be read from the first half of the
compressed data extent.
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=32768 got em start=32768 len=32768
Note here there is no btrfs_submit_compressed_read() call. Which is
incorrect now.
Although both extent maps at 0 and 32K are pointing to the same compressed
data, their offsets are different thus can not be merged into the same
read.
So this means the compressed data read merge check is doing something
wrong.
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=36864 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=40960 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=45056 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=49152 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=53248 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=57344 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=61440 skip uptodate
od-3457 btrfs_submit_compressed_read: cb orig_bio: file off=0 len=61440
The function btrfs_submit_compressed_read() is only called at the end of
folio read. The compressed bio will only have an extent map of range [0,
32K), but the original bio passed in is for the whole 64K folio.
This will cause the decompression part to only fill the first 32K,
leaving the rest untouched (aka, filled with zero).
This incorrect compressed read merge leads to the above data corruption.
There were similar problems that happened in the past, commit 808f80b467
("Btrfs: update fix for read corruption of compressed and shared
extents") is doing pretty much the same fix for readahead.
But that's back to 2015, where btrfs still only supports bs (block size)
== ps (page size) cases.
This means btrfs_do_readpage() only needs to handle a folio which
contains exactly one block.
Only btrfs_readahead() can lead to a read covering multiple blocks.
Thus only btrfs_readahead() passes a non-NULL @prev_em_start pointer.
With v5.15 kernel btrfs introduced bs < ps support. This breaks the above
assumption that a folio can only contain one block.
Now btrfs_read_folio() can also read multiple blocks in one go.
But btrfs_read_folio() doesn't pass a @prev_em_start pointer, thus the
existing bio force submission check will never be triggered.
In theory, this can also happen for btrfs with large folios, but since
large folio is still experimental, we don't need to bother it, thus only
bs < ps support is affected for now.
[FIX]
Instead of passing @prev_em_start to do the proper compressed extent
check, introduce one new member, btrfs_bio_ctrl::last_em_start, so that
the existing bio force submission logic will always be triggered.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The compression level is meaningless for lzo, but before commit
3f093ccb95 ("btrfs: harden parsing of compression mount options"),
it was silently ignored if passed.
After that commit, passing a level with lzo fails to mount:
BTRFS error: unrecognized compression value lzo:1
It seems reasonable for users to expect that lzo would permit a numeric
level option, as all the other algos do, even though the kernel's
implementation of LZO currently only supports a single level. Because it
has always worked to pass a level, it seems likely to me that users in
the real world are relying on doing so.
This patch restores the old behavior, giving "lzo:N" the same semantics
as all of the other compression algos.
To be clear, silly variants like "lzo:one", "lzo:the_first_option", or
"lzo:armageddon" also used to work. This isn't meant to suggest that
any possible mis-interpretation of mount options that once worked must
continue to work forever. This is an exceptional case where it makes
sense to preserve compatibility, both because the mis-interpretation is
reasonable, and because nothing tangible is sacrificed.
Finally update btrfs_show_options() to ignore the level of LZO, as it
is only the default level without any extra meaning.
Fixes: 3f093ccb95 ("btrfs: harden parsing of compression mount options")
Reviewed-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Calvin Owens <calvin@wbinvd.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The following workload on a squota enabled fs:
btrfs subvol create mnt/subvol
# ensure subvol extents get accounted
sync
btrfs qgroup create 1/1 mnt
btrfs qgroup assign mnt/subvol 1/1 mnt
btrfs qgroup delete mnt/subvol
# make the cleaner thread run
btrfs filesystem sync mnt
sleep 1
btrfs filesystem sync mnt
btrfs qgroup destroy 1/1 mnt
will fail with EBUSY. The reason is that 1/1 does the quick accounting
when we assign subvol to it, gaining its exclusive usage as excl and
excl_cmpr. But then when we delete subvol, the decrement happens via
record_squota_delta() which does not update excl_cmpr, as squotas does
not make any distinction between compressed and normal extents. Thus,
we increment excl_cmpr but never decrement it, and are unable to delete
1/1. The two possible fixes are to make squota always mirror excl and
excl_cmpr or to make the fast accounting separately track the plain and
cmpr numbers. The latter felt cleaner to me so that is what I opted for.
Fixes: 1e0e9d5771 ("btrfs: add helper for recording simple quota deltas")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the introduction of pid namespaces, their interaction with procfs
has been entirely implicit in ways that require a lot of dancing around
by programs that need to construct sandboxes with different PID
namespaces.
Being able to explicitly specify the pid namespace to use when
constructing a procfs super block will allow programs to no longer need
to fork off a process which does then does unshare(2) / setns(2) and
forks again in order to construct a procfs in a pidns.
So, provide a "pidns" mount option which allows such users to just
explicitly state which pid namespace they want that procfs instance to
use. This interface can be used with fsconfig(2) either with a file
descriptor or a path:
fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
or with classic mount(2) / mount(8):
// mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
As this new API is effectively shorthand for setns(2) followed by
mount(2), the permission model for this mirrors pidns_install() to avoid
opening up new attack surfaces by loosening the existing permission
model.
In order to avoid having to RCU-protect all users of proc_pid_ns() (to
avoid UAFs), attempting to reconfigure an existing procfs instance's pid
namespace will error out with -EBUSY. Creating new procfs instances is
quite cheap, so this should not be an impediment to most users, and lets
us avoid a lot of churn in fs/proc/* for a feature that it seems
unlikely userspace would use.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Link: https://lore.kernel.org/20250805-procfs-pidns-api-v4-2-705f984940e7@cyphar.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Remove extra logic in fuse_readpages_end() that checks against null
folio mappings. This was added in commit ce534fb052 ("fuse: allow
splice to move pages"):
"Since the remove_from_page_cache() + add_to_page_cache_locked()
are non-atomic it is possible that the page cache is repopulated in
between the two and add_to_page_cache_locked() will fail. This
could be fixed by creating a new atomic replace_page_cache_page()
function.
fuse_readpages_end() needed to be reworked so it works even if
page->mapping is NULL for some or all pages which can happen if the
add_to_page_cache_locked() failed."
Commit ef6a3c6311 ("mm: add replace_page_cache_page() function") added
atomic page cache replacement, which means the check against null
mappings can be removed.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
FUSE_INIT has always been asynchronous with mount. That means that the
server processed this request after the mount syscall returned.
This means that FUSE_INIT can't supply the root inode's ID, hence it
currently has a hardcoded value. There are other limitations such as not
being able to perform getxattr during mount, which is needed by selinux.
To remove these limitations allow server to process FUSE_INIT while
initializing the in-core super block for the fuse filesystem. This can
only be done if the server is prepared to handle this, so add
FUSE_DEV_IOC_SYNC_INIT ioctl, which
a) lets the server know whether this feature is supported, returning
ENOTTY othewrwise.
b) lets the kernel know to perform a synchronous initialization
The implementation is slightly tricky, since fuse_dev/fuse_conn are set up
only during super block creation. This is solved by setting the private
data of the fuse device file to a special value ((struct fuse_dev *) 1) and
waiting for this to be turned into a proper fuse_dev before commecing with
operations on the device file.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
With the introduction of clone3 in commit 7f192e3cd3 ("fork: add
clone3") the effective bit width of clone_flags on all architectures was
increased from 32-bit to 64-bit, with a new type of u64 for the flags.
However, for most consumers of clone_flags the interface was not
changed from the previous type of unsigned long.
While this works fine as long as none of the new 64-bit flag bits
(CLONE_CLEAR_SIGHAND and CLONE_INTO_CGROUP) are evaluated, this is still
undesirable in terms of the principle of least surprise.
Thus, this commit fixes all relevant interfaces of callees to
sys_clone3/copy_process (excluding the architecture-specific
copy_thread) to consistently pass clone_flags as u64, so that
no truncation to 32-bit integers occurs on 32-bit architectures.
Signed-off-by: Simon Schuster <schuster.simon@siemens-energy.com>
Link: https://lore.kernel.org/20250901-nios2-implement-clone3-v2-2-53fcf5577d57@siemens-energy.com
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
fuse fixes for 6.17-rc5
* tag 'fuse-fixes-6.17-rc5' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (6 commits)
fuse: Block access to folio overlimit
fuse: fix fuseblk i_blkbits for iomap partial writes
fuse: reflect cached blocksize if blocksize was changed
fuse: prevent overflow in copy_file_range return value
fuse: check if copy_file_range() returns larger than requested size
fuse: do not allow mapping a non-regular backing file
Link: https://lore.kernel.org/CAJfpeguEVMMyw_zCb+hbOuSxdE2Z3Raw=SJsq=Y56Ae6dn2W3g@mail.gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Currently, if we are the last iput, and we have the I_DIRTY_TIME bit
set, we will grab a reference on the inode again and then mark it dirty
and then redo the put. This is to make sure we delay the time update
for as long as possible.
We can rework this logic to simply dec i_count if it is not 1, and if it
is do the time update while still holding the i_count reference.
Then we can replace the atomic_dec_and_lock with locking the ->i_lock
and doing atomic_dec_and_test, since we did the atomic_add_unless above.
Co-developed-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/be208b89bdb650202e712ce2bcfc407ac7044c7a.1756222464.git.josef@toxicpanda.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Currently, hfs_brec_remove() executes moving records
towards the location of deleted record and it updates
offsets of moved records. However, the hfs_brec_remove()
logic ignores the "mess" of b-tree node's free space and
it doesn't touch the offsets out of records number.
Potentially, it could confuse fsck or driver logic or
to be a reason of potential corruption cases.
This patch reworks the logic of hfs_brec_remove()
by means of clearing freed space of b-tree node
after the records moving. And it clear the last
offset that keeping old location of free space
because now the offset before this one is keeping
the actual offset to the free space after the record
deletion.
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
cc: Yangtao Li <frank.li@vivo.com>
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20250815194918.38165-1-slava@dubeyko.com
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
The syzbot reported issue in hfs_find_set_zero_bits():
=====================================================
BUG: KMSAN: uninit-value in hfs_find_set_zero_bits+0x74d/0xb60 fs/hfs/bitmap.c:45
hfs_find_set_zero_bits+0x74d/0xb60 fs/hfs/bitmap.c:45
hfs_vbm_search_free+0x13c/0x5b0 fs/hfs/bitmap.c:151
hfs_extend_file+0x6a5/0x1b00 fs/hfs/extent.c:408
hfs_get_block+0x435/0x1150 fs/hfs/extent.c:353
__block_write_begin_int+0xa76/0x3030 fs/buffer.c:2151
block_write_begin fs/buffer.c:2262 [inline]
cont_write_begin+0x10e1/0x1bc0 fs/buffer.c:2601
hfs_write_begin+0x85/0x130 fs/hfs/inode.c:52
cont_expand_zero fs/buffer.c:2528 [inline]
cont_write_begin+0x35a/0x1bc0 fs/buffer.c:2591
hfs_write_begin+0x85/0x130 fs/hfs/inode.c:52
hfs_file_truncate+0x1d6/0xe60 fs/hfs/extent.c:494
hfs_inode_setattr+0x964/0xaa0 fs/hfs/inode.c:654
notify_change+0x1993/0x1aa0 fs/attr.c:552
do_truncate+0x28f/0x310 fs/open.c:68
do_ftruncate+0x698/0x730 fs/open.c:195
do_sys_ftruncate fs/open.c:210 [inline]
__do_sys_ftruncate fs/open.c:215 [inline]
__se_sys_ftruncate fs/open.c:213 [inline]
__x64_sys_ftruncate+0x11b/0x250 fs/open.c:213
x64_sys_call+0xfe3/0x3db0 arch/x86/include/generated/asm/syscalls_64.h:78
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xd9/0x210 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Uninit was created at:
slab_post_alloc_hook mm/slub.c:4154 [inline]
slab_alloc_node mm/slub.c:4197 [inline]
__kmalloc_cache_noprof+0x7f7/0xed0 mm/slub.c:4354
kmalloc_noprof include/linux/slab.h:905 [inline]
hfs_mdb_get+0x1cc8/0x2a90 fs/hfs/mdb.c:175
hfs_fill_super+0x3d0/0xb80 fs/hfs/super.c:337
get_tree_bdev_flags+0x6e3/0x920 fs/super.c:1681
get_tree_bdev+0x38/0x50 fs/super.c:1704
hfs_get_tree+0x35/0x40 fs/hfs/super.c:388
vfs_get_tree+0xb0/0x5c0 fs/super.c:1804
do_new_mount+0x738/0x1610 fs/namespace.c:3902
path_mount+0x6db/0x1e90 fs/namespace.c:4226
do_mount fs/namespace.c:4239 [inline]
__do_sys_mount fs/namespace.c:4450 [inline]
__se_sys_mount+0x6eb/0x7d0 fs/namespace.c:4427
__x64_sys_mount+0xe4/0x150 fs/namespace.c:4427
x64_sys_call+0xfa7/0x3db0 arch/x86/include/generated/asm/syscalls_64.h:166
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xd9/0x210 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
CPU: 1 UID: 0 PID: 12609 Comm: syz.1.2692 Not tainted 6.16.0-syzkaller #0 PREEMPT(none)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/12/2025
=====================================================
The HFS_SB(sb)->bitmap buffer is allocated in hfs_mdb_get():
HFS_SB(sb)->bitmap = kmalloc(8192, GFP_KERNEL);
Finally, it can trigger the reported issue because kmalloc()
doesn't clear the allocated memory. If allocated memory contains
only zeros, then everything will work pretty fine.
But if the allocated memory contains the "garbage", then
it can affect the bitmap operations and it triggers
the reported issue.
This patch simply exchanges the kmalloc() on kzalloc()
with the goal to guarantee the correctness of bitmap operations.
Because, newly created allocation bitmap should have all
available blocks free. Potentially, initialization bitmap's read
operation could not fill the whole allocated memory and
"garbage" in the not initialized memory will be the reason of
volume coruptions and file system driver bugs.
Reported-by: syzbot <syzbot+773fa9d79b29bd8b6831@syzkaller.appspotmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=773fa9d79b29bd8b6831
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
cc: Yangtao Li <frank.li@vivo.com>
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20250820230636.179085-1-slava@dubeyko.com
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
Potenatially, __hfs_ext_read_extent() could operate by
not initialized values of fd->key after hfs_brec_find() call:
static inline int __hfs_ext_read_extent(struct hfs_find_data *fd, struct hfs_extent *extent,
u32 cnid, u32 block, u8 type)
{
int res;
hfs_ext_build_key(fd->search_key, cnid, block, type);
fd->key->ext.FNum = 0;
res = hfs_brec_find(fd);
if (res && res != -ENOENT)
return res;
if (fd->key->ext.FNum != fd->search_key->ext.FNum ||
fd->key->ext.FkType != fd->search_key->ext.FkType)
return -ENOENT;
if (fd->entrylength != sizeof(hfs_extent_rec))
return -EIO;
hfs_bnode_read(fd->bnode, extent, fd->entryoffset, sizeof(hfs_extent_rec));
return 0;
}
This patch changes kmalloc() on kzalloc() in hfs_find_init()
and intializes fd->record, fd->keyoffset, fd->keylength,
fd->entryoffset, fd->entrylength for the case if hfs_brec_find()
has been found nothing in the b-tree node.
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
cc: Yangtao Li <frank.li@vivo.com>
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20250818225252.126427-1-slava@dubeyko.com
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
If the client sends SMB2_CREATE_POSIX_CONTEXT to ksmbd, allow the filename
to contain a colon (':'). This requires disabling the support for Alternate
Data Streams (ADS), which are denoted by a colon-separated suffix to the
filename on Windows. This should not be an issue, since this concept is not
known to POSIX anyway and the client has to explicitly request a POSIX
context to get this behavior.
Link: https://lore.kernel.org/all/f9401718e2be2ab22058b45a6817db912784ef61.camel@rx2.rx-server.de/
Signed-off-by: Philipp Kerling <pkerling@casix.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reading /proc/fs/cifs/open_dirs may hit a NULL dereference when
tcon->cfids is NULL.
Add NULL check before accessing cfids to prevent the crash.
Reproduction:
- Mount CIFS share
- cat /proc/fs/cifs/open_dirs
Fixes: 844e5c0eb1 ("smb3 client: add way to show directory leases for improved debugging")
Signed-off-by: Wang Zhaolong <wangzhaolong@huaweicloud.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Introduces two new sys nodes: allocate_section_hint and
allocate_section_policy. The allocate_section_hint identifies the boundary
between devices, measured in sections; it defaults to the end of the device
for single storage setups, and the end of the first device for multiple
storage setups. The allocate_section_policy determines the write strategy,
with a default value of 0 for normal sequential write strategy. A value of
1 prioritizes writes before the allocate_section_hint, while a value of 2
prioritizes writes after it.
This strategy addresses the issue where, despite F2FS supporting multiple
devices, SOC vendors lack multi-devices support (currently only supporting
zoned devices). As a workaround, multiple storage devices are mapped to a
single dm device. Both this workaround and the F2FS multi-devices solution
may require prioritizing writing to certain devices, such as a device with
better performance or when switching is needed due to performance
degradation near a device's end. For scenarios with more than two devices,
sort them at mount time to utilize this feature.
When using this feature with a single storage device, it has almost no
impact. However, for configurations where multiple storage devices are
mapped to the same dm device using F2FS, utilizing this feature can provide
some optimization benefits. Therefore, I believe it should not be limited
to just multi-devices usage.
Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
nfs_server_set_fsinfo() shouldn't assume that NFS_CAP_XATTR is unset
on entry to the function.
Fixes: b78ef845c3 ("NFSv4.2: query the server for extended attribute support")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
_nfs4_server_capabilities() should clear capabilities that are not
supported by the server.
Fixes: d2a00cceb9 ("NFSv4: Detect support for OPEN4_SHARE_ACCESS_WANT_OPEN_XOR_DELEGATION")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
_nfs4_server_capabilities() is expected to clear any flags that are not
supported by the server.
Fixes: 8a59bb93b7 ("NFSv4 store server support for fs_location attribute")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Don't clear the capabilities that are not going to get reset by the call
to _nfs4_server_capabilities().
Reported-by: Scott Haiden <scott.b.haiden@gmail.com>
Fixes: b01f21cacd ("NFS: Fix the setting of capabilities when automounting a new filesystem")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Pull EFI fixes from Ard Biesheuvel:
- Assorted fixes for the OP-TEE based pseudo-EFI variable store
- Fix for an OOB access when looking up the same non-existing efivarfs
entry multiple times in parallel
* tag 'efi-fixes-for-v6.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efivarfs: Fix slab-out-of-bounds in efivarfs_d_compare
efi: stmm: Drop unneeded null pointer check
efi: stmm: Drop unused EFI error from setup_mm_hdr arguments
efi: stmm: Do not return EFI_OUT_OF_RESOURCES on internal errors
efi: stmm: Fix incorrect buffer allocation method
Pull smb client fixes from Steve French:
- Fix possible refcount leak in compound operations
- Fix remap_file_range() return code mapping, found by generic/157
* tag 'v6.17-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
fs/smb: Fix inconsistent refcnt update
smb3 client: fix return code mapping of remap_file_range
Pull xfs fixes from Carlos Maiolino:
"The highlight I'd like to point here is related to the XFS_RT
Kconfig, which has been updated to be enabled by default now if
CONFIG_BLK_DEV_ZONED is enabled.
This also contains a few fixes for zoned devices support in XFS,
specially related to swapon requests in inodes belonging to the zoned
FS.
A null-ptr dereference fix in the xattr data, due to a mishandling of
medium errors generated by block devices is also included"
* tag 'xfs-fixes-6.17-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: do not propagate ENODATA disk errors into xattr code
xfs: reject swapon for inodes on a zoned file system earlier
xfs: kick off inodegc when failing to reserve zoned blocks
xfs: remove xfs_last_used_zone
xfs: Default XFS_RT to Y if CONFIG_BLK_DEV_ZONED is enabled
For a user mode library to avoid generating SIGPIPE signals (e.g.
because this behaviour is not portable across operating systems) is
cumbersome. It is generally bad form to change the process-wide signal
mask in a library, so a local solution is needed instead.
For I/O performed directly using system calls (synchronous or readiness
based asynchronous) this currently involves applying a thread-specific
signal mask before the operation and reverting it afterwards. This can be
avoided when it is known that the file descriptor refers to neither a
pipe nor a socket, but a conservative implementation must always apply
the mask. This incurs the cost of two additional system calls. In the
case of sockets, the existing MSG_NOSIGNAL flag can be used with send.
For asynchronous I/O performed using io_uring, currently the only option
(apart from MSG_NOSIGNAL for sockets), is to mask SIGPIPE entirely in the
call to io_uring_enter. Thankfully io_uring_enter takes a signal mask, so
only a single syscall is needed. However, copying the signal mask on
every call incurs a non-zero performance penalty. Furthermore, this mask
applies to all completions, meaning that if the non-signaling behaviour
is desired only for some subset of operations, the desired signals must
be raised manually from user-mode depending on the completed operation.
Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE signal
from being raised when writing on disconnected pipes or sockets. The flag
is handled directly by the pipe filesystem and converted to the existing
MSG_NOSIGNAL flag for sockets.
Signed-off-by: Lauri Vasama <git@vasama.org>
Link: https://lore.kernel.org/20250827133901.1820771-1-git@vasama.org
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Commit 620c266f39 ("fhandle: relax open_by_handle_at() permission
checks") relaxed the coditions for decoding a file handle from non init
userns.
The conditions are that that decoded dentry is accessible from the user
provided mountfd (or to fs root) and that all the ancestors along the
path have a valid id mapping in the userns.
These conditions are intentionally more strict than the condition that
the decoded dentry should be "lookable" by path from the mountfd.
For example, the path /home/amir/dir/subdir is lookable by path from
unpriv userns of user amir, because /home perms is 755, but the owner of
/home does not have a valid id mapping in unpriv userns of user amir.
The current code did not check that the decoded dentry itself has a
valid id mapping in the userns. There is no security risk in that,
because that final open still performs the needed permission checks,
but this is inconsistent with the checks performed on the ancestors,
so the behavior can be a bit confusing.
Add the check for the decoded dentry itself, so that the entire path,
including the last component has a valid id mapping in the userns.
Fixes: 620c266f39 ("fhandle: relax open_by_handle_at() permission checks")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/20250827194309.1259650-1-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>