Pull btrfs fixes from David Sterba:
- fix delayed inode tracking in xarray, eviction can race with
insertion and leave behind a disconnected inode
- on systems with large page (64K) and small block size (4K) fix
compression read that can return partially filled folio
- slightly relax compression option format for backward compatibility,
allow to specify level for LZO although there's only one
- fix simple quota accounting of compressed extents
- validate minimum device size in 'device add'
- update maintainers' entry
* tag 'for-6.17-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: don't allow adding block device of less than 1 MB
MAINTAINERS: update btrfs entry
btrfs: fix subvolume deletion lockup caused by inodes xarray race
btrfs: fix corruption reading compressed range when block size is smaller than page size
btrfs: accept and ignore compression level for lzo
btrfs: fix squota compressed stats leak
Pull misc fixes from Andrew Morton:
"20 hotfixes. 15 are cc:stable and the remainder address post-6.16
issues or aren't considered necessary for -stable kernels. 14 of these
fixes are for MM.
This includes
- kexec fixes from Breno for a recently introduced
use-uninitialized bug
- DAMON fixes from Quanmin Yan to avoid div-by-zero crashes
which can occur if the operator uses poorly-chosen insmod
parameters
and misc singleton fixes"
* tag 'mm-hotfixes-stable-2025-09-10-20-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
MAINTAINERS: add tree entry to numa memblocks and emulation block
mm/damon/sysfs: fix use-after-free in state_show()
proc: fix type confusion in pde_set_flags()
compiler-clang.h: define __SANITIZE_*__ macros only when undefined
mm/vmalloc, mm/kasan: respect gfp mask in kasan_populate_vmalloc()
ocfs2: fix recursive semaphore deadlock in fiemap call
mm/memory-failure: fix VM_BUG_ON_PAGE(PagePoisoned(page)) when unpoison memory
mm/mremap: fix regression in vrm->new_addr check
percpu: fix race on alloc failed warning limit
mm/memory-failure: fix redundant updates for already poisoned pages
s390: kexec: initialize kexec_buf struct
riscv: kexec: initialize kexec_buf struct
arm64: kexec: initialize kexec_buf struct in load_other_segments()
mm/damon/reclaim: avoid divide-by-zero in damon_reclaim_apply_parameters()
mm/damon/lru_sort: avoid divide-by-zero in damon_lru_sort_apply_parameters()
mm/damon/core: set quota->charged_from to jiffies at first charge window
mm/hugetlb: add missing hugetlb_lock in __unmap_hugepage_range()
init/main.c: fix boot time tracing crash
mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range()
mm/khugepaged: fix the address passed to notifier on testing young
Pull NFS client fixes from Trond Myklebust:
"Stable patches:
- Revert "SUNRPC: Don't allow waiting for exiting tasks" as it is
breaking ltp tests
Bugfixes:
- Another set of fixes to the tracking of NFSv4 server capabilities
when crossing filesystem boundaries
- Localio fix to restore credentials and prevent triggering a
BUG_ON()
- Fix to prevent flapping of the localio on/off trigger
- Protections against 'eof page pollution' as demonstrated in
xfstests generic/363
- Series of patches to ensure correct ordering of O_DIRECT i/o and
truncate, fallocate and copy functions
- Fix a NULL pointer check in flexfiles reads that regresses 6.17
- Correct a typo that breaks flexfiles layout segment processing"
* tag 'nfs-for-6.17-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4/flexfiles: Fix layout merge mirror check.
SUNRPC: call xs_sock_process_cmsg for all cmsg
Revert "SUNRPC: Don't allow waiting for exiting tasks"
NFS: Fix the marking of the folio as up to date
NFS: nfs_invalidate_folio() must observe the offset and size arguments
NFSv4.2: Serialise O_DIRECT i/o and copy range
NFSv4.2: Serialise O_DIRECT i/o and clone range
NFSv4.2: Serialise O_DIRECT i/o and fallocate()
NFS: Serialise O_DIRECT i/o and truncate()
NFSv4.2: Protect copy offload and clone against 'eof page pollution'
NFS: Protect against 'eof page pollution'
flexfiles/pNFS: fix NULL checks on result of ff_layout_choose_ds_for_read
nfs/localio: avoid bouncing LOCALIO if nfs_client_is_local()
nfs/localio: restore creds before releasing pageio data
NFSv4: Clear the NFS_CAP_XATTR flag if not supported by the server
NFSv4: Clear NFS_CAP_OPEN_XOR and NFS_CAP_DELEGTIME if not supported
NFSv4: Clear the NFS_CAP_FS_LOCATIONS flag if it is not set
NFSv4: Don't clear capabilities that won't be reset
Typo in ff_lseg_match_mirrors makes the diff ineffective. This results
in merge happening all the time. Merge happening all the time is
problematic because it marks lsegs invalid. Marking lsegs invalid
causes all outstanding IO to get restarted with EAGAIN and connections
to get closed.
Closing connections constantly triggers race conditions in the RDMA
implementation...
Fixes: 660d1eb223 ("pNFS/flexfile: Don't merge layout segments if the mirrors don't match")
Signed-off-by: Jonathan Curley <jcurley@purestorage.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Pull vfs fixes from Christian Brauner:
"fuse:
- Prevent opening of non-regular backing files.
Fuse doesn't support non-regular files anyway.
- Check whether copy_file_range() returns a larger size than
requested.
- Prevent overflow in copy_file_range() as fuse currently only
supports 32-bit sized copies.
- Cache the blocksize value if the server returned a new value as
inode->i_blkbits isn't modified directly anymore.
- Fix i_blkbits handling for iomap partial writes.
By default i_blkbits is set to PAGE_SIZE which causes iomap to mark
the whole folio as uptodate even on a partial write. But fuseblk
filesystems support choosing a blocksize smaller than PAGE_SIZE
risking data corruption. Simply enforce PAGE_SIZE as blocksize for
fuseblk's internal inode for now.
- Prevent out-of-bounds acces in fuse_dev_write() when the number of
bytes to be retrieved is truncated to the fc->max_pages limit.
virtiofs:
- Fix page faults for DAX page addresses.
Misc:
- Tighten file handle decoding from userns.
Check that the decoded dentry itself has a valid idmapping in the
user namespace.
- Fix mount-notify selftests.
- Fix some indentation errors.
- Add an FMODE_ flag to indicate IOCB_HAS_METADATA availability.
This will be moved to an FOP_* flag with a bit more rework needed
for that to happen not suitable for a fix.
- Don't silently ignore metadata for sync read/write.
- Don't pointlessly log warning when reading coredump sysctls"
* tag 'vfs-6.17-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fuse: virtio_fs: fix page fault for DAX page address
selftests/fs/mount-notify: Fix compilation failure.
fhandle: use more consistent rules for decoding file handle from userns
fuse: Block access to folio overlimit
fuse: fix fuseblk i_blkbits for iomap partial writes
fuse: reflect cached blocksize if blocksize was changed
fuse: prevent overflow in copy_file_range return value
fuse: check if copy_file_range() returns larger than requested size
fuse: do not allow mapping a non-regular backing file
coredump: don't pointlessly check and spew warnings
fs: fix indentation style
block: don't silently ignore metadata for sync read/write
fs: add a FMODE_ flag to indicate IOCB_HAS_METADATA availability
Please enter a commit message to explain why this merge is necessary,
especially if it merges an updated upstream into a topic branch.
Since all callers of nfs_page_group_covers_page() have already ensured
that there is only one group member, all that is required is to check
that the entire folio contains dirty data.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If we're truncating part of the folio, then we need to write out the
data on the part that is not covered by the cancellation.
Fixes: d47992f86b ("mm: change invalidatepage prototype to accept length")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Ensure that all O_DIRECT reads and writes complete before copying a file
range, so that the destination is up to date.
Fixes: a5864c999d ("NFS: Do not serialise O_DIRECT reads and writes")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Ensure that all O_DIRECT reads and writes complete before cloning a file
range, so that both the source and destination are up to date.
Fixes: a5864c999d ("NFS: Do not serialise O_DIRECT reads and writes")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Ensure that all O_DIRECT reads and writes complete before calling
fallocate so that we don't race w.r.t. attribute updates.
Fixes: 99f2378322 ("NFSv4.2: Always flush out writes in nfs42_proc_fallocate()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Ensure that all O_DIRECT reads and writes are complete, and prevent the
initiation of new i/o until the setattr operation that will truncate the
file is complete.
Fixes: a5864c999d ("NFS: Do not serialise O_DIRECT reads and writes")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The NFSv4.2 copy offload and clone functions can also end up extending
the size of the destination file, so they too need to call
nfs_truncate_last_folio().
Reported-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
This commit fixes the failing xfstest 'generic/363'.
When the user mmaps() an area that extends beyond the end of file, and
proceeds to write data into the folio that straddles that eof, we're
required to discard that folio data if the user calls some function that
extends the file length.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Recent commit f06bedfa62 ("pNFS/flexfiles: don't attempt pnfs on fatal DS
errors") has changed the error return type of ff_layout_choose_ds_for_read() from
NULL to an error pointer. However, not all code paths have been updated
to match the change. Thus, some non-NULL checks will accept error pointers
as a valid return value.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Suggested-by: Dan Carpenter <dan.carpenter@linaro.org>
Fixes: f06bedfa62 ("pNFS/flexfiles: don't attempt pnfs on fatal DS errors")
Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Previously nfs_local_probe() was made to disable and then attempt to
re-enable LOCALIO (via LOCALIO protocol handshake) if/when it was
called and LOCALIO already enabled.
Vague memory for _why_ this was the case is that this was useful
if/when a local NFS server were to be restarted with a local NFS
client connected to it.
But as it happens this causes an absurd amount of LOCALIO flapping
which has a side-effect of too much IO being needlessly sent to NFSD
(using RPC over the loopback network interface). This is the
definition of "serious performance loss" (that negates the point of
having LOCALIO).
So remove this mis-optimization for re-enabling LOCALIO if/when an NFS
server is restarted (which is an extremely rare thing to do). Will
revisit testing that scenario again but in the meantime this patch
restores the full benefit of LOCALIO.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Pull smb client fixes from Steve French:
- Fix two potential NULL pointer references
- Two debugging improvements (to help debug recent issues) a new
tracepoint, and minor improvement to DebugData
- Trivial comment cleanup
* tag '6.17-RC4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: prevent NULL pointer dereference in UTF16 conversion
smb: client: show negotiated cipher in DebugData
smb: client: add new tracepoint to trace lease break notification
smb: client: fix spellings in comments
smb: client: Fix NULL pointer dereference in cifs_debug_dirs_proc_show()
Commit 15ae0410c37a79 ("btrfs-progs: add error handling for
device_get_partition_size_fd_stat()") in btrfs-progs inadvertently
changed it so that if the BLKGETSIZE64 ioctl on a block device returned
a size of 0, this was no longer seen as an error condition.
Unfortunately this is how disconnected NBD devices behave, meaning that
with btrfs-progs 6.16 it's now possible to add a device you can't
remove:
# btrfs device add /dev/nbd0 /root/temp
# btrfs device remove /dev/nbd0 /root/temp
ERROR: error removing device '/dev/nbd0': Invalid argument
This check should always have been done kernel-side anyway, so add a
check in btrfs_init_new_device() that the new device doesn't have a size
less than BTRFS_DEVICE_RANGE_RESERVED (i.e. 1 MB).
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There can be a NULL pointer dereference bug here. NULL is passed to
__cifs_sfu_make_node without checks, which passes it unchecked to
cifs_strndup_to_utf16, which in turn passes it to
cifs_local_to_utf16_bytes where '*from' is dereferenced, causing a crash.
This patch adds a check for NULL 'src' in cifs_strndup_to_utf16 and
returns NULL early to prevent dereferencing NULL pointer.
Found by Linux Verification Center (linuxtesting.org) with SVACE
Signed-off-by: Makar Semyonov <m.semenov@tssltd.ru>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Pull smb server fix from Steve French:
- fix handling filenames with ":" (colon) in them
* tag 'v6.17-rc4-ksmbd-fix' of git://git.samba.org/ksmbd:
ksmbd: allow a filename to contain colons on SMB3.1.1 posix extensions
Add smb3_lease_break_enter to trace lease break notifications,
recording lease state, flags, epoch, and lease key. Align
smb3_lease_not_found to use the same payload and print format.
Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Pull misc fixes from Andrew Morton:
"17 hotfixes. 13 are cc:stable and the remainder address post-6.16
issues or aren't considered necessary for -stable kernels. 11 of these
fixes are for MM.
This includes a three-patch series from Harry Yoo which fixes an
intermittent boot failure which can occur on x86 systems. And a
two-patch series from Alexander Gordeev which fixes a KASAN crash on
S390 systems"
* tag 'mm-hotfixes-stable-2025-09-01-17-20' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: fix possible deadlock in kmemleak
x86/mm/64: define ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings()
mm: introduce and use {pgd,p4d}_populate_kernel()
mm: move page table sync declarations to linux/pgtable.h
proc: fix missing pde_set_flags() for net proc files
mm: fix accounting of memmap pages
mm/damon/core: prevent unnecessary overflow in damos_set_effective_quota()
kexec: add KEXEC_FILE_NO_CMA as a legal flag
kasan: fix GCC mem-intrinsic prefix with sw tags
mm/kasan: avoid lazy MMU mode hazards
mm/kasan: fix vmalloc shadow memory (de-)population races
kunit: kasan_test: disable fortify string checker on kasan_strings() test
selftests/mm: fix FORCE_READ to read input value correctly
mm/userfaultfd: fix kmap_local LIFO ordering for CONFIG_HIGHPTE
ocfs2: prevent release journal inode after journal shutdown
rust: mm: mark VmaNew as transparent
of_numa: fix uninitialized memory nodes causing kernel panic
Pull btrfs fixes from David Sterba:
- fix a few races related to inode link count
- fix inode leak on failure to add link to inode
- move transaction aborts closer to where they happen
* tag 'for-6.17-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: avoid load/store tearing races when checking if an inode was logged
btrfs: fix race between setting last_dir_index_offset and inode logging
btrfs: fix race between logging inode and checking if it was logged before
btrfs: simplify error handling logic for btrfs_link()
btrfs: fix inode leak on failure to add link to inode
btrfs: abort transaction on failure to add link to inode
There is a race condition between inode eviction and inode caching that
can cause a live struct btrfs_inode to be missing from the root->inodes
xarray. Specifically, there is a window during evict() between the inode
being unhashed and deleted from the xarray. If btrfs_iget() is called
for the same inode in that window, it will be recreated and inserted
into the xarray, but then eviction will delete the new entry, leaving
nothing in the xarray:
Thread 1 Thread 2
---------------------------------------------------------------
evict()
remove_inode_hash()
btrfs_iget_path()
btrfs_iget_locked()
btrfs_read_locked_inode()
btrfs_add_inode_to_root()
destroy_inode()
btrfs_destroy_inode()
btrfs_del_inode_from_root()
__xa_erase
In turn, this can cause issues for subvolume deletion. Specifically, if
an inode is in this lost state, and all other inodes are evicted, then
btrfs_del_inode_from_root() will call btrfs_add_dead_root() prematurely.
If the lost inode has a delayed_node attached to it, then when
btrfs_clean_one_deleted_snapshot() calls btrfs_kill_all_delayed_nodes(),
it will loop forever because the delayed_nodes xarray will never become
empty (unless memory pressure forces the inode out). We saw this
manifest as soft lockups in production.
Fix it by only deleting the xarray entry if it matches the given inode
(using __xa_cmpxchg()).
Fixes: 310b2f5d5a ("btrfs: use an xarray to track open inodes in a root")
Cc: stable@vger.kernel.org # 6.11+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Co-authored-by: Leo Martins <loemra.dev@gmail.com>
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
With 64K page size (aarch64 with 64K page size config) and 4K btrfs
block size, the following workload can easily lead to a corrupted read:
mkfs.btrfs -f -s 4k $dev > /dev/null
mount -o compress $dev $mnt
xfs_io -f -c "pwrite -S 0xff 0 64k" $mnt/base > /dev/null
echo "correct result:"
od -Ad -t x1 $mnt/base
xfs_io -f -c "reflink $mnt/base 32k 0 32k" \
-c "reflink $mnt/base 0 32k 32k" \
-c "pwrite -S 0xff 60k 4k" $mnt/new > /dev/null
echo "incorrect result:"
od -Ad -t x1 $mnt/new
umount $mnt
This shows the following result:
correct result:
0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
*
0065536
incorrect result:
0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
*
0032768 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0061440 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
*
0065536
Notice the zero in the range [32K, 60K), which is incorrect.
[CAUSE]
With extra trace printk, it shows the following events during od:
(some unrelated info removed like CPU and context)
od-3457 btrfs_do_readpage: enter r/i=5/258 folio=0(65536) prev_em_start=0000000000000000
The "r/i" is indicating the root and inode number. In our case the file
"new" is using ino 258 from fs tree (root 5).
Here notice the @prev_em_start pointer is NULL. This means the
btrfs_do_readpage() is called from btrfs_read_folio(), not from
btrfs_readahead().
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=0 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=4096 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=8192 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=12288 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=16384 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=20480 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=24576 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=28672 got em start=0 len=32768
These above 32K blocks will be read from the first half of the
compressed data extent.
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=32768 got em start=32768 len=32768
Note here there is no btrfs_submit_compressed_read() call. Which is
incorrect now.
Although both extent maps at 0 and 32K are pointing to the same compressed
data, their offsets are different thus can not be merged into the same
read.
So this means the compressed data read merge check is doing something
wrong.
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=36864 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=40960 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=45056 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=49152 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=53248 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=57344 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=61440 skip uptodate
od-3457 btrfs_submit_compressed_read: cb orig_bio: file off=0 len=61440
The function btrfs_submit_compressed_read() is only called at the end of
folio read. The compressed bio will only have an extent map of range [0,
32K), but the original bio passed in is for the whole 64K folio.
This will cause the decompression part to only fill the first 32K,
leaving the rest untouched (aka, filled with zero).
This incorrect compressed read merge leads to the above data corruption.
There were similar problems that happened in the past, commit 808f80b467
("Btrfs: update fix for read corruption of compressed and shared
extents") is doing pretty much the same fix for readahead.
But that's back to 2015, where btrfs still only supports bs (block size)
== ps (page size) cases.
This means btrfs_do_readpage() only needs to handle a folio which
contains exactly one block.
Only btrfs_readahead() can lead to a read covering multiple blocks.
Thus only btrfs_readahead() passes a non-NULL @prev_em_start pointer.
With v5.15 kernel btrfs introduced bs < ps support. This breaks the above
assumption that a folio can only contain one block.
Now btrfs_read_folio() can also read multiple blocks in one go.
But btrfs_read_folio() doesn't pass a @prev_em_start pointer, thus the
existing bio force submission check will never be triggered.
In theory, this can also happen for btrfs with large folios, but since
large folio is still experimental, we don't need to bother it, thus only
bs < ps support is affected for now.
[FIX]
Instead of passing @prev_em_start to do the proper compressed extent
check, introduce one new member, btrfs_bio_ctrl::last_em_start, so that
the existing bio force submission logic will always be triggered.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The compression level is meaningless for lzo, but before commit
3f093ccb95 ("btrfs: harden parsing of compression mount options"),
it was silently ignored if passed.
After that commit, passing a level with lzo fails to mount:
BTRFS error: unrecognized compression value lzo:1
It seems reasonable for users to expect that lzo would permit a numeric
level option, as all the other algos do, even though the kernel's
implementation of LZO currently only supports a single level. Because it
has always worked to pass a level, it seems likely to me that users in
the real world are relying on doing so.
This patch restores the old behavior, giving "lzo:N" the same semantics
as all of the other compression algos.
To be clear, silly variants like "lzo:one", "lzo:the_first_option", or
"lzo:armageddon" also used to work. This isn't meant to suggest that
any possible mis-interpretation of mount options that once worked must
continue to work forever. This is an exceptional case where it makes
sense to preserve compatibility, both because the mis-interpretation is
reasonable, and because nothing tangible is sacrificed.
Finally update btrfs_show_options() to ignore the level of LZO, as it
is only the default level without any extra meaning.
Fixes: 3f093ccb95 ("btrfs: harden parsing of compression mount options")
Reviewed-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Calvin Owens <calvin@wbinvd.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The following workload on a squota enabled fs:
btrfs subvol create mnt/subvol
# ensure subvol extents get accounted
sync
btrfs qgroup create 1/1 mnt
btrfs qgroup assign mnt/subvol 1/1 mnt
btrfs qgroup delete mnt/subvol
# make the cleaner thread run
btrfs filesystem sync mnt
sleep 1
btrfs filesystem sync mnt
btrfs qgroup destroy 1/1 mnt
will fail with EBUSY. The reason is that 1/1 does the quick accounting
when we assign subvol to it, gaining its exclusive usage as excl and
excl_cmpr. But then when we delete subvol, the decrement happens via
record_squota_delta() which does not update excl_cmpr, as squotas does
not make any distinction between compressed and normal extents. Thus,
we increment excl_cmpr but never decrement it, and are unable to delete
1/1. The two possible fixes are to make squota always mirror excl and
excl_cmpr or to make the fast accounting separately track the plain and
cmpr numbers. The latter felt cleaner to me so that is what I opted for.
Fixes: 1e0e9d5771 ("btrfs: add helper for recording simple quota deltas")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
fuse fixes for 6.17-rc5
* tag 'fuse-fixes-6.17-rc5' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (6 commits)
fuse: Block access to folio overlimit
fuse: fix fuseblk i_blkbits for iomap partial writes
fuse: reflect cached blocksize if blocksize was changed
fuse: prevent overflow in copy_file_range return value
fuse: check if copy_file_range() returns larger than requested size
fuse: do not allow mapping a non-regular backing file
Link: https://lore.kernel.org/CAJfpeguEVMMyw_zCb+hbOuSxdE2Z3Raw=SJsq=Y56Ae6dn2W3g@mail.gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
If the client sends SMB2_CREATE_POSIX_CONTEXT to ksmbd, allow the filename
to contain a colon (':'). This requires disabling the support for Alternate
Data Streams (ADS), which are denoted by a colon-separated suffix to the
filename on Windows. This should not be an issue, since this concept is not
known to POSIX anyway and the client has to explicitly request a POSIX
context to get this behavior.
Link: https://lore.kernel.org/all/f9401718e2be2ab22058b45a6817db912784ef61.camel@rx2.rx-server.de/
Signed-off-by: Philipp Kerling <pkerling@casix.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reading /proc/fs/cifs/open_dirs may hit a NULL dereference when
tcon->cfids is NULL.
Add NULL check before accessing cfids to prevent the crash.
Reproduction:
- Mount CIFS share
- cat /proc/fs/cifs/open_dirs
Fixes: 844e5c0eb1 ("smb3 client: add way to show directory leases for improved debugging")
Signed-off-by: Wang Zhaolong <wangzhaolong@huaweicloud.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
nfs_server_set_fsinfo() shouldn't assume that NFS_CAP_XATTR is unset
on entry to the function.
Fixes: b78ef845c3 ("NFSv4.2: query the server for extended attribute support")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
_nfs4_server_capabilities() should clear capabilities that are not
supported by the server.
Fixes: d2a00cceb9 ("NFSv4: Detect support for OPEN4_SHARE_ACCESS_WANT_OPEN_XOR_DELEGATION")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
_nfs4_server_capabilities() is expected to clear any flags that are not
supported by the server.
Fixes: 8a59bb93b7 ("NFSv4 store server support for fs_location attribute")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Don't clear the capabilities that are not going to get reset by the call
to _nfs4_server_capabilities().
Reported-by: Scott Haiden <scott.b.haiden@gmail.com>
Fixes: b01f21cacd ("NFS: Fix the setting of capabilities when automounting a new filesystem")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Pull EFI fixes from Ard Biesheuvel:
- Assorted fixes for the OP-TEE based pseudo-EFI variable store
- Fix for an OOB access when looking up the same non-existing efivarfs
entry multiple times in parallel
* tag 'efi-fixes-for-v6.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efivarfs: Fix slab-out-of-bounds in efivarfs_d_compare
efi: stmm: Drop unneeded null pointer check
efi: stmm: Drop unused EFI error from setup_mm_hdr arguments
efi: stmm: Do not return EFI_OUT_OF_RESOURCES on internal errors
efi: stmm: Fix incorrect buffer allocation method
Pull smb client fixes from Steve French:
- Fix possible refcount leak in compound operations
- Fix remap_file_range() return code mapping, found by generic/157
* tag 'v6.17-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
fs/smb: Fix inconsistent refcnt update
smb3 client: fix return code mapping of remap_file_range
Pull xfs fixes from Carlos Maiolino:
"The highlight I'd like to point here is related to the XFS_RT
Kconfig, which has been updated to be enabled by default now if
CONFIG_BLK_DEV_ZONED is enabled.
This also contains a few fixes for zoned devices support in XFS,
specially related to swapon requests in inodes belonging to the zoned
FS.
A null-ptr dereference fix in the xattr data, due to a mishandling of
medium errors generated by block devices is also included"
* tag 'xfs-fixes-6.17-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: do not propagate ENODATA disk errors into xattr code
xfs: reject swapon for inodes on a zoned file system earlier
xfs: kick off inodegc when failing to reserve zoned blocks
xfs: remove xfs_last_used_zone
xfs: Default XFS_RT to Y if CONFIG_BLK_DEV_ZONED is enabled
Commit 620c266f39 ("fhandle: relax open_by_handle_at() permission
checks") relaxed the coditions for decoding a file handle from non init
userns.
The conditions are that that decoded dentry is accessible from the user
provided mountfd (or to fs root) and that all the ancestors along the
path have a valid id mapping in the userns.
These conditions are intentionally more strict than the condition that
the decoded dentry should be "lookable" by path from the mountfd.
For example, the path /home/amir/dir/subdir is lookable by path from
unpriv userns of user amir, because /home perms is 755, but the owner of
/home does not have a valid id mapping in unpriv userns of user amir.
The current code did not check that the decoded dentry itself has a
valid id mapping in the userns. There is no security risk in that,
because that final open still performs the needed permission checks,
but this is inconsistent with the checks performed on the ancestors,
so the behavior can be a bit confusing.
Add the check for the decoded dentry itself, so that the entire path,
including the last component has a valid id mapping in the userns.
Fixes: 620c266f39 ("fhandle: relax open_by_handle_at() permission checks")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/20250827194309.1259650-1-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Observed on kernel 6.6 (present on master as well):
BUG: KASAN: slab-out-of-bounds in memcmp+0x98/0xd0
Call trace:
kasan_check_range+0xe8/0x190
__asan_loadN+0x1c/0x28
memcmp+0x98/0xd0
efivarfs_d_compare+0x68/0xd8
__d_lookup_rcu_op_compare+0x178/0x218
__d_lookup_rcu+0x1f8/0x228
d_alloc_parallel+0x150/0x648
lookup_open.isra.0+0x5f0/0x8d0
open_last_lookups+0x264/0x828
path_openat+0x130/0x3f8
do_filp_open+0x114/0x248
do_sys_openat2+0x340/0x3c0
__arm64_sys_openat+0x120/0x1a0
If dentry->d_name.len < EFI_VARIABLE_GUID_LEN , 'guid' can become
negative, leadings to oob. The issue can be triggered by parallel
lookups using invalid filename:
T1 T2
lookup_open
->lookup
simple_lookup
d_add
// invalid dentry is added to hash list
lookup_open
d_alloc_parallel
__d_lookup_rcu
__d_lookup_rcu_op_compare
hlist_bl_for_each_entry_rcu
// invalid dentry can be retrieved
->d_compare
efivarfs_d_compare
// oob
Fix it by checking 'guid' before cmp.
Fixes: da27a24383 ("efivarfs: guid part of filenames are case-insensitive")
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Wu Guanghao <wuguanghao3@huawei.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
A possible inconsistent update of refcount was identified in `smb2_compound_op`.
Such inconsistent update could lead to possible resource leaks.
Why it is a possible bug:
1. In the comment section of the function, it clearly states that the
reference to `cfile` should be dropped after calling this function.
2. Every control flow path would check and drop the reference to
`cfile`, except the patched one.
3. Existing callers would not handle refcount update of `cfile` if
-ENOMEM is returned.
To fix the bug, an extra goto label "out" is added, to make sure that the
cleanup logic would always be respected. As the problem is caused by the
allocation failure of `vars`, the cleanup logic between label "finished"
and "out" can be safely ignored. According to the definition of function
`is_replayable_error`, the error code of "-ENOMEM" is not recoverable.
Therefore, the replay logic also gets ignored.
Signed-off-by: Shuhao Fu <sfual@cse.ust.hk>
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
On regular fuse filesystems, i_blkbits is set to PAGE_SHIFT which means
any iomap partial writes will mark the entire folio as uptodate. However
fuseblk filesystems work differently and allow the blocksize to be less
than the page size. As such, this may lead to data corruption if fuseblk
sets its blocksize to less than the page size, uses the writeback cache,
and does a partial write, then a read and the read happens before the
write has undergone writeback, since the folio will not be marked
uptodate from the partial write so the read will read in the entire
folio from disk, which will overwrite the partial write.
The long-term solution for this, which will also be needed for fuse to
enable large folios with the writeback cache on, is to have fuse also
use iomap for folio reads, but until that is done, the cleanest
workaround is to use the page size for fuseblk's internal kernel inode
blksize/blkbits values while maintaining current behavior for stat().
This was verified using ntfs-3g:
$ sudo mkfs.ntfs -f -c 512 /dev/vdd1
$ sudo ntfs-3g /dev/vdd1 ~/fuseblk
$ stat ~/fuseblk/hi.txt
IO Block: 512
Fixes: a4c9ab1d49 ("fuse: use iomap for buffered writes")
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
As pointed out by Miklos[1], in the fuse_update_get_attr() path, the
attributes returned to stat may be cached values instead of fresh ones
fetched from the server. In the case where the server returned a
modified blocksize value, we need to cache it and reflect it back to
stat if values are not re-fetched since we now no longer directly change
inode->i_blkbits.
Link: https://lore.kernel.org/linux-fsdevel/CAJfpeguCOxeVX88_zPd1hqziB_C+tmfuDhZP5qO2nKmnb-dTUA@mail.gmail.com/ [1]
Fixes: 542ede096e ("fuse: keep inode->i_blkbits constant")
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>