Patch series "mm: Remove pXX_devmap page table bit and pfn_t type", v3.
All users of dax now require a ZONE_DEVICE page which is properly
refcounted. This means there is no longer any need for the PFN_DEV,
PFN_MAP and PFN_SPECIAL flags. Furthermore the PFN_SG_CHAIN and
PFN_SG_LAST flags never appear to have been used. It is therefore
possible to remove the pfn_t type and replace any usage with raw pfns.
The remaining users of PFN_DEV have simply passed this to
vmf_insert_mixed() to create pte_devmap() mappings. It is unclear why
this was the case but presumably to ensure vm_normal_page() does not
return these pages. These users can be trivially converted to raw pfns
and creating a pXX_special() mapping to ensure vm_normal_page() still
doesn't return these pages.
Now that there are no users of PFN_DEV we can remove the devmap page table
bit and all associated functions and macros, freeing up a software page
table bit.
This patch (of 14):
Currently dax is the only user of pmd and pud mapped ZONE_DEVICE pages.
Therefore page walkers that want to exclude DAX pages can check pmd_devmap
or pud_devmap. However soon dax will no longer set PFN_DEV, meaning dax
pages are mapped as normal pages.
Ensure page walkers that currently use pXd_devmap to skip DAX pages
continue to do so by adding explicit checks of the VMA instead.
Remove vma_is_dax() check from mm/userfaultfd.c as validate_move_areas()
will already skip DAX VMA's on account of them not being anonymous.
The check in userfaultfd_must_wait() is also redundant as
vma_can_userfault() should have already filtered out dax vma's.
For HMM the pud_devmap check seems unnecessary as there is no reason we
shouldn't be able to handle any leaf PUD here so remove it also.
Link: https://lkml.kernel.org/r/cover.176965585864cb8d2cf41464b44dcc0471e643a0.1750323463.git-series.apopple@nvidia.com
Link: https://lkml.kernel.org/r/f0611f6f475f48fcdf34c65084a359aefef4cccc.1750323463.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Björn Töpel <bjorn@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chunyan Zhang <zhang.lyra@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Deepak Gupta <debug@rivosinc.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Inki Dae <m.szyprowski@samsung.com>
Cc: John Groves <john@groves.net>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Björn Töpel <bjorn@rivosinc.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/memfd: Reserve hugetlb folios before allocation", v4.
There are cases when we try to pin a folio but discover that it has not
been faulted-in. So, we try to allocate it in memfd_alloc_folio() but the
allocation request may not succeed if there are no active reservations in
the system at that instant.
Therefore, making a reservation (by calling hugetlb_reserve_pages())
associated with the allocation will ensure that our request would not fail
due to lack of reservations. This will also ensure that proper
region/subpool accounting is done with our allocation.
This patch (of 3):
Currently, hugetlb_reserve_pages() returns a bool to indicate whether the
reservation map update for the range [from, to] was successful or not.
This is not sufficient for the case where the caller needs to determine
how many entries were updated for the range.
Therefore, have hugetlb_reserve_pages() return the number of entries
updated in the reservation map associated with the range [from, to].
Also, update the callers of hugetlb_reserve_pages() to handle the new
return value.
Link: https://lkml.kernel.org/r/20250618053415.1036185-1-vivek.kasireddy@intel.com
Link: https://lkml.kernel.org/r/20250618053415.1036185-2-vivek.kasireddy@intel.com
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Cc: Steve Sistare <steven.sistare@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The core kernel code is currently very inconsistent in its use of
vm_flags_t vs. unsigned long. This prevents us from changing the type of
vm_flags_t in the future and is simply not correct, so correct this.
While this results in rather a lot of churn, it is a critical
pre-requisite for a future planned change to VMA flag type.
Additionally, update VMA userland tests to account for the changes.
To make review easier and to break things into smaller parts, driver and
architecture-specific changes is left for a subsequent commit.
The code has been adjusted to cascade the changes across all calling code
as far as is needed.
We will adjust architecture-specific and driver code in a subsequent patch.
Overall, this patch does not introduce any functional change.
Link: https://lkml.kernel.org/r/d1588e7bb96d1ea3fe7b9df2c699d5b4592d901d.1750274467.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Kees Cook <kees@kernel.org>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
BUG_ON() is deprecated [1]. Convert all the BUG_ON()s and VM_BUG_ON()s to
use VM_WARN_ON_ONCE().
There are a few additional cases that are converted or modified:
- Convert the printk(KERN_WARNING ...) in handle_userfault() to use
pr_warn().
- Convert the WARN_ON_ONCE()s in move_pages() to use VM_WARN_ON_ONCE(),
as the relevant conditions are already checked in validate_range() in
move_pages()'s caller.
- Convert the VM_WARN_ON()'s in move_pages() to VM_WARN_ON_ONCE(). These
cases should never happen and are similar to those in mfill_atomic()
and mfill_atomic_hugetlb(), which were previously BUG_ON()s.
move_pages() was added later than those functions and makes use of
VM_WARN_ON() as a replacement for the deprecated BUG_ON(), but.
VM_WARN_ON_ONCE() is likely a better direct replacement.
- Convert the WARN_ON() for !VM_MAYWRITE in userfaultfd_unregister() and
userfaultfd_register_range() to VM_WARN_ON_ONCE(). This condition is
enforced in userfaultfd_register() so it should never happen, and can
be converted to a debug check.
[1] https://www.kernel.org/doc/html/v6.15/process/coding-style.html#use-warn-rather-than-bug
Link: https://lkml.kernel.org/r/20250619-uffd-fixes-v3-3-a7274d3bd5e4@columbia.edu
Signed-off-by: Tal Zussman <tz2294@columbia.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, a VMA registered with a uffd can be unregistered through a
different uffd associated with the same mm_struct.
The existing behavior is slightly broken and may incorrectly reject
unregistering some VMAs due to the following check:
if (!vma_can_userfault(cur, cur->vm_flags, wp_async))
goto out_unlock;
where wp_async is derived from ctx, not from cur. For example, a
file-backed VMA registered with wp_async enabled and UFFD_WP mode cannot
be unregistered through a uffd that does not have wp_async enabled.
Rather than fix this and maintain this odd behavior, make unregistration
stricter by requiring VMAs to be unregistered through the same uffd they
were registered with. Additionally, reorder the BUG() checks to avoid the
aforementioned wp_async issue in them. Convert the existing check to
VM_WARN_ON_ONCE() as BUG_ON() is deprecated.
This change slightly modifies the ABI. It should not be backported to
-stable. It is expected that no one depends on this behavior, and no such
cases are known.
While at it, correct the comment for the no userfaultfd case. This seems
to be a copy-paste artifact from the analogous userfaultfd_register()
check.
Link: https://lkml.kernel.org/r/20250619-uffd-fixes-v3-2-a7274d3bd5e4@columbia.edu
Fixes: 86039bd3b4 ("userfaultfd: add new syscall to provide memory externalization")
Signed-off-by: Tal Zussman <tz2294@columbia.edu>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pull /proc/sys dcache lookup fix from Al Viro:
"Fix for the breakage spotted by Neil in the interplay between
/proc/sys ->d_compare() weirdness and parallel lookups"
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix proc_sys_compare() handling of in-lookup dentries
Pull smb client fixes from Steve French:
- Two reconnect fixes including one for a reboot/reconnect race
- Fix for incorrect file type that can be returned by SMB3.1.1 POSIX
extensions
- tcon initialization fix
- Fix for resolving Windows symlinks with absolute paths
* tag 'v6.16-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: fix native SMB symlink traversal
smb: client: fix race condition in negotiate timeout by using more precise timing
cifs: all initializations for tcon should happen in tcon_info_alloc
smb: client: fix warning when reconnecting channel
smb: client: fix readdir returning wrong type with POSIX extensions
Pull bcachefs fixes from Kent Overstreet:
"The 'opts.casefold_disabled' patch is non critical, but would be a
6.15 backport; it's to address the casefolding + overlayfs
incompatibility that was discovvered late.
It's late because I was hoping that this would be addressed on the
overlayfs side (and will be in 6.17), but user reports keep coming in
on this one (lots of people are using docker these days)"
* tag 'bcachefs-2025-07-03' of git://evilpiepirate.org/bcachefs:
bcachefs: opts.casefold_disabled
bcachefs: Work around deadlock to btree node rewrites in journal replay
bcachefs: Fix incorrect transaction restart handling
bcachefs: fix btree_trans_peek_prev_journal()
bcachefs: mark invalid_btree_id autofix
Pull vfs fixes from Christian Brauner:
- Fix a regression caused by the anonymous inode rework. Making them
regular files causes various places in the kernel to tip over
starting with io_uring.
Revert to the former status quo and port our assertion to be based on
checking the inode so we don't lose the valuable VFS_*_ON_*()
assertions that have already helped discover weird behavior our
outright bugs.
- Fix the the upper bound calculation in fuse_fill_write_pages()
- Fix priority inversion issues in the eventpoll code
- Make secretmen use anon_inode_make_secure_inode() to avoid bypassing
the LSM layer
- Fix a netfs hang due to missing case in final DIO read result
collection
- Fix a double put of the netfs_io_request struct
- Provide some helpers to abstract out NETFS_RREQ_IN_PROGRESS flag
wrangling
- Fix infinite looping in netfs_wait_for_pause/request()
- Fix a netfs ref leak on an extra subrequest inserted into a request's
list of subreqs
- Fix various cifs RPC callbacks to set NETFS_SREQ_NEED_RETRY if a
subrequest fails retriably
- Fix a cifs warning in the workqueue code when reconnecting a channel
- Fix the updating of i_size in netfs to avoid a race between testing
if we should have extended the file with a DIO write and changing
i_size
- Merge the places in netfs that update i_size on write
- Fix coredump socket selftests
* tag 'vfs-6.16-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
anon_inode: rework assertions
netfs: Update tracepoints in a number of ways
netfs: Renumber the NETFS_RREQ_* flags to make traces easier to read
netfs: Merge i_size update functions
netfs: Fix i_size updating
smb: client: set missing retry flag in cifs_writev_callback()
smb: client: set missing retry flag in cifs_readv_callback()
smb: client: set missing retry flag in smb2_writev_callback()
netfs: Fix ref leak on inserted extra subreq in write retry
netfs: Fix looping in wait functions
netfs: Provide helpers to perform NETFS_RREQ_IN_PROGRESS flag wangling
netfs: Fix double put of request
netfs: Fix hang due to missing case in final DIO read result collection
eventpoll: Fix priority inversion problem
fuse: fix fuse_fill_write_pages() upper bound calculation
fs: export anon_inode_make_secure_inode() and fix secretmem LSM bypass
selftests/coredump: Fix "socket_detect_userspace_client" test failure
There's one case where ->d_compare() can be called for an in-lookup
dentry; usually that's nothing special from ->d_compare() point of
view, but... proc_sys_compare() is weird.
The thing is, /proc/sys subdirectories can look differently for
different processes. Up to and including having the same name
resolve to different dentries - all of them hashed.
The way it's done is ->d_compare() refusing to admit a match unless
this dentry is supposed to be visible to this caller. The information
needed to discriminate between them is stored in inode; it is set
during proc_sys_lookup() and until it's done d_splice_alias() we really
can't tell who should that dentry be visible for.
Normally there's no negative dentries in /proc/sys; we can run into
a dying dentry in RCU dcache lookup, but those can be safely rejected.
However, ->d_compare() is also called for in-lookup dentries, before
they get positive - or hashed, for that matter. In case of match
we will wait until dentry leaves in-lookup state and repeat ->d_compare()
afterwards. In other words, the right behaviour is to treat the
name match as sufficient for in-lookup dentries; if dentry is not
for us, we'll see that when we recheck once proc_sys_lookup() is
done with it.
While we are at it, fix the misspelled READ_ONCE and WRITE_ONCE there.
Fixes: d9171b9345 ("parallel lookups machinery, part 4 (and last)")
Reported-by: NeilBrown <neilb@brown.name>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We've seen customers having shares mounted in paths like /??/C:/ or
/??/UNC/foo.example.com/share in order to get their native SMB
symlinks successfully followed from different mounts.
After commit 12b466eb52 ("cifs: Fix creating and resolving absolute NT-style symlinks"),
the client would then convert absolute paths from "/??/C:/" to "/mnt/c/"
by default. The absolute paths would vary depending on the value of
symlinkroot= mount option.
Fix this by restoring old behavior of not trying to convert absolute
paths by default. Only do this if symlinkroot= was _explicitly_ set.
Before patch:
$ mount.cifs //w22-fs0/test2 /mnt/1 -o vers=3.1.1,username=xxx,password=yyy
$ ls -l /mnt/1/symlink2
lrwxr-xr-x 1 root root 15 Jun 20 14:22 /mnt/1/symlink2 -> /mnt/c/testfile
$ mkdir -p /??/C:; echo foo > //??/C:/testfile
$ cat /mnt/1/symlink2
cat: /mnt/1/symlink2: No such file or directory
After patch:
$ mount.cifs //w22-fs0/test2 /mnt/1 -o vers=3.1.1,username=xxx,password=yyy
$ ls -l /mnt/1/symlink2
lrwxr-xr-x 1 root root 15 Jun 20 14:22 /mnt/1/symlink2 -> '/??/C:/testfile'
$ mkdir -p /??/C:; echo foo > //??/C:/testfile
$ cat /mnt/1/symlink2
foo
Cc: linux-cifs@vger.kernel.org
Reported-by: Pierguido Lambri <plambri@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Stefan Metzmacher <metze@samba.org>
Fixes: 12b466eb52 ("cifs: Fix creating and resolving absolute NT-style symlinks")
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
When the SMB server reboots and the client immediately accesses the mount
point, a race condition can occur that causes operations to fail with
"Host is down" error.
Reproduction steps:
# Mount SMB share
mount -t cifs //192.168.245.109/TEST /mnt/ -o xxxx
ls /mnt
# Reboot server
ssh root@192.168.245.109 reboot
ssh root@192.168.245.109 /path/to/cifs_server_setup.sh
ssh root@192.168.245.109 systemctl stop firewalld
# Immediate access fails
ls /mnt
ls: cannot access '/mnt': Host is down
# But works if there is a delay
The issue is caused by a race condition between negotiate and reconnect.
The 20-second negotiate timeout mechanism can interfere with the normal
recovery process when both are triggered simultaneously.
ls cifsd
---------------------------------------------------
cifs_getattr
cifs_revalidate_dentry
cifs_get_inode_info
cifs_get_fattr
smb2_query_path_info
smb2_compound_op
SMB2_open_init
smb2_reconnect
cifs_negotiate_protocol
smb2_negotiate
cifs_send_recv
smb_send_rqst
wait_for_response
cifs_demultiplex_thread
cifs_read_from_socket
cifs_readv_from_socket
server_unresponsive
cifs_reconnect
__cifs_reconnect
cifs_abort_connection
mid->mid_state = MID_RETRY_NEEDED
cifs_wake_up_task
cifs_sync_mid_result
// case MID_RETRY_NEEDED
rc = -EAGAIN;
// In smb2_negotiate()
rc = -EHOSTDOWN;
The server_unresponsive() timeout triggers cifs_reconnect(), which aborts
ongoing mid requests and causes the ls command to receive -EAGAIN, leading
to -EHOSTDOWN.
Fix this by introducing a dedicated `neg_start` field to
precisely tracks when the negotiate process begins. The timeout check
now uses this accurate timestamp instead of `lstrp`, ensuring that:
1. Timeout is only triggered after negotiate has actually run for 20s
2. The mechanism doesn't interfere with concurrent recovery processes
3. Uninitialized timestamps (value 0) don't trigger false timeouts
Fixes: 7ccc146546 ("smb: client: fix hang in wait_for_response() for negproto")
Signed-off-by: Wang Zhaolong <wangzhaolong@huaweicloud.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Pull btrfs fixes from David Sterba:
- tree-log fixes:
- fixes of log tracking of directories and subvolumes
- fix iteration and error handling of inode references
during log replay
- fix free space tree rebuild (reported by syzbot)
* tag 'for-6.16-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: use btrfs_record_snapshot_destroy() during rmdir
btrfs: propagate last_unlink_trans earlier when doing a rmdir
btrfs: record new subvolume in parent dir earlier to avoid dir logging races
btrfs: fix inode lookup error handling during log replay
btrfs: fix iteration of extrefs during log replay
btrfs: fix missing error handling when searching for inode refs during log replay
btrfs: fix failure to rebuild free space tree using multiple transactions
Pull xfs fixes from Carlos Maiolino:
- Fix umount hang with unflushable inodes (and add new tracepoint used
for debugging this)
- Fix ABBA deadlock in xfs_reclaim_inode() vs xfs_ifree_cluster()
- Fix dquot buffer pin deadlock
* tag 'xfs-fixes-6.16-rc5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: add FALLOC_FL_ALLOCATE_RANGE to supported flags mask
xfs: fix unmount hang with unflushable inodes stuck in the AIL
xfs: factor out stale buffer item completion
xfs: rearrange code in xfs_buf_item.c
xfs: add tracepoints for stale pinned inode state debug
xfs: avoid dquot buffer pin deadlock
xfs: catch stale AGF/AGF metadata
xfs: xfs_ifree_cluster vs xfs_iflush_shutdown_abort deadlock
xfs: actually use the xfs_growfs_check_rtgeom tracepoint
xfs: Improve error handling in xfs_mru_cache_create()
xfs: move xfs_submit_zoned_bio a bit
xfs: use xfs_readonly_buftarg in xfs_remount_rw
xfs: remove NULL pointer checks in xfs_mru_cache_insert
xfs: check for shutdown before going to sleep in xfs_select_zone
Add an option for completely disabling casefolding on a filesystem, as a
workaround for overlayfs.
This should only be needed as a temporary workaround, until the
overlayfs fix arrives.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Don't mark btree nodes for rewrites, if they are or would be degraded,
if journal replay hasn't finished, to avoid a deadlock.
This is because btree node rewrites generate more updates for the
interior updates (alloc, backpointers), and if those updates touch
new nodes and generate more rewrites - we can only have so many interior
btree updates in flight before we deadlock on open_buckets.
The biggest cause is that we don't use the btree write buffer (for
the backpointer updates - this needs some real thought on locking in
order to fix.
The problem with this workaround (not doing the rewrite for degraded
nodes in journal replay) is that those degraded nodes persist, and we
don't want that (this is a real bug when a btree node write completes
with fewer replicas than we wanted and leaves a degraded node due to
device _removal_, i.e. the device went away mid write).
It's less of a bug here, but still a problem because we don't yet
have a way of tracking degraded data - we another index (all
extents/btree nodes, by replicas entry) in order to fix properly
(re-replicate degraded data at the earliest possible time).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Pull NFS client fixes from Anna Schumaker:
- Fix loop in GSS sequence number cache
- Clean up /proc/net/rpc/nfs if nfs_fs_proc_net_init() fails
- Fix a race to wake on NFS_LAYOUT_DRAIN
- Fix handling of NFS level errors in I/O
* tag 'nfs-for-6.16-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFSv4/flexfiles: Fix handling of NFS level errors in I/O
NFSv4/pNFS: Fix a race to wake on NFS_LAYOUT_DRAIN
nfs: Clean up /proc/net/rpc/nfs when nfs_fs_proc_net_init() fails.
sunrpc: fix loop in gss seqno cache
Make a number of updates to the netfs tracepoints:
(1) Remove a duplicate trace from netfs_unbuffered_write_iter_locked().
(2) Move the trace in netfs_wake_rreq_flag() to after the flag is cleared
so that the change appears in the trace.
(3) Differentiate the use of netfs_rreq_trace_wait/woke_queue symbols.
(4) Don't do so many trace emissions in the wait functions as some of them
are redundant.
(5) In netfs_collect_read_results(), differentiate a subreq that's being
abandoned vs one that has been consumed in a regular way.
(6) Add a tracepoint to indicate the call to ->ki_complete().
(7) Don't double-increment the subreq_counter when retrying a write.
(8) Move the netfs_sreq_trace_io_progress tracepoint within cifs code to
just MID_RESPONSE_RECEIVED and add different tracepoints for other MID
states and note check failure.
Signed-off-by: David Howells <dhowells@redhat.com>
Co-developed-by: Paulo Alcantara <pc@manguebit.org>
Signed-off-by: Paulo Alcantara <pc@manguebit.org>
Link: https://lore.kernel.org/20250701163852.2171681-14-dhowells@redhat.com
cc: Steve French <sfrench@samba.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Renumber the NETFS_RREQ_* flags to put the most useful status bits in the
bottom nibble - and therefore the last hex digit in the trace output -
making it easier to grasp the state at a glance.
In particular, put the IN_PROGRESS flag in bit 0 and ALL_QUEUED at bit 1.
Also make the flags field in /proc/fs/netfs/requests larger to accommodate
all the flags.
Also make the flags field in the netfs_sreq tracepoint larger to
accommodate all the NETFS_SREQ_* flags.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250701163852.2171681-13-dhowells@redhat.com
Reviewed-by: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Netfslib has two functions for updating the i_size after a write: one for
buffered writes into the pagecache and one for direct/unbuffered writes.
However, what needs to be done is much the same in both cases, so merge
them together.
This does raise one question, though: should updating the i_size after a
direct write do the same estimated update of i_blocks as is done for
buffered writes.
Also get rid of the cleanup function pointer from netfs_io_request as it's
only used for direct write to update i_size; instead do the i_size setting
directly from write collection.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250701163852.2171681-12-dhowells@redhat.com
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Fix the updating of i_size, particularly in regard to the completion of DIO
writes and especially async DIO writes by using a lock.
The bug is triggered occasionally by the generic/207 xfstest as it chucks a
bunch of AIO DIO writes at the filesystem and then checks that fstat()
returns a reasonable st_size as each completes.
The problem is that netfs is trying to do "if new_size > inode->i_size,
update inode->i_size" sort of thing but without a lock around it.
This can be seen with cifs, but shouldn't be seen with kafs because kafs
serialises modification ops on the client whereas cifs sends the requests
to the server as they're generated and lets the server order them.
Fixes: 153a9961b5 ("netfs: Implement unbuffered/DIO write support")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250701163852.2171681-11-dhowells@redhat.com
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
The write-retry algorithm will insert extra subrequests into the list if it
can't get sufficient capacity to split the range that needs to be retried
into the sequence of subrequests it currently has (for instance, if the
cifs credit pool has fewer credits available than it did when the range was
originally divided).
However, the allocator furnishes each new subreq with 2 refs and then
another is added for resubmission, causing one to be leaked.
Fix this by replacing the ref-getting line with a neutral trace line.
Fixes: 288ace2f57 ("netfs: New writeback implementation")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250701163852.2171681-6-dhowells@redhat.com
Tested-by: Steve French <sfrench@samba.org>
Reviewed-by: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
netfs_wait_for_request() and netfs_wait_for_pause() can loop forever if
netfs_collect_in_app() returns 2, indicating that it wants to repeat
because the ALL_QUEUED flag isn't yet set and there are no subreqs left
that haven't been collected.
The problem is that, unless collection is offloaded (OFFLOAD_COLLECTION),
we have to return to the application thread to continue and eventually set
ALL_QUEUED after pausing to deal with a retry - but we never get there.
Fix this by inserting checks for the IN_PROGRESS and PAUSE flags as
appropriate before cycling round - and add cond_resched() for good measure.
Fixes: 2b1424cd13 ("netfs: Fix wait/wake to be consistent about the waitqueue used")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250701163852.2171681-5-dhowells@redhat.com
Tested-by: Steve French <sfrench@samba.org>
Reviewed-by: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
If a netfs request finishes during the pause loop, it will have the ref
that belongs to the IN_PROGRESS flag removed at that point - however, if it
then goes to the final wait loop, that will *also* put the ref because it
sees that the IN_PROGRESS flag is clear and incorrectly assumes that this
happened when it called the collector.
In fact, since IN_PROGRESS is clear, we shouldn't call the collector again
since it's done all the cleanup, such as calling ->ki_complete().
Fix this by making netfs_collect_in_app() just return, indicating that
we're done if IN_PROGRESS is removed.
Fixes: 2b1424cd13 ("netfs: Fix wait/wake to be consistent about the waitqueue used")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250701163852.2171681-3-dhowells@redhat.com
Tested-by: Steve French <sfrench@samba.org>
Reviewed-by: Paulo Alcantara <pc@manguebit.org>
cc: Steve French <sfrench@samba.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
When doing a DIO read, if the subrequests we issue fail and cause the
request PAUSE flag to be set to put a pause on subrequest generation, we
may complete collection of the subrequests (possibly discarding them) prior
to the ALL_QUEUED flags being set.
In such a case, netfs_read_collection() doesn't see ALL_QUEUED being set
after netfs_collect_read_results() returns and will just return to the app
(the collector can be seen unpausing the generator in the trace log).
The subrequest generator can then set ALL_QUEUED and the app thread reaches
netfs_wait_for_request(). This causes netfs_collect_in_app() to be called
to see if we're done yet, but there's missing case here.
netfs_collect_in_app() will see that a thread is active and set inactive to
false, but won't see any subrequests in the read stream, and so won't set
need_collect to true. The function will then just return 0, indicating
that the caller should just sleep until further activity (which won't be
forthcoming) occurs.
Fix this by making netfs_collect_in_app() check to see if an active thread
is complete - i.e. that ALL_QUEUED is set and the subrequests list is empty
- and to skip the sleep return path. The collector will then be called
which will clear the request IN_PROGRESS flag, allowing the app to
progress.
Fixes: 2b1424cd13 ("netfs: Fix wait/wake to be consistent about the waitqueue used")
Reported-by: Steve French <sfrench@samba.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250701163852.2171681-2-dhowells@redhat.com
Tested-by: Steve French <sfrench@samba.org>
Reviewed-by: Paulo Alcantara <pc@manguebit.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
The ready event list of an epoll object is protected by read-write
semaphore:
- The consumer (waiter) acquires the write lock and takes items.
- the producer (waker) takes the read lock and adds items.
The point of this design is enabling epoll to scale well with large number
of producers, as multiple producers can hold the read lock at the same
time.
Unfortunately, this implementation may cause scheduling priority inversion
problem. Suppose the consumer has higher scheduling priority than the
producer. The consumer needs to acquire the write lock, but may be blocked
by the producer holding the read lock. Since read-write semaphore does not
support priority-boosting for the readers (even with CONFIG_PREEMPT_RT=y),
we have a case of priority inversion: a higher priority consumer is blocked
by a lower priority producer. This problem was reported in [1].
Furthermore, this could also cause stall problem, as described in [2].
To fix this problem, make the event list half-lockless:
- The consumer acquires a mutex (ep->mtx) and takes items.
- The producer locklessly adds items to the list.
Performance is not the main goal of this patch, but as the producer now can
add items without waiting for consumer to release the lock, performance
improvement is observed using the stress test from
https://github.com/rouming/test-tools/blob/master/stress-epoll.c. This is
the same test that justified using read-write semaphore in the past.
Testing using 12 x86_64 CPUs:
Before After Diff
threads events/ms events/ms
8 6932 19753 +185%
16 7820 27923 +257%
32 7648 35164 +360%
64 9677 37780 +290%
128 11166 38174 +242%
Testing using 1 riscv64 CPU (averaged over 10 runs, as the numbers are
noisy):
Before After Diff
threads events/ms events/ms
1 73 129 +77%
2 151 216 +43%
4 216 364 +69%
8 234 382 +63%
16 251 392 +56%
Reported-by: Frederic Weisbecker <frederic@kernel.org>
Closes: https://lore.kernel.org/linux-rt-users/20210825132754.GA895675@lothringen/ [1]
Reported-by: Valentin Schneider <vschneid@redhat.com>
Closes: https://lore.kernel.org/linux-rt-users/xhsmhttqvnall.mognet@vschneid.remote.csb/ [2]
Signed-off-by: Nam Cao <namcao@linutronix.de>
Link: https://lore.kernel.org/20250527090836.1290532-1-namcao@linutronix.de
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Today, a few work structs inside tcon are initialized inside
cifs_get_tcon and not in tcon_info_alloc. As a result, if a tcon
is obtained from tcon_info_alloc, but not called as a part of
cifs_get_tcon, we may trip over.
Cc: <stable@vger.kernel.org>
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Add FALLOC_FL_ALLOCATE_RANGE to the set of supported fallocate flags in
XFS_FALLOC_FL_SUPPORTED. This change improves code clarity and maintains
by explicitly showing this flag in the supported flags mask.
Note that since FALLOC_FL_ALLOCATE_RANGE is defined as 0x00, this addition
has no functional modifications.
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
When SMB 3.1.1 POSIX Extensions are negotiated, userspace applications
using readdir() or getdents() calls without stat() on each individual file
(such as a simple "ls" or "find") would misidentify file types and exhibit
strange behavior such as not descending into directories. The reason for
this behavior is an oversight in the cifs_posix_to_fattr conversion
function. Instead of extracting the entry type for cf_dtype from the
properly converted cf_mode field, it tries to extract the type from the
PDU. While the wire representation of the entry mode is similar in
structure to POSIX stat(), the assignments of the entry types are
different. Applying the S_DT macro to cf_mode instead yields the correct
result. This is also what the equivalent function
smb311_posix_info_to_fattr in inode.c already does for stat() etc.; which
is why "ls -l" would give the correct file type but "ls" would not (as
identified by the colors).
Cc: stable@vger.kernel.org
Signed-off-by: Philipp Kerling <pkerling@casix.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Pull smb client fixes from Steve French:
- Multichannel reconnect lock ordering deadlock fix
- Fix for regression in handling native Windows symlinks
- Three smbdirect fixes:
- oops in RDMA response processing
- smbdirect memcpy issue
- fix smbdirect regression with large writes (smbdirect test cases
now all passing)
- Fix for "FAILED_TO_PARSE" warning in trace-cmd report output
* tag 'v6.16-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Fix reading into an ITER_FOLIOQ from the smbdirect code
cifs: Fix the smbd_response slab to allow usercopy
smb: client: fix potential deadlock when reconnecting channels
smb: client: remove \t from TP_printk statements
smb: client: let smbd_post_send_iter() respect the peers max_send_size and transmit all data
smb: client: fix regression with native SMB symlinks
Pull misc fixes from Andrew Morton:
"16 hotfixes.
6 are cc:stable and the remainder address post-6.15 issues or aren't
considered necessary for -stable kernels. 5 are for MM"
* tag 'mm-hotfixes-stable-2025-06-27-16-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
MAINTAINERS: add Lorenzo as THP co-maintainer
mailmap: update Duje Mihanović's email address
selftests/mm: fix validate_addr() helper
crashdump: add CONFIG_KEYS dependency
mailmap: correct name for a historical account of Zijun Hu
mailmap: add entries for Zijun Hu
fuse: fix runtime warning on truncate_folio_batch_exceptionals()
scripts/gdb: fix dentry_name() lookup
mm/damon/sysfs-schemes: free old damon_sysfs_scheme_filter->memcg_path on write
mm/alloc_tag: fix the kmemleak false positive issue in the allocation of the percpu variable tag->counters
lib/group_cpus: fix NULL pointer dereference from group_cpus_evenly()
mm/hugetlb: remove unnecessary holding of hugetlb_lock
MAINTAINERS: add missing files to mm page alloc section
MAINTAINERS: add tree entry to mm init block
mm: add OOM killer maintainer structure
fs/proc/task_mmu: fix PAGE_IS_PFNZERO detection for the huge zero folio
We are setting the parent directory's last_unlink_trans directly which
may result in a concurrent task starting to log the directory not see the
update and therefore can log the directory after we removed a child
directory which had a snapshot within instead of falling back to a
transaction commit. Replaying such a log tree would result in a mount
failure since we can't currently delete snapshots (and subvolumes) during
log replay. This is the type of failure described in commit 1ec9a1ae1e
("Btrfs: fix unreplayable log after snapshot delete + parent dir fsync").
Fix this by using btrfs_record_snapshot_destroy() which updates the
last_unlink_trans field while holding the inode's log_mutex lock.
Fixes: 44f714dae5 ("Btrfs: improve performance on fsync against new inode after rename/unlink")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In case the removed directory had a snapshot that was deleted, we are
propagating its inode's last_unlink_trans to the parent directory after
we removed the entry from the parent directory. This leaves a small race
window where someone can log the parent directory after we removed the
entry and before we updated last_unlink_trans, and as a result if we ever
try to replay such a log tree, we will fail since we will attempt to
remove a snapshot during log replay, which is currently not possible and
results in the log replay (and mount) to fail. This is the type of failure
described in commit 1ec9a1ae1e ("Btrfs: fix unreplayable log after
snapshot delete + parent dir fsync").
So fix this by propagating the last_unlink_trans to the parent directory
before we remove the entry from it.
Fixes: 44f714dae5 ("Btrfs: improve performance on fsync against new inode after rename/unlink")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>