Don't roll the whole transaction after every extent, that's rather
inefficient.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
These sysctl knobs were scheduled for removal in September 2025. That
time has come, so remove them.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Delete XREAP_MAX_BINVAL and XREAP_MAX_DEFER_CHAIN because the reap code
now calculates those limits dynamically, so they're no longer needed.
Move the third limit (XREP_MAX_ITRUNCATE_EFIS) to the one file that uses
it. Note that the btree rebuilding code should reserve exactly the
number of blocks needed to rebuild a btree, so it is rare that the newbt
code will need to add any EFIs to the commit transaction. That's why
that static limit remains.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
We promised to turn off these old features by default in September 2025.
Do so now.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reaping file fork mappings is a little different -- log recovery can
free the blocks for us, so we only try to process a single mapping at a
time. Therefore, we only need to figure out the maximum number of
blocks that we can invalidate in a single transaction.
The rough calculation here is:
nr_extents = (logres - reservation used by any one step) /
(space used per binval)
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Calculate the maximum number of CoW staging extents that can be reaped
in a single transaction chain. The rough calculation here is:
nr_extents = (logres - reservation used by any one step) /
(space used by intents per extent)
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Calculate the maximum number of CoW staging extents that can be reaped
in a single transaction chain. The rough calculation here is:
nr_extents = (logres - reservation used by any one step) /
(space used by intents per extent +
space used for a few buffer invalidations)
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Calculate the maximum number of extents that can be reaped in a single
transaction chain, and the number of buffers that can be invalidated in
a single transaction. The rough calculation here is:
nr_extents = (logres - reservation used by any one step) /
(space used by intents per extent +
space used per binval)
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Convert the file fork reaping code to use struct xreap_state so that we
can reuse the dynamic state tracking code.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The online repair block reaping code employs static limits to decide if
it's time to roll the transaction or finish the deferred item chains to
avoid overflowing the scrub transaction's reservation. However, the
use of static limits aren't great -- btree blocks are assumed to be
scattered around the AG and the buffers need to be invalidated, whereas
COW staging extents are usually contiguous and do not have buffers. We
would like to configure the limits dynamically.
To get ready for this, reorganize struct xreap_state to store dynamic
limits, and add helpers to hide some of the details of how the limits
are enforced. Also rename the "xreap roll" functions to include the
word "binval" because they only exist to decide when we should roll the
transaction to deal with buffer invalidations.
No functional changes intended here.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
When we're removing rmap records for crosslinked blocks, use deferred
intent items so that we can try to free/unmap as many of the old data
structure's blocks as we can in the same transaction as the commit.
Cc: <stable@vger.kernel.org> # v6.6
Fixes: 1c7ce115e5 ("xfs: reap large AG metadata extents when possible")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The ready event list of an epoll object is protected by read-write
semaphore:
- The consumer (waiter) acquires the write lock and takes items.
- the producer (waker) takes the read lock and adds items.
The point of this design is enabling epoll to scale well with large number
of producers, as multiple producers can hold the read lock at the same
time.
Unfortunately, this implementation may cause scheduling priority inversion
problem. Suppose the consumer has higher scheduling priority than the
producer. The consumer needs to acquire the write lock, but may be blocked
by the producer holding the read lock. Since read-write semaphore does not
support priority-boosting for the readers (even with CONFIG_PREEMPT_RT=y),
we have a case of priority inversion: a higher priority consumer is blocked
by a lower priority producer. This problem was reported in [1].
Furthermore, this could also cause stall problem, as described in [2].
Fix this problem by replacing rwlock with spinlock.
This reduces the event bandwidth, as the producers now have to contend with
each other for the spinlock. According to the benchmark from
https://github.com/rouming/test-tools/blob/master/stress-epoll.c:
On 12 x86 CPUs:
Before After Diff
threads events/ms events/ms
8 7162 4956 -31%
16 8733 5383 -38%
32 7968 5572 -30%
64 10652 5739 -46%
128 11236 5931 -47%
On 4 riscv CPUs:
Before After Diff
threads events/ms events/ms
8 2958 2833 -4%
16 3323 3097 -7%
32 3451 3240 -6%
64 3554 3178 -11%
128 3601 3235 -10%
Although the numbers look bad, it should be noted that this benchmark
creates multiple threads who do nothing except constantly generating new
epoll events, thus contention on the spinlock is high. For real workload,
the event rate is likely much lower, and the performance drop is not as
bad.
Using another benchmark (perf bench epoll wait) where spinlock contention
is lower, improvement is even observed on x86:
On 12 x86 CPUs:
Before: Averaged 110279 operations/sec (+- 1.09%), total secs = 8
After: Averaged 114577 operations/sec (+- 2.25%), total secs = 8
On 4 riscv CPUs:
Before: Averaged 175767 operations/sec (+- 0.62%), total secs = 8
After: Averaged 167396 operations/sec (+- 0.23%), total secs = 8
In conclusion, no one is likely to be upset over this change. After all,
spinlock was used originally for years, and the commit which converted to
rwlock didn't mention a real workload, just that the benchmark numbers are
nice.
This patch is not exactly the revert of commit a218cc4914 ("epoll: use
rwlock in order to reduce ep_poll_callback() contention"), because git
revert conflicts in some places which are not obvious on the resolution.
This patch is intended to be backported, therefore go with the obvious
approach:
- Replace rwlock_t with spinlock_t one to one
- Delete list_add_tail_lockless() and chain_epi_lockless(). These were
introduced to allow producers to concurrently add items to the list.
But now that spinlock no longer allows producers to touch the event
list concurrently, these two functions are not necessary anymore.
Fixes: a218cc4914 ("epoll: use rwlock in order to reduce ep_poll_callback() contention")
Signed-off-by: Nam Cao <namcao@linutronix.de>
Link: https://lore.kernel.org/ec92458ea357ec503c737ead0f10b2c6e4c37d47.1752581388.git.namcao@linutronix.de
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: stable@vger.kernel.org
Reported-by: Frederic Weisbecker <frederic@kernel.org>
Closes: https://lore.kernel.org/linux-rt-users/20210825132754.GA895675@lothringen/ [1]
Reported-by: Valentin Schneider <vschneid@redhat.com>
Closes: https://lore.kernel.org/linux-rt-users/xhsmhttqvnall.mognet@vschneid.remote.csb/ [2]
Signed-off-by: Christian Brauner <brauner@kernel.org>
The changes modernizes the code by aligning it with current kernel best
practices. It improves code clarity and consistency, as strncpy is deprecated
as explained in Documentation/process/deprecated.rst. This change does
not alter the functionality or introduce any behavioral changes.
Suggested-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Marcelo Moreira <marcelomoreira1905@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Some fuse servers need to prune their caches, which can only be done if the
kernel's own dentry/inode caches are pruned first to avoid dangling
references.
Add FUSE_NOTIFY_PRUNE, which takes an array of node ID's to try and get rid
of. Inodes with active references are skipped.
A similar functionality is already provided by FUSE_NOTIFY_INVAL_ENTRY with
the FUSE_EXPIRE_ONLY flag. Differences in the interface are
FUSE_NOTIFY_INVAL_ENTRY:
- can only prune one dentry
- dentry is determined by parent ID and name
- if inode has multiple aliases (cached hard links), then they would have
to be invalidated individually to be able to get rid of the inode
FUSE_NOTIFY_PRUNE:
- can prune multiple inodes
- inodes determined by their node ID
- aliases are taken care of automatically
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Remove tail calls of fuse_copy_finish(), since it's now done from
fuse_dev_do_write().
No functional change.
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In case of FUSE_NOTIFY_RESEND and FUSE_NOTIFY_INC_EPOCH fuse_copy_finish()
isn't called.
Fix by always calling fuse_copy_finish() after fuse_notify(). It's a no-op
if called a second time.
Fixes: 760eac73f9 ("fuse: Introduce a new notification type for resend pending requests")
Fixes: 2396356a94 ("fuse: add more control over cache invalidation behaviour")
Cc: <stable@vger.kernel.org> # v6.9
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Absolute majority of callers are passing the 4th argument equal to
strlen() of the 3rd one.
Drop the v_size argument, add vfs_parse_fs_qstr() for the cases that
want independent length.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There can be a NULL pointer dereference bug here. NULL is passed to
__cifs_sfu_make_node without checks, which passes it unchecked to
cifs_strndup_to_utf16, which in turn passes it to
cifs_local_to_utf16_bytes where '*from' is dereferenced, causing a crash.
This patch adds a check for NULL 'src' in cifs_strndup_to_utf16 and
returns NULL early to prevent dereferencing NULL pointer.
Found by Linux Verification Center (linuxtesting.org) with SVACE
Signed-off-by: Makar Semyonov <m.semenov@tssltd.ru>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Some architectures, such as RISC-V, use the ELF e_flags field to encode
ABI-specific information (e.g., ISA extensions, fpu support). Debuggers
like GDB rely on these flags in core dumps to correctly interpret
optional register sets. If the flags are missing or incorrect, GDB may
warn and ignore valid data, for example:
warning: Unexpected size of section '.reg2/213' in core file.
This can prevent access to fpu or other architecture-specific registers
even when they were dumped.
Save the e_flags field during ELF binary loading (in load_elf_binary())
into the mm_struct, and later retrieve it during core dump generation
(in fill_note_info()). Kconfig option CONFIG_ARCH_HAS_ELF_CORE_EFLAGS
is introduced for architectures that require this behaviour.
Signed-off-by: Svetlana Parfenova <svetlana.parfenova@syntacore.com>
Link: https://lore.kernel.org/r/20250901135350.619485-1-svetlana.parfenova@syntacore.com
Signed-off-by: Kees Cook <kees@kernel.org>
Pull smb server fix from Steve French:
- fix handling filenames with ":" (colon) in them
* tag 'v6.17-rc4-ksmbd-fix' of git://git.samba.org/ksmbd:
ksmbd: allow a filename to contain colons on SMB3.1.1 posix extensions
There is a kernel panic due to WARN_ONCE when panic_on_warn is set.
This issue occurs when writeback is triggered due to sync call for an
opened file(ie, writeback reason is WB_REASON_SYNC). When f2fs balance
is needed at sync path, flush for quota_release_work is triggered.
By default quota_release_work is queued to "events_unbound" queue which
does not have WQ_MEM_RECLAIM flag. During f2fs balance "writeback"
workqueue tries to flush quota_release_work causing kernel panic due to
MEM_RECLAIM flag mismatch errors.
This patch creates dedicated workqueue with WQ_MEM_RECLAIM flag
for work quota_release_work.
------------[ cut here ]------------
WARNING: CPU: 4 PID: 14867 at kernel/workqueue.c:3721 check_flush_dependency+0x13c/0x148
Call trace:
check_flush_dependency+0x13c/0x148
__flush_work+0xd0/0x398
flush_delayed_work+0x44/0x5c
dquot_writeback_dquots+0x54/0x318
f2fs_do_quota_sync+0xb8/0x1a8
f2fs_write_checkpoint+0x3cc/0x99c
f2fs_gc+0x190/0x750
f2fs_balance_fs+0x110/0x168
f2fs_write_single_data_page+0x474/0x7dc
f2fs_write_data_pages+0x7d0/0xd0c
do_writepages+0xe0/0x2f4
__writeback_single_inode+0x44/0x4ac
writeback_sb_inodes+0x30c/0x538
wb_writeback+0xf4/0x440
wb_workfn+0x128/0x5d4
process_scheduled_works+0x1c4/0x45c
worker_thread+0x32c/0x3e8
kthread+0x11c/0x1b0
ret_from_fork+0x10/0x20
Kernel panic - not syncing: kernel: panic_on_warn set ...
Fixes: ac6f420291 ("quota: flush quota_release_work upon quota writeback")
CC: stable@vger.kernel.org
Signed-off-by: Shashank A P <shashank.ap@samsung.com>
Link: https://patch.msgid.link/20250901092905.2115-1-shashank.ap@samsung.com
Signed-off-by: Jan Kara <jack@suse.cz>
Add smb3_lease_break_enter to trace lease break notifications,
recording lease state, flags, epoch, and lease key. Align
smb3_lease_not_found to use the same payload and print format.
Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
do_add_mount() consumes vfsmount on success; just follow it with
conditional retain_and_null_ptr() on success and we can switch
to __free() for mnt and be done with that - unlock_mount() is
in the very end.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
finish_automount() can't use lock_mount() - it treats finding something
already mounted as "quitely drop our mount and return 0", not as
"mount on top of whatever mounted there". It's been open-coded;
let's take it into a helper similar to lock_mount(). "something's
already mounted" => -EBUSY, finish_automount() needs to distinguish
it from the normal case and it can't happen in other failure cases.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
preparations for making unlock_mount() a __cleanup();
can't have path_put() inside mount_lock scope.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
preparations for making unlock_mount() a __cleanup();
can't have path_put() inside mount_lock scope.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It's enough to check that dentries match; if path->dentry is equal to
m->mnt_root, superblocks will match as well.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We want to mount beneath the given location. For that operation to
make sense, location must be the root of some mount that has something
under it. Currently we let it proceed if those requirements are not met,
with rather meaningless results, and have that bogosity caught further
down the road; let's fail early instead - do_lock_mount() doesn't make
sense unless those conditions hold, and checking them there makes
things simpler.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
1) checking that location we want to move does point to root of some mount
can be done before anything else; that property is not going to change
and having it already verified simplifies the analysis.
2) checking the type agreement between what we are trying to move and what
we are trying to move it onto also belongs in the very beginning -
do_lock_mount() might end up switching new_path to something that overmounts
the original location, but... the same type agreement applies to overmounts,
so we could just as well check against the original location.
3) since we know that old_path->dentry is the root of old_path->mnt, there's
no point bothering with path_is_overmounted() in can_move_mount_beneath();
it's simply a check for the mount we are trying to move having non-NULL
->overmount. And with that, we can switch can_move_mount_beneath() to
taking old instead of old_path, leaving no uses of old_path past the original
checks.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Both 'parent' and 'ns' are used at most once, no point precalculating those...
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Prior to the call of do_new_mount_fc() the caller has just done successful
vfs_get_tree(). Then do_new_mount_fc() does several checks on resulting
superblock, and either does fc_drop_locked() and returns an error or
proceeds to unlock the superblock and call vfs_create_mount().
The thing is, there's no reason to delay that unlock + vfs_create_mount() -
the tests do not rely upon the state of ->s_umount and
fc_drop_locked()
put_fs_context()
is equivalent to
unlock ->s_umount
put_fs_context()
Doing vfs_create_mount() before the checks allows us to move vfs_get_tree()
from caller to do_new_mount_fc() and collapse it with vfs_create_mount()
into an fc_mount() call.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
here a use of __free(path_put) for dropping fs_root is enough to
make guard(mount_locked_reader) fit...
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All we need here is to follow ->overmount on root mount of namespace...
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and document that locking requirements for is_path_reachable().
There is one questionable caller in do_listmount() where we are not
holding mount_lock *and* might not have the first argument mounted.
However, in that case it will immediately return true without having
to look at the ancestors. Might be cleaner to move the check into
non-LSTM_ROOT case which it really belongs in - there the check is
not always true and is_mounted() is guaranteed.
Document the locking environments for is_path_reachable() callers:
get_peer_under_root()
get_dominating_id()
do_statmount()
do_listmount()
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The reason why it needs only mount_locked_reader is that there's no lockless
accesses of expiry lists.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and document the locking requirements of __has_locked_children()
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently we are taking mount_writer; what that function needs is
either mount_locked_reader (we are not changing anything, we just
want to iterate through the subtree) or namespace_shared and
a reference held by caller on the root of subtree - that's also
enough to stabilize the topology.
The thing is, all callers are already holding at least namespace_shared
as well as a reference to the root of subtree.
Let's make the callers provide locking warranties - don't mess with
mount_lock in check_for_nsfs_mounts() itself and document the locking
requirements.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
clean fit; namespace_shared due to iterating through ns->mounts.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
clean fit; guards can't be weaker due to umount_tree() call.
Setting emptied_ns requires namespace_excl, but not anything
mount_lock-related.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>