The hint block group selection in the extent allocator is wrong in the
first place, as it can select the dedicated data relocation block group for
the normal data allocation.
Since we separated the normal data space_info and the data relocation
space_info, we can easily identify a block group is for data relocation or
not. Do not choose it for the normal data allocation.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If you run a workload with:
- a cgroup that does tons of parallel data reading, with a working set
much larger than its memory limit
- a second cgroup that writes relatively fewer files, with overwrites,
with no memory limit
(see full code listing at the bottom for a reproducer)
Then what quickly occurs is:
- we have a large number of threads trying to read the csum tree
- we have a decent number of threads deleting csums running delayed refs
- we have a large number of threads in direct reclaim and thus high
memory pressure
The result of this is that we writeback the csum tree repeatedly mid
transaction, to get back the extent_buffer folios for reclaim. As a
result, we repeatedly COW the csum tree for the delayed refs that are
deleting csums. This means repeatedly write locking the higher levels of
the tree.
As a result of this, we achieve an unpleasant priority inversion. We
have:
- a high degree of contention on the csum root node (and other upper
nodes) eb rwsem
- a memory starved cgroup doing tons of reclaim on CPU.
- many reader threads in the memory starved cgroup "holding" the sem
as readers, but not scheduling promptly. i.e., task __state == 0, but
not running on a cpu.
- btrfs_commit_transaction stuck trying to acquire the sem as a writer.
(running delayed_refs, deleting csums for unreferenced data extents)
This results in arbitrarily long transactions. This then results in
seriously degraded performance for any cgroup using the filesystem (the
victim cgroup in the script).
It isn't an academic problem, as we see this exact problem in production
at Meta with one cgroup over its memory limit ruining btrfs performance
for the whole system, stalling critical system services that depend on
btrfs syncs.
The underlying scheduling "problem" with global rwsems is sort of thorny
and apparently well known and was discussed at LPC 2024, for example.
As a result, our main lever in the short term is just trying to reduce
contention on our various rwsems with an eye to reducing the frequency
of write locking, to avoid disabling the read lock fast acquisition path.
Luckily, it seems likely that many reads are for old extents written
many transactions ago, and that for those we *can* in fact search the
commit root. The commit_root_sem only gets taken write once, near the
end of transaction commit, no matter how much memory pressure there is,
so we have much less contention between readers and writers.
This change detects when we are trying to read an old extent (according
to extent map generation) and then wires that through bio_ctrl to the
btrfs_bio, which unfortunately isn't allocated yet when we have this
information. When we go to lookup the csums in lookup_bio_sums we can
check this condition on the btrfs_bio and do the commit root lookup
accordingly.
Note that a single bio_ctrl might collect a few extent_maps into a single
bio, so it is important to track a maximum generation across all the
extent_maps used for each bio to make an accurate decision on whether it
is valid to look in the commit root. If any extent_map is updated in the
current generation, we can't use the commit root.
To test and reproduce this issue, I used the following script and
accompanying C program (to avoid bottlenecks in constantly forking
thousands of dd processes):
====== big-read.c ======
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>
#include <errno.h>
#define BUF_SZ (128 * (1 << 10UL))
int read_once(int fd, size_t sz) {
char buf[BUF_SZ];
size_t rd = 0;
int ret = 0;
while (rd < sz) {
ret = read(fd, buf, BUF_SZ);
if (ret < 0) {
if (errno == EINTR)
continue;
fprintf(stderr, "read failed: %d\n", errno);
return -errno;
} else if (ret == 0) {
break;
} else {
rd += ret;
}
}
return rd;
}
int read_loop(char *fname) {
int fd;
struct stat st;
size_t sz = 0;
int ret;
while (1) {
fd = open(fname, O_RDONLY);
if (fd == -1) {
perror("open");
return 1;
}
if (!sz) {
if (!fstat(fd, &st)) {
sz = st.st_size;
} else {
perror("stat");
return 1;
}
}
ret = read_once(fd, sz);
close(fd);
}
}
int main(int argc, char *argv[]) {
int fd;
struct stat st;
off_t sz;
char *buf;
int ret;
if (argc != 2) {
fprintf(stderr, "Usage: %s <filename>\n", argv[0]);
return 1;
}
return read_loop(argv[1]);
}
====== repro.sh ======
#!/usr/bin/env bash
SCRIPT=$(readlink -f "$0")
DIR=$(dirname "$SCRIPT")
dev=$1
mnt=$2
shift
shift
CG_ROOT=/sys/fs/cgroup
BAD_CG=$CG_ROOT/bad-nbr
GOOD_CG=$CG_ROOT/good-nbr
NR_BIGGOS=1
NR_LITTLE=10
NR_VICTIMS=32
NR_VILLAINS=512
START_SEC=$(date +%s)
_elapsed() {
echo "elapsed: $(($(date +%s) - $START_SEC))"
}
_stats() {
local sysfs=/sys/fs/btrfs/$(findmnt -no UUID $dev)
echo "================"
date
_elapsed
cat $sysfs/commit_stats
cat $BAD_CG/memory.pressure
}
_setup_cgs() {
echo "+memory +cpuset" > $CG_ROOT/cgroup.subtree_control
mkdir -p $GOOD_CG
mkdir -p $BAD_CG
echo max > $BAD_CG/memory.max
# memory.high much less than the working set will cause heavy reclaim
echo $((1 << 30)) > $BAD_CG/memory.high
# victims get a subset of villain CPUs
echo 0 > $GOOD_CG/cpuset.cpus
echo 0,1,2,3 > $BAD_CG/cpuset.cpus
}
_kill_cg() {
local cg=$1
local attempts=0
echo "kill cgroup $cg"
[ -f $cg/cgroup.procs ] || return
while true; do
attempts=$((attempts + 1))
echo 1 > $cg/cgroup.kill
sleep 1
procs=$(wc -l $cg/cgroup.procs | cut -d' ' -f1)
[ $procs -eq 0 ] && break
done
rmdir $cg
echo "killed cgroup $cg in $attempts attempts"
}
_biggo_vol() {
echo $mnt/biggo_vol.$1
}
_biggo_file() {
echo $(_biggo_vol $1)/biggo
}
_subvoled_biggos() {
total_sz=$((10 << 30))
per_sz=$((total_sz / $NR_VILLAINS))
dd_count=$((per_sz >> 20))
echo "create $NR_VILLAINS subvols with a file of size $per_sz bytes for a total of $total_sz bytes."
for i in $(seq $NR_VILLAINS)
do
btrfs subvol create $(_biggo_vol $i) &>/dev/null
dd if=/dev/zero of=$(_biggo_file $i) bs=1M count=$dd_count &>/dev/null
done
echo "done creating subvols."
}
_setup() {
[ -f .done ] && rm .done
findmnt -n $dev && exit 1
if [ -f .re-mkfs ]; then
mkfs.btrfs -f -m single -d single $dev >/dev/null || exit 2
else
echo "touch .re-mkfs to populate the test fs"
fi
mount -o noatime $dev $mnt || exit 3
[ -f .re-mkfs ] && _subvoled_biggos
_setup_cgs
}
_my_cleanup() {
echo "CLEANUP!"
_kill_cg $BAD_CG
_kill_cg $GOOD_CG
sleep 1
umount $mnt
}
_bad_exit() {
_err "Unexpected Exit! $?"
_stats
exit $?
}
trap _my_cleanup EXIT
trap _bad_exit INT TERM
_setup
# Use a lot of page cache reading the big file
_villain() {
local i=$1
echo $BASHPID > $BAD_CG/cgroup.procs
$DIR/big-read $(_biggo_file $i)
}
# Hit del_csum a lot by overwriting lots of small new files
_victim() {
echo $BASHPID > $GOOD_CG/cgroup.procs
i=0;
while (true)
do
local tmp=$mnt/tmp.$i
dd if=/dev/zero of=$tmp bs=4k count=2 >/dev/null 2>&1
i=$((i+1))
[ $i -eq $NR_LITTLE ] && i=0
done
}
_one_sync() {
echo "sync..."
before=$(date +%s)
sync
after=$(date +%s)
echo "sync done in $((after - before))s"
_stats
}
# sync in a loop
_sync() {
echo "start sync loop"
syncs=0
echo $BASHPID > $GOOD_CG/cgroup.procs
while true
do
[ -f .done ] && break
_one_sync
syncs=$((syncs + 1))
[ -f .done ] && break
sleep 10
done
if [ $syncs -eq 0 ]; then
echo "do at least one sync!"
_one_sync
fi
echo "sync loop done."
}
_sleep() {
local time=${1-60}
local now=$(date +%s)
local end=$((now + time))
while [ $now -lt $end ];
do
echo "SLEEP: $((end - now))s left. Sleep 10."
sleep 10
now=$(date +%s)
done
}
echo "start $NR_VILLAINS villains"
for i in $(seq $NR_VILLAINS)
do
_villain $i &
disown # get rid of annoying log on kill (done via cgroup anyway)
done
echo "start $NR_VICTIMS victims"
for i in $(seq $NR_VICTIMS)
do
_victim &
disown
done
_sync &
SYNC_PID=$!
_sleep $1
_elapsed
touch .done
wait $SYNC_PID
echo "OK"
exit 0
Without this patch, that reproducer:
- Ran for 6+ minutes instead of 60s
- Hung hundreds of threads in D state on the csum reader lock
- Got a commit stuck for 3 minutes
sync done in 388s
================
Wed Jul 9 09:52:31 PM UTC 2025
elapsed: 420
commits 2
cur_commit_ms 0
last_commit_ms 159446
max_commit_ms 159446
total_commit_ms 160058
some avg10=99.03 avg60=98.97 avg300=75.43 total=418033386
full avg10=82.79 avg60=80.52 avg300=59.45 total=324995274
419 hits state R, D comms big-read
btrfs_tree_read_lock_nested
btrfs_read_lock_root_node
btrfs_search_slot
btrfs_lookup_csum
btrfs_lookup_bio_sums
btrfs_submit_bbio
1 hits state D comms btrfs-transacti
btrfs_tree_lock_nested
btrfs_lock_root_node
btrfs_search_slot
btrfs_del_csums
__btrfs_run_delayed_refs
btrfs_run_delayed_refs
With the patch, the reproducer exits naturally, in 65s, completing a
pretty decent 4 commits, despite heavy memory pressure. Occasionally you
can still trigger a rather long commit (couple seconds) but never one
that is minutes long.
sync done in 3s
================
elapsed: 65
commits 4
cur_commit_ms 0
last_commit_ms 485
max_commit_ms 689
total_commit_ms 2453
some avg10=98.28 avg60=64.54 avg300=19.39 total=64849893
full avg10=74.43 avg60=48.50 avg300=14.53 total=48665168
some random rwalker samples showed the most common stack in reclaim,
rather than the csum tree:
145 hits state R comms bash, sleep, dd, shuf
shrink_folio_list
shrink_lruvec
shrink_node
do_try_to_free_pages
try_to_free_mem_cgroup_pages
reclaim_high
Link: https://lpc.events/event/18/contributions/1883/
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that btrfs_zone_finish_endio_workfn() is directly calling
do_zone_finish() the only caller of btrfs_zone_finish_endio() is
btrfs_finish_one_ordered().
btrfs_finish_one_ordered() already has error handling in-place so
btrfs_zone_finish_endio() can return an error if the block group lookup
fails.
Also as btrfs_zone_finish_endio() already checks for zoned filesystems and
returns early, there's no need to do this in the caller.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When btrfs_zone_finish_endio_workfn() is calling btrfs_zone_finish_endio()
it already has a pointer to the block group. Furthermore
btrfs_zone_finish_endio() does additional checks if the block group can be
finished or not.
But in the context of btrfs_zone_finish_endio_workfn() only the actual
call to do_zone_finish() is of interest, as the skipping condition when
there is still room to allocate from the block group cannot be checked.
Directly call do_zone_finish() on the block group.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's one only one caller of unaccount_log_buffer() and both this
function and the caller are short, so move its code into the caller.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of extracting again the disk_bytenr and disk_num_bytes values from
the file extent item to pass to btrfs_qgroup_trace_extent(), use the key
local variable 'ins' which already has those values, reducing the size of
the source code.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of having an if statement to check for regular and prealloc
extents first, process them in a block, and then following with an else
statement to check for an inline extent, check for an inline extent first,
process it and jump to the 'update_inode' label, allowing us to avoid
having the code for processing regular and prealloc extents inside a
block, reducing the high indentation level by one and making the code
easier to read and avoid line splittings too.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At replay_one_extent(), we can jump to the code that updates the file
extent range and updates the inode when processing a file extent item that
represents a hole and we don't have the NO_HOLES feature enabled. This
helps reduce the high indentation level by one in replay_one_extent() and
avoid splitting some lines to make the code easier to read.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the replay_one_buffer() log tree walk callback we return errors to the
log tree walk caller and then the caller aborts the transaction, if we
have one, or turns the fs into error state if we don't have one. While
this reduces code it makes it harder to figure out where exactly an error
came from. So add the transaction aborts after every failure inside the
replay_one_buffer() callback and the functions it calls, making it as
fine grained as possible, so that it helps figuring out why failures
happen.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If read_alloc_one_name() we explicitly return -ENOMEM and currently that
is fine since it's the only error read_alloc_one_name() can return for
now. However this is fragile and not future proof, so return instead what
read_alloc_one_name() returned.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of keep dereferencing the walk_control structure to extract the
transaction handle whenever is needed, do it once by storing it in a local
variable and then use the variable everywhere. This reduces code verbosity
and eliminates the need for some split lines.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the process_one_buffer() log tree walk callback we return errors to the
log tree walk caller and then the caller aborts the transaction, if we
have one, or turns the fs into error state if we don't have one. While
this reduces code it makes it harder to figure out where exactly an error
came from. So add the transaction aborts after every failure inside the
process_one_buffer() callback, so that it helps figuring out why failures
happen.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We do several things while walking a log tree (for replaying and for
freeing a log tree) like reading extent buffers and cleaning them up,
but we don't immediately abort the transaction, or turn the fs into an
error state, when one of these things fails. Instead we the transaction
abort or turn the fs into error state in the caller of the entry point
function that walks a log tree - walk_log_tree() - which means we don't
get to know exactly where an error came from.
Improve on this by doing a transaction abort / turn fs into error state
after each such failure so that when it happens we have a better
understanding where the failure comes from. This deliberately leaves
the transaction abort / turn fs into error state in the callers of
walk_log_tree() as to ensure we don't get into an inconsistent state in
case we forget to do it deeper in call chain. It also deliberately does
not do it after errors from the calls to the callback defined in
struct walk_control::process_func(), as we will do it later on another
patch.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function cow_file_range() has two boolean parameters. Replace it
with a single @flags parameter, with two flags:
- COW_FILE_RANGE_NO_INLINE
- COW_FILE_RANGE_KEEP_LOCKED
And since we're here, also update the comments of cow_file_range() to
replace the old "page" usage with "folio".
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to identify whether a certain file is open by a different
client, start the unlink process by sending a compound request of
CREATE(DELETE_ON_CLOSE) + CLOSE with only FILE_SHARE_DELETE bit set in
smb2_create_req::ShareAccess. If the file is currently open, then the
server will fail the request with STATUS_SHARING_VIOLATION, in which
case we'll map it to -EBUSY, so __cifs_unlink() will fall back to
silly-rename the file.
This fixes the following case where open(O_CREAT) fails with
-ENOENT (STATUS_DELETE_PENDING) due to file still open by a different
client.
* Before patch
$ mount.cifs //srv/share /mnt/1 -o ...,nosharesock
$ mount.cifs //srv/share /mnt/2 -o ...,nosharesock
$ cd /mnt/1
$ touch foo
$ exec 3<>foo
$ cd /mnt/2
$ rm foo
$ touch foo
touch: cannot touch 'foo': No such file or directory
$ exec 3>&-
* After patch
$ mount.cifs //srv/share /mnt/1 -o ...,nosharesock
$ mount.cifs //srv/share /mnt/2 -o ...,nosharesock
$ cd /mnt/1
$ touch foo
$ exec 3<>foo
$ cd /mnt/2
$ rm foo
$ touch foo
$ exec 3>&-
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Cc: Frank Sorenson <sorenson@redhat.com>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Now that nfsd is accessing SHA-256 via the library API instead of via
crypto_shash, there is a direct symbol dependency on the SHA-256 code
and there is no benefit to be gained from forcing it to be built-in.
Therefore, select CRYPTO_LIB_SHA256 from NFSD (conditional on NFSD_V4)
instead of from NFSD_V4, so that it can be 'm' if NFSD is 'm'.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
When v3 NLM request finds a conflicting delegation, it triggers
a delegation recall and nfsd_open fails with EAGAIN. nfsd_open
then translates EAGAIN into nfserr_jukebox. In nlm_fopen, instead
of returning nlm_failed for when there is a conflicting delegation,
drop this NLM request so that the client retries. Once delegation
is recalled and if a local lock is claimed, a retry would lead to
nfsd returning a nlm_lck_blocked error or a successful nlm lock.
Fixes: d343fce148 ("[PATCH] knfsd: Allow lockd to drop replies as appropriate")
Cc: stable@vger.kernel.org # v6.6
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The common case is that a DRC lookup will not find the XID in the
bucket. Reduce the amount of pointer chasing during the lookup by
keeping fewer entries in each hash bucket.
Changing the bucket size constant forces the size of the DRC hash
table to increase, and the height of each bucket r-b tree to be
reduced.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Neil Brown observes:
> I would not include RC_INPROG entries in the lru at all - they are
> always ignored, and will be added when they are switched to
> RCU_DONE.
I also removed a stale comment.
Suggested-by: NeilBrown <neil@brown.name>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Clean up: because svc_rpcb_cleanup() and svc_xprt_destroy_all()
are always invoked in pairs, we can deduplicate code by moving
the svc_rpcb_cleanup() call sites into svc_xprt_destroy_all().
Tested-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The NFS client's NFSv4.0 callback listeners are created with
SVC_SOCK_ANONYMOUS, therefore svc_setup_socket() does not register
them with the client's rpcbind service.
And, note that nfs_callback_down_net() does not call
svc_rpcb_cleanup() at all when shutting down the callback server.
Even if svc_setup_socket() were to attempt to register or unregister
these sockets, the callback service has vs_hidden set, which shunts
the rpcbind upcalls.
The svc_rpcb_cleanup() error flow was introduced by
commit c946556b87 ("NFS: move per-net callback thread
initialization to nfs_callback_up_net()"). It doesn't appear in the
code that was relocated by that commit.
Therefore, there is no need to call svc_rpcb_cleanup() when listener
creation fails during callback server start-up.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The result of integer comparison already evaluates to bool. No need for
explicit conversion.
Signed-off-by: Xichao Zhao <zhao.xichao@vivo.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
A while back I had reported that an NFSv3 client could successfully
mount using '-o xprtsec=none' an export that had been exported with
'xprtsec=tls:mtls'. By "successfully" I mean that the mount command
would succeed and the mount would show up in /proc/mount. Attempting
to do anything futher with the mount would be met with NFS3ERR_ACCES.
This was fixed (albeit accidentally) by commit bb4f07f240 ("nfsd:
Fix NFSD_MAY_BYPASS_GSS and NFSD_MAY_BYPASS_GSS_ON_ROOT") and was
subsequently re-broken by commit 0813c5f012 ("nfsd: fix access
checking for NLM under XPRTSEC policies").
Transport Layer Security isn't an RPC security flavor or pseudo-flavor,
so we shouldn't be conflating them when determining whether the access
checks can be bypassed. Split check_nfsd_access() into two helpers, and
have __fh_verify() call the helpers directly since __fh_verify() has
logic that allows one or both of the checks to be skipped. All other
sites will continue to call check_nfsd_access().
Link: https://lore.kernel.org/linux-nfs/ZjO3Qwf_G87yNXb2@aion/
Fixes: 9280c57743 ("NFSD: Handle new xprtsec= export option")
Cc: stable@vger.kernel.org
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Commit 5304877936 ("NFSD: Fix strncpy() fortify warning") replaced
strncpy(,, sizeof(..)) with strlcpy(,, sizeof(..) - 1), but strlcpy()
already guaranteed NUL-termination of the destination buffer and
subtracting one byte potentially truncated the source string.
The incorrect size was then carried over in commit 72f78ae00a ("NFSD:
move from strlcpy with unused retval to strscpy") when switching from
strlcpy() to strscpy().
Fix this off-by-one error by using the full size of the destination
buffer again.
Cc: stable@vger.kernel.org
Fixes: 5304877936 ("NFSD: Fix strncpy() fortify warning")
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Since MD5 digests are fixed-size, make nfs4_make_rec_clidname() store
the digest in a stack buffer instead of a dynamically allocated buffer.
Use MD5_DIGEST_SIZE instead of a hard-coded value, both in
nfs4_make_rec_clidname() and in the definition of HEXDIR_LEN.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Since the Linux kernel's sprintf() has conversion to hex built-in via
"%*phN", delete md5_to_hex() and just use that. Also add an explicit
array bound to the dname parameter of nfs4_make_rec_clidname() to make
its size clear. No functional change.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
There is an extraneous space before a newline in a dprintk message.
Remove the space.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Instead of allowing the ctime to roll backward with a WRITE_ATTRS
delegation, set FMODE_NOCMTIME on the file and have it skip mtime and
ctime updates.
It is possible that the client will never send a SETATTR to set the
times before returning the delegation. Add two new bools to struct
nfs4_delegation:
dl_written: tracks whether the file has been written since the
delegation was granted. This is set in the WRITE and LAYOUTCOMMIT
handlers.
dl_setattr: tracks whether the client has sent at least one valid
mtime that can also update the ctime in a SETATTR.
When unlocking the lease for the delegation, clear FMODE_NOCMTIME. If
the file has been written, but no setattr for the delegated mtime and
ctime has been done, update the timestamps to current_time().
Suggested-by: NeilBrown <neil@brown.name>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
When updating the local timestamps from CB_GETATTR, the updated values
are not being properly vetted.
Compare the update times vs. the saved times in the delegation rather
than the current times in the inode. Also, ensure that the ctime is
properly vetted vs. its original value.
Fixes: 6ae30d6eb2 ("nfsd: add support for delegated timestamps")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
SETATTRs containing delegated timestamp updates are currently not being
vetted properly. Since we no longer need to compare the timestamps vs.
the current timestamps, move the vetting of delegated timestamps wholly
into nfsd.
Rename the set_cb_time() helper to nfsd4_vet_deleg_time(), and make it
non-static. Add a new vet_deleg_attrs() helper that is called from
nfsd4_setattr that uses nfsd4_vet_deleg_time() to properly validate the
all the timestamps. If the validation indicates that the update should
be skipped, unset the appropriate flags in ia_valid.
Fixes: 7e13f4f8d2 ("nfsd: handle delegated timestamps in SETATTR")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
As Trond points out [1], the "original time" mentioned in RFC 9754
refers to the timestamps on the files at the time that the delegation
was granted, and not the current timestamp of the file on the server.
Store the current timestamps for the file in the nfs4_delegation when
granting one. Add STATX_ATIME and STATX_MTIME to the request mask in
nfs4_delegation_stat(). When granting OPEN_DELEGATE_READ_ATTRS_DELEG, do
a nfs4_delegation_stat() and save the correct atime. If the stat() fails
for any reason, fall back to granting a normal read deleg.
[1]: https://lore.kernel.org/linux-nfs/47a4e40310e797f21b5137e847b06bb203d99e66.camel@kernel.org/
Fixes: 7e13f4f8d2 ("nfsd: handle delegated timestamps in SETATTR")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Ensure that notify_change() doesn't clobber a delegated ctime update
with current_time() by setting ATTR_CTIME_SET for those updates.
Don't bother setting the timestamps in cb_getattr_update_times() in the
non-delegated case. notify_change() will do that itself.
Fixes: 7e13f4f8d2 ("nfsd: handle delegated timestamps in SETATTR")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
When ATTR_ATIME_SET and ATTR_MTIME_SET are set in the ia_valid mask, the
notify_change() logic takes that to mean that the request should set
those values explicitly, and not override them with "now".
With the advent of delegated timestamps, similar functionality is needed
for the ctime. Add a ATTR_CTIME_SET flag, and use that to indicate that
the ctime should be accepted as-is. Also, clean up the if statements to
eliminate the extra negatives.
In setattr_copy() and setattr_copy_mgtime() use inode_set_ctime_deleg()
when ATTR_CTIME_SET is set, instead of basing the decision on ATTR_DELEG.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
If the only flag left is ATTR_DELEG, then there are no changes to be
made.
Fixes: 7e13f4f8d2 ("nfsd: handle delegated timestamps in SETATTR")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The ia_ctime.tv_nsec field should be set to modify.nseconds.
Fixes: 7e13f4f8d2 ("nfsd: handle delegated timestamps in SETATTR")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The data type of loca_last_write_offset is newoffset4 and is switched
on a boolean value, no_newoffset, that indicates if a previous write
occurred or not. If no_newoffset is FALSE, an offset is not given.
This means that client does not try to update the file size. Thus,
server should not try to calculate new file size and check if it fits
into the segment range. See RFC 8881, section 12.5.4.2.
Sometimes the current incorrect logic may cause clients to hang when
trying to sync an inode. If layoutcommit fails, the client marks the
inode as dirty again.
Fixes: 9cf514ccfa ("nfsd: implement pNFS operations")
Cc: stable@vger.kernel.org
Co-developed-by: Konstantin Evtushenko <koevtushenko@yandex.com>
Signed-off-by: Konstantin Evtushenko <koevtushenko@yandex.com>
Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
When pNFS client in the block or scsi layout mode sends layoutcommit
to MDS, a variable length array of modified extents is supplied within
the request. This patch allows the server to accept such extent arrays
if they do not fit within single memory page.
The issue can be reproduced when writing to a 1GB file using FIO with
O_DIRECT, 4K block and large I/O depth without preallocation of the
file. In this case, the server returns NFSERR_BADXDR to the client.
Co-developed-by: Konstantin Evtushenko <koevtushenko@yandex.com>
Signed-off-by: Konstantin Evtushenko <koevtushenko@yandex.com>
Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Use the appropriate xdr function to decode the lc_newoffset field,
which is a boolean value. See RFC 8881, section 18.42.1.
Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Remove dprintk in nfsd4_layoutcommit. These are not needed
in day to day usage, and the information is also available
in Wireshark when capturing NFS traffic.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Compilers may optimize the layout of C structures, so we should not rely
on sizeof struct and memcpy to encode and decode XDR structures. The byte
order of the fields should also be taken into account.
This patch adds the correct functions to handle the deviceid4 structure
and removes the pad field, which is currently not used by NFSD, from the
runtime state. The server's byte order is preserved because the deviceid4
blob on the wire is only used as a cookie by the client.
Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This interface was deprecated by commit e6f7e1487a ("nfs_localio:
simplify interface to nfsd for getting nfsd_file") and is now
unused. So let's remove it.
Signed-off-by: NeilBrown <neil@brown.name>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Clean up: The fh_getattr() function is part of NFSD's file handle
API, so relocate it.
I've made it an un-inlined function so that trace points and new
functionality can easily be introduced. That increases the size of
nfsd.ko by about a page on my x86_64 system (out of 26MB; compiled
with -O2).
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Clean up: these helpers are part of the NFSD file handle API.
Relocate them to fs/nfsd/nfsfh.h.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
For exFAT filesystems with 4MB read_ahead_size, removing the storage
device during read operations can delay EIO error reporting by several
minutes. This occurs because the read-ahead implementation in mpage
doesn't handle errors.
Another reason for the delay is that the filesystem requires metadata to
issue file read request. When the storage device is removed, the metadata
buffers are invalidated, causing mpage to repeatedly attempt to fetch
metadata during each get_block call.
The original purpose of this patch is terminate read ahead when we fail to
get metadata, to make the patch more generic, implement it by checking
folio status, instead of checking the return of get_block().
So, if a folio is synchronously unlocked and non-uptodate, should we quit
the read ahead?
I think it depends on whether the error is permanent or temporary, and
whether further read ahead might succeed. A device being unplugged is one
reason for returning such a folio, but we could return it for many other
reasons (e.g., metadata errors). I think most errors won't be restored in
a short time, so we should quit read ahead when they occur.
Link: https://lkml.kernel.org/r/20250829023659.688649-1-chizhiling@163.com
Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Sungjong Seo <sj1557.seo@samsung.com>
Cc: Yuezhang Mo <Yuezhang.Mo@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>