Commit Graph

102029 Commits

Author SHA1 Message Date
Naohiro Aota
0d703963d2 btrfs: zoned: refine extent allocator hint selection
The hint block group selection in the extent allocator is wrong in the
first place, as it can select the dedicated data relocation block group for
the normal data allocation.

Since we separated the normal data space_info and the data relocation
space_info, we can easily identify a block group is for data relocation or
not. Do not choose it for the normal data allocation.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-22 10:54:31 +02:00
Boris Burkov
f07b855c56 btrfs: try to search for data csums in commit root
If you run a workload with:

- a cgroup that does tons of parallel data reading, with a working set
  much larger than its memory limit
- a second cgroup that writes relatively fewer files, with overwrites,
  with no memory limit
(see full code listing at the bottom for a reproducer)

Then what quickly occurs is:

- we have a large number of threads trying to read the csum tree
- we have a decent number of threads deleting csums running delayed refs
- we have a large number of threads in direct reclaim and thus high
  memory pressure

The result of this is that we writeback the csum tree repeatedly mid
transaction, to get back the extent_buffer folios for reclaim. As a
result, we repeatedly COW the csum tree for the delayed refs that are
deleting csums. This means repeatedly write locking the higher levels of
the tree.

As a result of this, we achieve an unpleasant priority inversion. We
have:

- a high degree of contention on the csum root node (and other upper
  nodes) eb rwsem
- a memory starved cgroup doing tons of reclaim on CPU.
- many reader threads in the memory starved cgroup "holding" the sem
  as readers, but not scheduling promptly. i.e., task __state == 0, but
  not running on a cpu.
- btrfs_commit_transaction stuck trying to acquire the sem as a writer.
  (running delayed_refs, deleting csums for unreferenced data extents)

This results in arbitrarily long transactions. This then results in
seriously degraded performance for any cgroup using the filesystem (the
victim cgroup in the script).

It isn't an academic problem, as we see this exact problem in production
at Meta with one cgroup over its memory limit ruining btrfs performance
for the whole system, stalling critical system services that depend on
btrfs syncs.

The underlying scheduling "problem" with global rwsems is sort of thorny
and apparently well known and was discussed at LPC 2024, for example.

As a result, our main lever in the short term is just trying to reduce
contention on our various rwsems with an eye to reducing the frequency
of write locking, to avoid disabling the read lock fast acquisition path.

Luckily, it seems likely that many reads are for old extents written
many transactions ago, and that for those we *can* in fact search the
commit root. The commit_root_sem only gets taken write once, near the
end of transaction commit, no matter how much memory pressure there is,
so we have much less contention between readers and writers.

This change detects when we are trying to read an old extent (according
to extent map generation) and then wires that through bio_ctrl to the
btrfs_bio, which unfortunately isn't allocated yet when we have this
information. When we go to lookup the csums in lookup_bio_sums we can
check this condition on the btrfs_bio and do the commit root lookup
accordingly.

Note that a single bio_ctrl might collect a few extent_maps into a single
bio, so it is important to track a maximum generation across all the
extent_maps used for each bio to make an accurate decision on whether it
is valid to look in the commit root. If any extent_map is updated in the
current generation, we can't use the commit root.

To test and reproduce this issue, I used the following script and
accompanying C program (to avoid bottlenecks in constantly forking
thousands of dd processes):

====== big-read.c ======
  #include <fcntl.h>
  #include <stdio.h>
  #include <stdlib.h>
  #include <sys/mman.h>
  #include <sys/stat.h>
  #include <unistd.h>
  #include <errno.h>

  #define BUF_SZ (128 * (1 << 10UL))

  int read_once(int fd, size_t sz) {
  	char buf[BUF_SZ];
  	size_t rd = 0;
  	int ret = 0;

  	while (rd < sz) {
  		ret = read(fd, buf, BUF_SZ);
  		if (ret < 0) {
  			if (errno == EINTR)
  				continue;
  			fprintf(stderr, "read failed: %d\n", errno);
  			return -errno;
  		} else if (ret == 0) {
  			break;
  		} else {
  			rd += ret;
  		}
  	}
  	return rd;
  }

  int read_loop(char *fname) {
  	int fd;
  	struct stat st;
  	size_t sz = 0;
  	int ret;

  	while (1) {
  		fd = open(fname, O_RDONLY);
  		if (fd == -1) {
  			perror("open");
  			return 1;
  		}
  		if (!sz) {
  			if (!fstat(fd, &st)) {
  				sz = st.st_size;
  			} else {
  				perror("stat");
  				return 1;
  			}
  		}

                  ret = read_once(fd, sz);
  		close(fd);
  	}
  }

  int main(int argc, char *argv[]) {
  	int fd;
  	struct stat st;
  	off_t sz;
  	char *buf;
  	int ret;

  	if (argc != 2) {
  		fprintf(stderr, "Usage: %s <filename>\n", argv[0]);
  		return 1;
  	}

  	return read_loop(argv[1]);
  }

====== repro.sh ======
  #!/usr/bin/env bash

  SCRIPT=$(readlink -f "$0")
  DIR=$(dirname "$SCRIPT")

  dev=$1
  mnt=$2
  shift
  shift

  CG_ROOT=/sys/fs/cgroup
  BAD_CG=$CG_ROOT/bad-nbr
  GOOD_CG=$CG_ROOT/good-nbr
  NR_BIGGOS=1
  NR_LITTLE=10
  NR_VICTIMS=32
  NR_VILLAINS=512

  START_SEC=$(date +%s)

  _elapsed() {
  	echo "elapsed: $(($(date +%s) - $START_SEC))"
  }

  _stats() {
  	local sysfs=/sys/fs/btrfs/$(findmnt -no UUID $dev)

  	echo "================"
  	date
  	_elapsed
  	cat $sysfs/commit_stats
  	cat $BAD_CG/memory.pressure
  }

  _setup_cgs() {
  	echo "+memory +cpuset" > $CG_ROOT/cgroup.subtree_control
  	mkdir -p $GOOD_CG
  	mkdir -p $BAD_CG
  	echo max > $BAD_CG/memory.max
  	# memory.high much less than the working set will cause heavy reclaim
  	echo $((1 << 30)) > $BAD_CG/memory.high

  	# victims get a subset of villain CPUs
  	echo 0 > $GOOD_CG/cpuset.cpus
  	echo 0,1,2,3 > $BAD_CG/cpuset.cpus
  }

  _kill_cg() {
  	local cg=$1
  	local attempts=0
  	echo "kill cgroup $cg"
  	[ -f $cg/cgroup.procs ] || return
  	while true; do
  		attempts=$((attempts + 1))
  		echo 1 > $cg/cgroup.kill
  		sleep 1
  		procs=$(wc -l $cg/cgroup.procs | cut -d' ' -f1)
  		[ $procs -eq 0 ] && break
  	done
  	rmdir $cg
  	echo "killed cgroup $cg in $attempts attempts"
  }

  _biggo_vol() {
  	echo $mnt/biggo_vol.$1
  }

  _biggo_file() {
  	echo $(_biggo_vol $1)/biggo
  }

  _subvoled_biggos() {
  	total_sz=$((10 << 30))
  	per_sz=$((total_sz / $NR_VILLAINS))
  	dd_count=$((per_sz >> 20))
  	echo "create $NR_VILLAINS subvols with a file of size $per_sz bytes for a total of $total_sz bytes."
  	for i in $(seq $NR_VILLAINS)
  	do
  		btrfs subvol create $(_biggo_vol $i) &>/dev/null
  		dd if=/dev/zero of=$(_biggo_file $i) bs=1M count=$dd_count &>/dev/null
  	done
  	echo "done creating subvols."
  }

  _setup() {
  	[ -f .done ] && rm .done
  	findmnt -n $dev && exit 1
        if [ -f .re-mkfs ]; then
		mkfs.btrfs -f -m single -d single $dev >/dev/null || exit 2
	else
		echo "touch .re-mkfs to populate the test fs"
	fi

  	mount -o noatime $dev $mnt || exit 3
  	[ -f .re-mkfs ] && _subvoled_biggos
  	_setup_cgs
  }

  _my_cleanup() {
  	echo "CLEANUP!"
  	_kill_cg $BAD_CG
  	_kill_cg $GOOD_CG
  	sleep 1
  	umount $mnt
  }

  _bad_exit() {
  	_err "Unexpected Exit! $?"
  	_stats
  	exit $?
  }

  trap _my_cleanup EXIT
  trap _bad_exit INT TERM

  _setup

  # Use a lot of page cache reading the big file
  _villain() {
  	local i=$1
  	echo $BASHPID > $BAD_CG/cgroup.procs
  	$DIR/big-read $(_biggo_file $i)
  }

  # Hit del_csum a lot by overwriting lots of small new files
  _victim() {
  	echo $BASHPID > $GOOD_CG/cgroup.procs
  	i=0;
  	while (true)
  	do
  		local tmp=$mnt/tmp.$i

  		dd if=/dev/zero of=$tmp bs=4k count=2 >/dev/null 2>&1
  		i=$((i+1))
  		[ $i -eq $NR_LITTLE ] && i=0
  	done
  }

  _one_sync() {
  	echo "sync..."
  	before=$(date +%s)
  	sync
  	after=$(date +%s)
  	echo "sync done in $((after - before))s"
  	_stats
  }

  # sync in a loop
  _sync() {
  	echo "start sync loop"
  	syncs=0
  	echo $BASHPID > $GOOD_CG/cgroup.procs
  	while true
  	do
  		[ -f .done ] && break
  		_one_sync
  		syncs=$((syncs + 1))
  		[ -f .done ] && break
  		sleep 10
  	done
  	if [ $syncs -eq 0 ]; then
  		echo "do at least one sync!"
  		_one_sync
  	fi
  	echo "sync loop done."
  }

  _sleep() {
  	local time=${1-60}
  	local now=$(date +%s)
  	local end=$((now + time))
  	while [ $now -lt $end ];
  	do
  		echo "SLEEP: $((end - now))s left. Sleep 10."
  		sleep 10
  		now=$(date +%s)
  	done
  }

  echo "start $NR_VILLAINS villains"
  for i in $(seq $NR_VILLAINS)
  do
  	_villain $i &
  	disown # get rid of annoying log on kill (done via cgroup anyway)
  done

  echo "start $NR_VICTIMS victims"
  for i in $(seq $NR_VICTIMS)
  do
  	_victim &
  	disown
  done

  _sync &
  SYNC_PID=$!

  _sleep $1
  _elapsed
  touch .done
  wait $SYNC_PID

  echo "OK"
  exit 0

Without this patch, that reproducer:

- Ran for 6+ minutes instead of 60s
- Hung hundreds of threads in D state on the csum reader lock
- Got a commit stuck for 3 minutes

sync done in 388s
================
Wed Jul  9 09:52:31 PM UTC 2025
elapsed: 420
commits 2
cur_commit_ms 0
last_commit_ms 159446
max_commit_ms 159446
total_commit_ms 160058
some avg10=99.03 avg60=98.97 avg300=75.43 total=418033386
full avg10=82.79 avg60=80.52 avg300=59.45 total=324995274

419 hits state R, D comms big-read
                 btrfs_tree_read_lock_nested
                 btrfs_read_lock_root_node
                 btrfs_search_slot
                 btrfs_lookup_csum
                 btrfs_lookup_bio_sums
                 btrfs_submit_bbio

1 hits state D comms btrfs-transacti
                 btrfs_tree_lock_nested
                 btrfs_lock_root_node
                 btrfs_search_slot
                 btrfs_del_csums
                 __btrfs_run_delayed_refs
                 btrfs_run_delayed_refs

With the patch, the reproducer exits naturally, in 65s, completing a
pretty decent 4 commits, despite heavy memory pressure. Occasionally you
can still trigger a rather long commit (couple seconds) but never one
that is minutes long.

sync done in 3s
================
elapsed: 65
commits 4
cur_commit_ms 0
last_commit_ms 485
max_commit_ms 689
total_commit_ms 2453
some avg10=98.28 avg60=64.54 avg300=19.39 total=64849893
full avg10=74.43 avg60=48.50 avg300=14.53 total=48665168

some random rwalker samples showed the most common stack in reclaim,
rather than the csum tree:
145 hits state R comms bash, sleep, dd, shuf
                 shrink_folio_list
                 shrink_lruvec
                 shrink_node
                 do_try_to_free_pages
                 try_to_free_mem_cgroup_pages
                 reclaim_high

Link: https://lpc.events/event/18/contributions/1883/
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-22 10:54:31 +02:00
Jiapeng Chong
6d9cce2d1b btrfs: remove duplicate inclusion of linux/types.h
In messages.h there's linux/types.h included more than once.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=22939
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-22 10:54:31 +02:00
Johannes Thumshirn
3c44cd3c79 btrfs: zoned: return error from btrfs_zone_finish_endio()
Now that btrfs_zone_finish_endio_workfn() is directly calling
do_zone_finish() the only caller of btrfs_zone_finish_endio() is
btrfs_finish_one_ordered().

btrfs_finish_one_ordered() already has error handling in-place so
btrfs_zone_finish_endio() can return an error if the block group lookup
fails.

Also as btrfs_zone_finish_endio() already checks for zoned filesystems and
returns early, there's no need to do this in the caller.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-22 10:54:30 +02:00
Johannes Thumshirn
3d16abf6c8 btrfs: zoned: directly call do_zone_finish() from btrfs_zone_finish_endio_workfn()
When btrfs_zone_finish_endio_workfn() is calling btrfs_zone_finish_endio()
it already has a pointer to the block group. Furthermore
btrfs_zone_finish_endio() does additional checks if the block group can be
finished or not.

But in the context of btrfs_zone_finish_endio_workfn() only the actual
call to do_zone_finish() is of interest, as the skipping condition when
there is still room to allocate from the block group cannot be checked.

Directly call do_zone_finish() on the block group.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-22 10:54:30 +02:00
Filipe Manana
669d38bf58 btrfs: collapse unaccount_log_buffer() into clean_log_buffer()
There's one only one caller of unaccount_log_buffer() and both this
function and the caller are short, so move its code into the caller.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-22 10:54:30 +02:00
Filipe Manana
cac2ab34d8 btrfs: use local key variable to pass arguments in replay_one_extent()
Instead of extracting again the disk_bytenr and disk_num_bytes values from
the file extent item to pass to btrfs_qgroup_trace_extent(), use the key
local variable 'ins' which already has those values, reducing the size of
the source code.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-22 10:54:30 +02:00
Filipe Manana
575f52a77a btrfs: process inline extent earlier in replay_one_extent()
Instead of having an if statement to check for regular and prealloc
extents first, process them in a block, and then following with an else
statement to check for an inline extent, check for an inline extent first,
process it and jump to the 'update_inode' label, allowing us to avoid
having the code for processing regular and prealloc extents inside a
block, reducing the high indentation level by one and making the code
easier to read and avoid line splittings too.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-22 10:54:30 +02:00
Filipe Manana
88666b6df9 btrfs: exit early when replaying hole file extent item from a log tree
At replay_one_extent(), we can jump to the code that updates the file
extent range and updates the inode when processing a file extent item that
represents a hole and we don't have the NO_HOLES feature enabled. This
helps reduce the high indentation level by one in replay_one_extent() and
avoid splitting some lines to make the code easier to read.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-22 10:54:30 +02:00
Filipe Manana
912c257c88 btrfs: abort transaction where errors happen during log tree replay
In the replay_one_buffer() log tree walk callback we return errors to the
log tree walk caller and then the caller aborts the transaction, if we
have one, or turns the fs into error state if we don't have one. While
this reduces code it makes it harder to figure out where exactly an error
came from. So add the transaction aborts after every failure inside the
replay_one_buffer() callback and the functions it calls, making it as
fine grained as possible, so that it helps figuring out why failures
happen.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-22 10:54:29 +02:00
Filipe Manana
874576d2a7 btrfs: return real error from read_alloc_one_name() in drop_one_dir_item()
If read_alloc_one_name() we explicitly return -ENOMEM and currently that
is fine since it's the only error read_alloc_one_name() can return for
now. However this is fragile and not future proof, so return instead what
read_alloc_one_name() returned.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-22 10:54:29 +02:00
Filipe Manana
425652cf10 btrfs: use local variable for the transaction handle in replay_one_buffer()
Instead of keep dereferencing the walk_control structure to extract the
transaction handle whenever is needed, do it once by storing it in a local
variable and then use the variable everywhere. This reduces code verbosity
and eliminates the need for some split lines.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-22 10:54:29 +02:00
Filipe Manana
e6dd405b66 btrfs: abort transaction in the process_one_buffer() log tree walk callback
In the process_one_buffer() log tree walk callback we return errors to the
log tree walk caller and then the caller aborts the transaction, if we
have one, or turns the fs into error state if we don't have one. While
this reduces code it makes it harder to figure out where exactly an error
came from. So add the transaction aborts after every failure inside the
process_one_buffer() callback, so that it helps figuring out why failures
happen.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-22 10:54:29 +02:00
Filipe Manana
6ebd726b10 btrfs: abort transaction on specific error places when walking log tree
We do several things while walking a log tree (for replaying and for
freeing a log tree) like reading extent buffers and cleaning them up,
but we don't immediately abort the transaction, or turn the fs into an
error state, when one of these things fails. Instead we the transaction
abort or turn the fs into error state in the caller of the entry point
function that walks a log tree - walk_log_tree() - which means we don't
get to know exactly where an error came from.

Improve on this by doing a transaction abort / turn fs into error state
after each such failure so that when it happens we have a better
understanding where the failure comes from. This deliberately leaves
the transaction abort / turn fs into error state in the callers of
walk_log_tree() as to ensure we don't get into an inconsistent state in
case we forget to do it deeper in call chain. It also deliberately does
not do it after errors from the calls to the callback defined in
struct walk_control::process_func(), as we will do it later on another
patch.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-22 10:54:29 +02:00
Qu Wenruo
aa8fc9469d btrfs: replace double boolean parameters of cow_file_range()
The function cow_file_range() has two boolean parameters.  Replace it
with a single @flags parameter, with two flags:

- COW_FILE_RANGE_NO_INLINE
- COW_FILE_RANGE_KEEP_LOCKED

And since we're here, also update the comments of cow_file_range() to
replace the old "page" usage with "folio".

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-22 10:54:29 +02:00
Paulo Alcantara
1cf9f2a6a5 smb: client: handle unlink(2) of files open by different clients
In order to identify whether a certain file is open by a different
client, start the unlink process by sending a compound request of
CREATE(DELETE_ON_CLOSE) + CLOSE with only FILE_SHARE_DELETE bit set in
smb2_create_req::ShareAccess.  If the file is currently open, then the
server will fail the request with STATUS_SHARING_VIOLATION, in which
case we'll map it to -EBUSY, so __cifs_unlink() will fall back to
silly-rename the file.

This fixes the following case where open(O_CREAT) fails with
-ENOENT (STATUS_DELETE_PENDING) due to file still open by a different
client.

* Before patch

$ mount.cifs //srv/share /mnt/1 -o ...,nosharesock
$ mount.cifs //srv/share /mnt/2 -o ...,nosharesock
$ cd /mnt/1
$ touch foo
$ exec 3<>foo
$ cd /mnt/2
$ rm foo
$ touch foo
touch: cannot touch 'foo': No such file or directory
$ exec 3>&-

* After patch

$ mount.cifs //srv/share /mnt/1 -o ...,nosharesock
$ mount.cifs //srv/share /mnt/2 -o ...,nosharesock
$ cd /mnt/1
$ touch foo
$ exec 3<>foo
$ cd /mnt/2
$ rm foo
$ touch foo
$ exec 3>&-

Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Cc: Frank Sorenson <sorenson@redhat.com>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-09-21 19:36:27 -05:00
Stefan Metzmacher
f7f8925017 smb: server: use disable_work_sync in transport_rdma.c
This makes it safer during the disconnect and avoids
requeueing.

It's ok to call disable_work[_sync]() more than once.

Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Fixes: 0626e6641f ("cifsd: add server handler for central processing and tranport layers")
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-09-21 19:34:52 -05:00
Stefan Metzmacher
1cde0a74a7 smb: server: don't use delayed_work for post_recv_credits_work
If we are using a hardcoded delay of 0 there's no point in
using delayed_work it only adds confusion.

The client also uses a normal work_struct and now
it is easier to move it to the common smbdirect_socket.

Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Fixes: 0626e6641f ("cifsd: add server handler for central processing and tranport layers")
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-09-21 19:34:52 -05:00
Eric Biggers
13289ed501 nfsd: Don't force CRYPTO_LIB_SHA256 to be built-in
Now that nfsd is accessing SHA-256 via the library API instead of via
crypto_shash, there is a direct symbol dependency on the SHA-256 code
and there is no benefit to be gained from forcing it to be built-in.
Therefore, select CRYPTO_LIB_SHA256 from NFSD (conditional on NFSD_V4)
instead of from NFSD_V4, so that it can be 'm' if NFSD is 'm'.

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21 19:24:50 -04:00
Olga Kornievskaia
a082e4b4d0 nfsd: nfserr_jukebox in nlm_fopen should lead to a retry
When v3 NLM request finds a conflicting delegation, it triggers
a delegation recall and nfsd_open fails with EAGAIN. nfsd_open
then translates EAGAIN into nfserr_jukebox. In nlm_fopen, instead
of returning nlm_failed for when there is a conflicting delegation,
drop this NLM request so that the client retries. Once delegation
is recalled and if a local lock is claimed, a retry would lead to
nfsd returning a nlm_lck_blocked error or a successful nlm lock.

Fixes: d343fce148 ("[PATCH] knfsd: Allow lockd to drop replies as appropriate")
Cc: stable@vger.kernel.org # v6.6
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21 19:24:50 -04:00
Chuck Lever
8ddd06be9a NFSD: Reduce DRC bucket size
The common case is that a DRC lookup will not find the XID in the
bucket. Reduce the amount of pointer chasing during the lookup by
keeping fewer entries in each hash bucket.

Changing the bucket size constant forces the size of the DRC hash
table to increase, and the height of each bucket r-b tree to be
reduced.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21 19:24:50 -04:00
Chuck Lever
fb340bfd48 NFSD: Delay adding new entries to LRU
Neil Brown observes:
> I would not include RC_INPROG entries in the lru at all - they are
> always ignored, and will be added when they are switched to
> RCU_DONE.

I also removed a stale comment.

Suggested-by: NeilBrown <neil@brown.name>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21 19:24:50 -04:00
Chuck Lever
d73d06dac6 SUNRPC: Move the svc_rpcb_cleanup() call sites
Clean up: because svc_rpcb_cleanup() and svc_xprt_destroy_all()
are always invoked in pairs, we can deduplicate code by moving
the svc_rpcb_cleanup() call sites into svc_xprt_destroy_all().

Tested-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21 19:24:50 -04:00
Chuck Lever
dd9adfa0da NFS: Remove rpcbind cleanup for NFSv4.0 callback
The NFS client's NFSv4.0 callback listeners are created with
SVC_SOCK_ANONYMOUS, therefore svc_setup_socket() does not register
them with the client's rpcbind service.

And, note that nfs_callback_down_net() does not call
svc_rpcb_cleanup() at all when shutting down the callback server.

Even if svc_setup_socket() were to attempt to register or unregister
these sockets, the callback service has vs_hidden set, which shunts
the rpcbind upcalls.

The svc_rpcb_cleanup() error flow was introduced by
commit c946556b87 ("NFS: move per-net callback thread
initialization to nfs_callback_up_net()"). It doesn't appear in the
code that was relocated by that commit.

Therefore, there is no need to call svc_rpcb_cleanup() when listener
creation fails during callback server start-up.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21 19:24:50 -04:00
Xichao Zhao
f64397e04b NFSD: Drop redundant conversion to bool
The result of integer comparison already evaluates to bool. No need for
explicit conversion.

Signed-off-by: Xichao Zhao <zhao.xichao@vivo.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21 19:24:50 -04:00
Scott Mayhew
e4f574ca9c nfsd: decouple the xprtsec policy check from check_nfsd_access()
A while back I had reported that an NFSv3 client could successfully
mount using '-o xprtsec=none' an export that had been exported with
'xprtsec=tls:mtls'.  By "successfully" I mean that the mount command
would succeed and the mount would show up in /proc/mount.  Attempting
to do anything futher with the mount would be met with NFS3ERR_ACCES.

This was fixed (albeit accidentally) by commit bb4f07f240 ("nfsd:
Fix NFSD_MAY_BYPASS_GSS and NFSD_MAY_BYPASS_GSS_ON_ROOT") and was
subsequently re-broken by commit 0813c5f012 ("nfsd: fix access
checking for NLM under XPRTSEC policies").

Transport Layer Security isn't an RPC security flavor or pseudo-flavor,
so we shouldn't be conflating them when determining whether the access
checks can be bypassed.  Split check_nfsd_access() into two helpers, and
have __fh_verify() call the helpers directly since __fh_verify() has
logic that allows one or both of the checks to be skipped.  All other
sites will continue to call check_nfsd_access().

Link: https://lore.kernel.org/linux-nfs/ZjO3Qwf_G87yNXb2@aion/
Fixes: 9280c57743 ("NFSD: Handle new xprtsec= export option")
Cc: stable@vger.kernel.org
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21 19:24:50 -04:00
Thorsten Blum
ab1c282c01 NFSD: Fix destination buffer size in nfsd4_ssc_setup_dul()
Commit 5304877936 ("NFSD: Fix strncpy() fortify warning") replaced
strncpy(,, sizeof(..)) with strlcpy(,, sizeof(..) - 1), but strlcpy()
already guaranteed NUL-termination of the destination buffer and
subtracting one byte potentially truncated the source string.

The incorrect size was then carried over in commit 72f78ae00a ("NFSD:
move from strlcpy with unused retval to strscpy") when switching from
strlcpy() to strscpy().

Fix this off-by-one error by using the full size of the destination
buffer again.

Cc: stable@vger.kernel.org
Fixes: 5304877936 ("NFSD: Fix strncpy() fortify warning")
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21 19:24:50 -04:00
Eric Biggers
9ebcd022a3 nfsd: Eliminate an allocation in nfs4_make_rec_clidname()
Since MD5 digests are fixed-size, make nfs4_make_rec_clidname() store
the digest in a stack buffer instead of a dynamically allocated buffer.
Use MD5_DIGEST_SIZE instead of a hard-coded value, both in
nfs4_make_rec_clidname() and in the definition of HEXDIR_LEN.

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21 19:24:50 -04:00
Eric Biggers
17695d72d0 nfsd: Replace open-coded conversion of bytes to hex
Since the Linux kernel's sprintf() has conversion to hex built-in via
"%*phN", delete md5_to_hex() and just use that.  Also add an explicit
array bound to the dname parameter of nfs4_make_rec_clidname() to make
its size clear.  No functional change.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21 19:24:50 -04:00
Colin Ian King
6ecdfd7aa8 lockd: Remove space before newline
There is an extraneous space before a newline in a dprintk message.
Remove the space.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21 19:24:50 -04:00
Jeff Layton
e5e9b24ab8 nfsd: freeze c/mtime updates with outstanding WRITE_ATTRS delegation
Instead of allowing the ctime to roll backward with a WRITE_ATTRS
delegation, set FMODE_NOCMTIME on the file and have it skip mtime and
ctime updates.

It is possible that the client will never send a SETATTR to set the
times before returning the delegation. Add two new bools to struct
nfs4_delegation:

dl_written: tracks whether the file has been written since the
delegation was granted. This is set in the WRITE and LAYOUTCOMMIT
handlers.

dl_setattr: tracks whether the client has sent at least one valid
mtime that can also update the ctime in a SETATTR.

When unlocking the lease for the delegation, clear FMODE_NOCMTIME. If
the file has been written, but no setattr for the delegated mtime and
ctime has been done, update the timestamps to current_time().

Suggested-by: NeilBrown <neil@brown.name>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21 19:24:50 -04:00
Jeff Layton
b40b1ba37a nfsd: fix timestamp updates in CB_GETATTR
When updating the local timestamps from CB_GETATTR, the updated values
are not being properly vetted.

Compare the update times vs. the saved times in the delegation rather
than the current times in the inode. Also, ensure that the ctime is
properly vetted vs. its original value.

Fixes: 6ae30d6eb2 ("nfsd: add support for delegated timestamps")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21 19:24:50 -04:00
Jeff Layton
3952f1cbcb nfsd: fix SETATTR updates for delegated timestamps
SETATTRs containing delegated timestamp updates are currently not being
vetted properly. Since we no longer need to compare the timestamps vs.
the current timestamps, move the vetting of delegated timestamps wholly
into nfsd.

Rename the set_cb_time() helper to nfsd4_vet_deleg_time(), and make it
non-static. Add a new vet_deleg_attrs() helper that is called from
nfsd4_setattr that uses nfsd4_vet_deleg_time() to properly validate the
all the timestamps. If the validation indicates that the update should
be skipped, unset the appropriate flags in ia_valid.

Fixes: 7e13f4f8d2 ("nfsd: handle delegated timestamps in SETATTR")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21 19:24:50 -04:00
Jeff Layton
7663e963a5 nfsd: track original timestamps in nfs4_delegation
As Trond points out [1], the "original time" mentioned in RFC 9754
refers to the timestamps on the files at the time that the delegation
was granted, and not the current timestamp of the file on the server.

Store the current timestamps for the file in the nfs4_delegation when
granting one. Add STATX_ATIME and STATX_MTIME to the request mask in
nfs4_delegation_stat(). When granting OPEN_DELEGATE_READ_ATTRS_DELEG, do
a nfs4_delegation_stat() and save the correct atime. If the stat() fails
for any reason, fall back to granting a normal read deleg.

[1]: https://lore.kernel.org/linux-nfs/47a4e40310e797f21b5137e847b06bb203d99e66.camel@kernel.org/

Fixes: 7e13f4f8d2 ("nfsd: handle delegated timestamps in SETATTR")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21 19:24:50 -04:00
Jeff Layton
c066ff58e5 nfsd: use ATTR_CTIME_SET for delegated ctime updates
Ensure that notify_change() doesn't clobber a delegated ctime update
with current_time() by setting ATTR_CTIME_SET for those updates.

Don't bother setting the timestamps in cb_getattr_update_times() in the
non-delegated case. notify_change() will do that itself.

Fixes: 7e13f4f8d2 ("nfsd: handle delegated timestamps in SETATTR")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21 19:24:50 -04:00
Jeff Layton
afc5b36e29 vfs: add ATTR_CTIME_SET flag
When ATTR_ATIME_SET and ATTR_MTIME_SET are set in the ia_valid mask, the
notify_change() logic takes that to mean that the request should set
those values explicitly, and not override them with "now".

With the advent of delegated timestamps, similar functionality is needed
for the ctime. Add a ATTR_CTIME_SET flag, and use that to indicate that
the ctime should be accepted as-is. Also, clean up the if statements to
eliminate the extra negatives.

In setattr_copy() and setattr_copy_mgtime() use inode_set_ctime_deleg()
when ATTR_CTIME_SET is set, instead of basing the decision on ATTR_DELEG.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21 19:24:50 -04:00
Jeff Layton
5affb498e7 nfsd: ignore ATTR_DELEG when checking ia_valid before notify_change()
If the only flag left is ATTR_DELEG, then there are no changes to be
made.

Fixes: 7e13f4f8d2 ("nfsd: handle delegated timestamps in SETATTR")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21 19:24:50 -04:00
Jeff Layton
2990b5a479 nfsd: fix assignment of ia_ctime.tv_nsec on delegated mtime update
The ia_ctime.tv_nsec field should be set to modify.nseconds.

Fixes: 7e13f4f8d2 ("nfsd: handle delegated timestamps in SETATTR")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21 19:24:50 -04:00
Sergey Bashirov
d68886bae7 NFSD: Fix last write offset handling in layoutcommit
The data type of loca_last_write_offset is newoffset4 and is switched
on a boolean value, no_newoffset, that indicates if a previous write
occurred or not. If no_newoffset is FALSE, an offset is not given.
This means that client does not try to update the file size. Thus,
server should not try to calculate new file size and check if it fits
into the segment range. See RFC 8881, section 12.5.4.2.

Sometimes the current incorrect logic may cause clients to hang when
trying to sync an inode. If layoutcommit fails, the client marks the
inode as dirty again.

Fixes: 9cf514ccfa ("nfsd: implement pNFS operations")
Cc: stable@vger.kernel.org
Co-developed-by: Konstantin Evtushenko <koevtushenko@yandex.com>
Signed-off-by: Konstantin Evtushenko <koevtushenko@yandex.com>
Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21 19:24:50 -04:00
Sergey Bashirov
f963cf2b91 NFSD: Implement large extent array support in pNFS
When pNFS client in the block or scsi layout mode sends layoutcommit
to MDS, a variable length array of modified extents is supplied within
the request. This patch allows the server to accept such extent arrays
if they do not fit within single memory page.

The issue can be reproduced when writing to a 1GB file using FIO with
O_DIRECT, 4K block and large I/O depth without preallocation of the
file. In this case, the server returns NFSERR_BADXDR to the client.

Co-developed-by: Konstantin Evtushenko <koevtushenko@yandex.com>
Signed-off-by: Konstantin Evtushenko <koevtushenko@yandex.com>
Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21 19:24:50 -04:00
Sergey Bashirov
6bf1be3399 NFSD: Minor cleanup in layoutcommit decoding
Use the appropriate xdr function to decode the lc_newoffset field,
which is a boolean value. See RFC 8881, section 18.42.1.

Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21 19:24:50 -04:00
Sergey Bashirov
274365a51d NFSD: Minor cleanup in layoutcommit processing
Remove dprintk in nfsd4_layoutcommit. These are not needed
in day to day usage, and the information is also available
in Wireshark when capturing NFS traffic.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21 19:24:50 -04:00
Sergey Bashirov
832738e4b3 NFSD: Rework encoding and decoding of nfsd4_deviceid
Compilers may optimize the layout of C structures, so we should not rely
on sizeof struct and memcpy to encode and decode XDR structures. The byte
order of the fields should also be taken into account.

This patch adds the correct functions to handle the deviceid4 structure
and removes the pad field, which is currently not used by NFSD, from the
runtime state. The server's byte order is preserved because the deviceid4
blob on the wire is only used as a cookie by the client.

Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21 19:24:50 -04:00
NeilBrown
2ee3a75e42 nfsd: discard nfsd_file_get_local()
This interface was deprecated by commit e6f7e1487a ("nfs_localio:
simplify interface to nfsd for getting nfsd_file") and is now
unused. So let's remove it.

Signed-off-by: NeilBrown <neil@brown.name>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21 19:24:50 -04:00
Chuck Lever
c1f203e46c NFSD: Move the fh_getattr() helper
Clean up: The fh_getattr() function is part of NFSD's file handle
API, so relocate it.

I've made it an un-inlined function so that trace points and new
functionality can easily be introduced. That increases the size of
nfsd.ko by about a page on my x86_64 system (out of 26MB; compiled
with -O2).

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21 19:24:50 -04:00
Chuck Lever
c926f0298d NFSD: Relocate the fh_want_write() and fh_drop_write() helpers
Clean up: these helpers are part of the NFSD file handle API.
Relocate them to fs/nfsd/nfsfh.h.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21 19:24:50 -04:00
Miaohe Lin
5ce1dbfdd8 mm/hwpoison: decouple hwpoison_filter from mm/memory-failure.c
mm/memory-failure.c defines and uses hwpoison_filter_* parameters but the
values of those parameters can only be modified via mm/hwpoison-inject.c
from userspace.  They have a potentially different life time.  Decouple
those parameters from mm/memory-failure.c to fix this broken layering.

Link: https://lkml.kernel.org/r/20250904062258.3336092-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:21 -07:00
Chi Zhiling
8583bb0f9a mpage: convert do_mpage_readpage() to return void type
The return value of do_mpage_readpage() is arg->bio, which is already set
in the arg structure.  Returning it again is redundant.

This patch changes the return type to void since the caller doesn't care
about the return value.

Link: https://lkml.kernel.org/r/20250829023659.688649-2-chizhiling@163.com
Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Sungjong Seo <sj1557.seo@samsung.com>
Cc: Yuezhang Mo <Yuezhang.Mo@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:16 -07:00
Chi Zhiling
f6d8c71020 mpage: terminate read-ahead on read error
For exFAT filesystems with 4MB read_ahead_size, removing the storage
device during read operations can delay EIO error reporting by several
minutes.  This occurs because the read-ahead implementation in mpage
doesn't handle errors.

Another reason for the delay is that the filesystem requires metadata to
issue file read request.  When the storage device is removed, the metadata
buffers are invalidated, causing mpage to repeatedly attempt to fetch
metadata during each get_block call.

The original purpose of this patch is terminate read ahead when we fail to
get metadata, to make the patch more generic, implement it by checking
folio status, instead of checking the return of get_block().

So, if a folio is synchronously unlocked and non-uptodate, should we quit
the read ahead?

I think it depends on whether the error is permanent or temporary, and
whether further read ahead might succeed.  A device being unplugged is one
reason for returning such a folio, but we could return it for many other
reasons (e.g., metadata errors).  I think most errors won't be restored in
a short time, so we should quit read ahead when they occur.

Link: https://lkml.kernel.org/r/20250829023659.688649-1-chizhiling@163.com
Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Sungjong Seo <sj1557.seo@samsung.com>
Cc: Yuezhang Mo <Yuezhang.Mo@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:16 -07:00
David Hildenbrand
a638ee7f19 fs: hugetlbfs: cleanup folio in adjust_range_hwpoison()
Let's cleanup and simplify the function a bit.

Link: https://lkml.kernel.org/r/20250901150359.867252-17-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:04 -07:00