This function can process some consecutive blocks at a time.
When using update_sit_entry() to release consecutive blocks,
ensure that the consecutive blocks belong to the same segment.
Because after update_sit_entry_for_realese(), @segno is still
in use in update_sit_entry().
Signed-off-by: Yi Sun <yi.sun@unisoc.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch addresses an issue where some files in case-insensitive
directories become inaccessible due to changes in how the kernel function,
utf8_casefold(), generates case-folded strings from the commit 5c26d2f1d3
("unicode: Don't special case ignorable code points").
F2FS uses these case-folded names to calculate hash values for locating
dentries and stores them on disk. Since utf8_casefold() can produce
different output across kernel versions, stored hash values and newly
calculated hash values may differ. This results in affected files no
longer being found via the hash-based lookup.
To resolve this, the patch introduces a linear search fallback.
If the initial hash-based search fails, F2FS will sequentially scan the
directory entries.
Fixes: 5c26d2f1d3 ("unicode: Don't special case ignorable code points")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219586
Signed-off-by: Daniel Lee <chullee@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
New function f2fs_invalidate_compress_pages_range() adds the @len
parameter. So it can process some consecutive blocks at a time.
Signed-off-by: Yi Sun <yi.sun@unisoc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
kthread_create() creates a kthread without running it yet. kthread_run()
creates a kthread and runs it.
On the other hand, kthread_create_worker() creates a kthread worker and
runs it.
This difference in behaviours is confusing. Also there is no way to
create a kthread worker and affine it using kthread_bind_mask() or
kthread_affine_preferred() before starting it.
Consolidate the behaviours and introduce kthread_run_worker[_on_cpu]()
that behaves just like kthread_run(). kthread_create_worker[_on_cpu]()
will now only create a kthread worker without starting it.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
kthread_create_on_cpu() uses the CPU argument as an implicit and unique
printf argument to add to the format whereas
kthread_create_worker_on_cpu() still relies on explicitly passing the
printf arguments. This difference in behaviour is error prone and
doesn't help standardizing per-CPU kthread names.
Unify the behaviours and convert kthread_create_worker_on_cpu() to
use the printf behaviour of kthread_create_on_cpu().
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Non-rtg file systems have a fake RT group even if they do not have a RT
device, and thus an rgcount of 1. Ensure xfs_update_last_rtgroup_size
doesn't fail when called for !XFS_RT to handle this case.
Fixes: 87fe4c34a3 ("xfs: create incore realtime group structures")
Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
When `ksmbd_vfs_kern_path_locked` met an error and it is not the last
entry, it will exit without restoring changed path buffer. But later this
buffer may be used as the filename for creation.
Fixes: c5a709f08d ("ksmbd: handle caseless file creation")
Signed-off-by: He Wang <xw897002528@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Geert reported that my previous short fops debugfs changes
broke m68k, because it only has mandatory alignement of two,
so we can't stash the "is it short" information into the
pointer (as we already did with the "is it real" bit.)
Instead, exploit the fact that debugfs_file_get() called on
an already open file will already find that the fsdata is
no longer the real fops but rather the allocated data that
already distinguishes full/short ops, so only open() needs
to be able to distinguish. We can achieve that by using two
different open functions.
Unfortunately this requires another set of full file ops,
increasing the size by 536 bytes (x86-64), but that's still
a reasonable trade-off given that only converting some of
the wireless stack gained over 28k. This brings the total
cost of this to around 1k, for wins of 28k (all x86-64).
Reported-and-tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/CAMuHMdWu_9-L2Te101w8hU7H_2yobJFPXSwwUmGHSJfaPWDKiQ@mail.gmail.com
Fixes: 8dc6d81c6b ("debugfs: add small file operations for most files")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/r/20241129121536.30989-2-johannes@sipsolutions.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pull fuse fixes from Miklos Szeredi <mszeredi@redhat.com>:
- Fix fuse_get_user_pages() allocation failure handling.
- Fix direct-io folio offset and length calculation.
* tag 'fuse-fixes-6.13-rc7' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: Set *nbytesp=0 in fuse_get_user_pages on allocation failure
fuse: fix direct io folio offset and length calculation
Link: https://lore.kernel.org/r/CAJfpegu7o_X%3DSBWk_C47dUVUQ1mJZDEGe1MfD0N3wVJoUBWdmg@mail.gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Pull vfs fixes from Christian Brauner:
- Relax assertions on failure to encode file handles
The ->encode_fh() method can fail for various reasons. None of them
warrant a WARN_ON().
- Fix overlayfs file handle encoding by allowing encoding an fid from
an inode without an alias
- Make sure fuse_dir_open() handles FOPEN_KEEP_CACHE. If it's not
specified fuse needs to invaludate the directory inode page cache
- Fix qnx6 so it builds with gcc-15
- Various fixes for netfslib and ceph and nfs filesystems:
- Ignore silly rename files from afs and nfs when building header
archives
- Fix read result collection in netfslib with multiple subrequests
- Handle ENOMEM for netfslib buffered reads
- Fix oops in nfs_netfs_init_request()
- Parse the secctx command immediately in cachefiles
- Remove a redundant smp_rmb() in netfslib
- Handle recursion in read retry in netfslib
- Fix clearing of folio_queue
- Fix missing cancellation of copy-to_cache when the cache for a
file is temporarly disabled in netfslib
- Sanity check the hfs root record
- Fix zero padding data issues in concurrent write scenarios
- Fix is_mnt_ns_file() after converting nsfs to path_from_stashed()
- Fix missing declaration of init_files
- Increase I/O priority when writing revoke records in jbd2
- Flush filesystem device before updating tail sequence in jbd2
* tag 'vfs-6.13-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (23 commits)
ovl: support encoding fid from inode with no alias
ovl: pass realinode to ovl_encode_real_fh() instead of realdentry
fuse: respect FOPEN_KEEP_CACHE on opendir
netfs: Fix is-caching check in read-retry
netfs: Fix the (non-)cancellation of copy when cache is temporarily disabled
netfs: Fix ceph copy to cache on write-begin
netfs: Work around recursion by abandoning retry if nothing read
netfs: Fix missing barriers by using clear_and_wake_up_bit()
netfs: Remove redundant use of smp_rmb()
cachefiles: Parse the "secctx" immediately
nfs: Fix oops in nfs_netfs_init_request() when copying to cache
netfs: Fix enomem handling in buffered reads
netfs: Fix non-contiguous donation between completed reads
kheaders: Ignore silly-rename files
fs: relax assertions on failure to encode file handles
fs: fix missing declaration of init_files
fs: fix is_mnt_ns_file()
iomap: fix zero padding data issue in concurrent append writes
iomap: pass byte granular end position to iomap_add_to_ioend
jbd2: flush filesystem device before updating tail sequence
...
Since commit 559218d43e ("block: pre-calculate max_zone_append_sectors"),
queue_limits's max_zone_append_sectors is default to be 0 and it is only
updated when there is a zoned device. So, we have
lim->max_zone_append_sectors = 0 when there is no zoned device in the
filesystem.
That leads to fs_info->max_zone_append_size and thus
fs_info->max_extent_size to be 0, which is wrong and can for example
lead to a divide by zero in count_max_extents().
Fix this by only capping fs_info->max_extent_size to
fs_info->max_zone_append_size when it is non-zero.
Based on a patch from Naohiro Aota <naohiro.aota@wdc.com>, from which
much of this commit message is stolen as well.
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 559218d43e ("block: pre-calculate max_zone_append_sectors")
Tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Syzbot reported a crash with the following call trace:
BTRFS info (device loop0): scrub: started on devid 1
BUG: kernel NULL pointer dereference, address: 0000000000000208
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 106e70067 P4D 106e70067 PUD 107143067 PMD 0
Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 1 UID: 0 PID: 689 Comm: repro Kdump: loaded Tainted: G O 6.13.0-rc4-custom+ #206
Tainted: [O]=OOT_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
RIP: 0010:find_first_extent_item+0x26/0x1f0 [btrfs]
Call Trace:
<TASK>
scrub_find_fill_first_stripe+0x13d/0x3b0 [btrfs]
scrub_simple_mirror+0x175/0x260 [btrfs]
scrub_stripe+0x5d4/0x6c0 [btrfs]
scrub_chunk+0xbb/0x170 [btrfs]
scrub_enumerate_chunks+0x2f4/0x5f0 [btrfs]
btrfs_scrub_dev+0x240/0x600 [btrfs]
btrfs_ioctl+0x1dc8/0x2fa0 [btrfs]
? do_sys_openat2+0xa5/0xf0
__x64_sys_ioctl+0x97/0xc0
do_syscall_64+0x4f/0x120
entry_SYSCALL_64_after_hwframe+0x76/0x7e
</TASK>
[CAUSE]
The reproducer is using a corrupted image where extent tree root is
corrupted, thus forcing to use "rescue=all,ro" mount option to mount the
image.
Then it triggered a scrub, but since scrub relies on extent tree to find
where the data/metadata extents are, scrub_find_fill_first_stripe()
relies on an non-empty extent root.
But unfortunately scrub_find_fill_first_stripe() doesn't really expect
an NULL pointer for extent root, it use extent_root to grab fs_info and
triggered a NULL pointer dereference.
[FIX]
Add an extra check for a valid extent root at the beginning of
scrub_find_fill_first_stripe().
The new error path is introduced by 42437a6386 ("btrfs: introduce
mount option rescue=ignorebadroots"), but that's pretty old, and later
commit b979547513 ("btrfs: scrub: introduce helper to find and fill
sector info for a scrub_stripe") changed how we do scrub.
So for kernels older than 6.6, the fix will need manual backport.
Reported-by: syzbot+339e9dbe3a2ca419b85d@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/67756935.050a0220.25abdd.0a12.GAE@google.com/
Fixes: 42437a6386 ("btrfs: introduce mount option rescue=ignorebadroots")
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Dmitry Safonov reported that a WARN_ON() assertion can be trigered by
userspace when calling inotify_show_fdinfo() for an overlayfs watched
inode, whose dentry aliases were discarded with drop_caches.
The WARN_ON() assertion in inotify_show_fdinfo() was removed, because
it is possible for encoding file handle to fail for other reason, but
the impact of failing to encode an overlayfs file handle goes beyond
this assertion.
As shown in the LTP test case mentioned in the link below, failure to
encode an overlayfs file handle from a non-aliased inode also leads to
failure to report an fid with FAN_DELETE_SELF fanotify events.
As Dmitry notes in his analyzis of the problem, ovl_encode_fh() fails
if it cannot find an alias for the inode, but this failure can be fixed.
ovl_encode_fh() seldom uses the alias and in the case of non-decodable
file handles, as is often the case with fanotify fid info,
ovl_encode_fh() never needs to use the alias to encode a file handle.
Defer finding an alias until it is actually needed so ovl_encode_fh()
will not fail in the common case of FAN_DELETE_SELF fanotify events.
Fixes: 16aac5ad1f ("ovl: support encoding non-decodable file handles")
Reported-by: Dmitry Safonov <dima@arista.com>
Closes: https://lore.kernel.org/linux-fsdevel/CAOQ4uxiie81voLZZi2zXS1BziXZCM24nXqPAxbu8kxXCUWdwOg@mail.gmail.com/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20250105162404.357058-3-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
We can access exp->ex_stats or exp->ex_uuid in rcu context(c_show and
e_show). All these resources should be released using kfree_rcu. Fix this
by using call_rcu, clean them all after a rcu grace period.
==================================================================
BUG: KASAN: slab-use-after-free in svc_export_show+0x362/0x430 [nfsd]
Read of size 1 at addr ff11000010fdc120 by task cat/870
CPU: 1 UID: 0 PID: 870 Comm: cat Not tainted 6.12.0-rc3+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.16.1-2.fc37 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x53/0x70
print_address_description.constprop.0+0x2c/0x3a0
print_report+0xb9/0x280
kasan_report+0xae/0xe0
svc_export_show+0x362/0x430 [nfsd]
c_show+0x161/0x390 [sunrpc]
seq_read_iter+0x589/0x770
seq_read+0x1e5/0x270
proc_reg_read+0xe1/0x140
vfs_read+0x125/0x530
ksys_read+0xc1/0x160
do_syscall_64+0x5f/0x170
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Allocated by task 830:
kasan_save_stack+0x20/0x40
kasan_save_track+0x14/0x30
__kasan_kmalloc+0x8f/0xa0
__kmalloc_node_track_caller_noprof+0x1bc/0x400
kmemdup_noprof+0x22/0x50
svc_export_parse+0x8a9/0xb80 [nfsd]
cache_do_downcall+0x71/0xa0 [sunrpc]
cache_write_procfs+0x8e/0xd0 [sunrpc]
proc_reg_write+0xe1/0x140
vfs_write+0x1a5/0x6d0
ksys_write+0xc1/0x160
do_syscall_64+0x5f/0x170
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Freed by task 868:
kasan_save_stack+0x20/0x40
kasan_save_track+0x14/0x30
kasan_save_free_info+0x3b/0x60
__kasan_slab_free+0x37/0x50
kfree+0xf3/0x3e0
svc_export_put+0x87/0xb0 [nfsd]
cache_purge+0x17f/0x1f0 [sunrpc]
nfsd_destroy_serv+0x226/0x2d0 [nfsd]
nfsd_svc+0x125/0x1e0 [nfsd]
write_threads+0x16a/0x2a0 [nfsd]
nfsctl_transaction_write+0x74/0xa0 [nfsd]
vfs_write+0x1a5/0x6d0
ksys_write+0xc1/0x160
do_syscall_64+0x5f/0x170
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fixes: ae74136b4b ("SUNRPC: Allow cache lookups to use RCU protection rather than the r/w spinlock")
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
rcu_read_lock/rcu_read_unlock has already provide protection for the
pointer we will reference when we call e_show. Therefore, there is no
need to obtain a cache reference to help protect cache_head.
Additionally, the .put such as expkey_put/svc_export_put will invoke
dput, which can sleep and break rcu. Stop get cache reference to fix
them all.
Fixes: ae74136b4b ("SUNRPC: Allow cache lookups to use RCU protection rather than the r/w spinlock")
Suggested-by: NeilBrown <neilb@suse.de>
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
It helps to know what kind of callback happened that triggered the
WARN_ONCE in nfsd4_cb_done() function in diagnosing what can set
an uncommon state where both cb_status and tk_status are set at
the same time.
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
If a client were to send an error to a CB_GETATTR call, the code
erronously continues to try decode past the error code. It ends
up returning BAD_XDR error to the rpc layer and then in turn
trigger a WARN_ONCE in nfsd4_cb_done() function.
Fixes: 6487a13b5c ("NFSD: add support for CB_GETATTR callback")
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Add a shrinker which frees unused slots and may ask the clients to use
fewer slots on each session.
We keep a global count of the number of freeable slots, which is the sum
of one less than the current "target" slots in all sessions in all
clients in all net-namespaces. This number is reported by the shrinker.
When the shrinker is asked to free some, we call xxx on each session in
a round-robin asking each to reduce the slot count by 1. This will
reduce the "target" so the number reported by the shrinker will reduce
immediately. The memory will only be freed later when the client
confirmed that it is no longer needed.
We use a global list of sessions and move the "head" to after the last
session that we asked to reduce, so the next callback from the shrinker
will move on to the next session. This pressure should be applied
"evenly" across all sessions over time.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reducing the number of slots in the session slot table requires
confirmation from the client. This patch adds reduce_session_slots()
which starts the process of getting confirmation, but never calls it.
That will come in a later patch.
Before we can free a slot we need to confirm that the client won't try
to use it again. This involves returning a lower cr_maxrequests in a
SEQUENCE reply and then seeing a ca_maxrequests on the same slot which
is not larger than we limit we are trying to impose. So for each slot
we need to remember that we have sent a reduced cr_maxrequests.
To achieve this we introduce a concept of request "generations". Each
time we decide to reduce cr_maxrequests we increment the generation
number, and record this when we return the lower cr_maxrequests to the
client. When a slot with the current generation reports a low
ca_maxrequests, we commit to that level and free extra slots.
We use an 16 bit generation number (64 seems wasteful) and if it cycles
we iterate all slots and reset the generation number to avoid false matches.
When we free a slot we store the seqid in the slot pointer so that it can
be restored when we reactivate the slot. The RFC can be read as
suggesting that the slot number could restart from one after a slot is
retired and reactivated, but also suggests that retiring slots is not
required. So when we reactive a slot we accept with the next seqid in
sequence, or 1.
When decoding sa_highest_slotid into maxslots we need to add 1 - this
matches how it is encoded for the reply.
se_dead is moved in struct nfsd4_session to remove a hole.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
If a client ever uses the highest available slot for a given session,
attempt to allocate more slots so there is room for the client to use
them if wanted. GFP_NOWAIT is used so if there is not plenty of
free memory, failure is expected - which is what we want. It also
allows the allocation while holding a spinlock.
Each time we increase the number of slots by 20% (rounded up). This
allows fairly quick growth while avoiding excessive over-shoot.
We would expect to stablise with around 10% more slots available than
the client actually uses.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Rather than guessing how much space it might be safe to use for the DRC,
simply try allocating slots and be prepared to accept failure.
The first slot for each session is allocated with GFP_KERNEL which is
unlikely to fail. Subsequent slots are allocated with the addition of
__GFP_NORETRY which is expected to fail if there isn't much free memory.
This is probably too aggressive but clears the way for adding a
shrinker interface to free extra slots when memory is tight.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Using an xarray to store session slots will make it easier to change the
number of active slots based on demand, and removes an unnecessary
limit.
To achieve good throughput with a high-latency server it can be helpful
to have hundreds of concurrent writes, which means hundreds of slots.
So increase the limit to 2048 (twice what the Linux client will
currently use). This limit is only a sanity check, not a hard limit.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Now that the connection limit only apply to unconfirmed connections,
there is no need to configure it. So remove all the configuration and
fix the number of unconfirmed connections as always 64 - which is
now given a name: XPT_MAX_TMP_CONN
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The heuristic for limiting the number of incoming connections to nfsd
currently uses sv_nrthreads - allowing more connections if more threads
were configured.
A future patch will allow number of threads to grow dynamically so that
there will be no need to configure sv_nrthreads. So we need a different
solution for limiting connections.
It isn't clear what problem is solved by limiting connections (as
mentioned in a code comment) but the most likely problem is a connection
storm - many connections that are not doing productive work. These will
be closed after about 6 minutes already but it might help to slow down a
storm.
This patch adds a per-connection flag XPT_PEER_VALID which indicates
that the peer has presented a filehandle for which it has some sort of
access. i.e the peer is known to be trusted in some way. We now only
count connections which have NOT been determined to be valid. There
should be relative few of these at any given time.
If the number of non-validated peer exceed a limit - currently 64 - we
close the oldest non-validated peer to avoid having too many of these
useless connections.
Note that this patch significantly changes the meaning of the various
configuration parameters for "max connections". The next patch will
remove all of these.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Get rid of the nfsd4_legacy_tracking_ops->init() call in
check_for_legacy_methods(). That will be handled in the caller
(nfsd4_client_tracking_init()). Otherwise, we'll wind up calling
nfsd4_legacy_tracking_ops->init() twice, and the second time we'll
trigger the BUG_ON() in nfsd4_init_recdir().
Fixes: 74fd48739d ("nfsd: new Kconfig option for legacy client tracking")
Reported-by: Jur van der Burg <jur@avtware.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219580
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Salvatore Bonaccorso <carnil@debian.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The wake_up_var interface is fragile as barriers are sometimes needed.
There are now new interfaces so that most wake-ups can use an interface
that is guaranteed to have all barriers needed.
This patch changes the wake up on cl_cb_inflight to use
atomic_dec_and_wake_up().
It also changes the wake up on rp_locked to use store_release_wake_up().
This involves changing rp_locked from atomic_t to int.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Since commit e56dc9e294 ("nfsd: remove fault injection code") remove
all nfsd_recall_delegations codes,
we don't need trace_nfsd_deleg_recall any more.
Signed-off-by: Chen Hanxiao <chenhx.fnst@fujitsu.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Pull exfat fixes from Namjae Jeon:
"All fixes are for issues reported by syzbot:
- Fix wrong error return in exfat_find_empty_entry()
- Fix a endless loop by self-linked chain
- fix a KMSAN uninit-value issue in exfat_extend_valid_size()"
* tag 'exfat-for-6.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
exfat: fix the infinite loop in __exfat_free_cluster()
exfat: fix the new buffer was not zeroed before writing
exfat: fix the infinite loop in exfat_readdir()
exfat: fix exfat_find_empty_entry() not returning error on failure
If we return -EAGAIN the first time because we need to block,
btrfs_uring_encoded_read() will get called twice. Take a copy of args,
the iovs, and the iter the first time, as by the time we are called the
second time these may have gone out of scope.
Reported-by: Jens Axboe <axboe@kernel.dk>
Fixes: 34310c442e ("btrfs: add io_uring command for encoded reads (ENCODED_READ ioctl)")
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The rrror handling in fanotify_init(2) is buggy and overwrites 'fd'
before calling put_unused_fd() leading to possible access beyond the end
of fd bitmap. Fix it.
Reported-by: syzbot+6a3aa63412255587b21b@syzkaller.appspotmail.com
Fixes: ebe559609d ("fs: get rid of __FMODE_NONOTIFY kludge")
Signed-off-by: Jan Kara <jack@suse.cz>
In the smb2_send_interim_resp(), if ksmbd_alloc_work_struct()
fails to allocate a node, it returns a NULL pointer to the
in_work pointer. This can lead to an illegal memory write of
in_work->response_buf when allocate_interim_rsp_buf() attempts
to perform a kzalloc() on it.
To address this issue, incorporating a check for the return
value of ksmbd_alloc_work_struct() ensures that the function
returns immediately upon allocation failure, thereby preventing
the aforementioned illegal memory access.
Fixes: 041bba4414 ("ksmbd: fix wrong interim response on compound")
Signed-off-by: Wentao Liang <liangwentao@iscas.ac.cn>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Pull hotfixes from Andrew Morton:
"25 hotfixes. 16 are cc:stable. 18 are MM and 7 are non-MM.
The usual bunch of singletons and two doubletons - please see the
relevant changelogs for details"
* tag 'mm-hotfixes-stable-2025-01-04-18-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (25 commits)
MAINTAINERS: change Arınç _NAL's name and email address
scripts/sorttable: fix orc_sort_cmp() to maintain symmetry and transitivity
mm/util: make memdup_user_nul() similar to memdup_user()
mm, madvise: fix potential workingset node list_lru leaks
mm/damon/core: fix ignored quota goals and filters of newly committed schemes
mm/damon/core: fix new damon_target objects leaks on damon_commit_targets()
mm/list_lru: fix false warning of negative counter
vmstat: disable vmstat_work on vmstat_cpu_down_prep()
mm: shmem: fix the update of 'shmem_falloc->nr_unswapped'
mm: shmem: fix incorrect index alignment for within_size policy
percpu: remove intermediate variable in PERCPU_PTR()
mm: zswap: fix race between [de]compression and CPU hotunplug
ocfs2: fix slab-use-after-free due to dangling pointer dqi_priv
fs/proc/task_mmu: fix pagemap flags with PMD THP entries on 32bit
kcov: mark in_softirq_really() as __always_inline
docs: mm: fix the incorrect 'FileHugeMapped' field
mailmap: modify the entry for Mathieu Othacehe
mm/kmemleak: fix sleeping function called from invalid context at print message
mm: hugetlb: independent PMD page table shared count
maple_tree: reload mas before the second call for mas_empty_area
...
The mtree mechanism has been effective at creating directory offsets
that are stable over multiple opendir instances. However, it has not
been able to handle the subtleties of renames that are concurrent
with readdir.
Instead of using the mtree to emit entries in the order of their
offset values, use it only to map incoming ctx->pos to a starting
entry. Then use the directory's d_children list, which is already
maintained properly by the dcache, to find the next child to emit.
One of the sneaky things about this is that when the mtree-allocated
offset value wraps (which is very rare), looking up ctx->pos++ is
not going to find the next entry; it will return NULL. Instead, by
following the d_children list, the offset values can appear in any
order but all of the entries in the directory will be visited
eventually.
Note also that the readdir() is guaranteed to reach the tail of this
list. Entries are added only at the head of d_children, and readdir
walks from its current position in that list towards its tail.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/20241228175522.1854234-6-cel@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
According to getdents(3), the d_off field in each returned directory
entry points to the next entry in the directory. The d_off field in
the last returned entry in the readdir buffer must contain a valid
offset value, but if it points to an actual directory entry, then
readdir/getdents can loop.
This patch introduces a specific fixed offset value that is placed
in the d_off field of the last entry in a directory. Some user space
applications assume that the EOD offset value is larger than the
offsets of real directory entries, so the largest valid offset value
is reserved for this purpose. This new value is never allocated by
simple_offset_add().
When ->iterate_dir() returns, getdents{64} inserts the ctx->pos
value into the d_off field of the last valid entry in the readdir
buffer. When it hits EOD, offset_readdir() sets ctx->pos to the EOD
offset value so the last entry is updated to point to the EOD marker.
When trying to read the entry at the EOD offset, offset_readdir()
terminates immediately.
It is worth noting that using a Maple tree for directory offset
value allocation does not guarantee a 63-bit range of values --
on platforms where "long" is a 32-bit type, the directory offset
value range is still 0..(2^31 - 1). For broad compatibility with
32-bit user space, the largest tmpfs directory cookie value is now
S32_MAX.
Fixes: 796432efab ("libfs: getdents() should return 0 after reaching EOD")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/20241228175522.1854234-5-cel@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
The current directory offset allocator (based on mtree_alloc_cyclic)
stores the next offset value to return in octx->next_offset. This
mechanism typically returns values that increase monotonically over
time. Eventually, though, the newly allocated offset value wraps
back to a low number (say, 2) which is smaller than other already-
allocated offset values.
Yu Kuai <yukuai3@huawei.com> reports that, after commit 64a7ce76fb
("libfs: fix infinite directory reads for offset dir"), if a
directory's offset allocator wraps, existing entries are no longer
visible via readdir/getdents because offset_readdir() stops listing
entries once an entry's offset is larger than octx->next_offset.
These entries vanish persistently -- they can be looked up, but will
never again appear in readdir(3) output.
The reason for this is that the commit treats directory offsets as
monotonically increasing integer values rather than opaque cookies,
and introduces this comparison:
if (dentry2offset(dentry) >= last_index) {
On 64-bit platforms, the directory offset value upper bound is
2^63 - 1. Directory offsets will monotonically increase for millions
of years without wrapping.
On 32-bit platforms, however, LONG_MAX is 2^31 - 1. The allocator
can wrap after only a few weeks (at worst).
Revert commit 64a7ce76fb ("libfs: fix infinite directory reads for
offset dir") to prepare for a fix that can work properly on 32-bit
systems and might apply to recent LTS kernels where shmem employs
the simple_offset mechanism.
Reported-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/20241228175522.1854234-4-cel@kernel.org
Reviewed-by: Yang Erkun <yangerkun@huawei.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>