linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-21 20:45:27 -04:00

Author	SHA1	Message	Date
Bo Liu (OpenAnolis)	1cf12c7177	erofs: Add support for FS_IOC_GETFSLABEL Add support for reading to the erofs volume label from the FS_IOC_GETFSLABEL ioctls. Signed-off-by: Bo Liu (OpenAnolis) <liubo03@inspur.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2025-09-25 11:26:20 +08:00
Viacheslav Dubeyko	f32a26fab3	hfs/hfsplus: rework debug output subsystem Currently, HFS/HFS+ has very obsolete and inconvenient debug output subsystem. Also, the code is duplicated in HFS and HFS+ driver. This patch introduces linux/hfs_common.h for gathering common declarations, inline functions, and common short methods. Currently, this file contains only hfs_dbg() function that employs pr_debug() with the goal to print a debug-level messages conditionally. So, now, it is possible to enable the debug output by means of: echo 'file extent.c +p' > /proc/dynamic_debug/control echo 'func hfsplus_evict_inode +p' > /proc/dynamic_debug/control And debug output looks like this: hfs: pid 5831:fs/hfs/catalog.c:228 hfs_cat_delete(): delete_cat: 00,48 hfs: pid 5831:fs/hfs/extent.c:484 hfs_file_truncate(): truncate: 48, 409600 -> 0 hfs: pid 5831:fs/hfs/extent.c:212 hfs_dump_extent(): hfs: pid 5831:fs/hfs/extent.c:214 hfs_dump_extent(): 78:4 hfs: pid 5831:fs/hfs/extent.c:214 hfs_dump_extent(): 0:0 hfs: pid 5831:fs/hfs/extent.c:214 hfs_dump_extent(): 0:0 v4 Debug messages have been reworked and information about new HFS/HFS+ shared declarations file has been added to MAINTAINERS file. v5 Yangtao Li suggested to clean up debug output and fix several typos. Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com> cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> cc: Yangtao Li <frank.li@vivo.com> cc: linux-fsdevel@vger.kernel.org cc: Johannes Thumshirn <Johannes.Thumshirn@wdc.com> Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>	2025-09-24 16:30:34 -07:00
Linus Torvalds	74c7cc79aa	Merge tag 'for-6.17-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "One more regression fix for a problem in zoned mode: mounting would fail if the number of open and active zones reached a common limit that didn't use to be checked" * tag 'for-6.17-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: don't fail mount needlessly due to too many active zones	2025-09-24 11:09:09 -07:00
Al Viro	a890a2e339	nfs4_setup_readdir(): insufficient locking for ->d_parent->d_inode dereferencing Theoretically it's an oopsable race, but I don't believe one can manage to hit it on real hardware; might become doable on a KVM, but it still won't be easy to attack. Anyway, it's easy to deal with - since xdr_encode_hyper() is just a call of put_unaligned_be64(), we can put that under ->d_lock and be done with that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2025-09-23 13:29:51 -04:00
Trond Myklebust	902893e390	NFS: Enable use of the RWF_DONTCACHE flag on the NFS client The NFS client needs to defer dropbehind until after any writes to the folio have been persisted on the server. Since this may be a 2 step process, use folio_end_writeback_no_dropbehind() to allow release of the writeback flag, and then call folio_end_dropbehind() once the COMMIT is done. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2025-09-23 13:29:50 -04:00
Anna Schumaker	4b7c3b4c67	NFS: Update the flexfilelayout driver to use xdr_set_scratch_folio() Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2025-09-23 13:29:50 -04:00
Anna Schumaker	1a33b629af	NFS: Update the filelayout to use xdr_set_scratch_folio() Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2025-09-23 13:29:50 -04:00
Anna Schumaker	cf289099ab	NFS: Update the blocklayout to use xdr_set_scratch_folio() Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2025-09-23 13:29:50 -04:00
Anna Schumaker	c9cefd7ae8	NFS: Update listxattr to use xdr_set_scratch_folio() Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2025-09-23 13:29:50 -04:00
Anna Schumaker	2f8416f23e	NFS: Update getacl to use xdr_set_scratch_folio() Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2025-09-23 13:29:50 -04:00
Anna Schumaker	670335c0f9	NFS: Update readdir to use a scratch folio Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2025-09-23 13:29:50 -04:00
Chuck Lever	62c0c0e749	SUNRPC: Move the svc_rpcb_cleanup() call sites Clean up: because svc_rpcb_cleanup() and svc_xprt_destroy_all() are always invoked in pairs, we can deduplicate code by moving the svc_rpcb_cleanup() call sites into svc_xprt_destroy_all(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Olga Kornievskaia <okorniev@redhat.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2025-09-23 13:28:19 -04:00
Chuck Lever	c231cea10d	NFS: Remove rpcbind cleanup for NFSv4.0 callback The NFS client's NFSv4.0 callback listeners are created with SVC_SOCK_ANONYMOUS, therefore svc_setup_socket() does not register them with the client's rpcbind service. And, note that nfs_callback_down_net() does not call svc_rpcb_cleanup() at all when shutting down the callback server. Even if svc_setup_socket() were to attempt to register or unregister these sockets, the callback service has vs_hidden set, which shunts the rpcbind upcalls. The svc_rpcb_cleanup() error flow was introduced by commit `c946556b87` ("NFS: move per-net callback thread initialization to nfs_callback_up_net()"). It doesn't appear in the code that was relocated by that commit. Therefore, there is no need to call svc_rpcb_cleanup() when listener creation fails during callback server start-up. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2025-09-23 13:28:19 -04:00
Anthony Iliopoulos	bf75ad0968	NFSv4.1: fix mount hang after CREATE_SESSION failure When client initialization goes through server trunking discovery, it schedules the state manager and then sleeps waiting for nfs_client initialization completion. The state manager can fail during state recovery, and specifically in lease establishment as nfs41_init_clientid() will bail out in case of errors returned from nfs4_proc_create_session(), without ever marking the client ready. The session creation can fail for a variety of reasons e.g. during backchannel parameter negotiation, with status -EINVAL. The error status will propagate all the way to the nfs4_state_manager but the client status will not be marked, and thus the mount process will remain blocked waiting. Fix it by adding -EINVAL error handling to nfs4_state_manager(). Signed-off-by: Anthony Iliopoulos <ailiop@suse.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2025-09-23 13:28:19 -04:00
Anthony Iliopoulos	191512355e	NFSv4.1: fix backchannel max_resp_sz verification check When the client max_resp_sz is larger than what the server encodes in its reply, the nfs4_verify_back_channel_attrs() check fails and this causes nfs4_proc_create_session() to fail, in cases where the client page size is larger than that of the server and the server does not want to negotiate upwards. While this is not a problem with the linux nfs server that will reflect the proposed value in its reply irrespective of the local page size, other nfs server implementations may insist on their own max_resp_sz value, which could be smaller. Fix this by accepting smaller max_resp_sz values from the server, as this does not violate the protocol. The server is allowed to decrease but not increase proposed the size, and as such values smaller than the client-proposed ones are valid. Fixes: `43c2e885be` ("nfs4: fix channel attribute sanity-checks") Signed-off-by: Anthony Iliopoulos <ailiop@suse.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2025-09-23 13:28:19 -04:00
Xichao Zhao	64afd87839	NFSv4: fix "prefered"->"preferred" Trivial fix to spelling mistake in comment text. Signed-off-by: Xichao Zhao <zhao.xichao@vivo.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2025-09-23 13:28:19 -04:00
Olga Kornievskaia	be390f9524	NFSv4: handle ERR_GRACE on delegation recalls RFC7530 states that clients should be prepared for the return of NFS4ERR_GRACE errors for non-reclaim lock and I/O requests. Signed-off-by: Olga Kornievskaia <okorniev@redhat.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2025-09-23 13:28:19 -04:00
Jeff Layton	9082aae154	sunrpc: remove dfprintk_cont() and dfprintk_rcu_cont() KERN_CONT hails from a simpler time, when SMP wasn't the norm. These days, it doesn't quite work right since another printk() can always race in between the first one and the one being "continued". Nothing calls dprintk_rcu_cont(), so just remove it. The only caller of dprintk_cont() is in nfs_commit_release_pages(). Just use a normal dprintk() there instead, since this is not SMP-safe anyway. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2025-09-23 13:28:19 -04:00
Leo Martins	64dd802224	nfs: cleanup tracepoint declarations Cleanup tracepoint declarations by replacing commas with semicolons to better match other tracepoint declarations. No functional changes introduced. Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2025-09-23 13:28:19 -04:00
Jeff Layton	83c47ef8ac	nfs: add tracepoints to nfs_writepages() Show the inode info and requested range. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2025-09-23 13:28:19 -04:00
Jeff Layton	b6ef079fd9	nfs: more in-depth tracing of writepage events Add tracepoints to nfs_writepage_setup() and nfs_do_writepage(). Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2025-09-23 13:28:19 -04:00
Jeff Layton	4a2d81714d	nfs: new tracepoints around write handling New start and done tracepoints for: nfs_update_folio() nfs_write_begin() nfs_write_end() nfs_try_to_update_request() Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2025-09-23 13:28:19 -04:00
Jeff Layton	4b62f0e448	nfs: add tracepoints to nfs_file_read() and nfs_file_write() Add some tracepoints to the I/O submission codepaths. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>	2025-09-23 13:28:19 -04:00
Dave Chinner	c91d38b57f	xfs: rework datasync tracking and execution Jan Kara reported that the shared ILOCK held across the journal flush during fdatasync operations slows down O_DSYNC DIO on unwritten extents significantly. The underlying issue is that unwritten extent conversion needs the ILOCK exclusive, whilst the datasync operation after the extent conversion holds it shared. Hence we cannot be flushing the journal for one IO completion whilst at the same time doing unwritten extent conversion on another IO completion on the same inode. This means that IO completions lock-step, and IO performance is dependent on the journal flush latency. Jan demonstrated that reducing the ifdatasync lock hold time can improve O_DSYNC DIO to unwritten extents performance by 2.5x. Discussion on that patch found issues with the method, and we came to the conclusion that separately tracking datasync flush sequences was the best approach to solving the problem. The fsync code uses the ILOCK to serialise against concurrent modifications in the transaction commit phase. In a transaction commit, there are several disjoint updates to inode log item state that need to be considered atomically by the fsync code. These operations are all done under ILOCK_EXCL context: 1. ili_fsync_flags is updated in ->iop_precommit 2. i_pincount is updated in ->iop_pin before it is added to the CIL 3. ili_commit_seq is updated in ->iop_committing, after it has been added to the CIL In fsync, we need to: 1. check that the inode is dirty in the journal (ipincount) 2. check that ili_fsync_flags is set 3. grab the ili_commit_seq if a journal flush is needed 4. clear the ili_fsync_flags to ensure that new modifications that require fsync are tracked in ->iop_precommit correctly The serialisation of ipincount/ili_commit_seq is needed to ensure that we don't try to unnecessarily flush the journal. The serialisation of ili_fsync_flags being set in ->iop_precommit and cleared in fsync post journal flush is required for correctness. Hence holding the ILOCK_SHARED in xfs_file_fsync() performs all this serialisation for us. Ideally, we want to remove the need to hold the ILOCK_SHARED in xfs_file_fsync() for best performance. We start with the observation that fsync/fdatasync() only need to wait for operations that have been completed. Hence operations that are still being committed have not completed and datasync operations do not need to wait for them. This means we can use a single point in time in the commit process to signal "this modification is complete". This is what ->iop_committing is supposed to provide - it is the point at which the object is unlocked after the modification has been recorded in the CIL. Hence we could use ili_commit_seq to determine if we should flush the journal. In theory, we can already do this. However, in practice this will expose an internal global CIL lock to the IO path. The ipincount() checks optimise away the need to take this lock - if the inode is not pinned, then it is not in the CIL and we don't need to check if a journal flush at ili_commit_seq needs to be performed. The reason this is needed is that the ili_commit_seq is never cleared. Once it is set, it remains set even once the journal has been committed and the object has been unpinned. Hence we have to look that journal internal commit sequence state to determine if ili_commit_seq needs to be acted on or not. We can solve this by clearing ili_commit_seq when the inode is unpinned. If we clear it atomically with the last unpin going away, then we are guaranteed that new modifications will order correctly as they add a new pin counts and we won't clear a sequence number for an active modification in the CIL. Further, we can then allow the per-transaction flag state to propagate into ->iop_committing (instead of clearing it in ->iop_precommit) and that will allow us to determine if the modification needs a full fsync or just a datasync, and so we can record a separate datasync sequence number (Jan's idea!) and then use that in the fdatasync path instead of the full fsync sequence number. With this infrastructure in place, we no longer need the ILOCK_SHARED in the fsync path. All serialisation is done against the commit sequence numbers - if the sequence number is set, then we have to flush the journal. If it is not set, then we have nothing to do. This greatly simplifies the fsync implementation.... Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-09-23 15:12:43 +02:00
Dave Chinner	bc7d684fea	xfs: rearrange code in xfs_inode_item_precommit There are similar extsize checks and updates done inside and outside the inode item lock, which could all be done under a single top level logic branch outside the ili_lock. The COW extsize fixup can potentially miss updating the XFS_ILOG_CORE in ili_fsync_fields, so moving this code up above the ili_fsync_fields update could also be considered a fix. Further, to make the next change a bit cleaner, move where we calculate the on-disk flag mask to after we attach the cluster buffer to the the inode log item. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-09-23 15:12:37 +02:00
Darrick J. Wong	d3906d8f3c	fuse: enable FUSE_SYNCFS for all fuseblk servers Turn on syncfs for all fuseblk servers so that the ones in the know can flush cached intermediate data and logs to disk. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2025-09-23 12:45:25 +02:00
NeilBrown	0a2c705947	debugfs: rename start_creating() to debugfs_start_creating() start_creating() is a generic name which I would like to use for a function similar to simple_start_creating(), only not quite so simple. debugfs is using this name which, though static, will cause complaints if then name is given a different signature in a header file. So rename it to debugfs_start_creating(). Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-23 12:37:36 +02:00
NeilBrown	3d18f80ce1	VFS: rename kern_path_locked() and related functions. kern_path_locked() is now only used to prepare for removing an object from the filesystem (and that is the only credible reason for wanting a positive locked dentry). Thus it corresponds to kern_path_create() and so should have a corresponding name. Unfortunately the name "kern_path_create" is somewhat misleading as it doesn't actually create anything. The recently added simple_start_creating() provides a better pattern I believe. The "start" can be matched with "end" to bracket the creating or removing. So this patch changes names: kern_path_locked -> start_removing_path kern_path_create -> start_creating_path user_path_create -> start_creating_user_path user_path_locked_at -> start_removing_user_path_at done_path_create -> end_creating_path and also introduces end_removing_path() which is identical to end_creating_path(). __start_removing_path (which was __kern_path_locked) is enhanced to call mnt_want_write() for consistency with the start_creating_path(). Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-23 12:37:36 +02:00
NeilBrown	76a53de6f7	VFS/audit: introduce kern_path_parent() for audit audit_alloc_mark() and audit_get_nd() both need to perform a path lookup getting the parent dentry (which must exist) and the final target (following a LAST_NORM name) which sometimes doesn't need to exist. They don't need the parent to be locked, but use kern_path_locked() or kern_path_locked_negative() anyway. This is somewhat misleading to the casual reader. This patch introduces a more targeted function, kern_path_parent(), which returns not holding locks. On success the "path" will be set to the parent, which must be found, and the return value is the dentry of the target, which might be negative. This will clear the way to rename kern_path_locked() which is otherwise only used to prepare for removing something. It also allows us to remove kern_path_locked_negative(), which is transformed into the new kern_path_parent(). Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-23 12:37:35 +02:00
NeilBrown	d7fb2c4102	VFS: unify old_mnt_idmap and new_mnt_idmap in renamedata A rename operation can only rename within a single mount. Callers of vfs_rename() must and do ensure this is the case. So there is no point in having two mnt_idmaps in renamedata as they are always the same. Only one of them is passed to ->rename in any case. This patch replaces both with a single "mnt_idmap" and changes all callers. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-23 12:37:35 +02:00
NeilBrown	e66ccd30dc	VFS: discard err2 in filename_create() Since `204a575e91` "VFS: add common error checks to lookup_one_qstr_excl()" filename_create() does not need to stash the error value from mnt_want_write() into a separate variable - the logic that used to clobber 'error' after the call of mnt_want_write() has migrated into lookup_one_qstr_excl(). So there is no need for two different err variables. This patch discards "err2" and uses "error' throughout. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-23 12:37:35 +02:00
NeilBrown	17eb98d6b5	VFS/ovl: add lookup_one_positive_killable() ovl wants a lookup which won't block on a fatal signal. It currently uses down_write_killable() and then repeatedly calls to lookup_one() The lock may not be needed if the name is already in the dcache and it aids proposed future changes if the locking is kept internal to namei.c So this patch adds lookup_one_positive_killable() which is like lookup_one_positive() but will abort in the face of a fatal signal. overlayfs is changed to use this. Note that instead of always getting an exclusive lock, ovl now only gets a shared lock, and only sometimes. The exclusive lock was never needed. However down_read_killable() was only added in v4.15 but overlayfs started using down_write_killable() here in v4.7. Note that the linked list ->first_maybe_whiteout ->next_maybe_white is local to the thread so there is no concurrency in that list which could be threatened by removing the locking. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-23 12:37:35 +02:00
Amir Goldstein	ad14239227	ovl: make sure that ovl_create_real() returns a hashed dentry `e8bd877fb7` ("ovl: fix possible double unlink") added a sanity check of !d_unhashed(child) to try to verify that child dentry was not unlinked while parent dir was unlocked. This "was not unlink" check has a false positive result in the case of casefolded parent dir, because in that case, ovl_create_temp() returns an unhashed dentry after ovl_create_real() gets an unhashed dentry from ovl_lookup_upper() and makes it positive. To avoid returning unhashed dentry from ovl_create_temp(), let ovl_create_real() lookup again after making the newdentry positive, so it always returns a hashed positive dentry (or an error). This fixes the error in ovl_parent_lock() in ovl_check_rename_whiteout() after ovl_create_temp() and allows mount of overlayfs with casefolding enabled layers. Reported-by: André Almeida <andrealmeid@igalia.com> Closes: https://lore.kernel.org/r/18704e8c-c734-43f3-bc7c-b8be345e1bf5@igalia.com/ Suggested-by: Neil Brown <neil@brown.name> Reviewed-by: Neil Brown <neil@brown.name> Signed-off-by: Amir Goldstein <amir73il@gmail.com>	2025-09-23 12:29:36 +02:00
André Almeida	16754d61dc	ovl: Support mounting case-insensitive enabled layers Drop the restriction for casefold dentries lookup to enable support for case-insensitive layers in overlayfs. Support case-insensitive layers with the condition that they should be uniformly enabled across the stack and (i.e. if the root mount dir has casefold enabled, so should all the dirs bellow for every layer). Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: André Almeida <andrealmeid@igalia.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com>	2025-09-23 12:29:36 +02:00
André Almeida	dfc7da402c	ovl: Check for casefold consistency when creating new dentries In a overlayfs with casefold enabled, all new dentries should have casefold enabled as well. Check this at ovl_create_real(). Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: André Almeida <andrealmeid@igalia.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com>	2025-09-23 12:29:36 +02:00
André Almeida	f9377faaea	ovl: Add S_CASEFOLD as part of the inode flag to be copied To keep ovl's inodes consistent with their real inodes, create a new mask for inode file attributes that needs to be copied. Add the S_CASEFOLD flag as part of the flags that need to be copied along with the other file attributes. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: André Almeida <andrealmeid@igalia.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com>	2025-09-23 12:29:36 +02:00
André Almeida	8a78f18975	ovl: Set case-insensitive dentry operations for ovl sb For filesystems with encoding (i.e. with case-insensitive support), set the dentry operations for the super block as ovl_dentry_ci_operations. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: André Almeida <andrealmeid@igalia.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com>	2025-09-23 12:29:35 +02:00
André Almeida	1f7168b28f	ovl: Ensure that all layers have the same encoding When merging layers from different filesystems with casefold enabled, all layers should use the same encoding version and have the same flags to avoid any kind of incompatibility issues. Also, set the encoding and the encoding flags for the ovl super block as the same as used by the first valid layer. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: André Almeida <andrealmeid@igalia.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com>	2025-09-23 12:29:35 +02:00
André Almeida	ee95c5fc86	ovl: Create ovl_casefold() to support casefolded strncmp() To add overlayfs support casefold layers, create a new function ovl_casefold(), to be able to do case-insensitive strncmp(). ovl_casefold() allocates a new buffer and stores the casefolded version of the string on it. If the allocation or the casefold operation fails, fallback to use the original string. The case-insentive name is then used in the rb-tree search/insertion operation. If the name is found in the rb-tree, the name can be discarded and the buffer is freed. If the name isn't found, it's then stored at struct ovl_cache_entry to be used later. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: André Almeida <andrealmeid@igalia.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com>	2025-09-23 12:29:35 +02:00
André Almeida	5fbf73c7f1	ovl: Prepare for mounting case-insensitive enabled layers Prepare for mounting layers with case-insensitive dentries in order to supporting such layers in overlayfs, while enforcing uniform casefold layers. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: André Almeida <andrealmeid@igalia.com> Reviewed-by: Gabriel Krisman Bertazi <gabriel@krisman.be> Signed-off-by: Amir Goldstein <amir73il@gmail.com>	2025-09-23 12:29:35 +02:00
Darrick J. Wong	0d375a1385	fuse: capture the unique id of fuse commands being sent The fuse_request_{send,end} tracepoints capture the value of req->in.h.unique in the trace output. It would be really nice if we could use this to match a request to its response for debugging and latency analysis, but the call to trace_fuse_request_send occurs before the unique id has been set: fuse_request_send: connection 8388608 req 0 opcode 1 (FUSE_LOOKUP) len 107 fuse_request_end: connection 8388608 req 6 len 16 error -2 (Notice that req moves from 0 to 6) Move the callsites to trace_fuse_request_send to after the unique id has been set by introducing a helper to do that for standard fuse_req requests. FUSE_FORGET requests are not covered by this because they appear to be synthesized into the event stream without a fuse_req object and are never replied to. Requests that are aborted without ever having been submitted to the fuse server retain the behavior that only the fuse_request_end tracepoint shows up in the trace record, and with req==0. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2025-09-23 11:32:17 +02:00
Darrick J. Wong	26e5c67deb	fuse: fix livelock in synchronous file put from fuseblk workers I observed a hang when running generic/323 against a fuseblk server. This test opens a file, initiates a lot of AIO writes to that file descriptor, and closes the file descriptor before the writes complete. Unsurprisingly, the AIO exerciser threads are mostly stuck waiting for responses from the fuseblk server: # cat /proc/372265/task/372313/stack [<0>] request_wait_answer+0x1fe/0x2a0 [fuse] [<0>] __fuse_simple_request+0xd3/0x2b0 [fuse] [<0>] fuse_do_getattr+0xfc/0x1f0 [fuse] [<0>] fuse_file_read_iter+0xbe/0x1c0 [fuse] [<0>] aio_read+0x130/0x1e0 [<0>] io_submit_one+0x542/0x860 [<0>] __x64_sys_io_submit+0x98/0x1a0 [<0>] do_syscall_64+0x37/0xf0 [<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53 But the /weird/ part is that the fuseblk server threads are waiting for responses from itself: # cat /proc/372210/task/372232/stack [<0>] request_wait_answer+0x1fe/0x2a0 [fuse] [<0>] __fuse_simple_request+0xd3/0x2b0 [fuse] [<0>] fuse_file_put+0x9a/0xd0 [fuse] [<0>] fuse_release+0x36/0x50 [fuse] [<0>] __fput+0xec/0x2b0 [<0>] task_work_run+0x55/0x90 [<0>] syscall_exit_to_user_mode+0xe9/0x100 [<0>] do_syscall_64+0x43/0xf0 [<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53 The fuseblk server is fuse2fs so there's nothing all that exciting in the server itself. So why is the fuse server calling fuse_file_put? The commit message for the fstest sheds some light on that: "By closing the file descriptor before calling io_destroy, you pretty much guarantee that the last put on the ioctx will be done in interrupt context (during I/O completion). Aha. AIO fgets a new struct file from the fd when it queues the ioctx. The completion of the FUSE_WRITE command from userspace causes the fuse server to call the AIO completion function. The completion puts the struct file, queuing a delayed fput to the fuse server task. When the fuse server task returns to userspace, it has to run the delayed fput, which in the case of a fuseblk server, it does synchronously. Sending the FUSE_RELEASE command sychronously from fuse server threads is a bad idea because a client program can initiate enough simultaneous AIOs such that all the fuse server threads end up in delayed_fput, and now there aren't any threads left to handle the queued fuse commands. Fix this by only using asynchronous fputs when closing files, and leave a comment explaining why. Cc: stable@vger.kernel.org # v2.6.38 Fixes: `5a18ec176c` ("fuse: fix hang of single threaded fuseblk filesystem") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2025-09-23 11:32:17 +02:00
Johannes Thumshirn	53de7ee4e2	btrfs: zoned: don't fail mount needlessly due to too many active zones Previously BTRFS did not look at a device's reported max_open_zones limit, but starting with commit `04147d8394` ("btrfs: zoned: limit active zones to max_open_zones"), zoned BTRFS limited the number of concurrently used block-groups to the number of max_open_zones a device reported, if it hadn't already reported a number of max_active_zones. Starting with commit `04147d8394` the number of open zones is treated the same way as active zones. But this leads to mount failures on filesystems which have been used before `04147d8394` because too many zones are in an open state. Ignore the new limitations on these filesystems, so zones can be finished or evacuated. Reported-by: Yuwei Han <hrx@bupt.moe> Link: https://lore.kernel.org/all/2F48A90AF7DDF380+1790bcfd-cb6f-456b-870d-7982f21b5eae@bupt.moe/ Fixes: `04147d8394` ("btrfs: zoned: limit active zones to max_open_zones") Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 11:22:21 +02:00
Gao Xiang	334c0e493c	erofs: avoid reading more for fragment maps Since all real encoded extents (directly handled by the decompression subsystem) have a sane, limited maximum decoded length (Z_EROFS_PCLUSTER_MAX_DSIZE), and the read-more policy is only applied if needed. However, it makes no sense to read more for non-encoded maps, such as fragment extents, since such extents can be huge (up to i_size) and there is no benefit to reading more at this layer. For normal images, it does not really matter, but for crafted images generated by syzbot, excessively large fragment extents can cause read-more to run for an overly long time. Reported-and-tested-by: syzbot+1a9af3ef3c84c5e14dcc@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/68c8583d.050a0220.2ff435.03a3.GAE@google.com Fixes: `b44686c839` ("erofs: fix large fragment handling") Fixes: `b15b2e307c` ("erofs: support on-disk compressed fragments data") Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2025-09-23 15:56:31 +08:00
Filipe Manana	45c222468d	btrfs: use smp_mb__after_atomic() when forcing COW in create_pending_snapshot() After setting the BTRFS_ROOT_FORCE_COW flag on the root we are doing a full write barrier, smp_wmb(), but we don't need to, all we need is a smp_mb__after_atomic(). The use of the smp_wmb() is from the old days when we didn't use a bit and used instead an int field in the root to signal if cow is forced. After the int field was changed to a bit in the root's state (flags field), we forgot to update the memory barrier in create_pending_snapshot() to smp_mb__after_atomic(), but we did the change in commit_fs_roots() after clearing BTRFS_ROOT_FORCE_COW. That happened in commit `27cdeb7096` ("Btrfs: use bitfield instead of integer data type for the some variants in btrfs_root"). On the reader side, in should_cow_block(), we also use the counterpart smp_mb__before_atomic() which generates further confusion. So change the smp_wmb() to smp_mb__after_atomic(). In fact we don't even need any barrier at all since create_pending_snapshot() is called in the critical section of a transaction commit and therefore no one can concurrently join/attach the transaction, or start a new one, until the transaction is unblocked. By the time someone starts a new transaction and enters should_cow_block(), a lot of implicit memory barriers already took place by having acquired several locks such as fs_info->trans_lock and extent buffer locks on the root node at least. Nevertlheless, for consistency use smp_mb__after_atomic() after setting the force cow bit in create_pending_snapshot(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 09:02:17 +02:00
David Sterba	a929904cf7	btrfs: add unlikely annotations to branches leading to transaction abort The unlikely() annotation is a static prediction hint that compiler may use to reorder code out of hot path. We use it elsewhere (namely tree-checker.c) for error branches that almost never happen. Transaction abort is one such error, the btrfs_abort_transaction() inlines code to check the state and print a warning, this ought to be out of the hot path. The most common pattern is when transaction abort is called after checking a return value and the control flow leads to a quick return. In other cases it may not be necessary to add unlikely() e.g. when the function returns anyway or the control flow is not changed noticeably. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:26 +02:00
David Sterba	cc53bd2085	btrfs: add unlikely annotations to branches leading to EIO The unlikely() annotation is a static prediction hint that compiler may use to reorder code out of hot path. We use it elsewhere (namely tree-checker.c) for error branches that almost never happen, where EIO is one of them. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:26 +02:00
David Sterba	9264d004a6	btrfs: add unlikely annotations to branches leading to EUCLEAN The unlikely() annotation is a static prediction hint that compiler may use to reorder code out of hot path. We use it elsewhere (namely tree-checker.c) for error branches that almost never happen, where EUCLEAN (a corruption) is one of them. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:26 +02:00
Sun YangKai	4ca6f24a52	btrfs: more trivial BTRFS_PATH_AUTO_FREE conversions Trivial pattern for the auto freeing with goto -> return conversions if possible. The following cases are considered trivial in this patch: 1. Cases where there are no operations between btrfs_free_path() and the function returns. 2. Cases where only simple cleanup operations (such as kfree(), kvfree(), clear_bit(), and fs_path_free()) are present between btrfs_free_path() and the function return. Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:26 +02:00
Johannes Thumshirn	c9ff83963a	btrfs: zoned: don't fail mount needlessly due to too many active zones Previously BTRFS did not look at a device's reported max_open_zones limit, but starting with commit `04147d8394` ("btrfs: zoned: limit active zones to max_open_zones"), zoned BTRFS limited the number of concurrently used block-groups to the number of max_open_zones a device reported, if it hadn't already reported a number of max_active_zones. Starting with commit `04147d8394` the number of open zones is treated the same way as active zones. But this leads to mount failures on filesystems which have been used before `04147d8394` because too many zones are in an open state. Ignore the new limitations on these filesystems, so zones can be finished or evacuated. Reported-by: Yuwei Han <hrx@bupt.moe> Link: https://lore.kernel.org/all/2F48A90AF7DDF380+1790bcfd-cb6f-456b-870d-7982f21b5eae@bupt.moe/ Fixes: `04147d8394` ("btrfs: zoned: limit active zones to max_open_zones") Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:25 +02:00

... 10 11 12 13 14 ...

102029 Commits