linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-16 12:31:52 -04:00

Author	SHA1	Message	Date
Hans Holmberg	181ea4e2de	xfs: start gc on zonegc_low_space attribute updates Start gc if the agressiveness of zone garbage collection is changed by the user (if the file system is not read only). Without this change, the new setting will not be taken into account until the gc thread is woken up by e.g. a write. Cc: stable@vger.kernel.org # v6.15 Fixes: `845abeb1f0` ("xfs: add tunable threshold parameter for triggering zone GC") Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-30 16:34:05 +02:00
Christoph Hellwig	8166876aad	xfs: don't decrement the buffer LRU count for in-use buffers XFS buffers are added to the LRU when they are unused, but are only removed from the LRU lazily when the LRU list scan finds a used buffer. So far this only happen when the LRU counter hits 0, which is suboptimal as buffers that were added to the LRU, but are in use again still consume LRU scanning resources and are aged while actually in use. Fix this by checking for in-use buffers and removing the from the LRU before decrementing the LRU counter. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-30 16:34:05 +02:00
Christoph Hellwig	497560b9ef	xfs: switch (back) to a per-buftarg buffer hash The per-AG buffer hashes were added when all buffer lookups took a per-hash look. Since then we've made lookups entirely lockless and removed the need for a hash-wide lock for inserts and removals as well. With this there is no need to sharding the hash, so reduce the used resources by using a per-buftarg hash for all buftargs. Long after writing this initially, syzbot found a problem in the buffer cache teardown order, which this happens to fix as well by doing the entire buffer cache teardown in one places instead of splitting it between destroying the buftarg and the perag structures. Link: https://lore.kernel.org/linux-xfs/aLeUdemAZ5wmtZel@dread.disaster.area/ Reported-by: syzbot+0391d34e801643e2809b@syzkaller.appspotmail.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Tested-by: syzbot+0391d34e801643e2809b@syzkaller.appspotmail.com Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-30 16:34:05 +02:00
Christoph Hellwig	d02ee47bbe	xfs: use a lockref for the buffer reference count The lockref structure allows incrementing/decrementing counters like an atomic_t for the fast path, while still allowing complex slow path operations as if the counter was protected by a lock. The only slow path operations that actually need to take the lock are the final put, LRU evictions and marking a buffer stale. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-30 16:34:05 +02:00
Christoph Hellwig	67fe430397	xfs: don't keep a reference for buffers on the LRU Currently the buffer cache adds a reference to b_hold for buffers that are on the LRU. This seems to go all the way back and allows releasing buffers from the LRU using xfs_buf_rele. But it makes xfs_buf_rele really complicated in differs from how other LRUs are implemented in Linux. Switch to not having a reference for buffers in the LRU, and use a separate negative hold value to mark buffers as dead. This simplifies xfs_buf_rele, which now just deal with the last "real" reference, and prepares for using the lockref primitive. This also removes the b_lock protection for removing buffers from the buffer hash. This is the desired outcome because the rhashtable is fully internally synchronized, and previously the lock was mostly held out of ordering constrains in xfs_buf_rele_cached. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-30 16:34:05 +02:00
Carlos Maiolino	025b245f0b	Merge branch 'xfs-7.0-fixes' into for-next Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-26 18:06:56 +01:00
Darrick J. Wong	e31c53a806	xfs: remove file_path tracepoint data The xfile/xmbuf shmem file descriptions are no longer as detailed as they were when online fsck was first merged, because moving to static strings in commit `60382993a2` ("xfs: get rid of the xchk_xfile_*_descr calls") removed a memory allocation and hence a source of failure. However this makes encoding the description in the tracepoints sort of a waste of memory. David Laight also points out that file_path doesn't zero the whole buffer which causes exposure of stale trace bytes, and Steven Rostedt wonders why we're not using a dynamic array for the file path. I don't think this is worth fixing, so let's just rip it out. Cc: rostedt@goodmis.org Cc: david.laight.linux@gmail.com Link: https://lore.kernel.org/linux-xfs/20260323172204.work.979-kees@kernel.org/ Cc: stable@vger.kernel.org # v6.11 Fixes: `19ebc8f84e` ("xfs: fix file_path handling in tracepoints") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-26 14:25:23 +01:00
Darrick J. Wong	70685c291e	xfs: don't irele after failing to iget in xfs_attri_recover_work xlog_recovery_iget* never set @ip to a valid pointer if they return an error, so this irele will walk off a dangling pointer. Fix that. Cc: stable@vger.kernel.org # v6.10 Fixes: `ae673f534a` ("xfs: record inode generation in xattr update log intent items") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-26 14:25:06 +01:00
Carlos Maiolino	df236c996b	Merge branch 'xfs-7.1-merge' into for-next Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-23 11:11:13 +01:00
Carlos Maiolino	e9b7a02e58	Merge branch 'xfs-7.0-fixes' into for-next Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-23 11:10:52 +01:00
Brian Foster	388bb26b3d	xfs: report cow mappings with dirty pagecache for iomap zero range XFS has long supported the case where it is possible to have dirty data in pagecache backed by COW fork blocks and a hole in the data fork. This occurs for two reasons. On reflink enabled files, COW fork blocks are allocated with preallocation to help avoid fragmention. Second, if a mapping lookup for a write finds blocks in the COW fork, it consumes those blocks unconditionally. This might mean that COW fork blocks are backed by non-shared blocks or even a hole in the data fork, both of which are perfectly fine. This leaves an odd corner case for zero range, however, because it needs to distinguish between ranges that are sparse and thus do not require zeroing and those that are not. A range backed by COW fork blocks and a data fork hole might either be a legitimate hole in the file or a range with pending buffered writes that will be written back (which will remap COW fork blocks into the data fork). This "COW fork blocks over data fork hole" situation has historically been reported as a hole to iomap, which then has grown a flush hack as a workaround to ensure zeroing occurs correctly. Now that this has been lifted into the filesystem and replaced by the dirty folio lookup mechanism, we can do better and use the pagecache state to decide how to report the mapping. If a COW fork range exists with dirty folios in cache, then report a typical shared mapping. If the range is clean in cache, then we can consider the COW blocks preallocation and call it a hole. This doesn't fundamentally change behavior, but makes mapping reporting more accurate. Note that this does require splitting across the EOF boundary (similar to normal zero range) to ensure we don't spuriously perform post-eof zeroing. iomap will warn about zeroing beyond EOF because folios beyond i_size may not be written back. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-23 11:07:59 +01:00
Brian Foster	ce9d27ca8b	xfs: replace zero range flush with folio batch Now that the zero range pagecache flush is purely isolated to providing zeroing correctness in this case, we can remove it and replace it with the folio batch mechanism that is used for handling unwritten extents. This is still slightly odd in that XFS reports a hole vs. a mapping that reflects the COW fork extents, but that has always been the case in this situation and so a separate issue. We drop the iomap warning that assumes the folio batch is always associated with unwritten mappings, but this is mainly a development assertion as otherwise the core iomap fbatch code doesn't care much about the mapping type if it's handed the set of folios to process. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-23 11:07:59 +01:00
Brian Foster	c770f997a4	xfs: only flush when COW fork blocks overlap data fork holes The zero range hole mapping flush case has been lifted from iomap into XFS. Now that we have more mapping context available from the ->iomap_begin() handler, we can isolate the flush further to when we know a hole is fronted by COW blocks. Rather than purely rely on pagecache dirty state, explicitly check for the case where a range is a hole in both forks. Otherwise trim to the range where there does happen to be overlap and use that for the pagecache writeback check. This might prevent some spurious zeroing, but more importantly makes it easier to remove the flush entirely. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-23 11:07:59 +01:00
Brian Foster	a8eb41376d	xfs: look up cow fork extent earlier for buffered iomap_begin To further isolate the need for flushing for zero range, we need to know whether a hole in the data fork is fronted by blocks in the COW fork or not. COW fork lookup currently occurs further down in the function, after the zero range case is handled. As a preparation step, lift the COW fork extent lookup to earlier in the function, at the same time as the data fork lookup. Only the lookup logic is lifted. The COW fork branch/reporting logic remains as is to avoid any observable behavior change from an iomap reporting perspective. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-23 11:07:59 +01:00
Brian Foster	c35a3e273e	xfs: flush eof folio before insert range size update The flush in xfs_buffered_write_iomap_begin() for zero range over a data fork hole fronted by COW fork prealloc is primarily designed to provide correct zeroing behavior in particular pagecache conditions. As it turns out, this also partially masks some odd behavior in insert range (via zero range via setattr). Insert range bumps i_size the length of the new range, flushes, unmaps pagecache and cancels COW prealloc, and then right shifts extents from the end of the file back to the target offset of the insert. Since the i_size update occurs before the pagecache flush, this creates a transient situation where writeback around EOF can behave differently. This appears to be corner case situation, but if happens to be fronted by COW fork speculative preallocation and a large, dirty folio that contains at least one full COW block beyond EOF, the writeback after i_size is bumped may remap that COW fork block into the data fork within EOF. The block is zeroed and then shifted back out to post-eof, but this is unexpected in that it leads to a written post-eof data fork block. This can cause a zero range warning on a subsequent size extension, because we should never find blocks that require physical zeroing beyond i_size. To avoid this quirk, flush the EOF folio before the i_size update during insert range. The entire range will be flushed, unmapped and invalidated anyways, so this should be relatively unnoticeable. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-23 11:07:59 +01:00
Brian Foster	a35bb0dec9	iomap, xfs: lift zero range hole mapping flush into xfs iomap zero range has a wart in that it also flushes dirty pagecache over hole mappings (rather than only unwritten mappings). This was included to accommodate a quirk in XFS where COW fork preallocation can exist over a hole in the data fork, and the associated range is reported as a hole. This is because the range actually is a hole, but XFS also has an optimization where if COW fork blocks exist for a range being written to, those blocks are used regardless of whether the data fork blocks are shared or not. For zeroing, COW fork blocks over a data fork hole are only relevant if the range is dirty in pagecache, otherwise the range is already considered zeroed. The easiest way to deal with this corner case is to flush the pagecache to trigger COW remapping into the data fork, and then operate on the updated on-disk state. The problem is that ext4 cannot accommodate a flush from this context due to being a transaction deadlock vector. Outside of the hole quirk, ext4 can avoid the flush for zero range by using the recently introduced folio batch lookup mechanism for unwritten mappings. Therefore, take the next logical step and lift the hole handling logic into the XFS iomap_begin handler. iomap will still flush on unwritten mappings without a folio batch, and XFS will flush and retry mapping lookups in the case where it would otherwise report a hole with dirty pagecache during a zero range. Note that this is intended to be a fairly straightforward lift and otherwise not change behavior. Now that the flush exists within XFS, follow on patches can further optimize it. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-23 11:07:59 +01:00
Brian Foster	2f46c239fc	xfs: flush dirty pagecache over hole in zoned mode zero range For zoned filesystems a window exists between the first write to a sparse range (i.e. data fork hole) and writeback completion where we might spuriously observe holes in both the COW and data forks. This occurs because a buffered write populates the COW fork with delalloc, writeback submission removes the COW fork delalloc blocks and unlocks the inode, and then writeback completion remaps the physically allocated blocks into the data fork. If a zero range operation does a lookup during this window where both forks show a hole, it incorrectly reports a hole mapping for a range that contains data. This currently works because iomap checks for dirty pagecache over holes and unwritten mappings. If found, it flushes and retries the lookup. We plan to remove the hole flush logic from iomap, however, so lift the flush into xfs_zoned_buffered_write_iomap_begin() to preserve behavior and document the purpose for it. Zoned XFS filesystems don't support unwritten extents, so if zoned mode can come up with a way to close this transient hole window in the future, this flush can likely be removed. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-23 11:07:59 +01:00
Brian Foster	92e9dff9ca	xfs: fix iomap hole map reporting for zoned zero range The hole mapping logic for zero range in zoned mode is not quite correct. It currently reports a hole whenever one exists in the data fork. If the first write to a sparse range has completed and not yet written back, the blocks exist in the COW fork as delalloc until writeback completes, at which point they are allocated and mapped into the data fork. If a zero range occurs on a range that has not yet populated the data fork, we will incorrectly report it as a hole. Note that this currently functions correctly because we are bailed out by the pagecache flush in iomap_zero_range(). If a hole or unwritten mapping is reported with dirty pagecache, it assumes there is pending data, flushes to induce any pending block allocations/remaps, and retries the lookup. We want to remove this hack from iomap, however, so update iomap_begin() to only report a hole for zeroing when one exists in both forks. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-23 11:07:59 +01:00
Long Li	c6c56ff975	xfs: remove redundant validation in xlog_recover_attri_commit_pass2 Remove the redundant post-parse validation switch. By the time that block is reached, xfs_attri_validate() has already guaranteed all name lengths are non-zero via xfs_attri_validate_namelen(), and xfs_attri_validate_name_iovec() has already returned -EFSCORRUPTED for NULL names. For the REMOVE case, attr_value and value_len are structurally guaranteed to be NULL/zero because the parsing loop only populates them when value_len != 0. All checks in that switch are therefore dead code. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-23 11:00:08 +01:00
Long Li	d72f2084e3	xfs: fix ri_total validation in xlog_recover_attri_commit_pass2 The ri_total checks for SET/REPLACE operations are hardcoded to 3, but xfs_attri_item_size() only emits a value iovec when value_len > 0, so ri_total is 2 when value_len == 0. For PPTR_SET/PPTR_REMOVE/PPTR_REPLACE, value_len is validated by xfs_attri_validate() to be exactly sizeof(struct xfs_parent_rec) and is never zero, so their hardcoded checks remain correct. This problem may cause log recovery failures. The following script can be used to reproduce the problem: #!/bin/bash mkfs.xfs -f /dev/sda mount /dev/sda /mnt/test/ touch /mnt/test/file for i in {1..200}; do attr -s "user.attr_$i" -V "value_$i" /mnt/test/file > /dev/null done echo 1 > /sys/fs/xfs/debug/larp echo 1 > /sys/fs/xfs/sda/errortag/larp attr -s "user.zero" -V "" /mnt/test/file echo 0 > /sys/fs/xfs/sda/errortag/larp umount /mnt/test mount /dev/sda /mnt/test/ # mount failed Fix this by deriving the expected count dynamically as "2 + !!value_len" for SET/REPLACE operations. Cc: stable@vger.kernel.org # v6.9 Fixes: `ad206ae50e` ("xfs: check opcode and iovec count match in xlog_recover_attri_commit_pass2") Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-23 11:00:08 +01:00
Long Li	b854e1c4ef	xfs: close crash window in attr dabtree inactivation When inactivating an inode with node-format extended attributes, xfs_attr3_node_inactive() invalidates all child leaf/node blocks via xfs_trans_binval(), but intentionally does not remove the corresponding entries from their parent node blocks. The implicit assumption is that xfs_attr_inactive() will truncate the entire attr fork to zero extents afterwards, so log recovery will never reach the root node and follow those stale pointers. However, if a log shutdown occurs after the leaf/node block cancellations commit but before the attr bmap truncation commits, this assumption breaks. Recovery replays the attr bmap intact (the inode still has attr fork extents), but suppresses replay of all cancelled leaf/node blocks, maybe leaving them as stale data on disk. On the next mount, xlog_recover_process_iunlinks() retries inactivation and attempts to read the root node via the attr bmap. If the root node was not replayed, reading the unreplayed root block triggers a metadata verification failure immediately; if it was replayed, following its child pointers to unreplayed child blocks triggers the same failure: XFS (pmem0): Metadata corruption detected at xfs_da3_node_read_verify+0x53/0x220, xfs_da3_node block 0x78 XFS (pmem0): Unmount and run xfs_repair XFS (pmem0): First 128 bytes of corrupted metadata buffer: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ XFS (pmem0): metadata I/O error in "xfs_da_read_buf+0x104/0x190" at daddr 0x78 len 8 error 117 Fix this in two places: In xfs_attr3_node_inactive(), after calling xfs_trans_binval() on a child block, immediately remove the entry that references it from the parent node in the same transaction. This eliminates the window where the parent holds a pointer to a cancelled block. Once all children are removed, the now-empty root node is converted to a leaf block within the same transaction. This node-to-leaf conversion is necessary for crash safety. If the system shutdown after the empty node is written to the log but before the second-phase bmap truncation commits, log recovery will attempt to verify the root block on disk. xfs_da3_node_verify() does not permit a node block with count == 0; such a block will fail verification and trigger a metadata corruption shutdown. on the other hand, leaf blocks are allowed to have this transient state. In xfs_attr_inactive(), split the attr fork truncation into two explicit phases. First, truncate all extents beyond the root block (the child extents whose parent references have already been removed above). Second, invalidate the root block and truncate the attr bmap to zero in a single transaction. The two operations in the second phase must be atomic: as long as the attr bmap has any non-zero length, recovery can follow it to the root block, so the root block invalidation must commit together with the bmap-to-zero truncation. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-23 10:47:28 +01:00
Long Li	e65bb55d7f	xfs: factor out xfs_attr3_leaf_init Factor out wrapper xfs_attr3_leaf_init function, which exported for external use. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Long Li <leo.lilong@huawei.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-23 10:47:28 +01:00
Long Li	ce4e789cf3	xfs: factor out xfs_attr3_node_entry_remove Factor out wrapper xfs_attr3_node_entry_remove function, which exported for external use. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Long Li <leo.lilong@huawei.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-23 10:47:28 +01:00
Long Li	e942498385	xfs: only assert new size for datafork during truncate extents The assertion functions properly because we currently only truncate the attr to a zero size. Any other new size of the attr is not preempted. Make this assertion is specific to the datafork, preparing for subsequent patches to truncate the attribute to a non-zero size. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Long Li <leo.lilong@huawei.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-23 10:47:27 +01:00
Lukas Herbolt	8da6fd0884	xfs: Use xarray to track SB UUIDs instead of plain array. Removing the plain array to track the UUIDs and switch to xarray makes it more readable. Signed-off-by: Lukas Herbolt <lukas@herbolt.com> [cem: remove unneeded return from xfs_uuid_unmount] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-23 10:40:18 +01:00
Carlos Maiolino	2c0ff6151c	Merge branch 'xfs-7.1-merge' into for-next Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-18 11:04:17 +01:00
Damien Le Moal	c1f9554374	xfs: avoid unnecessary calculations in xfs_zoned_need_gc() If zonegc_low_space is set to zero (which is the default), the second condition in xfs_zoned_need_gc() that triggers GC never evaluates to true because the calculated threshold will always be 0. So there is no need to calculate the threshold and to evaluate that condition. Return early when zonegc_low_space is zero. While at it, add comments to document the intent of each of the 3 tests used to determine the return value to control the execution of garbage collection. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-18 10:08:07 +01:00
Damien Le Moal	68aa101bf2	xfs: display more zone related information in mountstats Modify xfs_zoned_show_stats() to add to the information displayed with /proc/self/mountstats the total number of zones (RT groups) and the number of open zones together with the maximum number of open zones. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-18 10:08:07 +01:00
Damien Le Moal	6a82a691b0	xfs: fix a comment typo in xfs_select_zone_nowait() Fix a typo in the comment describing the second call to xfs_select_open_zone_lru() in xfs_select_zone_nowait(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-18 10:08:07 +01:00
Damien Le Moal	770323d418	xfs: avoid unnecessary open zone check in xfs_select_zone_nowait() When xfs_select_zone_nowait() is called with pack_tight equal to true, the function xfs_select_open_zone_mru() is called if no open zone is returned by xfs_select_open_zone_lru(), that is, when oz is NULL. The open zone pointer return of xfs_select_zone_nowait() is then checked, but this check is outside of the "if (pack_tight)" that trigered the call to xfs_select_open_zone_mru(). In other word, this check is unnecessarily done even when pack_tight is false. Move the check for the return value of the call to xfs_select_open_zone_mru() inside the if that controls the call to this function, so that we do not uselessly test again the value of oz when pack_tight is false. No functional changes. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-18 10:08:07 +01:00
Bill Wendling	e5966096d0	xfs: annotate struct xfs_attr_list_context with __counted_by_ptr Add the `__counted_by_ptr` attribute to the `buffer` field of `struct xfs_attr_list_context`. This field is used to point to a buffer of size `bufsize`. The `buffer` field is assigned in: 1. `xfs_ioc_attr_list` in `fs/xfs/xfs_handle.c` 2. `xfs_xattr_list` in `fs/xfs/xfs_xattr.c` 3. `xfs_getparents` in `fs/xfs/xfs_handle.c` (implicitly initialized to NULL) In `xfs_ioc_attr_list`, `buffer` was assigned before `bufsize`. Reorder them to ensure `bufsize` is set before `buffer` is assigned, although no access happens between them. In `xfs_xattr_list`, `buffer` was assigned before `bufsize`. Reorder them to ensure `bufsize` is set before `buffer` is assigned. In `xfs_getparents`, `buffer` is NULL (from zero initialization) and remains NULL. `bufsize` is set to a non-zero value, but since `buffer` is NULL, no access occurs. In all cases, the pointer `buffer` is not accessed before `bufsize` is set. This patch was generated by CodeMender and reviewed by Bill Wendling. Tested by running xfstests. Signed-off-by: Bill Wendling <morbo@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-18 09:54:39 +01:00
Christoph Hellwig	0c98524ab2	xfs: cleanup buftarg handling in XFS_IOC_VERIFY_MEDIA The newly added XFS_IOC_VERIFY_MEDIA is a bit unusual in how it handles buftarg fields. Update it to be more in line with other XFS code: - use btp->bt_dev instead of btp->bt_bdev->bd_dev to retrieve the device number for tracing - use btp->bt_logical_sectorsize instead of bdev_logical_block_size(btp->bt_bdev) to retrieve the logical sector size - compare the buftarg and not the bdev to see if there is a separate log buftarg Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-18 09:52:33 +01:00
hongao	268378b6ad	xfs: scrub: unlock dquot before early return in quota scrub xchk_quota_item can return early after calling xchk_fblock_process_error. When that helper returns false, the function returned immediately without dropping dq->q_qlock, which can leave the dquot lock held and risk lock leaks or deadlocks in later quota operations. Fix this by unlocking dq->q_qlock before the early return. Signed-off-by: hongao <hongao@uniontech.com> Fixes: `7d1f0e167a` ("xfs: check the ondisk space mapping behind a dquot") Cc: <stable@vger.kernel.org> # v6.8 Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-18 09:44:46 +01:00
Yuto Ohnuki	7cac609473	xfs: refactor xfsaild_push loop into helper Factor the loop body of xfsaild_push() into a separate xfsaild_process_logitem() helper to improve readability. This is a pure code movement with no functional change. Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-18 09:40:31 +01:00
Yuto Ohnuki	394d70b86f	xfs: save ailp before dropping the AIL lock in push callbacks In xfs_inode_item_push() and xfs_qm_dquot_logitem_push(), the AIL lock is dropped to perform buffer IO. Once the cluster buffer no longer protects the log item from reclaim, the log item may be freed by background reclaim or the dquot shrinker. The subsequent spin_lock() call dereferences lip->li_ailp, which is a use-after-free. Fix this by saving the ailp pointer in a local variable while the AIL lock is held and the log item is guaranteed to be valid. Reported-by: syzbot+652af2b3c5569c4ab63c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=652af2b3c5569c4ab63c Fixes: `90c60e1640` ("xfs: xfs_iflush() is no longer necessary") Cc: stable@vger.kernel.org # v5.9 Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-18 09:40:31 +01:00
Yuto Ohnuki	79ef34ec05	xfs: avoid dereferencing log items after push callbacks After xfsaild_push_item() calls iop_push(), the log item may have been freed if the AIL lock was dropped during the push. Background inode reclaim or the dquot shrinker can free the log item while the AIL lock is not held, and the tracepoints in the switch statement dereference the log item after iop_push() returns. Fix this by capturing the log item type, flags, and LSN before calling xfsaild_push_item(), and introducing a new xfs_ail_push_class trace event class that takes these pre-captured values and the ailp pointer instead of the log item pointer. Reported-by: syzbot+652af2b3c5569c4ab63c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=652af2b3c5569c4ab63c Fixes: `90c60e1640` ("xfs: xfs_iflush() is no longer necessary") Cc: stable@vger.kernel.org # v5.9 Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-18 09:40:31 +01:00
Yuto Ohnuki	4f24a767e3	xfs: stop reclaim before pushing AIL during unmount The unmount sequence in xfs_unmount_flush_inodes() pushed the AIL while background reclaim and inodegc are still running. This is broken independently of any use-after-free issues - background reclaim and inodegc should not be running while the AIL is being pushed during unmount, as inodegc can dirty and insert inodes into the AIL during the flush, and background reclaim can race to abort and free dirty inodes. Reorder xfs_unmount_flush_inodes() to stop inodegc and cancel background reclaim before pushing the AIL. Stop inodegc before cancelling m_reclaim_work because the inodegc worker can re-queue m_reclaim_work via xfs_inodegc_set_reclaimable. Reported-by: syzbot+652af2b3c5569c4ab63c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=652af2b3c5569c4ab63c Fixes: `90c60e1640` ("xfs: xfs_iflush() is no longer necessary") Cc: stable@vger.kernel.org # v5.9 Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-18 09:40:31 +01:00
Carlos Maiolino	01478f356f	xfs: opencode xfs_zone_record_blocks We only have a single caller, no need to keep it in its own function. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> [hch: add zone_record_blocks trace back] Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-18 09:33:52 +01:00
Carlos Maiolino	3bdc20b005	xfs: factor out xfs_zone_inc_written Move the written blocks increment and full zone check into a new helper. Also add an assert to ensure rmap lock is held here. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-12 08:49:36 +01:00
Carlos Maiolino	02a5d8993b	xfs: factor out xfs_dio_write_zoned_end_io Stop sharing direct IO end_io between regular and zoned devices by factoring out zoned dio end_io to its own function. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-12 08:49:36 +01:00
Carlos Maiolino	db8367f63b	xfs: factor out isize updates from xfs_dio_write_end_io This is the only code needed for zoned inodes, so factor it out so we can move zoned inodes ioend to its own callback. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-12 08:49:36 +01:00
Long Li	362c490980	xfs: fix integer overflow in bmap intent sort comparator xfs_bmap_update_diff_items() sorts bmap intents by inode number using a subtraction of two xfs_ino_t (uint64_t) values, with the result truncated to int. This is incorrect when two inode numbers differ by more than INT_MAX (2^31 - 1), which is entirely possible on large XFS filesystems. Fix this by replacing the subtraction with cmp_int(). Cc: <stable@vger.kernel.org> # v4.9 Fixes: `9f3afb57d5` ("xfs: implement deferred bmbt map/unmap operations") Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-11 13:21:42 +01:00
Darrick J. Wong	52a8a1ba88	xfs: fix undersized l_iclog_roundoff values If the superblock doesn't list a log stripe unit, we set the incore log roundoff value to 512. This leads to corrupt logs and unmountable filesystems in generic/617 on a disk with 4k physical sectors... XFS (sda1): Mounting V5 Filesystem ff3121ca-26e6-4b77-b742-aaff9a449e1c XFS (sda1): Torn write (CRC failure) detected at log block 0x318e. Truncating head block from 0x3197. XFS (sda1): failed to locate log tail XFS (sda1): log mount/recovery failed: error -74 XFS (sda1): log mount failed XFS (sda1): Mounting V5 Filesystem ff3121ca-26e6-4b77-b742-aaff9a449e1c XFS (sda1): Ending clean mount ...on the current xfsprogs for-next which has a broken mkfs. xfs_info shows this... meta-data=/dev/sda1 isize=512 agcount=4, agsize=644992 blks = sectsz=4096 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=1 = reflink=1 bigtime=1 inobtcount=1 nrext64=1 = exchange=1 metadir=1 data = bsize=4096 blocks=2579968, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=1 log =internal log bsize=4096 blocks=16384, version=2 = sectsz=4096 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 = rgcount=0 rgsize=268435456 extents = zoned=0 start=0 reserved=0 ...observe that the log section has sectsz=4096 sunit=0, which means that the roundoff factor is 512, not 4096 as you'd expect. We should fix mkfs not to generate broken filesystems, but anyone can fuzz the ondisk superblock so we should be more cautious. I think the inadequate logic predates commit `a6a65fef5e`, but that's clearly going to require a different backport. Cc: stable@vger.kernel.org # v5.14 Fixes: `a6a65fef5e` ("xfs: log stripe roundoff is a property of the log") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-10 16:19:31 +01:00
Long Li	186ac39b8a	xfs: ensure dquot item is deleted from AIL only after log shutdown In xfs_qm_dqflush(), when a dquot flush fails due to corruption (the out_abort error path), the original code removed the dquot log item from the AIL before calling xfs_force_shutdown(). This ordering introduces a subtle race condition that can lead to data loss after a crash. The AIL tracks the oldest dirty metadata in the journal. The position of the tail item in the AIL determines the log tail LSN, which is the oldest LSN that must be preserved for crash recovery. When an item is removed from the AIL, the log tail can advance past the LSN of that item. The race window is as follows: if the dquot item happens to be at the tail of the log, removing it from the AIL allows the log tail to advance. If a concurrent log write is sampling the tail LSN at the same time and subsequently writes a complete checkpoint (i.e., one containing a commit record) to disk before the shutdown takes effect, the journal will no longer protect the dquot's last modification. On the next mount, log recovery will not replay the dquot changes, even though they were never written back to disk, resulting in silent data loss. Fix this by calling xfs_force_shutdown() before xfs_trans_ail_delete() in the out_abort path. Once the log is shut down, no new log writes can complete with an updated tail LSN, making it safe to remove the dquot item from the AIL. Cc: stable@vger.kernel.org Fixes: `b707fffda6` ("xfs: abort consistently on dquot flush failure") Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-10 09:40:38 +01:00
Long Li	f1d77b863b	xfs: remove redundant set null for ip->i_itemp ip->i_itemp has been set null in xfs_inode_item_destroy(), so there is no need set it null again in xfs_inode_free_callback(). Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-10 09:39:53 +01:00
Carlos Maiolino	54fcd2f95f	xfs: fix returned valued from xfs_defer_can_append xfs_defer_can_append returns a bool, it shouldn't be returning a NULL. Found by code inspection. Fixes: `4dffb2cbb4` ("xfs: allow pausing of pending deferred work items") Cc: <stable@vger.kernel.org> # v6.8 Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Souptick Joarder <souptick.joarder@hpe.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-06 09:30:07 +01:00
hongao	281cb17787	xfs: Remove redundant NULL check after __GFP_NOFAIL kzalloc() is called with __GFP_NOFAIL, so a NULL return is not expected. Drop the redundant !map check in xfs_dabuf_map(). Also switch the nirecs-sized allocation to kcalloc(). Signed-off-by: hongao <hongao@uniontech.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-05 10:02:45 +01:00
Darrick J. Wong	0ca1a8331c	xfs: fix race between healthmon unmount and read_iter xfs/1879 on one of my test VMs got stuck due to the xfs_io healthmon subcommand sleeping in wait_event_interruptible at: xfs_healthmon_read_iter+0x558/0x5f8 [xfs] vfs_read+0x248/0x320 ksys_read+0x78/0x120 Looking at xfs_healthmon_read_iter, in !O_NONBLOCK mode it will sleep until the mount cookie == DETACHED_MOUNT_COOKIE, there are events waiting to be formatted, or there are formatted events in the read buffer that could be copied to userspace. Poking into the running kernel, I see that there are zero events in the list, the read buffer is empty, and the mount cookie is indeed in DETACHED state. IOWs, xfs_healthmon_has_eventdata should have returned true, but instead we're asleep waiting for a wakeup. I think what happened here is that xfs_healthmon_read_iter and xfs_healthmon_unmount were racing with each other, and _read_iter lost the race. _unmount queued an unmount event, which woke up _read_iter. It found, formatted, and copied the event out to userspace. That cleared out the pending event list and emptied the read buffer. xfs_io then called read() again, so _has_eventdata decided that we should sleep on the empty event queue. Next, _unmount called xfs_healthmon_detach, which set the mount cookie to DETACHED. Unfortunately, it didn't call wake_up_all on the hm, so the wait_event_interruptible in the _read_iter thread remains asleep. That's why the test stalled. Fix this by moving the wake_up_all call to xfs_healthmon_detach. Fixes: `b3a289a2a9` ("xfs: create event queuing, formatting, and discovery infrastructure") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-04 10:11:47 +01:00
Wilfred Mallawa	c6ce65cb17	xfs: add write pointer to xfs_rtgroup_geometry There is currently no XFS ioctl that allows userspace to retrieve the write pointer for a specific realtime group block for zoned XFS. On zoned block devices, userspace can obtain this information via zone reports from the underlying device. However, for zoned XFS operating on regular block devices, no equivalent mechanism exists. Access to the realtime group write pointer is useful to userspace development and analysis tools such as Zonar [1]. So extend the existing struct xfs_rtgroup_geometry to add a new rg_writepointer field. This field is valid if XFS_RTGROUP_GEOM_WRITEPOINTER flag is set. The rg_writepointer field specifies the location of the current writepointer as a block offset into the respective rtgroup. [1] https://lwn.net/Articles/1059364/ Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-04 09:58:24 +01:00
Damien Le Moal	6270b8ac2f	xfs: remove scratch field from struct xfs_gc_bio The scratch field in struct xfs_gc_bio is unused. Remove it. Fixes: `102f444b57` ("xfs: rework zone GC buffer management") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2026-03-04 09:34:12 +01:00

1 2 3 4 5 ...

1426966 Commits