Commit Graph

103189 Commits

Author SHA1 Message Date
Darrick J. Wong
bd3138e891 xfs: fix remote xattr valuelblk check
In debugging other problems with generic/753, it turns out that it's
possible for the system go to down in the middle of a remote xattr set
operation such that the leaf block entry is marked incomplete and
valueblk is set to zero.  Make this no longer a failure.

Cc: <stable@vger.kernel.org> # v4.15
Fixes: 13791d3b83 ("xfs: scrub extended attribute leaf space")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-23 09:27:33 -08:00
Darrick J. Wong
6fed827044 xfs: fix the xattr scrub to detect freemap/entries array collisions
In the previous patches, we observed that it's possible for there to be
freemap entries with zero size but a nonzero base.  This isn't an
inconsistency per se, but older kernels can get confused by this and
corrupt the block, leading to corruption.

If we see this, flag the xattr structure for optimization so that it
gets rebuilt.

Cc: <stable@vger.kernel.org> # v4.15
Fixes: 13791d3b83 ("xfs: scrub extended attribute leaf space")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-23 09:27:33 -08:00
Darrick J. Wong
27a0c41f33 xfs: strengthen attr leaf block freemap checking
Check for erroneous overlapping freemap regions and collisions between
freemap regions and the xattr leaf entry array.

Note that we must explicitly zero out the extra freemaps in
xfs_attr3_leaf_compact so that the in-memory buffer has a correctly
initialized freemap array to satisfy the new verification code, even if
subsequent code changes the contents before unlocking the buffer.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-23 09:27:32 -08:00
Darrick J. Wong
a165f7e763 xfs: refactor attr3 leaf table size computation
Replace all the open-coded callsites with a single static inline helper.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-23 09:27:31 -08:00
Darrick J. Wong
3eefc0c2b7 xfs: fix freemap adjustments when adding xattrs to leaf blocks
xfs/592 and xfs/794 both trip this assertion in the leaf block freemap
adjustment code after ~20 minutes of running on my test VMs:

 ASSERT(ichdr->firstused >= ichdr->count * sizeof(xfs_attr_leaf_entry_t)
					+ xfs_attr3_leaf_hdr_size(leaf));

Upon enabling quite a lot more debugging code, I narrowed this down to
fsstress trying to set a local extended attribute with namelen=3 and
valuelen=71.  This results in an entry size of 80 bytes.

At the start of xfs_attr3_leaf_add_work, the freemap looks like this:

i 0 base 448 size 0 rhs 448 count 46
i 1 base 388 size 132 rhs 448 count 46
i 2 base 2120 size 4 rhs 448 count 46
firstused = 520

where "rhs" is the first byte past the end of the leaf entry array.
This is inconsistent -- the entries array ends at byte 448, but
freemap[1] says there's free space starting at byte 388!

By the end of the function, the freemap is in worse shape:

i 0 base 456 size 0 rhs 456 count 47
i 1 base 388 size 52 rhs 456 count 47
i 2 base 2120 size 4 rhs 456 count 47
firstused = 440

Important note: 388 is not aligned with the entries array element size
of 8 bytes.

Based on the incorrect freemap, the name area starts at byte 440, which
is below the end of the entries array!  That's why the assertion
triggers and the filesystem shuts down.

How did we end up here?  First, recall from the previous patch that the
freemap array in an xattr leaf block is not intended to be a
comprehensive map of all free space in the leaf block.  In other words,
it's perfectly legal to have a leaf block with:

 * 376 bytes in use by the entries array
 * freemap[0] has [base = 376, size = 8]
 * freemap[1] has [base = 388, size = 1500]
 * the space between 376 and 388 is free, but the freemap stopped
   tracking that some time ago

If we add one xattr, the entries array grows to 384 bytes, and
freemap[0] becomes [base = 384, size = 0].  So far, so good.  But if we
add a second xattr, the entries array grows to 392 bytes, and freemap[0]
gets pushed up to [base = 392, size = 0].  This is bad, because
freemap[1] hasn't been updated, and now the entries array and the free
space claim the same space.

The fix here is to adjust all freemap entries so that none of them
collide with the entries array.  Note that this fix relies on commit
2a2b5932db ("xfs: fix attr leaf header freemap.size underflow") and
the previous patch that resets zero length freemap entries to have
base = 0.

Cc: <stable@vger.kernel.org> # v2.6.12
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-23 09:27:31 -08:00
Darrick J. Wong
6f13c1d2a6 xfs: delete attr leaf freemap entries when empty
Back in commit 2a2b5932db ("xfs: fix attr leaf header freemap.size
underflow"), Brian Foster observed that it's possible for a small
freemap at the end of the end of the xattr entries array to experience
a size underflow when subtracting the space consumed by an expansion of
the entries array.  There are only three freemap entries, which means
that it is not a complete index of all free space in the leaf block.

This code can leave behind a zero-length freemap entry with a nonzero
base.  Subsequent setxattr operations can increase the base up to the
point that it overlaps with another freemap entry.  This isn't in and of
itself a problem because the code in _leaf_add that finds free space
ignores any freemap entry with zero size.

However, there's another bug in the freemap update code in _leaf_add,
which is that it fails to update a freemap entry that begins midway
through the xattr entry that was just appended to the array.  That can
result in the freemap containing two entries with the same base but
different sizes (0 for the "pushed-up" entry, nonzero for the entry
that's actually tracking free space).  A subsequent _leaf_add can then
allocate xattr namevalue entries on top of the entries array, leading to
data loss.  But fixing that is for later.

For now, eliminate the possibility of confusion by zeroing out the base
of any freemap entry that has zero size.  Because the freemap is not
intended to be a complete index of free space, a subsequent failure to
find any free space for a new xattr will trigger block compaction, which
regenerates the freemap.

It looks like this bug has been in the codebase for quite a long time.

Cc: <stable@vger.kernel.org> # v2.6.12
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-23 09:27:30 -08:00
Wenwu Hou
a1ca658d64 xfs: fix incorrect context handling in xfs_trans_roll
The memalloc_nofs_save() and memalloc_nofs_restore() calls are
incorrectly paired in xfs_trans_roll.

Call path:
xfs_trans_alloc()
    __xfs_trans_alloc()
	// tp->t_pflags = memalloc_nofs_save();
	xfs_trans_set_context()
...
xfs_defer_trans_roll()
    xfs_trans_roll()
        xfs_trans_dup()
            // old_tp->t_pflags = 0;
            xfs_trans_switch_context()
        __xfs_trans_commit()
            xfs_trans_free()
                // memalloc_nofs_restore(tp->t_pflags);
                xfs_trans_clear_context()

The code passes 0 to memalloc_nofs_restore() when committing the original
transaction, but memalloc_nofs_restore() should always receive the
flags returned from the paired memalloc_nofs_save() call.

Before commit 3f6d5e6a46 ("mm: introduce memalloc_flags_{save,restore}"),
calling memalloc_nofs_restore(0) would unset the PF_MEMALLOC_NOFS flag,
which could cause memory allocation deadlocks[1].
Fortunately, after that commit, memalloc_nofs_restore(0) does nothing,
so this issue is currently harmless.

Fixes: 756b1c3433 ("xfs: use current->journal_info for detecting transaction recursion")
Link: https://lore.kernel.org/linux-xfs/20251104131857.1587584-1-leo.lilong@huawei.com [1]
Signed-off-by: Wenwu Hou <hwenwur@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21 12:59:10 +01:00
Hans Holmberg
01a2896154 xfs: always allocate the free zone with the lowest index
Zones in the beginning of the address space are typically mapped to
higer bandwidth tracks on HDDs than those at the end of the address
space. So, in stead of allocating zones "round robin" across the whole
address space, always allocate the zone with the lowest index.

This increases average write bandwidth for overwrite workloads
when less than the full capacity is being used. At ~50% utilization
this improves bandwidth for a random file overwrite benchmark
with 128MiB files and 256MiB zone capacity by 30%.

Running the same benchmark with small 2-8 MiB files at 67% capacity
shows no significant difference in performance. Due to heavy
fragmentation the whole zone range is in use, greatly limiting the
number of free zones with high bw.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21 12:57:17 +01:00
Darrick J. Wong
4d6d335ea9 xfs: promote metadata directories and large block support
Large block support was merged upstream in 6.12 (Dec 2024) and metadata
directories was merged in 6.13 (Jan 2025).  We've not received any
serious complaints about the ondisk formats of these two features in the
past year, so let's remove the experimental warnings.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21 12:57:17 +01:00
Christoph Hellwig
12d12dcc15 xfs: use blkdev_get_zone_info to simplify zone reporting
Unwind the callback based programming model by querying the cached
zone information using blkdev_get_zone_info.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21 12:57:17 +01:00
Christoph Hellwig
b37c1e4e9a xfs: check that used blocks are smaller than the write pointer
Any used block must have been written, this reject used blocks > write
pointer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21 12:57:17 +01:00
Christoph Hellwig
19c5b6051e xfs: split and refactor zone validation
Currently xfs_zone_validate mixes validating the software zone state in
the XFS realtime group with validating the hardware state reported in
struct blk_zone and deriving the write pointer from that.

Move all code that works on the realtime group to xfs_init_zone, and only
keep the hardware state validation in xfs_zone_validate.  This makes the
code more clear, and allows for better reuse in userspace.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21 12:57:17 +01:00
Christoph Hellwig
776b76f754 xfs: pass the write pointer to xfs_init_zone
Move the two methods to query the write pointer out of xfs_init_zone into
the callers, so that xfs_init_zone doesn't have to bother with the
blk_zone structure and instead operates purely at the XFS realtime group
level.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21 12:57:17 +01:00
Christoph Hellwig
fc633b5c5b xfs: add a xfs_rtgroup_raw_size helper
Add a helper to figure the on-disk size of a group, accounting for the
XFS_SB_FEAT_INCOMPAT_ZONE_GAPS feature if needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21 12:57:17 +01:00
Damien Le Moal
41263267ef xfs: add missing forward declaration in xfs_zones.h
Add the missing forward declaration for struct blk_zone in xfs_zones.h.
This avoids headaches with the order of header file inclusion to avoid
compilation errors.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21 12:57:16 +01:00
Christoph Hellwig
3a65ea768b xfs: remove xfs_attr_leaf_hasname
The calling convention of xfs_attr_leaf_hasname() is problematic, because
it returns a NULL buffer when xfs_attr3_leaf_read fails, a valid buffer
when xfs_attr3_leaf_lookup_int returns -ENOATTR or -EEXIST, and a
non-NULL buffer pointer for an already released buffer when
xfs_attr3_leaf_lookup_int fails with other error values.

Fix this by simply open coding xfs_attr_leaf_hasname in the callers, so
that the buffer release code is done by each caller of
xfs_attr3_leaf_read.

Cc: stable@vger.kernel.org # v5.19+
Fixes: 07120f1abd ("xfs: Add xfs_has_attr and subroutines")
Reported-by: Mark Tinguely <mark.tinguely@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21 12:57:16 +01:00
Darrick J. Wong
f39854a3fb xfs: mark data structures corrupt on EIO and ENODATA
I learned a few things this year: first, blk_status_to_errno can return
ENODATA for critical media errors; and second, the scrub code doesn't
mark data structures as corrupt on ENODATA or EIO.

Currently, scrub failing to capture these errors isn't all that
impactful -- the checking code will exit to userspace with EIO/ENODATA,
and xfs_scrub will log a complaint and exit with nonzero status.  Most
people treat fsck tools failing as a sign that the fs is corrupt, but
online fsck should mark the metadata bad and keep moving.

Cc: stable@vger.kernel.org # v4.15
Fixes: 4700d22980 ("xfs: create helpers to record and deal with scrub problems")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21 12:57:16 +01:00
Christoph Hellwig
102f444b57 xfs: rework zone GC buffer management
The double buffering where just one scratch area is used at a time does
not efficiently use the available memory.  It was originally implemented
when GC I/O could happen out of order, but that was removed before
upstream submission to avoid fragmentation.  Now that all GC I/Os are
processed in order, just use a number of buffers as a simple ring buffer.

For a synthetic benchmark that fills 256MiB HDD zones and punches out
holes to free half the space this leads to a decrease of GC time by
a little more than 25%.

Thanks to Hans Holmberg <hans.holmberg@wdc.com> for testing and
benchmarking.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21 12:57:16 +01:00
Christoph Hellwig
0506d32f7c xfs: use bio_reuse in the zone GC code
Replace our somewhat fragile code to reuse the bio, which caused a
regression in the past with the block layer bio_reuse helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21 12:57:16 +01:00
Christoph Hellwig
cf9b52fa7d xfs: directly include xfs_platform.h
The xfs.h header conflicts with the public xfs.h in xfsprogs, leading
to a spurious difference in all shared libxfs files that have to
include libxfs_priv.h in userspace.  Directly include xfs_platform.h so
that we can add a header of the same name to xfsprogs and remove this
major annoyance for the shared code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21 12:57:16 +01:00
Christoph Hellwig
19a46f1246 xfs: move the remaining content from xfs.h to xfs_platform.h
Move the global defines from xfs.h to xfs_platform.h to prepare for
removing xfs.h.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21 12:57:16 +01:00
Christoph Hellwig
501a5161d2 xfs: include global headers first in xfs_platform.h
Ensure we have all kernel headers included by the time we do our own
thing, just like the rest of the tree.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21 12:57:16 +01:00
Christoph Hellwig
971ffb6341 xfs: rename xfs_linux.h to xfs_platform.h
Rename xfs_linux.h to prepare for including including it directly
from source files including those shared with xfsprogs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21 12:57:16 +01:00
Christoph Hellwig
a10b44cf10 xfs: factor out a xlog_write_space_advance helper
Add a new xlog_write_space_advance that returns the current place in the
iclog that data is written to, and advances the various counters by the
amount taken from xlog_write_iovec, and also use it xlog_write_partial,
which open codes the counter adjustments, but misses the asserts.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21 12:57:16 +01:00
Christoph Hellwig
e2663443da xfs: improve the iclog space assert in xlog_write_iovec
We need enough space for the length we copy into the iclog, not just
some space, so tighten up the check a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21 12:57:16 +01:00
Christoph Hellwig
865970d49a xfs: add a xlog_write_space_left helper
Various places check how much space is left in the current iclog,
add a helper for that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21 12:57:16 +01:00
Christoph Hellwig
a3eb1f9cf8 xfs: improve the calling convention for the xlog_write helpers
The xlog_write chain passes around the same seven variables that are
often passed by reference. Add a xlog_write_data structure to contain
them to improve code generation and readability.

This change increases the generated code size by about 140 bytes for my
x86_64 build, which is hopefully worth the much easier to follow code:

$ size fs/xfs/xfs_log.o*
   text	   data	    bss	    dec	    hex	filename
  29300	   1730	    176	  31206	   79e6	fs/xfs/xfs_log.o
  29160	   1730	    176	  31066	   795a	fs/xfs/xfs_log.o.old

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21 12:57:16 +01:00
Christoph Hellwig
a82d7aac75 xfs: regularize iclog space accounting in xlog_write_partial
When xlog_write_partial splits a log region over multiple iclogs, it
has to include the continuation ophder in the length requested for the
new iclog.  Currently is simply adds that to the request, which makes
the accounting of the used space below look slightly different from the
other users of iclog space that decrement it.

To prepare for more code sharing, add the ophdr size to the len variable
that tracks the number of bytes still are left in this xlog_write
operation before the calling xlog_write_get_more_iclog_space, and then
decrement it later when consuming that space.

This changes the value of len when xlog_write_get_more_iclog_space
returns an error, but as nothing looks at len in that case the
difference doesn't matter.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21 12:57:16 +01:00
Christoph Hellwig
2499d91180 xfs: move struct xfs_log_vec to xfs_log_priv.h
The log_vec is a private type for the log/CIL code and should not be
exposed to anything else.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21 12:57:16 +01:00
Christoph Hellwig
0274105914 xfs: move struct xfs_log_iovec to xfs_log_priv.h
This structure is now only used by the core logging and CIL code.

Also remove the unused typedef.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21 12:57:16 +01:00
Christoph Hellwig
8e76253443 xfs: improve the ->iop_format interface
Export a higher level interface to format log items.  The xlog_format_buf
structure is hidden inside xfs_log_cil.c and only accessed using two
helpers (and a wrapper build on top), hiding details of log iovecs from
the log items.  This also allows simply using an index into lv_iovecp
instead of keeping a cursor vec.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21 12:57:16 +01:00
Christoph Hellwig
c53fbeedbe xfs: set lv_bytes in xlog_write_one_vec
lv_bytes is mostly just use by the CIL code, but has crept into the
low-level log writing code to decide on a full or partial iclog
write.  Ensure it is valid even for the special log writes that don't
go through the CIL by initializing it in xlog_write_one_vec.

Note that even without this fix, the checkpoint commits would never
trigger a partial iclog write, as they have no payload beyond the
opheader.

The unmount record on the other hand could in theory trigger a an
overflow of the iclog, but given that is has never been seen in
the wild this has probably been masked by the small size of it
and the fact that the unmount process does multiple log forces
before writing the unmount record and we thus usually operate on
an empty or almost empty iclog.

Fixes: 110dc24ad2 ("xfs: log vector rounding leaks log space")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21 12:57:16 +01:00
Christoph Hellwig
2d4521e4c0 xfs: add a xlog_write_one_vec helper
Add a wrapper for xlog_write for the two callers who need to build a
log_vec and add it to a single-entry chain instead of duplicating the
code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21 12:57:16 +01:00
Linus Torvalds
f8907398a6 Merge tag 'ext4_for_linus-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:

 - Fix an inconsistency in structure size on 32-bit platforms caused by
   padding differences for the new EXT4_IOC_[GS]ET_TUNE_SB_PARAM ioctls

 - Fix a buffer leak on the error path when dropping the refcount an
   xattr value stored in an inode

 - Fix missing locking on the error path for the file defragmentation
   ioctl leading to a BUG

* tag 'ext4_for_linus-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix iloc.bh leak in ext4_xattr_inode_update_ref
  ext4: add missing down_write_data_sem in mext_move_extent().
  ext4: fix ext4_tune_sb_params padding
2026-01-18 14:01:20 -08:00
Yang Erkun
d250bdf531 ext4: fix iloc.bh leak in ext4_xattr_inode_update_ref
The error branch for ext4_xattr_inode_update_ref forget to release the
refcount for iloc.bh. Find this when review code.

Fixes: 57295e8354 ("ext4: guard against EA inode refcount underflow in xattr update")
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20251213055706.3417529-1-yangerkun@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2026-01-18 11:23:10 -05:00
Julian Sun
0ef7ef4227 ext4: add missing down_write_data_sem in mext_move_extent().
Commit 962e8a01ea ("ext4: introduce mext_move_extent()") attempts to
call ext4_swap_extents() on the failure path to recover the swapped
extents, but fails to acquire locks for the two inode->i_data_sem,
triggering the BUG_ON statement in ext4_swap_extents().

This issue can be fixed by calling ext4_double_down_write_data_sem()
before ext4_swap_extents().

Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
Reported-by: syzbot+4ea6bd8737669b423aae@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69368649.a70a0220.38f243.0093.GAE@google.com/
Fixes: 962e8a01ea ("ext4: introduce mext_move_extent()")
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://patch.msgid.link/20251208123713.1971068-1-sunjunchao@bytedance.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-18 11:23:10 -05:00
Linus Torvalds
e84d960149 Merge tag 'for-6.19-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:

 - with large folios in use, fix partial incorrect update of a reflinked
   range

 - fix potential deadlock in iget when lookup fails and eviction is
   needed

 - in send, validate inline extent type while detecting file holes

 - fix memory leak after an error when creating a space info

 - remove zone statistics from sysfs again, the output size limitations
   make it unusable, we'll do it in another way in another release

 - test fixes:
     - return proper error codes from block remapping tests
     - fix tree root leaks in qgroup tests after errors

* tag 'for-6.19-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: remove zoned statistics from sysfs
  btrfs: fix memory leaks in create_space_info() error paths
  btrfs: invalidate pages instead of truncate after reflinking
  btrfs: update the Kconfig string for CONFIG_BTRFS_EXPERIMENTAL
  btrfs: send: check for inline extents in range_is_hole_in_parent()
  btrfs: tests: fix return 0 on rmap test failure
  btrfs: tests: fix root tree leak in btrfs_test_qgroups()
  btrfs: release path before iget_failed() in btrfs_read_locked_inode()
2026-01-17 19:29:32 -08:00
Linus Torvalds
353c6f43ab Merge tag 'xfs-fixes-6.19-rc6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Carlos Maiolino:
 "Just a few obvious fixes and some 'cosmetic' changes"

* tag 'xfs-fixes-6.19-rc6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: set max_agbno to allow sparse alloc of last full inode chunk
  xfs: Fix xfs_grow_last_rtg()
  xfs: improve the assert at the top of xfs_log_cover
  xfs: fix an overly long line in xfs_rtgroup_calc_geometry
  xfs: mark __xfs_rtgroup_extents static
  xfs: Fix the return value of xfs_rtcopy_summary()
  xfs: fix memory leak in xfs_growfs_check_rtgeom()
2026-01-16 09:09:41 -08:00
Linus Torvalds
603c05a163 Merge tag 'nfs-for-6.19-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:

 - Fix another deadlock involving nfs_release_folio()

 - localio:
     - Stop I/O upon hitting a fatal error
     - Deal with page offsets that are > PAGE_SIZE

 - Fix size read races in truncate, fallocate and copy offload

 - Several bugfixes for the NFSv4.x directory delegation client code

 - pNFS:
    - Fix a deadlock when returning delegations during open
    - Fix memory leaks in various error paths

* tag 'nfs-for-6.19-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFS: Fix size read races in truncate, fallocate and copy offload
  NFS: Don't immediately return directory delegations when disabled
  NFS/localio: Deal with page bases that are > PAGE_SIZE
  NFS/localio: Stop further I/O upon hitting an error
  NFSv4.x: Directory delegations don't require any state recovery
  NFSv4: Don't free slots prematurely if requesting a directory delegation
  NFSv4: Fix nfs_clear_verifier_delegated() for delegated directories
  NFS: Fix directory delegation verifier checks
  pnfs/blocklayout: Fix memory leak in bl_parse_scsi()
  pnfs/flexfiles: Fix memory leak in nfs4_ff_alloc_deviceid_node()
  NFS: Fix a deadlock involving nfs_release_folio()
  pNFS: Fix a deadlock when returning a delegation during open()
2026-01-15 11:59:49 -08:00
Trond Myklebust
d5811e6297 NFS: Fix size read races in truncate, fallocate and copy offload
If the pre-operation file size is read before locking the inode and
quiescing O_DIRECT writes, then nfs_truncate_last_folio() might end up
overwriting valid file data.

Fixes: b1817b18ff ("NFS: Protect against 'eof page pollution'")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2026-01-15 14:38:25 -05:00
Johannes Thumshirn
437cc6057e btrfs: remove zoned statistics from sysfs
Remove the newly introduced zoned statistics from sysfs, as sysfs can
only show a single page this will truncate the output on a busy
filesystem.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-01-14 22:08:04 +01:00
Linus Torvalds
b54345928f Merge tag 'gfs2-for-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 revert from Andreas Gruenbacher:
 "Revert bad commit "gfs2: Fix use of bio_chain"

  I was originally assuming that there must be a bug in gfs2
  because gfs2 chains bios in the opposite direction of what
  bio_chain_and_submit() expects.

  It turns out that the bio chains are set up in "reverse direction"
  intentionally so that the first bio's bi_end_io callback is invoked
  rather than the last bio's callback.

  We want the first bio's callback invoked for the following reason: The
  initial bio starts page aligned and covers one or more pages. When it
  terminates at a non-page-aligned offset, subsequent bios are added to
  handle the remaining portion of the final page.

  Upon completion of the bio chain, all affected pages need to be be
  marked as read, and only the first bio references all of these pages"

* tag 'gfs2-for-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  Revert "gfs2: Fix use of bio_chain"
2026-01-13 10:04:34 -08:00
Brian Foster
c360004c01 xfs: set max_agbno to allow sparse alloc of last full inode chunk
Sparse inode cluster allocation sets min/max agbno values to avoid
allocating an inode cluster that might map to an invalid inode
chunk. For example, we can't have an inode record mapped to agbno 0
or that extends past the end of a runt AG of misaligned size.

The initial calculation of max_agbno is unnecessarily conservative,
however. This has triggered a corner case allocation failure where a
small runt AG (i.e. 2063 blocks) is mostly full save for an extent
to the EOFS boundary: [2050,13]. max_agbno is set to 2048 in this
case, which happens to be the offset of the last possible valid
inode chunk in the AG. In practice, we should be able to allocate
the 4-block cluster at agbno 2052 to map to the parent inode record
at agbno 2048, but the max_agbno value precludes it.

Note that this can result in filesystem shutdown via dirty trans
cancel on stable kernels prior to commit 9eb775968b ("xfs: walk
all AGs if TRYLOCK passed to xfs_alloc_vextent_iterate_ags") because
the tail AG selection by the allocator sets t_highest_agno on the
transaction. If the inode allocator spins around and finds an inode
chunk with free inodes in an earlier AG, the subsequent dir name
creation path may still fail to allocate due to the AG restriction
and cancel.

To avoid this problem, update the max_agbno calculation to the agbno
prior to the last chunk aligned agbno in the AG. This is not
necessarily the last valid allocation target for a sparse chunk, but
since inode chunks (i.e. records) are chunk aligned and sparse
allocs are cluster sized/aligned, this allows the sb_spino_align
alignment restriction to take over and round down the max effective
agbno to within the last valid inode chunk in the AG.

Note that even though the allocator improvements in the
aforementioned commit seem to avoid this particular dirty trans
cancel situation, the max_agbno logic improvement still applies as
we should be able to allocate from an AG that has been appropriately
selected. The more important target for this patch however are
older/stable kernels prior to this allocator rework/improvement.

Cc: stable@vger.kernel.org # v4.2
Fixes: 56d1115c9b ("xfs: allocate sparse inode chunks on full chunk allocation failure")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-13 10:45:55 +01:00
Nirjhar Roy (IBM)
a65fd81207 xfs: Fix xfs_grow_last_rtg()
The last rtg should be able to grow when the size of the last is less
than (and not equal to) sb_rgextents. xfs_growfs with realtime groups
fails without this patch. The reason is that, xfs_growfs_rtg() tries
to grow the last rt group even when the last rt group is at its
maximal size i.e, sb_rgextents. It fails with the following messages:

XFS (loop0): Internal error block >= mp->m_rsumblocks at line 253 of file fs/xfs/libxfs/xfs_rtbitmap.c.  Caller xfs_rtsummary_read_buf+0x20/0x80
XFS (loop0): Corruption detected. Unmount and run xfs_repair
XFS (loop0): Internal error xfs_trans_cancel at line 976 of file fs/xfs/xfs_trans.c.  Caller xfs_growfs_rt_bmblock+0x402/0x450
XFS (loop0): Corruption of in-memory data (0x8) detected at xfs_trans_cancel+0x10a/0x1f0 (fs/xfs/xfs_trans.c:977).  Shutting down filesystem.
XFS (loop0): Please unmount the filesystem and rectify the problem(s)

Signed-off-by: Nirjhar Roy (IBM) <nirjhar.roy.lists@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-13 10:40:45 +01:00
Christoph Hellwig
df7ec7226f xfs: improve the assert at the top of xfs_log_cover
Move each condition into a separate assert so that we can see which
on triggered.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-13 10:36:23 +01:00
Christoph Hellwig
baed03efe2 xfs: fix an overly long line in xfs_rtgroup_calc_geometry
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-13 10:34:29 +01:00
Christoph Hellwig
e0aea42a32 xfs: mark __xfs_rtgroup_extents static
__xfs_rtgroup_extents is not used outside of xfs_rtgroup.c, so mark it
static.  Move it and xfs_rtgroup_extents up in the file to avoid forward
declarations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-13 10:34:29 +01:00
Nirjhar Roy (IBM)
6b2d155366 xfs: Fix the return value of xfs_rtcopy_summary()
xfs_rtcopy_summary() should return the appropriate error code
instead of always returning 0. The caller of this function which is
xfs_growfs_rt_bmblock() is already handling the error.

Fixes: e94b53ff69 ("xfs: cache last bitmap block in realtime allocator")
Signed-off-by: Nirjhar Roy (IBM) <nirjhar.roy.lists@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org # v6.7
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-13 10:32:12 +01:00
Anna Schumaker
803e18641f NFS: Don't immediately return directory delegations when disabled
The function nfs_inode_evict_delegation() immediately and synchronously
returns a delegation when called. This means we can't call it from
nfs4_have_delegation(), since that function could be called under a
lock. Instead we should mark the delegation for return and let the state
manager handle it for us.

Fixes: b6d2a520f4 ("NFS: Add a module option to disable directory delegations")
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2026-01-12 11:50:22 -05:00
Jiasheng Jiang
a11224a016 btrfs: fix memory leaks in create_space_info() error paths
In create_space_info(), the 'space_info' object is allocated at the
beginning of the function. However, there are two error paths where the
function returns an error code without freeing the allocated memory:

1. When create_space_info_sub_group() fails in zoned mode.
2. When btrfs_sysfs_add_space_info_type() fails.

In both cases, 'space_info' has not yet been added to the
fs_info->space_info list, resulting in a memory leak. Fix this by
adding an error handling label to kfree(space_info) before returning.

Fixes: 2be12ef79f ("btrfs: Separate space_info create/update")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-01-12 16:21:55 +01:00