Add the __counted_by compiler attribute to the flexible array member
inodes to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.
Remove the now obsolete comment on the count field.
In ext4_expand_inode_array(), use struct_size() instead of offsetof()
and remove the local variable count. Increment the count field before
adding a new inode to the inodes array.
Compile-tested only.
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Link: https://patch.msgid.link/20240730220200.410939-3-thorsten.blum@toblux.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Function jbd2_journal_shrink_checkpoint_list() assumes that '0' is not a
valid value for transaction IDs, which is incorrect.
Furthermore, the sbi->s_fc_ineligible_tid handling also makes the same
assumption by being initialised to '0'. Fortunately, the sb flag
EXT4_MF_FC_INELIGIBLE can be used to check whether sbi->s_fc_ineligible_tid
has been previously set instead of comparing it with '0'.
Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240724161119.13448-5-luis.henriques@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Iterate the folio's list of buffer_heads twice instead of keeping
an array of pointers. This solves a too-large-array-for-stack problem
on architectures with a ridiculoously large PAGE_SIZE and prepares
ext4 to support larger folios.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20240718223005.568869-3-willy@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Instead of synchronously reading one buffer at a time, submit reads
as we walk the buffers in the first loop, then wait for them in the
second loop. This should be significantly more efficient, particularly
on HDDs, but I have not measured.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20240718223005.568869-2-willy@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This function is very similar to do_mpage_readpage() and a similar
approach to that taken in commit 12ac5a65cb will work. As in
do_mpage_readpage(), we only use this array for checking block contiguity
and we can do that more efficiently with a little arithmetic.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20240718223005.568869-1-willy@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In __jbd2_log_wait_for_space(), we might call jbd2_cleanup_journal_tail()
to recover some journal space. But if an error occurs while executing
jbd2_cleanup_journal_tail() (e.g., an EIO), we don't stop waiting for free
space right away, we try other branches, and if j_committing_transaction
is NULL (i.e., the tid is 0), we will get the following complain:
============================================
JBD2: I/O error when updating journal superblock for sdd-8.
__jbd2_log_wait_for_space: needed 256 blocks and only had 217 space available
__jbd2_log_wait_for_space: no way to get more journal space in sdd-8
------------[ cut here ]------------
WARNING: CPU: 2 PID: 139804 at fs/jbd2/checkpoint.c:109 __jbd2_log_wait_for_space+0x251/0x2e0
Modules linked in:
CPU: 2 PID: 139804 Comm: kworker/u8:3 Not tainted 6.6.0+ #1
RIP: 0010:__jbd2_log_wait_for_space+0x251/0x2e0
Call Trace:
<TASK>
add_transaction_credits+0x5d1/0x5e0
start_this_handle+0x1ef/0x6a0
jbd2__journal_start+0x18b/0x340
ext4_dirty_inode+0x5d/0xb0
__mark_inode_dirty+0xe4/0x5d0
generic_update_time+0x60/0x70
[...]
============================================
So only if jbd2_cleanup_journal_tail() returns 1, i.e., there is nothing to
clean up at the moment, continue to try to reclaim free space in other ways.
Note that this fix relies on commit 6f6a6fda29 ("jbd2: fix ocfs2 corrupt
when updating journal superblock fails") to make jbd2_cleanup_journal_tail
return the correct error code.
Fixes: 8c3f25d895 ("jbd2: don't give up looking for space so easily in __jbd2_log_wait_for_space")
Cc: stable@kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240718115336.2554501-1-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The following kernel trace can be triggered with fstest generic/629 when
executed against a filesystem with fast-commit feature enabled:
INFO: trying to register non-static key.
The code is fine but needs lockdep annotation, or maybe
you didn't initialize this object before use?
turning off the locking correctness validator.
CPU: 0 PID: 866 Comm: mount Not tainted 6.10.0+ #11
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x66/0x90
register_lock_class+0x759/0x7d0
__lock_acquire+0x85/0x2630
? __find_get_block+0xb4/0x380
lock_acquire+0xd1/0x2d0
? __ext4_journal_get_write_access+0xd5/0x160
_raw_spin_lock+0x33/0x40
? __ext4_journal_get_write_access+0xd5/0x160
__ext4_journal_get_write_access+0xd5/0x160
ext4_reserve_inode_write+0x61/0xb0
__ext4_mark_inode_dirty+0x79/0x270
? ext4_ext_replay_set_iblocks+0x2f8/0x450
ext4_ext_replay_set_iblocks+0x330/0x450
ext4_fc_replay+0x14c8/0x1540
? jread+0x88/0x2e0
? rcu_is_watching+0x11/0x40
do_one_pass+0x447/0xd00
jbd2_journal_recover+0x139/0x1b0
jbd2_journal_load+0x96/0x390
ext4_load_and_init_journal+0x253/0xd40
ext4_fill_super+0x2cc6/0x3180
...
In the replay path there's an attempt to lock sbi->s_bdev_wb_lock in
function ext4_check_bdev_write_error(). Unfortunately, at this point this
spinlock has not been initialized yet. Moving it's initialization to an
earlier point in __ext4_fill_super() fixes this splat.
Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
Link: https://patch.msgid.link/20240718094356.7863-1-luis.henriques@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
When a full journal commit is on-going, any fast commit has to be enqueued
into a different queue: FC_Q_STAGING instead of FC_Q_MAIN. This enqueueing
is done only once, i.e. if an inode is already queued in a previous fast
commit entry it won't be enqueued again. However, if a full commit starts
_after_ the inode is enqueued into FC_Q_MAIN, the next fast commit needs to
be done into FC_Q_STAGING. And this is not being done in function
ext4_fc_track_template().
This patch fixes the issue by re-enqueuing an inode into the STAGING queue
during the fast commit clean-up callback when doing a full commit. However,
to prevent a race with a fast-commit, the clean-up callback has to be called
with the journal locked.
This bug was found using fstest generic/047. This test creates several 32k
bytes files, sync'ing each of them after it's creation, and then shutting
down the filesystem. Some data may be loss in this operation; for example a
file may have it's size truncated to zero.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240717172220.14201-1-luis.henriques@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Syzbot has found an ODEBUG bug in ext4_fill_super
The del_timer_sync function cancels the s_err_report timer,
which reminds about filesystem errors daily. We should
guarantee the timer is no longer active before kfree(sbi).
When filesystem mounting fails, the flow goes to failed_mount3,
where an error occurs when ext4_stop_mmpd is called, causing
a read I/O failure. This triggers the ext4_handle_error function
that ultimately re-arms the timer,
leaving the s_err_report timer active before kfree(sbi) is called.
Fix the issue by canceling the s_err_report timer after calling ext4_stop_mmpd.
Signed-off-by: Xiaxi Shen <shenxiaxi26@gmail.com>
Reported-and-tested-by: syzbot+59e0101c430934bc9a36@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=59e0101c430934bc9a36
Link: https://patch.msgid.link/20240715043336.98097-1-shenxiaxi26@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Pull more btrfs fixes from David Sterba:
"A more fixes. We got reports that shrinker added in 6.10 still causes
latency spikes and the fixes don't handle all corner cases. Due to
summer holidays we're taking a shortcut to disable it for release
builds and will fix it in the near future.
- only enable extent map shrinker for DEBUG builds, temporary quick
fix to avoid latency spikes for regular builds
- update target inode's ctime on unlink, mandated by POSIX
- properly take lock to read/update block group's zoned variables
- add counted_by() annotations"
* tag 'for-6.11-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: only enable extent map shrinker for DEBUG builds
btrfs: zoned: properly take lock to read/update block group's zoned variables
btrfs: tree-checker: add dev extent item checks
btrfs: update target inode's ctime on unlink
btrfs: send: annotate struct name_cache_entry with __counted_by()
fuse_notify_store(), unlike fuse_do_readpage(), does not enable page
zeroing (because it can be used to change partial page contents).
So fuse_notify_store() must be more careful to fully initialize page
contents (including parts of the page that are beyond end-of-file)
before marking the page uptodate.
The current code can leave beyond-EOF page contents uninitialized, which
makes these uninitialized page contents visible to userspace via mmap().
This is an information leak, but only affects systems which do not
enable init-on-alloc (via CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y or the
corresponding kernel command line parameter).
Link: https://bugs.chromium.org/p/project-zero/issues/detail?id=2574
Cc: stable@kernel.org
Fixes: a1d75f2582 ("fuse: add store request")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull smb client fixes from Steve French:
- fix for clang warning - additional null check
- fix for cached write with posix locks
- flexible structure fix
* tag 'v6.11-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: smb2pdu.h: Use static_assert() to check struct sizes
smb3: fix lock breakage for cached writes
smb/client: avoid possible NULL dereference in cifs_free_subrequest()
Pull xfs fixes from Chandan Babu:
- Check for presence of only 'attr' feature before scrubbing an inode's
attribute fork.
- Restore the behaviour of setting AIL thread to TASK_INTERRUPTIBLE for
long (i.e. 50ms) sleep durations to prevent high load averages.
- Do not allow users to change the realtime flag of a file unless the
datadev and rtdev both support fsdax access modes.
* tag 'xfs-6.11-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: conditionally allow FS_XFLAG_REALTIME changes if S_DAX is set
xfs: revert AIL TASK_KILLABLE threshold
xfs: attr forks require attr, not attr2
Pull bcachefs fixes from Kent OverstreetL
- New on disk format version, bcachefs_metadata_version_disk_accounting_inum
This adds one more disk accounting counter, which counts disk usage
and number of extents per inode number. This lets us track
fragmentation, for implementing defragmentation later, and it also
counts disk usage per inode in all snapshots, which will be a useful
thing to expose to users.
- One performance issue we've observed is threads spinning when they
should be waiting for dirty keys in the key cache to be flushed by
journal reclaim, so we now have hysteresis for the waiting thread, as
well as improving the tracepoint and a new time_stat, for tracking
time blocked waiting on key cache flushing.
... and various assorted smaller fixes.
* tag 'bcachefs-2024-08-16' of git://evilpiepirate.org/bcachefs:
bcachefs: Fix locking in __bch2_trans_mark_dev_sb()
bcachefs: fix incorrect i_state usage
bcachefs: avoid overflowing LRU_TIME_BITS for cached data lru
bcachefs: Fix forgetting to pass trans to fsck_err()
bcachefs: Increase size of cuckoo hash table on too many rehashes
bcachefs: bcachefs_metadata_version_disk_accounting_inum
bcachefs: Kill __bch2_accounting_mem_mod()
bcachefs: Make bkey_fsck_err() a wrapper around fsck_err()
bcachefs: Fix warning in __bch2_fsck_err() for trans not passed in
bcachefs: Add a time_stat for blocked on key cache flush
bcachefs: Improve trans_blocked_journal_reclaim tracepoint
bcachefs: Add hysteresis to waiting on btree key cache flush
lib/generic-radix-tree.c: Fix rare race in __genradix_ptr_alloc()
bcachefs: Convert for_each_btree_node() to lockrestart_do()
bcachefs: Add missing downgrade table entry
bcachefs: disk accounting: ignore unknown types
bcachefs: bch2_accounting_invalid() fixup
bcachefs: Fix bch2_trigger_alloc when upgrading from old versions
bcachefs: delete faulty fastpath in bch2_btree_path_traverse_cached()
We run this in full RW mode now, so we have to guard against the
superblock buffer being reallocated.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Commit 9f9bef9bc5 ("smb: smb2pdu.h: Avoid -Wflex-array-member-not-at-end
warnings") introduced tagged `struct create_context_hdr`. We want to
ensure that when new members need to be added to the flexible structure,
they are always included within this tagged struct.
So, we use `static_assert()` to ensure that the memory layout for
both the flexible structure and the tagged struct is the same after
any changes.
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Mandatory locking is enforced for cached writes, which violates
default posix semantics, and also it is enforced inconsistently.
This apparently breaks recent versions of libreoffice, but can
also be demonstrated by opening a file twice from the same
client, locking it from handle one and writing to it from
handle two (which fails, returning EACCES).
Since there was already a mount option "forcemandatorylock"
(which defaults to off), with this change only when the user
intentionally specifies "forcemandatorylock" on mount will we
break posix semantics on write to a locked range (ie we will
only fail the write in this case, if the user mounts with
"forcemandatorylock").
Fixes: 85160e03a7 ("CIFS: Implement caching mechanism for mandatory brlocks")
Cc: stable@vger.kernel.org
Cc: Pavel Shilovsky <piastryyy@gmail.com>
Reported-by: abartlet@samba.org
Reported-by: Kevin Ottens <kevin.ottens@enioka.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Clang static checker (scan-build) warning:
cifsglob.h:line 890, column 3
Access to field 'ops' results in a dereference of a null pointer.
Commit 519be98971 ("cifs: Add a tracepoint to track credits involved in
R/W requests") adds a check for 'rdata->server', and let clang throw this
warning about NULL dereference.
When 'rdata->credits.value != 0 && rdata->server == NULL' happens,
add_credits_and_wake_if() will call rdata->server->ops->add_credits().
This will cause NULL dereference problem. Add a check for 'rdata->server'
to avoid NULL dereference.
Cc: stable@vger.kernel.org
Fixes: 69c3c023af ("cifs: Implement netfslib hooks")
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Su Hui <suhui@nfschina.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
__btrfs_add_free_space_zoned() references and modifies bg's alloc_offset,
ro, and zone_unusable, but without taking the lock. It is mostly safe
because they monotonically increase (at least for now) and this function is
mostly called by a transaction commit, which is serialized by itself.
Still, taking the lock is a safer and correct option and I'm going to add a
change to reset zone_unusable while a block group is still alive. So, add
locking around the operations.
Fixes: 169e0da91a ("btrfs: zoned: track unusable bytes for zones")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[REPORT]
There is a corruption report that btrfs refused to mount a fs that has
overlapping dev extents:
BTRFS error (device sdc): dev extent devid 4 physical offset 14263979671552 overlap with previous dev extent end 14263980982272
BTRFS error (device sdc): failed to verify dev extents against chunks: -117
BTRFS error (device sdc): open_ctree failed
[CAUSE]
The direct cause is very obvious, there is a bad dev extent item with
incorrect length.
With btrfs check reporting two overlapping extents, the second one shows
some clue on the cause:
ERROR: dev extent devid 4 offset 14263979671552 len 6488064 overlap with previous dev extent end 14263980982272
ERROR: dev extent devid 13 offset 2257707008000 len 6488064 overlap with previous dev extent end 2257707270144
ERROR: errors found in extent allocation tree or chunk allocation
The second one looks like a bitflip happened during new chunk
allocation:
hex(2257707008000) = 0x20da9d30000
hex(2257707270144) = 0x20da9d70000
diff = 0x00000040000
So it looks like a bitflip happened during new dev extent allocation,
resulting the second overlap.
Currently we only do the dev-extent verification at mount time, but if the
corruption is caused by memory bitflip, we really want to catch it before
writing the corruption to the storage.
Furthermore the dev extent items has the following key definition:
(<device id> DEV_EXTENT <physical offset>)
Thus we can not just rely on the generic key order check to make sure
there is no overlapping.
[ENHANCEMENT]
Introduce dedicated dev extent checks, including:
- Fixed member checks
* chunk_tree should always be BTRFS_CHUNK_TREE_OBJECTID (3)
* chunk_objectid should always be
BTRFS_FIRST_CHUNK_CHUNK_TREE_OBJECTID (256)
- Alignment checks
* chunk_offset should be aligned to sectorsize
* length should be aligned to sectorsize
* key.offset should be aligned to sectorsize
- Overlap checks
If the previous key is also a dev-extent item, with the same
device id, make sure we do not overlap with the previous dev extent.
Reported: Stefan N <stefannnau@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CA+W5K0rSO3koYTo=nzxxTm1-Pdu1HYgVxEpgJ=aGc7d=E8mGEg@mail.gmail.com/
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unlink changes the link count on the target inode. POSIX mandates that
the ctime must also change when this occurs.
According to https://pubs.opengroup.org/onlinepubs/9699919799/functions/unlink.html:
"Upon successful completion, unlink() shall mark for update the last data
modification and last file status change timestamps of the parent
directory. Also, if the file's link count is not 0, the last file status
change timestamp of the file shall be marked for update."
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add link to the opengroup docs ]
Signed-off-by: David Sterba <dsterba@suse.com>
Add the __counted_by compiler attribute to the flexible array member
name to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull btrfs fixes from David Sterba:
- extend tree-checker verification of directory item type
- fix regression in page/folio and extent state tracking in xarray, the
dirty status can get out of sync and can cause problems e.g. a hang
- in send, detect last extent and allow to clone it instead of sending
it as write, reduces amount of data transferred in the stream
- fix checking extent references when cleaning deleted subvolumes
- fix one more case in the extent map shrinker, let it run only in the
kswapd context so it does not cause latency spikes during other
operations
* tag 'for-6.11-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix invalid mapping of extent xarray state
btrfs: send: allow cloning non-aligned extent if it ends at i_size
btrfs: only run the extent map shrinker from kswapd tasks
btrfs: tree-checker: reject BTRFS_FT_UNKNOWN dir type
btrfs: check delayed refs when we're checking if a ref exists
Pull vfs fixes from Christian Brauner:
"VFS:
- Fix the name of file lease slab cache. When file leases were split
out of file locks the name of the file lock slab cache was used for
the file leases slab cache as well.
- Fix a type in take_fd() helper.
- Fix infinite directory iteration for stable offsets in tmpfs.
- When the icache is pruned all reclaimable inodes are marked with
I_FREEING and other processes that try to lookup such inodes will
block.
But some filesystems like ext4 can trigger lookups in their inode
evict callback causing deadlocks. Ext4 does such lookups if the
ea_inode feature is used whereby a separate inode may be used to
store xattrs.
Introduce I_LRU_ISOLATING which pins the inode while its pages are
reclaimed. This avoids inode deletion during inode_lru_isolate()
avoiding the deadlock and evict is made to wait until
I_LRU_ISOLATING is done.
netfs:
- Fault in smaller chunks for non-large folio mappings for
filesystems that haven't been converted to large folios yet.
- Fix the CONFIG_NETFS_DEBUG config option. The config option was
renamed a short while ago and that introduced two minor issues.
First, it depended on CONFIG_NETFS whereas it wants to depend on
CONFIG_NETFS_SUPPORT. The former doesn't exist, while the latter
does. Second, the documentation for the config option wasn't fixed
up.
- Revert the removal of the PG_private_2 writeback flag as ceph is
using it and fix how that flag is handled in netfs.
- Fix DIO reads on 9p. A program watching a file on a 9p mount
wouldn't see any changes in the size of the file being exported by
the server if the file was changed directly in the source
filesystem. Fix this by attempting to read the full size specified
when a DIO read is requested.
- Fix a NULL pointer dereference bug due to a data race where a
cachefiles cookies was retired even though it was still in use.
Check the cookie's n_accesses counter before discarding it.
nsfs:
- Fix ioctl declaration for NS_GET_MNTNS_ID from _IO() to _IOR() as
the kernel is writing to userspace.
pidfs:
- Prevent the creation of pidfds for kthreads until we have a
use-case for it and we know the semantics we want. It also confuses
userspace why they can get pidfds for kthreads.
squashfs:
- Fix an unitialized value bug reported by KMSAN caused by a
corrupted symbolic link size read from disk. Check that the
symbolic link size is not larger than expected"
* tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
Squashfs: sanity check symbolic link size
9p: Fix DIO read through netfs
vfs: Don't evict inode under the inode lru traversing context
netfs: Fix handling of USE_PGPRIV2 and WRITE_TO_CACHE flags
netfs, ceph: Revert "netfs: Remove deprecated use of PG_private_2 as a second writeback flag"
file: fix typo in take_fd() comment
pidfd: prevent creation of pidfds for kthreads
netfs: clean up after renaming FSCACHE_DEBUG config
libfs: fix infinite directory reads for offset dir
nsfs: fix ioctl declaration
fs/netfs/fscache_cookie: add missing "n_accesses" check
filelock: fix name of file_lease slab cache
netfs: Fault in smaller chunks for non-large folio mappings
If a file has the S_DAX flag (aka fsdax access mode) set, we cannot
allow users to change the realtime flag unless the datadev and rtdev
both support fsdax access modes. Even if there are no extents allocated
to the file, the setattr thread could be racing with another thread
that has already started down the write code paths.
Fixes: ba23cba9b3 ("fs: allow per-device dax status checking for filesystems")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
In commit 9adf40249e, we changed the behavior of the AIL thread to
set its own task state to KILLABLE whenever the timeout value is
nonzero. Unfortunately, this missed the fact that xfsaild_push will
return 50ms (aka a longish sleep) when we reach the push target or the
AIL becomes empty, so xfsaild goes to sleep for a long period of time in
uninterruptible D state.
This results in artificially high load averages because KILLABLE
processes are UNINTERRUPTIBLE, which contributes to load average even
though the AIL is asleep waiting for someone to interrupt it. It's not
blocked on IOs or anything, but people scrap ps for processes that look
like they're stuck in D state, so restore the previous threshold.
Fixes: 9adf40249e ("xfs: AIL doesn't need manual pushing")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
It turns out that I misunderstood the difference between the attr and
attr2 feature bits. "attr" means that at some point an attr fork was
created somewhere in the filesystem. "attr2" means that inodes have
variable-sized forks, but says nothing about whether or not there
actually /are/ attr forks in the system.
If we have an attr fork, we only need to check that attr is set.
Fixes: 99d9d8d05d ("xfs: scrub inode block mappings")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
This adds another disk accounting counter to track usage per inode
number (any snapshot ID).
This will be used for a couple things:
- It'll give us a way to tell the user how much space a given file ista
consuming in all snapshots; i.e. how much extra space it's consuming
due to snapshot versioning.
- It counts number of extents and total size of extents (both in btree
keyspace sectors and actual disk usage), meaning it gives us average
extent size: that is, it'll let us cheaply find fragmented files that
should be defragmented.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The next patch will be adding a disk accounting counter type which is
not kept in the in-memory eytzinger tree.
As prep, fold __bch2_accounting_mem_mod() into
bch2_accounting_mem_mod_locked() so that we can check for that counter
type and bail out without calling bpos_to_disk_accounting_pos() twice.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>