linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-04-10 22:59:03 -04:00

Author	SHA1	Message	Date
Ojaswin Mujoo	bae76c035b	ext4: fix fsmap end of range reporting with bigalloc With bigalloc enabled, the logic to report last extent has a bug since we try to use cluster units instead of block units. This can cause an issue where extra incorrect entries might be returned back to the user. This was flagged by generic/365 with 64k bs and -O bigalloc. Details of issue The issue was noticed on 5G 64k blocksize FS with -O bigalloc which has only 1 bg. $ xfs_io -c "fsmap -d" /mnt/scratch 0: 253:48 [0..127]: static fs metadata 128 /* sb / 1: 253:48 [128..255]: special 102:1 128 / gdt / 3: 253:48 [256..383]: special 102:3 128 / block bitmap / 4: 253:48 [384..2303]: unknown 1920 / flex bg empty space / 5: 253:48 [2304..2431]: special 102:4 128 / inode bitmap / 6: 253:48 [2432..4351]: unknown 1920 / flex bg empty space / 7: 253:48 [4352..6911]: inodes 2560 8: 253:48 [6912..538623]: unknown 531712 9: 253:48 [538624..10485759]: free space `9947136` The issue can be seen with: $ xfs_io -c "fsmap -d 0 3" /mnt/scratch 0: 253:48 [0..127]: static fs metadata 128 1: 253:48 [384..2047]: unknown 1664 Only the first entry was expected to be returned but we get 2. This is because: ext4_getfsmap_datadev() first_cluster, last_cluster = 0 ... info->gfi_last = true; ext4_getfsmap_datadev_helper(sb, end_ag, last_cluster + 1, 0, info); fsb = C2B(1) = 16 fslen = 0 ... / Merge in any relevant extents from the meta_list / list_for_each_entry_safe(p, tmp, &info->gfi_meta_list, fmr_list) { ... // since fsb = 16, considers all metadata which starts before 16 blockno iter 1: error = ext4_getfsmap_helper(sb, info, p); // p = sb (0,1), nop info->gfi_next_fsblk = 1 iter 2: error = ext4_getfsmap_helper(sb, info, p); // p = gdt (1,2), nop info->gfi_next_fsblk = 2 iter 3: error = ext4_getfsmap_helper(sb, info, p); // p = blk bitmap (2,3), nop info->gfi_next_fsblk = 3 iter 4: error = ext4_getfsmap_helper(sb, info, p); // p = ino bitmap (18,19) if (rec_blk > info->gfi_next_fsblk) { // (18 > 3) // emits an extra entry * BUG ** } } Fix this by directly calling ext4_getfsmap_datadev() with a dummy record that has fmr_physical set to (end_fsb + 1) instead of last_cluster + 1. By using the block instead of cluster we get the correct behavior. Replacing ext4_getfsmap_datadev_helper() with ext4_getfsmap_helper() is okay since the gfi_lastfree and metadata checks in ext4_getfsmap_datadev_helper() are anyways redundant when we only want to emit the last allocated block of the range, as we have already taken care of emitting metadata and any last free blocks. Cc: stable@kernel.org Reported-by: Disha Goel <disgoel@linux.ibm.com> Fixes: `4a622e4d47` ("ext4: fix FS_IOC_GETFSMAP handling") Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://patch.msgid.link/e7472c8535c9c5ec10f425f495366864ea12c9da.1754377641.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-08-12 23:15:05 -04:00
Qianfeng Rong	4ba97589ed	ext4: remove redundant __GFP_NOWARN GFP_NOWAIT already includes __GFP_NOWARN, so let's remove the redundant __GFP_NOWARN. Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Link: https://patch.msgid.link/20250803102243.623705-4-rongqianfeng@vivo.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-08-12 23:15:05 -04:00
Theodore Ts'o	59d8731c88	ext4: fix unused variable warning in ext4_init_new_dir Cc: stable@kernel.org Fixes: `90f097b140` ("ext4: refactor the inline directory conversion and...") Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-08-12 23:15:05 -04:00
Antonio Quartulli	f3fbaa74d9	ext4: remove useless if check This if branch is only jumping to 'out' which is defined just after the branch itself. Hence this is if-check is a no-op and can be removed. Address-Coverity-ID: 1647981 ("Incorrect expression (IDENTICAL_BRANCHES)") Signed-off-by: Antonio Quartulli <antonio@mandelbit.com> Link: https://patch.msgid.link/20250721200902.1071-1-antonio@mandelbit.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-08-12 23:15:05 -04:00
Andreas Dilger	b4cc4a4077	ext4: check fast symlink for ea_inode correctly The check for a fast symlink in the presence of only an external xattr inode is incorrect. If a fast symlink does not have an xattr block (i_file_acl == 0), but does have an external xattr inode that increases inode i_blocks, then the check for a fast symlink will incorrectly fail and __ext4_iget()->ext4_ind_check_inode() will report the inode is corrupt when it "validates" i_data[] on the next read: # ln -s foo /mnt/tmp/bar # setfattr -h -n trusted.test \ -v "$(yes \| head -n 4000)" /mnt/tmp/bar # umount /mnt/tmp # mount /mnt/tmp # ls -l /mnt/tmp ls: cannot access '/mnt/tmp/bar': Structure needs cleaning total 4 ? l?????????? ? ? ? ? ? bar # dmesg \| tail -1 EXT4-fs error (device dm-8): __ext4_iget:5098: inode #24578: block 7303014: comm ls: invalid block (note that "block 7303014" = 0x6f6f66 = "foo" in LE order). ext4_inode_is_fast_symlink() should check the superblock EXT4_FEATURE_INCOMPAT_EA_INODE feature flag, not the inode EXT4_EA_INODE_FL, since the latter is only set on the xattr inode itself, and not on the inode that uses this xattr. Cc: stable@vger.kernel.org Fixes: `fc82228a5e` ("ext4: support fast symlinks from ext3 file systems") Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Li Dongyang <dongyangli@ddn.com> Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com> Reviewed-on: https://review.whamcloud.com/59879 Lustre-bug-id: https://jira.whamcloud.com/browse/LU-19121 Link: https://patch.msgid.link/20250717063709.757077-1-adilger@dilger.ca Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-08-12 23:15:05 -04:00
Baokun Li	f2326fd14a	ext4: preserve SB_I_VERSION on remount IMA testing revealed that after an ext4 remount, file accesses triggered full measurements even without modifications, instead of skipping as expected when i_version is unchanged. Debugging showed `SB_I_VERSION` was cleared in reconfigure_super() during remount due to commit `1ff2030739` ("ext4: unconditionally enable the i_version counter") removing the fix from commit `960e0ab63b` ("ext4: fix i_version handling on remount"). To rectify this, `SB_I_VERSION` is always set for `fc->sb_flags` in ext4_init_fs_context(), instead of `sb->s_flags` in __ext4_fill_super(), ensuring it persists across all mounts. Cc: stable@kernel.org Fixes: `1ff2030739` ("ext4: unconditionally enable the i_version counter") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250703073903.6952-2-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-08-12 23:14:58 -04:00
Baokun Li	6a912e8aa2	ext4: show the default enabled i_version option Display `i_version` in `/proc/fs/ext4/sdx/options`, even though it's default enabled. This aids users managing multi-version scenarios and simplifies debugging. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250703073903.6952-1-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-08-12 21:28:07 -04:00
Theodore Ts'o	099b847ccc	ext4: do not BUG when INLINE_DATA_FL lacks system.data xattr A syzbot fuzzed image triggered a BUG_ON in ext4_update_inline_data() when an inode had the INLINE_DATA_FL flag set but was missing the system.data extended attribute. Since this can happen due to a maiciouly fuzzed file system, we shouldn't BUG, but rather, report it as a corrupted file system. Add similar replacements of BUG_ON with EXT4_ERROR_INODE() ii ext4_create_inline_data() and ext4_inline_data_truncate(). Reported-by: syzbot+544248a761451c0df72f@syzkaller.appspotmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:17 -04:00
Baokun Li	a3ce570a5d	ext4: implement linear-like traversal across order xarrays Although we now perform ordered traversal within an xarray, this is currently limited to a single xarray. However, we have multiple such xarrays, which prevents us from guaranteeing a linear-like traversal where all groups on the right are visited before all groups on the left. For example, suppose we have 128 block groups, with a target group of 64, a target length corresponding to an order of 1, and available free groups of 16 (order 1) and group 65 (order 8): For linear traversal, when no suitable free block is found in group 64, it will search in the next block group until group 127, then start searching from 0 up to block group 63. It ensures continuous forward traversal, which is consistent with the unidirectional rotation behavior of HDD platters. Additionally, the block group lock contention during freeing block is unavoidable. The goal increasing from 0 to 64 indicates that previously scanned groups (which had no suitable free space and are likely to free blocks later) and skipped groups (which are currently in use) have newly freed some used blocks. If we allocate blocks in these groups, the probability of competing with other processes increases. For non-linear traversal, we first traverse all groups in order_1. If only group 16 has free space in this list, we first traverse [63, 128), then traverse [0, 64) to find the available group 16, and then allocate blocks in group 16. Therefore, it cannot guarantee continuous traversal in one direction, thus increasing the probability of contention. So refactor ext4_mb_scan_groups_xarray() to ext4_mb_scan_groups_xa_range() to only traverse a fixed range of groups, and move the logic for handling wrap around to the caller. The caller first iterates through all xarrays in the range [start, ngroups) and then through the range [0, start). This approach simulates a linear scan, which reduces contention between freeing blocks and allocating blocks. Assume we have the following groups, where "\|" denotes the xarray traversal start position: order_1_groups: AB \| CD order_2_groups: EF \| GH Traversal order: Before: C > D > A > B > G > H > E > F After: C > D > G > H > A > B > E > F Performance test data follows: \|CPU: Kunpeng 920 \| P80 \| P1 \| \|Memory: 512GB \|------------------------\|-------------------------\| \|960GB SSD (0.5GB/s)\| base \| patched \| base \| patched \| \|-------------------\|-------\|----------------\|--------\|----------------\| \|mb_optimize_scan=0 \| 19555 \| 20049 (+2.5%) \| 315636 \| 316724 (-0.3%) \| \|mb_optimize_scan=1 \| 15496 \| 19342 (+24.8%) \| 323569 \| 328324 (+1.4%) \| \|CPU: AMD 9654 * 2 \| P96 \| P1 \| \|Memory: 1536GB \|------------------------\|-------------------------\| \|960GB SSD (1GB/s) \| base \| patched \| base \| patched \| \|-------------------\|-------\|----------------\|--------\|----------------\| \|mb_optimize_scan=0 \| 53192 \| 52125 (-2.0%) \| 212678 \| 215136 (+1.1%) \| \|mb_optimize_scan=1 \| 37636 \| 50331 (+33.7%) \| 214189 \| 209431 (-2.2%) \| Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-18-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:17 -04:00
Baokun Li	6347558764	ext4: refactor choose group to scan group This commit converts the `choose group` logic to `scan group` using previously prepared helper functions. This allows us to leverage xarrays for ordered non-linear traversal, thereby mitigating the "bouncing" issue inherent in the `choose group` mechanism. This also decouples linear and non-linear traversals, leading to cleaner and more readable code. Key changes: * ext4_mb_choose_next_group() is refactored to ext4_mb_scan_groups(). * Replaced ext4_mb_good_group() with ext4_mb_scan_group() in non-linear traversals, and related functions now return error codes instead of group info. * Added ext4_mb_scan_groups_linear() for performing linear scans starting from a specific group for a set number of times. * Linear scans now execute up to sbi->s_mb_max_linear_groups times, so ac_groups_linear_remaining is removed as it's no longer used. * ac->ac_criteria is now used directly instead of passing cr around. Also, ac->ac_criteria is incremented directly after groups scan fails for the corresponding criteria. * Since we're now directly scanning groups instead of finding a good group then scanning, the following variables and flags are no longer needed, s_bal_cX_groups_considered is sufficient. s_bal_p2_aligned_bad_suggestions s_bal_goal_fast_bad_suggestions s_bal_best_avail_bad_suggestions EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-17-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:17 -04:00
Baokun Li	f7eaacbb4e	ext4: convert free groups order lists to xarrays While traversing the list, holding a spin_lock prevents load_buddy, making direct use of ext4_try_lock_group impossible. This can lead to a bouncing scenario where spin_is_locked(grp_A) succeeds, but ext4_try_lock_group() fails, forcing the list traversal to repeatedly restart from grp_A. In contrast, linear traversal directly uses ext4_try_lock_group(), avoiding this bouncing. Therefore, we need a lockless, ordered traversal to achieve linear-like efficiency. Therefore, this commit converts both average fragment size lists and largest free order lists into ordered xarrays. In an xarray, the index represents the block group number and the value holds the block group information; a non-empty value indicates the block group's presence. While insertion and deletion complexity remain O(1), lookup complexity changes from O(1) to O(nlogn), which may slightly reduce single-threaded performance. Additionally, xarray insertions might fail, potentially due to memory allocation issues. However, since we have linear traversal as a fallback, this isn't a major problem. Therefore, we've only added a warning message for insertion failures here. A helper function ext4_mb_find_good_group_xarray() is added to find good groups in the specified xarray starting at the specified position start, and when it reaches ngroups-1, it wraps around to 0 and then to start-1. This ensures an ordered traversal within the xarray. Performance test results are as follows: Single-process operations on an empty disk show negligible impact, while multi-process workloads demonstrate a noticeable performance gain. \|CPU: Kunpeng 920 \| P80 \| P1 \| \|Memory: 512GB \|------------------------\|-------------------------\| \|960GB SSD (0.5GB/s)\| base \| patched \| base \| patched \| \|-------------------\|-------\|----------------\|--------\|----------------\| \|mb_optimize_scan=0 \| 20097 \| 19555 (-2.6%) \| 316141 \| 315636 (-0.2%) \| \|mb_optimize_scan=1 \| 13318 \| 15496 (+16.3%) \| 325273 \| 323569 (-0.5%) \| \|CPU: AMD 9654 * 2 \| P96 \| P1 \| \|Memory: 1536GB \|------------------------\|-------------------------\| \|960GB SSD (1GB/s) \| base \| patched \| base \| patched \| \|-------------------\|-------\|----------------\|--------\|----------------\| \|mb_optimize_scan=0 \| 53603 \| 53192 (-0.7%) \| 214243 \| 212678 (-0.7%) \| \|mb_optimize_scan=1 \| 20887 \| 37636 (+80.1%) \| 213632 \| 214189 (+0.2%) \| [ Applied spelling fixes per discussion on the ext4-list see thread referened in the Link tag. --tytso] Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-16-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:17 -04:00
Baokun Li	9c08e42db9	ext4: factor out ext4_mb_scan_group() Extract ext4_mb_scan_group() to make the code clearer and to prepare for the later conversion of 'choose group' to 'scan groups'. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-15-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:17 -04:00
Baokun Li	5abd85f667	ext4: factor out ext4_mb_might_prefetch() Extract ext4_mb_might_prefetch() to make the code clearer and to prepare for the later conversion of 'choose group' to 'scan groups'. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-14-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:17 -04:00
Baokun Li	45704f92e5	ext4: factor out __ext4_mb_scan_group() Extract __ext4_mb_scan_group() to make the code clearer and to prepare for the later conversion of 'choose group' to 'scan groups'. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-13-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:17 -04:00
Baokun Li	7d345aa1fa	ext4: fix largest free orders lists corruption on mb_optimize_scan switch The grp->bb_largest_free_order is updated regardless of whether mb_optimize_scan is enabled. This can lead to inconsistencies between grp->bb_largest_free_order and the actual s_mb_largest_free_orders list index when mb_optimize_scan is repeatedly enabled and disabled via remount. For example, if mb_optimize_scan is initially enabled, largest free order is 3, and the group is in s_mb_largest_free_orders[3]. Then, mb_optimize_scan is disabled via remount, block allocations occur, updating largest free order to 2. Finally, mb_optimize_scan is re-enabled via remount, more block allocations update largest free order to 1. At this point, the group would be removed from s_mb_largest_free_orders[3] under the protection of s_mb_largest_free_orders_locks[2]. This lock mismatch can lead to list corruption. To fix this, whenever grp->bb_largest_free_order changes, we now always attempt to remove the group from its old order list. However, we only insert the group into the new order list if `mb_optimize_scan` is enabled. This approach helps prevent lock inconsistencies and ensures the data in the order lists remains reliable. Fixes: `196e402adf` ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@vger.kernel.org Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-12-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:17 -04:00
Baokun Li	1c320d8e92	ext4: fix zombie groups in average fragment size lists Groups with no free blocks shouldn't be in any average fragment size list. However, when all blocks in a group are allocated(i.e., bb_fragments or bb_free is 0), we currently skip updating the average fragment size, which means the group isn't removed from its previous s_mb_avg_fragment_size[old] list. This created "zombie" groups that were always skipped during traversal as they couldn't satisfy any block allocation requests, negatively impacting traversal efficiency. Therefore, when a group becomes completely full, bb_avg_fragment_size_order is now set to -1. If the old order was not -1, a removal operation is performed; if the new order is not -1, an insertion is performed. Fixes: `196e402adf` ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@vger.kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-11-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:17 -04:00
Baokun Li	e7f101a808	ext4: merge freed extent with existing extents before insertion Attempt to merge ext4_free_data with already inserted free extents prior to adding new ones. This strategy drastically cuts down the number of times locks are held. For example, if prev, new, and next extents are all mergeable, the existing code (before this patch) requires acquiring the s_md_lock three times: prev merge into new and free prev // hold lock next merge into new and free next // hold lock insert new // hold lock After the patch, it only needs to be acquired once: new merge into next and free new // no lock next merge into prev and free next // hold lock Performance test data follows: Test: Running will-it-scale/fallocate2 on CPU-bound containers. Observation: Average fallocate operations per container per second. \|CPU: Kunpeng 920 \| P80 \| P1 \| \|Memory: 512GB \|------------------------\|-------------------------\| \|960GB SSD (0.5GB/s)\| base \| patched \| base \| patched \| \|-------------------\|-------\|----------------\|--------\|----------------\| \|mb_optimize_scan=0 \| 20043 \| 20097 (+0.2%) \| 314331 \| 316141 (+0.5%) \| \|mb_optimize_scan=1 \| 7290 \| 13318 (+87.4%) \| 324226 \| 325273 (+0.3%) \| \|CPU: AMD 9654 * 2 \| P96 \| P1 \| \|Memory: 1536GB \|------------------------\|-------------------------\| \|960GB SSD (1GB/s) \| base \| patched \| base \| patched \| \|-------------------\|-------\|----------------\|--------\|----------------\| \|mb_optimize_scan=0 \| 54999 \| 53603 (-2.5%) \| 214380 \| 214243 (-0.06%)\| \|mb_optimize_scan=1 \| 13497 \| 20887 (+54.6%) \| 216276 \| 213632 (-1.2%) \| Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-10-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:17 -04:00
Baokun Li	0a2326f6ae	ext4: convert sbi->s_mb_free_pending to atomic_t Previously, s_md_lock was used to protect s_mb_free_pending during modifications, while smp_mb() ensured fresh reads, so s_md_lock just guarantees the atomicity of s_mb_free_pending. Thus we optimized it by converting s_mb_free_pending into an atomic variable, thereby eliminating s_md_lock and minimizing lock contention. This also prepares for future lockless merging of free extents. Following this modification, s_md_lock is exclusively responsible for managing insertions and deletions within s_freed_data_list, along with operations involving list_splice. Performance test data follows: Test: Running will-it-scale/fallocate2 on CPU-bound containers. Observation: Average fallocate operations per container per second. \|CPU: Kunpeng 920 \| P80 \| P1 \| \|Memory: 512GB \|------------------------\|-------------------------\| \|960GB SSD (0.5GB/s)\| base \| patched \| base \| patched \| \|-------------------\|-------\|----------------\|--------\|----------------\| \|mb_optimize_scan=0 \| 19628 \| 20043 (+2.1%) \| 320885 \| 314331 (-2.0%) \| \|mb_optimize_scan=1 \| 7129 \| 7290 (+2.2%) \| 321275 \| 324226 (+0.9%) \| \|CPU: AMD 9654 * 2 \| P96 \| P1 \| \|Memory: 1536GB \|------------------------\|-------------------------\| \|960GB SSD (1GB/s) \| base \| patched \| base \| patched \| \|-------------------\|-------\|----------------\|--------\|----------------\| \|mb_optimize_scan=0 \| 53760 \| 54999 (+2.3%) \| 213145 \| 214380 (+0.5%) \| \|mb_optimize_scan=1 \| 12716 \| 13497 (+6.1%) \| 215262 \| 216276 (+0.4%) \| Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-9-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:17 -04:00
Baokun Li	9a0ed16981	ext4: fix typo in CR_GOAL_LEN_SLOW comment Remove the superfluous "find_". Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-8-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:17 -04:00
Baokun Li	4d18a0b982	ext4: get rid of some obsolete EXT4_MB_HINT flags Since nobody has used these EXT4_MB_HINT flags for ages, let's remove them. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-7-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:16 -04:00
Baokun Li	8f2c3b7486	ext4: utilize multiple global goals to reduce contention When allocating data blocks, if the first try (goal allocation) fails and stream allocation is on, it tries a global goal starting from the last group we used (s_mb_last_group). This helps cluster large files together to reduce free space fragmentation, and the data block contiguity also accelerates write-back to disk. However, when multiple processes allocate blocks, having just one global goal means they all fight over the same group. This drastically lowers the chances of extents merging and leads to much worse file fragmentation. To mitigate this multi-process contention, we now employ multiple global goals, with the number of goals being the minimum between the number of possible CPUs and one-quarter of the filesystem's total block group count. To ensure a consistent goal for each inode, we select the corresponding goal by taking the inode number modulo the total number of goals. Performance test data follows: Test: Running will-it-scale/fallocate2 on CPU-bound containers. Observation: Average fallocate operations per container per second. \|CPU: Kunpeng 920 \| P80 \| P1 \| \|Memory: 512GB \|------------------------\|-------------------------\| \|960GB SSD (0.5GB/s)\| base \| patched \| base \| patched \| \|-------------------\|-------\|----------------\|--------\|----------------\| \|mb_optimize_scan=0 \| 9636 \| 19628 (+103%) \| 337597 \| 320885 (-4.9%) \| \|mb_optimize_scan=1 \| 4834 \| 7129 (+47.4%) \| 341440 \| 321275 (-5.9%) \| \|CPU: AMD 9654 * 2 \| P96 \| P1 \| \|Memory: 1536GB \|------------------------\|-------------------------\| \|960GB SSD (1GB/s) \| base \| patched \| base \| patched \| \|-------------------\|-------\|----------------\|--------\|----------------\| \|mb_optimize_scan=0 \| 22341 \| 53760 (+140%) \| 219707 \| 213145 (-2.9%) \| \|mb_optimize_scan=1 \| 9177 \| 12716 (+38.5%) \| 215732 \| 215262 (+0.2%) \| Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-6-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:16 -04:00
Baokun Li	4b41deb896	ext4: remove unnecessary s_md_lock on update s_mb_last_group After we optimized the block group lock, we found another lock contention issue when running will-it-scale/fallocate2 with multiple processes. The fallocate's block allocation and the truncate's block release were fighting over the s_md_lock. The problem is, this lock protects totally different things in those two processes: the list of freed data blocks (s_freed_data_list) when releasing, and where to start looking for new blocks (mb_last_group) when allocating. Now we only need to track s_mb_last_group and no longer need to track s_mb_last_start, so we don't need the s_md_lock lock to ensure that the two are consistent. Since s_mb_last_group is merely a hint and doesn't require strong synchronization, READ_ONCE/WRITE_ONCE is sufficient. Besides, the s_mb_last_group data type only requires ext4_group_t (i.e., unsigned int), rendering unsigned long superfluous. Performance test data follows: Test: Running will-it-scale/fallocate2 on CPU-bound containers. Observation: Average fallocate operations per container per second. \|CPU: Kunpeng 920 \| P80 \| P1 \| \|Memory: 512GB \|------------------------\|-------------------------\| \|960GB SSD (0.5GB/s)\| base \| patched \| base \| patched \| \|-------------------\|-------\|----------------\|--------\|----------------\| \|mb_optimize_scan=0 \| 4821 \| 9636 (+99.8%) \| 314065 \| 337597 (+7.4%) \| \|mb_optimize_scan=1 \| 4784 \| 4834 (+1.04%) \| 316344 \| 341440 (+7.9%) \| \|CPU: AMD 9654 * 2 \| P96 \| P1 \| \|Memory: 1536GB \|------------------------\|-------------------------\| \|960GB SSD (1GB/s) \| base \| patched \| base \| patched \| \|-------------------\|-------\|----------------\|--------\|----------------\| \|mb_optimize_scan=0 \| 15371 \| 22341 (+45.3%) \| 205851 \| 219707 (+6.7%) \| \|mb_optimize_scan=1 \| 6101 \| 9177 (+50.4%) \| 207373 \| 215732 (+4.0%) \| Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-5-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:16 -04:00
Baokun Li	f0374d8071	ext4: remove unnecessary s_mb_last_start Since stream allocation does not use ac->ac_f_ex.fe_start, it is set to -1 by default, so the no longer needed sbi->s_mb_last_start is removed. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-4-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:16 -04:00
Baokun Li	35bfd4b44e	ext4: separate stream goal hits from s_bal_goals for better tracking In ext4_mb_regular_allocator(), after the call to ext4_mb_find_by_goal() fails to achieve the inode goal, allocation continues with the stream allocation global goal. Currently, hits for both are combined in sbi->s_bal_goals, hindering accurate optimization. This commit separates global goal hits into sbi->s_bal_stream_goals. Since stream allocation doesn't use ac->ac_g_ex.fe_start, set fe_start to -1. This prevents stream allocations from being counted in s_bal_goals. Also clear EXT4_MB_HINT_TRY_GOAL to avoid calling ext4_mb_find_by_goal again. After adding `stream_goal_hits`, `/proc/fs/ext4/sdx/mb_stats` will show: mballoc: reqs: 840347 success: 750992 groups_scanned: 1230506 cr_p2_aligned_stats: hits: 21531 groups_considered: 411664 extents_scanned: 21531 useless_loops: 0 bad_suggestions: 6 cr_goal_fast_stats: hits: 111222 groups_considered: 1806728 extents_scanned: 467908 useless_loops: 0 bad_suggestions: 13 cr_best_avail_stats: hits: 36267 groups_considered: 1817631 extents_scanned: 156143 useless_loops: 0 bad_suggestions: 204 cr_goal_slow_stats: hits: 106396 groups_considered: 5671710 extents_scanned: 22540056 useless_loops: 123747 cr_any_free_stats: hits: 138071 groups_considered: 724692 extents_scanned: 23615593 useless_loops: 585 extents_scanned: 46804261 goal_hits: 1307 stream_goal_hits: 236317 len_goal_hits: 155549 2^n_hits: 21531 breaks: 225096 lost: 35062 buddies_generated: 40/40 buddies_time_used: 48004 preallocated: 5962467 discarded: 4847560 Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-3-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:16 -04:00
Baokun Li	e9eec6f339	ext4: add ext4_try_lock_group() to skip busy groups When ext4 allocates blocks, we used to just go through the block groups one by one to find a good one. But when there are tons of block groups (like hundreds of thousands or even millions) and not many have free space (meaning they're mostly full), it takes a really long time to check them all, and performance gets bad. So, we added the "mb_optimize_scan" mount option (which is on by default now). It keeps track of some group lists, so when we need a free block, we can just grab a likely group from the right list. This saves time and makes block allocation much faster. But when multiple processes or containers are doing similar things, like constantly allocating 8k blocks, they all try to use the same block group in the same list. Even just two processes doing this can cut the IOPS in half. For example, one container might do 300,000 IOPS, but if you run two at the same time, the total is only 150,000. Since we can already look at block groups in a non-linear way, the first and last groups in the same list are basically the same for finding a block right now. Therefore, add an ext4_try_lock_group() helper function to skip the current group when it is locked by another process, thereby avoiding contention with other processes. This helps ext4 make better use of having multiple block groups. Also, to make sure we don't skip all the groups that have free space when allocating blocks, we won't try to skip busy groups anymore when ac_criteria is CR_ANY_FREE. Performance test data follows: Test: Running will-it-scale/fallocate2 on CPU-bound containers. Observation: Average fallocate operations per container per second. \|CPU: Kunpeng 920 \| P80 \| \|Memory: 512GB \|-------------------------\| \|960GB SSD (0.5GB/s)\| base \| patched \| \|-------------------\|-------\|-----------------\| \|mb_optimize_scan=0 \| 2667 \| 4821 (+80.7%) \| \|mb_optimize_scan=1 \| 2643 \| 4784 (+81.0%) \| \|CPU: AMD 9654 * 2 \| P96 \| \|Memory: 1536GB \|-------------------------\| \|960GB SSD (1GB/s) \| base \| patched \| \|-------------------\|-------\|-----------------\| \|mb_optimize_scan=0 \| 3450 \| 15371 (+345%) \| \|mb_optimize_scan=1 \| 3209 \| 6101 (+90.0%) \| Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-2-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:16 -04:00
Zhang Yi	82e6381e23	ext4: initialize superblock fields in the kballoc-test.c kunit tests Various changes in the "ext4: better scalability for ext4 block allocation" patch series have resulted in kunit test failures, most notably in the test_new_blocks_simple and the test_mb_mark_used tests. The root cause of these failures is that various in-memory ext4 data structures were not getting initialized, and while previous versions of the functions exercised by the unit tests didn't use these structure members, this was arguably a test bug. Since one of the patches in the block allocation scalability patches is a fix which is has a cc:stable tag, this commit also has a cc:stable tag. CC: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250714130327.1830534-1-libaokun1@huawei.com Link: https://patch.msgid.link/20250725021550.3177573-1-yi.zhang@huaweicloud.com Link: https://patch.msgid.link/20250725021654.3188798-1-yi.zhang@huaweicloud.com Reported-by: Guenter Roeck <linux@roeck-us.net> Closes: https://lore.kernel.org/linux-ext4/b0635ad0-7ebf-4152-a69b-58e7e87d5085@roeck-us.net/ Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:13:44 -04:00
Theodore Ts'o	90f097b140	ext4: refactor the inline directory conversion and new directory codepaths There was a lot of common code in the codepaths used to convert an inline directory and to creaet a new directory. To address this, rename ext4_init_dot_dotdot() to ext4_init_dirblock() and then move common code into that function. This reduces the lines of code count in fs/ext4/inline.c and fs/ext4/namei.c, as well as reducing the size of their object files. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://patch.msgid.link/20250712181249.434530-3-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-17 23:25:21 -04:00
Theodore Ts'o	a35454ecf8	ext4: use memcpy() instead of strcpy() The strcpy() function is considered dangerous and eeeevil by people who are using sophisticated code analysis tools such as "grep". This is true even when a quick inspection would show that the source is a constant string ("." or "..") and the destination is a fixed array which is guaranteed to have enough space. Make the "grep" code analysis tool happy by using memcpy() isstead of strcpy(). :-) Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://patch.msgid.link/20250712181249.434530-2-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-17 23:25:21 -04:00
Theodore Ts'o	3658b8b339	ext4: replace strcmp with direct comparison for '.' and '..' In a discussion over a proposed patch, "ext4: replace strcpy() with '.' assignment"[1], I had asserted that directory entries in ext4 were not NUL terminated, and hence it was safe to replace strcpy() with a direct assignment. As it turns out, this was incorrect. It's true for all all directory entries except for '.' and '..' where the kernel was using strcmp() and where e2fsck actually checks and offers to fix things if '.' and '..' are not NUL terminated. [1] https://lore.kernel.org/r/202505191316.JJMnPobO-lkp@intel.com We can't change this without breaking old kernel versions, but in the spirit of "be liberal in what you receive", use direct comparison of de->name_len and de->name[0,1] instead of strcmp(). This has the side benefit of reducing the compiled text size by 96 bytes on x86_64. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://patch.msgid.link/20250712181249.434530-1-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-17 23:25:21 -04:00
Jan Kara	91b8ca8b26	ext4: Make sure BH_New bit is cleared in ->write_end handler Currently we clear BH_New bit in case of error and also in the standard ext4_write_end() handler (in block_commit_write()). However ext4_journalled_write_end() misses this clearing and thus we are leaving stale BH_New bits behind. Generally ext4_block_write_begin() clears these bits before any harm can be done but in case blocksize < pagesize and we hit some error when processing a page with these stale bits, we'll try to zero buffers with these stale BH_New bits and jbd2 will complain (as buffers were not prepared for writing in this transaction). Fix the problem by clearing BH_New bits in ext4_journalled_write_end() and WARN if ext4_block_write_begin() sees stale BH_New bits. Reported-by: Baolin Liu <liubaolin12138@163.com> Reported-by: Zhi Long <longzhi@sangfor.com.cn> Fixes: `3910b513fc` ("ext4: persist the new uptodate buffers in ext4_journalled_zero_new_buffers") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250709084831.23876-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-17 23:25:21 -04:00
Baokun Li	c678bdc998	ext4: fix inode use after free in ext4_end_io_rsv_work() In ext4_io_end_defer_completion(), check if io_end->list_vec is empty to avoid adding an io_end that requires no conversion to the i_rsv_conversion_list, which in turn prevents starting an unnecessary worker. An ext4_emergency_state() check is also added to avoid attempting to abort the journal in an emergency state. Additionally, ext4_put_io_end_defer() is refactored to call ext4_io_end_defer_completion() directly instead of being open-coded. This also prevents starting an unnecessary worker when EXT4_IO_END_FAILED is set but data_err=abort is not enabled. This ensures that the check in ext4_put_io_end_defer() is consistent with the check in ext4_end_bio(). Otherwise, we might add an io_end to the i_rsv_conversion_list and then call ext4_finish_bio(), after which the inode could be freed before ext4_end_io_rsv_work() is called, triggering a use-after-free issue. Fixes: `ce51afb8cc` ("ext4: abort journal on data writeback failure if in data_err=abort mode") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250708111504.3208660-1-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-17 23:25:21 -04:00
I Hsin Cheng	9d9076238f	ext4: Refactor breaking condition for xattr_find_entry() Refactor the condition for breaking the loop within xattr_find_entry(). Elimate the usage of "<=" and take condition shortcut when "!cmp" is true. Originally, the condition was "(cmp <= 0 && (sorted \|\| cmp == 0))", which means after it knows "cmp <= 0" is true, it has to check the value of "sorted" and "cmp". The checking of "cmp" here would be redundant since it has already checked it. Observing from the logic, when "cmp == 0" the branch is going to be true, no need to check "cmp == 0" again, so we only need to take shortcut when "cmp == 0", on the other hand, we'll check "sorted" when "cmp < 0". The refactor can shrink the generated code size by 44 bytes. Numerous instructions can be saved thus should also benefit execution efficiency as well. $ ./scripts/bloat-o-meter vmlinux_old vmlinux_new add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-44 (-44) Function old new delta xattr_find_entry 300 256 -44 Total: Before=22989434, After=22989390, chg -0.00% The test is done on kernel version 6.16 with x86_64 defconfig and gcc 13.3.0. Signed-off-by: I Hsin Cheng <richard120310@gmail.com> Link: https://patch.msgid.link/20250708020013.175728-1-richard120310@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-17 10:41:05 -04:00
Zhang Yi	b12f423d59	ext4: limit the maximum folio order In environments with a page size of 64KB, the maximum size of a folio can reach up to 128MB. Consequently, during the write-back of folios, the 'rsv_blocks' will be overestimated to 1,577, which can make pressure on the journal space where the journal is small. This can easily exceed the limit of a single transaction. Besides, an excessively large folio is meaningless and will instead increase the overhead of traversing the bhs within the folio. Therefore, limit the maximum order of a folio to 2048 filesystem blocks. Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reported-by: Joseph Qi <jiangqi903@gmail.com> Closes: https://lore.kernel.org/linux-ext4/CA+G9fYsyYQ3ZL4xaSg1-Tt5Evto7Zd+hgNWZEa9cQLbahA1+xg@mail.gmail.com/ Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-12-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-14 23:48:15 -04:00
Zhang Yi	5137d6c890	ext4: fix insufficient credits calculation in ext4_meta_trans_blocks() The calculation of journal credits in ext4_meta_trans_blocks() should include pextents, as each extent separately may be allocated from a different group and thus need to update different bitmap and group descriptor block. Fixes: `0e32d86170` ("ext4: correct the journal credits calculations of allocating blocks") Reported-by: Jan Kara <jack@suse.cz> Closes: https://lore.kernel.org/linux-ext4/nhxfuu53wyacsrq7xqgxvgzcggyscu2tbabginahcygvmc45hy@t4fvmyeky33e/ Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20250707140814.542883-11-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-13 23:41:52 -04:00
Zhang Yi	57661f2875	ext4: replace ext4_writepage_trans_blocks() After ext4 supports large folios, the semantics of reserving credits in pages is no longer applicable. In most scenarios, reserving credits in extents is sufficient. Therefore, introduce ext4_chunk_trans_extent() to replace ext4_writepage_trans_blocks(). move_extent_per_page() is the only remaining location where we are still processing extents in pages. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-10-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-13 23:41:52 -04:00
Zhang Yi	bbbf150f3f	ext4: reserved credits for one extent during the folio writeback After ext4 supports large folios, reserving journal credits for one maximum-ordered folio based on the worst case cenario during the writeback process can easily exceed the maximum transaction credits. Additionally, reserving journal credits for one page is also no longer appropriate. Currently, the folio writeback process can either extend the journal credits or initiate a new transaction if the currently reserved journal credits are insufficient. Therefore, it can be modified to reserve credits for only one extent at the outset. In most cases involving continuous mapping, these credits are generally adequate, and we may only need to perform some basic credit expansion. However, in extreme cases where the block size and folio size differ significantly, or when the folios are sufficiently discontinuous, it may be necessary to restart a new transaction and resubmit the folios. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-9-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-13 23:41:52 -04:00
Zhang Yi	95ad8ee45c	ext4: correct the reserved credits for extent conversion Now, we reserve journal credits for converting extents in only one page to written state when the I/O operation is complete. This is insufficient when large folio is enabled. Fix this by reserving credits for converting up to one extent per block in the largest 2MB folio, this calculation should only involve extents index and leaf blocks, so it should not estimate too many credits. Fixes: `7ac67301e8` ("ext4: enable large folio for regular file") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20250707140814.542883-8-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-13 23:41:52 -04:00
Zhang Yi	6b132759b0	ext4: enhance tracepoints during the folios writeback After mpage_map_and_submit_extent() supports restarting handle if credits are insufficient during allocating blocks, it is more likely to exit the current mapping iteration and continue to process the current processing partially mapped folio again. The existing tracepoints are not sufficient to track this situation, so enhance the tracepoints to track the writeback position and the return value before and after submitting the folios. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-7-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-13 23:41:51 -04:00
Zhang Yi	e2c4c49dee	ext4: restart handle if credits are insufficient during allocating blocks After large folios are supported on ext4, writing back a sufficiently large and discontinuous folio may consume a significant number of journal credits, placing considerable strain on the journal. For example, in a 20GB filesystem with 1K block size and 1MB journal size, writing back a 2MB folio could require thousands of credits in the worst-case scenario (when each block is discontinuous and distributed across different block groups), potentially exceeding the journal size. This issue can also occur in ext4_write_begin() and ext4_page_mkwrite() when delalloc is not enabled. Fix this by ensuring that there are sufficient journal credits before allocating an extent in mpage_map_one_extent() and ext4_block_write_begin(). If there are not enough credits, return -EAGAIN, exit the current mapping loop, restart a new handle and a new transaction, and allocating blocks on this folio again in the next iteration. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-6-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-13 23:41:51 -04:00
Zhang Yi	2bddafea3d	ext4: refactor the block allocation process of ext4_page_mkwrite() The block allocation process and error handling in ext4_page_mkwrite() is complex now. Refactor it by introducing a new helper function, ext4_block_page_mkwrite(). It will call ext4_block_write_begin() to allocate blocks instead of directly calling block_page_mkwrite(). Preparing to implement retry logic in a subsequent patch to address situations where the reserved journal credits are insufficient. Additionally, this modification will help prevent potential deadlocks that may occur when waiting for folio writeback while holding the transaction handle. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-5-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-13 23:41:51 -04:00
Zhang Yi	ded2d726a3	ext4: fix stale data if it bail out of the extents mapping loop During the process of writing back folios, if mpage_map_and_submit_extent() exits the extent mapping loop due to an ENOSPC or ENOMEM error, it may result in stale data or filesystem inconsistency in environments where the block size is smaller than the folio size. When mapping a discontinuous folio in mpage_map_and_submit_extent(), some buffers may have already be mapped. If we exit the mapping loop prematurely, the folio data within the mapped range will not be written back, and the file's disk size will not be updated. Once the transaction that includes this range of extents is committed, this can lead to stale data or filesystem inconsistency. Fix this by submitting the current processing partially mapped folio. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-4-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-13 23:41:51 -04:00
Zhang Yi	f922c8c246	ext4: move the calculation of wbc->nr_to_write to mpage_folio_done() mpage_folio_done() should be a more appropriate place than mpage_submit_folio() for updating the wbc->nr_to_write after we have submitted a fully mapped folio. Preparing to make mpage_submit_folio() allows to submit partially mapped folio that is still under processing. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20250707140814.542883-3-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-13 23:41:51 -04:00
Zhang Yi	1bfe6354e0	ext4: process folios writeback in bytes Since ext4 supports large folios, processing writebacks in pages is no longer appropriate, it can be modified to process writebacks in bytes. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-2-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-13 23:41:51 -04:00
Baolin Liu	a073e8577f	ext4: remove unused EXT_STATS macro from ext4_extents.h The EXT_STATS macro in fs/ext4/ext4_extents.h has been defined but never used in the codebase since its introduction. This patch removes it. Analysis: 1. No references found in fs/ext4/ or other kernel code. 2. No impact on compilation or functionality. 3. Git history shows it was never utilized. Signed-off-by: Baolin Liu <liubaolin@kylinos.cn> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20250527053805.1550912-1-liubaolin12138@163.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-12 19:01:38 -04:00
Dan Carpenter	c5da1f6694	ext4: remove unnecessary duplicate check in ext4_map_blocks() The previous lines ensure that EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF is set so remove this duplicate check. Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://patch.msgid.link/aDCdjUhpzxB64vkD@stanley.mountain Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-11 16:52:59 -04:00
Jinliang Zheng	b6f3801727	ext4: remove duplicate check for EXT4_FC_REPLAY EXT4_FC_REPLAY will be checked in ext4_es_lookup_extent(). If it is set, ext4_es_lookup_extent() will return 0. Remove the repeated check for EXT4_FC_REPLAY in ext4_map_blocks() to simplify the code. Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250429111722.294975-1-alexjlzheng@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-10 21:33:03 -04:00
Linus Torvalds	aaf724ed69	Merge tag 'v6.16-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6 Pull smb client fixes from Steve French: - Multichannel reconnect lock ordering deadlock fix - Fix for regression in handling native Windows symlinks - Three smbdirect fixes: - oops in RDMA response processing - smbdirect memcpy issue - fix smbdirect regression with large writes (smbdirect test cases now all passing) - Fix for "FAILED_TO_PARSE" warning in trace-cmd report output * tag 'v6.16-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: Fix reading into an ITER_FOLIOQ from the smbdirect code cifs: Fix the smbd_response slab to allow usercopy smb: client: fix potential deadlock when reconnecting channels smb: client: remove \t from TP_printk statements smb: client: let smbd_post_send_iter() respect the peers max_send_size and transmit all data smb: client: fix regression with native SMB symlinks	2025-06-27 20:38:05 -07:00
Linus Torvalds	0fd39af24e	Merge tag 'mm-hotfixes-stable-2025-06-27-16-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "16 hotfixes. 6 are cc:stable and the remainder address post-6.15 issues or aren't considered necessary for -stable kernels. 5 are for MM" * tag 'mm-hotfixes-stable-2025-06-27-16-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: MAINTAINERS: add Lorenzo as THP co-maintainer mailmap: update Duje Mihanović's email address selftests/mm: fix validate_addr() helper crashdump: add CONFIG_KEYS dependency mailmap: correct name for a historical account of Zijun Hu mailmap: add entries for Zijun Hu fuse: fix runtime warning on truncate_folio_batch_exceptionals() scripts/gdb: fix dentry_name() lookup mm/damon/sysfs-schemes: free old damon_sysfs_scheme_filter->memcg_path on write mm/alloc_tag: fix the kmemleak false positive issue in the allocation of the percpu variable tag->counters lib/group_cpus: fix NULL pointer dereference from group_cpus_evenly() mm/hugetlb: remove unnecessary holding of hugetlb_lock MAINTAINERS: add missing files to mm page alloc section MAINTAINERS: add tree entry to mm init block mm: add OOM killer maintainer structure fs/proc/task_mmu: fix PAGE_IS_PFNZERO detection for the huge zero folio	2025-06-27 20:34:10 -07:00
Linus Torvalds	6f2a71a99e	Merge tag 'bcachefs-2025-06-26' of git://evilpiepirate.org/bcachefs Pull bcachefs fixes from Kent Overstreet: - Lots of small check/repair fixes, primarily in subvol loop and directory structure loop (when involving snapshots). - Fix a few 6.16 regressions: rare UAF in the foreground allocator path when taking a transaction restart from the transaction bump allocator, and some small fallout from the change to log the error being corrected in the journal when repairing errors, also some fallout from the btree node read error logging improvements. (Alan, Bharadwaj) - New option: journal_rewind This lets the entire filesystem be reset to an earlier point in time. Note that this is only a disaster recovery tool, and right now there are major caveats to using it (discards should be disabled, in particular), but it successfully restored the filesystem of one of the users who was bit by the subvolume deletion bug and didn't have backups. I'll likely be making some changes to the discard path in the future to make this a reliable recovery tool. - Some new btree iterator tracepoints, for tracking down some livelock-ish behaviour we've been seeing in the main data write path. * tag 'bcachefs-2025-06-26' of git://evilpiepirate.org/bcachefs: (51 commits) bcachefs: Plumb correct ip to trans_relock_fail tracepoint bcachefs: Ensure we rewind to run recovery passes bcachefs: Ensure btree node scan runs before checking for scanned nodes bcachefs: btree_root_unreadable_and_scan_found_nothing should not be autofix bcachefs: fix bch2_journal_keys_peek_prev_min() underflow bcachefs: Use wait_on_allocator() when allocating journal bcachefs: Check for bad write buffer key when moving from journal bcachefs: Don't unlock the trans if ret doesn't match BCH_ERR_operation_blocked bcachefs: Fix range in bch2_lookup_indirect_extent() error path bcachefs: fix spurious error_throw bcachefs: Add missing bch2_err_class() to fileattr_set() bcachefs: Add missing key type checks to check_snapshot_exists() bcachefs: Don't log fsck err in the journal if doing repair elsewhere bcachefs: Fix *__bch2_trans_subbuf_alloc() error path bcachefs: Fix missing newlines before ero bcachefs: fix spurious error in read_btree_roots() bcachefs: fsck: Fix oops in key_visible_in_snapshot() bcachefs: fsck: fix unhandled restart in topology repair bcachefs: fsck: Fix check_directory_structure when no check_dirents bcachefs: Fix restart handling in btree_node_scrub_work() ...	2025-06-26 19:49:12 -07:00
David Howells	263debecb4	cifs: Fix reading into an ITER_FOLIOQ from the smbdirect code When performing a file read from RDMA, smbd_recv() prints an "Invalid msg type 4" error and fails the I/O. This is due to the switch-statement there not handling the ITER_FOLIOQ handed down from netfslib. Fix this by collapsing smbd_recv_buf() and smbd_recv_page() into smbd_recv() and just using copy_to_iter() instead of memcpy(). This future-proofs the function too, in case more ITER_* types are added. Fixes: `ee4cdf7ba8` ("netfs: Speed up buffered reading") Reported-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: David Howells <dhowells@redhat.com> cc: Tom Talpey <tom@talpey.com> cc: Paulo Alcantara (Red Hat) <pc@manguebit.com> cc: Matthew Wilcox <willy@infradead.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>	2025-06-26 11:13:16 -05:00

1 2 3 4 5 ...

99675 Commits