Commit Graph

1368284 Commits

Author SHA1 Message Date
Baokun Li
45704f92e5 ext4: factor out __ext4_mb_scan_group()
Extract __ext4_mb_scan_group() to make the code clearer and to
prepare for the later conversion of 'choose group' to 'scan groups'.
No functional changes.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-13-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-25 09:14:17 -04:00
Baokun Li
7d345aa1fa ext4: fix largest free orders lists corruption on mb_optimize_scan switch
The grp->bb_largest_free_order is updated regardless of whether
mb_optimize_scan is enabled. This can lead to inconsistencies between
grp->bb_largest_free_order and the actual s_mb_largest_free_orders list
index when mb_optimize_scan is repeatedly enabled and disabled via remount.

For example, if mb_optimize_scan is initially enabled, largest free
order is 3, and the group is in s_mb_largest_free_orders[3]. Then,
mb_optimize_scan is disabled via remount, block allocations occur,
updating largest free order to 2. Finally, mb_optimize_scan is re-enabled
via remount, more block allocations update largest free order to 1.

At this point, the group would be removed from s_mb_largest_free_orders[3]
under the protection of s_mb_largest_free_orders_locks[2]. This lock
mismatch can lead to list corruption.

To fix this, whenever grp->bb_largest_free_order changes, we now always
attempt to remove the group from its old order list. However, we only
insert the group into the new order list if `mb_optimize_scan` is enabled.
This approach helps prevent lock inconsistencies and ensures the data in
the order lists remains reliable.

Fixes: 196e402adf ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@vger.kernel.org
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-12-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-25 09:14:17 -04:00
Baokun Li
1c320d8e92 ext4: fix zombie groups in average fragment size lists
Groups with no free blocks shouldn't be in any average fragment size list.
However, when all blocks in a group are allocated(i.e., bb_fragments or
bb_free is 0), we currently skip updating the average fragment size, which
means the group isn't removed from its previous s_mb_avg_fragment_size[old]
list.

This created "zombie" groups that were always skipped during traversal as
they couldn't satisfy any block allocation requests, negatively impacting
traversal efficiency.

Therefore, when a group becomes completely full, bb_avg_fragment_size_order
is now set to -1. If the old order was not -1, a removal operation is
performed; if the new order is not -1, an insertion is performed.

Fixes: 196e402adf ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@vger.kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-11-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-25 09:14:17 -04:00
Baokun Li
e7f101a808 ext4: merge freed extent with existing extents before insertion
Attempt to merge ext4_free_data with already inserted free extents prior
to adding new ones. This strategy drastically cuts down the number of
times locks are held.

For example, if prev, new, and next extents are all mergeable, the existing
code (before this patch) requires acquiring the s_md_lock three times:

  prev merge into new and free prev // hold lock
  next merge into new and free next // hold lock
  insert new // hold lock

After the patch, it only needs to be acquired once:

  new merge into next and free new // no lock
  next merge into prev and free next // hold lock

Performance test data follows:

Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.

|CPU: Kunpeng 920   |          P80           |            P1           |
|Memory: 512GB      |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 20043 | 20097 (+0.2%)  | 314331 | 316141 (+0.5%) |
|mb_optimize_scan=1 | 7290  | 13318 (+87.4%) | 324226 | 325273 (+0.3%) |

|CPU: AMD 9654 * 2  |          P96           |             P1          |
|Memory: 1536GB     |------------------------|-------------------------|
|960GB SSD (1GB/s)  | base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 54999 | 53603 (-2.5%)  | 214380 | 214243 (-0.06%)|
|mb_optimize_scan=1 | 13497 | 20887 (+54.6%) | 216276 | 213632 (-1.2%) |

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-10-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-25 09:14:17 -04:00
Baokun Li
0a2326f6ae ext4: convert sbi->s_mb_free_pending to atomic_t
Previously, s_md_lock was used to protect s_mb_free_pending during
modifications, while smp_mb() ensured fresh reads, so s_md_lock just
guarantees the atomicity of s_mb_free_pending. Thus we optimized it by
converting s_mb_free_pending into an atomic variable, thereby eliminating
s_md_lock and minimizing lock contention. This also prepares for future
lockless merging of free extents.

Following this modification, s_md_lock is exclusively responsible for
managing insertions and deletions within s_freed_data_list, along with
operations involving list_splice.

Performance test data follows:

Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.

|CPU: Kunpeng 920   |          P80           |            P1           |
|Memory: 512GB      |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 19628 | 20043 (+2.1%)  | 320885 | 314331 (-2.0%) |
|mb_optimize_scan=1 | 7129  | 7290  (+2.2%)  | 321275 | 324226 (+0.9%) |

|CPU: AMD 9654 * 2  |          P96           |             P1          |
|Memory: 1536GB     |------------------------|-------------------------|
|960GB SSD (1GB/s)  | base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 53760 | 54999 (+2.3%)  | 213145 | 214380 (+0.5%) |
|mb_optimize_scan=1 | 12716 | 13497 (+6.1%)  | 215262 | 216276 (+0.4%) |

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-9-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-25 09:14:17 -04:00
Baokun Li
9a0ed16981 ext4: fix typo in CR_GOAL_LEN_SLOW comment
Remove the superfluous "find_".

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-8-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-25 09:14:17 -04:00
Baokun Li
4d18a0b982 ext4: get rid of some obsolete EXT4_MB_HINT flags
Since nobody has used these EXT4_MB_HINT flags for ages,
let's remove them.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-7-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-25 09:14:16 -04:00
Baokun Li
8f2c3b7486 ext4: utilize multiple global goals to reduce contention
When allocating data blocks, if the first try (goal allocation) fails and
stream allocation is on, it tries a global goal starting from the last
group we used (s_mb_last_group). This helps cluster large files together
to reduce free space fragmentation, and the data block contiguity also
accelerates write-back to disk.

However, when multiple processes allocate blocks, having just one global
goal means they all fight over the same group. This drastically lowers
the chances of extents merging and leads to much worse file fragmentation.

To mitigate this multi-process contention, we now employ multiple global
goals, with the number of goals being the minimum between the number of
possible CPUs and one-quarter of the filesystem's total block group count.

To ensure a consistent goal for each inode, we select the corresponding
goal by taking the inode number modulo the total number of goals.

Performance test data follows:

Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.

|CPU: Kunpeng 920   |          P80           |            P1           |
|Memory: 512GB      |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 9636  | 19628 (+103%)  | 337597 | 320885 (-4.9%) |
|mb_optimize_scan=1 | 4834  | 7129  (+47.4%) | 341440 | 321275 (-5.9%) |

|CPU: AMD 9654 * 2  |          P96           |             P1          |
|Memory: 1536GB     |------------------------|-------------------------|
|960GB SSD (1GB/s)  | base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 22341 | 53760 (+140%)  | 219707 | 213145 (-2.9%) |
|mb_optimize_scan=1 | 9177  | 12716 (+38.5%) | 215732 | 215262 (+0.2%) |

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-6-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-25 09:14:16 -04:00
Baokun Li
4b41deb896 ext4: remove unnecessary s_md_lock on update s_mb_last_group
After we optimized the block group lock, we found another lock
contention issue when running will-it-scale/fallocate2 with multiple
processes. The fallocate's block allocation and the truncate's block
release were fighting over the s_md_lock. The problem is, this lock
protects totally different things in those two processes: the list of
freed data blocks (s_freed_data_list) when releasing, and where to start
looking for new blocks (mb_last_group) when allocating.

Now we only need to track s_mb_last_group and no longer need to track
s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
two are consistent. Since s_mb_last_group is merely a hint and doesn't
require strong synchronization, READ_ONCE/WRITE_ONCE is sufficient.

Besides, the s_mb_last_group data type only requires ext4_group_t
(i.e., unsigned int), rendering unsigned long superfluous.

Performance test data follows:

Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.

|CPU: Kunpeng 920   |          P80           |            P1           |
|Memory: 512GB      |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 4821  | 9636  (+99.8%) | 314065 | 337597 (+7.4%) |
|mb_optimize_scan=1 | 4784  | 4834  (+1.04%) | 316344 | 341440 (+7.9%) |

|CPU: AMD 9654 * 2  |          P96           |             P1          |
|Memory: 1536GB     |------------------------|-------------------------|
|960GB SSD (1GB/s)  | base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 15371 | 22341 (+45.3%) | 205851 | 219707 (+6.7%) |
|mb_optimize_scan=1 | 6101  | 9177  (+50.4%) | 207373 | 215732 (+4.0%) |

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-5-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-25 09:14:16 -04:00
Baokun Li
f0374d8071 ext4: remove unnecessary s_mb_last_start
Since stream allocation does not use ac->ac_f_ex.fe_start, it is set to -1
by default, so the no longer needed sbi->s_mb_last_start is removed.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-4-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-25 09:14:16 -04:00
Baokun Li
35bfd4b44e ext4: separate stream goal hits from s_bal_goals for better tracking
In ext4_mb_regular_allocator(), after the call to ext4_mb_find_by_goal()
fails to achieve the inode goal, allocation continues with the stream
allocation global goal. Currently, hits for both are combined in
sbi->s_bal_goals, hindering accurate optimization.

This commit separates global goal hits into sbi->s_bal_stream_goals. Since
stream allocation doesn't use ac->ac_g_ex.fe_start, set fe_start to -1.
This prevents stream allocations from being counted in s_bal_goals. Also
clear EXT4_MB_HINT_TRY_GOAL to avoid calling ext4_mb_find_by_goal again.

After adding `stream_goal_hits`, `/proc/fs/ext4/sdx/mb_stats` will show:

mballoc:
	reqs: 840347
	success: 750992
	groups_scanned: 1230506
	cr_p2_aligned_stats:
		hits: 21531
		groups_considered: 411664
		extents_scanned: 21531
		useless_loops: 0
		bad_suggestions: 6
	cr_goal_fast_stats:
		hits: 111222
		groups_considered: 1806728
		extents_scanned: 467908
		useless_loops: 0
		bad_suggestions: 13
	cr_best_avail_stats:
		hits: 36267
		groups_considered: 1817631
		extents_scanned: 156143
		useless_loops: 0
		bad_suggestions: 204
	cr_goal_slow_stats:
		hits: 106396
		groups_considered: 5671710
		extents_scanned: 22540056
		useless_loops: 123747
	cr_any_free_stats:
		hits: 138071
		groups_considered: 724692
		extents_scanned: 23615593
		useless_loops: 585
	extents_scanned: 46804261
		goal_hits: 1307
		stream_goal_hits: 236317
		len_goal_hits: 155549
		2^n_hits: 21531
		breaks: 225096
		lost: 35062
	buddies_generated: 40/40
	buddies_time_used: 48004
	preallocated: 5962467
	discarded: 4847560

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-3-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-25 09:14:16 -04:00
Baokun Li
e9eec6f339 ext4: add ext4_try_lock_group() to skip busy groups
When ext4 allocates blocks, we used to just go through the block groups
one by one to find a good one. But when there are tons of block groups
(like hundreds of thousands or even millions) and not many have free space
(meaning they're mostly full), it takes a really long time to check them
all, and performance gets bad. So, we added the "mb_optimize_scan" mount
option (which is on by default now). It keeps track of some group lists,
so when we need a free block, we can just grab a likely group from the
right list. This saves time and makes block allocation much faster.

But when multiple processes or containers are doing similar things, like
constantly allocating 8k blocks, they all try to use the same block group
in the same list. Even just two processes doing this can cut the IOPS in
half. For example, one container might do 300,000 IOPS, but if you run two
at the same time, the total is only 150,000.

Since we can already look at block groups in a non-linear way, the first
and last groups in the same list are basically the same for finding a block
right now. Therefore, add an ext4_try_lock_group() helper function to skip
the current group when it is locked by another process, thereby avoiding
contention with other processes. This helps ext4 make better use of having
multiple block groups.

Also, to make sure we don't skip all the groups that have free space
when allocating blocks, we won't try to skip busy groups anymore when
ac_criteria is CR_ANY_FREE.

Performance test data follows:

Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.

|CPU: Kunpeng 920   |          P80            |
|Memory: 512GB      |-------------------------|
|960GB SSD (0.5GB/s)| base  |    patched      |
|-------------------|-------|-----------------|
|mb_optimize_scan=0 | 2667  | 4821  (+80.7%)  |
|mb_optimize_scan=1 | 2643  | 4784  (+81.0%)  |

|CPU: AMD 9654 * 2  |          P96            |
|Memory: 1536GB     |-------------------------|
|960GB SSD (1GB/s)  | base  |    patched      |
|-------------------|-------|-----------------|
|mb_optimize_scan=0 | 3450  | 15371 (+345%)   |
|mb_optimize_scan=1 | 3209  | 6101  (+90.0%)  |

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-2-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-25 09:14:16 -04:00
Zhang Yi
82e6381e23 ext4: initialize superblock fields in the kballoc-test.c kunit tests
Various changes in the "ext4: better scalability for ext4 block
allocation" patch series have resulted in kunit test failures, most
notably in the test_new_blocks_simple and the test_mb_mark_used tests.
The root cause of these failures is that various in-memory ext4 data
structures were not getting initialized, and while previous versions
of the functions exercised by the unit tests didn't use these
structure members, this was arguably a test bug.

Since one of the patches in the block allocation scalability patches
is a fix which is has a cc:stable tag, this commit also has a
cc:stable tag.

CC: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250714130327.1830534-1-libaokun1@huawei.com
Link: https://patch.msgid.link/20250725021550.3177573-1-yi.zhang@huaweicloud.com
Link: https://patch.msgid.link/20250725021654.3188798-1-yi.zhang@huaweicloud.com
Reported-by: Guenter Roeck <linux@roeck-us.net>
Closes: https://lore.kernel.org/linux-ext4/b0635ad0-7ebf-4152-a69b-58e7e87d5085@roeck-us.net/
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-25 09:13:44 -04:00
Theodore Ts'o
90f097b140 ext4: refactor the inline directory conversion and new directory codepaths
There was a lot of common code in the codepaths used to convert an
inline directory and to creaet a new directory.  To address this,
rename ext4_init_dot_dotdot() to ext4_init_dirblock() and then move
common code into that function.

This reduces the lines of code count in fs/ext4/inline.c and
fs/ext4/namei.c, as well as reducing the size of their object files.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://patch.msgid.link/20250712181249.434530-3-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-17 23:25:21 -04:00
Theodore Ts'o
a35454ecf8 ext4: use memcpy() instead of strcpy()
The strcpy() function is considered dangerous and eeeevil by people
who are using sophisticated code analysis tools such as "grep".  This
is true even when a quick inspection would show that the source is a
constant string ("." or "..") and the destination is a fixed array
which is guaranteed to have enough space.  Make the "grep" code
analysis tool happy by using memcpy() isstead of strcpy().  :-)

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://patch.msgid.link/20250712181249.434530-2-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-17 23:25:21 -04:00
Theodore Ts'o
3658b8b339 ext4: replace strcmp with direct comparison for '.' and '..'
In a discussion over a proposed patch, "ext4: replace strcpy() with
'.' assignment"[1], I had asserted that directory entries in ext4 were
not NUL terminated, and hence it was safe to replace strcpy() with a
direct assignment.  As it turns out, this was incorrect.  It's true
for all all directory entries *except* for '.' and '..' where the
kernel was using strcmp() and where e2fsck actually checks and offers
to fix things if '.'  and '..' are not NUL terminated.

[1] https://lore.kernel.org/r/202505191316.JJMnPobO-lkp@intel.com

We can't change this without breaking old kernel versions, but in the
spirit of "be liberal in what you receive", use direct comparison of
de->name_len and de->name[0,1] instead of strcmp().  This has the side
benefit of reducing the compiled text size by 96 bytes on x86_64.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://patch.msgid.link/20250712181249.434530-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-17 23:25:21 -04:00
Jan Kara
91b8ca8b26 ext4: Make sure BH_New bit is cleared in ->write_end handler
Currently we clear BH_New bit in case of error and also in the standard
ext4_write_end() handler (in block_commit_write()). However
ext4_journalled_write_end() misses this clearing and thus we are leaving
stale BH_New bits behind. Generally ext4_block_write_begin() clears
these bits before any harm can be done but in case blocksize < pagesize
and we hit some error when processing a page with these stale bits,
we'll try to zero buffers with these stale BH_New bits and jbd2 will
complain (as buffers were not prepared for writing in this transaction).
Fix the problem by clearing BH_New bits in ext4_journalled_write_end()
and WARN if ext4_block_write_begin() sees stale BH_New bits.

Reported-by: Baolin Liu <liubaolin12138@163.com>
Reported-by: Zhi Long <longzhi@sangfor.com.cn>
Fixes: 3910b513fc ("ext4: persist the new uptodate buffers in ext4_journalled_zero_new_buffers")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250709084831.23876-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-17 23:25:21 -04:00
Baokun Li
c678bdc998 ext4: fix inode use after free in ext4_end_io_rsv_work()
In ext4_io_end_defer_completion(), check if io_end->list_vec is empty to
avoid adding an io_end that requires no conversion to the
i_rsv_conversion_list, which in turn prevents starting an unnecessary
worker. An ext4_emergency_state() check is also added to avoid attempting
to abort the journal in an emergency state.

Additionally, ext4_put_io_end_defer() is refactored to call
ext4_io_end_defer_completion() directly instead of being open-coded.
This also prevents starting an unnecessary worker when EXT4_IO_END_FAILED
is set but data_err=abort is not enabled.

This ensures that the check in ext4_put_io_end_defer() is consistent with
the check in ext4_end_bio(). Otherwise, we might add an io_end to the
i_rsv_conversion_list and then call ext4_finish_bio(), after which the
inode could be freed before ext4_end_io_rsv_work() is called, triggering
a use-after-free issue.

Fixes: ce51afb8cc ("ext4: abort journal on data writeback failure if in data_err=abort mode")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250708111504.3208660-1-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-17 23:25:21 -04:00
I Hsin Cheng
9d9076238f ext4: Refactor breaking condition for xattr_find_entry()
Refactor the condition for breaking the loop within xattr_find_entry().
Elimate the usage of "<=" and take condition shortcut when "!cmp" is
true.

Originally, the condition was "(cmp <= 0 && (sorted || cmp == 0))", which
means after it knows "cmp <= 0" is true, it has to check the value of
"sorted" and "cmp". The checking of "cmp" here would be redundant since
it has already checked it.

Observing from the logic, when "cmp == 0" the branch is going to be true,
no need to check "cmp == 0" again, so we only need to take shortcut when
"cmp == 0", on the other hand, we'll check "sorted" when "cmp < 0".

The refactor can shrink the generated code size by 44 bytes. Numerous
instructions can be saved thus should also benefit execution efficiency
as well.

$ ./scripts/bloat-o-meter vmlinux_old vmlinux_new
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-44 (-44)
Function                                     old     new   delta
xattr_find_entry                             300     256     -44
Total: Before=22989434, After=22989390, chg -0.00%

The test is done on kernel version 6.16 with x86_64 defconfig
and gcc 13.3.0.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Link: https://patch.msgid.link/20250708020013.175728-1-richard120310@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-17 10:41:05 -04:00
Zhang Yi
b12f423d59 ext4: limit the maximum folio order
In environments with a page size of 64KB, the maximum size of a folio
can reach up to 128MB. Consequently, during the write-back of folios,
the 'rsv_blocks' will be overestimated to 1,577, which can make
pressure on the journal space where the journal is small. This can
easily exceed the limit of a single transaction. Besides, an excessively
large folio is meaningless and will instead increase the overhead of
traversing the bhs within the folio. Therefore, limit the maximum order
of a folio to 2048 filesystem blocks.

Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: Joseph Qi <jiangqi903@gmail.com>
Closes: https://lore.kernel.org/linux-ext4/CA+G9fYsyYQ3ZL4xaSg1-Tt5Evto7Zd+hgNWZEa9cQLbahA1+xg@mail.gmail.com/
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-12-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-14 23:48:15 -04:00
Zhang Yi
5137d6c890 ext4: fix insufficient credits calculation in ext4_meta_trans_blocks()
The calculation of journal credits in ext4_meta_trans_blocks() should
include pextents, as each extent separately may be allocated from a
different group and thus need to update different bitmap and group
descriptor block.

Fixes: 0e32d86170 ("ext4: correct the journal credits calculations of allocating blocks")
Reported-by: Jan Kara <jack@suse.cz>
Closes: https://lore.kernel.org/linux-ext4/nhxfuu53wyacsrq7xqgxvgzcggyscu2tbabginahcygvmc45hy@t4fvmyeky33e/
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250707140814.542883-11-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13 23:41:52 -04:00
Zhang Yi
57661f2875 ext4: replace ext4_writepage_trans_blocks()
After ext4 supports large folios, the semantics of reserving credits in
pages is no longer applicable. In most scenarios, reserving credits in
extents is sufficient. Therefore, introduce ext4_chunk_trans_extent()
to replace ext4_writepage_trans_blocks(). move_extent_per_page() is the
only remaining location where we are still processing extents in pages.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-10-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13 23:41:52 -04:00
Zhang Yi
bbbf150f3f ext4: reserved credits for one extent during the folio writeback
After ext4 supports large folios, reserving journal credits for one
maximum-ordered folio based on the worst case cenario during the
writeback process can easily exceed the maximum transaction credits.
Additionally, reserving journal credits for one page is also no
longer appropriate.

Currently, the folio writeback process can either extend the journal
credits or initiate a new transaction if the currently reserved journal
credits are insufficient. Therefore, it can be modified to reserve
credits for only one extent at the outset. In most cases involving
continuous mapping, these credits are generally adequate, and we may
only need to perform some basic credit expansion. However, in extreme
cases where the block size and folio size differ significantly, or when
the folios are sufficiently discontinuous, it may be necessary to
restart a new transaction and resubmit the folios.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-9-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13 23:41:52 -04:00
Zhang Yi
95ad8ee45c ext4: correct the reserved credits for extent conversion
Now, we reserve journal credits for converting extents in only one page
to written state when the I/O operation is complete. This is
insufficient when large folio is enabled.

Fix this by reserving credits for converting up to one extent per block in
the largest 2MB folio, this calculation should only involve extents index
and leaf blocks, so it should not estimate too many credits.

Fixes: 7ac67301e8 ("ext4: enable large folio for regular file")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250707140814.542883-8-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13 23:41:52 -04:00
Zhang Yi
6b132759b0 ext4: enhance tracepoints during the folios writeback
After mpage_map_and_submit_extent() supports restarting handle if
credits are insufficient during allocating blocks, it is more likely to
exit the current mapping iteration and continue to process the current
processing partially mapped folio again. The existing tracepoints are
not sufficient to track this situation, so enhance the tracepoints to
track the writeback position and the return value before and after
submitting the folios.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-7-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13 23:41:51 -04:00
Zhang Yi
e2c4c49dee ext4: restart handle if credits are insufficient during allocating blocks
After large folios are supported on ext4, writing back a sufficiently
large and discontinuous folio may consume a significant number of
journal credits, placing considerable strain on the journal. For
example, in a 20GB filesystem with 1K block size and 1MB journal size,
writing back a 2MB folio could require thousands of credits in the
worst-case scenario (when each block is discontinuous and distributed
across different block groups), potentially exceeding the journal size.
This issue can also occur in ext4_write_begin() and ext4_page_mkwrite()
when delalloc is not enabled.

Fix this by ensuring that there are sufficient journal credits before
allocating an extent in mpage_map_one_extent() and
ext4_block_write_begin(). If there are not enough credits, return
-EAGAIN, exit the current mapping loop, restart a new handle and a new
transaction, and allocating blocks on this folio again in the next
iteration.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-6-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13 23:41:51 -04:00
Zhang Yi
2bddafea3d ext4: refactor the block allocation process of ext4_page_mkwrite()
The block allocation process and error handling in ext4_page_mkwrite()
is complex now. Refactor it by introducing a new helper function,
ext4_block_page_mkwrite(). It will call ext4_block_write_begin() to
allocate blocks instead of directly calling block_page_mkwrite().
Preparing to implement retry logic in a subsequent patch to address
situations where the reserved journal credits are insufficient.
Additionally, this modification will help prevent potential deadlocks
that may occur when waiting for folio writeback while holding the
transaction handle.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-5-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13 23:41:51 -04:00
Zhang Yi
ded2d726a3 ext4: fix stale data if it bail out of the extents mapping loop
During the process of writing back folios, if
mpage_map_and_submit_extent() exits the extent mapping loop due to an
ENOSPC or ENOMEM error, it may result in stale data or filesystem
inconsistency in environments where the block size is smaller than the
folio size.

When mapping a discontinuous folio in mpage_map_and_submit_extent(),
some buffers may have already be mapped. If we exit the mapping loop
prematurely, the folio data within the mapped range will not be written
back, and the file's disk size will not be updated. Once the transaction
that includes this range of extents is committed, this can lead to stale
data or filesystem inconsistency.

Fix this by submitting the current processing partially mapped folio.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-4-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13 23:41:51 -04:00
Zhang Yi
f922c8c246 ext4: move the calculation of wbc->nr_to_write to mpage_folio_done()
mpage_folio_done() should be a more appropriate place than
mpage_submit_folio() for updating the wbc->nr_to_write after we have
submitted a fully mapped folio. Preparing to make mpage_submit_folio()
allows to submit partially mapped folio that is still under processing.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250707140814.542883-3-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13 23:41:51 -04:00
Zhang Yi
1bfe6354e0 ext4: process folios writeback in bytes
Since ext4 supports large folios, processing writebacks in pages is no
longer appropriate, it can be modified to process writebacks in bytes.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-2-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13 23:41:51 -04:00
Baolin Liu
a073e8577f ext4: remove unused EXT_STATS macro from ext4_extents.h
The EXT_STATS macro in fs/ext4/ext4_extents.h has been defined
but never used in the codebase since its introduction. This patch
removes it.

Analysis:
1. No references found in fs/ext4/ or other kernel code.
2. No impact on compilation or functionality.
3. Git history shows it was never utilized.

Signed-off-by: Baolin Liu <liubaolin@kylinos.cn>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250527053805.1550912-1-liubaolin12138@163.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-12 19:01:38 -04:00
Dan Carpenter
c5da1f6694 ext4: remove unnecessary duplicate check in ext4_map_blocks()
The previous lines ensure that EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF is
set so remove this duplicate check.

Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://patch.msgid.link/aDCdjUhpzxB64vkD@stanley.mountain
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-11 16:52:59 -04:00
Jinliang Zheng
b6f3801727 ext4: remove duplicate check for EXT4_FC_REPLAY
EXT4_FC_REPLAY will be checked in ext4_es_lookup_extent(). If it is
set, ext4_es_lookup_extent() will return 0.

Remove the repeated check for EXT4_FC_REPLAY in ext4_map_blocks()
to simplify the code.

Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250429111722.294975-1-alexjlzheng@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-10 21:33:03 -04:00
Linus Torvalds
d0b3b7b22d Linux 6.16-rc4 v6.16-rc4 2025-06-29 13:09:04 -07:00
Linus Torvalds
afa9a6f4f5 Merge tag 'staging-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging driver fix from Greg KH:
 "Here is a single staging driver fix for 6.16-rc4. It resolves a build
  error in the rtl8723bs driver for some versions of clang on arm64 when
  checking the frame size with -Wframe-larger-than.

  It has been in linux-next for a while now with no reported issues"

* tag 'staging-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  staging: rtl8723bs: Avoid memset() in aes_cipher() and aes_decipher()
2025-06-29 09:25:55 -07:00
Linus Torvalds
798804b69f Merge tag 'tty-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial driver fixes from Greg KH:
 "Here are five small serial and tty and vt fixes for 6.16-rc4. Included
  in here are:

   - kerneldoc fixes for recent vt changes

   - imx serial driver fix

   - of_node sysfs fix for a regression

   - vt missing notification fix

   - 8250 dt bindings fix

  All of these have been in linux-next for a while with no reported issues"

* tag 'tty-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  dt-bindings: serial: 8250: Make clocks and clock-frequency exclusive
  serial: imx: Restore original RXTL for console to fix data loss
  serial: core: restore of_node information in sysfs
  vt: fix kernel-doc warnings in ucs_get_fallback()
  vt: add missing notification when switching back to text mode
2025-06-29 09:21:27 -07:00
Linus Torvalds
3b1890e4b2 Merge tag 'edac_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC fix from Borislav Petkov:

 - Consider secondary address mask registers in amd64_edac in order to
   get the correct total memory size of the system

* tag 'edac_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
  EDAC/amd64: Fix size calculation for Non-Power-of-Two DIMMs
2025-06-29 08:43:54 -07:00
Linus Torvalds
cc69ac7a65 Merge tag 'x86_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:

 - Make sure DR6 and DR7 are initialized to their architectural values
   and not accidentally cleared, leading to misconfigurations

* tag 'x86_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/traps: Initialize DR7 by writing its architectural reset value
  x86/traps: Initialize DR6 by writing its architectural reset value
2025-06-29 08:28:24 -07:00
Linus Torvalds
2fc18d0b89 Merge tag 'perf_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Borislav Petkov:

 - Make sure an AUX perf event is really disabled when it overruns

* tag 'perf_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/aux: Fix pending disable flow when the AUX ring buffer overruns
2025-06-29 08:16:02 -07:00
Linus Torvalds
753a0f61b9 Merge tag 'locking_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Borislav Petkov:

 - Make sure the new futex phash is not copied during fork in order to
   avoid a double-free

* tag 'locking_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Initialize futex_phash_new during fork().
2025-06-29 08:09:13 -07:00
Linus Torvalds
dfba48a70c Merge tag 'i2c-for-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:

 - imx: fix SMBus protocol compliance during block read

 - omap: fix error handling path in probe

 - robotfuzz, tiny-usb: prevent zero-length reads

 - x86, designware, amdisp: fix build error when modules are disabled
   (agreed to go in via i2c)

 - scx200_acb: fix build error because of missing HAS_IOPORT

* tag 'i2c-for-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: scx200_acb: depends on HAS_IOPORT
  i2c: omap: Fix an error handling path in omap_i2c_probe()
  platform/x86: Use i2c adapter name to fix build errors
  i2c: amd-isp: Initialize unique adapter name
  i2c: designware: Initialize adapter name only when not set
  i2c: tiny-usb: disable zero-length read messages
  i2c: robotfuzz-osif: disable zero-length read messages
  i2c: imx: fix emulated smbus block read
2025-06-28 15:23:17 -07:00
Linus Torvalds
ded779017a Merge tag 'trace-v6.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fix from Steven Rostedt:

 - Fix possible UAF on error path in filter_free_subsystem_filters()

   When freeing a subsystem filter, the filter for the subsystem is
   passed in to be freed and all the events within the subsystem will
   have their filter freed too. In order to free without waiting for RCU
   synchronization, list items are allocated to hold what is going to be
   freed to free it via a call_rcu(). If the allocation of these items
   fails, it will call the synchronization directly and free after that
   (causing a bit of delay for the user).

   The subsystem filter is first added to this list and then the filters
   for all the events under the subsystem. The bug is if one of the
   allocations of the list items for the event filters fail to allocate,
   it jumps to the "free_now" label which will free the subsystem
   filter, then all the items on the allocated list, and then the event
   filters that were not added to the list yet. But because the
   subsystem filter was added first, it gets freed twice.

   The solution is to add the subsystem filter after the events, and
   then if any of the allocations fail it will not try to free any of
   them twice

* tag 'trace-v6.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Fix filter logic error
2025-06-28 11:39:24 -07:00
Linus Torvalds
3a3de75a68 Merge tag 'loongarch-fixes-6.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
Pull LoongArch fixes from Huacai Chen:

 - replace __ASSEMBLY__ with __ASSEMBLER__ in headers like others

 - fix build warnings about export.h

 - reserve the EFI memory map region for kdump

 - handle __init vs inline mismatches

 - fix some KVM bugs

* tag 'loongarch-fixes-6.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
  LoongArch: KVM: Disable updating of "num_cpu" and "feature"
  LoongArch: KVM: Check validity of "num_cpu" from user space
  LoongArch: KVM: Check interrupt route from physical CPU
  LoongArch: KVM: Fix interrupt route update with EIOINTC
  LoongArch: KVM: Add address alignment check for IOCSR emulation
  LoongArch: KVM: Avoid overflow with array index
  LoongArch: Handle KCOV __init vs inline mismatches
  LoongArch: Reserve the EFI memory map region
  LoongArch: Fix build warnings about export.h
  LoongArch: Replace __ASSEMBLY__ with __ASSEMBLER__ in headers
2025-06-28 11:35:11 -07:00
Linus Torvalds
aaf724ed69 Merge tag 'v6.16-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:

 - Multichannel reconnect lock ordering deadlock fix

 - Fix for regression in handling native Windows symlinks

 - Three smbdirect fixes:
     - oops in RDMA response processing
     - smbdirect memcpy issue
     - fix smbdirect regression with large writes (smbdirect test cases
       now all passing)

 - Fix for "FAILED_TO_PARSE" warning in trace-cmd report output

* tag 'v6.16-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Fix reading into an ITER_FOLIOQ from the smbdirect code
  cifs: Fix the smbd_response slab to allow usercopy
  smb: client: fix potential deadlock when reconnecting channels
  smb: client: remove \t from TP_printk statements
  smb: client: let smbd_post_send_iter() respect the peers max_send_size and transmit all data
  smb: client: fix regression with native SMB symlinks
2025-06-27 20:38:05 -07:00
Linus Torvalds
0fd39af24e Merge tag 'mm-hotfixes-stable-2025-06-27-16-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
 "16 hotfixes.

  6 are cc:stable and the remainder address post-6.15 issues or aren't
  considered necessary for -stable kernels. 5 are for MM"

* tag 'mm-hotfixes-stable-2025-06-27-16-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  MAINTAINERS: add Lorenzo as THP co-maintainer
  mailmap: update Duje Mihanović's email address
  selftests/mm: fix validate_addr() helper
  crashdump: add CONFIG_KEYS dependency
  mailmap: correct name for a historical account of Zijun Hu
  mailmap: add entries for Zijun Hu
  fuse: fix runtime warning on truncate_folio_batch_exceptionals()
  scripts/gdb: fix dentry_name() lookup
  mm/damon/sysfs-schemes: free old damon_sysfs_scheme_filter->memcg_path on write
  mm/alloc_tag: fix the kmemleak false positive issue in the allocation of the percpu variable tag->counters
  lib/group_cpus: fix NULL pointer dereference from group_cpus_evenly()
  mm/hugetlb: remove unnecessary holding of hugetlb_lock
  MAINTAINERS: add missing files to mm page alloc section
  MAINTAINERS: add tree entry to mm init block
  mm: add OOM killer maintainer structure
  fs/proc/task_mmu: fix PAGE_IS_PFNZERO detection for the huge zero folio
2025-06-27 20:34:10 -07:00
Linus Torvalds
867b9987a3 Merge tag 'riscv-for-linus-5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V Fixes for 5.16-rc4

 - .rodata is no longer linkd into PT_DYNAMIC.

   It was not supposed to be there in the first place and resulted in
   invalid (but unused) entries. This manifests as at least warnings in
   llvm-readelf

 - A fix for runtime constants with all-0 upper 32-bits. This should
   only manifest on MMU=n kernels

 - A fix for context save/restore on systems using the T-Head vector
   extensions

 - A fix for a conflicting "+r"/"r" register constraint in the VDSO
   getrandom syscall wrapper, which is undefined behavior in clang

 - A fix for a missing register clobber in the RVV raid6 implementation.

   This manifests as a NULL pointer reference on some compilers, but
   could trigger in other ways

 - Misaligned accesses from userspace at faulting addresses are now
   handled correctly

 - A fix for an incorrect optimization that allowed access_ok() to mark
   invalid addresses as accessible, which can result in userspace
   triggering BUG()s

 - A few fixes for build warnings, and an update to Drew's email address

* tag 'riscv-for-linus-5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: export boot_cpu_hartid
  Revert "riscv: Define TASK_SIZE_MAX for __access_ok()"
  riscv: Fix sparse warning in vendor_extensions/sifive.c
  Revert "riscv: misaligned: fix sleeping function called during misaligned access handling"
  MAINTAINERS: Update Drew Fustini's email address
  RISC-V: uaccess: Wrap the get_user_8 uaccess macro
  raid6: riscv: Fix NULL pointer dereference caused by a missing clobber
  RISC-V: vDSO: Correct inline assembly constraints in the getrandom syscall wrapper
  riscv: vector: Fix context save/restore with xtheadvector
  riscv: fix runtime constant support for nommu kernels
  riscv: vdso: Exclude .rodata from the PT_DYNAMIC segment
2025-06-27 20:22:18 -07:00
Linus Torvalds
fa33adcaf8 Merge tag 'pci-v6.16-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull PCI fix from Bjorn Helgaas:

 - Fix a PTM debugfs build error with CONFIG_DEBUG_FS=n &&
   CONFIG_PCIE_PTM=y (Manivannan Sadhasivam)

* tag 'pci-v6.16-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
  PCI/PTM: Build debugfs code only if CONFIG_DEBUG_FS is enabled
2025-06-27 20:17:48 -07:00
Linus Torvalds
7abdafd234 Merge tag 'drm-fixes-2025-06-28' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
 "Regular weekly drm updates, nothing out of the ordinary, amdgpu, xe,
  i915 and a few misc bits. Seems about right for this time in the
  release cycle.

  core:
   - fix drm_writeback_connector_cleanup function signature
   - use correct HDMI audio bridge in drm_connector_hdmi_audio_init

  bridge:
   - SN65DSI86: fix HPD

  amdgpu:
   - Cleaner shader support for additional GFX9 GPUs
   - MES firmware compatibility fixes
   - Discovery error reporting fixes
   - SDMA6/7 userq fixes
   - Backlight fix
   - EDID sanity check

  i915:
   - Fix for SNPS PHY HDMI for 1080p@120Hz
   - Correct DP AUX DPCD probe address
   - Followup build fix for GCOV and AutoFDO enabled config

  xe:
   - Missing error check
   - Fix xe_hwmon_power_max_write
   - Move flushes
   - Explicitly exit CT safe mode on unwind
   - Process deferred GGTT node removals on device unwind"

* tag 'drm-fixes-2025-06-28' of https://gitlab.freedesktop.org/drm/kernel:
  drm/xe: Process deferred GGTT node removals on device unwind
  drm/xe/guc: Explicitly exit CT safe mode on unwind
  drm/xe: move DPT l2 flush to a more sensible place
  drm/xe: Move DSB l2 flush to a more sensible place
  drm/bridge: ti-sn65dsi86: Add HPD for DisplayPort connector type
  drm/i915: fix build error some more
  drm/xe/hwmon: Fix xe_hwmon_power_max_write
  drm/xe/display: Add check for alloc_ordered_workqueue()
  drm/amd/display: Add sanity checks for drm_edid_raw()
  drm/amd/display: Fix AMDGPU_MAX_BL_LEVEL value
  drm/amdgpu/sdma7: add ucode version checks for userq support
  drm/amdgpu/sdma6: add ucode version checks for userq support
  drm/amd: Adjust output for discovery error handling
  drm/amdgpu/mes: add compatibility checks for set_hw_resource_1
  drm/amdgpu/gfx9: Add Cleaner Shader Support for GFX9.x GPUs
  drm/bridge-connector: Fix bridge in drm_connector_hdmi_audio_init()
  drm/dp: Change AUX DPCD probe address from DPCD_REV to LANE0_1_STATUS
  drm/i915/snps_hdmi_pll: Fix 64-bit divisor truncation by using div64_u64
  drm: writeback: Fix drm_writeback_connector_cleanup signature
2025-06-27 19:38:36 -07:00
Linus Torvalds
26fd9f7b7f Merge tag 'cxl-fixes-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull Compute Express Link (CXL) fixes from Dave Jiang:
 "These fixes address a few issues in the CXL subsystem, including
  dealing with some bugs in the CXL EDAC and RAS drivers:

   - Fix return value of cxlctl_validate_set_features()

   - Fix min_scrub_cycle of a region miscaculation and add additional
     documentation

   - Fix potential memory leak issues for CXL EDAC

   - Fix CPER handler device confusion for CXL RAS

   - Fix using wrong repair type to check DRAM event record"

* tag 'cxl-fixes-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
  cxl/edac: Fix using wrong repair type to check dram event record
  cxl/ras: Fix CPER handler device confusion
  cxl/edac: Fix potential memory leak issues
  cxl/Documentation: Add more description about min/max scrub cycle
  cxl/edac: Fix the min_scrub_cycle of a region miscalculation
  cxl: fix return value in cxlctl_validate_set_features()
2025-06-27 17:58:32 -07:00
Linus Torvalds
5683cd63a3 Merge tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux
Pull crypto library fix from Eric Biggers:
 "Fix a regression where the purgatory code sometimes fails to build"

* tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
  lib/crypto: sha256: Mark sha256_choose_blocks as __always_inline
2025-06-27 17:32:30 -07:00