Commit Graph

1429747 Commits

Author SHA1 Message Date
Filipe Manana
40f2b11c1b btrfs: don't allow log trees to consume global reserve or overcommit metadata
For a fsync we never reserve space in advance, we just start a transaction
without reserving space and we use an empty block reserve for a log tree.
We reserve space as we need while updating a log tree, we end up in
btrfs_use_block_rsv() when reserving space for the allocation of a log
tree extent buffer and we attempt first to reserve without flushing,
and if that fails we attempt to consume from the global reserve or
overcommit metadata. This makes us consume space that may be the last
resort for a transaction commit to succeed, therefore increasing the
chances for a transaction abort with -ENOSPC.

So make btrfs_use_block_rsv() fail if we can't reserve metadata space for
a log tree extent buffer allocation without flushing, making the fsync
fallback to a transaction commit and avoid using critical space that could
be the only resort for a transaction commit to succeed when we are in a
critical space situation.

Reviewed-by: Leo Martins <loemra.dev@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:55:53 +02:00
Filipe Manana
574d93fc62 btrfs: be less aggressive with metadata overcommit when we can do full flushing
Over the years we often get reports of some -ENOSPC failure while updating
metadata that leads to a transaction abort. I have seen this happen for
filesystems of all sizes and with workloads that are very user/customer
specific and unable to reproduce, but Aleksandar recently reported a
simple way to reproduce this with a 1G filesystem and using the bonnie++
benchmark tool. The following test script reproduces the failure:

    $ cat test.sh
    #!/bin/bash

    # Create and use a 1G null block device, memory backed, otherwise
    # the test takes a very long time.
    modprobe null_blk nr_devices="0"
    null_dev="/sys/kernel/config/nullb/nullb0"
    mkdir "$null_dev"
    size=$((1 * 1024)) # in MB
    echo 2 > "$null_dev/submit_queues"
    echo "$size" > "$null_dev/size"
    echo 1 > "$null_dev/memory_backed"
    echo 1 > "$null_dev/discard"
    echo 1 > "$null_dev/power"

    DEV=/dev/nullb0
    MNT=/mnt/nullb0

    mkfs.btrfs -f $DEV
    mount $DEV $MNT

    mkdir $MNT/test/
    bonnie++ -d $MNT/test/ -m BTRFS -u 0 -s 256M -r 128M -b

    umount $MNT

    echo 0 > "$null_dev/power"
    rmdir "$null_dev"

When running this bonnie++ fails in the phase where it deletes test
directories and files:

    $ ./test.sh
    (...)
    Using uid:0, gid:0.
    Writing a byte at a time...done
    Writing intelligently...done
    Rewriting...done
    Reading a byte at a time...done
    Reading intelligently...done
    start 'em...done...done...done...done...done...
    Create files in sequential order...done.
    Stat files in sequential order...done.
    Delete files in sequential order...done.
    Create files in random order...done.
    Stat files in random order...done.
    Delete files in random order...Can't sync directory, turning off dir-sync.
    Can't delete file 9Bq7sr0000000338
    Cleaning up test directory after error.
    Bonnie: drastic I/O error (rmdir): Read-only file system

And in the syslog/dmesg we can see the following transaction abort trace:

    [161915.501506] BTRFS warning (device nullb0): Skipping commit of aborted transaction.
    [161915.502983] ------------[ cut here ]------------
    [161915.503832] BTRFS: Transaction aborted (error -28)
    [161915.504748] WARNING: fs/btrfs/transaction.c:2045 at btrfs_commit_transaction+0xa21/0xd30 [btrfs], CPU#11: bonnie++/3377975
    [161915.506786] Modules linked in: btrfs dm_zero dm_snapshot (...)
    [161915.518759] CPU: 11 UID: 0 PID: 3377975 Comm: bonnie++ Tainted: G        W           6.19.0-rc7-btrfs-next-224+ #4 PREEMPT(full)
    [161915.520857] Tainted: [W]=WARN
    [161915.521405] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
    [161915.523414] RIP: 0010:btrfs_commit_transaction+0xa24/0xd30 [btrfs]
    [161915.524630] Code: 48 8b 7c 24 (...)
    [161915.526982] RSP: 0018:ffffd3fe8206fda8 EFLAGS: 00010292
    [161915.527707] RAX: 0000000000000002 RBX: ffff8f4886d3c000 RCX: 0000000000000000
    [161915.528723] RDX: 0000000002040001 RSI: 00000000ffffffe4 RDI: ffffffffc088f780
    [161915.529691] RBP: ffff8f4f5adae7e0 R08: 0000000000000000 R09: ffffd3fe8206fb90
    [161915.530842] R10: ffff8f4f9c1fffa8 R11: 0000000000000003 R12: 00000000ffffffe4
    [161915.532027] R13: ffff8f4ef2cf2400 R14: ffff8f4f5adae708 R15: ffff8f4f62d18000
    [161915.533229] FS:  00007ff93112a780(0000) GS:ffff8f4ff63ee000(0000) knlGS:0000000000000000
    [161915.534611] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [161915.535575] CR2: 00005571b3072000 CR3: 0000000176080005 CR4: 0000000000370ef0
    [161915.536758] Call Trace:
    [161915.537185]  <TASK>
    [161915.537575]  btrfs_sync_file+0x431/0x530 [btrfs]
    [161915.538473]  do_fsync+0x39/0x80
    [161915.539042]  __x64_sys_fsync+0xf/0x20
    [161915.539750]  do_syscall_64+0x50/0xf20
    [161915.540396]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
    [161915.541301] RIP: 0033:0x7ff930ca49ee
    [161915.541904] Code: 08 0f 85 f5 (...)
    [161915.544830] RSP: 002b:00007ffd94291f38 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
    [161915.546152] RAX: ffffffffffffffda RBX: 00007ff93112a780 RCX: 00007ff930ca49ee
    [161915.547263] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
    [161915.548383] RBP: 0000000000000dab R08: 0000000000000000 R09: 0000000000000000
    [161915.549853] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd94291fb0
    [161915.551196] R13: 00007ffd94292350 R14: 0000000000000001 R15: 00007ffd94292340
    [161915.552161]  </TASK>
    [161915.552457] ---[ end trace 0000000000000000 ]---
    [161915.553232] BTRFS info (device nullb0 state A): dumping space info:
    [161915.553236] BTRFS info (device nullb0 state A): space_info DATA (sub-group id 0) has 12582912 free, is not full
    [161915.553239] BTRFS info (device nullb0 state A): space_info total=12582912, used=0, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
    [161915.553243] BTRFS info (device nullb0 state A): space_info METADATA (sub-group id 0) has -5767168 free, is full
    [161915.553245] BTRFS info (device nullb0 state A): space_info total=53673984, used=6635520, pinned=46956544, reserved=16384, may_use=5767168, readonly=65536 zone_unusable=0
    [161915.553251] BTRFS info (device nullb0 state A): space_info SYSTEM (sub-group id 0) has 8355840 free, is not full
    [161915.553254] BTRFS info (device nullb0 state A): space_info total=8388608, used=16384, pinned=16384, reserved=0, may_use=0, readonly=0 zone_unusable=0
    [161915.553257] BTRFS info (device nullb0 state A): global_block_rsv: size 5767168 reserved 5767168
    [161915.553261] BTRFS info (device nullb0 state A): trans_block_rsv: size 0 reserved 0
    [161915.553263] BTRFS info (device nullb0 state A): chunk_block_rsv: size 0 reserved 0
    [161915.553265] BTRFS info (device nullb0 state A): remap_block_rsv: size 0 reserved 0
    [161915.553268] BTRFS info (device nullb0 state A): delayed_block_rsv: size 0 reserved 0
    [161915.553270] BTRFS info (device nullb0 state A): delayed_refs_rsv: size 0 reserved 0
    [161915.553272] BTRFS: error (device nullb0 state A) in cleanup_transaction:2045: errno=-28 No space left
    [161915.554463] BTRFS info (device nullb0 state EA): forced readonly

The problem is that we allow for a very aggressive metadata overcommit,
about 1/8th of the currently available space, even when the task
attempting the reservation allows for full flushing. Over time this allows
more and more tasks to overcommit without getting a transaction commit to
release pinned extents, joining the same transaction and eventually lead
to the transaction abort when attempting some tree update, as the extent
allocator is not able to find any available metadata extent and it's not
able to allocate a new metadata block group either (not enough unallocated
space for that).

Fix this by allowing the overcommit to be up to 1/64th of the available
(unallocated) space instead and for that limit to apply to both types of
full flushing, BTRFS_RESERVE_FLUSH_ALL and BTRFS_RESERVE_FLUSH_ALL_STEAL.
This way we get more frequent transaction commits to release pinned
extents in case our caller is in a context where full flushing is allowed.

Note that the space infos dump in the dmesg/syslog right after the
transaction abort give the wrong idea that we have plenty of unallocated
space when the abort happened. During the bonnie++ workload we had a
metadata chunk allocation attempt and it failed with -ENOSPC because at
that time we had a bunch of data block groups allocated, which then became
empty and got deleted by the cleaner kthread after the metadata chunk
allocation failed with -ENOSPC and before the transaction abort happened
and dumped the space infos.

The custom tracing (some trace_printk() calls spread in strategic places)
used to check that:

  mount-1793735 [011] ...1. 28877.261096: btrfs_add_bg_to_space_info: added bg offset 13631488 length 8388608 flags 1 to space_info->flags 1 total_bytes 8388608 bytes_used 0 bytes_may_use 0
  mount-1793735 [011] ...1. 28877.261098: btrfs_add_bg_to_space_info: added bg offset 22020096 length 8388608 flags 34 to space_info->flags 2 total_bytes 8388608 bytes_used 16384 bytes_may_use 0
  mount-1793735 [011] ...1. 28877.261100: btrfs_add_bg_to_space_info: added bg offset 30408704 length 53673984 flags 36 to space_info->flags 4 total_bytes 53673984 bytes_used 131072 bytes_may_use 0

These are from loading the block groups created by mkfs during mount.

Then when bonnie++ starts doing its thing:

  kworker/u48:5-1792004 [011] ..... 28886.122050: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
  kworker/u48:5-1792004 [011] ..... 28886.122053: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 927596544
  kworker/u48:5-1792004 [011] ..... 28886.122055: btrfs_make_block_group: make bg offset 84082688 size 117440512 type 1
  kworker/u48:5-1792004 [011] ...1. 28886.122064: btrfs_add_bg_to_space_info: added bg offset 84082688 length 117440512 flags 1 to space_info->flags 1 total_bytes 125829120 bytes_used 0 bytes_may_use 5251072

First allocation of a data block group of 112M.

  kworker/u48:5-1792004 [011] ..... 28886.192408: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
  kworker/u48:5-1792004 [011] ..... 28886.192413: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 810156032
  kworker/u48:5-1792004 [011] ..... 28886.192415: btrfs_make_block_group: make bg offset 201523200 size 117440512 type 1
  kworker/u48:5-1792004 [011] ...1. 28886.192425: btrfs_add_bg_to_space_info: added bg offset 201523200 length 117440512 flags 1 to space_info->flags 1 total_bytes 243269632 bytes_used 0 bytes_may_use 122691584

Another 112M data block group allocated.

  kworker/u48:5-1792004 [011] ..... 28886.260935: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
  kworker/u48:5-1792004 [011] ..... 28886.260941: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 692715520
  kworker/u48:5-1792004 [011] ..... 28886.260943: btrfs_make_block_group: make bg offset 318963712 size 117440512 type 1
  kworker/u48:5-1792004 [011] ...1. 28886.260954: btrfs_add_bg_to_space_info: added bg offset 318963712 length 117440512 flags 1 to space_info->flags 1 total_bytes 360710144 bytes_used 0 bytes_may_use 240132096

Yet another one.

  bonnie++-1793755 [010] ..... 28886.280407: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
  bonnie++-1793755 [010] ..... 28886.280412: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 575275008
  bonnie++-1793755 [010] ..... 28886.280414: btrfs_make_block_group: make bg offset 436404224 size 117440512 type 1
  bonnie++-1793755 [010] ...1. 28886.280419: btrfs_add_bg_to_space_info: added bg offset 436404224 length 117440512 flags 1 to space_info->flags 1 total_bytes 478150656 bytes_used 0 bytes_may_use 268435456

One more.

  kworker/u48:5-1792004 [011] ..... 28886.566233: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
  kworker/u48:5-1792004 [011] ..... 28886.566238: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 457834496
  kworker/u48:5-1792004 [011] ..... 28886.566241: btrfs_make_block_group: make bg offset 553844736 size 117440512 type 1
  kworker/u48:5-1792004 [011] ...1. 28886.566250: btrfs_add_bg_to_space_info: added bg offset 553844736 length 117440512 flags 1 to space_info->flags 1 total_bytes 595591168 bytes_used 268435456 bytes_may_use 209723392

Another one.

  bonnie++-1793755 [009] ..... 28886.613446: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
  bonnie++-1793755 [009] ..... 28886.613451: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 340393984
  bonnie++-1793755 [009] ..... 28886.613453: btrfs_make_block_group: make bg offset 671285248 size 117440512 type 1
  bonnie++-1793755 [009] ...1. 28886.613458: btrfs_add_bg_to_space_info: added bg offset 671285248 length 117440512 flags 1 to space_info->flags 1 total_bytes 713031680 bytes_used 268435456 bytes_may_use 2 68435456

Another one.

  bonnie++-1793755 [009] ..... 28886.674953: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
  bonnie++-1793755 [009] ..... 28886.674957: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 222953472
  bonnie++-1793755 [009] ..... 28886.674959: btrfs_make_block_group: make bg offset 788725760 size 117440512 type 1
  bonnie++-1793755 [009] ...1. 28886.674963: btrfs_add_bg_to_space_info: added bg offset 788725760 length 117440512 flags 1 to space_info->flags 1 total_bytes 830472192 bytes_used 268435456 bytes_may_use 1 34217728

Another one.

  bonnie++-1793755 [009] ..... 28886.674981: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
  bonnie++-1793755 [009] ..... 28886.674982: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 105512960
  bonnie++-1793755 [009] ..... 28886.674983: btrfs_make_block_group: make bg offset 906166272 size 105512960 type 1
  bonnie++-1793755 [009] ...1. 28886.674984: btrfs_add_bg_to_space_info: added bg offset 906166272 length 105512960 flags 1 to space_info->flags 1 total_bytes 935985152 bytes_used 268435456 bytes_may_use 67108864

Another one, but a bit smaller (~100.6M) since we now have less space.

  bonnie++-1793758 [009] ..... 28891.962096: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
  bonnie++-1793758 [009] ..... 28891.962103: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 12582912
  bonnie++-1793758 [009] ..... 28891.962105: btrfs_make_block_group: make bg offset 1011679232 size 12582912 type 1
  bonnie++-1793758 [009] ...1. 28891.962114: btrfs_add_bg_to_space_info: added bg offset 1011679232 length 12582912 flags 1 to space_info->flags 1 total_bytes 948568064 bytes_used 268435456 bytes_may_use 8192

Another one, this one even smaller (12M).

  kworker/u48:5-1792004 [011] ..... 28892.112802: btrfs_chunk_alloc: enter first metadata chunk alloc attempt
  kworker/u48:5-1792004 [011] ..... 28892.112805: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 131072 dev_extent_want 536870912
  kworker/u48:5-1792004 [011] ..... 28892.112806: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 131072 dev_extent_want 536870912 max_avail 0

536870912 is 512M, the standard 256M metadata chunk size times 2 because
of the DUP profile for metadata.
'max_avail' is what find_free_dev_extent() returns to us in
gather_device_info().

As a result, gather_device_info() sets ctl->ndevs to 0, making
decide_stripe_size() fail with -ENOSPC, and therefore metadata chunk
allocation fails while we are attempting to run delayed items during
the transaction commit.

  kworker/u48:5-1792004 [011] ..... 28892.112807: btrfs_create_chunk: decide_stripe_size fail -ENOSPC

In the syslog/dmesg pasted above, which happened after the transaction was
aborted, the space info dumps did not account for all these data block
groups that were allocated during bonnie++'s workload. And that is because
after the metadata chunk allocation failed with -ENOSPC and before the
transaction abort happened, most of the data block groups had become empty
and got deleted by by the cleaner kthread - when the abort happened, we
had bonnie++ in the middle of deleting the files it created.

But dumping the space infos right after the metadata chunk allocation fails
by adding a call to btrfs_dump_space_info_for_trans_abort() in
decide_stripe_size() when it returns -ENOSPC, we get:

  [29972.409295] BTRFS info (device nullb0): dumping space info:
  [29972.409300] BTRFS info (device nullb0): space_info DATA (sub-group id 0) has 673341440 free, is not full
  [29972.409303] BTRFS info (device nullb0): space_info total=948568064, used=0, pinned=275226624, reserved=0, may_use=0, readonly=0 zone_unusable=0
  [29972.409305] BTRFS info (device nullb0): space_info METADATA (sub-group id 0) has 3915776 free, is not full
  [29972.409306] BTRFS info (device nullb0): space_info total=53673984, used=163840, pinned=42827776, reserved=147456, may_use=6553600, readonly=65536 zone_unusable=0
  [29972.409308] BTRFS info (device nullb0): space_info SYSTEM (sub-group id 0) has 7979008 free, is not full
  [29972.409310] BTRFS info (device nullb0): space_info total=8388608, used=16384, pinned=0, reserved=0, may_use=393216, readonly=0 zone_unusable=0
  [29972.409311] BTRFS info (device nullb0): global_block_rsv: size 5767168 reserved 5767168
  [29972.409313] BTRFS info (device nullb0): trans_block_rsv: size 0 reserved 0
  [29972.409314] BTRFS info (device nullb0): chunk_block_rsv: size 393216 reserved 393216
  [29972.409315] BTRFS info (device nullb0): remap_block_rsv: size 0 reserved 0
  [29972.409316] BTRFS info (device nullb0): delayed_block_rsv: size 0 reserved 0

So here we see there's ~904.6M of data space, ~51.2M of metadata space and
8M of system space, making a total of 963.8M.

Reported-by: Aleksandar Gerasimovski <Aleksandar.Gerasimovski@belden.com>
Link: https://lore.kernel.org/linux-btrfs/SA1PR18MB56922F690C5EC2D85371408B998FA@SA1PR18MB5692.namprd18.prod.outlook.com/
Link: https://lore.kernel.org/linux-btrfs/CAL3q7H61vZ3_+eqJ1A9po2WcgNJJjUu9MJQoYB2oDSAAecHaug@mail.gmail.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:55:53 +02:00
Qu Wenruo
2672a26a75 btrfs: use per-profile available space in calc_available_free_space()
For the following disk layout, can_overcommit() can cause false
confidence in available space:

  devid 1 unallocated:	1GiB
  devid 2 unallocated:	50GiB
  metadata type:	RAID1

As can_overcommit() simply uses unallocated space with factor to
calculate the allocatable metadata chunk size, resulting 25.5GiB
available space.

But in reality we can only allocate one 1GiB RAID1 chunk, the remaining
49GiB on devid 2 will never be utilized to fulfill a RAID1 chunk.

This leads to various ENOSPC related transaction abort and flips the fs
read-only.

Now use per-profile available space in calc_available_free_space(), and
only when that failed we fall back to the old factor based estimation.

And for zoned devices or for the very low chance of temporary memory
allocation failure, we will still fallback to factor based estimation.
But I hope in reality it's very rare.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:55:53 +02:00
Qu Wenruo
c84053d9f7 btrfs: update per-profile available estimation
This involves the following timing:

- Chunk allocation
- Chunk removal
- After Mount
- New device
- Device removal
- Device shrink
- Device enlarge

And since the function btrfs_update_per_profile_avail() will not return
an error, this won't cause new error handling path.

Although when btrfs_update_per_profile_avail() failed (only ENOSPC
possible) it will mark the per-profile available estimation as
unreliable, so later btrfs_get_per_profile_avail() will return false and
require the caller to have a fallback solution.

The function btrfs_update_per_profile_avail() will be executed with
chunk_mutex hold, thus it will slightly slow down those involved
functions, but not a lot.

As all the core workload is just various u64 calculations inside a loop,
without any tree search, the overhead should be acceptable even for all
supported 9 profiles.

For 4 disks (which exercises all 9 profiles), the execution time of that
function will still be less than 10 us.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:55:53 +02:00
Qu Wenruo
52fead5eb8 btrfs: introduce the device layout aware per-profile available space
[BUG]
There is a long known bug that if metadata is using RAID1 on two disks
with unbalanced sizes, there is a very high chance to hit ENOSPC related
transaction abort.

[CAUSE]
The root cause is in the available space estimation code:

- Factor based calculation
  Just use all unallocated space, divide by the profile factor
  One obvious user is can_overcommit().

This can not handle the following example:

  devid 1 unallocated:	1GiB
  devid 2 unallocated:	50GiB
  metadata type:	RAID1

If using factor based estimation, we can use (1GiB + 50GiB) / 2 = 25.5GiB
free space for metadata.
Thus we can continue allocating metadata (over-commit) way beyond the
1GiB limit.

But this estimation is completely wrong, in reality we can only allocate
one single 1GiB RAID1 block group, thus if we continue over-commit, at
one time we will hit ENOSPC at some critical path and flips the fs
read-only.

[SOLUTION]
This patch will introduce per-profile available space estimation,
which can provide chunk-allocator like behavior to give a (mostly)
accurate result, with under-estimate corner cases.

There are some differences between the estimation and real chunk
allocator:

- No consideration on hole size
  It's fine for most cases, as all data/metadata strips are in 1GiB size
  thus there should not be any hole wasting much space.

  And chunk allocator is able to use smaller stripes when there is
  really no other choice.

  Although in theory this means it can lead to some over-estimation, it
  should not cause too much hassle in the real world.

  The other benefit of such behavior is, we avoid dev-extent tree search
  completely, thus the overhead is very small.

- No true balance for certain cases
  If we have 3 disks RAID1, and each device has 2GiB unallocated space,
  we can load balance the chunk allocation so that we can allocate 3GiB
  RAID1 chunks, and that's what chunk allocator will do.

  But this current estimation code is using the largest available space
  to do a single allocation. Meaning the estimation will be 2GiB, thus
  under estimate.

  Such under estimation is fine and after the first chunk allocation, the
  estimation will be updated and still give a correct 2GiB
  estimation.
  So this only means the estimation will be a little conservative, which
  is safer for call sites like metadata over-commit check.

With that facility, for above 1GiB + 50GiB case, it will give a RAID1
estimation of 1GiB, instead of the incorrect 25.5GiB.

Or for a more complex example:
  devid 1 unallocated:	1T
  devid 2 unallocated:  1T
  devid 3 unallocated:	10T

We will get an array of:
  RAID10:	2T
  RAID1:	2T
  RAID1C3:	1T
  RAID1C4:	0  (not enough devices)
  DUP:		6T
  RAID0:	3T
  SINGLE:	12T
  RAID5:	2T
  RAID6:	1T

[IMPLEMENTATION]
And for the each profile , we go chunk allocator level calculation:
The pseudo code looks like:

  clear_virtual_used_space_of_all_rw_devices();
  do {
  	/*
  	 * The same as chunk allocator, despite used space,
  	 * we also take virtual used space into consideration.
  	 */
  	sort_device_with_virtual_free_space();

  	/*
  	 * Unlike chunk allocator, we don't need to bother hole/stripe
  	 * size, so we use the smallest device to make sure we can
  	 * allocated as many stripes as regular chunk allocator
  	 */
  	stripe_size = device_with_smallest_free->avail_space;
	stripe_size = min(stripe_size, to_alloc / ndevs);

  	/*
  	 * Allocate a virtual chunk, allocated virtual chunk will
  	 * increase virtual used space, allow next iteration to
  	 * properly emulate chunk allocator behavior.
  	 */
  	ret = alloc_virtual_chunk(stripe_size, &allocated_size);
  	if (ret == 0)
  		avail += allocated_size;
  } while (ret == 0)

This minimal available space based calculation is not perfect, but the
important part is, the estimation is never exceeding the real available
space.

This patch just introduces the infrastructure, no hooks are executed
yet.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:55:53 +02:00
Jiasheng Jiang
08ef56661f btrfs: zoned: remove redundant space_info lock and variable in do_allocation_zoned()
In do_allocation_zoned(), the code acquires space_info->lock before
block_group->lock. However, the critical section does not access or
modify any members of the space_info structure. Thus, the lock is
redundant as it provides no necessary synchronization here.

This change simplifies the locking logic and aligns the function with
other zoned paths, such as __btrfs_add_free_space_zoned(), which only
rely on block_group->lock. Since the 'space_info' local variable is
no longer used after removing the lock calls, it is also removed.

Removing this unnecessary lock reduces contention on the global
space_info lock, improving concurrency in the zoned allocation path.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:55:52 +02:00
Filipe Manana
6141abb7f1 btrfs: move min sys chunk array size check to validate_sys_chunk_array()
We check the minimum size of the sys chunk array in btrfs_validate_super()
but we have a better place for that, the helper validate_sys_chunk_array()
which we use for every other sys chunk array check. So move it there, also
converting the return error from -EINVAL to -EUCLEAN, which is a better
fit and also consistent with the other checks.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:55:52 +02:00
Filipe Manana
f3da62571b btrfs: remove duplicate system chunk array max size overflow check
We check it twice, once in validate_sys_chunk_array() and then again in
its caller, btrfs_validate_super(), right after it calls
validate_sys_chunk_array(). So remove the duplicated check from
btrfs_validate_super().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:55:52 +02:00
Filipe Manana
c4d30088fa btrfs: pass boolean literals as the last argument to inc_block_group_ro()
The last argument of inc_block_group_ro() is defined as a boolean, but
every caller is passing an integer literal, 0 or 1 for false and true
respectively. While this is not incorrect, as 0 and 1 are converted to
false and true, it's less readable and somewhat awkward since the
argument is defined as boolean. Replace 0 and 1 with false and true.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:55:52 +02:00
Naohiro Aota
1ba19a6ea9 btrfs: tests: zoned: add tests cases for zoned code
Add a test function for the zoned code, for now it tests
btrfs_load_block_group_by_raid_type() with various test cases. The
load_zone_info_tests[] array defines the test cases.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07 18:55:52 +02:00
Linus Torvalds
591cd656a1 Linux 7.0-rc7 v7.0-rc7 2026-04-05 15:26:23 -07:00
Linus Torvalds
85fb6da43a Merge tag 'riscv-for-linus-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Paul Walmsley:

 - Fix a CONFIG_SPARSEMEM crash on RV32 by avoiding early phys_to_page()

 - Prevent runtime const infrastructure from being used by modules,
   similar to what was done for x86

 - Avoid problems when shutting down ACPI systems with IOMMUs by adding
   a device dependency between IOMMU and devices that use it

 - Fix a bug where the CPU pointer masking state isn't properly reset
   when tagged addresses aren't enabled for a task

 - Fix some incorrect register assignments, and add some missing ones,
   in kgdb support code

 - Fix compilation of non-kernel code that uses the ptrace uapi header
   by replacing BIT() with _BITUL()

 - Fix compilation of the validate_v_ptrace kselftest by working around
   kselftest macro expansion issues

* tag 'riscv-for-linus-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  ACPI: RIMT: Add dependency between iommu and devices
  selftests: riscv: Add braces around EXPECT_EQ()
  riscv: use _BITUL macro rather than BIT() in ptrace uapi and kselftests
  riscv: Reset pmm when PR_TAGGED_ADDR_ENABLE is not set
  riscv: make runtime const not usable by modules
  riscv: patch: Avoid early phys_to_page()
  riscv: kgdb: fix several debug register assignment bugs
2026-04-05 14:43:47 -07:00
Linus Torvalds
10b76a429a Merge tag 'x86-urgent-2026-04-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:

 - Fix kexec crash on KCOV-instrumented kernels (Aleksandr Nogikh)

 - Fix Geode platform driver on-stack property data use-after-return
   bug (Dmitry Torokhov)

* tag 'x86-urgent-2026-04-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/platform/geode: Fix on-stack property data use-after-return bug
  x86/kexec: Disable KCOV instrumentation after load_segments()
2026-04-05 13:53:07 -07:00
Linus Torvalds
2ab99ad7fa Merge tag 'sched-urgent-2026-04-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:

 - Fix zero_vruntime tracking again (Peter Zijlstra)

 - Fix avg_vruntime() usage in sched_debug (Peter Zijlstra)

* tag 'sched-urgent-2026-04-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/debug: Fix avg_vruntime() usage
  sched/fair: Fix zero_vruntime tracking fix
2026-04-05 13:45:37 -07:00
Linus Torvalds
7bba6c8622 Merge tag 'perf-urgent-2026-04-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Ingo Molnar:

 - Fix potential bad container_of() in intel_pmu_hw_config() (Ian
   Rogers)

* tag 'perf-urgent-2026-04-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: Fix potential bad container_of in intel_pmu_hw_config
2026-04-05 13:43:26 -07:00
Linus Torvalds
1391af0364 Merge tag 'irq-urgent-2026-04-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Ingo Molnar:

 - Fix RISC-V APLIC irqchip driver setup errors on ACPI systems (Jessica
   Liu)

* tag 'irq-urgent-2026-04-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/riscv-aplic: Restrict genpd notifier to device tree only
2026-04-05 13:40:58 -07:00
Linus Torvalds
5401b9adeb i915: don't use a vma that didn't match the context VM
In eb_lookup_vma(), the code checks that the context vm matches before
incrementing the i915 vma usage count, but for the non-matching case it
didn't clear the non-matching vma pointer, so it would then mistakenly
be returned, causing potential UaF and refcount issues.

Reported-by: Yassine Mounir <sosohero200@gmail.com>
Suggested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-04-05 12:42:25 -07:00
Linus Torvalds
eb3765aa71 Merge tag 'mips-fixes_7.0_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
Pull MIPS fixes from Thomas Bogendoerfer:

 - Fix TLB uniquification for systems with TLB not initialised by
   firmware

 - Fix allocation in TLB uniquification

 - Fix SiByte cache initialisation

 - Check uart parameters from firmware on Loongson64 systems

 - Fix clock id mismatch for Ralink SoCs

 - Fix GCC version check for __mutli3 workaround

* tag 'mips-fixes_7.0_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
  mips: mm: Allocate tlb_vpn array atomically
  MIPS: mm: Rewrite TLB uniquification for the hidden bit feature
  MIPS: mm: Suppress TLB uniquification on EHINV hardware
  MIPS: Always record SEGBITS in cpu_data.vmbits
  MIPS: Fix the GCC version check for `__multi3' workaround
  MIPS: SiByte: Bring back cache initialisation
  mips: ralink: update CPU clock index
  MIPS: Loongson64: env: Check UARTs passed by LEFI cautiously
2026-04-05 11:29:07 -07:00
Linus Torvalds
1791c39014 Merge tag 'char-misc-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc/iio driver fixes from Greg KH:
 "Here are a relativly large number of small char/misc/iio and other
  driver fixes for 7.0-rc7. There's a bunch, but overall they are all
  small fixes for issues that people have been having that I finally
  caught up with getting merged due to delays on my end.

  The "largest" change overall is just some documentation updates to the
  security-bugs.rst file to hopefully tell the AI tools (and any users
  that actually read the documentation), how to send us better security
  bug reports as the quantity of reports these past few weeks has
  increased dramatically due to tools getting better at "finding"
  things.

  Included in here are:
   - lots of small IIO driver fixes for issues reported in 7.0-rc
   - gpib driver fixes
   - comedi driver fixes
   - interconnect driver fix
   - nvmem driver fixes
   - mei driver fix
   - counter driver fix
   - binder rust driver fixes
   - some other small misc driver fixes

  All of these have been in linux-next this week with no reported issues"

* tag 'char-misc-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (63 commits)
  Documentation: fix two typos in latest update to the security report howto
  Documentation: clarify the mandatory and desirable info for security reports
  Documentation: explain how to find maintainers addresses for security reports
  Documentation: minor updates to the security contacts
  .get_maintainer.ignore: add myself
  nvmem: zynqmp_nvmem: Fix buffer size in DMA and memcpy
  nvmem: imx: assign nvmem_cell_info::raw_len
  misc: fastrpc: check qcom_scm_assign_mem() return in rpmsg_probe
  misc: fastrpc: possible double-free of cctx->remote_heap
  comedi: dt2815: add hardware detection to prevent crash
  comedi: runflags cannot determine whether to reclaim chanlist
  comedi: Reinit dev->spinlock between attachments to low-level drivers
  comedi: me_daq: Fix potential overrun of firmware buffer
  comedi: me4000: Fix potential overrun of firmware buffer
  comedi: ni_atmio16d: Fix invalid clean-up after failed attach
  gpib: fix use-after-free in IO ioctl handlers
  gpib: lpvo_usb: fix memory leak on disconnect
  gpib: Fix fluke driver s390 compile issue
  lis3lv02d: Omit IRQF_ONESHOT if no threaded handler is provided
  lis3lv02d: fix kernel-doc warnings
  ...
2026-04-05 10:09:33 -07:00
Linus Torvalds
7a60c79bd0 Merge tag 'tty-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty fixes from Greg KH:
 "Here are two small tty vt fixes for 7.0-rc7 to resolve some reported
  issues with the resize ability of the alt screen buffer. Both of these
  have been in linux-next all week with no reported issues"

* tag 'tty-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  vt: resize saved unicode buffer on alt screen exit after resize
  vt: discard stale unicode buffer on alt screen exit after resize
2026-04-05 10:04:28 -07:00
Linus Torvalds
aea7c84f28 Merge tag 'usb-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB/Thunderbolt fixes from Greg KH:
 "Here are a bunch of USB and Thunderbolt fixes (most all are USB) for
  7.0-rc7. More than I normally like this late in the release cycle,
  partly due to my recent travels, and partly due to people banging away
  on the USB gadget interfaces and apis more than normal (big shoutout
  to Android for getting the vendors to actually work upstream on this,
  that's a huge win overall for everyone here)

  Included in here are:
   - Small thunderbolt fix
   - new USB serial driver ids added
   - typec driver fixes
   - gadget driver fixes for some disconnect issues
   - other usb gadget driver fixes for reported problems with binding
     and unbinding devices as happens when a gadget device connects /
     disconnects from a system it is plugged into (or it switches device
     mode at a user's request, these things are complex little
     beasts...)
   - usb offload fixes (where USB audio tunnels through the controller
     while the main CPU is asleep) for when EMP spikes hit the system
     causing disconnects to happen (as often happens with static
     electricity in the winter months). This has been much reported by
     at least one vendor, and resolves the issues they have been seeing
     with this codepath. Can't wait for the "formal methods are the
     answer!" people to try to model that one properly...
   - Other small usb driver fixes for issues reported.

  All of these have been in linux-next this week, and before, with no
  reported issues, and I've personally been stressing these harder than
  normal on my systems here with no problems"

* tag 'usb-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (39 commits)
  usb: gadget: f_hid: move list and spinlock inits from bind to alloc
  usb: host: xhci-sideband: delegate offload_usage tracking to class drivers
  usb: core: use dedicated spinlock for offload state
  usb: cdns3: gadget: fix state inconsistency on gadget init failure
  usb: dwc3: imx8mp: fix memory leak on probe failure path
  usb: gadget: f_uac1_legacy: validate control request size
  usb: ulpi: fix double free in ulpi_register_interface() error path
  usb: misc: usbio: Fix URB memory leak on submit failure
  USB: core: add NO_LPM quirk for Razer Kiyo Pro webcam
  usb: cdns3: gadget: fix NULL pointer dereference in ep_queue
  usb: core: phy: avoid double use of 'usb3-phy'
  USB: serial: option: add MeiG Smart SRM825WN
  usb: gadget: f_rndis: Fix net_device lifecycle with device_move
  usb: gadget: f_subset: Fix net_device lifecycle with device_move
  usb: gadget: f_eem: Fix net_device lifecycle with device_move
  usb: gadget: f_ecm: Fix net_device lifecycle with device_move
  usb: gadget: u_ncm: Add kernel-doc comments for struct f_ncm_opts
  usb: gadget: f_rndis: Protect RNDIS options with mutex
  usb: gadget: f_subset: Fix unbalanced refcnt in geth_free
  dt-bindings: connector: add pd-disable dependency
  ...
2026-04-05 10:00:26 -07:00
Sunil V L
9156585280 ACPI: RIMT: Add dependency between iommu and devices
EPROBE_DEFER ensures IOMMU devices are probed before the devices that
depend on them. During shutdown, however, the IOMMU may be removed
first, leading to issues. To avoid this, a device link is added
which enforces the correct removal order.

Fixes: 8f77295525 ("ACPI: RISC-V: Add support for RIMT")
Signed-off-by: Sunil V L <sunilvl@oss.qualcomm.com>
Link: https://patch.msgid.link/20260303061605.722949-1-sunilvl@oss.qualcomm.com
Signed-off-by: Paul Walmsley <pjw@kernel.org>
2026-04-04 18:38:03 -06:00
Charlie Jenkins
511361fe7a selftests: riscv: Add braces around EXPECT_EQ()
EXPECT_EQ() expands to multiple lines, breaking up one-line if
statements. This issue was not present in the patch on the mailing list
but was instead introduced by the maintainer when attempting to fix up
checkpatch warnings. Add braces around EXPECT_EQ() to avoid the error
even though checkpatch suggests them to be removed:

validate_v_ptrace.c:626:17: error: ‘else’ without a previous ‘if’

Fixes: 3789d5eecd ("selftests: riscv: verify syscalls discard vector context")
Fixes: 30eb191c89 ("selftests: riscv: verify ptrace rejects invalid vector csr inputs")
Fixes: 849f05ae1e ("selftests: riscv: verify ptrace accepts valid vector csr values")
Signed-off-by: Charlie Jenkins <thecharlesjenkins@gmail.com>
Reviewed-and-tested-by: Sergey Matyukevich <geomatsi@gmail.com>
Link: https://patch.msgid.link/20260309-fix_selftests-v2-2-9d5a553a531e@gmail.com
Signed-off-by: Paul Walmsley <pjw@kernel.org>
2026-04-04 18:37:57 -06:00
Paul Walmsley
87ad7cc9aa riscv: use _BITUL macro rather than BIT() in ptrace uapi and kselftests
Fix the build of non-kernel code that includes the RISC-V ptrace uapi
header, and the RISC-V validate_v_ptrace.c kselftest, by using the
_BITUL() macro rather than BIT().  BIT() is not available outside
the kernel.

Based on patches and comments from Charlie Jenkins, Michael Neuling,
and Andreas Schwab.

Fixes: 30eb191c89 ("selftests: riscv: verify ptrace rejects invalid vector csr inputs")
Fixes: 2af7c9cf02 ("riscv/ptrace: expose riscv CFI status and state via ptrace and in core files")
Cc: Andreas Schwab <schwab@suse.de>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Charlie Jenkins <thecharlesjenkins@gmail.com>
Link: https://patch.msgid.link/20260330024248.449292-1-mikey@neuling.org
Link: https://lore.kernel.org/linux-riscv/20260309-fix_selftests-v2-1-9d5a553a531e@gmail.com/
Link: https://lore.kernel.org/linux-riscv/20260309-fix_selftests-v2-3-9d5a553a531e@gmail.com/
Signed-off-by: Paul Walmsley <pjw@kernel.org>
2026-04-04 18:37:54 -06:00
Zishun Yi
3033b2b1e3 riscv: Reset pmm when PR_TAGGED_ADDR_ENABLE is not set
In set_tagged_addr_ctrl(), when PR_TAGGED_ADDR_ENABLE is not set, pmlen
is correctly set to 0, but it forgets to reset pmm. This results in the
CPU pmm state not corresponding to the software pmlen state.

Fix this by resetting pmm along with pmlen.

Fixes: 2e17430858 ("riscv: Add support for the tagged address ABI")
Signed-off-by: Zishun Yi <vulab@iscas.ac.cn>
Reviewed-by: Samuel Holland <samuel.holland@sifive.com>
Link: https://patch.msgid.link/20260322160022.21908-1-vulab@iscas.ac.cn
Signed-off-by: Paul Walmsley <pjw@kernel.org>
2026-04-04 18:37:45 -06:00
Jisheng Zhang
57f0253bc1 riscv: make runtime const not usable by modules
Similar as commit 284922f4c5 ("x86: uaccess: don't use runtime-const
rewriting in modules") does, make riscv's runtime const not usable by
modules too, to "make sure this doesn't get forgotten the next time
somebody wants to do runtime constant optimizations". The reason is
well explained in the above commit: "The runtime-const infrastructure
was never designed to handle the modular case, because the constant
fixup is only done at boot time for core kernel code."

Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Link: https://patch.msgid.link/20260221023731.3476-1-jszhang@kernel.org
Signed-off-by: Paul Walmsley <pjw@kernel.org>
2026-04-04 18:37:31 -06:00
Vivian Wang
6b60a128c2 riscv: patch: Avoid early phys_to_page()
Similarly to commit 8d09e2d569 ("arm64: patching: avoid early
page_to_phys()"), avoid using phys_to_page() for the kernel address case
in patch_map().

Since this is called from apply_boot_alternatives() in setup_arch(), and
commit 4267739cab ("arch, mm: consolidate initialization of SPARSE
memory model") has moved sparse_init() to after setup_arch(),
phys_to_page() is not available there yet, and it panics on boot with
SPARSEMEM on RV32, which does not use SPARSEMEM_VMEMMAP.

Reported-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Closes: https://lore.kernel.org/r/20260223144108-dcace0b9-02e8-4b67-a7ce-f263bed36f26@linutronix.de/
Fixes: 4267739cab ("arch, mm: consolidate initialization of SPARSE memory model")
Suggested-by: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Vivian Wang <wangruikang@iscas.ac.cn>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Tested-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Link: https://patch.msgid.link/20260310-riscv-sparsemem-alternatives-fix-v1-1-659d5dd257e2@iscas.ac.cn
[pjw@kernel.org: fix the subject line to align with the patch description]
Signed-off-by: Paul Walmsley <pjw@kernel.org>
2026-04-04 18:37:03 -06:00
Paul Walmsley
834911eb8e riscv: kgdb: fix several debug register assignment bugs
Fix several bugs in the RISC-V kgdb implementation:

- The element of dbg_reg_def[] that is supposed to pertain to the S1
  register embeds instead the struct pt_regs offset of the A1
  register.  Fix this to use the S1 register offset in struct pt_regs.

- The sleeping_thread_to_gdb_regs() function copies the value of the
  S10 register into the gdb_regs[] array element meant for the S9
  register, and copies the value of the S11 register into the array
  element meant for the S10 register.  It also neglects to copy the
  value of the S11 register.  Fix all of these issues.

Fixes: fe89bd2be8 ("riscv: Add KGDB support")
Cc: Vincent Chen <vincent.chen@sifive.com>
Link: https://patch.msgid.link/fde376f8-bcfd-bfe4-e467-07d8f7608d05@kernel.org
Signed-off-by: Paul Walmsley <pjw@kernel.org>
2026-04-04 18:36:52 -06:00
Linus Torvalds
3aae9383f4 Merge tag 'input-for-v7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input fixes from Dmitry Torokhov:

 - new IDs for BETOP BTP-KP50B/C and Razer Wolverine V3 Pro added to
   xpad controller driver

 - another quirk for new TUXEDO InfinityBook added to i8042

 - a small fixup for Synaptics RMI4 driver to properly unlock mutex when
   encountering an error in F54

 - an update to bcm5974 touch controller driver to reliably switch into
   wellspring mode

* tag 'input-for-v7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: xpad - add support for BETOP BTP-KP50B/C controller's wireless mode
  Input: xpad - add support for Razer Wolverine V3 Pro
  Input: synaptics-rmi4 - fix a locking bug in an error path
  Input: i8042 - add TUXEDO InfinityBook Max 16 Gen10 AMD to i8042 quirk table
  Input: bcm5974 - recover from failed mode switch
2026-04-04 08:24:32 -07:00
Willy Tarreau
f387e2e2b9 Documentation: fix two typos in latest update to the security report howto
In previous patch "Documentation: clarify the mandatory and desirable
info for security reports" I left two typos that I didn't detect in local
checks. One is "get_maintainers.pl" (no 's' in the script name), and the
other one is a missing closing quote after "Reported-by", which didn't
have effect here but I don't know if it can break rendering elsewhere
(e.g. on the public HTML page). Better fix it before it gets merged.

Signed-off-by: Willy Tarreau <w@1wt.eu>
Link: https://patch.msgid.link/20260404082033.5160-1-w@1wt.eu
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-04 10:38:43 +02:00
Shengyu Qu
0d9363a764 Input: xpad - add support for BETOP BTP-KP50B/C controller's wireless mode
BETOP's BTP-KP50B and BTP-KP50C controller's wireless dongles are both
working as standard Xbox 360 controllers. Add USB device IDs for them to
xpad driver.

Signed-off-by: Shengyu Qu <wiagn233@outlook.com>
Link: https://patch.msgid.link/TY4PR01MB14432B4B298EA186E5F86C46B9855A@TY4PR01MB14432.jpnprd01.prod.outlook.com
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2026-04-03 22:37:30 -07:00
Zoltan Illes
e2b0ae529d Input: xpad - add support for Razer Wolverine V3 Pro
Add device IDs for the Razer Wolverine V3 Pro controller in both
wired (0x0a57) and wireless 2.4 GHz dongle (0x0a59) modes.

The controller uses the Xbox 360 protocol (vendor-specific class,
subclass 93, protocol 1) on interface 0 with an identical 20-byte
input report layout, so no additional processing is needed.

Signed-off-by: Zoltan Illes <zoliviragh@gmail.com>
Link: https://patch.msgid.link/20260329220031.1325509-1-137647604+ZlordHUN@users.noreply.github.com
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2026-04-03 22:37:29 -07:00
Linus Torvalds
7ca6d1cfec Merge tag 'powerpc-7.0-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fix from Madhavan Srinivasan:

 - fix iommu incorrectly bypassing DMA APIs

Thanks to Dan Horak, Gaurav Batra, and Ritesh Harjani (IBM).

* tag 'powerpc-7.0-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/powernv/iommu: iommu incorrectly bypass DMA APIs
2026-04-03 20:08:25 -07:00
Linus Torvalds
3719114091 Merge tag 's390-7.0-7' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Vasily Gorbik:

 - Fix a memory leak in the zcrypt driver where the AP message buffer
   for clear key RSA requests was allocated twice, once by the caller
   and again locally, causing the first allocation to never be freed

 - Fix the cpum_sf perf sampling rate overflow adjustment to clamp the
   recalculated rate to the hardware maximum, preventing exceptions on
   heavily loaded systems running with HZ=1000

* tag 's390-7.0-7' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/zcrypt: Fix memory leak with CCA cards used as accelerator
  s390/cpum_sf: Cap sampling rate to prevent lsctl exception
2026-04-03 17:50:24 -07:00
Linus Torvalds
1523e4d567 Merge tag 'hwmon-for-v7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
Pull hwmon fixes from Guenter Roeck:

 - Fix temperature sensor for PRIME X670E-PRO WIFI

 - occ: Add missing newline, and fix potential division by zero

 - pmbus:
    - Fix device ID comparison and printing in tps53676_identify()
    - Add missing MODULE_IMPORT_NS("PMBUS") for ltc4286
    - Check return value of page-select write in pxe1610 probe
    - Fix array access with zero-length block tps53679 read

* tag 'hwmon-for-v7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
  hwmon: (asus-ec-sensors) Fix T_Sensor for PRIME X670E-PRO WIFI
  hwmon: (occ) Fix missing newline in occ_show_extended()
  hwmon: (occ) Fix division by zero in occ_show_power_1()
  hwmon: (tps53679) Fix device ID comparison and printing in tps53676_identify()
  hwmon: (ltc4286) Add missing MODULE_IMPORT_NS("PMBUS")
  hwmon: (pxe1610) Check return value of page-select write in probe
  hwmon: (tps53679) Fix array access with zero-length block read
2026-04-03 17:13:59 -07:00
Linus Torvalds
631919fb12 Merge tag 'sched_ext-for-7.0-rc6-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
 "These are late but both fix subtle yet critical problems and the blast
  radius is limited strictly to sched_ext.

   - Fix stale direct dispatch state in ddsp_dsq_id which can cause
     spurious warnings in mark_direct_dispatch() on task wakeup

   - Fix is_bpf_migration_disabled() false negative on non-PREEMPT_RCU
     configs which can lead to incorrectly dispatching migration-
     disabled tasks to remote CPUs"

* tag 'sched_ext-for-7.0-rc6-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Fix stale direct dispatch state in ddsp_dsq_id
  sched_ext: Fix is_bpf_migration_disabled() false negative on non-PREEMPT_RCU
2026-04-03 12:05:06 -07:00
Linus Torvalds
e41255ce7a Merge tag 'io_uring-7.0-20260403' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring fixes from Jens Axboe:

 - A previous fix in this release covered the case of the rings being
   RCU protected during resize, but it missed a few spots. This covers
   the rest

 - Fix the cBPF filters when COW'ed, introduced in this merge window

 - Fix for an attempt to import a zero sized buffer

 - Fix for a missing clamp in importing bundle buffers

* tag 'io_uring-7.0-20260403' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  io_uring/bpf_filters: retain COW'ed settings on parse failures
  io_uring: protect remaining lockless ctx->rings accesses with RCU
  io_uring/rsrc: reject zero-length fixed buffer import
  io_uring/net: fix slab-out-of-bounds read in io_bundle_nbufs()
2026-04-03 11:58:04 -07:00
Linus Torvalds
c514f73377 Merge tag 'spi-fix-v7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
 "A small collection of fixes, mostly probe/remove issues that are the
  result of Felix Gu going and auditing those areas, plus one error
  handling fix for the Cadence QSPI driver"

* tag 'spi-fix-v7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
  spi: cadence-qspi: Fix exec_mem_op error handling
  spi: amlogic: spifc-a4: unregister ECC engine on probe failure and remove() callback
  spi: stm32-ospi: Fix DMA channel leak on stm32_ospi_dma_setup() failure
  spi: stm32-ospi: Fix reset control leak on probe error
  spi: stm32-ospi: Fix resource leak in remove() callback
2026-04-03 10:19:52 -07:00
Andrea Righi
7e0ffb72de sched_ext: Fix stale direct dispatch state in ddsp_dsq_id
@p->scx.ddsp_dsq_id can be left set (non-SCX_DSQ_INVALID) triggering a
spurious warning in mark_direct_dispatch() when the next wakeup's
ops.select_cpu() calls scx_bpf_dsq_insert(), such as:

 WARNING: kernel/sched/ext.c:1273 at scx_dsq_insert_commit+0xcd/0x140

The root cause is that ddsp_dsq_id was only cleared in dispatch_enqueue(),
which is not reached in all paths that consume or cancel a direct dispatch
verdict.

Fix it by clearing it at the right places:

 - direct_dispatch(): cache the direct dispatch state in local variables
   and clear it before dispatch_enqueue() on the synchronous path. For
   the deferred path, the direct dispatch state must remain set until
   process_ddsp_deferred_locals() consumes them.

 - process_ddsp_deferred_locals(): cache the dispatch state in local
   variables and clear it before calling dispatch_to_local_dsq(), which
   may migrate the task to another rq.

 - do_enqueue_task(): clear the dispatch state on the enqueue path
   (local/global/bypass fallbacks), where the direct dispatch verdict is
   ignored.

 - dequeue_task_scx(): clear the dispatch state after dispatch_dequeue()
   to handle both the deferred dispatch cancellation and the holding_cpu
   race, covering all cases where a pending direct dispatch is
   cancelled.

 - scx_disable_task(): clear the direct dispatch state when
   transitioning a task out of the current scheduler. Waking tasks may
   have had the direct dispatch state set by the outgoing scheduler's
   ops.select_cpu() and then been queued on a wake_list via
   ttwu_queue_wakelist(), when SCX_OPS_ALLOW_QUEUED_WAKEUP is set. Such
   tasks are not on the runqueue and are not iterated by scx_bypass(),
   so their direct dispatch state won't be cleared. Without this clear,
   any subsequent SCX scheduler that tries to direct dispatch the task
   will trigger the WARN_ON_ONCE() in mark_direct_dispatch().

Fixes: 5b26f7b920 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches")
Cc: stable@vger.kernel.org # v6.12+
Cc: Daniel Hodges <hodgesd@meta.com>
Cc: Patrick Somaru <patsomaru@meta.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-03 07:14:49 -10:00
Linus Torvalds
1270605fd2 Merge tag 'pm-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
 "These fix a potential NULL pointer dereference in the energy model
  netlink interface and a potential double free in an error path in
  the common cpufreq governor management code:

   - Fix a NULL pointer dereference in the energy model netlink
     interface that may occur if a given perf domain ID is not
     recognized (Changwoo Min)

   - Avoid double free in the cpufreq_dbs_governor_init() error
     path when kobject_init_and_add() fails (Guangshuo Li)"

* tag 'pm-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: governor: fix double free in cpufreq_dbs_governor_init() error path
  PM: EM: Fix NULL pointer dereference when perf domain ID is not found
2026-04-03 09:56:32 -07:00
Linus Torvalds
576db0f375 Merge tag 'thermal-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull thermal control fixes from Rafael Wysocki:
 "Address potential races between thermal zone removal and system
  resume that may lead to a use-after-free (in two different ways)
  and a potential use-after-free in the thermal zone unregistration
  path (Rafael Wysocki)"

* tag 'thermal-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  thermal: core: Fix thermal zone device registration error path
  thermal: core: Address thermal zone removal races with resume
2026-04-03 09:49:06 -07:00
Linus Torvalds
116a3308e1 Merge tag 'gpio-fixes-for-v7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:

 - fix kerneldocs for gpio-timberdale and gpio-nomadik

 - clear the "requested" flag in error path in gpiod_request_commit()

 - call of_xlate() if provided when setting up shared GPIOs

 - handle pins shared by child firmware nodes of consumer devices

 - fix return value check in gpio-qixis-fpga

 - fix suspend on gpio-mxc

 - fix gpio-microchip DT bindings

* tag 'gpio-fixes-for-v7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
  dt-bindings: gpio: fix microchip #interrupt-cells
  gpio: shared: shorten the critical section in gpiochip_setup_shared()
  gpio: mxc: map Both Edge pad wakeup to Rising Edge
  gpio: qixis-fpga: Fix error handling for devm_regmap_init_mmio()
  gpio: shared: handle pins shared by child nodes of devices
  gpio: shared: call gpio_chip::of_xlate() if set
  gpiolib: clear requested flag if line is invalid
  gpio: nomadik: repair some kernel-doc comments
  gpio: timberdale: repair kernel-doc comments
  gpio: Fix resource leaks on errors in gpiochip_add_data_with_key()
2026-04-03 09:33:38 -07:00
Linus Torvalds
441c63ff42 Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fix from Will Deacon:

 - Implement a basic static call trampoline to fix CFI failures with the
   generic implementation

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: Use static call trampolines when kCFI is enabled
2026-04-03 08:47:13 -07:00
Linus Torvalds
60d9212c69 Merge tag 'drm-fixes-2026-04-03' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
 "Hopefully no Easter eggs in this bunch of fixes. Usual stuff across
  the amd/intel with some misc bits. Thanks to Thorsten and Alex for
  making sure a regression fix that was hanging around in process land
  finally made it in, that is probably the biggest change in here.

  core:
   - revert unplug/framebuffer fix as it caused problems
   - compat ioctl speculation fix

  bridge:
   - refcounting fix

  sysfb:
   - error handling fix

  amdgpu:
   - fix renoir audio regression
   - UserQ fixes
   - PASID handling fix
   - S4 fix for smu11 chips
   - Misc small fixes

  amdkfd:
   - Non-4K page fixes

  i915:
   - Fix for #12045: Huawei Matebook E (DRR-WXX): Persistent Black
     Screen on Boot with i915 and Gen11: Modesetting and Backlight
     Control Malfunction
   - Fix for #15826: i915: Raptor Lake-P [UHD Graphics] display
     flicker/corruption on eDP panel
   - Use crtc_state->enhanced_framing properly on ivb/hsw CPU eDP

  xe:
   - uapi: Accept canonical GPU addresses in xe_vm_madvise_ioctl
   - Disallow writes to read-only VMAs
   - PXP fixes
   - Disable garbage collector work item on SVM close
   - void memory allocations in xe_device_declare_wedged

  qaic:
   - hang fix

  ast:
   - initialisation fix"

* tag 'drm-fixes-2026-04-03' of https://gitlab.freedesktop.org/drm/kernel: (28 commits)
  drm/amd/display: Wire up dcn10_dio_construct() for all pre-DCN401 generations
  drm/ioc32: stop speculation on the drm_compat_ioctl path
  drm/sysfb: Fix efidrm error handling and memory type mismatch
  drm/i915/dp: Use crtc_state->enhanced_framing properly on ivb/hsw CPU eDP
  drm/i915/cdclk: Do the full CDCLK dance for min_voltage_level changes
  drm/amdkfd: Fix queue preemption/eviction failures by aligning control stack size to GPU page size
  drm/amdgpu: Fix wait after reset sequence in S4
  drm/amd/display: Fix NULL pointer dereference in dcn401_init_hw()
  drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 64KB
  drm/amdgpu/userq: fix memory leak in MQD creation error paths
  drm/amd: Fix MQD and control stack alignment for non-4K
  drm/amdkfd: Align expected_queue_size to PAGE_SIZE
  drm/amdgpu: fix the idr allocation flags
  drm/amdgpu: validate doorbell_offset in user queue creation
  drm/amdgpu/pm: drop SMU driver if version not matched messages
  drm/xe: Avoid memory allocations in xe_device_declare_wedged()
  drm/xe: Disable garbage collector work item on SVM close
  drm/xe/pxp: Don't allow PXP on older PTL GSC FWs
  drm/xe/pxp: Clear restart flag in pxp_start after jumping back
  drm/xe/pxp: Remove incorrect handling of impossible state during suspend
  ...
2026-04-03 08:23:51 -07:00
Rafael J. Wysocki
744d5721d2 Merge branch 'pm-em'
Fix a NULL pointer dereference in the energy model netlink interface
that may occur if a given perf domain ID is not recognized (Changwoo Min).

* pm-em:
  PM: EM: Fix NULL pointer dereference when perf domain ID is not found
2026-04-03 14:15:06 +02:00
Willy Tarreau
496fa1befb Documentation: clarify the mandatory and desirable info for security reports
A significant part of the effort of the security team consists in begging
reporters for patch proposals, or asking them to provide them in regular
format, and most of the time they're willing to provide this, they just
didn't know that it would help. So let's add a section detailing the
required and desirable contents in a security report to help reporters
write more actionable reports which do not require round trips.

Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Link: https://patch.msgid.link/20260403062018.31080-4-w@1wt.eu
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-03 13:11:23 +02:00
Willy Tarreau
a72b832a48 Documentation: explain how to find maintainers addresses for security reports
These days, 80% of the work done by the security team consists in
locating the affected subsystem in a report, running get_maintainers on
it, forwarding the report to these persons and responding to the reporter
with them in Cc. This is a huge and unneeded overhead that we must try to
lower for a better overall efficiency. This patch adds a complete section
explaining how to figure the list of recipients to send the report to.

Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Link: https://patch.msgid.link/20260403062018.31080-3-w@1wt.eu
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-03 13:11:23 +02:00
Willy Tarreau
f2b1cbef15 Documentation: minor updates to the security contacts
This clarifies the fact that the bug reporters must use a valid
e-mail address to send their report, and that the security team
assists developers working on a fix but doesn't always produce
fixes on its own.

Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Link: https://patch.msgid.link/20260403062018.31080-2-w@1wt.eu
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-03 13:11:23 +02:00
Dave Airlie
75f53c4b2d Merge tag 'drm-misc-fixes-2026-04-02' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes
A refcounting fix for bridges, revert a previous framebuffer
use-after-free fix that turned out to be causing more problems, a hang
fix for qaic, an initialization fix for ast, a error handling fix for
sysfb, and a speculation fix for drm_compat_ioctl.

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Maxime Ripard <mripard@redhat.com>
Link: https://patch.msgid.link/20260402-vivid-perfect-caiman-ca055e@houat
2026-04-03 19:07:43 +10:00
Dave Airlie
293fa6ebd4 Merge tag 'amd-drm-fixes-7.0-2026-04-02' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes
amd-drm-fixes-7.0-2026-04-02:

amdgpu:
- Fix audio regression on renoir

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patch.msgid.link/20260402194409.914769-1-alexander.deucher@amd.com
2026-04-03 18:43:09 +10:00