linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-12-27 08:45:26 -05:00

Author	SHA1	Message	Date
Linus Torvalds	115fada16b	Merge tag 'for-6.19-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix missing btrfs_path release after printing a relocation error message - fix extent changeset leak on mmap write after failure to reserve metadata - fix fs devices list structure freeing, it could be potentially leaked under some circumstances - tree log fixes: - fix incremental directory logging where inodes for new dentries were incorrectly skipped - don't log conflicting inode if it's a directory moved in the current transaction - regression fixes: - fix incorrect btrfs_path freeing when it's auto-cleaned - revert commit simplifying preallocation of temporary structures in qgroup functions, some cases were not handled properly * tag 'for-6.19-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix changeset leak on mmap write after failure to reserve metadata btrfs: fix memory leak of fs_devices in degraded seed device path btrfs: fix a potential path leak in print_data_reloc_error() Revert "btrfs: add ASSERTs on prealloc in qgroup functions" btrfs: do not skip logging new dentries when logging a new name btrfs: don't log conflicting inode if it's a dir moved in the current transaction btrfs: tests: fix double btrfs_path free in remove_extent_ref()	2025-12-16 19:28:20 +12:00
Filipe Manana	37343524f0	btrfs: fix changeset leak on mmap write after failure to reserve metadata If the call to btrfs_delalloc_reserve_metadata() fails we jump to the 'out_noreserve' label and there we never free the extent_changeset allocated by the previous call to btrfs_check_data_free_space() (if qgroups are enabled). Fix this by calling extent_changeset_free() under the 'out_noreserve' label. Fixes: `6599716de2` ("btrfs: fix -ENOSPC mmap write failure on NOCOW files/extents") Reported-by: syzbot+2f8aa76e6acc9fce6638@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/693a635a.a70a0220.33cd7b.0029.GAE@google.com/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-12-12 16:33:18 +01:00
Deepanshu Kartikey	b57f2ddd28	btrfs: fix memory leak of fs_devices in degraded seed device path In open_seed_devices(), when find_fsid() fails and we're in DEGRADED mode, a new fs_devices is allocated via alloc_fs_devices() but is never added to the seed_list before returning. This contrasts with the normal path where fs_devices is properly added via list_add(). If any error occurs later in read_one_dev() or btrfs_read_chunk_tree(), the cleanup code iterates seed_list to free seed devices, but this orphaned fs_devices is never found and never freed, causing a memory leak. Any devices allocated via add_missing_dev() and attached to this fs_devices are also leaked. Fix this by adding the newly allocated fs_devices to seed_list in the degraded path, consistent with the normal path. Fixes: `5f37583569` ("Btrfs: move the missing device to its own fs device list") Reported-by: syzbot+eadd98df8bceb15d7fed@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=eadd98df8bceb15d7fed Tested-by: syzbot+eadd98df8bceb15d7fed@syzkaller.appspotmail.com Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-12-12 16:33:12 +01:00
Qu Wenruo	313ef70a9f	btrfs: fix a potential path leak in print_data_reloc_error() Inside print_data_reloc_error(), if extent_from_logical() failed we return immediately. However there are the following cases where extent_from_logical() can return error but still holds a path: - btrfs_search_slot() returned 0 - No backref item found in extent tree - No flags_ret provided This is not possible in this call site though. So for the above two cases, we can return without releasing the path, causing extent buffer leaks. Fixes: `b9a9a85059` ("btrfs: output affected files when relocation fails") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-12-09 04:35:28 +01:00
Qu Wenruo	428e1b114c	Revert "btrfs: add ASSERTs on prealloc in qgroup functions" This reverts commit `252877a870`. Commit `252877a870` ("btrfs: add ASSERTs on prealloc in qgroup functions") tries to remove the kfree() on preallocated qgroup during several call sites, but this cannot work as intended: - btrfs_quota_enable() - btrfs_create_qgroup() If add_qgroup_item() failed, we go out_free_path() and at that time prealloc is not yet utilized and will trigger the new ASSERT(). - btrfs_qgroup_inherit() If qgroup_auto_inherit() failed, prealloc is not yet utilized and will trigger the new ASSERT() Reported-by: syzbot+b44d4a4885bc82af2a06@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/69369331.a70a0220.38f243.009e.GAE@google.com/ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-12-09 04:32:46 +01:00
Filipe Manana	5630f7557d	btrfs: do not skip logging new dentries when logging a new name When we are logging a directory and the log context indicates that we are logging a new name for some other file (that is or was inside that directory), we skip logging the inodes for new dentries in the directory. This is ok most of the time, but if after the rename or link operation that triggered the logging of that directory, we have an explicit fsync of that directory without the directory inode being evicted and reloaded, we end up never logging the inodes for the new dentries that we found during the new name logging, as the next directory fsync will only process dentries that were added after the last time we logged the directory (we are doing an incremental directory logging). So make sure we always log new dentries for a directory even if we are in a context of logging a new name. We started skipping logging inodes for new dentries as of commit `c48792c6ee` ("btrfs: do not log new dentries when logging that a new name exists") and it was fine back then, because when logging a directory we always iterated over all the directory entries (for leaves changed in the current transaction) so a subsequent fsync would always log anything that was previously skipped while logging a directory when logging a new name (with btrfs_log_new_name()). But later support for incrementally logging a directory was added in commit `dc2872247e` ("btrfs: keep track of the last logged keys when logging a directory"), to avoid checking all dir items every time we log a directory, so the check to skip dentry logging added in the first commit should have been removed when the incremental support for logging a directory was added. A test case for fstests will follow soon. Reported-by: Vyacheslav Kovalevsky <slava.kovalevskiy.2014@gmail.com> Link: https://lore.kernel.org/linux-btrfs/84c4e713-85d6-42b9-8dcf-0722ed26cb05@gmail.com/ Fixes: `dc2872247e` ("btrfs: keep track of the last logged keys when logging a directory") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-12-08 21:46:56 +01:00
Filipe Manana	266273eaf4	btrfs: don't log conflicting inode if it's a dir moved in the current transaction We can't log a conflicting inode if it's a directory and it was moved from one parent directory to another parent directory in the current transaction, as this can result an attempt to have a directory with two hard links during log replay, one for the old parent directory and another for the new parent directory. The following scenario triggers that issue: 1) We have directories "dir1" and "dir2" created in a past transaction. Directory "dir1" has inode A as its parent directory; 2) We move "dir1" to some other directory; 3) We create a file with the name "dir1" in directory inode A; 4) We fsync the new file. This results in logging the inode of the new file and the inode for the directory "dir1" that was previously moved in the current transaction. So the log tree has the INODE_REF item for the new location of "dir1"; 5) We move the new file to some other directory. This results in updating the log tree to included the new INODE_REF for the new location of the file and removes the INODE_REF for the old location. This happens during the rename when we call btrfs_log_new_name(); 6) We fsync the file, and that persists the log tree changes done in the previous step (btrfs_log_new_name() only updates the log tree in memory); 7) We have a power failure; 8) Next time the fs is mounted, log replay happens and when processing the inode for directory "dir1" we find a new INODE_REF and add that link, but we don't remove the old link of the inode since we have not logged the old parent directory of the directory inode "dir1". As a result after log replay finishes when we trigger writeback of the subvolume tree's extent buffers, the tree check will detect that we have a directory a hard link count of 2 and we get a mount failure. The errors and stack traces reported in dmesg/syslog are like this: [ 3845.729764] BTRFS info (device dm-0): start tree-log replay [ 3845.730304] page: refcount:3 mapcount:0 mapping:000000005c8a3027 index:0x1d00 pfn:0x11510c [ 3845.731236] memcg:ffff9264c02f4e00 [ 3845.731751] aops:btree_aops [btrfs] ino:1 [ 3845.732300] flags: 0x17fffc00000400a(uptodate\|private\|writeback\|node=0\|zone=2\|lastcpupid=0x1ffff) [ 3845.733346] raw: 017fffc00000400a 0000000000000000 dead000000000122 ffff9264d978aea8 [ 3845.734265] raw: 0000000000001d00 ffff92650e6d4738 00000003ffffffff ffff9264c02f4e00 [ 3845.735305] page dumped because: eb page dump [ 3845.735981] BTRFS critical (device dm-0): corrupt leaf: root=5 block=30408704 slot=6 ino=257, invalid nlink: has 2 expect no more than 1 for dir [ 3845.737786] BTRFS info (device dm-0): leaf 30408704 gen 10 total ptrs 17 free space 14881 owner 5 [ 3845.737789] BTRFS info (device dm-0): refs 4 lock_owner 0 current 30701 [ 3845.737792] item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160 [ 3845.737794] inode generation 3 transid 9 size 16 nbytes 16384 [ 3845.737795] block group 0 mode 40755 links 1 uid 0 gid 0 [ 3845.737797] rdev 0 sequence 2 flags 0x0 [ 3845.737798] atime 1764259517.0 [ 3845.737800] ctime 1764259517.572889464 [ 3845.737801] mtime 1764259517.572889464 [ 3845.737802] otime 1764259517.0 [ 3845.737803] item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12 [ 3845.737805] index 0 name_len 2 [ 3845.737807] item 2 key (256 DIR_ITEM 2363071922) itemoff 16077 itemsize 34 [ 3845.737808] location key (257 1 0) type 2 [ 3845.737810] transid 9 data_len 0 name_len 4 [ 3845.737811] item 3 key (256 DIR_ITEM 2676584006) itemoff 16043 itemsize 34 [ 3845.737813] location key (258 1 0) type 2 [ 3845.737814] transid 9 data_len 0 name_len 4 [ 3845.737815] item 4 key (256 DIR_INDEX 2) itemoff 16009 itemsize 34 [ 3845.737816] location key (257 1 0) type 2 [ 3845.737818] transid 9 data_len 0 name_len 4 [ 3845.737819] item 5 key (256 DIR_INDEX 3) itemoff 15975 itemsize 34 [ 3845.737820] location key (258 1 0) type 2 [ 3845.737821] transid 9 data_len 0 name_len 4 [ 3845.737822] item 6 key (257 INODE_ITEM 0) itemoff 15815 itemsize 160 [ 3845.737824] inode generation 9 transid 10 size 6 nbytes 0 [ 3845.737825] block group 0 mode 40755 links 2 uid 0 gid 0 [ 3845.737826] rdev 0 sequence 1 flags 0x0 [ 3845.737827] atime 1764259517.572889464 [ 3845.737828] ctime 1764259517.572889464 [ 3845.737830] mtime 1764259517.572889464 [ 3845.737831] otime 1764259517.572889464 [ 3845.737832] item 7 key (257 INODE_REF 256) itemoff 15801 itemsize 14 [ 3845.737833] index 2 name_len 4 [ 3845.737834] item 8 key (257 INODE_REF 258) itemoff 15787 itemsize 14 [ 3845.737836] index 2 name_len 4 [ 3845.737837] item 9 key (257 DIR_ITEM 2507850652) itemoff 15754 itemsize 33 [ 3845.737838] location key (259 1 0) type 1 [ 3845.737839] transid 10 data_len 0 name_len 3 [ 3845.737840] item 10 key (257 DIR_INDEX 2) itemoff 15721 itemsize 33 [ 3845.737842] location key (259 1 0) type 1 [ 3845.737843] transid 10 data_len 0 name_len 3 [ 3845.737844] item 11 key (258 INODE_ITEM 0) itemoff 15561 itemsize 160 [ 3845.737846] inode generation 9 transid 10 size 8 nbytes 0 [ 3845.737847] block group 0 mode 40755 links 1 uid 0 gid 0 [ 3845.737848] rdev 0 sequence 1 flags 0x0 [ 3845.737849] atime 1764259517.572889464 [ 3845.737850] ctime 1764259517.572889464 [ 3845.737851] mtime 1764259517.572889464 [ 3845.737852] otime 1764259517.572889464 [ 3845.737853] item 12 key (258 INODE_REF 256) itemoff 15547 itemsize 14 [ 3845.737855] index 3 name_len 4 [ 3845.737856] item 13 key (258 DIR_ITEM 1843588421) itemoff 15513 itemsize 34 [ 3845.737857] location key (257 1 0) type 2 [ 3845.737858] transid 10 data_len 0 name_len 4 [ 3845.737860] item 14 key (258 DIR_INDEX 2) itemoff 15479 itemsize 34 [ 3845.737861] location key (257 1 0) type 2 [ 3845.737862] transid 10 data_len 0 name_len 4 [ 3845.737863] item 15 key (259 INODE_ITEM 0) itemoff 15319 itemsize 160 [ 3845.737865] inode generation 10 transid 10 size 0 nbytes 0 [ 3845.737866] block group 0 mode 100600 links 1 uid 0 gid 0 [ 3845.737867] rdev 0 sequence 2 flags 0x0 [ 3845.737868] atime 1764259517.580874966 [ 3845.737869] ctime 1764259517.586121869 [ 3845.737870] mtime 1764259517.580874966 [ 3845.737872] otime 1764259517.580874966 [ 3845.737873] item 16 key (259 INODE_REF 257) itemoff 15306 itemsize 13 [ 3845.737874] index 2 name_len 3 [ 3845.737875] BTRFS error (device dm-0): block=30408704 write time tree block corruption detected [ 3845.739448] ------------[ cut here ]------------ [ 3845.740092] WARNING: CPU: 5 PID: 30701 at fs/btrfs/disk-io.c:335 btree_csum_one_bio+0x25a/0x270 [btrfs] [ 3845.741439] Modules linked in: btrfs dm_flakey crc32c_cryptoapi (...) [ 3845.750626] CPU: 5 UID: 0 PID: 30701 Comm: mount Tainted: G W 6.18.0-rc6-btrfs-next-218+ #1 PREEMPT(full) [ 3845.752414] Tainted: [W]=WARN [ 3845.752828] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [ 3845.754499] RIP: 0010:btree_csum_one_bio+0x25a/0x270 [btrfs] [ 3845.755460] Code: 31 f6 48 89 (...) [ 3845.758685] RSP: 0018:ffffa8d9c5677678 EFLAGS: 00010246 [ 3845.759450] RAX: 0000000000000000 RBX: ffff92650e6d4738 RCX: 0000000000000000 [ 3845.760309] RDX: 0000000000000000 RSI: ffffffff9aab45b9 RDI: ffff9264c4748000 [ 3845.761239] RBP: ffff9264d4324000 R08: 0000000000000000 R09: ffffa8d9c5677468 [ 3845.762607] R10: ffff926bdc1fffa8 R11: 0000000000000003 R12: ffffa8d9c5677680 [ 3845.764099] R13: 0000000000004000 R14: ffff9264dd624000 R15: ffff9264d978aba8 [ 3845.765094] FS: 00007f751fa5a840(0000) GS:ffff926c42a82000(0000) knlGS:0000000000000000 [ 3845.766226] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3845.766970] CR2: 0000558df1815380 CR3: 000000010ed88003 CR4: 0000000000370ef0 [ 3845.768009] Call Trace: [ 3845.768392] <TASK> [ 3845.768714] btrfs_submit_bbio+0x6ee/0x7f0 [btrfs] [ 3845.769640] ? write_one_eb+0x28e/0x340 [btrfs] [ 3845.770588] btree_write_cache_pages+0x2f0/0x550 [btrfs] [ 3845.771286] ? alloc_extent_state+0x19/0x100 [btrfs] [ 3845.771967] ? merge_next_state+0x1a/0x90 [btrfs] [ 3845.772586] ? set_extent_bit+0x233/0x8b0 [btrfs] [ 3845.773198] ? xas_load+0x9/0xc0 [ 3845.773589] ? xas_find+0x14d/0x1a0 [ 3845.773969] do_writepages+0xc6/0x160 [ 3845.774367] filemap_fdatawrite_wbc+0x48/0x60 [ 3845.775003] __filemap_fdatawrite_range+0x5b/0x80 [ 3845.775902] btrfs_write_marked_extents+0x61/0x170 [btrfs] [ 3845.776707] btrfs_write_and_wait_transaction+0x4e/0xc0 [btrfs] [ 3845.777379] ? _raw_spin_unlock_irqrestore+0x23/0x40 [ 3845.777923] btrfs_commit_transaction+0x5ea/0xd20 [btrfs] [ 3845.778551] ? _raw_spin_unlock+0x15/0x30 [ 3845.778986] ? release_extent_buffer+0x34/0x160 [btrfs] [ 3845.779659] btrfs_recover_log_trees+0x7a3/0x7c0 [btrfs] [ 3845.780416] ? __pfx_replay_one_buffer+0x10/0x10 [btrfs] [ 3845.781499] open_ctree+0x10bb/0x15f0 [btrfs] [ 3845.782194] btrfs_get_tree.cold+0xb/0x16c [btrfs] [ 3845.782764] ? fscontext_read+0x15c/0x180 [ 3845.783202] ? rw_verify_area+0x50/0x180 [ 3845.783667] vfs_get_tree+0x25/0xd0 [ 3845.784047] vfs_cmd_create+0x59/0xe0 [ 3845.784458] __do_sys_fsconfig+0x4f6/0x6b0 [ 3845.784914] do_syscall_64+0x50/0x1220 [ 3845.785340] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 3845.785980] RIP: 0033:0x7f751fc7f4aa [ 3845.786759] Code: 73 01 c3 48 (...) [ 3845.789951] RSP: 002b:00007ffcdba45dc8 EFLAGS: 00000246 ORIG_RAX: 00000000000001af [ 3845.791402] RAX: ffffffffffffffda RBX: 000055ccc8291c20 RCX: 00007f751fc7f4aa [ 3845.792688] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003 [ 3845.794308] RBP: 000055ccc8292120 R08: 0000000000000000 R09: 0000000000000000 [ 3845.795829] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 3845.797183] R13: 00007f751fe11580 R14: 00007f751fe1326c R15: 00007f751fdf8a23 [ 3845.798633] </TASK> [ 3845.799067] ---[ end trace 0000000000000000 ]--- [ 3845.800215] BTRFS: error (device dm-0) in btrfs_commit_transaction:2553: errno=-5 IO failure (Error while writing out transaction) [ 3845.801860] BTRFS warning (device dm-0 state E): Skipping commit of aborted transaction. [ 3845.802815] BTRFS error (device dm-0 state EA): Transaction aborted (error -5) [ 3845.803728] BTRFS: error (device dm-0 state EA) in cleanup_transaction:2036: errno=-5 IO failure [ 3845.805374] BTRFS: error (device dm-0 state EA) in btrfs_replay_log:2083: errno=-5 IO failure (Failed to recover log tree) [ 3845.807919] BTRFS error (device dm-0 state EA): open_ctree failed: -5 Fix this by never logging a conflicting inode that is a directory and was moved in the current transaction (its last_unlink_trans equals the current transaction) and instead fallback to a transaction commit. A test case for fstests will follow soon. Reported-by: Vyacheslav Kovalevsky <slva.kovalevskiy.2014@gmail.com> Link: https://lore.kernel.org/linux-btrfs/7bbc9419-5c56-450a-b5a0-efeae7457113@gmail.com/ CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-12-08 21:46:47 +01:00
Dan Carpenter	564d59410c	btrfs: tests: fix double btrfs_path free in remove_extent_ref() We converted this code to use auto free cleanup.h magic but one remaining free was accidentally left behind which leads to a double free bug. Fixes: `a320476ca8` ("btrfs: tests: do trivial BTRFS_PATH_AUTO_FREE conversions") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-12-08 21:46:43 +01:00
Linus Torvalds	51d90a15fe	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull KVM updates from Paolo Bonzini: "ARM: - Support for userspace handling of synchronous external aborts (SEAs), allowing the VMM to potentially handle the abort in a non-fatal manner - Large rework of the VGIC's list register handling with the goal of supporting more active/pending IRQs than available list registers in hardware. In addition, the VGIC now supports EOImode==1 style deactivations for IRQs which may occur on a separate vCPU than the one that acked the IRQ - Support for FEAT_XNX (user / privileged execute permissions) and FEAT_HAF (hardware update to the Access Flag) in the software page table walkers and shadow MMU - Allow page table destruction to reschedule, fixing long need_resched latencies observed when destroying a large VM - Minor fixes to KVM and selftests Loongarch: - Get VM PMU capability from HW GCFG register - Add AVEC basic support - Use 64-bit register definition for EIOINTC - Add KVM timer test cases for tools/selftests RISC/V: - SBI message passing (MPXY) support for KVM guest - Give a new, more specific error subcode for the case when in-kernel AIA virtualization fails to allocate IMSIC VS-file - Support KVM_DIRTY_LOG_INITIALLY_SET, enabling dirty log gradually in small chunks - Fix guest page fault within HLV* instructions - Flush VS-stage TLB after VCPU migration for Andes cores s390: - Always allocate ESCA (Extended System Control Area), instead of starting with the basic SCA and converting to ESCA with the addition of the 65th vCPU. The price is increased number of exits (and worse performance) on z10 and earlier processor; ESCA was introduced by z114/z196 in 2010 - VIRT_XFER_TO_GUEST_WORK support - Operation exception forwarding support - Cleanups x86: - Skip the costly "zap all SPTEs" on an MMIO generation wrap if MMIO SPTE caching is disabled, as there can't be any relevant SPTEs to zap - Relocate a misplaced export - Fix an async #PF bug where KVM would clear the completion queue when the guest transitioned in and out of paging mode, e.g. when handling an SMI and then returning to paged mode via RSM - Leave KVM's user-return notifier registered even when disabling virtualization, as long as kvm.ko is loaded. On reboot/shutdown, keeping the notifier registered is ok; the kernel does not use the MSRs and the callback will run cleanly and restore host MSRs if the CPU manages to return to userspace before the system goes down - Use the checked version of {get,put}_user() - Fix a long-lurking bug where KVM's lack of catch-up logic for periodic APIC timers can result in a hard lockup in the host - Revert the periodic kvmclock sync logic now that KVM doesn't use a clocksource that's subject to NTP corrections - Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the latter behind CONFIG_CPU_MITIGATIONS - Context switch XCR0, XSS, and PKRU outside of the entry/exit fast path; the only reason they were handled in the fast path was to paper of a bug in the core #MC code, and that has long since been fixed - Add emulator support for AVX MOV instructions, to play nice with emulated devices whose guest drivers like to access PCI BARs with large multi-byte instructions x86 (AMD): - Fix a few missing "VMCB dirty" bugs - Fix the worst of KVM's lack of EFER.LMSLE emulation - Add AVIC support for addressing 4k vCPUs in x2AVIC mode - Fix incorrect handling of selective CR0 writes when checking intercepts during emulation of L2 instructions - Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32] on VMRUN and #VMEXIT - Fix a bug where KVM corrupt the guest code stream when re-injecting a soft interrupt if the guest patched the underlying code after the VM-Exit, e.g. when Linux patches code with a temporary INT3 - Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits to userspace, and extend KVM "support" to all policy bits that don't require any actual support from KVM x86 (Intel): - Use the root role from kvm_mmu_page to construct EPTPs instead of the current vCPU state, partly as worthwhile cleanup, but mostly to pave the way for tracking per-root TLB flushes, and elide EPT flushes on pCPU migration if the root is clean from a previous flush - Add a few missing nested consistency checks - Rip out support for doing "early" consistency checks via hardware as the functionality hasn't been used in years and is no longer useful in general; replace it with an off-by-default module param to WARN if hardware fails a check that KVM does not perform - Fix a currently-benign bug where KVM would drop the guest's SPEC_CTRL[63:32] on VM-Enter - Misc cleanups - Overhaul the TDX code to address systemic races where KVM (acting on behalf of userspace) could inadvertantly trigger lock contention in the TDX-Module; KVM was either working around these in weird, ugly ways, or was simply oblivious to them (though even Yan's devilish selftests could only break individual VMs, not the host kernel) - Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a TDX vCPU, if creating said vCPU failed partway through - Fix a few sparse warnings (bad annotation, 0 != NULL) - Use struct_size() to simplify copying TDX capabilities to userspace - Fix a bug where TDX would effectively corrupt user-return MSR values if the TDX Module rejects VP.ENTER and thus doesn't clobber host MSRs as expected Selftests: - Fix a math goof in mmu_stress_test when running on a single-CPU system/VM - Forcefully override ARCH from x86_64 to x86 to play nice with specifying ARCH=x86_64 on the command line - Extend a bunch of nested VMX to validate nested SVM as well - Add support for LA57 in the core VM_MODE_xxx macro, and add a test to verify KVM can save/restore nested VMX state when L1 is using 5-level paging, but L2 is not - Clean up the guest paging code in anticipation of sharing the core logic for nested EPT and nested NPT guest_memfd: - Add NUMA mempolicy support for guest_memfd, and clean up a variety of rough edges in guest_memfd along the way - Define a CLASS to automatically handle get+put when grabbing a guest_memfd from a memslot to make it harder to leak references - Enhance KVM selftests to make it easer to develop and debug selftests like those added for guest_memfd NUMA support, e.g. where test and/or KVM bugs often result in hard-to-debug SIGBUS errors - Misc cleanups Generic: - Use the recently-added WQ_PERCPU when creating the per-CPU workqueue for irqfd cleanup - Fix a goof in the dirty ring documentation - Fix choice of target for directed yield across different calls to kvm_vcpu_on_spin(); the function was always starting from the first vCPU instead of continuing the round-robin search" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (260 commits) KVM: arm64: at: Update AF on software walk only if VM has FEAT_HAFDBS KVM: arm64: at: Use correct HA bit in TCR_EL2 when regime is EL2 KVM: arm64: Document KVM_PGTABLE_PROT_{UX,PX} KVM: arm64: Fix spelling mistake "Unexpeced" -> "Unexpected" KVM: arm64: Add break to default case in kvm_pgtable_stage2_pte_prot() KVM: arm64: Add endian casting to kvm_swap_s[12]_desc() KVM: arm64: Fix compilation when CONFIG_ARM64_USE_LSE_ATOMICS=n KVM: arm64: selftests: Add test for AT emulation KVM: arm64: nv: Expose hardware access flag management to NV guests KVM: arm64: nv: Implement HW access flag management in stage-2 SW PTW KVM: arm64: Implement HW access flag management in stage-1 SW PTW KVM: arm64: Propagate PTW errors up to AT emulation KVM: arm64: Add helper for swapping guest descriptor KVM: arm64: nv: Use pgtable definitions in stage-2 walk KVM: arm64: Handle endianness in read helper for emulated PTW KVM: arm64: nv: Stop passing vCPU through void ptr in S2 PTW KVM: arm64: Call helper for reading descriptors directly KVM: arm64: nv: Advertise support for FEAT_XNX KVM: arm64: Teach ptdump about FEAT_XNX permissions KVM: s390: Use generic VIRT_XFER_TO_GUEST_WORK functions ...	2025-12-05 17:01:20 -08:00
Linus Torvalds	7696286034	Merge tag 'for-6.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "Features: - shutdown ioctl support (needs CONFIG_BTRFS_EXPERIMENTAL for now): - set filesystem state as being shut down (also named going down in other filesystems), where all active operations return EIO and this cannot be changed until unmount - pending operations are attempted to be finished but error messages may still show up depending on where exactly the shutdown happened - scrub (and device replace) vs suspend/hibernate: - a running scrub will prevent suspend, which can be annoying as suspend is an immediate request and scrub is not critical - filesystem freezing before suspend was not sufficient as the problem was in process freezing - behaviour change: on suspend scrub and device replace are cancelled, where scrub can record the last state and continue from there; the device replace has to be restarted from the beginning - zone stats exported in sysfs, from the perspective of the filesystem this includes active, reclaimable, relocation etc zones Performance: - improvements when processing space reservation tickets by optimizing locking and shrinking critical sections, cumulative improvements in lockstat numbers show +15% Notable fixes: - use vmalloc fallback when allocating bios as high order allocations can happen with wide checksums (like sha256) - scrub will always track the last position of progress so it's not starting from zero after an error Core: - under experimental config, checksum calculations are offloaded to process context, simplifies locking and allows to remove compression write worker kthread(s): - speed improvement in direct IO throughput with buffered IO fallback is +15% when not offloaded but this is more related to internal crypto subsystem improvements - this will be probably default in the future removing the sysfs tunable - (experimental) block size > page size updates: - support more operations when not using large folios (encoded read/write and send) - raid56 - more preparations for fscrypt support Other: - more conversions to auto-cleaned variables - parameter cleanups and removals - extended warning fixes - improved printing of structured values like keys - lots of other cleanups and refactoring" * tag 'for-6.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (147 commits) btrfs: remove unnecessary inode key in btrfs_log_all_parents() btrfs: remove redundant zero/NULL initializations in btrfs_alloc_root() btrfs: remaining BTRFS_PATH_AUTO_FREE conversions btrfs: send: do not allocate memory for xattr data when checking it exists btrfs: send: add unlikely to all unexpected overflow checks btrfs: reduce arguments to btrfs_del_inode_ref_in_log() btrfs: remove root argument from btrfs_del_dir_entries_in_log() btrfs: use test_and_set_bit() in btrfs_delayed_delete_inode_ref() btrfs: don't search back for dir inode item in INO_LOOKUP_USER btrfs: don't rewrite ret from inode_permission btrfs: add orig_logical to btrfs_bio for encryption btrfs: disable verity on encrypted inodes btrfs: disable various operations on encrypted inodes btrfs: remove redundant level reset in btrfs_del_items() btrfs: simplify leaf traversal after path release in btrfs_next_old_leaf() btrfs: optimize balance_level() path reference handling btrfs: factor out root promotion logic into promote_child_to_root() btrfs: raid56: remove the "_step" infix btrfs: raid56: enable bs > ps support btrfs: raid56: prepare finish_parity_scrub() to support bs > ps cases ...	2025-12-03 20:03:46 -08:00
Linus Torvalds	cc25df3e2e	Merge tag 'for-6.19/block-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: - Fix head insertion for mq-deadline, a regression from when priority support was added - Series simplifying and improving the ublk user copy code - Various ublk related cleanups - Fixup REQ_NOWAIT handling in loop/zloop, clearing NOWAIT when the request is punted to a thread for handling - Merge and then later revert loop dio nowait support, as it ended up causing excessive stack usage for when the inline issue code needs to dip back into the full file system code - Improve auto integrity code, making it less deadlock prone - Speedup polled IO handling, but manually managing the hctx lookups - Fixes for blk-throttle for SSD devices - Small series with fixes for the S390 dasd driver - Add support for caching zones, avoiding unnecessary report zone queries - MD pull requests via Yu: - fix null-ptr-dereference regression for dm-raid0 - fix IO hang for raid5 when array is broken with IO inflight - remove legacy 1s delay to speed up system shutdown - change maintainer's email address - data can be lost if array is created with different lbs devices, fix this problem and record lbs of the array in metadata - fix rcu protection for md_thread - fix mddev kobject lifetime regression - enable atomic writes for md-linear - some cleanups - bcache updates via Coly - remove useless discard and cache device code - improve usage of per-cpu workqueues - Reorganize the IO scheduler switching code, fixing some lockdep reports as well - Improve the block layer P2P DMA support - Add support to the block tracing code for zoned devices - Segment calculation improves, and memory alignment flexibility improvements - Set of prep and cleanups patches for ublk batching support. The actual batching hasn't been added yet, but helps shrink down the workload of getting that patchset ready for 6.20 - Fix for how the ps3 block driver handles segments offsets - Improve how block plugging handles batch tag allocations - nbd fixes for use-after-free of the configuration on device clear/put - Set of improvements and fixes for zloop - Add Damien as maintainer of the block zoned device code handling - Various other fixes and cleanups * tag 'for-6.19/block-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits) block/rnbd: correct all kernel-doc complaints blk-mq: use queue_hctx in blk_mq_map_queue_type md: remove legacy 1s delay in md_notify_reboot md/raid5: fix IO hang when array is broken with IO inflight md: warn about updating super block failure md/raid0: fix NULL pointer dereference in create_strip_zones() for dm-raid sbitmap: fix all kernel-doc warnings ublk: add helper of __ublk_fetch() ublk: pass const pointer to ublk_queue_is_zoned() ublk: refactor auto buffer register in ublk_dispatch_req() ublk: add `union ublk_io_buf` with improved naming ublk: add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg() kfifo: add kfifo_alloc_node() helper for NUMA awareness blk-mq: fix potential uaf for 'queue_hw_ctx' blk-mq: use array manage hctx map instead of xarray ublk: prevent invalid access with DEBUG s390/dasd: Use scnprintf() instead of sprintf() s390/dasd: Move device name formatting into separate function s390/dasd: Remove unnecessary debugfs_create() return checks s390/dasd: Fix gendisk parent after copy pair swap ...	2025-12-03 19:26:18 -08:00
Linus Torvalds	0abcfd8983	Merge tag 'for-6.19/io_uring-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring updates from Jens Axboe: - Unify how task_work cancelations are detected, placing it in the task_work running state rather than needing to check the task state - Series cleaning up and moving the cancelation code to where it belongs, in cancel.c - Cleanup of waitid and futex argument handling - Add support for mixed sized SQEs. 6.18 added support for mixed sized CQEs, improving flexibility and efficiency of workloads that need big CQEs. This adds similar support for SQEs, where the occasional need for a 128b SQE doesn't necessitate having all SQEs be 128b in size - Introduce zcrx and SQ/CQ layout queries. The former returns what zcrx features are available. And both return the ring size information to help with allocation size calculation for user provided rings like IORING_SETUP_NO_MMAP and IORING_MEM_REGION_TYPE_USER - Zcrx updates for 6.19. It includes a bunch of small patches, IORING_REGISTER_ZCRX_CTRL and RQ flushing and David's work on sharing zcrx b/w multiple io_uring instances - Series cleaning up ring initializations, notable deduplicating ring size and offset calculations. It also moves most of the checking before doing any allocations, making the code simpler - Add support for getsockname and getpeername, which is mostly a trivial hookup after a bit of refactoring on the networking side - Various fixes and cleanups * tag 'for-6.19/io_uring-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits) io_uring: Introduce getsockname io_uring cmd socket: Split out a getsockname helper for io_uring socket: Unify getsockname and getpeername implementation io_uring/query: drop unused io_handle_query_entry() ctx arg io_uring/kbuf: remove obsolete buf_nr_pages and update comments io_uring/register: use correct location for io_rings_layout io_uring/zcrx: share an ifq between rings io_uring/zcrx: add io_fill_zcrx_offsets() io_uring/zcrx: export zcrx via a file io_uring/zcrx: move io_zcrx_scrub() and dependencies up io_uring/zcrx: count zcrx users io_uring/zcrx: add sync refill queue flushing io_uring/zcrx: introduce IORING_REGISTER_ZCRX_CTRL io_uring/zcrx: elide passing msg flags io_uring/zcrx: use folio_nr_pages() instead of shift operation io_uring/zcrx: convert to use netmem_desc io_uring/query: introduce rings info query io_uring/query: introduce zcrx query io_uring: move cq/sq user offset init around io_uring: pre-calculate scq layout ...	2025-12-03 18:58:57 -08:00
Linus Torvalds	2ddcf4962c	Merge tag 'kbuild-6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux Pull Kbuild updates from Nicolas Schier: - Enable -fms-extensions, allowing anonymous use of tagged struct or union in struct/union (tag kbuild-ms-extensions-6.19). An exemplary conversion patch is added here, too (btrfs). [ Editor's note: the core of this actually came in early through a shared branch and a few other trees - Linus ] - Introduce architecture-specific CC_CAN_LINK and flags for userprogs - Add new packaging target 'modules-cpio-pkg' for building a initramfs cpio w/ kmods - Handle included .c files in gen_compile_commands - Minor kbuild changes: - Use objtree for module signing key path, fixing oot kmod signing - Improve documentation of KBUILD_BUILD_TIMESTAMP - Reuse KBUILD_USERCFLAGS for UAPI, instead of defining twice - Rename scripts/Makefile.extrawarn to Makefile.warn - Drop obsolete types.h check from headers_check.pl - Remove outdated config leak ignore entries * tag 'kbuild-6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux: kbuild: add target to build a cpio containing modules initramfs: add gen_init_cpio to hostprogs unconditionally kbuild: allow architectures to override CC_CAN_LINK init: deduplicate cc-can-link.sh invocations kbuild: don't enable CC_CAN_LINK if the dummy program generates warnings scripts: headers_install.sh: Remove two outdated config leak ignore entries scripts/clang-tools: Handle included .c files in gen_compile_commands kbuild: uapi: Drop types.h check from headers_check.pl kbuild: Rename Makefile.extrawarn to Makefile.warn MAINTAINERS, .mailmap: Update mail address for Nicolas Schier kbuild: uapi: reuse KBUILD_USERCFLAGS kbuild: doc: improve KBUILD_BUILD_TIMESTAMP documentation kbuild: Use objtree for module signing key path btrfs: send: make use of -fms-extensions for defining struct fs_path	2025-12-03 14:42:21 -08:00
Linus Torvalds	a8058f8442	Merge tag 'vfs-6.19-rc1.directory.locking' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull directory locking updates from Christian Brauner: "This contains the work to add centralized APIs for directory locking operations. This series is part of a larger effort to change directory operation locking to allow multiple concurrent operations in a directory. The ultimate goal is to lock the target dentry(s) rather than the whole parent directory. To help with changing the locking protocol, this series centralizes locking and lookup in new helper functions. The helpers establish a pattern where it is the dentry that is being locked and unlocked (currently the lock is held on dentry->d_parent->d_inode, but that can change in the future). This also changes vfs_mkdir() to unlock the parent on failure, as well as dput()ing the dentry. This allows end_creating() to only require the target dentry (which may be IS_ERR() after vfs_mkdir()), not the parent" * tag 'vfs-6.19-rc1.directory.locking' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: nfsd: fix end_creating() conversion VFS: introduce end_creating_keep() VFS: change vfs_mkdir() to unlock on failure. ecryptfs: use new start_creating/start_removing APIs Add start_renaming_two_dentries() VFS/ovl/smb: introduce start_renaming_dentry() VFS/nfsd/ovl: introduce start_renaming() and end_renaming() VFS: add start_creating_killable() and start_removing_killable() VFS: introduce start_removing_dentry() smb/server: use end_removing_noperm for for target of smb2_create_link() VFS: introduce start_creating_noperm() and start_removing_noperm() VFS/nfsd/cachefiles/ovl: introduce start_removing() and end_removing() VFS/nfsd/cachefiles/ovl: add start_creating() and end_creating() VFS: tidy up do_unlinkat() VFS: introduce start_dirop() and end_dirop() debugfs: rename end_creating() to debugfs_end_creating()	2025-12-01 16:13:46 -08:00
Linus Torvalds	978d337c2e	Merge tag 'vfs-6.19-rc1.guards' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull superblock lock guard updates from Christian Brauner: "This starts the work of introducing guards for superblock related locks. Introduce super_write_guard for scoped superblock write protection. This provides a guard-based alternative to the manual sb_start_write() and sb_end_write() pattern, allowing the compiler to automatically handle the cleanup" * tag 'vfs-6.19-rc1.guards' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: xfs: use super write guard in xfs_file_ioctl() open: use super write guard in do_ftruncate() btrfs: use super write guard in relocating_repair_kthread() ext4: use super write guard in write_mmp_block() btrfs: use super write guard in sb_start_write() btrfs: use super write guard btrfs_run_defrag_inode() btrfs: use super write guard in btrfs_reclaim_bgs_work() fs: add super_write_guard	2025-12-01 14:39:03 -08:00
Linus Torvalds	afdf0fb340	Merge tag 'vfs-6.19-rc1.fs_header' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull fs header updates from Christian Brauner: "This contains initial work to start splitting up fs.h. Begin the long-overdue work of splitting up the monolithic fs.h header. The header has grown to over 3000 lines and includes types and functions for many different subsystems, making it difficult to navigate and causing excessive compilation dependencies. This series introduces new focused headers for superblock-related code: - Rename fs_types.h to fs_dirent.h to better reflect its actual content (directory entry types) - Add fs/super_types.h containing superblock type definitions - Add fs/super.h containing superblock function declarations This is the first step in a longer effort to modularize the VFS headers. Cleanups: - Inode Field Layout Optimization (Mateusz Guzik) Move inode fields used during fast path lookup closer together to improve cache locality during path resolution. - current_umask() Optimization (Mateusz Guzik) Inline current_umask() and move it to fs_struct.h. This improves performance by avoiding function call overhead for this frequently-used function, and places it in a more appropriate header since it operates on fs_struct" * tag 'vfs-6.19-rc1.fs_header' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: move inode fields used during fast path lookup closer together fs: inline current_umask() and move it to fs_struct.h fs: add fs/super.h header fs: add fs/super_types.h header fs: rename fs_types.h to fs_dirent.h	2025-12-01 14:18:01 -08:00
Linus Torvalds	f2e74ecfba	Merge tag 'vfs-6.19-rc1.folio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull folio updates from Christian Brauner: "Add a new folio_next_pos() helper function that returns the file position of the first byte after the current folio. This is a common operation in filesystems when needing to know the end of the current folio. The helper is lifted from btrfs which already had its own version, and is now used across multiple filesystems and subsystems: - btrfs - buffer - ext4 - f2fs - gfs2 - iomap - netfs - xfs - mm This fixes a long-standing bug in ocfs2 on 32-bit systems with files larger than 2GiB. Presumably this is not a common configuration, but the fix is backported anyway. The other filesystems did not have bugs, they were just mildly inefficient. This also introduce uoff_t as the unsigned version of loff_t. A recent commit inadvertently changed a comparison from being unsigned (on 64-bit systems) to being signed (which it had always been on 32-bit systems), leading to sporadic fstests failures. Generally file sizes are restricted to being a signed integer, but in places where -1 is passed to indicate "up to the end of the file", it is convenient to have an unsigned type to ensure comparisons are always unsigned regardless of architecture" * tag 'vfs-6.19-rc1.folio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: Add uoff_t mm: Use folio_next_pos() xfs: Use folio_next_pos() netfs: Use folio_next_pos() iomap: Use folio_next_pos() gfs2: Use folio_next_pos() f2fs: Use folio_next_pos() ext4: Use folio_next_pos() buffer: Use folio_next_pos() btrfs: Use folio_next_pos() filemap: Add folio_next_pos()	2025-12-01 10:26:38 -08:00
Linus Torvalds	ebaeabfa5a	Merge tag 'vfs-6.19-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull writeback updates from Christian Brauner: "Features: - Allow file systems to increase the minimum writeback chunk size. The relatively low minimal writeback size of 4MiB means that written back inodes on rotational media are switched a lot. Besides introducing additional seeks, this also can lead to extreme file fragmentation on zoned devices when a lot of files are cached relative to the available writeback bandwidth. This adds a superblock field that allows the file system to override the default size, and sets it to the zone size for zoned XFS. - Add logging for slow writeback when it exceeds sysctl_hung_task_timeout_secs. This helps identify tasks waiting for a long time and pinpoint potential issues. Recording the starting jiffies is also useful when debugging a crashed vmcore. - Wake up waiting tasks when finishing the writeback of a chunk Cleanups: - filemap_* writeback interface cleanups. Adding filemap_fdatawrite_wbc ended up being a mistake, as all but the original btrfs caller should be using better high level interfaces instead. This series removes all these low-level interfaces, switches btrfs to a more specific interface, and cleans up other too low-level interfaces. With this the writeback_control that is passed to the writeback code is only initialized in three places. - Remove __filemap_fdatawrite, __filemap_fdatawrite_range, and filemap_fdatawrite_wbc - Add filemap_flush_nr helper for btrfs - Push struct writeback_control into start_delalloc_inodes in btrfs - Rename filemap_fdatawrite_range_kick to filemap_flush_range - Stop opencoding filemap_fdatawrite_range in 9p, ocfs2, and mm - Make wbc_to_tag() inline and use it in fs" * tag 'vfs-6.19-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: Make wbc_to_tag() inline and use it in fs. xfs: set s_min_writeback_pages for zoned file systems writeback: allow the file system to override MIN_WRITEBACK_PAGES writeback: cleanup writeback_chunk_size mm: rename filemap_fdatawrite_range_kick to filemap_flush_range mm: remove __filemap_fdatawrite_range mm: remove filemap_fdatawrite_wbc mm: remove __filemap_fdatawrite mm,btrfs: add a filemap_flush_nr helper btrfs: push struct writeback_control into start_delalloc_inodes btrfs: use the local tmp_inode variable in start_delalloc_inodes ocfs2: don't opencode filemap_fdatawrite_range in ocfs2_journal_submit_inode_data_buffers 9p: don't opencode filemap_fdatawrite_range in v9fs_mmap_vm_close mm: don't opencode filemap_fdatawrite_range in filemap_invalidate_inode writeback: Add logging for slow writeback (exceeds sysctl_hung_task_timeout_secs) writeback: Wake up waiting tasks when finishing the writeback of a chunk.	2025-12-01 09:20:51 -08:00
Linus Torvalds	9368f0f941	Merge tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs inode updates from Christian Brauner: "Features: - Hide inode->i_state behind accessors. Open-coded accesses prevent asserting they are done correctly. One obvious aspect is locking, but significantly more can be checked. For example it can be detected when the code is clearing flags which are already missing, or is setting flags when it is illegal (e.g., I_FREEING when ->i_count > 0) - Provide accessors for ->i_state, converts all filesystems using coccinelle and manual conversions (btrfs, ceph, smb, f2fs, gfs2, overlayfs, nilfs2, xfs), and makes plain ->i_state access fail to compile - Rework I_NEW handling to operate without fences, simplifying the code after the accessor infrastructure is in place Cleanups: - Move wait_on_inode() from writeback.h to fs.h - Spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb for clarity - Cosmetic fixes to LRU handling - Push list presence check into inode_io_list_del() - Touch up predicts in __d_lookup_rcu() - ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage - Assert on ->i_count in iput_final() - Assert ->i_lock held in __iget() Fixes: - Add missing fences to I_NEW handling" * tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits) dcache: touch up predicts in __d_lookup_rcu() fs: push list presence check into inode_io_list_del() fs: cosmetic fixes to lru handling fs: rework I_NEW handling to operate without fences fs: make plain ->i_state access fail to compile xfs: use the new ->i_state accessors nilfs2: use the new ->i_state accessors overlayfs: use the new ->i_state accessors gfs2: use the new ->i_state accessors f2fs: use the new ->i_state accessors smb: use the new ->i_state accessors ceph: use the new ->i_state accessors btrfs: use the new ->i_state accessors Manual conversion to use ->i_state accessors of all places not covered by coccinelle Coccinelle-based conversion to use ->i_state accessors fs: provide accessors for ->i_state fs: spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb fs: move wait_on_inode() from writeback.h to fs.h fs: add missing fences to I_NEW handling ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage ...	2025-12-01 09:02:34 -08:00
Linus Torvalds	b04b2e7a61	Merge tag 'vfs-6.19-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Cheaper MAY_EXEC handling for path lookup. This elides MAY_WRITE permission checks during path lookup and adds the IOP_FASTPERM_MAY_EXEC flag so filesystems like btrfs can avoid expensive permission work. - Hide dentry_cache behind runtime const machinery. - Add German Maglione as virtiofs co-maintainer. Cleanups: - Tidy up and inline step_into() and walk_component() for improved code generation. - Re-enable IOCB_NOWAIT writes to files. This refactors file timestamp update logic, fixing a layering bypass in btrfs when updating timestamps on device files and improving FMODE_NOCMTIME handling in VFS now that nfsd started using it. - Path lookup optimizations extracting slowpaths into dedicated routines and adding branch prediction hints for mntput_no_expire(), fd_install(), lookup_slow(), and various other hot paths. - Enable clang's -fms-extensions flag, requiring a JFS rename to avoid conflicts. - Remove spurious exports in fs/file_attr.c. - Stop duplicating union pipe_index declaration. This depends on the shared kbuild branch that brings in -fms-extensions support which is merged into this branch. - Use MD5 library instead of crypto_shash in ecryptfs. - Use largest_zero_folio() in iomap_dio_zero(). - Replace simple_strtol/strtoul with kstrtoint/kstrtouint in init and initrd code. - Various typo fixes. Fixes: - Fix emergency sync for btrfs. Btrfs requires an explicit sync_fs() call with wait == 1 to commit super blocks. The emergency sync path never passed this, leaving btrfs data uncommitted during emergency sync. - Use local kmap in watch_queue's post_one_notification(). - Add hint prints in sb_set_blocksize() for LBS dependency on THP" * tag 'vfs-6.19-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits) MAINTAINERS: add German Maglione as virtiofs co-maintainer fs: inline step_into() and walk_component() fs: tidy up step_into() & friends before inlining orangefs: use inode_update_timestamps directly btrfs: fix the comment on btrfs_update_time btrfs: use vfs_utimes to update file timestamps fs: export vfs_utimes fs: lift the FMODE_NOCMTIME check into file_update_time_flags fs: refactor file timestamp update logic include/linux/fs.h: trivial fix: regualr -> regular fs/splice.c: trivial fix: pipes -> pipe's fs: mark lookup_slow() as noinline fs: add predicts based on nd->depth fs: move mntput_no_expire() slowpath into a dedicated routine fs: remove spurious exports in fs/file_attr.c watch_queue: Use local kmap in post_one_notification() fs: touch up predicts in path lookup fs: move fd_install() slowpath into a dedicated routine and provide commentary fs: hide dentry_cache behind runtime const machinery fs: touch predicts in do_dentry_open() ...	2025-12-01 08:44:26 -08:00
Christoph Hellwig	f981264ae7	btrfs: fix the comment on btrfs_update_time Since commit `e41f941a23` ("Btrfs: move over to use ->update_time") this is not a copy of the high-level file_update_time helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251120064859.2911749-6-hch@lst.de Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-26 14:50:10 +01:00
Christoph Hellwig	ded9958704	btrfs: use vfs_utimes to update file timestamps Btrfs updates the device node timestamps for block device special files when it stop using the device. Commit `8f96a5bfa1` ("btrfs: update the bdev time directly when closing") switch that update from the correct layering to directly call the low-level helper on the bdev inode. This is wrong and got fixed in commit `54fde91f52` ("btrfs: update device path inode time instead of bd_inode") by updating the file system inode instead of the bdev inode, but this kept the incorrect bypassing of the VFS interfaces and file system ->update_times method. Fix this by using the propet vfs_utimes interface. Fixes: `8f96a5bfa1` ("btrfs: update the bdev time directly when closing") Fixes: `54fde91f52` ("btrfs: update device path inode time instead of bd_inode") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251120064859.2911749-5-hch@lst.de Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-26 14:50:10 +01:00
Paolo Bonzini	236831743c	Merge tag 'kvm-x86-gmem-6.19' of https://github.com/kvm-x86/linux into HEAD KVM guest_memfd changes for 6.19: - Add NUMA mempolicy support for guest_memfd, and clean up a variety of rough edges in guest_memfd along the way. - Define a CLASS to automatically handle get+put when grabbing a guest_memfd from a memslot to make it harder to leak references. - Enhance KVM selftests to make it easer to develop and debug selftests like those added for guest_memfd NUMA support, e.g. where test and/or KVM bugs often result in hard-to-debug SIGBUS errors. - Misc cleanups.	2025-11-26 09:32:44 +01:00
Filipe Manana	9e0e6577b3	btrfs: remove unnecessary inode key in btrfs_log_all_parents() We are setting up an inode key to lookup parent directory inode but all we need is the inode's objectid. The use of the key was necessary in the past but since commit `0202e83fda` ("btrfs: simplify iget helpers") we only need the objectid. So remove the key variable in the stack and use instead a simple u64 for the inode's objectid. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:53:33 +01:00
Filipe Manana	1c3e03b340	btrfs: remove redundant zero/NULL initializations in btrfs_alloc_root() We have allocated the root with kzalloc() so all the memory is already zero initialized, therefore it's redundant to assign 0 and NULL to several of the root members. Remove all of them except the atomic initializations since atomic_t is an opaque type and it's not a good practice to assume its internals. This slightly reduces the binary size. With gcc 14.2.0-19 from Debian on x86_64, before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1939404 162963 15592 2117959 205147 fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1939212 162963 15592 2117767 205087 fs/btrfs/btrfs.ko Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:53:33 +01:00
David Sterba	10934c131f	btrfs: remaining BTRFS_PATH_AUTO_FREE conversions Do the remaining btrfs_path conversion to the auto cleaning, this seems to be the last one. Most of the conversions are trivial, only adding the declaration and removing the freeing, or changing the goto patterns to return. There are some functions with many changes, like __btrfs_free_extent(), btrfs_remove_from_free_space_tree() or btrfs_add_to_free_space_tree() but it still follows the same pattern. Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:53:33 +01:00
Filipe Manana	5c9cac55b7	btrfs: send: do not allocate memory for xattr data when checking it exists When checking if xattrs were deleted we don't care about their data, but we are allocating memory for the data and copying it, which only wastes time and can result in an unnecessary error in case the allocation fails. So stop allocating memory and copying data by making find_xattr() and __find_xattr() skip those steps if the given data buffer is NULL. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:53:33 +01:00
Filipe Manana	7c3acdb998	btrfs: send: add unlikely to all unexpected overflow checks There are several checks for unexpected overflows of buffers and path lengths that makes us fail the send operation with an error if for some highly unexpected reason they happen. So add the unlikely tag to those checks to hint the compiler to generate better code, while also making it more explicit in the source that it's highly unexpected. With gcc 14.2.0-19 from Debian on x86_64, I also got a small reduction the text size of the btrfs module. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1936917 162723 15592 2115232 2046a0 fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1936789 162723 15592 2115104 204620 fs/btrfs/btrfs.ko Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:53:33 +01:00
Filipe Manana	139e3167d8	btrfs: reduce arguments to btrfs_del_inode_ref_in_log() Instead of passing a root and the objectid of the parent directory, just pass the directory inode, as like that we can extract both the root and the objectid, reducing the number of arguments by one. It also makes the function more consistent with other log tree functions in the sense that we pass the inode and not only its objectid. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:53:33 +01:00
Filipe Manana	1361f7d8da	btrfs: remove root argument from btrfs_del_dir_entries_in_log() There's no need to pass the root as we can extract it from the directory inode, so remove it. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:53:32 +01:00
Filipe Manana	9c78fe4a85	btrfs: use test_and_set_bit() in btrfs_delayed_delete_inode_ref() Instead of testing and setting the BTRFS_DELAYED_NODE_DEL_IREF bit in the delayed node's flags, use test_and_set_bit() which makes the code shorter without compromising readability and getting rid of the label and goto. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Daniel Vacek <neelx@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:53:32 +01:00
Josef Bacik	70085399b1	btrfs: don't search back for dir inode item in INO_LOOKUP_USER We don't need to search back to the inode item, the directory inode number is in key.offset, so simply use that. If we can't find the directory we'll get an ENOENT at the iget(). Note: The patch was taken from v5 of fscrypt patchset (https://lore.kernel.org/linux-btrfs/cover.1706116485.git.josef@toxicpanda.com/) which was handled over time by various people: Omar Sandoval, Sweet Tea Dorminy, Josef Bacik. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add note ] Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:53:27 +01:00
Josef Bacik	0185c2292c	btrfs: don't rewrite ret from inode_permission In our user safe ino resolve ioctl we'll just turn any ret into -EACCES from inode_permission(). This is redundant, and could potentially be wrong if we had an ENOMEM in the security layer or some such other error, so simply return the actual return value. Note: The patch was taken from v5 of fscrypt patchset (https://lore.kernel.org/linux-btrfs/cover.1706116485.git.josef@toxicpanda.com/) which was handled over time by various people: Omar Sandoval, Sweet Tea Dorminy, Josef Bacik. Fixes: `23d0b79dfa` ("btrfs: Add unprivileged version of ino_lookup ioctl") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add note ] Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:52:24 +01:00
Josef Bacik	bd45e9e3f6	btrfs: add orig_logical to btrfs_bio for encryption When checksumming the encrypted bio on writes we need to know which logical address this checksum is for. At the point where we get the encrypted bio the bi_sector is the physical location on the target disk, so we need to save the original logical offset in the btrfs_bio. Then we can use this when checksumming the bio instead of the bio->iter.bi_sector. Note: The patch was taken from v5 of fscrypt patchset (https://lore.kernel.org/linux-btrfs/cover.1706116485.git.josef@toxicpanda.com/) which was handled over time by various people: Omar Sandoval, Sweet Tea Dorminy, Josef Bacik. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add note ] Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:52:23 +01:00
Sweet Tea Dorminy	45d99129b6	btrfs: disable verity on encrypted inodes Right now there isn't a way to encrypt things that aren't either filenames in directories or data on blocks on disk with extent encryption, so for now, disable verity usage with encryption on btrfs. fscrypt with fsverity should be possible and it can be implemented in the future. Note: The patch was taken from v5 of fscrypt patchset (https://lore.kernel.org/linux-btrfs/cover.1706116485.git.josef@toxicpanda.com/) which was handled over time by various people: Omar Sandoval, Sweet Tea Dorminy, Josef Bacik. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add note ] Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:50:56 +01:00
Omar Sandoval	f968340053	btrfs: disable various operations on encrypted inodes Initially, only normal data extents will be encrypted. This change forbids various other bits: - allows reflinking only if both inodes have the same encryption status - disable inline data on encrypted inodes Note: The patch was taken from v5 of fscrypt patchset (https://lore.kernel.org/linux-btrfs/cover.1706116485.git.josef@toxicpanda.com/) which was handled over time by various people: Omar Sandoval, Sweet Tea Dorminy, Josef Bacik. Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add note ] Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:50:56 +01:00
Sun YangKai	4357dd76f5	btrfs: remove redundant level reset in btrfs_del_items() When btrfs_del_items() empties a leaf, it deletes the leaf unless it's the root node. For the root leaf case, the code used to reset its level to 0 via btrfs_set_header_level(). This is redundant as leaf nodes always have level == 0. Remove the unnecessary level assignment and invert the conditional to handle only the non-root leaf deletion. The root leaf is correctly left as-is. Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:50:56 +01:00
Sun YangKai	139f75a3b1	btrfs: simplify leaf traversal after path release in btrfs_next_old_leaf() After releasing the path in btrfs_next_old_leaf(), we need to re-check the leaf because a balance operation may have added items or removed the last item. The original code handled this with two separate conditional blocks, the second marked with a lengthy comment explaining a "missed case". Merge these two blocks into a single logical structure that handles both scenarios more clearly. Also update the comment to be more concise and accurate, incorporating the explanation directly into the main block rather than a separate annotation. Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:50:56 +01:00
Sun YangKai	3afa17bf24	btrfs: optimize balance_level() path reference handling Instead of incrementing refcount on 'left' node when it's referenced by path, simply transfer ownership to path and set left to NULL. This eliminates: - Unnecessary refcount increment/decrement operations - Redundant conditional checks for left node cleanup The path now consistently owns the left node reference when used. Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:50:56 +01:00
Sun YangKai	31b37b7667	btrfs: factor out root promotion logic into promote_child_to_root() The balance_level() function is overly long and contains a cold code path that handles promoting a child node to root when the root has only one item. This code has distinct logic that is clearer and more maintainable when isolated in its own function. Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:50:56 +01:00
Qu Wenruo	1a332a6d70	btrfs: raid56: remove the "_step" infix The following functions are introduced as a middle step for bs > ps support: - rbio_streip_step_paddr() - rbio_pstripe_step_paddr() - rbio_qstripe_step_paddr() - sector_step_paddr_in_rbio() As there is already an existing function without the infix, and has a different parameter list. But the existing functions have been cleaned up, there is no need to keep the "_step" infix, just remove it completely. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:50:19 +01:00
Qu Wenruo	8870dbeedc	btrfs: raid56: enable bs > ps support The support code for bs > ps is complete, enable it and update assertions. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:48:52 +01:00
Qu Wenruo	89ca1a403e	btrfs: raid56: prepare finish_parity_scrub() to support bs > ps cases The function finish_parity_scrub() assume each fs block can be mapped by one page, blocking bs > ps support for raid56. Prepare it for bs > ps cases by: - Introduce a helper, verify_one_parity_step() Since the P/Q generation is always done in a vertical stripe, we have to handle the range step by step. - Only clear the rbio->dbitmap if all steps of an fs block match - Remove rbio_stripe_paddr() and sector_paddr_in_rbio() helpers Now we either use the paddrs version for checksum, or the step version for P/Q generation/recovery. - Make alloc_rbio_essential_pages() to handle bs > ps cases Since for bs > ps cases, one fs block needs multiple pages, the existing simple check against rbio->stripe_pages[] is not enough. Extract a dedicated helper, alloc_rbio_sector_pages(), for the existing alloc_rbio_essential_pages(), which is still based on sector number. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:48:02 +01:00
Qu Wenruo	ba88278c69	btrfs: raid56: prepare rbio_bio_add_io_paddr() to support bs > ps cases The function rbio_bio_add_io_paddr() assume each fs block can be mapped by one page, blocking bs > ps support for raid56. Prepare it for bs > ps cases by: - Introduce a helper bio_add_paddrs() Previously we only need to add a single page to a bio for a fs block, but now we need to add multiple pages, this means we can fail halfway. In that case we need to properly revert the bio (only for its size though) for halfway failed cases. - Rename rbio_add_io_paddr() to rbio_add_io_paddrs() And change the @paddr parameter to @paddrs[]. - Change all callers to use the updated rbio_add_io_paddrs() For the @paddrs pointer used for the new function, it can be grabbed using sector_paddrs_in_rbio() and rbio_stripe_paddrs() helpers. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:47:54 +01:00
Qu Wenruo	53474a2ae1	btrfs: raid56: prepare steal_rbio() to support bs > ps cases The function steal_rbio() assume each fs block can be mapped by one page, blocking bs > ps support for raid56. Prepare it for bs > ps cases by: - Introduce two helpers to calculate the sector number Previously we assume one page will contain at least one fs block, thus can use something like "sectors_per_page = PAGE_SIZE / sectorsize;", but with bs > ps support that above number will be 0. Instead introduce two helpers: * page_nr_to_sector_nr() Returns the sector number of the first sector covered by the page. * page_nr_to_num_sectors() Return how many sectors are covered by the page. And use the returned values for bitmap operations other than open-coded "PAGE_SIZE / sectorsize". Those helpers also have extra ASSERT()s to catch weird numbers. - Use above helpers The involved functions are: * steal_rbio_page() * is_data_stripe_page() * full_page_sectors_uptodate() Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:47:41 +01:00
Qu Wenruo	05ddf35a5d	btrfs: raid56: prepare set_bio_pages_uptodate() to support bs > ps cases The function set_bio_pages_uptodate() assume each fs block can be mapped by one page, blocking bs > ps support for raid56. Prepare it for bs > ps cases by: - Update find_stripe_sector_nr() to check only the first step paddr We don't need to check each paddr, as the bios are still aligned to fs block size, thus checking the first step is enough. - Use step size to iterate the bio This means we only need to find the sector number for the first step of each fs block, and skip the remaining part. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:47:31 +01:00
Qu Wenruo	64e7b8c7c5	btrfs: raid56: prepare verify_bio_data_sectors() to support bs > ps cases The function verify_bio_data_sectors() assume each fs block can be mapped by one page, blocking bs > ps support for raid56. Prepare it for bs > ps cases by: - Make get_bio_sector_nr() to consider bs > ps cases The function is utilized to calculate the sector number of a device bio submitted by btrfs raid56 layer. - Assemble a local paddrs[] for checksum calculation - Open code btrfs_check_block_csum() btrfs_check_block_csum() only supports fs blocks backed by large folios. But for raid56 we can have fs blocks backed by multiple non-contiguous pages, e.g. direct IO, encoded read/write/send. So instead of using btrfs_check_block_csum(), open code it to use btrfs_calculate_block_csum_pages(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:47:20 +01:00
Qu Wenruo	e0eadfcc95	btrfs: raid56: prepare verify_one_sector() to support bs > ps cases The function verify_one_sector() assume each fs block can be mapped by one page, blocking bs > ps support for raid56. Prepare it for bs > ps cases by: - Introduce helpers to get a paddrs pointer Thankfully all the higher layer bio should still be aligned to fs block size, thus a fs block should still be fully covered by the bio. Introduce sector_paddrs_in_rbio() and rbio_stripe_paddrs(), which will return a paddrs pointer inside btrfs_raid_bio::bio_paddrs[] or stripe_paddrs[]. The pointer can be directly passed to btrfs_calculate_block_csum_pages() to verify the checksum. - Open code btrfs_check_block_csum() btrfs_check_block_csum() only supports fs blocks backed by large folios. But for raid56 we can have fs blocks backed by multiple non-contiguous pages, e.g. direct IO, encoded read/write/send. So instead of using btrfs_check_block_csum(), open code it to use btrfs_calculate_block_csum_pages(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:47:16 +01:00
Qu Wenruo	9ba67fd616	btrfs: raid56: prepare recover_vertical() to support bs > ps cases Currently recover_vertical() assumes that every fs block can be mapped by one page, this is blocking bs > ps support for raid56. Prepare recover_vertical() to support bs > ps cases by: - Introduce recover_vertical_step() helper Which will recover a full step (min(PAGE_SIZE, sectorsize)). Now recover_vertical() will do the error check for the specified sector, do the recover step by step, then do the sector verification. - Fix a spelling error of get_rbio_vertical_errors() The old name has a typo: "veritical". Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:47:11 +01:00
Qu Wenruo	826325b6d0	btrfs: raid56: prepare generate_pq_vertical() for bs > ps cases Unlike btrfs_calculate_block_csum_pages(), we cannot handle multiple pages at the same time for P/Q generation. So here we introduce a new @step_nr, and various helpers to grab the sub-block page from the rbio, and generate the P/Q stripe page by page. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:47:05 +01:00

1 2 3 4 5 ...

14824 Commits