Commit Graph

98291 Commits

Author SHA1 Message Date
Jakub Kicinski
bebd7b2626 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.15-rc7).

Conflicts:

tools/testing/selftests/drivers/net/hw/ncdevmem.c
  97c4e094a4 ("tests/ncdevmem: Fix double-free of queue array")
  2f1a805f32 ("selftests: ncdevmem: Implement devmem TCP TX")
https://lore.kernel.org/20250514122900.1e77d62d@canb.auug.org.au

Adjacent changes:

net/core/devmem.c
net/core/devmem.h
  0afc44d8cd ("net: devmem: fix kernel panic when netlink socket close after module unload")
  bd61848900 ("net: devmem: Implement TX path")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15 11:28:30 -07:00
Linus Torvalds
74a6325597 Merge tag 'for-6.15-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:

 - fix potential endless loop when discarding a block group when
   disabling discard

 - reinstate message when setting a large value of mount option 'commit'

 - fix a folio leak when async extent submission fails

* tag 'for-6.15-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: add back warning for mount option commit values exceeding 300
  btrfs: fix folio leak in submit_one_async_extent()
  btrfs: fix discard worker infinite loop after disabling discard
2025-05-14 18:39:12 -07:00
Linus Torvalds
1a80a098c6 Merge tag 'execve-v6.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull execve fix from Kees Cook:
 "This fixes a corner case for ASLR-disabled static-PIE brk collision
  with vdso allocations:

   - binfmt_elf: Move brk for static PIE even if ASLR disabled"

* tag 'execve-v6.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  binfmt_elf: Move brk for static PIE even if ASLR disabled
2025-05-14 09:15:16 -07:00
Kyoji Ogasawara
4ce2affc6e btrfs: add back warning for mount option commit values exceeding 300
The Btrfs documentation states that if the commit value is greater than
300 a warning should be issued. The warning was accidentally lost in the
new mount API update.

Fixes: 6941823cc8 ("btrfs: remove old mount API code")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Kyoji Ogasawara <sawara04.o@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-12 21:39:34 +02:00
Boris Burkov
a0fd1c6098 btrfs: fix folio leak in submit_one_async_extent()
If btrfs_reserve_extent() fails while submitting an async_extent for a
compressed write, then we fail to call free_async_extent_pages() on the
async_extent and leak its folios. A likely cause for such a failure
would be btrfs_reserve_extent() failing to find a large enough
contiguous free extent for the compressed extent.

I was able to reproduce this by:

1. mount with compress-force=zstd:3
2. fallocating most of a filesystem to a big file
3. fragmenting the remaining free space
4. trying to copy in a file which zstd would generate large compressed
   extents for (vmlinux worked well for this)

Step 4. hits the memory leak and can be repeated ad nauseam to
eventually exhaust the system memory.

Fix this by detecting the case where we fallback to uncompressed
submission for a compressed async_extent and ensuring that we call
free_async_extent_pages().

Fixes: 131a821a24 ("btrfs: fallback if compressed IO fails for ENOSPC")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Co-developed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-12 21:39:13 +02:00
Filipe Manana
54db6d1bdd btrfs: fix discard worker infinite loop after disabling discard
If the discard worker is running and there's currently only one block
group, that block group is a data block group, it's in the unused block
groups discard list and is being used (it got an extent allocated from it
after becoming unused), the worker can end up in an infinite loop if a
transaction abort happens or the async discard is disabled (during remount
or unmount for example).

This happens like this:

1) Task A, the discard worker, is at peek_discard_list() and
   find_next_block_group() returns block group X;

2) Block group X is in the unused block groups discard list (its discard
   index is BTRFS_DISCARD_INDEX_UNUSED) since at some point in the past
   it become an unused block group and was added to that list, but then
   later it got an extent allocated from it, so its ->used counter is not
   zero anymore;

3) The current transaction is aborted by task B and we end up at
   __btrfs_handle_fs_error() in the transaction abort path, where we call
   btrfs_discard_stop(), which clears BTRFS_FS_DISCARD_RUNNING from
   fs_info, and then at __btrfs_handle_fs_error() we set the fs to RO mode
   (setting SB_RDONLY in the super block's s_flags field);

4) Task A calls __add_to_discard_list() with the goal of moving the block
   group from the unused block groups discard list into another discard
   list, but at __add_to_discard_list() we end up doing nothing because
   btrfs_run_discard_work() returns false, since the super block has
   SB_RDONLY set in its flags and BTRFS_FS_DISCARD_RUNNING is not set
   anymore in fs_info->flags. So block group X remains in the unused block
   groups discard list;

5) Task A then does a goto into the 'again' label, calls
   find_next_block_group() again we gets block group X again. Then it
   repeats the previous steps over and over since there are not other
   block groups in the discard lists and block group X is never moved
   out of the unused block groups discard list since
   btrfs_run_discard_work() keeps returning false and therefore
   __add_to_discard_list() doesn't move block group X out of that discard
   list.

When this happens we can get a soft lockup report like this:

  [71.957] watchdog: BUG: soft lockup - CPU#0 stuck for 27s! [kworker/u4:3:97]
  [71.957] Modules linked in: xfs af_packet rfkill (...)
  [71.957] CPU: 0 UID: 0 PID: 97 Comm: kworker/u4:3 Tainted: G        W          6.14.2-1-default #1 openSUSE Tumbleweed 968795ef2b1407352128b466fe887416c33af6fa
  [71.957] Tainted: [W]=WARN
  [71.957] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
  [71.957] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs]
  [71.957] RIP: 0010:btrfs_discard_workfn+0xc4/0x400 [btrfs]
  [71.957] Code: c1 01 48 83 (...)
  [71.957] RSP: 0018:ffffafaec03efe08 EFLAGS: 00000246
  [71.957] RAX: ffff897045500000 RBX: ffff8970413ed8d0 RCX: 0000000000000000
  [71.957] RDX: 0000000000000001 RSI: ffff8970413ed8d0 RDI: 0000000a8f1272ad
  [71.957] RBP: 0000000a9d61c60e R08: ffff897045500140 R09: 8080808080808080
  [71.957] R10: ffff897040276800 R11: fefefefefefefeff R12: ffff8970413ed860
  [71.957] R13: ffff897045500000 R14: ffff8970413ed868 R15: 0000000000000000
  [71.957] FS:  0000000000000000(0000) GS:ffff89707bc00000(0000) knlGS:0000000000000000
  [71.957] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [71.957] CR2: 00005605bcc8d2f0 CR3: 000000010376a001 CR4: 0000000000770ef0
  [71.957] PKRU: 55555554
  [71.957] Call Trace:
  [71.957]  <TASK>
  [71.957]  process_one_work+0x17e/0x330
  [71.957]  worker_thread+0x2ce/0x3f0
  [71.957]  ? __pfx_worker_thread+0x10/0x10
  [71.957]  kthread+0xef/0x220
  [71.957]  ? __pfx_kthread+0x10/0x10
  [71.957]  ret_from_fork+0x34/0x50
  [71.957]  ? __pfx_kthread+0x10/0x10
  [71.957]  ret_from_fork_asm+0x1a/0x30
  [71.957]  </TASK>
  [71.957] Kernel panic - not syncing: softlockup: hung tasks
  [71.987] CPU: 0 UID: 0 PID: 97 Comm: kworker/u4:3 Tainted: G        W    L     6.14.2-1-default #1 openSUSE Tumbleweed 968795ef2b1407352128b466fe887416c33af6fa
  [71.989] Tainted: [W]=WARN, [L]=SOFTLOCKUP
  [71.989] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
  [71.991] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs]
  [71.992] Call Trace:
  [71.993]  <IRQ>
  [71.994]  dump_stack_lvl+0x5a/0x80
  [71.994]  panic+0x10b/0x2da
  [71.995]  watchdog_timer_fn.cold+0x9a/0xa1
  [71.996]  ? __pfx_watchdog_timer_fn+0x10/0x10
  [71.997]  __hrtimer_run_queues+0x132/0x2a0
  [71.997]  hrtimer_interrupt+0xff/0x230
  [71.998]  __sysvec_apic_timer_interrupt+0x55/0x100
  [71.999]  sysvec_apic_timer_interrupt+0x6c/0x90
  [72.000]  </IRQ>
  [72.000]  <TASK>
  [72.001]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
  [72.002] RIP: 0010:btrfs_discard_workfn+0xc4/0x400 [btrfs]
  [72.002] Code: c1 01 48 83 (...)
  [72.005] RSP: 0018:ffffafaec03efe08 EFLAGS: 00000246
  [72.006] RAX: ffff897045500000 RBX: ffff8970413ed8d0 RCX: 0000000000000000
  [72.006] RDX: 0000000000000001 RSI: ffff8970413ed8d0 RDI: 0000000a8f1272ad
  [72.007] RBP: 0000000a9d61c60e R08: ffff897045500140 R09: 8080808080808080
  [72.008] R10: ffff897040276800 R11: fefefefefefefeff R12: ffff8970413ed860
  [72.009] R13: ffff897045500000 R14: ffff8970413ed868 R15: 0000000000000000
  [72.010]  ? btrfs_discard_workfn+0x51/0x400 [btrfs 23b01089228eb964071fb7ca156eee8cd3bf996f]
  [72.011]  process_one_work+0x17e/0x330
  [72.012]  worker_thread+0x2ce/0x3f0
  [72.013]  ? __pfx_worker_thread+0x10/0x10
  [72.014]  kthread+0xef/0x220
  [72.014]  ? __pfx_kthread+0x10/0x10
  [72.015]  ret_from_fork+0x34/0x50
  [72.015]  ? __pfx_kthread+0x10/0x10
  [72.016]  ret_from_fork_asm+0x1a/0x30
  [72.017]  </TASK>
  [72.017] Kernel Offset: 0x15000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
  [72.019] Rebooting in 90 seconds..

So fix this by making sure we move a block group out of the unused block
groups discard list when calling __add_to_discard_list().

Fixes: 2bee7eb8bb ("btrfs: discard one region at a time in async discard")
Link: https://bugzilla.suse.com/show_bug.cgi?id=1242012
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-12 21:38:56 +02:00
Linus Torvalds
8b64199a7f Merge tag 'udf_for_v6.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF fix from Jan Kara:
 "Fix a bug in UDF inode eviction leading to spewing pointless
  error messages"

* tag 'udf_for_v6.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: Make sure i_lenExtents is uptodate on inode eviction
2025-05-12 10:23:20 -07:00
Linus Torvalds
e238e49b18 Merge tag 'vfs-6.15-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:

 - Ensure that simple_xattr_list() always includes security.* xattrs

 - Fix eventpoll busy loop optimization when combined with timeouts

 - Disable swapon() for devices with block sizes greater than page sizes

 - Don't call errseq_set() twice during mark_buffer_write_io_error().
   Just use mapping_set_error() which takes care to not deference
   unconditionally

* tag 'vfs-6.15-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: Remove redundant errseq_set call in mark_buffer_write_io_error.
  swapfile: disable swapon for bs > ps devices
  fs/eventpoll: fix endless busy loop after timeout has expired
  fs/xattr.c: fix simple_xattr_list to always include security.* xattrs
2025-05-12 10:04:14 -07:00
Linus Torvalds
3ce9925823 Merge tag 'mm-hotfixes-stable-2025-05-10-14-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc hotfixes from Andrew Morton:
 "22 hotfixes. 13 are cc:stable and the remainder address post-6.14
  issues or aren't considered necessary for -stable kernels.

  About half are for MM. Five OCFS2 fixes and a few MAINTAINERS updates"

* tag 'mm-hotfixes-stable-2025-05-10-14-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (22 commits)
  mm: fix folio_pte_batch() on XEN PV
  nilfs2: fix deadlock warnings caused by lock dependency in init_nilfs()
  mm/hugetlb: copy the CMA flag when demoting
  mm, swap: fix false warning for large allocation with !THP_SWAP
  selftests/mm: fix a build failure on powerpc
  selftests/mm: fix build break when compiling pkey_util.c
  mm: vmalloc: support more granular vrealloc() sizing
  tools/testing/selftests: fix guard region test tmpfs assumption
  ocfs2: stop quota recovery before disabling quotas
  ocfs2: implement handshaking with ocfs2 recovery thread
  ocfs2: switch osb->disable_recovery to enum
  mailmap: map Uwe's BayLibre addresses to a single one
  MAINTAINERS: add mm THP section
  mm/userfaultfd: fix uninitialized output field for -EAGAIN race
  selftests/mm: compaction_test: support platform with huge mount of memory
  MAINTAINERS: add core mm section
  ocfs2: fix panic in failed foilio allocation
  mm/huge_memory: fix dereferencing invalid pmd migration entry
  MAINTAINERS: add reverse mapping section
  x86: disable image size check for test builds
  ...
2025-05-10 15:50:56 -07:00
Linus Torvalds
acbf235235 Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull mount fixes from Al Viro:
 "A couple of races around legalize_mnt vs umount (both fairly old and
  hard to hit) plus two bugs in move_mount(2) - both around 'move
  detached subtree in place' logics"

* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fix IS_MNT_PROPAGATING uses
  do_move_mount(): don't leak MNTNS_PROPAGATING on failures
  do_umount(): add missing barrier before refcount checks in sync case
  __legitimize_mnt(): check for MNT_SYNC_UMOUNT should be under mount_lock
2025-05-10 08:36:07 -07:00
Linus Torvalds
1a33418a69 Merge tag '6.15-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:

 - Fix dentry leak which can cause umount crash

 - Add warning for parse contexts error on compounded operation

* tag '6.15-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb: client: Avoid race in open_cached_dir with lease breaks
  smb3 client: warn when parse contexts returns error on compounded operation
2025-05-09 16:45:21 -07:00
Al Viro
d1ddc6f1d9 fix IS_MNT_PROPAGATING uses
propagate_mnt() does not attach anything to mounts created during
propagate_mnt() itself.  What's more, anything on ->mnt_slave_list
of such new mount must also be new, so we don't need to even look
there.

When move_mount() had been introduced, we've got an additional
class of mounts to skip - if we are moving from anon namespace,
we do not want to propagate to mounts we are moving (i.e. all
mounts in that anon namespace).

Unfortunately, the part about "everything on their ->mnt_slave_list
will also be ignorable" is not true - if we have propagation graph
	A -> B -> C
and do OPEN_TREE_CLONE open_tree() of B, we get
	A -> [B <-> B'] -> C
as propagation graph, where B' is a clone of B in our detached tree.
Making B private will result in
	A -> B' -> C
C still gets propagation from A, as it would after making B private
if we hadn't done that open_tree(), but now the propagation goes
through B'.  Trying to move_mount() our detached tree on subdirectory
in A should have
	* moved B' on that subdirectory in A
	* skipped the corresponding subdirectory in B' itself
	* copied B' on the corresponding subdirectory in C.
As it is, the logics in propagation_next() and friends ends up
skipping propagation into C, since it doesn't consider anything
downstream of B'.

IOW, walking the propagation graph should only skip the ->mnt_slave_list
of new mounts; the only places where the check for "in that one
anon namespace" are applicable are propagate_one() (where we should
treat that as the same kind of thing as "mountpoint we are looking
at is not visible in the mount we are looking at") and
propagation_would_overmount().  The latter is better dealt with
in the caller (can_move_mount_beneath()); on the first call of
propagation_would_overmount() the test is always false, on the
second it is always true in "move from anon namespace" case and
always false in "move within our namespace" one, so it's easier
to just use check_mnt() before bothering with the second call and
be done with that.

Fixes: 064fe6e233 ("mount: handle mount propagation for detached mount trees")
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-09 18:06:27 -04:00
Al Viro
267fc3a06a do_move_mount(): don't leak MNTNS_PROPAGATING on failures
as it is, a failed move_mount(2) from anon namespace breaks
all further propagation into that namespace, including normal
mounts in non-anon namespaces that would otherwise propagate
there.

Fixes: 064fe6e233 ("mount: handle mount propagation for detached mount trees")
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-09 18:06:10 -04:00
Al Viro
65781e19dc do_umount(): add missing barrier before refcount checks in sync case
do_umount() analogue of the race fixed in 119e1ef80e "fix
__legitimize_mnt()/mntput() race".  Here we want to make sure that
if __legitimize_mnt() doesn't notice our lock_mount_hash(), we will
notice their refcount increment.  Harder to hit than mntput_no_expire()
one, fortunately, and consequences are milder (sync umount acting
like umount -l on a rare race with RCU pathwalk hitting at just the
wrong time instead of use-after-free galore mntput_no_expire()
counterpart used to be hit).  Still a bug...

Fixes: 48a066e72d ("RCU'd vfsmounts")
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-09 18:05:55 -04:00
Al Viro
250cf36930 __legitimize_mnt(): check for MNT_SYNC_UMOUNT should be under mount_lock
... or we risk stealing final mntput from sync umount - raising mnt_count
after umount(2) has verified that victim is not busy, but before it
has set MNT_SYNC_UMOUNT; in that case __legitimize_mnt() doesn't see
that it's safe to quietly undo mnt_count increment and leaves dropping
the reference to caller, where it'll be a full-blown mntput().

Check under mount_lock is needed; leaving the current one done before
taking that makes no sense - it's nowhere near common enough to bother
with.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-09 18:02:12 -04:00
Jeremy Bongio
04679f3c27 fs: Remove redundant errseq_set call in mark_buffer_write_io_error.
mark_buffer_write_io_error sets sb->s_wb_err to -EIO twice.
Once in mapping_set_error and once in errseq_set.
Only mapping_set_error checks if bh->b_assoc_map->host is NULL.

Discovered during null pointer dereference during writeback
to a failing device:

[<ffffffff9a416dc8>] ? mark_buffer_write_io_error+0x98/0xc0
[<ffffffff9a416dbe>] ? mark_buffer_write_io_error+0x8e/0xc0
[<ffffffff9ad4bda0>] end_buffer_async_write+0x90/0xd0
[<ffffffff9ad4e3eb>] end_bio_bh_io_sync+0x2b/0x40
[<ffffffff9adbafe6>] blk_update_request+0x1b6/0x480
[<ffffffff9adbb3d8>] blk_mq_end_request+0x18/0x30
[<ffffffff9adbc6aa>] blk_mq_dispatch_rq_list+0x4da/0x8e0
[<ffffffff9adc0a68>] __blk_mq_sched_dispatch_requests+0x218/0x6a0
[<ffffffff9adc07fa>] blk_mq_sched_dispatch_requests+0x3a/0x80
[<ffffffff9adbbb98>] blk_mq_run_hw_queue+0x108/0x330
[<ffffffff9adbcf58>] blk_mq_flush_plug_list+0x178/0x5f0
[<ffffffff9adb6741>] __blk_flush_plug+0x41/0x120
[<ffffffff9adb6852>] blk_finish_plug+0x22/0x40
[<ffffffff9ad47cb0>] wb_writeback+0x150/0x280
[<ffffffff9ac5343f>] ? set_worker_desc+0x9f/0xc0
[<ffffffff9ad4676e>] wb_workfn+0x24e/0x4a0

Fixes: 485e9605c0 ("fs/buffer.c: record blockdev write errors in super_block that it backs")
Signed-off-by: Jeremy Bongio <jbongio@google.com>
Link: https://lore.kernel.org/20250507123010.1228243-1-jbongio@google.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09 12:31:57 +02:00
Linus Torvalds
9c69f88849 Merge tag 'bcachefs-2025-05-08' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:

 - Some fixes to help with filesystem analysis: ensure superblock
   error count gets written if we go ERO, don't discard the journal
   aggressively (so it's available for list_journal -a)

 - Fix lost wakeup on arm causing us to get stuck when reading btree
   nodes

 - Fix fsck failing to exit on ctrl-c

 - An additional fix for filesystems with misaligned bucket sizes: we
   now ensure that allocations are properly aligned

 - Setting background target but not promote target will now leave that
   data cached on the foreground target, as it used to

 - Revert a change to when we allocate the VFS superblock, this was done
   for implementing blk_holder_ops but ended up not being needed, and
   allocating a superblock and not setting SB_BORN while we do recovery
   caused sync() calls and other things to hang

 - Assorted fixes for harmless error messages that caused concern to
   users

* tag 'bcachefs-2025-05-08' of git://evilpiepirate.org/bcachefs:
  bcachefs: Don't aggressively discard the journal
  bcachefs: Ensure superblock gets written when we go ERO
  bcachefs: Filter out harmless EROFS error messages
  bcachefs: journal_shutdown is EROFS, not EIO
  bcachefs: Call bch2_fs_start before getting vfs superblock
  bcachefs: fix hung task timeout in journal read
  bcachefs: Add missing barriers before wake_up_bit()
  bcachefs: Ensure proper write alignment
  bcachefs: Improve want_cached_ptr()
  bcachefs: thread_with_stdio: fix spinning instead of exiting
2025-05-08 14:28:49 -07:00
Jakub Kicinski
6b02fd7799 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.15-rc6).

No conflicts.

Adjacent changes:

net/core/dev.c:
  08e9f2d584 ("net: Lock netdevices during dev_shutdown")
  a82dc19db1 ("net: avoid potential race between netdev_get_by_index_lock() and netns switch")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-08 08:59:02 -07:00
Linus Torvalds
80ae5fb229 Merge tag 'v6.15-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:

 - Fix UAF closing file table (e.g. in tree disconnect)

 - Fix potential out of bounds write

 - Fix potential memory leak parsing lease state in open

 - Fix oops in rename with empty target

* tag 'v6.15-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
  ksmbd: Fix UAF in __close_file_table_ids
  ksmbd: prevent out-of-bounds stream writes by validating *pos
  ksmbd: fix memory leak in parse_lease_state()
  ksmbd: prevent rename with empty string
2025-05-08 08:22:35 -07:00
Ryusuke Konishi
fb881cd760 nilfs2: fix deadlock warnings caused by lock dependency in init_nilfs()
After commit c0e473a0d2 ("block: fix race between set_blocksize and read
paths") was merged, set_blocksize() called by sb_set_blocksize() now locks
the inode of the backing device file.  As a result of this change, syzbot
started reporting deadlock warnings due to a circular dependency involving
the semaphore "ns_sem" of the nilfs object, the inode lock of the backing
device file, and the locks that this inode lock is transitively dependent
on.

This is caused by a new lock dependency added by the above change, since
init_nilfs() calls sb_set_blocksize() in the lock section of "ns_sem". 
However, these warnings are false positives because init_nilfs() is called
in the early stage of the mount operation and the filesystem has not yet
started.

The reason why "ns_sem" is locked in init_nilfs() was to avoid a race
condition in nilfs_fill_super() caused by sharing a nilfs object among
multiple filesystem instances (super block structures) in the early
implementation.  However, nilfs objects and super block structures have
long ago become one-to-one, and there is no longer any need to use the
semaphore there.

So, fix this issue by removing the use of the semaphore "ns_sem" in
init_nilfs().

Link: https://lkml.kernel.org/r/20250503053327.12294-1-konishi.ryusuke@gmail.com
Fixes: c0e473a0d2 ("block: fix race between set_blocksize and read paths")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+00f7f5b884b117ee6773@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=00f7f5b884b117ee6773
Tested-by: syzbot+00f7f5b884b117ee6773@syzkaller.appspotmail.com
Reported-by: syzbot+f30591e72bfc24d4715b@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=f30591e72bfc24d4715b
Tested-by: syzbot+f30591e72bfc24d4715b@syzkaller.appspotmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-07 23:39:42 -07:00
Jan Kara
fcaf3b2683 ocfs2: stop quota recovery before disabling quotas
Currently quota recovery is synchronized with unmount using sb->s_umount
semaphore.  That is however prone to deadlocks because
flush_workqueue(osb->ocfs2_wq) called from umount code can wait for quota
recovery to complete while ocfs2_finish_quota_recovery() waits for
sb->s_umount semaphore.

Grabbing of sb->s_umount semaphore in ocfs2_finish_quota_recovery() is
only needed to protect that function from disabling of quotas from
ocfs2_dismount_volume().  Handle this problem by disabling quota recovery
early during unmount in ocfs2_dismount_volume() instead so that we can
drop acquisition of sb->s_umount from ocfs2_finish_quota_recovery().

Link: https://lkml.kernel.org/r/20250424134515.18933-6-jack@suse.cz
Fixes: 5f530de63c ("ocfs2: Use s_umount for quota recovery protection")
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Shichangkuo <shi.changkuo@h3c.com>
Reported-by: Murad Masimov <m.masimov@mt-integration.ru>
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Tested-by: Heming Zhao <heming.zhao@suse.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-07 23:39:40 -07:00
Jan Kara
8f947e0fd5 ocfs2: implement handshaking with ocfs2 recovery thread
We will need ocfs2 recovery thread to acknowledge transitions of
recovery_state when disabling particular types of recovery.  This is
similar to what currently happens when disabling recovery completely, just
more general.  Implement the handshake and use it for exit from recovery.

Link: https://lkml.kernel.org/r/20250424134515.18933-5-jack@suse.cz
Fixes: 5f530de63c ("ocfs2: Use s_umount for quota recovery protection")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Tested-by: Heming Zhao <heming.zhao@suse.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Murad Masimov <m.masimov@mt-integration.ru>
Cc: Shichangkuo <shi.changkuo@h3c.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-07 23:39:40 -07:00
Jan Kara
c0fb83088f ocfs2: switch osb->disable_recovery to enum
Patch series "ocfs2: Fix deadlocks in quota recovery", v3.

This implements another approach to fixing quota recovery deadlocks.  We
avoid grabbing sb->s_umount semaphore from ocfs2_finish_quota_recovery()
and instead stop quota recovery early in ocfs2_dismount_volume().


This patch (of 3):

We will need more recovery states than just pure enable / disable to fix
deadlocks with quota recovery.  Switch osb->disable_recovery to enum.

Link: https://lkml.kernel.org/r/20250424134301.1392-1-jack@suse.cz
Link: https://lkml.kernel.org/r/20250424134515.18933-4-jack@suse.cz
Fixes: 5f530de63c ("ocfs2: Use s_umount for quota recovery protection")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Tested-by: Heming Zhao <heming.zhao@suse.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Murad Masimov <m.masimov@mt-integration.ru>
Cc: Shichangkuo <shi.changkuo@h3c.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-07 23:39:40 -07:00
Peter Xu
9556772917 mm/userfaultfd: fix uninitialized output field for -EAGAIN race
While discussing some userfaultfd relevant issues recently, Andrea noticed
a potential ABI breakage with -EAGAIN on almost all userfaultfd ioctl()s.

Quote from Andrea, explaining how -EAGAIN was processed, and how this
should fix it (taking example of UFFDIO_COPY ioctl):

  The "mmap_changing" and "stale pmd" conditions are already reported as
  -EAGAIN written in the copy field, this does not change it. This change
  removes the subnormal case that left copy.copy uninitialized and required
  apps to explicitly set the copy field to get deterministic
  behavior (which is a requirement contrary to the documentation in both
  the manpage and source code). In turn there's no alteration to backwards
  compatibility as result of this change because userland will find the
  copy field consistently set to -EAGAIN, and not anymore sometime -EAGAIN
  and sometime uninitialized.

  Even then the change only can make a difference to non cooperative users
  of userfaultfd, so when UFFD_FEATURE_EVENT_* is enabled, which is not
  true for the vast majority of apps using userfaultfd or this unintended
  uninitialized field may have been noticed sooner.

Meanwhile, since this bug existed for years, it also almost affects all
ioctl()s that was introduced later.  Besides UFFDIO_ZEROPAGE, these also
get affected in the same way:

  - UFFDIO_CONTINUE
  - UFFDIO_POISON
  - UFFDIO_MOVE

This patch should have fixed all of them.

Link: https://lkml.kernel.org/r/20250424215729.194656-2-peterx@redhat.com
Fixes: df2cc96e77 ("userfaultfd: prevent non-cooperative events vs mcopy_atomic races")
Fixes: f619147104 ("userfaultfd: add UFFDIO_CONTINUE ioctl")
Fixes: fc71884a5f ("mm: userfaultfd: add new UFFDIO_POISON ioctl")
Fixes: adef440691 ("userfaultfd: UFFDIO_MOVE uABI")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-07 23:39:39 -07:00
Mark Tinguely
31d4cd4eb2 ocfs2: fix panic in failed foilio allocation
commit 7e119cff9d ("ocfs2: convert w_pages to w_folios") and commit
9a5e08652d ("ocfs2: use an array of folios instead of an array of
pages") save -ENOMEM in the folio array upon allocation failure and call
the folio array free code.

The folio array free code expects either valid folio pointers or NULL. 
Finding the -ENOMEM will result in a panic.  Fix by NULLing the error
folio entry.

Link: https://lkml.kernel.org/r/c879a52b-835c-4fa0-902b-8b2e9196dcbd@oracle.com
Fixes: 7e119cff9d ("ocfs2: convert w_pages to w_folios")
Fixes: 9a5e08652d ("ocfs2: use an array of folios instead of an array of pages")
Signed-off-by: Mark Tinguely <mark.tinguely@oracle.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-07 23:39:38 -07:00
Heming Zhao
bd1261b16d ocfs2: fix the issue with discontiguous allocation in the global_bitmap
commit 4eb7b93e03 ("ocfs2: improve write IO performance when
fragmentation is high") introduced another regression.

The following ocfs2-test case can trigger this issue:
> discontig_runner.sh => activate_discontig_bg.sh => resv_unwritten:
> ${RESV_UNWRITTEN_BIN} -f ${WORK_PLACE}/large_testfile -s 0 -l \
> $((${FILE_MAJOR_SIZE_M}*1024*1024))

In my env, test disk size (by "fdisk -l <dev>"):
> 53687091200 bytes, 104857600 sectors.

Above command is:
> /usr/local/ocfs2-test/bin/resv_unwritten -f \
> /mnt/ocfs2/ocfs2-activate-discontig-bg-dir/large_testfile -s 0 -l \
> 53187969024

Error log:
> [*] Reserve 50724M space for a LARGE file, reserve 200M space for future test.
> ioctl error 28: "No space left on device"
> resv allocation failed Unknown error -1
> reserve unwritten region from 0 to 53187969024.

Call flow:
__ocfs2_change_file_space //by ioctl OCFS2_IOC_RESVSP64
 ocfs2_allocate_unwritten_extents //start:0 len:53187969024
  while()
   + ocfs2_get_clusters //cpos:0, alloc_size:1623168 (cluster number)
   + ocfs2_extend_allocation
     + ocfs2_lock_allocators
     |  + choose OCFS2_AC_USE_MAIN & ocfs2_cluster_group_search
     |
     + ocfs2_add_inode_data
        ocfs2_add_clusters_in_btree
         __ocfs2_claim_clusters
          ocfs2_claim_suballoc_bits
          + During the allocation of the final part of the large file
	    (after ~47GB), no chain had the required contiguous
            bits_wanted. Consequently, the allocation failed.

How to fix:
When OCFS2 is encountering fragmented allocation, the file system should
stop attempting bits_wanted contiguous allocation and instead provide the
largest available contiguous free bits from the cluster groups.

Link: https://lkml.kernel.org/r/20250414060125.19938-2-heming.zhao@suse.com
Fixes: 4eb7b93e03 ("ocfs2: improve write IO performance when fragmentation is high")
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reported-by: Gautham Ananthakrishna <gautham.ananthakrishna@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-07 23:39:37 -07:00
Kent Overstreet
8e4d28036c bcachefs: Don't aggressively discard the journal
We frequently use 'bcachefs list_journal -a' for debugging, as it
provides a record of all btree transactions, and a history of what
happened.

But it's not so useful if we immediately discard journal buckets right
after they're no longer dirty.

This tweaks journal reclaim to only discard when we're low on space,
keeping the journal mostly un-discarded.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-07 17:10:10 -04:00
Kent Overstreet
da18dabc37 bcachefs: Ensure superblock gets written when we go ERO
When we go emergency read-only, make sure we do a final write_super() to
persist counters and error counts - this can be critical for piecing
together what fsck was doing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-07 17:09:59 -04:00
Kent Overstreet
2fea3aa76e bcachefs: Filter out harmless EROFS error messages
These just indicate that we're shutting down.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-07 16:58:32 -04:00
Kent Overstreet
473f09f362 bcachefs: journal_shutdown is EROFS, not EIO
We often filter out EROFS errors to avoid log spew after an emergency
shutdown - journal_shutdown is just another emergency shutdown error.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-07 16:58:26 -04:00
Paul Aurich
3ca02e63ed smb: client: Avoid race in open_cached_dir with lease breaks
A pre-existing valid cfid returned from find_or_create_cached_dir might
race with a lease break, meaning open_cached_dir doesn't consider it
valid, and thinks it's newly-constructed. This leaks a dentry reference
if the allocation occurs before the queued lease break work runs.

Avoid the race by extending holding the cfid_list_lock across
find_or_create_cached_dir and when the result is checked.

Cc: stable@vger.kernel.org
Reviewed-by: Henrique Carvalho <henrique.carvalho@suse.com>
Signed-off-by: Paul Aurich <paul@darkrain42.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-05-07 15:24:46 -05:00
Linus Torvalds
d76bb1ebb5 Merge tag 'erofs-for-6.15-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:

 - Add a new reviewer, Hongbo Li, for better community development

 - Fix an I/O hang out of file-backed mounts

 - Address a rare data corruption caused by concurrent I/Os on the same
   deduplicated compressed data

 - Minor cleanup

* tag 'erofs-for-6.15-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: ensure the extra temporary copy is valid for shortened bvecs
  erofs: remove unused enum type
  fs/erofs/fileio: call erofs_onlinefolio_split() after bio_add_folio()
  MAINTAINERS: erofs: add myself as reviewer
2025-05-07 10:19:47 -07:00
Jan Kara
55dd5b4db3 udf: Make sure i_lenExtents is uptodate on inode eviction
UDF maintains total length of all extents in i_lenExtents. Generally we
keep extent lengths (and thus i_lenExtents) block aligned because it
makes the file appending logic simpler. However the standard mandates
that the inode size must match the length of all extents and thus we
trim the last extent when closing the file. To catch possible bugs we
also verify that i_lenExtents matches i_size when evicting inode from
memory. Commit b405c1e58b ("udf: refactor udf_next_aext() to handle
error") however broke the code updating i_lenExtents and thus
udf_evict_inode() ended up spewing lots of errors about incorrectly
sized extents although the extents were actually sized properly. Fix the
updating of i_lenExtents to silence the errors.

Fixes: b405c1e58b ("udf: refactor udf_next_aext() to handle error")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
2025-05-07 12:04:07 +02:00
Gao Xiang
35076d2223 erofs: ensure the extra temporary copy is valid for shortened bvecs
When compressed data deduplication is enabled, multiple logical extents
may reference the same compressed physical cluster.

The previous commit 94c43de735 ("erofs: fix wrong primary bvec
selection on deduplicated extents") already avoids using shortened
bvecs.  However, in such cases, the extra temporary buffers also
need to be preserved for later use in z_erofs_fill_other_copies() to
to prevent data corruption.

IOWs, extra temporary buffers have to be retained not only due to
varying start relative offsets (`pageofs_out`, as indicated by
`pcl->multibases`) but also because of shortened bvecs.

android.hardware.graphics.composer@2.1.so : 270696 bytes
   0:        0..  204185 |  204185 :  628019200.. 628084736 |   65536
-> 1:   204185..  225536 |   21351 :  544063488.. 544129024 |   65536
   2:   225536..  270696 |   45160 :          0..         0 |       0

com.android.vndk.v28.apex : 93814897 bytes
...
   364: 53869896..54095257 |  225361 :  543997952.. 544063488 |   65536
-> 365: 54095257..54309344 |  214087 :  544063488.. 544129024 |   65536
   366: 54309344..54514557 |  205213 :  544129024.. 544194560 |   65536
...

Both 204185 and 54095257 have the same start relative offset of 3481,
but the logical page 55 of `android.hardware.graphics.composer@2.1.so`
ranges from 225280 to 229632, forming a shortened bvec [225280, 225536)
that cannot be used for decompressing the range from 54095257 to
54309344 of `com.android.vndk.v28.apex`.

Since `pcl->multibases` is already meaningless, just mark `be->keepxcpy`
on demand for simplicity.

Again, this issue can only lead to data corruption if `-Ededupe` is on.

Fixes: 94c43de735 ("erofs: fix wrong primary bvec selection on deduplicated extents")
Reviewed-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250506101850.191506-1-hsiangkao@linux.alibaba.com
2025-05-07 09:50:51 +08:00
Linus Torvalds
0d8d44db29 Merge tag 'for-6.15-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:

 - revert device path canonicalization, this does not work as intended
   with namespaces and is not reliable in all setups

 - fix crash in scrub when checksum tree is not valid, e.g. when mounted
   with rescue=ignoredatacsums

 - fix crash when tracepoint btrfs_prelim_ref_insert is enabled

 - other minor fixups:
     - open code folio_index(), meant to be used in MM code
     - use matching type for sizeof in compression allocation

* tag 'for-6.15-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: open code folio_index() in btree_clear_folio_dirty_tag()
  Revert "btrfs: canonicalize the device path before adding it"
  btrfs: avoid NULL pointer dereference if no valid csum tree
  btrfs: handle empty eb->folios in num_extent_folios()
  btrfs: correct the order of prelim_ref arguments in btrfs__prelim_ref
  btrfs: compression: adjust cb->compressed_folios allocation type
2025-05-06 08:19:09 -07:00
Steve French
d90b023718 smb3 client: warn when parse contexts returns error on compounded operation
Coverity noticed that the rc on smb2_parse_contexts() was not being checked
in the case of compounded operations.  Since we don't want to stop parsing
the following compounded responses which are likely valid, we can't easily
error out here, but at least print a warning message if server has a bug
causing us to skip parsing the open response contexts.

Addresses-Coverity: 1639191
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-05-06 09:05:15 -05:00
Sean Heelan
36991c1ccd ksmbd: Fix UAF in __close_file_table_ids
A use-after-free is possible if one thread destroys the file
via __ksmbd_close_fd while another thread holds a reference to
it. The existing checks on fp->refcount are not sufficient to
prevent this.

The fix takes ft->lock around the section which removes the
file from the file table. This prevents two threads acquiring the
same file pointer via __close_file_table_ids, as well as the other
functions which retrieve a file from the IDR and which already use
this same lock.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Heelan <seanheelan@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-05-06 08:37:02 -05:00
Norbert Szetei
0ca6df4f40 ksmbd: prevent out-of-bounds stream writes by validating *pos
ksmbd_vfs_stream_write() did not validate whether the write offset
(*pos) was within the bounds of the existing stream data length (v_len).
If *pos was greater than or equal to v_len, this could lead to an
out-of-bounds memory write.

This patch adds a check to ensure *pos is less than v_len before
proceeding. If the condition fails, -EINVAL is returned.

Cc: stable@vger.kernel.org
Signed-off-by: Norbert Szetei <norbert@doyensec.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-05-06 08:36:36 -05:00
Kent Overstreet
9c61856099 bcachefs: Call bch2_fs_start before getting vfs superblock
This reverts

1fdbe0b184 bcachefs: Make sure c->vfs_sb is set before starting fs

switched up bch2_fs_get_tree() so that we got a superblock before
calling bch2_fs_start, so that c->vfs_sb would always be initialized
while the filesystem was active.

This turned out not to be necessary, because blk_holder_ops were
implemented using our own locking, not vfs locking.

And this had the side effect of creating a super_block and doing our
full recovery (including potentially fsck) before setting SB_BORN, which
causes things like sync calls to hang until our recovery is finished.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-05 16:06:35 -04:00
Kent Overstreet
aed4ccbf45 bcachefs: fix hung task timeout in journal read
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-05 14:21:28 -04:00
Kent Overstreet
7a69fa6571 bcachefs: Add missing barriers before wake_up_bit()
wake_up() doesn't require a barrier - but wake_up_bit() does.

This only affected non x86, and primarily lead to lost wakeups after
btree node reads.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-05 14:19:10 -04:00
Kent Overstreet
50a7b899a0 bcachefs: Ensure proper write alignment
There was a buggy version of bcachefs-tools which picked misaligned
bucket sizes when formatting, and we're also about to do dynamic block
sizes - which will allow picking logical block size or physical block
size of the device per-write, allowing for better compression ratios at
the cost of slightly worse write performance (i.e. forcing the device to
do RMW or extra buffering).

To account for this, tweak bch2_alloc_sectors_start() to properly align
open_buckets to the blocksize of the write we're about to do.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-05 14:19:01 -04:00
Kent Overstreet
844f766e02 bcachefs: Improve want_cached_ptr()
If promote target isn't set, rebalance should still leave a cached copy
on the faster device.

Fall back to foreground_target if it's set, or allow a cached copy on
any device if neither are set.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-05 14:16:20 -04:00
Kent Overstreet
df2e19a883 bcachefs: thread_with_stdio: fix spinning instead of exiting
bch2_stdio_redirect_vprintf() was missing a check for stdio->done, i.e.
exiting.

This caused the thread attempting to print to spin, and since it was
being called from the kthread ran by thread_with_stdio, the userspace
side hung as well.

Change it to return -EPIPE - i.e. writing to a pipe that's been closed.

Reported-by: Jan Solanti <jhs@psonet.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-04 14:00:14 -04:00
Linus Torvalds
daad00c063 Merge tag '6.15-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:

 - fix posix mkdir error to ksmbd (also avoids crash in
   cifs_destroy_request_bufs)

 - two smb1 fixes: fixing querypath info and setpathinfo to old servers

 - fix rsize/wsize when not multiple of page size to address DIO
   reads/writes

* tag '6.15-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb: client: ensure aligned IO sizes
  cifs: Fix changing times and read-only attr over SMB1 smb_set_file_info() function
  cifs: Fix and improve cifs_query_path_info() and cifs_query_file_info()
  smb: client: fix zero length for mkdir POSIX create context
2025-05-02 14:37:16 -07:00
Linus Torvalds
2bfcee565c Merge tag 'bcachefs-2025-05-01' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
 "Lots of assorted small fixes...

   - Some repair path fixes, a fix for -ENOMEM when reconstructing lots
     of alloc info on large filesystems, upgrade for ancient 0.14
     filesystems, etc.

   - Various assert tweaks; assert -> ERO, ERO -> log the error in the
     superblock and continue

   - casefolding now uses d_ops like on other casefolding filesystems

   - fix device label create on device add, fix bucket array resize on
     filesystem resize

   - fix xattrs with FORTIFY_SOURCE builds with gcc-15/clang"

* tag 'bcachefs-2025-05-01' of git://evilpiepirate.org/bcachefs: (22 commits)
  bcachefs: Remove incorrect __counted_by annotation
  bcachefs: add missing sched_annotate_sleep()
  bcachefs: Fix __bch2_dev_group_set()
  bcachefs: Kill ERO for i_blocks check in truncate
  bcachefs: check for inode.bi_sectors underflow
  bcachefs: Kill ERO in __bch2_i_sectors_acct()
  bcachefs: readdir fixes
  bcachefs: improve missing journal write device error message
  bcachefs: Topology error after insert is now an ERO
  bcachefs: Use bch2_kvmalloc() for journal keys array
  bcachefs: More informative error message when shutting down due to error
  bcachefs: btree_root_unreadable_and_scan_found_nothing autofix for non data btrees
  bcachefs: btree_node_data_missing is now autofix
  bcachefs: Don't generate alloc updates to invalid buckets
  bcachefs: Improve bch2_dev_bucket_missing()
  bcachefs: fix bch2_dev_buckets_resize()
  bcachefs: Add upgrade table entry from 0.14
  bcachefs: Run BCH_RECOVERY_PASS_reconstruct_snapshots on missing subvol -> snapshot
  bcachefs: Add missing utf8_unload()
  bcachefs: Emit unicode version message on startup
  ...
2025-05-02 09:12:29 -07:00
Max Kellermann
d9ec733010 fs/eventpoll: fix endless busy loop after timeout has expired
After commit 0a65bc27bd ("eventpoll: Set epoll timeout if it's in
the future"), the following program would immediately enter a busy
loop in the kernel:

```
int main() {
  int e = epoll_create1(0);
  struct epoll_event event = {.events = EPOLLIN};
  epoll_ctl(e, EPOLL_CTL_ADD, 0, &event);
  const struct timespec timeout = {.tv_nsec = 1};
  epoll_pwait2(e, &event, 1, &timeout, 0);
}
```

This happens because the given (non-zero) timeout of 1 nanosecond
usually expires before ep_poll() is entered and then
ep_schedule_timeout() returns false, but `timed_out` is never set
because the code line that sets it is skipped.  This quickly turns
into a soft lockup, RCU stalls and deadlocks, inflicting severe
headaches to the whole system.

When the timeout has expired, we don't need to schedule a hrtimer, but
we should set the `timed_out` variable.  Therefore, I suggest moving
the ep_schedule_timeout() check into the `timed_out` expression
instead of skipping it.

brauner: Note that there was an earlier fix by Joe Damato in response to
my bug report in [1].

Fixes: 0a65bc27bd ("eventpoll: Set epoll timeout if it's in the future")
Cc: Joe Damato <jdamato@fastly.com>
Cc: stable@vger.kernel.org
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/20250429153419.94723-1-jdamato@fastly.com [1]
Link: https://lore.kernel.org/20250429185827.3564438-1-max.kellermann@ionos.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-02 14:21:26 +02:00
Kairui Song
38e541051e btrfs: open code folio_index() in btree_clear_folio_dirty_tag()
The folio_index() helper is only needed for mixed usage of page cache
and swap cache, for pure page cache usage, the caller can just use
folio->index instead.

It can't be a swap cache folio here.  Swap mapping may only call into fs
through 'swap_rw' but btrfs does not use that method for swap.

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-02 13:20:56 +02:00
Qu Wenruo
8fb1dcbbcc Revert "btrfs: canonicalize the device path before adding it"
This reverts commit 7e06de7c83.

Commit 7e06de7c83 ("btrfs: canonicalize the device path before adding
it") tries to make btrfs to use "/dev/mapper/*" name first, then any
filename inside "/dev/" as the device path.

This is mostly fine when there is only the root namespace involved, but
when multiple namespace are involved, things can easily go wrong for the
d_path() usage.

As d_path() returns a file path that is namespace dependent, the
resulted string may not make any sense in another namespace.

Furthermore, the "/dev/" prefix checks itself is not reliable, one can
still make a valid initramfs without devtmpfs, and fill all needed
device nodes manually.

Overall the userspace has all its might to pass whatever device path for
mount, and we are not going to win the war trying to cover every corner
case.

So just revert that commit, and do no extra d_path() based file path
sanity check.

CC: stable@vger.kernel.org # 6.12+
Link: https://lore.kernel.org/linux-fsdevel/20250115185608.GA2223535@zen.localdomain/
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-02 13:20:26 +02:00
Qu Wenruo
f95d186255 btrfs: avoid NULL pointer dereference if no valid csum tree
[BUG]
When trying read-only scrub on a btrfs with rescue=idatacsums mount
option, it will crash with the following call trace:

  BUG: kernel NULL pointer dereference, address: 0000000000000208
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  CPU: 1 UID: 0 PID: 835 Comm: btrfs Tainted: G           O        6.15.0-rc3-custom+ #236 PREEMPT(full)
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
  RIP: 0010:btrfs_lookup_csums_bitmap+0x49/0x480 [btrfs]
  Call Trace:
   <TASK>
   scrub_find_fill_first_stripe+0x35b/0x3d0 [btrfs]
   scrub_simple_mirror+0x175/0x290 [btrfs]
   scrub_stripe+0x5f7/0x6f0 [btrfs]
   scrub_chunk+0x9a/0x150 [btrfs]
   scrub_enumerate_chunks+0x333/0x660 [btrfs]
   btrfs_scrub_dev+0x23e/0x600 [btrfs]
   btrfs_ioctl+0x1dcf/0x2f80 [btrfs]
   __x64_sys_ioctl+0x97/0xc0
   do_syscall_64+0x4f/0x120
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

[CAUSE]
Mount option "rescue=idatacsums" will completely skip loading the csum
tree, so that any data read will not find any data csum thus we will
ignore data checksum verification.

Normally call sites utilizing csum tree will check the fs state flag
NO_DATA_CSUMS bit, but unfortunately scrub does not check that bit at all.

This results in scrub to call btrfs_search_slot() on a NULL pointer
and triggered above crash.

[FIX]
Check both extent and csum tree root before doing any tree search.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-02 13:20:11 +02:00