Commit Graph

98138 Commits

Author SHA1 Message Date
Linus Torvalds
05e8d261a3 gcc-15: add '__nonstring' markers to byte arrays
All of these cases are perfectly valid and good traditional C, but hit
by the "you're not NUL-terminating your byte array" warning.

And none of the cases want any terminating NUL character.

Mark them __nonstring to shut up gcc-15 (and in the case of the ak8974
magnetometer driver, I just removed the explicit array size and let gcc
expand the 3-byte and 6-byte arrays by one extra byte, because it was
the simpler change).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-20 11:57:54 -07:00
Linus Torvalds
6fea5fabd3 Merge tag 'mm-hotfixes-stable-2025-04-19-21-24' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc hotfixes from Andrew Morton:
 "16 hotfixes. 2 are cc:stable and the remainder address post-6.14
  issues or aren't considered necessary for -stable kernels.

  All patches are basically for MM although five are alterations to
  MAINTAINERS"

[ Basic counting skills are clearly not a strictly necessary requirement
  for kernel maintainers.     - Linus ]

* tag 'mm-hotfixes-stable-2025-04-19-21-24' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  MAINTAINERS: add section for locking of mm's and VMAs
  mm: vmscan: fix kswapd exit condition in defrag_mode
  mm: vmscan: restore high-cpu watermark safety in kswapd
  MAINTAINERS: add Pedro as reviewer to the MEMORY MAPPING section
  mm/memory: move sanity checks in do_wp_page() after mapcount vs. refcount stabilization
  mm, hugetlb: increment the number of pages to be reset on HVO
  writeback: fix false warning in inode_to_wb()
  docs: ABI: replace mcroce@microsoft.com with new Meta address
  mm/gup: fix wrongly calculated returned value in fault_in_safe_writeable()
  MAINTAINERS: add memory advice section
  MAINTAINERS: add mmap trace events to MEMORY MAPPING
  mm: memcontrol: fix swap counter leak from offline cgroup
  MAINTAINERS: add MM subsection for the page allocator
  MAINTAINERS: update SLAB ALLOCATOR maintainers
  fs/dax: fix folio splitting issue by resetting old folio order + _nr_pages
  mm/page_alloc: fix deadlock on cpu_hotplug_lock in __accept_page()
2025-04-19 21:46:58 -07:00
Linus Torvalds
119009db26 Merge tag 'vfs-6.15-rc3.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:

 - Revert the hfs{plus} deprecation warning that's also included in this
   pull request. The commit introducing the deprecation warning resides
   rather early in this branch. So simply dropping it would've rebased
   all other commits which I decided to avoid. Hence the revert in the
   same branch

   [ Background - the deprecation warning discussion resulted in people
     stepping up, and so hfs{plus} will have a maintainer taking care of
     it after all..   - Linus ]

 - Switch CONFIG_SYSFS_SYCALL default to n and decouple from
   CONFIG_EXPERT

 - Fix an audit bug caused by changes to our kernel path lookup helpers
   this cycle. Audit needs the parent path even if the dentry it tried
   to look up is negative

 - Ensure that the kernel path lookup helpers leave the passed in path
   argument clean when they return an error. This is consistent with all
   our other helpers

 - Ensure that vfs_getattr_nosec() calls bdev_statx() so the relevant
   information is available to kernel consumers as well

 - Don't set a timer and call schedule() if the timer will expire
   immediately in epoll

 - Make netfs lookup tables with __nonstring

* tag 'vfs-6.15-rc3.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  Revert "hfs{plus}: add deprecation warning"
  fs: move the bdex_statx call to vfs_getattr_nosec
  netfs: Mark __nonstring lookup tables
  eventpoll: Set epoll timeout if it's in the future
  fs: ensure that *path_locked*() helpers leave passed path pristine
  fs: add kern_path_locked_negative()
  hfs{plus}: add deprecation warning
  Kconfig: switch CONFIG_SYSFS_SYCALL default to n
2025-04-19 14:31:08 -07:00
Christian Brauner
408e4504f9 Revert "hfs{plus}: add deprecation warning"
This reverts commit ddee68c499.

There's ongoing discussion about better maintenance of at least hfsplus.
Rever the deprecation warning for now.

Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-19 22:48:59 +02:00
Linus Torvalds
1ca0f935a1 Merge tag 'nfsd-6.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:

 - v6.15 libcrc clean-up makes invalid configurations possible

 - Fix a potential deadlock introduced during the v6.15 merge window

* tag 'nfsd-6.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  nfsd: decrease sc_count directly if fail to queue dl_recall
  nfs: add missing selections of CONFIG_CRC32
2025-04-19 10:38:03 -07:00
Linus Torvalds
8560697b23 Merge tag '6.15-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:

 - Fix hard link lease key problem when close is deferred

 - Revert the socket lockdep/refcount workarounds done in cifs.ko now
   that it is fixed at the socket layer

* tag '6.15-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  Revert "smb: client: fix TCP timers deadlock after rmmod"
  Revert "smb: client: Fix netns refcount imbalance causing leaks and use-after-free"
  smb3 client: fix open hardlink on deferred close file error
2025-04-18 20:10:42 -07:00
Linus Torvalds
7e74f756f5 Merge tag 'v6.15-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:

 - Fix integer overflow in server disconnect deadtime calculation

 - Three fixes for potential use after frees: one for oplocks, and one
   for leases and one for kerberos authentication

 - Fix to prevent attempted write to directory

 - Fix locking warning for durable scavenger thread

* tag 'v6.15-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
  ksmbd: Prevent integer overflow in calculation of deadtime
  ksmbd: fix the warning from __kernel_write_iter
  ksmbd: fix use-after-free in smb_break_all_levII_oplock()
  ksmbd: fix use-after-free in __smb2_lease_break_noti()
  ksmbd: fix WARNING "do not call blocking ops when !TASK_RUNNING"
  ksmbd: Fix dangling pointer in krb_authenticate
2025-04-18 09:37:44 -07:00
David Hildenbrand
98b1917cde fs/dax: fix folio splitting issue by resetting old folio order + _nr_pages
Alison reports an issue with fsdax when large extends end up using large
ZONE_DEVICE folios:

[  417.796271] BUG: kernel NULL pointer dereference, address: 0000000000000b00
[  417.796982] #PF: supervisor read access in kernel mode
[  417.797540] #PF: error_code(0x0000) - not-present page
[  417.798123] PGD 2a5c5067 P4D 2a5c5067 PUD 2a5c6067 PMD 0
[  417.798690] Oops: Oops: 0000 [#1] SMP NOPTI
[  417.799178] CPU: 5 UID: 0 PID: 1515 Comm: mmap Tainted: ...
[  417.800150] Tainted: [O]=OOT_MODULE
[  417.800583] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[  417.801358] RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
[  417.801948] Code: ...
[  417.803662] RSP: 0000:ffffc90002be3a08 EFLAGS: 00010206
[  417.804234] RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000000002
[  417.804984] RDX: ffffffff815652d7 RSI: 0000000000000000 RDI: ffffffff82a2beae
[  417.805689] RBP: ffffc90002be3a28 R08: 0000000000000000 R09: 0000000000000000
[  417.806384] R10: ffffea0007000040 R11: ffff888376ffe000 R12: 0000000000000001
[  417.807099] R13: 0000000000000012 R14: ffff88807fe4ab40 R15: ffff888029210580
[  417.807801] FS:  00007f339fa7a740(0000) GS:ffff8881fa9b9000(0000) knlGS:0000000000000000
[  417.808570] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  417.809193] CR2: 0000000000000b00 CR3: 000000002a4f0004 CR4: 0000000000370ef0
[  417.809925] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  417.810622] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  417.811353] Call Trace:
[  417.811709]  <TASK>
[  417.812038]  folio_add_file_rmap_ptes+0x143/0x230
[  417.812566]  insert_page_into_pte_locked+0x1ee/0x3c0
[  417.813132]  insert_page+0x78/0xf0
[  417.813558]  vmf_insert_page_mkwrite+0x55/0xa0
[  417.814088]  dax_fault_iter+0x484/0x7b0
[  417.814542]  dax_iomap_pte_fault+0x1ca/0x620
[  417.815055]  dax_iomap_fault+0x39/0x40
[  417.815499]  __xfs_write_fault+0x139/0x380
[  417.815995]  ? __handle_mm_fault+0x5e5/0x1a60
[  417.816483]  xfs_write_fault+0x41/0x50
[  417.816966]  xfs_filemap_fault+0x3b/0xe0
[  417.817424]  __do_fault+0x31/0x180
[  417.817859]  __handle_mm_fault+0xee1/0x1a60
[  417.818325]  ? debug_smp_processor_id+0x17/0x20
[  417.818844]  handle_mm_fault+0xe1/0x2b0
[...]

The issue is that when we split a large ZONE_DEVICE folio to order-0 ones,
we don't reset the order/_nr_pages.  As folio->_nr_pages overlays
page[1]->memcg_data, once page[1] is a folio, it suddenly looks like it
has folio->memcg_data set.  And we never manually initialize
folio->memcg_data in fsdax code, because we never expect it to be set at
all.

When __lruvec_stat_mod_folio() then stumbles over such a folio, it tries
to use folio->memcg_data (because it's non-NULL) but it does not actually
point at a memcg, resulting in the problem.

Alison also observed that these folios sometimes have "locked" set, which
is rather concerning (folios locked from the beginning ...).  The reason
is that the order for large folios is stored in page[1]->flags, which
become the folio->flags of a new small folio.

Let's fix it by adding a folio helper to clear order/_nr_pages for
splitting purposes.

Maybe we should reinitialize other large folio flags / folio members as
well when splitting, because they might similarly cause harm once page[1]
becomes a folio?  At least other flags in PAGE_FLAGS_SECOND should not be
set for fsdax, so at least page[1]->flags might be as expected with this
fix.

From a quick glimpse, initializing ->mapping, ->pgmap and ->share should
re-initialize most things from a previous page[1] used by large folios
that fsdax cares about.  For example folio->private might not get
reinitialized, but maybe that's not relevant -- no traces of it's use in
fsdax code.  Needs a closer look.

Another thing that should be considered in the future is performing
similar checks as we perform in free_tail_page_prepare()
-- checking pincount etc.
-- when freeing a large fsdax folio.

Link: https://lkml.kernel.org/r/20250410091020.119116-1-david@redhat.com
Fixes: 4996fc547f ("mm: let _folio_nr_pages overlay memcg_data in first tail page")
Fixes: 38607c62b3 ("fs/dax: properly refcount fs dax pages")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Alison Schofield <alison.schofield@intel.com>
Closes: https://lkml.kernel.org/r/Z_W9Oeg-D9FhImf3@aschofie-mobl2.lan
Tested-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: "Darrick J. Wong" <djwong@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-17 20:10:06 -07:00
Linus Torvalds
9e99c1accb Merge tag 'bcachefs-2025-04-17' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
 "Usual set of small fixes/logging improvements.

  One bigger user reported fix, for inode <-> dirent inconsistencies
  reported in fsck, after moving a subvolume that had been snapshotted"

* tag 'bcachefs-2025-04-17' of git://evilpiepirate.org/bcachefs:
  bcachefs: Fix snapshotting a subvolume, then renaming it
  bcachefs: Add missing READ_ONCE() for metadata replicas
  bcachefs: snapshot_node_missing is now autofix
  bcachefs: Log message when incompat version requested but not enabled
  bcachefs: Print version_incompat_allowed on startup
  bcachefs: Silence extent_poisoned error messages
  bcachefs: btree_root_unreadable_and_scan_found_nothing now AUTOFIX
  bcachefs: fix bch2_dev_usage_full_read_fast()
  bcachefs: Don't print data read retry success on non-errors
  bcachefs: Add missing error handling
  bcachefs: Prevent granting write refs when filesystem is read-only
2025-04-17 15:08:29 -07:00
Kent Overstreet
261592ba06 bcachefs: Fix snapshotting a subvolume, then renaming it
Subvolume roots and the dirents that point to them are special; they
don't obey the normal snapshot versioning rules because they cross
snapshot boundaries.

We don't keep around older versions of subvolume dirents on rename - we
don't need to, because subvolume dirents are only visible in the parent
subvolume, and we wouldn't be able to match up the different dirent and
inode versions due to crossing the snapshot ID boundary.

That means that when we rename a subvolume, that's been snapshotted, the
older version of the subvolume root will become dangling - it won't have
a dirent that points to it.

That's expected, we just need to tell fsck that this is ok.

Fixes: https://github.com/koverstreet/bcachefs/issues/856
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-17 14:17:16 -04:00
Linus Torvalds
096384deed Merge tag 'xfs-fixes-6.15-rc3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull XFS fixes from Carlos Maiolino:
 "This mostly includes fixes and documentation for the zoned allocator
  feature merged during previous merge window, but it also adds a sysfs
  tunable for the zone garbage collector.

  There is also a fix for a regression to the RT device that we'd like
  to fix ASAP now that we're getting more users on the RT zoned
  allocator"

* tag 'xfs-fixes-6.15-rc3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: document zoned rt specifics in admin-guide
  xfs: fix fsmap for internal zoned devices
  xfs: Fix spelling mistake "drity" -> "dirty"
  xfs: compute buffer address correctly in xmbuf_map_backing_mem
  xfs: add tunable threshold parameter for triggering zone GC
  xfs: mark xfs_buf_free as might_sleep()
  xfs: remove the leftover xfs_{set,clear}_li_failed infrastructure
2025-04-17 09:24:56 -07:00
Linus Torvalds
0cb9ce06a6 Merge tag 'for-6.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:

 - handle encoded read ioctl returning EAGAIN so it does not mistakenly
   free the work structure

 - escape subvolume path in mount option list so it cannot be wrongly
   parsed when the path contains ","

 - remove folio size assertions when writing super block to device with
   enabled large folios

* tag 'for-6.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: remove folio order ASSERT()s in super block writeback path
  btrfs: correctly escape subvol in btrfs_show_options()
  btrfs: ioctl: don't free iov when btrfs_encoded_read() returns -EAGAIN
2025-04-17 09:17:57 -07:00
Christoph Hellwig
777d0961ff fs: move the bdex_statx call to vfs_getattr_nosec
Currently bdex_statx is only called from the very high-level
vfs_statx_path function, and thus bypassing it for in-kernel calls
to vfs_getattr or vfs_getattr_nosec.

This breaks querying the block ѕize of the underlying device in the
loop driver and also is a pitfall for any other new kernel caller.

Move the call into the lowest level helper to ensure all callers get
the right results.

Fixes: 2d985f8c6b ("vfs: support STATX_DIOALIGN on block devices")
Fixes: f4774e92aa ("loop: take the file system minimum dio alignment into account")
Reported-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250417064042.712140-1-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-17 10:14:34 +02:00
Kees Cook
58db1c3cd0 netfs: Mark __nonstring lookup tables
GCC 15's new -Wunterminated-string-initialization notices that the
character lookup tables "fscache_cache_states" and "fscache_cookie_states"
(which are not used as a C-String) need to be marked as "nonstring":

fs/netfs/fscache_cache.c:375:67: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (6 chars into 5 available) [-Wunterminated-string-initialization]
  375 | static const char fscache_cache_states[NR__FSCACHE_CACHE_STATE] = "-PAEW";
      |                                                                   ^~~~~~~
fs/netfs/fscache_cookie.c:32:69: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (11 chars into 10 available) [-Wunterminated-string-initialization]
   32 | static const char fscache_cookie_states[FSCACHE_COOKIE_STATE__NR] = "-LCAIFUWRD";
      |                                                                     ^~~~~~~~~~~~

Annotate the arrays.

Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/20250416221654.work.028-kees@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-17 10:13:46 +02:00
Joe Damato
0a65bc27bd eventpoll: Set epoll timeout if it's in the future
Avoid an edge case where epoll_wait arms a timer and calls schedule()
even if the timer will expire immediately.

For example: if the user has specified an epoll busy poll usecs which is
equal or larger than the epoll_wait/epoll_pwait2 timeout, it is
unnecessary to call schedule_hrtimeout_range; the busy poll usecs have
consumed the entire timeout duration so it is unnecessary to induce
scheduling latency by calling schedule() (via schedule_hrtimeout_range).

This can be measured using a simple bpftrace script:

tracepoint:sched:sched_switch
/ args->prev_pid == $1 /
{
  print(kstack());
  print(ustack());
}

Before this patch is applied:

  Testing an epoll_wait app with busy poll usecs set to 1000, and
  epoll_wait timeout set to 1ms using the script above shows:

     __traceiter_sched_switch+69
     __schedule+1495
     schedule+32
     schedule_hrtimeout_range+159
     do_epoll_wait+1424
     __x64_sys_epoll_wait+97
     do_syscall_64+95
     entry_SYSCALL_64_after_hwframe+118

     epoll_wait+82

  Which is unexpected; the busy poll usecs should have consumed the
  entire timeout and there should be no reason to arm a timer.

After this patch is applied: the same test scenario does not generate a
call to schedule() in the above edge case. If the busy poll usecs are
reduced (for example usecs: 100, epoll_wait timeout 1ms) the timer is
armed as expected.

Fixes: bf3b9f6372 ("epoll: Add busy poll support to epoll with socket fds.")
Signed-off-by: Joe Damato <jdamato@fastly.com>
Link: https://lore.kernel.org/20250416185826.26375-1-jdamato@fastly.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-17 09:56:20 +02:00
Kent Overstreet
8dd3804bf4 bcachefs: Add missing READ_ONCE() for metadata replicas
If we race with the user changing the metadata_replicas setting, this
could cause us to get an incorrectly sized disk reservation.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-16 18:31:47 -04:00
Darrick J. Wong
c6f1401b1d xfs: fix fsmap for internal zoned devices
Filesystems with an internal zoned rt section use xfs_rtblock_t values
that are relative to the start of the data device.  When fsmap reports
on internal rt sections, it reports the space used by the data section
as "OWN_FS".

Unfortunately, the logic for resuming a query isn't quite right, so
xfs/273 fails because it stress-tests GETFSMAP with a single-record
buffer.  If we enter the "report fake space as OWN_FS" block with a
nonzero key[0].fmr_length, we should add that to key[0].fmr_physical
and recheck if we still need to emit the fake record.  We should /not/
just return 0 from the whole function because that prevents all rmap
record iteration.

If we don't enter that block, the resumption is still wrong.
keys[*].fmr_physical is a reflection of what we copied out to userspace
on a previous query, which means that it already accounts for rgstart.
It is not correct to add rtstart_daddr when computing start_rtb or
end_rtb, so stop that.

While we're at it, add a xfs_has_zoned to make it clear that this is a
zoned filesystem thing.

Fixes: e50ec7fac8 ("xfs: enable fsmap reporting for internal RT devices")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-04-16 12:56:10 +02:00
Zhang Xianwei
1c406526bd xfs: Fix spelling mistake "drity" -> "dirty"
There is a spelling mistake in fs/xfs/xfs_log.c. Fix it.

Signed-off-by: Zhang Xianwei <zhang.xianwei8@zte.com.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-04-16 10:43:23 +02:00
Christian Brauner
a681b7c17d fs: ensure that *path_locked*() helpers leave passed path pristine
The functions currently leaving dangling pointers in the passed-in path
leading to hard to debug bugs in the long run. Ensure that the path is
left in pristine state just like we do in e.g., path_parentat() and
other helpers.

Link: https://lore.kernel.org/20250414-rennt-wimmeln-f186c3a780f1@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-16 09:33:09 +02:00
Kent Overstreet
72b5259053 bcachefs: snapshot_node_missing is now autofix
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-15 17:37:39 -04:00
Kent Overstreet
c3b02e6d67 bcachefs: Log message when incompat version requested but not enabled
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-15 11:35:05 -04:00
Kent Overstreet
14bcf982f4 bcachefs: Print version_incompat_allowed on startup
Let users know if incompatible features aren't enabled

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-15 11:35:05 -04:00
Kent Overstreet
a06459657e bcachefs: Silence extent_poisoned error messages
extent poisoning is partly so that we don't keep spewing the dmesg log
when we've got unreadable data - we don't want to print these.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-15 11:34:37 -04:00
Linus Torvalds
065d49851e Merge tag 'fs_for_v6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull isofs fix from Jan Kara:
 "Fix a case where isofs could be reading beyond end of the passed
  file handle if its type was incorrectly set"

* tag 'fs_for_v6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  isofs: Prevent the use of too small fid
2025-04-15 08:19:52 -07:00
Christian Brauner
c86b300b1e fs: add kern_path_locked_negative()
The audit code relies on the fact that kern_path_locked() returned a
path even for a negative dentry. If it doesn't find a valid dentry it
immediately calls:

    audit_find_parent(d_backing_inode(parent_path.dentry));

which assumes that parent_path.dentry is still valid. But it isn't since
kern_path_locked() has been changed to path_put() also for a negative
dentry.

Fix this by adding a helper that implements the required audit semantics
and allows us to fix the immediate bleeding. We can find a unified
solution for this afterwards.

Link: https://lore.kernel.org/20250414-rennt-wimmeln-f186c3a780f1@brauner
Fixes: 1c3cb50b58 ("VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry")
Reported-and-tested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-15 11:32:34 +02:00
Christian Brauner
ddee68c499 hfs{plus}: add deprecation warning
Both the hfs and hfsplus filesystem have been orphaned since at least
2014, i.e., over 10 years. It's time to remove them from the kernel as
they're exhibiting more and more issues and no one is stepping up to
fixing them.

Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-15 10:28:35 +02:00
Denis Arefev
a93ff74282 ksmbd: Prevent integer overflow in calculation of deadtime
The user can set any value for 'deadtime'. This affects the arithmetic
expression 'req->deadtime * SMB_ECHO_INTERVAL', which is subject to
overflow. The added check makes the server behavior more predictable.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Fixes: 0626e6641f ("cifsd: add server handler for central processing and tranport layers")
Cc: stable@vger.kernel.org
Signed-off-by: Denis Arefev <arefev@swemel.ru>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-04-14 22:21:26 -05:00
Namjae Jeon
b37f2f332b ksmbd: fix the warning from __kernel_write_iter
[ 2110.972290] ------------[ cut here ]------------
[ 2110.972301] WARNING: CPU: 3 PID: 735 at fs/read_write.c:599 __kernel_write_iter+0x21b/0x280

This patch doesn't allow writing to directory.

Cc: stable@vger.kernel.org
Reported-by: Norbert Szetei <norbert@doyensec.com>
Tested-by: Norbert Szetei <norbert@doyensec.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-04-14 22:21:26 -05:00
Namjae Jeon
18b4fac5ef ksmbd: fix use-after-free in smb_break_all_levII_oplock()
There is a room in smb_break_all_levII_oplock that can cause racy issues
when unlocking in the middle of the loop. This patch use read lock
to protect whole loop.

Cc: stable@vger.kernel.org
Reported-by: Norbert Szetei <norbert@doyensec.com>
Tested-by: Norbert Szetei <norbert@doyensec.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-04-14 22:21:26 -05:00
Namjae Jeon
21a4e47578 ksmbd: fix use-after-free in __smb2_lease_break_noti()
Move tcp_transport free to ksmbd_conn_free. If ksmbd connection is
referenced when ksmbd server thread terminates, It will not be freed,
but conn->tcp_transport is freed. __smb2_lease_break_noti can be performed
asynchronously when the connection is disconnected. __smb2_lease_break_noti
calls ksmbd_conn_write, which can cause use-after-free
when conn->ksmbd_transport is already freed.

Cc: stable@vger.kernel.org
Reported-by: Norbert Szetei <norbert@doyensec.com>
Tested-by: Norbert Szetei <norbert@doyensec.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-04-14 22:21:26 -05:00
Namjae Jeon
1df0d4c616 ksmbd: fix WARNING "do not call blocking ops when !TASK_RUNNING"
wait_event_timeout() will set the state of the current
task to TASK_UNINTERRUPTIBLE, before doing the condition check. This
means that ksmbd_durable_scavenger_alive() will try to acquire the mutex
while already in a sleeping state. The scheduler warns us by giving
the following warning:

do not call blocking ops when !TASK_RUNNING; state=2 set at
 [<0000000061515a6f>] prepare_to_wait_event+0x9f/0x6c0
WARNING: CPU: 2 PID: 4147 at kernel/sched/core.c:10099 __might_sleep+0x12f/0x160

mutex lock is not needed in ksmbd_durable_scavenger_alive().

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-04-14 22:21:26 -05:00
Sean Heelan
1e440d5b25 ksmbd: Fix dangling pointer in krb_authenticate
krb_authenticate frees sess->user and does not set the pointer
to NULL. It calls ksmbd_krb5_authenticate to reinitialise
sess->user but that function may return without doing so. If
that happens then smb2_sess_setup, which calls krb_authenticate,
will be accessing free'd memory when it later uses sess->user.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Heelan <seanheelan@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-04-14 22:21:26 -05:00
Linus Torvalds
3618002d00 Merge tag 'vfs-6.15-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:

 - Fix NULL pointer dereference in virtiofs

 - Fix slab OOB access in hfs/hfsplus

 - Only create /proc/fs/netfs when CONFIG_PROC_FS is set

 - Fix getname_flags() to initialize pointer correctly

 - Convert dentry flags to enum

 - Don't allow datadir without lowerdir in overlayfs

 - Use namespace_{lock,unlock} helpers in dissolve_on_fput() instead of
   plain namespace_sem so unmounted mounts are properly cleaned up

 - Skip unnecessary ifs_block_is_uptodate check in iomap

 - Remove an unused forward declaration in overlayfs

 - Fix devpts uid/gid handling after converting to the new mount api

 - Fix afs_dynroot_readdir() to not use the RCU read lock

 - Fix mount_setattr() and open_tree_attr() to not pointlessly do path
   lookup or walk the mount tree if no mount option change has been
   requested

* tag 'vfs-6.15-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: use namespace_{lock,unlock} in dissolve_on_fput()
  iomap: skip unnecessary ifs_block_is_uptodate check
  fs: Fix filename init after recent refactoring
  netfs: Only create /proc/fs/netfs with CONFIG_PROC_FS
  mount: ensure we don't pointlessly walk the mount tree
  dcache: convert dentry flag macros to enum
  afs: Fix afs_dynroot_readdir() to not use the RCU read lock
  hfs/hfsplus: fix slab-out-of-bounds in hfs_bnode_read_key
  virtiofs: add filesystem context source name check
  devpts: Fix type for uid and gid params
  ovl: remove unused forward declaration
  ovl: don't allow datadir only
2025-04-14 09:36:16 -07:00
Darrick J. Wong
a37b3b9c3c xfs: compute buffer address correctly in xmbuf_map_backing_mem
Prior to commit e614a00117, xmbuf_map_backing_mem relied on
folio_file_page to return the base page for the xmbuf's loff_t in the
xfile, and set b_addr to the page_address of that base page.

Now that folio_file_page has been removed from xmbuf_map_backing_mem, we
always set b_addr to the folio_address of the folio.  This is correct
for the situation where the folio size matches the buffer size, but it's
totally wrong if tmpfs uses large folios.  We need to use
offset_in_folio here.

Found via xfs/801, which demonstrated evidence of corruption of an
in-memory rmap btree block right after initializing an adjacent block.

Fixes: e614a00117 ("xfs: cleanup mapping tmpfs folios into the buffer cache")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-04-14 11:22:52 +02:00
Hans Holmberg
845abeb1f0 xfs: add tunable threshold parameter for triggering zone GC
Presently we start garbage collection late - when we start running
out of free zones to backfill max_open_zones. This is a reasonable
default as it minimizes write amplification. The longer we wait,
the more blocks are invalidated and reclaim cost less in terms
of blocks to relocate.

Starting this late however introduces a risk of GC being outcompeted
by user writes. If GC can't keep up, user writes will be forced to
wait for free zones with high tail latencies as a result.

This is not a problem under normal circumstances, but if fragmentation
is bad and user write pressure is high (multiple full-throttle
writers) we will "bottom out" of free zones.

To mitigate this, introduce a zonegc_low_space tunable that lets the
user specify a percentage of how much of the unused space that GC
should keep available for writing. A high value will reclaim more of
the space occupied by unused blocks, creating a larger buffer against
write bursts.

This comes at a cost as write amplification is increased. To
illustrate this using a sample workload, setting zonegc_low_space to
60% avoids high (500ms) max latencies while increasing write
amplification by 15%.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-04-14 10:41:33 +02:00
Christoph Hellwig
a1a56f541a xfs: mark xfs_buf_free as might_sleep()
xfs_buf_free can call vunmap, which can sleep.  The vunmap path is an
unlikely one, so add might_sleep to ensure calling xfs_buf_free from
atomic context gets caught more easily.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-04-14 10:24:30 +02:00
Christoph Hellwig
b73e05281c xfs: remove the leftover xfs_{set,clear}_li_failed infrastructure
Marking a log item as failed kept a buffer reference around for
resubmission of inode and dquote items.

For inode items commit 298f7bec50 ("xfs: pin inode backing buffer to
the inode log item") started pinning the inode item buffers
unconditionally and removed the need for this.  Later commit acc8f8628c
("xfs: attach dquot buffer to dquot log item buffer") did the same for
dquot items but didn't fully clean up the xfs_clear_li_failed side
for them.  Stop adding the extra pin for dquot items and remove the
helpers.

This happens to fix a call to xfs_buf_free with the AIL lock held,
which would be incorrect for the unlikely case freeing the buffer
ends up calling vfree.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-04-14 10:24:30 +02:00
Kent Overstreet
8692c7db9a bcachefs: btree_root_unreadable_and_scan_found_nothing now AUTOFIX
This will likely mean that the btree had only one node - there was
nothing or almost nothing in it, and we should reconstruct and continue.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-13 21:17:55 -04:00
Kuniyuki Iwashima
95d2b9f693 Revert "smb: client: fix TCP timers deadlock after rmmod"
This reverts commit e9f2517a3e.

Commit e9f2517a3e ("smb: client: fix TCP timers deadlock after
rmmod") is intended to fix a null-ptr-deref in LOCKDEP, which is
mentioned as CVE-2024-54680, but is actually did not fix anything;
The issue can be reproduced on top of it. [0]

Also, it reverted the change by commit ef7134c7fc ("smb: client:
Fix use-after-free of network namespace.") and introduced a real
issue by reviving the kernel TCP socket.

When a reconnect happens for a CIFS connection, the socket state
transitions to FIN_WAIT_1.  Then, inet_csk_clear_xmit_timers_sync()
in tcp_close() stops all timers for the socket.

If an incoming FIN packet is lost, the socket will stay at FIN_WAIT_1
forever, and such sockets could be leaked up to net.ipv4.tcp_max_orphans.

Usually, FIN can be retransmitted by the peer, but if the peer aborts
the connection, the issue comes into reality.

I warned about this privately by pointing out the exact report [1],
but the bogus fix was finally merged.

So, we should not stop the timers to finally kill the connection on
our side in that case, meaning we must not use a kernel socket for
TCP whose sk->sk_net_refcnt is 0.

The kernel socket does not have a reference to its netns to make it
possible to tear down netns without cleaning up every resource in it.

For example, tunnel devices use a UDP socket internally, but we can
destroy netns without removing such devices and let it complete
during exit.  Otherwise, netns would be leaked when the last application
died.

However, this is problematic for TCP sockets because TCP has timers to
close the connection gracefully even after the socket is close()d.  The
lifetime of the socket and its netns is different from the lifetime of
the underlying connection.

If the socket user does not maintain the netns lifetime, the timer could
be fired after the socket is close()d and its netns is freed up, resulting
in use-after-free.

Actually, we have seen so many similar issues and converted such sockets
to have a reference to netns.

That's why I converted the CIFS client socket to have a reference to
netns (sk->sk_net_refcnt == 1), which is somehow mentioned as out-of-scope
of CIFS and technically wrong in e9f2517a3e, but **is in-scope and right
fix**.

Regarding the LOCKDEP issue, we can prevent the module unload by
bumping the module refcount when switching the LOCKDDEP key in
sock_lock_init_class_and_name(). [2]

For a while, let's revert the bogus fix.

Note that now we can use sk_net_refcnt_upgrade() for the socket
conversion, but I'll do so later separately to make backport easy.

Link: https://lore.kernel.org/all/20250402020807.28583-1-kuniyu@amazon.com/ #[0]
Link: https://lore.kernel.org/netdev/c08bd5378da647a2a4c16698125d180a@huawei.com/ #[1]
Link: https://lore.kernel.org/lkml/20250402005841.19846-1-kuniyu@amazon.com/ #[2]
Fixes: e9f2517a3e ("smb: client: fix TCP timers deadlock after rmmod")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-04-13 17:24:55 -05:00
Kuniyuki Iwashima
c707193a17 Revert "smb: client: Fix netns refcount imbalance causing leaks and use-after-free"
This reverts commit 4e7f1644f2.

The commit e9f2517a3e ("smb: client: fix TCP timers deadlock after
rmmod") is not only a bogus fix for LOCKDEP null-ptr-deref but also
introduces a real issue, TCP sockets leak, which will be explained in
detail in the next revert.

Also, CNA assigned CVE-2024-54680 to it but is rejecting it. [0]

Thus, we are reverting the commit and its follow-up commit 4e7f1644f2
("smb: client: Fix netns refcount imbalance causing leaks and
use-after-free").

Link: https://lore.kernel.org/all/2025040248-tummy-smilingly-4240@gregkh/ #[0]
Fixes: 4e7f1644f2 ("smb: client: Fix netns refcount imbalance causing leaks and use-after-free")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-04-13 17:24:55 -05:00
Chunjie Zhu
262b73ef44 smb3 client: fix open hardlink on deferred close file error
The following Python script results in unexpected behaviour when run on
a CIFS filesystem against a Windows Server:

    # Create file
    fd = os.open('test', os.O_WRONLY|os.O_CREAT)
    os.write(fd, b'foo')
    os.close(fd)

    # Open and close the file to leave a pending deferred close
    fd = os.open('test', os.O_RDONLY|os.O_DIRECT)
    os.close(fd)

    # Try to open the file via a hard link
    os.link('test', 'new')
    newfd = os.open('new', os.O_RDONLY|os.O_DIRECT)

The final open returns EINVAL due to the server returning
STATUS_INVALID_PARAMETER. The root cause of this is that the client
caches lease keys per inode, but the spec requires them to be related to
the filename which causes problems when hard links are involved:

From MS-SMB2 section 3.3.5.9.11:

"The server MUST attempt to locate a Lease by performing a lookup in the
LeaseTable.LeaseList using the LeaseKey in the
SMB2_CREATE_REQUEST_LEASE_V2 as the lookup key. If a lease is found,
Lease.FileDeleteOnClose is FALSE, and Lease.Filename does not match the
file name for the incoming request, the request MUST be failed with
STATUS_INVALID_PARAMETER"

On client side, we first check the context of file open, if it hits above
conditions, we first close all opening files which are belong to the same
inode, then we do open the hard link file.

Cc: stable@vger.kernel.org
Signed-off-by: Chunjie Zhu <chunjie.zhu@cloud.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-04-13 17:24:55 -05:00
Li Lingfeng
a1d14d931b nfsd: decrease sc_count directly if fail to queue dl_recall
A deadlock warning occurred when invoking nfs4_put_stid following a failed
dl_recall queue operation:
            T1                            T2
                                nfs4_laundromat
                                 nfs4_get_client_reaplist
                                  nfs4_anylock_blockers
__break_lease
 spin_lock // ctx->flc_lock
                                   spin_lock // clp->cl_lock
                                   nfs4_lockowner_has_blockers
                                    locks_owner_has_blockers
                                     spin_lock // flctx->flc_lock
 nfsd_break_deleg_cb
  nfsd_break_one_deleg
   nfs4_put_stid
    refcount_dec_and_lock
     spin_lock // clp->cl_lock

When a file is opened, an nfs4_delegation is allocated with sc_count
initialized to 1, and the file_lease holds a reference to the delegation.
The file_lease is then associated with the file through kernel_setlease.

The disassociation is performed in nfsd4_delegreturn via the following
call chain:
nfsd4_delegreturn --> destroy_delegation --> destroy_unhashed_deleg -->
nfs4_unlock_deleg_lease --> kernel_setlease --> generic_delete_lease
The corresponding sc_count reference will be released after this
disassociation.

Since nfsd_break_one_deleg executes while holding the flc_lock, the
disassociation process becomes blocked when attempting to acquire flc_lock
in generic_delete_lease. This means:
1) sc_count in nfsd_break_one_deleg will not be decremented to 0;
2) The nfs4_put_stid called by nfsd_break_one_deleg will not attempt to
acquire cl_lock;
3) Consequently, no deadlock condition is created.

Given that sc_count in nfsd_break_one_deleg remains non-zero, we can
safely perform refcount_dec on sc_count directly. This approach
effectively avoids triggering deadlock warnings.

Fixes: 230ca75845 ("nfsd: put dl_stid if fail to queue dl_recall")
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-04-13 16:39:42 -04:00
Eric Biggers
cd35b6cb46 nfs: add missing selections of CONFIG_CRC32
nfs.ko, nfsd.ko, and lockd.ko all use crc32_le(), which is available
only when CONFIG_CRC32 is enabled.  But the only NFS kconfig option that
selected CONFIG_CRC32 was CONFIG_NFS_DEBUG, which is client-specific and
did not actually guard the use of crc32_le() even on the client.

The code worked around this bug by only actually calling crc32_le() when
CONFIG_CRC32 is built-in, instead hard-coding '0' in other cases.  This
avoided randconfig build errors, and in real kernels the fallback code
was unlikely to be reached since CONFIG_CRC32 is 'default y'.  But, this
really needs to just be done properly, especially now that I'm planning
to update CONFIG_CRC32 to not be 'default y'.

Therefore, make CONFIG_NFS_FS, CONFIG_NFSD, and CONFIG_LOCKD select
CONFIG_CRC32.  Then remove the fallback code that becomes unnecessary,
as well as the selection of CONFIG_CRC32 from CONFIG_NFS_DEBUG.

Fixes: 1264a2f053 ("NFS: refactor code for calculating the crc32 hash of a filehandle")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Anna Schumaker <anna.schumaker@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-04-13 16:39:42 -04:00
Kent Overstreet
345731a389 bcachefs: fix bch2_dev_usage_full_read_fast()
One reference to bch_dev_usage wasn't updated, which meant we weren't
reading the full bch_dev_usage_full - oops.

Fixes: 955ba7b5ea ("bcachefs: bch_dev_usage_full")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-13 14:54:34 -04:00
Linus Torvalds
004a365eb8 Merge tag 'erofs-for-6.15-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:

 - Properly handle errors when file-backed I/O fails

 - Fix compilation issues on ARM platform (arm-linux-gnueabi)

 - Fix parsing of encoded extents

 - Minor cleanup

* tag 'erofs-for-6.15-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: remove duplicate code
  erofs: fix encoded extents handling
  erofs: add __packed annotation to union(__le16..)
  erofs: set error to bio if file-backed IO fails
2025-04-13 10:52:04 -07:00
Kent Overstreet
7dfd42a07a bcachefs: Don't print data read retry success on non-errors
We may end up in the data read retry path when reading cached data and
racing with invalidation, or on checksum error when we were reading into
a userspace buffer that might have been modified while the read was in
flight.

These aren't real errors, so we shouldn't print the 'retry success'
message.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-13 10:17:00 -04:00
Linus Torvalds
5aaaedb0cb Merge tag 'ext4_for_linus-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
 "A few more miscellaneous ext4 bug fixes and cleanups including some
  syzbot failures and fixing a stale file handing refeencing an inode
  previously used as a regular file, but which has been deleted and
  reused as an ea_inode would result in ext4 erroneously considering
  this a case of fs corruption"

* tag 'ext4_for_linus-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix off-by-one error in do_split
  ext4: make block validity check resistent to sb bh corruption
  ext4: avoid -Wflex-array-member-not-at-end warning
  Documentation: ext4: Add fields to ext4_super_block documentation
  ext4: don't treat fhandle lookup of ea_inode as FS corruption
2025-04-13 07:15:50 -07:00
Alan Huang
806776ad9c bcachefs: Add missing error handling
Reported-by: syzbot+d10151bf01574a09a915@syzkaller.appspotmail.com
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-13 07:56:28 -04:00
Gabriel Shahrouzi
d62922ba3c bcachefs: Prevent granting write refs when filesystem is read-only
Fix a shutdown WARNING in bch2_dev_free caused by active write I/O
references (ca->io_ref[WRITE]) on a device being freed.

The problem occurs when:
- The filesystem is marked read-only (BCH_FS_rw clear in c->flags).
- A subsequent operation (e.g., error handling for device removal)
  incorrectly tries to grant write references back to a device.
- During final shutdown, the read-only flag causes the system to skip
  stopping write I/O references (bch2_dev_io_ref_stop(ca, WRITE)).
- The leftover active write reference triggers the WARN_ON in
  bch2_dev_free.

Prevent this by checking if the filesystem is read-only before
attempting to grant write references to a device in the problematic
code path. Ensure consistency between the filesystem state flag
and the device I/O reference state during shutdown.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-13 07:56:28 -04:00
Artem Sadovnikov
94824ac9a8 ext4: fix off-by-one error in do_split
Syzkaller detected a use-after-free issue in ext4_insert_dentry that was
caused by out-of-bounds access due to incorrect splitting in do_split.

BUG: KASAN: use-after-free in ext4_insert_dentry+0x36a/0x6d0 fs/ext4/namei.c:2109
Write of size 251 at addr ffff888074572f14 by task syz-executor335/5847

CPU: 0 UID: 0 PID: 5847 Comm: syz-executor335 Not tainted 6.12.0-rc6-syzkaller-00318-ga9cda7c0ffed #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/30/2024
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:94 [inline]
 dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
 print_address_description mm/kasan/report.c:377 [inline]
 print_report+0x169/0x550 mm/kasan/report.c:488
 kasan_report+0x143/0x180 mm/kasan/report.c:601
 kasan_check_range+0x282/0x290 mm/kasan/generic.c:189
 __asan_memcpy+0x40/0x70 mm/kasan/shadow.c:106
 ext4_insert_dentry+0x36a/0x6d0 fs/ext4/namei.c:2109
 add_dirent_to_buf+0x3d9/0x750 fs/ext4/namei.c:2154
 make_indexed_dir+0xf98/0x1600 fs/ext4/namei.c:2351
 ext4_add_entry+0x222a/0x25d0 fs/ext4/namei.c:2455
 ext4_add_nondir+0x8d/0x290 fs/ext4/namei.c:2796
 ext4_symlink+0x920/0xb50 fs/ext4/namei.c:3431
 vfs_symlink+0x137/0x2e0 fs/namei.c:4615
 do_symlinkat+0x222/0x3a0 fs/namei.c:4641
 __do_sys_symlink fs/namei.c:4662 [inline]
 __se_sys_symlink fs/namei.c:4660 [inline]
 __x64_sys_symlink+0x7a/0x90 fs/namei.c:4660
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
 </TASK>

The following loop is located right above 'if' statement.

for (i = count-1; i >= 0; i--) {
	/* is more than half of this entry in 2nd half of the block? */
	if (size + map[i].size/2 > blocksize/2)
		break;
	size += map[i].size;
	move++;
}

'i' in this case could go down to -1, in which case sum of active entries
wouldn't exceed half the block size, but previous behaviour would also do
split in half if sum would exceed at the very last block, which in case of
having too many long name files in a single block could lead to
out-of-bounds access and following use-after-free.

Found by Linux Verification Center (linuxtesting.org) with Syzkaller.

Cc: stable@vger.kernel.org
Fixes: 5872331b3d ("ext4: fix potential negative array index in do_split()")
Signed-off-by: Artem Sadovnikov <a.sadovnikov@ispras.ru>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250404082804.2567-3-a.sadovnikov@ispras.ru
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-04-12 22:07:36 -04:00