Commit Graph

99744 Commits

Author SHA1 Message Date
Linus Torvalds
f4a40a4282 Merge tag 'efi-fixes-for-v6.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI fix from Ard Biesheuvel:

 - Fix potential memory leak reported by kmemleak

* tag 'efi-fixes-for-v6.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
  efivarfs: Fix memory leak of efivarfs_fs_info in fs_context error paths
2025-07-19 16:27:03 -07:00
Linus Torvalds
1c6aa1121e Merge tag 'vfs-6.16-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:

 - Fix a memory leak in fcntl_dirnotify()

 - Raise SB_I_NOEXEC on secrement superblock instead of messing with
   flags on the mount

 - Add fsdevel and block mailing lists to uio entry. We had a few
   instances were very questionable stuff was added without either block
   or the VFS being aware of it

 - Fix netfs copy-to-cache so that it performs collection with
   ceph+fscache

 - Fix netfs race between cache write completion and ALL_QUEUED being
   set

 - Verify the inode mode when loading entries from disk in isofs

 - Avoid state_lock in iomap_set_range_uptodate()

 - Fix PIDFD_INFO_COREDUMP check in PIDFD_GET_INFO ioctl

 - Fix the incorrect return value in __cachefiles_write()

* tag 'vfs-6.16-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  MAINTAINERS: add block and fsdevel lists to iov_iter
  netfs: Fix race between cache write completion and ALL_QUEUED being set
  netfs: Fix copy-to-cache so that it performs collection with ceph+fscache
  fix a leak in fcntl_dirnotify()
  iomap: avoid unnecessary ifs_set_range_uptodate() with locks
  isofs: Verify inode mode when loading from disk
  cachefiles: Fix the incorrect return value in __cachefiles_write()
  secretmem: use SB_I_NOEXEC
  coredump: fix PIDFD_INFO_COREDUMP ioctl check
2025-07-19 08:47:59 -07:00
Linus Torvalds
4871b7cb27 Merge tag 'v6.16-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:

 - fix creating special files to Samba when using SMB3.1.1 POSIX
   Extensions

 - fix incorrect caching on new file creation with directory leases
   enabled

 - two use after free fixes: one in oplock_break and one in async
   decryption

* tag 'v6.16-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  Fix SMB311 posix special file creation to servers which do not advertise reparse support
  smb: invalidate and close cached directory when creating child entries
  smb: client: fix use-after-free in crypt_message when using async crypto
  smb: client: fix use-after-free in cifs_oplock_break
2025-07-18 22:32:30 -07:00
Steve French
8767cb3fbd Fix SMB311 posix special file creation to servers which do not advertise reparse support
Some servers (including Samba), support the SMB3.1.1 POSIX Extensions (which use reparse
points for handling special files) but do not properly advertise file system attribute
FILE_SUPPORTS_REPARSE_POINTS.  Although we don't check for this attribute flag when
querying special file information, we do check it when creating special files which
causes them to fail unnecessarily.   If we have negotiated SMB3.1.1 POSIX Extensions
with the server we can expect the server to support creating special files via
reparse points, and even if the server fails the operation due to really forbidding
creating special files, then it should be no problem and is more likely to return a
more accurate rc in any case (e.g. EACCES instead of EOPNOTSUPP).

Allow creating special files as long as the server supports either reparse points
or the SMB3.1.1 POSIX Extensions (note that if the "sfu" mount option is specified
it uses a different way of storing special files that does not rely on reparse points).

Cc: <stable@vger.kernel.org>
Fixes: 6c06be908c ("cifs: Check if server supports reparse points before using them")
Acked-by: Ralph Boehme <slow@samba.org>
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-07-18 12:12:02 -05:00
Linus Torvalds
d551d7bbf2 Merge tag 'xfs-fixes-6.16-rc7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Carlos Maiolino:
 "This contains mostly code clean up, refactoring and comments
  modification.

  The most important patch in this series is the last one that removes
  an unnecessary data structure allocation of xfs busy extents which
  might lead to a memory leak on the zoned allocator code"

* tag 'xfs-fixes-6.16-rc7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: don't allocate the xfs_extent_busy structure for zoned RTGs
  xfs: remove the bt_bdev_file buftarg field
  xfs: rename the bt_bdev_* buftarg fields
  xfs: refactor xfs_calc_atomic_write_unit_max
  xfs: add a xfs_group_type_buftarg helper
  xfs: remove the call to sync_blockdev in xfs_configure_buftarg
  xfs: clean up the initial read logic in xfs_readsb
  xfs: replace strncpy with memcpy in xattr listing
2025-07-18 10:09:48 -07:00
Christoph Hellwig
5948705adb xfs: don't allocate the xfs_extent_busy structure for zoned RTGs
Busy extent tracking is primarily used to ensure that freed blocks are
not reused for data allocations before the transaction that deleted them
has been committed to stable storage, and secondarily to drive online
discard.  None of the use cases applies to zoned RTGs, as the zoned
allocator can't overwrite blocks before resetting the zone, which already
flushes out all transactions touching the RTGs.

So the busy extent tracking is not needed for zoned RTGs, and also not
called for zoned RTGs.  But somehow the code to skip allocating and
freeing the structure got lost during the zoned XFS upstreaming process.
This not only causes these structures to unnecessarily allocated, but can
also lead to memory leaks as the xg_busy_extents pointer in the
xfs_group structure is overlayed with the pointer for the linked list
of to be reset zones.

Stop allocating and freeing the structure to not pointlessly allocate
memory which is then leaked when the zone is reset.

Fixes: 080d01c41d ("xfs: implement zoned garbage collection")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: <stable@vger.kernel.org> # v6.15
[cem: Fix type and add stable tag]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-18 17:42:31 +02:00
Breno Leitao
64e135f1ea efivarfs: Fix memory leak of efivarfs_fs_info in fs_context error paths
When processing mount options, efivarfs allocates efivarfs_fs_info (sfi)
early in fs_context initialization. However, sfi is associated with the
superblock and typically freed when the superblock is destroyed. If the
fs_context is released (final put) before fill_super is called—such as
on error paths or during reconfiguration—the sfi structure would leak,
as ownership never transfers to the superblock.

Implement the .free callback in efivarfs_context_ops to ensure any
allocated sfi is properly freed if the fs_context is torn down before
fill_super, preventing this memory leak.

Suggested-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Fixes: 5329aa5101 ("efivarfs: Add uid/gid mount options")
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2025-07-18 12:00:20 +02:00
Kent Overstreet
89edfcf710 bcachefs: Fix bch2_maybe_casefold() when CONFIG_UTF8=n
maybe_casefold() shouldn't have been nooped, just bch2_casefold().

Fixes: 94426e4201 ("bcachefs: opts.casefold_disabled")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-16 17:32:33 -04:00
Kent Overstreet
40c35a0b47 bcachefs: Fix build when CONFIG_UNICODE=n
94426e4201, which added the killswitch for casefolding, accidentally
removed some of the ifdefs we need to avoid build errors.

It appears we need better build testing for different configurations, it
took two weeks for the robots to catch this one.

Fixes: 94426e4201 ("bcachefs: opts.casefold_disabled")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-16 17:32:33 -04:00
Kent Overstreet
c02b943f7d bcachefs: Fix reference to invalid bucket in copygc
Use bch2_dev_bucket_tryget() instead of bch2_dev_tryget() before
checking the bucket bitmap.

Reported-by: syzbot+3168625f36f4a539237e@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-16 17:32:33 -04:00
Kent Overstreet
9fe8ec8664 bcachefs: Don't build aux search tree when still repairing node
bch2_btree_node_drop_keys_outside_node() will (re)build aux search
trees, because it's also called by topology repair.

bch2_btree_node_read_done() was calling it before validating individual
keys; invalid ones have to be dropped.

If we call drop_keys_outside_node() first, then
bch2_bset_build_aux_tree() doesn't run because the node already has an
aux search tree - which was invalidated by the repair.

Reported-by: syzbot+c5e7a66b3b23ae65d44f@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-16 17:32:33 -04:00
Kent Overstreet
6a1c4323de bcachefs: Tweak threshold for allocator triggering discards
The allocator path has a "if we're really low on free buckets, check if
we should issue discards" - tweak this to also trigger discards if more
than 1/128th of the device is in need_discard state.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-16 17:32:33 -04:00
Kent Overstreet
b4d6e204f8 bcachefs: Fix triggering of discard by the journal path
It becomes possible to do discards after a journal flush, which
naturally the journal code is reponsible for.

A prior refactoring seems to have broken this - which went unnoticed
because the foreground allocator path can also trigger discards.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-16 17:32:33 -04:00
David Howells
89635eae07 netfs: Fix race between cache write completion and ALL_QUEUED being set
When netfslib is issuing subrequests, the subrequests start processing
immediately and may complete before we reach the end of the issuing
function.  At the end of the issuing function we set NETFS_RREQ_ALL_QUEUED
to indicate to the collector that we aren't going to issue any more subreqs
and that it can do the final notifications and cleanup.

Now, this isn't a problem if the request is synchronous
(NETFS_RREQ_OFFLOAD_COLLECTION is unset) as the result collection will be
done in-thread and we're guaranteed an opportunity to run the collector.

However, if the request is asynchronous, collection is primarily triggered
by the termination of subrequests queuing it on a workqueue.  Now, a race
can occur here if the app thread sets ALL_QUEUED after the last subrequest
terminates.

This can happen most easily with the copy2cache code (as used by Ceph)
where, in the collection routine of a read request, an asynchronous write
request is spawned to copy data to the cache.  Folios are added to the
write request as they're unlocked, but there may be a delay before
ALL_QUEUED is set as the write subrequests may complete before we get
there.

If all the write subreqs have finished by the ALL_QUEUED point, no further
events happen and the collection never happens, leaving the request
hanging.

Fix this by queuing the collector after setting ALL_QUEUED.  This is a bit
heavy-handed and it may be sufficient to do it only if there are no extant
subreqs.

Also add a tracepoint to cross-reference both requests in a copy-to-request
operation and add a trace to the netfs_rreq tracepoint to indicate the
setting of ALL_QUEUED.

Fixes: e2d46f2ec3 ("netfs: Change the read result collector to only use one work item")
Reported-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/r/CAKPOu+8z_ijTLHdiCYGU_Uk7yYD=shxyGLwfe-L7AV3DhebS3w@mail.gmail.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250711151005.2956810-3-dhowells@redhat.com
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: netfs@lists.linux.dev
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: stable@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14 11:05:02 +02:00
David Howells
4c238e3077 netfs: Fix copy-to-cache so that it performs collection with ceph+fscache
The netfs copy-to-cache that is used by Ceph with local caching sets up a
new request to write data just read to the cache.  The request is started
and then left to look after itself whilst the app continues.  The request
gets notified by the backing fs upon completion of the async DIO write, but
then tries to wake up the app because NETFS_RREQ_OFFLOAD_COLLECTION isn't
set - but the app isn't waiting there, and so the request just hangs.

Fix this by setting NETFS_RREQ_OFFLOAD_COLLECTION which causes the
notification from the backing filesystem to put the collection onto a work
queue instead.

Fixes: e2d46f2ec3 ("netfs: Change the read result collector to only use one work item")
Reported-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/r/CAKPOu+8z_ijTLHdiCYGU_Uk7yYD=shxyGLwfe-L7AV3DhebS3w@mail.gmail.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250711151005.2956810-2-dhowells@redhat.com
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: netfs@lists.linux.dev
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: stable@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14 11:05:02 +02:00
Al Viro
fdfe013347 fix a leak in fcntl_dirnotify()
[into #fixes, unless somebody objects]

Lifetime of new_dn_mark is controlled by that of its ->fsn_mark,
pointed to by new_fsn_mark.  Unfortunately, a failure exit had
been inserted between the allocation of new_dn_mark and the
call of fsnotify_init_mark(), ending up with a leak.

Fixes: 1934b21261 "file: reclaim 24 bytes from f_owner"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/20250712171843.GB1880847@ZenIV
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14 10:13:31 +02:00
Bharath SM
83898e4a85 smb: invalidate and close cached directory when creating child entries
When a parent lease key is passed to the server during a create operation
while holding a directory lease, the server may not send a lease break to
the client. In such cases, it becomes the client’s responsibility to
ensure cache consistency.

This led to a problem where directory listings (e.g., `ls` or `readdir`)
could return stale results after a new file is created.
eg:
ls /mnt/share/
touch /mnt/share/file1
ls /mnt/share/

In this scenario, the final `ls` may not show  `file1` due to the stale
directory cache.

For now, fix this by marking the cached directory as invalid if using
the parent lease key during create, and explicitly closing the cached
directory after successful file creation.

Fixes: 037e1bae58 ("smb: client: use ParentLeaseKey in cifs_do_create")

Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-07-13 17:16:29 -05:00
Wang Zhaolong
b220bed633 smb: client: fix use-after-free in crypt_message when using async crypto
The CVE-2024-50047 fix removed asynchronous crypto handling from
crypt_message(), assuming all crypto operations are synchronous.
However, when hardware crypto accelerators are used, this can cause
use-after-free crashes:

  crypt_message()
    // Allocate the creq buffer containing the req
    creq = smb2_get_aead_req(..., &req);

    // Async encryption returns -EINPROGRESS immediately
    rc = enc ? crypto_aead_encrypt(req) : crypto_aead_decrypt(req);

    // Free creq while async operation is still in progress
    kvfree_sensitive(creq, ...);

Hardware crypto modules often implement async AEAD operations for
performance. When crypto_aead_encrypt/decrypt() returns -EINPROGRESS,
the operation completes asynchronously. Without crypto_wait_req(),
the function immediately frees the request buffer, leading to crashes
when the driver later accesses the freed memory.

This results in a use-after-free condition when the hardware crypto
driver later accesses the freed request structure, leading to kernel
crashes with NULL pointer dereferences.

The issue occurs because crypto_alloc_aead() with mask=0 doesn't
guarantee synchronous operation. Even without CRYPTO_ALG_ASYNC in
the mask, async implementations can be selected.

Fix by restoring the async crypto handling:
- DECLARE_CRYPTO_WAIT(wait) for completion tracking
- aead_request_set_callback() for async completion notification
- crypto_wait_req() to wait for operation completion

This ensures the request buffer isn't freed until the crypto operation
completes, whether synchronous or asynchronous, while preserving the
CVE-2024-50047 fix.

Fixes: b0abcd65ec ("smb: client: fix UAF in async decryption")
Link: https://lore.kernel.org/all/8b784a13-87b0-4131-9ff9-7a8993538749@huaweicloud.com/
Cc: stable@vger.kernel.org
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Wang Zhaolong <wangzhaolong@huaweicloud.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-07-13 17:16:29 -05:00
Wang Zhaolong
705c79101c smb: client: fix use-after-free in cifs_oplock_break
A race condition can occur in cifs_oplock_break() leading to a
use-after-free of the cinode structure when unmounting:

  cifs_oplock_break()
    _cifsFileInfo_put(cfile)
      cifsFileInfo_put_final()
        cifs_sb_deactive()
          [last ref, start releasing sb]
            kill_sb()
              kill_anon_super()
                generic_shutdown_super()
                  evict_inodes()
                    dispose_list()
                      evict()
                        destroy_inode()
                          call_rcu(&inode->i_rcu, i_callback)
    spin_lock(&cinode->open_file_lock)  <- OK
                            [later] i_callback()
                              cifs_free_inode()
                                kmem_cache_free(cinode)
    spin_unlock(&cinode->open_file_lock)  <- UAF
    cifs_done_oplock_break(cinode)       <- UAF

The issue occurs when umount has already released its reference to the
superblock. When _cifsFileInfo_put() calls cifs_sb_deactive(), this
releases the last reference, triggering the immediate cleanup of all
inodes under RCU. However, cifs_oplock_break() continues to access the
cinode after this point, resulting in use-after-free.

Fix this by holding an extra reference to the superblock during the
entire oplock break operation. This ensures that the superblock and
its inodes remain valid until the oplock break completes.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=220309
Fixes: b98749cac4 ("CIFS: keep FileInfo handle live during oplock break")
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Wang Zhaolong <wangzhaolong@huaweicloud.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-07-13 17:16:29 -05:00
Kent Overstreet
30792947c6 bcachefs: io_read: remove from async obj list in rbio_done()
Previously, only split rbios allocated in io_read.c would be removed
from the async obj list.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-13 17:45:13 -04:00
Linus Torvalds
3f31a806a6 Merge tag 'mm-hotfixes-stable-2025-07-11-16-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
 "19 hotfixes. A whopping 16 are cc:stable and the remainder address
  post-6.15 issues or aren't considered necessary for -stable kernels.

  14 are for MM.  Three gdb-script fixes and a kallsyms build fix"

* tag 'mm-hotfixes-stable-2025-07-11-16-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  Revert "sched/numa: add statistics of numa balance task"
  mm: fix the inaccurate memory statistics issue for users
  mm/damon: fix divide by zero in damon_get_intervals_score()
  samples/damon: fix damon sample mtier for start failure
  samples/damon: fix damon sample wsse for start failure
  samples/damon: fix damon sample prcl for start failure
  kasan: remove kasan_find_vm_area() to prevent possible deadlock
  scripts: gdb: vfs: support external dentry names
  mm/migrate: fix do_pages_stat in compat mode
  mm/damon/core: handle damon_call_control as normal under kdmond deactivation
  mm/rmap: fix potential out-of-bounds page table access during batched unmap
  mm/hugetlb: don't crash when allocating a folio if there are no resv
  scripts/gdb: de-reference per-CPU MCE interrupts
  scripts/gdb: fix interrupts.py after maple tree conversion
  maple_tree: fix mt_destroy_walk() on root leaf node
  mm/vmalloc: leave lazy MMU mode on PTE mapping error
  scripts/gdb: fix interrupts display after MCP on x86
  lib/alloc_tag: do not acquire non-existent lock in alloc_tag_top_users()
  kallsyms: fix build without execinfo
2025-07-12 10:30:47 -07:00
Linus Torvalds
3b428e1cfc Merge tag 'erofs-for-6.16-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
 "Fix for a cache aliasing issue by adding missing flush_dcache_folio(),
  which causes execution failures on some arm32 setups.

  Fix for large compressed fragments, which could be generated by
  -Eall-fragments option (but should be rare) and was rejected by
  mistake due to an on-disk hardening commit.

  The remaining ones are small fixes. Summary:

   - Address cache aliasing for mappable page cache folios

   - Allow readdir() to be interrupted

   - Fix large fragment handling which was errored out by mistake

   - Add missing tracepoints

   - Use memcpy_to_folio() to replace copy_to_iter() for inline data"

* tag 'erofs-for-6.16-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: fix large fragment handling
  erofs: allow readdir() to be interrupted
  erofs: address D-cache aliasing
  erofs: use memcpy_to_folio() to replace copy_to_iter()
  erofs: fix to add missing tracepoint in erofs_read_folio()
  erofs: fix to add missing tracepoint in erofs_readahead()
2025-07-12 10:20:03 -07:00
Linus Torvalds
4412b8b23d Merge tag 'bcachefs-2025-07-11' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet.

* tag 'bcachefs-2025-07-11' of git://evilpiepirate.org/bcachefs:
  bcachefs: Don't set BCH_FS_error on transaction restart
  bcachefs: Fix additional misalignment in journal space calculations
  bcachefs: Don't schedule non persistent passes persistently
  bcachefs: Fix bch2_btree_transactions_read() synchronization
  bcachefs: btree read retry fixes
  bcachefs: btree node scan no longer uses btree cache
  bcachefs: Tweak btree cache helpers for use by btree node scan
  bcachefs: Fix btree for nonexistent tree depth
  bcachefs: Fix bch2_io_failures_to_text()
  bcachefs: bch2_fpunch_snapshot()
2025-07-12 10:13:27 -07:00
Linus Torvalds
2632d81f5a Merge tag 'v6.16-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:

 - fix use after free in lease break

 - small fix for freeing rdma transport (fixes missing logging of
   cm_qp_destroy)

 - fix write count leak

* tag 'v6.16-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
  ksmbd: fix potential use-after-free in oplock/lease break ack
  ksmbd: fix a mount write count leak in ksmbd_vfs_kern_path_locked()
  smb: server: make use of rdma_destroy_qp()
2025-07-12 10:06:06 -07:00
Linus Torvalds
5f02b80c21 Revert "eventpoll: Fix priority inversion problem"
This reverts commit 8c44dac8ad.

I haven't figured out what the actual bug in this commit is, but I did
spend a lot of time chasing it down and eventually succeeded in
bisecting it down to this.

For some reason, this eventpoll commit ends up causing delays and stuck
user space processes, but it only happens on one of my machines, and
only during early boot or during the flurry of initial activity when
logging in.

I must be triggering some very subtle timing issue, but once I figured
out the behavior pattern that made it reasonably reliable to trigger, it
did bisect right to this, and reverting the commit fixes the problem.

Of course, that was only after I had failed at bisecting it several
times, and had flailed around blaming both the drm people and the
netlink people for the odd problems.  The most obvious of which happened
at the time of the first graphical login (the most common symptom being
that some gnome app aborted due to a 30s timeout, often leading to the
whole session then failing if it was some critical component like
gnome-shell or similar).

Acked-by: Nam Cao <namcao@linutronix.de>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-07-11 17:10:32 -07:00
Gao Xiang
b44686c839 erofs: fix large fragment handling
Fragments aren't limited by Z_EROFS_PCLUSTER_MAX_DSIZE. However, if
a fragment's logical length is larger than Z_EROFS_PCLUSTER_MAX_DSIZE
but the fragment is not the whole inode, it currently returns
-EOPNOTSUPP because m_flags has the wrong EROFS_MAP_ENCODED flag set.
It is not intended by design but should be rare, as it can only be
reproduced by mkfs with `-Eall-fragments` in a specific case.

Let's normalize fragment m_flags using the new EROFS_MAP_FRAGMENT.

Reported-by: Axel Fontaine <axel@axelfontaine.com>
Closes: https://github.com/erofs/erofs-utils/issues/23
Fixes: 7c3ca1838a ("erofs: restrict pcluster size limitations")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250711195826.3601157-1-hsiangkao@linux.alibaba.com
2025-07-12 04:02:44 +08:00
Jinliang Zheng
177bb4cba9 iomap: avoid unnecessary ifs_set_range_uptodate() with locks
In the buffer write path, iomap_set_range_uptodate() is called every
time iomap_end_write() is called. But if folio_test_uptodate() holds, we
know that all blocks in this folio are already in the uptodate state, so
there is no need to go deep into the critical section of state_lock to
execute bitmap_set().

This is because the folios always creep towards ifs_is_fully_uptodate()
state and once they've gotten there folio_mark_uptodate() is called, which
means the folio is uptodate.

Then once a folio is uptodate, there is no route back to !uptodate without
going through the removal of the folio from the page cache. Therefore, it's
fine to use folio_test_uptodate() to short-circuit unnecessary code paths.

Although state_lock may not have significant lock contention due to
folio lock, this patch at least reduces the number of instructions,
especially the expensive lock-prefixed instructions.

Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Link: https://lore.kernel.org/20250711081207.1782667-1-alexjlzheng@tencent.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-11 11:42:44 +02:00
Jan Kara
0a9e740513 isofs: Verify inode mode when loading from disk
Verify that the inode mode is sane when loading it from the disk to
avoid complaints from VFS about setting up invalid inodes.

Reported-by: syzbot+895c23f6917da440ed0d@syzkaller.appspotmail.com
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/20250709095545.31062-2-jack@suse.cz
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-11 11:39:31 +02:00
Chao Yu
d31fbdc4c7 erofs: allow readdir() to be interrupted
In a quick slow device, readdir() may loop for long time in large
directory, let's give a chance to allow it to be interrupted by
userspace.

Signed-off-by: Chao Yu <chao@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250710073619.4083422-1-chao@kernel.org
[ Gao Xiang: move cond_resched() to the end of the while loop. ]
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2025-07-10 17:08:27 +08:00
Gao Xiang
27917e8194 erofs: address D-cache aliasing
Flush the D-cache before unlocking folios for compressed inodes, as
they are dirtied during decompression.

Avoid calling flush_dcache_folio() on every CPU write, since it's more
like playing whack-a-mole without real benefit.

It has no impact on x86 and arm64/risc-v: on x86, flush_dcache_folio()
is a no-op, and on arm64/risc-v, PG_dcache_clean (PG_arch_1) is clear
for new page cache folios.  However, certain ARM boards are affected,
as reported.

Fixes: 3883a79abd ("staging: erofs: introduce VLE decompression support")
Closes: https://lore.kernel.org/r/c1e51e16-6cc6-49d0-a63e-4e9ff6c4dd53@pengutronix.de
Closes: https://lore.kernel.org/r/38d43fae-1182-4155-9c5b-ffc7382d9917@siemens.com
Tested-by: Jan Kiszka <jan.kiszka@siemens.com>
Tested-by: Stefan Kerkmann <s.kerkmann@pengutronix.de>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250709034614.2780117-2-hsiangkao@linux.alibaba.com
2025-07-10 17:08:27 +08:00
Gao Xiang
f5443d0d1a erofs: use memcpy_to_folio() to replace copy_to_iter()
Using copy_to_iter() here is overkill and even messy.

Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250709034614.2780117-1-hsiangkao@linux.alibaba.com
2025-07-10 17:08:26 +08:00
Chao Yu
99f7619a77 erofs: fix to add missing tracepoint in erofs_read_folio()
Commit 771c994ea5 ("erofs: convert all uncompressed cases to iomap")
converts to use iomap interface, it removed trace_erofs_readpage()
tracepoint in the meantime, let's add it back.

Fixes: 771c994ea5 ("erofs: convert all uncompressed cases to iomap")
Signed-off-by: Chao Yu <chao@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250708111942.3120926-1-chao@kernel.org
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2025-07-10 17:08:26 +08:00
Chao Yu
d53238b614 erofs: fix to add missing tracepoint in erofs_readahead()
Commit 771c994ea5 ("erofs: convert all uncompressed cases to iomap")
converts to use iomap interface, it removed trace_erofs_readahead()
tracepoint in the meantime, let's add it back.

Fixes: 771c994ea5 ("erofs: convert all uncompressed cases to iomap")
Signed-off-by: Chao Yu <chao@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250707084832.2725677-1-chao@kernel.org
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2025-07-10 17:08:25 +08:00
Zizhi Wo
6b89819b06 cachefiles: Fix the incorrect return value in __cachefiles_write()
In __cachefiles_write(), if the return value of the write operation > 0, it
is set to 0. This makes it impossible to distinguish scenarios where a
partial write has occurred, and will affect the outer calling functions:

 1) cachefiles_write_complete() will call "term_func" such as
netfs_write_subrequest_terminated(). When "ret" in __cachefiles_write()
is used as the "transferred_or_error" of this function, it can not
distinguish the amount of data written, makes the WARN meaningless.

 2) cachefiles_ondemand_fd_write_iter() can only assume all writes were
successful by default when "ret" is 0, and unconditionally return the full
length specified by user space.

Fix it by modifying "ret" to reflect the actual number of bytes written.
Furthermore, returning a value greater than 0 from __cachefiles_write()
does not affect other call paths, such as cachefiles_issue_write() and
fscache_write().

Fixes: 047487c947 ("cachefiles: Implement the I/O routines")
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Link: https://lore.kernel.org/20250703024418.2809353-1-wozizhi@huaweicloud.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-10 09:40:17 +02:00
Baolin Wang
82241a83cd mm: fix the inaccurate memory statistics issue for users
On some large machines with a high number of CPUs running a 64K pagesize
kernel, we found that the 'RES' field is always 0 displayed by the top
command for some processes, which will cause a lot of confusion for users.

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 875525 root      20   0   12480      0      0 R   0.3   0.0   0:00.08 top
      1 root      20   0  172800      0      0 S   0.0   0.0   0:04.52 systemd

The main reason is that the batch size of the percpu counter is quite
large on these machines, caching a significant percpu value, since
converting mm's rss stats into percpu_counter by commit f1a7941243 ("mm:
convert mm's rss stats into percpu_counter").  Intuitively, the batch
number should be optimized, but on some paths, performance may take
precedence over statistical accuracy.  Therefore, introducing a new
interface to add the percpu statistical count and display it to users,
which can remove the confusion.  In addition, this change is not expected
to be on a performance-critical path, so the modification should be
acceptable.

In addition, the 'mm->rss_stat' is updated by using add_mm_counter() and
dec/inc_mm_counter(), which are all wrappers around
percpu_counter_add_batch().  In percpu_counter_add_batch(), there is
percpu batch caching to avoid 'fbc->lock' contention.  This patch changes
task_mem() and task_statm() to get the accurate mm counters under the
'fbc->lock', but this should not exacerbate kernel 'mm->rss_stat' lock
contention due to the percpu batch caching of the mm counters.  The
following test also confirm the theoretical analysis.

I run the stress-ng that stresses anon page faults in 32 threads on my 32
cores machine, while simultaneously running a script that starts 32
threads to busy-loop pread each stress-ng thread's /proc/pid/status
interface.  From the following data, I did not observe any obvious impact
of this patch on the stress-ng tests.

w/o patch:
stress-ng: info:  [6848]          4,399,219,085,152 CPU Cycles          67.327 B/sec
stress-ng: info:  [6848]          1,616,524,844,832 Instructions          24.740 B/sec (0.367 instr. per cycle)
stress-ng: info:  [6848]          39,529,792 Page Faults Total           0.605 M/sec
stress-ng: info:  [6848]          39,529,792 Page Faults Minor           0.605 M/sec

w/patch:
stress-ng: info:  [2485]          4,462,440,381,856 CPU Cycles          68.382 B/sec
stress-ng: info:  [2485]          1,615,101,503,296 Instructions          24.750 B/sec (0.362 instr. per cycle)
stress-ng: info:  [2485]          39,439,232 Page Faults Total           0.604 M/sec
stress-ng: info:  [2485]          39,439,232 Page Faults Minor           0.604 M/sec

On comparing a very simple app which just allocates & touches some
memory against v6.1 (which doesn't have f1a7941243) and latest Linus
tree (4c06e63b92) I can see that on latest Linus tree the values for
VmRSS, RssAnon and RssFile from /proc/self/status are all zeroes while
they do report values on v6.1 and a Linus tree with this patch.

Link: https://lkml.kernel.org/r/f4586b17f66f97c174f7fd1f8647374fdb53de1c.1749119050.git.baolin.wang@linux.alibaba.com
Fixes: f1a7941243 ("mm: convert mm's rss stats into percpu_counter")
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by Donet Tom <donettom@linux.ibm.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: SeongJae Park <sj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09 21:07:55 -07:00
Linus Torvalds
8c2e52ebbe eventpoll: don't decrement ep refcount while still holding the ep mutex
Jann Horn points out that epoll is decrementing the ep refcount and then
doing a

    mutex_unlock(&ep->mtx);

afterwards. That's very wrong, because it can lead to a use-after-free.

That pattern is actually fine for the very last reference, because the
code in question will delay the actual call to "ep_free(ep)" until after
it has unlocked the mutex.

But it's wrong for the much subtler "next to last" case when somebody
*else* may also be dropping their reference and free the ep while we're
still using the mutex.

Note that this is true even if that other user is also using the same ep
mutex: mutexes, unlike spinlocks, can not be used for object ownership,
even if they guarantee mutual exclusion.

A mutex "unlock" operation is not atomic, and as one user is still
accessing the mutex as part of unlocking it, another user can come in
and get the now released mutex and free the data structure while the
first user is still cleaning up.

See our mutex documentation in Documentation/locking/mutex-design.rst,
in particular the section [1] about semantics:

	"mutex_unlock() may access the mutex structure even after it has
	 internally released the lock already - so it's not safe for
	 another context to acquire the mutex and assume that the
	 mutex_unlock() context is not using the structure anymore"

So if we drop our ep ref before the mutex unlock, but we weren't the
last one, we may then unlock the mutex, another user comes in, drops
_their_ reference and releases the 'ep' as it now has no users - all
while the mutex_unlock() is still accessing it.

Fix this by simply moving the ep refcount dropping to outside the mutex:
the refcount itself is atomic, and doesn't need mutex protection (that's
the whole _point_ of refcounts: unlike mutexes, they are inherently
about object lifetimes).

Reported-by: Jann Horn <jannh@google.com>
Link: https://docs.kernel.org/locking/mutex-design.html#semantics [1]
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-07-09 10:38:29 -07:00
Kent Overstreet
fec5e6f97d bcachefs: Don't set BCH_FS_error on transaction restart
This started showing up more when we started logging the error being
corrected in the journal - but __bch2_fsck_err() could return
transaction restarts before that.

Setting BCH_FS_error incorrectly causes recovery passes to not be
cleared, among other issues.

Fixes: b43f724927 ("bcachefs: Log fsck errors in the journal")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-08 15:24:15 -04:00
Namjae Jeon
50f930db22 ksmbd: fix potential use-after-free in oplock/lease break ack
If ksmbd_iov_pin_rsp return error, use-after-free can happen by
accessing opinfo->state and opinfo_put and ksmbd_fd_put could
called twice.

Reported-by: Ziyan Xu <research@securitygossip.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-07-08 11:25:44 -05:00
Al Viro
277627b431 ksmbd: fix a mount write count leak in ksmbd_vfs_kern_path_locked()
If the call of ksmbd_vfs_lock_parent() fails, we drop the parent_path
references and return an error.  We need to drop the write access we
just got on parent_path->mnt before we drop the mount reference - callers
assume that ksmbd_vfs_kern_path_locked() returns with mount write
access grabbed if and only if it has returned 0.

Fixes: 864fb5d371 ("ksmbd: fix possible deadlock in smb2_open")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-07-08 11:25:44 -05:00
Stefan Metzmacher
0c2b53997e smb: server: make use of rdma_destroy_qp()
The qp is created by rdma_create_qp() as t->cm_id->qp
and t->qp is just a shortcut.

rdma_destroy_qp() also calls ib_destroy_qp(cm_id->qp) internally,
but it is protected by a mutex, clears the cm_id and also calls
trace_cm_qp_destroy().

This should make the tracing more useful as both
rdma_create_qp() and rdma_destroy_qp() are traces and it makes
the code look more sane as functions from the same layer are used
for the specific qp object.

trace-cmd stream -e rdma_cma:cm_qp_create -e rdma_cma:cm_qp_destroy
shows this now while doing a mount and unmount from a client:

  <...>-80   [002] 378.514182: cm_qp_create:  cm.id=1 src=172.31.9.167:5445 dst=172.31.9.166:37113 tos=0 pd.id=0 qp_type=RC send_wr=867 recv_wr=255 qp_num=1 rc=0
  <...>-6283 [001] 381.686172: cm_qp_destroy: cm.id=1 src=172.31.9.167:5445 dst=172.31.9.166:37113 tos=0 qp_num=1

Before we only saw the first line.

Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <stfrench@microsoft.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Fixes: 0626e6641f ("cifsd: add server handler for central processing and tranport layers")
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Reviewed-by: Tom Talpey <tom@talpey.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-07-08 11:25:43 -05:00
Christoph Hellwig
9b027aa3e8 xfs: remove the bt_bdev_file buftarg field
And use bt_file for both bdev and shmem backed buftargs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-08 13:30:26 +02:00
Christoph Hellwig
988a168275 xfs: rename the bt_bdev_* buftarg fields
The extra bdev_ is weird, so drop it.  Also improve the comment to make
it clear these are the hardware limits.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-08 13:30:26 +02:00
Christoph Hellwig
e4a7a3f9b2 xfs: refactor xfs_calc_atomic_write_unit_max
This function and the helpers used by it duplicate the same logic for AGs
and RTGs.  Use the xfs_group_type enum to unify both variants.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-08 13:30:26 +02:00
Christoph Hellwig
e74d1fa6a7 xfs: add a xfs_group_type_buftarg helper
Generalize the xfs_group_type helper in the discard code to return a buftarg
and move it to xfs_mount.h, and use the result in xfs_dax_notify_dev_failure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-08 13:30:26 +02:00
Christoph Hellwig
d9b1e348cf xfs: remove the call to sync_blockdev in xfs_configure_buftarg
This extra call is not needed as xfs_alloc_buftarg already calls
sync_blockdev.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-08 13:30:26 +02:00
Christoph Hellwig
a578a8efa7 xfs: clean up the initial read logic in xfs_readsb
The initial sb read is always for a device logical block size buffer.
The device logical block size is provided in the bt_logical_sectorsize in
struct buftarg, so use that instead of the confusingly named
xfs_getsize_buftarg buffer that reads it from the bdev.

Update the comments surrounding the code to better describe what is going
on.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-08 13:30:26 +02:00
Pranav Tyagi
f2eb2796b9 xfs: replace strncpy with memcpy in xattr listing
Use memcpy() in place of strncpy() in __xfs_xattr_put_listent().
The length is known and a null byte is added manually.

No functional change intended.

Signed-off-by: Pranav Tyagi <pranav.tyagi03@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-08 11:50:09 +02:00
Kent Overstreet
74f3931a1b bcachefs: Fix additional misalignment in journal space calculations
Additional fix on top of

f54b2a80d0 bcachefs: Fix misaligned bucket check in journal space calculations

Make sure that when we calculate space for the next entry it's not
misaligned: we need to round_down() to filesystem block size in multiple
places (next entry size calculation as well as total space available).

Reported-by: Ondřej Kraus <neverberlerfellerer@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-07 18:19:30 -04:00
Kent Overstreet
7de3c8b407 bcachefs: Don't schedule non persistent passes persistently
if (!(in_recovery && (flags & RUN_RECOVERY_PASS_nopersistent)))

should have been

  if (!in_recovery && !(flags & RUN_RECOVERY_PASS_nopersistent)))

But the !in_recovery part was also wrong: the assumption is that if
we're in recovery we'll just rewind and run the recovery pass
immediately, but we're not able to do so if we've already gone RW and
the pass must be run before we go RW. In that case, we need to schedule
it in the superblock so it can be run on the next mount attempt.

Scheduling it persistently is fine, because it'll be cleared in the
superblock immediately when the pass completes successfully.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-07 14:10:47 -04:00
Laura Brehm
830a9e37cf coredump: fix PIDFD_INFO_COREDUMP ioctl check
In Commit 1d8db6fd69 ("pidfs,
coredump: add PIDFD_INFO_COREDUMP"), the following code was added:

    if (mask & PIDFD_INFO_COREDUMP) {
        kinfo.mask |= PIDFD_INFO_COREDUMP;
        kinfo.coredump_mask = READ_ONCE(pidfs_i(inode)->__pei.coredump_mask);
    }
    [...]
    if (!(kinfo.mask & PIDFD_INFO_COREDUMP)) {
        task_lock(task);
        if (task->mm)
            kinfo.coredump_mask = pidfs_coredump_mask(task->mm->flags);
        task_unlock(task);
    }

The second bit in particular looks off to me - the condition in essence
checks whether PIDFD_INFO_COREDUMP was **not** requested, and if so
fetches the coredump_mask in kinfo, since it's checking !(kinfo.mask &
PIDFD_INFO_COREDUMP), which is unconditionally set in the earlier hunk.

I'm tempted to assume the idea in the second hunk was to calculate the
coredump mask if one was requested but fetched in the first hunk, in
which case the check should be
    if ((kinfo.mask & PIDFD_INFO_COREDUMP) && !(kinfo.coredump_mask))
which might be more legibly written as
    if ((mask & PIDFD_INFO_COREDUMP) && !(kinfo.coredump_mask))

This could also instead be achieved by changing the first hunk to be:

    if (mask & PIDFD_INFO_COREDUMP) {
	kinfo.coredump_mask = READ_ONCE(pidfs_i(inode)->__pei.coredump_mask);
	if (kinfo.coredump_mask)
	    kinfo.mask |= PIDFD_INFO_COREDUMP;
    }

and the second hunk to:

    if ((mask & PIDFD_INFO_COREDUMP) && !(kinfo.mask & PIDFD_INFO_COREDUMP)) {
	task_lock(task);
        if (task->mm) {
	    kinfo.coredump_mask = pidfs_coredump_mask(task->mm->flags);
            kinfo.mask |= PIDFD_INFO_COREDUMP;
        }
        task_unlock(task);
    }

However, when looking at this, the supposition that the second hunk
means to cover cases where the coredump info was requested but the first
hunk failed to get it starts getting doubtful, so apologies if I'm
completely off-base.

This patch addresses the issue by fixing the check in the second hunk.

Signed-off-by: Laura Brehm <laurabrehm@hey.com>
Link: https://lore.kernel.org/20250703120244.96908-3-laurabrehm@hey.com
Cc: brauner@kernel.org
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-07 13:26:03 +02:00