Commit Graph

102029 Commits

Author SHA1 Message Date
Jaegeuk Kim
4e715744bf f2fs: add missing dput() when printing the donation list
We missed to call dput() on the grabbed dentry.

Fixes: f1a49c1b11 ("f2fs: show the list of donation files")
Reviewed-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-10-03 03:16:10 +00:00
Linus Torvalds
e406d57be7 Merge tag 'mm-nonmm-stable-2025-10-02-15-29' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:

 - "ida: Remove the ida_simple_xxx() API" from Christophe Jaillet
   completes the removal of this legacy IDR API

 - "panic: introduce panic status function family" from Jinchao Wang
   provides a number of cleanups to the panic code and its various
   helpers, which were rather ad-hoc and scattered all over the place

 - "tools/delaytop: implement real-time keyboard interaction support"
   from Fan Yu adds a few nice user-facing usability changes to the
   delaytop monitoring tool

 - "efi: Fix EFI boot with kexec handover (KHO)" from Evangelos
   Petrongonas fixes a panic which was happening with the combination of
   EFI and KHO

 - "Squashfs: performance improvement and a sanity check" from Phillip
   Lougher teaches squashfs's lseek() about SEEK_DATA/SEEK_HOLE. A mere
   150x speedup was measured for a well-chosen microbenchmark

 - plus another 50-odd singleton patches all over the place

* tag 'mm-nonmm-stable-2025-10-02-15-29' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (75 commits)
  Squashfs: reject negative file sizes in squashfs_read_inode()
  kallsyms: use kmalloc_array() instead of kmalloc()
  MAINTAINERS: update Sibi Sankar's email address
  Squashfs: add SEEK_DATA/SEEK_HOLE support
  Squashfs: add additional inode sanity checking
  lib/genalloc: fix device leak in of_gen_pool_get()
  panic: remove CONFIG_PANIC_ON_OOPS_VALUE
  ocfs2: fix double free in user_cluster_connect()
  checkpatch: suppress strscpy warnings for userspace tools
  cramfs: fix incorrect physical page address calculation
  kernel: prevent prctl(PR_SET_PDEATHSIG) from racing with parent process exit
  Squashfs: fix uninit-value in squashfs_get_parent
  kho: only fill kimage if KHO is finalized
  ocfs2: avoid extra calls to strlen() after ocfs2_sprintf_system_inode_name()
  kernel/sys.c: fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths
  sched/task.h: fix the wrong comment on task_lock() nesting with tasklist_lock
  coccinelle: platform_no_drv_owner: handle also built-in drivers
  coccinelle: of_table: handle SPI device ID tables
  lib/decompress: use designated initializers for struct compress_format
  efi: support booting with kexec handover (KHO)
  ...
2025-10-02 18:44:54 -07:00
Linus Torvalds
8804d970fa Merge tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:

 - "mm, swap: improve cluster scan strategy" from Kairui Song improves
   performance and reduces the failure rate of swap cluster allocation

 - "support large align and nid in Rust allocators" from Vitaly Wool
   permits Rust allocators to set NUMA node and large alignment when
   perforning slub and vmalloc reallocs

 - "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend
   DAMOS_STAT's handling of the DAMON operations sets for virtual
   address spaces for ops-level DAMOS filters

 - "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren
   Baghdasaryan reduces mmap_lock contention during reads of
   /proc/pid/maps

 - "mm/mincore: minor clean up for swap cache checking" from Kairui Song
   performs some cleanup in the swap code

 - "mm: vm_normal_page*() improvements" from David Hildenbrand provides
   code cleanup in the pagemap code

 - "add persistent huge zero folio support" from Pankaj Raghav provides
   a block layer speedup by optionalls making the
   huge_zero_pagepersistent, instead of releasing it when its refcount
   falls to zero

 - "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to
   the recently added Kexec Handover feature

 - "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo
   Stoakes turns mm_struct.flags into a bitmap. To end the constant
   struggle with space shortage on 32-bit conflicting with 64-bit's
   needs

 - "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap
   code

 - "selftests/mm: Fix false positives and skip unsupported tests" from
   Donet Tom fixes a few things in our selftests code

 - "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised"
   from David Hildenbrand "allows individual processes to opt-out of
   THP=always into THP=madvise, without affecting other workloads on the
   system".

   It's a long story - the [1/N] changelog spells out the considerations

 - "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on
   the memdesc project. Please see

      https://kernelnewbies.org/MatthewWilcox/Memdescs and
      https://blogs.oracle.com/linux/post/introducing-memdesc

 - "Tiny optimization for large read operations" from Chi Zhiling
   improves the efficiency of the pagecache read path

 - "Better split_huge_page_test result check" from Zi Yan improves our
   folio splitting selftest code

 - "test that rmap behaves as expected" from Wei Yang adds some rmap
   selftests

 - "remove write_cache_pages()" from Christoph Hellwig removes that
   function and converts its two remaining callers

 - "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD
   selftests issues

 - "introduce kernel file mapped folios" from Boris Burkov introduces
   the concept of "kernel file pages". Using these permits btrfs to
   account its metadata pages to the root cgroup, rather than to the
   cgroups of random inappropriate tasks

 - "mm/pageblock: improve readability of some pageblock handling" from
   Wei Yang provides some readability improvements to the page allocator
   code

 - "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON
   to understand arm32 highmem

 - "tools: testing: Use existing atomic.h for vma/maple tests" from
   Brendan Jackman performs some code cleanups and deduplication under
   tools/testing/

 - "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes
   a couple of 32-bit issues in tools/testing/radix-tree.c

 - "kasan: unify kasan_enabled() and remove arch-specific
   implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific
   initialization code into a common arch-neutral implementation

 - "mm: remove zpool" from Johannes Weiner removes zspool - an
   indirection layer which now only redirects to a single thing
   (zsmalloc)

 - "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a
   couple of cleanups in the fork code

 - "mm: remove nth_page()" from David Hildenbrand makes rather a lot of
   adjustments at various nth_page() callsites, eventually permitting
   the removal of that undesirable helper function

 - "introduce kasan.write_only option in hw-tags" from Yeoreum Yun
   creates a KASAN read-only mode for ARM, using that architecture's
   memory tagging feature. It is felt that a read-only mode KASAN is
   suitable for use in production systems rather than debug-only

 - "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does
   some tidying in the hugetlb folio allocation code

 - "mm: establish const-correctness for pointer parameters" from Max
   Kellermann makes quite a number of the MM API functions more accurate
   about the constness of their arguments. This was getting in the way
   of subsystems (in this case CEPH) when they attempt to improving
   their own const/non-const accuracy

 - "Cleanup free_pages() misuse" from Vishal Moola fixes a number of
   code sites which were confused over when to use free_pages() vs
   __free_pages()

 - "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the
   mapletree code accessible to Rust. Required by nouveau and by its
   forthcoming successor: the new Rust Nova driver

 - "selftests/mm: split_huge_page_test: split_pte_mapped_thp
   improvements" from David Hildenbrand adds a fix and some cleanups to
   the thp selftesting code

 - "mm, swap: introduce swap table as swap cache (phase I)" from Chris
   Li and Kairui Song is the first step along the path to implementing
   "swap tables" - a new approach to swap allocation and state tracking
   which is expected to yield speed and space improvements. This
   patchset itself yields a 5-20% performance benefit in some situations

 - "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc
   layer to clean up the ptdesc code a little

 - "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some
   issues in our 5-level pagetable selftesting code

 - "Minor fixes for memory allocation profiling" from Suren Baghdasaryan
   addresses a couple of minor issues in relatively new memory
   allocation profiling feature

 - "Small cleanups" from Matthew Wilcox has a few cleanups in
   preparation for more memdesc work

 - "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from
   Quanmin Yan makes some changes to DAMON in furtherance of supporting
   arm highmem

 - "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad
   Anjum adds that compiler check to selftests code and fixes the
   fallout, by removing dead code

 - "Improvements to Victim Process Thawing and OOM Reaper Traversal
   Order" from zhongjinji makes a number of improvements in the OOM
   killer: mainly thawing a more appropriate group of victim threads so
   they can release resources

 - "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park
   is a bunch of small and unrelated fixups for DAMON

 - "mm/damon: define and use DAMON initialization check function" from
   SeongJae Park implement reliability and maintainability improvements
   to a recently-added bug fix

 - "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from
   SeongJae Park provides additional transparency to userspace clients
   of the DAMON_STAT information

 - "Expand scope of khugepaged anonymous collapse" from Dev Jain removes
   some constraints on khubepaged's collapsing of anon VMAs. It also
   increases the success rate of MADV_COLLAPSE against an anon vma

 - "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()"
   from Lorenzo Stoakes moves us further towards removal of
   file_operations.mmap(). This patchset concentrates upon clearing up
   the treatment of stacked filesystems

 - "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau
   provides some fixes and improvements to mlock's tracking of large
   folios. /proc/meminfo's "Mlocked" field became more accurate

 - "mm/ksm: Fix incorrect accounting of KSM counters during fork" from
   Donet Tom fixes several user-visible KSM stats inaccuracies across
   forks and adds selftest code to verify these counters

 - "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses
   some potential but presently benign issues in KSM's mm_slot handling

* tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits)
  mm: swap: check for stable address space before operating on the VMA
  mm: convert folio_page() back to a macro
  mm/khugepaged: use start_addr/addr for improved readability
  hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list
  alloc_tag: fix boot failure due to NULL pointer dereference
  mm: silence data-race in update_hiwater_rss
  mm/memory-failure: don't select MEMORY_ISOLATION
  mm/khugepaged: remove definition of struct khugepaged_mm_slot
  mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL
  hugetlb: increase number of reserving hugepages via cmdline
  selftests/mm: add fork inheritance test for ksm_merging_pages counter
  mm/ksm: fix incorrect KSM counter handling in mm_struct during fork
  drivers/base/node: fix double free in register_one_node()
  mm: remove PMD alignment constraint in execmem_vmalloc()
  mm/memory_hotplug: fix typo 'esecially' -> 'especially'
  mm/rmap: improve mlock tracking for large folios
  mm/filemap: map entire large folio faultaround
  mm/fault: try to map the entire file folio in finish_fault()
  mm/rmap: mlock large folios in try_to_unmap_one()
  mm/rmap: fix a mlock race condition in folio_referenced_one()
  ...
2025-10-02 18:18:33 -07:00
Linus Torvalds
e1b1d03cee Merge tag 'for-6.18/block-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block updates from Jens Axboe:

 - NVMe pull request via Keith:
     - FC target fixes (Daniel)
     - Authentication fixes and updates (Martin, Chris)
     - Admin controller handling (Kamaljit)
     - Target lockdep assertions (Max)
     - Keep-alive updates for discovery (Alastair)
     - Suspend quirk (Georg)

 - MD pull request via Yu:
     - Add support for a lockless bitmap.

       A key feature for the new bitmap are that the IO fastpath is
       lockless. If a user issues lots of write IO to the same bitmap
       bit in a short time, only the first write has additional overhead
       to update bitmap bit, no additional overhead for the following
       writes.

       By supporting only resync or recover written data, means in the
       case creating new array or replacing with a new disk, there is no
       need to do a full disk resync/recovery.

 - Switch ->getgeo() and ->bios_param() to using struct gendisk rather
   than struct block_device.

 - Rust block changes via Andreas. This series adds configuration via
   configfs and remote completion to the rnull driver. The series also
   includes a set of changes to the rust block device driver API: a few
   cleanup patches, and a few features supporting the rnull changes.

   The series removes the raw buffer formatting logic from
   `kernel::block` and improves the logic available in `kernel::string`
   to support the same use as the removed logic.

 - floppy arch cleanups

 - Reduce the number of dereferencing needed for ublk commands

 - Restrict supported sockets for nbd. Mostly done to eliminate a class
   of issues perpetually reported by syzbot, by using nonsensical socket
   setups.

 - A few s390 dasd block fixes

 - Fix a few issues around atomic writes

 - Improve DMA interation for integrity requests

 - Improve how iovecs are treated with regards to O_DIRECT aligment
   constraints.

   We used to require each segment to adhere to the constraints, now
   only the request as a whole needs to.

 - Clean up and improve p2p support, enabling use of p2p for metadata
   payloads

 - Improve locking of request lookup, using SRCU where appropriate

 - Use page references properly for brd, avoiding very long RCU sections

 - Fix ordering of recursively submitted IOs

 - Clean up and improve updating nr_requests for a live device

 - Various fixes and cleanups

* tag 'for-6.18/block-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (164 commits)
  s390/dasd: enforce dma_alignment to ensure proper buffer validation
  s390/dasd: Return BLK_STS_INVAL for EINVAL from do_dasd_request
  ublk: remove redundant zone op check in ublk_setup_iod()
  nvme: Use non zero KATO for persistent discovery connections
  nvmet: add safety check for subsys lock
  nvme-core: use nvme_is_io_ctrl() for I/O controller check
  nvme-core: do ioccsz/iorcsz validation only for I/O controllers
  nvme-core: add method to check for an I/O controller
  blk-cgroup: fix possible deadlock while configuring policy
  blk-mq: fix null-ptr-deref in blk_mq_free_tags() from error path
  blk-mq: Fix more tag iteration function documentation
  selftests: ublk: fix behavior when fio is not installed
  ublk: don't access ublk_queue in ublk_unmap_io()
  ublk: pass ublk_io to __ublk_complete_rq()
  ublk: don't access ublk_queue in ublk_need_complete_req()
  ublk: don't access ublk_queue in ublk_check_commit_and_fetch()
  ublk: don't pass ublk_queue to ublk_fetch()
  ublk: don't access ublk_queue in ublk_config_io_buf()
  ublk: don't access ublk_queue in ublk_check_fetch_buf()
  ublk: pass q_id and tag to __ublk_check_and_get_req()
  ...
2025-10-02 10:16:56 -07:00
Linus Torvalds
5832d26433 Merge tag 'for-6.18/io_uring-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring updates from Jens Axboe:

 - Store ring provided buffers locally for the users, rather than stuff
   them into struct io_kiocb.

   These types of buffers must always be fully consumed or recycled in
   the current context, and leaving them in struct io_kiocb is hence not
   a good ideas as that struct has a vastly different life time.

   Basically just an architecture cleanup that can help prevent issues
   with ring provided buffers in the future.

 - Support for mixed CQE sizes in the same ring.

   Before this change, a CQ ring either used the default 16b CQEs, or it
   was setup with 32b CQE using IORING_SETUP_CQE32. For use cases where
   a few 32b CQEs were needed, this caused everything else to use big
   CQEs. This is wasteful both in terms of memory usage, but also memory
   bandwidth for the posted CQEs.

   With IORING_SETUP_CQE_MIXED, applications may use request types that
   post both normal 16b and big 32b CQEs on the same ring.

 - Add helpers for async data management, to make it harder for opcode
   handlers to mess it up.

 - Add support for multishot for uring_cmd, which ublk can use. This
   helps improve efficiency, by providing a persistent request type that
   can trigger multiple CQEs.

 - Add initial support for ring feature querying.

   We had basic support for probe operations, but the API isn't great.
   Rather than expand that, add support for QUERY which is easily
   expandable and can cover a lot more cases than the existing probe
   support. This will help applications get a better idea of what
   operations are supported on a given host.

 - zcrx improvements from Pavel:
        - Improve refill entry alignment for better caching
        - Various cleanups, especially around deduplicating normal
          memory vs dmabuf setup.
        - Generalisation of the niov size (Patch 12). It's still hard
          coded to PAGE_SIZE on init, but will let the user to specify
          the rx buffer length on setup.
        - Syscall / synchronous bufer return. It'll be used as a slow
          fallback path for returning buffers when the refill queue is
          full. Useful for tolerating slight queue size misconfiguration
          or with inconsistent load.
        - Accounting more memory to cgroups.
        - Additional independent cleanups that will also be useful for
          mutli-area support.

 - Various fixes and cleanups

* tag 'for-6.18/io_uring-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits)
  io_uring/cmd: drop unused res2 param from io_uring_cmd_done()
  io_uring: fix nvme's 32b cqes on mixed cq
  io_uring/query: cap number of queries
  io_uring/query: prevent infinite loops
  io_uring/zcrx: account niov arrays to cgroup
  io_uring/zcrx: allow synchronous buffer return
  io_uring/zcrx: introduce io_parse_rqe()
  io_uring/zcrx: don't adjust free cache space
  io_uring/zcrx: use guards for the refill lock
  io_uring/zcrx: reduce netmem scope in refill
  io_uring/zcrx: protect netdev with pp_lock
  io_uring/zcrx: rename dma lock
  io_uring/zcrx: make niov size variable
  io_uring/zcrx: set sgt for umem area
  io_uring/zcrx: remove dmabuf_offset
  io_uring/zcrx: deduplicate area mapping
  io_uring/zcrx: pass ifq to io_zcrx_alloc_fallback()
  io_uring/zcrx: check all niovs filled with dma addresses
  io_uring/zcrx: move area reg checks into io_import_area
  io_uring/zcrx: don't pass slot to io_zcrx_create_area
  ...
2025-10-02 09:56:23 -07:00
Rajasi Mandal
37e263e68c cifs: client: force multichannel=off when max_channels=1
Previously, specifying both multichannel and max_channels=1 as mount
options would leave multichannel enabled, even though it is not
meaningful when only one channel is allowed. This led to confusion and
inconsistent behavior, as the client would advertise multichannel
capability but never establish secondary channels.

Fix this by forcing multichannel to false whenever max_channels=1,
ensuring the mount configuration is consistent and matches user intent.
This prevents the client from advertising or attempting multichannel
support when it is not possible.

Signed-off-by: Rajasi Mandal <rajasimandal@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-01 22:42:15 -05:00
Bharath SM
aa12118dbc smb client: fix bug with newly created file in cached dir
Test generic/637 spotted a problem with create of a new file in a
cached directory (by the same client) could cause cases where the
new file does not show up properly in ls on that client until the
lease times out.

Fixes: 037e1bae58 ("smb: client: use ParentLeaseKey in cifs_do_create")
Cc: stable@vger.kernel.org
Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-01 22:42:15 -05:00
Henrique Carvalho
316025335a smb: client: short-circuit negative lookups when parent dir is fully cached
When the parent directory has a valid and complete cached enumeration we
can assume that negative dentries are not present in the directory, thus
we can return without issuing a request.

This reduces traffic for common ENOENT when the directory entries are
cached.

Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-01 22:42:15 -05:00
Henrique Carvalho
55580ad027 smb: client: short-circuit in open_cached_dir_by_dentry() if !dentry
When dentry is NULL, the current code acquires the spinlock and traverses
the entire list, but the condition (dentry && cfid->dentry == dentry)
ensures no match will ever be found.

Return -ENOENT early in this case, avoiding unnecessary lock acquisition
and list traversal.

Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-01 22:42:15 -05:00
Henrique Carvalho
2f6a4af028 smb: client: remove pointless cfid->has_lease check
open_cached_dir() will only return a valid cfid, which has both
has_lease = true and time != 0.

Remove the pointless check of cfid->has_lease right after
open_cached_dir() returns no error.

Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com>
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-01 22:42:04 -05:00
Fiona Ebner
6c7fd18423 smb: client: transport: minor indentation style fix
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-01 22:25:25 -05:00
Fiona Ebner
00be6f26a2 smb: client: transport: avoid reconnects triggered by pending task work
When io_uring is used in the same task as CIFS, there might be
unnecessary reconnects, causing issues in user-space applications
like QEMU with a log like:

> CIFS: VFS: \\10.10.100.81 Error -512 sending data on socket to server

Certain io_uring completions might be added to task_work with
notify_method being TWA_SIGNAL and thus TIF_NOTIFY_SIGNAL is set for
the task.

In __smb_send_rqst(), signals are masked before calling
smb_send_kvec(), but the masking does not apply to TIF_NOTIFY_SIGNAL.

If sk_stream_wait_memory() is reached via sock_sendmsg() while
TIF_NOTIFY_SIGNAL is set, signal_pending(current) will evaluate to
true there, and -EINTR will be propagated all the way from
sk_stream_wait_memory() to sock_sendmsg() in smb_send_kvec().
Afterwards, __smb_send_rqst() will see that not everything was written
and reconnect.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-01 22:25:21 -05:00
Henrique Carvalho
17ef15fa80 smb: client: remove unused fid_lock
The fid_lock in struct cached_fid does not currently provide any real
synchronization. Previously, it had the intention to prevent a double
release of the dentry, but every change to cfid->dentry is already
protected either by cfid_list_lock (while the entry is in the list) or
happens after the cfid has been removed (so no other thread should find
it).

Since there is no scenario in which fid_lock prevents any race, it is
vestigial and can be removed along with its associated
spin_lock()/spin_unlock() calls.

Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com>
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-01 22:05:19 -05:00
Henrique Carvalho
5676398315 smb: client: update cfid->last_access_time in open_cached_dir_by_dentry()
open_cached_dir_by_dentry() was missing an update of
cfid->last_access_time to jiffies, similar to what open_cached_dir()
has.

Add it to the function.

Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com>
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-01 22:01:03 -05:00
Steve French
a365f2c049 smb: client: ensure open_cached_dir_by_dentry() only returns valid cfid
open_cached_dir_by_dentry() was exposing an invalid cached directory to
callers. The validity check outside the function was exclusively based
on cfid->time.

Add validity check before returning success and introduce
is_valid_cached_dir() helper for consistent checks across the code.

Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com>
Reviwed-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-01 21:49:59 -05:00
Bharath SM
63eb8bd6c8 smb: client: account smb directory cache usage and per-tcon totals
Add lightweight accounting for directory lease cache usage
to aid debugging and limiting cache size in future. Track
per-directory entry/byte counts and maintain per-tcon
aggregates. Also expose the totals in /proc/fs/cifs/open_dirs.

Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-01 21:49:53 -05:00
Bharath SM
dde6667fa3 smb: client: add drop_dir_cache module parameter to invalidate cached dirents
Add write-only /sys/module/cifs/parameters/drop_dir_cache. Writing a
non-zero value iterates all tcons and calls invalidate_all_cached_dirs()
to drop cached directory entries. This is useful to force a dirent cache
drop across mounts for debugging and testing purpose.

Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-01 20:36:23 -05:00
NeilBrown
73cc6ec1a8 nfsd: discard nfserr_dropit
nfserr_dropit hasn't been used for over a decade, since rq_dropme and
the RQ_DROPME were introduced.

Time to get rid of it completely.

Signed-off-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-10-01 15:54:01 -04:00
Mike Snitzer
6304affe45 NFSD: Add io_cache_{read,write} controls to debugfs
Add 'io_cache_read' to NFSD's debugfs interface so that any data
read by NFSD will either be:
- cached using page cache (NFSD_IO_BUFFERED=0)
- cached but removed from the page cache upon completion
  (NFSD_IO_DONTCACHE=1).

io_cache_read may be set by writing to:
  /sys/kernel/debug/nfsd/io_cache_read

Add 'io_cache_write' to NFSD's debugfs interface so that any data
written by NFSD will either be:
- cached using page cache (NFSD_IO_BUFFERED=0)
- cached but removed from the page cache upon completion
  (NFSD_IO_DONTCACHE=1).

io_cache_write may be set by writing to:
  /sys/kernel/debug/nfsd/io_cache_write

The default value for both settings is NFSD_IO_BUFFERED, which is
NFSD's existing behavior for both read and write. Changes to these
settings take immediate effect for all exports and NFS versions.

Currently only xfs and ext4 implement RWF_DONTCACHE. For file
systems that do not implement RWF_DONTCACHE, NFSD use only buffered
I/O when the io_cache setting is NFSD_IO_DONTCACHE.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-10-01 15:54:01 -04:00
Chuck Lever
d6e80d48f9 NFSD: Do the grace period check in ->proc_layoutget
RFC 8881 Section 18.43.3 states:
> If the metadata server is in a grace period, and does not persist
> layouts and device ID to device address mappings, then it MUST
> return NFS4ERR_GRACE (see Section 8.4.2.1).

Jeff observed that this suggests the grace period check is better
done by the individual layout type implementations, because checking
for the server grace period is unnecessary for some layout types.

Suggested-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/linux-nfs/7h5p5ktyptyt37u6jhpbjfd5u6tg44lriqkdc7iz7czeeabrvo@ijgxz27dw4sg/T/#t
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-10-01 15:54:01 -04:00
Dan Carpenter
eafdd7e949 nfsd: delete unnecessary NULL check in __fh_verify()
In commit 4a0de50a44bb ("nfsd: decouple the xprtsec policy check from
check_nfsd_access()") we added a NULL check on "rqstp" to earlier in
the function.  This check is no longer required so delete it.

Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-10-01 15:54:01 -04:00
Sergey Bashirov
e0963ce53b NFSD: Allow layoutcommit during grace period
If the loca_reclaim field is set to TRUE, this indicates that the client
is attempting to commit changes to a layout after the restart of the
metadata server during the metadata server's recovery grace period. This
type of request may be necessary when the client has uncommitted writes
to provisionally allocated byte-ranges of a file that were sent to the
storage devices before the restart of the metadata server. See RFC 8881,
section 18.42.3.

Without this, the client is not able to increase the file size and commit
preallocated extents when the block/scsi layout server is restarted
during a write and is in a grace period. And when the grace period ends,
the client also cannot perform layoutcommit because the old layout state
becomes invalid, resulting in file corruption.

Co-developed-by: Konstantin Evtushenko <koevtushenko@yandex.com>
Signed-off-by: Konstantin Evtushenko <koevtushenko@yandex.com>
Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-10-01 15:54:01 -04:00
Bharath SM
ac3ad9845b smb: client: show lease state as R/H/W (or NONE) in open_files
Print the lease/oplock caching state for each open file as a
compact string of letters: R (read), H (handle), W (write).

Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-01 12:41:46 -05:00
Linus Torvalds
eb3289fc47 Merge tag 'driver-core-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core
Pull driver core updates from Danilo Krummrich:
 "Auxiliary:
   - Drop call to dev_pm_domain_detach() in auxiliary_bus_probe()
   - Optimize logic of auxiliary_match_id()

  Rust:
   - Auxiliary:
      - Use primitive C types from prelude

   - DebugFs:
      - Add debugfs support for simple read/write files and custom
        callbacks through a File-type-based and directory-scope-based
        API
      - Sample driver code for the File-type-based API
      - Sample module code for the directory-scope-based API

   - I/O:
      - Add io::poll module and implement Rust specific
        read_poll_timeout() helper

   - IRQ:
      - Implement support for threaded and non-threaded device IRQs
        based on (&Device<Bound>, IRQ number) tuples (IrqRequest)
      - Provide &Device<Bound> cookie in IRQ handlers

   - PCI:
      - Support IRQ requests from IRQ vectors for a specific
        pci::Device<Bound>
      - Implement accessors for subsystem IDs, revision, devid and
        resource start
      - Provide dedicated pci::Vendor and pci::Class types for vendor
        and class ID numbers
      - Implement Display to print actual vendor and class names; Debug
        to print the raw ID numbers
      - Add pci::DeviceId::from_class_and_vendor() helper
      - Use primitive C types from prelude
      - Various minor inline and (safety) comment improvements

   - Platform:
      - Support IRQ requests from IRQ vectors for a specific
        platform::Device<Bound>

   - Nova:
      - Use pci::DeviceId::from_class_and_vendor() to avoid probing
        non-display/compute PCI functions

   - Misc:
      - Add helper for cpu_relax()
      - Update ARef import from sync::aref

  sysfs:
   - Remove bin_attrs_new field from struct attribute_group
   - Remove read_new() and write_new() from struct bin_attribute

  Misc:
   - Document potential race condition in get_dev_from_fwnode()
   - Constify node_group argument in software node registration
     functions
   - Fix order of kernel-doc parameters in various functions
   - Set power.no_pm flag for faux devices
   - Set power.no_callbacks flag along with the power.no_pm flag
   - Constify the pmu_bus bus type
   - Minor spelling fixes"

* tag 'driver-core-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: (43 commits)
  rust: pci: display symbolic PCI vendor names
  rust: pci: display symbolic PCI class names
  rust: pci: fix incorrect platform reference in PCI driver probe doc comment
  rust: pci: fix incorrect platform reference in PCI driver unbind doc comment
  perf: make pmu_bus const
  samples: rust: Add scoped debugfs sample driver
  rust: debugfs: Add support for scoped directories
  samples: rust: Add debugfs sample driver
  rust: debugfs: Add support for callback-based files
  rust: debugfs: Add support for writable files
  rust: debugfs: Add support for read-only files
  rust: debugfs: Add initial support for directories
  driver core: auxiliary bus: Optimize logic of auxiliary_match_id()
  driver core: auxiliary bus: Drop dev_pm_domain_detach() call
  driver core: Fix order of the kernel-doc parameters
  driver core: get_dev_from_fwnode(): document potential race
  drivers: base: fix "publically"->"publicly"
  driver core/PM: Set power.no_callbacks along with power.no_pm
  driver core: faux: Set power.no_pm for faux devices
  rust: pci: inline several tiny functions
  ...
2025-10-01 08:39:23 -07:00
Nathan Chancellor
4335c4496b btrfs: fix PAGE_SIZE format specifier in open_ctree()
There is an instance of -Wformat when targeting 32-bit architectures due
to using a 'size_t' specifier (which is 'unsigned int' for 32-bit
platforms) to print PAGE_SIZE:

  In file included from fs/btrfs/compression.h:17,
                   from fs/btrfs/extent_io.h:15,
                   from fs/btrfs/locking.h:13,
                   from fs/btrfs/ctree.h:19,
                   from fs/btrfs/disk-io.c:22:
  fs/btrfs/disk-io.c: In function 'open_ctree':
  include/linux/kern_levels.h:5:25: error: format '%zu' expects argument of type 'size_t', but argument 4 has type 'long unsigned int' [-Werror=format=]
  ...
  fs/btrfs/disk-io.c:3398:17: note: in expansion of macro 'btrfs_warn'
   3398 |                 btrfs_warn(fs_info,
        |                 ^~~~~~~~~~

PAGE_SIZE is consistently defined as an 'unsigned long' in
include/vsdo/page.h so use '%lu' to clear up the warning.

Fixes: 98077f7f21 ("btrfs: enable experimental bs > ps support")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-01 16:27:28 +02:00
Namjae Jeon
e28c5bc456 ksmbd: increase session and share hash table bits
Increases the number of bits for the hash table from 3 to 12.
The thousands of sessions and shares can be connected.
So the current 3-bit size can lead to frequent hash collisions.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-09-30 21:37:55 -05:00
Namjae Jeon
0bcc831be5 ksmbd: replace connection list with hash table
Replace connection list with hash table to improve lookup performance.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-09-30 21:37:55 -05:00
Namjae Jeon
6747885781 ksmbd: add an error print when maximum IP connections limit is reached
This change introduces an error print using pr_info_ratelimited()
to prevent excessive logging. This message will inform the user that
the limit for maximum IP connections has been hit and what that
current count is, which can be useful for debugging and monitoring
connection limits.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-09-30 21:37:54 -05:00
Namjae Jeon
d8b6dc9256 ksmbd: add max ip connections parameter
This parameter set the maximum number of connections per ip address.
The default is 8.

Cc: stable@vger.kernel.org
Fixes: c0d41112f1 ("ksmbd: extend the connection limiting mechanism to support IPv6")
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-09-30 21:37:54 -05:00
Matvey Kovalev
88daf2f448 ksmbd: fix error code overwriting in smb2_get_info_filesystem()
If client doesn't negotiate with SMB3.1.1 POSIX Extensions,
then proper error code won't be returned due to overwriting.

Return error immediately.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Fixes: e2f34481b2 ("cifsd: add server-side procedures for SMB3")
Cc: stable@vger.kernel.org
Signed-off-by: Matvey Kovalev <matvey.kovalev@ispras.ru>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-09-30 21:37:54 -05:00
Namjae Jeon
c20988c217 ksmbd: copy overlapped range within the same file
cifs.ko request to copy overlapped range within the same file.
ksmbd is using vfs_copy_file_range for this, vfs_copy_file_range() does not
allow overlapped copying within the same file.
This patch use do_splice_direct() if offset and length are overlapped.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-09-30 21:37:54 -05:00
Namjae Jeon
3677ca67b9 ksmbd: use sock_create_kern interface to create kernel socket
we should use sock_create_kern() if the socket resides in kernel space.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-09-30 21:37:54 -05:00
Namjae Jeon
5da92a251e ksmbd: make ksmbd thread names distinct by client IP
This patch makes ksmbd thread names distinct by client IP address.

100943 ?        S      0:00 [ksmbd:::ffff:10.177.110.57]
 or
101752 ?        S      0:00 [ksmbd:10.177.110.57]

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-09-30 21:37:54 -05:00
Yunseong Kim
305853cce3 ksmbd: Fix race condition in RPC handle list access
The 'sess->rpc_handle_list' XArray manages RPC handles within a ksmbd
session. Access to this list is intended to be protected by
'sess->rpc_lock' (an rw_semaphore). However, the locking implementation was
flawed, leading to potential race conditions.

In ksmbd_session_rpc_open(), the code incorrectly acquired only a read lock
before calling xa_store() and xa_erase(). Since these operations modify
the XArray structure, a write lock is required to ensure exclusive access
and prevent data corruption from concurrent modifications.

Furthermore, ksmbd_session_rpc_method() accessed the list using xa_load()
without holding any lock at all. This could lead to reading inconsistent
data or a potential use-after-free if an entry is concurrently removed and
the pointer is dereferenced.

Fix these issues by:
1. Using down_write() and up_write() in ksmbd_session_rpc_open()
   to ensure exclusive access during XArray modification, and ensuring
   the lock is correctly released on error paths.
2. Adding down_read() and up_read() in ksmbd_session_rpc_method()
   to safely protect the lookup.

Fixes: a1f46c99d9 ("ksmbd: fix use-after-free in ksmbd_session_rpc_open")
Fixes: b685757c7b ("ksmbd: Implements sess->rpc_handle_list as xarray")
Cc: stable@vger.kernel.org
Signed-off-by: Yunseong Kim <ysk@kzalloc.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-09-30 21:37:54 -05:00
Linus Torvalds
2cb8eeaf00 Merge tag 'x86_cache_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 resource control updates from Borislav Petkov:
 "Add support on AMD for assigning QoS bandwidth counters to resources
  (RMIDs) with the ability for those resources to be tracked by the
  counters as long as they're assigned to them.

  Previously, due to hw limitations, bandwidth counts from untracked
  resources would get lost when those resources are not tracked.

  Refactor the code and user interfaces to be able to also support
  other, similar features on ARM, for example"

* tag 'x86_cache_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
  fs/resctrl: Fix counter auto-assignment on mkdir with mbm_event enabled
  MAINTAINERS: resctrl: Add myself as reviewer
  x86/resctrl: Configure mbm_event mode if supported
  fs/resctrl: Introduce the interface to switch between monitor modes
  fs/resctrl: Disable BMEC event configuration when mbm_event mode is enabled
  fs/resctrl: Introduce the interface to modify assignments in a group
  fs/resctrl: Introduce mbm_L3_assignments to list assignments in a group
  fs/resctrl: Auto assign counters on mkdir and clean up on group removal
  fs/resctrl: Introduce mbm_assign_on_mkdir to enable assignments on mkdir
  fs/resctrl: Provide interface to update the event configurations
  fs/resctrl: Add event configuration directory under info/L3_MON/
  fs/resctrl: Support counter read/reset with mbm_event assignment mode
  x86/resctrl: Implement resctrl_arch_reset_cntr() and resctrl_arch_cntr_read()
  x86/resctrl: Refactor resctrl_arch_rmid_read()
  fs/resctrl: Introduce counter ID read, reset calls in mbm_event mode
  fs/resctrl: Pass struct rdtgroup instead of individual members
  fs/resctrl: Add the functionality to unassign MBM events
  fs/resctrl: Add the functionality to assign MBM events
  x86,fs/resctrl: Implement resctrl_arch_config_cntr() to assign a counter with ABMC
  fs/resctrl: Introduce event configuration field in struct mon_evt
  ...
2025-09-30 13:29:42 -07:00
Mike Snitzer
1f0d4ab0f5 NFS: add basic STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
NFS doesn't have DIO alignment constraints, so have NFS respond with
accommodating DIO alignment attributes (rather than plumb in GETATTR
support for STATX_DIOALIGN and STATX_DIO_READ_ALIGN).

The most coarse-grained dio_offset_align is the most accommodating
(e.g. PAGE_SIZE, in future larger may be supported).

Now that NFS has support, NFS reexport will now handle unaligned DIO
(NFSD's NFSD_IO_DIRECT support requires the underlying filesystem
support STATX_DIOALIGN and/or STATX_DIO_READ_ALIGN).

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-09-30 16:10:30 -04:00
Mike Snitzer
cda94457c2 nfs/localio: add tracepoints for misaligned DIO READ and WRITE support
Add nfs_local_dio_class and use it to create nfs_local_dio_read,
nfs_local_dio_write and nfs_local_dio_misaligned trace events.

These trace events show how NFS LOCALIO splits a given misaligned
IO into a mix of misaligned head and/or tail extents and a DIO-aligned
middle extent.  The misaligned head and/or tail extents are issued
using buffered IO and the DIO-aligned middle is issued using O_DIRECT.

This combination of trace events is useful for LOCALIO DIO READs:

  echo 1 > /sys/kernel/tracing/events/nfs/nfs_local_dio_read/enable
  echo 1 > /sys/kernel/tracing/events/nfs/nfs_local_dio_misaligned/enable
  echo 1 > /sys/kernel/tracing/events/nfs/nfs_initiate_read/enable
  echo 1 > /sys/kernel/tracing/events/nfs/nfs_readpage_done/enable
  echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable

This combination of trace events is useful for LOCALIO DIO WRITEs:

  echo 1 > /sys/kernel/tracing/events/nfs/nfs_local_dio_write/enable
  echo 1 > /sys/kernel/tracing/events/nfs/nfs_local_dio_misaligned/enable
  echo 1 > /sys/kernel/tracing/events/nfs/nfs_initiate_write/enable
  echo 1 > /sys/kernel/tracing/events/nfs/nfs_writeback_done/enable
  echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-09-30 16:10:30 -04:00
Mike Snitzer
c817248fc8 nfs/localio: add proper O_DIRECT support for READ and WRITE
Because the NFS client will already happily handle misaligned O_DIRECT
IO (by sending it out to NFSD via RPC) this commit's new capabilities
are for the benefit of LOCALIO.

LOCALIO will make best effort to transform misaligned IO to
DIO-aligned extents when possible.

LOCALIO's READ and WRITE DIO that is misaligned will be split into as
many as 3 component IOs (@start, @middle and @end) as needed -- IFF
the @middle extent is verified to be DIO-aligned, and then the @start
and/or @end are misaligned (due to each being a partial page).
Otherwise if the @middle isn't DIO-aligned the code will fallback to
issuing only a single contiguous buffered IO.

The @middle is only DIO-aligned if both the memory and on-disk offsets
for the IO are aligned relative to the underlying local filesystem's
block device limits (@dma_alignment and @logical_block_size
respectively).

The misaligned @start and/or @end extents are issued using buffered IO
and the DIO-aligned @middle is issued using O_DIRECT. The @start and
@end IOs are issued first using buffered IO with IOCB_SYNC and then
the @middle is issued last using direct IO with async completion (AIO).
This out of order IO completion means that LOCALIO's IO completion
code (nfs_local_read_done and nfs_local_write_done) is only called for
the IO's last associated iov_iter completion. And in the case of
DIO-aligned @middle it completes last using AIO. nfs_local_pgio_done()
is updated to handle piece-wise partial completion of each iov_iter.

This implementation for LOCALIO's misaligned DIO handling uses 3
iov_iter that share the same backing pages in their bio_vecs (so
unfortunately 'struct nfs_local_kiocb' has 3 instead of only 1).

[Reducing LOCALIO's per-IO (struct nfs_local_kiocb) memory use can be
explored in the future. One logical progression to improve this code,
and eliminate explicit loops over up to 3 iov_iter, is by extending
'struct iov_iter' to support iov_iter_clone() and iov_iter_chain()
interfaces that are comparable to what 'struct bio' is able to support
in the block layer. But even that wouldn't avoid the need to
allocate/use up to 3 iov_iter]

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-09-30 16:10:30 -04:00
Mike Snitzer
e43e9a3a3d nfs/localio: refactor iocb initialization
The goal of this commit's various refactoring is to have LOCALIO's per
IO initialization occur in process context so that we don't get into a
situation where IO fails to be issued from workqueue (e.g. due to lack
of memory, etc). Better to have LOCALIO's iocb initialization fail
early.

There isn't immediate need but this commit makes it possible for
LOCALIO to fallback to NFS pagelist code in process context to allow
for immediate retry over RPC.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-09-30 16:10:30 -04:00
Mike Snitzer
091bdcfcec nfs/localio: refactor iocb and iov_iter_bvec initialization
nfs_local_iter_init() is updated to follow the same pattern to
initializing LOCALIO's iov_iter_bvec as was established by
nfsd_iter_read().

Other LOCALIO iocb initialization refactoring in this commit offers
incremental cleanup that will be taken further by the next commit.

No functional change.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-09-30 16:10:30 -04:00
Mike Snitzer
25ba2b84c3 nfs/localio: avoid issuing misaligned IO using O_DIRECT
Add nfsd_file_dio_alignment and use it to avoid issuing misaligned IO
using O_DIRECT. Any misaligned DIO falls back to using buffered IO.

Because misaligned DIO is now handled safely, remove the nfs modparam
'localio_O_DIRECT_semantics' that was added to require users opt-in to
the requirement that all O_DIRECT be properly DIO-aligned.

Also, introduce nfs_iov_iter_aligned_bvec() which is a variant of
iov_iter_aligned_bvec() that also verifies the offset associated with
an iov_iter is DIO-aligned.  NOTE: in a parallel effort,
iov_iter_aligned_bvec() is being removed along with
iov_iter_is_aligned().

Lastly, add pr_info_ratelimited if underlying filesystem returns
-EINVAL because it was made to try O_DIRECT for IO that is not
DIO-aligned (shouldn't happen, so its best to be louder if it does).

Fixes: 3feec68563 ("nfs/localio: add direct IO enablement with sync and async IO support")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-09-30 16:10:29 -04:00
Mike Snitzer
fd6d93c2b7 nfs/localio: make trace_nfs_local_open_fh more useful
Always trigger trace event when LOCALIO opens a file.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-09-30 16:10:29 -04:00
Mike Snitzer
d11f6cd1bb NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
Use STATX_DIOALIGN and STATX_DIO_READ_ALIGN to get DIO alignment
attributes from the underlying filesystem and store them in the
associated nfsd_file. This is done when the nfsd_file is first
opened for each regular file.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-09-30 16:10:05 -04:00
Linus Torvalds
f3827213ab Merge tag 'for-6.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
 "There are no new features, the changes are in the core code, notably
  tree-log error handling and reporting improvements, and initial
  support for block size > page size.

  Performance improvements:

   - search data checksums in the commit root (previous transaction) to
     avoid locking contention, this improves parallelism of read
     heavy/low write workloads, and also reduces transaction commit
     time; on real and reproducer workload the sync time went from
     minutes to tens of seconds (workload and numbers are in the
     changelog)

  Core:

   - tree-log updates:
      - error handling improvements, transaction aborts
      - add new error state 'O' (printed in status messages) when log
        replay fails and is aborted
      - reduced number of btrfs_path allocations when traversing the
        tree

   - 'block size > page size' support
      - basic implementation with limitations, under experimental build
      - limitations: no direct io, raid56, encoded read (standalone and
        in send ioctl), encoded write
      - preparatory work for compression, removing implicit assumptions
        of page and block sizes
      - compression workspaces are now per-filesystem, we cannot assume
        common block size for work memory among different filesystems

   - tree-checker now verifies INODE_EXTREF item (which is implementing
     hardlinks)

   - tree leaf pretty printer updates, there were missing data from
     items, keys/items

   - move config option CONFIG_BTRFS_REF_VERIFY to CONFIG_BTRFS_DEBUG,
     it's a debugging feature and not needed to be enabled separately

   - more struct btrfs_path auto free updates

   - use ref_tracker API for tracking delayed inodes, enabled by mount
     option 'ref_verify', allowing to better pinpoint leaking references

   - in zoned mode, avoid selecting data relocation zoned for ordinary
     data block groups

   - updated and enhanced error messages

   - lots of cleanups and refactoring"

* tag 'for-6.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (113 commits)
  btrfs: use smp_mb__after_atomic() when forcing COW in create_pending_snapshot()
  btrfs: add unlikely annotations to branches leading to transaction abort
  btrfs: add unlikely annotations to branches leading to EIO
  btrfs: add unlikely annotations to branches leading to EUCLEAN
  btrfs: more trivial BTRFS_PATH_AUTO_FREE conversions
  btrfs: zoned: don't fail mount needlessly due to too many active zones
  btrfs: use kmalloc_array() for open-coded arithmetic in kmalloc()
  btrfs: enable experimental bs > ps support
  btrfs: add extra ASSERT()s to catch unaligned bios
  btrfs: fix symbolic link reading when bs > ps
  btrfs: prepare scrub to support bs > ps cases
  btrfs: prepare zlib to support bs > ps cases
  btrfs: prepare lzo to support bs > ps cases
  btrfs: prepare zstd to support bs > ps cases
  btrfs: prepare compression folio alloc/free for bs > ps cases
  btrfs: fix the incorrect max_bytes value for find_lock_delalloc_range()
  btrfs: remove pointless key offset setup in create_pending_snapshot()
  btrfs: annotate btrfs_is_testing() as unlikely and make it return bool
  btrfs: make the rule checking more readable for should_cow_block()
  btrfs: simplify inline extent end calculation at replay_one_extent()
  ...
2025-09-30 08:14:49 -07:00
Thorsten Blum
11f6bce77e fs/orangefs: Replace kzalloc + copy_from_user with memdup_user_nul
Replace kzalloc() followed by copy_from_user() with memdup_user_nul() to
simplify and improve orangefs_debug_write(). Allocate only 'count' bytes
instead of the maximum size ORANGEFS_MAX_DEBUG_STRING_LEN, and set 'buf'
to NULL to ensure kfree(buf) still works.

No functional changes intended.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2025-09-30 10:23:24 -04:00
Mike Marshall
025e880759 orangefs: fix xattr related buffer overflow...
Willy Tarreau <w@1wt.eu> forwarded me a message from
Disclosure <disclosure@aisle.com> with the following
warning:

> The helper `xattr_key()` uses the pointer variable in the loop condition
> rather than dereferencing it. As `key` is incremented, it remains non-NULL
> (until it runs into unmapped memory), so the loop does not terminate on
> valid C strings and will walk memory indefinitely, consuming CPU or hanging
> the thread.

I easily reproduced this with setfattr and getfattr, causing a kernel
oops, hung user processes and corrupted orangefs files. Disclosure
sent along a diff (not a patch) with a suggested fix, which I based
this patch on.

After xattr_key started working right, xfstest generic/069 exposed an
xattr related memory leak that lead to OOM. xattr_key returns
a hashed key.  When adding xattrs to the orangefs xattr cache, orangefs
used hash_add, a kernel hashing macro. hash_add also hashes the key using
hash_log which resulted in additions to the xattr cache going to the wrong
hash bucket. generic/069 tortures a single file and orangefs does a
getattr for the xattr "security.capability" every time. Orangefs
negative caches on xattrs which includes a kmalloc. Since adds to the
xattr cache were going to the wrong bucket, every getattr for
"security.capability" resulted in another kmalloc, none of which were
ever freed.

I changed the two uses of hash_add to hlist_add_head instead
and the memory leak ceased and generic/069 quit throwing furniture.

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Reported-by: Stanislav Fort of Aisle Research <stanislav.fort@aisle.com>
2025-09-30 10:23:20 -04:00
Zhen Ni
3dffadfa99 orangefs: Remove unused type in macro fill_default_sys_attrs
Remove the unused type parameter from the macro definition and all its
callers, making the interface consistent with its actual usage.

Fixes: 69a23de2f3 ("orangefs: clean up fill_default_sys_attrs")
Signed-off-by: Zhen Ni <zhen.ni@easystack.cn>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2025-09-30 10:23:03 -04:00
Ethan Ferguson
d01579d590 exfat: Add support for FS_IOC_{GET,SET}FSLABEL
Add support for reading / writing to the exfat volume label from the
FS_IOC_GETFSLABEL and FS_IOC_SETFSLABEL ioctls

Co-developed-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Signed-off-by: Ethan Ferguson <ethan.ferguson@zetier.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2025-09-30 13:49:31 +09:00
Sang-Heon Jeon
29c063658d exfat: combine iocharset and utf8 option setup
Currently, exfat utf8 mount option depends on the iocharset option
value. After exfat remount, utf8 option may become inconsistent with
iocharset option.

If the options are inconsistent; (specifically, iocharset=utf8 but
utf8=0) readdir may reference uninitalized NLS, leading to a null
pointer dereference.

Extract and combine utf8/iocharset setup logic into exfat_set_iocharset().
Then Replace iocharset setup logic to exfat_set_iocharset to prevent
utf8/iocharset option inconsistentcy after remount.

Reported-by: syzbot+3e9cb93e3c5f90d28e19@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=3e9cb93e3c5f90d28e19
Signed-off-by: Sang-Heon Jeon <ekffu200098@gmail.com>
Fixes: acab02ffcd6b ("exfat: support modifying mount options via remount")
Tested-by: syzbot+3e9cb93e3c5f90d28e19@syzkaller.appspotmail.com
Reviewed-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2025-09-30 13:41:22 +09:00
Yuezhang Mo
e6fd5d3a43 exfat: support modifying mount options via remount
Before this commit, all exfat-defined mount options could not be
modified dynamically via remount, and no error was returned.

After this commit, these three exfat-defined mount options
(discard, zero_size_dir, and errors) can be modified dynamically
via remount. While other exfat-defined mount options cannot be
modified dynamically via remount because their old settings are
cached in inodes or dentries, modifying them will be rejected with
an error.

Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2025-09-30 13:34:43 +09:00