Commit Graph

5584 Commits

Author SHA1 Message Date
Linus Torvalds
4cff5c05e0 Merge tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:

 - "powerpc/64s: do not re-activate batched TLB flush" makes
   arch_{enter|leave}_lazy_mmu_mode() nest properly (Alexander Gordeev)

   It adds a generic enter/leave layer and switches architectures to use
   it. Various hacks were removed in the process.

 - "zram: introduce compressed data writeback" implements data
   compression for zram writeback (Richard Chang and Sergey Senozhatsky)

 - "mm: folio_zero_user: clear page ranges" adds clearing of contiguous
   page ranges for hugepages. Large improvements during demand faulting
   are demonstrated (David Hildenbrand)

 - "memcg cleanups" tidies up some memcg code (Chen Ridong)

 - "mm/damon: introduce {,max_}nr_snapshots and tracepoint for damos
   stats" improves DAMOS stat's provided information, deterministic
   control, and readability (SeongJae Park)

 - "selftests/mm: hugetlb cgroup charging: robustness fixes" fixes a few
   issues in the hugetlb cgroup charging selftests (Li Wang)

 - "Fix va_high_addr_switch.sh test failure - again" addresses several
   issues in the va_high_addr_switch test (Chunyu Hu)

 - "mm/damon/tests/core-kunit: extend existing test scenarios" improves
   the KUnit test coverage for DAMON (Shu Anzai)

 - "mm/khugepaged: fix dirty page handling for MADV_COLLAPSE" fixes a
   glitch in khugepaged which was causing madvise(MADV_COLLAPSE) to
   transiently return -EAGAIN (Shivank Garg)

 - "arch, mm: consolidate hugetlb early reservation" reworks and
   consolidates a pile of straggly code related to reservation of
   hugetlb memory from bootmem and creation of CMA areas for hugetlb
   (Mike Rapoport)

 - "mm: clean up anon_vma implementation" cleans up the anon_vma
   implementation in various ways (Lorenzo Stoakes)

 - "tweaks for __alloc_pages_slowpath()" does a little streamlining of
   the page allocator's slowpath code (Vlastimil Babka)

 - "memcg: separate private and public ID namespaces" cleans up the
   memcg ID code and prevents the internal-only private IDs from being
   exposed to userspace (Shakeel Butt)

 - "mm: hugetlb: allocate frozen gigantic folio" cleans up the
   allocation of frozen folios and avoids some atomic refcount
   operations (Kefeng Wang)

 - "mm/damon: advance DAMOS-based LRU sorting" improves DAMOS's movement
   of memory betewwn the active and inactive LRUs and adds auto-tuning
   of the ratio-based quotas and of monitoring intervals (SeongJae Park)

 - "Support page table check on PowerPC" makes
   CONFIG_PAGE_TABLE_CHECK_ENFORCED work on powerpc (Andrew Donnellan)

 - "nodemask: align nodes_and{,not} with underlying bitmap ops" makes
   nodes_and() and nodes_andnot() propagate the return values from the
   underlying bit operations, enabling some cleanup in calling code
   (Yury Norov)

 - "mm/damon: hide kdamond and kdamond_lock from API callers" cleans up
   some DAMON internal interfaces (SeongJae Park)

 - "mm/khugepaged: cleanups and scan limit fix" does some cleanup work
   in khupaged and fixes a scan limit accounting issue (Shivank Garg)

 - "mm: balloon infrastructure cleanups" goes to town on the balloon
   infrastructure and its page migration function. Mainly cleanups, also
   some locking simplification (David Hildenbrand)

 - "mm/vmscan: add tracepoint and reason for kswapd_failures reset" adds
   additional tracepoints to the page reclaim code (Jiayuan Chen)

 - "Replace wq users and add WQ_PERCPU to alloc_workqueue() users" is
   part of Marco's kernel-wide migration from the legacy workqueue APIs
   over to the preferred unbound workqueues (Marco Crivellari)

 - "Various mm kselftests improvements/fixes" provides various unrelated
   improvements/fixes for the mm kselftests (Kevin Brodsky)

 - "mm: accelerate gigantic folio allocation" greatly speeds up gigantic
   folio allocation, mainly by avoiding unnecessary work in
   pfn_range_valid_contig() (Kefeng Wang)

 - "selftests/damon: improve leak detection and wss estimation
   reliability" improves the reliability of two of the DAMON selftests
   (SeongJae Park)

 - "mm/damon: cleanup kdamond, damon_call(), damos filter and
   DAMON_MIN_REGION" does some cleanup work in the core DAMON code
   (SeongJae Park)

 - "Docs/mm/damon: update intro, modules, maintainer profile, and misc"
   performs maintenance work on the DAMON documentation (SeongJae Park)

 - "mm: add and use vma_assert_stabilised() helper" refactors and cleans
   up the core VMA code. The main aim here is to be able to use the mmap
   write lock's lockdep state to perform various assertions regarding
   the locking which the VMA code requires (Lorenzo Stoakes)

 - "mm, swap: swap table phase II: unify swapin use" removes some old
   swap code (swap cache bypassing and swap synchronization) which
   wasn't working very well. Various other cleanups and simplifications
   were made. The end result is a 20% speedup in one benchmark (Kairui
   Song)

 - "enable PT_RECLAIM on more 64-bit architectures" makes PT_RECLAIM
   available on 64-bit alpha, loongarch, mips, parisc, and um. Various
   cleanups were performed along the way (Qi Zheng)

* tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (325 commits)
  mm/memory: handle non-split locks correctly in zap_empty_pte_table()
  mm: move pte table reclaim code to memory.c
  mm: make PT_RECLAIM depends on MMU_GATHER_RCU_TABLE_FREE
  mm: convert __HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config
  um: mm: enable MMU_GATHER_RCU_TABLE_FREE
  parisc: mm: enable MMU_GATHER_RCU_TABLE_FREE
  mips: mm: enable MMU_GATHER_RCU_TABLE_FREE
  LoongArch: mm: enable MMU_GATHER_RCU_TABLE_FREE
  alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE
  mm: change mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h
  mm/damon/stat: remove __read_mostly from memory_idle_ms_percentiles
  zsmalloc: make common caches global
  mm: add SPDX id lines to some mm source files
  mm/zswap: use %pe to print error pointers
  mm/vmscan: use %pe to print error pointers
  mm/readahead: fix typo in comment
  mm: khugepaged: fix NR_FILE_PAGES and NR_SHMEM in collapse_file()
  mm: refactor vma_map_pages to use vm_insert_pages
  mm/damon: unify address range representation with damon_addr_range
  mm/cma: replace snprintf with strscpy in cma_new_area
  ...
2026-02-12 11:32:37 -08:00
Linus Torvalds
997f9640c9 Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux
Pull fsverity updates from Eric Biggers:
 "fsverity cleanups, speedup, and memory usage optimization from
  Christoph Hellwig:

   - Move some logic into common code

   - Fix btrfs to reject truncates of fsverity files

   - Improve the readahead implementation

   - Store each inode's fsverity_info in a hash table instead of using a
     pointer in the filesystem-specific part of the inode.

     This optimizes for memory usage in the usual case where most files
     don't have fsverity enabled.

   - Look up the fsverity_info fewer times during verification, to
     amortize the hash table overhead"

* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
  fsverity: remove inode from fsverity_verification_ctx
  fsverity: use a hashtable to find the fsverity_info
  btrfs: consolidate fsverity_info lookup
  f2fs: consolidate fsverity_info lookup
  ext4: consolidate fsverity_info lookup
  fs: consolidate fsverity_info lookup in buffer.c
  fsverity: push out fsverity_info lookup
  fsverity: deconstify the inode pointer in struct fsverity_info
  fsverity: kick off hash readahead at data I/O submission time
  ext4: move ->read_folio and ->readahead to readpage.c
  readahead: push invalidate_lock out of page_cache_ra_unbounded
  fsverity: don't issue readahead for non-ENOENT errors from __filemap_get_folio
  fsverity: start consolidating pagecache code
  fsverity: pass struct file to ->write_merkle_tree_block
  f2fs: don't build the fsverity work handler for !CONFIG_FS_VERITY
  ext4: don't build the fsverity work handler for !CONFIG_FS_VERITY
  fs,fsverity: clear out fsverity_info from common code
  fs,fsverity: reject size changes on fsverity files in setattr_prepare
2026-02-12 10:41:34 -08:00
Linus Torvalds
5903c871e2 Merge tag 'ext4_for_linus-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
 "New features and improvements for the ext4 file system

   - Avoid unnecessary cache invalidation in the extent status cache
     (es_cache) when adding extents to be cached in the es_cache and we
     are not changing the extent tree

   - Add a sysfs parameter, err_report_sec, to control how frequently to
     log a warning message that file system inconsistency has been
     detected (Previously we logged the warning message every 24 hours)

   - Avoid unnecessary forced ordered writes when appending to a file
     when delayed allocation is enabled

   - Defer splitting unwritten extents to I/O completion to improve
     write performance of concurrent direct I/O writes to multiple files

   - Refactor and add kunit tests to the extent splitting and conversion
     code paths

  Various Bug Fixes:

   - Fix a panic when the debugging DOUBLE_CHECK macro is defined

   - Avoid using fast commit for rare and complex file system operations
     to make fast commit easier to reason about. This can also avoid
     some corner cases that could result in file system inconsistency if
     there is a crash between the fast commit before a subsequent full
     commit

   - Fix memory leaks in error paths

   - Fix a false positive reports caused when running stress tests using
     mixed huge-page workloads caused by a race between page migration
     and bitmap updates

   - Fix a potential recursion into file system reclaim when evicting an
     inode when fast commit is enabled

   - Fix a warning caused by a potential double decrement to the dirty
     clusters counter when executing FS_IOC_SHUTDOWN when running a
     stress test

   - Enable mballoc optimized scanning regardless whether the inode is
     using indirect blocks or extent trees to map blocks"

* tag 'ext4_for_linus-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (45 commits)
  et4: allow zeroout when doing written to unwritten split
  ext4: refactor split and convert extents
  ext4: refactor zeroout path and handle all cases
  ext4: propagate flags to ext4_convert_unwritten_extents_endio()
  ext4: propagate flags to convert_initialized_extent()
  ext4: add extent status cache support to kunit tests
  ext4: kunit tests for higher level extent manipulation functions
  ext4: kunit tests for extent splitting and conversion
  ext4: use optimized mballoc scanning regardless of inode format
  ext4: always allocate blocks only from groups inode can use
  ext4: fix dirtyclusters double decrement on fs shutdown
  ext4: fast commit: make s_fc_lock reclaim-safe
  ext4: fix e4b bitmap inconsistency reports
  ext4: remove redundant NULL check after __GFP_NOFAIL
  ext4: remove EXT4_GET_BLOCKS_IO_CREATE_EXT
  ext4: simplify the mapping query logic in ext4_iomap_begin()
  ext4: remove unused unwritten parameter in ext4_dio_write_iter()
  ext4: remove useless ext4_iomap_overwrite_ops
  ext4: avoid starting handle when dio writing an unwritten extent
  ext4: don't split extent before submitting I/O
  ...
2026-02-12 10:19:58 -08:00
Linus Torvalds
148f95f75c Merge tag 'slab-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab updates from Vlastimil Babka:

 - The percpu sheaves caching layer was introduced as opt-in in 6.18 and
   now we enable it for all caches and remove the previous cpu (partial)
   slab caching mechanism.

   Besides the lower locking overhead and much more likely fastpath when
   freeing, this removes the rather complicated code related to the cpu
   slab lockless fastpaths (using this_cpu_try_cmpxchg128/64) and all
   its complications for PREEMPT_RT or kmalloc_nolock().

   The lockless slab freelist+counters update operation using
   try_cmpxchg128/64 remains and is crucial for freeing remote NUMA
   objects, and to allow flushing objects from sheaves to slabs mostly
   without the node list_lock (Vlastimil Babka)

 - Eliminate slabobj_ext metadata overhead when possible. Instead of
   using kmalloc() to allocate the array for memcg and/or allocation
   profiling tag pointers, use leftover space in a slab or per-object
   padding due to alignment (Harry Yoo)

 - Various followup improvements to the above (Hao Li)

* tag 'slab-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (39 commits)
  slub: let need_slab_obj_exts() return false if SLAB_NO_OBJ_EXT is set
  mm/slab: only allow SLAB_OBJ_EXT_IN_OBJ for unmergeable caches
  mm/slab: place slabobj_ext metadata in unused space within s->size
  mm/slab: move [__]ksize and slab_ksize() to mm/slub.c
  mm/slab: save memory by allocating slabobj_ext array from leftover
  mm/memcontrol,alloc_tag: handle slabobj_ext access under KASAN poison
  mm/slab: use stride to access slabobj_ext
  mm/slab: abstract slabobj_ext access via new slab_obj_ext() helper
  ext4: specify the free pointer offset for ext4_inode_cache
  mm/slab: allow specifying free pointer offset when using constructor
  mm/slab: use unsigned long for orig_size to ensure proper metadata align
  slub: clarify object field layout comments
  mm/slab: avoid allocating slabobj_ext array from its own slab
  slub: avoid list_lock contention from __refill_objects_any()
  mm/slub: cleanup and repurpose some stat items
  mm/slub: remove DEACTIVATE_TO_* stat items
  slab: remove frozen slab checks from __slab_free()
  slab: update overview comments
  slab: refill sheaves from all nodes
  slab: remove unused PREEMPT_RT specific macros
  ...
2026-02-11 14:12:50 -08:00
Linus Torvalds
0c00ed308d Merge tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block updates from Jens Axboe:

 - Support for batch request processing for ublk, improving the
   efficiency of the kernel/ublk server communication. This can yield
   nice 7-12% performance improvements

 - Support for integrity data for ublk

 - Various other ublk improvements and additions, including a ton of
   selftests additions and updated

 - Move the handling of blk-crypto software fallback from below the
   block layer to above it. This reduces the complexity of dealing with
   bio splitting

 - Series fixing a number of potential deadlocks in blk-mq related to
   the queue usage counter and writeback throttling and rq-qos debugfs
   handling

 - Add an async_depth queue attribute, to resolve a performance
   regression that's been around for a qhilw related to the scheduler
   depth handling

 - Only use task_work for IOPOLL completions on NVMe, if it is necessary
   to do so. An earlier fix for an issue resulted in all these
   completions being punted to task_work, to guarantee that completions
   were only run for a given io_uring ring when it was local to that
   ring. With the new changes, we can detect if it's necessary to use
   task_work or not, and avoid it if possible.

 - rnbd fixes:
      - Fix refcount underflow in device unmap path
      - Handle PREFLUSH and NOUNMAP flags properly in protocol
      - Fix server-side bi_size for special IOs
      - Zero response buffer before use
      - Fix trace format for flags
      - Add .release to rnbd_dev_ktype

 - MD pull requests via Yu Kuai
      - Fix raid5_run() to return error when log_init() fails
      - Fix IO hang with degraded array with llbitmap
      - Fix percpu_ref not resurrected on suspend timeout in llbitmap
      - Fix GPF in write_page caused by resize race
      - Fix NULL pointer dereference in process_metadata_update
      - Fix hang when stopping arrays with metadata through dm-raid
      - Fix any_working flag handling in raid10_sync_request
      - Refactor sync/recovery code path, improve error handling for
        badblocks, and remove unused recovery_disabled field
      - Consolidate mddev boolean fields into mddev_flags
      - Use mempool to allocate stripe_request_ctx and make sure
        max_sectors is not less than io_opt in raid5
      - Fix return value of mddev_trylock
      - Fix memory leak in raid1_run()
      - Add Li Nan as mdraid reviewer

 - Move phys_vec definitions to the kernel types, mostly in preparation
   for some VFIO and RDMA changes

 - Improve the speed for secure erase for some devices

 - Various little rust updates

 - Various other minor fixes, improvements, and cleanups

* tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits)
  blk-mq: ABI/sysfs-block: fix docs build warnings
  selftests: ublk: organize test directories by test ID
  block: decouple secure erase size limit from discard size limit
  block: remove redundant kill_bdev() call in set_blocksize()
  blk-mq: add documentation for new queue attribute async_dpeth
  block, bfq: convert to use request_queue->async_depth
  mq-deadline: covert to use request_queue->async_depth
  kyber: covert to use request_queue->async_depth
  blk-mq: add a new queue sysfs attribute async_depth
  blk-mq: factor out a helper blk_mq_limit_depth()
  blk-mq-sched: unify elevators checking for async requests
  block: convert nr_requests to unsigned int
  block: don't use strcpy to copy blockdev name
  blk-mq-debugfs: warn about possible deadlock
  blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs()
  blk-mq-debugfs: remove blk_mq_debugfs_unregister_rqos()
  blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static
  blk-rq-qos: fix possible debugfs_mutex deadlock
  blk-mq-debugfs: factor out a helper to register debugfs for all rq_qos
  blk-wbt: fix possible deadlock to nest pcpu_alloc_mutex under q_usage_counter
  ...
2026-02-09 17:57:21 -08:00
Linus Torvalds
9e355113f0 Merge tag 'vfs-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
 "This contains a mix of VFS cleanups, performance improvements, API
  fixes, documentation, and a deprecation notice.

  Scalability and performance:

   - Rework pid allocation to only take pidmap_lock once instead of
     twice during alloc_pid(), improving thread creation/teardown
     throughput by 10-16% depending on false-sharing luck. Pad the
     namespace refcount to reduce false-sharing

   - Track file lock presence via a flag in ->i_opflags instead of
     reading ->i_flctx, avoiding false-sharing with ->i_readcount on
     open/close hot paths. Measured 4-16% improvement on 24-core
     open-in-a-loop benchmarks

   - Use a consume fence in locks_inode_context() to match the
     store-release/load-consume idiom, eliminating a hardware fence on
     some architectures

   - Annotate cdev_lock with __cacheline_aligned_in_smp to prevent
     false-sharing

   - Remove a redundant DCACHE_MANAGED_DENTRY check in
     __follow_mount_rcu() that never fires since the caller already
     verifies it, eliminating a 100% mispredicted branch

   - Fix a 100% mispredicted likely() in devcgroup_inode_permission()
     that became wrong after a prior code reorder

  Bug fixes and correctness:

   - Make insert_inode_locked() wait for inode destruction instead of
     skipping, fixing a corner case where two matching inodes could
     exist in the hash

   - Move f_mode initialization before file_ref_init() in alloc_file()
     to respect the SLAB_TYPESAFE_BY_RCU ordering contract

   - Add a WARN_ON_ONCE guard in try_to_free_buffers() for folios with
     no buffers attached, preventing a null pointer dereference when
     AS_RELEASE_ALWAYS is set but no release_folio op exists

   - Fix select restart_block to store end_time as timespec64, avoiding
     truncation of tv_sec on 32-bit architectures

   - Make dump_inode() use get_kernel_nofault() to safely access inode
     and superblock fields, matching the dump_mapping() pattern

  API modernization:

   - Make posix_acl_to_xattr() allocate the buffer internally since
     every single caller was doing it anyway. Reduces boilerplate and
     unnecessary error checking across ~15 filesystems

   - Replace deprecated simple_strtoul() with kstrtoul() for the
     ihash_entries, dhash_entries, mhash_entries, and mphash_entries
     boot parameters, adding proper error handling

   - Convert chardev code to use guard(mutex) and __free(kfree) cleanup
     patterns

   - Replace min_t() with min() or umin() in VFS code to avoid silently
     truncating unsigned long to unsigned int

   - Gate LOOKUP_RCU assertions behind CONFIG_DEBUG_VFS since callers
     already check the flag

  Deprecation:

   - Begin deprecating legacy BSD process accounting (acct(2)). The
     interface has numerous footguns and better alternatives exist
     (eBPF)

  Documentation:

   - Fix and complete kernel-doc for struct export_operations, removing
     duplicated documentation between ReST and source

   - Fix kernel-doc warnings for __start_dirop() and ilookup5_nowait()

  Testing:

   - Add a kunit test for initramfs cpio handling of entries with
     filesize > PATH_MAX

  Misc:

   - Add missing <linux/init_task.h> include in fs_struct.c"

* tag 'vfs-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits)
  posix_acl: make posix_acl_to_xattr() alloc the buffer
  fs: make insert_inode_locked() wait for inode destruction
  initramfs_test: kunit test for cpio.filesize > PATH_MAX
  fs: improve dump_inode() to safely access inode fields
  fs: add <linux/init_task.h> for 'init_fs'
  docs: exportfs: Use source code struct documentation
  fs: move initializing f_mode before file_ref_init()
  exportfs: Complete kernel-doc for struct export_operations
  exportfs: Mark struct export_operations functions at kernel-doc
  exportfs: Fix kernel-doc output for get_name()
  acct(2): begin the deprecation of legacy BSD process accounting
  device_cgroup: remove branch hint after code refactor
  VFS: fix __start_dirop() kernel-doc warnings
  fs: Describe @isnew parameter in ilookup5_nowait()
  fs/namei: Remove redundant DCACHE_MANAGED_DENTRY check in __follow_mount_rcu
  fs: only assert on LOOKUP_RCU when built with CONFIG_DEBUG_VFS
  select: store end_time as timespec64 in restart block
  chardev: Switch to guard(mutex) and __free(kfree)
  namespace: Replace simple_strtoul with kstrtoul to parse boot params
  dcache: Replace simple_strtoul with kstrtoul in set_dhash_entries
  ...
2026-02-09 15:13:05 -08:00
Linus Torvalds
dd466ea002 Merge tag 'vfs-7.0-rc1.fserror' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs error reporting updates from Christian Brauner:
 "This contains the changes to support generic I/O error reporting.

  Filesystems currently have no standard mechanism for reporting
  metadata corruption and file I/O errors to userspace via fsnotify.
  Each filesystem (xfs, ext4, erofs, f2fs, etc.) privately defines
  EFSCORRUPTED, and error reporting to fanotify is inconsistent or
  absent entirely.

  This introduces a generic fserror infrastructure built around struct
  super_block that gives filesystems a standard way to queue metadata
  and file I/O error reports for delivery to fsnotify.

  Errors are queued via mempools and queue_work to avoid holding
  filesystem locks in the notification path; unmount waits for pending
  events to drain. A new super_operations::report_error callback lets
  filesystem drivers respond to file I/O errors themselves (to be used
  by an upcoming XFS self-healing patchset).

  On the uapi side, EFSCORRUPTED and EUCLEAN are promoted from private
  per-filesystem definitions to canonical errno.h values across all
  architectures"

* tag 'vfs-7.0-rc1.fserror' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  ext4: convert to new fserror helpers
  xfs: translate fsdax media errors into file "data lost" errors when convenient
  xfs: report fs metadata errors via fsnotify
  iomap: report file I/O errors to the VFS
  fs: report filesystem and file I/O errors to fsnotify
  uapi: promote EFSCORRUPTED and EUCLEAN to errno.h
2026-02-09 12:21:37 -08:00
Linus Torvalds
aa2a0fcd4c Merge tag 'vfs-7.0-rc1.leases' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs lease updates from Christian Brauner:
 "This contains updates for lease support to require filesystems to
  explicitly opt-in to lease support

  Currently kernel_setlease() falls through to generic_setlease() when a
  a filesystem does not define ->setlease(), silently granting lease
  support to every filesystem regardless of whether it is prepared for
  it.

  This is a poor default: most filesystems never intended to support
  leases, and the silent fallthrough makes it impossible to distinguish
  "supports leases" from "never thought about it".

  This inverts the default. It adds explicit

	.setlease = generic_setlease;

  assignments to every in-tree filesystem that should retain lease
  support, then changes kernel_setlease() to return -EINVAL when
  ->setlease is NULL.

  With the new default in place, simple_nosetlease() is redundant and
  is removed along with all references to it"

* tag 'vfs-7.0-rc1.leases' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (25 commits)
  fuse: add setlease file operation
  fs: remove simple_nosetlease()
  filelock: default to returning -EINVAL when ->setlease operation is NULL
  xfs: add setlease file operation
  ufs: add setlease file operation
  udf: add setlease file operation
  tmpfs: add setlease file operation
  squashfs: add setlease file operation
  overlayfs: add setlease file operation
  orangefs: add setlease file operation
  ocfs2: add setlease file operation
  ntfs3: add setlease file operation
  nilfs2: add setlease file operation
  jfs: add setlease file operation
  jffs2: add setlease file operation
  gfs2: add a setlease file operation
  fat: add setlease file operation
  f2fs: add setlease file operation
  exfat: add setlease file operation
  ext4: add setlease file operation
  ...
2026-02-09 11:59:07 -08:00
Christoph Hellwig
f77f281b61 fsverity: use a hashtable to find the fsverity_info
Use the kernel's resizable hash table (rhashtable) to find the
fsverity_info.  This way file systems that want to support fsverity don't
have to bloat every inode in the system with an extra pointer.  The
trade-off is that looking up the fsverity_info is a bit more expensive
now, but the main operations are still dominated by I/O and hashing
overhead.

The rhashtable implementations requires no external synchronization, and
the _fast versions of the APIs provide the RCU critical sections required
by the implementation.  Because struct fsverity_info is only removed on
inode eviction and does not contain a reference count, there is no need
for an extended critical section to grab a reference or validate the
object state.  The file open path uses rhashtable_lookup_get_insert_fast,
which can either find an existing object for the hash key or insert a
new one in a single atomic operation, so that concurrent opens never
instantiate duplicate fsverity_info structure.  FS_IOC_ENABLE_VERITY must
already be synchronized by a combination of i_rwsem and file system flags
and uses rhashtable_lookup_insert_fast, which errors out on an existing
object for the hash key as an additional safety check.

Because insertion into the hash table now happens before S_VERITY is set,
fsverity just becomes a barrier and a flag check and doesn't have to look
up the fsverity_info at all, so there is only a single lookup per
->read_folio or ->readahead invocation.  For btrfs there is an additional
one for each bio completion, while for ext4 and f2fs the fsverity_info
is stored in the per-I/O context and reused for the completion workqueue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Link: https://lore.kernel.org/r/20260202060754.270269-12-hch@lst.de
[EB: folded in fix for missing fsverity_free_info()]
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-02-04 11:31:54 -08:00
Christoph Hellwig
d0b57f0962 ext4: consolidate fsverity_info lookup
Look up the fsverity_info once in ext4_mpage_readpages, and then use it
for the readahead, local verification of holes and pass it along to the
I/O completion workqueue in struct bio_post_read_ctx.

This amortizes the lookup better once it becomes less efficient.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Link: https://lore.kernel.org/r/20260202060754.270269-9-hch@lst.de
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-02-04 11:27:54 -08:00
Harry Yoo
43d9bb4236 ext4: specify the free pointer offset for ext4_inode_cache
Convert ext4_inode_cache to use the kmem_cache_args interface and
specify a free pointer offset.

Since ext4_inode_cache uses a constructor, the free pointer would be
placed after the object to prevent overwriting fields used by the
constructor.

However, some fields such as ->i_flags are not used by the constructor
and can safely be repurposed for the free pointer.

Specify the free pointer offset at i_flags to reduce the object size.

Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260113061845.159790-4-harry.yoo@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-04 10:05:35 +01:00
Christoph Hellwig
47bc2ac9b6 fsverity: push out fsverity_info lookup
Pass a struct fsverity_info to the verification and readahead helpers,
and push the lookup into the callers.  Right now this is a very dumb
almost mechanic move that open codes a lot of fsverity_info_addr() calls
in the file systems.  The subsequent patches will clean this up.

This prepares for reducing the number of fsverity_info lookups, which
will allow to amortize them better when using a more expensive lookup
method.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Acked-by: David Sterba <dsterba@suse.com> # btrfs
Link: https://lore.kernel.org/r/20260202060754.270269-7-hch@lst.de
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-02-02 17:15:26 -08:00
Christoph Hellwig
f1a6cf44b3 fsverity: kick off hash readahead at data I/O submission time
Currently all reads of the fsverity hashes are kicked off from the data
I/O completion handler, leading to needlessly dependent I/O.  This is
worked around a bit by performing readahead on the level 0 nodes, but
still fairly ineffective.

Switch to a model where the ->read_folio and ->readahead methods instead
kick off explicit readahead of the fsverity hashed so they are usually
available at I/O completion time.

For 64k sequential reads on my test VM this improves read performance
from 2.4GB/s - 2.6GB/s to 3.5GB/s - 3.9GB/s.  The improvements for
random reads are likely to be even bigger.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com> # btrfs
Link: https://lore.kernel.org/r/20260202060754.270269-5-hch@lst.de
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-02-02 17:15:26 -08:00
Christoph Hellwig
314b652b7e ext4: move ->read_folio and ->readahead to readpage.c
Keep all the read into pagecache code in a single file.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20260202060754.270269-4-hch@lst.de
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-02-02 17:15:23 -08:00
Christoph Hellwig
821ddd25fb fsverity: start consolidating pagecache code
ext4 and f2fs are largely using the same code to read a page full
of Merkle tree blocks from the page cache, and the upcoming xfs
fsverity support would add another copy.

Move the ext4 code to fs/verity/ and use it in f2fs as well.  For f2fs
this removes the previous f2fs-specific error injection, but otherwise
the behavior remains unchanged.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Link: https://lore.kernel.org/r/20260128152630.627409-7-hch@lst.de
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-01-29 09:39:41 -08:00
Christoph Hellwig
ac09a30900 fsverity: pass struct file to ->write_merkle_tree_block
This will make an iomap implementation of the method easier.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Acked-by: David Sterba <dsterba@suse.com> # btrfs
Link: https://lore.kernel.org/r/20260128152630.627409-6-hch@lst.de
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-01-29 09:39:41 -08:00
Christoph Hellwig
fb26616459 ext4: don't build the fsverity work handler for !CONFIG_FS_VERITY
Use IS_ENABLED to disable this code, leading to a slight size reduction:

   text	   data	    bss	    dec	    hex	filename
   4121	    376	     16	   4513	   11a1	fs/ext4/readpage.o.old
   4030	    328	     16	   4374	   1116	fs/ext4/readpage.o

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Link: https://lore.kernel.org/r/20260128152630.627409-4-hch@lst.de
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-01-29 09:39:41 -08:00
Christoph Hellwig
70098d9327 fs,fsverity: clear out fsverity_info from common code
Free the fsverity_info directly in clear_inode instead of requiring file
systems to handle it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Acked-by: David Sterba <dsterba@suse.com> # btrfs
Link: https://lore.kernel.org/r/20260128152630.627409-3-hch@lst.de
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-01-29 09:39:41 -08:00
Christoph Hellwig
e9734653c5 fs,fsverity: reject size changes on fsverity files in setattr_prepare
Add the check to reject truncates of fsverity files directly to
setattr_prepare instead of requiring the file system to handle it.
Besides removing boilerplate code, this also fixes the complete lack of
such check in btrfs.

Fixes: 146054090b ("btrfs: initial fsverity support")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Link: https://lore.kernel.org/r/20260128152630.627409-2-hch@lst.de
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-01-29 09:39:41 -08:00
Ojaswin Mujoo
4f5e8e6f01 et4: allow zeroout when doing written to unwritten split
Currently, when we are doing an extent split and convert operation of
written to unwritten extent (example, as done by ZERO_RANGE), we don't
allow the zeroout fallback in case the extent tree manipulation fails.
This is mostly because zeroout might take unsually long and the fact that
this code path is more tolerant to failures than endio.

Since we have zeroout machinery in place, we might as well use it hence
lift this restriction. To mitigate zeroout taking too long respect the
max zeroout limit here so that the operation finishes relatively fast.

Also, add kunit tests for this case.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/1c3349020b8e098a63f293b84bc8a9b56011cef4.1769149131.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-23 16:50:11 -05:00
Ojaswin Mujoo
716b9c23b8 ext4: refactor split and convert extents
ext4_split_convert_extents() has been historically prone to subtle
bugs and inconsistent behavior due to the way all the various flags
interact with the extent split and conversion process. For example,
callers like ext4_convert_unwritten_extents_endio() and
convert_initialized_extents() needed to open code extent conversion
despite passing CONVERT or CONVERT_UNWRITTEN flags because
ext4_split_convert_extents() wasn't performing the conversion.

Hence, refactor ext4_split_convert_extents() to clearly enforce the
semantics of each flag. The major changes here are:

 * Clearly separate the split and convert process:
   * ext4_split_extent() and ext4_split_extent_at() are now only
     responsible to perform the split.
   * ext4_split_convert_extents() is now responsible to perform extent
     conversion after calling ext4_split_extent() for splitting.
   * This helps get rid of all the MARK_UNWRIT* flags.

 * Clearly enforce the semantics of flags passed to
   ext4_split_convert_extents():

   * EXT4_GET_BLOCKS_CONVERT: Will convert the split extent to written
   * EXT4_GET_BLOCKS_CONVERT_UNWRITTEN: Will convert the split extent to
     unwritten
   * Modify all callers to enforce the above semantics.

 * Use ext4_split_convert_extents() instead of ext4_split_extents()
   in ext4_ext_convert_to_initialized() for uniformity.

 * Now that ext4_split_convert_extents() is handling caching to es, we
   dont need to do it in ext4_split_extent_zeroout().

 * Cleanup all callers open coding the conversion logic. Further, modify
   kuniy tests to pass flags based on the new semantics.

>From an end user point of view, we should not see any changes in
behavior of ext4.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/2084a383d69ceefbaa293b8fcf725365eca0a349.1769149131.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-23 16:50:01 -05:00
Ojaswin Mujoo
a985e07c26 ext4: refactor zeroout path and handle all cases
Currently, zeroout is used as a fallback in case we fail to
split/convert extents in the "traditional" modify-the-extent-tree way.
This is essential to mitigate failures in critical paths like extent
splitting during endio. However, the logic is very messy and not easy to
follow. Further, the fragile use of various flags has made it prone to
errors.

Refactor zeroout out logic by moving it up to ext4_split_extents().
Further, zeroout correctly based on the type of conversion we want, ie:
- unwritten to written: Zeroout everything around the mapped range.
- written to unwritten: Zeroout only the mapped range.

Also, ext4_ext_convert_to_initialized() now passes
EXT4_GET_BLOCKS_CONVERT to make the intention clear.

Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/e1b51dedeca7c0b1f702141d91edfe4230560e7b.1769149131.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-23 16:49:53 -05:00
Ojaswin Mujoo
6066990c99 ext4: propagate flags to ext4_convert_unwritten_extents_endio()
Currently, callers like ext4_convert_unwritten_extents() pass
EXT4_EX_NOCACHE flag to avoid caching extents however this is not
respected by ext4_convert_unwritten_extents_endio(). Hence, modify it to
accept flags from the caller and to pass the flags on to other extent
manipulation functions it calls. This makes sure the NOCACHE flag is
respected throughout the code path.

Also, since the caller already passes METADATA_NOFAIL and CONVERT flags
we don't need to explicitly pass it anymore.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/7c2139e0ad32c49c19b194f72219e15d613de284.1769149131.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-23 16:49:53 -05:00
Ojaswin Mujoo
3fffa44b6e ext4: propagate flags to convert_initialized_extent()
Currently, ext4_zero_range passes EXT4_EX_NOCACHE flag to avoid caching
extents however this is not respected by convert_initialized_extent().
Hence, modify it to accept flags from the caller and to pass the flags
on to other extent manipulation functions it calls. This makes
sure the NOCACHE flag is respected throughout the code path.

Also, we no longer explicitly pass CONVERT_UNWRITTEN as the caller takes
care of this.

Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/07008fbb14db727fddcaf4c30e2346c49f6c8fe0.1769149131.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-23 16:49:53 -05:00
Ojaswin Mujoo
82f80e2e3b ext4: add extent status cache support to kunit tests
Add support in Kunit tests to ensure that the extent status cache is
also in sync after the extent split and conversion operations.

Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/5f9d2668feeb89a3f3e9d03dadab8c10cbea3741.1769149131.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-23 16:49:43 -05:00
Ojaswin Mujoo
4dff18488f ext4: kunit tests for higher level extent manipulation functions
Add more kunit tests to cover the high level caller
ext4_map_create_blocks(). We pass flags in a manner that covers
the below function:

1. ext4_ext_handle_unwritten_extents()
  1.1 - Split/Convert unwritten extent to written in endio convtext.
  1.2 - Split/Convert unwritten extent to written in non endio context.
  1.3 - Zeroout tests for the above 2 cases
2. convert_initialized_extent() - Convert written extent to unwritten
   during zero range

Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/9d8ad32cb62f44999c0fe3545b44fc3113546c70.1769149131.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-23 16:48:02 -05:00
Ojaswin Mujoo
cb1e0c1d1f ext4: kunit tests for extent splitting and conversion
Add multiple KUnit tests to test various permutations of extent
splitting and conversion.

We test the following cases:

1. Split of unwritten extent into 2 parts and convert 1 part to written
2. Split of unwritten extent into 3 parts and convert 1 part to written
3. Split of written extent into 2 parts and convert 1 part to unwritten
4. Split of written extent into 3 parts and convert 1 part to unwritten
5. Zeroout fallback for all the above cases except 3-4 because zeroout
   is not supported for written to unwritten splits

The main function we test here is ext4_split_convert_extents().
Currently some of the tests are failing due to issues in implementation.
All failures are mitigated at other layers in ext4 [1] but still point
out the mismatch in expectation of what the caller wants vs what the
function does.

The aim is to eventually fix all the failures we see here. More detailed
implementation notes can be found in the topmost commit in the test
file.

[1] for example, EXT4_GET_BLOCKS_CONVERT doesn't really convert the
split extent to written, but rather the callers end up doing the
conversion.

Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/22bb9d17cd88c1318a2edde48887ca7488cb8a13.1769149131.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-23 16:48:02 -05:00
Johannes Weiner
64dd89ae01 mm/block/fs: remove laptop_mode
Laptop mode was introduced to save battery, by delaying and consolidating
writes and thereby maximize the time rotating hard drives wouldn't have to
spin.

Luckily, rotating hard drives, with their high spin-up times and power
draw, are a thing of the past for battery-powered devices.  Reclaim has
also since changed to not write single filesystem pages anymore, and
regular filesystem writeback is lumpy by design.

The juice doesn't appear worth the squeeze anymore.  The footprint of the
feature is small, but nevertheless it's a complicating factor in mm,
block, filesystems.  Developers don't think about it, and it likely hasn't
been tested with new reclaim and writeback changes in years.

Let's sunset it.  Keep the sysctl with a deprecation warning around for a
few more cycles, but remove all functionality behind it.

[akpm@linux-foundation.org: fix Documentation/admin-guide/laptops/index.rst]
Link: https://lkml.kernel.org/r/20251216185201.GH905277@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Deepanshu Kartikey <kartikey406@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:24:47 -08:00
Jan Kara
3574c322b1 ext4: use optimized mballoc scanning regardless of inode format
Currently we don't used mballoc optimized scanning (using max free
extent order and avg free extent order group lists) for inodes with
indirect block based format. This is confusing for users and I don't see
a good reason for that. Even with indirect block based inode format we
can spend big amount of time searching for free blocks for large
filesystems with fragmented free space. To add to the confusion before
commit 077d0c2c78 ("ext4: make mb_optimize_scan performance mount
option work with extents") optimized scanning was applied *only* to
indirect block based inodes so that commit appears as a performance
regression to some users. Just use optimized scanning whenever it is
enabled by mount options.

Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
Link: https://patch.msgid.link/20260114182836.14120-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19 22:54:58 -05:00
Jan Kara
4865c768b5 ext4: always allocate blocks only from groups inode can use
For filesystems with more than 2^32 blocks inodes using indirect block
based format cannot use blocks beyond the 32-bit limit.
ext4_mb_scan_groups_linear() takes care to not select these unsupported
groups for such inodes however other functions selecting groups for
allocation don't. So far this is harmless because the other selection
functions are used only with mb_optimize_scan and this is currently
disabled for inodes with indirect blocks however in the following patch
we want to enable mb_optimize_scan regardless of inode format.

Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Pedro Falcato <pfalcato@suse.de>
Cc: stable@kernel.org
Link: https://patch.msgid.link/20260114182836.14120-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19 22:54:58 -05:00
Brian Foster
94a8cea54c ext4: fix dirtyclusters double decrement on fs shutdown
fstests test generic/388 occasionally reproduces a warning in
ext4_put_super() associated with the dirty clusters count:

  WARNING: CPU: 7 PID: 76064 at fs/ext4/super.c:1324 ext4_put_super+0x48c/0x590 [ext4]

Tracing the failure shows that the warning fires due to an
s_dirtyclusters_counter value of -1. IOW, this appears to be a
spurious decrement as opposed to some sort of leak. Further tracing
of the dirty cluster count deltas and an LLM scan of the resulting
output identified the cause as a double decrement in the error path
between ext4_mb_mark_diskspace_used() and the caller
ext4_mb_new_blocks().

First, note that generic/388 is a shutdown vs. fsstress test and so
produces a random set of operations and shutdown injections. In the
problematic case, the shutdown triggers an error return from the
ext4_handle_dirty_metadata() call(s) made from
ext4_mb_mark_context(). The changed value is non-zero at this point,
so ext4_mb_mark_diskspace_used() does not exit after the error
bubbles up from ext4_mb_mark_context(). Instead, the former
decrements both cluster counters and returns the error up to
ext4_mb_new_blocks(). The latter falls into the !ar->len out path
which decrements the dirty clusters counter a second time, creating
the inconsistency.

To avoid this problem and simplify ownership of the cluster
reservation in this codepath, lift the counter reduction to a single
place in the caller. This makes it more clear that
ext4_mb_new_blocks() is responsible for acquiring cluster
reservation (via ext4_claim_free_clusters()) in the !delalloc case
as well as releasing it, regardless of whether it ends up consumed
or returned due to failure.

Fixes: 0087d9fb3f ("ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20260113171905.118284-1-bfoster@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2026-01-19 22:51:23 -05:00
Li Chen
491f2927ae ext4: fast commit: make s_fc_lock reclaim-safe
s_fc_lock can be acquired from inode eviction and thus is
reclaim unsafe. Since the fast commit path holds s_fc_lock while writing
the commit log, allocations under the lock can enter reclaim and invert
the lock order with fs_reclaim. Add ext4_fc_lock()/ext4_fc_unlock()
helpers which acquire s_fc_lock under memalloc_nofs_save()/restore()
context and use them everywhere so allocations under the lock cannot
recurse into filesystem reclaim.

Fixes: 6593714d67 ("ext4: hold s_fc_lock while during fast commit")
Signed-off-by: Li Chen <me@linux.beauty>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260106120621.440126-1-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19 22:46:05 -05:00
Yongjian Sun
bdc56a9c46 ext4: fix e4b bitmap inconsistency reports
A bitmap inconsistency issue was observed during stress tests under
mixed huge-page workloads. Ext4 reported multiple e4b bitmap check
failures like:

ext4_mb_complex_scan_group:2508: group 350, 8179 free clusters as
per group info. But got 8192 blocks

Analysis and experimentation confirmed that the issue is caused by a
race condition between page migration and bitmap modification. Although
this timing window is extremely narrow, it is still hit in practice:

folio_lock                        ext4_mb_load_buddy
__migrate_folio
  check ref count
  folio_mc_copy                     __filemap_get_folio
                                      folio_try_get(folio)
                                  ......
                                  mb_mark_used
                                  ext4_mb_unload_buddy
  __folio_migrate_mapping
    folio_ref_freeze
folio_unlock

The root cause of this issue is that the fast path of load_buddy only
increments the folio's reference count, which is insufficient to prevent
concurrent folio migration. We observed that the folio migration process
acquires the folio lock. Therefore, we can determine whether to take the
fast path in load_buddy by checking the lock status. If the folio is
locked, we opt for the slow path (which acquires the lock) to close this
concurrency window.

Additionally, this change addresses the following issues:

When the DOUBLE_CHECK macro is enabled to inspect bitmap-related
issues, the following error may be triggered:

corruption in group 324 at byte 784(6272): f in copy != ff on
disk/prealloc

Analysis reveals that this is a false positive. There is a specific race
window where the bitmap and the group descriptor become momentarily
inconsistent, leading to this error report:

ext4_mb_load_buddy                   ext4_mb_load_buddy
  __filemap_get_folio(create|lock)
    folio_lock
  ext4_mb_init_cache
    folio_mark_uptodate
                                     __filemap_get_folio(no lock)
                                     ......
                                     mb_mark_used
                                       mb_mark_used_double
  mb_cmp_bitmaps
                                       mb_set_bits(e4b->bd_bitmap)
  folio_unlock

The original logic assumed that since mb_cmp_bitmaps is called when the
bitmap is newly loaded from disk, the folio lock would be sufficient to
prevent concurrent access. However, this overlooks a specific race
condition: if another process attempts to load buddy and finds the folio
is already in an uptodate state, it will immediately begin using it without
holding folio lock.

Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260106090820.836242-1-sunyongjian@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2026-01-19 22:41:59 -05:00
Baolin Liu
591a4ab9b8 ext4: remove redundant NULL check after __GFP_NOFAIL
Remove redundant NULL check after kcalloc() with GFP_NOFS | __GFP_NOFAIL.

Signed-off-by: Baolin Liu <liubaolin@kylinos.cn>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20260106062016.154573-1-liubaolin12138@163.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19 22:37:06 -05:00
Zhang Yi
5f18f60d56 ext4: remove EXT4_GET_BLOCKS_IO_CREATE_EXT
We do not use EXT4_GET_BLOCKS_IO_CREATE_EXT or split extents before
submitting I/O; therefore, remove the related code.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/20260105014522.1937690-8-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19 22:28:30 -05:00
Zhang Yi
5ca28af074 ext4: simplify the mapping query logic in ext4_iomap_begin()
In the write path mapping check of ext4_iomap_begin(), the return value
'ret' should never greater than orig_mlen. If 'ret' equals 'orig_mlen',
it can be returned directly without checking IOMAP_ATOMIC.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/20260105014522.1937690-7-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19 22:28:30 -05:00
Zhang Yi
8bd1f257af ext4: remove unused unwritten parameter in ext4_dio_write_iter()
The parameter unwritten in ext4_dio_write_iter() is no longer needed,
simply remove it.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/20260105014522.1937690-6-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19 22:28:30 -05:00
Zhang Yi
012924f0ee ext4: remove useless ext4_iomap_overwrite_ops
ext4_iomap_overwrite_ops was introduced in commit 8cd115bdda ("ext4:
Optimize ext4 DIO overwrites"), which can optimize pure overwrite
performance by dropping the IOMAP_WRITE flag to only query the mapped
mapping information. This avoids starting a new journal handle, thereby
improving speed. Later, commit 9faac62d40 ("ext4: optimize file
overwrites") also optimized similar scenarios, but it performs the check
later, examining the mappings status only when the actual block mapping
is needed. Thus, it can handle the previous commit scenario. That means
in the case of an overwrite scenario, the condition
"offset + length <= i_size_read(inode)" in the write path must always be
true.

Therefore, it is acceptable to remove the ext4_iomap_overwrite_ops,
which will also clarify the write and read paths of ext4_iomap_begin.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/20260105014522.1937690-5-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19 22:28:30 -05:00
Zhang Yi
5d87c7fca2 ext4: avoid starting handle when dio writing an unwritten extent
Since we have deferred the split of the unwritten extent until after I/O
completion, it is not necessary to initiate the journal handle when
submitting the I/O.

This can improve the write performance of concurrent DIO for multiple
files. The fio tests below show a ~25% performance improvement when
wirting to unwritten files on my VM with a mem disk.

  [unwritten]
  direct=1
  ioengine=psync
  numjobs=16
  rw=write     # write/randwrite
  bs=4K
  iodepth=1
  directory=/mnt
  size=5G
  runtime=30s
  overwrite=0
  norandommap=1
  fallocate=native
  ramp_time=5s
  group_reporting=1

 [w/o]
  w:  IOPS=62.5k, BW=244MiB/s
  rw: IOPS=56.7k, BW=221MiB/s

 [w]
  w:  IOPS=79.6k, BW=311MiB/s
  rw: IOPS=70.2k, BW=274MiB/s

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/20260105014522.1937690-4-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19 22:28:30 -05:00
Zhang Yi
ea96cb5c4a ext4: don't split extent before submitting I/O
Currently, when writing back dirty pages to the filesystem with the
dioread_nolock feature enabled and when doing DIO, if the area to be
written back is part of an unwritten extent, the
EXT4_GET_BLOCKS_IO_CREATE_EXT flag is set during block allocation before
submitting I/O. The function ext4_split_convert_extents() then attempts
to split this extent in advance. This approach is designed to prevents
extent splitting and conversion to the written type from failing due to
insufficient disk space at the time of I/O completion, which could
otherwise result in data loss.

However, we already have two mechanisms to ensure successful extent
conversion. The first is the EXT4_GET_BLOCKS_METADATA_NOFAIL flag, which
is a best effort, it permits the use of 2% of the reserved space or
4,096 blocks in the file system when splitting extents. This flag covers
most scenarios where extent splitting might fail. The second is the
EXT4_EXT_MAY_ZEROOUT flag, which is also set during extent splitting. If
the reserved space is insufficient and splitting fails, it does not
retry the allocation. Instead, it directly zeros out the extra part of
the extent, thereby avoiding splitting and directly converting the
entire extent to the written type.

These two mechanisms also exist when I/Os are completed because there is
a concurrency window between write-back and fallocate, which may still
require us to split extents upon I/O completion. There is no much
difference between splitting extents before submitting I/O. Therefore,
It seems possible to defer the splitting until I/O completion, it won't
increase the risk of I/O failure and data loss. On the contrary, if some
I/Os can be merged when I/O completion, it can also reduce unnecessary
splitting operations, thereby alleviating the pressure on reserved
space.

In addition, deferring extent splitting until I/O completion can
also simplify the IO submission process and avoid initiating unnecessary
journal handles when writing unwritten extents.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/20260105014522.1937690-3-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19 22:28:30 -05:00
Zhang Yi
01942af95a ext4: use reserved metadata blocks when splitting extent on endio
When performing buffered writes, we may need to split and convert an
unwritten extent into a written one during the end I/O process. However,
we do not reserve space specifically for these metadata changes, we only
reserve 2% of space or 4096 blocks. To address this, we use
EXT4_GET_BLOCKS_PRE_IO to potentially split extents in advance and
EXT4_GET_BLOCKS_METADATA_NOFAIL to utilize reserved space if necessary.

These two approaches can reduce the likelihood of running out of space
and losing data. However, these methods are merely best efforts, we
could still run out of space, and there is not much difference between
converting an extent during the writeback process and the end I/O
process, it won't increase the risk of losing data if we postpone the
conversion.

Therefore, also use EXT4_GET_BLOCKS_METADATA_NOFAIL in
ext4_convert_unwritten_extents_endio() to prepare for the buffered I/O
iomap conversion, which may perform extent conversion during the end I/O
process.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/20260105014522.1937690-2-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19 22:28:30 -05:00
Zilin Guan
ca81109d4a ext4: fix memory leak in ext4_ext_shift_extents()
In ext4_ext_shift_extents(), if the extent is NULL in the while loop, the
function returns immediately without releasing the path obtained via
ext4_find_extent(), leading to a memory leak.

Fix this by jumping to the out label to ensure the path is properly
released.

Fixes: a18ed359bd ("ext4: always check ext4_ext_find_extent result")
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20251225084800.905701-1-zilin@seu.edu.cn
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2026-01-19 22:28:30 -05:00
Zhang Yi
154922b34d ext4: don't order data when zeroing unwritten or delayed block
When zeroing out a written partial block, it is necessary to order the
data to prevent exposing stale data on disk. However, if the buffer is
unwritten or delayed, it is not allocated as written, so ordering the
data is not required. This can prevent strange and unnecessary ordered
writes when appending data across a region within a block.

Assume we have a 2K unwritten file on a filesystem with 4K blocksize,
and buffered write from 3K to 4K. Before this patch,
__ext4_block_zero_page_range() would add the range [2k,3k) to the
ordered range, and then the JBD2 commit process would write back this
block. However, it does nothing since the block is not mapped as
written, this folio will be redirtied and written back agian through the
normal write back process.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20251223011927.34042-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19 22:22:32 -05:00
pengdonglin
26f260ce58 ext4: remove unnecessary zero-initialization via memset
The d_path function does not require the caller to pre-zero the
buffer.

Signed-off-by: pengdonglin <pengdonglin@xiaomi.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20251211123829.2777009-1-dolinux.peng@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19 20:13:57 -05:00
Li Chen
1f8dd813a1 ext4: mark group extend fast-commit ineligible
Fast commits only log operations that have dedicated replay support.
EXT4_IOC_GROUP_EXTEND grows the filesystem to the end of the last
block group and updates the same on-disk metadata without going
through the fast commit tracking paths.
In practice these operations are rare and usually followed by further
updates, but mixing them into a fast commit makes the overall
semantics harder to reason about and risks replay gaps if new call
sites appear.

Teach ext4 to mark the filesystem fast-commit ineligible when
EXT4_IOC_GROUP_EXTEND grows the filesystem.
This forces those transactions to fall back to a full commit,
ensuring that the group extension changes are captured by the normal
journal rather than partially encoded in fast commit TLVs.
This change should not affect common workloads but makes online
resize via GROUP_EXTEND safer and easier to reason about under fast
commit.

Testing:
1. prepare:
    dd if=/dev/zero of=/root/fc_resize.img bs=1M count=0 seek=256
    mkfs.ext4 -O fast_commit -F /root/fc_resize.img
    mkdir -p /mnt/fc_resize && mount -t ext4 -o loop /root/fc_resize.img /mnt/fc_resize
2. Extended the filesystem to the end of the last block group using a
   helper that calls EXT4_IOC_GROUP_EXTEND on the mounted filesystem
   and checked fc_info:
    ./group_extend_helper /mnt/fc_resize
    cat /proc/fs/ext4/loop0/fc_info
   shows the "Resize" ineligible reason increased.
3. Fsynced a file on the resized filesystem and confirmed that the fast
   commit ineligible counter incremented for the resize transaction:
    touch /mnt/fc_resize/file
    /root/fsync_file /mnt/fc_resize/file
    sync
    cat /proc/fs/ext4/loop0/fc_info

Signed-off-by: Li Chen <me@linux.beauty>
Link: https://patch.msgid.link/20251211115146.897420-6-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19 19:26:35 -05:00
Li Chen
89b4336fd5 ext4: mark group add fast-commit ineligible
Fast commits only log operations that have dedicated replay support.
Online resize via EXT4_IOC_GROUP_ADD updates the superblock and group
descriptor metadata without going through the fast commit tracking
paths.
In practice these operations are rare and usually followed by further
updates, but mixing them into a fast commit makes the overall
semantics harder to reason about and risks replay gaps if new call
sites appear.

Teach ext4 to mark the filesystem fast-commit ineligible when
ext4_ioctl_group_add() adds new block groups.
This forces those transactions to fall back to a full commit,
ensuring that the filesystem geometry updates are captured by the
normal journal rather than partially encoded in fast commit TLVs.
This change should not affect common workloads but makes online
resize via GROUP_ADD safer and easier to reason about under fast
commit.

Testing:
1. prepare:
    dd if=/dev/zero of=/root/fc_resize.img bs=1M count=0 seek=256
    mkfs.ext4 -O fast_commit -F /root/fc_resize.img
    mkdir -p /mnt/fc_resize && mount -t ext4 -o loop /root/fc_resize.img /mnt/fc_resize
2. Ran a helper that issues EXT4_IOC_GROUP_ADD on the mounted
   filesystem and checked the resize ineligible reason:
    ./group_add_helper /mnt/fc_resize
    cat /proc/fs/ext4/loop0/fc_info
   shows "Resize": > 0.
3. Fsynced a file on the resized filesystem and verified that the fast
   commit stats report at least one ineligible commit:
    touch /mnt/fc_resize/file
    /root/fsync_file /mnt/fc_resize/file
    sync
    cat /proc/fs/ext4/loop0/fc_info
   shows fc stats ineligible > 0.

Signed-off-by: Li Chen <me@linux.beauty>
Link: https://patch.msgid.link/20251211115146.897420-5-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19 19:26:35 -05:00
Li Chen
690558921d ext4: mark move extents fast-commit ineligible
Fast commits only log operations that have dedicated replay support.
EXT4_IOC_MOVE_EXT swaps extents between regular files and may copy
data, rewriting the affected inodes' block mapping layout without
going through the fast commit tracking paths.
In practice these operations are rare and usually followed by further
updates, but mixing them into a fast commit makes the overall
semantics harder to reason about and risks replay gaps if new call
sites appear.

Teach ext4 to mark the filesystem fast-commit ineligible for the
journal transactions used by move_extent_per_page() when
EXT4_IOC_MOVE_EXT runs.
This forces those transactions to fall back to a full commit,
ensuring that these multi-inode extent swaps are captured by the
normal journal rather than partially encoded in fast commit TLVs.
This change should not affect common workloads but makes online
defragmentation safer and easier to reason about under fast commit.

Testing:
1. prepare:
        dd if=/dev/zero of=/root/fc_move.img bs=1M count=0 seek=256
        mkfs.ext4 -O fast_commit -F /root/fc_move.img
        mkdir -p /mnt/fc_move && mount -t ext4 -o loop \
/root/fc_move.img /mnt/fc_move
2. Created two files, ran EXT4_IOC_MOVE_EXT via e4defrag, and checked
   the ineligible reason statistics:
        fallocate -l 64M /mnt/fc_move/file1
        cp /mnt/fc_move/file1 /mnt/fc_move/file2
        e4defrag /mnt/fc_move/file1
        cat /proc/fs/ext4/loop0/fc_info
   shows "Move extents": > 0 and fc stats ineligible > 0.

Signed-off-by: Li Chen <me@linux.beauty>
Link: https://patch.msgid.link/20251211115146.897420-4-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19 19:26:35 -05:00
Li Chen
16d43b9748 ext4: mark fs-verity enable fast-commit ineligible
Fast commits only log operations that have dedicated replay support.
Enabling fs-verity builds a Merkle tree and updates inode and orphan
state in ways that are not described by the fast commit replay tags.
In practice these operations are rare and usually followed by further
updates, but mixing them into a fast commit makes the overall
semantics harder to reason about and risks replay gaps if new call
sites appear.

Teach ext4 to mark the filesystem fast-commit ineligible when
ext4_end_enable_verity() starts its journal transaction.
This forces that transaction to fall back to a full commit, ensuring
that the fs-verity enable changes are captured by the normal journal
rather than partially encoded in fast commit TLVs.
This change should not affect common workloads but makes fs-verity
enable safer and easier to reason about under fast commit.

Testing:
1. prepare:
    dd if=/dev/zero of=/root/fc_verity.img bs=1M count=0 seek=128
    mkfs.ext4 -O fast_commit,verity -F /root/fc_verity.img
    mkdir -p /mnt/fc_verity && mount -t ext4 -o loop /root/fc_verity.img /mnt/fc_verity
2. Enabled fs-verity on a file and verified reason accounting:
    echo "data" > /mnt/fc_verity/verityfile
    /root/enable_verity /mnt/fc_verity/verityfile
    sync
    tail -n 1 /proc/fs/ext4/loop0/fc_info
    "fs-verity enable":     1
3. Enabled fs-verity on a second file, fsynced it, and checked that the
   ineligible commit counter is updated too:
    echo "data2" > /mnt/fc_verity/verityfile2
    /root/enable_verity /mnt/fc_verity/verityfile2
    /root/fsync_file /mnt/fc_verity/verityfile2
    sync
    /proc/fs/ext4/loop0/fc_info shows "fs-verity enable" incremented and
    fc stats ineligible increased accordingly.

Signed-off-by: Li Chen <me@linux.beauty>
Link: https://patch.msgid.link/20251211115146.897420-3-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19 19:26:35 -05:00
Li Chen
87e79fa122 ext4: mark inode format migration fast-commit ineligible
Fast commits only log operations that have dedicated replay support.
Inode format migration (indirect<->extent layout changes via
EXT4_IOC_MIGRATE or toggling EXT4_EXTENTS_FL) rewrites the block mapping
representation without going through the fast commit tracking paths.
In practice these migrations are rare and usually followed by further
updates, but mixing them into a fast commit makes the overall semantics
harder to reason about and risks replay gaps if new call sites appear.

Teach ext4 to mark the filesystem fast-commit ineligible when
ext4_ext_migrate() or ext4_ind_migrate() start their journal transactions.
This forces those transactions to fall back to a full commit, ensuring
that the entire inode layout change is captured by the normal journal
rather than partially encoded in fast commit TLVs. This change should
not affect common workloads but makes format migrations safer and easier
to reason about under fast commit.

Testing:
1. prepare:
    dd if=/dev/zero of=/root/fc.img bs=1M count=0 seek=128
    mkfs.ext4 -O fast_commit -F /root/fc.img
    mkdir -p /mnt/fc && mount -t ext4 -o loop /root/fc.img /mnt/fc
2.  Created a test file and toggled the extents flag to exercise both
    ext4_ind_migrate() and ext4_ext_migrate():
    touch /mnt/fc/migtest
    chattr -e /mnt/fc/migtest
    chattr +e /mnt/fc/migtest
3. Verified fast-commit ineligible statistics:
    tail -n 1 /proc/fs/ext4/loop0/fc_info
    "Inode format migration":       2

Signed-off-by: Li Chen <me@linux.beauty>
Link: https://patch.msgid.link/20251211115146.897420-2-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19 19:26:35 -05:00
Baolin Liu
d518215c27 ext4: add sysfs attribute err_report_sec to control s_err_report timer
Add a new sysfs attribute "err_report_sec" to control the s_err_report
timer in ext4_sb_info. Writing '0' disables the timer, while writing
a non-zero value enables the timer and sets the timeout in seconds.

Signed-off-by: Baolin Liu <liubaolin@kylinos.cn>
Link: https://patch.msgid.link/20251211030256.28613-1-liubaolin12138@163.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19 19:26:35 -05:00