Commit Graph

8674 Commits

Author SHA1 Message Date
Yu Kuai
bb74b093c3 md/md-bitmap: handle the case bitmap is not enabled before end_sync()
This case can be handled without knowing internal implementation.

Prepare to introduce CONFIG_MD_BITMAP.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-9-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06 17:11:54 +08:00
Yu Kuai
5ae58d1500 md/md-bitmap: handle the case bitmap is not enabled before start_sync()
This case can be handled without knowing internal implementation.

Prepare to introduce CONFIG_MD_BITMAP.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-8-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06 17:11:50 +08:00
Yu Kuai
110332074d md/md-bitmap: add md_bitmap_registered/enabled() helper
There are no functional changes, prepare to handle the case that
mddev->bitmap_ops can be NULL, which is possible after introducing
CONFIG_MD_BITMAP.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-7-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06 17:11:45 +08:00
Yu Kuai
9c41ead04e md/md-bitmap: add a new parameter 'flush' to bitmap_ops->enabled
The method is only used from raid1/raid10 IO path, to check if write
bio should be pluged, the parameter is always set to true for now,
following patch will use this helper in other context like updating
superblock.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-6-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06 17:11:42 +08:00
Yu Kuai
9307dbac0e md/md-bitmap: merge md_bitmap_group into bitmap_operations
Now that all bitmap implementations are internal, it doesn't make sense
to export md_bitmap_group anymore.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-5-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06 17:11:40 +08:00
Yu Kuai
e57b225c28 md/md-bitmap: remove the parameter 'init' for bitmap_ops->resize()
It's set to 'false' for all callers, hence it's useless and can be
removed.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-3-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06 17:11:33 +08:00
Li Nan
7202082b7b md: prevent incorrect update of resync/recovery offset
In md_do_sync(), when md_sync_action returns ACTION_FROZEN, subsequent
call to md_sync_position() will return MaxSector. This causes
'curr_resync' (and later 'recovery_offset') to be set to MaxSector too,
which incorrectly signals that recovery/resync has completed, even though
disk data has not actually been updated.

To fix this issue, skip updating any offset values when the sync action
is FROZEN. The same holds true for IDLE.

Fixes: 7d9f107a4e ("md: use new helpers in md_do_sync()")
Signed-off-by: Li Nan <linan122@huawei.com>
Link: https://lore.kernel.org/linux-raid/20250904073452.3408516-1-linan666@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-09-05 00:31:18 +08:00
Yu Kuai
93dec51e71 md/raid1: fix data lost for writemostly rdev
If writemostly is enabled, alloc_behind_master_bio() will allocate a new
bio for rdev, with bi_opf set to 0. Later, raid1_write_request() will
clone from this bio, hence bi_opf is still 0 for the cloned bio. Submit
this cloned bio will end up to be read, causing write data lost.

Fix this problem by inheriting bi_opf from original bio for
behind_mast_bio.

Fixes: e879a0d9cb ("md/raid1,raid10: don't ignore IO flags")
Reported-and-tested-by: Ian Dall <ian@beware.dropbear.id.au>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220507
Link: https://lore.kernel.org/linux-raid/20250903014140.3690499-1-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Li Nan <linan122@huawei.com>
2025-09-05 00:30:04 +08:00
Jens Axboe
4dbe13c784 Merge tag 'pull-getgeo' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into for-6.18/block
Pull struct block_device getgeo changes from Al.

"switching ->getgeo() from struct block_device to struct gendisk

 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>"

* tag 'pull-getgeo' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  block: switch ->getgeo() to struct gendisk
  scsi: switch ->bios_param() to passing gendisk
  scsi: switch scsi_bios_ptable() and scsi_partsize() to gendisk
2025-09-03 15:15:43 -06:00
Qianfeng Rong
4466dd3d71 dm-pcache: use int type to store negative error codes
Change the 'ret' variable from u32 to int to store negative error codes or
zero returned by cache_kset_close().

Storing the negative error codes in unsigned type, doesn't cause an issue
at runtime but it's ugly. Additionally, assigning negative error codes to
unsigned type may trigger a GCC warning when the -Wsign-conversion flag
is enabled.

No effect on runtime.

Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-09-02 11:19:12 +02:00
Xichao Zhao
9f5c33bddd dm: fix "writen"->"written"
Trivial fix to spelling mistake in comment text.

Signed-off-by: Xichao Zhao <zhao.xichao@vivo.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-09-01 13:35:18 +02:00
Dongsheng Yang
2473577177 dm-pcache: cleanup: fix coding style report by checkpatch.pl
A patch from a few days ago fixed the division issue on 32-bit machines,
but it introduced a coding style problem.

WARNING: Missing a blank line after declarations
+       u32 rem;
+       div_u64_rem(off >> PCACHE_CACHE_SUBTREE_SIZE_SHIFT,
	cache->n_ksets, &rem);

total: 0 errors, 1 warnings, 634 lines checked

Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-09-01 13:30:54 +02:00
Dongsheng Yang
1f9ad14aef dm-pcache: remove ctrl_lock for pcache_cache_segment
The smatch checker reports a “scheduler in atomic context” problem in
the following call chain:

miss_read_end_req()
   -> cache_seg_put()
      -> cache_seg_invalidate()
         -> cache_seg_gen_increase()
            -> mutex_lock(&cache_seg->ctrl_lock);

In practice, this `mutex_lock` will not actually schedule, because it is
only called when `cache_seg_put()` drops the last reference, which is
single-threaded. That is also why the issue never shows up during real
testing.

However, the code is still buggy. The original purpose of `ctrl_lock`
was to prevent read/write conflicts on the cache segment control
information. Looking at the current usage, all control information
accesses are single-threaded: reads only occur during the init phase,
where no conflicts are possible, and writes happen once in the init
phase (also single-threaded) and once when `cache_seg_put()` drops the
last reference (again single-threaded).

Therefore, this patch removes `ctrl_lock` entirely and adds comments in
the appropriate places to document this logic.

Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-09-01 13:30:39 +02:00
Zheng Qixing
8d33a030c5 dm: fix NULL pointer dereference in __dm_suspend()
There is a race condition between dm device suspend and table load that
can lead to null pointer dereference. The issue occurs when suspend is
invoked before table load completes:

BUG: kernel NULL pointer dereference, address: 0000000000000054
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 6 PID: 6798 Comm: dmsetup Not tainted 6.6.0-g7e52f5f0ca9b #62
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014
RIP: 0010:blk_mq_wait_quiesce_done+0x0/0x50
Call Trace:
  <TASK>
  blk_mq_quiesce_queue+0x2c/0x50
  dm_stop_queue+0xd/0x20
  __dm_suspend+0x130/0x330
  dm_suspend+0x11a/0x180
  dev_suspend+0x27e/0x560
  ctl_ioctl+0x4cf/0x850
  dm_ctl_ioctl+0xd/0x20
  vfs_ioctl+0x1d/0x50
  __se_sys_ioctl+0x9b/0xc0
  __x64_sys_ioctl+0x19/0x30
  x64_sys_call+0x2c4a/0x4620
  do_syscall_64+0x9e/0x1b0

The issue can be triggered as below:

T1 						T2
dm_suspend					table_load
__dm_suspend					dm_setup_md_queue
						dm_mq_init_request_queue
						blk_mq_init_allocated_queue
						=> q->mq_ops = set->ops; (1)
dm_stop_queue / dm_wait_for_completion
=> q->tag_set NULL pointer!	(2)
						=> q->tag_set = set; (3)

Fix this by checking if a valid table (map) exists before performing
request-based suspend and waiting for target I/O. When map is NULL,
skip these table-dependent suspend steps.

Even when map is NULL, no I/O can reach any target because there is
no table loaded; I/O submitted in this state will fail early in the
DM layer. Skipping the table-dependent suspend logic in this case
is safe and avoids NULL pointer dereferences.

Fixes: c4576aed8d ("dm: fix request-based dm's use of dm_wait_for_completion")
Cc: stable@vger.kernel.org
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-08-26 16:46:52 +02:00
Zheng Qixing
7f597c2cdb dm: fix queue start/stop imbalance under suspend/load/resume races
When suspend and load run concurrently, before q->mq_ops is set in
blk_mq_init_allocated_queue(), __dm_suspend() skip dm_stop_queue(). As a
result, the queue's quiesce depth is not incremented.

Later, once table load has finished and __dm_resume() runs, which triggers
q->quiesce_depth ==0 warning in blk_mq_unquiesce_queue():
Call Trace:
 <TASK>
 dm_start_queue+0x16/0x20 [dm_mod]
 __dm_resume+0xac/0xb0 [dm_mod]
 dm_resume+0x12d/0x150 [dm_mod]
 do_resume+0x2c2/0x420 [dm_mod]
 dev_suspend+0x30/0x130 [dm_mod]
 ctl_ioctl+0x402/0x570 [dm_mod]
 dm_ctl_ioctl+0x23/0x30 [dm_mod]

Fix this by explicitly tracking whether the request queue was
stopped in __dm_suspend() via a new DMF_QUEUE_STOPPED flag.
Only call dm_start_queue() in __dm_resume() if the queue was
actually stopped.

Fixes: e70feb8b3e ("blk-mq: support concurrent queue quiesce/unquiesce")
Cc: stable@vger.kernel.org
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-08-26 16:46:32 +02:00
Dongsheng Yang
1d57628ff9 dm-pcache: add persistent cache target in device-mapper
This patch introduces dm-pcache, a new DM target that places a DAX-
capable persistent-memory device in front of any slower block device and
uses it as a high-throughput, low-latency  cache.

Design highlights
-----------------
- DAX data path – data is copied directly between DRAM and the pmem
  mapping, bypassing the block layer’s overhead.

- Segmented, crash-consistent layout
  - all layout metadata are dual-replicated CRC-protected.
  - atomic kset flushes; key replay on mount guarantees cache integrity
    even after power loss.

- Striped multi-tree index
  - Multi‑tree indexing for high parallelism.
  - overlap-resolution logic ensures non-intersecting cached extents.

- Background services
  - write-back worker flushes dirty keys in order, preserving backing-device
    crash consistency. This is important for checkpoint in cloud storage.
  - garbage collector reclaims clean segments when utilisation exceeds a
    tunable threshold.

- Data integrity – optional CRC32 on cached payload; metadata always protected.

Comparison with existing block-level caches
---------------------------------------------------------------------------------------------------------------------------------
| Feature                          | pcache (this patch)             | bcache                       | dm-writecache             |
|----------------------------------|---------------------------------|------------------------------|---------------------------|
| pmem access method               | DAX                             | bio (block I/O)              | DAX                       |
| Write latency (4 K rand-write)   | ~5 µs                           | ~20 µs                       | ~5 µs                     |
| Concurrency                      | multi subtree index             | global index tree            | single tree + wc_lock     |
| IOPS (4K randwrite, 32 numjobs)  | 2.1 M                           | 352 K                        | 283 K                     |
| Read-cache support               | YES                             | YES                          | NO                        |
| Deployment                       | no re-format of backend         | backend devices must be      | no re-format of backend   |
|                                  |                                 | reformatted                  |                           |
| Write-back ordering              | log-structured;                 | no ordering guarantee        | no ordering guarantee     |
|                                  | preserves app-IO-order          |                              |                           |
| Data integrity checks            | metadata + data CRC(optional)   | metadata CRC only            | none                      |
---------------------------------------------------------------------------------------------------------------------------------

Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-08-25 15:25:29 +02:00
Christoph Hellwig
499cbe0f2f dm error: mark as DM_TARGET_PASSES_INTEGRITY
Mark dm error as DM_TARGET_PASSES_INTEGRITY so that it can be stacked on
top of PI capable devices.  The claim is strictly speaking as lie as dm
error fails all I/O and doesn't pass anything on, but doing the same for
integrity I/O work just fine :)

This helps to make about two dozen xfstests test cases pass on PI capable
devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-08-19 11:12:50 +02:00
Qianfeng Rong
31a9403b0d dm bufio: remove redundant __GFP_NOWARN
GFP_NOWAIT already includes __GFP_NOWARN, so let's remove the redundant
__GFP_NOWARN.  Also update comments to clarify the flag semantics.

Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-08-19 11:12:50 +02:00
Mikulas Patocka
d4f00a584b dm-ima: drop a useless argument
The "gfp_t flags" is always GFP_KERNEL, so it can be removed.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-08-19 11:12:50 +02:00
Qianfeng Rong
c3797175b3 dm: use vmalloc_array() to simplify code
Remove array_size() calls and replace vmalloc() with vmalloc_array() to
simplify the code.

Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-08-19 11:12:50 +02:00
Dmitry Antipov
a3191475d1 dm-ima: more strlen() drops
Following commit ebbd17695e ("dm: ima: avoid extra calls to
strlen()"), convert 'dm_ima_alloc_and_copy_capacity_str()' to
return the number of characters emitted by 'scnprintf()' and
simplify the users accordingly. Compile tested only.

Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Reviewed-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-08-19 11:12:49 +02:00
Zheng Qixing
b7ee30f0ef md: fix sync_action incorrect display during resync
During raid resync, if a disk becomes faulty, the operation is
briefly interrupted. The MD_RECOVERY_RECOVER flag triggered by
the disk failure causes sync_action to incorrectly show "recover"
instead of "resync". The same issue affects reshape operations.

Reproduction steps:
  mdadm -Cv /dev/md1 -l1 -n4 -e1.2 /dev/sd{a..d} // -> resync happened
  mdadm -f /dev/md1 /dev/sda                     // -> resync interrupted
  cat sync_action
  -> recover

Add progress checks in md_sync_action() for resync/recover/reshape
to ensure the interface correctly reports the actual operation type.

Fixes: 4b10a3bc67 ("md: ensure resync is prioritized over recovery")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Link: https://lore.kernel.org/linux-raid/20250816002534.1754356-3-zhengqixing@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-08-16 08:52:33 +08:00
Zheng Qixing
cb0780ad43 md: add helper rdev_needs_recovery()
Add a helper for checking if an rdev needs recovery.

Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Link: https://lore.kernel.org/linux-raid/20250816002534.1754356-2-zhengqixing@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-08-16 08:51:59 +08:00
Xiao Ni
c27973211f md: keep recovery_cp in mdp_superblock_s
commit 907a99c314 ("md: rename recovery_cp to resync_offset") replaces
recovery_cp with resync_offset in mdp_superblock_s which is in md_p.h.
md_p.h is used in userspace too. So mdadm building fails because of this.
This patch revert this change.

Fixes: 907a99c314 ("md: rename recovery_cp to resync_offset")
Signed-off-by: Xiao Ni <xni@redhat.com>
Link: https://lore.kernel.org/linux-raid/20250815040028.18085-1-xni@redhat.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-08-16 08:47:38 +08:00
Xiao Ni
25db5f284f md: add legacy_async_del_gendisk mode
commit 9e59d60976 ("md: call del_gendisk in control path") changes the
async way to sync way of calling del_gendisk. But it breaks mdadm
--assemble command. The assemble command runs like this:
1. create the array
2. stop the array
3. access the sysfs files after stopping

The sync way calls del_gendisk in step 2, so all sysfs files are removed.
Now to avoid breaking mdadm assemble command, this patch adds the parameter
legacy_async_del_gendisk that can be used to choose which way. The default
is async way. In future, we plan to change default to sync way in kernel
7.0. Then users need to upgrade to mdadm 4.5+ which removes step 2.

Fixes: 9e59d60976 ("md: call del_gendisk in control path")
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Closes: https://lore.kernel.org/linux-raid/CAMw=ZnQ=ET2St-+hnhsuq34rRPnebqcXqP1QqaHW5Bh4aaaZ4g@mail.gmail.com/T/#t
Suggested-and-reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Link: https://lore.kernel.org/linux-raid/20250813032929.54978-1-xni@redhat.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-08-13 19:44:17 +08:00
Al Viro
4fc8728aa3 block: switch ->getgeo() to struct gendisk
Instances are happier that way and it makes more sense anyway -
the only part of the result that is related to partition we are given
is the start sector, and that has been filled in by the caller.

Everything else is a function of the disk.  Only one instance
(DASD) is ever looking at anything other than bdev->bd_disk and
that one is trivial to adjust.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-08-13 02:59:29 -04:00
Alexander Aring
bea90085dc dlm: use defines for force values in dlm_release_lockspace
Clarify the use of the force parameter by renaming it to
"release_option" and adding defines (with descriptions) for
each of the accepted values.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2025-08-12 11:22:49 -05:00
Mikulas Patocka
1071d560af dm-stripe: fix a possible integer overflow
There's a possible integer overflow in stripe_io_hints if we have too
large chunk size. Test if the overflow happened, and if it did, don't set
limits->io_min and limits->io_opt;

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Suggested-by: Dongsheng Yang <dongsheng.yang@linux.dev>
Cc: stable@vger.kernel.org
2025-08-11 13:17:32 +02:00
Linus Torvalds
2988dfed8a Merge tag 'block-6.17-20250808' of git://git.kernel.dk/linux
Pull more block updates from Jens Axboe:

 - MD pull request via Yu:
      - mddev null-ptr-dereference fix, by Erkun
      - md-cluster fail to remove the faulty disk regression fix, by
        Heming
      - minor cleanup, by Li Nan and Jinchao
      - mdadm lifetime regression fix reported by syzkaller, by Yu Kuai

 - MD pull request via Christoph
      - add support for getting the FDP featuee in fabrics passthru path
        (Nitesh Shetty)
      - add capability to connect to an administrative controller
        (Kamaljit Singh)
      - fix a leak on sgl setup error (Keith Busch)
      - initialize discovery subsys after debugfs is initialized
        (Mohamed Khalfella)
      - fix various comment typos (Bjorn Helgaas)
      - remove unneeded semicolons (Jiapeng Chong)

 - nvmet debugfs ordering issue fix

 - Fix UAF in the tag_set in zloop

 - Ensure sbitmap shallow depth covers entire set

 - Reduce lock roundtrips in io context lookup

 - Move scheduler tags alloc/free out of elevator and freeze lock, to
   fix some lockdep found issues

 - Improve robustness of queue limits checking

 - Fix a regression with IO priorities, if no io context exists

* tag 'block-6.17-20250808' of git://git.kernel.dk/linux: (26 commits)
  lib/sbitmap: make sbitmap_get_shallow() internal
  lib/sbitmap: convert shallow_depth from one word to the whole sbitmap
  nvmet: exit debugfs after discovery subsystem exits
  block, bfq: Reorder struct bfq_iocq_bfqq_data
  md: make rdev_addable usable for rcu mode
  md/raid1: remove struct pool_info and related code
  md/raid1: change r1conf->r1bio_pool to a pointer type
  block: ensure discard_granularity is zero when discard is not supported
  zloop: fix KASAN use-after-free of tag set
  block: Fix default IO priority if there is no IO context
  nvme: fix various comment typos
  nvme-auth: remove unneeded semicolon
  nvme-pci: fix leak on sgl setup error
  nvmet: initialize discovery subsys after debugfs is initialized
  nvme: add capability to connect to an administrative controller
  nvmet: add support for FDP in fabrics passthru path
  md: rename recovery_cp to resync_offset
  md/md-cluster: handle REMOVE message earlier
  md: fix create on open mddev lifetime regression
  block: fix potential deadlock while running nr_hw_queue update
  ...
2025-08-09 08:47:28 +03:00
Linus Torvalds
d632ab86af Merge tag 'for-6.17/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mikulas Patocka:

 - fix checking for request-based stackable devices (dm-table)

 - fix corrupt_bio_byte setup checks (dm-flakey)

 - add support for resync w/o metadata devices (dm raid)

 - small code simplification (dm, dm-mpath, vm-vdo, dm-raid)

 - remove support for asynchronous hashes (dm-verity)

 - close smatch warning (dm-zoned-target)

 - update the documentation and enable inline-crypto passthrough
   (dm-thin)

* tag 'for-6.17/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: set DM_TARGET_PASSES_CRYPTO feature for dm-thin
  dm-thin: update the documentation
  dm-raid: do not include dm-core.h
  vdo: omit need_resched() before cond_resched()
  md: dm-zoned-target: Initialize return variable r to avoid uninitialized use
  dm-verity: remove support for asynchronous hashes
  dm-mpath: don't print the "loaded" message if registering fails
  dm-mpath: make dm_unregister_path_selector return void
  dm: ima: avoid extra calls to strlen()
  dm: Simplify dm_io_complete()
  dm: Remove unnecessary return in dm_zone_endio()
  dm raid: add support for resync w/o metadata devices
  dm-flakey: Fix corrupt_bio_byte setup checks
  dm-table: fix checking for rq stackable devices
2025-08-04 08:58:53 -07:00
Yang Erkun
13017b4271 md: make rdev_addable usable for rcu mode
Our testcase trigger panic:

BUG: kernel NULL pointer dereference, address: 00000000000000e0
...
Oops: Oops: 0000 [#1] SMP NOPTI
CPU: 2 UID: 0 PID: 85 Comm: kworker/2:1 Not tainted 6.16.0+ #94
PREEMPT(none)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.16.1-2.fc37 04/01/2014
Workqueue: md_misc md_start_sync
RIP: 0010:rdev_addable+0x4d/0xf0
...
Call Trace:
 <TASK>
 md_start_sync+0x329/0x480
 process_one_work+0x226/0x6d0
 worker_thread+0x19e/0x340
 kthread+0x10f/0x250
 ret_from_fork+0x14d/0x180
 ret_from_fork_asm+0x1a/0x30
 </TASK>
Modules linked in: raid10
CR2: 00000000000000e0
---[ end trace 0000000000000000 ]---
RIP: 0010:rdev_addable+0x4d/0xf0

md_spares_need_change in md_start_sync will call rdev_addable which
protected by rcu_read_lock/rcu_read_unlock. This rcu context will help
protect rdev won't be released, but rdev->mddev will be set to NULL
before we call synchronize_rcu in md_kick_rdev_from_array. Fix this by
using READ_ONCE and check does rdev->mddev still alive.

Fixes: bc08041b32 ("md: suspend array in md_start_sync() if array need reconfiguration")
Fixes: 570b9147de ("md: use RCU lock to protect traversal in md_spares_need_change()")
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
Link: https://lore.kernel.org/linux-raid/20250731114530.776670-1-yangerkun@huawei.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-08-03 13:08:18 +08:00
Wang Jinchao
178d1391c5 md/raid1: remove struct pool_info and related code
The struct pool_info was originally introduced mainly to support reshape
operations, serving as a parameter for mempool_init() when raid_disks
changes. Now that mempool_create_kmalloc_pool() is sufficient for this
purpose, struct pool_info and its related code are no longer needed.

Remove struct pool_info and all associated code.

Signed-off-by: Wang Jinchao <wangjinchao600@gmail.com>
Link: https://lore.kernel.org/linux-raid/20250707012711.376844-3-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-08-03 13:08:12 +08:00
Wang Jinchao
987ca60637 md/raid1: change r1conf->r1bio_pool to a pointer type
In raid1_reshape(), newpool is a stack variable.
mempool_init() initializes newpool->wait with the stack address.
After assigning newpool to conf->r1bio_pool, the wait queue
need to be reinitialized, which is not ideal.

Change raid1_conf->r1bio_pool to a pointer type and
replace mempool_init() with mempool_create_kmalloc_pool() to
avoid referencing a stack-based wait queue.

Signed-off-by: Wang Jinchao <wangjinchao600@gmail.com>
Link: https://lore.kernel.org/linux-raid/20250707012711.376844-2-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-08-03 13:08:02 +08:00
Linus Torvalds
beace86e61 Merge tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
 "As usual, many cleanups. The below blurbiage describes 42 patchsets.
  21 of those are partially or fully cleanup work. "cleans up",
  "cleanup", "maintainability", "rationalizes", etc.

  I never knew the MM code was so dirty.

  "mm: ksm: prevent KSM from breaking merging of new VMAs" (Lorenzo Stoakes)
     addresses an issue with KSM's PR_SET_MEMORY_MERGE mode: newly
     mapped VMAs were not eligible for merging with existing adjacent
     VMAs.

  "mm/damon: introduce DAMON_STAT for simple and practical access monitoring" (SeongJae Park)
     adds a new kernel module which simplifies the setup and usage of
     DAMON in production environments.

  "stop passing a writeback_control to swap/shmem writeout" (Christoph Hellwig)
     is a cleanup to the writeback code which removes a couple of
     pointers from struct writeback_control.

  "drivers/base/node.c: optimization and cleanups" (Donet Tom)
     contains largely uncorrelated cleanups to the NUMA node setup and
     management code.

  "mm: userfaultfd: assorted fixes and cleanups" (Tal Zussman)
     does some maintenance work on the userfaultfd code.

  "Readahead tweaks for larger folios" (Ryan Roberts)
     implements some tuneups for pagecache readahead when it is reading
     into order>0 folios.

  "selftests/mm: Tweaks to the cow test" (Mark Brown)
     provides some cleanups and consistency improvements to the
     selftests code.

  "Optimize mremap() for large folios" (Dev Jain)
     does that. A 37% reduction in execution time was measured in a
     memset+mremap+munmap microbenchmark.

  "Remove zero_user()" (Matthew Wilcox)
     expunges zero_user() in favor of the more modern memzero_page().

  "mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes" (David Hildenbrand)
     addresses some warts which David noticed in the huge page code.
     These were not known to be causing any issues at this time.

  "mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD" (SeongJae Park)
     provides some cleanup and consolidation work in DAMON.

  "use vm_flags_t consistently" (Lorenzo Stoakes)
     uses vm_flags_t in places where we were inappropriately using other
     types.

  "mm/memfd: Reserve hugetlb folios before allocation" (Vivek Kasireddy)
     increases the reliability of large page allocation in the memfd
     code.

  "mm: Remove pXX_devmap page table bit and pfn_t type" (Alistair Popple)
     removes several now-unneeded PFN_* flags.

  "mm/damon: decouple sysfs from core" (SeongJae Park)
     implememnts some cleanup and maintainability work in the DAMON
     sysfs layer.

  "madvise cleanup" (Lorenzo Stoakes)
     does quite a lot of cleanup/maintenance work in the madvise() code.

  "madvise anon_name cleanups" (Vlastimil Babka)
     provides additional cleanups on top or Lorenzo's effort.

  "Implement numa node notifier" (Oscar Salvador)
     creates a standalone notifier for NUMA node memory state changes.
     Previously these were lumped under the more general memory
     on/offline notifier.

  "Make MIGRATE_ISOLATE a standalone bit" (Zi Yan)
     cleans up the pageblock isolation code and fixes a potential issue
     which doesn't seem to cause any problems in practice.

  "selftests/damon: add python and drgn based DAMON sysfs functionality tests" (SeongJae Park)
     adds additional drgn- and python-based DAMON selftests which are
     more comprehensive than the existing selftest suite.

  "Misc rework on hugetlb faulting path" (Oscar Salvador)
     fixes a rather obscure deadlock in the hugetlb fault code and
     follows that fix with a series of cleanups.

  "cma: factor out allocation logic from __cma_declare_contiguous_nid" (Mike Rapoport)
     rationalizes and cleans up the highmem-specific code in the CMA
     allocator.

  "mm/migration: rework movable_ops page migration (part 1)" (David Hildenbrand)
     provides cleanups and future-preparedness to the migration code.

  "mm/damon: add trace events for auto-tuned monitoring intervals and DAMOS quota" (SeongJae Park)
     adds some tracepoints to some DAMON auto-tuning code.

  "mm/damon: fix misc bugs in DAMON modules" (SeongJae Park)
     does that.

  "mm/damon: misc cleanups" (SeongJae Park)
     also does what it claims.

  "mm: folio_pte_batch() improvements" (David Hildenbrand)
     cleans up the large folio PTE batching code.

  "mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions" (SeongJae Park)
     facilitates dynamic alteration of DAMON's inter-node allocation
     policy.

  "Remove unmap_and_put_page()" (Vishal Moola)
     provides a couple of page->folio conversions.

  "mm: per-node proactive reclaim" (Davidlohr Bueso)
     implements a per-node control of proactive reclaim - beyond the
     current memcg-based implementation.

  "mm/damon: remove damon_callback" (SeongJae Park)
     replaces the damon_callback interface with a more general and
     powerful damon_call()+damos_walk() interface.

  "mm/mremap: permit mremap() move of multiple VMAs" (Lorenzo Stoakes)
     implements a number of mremap cleanups (of course) in preparation
     for adding new mremap() functionality: newly permit the remapping
     of multiple VMAs when the user is specifying MREMAP_FIXED. It still
     excludes some specialized situations where this cannot be performed
     reliably.

  "drop hugetlb_free_pgd_range()" (Anthony Yznaga)
     switches some sparc hugetlb code over to the generic version and
     removes the thus-unneeded hugetlb_free_pgd_range().

  "mm/damon/sysfs: support periodic and automated stats update" (SeongJae Park)
     augments the present userspace-requested update of DAMON sysfs
     monitoring files. Automatic update is now provided, along with a
     tunable to control the update interval.

  "Some randome fixes and cleanups to swapfile" (Kemeng Shi)
     does what is claims.

  "mm: introduce snapshot_page" (Luiz Capitulino and David Hildenbrand)
     provides (and uses) a means by which debug-style functions can grab
     a copy of a pageframe and inspect it locklessly without tripping
     over the races inherent in operating on the live pageframe
     directly.

  "use per-vma locks for /proc/pid/maps reads" (Suren Baghdasaryan)
     addresses the large contention issues which can be triggered by
     reads from that procfs file. Latencies are reduced by more than
     half in some situations. The series also introduces several new
     selftests for the /proc/pid/maps interface.

  "__folio_split() clean up" (Zi Yan)
     cleans up __folio_split()!

  "Optimize mprotect() for large folios" (Dev Jain)
     provides some quite large (>3x) speedups to mprotect() when dealing
     with large folios.

  "selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup" (wang lian)
     does some cleanup work in the selftests code.

  "tools/testing: expand mremap testing" (Lorenzo Stoakes)
     extends the mremap() selftest in several ways, including adding
     more checking of Lorenzo's recently added "permit mremap() move of
     multiple VMAs" feature.

  "selftests/damon/sysfs.py: test all parameters" (SeongJae Park)
     extends the DAMON sysfs interface selftest so that it tests all
     possible user-requested parameters. Rather than the present minimal
     subset"

* tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (370 commits)
  MAINTAINERS: add missing headers to mempory policy & migration section
  MAINTAINERS: add missing file to cgroup section
  MAINTAINERS: add MM MISC section, add missing files to MISC and CORE
  MAINTAINERS: add missing zsmalloc file
  MAINTAINERS: add missing files to page alloc section
  MAINTAINERS: add missing shrinker files
  MAINTAINERS: move memremap.[ch] to hotplug section
  MAINTAINERS: add missing mm_slot.h file THP section
  MAINTAINERS: add missing interval_tree.c to memory mapping section
  MAINTAINERS: add missing percpu-internal.h file to per-cpu section
  mm/page_alloc: remove trace_mm_alloc_contig_migrate_range_info()
  selftests/damon: introduce _common.sh to host shared function
  selftests/damon/sysfs.py: test runtime reduction of DAMON parameters
  selftests/damon/sysfs.py: test non-default parameters runtime commit
  selftests/damon/sysfs.py: generalize DAMON context commit assertion
  selftests/damon/sysfs.py: generalize monitoring attributes commit assertion
  selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion
  selftests/damon/sysfs.py: test DAMOS filters commitment
  selftests/damon/sysfs.py: generalize DAMOS scheme commit assertion
  selftests/damon/sysfs.py: test DAMOS destinations commitment
  ...
2025-07-31 14:57:54 -07:00
LongPing Wei
55a0fbd2ac dm: set DM_TARGET_PASSES_CRYPTO feature for dm-thin
dm-thin obviously can pass through inline crypto support.

Signed-off-by: LongPing Wei <weilongping@oppo.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-07-31 16:30:02 +02:00
Pavel Tikhomirov
8d05316d79 dm-raid: do not include dm-core.h
In commit 4cc96131af ("dm: move request-based code out to dm-rq.[hc]")
we have a note: "DM targets should _never_ include dm-core.h!". And it
is not used in any DM targets except dm-raid now, so let's remove it
from dm-raid for consistency, also use special helpers instead of
accessing dm_table and mapper_device fields directly. This change is
merely a cleanup and should not affect functionality.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-07-31 15:39:55 +02:00
Mikulas Patocka
225b2cb640 vdo: omit need_resched() before cond_resched()
There's no need to call need_resched() because cond_resched() will do
nothing if need_resched() returns false.

Reviewed-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-07-31 15:39:55 +02:00
Purva Yeshi
487767bff5 md: dm-zoned-target: Initialize return variable r to avoid uninitialized use
Fix Smatch-detected error:
drivers/md/dm-zoned-target.c:1073 dmz_iterate_devices()
error: uninitialized symbol 'r'.

Smatch detects a possible use of the uninitialized variable 'r' in
dmz_iterate_devices() because if dmz->nr_ddevs is zero, the loop is
skipped and 'r' is returned without being set, leading to undefined
behavior.

Initialize 'r' to 0 before the loop. This ensures that if there are no
devices to iterate over, the function still returns a defined value.

Signed-off-by: Purva Yeshi <purvayeshi550@gmail.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-07-31 15:39:55 +02:00
Eric Biggers
bdf253d580 dm-verity: remove support for asynchronous hashes
The support for asynchronous hashes in dm-verity has outlived its
usefulness.  It adds significant code complexity and opportunity for
bugs.  I don't know of anyone using it in practice.  (The original
submitter of the code possibly was, but that was 8 years ago.)  Data I
recently collected for en/decryption shows that using off-CPU crypto
"accelerators" is consistently much slower than the CPU
(https://lore.kernel.org/r/20250704070322.20692-1-ebiggers@kernel.org/),
even on CPUs that lack dedicated cryptographic instructions.  Similar
results are likely to be seen for hashing.

I already removed support for asynchronous hashes from fsverity two
years ago, and no one ever complained.

Moreover, neither dm-verity, fsverity, nor fscrypt has ever actually
used the asynchronous crypto algorithms in a truly asynchronous manner.
The lack of interest in such optimizations provides further evidence
that it's only the CPU-based crypto that actually matters.

Historically, it's also been common for people to forget to enable the
optimized SHA-256 code, which could contribute to an off-CPU crypto
engine being perceived as more useful than it really is.  In 6.16 I
fixed that: the optimized SHA-256 code is now enabled by default.

Therefore, let's drop the support for asynchronous hashes in dm-verity.

Tested with verity-compat-test.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-07-31 15:39:55 +02:00
Li Nan
907a99c314 md: rename recovery_cp to resync_offset
'recovery_cp' was used to represent the progress of sync, but its name
contains recovery, which can cause confusion. Replaces 'recovery_cp'
with 'resync_offset' for clarity.

Signed-off-by: Li Nan <linan122@huawei.com>
Link: https://lore.kernel.org/linux-raid/20250722033340.1933388-1-linan666@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-07-31 01:26:04 +08:00
Heming Zhao
948b1fe120 md/md-cluster: handle REMOVE message earlier
Commit a1fd37f978 ("md: Don't wait for MD_RECOVERY_NEEDED for
HOT_REMOVE_DISK ioctl") introduced a regression in the md_cluster
module. (Failed cases 02r1_Manage_re-add & 02r10_Manage_re-add)

Consider a 2-node cluster:
- node1 set faulty & remove command on a disk.
- node2 must correctly update the array metadata.

Before a1fd37f978, on node1, the delay between msg:METADATA_UPDATED
(triggered by faulty) and msg:REMOVE was sufficient for node2 to
reload the disk info (written by node1).
After a1fd37f978, node1 no longer waits between faulty and remove,
causing it to send msg:REMOVE while node2 is still reloading disk info.
This often results in node2 failing to remove the faulty disk.

== how to trigger ==

set up a 2-node cluster (node1 & node2) with disks vdc & vdd.

on node1:
mdadm -CR /dev/md0 -l1 -b clustered -n2 /dev/vdc /dev/vdd --assume-clean
ssh node2-ip mdadm -A /dev/md0 /dev/vdc /dev/vdd
mdadm --manage /dev/md0 --fail /dev/vdc --remove /dev/vdc

check array status on both nodes with "mdadm -D /dev/md0".
node1 output:
    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1     254       48        1      active sync   /dev/vdd
node2 output:
    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1     254       48        1      active sync   /dev/vdd

       0     254       32        -      faulty   /dev/vdc

Fixes: a1fd37f978 ("md: Don't wait for MD_RECOVERY_NEEDED for HOT_REMOVE_DISK ioctl")
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Su Yue <glass.su@suse.com>
Link: https://lore.kernel.org/linux-raid/20250728042145.9989-1-heming.zhao@suse.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-07-31 01:23:19 +08:00
Yu Kuai
1df1fc845d md: fix create on open mddev lifetime regression
Commit 9e59d60976 ("md: call del_gendisk in control path") moves
setting MD_DELETED from __mddev_put() to do_md_stop(), however, for the
case create on open, mddev can be freed without do_md_stop():

1) open

md_probe
 md_alloc_and_put
  md_alloc
   mddev_alloc
   atomic_set(&mddev->active, 1);
   mddev->hold_active = UNTIL_IOCTL
  mddev_put
   atomic_dec_and_test(&mddev->active)
    if (mddev->hold_active)
    -> active is 0, hold_active is set
md_open
 mddev_get
  atomic_inc(&mddev->active);

2) ioctl that is not STOP_ARRAY, for example, GET_ARRAY_INFO:

md_ioctl
 mddev->hold_active = 0

3) close

md_release
 mddev_put(mddev);
  atomic_dec_and_lock(&mddev->active, &all_mddevs_lock)
  __mddev_put
  -> hold_active is cleared, mddev will be freed
  queue_work(md_misc_wq, &mddev->del_work)

Now that MD_DELETED is not set, before mddev is freed by
mddev_delayed_delete(), md_open can still succeed and break mddev
lifetime, causing mddev->kobj refcount underflow or mddev uaf
problem.

Fix this problem by setting MD_DELETED before queuing del_work.

Reported-by: syzbot+9921e319bd6168140b40@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68894408.a00a0220.26d0e1.0012.GAE@google.com/
Reported-by: syzbot+fa3a12519f0d3fd4ec16@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68894408.a00a0220.26d0e1.0013.GAE@google.com/
Fixes: 9e59d60976 ("md: call del_gendisk in control path")
Link: https://lore.kernel.org/linux-raid/20250730073321.2583158-1-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-07-31 01:21:43 +08:00
Linus Torvalds
6e11664f14 Merge tag 'for-6.17/block-20250728' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:

 - MD pull request via Yu:
      - call del_gendisk synchronously (Xiao)
      - cleanup unused variable (John)
      - cleanup workqueue flags (Ryo)
      - fix faulty rdev can't be removed during resync (Qixing)

 - NVMe pull request via Christoph:
      - try PCIe function level reset on init failure (Keith Busch)
      - log TLS handshake failures at error level (Maurizio Lombardi)
      - pci-epf: do not complete commands twice if nvmet_req_init()
        fails (Rick Wertenbroek)
      - misc cleanups (Alok Tiwari)

 - Removal of the pktcdvd driver

   This has been more than a decade coming at this point, and some
   recently revealed breakages that had it causing issues even for cases
   where it isn't required made me re-pull the trigger on this one. It's
   known broken and nobody has stepped up to maintain the code

 - Series for ublk supporting batch commands, enabling the use of
   multishot where appropriate

 - Speed up ublk exit handling

 - Fix for the two-stage elevator fixing which could leak data

 - Convert NVMe to use the new IOVA based API

 - Increase default max transfer size to something more reasonable

 - Series fixing write operations on zoned DM devices

 - Add tracepoints for zoned block device operations

 - Prep series working towards improving blk-mq queue management in the
   presence of isolated CPUs

 - Don't allow updating of the block size of a loop device that is
   currently under exclusively ownership/open

 - Set chunk sectors from stacked device stripe size and use it for the
   atomic write size limit

 - Switch to folios in bcache read_super()

 - Fix for CD-ROM MRW exit flush handling

 - Various tweaks, fixes, and cleanups

* tag 'for-6.17/block-20250728' of git://git.kernel.dk/linux: (94 commits)
  block: restore two stage elevator switch while running nr_hw_queue update
  cdrom: Call cdrom_mrw_exit from cdrom_release function
  sunvdc: Balance device refcount in vdc_port_mpgroup_check
  nvme-pci: try function level reset on init failure
  dm: split write BIOs on zone boundaries when zone append is not emulated
  block: use chunk_sectors when evaluating stacked atomic write limits
  dm-stripe: limit chunk_sectors to the stripe size
  md/raid10: set chunk_sectors limit
  md/raid0: set chunk_sectors limit
  block: sanitize chunk_sectors for atomic write limits
  ilog2: add max_pow_of_two_factor()
  nvmet: pci-epf: Do not complete commands twice if nvmet_req_init() fails
  nvme-tcp: log TLS handshake failures at error level
  docs: nvme: fix grammar in nvme-pci-endpoint-target.rst
  nvme: fix typo in status code constant for self-test in progress
  nvmet: remove redundant assignment of error code in nvmet_ns_enable()
  nvme: fix incorrect variable in io cqes error message
  nvme: fix multiple spelling and grammar issues in host drivers
  block: fix blk_zone_append_update_request_bio() kernel-doc
  md/raid10: fix set but not used variable in sync_request_write()
  ...
2025-07-28 16:43:54 -07:00
Linus Torvalds
cec40a7c80 Merge tag 'vfs-6.17-rc1.integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs 'protection info' updates from Christian Brauner:
 "This adds the new FS_IOC_GETLBMD_CAP ioctl() to query metadata and
  protection info (PI) capabilities. This ioctl returns information
  about the files integrity profile. This is useful for userspace
  applications to understand a files end-to-end data protection support
  and configure the I/O accordingly.

  For now this interface is only supported by block devices. However the
  design and placement of this ioctl in generic FS ioctl space allows us
  to extend it to work over files as well. This maybe useful when
  filesystems start supporting PI-aware layouts.

  A new structure struct logical_block_metadata_cap is introduced, which
  contains the following fields:

   - lbmd_flags:
     bitmask of logical block metadata capability flags

   - lbmd_interval:
     the amount of data described by each unit of logical block metadata

   - lbmd_size:
     size in bytes of the logical block metadata associated with each
     interval

   - lbmd_opaque_size:
     size in bytes of the opaque block tag associated with each interval

   - lbmd_opaque_offset:
     offset in bytes of the opaque block tag within the logical block
     metadata

   - lbmd_pi_size:
     size in bytes of the T10 PI tuple associated with each interval

   - lbmd_pi_offset:
     offset in bytes of T10 PI tuple within the logical block metadata

   - lbmd_pi_guard_tag_type:
     T10 PI guard tag type

   - lbmd_pi_app_tag_size:
     size in bytes of the T10 PI application tag

   - lbmd_pi_ref_tag_size:
     size in bytes of the T10 PI reference tag

   - lbmd_pi_storage_tag_size:
     size in bytes of the T10 PI storage tag

  The internal logic to fetch the capability is encapsulated in a helper
  function blk_get_meta_cap(), which uses the blk_integrity profile
  associated with the device. The ioctl returns -EOPNOTSUPP, if
  CONFIG_BLK_DEV_INTEGRITY is not enabled"

* tag 'vfs-6.17-rc1.integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  block: fix lbmd_guard_tag_type assignment in FS_IOC_GETLBMD_CAP
  block: fix FS_IOC_GETLBMD_CAP parsing in blkdev_common_ioctl()
  fs: add ioctl to query metadata and protection info capabilities
  nvme: set pi_offset only when checksum type is not BLK_INTEGRITY_CSUM_NONE
  block: introduce pi_tuple_size field in blk_integrity
  block: rename tuple_size field in blk_integrity to metadata_size
2025-07-28 15:12:00 -07:00
Linus Torvalds
278c7d9b5e Merge tag 'vfs-6.17-rc1.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull fallocate updates from Christian Brauner:
 "fallocate() currently supports creating preallocated files
  efficiently. However, on most filesystems fallocate() will preallocate
  blocks in an unwriten state even if FALLOC_FL_ZERO_RANGE is specified.

  The extent state must later be converted to a written state when the
  user writes data into this range, which can trigger numerous metadata
  changes and journal I/O. This may leads to significant write
  amplification and performance degradation in synchronous write mode.

  At the moment, the only method to avoid this is to create an empty
  file and write zero data into it (for example, using 'dd' with a large
  block size). However, this method is slow and consumes a considerable
  amount of disk bandwidth.

  Now that more and more flash-based storage devices are available it is
  possible to efficiently write zeros to SSDs using the unmap write
  zeroes command if the devices do not write physical zeroes to the
  media.

  For example, if SCSI SSDs support the UMMAP bit or NVMe SSDs support
  the DEAC bit[1], the write zeroes command does not write actual data
  to the device, instead, NVMe converts the zeroed range to a
  deallocated state, which works fast and consumes almost no disk write
  bandwidth.

  This series implements the BLK_FEAT_WRITE_ZEROES_UNMAP feature and
  BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED flag for SCSI, NVMe and
  device-mapper drivers, and add the FALLOC_FL_WRITE_ZEROES and
  STATX_ATTR_WRITE_ZEROES_UNMAP support for ext4 and raw bdev devices.

  fallocate() is subsequently extended with the FALLOC_FL_WRITE_ZEROES
  flag. FALLOC_FL_WRITE_ZEROES zeroes a specified file range in such a
  way that subsequent writes to that range do not require further
  changes to the file mapping metadata. This flag is beneficial for
  subsequent pure overwriting within this range, as it can save on block
  allocation and, consequently, significant metadata changes"

* tag 'vfs-6.17-rc1.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  ext4: add FALLOC_FL_WRITE_ZEROES support
  block: add FALLOC_FL_WRITE_ZEROES support
  block: factor out common part in blkdev_fallocate()
  fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate
  dm: clear unmap write zeroes limits when disabling write zeroes
  scsi: sd: set max_hw_wzeroes_unmap_sectors if device supports SD_ZERO_*_UNMAP
  nvmet: set WZDS and DRB if device enables unmap write zeroes operation
  nvme: set max_hw_wzeroes_unmap_sectors if device supports DEAC bit
  block: introduce max_{hw|user}_wzeroes_unmap_sectors to queue limits
2025-07-28 13:36:49 -07:00
Jens Axboe
c20413b799 Merge tag 'md-6.17-20250722' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux into for-6.17/block
Pull MD updates from Yu:

"- call del_gendisk synchronously, from Xiao
 - cleanup unused variable, from John
 - cleanup workqueue flags, from Ryo
 - fix faulty rdev can't be removed during resync, from Qixing"

* tag 'md-6.17-20250722' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux:
  md/raid10: fix set but not used variable in sync_request_write()
  md: allow removing faulty rdev during resync
  md/raid5: unset WQ_CPU_INTENSIVE for raid5 unbound workqueue
  md: remove/add redundancy group only in level change
  md: Don't clear MD_CLOSING until mddev is freed
  md: call del_gendisk in control path
2025-07-22 04:48:52 -06:00
Shin'ichiro Kawasaki
675f940576 dm: split write BIOs on zone boundaries when zone append is not emulated
Commit 2df7168717 ("dm: Always split write BIOs to zoned device
limits") updates the device-mapper driver to perform splits for the
write BIOs. However, it did not address the cases where DM targets do
not emulate zone append, such as in the cases of dm-linear or dm-flakey.
For these targets, when the write BIOs span across zone boundaries, they
trigger WARN_ON_ONCE(bio_straddles_zones(bio)) in
blk_zone_wplug_handle_write(). This results in I/O errors. The errors
are reproduced by running blktests test case zbd/004 using zoned
dm-linear or dm-flakey devices.

To avoid the I/O errors, handle the write BIOs regardless whether DM
targets emulate zone append or not, so that all write BIOs are split at
zone boundaries. For that purpose, drop the check for zone append
emulation in dm_zone_bio_needs_split(). Its argument 'md' is no longer
used then drop it also.

Fixes: 2df7168717 ("dm: Always split write BIOs to zoned device limits")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Link: https://lore.kernel.org/r/20250717103539.37279-1-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-17 06:03:03 -06:00
John Garry
5fb9d4341b dm-stripe: limit chunk_sectors to the stripe size
Same as done for raid0, set chunk_sectors limit to appropriately set the
atomic write size limit.

Setting chunk_sectors limit in this way overrides the stacked limit
already calculated based on the bottom device limits. This is ok, as
when any bios are sent to the bottom devices, the block layer will still
respect the bottom device chunk_sectors.

Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250711105258.3135198-6-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-17 06:01:16 -06:00
John Garry
7ef50c4c6a md/raid10: set chunk_sectors limit
Same as done for raid0, set chunk_sectors limit to appropriately set the
atomic write size limit.

Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250711105258.3135198-5-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-17 06:01:16 -06:00
John Garry
4b8beba60d md/raid0: set chunk_sectors limit
Currently we use min io size as the chunk size when deciding on the
atomic write size limits - see blk_stack_atomic_writes_head().

The limit min_io size is not a reliable value to store the chunk size, as
this may be mutated by the block stacking code. Such an example would be
for the min io size less than the physical block size, and the min io size
is raised to the physical block size - see blk_stack_limits().

The block stacking limits will rely on chunk_sectors in future,
so set this value (to the chunk size).

Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250711105258.3135198-4-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-17 06:01:16 -06:00