8741 Commits

Author SHA1 Message Date
Linus Torvalds
6a1636e066 Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
 "The only core fix is in doc; all the others are in drivers, with the
  biggest impacts in libsas being the rollback on error handling and in
  ufs coming from a couple of error handling fixes, one causing a crash
  if it's activated before scanning and the other fixing W-LUN
  resumption"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: ufs: qcom: Fix confusing cleanup.h syntax
  scsi: libsas: Add rollback handling when an error occurs
  scsi: device_handler: Return error pointer in scsi_dh_attached_handler_name()
  scsi: ufs: core: Fix a deadlock in the frequency scaling code
  scsi: ufs: core: Fix an error handler crash
  scsi: Revert "scsi: libsas: Fix exp-attached device scan after probe failure scanned in again after probe failed"
  scsi: ufs: core: Fix RPMB link error by reversing Kconfig dependencies
  scsi: qla4xxx: Use time conversion macros
  scsi: qla2xxx: Enable/disable IRQD_NO_BALANCING during reset
  scsi: ipr: Enable/disable IRQD_NO_BALANCING during reset
  scsi: imm: Fix use-after-free bug caused by unfinished delayed work
  scsi: target: sbp: Remove KMSG_COMPONENT macro
  scsi: core: Correct documentation for scsi_device_quiesce()
  scsi: mpi3mr: Prevent duplicate SAS/SATA device entries in channel 1
  scsi: target: Reset t_task_cdb pointer in error case
  scsi: ufs: core: Fix EH failure after W-LUN resume error
2025-12-14 15:35:35 +12:00
Linus Torvalds
35ebee7e72 Merge tag 'block-6.19-20251211' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block fixes from Jens Axboe:

 - Always initialize DMA state, fixing a potentially nasty issue on the
   block side

 - btrfs zoned write fix with cached zone reports

 - Fix corruption issues in bcache with chained bio's, and further make
   it clear that the chained IO handler is simply a marker, it's not
   code meant to be executed

 - Kill old code dealing with synchronous IO polling in the block layer,
   that has been dead for a long time. Only async polling is supported
   these days

 - Fix a lockdep issue in tag_set management, moving it to RCU

 - Fix an issue with ublks bio_vec iteration

 - Don't unconditionally enforce blocking issue of ublk control
   commands, allow some of them with non-blocking issue as they
   do not block

* tag 'block-6.19-20251211' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  blk-mq-dma: always initialize dma state
  blk-mq: delete task running check in blk_hctx_poll()
  block: fix cached zone reports on devices with native zone append
  block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock
  ublk: don't mutate struct bio_vec in iteration
  block: prohibit calls to bio_chain_endio
  bcache: fix improper use of bi_end_io
  ublk: allow non-blocking ctrl cmds in IO_URING_F_NONBLOCK issue
2025-12-12 22:04:18 +12:00
Linus Torvalds
d358e52546 Merge tag 'for-6.19/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mikulas Patocka:

 - convert crypto_shash users to direct crypto library use with simpler
   and faster code and reduced stack usage (Eric Biggers):

     - the dm-verity SHA-256 conversion also teaches it to do two-way
       interleaved hashing for added performance

     - dm-crypt MD5 conversion (used for Loop-AES compatibility)

 - added document for for takeover/reshape raid1 -> raid5 examples (Heinz Mauelshagen)

 - fix dm-vdo kerneldoc warnings (Matthew Sakai)

 - various random fixes and cleanups

* tag 'for-6.19/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (29 commits)
  dm pcache: fix segment info indexing
  dm pcache: fix cache info indexing
  dm-pcache: advance slot index before writing slot
  dm raid: add documentation for takeover/reshape raid1 -> raid5 table line examples
  dm log-writes: Add missing set_freezable() for freezable kthread
  dm-raid: fix possible NULL dereference with undefined raid type
  dm-snapshot: fix 'scheduling while atomic' on real-time kernels
  dm: ignore discard return value
  MAINTAINERS: add Benjamin Marzinski as a device mapper maintainer
  dm-mpath: Simplify the setup_scsi_dh code
  dm vdo: fix kerneldoc warnings
  dm-bufio: align write boundary on physical block size
  dm-crypt: enable DM_TARGET_ATOMIC_WRITES
  dm: test for REQ_ATOMIC in dm_accept_partial_bio()
  dm-verity: remove useless mempool
  dm-verity: disable recursive forward error correction
  dm-ebs: Mark full buffer dirty even on partial write
  dm mpath: enable DM_TARGET_ATOMIC_WRITES
  dm verity fec: Expose corrected block count via status
  dm: Don't warn if IMA_DISABLE_HTABLE is not enabled
  ...
2025-12-11 12:13:29 +09:00
Li Chen
13ea55ea20 dm pcache: fix segment info indexing
Segment info indexing also used sizeof(struct) instead of the
4K metadata stride, so info_index could point between slots and
subsequent writes would advance incorrectly. Derive info_index
from the pointer returned by the segment meta search using
PCACHE_SEG_INFO_SIZE and advance to the next slot for future
updates.

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Zheng Gu <cengku@gmail.com>
Cc: stable@vger.kernel.org	# 6.18
2025-12-10 19:28:23 +01:00
Li Chen
ee76331783 dm pcache: fix cache info indexing
The on-media cache_info index used sizeof(struct) instead of the
4K metadata stride, so gc_percent updates from dmsetup message
were written between slots and lost after reboot. Use
PCACHE_CACHE_INFO_SIZE in get_cache_info_addr() and align
info_index with the slot returned by pcache_meta_find_latest().

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Zheng Gu <cengku@gmail.com>
Cc: stable@vger.kernel.org	# 6.18
2025-12-10 19:28:23 +01:00
Dongsheng Yang
ebbb90344a dm-pcache: advance slot index before writing slot
In dm-pcache, in order to ensure crash-consistency, a dual-copy scheme
is used to alternately update metadata, and there is a slot index that
records the current slot. However, in the write path the current
implementation writes directly to the current slot indexed by slot
index, and then advances the slot — which ends up overwriting the
existing slot, violating the crash-consistency guarantee.

This patch fixes that behavior, preventing metadata from being
overwritten incorrectly.

In addition, this patch add a missing pmem_wmb() after memcpy_flushcache().

Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Zheng Gu <cengku@gmail.com>
Cc: stable@vger.kernel.org	# 6.18
2025-12-10 19:28:23 +01:00
Haotian Zhang
ab08f9c8b3 dm log-writes: Add missing set_freezable() for freezable kthread
The log_writes_kthread() calls try_to_freeze() but lacks set_freezable(),
rendering the freeze attempt ineffective since kernel threads are
non-freezable by default. This prevents proper thread suspension during
system suspend/hibernate.

Add set_freezable() to explicitly mark the thread as freezable.

Fixes: 0e9cebe724 ("dm: add log writes target")
Signed-off-by: Haotian Zhang <vulab@iscas.ac.cn>
Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-12-10 19:28:22 +01:00
Alexey Simakov
2f6cfd6d7c dm-raid: fix possible NULL dereference with undefined raid type
rs->raid_type is assigned from get_raid_type_by_ll(), which may return
NULL. This NULL value could be dereferenced later in the condition
'if (!(rs_is_raid10(rs) && rt_is_raid0(rs->raid_type)))'.

Add a fail-fast check to return early with an error if raid_type is NULL,
similar to other uses of this function.

Found by Linux Verification Center (linuxtesting.org) with Svace.

Fixes: 33e53f0685 ("dm raid: introduce extended superblock and new raid types to support takeover/reshaping")
Signed-off-by: Alexey Simakov <bigalex934@gmail.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-12-10 19:28:22 +01:00
Mikulas Patocka
8581b19eb2 dm-snapshot: fix 'scheduling while atomic' on real-time kernels
There is reported 'scheduling while atomic' bug when using dm-snapshot on
real-time kernels. The reason for the bug is that the hlist_bl code does
preempt_disable() when taking the lock and the kernel attempts to take
other spinlocks while holding the hlist_bl lock.

Fix this by converting a hlist_bl spinlock into a regular spinlock.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reported-by: Jiping Ma <jiping.ma2@windriver.com>
2025-12-10 19:28:22 +01:00
Chaitanya Kulkarni
f4412c7d5a dm: ignore discard return value
__blkdev_issue_discard() always returns 0, making all error checking
at call sites dead code.

For dm-thin change issue_discard() return type to void, in
passdown_double_checking_shared_status() remove the r assignment from
return value of the issue_discard(), for end_discard() hardcode value of
r to 0 that matches only value returned from __blkdev_issue_discard().

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-12-10 19:28:22 +01:00
Benjamin Marzinski
27f204c215 dm-mpath: Simplify the setup_scsi_dh code
There's no point to the MPATHF_RETAIN_ATTACHED_HW_HANDLER flag any more.
The way setup_scsi_dh() worked, if that flag wasn't set, it would
attempt to attach any passed in hardware handler. This would always fail
if a different hardware handler was attached, which caused
setup_scsi_dh() to rerun as if the flag was set. So the code would
already retain any attached handler, because attaching a different one
would always fail.

Also, the code had a bug. If attached_handler_name was NULL but there
was a scsi device handler attached (because either
scsi_dh_attached_handler_name failed() to allocate a name, a handler got
attached after it was called) the code would loop endlessly.

Instead, ignore MPATHF_RETAIN_ATTACHED_HW_HANDLER, and always free the
passed in handler if *attached_handler_name is set. This simplifies the
code, and avoids the endless loop bug, while keeping the functionality
the same.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-12-10 19:28:22 +01:00
Matthew Sakai
4efe85b0c4 dm vdo: fix kerneldoc warnings
Fix kerneldoc warnings across the dm-vdo target. Also
remove some unhelpful or inaccurate doc comments, and fix
some format inconsistencies that did not produce warnings.

No functional changes.

Suggested-by: Sunday Adelodun <adelodunolaoluwa@yahoo.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-12-10 19:28:22 +01:00
Mikulas Patocka
d0ac06ae53 dm-bufio: align write boundary on physical block size
There may be devices with physical block size larger than 4k.

If dm-bufio sends I/O that is not aligned on physical block size,
performance is degraded.

The 4k minimum alignment limit is there because some SSDs report logical
and physical block size 512 despite having 4k internally - so dm-bufio
shouldn't send I/Os not aligned on 4k boundary, because they perform
badly (the SSD does read-modify-write for them).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reported-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: stable@vger.kernel.org
2025-12-10 19:28:22 +01:00
Mikulas Patocka
ce51c6963a dm-crypt: enable DM_TARGET_ATOMIC_WRITES
Allow handling of bios with REQ_ATOMIC flag set.

Don't split these bios and fail them if they overrun the hard limit
"BIO_MAX_VECS << PAGE_SHIFT".

In order to simplify the code, this commit joins the logic that avoids
splitting emulated zone append bios with the logic that avoids
splitting atomic write bios.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: John Garry <john.g.garry@oracle.com>
2025-12-10 19:28:22 +01:00
Mikulas Patocka
de67c139b3 dm: test for REQ_ATOMIC in dm_accept_partial_bio()
Any bio with REQ_ATOMIC flag set should never be split or partially
completed, so BUG_ON() on this scenario in dm_accept_partial_bio() (whose
intent is to allow partial completions).

Also, we must reject atomic bio to targets that don't support them,
otherwise this BUG could be triggered by stray bios that have the
REQ_ATOMIC set.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: John Garry <john.g.garry@oracle.com>
2025-12-10 19:28:12 +01:00
Mikulas Patocka
b9dd1f71e6 dm-verity: remove useless mempool
v->fec->extra_pool has zero reserved entries, so we can remove it and use
the kernel cache directly.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
2025-12-10 19:28:12 +01:00
Mikulas Patocka
d9f3e47d3f dm-verity: disable recursive forward error correction
There are two problems with the recursive correction:

1. It may cause denial-of-service. In fec_read_bufs, there is a loop that
has 253 iterations. For each iteration, we may call verity_hash_for_block
recursively. There is a limit of 4 nested recursions - that means that
there may be at most 253^4 (4 billion) iterations. Red Hat QE team
actually created an image that pushes dm-verity to this limit - and this
image just makes the udev-worker process get stuck in the 'D' state.

2. It doesn't work. In fec_read_bufs we store data into the variable
"fio->bufs", but fio bufs is shared between recursive invocations, if
"verity_hash_for_block" invoked correction recursively, it would
overwrite partially filled fio->bufs.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reported-by: Guangwu Zhang <guazhang@redhat.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
2025-12-10 19:27:59 +01:00
Shida Zhang
53280e3984 bcache: fix improper use of bi_end_io
Don't call bio->bi_end_io() directly. Use the bio_endio() helper
function instead, which handles completion more safely and uniformly.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-09 10:20:03 -07:00
Benjamin Marzinski
fd81bc5cca scsi: device_handler: Return error pointer in scsi_dh_attached_handler_name()
If scsi_dh_attached_handler_name() fails to allocate the handler name,
dm-multipath (its only caller) assumes there is no attached device
handler, and sets the device up incorrectly. Return an error pointer
instead, so multipath can distinguish between failure, success where
there is no attached device handler, or when the path device is not a
SCSI device at all.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Reviewed-by: Martin Wilck <mwilck@suse.com>
Link: https://patch.msgid.link/20251206010015.1595225-1-bmarzins@redhat.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-12-08 22:04:38 -05:00
Linus Torvalds
cc25df3e2e Merge tag 'for-6.19/block-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block updates from Jens Axboe:

 - Fix head insertion for mq-deadline, a regression from when priority
   support was added

 - Series simplifying and improving the ublk user copy code

 - Various ublk related cleanups

 - Fixup REQ_NOWAIT handling in loop/zloop, clearing NOWAIT when the
   request is punted to a thread for handling

 - Merge and then later revert loop dio nowait support, as it ended up
   causing excessive stack usage for when the inline issue code needs to
   dip back into the full file system code

 - Improve auto integrity code, making it less deadlock prone

 - Speedup polled IO handling, but manually managing the hctx lookups

 - Fixes for blk-throttle for SSD devices

 - Small series with fixes for the S390 dasd driver

 - Add support for caching zones, avoiding unnecessary report zone
   queries

 - MD pull requests via Yu:
      - fix null-ptr-dereference regression for dm-raid0
      - fix IO hang for raid5 when array is broken with IO inflight
      - remove legacy 1s delay to speed up system shutdown
      - change maintainer's email address
      - data can be lost if array is created with different lbs devices,
        fix this problem and record lbs of the array in metadata
      - fix rcu protection for md_thread
      - fix mddev kobject lifetime regression
      - enable atomic writes for md-linear
      - some cleanups

 - bcache updates via Coly
      - remove useless discard and cache device code
      - improve usage of per-cpu workqueues

 - Reorganize the IO scheduler switching code, fixing some lockdep
   reports as well

 - Improve the block layer P2P DMA support

 - Add support to the block tracing code for zoned devices

 - Segment calculation improves, and memory alignment flexibility
   improvements

 - Set of prep and cleanups patches for ublk batching support. The
   actual batching hasn't been added yet, but helps shrink down the
   workload of getting that patchset ready for 6.20

 - Fix for how the ps3 block driver handles segments offsets

 - Improve how block plugging handles batch tag allocations

 - nbd fixes for use-after-free of the configuration on device clear/put

 - Set of improvements and fixes for zloop

 - Add Damien as maintainer of the block zoned device code handling

 - Various other fixes and cleanups

* tag 'for-6.19/block-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits)
  block/rnbd: correct all kernel-doc complaints
  blk-mq: use queue_hctx in blk_mq_map_queue_type
  md: remove legacy 1s delay in md_notify_reboot
  md/raid5: fix IO hang when array is broken with IO inflight
  md: warn about updating super block failure
  md/raid0: fix NULL pointer dereference in create_strip_zones() for dm-raid
  sbitmap: fix all kernel-doc warnings
  ublk: add helper of __ublk_fetch()
  ublk: pass const pointer to ublk_queue_is_zoned()
  ublk: refactor auto buffer register in ublk_dispatch_req()
  ublk: add `union ublk_io_buf` with improved naming
  ublk: add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg()
  kfifo: add kfifo_alloc_node() helper for NUMA awareness
  blk-mq: fix potential uaf for 'queue_hw_ctx'
  blk-mq: use array manage hctx map instead of xarray
  ublk: prevent invalid access with DEBUG
  s390/dasd: Use scnprintf() instead of sprintf()
  s390/dasd: Move device name formatting into separate function
  s390/dasd: Remove unnecessary debugfs_create() return checks
  s390/dasd: Fix gendisk parent after copy pair swap
  ...
2025-12-03 19:26:18 -08:00
Linus Torvalds
7f8d5f70ff Merge tag 'core-core-2025-12-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core irq cleanup from Thomas Gleixner:
 "Tree wide cleanup of the remaining users of in_irq() which got
  replaced by in_hardirq() and marked deprecated in 2020"

* tag 'core-core-2025-12-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  treewide: Remove in_irq()
2025-12-02 10:18:49 -08:00
Tarun Sahu
fdd0c6a649 md: remove legacy 1s delay in md_notify_reboot
During system shutdown, the md driver registered notifier function
(md_notify_reboot) currently imposes a hardcoded one-second delay.

This delay was introduced approximately 23 years ago and was likely
necessary for the hardware generation of that time. Proposing this
patch to make sure there are no known devices that need this delay.

Link: https://lore.kernel.org/linux-raid/20251121191422.2758555-1-tarunsahu@google.com
Signed-off-by: Tarun Sahu <tarunsahu@google.com>
Reviewed-by: Yu Kuai<yukuai@fnnas.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-30 09:42:28 +08:00
Yu Kuai
a913d1f6a7 md/raid5: fix IO hang when array is broken with IO inflight
Following test can cause IO hang:

mdadm -CvR /dev/md0 -l10 -n4 /dev/sd[abcd] --assume-clean --chunk=64K --bitmap=none
sleep 5
echo 1 > /sys/block/sda/device/delete
echo 1 > /sys/block/sdb/device/delete
echo 1 > /sys/block/sdc/device/delete
echo 1 > /sys/block/sdd/device/delete

dd if=/dev/md0 of=/dev/null bs=8k count=1 iflag=direct

Root cause:

1) all disks removed, however all rdevs in the array is still in sync,
IO will be issued normally.

2) IO failure from sda, and set badblocks failed, sda will be faulty
and MD_SB_CHANGING_PENDING will be set.

3) error recovery try to recover this IO from other disks, IO will be
issued to sdb, sdc, and sdd.

4) IO failure from sdb, and set badblocks failed again, now array is
broken and will become read-only.

5) IO failure from sdc and sdd, however, stripe can't be handled anymore
because MD_SB_CHANGING_PENDING is set:

handle_stripe
 handle_stripe
 if (test_bit MD_SB_CHANGING_PENDING)
  set_bit STRIPE_HANDLE
  goto finish
  // skip handling failed stripe

release_stripe
 if (test_bit STRIPE_HANDLE)
  list_add_tail conf->hand_list

6) later raid5d can't handle failed stripe as well:

raid5d
 md_check_recovery
  md_update_sb
   if (!md_is_rdwr())
    // can't clear pending bit
    return
 if (test_bit MD_SB_CHANGING_PENDING)
  break;
  // can't handle failed stripe

Since MD_SB_CHANGING_PENDING can never be cleared for read-only array,
fix this problem by skip this checking for read-only array.

Link: https://lore.kernel.org/linux-raid/20251117085557.770572-3-yukuai@fnnas.com
Fixes: d87f064f58 ("md: never update metadata when array is read-only.")
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Li Nan <linan122@huawei.com>
2025-11-30 09:38:45 +08:00
Yu Kuai
8c9e376b9d md: warn about updating super block failure
Many personalities will handle IO error from daemon thread(like raid1d,
raid10d, raid5d), and sb will require to be clean before hanlding these
failed IO. However update sb can fail, for example array is broken by
IO failure, or user config sysfs api array_state.

This patch adds warning if updating sb failed first, in case this will
be related to IO hang.

Link: https://lore.kernel.org/linux-raid/20251117085557.770572-2-yukuai@fnnas.com
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Li Nan <linan122@huawei.com>
2025-11-30 09:38:22 +08:00
Yu Kuai
46f21952c4 md/raid0: fix NULL pointer dereference in create_strip_zones() for dm-raid
Commit 2107457e31 ("md/raid0: Move queue limit setup before r0conf
initialization") dereference mddev->gendisk unconditionally, which is
NULL for dm-raid.

Fix this problem by reverting to old codes for dm-raid.

Link: https://lore.kernel.org/linux-raid/20251116021816.107648-1-yukuai@fnnas.com
Fixes: 2107457e31 ("md/raid0: Move queue limit setup before r0conf initialization")
Reported-and-tested-by: Changhui Zhong <czhong@redhat.com>
Closes: https://lore.kernel.org/all/CAGVVp+VqVnvGeneUoTbYvBv2cw6GwQRrR3B-iQ-_9rVfyumoKA@mail.gmail.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Li Nan <linan122@huawei.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
2025-11-30 09:36:50 +08:00
Mikulas Patocka
fe680d8c74 dm-verity: fix unreliable memory allocation
GFP_NOWAIT allocation may fail anytime. It needs to be changed to
GFP_NOIO. There's no need to handle an error because mempool_alloc with
GFP_NOIO can't fail.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
2025-11-21 12:51:41 +01:00
Mikulas Patocka
a612d24e85 dm: fix failure when empty flush's bi_sector points beyond the device end
An empty flush bio can have arbitrary bi_sector. The commit 2b1c6d7a89
introduced a regression that device mapper would fail an empty flush bio
with -EIO if the sector pointed beyond the end of the device.

The commit introduced an optimization, that optimization would pass
flushes to __split_and_process_bio and __split_and_process_bio is not
prepared to handle empty bios. Fix this bug by passing only non-empty
flushes to __split_and_process_bio - non-empty flushes must have valid
bi_sector. Empty bios will go through __send_empty_flush, as they did
before the optimization.

This problem can be reproduced by running the lvm2 test:
make check_local T=lvconvert-thin.sh LVM_TEST_PREFER_BRD=0

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 2b1c6d7a89 ("dm: optimize REQ_PREFLUSH with data when using the linear target")
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
2025-11-20 19:50:42 +01:00
Li Chen
a6ee8422b4 dm-pcache: zero cache_info before default init
pcache_meta_find_latest() leaves whatever it last copied into the
caller’s buffer even when it returns NULL. For cache_info_init(),
that meant cache->cache_info could still contain CRC-bad garbage when
no valid metadata exists, leading later initialization paths to read
bogus flags.

Explicitly memset cache->cache_info in cache_info_init_default()
so new-cache paths start from a clean slate. The default sequence
number assignment becomes redundant with this reset, so it drops out.

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Reviewed-by: Zheng Gu <cengku@gmail.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-11-18 19:02:08 +01:00
Li Chen
840b80af74 dm-pcache: reuse meta_addr in pcache_meta_find_latest
pcache_meta_find_latest() already computes the metadata address as
meta_addr. Reuse that instead of recomputing.

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-11-18 19:01:57 +01:00
Li Chen
341d14bd69 dm-pcache: allow built-in build and rename flush helper
CONFIG_BCACHE is tristate, so dm-pcache can also be built-in.
Switch the Makefile to use obj-$(CONFIG_DM_PCACHE) so the target can be
linked into vmlinux instead of always being a loadable module.

Also rename cache_flush() to pcache_cache_flush() to avoid a global
symbol clash with sunrpc/cache.c's cache_flush().

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-11-18 19:01:47 +01:00
Uladzislau Rezki (Sony)
7fa3e7d114 dm-ebs: Mark full buffer dirty even on partial write
When performing a read-modify-write(RMW) operation, any modification
to a buffered block must cause the entire buffer to be marked dirty.

Marking only a subrange as dirty is incorrect because the underlying
device block size(ubs) defines the minimum read/write granularity. A
lower device can perform I/O only on regions which are fully aligned
and sized to ubs.

This change ensures that write-back operations always occur in full
ubs-sized chunks, matching the intended emulation semantics of the
EBS target.

As for user space visible impact, submitting sub-ubs and misaligned
I/O for devices which are tuned to ubs sizes only, will reject such
requests, therefore it can lead to losing data. Example:

1) Create a 8K nvme device in qemu by adding

-device nvme,drive=drv0,serial=foo,logical_block_size=8192,physical_block_size=8192

2) Setup dm-ebs to emulate 512B to 8K mapping

urezki@pc638:~/bin$ cat dmsetup.sh

lower=/dev/nvme0n1
len=$(blockdev --getsz "$lower")

echo "0 $len ebs $lower 0 1 16" | dmsetup create nvme-8k
urezki@pc638:~/bin$

offset 0, ebs=1 and ubs=16(in sectors).

3) Create an ext4 filesystem(default 4K block size)

urezki@pc638:~/bin$ sudo mkfs.ext4 -F /dev/dm-0
mke2fs 1.47.0 (5-Feb-2023)
Discarding device blocks: done
Creating filesystem with 2072576 4k blocks and 518144 inodes
Filesystem UUID: bd0b6ca6-0506-4e31-86da-8d22c9d50b63
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632

Allocating group tables: done
Writing inode tables: done
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: mkfs.ext4: Input/output error while writing out and closing file system
urezki@pc638:~/bin$ dmesg

<snip>
[ 1618.875449] buffer_io_error: 1028 callbacks suppressed
[ 1618.875456] Buffer I/O error on dev dm-0, logical block 0, lost async page write
[ 1618.875527] Buffer I/O error on dev dm-0, logical block 1, lost async page write
[ 1618.875602] Buffer I/O error on dev dm-0, logical block 2, lost async page write
[ 1618.875620] Buffer I/O error on dev dm-0, logical block 3, lost async page write
[ 1618.875639] Buffer I/O error on dev dm-0, logical block 4, lost async page write
[ 1618.894316] Buffer I/O error on dev dm-0, logical block 5, lost async page write
[ 1618.894358] Buffer I/O error on dev dm-0, logical block 6, lost async page write
[ 1618.894380] Buffer I/O error on dev dm-0, logical block 7, lost async page write
[ 1618.894405] Buffer I/O error on dev dm-0, logical block 8, lost async page write
[ 1618.894427] Buffer I/O error on dev dm-0, logical block 9, lost async page write
<snip>

Many I/O errors because the lower 8K device rejects sub-ubs/misaligned
requests.

with a patch:

urezki@pc638:~/bin$ sudo mkfs.ext4 -F /dev/dm-0
mke2fs 1.47.0 (5-Feb-2023)
Discarding device blocks: done
Creating filesystem with 2072576 4k blocks and 518144 inodes
Filesystem UUID: 9b54f44f-ef55-4bd4-9e40-c8b775a616ac
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632

Allocating group tables: done
Writing inode tables: done
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done

urezki@pc638:~/bin$ sudo mount /dev/dm-0 /mnt/
urezki@pc638:~/bin$ ls -al /mnt/
total 24
drwxr-xr-x  3 root root  4096 Oct 17 15:13 .
drwxr-xr-x 19 root root  4096 Jul 10 19:42 ..
drwx------  2 root root 16384 Oct 17 15:13 lost+found
urezki@pc638:~/bin$

After this change: mkfs completes; mount succeeds.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
2025-11-18 18:24:26 +01:00
John Garry
61c73e8de9 dm mpath: enable DM_TARGET_ATOMIC_WRITES
Both the bio- and rq-based paths have no problem supporting REQ_ATOMIC,
so enable DM_TARGET_ATOMIC_WRITES.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-11-18 18:24:26 +01:00
Shubhankar Mishra
ae97648e14 dm verity fec: Expose corrected block count via status
Enhance visibility into dm-verity Forward Error Correction (FEC)
activity. While FEC can correct on-disk corruptions, the number of
successful correction events is not readily exposed through a standard
interface.

This change integrates FEC statistics into the verity target's
.status handler for STATUSTYPE_INFO. The info output now
includes count of corrected block by FEC.

The counter is a per-device instance atomic64_t, maintained within
the struct dm_verity_fec, tracking blocks successfully repaired by FEC
on this specific device instance since it was created.

This approach aligns with the standard Device Mapper mechanism for
targets to report runtime information, as used by other targets like
dm-integrity.

This patch also updates Documentation/admin-guide/device-mapper/verity.rst
to reflect the new status information.

Tested:
  Induced single-bit errors on a block device protected by dm-verity
  with FEC on android phone. Confirmed 'dmctl status <device>' on Android
  reports an incrementing 'fec_corrected_blocks' count after the
  corrupted blocks were accessed.

Signed-off-by: Shubhankar Mishra <shubhankarm@google.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-11-18 18:24:26 +01:00
Jon Hunter
c82faa8934 dm: Don't warn if IMA_DISABLE_HTABLE is not enabled
Commit f1cd6cb24b ("dm ima: add a warning in dm_init if duplicate ima
events are not measured") added a warning message if CONFIG_IMA is
enabled but CONFIG_IMA_DISABLE_HTABLE is not to inform users. When
enabling CONFIG_IMA, CONFIG_IMA_DISABLE_HTABLE is disabled by default
and so warning is seen. Therefore, it seems more appropriate to make
this an INFO level message than warning. If this truly is a warning,
then maybe CONFIG_IMA_DISABLE_HTABLE should default to y if CONFIG_IMA
is enabled.

Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-11-18 18:24:25 +01:00
Gustavo A. R. Silva
699122b590 bcache: Avoid -Wflex-array-member-not-at-end warning
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.

Use the new TRAILING_OVERLAP() helper to fix the following warning:

drivers/md/bcache/bset.h:330:27: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]

This helper creates a union between a flexible-array member (FAM) and a
set of MEMBERS that would otherwise follow it.

This overlays the trailing MEMBER struct btree_iter_set stack_data[MAX_BSETS];
onto the FAM struct btree_iter::data[], while keeping the FAM and the start
of MEMBER aligned.

The static_assert() ensures this alignment remains, and it's
intentionally placed immediately after the corresponding structures --no
blank line in between.

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Coly Li <colyli@fnnas.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13 09:18:06 -07:00
Marco Crivellari
c0c8082142 bcache: WQ_PERCPU added to alloc_workqueue users
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.

alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This patch continues the effort to refactor worqueue APIs, which has begun
with the change introducing new workqueues and a new alloc_workqueue flag:

commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")

This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.

Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Coly Li <colyli@fnnas.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13 09:18:06 -07:00
Marco Crivellari
fd82071814 bcache: replace use of system_wq with system_percpu_wq
Currently if a user enqueues a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.

This lack of consistency cannot be addressed without refactoring the API.

This patch continues the effort to refactor worqueue APIs, which has begun
with the change introducing new workqueues and a new alloc_workqueue flag:

commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")

system_wq should be the per-cpu workqueue, yet in this name nothing makes
that clear, so replace system_wq with system_percpu_wq.

The old wq (system_wq) will be kept for a few release cycles.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Coly Li <colyli@fnnas.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13 09:18:06 -07:00
Qianfeng Rong
21194c44b6 bcache: remove redundant __GFP_NOWARN
GFP_NOWAIT already includes __GFP_NOWARN, so let's remove the redundant
__GFP_NOWARN.

Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Acked-by: Coly Li <colyli@fnnas.com>
Acked-by: Coly Li <colyli@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13 09:18:06 -07:00
Coly Li
70bc173ce0 bcache: reduce gc latency by processing less nodes and sleep less time
When bcache device is busy for high I/O loads, there are two methods to
reduce the garbage collection latency,
- Process less nodes in eac loop of incremental garbage collection in
  btree_gc_recurse().
- Sleep less time between two full garbage collection in
  bch_btree_gc().

This patch introduces to hleper routines to provide different garbage
collection nodes number and sleep intervel time.
- btree_gc_min_nodes()
  If there is no front end I/O, return 128 nodes to process in each
  incremental loop, otherwise only 10 nodes are returned. Then front I/O
  is able to access the btree earlier.
- btree_gc_sleep_ms()
  If there is no synchronized wait for bucket allocation, sleep 100 ms
  between two incremental GC loop. Othersize only sleep 10 ms before
  incremental GC loop. Then a faster GC may provide available buckets
  earlier, to avoid most of bcache working threads from being starved by
  buckets allocation.

The idea is inspired by works from Mingzhe Zou and Robert Pang, but much
simpler and the expected behavior is more predictable.

Signed-off-by: Coly Li <colyli@fnnas.com>
Signed-off-by: Robert Pang <robertpang@google.com>
Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13 09:18:06 -07:00
Coly Li
73a004f83c bcache: drop discard sysfs interface
Since discard code is removed, now the sysfs interface to enable discard
is useless. This patch removes the corresponding sysfs entry, and remove
bool variable 'discard' from struct cache as well.

Signed-off-by: Coly Li <colyli@fnnas.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13 09:18:06 -07:00
Coly Li
b4056afbd4 bcache: remove discard code from alloc.c
Bcache allocator initially has no free space to allocate. Firstly it
does a garbage collection which is triggered by a cache device write
and fills free space into ca->free[] lists. The discard happens after
the free bucket is handled by garbage collection added into one of the
ca->free[] lists. But normally this bucket will be allocated out very
soon to requester and filled data onto it. The discard hint on this
bucket LBA range doesn't help SSD control to improve internal erasure
performance, and waste extra CPU cycles to issue discard bios.

This patch removes the almost-useless discard code from alloc.c.

Signed-off-by: Coly Li <colyli@fnnas.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13 09:18:06 -07:00
Coly Li
0c72e9fcc1 bcache: get rid of discard code from journal
In bcache journal there is discard functionality but almost useless in
reality. Because discard happens after a journal bucket is reclaimed,
and the reclaimed bucket is allocated for new journaling immediately.
There is no time for underlying SSD to use the discard hint for internal
data management.

The discard code in bcache journal doesn't bring any performance
optimization and wastes CPU cycles for issuing discard bios. Therefore
this patch gits rid of it from journal.c and journal.h.

Signed-off-by: Coly Li <colyli@fnnas.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13 09:18:06 -07:00
Damien Le Moal
7b2038b1b1 dm: fix zone reset all operation processing
dm_zone_get_reset_bitmap() is used to generate a bitmap of the zones of
a zoned device target when a REQ_OP_ZONE_RESET_ALL request is being
processed. This bitmap is built by executing a zone report with a report
callback set to the function dm_zone_need_reset_cb() in struct
dm_report_zones_args. However, the cb callback pointer is not anymore
the same as the callback specified by callers of the
blkdev_report_zones() function. Rather, this is a DM internal callback
and report zones callback functions from blkdev_report_zones() are
passed using struct blk_report_zones_args, introduced with commit
db9aed869f34 ("block: introduce disk_report_zone()").

This commit changed the DM main report zones callback handler function
dm_report_zones_cb() to call the new disk_report_zone() so that callback
functions from blkdev_report_zones() are executed, and this change
resulted in the DM internal dm_zone_need_reset_cb() callback function to
not be executed anymore, turning any REQ_OP_ZONE_RESET_ALL request into
a no-op.

Fix this by calling in dm_report_zones_cb() the DM internal cb function
specified in struct dm_report_zones_args.

Fixes: db9aed869f34 ("block: introduce disk_report_zone()").
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13 09:10:04 -07:00
Li Nan
62ed1b5822 md: allow configuring logical block size
Previously, raid array used the maximum logical block size (LBS)
of all member disks. Adding a larger LBS disk at runtime could
unexpectedly increase RAID's LBS, risking corruption of existing
partitions. This can be reproduced by:

```
  # LBS of sd[de] is 512 bytes, sdf is 4096 bytes.
  mdadm -CRq /dev/md0 -l1 -n3 /dev/sd[de] missing --assume-clean

  # LBS is 512
  cat /sys/block/md0/queue/logical_block_size

  # create partition md0p1
  parted -s /dev/md0 mklabel gpt mkpart primary 1MiB 100%
  lsblk | grep md0p1

  # LBS becomes 4096 after adding sdf
  mdadm --add -q /dev/md0 /dev/sdf
  cat /sys/block/md0/queue/logical_block_size

  # partition lost
  partprobe /dev/md0
  lsblk | grep md0p1
```

Simply restricting larger-LBS disks is inflexible. In some scenarios,
only disks with 512 bytes LBS are available currently, but later, disks
with 4KB LBS may be added to the array.

Making LBS configurable is the best way to solve this scenario.
After this patch, the raid will:
  - store LBS in disk metadata
  - add a read-write sysfs 'mdX/logical_block_size'

Future mdadm should support setting LBS via metadata field during RAID
creation and the new sysfs. Though the kernel allows runtime LBS changes,
users should avoid modifying it after creating partitions or filesystems
to prevent compatibility issues.

Only 1.x metadata supports configurable LBS. 0.90 metadata inits all
fields to default values at auto-detect. Supporting 0.90 would require
more extensive changes and no such use case has been observed.

Note that many RAID paths rely on PAGE_SIZE alignment, including for
metadata I/O. A larger LBS than PAGE_SIZE will result in metadata
read/write failures. So this config should be prevented.

Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-6-linan666@huaweicloud.com
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-11 11:20:15 +08:00
Li Nan
9c47127a80 md: add check_new_feature module parameter
Raid checks if pad3 is zero when loading superblock from disk. Arrays
created with new features may fail to assemble on old kernels as pad3
is used.

Add module parameter check_new_feature to bypass this check.

Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-5-linan666@huaweicloud.com
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-11 11:19:54 +08:00
Li Nan
2107457e31 md/raid0: Move queue limit setup before r0conf initialization
Prepare for making logical blocksize configurable. This change has no
impact until logical block size becomes configurable.

Move raid0_set_limits() before create_strip_zones(). It is safe as fields
modified in create_strip_zones() do not involve mddev configuration, and
rdev modifications there are not used in raid0_set_limits().

'blksize' in create_strip_zones() fetches mddev's logical block size,
which is already the maximum aross all rdevs, so the later max() can be
removed.

Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-4-linan666@huaweicloud.com
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-11 11:19:27 +08:00
Li Nan
381a3ce1c0 md: init bioset in mddev_init
IO operations may be needed before md_run(), such as updating metadata
after writing sysfs. Without bioset, this triggers a NULL pointer
dereference as below:

 BUG: kernel NULL pointer dereference, address: 0000000000000020
 Call Trace:
  md_update_sb+0x658/0xe00
  new_level_store+0xc5/0x120
  md_attr_store+0xc9/0x1e0
  sysfs_kf_write+0x6f/0xa0
  kernfs_fop_write_iter+0x141/0x2a0
  vfs_write+0x1fc/0x5a0
  ksys_write+0x79/0x180
  __x64_sys_write+0x1d/0x30
  x64_sys_call+0x2818/0x2880
  do_syscall_64+0xa9/0x580
  entry_SYSCALL_64_after_hwframe+0x4b/0x53

Reproducer
```
  mdadm -CR /dev/md0 -l1 -n2 /dev/sd[cd]
  echo inactive > /sys/block/md0/md/array_state
  echo 10 > /sys/block/md0/md/new_level
```

mddev_init() can only be called once per mddev, no need to test if bioset
has been initialized anymore.

Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-3-linan666@huaweicloud.com
Fixes: d981ed8419 ("md: Add new_level sysfs interface")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-11 11:19:10 +08:00
Li Nan
0ce112d917 md: delete md_redundancy_group when array is becoming inactive
'md_redundancy_group' are created in md_run() and deleted in del_gendisk(),
but these are not paired. Writing inactive/active to sysfs array_state can
trigger md_run() multiple times without del_gendisk(), leading to
duplicate creation as below:

 sysfs: cannot create duplicate filename '/devices/virtual/block/md0/md/sync_action'
 Call Trace:
  dump_stack_lvl+0x9f/0x120
  dump_stack+0x14/0x20
  sysfs_warn_dup+0x96/0xc0
  sysfs_add_file_mode_ns+0x19c/0x1b0
  internal_create_group+0x213/0x830
  sysfs_create_group+0x17/0x20
  md_run+0x856/0xe60
  ? __x64_sys_openat+0x23/0x30
  do_md_run+0x26/0x1d0
  array_state_store+0x559/0x760
  md_attr_store+0xc9/0x1e0
  sysfs_kf_write+0x6f/0xa0
  kernfs_fop_write_iter+0x141/0x2a0
  vfs_write+0x1fc/0x5a0
  ksys_write+0x79/0x180
  __x64_sys_write+0x1d/0x30
  x64_sys_call+0x2818/0x2880
  do_syscall_64+0xa9/0x580
  entry_SYSCALL_64_after_hwframe+0x4b/0x53
 md: cannot register extra attributes for md0

Creation of it depends on 'pers', its lifecycle cannot be aligned with
gendisk. So fix this issue by triggering 'md_redundancy_group' deletion
when the array is becoming inactive.

Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-2-linan666@huaweicloud.com
Fixes: 790abe4d77 ("md: remove/add redundancy group only in level change")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-11 11:18:51 +08:00
Li Nan
6c6b66f65e md: prevent adding disks with larger logical_block_size to active arrays
When adding a disk to a md array, avoid updating the array's
logical_block_size to match the new disk. This prevents accidental
partition table loss that renders the array unusable.

The later patch will introduce a way to configure the array's
logical_block_size.

The issue was introduced before Linux 2.6.12-rc2.

Link: https://lore.kernel.org/linux-raid/20250918115759.334067-2-linan666@huaweicloud.com/
Fixes: d2e45eace8 ("[PATCH] Fix raid "bio too big" failures")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-11 11:17:33 +08:00
Huiwen He
a811db3919 md/raid5: remove redundant __GFP_NOWARN
The __GFP_NOWARN flag was included in GFP_NOWAIT since commit
16f5dfbc85 ("gfp: include __GFP_NOWARN in GFP_NOWAIT"). So
remove the redundant __GFP_NOWARN flag.

Link: https://lore.kernel.org/linux-raid/20251102152540.871568-1-hehuiwen@kylinos.cn
Signed-off-by: Huiwen He <hehuiwen@kylinos.cn>
Reviewed-by: Li Nan <linan122@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-08 17:49:36 +08:00