In commit 4cc96131af ("dm: move request-based code out to dm-rq.[hc]")
we have a note: "DM targets should _never_ include dm-core.h!". And it
is not used in any DM targets except dm-raid now, so let's remove it
from dm-raid for consistency, also use special helpers instead of
accessing dm_table and mapper_device fields directly. This change is
merely a cleanup and should not affect functionality.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
There's no need to call need_resched() because cond_resched() will do
nothing if need_resched() returns false.
Reviewed-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fix Smatch-detected error:
drivers/md/dm-zoned-target.c:1073 dmz_iterate_devices()
error: uninitialized symbol 'r'.
Smatch detects a possible use of the uninitialized variable 'r' in
dmz_iterate_devices() because if dmz->nr_ddevs is zero, the loop is
skipped and 'r' is returned without being set, leading to undefined
behavior.
Initialize 'r' to 0 before the loop. This ensures that if there are no
devices to iterate over, the function still returns a defined value.
Signed-off-by: Purva Yeshi <purvayeshi550@gmail.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
The support for asynchronous hashes in dm-verity has outlived its
usefulness. It adds significant code complexity and opportunity for
bugs. I don't know of anyone using it in practice. (The original
submitter of the code possibly was, but that was 8 years ago.) Data I
recently collected for en/decryption shows that using off-CPU crypto
"accelerators" is consistently much slower than the CPU
(https://lore.kernel.org/r/20250704070322.20692-1-ebiggers@kernel.org/),
even on CPUs that lack dedicated cryptographic instructions. Similar
results are likely to be seen for hashing.
I already removed support for asynchronous hashes from fsverity two
years ago, and no one ever complained.
Moreover, neither dm-verity, fsverity, nor fscrypt has ever actually
used the asynchronous crypto algorithms in a truly asynchronous manner.
The lack of interest in such optimizations provides further evidence
that it's only the CPU-based crypto that actually matters.
Historically, it's also been common for people to forget to enable the
optimized SHA-256 code, which could contribute to an off-CPU crypto
engine being perceived as more useful than it really is. In 6.16 I
fixed that: the optimized SHA-256 code is now enabled by default.
Therefore, let's drop the support for asynchronous hashes in dm-verity.
Tested with verity-compat-test.
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
dm_unregister_path_selector may only return error if there's a bug in the
code - so we make it return void and print a warning if the user abuses
this function to unregister a target that was not registered.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Since 'scnprintf()' returns the number of characters emitted (not
including the trailing '\0'), use that return value instead of the
subsequent calls to 'strlen()' where appropriate. Compile tested only.
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
The local variable first_requeue is not needed since it is always equal
to dm_io_flagged(io, DM_IO_WAS_SPLIT). Call __dm_io_complete() passing
this value directly and remove first_requeue.
Also declare dm_io_complete() as inline to make sure it is inlined in
its single call site, thus avoiding the cost of a function call.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
The return statement at the end of dm_zone_endio() is not needed.
Remove it.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Target does not honour the "sync" argument when activated w/o metadata
devices, e.g. with table line:
"0 $(blockdev --getsz $data1) raid raid1 2 0 sync 2 - $data1 - $data2".
Fix this to support temporary, transient raid devices useful
for data duplication.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Due to the semantics of iterate_devices(), the current code allows a
request-based dm table as long as it includes one request-stackable
device. It is supposed to only allow tables where there are no
non-request-stackable devices.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Pull more block updates from Jens Axboe:
- NVMe pull request via Christoph:
- TCP error handling fix (Shin'ichiro Kawasaki)
- TCP I/O stall handling fixes (Hannes Reinecke)
- fix command limits status code (Keith Busch)
- support vectored buffers also for passthrough (Pavel Begunkov)
- spelling fixes (Yi Zhang)
- MD pull request via Yu:
- fix REQ_RAHEAD and REQ_NOWAIT IO err handling for raid1/10
- fix max_write_behind setting for dm-raid
- some minor cleanups
- Integrity data direction fix and cleanup
- bcache NULL pointer fix
- Fix for loop missing write start/end handling
- Decouple hardware queues and IO threads in ublk
- Slew of ublk selftests additions and updates
* tag 'block-6.16-20250606' of git://git.kernel.dk/linux: (29 commits)
nvme: spelling fixes
nvme-tcp: fix I/O stalls on congested sockets
nvme-tcp: sanitize request list handling
nvme-tcp: remove tag set when second admin queue config fails
nvme: enable vectored registered bufs for passthrough cmds
nvme: fix implicit bool to flags conversion
nvme: fix command limits status code
selftests: ublk: kublk: improve behavior on init failure
block: flip iter directions in blk_rq_integrity_map_user()
block: drop direction param from bio_integrity_copy_user()
selftests: ublk: cover PER_IO_DAEMON in more stress tests
Documentation: ublk: document UBLK_F_PER_IO_DAEMON
selftests: ublk: add stress test for per io daemons
selftests: ublk: add functional test for per io daemons
selftests: ublk: kublk: decouple ublk_queues from ublk server threads
selftests: ublk: kublk: move per-thread data out of ublk_queue
selftests: ublk: kublk: lift queue initialization out of thread
selftests: ublk: kublk: tie sqe allocation to io instead of queue
selftests: ublk: kublk: plumb q_id in io_uring user_data
ublk: have a per-io daemon instead of a per-queue daemon
...
Pull device mapper updates from Mikulas Patocka:
- better error handling when reloading a table
- use use generic disable_* functions instead of open coding them
- lock queue limits when reading them
- remove unneeded kvfree from alloc_targets
- fix BLK_FEAT_ATOMIC_WRITES
- pass through operations on wrapped inline crypto keys
- dm-verity:
- use softirq context only when !need_resched()
- fix a memory leak if some arguments are specified multiple times
- dm-mpath:
- interface for explicit probing of active paths
- replace spin_lock_irqsave with spin_lock_irq
- dm-delay: don't busy-wait in kthread
- dm-bufio: remove maximum age based eviction
- dm-flakey: various fixes
- vdo indexer: don't read request structure after enqueuing
- dm-zone: Use bdev_*() helper functions where applicable
- dm-mirror: fix a tiny race condition
- dm-stripe: small code cleanup
* tag 'for-6.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (29 commits)
dm-stripe: small code cleanup
dm-verity: fix a memory leak if some arguments are specified multiple times
dm-mirror: fix a tiny race condition
dm-table: check BLK_FEAT_ATOMIC_WRITES inside limits_lock
dm mpath: replace spin_lock_irqsave with spin_lock_irq
dm-mpath: Don't grab work_mutex while probing paths
dm-zone: Use bdev_*() helper functions where applicable
dm vdo indexer: don't read request structure after enqueuing
dm: pass through operations on wrapped inline crypto keys
blk-crypto: export wrapped key functions
dm-table: Set BLK_FEAT_ATOMIC_WRITES for target queue limits
dm mpath: Interface for explicit probing of active paths
dm: Allow .prepare_ioctl to handle ioctls directly
dm-flakey: make corrupting read bios work
dm-flakey: remove useless ERROR_READS check in flakey_end_io
dm-flakey: error all IOs when num_features is absent
dm-flakey: Clean up parsing messages
dm: remove unneeded kvfree from alloc_targets
dm-bufio: remove maximum age based eviction
dm-verity: use softirq context only when !need_resched()
...
This commit doesn't fix any bug, it is just code cleanup. Use the
function format_dev_t instead of sprintf, because format_dev_t does the
same thing.
Remove the useless memset call.
An unsigned integer can take at most 10 digits, so extend the array size
to 22. (note that because the range of minor and major numbers is limited,
the size 16 could not be exceeded, thus this function couldn't write
beyond string end)
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
If some of the arguments "check_at_most_once", "ignore_zero_blocks",
"use_fec_from_device", "root_hash_sig_key_desc" were specified more than
once on the target line, a memory leak would happen.
This commit fixes the memory leak. It also fixes error handling in
verity_verify_sig_parse_opt_args.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
There's a tiny race condition in dm-mirror. The functions queue_bio and
write_callback grab a spinlock, add a bio to the list, drop the spinlock
and wake up the mirrord thread that processes bios in the list.
It may be possible that the mirrord thread processes the bio just after
spin_unlock_irqrestore is called, before wakeup_mirrord. This spurious
wake-up is normally harmless, however if the device mapper device is
unloaded just after the bio was processed, it may be possible that
wakeup_mirrord(ms) uses invalid "ms" pointer.
Fix this bug by moving wakeup_mirrord inside the spinlock.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
dm_set_device_limits() should check q->limits.features for
BLK_FEAT_ATOMIC_WRITES while holding q->limits_lock, like it does for
the rest of the queue limits.
Fixes: b7c18b17a1 ("dm-table: Set BLK_FEAT_ATOMIC_WRITES for target queue limits")
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Pull non-MM updates from Andrew Morton:
- "hung_task: extend blocking task stacktrace dump to semaphore" from
Lance Yang enhances the hung task detector.
The detector presently dumps the blocking tasks's stack when it is
blocked on a mutex. Lance's series extends this to semaphores
- "nilfs2: improve sanity checks in dirty state propagation" from
Wentao Liang addresses a couple of minor flaws in nilfs2
- "scripts/gdb: Fixes related to lx_per_cpu()" from Illia Ostapyshyn
fixes a couple of issues in the gdb scripts
- "Support kdump with LUKS encryption by reusing LUKS volume keys" from
Coiby Xu addresses a usability problem with kdump.
When the dump device is LUKS-encrypted, the kdump kernel may not have
the keys to the encrypted filesystem. A full writeup of this is in
the series [0/N] cover letter
- "sysfs: add counters for lockups and stalls" from Max Kellermann adds
/sys/kernel/hardlockup_count and /sys/kernel/hardlockup_count and
/sys/kernel/rcu_stall_count
- "fork: Page operation cleanups in the fork code" from Pasha Tatashin
implements a number of code cleanups in fork.c
- "scripts/gdb/symbols: determine KASLR offset on s390 during early
boot" from Ilya Leoshkevich fixes some s390 issues in the gdb
scripts
* tag 'mm-nonmm-stable-2025-05-31-15-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (67 commits)
llist: make llist_add_batch() a static inline
delayacct: remove redundant code and adjust indentation
squashfs: add optional full compressed block caching
crash_dump, nvme: select CONFIGFS_FS as built-in
scripts/gdb/symbols: determine KASLR offset on s390 during early boot
scripts/gdb/symbols: factor out pagination_off()
scripts/gdb/symbols: factor out get_vmlinux()
kernel/panic.c: format kernel-doc comments
mailmap: update and consolidate Casey Connolly's name and email
nilfs2: remove wbc->for_reclaim handling
fork: define a local GFP_VMAP_STACK
fork: check charging success before zeroing stack
fork: clean-up naming of vm_stack/vm_struct variables in vmap stacks code
fork: clean-up ifdef logic around stack allocation
kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count
kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count
x86/crash: make the page that stores the dm crypt keys inaccessible
x86/crash: pass dm crypt keys to kdump kernel
Revert "x86/mm: Remove unused __set_memory_prot()"
crash_dump: retrieve dm crypt keys in kdump kernel
...
Pull hardening updates from Kees Cook:
- Update overflow helpers to ease refactoring of on-stack flex array
instances (Gustavo A. R. Silva, Kees Cook)
- lkdtm: Use SLAB_NO_MERGE instead of constructors (Harry Yoo)
- Simplify CONFIG_CC_HAS_COUNTED_BY (Jan Hendrik Farr)
- Disable u64 usercopy KUnit test on 32-bit SPARC (Thomas Weißschuh)
- Add missed designated initializers now exposed by fixed randstruct
(Nathan Chancellor, Kees Cook)
- Document compilers versions for __builtin_dynamic_object_size
- Remove ARM_SSP_PER_TASK GCC plugin
- Fix GCC plugin randstruct, add selftests, and restore COMPILE_TEST
builds
- Kbuild: induce full rebuilds when dependencies change with GCC
plugins, the Clang sanitizer .scl file, or the randstruct seed.
- Kbuild: Switch from -Wvla to -Wvla-larger-than=1
- Correct several __nonstring uses for -Wunterminated-string-initialization
* tag 'hardening-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (23 commits)
Revert "hardening: Disable GCC randstruct for COMPILE_TEST"
lib/tests: randstruct: Add deep function pointer layout test
lib/tests: Add randstruct KUnit test
randstruct: gcc-plugin: Remove bogus void member
net: qede: Initialize qede_ll_ops with designated initializer
scsi: qedf: Use designated initializer for struct qed_fcoe_cb_ops
md/bcache: Mark __nonstring look-up table
integer-wrap: Force full rebuild when .scl file changes
randstruct: Force full rebuild when seed changes
gcc-plugins: Force full rebuild when plugins change
kbuild: Switch from -Wvla to -Wvla-larger-than=1
hardening: simplify CONFIG_CC_HAS_COUNTED_BY
overflow: Fix direct struct member initialization in _DEFINE_FLEX()
kunit/overflow: Add tests for STACK_FLEX_ARRAY_SIZE() helper
overflow: Add STACK_FLEX_ARRAY_SIZE() helper
input/joystick: magellan: Mark __nonstring look-up table const
watchdog: exar: Shorten identity name to fit correctly
mod_devicetable: Enlarge the maximum platform_device_id name length
overflow: Clarify expectations for getting DEFINE_FLEX variable sizes
compiler_types: Identify compiler versions for __builtin_dynamic_object_size
...
Reported an IO hang and unrecoverable error in our testing environment.
After careful research, we found that bch_allocator_thread is stuck,
the call stack is as follows:
[<0>] __switch_to+0xbc/0x108
[<0>] __closure_sync+0x7c/0xbc [bcache]
[<0>] bch_prio_write+0x430/0x448 [bcache]
[<0>] bch_allocator_thread+0xb44/0xb70 [bcache]
[<0>] kthread+0x124/0x130
[<0>] ret_from_fork+0x10/0x18
Moreover, the RESERVE_BTREE type bucket slot are empty and journal_full
occurs at the same time.
When the cache disk is first used, the sb.nJournal_buckets defaults to 0.
So, only 8 RESERVE_BTREE type buckets are reserved. If RESERVE_BTREE type
buckets used up or btree_check_reserve() failed when request handle btree
split, the request will be repeatedly retried and wait for alloc thread to
fill in.
After the alloc thread fills the buckets, it will call bch_prio_write().
If journal_full occurs simultaneously at this time, journal_reclaim() and
btree_flush_write() will be called sequentially, journal_write cannot be
completed.
This is a low probability event, we believe that reserve more RESERVE_BTREE
buckets can avoid the worst situation.
Fixes: 682811b3ce ("bcache: fix for allocator and register thread race")
Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Signed-off-by: Coly Li <colyli@kernel.org>
Link: https://lore.kernel.org/r/20250527051601.74407-4-colyli@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull block updates from Jens Axboe:
- ublk updates:
- Add support for updating the size of a ublk instance
- Zero-copy improvements
- Auto-registering of buffers for zero-copy
- Series simplifying and improving GET_DATA and request lookup
- Series adding quiesce support
- Lots of selftests additions
- Various cleanups
- NVMe updates via Christoph:
- add per-node DMA pools and use them for PRP/SGL allocations
(Caleb Sander Mateos, Keith Busch)
- nvme-fcloop refcounting fixes (Daniel Wagner)
- support delayed removal of the multipath node and optionally
support the multipath node for private namespaces (Nilay Shroff)
- support shared CQs in the PCI endpoint target code (Wilfred
Mallawa)
- support admin-queue only authentication (Hannes Reinecke)
- use the crc32c library instead of the crypto API (Eric Biggers)
- misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes
Reinecke, Leon Romanovsky, Gustavo A. R. Silva)
- MD updates via Yu:
- Fix that normal IO can be starved by sync IO, found by mkfs on
newly created large raid5, with some clean up patches for bdev
inflight counters
- Clean up brd, getting rid of atomic kmaps and bvec poking
- Add loop driver specifically for zoned IO testing
- Eliminate blk-rq-qos calls with a static key, if not enabled
- Improve hctx locking for when a plug has IO for multiple queues
pending
- Remove block layer bouncing support, which in turn means we can
remove the per-node bounce stat as well
- Improve blk-throttle support
- Improve delay support for blk-throttle
- Improve brd discard support
- Unify IO scheduler switching. This should also fix a bunch of lockdep
warnings we've been seeing, after enabling lockdep support for queue
freezing/unfreezeing
- Add support for block write streams via FDP (flexible data placement)
on NVMe
- Add a bunch of block helpers, facilitating the removal of a bunch of
duplicated boilerplate code
- Remove obsolete BLK_MQ pci and virtio Kconfig options
- Add atomic/untorn write support to blktrace
- Various little cleanups and fixes
* tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits)
selftests: ublk: add test for UBLK_F_QUIESCE
ublk: add feature UBLK_F_QUIESCE
selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE
traceevent/block: Add REQ_ATOMIC flag to block trace events
ublk: run auto buf unregisgering in same io_ring_ctx with registering
io_uring: add helper io_uring_cmd_ctx_handle()
ublk: remove io argument from ublk_auto_buf_reg_fallback()
ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch()
selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK
selftests: ublk: support UBLK_F_AUTO_BUF_REG
ublk: support UBLK_AUTO_BUF_REG_FALLBACK
ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG
ublk: prepare for supporting to register request buffer automatically
ublk: convert to refcount_t
selftests: ublk: make IO & device removal test more stressful
nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk
nvme: introduce multipath_always_on module param
nvme-multipath: introduce delayed removal of the multipath head node
nvme-pci: derive and better document max segments limits
nvme-pci: use struct_size for allocation struct nvme_dev
...
Replace spin_lock_irqsave/spin_unlock_irqrestore with
spin_lock_irq/spin_unlock_irq at places where it is known that interrupts
are enabled.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Grabbing the work_mutex keeps probe_active_paths() from running at the
same time as multipath_message(). The only messages that could interfere
with probing the paths are "disable_group", "enable_group", and
"switch_group". These messages could force multipath to pick a new
pathgroup while probe_active_paths() was probing the current pathgroup.
If the multipath device has a hardware handler, and it switches active
pathgroups while there is outstanding IO to a path device, it's possible
that IO to the path will fail, even if the path would be usable if it
was in the active pathgroup. To avoid this, do not clear the current
pathgroup for the *_group messages while probe_active_paths() is
running. Instead set a flag, and probe_active_paths() will clear the
current pathgroup when it finishes probing the paths. For this to work
correctly, multipath needs to check current_pg before next_pg in
choose_pgpath(), but before this patch next_pg was only ever set when
current_pg was cleared, so this doesn't change the current behavior when
paths aren't being probed. Even with this change, it is still possible
to switch pathgroups while the probe is running, but only if all the
paths have failed, and the probe function will skip them as well in this
case.
If multiple DM_MPATH_PROBE_PATHS requests are received at once, there is
no point in repeatedly issuing test IOs. Instead, the later probes
should wait for the current probe to complete. If current pathgroup is
still the same as the one that was just checked, the other probes should
skip probing and just check the number of valid paths. Finally, probing
the paths should quit early if the multipath device is trying to
suspend, instead of continuing to issue test IOs, delaying the suspend.
While this patch will not change the behavior of existing multipath
users which don't use the DM_MPATH_PROBE_PATHS ioctl, when that ioctl
is used, the behavior of the "disable_group", "enable_group", and
"switch_group" messages can change subtly. When these messages return,
the next IO to the multipath device will no longer be guaranteed to
choose a new pathgroup. Instead, choosing a new pathgroup could be
delayed by an in-progress DM_MPATH_PROBE_PATHS ioctl. The userspace
multipath tools make no assumptions about what will happen to IOs after
sending these messages, so this change will not effect already released
versions of them, even if the DM_MPATH_PROBE_PATHS ioctl is run
alongside them.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
The function get_volume_page_protected may place a request on
a queue for another thread to process asynchronously. When this
happens, the volume should not read the request from the original
thread. This can not currently cause problems, due to the way
request processing is handled, but it is not safe in general.
Reviewed-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
If sync_speed is above speed_min, then is_mddev_idle() will be called
for each sync IO to check if the array is idle, and inflight sync_io
will be limited if the array is not idle.
However, while mkfs.ext4 for a large raid5 array while recovery is in
progress, it's found that sync_speed is already above speed_min while
lots of stripes are used for sync IO, causing long delay for mkfs.ext4.
Root cause is the following checking from is_mddev_idle():
t1: submit sync IO: events1 = completed IO - issued sync IO
t2: submit next sync IO: events2 = completed IO - issued sync IO
if (events2 - events1 > 64)
For consequence, the more sync IO issued, the less likely checking will
pass. And when completed normal IO is more than issued sync IO, the
condition will finally pass and is_mddev_idle() will return false,
however, last_events will be updated hence is_mddev_idle() can only
return false once in a while.
Fix this problem by changing the checking as following:
1) mddev doesn't have normal IO completed;
2) mddev doesn't have normal IO inflight;
3) if any member disks is partition, and all other partitions doesn't
have IO completed.
Also change rdev->last_events to unsigned long to cleanup type casting.
Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-9-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Currently if sync speed is above speed_min and below speed_max,
md_do_sync() will wait for all sync IOs to be done before issuing new
sync IO, means sync IO depth is limited to just 1.
This limit is too low, in order to prevent sync speed drop conspicuously
after fixing is_mddev_idle() in the next patch, add a new api for
limiting sync IO depth, the default value is 32.
Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-8-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
GCC 15's new -Wunterminated-string-initialization notices that the 16
character lookup table "zero_uuid" (which is not used as a C-String)
needs to be marked as "nonstring":
drivers/md/bcache/super.c: In function 'uuid_find_empty':
drivers/md/bcache/super.c:549:43: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (17 chars into 16 available) [-Wunterminated-string-initialization]
549 | static const char zero_uuid[16] = "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0";
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Add the annotation (since it is not used as a C-String), and switch the
initializer to an array of bytes rather than an empty initializer,
as preferred by Coly Li.
Suggested-by: Coly Li <colyli@kernel.org>
Link: https://lore.kernel.org/lkml/389A9925-0990-422C-A1B3-0195FAA73288@coly.li/
Signed-off-by: Kees Cook <kees@kernel.org>
Make the device-mapper layer pass through the derive_sw_secret,
import_key, generate_key, and prepare_key blk-crypto operations when all
underlying devices support hardware-wrapped inline crypto keys and are
passing through inline crypto support.
Commit ebc4176551 ("blk-crypto: add basic hardware-wrapped key
support") already made BLK_CRYPTO_KEY_TYPE_HW_WRAPPED be passed through
in the same way that the other crypto capabilities are. But the wrapped
key support also includes additional operations in blk_crypto_ll_ops,
and the dm layer needs to implement those to pass them through.
derive_sw_secret is needed by fscrypt, while the other operations are
needed for the new blk-crypto ioctls to work on device-mapper devices
and not just the raw partitions.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Pull device mapper fixes from Mikulas Patocka:
- fix reading past the end of allocated memory
- fix missing dm_put_live_table() in dm_keyslot_evict()
* tag 'for-6.15/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: fix copying after src array boundaries
dm: add missing unlock on in dm_keyslot_evict()
Feature flag BLK_FEAT_ATOMIC_WRITES is not being properly set for the
target queue limits, and this means that atomic writes are not being
enabled for any dm personalities.
When calling dm_set_device_limits() -> blk_stack_limits() ->
... -> blk_stack_atomic_writes_limits(), the bottom device limits
(which corresponds to intermediate target queue limits) does not have
BLK_FEAT_ATOMIC_WRITES set, and so atomic writes can never be enabled.
Typically such a flag would be inherited from the stacked device in
dm_set_device_limits() -> blk_stack_limits() via BLK_FEAT_INHERIT_MASK,
but BLK_FEAT_ATOMIC_WRITES is not inherited as it's preferred to manually
enable on a per-personality basis.
Set BLK_FEAT_ATOMIC_WRITES manually for the intermediate target queue
limits from the stacked device to get atomic writes working.
Fixes: 3194e36488 ("dm-table: atomic writes support")
Cc: stable@vger.kernel.org # v6.14
Signed-off-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Multipath cannot directly provide failover for ioctls in the kernel
because it doesn't know what each ioctl means and which result could
indicate a path error. Userspace generally knows what the ioctl it
issued means and if it might be a path error, but neither does it know
which path the ioctl took nor does it necessarily have the privileges to
fail a path using the control device.
In order to allow userspace to address this situation, implement a
DM_MPATH_PROBE_PATHS ioctl that prompts the dm-mpath driver to probe all
active paths in the current path group to see whether they still work,
and fail them if not. If this returns success, userspace can retry the
ioctl and expect that the previously hit bad path is now failed (or
working again).
The immediate motivation for this is the use of SG_IO in QEMU for SCSI
passthrough. Following a failed SG_IO ioctl, QEMU will trigger probing
to ensure that all active paths are actually alive, so that retrying
SG_IO at least has a lower chance of failing due to a path error.
However, the problem is broader than just SG_IO (it affects any ioctl),
and if applications need failover support for other ioctls, the same
probing can be used.
This is not implemented on the DM control device, but on the DM mpath
block devices, to allow all users who have access to such a block device
to make use of this interface, specifically to implement failover for
ioctls. For the same reason, it is also unprivileged. Its implementation
is effectively just a bunch of reads, which could already be issued by
userspace, just without any guarantee that all the rights paths are
selected.
The probing implemented here is done fully synchronously path by path;
probing all paths concurrently is left as an improvement for the future.
Co-developed-by: Hanna Czenczek <hreitz@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
This adds a 'bool *forward' parameter to .prepare_ioctl, which allows
device mapper targets to accept ioctls to themselves instead of the
underlying device. If the target already fully handled the ioctl, it
sets *forward to false and device mapper won't forward it to the
underlying device any more.
In order for targets to actually know what the ioctl is about and how to
handle it, pass also cmd and arg.
As long as targets restrict themselves to interpreting ioctls of type
DM_IOCTL, this is a backwards compatible change because previously, any
such ioctl would have been passed down through all device mapper layers
until it reached a device that can't understand the ioctl and would
return an error.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
dm-flakey corrupts the read bios in the endio function. However, the
corrupt_bio_* functions checked bio_has_data() to see if there was data
to corrupt. Since this was the endio function, there was no data left to
complete, so bio_has_data() was always false. Fix this by saving a copy
of the bio's bi_iter in flakey_map(), and using this to initialize the
iter for corrupting the read bios. This patch also skips cloning the bio
for write bios with no data.
Reported-by: Kent Overstreet <kent.overstreet@linux.dev>
Fixes: a3998799fb ("dm flakey: add corrupt_bio_byte feature")
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>