Pull hardening updates from Kees Cook:
- Update overflow helpers to ease refactoring of on-stack flex array
instances (Gustavo A. R. Silva, Kees Cook)
- lkdtm: Use SLAB_NO_MERGE instead of constructors (Harry Yoo)
- Simplify CONFIG_CC_HAS_COUNTED_BY (Jan Hendrik Farr)
- Disable u64 usercopy KUnit test on 32-bit SPARC (Thomas Weißschuh)
- Add missed designated initializers now exposed by fixed randstruct
(Nathan Chancellor, Kees Cook)
- Document compilers versions for __builtin_dynamic_object_size
- Remove ARM_SSP_PER_TASK GCC plugin
- Fix GCC plugin randstruct, add selftests, and restore COMPILE_TEST
builds
- Kbuild: induce full rebuilds when dependencies change with GCC
plugins, the Clang sanitizer .scl file, or the randstruct seed.
- Kbuild: Switch from -Wvla to -Wvla-larger-than=1
- Correct several __nonstring uses for -Wunterminated-string-initialization
* tag 'hardening-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (23 commits)
Revert "hardening: Disable GCC randstruct for COMPILE_TEST"
lib/tests: randstruct: Add deep function pointer layout test
lib/tests: Add randstruct KUnit test
randstruct: gcc-plugin: Remove bogus void member
net: qede: Initialize qede_ll_ops with designated initializer
scsi: qedf: Use designated initializer for struct qed_fcoe_cb_ops
md/bcache: Mark __nonstring look-up table
integer-wrap: Force full rebuild when .scl file changes
randstruct: Force full rebuild when seed changes
gcc-plugins: Force full rebuild when plugins change
kbuild: Switch from -Wvla to -Wvla-larger-than=1
hardening: simplify CONFIG_CC_HAS_COUNTED_BY
overflow: Fix direct struct member initialization in _DEFINE_FLEX()
kunit/overflow: Add tests for STACK_FLEX_ARRAY_SIZE() helper
overflow: Add STACK_FLEX_ARRAY_SIZE() helper
input/joystick: magellan: Mark __nonstring look-up table const
watchdog: exar: Shorten identity name to fit correctly
mod_devicetable: Enlarge the maximum platform_device_id name length
overflow: Clarify expectations for getting DEFINE_FLEX variable sizes
compiler_types: Identify compiler versions for __builtin_dynamic_object_size
...
Pull block updates from Jens Axboe:
- ublk updates:
- Add support for updating the size of a ublk instance
- Zero-copy improvements
- Auto-registering of buffers for zero-copy
- Series simplifying and improving GET_DATA and request lookup
- Series adding quiesce support
- Lots of selftests additions
- Various cleanups
- NVMe updates via Christoph:
- add per-node DMA pools and use them for PRP/SGL allocations
(Caleb Sander Mateos, Keith Busch)
- nvme-fcloop refcounting fixes (Daniel Wagner)
- support delayed removal of the multipath node and optionally
support the multipath node for private namespaces (Nilay Shroff)
- support shared CQs in the PCI endpoint target code (Wilfred
Mallawa)
- support admin-queue only authentication (Hannes Reinecke)
- use the crc32c library instead of the crypto API (Eric Biggers)
- misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes
Reinecke, Leon Romanovsky, Gustavo A. R. Silva)
- MD updates via Yu:
- Fix that normal IO can be starved by sync IO, found by mkfs on
newly created large raid5, with some clean up patches for bdev
inflight counters
- Clean up brd, getting rid of atomic kmaps and bvec poking
- Add loop driver specifically for zoned IO testing
- Eliminate blk-rq-qos calls with a static key, if not enabled
- Improve hctx locking for when a plug has IO for multiple queues
pending
- Remove block layer bouncing support, which in turn means we can
remove the per-node bounce stat as well
- Improve blk-throttle support
- Improve delay support for blk-throttle
- Improve brd discard support
- Unify IO scheduler switching. This should also fix a bunch of lockdep
warnings we've been seeing, after enabling lockdep support for queue
freezing/unfreezeing
- Add support for block write streams via FDP (flexible data placement)
on NVMe
- Add a bunch of block helpers, facilitating the removal of a bunch of
duplicated boilerplate code
- Remove obsolete BLK_MQ pci and virtio Kconfig options
- Add atomic/untorn write support to blktrace
- Various little cleanups and fixes
* tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits)
selftests: ublk: add test for UBLK_F_QUIESCE
ublk: add feature UBLK_F_QUIESCE
selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE
traceevent/block: Add REQ_ATOMIC flag to block trace events
ublk: run auto buf unregisgering in same io_ring_ctx with registering
io_uring: add helper io_uring_cmd_ctx_handle()
ublk: remove io argument from ublk_auto_buf_reg_fallback()
ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch()
selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK
selftests: ublk: support UBLK_F_AUTO_BUF_REG
ublk: support UBLK_AUTO_BUF_REG_FALLBACK
ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG
ublk: prepare for supporting to register request buffer automatically
ublk: convert to refcount_t
selftests: ublk: make IO & device removal test more stressful
nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk
nvme: introduce multipath_always_on module param
nvme-multipath: introduce delayed removal of the multipath head node
nvme-pci: derive and better document max segments limits
nvme-pci: use struct_size for allocation struct nvme_dev
...
If sync_speed is above speed_min, then is_mddev_idle() will be called
for each sync IO to check if the array is idle, and inflight sync_io
will be limited if the array is not idle.
However, while mkfs.ext4 for a large raid5 array while recovery is in
progress, it's found that sync_speed is already above speed_min while
lots of stripes are used for sync IO, causing long delay for mkfs.ext4.
Root cause is the following checking from is_mddev_idle():
t1: submit sync IO: events1 = completed IO - issued sync IO
t2: submit next sync IO: events2 = completed IO - issued sync IO
if (events2 - events1 > 64)
For consequence, the more sync IO issued, the less likely checking will
pass. And when completed normal IO is more than issued sync IO, the
condition will finally pass and is_mddev_idle() will return false,
however, last_events will be updated hence is_mddev_idle() can only
return false once in a while.
Fix this problem by changing the checking as following:
1) mddev doesn't have normal IO completed;
2) mddev doesn't have normal IO inflight;
3) if any member disks is partition, and all other partitions doesn't
have IO completed.
Also change rdev->last_events to unsigned long to cleanup type casting.
Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-9-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Currently if sync speed is above speed_min and below speed_max,
md_do_sync() will wait for all sync IOs to be done before issuing new
sync IO, means sync IO depth is limited to just 1.
This limit is too low, in order to prevent sync speed drop conspicuously
after fixing is_mddev_idle() in the next patch, add a new api for
limiting sync IO depth, the default value is 32.
Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-8-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
GCC 15's new -Wunterminated-string-initialization notices that the 16
character lookup table "zero_uuid" (which is not used as a C-String)
needs to be marked as "nonstring":
drivers/md/bcache/super.c: In function 'uuid_find_empty':
drivers/md/bcache/super.c:549:43: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (17 chars into 16 available) [-Wunterminated-string-initialization]
549 | static const char zero_uuid[16] = "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0";
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Add the annotation (since it is not used as a C-String), and switch the
initializer to an array of bytes rather than an empty initializer,
as preferred by Coly Li.
Suggested-by: Coly Li <colyli@kernel.org>
Link: https://lore.kernel.org/lkml/389A9925-0990-422C-A1B3-0195FAA73288@coly.li/
Signed-off-by: Kees Cook <kees@kernel.org>
Pull device mapper fixes from Mikulas Patocka:
- fix reading past the end of allocated memory
- fix missing dm_put_live_table() in dm_keyslot_evict()
* tag 'for-6.15/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: fix copying after src array boundaries
dm: add missing unlock on in dm_keyslot_evict()
We need to call dm_put_live_table() even if dm_get_live_table() returns
NULL.
Fixes: 9355a9eb21 ("dm: support key eviction from keyslot managers of underlying devices")
Cc: stable@vger.kernel.org # v5.12+
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Pull device mapper fixes from Mikulas Patocka:
- always update the array size in realloc_argv on success
- dm-integrity: fix a warning on invalid table line
- dm-bufio: don't schedule in atomic context
- Fix W=1 build with clang
* tag 'for-6.15/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: always update the array size in realloc_argv on success
dm-integrity: fix a warning on invalid table line
dm-bufio: don't schedule in atomic context
dm table: Fix W=1 build warning when mempool_needs_integrity is unused
realloc_argv() was only updating the array size if it was called with
old_argv already allocated. The first time it was called to create an
argv array, it would allocate the array but return the array size as
zero. dm_split_args() would think that it couldn't store any arguments
in the array and would call realloc_argv() again, causing it to
reallocate the initial slots (this time using GPF_KERNEL) and finally
return a size. Aside from being wasteful, this could cause deadlocks on
targets that need to process messages without starting new IO. Instead,
realloc_argv should always update the allocated array size on success.
Fixes: a065192655 ("dm table: don't copy from a NULL pointer in realloc_argv()")
Cc: stable@vger.kernel.org
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
If we use the 'B' mode and we have an invalit table line,
cancel_delayed_work_sync would trigger a warning. This commit avoids the
warning.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
This removes two cases of explicit NUL padding that now causes warnings
because of '-Wunterminated-string-initialization' being part of -Wextra
in gcc-15.
Gcc is being silly in this case when it says that it truncates a NUL
terminator, because in these cases there were _multiple_ NUL characters.
But we can get rid of the warning by just simplifying the two
initializers that trigger the warning for me, so this does exactly that.
I'm not sure why the power supply code did that odd
.attr_name = #_name "\0",
pattern: it was introduced in commit 2cabeaf151 ("power: supply: core:
Cleanup power supply sysfs attribute list"), but that 'attr_name[]'
field is an explicitly sized character array in a statically initialized
variable, and a string initializer always has a terminating NUL _and_
statically initialized character arrays are zero-padded anyway, so it
really seems to be rather extraneous belt-and-suspenders.
The zero_uuid[16] initialization in drivers/md/bcache/super.c makes
perfect sense, but it isn't necessary for the same reasons, and not
worth the new gcc warning noise.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull block fixes from Jens Axboe:
- MD pull via Yu:
- fix raid10 missing discard IO accounting (Yu Kuai)
- fix bitmap stats for bitmap file (Zheng Qixing)
- fix oops while reading all member disks failed during
check/repair (Meir Elisha)
- NVMe pull via Christoph:
- fix scan failure for non-ANA multipath controllers (Hannes
Reinecke)
- fix multipath sysfs links creation for some cases (Hannes
Reinecke)
- PCIe endpoint fixes (Damien Le Moal)
- use NULL instead of 0 in the auth code (Damien Le Moal)
- Various ublk fixes:
- Slew of selftest additions
- Improvements and fixes for IO cancelation
- Tweak to Kconfig verbiage
- Fix for page dirtying for blk integrity mapped pages
- loop fixes:
- buffered IO fix
- uevent fixes
- request priority inheritance fix
- Various little fixes
* tag 'block-6.15-20250417' of git://git.kernel.dk/linux: (38 commits)
selftests: ublk: add generic_06 for covering fault inject
ublk: simplify aborting ublk request
ublk: remove __ublk_quiesce_dev()
ublk: improve detection and handling of ublk server exit
ublk: move device reset into ublk_ch_release()
ublk: rely on ->canceling for dealing with ublk_nosrv_dev_should_queue_io
ublk: add ublk_force_abort_dev()
ublk: properly serialize all FETCH_REQs
selftests: ublk: move creating UBLK_TMP into _prep_test()
selftests: ublk: add test_stress_05.sh
selftests: ublk: support user recovery
selftests: ublk: support target specific command line
selftests: ublk: increase max nr_queues and queue depth
selftests: ublk: set queue pthread's cpu affinity
selftests: ublk: setup ring with IORING_SETUP_SINGLE_ISSUER/IORING_SETUP_DEFER_TASKRUN
selftests: ublk: add two stress tests for zero copy feature
selftests: ublk: run stress tests in parallel
selftests: ublk: make sure _add_ublk_dev can return in sub-shell
selftests: ublk: cleanup backfile automatically
selftests: ublk: add io_uring uapi header
...
During recovery/check operations, the process_checks function loops
through available disks to find a 'primary' source with successfully
read data.
If no suitable source disk is found after checking all possibilities,
the 'primary' index will reach conf->raid_disks * 2. Add an explicit
check for this condition after the loop. If no source disk was found,
print an error message and return early to prevent further processing
without a valid primary source.
Link: https://lore.kernel.org/linux-raid/20250408143808.1026534-1-meir.elisha@volumez.com
Signed-off-by: Meir Elisha <meir.elisha@volumez.com>
Suggested-and-reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
The mempool_needs_integrity is unused. This, in particular, prevents
kernel builds with Clang, `make W=1` and CONFIG_WERROR=y:
drivers/md/dm-table.c:1052:7: error: variable 'mempool_needs_integrity' set but not used [-Werror,-Wunused-but-set-variable]
1052 | bool mempool_needs_integrity = t->integrity_supported;
| ^
Fix this by removing the leftover.
Fixes: 105ca2a2c2 ("block: split struct bio_integrity_payload")
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Pull CRC cleanups from Eric Biggers:
"Finish cleaning up the CRC kconfig options by removing the remaining
unnecessary prompts and an unnecessary 'default y', removing
CONFIG_LIBCRC32C, and documenting all the CRC library options"
* tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
lib/crc: remove CONFIG_LIBCRC32C
lib/crc: document all the CRC library kconfig options
lib/crc: remove unnecessary prompt for CONFIG_CRC_ITU_T
lib/crc: remove unnecessary prompt for CONFIG_CRC_T10DIF
lib/crc: remove unnecessary prompt for CONFIG_CRC16
lib/crc: remove unnecessary prompt for CONFIG_CRC_CCITT
lib/crc: remove unnecessary prompt for CONFIG_CRC32 and drop 'default y'
The bitmap_get_stats() function incorrectly returns -ENOENT for external
bitmaps.
Remove the external bitmap check as the statistics should be available
regardless of bitmap storage location.
Return -EINVAL only for invalid bitmap with no storage (neither in
superblock nor in external file).
Note: "bitmap_info.external" here refers to a bitmap stored in a separate
file (bitmap_file), not to external metadata.
Fixes: 8d28d0ddb9 ("md/md-bitmap: Synchronize bitmap_get_stats() with bitmap lifetime")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Link: https://lore.kernel.org/linux-raid/20250403015322.2873369-1-zhengqixing@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.
Conversion was done with coccinelle plus manual fixups where necessary.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There's a possible race condition in dm-ebs - dm bufio prefetch may be in
progress while the device is suspended. Fix this by calling
dm_bufio_client_reset in the postsuspend hook.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
There's a possible race condition in dm-verity - the prefetch work item
may race with suspend and it is possible that prefetch continues to run
while the device is suspended. Fix this by calling flush_workqueue and
dm_bufio_client_reset in the postsuspend hook.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
When using dm-integrity in standalone mode with a keyed hmac algorithm,
integrity tags are calculated and verified internally.
Using plain memcmp to compare the stored and computed tags may leak the
position of the first byte mismatch through side-channel analysis,
allowing to brute-force expected tags in linear time (e.g., by counting
single-stepping interrupts in confidential virtual machine environments).
Co-developed-by: Luca Wilke <work@luca-wilke.com>
Signed-off-by: Luca Wilke <work@luca-wilke.com>
Signed-off-by: Jo Van Bulck <jo.vanbulck@cs.kuleuven.be>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Calling verity_verify_io in bh for IO of all sizes is not suitable for
embedded devices. From our tests, it can improve the performance of 4K
synchronise random reads.
For example:
./fio --name=rand_read --ioengine=psync --rw=randread --bs=4K \
--direct=1 --numjobs=8 --runtime=60 --time_based --group_reporting \
--filename=/dev/block/mapper/xx-verity
But it will degrade the performance of 512K synchronise sequential reads
on our devices.
For example:
./fio --name=read --ioengine=psync --rw=read --bs=512K --direct=1 \
--numjobs=8 --runtime=60 --time_based --group_reporting \
--filename=/dev/block/mapper/xx-verity
A parameter array is introduced by this change. And users can modify the
default config by /sys/module/dm_verity/parameters/use_bh_bytes.
The default limits for NONE/RT/BE is set to 8192.
The default limits for IDLE is set to 0.
Call verity_verify_io directly when verity_end_io is not in hardirq.
Signed-off-by: LongPing Wei <weilongping@oppo.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Pull block updates from Jens Axboe:
- Fixes for integrity handling
- NVMe pull request via Keith:
- Secure concatenation for TCP transport (Hannes)
- Multipath sysfs visibility (Nilay)
- Various cleanups (Qasim, Baruch, Wang, Chen, Mike, Damien, Li)
- Correct use of 64-bit BARs for pci-epf target (Niklas)
- Socket fix for selinux when used in containers (Peijie)
- MD pull request via Yu:
- fix recovery can preempt resync (Li Nan)
- fix md-bitmap IO limit (Su Yue)
- fix raid10 discard with REQ_NOWAIT (Xiao Ni)
- fix raid1 memory leak (Zheng Qixing)
- fix mddev uaf (Yu Kuai)
- fix raid1,raid10 IO flags (Yu Kuai)
- some refactor and cleanup (Yu Kuai)
- Series cleaning up and fixing bugs in the bad block handling code
- Improve support for write failure simulation in null_blk
- Various lock ordering fixes
- Fixes for locking for debugfs attributes
- Various ublk related fixes and improvements
- Cleanups for blk-rq-qos wait handling
- blk-throttle fixes
- Fixes for loop dio and sync handling
- Fixes and cleanups for the auto-PI code
- Block side support for hardware encryption keys in blk-crypto
- Various cleanups and fixes
* tag 'for-6.15/block-20250322' of git://git.kernel.dk/linux: (105 commits)
nvmet: replace max(a, min(b, c)) by clamp(val, lo, hi)
nvme-tcp: fix selinux denied when calling sock_sendmsg
nvmet: pci-epf: Always configure BAR0 as 64-bit
nvmet: Remove duplicate uuid_copy
nvme: zns: Simplify nvme_zone_parse_entry()
nvmet: pci-epf: Remove redundant 'flush_workqueue()' calls
nvmet-fc: Remove unused functions
nvme-pci: remove stale comment
nvme-fc: Utilise min3() to simplify queue count calculation
nvme-multipath: Add visibility for queue-depth io-policy
nvme-multipath: Add visibility for numa io-policy
nvme-multipath: Add visibility for round-robin io-policy
nvmet: add tls_concat and tls_key debugfs entries
nvmet-tcp: support secure channel concatenation
nvmet: Add 'sq' argument to alloc_ctrl_args
nvme-fabrics: reset admin connection for secure concatenation
nvme-tcp: request secure channel concatenation
nvme-keyring: add nvme_tls_psk_refresh()
nvme: add nvme_auth_derive_tls_psk()
nvme: add nvme_auth_generate_digest()
...
Pull CRC updates from Eric Biggers:
"Another set of improvements to the kernel's CRC (cyclic redundancy
check) code:
- Rework the CRC64 library functions to be directly optimized, like
what I did last cycle for the CRC32 and CRC-T10DIF library
functions
- Rewrite the x86 PCLMULQDQ-optimized CRC code, and add VPCLMULQDQ
support and acceleration for crc64_be and crc64_nvme
- Rewrite the riscv Zbc-optimized CRC code, and add acceleration for
crc_t10dif, crc64_be, and crc64_nvme
- Remove crc_t10dif and crc64_rocksoft from the crypto API, since
they are no longer needed there
- Rename crc64_rocksoft to crc64_nvme, as the old name was incorrect
- Add kunit test cases for crc64_nvme and crc7
- Eliminate redundant functions for calculating the Castagnoli CRC32,
settling on just crc32c()
- Remove unnecessary prompts from some of the CRC kconfig options
- Further optimize the x86 crc32c code"
* tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (36 commits)
x86/crc: drop the avx10_256 functions and rename avx10_512 to avx512
lib/crc: remove unnecessary prompt for CONFIG_CRC64
lib/crc: remove unnecessary prompt for CONFIG_LIBCRC32C
lib/crc: remove unnecessary prompt for CONFIG_CRC8
lib/crc: remove unnecessary prompt for CONFIG_CRC7
lib/crc: remove unnecessary prompt for CONFIG_CRC4
lib/crc7: unexport crc7_be_syndrome_table
lib/crc_kunit.c: update comment in crc_benchmark()
lib/crc_kunit.c: add test and benchmark for crc7_be()
x86/crc32: optimize tail handling for crc32c short inputs
riscv/crc64: add Zbc optimized CRC64 functions
riscv/crc-t10dif: add Zbc optimized CRC-T10DIF function
riscv/crc32: reimplement the CRC32 functions using new template
riscv/crc: add "template" for Zbc optimized CRC functions
x86/crc: add ANNOTATE_NOENDBR to suppress objtool warnings
x86/crc32: improve crc32c_arch() code generation with clang
x86/crc64: implement crc64_be and crc64_nvme using new template
x86/crc-t10dif: implement crc_t10dif using new template
x86/crc32: implement crc32_le using new template
x86/crc: add "template" for [V]PCLMULQDQ based CRC functions
...
Add support for zoned device by passing through report_zoned to the
underlying read device.
This is required to make enable xfstests xfs/311 on zoned devices.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
The devices with size >= 2^63 bytes can't be used reliably by userspace
because the type off_t is a signed 64-bit integer.
Therefore, we limit the maximum size of a device mapper device to
2^63-512 bytes.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Merge MD changes from Yu:
"- fix recovery can preempt resync (Li Nan)
- fix md-bitmap IO limit (Su Yue)
- fix raid10 discard with REQ_NOWAIT (Xiao Ni)
- fix raid1 memory leak (Zheng Qixing)
- fix mddev uaf (Yu Kuai)
- fix raid1,raid10 IO flags (Yu Kuai)
- some refactor and cleanup (Yu Kuai)"
* tag 'md-6.15-20250312' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux:
md/raid10: wait barrier before returning discard request with REQ_NOWAIT
md/md-bitmap: fix wrong bitmap_limit for clustermd when write sb
md/raid1,raid10: don't ignore IO flags
md/raid5: merge reshape_progress checking inside get_reshape_loc()
md: fix mddev uaf while iterating all_mddevs list
md: switch md-cluster to use md_submodle_head
md: don't export md_cluster_ops
md/md-cluster: cleanup md_cluster_ops reference
md: switch personalities to use md_submodule_head
md: introduce struct md_submodule_head and APIs
md: only include md-cluster.h if necessary
md: merge common code into find_pers()
md/raid1: fix memory leak in raid1_run() if no active rdev
md: ensure resync is prioritized over recovery
This patch introduces formal support for shrinking the cache origin by
reducing the cache target length via table reloads. Cache blocks mapped
beyond the new target length must be clean and are invalidated during
preresume. If any dirty blocks exist in the area being removed, the
preresume operation fails without setting the NEEDS_CHECK flag in
superblock, and the resume ioctl returns EFBIG. The cache device remains
suspended until a table reload with target length that fits existing
mappings is performed.
Without this patch, reducing the cache target length could result in
io errors (RHBZ: 2134334), out-of-bounds memory access to the discard
bitset, and security concerns regarding data leakage.
Verification steps:
1. create a cache metadata with some cached blocks mapped to the tail
of the origin device. Here we use cache_restore v1.0 to build a
metadata with one clean block mapped to the last origin block.
cat <<EOF >> cmeta.xml
<superblock uuid="" block_size="128" nr_cache_blocks="512" \
policy="smq" hint_width="4">
<mappings>
<mapping cache_block="0" origin_block="4095" dirty="false"/>
</mappings>
</superblock>
EOF
dmsetup create cmeta --table "0 8192 linear /dev/sdc 0"
cache_restore -i cmeta.xml -o /dev/mapper/cmeta --metadata-version=2
dmsetup remove cmeta
2. bring up the cache whilst shrinking the cache origin by one block:
dmsetup create cmeta --table "0 8192 linear /dev/sdc 0"
dmsetup create cdata --table "0 65536 linear /dev/sdc 8192"
dmsetup create corig --table "0 524160 linear /dev/sdc 262144"
dmsetup create cache --table "0 524160 cache /dev/mapper/cmeta \
/dev/mapper/cdata /dev/mapper/corig 128 2 metadata2 writethrough smq 0"
3. check the number of cached data blocks via dmsetup status. It is
expected to be zero.
dmsetup status cache | cut -d ' ' -f 7
In addition to the script above, this patch can be verified using the
"cache/resize" tests in dmtest-python:
./dmtest run --rx cache/resize/shrink_origin --result-set default
Signed-off-by: Ming-Hung Tsai <mtsai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
There is a truncation of badblocks length issue when set badblocks as
follow:
echo "2055 4294967299" > bad_blocks
cat bad_blocks
2055 3
Change 'sectors' argument type from 'int' to 'sector_t'.
This change avoids truncation of badblocks length for large sectors by
replacing 'int' with 'sector_t' (u64), enabling proper handling of larger
disk sizes and ensuring compatibility with 64-bit sector addressing.
Fixes: 9e0e252a04 ("badblocks: Add core badblock management code")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Coly Li <colyli@kernel.org>
Link: https://lore.kernel.org/r/20250227075507.151331-13-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
rdev_set_badblocks() only indicates success/failure, so convert its return
type from int to boolean for better semantic clarity.
rdev_clear_badblocks() return value is never used by any caller, convert it
to void. This removes unnecessary value returns.
Also update narrow_write_error() in both raid1 and raid10 to use boolean
return type to match rdev_set_badblocks().
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250227075507.151331-12-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Change the return type of badblocks_set() and badblocks_clear()
from int to bool, indicating success or failure. Specifically:
- _badblocks_set() and _badblocks_clear() functions now return
true for success and false for failure.
- All calls to these functions are updated to handle the new
boolean return type.
- This change improves code clarity and ensures a more consistent
handling of success and failure states.
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Coly Li <colyli@kernel.org>
Acked-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20250227075507.151331-11-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
raid10_handle_discard should wait barrier before returning a discard bio
which has REQ_NOWAIT. And there is no need to print warning calltrace
if a discard bio has REQ_NOWAIT flag. Quality engineer usually checks
dmesg and reports error if dmesg has warning/error calltrace.
Fixes: c9aa889b03 ("md: raid10 add nowait support")
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Coly Li <colyli@kernel.org>
Link: https://lore.kernel.org/linux-raid/20250306094938.48952-1-xni@redhat.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
In clustermd, separate write-intent-bitmaps are used for each cluster
node:
0 4k 8k 12k
-------------------------------------------------------------------
| idle | md super | bm super [0] + bits |
| bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] |
| bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits |
| bm bits [3, contd] | | |
So in node 1, pg_index in __write_sb_page() could equal to
bitmap->storage.file_pages. Then bitmap_limit will be calculated to
0. md_super_write() will be called with 0 size.
That means the first 4k sb area of node 1 will never be updated
through filemap_write_page().
This bug causes hang of mdadm/clustermd_tests/01r1_Grow_resize.
Here use (pg_index % bitmap->storage.file_pages) to make calculation
of bitmap_limit correct.
Fixes: ab99a87542 ("md/md-bitmap: fix writing non bitmap pages")
Signed-off-by: Su Yue <glass.su@suse.com>
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Link: https://lore.kernel.org/linux-raid/20250303033918.32136-1-glass.su@suse.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
If blk-wbt is enabled by default, it's found that raid write performance
is quite bad because all IO are throttled by wbt of underlying disks,
due to flag REQ_IDLE is ignored. And turns out this behaviour exist since
blk-wbt is introduced.
Other than REQ_IDLE, other flags should not be ignored as well, for
example REQ_META can be set for filesystems, clearing it can cause priority
reverse problems; And REQ_NOWAIT should not be cleared as well, because
io will wait instead of failing directly in underlying disks.
Fix those problems by keep IO flags from master bio.
Fises: f51d46d0e7 ("md: add support for REQ_NOWAIT")
Fixes: e34cbd3074 ("blk-wbt: add general throttling mechanism")
Fixes: 5404bc7a87 ("[PATCH] Allow file systems to differentiate between data and meta reads")
Link: https://lore.kernel.org/linux-raid/20250227121657.832356-1-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
During code review, it's found that other than raid5_bitmap_sector(),
reshape_progress is always checked before get_reshape_loc(), while
raid5_bitmap_sector() should check as well to prevent holding the
lock 'conf->device_lock'. Hence merge that checking inside
get_reshape_loc().
Link: https://lore.kernel.org/linux-raid/20250227120452.808503-1-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
While iterating all_mddevs list from md_notify_reboot() and md_exit(),
list_for_each_entry_safe is used, and this can race with deletint the
next mddev, causing UAF:
t1:
spin_lock
//list_for_each_entry_safe(mddev, n, ...)
mddev_get(mddev1)
// assume mddev2 is the next entry
spin_unlock
t2:
//remove mddev2
...
mddev_free
spin_lock
list_del
spin_unlock
kfree(mddev2)
mddev_put(mddev1)
spin_lock
//continue dereference mddev2->all_mddevs
The old helper for_each_mddev() actually grab the reference of mddev2
while holding the lock, to prevent from being freed. This problem can be
fixed the same way, however, the code will be complex.
Hence switch to use list_for_each_entry, in this case mddev_put() can free
the mddev1 and it's not safe as well. Refer to md_seq_show(), also factor
out a helper mddev_put_locked() to fix this problem.
Cc: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/linux-raid/20250220124348.845222-1-yukuai1@huaweicloud.com
Fixes: f265143422 ("md: stop using for_each_mddev in md_notify_reboot")
Fixes: 16648bac86 ("md: stop using for_each_mddev in md_exit")
Reported-and-tested-by: Guillaume Morin <guillaume@morinfr.org>
Closes: https://lore.kernel.org/all/Z7Y0SURoA8xwg7vn@bender.morinfr.org/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>