Commit Graph

8552 Commits

Author SHA1 Message Date
Yu Kuai
696936838b md/md-bitmap: merge md_bitmap_status() into bitmap_operations
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-20-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2024-08-27 10:14:16 -07:00
Yu Kuai
fe59b34676 md/md-bitmap: merge md_bitmap_update_sb() into bitmap_operations
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-19-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2024-08-27 10:14:16 -07:00
Yu Kuai
a0240e3ec7 md/md-bitmap: make md_bitmap_print_sb() internal
md_bitmap_print_sb() is only used inside md-bitmap.c, hence make it
static, also rename it to bitmap_print_sb.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-18-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2024-08-27 10:14:16 -07:00
Yu Kuai
ca925302e8 md/md-bitmap: merge md_bitmap_flush() into bitmap_operations
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-17-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2024-08-27 10:14:16 -07:00
Yu Kuai
a2bd703192 md/md-bitmap: merge md_bitmap_destroy() into bitmap_operations
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-16-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2024-08-27 10:14:16 -07:00
Yu Kuai
e1e4908059 md/md-bitmap: merge md_bitmap_load() into bitmap_operations
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-15-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2024-08-27 10:14:16 -07:00
Yu Kuai
04c80e6495 md/md-bitmap: merge md_bitmap_create() into bitmap_operations
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-14-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2024-08-27 10:14:15 -07:00
Yu Kuai
7545d385ec md/md-bitmap: simplify md_bitmap_create() + md_bitmap_load()
Other than internal api get_bitmap_from_slot(), all other places will
set returned bitmap to mddev->bitmap. So move the setting of
mddev->bitmap into md_bitmap_create() to simplify code.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-13-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2024-08-27 10:14:15 -07:00
Yu Kuai
7add9db6ba md/md-bitmap: introduce struct bitmap_operations
The structure is empty for now, and will be used in later patches to
merge in bitmap operations, so that bitmap implementation won't be
exposed.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-12-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2024-08-27 10:14:15 -07:00
Yu Kuai
27832ad3f7 md/md-bitmap: add a new helper md_bitmap_set_pages()
Currently md-cluster will set bitmap->counts.pages directly, add a
helper to do this to avoid dereferencing bitmap directly.

Noted that after this patch bitmap is not dereferenced directly anymore
and following patches will move the structure inside md-bitmap.c.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-11-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2024-08-27 10:14:15 -07:00
Yu Kuai
9e4481ce0e md/md-cluster: use helper md_bitmap_get_stats() to get pages in resize_bitmaps()
Use the existed helper instead of open coding it, avoid dereferencing
bitmap directly to prepare inventing a new bitmap.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-10-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2024-08-27 10:14:15 -07:00
Yu Kuai
a0e7744a46 md/md-bitmap: add 'behind_writes' and 'behind_wait' into struct md_bitmap_stats
There are no functional changes, avoid dereferencing bitmap directly to
prepare inventing a new bitmap.

Also fix following checkpatch warning by using wq_has_sleeper().

WARNING: waitqueue_active without comment

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-9-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2024-08-27 10:14:15 -07:00
Yu Kuai
10bc2ac105 md/md-bitmap: add 'file_pages' into struct md_bitmap_stats
There are no functional changes, avoid dereferencing bitmap directly to
prepare inventing a new bitmap.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-8-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2024-08-27 10:14:15 -07:00
Yu Kuai
ec6bb299c7 md/md-bitmap: add 'sync_size' into struct md_bitmap_stats
To avoid dereferencing bitmap directly in md-cluster to prepare
inventing a new bitmap.

BTW, also fix following checkpatch warnings:

WARNING: Deprecated use of 'kmap_atomic', prefer 'kmap_local_page' instead
WARNING: Deprecated use of 'kunmap_atomic', prefer 'kunmap_local' instead

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-7-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2024-08-27 10:14:15 -07:00
Yu Kuai
82697ccf7e md/md-cluster: fix spares warnings for __le64
drivers/md/md-cluster.c:1220:22: warning: incorrect type in assignment (different base types)
drivers/md/md-cluster.c:1220:22:    expected unsigned long my_sync_size
drivers/md/md-cluster.c:1220:22:    got restricted __le64 [usertype] sync_size
drivers/md/md-cluster.c:1252:35: warning: incorrect type in assignment (different base types)
drivers/md/md-cluster.c:1252:35:    expected unsigned long sync_size
drivers/md/md-cluster.c:1252:35:    got restricted __le64 [usertype] sync_size
drivers/md/md-cluster.c:1253:41: warning: restricted __le64 degrades to integer

Fix the warnings by using le64_to_cpu() to convet __le64 to integer.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-6-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2024-08-27 10:14:14 -07:00
Yu Kuai
d004442f46 md/md-bitmap: add 'events_cleared' into struct md_bitmap_stats
Also add a new helper to get events_cleared to avoid dereferencing
bitmap directly to prepare inventing a new bitmap.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-5-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2024-08-27 10:14:14 -07:00
Yu Kuai
9681538122 md: use new helper md_bitmap_get_stats() in update_array_info()
There are no functional changes, avoid dereferencing bitmap directly to
prepare inventing a new bitmap.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-4-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2024-08-27 10:14:14 -07:00
Yu Kuai
38f287d7e4 md/md-bitmap: replace md_bitmap_status() with a new helper md_bitmap_get_stats()
There are no functional changes, and the new helper will be used in
multiple places in following patches to avoid dereferencing bitmap
directly.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-3-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2024-08-27 10:14:14 -07:00
Yu Kuai
2db4fa1b7e md/raid1: use md_bitmap_wait_behind_writes() in raid1_read_request()
Use the existed helper instead of open coding it to make the code cleaner.
There are no functional changes, and also avoid dereferencing bitmap
directly to prepare inventing a new bitmap.

Noted that this patch also export md_bitmap_wait_behind_writes(), which
is necessary for now, and the exported api will be removed in following
patches to convert bitmap apis into ops.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-2-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2024-08-27 10:14:14 -07:00
Yu Kuai
2d389a759d md/raid1: Clean up local variable 'b' from raid1_read_request()
The local variable will only be used onced, in the error path that
read_balance() failed to find a valid rdev to read. Since now the rdev
is ensured can't be removed from conf while IO is still pending,
remove the local variable and dereference rdev directly.

Since we're here, also remove an extra empty line, and unnecessary
type conversion from sector_t(u64) to unsigned long long.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240801133008.459998-1-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2024-08-27 10:08:22 -07:00
Yu Kuai
86ad4cda79 md: Don't flush sync_work in md_write_start()
Because flush sync_work may trigger mddev_suspend() if there are spares,
and this should never be done in IO path because mddev_suspend() is used
to wait for IO.

This problem is found by code review.

Fixes: bc08041b32 ("md: suspend array in md_start_sync() if array need reconfiguration")
Cc: stable@vger.kernel.org
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240801124746.242558-1-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2024-08-27 09:55:16 -07:00
Yuesong Li
02c0207ecd dm bufio: Remove NULL check of list_entry()
list_entry() will never return a NULL pointer, thus remove the
check.

Signed-off-by: Yuesong Li <liyuesong@vivo.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-08-22 18:32:55 +02:00
Ingo Franzki
4441686b24 dm-crypt: Allow to specify the integrity key size as option
For the MAC based integrity operation, the integrity key size (i.e.
key_mac_size) is currently set to the digest size of the used digest.

For wrapped key HMAC algorithms, the key size is independent of the
cryptographic key size. So there is no known size of the mac key in
such cases. The desired key size can optionally be specified as argument
when the dm-crypt device is configured via 'integrity_key_size:%u'.
If no integrity_key_size argument is specified, the mac key size
is still set to the digest size, as before.

Increase version number to 1.28.0 so that support for the new
argument can be detected by user space (i.e. cryptsetup).

Signed-off-by: Ingo Franzki <ifranzki@linux.ibm.com>
Reviewed-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-08-21 15:36:27 +02:00
Zhang Zekun
f3631ae11d dm: Remove unused declaration and empty definition "dm_zone_map_bio"
dm_zone_map_bio() has beed removed since commit f211268ed1 ("dm: Use
the block layer zone append emulation"), remain the declaration unused in
header files. So, let's remove this unused declaration and empty
definition.

Signed-off-by: Zhang Zekun <zhangzekun11@huawei.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-08-21 13:12:12 +02:00
Susan LeGendre-McGhee
448c4e4eb1 dm vdo: force read-only mode for a corrupt recovery journal
Ensure the recovery journal does not attempt recovery when blocks
with mismatched metadata versions are detected. This check is
performed after determining that the blocks are otherwise valid
so that it does not interfere with normal recovery.

Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-08-21 13:12:12 +02:00
Susan LeGendre-McGhee
f3ff668352 dm vdo: abort loading dirty VDO with the old recovery journal format
Abort the load process with status code VDO_UNSUPPORTED_VERSION
without forcing read-only mode when a journal block with the
old format version is detected.

Forcing the VDO volume into read-only mode and thus requiring
a read-only rebuild should only be done when absolutely necessary.

Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-08-21 13:11:34 +02:00
Bruce Johnston
47874c98dc dm vdo: add dmsetup message for returning configuration info
Add a new dmsetup message called config, which will return
useful configuration information for the vdo volume and
the uds index associated with it. The output is a YAML
string, and contains a version number to allow future
additions to the content.

Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-08-21 13:05:56 +02:00
Ken Raeburn
3a59b2ec24 dm vdo: remove bad check of bi_next field
Remove this check to prevent spurious warning messages due to
the behavior of other storage layers. This check was intended
to make sure dm-vdo does not chain metadata bios together.
However, vdo has no control over underlying storage layers,
so the assertion is not always true.

Signed-off-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-08-21 13:05:56 +02:00
Ken Raeburn
0808ebf2f8 dm vdo: don't refer to dedupe_context after releasing it
Clear the dedupe_context pointer in a data_vio whenever ownership of
the context is lost, so that vdo can't examine it accidentally.

Signed-off-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-08-21 13:05:56 +02:00
Deven Bowers
a6af7bc3d7 dm-verity: expose root hash digest and signature data to LSMs
dm-verity provides a strong guarantee of a block device's integrity. As
a generic way to check the integrity of a block device, it provides
those integrity guarantees to its higher layers, including the filesystem
level.

However, critical security metadata like the dm-verity roothash and its
signing information are not easily accessible to the LSMs.
To address this limitation, this patch introduces a mechanism to store
and manage these essential security details within a newly added LSM blob
in the block_device structure.

This addition allows LSMs to make access control decisions on the integrity
data stored within the block_device, enabling more flexible security
policies. For instance, LSMs can now revoke access to dm-verity devices
based on their roothashes, ensuring that only authorized and verified
content is accessible. Additionally, LSMs can enforce policies to only
allow files from dm-verity devices that have a valid digital signature to
execute, effectively blocking any unsigned files from execution, thus
enhancing security against unauthorized modifications.

The patch includes new hook calls, `security_bdev_setintegrity()`, in
dm-verity to expose the dm-verity roothash and the roothash signature to
LSMs via preresume() callback. By using the preresume() callback, it
ensures that the security metadata is consistently in sync with the
metadata of the dm-verity target in the current active mapping table.
The hook calls are depended on CONFIG_SECURITY.

Signed-off-by: Deven Bowers <deven.desai@linux.microsoft.com>
Signed-off-by: Fan Wu <wufan@linux.microsoft.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
[PM: moved sig_size field as discussed]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-08-20 14:02:38 -04:00
Linus Torvalds
85652baa89 Merge tag 'block-6.11-20240824' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:

 - Fix corruption issues with s390/dasd (Eric, Stefan)

 - Fix a misuse of non irq locking grab of a lock (Li)

 - MD pull request with a single data corruption fix for raid1 (Yu)

* tag 'block-6.11-20240824' of git://git.kernel.dk/linux:
  block: Fix lockdep warning in blk_mq_mark_tag_wait
  md/raid1: Fix data corruption for degraded array with slow disk
  s390/dasd: fix error recovery leading to data corruption on ESE devices
  s390/dasd: Remove DMA alignment
2024-08-16 14:03:31 -07:00
Yu Kuai
c916ca3530 md/raid1: Fix data corruption for degraded array with slow disk
read_balance() will avoid reading from slow disks as much as possible,
however, if valid data only lands in slow disks, and a new normal disk
is still in recovery, unrecovered data can be read:

raid1_read_request
 read_balance
  raid1_should_read_first
  -> return false
  choose_best_rdev
  -> normal disk is not recovered, return -1
  choose_bb_rdev
  -> missing the checking of recovery, return the normal disk
 -> read unrecovered data

Root cause is that the checking of recovery is missing in
choose_bb_rdev(). Hence add such checking to fix the problem.

Also fix similar problem in choose_slow_rdev().

Cc: stable@vger.kernel.org
Fixes: 9f3ced7922 ("md/raid1: factor out choose_bb_rdev() from read_balance()")
Fixes: dfa8ecd167 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
Reported-and-tested-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Closes: https://lore.kernel.org/all/9952f532-2554-44bf-b906-4880b2e88e3a@o2.pl/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240803091137.3197008-1-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2024-08-15 13:38:17 -07:00
Chen Ni
ca958879ad md: convert comma to semicolon
Replace a comma between expression statements by a semicolon.

Fixes: 5e5702898e ("md/raid10: Handle read errors during recovery better.")
Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Link: https://lore.kernel.org/r/20240716025852.400259-1-nichen@iscas.ac.cn
Signed-off-by: Song Liu <song@kernel.org>
2024-08-15 00:06:50 -07:00
Mikulas Patocka
faada2174c dm persistent data: fix memory allocation failure
kmalloc is unreliable when allocating more than 8 pages of memory. It may
fail when there is plenty of free memory but the memory is fragmented.
Zdenek Kabelac observed such failure in his tests.

This commit changes kmalloc to kvmalloc - kvmalloc will fall back to
vmalloc if the large allocation fails.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Cc: stable@vger.kernel.org
2024-08-13 21:14:21 +02:00
Khazhismel Kumykov
7a636b4f03 dm resume: don't return EINVAL when signalled
If the dm_resume method is called on a device that is not suspended, the
method will suspend the device briefly, before resuming it (so that the
table will be swapped).

However, there was a bug that the return value of dm_suspended_md was not
checked. dm_suspended_md may return an error when it is interrupted by a
signal. In this case, do_resume would call dm_swap_table, which would
return -EINVAL.

This commit fixes the logic, so that error returned by dm_suspend is
checked and the resume operation is undone.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Cc: stable@vger.kernel.org
2024-08-13 13:51:34 +02:00
Mikulas Patocka
1e1fd567d3 dm suspend: return -ERESTARTSYS instead of -EINTR
This commit changes device mapper, so that it returns -ERESTARTSYS
instead of -EINTR when it is interrupted by a signal (so that the ioctl
can be restarted).

The manpage signal(7) says that the ioctl function should be restarted if
the signal was handled with SA_RESTART.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
2024-08-13 13:50:45 +02:00
Linus Torvalds
4477b39c32 minmax: add a few more MIN_T/MAX_T users
Commit 3a7e02c040 ("minmax: avoid overly complicated constant
expressions in VM code") added the simpler MIN_T/MAX_T macros in order
to avoid some excessive expansion from the rather complicated regular
min/max macros.

The complexity of those macros stems from two issues:

 (a) trying to use them in situations that require a C constant
     expression (in static initializers and for array sizes)

 (b) the type sanity checking

and MIN_T/MAX_T avoids both of these issues.

Now, in the whole (long) discussion about all this, it was pointed out
that the whole type sanity checking is entirely unnecessary for
min_t/max_t which get a fixed type that the comparison is done in.

But that still leaves min_t/max_t unnecessarily complicated due to
worries about the C constant expression case.

However, it turns out that there really aren't very many cases that use
min_t/max_t for this, and we can just force-convert those.

This does exactly that.

Which in turn will then allow for much simpler implementations of
min_t()/max_t().  All the usual "macros in all upper case will evaluate
the arguments multiple times" rules apply.

We should do all the same things for the regular min/max() vs MIN/MAX()
cases, but that has the added complexity of various drivers defining
their own local versions of MIN/MAX, so that needs another level of
fixes first.

Link: https://lore.kernel.org/all/b47fad1d0cf8449886ad148f8c013dae@AcuMS.aculab.com/
Cc: David Laight <David.Laight@aculab.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-07-28 13:41:14 -07:00
Linus Torvalds
7d080fa867 Merge tag 'for-6.11/block-20240722' of git://git.kernel.dk/linux
Pull more block updates from Jens Axboe:

 - MD fixes via Song:
     - md-cluster fixes (Heming Zhao)
     - raid1 fix (Mateusz Jończyk)

 - s390/dasd module description (Jeff)

 - Series cleaning up and hardening the blk-mq debugfs flag handling
   (John, Christoph)

 - blk-cgroup cleanup (Xiu)

 - Error polled IO attempts if backend doesn't support it (hexue)

 - Fix for an sbitmap hang (Yang)

* tag 'for-6.11/block-20240722' of git://git.kernel.dk/linux: (23 commits)
  blk-cgroup: move congestion_count to struct blkcg
  sbitmap: fix io hung due to race on sbitmap_word::cleared
  block: avoid polling configuration errors
  block: Catch possible entries missing from rqf_name[]
  block: Simplify definition of RQF_NAME()
  block: Use enum to define RQF_x bit indexes
  block: Catch possible entries missing from cmd_flag_name[]
  block: Catch possible entries missing from alloc_policy_name[]
  block: Catch possible entries missing from hctx_flag_name[]
  block: Catch possible entries missing from hctx_state_name[]
  block: Catch possible entries missing from blk_queue_flag_name[]
  block: Make QUEUE_FLAG_x as an enum
  block: Relocate BLK_MQ_MAX_DEPTH
  block: Relocate BLK_MQ_CPU_WORK_BATCH
  block: remove QUEUE_FLAG_STOPPED
  block: Add missing entry to hctx_flag_name[]
  block: Add zone write plugging entry to rqf_name[]
  block: Add missing entries from cmd_flag_name[]
  s390/dasd: fix error checks in dasd_copy_pair_store()
  s390/dasd: add missing MODULE_DESCRIPTION() macros
  ...
2024-07-22 11:32:05 -07:00
Linus Torvalds
0256994887 Merge tag 'for-6.11/block-post-20240722' of git://git.kernel.dk/linux
Pull block integrity mapping updates from Jens Axboe:
 "A set of cleanups and fixes for the block integrity support.

  Sent separately from the main block changes from last week, as they
  depended on later fixes in the 6.10-rc cycle"

* tag 'for-6.11/block-post-20240722' of git://git.kernel.dk/linux:
  block: don't free the integrity payload in bio_integrity_unmap_free_user
  block: don't free submitter owned integrity payload on I/O completion
  block: call bio_integrity_unmap_free_user from blk_rq_unmap_user
  block: don't call bio_uninit from bio_endio
  block: also return bio_integrity_payload * from stubs
  block: split integrity support out of bio.h
2024-07-22 11:04:09 -07:00
Linus Torvalds
527eff227d Merge tag 'mm-nonmm-stable-2024-07-21-15-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:

 - In the series "treewide: Refactor heap related implementation",
   Kuan-Wei Chiu has significantly reworked the min_heap library code
   and has taught bcachefs to use the new more generic implementation.

 - Yury Norov's series "Cleanup cpumask.h inclusion in core headers"
   reworks the cpumask and nodemask headers to make things generally
   more rational.

 - Kuan-Wei Chiu has sent along some maintenance work against our
   sorting library code in the series "lib/sort: Optimizations and
   cleanups".

 - More library maintainance work from Christophe Jaillet in the series
   "Remove usage of the deprecated ida_simple_xx() API".

 - Ryusuke Konishi continues with the nilfs2 fixes and clanups in the
   series "nilfs2: eliminate the call to inode_attach_wb()".

 - Kuan-Ying Lee has some fixes to the gdb scripts in the series "Fix
   GDB command error".

 - Plus the usual shower of singleton patches all over the place. Please
   see the relevant changelogs for details.

* tag 'mm-nonmm-stable-2024-07-21-15-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (98 commits)
  ia64: scrub ia64 from poison.h
  watchdog/perf: properly initialize the turbo mode timestamp and rearm counter
  tsacct: replace strncpy() with strscpy()
  lib/bch.c: use swap() to improve code
  test_bpf: convert comma to semicolon
  init/modpost: conditionally check section mismatch to __meminit*
  init: remove unused __MEMINIT* macros
  nilfs2: Constify struct kobj_type
  nilfs2: avoid undefined behavior in nilfs_cnt32_ge macro
  math: rational: add missing MODULE_DESCRIPTION() macro
  lib/zlib: add missing MODULE_DESCRIPTION() macro
  fs: ufs: add MODULE_DESCRIPTION()
  lib/rbtree.c: fix the example typo
  ocfs2: add bounds checking to ocfs2_check_dir_entry()
  fs: add kernel-doc comments to ocfs2_prepare_orphan_dir()
  coredump: simplify zap_process()
  selftests/fpu: add missing MODULE_DESCRIPTION() macro
  compiler.h: simplify data_race() macro
  build-id: require program headers to be right after ELF header
  resource: add missing MODULE_DESCRIPTION()
  ...
2024-07-21 17:56:22 -07:00
Linus Torvalds
661fb4e68c Merge tag 'for-6.11/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mikulas Patocka:

 - Optimize processing of flush bios in the dm-linear and dm-stripe
   targets

 - Dm-io cleansups and refactoring

 - Remove unused 'struct thunk' in dm-cache

 - Handle minor device numbers > 255 in dm-init

 - Dm-verity refactoring & enabling platform keyring

 - Fix warning in dm-raid

 - Improve dm-crypt performance - split bios to smaller pieces, so that
   They could be processed concurrently

 - Stop using blk_limits_io_{min,opt}

 - Dm-vdo cleanup and refactoring

 - Remove max_write_zeroes_granularity and max_secure_erase_granularity

 - Dm-multipath cleanup & refactoring

 - Add dm-crypt and dm-integrity support for non-power-of-2 sector size

 - Fix reshape in dm-raid

 - Make dm_block_validator const

* tag 'for-6.11/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (33 commits)
  dm vdo: fix a minor formatting issue in vdo.rst
  dm vdo int-map: fix kerneldoc formatting
  dm vdo repair: add missing kerneldoc fields
  dm: Constify struct dm_block_validator
  dm-integrity: introduce the Inline mode
  dm: introduce the target flag mempool_needs_integrity
  dm raid: fix stripes adding reshape size issues
  dm raid: move _get_reshape_sectors() as prerequisite to fixing reshape size issues
  dm-crypt: support for per-sector NVMe metadata
  dm mpath: don't call dm_get_device in multipath_message
  dm: factor out helper function from dm_get_device
  dm-verity: fix dm_is_verity_target() when dm-verity is builtin
  dm: Remove max_secure_erase_granularity
  dm: Remove max_write_zeroes_granularity
  dm vdo indexer: use swap() instead of open coding it
  dm vdo: remove unused struct 'uds_attribute'
  dm: stop using blk_limits_io_{min,opt}
  dm-crypt: limit the size of encryption requests
  dm verity: add support for signature verification with platform keyring
  dm-raid: Fix WARN_ON_ONCE check for sync_thread in raid_resume
  ...
2024-07-19 10:48:44 -07:00
Matthew Sakai
513789b7fb dm vdo int-map: fix kerneldoc formatting
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202407141607.M3E2XQ0Z-lkp@intel.com/
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-19 12:08:21 +02:00
Matthew Sakai
fa398e603f dm vdo repair: add missing kerneldoc fields
Also remove trivial comment for increment_recovery_point.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9518
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-19 12:08:21 +02:00
Christophe JAILLET
0b60be1628 dm: Constify struct dm_block_validator
'struct dm_block_validator' are not modified in these drivers.

Constifying this structure moves some data to a read-only section, so
increase overall security.

On a x86_64, with allmodconfig, as an example:

Before:
======
   text	   data	    bss	    dec	    hex	filename
  32047	    920	     16	  32983	   80d7	drivers/md/dm-cache-metadata.o

After:
=====
   text	   data	    bss	    dec	    hex	filename
  32075	    896	     16	  32987	   80db	drivers/md/dm-cache-metadata.o

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-19 12:08:15 +02:00
Mikulas Patocka
fb0987682c dm-integrity: introduce the Inline mode
This commit introduces a new 'I' mode for dm-integrity.

The 'I' mode may be selected if the underlying device has non-power-of-2
sector size. In this mode, dm-integrity will store integrity data
directly in device's sectors and it will not use journal.

This mode improves performance and reduces flash wear because there would
be no journal writes.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-07-19 12:06:44 +02:00
Linus Torvalds
f097ef0e76 Merge tag 'dlm-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:

 - New flag DLM_LSFL_SOFTIRQ_SAFE can be set by code using dlm to
   indicate callbacks can be run from softirq

 - Change md-cluster to set DLM_LSFL_SOFTIRQ_SAFE

 - Clean up for previous changes, e.g. unused code and parameters

 - Remove custom pre-allocation of rsb structs which is unnecessary with
   kmem caches

 - Change idr to xarray for lkb structs in use

 - Change idr to xarray for rsb structs being recovered

 - Change outdated naming related to internal rsb states

 - Fix some incorrect add/remove of rsb on scan list

 - Use rcu to free rsb structs

* tag 'dlm-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  dlm: add rcu_barrier before destroy kmem cache
  dlm: remove DLM_LSFL_SOFTIRQ from exflags
  fs: dlm: remove unused struct 'dlm_processed_nodes'
  md-cluster: use DLM_LSFL_SOFTIRQ for dlm_new_lockspace()
  dlm: implement LSFL_SOFTIRQ_SAFE
  dlm: introduce DLM_LSFL_SOFTIRQ_SAFE
  dlm: use LSFL_FS to check for kernel lockspace
  dlm: use rcu to avoid an extra rsb struct lookup
  dlm: fix add_scan and del_scan usage
  dlm: change list and timer names
  dlm: move recover idr to xarray datastructure
  dlm: move lkb idr to xarray datastructure
  dlm: drop own rsb pre allocation mechanism
  dlm: remove ls_local_handle from struct dlm_ls
  dlm: remove unused parameter in dlm_midcomms_addr
  dlm: don't kref_init rsbs created for toss list
  dlm: remove scand leftovers
2024-07-17 12:16:22 -07:00
Linus Torvalds
3e78198862 Merge tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:

 - NVMe updates via Keith:
     - Device initialization memory leak fixes (Keith)
     - More constants defined (Weiwen)
     - Target debugfs support (Hannes)
     - PCIe subsystem reset enhancements (Keith)
     - Queue-depth multipath policy (Redhat and PureStorage)
     - Implement get_unique_id (Christoph)
     - Authentication error fixes (Gaosheng)

 - MD updates via Song
     - sync_action fix and refactoring (Yu Kuai)
     - Various small fixes (Christoph Hellwig, Li Nan, and Ofir Gal, Yu
       Kuai, Benjamin Marzinski, Christophe JAILLET, Yang Li)

 - Fix loop detach/open race (Gulam)

 - Fix lower control limit for blk-throttle (Yu)

 - Add module descriptions to various drivers (Jeff)

 - Add support for atomic writes for block devices, and statx reporting
   for same. Includes SCSI and NVMe (John, Prasad, Alan)

 - Add IO priority information to block trace points (Dongliang)

 - Various zone improvements and tweaks (Damien)

 - mq-deadline tag reservation improvements (Bart)

 - Ignore direct reclaim swap writes in writeback throttling (Baokun)

 - Block integrity improvements and fixes (Anuj)

 - Add basic support for rust based block drivers. Has a dummy null_blk
   variant for now (Andreas)

 - Series converting driver settings to queue limits, and cleanups and
   fixes related to that (Christoph)

 - Cleanup for poking too deeply into the bvec internals, in preparation
   for DMA mapping API changes (Christoph)

 - Various minor tweaks and fixes (Jiapeng, John, Kanchan, Mikulas,
   Ming, Zhu, Damien, Christophe, Chaitanya)

* tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux: (206 commits)
  floppy: add missing MODULE_DESCRIPTION() macro
  loop: add missing MODULE_DESCRIPTION() macro
  ublk_drv: add missing MODULE_DESCRIPTION() macro
  xen/blkback: add missing MODULE_DESCRIPTION() macro
  block/rnbd: Constify struct kobj_type
  block: take offset into account in blk_bvec_map_sg again
  block: fix get_max_segment_size() warning
  loop: Don't bother validating blocksize
  virtio_blk: Don't bother validating blocksize
  null_blk: Don't bother validating blocksize
  block: Validate logical block size in blk_validate_limits()
  virtio_blk: Fix default logical block size fallback
  nvmet-auth: fix nvmet_auth hash error handling
  nvme: implement ->get_unique_id
  block: pass a phys_addr_t to get_max_segment_size
  block: add a bvec_phys helper
  blk-lib: check for kill signal in ioctl BLKZEROOUT
  block: limit the Write Zeroes to manually writing zeroes fallback
  block: refacto blkdev_issue_zeroout
  block: move read-only and supported checks into (__)blkdev_issue_zeroout
  ...
2024-07-15 14:20:22 -07:00
Mikulas Patocka
617069741d dm: introduce the target flag mempool_needs_integrity
This commit introduces the dm target flag mempool_needs_integrity. When
the flag is set, device mapper will call bioset_integrity_create on it's
bio sets. The target can then call bio_integrity_alloc on the bios
allocated from the table's mempool.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-07-12 12:39:06 -04:00
Mateusz Jończyk
36a5c03f23 md/raid1: set max_sectors during early return from choose_slow_rdev()
Linux 6.9+ is unable to start a degraded RAID1 array with one drive,
when that drive has a write-mostly flag set. During such an attempt,
the following assertion in bio_split() is hit:

	BUG_ON(sectors <= 0);

Call Trace:
	? bio_split+0x96/0xb0
	? exc_invalid_op+0x53/0x70
	? bio_split+0x96/0xb0
	? asm_exc_invalid_op+0x1b/0x20
	? bio_split+0x96/0xb0
	? raid1_read_request+0x890/0xd20
	? __call_rcu_common.constprop.0+0x97/0x260
	raid1_make_request+0x81/0xce0
	? __get_random_u32_below+0x17/0x70
	? new_slab+0x2b3/0x580
	md_handle_request+0x77/0x210
	md_submit_bio+0x62/0xa0
	__submit_bio+0x17b/0x230
	submit_bio_noacct_nocheck+0x18e/0x3c0
	submit_bio_noacct+0x244/0x670

After investigation, it turned out that choose_slow_rdev() does not set
the value of max_sectors in some cases and because of it,
raid1_read_request calls bio_split with sectors == 0.

Fix it by filling in this variable.

This bug was introduced in
commit dfa8ecd167 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
but apparently hidden until
commit 0091c5a269 ("md/raid1: factor out helpers to choose the best rdev from read_balance()")
shortly thereafter.

Cc: stable@vger.kernel.org # 6.9.x+
Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Fixes: dfa8ecd167 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
Cc: Song Liu <song@kernel.org>
Cc: Yu Kuai <yukuai3@huawei.com>
Cc: Paul Luse <paul.e.luse@linux.intel.com>
Cc: Xiao Ni <xni@redhat.com>
Cc: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Link: https://lore.kernel.org/linux-raid/20240706143038.7253-1-mat.jonczyk@o2.pl/

--

Tested on both Linux 6.10 and 6.9.8.

Inside a VM, mdadm testsuite for RAID1 on 6.10 did not find any problems:
	./test --dev=loop --no-error --raidtype=raid1
(on 6.9.8 there was one failure, caused by external bitmap support not
compiled in).

Notes:
- I was reliably getting deadlocks when adding / removing devices
  on such an array - while the array was loaded with fsstress with 20
  concurrent processes. When the array was idle or loaded with fsstress
  with 8 processes, no such deadlocks happened in my tests.
  This occurred also on unpatched Linux 6.8.0 though, but not on
  6.1.97-rc1, so this is likely an independent regression (to be
  investigated).
- I was also getting deadlocks when adding / removing the bitmap on the
  array in similar conditions - this happened on Linux 6.1.97-rc1
  also though. fsstress with 8 concurrent processes did cause it only
  once during many tests.
- in my testing, there was once a problem with hot adding an
  internal bitmap to the array:
	mdadm: Cannot add bitmap while array is resyncing or reshaping etc.
	mdadm: failed to set internal bitmap.
  even though no such reshaping was happening according to /proc/mdstat.
  This seems unrelated, though.

Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240711202316.10775-1-mat.jonczyk@o2.pl
2024-07-12 01:30:38 +00:00
Heming Zhao
35a0a409fa md-cluster: fix no recovery job when adding/re-adding a disk
The commit db5e653d7c ("md: delay choosing sync action to
md_start_sync()") delays the start of the sync action. In a
clustered environment, this will cause another node to first
activate the spare disk and skip recovery. As a result, no
nodes will perform recovery when a disk is added or re-added.

Before db5e653d7c:

```
   node1                                node2
----------------------------------------------------------------
md_check_recovery
 + md_update_sb
 |  sendmsg: METADATA_UPDATED
 + md_choose_sync_action           process_metadata_update
 |  remove_and_add_spares           //node1 has not finished adding
 + call mddev->sync_work            //the spare disk:do nothing

md_start_sync
 starts md_do_sync

md_do_sync
 + grabbed resync_lockres:DLM_LOCK_EX
 + do syncing job

md_check_recovery
 sendmsg: METADATA_UPDATED
                                 process_metadata_update
                                   //activate spare disk

                                 ... ...

                                 md_do_sync
                                  waiting to grab resync_lockres:EX
```

After db5e653d7c:

(note: if 'cmd:idle' sets MD_RECOVERY_INTR after md_check_recovery
starts md_start_sync, setting the INTR action will exacerbate the
delay in node1 calling the md_do_sync function.)

```
   node1                                node2
----------------------------------------------------------------
md_check_recovery
 + md_update_sb
 |  sendmsg: METADATA_UPDATED
 + calls mddev->sync_work         process_metadata_update
                                   //node1 has not finished adding
                                   //the spare disk:do nothing

md_start_sync
 + md_choose_sync_action
 |  remove_and_add_spares
 + calls md_do_sync

md_check_recovery
 md_update_sb
  sendmsg: METADATA_UPDATED
                                  process_metadata_update
                                    //activate spare disk

  ... ...                         ... ...

                                  md_do_sync
                                   + grabbed resync_lockres:EX
                                   + raid1_sync_request skip sync under
				     conf->fullsync:0
md_do_sync
 1. waiting to grab resync_lockres:EX
 2. when node1 could grab EX lock,
    node1 will skip resync under recovery_offset:MaxSector
```

How to trigger:

```(commands @node1)
 # to easily watch the recovery status
echo 2000 > /proc/sys/dev/raid/speed_limit_max
ssh root@node2 "echo 2000 > /proc/sys/dev/raid/speed_limit_max"

mdadm -CR /dev/md0 -l1 -b clustered -n 2 /dev/sda /dev/sdb --assume-clean
ssh root@node2 mdadm -A /dev/md0 /dev/sda /dev/sdb
mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda
mdadm --manage /dev/md0 --add /dev/sdc

=== "cat /proc/mdstat" on both node, there are no recovery action. ===
```

How to fix:

because md layer code logic is hard to restore for speeding up sync job
on local node, we add new cluster msg to pending the another node to
active disk.

Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Su Yue <glass.su@suse.com>
Acked-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240709104120.22243-2-heming.zhao@suse.com
2024-07-12 01:30:18 +00:00