linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-07-16 17:57:38 -04:00

Author	SHA1	Message	Date
Connor Williamson	181bb9c9ea	block: remove redundant GD_NEED_PART_SCAN in add_disk_final() add_disk_final() sets GD_NEED_PART_SCAN before calling bdev_add(), then calls disk_scan_partitions() which sets the flag itself. The early set is redundant and introduces a race. Between bdev_add() and disk_scan_partitions(), concurrent openers (multipathd, blkid, LVM) see the flag in blkdev_get_whole() and trigger bdev_disk_changed(). When disk_scan_partitions() then runs, it calls bdev_disk_changed() again, dropping the partitions the concurrent opener already created before re-adding them, which can result in transient partition disappearances. The race is observable by inserting an msleep() between bdev_add() and disk_scan_partitions() while running concurrent open() calls during device bind. Without artificial delay, it manifests under scheduling pressure during boot on systems with aggressive device scanners (multipathd, systemd-udevd). Therefore, do not set GD_NEED_PART_SCAN in add_disk_final(). Other GD_NEED_PART_SCAN consumers (blkdev_get_whole(), sd_need_revalidate()) should not be affected as the flag is set internally by disk_scan_partitions(). The retry-on-next-open intention from commit `e5cfefa97b` ("block: fix scan partition for exclusively open device again") should also not be affected as the early return paths in disk_scan_partitions() should be unreachable at device registration time (bd_holder is NULL and open_partitions is zero). Fixes: `e5cfefa97b` ("block: fix scan partition for exclusively open device again") Cc: stable@vger.kernel.org Signed-off-by: Connor Williamson <connordw@amazon.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260615130715.53693-1-connordw@amazon.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-07-10 09:08:50 -06:00
Guzebing	1e56f30a73	block: Make WBT latency writes honor enable state queue/wbt_lat_usec controls both the stored WBT latency target and the effective WBT enable state. The old no-op check skipped updates whenever the converted latency matched the stored min_lat_nsec. That check ignored whether the current WBT state already matched the state requested by the write. For a queue disabled by default, attempting to enable WBT by writing the default value through sysfs could return success while the enable state was left unchanged. Treat a write as a no-op only when both the stored latency and the effective WBT enabled state already match the converted value. Signed-off-by: Guzebing <guzebing1612@gmail.com> Link: https://patch.msgid.link/20260621014030.1625306-1-guzebing1612@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-07-01 19:07:20 -06:00
Damien Le Moal	3dd63dba8f	block: avoid potential deadlock on zone revalidation failure If revalidating the zones of a zoned block device with blk_revalidate_disk_zones() fails during a SCSI disk rescan, the following lockdep splat is thrown: [ 347.251859] [ T11230] sda: failed to revalidate zones [ 347.261380] [ T11230] ====================================================== [ 347.263882] [ T11230] WARNING: possible circular locking dependency detected [ 347.266353] [ T11230] 7.1.0+ #1194 Not tainted [ 347.268052] [ T11230] ------------------------------------------------------ [ 347.270537] [ T11230] tcsh/11230 is trying to acquire lock: [ 347.272555] [ T11230] ffffffff8f91d400 (wq_pool_mutex){+.+.}-{4:4}, at: destroy_workqueue+0x15d/0x8d0 [ 347.275914] [ T11230] but task is already holding lock: [ 347.278646] [ T11230] ffff88812fa1bcc0 (&q->q_usage_counter(io)#5){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x16/0x30 [ 347.282503] [ T11230] which lock already depends on the new lock. [ 347.286239] [ T11230] the existing dependency chain (in reverse order) is: [ 347.289408] [ T11230] -> #2 (&q->q_usage_counter(io)#5){++++}-{0:0}: [ 347.292437] [ T11230] blk_alloc_queue+0x5ca/0x750 [ 347.294379] [ T11230] blk_mq_alloc_queue+0x14c/0x240 [ 347.296375] [ T11230] scsi_alloc_sdev+0x871/0xd10 [scsi_mod] [ 347.298619] [ T11230] scsi_probe_and_add_lun+0x600/0xc50 [scsi_mod] [ 347.301056] [ T11230] __scsi_scan_target+0x187/0x3b0 [scsi_mod] [ 347.303385] [ T11230] scsi_scan_channel+0xf2/0x180 [scsi_mod] [ 347.305651] [ T11230] scsi_scan_host_selected+0x20b/0x2d0 [scsi_mod] [ 347.308119] [ T11230] do_scan_async+0x42/0x420 [scsi_mod] [ 347.310276] [ T11230] async_run_entry_fn+0x94/0x5a0 [ 347.312284] [ T11230] process_one_work+0x8da/0x1690 [ 347.314287] [ T11230] worker_thread+0x5fe/0x1010 [ 347.316216] [ T11230] kthread+0x358/0x450 [ 347.317675] [ T11230] ret_from_fork+0x5b9/0x8e0 [ 347.319181] [ T11230] ret_from_fork_asm+0x11/0x20 [ 347.320778] [ T11230] -> #1 (fs_reclaim){+.+.}-{0:0}: [ 347.322890] [ T11230] fs_reclaim_acquire+0xd5/0x120 [ 347.324464] [ T11230] __kmalloc_cache_node_noprof+0x39/0x620 [ 347.326223] [ T11230] init_rescuer+0x19b/0x560 [ 347.327697] [ T11230] workqueue_init+0x33b/0x6a0 [ 347.329224] [ T11230] kernel_init_freeable+0x2eb/0x600 [ 347.330881] [ T11230] kernel_init+0x1c/0x140 [ 347.332334] [ T11230] ret_from_fork+0x5b9/0x8e0 [ 347.333847] [ T11230] ret_from_fork_asm+0x11/0x20 [ 347.335360] [ T11230] -> #0 (wq_pool_mutex){+.+.}-{4:4}: [ 347.337510] [ T11230] __lock_acquire+0xdea/0x2260 [ 347.339030] [ T11230] lock_acquire+0x187/0x2f0 [ 347.340495] [ T11230] __mutex_lock+0x1ab/0x2600 [ 347.341464] [ T11230] destroy_workqueue+0x15d/0x8d0 [ 347.342485] [ T11230] disk_free_zone_resources+0xd5/0x560 [ 347.343577] [ T11230] blk_revalidate_disk_zones+0x620/0xac7 [ 347.344723] [ T11230] sd_zbc_revalidate_zones+0x1dd/0x790 [sd_mod] [ 347.345938] [ T11230] sd_revalidate_disk+0xc66/0x8e60 [sd_mod] [ 347.347112] [ T11230] scsi_rescan_device+0x1f9/0x310 [scsi_mod] [ 347.348318] [ T11230] store_rescan_field+0x19/0x20 [scsi_mod] [ 347.349507] [ T11230] kernfs_fop_write_iter+0x3d2/0x5e0 [ 347.350565] [ T11230] vfs_write+0x469/0x1000 [ 347.351484] [ T11230] ksys_write+0x116/0x250 [ 347.352403] [ T11230] do_syscall_64+0xf0/0x6e0 [ 347.353361] [ T11230] entry_SYSCALL_64_after_hwframe+0x4b/0x53 [ 347.354533] [ T11230] other info that might help us debug this: [ 347.356432] [ T11230] Chain exists of: wq_pool_mutex --> fs_reclaim --> &q->q_usage_counter(io)#5 [ 347.358919] [ T11230] Possible unsafe locking scenario: [ 347.360307] [ T11230] CPU0 CPU1 [ 347.361327] [ T11230] ---- ---- [ 347.362340] [ T11230] lock(&q->q_usage_counter(io)#5); [ 347.363344] [ T11230] lock(fs_reclaim); [ 347.364526] [ T11230] lock(&q->q_usage_counter(io)#5); [ 347.365968] [ T11230] lock(wq_pool_mutex); [ 347.366811] [ T11230] * DEADLOCK * This happens because SCSI disk rescan is executed from a work context and a failure of blk_revalidate_disk_zones() causes a call to disk_free_zone_resources() which will free the disk zone write plug workqueue. Avoid this by delaying the destruction of the disk zone write plug workqueue to disk_release(). Do this by introducing the function disk_release_zone_resources() and using this new function from disk_release(). This new function destroys the zone write plugs workqueue and calls disk_free_zone_resources(), thus allowing to remove the call to destroy_workqueue() from disk_free_zone_resources(). disk_alloc_zone_resources() is modified to not create the disk zone write plug work queue if it already exists. Fixes: `a8f59e5a5d` ("block: use a per disk workqueue for zone write plugging") Cc: stable@vger.kernek.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@kernel.org> Link: https://patch.msgid.link/20260701082155.1369996-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-07-01 05:34:35 -06:00
Anuj Gupta	30e542a362	blk-mq: bound blk_hctx_poll() to one jiffy blk_hctx_poll() can busy-poll until a completion is found or need_resched() becomes true. On preemptible kernels, the scheduler can set TIF_NEED_RESCHED on the timer tick and preempt the task at IRQ return before the loop condition re-evaluates it. After the context switch, the flag is cleared, so the poller can continue spinning instead of returning to its caller. This can happen with io_uring IOPOLL reads inside iocb_bio_iopoll(), which holds the rcu_read_lock() while calling bio_poll(). If another poller on the same polled queue drains the available completions, this poller may repeatedly find no completions and remain inside the RCU read-side critical section long enough to trigger RCU stall reports: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: rcu: Tasks blocked on level-1 rcu_node (CPUs 0-9): P3961 rcu: (detected by 3, t=60002 jiffies, g=18533, q=4943 ncpus=20) task:fio state:R running task stack:0 pid:3961 Call Trace: <TASK> ? nvme_poll+0x36/0xa0 [nvme] ? blk_hctx_poll+0x39/0x90 ? blk_mq_poll+0x30/0x60 ? bio_poll+0x87/0x170 ? iocb_bio_iopoll+0x32/0x50 ? io_uring_classic_poll+0x25/0x50 ? io_do_iopoll+0x216/0x420 ? __do_sys_io_uring_enter+0x2c7/0x7c0 Reproducible with: fio -filename=/dev/nvme0n1 -direct=1 -size=4g -rw=randread \ --numjobs=32 -bs=4K -ioengine=io_uring -hipri=1 -iodepth=1 \ --registerfiles=1 --group_reporting --thread Record the starting jiffy and exit the loop once jiffies has advanced. This bounds each blk_hctx_poll() invocation while also covering the case where the reschedule flag was cleared by the context switch before the loop condition could observe it. Fixes: `f22ecf9c14` ("blk-mq: delete task running check in blk_hctx_poll()") Reviewed-by: Fengnan Chang <changfengnan@bytedance.com> Suggested-by: Fengnan Chang <changfengnan@bytedance.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Alok Rathore <alok.rathore@samsung.com> Link: https://patch.msgid.link/20260617155051.1266079-1-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-26 10:41:33 -06:00
Christoph Hellwig	a1c8bdbbd7	block: handle REQ_OP_ZONE_APPEND in __bio_integrity_action Otherwise zone append commands will miss their integrity data. While this works "fine" for auto-PI, it break file system PI and non-PI metadata. With this XFS on ZNS namespace with non-PI metadata and 512 byte sectors with PI work, while PI 4k sector formats with PI work only when Caleb's "block: fix integrity offset/length conversions" is applied as well. Note that unlike regular writes, zone append does need remapping as partitions are not supported on zoned block devices. Fixes: `df3c485e0e` ("block: switch on bio operation in bio_integrity_prep") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://patch.msgid.link/20260624080014.1998650-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-24 06:53:25 -06:00
Christoph Hellwig	e7c1627afd	block: fix GFP_ flags confusion in bio_integrity_alloc_buf bio_integrity_alloc_buf usage of GFP_ flags is messed up. For one it mixes GFP_NOFS and GFP_NOIO for neighbouring allocations, but it also makes the allocations fail more often than needed. That code was copied from bio_alloc_bioset which needs to do that so that it can punt to the rescuer workqueue, but none of that is needed for the integrity allocations that either sits in the file system or at the very bottom of the I/O stack. Failing early means we'll do a fully waiting allocation from the mempool ->alloc callback which is usually much larger than required. Fix this by passing a gfp_t so that the file system path can pass GFP_NOFS and the auto-integrity code can pass GFP_NOIO, and don't modify the allocation type except for disabling warnings. Fixes: `ec7f31b2a2` ("block: make bio auto-integrity deadlock safe") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://patch.msgid.link/20260624080014.1998650-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-24 06:53:25 -06:00
Yu Kuai	3ca4f4e3ae	block, bfq: don't grab queue_lock to initialize bfq The request_queue is frozen and quiesced while the elevator init_sched() method runs, so queue_lock is not needed for BFQ cgroup initialization. Signed-off-by: Yu Kuai <yukuai@fygo.io> Link: https://patch.msgid.link/1965073ea20f33114a8d903816b986e483b9bb34.1780621988.git.yukuai@fygo.io Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-24 06:42:31 -06:00
Yu Kuai	4cfd7c1cff	blk-cgroup: don't nest queue_lock under blkcg->lock in blkcg_destroy_blkgs() The correct lock order is q->queue_lock before blkcg->lock, and in order to prevent deadlock from blkcg_destroy_blkgs(), trylock is used for q->queue_lock while blkcg->lock is already held, this is hacky. Refactor blkcg_destroy_blkgs() to hold blkcg->lock only long enough to get the first blkg and then release it. Then take q->queue_lock and blkcg->lock in the correct order to destroy the blkg. This is a very cold path, so the extra lock/unlock cycles are acceptable. Also prepare to convert protecting blkcg with blkcg_mutex instead of queue_lock. Signed-off-by: Yu Kuai <yukuai@fygo.io> Link: https://patch.msgid.link/00b03cf74a9937cb4d6dd67a189ddc00a3de0451.1780621988.git.yukuai@fygo.io Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-24 06:42:31 -06:00
Yu Kuai	457d3c4f0f	blk-cgroup: don't nest queue_lock under rcu in bio_associate_blkg() If a bio is already associated with a blkg, the blkcg is already pinned until the bio is done, so there is no need for RCU protection. Otherwise, protect blkcg_css() with RCU independently. Prepare to protect blkcg with blkcg_mutex instead of queue_lock. Signed-off-by: Yu Kuai <yukuai@fygo.io> Link: https://patch.msgid.link/8496fa234b21d4b31b7f068766906d0bffcac8e6.1780621988.git.yukuai@fygo.io Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-24 06:42:31 -06:00
Yu Kuai	9327a865e3	blk-cgroup: don't nest queue_lock under rcu in blkg_lookup_create() Change this in two steps: 1) hold rcu lock and do blkg_lookup() from fast path; 2) hold queue_lock directly from slow path, and don't nest it under rcu lock; Prepare to convert protecting blkcg with blkcg_mutex instead of queue_lock. Signed-off-by: Yu Kuai <yukuai@fygo.io> Link: https://patch.msgid.link/93f33cc9e5a39dddb78dcd934d0c1d04b564fb00.1780621988.git.yukuai@fygo.io Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-24 06:42:31 -06:00
Yu Kuai	56cc24f59c	blk-cgroup: don't nest queue_lock under rcu in blkcg_print_blkgs() With previous modification to delay freeing policy data after an RCU grace period, prfill() can run under RCU instead of taking queue_lock. However, policy teardown can still clear blkg->pd[plid] after blkcg_print_blkgs() observes the policy enabled bit. Load policy data once with READ_ONCE() and skip the blkg if teardown already cleared it. Do the same in recursive stat walks for descendant blkgs. Remove the stale BFQ debug queue_lock assertion because blkcg_print_blkgs() no longer calls prfill() with queue_lock held. This also lets ioc_qos_prfill() and ioc_cost_model_prfill() use IRQ-safe ioc->lock locking without re-enabling IRQs while queue_lock is still held. Signed-off-by: Yu Kuai <yukuai@fygo.io> Link: https://patch.msgid.link/db7633d5e263dd1c2bf9b901762545a84b7d714e.1780621988.git.yukuai@fygo.io Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-24 06:42:19 -06:00
Yu Kuai	0af3fedb8c	blk-cgroup: delay freeing policy data after rcu grace period Currently blkcg_print_blkgs() must hold RCU to iterate blkgs from a blkcg, and prfill() must hold queue_lock to prevent policy data from being freed by policy deactivation. As a consequence, queue_lock has to be nested under RCU from blkcg_print_blkgs(). Delay freeing policy data until after an RCU grace period so prfill() can be protected by RCU alone. Signed-off-by: Yu Kuai <yukuai@fygo.io> Link: https://patch.msgid.link/e20e5d984b41a026d61851966bed35eb094c4bff.1780621988.git.yukuai@fygo.io Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-24 06:37:54 -06:00
Yu Kuai	25656304da	blk-cgroup: protect iterating blkgs with blkcg->lock in blkcg_print_stat() blkcg_print_one_stat() will be called for each blkg: - access blkg->iostat, which is freed from rcu callback blkg_free_workfn(); - access policy data from pd_stat_fn(), which is freed from pd_free_fn(), while pd_free_fn() can be called by removing blkcg or deactivating policy; Take blkcg->lock while iterating so the blkgs stay online and both blkg->iostat and policy data for activated policies stay valid. Use irq-safe locking because blkcg->lock can be nested under q->queue_lock, which is used from IRQ completion paths. Prepare to convert protecting blkgs from request_queue with mutex. Signed-off-by: Yu Kuai <yukuai@fygo.io> Link: https://patch.msgid.link/05799877e720dcd300e2ddd4625e8e162959d7cc.1780621988.git.yukuai@fygo.io Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-24 06:37:54 -06:00
Zizhi Wo	3ed9b4779a	blk-cgroup: defer blkcg css_put until blkg is unlinked from queue [BUG] Our fuzz testing triggered a blkcg use-after-free issue: BUG: KASAN: slab-use-after-free in _raw_spin_lock+0x75/0xe0 Call Trace: ... blkcg_deactivate_policy+0x244/0x4d0 ioc_rqos_exit+0x44/0xe0 rq_qos_exit+0xba/0x120 __del_gendisk+0x50b/0x800 del_gendisk+0xff/0x190 ... [CAUSE] process1 process2 cgroup_rmdir ... css_killed_work_fn offline_css ... blkcg_destroy_blkgs ... __blkg_release css_put(&blkg->blkcg->css) blkg_free INIT_WORK(xxx, blkg_free_workfn) schedule_work css_put ... blkcg_css_free kfree(blkcg)--------blkcg has been freed!!! ====================================schedule_work blkg_free_workfn __del_gendisk rq_qos_exit ioc_rqos_exit blkcg_deactivate_policy mutex_lock(&q->blkcg_mutex) spin_lock_irq(&q->queue_lock) list_for_each_entry(blkg, xxx) blkcg = blkg->blkcg spin_lock(&blkcg->lock)-------UAF!!! mutex_lock(&q->blkcg_mutex) spin_lock_irq(&q->queue_lock) /* Only then is the blkg removed from the list */ list_del_init(&blkg->q_node) As a result, a blkg can still be reachable through q->blkg_list while its ->blkcg has already been freed. [Fix] Fix this by deferring the blkcg css_put() until after the blkg has been unlinked from q->blkg_list in blkg_free_workfn(). This ensures that the blkcg outlives every blkg still reachable through q->blkg_list, so any iterator holding q->queue_lock is guaranteed to observe a valid blkg->blkcg. While at it, move css_tryget_online() from blkg_create() into blkg_alloc() so that the css reference is owned by the alloc/free pair rather than straddling layers: blkg_alloc() <-> blkg_free() blkg_create() <-> blkg_destroy() Fixes: `f1c006f1c6` ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()") Suggested-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai@fygo.io> Reviewed-by: Tang Yizhou <yizhou.tang@shopee.com> Link: https://patch.msgid.link/20260616011746.2451461-1-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-22 15:59:53 -06:00
Michal Koutný	0ab5ee5a1b	blk-cgroup: fix UAF in __blkcg_rstat_flush() When multiple blkgs in the same blkcg are released concurrently, a use-after-free can occur. The race happens when one blkg's __blkcg_rstat_flush() removes another blkg's iostat entries via llist_del_all(). The second blkg sees an empty list and proceeds to free itself while the first is still iterating over its entries. Move the flush from __blkg_release() (RCU callback) to blkg_release() (before call_rcu). This ensures the RCU grace period waits for any concurrent flush's rcu_read_lock() section to complete before freeing. Cc: stable@vger.kernel.org Cc: Jay Shin <jaeshin@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Fixes: `20cb1c2fb7` ("blk-cgroup: Flush stats before releasing blkcg_gq") Reported-by: coregee2000@gmail.com Closes: https://lore.kernel.org/linux-block/CAHPqNmwT9oRpem3J3erS_W0uSQND47LGGSBsNxP8E6uSUish1w@mail.gmail.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Jose Fernandez (Anthropic) <jose.fernandez@linux.dev> Link: https://patch.msgid.link/20260205155425.342084-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-22 15:58:41 -06:00
Cen Zhang	17b2d950a3	block, bfq: protect async queue reset with blkcg locks Writing 0 to BFQ's low_latency attribute ends weight raising for active, idle and async queues. The async cgroup path walks q->blkg_list, converts each blkg to BFQ policy data and then reads bfqg->async_bfqq and bfqg->async_idle_bfqq. That walk was protected only by bfqd->lock. blkcg release work is serialized by q->blkcg_mutex and q->queue_lock instead, and blkg_free_workfn() can call BFQ's pd_free_fn before it removes blkg->q_node from q->blkg_list. A low_latency reset can therefore still find the blkg on the queue list after the BFQ policy data has been freed. The buggy scenario involves two paths, with each column showing the order within that path: BFQ low_latency reset: blkcg blkg release work: 1. bfq_low_latency_store() 1. blkg_free_workfn() takes calls bfq_end_wr(). q->blkcg_mutex. 2. bfq_end_wr_async() walks 2. BFQ pd_free_fn drops the q->blkg_list. final bfq_group reference. 3. blkg_to_bfqg() returns 3. blkg->q_node remains on the stale policy data. q->blkg_list until list_del_init(). 4. bfq_end_wr_async_queues() reads async queue fields. Fix this by taking q->blkcg_mutex and q->queue_lock around the q->blkg_list walk, then taking bfqd->lock before touching BFQ async queues. The mutex serializes against policy-data free and queue_lock stabilizes the list. Move the async reset out of bfq_end_wr()'s existing bfqd->lock critical section so the lock order matches blkcg policy callbacks. Validation reproduced this kernel report: BUG: KASAN: slab-use-after-free in bfq_end_wr_async_queues+0x246/0x340 Call Trace: <TASK> dump_stack_lvl+0x66/0xa0 print_report+0xce/0x630 ? bfq_end_wr_async_queues+0x246/0x340 ? srso_alias_return_thunk+0x5/0xfbef5 ? __virt_addr_valid+0x20d/0x410 ? bfq_end_wr_async_queues+0x246/0x340 kasan_report+0xe0/0x110 ? bfq_end_wr_async_queues+0x246/0x340 bfq_end_wr_async_queues+0x246/0x340 bfq_end_wr_async+0xba/0x180 bfq_low_latency_store+0x4e5/0x690 ? 0xffffffffc02150da ? __pfx_bfq_low_latency_store+0x10/0x10 ? __pfx_bfq_low_latency_store+0x10/0x10 elv_attr_store+0xc4/0x110 kernfs_fop_write_iter+0x2f5/0x4a0 vfs_write+0x604/0x11f0 ? __pfx_locks_remove_posix+0x10/0x10 ? __pfx_vfs_write+0x10/0x10 ksys_write+0xf9/0x1d0 ? __pfx_ksys_write+0x10/0x10 do_syscall_64+0x115/0x6a0 entry_SYSCALL_64_after_hwframe+0x77/0x7f Allocated by task 544: kasan_save_stack+0x33/0x60 kasan_save_track+0x14/0x30 __kasan_kmalloc+0xaa/0xb0 bfq_pd_alloc+0xc0/0x1b0 blkg_alloc+0x346/0x960 blkg_create+0x8c2/0x10d0 bio_associate_blkg_from_css+0x9f3/0xfa0 bio_associate_blkg+0xd9/0x200 bio_init+0x303/0x640 __blkdev_direct_IO_simple+0x56b/0x8a0 blkdev_direct_IO+0x8e7/0x2580 blkdev_read_iter+0x205/0x400 vfs_read+0x7b0/0xda0 ksys_read+0xf9/0x1d0 do_syscall_64+0x115/0x6a0 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 465: kasan_save_stack+0x33/0x60 kasan_save_track+0x14/0x30 kasan_save_free_info+0x3b/0x60 __kasan_slab_free+0x5f/0x80 kfree+0x307/0x580 blkg_free_workfn+0xef/0x460 process_one_work+0x8d0/0x1870 worker_thread+0x575/0xf80 kthread+0x2e7/0x3c0 ret_from_fork+0x576/0x810 ret_from_fork_asm+0x1a/0x30 Fixes: `44e44a1b32` ("block, bfq: improve responsiveness") Assisted-by: Codex:gpt-5.5 Signed-off-by: Cen Zhang <zzzccc427@gmail.com> Reviewed-by: Tao Cui <cuitao@kylinos.cn> Link: https://patch.msgid.link/20260621135930.2657810-1-zzzccc427@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-22 15:57:31 -06:00
Christoph Hellwig	214cdae69d	block: fix incorrect error injection static key decrement Only decrement the static key when we had items and thus it was incremented before. Fixes: `e8dcf2d142` ("block: add configurable error injection") Reported-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260622160752.1552516-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-22 15:55:11 -06:00
Qu Wenruo	d5b58fbb2f	block: respect iov_iter::nofault flag in bio_iov_iter_bounce_write() For the incoming usage of IOMAP_DIO_BOUNCE in btrfs, btrfs has set iov_iter::nofault to prevent deadlock when a page fault is needed to read out the buffer. However bio_iov_iter_bounce_write() doesn't respect iov_iter::nofault flag, and just call a plain copy_from_iter() so it can still trigger page fault and cause deadlock in btrfs. Fix it by utilizing copy_folio_from_iter_atomic() if nofault flag is set, otherwise use copy_folio_from_iter(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/9c165a314022b61566eb247852eb773ca6c70889.1781597506.git.wqu@suse.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-16 14:48:35 -06:00
Qu Wenruo	b68d4979c8	block: revert the iov_iter after a short copy in bio_iov_iter_bounce_write() For the incoming IOMAP_DIO_BOUNCE flag usage inside btrfs, it's pretty easy to hit short copy inside bio_iov_iter_bounce_write(). This is because btrfs has disabled page fault to avoid certain deadlock during direct writes, and instead btrfs manually fault in the pages then retry. And inside bio_iov_iter_bounce_write(), if we hit a short write, we didn't revert the iov_iter, which can cause problems like unexpected garbage for the next retry. Revert the iov_iter after a short copy. One thing to note is that, the folio is allocated then immediately queued into the bio, so the proper revert size should be (bi_size - this_len + copied). Fixes: `8dd5e7c75d` ("block: add helpers to bounce buffer an iov_iter into bios") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/c400989f227343b134110773d5acaaacf7024574.1781597506.git.wqu@suse.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-16 14:48:35 -06:00
Wen Xiong	9cbbac29d7	block: Remove redundant plug in __submit_bio() The patch removes the automatic plug/unplug operations from __submit_bio() that were added to cache nsecs time when no explicit plug is used. The plug mechanism is most effective when batching multiple I/O operations together. Creating a plug for every bio submission provides minimal benefit while adding function call overhead and stack usage for every I/O operation. Below is performance comparison with the latest upstream kernel. Iotype qd nj rmix mpstat busy mpstat busy without plug Randrw 1 20 100 53% 24% Randrw 1 40 100 70% 24% Randrw 1 20 70 40% 24% Randrw 1 40 70 60% 26% Randrw 1 20 0 14% 6% Randrw 1 40 0 20% 7% Fixes: `060406c61c` ("block: add plug while submitting IO") Signed-off-by: Wen Xiong <wenxiong@linux.ibm.com> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Link: https://patch.msgid.link/20260616143121.878021-1-wenxiong@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-16 10:06:27 -06:00
Yitang Yang	4f919141be	block: fix IORING_URING_CMD_REISSUE flags check in blkdev_uring_cmd blkdev_uring_cmd() checks IORING_URING_CMD_REISSUE to determine whether this is the first issue. However, this flag lives in cmd->flags instead of issue_flags. Coincidentally, IO_URING_F_NONBLOCK shares bit 31 with IORING_URING_CMD_REISSUE. As a result, the SQE read was never performed, bic->len remained zero, and every BLOCK_URING_CMD_DISCARD failed with -EINVAL. Fix it by checking cmd->flags as intended. Cc: stable@vger.kernel.org Fixes: `212ec34e4e` ("block: only read from sqe on initial invocation of blkdev_uring_cmd") Signed-off-by: Yitang Yang <yi1tang.yang@gmail.com> Link: https://patch.msgid.link/20260616155129.406057-1-yi1tang.yang@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-16 09:56:38 -06:00
Linus Torvalds	b1cbabe84c	Merge tag 'for-7.2/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mikulas Patocka: - small cleanups in dm-vdo, dm-raid, dm-cache, dm-zoned-metadata - rework of dm-ima - introduce dm-inlinecrypt - fix wrong return value in dm-ioctl - fix rcu stall when polling * tag 'for-7.2/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm-zoned-metadata: Use strscpy() to copy device name dm cache: make smq background work limit configurable dm-inlinecrypt: add support for hardware-wrapped keys dm: limit target bio polling to one shot dm-ioctl: report an error if a device has no table dm: add documentation for dm-inlinecrypt target dm-inlinecrypt: add target for inline block device encryption block: export blk-crypto symbols required by dm-inlinecrypt dm-ima: use active table's size if available dm-ima: Fail more gracefully in dm_ima_measure_on_* dm-ima: Handle race between rename and table swap dm-ima: Fix issues with dm_ima_measure_on_device_rename dm-ima: remove new_map from dm_ima_measure_on_device_clear dm-ima: Fix UAF errors and measuring incorrect context dm-ima: don't copy the active table to the inactive table dm-ima: Remove status_flags from dm_ima_measure_on_table_load() dm-ima: remove broken last_target_measured logic dm-ima: remove dm_ima_reset_data() dm-raid: only requeue bios when dm is suspending dm vdo: use get_random_u32() where appropriate	2026-06-16 13:20:54 +05:30
Linus Torvalds	ba9c792c82	Merge tag 'for-7.2/block-20260615' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: - NVMe pull request via Keith: - Per-controller admin and IO timeout sysfs attributes, and letting the block layer set request timeouts (Maurizio, Maximilian) - Multipath passthrough iostats, and PCI P2PDMA enablement for multipath devices (Keith, Kiran) - A new diag sysfs attribute group exporting per-controller counters (retries, multipath failover, error counters, requeue and failure counts, reset and reconnect events) (Nilay) - FDP configuration validation and bounds check fixes (liuxixin) - Various nvmet fixes, including a pre-auth out-of-bounds read in the Discovery Get Log Page handler, auth payload bounds validation, and tcp error-path leak fixes (Bryam, Tianchu, Geliang) - nvme-tcp lockdep and workqueue fixes (Shin'ichiro, Kuniyuki, Eric) - Assorted other fixes and cleanups (John, Yao, Chao, Mateusz, Achkinazi, Wentao) - MD pull request via Yu Kuai: - raid1/raid10 fixes for a deadlock in the read error recovery path, error-path detection and bio accounting with cloned bios, and an nr_pending leak in the REQ_ATOMIC bad-block error path (Abd-Alrhman) - PCI P2PDMA propagation from member devices to the RAID device (Kiran) - dm-raid bio requeue fix, and various smaller fixes and cleanups (Benjamin, Chen, Li, Thorsten) - Enable Clang lock context analysis for the block layer, with the accompanying annotations across queue limits, the blk_holder_ops callbacks, crypto, cgroup, iocost, kyber and mq-deadline (Bart) - Block status code infrastructure work: a tagged status table, a str_to_blk_op() helper, a bio_endio_status() helper, and on top of that a new configurable block-layer error injection facility (Christoph) - DRBD netlink rework, replacing the genl_magic machinery with explicit netlink serialization and moving the DRBD UAPI headers to include/uapi/linux/ (Christoph Böhmwalder) - bvec improvements: a bvec_folio() helper and making the bvec_iter helpers proper inline functions (Willy, Christoph) - ublk cleanups and a canceling-flag fix for the disk-not-allocated case (Caleb, Ming) - Partition handling fixes: bound the AIX pp_count scan, fix an of_node refcount leak, and replace __get_free_page() with kmalloc() (Bryam, Wentao, Mike) - Convert numa_node to int in blk_mq_hw_ctx and ->init_request, and add WQ_PERCPU to the block workqueue users (Mateusz, Marco) - Block statistics and tracing: propagate in-flight to the whole disk on partition IO, export passthrough stats, and a new block_rq_tag_wait tracepoint (Tang, Keith, Aaron) - A round of removals, unexports and cleanups across bio, direct-io and the bvec helpers (Christoph) - Various driver fixes (mtip32xx use-after-free, rbd snap_count validation and strscpy conversion, nbd socket lockdep reclassify, virtio-blk zone report clamp, floppy) and a batch of MAINTAINERS email/list updates (Coly, Li, Yu, Christoph Böhmwalder) - Other little fixes and cleanups all over * tag 'for-7.2/block-20260615' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (117 commits) MAINTAINERS: Update Coly Li's email address block: check bio split for unaligned bvec nbd: Reclassify sockets to avoid lockdep circular dependency block: add configurable error injection block: add a str_to_blk_op helper block: add a "tag" for block status codes block: add a macro to initialize the status table floppy: Drop unused pnp driver data block: propagate in_flight to whole disk on partition I/O virtio-blk: clamp zone report to the report buffer capacity block: optimize I/O merge hot path with unlikely() hints drivers/block/rbd: Use strscpy() to copy strings into arrays partitions: aix: bound the pp_count scan to the ppe array block: Enable lock context analysis block/mq-deadline: Make the lock context annotations compatible with Clang block/Kyber: Make the lock context annotations compatible with Clang block/blk-mq-debugfs: Improve lock context annotations block/blk-iocost: Inline iocg_lock() and iocg_unlock() block/blk-iocost: Split ioc_rqos_throttle() block/crypto: Annotate the crypto functions ...	2026-06-16 13:02:47 +05:30
Linus Torvalds	f8115f0e8a	Merge tag 'slab-for-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: - Support for "allocation tokens" (currently available in Clang 22+) for smarter partitioning of kmalloc caches based on the allocated object type, which can be enabled instead of the "random" per-caller-address-hash partitioning. It should be able to deterministically separate types containing a pointer from those that do not (Marco Elver) - Improvements and simplification of the kmem_cache_alloc_bulk() and mempool_alloc_bulk() API. This includes adaptation of callers (Christoph Hellwig) - Performance improvements and cleanups related mostly to sheaves refill (Hao Li, Shengming Hu, Vlastimil Babka) - Several fixups for the slabinfo tool (Xuewen Wang) * tag 'slab-for-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slab: do not limit zeroing to orig_size when only red zoning is enabled mm/slub: preserve original size in _kmalloc_nolock_noprof retry path mm: simplify the mempool_alloc_bulk API mm/slab: improve kmem_cache_alloc_bulk mm/slub: detach and reattach partial slabs in batch mm/slub: introduce helpers for node partial slab state mm/slub: use empty sheaf helpers for oversized sheaves tools/mm/slabinfo: remove redundant slab->partial assignment tools/mm/slabinfo: remove dead assignment in get_obj_and_str() tools/mm/slabinfo: Fix trace disable logic inversion MAINTAINERS: add slab-related scripts and tools to SLAB ALLOCATOR mm/slub: fix typo in sheaves comment mm, slab: simplify returning slab in __refill_objects_node() mm, slab: add an optimistic __slab_try_return_freelist() slab: fix kernel-docs for mm-api slab: improve KMALLOC_PARTITION_RANDOM randomness slab: support for compiler-assisted type-based slab cache partitioning mm/slub: defer freelist construction until after bulk allocation from a new slab	2026-06-16 08:44:43 +05:30
Keith Busch	9b0c3673c8	block: check bio split for unaligned bvec Offsets and lengths need to be validated against the dma alignment. This check was skipped for sufficiently a small bio with a single bvec, which may allow an invalid request dispatched to the driver. Force the validation for an unaligned bvec by forcing the bio split path that handles this condition. Fixes: `7eac331869` ("iomap: simplify direct io validity check") Fixes: `5ff3f74e14` ("block: simplify direct io validity check") Reported-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://patch.msgid.link/20260612223205.465913-1-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-13 06:35:44 -06:00
Christoph Hellwig	e8dcf2d142	block: add configurable error injection Add a new block error injection interface that allows to inject specific status code for specific ranges. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Md Haris Iqbal <haris.iqbal@linux.dev> Link: https://patch.msgid.link/20260611140703.2401204-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-12 10:40:35 -06:00
Christoph Hellwig	d39a63ead3	block: add a str_to_blk_op helper Add a helper to find the REQ_OP_XYZ constant from the "XYZ" string. This will be used for the error injection debugfs interface. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Md Haris Iqbal <haris.iqbal@linux.dev> Link: https://patch.msgid.link/20260611140703.2401204-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-12 10:40:35 -06:00
Christoph Hellwig	ce351560b7	block: add a "tag" for block status codes The full name of the status codes is not good for user interfaces as it can contain white spaces. Add the name of the status code without the BLK_STS_ prefix as a tag so that it can be used for user interfaces. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Md Haris Iqbal <haris.iqbal@linux.dev> Link: https://patch.msgid.link/20260611140703.2401204-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-12 10:40:35 -06:00
Christoph Hellwig	8c8ebed165	block: add a macro to initialize the status table Prepare for adding a new value to the error table by adding a macro to fill it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Md Haris Iqbal <haris.iqbal@linux.dev> Link: https://patch.msgid.link/20260611140703.2401204-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-12 10:40:35 -06:00
Tang Yizhou	5bdb8ec58b	block: propagate in_flight to whole disk on partition I/O Now when I/O is submitted to a partition, the per-CPU in_flight[] counter is incremented only on the partition's block_device, not on the underlying whole disk. This leads to a problem which can be shown by a fio test: lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS mydev 252:1 0 20G 0 disk └─mydev1 259:0 0 10G 0 part iostat -xp 1 Device r/s rkB/s ... aqu-sz %util mydev 128153.00 512612.00 ... 13.22 72.20 mydev1 128154.00 512616.00 ... 13.22 100.00 %util is different between mydev and mydev1, which is unexpected. This is the cumulative effect of a series of patches. The root cause is commit `e016b78201` ("block: return just one value from part_in_flight"), which deleted the branch in part_in_flight() that aggregated the whole-disk in_flight count on top of the partition's. Then the second commit is commit `10ec5e86f9` ("block: merge part_{inc,dev}_in_flight into their only callers"), which folded the whole-disk in_flight accounting into generic_start_io_acct() and generic_end_io_acct(). Those two helpers were then removed by commit `e722fff238` ("block: remove generic_{start,end}_io_acct"), and from that point on the whole disk's in_flight is no longer accounted at all. In update_io_ticks(), if calling bdev_count_inflight() finds that the inflight value of the whole device is 0, the accumulation of io_ticks will be skipped, causing the reported util% value to be underestimated. Fix it by restoring the whole-disk in_flight accounting. Fixes: `e016b78201` ("block: return just one value from part_in_flight") Suggested-by: Leon Hwang <leon.huangfu@shopee.com> Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260526021555.359500-1-yizhou.tang@shopee.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-09 10:12:50 -06:00
Steven Feng	7ed4aab138	block: optimize I/O merge hot path with unlikely() hints Remove redundant '== false' comparisons and add unlikely() branch prediction hints in block I/O merge path functions. These functions (ll_new_hw_segment, ll_merge_requests_fn, and blk_rq_merge_ok) are executed on every I/O request merge attempt, making them critical hot paths. Data integrity check failures are rare events, so marking these conditions as unlikely() helps the CPU optimize the common case by improving branch prediction. Changes: - Replace 'func() == false' with 'unlikely(!func())' for better code style and branch prediction This micro-optimization reduces branch misprediction penalties in high-frequency I/O merge paths. Signed-off-by: Steven Feng <steven@joint-cloud.com> Link: https://patch.msgid.link/tencent_79B652BD0CC23E093F27914380F161E7E505@qq.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-08 07:46:58 -06:00
Bryam Vargas	2dc0bfd2fe	partitions: aix: bound the pp_count scan to the ppe array aix_partition() reads the physical volume descriptor into a fixed-size struct pvd and then scans its physical-partition-extent array: int numpps = be16_to_cpu(pvd->pp_count); ... for (i = 0; i < numpps; i += 1) { struct ppe *p = pvd->ppe + i; ... lp_ix = be16_to_cpu(p->lp_ix); pvd points at a single kmalloc()'d struct pvd whose ppe[] member holds a fixed ARRAY_SIZE(pvd->ppe) (1016) entries, but the loop runs up to the on-disk pp_count. pp_count is an unvalidated __be16 read straight from the descriptor, so a crafted AIX image with pp_count larger than 1016 drives the loop to read pvd->ppe[i] past the end of the allocation (up to 65535 entries, ~2 MB out of bounds). The partition scan runs without mounting anything, when a block device with a crafted AIX/IBM partition table appears (an attacker-supplied image attached with losetup -P, or a device auto-scanned by udev), via msdos_partition() -> aix_partition(). Clamp the scan to the number of entries the ppe[] array can hold. Fixes: `6ceea22bbb` ("partitions: add aix lvm partition support files") Cc: stable@vger.kernel.org Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me> Acked-by: Philippe De Muyter <phdm@macqel.be> Link: https://patch.msgid.link/20260607064137.302574-1-hexlabsecurity@proton.me Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-08 07:41:21 -06:00
Bart Van Assche	5f0777166e	block: Enable lock context analysis Now that all block/*.c files have been annotated, enable lock context analysis for all these source files. Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/e248ca3aeead238bbc489cf3afdafcbff9e41faf.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-05 13:41:11 -06:00
Bart Van Assche	f10b2de2af	block/mq-deadline: Make the lock context annotations compatible with Clang While sparse ignores the __acquires() and __releases() arguments, Clang verifies these. Make the arguments of __acquires() and __releases() acceptable for Clang. Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/3b6e336ced91e27213608ffce205ccd24f4ba285.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-05 13:41:11 -06:00
Bart Van Assche	b4591b9152	block/Kyber: Make the lock context annotations compatible with Clang While sparse ignores the __acquires() and __releases() arguments, Clang verifies these. Make the arguments of __acquires() and __releases() acceptable for Clang. Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/91cb8c790fc8b26b8aa742569fbf8c2c1d099dac.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-05 13:41:11 -06:00
Bart Van Assche	131f14125a	block/blk-mq-debugfs: Improve lock context annotations Make the existing lock context annotations compatible with Clang. Add the lock context annotations that are missing. Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/f58fe220ff98f9dfddfed4573f40005c773b7fb7.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-05 13:41:11 -06:00
Bart Van Assche	a255026594	block/blk-iocost: Inline iocg_lock() and iocg_unlock() Both iocg_lock() and iocg_unlock() use conditional locking. Fold these functions into their callers such that unlocking becomes unconditional. Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/f8c9867788957d2e40a32e23c6d9b866e480ad9d.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-05 13:41:11 -06:00
Bart Van Assche	1ff85a3879	block/blk-iocost: Split ioc_rqos_throttle() Prepare for inlining iocg_lock() and iocg_unlock() by moving the code between these two calls into a new function. No functionality has been changed. Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/a6d3ed953cef6669d23a80923bf46600733cbdae.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-05 13:41:11 -06:00
Bart Van Assche	73bb2480e3	block/crypto: Annotate the crypto functions Add the lock context annotations required for Clang's thread-safety analysis. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@kernel.org> Cc: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/297b40e43a7f9b7d20e91a6c44b41a69d01f5c63.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-05 13:41:11 -06:00
Bart Van Assche	998cda78d4	block/cgroup: Inline blkg_conf_{open,close}_bdev_frozen() The blkg_conf_open_bdev_frozen() calling convention is not compatible with lock context annotations. Fold both blkg_conf_open_bdev_frozen() and blkg_conf_close_bdev_frozen() into their only caller. This patch prepares for enabling lock context analysis. The type of 'memflags' has been changed from unsigned long into unsigned int to match the type of current->flags. See also <linux/sched.h>. Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/05661d1555decc6dd5389174ba448d803b72ed9a.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-05 13:41:11 -06:00
Bart Van Assche	6a7717a2df	block/blk-iocost: Combine two error paths in ioc_qos_write() Reduce code duplication by combining two error paths. No functionality has been changed. Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/80d4fc1ecd5eaf187c0a31c63a1033a7326d4c7e.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-05 13:41:11 -06:00
Bart Van Assche	9865e41664	block/cgroup: Improve lock context annotations Add lock context annotations where these are missing. Move the blkg_conf_prep() annotation into block/blk-cgroup.h to make it visible to all blkg_conf_prep() callers. Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/58ddd6e2b960bdfa03d0007984386bc0ba351391.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-05 13:41:11 -06:00
Bart Van Assche	c574c3cc36	block/cgroup: Split blkg_conf_exit() Split blkg_conf_exit() into blkg_conf_unprep() and blkg_conf_close_bdev() because blkg_conf_exit() is not compatible with the Clang thread-safety annotations. Remove blkg_conf_exit(). Rename blkg_conf_exit_frozen() into blkg_conf_close_bdev_frozen(). Add thread-safety annotations to the new functions. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/c1ec1f1c4b675bc5f187f77b3e6436234c6b244c.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-05 13:41:11 -06:00
Bart Van Assche	ea4f575e72	block/cgroup: Split blkg_conf_prep() Move the blkg_conf_open_bdev() call out of blkg_conf_prep() to make it possible to add lock context annotations to blkg_conf_prep(). Change an if-statement in blkg_conf_open_bdev() into a WARN_ON_ONCE() call. Export blkg_conf_open_bdev() because it is called by the BFQ I/O scheduler and the BFQ I/O scheduler may be built as a kernel module. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/e6ea0387f413217c8561a0ca54ce7b846aa5c7c5.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-05 13:41:11 -06:00
Marco Crivellari	7e712f292e	block: Add WQ_PERCPU to alloc_workqueue users This continues the effort to refactor workqueue APIs, which began with the introduction of new workqueues and a new alloc_workqueue flag in: commit `128ea9f6cc` ("workqueue: Add system_percpu_wq and system_dfl_wq") commit `930c2ea566` ("workqueue: Add new WQ_PERCPU flag") The refactoring is going to alter the default behavior of alloc_workqueue() to be unbound by default. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. For more details see the Link tag below. In order to keep alloc_workqueue() behavior identical, explicitly request WQ_PERCPU. Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/ Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://patch.msgid.link/20260604105347.168322-1-marco.crivellari@suse.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-05 11:21:39 -06:00
Christoph Hellwig	be6e0a533d	mm: simplify the mempool_alloc_bulk API The mempool_alloc_bulk was modelled after the alloc_pages_bulk API, including some misunderstanding of it. Remove checking for NULL slots in the array, as alloc_pages_bulk and kmem_cache_alloc_bulk always fill the array from the beginning and thus we know the offset of the first failing allocation. This removes support for working well with alloc_pages_bulk used to refill page arrays that might have an entry removed from in the middle, but that is only used by sunrpc and hopefully on it's way out. Also remove the allocated parameter as it is redundant because the caller can simply specific and offset into the entries array. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260602160038.3976341-1-hch@lst.de Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>	2026-06-03 18:20:47 +02:00
Thorsten Blum	aa528cd12c	block/partitions/acorn: use min in {riscix,linux}_partition Use min() to replace the open-coded implementations and to simplify riscix_partition() and linux_partition(). Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20260602160757.973736-3-thorsten.blum@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-02 11:14:52 -06:00
Yu Kuai	6636e16e60	block, bfq: release cgroup stats with bfq_group BFQ cgroup stats contain percpu counters embedded in struct bfq_group, but the old free path destroys them from bfq_pd_free(), which is tied to blkg policy-data teardown. That is not the same lifetime as struct bfq_group. BFQ pins bfq_group while bfq_queue entities refer to it, so bfq_pd_free() can drop the policy-data reference while other bfq_group references still exist. The following blkcg change also defers policy-data release through RCU and leaves BFQ to run the final bfqg_put() from an RCU callback. For that conversion, stats teardown must belong to the last bfq_group put, not to policy-data teardown. Move stats teardown to bfqg_put() so the embedded counters are destroyed exactly when the last bfq_group reference is released, before kfree(bfqg). Without this preparatory change, the RCU-delayed policy-data free conversion reproduced the following KASAN report: BUG: KASAN: slab-use-after-free in percpu_counter_destroy_many+0xf1/0x2e0 Write of size 8 at addr ffff88811d9409e0 by task test_blkcg/535 CPU: 0 UID: 0 PID: 535 Comm: test_blkcg Not tainted 7.1.0-rc2-g1e14adca0199 #1 PREEMPT ea13f83d4b74a12510d20db4a7d9a0fe8275f05c Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x54/0x70 print_address_description+0x77/0x200 ? percpu_counter_destroy_many+0xf1/0x2e0 print_report+0x64/0x70 kasan_report+0x118/0x150 ? percpu_counter_destroy_many+0xf1/0x2e0 percpu_counter_destroy_many+0xf1/0x2e0 __mmdrop+0x1d8/0x350 finish_task_switch+0x3f5/0x570 __schedule+0xe8e/0x18a0 schedule+0xfe/0x1c0 schedule_timeout+0x7f/0x1d0 __wait_for_common+0x26c/0x3f0 wait_for_completion_state+0x21/0x40 call_usermodehelper_exec+0x271/0x2c0 __request_module+0x296/0x410 elv_iosched_store+0x1bc/0x2c0 queue_attr_store+0x152/0x1c0 kernfs_fop_write_iter+0x1d7/0x280 vfs_write+0x580/0x630 ksys_write+0xec/0x190 do_syscall_64+0x156/0x490 entry_SYSCALL_64_after_hwframe+0x77/0x7f Allocated by task 535: kasan_save_track+0x3e/0x80 __kasan_kmalloc+0x72/0x90 bfq_pd_alloc+0x60/0x100 [bfq] blkg_create+0x3bb/0xbe0 blkg_lookup_create+0x3a2/0x460 blkg_conf_start+0x24a/0x2d0 bfq_io_set_weight+0x17f/0x430 [bfq] cgroup_file_write+0x1c5/0x4b0 kernfs_fop_write_iter+0x1d7/0x280 vfs_write+0x580/0x630 ksys_write+0xec/0x190 do_syscall_64+0x156/0x490 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 0: kasan_save_track+0x3e/0x80 kasan_save_free_info+0x46/0x50 __kasan_slab_free+0x3a/0x60 kfree+0x14e/0x4f0 rcu_core+0x6f3/0xcd0 handle_softirqs+0x1a0/0x550 __irq_exit_rcu+0x8c/0x150 irq_exit_rcu+0xe/0x20 sysvec_apic_timer_interrupt+0x6e/0x80 asm_sysvec_apic_timer_interrupt+0x1a/0x20 Last potentially related work creation: kasan_save_stack+0x3e/0x60 kasan_record_aux_stack+0x99/0xb0 call_rcu+0x55/0x5c0 blkg_free_workfn+0x130/0x220 process_scheduled_works+0x655/0xb60 worker_thread+0x446/0x600 kthread+0x1f4/0x230 ret_from_fork+0x259/0x420 ret_from_fork_asm+0x1a/0x30 Signed-off-by: Yu Kuai <yukuai@fygo.io> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260601061502.899552-1-yukuai@fygo.io Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-02 07:24:58 -06:00
Linus Torvalds	9215e74f22	Merge tag 'block-7.1-20260529' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fix from Jens Axboe: "Just a single fix for the block side, making a slight tweak to a fix from this cycle" * tag 'block-7.1-20260529' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: blk-mq: reinsert cached request to the list	2026-05-29 10:41:07 -07:00
Matthew Wilcox (Oracle)	ca36c1aaee	block: Add bvec_folio() This is a simple helper which replaces page_folio(bvec->bv_page). Minor improvement in readability, but the real motivation is to reduce the number of references to bvec->bv_page so that it can be changed with less work. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Leon Romanovsky <leon@kernel.org> Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@linux.dev> Link: https://patch.msgid.link/20260528175905.1102280-2-willy@infradead.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-05-29 05:34:24 -06:00

1 2 3 4 5 ...

8304 Commits