linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-12-27 12:21:22 -05:00

Author	SHA1	Message	Date
Linus Torvalds	3f0e9c8cef	Merge tag 'block-6.19-20251226' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - Fix for a signedness issue introduced in this kernel release for rnbd - Fix up user copy references for ublk when the server exits * tag 'block-6.19-20251226' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: block: rnbd-clt: Fix signedness bug in init_dev() ublk: clean up user copy references on ublk server exit	2025-12-26 11:44:35 -08:00
Dan Carpenter	1ddb815fdf	block: rnbd-clt: Fix signedness bug in init_dev() The "dev->clt_device_id" variable is set using ida_alloc_max() which returns an int and in particular it returns negative error codes. Change the type from u32 to int to fix the error checking. Fixes: `c9b5645fd8` ("block: rnbd-clt: Fix leaked ID in init_dev()") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-20 12:56:48 -07:00
Caleb Sander Mateos	daa24603d9	ublk: clean up user copy references on ublk server exit If a ublk server process releases a ublk char device file, any requests dispatched to the ublk server but not yet completed will retain a ref value of UBLK_REFCOUNT_INIT. Before commit `e63d2228ef` ("ublk: simplify aborting ublk request"), __ublk_fail_req() would decrement the reference count before completing the failed request. However, that commit optimized __ublk_fail_req() to call __ublk_complete_rq() directly without decrementing the request reference count. The leaked reference count incorrectly allows user copy and zero copy operations on the completed ublk request. It also triggers the WARN_ON_ONCE(refcount_read(&io->ref)) warnings in ublk_queue_reinit() and ublk_deinit_queue(). Commit `c5c5eb24ed` ("ublk: avoid ublk_io_release() called after ublk char dev is closed") already fixed the issue for ublk devices using UBLK_F_SUPPORT_ZERO_COPY or UBLK_F_AUTO_BUF_REG. However, the reference count leak also affects UBLK_F_USER_COPY, the other reference-counted data copy mode. Fix the condition in ublk_check_and_reset_active_ref() to include all reference-counted data copy modes. This ensures that any ublk requests still owned by the ublk server when it exits have their reference counts reset to 0. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: `e63d2228ef` ("ublk: simplify aborting ublk request") Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-20 12:56:37 -07:00
Linus Torvalds	d8ba32c5a4	Merge tag 'block-6.19-20251218' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - ublk selftests for missing coverage - two fixes for the block integrity code - fix for the newly added newly added PR read keys ioctl, limiting the memory that can be allocated - work around for a deadlock that can occur with ublk, where partition scanning ends up recursing back into file closure, which needs the same mutex grabbed. Not the prettiest thing in the world, but an acceptable work-around until we can eliminate the reliance on disk->open_mutex for this - fix for a race between enabling writeback throttling and new IO submissions - move a bit of bio flag handling code. No changes, but needed for a patchset for a future kernel - fix for an init time id leak failure in rnbd - loop/zloop state check fix * tag 'block-6.19-20251218' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: block: validate interval_exp integrity limit block: validate pi_offset integrity limit block: rnbd-clt: Fix leaked ID in init_dev() ublk: fix deadlock when reading partition table block: add allocation size check in blkdev_pr_read_keys() Documentation: admin-guide: blockdev: replace zone_capacity with zone_capacity_mb when creating devices zloop: use READ_ONCE() to read lo->lo_state in queue_rq path loop: use READ_ONCE() to read lo->lo_state without locking block: fix race between wbt_enable_default and IO submission selftests: ublk: add user copy test cases selftests: ublk: add support for user copy to kublk selftests: ublk: forbid multiple data copy modes selftests: ublk: don't share backing files between ublk servers selftests: ublk: use auto_zc for PER_IO_DAEMON tests in stress_04 selftests: ublk: fix fio arguments in run_io_and_recover() selftests: ublk: remove unused ios map in seq_io.bt selftests: ublk: correct last_rw map type in seq_io.bt selftests: ublk: fix overflow in ublk_queue_auto_zc_fallback() block: move around bio flagging helpers	2025-12-20 09:48:56 -08:00
Thomas Fourier	c9b5645fd8	block: rnbd-clt: Fix leaked ID in init_dev() If kstrdup() fails in init_dev(), then the newly allocated ID is lost. Fixes: `64e8a6ece1` ("block/rnbd-clt: Dynamically alloc buffer for pathname & blk_symlink_name") Signed-off-by: Thomas Fourier <fourier.thomas@gmail.com> Acked-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-18 08:10:28 -07:00
Ming Lei	c258f5c450	ublk: fix deadlock when reading partition table When one process(such as udev) opens ublk block device (e.g., to read the partition table via bdev_open()), a deadlock[1] can occur: 1. bdev_open() grabs disk->open_mutex 2. The process issues read I/O to ublk backend to read partition table 3. In __ublk_complete_rq(), blk_update_request() or blk_mq_end_request() runs bio->bi_end_io() callbacks 4. If this triggers fput() on file descriptor of ublk block device, the work may be deferred to current task's task work (see fput() implementation) 5. This eventually calls blkdev_release() from the same context 6. blkdev_release() tries to grab disk->open_mutex again 7. Deadlock: same task waiting for a mutex it already holds The fix is to run blk_update_request() and blk_mq_end_request() with bottom halves disabled. This forces blkdev_release() to run in kernel work-queue context instead of current task work context, and allows ublk server to make forward progress, and avoids the deadlock. Fixes: `71f28f3136` ("ublk_drv: add io_uring based userspace block driver") Link: https://github.com/ublk-org/ublksrv/issues/170 [1] Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> [axboe: rewrite comment in ublk] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-17 19:40:28 -07:00
Yongpeng Yang	4b2b03151e	zloop: use READ_ONCE() to read lo->lo_state in queue_rq path In the queue_rq path, zlo->state is accessed without locking, and direct access may read stale data. This patch uses READ_ONCE() to read zlo->state and data_race() to silence code checkers, and changes all assignments to use WRITE_ONCE(). Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-15 09:32:42 -07:00
Yongpeng Yang	54891a96b7	loop: use READ_ONCE() to read lo->lo_state without locking When lo->lo_mutex is not held, direct access may read stale data. This patch uses READ_ONCE() to read lo->lo_state and data_race() to silence code checkers, and changes all assignments to use WRITE_ONCE(). Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-15 09:32:42 -07:00
Linus Torvalds	0dfb36b2dc	Merge tag 'ceph-for-6.19-rc1' of https://github.com/ceph/ceph-client Pull ceph updates from Ilya Dryomov: "We have a patch that adds an initial set of tracepoints to the MDS client from Max, a fix that hardens osdmap parsing code from myself (marked for stable) and a few assorted fixups" * tag 'ceph-for-6.19-rc1' of https://github.com/ceph/ceph-client: rbd: stop selecting CRC32, CRYPTO, and CRYPTO_AES ceph: stop selecting CRC32, CRYPTO, and CRYPTO_AES libceph: make decode_pool() more resilient against corrupted osdmaps libceph: Amend checking to fix `make W=1` build breakage ceph: Amend checking to fix `make W=1` build breakage ceph: add trace points to the MDS client libceph: fix log output race condition in OSD client	2025-12-14 15:24:10 +12:00
Linus Torvalds	35ebee7e72	Merge tag 'block-6.19-20251211' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - Always initialize DMA state, fixing a potentially nasty issue on the block side - btrfs zoned write fix with cached zone reports - Fix corruption issues in bcache with chained bio's, and further make it clear that the chained IO handler is simply a marker, it's not code meant to be executed - Kill old code dealing with synchronous IO polling in the block layer, that has been dead for a long time. Only async polling is supported these days - Fix a lockdep issue in tag_set management, moving it to RCU - Fix an issue with ublks bio_vec iteration - Don't unconditionally enforce blocking issue of ublk control commands, allow some of them with non-blocking issue as they do not block * tag 'block-6.19-20251211' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: blk-mq-dma: always initialize dma state blk-mq: delete task running check in blk_hctx_poll() block: fix cached zone reports on devices with native zone append block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock ublk: don't mutate struct bio_vec in iteration block: prohibit calls to bio_chain_endio bcache: fix improper use of bi_end_io ublk: allow non-blocking ctrl cmds in IO_URING_F_NONBLOCK issue	2025-12-12 22:04:18 +12:00
Ilya Dryomov	21c1466ea2	rbd: stop selecting CRC32, CRYPTO, and CRYPTO_AES None of the RBD code directly requires CRC32, CRYPTO, or CRYPTO_AES. These options are needed by CEPH_LIB code and they are selected there directly. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@linux.dev>	2025-12-10 11:50:54 +01:00
Caleb Sander Mateos	db339b4067	ublk: don't mutate struct bio_vec in iteration __bio_for_each_segment() uses the returned struct bio_vec's bv_len field to advance the struct bvec_iter at the end of each loop iteration. So it's incorrect to modify it during the loop. Don't assign to bv_len (or bv_offset, for that matter) in ublk_copy_user_pages(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: `e87d66ab27` ("ublk: use rq_for_each_segment() for user copy") Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-09 10:20:18 -07:00
Linus Torvalds	4482ebb297	Merge tag 'block-6.19-20251208' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: "Followup set of fixes and updates for block for the 6.19 merge window. NVMe had some late minute debates which lead to dropping some patches from that tree, which is why the initial PR didn't have NVMe included. It's here now. This pull request contains: - NVMe pull request via Keith: - Subsystem usage cleanups (Max) - Endpoint device fixes (Shin'ichiro) - Debug statements (Gerd) - FC fabrics cleanups and fixes (Daniel) - Consistent alloc API usages (Israel) - Code comment updates (Chu) - Authentication retry fix (Justin) - Fix a memory leak in the discard ioctl code, if the task is being interrupted by a signal at just the wrong time - Zoned write plugging fixes - Add ioctls for for persistent reservations - Enable per-cpu bio caching by default - Various little fixes and tweaks" * tag 'block-6.19-20251208' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (27 commits) nvme-fabrics: add ENOKEY to no retry criteria for authentication failures nvme-auth: use kvfree() for memory allocated with kvcalloc() nvmet-tcp: use kvcalloc for commands array nvmet-rdma: use kvcalloc for commands and responses arrays nvme: fix typo error in nvme target nvmet-fc: use pr_* print macros instead of dev_* nvmet-fcloop: remove unused lsdir member. nvmet-fcloop: check all request and response have been processed nvme-fc: check all request and response have been processed block: fix memory leak in __blkdev_issue_zero_pages block: fix comment for op_is_zone_mgmt() to include RESET_ALL block: Clear BLK_ZONE_WPLUG_PLUGGED when aborting plugged BIOs blk-mq: Abort suspend when wakeup events are pending blk-mq: add blk_rq_nr_bvec() helper block: add IOC_PR_READ_RESERVATION ioctl block: add IOC_PR_READ_KEYS ioctl nvme: reject invalid pr_read_keys() num_keys values scsi: sd: reject invalid pr_read_keys() num_keys values block: enable per-cpu bio cache by default block: use bio_alloc_bioset for passthru IO by default ...	2025-12-09 08:53:24 +09:00
Caleb Sander Mateos	87213b0d84	ublk: allow non-blocking ctrl cmds in IO_URING_F_NONBLOCK issue Handling most of the ublksrv_ctrl_cmd opcodes require locking a mutex, so ublk_ctrl_uring_cmd() bails out with EAGAIN when called with the IO_URING_F_NONBLOCK issue flag. However, several opcodes can be handled without blocking: - UBLK_CMD_GET_QUEUE_AFFINITY - UBLK_CMD_GET_DEV_INFO - UBLK_CMD_GET_DEV_INFO2 - UBLK_U_CMD_GET_FEATURES Handle these opcodes synchronously instead of returning EAGAIN so io_uring doesn't need to issue the command via the worker thread pool. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-08 13:32:30 -07:00
Linus Torvalds	7203ca412f	Merge tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "__vmalloc()/kvmalloc() and no-block support" (Uladzislau Rezki) Rework the vmalloc() code to support non-blocking allocations (GFP_ATOIC, GFP_NOWAIT) "ksm: fix exec/fork inheritance" (xu xin) Fix a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not inherited across fork/exec "mm/zswap: misc cleanup of code and documentations" (SeongJae Park) Some light maintenance work on the zswap code "mm/page_owner: add debugfs files 'show_handles' and 'show_stacks_handles'" (Mauricio Faria de Oliveira) Enhance the /sys/kernel/debug/page_owner debug feature by adding unique identifiers to differentiate the various stack traces so that userspace monitoring tools can better match stack traces over time "mm/page_alloc: pcp->batch cleanups" (Joshua Hahn) Minor alterations to the page allocator's per-cpu-pages feature "Improve UFFDIO_MOVE scalability by removing anon_vma lock" (Lokesh Gidra) Address a scalability issue in userfaultfd's UFFDIO_MOVE operation "kasan: cleanups for kasan_enabled() checks" (Sabyrzhan Tasbolatov) "drivers/base/node: fold node register and unregister functions" (Donet Tom) Clean up the NUMA node handling code a little "mm: some optimizations for prot numa" (Kefeng Wang) Cleanups and small optimizations to the NUMA allocation hinting code "mm/page_alloc: Batch callers of free_pcppages_bulk" (Joshua Hahn) Address long lock hold times at boot on large machines. These were causing (harmless) softlockup warnings "optimize the logic for handling dirty file folios during reclaim" (Baolin Wang) Remove some now-unnecessary work from page reclaim "mm/damon: allow DAMOS auto-tuned for per-memcg per-node memory usage" (SeongJae Park) Enhance the DAMOS auto-tuning feature "mm/damon: fixes for address alignment issues in DAMON_LRU_SORT and DAMON_RECLAIM" (Quanmin Yan) Fix DAMON_LRU_SORT and DAMON_RECLAIM with certain userspace configuration "expand mmap_prepare functionality, port more users" (Lorenzo Stoakes) Enhance the new(ish) file_operations.mmap_prepare() method and port additional callsites from the old ->mmap() over to ->mmap_prepare() "Fix stale IOTLB entries for kernel address space" (Lu Baolu) Fix a bug (and possible security issue on non-x86) in the IOMMU code. In some situations the IOMMU could be left hanging onto a stale kernel pagetable entry "mm/huge_memory: cleanup __split_unmapped_folio()" (Wei Yang) Clean up and optimize the folio splitting code "mm, swap: misc cleanup and bugfix" (Kairui Song) Some cleanups and a minor fix in the swap discard code "mm/damon: misc documentation fixups" (SeongJae Park) "mm/damon: support pin-point targets removal" (SeongJae Park) Permit userspace to remove a specific monitoring target in the middle of the current targets list "mm: MISC follow-up patches for linux/pgalloc.h" (Harry Yoo) A couple of cleanups related to mm header file inclusion "mm/swapfile.c: select swap devices of default priority round robin" (Baoquan He) improve the selection of swap devices for NUMA machines "mm: Convert memory block states (MEM_) macros to enums" (Israel Batista) Change the memory block labels from macros to enums so they will appear in kernel debug info "ksm: perform a range-walk to jump over holes in break_ksm" (Pedro Demarchi Gomes) Address an inefficiency when KSM unmerges an address range "mm/damon/tests: fix memory bugs in kunit tests" (SeongJae Park) Fix leaks and unhandled malloc() failures in DAMON userspace unit tests "some cleanups for pageout()" (Baolin Wang) Clean up a couple of minor things in the page scanner's writeback-for-eviction code "mm/hugetlb: refactor sysfs/sysctl interfaces" (Hui Zhu) Move hugetlb's sysfs/sysctl handling code into a new file "introduce VM_MAYBE_GUARD and make it sticky" (Lorenzo Stoakes) Make the VMA guard regions available in /proc/pid/smaps and improves the mergeability of guarded VMAs "mm: perform guard region install/remove under VMA lock" (Lorenzo Stoakes) Reduce mmap lock contention for callers performing VMA guard region operations "vma_start_write_killable" (Matthew Wilcox) Start work on permitting applications to be killed when they are waiting on a read_lock on the VMA lock "mm/damon/tests: add more tests for online parameters commit" (SeongJae Park) Add additional userspace testing of DAMON's "commit" feature "mm/damon: misc cleanups" (SeongJae Park) "make VM_SOFTDIRTY a sticky VMA flag" (Lorenzo Stoakes) Address the possible loss of a VMA's VM_SOFTDIRTY flag when that VMA is merged with another "mm: support device-private THP" (Balbir Singh) Introduce support for Transparent Huge Page (THP) migration in zone device-private memory "Optimize folio split in memory failure" (Zi Yan) "mm/huge_memory: Define split_type and consolidate split support checks" (Wei Yang) Some more cleanups in the folio splitting code "mm: remove is_swap_[pte, pmd]() + non-swap entries, introduce leaf entries" (Lorenzo Stoakes) Clean up our handling of pagetable leaf entries by introducing the concept of 'software leaf entries', of type softleaf_t "reparent the THP split queue" (Muchun Song) Reparent the THP split queue to its parent memcg. This is in preparation for addressing the long-standing "dying memcg" problem, wherein dead memcg's linger for too long, consuming memory resources "unify PMD scan results and remove redundant cleanup" (Wei Yang) A little cleanup in the hugepage collapse code "zram: introduce writeback bio batching" (Sergey Senozhatsky) Improve zram writeback efficiency by introducing batched bio writeback support "memcg: cleanup the memcg stats interfaces" (Shakeel Butt) Clean up our handling of the interrupt safety of some memcg stats "make vmalloc gfp flags usage more apparent" (Vishal Moola) Clean up vmalloc's handling of incoming GFP flags "mm: Add soft-dirty and uffd-wp support for RISC-V" (Chunyan Zhang) Teach soft dirty and userfaultfd write protect tracking to use RISC-V's Svrsw60t59b extension "mm: swap: small fixes and comment cleanups" (Youngjun Park) Fix a small bug and clean up some of the swap code "initial work on making VMA flags a bitmap" (Lorenzo Stoakes) Start work on converting the vma struct's flags to a bitmap, so we stop running out of them, especially on 32-bit "mm/swapfile: fix and cleanup swap list iterations" (Youngjun Park) Address a possible bug in the swap discard code and clean things up a little [ This merge also reverts commit `ebb9aeb980` ("vfio/nvgrace-gpu: register device memory for poison handling") because it looks broken to me, I've asked for clarification - Linus ] tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) mm: fix vma_start_write_killable() signal handling mm/swapfile: use plist_for_each_entry in __folio_throttle_swaprate mm/swapfile: fix list iteration when next node is removed during discard fs/proc/task_mmu.c: fix make_uffd_wp_huge_pte() huge pte handling mm/kfence: add reboot notifier to disable KFENCE on shutdown memcg: remove inc/dec_lruvec_kmem_state helpers selftests/mm/uffd: initialize char variable to Null mm: fix DEBUG_RODATA_TEST indentation in Kconfig mm: introduce VMA flags bitmap type tools/testing/vma: eliminate dependency on vma->__vm_flags mm: simplify and rename mm flags function for clarity mm: declare VMA flags by bit zram: fix a spelling mistake mm/page_alloc: optimize lowmem_reserve max lookup using its semantic monotonicity mm/vmscan: skip increasing kswapd_failures when reclaim was boosted pagemap: update BUDDY flag documentation mm: swap: remove scan_swap_map_slots() references from comments mm: swap: change swap_alloc_slow() to void mm, swap: remove redundant comment for read_swap_cache_async mm, swap: use SWP_SOLIDSTATE to determine if swap is rotational ...	2025-12-05 13:52:43 -08:00
Chaitanya Kulkarni	71075d25ca	blk-mq: add blk_rq_nr_bvec() helper Add a new helper function blk_rq_nr_bvec() that returns the number of bvecs in a request. This count represents the number of iterations rq_for_each_bvec() would perform on a request. Drivers need to pre-allocate bvec arrays before iterating through a request's bvecs. Currently, they manually count bvecs using rq_for_each_bvec() in a loop, which is repetitive. The new helper centralizes this logic. This pattern exists in loop and zloop drivers, where multi-bio requests require copying bvecs into a contiguous array before creating an iov_iter for file operations. Update loop and zloop drivers to use the new helper, eliminating duplicate code. This patch also provides a clear API to avoid any potential misuse of blk_nr_phys_segments() for calculating the bvecs since, one bvec can have more than one segments and use of blk_nr_phys_segments() can lead to extra memory allocation :- [ 6155.673749] nullb_bio: 128K bio as ONE bvec: sector=0, size=131072 [ 6155.673846] null_blk: #### null_handle_data_transfer:1375 [ 6155.673850] null_blk: nr_bvec=1 blk_rq_nr_phys_segments=2 [ 6155.674263] null_blk: #### null_handle_data_transfer:1375 [ 6155.674267] null_blk: nr_bvec=1 blk_rq_nr_phys_segments=1 Reviewed-by: Niklas Cassel <cassel@kernel.org> Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-04 07:19:26 -07:00
Linus Torvalds	cc25df3e2e	Merge tag 'for-6.19/block-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: - Fix head insertion for mq-deadline, a regression from when priority support was added - Series simplifying and improving the ublk user copy code - Various ublk related cleanups - Fixup REQ_NOWAIT handling in loop/zloop, clearing NOWAIT when the request is punted to a thread for handling - Merge and then later revert loop dio nowait support, as it ended up causing excessive stack usage for when the inline issue code needs to dip back into the full file system code - Improve auto integrity code, making it less deadlock prone - Speedup polled IO handling, but manually managing the hctx lookups - Fixes for blk-throttle for SSD devices - Small series with fixes for the S390 dasd driver - Add support for caching zones, avoiding unnecessary report zone queries - MD pull requests via Yu: - fix null-ptr-dereference regression for dm-raid0 - fix IO hang for raid5 when array is broken with IO inflight - remove legacy 1s delay to speed up system shutdown - change maintainer's email address - data can be lost if array is created with different lbs devices, fix this problem and record lbs of the array in metadata - fix rcu protection for md_thread - fix mddev kobject lifetime regression - enable atomic writes for md-linear - some cleanups - bcache updates via Coly - remove useless discard and cache device code - improve usage of per-cpu workqueues - Reorganize the IO scheduler switching code, fixing some lockdep reports as well - Improve the block layer P2P DMA support - Add support to the block tracing code for zoned devices - Segment calculation improves, and memory alignment flexibility improvements - Set of prep and cleanups patches for ublk batching support. The actual batching hasn't been added yet, but helps shrink down the workload of getting that patchset ready for 6.20 - Fix for how the ps3 block driver handles segments offsets - Improve how block plugging handles batch tag allocations - nbd fixes for use-after-free of the configuration on device clear/put - Set of improvements and fixes for zloop - Add Damien as maintainer of the block zoned device code handling - Various other fixes and cleanups * tag 'for-6.19/block-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits) block/rnbd: correct all kernel-doc complaints blk-mq: use queue_hctx in blk_mq_map_queue_type md: remove legacy 1s delay in md_notify_reboot md/raid5: fix IO hang when array is broken with IO inflight md: warn about updating super block failure md/raid0: fix NULL pointer dereference in create_strip_zones() for dm-raid sbitmap: fix all kernel-doc warnings ublk: add helper of __ublk_fetch() ublk: pass const pointer to ublk_queue_is_zoned() ublk: refactor auto buffer register in ublk_dispatch_req() ublk: add `union ublk_io_buf` with improved naming ublk: add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg() kfifo: add kfifo_alloc_node() helper for NUMA awareness blk-mq: fix potential uaf for 'queue_hw_ctx' blk-mq: use array manage hctx map instead of xarray ublk: prevent invalid access with DEBUG s390/dasd: Use scnprintf() instead of sprintf() s390/dasd: Move device name formatting into separate function s390/dasd: Remove unnecessary debugfs_create() return checks s390/dasd: Fix gendisk parent after copy pair swap ...	2025-12-03 19:26:18 -08:00
Linus Torvalds	0abcfd8983	Merge tag 'for-6.19/io_uring-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring updates from Jens Axboe: - Unify how task_work cancelations are detected, placing it in the task_work running state rather than needing to check the task state - Series cleaning up and moving the cancelation code to where it belongs, in cancel.c - Cleanup of waitid and futex argument handling - Add support for mixed sized SQEs. 6.18 added support for mixed sized CQEs, improving flexibility and efficiency of workloads that need big CQEs. This adds similar support for SQEs, where the occasional need for a 128b SQE doesn't necessitate having all SQEs be 128b in size - Introduce zcrx and SQ/CQ layout queries. The former returns what zcrx features are available. And both return the ring size information to help with allocation size calculation for user provided rings like IORING_SETUP_NO_MMAP and IORING_MEM_REGION_TYPE_USER - Zcrx updates for 6.19. It includes a bunch of small patches, IORING_REGISTER_ZCRX_CTRL and RQ flushing and David's work on sharing zcrx b/w multiple io_uring instances - Series cleaning up ring initializations, notable deduplicating ring size and offset calculations. It also moves most of the checking before doing any allocations, making the code simpler - Add support for getsockname and getpeername, which is mostly a trivial hookup after a bit of refactoring on the networking side - Various fixes and cleanups * tag 'for-6.19/io_uring-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits) io_uring: Introduce getsockname io_uring cmd socket: Split out a getsockname helper for io_uring socket: Unify getsockname and getpeername implementation io_uring/query: drop unused io_handle_query_entry() ctx arg io_uring/kbuf: remove obsolete buf_nr_pages and update comments io_uring/register: use correct location for io_rings_layout io_uring/zcrx: share an ifq between rings io_uring/zcrx: add io_fill_zcrx_offsets() io_uring/zcrx: export zcrx via a file io_uring/zcrx: move io_zcrx_scrub() and dependencies up io_uring/zcrx: count zcrx users io_uring/zcrx: add sync refill queue flushing io_uring/zcrx: introduce IORING_REGISTER_ZCRX_CTRL io_uring/zcrx: elide passing msg flags io_uring/zcrx: use folio_nr_pages() instead of shift operation io_uring/zcrx: convert to use netmem_desc io_uring/query: introduce rings info query io_uring/query: introduce zcrx query io_uring: move cq/sq user offset init around io_uring: pre-calculate scq layout ...	2025-12-03 18:58:57 -08:00
Linus Torvalds	8f7aa3d3c7	Merge tag 'net-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Replace busylock at the Tx queuing layer with a lockless list. Resulting in a 300% (4x) improvement on heavy TX workloads, sending twice the number of packets per second, for half the cpu cycles. - Allow constantly busy flows to migrate to a more suitable CPU/NIC queue. Normally we perform queue re-selection when flow comes out of idle, but under extreme circumstances the flows may be constantly busy. Add sysctl to allow periodic rehashing even if it'd risk packet reordering. - Optimize the NAPI skb cache, make it larger, use it in more paths. - Attempt returning Tx skbs to the originating CPU (like we already did for Rx skbs). - Various data structure layout and prefetch optimizations from Eric. - Remove ktime_get() from the recvmsg() fast path, ktime_get() is sadly quite expensive on recent AMD machines. - Extend threaded NAPI polling to allow the kthread busy poll for packets. - Make MPTCP use Rx backlog processing. This lowers the lock pressure, improving the Rx performance. - Support memcg accounting of MPTCP socket memory. - Allow admin to opt sockets out of global protocol memory accounting (using a sysctl or BPF-based policy). The global limits are a poor fit for modern container workloads, where limits are imposed using cgroups. - Improve heuristics for when to kick off AF_UNIX garbage collection. - Allow users to control TCP SACK compression, and default to 33% of RTT. - Add tcp_rcvbuf_low_rtt sysctl to let datacenter users avoid unnecessarily aggressive rcvbuf growth and overshot when the connection RTT is low. - Preserve skb metadata space across skb_push / skb_pull operations. - Support for IPIP encapsulation in the nftables flowtable offload. - Support appending IP interface information to ICMP messages (RFC 5837). - Support setting max record size in TLS (RFC 8449). - Remove taking rtnl_lock from RTM_GETNEIGHTBL and RTM_SETNEIGHTBL. - Use a dedicated lock (and RCU) in MPLS, instead of rtnl_lock. - Let users configure the number of write buffers in SMC. - Add new struct sockaddr_unsized for sockaddr of unknown length, from Kees. - Some conversions away from the crypto_ahash API, from Eric Biggers. - Some preparations for slimming down struct page. - YAML Netlink protocol spec for WireGuard. - Add a tool on top of YAML Netlink specs/lib for reporting commonly computed derived statistics and summarized system state. Driver API: - Add CAN XL support to the CAN Netlink interface. - Add uAPI for reporting PHY Mean Square Error (MSE) diagnostics, as defined by the OPEN Alliance's "Advanced diagnostic features for 100BASE-T1 automotive Ethernet PHYs" specification. - Add DPLL phase-adjust-gran pin attribute (and implement it in zl3073x). - Refactor xfrm_input lock to reduce contention when NIC offloads IPsec and performs RSS. - Add info to devlink params whether the current setting is the default or a user override. Allow resetting back to default. - Add standard device stats for PSP crypto offload. - Leverage DSA frame broadcast to implement simple HSR frame duplication for a lot of switches without dedicated HSR offload. - Add uAPI defines for 1.6Tbps link modes. Device drivers: - Add Motorcomm YT921x gigabit Ethernet switch support. - Add MUCSE driver for N500/N210 1GbE NIC series. - Convert drivers to support dedicated ops for timestamping control, and away from the direct IOCTL handling. While at it support GET operations for PHY timestamping. - Add (and convert most drivers to) a dedicated ethtool callback for reading the Rx ring count. - Significant refactoring efforts in the STMMAC driver, which supports Synopsys turn-key MAC IP integrated into a ton of SoCs. - Ethernet high-speed NICs: - Broadcom (bnxt): - support PPS in/out on all pins - Intel (100G, ice, idpf): - ice: implement standard ethtool and timestamping stats - i40e: support setting the max number of MAC addresses per VF - iavf: support RSS of GTP tunnels for 5G and LTE deployments - nVidia/Mellanox (mlx5): - reduce downtime on interface reconfiguration - disable being an XDP redirect target by default (same as other drivers) to avoid wasting resources if feature is unused - Meta (fbnic): - add support for Linux-managed PCS on 25G, 50G, and 100G links - Wangxun: - support Rx descriptor merge, and Tx head writeback - support Rx coalescing offload - support 25G SPF and 40G QSFP modules - Ethernet virtual: - Google (gve): - allow ethtool to configure rx_buf_len - implement XDP HW RX Timestamping support for DQ descriptor format - Microsoft vNIC (mana): - support HW link state events - handle hardware recovery events when probing the device - Ethernet NICs consumer, and embedded: - usbnet: add support for Byte Queue Limits (BQL) - AMD (amd-xgbe): - add device selftests - NXP (enetc): - add i.MX94 support - Broadcom integrated MACs (bcmgenet, bcmasp): - bcmasp: add support for PHY-based Wake-on-LAN - Broadcom switches (b53): - support port isolation - support BCM5389/97/98 and BCM63XX ARL formats - Lantiq/MaxLinear switches: - support bridge FDB entries on the CPU port - use regmap for register access - allow user to enable/disable learning - support Energy Efficient Ethernet - support configuring RMII clock delays - add tagging driver for MaxLinear GSW1xx switches - Synopsys (stmmac): - support using the HW clock in free running mode - add Eswin EIC7700 support - add Rockchip RK3506 support - add Altera Agilex5 support - Cadence (macb): - cleanup and consolidate descriptor and DMA address handling - add EyeQ5 support - TI: - icssg-prueth: support AF_XDP - Airoha access points: - add missing Ethernet stats and link state callback - add AN7583 support - support out-of-order Tx completion processing - Power over Ethernet: - pd692x0: preserve PSE configuration across reboots - add support for TPS23881B devices - Ethernet PHYs: - Open Alliance OATC14 10BASE-T1S PHY cable diagnostic support - Support 50G SerDes and 100G interfaces in Linux-managed PHYs - micrel: - support for non PTP SKUs of lan8814 - enable in-band auto-negotiation on lan8814 - realtek: - cable testing support on RTL8224 - interrupt support on RTL8221B - motorcomm: support for PHY LEDs on YT853 - microchip: support for LAN867X Rev.D0 PHYs w/ SQI and cable diag - mscc: support for PHY LED control - CAN drivers: - m_can: add support for optional reset and system wake up - remove can_change_mtu() obsoleted by core handling - mcp251xfd: support GPIO controller functionality - Bluetooth: - add initial support for PASTa - WiFi: - split ieee80211.h file, it's way too big - improvements in VHT radiotap reporting, S1G, Channel Switch Announcement handling, rate tracking in mesh networks - improve multi-radio monitor mode support, and add a cfg80211 debugfs interface for it - HT action frame handling on 6 GHz - initial chanctx work towards NAN - MU-MIMO sniffer improvements - WiFi drivers: - RealTek (rtw89): - support USB devices RTL8852AU and RTL8852CU - initial work for RTL8922DE - improved injection support - Intel: - iwlwifi: new sniffer API support - MediaTek (mt76): - WED support for >32-bit DMA - airoha NPU support - regdomain improvements - continued WiFi7/MLO work - Qualcomm/Atheros: - ath10k: factory test support - ath11k: TX power insertion support - ath12k: BSS color change support - ath12k: statistics improvements - brcmfmac: Acer A1 840 tablet quirk - rtl8xxxu: 40 MHz connection fixes/support" * tag 'net-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1381 commits) net: page_pool: sanitise allocation order net: page pool: xa init with destroy on pp init net/mlx5e: Support XDP target xmit with dummy program net/mlx5e: Update XDP features in switch channels selftests/tc-testing: Test CAKE scheduler when enqueue drops packets net/sched: sch_cake: Fix incorrect qlen reduction in cake_drop wireguard: netlink: generate netlink code wireguard: uapi: generate header with ynl-gen wireguard: uapi: move flag enums wireguard: uapi: move enum wg_cmd wireguard: netlink: add YNL specification selftests: drv-net: Fix tolerance calculation in devlink_rate_tc_bw.py selftests: drv-net: Fix and clarify TC bandwidth split in devlink_rate_tc_bw.py selftests: drv-net: Set shell=True for sysfs writes in devlink_rate_tc_bw.py selftests: drv-net: Use Iperf3Runner in devlink_rate_tc_bw.py selftests: drv-net: introduce Iperf3Runner for measurement use cases selftests: drv-net: Add devlink_rate_tc_bw.py to TEST_PROGS net: ps3_gelic_net: Use napi_alloc_skb() and napi_gro_receive() Documentation: net: dsa: mention simple HSR offload helpers Documentation: net: dsa: mention availability of RedBox ...	2025-12-03 17:24:33 -08:00
Linus Torvalds	784faa8eca	Merge tag 'rust-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux Pull Rust updates from Miguel Ojeda: "Toolchain and infrastructure: - Add support for 'syn'. Syn is a parsing library for parsing a stream of Rust tokens into a syntax tree of Rust source code. Currently this library is geared toward use in Rust procedural macros, but contains some APIs that may be useful more generally. 'syn' allows us to greatly simplify writing complex macros such as 'pin-init' (Benno has already prepared the 'syn'-based version). We will use it in the 'macros' crate too. 'syn' is the most downloaded Rust crate (according to crates.io), and it is also used by the Rust compiler itself. While the amount of code is substantial, there should not be many updates needed for these crates, and even if there are, they should not be too big, e.g. +7k -3k lines across the 3 crates in the last year. 'syn' requires two smaller dependencies: 'quote' and 'proc-macro2'. I only modified their code to remove a third dependency ('unicode-ident') and to add the SPDX identifiers. The code can be easily verified to exactly match upstream with the provided scripts. They are all licensed under "Apache-2.0 OR MIT", like the other vendored 'alloc' crate we had for a while. Please see the merge commit with the cover letter for more context. - Allow 'unreachable_pub' and 'clippy::disallowed_names' for doctests. Examples (i.e. doctests) may want to do things like show public items and use names such as 'foo'. Nevertheless, we still try to keep examples as close to real code as possible (this is part of why running Clippy on doctests is important for us, e.g. for safety comments, which userspace Rust does not support yet but we are stricter). 'kernel' crate: - Replace our custom 'CStr' type with 'core::ffi::CStr'. Using the standard library type reduces our custom code footprint, and we retain needed custom functionality through an extension trait and a new 'fmt!' macro which replaces the previous 'core' import. This started in 6.17 and continued in 6.18, and we finally land the replacement now. This required quite some stamina from Tamir, who split the changes in steps to prepare for the flag day change here. - Replace 'kernel::c_str!' with C string literals. C string literals were added in Rust 1.77, which produce '&CStr's (the 'core' one), so now we can write: c"hi" instead of: c_str!("hi") - Add 'num' module for numerical features. It includes the 'Integer' trait, implemented for all primitive integer types. It also includes the 'Bounded' integer wrapping type: an integer value that requires only the 'N' least significant bits of the wrapped type to be encoded: // An unsigned 8-bit integer, of which only the 4 LSBs are used. let v = Bounded::<u8, 4>:🆕:<15>(); assert_eq!(v.get(), 15); 'Bounded' is useful to e.g. enforce guarantees when working with bitfields that have an arbitrary number of bits. Values can also be constructed from simple non-constant expressions or, for more complex ones, validated at runtime. 'Bounded' also comes with comparison and arithmetic operations (with both their backing type and other 'Bounded's with a compatible backing type), casts to change the backing type, extending/shrinking and infallible/fallible conversions from/to primitives as applicable. - 'rbtree' module: add immutable cursor ('Cursor'). It enables to use just an immutable tree reference where appropriate. The existing fully-featured mutable cursor is renamed to 'CursorMut'. kallsyms: - Fix wrong "big" kernel symbol type read from procfs. 'pin-init' crate: - A couple minor fixes (Benno asked me to pick these patches up for him this cycle). Documentation: - Quick Start guide: add Debian 13 (Trixie). Debian Stable is now able to build Linux, since Debian 13 (released 2025-08-09) packages Rust 1.85.0, which is recent enough. We are planning to propose that the minimum supported Rust version in Linux follows Debian Stable releases, with Debian 13 being the first one we upgrade to, i.e. Rust 1.85. MAINTAINERS: - Add entry for the new 'num' module. - Remove Alex as Rust maintainer: he hasn't had the time to contribute for a few years now, so it is a no-op change in practice. And a few other cleanups and improvements" * tag 'rust-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux: (53 commits) rust: macros: support `proc-macro2`, `quote` and `syn` rust: syn: enable support in kbuild rust: syn: add `README.md` rust: syn: remove `unicode-ident` dependency rust: syn: add SPDX License Identifiers rust: syn: import crate rust: quote: enable support in kbuild rust: quote: add `README.md` rust: quote: add SPDX License Identifiers rust: quote: import crate rust: proc-macro2: enable support in kbuild rust: proc-macro2: add `README.md` rust: proc-macro2: remove `unicode_ident` dependency rust: proc-macro2: add SPDX License Identifiers rust: proc-macro2: import crate rust: kbuild: support using libraries in `rustc_procmacro` rust: kbuild: support skipping flags in `rustc_test_library` rust: kbuild: add proc macro library support rust: kbuild: simplify `--cfg` handling rust: kbuild: introduce `core-flags` and `core-skip_flags` ...	2025-12-03 14:16:49 -08:00
Linus Torvalds	1d18101a64	Merge tag 'kernel-6.19-rc1.cred' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull cred guard updates from Christian Brauner: "This contains substantial credential infrastructure improvements adding guard-based credential management that simplifies code and eliminates manual reference counting in many subsystems. Features: - Kernel Credential Guards Add with_kernel_creds() and scoped_with_kernel_creds() guards that allow using the kernel credentials without allocating and copying them. This was requested by Linus after seeing repeated prepare_kernel_creds() calls that duplicate the kernel credentials only to drop them again later. The new guards completely avoid the allocation and never expose the temporary variable to hold the kernel credentials anywhere in callers. - Generic Credential Guards Add scoped_with_creds() guards for the common override_creds() and revert_creds() pattern. This builds on earlier work that made override_creds()/revert_creds() completely reference count free. - Prepare Credential Guards Add prepare credential guards for the more complex pattern of preparing a new set of credentials and overriding the current credentials with them: - prepare_creds() - modify new creds - override_creds() - revert_creds() - put_cred() Cleanups: - Make init_cred static since it should not be directly accessed - Add kernel_cred() helper to properly access the kernel credentials - Fix scoped_class() macro that was introduced two cycles ago - coredump: split out do_coredump() from vfs_coredump() for cleaner credential handling - coredump: move revert_cred() before coredump_cleanup() - coredump: mark struct mm_struct as const - coredump: pass struct linux_binfmt as const - sev-dev: use guard for path" * tag 'kernel-6.19-rc1.cred' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits) trace: use override credential guard trace: use prepare credential guard coredump: use override credential guard coredump: use prepare credential guard coredump: split out do_coredump() from vfs_coredump() coredump: mark struct mm_struct as const coredump: pass struct linux_binfmt as const coredump: move revert_cred() before coredump_cleanup() sev-dev: use override credential guards sev-dev: use prepare credential guard sev-dev: use guard for path cred: add prepare credential guard net/dns_resolver: use credential guards in dns_query() cgroup: use credential guards in cgroup_attach_permissions() act: use credential guards in acct_write_process() smb: use credential guards in cifs_get_spnego_key() nfs: use credential guards in nfs_idmap_get_key() nfs: use credential guards in nfs_local_call_write() nfs: use credential guards in nfs_local_call_read() erofs: use credential guards ...	2025-12-01 13:45:41 -08:00
Randy Dunlap	d211a28035	block/rnbd: correct all kernel-doc complaints Fix all kernel-doc warnings in rnbd-proto.h: - use correct enum name in kdoc comment - mark several struct members as "/* private: /" so that no kdoc is required for them - don't use "/" for a non-kernel-doc comment - use the correct struct member name for "dev_name" - use " " for a blank kernel-doc line Fixes these warnings: Warning: drivers/block/rnbd/rnbd-proto.h:41 expecting prototype for enum rnbd_msg_types. Prototype was for enum rnbd_msg_type instead Warning: drivers/block/rnbd/rnbd-proto.h:50 struct member '__padding' not described in 'rnbd_msg_hdr' Warning: drivers/block/rnbd/rnbd-proto.h:53 This comment starts with '/*', but isn't a kernel-doc comment. We allow to map RO many times and RW only once. We allow to map yet another Warning: drivers/block/rnbd/rnbd-proto.h:81 struct member 'reserved' not described in 'rnbd_msg_sess_info' Warning: drivers/block/rnbd/rnbd-proto.h:92 struct member 'reserved' not described in 'rnbd_msg_sess_info_rsp' Warning: drivers/block/rnbd/rnbd-proto.h:107 struct member 'resv1' not described in 'rnbd_msg_open' Warning: drivers/block/rnbd/rnbd-proto.h:107 struct member 'dev_name' not described in 'rnbd_msg_open' Warning: drivers/block/rnbd/rnbd-proto.h:107 struct member 'reserved' not described in 'rnbd_msg_open' Warning: drivers/block/rnbd/rnbd-proto.h:158 struct member 'reserved' not described in 'rnbd_msg_open_rsp' Warning: drivers/block/rnbd/rnbd-proto.h:189 bad line: Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-01 07:19:50 -07:00
Chu Guangqing	8f4338b114	zram: fix a spelling mistake The spelling of the word "relases" is incorrect; it should be "releases". Link: https://lkml.kernel.org/r/20251125020522.1913-1-chuguangqing@inspur.com Signed-off-by: Chu Guangqing <chuguangqing@inspur.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-29 10:41:08 -08:00
Ming Lei	28d7a371f0	ublk: add helper of __ublk_fetch() Add helper __ublk_fetch() for refactoring ublk_fetch(). Meantime move ublk_config_io_buf() out of __ublk_fetch() to make the code structure cleaner. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-28 09:20:13 -07:00
Ming Lei	3443bab2f8	ublk: pass const pointer to ublk_queue_is_zoned() Pass const pointer to ublk_queue_is_zoned() because it is readonly. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-28 09:20:13 -07:00
Ming Lei	0a9beafa7c	ublk: refactor auto buffer register in ublk_dispatch_req() Refactor auto buffer register code and prepare for supporting batch IO feature, and the main motivation is to put 'ublk_io' operation code together, so that per-io lock can be applied for the code block. The key changes are: - Rename ublk_auto_buf_reg() as ublk_do_auto_buf_reg() - Introduce an enum `auto_buf_reg_res` to represent the result of the buffer registration attempt (FAIL, FALLBACK, OK). - Split the existing `ublk_do_auto_buf_reg` function into two: - `__ublk_do_auto_buf_reg`: Performs the actual buffer registration and returns the `auto_buf_reg_res` status. - `ublk_do_auto_buf_reg`: A wrapper that calls the internal function and handles the I/O preparation based on the result. - Introduce `ublk_prep_auto_buf_reg_io` to encapsulate the logic for preparing the I/O for completion after buffer registration. - Pass the `tag` directly to `ublk_auto_buf_reg_fallback` to avoid recalculating it. This refactoring makes the control flow clearer and isolates the different stages of the auto buffer registration process. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-28 09:20:13 -07:00
Ming Lei	8d61ece156	ublk: add `union ublk_io_buf` with improved naming Add `union ublk_io_buf` for naming the anonymous union of struct ublk_io's addr and buf fields, meantime apply it to `struct ublk_io` for storing either ublk auto buffer register data or ublk server io buffer address. The union uses clear field names: - `addr`: for regular ublk server io buffer addresses - `auto_reg`: for ublk auto buffer registration data This eliminates confusing access patterns and improves code readability. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-28 09:20:13 -07:00
Ming Lei	3035b9b46b	ublk: add parameter `struct io_uring_cmd ` to ublk_prep_auto_buf_reg() Add parameter `struct io_uring_cmd ` to ublk_prep_auto_buf_reg() and prepare for reusing this helper for the coming UBLK_BATCH_IO feature, which can fetch & commit one batch of io commands via single uring_cmd. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-28 09:20:13 -07:00
Kevin Brodsky	c6a45ee760	ublk: prevent invalid access with DEBUG ublk_ch_uring_cmd_local() may jump to the out label before initialising the io pointer. This will cause trouble if DEBUG is defined, because the pr_devel() call dereferences io. Clang reports: drivers/block/ublk_drv.c:2403:6: error: variable 'io' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized] 2403 \| if (tag >= ub->dev_info.queue_depth) \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ drivers/block/ublk_drv.c:2492:32: note: uninitialized use occurs here 2492 \| __func__, cmd_op, tag, ret, io->flags); \| Fix this by initialising io to NULL and checking it before dereferencing it. Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Fixes: `71f28f3136` ("ublk_drv: add io_uring based userspace block driver") Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-26 10:59:18 -07:00
Jens Axboe	96f03c8cb2	Revert "Merge branch 'loop-aio-nowait' into for-6.19/block" This reverts commit `f43fdeb9a3`, reversing changes made to `2c6d792d4b`. There are concerns that doing inline submits can cause excessive stack usage, particularly when going back into the filesystem. Revert the loop dio nowait change for now. Link: https://lore.kernel.org/linux-block/aSP3SG_KaROJTBHx@infradead.org/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-24 20:53:19 -07:00
Yuwen Chen	04d31610a7	zram: fix the issue that the write - back limits might overflow When the page size exceeds 4KB, if bd_wb_limit is set to a value that is not aligned with the page size, it will cause a numerical wrap-around issue for bd_wb_limit. For example, when the page size is set to 16KB and bd_wb_limit is set to 3, after one write-back operation, the value of bd_wb_limit will become -1. More seriously, since bd_wb_limit is an unsigned number, its value may become as large as 2^64 - 1. The core reason for this problem is that the unit of bd_wb_limit is 4KB. For example, when a write-back occurs on a system with a page size of 16KB, 4 needs to be subtracted from bd_wb_limit. This operation takes place in the zram_account_writeback_submit function. This patch fixes the issue by limiting bd_wb_limit to be an integer multiple of PAGE_SIZE / 4096. Link: https://lkml.kernel.org/r/tencent_5936CFE72BAB2BA76887BB69DCC1B5E67C05@qq.com Fixes: `1d69a3f8ae` ("zram: idle writeback fixes and cleanup") Signed-off-by: Yuwen Chen <ywen.chen@foxmail.com> Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Brian Geffon <bgeffon@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Richard Chang <richardycc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 15:08:55 -08:00
Sergey Senozhatsky	1b1a4e4d67	zram: read slot block idx under slot lock Read slot's block id under slot-lock. We release the slot-lock for bdev read so, technically, slot still can get freed in the meantime, but at least we will read bdev block (page) that holds previous know slot data, not from slot->handle bdev block, which can be anything at that point. Link: https://lkml.kernel.org/r/20251122074029.3948921-7-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Brian Geffon <bgeffon@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Richard Chang <richardycc@google.com> Cc: Yuwen Chen <ywen.chen@foxmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 15:08:53 -08:00
Sergey Senozhatsky	e87ddea345	zram: rework bdev block allocation First, writeback bdev ->bitmap bits are set only from one context, as we can have only one single task performing writeback, so we cannot race with anything else. Remove retry path. Second, we always check ZRAM_WB flag to distinguish writtenback slots, so we should not confuse 0 bdev block index and 0 handle. We can use first bdev block (0 bit) for writeback as well. While at it, give functions slightly more accurate names, as we don't alloc/free anything there, we reserve a block for async writeback or release the block. Link: https://lkml.kernel.org/r/20251122074029.3948921-6-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Brian Geffon <bgeffon@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Richard Chang <richardycc@google.com> Cc: Yuwen Chen <ywen.chen@foxmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 15:08:53 -08:00
Sergey Senozhatsky	a4f506c569	zram: drop wb_limit_lock We don't need wb_limit_lock. Writeback limit setters take an exclusive write zram init_lock, while wb_limit modifications happen only from a single task and under zram read init_lock. No concurrent wb_limit modifications are possible (we permit only one post-processing task at a time). Add lockdep assertions to wb_limit mutators. While at it, fixup coding styles. Link: https://lkml.kernel.org/r/20251122074029.3948921-5-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Brian Geffon <bgeffon@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Richard Chang <richardycc@google.com> Cc: Yuwen Chen <ywen.chen@foxmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 15:08:53 -08:00
Sergey Senozhatsky	7c929664fd	zram: take write lock in wb limit store handlers Write device attrs handlers should take write zram init_lock. While at it, fixup coding styles. Link: https://lkml.kernel.org/r/20251122074029.3948921-4-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Brian Geffon <bgeffon@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Richard Chang <richardycc@google.com> Cc: Yuwen Chen <ywen.chen@foxmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 15:08:53 -08:00
Sergey Senozhatsky	e828cccb72	zram: add writeback batch size device attr Introduce writeback_batch_size device attribute so that the maximum number of in-flight writeback bio requests can be configured at run-time per-device. This essentially enables batched bio writeback. Link: https://lkml.kernel.org/r/20251122074029.3948921-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Brian Geffon <bgeffon@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Richard Chang <richardycc@google.com> Cc: Yuwen Chen <ywen.chen@foxmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 15:08:53 -08:00
Sergey Senozhatsky	f405066a1f	zram: introduce writeback bio batching Patch series "zram: introduce writeback bio batching", v6. As writeback is becoming more and more common the longstanding limitations of zram writeback throughput are becoming more visible. Introduce writeback bio batching so that multiple writeback bios can be processed simultaneously. This patch (of 6): As was stated in a comment [1] a single page writeback IO is not efficient, but it works. It's time to address this throughput limitation as writeback becomes used more often. Introduce batched (multiple) bio writeback support to take advantage of parallel requests processing and better requests scheduling. Approach used in this patch doesn't use a dedicated kthread like in [2], or blk-plug like in [3]. Dedicated kthread adds complexity, which can be avoided. Apart from that not all zram setups use writeback, so having numerous per-device kthreads (on systems that create multiple zram devices) hanging around is not the most optimal thing to do. blk-plug, on the other hand, works best when request are sequential, which doesn't particularly fit zram writebck IO patterns: zram writeback IO patterns are expected to be random, due to how bdev block reservation/release are handled. blk-plug approach also works in cycles: idle IO, when zram sets up requests in a batch, is followed by bursts of IO, when zram submits the entire batch. Instead we use a batch of requests and submit new bio as soon as one of the in-flight requests completes. For the time being the writeback batch size (maximum number of in-flight bio requests) is set to 32 for all devices. A follow up patch adds a writeback_batch_size device attribute, so the batch size becomes run-time configurable. Link: https://lkml.kernel.org/r/20251122074029.3948921-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/20251122074029.3948921-2-senozhatsky@chromium.org Link: https://lore.kernel.org/all/20181203024045.153534-6-minchan@kernel.org/ [1] Link: https://lore.kernel.org/all/20250731064949.1690732-1-richardycc@google.com/ [2] Link: https://lore.kernel.org/all/tencent_78FC2C4FE16BA1EBAF0897DB60FCD675ED05@qq.com/ [3] Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Co-developed-by: Yuwen Chen <ywen.chen@foxmail.com> Co-developed-by: Richard Chang <richardycc@google.com> Suggested-by: Minchan Kim <minchan@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Richard Chang <richardycc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 15:08:52 -08:00
Chaitanya Kulkarni	e8f0abdd49	zloop: clear nowait flag in workqueue context The zloop driver advertises REQ_NOWAIT support through BLK_FEAT_NOWAIT (enabled by default for all blk-mq devices), and honors the nowait behavior throughout zloop_queue_rq(). However, actual I/O to the backing file is performed in a workqueue, where blocking is allowed. To avoid imposing unnecessary non-blocking constraints in this blocking context, clear the REQ_NOWAIT flag before processing the request in the workqueue context. Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-20 07:42:32 -07:00
Chaitanya Kulkarni	b11e483a1c	loop: clear nowait flag in workqueue context The loop driver advertises REQ_NOWAIT support through BLK_FEAT_NOWAIT (enabled by default for all blk-mq devices), and honors the nowait behavior throughout loop_queue_rq(). However, actual I/O to the backing file is performed in a workqueue, where blocking is allowed. To avoid imposing unnecessary non-blocking constraints in this blocking context, clear the REQ_NOWAIT flag before processing the request in the workqueue context. Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-20 07:42:32 -07:00
Damien Le Moal	a9637ab93c	zloop: fix zone append check in zloop_rw() While commit `cf28f6f923` ("zloop: fail zone append operations that are targeting full zones") added a check in zloop_rw() that a zone append is not issued to a full zone, commit `e3a96ca904` ("zloop: simplify checks for writes to sequential zones") inadvertently removed the check to verify that there is enough unwritten space in a zone for an incoming zone append opration. Re-add this check in zloop_rw() to make sure we do not write beyond the end of a zone. Of note is that this same check is already present in the function zloop_set_zone_append_sector() when ordered zone append is in use. Reported-by: Hans Holmberg <Hans.Holmberg@wdc.com> Fixes: `e3a96ca904` ("zloop: simplify checks for writes to sequential zones") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-19 07:39:42 -07:00
Ming Lei	837ed30396	loop: add hint for handling aio via IOCB_NOWAIT Add hint for using IOCB_NOWAIT to handle loop aio command for avoiding to cause write(especially randwrite) perf regression on sparse backed file. Try IOCB_NOWAIT in the following situations: - backing file is block device OR - READ aio command OR - there isn't any queued blocking async WRITEs, because NOWAIT won't cause contention with blocking WRITE, which often implies exclusive lock With this simple policy, perf regression of randwrite/write on sparse backing file is fixed. Link: https://lore.kernel.org/dm-devel/7d6ae2c9-df8e-50d0-7ad6-b787cb3cfab4@redhat.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 06:49:52 -07:00
Ming Lei	0ba93a906d	loop: try to handle loop aio command via NOWAIT IO first Try to handle loop aio command via NOWAIT IO first, then we can avoid to queue the aio command into workqueue. This is usually one big win in case that FS block mapping is stable, Mikulas verified [1] that this way improves IO perf by close to 5X in 12jobs sequential read/write test, in which FS block mapping is just stable. Fallback to workqueue in case of -EAGAIN. This way may bring a little cost from the 1st retry, but when running the following write test over loop/sparse_file, the actual effect on randwrite is obvious: ``` truncate -s 4G 1.img #1.img is created on XFS/virtio-scsi losetup -f 1.img --direct-io=on fio --direct=1 --bs=4k --runtime=40 --time_based --numjobs=1 --ioengine=libaio \ --iodepth=16 --group_reporting=1 --filename=/dev/loop0 -name=job --rw=$RW ``` - RW=randwrite: obvious IOPS drop observed - RW=write: a little drop(%5 - 10%) This perf drop on randwrite over sparse file will be addressed in the following patch. BLK_MQ_F_BLOCKING has to be set for calling into .read_iter() or .write_iter() which might sleep even though it is NOWAIT, and the only effect is that rcu read lock is replaced with srcu read lock. Link: https://lore.kernel.org/linux-block/a8e5c76a-231f-07d1-a394-847de930f638@redhat.com/ [1] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 06:49:52 -07:00
Ming Lei	f4788ae9d7	loop: move command blkcg/memcg initialization into loop_queue_work Move loop command blkcg/memcg initialization into loop_queue_work, and prepare for supporting to handle loop io command by IOCB_NOWAIT. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 06:49:52 -07:00
Ming Lei	c66e9708f9	loop: add lo_submit_rw_aio() Refactor lo_rw_aio() by extracting the I/O submission logic into a new helper function lo_submit_rw_aio(). This further improves code organization by separating the I/O preparation, submission, and completion handling into distinct phases. Prepare for using NOWAIT to improve loop performance. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 06:49:51 -07:00
Ming Lei	fd858d1ca9	loop: add helper lo_rw_aio_prep() Add helper lo_rw_aio_prep() to separate the preparation phase(setting up bio vectors and initializing the iocb structure) from the actual I/O execution in the loop block driver. Prepare for using NOWAIT to improve loop performance. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 06:49:51 -07:00
Ming Lei	c3e6c11147	loop: add helper lo_cmd_nr_bvec() Add lo_cmd_nr_bvec() and prepare for refactoring lo_rw_aio(). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 06:49:51 -07:00
Sukrut Heroorkar	2c6d792d4b	drbd: turn bitmap I/O comments into regular block comments W=1 build warns because the bitmap I/O comments use '/**', which marks them as kernel-doc comments even though these functions do not document an external API. Convert these comments to regular block comments so kernel-doc no longer parses them. Signed-off-by: Sukrut Heroorkar <hsukrut3@gmail.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 06:48:33 -07:00
Damien Le Moal	fcc6eaa3a0	zloop: introduce the ordered_zone_append configuration parameter The zone append operation processing for zloop devices is similar to any other command, that is, the operation is processed as a command work item, without any special serialization between the work items (beside the zone mutex for mutually exclusive code sections). This processing is fine and gives excellent performance. However, it has a side effect: zone append operation are very often reordered and processed in a sequence that is very different from their issuing order by the user. This effect is very visible using an XFS file system on top of a zloop device. A simple file write leads to many file extents as the data writes using zone append are reordered and so result in the physical order being different than the file logical order. E.g. executing: $ dd if=/dev/zero of=/mnt/test bs=1M count=10 && sync $ xfs_bmap /mnt/test /mnt/test: 0: [0..4095]: 2162688..2166783 1: [4096..6143]: 2168832..2170879 2: [6144..8191]: 2166784..2168831 3: [8192..10239]: 2170880..2172927 4: [10240..12287]: 2174976..2177023 5: [12288..14335]: 2172928..2174975 6: [14336..20479]: 2177024..2183167 For 10 IOs, 6 extents are created. This is fine and actually allows to exercise XFS zone garbage collection very well. However, this also makes debugging/working on XFS data placement harder as the underlying device will most of the time reorder IOs, resulting in many file extents. Allow a user to mitigate this with the new ordered_zone_append configuration parameter. For a zloop device created with this parameter specified, the sector of a zone append command is set early, when the command is submitted by the block layer with the zloop_queue_rq() function, instead of in the zloop_rw() function which is exectued later in the command work item context. This change ensures that more often than not, zone append operations data end up being written in the same order as the command submission by the user. In the case of XFS, this leads to far less file data extents. E.g., for the previous example, we get a single file data extent for the written file. $ dd if=/dev/zero of=/mnt/test bs=1M count=10 && sync $ xfs_bmap /mnt/test /mnt/test: 0: [0..20479]: 2162688..2183167 Since we cannot use a mutex in the context of the zloop_queue_rq() function to atomically set a zone append operation sector to the target zone write pointer location and increment that the write pointer, a new per-zone spinlock is introduced to protect a zone write pointer access and modifications. To check a zone write pointer location and set a zone append operation target sector to that value, the function zloop_set_zone_append_sector() is introduced and called from zloop_queue_rq(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 09:40:09 -07:00
Damien Le Moal	9236c5fdd5	zloop: introduce the zone_append configuration parameter A zloop zoned block device declares to the block layer that it supports zone append operations. That is, a zloop device ressembles an NVMe ZNS devices supporting zone append. This native support is fine but it does not allow exercising the block layer zone write plugging emulation of zone append, as is done with SCSI or ATA SMR HDDs. Introduce the zone_append configuration parameter to allow creating a zloop device without native support for zone append, thus relying on the block layer zone append emulation. If not specified, zone append support is enabled by default. Otherwise, a value of 0 disables native zone append and a value of 1 enables it. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 09:40:09 -07:00
Damien Le Moal	e3a96ca904	zloop: simplify checks for writes to sequential zones The function zloop_rw() already checks early that a request is fully contained within the target zone. So this check does not need to be done again for regular writes to sequential zones. Furthermore, since zone append operations are always directed to the zone write pointer location, we do not need to check for their alignment to that value after setting it. So turn the "if" checking the write pointer alignment into an "else if". While at it, improve the comment describing the write pointer modification and how this value is corrected in case of error. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 09:40:09 -07:00

1 2 3 4 5 ...

8711 Commits