linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-26 21:12:53 -04:00

Author	SHA1	Message	Date
Linus Torvalds	7df48e3631	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma Pull rdma fixes from Jason Gunthorpe: - Quite a few irdma bug fixes, several user triggerable - Fix a 0 SMAC header in ionic - Tolerate FW errors for RAAS in bng_re - Don't UAF in efa when printing error events - Better handle pool exhaustion in the new bvec paths * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: RDMA/irdma: Harden depth calculation functions RDMA/irdma: Return EINVAL for invalid arp index error RDMA/irdma: Fix deadlock during netdev reset with active connections RDMA/irdma: Remove reset check from irdma_modify_qp_to_err() RDMA/irdma: Clean up unnecessary dereference of event->cm_node RDMA/irdma: Remove a NOP wait_event() in irdma_modify_qp_roce() RDMA/irdma: Update ibqp state to error if QP is already in error state RDMA/irdma: Initialize free_qp completion before using it RDMA/efa: Fix possible deadlock RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path RDMA/rw: Fall back to direct SGE on MR pool exhaustion RDMA/efa: Fix use of completion ctx after free RDMA/bng_re: Fix silent failure in HWRM version query RDMA/ionic: Preserve and set Ethernet source MAC after ib_ud_header_init() RDMA/irdma: Fix double free related to rereg_user_mr	2026-03-27 13:30:04 -07:00
Linus Torvalds	dabb83ecf4	Merge tag 'dma-mapping-7.0-2026-03-25' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fixes from Marek Szyprowski: "A set of fixes for DMA-mapping subsystem, which resolve false- positive warnings from KMSAN and DMA-API debug (Shigeru Yoshida and Leon Romanovsky) as well as a simple build fix (Miguel Ojeda)" * tag 'dma-mapping-7.0-2026-03-25' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma-mapping: add missing `inline` for `dma_free_attrs` mm/hmm: Indicate that HMM requires DMA coherency RDMA/umem: Tell DMA mapping that UMEM requires coherency iommu/dma: add support for DMA_ATTR_REQUIRE_COHERENT attribute dma-direct: prevent SWIOTLB path when DMA_ATTR_REQUIRE_COHERENT is set dma-mapping: Introduce DMA require coherency attribute dma-mapping: Clarify valid conditions for CPU cache line overlap dma-mapping: handle DMA_ATTR_CPU_CACHE_CLEAN in trace output dma-debug: Allow multiple invocations of overlapping entries dma: swiotlb: add KMSAN annotations to swiotlb_bounce()	2026-03-26 08:22:07 -07:00
Leon Romanovsky	d9d43a3f5c	RDMA/umem: Tell DMA mapping that UMEM requires coherency The RDMA subsystem exposes DMA regions through the verbs interface, which assumes a coherent system. Use the DMA_ATTR_REQUIRE_COHERENCE attribute to ensure coherency and avoid taking the SWIOTLB path. The RDMA verbs programming model resembles HMM and assumes concurrent DMA and CPU access to userspace memory. The hardware and programming model support "one-sided" operations initiated remotely without any local CPU involvement or notification. These include ATOMIC compare/swap, READ, and WRITE. A remote CPU can use these operations to traverse data structures, manipulate locks, and perform similar tasks without the host CPU’s awareness. If SWIOTLB substitutes memory or DMA is not cache coherent, these use cases break entirely. In-kernel RDMA is fine with incoherent mappings because kernel users do not rely on one-sided operations in ways that would expose these issues. A given region may also be exported multiple times, which can trigger warnings about cacheline overlaps. These warnings are suppressed when the new attribute is used. infiniband rocep8s0f0: mlx5_ib_reg_user_mr:1592:(pid 5812): start 0x2b28c000, iova 0x2b28c000, length 0x1000, access_flags 0x1 infiniband rocep8s0f0: mlx5_ib_reg_user_mr:1592:(pid 5812): start 0x2b28c001, iova 0x2b28c001, length 0xfff, access_flags 0x1 ------------[ cut here ]------------ DMA-API: mlx5_core 0000:08:00.0: cacheline tracking EEXIST, overlapping mappings aren't supported WARNING: kernel/dma/debug.c:620 at add_dma_entry+0x1bb/0x280, CPU#6: ibv_rc_pingpong/5812 Modules linked in: veth xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat xt_addrtype br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay mlx5_fwctl zram zsmalloc mlx5_ib fuse rpcrdma rdma_ucm ib_uverbs ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_core ib_core CPU: 6 UID: 2733 PID: 5812 Comm: ibv_rc_pingpong Tainted: G W 6.19.0+ #129 PREEMPT Tainted: [W]=WARN Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:add_dma_entry+0x1be/0x280 Code: 8b 7b 10 48 85 ff 0f 84 c3 00 00 00 48 8b 6f 50 48 85 ed 75 03 48 8b 2f e8 ff 8e 6a 00 48 89 c6 48 8d 3d 55 ef 2d 01 48 89 ea <67> 48 0f b9 3a 48 85 db 74 1a 48 c7 c7 b0 00 2b 82 e8 9c 25 fd ff RSP: 0018:ff11000138717978 EFLAGS: 00010286 RAX: ffffffffa02d7831 RBX: ff1100010246de00 RCX: 0000000000000000 RDX: ff110001036fac30 RSI: ffffffffa02d7831 RDI: ffffffff82678650 RBP: ff110001036fac30 R08: ff11000110dcb4a0 R09: ff11000110dcb478 R10: 0000000000000000 R11: ffffffff824b30a8 R12: 0000000000000000 R13: 00000000ffffffef R14: 0000000000000202 R15: ff1100010246de00 FS: 00007f59b411c740(0000) GS:ff110008dcc99000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffe538f7000 CR3: 000000010e066005 CR4: 0000000000373eb0 Call Trace: <TASK> debug_dma_map_sg+0x1b4/0x390 __dma_map_sg_attrs+0x6d/0x1a0 dma_map_sgtable+0x19/0x30 ib_umem_get+0x254/0x380 [ib_uverbs] mlx5_ib_reg_user_mr+0x68/0x2a0 [mlx5_ib] ib_uverbs_reg_mr+0x17f/0x2a0 [ib_uverbs] ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xc2/0x130 [ib_uverbs] ib_uverbs_cmd_verbs+0xa0b/0xae0 [ib_uverbs] ? ib_uverbs_handler_UVERBS_METHOD_QUERY_PORT_SPEED+0xe0/0xe0 [ib_uverbs] ? mmap_region+0x7a/0xb0 ? do_mmap+0x3b8/0x5c0 ib_uverbs_ioctl+0xa7/0x110 [ib_uverbs] __x64_sys_ioctl+0x14f/0x8b0 ? ksys_mmap_pgoff+0xc5/0x190 do_syscall_64+0x8c/0xbf0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7f59b430aeed Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00 RSP: 002b:00007ffe538f9430 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007ffe538f94c0 RCX: 00007f59b430aeed RDX: 00007ffe538f94e0 RSI: 00000000c0181b01 RDI: 0000000000000003 RBP: 00007ffe538f9480 R08: 0000000000000028 R09: 00007ffe538f9684 R10: 0000000000000001 R11: 0000000000000246 R12: 00007ffe538f9684 R13: 000000000000000c R14: 000000002b28d170 R15: 000000000000000c </TASK> ---[ end trace 0000000000000000 ]--- Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260316-dma-debug-overlap-v3-7-1dde90a7f08b@nvidia.com	2026-03-20 12:05:56 +01:00
Chuck Lever	f28599f396	RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path When IOVA-based DMA mapping is unavailable (e.g., IOMMU passthrough mode), rdma_rw_ctx_init_bvec() falls back to checking rdma_rw_io_needs_mr() with the raw bvec count. Unlike the scatterlist path in rdma_rw_ctx_init(), which passes a post-DMA-mapping entry count that reflects coalescing of physically contiguous pages, the bvec path passes the pre-mapping page count. This overstates the number of DMA entries, causing every multi-bvec RDMA READ to consume an MR from the QP's pool. Under NFS WRITE workloads the server performs RDMA READs to pull data from the client. With the inflated MR demand, the pool is rapidly exhausted, ib_mr_pool_get() returns NULL, and rdma_rw_init_one_mr() returns -EAGAIN. svcrdma treats this as a DMA mapping failure, closes the connection, and the client reconnects -- producing a cycle of 71% RPC retransmissions and ~100 reconnections per test run. RDMA WRITEs (NFS READ direction) are unaffected because DMA_TO_DEVICE never triggers the max_sgl_rd check. Remove the rdma_rw_io_needs_mr() gate from the bvec path entirely, so that bvec RDMA operations always use the map_wrs path (direct WR posting without MR allocation). The bvec caller has no post-DMA-coalescing segment count available -- xdr_buf and svc_rqst hold pages as individual pointers, and physical contiguity is discovered only during DMA mapping -- so the raw page count cannot serve as a reliable input to rdma_rw_io_needs_mr(). iWARP devices, which require MRs unconditionally, are handled by an earlier check in rdma_rw_ctx_init_bvec() and are unaffected. Fixes: `bea28ac14c` ("RDMA/core: add MR support for bvec-based RDMA operations") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260313194201.5818-3-cel@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>	2026-03-17 15:00:56 -04:00
Chuck Lever	00da250c21	RDMA/rw: Fall back to direct SGE on MR pool exhaustion When IOMMU passthrough mode is active, ib_dma_map_sgtable_attrs() produces no coalescing: each scatterlist page maps 1:1 to a DMA entry, so sgt.nents equals the raw page count. A 1 MB transfer yields 256 DMA entries. If that count exceeds the device's max_sgl_rd threshold (an optimization hint from mlx5 firmware), rdma_rw_io_needs_mr() steers the operation into the MR registration path. Each such operation consumes one or more MRs from a pool sized at max_rdma_ctxs -- roughly one MR per concurrent context. Under write-intensive workloads that issue many concurrent RDMA READs, the pool is rapidly exhausted, ib_mr_pool_get() returns NULL, and rdma_rw_init_one_mr() returns -EAGAIN. Upper layer protocols treat this as a fatal DMA mapping failure and tear down the connection. The max_sgl_rd check is a performance optimization, not a correctness requirement: the device can handle large SGE counts via direct posting, just less efficiently than with MR registration. When the MR pool cannot satisfy a request, falling back to the direct SGE (map_wrs) path avoids the connection reset while preserving the MR optimization for the common case where pool resources are available. Add a fallback in rdma_rw_ctx_init() so that -EAGAIN from rdma_rw_init_mr_wrs() triggers direct SGE posting instead of propagating the error. iWARP devices, which mandate MR registration for RDMA READs, and force_mr debug mode continue to treat -EAGAIN as terminal. Fixes: `00bd1439f4` ("RDMA/rw: Support threshold for registration vs scattering to local pages") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260313194201.5818-2-cel@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>	2026-03-17 15:00:38 -04:00
Leon Romanovsky	7c2889af82	RDMA/uverbs: Import DMA-BUF module in uverbs_std_types_dmabuf file Fix the following compilation error: ERROR: modpost: module ib_uverbs uses symbol dma_buf_move_notify from namespace DMA_BUF, but does not import it. Fixes: `0ac6f4056c` ("RDMA/uverbs: Add DMABUF object type and operations") Link: https://patch.msgid.link/20260225-fix-uverbs-compilation-v1-1-acf7b3d0f9fa@nvidia.com Signed-off-by: Leon Romanovsky <leonro@nvidia.com>	2026-02-26 04:58:24 -05:00
Jacob Moroni	104016eb67	RDMA/umem: Fix double dma_buf_unpin in failure path In ib_umem_dmabuf_get_pinned_with_dma_device(), the call to ib_umem_dmabuf_map_pages() can fail. If this occurs, the dmabuf is immediately unpinned but the umem_dmabuf->pinned flag is still set. Then, when ib_umem_release() is called, it calls ib_umem_dmabuf_revoke() which will call dma_buf_unpin() again. Fix this by removing the immediate unpin upon failure and just let the ib_umem_release/revoke path handle it. This also ensures the proper unmap-unpin unwind ordering if the dmabuf_map_pages call happened to fail due to dma_resv_wait_timeout (and therefore has a non-NULL umem_dmabuf->sgt). Fixes: `1e4df4a21c` ("RDMA/umem: Allow pinned dmabuf umem usage") Signed-off-by: Jacob Moroni <jmoroni@google.com> Link: https://patch.msgid.link/20260224234153.1207849-1-jmoroni@google.com Signed-off-by: Leon Romanovsky <leon@kernel.org>	2026-02-25 07:50:58 -05:00
Stefan Metzmacher	93a4a9b732	RDMA/core: Check id_priv->restricted_node_type in cma_listen_on_dev() When listening on wildcard addresses we have a global list for the application layer rdma_cm_id and for any existing device or any device added in future we try to listen on any wildcard listener. When the listener has a restricted_node_type we should prevent listening on devices with a different node type. While there fix the documentation comment of rdma_restrict_node_type() to include rdma_resolve_addr() instead of having rdma_bind_addr() twice. Fixes: `a760e80e90` ("RDMA/core: introduce rdma_restrict_node_type()") Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: Steve French <smfrench@gmail.com> Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Tom Talpey <tom@talpey.com> Cc: Long Li <longli@microsoft.com> Cc: linux-rdma@vger.kernel.org Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Link: https://patch.msgid.link/20260224165951.3582093-2-metze@samba.org Signed-off-by: Leon Romanovsky <leon@kernel.org>	2026-02-25 07:50:10 -05:00
Jiri Pirko	9af0feae80	RDMA/core: Fix stale RoCE GIDs during netdev events at registration RoCE GID entries become stale when netdev properties change during the IB device registration window. This is reproducible with a udev rule that sets a MAC address when a VF netdev appears: ACTION=="add", SUBSYSTEM=="net", KERNEL=="eth4", \ RUN+="/sbin/ip link set eth4 address 88:22:33:44:55:66" After VF creation, show_gids displays GIDs derived from the original random MAC rather than the configured one. The root cause is a race between netdev event processing and device registration: CPU 0 (driver) CPU 1 (udev/workqueue) ────────────── ────────────────────── ib_register_device() ib_cache_setup_one() gid_table_setup_one() _gid_table_setup_one() ← GID table allocated rdma_roce_rescan_device() ← GIDs populated with OLD MAC ip link set eth4 addr NEW_MAC NETDEV_CHANGEADDR queued netdevice_event_work_handler() ib_enum_all_roce_netdevs() ← Iterates DEVICE_REGISTERED ← Device NOT marked yet, SKIP! enable_device_and_get() xa_set_mark(DEVICE_REGISTERED) ← Too late, event was lost The netdev event handler uses ib_enum_all_roce_netdevs() which only iterates devices marked DEVICE_REGISTERED. However, this mark is set late in the registration process, after the GID cache is already populated. Events arriving in this window are silently dropped. Fix this by introducing a new xarray mark DEVICE_GID_UPDATES that is set immediately after the GID table is allocated and initialized. Use the new mark in ib_enum_all_roce_netdevs() function to iterate devices instead of DEVICE_REGISTERED. This is safe because: - After _gid_table_setup_one(), all required structures exist (port_data, immutable, cache.gid) - The GID table mutex serializes concurrent access between the initial rescan and event handlers - Event handlers correctly update stale GIDs even when racing with rescan - The mark is cleared in ib_cache_cleanup_one() before teardown This also fixes similar races for IP address events (inetaddr_event, inet6addr_event) which use the same enumeration path. Fixes: `0df91bb673` ("RDMA/devices: Use xarray to store the client_data") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20260127093839.126291-1-jiri@resnulli.us Reported-by: syzbot+881d65229ca4f9ae8c84@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=881d65229ca4f9ae8c84 Signed-off-by: Leon Romanovsky <leon@kernel.org>	2026-02-24 03:43:16 -05:00
Kees Cook	189f164e57	Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL uses Conversion performed via this Coccinelle script: // SPDX-License-Identifier: GPL-2.0-only // Options: --include-headers-for-types --all-includes --include-headers --keep-comments virtual patch @gfp depends on patch && !(file in "tools") && !(file in "samples")@ identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex, kzalloc_obj,kzalloc_objs,kzalloc_flex, kvmalloc_obj,kvmalloc_objs,kvmalloc_flex, kvzalloc_obj,kvzalloc_objs,kvzalloc_flex}; @@ ALLOC(... - , GFP_KERNEL ) $ make coccicheck MODE=patch COCCI=gfp.cocci Build and boot tested x86_64 with Fedora 42's GCC and Clang: Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-22 08:26:33 -08:00
Linus Torvalds	32a92f8c89	Convert more 'alloc_obj' cases to default GFP_KERNEL arguments This converts some of the visually simpler cases that have been split over multiple lines. I only did the ones that are easy to verify the resulting diff by having just that final GFP_KERNEL argument on the next line. Somebody should probably do a proper coccinelle script for this, but for me the trivial script actually resulted in an assertion failure in the middle of the script. I probably had made it a bit _too_ trivial. So after fighting that far a while I decided to just do some of the syntactically simpler cases with variations of the previous 'sed' scripts. The more syntactically complex multi-line cases would mostly really want whitespace cleanup anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 20:03:00 -08:00
Linus Torvalds	323bbfcf1e	Convert 'alloc_flex' family to use the new default GFP_KERNEL argument This is the exact same thing as the 'alloc_obj()' version, only much smaller because there are a lot fewer users of the alloc_flex() interface. As with alloc_obj() version, this was done entirely with mindless brute force, using the same script, except using 'flex' in the pattern rather than 'objs'. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 17:09:51 -08:00
Linus Torvalds	bf4afc53b7	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument This was done entirely with mindless brute force, using git grep -l '\<k[vmz]alloc_objs(., GFP_KERNEL)' \| xargs sed -i 's/$alloc_objs(.*$, GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 17:09:51 -08:00
Kees Cook	69050f8d6d	treewide: Replace kmalloc with kmalloc_obj for non-scalar types This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(PTR, FAM, COUNT, ...) (where TYPE may also be VAR) The resulting allocations no longer return "void ", instead returning "TYPE ". Signed-off-by: Kees Cook <kees@kernel.org>	2026-02-21 01:02:28 -08:00
Linus Torvalds	311aa68319	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma Pull rdma updates from Jason Gunthorpe: "Usual smallish cycle. The NFS biovec work to push it down into RDMA instead of indirecting through a scatterlist is pretty nice to see, been talked about for a long time now. - Various code improvements in irdma, rtrs, qedr, ocrdma, irdma, rxe - Small driver improvements and minor bug fixes to hns, mlx5, rxe, mana, mlx5, irdma - Robusness improvements in completion processing for EFA - New query_port_speed() verb to move past limited IBA defined speed steps - Support for SG_GAPS in rts and many other small improvements - Rare list corruption fix in iwcm - Better support different page sizes in rxe - Device memory support for mana - Direct bio vec to kernel MR for use by NFS-RDMA - QP rate limiting for bnxt_re - Remote triggerable NULL pointer crash in siw - DMA-buf exporter support for RDMA mmaps like doorbells" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (66 commits) RDMA/mlx5: Implement DMABUF export ops RDMA/uverbs: Add DMABUF object type and operations RDMA/uverbs: Support external FD uobjects RDMA/siw: Fix potential NULL pointer dereference in header processing RDMA/umad: Reject negative data_len in ib_umad_write IB/core: Extend rate limit support for RC QPs RDMA/mlx5: Support rate limit only for Raw Packet QP RDMA/bnxt_re: Report QP rate limit in debugfs RDMA/bnxt_re: Report packet pacing capabilities when querying device RDMA/bnxt_re: Add support for QP rate limiting MAINTAINERS: Drop RDMA files from Hyper-V section RDMA/uverbs: Add __GFP_NOWARN to ib_uverbs_unmarshall_recv() kmalloc svcrdma: use bvec-based RDMA read/write API RDMA/core: add rdma_rw_max_sge() helper for SQ sizing RDMA/core: add MR support for bvec-based RDMA operations RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations RDMA/core: add bio_vec based RDMA read/write API RDMA/irdma: Use kvzalloc for paged memory DMA address array RDMA/rxe: Fix race condition in QP timer handlers RDMA/mana_ib: Add device‑memory support ...	2026-02-12 17:05:20 -08:00
Linus Torvalds	136114e0ab	Merge tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" saves disk space by teaching ocfs2 to reclaim suballocator block group space (Heming Zhao) - "Add ARRAY_END(), and use it to fix off-by-one bugs" adds the ARRAY_END() macro and uses it in various places (Alejandro Colomar) - "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" makes the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the page size (Pnina Feder) - "kallsyms: Prevent invalid access when showing module buildid" cleans up kallsyms code related to module buildid and fixes an invalid access crash when printing backtraces (Petr Mladek) - "Address page fault in ima_restore_measurement_list()" fixes a kexec-related crash that can occur when booting the second-stage kernel on x86 (Harshit Mogalapalli) - "kho: ABI headers and Documentation updates" updates the kexec handover ABI documentation (Mike Rapoport) - "Align atomic storage" adds the __aligned attribute to atomic_t and atomic64_t definitions to get natural alignment of both types on csky, m68k, microblaze, nios2, openrisc and sh (Finn Thain) - "kho: clean up page initialization logic" simplifies the page initialization logic in kho_restore_page() (Pratyush Yadav) - "Unload linux/kernel.h" moves several things out of kernel.h and into more appropriate places (Yury Norov) - "don't abuse task_struct.group_leader" removes the usage of ->group_leader when it is "obviously unnecessary" (Oleg Nesterov) - "list private v2 & luo flb" adds some infrastructure improvements to the live update orchestrator (Pasha Tatashin) * tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (107 commits) watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency procfs: fix missing RCU protection when reading real_parent in do_task_stat() watchdog/softlockup: fix sample ring index wrap in need_counting_irqs() kcsan, compiler_types: avoid duplicate type issues in BPF Type Format kho: fix doc for kho_restore_pages() tests/liveupdate: add in-kernel liveupdate test liveupdate: luo_flb: introduce File-Lifecycle-Bound global state liveupdate: luo_file: Use private list list: add kunit test for private list primitives list: add primitives for private list manipulations delayacct: fix uapi timespec64 definition panic: add panic_force_cpu= parameter to redirect panic to a specific CPU netclassid: use thread_group_leader(p) in update_classid_task() RDMA/umem: don't abuse current->group_leader drm/pan*: don't abuse current->group_leader drm/amd: kill the outdated "Only the pthreads threading model is supported" checks drm/amdgpu: don't abuse current->group_leader android/binder: use same_thread_group(proc->tsk, current) in binder_mmap() android/binder: don't abuse current->group_leader kho: skip memoryless NUMA nodes when reserving scratch areas ...	2026-02-12 12:13:01 -08:00
Yishai Hadas	0ac6f4056c	RDMA/uverbs: Add DMABUF object type and operations Expose DMABUF functionality to userspace through the uverbs interface, enabling InfiniBand/RDMA devices to export PCI based memory regions (e.g. device memory) as DMABUF file descriptors. This allows zero-copy sharing of RDMA memory with other subsystems that support the dma-buf framework. A new UVERBS_OBJECT_DMABUF object type and allocation method were introduced. During allocation, uverbs invokes the driver to supply the rdma_user_mmap_entry associated with the given page offset (pgoff). Based on the returned rdma_user_mmap_entry, uverbs requests the driver to provide the corresponding physical-memory details as well as the driver’s PCI provider information. Using this information, dma_buf_export() is called; if it succeeds, uobj->object is set to the underlying file pointer returned by the dma-buf framework. The file descriptor number follows the standard uverbs allocation flow, but the file pointer comes from the dma-buf subsystem, including its own fops and private data. When an mmap entry is removed, uverbs iterates over its associated DMABUFs, marks them as revoked, and calls dma_buf_move_notify() so that their importers are notified. The same procedure applies during the disassociate flow; final cleanup occurs when the application closes the file. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20260201-dmabuf-export-v3-2-da238b614fe3@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>	2026-02-08 23:50:41 -05:00
Yishai Hadas	9ad95a0f2b	RDMA/uverbs: Support external FD uobjects Add support for uobjects that wrap externally allocated file descriptors (FDs). In this mode, the FD number still follows the standard uverbs allocation flow, but the file pointer is allocated externally and has its own fops and private data. As a result, alloc_begin_fd_uobject() must handle cases where fd_type->fops is NULL, and both alloc_commit_fd_uobject() and alloc_abort_fd_uobject() must account for whether filp->private_data exists, since it is populated outside the standard uverbs flow. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20260201-dmabuf-export-v3-1-da238b614fe3@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>	2026-02-08 23:47:31 -05:00
Stefan Metzmacher	a760e80e90	RDMA/core: introduce rdma_restrict_node_type() For smbdirect it required to use different ports depending on the RDMA protocol. E.g. for iWarp 5445 is needed (as tcp port 445 already used by the raw tcp transport for SMB), while InfiniBand, RoCEv1 and RoCEv2 use port 445, as they use an independent port range (even for RoCEv2, which uses udp port 4791 itself). Currently ksmbd is not able to function correctly at all if the system has iWarp (RDMA_NODE_RNIC) interface(s) and any InfiniBand, RoCEv1 and/or RoCEv2 interface(s) at the same time. And cifs.ko uses 5445 with a fallback to 445, which means depending on the available interfaces, it tries 5445 in the RoCE range or may tries iWarp with 445 as a fallback. This leads to strange error messages and strange network captures. To avoid these problems they will be able to use rdma_restrict_node_type(RDMA_NODE_RNIC) before trying port 5445 and rdma_restrict_node_type(RDMA_NODE_IB_CA) before trying port 445. It means we'll get early -ENODEV early from rdma_resolve_addr() without any network traffic and timeouts. This is designed to be called before calling any of rdma_bind_addr(), rdma_resolve_addr() or rdma_listen(). Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: Long Li <longli@microsoft.com> Cc: linux-rdma@vger.kernel.org Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Acked-by: Leon Romanovsky <leon@kernel.org> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2026-02-08 17:12:58 -06:00
YunJe Shin	5551b02fdb	RDMA/umad: Reject negative data_len in ib_umad_write ib_umad_write computes data_len from user-controlled count and the MAD header sizes. With a mismatched user MAD header size and RMPP header length, data_len can become negative and reach ib_create_send_mad(). This can make the padding calculation exceed the segment size and trigger an out-of-bounds memset in alloc_send_rmpp_list(). Add an explicit check to reject negative data_len before creating the send buffer. KASAN splat: [ 211.363464] BUG: KASAN: slab-out-of-bounds in ib_create_send_mad+0xa01/0x11b0 [ 211.364077] Write of size 220 at addr ffff88800c3fa1f8 by task spray_thread/102 [ 211.365867] ib_create_send_mad+0xa01/0x11b0 [ 211.365887] ib_umad_write+0x853/0x1c80 Fixes: `2be8e3ee8e` ("IB/umad: Add P_Key index support") Signed-off-by: YunJe Shin <ioerts@kookmin.ac.kr> Link: https://patch.msgid.link/20260203100628.1215408-1-ioerts@kookmin.ac.kr Signed-off-by: Leon Romanovsky <leon@kernel.org>	2026-02-05 07:25:38 -05:00
Oleg Nesterov	6fd390e2bc	RDMA/umem: don't abuse current->group_leader Cleanup and preparation to simplify the next changes. Use current->tgid instead of current->group_leader->pid. Link: https://lkml.kernel.org/r/aXY_2JIhCeGAYC0r@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Leon Romanovsky <leon@kernel.org> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Boris Brezillon <boris.brezillon@collabora.com> Cc: Christan König <christian.koenig@amd.com> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Cc: Steven Price <steven.price@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-02-03 08:21:25 -08:00
Kalesh AP	42e3aac65c	IB/core: Extend rate limit support for RC QPs Broadcom devices supports setting the rate limit while changing RC QP state from INIT to RTR, RTR to RTS and RTS to RTS. Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://patch.msgid.link/20260202133413.3182578-6-kalesh-anakkur.purayil@broadcom.com Reviewed-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>	2026-02-02 08:37:59 -05:00
Yi Liu	58b604dfc7	RDMA/uverbs: Add __GFP_NOWARN to ib_uverbs_unmarshall_recv() kmalloc Since wqe_size in ib_uverbs_unmarshall_recv() is user-provided and already validated, but can still be large, add __GFP_NOWARN to suppress memory allocation warnings for large sizes, consistent with the similar fix in ib_uverbs_post_send(). Fixes: `67cdb40ca4` ("[IB] uverbs: Implement more commands") Signed-off-by: Yi Liu <liuy22@mails.tsinghua.edu.cn> Link: https://patch.msgid.link/20260129094900.3517706-1-liuy22@mails.tsinghua.edu.cn Signed-off-by: Leon Romanovsky <leon@kernel.org>	2026-02-01 07:39:14 -05:00
Chuck Lever	afcae7d7b8	RDMA/core: add rdma_rw_max_sge() helper for SQ sizing svc_rdma_accept() computes sc_sq_depth as the sum of rq_depth and the number of rdma_rw contexts (ctxts). This value is used to allocate the Send CQ and to initialize the sc_sq_avail credit pool. However, when the device uses memory registration for RDMA operations, rdma_rw_init_qp() inflates the QP's max_send_wr by a factor of three per context to account for REG and INV work requests. The Send CQ and credit pool remain sized for only one work request per context, causing Send Queue exhaustion under heavy NFS WRITE workloads. Introduce rdma_rw_max_sge() to compute the actual number of Send Queue entries required for a given number of rdma_rw contexts. Upper layer protocols call this helper before creating a Queue Pair so that their Send CQs and credit accounting match the QP's true capacity. Update svc_rdma_accept() to use rdma_rw_max_sge() when computing sc_sq_depth, ensuring the credit pool reflects the work requests that rdma_rw_init_qp() will reserve. Reviewed-by: Christoph Hellwig <hch@lst.de> Fixes: `00bd1439f4` ("RDMA/rw: Support threshold for registration vs scattering to local pages") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://patch.msgid.link/20260128005400.25147-5-cel@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>	2026-01-28 05:54:53 -05:00
Chuck Lever	bea28ac14c	RDMA/core: add MR support for bvec-based RDMA operations The bvec-based RDMA API currently returns -EOPNOTSUPP when Memory Region registration is required. This prevents iWARP devices from using the bvec path, since iWARP requires MR registration for RDMA READ operations. The force_mr debug parameter is also unusable with bvec input. Add rdma_rw_init_mr_wrs_bvec() to handle MR registration for bvec arrays. The approach creates a synthetic scatterlist populated with DMA addresses from the bvecs, then reuses the existing ib_map_mr_sg() infrastructure. This avoids driver changes while keeping the implementation small. The synthetic scatterlist is stored in the rdma_rw_ctx for cleanup. On destroy, the MRs are returned to the pool and the bvec DMA mappings are released using the stored addresses. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://patch.msgid.link/20260128005400.25147-4-cel@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Leon Romanovsky <leon@kernel.org>	2026-01-28 05:54:53 -05:00
Chuck Lever	853e892076	RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations The bvec RDMA API maps each bvec individually via dma_map_phys(), requiring an IOTLB sync for each mapping. For large I/O operations with many bvecs, this overhead becomes significant. The two-step IOVA API (dma_iova_try_alloc / dma_iova_link / dma_iova_sync) allocates a contiguous IOVA range upfront, links all physical pages without IOTLB syncs, then performs a single sync at the end. This reduces IOTLB flushes from O(n) to O(1). It also requires only a single output dma_addr_t compared to extra per-input element storage in struct scatterlist. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://patch.msgid.link/20260128005400.25147-3-cel@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>	2026-01-28 05:54:53 -05:00
Chuck Lever	5e54155358	RDMA/core: add bio_vec based RDMA read/write API The existing rdma_rw_ctx_init() API requires callers to construct a scatterlist, which is then DMA-mapped page by page. Callers that already have data in bio_vec form (such as the NVMe-oF target) must first convert to scatterlist, adding overhead and complexity. Introduce rdma_rw_ctx_init_bvec() and rdma_rw_ctx_destroy_bvec() to accept bio_vec arrays directly. The new helpers use dma_map_phys() for hardware RDMA devices and virtual addressing for software RDMA devices (rxe, siw), avoiding intermediate scatterlist construction. Memory registration (MR) path support is deferred to a follow-up series; callers requiring MR-based transfers (iWARP devices or force_mr=1) receive -EOPNOTSUPP and should use the scatterlist API. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://patch.msgid.link/20260128005400.25147-2-cel@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>	2026-01-28 05:54:53 -05:00
Yi Liu	1956f0a74c	RDMA/uverbs: Validate wqe_size before using it in ib_uverbs_post_send ib_uverbs_post_send() uses cmd.wqe_size from userspace without any validation before passing it to kmalloc() and using the allocated buffer as struct ib_uverbs_send_wr. If a user provides a small wqe_size value (e.g., 1), kmalloc() will succeed, but subsequent accesses to user_wr->opcode, user_wr->num_sge, and other fields will read beyond the allocated buffer, resulting in an out-of-bounds read from kernel heap memory. This could potentially leak sensitive kernel information to userspace. Additionally, providing an excessively large wqe_size can trigger a WARNING in the memory allocation path, as reported by syzkaller. This is inconsistent with ib_uverbs_unmarshall_recv() which properly validates that wqe_size >= sizeof(struct ib_uverbs_recv_wr) before proceeding. Add the same validation for ib_uverbs_post_send() to ensure wqe_size is at least sizeof(struct ib_uverbs_send_wr). Fixes: `c3bea3d2dc` ("RDMA/uverbs: Use the iterator for ib_uverbs_unmarshall_recv()") Signed-off-by: Yi Liu <liuy22@mails.tsinghua.edu.cn> Link: https://patch.msgid.link/20260122142900.2356276-2-liuy22@mails.tsinghua.edu.cn Signed-off-by: Leon Romanovsky <leon@kernel.org>	2026-01-26 07:57:44 -05:00
Jacob Moroni	7874eeacfa	RDMA/iwcm: Fix workqueue list corruption by removing work_list The commit `e1168f0` ("RDMA/iwcm: Simplify cm_event_handler()") changed the work submission logic to unconditionally call queue_work() with the expectation that queue_work() would have no effect if work was already pending. The problem is that a free list of struct iwcm_work is used (for which struct work_struct is embedded), so each call to queue_work() is basically unique and therefore does indeed queue the work. This causes a problem in the work handler which walks the work_list until it's empty to process entries. This means that a single run of the work handler could process item N+1 and release it back to the free list while the actual workqueue entry is still queued. It could then get reused (INIT_WORK...) and lead to list corruption in the workqueue logic. Fix this by just removing the work_list. The workqueue already does this for us. This fixes the following error that was observed when stress testing with ucmatose on an Intel E830 in iWARP mode: [ 151.465780] list_del corruption. next->prev should be ffff9f0915c69c08, but was ffff9f0a1116be08. (next=ffff9f0a15b11c08) [ 151.466639] ------------[ cut here ]------------ [ 151.466986] kernel BUG at lib/list_debug.c:67! [ 151.467349] Oops: invalid opcode: 0000 [#1] SMP NOPTI [ 151.467753] CPU: 14 UID: 0 PID: 2306 Comm: kworker/u64:18 Not tainted 6.19.0-rc4+ #1 PREEMPT(voluntary) [ 151.468466] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 151.469192] Workqueue: 0x0 (iw_cm_wq) [ 151.469478] RIP: 0010:__list_del_entry_valid_or_report+0xf0/0x100 [ 151.469942] Code: c7 58 5f 4c b2 e8 10 50 aa ff 0f 0b 48 89 ef e8 36 57 cb ff 48 8b 55 08 48 89 e9 48 89 de 48 c7 c7 a8 5f 4c b2 e8 f0 4f aa ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 [ 151.471323] RSP: 0000:ffffb15644e7bd68 EFLAGS: 00010046 [ 151.471712] RAX: 000000000000006d RBX: ffff9f0915c69c08 RCX: 0000000000000027 [ 151.472243] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9f0a37d9c600 [ 151.472768] RBP: ffff9f0a15b11c08 R08: 0000000000000000 R09: c0000000ffff7fff [ 151.473294] R10: 0000000000000001 R11: ffffb15644e7bba8 R12: ffff9f092339ee68 [ 151.473817] R13: ffff9f0900059c28 R14: ffff9f092339ee78 R15: 0000000000000000 [ 151.474344] FS: 0000000000000000(0000) GS:ffff9f0a847b5000(0000) knlGS:0000000000000000 [ 151.474934] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 151.475362] CR2: 0000559e233a9088 CR3: 000000020296b004 CR4: 0000000000770ef0 [ 151.475895] PKRU: 55555554 [ 151.476118] Call Trace: [ 151.476331] <TASK> [ 151.476497] move_linked_works+0x49/0xa0 [ 151.476792] __pwq_activate_work.isra.46+0x2f/0xa0 [ 151.477151] pwq_dec_nr_in_flight+0x1e0/0x2f0 [ 151.477479] process_scheduled_works+0x1c8/0x410 [ 151.477823] worker_thread+0x125/0x260 [ 151.478108] ? __pfx_worker_thread+0x10/0x10 [ 151.478430] kthread+0xfe/0x240 [ 151.478671] ? __pfx_kthread+0x10/0x10 [ 151.478955] ? __pfx_kthread+0x10/0x10 [ 151.479240] ret_from_fork+0x208/0x270 [ 151.479523] ? __pfx_kthread+0x10/0x10 [ 151.479806] ret_from_fork_asm+0x1a/0x30 [ 151.480103] </TASK> Fixes: `e1168f09b3` ("RDMA/iwcm: Simplify cm_event_handler()") Signed-off-by: Jacob Moroni <jmoroni@google.com> Link: https://patch.msgid.link/20260112020006.1352438-1-jmoroni@google.com Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Leon Romanovsky <leon@kernel.org>	2026-01-15 04:59:53 -05:00
Parav Pandit	8d466b155f	RDMA/core: Avoid exporting module local functions and remove not-used ones Some of the functions are local to the module and some are not used starting from commit `36783dec8d` ("RDMA/rxe: Delete deprecated module parameters interface"). Delete and avoid exporting them. Signed-off-by: Parav Pandit <parav@nvidia.com> Link: https://patch.msgid.link/20260104-ib-core-misc-v1-2-00367f77f3a8@nvidia.com Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>	2026-01-05 04:02:29 -05:00
Leon Romanovsky	ac7dea328a	RDMA/umem: Remove redundant DMABUF ops check ib_umem_dmabuf_get_with_dma_device() is an in-kernel function and does not require a defensive check for the .move_notify callback. All current callers guarantee that this callback is always present. Link: https://patch.msgid.link/20260104-ib-core-misc-v1-1-00367f77f3a8@nvidia.com Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>	2026-01-05 04:02:18 -05:00
Or Har-Toov	51a07ce2fe	IB/core: Add query_port_speed verb Add new ibv_query_port_speed() verb to enable applications to query the effective bandwidth of a port. This verb is particularly useful when the speed is not a multiplication of IB speed and width where width is 2^n. Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>	2026-01-05 02:43:05 -05:00
Or Har-Toov	d4adeff26c	IB/core: Refactor rate_show to use ib_port_attr_to_rate() Update sysfs rate_show() to rely on ib_port_attr_to_speed_info() for converting IB port speed and width attributes to data rate and speed string. Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Reviewed-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>	2026-01-05 02:42:59 -05:00
Or Har-Toov	2941abac6d	IB/core: Add helper to convert port attributes to data rate Introduce ib_port_attr_to_rate() to compute the data rate in 100 Mbps units (deci-Gb/sec) from a port's active_speed and active_width attributes. This generic helper removes duplicated speed-to-rate calculations, which are used by sysfs and the upcoming new verb. Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>	2026-01-05 02:42:53 -05:00
Or Har-Toov	263d1d9975	IB/core: Add async event on device speed change Add IB_EVENT_DEVICE_SPEED_CHANGE for notifying user applications on device's ports speed changes. Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>	2026-01-05 02:42:43 -05:00
Etienne AUJAMES	ddd6c8c873	IB/cache: update gid cache on client reregister event Some HCAs (e.g: ConnectX4) do not trigger a IB_EVENT_GID_CHANGE on subnet prefix update from SM (PortInfo). Since the commit `d58c23c925` ("IB/core: Only update PKEY and GID caches on respective events"), the GID cache is updated exclusively on IB_EVENT_GID_CHANGE. If this event is not emitted, the subnet prefix in the IPoIB interface’s hardware address remains set to its default value (0xfe80000000000000). Then rdma_bind_addr() failed because it relies on hardware address to find the port GID (subnet_prefix + port GUID). This patch fixes this issue by updating the GID cache on IB_EVENT_CLIENT_REREGISTER event (emitted on PortInfo::ClientReregister=1). Fixes: `d58c23c925` ("IB/core: Only update PKEY and GID caches on respective events") Signed-off-by: Etienne AUJAMES <eaujames@ddn.com> Link: https://patch.msgid.link/aVUfsO58QIDn5bGX@eaujamesFR0130 Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>	2026-01-04 09:07:22 -05:00
Tetsuo Handa	fa3c411d21	RDMA/core: always drop device refcount in ib_del_sub_device_and_put() Since nldev_deldev() (introduced by commit `060c642b2a` ("RDMA/nldev: Add support to add/delete a sub IB device through netlink") grabs a reference using ib_device_get_by_index() before calling ib_del_sub_device_and_put(), we need to drop that reference before returning -EOPNOTSUPP error. Reported-by: syzbot+881d65229ca4f9ae8c84@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=881d65229ca4f9ae8c84 Fixes: `bca5119762` ("RDMA/core: Support IB sub device with type "SMI"") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Link: https://patch.msgid.link/80749a85-cbe2-460c-8451-42516013f9fa@I-love.SAKURA.ne.jp Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>	2025-12-21 06:45:17 -05:00
Jang Ingyu	8aaa848ead	RDMA/core: Fix logic error in ib_get_gids_from_rdma_hdr() Fix missing comparison operator for RDMA_NETWORK_ROCE_V1 in the conditional statement. The constant was used directly instead of being compared with net_type, causing the condition to always evaluate to true. Fixes: `1c15b4f2a4` ("RDMA/core: Modify enum ib_gid_type and enum rdma_network_type") Signed-off-by: Jang Ingyu <ingyujang25@korea.ac.kr> Link: https://patch.msgid.link/20251219041508.1725947-1-ingyujang25@korea.ac.kr Signed-off-by: Leon Romanovsky <leon@kernel.org>	2025-12-21 04:19:51 -05:00
Jason Gunthorpe	57f3cb6c84	RDMA/cm: Fix leaking the multicast GID table reference If the CM ID is destroyed while the CM event for multicast creating is still queued the cancel_work_sync() will prevent the work from running which also prevents destroying the ah_attr. This leaks a refcount and triggers a WARN: GID entry ref leak for dev syz1 index 2 ref=573 WARNING: CPU: 1 PID: 655 at drivers/infiniband/core/cache.c:809 release_gid_table drivers/infiniband/core/cache.c:806 [inline] WARNING: CPU: 1 PID: 655 at drivers/infiniband/core/cache.c:809 gid_table_release_one+0x284/0x3cc drivers/infiniband/core/cache.c:886 Destroy the ah_attr after canceling the work, it is safe to call this twice. Link: https://patch.msgid.link/r/0-v1-4285d070a6b2+20a-rdma_mc_gid_leak_syz_jgg@nvidia.com Cc: stable@vger.kernel.org Fixes: `fe454dc31e` ("RDMA/ucma: Fix use-after-free bug in ucma_create_uevent") Reported-by: syzbot+b0da83a6c0e2e2bddbd4@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68232e7b.050a0220.f2294.09f6.GAE@google.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2025-12-16 21:32:13 -04:00
Jason Gunthorpe	a7b8e876e0	RDMA/core: Check for the presence of LS_NLA_TYPE_DGID correctly The netlink response for RDMA_NL_LS_OP_IP_RESOLVE should always have a LS_NLA_TYPE_DGID attribute, it is invalid if it does not. Use the nl parsing logic properly and call nla_parse_deprecated() to fill the nlattrs array and then directly index that array to get the data for the DGID. Just fail if it is NULL. Remove the for loop searching for the nla, and squash the validation and parsing into one function. Fixes an uninitialized read from the stack triggered by userspace if it does not provide the DGID to a kernel initiated RDMA_NL_LS_OP_IP_RESOLVE query. BUG: KMSAN: uninit-value in hex_byte_pack include/linux/hex.h:13 [inline] BUG: KMSAN: uninit-value in ip6_string+0xef4/0x13a0 lib/vsprintf.c:1490 hex_byte_pack include/linux/hex.h:13 [inline] ip6_string+0xef4/0x13a0 lib/vsprintf.c:1490 ip6_addr_string+0x18a/0x3e0 lib/vsprintf.c:1509 ip_addr_string+0x245/0xee0 lib/vsprintf.c:1633 pointer+0xc09/0x1bd0 lib/vsprintf.c:2542 vsnprintf+0xf8a/0x1bd0 lib/vsprintf.c:2930 vprintk_store+0x3ae/0x1530 kernel/printk/printk.c:2279 vprintk_emit+0x307/0xcd0 kernel/printk/printk.c:2426 vprintk_default+0x3f/0x50 kernel/printk/printk.c:2465 vprintk+0x36/0x50 kernel/printk/printk_safe.c:82 _printk+0x17e/0x1b0 kernel/printk/printk.c:2475 ib_nl_process_good_ip_rsep drivers/infiniband/core/addr.c:128 [inline] ib_nl_handle_ip_res_resp+0x963/0x9d0 drivers/infiniband/core/addr.c:141 rdma_nl_rcv_msg drivers/infiniband/core/netlink.c:-1 [inline] rdma_nl_rcv_skb drivers/infiniband/core/netlink.c:239 [inline] rdma_nl_rcv+0xefa/0x11c0 drivers/infiniband/core/netlink.c:259 netlink_unicast_kernel net/netlink/af_netlink.c:1320 [inline] netlink_unicast+0xf04/0x12b0 net/netlink/af_netlink.c:1346 netlink_sendmsg+0x10b3/0x1250 net/netlink/af_netlink.c:1896 sock_sendmsg_nosec net/socket.c:714 [inline] __sock_sendmsg+0x333/0x3d0 net/socket.c:729 ____sys_sendmsg+0x7e0/0xd80 net/socket.c:2617 ___sys_sendmsg+0x271/0x3b0 net/socket.c:2671 __sys_sendmsg+0x1aa/0x300 net/socket.c:2703 __compat_sys_sendmsg net/compat.c:346 [inline] __do_compat_sys_sendmsg net/compat.c:353 [inline] __se_compat_sys_sendmsg net/compat.c:350 [inline] __ia32_compat_sys_sendmsg+0xa4/0x100 net/compat.c:350 ia32_sys_call+0x3f6c/0x4310 arch/x86/include/generated/asm/syscalls_32.h:371 do_syscall_32_irqs_on arch/x86/entry/syscall_32.c:83 [inline] __do_fast_syscall_32+0xb0/0x150 arch/x86/entry/syscall_32.c:306 do_fast_syscall_32+0x38/0x80 arch/x86/entry/syscall_32.c:331 do_SYSENTER_32+0x1f/0x30 arch/x86/entry/syscall_32.c:3 Link: https://patch.msgid.link/r/0-v1-3fbaef094271+2cf-rdma_op_ip_rslv_syz_jgg@nvidia.com Cc: stable@vger.kernel.org Fixes: `ae43f82867` ("IB/core: Add IP to GID netlink offload") Reported-by: syzbot+938fcd548c303fe33c1a@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/68dc3dac.a00a0220.102ee.004f.GAE@google.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2025-12-16 21:30:56 -04:00
Linus Torvalds	55aa394a5e	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma Pull rdma updates from Jason Gunthorpe: "This has another new RDMA driver 'bng_en' for latest generation Broadcom NICs. There might be one more new driver still to come. Otherwise it is a fairly quite cycle. Summary: - Minor driver bug fixes and updates to cxgb4, rxe, rdmavt, bnxt_re, mlx5 - Many bug fix patches for irdma - WQ_PERCPU annotations and system_dfl_wq changes - Improved mlx5 support for "other eswitches" and multiple PFs - 1600Gbps link speed reporting support. Four Digits Now! - New driver bng_en for latest generation Broadcom NICs - Bonding support for hns - Adjust mlx5's hmm based ODP to work with the very large address space created by the new 5 level paging default on x86 - Lockdep fixups in rxe and siw" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (65 commits) RDMA/rxe: reclassify sockets in order to avoid false positives from lockdep RDMA/siw: reclassify sockets in order to avoid false positives from lockdep RDMA/bng_re: Remove prefetch instruction RDMA/core: Reduce cond_resched() frequency in __ib_umem_release RDMA/irdma: Fix SRQ shadow area address initialization RDMA/irdma: Remove doorbell elision logic RDMA/irdma: Do not set IBK_LOCAL_DMA_LKEY for GEN3+ RDMA/irdma: Do not directly rely on IB_PD_UNSAFE_GLOBAL_RKEY RDMA/irdma: Add missing mutex destroy RDMA/irdma: Fix SIGBUS in AEQ destroy RDMA/irdma: Add a missing kfree of struct irdma_pci_f for GEN2 RDMA/irdma: Fix data race in irdma_free_pble RDMA/irdma: Fix data race in irdma_sc_ccq_arm RDMA/mlx5: Add support for 1600_8x lane speed RDMA/core: Add new IB rate for XDR (8x) support IB/mlx5: Reduce IMR KSM size when 5-level paging is enabled RDMA/bnxt_re: Pass correct flag for dma mr creation RDMA/bnxt_re: Fix the inline size for GenP7 devices RDMA/hns: Support reset recovery for bond RDMA/hns: Support link state reporting for bond ...	2025-12-04 18:54:37 -08:00
Li RongQing	f37e286879	RDMA/core: Reduce cond_resched() frequency in __ib_umem_release The current implementation calls cond_resched() for every SG entry in __ib_umem_release(), which can increase needless overhead. This patch introduces RESCHED_LOOP_CNT_THRESHOLD (0x1000) to limit how often cond_resched() is called. The function now yields the CPU once every 4096 iterations, and yield at the very first iteration for lots of small umem case, to reduce scheduling overhead. Fixes: `d056bc45b6` ("RDMA/core: Prevent soft lockup during large user memory region cleanup") Signed-off-by: Li RongQing <lirongqing@baidu.com> Link: https://patch.msgid.link/20251126025147.2627-1-lirongqing@baidu.com Signed-off-by: Leon Romanovsky <leon@kernel.org>	2025-11-26 03:15:36 -05:00
Maher Sanalla	0f1f9b5e47	RDMA/core: Add new IB rate for XDR (8x) support Add the new rates as defined in the Infiniband spec for XDR and 8x link width support. Furthermore, modify the utility conversion methods accordingly. Reference: IB Spec Release 1.8 Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Link: https://patch.msgid.link/20251120-speed-8-v1-1-e6a7efef8cb8@nvidia.com Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>	2025-11-24 02:58:30 -05:00
Li RongQing	d056bc45b6	RDMA/core: Prevent soft lockup during large user memory region cleanup When a process exits with numerous large, pinned memory regions consisting of 4KB pages, the cleanup of the memory region through __ib_umem_release() may cause soft lockups. This is because unpin_user_page_range_dirty_lock() is called in a tight loop for unpin and releasing page without yielding the CPU. watchdog: BUG: soft lockup - CPU#44 stuck for 26s! [python3:73464] Kernel panic - not syncing: softlockup: hung tasks CPU: 44 PID: 73464 Comm: python3 Tainted: G OEL asm_sysvec_apic_timer_interrupt+0x1b/0x20 RIP: 0010:free_unref_page+0xff/0x190 ? free_unref_page+0xe3/0x190 __put_page+0x77/0xe0 put_compound_head+0xed/0x100 unpin_user_page_range_dirty_lock+0xb2/0x180 __ib_umem_release+0x57/0xb0 [ib_core] ib_umem_release+0x3f/0xd0 [ib_core] mlx5_ib_dereg_mr+0x2e9/0x440 [mlx5_ib] ib_dereg_mr_user+0x43/0xb0 [ib_core] uverbs_free_mr+0x15/0x20 [ib_uverbs] destroy_hw_idr_uobject+0x21/0x60 [ib_uverbs] uverbs_destroy_uobject+0x38/0x1b0 [ib_uverbs] __uverbs_cleanup_ufile+0xd1/0x150 [ib_uverbs] uverbs_destroy_ufile_hw+0x3f/0x100 [ib_uverbs] ib_uverbs_close+0x1f/0xb0 [ib_uverbs] __fput+0x9c/0x280 ____fput+0xe/0x20 task_work_run+0x6a/0xb0 do_exit+0x217/0x3c0 do_group_exit+0x3b/0xb0 get_signal+0x150/0x900 arch_do_signal_or_restart+0xde/0x100 exit_to_user_mode_loop+0xc4/0x160 exit_to_user_mode_prepare+0xa0/0xb0 syscall_exit_to_user_mode+0x27/0x50 do_syscall_64+0x63/0xb0 Fix soft lockup issues by incorporating cond_resched() calls within __ib_umem_release(), and this SG entries are typically grouped in 2MB chunks on x86_64, adding cond_resched() should has minimal performance impact. Signed-off-by: Li RongQing <lirongqing@baidu.com> Link: https://patch.msgid.link/20251113095317.2628-1-lirongqing@baidu.com Acked-by: Junxian Huang <huangjunxian6@hisilicon.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>	2025-11-13 08:35:16 -05:00
Kalesh AP	d43358cda7	RDMA/restrack: Fix typos in the comments Fix couple of occurrences of the misspelled word "reource" in the comments with the correct spelling "resource". Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://patch.msgid.link/20251113105457.879903-1-kalesh-anakkur.purayil@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>	2025-11-13 06:27:39 -05:00
Marco Crivellari	e60c5583b6	RDMA/core: WQ_PERCPU added to alloc_workqueue users Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This change adds a new WQ_PERCPU flag to explicitly request alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://patch.msgid.link/20251101163121.78400-3-marco.crivellari@suse.com Signed-off-by: Leon Romanovsky <leon@kernel.org>	2025-11-06 02:23:23 -05:00
Marco Crivellari	f673fb3449	RDMA/core: RDMA/mlx5: replace use of system_unbound_wq with system_dfl_wq Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistency cannot be addressed without refactoring the API. system_unbound_wq should be the default workqueue so as not to enforce locality constraints for random work whenever it's not required. Adding system_dfl_wq to encourage its use when unbound work should be used. The old system_unbound_wq will be kept for a few release cycles. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://patch.msgid.link/20251101163121.78400-2-marco.crivellari@suse.com Signed-off-by: Leon Romanovsky <leon@kernel.org>	2025-11-06 02:23:23 -05:00
Håkon Bugge	58aca1f3de	RDMA/cm: Base cm_id destruction timeout on CMA values When a GSI MAD packet is sent on the QP, it will potentially be retried CMA_MAX_CM_RETRIES times with a timeout value of: 4.096usec * 2 ^ CMA_CM_RESPONSE_TIMEOUT The above equates to ~64 seconds using the default CMA values. The cm_id_priv's refcount will be incremented for this period. Therefore, the timeout value waiting for a cm_id destruction must be based on the effective timeout of MAD packets. To provide additional leeway, we add 25% to this timeout and use that instead of the constant 10 seconds timeout, which may result in false negatives. Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Link: https://patch.msgid.link/20251021132738.4179604-1-haakon.bugge@oracle.com Signed-off-by: Leon Romanovsky <leon@kernel.org>	2025-10-27 07:36:39 -04:00
Shuhao Fu	d8713158fa	RDMA/uverbs: Fix umem release in UVERBS_METHOD_CQ_CREATE In `UVERBS_METHOD_CQ_CREATE`, umem should be released if anything goes wrong. Currently, if `create_cq_umem` fails, umem would not be released or referenced, causing a possible leak. In this patch, we release umem at `UVERBS_METHOD_CQ_CREATE`, the driver should not release umem if it returns an error code. Fixes: `1a40c362ae` ("RDMA/uverbs: Add a common way to create CQ with umem") Signed-off-by: Shuhao Fu <sfual@cse.ust.hk> Link: https://patch.msgid.link/aOh1le4YqtYwj-hH@osx.local Signed-off-by: Leon Romanovsky <leon@kernel.org>	2025-10-19 07:31:25 -04:00
Stefan Metzmacher	879424832d	RDMA/core: let rdma_connect_locked() call lockdep_assert_held(&id_priv->handler_mutex) rdma_accept() also has this, so this is now more consistent and may prevent bugs in future. Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: linux-rdma@vger.kernel.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Link: https://patch.msgid.link/20251008165913.444276-1-metze@samba.org Signed-off-by: Leon Romanovsky <leon@kernel.org>	2025-10-19 07:03:25 -04:00

1 2 3 4 5 ...

3311 Commits