Commit Graph

1396693 Commits

Author SHA1 Message Date
Siva Reddy Kallam
4f830cd8d7 RDMA/bng_re: Add infrastructure for enabling Firmware channel
Add infrastructure for enabling Firmware channel.

Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com>
Link: https://patch.msgid.link/20251117171136.128193-6-siva.kallam@broadcom.com
Reviewed-by: Usman Ansari <usman.ansari@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24 02:58:29 -05:00
Siva Reddy Kallam
53310b698f RDMA/bng_re: Allocate required memory resources for Firmware channel
Allocate required memory resources for Firmware channel.

Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com>
Link: https://patch.msgid.link/20251117171136.128193-5-siva.kallam@broadcom.com
Reviewed-by: Usman Ansari <usman.ansari@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24 02:58:29 -05:00
Siva Reddy Kallam
745065770c RDMA/bng_re: Register and get the resources from bnge driver
Register and get the basic required resources from bnge driver.

Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com>
Link: https://patch.msgid.link/20251117171136.128193-4-siva.kallam@broadcom.com
Reviewed-by: Usman Ansari <usman.ansari@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24 02:58:29 -05:00
Siva Reddy Kallam
d0da769c19 RDMA/bng_re: Add Auxiliary interface
Add basic Auxiliary interface to the driver which supports
the BCM5770X NIC family.

Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com>
Link: https://patch.msgid.link/20251117171136.128193-3-siva.kallam@broadcom.com
Reviewed-by: Usman Ansari <usman.ansari@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24 02:58:29 -05:00
Vikas Gupta
8ac050ec3b bng_en: Add RoCE aux device support
Add an auxiliary (aux) device to support RoCE. The base driver is
responsible for creating the auxiliary device and allocating the
required resources to it, which will be owned by the bnge RoCE
driver in future patches.

Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Link: https://patch.msgid.link/20251117171136.128193-2-siva.kallam@broadcom.com
Reviewed-by: Siva Reddy Kallam <siva.kallam@broadcom.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-20 10:35:56 -05:00
Kalesh AP
fecaa0c74f RDMA/bnxt_re: Fix wrong check for CQ coalesc support
Driver is not creating the debugfs hooks for CQ coalesc parameters
because of a wrong check. Fixed the condition check inside
bnxt_re_init_cq_coal_debugfs().

Fixes: cf27490790 ("RDMA/bnxt_re: Add a debugfs entry for CQE coalescing tuning")
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20251117061306.1140588-1-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-20 10:30:52 -05:00
Li RongQing
d056bc45b6 RDMA/core: Prevent soft lockup during large user memory region cleanup
When a process exits with numerous large, pinned memory regions consisting
of 4KB pages, the cleanup of the memory region through __ib_umem_release()
may cause soft lockups. This is because unpin_user_page_range_dirty_lock()
is called in a tight loop for unpin and releasing page without yielding the
CPU.

 watchdog: BUG: soft lockup - CPU#44 stuck for 26s! [python3:73464]
 Kernel panic - not syncing: softlockup: hung tasks
 CPU: 44 PID: 73464 Comm: python3 Tainted: G           OEL

 asm_sysvec_apic_timer_interrupt+0x1b/0x20
 RIP: 0010:free_unref_page+0xff/0x190

  ? free_unref_page+0xe3/0x190
  __put_page+0x77/0xe0
  put_compound_head+0xed/0x100
  unpin_user_page_range_dirty_lock+0xb2/0x180
  __ib_umem_release+0x57/0xb0 [ib_core]
  ib_umem_release+0x3f/0xd0 [ib_core]
  mlx5_ib_dereg_mr+0x2e9/0x440 [mlx5_ib]
  ib_dereg_mr_user+0x43/0xb0 [ib_core]
  uverbs_free_mr+0x15/0x20 [ib_uverbs]
  destroy_hw_idr_uobject+0x21/0x60 [ib_uverbs]
  uverbs_destroy_uobject+0x38/0x1b0 [ib_uverbs]
  __uverbs_cleanup_ufile+0xd1/0x150 [ib_uverbs]
  uverbs_destroy_ufile_hw+0x3f/0x100 [ib_uverbs]
  ib_uverbs_close+0x1f/0xb0 [ib_uverbs]
  __fput+0x9c/0x280
  ____fput+0xe/0x20
  task_work_run+0x6a/0xb0
  do_exit+0x217/0x3c0
  do_group_exit+0x3b/0xb0
  get_signal+0x150/0x900
  arch_do_signal_or_restart+0xde/0x100
  exit_to_user_mode_loop+0xc4/0x160
  exit_to_user_mode_prepare+0xa0/0xb0
  syscall_exit_to_user_mode+0x27/0x50
  do_syscall_64+0x63/0xb0

Fix soft lockup issues by incorporating cond_resched() calls within
__ib_umem_release(), and this SG entries are typically grouped in 2MB
chunks on x86_64, adding cond_resched() should has minimal performance
impact.

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Link: https://patch.msgid.link/20251113095317.2628-1-lirongqing@baidu.com
Acked-by: Junxian Huang <huangjunxian6@hisilicon.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-13 08:35:16 -05:00
Kalesh AP
d43358cda7 RDMA/restrack: Fix typos in the comments
Fix couple of occurrences of the misspelled word "reource"
in the comments with the correct spelling "resource".

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20251113105457.879903-1-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-13 06:27:39 -05:00
Tuo Li
a49a9f4555 RDMA/irdma: Remove redundant NULL check of udata in irdma_create_user_ah()
The variable udata cannot be NULL because irdma_create_user_ah() always
receives it. Therefore, the if() check can be safely removed.

Signed-off-by: Tuo Li <islituo@gmail.com>
Link: https://patch.msgid.link/20251112120253.68945-1-islituo@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-12 07:10:25 -05:00
Randy Dunlap
4e5cba5bb6 RDMA/cm: Correct typedef and bad line warnings
In include/rdma/ib_cm.h:

Correct a typedef's kernel-doc notation by adding the 'typedef' keyword
to it to avoid a warning.
Add a leading " *" to a kernel-doc line to avoid a warning.

Warning: ib_cm.h:289 function parameter 'ib_cm_handler' not described
 in 'int'
Warning: ib_cm.h:289 expecting prototype for ib_cm_handler().  Prototype
 was for int() instead
Warning: ib_cm.h:484 bad line: connection message in case duplicates
 are received.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://patch.msgid.link/20251112062908.2711007-1-rdunlap@infradead.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-12 04:39:12 -05:00
Leon Romanovsky
736c561950 Expose definition for 1600Gbps link mode
Single patch to expose new link mode for 1600Gbps, utilizing 8 lanes at
200Gbps per lane.

Signed-off-by: Leon Romanovsky <leon@kernel.org>

* mlx5-next:
  net/mlx5: Expose definition for 1600Gbps link mode
2025-11-12 03:52:18 -05:00
Tariq Toukan
5422318e27 net/mlx5: Expose definition for 1600Gbps link mode
This patch exposes new link mode for 1600Gbps, utilizing 8 lanes at
200Gbps per lane.

Co-developed-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1762863888-1092798-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-12 03:35:14 -05:00
Ma Ke
a338d6e849 RDMA/rtrs: server: Fix error handling in get_or_create_srv
After device_initialize() is called, use put_device() to release the
device according to kernel device management rules. While direct
kfree() work in this case, using put_device() is more correct.

Found by code review.

Fixes: 9cb8374804 ("RDMA/rtrs: server: main functionality")
Signed-off-by: Ma Ke <make24@iscas.ac.cn>
Link: https://patch.msgid.link/20251110005158.13394-1-make24@iscas.ac.cn
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-10 03:09:41 -05:00
Marco Crivellari
5c467151f6 IB/isert: add WQ_PERCPU to alloc_workqueue users
Currently if a user enqueues a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistency cannot be addressed without refactoring the API.

alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:

commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")

This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.

Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20251107133626.190952-1-marco.crivellari@suse.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09 06:30:58 -05:00
Marco Crivellari
65d21dee53 IB/iser: add WQ_PERCPU to alloc_workqueue users
Currently if a user enqueues a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistency cannot be addressed without refactoring the API.

alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:

commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")

This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.

Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20251107133306.187939-1-marco.crivellari@suse.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09 06:30:48 -05:00
Jacob Moroni
5dd68a5914 RDMA/irdma: Remove unused CQ registry
The CQ registry was never actually used (ceq->reg_cq was always NULL),
so remove the dead code.

Signed-off-by: Jacob Moroni <jmoroni@google.com>
Link: https://patch.msgid.link/20251105162841.31786-1-jmoroni@google.com
Acked-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-11-09 06:13:57 -05:00
Patrisious Haddad
6e79e21005 RDMA/mlx5: Add other eswitch support to userspace tables
Allows the creation of RDMA TRANSPORT tables over VFs/SFs that
belong to another eswitch manager. Which is only possible for PFs that
were connected via a create_lag PRM command.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-7-98bb707b5d57@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09 05:21:46 -05:00
Patrisious Haddad
f277662b73 RDMA/mlx5: Refactor _get_prio() function
Refactor the _get_prio() function to remove redundant arguments by
reusing the existing flow table attributes struct instead of passing
attributes separately. This improves code clarity and maintainability.

In addition allows downstream patch to add new parameter without
needing to change __get_prio() arguments.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-6-98bb707b5d57@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09 05:21:42 -05:00
Patrisious Haddad
5939decc64 RDMA/mlx5: Add other_eswitch support for devx destruction
When building a devx object destruction command for steering objects add
consideration for other_eswitch argument to allow proper destruction for
objects that were created with it.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-5-98bb707b5d57@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09 05:21:38 -05:00
Patrisious Haddad
3506242da0 RDMA/mlx5: Change default device for LAG slaves in RDMA TRANSPORT namespaces
In case of a LAG configuration change the root namespace core device for
all of the LAG slaves to be the core device of the master device for
RDMA_TRANSPORT namespaces, in order to ensure all tables are created
through the master device.
Once the LAG is disabled revert back to the native core device.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-4-98bb707b5d57@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09 05:21:33 -05:00
Leon Romanovsky
d06ccdc952 Add other eswitch support
When the device in switchdev mode, the RDMA device manages all the
vports which belong to its representors, which can lead to a situation
where the PF that is used to manage the RDMA device isn't the native PF
of some of the vports it manages.

Add infrastructure to allow the master PF to manage all the hardware
resources for the vports under its management.
Whereas currently the only such resource is RDMA TRANSPORT steering
domains.

That is done by adding new FW argument other_eswitch which is passed by
the driver to the FW to allow the master PF to properly manage vports
belonging to other native PF.

Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09 05:17:37 -05:00
Patrisious Haddad
583b4fe1c1 net/mlx5: fs, set non default device per namespace
Add mlx5_fs_set_root_dev() function which swaps the root namespace
core device with another one for a given table_type.

It is intended for usage only by RDMA_TRANSPORT tables in case of LAG
configuration, to allow the creation of tables during LAG always
through the LAG master device, which is valid since during LAG the
master is allowed to manage the RDMA_TRANSPORT tables of its slaves.

In addition move the table_type enum to global include to allow its use
in a downstream patch in the RDMA driver.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-3-98bb707b5d57@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09 05:16:58 -05:00
Patrisious Haddad
3b848dec7e net/mlx5: fs, Add other_eswitch support for steering tables
Add other_eswitch support which allows flow tables creation above vports
that reside on different esw managers.

The new flag MLX5_FLOW_TABLE_OTHER_ESWITCH indicates if the
esw_owner_vhca_id attribute is supported.

Note that this is only supported if the Advanced-RDMA cap-
rdma_transport_manager_other_eswitch is set.
And it is the caller responsibility to check that.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-2-98bb707b5d57@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09 05:16:53 -05:00
Patrisious Haddad
6948417b3f net/mlx5: Add OTHER_ESWITCH HW capabilities
Add OTHER_ESWITCH capabilities which includes other_eswitch and
eswitch_owner_vhca_id to all steering objects.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-1-98bb707b5d57@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09 05:16:47 -05:00
Yishai Hadas
2d838c11e1 net/mlx5: Add direct ST mode support for RDMA
Add support for direct ST mode where ST Table Location equals
PCI_TPH_LOC_NONE.

In that case, no steering table exists, the steering tag itself will be
used directly by the SW, FW, HW from the mkey.

This enables RDMA users to use the current exposed APIs to work in
direct mode.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20251027-st-direct-mode-v1-2-e0ad953866b6@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09 05:13:34 -05:00
Yishai Hadas
7b8a8ec20c PCI/TPH: Expose pcie_tph_get_st_table_loc()
Expose pcie_tph_get_st_table_loc() to be used by drivers as will be done
in the next patch from the series.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20251027-st-direct-mode-v1-1-e0ad953866b6@nvidia.com
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09 05:13:02 -05:00
Kalesh AP
cf27490790 RDMA/bnxt_re: Add a debugfs entry for CQE coalescing tuning
This patch adds debugfs interfaces that allows the user to
enable/disable the RoCE CQ coalescing and fine tune certain
CQ coalescing parameters which would be helpful during debug.

Signed-off-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com>
Reviewed-by: Hongguang Gao <hongguang.gao@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20251103043425.234846-1-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09 04:02:27 -05:00
Randy Dunlap
512c832657 IB/rdmavt: rdmavt_qp.h: clean up kernel-doc comments
Correct the kernel-doc comments format to avoid around 35 kernel-doc
warnings:

- use struct keyword to introduce struct kernel-doc comments
- use correct variable name for some struct members
- use correct function name in comments for some functions
- fix spelling in a few comments
- use a ':' instead of '-' to separate struct members from their
  descriptions
- add a function name heading for rvt_div_mtu()

This leaves one struct member that is not described:
rdmavt_qp.h:206: warning: Function parameter or struct member 'wq'
 not described in 'rvt_krwq'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://patch.msgid.link/20251105045127.106822-1-rdunlap@infradead.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-06 02:23:23 -05:00
Marco Crivellari
7196156b0c IB/rdmavt: WQ_PERCPU added to alloc_workqueue users
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.

alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.

Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.

CC: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20251101163121.78400-6-marco.crivellari@suse.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-06 02:23:23 -05:00
Marco Crivellari
5267feda50 RDMA/mlx4: WQ_PERCPU added to alloc_workqueue users
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.

alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.

Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.

CC: Yishai Hadas <yishaih@nvidia.com>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20251101163121.78400-5-marco.crivellari@suse.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-06 02:23:23 -05:00
Marco Crivellari
5f93287fa9 hfi1: WQ_PERCPU added to alloc_workqueue users
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.

alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.

Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.

CC: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20251101163121.78400-4-marco.crivellari@suse.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-06 02:23:23 -05:00
Marco Crivellari
e60c5583b6 RDMA/core: WQ_PERCPU added to alloc_workqueue users
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.

alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.

Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20251101163121.78400-3-marco.crivellari@suse.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-06 02:23:23 -05:00
Marco Crivellari
f673fb3449 RDMA/core: RDMA/mlx5: replace use of system_unbound_wq with system_dfl_wq
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.

This lack of consistency cannot be addressed without refactoring the API.

system_unbound_wq should be the default workqueue so as not to enforce
locality constraints for random work whenever it's not required.

Adding system_dfl_wq to encourage its use when unbound work should be used.

The old system_unbound_wq will be kept for a few release cycles.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20251101163121.78400-2-marco.crivellari@suse.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-06 02:23:23 -05:00
Jay Bhat
da58d4223b RDMA/irdma: Take a lock before moving SRQ tail in poll_cq
Need to take an SRQ lock in poll_cq before moving SRQ tail.

Signed-off-by: Jay Bhat <jay.bhat@intel.com>
Reviewed-by: Jacob Moroni <jmoroni@google.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Link: https://patch.msgid.link/20251031021726.1003-7-tatyana.e.nikolova@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-06 02:23:23 -05:00
Jay Bhat
0a19274555 RDMA/irdma: CQ size and shadow update changes for GEN3
CQ shadow area should not be updated at the end of a page (once every
64th CQ entry), except when CQ has no more CQEs. SW must also increase
the requested CQ size by 1 and make sure the CQ is not exactly one page
in size. This is to address a quirk in the hardware.

Signed-off-by: Jay Bhat <jay.bhat@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Link: https://patch.msgid.link/20251031021726.1003-4-tatyana.e.nikolova@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-02 06:52:58 -05:00
Jay Bhat
cd84d8001e RDMA/irdma: Silently consume unsignaled completions
In case we get an unsignaled error completion, we silently consume the CQE by
pretending the QP does not exist. Without this, bookkeeping for signaled
completions does not work correctly.

Signed-off-by: Jay Bhat <jay.bhat@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Link: https://patch.msgid.link/20251031021726.1003-5-tatyana.e.nikolova@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-02 06:52:51 -05:00
Jay Bhat
153243086e RDMA/irdma: Initialize cqp_cmds_info to prevent resource leaks
Failure to initialize info.create field to false in certain cases
was resulting in incorrect status code going to rdma-core when dereg_mr
failed during reset.  To fix this, memset entire cqp_request->info
in irdma_alloc_and_get_cqp_request() function, so that this is not spread
all over the code.

Signed-off-by: Bhat, Jay <jay.bhat@intel.com>
Reviewed-by: Jacob Moroni <jmoroni@google.com>
Signed-off-by: Krzysztof Czurylo <krzysztof.czurylo@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Link: https://patch.msgid.link/20251031021726.1003-2-tatyana.e.nikolova@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-02 06:52:44 -05:00
Jacob Moroni
69e8e429bc RDMA/irdma: Enforce local fence for LOCAL_INV WRs
Enforce local fence for LOCAL_INV WRs to
avoid spurious FASTREG_VALID_MKEY async events
during heavy invalidation/registration activity.

Signed-off-by: Jacob Moroni <jmoroni@google.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Link: https://patch.msgid.link/20251031021726.1003-3-tatyana.e.nikolova@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-02 06:52:33 -05:00
Zhu Yanjun
503a5e4690 RDMA/rxe: Fix null deref on srq->rq.queue after resize failure
A NULL pointer dereference can occur in rxe_srq_chk_attr() when
ibv_modify_srq() is invoked twice in succession under certain error
conditions. The first call may fail in rxe_queue_resize(), which leads
rxe_srq_from_attr() to set srq->rq.queue = NULL. The second call then
triggers a crash (null deref) when accessing
srq->rq.queue->buf->index_mask.

Call Trace:
<TASK>
rxe_modify_srq+0x170/0x480 [rdma_rxe]
? __pfx_rxe_modify_srq+0x10/0x10 [rdma_rxe]
? uverbs_try_lock_object+0x4f/0xa0 [ib_uverbs]
? rdma_lookup_get_uobject+0x1f0/0x380 [ib_uverbs]
ib_uverbs_modify_srq+0x204/0x290 [ib_uverbs]
? __pfx_ib_uverbs_modify_srq+0x10/0x10 [ib_uverbs]
? tryinc_node_nr_active+0xe6/0x150
? uverbs_fill_udata+0xed/0x4f0 [ib_uverbs]
ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0x2c0/0x470 [ib_uverbs]
? __pfx_ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0x10/0x10 [ib_uverbs]
? uverbs_fill_udata+0xed/0x4f0 [ib_uverbs]
ib_uverbs_run_method+0x55a/0x6e0 [ib_uverbs]
? __pfx_ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0x10/0x10 [ib_uverbs]
ib_uverbs_cmd_verbs+0x54d/0x800 [ib_uverbs]
? __pfx_ib_uverbs_cmd_verbs+0x10/0x10 [ib_uverbs]
? __pfx___raw_spin_lock_irqsave+0x10/0x10
? __pfx_do_vfs_ioctl+0x10/0x10
? ioctl_has_perm.constprop.0.isra.0+0x2c7/0x4c0
? __pfx_ioctl_has_perm.constprop.0.isra.0+0x10/0x10
ib_uverbs_ioctl+0x13e/0x220 [ib_uverbs]
? __pfx_ib_uverbs_ioctl+0x10/0x10 [ib_uverbs]
__x64_sys_ioctl+0x138/0x1c0
do_syscall_64+0x82/0x250
? fdget_pos+0x58/0x4c0
? ksys_write+0xf3/0x1c0
? __pfx_ksys_write+0x10/0x10
? do_syscall_64+0xc8/0x250
? __pfx_vm_mmap_pgoff+0x10/0x10
? fget+0x173/0x230
? fput+0x2a/0x80
? ksys_mmap_pgoff+0x224/0x4c0
? do_syscall_64+0xc8/0x250
? do_user_addr_fault+0x37b/0xfe0
? clear_bhb_loop+0x50/0xa0
? clear_bhb_loop+0x50/0xa0
? clear_bhb_loop+0x50/0xa0
entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fixes: 8700e3e7c4 ("Soft RoCE driver")
Tested-by: Liu Yi <asatsuyu.liu@gmail.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Link: https://patch.msgid.link/20251027215203.1321-1-yanjun.zhu@linux.dev
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-10-28 04:04:07 -04:00
Håkon Bugge
58aca1f3de RDMA/cm: Base cm_id destruction timeout on CMA values
When a GSI MAD packet is sent on the QP, it will potentially be
retried CMA_MAX_CM_RETRIES times with a timeout value of:

    4.096usec * 2 ^ CMA_CM_RESPONSE_TIMEOUT

The above equates to ~64 seconds using the default CMA values.

The cm_id_priv's refcount will be incremented for this period.
Therefore, the timeout value waiting for a cm_id destruction must be
based on the effective timeout of MAD packets.  To provide additional
leeway, we add 25% to this timeout and use that instead of the
constant 10 seconds timeout, which may result in false negatives.

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Link: https://patch.msgid.link/20251021132738.4179604-1-haakon.bugge@oracle.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-10-27 07:36:39 -04:00
Adithya Jayachandran
eea31f21dc {rdma,net}/mlx5: Query vports mac address from device
Before this patch during either switchdev or legacy mode enablement we
cleared the mac address of vports between changes. This change allows us
to preserve the vports mac address between eswitch mode changes.

Vports hold information for VFs/SFs such as the permanent mac address.
VF/SF mac can be set either by iproute vf interface or devlink function
interface. For no obvious reason we reset it to 0 on switchdev/legacy
mode changes, this patch is fixing that, to align with other vport
information that are never reset, e.g GUID,mtu,promisc mode, etc ..

Signed-off-by: Adithya Jayachandran <ajayachandra@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Acked-by: Leon Romanovsky <leon@kernel.org> # RDMA
2025-10-24 20:16:01 -07:00
Randy Dunlap
be180c847a RDMA/uverbs: fix some kernel-doc warnings
Fix 49 kernel-doc warnings in ib_verbs.h:

- Add struct short description for rdma_stat_desc, rdma_hw_stats.
- Fix kernel-doc format for struct members (use ':' instead of '-') for
  several structs.
- Don't use "/**" kernel-doc notation for struct members in ib_device_ops
  (most members are not documented and most of the kernel-doc was
  not formatted correctly).
- Spell function parameters correctly in ib_dma_map_sgtable_attrs(),
  ib_device_try_get(), rdma_roce_rescan_device().
- Add kernel-doc for the function parameter in
  rdma_flow_label_to_udp_sport().

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://patch.msgid.link/20251020034320.3011094-1-rdunlap@infradead.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-10-20 11:45:46 -04:00
Colin Ian King
1511efaca0 RDMA/rxe: Remove redundant assignment to variable page_offset
The variable page_offset is being assigned a value at the start of
a loop and being redundantly zero'd at the end of the loop, there
is no code that reads the zero'd value. The assignment is redundant
and can be removed.

Signed-off-by: Colin Ian King <coking@nvidia.com>
Link: https://patch.msgid.link/20251014120343.2528608-1-coking@nvidia.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-10-19 08:06:21 -04:00
Stefan Metzmacher
879424832d RDMA/core: let rdma_connect_locked() call lockdep_assert_held(&id_priv->handler_mutex)
rdma_accept() also has this, so this is now more consistent and may
prevent bugs in future.

Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: linux-rdma@vger.kernel.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Link: https://patch.msgid.link/20251008165913.444276-1-metze@samba.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-10-19 07:03:25 -04:00
Alok Tiwari
cfd51fcf11 RDMA/cxgb4: fix typo in write_pbl() debug message
Correct the debug log format string from "pdb_addr" -> "pbl_addr"
to match the actual variable name.

Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Link: https://patch.msgid.link/20250929091102.50384-1-alok.a.tiwari@oracle.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-10-19 06:51:46 -04:00
Linus Torvalds
3a86608788 Linux 6.18-rc1 v6.18-rc1 2025-10-12 13:42:36 -07:00
Linus Torvalds
3dd7b81235 Merge tag 'i2c-for-6.18-rc1-hotfix' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fix from Wolfram Sang:
 "One revert because of a regression in the I2C core which has sadly not
  showed up during its time in -next"

* tag 'i2c-for-6.18-rc1-hotfix' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  Revert "i2c: boardinfo: Annotate code used in init phase only"
2025-10-12 13:27:56 -07:00
Linus Torvalds
8765f46791 Merge tag 'irq_urgent_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Borislav Petkov:

 - Skip interrupt ID 0 in sifive-plic during suspend/resume because
   ID 0 is reserved and accessing reserved register space could result
   in undefined behavior

 - Fix a function's retval check in aspeed-scu-ic

* tag 'irq_urgent_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/sifive-plic: Avoid interrupt ID 0 handling during suspend/resume
  irqchip/aspeed-scu-ic: Fix an IS_ERR() vs NULL check
2025-10-12 08:45:52 -07:00
Linus Torvalds
67029a49db Merge tag 'trace-v6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
 "The previous fix to trace_marker required updating trace_marker_raw as
  well. The difference between trace_marker_raw from trace_marker is
  that the raw version is for applications to write binary structures
  directly into the ring buffer instead of writing ASCII strings. This
  is for applications that will read the raw data from the ring buffer
  and get the data structures directly. It's a bit quicker than using
  the ASCII version.

  Unfortunately, it appears that our test suite has several tests that
  test writes to the trace_marker file, but lacks any tests to the
  trace_marker_raw file (this needs to be remedied). Two issues came
  about the update to the trace_marker_raw file that syzbot found:

   - Fix tracing_mark_raw_write() to use per CPU buffer

     The fix to use the per CPU buffer to copy from user space was
     needed for both the trace_maker and trace_maker_raw file.

     The fix for reading from user space into per CPU buffers properly
     fixed the trace_marker write function, but the trace_marker_raw
     file wasn't fixed properly. The user space data was correctly
     written into the per CPU buffer, but the code that wrote into the
     ring buffer still used the user space pointer and not the per CPU
     buffer that had the user space data already written.

   - Stop the fortify string warning from writing into trace_marker_raw

     After converting the copy_from_user_nofault() into a memcpy(),
     another issue appeared. As writes to the trace_marker_raw expects
     binary data, the first entry is a 4 byte identifier. The entry
     structure is defined as:

     struct {
   	struct trace_entry ent;
   	int id;
   	char buf[];
     };

     The size of this structure is reserved on the ring buffer with:

       size = sizeof(*entry) + cnt;

     Then it is copied from the buffer into the ring buffer with:

       memcpy(&entry->id, buf, cnt);

     This use to be a copy_from_user_nofault(), but now converting it to
     a memcpy() triggers the fortify-string code, and causes a warning.

     The allocated space is actually more than what is copied, as the
     cnt used also includes the entry->id portion. Allocating
     sizeof(*entry) plus cnt is actually allocating 4 bytes more than
     what is needed.

     Change the size function to:

       size = struct_size(entry, buf, cnt - sizeof(entry->id));

     And update the memcpy() to unsafe_memcpy()"

* tag 'trace-v6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Stop fortify-string from warning in tracing_mark_raw_write()
  tracing: Fix tracing_mark_raw_write() to use buf and not ubuf
2025-10-11 16:06:04 -07:00
Linus Torvalds
c04022dccb Merge tag 'kbuild-fixes-6.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux
Pull Kbuild fixes from Nathan Chancellor:

 - Fix UAPI types check in headers_check.pl

 - Only enable -Werror for hostprogs with CONFIG_WERROR / W=e

 - Ignore fsync() error when output of gen_init_cpio is a pipe

 - Several little build fixes for recent modules.builtin.modinfo series

* tag 'kbuild-fixes-6.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux:
  kbuild: Use '--strip-unneeded-symbol' for removing module device table symbols
  s390/vmlinux.lds.S: Move .vmlinux.info to end of allocatable sections
  kbuild: Add '.rel.*' strip pattern for vmlinux
  kbuild: Restore pattern to avoid stripping .rela.dyn from vmlinux
  gen_init_cpio: Ignore fsync() returning EINVAL on pipes
  scripts/Makefile.extrawarn: Respect CONFIG_WERROR / W=e for hostprogs
  kbuild: uapi: Strip comments before size type check
2025-10-11 15:47:12 -07:00