Commit Graph

1381724 Commits

Author SHA1 Message Date
Edward Srouji
372fdb5c75 RDMA/mlx5: Fix page size bitmap calculation for KSM mode
When using KSM (Key Scatter-gather Memory) access mode, the HW requires
the IOVA to be aligned to the selected page size.
Without this alignment, the HW may not function correctly.

Currently, mlx5_umem_mkc_find_best_pgsz() does not filter out page sizes
that would result in misaligned IOVAs for KSM mode. This can lead to
selecting page sizes that are incompatible with the given IOVA.

Fix this by filtering the page size bitmap when in KSM mode, keeping
only page sizes to which the IOVA is aligned to.

Fixes: fcfb03597b ("RDMA/mlx5: Align mkc page size capability check to PRM")
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20250824144839.154717-1-edwards@nvidia.com
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-09-11 02:18:35 -04:00
Kalesh AP
aae757ec20 RDMA/bnxt_re: Remove unnecessary condition checks
The check for "rdev" and "en_dev" pointer validity always
return false.

Remove them.

Reviewed-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20250822040801.776196-11-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-09-11 02:18:31 -04:00
Saravanan Vajravel
6133c13154 RDMA/bnxt_re: Use firmware provided message timeout value
Before this patch, we used a hardcoded value of 500 msec as the default
value for L2 firmware message response timeout. With this commit,
the driver is using the firmware timeout value from the firmware.

As part of this change moved bnxt_re_query_hwrm_intf_version() to
bnxt_re_setup_chip_ctx() so that timeout value is queries before
sending first command.

Signed-off-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com>
Reviewed-by: Selvin Xavier <selvin.xavier@broadcom.com>
Reviewed-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reviewed-by: Bhargava Chenna Marreddy <bhargava.marreddy@broadcom.com>
Co-developed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20250822040801.776196-10-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-09-11 02:18:27 -04:00
Saravanan Vajravel
959d10d642 RDMA/bnxt_re: Initialize fw with roce_mirror support
- Check FW capability for roce_mirror support.
- Initialize FW with roce_mirror support.
- When modifying QP, use unique GID for sgid incase of RawEth QP.

Signed-off-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com>
Reviewed-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reviewed-by: Selvin Xavier <selvin.xavier@broadcom.com>
Reviewed-by: Anantha Prabhu <anantha.prabhu@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20250822040801.776196-9-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-09-11 02:18:22 -04:00
Saravanan Vajravel
7c7511f165 RDMA/bnxt_re: Add support for flow create/destroy
- Added support for create_flow and destroy_flow verbs. These
  verbs are used on RawEth QP to add a specific flow action.
- To support TCP dump on RoCE, added IB_FLOW_ATTR_SNIFFER
  attribute.
- In create_flow verb, driver allocates mirror_vnic and configure it
  with RawEth QP. Once this is done, driver will enable mirroring.
- In destroy_flow, driver will disable mirroring and free the mirror
  vnic.

Signed-off-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com>
Reviewed-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reviewed-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20250822040801.776196-8-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-09-11 02:18:18 -04:00
Saravanan Vajravel
fd6c9ae7c1 RDMA/bnxt_re: Add support for mirror vnic
Added below support:
- Querying the pre-reserved mirror_vnic_id
- Allocating/freeing mirror_vnic
- Configuring mirror vnic to associate it with raw qp

These functions will be used in the subsequent patch in this series.

Signed-off-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com>
Reviewed-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20250822040801.776196-7-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-09-11 02:18:14 -04:00
Saravanan Vajravel
af7f9d0d57 RDMA/bnxt_re: Add support for unique GID
- RawEth QP requires unique GID so that per function stats_ctx
  is not polluted by packets mirrored to RoCE vnic.
- Added support to add unique GID when RawEth type QP is created.
- Added support to destroy unique GID when RawEth type QP is
  destroyed.
- Allocated exclusive stats_ctx to use for RawEth type QP.

Signed-off-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com>
Reviewed-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reviewed-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20250822040801.776196-6-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-09-11 02:18:10 -04:00
Kalesh AP
b5942828ea RDMA/bnxt_re: Refactor stats context memory allocation
Moved the stats context allocation logic to a new function.
The stats context memory allocation code has been moved from
bnxt_qplib_alloc_hwctx() to the newly added bnxt_re_get_stats_ctx()
function. Also, the code to send the firmware command has been moved.

This patch is in preparation for other patches in this series.
There is no functional changes intended.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20250822040801.776196-5-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-09-11 02:18:07 -04:00
Kalesh AP
2419b16a3d RDMA/bnxt_re: Refactor hw context memory allocation
This patch is in preparation for other patches in this series.
There is no functional changes intended.

1. Rename bnxt_qplib_alloc_ctx() to bnxt_qplib_alloc_hwctx().
2. Rename bnxt_qplib_free_ctx() to bnxt_qplib_free_hwctx().
3. Reduce the number of arguments of bnxt_qplib_alloc_hwctx()
   by moving a check outside of it.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20250822040801.776196-4-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-09-11 02:18:03 -04:00
Saravanan Vajravel
8f47f12db5 RDMA/bnxt_re: Add data structures for RoCE mirror support
Added data structures required for supporting mirroring on
RoCE device.

Signed-off-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com>
Reviewed-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20250822040801.776196-3-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-09-11 02:17:59 -04:00
Saravanan Vajravel
217156bb70 bnxt_en: Enhance stats context reservation logic
When the firmware advertises that the device is capable of supporting
port mirroring on RoCE device, reserve one additional stat_ctx.
To support port mirroring feature, RDMA driver allocates one stat_ctx
for exclusive use in RawEth QP.

Signed-off-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Reviewed-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reviewed-by: Selvin Xavier <selvin.xavier@broadcom.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20250822040801.776196-2-kalesh-anakkur.purayil@broadcom.com
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-09-11 02:17:54 -04:00
Kalesh AP
c0da8dc534 RDMA/bnxt_re: Enhance a log message when bnxt_re_register_netdev fails
Make a error log message more user friendly.
When bnxt_re_register_netdev()() fails, the current
log does not convey much information.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20250814112555.221665-10-kalesh-anakkur.purayil@broadcom.com
Reviewed-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com>
Reviewed-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-09-11 02:17:43 -04:00
Kalesh AP
4a9fba4d00 RDMA/bnxt_re: Delete always true SGID table check
The "sgid_tbl" inside "rdev->qplib_res" is a static memory.
Hence, the check always return true.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20250814112555.221665-9-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-09-08 03:08:35 -04:00
Abhishek Mohapatra
dc61e916f1 RDMA/bnxt_re: Report udp source port for flow_label in bnxt_re_query_qp
The firmware doesn't capture the flow_label. Therefore the value
that's always returned by qplib_qp->ah.flow_label is 0 whenever
a qp is created. And as per IB spec, udp source port can be reported
for flow_label. Hence reported udp source port for flow_label in
bnxt_re_query_qp by populating the value of qplib_qp->udp_sport
into qp_attr->ah_attr.grh.flow_label.

Signed-off-by: Abhishek Mohapatra <abhishek.mohapatra@broadcom.com>
Signed-off-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com>
Link: https://patch.msgid.link/20250814112555.221665-8-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-09-08 03:08:35 -04:00
Vasuthevan Maheswaran
ef56081d18 RDMA/bnxt_re: RoCE related hardware counters update
Support for new hardware counters added, and existing hardware
counters have been modified according to the design documents
for compatibility with open-source monitoring agents.

Signed-off-by: Vasuthevan Maheswaran <vasuthevan.maheswaran@broadcom.com>
Signed-off-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20250814112555.221665-7-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-09-08 03:08:35 -04:00
Damodharam Ammepalli
832fc9e161 RDMA/bnxt_re: Optimize bnxt_qplib_get_dev_attr function
Optimize bnxt_qplib_get_dev_attr() by separating out query_version which
uses creq notification method to host. Due to serialization of cmdq by
firmware, expected latency in response to heavy multi-threaded rdma
applications might be observed.

This patch separates the version_query logic out of device attribute
query and called only during rdma driver init.

Signed-off-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com>
Reviewed-by: Hongguang Gao <hongguang.gao@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20250814112555.221665-6-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-09-08 03:08:35 -04:00
Chenna Arnoori
ae7990ce1f RDMA/bnxt_re: RoCE Driver Dynamic Debug for HWRM's
Add Linux kernel dynamic debug prints to ROCE HWRM's.
Dumping request and response buffers for the ROCE HWRM's using
print_hex_dump_bytes() to be part of kernel dynmic debug.

Signed-off-by: Chenna Arnoori <chenna.arnoori@broadcom.com>
Signed-off-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com>
Link: https://patch.msgid.link/20250814112555.221665-4-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-09-08 03:08:35 -04:00
Kashyap Desai
8191e87415 RDMA/bnxt_re: Show srq_limit in fill_res_srq_entry hook
Added srq_limit in rdma show resource srq hook.

Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20250814112555.221665-3-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-09-08 03:08:28 -04:00
Qianfeng Rong
2aa35b24ad RDMA/erdma: Use vcalloc() instead of vzalloc()
Replace vzalloc() with vcalloc() in vmalloc_to_dma_addrs().  As noted in
the kernel documentation [1], open-coded multiplication in allocator
arguments is discouraged because it can lead to integer overflow.

Use vcalloc() to gain built-in overflow protection, making memory
allocation safer when calculating allocation size compared to explicit
multiplication.

[1]: https://www.kernel.org/doc/html/next/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments

Link: https://patch.msgid.link/r/20250821072209.510348-1-rongqianfeng@vivo.com
Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Reviewed-by: Cheng Xu <chengyou@linux.alibaba.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-08-25 15:14:54 -03:00
Patrisious Haddad
08aae78604 RDMA/mlx5: Fix vport loopback forcing for MPV device
Previously loopback for MPV was supposed to be permanently enabled,
however other driver flows were able to over-ride that configuration and
disable it.

Add force_lb parameter that indicates that loopback should always be
enabled which prevents all other driver flows from disabling it.

Fixes: a9a9e68954 ("RDMA/mlx5: Fix vport loopback for MPV device")
Link: https://patch.msgid.link/r/cfc6b1f0f99f8100b087483cc14da6025317f901.1755088808.git.leon@kernel.org
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-08-25 14:49:59 -03:00
Or Har-Toov
1a7c18c485 RDMA/mlx5: Better estimate max_qp_wr to reflect WQE count
The mlx5 driver currently derives max_qp_wr directly from the
log_max_qp_sz HCA capability:

    props->max_qp_wr = 1 << MLX5_CAP_GEN(mdev, log_max_qp_sz);

However, this value represents the number of WQEs in units of Basic
Blocks (see MLX5_SEND_WQE_BB), not actual number of WQEs.  Since the size
of a WQE can vary depending on transport type and features (e.g., atomic
operations, UMR, LSO), the actual number of WQEs can be significantly
smaller than the WQEBB count suggests.

This patch introduces a conservative estimation of the worst-case WQE size
— considering largest segments possible with 1 SGE and no inline data or
special features. It uses this to derive a more accurate max_qp_wr value.

Fixes: 938fe83c8d ("net/mlx5_core: New device capabilities handling")
Link: https://patch.msgid.link/r/7d992c9831c997ed5c33d30973406dc2dcaf5e89.1755088725.git.leon@kernel.org
Reported-by: Chuck Lever <cel@kernel.org>
Closes: https://lore.kernel.org/all/20250506142202.GJ2260621@ziepe.ca/
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-08-25 14:42:19 -03:00
Yishai Hadas
d9e6e85b7b RDMA/mlx5: Enable Data-Direct with Relaxed Ordering
Relaxed Ordering can improve performance in certain scenarios.

Enable it in the Data-Direct use case as well.

Link: https://patch.msgid.link/r/1221dcdda8061ba5f6bc3519044083c7438b257e.1755088503.git.leon@kernel.org
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Gal Shalom <galshalom@Nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-08-25 14:40:29 -03:00
Michael Margolin
8e5442f393 RDMA/efa: Extend admin timeout error print
Add command id to the printed message for additional debug information.

Link: https://patch.msgid.link/r/20250703182314.16442-1-mrgolin@amazon.com
Reviewed-by: Yonatan Nachum <ynachum@amazon.com>
Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-08-25 11:53:48 -03:00
Fushuai Wang
211dc59b7b IB/hfi1: Use for_each_online_cpu() instead of for_each_cpu()
Replace the opencoded for_each_cpu(cpu, cpu_online_mask) loop with the
more readable and equivalent for_each_online_cpu(cpu) macro.

Signed-off-by: Fushuai Wang <wangfushuai@baidu.com>
Link: https://patch.msgid.link/20250811062534.1041-1-wangfushuai@baidu.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-08-13 07:05:40 -04:00
Konstantin Taranov
44d69d3cf2 RDMA/mana_ib: Drain send wrs of GSI QP
Drain send WRs of the GSI QP on device removal.

In rare servicing scenarios, the hardware may delete the
state of the GSI QP, preventing it from generating CQEs
for pending send WRs. Since WRs submitted to the GSI QP
hold CM resources, the device cannot be removed until
those WRs are completed. This patch marks all pending
send WRs as failed, allowing the GSI QP to release the CM
resources and enabling safe device removal.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://patch.msgid.link/1753779618-23629-1-git-send-email-kotaranov@linux.microsoft.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-08-13 06:33:33 -04:00
Boshi Yu
4d674c478d RDMA/erdma: Use dma_map_page to map scatter MTT buffer
Each high-level indirect MTT entry is assumed to point to exactly one page
of the low-level MTT buffer, but dma_map_sg may merge contiguous physical
pages when mapping. To avoid extra overhead from splitting merged regions,
use dma_map_page to map the scatter MTT buffer page by page.

Reviewed-by: Cheng Xu <chengyou@linux.alibaba.com>
Signed-off-by: Boshi Yu <boshiyu@linux.alibaba.com>
Link: https://patch.msgid.link/20250725055410.67520-2-boshiyu@linux.alibaba.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-08-13 06:26:33 -04:00
Mark Zhang
a3c9d0fcd3 RDMA/ucma: Support write an event into a CM
Enable user-space to inject an event into a CM through it's event
channel. Two new events are added and supported: RDMA_CM_EVENT_USER and
RDMA_CM_EVENT_INTERNAL. With these 2 events a new event parameter "arg"
is supported, which is passed from sender to receiver transparently.

With this feature an application is able to write an event into a CM
channel with a new user-space rdmacm API. For example thread T1 could
write an event with the API:
    rdma_write_cm_event(cm_id, RDMA_CM_EVENT_USER, status, arg);
and thread T2 could receive the event with rdma_get_cm_event().

Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Link: https://patch.msgid.link/fdf49d0b17a45933c5d8c1d90605c9447d9a3c73.1751279794.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-08-13 06:16:11 -04:00
Mark Zhang
810f874eda RDMA/ucma: Support query resolved service records
Enable user-space to query resolved service records through a ucma
command when a RDMA_CM_EVENT_ADDRINFO_RESOLVED event is received.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Link: https://patch.msgid.link/1090ee7c00c3f8058c4f9e7557de983504a16715.1751279794.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-08-13 06:16:07 -04:00
Mark Zhang
a6404823fe RDMA/cma: Support IB service record resolution
Add new UCMA command and the corresponding CMA implementation. Userspace
can send this command to request service resolution based on service
name or ID.

On a successful resolution, one or multiple service records are
returned, the first one will be used as destination address by default.

Two new CM events are added and returned to caller accordingly:
  - RDMA_CM_EVENT_ADDRINFO_RESOLVED: Resolve succeeded;
  - RDMA_CM_EVENT_ADDRINFO_ERROR:  Resolve failed.

Internally two new CM states are added:
  - RDMA_CM_ADDRINFO_QUERY: CM is in the process of IB service
    resolution;
  - RDMA_CM_ADDRINFO_RESOLVED: CM has finished the resolve process.

With these new states, beside existing state transfer processes, 2 new
processes are supported:
 1. The default address is used:
    RDMA_CM_ADDR_BOUND ->
      RDMA_CM_ADDRINFO_QUERY ->
        RDMA_CM_ADDRINFO_RESOLVED ->
          RDMA_CM_ROUTE_QUERY

 2. To use a different address:
    RDMA_CM_ADDR_BOUND ->
      RDMA_CM_ADDRINFO_QUERY->
        RDMA_CM_ADDRINFO_RESOLVED ->
          RDMA_CM_ADDR_QUERY ->
            RDMA_CM_ADDR_RESOLVED ->
              RDMA_CM_ROUTE_QUERY

In the 2nd case, resolve_addrinfo returns multiple records, a user
could call rdma_resolve_addr() with the one that is not the first.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Link: https://patch.msgid.link/b6e82ad75522a13b5efe4ff86da0e465aab04cc2.1751279794.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-08-13 06:16:00 -04:00
Mark Zhang
a892a3e74f RDMA/sa_query: Support IB service records resolution
Add an SA query API ib_sa_service_rec_get() to support building and
sending SA query MADs that ask for service records with a specific
name or ID, and receiving and parsing responses from the SM.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Link: https://patch.msgid.link/9af6c82f3a3a9d975115a33235fb4ffc7c8edb21.1751279793.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-08-13 06:15:56 -04:00
Mark Zhang
ef5fcdb730 RDMA/sa_query: Add RMPP support for SA queries
Register GSI mad agent with RMPP support and add rmpp_callback for
SA queries. This is needed for querying more than one service record
in one query.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Link: https://patch.msgid.link/81dbcb48682e1838dc40f381cdcc0dc63f25f0f1.1751279793.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-08-13 06:15:48 -04:00
Linus Torvalds
8f5ae30d69 Linux 6.17-rc1 v6.17-rc1 2025-08-10 19:41:16 +03:00
Linus Torvalds
2b38afce25 Merge tag 'turbostat-2025.09.09' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux
Pull turbostat updates from Len Brown:
 "tools/power turbostat: version 2025.09.09

   - Probe and display L3 Cache topology

   - Add ability to average an added counter (useful for pre-integrated
     "counters", such as Watts)

   - Break the limit of 64 built-in counters

   - Assorted bug fixes and minor feature tweaks"

* tag 'turbostat-2025.09.09' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
  tools/power turbostat: version 2025.09.09
  tools/power turbostat: Handle non-root legacy-uncore sysfs permissions
  tools/power turbostat: standardize PER_THREAD_PARAMS
  tools/power turbostat: Fix DMR support
  tools/power turbostat: add format "average" for external attributes
  tools/power turbostat: delete GET_PKG()
  tools/power turbostat: probe and display L3 cache topology
  tools/power turbostat: Support more than 64 built-in-counters
  tools/power turbostat.8: Document Totl%C0, Any%C0, GFX%C0, CPUGFX% columns
  tools/power turbostat: Fix bogus SysWatt for forked program
  tools/power turbostat: Handle cap_get_proc() ENOSYS
  tools/power turbostat: Fix build with musl
  tools/power turbostat: verify arguments to params --show and --hide
  tools/power turbostat: regression fix: --show C1E%
2025-08-10 09:02:36 +03:00
Linus Torvalds
b96ddbc5c8 Merge tag 'smp_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull smp fixes from Borislav Petkov:

 - Remove an obsolete comment and fix spelling

* tag 'smp_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu: Remove obsolete comment from takedown_cpu()
  smp: Fix spelling in on_each_cpu_cond_mask()'s doc-comment
2025-08-10 08:51:37 +03:00
Linus Torvalds
7d2fed1f3c Merge tag 'irq_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Borislav Petkov:

 - Fix a wrong ioremap size in mvebu-gicp

 - Remove yet another compile-test case for a driver which needs an
   additional dependency

 - Fix a lock inversion scenario in the IRQ unit test suite

 - Remove an impossible flag situation in gic-v5

 - Do not iounmap resources in gic-v5 which are managed by devm

 - Make sure stale, left-over interrupts in mvebu-gicp are cleared on
   driver init

 - Fix a reference counting mishap in msi-lib

 - Fix a dereference-before-null-ptr-check case in the riscv-imsic
   irqchip driver

* tag 'irq_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/mvebu-gicp: Use resource_size() for ioremap()
  irqchip: Build IMX_MU_MSI only on ARM
  genirq/test: Resolve irq lock inversion warnings
  irqchip/gic-v5: Remove IRQD_RESEND_WHEN_IN_PROGRESS for ITS IRQs
  irqchip/gic-v5: iwb: Fix iounmap probe failure path
  irqchip/mvebu-gicp: Clear pending interrupts on init
  irqchip/msi-lib: Fix fwnode refcount in msi_lib_irq_domain_select()
  irqchip/riscv-imsic: Don't dereference before NULL pointer check
2025-08-10 08:46:47 +03:00
Linus Torvalds
acaa21a26f Merge tag 'x86_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:

 - Fix an interrupt vector setup race which leads to a non-functioning
   device

 - Add new Intel CPU models *and* a family: 0x12. Finally. Yippie! :-)

* tag 'x86_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/irq: Plug vector setup race
  x86/cpu: Add new Intel CPU model numbers for Wildcatlake and Novalake
2025-08-10 08:15:32 +03:00
Linus Torvalds
8e8f6b635f Merge tag 'locking_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Borislav Petkov:

 - Prevent a futex hash leak due to different mm lifetimes

* tag 'locking_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Move futex cleanup to __mmdrop()
2025-08-10 08:11:39 +03:00
Len Brown
5e98a5e73e tools/power turbostat: version 2025.09.09
Probe and display L3 Cache topology
Add ability to average an added counter
	(useful for pre-integrated "counters", such as Watts)
Break the limit of 64 built-in counters.
Assorted bug fixes and minor feature tweaks

Signed-off-by: Len Brown <len.brown@intel.com>
2025-08-09 21:24:46 -04:00
Len Brown
e60a13bcef tools/power turbostat: Handle non-root legacy-uncore sysfs permissions
/sys/devices/system/cpu/intel_uncore_frequency/package_X_die_Y/
may be readable by all, but
/sys/devices/system/cpu/intel_uncore_frequency/package_X_die_Y/current_freq_khz
may be readable only by root.

Non-root turbostat users see complaints in this scenario.

Fail probe of the interface if we can't read current_freq_khz.

Reported-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Original-patch-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
2025-08-09 21:24:46 -04:00
Len Brown
378e901160 tools/power turbostat: standardize PER_THREAD_PARAMS
use a macro for PER_THREAD_PARAMS to make adding one later more clear.

no functional change

Signed-off-by: Len Brown <len.brown@intel.com>
2025-08-09 21:24:46 -04:00
Zhang Rui
3a088b07c4 tools/power turbostat: Fix DMR support
Together with the RAPL MSRs, there are more MSRs gone on DMR, including
PLR (Perf Limit Reasons), and IRTL (Package cstate Interrupt Response
Time Limit) MSRs. The configurable TDP info should also be retrieved
from TPMI based Intel Speed Select Technology feature.

Remove the access of these MSRs for DMR. Improve the DMR platform
feature table to make it more readable at the same time.

Fixes: 83075bd59d ("tools/power turbostat: Add initial support for DMR")
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
2025-08-09 21:24:46 -04:00
Michael Hebenstreit
dcd1c379b0 tools/power turbostat: add format "average" for external attributes
External atributes with format "raw" are not printed in summary lines
for nodes/packages (or with option -S). The new format "average"
behaves like "raw" but also adds the summary data

Signed-off-by: Michael Hebenstreit <michael.hebenstreit@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
2025-08-09 21:24:46 -04:00
Len Brown
a5015d945d tools/power turbostat: delete GET_PKG()
pkg_base[pkg_id] is a simple array of structure pointers,
let the compiler treat it that way.

Signed-off-by: Len Brown <len.brown@intel.com>
2025-08-09 21:24:46 -04:00
Len Brown
5f961fb2a7 tools/power turbostat: probe and display L3 cache topology
Signed-off-by: Len Brown <len.brown@intel.com>
2025-08-09 21:24:46 -04:00
Len Brown
8d14a098b4 tools/power turbostat: Support more than 64 built-in-counters
We have out-grown the ability to use a 64-bit memory location
to inventory every possible built-in counter.
Leverage the the CPU_SET(3) macros to break this barrier.

Also, break the Joules & Watts counters into two,
since we can no longer 'or' them together...

Signed-off-by: Len Brown <len.brown@intel.com>
2025-08-09 21:23:45 -04:00
Len Brown
d240b441b5 tools/power turbostat.8: Document Totl%C0, Any%C0, GFX%C0, CPUGFX% columns
Explain the meaning of the Totl%C0, Any%C0, GFX%C0, CPUGFX% columns.

Signed-off-by: Len Brown <len.brown@intel.com>
2025-08-09 11:14:30 -04:00
Linus Torvalds
561c80369d Merge tag 'tty-6.16-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull TTY fix from Greg KH:
 "Here is a single revert of one of the previous patches that went in
  the last tty/serial merge that is breaking userspace on some platforms
  (specifically powerpc, probably a few others.)

  It accidentially changed the ioctl values of some tty ioctls, which
  breaks xorg.

  The revert has been in linux-next all this week with no reported
  issues"

* tag 'tty-6.16-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  Revert "tty: vt: use _IO() to define ioctl numbers"
2025-08-09 18:12:23 +03:00
Linus Torvalds
402e262d77 Merge tag 'efi-next-for-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI updates from Ard Biesheuvel:

 - Expose the OVMF firmware debug log via sysfs

 - Lower the default log level for the EFI stub to avoid corrupting any
   splash screens with unimportant diagnostic output

* tag 'efi-next-for-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
  efi: add API doc entry for ovmf_debug_log
  efistub: Lower default log level
  efi: add ovmf debug log driver
2025-08-09 18:10:01 +03:00
Linus Torvalds
c30a13538d Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:

 - Fix memory leak of bpf_scc_info objects (Eduard Zingerman)

 - Fix a regression in the 'perf' tool caused by moving UID filtering to
   BPF (Ilya Leoshkevich)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  perf bpf-filter: Enable events manually
  libbpf: Add the ability to suppress perf event enablement
  bpf: Fix memory leak of bpf_scc_info objects
2025-08-09 09:03:21 +03:00
Linus Torvalds
2988dfed8a Merge tag 'block-6.17-20250808' of git://git.kernel.dk/linux
Pull more block updates from Jens Axboe:

 - MD pull request via Yu:
      - mddev null-ptr-dereference fix, by Erkun
      - md-cluster fail to remove the faulty disk regression fix, by
        Heming
      - minor cleanup, by Li Nan and Jinchao
      - mdadm lifetime regression fix reported by syzkaller, by Yu Kuai

 - MD pull request via Christoph
      - add support for getting the FDP featuee in fabrics passthru path
        (Nitesh Shetty)
      - add capability to connect to an administrative controller
        (Kamaljit Singh)
      - fix a leak on sgl setup error (Keith Busch)
      - initialize discovery subsys after debugfs is initialized
        (Mohamed Khalfella)
      - fix various comment typos (Bjorn Helgaas)
      - remove unneeded semicolons (Jiapeng Chong)

 - nvmet debugfs ordering issue fix

 - Fix UAF in the tag_set in zloop

 - Ensure sbitmap shallow depth covers entire set

 - Reduce lock roundtrips in io context lookup

 - Move scheduler tags alloc/free out of elevator and freeze lock, to
   fix some lockdep found issues

 - Improve robustness of queue limits checking

 - Fix a regression with IO priorities, if no io context exists

* tag 'block-6.17-20250808' of git://git.kernel.dk/linux: (26 commits)
  lib/sbitmap: make sbitmap_get_shallow() internal
  lib/sbitmap: convert shallow_depth from one word to the whole sbitmap
  nvmet: exit debugfs after discovery subsystem exits
  block, bfq: Reorder struct bfq_iocq_bfqq_data
  md: make rdev_addable usable for rcu mode
  md/raid1: remove struct pool_info and related code
  md/raid1: change r1conf->r1bio_pool to a pointer type
  block: ensure discard_granularity is zero when discard is not supported
  zloop: fix KASAN use-after-free of tag set
  block: Fix default IO priority if there is no IO context
  nvme: fix various comment typos
  nvme-auth: remove unneeded semicolon
  nvme-pci: fix leak on sgl setup error
  nvmet: initialize discovery subsys after debugfs is initialized
  nvme: add capability to connect to an administrative controller
  nvmet: add support for FDP in fabrics passthru path
  md: rename recovery_cp to resync_offset
  md/md-cluster: handle REMOVE message earlier
  md: fix create on open mddev lifetime regression
  block: fix potential deadlock while running nr_hw_queue update
  ...
2025-08-09 08:47:28 +03:00