Whenever a new counter is created, save inside it the user requested
configuration for optional-counters binding, for manual configuration it
is requested directly by the user and for the automatic configuration it
depends on if the automatic binding was enabled with or without
optional-counters binding.
This argument will later be used by the driver to determine if to bind the
optional-counters as well or not when trying to bind this counter to a QP.
It indicates that when binding counters to a QP we also want the
currently enabled link optional-counters to be bound as well.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/82f1c357606a16932979ef9a5910122675c74a3a.1741875070.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Change rdma_counter allocation to use rdma_zalloc_drv_obj() instead of,
explicitly allocating at core, in order to be contained inside driver
specific structures.
Adjust all drivers that use it to have their containing structure, and
add driver specific initialization operation.
This change is needed to allow upcoming patches to implement
optional-counters binding whereas inside each driver specific counter
struct his bound optional-counters will be maintained.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/a5a484f421fc2e5595158e61a354fba43272b02d.1741875070.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Add the following optional counters:
rdma_tx_packets,rdma_rx_bytes,rdma_rx_packets,rdma_tx_bytes.
Which counts all RDMA packets/bytes sent and received per link.
Note that since each direction packet and byte counter are shared,
the counter is only reset when both counters of that direction
are removed. But from user-perspective each can be enabled/disabled separately.
The counters can be enabled using:
sudo rdma stat set link rocep8s0f0/1 optional-counters rdma_tx_packets
And can be seen using:
rdma stat -j show link rocep8s0f0/1
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/9f2753ad636f21704416df64b47395c8991d1123.1741875070.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Currently, the IB uverbs API calls uobj_get_uobj_read(), which in turn
uses the rdma_lookup_get_uobject() helper to retrieve user objects.
In case of failure, uobj_get_uobj_read() returns NULL, overriding the
error code from rdma_lookup_get_uobject(). The IB uverbs API then
translates this NULL to -EINVAL, masking the actual error and
complicating debugging. For example, applications calling ibv_modify_qp
that fails with EBUSY when retrieving the QP uobject will see the
overridden error code EINVAL instead, masking the actual error.
Furthermore, based on rdma-core commit:
"2a22f1ced5f3 ("Merge pull request #1568 from jakemoroni/master")"
Kernel's IB uverbs return values are either ignored and passed on as is
to application or overridden with other errnos in a few cases.
Thus, to improve error reporting and debuggability, propagate the
original error from rdma_lookup_get_uobject() instead of replacing it
with EINVAL.
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Link: https://patch.msgid.link/64f9d3711b183984e939962c2f83383904f97dfb.1740577869.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
When running under Hyper-V, the master device to the RDMA device is always
bonded to this RDMA device. This is not user-configurable.
The master device can be unbind/bind from the kernel. During those events,
the RDMA device should set to the current netdev to reflect the change of
master device from those events.
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://patch.msgid.link/1741821332-9392-2-git-send-email-longli@linuxonhyperv.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Implement a new User CAPabilities (UCAP) API to provide fine-grained
control over specific firmware features.
This approach offers more granular capabilities than the existing Linux
capabilities, which may be too generic for certain FW features.
This mechanism represents each capability as a character device with
root read-write access. Root processes can grant users special
privileges by allowing access to these character devices (e.g., using
chown).
UCAP character devices are located in /dev/infiniband and the class path
is /sys/class/infiniband_ucaps.
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Link: https://patch.msgid.link/5a1379187cd21178e8554afc81a3c941f21af22f.1741261611.git.leon@kernel.org
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
My static checker says this multiplication can overflow. I'm not an
expert in this code but the call tree would be:
ib_uverbs_handler_UVERBS_METHOD_QP_CREATE() <- reads cap from the user
-> ib_create_qp_user()
-> create_qp()
-> mana_ib_create_qp()
-> mana_ib_create_ud_qp()
-> create_shadow_queue()
It can't hurt to use safer interfaces.
Fixes: c8017f5b48 ("RDMA/mana_ib: UD/GSI work requests")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://patch.msgid.link/58439ac0-1ee5-4f96-a595-7ab83b59139b@stanley.mountain
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
---------------------------------------------------------------------
Hi,
This is preparation series targeted for mlx5-next, which will be used
later in RDMA.
This series adds RDMA transport steering logic which would allow the
vport group manager to catch control packets from VFs and forward them
to control SW to help with congestion control.
In addition, RDMA will provide new set of APIs to better control exposed
FW capabilities and this series is needed to make sure mlx5 command
interface will ensure that privileged commands can always proceed,
Thanks
Link: https://lore.kernel.org/all/cover.1740574103.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
* mlx5-next:
net/mlx5: fs, add RDMA TRANSPORT steering domain support
net/mlx5: Query ADV_RDMA capabilities
net/mlx5: Limit non-privileged commands
net/mlx5: Allow the throttle mechanism to be more dynamic
net/mlx5: Add RDMA_CTRL HW capabilities
Add RX and TX RDMA_TRANSPORT flow table namespace, and the ability
to create flow tables in those namespaces.
The RDMA_TRANSPORT RX and TX are per vport.
Packets will traverse through RDMA_TRANSPORT_RX after RDMA_RX and through
RDMA_TRANSPORT_TX before RDMA_TX, ensuring proper control and management.
RDMA_TRANSPORT domains are managed by the vport group manager.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/a6b550d9859a197eafa804b9a8d76916ca481da9.1740574103.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Previously, throttle commands were identified and limited based on
opcode. These commands were limited to half the command slots using a
semaphore, and callback commands checked the opcode to determine
semaphore release.
To allow exceptions, we introduce a variable to indicate when the
throttle lock is held. This allows scenarios where throttle commands
are not limited. Callback functions use this variable to determine
if the throttle semaphore needs to be released.
This patch contains no functional changes. It's a preparation for the
next patch.
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Link: https://patch.msgid.link/055d975edeb816ac4c0fd1e665c6157d11947d26.1740574103.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
After the erdma_cep_put(new_cep) being called, new_cep will be freed,
and the following dereference will cause a UAF problem. Fix this issue.
Fixes: 920d93eac8 ("RDMA/erdma: Add connection management (CM) support")
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
capable() calls refer to enabled LSMs whether to permit or deny the
request. This is relevant in connection with SELinux, where a
capability check results in a policy decision and by default a denial
message on insufficient permission is issued.
It can lead to three undesired cases:
1. A denial message is generated, even in case the operation was an
unprivileged one and thus the syscall succeeded, creating noise.
2. To avoid the noise from 1. the policy writer adds a rule to ignore
those denial messages, hiding future syscalls, where the task
performs an actual privileged operation, leading to hidden limited
functionality of that task.
3. To avoid the noise from 1. the policy writer adds a rule to permit
the task the requested capability, while it does not need it,
violating the principle of least privilege.
Signed-off-by: Christian Göttsche <cgzones@googlemail.com>
Link: https://patch.msgid.link/20250302160657.127253-10-cgoettsche@seltendoof.de
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Commit 467f432a52 ("RDMA/core: Split port and device counter sysfs
attributes") accidentally almost exposed hw counters to non-init net
namespaces. It didn't expose them fully, as an attempt to read any of
those counters leads to a crash like this one:
[42021.807566] BUG: kernel NULL pointer dereference, address: 0000000000000028
[42021.814463] #PF: supervisor read access in kernel mode
[42021.819549] #PF: error_code(0x0000) - not-present page
[42021.824636] PGD 0 P4D 0
[42021.827145] Oops: 0000 [#1] SMP PTI
[42021.830598] CPU: 82 PID: 2843922 Comm: switchto-defaul Kdump: loaded Tainted: G S W I XXX
[42021.841697] Hardware name: XXX
[42021.849619] RIP: 0010:hw_stat_device_show+0x1e/0x40 [ib_core]
[42021.855362] Code: 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 49 89 d0 4c 8b 5e 20 48 8b 8f b8 04 00 00 48 81 c7 f0 fa ff ff <48> 8b 41 28 48 29 ce 48 83 c6 d0 48 c1 ee 04 69 d6 ab aa aa aa 48
[42021.873931] RSP: 0018:ffff97fe90f03da0 EFLAGS: 00010287
[42021.879108] RAX: ffff9406988a8c60 RBX: ffff940e1072d438 RCX: 0000000000000000
[42021.886169] RDX: ffff94085f1aa000 RSI: ffff93c6cbbdbcb0 RDI: ffff940c7517aef0
[42021.893230] RBP: ffff97fe90f03e70 R08: ffff94085f1aa000 R09: 0000000000000000
[42021.900294] R10: ffff94085f1aa000 R11: ffffffffc0775680 R12: ffffffff87ca2530
[42021.907355] R13: ffff940651602840 R14: ffff93c6cbbdbcb0 R15: ffff94085f1aa000
[42021.914418] FS: 00007fda1a3b9700(0000) GS:ffff94453fb80000(0000) knlGS:0000000000000000
[42021.922423] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[42021.928130] CR2: 0000000000000028 CR3: 00000042dcfb8003 CR4: 00000000003726f0
[42021.935194] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[42021.942257] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[42021.949324] Call Trace:
[42021.951756] <TASK>
[42021.953842] [<ffffffff86c58674>] ? show_regs+0x64/0x70
[42021.959030] [<ffffffff86c58468>] ? __die+0x78/0xc0
[42021.963874] [<ffffffff86c9ef75>] ? page_fault_oops+0x2b5/0x3b0
[42021.969749] [<ffffffff87674b92>] ? exc_page_fault+0x1a2/0x3c0
[42021.975549] [<ffffffff87801326>] ? asm_exc_page_fault+0x26/0x30
[42021.981517] [<ffffffffc0775680>] ? __pfx_show_hw_stats+0x10/0x10 [ib_core]
[42021.988482] [<ffffffffc077564e>] ? hw_stat_device_show+0x1e/0x40 [ib_core]
[42021.995438] [<ffffffff86ac7f8e>] dev_attr_show+0x1e/0x50
[42022.000803] [<ffffffff86a3eeb1>] sysfs_kf_seq_show+0x81/0xe0
[42022.006508] [<ffffffff86a11134>] seq_read_iter+0xf4/0x410
[42022.011954] [<ffffffff869f4b2e>] vfs_read+0x16e/0x2f0
[42022.017058] [<ffffffff869f50ee>] ksys_read+0x6e/0xe0
[42022.022073] [<ffffffff8766f1ca>] do_syscall_64+0x6a/0xa0
[42022.027441] [<ffffffff8780013b>] entry_SYSCALL_64_after_hwframe+0x78/0xe2
The problem can be reproduced using the following steps:
ip netns add foo
ip netns exec foo bash
cat /sys/class/infiniband/mlx4_0/hw_counters/*
The panic occurs because of casting the device pointer into an
ib_device pointer using container_of() in hw_stat_device_show() is
wrong and leads to a memory corruption.
However the real problem is that hw counters should never been exposed
outside of the non-init net namespace.
Fix this by saving the index of the corresponding attribute group
(it might be 1 or 2 depending on the presence of driver-specific
attributes) and zeroing the pointer to hw_counters group for compat
devices during the initialization.
With this fix applied hw_counters are not available in a non-init
net namespace:
find /sys/class/infiniband/mlx4_0/ -name hw_counters
/sys/class/infiniband/mlx4_0/ports/1/hw_counters
/sys/class/infiniband/mlx4_0/ports/2/hw_counters
/sys/class/infiniband/mlx4_0/hw_counters
ip netns add foo
ip netns exec foo bash
find /sys/class/infiniband/mlx4_0/ -name hw_counters
Fixes: 467f432a52 ("RDMA/core: Split port and device counter sysfs attributes")
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Maher Sanalla <msanalla@nvidia.com>
Cc: linux-rdma@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Link: https://patch.msgid.link/20250227165420.3430301-1-roman.gushchin@linux.dev
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Now that the crc32c() library function directly takes advantage of
architecture-specific optimizations, it is unnecessary to go through the
crypto API. Just use crc32c(). This is much simpler, and it improves
performance due to eliminating the crypto API overhead.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://patch.msgid.link/20250227051207.19470-1-ebiggers@kernel.org
Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
This is merge of shared branch between RDMA and net-next trees.
* mlx5-next: (550 commits)
net/mlx5: Change POOL_NEXT_SIZE define value and make it global
net/mlx5: Add new health syndrome error and crr bit offset
Linux 6.14-rc3
...
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Change POOL_NEXT_SIZE define value from 0 to BIT(30), since this define
is used to request the available maximum sized flow table, and zero doesn't
make sense for it, whereas some places in the driver use zero explicitly
expecting the smallest table size possible but instead due to this
define they end up allocating the biggest table size unawarely.
In addition move the definition to "include/linux/mlx5/fs.h" to expose the
define to IB driver as well, while appropriately renaming it.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250219085808.349923-3-tariqt@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Enable 'fetch and add' and 'compare and swap' operations to be used with
ODP. This is comprised of the following steps:
1. Check the driver page table(umem_odp->dma_list) to see if the target
page is both readable and writable.
2. If not, then trigger page fault to map the page.
3. Convert its user space address to a kernel logical address using PFNs
in the driver page table(umem_odp->pfn_list).
4. Execute the operation.
Link: https://patch.msgid.link/r/20241220100936.2193541-6-matsuda-daisuke@fujitsu.com
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
rxe_mr_copy() is used widely to copy data to/from a user MR. requester uses
it to load payloads of requesting packets; responder uses it to process
Send, Write, and Read operaetions; completer uses it to copy data from
response packets of Read and Atomic operations to a user MR.
Allow these operations to be used with ODP by adding a subordinate function
rxe_odp_mr_copy(). It is comprised of the following steps:
1. Check the driver page table(umem_odp->dma_list) to see if pages being
accessed are present with appropriate permission.
2. If necessary, trigger page fault to map the pages.
3. Convert their user space addresses to kernel logical addresses using
PFNs in the driver page table(umem_odp->pfn_list).
4. Execute data copy to/from the pages.
Link: https://patch.msgid.link/r/20241220100936.2193541-5-matsuda-daisuke@fujitsu.com
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Allow userspace to register an ODP-enabled MR, in which case the flag
IB_ACCESS_ON_DEMAND is passed to rxe_reg_user_mr(). However, there is no
RDMA operation enabled right now. They will be supported later in the
subsequent two patches.
rxe_odp_do_pagefault() is called to initialize an ODP-enabled MR. It syncs
process address space from the CPU page table to the driver page table
(dma_list/pfn_list in umem_odp) when called with RXE_PAGEFAULT_SNAPSHOT
flag. Additionally, It can be used to trigger page fault when pages being
accessed are not present or do not have proper read/write permissions, and
possibly to prefetch pages in the future.
Link: https://patch.msgid.link/r/20241220100936.2193541-4-matsuda-daisuke@fujitsu.com
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
A single scatter-gather entry is limited by a 32 bits "length" field
that is practically 4GB - PAGE_SIZE. This means that even when the
memory is physically contiguous, we might need more than one entry to
represent it. Additionally when using dmabuf, the sg_table might be
originated outside the subsystem and optimized for other needs.
For instance an SGT of 16GB GPU continuous memory might look like this:
(a real life example)
dma_address 34401400000, length fffff000
dma_address 345013ff000, length fffff000
dma_address 346013fe000, length fffff000
dma_address 347013fd000, length fffff000
dma_address 348013fc000, length 4000
Since ib_umem_find_best_pgsz works within SG entries, in the above case
we will result with the worst possible 4KB page size.
Fix this by taking into consideration only the alignment of addresses of
real discontinuity points rather than treating SG entries as such, and
adjust the page iterator to correctly handle cross SG entry pages.
There is currently an assumption that drivers do not ask for pages
bigger than maximal DMA size supported by their devices.
Reviewed-by: Firas Jahjah <firasj@amazon.com>
Reviewed-by: Yonatan Nachum <ynachum@amazon.com>
Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Link: https://patch.msgid.link/20250217141623.12428-1-mrgolin@amazon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Pull Kbuild fixes from Masahiro Yamada:
- Fix annoying logs when building tools in parallel
- Fix the Debian linux-headers package build again
- Fix the target triple detection for userspace programs on Clang
* tag 'kbuild-fixes-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
modpost: Fix a few typos in a comment
kbuild: userprogs: fix bitsize and target detection on clang
kbuild: fix linux-headers package build when $(CC) cannot link userspace
tools: fix annoying "mkdir -p ..." logs when building tools in parallel
Pull driver core api addition from Greg KH:
"Here is a driver core new api for 6.14-rc3 that is being added to
allow platform devices from stop being abused.
It adds a new 'faux_device' structure and bus and api to allow almost
a straight or simpler conversion from platform devices that were not
really a platform device. It also comes with a binding for rust, with
an example driver in rust showing how it's used.
I'm adding this now so that the patches that convert the different
drivers and subsystems can all start flowing into linux-next now
through their different development trees, in time for 6.15-rc1.
We have a number that are already reviewed and tested, but adding
those conversions now doesn't seem right. For now, no one is using
this, and it passes all build tests from 0-day and linux-next, so all
should be good"
* tag 'driver-core-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
rust/kernel: Add faux device bindings
driver core: add a faux bus for use when a simple device/bus is needed
Pull serial driver fixes from Greg KH:
"Here are some small serial driver fixes for some reported problems.
Nothing major, just:
- sc16is7xx irq check fix
- 8250 fifo underflow fix
- serial_port and 8250 iotype fixes
Most of these have been in linux-next already, and all have passed
0-day testing"
* tag 'tty-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
serial: 8250: Fix fifo underflow on flush
serial: 8250_pnp: Remove unneeded ->iotype assignment
serial: 8250_platform: Remove unneeded ->iotype assignment
serial: 8250_of: Remove unneeded ->iotype assignment
serial: port: Make ->iotype validation global in __uart_read_properties()
serial: port: Always update ->iotype in __uart_read_properties()
serial: port: Assign ->iotype correctly when ->iobase is set
serial: sc16is7xx: Fix IRQ number check behavior
Pull USB fixes from Greg KH:
"Here are some small USB driver fixes, and new device ids, for
6.14-rc3. Lots of tiny stuff for reported problems, including:
- new device ids and quirks
- usb hub crash fix found by syzbot
- dwc2 driver fix
- dwc3 driver fixes
- uvc gadget driver fix
- cdc-acm driver fixes for a variety of different issues
- other tiny bugfixes
Almost all of these have been in linux-next this week, and all have
passed 0-day testing"
* tag 'usb-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (25 commits)
usb: typec: tcpm: PSSourceOffTimer timeout in PR_Swap enters ERROR_RECOVERY
usb: roles: set switch registered flag early on
usb: gadget: uvc: Fix unstarted kthread worker
USB: quirks: add USB_QUIRK_NO_LPM quirk for Teclast dist
usb: gadget: core: flush gadget workqueue after device removal
USB: gadget: f_midi: f_midi_complete to call queue_work
usb: core: fix pipe creation for get_bMaxPacketSize0
usb: dwc3: Fix timeout issue during controller enter/exit from halt state
USB: Add USB_QUIRK_NO_LPM quirk for sony xperia xz1 smartphone
USB: cdc-acm: Fill in Renesas R-Car D3 USB Download mode quirk
usb: cdc-acm: Fix handling of oversized fragments
usb: cdc-acm: Check control transfer buffer size before access
usb: xhci: Restore xhci_pci support for Renesas HCs
USB: pci-quirks: Fix HCCPARAMS register error for LS7A EHCI
USB: serial: option: drop MeiG Smart defines
USB: serial: option: fix Telit Cinterion FN990A name
USB: serial: option: add Telit Cinterion FN990B compositions
USB: serial: option: add MeiG Smart SLM828
usb: gadget: f_midi: fix MIDI Streaming descriptor lengths
usb: dwc2: gadget: remove of_node reference upon udc_stop
...