Report standard counter stats->rx_missed_errors
using hc_rx_discards_no_wqe from the hardware.
Add a global workqueue to periodically run
mana_query_gf_stats every 2 seconds to get the latest
info in eth_stats and define a driver capability flag
to notify hardware of the periodic queries.
To avoid repeated failures and log flooding, the workqueue
is not rescheduled if mana_query_gf_stats fails on HWC timeout
error and the stats are reset to 0. Other errors are transient
which will not need a VF reset for recovery.
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/1763120599-6331-3-git-send-email-ernis@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move hardware counter (HC) statistics from mana_port_context to
mana_context to enable sharing stats across multiple network ports
on the same MANA VF. Previously, each network port queried
hardware counters independently using MANA_QUERY_GF_STAT command
(GF = Generic Function stats from GDMA hardware), resulting in
redundant queries when multiple ports existed on the same device.
Isolate hardware counter stats by introducing mana_ethtool_hc_stats
in mana_context and update the code to ensure all stats are properly
reported via ethtool -S <interface>, maintaining consistency with
previous behavior.
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/1763120599-6331-2-git-send-email-ernis@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Once the netshaper is created for MANA, the current bandwidth
is reported in debugfs like this:
$ sudo ./tools/net/ynl/pyynl/cli.py \
--spec Documentation/netlink/specs/net_shaper.yaml \
--do set \
--json '{"ifindex":'3',
"handle":{ "scope": "netdev", "id":'1' },
"bw-max": 200000000 }'
None
$ sudo cat /sys/kernel/debug/mana/1/vport0/current_speed
200
After the shaper is deleted, it is expected to report
the maximum speed supported by the SKU. But currently it is
reporting 0, which is incorrect.
Fix this inconsistency, by resetting apc->speed to apc->max_speed
during deletion of the shaper object. This will improve
readability and debuggability.
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/1762369468-32570-1-git-send-email-ernis@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Handle the NIC hardware link state events received from the HW
channel, then set the proper link state accordingly.
And, add a feature bit, GDMA_DRV_CAP_FLAG_1_HW_VPORT_LINK_AWARE,
to inform the NIC hardware this handler exists.
Our MANA NIC only sends out the link state down/up messages
when we need to let the VM rerun DHCP client and change IP
address. So, add netif_carrier_on() in the probe(), let the NIC
show the right initial state in /sys/class/net/ethX/operstate.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/1761770601-16920-1-git-send-email-haiyangz@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch enhances RX buffer handling in the mana driver by allocating
pages from a page pool and slicing them into MTU-sized fragments, rather
than dedicating a full page per packet. This approach is especially
beneficial on systems with large base page sizes like 64KB.
Key improvements:
- Proper integration of page pool for RX buffer allocations.
- MTU-sized buffer slicing to improve memory utilization.
- Reduce overall per Rx queue memory footprint.
- Automatic fallback to full-page buffers when:
* Jumbo frames are enabled (MTU > PAGE_SIZE / 2).
* The XDP path is active, to avoid complexities with fragment reuse.
Testing on VMs with 64KB pages shows around 200% throughput improvement.
Memory efficiency is significantly improved due to reduced wastage in page
allocations. Example: We are now able to fit 35 rx buffers in a single 64kb
page for MTU size of 1500, instead of 1 rx buffer per page previously.
Tested:
- iperf3, iperf2, and nttcp benchmarks.
- Jumbo frames with MTU 9000.
- Native XDP programs (XDP_PASS, XDP_DROP, XDP_TX, XDP_REDIRECT) for
testing the XDP path in driver.
- Memory leak detection (kmemleak).
- Driver load/unload, reboot, and stress scenarios.
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Link: https://patch.msgid.link/20250814140410.GA22089@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
MANA supports RDMA in PF mode. The driver should record the doorbell
physical address when in PF mode.
The doorbell physical address is used by the RDMA driver to map
doorbell pages of the device to user-mode applications through RDMA
verbs interface. In the past, they have been mapped to user-mode while
the device is in VF mode. With the support for PF mode implemented,
also expose those pages in PF mode.
Support for PF mode is implemented in
290e5d3c49 ("net: mana: Add support for Multi Vports on Bare metal")
Signed-off-by: Long Li <longli@microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/1750210606-12167-1-git-send-email-longli@linuxonhyperv.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Allow tx_packets and tx_bytes counter in the driver to represent
the packets transmitted post GSO processing.
Currently they are populated as bigger pre-GSO packets and bytes
Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow mana ethtool get_link_ksettings operation to report
the maximum speed supported by the SKU in mbps.
The driver retrieves this information by issuing a
HWC command to the hardware via mana_query_link_cfg(),
which retrieves the SKU's maximum supported speed.
These APIs when invoked on hardware that are older/do
not support these APIs, the speed would be reported as UNKNOWN.
Before:
$ethtool enP30832s1
> Settings for enP30832s1:
Supported ports: [ ]
Supported link modes: Not reported
Supported pause frame use: No
Supports auto-negotiation: No
Supported FEC modes: Not reported
Advertised link modes: Not reported
Advertised pause frame use: No
Advertised auto-negotiation: No
Advertised FEC modes: Not reported
Speed: Unknown!
Duplex: Full
Auto-negotiation: off
Port: Other
PHYAD: 0
Transceiver: internal
Link detected: yes
After:
$ethtool enP30832s1
> Settings for enP30832s1:
Supported ports: [ ]
Supported link modes: Not reported
Supported pause frame use: No
Supports auto-negotiation: No
Supported FEC modes: Not reported
Advertised link modes: Not reported
Advertised pause frame use: No
Advertised auto-negotiation: No
Advertised FEC modes: Not reported
Speed: 16000Mb/s
Duplex: Full
Auto-negotiation: off
Port: Other
PHYAD: 0
Transceiver: internal
Link detected: yes
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Saurabh Singh Sengar <ssengar@linux.microsoft.com>
Reviewed-by: Long Li <longli@microsoft.com>
Link: https://patch.msgid.link/1750144656-2021-4-git-send-email-ernis@linux.microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Introduce support for net_shaper_ops in the MANA driver,
enabling configuration of rate limiting on the MANA NIC.
To apply rate limiting, the driver issues a HWC command via
mana_set_bw_clamp() and updates the corresponding shaper object
in the net_shaper cache. If an error occurs during this process,
the driver restores the previous speed by querying the current link
configuration using mana_query_link_cfg().
The minimum supported bandwidth is 100 Mbps, and only values that are
exact multiples of 100 Mbps are allowed. Any other values are rejected.
To remove a shaper, the driver resets the bandwidth to the maximum
supported by the SKU using mana_set_bw_clamp() and clears the
associated cache entry. If an error occurs during this process,
the shaper details are retained.
On the hardware that does not support these APIs, the net-shaper
calls to set speed would fail.
Set the speed:
./tools/net/ynl/pyynl/cli.py \
--spec Documentation/netlink/specs/net_shaper.yaml \
--do set --json '{"ifindex":'$IFINDEX',
"handle":{"scope": "netdev", "id":'$ID' },
"bw-max": 200000000 }'
Get the shaper details:
./tools/net/ynl/pyynl/cli.py \
--spec Documentation/netlink/specs/net_shaper.yaml \
--do get --json '{"ifindex":'$IFINDEX',
"handle":{"scope": "netdev", "id":'$ID' }}'
> {'bw-max': 200000000,
> 'handle': {'scope': 'netdev'},
> 'ifindex': $IFINDEX,
> 'metric': 'bps'}
Delete the shaper object:
./tools/net/ynl/pyynl/cli.py \
--spec Documentation/netlink/specs/net_shaper.yaml \
--do delete --json '{"ifindex":'$IFINDEX',
"handle":{"scope": "netdev","id":'$ID' }}'
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Saurabh Singh Sengar <ssengar@linux.microsoft.com>
Reviewed-by: Long Li <longli@microsoft.com>
Link: https://patch.msgid.link/1750144656-2021-3-git-send-email-ernis@linux.microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When net_shaper_ops are enabled for MANA, netdev_ops_lock
becomes active.
MANA VF setup/teardown by netvsc follows this call chain:
netvsc_vf_setup()
dev_change_flags()
...
__dev_open() OR __dev_close()
dev_change_flags() holds the netdev mutex via netdev_lock_ops.
Meanwhile, mana_create_txq() and mana_create_rxq() in mana_open()
path call NAPI APIs (netif_napi_add_tx(), netif_napi_add_weight(),
napi_enable()), which also try to acquire the same lock, risking
deadlock.
Similarly in the teardown path (mana_close()), netif_napi_disable()
and netif_napi_del(), contend for the same lock.
Switch to the _locked variants of these APIs to avoid deadlocks
when the netdev_ops_lock is held.
Fixes: d4c22ec680 ("net: hold netdev instance lock during ndo_open/ndo_stop")
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Saurabh Singh Sengar <ssengar@linux.microsoft.com>
Reviewed-by: Long Li <longli@microsoft.com>
Link: https://patch.msgid.link/1750144656-2021-2-git-send-email-ernis@linux.microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Shradha Gupta says:
====================
Allow dyn MSI-X vector allocation of MANA
In this patchset we want to enable the MANA driver to be able to
allocate MSI-X vectors in PCI dynamically.
The first patch exports pci_msix_prepare_desc() in PCI to be able to
correctly prepare descriptors for dynamically added MSI-X vectors.
The second patch adds the support of dynamic vector allocation in
pci-hyperv PCI controller by enabling the MSI_FLAG_PCI_MSIX_ALLOC_DYN
flag and using the pci_msix_prepare_desc() exported in first patch.
The third patch adds a detailed description of the irq_setup(), to
help understand the function design better.
The fourth patch is a preparation patch for mana changes to support
dynamic IRQ allocation. It contains changes in irq_setup() to allow
skipping first sibling CPU sets, in case certain IRQs are already
affinitized to them.
The fifth patch has the changes in MANA driver to be able to allocate
MSI-X vectors dynamically. If the support does not exist it defaults to
older behavior.
* 'shradha_v6.16-rc1' of https://github.com/shradhagupta6/linux:
net: mana: Allocate MSI-X vectors dynamically
net: mana: Allow irq_setup() to skip cpus for affinity
net: mana: explain irq_setup() algorithm
PCI: hv: Allow dynamic MSI-X vector allocation
PCI/MSI: Export pci_msix_prepare_desc() for dynamic MSI-X allocations
====================
Link: https://patch.msgid.link/1749650984-9193-1-git-send-email-shradhagupta@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, the MANA driver allocates MSI-X vectors statically based on
MANA_MAX_NUM_QUEUES and num_online_cpus() values and in some cases ends
up allocating more vectors than it needs. This is because, by this time
we do not have a HW channel and do not know how many IRQs should be
allocated.
To avoid this, we allocate 1 MSI-X vector during the creation of HWC and
after getting the value supported by hardware, dynamically add the
remaining MSI-X vectors.
Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
In order to prepare the MANA driver to allocate the MSI-X IRQs
dynamically, we need to enhance irq_setup() to allow skipping
affinitizing IRQs to the first CPU sibling group.
This would be for cases when the number of IRQs is less than or equal
to the number of online CPUs. In such cases for dynamically added IRQs
the first CPU sibling group would already be affinitized with HWC IRQ.
Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Commit 91bfe210e1 ("net: mana: add a function to spread IRQs per CPUs")
added the irq_setup() function that distributes IRQs on CPUs according
to a tricky heuristic. The corresponding commit message explains the
heuristic.
Duplicate it in the source code to make available for readers without
digging git in history. Also, add more detailed explanation about how
the heuristics is implemented.
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Pull rdma updates from Jason Gunthorpe:
"Usual collection of driver fixes:
- Small bug fixes and cleansup in hfi, hns, rxe, mlx5, mana siw
- Further ODP functionality in rxe
- Remote access MRs in mana, along with more page sizes
- Improve CM scalability with a rwlock around the agent
- More trace points for hns
- ODP hmm conversion to the new two step dma API
- Support the ethernet HW device in mana as well as the RNIC
- Cleanups:
- Use secs_to_jiffies() when appropriate
- Use ERR_CAST() instead of naked casts
- Don't use %pK in printk
- Unusued functions removed
- Allocation type matching"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (57 commits)
RDMA/cma: Fix hang when cma_netevent_callback fails to queue_work
RDMA/bnxt_re: Support extended stats for Thor2 VF
RDMA/hns: Fix endian issue in trace events
RDMA/mlx5: Avoid flexible array warning
IB/cm: Remove dead code and adjust naming
RDMA/core: Avoid hmm_dma_map_alloc() for virtual DMA devices
RDMA/rxe: Break endless pagefault loop for RO pages
RDMA/bnxt_re: Fix return code of bnxt_re_configure_cc
RDMA/bnxt_re: Fix missing error handling for tx_queue
RDMA/bnxt_re: Fix incorrect display of inactivity_cp in debugfs output
RDMA/mlx5: Add support for 200Gbps per lane speeds
RDMA/mlx5: Remove the redundant MLX5_IB_STAGE_UAR stage
RDMA/iwcm: Fix use-after-free of work objects after cm_id destruction
net: mana: Add support for auxiliary device servicing events
RDMA/mana_ib: unify mana_ib functions to support any gdma device
RDMA/mana_ib: Add support of mana_ib for RNIC and ETH nic
net: mana: Probe rdma device in mana driver
RDMA/siw: replace redundant ternary operator with just rv
RDMA/umem: Separate implicit ODP initialization from explicit ODP
RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage
...
Pull networking fixes from Jakub Kicinski:
"Rather tiny pull request, mostly so that we can get into our trees
your fix to the x86 Makefile.
Current release - regressions:
- Revert "tcp: avoid atomic operations on sk->sk_rmem_alloc", error
queue accounting was missed
Current release - new code bugs:
- 5 fixes for the netdevice instance locking work
Previous releases - regressions:
- usbnet: restore usb%d name exception for local mac addresses
Previous releases - always broken:
- rtnetlink: allocate vfinfo size for VF GUIDs when supported, avoid
spurious GET_LINK failures
- eth: mana: Switch to page pool for jumbo frames
- phy: broadcom: Correct BCM5221 PHY model detection
Misc:
- selftests: drv-net: replace helpers for referring to other files"
* tag 'net-6.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (22 commits)
Revert "tcp: avoid atomic operations on sk->sk_rmem_alloc"
bnxt_en: bring back rtnl lock in bnxt_shutdown
eth: gve: add missing netdev locks on reset and shutdown paths
selftests: mptcp: ignore mptcp_diag binary
selftests: mptcp: close fd_in before returning in main_loop
selftests: mptcp: fix incorrect fd checks in main_loop
mptcp: fix NULL pointer in can_accept_new_subflow
octeontx2-af: Free NIX_AF_INT_VEC_GEN irq
octeontx2-af: Fix mbox INTR handler when num VFs > 64
net: fix use-after-free in the netdev_nl_sock_priv_destroy()
selftests: net: use Path helpers in ping
selftests: net: use the dummy bpf from net/lib
selftests: drv-net: replace the rpath helper with Path objects
net: lapbether: use netdev_lockdep_set_classes() helper
net: phy: broadcom: Correct BCM5221 PHY model detection
net: usb: usbnet: restore usb%d name exception for local mac addresses
net/mlx5e: SHAMPO, Make reserved size independent of page size
net: mana: Switch to page pool for jumbo frames
MAINTAINERS: Add dedicated entries for phy_link_topology
net: move replay logic to tc_modify_qdisc
...
Pull rdma updates from Jason Gunthorpe:
- Usual minor updates and fixes for bnxt_re, hfi1, rxe, mana, iser,
mlx5, vmw_pvrdma, hns
- Make rxe work on tun devices
- mana gains more standard verbs as it moves toward supporting
in-kernel verbs
- DMABUF support for mana
- Fix page size calculations when memory registration exceeds 4G
- On Demand Paging support for rxe
- mlx5 support for RDMA TRANSPORT flow tables and a new ucap mechanism
to access control use of them
- Optional RDMA_TX/RX counters per QP in mlx5
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (73 commits)
IB/mad: Check available slots before posting receive WRs
RDMA/mana_ib: Fix integer overflow during queue creation
RDMA/mlx5: Fix calculation of total invalidated pages
RDMA/mlx5: Fix mlx5_poll_one() cur_qp update flow
RDMA/mlx5: Fix page_size variable overflow
RDMA/mlx5: Drop access_flags from _mlx5_mr_cache_alloc()
RDMA/mlx5: Fix cache entry update on dereg error
RDMA/mlx5: Fix MR cache initialization error flow
RDMA/mlx5: Support optional-counters binding for QPs
RDMA/mlx5: Compile fs.c regardless of INFINIBAND_USER_ACCESS config
RDMA/core: Pass port to counter bind/unbind operations
RDMA/core: Add support to optional-counters binding configuration
RDMA/core: Create and destroy rdma_counter using rdma_zalloc_drv_obj()
RDMA/mlx5: Add optional counters for RDMA_TX/RX_packets/bytes
RDMA/core: Fix use-after-free when rename device name
RDMA/bnxt_re: Support perf management counters
RDMA/rxe: Fix incorrect return value of rxe_odp_atomic_op()
RDMA/uverbs: Propagate errors from rdma_lookup_get_uobject()
RDMA/mana_ib: Handle net event for pointing to the current netdev
net: mana: Change the function signature of mana_get_primary_netdev_rcu
...
Cross-merge networking fixes after downstream PR (net-6.14-rc8).
Conflict:
tools/testing/selftests/net/Makefile
03544faad7 ("selftest: net: add proc_net_pktgen")
3ed61b8938 ("selftests: net: test for lwtunnel dst ref loops")
tools/testing/selftests/net/config:
85cb3711ac ("selftests: net: Add test cases for link and peer netns")
3ed61b8938 ("selftests: net: test for lwtunnel dst ref loops")
Adjacent commits:
tools/testing/selftests/net/Makefile
c935af429e ("selftests: net: add support for testing SO_RCVMARK and SO_RCVPRIORITY")
355d940f4d ("Revert "selftests: Add IPv6 link-local address generation tests for GRE devices."")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Cross-merge networking fixes after downstream PR (net-6.14-rc6).
Conflicts:
tools/testing/selftests/drivers/net/ping.py
75cc19c8ff ("selftests: drv-net: add xdp cases for ping.py")
de94e86974 ("selftests: drv-net: store addresses in dict indexed by ipver")
https://lore.kernel.org/netdev/20250311115758.17a1d414@canb.auug.org.au/
net/core/devmem.c
a70f891e0f ("net: devmem: do not WARN conditionally after netdev_rx_queue_restart()")
1d22d3060b ("net: drop rtnl_lock for queue_mgmt operations")
https://lore.kernel.org/netdev/20250313114929.43744df1@canb.auug.org.au/
Adjacent changes:
tools/testing/selftests/net/Makefile
6f50175cca ("selftests: Add IPv6 link-local address generation tests for GRE devices.")
2e5584e0f9 ("selftests/net: expand cmsg_ipv6.sh with ipv4")
drivers/net/ethernet/broadcom/bnxt/bnxt.c
661958552e ("eth: bnxt: do not use BNXT_VNIC_NTUPLE unconditionally in queue restart logic")
fe96d717d3 ("bnxt_en: Extend queue stop/start for TX rings")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Move the more esoteric helpers for netdev instance lock to
a dedicated header. This avoids growing netdevice.h to infinity
and makes rebuilding the kernel much faster (after touching
the header with the helpers).
The main netdev_lock() / netdev_unlock() functions are used
in static inlines in netdevice.h and will probably be used
most commonly, so keep them in netdevice.h.
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250307183006.2312761-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>