linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-16 12:31:52 -04:00

Author	SHA1	Message	Date
Zilin Guan	990a73dec3	wifi: mwifiex: Fix memory leak in mwifiex_11n_aggregate_pkt() In mwifiex_11n_aggregate_pkt(), skb_aggr is allocated via mwifiex_alloc_dma_align_buf(). If mwifiex_is_ralist_valid() returns false, the function currently returns -1 immediately without freeing the previously allocated skb_aggr, causing a memory leak. Since skb_aggr has not yet been queued via skb_queue_tail(), no other references to this memory exist. Therefore, it has to be freed locally before returning the error. Fix this by calling mwifiex_write_data_complete() to free skb_aggr before returning the error status. Compile tested only. Issue found using a prototype static analysis tool and code review. Fixes: `5e6e3a92b9` ("wireless: mwifiex: initial commit for Marvell mwifiex driver") Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Reviewed-by: Jeff Chen <jeff.chen_1@nxp.com> Link: https://patch.msgid.link/20260119092625.1349934-1-zilin@seu.edu.cn Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2026-03-02 08:17:22 +01:00
Jakub Kicinski	0314e382cf	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-7.0-rc2). Conflicts: tools/testing/selftests/drivers/net/hw/rss_ctx.py `19c3a2a81d` ("selftests: drv-net: rss: Generate unique ports for RSS context tests") `ce5a0f4612` ("selftests: drv-net: rss_ctx: test RSS contexts persist after ifdown/up") include/net/inet_connection_sock.h `858d2a4f67` ("tcp: fix potential race in tcp_v6_syn_recv_sock()") `fcd3d039fa` ("tcp: make tcp_v{4,6}_send_check() static") https://lore.kernel.org/aZ8PSFLzBrEU3I89@sirena.org.uk drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c drivers/net/ethernet/mellanox/mlx5/core/en/xsk/pool.c `69050f8d6d` ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types") `bf4afc53b7` ("Convert 'alloc_obj' family to use the new default GFP_KERNEL argument") `8a96b9144f` ("net/mlx5e: Alloc xsk channel param out of mlx5e_open_xsk()") Adjacent changes: net/netfilter/ipvs/ip_vs_ctl.c `c59bd9e62e` ("ipvs: use more counters to avoid service lookups") `bf4afc53b7` ("Convert 'alloc_obj' family to use the new default GFP_KERNEL argument") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 10:23:00 -08:00
Linus Torvalds	b9c8fc2cae	Merge tag 'net-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from IPsec, Bluetooth and netfilter Current release - regressions: - wifi: fix dev_alloc_name() return value check - rds: fix recursive lock in rds_tcp_conn_slots_available Current release - new code bugs: - vsock: lock down child_ns_mode as write-once Previous releases - regressions: - core: - do not pass flow_id to set_rps_cpu() - consume xmit errors of GSO frames - netconsole: avoid OOB reads, msg is not nul-terminated - netfilter: h323: fix OOB read in decode_choice() - tcp: re-enable acceptance of FIN packets when RWIN is 0 - udplite: fix null-ptr-deref in __udp_enqueue_schedule_skb(). - wifi: brcmfmac: fix potential kernel oops when probe fails - phy: register phy led_triggers during probe to avoid AB-BA deadlock - eth: - bnxt_en: fix deleting of Ntuple filters - wan: farsync: fix use-after-free bugs caused by unfinished tasklets - xscale: check for PTP support properly Previous releases - always broken: - tcp: fix potential race in tcp_v6_syn_recv_sock() - kcm: fix zero-frag skb in frag_list on partial sendmsg error - xfrm: - fix race condition in espintcp_close() - always flush state and policy upon NETDEV_UNREGISTER event - bluetooth: - purge error queues in socket destructors - fix response to L2CAP_ECRED_CONN_REQ - eth: - mlx5: - fix circular locking dependency in dump - fix "scheduling while atomic" in IPsec MAC address query - gve: fix incorrect buffer cleanup for QPL - team: avoid NETDEV_CHANGEMTU event when unregistering slave - usb: validate USB endpoints" * tag 'net-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (72 commits) netfilter: nf_conntrack_h323: fix OOB read in decode_choice() dpaa2-switch: validate num_ifs to prevent out-of-bounds write net: consume xmit errors of GSO frames vsock: document write-once behavior of the child_ns_mode sysctl vsock: lock down child_ns_mode as write-once selftests/vsock: change tests to respect write-once child ns mode net/mlx5e: Fix "scheduling while atomic" in IPsec MAC address query net/mlx5: Fix missing devlink lock in SRIOV enable error path net/mlx5: E-switch, Clear legacy flag when moving to switchdev net/mlx5: LAG, disable MPESW in lag_disable_change() net/mlx5: DR, Fix circular locking dependency in dump selftests: team: Add a reference count leak test team: avoid NETDEV_CHANGEMTU event when unregistering slave net: mana: Fix double destroy_workqueue on service rescan PCI path MAINTAINERS: Update maintainer entry for QUALCOMM ETHQOS ETHERNET DRIVER dpll: zl3073x: Remove redundant cleanup in devm_dpll_init() selftests/net: packetdrill: Verify acceptance of FIN packets when RWIN is 0 tcp: re-enable acceptance of FIN packets when RWIN is 0 vsock: Use container_of() to get net namespace in sysctl handlers net: usb: kaweth: validate USB endpoints ...	2026-02-26 08:00:13 -08:00
Vahagn Vardanian	baed0d9ba9	netfilter: nf_conntrack_h323: fix OOB read in decode_choice() In decode_choice(), the boundary check before get_len() uses the variable `len`, which is still 0 from its initialization at the top of the function: unsigned int type, ext, len = 0; ... if (ext \|\| (son->attr & OPEN)) { BYTE_ALIGN(bs); if (nf_h323_error_boundary(bs, len, 0)) /* len is 0 here / return H323_ERROR_BOUND; len = get_len(bs); / OOB read / When the bitstream is exactly consumed (bs->cur == bs->end), the check nf_h323_error_boundary(bs, 0, 0) evaluates to (bs->cur + 0 > bs->end), which is false. The subsequent get_len() call then dereferences bs->cur++, reading 1 byte past the end of the buffer. If that byte has bit 7 set, get_len() reads a second byte as well. This can be triggered remotely by sending a crafted Q.931 SETUP message with a User-User Information Element containing exactly 2 bytes of PER-encoded data ({0x08, 0x00}) to port 1720 through a firewall with the nf_conntrack_h323 helper active. The decoder fully consumes the PER buffer before reaching this code path, resulting in a 1-2 byte heap-buffer-overflow read confirmed by AddressSanitizer. Fix this by checking for 2 bytes (the maximum that get_len() may read) instead of the uninitialized `len`. This matches the pattern used at every other get_len() call site in the same file, where the caller checks for 2 bytes of available data before calling get_len(). Fixes: `ec8a8f3c31` ("netfilter: nf_ct_h323: Extend nf_h323_error_boundary to work on bits as well") Signed-off-by: Vahagn Vardanian <vahagn@redrays.io> Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260225130619.1248-2-fw@strlen.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 12:50:42 +01:00
Junrui Luo	8a5752c6dc	dpaa2-switch: validate num_ifs to prevent out-of-bounds write The driver obtains sw_attr.num_ifs from firmware via dpsw_get_attributes() but never validates it against DPSW_MAX_IF (64). This value controls iteration in dpaa2_switch_fdb_get_flood_cfg(), which writes port indices into the fixed-size cfg->if_id[DPSW_MAX_IF] array. When firmware reports num_ifs >= 64, the loop can write past the array bounds. Add a bound check for num_ifs in dpaa2_switch_init(). dpaa2_switch_fdb_get_flood_cfg() appends the control interface (port num_ifs) after all matched ports. When num_ifs == DPSW_MAX_IF and all ports match the flood filter, the loop fills all 64 slots and the control interface write overflows by one entry. The check uses >= because num_ifs == DPSW_MAX_IF is also functionally broken. build_if_id_bitmap() silently drops any ID >= 64: if (id[i] < DPSW_MAX_IF) bmap[id[i] / 64] \|= ... Fixes: `539dda3c5d` ("staging: dpaa2-switch: properly setup switching domains") Signed-off-by: Junrui Luo <moonafterrain@outlook.com> Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com> Link: https://patch.msgid.link/SYBPR01MB78812B47B7F0470B617C408AAF74A@SYBPR01MB7881.ausprd01.prod.outlook.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 12:37:21 +01:00
Hangbin Liu	4916f2e2f3	bonding: print churn state via netlink Currently, the churn state is printed only in sysfs. Add netlink support so users could get the state via netlink. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://patch.msgid.link/20260224020215.6012-1-liuhangbin@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 11:45:35 +01:00
Qingfang Deng	15c9ed1d82	pppoe: remove kernel-mode relay support The kernel-mode PPPoE relay feature and its two associated ioctls (PPPOEIOCSFWD and PPPOEIOCDFWD) are not used by any existing userspace PPPoE implementations. The most commonly-used package, RP-PPPoE [1], handles the relaying entirely in userspace. This legacy code has remained in the driver since its introduction in kernel 2.3.99-pre7 for over two decades, but has served no practical purpose. Remove the unused relay code. [1] https://dianne.skoll.ca/projects/rp-pppoe/ Signed-off-by: Qingfang Deng <dqfext@gmail.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20260224015053.42472-1-dqfext@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 11:41:00 +01:00
Jakub Kicinski	7aa767d0d3	net: consume xmit errors of GSO frames udpgro_frglist.sh and udpgro_bench.sh are the flakiest tests currently in NIPA. They fail in the same exact way, TCP GRO test stalls occasionally and the test gets killed after 10min. These tests use veth to simulate GRO. They attach a trivial ("return XDP_PASS;") XDP program to the veth to force TSO off and NAPI on. Digging into the failure mode we can see that the connection is completely stuck after a burst of drops. The sender's snd_nxt is at sequence number N [1], but the receiver claims to have received (rcv_nxt) up to N + 3 * MSS [2]. Last piece of the puzzle is that senders rtx queue is not empty (let's say the block in the rtx queue is at sequence number N - 4 * MSS [3]). In this state, sender sends a retransmission from the rtx queue with a single segment, and sequence numbers N-4MSS:N-3MSS [3]. Receiver sees it and responds with an ACK all the way up to N + 3 * MSS [2]. But sender will reject this ack as TCP_ACK_UNSENT_DATA because it has no recollection of ever sending data that far out [1]. And we are stuck. The root cause is the mess of the xmit return codes. veth returns an error when it can't xmit a frame. We end up with a loss event like this: ------------------------------------------------- \| GSO super frame 1 \| GSO super frame 2 \| \|-----------------------------------------------\| \| seg \| seg \| seg \| seg \| seg \| seg \| seg \| seg \| \| 1 \| 2 \| 3 \| 4 \| 5 \| 6 \| 7 \| 8 \| ------------------------------------------------- x ok ok <ok>\| ok ok ok <x> \\ snd_nxt "x" means packet lost by veth, and "ok" means it went thru. Since veth has TSO disabled in this test it sees individual segments. Segment 1 is on the retransmit queue and will be resent. So why did the sender not advance snd_nxt even tho it clearly did send up to seg 8? tcp_write_xmit() interprets the return code from the core to mean that data has not been sent at all. Since TCP deals with GSO super frames, not individual segment the crux of the problem is that loss of a single segment can be interpreted as loss of all. TCP only sees the last return code for the last segment of the GSO frame (in <> brackets in the diagram above). Of course for the problem to occur we need a setup or a device without a Qdisc. Otherwise Qdisc layer disconnects the protocol layer from the device errors completely. We have multiple ways to fix this. 1) make veth not return an error when it lost a packet. While this is what I think we did in the past, the issue keeps reappearing and it's annoying to debug. The game of whack a mole is not great. 2) fix the damn return codes We only talk about NETDEV_TX_OK and NETDEV_TX_BUSY in the documentation, so maybe we should make the return code from ndo_start_xmit() a boolean. I like that the most, but perhaps some ancient, not-really-networking protocol would suffer. 3) make TCP ignore the errors It is not entirely clear to me what benefit TCP gets from interpreting the result of ip_queue_xmit()? Specifically once the connection is established and we're pushing data - packet loss is just packet loss? 4) this fix Ignore the rc in the Qdisc-less+GSO case, since it's unreliable. We already always return OK in the TCQ_F_CAN_BYPASS case. In the Qdisc-less case let's be a bit more conservative and only mask the GSO errors. This path is taken by non-IP-"networks" like CAN, MCTP etc, so we could regress some ancient thing. This is the simplest, but also maybe the hackiest fix? Similar fix has been proposed by Eric in the past but never committed because original reporter was working with an OOT driver and wasn't providing feedback (see Link). Link: https://lore.kernel.org/CANn89iJcLepEin7EtBETrZ36bjoD9LrR=k4cfwWh046GB+4f9A@mail.gmail.com Fixes: `1f59533f9c` ("qdisc: validate frames going through the direct_xmit path") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260223235100.108939-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 11:35:00 +01:00
Paolo Abeni	f0a2f2aadb	Merge branch 'vsock-add-write-once-semantics-to-child_ns_mode' Bobby Eshleman says: ==================== vsock: add write-once semantics to child_ns_mode Two administrator processes may race when setting child_ns_mode: one sets it to "local" and creates a namespace, but another changes it to "global" in between. The first process ends up with a namespace in the wrong mode. Make child_ns_mode write-once so that a namespace manager can set it once, check the value, and be guaranteed it won't change before creating its namespaces. Writing a different value after the first write returns -EBUSY. One patch for the implementation, one for docs, and one for tests. v2: https://lore.kernel.org/r/20260218-vsock-ns-write-once-v2-0-19e4c50d509a@meta.com v1: https://lore.kernel.org/r/20260217-vsock-ns-write-once-v1-1-a1fb30f289a9@meta.com ==================== Link: https://patch.msgid.link/20260223-vsock-ns-write-once-v3-0-c0cde6959923@meta.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 11:10:06 +01:00
Bobby Eshleman	b6302e057f	vsock: document write-once behavior of the child_ns_mode sysctl Update the vsock child_ns_mode documentation to include the new write-once semantics of setting child_ns_mode. The semantics are implemented in a preceding patch in this series. Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260223-vsock-ns-write-once-v3-3-c0cde6959923@meta.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 11:10:03 +01:00
Bobby Eshleman	102eab95f0	vsock: lock down child_ns_mode as write-once Two administrator processes may race when setting child_ns_mode as one process sets child_ns_mode to "local" and then creates a namespace, but another process changes child_ns_mode to "global" between the write and the namespace creation. The first process ends up with a namespace in "global" mode instead of "local". While this can be detected after the fact by reading ns_mode and retrying, it is fragile and error-prone. Make child_ns_mode write-once so that a namespace manager can set it once and be sure it won't change. Writing a different value after the first write returns -EBUSY. This applies to all namespaces, including init_net, where an init process can write "local" to lock all future namespaces into local mode. Fixes: `eafb64f40c` ("vsock: add netns to vsock core") Suggested-by: Daan De Meyer <daan.j.demeyer@gmail.com> Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Co-developed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260223-vsock-ns-write-once-v3-2-c0cde6959923@meta.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 11:10:03 +01:00
Bobby Eshleman	a382a34276	selftests/vsock: change tests to respect write-once child ns mode The child_ns_mode sysctl parameter becomes write-once in a future patch in this series, which breaks existing tests. This patch updates the tests to respect this new policy. No additional tests are added. Add "global-parent" and "local-parent" namespaces as intermediaries to spawn namespaces in the given modes. This avoids the need to change "child_ns_mode" in the init_ns. nsenter must be used because ip netns unshares the mount namespace so nested "ip netns add" breaks exec calls from the init ns. Adds nsenter to the deps check. Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260223-vsock-ns-write-once-v3-1-c0cde6959923@meta.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 11:10:03 +01:00
Paolo Abeni	90fcb0f3bc	Merge branch 'net-mlx5e-shampo-allow-high-order-pages-in-zerocopy-mode' Tariq Toukan says: ==================== net/mlx5e: SHAMPO, Allow high order pages in zerocopy mode This series adds support for high order pages when io_uring/devmem zero copy is used. See detailed description by Dragos below. The first patches are moving code around to allow using queue specific parameters that are not just for XSK. They are a bit large as they touch a lot of functions. The middle part of the series is updating various formulas to remove remaining hardcoded use of PAGE_SIZE/PAGE_SHIFT. The last part adds support for high order pages by implementing the queue configuration functions and allowing larger rx_page_size configurations when in zero-copy mode. Results show an increase in BW and a decrease in CPU usage. The benchmark was done with the zcrx samples from liburing [0]. rx_buf_len=4K, oncpu [1]: packets=3358832 (MB=820027), rps=55794 (MB/s=13621) Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: 9 1.56 0.00 18.09 13.42 0.00 66.80 0.00 0.00 0.00 0.12 rx_buf_len=128K, oncpu [2]: packets=3781376 (MB=923187), rps=62813 (MB/s=15335) Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: 9 0.33 0.00 7.61 18.86 0.00 73.08 0.00 0.00 0.00 0.12 rx_buf_len=4K, offcpu [3]: packets=3460368 (MB=844816), rps=57481 (MB/s=14033) Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: 9 0.00 0.00 0.26 0.00 0.00 92.63 0.00 0.00 0.00 7.11 Average: 11 3.04 0.00 68.09 28.87 0.00 0.00 0.00 0.00 0.00 0.00 rx_buf_len=128K, offcpu [4]: packets=4119840 (MB=1005820), rps=68435 (MB/s=16707) Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: 9 0.00 0.00 0.87 0.00 0.00 63.77 0.00 0.00 0.00 35.36 Average: 11 1.96 0.00 43.68 54.37 0.00 0.00 0.00 0.00 0.00 0.00 [0] https://github.com/isilence/liburing/tree/zcrx/rx-buf-len [1] commands: $> taskset -c 9 ./zcrx 6 -i eth2 -q 9 -A 1 -B 4096 -S 33554432 $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000 [2] commands: $> taskset -c 9 ./zcrx 6 -i eth2 -q 9 -A 1 -B 131072 -S 33554432 $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000 [3] commands: $> taskset -c 11 ./zcrx 6 -i eth2 -q 9 -A 1 -B 4096 -S 33554432 $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000 [4] commands: $> taskset -c 11 ./zcrx 6 -i eth2 -q 9 -A 1 -B 131072 -S 33554432 $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000 ==================== Link: https://patch.msgid.link/20260223204155.1783580-1-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 10:54:41 +01:00
Dragos Tatulea	df5135fced	net/mlx5e: SHAMPO, Allow high order pages in zerocopy mode Allow high order pages only when SHAMPO mode is enabled (hw-gro) and the queue is used for zerocopy (has memory provider ops set). The limit is 128K and it was chosen for the following reasons: - 256K size requires a special case during MTT calculation to split the page in two. That's because two MTTs are needed to form an octword. - Higher sizes require increasing WQE size and/or reducing the number of WQEs. - Having the RQ lined with too few large pages can lead to refill issues. Results show an increase in BW and a decrease in CPU usage. The benchmark was done with the zcrx samples from liburing [0]. rx_buf_len=4K, oncpu [1]: packets=3358832 (MB=820027), rps=55794 (MB/s=13621) Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: 9 1.56 0.00 18.09 13.42 0.00 66.80 0.00 0.00 0.00 0.12 rx_buf_len=128K, oncpu [2]: packets=3781376 (MB=923187), rps=62813 (MB/s=15335) Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: 9 0.33 0.00 7.61 18.86 0.00 73.08 0.00 0.00 0.00 0.12 rx_buf_len=4K, offcpu [3]: packets=3460368 (MB=844816), rps=57481 (MB/s=14033) Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: 9 0.00 0.00 0.26 0.00 0.00 92.63 0.00 0.00 0.00 7.11 Average: 11 3.04 0.00 68.09 28.87 0.00 0.00 0.00 0.00 0.00 0.00 rx_buf_len=128K, offcpu [4]: packets=4119840 (MB=1005820), rps=68435 (MB/s=16707) Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: 9 0.00 0.00 0.87 0.00 0.00 63.77 0.00 0.00 0.00 35.36 Average: 11 1.96 0.00 43.68 54.37 0.00 0.00 0.00 0.00 0.00 0.00 [0] https://github.com/isilence/liburing/tree/zcrx/rx-buf-len [1] commands: $> taskset -c 9 ./zcrx 6 -i eth2 -q 9 -A 1 -B 4096 -S 33554432 $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000 [2] commands: $> taskset -c 9 ./zcrx 6 -i eth2 -q 9 -A 1 -B 131072 -S 33554432 $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000 [3] commands: $> taskset -c 11 ./zcrx 6 -i eth2 -q 9 -A 1 -B 4096 -S 33554432 $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000 [4] commands: $> taskset -c 11 ./zcrx 6 -i eth2 -q 9 -A 1 -B 131072 -S 33554432 $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000 Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260223204155.1783580-16-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 10:54:24 +01:00
Dragos Tatulea	5b6e0ddb36	net/mlx5e: Add param helper to calculate max page size This function will be necessary to determine the upper limit of rx-page-size. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260223204155.1783580-15-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 10:54:24 +01:00
Dragos Tatulea	585cfa99d3	net/mlx5e: Pass netdev queue config to param calculations If set, take rx_page_size into consideration when calculating the page shift in Multi Packet WQE mode. The queue config is saved in the mlx5e_rq_opt_param struct which is added to the mlx5e_channel_param struct. Now the configuration can be read from the struct instead of adding it as an argument to all call sites. For consistency, the queue config is assigned in mlx5e_build_channel_param(). The queue configuration is read only from queue management ops as that's the only place where it is currently useful. Furthermore, netdev_queue_config() expects netdev->queue_mgmt_ops to be set which is not always the case (representor netdevs). Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260223204155.1783580-14-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 10:54:24 +01:00
Dragos Tatulea	0fa8c93357	net/mlx5e: Add queue config ops for page size For now allow only PAGE_SIZE. A subsequent patch will add support for high order pages in zero-copy mode. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260223204155.1783580-13-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 10:54:23 +01:00
Dragos Tatulea	8611660778	net/mlx5e: RX, Make page frag bias more robust The formula uses the system page size but does not account for high order pages. One way to fix this would be to adapt the formula to take into account the pool order. This would require calculating it for every allocation or adding an additional rq struct member to hold the bias max. However, the above is not really needed as the driver doesn't check the bias value. It has other means to calculate the expected number of fragments based on context. This patch simply sets the value to the max possible value. A sanity check is added during queue init phase to avoid having really big pages from using more fragments than the type can fit. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260223204155.1783580-12-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 10:54:23 +01:00
Dragos Tatulea	0285cc3dac	net/mlx5e: Alloc rq drop page based on calculated page_shift An upcoming patch will allow setting the page order for RX pages to be greater than 0. Make sure that the drop page will also be allocated with the right size when that happens. Take extra care when calculating the drop page size to account for page_shift < PAGE_SHIFT which can happen for xsk. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260223204155.1783580-11-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 10:54:23 +01:00
Dragos Tatulea	3a145cf492	net/mlx5e: Set page_pool order based on calculated page_shift Instead of unconditionally setting the page_pool to 0, calculate it from page_shift for MPWQE case. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260223204155.1783580-10-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 10:54:23 +01:00
Dragos Tatulea	dff1c3164a	net/mlx5e: SHAMPO, Always calculate page size Adapt the rx path in SHAMPO mode to calculate page size based on configured page_shift when dealing with payload data. This is necessary as an upcoming patch will add support for using different page sizes. This change has no functional changes. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260223204155.1783580-9-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 10:54:23 +01:00
Dragos Tatulea	3707a73854	net/mlx5e: Drop unused channel parameters The channel parameters from struct mlx5_qmgmt_data are built in mlx5e_queue_mem_alloc() but are not used. mlx5e_open_channel() builds the channel parameters internally and those parameters will be the ones that are used when opening the queue. This patch drops the unused parameters. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260223204155.1783580-8-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 10:54:23 +01:00
Dragos Tatulea	099efb294e	net/mlx5e: Move xsk param into new option container struct The xsk parameter configuration (struct mlx5e_xsk_param) is passed around many places during parameter calculation. It is used to contain channel specific information (as opposed to the global info from struct mlx5e_params). Upcoming changes will need to push similar channel specific rq configuration. Instead of adding one more parameter to all these functions, create a new container structure that has optional rq specific parameters. The xsk parameter will be the first of such kind. The new container struct is itself optional. That means that before checking its members, it has to be checked itself for validity. This patch has no functional changes. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260223204155.1783580-7-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 10:54:23 +01:00
Dragos Tatulea	8a96b9144f	net/mlx5e: Alloc xsk channel param out of mlx5e_open_xsk() Currently the allocation and filling of the xsk channel parameters was done in mlx5e_open_xsk(). Move this responsibility out of mlx5e_open_xsk() and have the function take an already filled mlx5e_channel_param. mlx5e_open_channel() already allocates channel parameters. The only precaution that is needed is to call mlx5e_build_xsk_channel_param() before mlx5e_open_xsk(). mlx5e_xsk_enable_locked() now allocates and fills the xsk parameters. For simplicity, link the xsk parameters in struct mlx5e_channel_params so that channel params can be passed around. This patch has no functional changes. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260223204155.1783580-6-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 10:54:23 +01:00
Dragos Tatulea	ba4f39c256	net/mlx5e: Expose and rename xsk channel parameter function mlx5e_build_xsk_cparam() is meant to be the alternative to mlx5e_build_channel_param(). It calculates only the parameters that it requires using the previously configured mlx5e_xsk_param. Move this function to params.c to be alongside mlx5e_build_channel_param() and give it a similar name. Expose the function as it will be needed by upcoming changes. This patch has no functional changes. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260223204155.1783580-5-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 10:54:23 +01:00
Dragos Tatulea	a2ff2f5f80	net/mlx5e: Extract max_xsk_wqebbs into its own function Calculating max_xsk_wqebbs seems large enough to deserve its own function. It will make upcoming changes easier. This patch has no functional changes. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260223204155.1783580-4-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 10:54:23 +01:00
Dragos Tatulea	d3a99b71a2	net/mlx5e: Extract striding rq param calculation in function Calculating parameters for striding rq is large enough to deserve its own function. As the names are also very long it is very easy to hit on the 80 char limitation every time a change is made. This is an additional sign that it should be extracted into its own function. This patch has no functional change. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260223204155.1783580-3-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 10:54:23 +01:00
Dragos Tatulea	376cf42274	net/mlx5e: Make mlx5e_rq_param naming consistent This structure is used under different names: rq_param, rq_params, param, rqp. Refactor the code to use a single name: rq_param. This patch has no functional change. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260223204155.1783580-2-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 10:54:23 +01:00
Jakub Kicinski	97f87e5788	Merge branch 'mlx5-misc-fixes-2026-02-24' Tariq Toukan says: ==================== mlx5 misc fixes 2026-02-24 This patchset provides misc bug fixes from the team to the mlx5 core and Eth drivers. ==================== Link: https://patch.msgid.link/20260224114652.1787431-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 20:01:53 -08:00
Jianbo Liu	859380694f	net/mlx5e: Fix "scheduling while atomic" in IPsec MAC address query Fix a "scheduling while atomic" bug in mlx5e_ipsec_init_macs() by replacing mlx5_query_mac_address() with ether_addr_copy() to get the local MAC address directly from netdev->dev_addr. The issue occurs because mlx5_query_mac_address() queries the hardware which involves mlx5_cmd_exec() that can sleep, but it is called from the mlx5e_ipsec_handle_event workqueue which runs in atomic context. The MAC address is already available in netdev->dev_addr, so no need to query hardware. This avoids the sleeping call and resolves the bug. Call trace: BUG: scheduling while atomic: kworker/u112:2/69344/0x00000200 __schedule+0x7ab/0xa20 schedule+0x1c/0xb0 schedule_timeout+0x6e/0xf0 __wait_for_common+0x91/0x1b0 cmd_exec+0xa85/0xff0 [mlx5_core] mlx5_cmd_exec+0x1f/0x50 [mlx5_core] mlx5_query_nic_vport_mac_address+0x7b/0xd0 [mlx5_core] mlx5_query_mac_address+0x19/0x30 [mlx5_core] mlx5e_ipsec_init_macs+0xc1/0x720 [mlx5_core] mlx5e_ipsec_build_accel_xfrm_attrs+0x422/0x670 [mlx5_core] mlx5e_ipsec_handle_event+0x2b9/0x460 [mlx5_core] process_one_work+0x178/0x2e0 worker_thread+0x2ea/0x430 Fixes: `cee137a634` ("net/mlx5e: Handle ESN update events") Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260224114652.1787431-6-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 20:01:44 -08:00
Shay Drory	60253042c0	net/mlx5: Fix missing devlink lock in SRIOV enable error path The cited commit miss to add locking in the error path of mlx5_sriov_enable(). When pci_enable_sriov() fails, mlx5_device_disable_sriov() is called to clean up. This cleanup function now expects to be called with the devlink instance lock held. Add the missing devl_lock(devlink) and devl_unlock(devlink) Fixes: `84a433a40d` ("net/mlx5: Lock mlx5 devlink reload callbacks") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260224114652.1787431-5-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 20:01:44 -08:00
Shay Drory	d7073e8b97	net/mlx5: E-switch, Clear legacy flag when moving to switchdev The cited commit introduced MLX5_PRIV_FLAGS_SWITCH_LEGACY to identify when a transition to legacy mode is requested via devlink. However, the logic failed to clear this flag if the mode was subsequently changed back to MLX5_ESWITCH_OFFLOADS (switchdev). Consequently, if a user toggled from legacy to switchdev, the flag remained set, leaving the driver with wrong state indicating Fix this by explicitly clearing the MLX5_PRIV_FLAGS_SWITCH_LEGACY bit when the requested mode is MLX5_ESWITCH_OFFLOADS. Fixes: `2a4f56fbcc` ("net/mlx5e: Keep netdev when leave switchdev for devlink set legacy only") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260224114652.1787431-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 20:01:44 -08:00
Shay Drory	bd7b9f83fb	net/mlx5: LAG, disable MPESW in lag_disable_change() mlx5_lag_disable_change() unconditionally called mlx5_disable_lag() when LAG was active, which is incorrect for MLX5_LAG_MODE_MPESW. Hnece, call mlx5_disable_mpesw() when running in MPESW mode. Fixes: `a32327a3a0` ("net/mlx5: Lag, Control MultiPort E-Switch single FDB mode") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260224114652.1787431-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 20:01:44 -08:00
Shay Drory	2700b7e603	net/mlx5: DR, Fix circular locking dependency in dump Fix a circular locking dependency between dbg_mutex and the domain rx/tx mutexes that could lead to a deadlock. The dump path in dr_dump_domain_all() was acquiring locks in the order: dbg_mutex -> rx.mutex -> tx.mutex While the table/matcher creation paths acquire locks in the order: rx.mutex -> tx.mutex -> dbg_mutex This inverted lock ordering creates a circular dependency. Fix this by changing dr_dump_domain_all() to acquire the domain lock before dbg_mutex, matching the order used in mlx5dr_table_create() and mlx5dr_matcher_create(). Lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 6.19.0-rc6net_next_e817c4e #1 Not tainted ------------------------------------------------------ sos/30721 is trying to acquire lock: ffff888102df5900 (&dmn->info.rx.mutex){+.+.}-{4:4}, at: dr_dump_start+0x131/0x450 [mlx5_core] but task is already holding lock: ffff888102df5bc0 (&dmn->dump_info.dbg_mutex){+.+.}-{4:4}, at: dr_dump_start+0x10b/0x450 [mlx5_core] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&dmn->dump_info.dbg_mutex){+.+.}-{4:4}: __mutex_lock+0x91/0x1060 mlx5dr_matcher_create+0x377/0x5e0 [mlx5_core] mlx5_cmd_dr_create_flow_group+0x62/0xd0 [mlx5_core] mlx5_create_flow_group+0x113/0x1c0 [mlx5_core] mlx5_chains_create_prio+0x453/0x2290 [mlx5_core] mlx5_chains_get_table+0x2e2/0x980 [mlx5_core] esw_chains_create+0x1e6/0x3b0 [mlx5_core] esw_create_offloads_fdb_tables.cold+0x62/0x63f [mlx5_core] esw_offloads_enable+0x76f/0xd20 [mlx5_core] mlx5_eswitch_enable_locked+0x35a/0x500 [mlx5_core] mlx5_devlink_eswitch_mode_set+0x561/0x950 [mlx5_core] devlink_nl_eswitch_set_doit+0x67/0xe0 genl_family_rcv_msg_doit+0xe0/0x130 genl_rcv_msg+0x188/0x290 netlink_rcv_skb+0x4b/0xf0 genl_rcv+0x24/0x40 netlink_unicast+0x1ed/0x2c0 netlink_sendmsg+0x210/0x450 __sock_sendmsg+0x38/0x60 __sys_sendto+0x119/0x180 __x64_sys_sendto+0x20/0x30 do_syscall_64+0x70/0xd00 entry_SYSCALL_64_after_hwframe+0x4b/0x53 -> #1 (&dmn->info.tx.mutex){+.+.}-{4:4}: __mutex_lock+0x91/0x1060 mlx5dr_table_create+0x11d/0x530 [mlx5_core] mlx5_cmd_dr_create_flow_table+0x62/0x140 [mlx5_core] __mlx5_create_flow_table+0x46f/0x960 [mlx5_core] mlx5_create_flow_table+0x16/0x20 [mlx5_core] esw_create_offloads_fdb_tables+0x136/0x240 [mlx5_core] esw_offloads_enable+0x76f/0xd20 [mlx5_core] mlx5_eswitch_enable_locked+0x35a/0x500 [mlx5_core] mlx5_devlink_eswitch_mode_set+0x561/0x950 [mlx5_core] devlink_nl_eswitch_set_doit+0x67/0xe0 genl_family_rcv_msg_doit+0xe0/0x130 genl_rcv_msg+0x188/0x290 netlink_rcv_skb+0x4b/0xf0 genl_rcv+0x24/0x40 netlink_unicast+0x1ed/0x2c0 netlink_sendmsg+0x210/0x450 __sock_sendmsg+0x38/0x60 __sys_sendto+0x119/0x180 __x64_sys_sendto+0x20/0x30 do_syscall_64+0x70/0xd00 entry_SYSCALL_64_after_hwframe+0x4b/0x53 -> #0 (&dmn->info.rx.mutex){+.+.}-{4:4}: __lock_acquire+0x18b6/0x2eb0 lock_acquire+0xd3/0x2c0 __mutex_lock+0x91/0x1060 dr_dump_start+0x131/0x450 [mlx5_core] seq_read_iter+0xe3/0x410 seq_read+0xfb/0x130 full_proxy_read+0x53/0x80 vfs_read+0xba/0x330 ksys_read+0x65/0xe0 do_syscall_64+0x70/0xd00 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&dmn->dump_info.dbg_mutex); lock(&dmn->info.tx.mutex); lock(&dmn->dump_info.dbg_mutex); lock(&dmn->info.rx.mutex); * DEADLOCK * Fixes: `9222f0b27d` ("net/mlx5: DR, Add support for dumping steering info") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Alex Vesker <valex@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260224114652.1787431-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 20:01:43 -08:00
Jakub Kicinski	6668c6f2dd	Merge tag 'wireless-2026-02-25' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== A good number of fixes: - cfg80211: - cancel rfkill work appropriately - fix radiotap parsing to correctly reject field 18 - fix wext (yes...) off-by-one for IGTK key ID - mac80211: - fix for mesh NULL pointer dereference - fix for stack out-of-bounds (2 bytes) write on specific multi-link action frames - set default WMM parameters for all links - mwifiex: check dev_alloc_name() return value correctly - libertas: fix potential timer use-after-free - brcmfmac: fix crash on probe failure * tag 'wireless-2026-02-25' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: wifi: mac80211: fix NULL pointer dereference in mesh_rx_csa_frame() wifi: mac80211: bounds-check link_id in ieee80211_ml_reconfiguration wifi: mac80211: set default WMM parameters on all links wifi: libertas: fix use-after-free in lbs_free_adapter() wifi: mwifiex: Fix dev_alloc_name() return value check wifi: brcmfmac: Fix potential kernel oops when probe fails wifi: radiotap: reject radiotap with unknown bits wifi: cfg80211: cancel rfkill_block work in wiphy_unregister() wifi: cfg80211: wext: fix IGTK key ID off-by-one ==================== Link: https://patch.msgid.link/20260225113159.360574-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 19:54:28 -08:00
Jakub Kicinski	c7f5c6fb0f	Merge branch 'add-selftests-helper-to-get-n-unique-ports' Dimitri Daskalakis says: ==================== Add selftests helper to get N unique ports The rss_ctx.py tests would occasionally flake. I found that the successive calls to rand_port would occasionally return duplicate ports, breaking the tests invariants. Add a new helper that guarantees generated ports are unique. ==================== Link: https://patch.msgid.link/20260224224659.1507082-1-dimitri.daskalakis1@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 19:42:06 -08:00
Dimitri Daskalakis	19c3a2a81d	selftests: drv-net: rss: Generate unique ports for RSS context tests The RSS ctx tests rely on NFC rules with unique ports to steer packets to the correct ctx. This updates the test to use the new rand_ports() helper to guarantee the ports are unique. Manual testing shows that generating 32 ports with the existing method would result in at least one duplicate 4% of the time. Signed-off-by: Dimitri Daskalakis <dimitri.daskalakis1@gmail.com> Link: https://patch.msgid.link/20260224224659.1507082-3-dimitri.daskalakis1@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 19:42:02 -08:00
Dimitri Daskalakis	b0249c0d41	selftests: net: py: Add rand_ports helper method Certain tests need a unique set of ports. Successive calls to the existing rand_port method may return a duplicate port, resulting in test flakiness. The new helper keeps sockets open while building a list of ephemeral ports, thus the kernel enforces their uniqueness. Signed-off-by: Dimitri Daskalakis <dimitri.daskalakis1@gmail.com> Link: https://patch.msgid.link/20260224224659.1507082-2-dimitri.daskalakis1@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 19:42:02 -08:00
Jakub Kicinski	2cd63825c7	Merge branch 'netfilter-updates-for-net-next' Florian Westphal says: ==================== netfilter: updates for net-next including IPVS updates from and via Julian Anastasov. First updates for IPVS. From Julians cover-letter: * Convert the global __ip_vs_mutex to per-net service_mutex and switch the service tables to be per-net, cowork by Jiejian Wu and Dust Li * Convert some code that walks the service lists to use RCU instead of the service_mutex * We used two tables for services (non-fwmark and fwmark), merge them into single svc_table * The list for unavailable destinations (dest_trash) holds dsts and thus dev references causing extra work for the ip_vs_dst_event() dev notifier handler. Change this by dropping the reference when dest is removed and saved into dest_trash. The dest_trash will need more changes to make it light for lookups. TODO. * On new connection we can do multiple lookups for services by trying different fallback options. Add more counters for service types, so that we can avoid unneeded lookups for services. * The no_cport and dropentry counters can be per-net and also we can avoid extra conn lookups Then, a few cleanups for nf_tables: * keep BH enabled during nft_set_rbtree inserts, this is possible because the root lock is now only taken from control plane. * toss a few EXPORT_SYMBOLs from nf_tables; these were historic leftovers from back in the day when e.g. set backends were still residing in their own modules. * remove the register tracking infra from nftables. It was disabled years ago in 5.18 and there are no plans to salvage this work; the idea was good (remove redundant register stores), but there is just one too many pitfalls, and better rule structuring (verdict maps) largely avoids the scenarios where this would have helped. ==================== Link: https://patch.msgid.link/20260224205048.4718-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 19:36:29 -08:00
Florian Westphal	6b94d081f8	netfilter: nf_tables: remove register tracking infrastructure This facility was disabled in commit `9e539c5b6d` ("netfilter: nf_tables: disable expression reduction infra"), because not all nft_exprs guarantee they will update the destination register: some may set NFT_BREAK instead to cancel evaluation of the rule. This has been dead code ever since. There are no plans to salvage this at this time, so remove this. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260224205048.4718-10-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 19:36:26 -08:00
Florian Westphal	b6461103e0	netfilter: nf_tables: drop obsolete EXPORT_SYMBOLs These are no longer required, calling objects are nowadays baked into nf_tables.ko itself. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260224205048.4718-9-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 19:36:26 -08:00
Florian Westphal	3aea466a43	netfilter: nft_set_rbtree: don't disable bh when acquiring tree lock As of commit `7e43e0a114` ("netfilter: nft_set_rbtree: translate rbtree to array for binary search") the lock is only taken from control plane, no need to disable BH anymore. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260224205048.4718-8-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 19:36:26 -08:00
Julian Anastasov	09b71fb459	ipvs: no_cport and dropentry counters can be per-net Change the no_cport counters to be per-net and address family. This should reduce the extra conn lookups done during present NO_CPORT connections. By changing from global to per-net dropentry counters, one net will not affect the drop rate of another net. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260224205048.4718-7-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 19:36:26 -08:00
Julian Anastasov	c59bd9e62e	ipvs: use more counters to avoid service lookups When new connection is created we can lookup for services multiple times to support fallback options. We already have some counters to skip specific lookups because it costs CPU cycles for hash calculation, etc. Add more counters for fwmark/non-fwmark services (fwm_services and nonfwm_services) and make all counters per address family. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260224205048.4718-6-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 19:36:26 -08:00
Julian Anastasov	40fb72209f	ipvs: do not keep dest_dst after dest is removed Before now dest->dest_dst is not released when server is moved into dest_trash list after removal. As result, we can keep dst/dev references for long time without actively using them. It is better to avoid walking the dest_trash list when ip_vs_dst_event() receives dev events. So, make sure we do not hold dev references in dest_trash list. As packets can be flying while server is being removed, check the IP_VS_DEST_F_AVAILABLE flag in slow path to ensure we do not save new dev references to removed servers. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260224205048.4718-5-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 19:36:26 -08:00
Julian Anastasov	b24ae1a387	ipvs: use single svc table fwmark based services and non-fwmark based services can be hashed in same service table. This reduces the burden of working with two tables. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260224205048.4718-4-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 19:36:25 -08:00
Julian Anastasov	3de0ec2873	ipvs: some service readers can use RCU Some places walk the services under mutex but they can just use RCU: * ip_vs_dst_event() uses ip_vs_forget_dev() which uses its own lock to modify dest * ip_vs_genl_dump_services(): ip_vs_genl_fill_service() just fills skb * ip_vs_genl_parse_service(): move RCU lock to callers ip_vs_genl_set_cmd(), ip_vs_genl_dump_dests() and ip_vs_genl_get_cmd() * ip_vs_genl_dump_dests(): just fill skb Signed-off-by: Julian Anastasov <ja@ssi.bg> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260224205048.4718-3-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 19:36:25 -08:00
Jiejian Wu	74455a5b43	ipvs: make ip_vs_svc_table and ip_vs_svc_fwm_table per netns Current ipvs uses one global mutex "__ip_vs_mutex" to keep the global "ip_vs_svc_table" and "ip_vs_svc_fwm_table" safe. But when there are tens of thousands of services from different netns in the table, it takes a long time to look up the table, for example, using "ipvsadm -ln" from different netns simultaneously. We make "ip_vs_svc_table" and "ip_vs_svc_fwm_table" per netns, and we add "service_mutex" per netns to keep these two tables safe instead of the global "__ip_vs_mutex" in current version. To this end, looking up services from different netns simultaneously will not get stuck, shortening the time consumption in large-scale deployment. It can be reproduced using the simple scripts below. init.sh: #!/bin/bash for((i=1;i<=4;i++));do ip netns add ns$i ip netns exec ns$i ip link set dev lo up ip netns exec ns$i sh add-services.sh done add-services.sh: #!/bin/bash for((i=0;i<30000;i++)); do ipvsadm -A -t 10.10.10.10:$((80+$i)) -s rr done runtest.sh: #!/bin/bash for((i=1;i<4;i++));do ip netns exec ns$i ipvsadm -ln > /dev/null & done ip netns exec ns4 ipvsadm -ln > /dev/null Run "sh init.sh" to initiate the network environment. Then run "time ./runtest.sh" to evaluate the time consumption. Our testbed is a 4-core Intel Xeon ECS. The result of the original version is around 8 seconds, while the result of the modified version is only 0.8 seconds. Signed-off-by: Jiejian Wu <jiejian@linux.alibaba.com> Co-developed-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260224205048.4718-2-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 19:36:25 -08:00
Eric Woudstra	7717fbb140	net: pppoe: avoid zero-length arrays in struct pppoe_hdr Jakub Kicinski reported following issue in upcoming patches: W=1 C=1 GCC build gives us: net/bridge/netfilter/nf_conntrack_bridge.c: note: in included file (through ../include/linux/if_pppox.h, ../include/uapi/linux/netfilter_bridge.h, ../include/linux/netfilter_bridge.h): include/uapi/linux/if_pppox.h: 153:29: warning: array of flexible structures sparse doesn't like that hdr has a zero-length array which overlaps proto. The kernel code doesn't currently need those arrays. PPPoE connection is functional after applying this patch. Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Kees Cook <kees@kernel.org> Signed-off-by: Eric Woudstra <ericwouds@gmail.com> Link: https://patch.msgid.link/20260224155030.106918-1-ericwouds@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 19:25:08 -08:00
Abhilekh Deka	8debe7a223	net/ibmveth: fix comment typos in ibmveth.c Correct spelling mistakes in comments: - Fix misspelling of gro_receive - Fix misspelling of Partition Signed-off-by: Abhilekh Deka <abhindeka@gmail.com> Reviewed-by: Dave Marquardt <davemarq@linux.ibm.com> Link: https://patch.msgid.link/20260224153601.17534-1-abhindeka@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 19:23:04 -08:00

1 2 3 4 5 ...

1427081 Commits