linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-02 14:34:13 -04:00

Author	SHA1	Message	Date
Divya Koppera	9fc3d6fe80	net: phy: microchip_t1 : Add initialization of ptp for lan887x Add initialization of ptp for lan887x. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Divya Koppera <divya.koppera@microchip.com> Link: https://patch.msgid.link/20241219123311.30213-6-divya.koppera@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-23 10:30:59 -08:00
Divya Koppera	85b39f7593	net: phy: Makefile: Add makefile support for rds ptp in Microchip phys Add makefile support for rds ptp library. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: Divya Koppera <divya.koppera@microchip.com> Link: https://patch.msgid.link/20241219123311.30213-5-divya.koppera@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-23 10:30:59 -08:00
Divya Koppera	2550afc61e	net: phy: Kconfig: Add rds ptp library support and 1588 optional flag in Microchip phys Add ptp library support in Kconfig As some of Microchip T1 phys support ptp, add dependency of 1588 optional flag in Kconfig Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: Divya Koppera <divya.koppera@microchip.com> Link: https://patch.msgid.link/20241219123311.30213-4-divya.koppera@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-23 10:30:59 -08:00
Divya Koppera	fa51199c5f	net: phy: microchip_rds_ptp : Add rds ptp library for Microchip phys Add rds ptp library for Microchip phys 1-step and 2-step modes are supported, over Ethernet and UDP(ipv4, ipv6) Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: Divya Koppera <divya.koppera@microchip.com> Link: https://patch.msgid.link/20241219123311.30213-3-divya.koppera@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-23 10:30:58 -08:00
Divya Koppera	d46ef4ee38	net: phy: microchip_rds_ptp: Add header file for Microchip rds ptp library This rds ptp header file will cover ptp macros for future phys in Microchip where addresses will be same but base offset and mmd address may changes. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: Divya Koppera <divya.koppera@microchip.com> Link: https://patch.msgid.link/20241219123311.30213-2-divya.koppera@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-23 10:30:58 -08:00
Jakub Kicinski	b4cbbf078c	Merge branch 'vsock-test-tests-for-memory-leaks' Michal Luczaj says: ==================== vsock/test: Tests for memory leaks Series adds tests for recently fixed memory leaks[1]: commit `d7b0ff5a86` ("virtio/vsock: Fix accept_queue memory leak") commit `fbf7085b3a` ("vsock: Fix sk_error_queue memory leak") commit `60cf6206a1` ("virtio/vsock: Improve MSG_ZEROCOPY error handling") Patch 1 is a non-functional preparatory cleanup. Patch 2 is a test suite extension for picking specific tests. Patch 3 explains the need of kmemleak scans. Patch 4 adapts utility functions to handle MSG_ZEROCOPY. Patches 5-6-7 add the tests. NOTE: Test in the last patch ("vsock/test: Add test for MSG_ZEROCOPY completion memory leak") may stop working even before this series is merged. See changes proposed in [2]. The failslab variant would be unaffected. [1] https://lore.kernel.org/20241107-vsock-mem-leaks-v2-0-4e21bfcfc818@rbox.co [2] https://lore.kernel.org/CANn89i+oL+qoPmbbGvE_RT3_3OWgeck7cCPcTafeehKrQZ8kyw@mail.gmail.com v3: https://lore.kernel.org/20241218-test-vsock-leaks-v3-0-f1a4dcef9228@rbox.co v2: https://lore.kernel.org/20241216-test-vsock-leaks-v2-0-55e1405742fc@rbox.co v1: https://lore.kernel.org/20241206-test-vsock-leaks-v1-0-c31e8c875797@rbox.co ==================== Link: https://patch.msgid.link/20241219-test-vsock-leaks-v4-0-a416e554d9d7@rbox.co Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-23 10:29:01 -08:00
Michal Luczaj	d127ac8b1d	vsock/test: Add test for MSG_ZEROCOPY completion memory leak Exercise the ENOMEM error path by attempting to hit net.core.optmem_max limit on send(). Test aims to create a memory leak, kmemleak should be employed. Fixed by commit `60cf6206a1` ("virtio/vsock: Improve MSG_ZEROCOPY error handling"). Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://patch.msgid.link/20241219-test-vsock-leaks-v4-7-a416e554d9d7@rbox.co Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-23 10:28:01 -08:00
Michal Luczaj	ec50efee8c	vsock/test: Add test for sk_error_queue memory leak Ask for MSG_ZEROCOPY completion notification, but do not recv() it. Test attempts to create a memory leak, kmemleak should be employed. Fixed by commit `fbf7085b3a` ("vsock: Fix sk_error_queue memory leak"). Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://patch.msgid.link/20241219-test-vsock-leaks-v4-6-a416e554d9d7@rbox.co Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-23 10:28:01 -08:00
Michal Luczaj	f66ef469a7	vsock/test: Add test for accept_queue memory leak Attempt to enqueue a child after the queue was flushed, but before SOCK_DONE flag has been set. Test tries to produce a memory leak, kmemleak should be employed. Dealing with a race condition, test by its very nature may lead to a false negative. Fixed by commit `d7b0ff5a86` ("virtio/vsock: Fix accept_queue memory leak"). Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://patch.msgid.link/20241219-test-vsock-leaks-v4-5-a416e554d9d7@rbox.co Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-23 10:28:01 -08:00
Michal Luczaj	f52e7f593b	vsock/test: Adapt send_byte()/recv_byte() to handle MSG_ZEROCOPY For a zerocopy send(), buffer (always byte 'A') needs to be preserved (thus it can not be on the stack) or the data recv()ed check in recv_byte() might fail. While there, change the printf format to 0x%02x so the '\0' bytes can be seen. Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://patch.msgid.link/20241219-test-vsock-leaks-v4-4-a416e554d9d7@rbox.co Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-23 10:28:01 -08:00
Michal Luczaj	50f9434463	vsock/test: Add README blurb about kmemleak usage Document the suggested use of kmemleak for memory leak detection. Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://patch.msgid.link/20241219-test-vsock-leaks-v4-3-a416e554d9d7@rbox.co Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-23 10:28:00 -08:00
Michal Luczaj	ef8bd18f47	vsock/test: Introduce option to select tests Allow for selecting specific test IDs to be executed. Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://patch.msgid.link/20241219-test-vsock-leaks-v4-2-a416e554d9d7@rbox.co Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-23 10:28:00 -08:00
Michal Luczaj	f3af3ba108	vsock/test: Use NSEC_PER_SEC Replace 1000000000ULL with NSEC_PER_SEC. No functional change intended. Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://patch.msgid.link/20241219-test-vsock-leaks-v4-1-a416e554d9d7@rbox.co Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-23 10:28:00 -08:00
Yuyang Huang	aa4ad7c3f2	netlink: correct nlmsg size for multicast notifications Corrected the netlink message size calculation for multicast group join/leave notifications. The previous calculation did not account for the inclusion of both IPv4/IPv6 addresses and ifa_cacheinfo in the payload. This fix ensures that the allocated message size is sufficient to hold all necessary information. This patch also includes the following improvements: * Uses GFP_KERNEL instead of GFP_ATOMIC when holding the RTNL mutex. * Uses nla_total_size(sizeof(struct in6_addr)) instead of nla_total_size(16). * Removes unnecessary EXPORT_SYMBOL(). Fixes: `2c2b61d213` ("netlink: add IGMP/MLD join/leave notifications") Cc: Maciej Żenczykowski <maze@google.com> Cc: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: Yuyang Huang <yuyanghuang@google.com> Link: https://patch.msgid.link/20241221100007.1910089-1-yuyanghuang@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-23 10:26:43 -08:00
Jakub Kicinski	f288c7a1ba	selftests: drv-net: assume stats refresh is 0 if no ethtool -c support Tests using HW stats wait for them to stabilize, using data from ethtool -c as the delay. Not all drivers implement ethtool -c so handle the errors gracefully. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20241220003116.1458863-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-23 10:12:20 -08:00
YiFei Zhu	85101bda13	sfc: Use netdev refcount tracking in struct efx_async_filter_insertion I was debugging some netdev refcount issues in OpenOnload, and one of the places I was looking at was in the sfc driver. Only struct efx_async_filter_insertion was not using netdev refcount tracker, so add it here. GFP_ATOMIC because this code path is called by ndo_rx_flow_steer which holds RCU. This patch should be a no-op if !CONFIG_NET_DEV_REFCNT_TRACKER Signed-off-by: YiFei Zhu <zhuyifei@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241219173004.2615655-1-zhuyifei@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-23 10:11:49 -08:00
Jakub Kicinski	a6ac667467	Merge branch 'net-bridge-add-skb-drop-reasons-to-the-most-common-drop-points' Radu Rendec says: ==================== net/bridge: Add skb drop reasons to the most common drop points The bridge input code may drop frames for various reasons and at various points in the ingress handling logic. Currently kfree_skb() is used everywhere, and therefore no drop reason is specified. Add drop reasons to the most common drop points. The purpose of this series is to address the most common drop points on the bridge ingress path. It does not exhaustively add drop reasons to the entire bridge code. The intention here is to incrementally add drop reasons to the rest of the bridge code in follow up patches. Most of the skb drop points that are addressed in this series can be easily tested by sending crafted packets. The diagram below shows a simple test configuration, and some examples using `packit`() are also included. The bridge is set up with STP disabled. () https://github.com/resurrecting-open-source-projects/packit The following changes were not tested: * SKB_DROP_REASON_NOMEM in br_flood(). It's not easy to trigger an OOM condition for testing purposes, while everything else works correctly. * All drop reasons in br_multicast_flood(). I could not find an easy way to make a crafted packet get there. * SKB_DROP_REASON_BRIDGE_INGRESS_STP_STATE in br_handle_frame_finish() when the port state is BR_STATE_DISABLED, because in that case the frame is already dropped in the switch/case block at the end of br_handle_frame(). +-------+ \| br0 \| +---+---+ \| +---+---+ veth pair +-------+ \| veth0 +-------------+ xeth0 \| +-------+ +-------+ SKB_DROP_REASON_MAC_INVALID_SOURCE - br_handle_frame() packit -t UDP -s 192.168.0.1 -d 192.168.0.2 -S 8000 -D 8000 \ -e 01:22:33:44:55:66 -E aa:bb:cc:dd:ee:ff -c 1 \ -p '0x de ad be ef' -i xeth0 SKB_DROP_REASON_MAC_IEEE_MAC_CONTROL - br_handle_frame() packit -t UDP -s 192.168.0.1 -d 192.168.0.2 -S 8000 -D 8000 \ -e 02:22:33:44:55:66 -E 01:80:c2:00:00:01 -c 1 \ -p '0x de ad be ef' -i xeth0 SKB_DROP_REASON_BRIDGE_INGRESS_STP_STATE - br_handle_frame() bridge link set dev veth0 state 0 # disabled packit -t UDP -s 192.168.0.1 -d 192.168.0.2 -S 8000 -D 8000 \ -e 02:22:33:44:55:66 -E aa:bb:cc:dd:ee:ff -c 1 \ -p '0x de ad be ef' -i xeth0 SKB_DROP_REASON_BRIDGE_INGRESS_STP_STATE - br_handle_frame_finish() bridge link set dev veth0 state 2 # learning packit -t UDP -s 192.168.0.1 -d 192.168.0.2 -S 8000 -D 8000 \ -e 02:22:33:44:55:66 -E aa:bb:cc:dd:ee:ff -c 1 \ -p '0x de ad be ef' -i xeth0 SKB_DROP_REASON_NO_TX_TARGET - br_flood() packit -t UDP -s 192.168.0.1 -d 192.168.0.2 -S 8000 -D 8000 \ -e 02:22:33:44:55:66 -E aa:bb:cc:dd:ee:ff -c 1 \ -p '0x de ad be ef' -i xeth0 ==================== Link: https://patch.msgid.link/20241219163606.717758-1-rrendec@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-23 10:11:08 -08:00
Radu Rendec	623e43c2f5	net: bridge: add skb drop reasons to the most common drop points The bridge input code may drop frames for various reasons and at various points in the ingress handling logic. Currently kfree_skb() is used everywhere, and therefore no drop reason is specified. Add drop reasons to the most common drop points. Drop reasons are not added exhaustively to the entire bridge code. The intention is to incrementally add drop reasons to the rest of the bridge code in follow up patches. Signed-off-by: Radu Rendec <rrendec@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20241219163606.717758-3-rrendec@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-23 10:11:04 -08:00
Radu Rendec	46e0ccfb88	net: vxlan: rename SKB_DROP_REASON_VXLAN_NO_REMOTE The SKB_DROP_REASON_VXLAN_NO_REMOTE skb drop reason was introduced in the specific context of vxlan. As it turns out, there are similar cases when a packet needs to be dropped in other parts of the network stack, such as the bridge module. Rename SKB_DROP_REASON_VXLAN_NO_REMOTE and give it a more generic name, so that it can be used in other parts of the network stack. This is not a functional change, and the numeric value of the drop reason even remains unchanged. Signed-off-by: Radu Rendec <rrendec@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20241219163606.717758-2-rrendec@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-23 10:11:04 -08:00
Jakub Kicinski	6ad7f7196a	Merge branch 'add-more-feautues-for-enetc-v4-round-1' Wei Fang says: ==================== Add more feautues for ENETC v4 - round 1 Compared to ENETC v1 (LS1028A), ENETC v4 (i.MX95) adds more features, and some features are configured completely differently from v1. In order to more fully support ENETC v4, these features will be added through several rounds of patch sets. This round adds these features, such as Tx and Rx checksum offload, increase maximum chained Tx BD number and Large send offload (LSO). v1 Link: https://lore.kernel.org/20241107033817.1654163-1-wei.fang@nxp.com v2 Link: https://lore.kernel.org/20241111015216.1804534-1-wei.fang@nxp.com v3 Link: https://lore.kernel.org/20241112091447.1850899-1-wei.fang@nxp.com v4 Link: https://lore.kernel.org/20241115024744.1903377-1-wei.fang@nxp.com v5 Link: https://lore.kernel.org/20241118060630.1956134-1-wei.fang@nxp.com v6 Link: https://lore.kernel.org/20241119082344.2022830-1-wei.fang@nxp.com v6 RESEND Link: https://lore.kernel.org/20241204052932.112446-1-wei.fang@nxp.com v7 Link: https://lore.kernel.org/20241211063752.744975-1-wei.fang@nxp.com v8 Link: https://lore.kernel.org/20241213021731.1157535-1-wei.fang@nxp.com ==================== Link: https://patch.msgid.link/20241219054755.1615626-1-wei.fang@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-23 09:54:36 -08:00
Wei Fang	c12e82c053	net: enetc: add UDP segmentation offload support Set NETIF_F_GSO_UDP_L4 bit of hw_features and features because i.MX95 enetc and LS1028A driver implements UDP segmentation. - i.MX95 ENETC supports UDP segmentation via LSO. - LS1028A ENETC supports UDP segmentation since the commit `3d5b459ba0` ("net: tso: add UDP segmentation support"). Signed-off-by: Wei Fang <wei.fang@nxp.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Link: https://patch.msgid.link/20241219054755.1615626-5-wei.fang@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-23 09:54:33 -08:00
Wei Fang	69797ff888	net: enetc: add LSO support for i.MX95 ENETC PF ENETC rev 4.1 supports large send offload (LSO), segmenting large TCP and UDP transmit units into multiple Ethernet frames. To support LSO, software needs to fill some auxiliary information in Tx BD, such as LSO header length, frame length, LSO maximum segment size, etc. At 1Gbps link rate, TCP segmentation was tested using iperf3, and the CPU performance before and after applying the patch was compared through the top command. It can be seen that LSO saves a significant amount of CPU cycles compared to software TSO. Before applying the patch: %Cpu(s): 0.1 us, 4.1 sy, 0.0 ni, 85.7 id, 0.0 wa, 0.5 hi, 9.7 si After applying the patch: %Cpu(s): 0.1 us, 2.3 sy, 0.0 ni, 94.5 id, 0.0 wa, 0.4 hi, 2.6 si Signed-off-by: Wei Fang <wei.fang@nxp.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Link: https://patch.msgid.link/20241219054755.1615626-4-wei.fang@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-23 09:54:33 -08:00
Wei Fang	93c5d5a0dd	net: enetc: update max chained Tx BD number for i.MX95 ENETC The max chained Tx BDs of latest ENETC (i.MX95 ENETC, rev 4.1) has been increased to 63, but since the range of MAX_SKB_FRAGS is 17~45, so for i.MX95 ENETC and later revision, it is better to set ENETC4_MAX_SKB_FRAGS to MAX_SKB_FRAGS. In addition, add max_frags in struct enetc_drvdata to indicate the max chained BDs supported by device. Because the max number of chained BDs supported by LS1028A and i.MX95 ENETC is different. Signed-off-by: Wei Fang <wei.fang@nxp.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241219054755.1615626-3-wei.fang@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-23 09:54:33 -08:00
Wei Fang	d9a093d2d1	net: enetc: add Tx checksum offload for i.MX95 ENETC In addition to supporting Rx checksum offload, i.MX95 ENETC also supports Tx checksum offload. The transmit checksum offload is implemented through the Tx BD. To support Tx checksum offload, software needs to fill some auxiliary information in Tx BD, such as IP version, IP header offset and size, whether L4 is UDP or TCP, etc. Same as Rx checksum offload, Tx checksum offload capability isn't defined in register, so tx_csum bit is added to struct enetc_drvdata to indicate whether the device supports Tx checksum offload. Signed-off-by: Wei Fang <wei.fang@nxp.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Link: https://patch.msgid.link/20241219054755.1615626-2-wei.fang@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-23 09:54:33 -08:00
Stefano Brivio	a502ea6fa9	udp: Deal with race between UDP socket address change and rehash If a UDP socket changes its local address while it's receiving datagrams, as a result of connect(), there is a period during which a lookup operation might fail to find it, after the address is changed but before the secondary hash (port and address) and the four-tuple hash (local and remote ports and addresses) are updated. Secondary hash chains were introduced by commit `30fff9231f` ("udp: bind() optimisation") and, as a result, a rehash operation became needed to make a bound socket reachable again after a connect(). This operation was introduced by commit `719f835853` ("udp: add rehash on connect()") which isn't however a complete fix: the socket will be found once the rehashing completes, but not while it's pending. This is noticeable with a socat(1) server in UDP4-LISTEN mode, and a client sending datagrams to it. After the server receives the first datagram (cf. _xioopen_ipdgram_listen()), it issues a connect() to the address of the sender, in order to set up a directed flow. Now, if the client, running on a different CPU thread, happens to send a (subsequent) datagram while the server's socket changes its address, but is not rehashed yet, this will result in a failed lookup and a port unreachable error delivered to the client, as apparent from the following reproducer: LEN=$(($(cat /proc/sys/net/core/wmem_default) / 4)) dd if=/dev/urandom bs=1 count=${LEN} of=tmp.in while :; do taskset -c 1 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc & sleep 0.1 \|\| sleep 1 taskset -c 2 socat OPEN:tmp.in UDP4:localhost:1337,shut-null wait done where the client will eventually get ECONNREFUSED on a write() (typically the second or third one of a given iteration): 2024/11/13 21:28:23 socat[46901] E write(6, 0x556db2e3c000, 8192): Connection refused This issue was first observed as a seldom failure in Podman's tests checking UDP functionality while using pasta(1) to connect the container's network namespace, which leads us to a reproducer with the lookup error resulting in an ICMP packet on a tap device: LOCAL_ADDR="$(ip -j -4 addr show\|jq -rM '.[] \| .addr_info[0] \| select(.scope == "global").local')" while :; do ./pasta --config-net -p pasta.pcap -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc & sleep 0.2 \|\| sleep 1 socat OPEN:tmp.in UDP4:${LOCAL_ADDR}:1337,shut-null wait cmp tmp.in tmp.out done Once this fails: tmp.in tmp.out differ: char 8193, line 29 we can finally have a look at what's going on: $ tshark -r pasta.pcap 1 0.000000 :: ? ff02::16 ICMPv6 110 Multicast Listener Report Message v2 2 0.168690 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 3 0.168767 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 4 0.168806 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 5 0.168827 c6:47:05:8d:dc:04 ? Broadcast ARP 42 Who has 88.198.0.161? Tell 88.198.0.164 6 0.168851 9a:55:9a:55:9a:55 ? c6:47:05:8d:dc:04 ARP 42 88.198.0.161 is at 9a:55:9a:55:9a:55 7 0.168875 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 8 0.168896 88.198.0.164 ? 88.198.0.161 ICMP 590 Destination unreachable (Port unreachable) 9 0.168926 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 10 0.168959 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 11 0.168989 88.198.0.161 ? 88.198.0.164 UDP 4138 60260 ? 1337 Len=4096 12 0.169010 88.198.0.161 ? 88.198.0.164 UDP 42 60260 ? 1337 Len=0 On the third datagram received, the network namespace of the container initiates an ARP lookup to deliver the ICMP message. In another variant of this reproducer, starting the client with: strace -f pasta --config-net -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc 2>strace.log & and connecting to the socat server using a loopback address: socat OPEN:tmp.in UDP4:localhost:1337,shut-null we can more clearly observe a sendmmsg() call failing after the first datagram is delivered: [pid 278012] connect(173, 0x7fff96c95fc0, 16) = 0 [...] [pid 278012] recvmmsg(173, 0x7fff96c96020, 1024, MSG_DONTWAIT, NULL) = -1 EAGAIN (Resource temporarily unavailable) [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = 1 [...] [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = -1 ECONNREFUSED (Connection refused) and, somewhat confusingly, after a connect() on the same socket succeeded. Until commit `4cdeeee925` ("net: udp: prefer listeners bound to an address"), the race between receive address change and lookup didn't actually cause visible issues, because, once the lookup based on the secondary hash chain failed, we would still attempt a lookup based on the primary hash (destination port only), and find the socket with the outdated secondary hash. That change, however, dropped port-only lookups altogether, as side effect, making the race visible. To fix this, while avoiding the need to make address changes and rehash atomic against lookups, reintroduce primary hash lookups as fallback, if lookups based on four-tuple and secondary hashes fail. To this end, introduce a simplified lookup implementation, which doesn't take care of SO_REUSEPORT groups: if we have one, there are multiple sockets that would match the four-tuple or secondary hash, meaning that we can't run into this race at all. v2: - instead of synchronising lookup operations against address change plus rehash, reintroduce a simplified version of the original primary hash lookup as fallback v1: - fix build with CONFIG_IPV6=n: add ifdef around sk_v6_rcv_saddr usage (Kuniyuki Iwashima) - directly use sk_rcv_saddr for IPv4 receive addresses instead of fetching inet_rcv_saddr (Kuniyuki Iwashima) - move inet_update_saddr() to inet_hashtables.h and use that to set IPv4/IPv6 addresses as suitable (Kuniyuki Iwashima) - rebase onto net-next, update commit message accordingly Reported-by: Ed Santiago <santiago@redhat.com> Link: https://github.com/containers/podman/issues/24147 Analysed-by: David Gibson <david@gibson.dropbear.id.au> Fixes: `30fff9231f` ("udp: bind() optimisation") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-12-23 11:39:55 +00:00
Jakub Kicinski	ae418e95dd	Merge branch 'ipv4-consolidate-route-lookups-from-ipv4-sockets' Guillaume Nault says: ==================== ipv4: Consolidate route lookups from IPv4 sockets. Create inet_sk_init_flowi4() so that the different IPv4 code paths that need to do a route lookup based on an IPv4 socket don't need to reimplement that logic. ==================== Link: https://patch.msgid.link/cover.1734357769.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-20 13:50:15 -08:00
Guillaume Nault	c63e9f3b89	l2tp: Use inet_sk_init_flowi4() in l2tp_ip_sendmsg(). Use inet_sk_init_flowi4() to automatically initialise the flowi4 structure in l2tp_ip_sendmsg() instead of passing parameters manually to ip_route_output_ports(). Override ->daddr with the value passed in the msghdr structure if provided. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: James Chapman <jchapman@katalix.com> Link: https://patch.msgid.link/2ff22a3560c5050228928456662b80b9c84a8fe4.1734357769.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-20 13:50:09 -08:00
Guillaume Nault	148721f8e0	ipv4: Use inet_sk_init_flowi4() in __ip_queue_xmit(). Use inet_sk_init_flowi4() to automatically initialise the flowi4 structure in __ip_queue_xmit() instead of passing parameters manually to ip_route_output_ports(). Override ->flowi4_tos with the value passed as parameter since that's required by SCTP. Signed-off-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/37e64ffbd9adac187b14aa9097b095f5c86e85be.1734357769.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-20 13:50:09 -08:00
Guillaume Nault	42e5ffc385	ipv4: Use inet_sk_init_flowi4() in inet_csk_rebuild_route(). Use inet_sk_init_flowi4() to automatically initialise the flowi4 structure in inet_csk_rebuild_route() instead of passing parameters manually to ip_route_output_ports(). Signed-off-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/b270931636effa1095508e0f0a3e8c3a0e6d357f.1734357769.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-20 13:50:09 -08:00
Guillaume Nault	5be1323b50	ipv4: Use inet_sk_init_flowi4() in ip4_datagram_release_cb(). Use inet_sk_init_flowi4() to automatically initialise the flowi4 structure in ip4_datagram_release_cb() instead of passing parameters manually to ip_route_output_ports(). Signed-off-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/9c326b8d9e919478f7952b21473d31da07eba2dd.1734357769.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-20 13:50:09 -08:00
Guillaume Nault	1dbdce30f0	ipv4: Define inet_sk_init_flowi4() and use it in inet_sk_rebuild_header(). IPv4 code commonly has to initialise a flowi4 structure from an IPv4 socket. This requires looking at potential IPv4 options to set the proper destination address, call flowi4_init_output() with the correct set of parameters and run the sk_classify_flow security hook. Instead of reimplementing these operations in different parts of the stack, let's define inet_sk_init_flowi4() which does all these operations. The first user is inet_sk_rebuild_header(), where inet_sk_init_flowi4() replaces ip_route_output_ports(). Unlike ip_route_output_ports(), which sets the flowi4 structure and performs the route lookup in one go, inet_sk_init_flowi4() only initialises the flow. The route lookup is then done by ip_route_output_flow(). Decoupling flow initialisation from route lookup makes this new interface applicable more broadly as it will allow some users to overwrite specific struct flowi4 members before the route lookup. Signed-off-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/fd416275262b1f518d5abfcef740ce4f4a1a6522.1734357769.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-20 13:50:09 -08:00
Tristram Ha	6ed3472173	net: dsa: microchip: Do not execute PTP driver code for unsupported switches The PTP driver code only works for certain KSZ switches like KSZ9477, KSZ9567, LAN937X and their varieties. This code is enabled by kernel configuration CONFIG_NET_DSA_MICROCHIP_KSZ_PTP. As the DSA driver is common to work with all KSZ switches this PTP code is not appropriate for other unsupported switches. The ptp_capable indication is added to the chip data structure to signal whether to execute those code. Signed-off-by: Tristram Ha <tristram.ha@microchip.com> Link: https://patch.msgid.link/20241218020240.70601-1-Tristram.Ha@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-20 13:28:42 -08:00
Thomas Weißschuh	3272040790	qlcnic: use const 'struct bin_attribute' callbacks The sysfs core now provides callback variants that explicitly take a const pointer. Use them so the non-const variants can be removed. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://patch.msgid.link/20241219-sysfs-const-bin_attr-net-v2-1-93bdaece3c90@weissschuh.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-20 13:24:50 -08:00
Jakub Kicinski	c1bc6d2178	Merge branch 'bridge-handle-changes-in-vlan_flag_bridge_binding' Petr Machata says: ==================== bridge: Handle changes in VLAN_FLAG_BRIDGE_BINDING When bridge binding is enabled on a VLAN netdevice, its link state should track bridge ports that are members of the corresponding VLAN. This works for a newly-added netdevices. However toggling the option does not have the effect of enabling or disabling the behavior as appropriate. In this patchset, have bridge react to bridge_binding toggles on VLAN uppers. There has been another attempt at supporting this behavior in 2022 by Sevinj Aghayeva [0]. A discussion ensued that informed how this new patchset is constructed, namely that the logic is in the bridge as opposed to the 8021q driver, and the bridge reacts to NETDEV_CHANGE events on the 8021q upper. Patches #1 and #2 contain the implementation, patches #3 and #4 a selftest. [0] https://lore.kernel.org/netdev/cover.1660100506.git.sevinj.aghayeva@gmail.com/ ==================== Link: https://patch.msgid.link/cover.1734540770.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-20 13:16:47 -08:00
Petr Machata	dca12e9ab7	selftests: net: Add a VLAN bridge binding selftest Add a test that exercises bridge binding. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/baf7244fd1fe223a6d93e027584fa9f99dee982c.1734540770.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-20 13:14:17 -08:00
Petr Machata	976d248bd3	selftests: net: lib: Add a couple autodefer helpers Alongside the helper ip_link_set_up(), one to set the link down will be useful as well. Add a helper to determine the link state as well, ip_link_is_up(), and use it to short-circuit any changes if the state is already the desired one. Furthermore, add a helper bridge_vlan_add(). Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/856d9e01725fdba21b7f6716358f645b19131af2.1734540770.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-20 13:14:17 -08:00
Petr Machata	3abd45122c	net: bridge: Handle changes in VLAN_FLAG_BRIDGE_BINDING When bridge binding is enabled on a VLAN netdevice, its link state should track bridge ports that are members of the corresponding VLAN. This works for newly-added netdevices. However toggling the option does not have the effect of enabling or disabling the behavior as appropriate. In this patch, react to bridge_binding toggles on VLAN uppers. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/90a8ca8aea4d81378b29d75d9e562433e0d5c7ff.1734540770.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-20 13:14:17 -08:00
Petr Machata	f284424dc1	net: bridge: Extract a helper to handle bridge_binding toggles Currently, the BROPT_VLAN_BRIDGE_BINDING bridge option is only toggled when VLAN devices are added on top of a bridge or removed from it. Extract the toggling of the option to a function so that it could be invoked by a subsequent patch when the state of an upper VLAN device changes. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/a7455f6fe1dfa7b13126ed8a7fb33d3b611eecb8.1734540770.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-20 13:14:17 -08:00
Eric Dumazet	05dd04b218	inetpeer: avoid false sharing in inet_peer_xrlim_allow() Under DOS, inet_peer_xrlim_allow() might be called millions of times per second from different cpus. Make sure to write over peer->rate_tokens and peer->rate_last only when really needed. Note the inherent races of this function are still there, we do not care of precise ICMP rate limiting. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20241219150330.3159027-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-20 13:04:40 -08:00
Jakub Kicinski	710aebdad6	Merge branch 'hisilicon-hns-deadcoding' Dr. David Alan Gilbert says: ==================== hisilicon hns deadcoding From: "Dr. David Alan Gilbert" <linux@treblig.org> A small set of deadcoding for functions that are not called, and a couple of function pointers that they called. Build tested only; I don't have the hardware. ==================== Link: https://patch.msgid.link/20241218163341.40297-1-linux@treblig.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-20 12:56:19 -08:00
Dr. David Alan Gilbert	8973ce1893	net: hisilicon: hns: Remove unused enums The enums dsaf_roce_port_mode, dsaf_roce_port_num and dsaf_roce_qos_sl are unused after the removal of the reset code. Remove them. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Jijie Shao<shaojijie@huawei.com> Link: https://patch.msgid.link/20241218163341.40297-5-linux@treblig.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-20 12:56:17 -08:00
Dr. David Alan Gilbert	0198b459f5	net: hisilicon: hns: Remove reset helpers With hns_dsaf_roce_reset() removed in a previous patch, the two helper member pointers, 'hns_dsaf_roce_srst', and 'hns_dsaf_srst_chns' are now unread. Remove them, and the helper functions that they were initialised to, that is hns_dsaf_srst_chns(), hns_dsaf_srst_chns_acpi(), hns_dsaf_roce_srst() and hns_dsaf_roce_srst_acpi(). Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Jijie Shao<shaojijie@huawei.com> Link: https://patch.msgid.link/20241218163341.40297-4-linux@treblig.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-20 12:56:17 -08:00
Dr. David Alan Gilbert	0265e9edf2	net: hisilicon: hns: Remove unused hns_rcb_start hns_rcb_start() has been unused since 2016's commit `454784d85d` ("net: hns: delete redundancy ring enable operations") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Jijie Shao<shaojijie@huawei.com> Link: https://patch.msgid.link/20241218163341.40297-3-linux@treblig.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-20 12:56:17 -08:00
Dr. David Alan Gilbert	a574fe14ed	net: hisilicon: hns: Remove unused hns_dsaf_roce_reset hns_dsaf_roce_reset() has been unused since 2021's commit `38d2208824` ("RDMA/hns: Remove support for HIP06") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Jijie Shao<shaojijie@huawei.com> Link: https://patch.msgid.link/20241218163341.40297-2-linux@treblig.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-20 12:56:16 -08:00
Jakub Kicinski	b73e56f162	Merge branch 'xdp-a-fistful-of-generic-changes-pt-iii' Alexander Lobakin says: ==================== xdp: a fistful of generic changes pt. III XDP for idpf is currently 5.(6) chapters: * convert Rx to libeth; * convert Tx and stats to libeth; * generic XDP and XSk code changes; * generic XDP and XSk code additions pt. 1; * generic XDP and XSk code additions pt. 2 (you are here); * actual XDP for idpf via new libeth_xdp; * XSk for idpf (via ^). Part III.3 does the following: * adds generic functions to build skbs from xdp_buffs (regular and XSk) and attach frags to xdp_buffs (regular and XSk); * adds helper to optimize XSk xmit in drivers. Everything is prereq for libeth_xdp, but will be useful standalone as well: less code in drivers, faster XSk XDP_PASS, smaller object code. ==================== Link: https://patch.msgid.link/20241218174435.1445282-1-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-19 19:51:18 -08:00
Alexander Lobakin	560d958c6c	xsk: add generic XSk &xdp_buff -> skb conversion Same as with converting &xdp_buff to skb on Rx, the code which allocates a new skb and copies the XSk frame there is identical across the drivers, so make it generic. This includes copying all the frags if they are present in the original buff. System percpu page_pools greatly improve XDP_PASS performance on XSk: instead of page_alloc() + page_free(), the net core recycles the same pages, so the only overhead left is memcpy()s. When the Page Pool is not compiled in, the whole function is a return-NULL (but it always gets selected when eBPF is enabled). Note that the passed buff gets freed if the conversion is done w/o any error, assuming you don't need this buffer after you convert it to an skb. Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20241218174435.1445282-6-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-19 19:51:14 -08:00
Alexander Lobakin	51205f841a	xsk: make xsk_buff_add_frag() really add the frag via __xdp_buff_add_frag() Currently, xsk_buff_add_frag() only adds the frag to pool's linked list, not doing anything with the &xdp_buff. The drivers do that manually and the logic is the same. Make it really add an skb frag, just like xdp_buff_add_frag() does that, and freeing frags on error if needed. This allows to remove repeating code from i40e and ice and not add the same code again and again. Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20241218174435.1445282-5-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-19 19:51:14 -08:00
Alexander Lobakin	539c1fba1a	xdp: add generic xdp_build_skb_from_buff() The code which builds an skb from an &xdp_buff keeps multiplying itself around the drivers with almost no changes. Let's try to stop that by adding a generic function. Unlike __xdp_build_skb_from_frame(), always allocate an skbuff head using napi_build_skb() and make use of the available xdp_rxq pointer to assign the Rx queue index. In case of PP-backed buffer, mark the skb to be recycled, as every PP user's been switched to recycle skbs. Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20241218174435.1445282-4-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-19 19:51:14 -08:00
Alexander Lobakin	68ddc8ae17	xdp: add generic xdp_buff_add_frag() The code piece which would attach a frag to &xdp_buff is almost identical across the drivers supporting XDP multi-buffer on Rx. Make it a generic elegant "oneliner". Also, I see lots of drivers calculating frags_truesize as `xdp->frame_sz * nr_frags`. I can't say this is fully correct, since frags might be backed by chunks of different sizes, especially with stuff like the header split. Even page_pool_alloc() can give you two different truesizes on two subsequent requests to allocate the same buffer size. Add a field to &skb_shared_info (unionized as there's no free slot currently on x86_64) to track the "true" truesize. It can be used later when updating the skb. Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20241218174435.1445282-3-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-19 19:51:13 -08:00
Alexander Lobakin	a19d0236f4	page_pool: add page_pool_dev_alloc_netmem() Similarly to other _dev shorthands, add one for page_pool_alloc_netmem() to allocate a netmem using the default Rx GFP flags (ATOMIC \| NOWARN) to make the page -> netmem transition of drivers easier. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20241218174435.1445282-2-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-19 19:51:13 -08:00

1 2 3 4 5 ...

1324938 Commits