linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-04-07 06:16:08 -04:00

Author	SHA1	Message	Date
Jason Xing	30ed05adca	xsk: use a smaller new lock for shared pool case - Split cq_lock into two smaller locks: cq_prod_lock and cq_cached_prod_lock - Avoid disabling/enabling interrupts in the hot xmit path In either xsk_cq_cancel_locked() or xsk_cq_reserve_locked() function, the race condition is only between multiple xsks sharing the same pool. They are all in the process context rather than interrupt context, so now the small lock named cq_cached_prod_lock can be used without handling interrupts. While cq_cached_prod_lock ensures the exclusive modification of @cached_prod, cq_prod_lock in xsk_cq_submit_addr_locked() only cares about @producer and corresponding @desc. Both of them don't necessarily be consistent with @cached_prod protected by cq_cached_prod_lock. That's the reason why the previous big lock can be split into two smaller ones. Please note that SPSC rule is all about the global state of producer and consumer that can affect both layers instead of local or cached ones. Frequently disabling and enabling interrupt are very time consuming in some cases, especially in a per-descriptor granularity, which now can be avoided after this optimization, even when the pool is shared by multiple xsks. With this patch, the performance number[1] could go from 1,872,565 pps to 1,961,009 pps. It's a minor rise of around 5%. [1]: taskset -c 1 ./xdpsock -i enp2s0f1 -q 0 -t -S -s 64 Signed-off-by: Jason Xing <kernelxing@tencent.com> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://patch.msgid.link/20251030000646.18859-3-kerneljasonxing@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-11-04 16:10:53 +01:00
Kuniyuki Iwashima	e833eb2516	mpls: Protect net->mpls.platform_label with a per-netns mutex. MPLS (re)uses RTNL to protect net->mpls.platform_label, but the lock does not need to be RTNL at all. Let's protect net->mpls.platform_label with a dedicated per-netns mutex. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20251029173344.2934622-13-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-03 17:40:53 -08:00
Kuniyuki Iwashima	d8f9581e1b	ipv6: Add in6_dev_rcu(). rcu_dereference_rtnl() does not clearly tell whether the caller is under RCU or RTNL. Let's add in6_dev_rcu() to make it easy to remove __in6_dev_get() in the future. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20251029173344.2934622-5-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-03 17:40:46 -08:00
Haiyang Zhang	54133f9b4b	net: mana: Support HW link state events Handle the NIC hardware link state events received from the HW channel, then set the proper link state accordingly. And, add a feature bit, GDMA_DRV_CAP_FLAG_1_HW_VPORT_LINK_AWARE, to inform the NIC hardware this handler exists. Our MANA NIC only sends out the link state down/up messages when we need to let the VM rerun DHCP client and change IP address. So, add netif_carrier_on() in the probe(), let the NIC show the right initial state in /sys/class/net/ethX/operstate. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Link: https://patch.msgid.link/1761770601-16920-1-git-send-email-haiyangz@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-31 15:56:53 -07:00
Jakub Kicinski	1a2352ad82	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.18-rc4). No conflicts, adjacent changes: drivers/net/ethernet/stmicro/stmmac/stmmac_main.c `ded9813d17` ("net: stmmac: Consider Tx VLAN offload tag length for maxSDU") `26ab9830be` ("net: stmmac: replace has_xxxx with core_type") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-31 06:46:03 -07:00
Jakub Kicinski	12a7c6a993	Merge tag 'nf-next-25-10-30' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== netfilter: updates for net-next 1) Convert nf_tables 'nft_set_iter' usage to use C99 struct initialization, from Fernando Fernandez Mancera. 2) Disallow nf_conntrack_max=0. This was an (undocumented) historic inheritance from ip_conntrack (ipv4 only nf_conntrack predecessor). Doing so will simplify future changes to make this pernet-tuneable. 3) Fix a typo in conntrack.h comment, from Weibiao Tu. * tag 'nf-next-25-10-30' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: fix typo in nf_conntrack_l4proto.h comment netfilter: conntrack: disable 0 value for conntrack_max setting netfilter: nf_tables: use C99 struct initializer for nft_set_iter ==================== Link: https://patch.msgid.link/20251030121954.29175-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-30 17:57:07 -07:00
Jakub Kicinski	1659b441b6	Merge tag 'wireless-next-2025-10-30' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Johannes Berg says: ==================== Not that many changes this time: - mac80211: - improved VHT radiotap reporting - S1G improvements - multi-radio monitor improvements - HT action frame handling on 6 GHz - mesh rate tracking improvements - CSA handling improvements - cfg80211: multi-radio debugfs - rt2x00: improvements for embedded platforms * tag 'wireless-next-2025-10-30' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: wifi: mac80211: Allow HT Action frame processing on 6 GHz when HE is supported wifi: rt2x00: add nvmem eeprom support wifi: mac80211: add RX flag to report radiotap VHT information net: wireless: Remove redundant pm_runtime_mark_last_busy() calls wifi: cfg80211: Add parameters to radio-specific debugfs directories wifi: cfg80211: Add debugfs support for multi-radio wiphy wifi: mac80211: fix missing RX bitrate update for mesh forwarding path wifi: cfg80211: default S1G chandef width to 1MHz wifi: mac80211: get probe response chan via ieee80211_get_channel_khz wifi: mac80211: reset CRC valid after CSA wifi: mac80211_hwsim: advertise puncturing feature support wifi: cfg80211/mac80211: validate radio frequency range for monitor mode wifi: rt2x00: check retval for of_get_mac_address ==================== Link: https://patch.msgid.link/20251030105355.13216-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-30 17:38:37 -07:00
Halil Pasic	aef3cdb47b	net/smc: make wr buffer count configurable Think SMC_WR_BUF_CNT_SEND := SMC_WR_BUF_CNT used in send context and SMC_WR_BUF_CNT_RECV := 3 * SMC_WR_BUF_CNT used in recv context. Those get replaced with lgr->max_send_wr and lgr->max_recv_wr respective. Please note that although with the default sysctl values qp_attr.cap.max_send_wr == qp_attr.cap.max_recv_wr is maintained but can not be assumed to be generally true any more. I see no downside to that, but my confidence level is rather modest. Signed-off-by: Halil Pasic <pasic@linux.ibm.com> Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Tested-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Link: https://patch.msgid.link/20251027224856.2970019-2-pasic@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-10-30 13:31:43 +01:00
caivive (Weibiao Tu)	57347d58a4	netfilter: fix typo in nf_conntrack_l4proto.h comment In the comment for nf_conntrack_l4proto.h, the word "nfnetink" was incorrectly spelled. It has been corrected to "nfnetlink". Fixes a typo to enhance readability and ensure consistency. Signed-off-by: caivive (Weibiao Tu) <cavivie@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>	2025-10-30 12:52:45 +01:00
Benjamin Berg	db82ddeaf4	wifi: mac80211: add RX flag to report radiotap VHT information mac80211 already reports some basic information in the radiotap header with the known fields declared by the driver. However, drivers may want to report more accurate information and in that case the full VHT radiotap structure needs to be provided. Add a new RX_FLAG_RADIOTAP_VHT which is set when the VHT information should be pulled from the skb. Update the code to fill in the VHT fields to only do so when requested by the driver or if the information has not yet been set. This way the driver can fully control the information if it chooses so. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20251027142118.0bad1c307a21.I2cf285c20a822698039603f2af00ed9c548f2ee0@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2025-10-30 08:38:51 +01:00
Shahar Shitrit	c15d5c62ab	net: tls: Cancel RX async resync request on rcd_delta overflow When a netdev issues a RX async resync request for a TLS connection, the TLS module handles it by logging record headers and attempting to match them to the tcp_sn provided by the device. If a match is found, the TLS module approves the tcp_sn for resynchronization. While waiting for a device response, the TLS module also increments rcd_delta each time a new TLS record is received, tracking the distance from the original resync request. However, if the device response is delayed or fails (e.g due to unstable connection and device getting out of tracking, hardware errors, resource exhaustion etc.), the TLS module keeps logging and incrementing, which can lead to a WARN() when rcd_delta exceeds the threshold. To address this, introduce tls_offload_rx_resync_async_request_cancel() to explicitly cancel resync requests when a device response failure is detected. Call this helper also as a final safeguard when rcd_delta crosses its threshold, as reaching this point implies that earlier cancellation did not occur. Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1761508983-937977-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 18:32:18 -07:00
Shahar Shitrit	34892cfec0	net: tls: Change async resync helpers argument Update tls_offload_rx_resync_async_request_start() and tls_offload_rx_resync_async_request_end() to get a struct tls_offload_resync_async parameter directly, rather than extracting it from struct sock. This change aligns the function signatures with the upcoming tls_offload_rx_resync_async_request_cancel() helper, which will be introduced in a subsequent patch. Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1761508983-937977-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 18:32:17 -07:00
Ido Schimmel	d12d04d221	ipv6: icmp: Add RFC 5837 support Add the ability to append the incoming IP interface information to ICMPv6 error messages in accordance with RFC 5837 and RFC 4884. This is required for more meaningful traceroute results in unnumbered networks. The feature is disabled by default and controlled via a new sysctl ("net.ipv6.icmp.errors_extension_mask") which accepts a bitmask of ICMP extensions to append to ICMP error messages. Currently, only a single value is supported, but the interface and the implementation should be able to support more extensions, if needed. Clone the skb and copy the relevant data portions before modifying the skb as the caller of icmp6_send() still owns the skb after the function returns. This should be fine since by default ICMP error messages are rate limited to 1000 per second and no more than 1 per second per specific host. Trim or pad the packet to 128 bytes before appending the ICMP extension structure in order to be compatible with legacy applications that assume that the ICMP extension structure always starts at this offset (the minimum length specified by RFC 4884). Since commit `20e1954fe2` ("ipv6: RFC 4884 partial support for SIT/GRE tunnels") it is possible for icmp6_send() to be called with an skb that already contains ICMP extensions. This can happen when we receive an ICMPv4 message with extensions from a tunnel and translate it to an ICMPv6 message towards an IPv6 host in the overlay network. I could not find an RFC that supports this behavior, but it makes sense to not overwrite the original extensions that were appended to the packet. Therefore, avoid appending extensions if the length field in the provided ICMPv6 header is already filled. Export netdev_copy_name() using EXPORT_IPV6_MOD_GPL() to make it available to IPv6 when it is built as a module. Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251027082232.232571-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 18:28:30 -07:00
Ido Schimmel	f0e7036fc9	ipv4: icmp: Add RFC 5837 support Add the ability to append the incoming IP interface information to ICMPv4 error messages in accordance with RFC 5837 and RFC 4884. This is required for more meaningful traceroute results in unnumbered networks. The feature is disabled by default and controlled via a new sysctl ("net.ipv4.icmp_errors_extension_mask") which accepts a bitmask of ICMP extensions to append to ICMP error messages. Currently, only a single value is supported, but the interface and the implementation should be able to support more extensions, if needed. Clone the skb and copy the relevant data portions before modifying the skb as the caller of __icmp_send() still owns the skb after the function returns. This should be fine since by default ICMP error messages are rate limited to 1000 per second and no more than 1 per second per specific host. Trim or pad the packet to 128 bytes before appending the ICMP extension structure in order to be compatible with legacy applications that assume that the ICMP extension structure always starts at this offset (the minimum length specified by RFC 4884). Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251027082232.232571-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 18:28:29 -07:00
Eric Dumazet	b1e014a1f3	tcp: add newval parameter to tcp_rcvbuf_grow() This patch has no functional change, and prepares the following one. tcp_rcvbuf_grow() will need to have access to tp->rcvq_space.space old and new values. Change mptcp_rcvbuf_grow() in a similar way. Signed-off-by: Eric Dumazet <edumazet@google.com> [ Moved 'oldval' declaration to the next patch to avoid warnings at build time. ] Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20251028-net-tcp-recv-autotune-v3-3-74b43ba4c84c@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 17:30:19 -07:00
Christophe JAILLET	294bfe0343	sctp: Constify struct sctp_sched_ops 'struct sctp_sched_ops' is not modified in these drivers. Constifying this structure moves some data to a read-only section, so increases overall security, especially when the structure holds some function pointers. On a x86_64, with allmodconfig, as an example: Before: ====== text data bss dec hex filename 8019 568 0 8587 218b net/sctp/stream_sched_fc.o After: ===== text data bss dec hex filename 8275 312 0 8587 218b net/sctp/stream_sched_fc.o Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://patch.msgid.link/dce03527eb7b7cc8a3c26d5cdac12bafe3350135.1761377890.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-28 17:50:55 -07:00
Bobby Eshleman	8443c31608	net: netmem: remove NET_IOV_MAX from net_iov_type enum Remove the NET_IOV_MAX workaround from the net_iov_type enum. This entry was previously added to force the enum size to unsigned long to satisfy the NET_IOV_ASSERT_OFFSET static assertions. After commit `f3d85c9ee5` ("netmem: introduce struct netmem_desc mirroring struct page") this approach became unnecessary by placing the net_iov_type after the netmem_desc. Placing the net_iov_type after netmem_desc results in the net_iov_type size having no effect on the position or layout of the fields that mirror the struct page. The layout before this patch: struct net_iov { union { struct netmem_desc desc; /* 0 48 / struct { long unsigned int _flags; / 0 8 / long unsigned int pp_magic; / 8 8 / struct page_pool pp; /* 16 8 / long unsigned int _pp_mapping_pad; / 24 8 / long unsigned int dma_addr; / 32 8 / atomic_long_t pp_ref_count; / 40 8 / }; / 0 48 / }; / 0 48 / struct net_iov_area owner; /* 48 8 / enum net_iov_type type; / 56 8 / / size: 64, cachelines: 1, members: 3 / }; The layout after this patch: struct net_iov { union { struct netmem_desc desc; / 0 48 / struct { long unsigned int _flags; / 0 8 / long unsigned int pp_magic; / 8 8 / struct page_pool pp; /* 16 8 / long unsigned int _pp_mapping_pad; / 24 8 / long unsigned int dma_addr; / 32 8 / atomic_long_t pp_ref_count; / 40 8 / }; / 0 48 / }; / 0 48 / struct net_iov_area owner; /* 48 8 / enum net_iov_type type; / 56 4 / / size: 64, cachelines: 1, members: 3 / / padding: 4 */ }; Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20251024-b4-devmem-remove-niov-max-v1-1-ba72c68bc869@meta.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-28 17:41:46 -07:00
Yue Haibing	6f147c8328	net/sched: Remove unused typedef psched_tdiff_t Since commit `051d442098` ("net/sched: Retire CBQ qdisc") this is not used anymore. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Link: https://patch.msgid.link/20251024025145.4069583-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-27 18:05:54 -07:00
Kuniyuki Iwashima	71068e2e1b	sctp: Remove sctp_copy_sock() and sctp_copy_descendant(). Now, sctp_accept() and sctp_do_peeloff() use sk_clone(), and we no longer need sctp_copy_sock() and sctp_copy_descendant(). Let's remove them. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/20251023231751.4168390-9-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-27 18:04:59 -07:00
Kuniyuki Iwashima	c49ed521f1	sctp: Remove sctp_pf.create_accept_sk(). sctp_v[46]_create_accept_sk() are no longer used. Let's remove sctp_pf.create_accept_sk(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/20251023231751.4168390-7-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-27 18:04:58 -07:00
Kuniyuki Iwashima	151b98d10e	net: Add sk_clone(). sctp_accept() will use sk_clone_lock(), but it will be called with the parent socket locked, and sctp_migrate() acquires the child lock later. Let's add no lock version of sk_clone_lock(). Note that lockdep complains if we simply use bh_lock_sock_nested(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/20251023231751.4168390-5-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-27 18:04:57 -07:00
Wilfred Mallawa	82cb5be6ad	net/tls: support setting the maximum payload size During a handshake, an endpoint may specify a maximum record size limit. Currently, the kernel defaults to TLS_MAX_PAYLOAD_SIZE (16KB) for the maximum record size. Meaning that, the outgoing records from the kernel can exceed a lower size negotiated during the handshake. In such a case, the TLS endpoint must send a fatal "record_overflow" alert [1], and thus the record is discarded. Upcoming Western Digital NVMe-TCP hardware controllers implement TLS support. For these devices, supporting TLS record size negotiation is necessary because the maximum TLS record size supported by the controller is less than the default 16KB currently used by the kernel. Currently, there is no way to inform the kernel of such a limit. This patch adds support to a new setsockopt() option `TLS_TX_MAX_PAYLOAD_LEN` that allows for setting the maximum plaintext fragment size. Once set, outgoing records are no larger than the size specified. This option can be used to specify the record size limit. [1] https://www.rfc-editor.org/rfc/rfc8449 Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20251022001937.20155-1-wilfred.opensource@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-27 16:13:42 -07:00
Roopni Devanathan	7cc986c04a	wifi: cfg80211: Add debugfs support for multi-radio wiphy In multi-radio wiphy architecture, where a single wiphy can have multiple radios tied to it, radio specific configuration parameters and global wiphy parameters are maintained for the entire physical device and common to all radios. But, each radio in a wiphy can have different values for each radio configuration parameter, like RTS threshold. With the current debugfs directory structure, the values of global wiphy configuration parameters can be viewed, but, values of individual radio configuration parameters cannot be viewed, as radio specific configuration parameters are not maintained, separately. To address this, in addition to maintaining global wiphy configuration parameters common to all radios, create separate debugfs directories for each radio in a wiphy to maintain parameters corresponding to that radio in this directory. In implementation, maintain a dentry structure in wiphy_radio_cfg, a structure containing radio configurations of a wiphy. This struct is maintained to denote per-radio configurations of a wiphy. Create separate directories representing each radio within phy#X directory in debugfs during wiphy registration. Sample directory structure with this change: ls /sys/kernel/debug/ieee80211/phy0/radio radio0/ radio1/ radio2/ Signed-off-by: Roopni Devanathan <quic_rdevanat@quicinc.com> Link: https://patch.msgid.link/20251024044649.483557-2-quic_rdevanat@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2025-10-27 09:18:41 +01:00
Ryder Lee	a392cde88d	wifi: cfg80211/mac80211: validate radio frequency range for monitor mode In multi-radio devices, it is possible to have an MLD AP and a monitor interface active at the same time. In such cases, monitor mode may not be able to specify a fixed channel and could end up capturing frames from all radios, including those outside the intended frequency bands. This patch adds frequency validation for monitor mode. Received frames are now only processed if their frequency fall within the allowed ranges of the radios specified by the interface's radio_mask. This prevents monitor mode from capturing frames outside the supported radio. Signed-off-by: Ryder Lee <ryder.lee@mediatek.com> Link: https://patch.msgid.link/700b8284e845d96654eb98431f8eeb5a81503862.1758647858.git.ryder.lee@mediatek.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2025-10-27 09:16:23 +01:00
Kuniyuki Iwashima	3064d0fe02	neighbour: Convert rwlock of struct neigh_table to spinlock. Only neigh_for_each() and neigh_seq_start/stop() are on the reader side of neigh_table.lock. Let's convert rwlock to the plain spinlock. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251022054004.2514876-6-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-24 17:57:20 -07:00
Kuniyuki Iwashima	35d7c70870	neighbour: Annotate access to neigh_parms fields. NEIGH_VAR() is read locklessly in the fast path, and IPv6 ndisc uses NEIGH_VAR_SET() locklessly. The next patch will convert neightbl_dump_info() to RCU. Let's annotate accesses to neigh_param with READ_ONCE() and WRITE_ONCE(). Note that ndisc_ifinfo_sysctl_change() uses &NEIGH_VAR() and we cannot use '&' with READ_ONCE(), so NEIGH_VAR_PTR() is introduced. Note also that NEIGH_VAR_INIT() does not need WRITE_ONCE() as it is before parms is published. Also, the only user hippi_neigh_setup_dev() is no longer called since commit `e3804cbebb` ("net: remove COMPAT_NET_DEV_OPS"), which looks wrong, but probably no one uses HIPPI and RoadRunner. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251022054004.2514876-3-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-24 17:57:20 -07:00
Luiz Augusto von Dentz	751463ceef	Bluetooth: hci_core: Fix tracking of periodic advertisement Periodic advertising enabled flag cannot be tracked by the enabled flag since advertising and periodic advertising each can be enabled/disabled separately from one another causing the states to be inconsistent when for example an advertising set is disabled its enabled flag is set to false which is then used for periodic which has not being disabled. Fixes: `eca0ae4aea` ("Bluetooth: Add initial implementation of BIS connections") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2025-10-24 10:31:59 -04:00
Frédéric Danis	76e20da0bd	Revert "Bluetooth: L2CAP: convert timeouts to secs_to_jiffies()" This reverts commit `c9d84da18d`. It replaces in L2CAP calls to msecs_to_jiffies() to secs_to_jiffies() and updates the constants accordingly. But the constants are also used in LCAP Configure Request and L2CAP Configure Response which expect values in milliseconds. This may prevent correct usage of L2CAP channel. To fix it, keep those constants in milliseconds and so revert this change. Fixes: `c9d84da18d` ("Bluetooth: L2CAP: convert timeouts to secs_to_jiffies()") Signed-off-by: Frédéric Danis <frederic.danis@collabora.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2025-10-24 10:30:30 -04:00
Pauli Virtanen	e8785404de	Bluetooth: MGMT: fix crash in set_mesh_sync and set_mesh_complete There is a BUG: KASAN: stack-out-of-bounds in set_mesh_sync due to memcpy from badly declared on-stack flexible array. Another crash is in set_mesh_complete() due to double list_del via mgmt_pending_valid + mgmt_pending_remove. Use DEFINE_FLEX to declare the flexible array right, and don't memcpy outside bounds. As mgmt_pending_valid removes the cmd from list, use mgmt_pending_free, and also report status on error. Fixes: `302a1f674c` ("Bluetooth: MGMT: Fix possible UAFs") Signed-off-by: Pauli Virtanen <pav@iki.fi> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2025-10-24 10:21:37 -04:00
Luiz Augusto von Dentz	0d92808024	Bluetooth: HCI: Fix tracking of advertisement set/instance 0x00 This fixes the state tracking of advertisement set/instance 0x00 which is considered a legacy instance and is not tracked individually by adv_instances list, previously it was assumed that hci_dev itself would track it via HCI_LE_ADV but that is a global state not specifc to instance 0x00, so to fix it a new flag is introduced that only tracks the state of instance 0x00. Fixes: `1488af7b8b` ("Bluetooth: hci_sync: Fix hci_resume_advertising_sync") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2025-10-24 10:21:07 -04:00
Yue Haibing	114573962a	net/sched: Remove unused inline helper qdisc_from_priv() Since commit `fb38306ceb` ("net/sched: Retire ATM qdisc"), this is not used and can be removed. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Link: https://patch.msgid.link/20251021114626.3148894-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-22 19:20:48 -07:00
David Yang	ca4709843b	net: dsa: tag_yt921x: add support for Motorcomm YT921x tags Add support for Motorcomm YT921x tags, which includes a proper configurable ethertype field (default to 0x9988). Signed-off-by: David Yang <mmyangfl@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20251017060859.326450-3-mmyangfl@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-21 18:25:30 -07:00
Eric Dumazet	3ff9bcecce	net: avoid extra access to sk->sk_wmem_alloc in sock_wfree() UDP TX packets destructor is sock_wfree(). It suffers from a cache line bouncing in sock_def_write_space_wfree(). Instead of reading sk->sk_wmem_alloc after we just did an atomic RMW on it, use __refcount_sub_and_test() to get the old value for free, and pass the new value to sock_def_write_space_wfree(). Add __sock_writeable() helper. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251017133712.2842665-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-10-21 15:56:21 +02:00
Randy Dunlap	3701572931	nl802154: fix some kernel-doc warnings Correct multiple kernel-doc warnings in nl802154.h: - Fix a typo on one enum name to avoid a kernel-doc warning. - Drop 2 enum descriptions that are no longer needed. - Mark 2 internal enums as "private:" so that kernel-doc is not needed for them. Warning: nl802154.h:239 Enum value 'NL802154_CAP_ATTR_MAX_MAXBE' not described in enum 'nl802154_wpan_phy_capability_attr' Warning: nl802154.h:239 Excess enum value '%NL802154_CAP_ATTR_MIN_CCA_ED_LEVEL' description in 'nl802154_wpan_phy_capability_attr' Warning: nl802154.h:239 Excess enum value '%NL802154_CAP_ATTR_MAX_CCA_ED_LEVEL' description in 'nl802154_wpan_phy_capability_attr' Warning: nl802154.h:369 Enum value '__NL802154_CCA_OPT_ATTR_AFTER_LAST' not described in enum 'nl802154_cca_opts' Warning: nl802154.h:369 Enum value 'NL802154_CCA_OPT_ATTR_MAX' not described in enum 'nl802154_cca_opts' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20251016035917.1148012-1-rdunlap@infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-20 17:13:40 -07:00
Jakub Kicinski	e90576829c	Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Martin KaFai Lau says: ==================== pull-request: bpf-next 2025-10-16 We've added 6 non-merge commits during the last 1 day(s) which contain a total of 18 files changed, 577 insertions(+), 38 deletions(-). The main changes are: 1) Bypass the global per-protocol memory accounting either by setting a netns sysctl or using bpf_setsockopt in a bpf program, from Kuniyuki Iwashima. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: selftests/bpf: Add test for sk->sk_bypass_prot_mem. bpf: Introduce SK_BPF_BYPASS_PROT_MEM. bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_CREATE. net: Introduce net.core.bypass_prot_mem sysctl. net: Allow opt-out from global protocol memory accounting. tcp: Save lock_sock() for memcg in inet_csk_accept(). ==================== Link: https://patch.msgid.link/20251016204539.773707-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-17 17:20:42 -07:00
Eric Biggers	37a183d3b7	tcp: Convert tcp-md5 to use MD5 library instead of crypto_ahash Make tcp-md5 use the MD5 library API (added in 6.18) instead of the crypto_ahash API. This is much simpler and also more efficient: - The library API just operates on struct md5_ctx. Just allocate this struct on the stack instead of using a pool of pre-allocated crypto_ahash and ahash_request objects. - The library API accepts standard pointers and doesn't require scatterlists. So, for hashing the headers just use an on-stack buffer instead of a pool of pre-allocated kmalloc'ed scratch buffers. - The library API never fails. Therefore, checking for MD5 hashing errors is no longer necessary. Update tcp_v4_md5_hash_skb(), tcp_v6_md5_hash_skb(), tcp_v4_md5_hash_hdr(), tcp_v6_md5_hash_hdr(), tcp_md5_hash_key(), tcp_sock_af_ops::calc_md5_hash, and tcp_request_sock_ops::calc_md5_hash to return void instead of int. - The library API provides direct access to the MD5 code, eliminating unnecessary overhead such as indirect function calls and scatterlist management. Microbenchmarks of tcp_v4_md5_hash_skb() on x86_64 show a speedup from 7518 to 7041 cycles (6% fewer) with skb->len == 1440, or from 1020 to 678 cycles (33% fewer) with skb->len == 140. Since tcp_sigpool_hash_skb_data() can no longer be used, add a function tcp_md5_hash_skb_data() which is specialized to MD5. Of course, to the extent that this duplicates any code, it's well worth it. To preserve the existing behavior of TCP-MD5 support being disabled when the kernel is booted with "fips=1", make tcp_md5_do_add() check fips_enabled itself. Previously it relied on the error from crypto_alloc_ahash("md5") being bubbled up. I don't know for sure that this is actually needed, but this preserves the existing behavior. Tested with bidirectional TCP-MD5, both IPv4 and IPv6, between a kernel that includes this commit and a kernel that doesn't include this commit. (Side note: please don't use TCP-MD5! It's cryptographically weak. But as long as Linux supports it, it might as well be implemented properly.) Signed-off-by: Eric Biggers <ebiggers@kernel.org> Link: https://patch.msgid.link/20251014215836.115616-1-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-17 17:14:54 -07:00
Xuanqiang Luo	1532ed0d07	inet: Avoid ehash lookup race in inet_ehash_insert() Since ehash lookups are lockless, if one CPU performs a lookup while another concurrently deletes and inserts (removing reqsk and inserting sk), the lookup may fail to find the socket, an RST may be sent. The call trace map is drawn as follows: CPU 0 CPU 1 ----- ----- inet_ehash_insert() spin_lock() sk_nulls_del_node_init_rcu(osk) __inet_lookup_established() (lookup failed) __sk_nulls_add_node_rcu(sk, list) spin_unlock() As both deletion and insertion operate on the same ehash chain, this patch introduces a new sk_nulls_replace_node_init_rcu() helper functions to implement atomic replacement. Fixes: `5e0724d027` ("tcp/dccp: fix hashdance race for passive sessions") Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251015020236.431822-3-xuanqiang.luo@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-17 16:08:43 -07:00
Kuniyuki Iwashima	1c17f4373d	ipv6: Move ipv6_fl_list from ipv6_pinfo to inet_sock. In {tcp6,udp6,raw6}_sock, struct ipv6_pinfo is always placed at the beginning of a new cache line because 1. __alignof__(struct tcp_sock) is 64 due to ____cacheline_aligned of __cacheline_group_begin(tcp_sock_write_tx) 2. __alignof__(struct udp_sock) is 64 due to ____cacheline_aligned of struct numa_drop_counters 3. in raw6_sock, struct numa_drop_counters is placed before struct ipv6_pinfo . struct ipv6_pinfo is 136 bytes, but the last cache line is only used by ipv6_fl_list: $ pahole -C ipv6_pinfo vmlinux struct ipv6_pinfo { ... /* --- cacheline 2 boundary (128 bytes) --- / struct ipv6_fl_socklist ipv6_fl_list; /* 128 8 / / size: 136, cachelines: 3, members: 23 / Let's move ipv6_fl_list from struct ipv6_pinfo to struct inet_sock to save a full cache line for {tcp6,udp6,raw6}_sock. Now, struct ipv6_pinfo is 128 bytes, and {tcp6,udp6,raw6}_sock have 64 bytes less, while {tcp,udp,raw}_sock retain the same size. Before: # grep -E "^(RAW\|UDP[^L\-]\|TCP)" /proc/slabinfo \| awk '{print $1, "\t", $4}' RAWv6 1408 UDPv6 1472 TCPv6 2560 RAW 1152 UDP 1280 TCP 2368 After: # grep -E "^(RAW\|UDP[^L\-]\|TCP)" /proc/slabinfo \| awk '{print $1, "\t", $4}' RAWv6 1344 UDPv6 1408 TCPv6 2496 RAW 1152 UDP 1280 TCP 2368 Also, ipv6_fl_list and inet_flags (SNDFLOW bit) are placed in the same cache line. $ pahole -C inet_sock vmlinux ... / --- cacheline 11 boundary (704 bytes) was 56 bytes ago --- / struct ipv6_pinfo pinet6; /* 760 8 / / --- cacheline 12 boundary (768 bytes) --- / struct ipv6_fl_socklist ipv6_fl_list; /* 768 8 / unsigned long inet_flags; / 776 8 */ Doc churn is due to the insufficient Type column (only 1 space short). Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251014224210.2964778-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-17 16:06:52 -07:00
Eric Dumazet	100dfa74ca	net: dev_queue_xmit() llist adoption Remove busylock spinlock and use a lockless list (llist) to reduce spinlock contention to the minimum. Idea is that only one cpu might spin on the qdisc spinlock, while others simply add their skb in the llist. After this patch, we get a 300 % improvement on heavy TX workloads. - Sending twice the number of packets per second. - While consuming 50 % less cycles. Note that this also allows in the future to submit batches to various qdisc->enqueue() methods. Tested: - Dual Intel(R) Xeon(R) 6985P-C (480 hyper threads). - 100Gbit NIC, 30 TX queues with FQ packet scheduler. - echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm) - 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n" Before: 16 Mpps (41 Mpps if each thread is pinned to a different cpu) vmstat 2 5 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 243 0 0 2368988672 51036 1100852 0 0 146 1 242 60 0 9 91 0 0 244 0 0 2368988672 51036 1100852 0 0 536 10 487745 14718 0 52 48 0 0 244 0 0 2368988672 51036 1100852 0 0 512 0 503067 46033 0 52 48 0 0 244 0 0 2368988672 51036 1100852 0 0 512 0 494807 12107 0 52 48 0 0 244 0 0 2368988672 51036 1100852 0 0 702 26 492845 10110 0 52 48 0 0 Lock contention (1 second sample taken on 8 cores) perf lock record -C0-7 sleep 1; perf lock contention contended total wait max wait avg wait type caller 442111 6.79 s 162.47 ms 15.35 us spinlock dev_hard_start_xmit+0xcd 5961 9.57 ms 8.12 us 1.60 us spinlock __dev_queue_xmit+0x3a0 244 560.63 us 7.63 us 2.30 us spinlock do_softirq+0x5b 13 25.09 us 3.21 us 1.93 us spinlock net_tx_action+0xf8 If netperf threads are pinned, spinlock stress is very high. perf lock record -C0-7 sleep 1; perf lock contention contended total wait max wait avg wait type caller 964508 7.10 s 147.25 ms 7.36 us spinlock dev_hard_start_xmit+0xcd 201 268.05 us 4.65 us 1.33 us spinlock __dev_queue_xmit+0x3a0 12 26.05 us 3.84 us 2.17 us spinlock do_softirq+0x5b @__dev_queue_xmit_ns: [256, 512) 21 \| \| [512, 1K) 631 \| \| [1K, 2K) 27328 \|@ \| [2K, 4K) 265392 \|@@@@@@@@@@@@@@@@ \| [4K, 8K) 417543 \|@@@@@@@@@@@@@@@@@@@@@@@@@@ \| [8K, 16K) 826292 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\| [16K, 32K) 733822 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ \| [32K, 64K) 19055 \|@ \| [64K, 128K) 17240 \|@ \| [128K, 256K) 25633 \|@ \| [256K, 512K) 4 \| \| After: 29 Mpps (57 Mpps if each thread is pinned to a different cpu) vmstat 2 5 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 78 0 0 2369573632 32896 1350988 0 0 22 0 331 254 0 8 92 0 0 75 0 0 2369573632 32896 1350988 0 0 22 50 425713 280199 0 23 76 0 0 104 0 0 2369573632 32896 1350988 0 0 290 0 430238 298247 0 23 76 0 0 86 0 0 2369573632 32896 1350988 0 0 132 0 428019 291865 0 24 76 0 0 90 0 0 2369573632 32896 1350988 0 0 502 0 422498 278672 0 23 76 0 0 perf lock record -C0-7 sleep 1; perf lock contention contended total wait max wait avg wait type caller 2524 116.15 ms 486.61 us 46.02 us spinlock __dev_queue_xmit+0x55b 5821 107.18 ms 371.67 us 18.41 us spinlock dev_hard_start_xmit+0xcd 2377 9.73 ms 35.86 us 4.09 us spinlock ___slab_alloc+0x4e0 923 5.74 ms 20.91 us 6.22 us spinlock ___slab_alloc+0x5c9 121 3.42 ms 193.05 us 28.24 us spinlock net_tx_action+0xf8 6 564.33 us 167.60 us 94.05 us spinlock do_softirq+0x5b If netperf threads are pinned (~54 Mpps) perf lock record -C0-7 sleep 1; perf lock contention 32907 316.98 ms 195.98 us 9.63 us spinlock dev_hard_start_xmit+0xcd 4507 61.83 ms 212.73 us 13.72 us spinlock __dev_queue_xmit+0x554 2781 23.53 ms 40.03 us 8.46 us spinlock ___slab_alloc+0x5c9 3554 18.94 ms 34.69 us 5.33 us spinlock ___slab_alloc+0x4e0 233 9.09 ms 215.70 us 38.99 us spinlock do_softirq+0x5b 153 930.66 us 48.67 us 6.08 us spinlock net_tx_action+0xfd 84 331.10 us 14.22 us 3.94 us spinlock ___slab_alloc+0x5c9 140 323.71 us 9.94 us 2.31 us spinlock ___slab_alloc+0x4e0 @__dev_queue_xmit_ns: [128, 256) 1539830 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ \| [256, 512) 2299558 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\| [512, 1K) 483936 \|@@@@@@@@@@ \| [1K, 2K) 265345 \|@@@@@@ \| [2K, 4K) 145463 \|@@@ \| [4K, 8K) 54571 \|@ \| [8K, 16K) 10270 \| \| [16K, 32K) 9385 \| \| [32K, 64K) 7749 \| \| [64K, 128K) 26799 \| \| [128K, 256K) 2665 \| \| [256K, 512K) 665 \| \| Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Tested-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20251014171907.3554413-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-16 16:25:10 -07:00
Eric Dumazet	526f5fb112	net: sched: claim one cache line in Qdisc Replace state2 field with a boolean. Move it to a hole between qstats and state so that we shrink Qdisc by a full cache line. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Tested-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20251014171907.3554413-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-16 16:25:10 -07:00
Eric Dumazet	178ca30889	Revert "net/sched: Fix mirred deadlock on device recursion" This reverts commits `0f022d32c3` and `44180feacc`. Prior patch in this series implemented loop detection in act_mirred, we can remove q->owner to save some cycles in the fast path. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Tested-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20251014171907.3554413-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-16 16:25:10 -07:00
Kuniyuki Iwashima	b46ab63181	net: Introduce net.core.bypass_prot_mem sysctl. If a socket has sk->sk_bypass_prot_mem flagged, the socket opts out of the global protocol memory accounting. Let's control the flag by a new sysctl knob. The flag is written once during socket(2) and is inherited to child sockets. Tested with a script that creates local socket pairs and send()s a bunch of data without recv()ing. Setup: # mkdir /sys/fs/cgroup/test # echo $$ >> /sys/fs/cgroup/test/cgroup.procs # sysctl -q net.ipv4.tcp_mem="1000 1000 1000" # ulimit -n 524288 Without net.core.bypass_prot_mem, charged to tcp_mem & memcg # python3 pressure.py & # cat /sys/fs/cgroup/test/memory.stat \| grep sock sock 22642688 <-------------------------------------- charged to memcg # cat /proc/net/sockstat\| grep TCP TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 5376 <-- charged to tcp_mem # ss -tn \| head -n 5 State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:53188 ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:49972 ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:53868 ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:53554 # nstat \| grep Pressure \|\| echo no pressure TcpExtTCPMemoryPressures 1 0.0 With net.core.bypass_prot_mem=1, charged to memcg only: # sysctl -q net.core.bypass_prot_mem=1 # python3 pressure.py & # cat /sys/fs/cgroup/test/memory.stat \| grep sock sock 2757468160 <------------------------------------ charged to memcg # cat /proc/net/sockstat \| grep TCP TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 0 <- NOT charged to tcp_mem # ss -tn \| head -n 5 State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 111000 0 127.0.0.1:36019 127.0.0.1:49026 ESTAB 110000 0 127.0.0.1:36019 127.0.0.1:45630 ESTAB 110000 0 127.0.0.1:36019 127.0.0.1:44870 ESTAB 111000 0 127.0.0.1:36019 127.0.0.1:45274 # nstat \| grep Pressure \|\| echo no pressure no pressure Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Link: https://patch.msgid.link/20251014235604.3057003-4-kuniyu@google.com	2025-10-16 12:04:47 -07:00
Kuniyuki Iwashima	7c268eaeec	net: Allow opt-out from global protocol memory accounting. Some protocols (e.g., TCP, UDP) implement memory accounting for socket buffers and charge memory to per-protocol global counters pointed to by sk->sk_proto->memory_allocated. Sometimes, system processes do not want that limitation. For a similar purpose, there is SO_RESERVE_MEM for sockets under memcg. Also, by opting out of the per-protocol accounting, sockets under memcg can avoid paying costs for two orthogonal memory accounting mechanisms. A microbenchmark result is in the subsequent bpf patch. Let's allow opt-out from the per-protocol memory accounting if sk->sk_bypass_prot_mem is true. sk->sk_bypass_prot_mem and sk->sk_prot are placed in the same cache line, and sk_has_account() always fetches sk->sk_prot before accessing sk->sk_bypass_prot_mem, so there is no extra cache miss for this patch. The following patches will set sk->sk_bypass_prot_mem to true, and then, the per-protocol memory accounting will be skipped. Note that this does NOT disable memcg, but rather the per-protocol one. Another option not to use the hole in struct sock_common is create sk_prot variants like tcp_prot_bypass, but this would complicate SOCKMAP logic, tcp_bpf_prots etc. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Link: https://patch.msgid.link/20251014235604.3057003-3-kuniyu@google.com	2025-10-16 12:04:47 -07:00
Jakub Kicinski	55db64ddd6	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.18-rc2). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-16 11:06:28 -07:00
Eric Dumazet	e5b670e543	net: remove obsolete WARN_ON(refcount_read(&sk->sk_refcnt) == 1) sk->sk_refcnt has been converted to refcount_t in 2017. __sock_put(sk) being refcount_dec(&sk->sk_refcnt), it will complain loudly if the current refcnt is 1 (or less) in a non racy way. We can remove four WARN_ON() in favor of the generic refcount_dec() check. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Xuanqiang Luo<luoxuanqiang@kylinos.cn> Link: https://patch.msgid.link/20251014140605.2982703-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-15 17:18:38 -07:00
Eric Dumazet	4a7708443d	net: allow busy connected flows to switch tx queues This is a followup of commit `726e9e8b94` ("tcp: refine skb->ooo_okay setting") and of prior commit in this series ("net: control skb->ooo_okay from skb_set_owner_w()") skb->ooo_okay might never be set for bulk flows that always have at least one skb in a qdisc queue of NIC queue, especially if TX completion is delayed because of a stressed cpu. The so-called "strange attractors" has caused many performance issues (see for instance `9b462d02d6` ("tcp: TCP Small Queues and strange attractors")), we need to do better. We have tried very hard to avoid reorders because TCP was not dealing with them nicely a decade ago. Use the new net.core.txq_reselection_ms sysctl to let flows follow XPS and select a more efficient queue. After this patch, we no longer have to make sure threads are pinned to cpus, they now can be migrated without adding too much spinlock/qdisc/TX completion pressure anymore. TX completion part was problematic, because it added false sharing on various socket fields, but also added false sharing and spinlock contention in mm layers. Calling skb_orphan() from ndo_start_xmit() is not an option unfortunately. Note for later: 1) move sk->sk_tx_queue_mapping closer to sk_tx_queue_mapping_jiffies for better cache locality. 2) Study if `9b462d02d6` ("tcp: TCP Small Queues and strange attractors") could be revised. Tested: Used a host with 32 TX queues, shared by groups of 8 cores. XPS setup : echo ff >/sys/class/net/eth1/queue/tx-0/xps_cpus echo ff00 >/sys/class/net/eth1/queue/tx-1/xps_cpus echo ff0000 >/sys/class/net/eth1/queue/tx-2/xps_cpus echo ff000000 >/sys/class/net/eth1/queue/tx-3/xps_cpus echo ff,00000000 >/sys/class/net/eth1/queue/tx-4/xps_cpus echo ff00,00000000 >/sys/class/net/eth1/queue/tx-5/xps_cpus echo ff0000,00000000 >/sys/class/net/eth1/queue/tx-6/xps_cpus echo ff000000,00000000 >/sys/class/net/eth1/queue/tx-7/xps_cpus ... Launched a tcp_stream with 15 threads and 1000 flows, initially affined to core 0-15 taskset -c 0-15 tcp_stream -T15 -F1000 -l1000 -c -H target_host Checked that only queues 0 and 1 are used as instructed by XPS : tc -s qdisc show dev eth1\|grep backlog\|grep -v "backlog 0b 0p" backlog 123489410b 1890p backlog 69809026b 1064p backlog 52401054b 805p Then force each thread to run on cpu 1,9,17,25,33,41,49,57,65,73,81,89,97,105,113,121 C=1;PID=`pidof tcp_stream`;for P in `ls /proc/$PID/task`; do taskset -pc $C $P; C=$(($C + 8));done Set txq_reselection_ms to 1000 echo 1000 > /proc/sys/net/core/txq_reselection_ms Check that the flows have migrated nicely: tc -s qdisc show dev eth1\|grep backlog\|grep -v "backlog 0b 0p" backlog 130508314b 1916p backlog 8584380b 126p backlog 8584380b 126p backlog 8379990b 123p backlog 8584380b 126p backlog 8487484b 125p backlog 8584380b 126p backlog 8448120b 124p backlog 8584380b 126p backlog 8720640b 128p backlog 8856900b 130p backlog 8584380b 126p backlog 8652510b 127p backlog 8448120b 124p backlog 8516250b 125p backlog 7834950b 115p Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251013152234.842065-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-15 09:04:22 -07:00
Eric Dumazet	2ddef3462b	net: add /proc/sys/net/core/txq_reselection_ms control Add a new sysctl to control how often a queue reselection can happen even if a flow has a persistent queue of skbs in a Qdisc or NIC queue. A value of zero means the feature is disabled. Default is 1000 (1 second). This sysctl is used in the following patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251013152234.842065-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-15 09:04:21 -07:00
Eric Dumazet	6ddb811a57	net: add SK_WMEM_ALLOC_BIAS constant sk->sk_wmem_alloc is initialized to 1, and sk_wmem_alloc_get() takes care of this initial value. Add SK_WMEM_ALLOC_BIAS define to not spread this magic value. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251013152234.842065-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-15 09:04:21 -07:00
Eric Dumazet	1c51450f1a	tcp: better handle TCP_TX_DELAY on established flows Some applications uses TCP_TX_DELAY socket option after TCP flow is established. Some metrics need to be updated, otherwise TCP might take time to adapt to the new (emulated) RTT. This patch adjusts tp->srtt_us, tp->rtt_min, icsk_rto and sk->sk_pacing_rate. This is best effort, and for instance icsk_rto is reset without taking backoff into account. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251013145926.833198-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-15 08:56:30 -07:00
Byungchul Park	53615ad26e	netmem: replace __netmem_clear_lsb() with netmem_to_nmdesc() Now that we have struct netmem_desc, it'd better access the pp fields via struct netmem_desc rather than struct net_iov. Introduce netmem_to_nmdesc() for safely converting netmem_ref to netmem_desc regardless of the type underneath e.i. netmem_desc, net_iov. While at it, remove __netmem_clear_lsb() and make netmem_to_nmdesc() used instead. Suggested-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Byungchul Park <byungchul@sk.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20251013044133.69472-1-byungchul@sk.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-10-14 13:37:26 +02:00

1 2 3 4 5 ...

18836 Commits