linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-20 07:31:07 -04:00

Author	SHA1	Message	Date
Ido Schimmel	4a8c416602	ipv4: icmp: Fix source IP derivation in presence of VRFs When the "icmp_errors_use_inbound_ifaddr" sysctl is enabled, the source IP of ICMP error messages should be the "primary address of the interface that received the packet that caused the icmp error". The IPv4 ICMP code determines this interface using inet_iif() which in the input path translates to skb->skb_iif. If the interface that received the packet is a VRF port, skb->skb_iif will contain the ifindex of the VRF device and not that of the receiving interface. This is because in the input path the VRF driver overrides skb->skb_iif with the ifindex of the VRF device itself (see vrf_ip_rcv()). As such, the source IP that will be chosen for the ICMP error message is either an address assigned to the VRF device itself (if present) or an address assigned to some VRF port, not necessarily the input or output interface. This behavior is especially problematic when the error messages are "Time Exceeded" messages as it means that utilities like traceroute will show an incorrect packet path. Solve this by determining the input interface based on the iif field in the control block, if present. This field is set in the input path to skb->skb_iif and is not later overridden by the VRF driver, unlike skb->skb_iif. This behavior is consistent with the IPv6 counterpart that already uses the iif from the control block. Reported-by: Andy Roulin <aroulin@nvidia.com> Reported-by: Rajkumar Srinivasan <rajsrinivasa@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250908073238.119240-4-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-11 12:22:38 +02:00
Ido Schimmel	0d3c4a4416	ipv4: icmp: Pass IPv4 control block structure as an argument to __icmp_send() __icmp_send() is used to generate ICMP error messages in response to various situations such as MTU errors (i.e., "Fragmentation Required") and too many hops (i.e., "Time Exceeded"). The skb that generated the error does not necessarily come from the IPv4 layer and does not always have a valid IPv4 control block in skb->cb. Therefore, commit `9ef6b42ad6` ("net: Add __icmp_send helper.") changed the function to take the IP options structure as argument instead of deriving it from the skb's control block. Some callers of this function such as icmp_send() pass the IP options structure from the skb's control block as in these call paths the control block is known to be valid, but other callers simply pass a zeroed structure. A subsequent patch will need __icmp_send() to access more information from the IPv4 control block (specifically, the ifindex of the input interface). As a preparation for this change, change the function to take the IPv4 control block structure as an argument instead of the IP options structure. This makes the function similar to its IPv6 counterpart that already takes the IPv6 control block structure as an argument. No functional changes intended. Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250908073238.119240-3-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-11 12:22:38 +02:00
Ido Schimmel	cda276bcb9	ipv4: cipso: Simplify IP options handling in cipso_v4_error() When __ip_options_compile() is called with an skb, the IP options are parsed from the skb data into the provided IP option argument. This is in contrast to the case where the skb argument is NULL and the options are parsed from opt->__data. Given that cipso_v4_error() always passes an skb to __ip_options_compile(), there is no need to allocate an extra 40 bytes (maximum IP options size). Therefore, simplify the function by removing these extra bytes and make the function similar to ipv4_send_dest_unreach() which also calls both __ip_options_compile() and __icmp_send(). This is a preparation for changing the arguments being passed to __icmp_send(). No functional changes intended. Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250908073238.119240-2-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-11 12:22:38 +02:00
Alok Tiwari	d436b5abba	ipv4: udp: fix typos in comments Correct typos in ipv4/udp.c comments for clarity: "Encapulation" -> "Encapsulation" "measureable" -> "measurable" "tacking care" -> "taking care" No functional changes. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250907192535.3610686-1-alok.a.tiwari@oracle.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-09 16:29:05 -07:00
Eric Dumazet	b7b74953f8	ipv4: snmp: do not use SNMP_MIB_SENTINEL anymore Use ARRAY_SIZE(), so that we know the limit at compile time. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250905165813.1470708-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-08 18:06:20 -07:00
Eric Dumazet	b13592d20b	tcp: use tcp_eat_recv_skb in __tcp_close() Small change to use tcp_eat_recv_skb() instead of __kfree_skb(). This can help if an application under attack has to close many sockets with unread data. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20250903084720.1168904-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-04 19:13:41 -07:00
Eric Dumazet	5f92385309	tcp: fix __tcp_close() to only send RST when required If the receive queue contains payload that was already received, __tcp_close() can send an unexpected RST. Refine the code to take tp->copied_seq into account, as we already do in tcp recvmsg(). Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20250903084720.1168904-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-04 19:13:41 -07:00
Jakub Kicinski	5ef04a7b06	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.17-rc5). No conflicts. Adjacent changes: include/net/sock.h `c51613fa27` ("net: add sk->sk_drop_counters") `5d6b58c932` ("net: lockless sock_i_ino()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-04 13:33:00 -07:00
Dan Carpenter	a51160f8da	ipv4: Fix NULL vs error pointer check in inet_blackhole_dev_init() The inetdev_init() function never returns NULL. Check for error pointers instead. Fixes: `22600596b6` ("ipv4: give an IPv4 dev to blackhole_netdev") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/aLaQWL9NguWmeM1i@stanley.mountain Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-03 16:58:44 -07:00
Jakub Kicinski	24ee9feeb3	Merge tag 'nf-next-25-09-02' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== netfilter: updates for net-next 1) prefer vmalloc_array in ebtables, from Qianfeng Rong. 2) Use csum_replace4 instead of open-coding it, from Christophe Leroy. 3+4) Get rid of GFP_ATOMIC in transaction object allocations, those cause silly failures with large sets under memory pressure, from myself. 5) Remove test for AVX cpu feature in nftables pipapo set type, testing for AVX2 feature is sufficient. 6) Unexport a few function in nf_reject infra: no external callers. 7) Extend payload offset to u16, this was restricted to values <=255 so far, from Fernando Fernandez Mancera. * tag 'nf-next-25-09-02' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nft_payload: extend offset to 65535 bytes netfilter: nf_reject: remove unneeded exports netfilter: nft_set_pipapo: remove redundant test for avx feature bit netfilter: nf_tables: all transaction allocations can now sleep netfilter: nf_tables: allow iter callbacks to sleep netfilter: nft_payload: Use csum_replace4() instead of opencoding netfilter: ebtables: Use vmalloc_array() to improve code ==================== Link: https://patch.msgid.link/20250902133549.15945-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-03 16:06:45 -07:00
Asbjørn Sloth Tønnesen	9f9581ba74	netlink: specs: fou: change local-v6/peer-v6 check While updating the binary min-len implementation, I noticed that the only user, should AFAICT be using exact-len instead. In net/ipv4/fou_core.c FOU_ATTR_LOCAL_V6 and FOU_ATTR_PEER_V6 are only used for singular IPv6 addresses, and there are AFAICT no known implementations trying to send more, it therefore appears safe to change it to an exact-len policy. This patch therefore changes the local-v6/peer-v6 attributes to use an exact-len check, instead of a min-len check. Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250902154640.759815-2-ast@fiberby.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-03 15:16:49 -07:00
Florian Westphal	f4f9e05904	netfilter: nf_reject: remove unneeded exports These functions have no external callers and can be static. Signed-off-by: Florian Westphal <fw@strlen.de>	2025-09-02 15:28:17 +02:00
Eric Dumazet	51ba2d26bc	inet: ping: use EXPORT_IPV6_MOD[_GPL]() There is no neeed to export ping symbols when CONFIG_IPV6=y Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250829153054.474201-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-01 13:15:14 -07:00
Eric Dumazet	689adb36bd	inet: ping: make ping_port_rover per netns Provide isolation between netns for ping idents. Randomize initial ping_port_rover value at netns creation. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250829153054.474201-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-01 13:15:14 -07:00
Eric Dumazet	10343e7e6c	inet: ping: remove ping_hash() There is no point in keeping ping_hash(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Yue Haibing <yuehaibing@huawei.com> Link: https://patch.msgid.link/20250829153054.474201-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-01 13:15:14 -07:00
Eric Dumazet	59f26d86b2	inet: ping: check sock_net() in ping_get_port() and ping_lookup() We need to check socket netns before considering them in ping_get_port(). Otherwise, one malicious netns could 'consume' all ports. Add corresponding check in ping_lookup(). Fixes: `c319b4d76b` ("net: ipv4: add IPPROTO_ICMP socket kind") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Yue Haibing <yuehaibing@huawei.com> Link: https://patch.msgid.link/20250829153054.474201-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-01 13:15:14 -07:00
Fabian Bläse	c6dd1aa2cb	icmp: fix icmp_ndo_send address translation for reply direction The icmp_ndo_send function was originally introduced to ensure proper rate limiting when icmp_send is called by a network device driver, where the packet's source address may have already been transformed by SNAT. However, the original implementation only considers the IP_CT_DIR_ORIGINAL direction for SNAT and always replaced the packet's source address with that of the original-direction tuple. This causes two problems: 1. For SNAT: Reply-direction packets were incorrectly translated using the source address of the CT original direction, even though no translation is required. 2. For DNAT: Reply-direction packets were not handled at all. In DNAT, the original direction's destination is translated. Therefore, in the reply direction the source address must be set to the reply-direction source, so rate limiting works as intended. Fix this by using the connection direction to select the correct tuple for source address translation, and adjust the pre-checks to handle reply-direction packets in case of DNAT. Additionally, wrap the `ct->status` access in READ_ONCE(). This avoids possible KCSAN reports about concurrent updates to `ct->status`. Fixes: `0b41713b60` ("icmp: introduce helper for nat'd source address in network device context") Signed-off-by: Fabian Bläse <fabian@blaese.de> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-01 12:54:41 -07:00
Kuniyuki Iwashima	7051b54fb5	tcp: Remove sk->sk_prot->orphan_count. TCP tracks the number of orphaned (SOCK_DEAD but not yet destructed) sockets in tcp_orphan_count. In some code that was shared with DCCP, tcp_orphan_count is referenced via sk->sk_prot->orphan_count. Let's reference tcp_orphan_count directly. inet_csk_prepare_for_destroy_sock() is moved to inet_connection_sock.c due to header dependency. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250829215641.711664-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-01 12:52:09 -07:00
Eric Dumazet	6ad8de3cef	ipv4: start using dst_dev_rcu() Change icmpv4_xrlim_allow(), ip_defrag() to prevent possible UAF. Change ipmr_prepare_xmit(), ipmr_queue_fwd_xmit(), ip_mr_output(), ipv4_neigh_lookup() to use lockdep enabled dst_dev_rcu(). Fixes: `4a6ce2b6f2` ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250828195823.3958522-9-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-29 19:36:32 -07:00
Eric Dumazet	b62a59c18b	tcp: use dst_dev_rcu() in tcp_fastopen_active_disable_ofo_check() Use RCU to avoid a pair of atomic operations and a potential UAF on dst_dev()->flags. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250828195823.3958522-8-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-29 19:36:32 -07:00
Eric Dumazet	50c127a69c	tcp_metrics: use dst_dev_net_rcu() Replace three dst_dev() with a lockdep enabled helper. Fixes: `4a6ce2b6f2` ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250828195823.3958522-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-29 19:36:32 -07:00
Eric Dumazet	caedcc5b6d	net: dst: introduce dst->dev_rcu Followup of commit `88fe14253e` ("net: dst: add four helpers to annotate data-races around dst->dev"). We want to gradually add explicit RCU protection to dst->dev, including lockdep support. Add an union to alias dst->dev_rcu and dst->dev. Add dst_dev_net_rcu() helper. Fixes: `4a6ce2b6f2` ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250828195823.3958522-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-29 19:36:31 -07:00
Eric Dumazet	95fa78830e	inet_diag: avoid cache line misses in inet_diag_bc_sk() inet_diag_bc_sk() pulls five cache lines per socket, while most filters only need the two first ones. Add three booleans to struct inet_diag_dump_data, that are selectively set if a filter needs specific socket fields. - mark_needed /* INET_DIAG_BC_MARK_COND present. / - cgroup_needed / INET_DIAG_BC_CGROUP_COND present. / - userlocks_needed / INET_DIAG_BC_AUTO present. */ This removes millions of cache lines misses per ss invocation when simple filters are specified on busy servers. offsetof(struct sock, sk_userlocks) = 0xf3 offsetof(struct sock, sk_mark) = 0x20c offsetof(struct sock, sk_cgrp_data) = 0x298 Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250828102738.2065992-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-29 19:29:24 -07:00
Eric Dumazet	9529320ad6	inet_diag: change inet_diag_bc_sk() first argument We want to have access to the inet_diag_dump_data structure in the following patch. This patch removes duplication in callers. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250828102738.2065992-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-29 19:29:24 -07:00
Eric Dumazet	4fd84a0aaf	inet_diag: annotate data-races in inet_diag_bc_sk() inet_diag_bc_sk() runs with an unlocked socket, annotate potential races with READ_ONCE(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250828102738.2065992-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-29 19:29:24 -07:00
Eric Dumazet	8e60447f08	tcp: annotate data-races in tcp_req_diag_fill() req->num_retrans and rsk_timer.expires are read locklessly, and can be changed from tcp_rtx_synack(). Add READ_ONCE()/WRITE_ONCE() annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250828102738.2065992-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-29 19:29:24 -07:00
Eric Dumazet	9a574257b9	inet_diag: annotate data-races in inet_diag_msg_common_fill() inet_diag_msg_common_fill() can run without socket lock. Add READ_ONCE() or data_race() annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250828102738.2065992-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-29 19:29:24 -07:00
Jakub Kicinski	d23ad54de7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.17-rc4). No conflicts. Adjacent changes: drivers/net/ethernet/intel/idpf/idpf_txrx.c `02614eee26` ("idpf: do not linearize big TSO packets") `6c4e684802` ("idpf: remove obsolete stashing code") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-29 11:48:01 -07:00
Eric Dumazet	b81aa23234	inet: raw: add drop_counters to raw sockets When a packet flood hits one or more RAW sockets, many cpus have to update sk->sk_drops. This slows down other cpus, because currently sk_drops is in sock_write_rx group. Add a socket_drop_counters structure to raw sockets. Using dedicated cache lines to hold drop counters makes sure that consumers no longer suffer from false sharing if/when producers only change sk->sk_drops. This adds 128 bytes per RAW socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250826125031.1578842-6-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-28 13:14:50 +02:00
Eric Dumazet	cb4d5a6eb6	net: add sk_drops_skbadd() helper Existing sk_drops_add() helper is renamed to sk_drops_skbadd(). Add sk_drops_add() and convert sk_drops_inc() to use it. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250826125031.1578842-3-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-28 13:14:50 +02:00
Eric Dumazet	f86f42ed2c	net: add sk_drops_read(), sk_drops_inc() and sk_drops_reset() helpers We want to split sk->sk_drops in the future to reduce potential contention on this field. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250826125031.1578842-2-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-28 13:14:50 +02:00
Oscar Maes	5189446ba9	net: ipv4: fix regression in local-broadcast routes Commit `9e30ecf23b` ("net: ipv4: fix incorrect MTU in broadcast routes") introduced a regression where local-broadcast packets would have their gateway set in __mkroute_output, which was caused by fi = NULL being removed. Fix this by resetting the fib_info for local-broadcast packets. This preserves the intended changes for directed-broadcast packets. Cc: stable@vger.kernel.org Fixes: `9e30ecf23b` ("net: ipv4: fix incorrect MTU in broadcast routes") Reported-by: Brett A C Sheffield <bacs@librecast.net> Closes: https://lore.kernel.org/regressions/20250822165231.4353-4-bacs@librecast.net Signed-off-by: Oscar Maes <oscmaes92@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250827062322.4807-1-oscmaes92@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-28 10:52:30 +02:00
Guillaume Nault	1bec9d0c00	ipv4: Convert ->flowi4_tos to dscp_t. Convert the ->flowic_tos field of struct flowi_common from __u8 to dscp_t, rename it ->flowic_dscp and propagate these changes to struct flowi and struct flowi4. We've had several bugs in the past where ECN bits could interfere with IPv4 routing, because these bits were not properly cleared when setting ->flowi4_tos. These bugs should be fixed now and the dscp_t type has been introduced to ensure that variables carrying DSCP values don't accidentally have any ECN bits set. Several variables and structure fields have been converted to dscp_t already, but the main IPv4 routing structure, struct flowi4, is still using a __u8. To avoid any future regression, this patch converts it to dscp_t. There are many users to convert at once. Fortunately, around half of ->flowi4_tos users already have a dscp_t value at hand, which they currently convert to __u8 using inet_dscp_to_dsfield(). For all of these users, we just need to drop that conversion. But, although we try to do the __u8 <-> dscp_t conversions at the boundaries of the network or of user space, some places still store TOS/DSCP variables as __u8 in core networking code. Those can hardly be converted either because the data structure is part of UAPI or because the same variable or field is also used for handling ECN in other parts of the code. In all of these cases where we don't have a dscp_t variable at hand, we need to use inet_dsfield_to_dscp() when interacting with ->flowi4_dscp. Changes since v1: * Fix space alignment in __bpf_redirect_neigh_v4() (Ido). Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/29acecb45e911d17446b9a3dbdb1ab7b821ea371.1756128932.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-26 17:34:31 -07:00
Kuniyuki Iwashima	382a4d9cb6	tcp: Move TCP-specific diag functions to tcp_diag.c. tcp_diag_dump() / tcp_diag_dump_one() is just a wrapper of inet_diag_dump_icsk() / inet_diag_dump_one_icsk(), respectively. Let's inline them in tcp_diag.c and move static callees as well. Note that inet_sk_attr_size() is merged into tcp_diag_get_aux_size(), and we remove inet_diag_handler.idiag_get_aux_size() accordingly. While at it, BUG_ON() is replaced with DEBUG_NET_WARN_ON_ONCE(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250822190803.540788-7-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-25 17:53:36 -07:00
Kuniyuki Iwashima	f1241200cd	tcp: Don't pass hashinfo to inet_diag helpers. These inet_diag functions required struct inet_hashinfo because they are shared by TCP and DCCP: * inet_diag_dump_icsk() * inet_diag_dump_one_icsk() * inet_diag_find_one_icsk() DCCP has gone, and we don't need to pass hashinfo down to them. Let's fetch net->ipv4.tcp_death_row.hashinfo directly in the first 2 functions. Note that inet_diag_find_one_icsk() don't need hashinfo since the previous patch. We will move TCP-specific functions to tcp_diag.c in the next patch. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250822190803.540788-6-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-25 17:53:36 -07:00
Kuniyuki Iwashima	cb16f4b6c7	tcp: Don't pass hashinfo to socket lookup helpers. These socket lookup functions required struct inet_hashinfo because they are shared by TCP and DCCP. * __inet_lookup_established() * __inet_lookup_listener() * __inet6_lookup_established() * inet6_lookup_listener() DCCP has gone, and we don't need to pass hashinfo down to them. Let's fetch net->ipv4.tcp_death_row.hashinfo directly in the above 4 functions. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250822190803.540788-5-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-25 17:53:35 -07:00
Kuniyuki Iwashima	8150f3a44b	tcp: Remove hashinfo test for inet6?_lookup_run_sk_lookup(). Commit `6c886db2e7` ("net: remove duplicate sk_lookup helpers") started to check if hashinfo == net->ipv4.tcp_death_row.hashinfo in __inet_lookup_listener() and inet6_lookup_listener() and stopped invoking BPF sk_lookup prog for DCCP. DCCP has gone and the condition is always true. Let's remove the hashinfo test. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250822190803.540788-4-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-25 17:53:35 -07:00
Kuniyuki Iwashima	2d842b6c67	tcp: Remove timewait_sock_ops.twsk_destructor(). Since DCCP has been removed, sk->sk_prot->twsk_prot->twsk_destructor is always tcp_twsk_destructor(). Let's call tcp_twsk_destructor() directly in inet_twsk_free() and remove ->twsk_destructor(). While at it, tcp_twsk_destructor() is un-exported. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250822190803.540788-3-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-25 17:53:35 -07:00
Kuniyuki Iwashima	9db0163e3c	tcp: Remove sk_protocol test for tcp_twsk_unique(). Commit `383eed2de5` ("tcp: get rid of twsk_unique()") added sk->sk_protocol test in __inet_check_established() and __inet6_check_established() to remove twsk_unique() and call tcp_twsk_unique() directly. DCCP has gone, and the condition is always true. Let's remove the sk_protocol test. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250822190803.540788-2-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-25 17:53:35 -07:00
Eric Dumazet	9bd999eb35	tcp: annotate data-races around icsk->icsk_probes_out icsk->icsk_probes_out is read locklessly from inet_sk_diag_fill(), get_tcp4_sock() and get_tcp6_sock(). Add corresponding READ_ONCE()/WRITE_ONCE() annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20250822091727.835869-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-25 16:20:59 -07:00
Eric Dumazet	e6f178be3c	tcp: annotate data-races around icsk->icsk_retransmits icsk->icsk_retransmits is read locklessly from inet_sk_diag_fill(), tcp_get_timestamping_opt_stats, get_tcp4_sock() and get_tcp6_sock(). Add corresponding READ_ONCE()/WRITE_ONCE() annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20250822091727.835869-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-25 16:20:59 -07:00
Oscar Maes	1b8c5fa0cb	net: ipv4: allow directed broadcast routes to use dst hint Currently, ip_extract_route_hint uses RTN_BROADCAST to decide whether to use the route dst hint mechanism. This check is too strict, as it prevents directed broadcast routes from using the hint, resulting in poor performance during bursts of directed broadcast traffic. Fix this in ip_extract_route_hint and modify ip_route_use_hint to preserve the intended behaviour. Signed-off-by: Oscar Maes <oscmaes92@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250819174642.5148-2-oscmaes92@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-25 16:07:16 -07:00
Eric Dumazet	9217146fee	tcp: lockless TCP_MAXSEG option setsockopt(TCP_MAXSEG) writes over a field that does not need socket lock protection anymore. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20250821141901.18839-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-22 15:58:59 -07:00
Eric Dumazet	d5ffba0f25	tcp: annotate data-races around tp->rx_opt.user_mss This field is already read locklessly for listeners, next patch will make setsockopt(TCP_MAXSEG) lockless. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20250821141901.18839-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-22 15:58:58 -07:00
Jakub Kicinski	a9af709fda	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.17-rc3). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-21 11:33:15 -07:00
Florian Westphal	91a79b7922	netfilter: nf_reject: don't leak dst refcount for loopback packets recent patches to add a WARN() when replacing skb dst entry found an old bug: WARNING: include/linux/skbuff.h:1165 skb_dst_check_unset include/linux/skbuff.h:1164 [inline] WARNING: include/linux/skbuff.h:1165 skb_dst_set include/linux/skbuff.h:1210 [inline] WARNING: include/linux/skbuff.h:1165 nf_reject_fill_skb_dst+0x2a4/0x330 net/ipv4/netfilter/nf_reject_ipv4.c:234 [..] Call Trace: nf_send_unreach+0x17b/0x6e0 net/ipv4/netfilter/nf_reject_ipv4.c:325 nft_reject_inet_eval+0x4bc/0x690 net/netfilter/nft_reject_inet.c:27 expr_call_ops_eval net/netfilter/nf_tables_core.c:237 [inline] .. This is because blamed commit forgot about loopback packets. Such packets already have a dst_entry attached, even at PRE_ROUTING stage. Instead of checking hook just check if the skb already has a route attached to it. Fixes: `f53b9b0bdc` ("netfilter: introduce support for reject at prerouting stage") Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20250820123707.10671-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-21 10:02:00 -07:00
Eric Dumazet	a6d4f25888	net: set net.core.rmem_max and net.core.wmem_max to 4 MB SO_RCVBUF and SO_SNDBUF have limited range today, unless distros or system admins change rmem_max and wmem_max. Even iproute2 uses 1 MB SO_RCVBUF which is capped by the kernel. Decouple [rw]mem_max and [rw]mem_default and increase [rw]mem_max to 4 MB. Before: $ sysctl net.core.rmem_default net.core.rmem_max net.core.wmem_default net.core.wmem_max net.core.rmem_default = 212992 net.core.rmem_max = 212992 net.core.wmem_default = 212992 net.core.wmem_max = 212992 After: $ sysctl net.core.rmem_default net.core.rmem_max net.core.wmem_default net.core.wmem_max net.core.rmem_default = 212992 net.core.rmem_max = 4194304 net.core.wmem_default = 212992 net.core.wmem_max = 4194304 Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20250819174030.1986278-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-20 19:35:00 -07:00
Kuniyuki Iwashima	bb178c6bc0	net-memcg: Pass struct sock to mem_cgroup_sk_(un)?charge(). We will store a flag in the lowest bit of sk->sk_memcg. Then, we cannot pass the raw pointer to mem_cgroup_charge_skmem() and mem_cgroup_uncharge_skmem(). Let's pass struct sock to the functions. While at it, they are renamed to match other functions starting with mem_cgroup_sk_. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Link: https://patch.msgid.link/20250815201712.1745332-9-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 19:20:59 -07:00
Kuniyuki Iwashima	43049b0db0	net-memcg: Introduce mem_cgroup_sk_enabled(). The socket memcg feature is enabled by a static key and only works for non-root cgroup. We check both conditions in many places. Let's factorise it as a helper function. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Link: https://patch.msgid.link/20250815201712.1745332-8-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 19:20:59 -07:00
Kuniyuki Iwashima	f7161b234f	net-memcg: Introduce mem_cgroup_from_sk(). We will store a flag in the lowest bit of sk->sk_memcg. Then, directly dereferencing sk->sk_memcg will be illegal, and we do not want to allow touching the raw sk->sk_memcg in many places. Let's introduce mem_cgroup_from_sk(). Other places accessing the raw sk->sk_memcg will be converted later. Note that we cannot define the helper as an inline function in memcontrol.h as we cannot access any fields of struct sock there due to circular dependency, so it is placed in sock.h. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Link: https://patch.msgid.link/20250815201712.1745332-7-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 19:20:59 -07:00

1 2 3 4 5 ...

13170 Commits