From 4510d140524ca7d6e772db962e013f26f09a63b1 Mon Sep 17 00:00:00 2001 From: Dudu Lu Date: Mon, 13 Apr 2026 16:49:27 +0800 Subject: [PATCH 001/148] net/sched: act_mirred: fix wrong device for mac_header_xmit check in tcf_blockcast_redir MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In tcf_blockcast_redir(), when iterating block ports to redirect packets to multiple devices, the mac_header_xmit flag is queried from the wrong device. The loop sends to dev_prev but queries dev_is_mac_header_xmit(dev) — which is the NEXT device in the iteration, not the one being sent to. This causes tcf_mirred_to_dev() to make incorrect decisions about whether to push or pull the MAC header. When the block contains mixed device types (e.g., an ethernet veth and a tunnel device), intermediate devices get the wrong mac_header_xmit flag, leading to skb header corruption. In the worst case, skb_push_rcsum with an incorrect mac_len can exhaust headroom and panic. The last device in the loop is handled correctly (line 365-366 uses dev_is_mac_header_xmit(dev_prev)), confirming this is a copy-paste oversight for the intermediate devices. Fix by using dev_prev instead of dev for the mac_header_xmit query, consistent with the device actually being sent to. Fixes: 42f39036cda8 ("net/sched: act_mirred: Allow mirred to block") Signed-off-by: Dudu Lu Reviewed-by: Simon Horman Link: https://patch.msgid.link/20260413084927.71353-1-phx0fer@gmail.com Signed-off-by: Paolo Abeni --- net/sched/act_mirred.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c index 05e0b14b5773..2c5a7a321a94 100644 --- a/net/sched/act_mirred.c +++ b/net/sched/act_mirred.c @@ -354,7 +354,7 @@ static int tcf_blockcast_redir(struct sk_buff *skb, struct tcf_mirred *m, goto assign_prev; tcf_mirred_to_dev(skb, m, dev_prev, - dev_is_mac_header_xmit(dev), + dev_is_mac_header_xmit(dev_prev), mirred_eaction, retval); assign_prev: dev_prev = dev; From fa92a77b0ed4d5f11a71665a232ac5a54a4b055d Mon Sep 17 00:00:00 2001 From: Dudu Lu Date: Mon, 13 Apr 2026 16:53:49 +0800 Subject: [PATCH 002/148] macvlan: fix macvlan_get_size() not reserving space for IFLA_MACVLAN_BC_CUTOFF macvlan_get_size() does not account for IFLA_MACVLAN_BC_CUTOFF, but macvlan_fill_info() conditionally includes it when port->bc_cutoff != 1. This causes nla_put_s32() to fail with -EMSGSIZE when the netlink skb runs out of space, triggering a WARN_ON in rtnetlink and preventing the interface from being dumped. The bug can be reproduced with: ip link add macvlan0 link eth0 type macvlan mode bridge ip link set macvlan0 type macvlan bc_cutoff 0 ip -d link show macvlan0 # fails with -EMSGSIZE The bc_cutoff feature was added in commit 954d1fa1ac93 ("macvlan: Add netlink attribute for broadcast cutoff"), which added the nla_put_s32() call in macvlan_fill_info() but missed adding the corresponding nla_total_size(4) in macvlan_get_size(). A follow-up commit 55cef78c244d ("macvlan: add forgotten nla_policy for IFLA_MACVLAN_BC_CUTOFF") fixed the missing nla_policy entry but still did not fix the size calculation. Fixes: 954d1fa1ac93 ("macvlan: Add netlink attribute for broadcast cutoff") Signed-off-by: Dudu Lu Reviewed-by: Vadim Fedorenko Reviewed-by: Eric Dumazet Link: https://patch.msgid.link/20260413085349.73977-1-phx0fer@gmail.com Signed-off-by: Paolo Abeni --- drivers/net/macvlan.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index 9f90c598649d..c40fa331836b 100644 --- a/drivers/net/macvlan.c +++ b/drivers/net/macvlan.c @@ -1689,6 +1689,7 @@ static size_t macvlan_get_size(const struct net_device *dev) + macvlan_get_size_mac(vlan) /* IFLA_MACVLAN_MACADDR */ + nla_total_size(4) /* IFLA_MACVLAN_BC_QUEUE_LEN */ + nla_total_size(4) /* IFLA_MACVLAN_BC_QUEUE_LEN_USED */ + + nla_total_size(4) /* IFLA_MACVLAN_BC_CUTOFF */ ); } From df4601653201de21b487c3e7fffd464790cab808 Mon Sep 17 00:00:00 2001 From: Zhengchuan Liang Date: Mon, 13 Apr 2026 17:08:46 +0800 Subject: [PATCH 003/148] net: bridge: use a stable FDB dst snapshot in RCU readers Local FDB entries can be rewritten in place by `fdb_delete_local()`, which updates `f->dst` to another port or to `NULL` while keeping the entry alive. Several bridge RCU readers inspect `f->dst`, including `br_fdb_fillbuf()` through the `brforward_read()` sysfs path. These readers currently load `f->dst` multiple times and can therefore observe inconsistent values across the check and later dereference. In `br_fdb_fillbuf()`, this means a concurrent local-FDB update can change `f->dst` after the NULL check and before the `port_no` dereference, leading to a NULL-ptr-deref. Fix this by taking a single `READ_ONCE()` snapshot of `f->dst` in each affected RCU reader and using that snapshot for the rest of the access sequence. Also publish the in-place `f->dst` updates in `fdb_delete_local()` with `WRITE_ONCE()` so the readers and writer use matching access patterns. Fixes: 960b589f86c7 ("bridge: Properly check if local fdb entry can be deleted in br_fdb_change_mac_address") Cc: stable@kernel.org Reported-by: Yifan Wu Reported-by: Juefei Pu Co-developed-by: Yuan Tan Signed-off-by: Yuan Tan Suggested-by: Xin Liu Tested-by: Ren Wei Signed-off-by: Zhengchuan Liang Signed-off-by: Ren Wei Reviewed-by: Ido Schimmel Acked-by: Nikolay Aleksandrov Link: https://patch.msgid.link/6570fabb85ecadb8baaf019efe856f407711c7b9.1776043229.git.zcliangcn@gmail.com Signed-off-by: Paolo Abeni --- net/bridge/br_arp_nd_proxy.c | 8 +++++--- net/bridge/br_fdb.c | 28 ++++++++++++++++++---------- 2 files changed, 23 insertions(+), 13 deletions(-) diff --git a/net/bridge/br_arp_nd_proxy.c b/net/bridge/br_arp_nd_proxy.c index 0c8a06cdd46f..deb1ab1f24b0 100644 --- a/net/bridge/br_arp_nd_proxy.c +++ b/net/bridge/br_arp_nd_proxy.c @@ -201,11 +201,12 @@ void br_do_proxy_suppress_arp(struct sk_buff *skb, struct net_bridge *br, f = br_fdb_find_rcu(br, n->ha, vid); if (f) { + const struct net_bridge_port *dst = READ_ONCE(f->dst); bool replied = false; if ((p && (p->flags & BR_PROXYARP)) || - (f->dst && (f->dst->flags & BR_PROXYARP_WIFI)) || - br_is_neigh_suppress_enabled(f->dst, vid)) { + (dst && (dst->flags & BR_PROXYARP_WIFI)) || + br_is_neigh_suppress_enabled(dst, vid)) { if (!vid) br_arp_send(br, p, skb->dev, sip, tip, sha, n->ha, sha, 0, 0); @@ -469,9 +470,10 @@ void br_do_suppress_nd(struct sk_buff *skb, struct net_bridge *br, f = br_fdb_find_rcu(br, n->ha, vid); if (f) { + const struct net_bridge_port *dst = READ_ONCE(f->dst); bool replied = false; - if (br_is_neigh_suppress_enabled(f->dst, vid)) { + if (br_is_neigh_suppress_enabled(dst, vid)) { if (vid != 0) br_nd_send(br, p, skb, n, skb->vlan_proto, diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c index e2c17f620f00..6eb3ab69a514 100644 --- a/net/bridge/br_fdb.c +++ b/net/bridge/br_fdb.c @@ -236,6 +236,7 @@ struct net_device *br_fdb_find_port(const struct net_device *br_dev, const unsigned char *addr, __u16 vid) { + const struct net_bridge_port *dst; struct net_bridge_fdb_entry *f; struct net_device *dev = NULL; struct net_bridge *br; @@ -248,8 +249,11 @@ struct net_device *br_fdb_find_port(const struct net_device *br_dev, br = netdev_priv(br_dev); rcu_read_lock(); f = br_fdb_find_rcu(br, addr, vid); - if (f && f->dst) - dev = f->dst->dev; + if (f) { + dst = READ_ONCE(f->dst); + if (dst) + dev = dst->dev; + } rcu_read_unlock(); return dev; @@ -346,7 +350,7 @@ static void fdb_delete_local(struct net_bridge *br, vg = nbp_vlan_group(op); if (op != p && ether_addr_equal(op->dev->dev_addr, addr) && (!vid || br_vlan_find(vg, vid))) { - f->dst = op; + WRITE_ONCE(f->dst, op); clear_bit(BR_FDB_ADDED_BY_USER, &f->flags); return; } @@ -357,7 +361,7 @@ static void fdb_delete_local(struct net_bridge *br, /* Maybe bridge device has same hw addr? */ if (p && ether_addr_equal(br->dev->dev_addr, addr) && (!vid || (v && br_vlan_should_use(v)))) { - f->dst = NULL; + WRITE_ONCE(f->dst, NULL); clear_bit(BR_FDB_ADDED_BY_USER, &f->flags); return; } @@ -928,6 +932,7 @@ int br_fdb_test_addr(struct net_device *dev, unsigned char *addr) int br_fdb_fillbuf(struct net_bridge *br, void *buf, unsigned long maxnum, unsigned long skip) { + const struct net_bridge_port *dst; struct net_bridge_fdb_entry *f; struct __fdb_entry *fe = buf; unsigned long delta; @@ -944,7 +949,8 @@ int br_fdb_fillbuf(struct net_bridge *br, void *buf, continue; /* ignore pseudo entry for local MAC address */ - if (!f->dst) + dst = READ_ONCE(f->dst); + if (!dst) continue; if (skip) { @@ -956,8 +962,8 @@ int br_fdb_fillbuf(struct net_bridge *br, void *buf, memcpy(fe->mac_addr, f->key.addr.addr, ETH_ALEN); /* due to ABI compat need to split into hi/lo */ - fe->port_no = f->dst->port_no; - fe->port_hi = f->dst->port_no >> 8; + fe->port_no = dst->port_no; + fe->port_hi = dst->port_no >> 8; fe->is_local = test_bit(BR_FDB_LOCAL, &f->flags); if (!test_bit(BR_FDB_STATIC, &f->flags)) { @@ -1083,9 +1089,11 @@ int br_fdb_dump(struct sk_buff *skb, rcu_read_lock(); hlist_for_each_entry_rcu(f, &br->fdb_list, fdb_node) { + const struct net_bridge_port *dst = READ_ONCE(f->dst); + if (*idx < ctx->fdb_idx) goto skip; - if (filter_dev && (!f->dst || f->dst->dev != filter_dev)) { + if (filter_dev && (!dst || dst->dev != filter_dev)) { if (filter_dev != dev) goto skip; /* !f->dst is a special case for bridge @@ -1093,10 +1101,10 @@ int br_fdb_dump(struct sk_buff *skb, * Therefore need a little more filtering * we only want to dump the !f->dst case */ - if (f->dst) + if (dst) goto skip; } - if (!filter_dev && f->dst) + if (!filter_dev && dst) goto skip; err = fdb_fill_info(skb, br, f, From f9e40664706927d7ae22a448a3383e23c38a4c0b Mon Sep 17 00:00:00 2001 From: Dudu Lu Date: Mon, 13 Apr 2026 19:00:41 +0800 Subject: [PATCH 004/148] net/sched: sch_cake: fix NAT destination port not being updated in cake_update_flowkeys MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit cake_update_flowkeys() is supposed to update the flow dissector keys with the NAT-translated addresses and ports from conntrack, so that CAKE's per-flow fairness correctly identifies post-NAT flows as belonging to the same connection. For the source port, this works correctly: keys->ports.src = port; But for the destination port, the assignment is reversed: port = keys->ports.dst; This means the NAT destination port is never updated in the flow keys. As a result, when multiple connections are NATed to the same destination, CAKE treats them as separate flows because the original (pre-NAT) destination ports differ. This breaks CAKE's NAT-aware flow isolation when using the "nat" mode. The bug was introduced in commit b0c19ed6088a ("sch_cake: Take advantage of skb->hash where appropriate") which refactored the original direct assignment into a compare-and-conditionally-update pattern, but wrote the destination port update backwards. Fix by reversing the assignment direction to match the source port pattern. Fixes: b0c19ed6088a ("sch_cake: Take advantage of skb->hash where appropriate") Signed-off-by: Dudu Lu Acked-by: Toke Høiland-Jørgensen Link: https://patch.msgid.link/20260413110041.44704-1-phx0fer@gmail.com Signed-off-by: Paolo Abeni --- net/sched/sch_cake.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c index ffea9fbd522d..02e1fa4577ae 100644 --- a/net/sched/sch_cake.c +++ b/net/sched/sch_cake.c @@ -619,7 +619,7 @@ static bool cake_update_flowkeys(struct flow_keys *keys, } port = rev ? tuple.src.u.all : tuple.dst.u.all; if (port != keys->ports.dst) { - port = keys->ports.dst; + keys->ports.dst = port; upd = true; } } From 29c95185ba32b621fbc3800fb86e7dc3edf5c2be Mon Sep 17 00:00:00 2001 From: Jiayuan Chen Date: Mon, 13 Apr 2026 19:45:19 +0800 Subject: [PATCH 005/148] nexthop: fix IPv6 route referencing IPv4 nexthop syzbot reported a panic [1] [2]. When an IPv6 nexthop is replaced with an IPv4 nexthop, the has_v4 flag of all groups containing this nexthop is not updated. This is because nh_group_v4_update is only called when replacing AF_INET to AF_INET6, but the reverse direction (AF_INET6 to AF_INET) is missed. This allows a stale has_v4=false to bypass fib6_check_nexthop, causing IPv6 routes to be attached to groups that effectively contain only AF_INET members. Subsequent route lookups then call nexthop_fib6_nh() which returns NULL for the AF_INET member, leading to a NULL pointer dereference. Fix by calling nh_group_v4_update whenever the family changes, not just AF_INET to AF_INET6. Reproducer: # AF_INET6 blackhole ip -6 nexthop add id 1 blackhole # group with has_v4=false ip nexthop add id 100 group 1 # replace with AF_INET (no -6), has_v4 stays false ip nexthop replace id 1 blackhole # pass stale has_v4 check ip -6 route add 2001:db8::/64 nhid 100 # panic ping -6 2001:db8::1 [1] https://syzkaller.appspot.com/bug?id=e17283eb2f8dcf3dd9b47fe6f67a95f71faadad0 [2] https://syzkaller.appspot.com/bug?id=8699b6ae54c9f35837d925686208402949e12ef3 Fixes: 7bf4796dd099 ("nexthops: add support for replace") Signed-off-by: Jiayuan Chen Reviewed-by: David Ahern Link: https://patch.msgid.link/20260413114522.147784-1-jiayuan.chen@linux.dev Signed-off-by: Paolo Abeni --- net/ipv4/nexthop.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index 904a060a7330..f92fcc39fc4c 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -2469,10 +2469,10 @@ static int replace_nexthop_single(struct net *net, struct nexthop *old, goto err_notify; } - /* When replacing an IPv4 nexthop with an IPv6 nexthop, potentially + /* When replacing a nexthop with one of a different family, potentially * update IPv4 indication in all the groups using the nexthop. */ - if (oldi->family == AF_INET && newi->family == AF_INET6) { + if (oldi->family != newi->family) { list_for_each_entry(nhge, &old->grp_list, nh_list) { struct nexthop *nhp = nhge->nh_parent; struct nh_group *nhg; From 104f082f5ed6d19c5d85ca905ccd4e4d01aef66e Mon Sep 17 00:00:00 2001 From: Jiayuan Chen Date: Mon, 13 Apr 2026 19:45:20 +0800 Subject: [PATCH 006/148] selftests: fib_nexthops: test stale has_v4 on nexthop replace MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add test cases that exercise the scenario where an IPv6 nexthop is replaced with an IPv4 nexthop while being part of a group. The group's has_v4 flag must be updated so that subsequent IPv6 route additions are properly rejected. Two cases are covered: 1. Gateway nexthop replaced across families with an existing IPv6 route on the group (rejected by fib6_check_nh_list). 2. Blackhole nexthop replaced across families with no existing IPv6 route on the group (fib6_check_nh_list returns early) — this is the path that triggers a NULL ptr deref without the kernel fix. Signed-off-by: Jiayuan Chen Reviewed-by: David Ahern Link: https://patch.msgid.link/20260413114522.147784-2-jiayuan.chen@linux.dev Signed-off-by: Paolo Abeni --- tools/testing/selftests/net/fib_nexthops.sh | 22 +++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/tools/testing/selftests/net/fib_nexthops.sh b/tools/testing/selftests/net/fib_nexthops.sh index 6eb7f95e70e1..ac868a731694 100755 --- a/tools/testing/selftests/net/fib_nexthops.sh +++ b/tools/testing/selftests/net/fib_nexthops.sh @@ -1209,6 +1209,28 @@ ipv6_fcnal_runtime() run_cmd "$IP ro replace 2001:db8:101::1/128 nhid 124" log_test $? 0 "IPv6 route using a group after replacing v4 gateways" + # Replacing an IPv6 nexthop with an IPv4 nexthop should update has_v4 + # for all groups using it, preventing IPv6 routes from referencing the + # group after the replace. + run_cmd "$IP nexthop add id 89 via 2001:db8:91::2 dev veth1" + run_cmd "$IP nexthop add id 125 group 89" + run_cmd "$IP nexthop replace id 89 via 172.16.1.1 dev veth1" + run_cmd "$IP ro replace 2001:db8:101::1/128 nhid 125" + log_test $? 2 "IPv6 route can not use group after v6 nexthop replaced by v4" + + # Same scenario but with a blackhole nexthop: the group has no IPv6 + # routes yet when the replace happens, so fib6_check_nh_list returns + # early without checking. has_v4 must still be updated to block + # subsequent IPv6 route additions. + run_cmd "$IP nexthop flush >/dev/null 2>&1" + run_cmd "$IP -6 nexthop add id 90 blackhole" + run_cmd "$IP nexthop add id 125 group 90" + run_cmd "$IP nexthop replace id 90 blackhole" + run_cmd "$IP -6 ro add 2001:db8:101::1/128 nhid 125" + log_test $? 2 "IPv6 route reject v6 blackhole replaced by v4 blackhole" + run_cmd "ip netns exec $me ping -6 2001:db8:101::1 -c1 -w$PING_TIMEOUT" + log_test $? 2 "Ping unreachable after rejected route" + $IP nexthop flush >/dev/null 2>&1 # From 52bcb57a4e8a0865a76c587c2451906342ae1b2d Mon Sep 17 00:00:00 2001 From: Dudu Lu Date: Mon, 13 Apr 2026 21:14:09 +0800 Subject: [PATCH 007/148] vsock/virtio: fix accept queue count leak on transport mismatch virtio_transport_recv_listen() calls sk_acceptq_added() before vsock_assign_transport(). If vsock_assign_transport() fails or selects a different transport, the error path returns without calling sk_acceptq_removed(), permanently incrementing sk_ack_backlog. After approximately backlog+1 such failures, sk_acceptq_is_full() returns true, causing the listener to reject all new connections. Fix by moving sk_acceptq_added() to after the transport validation, matching the pattern used by vmci_transport and hyperv_transport. Fixes: c0cfa2d8a788 ("vsock: add multi-transports support") Signed-off-by: Dudu Lu Reviewed-by: Bobby Eshleman Reviewed-by: Luigi Leonardi Reviewed-by: Stefano Garzarella Acked-by: Michael S. Tsirkin Link: https://patch.msgid.link/20260413131409.19022-1-phx0fer@gmail.com Signed-off-by: Paolo Abeni --- net/vmw_vsock/virtio_transport_common.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c index a152a9e208d0..e96e9893b21b 100644 --- a/net/vmw_vsock/virtio_transport_common.c +++ b/net/vmw_vsock/virtio_transport_common.c @@ -1558,8 +1558,6 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb, return -ENOMEM; } - sk_acceptq_added(sk); - lock_sock_nested(child, SINGLE_DEPTH_NESTING); child->sk_state = TCP_ESTABLISHED; @@ -1581,6 +1579,7 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb, return ret; } + sk_acceptq_added(sk); if (virtio_transport_space_update(child, skb)) child->sk_write_space(child); From f3206328bb52c2787197d80d7cbd687946047d5f Mon Sep 17 00:00:00 2001 From: Lorenzo Bianconi Date: Tue, 14 Apr 2026 16:08:52 +0200 Subject: [PATCH 008/148] net: airoha: Wait for NPU PPE configuration to complete in airoha_ppe_offload_setup() In order to properly enable flowtable hw offloading, poll REG_PPE_FLOW_CFG register in airoha_ppe_offload_setup routine and wait for NPU PPE configuration triggered by ppe_init callback to complete before running airoha_ppe_hw_init(). Fixes: 00a7678310fe3 ("net: airoha: Introduce flowtable offload support") Signed-off-by: Lorenzo Bianconi Link: https://patch.msgid.link/20260414-airoha-wait-for-npu-config-offload-setup-v2-1-5a9bf6d43aee@kernel.org Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/airoha/airoha_ppe.c | 28 ++++++++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/drivers/net/ethernet/airoha/airoha_ppe.c b/drivers/net/ethernet/airoha/airoha_ppe.c index 03115c1c1063..859818676b69 100644 --- a/drivers/net/ethernet/airoha/airoha_ppe.c +++ b/drivers/net/ethernet/airoha/airoha_ppe.c @@ -1356,6 +1356,29 @@ static struct airoha_npu *airoha_ppe_npu_get(struct airoha_eth *eth) return npu; } +static int airoha_ppe_wait_for_npu_init(struct airoha_eth *eth) +{ + int err; + u32 val; + + /* PPE_FLOW_CFG default register value is 0. Since we reset FE + * during the device probe we can just check the configured value + * is not 0 here. + */ + err = read_poll_timeout(airoha_fe_rr, val, val, USEC_PER_MSEC, + 100 * USEC_PER_MSEC, false, eth, + REG_PPE_PPE_FLOW_CFG(0)); + if (err) + return err; + + if (airoha_ppe_is_enabled(eth, 1)) + err = read_poll_timeout(airoha_fe_rr, val, val, USEC_PER_MSEC, + 100 * USEC_PER_MSEC, false, eth, + REG_PPE_PPE_FLOW_CFG(1)); + + return err; +} + static int airoha_ppe_offload_setup(struct airoha_eth *eth) { struct airoha_npu *npu = airoha_ppe_npu_get(eth); @@ -1369,6 +1392,11 @@ static int airoha_ppe_offload_setup(struct airoha_eth *eth) if (err) goto error_npu_put; + /* Wait for NPU PPE configuration to complete */ + err = airoha_ppe_wait_for_npu_init(eth); + if (err) + goto error_npu_put; + ppe_num_stats_entries = airoha_ppe_get_total_num_stats_entries(ppe); if (ppe_num_stats_entries > 0) { err = npu->ops.ppe_init_stats(npu, ppe->foe_stats_dma, From 1e9e7fd839b7f22b46762059c6f3e576b3e6e179 Mon Sep 17 00:00:00 2001 From: Geert Uytterhoeven Date: Tue, 14 Apr 2026 12:30:47 +0200 Subject: [PATCH 009/148] net: mdio: MDIO_PIC64HPSC should depend on ARCH_MICROCHIP The PIC64-HPSC/HX MDIO interface is only present on Microchip PIC64-HPSC/HX SoCs. Hence add a dependency on ARCH_MICROCHIP, to prevent asking the user about this driver when configuring a kernel without Microchip SoC support. Fixes: f76aef980206e7c6 ("net: mdio: add a driver for PIC64-HPSC/HX MDIO controller") Signed-off-by: Geert Uytterhoeven Reviewed-by: Charles Perry Reviewed-by: Simon Horman Link: https://patch.msgid.link/980c57efa5843733ef95459c3283aebade56f142.1776162544.git.geert+renesas@glider.be Signed-off-by: Jakub Kicinski --- drivers/net/mdio/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/mdio/Kconfig b/drivers/net/mdio/Kconfig index 516b0d05e16e..c71132f33f84 100644 --- a/drivers/net/mdio/Kconfig +++ b/drivers/net/mdio/Kconfig @@ -147,6 +147,7 @@ config MDIO_OCTEON config MDIO_PIC64HPSC tristate "PIC64-HPSC/HX MDIO interface support" + depends on ARCH_MICROCHIP || COMPILE_TEST depends on HAS_IOMEM && OF_MDIO help This driver supports the MDIO interface found on the PIC64-HPSC/HX From 105425b1969c5affe532713cfac1c0b320d7ac2b Mon Sep 17 00:00:00 2001 From: Vinicius Costa Gomes Date: Fri, 10 Apr 2026 18:57:57 -0700 Subject: [PATCH 010/148] net/sched: taprio: fix use-after-free in advance_sched() on schedule switch In advance_sched(), when should_change_schedules() returns true, switch_schedules() is called to promote the admin schedule to oper. switch_schedules() queues the old oper schedule for RCU freeing via call_rcu(), but 'next' still points into an entry of the old oper schedule. The subsequent 'next->end_time = end_time' and rcu_assign_pointer(q->current_entry, next) are use-after-free. Fix this by selecting 'next' from the new oper schedule immediately after switch_schedules(), and using its pre-calculated end_time. setup_first_end_time() sets the first entry's end_time to base_time + interval when the schedule is installed, so the value is already correct. The deleted 'end_time = sched_base_time(admin)' assignment was also harmful independently: it would overwrite the new first entry's pre-calculated end_time with just base_time. Fixes: a3d43c0d56f1 ("taprio: Add support adding an admin schedule") Reported-by: Junxi Qian Signed-off-by: Vinicius Costa Gomes Signed-off-by: Jakub Kicinski --- net/sched/sch_taprio.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c index 8e3752811950..a47a09d76400 100644 --- a/net/sched/sch_taprio.c +++ b/net/sched/sch_taprio.c @@ -972,11 +972,12 @@ static enum hrtimer_restart advance_sched(struct hrtimer *timer) } if (should_change_schedules(admin, oper, end_time)) { - /* Set things so the next time this runs, the new - * schedule runs. - */ - end_time = sched_base_time(admin); switch_schedules(q, &admin, &oper); + /* After changing schedules, the next entry is the first one + * in the new schedule, with a pre-calculated end_time. + */ + next = list_first_entry(&oper->entries, struct sched_entry, list); + end_time = next->end_time; } next->end_time = end_time; From 0f99e0c3e19badaf3fdced0d3feba623e59eed41 Mon Sep 17 00:00:00 2001 From: Stanislav Fomichev Date: Tue, 14 Apr 2026 16:10:35 -0700 Subject: [PATCH 011/148] net: dsa: remove redundant netdev_lock_ops() from conduit ethtool ops DSA replaces the conduit (master) device's ethtool_ops with its own wrappers that aggregate stats from both the conduit and DSA switch ports. Taking the lock again inside the DSA wrappers causes a deadlock. Stumbled upon this when booting qemu with fbnic and CONFIG_NET_DSA_LOOP=y (which looks like some kind of testing device that auto-populates the ports of eth0). `ethtool -i` is enough to deadlock. This means we have basically zero coverage for DSA stuff with real ops locked devs. Remove the redundant netdev_lock_ops()/netdev_unlock_ops() calls from the DSA conduit ethtool wrappers. Fixes: 2bcf4772e45a ("net: ethtool: try to protect all callback with netdev instance lock") Signed-off-by: Stanislav Fomichev Reviewed-by: Maxime Chevallier Link: https://patch.msgid.link/20260414231035.1917035-1-sdf@fomichev.me Signed-off-by: Jakub Kicinski --- net/dsa/conduit.c | 16 +--------------- 1 file changed, 1 insertion(+), 15 deletions(-) diff --git a/net/dsa/conduit.c b/net/dsa/conduit.c index a1b044467bd6..8398d72d7e4d 100644 --- a/net/dsa/conduit.c +++ b/net/dsa/conduit.c @@ -27,9 +27,7 @@ static int dsa_conduit_get_regs_len(struct net_device *dev) int len; if (ops && ops->get_regs_len) { - netdev_lock_ops(dev); len = ops->get_regs_len(dev); - netdev_unlock_ops(dev); if (len < 0) return len; ret += len; @@ -60,15 +58,11 @@ static void dsa_conduit_get_regs(struct net_device *dev, int len; if (ops && ops->get_regs_len && ops->get_regs) { - netdev_lock_ops(dev); len = ops->get_regs_len(dev); - if (len < 0) { - netdev_unlock_ops(dev); + if (len < 0) return; - } regs->len = len; ops->get_regs(dev, regs, data); - netdev_unlock_ops(dev); data += regs->len; } @@ -115,10 +109,8 @@ static void dsa_conduit_get_ethtool_stats(struct net_device *dev, int count, mcount = 0; if (ops && ops->get_sset_count && ops->get_ethtool_stats) { - netdev_lock_ops(dev); mcount = ops->get_sset_count(dev, ETH_SS_STATS); ops->get_ethtool_stats(dev, stats, data); - netdev_unlock_ops(dev); } list_for_each_entry(dp, &dst->ports, list) { @@ -149,10 +141,8 @@ static void dsa_conduit_get_ethtool_phy_stats(struct net_device *dev, if (count >= 0) phy_ethtool_get_stats(dev->phydev, stats, data); } else if (ops && ops->get_sset_count && ops->get_ethtool_phy_stats) { - netdev_lock_ops(dev); count = ops->get_sset_count(dev, ETH_SS_PHY_STATS); ops->get_ethtool_phy_stats(dev, stats, data); - netdev_unlock_ops(dev); } if (count < 0) @@ -176,13 +166,11 @@ static int dsa_conduit_get_sset_count(struct net_device *dev, int sset) struct dsa_switch_tree *dst = cpu_dp->dst; int count = 0; - netdev_lock_ops(dev); if (sset == ETH_SS_PHY_STATS && dev->phydev && (!ops || !ops->get_ethtool_phy_stats)) count = phy_ethtool_get_sset_count(dev->phydev); else if (ops && ops->get_sset_count) count = ops->get_sset_count(dev, sset); - netdev_unlock_ops(dev); if (count < 0) count = 0; @@ -239,7 +227,6 @@ static void dsa_conduit_get_strings(struct net_device *dev, u32 stringset, struct dsa_switch_tree *dst = cpu_dp->dst; int count, mcount = 0; - netdev_lock_ops(dev); if (stringset == ETH_SS_PHY_STATS && dev->phydev && !ops->get_ethtool_phy_stats) { mcount = phy_ethtool_get_sset_count(dev->phydev); @@ -253,7 +240,6 @@ static void dsa_conduit_get_strings(struct net_device *dev, u32 stringset, mcount = 0; ops->get_strings(dev, stringset, data); } - netdev_unlock_ops(dev); list_for_each_entry(dp, &dst->ports, list) { if (!dsa_port_is_dsa(dp) && !dsa_port_is_cpu(dp)) From 5099807f335ce4f783f0578bef7278fffad30b07 Mon Sep 17 00:00:00 2001 From: Kory Maincent Date: Wed, 15 Apr 2026 15:02:59 +0200 Subject: [PATCH 012/148] net: pse-pd: fix out-of-bounds bitmap access in pse_isr() on 32-bit In pse_isr(), notifs_mask was declared as a single unsigned long on the stack (32 bits on 32-bit architectures). For PSE controllers with more than 32 ports, this causes two problems: - map_event callbacks could wrote bit positions >= 32 via *notifs_mask |= BIT(i), which is undefined behaviour on a 32-bit unsigned long and corrupts adjacent stack memory. - for_each_set_bit(i, ¬ifs_mask, pcdev->nr_lines) treats ¬ifs_mask as a multi-word bitmap and reads beyond the single unsigned long when nr_lines > BITS_PER_LONG. Fix this by moving notifs_mask out of the stack and into struct pse_irq as a dynamically allocated bitmap. It is sized with BITS_TO_LONGS(pcdev->nr_lines) words in devm_pse_irq_helper(), so it is always wide enough regardless of the host word size. [Jakub]: No upstream driver currently supports >=32 ports. Signed-off-by: Kory Maincent Link: https://patch.msgid.link/20260415130300.806152-1-kory.maincent@bootlin.com Signed-off-by: Jakub Kicinski --- drivers/net/pse-pd/pse_core.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/drivers/net/pse-pd/pse_core.c b/drivers/net/pse-pd/pse_core.c index f6b94ac7a68a..87aa4f4e9724 100644 --- a/drivers/net/pse-pd/pse_core.c +++ b/drivers/net/pse-pd/pse_core.c @@ -1170,6 +1170,7 @@ struct pse_irq { struct pse_controller_dev *pcdev; struct pse_irq_desc desc; unsigned long *notifs; + unsigned long *notifs_mask; }; /** @@ -1247,7 +1248,6 @@ static int pse_set_config_isr(struct pse_controller_dev *pcdev, int id, static irqreturn_t pse_isr(int irq, void *data) { struct pse_controller_dev *pcdev; - unsigned long notifs_mask = 0; struct pse_irq_desc *desc; struct pse_irq *h = data; int ret, i; @@ -1257,14 +1257,15 @@ static irqreturn_t pse_isr(int irq, void *data) /* Clear notifs mask */ memset(h->notifs, 0, pcdev->nr_lines * sizeof(*h->notifs)); + bitmap_zero(h->notifs_mask, pcdev->nr_lines); mutex_lock(&pcdev->lock); - ret = desc->map_event(irq, pcdev, h->notifs, ¬ifs_mask); - if (ret || !notifs_mask) { + ret = desc->map_event(irq, pcdev, h->notifs, h->notifs_mask); + if (ret || bitmap_empty(h->notifs_mask, pcdev->nr_lines)) { mutex_unlock(&pcdev->lock); return IRQ_NONE; } - for_each_set_bit(i, ¬ifs_mask, pcdev->nr_lines) { + for_each_set_bit(i, h->notifs_mask, pcdev->nr_lines) { unsigned long notifs, rnotifs; struct pse_ntf ntf = {}; @@ -1340,6 +1341,10 @@ int devm_pse_irq_helper(struct pse_controller_dev *pcdev, int irq, if (!h->notifs) return -ENOMEM; + h->notifs_mask = devm_bitmap_zalloc(dev, pcdev->nr_lines, GFP_KERNEL); + if (!h->notifs_mask) + return -ENOMEM; + ret = devm_request_threaded_irq(dev, irq, NULL, pse_isr, IRQF_ONESHOT | irq_flags, irq_name, h); From 759a32900b6f3db3d0f34a3b61123742723b50b4 Mon Sep 17 00:00:00 2001 From: Wei Fang Date: Wed, 15 Apr 2026 14:08:32 +0800 Subject: [PATCH 013/148] net: enetc: correct the command BD ring consumer index The command BD ring cousumer index register has the consumer index as the lower 10 bits, and the bit 31 is SBE, which indicates whether a system bus error occurred during execution of the CBD command. So if a system bus error occurs, reading the register will get the SBE bit set. However, the current implementation directly uses the register value as the consumer index without masking it. Therefore, if a system bus error occurs, an incorrect consumer index will be obtained, causing errors in the processing of the command BD ring. Thus, we need to mask out the other bits to obtain the correct consumer index. In addition, this patch adds a check for the SBE bit after the polling loop and returns an error if the bit is set. Fixes: 4701073c3deb ("net: enetc: add initial netc-lib driver to support NTMP") Signed-off-by: Wei Fang Link: https://patch.msgid.link/20260415060833.2303846-2-wei.fang@nxp.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/freescale/enetc/ntmp.c | 13 ++++++++++--- drivers/net/ethernet/freescale/enetc/ntmp_private.h | 2 ++ 2 files changed, 12 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/freescale/enetc/ntmp.c b/drivers/net/ethernet/freescale/enetc/ntmp.c index 703752995e93..0dad38735234 100644 --- a/drivers/net/ethernet/freescale/enetc/ntmp.c +++ b/drivers/net/ethernet/freescale/enetc/ntmp.c @@ -55,7 +55,7 @@ int ntmp_init_cbdr(struct netc_cbdr *cbdr, struct device *dev, spin_lock_init(&cbdr->ring_lock); cbdr->next_to_use = netc_read(cbdr->regs.pir); - cbdr->next_to_clean = netc_read(cbdr->regs.cir); + cbdr->next_to_clean = netc_read(cbdr->regs.cir) & NETC_CBDRCIR_INDEX; /* Step 1: Configure the base address of the Control BD Ring */ netc_write(cbdr->regs.bar0, lower_32_bits(cbdr->dma_base_align)); @@ -98,7 +98,7 @@ static void ntmp_clean_cbdr(struct netc_cbdr *cbdr) int i; i = cbdr->next_to_clean; - while (netc_read(cbdr->regs.cir) != i) { + while ((netc_read(cbdr->regs.cir) & NETC_CBDRCIR_INDEX) != i) { cbd = ntmp_get_cbd(cbdr, i); memset(cbd, 0, sizeof(*cbd)); i = (i + 1) % cbdr->bd_num; @@ -135,12 +135,19 @@ static int netc_xmit_ntmp_cmd(struct ntmp_user *user, union netc_cbd *cbd) cbdr->next_to_use = i; netc_write(cbdr->regs.pir, i); - err = read_poll_timeout_atomic(netc_read, val, val == i, + err = read_poll_timeout_atomic(netc_read, val, + (val & NETC_CBDRCIR_INDEX) == i, NETC_CBDR_DELAY_US, NETC_CBDR_TIMEOUT, true, cbdr->regs.cir); if (unlikely(err)) goto cbdr_unlock; + if (unlikely(val & NETC_CBDRCIR_SBE)) { + dev_err(user->dev, "Command BD system bus error\n"); + err = -EIO; + goto cbdr_unlock; + } + dma_rmb(); /* Get the writeback command BD, because the caller may need * to check some other fields of the response header. diff --git a/drivers/net/ethernet/freescale/enetc/ntmp_private.h b/drivers/net/ethernet/freescale/enetc/ntmp_private.h index 34394e40fddd..3459cc45b610 100644 --- a/drivers/net/ethernet/freescale/enetc/ntmp_private.h +++ b/drivers/net/ethernet/freescale/enetc/ntmp_private.h @@ -12,6 +12,8 @@ #define NTMP_EID_REQ_LEN 8 #define NETC_CBDR_BD_NUM 256 +#define NETC_CBDRCIR_INDEX GENMASK(9, 0) +#define NETC_CBDRCIR_SBE BIT(31) union netc_cbd { struct { From 3cade698881eb238f88cbbfec82acc2110440a3f Mon Sep 17 00:00:00 2001 From: Wei Fang Date: Wed, 15 Apr 2026 14:08:33 +0800 Subject: [PATCH 014/148] net: enetc: fix NTMP DMA use-after-free issue The AI-generated review reported a potential DMA use-after-free issue [1]. If netc_xmit_ntmp_cmd() times out and returns an error, the pending command is not explicitly aborted, while ntmp_free_data_mem() unconditionally frees the DMA buffer. If the buffer has already been reallocated elsewhere, this may lead to silent memory corruption. Because the hardware eventually processes the pending command and perform a DMA write of the response to the physical address of the freed buffer. To resolve this issue, this patch does the following modifications: 1. Convert cbdr->ring_lock from a spinlock to a mutex The lock was originally a spinlock in case NTMP operations might be invoked from atomic context. After downstream support for all NTMP tables, no such usage has materialized. A mutex lock is now required because the driver now needs to reclaim used BDs and release associated DMA memory within the lock's context, while dma_free_coherent() might sleep. 2. Introduce software command BD (struct netc_swcbd) The hardware write-back overwrites the addr and len fields of the BD, so the driver cannot rely on the hardware BD to free the associated DMA memory. The driver now maintains a software shadow BD storing the DMA buffer pointer, DMA address, and size. And netc_xmit_ntmp_cmd() only reclaims older BDs when the number of used BDs reaches NETC_CBDR_CLEAN_WORK (16). The software BD enables correct DMA memory release. With this, struct ntmp_dma_buf and ntmp_free_data_mem() are no longer needed and are removed. 3. Require callers to hold ring_lock across netc_xmit_ntmp_cmd() netc_xmit_ntmp_cmd() releases the ring_lock before the caller finishes consuming the response. At this point, if a concurrent thread submits a new command, it may trigger ntmp_clean_cbdr() and free the DMA buffer while it is still in use. Move ring_lock ownership to the caller to ensure the response buffer cannot be reclaimed prematurely. So the helpers ntmp_select_and_lock_cbdr() and ntmp_unlock_cbdr() are added. These changes eliminate the DMA use-after-free condition and ensure safe and consistent BD reclamation and DMA buffer lifecycle management. Fixes: 4701073c3deb ("net: enetc: add initial netc-lib driver to support NTMP") Link: https://lore.kernel.org/netdev/20260403011729.1795413-1-kuba@kernel.org/ # [1] Signed-off-by: Wei Fang Link: https://patch.msgid.link/20260415060833.2303846-3-wei.fang@nxp.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/freescale/enetc/ntmp.c | 228 ++++++++++-------- .../ethernet/freescale/enetc/ntmp_private.h | 8 +- include/linux/fsl/ntmp.h | 9 +- 3 files changed, 141 insertions(+), 104 deletions(-) diff --git a/drivers/net/ethernet/freescale/enetc/ntmp.c b/drivers/net/ethernet/freescale/enetc/ntmp.c index 0dad38735234..c94a928622fd 100644 --- a/drivers/net/ethernet/freescale/enetc/ntmp.c +++ b/drivers/net/ethernet/freescale/enetc/ntmp.c @@ -7,6 +7,7 @@ #include #include #include +#include #include "ntmp_private.h" @@ -42,6 +43,12 @@ int ntmp_init_cbdr(struct netc_cbdr *cbdr, struct device *dev, if (!cbdr->addr_base) return -ENOMEM; + cbdr->swcbd = vcalloc(cbd_num, sizeof(struct netc_swcbd)); + if (!cbdr->swcbd) { + dma_free_coherent(dev, size, cbdr->addr_base, cbdr->dma_base); + return -ENOMEM; + } + cbdr->dma_size = size; cbdr->bd_num = cbd_num; cbdr->regs = *regs; @@ -52,7 +59,7 @@ int ntmp_init_cbdr(struct netc_cbdr *cbdr, struct device *dev, cbdr->addr_base_align = PTR_ALIGN(cbdr->addr_base, NTMP_BASE_ADDR_ALIGN); - spin_lock_init(&cbdr->ring_lock); + mutex_init(&cbdr->ring_lock); cbdr->next_to_use = netc_read(cbdr->regs.pir); cbdr->next_to_clean = netc_read(cbdr->regs.cir) & NETC_CBDRCIR_INDEX; @@ -71,10 +78,24 @@ int ntmp_init_cbdr(struct netc_cbdr *cbdr, struct device *dev, } EXPORT_SYMBOL_GPL(ntmp_init_cbdr); +static void ntmp_free_data_mem(struct device *dev, struct netc_swcbd *swcbd) +{ + if (unlikely(!swcbd->buf)) + return; + + dma_free_coherent(dev, swcbd->size + NTMP_DATA_ADDR_ALIGN, + swcbd->buf, swcbd->dma); +} + void ntmp_free_cbdr(struct netc_cbdr *cbdr) { /* Disable the Control BD Ring */ netc_write(cbdr->regs.mr, 0); + + for (int i = 0; i < cbdr->bd_num; i++) + ntmp_free_data_mem(cbdr->dev, &cbdr->swcbd[i]); + + vfree(cbdr->swcbd); dma_free_coherent(cbdr->dev, cbdr->dma_size, cbdr->addr_base, cbdr->dma_base); memset(cbdr, 0, sizeof(*cbdr)); @@ -94,40 +115,59 @@ static union netc_cbd *ntmp_get_cbd(struct netc_cbdr *cbdr, int index) static void ntmp_clean_cbdr(struct netc_cbdr *cbdr) { - union netc_cbd *cbd; - int i; + int i = cbdr->next_to_clean; - i = cbdr->next_to_clean; while ((netc_read(cbdr->regs.cir) & NETC_CBDRCIR_INDEX) != i) { - cbd = ntmp_get_cbd(cbdr, i); + union netc_cbd *cbd = ntmp_get_cbd(cbdr, i); + struct netc_swcbd *swcbd = &cbdr->swcbd[i]; + + ntmp_free_data_mem(cbdr->dev, swcbd); + memset(swcbd, 0, sizeof(*swcbd)); memset(cbd, 0, sizeof(*cbd)); i = (i + 1) % cbdr->bd_num; } + dma_wmb(); cbdr->next_to_clean = i; } -static int netc_xmit_ntmp_cmd(struct ntmp_user *user, union netc_cbd *cbd) +static void ntmp_select_and_lock_cbdr(struct ntmp_user *user, + struct netc_cbdr **cbdr) +{ + /* Currently only ENETC is supported, and it has only one command + * BD ring. + */ + *cbdr = &user->ring[0]; + + mutex_lock(&(*cbdr)->ring_lock); +} + +static void ntmp_unlock_cbdr(struct netc_cbdr *cbdr) +{ + mutex_unlock(&cbdr->ring_lock); +} + +static int netc_xmit_ntmp_cmd(struct netc_cbdr *cbdr, union netc_cbd *cbd, + struct netc_swcbd *swcbd) { union netc_cbd *cur_cbd; - struct netc_cbdr *cbdr; - int i, err; + int i, err, used_bds; u16 status; u32 val; - /* Currently only i.MX95 ENETC is supported, and it only has one - * command BD ring - */ - cbdr = &user->ring[0]; - - spin_lock_bh(&cbdr->ring_lock); - - if (unlikely(!ntmp_get_free_cbd_num(cbdr))) + used_bds = cbdr->bd_num - ntmp_get_free_cbd_num(cbdr); + if (unlikely(used_bds >= NETC_CBDR_CLEAN_WORK)) { ntmp_clean_cbdr(cbdr); + if (unlikely(!ntmp_get_free_cbd_num(cbdr))) { + ntmp_free_data_mem(cbdr->dev, swcbd); + return -EBUSY; + } + } i = cbdr->next_to_use; cur_cbd = ntmp_get_cbd(cbdr, i); *cur_cbd = *cbd; + cbdr->swcbd[i] = *swcbd; dma_wmb(); /* Update producer index of both software and hardware */ @@ -135,17 +175,16 @@ static int netc_xmit_ntmp_cmd(struct ntmp_user *user, union netc_cbd *cbd) cbdr->next_to_use = i; netc_write(cbdr->regs.pir, i); - err = read_poll_timeout_atomic(netc_read, val, - (val & NETC_CBDRCIR_INDEX) == i, - NETC_CBDR_DELAY_US, NETC_CBDR_TIMEOUT, - true, cbdr->regs.cir); + err = read_poll_timeout(netc_read, val, + (val & NETC_CBDRCIR_INDEX) == i, + NETC_CBDR_DELAY_US, NETC_CBDR_TIMEOUT, + true, cbdr->regs.cir); if (unlikely(err)) - goto cbdr_unlock; + return err; if (unlikely(val & NETC_CBDRCIR_SBE)) { - dev_err(user->dev, "Command BD system bus error\n"); - err = -EIO; - goto cbdr_unlock; + dev_err(cbdr->dev, "Command BD system bus error\n"); + return -EIO; } dma_rmb(); @@ -157,38 +196,27 @@ static int netc_xmit_ntmp_cmd(struct ntmp_user *user, union netc_cbd *cbd) /* Check the writeback error status */ status = le16_to_cpu(cbd->resp_hdr.error_rr) & NTMP_RESP_ERROR; if (unlikely(status)) { - err = -EIO; - dev_err(user->dev, "Command BD error: 0x%04x\n", status); + dev_err(cbdr->dev, "Command BD error: 0x%04x\n", status); + return -EIO; } - ntmp_clean_cbdr(cbdr); - dma_wmb(); - -cbdr_unlock: - spin_unlock_bh(&cbdr->ring_lock); - - return err; -} - -static int ntmp_alloc_data_mem(struct ntmp_dma_buf *data, void **buf_align) -{ - void *buf; - - buf = dma_alloc_coherent(data->dev, data->size + NTMP_DATA_ADDR_ALIGN, - &data->dma, GFP_KERNEL); - if (!buf) - return -ENOMEM; - - data->buf = buf; - *buf_align = PTR_ALIGN(buf, NTMP_DATA_ADDR_ALIGN); - return 0; } -static void ntmp_free_data_mem(struct ntmp_dma_buf *data) +static int ntmp_alloc_data_mem(struct device *dev, struct netc_swcbd *swcbd, + void **buf_align) { - dma_free_coherent(data->dev, data->size + NTMP_DATA_ADDR_ALIGN, - data->buf, data->dma); + void *buf; + + buf = dma_alloc_coherent(dev, swcbd->size + NTMP_DATA_ADDR_ALIGN, + &swcbd->dma, GFP_KERNEL); + if (!buf) + return -ENOMEM; + + swcbd->buf = buf; + *buf_align = PTR_ALIGN(buf, NTMP_DATA_ADDR_ALIGN); + + return 0; } static void ntmp_fill_request_hdr(union netc_cbd *cbd, dma_addr_t dma, @@ -241,37 +269,39 @@ static int ntmp_delete_entry_by_id(struct ntmp_user *user, int tbl_id, u8 tbl_ver, u32 entry_id, u32 req_len, u32 resp_len) { - struct ntmp_dma_buf data = { - .dev = user->dev, + struct netc_swcbd swcbd = { .size = max(req_len, resp_len), }; struct ntmp_req_by_eid *req; + struct netc_cbdr *cbdr; union netc_cbd cbd; int err; - err = ntmp_alloc_data_mem(&data, (void **)&req); + err = ntmp_alloc_data_mem(user->dev, &swcbd, (void **)&req); if (err) return err; ntmp_fill_crd_eid(req, tbl_ver, 0, 0, entry_id); - ntmp_fill_request_hdr(&cbd, data.dma, NTMP_LEN(req_len, resp_len), + ntmp_fill_request_hdr(&cbd, swcbd.dma, NTMP_LEN(req_len, resp_len), tbl_id, NTMP_CMD_DELETE, NTMP_AM_ENTRY_ID); - err = netc_xmit_ntmp_cmd(user, &cbd); + ntmp_select_and_lock_cbdr(user, &cbdr); + err = netc_xmit_ntmp_cmd(cbdr, &cbd, &swcbd); if (err) dev_err(user->dev, "Failed to delete entry 0x%x of %s, err: %pe", entry_id, ntmp_table_name(tbl_id), ERR_PTR(err)); - - ntmp_free_data_mem(&data); + ntmp_unlock_cbdr(cbdr); return err; } -static int ntmp_query_entry_by_id(struct ntmp_user *user, int tbl_id, - u32 len, struct ntmp_req_by_eid *req, - dma_addr_t dma, bool compare_eid) +static int ntmp_query_entry_by_id(struct netc_cbdr *cbdr, int tbl_id, + struct ntmp_req_by_eid *req, + struct netc_swcbd *swcbd, + bool compare_eid) { + u32 len = NTMP_LEN(sizeof(*req), swcbd->size); struct ntmp_cmn_resp_query *resp; int cmd = NTMP_CMD_QUERY; union netc_cbd cbd; @@ -283,10 +313,11 @@ static int ntmp_query_entry_by_id(struct ntmp_user *user, int tbl_id, cmd = NTMP_CMD_QU; /* Request header */ - ntmp_fill_request_hdr(&cbd, dma, len, tbl_id, cmd, NTMP_AM_ENTRY_ID); - err = netc_xmit_ntmp_cmd(user, &cbd); + ntmp_fill_request_hdr(&cbd, swcbd->dma, len, tbl_id, cmd, + NTMP_AM_ENTRY_ID); + err = netc_xmit_ntmp_cmd(cbdr, &cbd, swcbd); if (err) { - dev_err(user->dev, + dev_err(cbdr->dev, "Failed to query entry 0x%x of %s, err: %pe\n", entry_id, ntmp_table_name(tbl_id), ERR_PTR(err)); return err; @@ -300,7 +331,7 @@ static int ntmp_query_entry_by_id(struct ntmp_user *user, int tbl_id, resp = (struct ntmp_cmn_resp_query *)req; if (unlikely(le32_to_cpu(resp->entry_id) != entry_id)) { - dev_err(user->dev, + dev_err(cbdr->dev, "%s: query EID 0x%x doesn't match response EID 0x%x\n", ntmp_table_name(tbl_id), entry_id, le32_to_cpu(resp->entry_id)); return -EIO; @@ -312,15 +343,15 @@ static int ntmp_query_entry_by_id(struct ntmp_user *user, int tbl_id, int ntmp_maft_add_entry(struct ntmp_user *user, u32 entry_id, struct maft_entry_data *maft) { - struct ntmp_dma_buf data = { - .dev = user->dev, + struct netc_swcbd swcbd = { .size = sizeof(struct maft_req_add), }; struct maft_req_add *req; + struct netc_cbdr *cbdr; union netc_cbd cbd; int err; - err = ntmp_alloc_data_mem(&data, (void **)&req); + err = ntmp_alloc_data_mem(user->dev, &swcbd, (void **)&req); if (err) return err; @@ -329,14 +360,15 @@ int ntmp_maft_add_entry(struct ntmp_user *user, u32 entry_id, req->keye = maft->keye; req->cfge = maft->cfge; - ntmp_fill_request_hdr(&cbd, data.dma, NTMP_LEN(data.size, 0), + ntmp_fill_request_hdr(&cbd, swcbd.dma, NTMP_LEN(swcbd.size, 0), NTMP_MAFT_ID, NTMP_CMD_ADD, NTMP_AM_ENTRY_ID); - err = netc_xmit_ntmp_cmd(user, &cbd); + + ntmp_select_and_lock_cbdr(user, &cbdr); + err = netc_xmit_ntmp_cmd(cbdr, &cbd, &swcbd); if (err) dev_err(user->dev, "Failed to add MAFT entry 0x%x, err: %pe\n", entry_id, ERR_PTR(err)); - - ntmp_free_data_mem(&data); + ntmp_unlock_cbdr(cbdr); return err; } @@ -345,31 +377,31 @@ EXPORT_SYMBOL_GPL(ntmp_maft_add_entry); int ntmp_maft_query_entry(struct ntmp_user *user, u32 entry_id, struct maft_entry_data *maft) { - struct ntmp_dma_buf data = { - .dev = user->dev, + struct netc_swcbd swcbd = { .size = sizeof(struct maft_resp_query), }; struct maft_resp_query *resp; struct ntmp_req_by_eid *req; + struct netc_cbdr *cbdr; int err; - err = ntmp_alloc_data_mem(&data, (void **)&req); + err = ntmp_alloc_data_mem(user->dev, &swcbd, (void **)&req); if (err) return err; ntmp_fill_crd_eid(req, user->tbl.maft_ver, 0, 0, entry_id); - err = ntmp_query_entry_by_id(user, NTMP_MAFT_ID, - NTMP_LEN(sizeof(*req), data.size), - req, data.dma, true); + + ntmp_select_and_lock_cbdr(user, &cbdr); + err = ntmp_query_entry_by_id(cbdr, NTMP_MAFT_ID, req, &swcbd, true); if (err) - goto end; + goto unlock_cbdr; resp = (struct maft_resp_query *)req; maft->keye = resp->keye; maft->cfge = resp->cfge; -end: - ntmp_free_data_mem(&data); +unlock_cbdr: + ntmp_unlock_cbdr(cbdr); return err; } @@ -385,8 +417,9 @@ EXPORT_SYMBOL_GPL(ntmp_maft_delete_entry); int ntmp_rsst_update_entry(struct ntmp_user *user, const u32 *table, int count) { - struct ntmp_dma_buf data = {.dev = user->dev}; struct rsst_req_update *req; + struct netc_swcbd swcbd; + struct netc_cbdr *cbdr; union netc_cbd cbd; int err, i; @@ -394,8 +427,8 @@ int ntmp_rsst_update_entry(struct ntmp_user *user, const u32 *table, /* HW only takes in a full 64 entry table */ return -EINVAL; - data.size = struct_size(req, groups, count); - err = ntmp_alloc_data_mem(&data, (void **)&req); + swcbd.size = struct_size(req, groups, count); + err = ntmp_alloc_data_mem(user->dev, &swcbd, (void **)&req); if (err) return err; @@ -405,15 +438,15 @@ int ntmp_rsst_update_entry(struct ntmp_user *user, const u32 *table, for (i = 0; i < count; i++) req->groups[i] = (u8)(table[i]); - ntmp_fill_request_hdr(&cbd, data.dma, NTMP_LEN(data.size, 0), + ntmp_fill_request_hdr(&cbd, swcbd.dma, NTMP_LEN(swcbd.size, 0), NTMP_RSST_ID, NTMP_CMD_UPDATE, NTMP_AM_ENTRY_ID); - err = netc_xmit_ntmp_cmd(user, &cbd); + ntmp_select_and_lock_cbdr(user, &cbdr); + err = netc_xmit_ntmp_cmd(cbdr, &cbd, &swcbd); if (err) dev_err(user->dev, "Failed to update RSST entry, err: %pe\n", ERR_PTR(err)); - - ntmp_free_data_mem(&data); + ntmp_unlock_cbdr(cbdr); return err; } @@ -421,8 +454,9 @@ EXPORT_SYMBOL_GPL(ntmp_rsst_update_entry); int ntmp_rsst_query_entry(struct ntmp_user *user, u32 *table, int count) { - struct ntmp_dma_buf data = {.dev = user->dev}; struct ntmp_req_by_eid *req; + struct netc_swcbd swcbd; + struct netc_cbdr *cbdr; union netc_cbd cbd; int err, i; u8 *group; @@ -431,21 +465,23 @@ int ntmp_rsst_query_entry(struct ntmp_user *user, u32 *table, int count) /* HW only takes in a full 64 entry table */ return -EINVAL; - data.size = NTMP_ENTRY_ID_SIZE + RSST_STSE_DATA_SIZE(count) + - RSST_CFGE_DATA_SIZE(count); - err = ntmp_alloc_data_mem(&data, (void **)&req); + swcbd.size = NTMP_ENTRY_ID_SIZE + RSST_STSE_DATA_SIZE(count) + + RSST_CFGE_DATA_SIZE(count); + err = ntmp_alloc_data_mem(user->dev, &swcbd, (void **)&req); if (err) return err; /* Set the request data buffer */ ntmp_fill_crd_eid(req, user->tbl.rsst_ver, 0, 0, 0); - ntmp_fill_request_hdr(&cbd, data.dma, NTMP_LEN(sizeof(*req), data.size), + ntmp_fill_request_hdr(&cbd, swcbd.dma, NTMP_LEN(sizeof(*req), swcbd.size), NTMP_RSST_ID, NTMP_CMD_QUERY, NTMP_AM_ENTRY_ID); - err = netc_xmit_ntmp_cmd(user, &cbd); + + ntmp_select_and_lock_cbdr(user, &cbdr); + err = netc_xmit_ntmp_cmd(cbdr, &cbd, &swcbd); if (err) { dev_err(user->dev, "Failed to query RSST entry, err: %pe\n", ERR_PTR(err)); - goto end; + goto unlock_cbdr; } group = (u8 *)req; @@ -453,8 +489,8 @@ int ntmp_rsst_query_entry(struct ntmp_user *user, u32 *table, int count) for (i = 0; i < count; i++) table[i] = group[i]; -end: - ntmp_free_data_mem(&data); +unlock_cbdr: + ntmp_unlock_cbdr(cbdr); return err; } diff --git a/drivers/net/ethernet/freescale/enetc/ntmp_private.h b/drivers/net/ethernet/freescale/enetc/ntmp_private.h index 3459cc45b610..f8dff3ba2c28 100644 --- a/drivers/net/ethernet/freescale/enetc/ntmp_private.h +++ b/drivers/net/ethernet/freescale/enetc/ntmp_private.h @@ -14,6 +14,7 @@ #define NETC_CBDR_BD_NUM 256 #define NETC_CBDRCIR_INDEX GENMASK(9, 0) #define NETC_CBDRCIR_SBE BIT(31) +#define NETC_CBDR_CLEAN_WORK 16 union netc_cbd { struct { @@ -56,13 +57,6 @@ union netc_cbd { } resp_hdr; /* NTMP Response Message Header Format */ }; -struct ntmp_dma_buf { - struct device *dev; - size_t size; - void *buf; - dma_addr_t dma; -}; - struct ntmp_cmn_req_data { __le16 update_act; u8 dbg_opt; diff --git a/include/linux/fsl/ntmp.h b/include/linux/fsl/ntmp.h index 916dc4fe7de3..83a449b4d6ec 100644 --- a/include/linux/fsl/ntmp.h +++ b/include/linux/fsl/ntmp.h @@ -31,6 +31,12 @@ struct netc_tbl_vers { u8 rsst_ver; }; +struct netc_swcbd { + void *buf; + dma_addr_t dma; + size_t size; +}; + struct netc_cbdr { struct device *dev; struct netc_cbdr_regs regs; @@ -44,9 +50,10 @@ struct netc_cbdr { void *addr_base_align; dma_addr_t dma_base; dma_addr_t dma_base_align; + struct netc_swcbd *swcbd; /* Serialize the order of command BD ring */ - spinlock_t ring_lock; + struct mutex ring_lock; }; struct ntmp_user { From 080f22f5d30233faf3d83be3098f35b8be9b7a00 Mon Sep 17 00:00:00 2001 From: Luigi Leonardi Date: Wed, 15 Apr 2026 17:09:28 +0200 Subject: [PATCH 015/148] vsock/virtio: fix MSG_PEEK ignoring skb offset when calculating bytes to copy `virtio_transport_stream_do_peek()` does not account for the skb offset when computing the number of bytes to copy. This means that, after a partial recv() that advances the offset, a peek requesting more bytes than are available in the sk_buff causes `skb_copy_datagram_iter()` to go past the valid payload, resulting in a -EFAULT. The dequeue path already handles this correctly. Apply the same logic to the peek path. Fixes: 0df7cd3c13e4 ("vsock/virtio/vhost: read data from non-linear skb") Reviewed-by: Stefano Garzarella Acked-by: Arseniy Krasnov Signed-off-by: Luigi Leonardi Link: https://patch.msgid.link/20260415-fix_peek-v4-1-8207e872759e@redhat.com Signed-off-by: Jakub Kicinski --- net/vmw_vsock/virtio_transport_common.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c index e96e9893b21b..0742091beae7 100644 --- a/net/vmw_vsock/virtio_transport_common.c +++ b/net/vmw_vsock/virtio_transport_common.c @@ -545,9 +545,8 @@ virtio_transport_stream_do_peek(struct vsock_sock *vsk, skb_queue_walk(&vvs->rx_queue, skb) { size_t bytes; - bytes = len - total; - if (bytes > skb->len) - bytes = skb->len; + bytes = min_t(size_t, len - total, + skb->len - VIRTIO_VSOCK_SKB_CB(skb)->offset); spin_unlock_bh(&vvs->rx_lock); From a3f77afbf67d5ddbc8938fd5627a11221d8a3368 Mon Sep 17 00:00:00 2001 From: Luigi Leonardi Date: Wed, 15 Apr 2026 17:09:29 +0200 Subject: [PATCH 016/148] vsock/test: fix MSG_PEEK handling in recv_buf() `recv_buf` does not handle the MSG_PEEK flag correctly: it keeps calling `recv` until all requested bytes are available or an error occurs. The problem is how it calculates the number of bytes read: MSG_PEEK doesn't consume any bytes and will re-read the same bytes from the buffer head, so summing the return value every time is wrong. Moreover, MSG_PEEK doesn't consume the bytes in the buffer, so if more bytes are requested than are available, the loop will never terminate, because `recv` will never return EOF. For this reason, we need to compare the number of bytes read with the number of bytes expected. Add a check: if the MSG_PEEK flag is present, update the byte counter and break out of the loop only after at least the expected number of bytes have been received; otherwise, retry after a short delay to avoid consuming too many CPU cycles. This allows us to simplify the `test_stream_credit_update_test` by reusing `recv_buf`, like some other tests already do. Suggested-by: Stefano Garzarella Signed-off-by: Luigi Leonardi Reviewed-by: Stefano Garzarella Link: https://patch.msgid.link/20260415-fix_peek-v4-2-8207e872759e@redhat.com Signed-off-by: Jakub Kicinski --- tools/testing/vsock/util.c | 15 +++++++++++++++ tools/testing/vsock/vsock_test.c | 13 +------------ 2 files changed, 16 insertions(+), 12 deletions(-) diff --git a/tools/testing/vsock/util.c b/tools/testing/vsock/util.c index 1fe1338c79cd..fe316b02a590 100644 --- a/tools/testing/vsock/util.c +++ b/tools/testing/vsock/util.c @@ -381,7 +381,13 @@ void send_buf(int fd, const void *buf, size_t len, int flags, } } +#define RECV_PEEK_RETRY_USEC (10 * 1000) + /* Receive bytes in a buffer and check the return value. + * + * When MSG_PEEK is set, recv() is retried until it returns at least + * expected_ret bytes. The function returns on error, EOF, or timeout + * as usual. * * expected_ret: * <0 Negative errno (for testing errors) @@ -403,6 +409,15 @@ void recv_buf(int fd, void *buf, size_t len, int flags, ssize_t expected_ret) if (ret <= 0) break; + if (flags & MSG_PEEK) { + if (ret >= expected_ret) { + nread = ret; + break; + } + timeout_usleep(RECV_PEEK_RETRY_USEC); + continue; + } + nread += ret; } while (nread < len); timeout_end(); diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c index 5bd20ccd9335..bdb0754965df 100644 --- a/tools/testing/vsock/vsock_test.c +++ b/tools/testing/vsock/vsock_test.c @@ -1500,18 +1500,7 @@ static void test_stream_credit_update_test(const struct test_opts *opts, } /* Wait until there will be 128KB of data in rx queue. */ - while (1) { - ssize_t res; - - res = recv(fd, buf, buf_size, MSG_PEEK); - if (res == buf_size) - break; - - if (res <= 0) { - fprintf(stderr, "unexpected 'recv()' return: %zi\n", res); - exit(EXIT_FAILURE); - } - } + recv_buf(fd, buf, buf_size, MSG_PEEK, buf_size); /* There is 128KB of data in the socket's rx queue, dequeue first * 64KB, credit update is sent if 'low_rx_bytes_test' == true. From 2a2675ef619010912a5826297cd3cab00d7dc697 Mon Sep 17 00:00:00 2001 From: Luigi Leonardi Date: Wed, 15 Apr 2026 17:09:30 +0200 Subject: [PATCH 017/148] vsock/test: add MSG_PEEK after partial recv test Add a test that verifies MSG_PEEK works correctly after a partial recv(). This is to test a bug that was present in the `virtio_transport_stream_do_peek()` when computing the number of bytes to copy: After a partial read, the peek function didn't take into consideration the number of bytes that were already read. So peeking the whole buffer would cause an out-of-bounds read, that resulted in a -EFAULT. This test does exactly this: do a partial recv on a buffer, then try to peek the whole buffer content. The test re-uses `test_stream_msg_peek_client()` to also cover this scenario. Reviewed-by: Stefano Garzarella Signed-off-by: Luigi Leonardi Link: https://patch.msgid.link/20260415-fix_peek-v4-3-8207e872759e@redhat.com Signed-off-by: Jakub Kicinski --- tools/testing/vsock/vsock_test.c | 37 ++++++++++++++++++++++++++++++++ 1 file changed, 37 insertions(+) diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c index bdb0754965df..76be0e4a7f0e 100644 --- a/tools/testing/vsock/vsock_test.c +++ b/tools/testing/vsock/vsock_test.c @@ -346,6 +346,38 @@ static void test_stream_msg_peek_server(const struct test_opts *opts) return test_msg_peek_server(opts, false); } +static void test_stream_peek_after_recv_server(const struct test_opts *opts) +{ + unsigned char buf_normal[MSG_PEEK_BUF_LEN]; + unsigned char buf_peek[MSG_PEEK_BUF_LEN]; + int fd; + + fd = vsock_stream_accept(VMADDR_CID_ANY, opts->peer_port, NULL); + if (fd < 0) { + perror("accept"); + exit(EXIT_FAILURE); + } + + control_writeln("SRVREADY"); + + /* Partial recv to advance offset within the skb */ + recv_buf(fd, buf_normal, 1, 0, 1); + + /* Peek with a buffer larger than the remaining data */ + recv_buf(fd, buf_peek, sizeof(buf_peek), MSG_PEEK, sizeof(buf_peek) - 1); + + /* Consume the remaining data */ + recv_buf(fd, buf_normal, sizeof(buf_normal) - 1, 0, sizeof(buf_normal) - 1); + + /* Compare full peek and normal read. */ + if (memcmp(buf_peek, buf_normal, sizeof(buf_peek) - 1)) { + fprintf(stderr, "Full peek data mismatch\n"); + exit(EXIT_FAILURE); + } + + close(fd); +} + #define SOCK_BUF_SIZE (2 * 1024 * 1024) #define SOCK_BUF_SIZE_SMALL (64 * 1024) #define MAX_MSG_PAGES 4 @@ -2509,6 +2541,11 @@ static struct test_case test_cases[] = { .run_client = test_stream_tx_credit_bounds_client, .run_server = test_stream_tx_credit_bounds_server, }, + { + .name = "SOCK_STREAM MSG_PEEK after partial recv", + .run_client = test_stream_msg_peek_client, + .run_server = test_stream_peek_after_recv_server, + }, {}, }; From 82c21069028c5db3463f851ae8ac9cc2e38a3827 Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Wed, 15 Apr 2026 18:04:39 -0700 Subject: [PATCH 018/148] selftests: net: add missing CMAC to tcp_ao config Recent changes to crypto and wifi made CMAC no longer selected by default on x86 and tcp_ao needs it. Add the missing config. Reviewed-by: Matthieu Baerts (NGI0) Link: https://patch.msgid.link/20260416010439.1053587-1-kuba@kernel.org Signed-off-by: Jakub Kicinski --- tools/testing/selftests/net/tcp_ao/config | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/testing/selftests/net/tcp_ao/config b/tools/testing/selftests/net/tcp_ao/config index 971cb6fa2d63..f22148512365 100644 --- a/tools/testing/selftests/net/tcp_ao/config +++ b/tools/testing/selftests/net/tcp_ao/config @@ -1,3 +1,4 @@ +CONFIG_CRYPTO_CMAC=y CONFIG_CRYPTO_HMAC=y CONFIG_CRYPTO_RMD160=y CONFIG_CRYPTO_SHA1=y From e5fd34ab8dff6c5bd4f2e9ee4f3945b79e511068 Mon Sep 17 00:00:00 2001 From: Ralf Lici Date: Tue, 24 Mar 2026 08:48:57 +0100 Subject: [PATCH 019/148] selftests: ovpn: add nftables config dependencies for test-mark test-mark.sh installs nftables rules in an inet/filter output chain and verifies packet drops via nft counters. In vmksft this can fail when the nftables core is not enabled by the ovpn selftest config. Add the missing kernel options required by this test: - CONFIG_NETFILTER - CONFIG_NF_TABLES - CONFIG_NF_TABLES_INET Fixes: 7b80d8a33500 ("selftests: ovpn: add test for the FW mark feature") Reported-by: Jakub Kicinski Closes: https://lore.kernel.org/all/20260319124114.42f91f72@kernel.org/ Signed-off-by: Ralf Lici Signed-off-by: Antonio Quartulli --- tools/testing/selftests/net/ovpn/config | 3 +++ 1 file changed, 3 insertions(+) diff --git a/tools/testing/selftests/net/ovpn/config b/tools/testing/selftests/net/ovpn/config index 42699740936d..d6cf033d555e 100644 --- a/tools/testing/selftests/net/ovpn/config +++ b/tools/testing/selftests/net/ovpn/config @@ -5,6 +5,9 @@ CONFIG_CRYPTO_GCM=y CONFIG_DST_CACHE=y CONFIG_INET=y CONFIG_NET=y +CONFIG_NETFILTER=y CONFIG_NET_UDP_TUNNEL=y +CONFIG_NF_TABLES=m +CONFIG_NF_TABLES_INET=y CONFIG_OVPN=m CONFIG_STREAM_PARSER=y From c409da0fe15e2b2aae7f93edbab977e23117ce4d Mon Sep 17 00:00:00 2001 From: Ralf Lici Date: Mon, 23 Mar 2026 15:32:26 +0100 Subject: [PATCH 020/148] selftests: ovpn: fail notification check on mismatch compare_ntfs doesn't fail when expected and received notification streams diverge. Fix this bug by tracking the diff exit status explicitly and return it to the caller so notification mismatches propagate as test failures. Fixes: 77de28cd7cf1 ("selftests: ovpn: add notification parsing and matching") Signed-off-by: Ralf Lici Signed-off-by: Antonio Quartulli --- tools/testing/selftests/net/ovpn/common.sh | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/net/ovpn/common.sh b/tools/testing/selftests/net/ovpn/common.sh index 4c08f756e63a..c92415aaddfc 100644 --- a/tools/testing/selftests/net/ovpn/common.sh +++ b/tools/testing/selftests/net/ovpn/common.sh @@ -140,23 +140,35 @@ add_peer() { } compare_ntfs() { + local diff_rc=0 + local diff_file + if [ ${#tmp_jsons[@]} -gt 0 ]; then suffix="" [ "${SYMMETRIC_ID}" -eq 1 ] && suffix="${suffix}-symm" [ "$FLOAT" == 1 ] && suffix="${suffix}-float" expected="json/peer${1}${suffix}.json" received="${tmp_jsons[$1]}" + diff_file=$(mktemp) kill -TERM ${listener_pids[$1]} || true wait ${listener_pids[$1]} || true printf "Checking notifications for peer ${1}... " if diff <(jq -s "${JQ_FILTER}" ${expected}) \ - <(jq -s "${JQ_FILTER}" ${received}); then + <(jq -s "${JQ_FILTER}" ${received}) \ + >"${diff_file}" 2>&1; then echo "OK" + else + diff_rc=$? + echo "failed" + cat "${diff_file}" fi + rm -f "${diff_file}" || true rm -f ${received} || true fi + + return "${diff_rc}" } cleanup() { From 222e7f8d1ca3aaebe7588a79bf64d9820813785c Mon Sep 17 00:00:00 2001 From: Ralf Lici Date: Tue, 24 Mar 2026 15:54:18 +0100 Subject: [PATCH 021/148] selftests: ovpn: flatten slurped notification JSON before filtering Notification comparison uses jq -s, which slurps all inputs into an array. Some inputs can be arrays themselves, and applying the .msg.peer filter directly on those entries triggers jq type errors. Expand any array-valued JSON items returned by jq -s before selecting .msg.peer, so the filter handles both normal notification objects and [] entries without type errors. Fixes: 77de28cd7cf1 ("selftests: ovpn: add notification parsing and matching") Signed-off-by: Ralf Lici Signed-off-by: Antonio Quartulli --- tools/testing/selftests/net/ovpn/common.sh | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/net/ovpn/common.sh b/tools/testing/selftests/net/ovpn/common.sh index c92415aaddfc..d3b322e84fab 100644 --- a/tools/testing/selftests/net/ovpn/common.sh +++ b/tools/testing/selftests/net/ovpn/common.sh @@ -15,7 +15,8 @@ SYMMETRIC_ID=${SYMMETRIC_ID:-0} export ID_OFFSET=$(( 9 * (SYMMETRIC_ID == 0) )) -JQ_FILTER='map(select(.msg.peer | has("remote-ipv6") | not)) | +JQ_FILTER='map(if type == "array" then .[] else . end) | + map(select(.msg.peer | has("remote-ipv6") | not)) | map(del(.msg.ifindex)) | sort_by(.msg.peer.id)[]' LAN_IP="11.11.11.11" From 7c29665a3a3cce1b0e9d6b96054eef64bfc4cebd Mon Sep 17 00:00:00 2001 From: Ralf Lici Date: Fri, 20 Mar 2026 17:29:38 +0100 Subject: [PATCH 022/148] selftests: ovpn: add prefix to helpers and shared variables Current naming for shared variables, helpers and netnamespaces is a bit unfortunate as it doesn't come with a clean prefix. This showed to be problematic in case of name clashes with external scripts or in case of abrupt test termination (hanging netns' weren't easily reconducible to ovpn). Rename common helper entry points and all shared globals in the ovpn selftests to ovpn_ or OVPN_ names so test scripts and wrappers use a single explicit prefix. Also rename the temporary network namespaces created by the tests from peerN to ovpn_peerN. This makes leaked namespaces easier to identify. This is a mechanical refactor only, behavior is unchanged. Fixes: 959bc330a439 ("testing/selftests: add test tool and scripts for ovpn module") Signed-off-by: Ralf Lici Signed-off-by: Antonio Quartulli --- tools/testing/selftests/net/ovpn/common.sh | 182 ++++++++++-------- .../selftests/net/ovpn/test-chachapoly.sh | 2 +- .../net/ovpn/test-close-socket-tcp.sh | 2 +- .../selftests/net/ovpn/test-close-socket.sh | 32 +-- .../testing/selftests/net/ovpn/test-float.sh | 2 +- tools/testing/selftests/net/ovpn/test-mark.sh | 45 ++--- .../net/ovpn/test-symmetric-id-float.sh | 4 +- .../net/ovpn/test-symmetric-id-tcp.sh | 4 +- .../selftests/net/ovpn/test-symmetric-id.sh | 2 +- tools/testing/selftests/net/ovpn/test-tcp.sh | 2 +- tools/testing/selftests/net/ovpn/test.sh | 139 ++++++------- 11 files changed, 224 insertions(+), 192 deletions(-) diff --git a/tools/testing/selftests/net/ovpn/common.sh b/tools/testing/selftests/net/ovpn/common.sh index d3b322e84fab..38f187b9de23 100644 --- a/tools/testing/selftests/net/ovpn/common.sh +++ b/tools/testing/selftests/net/ovpn/common.sh @@ -4,63 +4,72 @@ # # Author: Antonio Quartulli -UDP_PEERS_FILE=${UDP_PEERS_FILE:-udp_peers.txt} -TCP_PEERS_FILE=${TCP_PEERS_FILE:-tcp_peers.txt} +OVPN_UDP_PEERS_FILE=${OVPN_UDP_PEERS_FILE:-udp_peers.txt} +OVPN_TCP_PEERS_FILE=${OVPN_TCP_PEERS_FILE:-tcp_peers.txt} OVPN_CLI=${OVPN_CLI:-./ovpn-cli} -YNL_CLI=${YNL_CLI:-../../../../net/ynl/pyynl/cli.py} -ALG=${ALG:-aes} -PROTO=${PROTO:-UDP} -FLOAT=${FLOAT:-0} -SYMMETRIC_ID=${SYMMETRIC_ID:-0} +OVPN_YNL=${OVPN_YNL:-../../../../net/ynl/pyynl/cli.py} +OVPN_ALG=${OVPN_ALG:-aes} +OVPN_PROTO=${OVPN_PROTO:-UDP} +OVPN_FLOAT=${OVPN_FLOAT:-0} +OVPN_SYMMETRIC_ID=${OVPN_SYMMETRIC_ID:-0} -export ID_OFFSET=$(( 9 * (SYMMETRIC_ID == 0) )) +export OVPN_ID_OFFSET=$(( 9 * (OVPN_SYMMETRIC_ID == 0) )) -JQ_FILTER='map(if type == "array" then .[] else . end) | +OVPN_JQ_FILTER='map(if type == "array" then .[] else . end) | map(select(.msg.peer | has("remote-ipv6") | not)) | map(del(.msg.ifindex)) | sort_by(.msg.peer.id)[]' -LAN_IP="11.11.11.11" +OVPN_LAN_IP="11.11.11.11" -declare -A tmp_jsons=() -declare -A listener_pids=() +declare -A OVPN_TMP_JSONS=() +declare -A OVPN_LISTENER_PIDS=() -create_ns() { - ip netns add peer${1} +ovpn_create_ns() { + ip netns add "ovpn_peer${1}" } -setup_ns() { +ovpn_setup_ns() { + local peer="ovpn_peer${1}" + local server_ns="ovpn_peer0" + local peer_ns MODE="P2P" if [ ${1} -eq 0 ]; then MODE="MP" - for p in $(seq 1 ${NUM_PEERS}); do - ip link add veth${p} netns peer0 type veth peer name veth${p} netns peer${p} + for p in $(seq 1 ${OVPN_NUM_PEERS}); do + peer_ns="ovpn_peer${p}" + ip link add veth${p} netns "${server_ns}" type veth \ + peer name veth${p} netns "${peer_ns}" - ip -n peer0 addr add 10.10.${p}.1/24 dev veth${p} - ip -n peer0 addr add fd00:0:0:${p}::1/64 dev veth${p} - ip -n peer0 link set veth${p} up + ip -n "${server_ns}" addr add 10.10.${p}.1/24 dev \ + veth${p} + ip -n "${server_ns}" addr add fd00:0:0:${p}::1/64 dev \ + veth${p} + ip -n "${server_ns}" link set veth${p} up - ip -n peer${p} addr add 10.10.${p}.2/24 dev veth${p} - ip -n peer${p} addr add fd00:0:0:${p}::2/64 dev veth${p} - ip -n peer${p} link set veth${p} up + ip -n "${peer_ns}" addr add 10.10.${p}.2/24 dev veth${p} + ip -n "${peer_ns}" addr add fd00:0:0:${p}::2/64 dev \ + veth${p} + ip -n "${peer_ns}" link set veth${p} up done fi - ip netns exec peer${1} ${OVPN_CLI} new_iface tun${1} $MODE - ip -n peer${1} addr add ${2} dev tun${1} + ip netns exec "${peer}" ${OVPN_CLI} new_iface tun${1} $MODE + ip -n "${peer}" addr add ${2} dev tun${1} # add a secondary IP to peer 1, to test a LAN behind a client - if [ ${1} -eq 1 -a -n "${LAN_IP}" ]; then - ip -n peer${1} addr add ${LAN_IP} dev tun${1} - ip -n peer0 route add ${LAN_IP} via $(echo ${2} |sed -e s'!/.*!!') dev tun0 + if [ ${1} -eq 1 -a -n "${OVPN_LAN_IP}" ]; then + ip -n "${peer}" addr add ${OVPN_LAN_IP} dev tun${1} + ip -n "${server_ns}" route add ${OVPN_LAN_IP} via \ + $(echo ${2} |sed -e s'!/.*!!') dev tun0 fi if [ -n "${3}" ]; then - ip -n peer${1} link set mtu ${3} dev tun${1} + ip -n "${peer}" link set mtu ${3} dev tun${1} fi - ip -n peer${1} link set tun${1} up + ip -n "${peer}" link set tun${1} up } -build_capture_filter() { +ovpn_build_capture_filter() { # match the first four bytes of the openvpn data payload - if [ "${PROTO}" == "UDP" ]; then + if [ "${OVPN_PROTO}" == "UDP" ]; then # For UDP, libpcap transport indexing only works for IPv4, so # use an explicit IPv4 or IPv6 expression based on the peer # address. The IPv6 branch assumes there are no extension @@ -77,86 +86,98 @@ build_capture_filter() { fi } -setup_listener() { +ovpn_setup_listener() { + local peer_ns="ovpn_peer${p}" file=$(mktemp) - PYTHONUNBUFFERED=1 ip netns exec peer${p} ${YNL_CLI} --family ovpn \ - --subscribe peers --output-json --duration 40 > ${file} & - listener_pids[$1]=$! - tmp_jsons[$1]="${file}" + PYTHONUNBUFFERED=1 ip netns exec "${peer_ns}" "${OVPN_YNL}" --family \ + ovpn --subscribe peers --output-json --duration 40 > ${file} & + OVPN_LISTENER_PIDS[$1]=$! + OVPN_TMP_JSONS[$1]="${file}" } -add_peer() { +ovpn_add_peer() { labels=("ASYMM" "SYMM") - M_ID=${labels[SYMMETRIC_ID]} + local peer_ns + local server_ns="ovpn_peer0" + M_ID=${labels[OVPN_SYMMETRIC_ID]} - if [ "${PROTO}" == "UDP" ]; then + if [ "${OVPN_PROTO}" == "UDP" ]; then if [ ${1} -eq 0 ]; then - ip netns exec peer0 ${OVPN_CLI} new_multi_peer tun0 1 \ - ${M_ID} ${UDP_PEERS_FILE} + ip netns exec "${server_ns}" ${OVPN_CLI} \ + new_multi_peer tun0 1 ${M_ID} \ + ${OVPN_UDP_PEERS_FILE} - for p in $(seq 1 ${NUM_PEERS}); do - ip netns exec peer0 ${OVPN_CLI} new_key tun0 ${p} 1 0 ${ALG} 0 \ + for p in $(seq 1 ${OVPN_NUM_PEERS}); do + ip netns exec "${server_ns}" ${OVPN_CLI} \ + new_key tun0 ${p} 1 0 ${OVPN_ALG} 0 \ data64.key done else - if [ "${SYMMETRIC_ID}" -eq 1 ]; then + peer_ns="ovpn_peer${1}" + if [ "${OVPN_SYMMETRIC_ID}" -eq 1 ]; then PEER_ID=${1} TX_ID="none" else PEER_ID=$(awk "NR == ${1} {print \$2}" \ - ${UDP_PEERS_FILE}) + ${OVPN_UDP_PEERS_FILE}) TX_ID=${1} fi - RADDR=$(awk "NR == ${1} {print \$3}" ${UDP_PEERS_FILE}) - RPORT=$(awk "NR == ${1} {print \$4}" ${UDP_PEERS_FILE}) - LPORT=$(awk "NR == ${1} {print \$6}" ${UDP_PEERS_FILE}) - ip netns exec peer${1} ${OVPN_CLI} new_peer tun${1} \ - ${PEER_ID} ${TX_ID} ${LPORT} ${RADDR} ${RPORT} - ip netns exec peer${1} ${OVPN_CLI} new_key tun${1} \ - ${PEER_ID} 1 0 ${ALG} 1 data64.key + RADDR=$(awk "NR == ${1} {print \$3}" \ + ${OVPN_UDP_PEERS_FILE}) + RPORT=$(awk "NR == ${1} {print \$4}" \ + ${OVPN_UDP_PEERS_FILE}) + LPORT=$(awk "NR == ${1} {print \$6}" \ + ${OVPN_UDP_PEERS_FILE}) + ip netns exec "${peer_ns}" ${OVPN_CLI} new_peer \ + tun${1} ${PEER_ID} ${TX_ID} ${LPORT} ${RADDR} \ + ${RPORT} + ip netns exec "${peer_ns}" ${OVPN_CLI} new_key tun${1} \ + ${PEER_ID} 1 0 ${OVPN_ALG} 1 data64.key fi else if [ ${1} -eq 0 ]; then - (ip netns exec peer0 ${OVPN_CLI} listen tun0 1 ${M_ID} \ - ${TCP_PEERS_FILE} && { - for p in $(seq 1 ${NUM_PEERS}); do - ip netns exec peer0 ${OVPN_CLI} new_key tun0 ${p} 1 0 \ - ${ALG} 0 data64.key + (ip netns exec "${server_ns}" ${OVPN_CLI} listen tun0 \ + 1 ${M_ID} ${OVPN_TCP_PEERS_FILE} && { + for p in $(seq 1 ${OVPN_NUM_PEERS}); do + ip netns exec "${server_ns}" \ + ${OVPN_CLI} new_key tun0 ${p} \ + 1 0 ${OVPN_ALG} 0 data64.key done }) & sleep 5 else - if [ "${SYMMETRIC_ID}" -eq 1 ]; then + peer_ns="ovpn_peer${1}" + if [ "${OVPN_SYMMETRIC_ID}" -eq 1 ]; then PEER_ID=${1} TX_ID="none" else PEER_ID=$(awk "NR == ${1} {print \$2}" \ - ${TCP_PEERS_FILE}) + ${OVPN_TCP_PEERS_FILE}) TX_ID=${1} fi - ip netns exec peer${1} ${OVPN_CLI} connect tun${1} \ + ip netns exec "${peer_ns}" ${OVPN_CLI} connect tun${1} \ ${PEER_ID} ${TX_ID} 10.10.${1}.1 1 data64.key fi fi } -compare_ntfs() { +ovpn_compare_ntfs() { local diff_rc=0 local diff_file - if [ ${#tmp_jsons[@]} -gt 0 ]; then + if [ ${#OVPN_TMP_JSONS[@]} -gt 0 ]; then suffix="" - [ "${SYMMETRIC_ID}" -eq 1 ] && suffix="${suffix}-symm" - [ "$FLOAT" == 1 ] && suffix="${suffix}-float" + [ "${OVPN_SYMMETRIC_ID}" -eq 1 ] && suffix="${suffix}-symm" + [ "$OVPN_FLOAT" == 1 ] && suffix="${suffix}-float" expected="json/peer${1}${suffix}.json" - received="${tmp_jsons[$1]}" + received="${OVPN_TMP_JSONS[$1]}" diff_file=$(mktemp) - kill -TERM ${listener_pids[$1]} || true - wait ${listener_pids[$1]} || true + kill -TERM ${OVPN_LISTENER_PIDS[$1]} || true + wait ${OVPN_LISTENER_PIDS[$1]} || true printf "Checking notifications for peer ${1}... " - if diff <(jq -s "${JQ_FILTER}" ${expected}) \ - <(jq -s "${JQ_FILTER}" ${received}) \ + if diff <(jq -s "${OVPN_JQ_FILTER}" ${expected}) \ + <(jq -s "${OVPN_JQ_FILTER}" ${received}) \ >"${diff_file}" 2>&1; then echo "OK" else @@ -172,25 +193,30 @@ compare_ntfs() { return "${diff_rc}" } -cleanup() { +ovpn_cleanup() { + local peer_ns # some ovpn-cli processes sleep in background so they need manual poking killall $(basename ${OVPN_CLI}) 2>/dev/null || true # netns peer0 is deleted without erasing ifaces first for p in $(seq 1 10); do - ip -n peer${p} link set tun${p} down 2>/dev/null || true - ip netns exec peer${p} ${OVPN_CLI} del_iface tun${p} 2>/dev/null || true + peer_ns="ovpn_peer${p}" + ip -n "${peer_ns}" link set tun${p} down 2>/dev/null || true + ip netns exec "${peer_ns}" ${OVPN_CLI} del_iface tun${p} \ + 2>/dev/null || true done for p in $(seq 1 10); do - ip -n peer0 link del veth${p} 2>/dev/null || true + ip -n ovpn_peer0 link del veth${p} 2>/dev/null || true done for p in $(seq 0 10); do - ip netns del peer${p} 2>/dev/null || true + ip netns del "ovpn_peer${p}" 2>/dev/null || true done } -if [ "${PROTO}" == "UDP" ]; then - NUM_PEERS=${NUM_PEERS:-$(wc -l ${UDP_PEERS_FILE} | awk '{print $1}')} +if [ "${OVPN_PROTO}" == "UDP" ]; then + OVPN_NUM_PEERS=${OVPN_NUM_PEERS:-$(wc -l ${OVPN_UDP_PEERS_FILE} | \ + awk '{print $1}')} else - NUM_PEERS=${NUM_PEERS:-$(wc -l ${TCP_PEERS_FILE} | awk '{print $1}')} + OVPN_NUM_PEERS=${OVPN_NUM_PEERS:-$(wc -l ${OVPN_TCP_PEERS_FILE} | \ + awk '{print $1}')} fi diff --git a/tools/testing/selftests/net/ovpn/test-chachapoly.sh b/tools/testing/selftests/net/ovpn/test-chachapoly.sh index 32504079a2b8..cd3d94355d58 100755 --- a/tools/testing/selftests/net/ovpn/test-chachapoly.sh +++ b/tools/testing/selftests/net/ovpn/test-chachapoly.sh @@ -4,6 +4,6 @@ # # Author: Antonio Quartulli -ALG="chachapoly" +OVPN_ALG="chachapoly" source test.sh diff --git a/tools/testing/selftests/net/ovpn/test-close-socket-tcp.sh b/tools/testing/selftests/net/ovpn/test-close-socket-tcp.sh index 093d44772ffd..392d269bada5 100755 --- a/tools/testing/selftests/net/ovpn/test-close-socket-tcp.sh +++ b/tools/testing/selftests/net/ovpn/test-close-socket-tcp.sh @@ -4,6 +4,6 @@ # # Author: Antonio Quartulli -PROTO="TCP" +OVPN_PROTO="TCP" source test-close-socket.sh diff --git a/tools/testing/selftests/net/ovpn/test-close-socket.sh b/tools/testing/selftests/net/ovpn/test-close-socket.sh index 0d09df14fe8e..6bc1b6eab8ac 100755 --- a/tools/testing/selftests/net/ovpn/test-close-socket.sh +++ b/tools/testing/selftests/net/ovpn/test-close-socket.sh @@ -8,38 +8,40 @@ set -e source ./common.sh +server_ns="ovpn_peer0" -cleanup +ovpn_cleanup modprobe -q ovpn || true -for p in $(seq 0 ${NUM_PEERS}); do - create_ns ${p} +for p in $(seq 0 ${OVPN_NUM_PEERS}); do + ovpn_create_ns ${p} done -for p in $(seq 0 ${NUM_PEERS}); do - setup_ns ${p} 5.5.5.$((${p} + 1))/24 +for p in $(seq 0 ${OVPN_NUM_PEERS}); do + ovpn_setup_ns ${p} 5.5.5.$((${p} + 1))/24 done -for p in $(seq 0 ${NUM_PEERS}); do - add_peer ${p} +for p in $(seq 0 ${OVPN_NUM_PEERS}); do + ovpn_add_peer ${p} done -for p in $(seq 1 ${NUM_PEERS}); do - ip netns exec peer0 ${OVPN_CLI} set_peer tun0 ${p} 60 120 - ip netns exec peer${p} ${OVPN_CLI} set_peer tun${p} $((${p}+9)) 60 120 +for p in $(seq 1 ${OVPN_NUM_PEERS}); do + ip netns exec "${server_ns}" ${OVPN_CLI} set_peer tun0 ${p} 60 120 + ip netns exec "ovpn_peer${p}" ${OVPN_CLI} set_peer tun${p} $((${p}+9)) \ + 60 120 done sleep 1 -for p in $(seq 1 ${NUM_PEERS}); do - ip netns exec peer0 ping -qfc 500 -w 3 5.5.5.$((${p} + 1)) +for p in $(seq 1 ${OVPN_NUM_PEERS}); do + ip netns exec "${server_ns}" ping -qfc 500 -w 3 5.5.5.$((${p} + 1)) done -ip netns exec peer0 iperf3 -1 -s & +ip netns exec "${server_ns}" iperf3 -1 -s & sleep 1 -ip netns exec peer1 iperf3 -Z -t 3 -c 5.5.5.1 +ip netns exec ovpn_peer1 iperf3 -Z -t 3 -c 5.5.5.1 -cleanup +ovpn_cleanup modprobe -r ovpn || true diff --git a/tools/testing/selftests/net/ovpn/test-float.sh b/tools/testing/selftests/net/ovpn/test-float.sh index ba5d725e18b0..91f8e113718e 100755 --- a/tools/testing/selftests/net/ovpn/test-float.sh +++ b/tools/testing/selftests/net/ovpn/test-float.sh @@ -4,6 +4,6 @@ # # Author: Antonio Quartulli -FLOAT="1" +OVPN_FLOAT="1" source test.sh diff --git a/tools/testing/selftests/net/ovpn/test-mark.sh b/tools/testing/selftests/net/ovpn/test-mark.sh index 8534428ed3eb..2ee5dc5fc538 100755 --- a/tools/testing/selftests/net/ovpn/test-mark.sh +++ b/tools/testing/selftests/net/ovpn/test-mark.sh @@ -11,62 +11,63 @@ set -e MARK=1056 source ./common.sh +server_ns="ovpn_peer0" -cleanup +ovpn_cleanup modprobe -q ovpn || true -for p in $(seq 0 "${NUM_PEERS}"); do - create_ns "${p}" +for p in $(seq 0 "${OVPN_NUM_PEERS}"); do + ovpn_create_ns "${p}" done for p in $(seq 0 3); do - setup_ns "${p}" 5.5.5.$((p + 1))/24 + ovpn_setup_ns "${p}" 5.5.5.$((p + 1))/24 done # add peer0 with mark -ip netns exec peer0 "${OVPN_CLI}" new_multi_peer tun0 1 ASYMM \ - "${UDP_PEERS_FILE}" \ +ip netns exec "${server_ns}" "${OVPN_CLI}" new_multi_peer tun0 1 ASYMM \ + "${OVPN_UDP_PEERS_FILE}" \ ${MARK} for p in $(seq 1 3); do - ip netns exec peer0 "${OVPN_CLI}" new_key tun0 "${p}" 1 0 "${ALG}" 0 \ - data64.key + ip netns exec "${server_ns}" "${OVPN_CLI}" new_key tun0 "${p}" 1 0 \ + "${OVPN_ALG}" 0 data64.key done for p in $(seq 1 3); do - add_peer "${p}" + ovpn_add_peer "${p}" done for p in $(seq 1 3); do - ip netns exec peer0 "${OVPN_CLI}" set_peer tun0 "${p}" 60 120 - ip netns exec peer"${p}" "${OVPN_CLI}" set_peer tun"${p}" \ + ip netns exec "${server_ns}" "${OVPN_CLI}" set_peer tun0 "${p}" 60 120 + ip netns exec "ovpn_peer${p}" "${OVPN_CLI}" set_peer tun"${p}" \ $((p + 9)) 60 120 done sleep 1 for p in $(seq 1 3); do - ip netns exec peer0 ping -qfc 500 -w 3 5.5.5.$((p + 1)) + ip netns exec "${server_ns}" ping -qfc 500 -w 3 5.5.5.$((p + 1)) done echo "Adding an nftables drop rule based on mark value ${MARK}" -ip netns exec peer0 nft flush ruleset -ip netns exec peer0 nft 'add table inet filter' -ip netns exec peer0 nft 'add chain inet filter output { +ip netns exec "${server_ns}" nft flush ruleset +ip netns exec "${server_ns}" nft 'add table inet filter' +ip netns exec "${server_ns}" nft 'add chain inet filter output { type filter hook output priority 0; policy accept; }' -ip netns exec peer0 nft add rule inet filter output \ +ip netns exec "${server_ns}" nft add rule inet filter output \ meta mark == ${MARK} \ counter drop -DROP_COUNTER=$(ip netns exec peer0 nft list chain inet filter output \ +DROP_COUNTER=$(ip netns exec "${server_ns}" nft list chain inet filter output \ | sed -n 's/.*packets \([0-9]*\).*/\1/p') sleep 1 # ping should fail for p in $(seq 1 3); do - PING_OUTPUT=$(ip netns exec peer0 ping \ + PING_OUTPUT=$(ip netns exec "${server_ns}" ping \ -qfc 500 -w 1 5.5.5.$((p + 1)) 2>&1) && exit 1 echo "${PING_OUTPUT}" LOST_PACKETS=$(echo "$PING_OUTPUT" \ @@ -76,7 +77,7 @@ for p in $(seq 1 3); do done # check if the final nft counter matches our counter -TOTAL_COUNT=$(ip netns exec peer0 nft list chain inet filter output \ +TOTAL_COUNT=$(ip netns exec "${server_ns}" nft list chain inet filter output \ | sed -n 's/.*packets \([0-9]*\).*/\1/p') if [ "${DROP_COUNTER}" -ne "${TOTAL_COUNT}" ]; then echo "Expected ${TOTAL_COUNT} drops, got ${DROP_COUNTER}" @@ -84,13 +85,13 @@ if [ "${DROP_COUNTER}" -ne "${TOTAL_COUNT}" ]; then fi echo "Removing the drop rule" -ip netns exec peer0 nft flush ruleset +ip netns exec "${server_ns}" nft flush ruleset sleep 1 for p in $(seq 1 3); do - ip netns exec peer0 ping -qfc 500 -w 3 5.5.5.$((p + 1)) + ip netns exec "${server_ns}" ping -qfc 500 -w 3 5.5.5.$((p + 1)) done -cleanup +ovpn_cleanup modprobe -r ovpn || true diff --git a/tools/testing/selftests/net/ovpn/test-symmetric-id-float.sh b/tools/testing/selftests/net/ovpn/test-symmetric-id-float.sh index b3711a81b463..75296fe72c39 100755 --- a/tools/testing/selftests/net/ovpn/test-symmetric-id-float.sh +++ b/tools/testing/selftests/net/ovpn/test-symmetric-id-float.sh @@ -5,7 +5,7 @@ # Author: Ralf Lici # Antonio Quartulli -SYMMETRIC_ID="1" -FLOAT="1" +OVPN_SYMMETRIC_ID="1" +OVPN_FLOAT="1" source test.sh diff --git a/tools/testing/selftests/net/ovpn/test-symmetric-id-tcp.sh b/tools/testing/selftests/net/ovpn/test-symmetric-id-tcp.sh index 188cafb67b2f..680a465c49d2 100755 --- a/tools/testing/selftests/net/ovpn/test-symmetric-id-tcp.sh +++ b/tools/testing/selftests/net/ovpn/test-symmetric-id-tcp.sh @@ -5,7 +5,7 @@ # Author: Ralf Lici # Antonio Quartulli -PROTO="TCP" -SYMMETRIC_ID=1 +OVPN_PROTO="TCP" +OVPN_SYMMETRIC_ID=1 source test.sh diff --git a/tools/testing/selftests/net/ovpn/test-symmetric-id.sh b/tools/testing/selftests/net/ovpn/test-symmetric-id.sh index 35b119c72e4f..a2e2808959d9 100755 --- a/tools/testing/selftests/net/ovpn/test-symmetric-id.sh +++ b/tools/testing/selftests/net/ovpn/test-symmetric-id.sh @@ -5,6 +5,6 @@ # Author: Ralf Lici # Antonio Quartulli -SYMMETRIC_ID="1" +OVPN_SYMMETRIC_ID="1" source test.sh diff --git a/tools/testing/selftests/net/ovpn/test-tcp.sh b/tools/testing/selftests/net/ovpn/test-tcp.sh index ba3f1f315a34..27cc6e7b98bc 100755 --- a/tools/testing/selftests/net/ovpn/test-tcp.sh +++ b/tools/testing/selftests/net/ovpn/test-tcp.sh @@ -4,6 +4,6 @@ # # Author: Antonio Quartulli -PROTO="TCP" +OVPN_PROTO="TCP" source test.sh diff --git a/tools/testing/selftests/net/ovpn/test.sh b/tools/testing/selftests/net/ovpn/test.sh index b60e94a4094e..b766f4842940 100755 --- a/tools/testing/selftests/net/ovpn/test.sh +++ b/tools/testing/selftests/net/ovpn/test.sh @@ -8,37 +8,38 @@ set -e source ./common.sh +server_ns="ovpn_peer0" -cleanup +ovpn_cleanup modprobe -q ovpn || true -for p in $(seq 0 ${NUM_PEERS}); do - create_ns ${p} +for p in $(seq 0 ${OVPN_NUM_PEERS}); do + ovpn_create_ns ${p} done -for p in $(seq 0 ${NUM_PEERS}); do - setup_listener ${p} +for p in $(seq 0 ${OVPN_NUM_PEERS}); do + ovpn_setup_listener ${p} done -for p in $(seq 0 ${NUM_PEERS}); do - setup_ns ${p} 5.5.5.$((${p} + 1))/24 ${MTU} +for p in $(seq 0 ${OVPN_NUM_PEERS}); do + ovpn_setup_ns ${p} 5.5.5.$((${p} + 1))/24 ${MTU} done -for p in $(seq 0 ${NUM_PEERS}); do - add_peer ${p} +for p in $(seq 0 ${OVPN_NUM_PEERS}); do + ovpn_add_peer ${p} done -for p in $(seq 1 ${NUM_PEERS}); do - ip netns exec peer0 ${OVPN_CLI} set_peer tun0 ${p} 60 120 - ip netns exec peer${p} ${OVPN_CLI} set_peer tun${p} \ - $((${p}+ID_OFFSET)) 60 120 +for p in $(seq 1 ${OVPN_NUM_PEERS}); do + ip netns exec "${server_ns}" ${OVPN_CLI} set_peer tun0 ${p} 60 120 + ip netns exec "ovpn_peer${p}" ${OVPN_CLI} set_peer tun${p} \ + $((${p}+OVPN_ID_OFFSET)) 60 120 done sleep 1 TCPDUMP_TIMEOUT="1.5s" -for p in $(seq 1 ${NUM_PEERS}); do +for p in $(seq 1 ${OVPN_NUM_PEERS}); do # The first part of the data packet header consists of: # - TCP only: 2 bytes for the packet length # - 5 bits for opcode ("9" for DATA_V2) @@ -47,119 +48,121 @@ for p in $(seq 1 ${NUM_PEERS}); do # - with asymmetric ID: "${p}" one way and "${p} + 9" the other way # - with symmetric ID: "${p}" both ways HEADER1=$(printf "0x4800000%x" ${p}) - HEADER2=$(printf "0x4800000%x" $((${p} + ID_OFFSET))) + HEADER2=$(printf "0x4800000%x" $((${p} + OVPN_ID_OFFSET))) RADDR="" - if [ "${PROTO}" == "UDP" ]; then - RADDR=$(awk "NR == ${p} {print \$3}" ${UDP_PEERS_FILE}) + if [ "${OVPN_PROTO}" == "UDP" ]; then + RADDR=$(awk "NR == ${p} {print \$3}" ${OVPN_UDP_PEERS_FILE}) fi - timeout ${TCPDUMP_TIMEOUT} ip netns exec peer${p} \ + timeout ${TCPDUMP_TIMEOUT} ip netns exec "ovpn_peer${p}" \ tcpdump --immediate-mode -p -ni veth${p} -c 1 \ - "$(build_capture_filter "${HEADER1}" "${RADDR}")" \ + "$(ovpn_build_capture_filter "${HEADER1}" "${RADDR}")" \ >/dev/null 2>&1 & TCPDUMP_PID1=$! - timeout ${TCPDUMP_TIMEOUT} ip netns exec peer${p} \ + timeout ${TCPDUMP_TIMEOUT} ip netns exec "ovpn_peer${p}" \ tcpdump --immediate-mode -p -ni veth${p} -c 1 \ - "$(build_capture_filter "${HEADER2}" "${RADDR}")" \ + "$(ovpn_build_capture_filter "${HEADER2}" "${RADDR}")" \ >/dev/null 2>&1 & TCPDUMP_PID2=$! sleep 0.3 - ip netns exec peer0 ping -qfc 500 -w 3 5.5.5.$((${p} + 1)) - ip netns exec peer0 ping -qfc 500 -s 3000 -w 3 5.5.5.$((${p} + 1)) + ip netns exec "${server_ns}" ping -qfc 500 -w 3 5.5.5.$((${p} + 1)) + ip netns exec "${server_ns}" ping -qfc 500 -s 3000 -w 3 \ + 5.5.5.$((${p} + 1)) wait ${TCPDUMP_PID1} wait ${TCPDUMP_PID2} done # ping LAN behind client 1 -ip netns exec peer0 ping -qfc 500 -w 3 ${LAN_IP} +ip netns exec "${server_ns}" ping -qfc 500 -w 3 ${OVPN_LAN_IP} -if [ "$FLOAT" == "1" ]; then +if [ "$OVPN_FLOAT" == "1" ]; then # make clients float.. - for p in $(seq 1 ${NUM_PEERS}); do - ip -n peer${p} addr del 10.10.${p}.2/24 dev veth${p} - ip -n peer${p} addr add 10.10.${p}.3/24 dev veth${p} + for p in $(seq 1 ${OVPN_NUM_PEERS}); do + ip -n "ovpn_peer${p}" addr del 10.10.${p}.2/24 dev veth${p} + ip -n "ovpn_peer${p}" addr add 10.10.${p}.3/24 dev veth${p} done - for p in $(seq 1 ${NUM_PEERS}); do - ip netns exec peer${p} ping -qfc 500 -w 3 5.5.5.1 + for p in $(seq 1 ${OVPN_NUM_PEERS}); do + ip netns exec "ovpn_peer${p}" ping -qfc 500 -w 3 5.5.5.1 done fi -ip netns exec peer0 iperf3 -1 -s & +ip netns exec "${server_ns}" iperf3 -1 -s & sleep 1 -ip netns exec peer1 iperf3 -Z -t 3 -c 5.5.5.1 +ip netns exec ovpn_peer1 iperf3 -Z -t 3 -c 5.5.5.1 echo "Adding secondary key and then swap:" -for p in $(seq 1 ${NUM_PEERS}); do - ip netns exec peer0 ${OVPN_CLI} new_key tun0 ${p} 2 1 ${ALG} 0 \ - data64.key - ip netns exec peer${p} ${OVPN_CLI} new_key tun${p} \ - $((${p} + ID_OFFSET)) 2 1 ${ALG} 1 data64.key - ip netns exec peer${p} ${OVPN_CLI} swap_keys tun${p} \ - $((${p} + ID_OFFSET)) +for p in $(seq 1 ${OVPN_NUM_PEERS}); do + ip netns exec "${server_ns}" ${OVPN_CLI} new_key tun0 ${p} 2 1 \ + ${OVPN_ALG} 0 data64.key + ip netns exec "ovpn_peer${p}" ${OVPN_CLI} new_key tun${p} \ + $((${p} + OVPN_ID_OFFSET)) 2 1 ${OVPN_ALG} 1 data64.key + ip netns exec "ovpn_peer${p}" ${OVPN_CLI} swap_keys tun${p} \ + $((${p} + OVPN_ID_OFFSET)) done sleep 1 echo "Querying all peers:" -ip netns exec peer0 ${OVPN_CLI} get_peer tun0 -ip netns exec peer1 ${OVPN_CLI} get_peer tun1 +ip netns exec "${server_ns}" ${OVPN_CLI} get_peer tun0 +ip netns exec ovpn_peer1 ${OVPN_CLI} get_peer tun1 echo "Querying peer 1:" -ip netns exec peer0 ${OVPN_CLI} get_peer tun0 1 +ip netns exec "${server_ns}" ${OVPN_CLI} get_peer tun0 1 echo "Querying non-existent peer 20:" -ip netns exec peer0 ${OVPN_CLI} get_peer tun0 20 || true +ip netns exec "${server_ns}" ${OVPN_CLI} get_peer tun0 20 || true echo "Deleting peer 1:" -ip netns exec peer0 ${OVPN_CLI} del_peer tun0 1 -ip netns exec peer1 ${OVPN_CLI} del_peer tun1 $((1 + ID_OFFSET)) +ip netns exec "${server_ns}" ${OVPN_CLI} del_peer tun0 1 +ip netns exec ovpn_peer1 ${OVPN_CLI} del_peer tun1 $((1 + OVPN_ID_OFFSET)) echo "Querying keys:" -for p in $(seq 2 ${NUM_PEERS}); do - ip netns exec peer${p} ${OVPN_CLI} get_key tun${p} \ - $((${p} + ID_OFFSET)) 1 - ip netns exec peer${p} ${OVPN_CLI} get_key tun${p} \ - $((${p} + ID_OFFSET)) 2 +for p in $(seq 2 ${OVPN_NUM_PEERS}); do + ip netns exec "ovpn_peer${p}" ${OVPN_CLI} get_key tun${p} \ + $((${p} + OVPN_ID_OFFSET)) 1 + ip netns exec "ovpn_peer${p}" ${OVPN_CLI} get_key tun${p} \ + $((${p} + OVPN_ID_OFFSET)) 2 done echo "Deleting peer while sending traffic:" -(ip netns exec peer2 ping -qf -w 4 5.5.5.1)& +(ip netns exec ovpn_peer2 ping -qf -w 4 5.5.5.1)& sleep 2 -ip netns exec peer0 ${OVPN_CLI} del_peer tun0 2 +ip netns exec "${server_ns}" ${OVPN_CLI} del_peer tun0 2 # following command fails in TCP mode # (both ends get conn reset when one peer disconnects) -ip netns exec peer2 ${OVPN_CLI} del_peer tun2 $((2 + ID_OFFSET)) || true +ip netns exec ovpn_peer2 ${OVPN_CLI} del_peer tun2 $((2 + OVPN_ID_OFFSET)) || \ + true echo "Deleting keys:" -for p in $(seq 3 ${NUM_PEERS}); do - ip netns exec peer${p} ${OVPN_CLI} del_key tun${p} \ - $((${p} + ID_OFFSET)) 1 - ip netns exec peer${p} ${OVPN_CLI} del_key tun${p} \ - $((${p} + ID_OFFSET)) 2 +for p in $(seq 3 ${OVPN_NUM_PEERS}); do + ip netns exec "ovpn_peer${p}" ${OVPN_CLI} del_key tun${p} \ + $((${p} + OVPN_ID_OFFSET)) 1 + ip netns exec "ovpn_peer${p}" ${OVPN_CLI} del_key tun${p} \ + $((${p} + OVPN_ID_OFFSET)) 2 done echo "Setting timeout to 3s MP:" -for p in $(seq 3 ${NUM_PEERS}); do - ip netns exec peer0 ${OVPN_CLI} set_peer tun0 ${p} 3 3 || true - ip netns exec peer${p} ${OVPN_CLI} set_peer tun${p} \ - $((${p} + ID_OFFSET)) 0 0 +for p in $(seq 3 ${OVPN_NUM_PEERS}); do + ip netns exec "${server_ns}" ${OVPN_CLI} set_peer tun0 ${p} 3 3 || true + ip netns exec "ovpn_peer${p}" ${OVPN_CLI} set_peer tun${p} \ + $((${p} + OVPN_ID_OFFSET)) 0 0 done # wait for peers to timeout sleep 5 echo "Setting timeout to 3s P2P:" -for p in $(seq 3 ${NUM_PEERS}); do - ip netns exec peer${p} ${OVPN_CLI} set_peer tun${p} \ - $((${p} + ID_OFFSET)) 3 3 +for p in $(seq 3 ${OVPN_NUM_PEERS}); do + ip netns exec "ovpn_peer${p}" ${OVPN_CLI} set_peer tun${p} \ + $((${p} + OVPN_ID_OFFSET)) 3 3 done sleep 5 -for p in $(seq 0 ${NUM_PEERS}); do - compare_ntfs ${p} +for p in $(seq 0 ${OVPN_NUM_PEERS}); do + ovpn_compare_ntfs ${p} done -cleanup +ovpn_cleanup modprobe -r ovpn || true From 1be93bb979ab02554541b04406e9e3a6a8e0ce9e Mon Sep 17 00:00:00 2001 From: Ralf Lici Date: Mon, 23 Mar 2026 15:12:32 +0100 Subject: [PATCH 023/148] selftests: ovpn: align command flow with TAP Current tests do not properly adhere to the TAP infrastructure therefore they do not properly report failures leading to hangs of the CI machinery. Restructure ovpn selftests into using the TAP infrastructure: split each test in stages, execute stage bodies with fail-fast semantics, and emit KTAP pass/fail for each stage. Centralize behavior control in common.sh and makes the scripts use dedicated wrappers for required-success, expected-failure, and non-fatal commands. Also add the OVPN_VERBOSE mode that exposes captured command output for debugging. This way tests won't hang anymore in case of failure when executed within the CI machinery. This change also makes default OVPN_CLI and YNL resolution independent from the caller CWD by anchoring both to COMMON_DIR, so behavior is stable across direct execution and run_tests-style execution. Fixes: 959bc330a439 ("testing/selftests: add test tool and scripts for ovpn module") Signed-off-by: Ralf Lici Signed-off-by: Antonio Quartulli --- tools/testing/selftests/net/ovpn/common.sh | 186 +++++++- .../selftests/net/ovpn/test-close-socket.sh | 104 +++-- tools/testing/selftests/net/ovpn/test-mark.sh | 234 ++++++---- tools/testing/selftests/net/ovpn/test.sh | 427 ++++++++++++------ 4 files changed, 677 insertions(+), 274 deletions(-) diff --git a/tools/testing/selftests/net/ovpn/common.sh b/tools/testing/selftests/net/ovpn/common.sh index 38f187b9de23..2d844eb3aa6e 100644 --- a/tools/testing/selftests/net/ovpn/common.sh +++ b/tools/testing/selftests/net/ovpn/common.sh @@ -4,14 +4,18 @@ # # Author: Antonio Quartulli +OVPN_COMMON_DIR=$(dirname "$(readlink -f "${BASH_SOURCE[0]}")") +source "$OVPN_COMMON_DIR/../../kselftest/ktap_helpers.sh" + OVPN_UDP_PEERS_FILE=${OVPN_UDP_PEERS_FILE:-udp_peers.txt} OVPN_TCP_PEERS_FILE=${OVPN_TCP_PEERS_FILE:-tcp_peers.txt} -OVPN_CLI=${OVPN_CLI:-./ovpn-cli} -OVPN_YNL=${OVPN_YNL:-../../../../net/ynl/pyynl/cli.py} +OVPN_CLI=${OVPN_CLI:-${OVPN_COMMON_DIR}/ovpn-cli} +OVPN_YNL=${OVPN_YNL:-${OVPN_COMMON_DIR}/../../../../net/ynl/pyynl/cli.py} OVPN_ALG=${OVPN_ALG:-aes} OVPN_PROTO=${OVPN_PROTO:-UDP} OVPN_FLOAT=${OVPN_FLOAT:-0} OVPN_SYMMETRIC_ID=${OVPN_SYMMETRIC_ID:-0} +OVPN_VERBOSE=${OVPN_VERBOSE:-0} export OVPN_ID_OFFSET=$(( 9 * (OVPN_SYMMETRIC_ID == 0) )) @@ -22,6 +26,111 @@ OVPN_LAN_IP="11.11.11.11" declare -A OVPN_TMP_JSONS=() declare -A OVPN_LISTENER_PIDS=() +OVPN_CURRENT_STAGE="" + +ovpn_is_verbose() { + [[ "${OVPN_VERBOSE}" == "1" ]] +} + +ovpn_log() { + ovpn_is_verbose || return 0 + printf '%s\n' "$*" +} + +ovpn_print_cmd_output() { + local output_file="$1" + local line + + [[ -s "${output_file}" ]] || return 0 + + while IFS= read -r line; do + ovpn_log "${line}" + done < "${output_file}" +} + +ovpn_cmd_run() { + local mode="$1" + local label="$2" + local output_file + local rc + local ret=0 + + shift 2 + + output_file=$(mktemp) + if "$@" >"${output_file}" 2>&1; then + rc=0 + else + rc=$? + fi + + case "${mode}" in + ok) + if [[ "${rc}" -ne 0 ]]; then + cat "${output_file}" + printf '%s\n' \ + "${label}: command failed with rc=${rc}: $*" + ret="${rc}" + fi + ;; + mayfail) + ;; + fail) + [[ "${rc}" -eq 0 ]] && ret=1 + ;; + esac + + if ovpn_is_verbose && [[ "${rc}" -eq 0 || "${mode}" != "ok" ]]; then + ovpn_print_cmd_output "${output_file}" + fi + + rm -f "${output_file}" + return "${ret}" +} + +ovpn_cmd_ok() { + ovpn_cmd_run ok "$@" +} + +ovpn_cmd_mayfail() { + ovpn_cmd_run mayfail "$@" +} + +ovpn_cmd_fail() { + ovpn_cmd_run fail "$@" +} + +ovpn_run_bg() { + local pid_var="$1" + + shift + if ovpn_is_verbose; then + "$@" & + else + "$@" >/dev/null 2>&1 & + fi + + printf -v "${pid_var}" '%s' "$!" +} + +ovpn_run_stage() { + local label="$1" + + shift + OVPN_CURRENT_STAGE="${label}" + "$@" + OVPN_CURRENT_STAGE="" + ktap_test_pass "${label}" +} + +ovpn_stage_err() { + # ERR trap is global under set -eE: only report failures that happen + # while ovpn_run_stage() is actively executing a stage body. + if [[ -n "${OVPN_CURRENT_STAGE}" ]]; then + ktap_test_fail "${OVPN_CURRENT_STAGE}" + OVPN_CURRENT_STAGE="" + fi +} ovpn_create_ns() { ip netns add "ovpn_peer${1}" @@ -87,12 +196,16 @@ ovpn_build_capture_filter() { } ovpn_setup_listener() { - local peer_ns="ovpn_peer${p}" + local peer="$1" + local file + local peer_ns="ovpn_peer${peer}" + file=$(mktemp) PYTHONUNBUFFERED=1 ip netns exec "${peer_ns}" "${OVPN_YNL}" --family \ - ovpn --subscribe peers --output-json --duration 40 > ${file} & - OVPN_LISTENER_PIDS[$1]=$! - OVPN_TMP_JSONS[$1]="${file}" + ovpn --subscribe peers --output-json > "${file}" \ + 2>/dev/null & + OVPN_LISTENER_PIDS["${peer}"]=$! + OVPN_TMP_JSONS["${peer}"]="${file}" } ovpn_add_peer() { @@ -173,8 +286,7 @@ ovpn_compare_ntfs() { received="${OVPN_TMP_JSONS[$1]}" diff_file=$(mktemp) - kill -TERM ${OVPN_LISTENER_PIDS[$1]} || true - wait ${OVPN_LISTENER_PIDS[$1]} || true + ovpn_stop_listener "${1}" 1 printf "Checking notifications for peer ${1}... " if diff <(jq -s "${OVPN_JQ_FILTER}" ${expected}) \ <(jq -s "${OVPN_JQ_FILTER}" ${received}) \ @@ -187,30 +299,60 @@ ovpn_compare_ntfs() { fi rm -f "${diff_file}" || true - rm -f ${received} || true + rm -f "${received}" || true + unset "OVPN_TMP_JSONS[$1]" fi return "${diff_rc}" } -ovpn_cleanup() { - local peer_ns - # some ovpn-cli processes sleep in background so they need manual poking - killall $(basename ${OVPN_CLI}) 2>/dev/null || true +ovpn_stop_listener() { + local peer="$1" + local keep_json="${2:-0}" + local pid="${OVPN_LISTENER_PIDS[$peer]:-}" + local json="${OVPN_TMP_JSONS[$peer]:-}" - # netns peer0 is deleted without erasing ifaces first - for p in $(seq 1 10); do - peer_ns="ovpn_peer${p}" - ip -n "${peer_ns}" link set tun${p} down 2>/dev/null || true - ip netns exec "${peer_ns}" ${OVPN_CLI} del_iface tun${p} \ - 2>/dev/null || true + if [[ -n "${pid}" ]]; then + kill -TERM "${pid}" 2>/dev/null || true + wait "${pid}" 2>/dev/null || true + unset "OVPN_LISTENER_PIDS[$peer]" + fi + + if [[ -n "${json}" && "${keep_json}" -eq 0 ]]; then + rm -f "${json}" || true + unset "OVPN_TMP_JSONS[$peer]" + fi +} + +ovpn_cleanup_peer_ns() { + local peer="$1" + local peer_id="${peer#ovpn_peer}" + + ip -n "${peer}" link set tun${peer_id} down 2>/dev/null || true + ip netns exec "${peer}" ${OVPN_CLI} del_iface tun${peer_id} \ + 1>/dev/null 2>&1 || true + ip netns del "${peer}" 2>/dev/null || true +} + +ovpn_cleanup() { + local peer + + # some ovpn-cli processes sleep in background so they need manual poking + killall "$(basename "${OVPN_CLI}")" 2>/dev/null || true + + for peer in "${!OVPN_LISTENER_PIDS[@]}"; do + ovpn_stop_listener "${peer}" 2>/dev/null done + for p in $(seq 1 10); do ip -n ovpn_peer0 link del veth${p} 2>/dev/null || true done - for p in $(seq 0 10); do - ip netns del "ovpn_peer${p}" 2>/dev/null || true - done + + # remove from ovpn's netns pool + while IFS= read -r peer; do + [[ -n "${peer}" ]] || continue + ovpn_cleanup_peer_ns "${peer}" 2>/dev/null + done < <(ip netns list 2>/dev/null | awk '/^ovpn_/ {print $1}') } if [ "${OVPN_PROTO}" == "UDP" ]; then diff --git a/tools/testing/selftests/net/ovpn/test-close-socket.sh b/tools/testing/selftests/net/ovpn/test-close-socket.sh index 6bc1b6eab8ac..af1532b4d2da 100755 --- a/tools/testing/selftests/net/ovpn/test-close-socket.sh +++ b/tools/testing/selftests/net/ovpn/test-close-socket.sh @@ -5,43 +5,81 @@ # Author: Antonio Quartulli #set -x -set -e +set -eE source ./common.sh -server_ns="ovpn_peer0" + +ovpn_test_finished=0 + +ovpn_test_exit() { + ovpn_cleanup + modprobe -r ovpn || true + + if [ "${ovpn_test_finished}" -eq 0 ]; then + ktap_print_totals + fi +} + +ovpn_prepare_network() { + local p + local peer_ns + + for p in $(seq 0 ${OVPN_NUM_PEERS}); do + ovpn_cmd_ok "create namespace peer${p}" ovpn_create_ns "${p}" + done + + for p in $(seq 0 ${OVPN_NUM_PEERS}); do + ovpn_cmd_ok "configure peer${p} namespace" ovpn_setup_ns \ + "${p}" 5.5.5.$((p + 1))/24 + done + + for p in $(seq 0 ${OVPN_NUM_PEERS}); do + ovpn_cmd_ok "register peer${p} in overlay" ovpn_add_peer "${p}" + done + + for p in $(seq 1 ${OVPN_NUM_PEERS}); do + peer_ns="ovpn_peer${p}" + ovpn_cmd_ok "set peer0 timeout for peer ${p}" \ + ip netns exec ovpn_peer0 ${OVPN_CLI} set_peer tun0 \ + ${p} 60 120 + ovpn_cmd_ok "set peer${p} timeout for peer ${p}" \ + ip netns exec "${peer_ns}" ${OVPN_CLI} set_peer \ + tun${p} $((p + OVPN_ID_OFFSET)) 60 120 + done +} + +ovpn_run_ping_traffic() { + local p + + for p in $(seq 1 ${OVPN_NUM_PEERS}); do + ovpn_cmd_ok "send ping traffic to peer ${p}" \ + ip netns exec ovpn_peer0 ping -qfc 500 -w 3 \ + 5.5.5.$((p + 1)) + done +} + +ovpn_run_iperf() { + local iperf_pid + + ovpn_run_bg iperf_pid ip netns exec ovpn_peer0 iperf3 -1 -s + sleep 1 + ovpn_cmd_ok "run iperf throughput flow" \ + ip netns exec ovpn_peer1 iperf3 -Z -t 3 -c 5.5.5.1 + wait "${iperf_pid}" || return 1 +} + +trap ovpn_test_exit EXIT +trap ovpn_stage_err ERR + +ktap_print_header +ktap_set_plan 3 ovpn_cleanup - modprobe -q ovpn || true -for p in $(seq 0 ${OVPN_NUM_PEERS}); do - ovpn_create_ns ${p} -done +ovpn_run_stage "setup network topology" ovpn_prepare_network +ovpn_run_stage "run ping traffic" ovpn_run_ping_traffic +ovpn_run_stage "run iperf throughput" ovpn_run_iperf -for p in $(seq 0 ${OVPN_NUM_PEERS}); do - ovpn_setup_ns ${p} 5.5.5.$((${p} + 1))/24 -done - -for p in $(seq 0 ${OVPN_NUM_PEERS}); do - ovpn_add_peer ${p} -done - -for p in $(seq 1 ${OVPN_NUM_PEERS}); do - ip netns exec "${server_ns}" ${OVPN_CLI} set_peer tun0 ${p} 60 120 - ip netns exec "ovpn_peer${p}" ${OVPN_CLI} set_peer tun${p} $((${p}+9)) \ - 60 120 -done - -sleep 1 - -for p in $(seq 1 ${OVPN_NUM_PEERS}); do - ip netns exec "${server_ns}" ping -qfc 500 -w 3 5.5.5.$((${p} + 1)) -done - -ip netns exec "${server_ns}" iperf3 -1 -s & -sleep 1 -ip netns exec ovpn_peer1 iperf3 -Z -t 3 -c 5.5.5.1 - -ovpn_cleanup - -modprobe -r ovpn || true +ovpn_test_finished=1 +ktap_finished diff --git a/tools/testing/selftests/net/ovpn/test-mark.sh b/tools/testing/selftests/net/ovpn/test-mark.sh index 2ee5dc5fc538..5a8f47554286 100755 --- a/tools/testing/selftests/net/ovpn/test-mark.sh +++ b/tools/testing/selftests/net/ovpn/test-mark.sh @@ -6,92 +6,166 @@ # Antonio Quartulli #set -x -set -e +set -eE MARK=1056 +MARK_DROP_COUNTER=0 source ./common.sh -server_ns="ovpn_peer0" + +ovpn_test_finished=0 + +ovpn_test_exit() { + ovpn_cleanup + modprobe -r ovpn || true + + if [ "${ovpn_test_finished}" -eq 0 ]; then + ktap_print_totals + fi +} + +ovpn_mark_prepare_network() { + local p + local peer_ns + + for p in $(seq 0 "${OVPN_NUM_PEERS}"); do + ovpn_cmd_ok "create namespace peer${p}" ovpn_create_ns "${p}" + done + + for p in $(seq 0 3); do + ovpn_cmd_ok "configure peer${p} namespace" ovpn_setup_ns \ + "${p}" 5.5.5.$((p + 1))/24 + done + + ovpn_cmd_ok "create server-side multi-peer with fwmark" \ + ip netns exec ovpn_peer0 "${OVPN_CLI}" new_multi_peer tun0 1 \ + ASYMM "${OVPN_UDP_PEERS_FILE}" "${MARK}" + for p in $(seq 1 3); do + ovpn_cmd_ok "install server key for peer ${p}" \ + ip netns exec ovpn_peer0 "${OVPN_CLI}" new_key tun0 \ + "${p}" 1 0 "${OVPN_ALG}" 0 data64.key + done + + for p in $(seq 1 3); do + ovpn_cmd_ok "register peer${p} in overlay" ovpn_add_peer "${p}" + done + + for p in $(seq 1 3); do + peer_ns="ovpn_peer${p}" + ovpn_cmd_ok "set peer0 timeout for peer ${p}" \ + ip netns exec ovpn_peer0 "${OVPN_CLI}" set_peer tun0 \ + "${p}" 60 120 + ovpn_cmd_ok "set peer${p} timeout for peer ${p}" \ + ip netns exec "${peer_ns}" "${OVPN_CLI}" set_peer \ + tun"${p}" $((p + OVPN_ID_OFFSET)) 60 120 + done +} + +ovpn_mark_run_baseline_traffic() { + local p + + for p in $(seq 1 3); do + ovpn_cmd_ok "send baseline traffic to peer ${p}" \ + ip netns exec ovpn_peer0 ping -qfc 500 -w 3 \ + 5.5.5.$((p + 1)) + done +} + +ovpn_mark_add_drop_rule() { + ovpn_log "Adding an nftables drop rule based on mark value ${MARK}" + + ovpn_cmd_ok "flush nft ruleset" ip netns exec ovpn_peer0 nft flush \ + ruleset + ovpn_cmd_ok "create nft filter table" ip netns exec ovpn_peer0 nft \ + "add table inet filter" + ovpn_cmd_ok "create nft filter output chain" \ + ip netns exec ovpn_peer0 nft "add chain inet filter output { \ + type filter hook output priority 0; policy accept; }" + ovpn_cmd_ok "add nft drop rule for mark ${MARK}" \ + ip netns exec ovpn_peer0 nft add rule inet filter output \ + meta mark == "${MARK}" \ + counter drop + + MARK_DROP_COUNTER=$(ip netns exec ovpn_peer0 nft list chain inet \ + filter output | sed -n 's/.*packets \([0-9]*\).*/\1/p') + if [ -z "${MARK_DROP_COUNTER}" ]; then + printf '%s\n' "unable to read nft drop counter" + return 1 + fi +} + +ovpn_mark_verify_drop_traffic() { + local p + local ping_output + local lost_packets + local total_count + + for p in $(seq 1 3); do + if ping_output=$(ip netns exec ovpn_peer0 ping -qfc 500 -w 1 \ + 5.5.5.$((p + 1)) 2>&1); then + printf '%s\n' "expected ping to peer ${p} to fail \ + after nft drop rule" + return 1 + fi + ovpn_log "${ping_output}" + lost_packets=$(echo "${ping_output}" | \ + awk '/packets transmitted/ { print $1 }') + if [ -z "${lost_packets}" ]; then + printf '%s\n' "unable to parse lost packets for peer \ + ${p}" + return 1 + fi + MARK_DROP_COUNTER=$((MARK_DROP_COUNTER + lost_packets)) + done + + total_count=$(ip netns exec ovpn_peer0 nft list chain inet filter \ + output | sed -n 's/.*packets \([0-9]*\).*/\1/p') + if [ -z "${total_count}" ]; then + printf '%s\n' "unable to read final nft drop counter" + return 1 + fi + if [ "${MARK_DROP_COUNTER}" -ne "${total_count}" ]; then + printf '%s\n' "expected ${MARK_DROP_COUNTER} drops, got \ + ${total_count}" + return 1 + fi +} + +ovpn_mark_remove_drop_rule() { + ovpn_log "Removing the drop rule" + + ovpn_cmd_ok "flush nft ruleset" ip netns exec ovpn_peer0 nft flush \ + ruleset +} + +ovpn_mark_verify_traffic_recovery() { + local p + + sleep 1 + for p in $(seq 1 3); do + ovpn_cmd_ok "send recovery traffic to peer ${p}" \ + ip netns exec ovpn_peer0 ping -qfc 500 -w 3 \ + 5.5.5.$((p + 1)) + done +} + +trap ovpn_test_exit EXIT +trap ovpn_stage_err ERR + +ktap_print_header +ktap_set_plan 6 ovpn_cleanup - modprobe -q ovpn || true -for p in $(seq 0 "${OVPN_NUM_PEERS}"); do - ovpn_create_ns "${p}" -done +ovpn_run_stage "setup marked network topology" ovpn_mark_prepare_network +ovpn_run_stage "run baseline traffic" ovpn_mark_run_baseline_traffic +ovpn_run_stage "install nft mark drop rule" ovpn_mark_add_drop_rule +ovpn_run_stage "drop marked traffic and count packets" \ + ovpn_mark_verify_drop_traffic +ovpn_run_stage "remove nft drop rule" ovpn_mark_remove_drop_rule +ovpn_run_stage "traffic recovers after drop removal" \ + ovpn_mark_verify_traffic_recovery -for p in $(seq 0 3); do - ovpn_setup_ns "${p}" 5.5.5.$((p + 1))/24 -done - -# add peer0 with mark -ip netns exec "${server_ns}" "${OVPN_CLI}" new_multi_peer tun0 1 ASYMM \ - "${OVPN_UDP_PEERS_FILE}" \ - ${MARK} -for p in $(seq 1 3); do - ip netns exec "${server_ns}" "${OVPN_CLI}" new_key tun0 "${p}" 1 0 \ - "${OVPN_ALG}" 0 data64.key -done - -for p in $(seq 1 3); do - ovpn_add_peer "${p}" -done - -for p in $(seq 1 3); do - ip netns exec "${server_ns}" "${OVPN_CLI}" set_peer tun0 "${p}" 60 120 - ip netns exec "ovpn_peer${p}" "${OVPN_CLI}" set_peer tun"${p}" \ - $((p + 9)) 60 120 -done - -sleep 1 - -for p in $(seq 1 3); do - ip netns exec "${server_ns}" ping -qfc 500 -w 3 5.5.5.$((p + 1)) -done - -echo "Adding an nftables drop rule based on mark value ${MARK}" -ip netns exec "${server_ns}" nft flush ruleset -ip netns exec "${server_ns}" nft 'add table inet filter' -ip netns exec "${server_ns}" nft 'add chain inet filter output { - type filter hook output priority 0; - policy accept; -}' -ip netns exec "${server_ns}" nft add rule inet filter output \ - meta mark == ${MARK} \ - counter drop - -DROP_COUNTER=$(ip netns exec "${server_ns}" nft list chain inet filter output \ - | sed -n 's/.*packets \([0-9]*\).*/\1/p') -sleep 1 - -# ping should fail -for p in $(seq 1 3); do - PING_OUTPUT=$(ip netns exec "${server_ns}" ping \ - -qfc 500 -w 1 5.5.5.$((p + 1)) 2>&1) && exit 1 - echo "${PING_OUTPUT}" - LOST_PACKETS=$(echo "$PING_OUTPUT" \ - | awk '/packets transmitted/ { print $1 }') - # increment the drop counter by the amount of lost packets - DROP_COUNTER=$((DROP_COUNTER + LOST_PACKETS)) -done - -# check if the final nft counter matches our counter -TOTAL_COUNT=$(ip netns exec "${server_ns}" nft list chain inet filter output \ - | sed -n 's/.*packets \([0-9]*\).*/\1/p') -if [ "${DROP_COUNTER}" -ne "${TOTAL_COUNT}" ]; then - echo "Expected ${TOTAL_COUNT} drops, got ${DROP_COUNTER}" - exit 1 -fi - -echo "Removing the drop rule" -ip netns exec "${server_ns}" nft flush ruleset -sleep 1 - -for p in $(seq 1 3); do - ip netns exec "${server_ns}" ping -qfc 500 -w 3 5.5.5.$((p + 1)) -done - -ovpn_cleanup - -modprobe -r ovpn || true +ovpn_test_finished=1 +ktap_finished diff --git a/tools/testing/selftests/net/ovpn/test.sh b/tools/testing/selftests/net/ovpn/test.sh index b766f4842940..eca653112aeb 100755 --- a/tools/testing/selftests/net/ovpn/test.sh +++ b/tools/testing/selftests/net/ovpn/test.sh @@ -5,164 +5,313 @@ # Author: Antonio Quartulli #set -x -set -e +set -eE source ./common.sh -server_ns="ovpn_peer0" -ovpn_cleanup +ovpn_test_finished=0 -modprobe -q ovpn || true +ovpn_test_exit() { + ovpn_cleanup + modprobe -r ovpn || true -for p in $(seq 0 ${OVPN_NUM_PEERS}); do - ovpn_create_ns ${p} -done + if [ "${ovpn_test_finished}" -eq 0 ]; then + ktap_print_totals + fi +} -for p in $(seq 0 ${OVPN_NUM_PEERS}); do - ovpn_setup_listener ${p} -done +ovpn_prepare_network() { + local p + local peer_ns -for p in $(seq 0 ${OVPN_NUM_PEERS}); do - ovpn_setup_ns ${p} 5.5.5.$((${p} + 1))/24 ${MTU} -done + for p in $(seq 0 ${OVPN_NUM_PEERS}); do + ovpn_cmd_ok "create namespace peer${p}" ovpn_create_ns "${p}" + done -for p in $(seq 0 ${OVPN_NUM_PEERS}); do - ovpn_add_peer ${p} -done + for p in $(seq 0 ${OVPN_NUM_PEERS}); do + ovpn_cmd_ok "start notification listener peer${p}" \ + ovpn_setup_listener "${p}" + done -for p in $(seq 1 ${OVPN_NUM_PEERS}); do - ip netns exec "${server_ns}" ${OVPN_CLI} set_peer tun0 ${p} 60 120 - ip netns exec "ovpn_peer${p}" ${OVPN_CLI} set_peer tun${p} \ - $((${p}+OVPN_ID_OFFSET)) 60 120 -done + for p in $(seq 0 ${OVPN_NUM_PEERS}); do + ovpn_cmd_ok "configure peer${p} namespace" ovpn_setup_ns \ + "${p}" 5.5.5.$((p + 1))/24 "${MTU}" + done -sleep 1 + for p in $(seq 0 ${OVPN_NUM_PEERS}); do + ovpn_cmd_ok "register peer${p} in overlay" ovpn_add_peer "${p}" + done -TCPDUMP_TIMEOUT="1.5s" -for p in $(seq 1 ${OVPN_NUM_PEERS}); do - # The first part of the data packet header consists of: - # - TCP only: 2 bytes for the packet length - # - 5 bits for opcode ("9" for DATA_V2) - # - 3 bits for key-id ("0" at this point) - # - 12 bytes for peer-id: - # - with asymmetric ID: "${p}" one way and "${p} + 9" the other way - # - with symmetric ID: "${p}" both ways - HEADER1=$(printf "0x4800000%x" ${p}) - HEADER2=$(printf "0x4800000%x" $((${p} + OVPN_ID_OFFSET))) - RADDR="" - if [ "${OVPN_PROTO}" == "UDP" ]; then - RADDR=$(awk "NR == ${p} {print \$3}" ${OVPN_UDP_PEERS_FILE}) + for p in $(seq 1 ${OVPN_NUM_PEERS}); do + peer_ns="ovpn_peer${p}" + ovpn_cmd_ok "set peer0 timeout for peer ${p}" \ + ip netns exec ovpn_peer0 ${OVPN_CLI} set_peer tun0 \ + ${p} 60 120 + ovpn_cmd_ok "set peer${p} timeout for peer ${p}" \ + ip netns exec "${peer_ns}" ${OVPN_CLI} set_peer \ + tun${p} $((p + OVPN_ID_OFFSET)) 60 120 + done +} + +ovpn_run_basic_traffic() { + local p + local header1 + local header2 + local peer_ns + local raddr + local tcpdump_pid1 + local tcpdump_pid2 + local tcpdump_timeout="1.5s" + + for p in $(seq 1 ${OVPN_NUM_PEERS}); do + # The first part of the data packet header consists of: + # - TCP only: 2 bytes for the packet length + # - 5 bits for opcode ("9" for DATA_V2) + # - 3 bits for key-id ("0" at this point) + # - 12 bytes for peer-id: + # - with asymmetric ID: "${p}" one way and "${p} + 9" the + # other way + # - with symmetric ID: "${p}" both ways + header1=$(printf "0x4800000%x" ${p}) + header2=$(printf "0x4800000%x" $((p + OVPN_ID_OFFSET))) + raddr="" + if [ "${OVPN_PROTO}" == "UDP" ]; then + raddr=$(awk "NR == ${p} {print \$3}" \ + "${OVPN_UDP_PEERS_FILE}") + fi + peer_ns="ovpn_peer${p}" + + timeout ${tcpdump_timeout} ip netns exec "${peer_ns}" \ + tcpdump --immediate-mode -p -ni veth${p} -c 1 \ + "$(ovpn_build_capture_filter "${header1}" "${raddr}")" \ + >/dev/null 2>&1 & + tcpdump_pid1=$! + timeout ${tcpdump_timeout} ip netns exec "${peer_ns}" \ + tcpdump --immediate-mode -p -ni veth${p} -c 1 \ + "$(ovpn_build_capture_filter "${header2}" "${raddr}")" \ + >/dev/null 2>&1 & + tcpdump_pid2=$! + + sleep 0.3 + ovpn_cmd_ok "send baseline traffic to peer ${p}" \ + ip netns exec ovpn_peer0 \ + ping -qfc 500 -w 3 5.5.5.$((p + 1)) + ovpn_cmd_ok "send large-payload traffic to peer ${p}" \ + ip netns exec ovpn_peer0 \ + ping -qfc 500 -s 3000 -w 3 5.5.5.$((p + 1)) + + wait "${tcpdump_pid1}" || return 1 + wait "${tcpdump_pid2}" || return 1 + done +} + +ovpn_run_lan_traffic() { + ovpn_cmd_ok "ping LAN behind peer1" \ + ip netns exec ovpn_peer0 ping -qfc 500 -w 3 "${OVPN_LAN_IP}" +} + +ovpn_run_float_mode() { + local p + local peer_ns + + for p in $(seq 1 ${OVPN_NUM_PEERS}); do + peer_ns="ovpn_peer${p}" + ovpn_cmd_ok "float: remove old transport address on peer${p}" \ + ip -n "${peer_ns}" addr del 10.10.${p}.2/24 dev veth${p} + ovpn_cmd_ok "float: add new transport address on peer${p}" \ + ip -n "${peer_ns}" addr add 10.10.${p}.3/24 dev veth${p} + done + for p in $(seq 1 ${OVPN_NUM_PEERS}); do + peer_ns="ovpn_peer${p}" + ovpn_cmd_ok "ping tunnel after float peer ${p}" \ + ip netns exec "${peer_ns}" ping -qfc 500 -w 3 5.5.5.1 + done +} + +ovpn_run_iperf() { + local iperf_pid + + ovpn_run_bg iperf_pid ip netns exec ovpn_peer0 iperf3 -1 -s + sleep 1 + + ovpn_cmd_ok "run iperf throughput flow" \ + ip netns exec ovpn_peer1 iperf3 -Z -t 3 -c 5.5.5.1 + wait "${iperf_pid}" || return 1 +} + +ovpn_run_key_rollover() { + local p + local peer_ns + + ovpn_log "Adding secondary key and then swap:" + + for p in $(seq 1 ${OVPN_NUM_PEERS}); do + peer_ns="ovpn_peer${p}" + ovpn_cmd_ok "add secondary key on peer0 for peer ${p}" \ + ip netns exec ovpn_peer0 ${OVPN_CLI} new_key tun0 \ + ${p} 2 1 ${OVPN_ALG} 0 data64.key + ovpn_cmd_ok "add secondary key on peer${p} for peer ${p}" \ + ip netns exec "${peer_ns}" ${OVPN_CLI} new_key tun${p} \ + $((p + OVPN_ID_OFFSET)) 2 1 ${OVPN_ALG} 1 \ + data64.key + ovpn_cmd_ok "swap keys on peer${p}" \ + ip netns exec "${peer_ns}" ${OVPN_CLI} swap_keys \ + tun${p} $((p + OVPN_ID_OFFSET)) + done +} + +ovpn_run_queries() { + ovpn_log "Querying all peers:" + + ovpn_cmd_ok "query all peers from peer0" \ + ip netns exec ovpn_peer0 ${OVPN_CLI} get_peer tun0 + ovpn_cmd_ok "query all peers from peer1" \ + ip netns exec ovpn_peer1 ${OVPN_CLI} get_peer tun1 + + ovpn_log "Querying peer 1:" + + ovpn_cmd_ok "query peer 1 from peer0" \ + ip netns exec ovpn_peer0 ${OVPN_CLI} get_peer tun0 1 +} + +ovpn_query_peer_missing() { + ovpn_log "Querying non-existent peer 20:" + + ovpn_cmd_fail "query missing peer 20 on peer0" \ + ip netns exec ovpn_peer0 ${OVPN_CLI} get_peer tun0 20 +} + +ovpn_run_peer_cleanup() { + local p + local peer_ns + + ovpn_log "Deleting peer 1:" + + ovpn_cmd_ok "delete peer1 on peer0" \ + ip netns exec ovpn_peer0 ${OVPN_CLI} del_peer tun0 1 + ovpn_cmd_ok "delete peer1 on peer1" \ + ip netns exec ovpn_peer1 ${OVPN_CLI} del_peer tun1 \ + $((1 + OVPN_ID_OFFSET)) + + ovpn_log "Querying keys:" + + for p in $(seq 2 ${OVPN_NUM_PEERS}); do + peer_ns="ovpn_peer${p}" + ovpn_cmd_ok "query peer${p} key 1" \ + ip netns exec "${peer_ns}" ${OVPN_CLI} get_key tun${p} \ + $((p + OVPN_ID_OFFSET)) 1 + ovpn_cmd_ok "query peer${p} key 2" \ + ip netns exec "${peer_ns}" ${OVPN_CLI} get_key tun${p} \ + $((p + OVPN_ID_OFFSET)) 2 + done +} + +ovpn_run_traffic_delete_peer() { + local ping_pid + + ovpn_log "Deleting peer while sending traffic:" + + ovpn_run_bg ping_pid ip netns exec ovpn_peer2 ping -qf -w 4 5.5.5.1 + sleep 2 + ovpn_cmd_ok "delete peer0 peer 2" \ + ip netns exec ovpn_peer0 ${OVPN_CLI} del_peer tun0 2 + + if [ "${OVPN_PROTO}" == "TCP" ]; then + # In TCP mode this command is expected to fail for both peers. + ovpn_cmd_mayfail "delete peer2 peer 2 (TCP non-fatal)" \ + ip netns exec ovpn_peer2 ${OVPN_CLI} del_peer tun2 \ + $((2 + OVPN_ID_OFFSET)) + else + ovpn_cmd_ok "delete peer2 peer 2" ip netns exec ovpn_peer2 \ + ${OVPN_CLI} del_peer tun2 $((2 + OVPN_ID_OFFSET)) fi - timeout ${TCPDUMP_TIMEOUT} ip netns exec "ovpn_peer${p}" \ - tcpdump --immediate-mode -p -ni veth${p} -c 1 \ - "$(ovpn_build_capture_filter "${HEADER1}" "${RADDR}")" \ - >/dev/null 2>&1 & - TCPDUMP_PID1=$! - timeout ${TCPDUMP_TIMEOUT} ip netns exec "ovpn_peer${p}" \ - tcpdump --immediate-mode -p -ni veth${p} -c 1 \ - "$(ovpn_build_capture_filter "${HEADER2}" "${RADDR}")" \ - >/dev/null 2>&1 & - TCPDUMP_PID2=$! + wait "${ping_pid}" || true +} - sleep 0.3 - ip netns exec "${server_ns}" ping -qfc 500 -w 3 5.5.5.$((${p} + 1)) - ip netns exec "${server_ns}" ping -qfc 500 -s 3000 -w 3 \ - 5.5.5.$((${p} + 1)) +ovpn_run_key_cleanup() { + local p + local peer_ns - wait ${TCPDUMP_PID1} - wait ${TCPDUMP_PID2} -done + ovpn_log "Deleting keys:" -# ping LAN behind client 1 -ip netns exec "${server_ns}" ping -qfc 500 -w 3 ${OVPN_LAN_IP} - -if [ "$OVPN_FLOAT" == "1" ]; then - # make clients float.. - for p in $(seq 1 ${OVPN_NUM_PEERS}); do - ip -n "ovpn_peer${p}" addr del 10.10.${p}.2/24 dev veth${p} - ip -n "ovpn_peer${p}" addr add 10.10.${p}.3/24 dev veth${p} + for p in $(seq 3 ${OVPN_NUM_PEERS}); do + peer_ns="ovpn_peer${p}" + ovpn_cmd_ok "delete key 1 for peer${p}" \ + ip netns exec "${peer_ns}" ${OVPN_CLI} del_key tun${p} \ + $((p + OVPN_ID_OFFSET)) 1 + ovpn_cmd_ok "delete key 2 for peer${p}" \ + ip netns exec "${peer_ns}" ${OVPN_CLI} del_key tun${p} \ + $((p + OVPN_ID_OFFSET)) 2 done - for p in $(seq 1 ${OVPN_NUM_PEERS}); do - ip netns exec "ovpn_peer${p}" ping -qfc 500 -w 3 5.5.5.1 +} + +ovpn_run_timeouts() { + local p + local peer_ns + + ovpn_log "Setting timeout to 3s MP:" + + for p in $(seq 3 ${OVPN_NUM_PEERS}); do + # Non-fatal: this may fail in some protocol modes. + ovpn_cmd_mayfail "set peer0 timeout for peer ${p} (non-fatal)" \ + ip netns exec ovpn_peer0 ${OVPN_CLI} set_peer tun0 \ + ${p} 3 3 + peer_ns="ovpn_peer${p}" + ovpn_cmd_ok "disable timeout on peer${p} while peer0 adjusts \ + state" ip netns exec "${peer_ns}" ${OVPN_CLI} set_peer \ + tun${p} $((p + OVPN_ID_OFFSET)) 0 0 done + # wait for peers to timeout + sleep 5 + + ovpn_log "Setting timeout to 3s P2P:" + + for p in $(seq 3 ${OVPN_NUM_PEERS}); do + peer_ns="ovpn_peer${p}" + ovpn_cmd_ok "set peer${p} P2P timeout" \ + ip netns exec "${peer_ns}" ${OVPN_CLI} set_peer \ + tun${p} $((p + OVPN_ID_OFFSET)) 3 3 + done + sleep 5 +} + +ovpn_run_notifications() { + local p + + for p in $(seq 0 ${OVPN_NUM_PEERS}); do + ovpn_cmd_ok "validate listener output for peer ${p}" \ + ovpn_compare_ntfs "${p}" + done +} + +trap ovpn_test_exit EXIT +trap ovpn_stage_err ERR + +ktap_print_header +if [ "${OVPN_FLOAT}" == "1" ]; then + ktap_set_plan 13 +else + ktap_set_plan 12 fi -ip netns exec "${server_ns}" iperf3 -1 -s & -sleep 1 -ip netns exec ovpn_peer1 iperf3 -Z -t 3 -c 5.5.5.1 - -echo "Adding secondary key and then swap:" -for p in $(seq 1 ${OVPN_NUM_PEERS}); do - ip netns exec "${server_ns}" ${OVPN_CLI} new_key tun0 ${p} 2 1 \ - ${OVPN_ALG} 0 data64.key - ip netns exec "ovpn_peer${p}" ${OVPN_CLI} new_key tun${p} \ - $((${p} + OVPN_ID_OFFSET)) 2 1 ${OVPN_ALG} 1 data64.key - ip netns exec "ovpn_peer${p}" ${OVPN_CLI} swap_keys tun${p} \ - $((${p} + OVPN_ID_OFFSET)) -done - -sleep 1 - -echo "Querying all peers:" -ip netns exec "${server_ns}" ${OVPN_CLI} get_peer tun0 -ip netns exec ovpn_peer1 ${OVPN_CLI} get_peer tun1 - -echo "Querying peer 1:" -ip netns exec "${server_ns}" ${OVPN_CLI} get_peer tun0 1 - -echo "Querying non-existent peer 20:" -ip netns exec "${server_ns}" ${OVPN_CLI} get_peer tun0 20 || true - -echo "Deleting peer 1:" -ip netns exec "${server_ns}" ${OVPN_CLI} del_peer tun0 1 -ip netns exec ovpn_peer1 ${OVPN_CLI} del_peer tun1 $((1 + OVPN_ID_OFFSET)) - -echo "Querying keys:" -for p in $(seq 2 ${OVPN_NUM_PEERS}); do - ip netns exec "ovpn_peer${p}" ${OVPN_CLI} get_key tun${p} \ - $((${p} + OVPN_ID_OFFSET)) 1 - ip netns exec "ovpn_peer${p}" ${OVPN_CLI} get_key tun${p} \ - $((${p} + OVPN_ID_OFFSET)) 2 -done - -echo "Deleting peer while sending traffic:" -(ip netns exec ovpn_peer2 ping -qf -w 4 5.5.5.1)& -sleep 2 -ip netns exec "${server_ns}" ${OVPN_CLI} del_peer tun0 2 -# following command fails in TCP mode -# (both ends get conn reset when one peer disconnects) -ip netns exec ovpn_peer2 ${OVPN_CLI} del_peer tun2 $((2 + OVPN_ID_OFFSET)) || \ - true - -echo "Deleting keys:" -for p in $(seq 3 ${OVPN_NUM_PEERS}); do - ip netns exec "ovpn_peer${p}" ${OVPN_CLI} del_key tun${p} \ - $((${p} + OVPN_ID_OFFSET)) 1 - ip netns exec "ovpn_peer${p}" ${OVPN_CLI} del_key tun${p} \ - $((${p} + OVPN_ID_OFFSET)) 2 -done - -echo "Setting timeout to 3s MP:" -for p in $(seq 3 ${OVPN_NUM_PEERS}); do - ip netns exec "${server_ns}" ${OVPN_CLI} set_peer tun0 ${p} 3 3 || true - ip netns exec "ovpn_peer${p}" ${OVPN_CLI} set_peer tun${p} \ - $((${p} + OVPN_ID_OFFSET)) 0 0 -done -# wait for peers to timeout -sleep 5 - -echo "Setting timeout to 3s P2P:" -for p in $(seq 3 ${OVPN_NUM_PEERS}); do - ip netns exec "ovpn_peer${p}" ${OVPN_CLI} set_peer tun${p} \ - $((${p} + OVPN_ID_OFFSET)) 3 3 -done -sleep 5 - -for p in $(seq 0 ${OVPN_NUM_PEERS}); do - ovpn_compare_ntfs ${p} -done - ovpn_cleanup +modprobe -q ovpn || true -modprobe -r ovpn || true +ovpn_run_stage "setup network topology" ovpn_prepare_network +ovpn_run_stage "run baseline data traffic" ovpn_run_basic_traffic +ovpn_run_stage "run LAN traffic behind peer1" ovpn_run_lan_traffic +[ "${OVPN_FLOAT}" == "1" ] && ovpn_run_stage "run floating peer checks" \ + ovpn_run_float_mode +ovpn_run_stage "run iperf throughput" ovpn_run_iperf +ovpn_run_stage "run key rollout" ovpn_run_key_rollover +ovpn_run_stage "query peers" ovpn_run_queries +ovpn_run_stage "query missing peer fails" ovpn_query_peer_missing +ovpn_run_stage "peer lifecycle and key queries" ovpn_run_peer_cleanup +ovpn_run_stage "delete peer while traffic" ovpn_run_traffic_delete_peer +ovpn_run_stage "delete stale keys" ovpn_run_key_cleanup +ovpn_run_stage "check timeout behavior" ovpn_run_timeouts +ovpn_run_stage "validate notification output" ovpn_run_notifications + +ovpn_test_finished=1 +ktap_finished From 6c9b1dc218fea8b15893953f5299b209f11fa0a8 Mon Sep 17 00:00:00 2001 From: Ralf Lici Date: Thu, 16 Apr 2026 09:19:28 +0200 Subject: [PATCH 024/148] selftests: ovpn: serialize YNL listener startup Starting one background YNL notification listener per peer back-to-back can intermittently stall the test setup before the listeners even reach the Python main function. This was reproducible in a reduced test.sh setup-only loop: a single listener stayed stable across repeated runs, while starting listeners for all peers could hang early in the listener launch phase. Adding a short delay between listener launches makes the listeners start cleanly and eliminates the reproduced hangs in repeated normal and slow-runner tests. Serialize listener startup with a small sleep between setup_listener calls. Fixes: 77de28cd7cf1 ("selftests: ovpn: add notification parsing and matching") Signed-off-by: Ralf Lici Signed-off-by: Antonio Quartulli --- tools/testing/selftests/net/ovpn/test.sh | 3 +++ 1 file changed, 3 insertions(+) diff --git a/tools/testing/selftests/net/ovpn/test.sh b/tools/testing/selftests/net/ovpn/test.sh index eca653112aeb..b50dbe45a4d0 100755 --- a/tools/testing/selftests/net/ovpn/test.sh +++ b/tools/testing/selftests/net/ovpn/test.sh @@ -31,6 +31,9 @@ ovpn_prepare_network() { for p in $(seq 0 ${OVPN_NUM_PEERS}); do ovpn_cmd_ok "start notification listener peer${p}" \ ovpn_setup_listener "${p}" + # starting all YNL listeners back-to-back can intermittently + # stall their startup so serialize launches a bit + sleep 0.5 done for p in $(seq 0 ${OVPN_NUM_PEERS}); do From 267bf3cf9a6f0ffb98b8afd983c1950e835f07c9 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 16 Apr 2026 20:03:06 +0000 Subject: [PATCH 025/148] tcp: annotate data-races in tcp_get_info_chrono_stats() tcp_get_timestamping_opt_stats() does not own the socket lock, this is intentional. It calls tcp_get_info_chrono_stats() while other threads could change chrono fields in tcp_chrono_set(). I do not think we need coherent TCP socket state snapshot in tcp_get_timestamping_opt_stats(), I chose to only add annotations to keep KCSAN happy. Fixes: 1c885808e456 ("tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING") Signed-off-by: Eric Dumazet Link: https://patch.msgid.link/20260416200319.3608680-2-edumazet@google.com Signed-off-by: Jakub Kicinski --- include/net/tcp.h | 10 +++++++--- net/ipv4/tcp.c | 14 ++++++++++---- 2 files changed, 17 insertions(+), 7 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index dfa52ceefd23..674af493882c 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -2208,10 +2208,14 @@ static inline void tcp_chrono_set(struct tcp_sock *tp, const enum tcp_chrono new const u32 now = tcp_jiffies32; enum tcp_chrono old = tp->chrono_type; + /* Following WRITE_ONCE()s pair with READ_ONCE()s in + * tcp_get_info_chrono_stats(). + */ if (old > TCP_CHRONO_UNSPEC) - tp->chrono_stat[old - 1] += now - tp->chrono_start; - tp->chrono_start = now; - tp->chrono_type = new; + WRITE_ONCE(tp->chrono_stat[old - 1], + tp->chrono_stat[old - 1] + now - tp->chrono_start); + WRITE_ONCE(tp->chrono_start, now); + WRITE_ONCE(tp->chrono_type, new); } static inline void tcp_chrono_start(struct sock *sk, const enum tcp_chrono type) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 1a494d18c5fd..7b7812cb710f 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -4191,12 +4191,18 @@ static void tcp_get_info_chrono_stats(const struct tcp_sock *tp, struct tcp_info *info) { u64 stats[__TCP_CHRONO_MAX], total = 0; - enum tcp_chrono i; + enum tcp_chrono i, cur; + /* Following READ_ONCE()s pair with WRITE_ONCE()s in tcp_chrono_set(). + * This is because socket lock might not be owned by us at this point. + * This is best effort, tcp_get_timestamping_opt_stats() can + * see wrong values. A real fix would be too costly for TCP fast path. + */ + cur = READ_ONCE(tp->chrono_type); for (i = TCP_CHRONO_BUSY; i < __TCP_CHRONO_MAX; ++i) { - stats[i] = tp->chrono_stat[i - 1]; - if (i == tp->chrono_type) - stats[i] += tcp_jiffies32 - tp->chrono_start; + stats[i] = READ_ONCE(tp->chrono_stat[i - 1]); + if (i == cur) + stats[i] += tcp_jiffies32 - READ_ONCE(tp->chrono_start); stats[i] *= USEC_PER_SEC / HZ; total += stats[i]; } From 21e92a38cfd891538598ba8f805e0165a820d532 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 16 Apr 2026 20:03:07 +0000 Subject: [PATCH 026/148] tcp: add data-race annotations around tp->data_segs_out and tp->total_retrans tcp_get_timestamping_opt_stats() intentionally runs lockless, we must add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy. Fixes: 7e98102f4897 ("tcp: record pkts sent and retransmistted") Signed-off-by: Eric Dumazet Link: https://patch.msgid.link/20260416200319.3608680-3-edumazet@google.com Signed-off-by: Jakub Kicinski --- net/ipv4/tcp.c | 4 ++-- net/ipv4/tcp_output.c | 8 +++++--- 2 files changed, 7 insertions(+), 5 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 7b7812cb710f..e39e0734d958 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -4434,9 +4434,9 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk, nla_put_u64_64bit(stats, TCP_NLA_SNDBUF_LIMITED, info.tcpi_sndbuf_limited, TCP_NLA_PAD); nla_put_u64_64bit(stats, TCP_NLA_DATA_SEGS_OUT, - tp->data_segs_out, TCP_NLA_PAD); + READ_ONCE(tp->data_segs_out), TCP_NLA_PAD); nla_put_u64_64bit(stats, TCP_NLA_TOTAL_RETRANS, - tp->total_retrans, TCP_NLA_PAD); + READ_ONCE(tp->total_retrans), TCP_NLA_PAD); rate = READ_ONCE(sk->sk_pacing_rate); rate64 = (rate != ~0UL) ? rate : ~0ULL; diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 8e99687526a6..d8e8bba2d03a 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1688,7 +1688,8 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, if (skb->len != tcp_header_size) { tcp_event_data_sent(tp, sk); - tp->data_segs_out += tcp_skb_pcount(skb); + WRITE_ONCE(tp->data_segs_out, + tp->data_segs_out + tcp_skb_pcount(skb)); tp->bytes_sent += skb->len - tcp_header_size; } @@ -3642,7 +3643,7 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs) TCP_ADD_STATS(sock_net(sk), TCP_MIB_RETRANSSEGS, segs); if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN) __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSYNRETRANS); - tp->total_retrans += segs; + WRITE_ONCE(tp->total_retrans, tp->total_retrans + segs); tp->bytes_retrans += skb->len; /* make sure skb->data is aligned on arches that require it @@ -4646,7 +4647,8 @@ int tcp_rtx_synack(const struct sock *sk, struct request_sock *req) * However in this case, we are dealing with a passive fastopen * socket thus we can change total_retrans value. */ - tcp_sk_rw(sk)->total_retrans++; + WRITE_ONCE(tcp_sk_rw(sk)->total_retrans, + tcp_sk_rw(sk)->total_retrans + 1); } trace_tcp_retransmit_synack(sk, req); WRITE_ONCE(req->num_retrans, req->num_retrans + 1); From 829ba1f329cb7cbd56d599a6d225997fba66dc32 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 16 Apr 2026 20:03:08 +0000 Subject: [PATCH 027/148] tcp: add data-races annotations around tp->reordering, tp->snd_cwnd tcp_get_timestamping_opt_stats() intentionally runs lockless, we must add READ_ONCE(), WRITE_ONCE() data_race() annotations to keep KCSAN happy. Fixes: bb7c19f96012 ("tcp: add related fields into SCM_TIMESTAMPING_OPT_STATS") Signed-off-by: Eric Dumazet Link: https://patch.msgid.link/20260416200319.3608680-4-edumazet@google.com Signed-off-by: Jakub Kicinski --- include/net/tcp.h | 2 +- net/ipv4/tcp.c | 8 ++++---- net/ipv4/tcp_input.c | 14 ++++++++------ net/ipv4/tcp_metrics.c | 2 +- 4 files changed, 14 insertions(+), 12 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 674af493882c..ecbadcb3a744 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1513,7 +1513,7 @@ static inline u32 tcp_snd_cwnd(const struct tcp_sock *tp) static inline void tcp_snd_cwnd_set(struct tcp_sock *tp, u32 val) { WARN_ON_ONCE((int)val <= 0); - tp->snd_cwnd = val; + WRITE_ONCE(tp->snd_cwnd, val); } static inline bool tcp_in_slow_start(const struct tcp_sock *tp) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index e39e0734d958..24ba80d244b1 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -4445,13 +4445,13 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk, rate64 = tcp_compute_delivery_rate(tp); nla_put_u64_64bit(stats, TCP_NLA_DELIVERY_RATE, rate64, TCP_NLA_PAD); - nla_put_u32(stats, TCP_NLA_SND_CWND, tcp_snd_cwnd(tp)); - nla_put_u32(stats, TCP_NLA_REORDERING, tp->reordering); - nla_put_u32(stats, TCP_NLA_MIN_RTT, tcp_min_rtt(tp)); + nla_put_u32(stats, TCP_NLA_SND_CWND, READ_ONCE(tp->snd_cwnd)); + nla_put_u32(stats, TCP_NLA_REORDERING, READ_ONCE(tp->reordering)); + nla_put_u32(stats, TCP_NLA_MIN_RTT, data_race(tcp_min_rtt(tp))); nla_put_u8(stats, TCP_NLA_RECUR_RETRANS, READ_ONCE(inet_csk(sk)->icsk_retransmits)); - nla_put_u8(stats, TCP_NLA_DELIVERY_RATE_APP_LMT, !!tp->rate_app_limited); + nla_put_u8(stats, TCP_NLA_DELIVERY_RATE_APP_LMT, data_race(!!tp->rate_app_limited)); nla_put_u32(stats, TCP_NLA_SND_SSTHRESH, tp->snd_ssthresh); nla_put_u32(stats, TCP_NLA_DELIVERED, tp->delivered); nla_put_u32(stats, TCP_NLA_DELIVERED_CE, tp->delivered_ce); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 021f745747c5..6bb6bf049a35 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1293,8 +1293,9 @@ static void tcp_check_sack_reordering(struct sock *sk, const u32 low_seq, tp->sacked_out, tp->undo_marker ? tp->undo_retrans : 0); #endif - tp->reordering = min_t(u32, (metric + mss - 1) / mss, - READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_max_reordering)); + WRITE_ONCE(tp->reordering, + min_t(u32, (metric + mss - 1) / mss, + READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_max_reordering))); } /* This exciting event is worth to be remembered. 8) */ @@ -2439,8 +2440,9 @@ static void tcp_check_reno_reordering(struct sock *sk, const int addend) if (!tcp_limit_reno_sacked(tp)) return; - tp->reordering = min_t(u32, tp->packets_out + addend, - READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_max_reordering)); + WRITE_ONCE(tp->reordering, + min_t(u32, tp->packets_out + addend, + READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_max_reordering))); tp->reord_seen++; NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRENOREORDER); } @@ -2579,8 +2581,8 @@ void tcp_enter_loss(struct sock *sk) reordering = READ_ONCE(net->ipv4.sysctl_tcp_reordering); if (icsk->icsk_ca_state <= TCP_CA_Disorder && tp->sacked_out >= reordering) - tp->reordering = min_t(unsigned int, tp->reordering, - reordering); + WRITE_ONCE(tp->reordering, + min_t(unsigned int, tp->reordering, reordering)); tcp_set_ca_state(sk, TCP_CA_Loss); tp->high_seq = tp->snd_nxt; diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c index 06b1d5d3b6df..7a9d6d9006f6 100644 --- a/net/ipv4/tcp_metrics.c +++ b/net/ipv4/tcp_metrics.c @@ -496,7 +496,7 @@ void tcp_init_metrics(struct sock *sk) } val = tcp_metric_get(tm, TCP_METRIC_REORDERING); if (val && tp->reordering != val) - tp->reordering = val; + WRITE_ONCE(tp->reordering, val); crtt = tcp_metric_get(tm, TCP_METRIC_RTT); rcu_read_unlock(); From fd571afb05ebaeac5d8f09460a0640d4cf6755f8 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 16 Apr 2026 20:03:09 +0000 Subject: [PATCH 028/148] tcp: annotate data-races around tp->snd_ssthresh tcp_get_timestamping_opt_stats() intentionally runs lockless, we must add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy. Fixes: 7156d194a077 ("tcp: add snd_ssthresh stat in SCM_TIMESTAMPING_OPT_STATS") Signed-off-by: Eric Dumazet Link: https://patch.msgid.link/20260416200319.3608680-5-edumazet@google.com Signed-off-by: Jakub Kicinski --- net/core/filter.c | 2 +- net/ipv4/tcp.c | 4 ++-- net/ipv4/tcp_bbr.c | 6 +++--- net/ipv4/tcp_bic.c | 2 +- net/ipv4/tcp_cdg.c | 4 ++-- net/ipv4/tcp_cubic.c | 6 +++--- net/ipv4/tcp_dctcp.c | 2 +- net/ipv4/tcp_input.c | 8 ++++---- net/ipv4/tcp_metrics.c | 4 ++-- net/ipv4/tcp_nv.c | 4 ++-- net/ipv4/tcp_output.c | 4 ++-- net/ipv4/tcp_vegas.c | 9 +++++---- net/ipv4/tcp_westwood.c | 4 ++-- net/ipv4/tcp_yeah.c | 3 ++- 14 files changed, 32 insertions(+), 30 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index fcfcb72663ca..3b5609fb96de 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -5396,7 +5396,7 @@ static int bpf_sol_tcp_setsockopt(struct sock *sk, int optname, if (val <= 0) return -EINVAL; tp->snd_cwnd_clamp = val; - tp->snd_ssthresh = val; + WRITE_ONCE(tp->snd_ssthresh, val); break; case TCP_BPF_DELACK_MAX: timeout = usecs_to_jiffies(val); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 24ba80d244b1..802a9ea05211 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -3425,7 +3425,7 @@ int tcp_disconnect(struct sock *sk, int flags) icsk->icsk_rto = TCP_TIMEOUT_INIT; WRITE_ONCE(icsk->icsk_rto_min, TCP_RTO_MIN); WRITE_ONCE(icsk->icsk_delack_max, TCP_DELACK_MAX); - tp->snd_ssthresh = TCP_INFINITE_SSTHRESH; + WRITE_ONCE(tp->snd_ssthresh, TCP_INFINITE_SSTHRESH); tcp_snd_cwnd_set(tp, TCP_INIT_CWND); tp->snd_cwnd_cnt = 0; tp->is_cwnd_limited = 0; @@ -4452,7 +4452,7 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk, nla_put_u8(stats, TCP_NLA_RECUR_RETRANS, READ_ONCE(inet_csk(sk)->icsk_retransmits)); nla_put_u8(stats, TCP_NLA_DELIVERY_RATE_APP_LMT, data_race(!!tp->rate_app_limited)); - nla_put_u32(stats, TCP_NLA_SND_SSTHRESH, tp->snd_ssthresh); + nla_put_u32(stats, TCP_NLA_SND_SSTHRESH, READ_ONCE(tp->snd_ssthresh)); nla_put_u32(stats, TCP_NLA_DELIVERED, tp->delivered); nla_put_u32(stats, TCP_NLA_DELIVERED_CE, tp->delivered_ce); diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c index 1ddc20a399b0..aec7805b1d37 100644 --- a/net/ipv4/tcp_bbr.c +++ b/net/ipv4/tcp_bbr.c @@ -897,8 +897,8 @@ static void bbr_check_drain(struct sock *sk, const struct rate_sample *rs) if (bbr->mode == BBR_STARTUP && bbr_full_bw_reached(sk)) { bbr->mode = BBR_DRAIN; /* drain queue we created */ - tcp_sk(sk)->snd_ssthresh = - bbr_inflight(sk, bbr_max_bw(sk), BBR_UNIT); + WRITE_ONCE(tcp_sk(sk)->snd_ssthresh, + bbr_inflight(sk, bbr_max_bw(sk), BBR_UNIT)); } /* fall through to check if in-flight is already small: */ if (bbr->mode == BBR_DRAIN && bbr_packets_in_net_at_edt(sk, tcp_packets_in_flight(tcp_sk(sk))) <= @@ -1043,7 +1043,7 @@ __bpf_kfunc static void bbr_init(struct sock *sk) struct bbr *bbr = inet_csk_ca(sk); bbr->prior_cwnd = 0; - tp->snd_ssthresh = TCP_INFINITE_SSTHRESH; + WRITE_ONCE(tp->snd_ssthresh, TCP_INFINITE_SSTHRESH); bbr->rtt_cnt = 0; bbr->next_rtt_delivered = tp->delivered; bbr->prev_ca_state = TCP_CA_Open; diff --git a/net/ipv4/tcp_bic.c b/net/ipv4/tcp_bic.c index 58358bf92e1b..65444ff14241 100644 --- a/net/ipv4/tcp_bic.c +++ b/net/ipv4/tcp_bic.c @@ -74,7 +74,7 @@ static void bictcp_init(struct sock *sk) bictcp_reset(ca); if (initial_ssthresh) - tcp_sk(sk)->snd_ssthresh = initial_ssthresh; + WRITE_ONCE(tcp_sk(sk)->snd_ssthresh, initial_ssthresh); } /* diff --git a/net/ipv4/tcp_cdg.c b/net/ipv4/tcp_cdg.c index ceabfd690a29..0812c390aee5 100644 --- a/net/ipv4/tcp_cdg.c +++ b/net/ipv4/tcp_cdg.c @@ -162,7 +162,7 @@ static void tcp_cdg_hystart_update(struct sock *sk) NET_ADD_STATS(sock_net(sk), LINUX_MIB_TCPHYSTARTTRAINCWND, tcp_snd_cwnd(tp)); - tp->snd_ssthresh = tcp_snd_cwnd(tp); + WRITE_ONCE(tp->snd_ssthresh, tcp_snd_cwnd(tp)); return; } } @@ -181,7 +181,7 @@ static void tcp_cdg_hystart_update(struct sock *sk) NET_ADD_STATS(sock_net(sk), LINUX_MIB_TCPHYSTARTDELAYCWND, tcp_snd_cwnd(tp)); - tp->snd_ssthresh = tcp_snd_cwnd(tp); + WRITE_ONCE(tp->snd_ssthresh, tcp_snd_cwnd(tp)); } } } diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c index ab78b5ae8d0e..119bf8cbb007 100644 --- a/net/ipv4/tcp_cubic.c +++ b/net/ipv4/tcp_cubic.c @@ -136,7 +136,7 @@ __bpf_kfunc static void cubictcp_init(struct sock *sk) bictcp_hystart_reset(sk); if (!hystart && initial_ssthresh) - tcp_sk(sk)->snd_ssthresh = initial_ssthresh; + WRITE_ONCE(tcp_sk(sk)->snd_ssthresh, initial_ssthresh); } __bpf_kfunc static void cubictcp_cwnd_event_tx_start(struct sock *sk) @@ -420,7 +420,7 @@ static void hystart_update(struct sock *sk, u32 delay) NET_ADD_STATS(sock_net(sk), LINUX_MIB_TCPHYSTARTTRAINCWND, tcp_snd_cwnd(tp)); - tp->snd_ssthresh = tcp_snd_cwnd(tp); + WRITE_ONCE(tp->snd_ssthresh, tcp_snd_cwnd(tp)); } } } @@ -440,7 +440,7 @@ static void hystart_update(struct sock *sk, u32 delay) NET_ADD_STATS(sock_net(sk), LINUX_MIB_TCPHYSTARTDELAYCWND, tcp_snd_cwnd(tp)); - tp->snd_ssthresh = tcp_snd_cwnd(tp); + WRITE_ONCE(tp->snd_ssthresh, tcp_snd_cwnd(tp)); } } } diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c index 96c99999e09d..274e628e7cf8 100644 --- a/net/ipv4/tcp_dctcp.c +++ b/net/ipv4/tcp_dctcp.c @@ -177,7 +177,7 @@ static void dctcp_react_to_loss(struct sock *sk) struct tcp_sock *tp = tcp_sk(sk); ca->loss_cwnd = tcp_snd_cwnd(tp); - tp->snd_ssthresh = max(tcp_snd_cwnd(tp) >> 1U, 2U); + WRITE_ONCE(tp->snd_ssthresh, max(tcp_snd_cwnd(tp) >> 1U, 2U)); } __bpf_kfunc static void dctcp_state(struct sock *sk, u8 new_state) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 6bb6bf049a35..c6361447535f 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2567,7 +2567,7 @@ void tcp_enter_loss(struct sock *sk) (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) { tp->prior_ssthresh = tcp_current_ssthresh(sk); tp->prior_cwnd = tcp_snd_cwnd(tp); - tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk); + WRITE_ONCE(tp->snd_ssthresh, icsk->icsk_ca_ops->ssthresh(sk)); tcp_ca_event(sk, CA_EVENT_LOSS); tcp_init_undo(tp); } @@ -2860,7 +2860,7 @@ static void tcp_undo_cwnd_reduction(struct sock *sk, bool unmark_loss) tcp_snd_cwnd_set(tp, icsk->icsk_ca_ops->undo_cwnd(sk)); if (tp->prior_ssthresh > tp->snd_ssthresh) { - tp->snd_ssthresh = tp->prior_ssthresh; + WRITE_ONCE(tp->snd_ssthresh, tp->prior_ssthresh); tcp_ecn_withdraw_cwr(tp); } } @@ -2978,7 +2978,7 @@ static void tcp_init_cwnd_reduction(struct sock *sk) tp->prior_cwnd = tcp_snd_cwnd(tp); tp->prr_delivered = 0; tp->prr_out = 0; - tp->snd_ssthresh = inet_csk(sk)->icsk_ca_ops->ssthresh(sk); + WRITE_ONCE(tp->snd_ssthresh, inet_csk(sk)->icsk_ca_ops->ssthresh(sk)); tcp_ecn_queue_cwr(tp); } @@ -3120,7 +3120,7 @@ static void tcp_non_congestion_loss_retransmit(struct sock *sk) if (icsk->icsk_ca_state != TCP_CA_Loss) { tp->high_seq = tp->snd_nxt; - tp->snd_ssthresh = tcp_current_ssthresh(sk); + WRITE_ONCE(tp->snd_ssthresh, tcp_current_ssthresh(sk)); tp->prior_ssthresh = 0; tp->undo_marker = 0; tcp_set_ca_state(sk, TCP_CA_Loss); diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c index 7a9d6d9006f6..dc0c081fc1f3 100644 --- a/net/ipv4/tcp_metrics.c +++ b/net/ipv4/tcp_metrics.c @@ -490,9 +490,9 @@ void tcp_init_metrics(struct sock *sk) val = READ_ONCE(net->ipv4.sysctl_tcp_no_ssthresh_metrics_save) ? 0 : tcp_metric_get(tm, TCP_METRIC_SSTHRESH); if (val) { - tp->snd_ssthresh = val; + WRITE_ONCE(tp->snd_ssthresh, val); if (tp->snd_ssthresh > tp->snd_cwnd_clamp) - tp->snd_ssthresh = tp->snd_cwnd_clamp; + WRITE_ONCE(tp->snd_ssthresh, tp->snd_cwnd_clamp); } val = tcp_metric_get(tm, TCP_METRIC_REORDERING); if (val && tp->reordering != val) diff --git a/net/ipv4/tcp_nv.c b/net/ipv4/tcp_nv.c index a60662f4bdf9..f345897a68df 100644 --- a/net/ipv4/tcp_nv.c +++ b/net/ipv4/tcp_nv.c @@ -396,8 +396,8 @@ static void tcpnv_acked(struct sock *sk, const struct ack_sample *sample) /* We have enough data to determine we are congested */ ca->nv_allow_cwnd_growth = 0; - tp->snd_ssthresh = - (nv_ssthresh_factor * max_win) >> 3; + WRITE_ONCE(tp->snd_ssthresh, + (nv_ssthresh_factor * max_win) >> 3); if (tcp_snd_cwnd(tp) - max_win > 2) { /* gap > 2, we do exponential cwnd decrease */ int dec; diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index d8e8bba2d03a..2663505a0dd7 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -171,7 +171,7 @@ void tcp_cwnd_restart(struct sock *sk, s32 delta) tcp_ca_event(sk, CA_EVENT_CWND_RESTART); - tp->snd_ssthresh = tcp_current_ssthresh(sk); + WRITE_ONCE(tp->snd_ssthresh, tcp_current_ssthresh(sk)); restart_cwnd = min(restart_cwnd, cwnd); while ((delta -= inet_csk(sk)->icsk_rto) > 0 && cwnd > restart_cwnd) @@ -2143,7 +2143,7 @@ static void tcp_cwnd_application_limited(struct sock *sk) u32 init_win = tcp_init_cwnd(tp, __sk_dst_get(sk)); u32 win_used = max(tp->snd_cwnd_used, init_win); if (win_used < tcp_snd_cwnd(tp)) { - tp->snd_ssthresh = tcp_current_ssthresh(sk); + WRITE_ONCE(tp->snd_ssthresh, tcp_current_ssthresh(sk)); tcp_snd_cwnd_set(tp, (tcp_snd_cwnd(tp) + win_used) >> 1); } tp->snd_cwnd_used = 0; diff --git a/net/ipv4/tcp_vegas.c b/net/ipv4/tcp_vegas.c index 950a66966059..574453af6bc0 100644 --- a/net/ipv4/tcp_vegas.c +++ b/net/ipv4/tcp_vegas.c @@ -245,7 +245,8 @@ static void tcp_vegas_cong_avoid(struct sock *sk, u32 ack, u32 acked) */ tcp_snd_cwnd_set(tp, min(tcp_snd_cwnd(tp), (u32)target_cwnd + 1)); - tp->snd_ssthresh = tcp_vegas_ssthresh(tp); + WRITE_ONCE(tp->snd_ssthresh, + tcp_vegas_ssthresh(tp)); } else if (tcp_in_slow_start(tp)) { /* Slow start. */ @@ -261,8 +262,8 @@ static void tcp_vegas_cong_avoid(struct sock *sk, u32 ack, u32 acked) * we slow down. */ tcp_snd_cwnd_set(tp, tcp_snd_cwnd(tp) - 1); - tp->snd_ssthresh - = tcp_vegas_ssthresh(tp); + WRITE_ONCE(tp->snd_ssthresh, + tcp_vegas_ssthresh(tp)); } else if (diff < alpha) { /* We don't have enough extra packets * in the network, so speed up. @@ -280,7 +281,7 @@ static void tcp_vegas_cong_avoid(struct sock *sk, u32 ack, u32 acked) else if (tcp_snd_cwnd(tp) > tp->snd_cwnd_clamp) tcp_snd_cwnd_set(tp, tp->snd_cwnd_clamp); - tp->snd_ssthresh = tcp_current_ssthresh(sk); + WRITE_ONCE(tp->snd_ssthresh, tcp_current_ssthresh(sk)); } /* Wipe the slate clean for the next RTT. */ diff --git a/net/ipv4/tcp_westwood.c b/net/ipv4/tcp_westwood.c index c6e97141eef2..b5a42adfd6ca 100644 --- a/net/ipv4/tcp_westwood.c +++ b/net/ipv4/tcp_westwood.c @@ -244,11 +244,11 @@ static void tcp_westwood_event(struct sock *sk, enum tcp_ca_event event) switch (event) { case CA_EVENT_COMPLETE_CWR: - tp->snd_ssthresh = tcp_westwood_bw_rttmin(sk); + WRITE_ONCE(tp->snd_ssthresh, tcp_westwood_bw_rttmin(sk)); tcp_snd_cwnd_set(tp, tp->snd_ssthresh); break; case CA_EVENT_LOSS: - tp->snd_ssthresh = tcp_westwood_bw_rttmin(sk); + WRITE_ONCE(tp->snd_ssthresh, tcp_westwood_bw_rttmin(sk)); /* Update RTT_min when next ack arrives */ w->reset_rtt_min = 1; break; diff --git a/net/ipv4/tcp_yeah.c b/net/ipv4/tcp_yeah.c index b22b3dccd05e..9e581154f18f 100644 --- a/net/ipv4/tcp_yeah.c +++ b/net/ipv4/tcp_yeah.c @@ -147,7 +147,8 @@ static void tcp_yeah_cong_avoid(struct sock *sk, u32 ack, u32 acked) tcp_snd_cwnd_set(tp, max(tcp_snd_cwnd(tp), yeah->reno_count)); - tp->snd_ssthresh = tcp_snd_cwnd(tp); + WRITE_ONCE(tp->snd_ssthresh, + tcp_snd_cwnd(tp)); } if (yeah->reno_count <= 2) From faa886ad3ce5fc8f5156493491fe189b2b726bc9 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 16 Apr 2026 20:03:10 +0000 Subject: [PATCH 029/148] tcp: annotate data-races around tp->delivered and tp->delivered_ce tcp_get_timestamping_opt_stats() intentionally runs lockless, we must add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy. Fixes: feb5f2ec6464 ("tcp: export packets delivery info") Signed-off-by: Eric Dumazet Link: https://patch.msgid.link/20260416200319.3608680-6-edumazet@google.com Signed-off-by: Jakub Kicinski --- include/net/tcp_ecn.h | 2 +- net/ipv4/tcp.c | 4 ++-- net/ipv4/tcp_input.c | 8 ++++---- 3 files changed, 7 insertions(+), 7 deletions(-) diff --git a/include/net/tcp_ecn.h b/include/net/tcp_ecn.h index e9a933641636..865d5c5a7718 100644 --- a/include/net/tcp_ecn.h +++ b/include/net/tcp_ecn.h @@ -181,7 +181,7 @@ static inline void tcp_accecn_third_ack(struct sock *sk, tcp_accecn_validate_syn_feedback(sk, ace, sent_ect)) { if ((tcp_accecn_extract_syn_ect(ace) == INET_ECN_CE) && !tp->delivered_ce) - tp->delivered_ce++; + WRITE_ONCE(tp->delivered_ce, 1); } break; } diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 802a9ea05211..0aabd02d4496 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -4453,8 +4453,8 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk, READ_ONCE(inet_csk(sk)->icsk_retransmits)); nla_put_u8(stats, TCP_NLA_DELIVERY_RATE_APP_LMT, data_race(!!tp->rate_app_limited)); nla_put_u32(stats, TCP_NLA_SND_SSTHRESH, READ_ONCE(tp->snd_ssthresh)); - nla_put_u32(stats, TCP_NLA_DELIVERED, tp->delivered); - nla_put_u32(stats, TCP_NLA_DELIVERED_CE, tp->delivered_ce); + nla_put_u32(stats, TCP_NLA_DELIVERED, READ_ONCE(tp->delivered)); + nla_put_u32(stats, TCP_NLA_DELIVERED_CE, READ_ONCE(tp->delivered_ce)); nla_put_u32(stats, TCP_NLA_SNDQ_SIZE, tp->write_seq - tp->snd_una); nla_put_u8(stats, TCP_NLA_CA_STATE, inet_csk(sk)->icsk_ca_state); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index c6361447535f..63ff89210a72 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -476,14 +476,14 @@ static bool tcp_accecn_process_option(struct tcp_sock *tp, static void tcp_count_delivered_ce(struct tcp_sock *tp, u32 ecn_count) { - tp->delivered_ce += ecn_count; + WRITE_ONCE(tp->delivered_ce, tp->delivered_ce + ecn_count); } /* Updates the delivered and delivered_ce counts */ static void tcp_count_delivered(struct tcp_sock *tp, u32 delivered, bool ece_ack) { - tp->delivered += delivered; + WRITE_ONCE(tp->delivered, tp->delivered + delivered); if (tcp_ecn_mode_rfc3168(tp) && ece_ack) tcp_count_delivered_ce(tp, delivered); } @@ -6779,7 +6779,7 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack, NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPFASTOPENACTIVE); /* SYN-data is counted as two separate packets in tcp_ack() */ if (tp->delivered > 1) - --tp->delivered; + WRITE_ONCE(tp->delivered, tp->delivered - 1); } tcp_fastopen_add_skb(sk, synack); @@ -7212,7 +7212,7 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb) SKB_DR_SET(reason, NOT_SPECIFIED); switch (sk->sk_state) { case TCP_SYN_RECV: - tp->delivered++; /* SYN-ACK delivery isn't tracked in tcp_ack */ + WRITE_ONCE(tp->delivered, tp->delivered + 1); /* SYN-ACK delivery isn't tracked in tcp_ack */ if (!tp->srtt_us) tcp_synack_rtt_meas(sk, req); From 124199444de467767175a9004e1574dc42523e62 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 16 Apr 2026 20:03:11 +0000 Subject: [PATCH 030/148] tcp: add data-race annotations for TCP_NLA_SNDQ_SIZE tcp_get_timestamping_opt_stats() intentionally runs lockless, we must add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy. Fixes: 87ecc95d81d9 ("tcp: add send queue size stat in SCM_TIMESTAMPING_OPT_STATS") Signed-off-by: Eric Dumazet Link: https://patch.msgid.link/20260416200319.3608680-7-edumazet@google.com Signed-off-by: Jakub Kicinski --- net/ipv4/tcp.c | 4 +++- net/ipv4/tcp_input.c | 4 ++-- net/ipv4/tcp_output.c | 2 +- 3 files changed, 6 insertions(+), 4 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 0aabd02d4496..729936d13a5c 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -4456,7 +4456,9 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk, nla_put_u32(stats, TCP_NLA_DELIVERED, READ_ONCE(tp->delivered)); nla_put_u32(stats, TCP_NLA_DELIVERED_CE, READ_ONCE(tp->delivered_ce)); - nla_put_u32(stats, TCP_NLA_SNDQ_SIZE, tp->write_seq - tp->snd_una); + nla_put_u32(stats, TCP_NLA_SNDQ_SIZE, + max_t(int, 0, + READ_ONCE(tp->write_seq) - READ_ONCE(tp->snd_una))); nla_put_u8(stats, TCP_NLA_CA_STATE, inet_csk(sk)->icsk_ca_state); nla_put_u64_64bit(stats, TCP_NLA_BYTES_SENT, tp->bytes_sent, diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 63ff89210a72..edb5013873e0 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3912,7 +3912,7 @@ static void tcp_snd_una_update(struct tcp_sock *tp, u32 ack) sock_owned_by_me((struct sock *)tp); tp->bytes_acked += delta; tcp_snd_sne_update(tp, ack); - tp->snd_una = ack; + WRITE_ONCE(tp->snd_una, ack); } static void tcp_rcv_sne_update(struct tcp_sock *tp, u32 seq) @@ -7240,7 +7240,7 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb) if (sk->sk_socket) sk_wake_async(sk, SOCK_WAKE_IO, POLL_OUT); - tp->snd_una = TCP_SKB_CB(skb)->ack_seq; + WRITE_ONCE(tp->snd_una, TCP_SKB_CB(skb)->ack_seq); tp->snd_wnd = ntohs(th->window) << tp->rx_opt.snd_wscale; tcp_init_wl(tp, TCP_SKB_CB(skb)->seq); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 2663505a0dd7..9f83c7e4acab 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -4153,7 +4153,7 @@ static void tcp_connect_init(struct sock *sk) tp->snd_wnd = 0; tcp_init_wl(tp, 0); tcp_write_queue_purge(sk); - tp->snd_una = tp->write_seq; + WRITE_ONCE(tp->snd_una, tp->write_seq); tp->snd_sml = tp->write_seq; tp->snd_up = tp->write_seq; WRITE_ONCE(tp->snd_nxt, tp->write_seq); From ee43e957ce2ec77b2ec47fef28f3c0df6ab01a31 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 16 Apr 2026 20:03:12 +0000 Subject: [PATCH 031/148] tcp: annotate data-races around tp->bytes_sent tcp_get_timestamping_opt_stats() intentionally runs lockless, we must add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy. Fixes: ba113c3aa79a ("tcp: add data bytes sent stats") Signed-off-by: Eric Dumazet Link: https://patch.msgid.link/20260416200319.3608680-8-edumazet@google.com Signed-off-by: Jakub Kicinski --- net/ipv4/tcp.c | 2 +- net/ipv4/tcp_output.c | 3 ++- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 729936d13a5c..f999b86851cd 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -4461,7 +4461,7 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk, READ_ONCE(tp->write_seq) - READ_ONCE(tp->snd_una))); nla_put_u8(stats, TCP_NLA_CA_STATE, inet_csk(sk)->icsk_ca_state); - nla_put_u64_64bit(stats, TCP_NLA_BYTES_SENT, tp->bytes_sent, + nla_put_u64_64bit(stats, TCP_NLA_BYTES_SENT, READ_ONCE(tp->bytes_sent), TCP_NLA_PAD); nla_put_u64_64bit(stats, TCP_NLA_BYTES_RETRANS, tp->bytes_retrans, TCP_NLA_PAD); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 9f83c7e4acab..87af4731df87 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1690,7 +1690,8 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, tcp_event_data_sent(tp, sk); WRITE_ONCE(tp->data_segs_out, tp->data_segs_out + tcp_skb_pcount(skb)); - tp->bytes_sent += skb->len - tcp_header_size; + WRITE_ONCE(tp->bytes_sent, + tp->bytes_sent + skb->len - tcp_header_size); } if (after(tcb->end_seq, tp->snd_nxt) || tcb->seq == tcb->end_seq) From 5efc7b9f7cbd43401f1af81d3d7f2be00f93390d Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 16 Apr 2026 20:03:13 +0000 Subject: [PATCH 032/148] tcp: annotate data-races around tp->bytes_retrans tcp_get_timestamping_opt_stats() intentionally runs lockless, we must add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy. Fixes: fb31c9b9f6c8 ("tcp: add data bytes retransmitted stats") Signed-off-by: Eric Dumazet Link: https://patch.msgid.link/20260416200319.3608680-9-edumazet@google.com Signed-off-by: Jakub Kicinski --- net/ipv4/tcp.c | 4 ++-- net/ipv4/tcp_output.c | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index f999b86851cd..8c84639dc54b 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -4463,8 +4463,8 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk, nla_put_u64_64bit(stats, TCP_NLA_BYTES_SENT, READ_ONCE(tp->bytes_sent), TCP_NLA_PAD); - nla_put_u64_64bit(stats, TCP_NLA_BYTES_RETRANS, tp->bytes_retrans, - TCP_NLA_PAD); + nla_put_u64_64bit(stats, TCP_NLA_BYTES_RETRANS, + READ_ONCE(tp->bytes_retrans), TCP_NLA_PAD); nla_put_u32(stats, TCP_NLA_DSACK_DUPS, tp->dsack_dups); nla_put_u32(stats, TCP_NLA_REORD_SEEN, tp->reord_seen); nla_put_u32(stats, TCP_NLA_SRTT, tp->srtt_us >> 3); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 87af4731df87..f9d8755705f7 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -3645,7 +3645,7 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs) if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN) __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSYNRETRANS); WRITE_ONCE(tp->total_retrans, tp->total_retrans + segs); - tp->bytes_retrans += skb->len; + WRITE_ONCE(tp->bytes_retrans, tp->bytes_retrans + skb->len); /* make sure skb->data is aligned on arches that require it * and check if ack-trimming & collapsing extended the headroom From a984705ca88b976bf1087978fd98b7f3993da88c Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 16 Apr 2026 20:03:14 +0000 Subject: [PATCH 033/148] tcp: annotate data-races around tp->dsack_dups tcp_get_timestamping_opt_stats() intentionally runs lockless, we must add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy. Fixes: 7e10b6554ff2 ("tcp: add dsack blocks received stats") Signed-off-by: Eric Dumazet Link: https://patch.msgid.link/20260416200319.3608680-10-edumazet@google.com Signed-off-by: Jakub Kicinski --- net/ipv4/tcp.c | 2 +- net/ipv4/tcp_input.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 8c84639dc54b..57c4dcc8bfe9 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -4465,7 +4465,7 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk, TCP_NLA_PAD); nla_put_u64_64bit(stats, TCP_NLA_BYTES_RETRANS, READ_ONCE(tp->bytes_retrans), TCP_NLA_PAD); - nla_put_u32(stats, TCP_NLA_DSACK_DUPS, tp->dsack_dups); + nla_put_u32(stats, TCP_NLA_DSACK_DUPS, READ_ONCE(tp->dsack_dups)); nla_put_u32(stats, TCP_NLA_REORD_SEEN, tp->reord_seen); nla_put_u32(stats, TCP_NLA_SRTT, tp->srtt_us >> 3); nla_put_u16(stats, TCP_NLA_TIMEOUT_REHASH, tp->timeout_rehash); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index edb5013873e0..65b3ecc6be4b 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1246,7 +1246,7 @@ static u32 tcp_dsack_seen(struct tcp_sock *tp, u32 start_seq, else if (tp->tlp_high_seq && tp->tlp_high_seq == end_seq) state->flag |= FLAG_DSACK_TLP; - tp->dsack_dups += dup_segs; + WRITE_ONCE(tp->dsack_dups, tp->dsack_dups + dup_segs); /* Skip the DSACK if dup segs weren't retransmitted by sender */ if (tp->dsack_dups > tp->total_retrans) return 0; From 62585690e6b2a112c408fe25f142b246ac833c42 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 16 Apr 2026 20:03:15 +0000 Subject: [PATCH 034/148] tcp: annotate data-races around tp->reord_seen tcp_get_timestamping_opt_stats() intentionally runs lockless, we must add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy. Fixes: 7ec65372ca53 ("tcp: add stat of data packet reordering events") Signed-off-by: Eric Dumazet Link: https://patch.msgid.link/20260416200319.3608680-11-edumazet@google.com Signed-off-by: Jakub Kicinski --- net/ipv4/tcp.c | 2 +- net/ipv4/tcp_input.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 57c4dcc8bfe9..39a4b06e36bb 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -4466,7 +4466,7 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk, nla_put_u64_64bit(stats, TCP_NLA_BYTES_RETRANS, READ_ONCE(tp->bytes_retrans), TCP_NLA_PAD); nla_put_u32(stats, TCP_NLA_DSACK_DUPS, READ_ONCE(tp->dsack_dups)); - nla_put_u32(stats, TCP_NLA_REORD_SEEN, tp->reord_seen); + nla_put_u32(stats, TCP_NLA_REORD_SEEN, READ_ONCE(tp->reord_seen)); nla_put_u32(stats, TCP_NLA_SRTT, tp->srtt_us >> 3); nla_put_u16(stats, TCP_NLA_TIMEOUT_REHASH, tp->timeout_rehash); nla_put_u32(stats, TCP_NLA_BYTES_NOTSENT, diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 65b3ecc6be4b..896a5a5a6b1a 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1299,7 +1299,7 @@ static void tcp_check_sack_reordering(struct sock *sk, const u32 low_seq, } /* This exciting event is worth to be remembered. 8) */ - tp->reord_seen++; + WRITE_ONCE(tp->reord_seen, tp->reord_seen + 1); NET_INC_STATS(sock_net(sk), ts ? LINUX_MIB_TCPTSREORDER : LINUX_MIB_TCPSACKREORDER); } @@ -2443,7 +2443,7 @@ static void tcp_check_reno_reordering(struct sock *sk, const int addend) WRITE_ONCE(tp->reordering, min_t(u32, tp->packets_out + addend, READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_max_reordering))); - tp->reord_seen++; + WRITE_ONCE(tp->reord_seen, tp->reord_seen + 1); NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRENOREORDER); } From 290b693ce7c9d48588d88b15a782a3efc6fa036b Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 16 Apr 2026 20:03:16 +0000 Subject: [PATCH 035/148] tcp: annotate data-races around tp->srtt_us tcp_get_timestamping_opt_stats() intentionally runs lockless, we must add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy. Fixes: e8bd8fca6773 ("tcp: add SRTT to SCM_TIMESTAMPING_OPT_STATS") Signed-off-by: Eric Dumazet Link: https://patch.msgid.link/20260416200319.3608680-12-edumazet@google.com Signed-off-by: Jakub Kicinski --- net/ipv4/tcp.c | 5 +++-- net/ipv4/tcp_input.c | 2 +- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 39a4b06e36bb..541bd0d2d8c4 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -3623,7 +3623,8 @@ static void tcp_enable_tx_delay(struct sock *sk, int val) if (delta && sk->sk_state == TCP_ESTABLISHED) { s64 srtt = (s64)tp->srtt_us + delta; - tp->srtt_us = clamp_t(s64, srtt, 1, ~0U); + WRITE_ONCE(tp->srtt_us, + clamp_t(s64, srtt, 1, ~0U)); /* Note: does not deal with non zero icsk_backoff */ tcp_set_rto(sk); @@ -4467,7 +4468,7 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk, READ_ONCE(tp->bytes_retrans), TCP_NLA_PAD); nla_put_u32(stats, TCP_NLA_DSACK_DUPS, READ_ONCE(tp->dsack_dups)); nla_put_u32(stats, TCP_NLA_REORD_SEEN, READ_ONCE(tp->reord_seen)); - nla_put_u32(stats, TCP_NLA_SRTT, tp->srtt_us >> 3); + nla_put_u32(stats, TCP_NLA_SRTT, READ_ONCE(tp->srtt_us) >> 3); nla_put_u16(stats, TCP_NLA_TIMEOUT_REHASH, tp->timeout_rehash); nla_put_u32(stats, TCP_NLA_BYTES_NOTSENT, max_t(int, 0, tp->write_seq - tp->snd_nxt)); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 896a5a5a6b1a..e04ae105893c 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1132,7 +1132,7 @@ static void tcp_rtt_estimator(struct sock *sk, long mrtt_us) tcp_bpf_rtt(sk, mrtt_us, srtt); } - tp->srtt_us = max(1U, srtt); + WRITE_ONCE(tp->srtt_us, max(1U, srtt)); } void tcp_update_pacing_rate(struct sock *sk) From 71c675358b711bbfd8528949249419dc2dfa4ce1 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 16 Apr 2026 20:03:17 +0000 Subject: [PATCH 036/148] tcp: annotate data-races around tp->timeout_rehash tcp_get_timestamping_opt_stats() intentionally runs lockless, we must add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy. Fixes: 32efcc06d2a1 ("tcp: export count for rehash attempts") Signed-off-by: Eric Dumazet Link: https://patch.msgid.link/20260416200319.3608680-13-edumazet@google.com Signed-off-by: Jakub Kicinski --- net/ipv4/tcp.c | 3 ++- net/ipv4/tcp_timer.c | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 541bd0d2d8c4..192e95b71ce9 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -4469,7 +4469,8 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk, nla_put_u32(stats, TCP_NLA_DSACK_DUPS, READ_ONCE(tp->dsack_dups)); nla_put_u32(stats, TCP_NLA_REORD_SEEN, READ_ONCE(tp->reord_seen)); nla_put_u32(stats, TCP_NLA_SRTT, READ_ONCE(tp->srtt_us) >> 3); - nla_put_u16(stats, TCP_NLA_TIMEOUT_REHASH, tp->timeout_rehash); + nla_put_u16(stats, TCP_NLA_TIMEOUT_REHASH, + READ_ONCE(tp->timeout_rehash)); nla_put_u32(stats, TCP_NLA_BYTES_NOTSENT, max_t(int, 0, tp->write_seq - tp->snd_nxt)); nla_put_u64_64bit(stats, TCP_NLA_EDT, orig_skb->skb_mstamp_ns, diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c index ea99988795e7..8d791a954cd6 100644 --- a/net/ipv4/tcp_timer.c +++ b/net/ipv4/tcp_timer.c @@ -297,7 +297,7 @@ static int tcp_write_timeout(struct sock *sk) } if (sk_rethink_txhash(sk)) { - tp->timeout_rehash++; + WRITE_ONCE(tp->timeout_rehash, tp->timeout_rehash + 1); __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPTIMEOUTREHASH); } From 3a63b3d160560ef51e43fb4c880a5cde8078053c Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 16 Apr 2026 20:03:18 +0000 Subject: [PATCH 037/148] tcp: annotate data-races around (tp->write_seq - tp->snd_nxt) tcp_get_timestamping_opt_stats() intentionally runs lockless, we must add READ_ONCE() annotations to keep KCSAN happy. WRITE_ONCE() annotations are already present. Fixes: e08ab0b377a1 ("tcp: add bytes not sent to SCM_TIMESTAMPING_OPT_STATS") Signed-off-by: Eric Dumazet Link: https://patch.msgid.link/20260416200319.3608680-14-edumazet@google.com Signed-off-by: Jakub Kicinski --- net/ipv4/tcp.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 192e95b71ce9..68894c03f262 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -4472,7 +4472,8 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk, nla_put_u16(stats, TCP_NLA_TIMEOUT_REHASH, READ_ONCE(tp->timeout_rehash)); nla_put_u32(stats, TCP_NLA_BYTES_NOTSENT, - max_t(int, 0, tp->write_seq - tp->snd_nxt)); + max_t(int, 0, + READ_ONCE(tp->write_seq) - READ_ONCE(tp->snd_nxt))); nla_put_u64_64bit(stats, TCP_NLA_EDT, orig_skb->skb_mstamp_ns, TCP_NLA_PAD); if (ack_skb) From 9e89b9d03a2d2e30dcca166d5af52f9a8eceab25 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 16 Apr 2026 20:03:19 +0000 Subject: [PATCH 038/148] tcp: annotate data-races around tp->plb_rehash tcp_get_timestamping_opt_stats() intentionally runs lockless, we must add READ_ONCE() and WRITE_ONCE() annotations to keep KCSAN happy. Fixes: 29c1c44646ae ("tcp: add u32 counter in tcp_sock and an SNMP counter for PLB") Signed-off-by: Eric Dumazet Link: https://patch.msgid.link/20260416200319.3608680-15-edumazet@google.com Signed-off-by: Jakub Kicinski --- net/ipv4/tcp.c | 3 ++- net/ipv4/tcp_plb.c | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 68894c03f262..7fbf2fca5eb2 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -4480,7 +4480,8 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk, nla_put_u8(stats, TCP_NLA_TTL, tcp_skb_ttl_or_hop_limit(ack_skb)); - nla_put_u32(stats, TCP_NLA_REHASH, tp->plb_rehash + tp->timeout_rehash); + nla_put_u32(stats, TCP_NLA_REHASH, + READ_ONCE(tp->plb_rehash) + READ_ONCE(tp->timeout_rehash)); return stats; } diff --git a/net/ipv4/tcp_plb.c b/net/ipv4/tcp_plb.c index 68ccdb9a5412..c11a0cd3f8fe 100644 --- a/net/ipv4/tcp_plb.c +++ b/net/ipv4/tcp_plb.c @@ -80,7 +80,7 @@ void tcp_plb_check_rehash(struct sock *sk, struct tcp_plb_state *plb) sk_rethink_txhash(sk); plb->consec_cong_rounds = 0; - tcp_sk(sk)->plb_rehash++; + WRITE_ONCE(tcp_sk(sk)->plb_rehash, tcp_sk(sk)->plb_rehash + 1); NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPPLBREHASH); } EXPORT_SYMBOL_GPL(tcp_plb_check_rehash); From 885c5e57924dc040b23d0ad0d8388f0e35772159 Mon Sep 17 00:00:00 2001 From: Grzegorz Nitka Date: Thu, 16 Apr 2026 17:53:25 -0700 Subject: [PATCH 039/148] ice: fix 'adjust' timer programming for E830 devices MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fix incorrect 'adjust the timer' programming sequence for E830 devices series. Only shadow registers GLTSYN_SHADJ were programmed in the current implementation. According to the specification [1], write to command GLTSYN_CMD register is also required with CMD field set to "Adjust the Time" value, for the timer adjustment to take the effect. The flow was broken for the adjustment less than S32_MAX/MIN range (around +/- 2 seconds). For bigger adjustment, non-atomic programming flow is used, involving set timer programming. Non-atomic flow is implemented correctly. Testing hints: Run command: phc_ctl /dev/ptpX get adj 2 get Expected result: Returned timestamps differ at least by 2 seconds [1] Intel® Ethernet Controller E830 Datasheet rev 1.3, chapter 9.7.5.4 https://cdrdv2.intel.com/v1/dl/getContent/787353?explicitVersion=true Fixes: f00307522786 ("ice: Implement PTP support for E830 devices") Reviewed-by: Aleksandr Loktionov Signed-off-by: Grzegorz Nitka Reviewed-by: Simon Horman Tested-by: Rinitha S Reviewed-by: Jacob Keller Signed-off-by: Jacob Keller Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-1-686c33c9828d@intel.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/intel/ice/ice_ptp_hw.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/ice/ice_ptp_hw.c b/drivers/net/ethernet/intel/ice/ice_ptp_hw.c index 61c0a0d93ea8..5a5c511ccbb6 100644 --- a/drivers/net/ethernet/intel/ice/ice_ptp_hw.c +++ b/drivers/net/ethernet/intel/ice/ice_ptp_hw.c @@ -5381,8 +5381,8 @@ int ice_ptp_write_incval_locked(struct ice_hw *hw, u64 incval) */ int ice_ptp_adj_clock(struct ice_hw *hw, s32 adj) { + int err = 0; u8 tmr_idx; - int err; tmr_idx = hw->func_caps.ts_func_info.tmr_index_owned; @@ -5399,8 +5399,8 @@ int ice_ptp_adj_clock(struct ice_hw *hw, s32 adj) err = ice_ptp_prep_phy_adj_e810(hw, adj); break; case ICE_MAC_E830: - /* E830 sync PHYs automatically after setting GLTSYN_SHADJ */ - return 0; + /* E830 sync PHYs automatically after setting cmd register */ + break; case ICE_MAC_GENERIC: err = ice_ptp_prep_phy_adj_e82x(hw, adj); break; From 05567e4052732d70c7ff9655217b3d14d25f639a Mon Sep 17 00:00:00 2001 From: Grzegorz Nitka Date: Thu, 16 Apr 2026 17:53:26 -0700 Subject: [PATCH 040/148] ice: update PCS latency settings for E825 10G/25Gb modes Update MAC Rx/Tx offset registers settings (PHY_MAC_[RX|TX]_OFFSET registers) with the data obtained with the latest research. It applies to PCS latency settings for the following speeds/modes: * 10Gb NO-FEC - TX latency changed from 71.25 ns to 73 ns - RX latency changed from -25.6 ns to -28 ns * 25Gb NO-FEC - TX latency changed from 28.17 ns to 33 ns - RX latency changed from -12.45 ns to -12 ns * 25Gb RS-FEC - TX latency changed from 64.5 ns to 69 ns - RX latency changed from -3.6 ns to -3 ns The original data came from simulation and pre-production hardware. The new data measures the actual delays and as such is more accurate. Fixes: 7cab44f1c35f ("ice: Introduce ETH56G PHY model for E825C products") Co-developed-by: Zoltan Fodor Signed-off-by: Zoltan Fodor Reviewed-by: Aleksandr Loktionov Reviewed-by: Jacob Keller Signed-off-by: Grzegorz Nitka Tested-by: Sunitha Mekala Signed-off-by: Jacob Keller Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-2-686c33c9828d@intel.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/intel/ice/ice_ptp_consts.h | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/intel/ice/ice_ptp_consts.h b/drivers/net/ethernet/intel/ice/ice_ptp_consts.h index 19dddd9b53dd..4d298c27bfb2 100644 --- a/drivers/net/ethernet/intel/ice/ice_ptp_consts.h +++ b/drivers/net/ethernet/intel/ice/ice_ptp_consts.h @@ -78,14 +78,14 @@ struct ice_eth56g_mac_reg_cfg eth56g_mac_cfg[NUM_ICE_ETH56G_LNK_SPD] = { .blktime = 0x666, /* 3.2 */ .tx_offset = { .serdes = 0x234c, /* 17.6484848 */ - .no_fec = 0x8e80, /* 71.25 */ + .no_fec = 0x93d9, /* 73 */ .fc = 0xb4a4, /* 90.32 */ .sfd = 0x4a4, /* 2.32 */ .onestep = 0x4ccd /* 38.4 */ }, .rx_offset = { .serdes = 0xffffeb27, /* -10.42424 */ - .no_fec = 0xffffcccd, /* -25.6 */ + .no_fec = 0xffffc7b6, /* -28 */ .fc = 0xfffc557b, /* -469.26 */ .sfd = 0x4a4, /* 2.32 */ .bs_ds = 0x32 /* 0.0969697 */ @@ -118,17 +118,17 @@ struct ice_eth56g_mac_reg_cfg eth56g_mac_cfg[NUM_ICE_ETH56G_LNK_SPD] = { .mktime = 0x147b, /* 10.24, only if RS-FEC enabled */ .tx_offset = { .serdes = 0xe1e, /* 7.0593939 */ - .no_fec = 0x3857, /* 28.17 */ + .no_fec = 0x4266, /* 33 */ .fc = 0x48c3, /* 36.38 */ - .rs = 0x8100, /* 64.5 */ + .rs = 0x8a00, /* 69 */ .sfd = 0x1dc, /* 0.93 */ .onestep = 0x1eb8 /* 15.36 */ }, .rx_offset = { .serdes = 0xfffff7a9, /* -4.1697 */ - .no_fec = 0xffffe71a, /* -12.45 */ + .no_fec = 0xffffe700, /* -12 */ .fc = 0xfffe894d, /* -187.35 */ - .rs = 0xfffff8cd, /* -3.6 */ + .rs = 0xfffff8cc, /* -3 */ .sfd = 0x1dc, /* 0.93 */ .bs_ds = 0x14 /* 0.0387879, RS-FEC 0 */ } From 9aab1c3d7299285e2569cbc0ed5892d631a241b2 Mon Sep 17 00:00:00 2001 From: Guangshuo Li Date: Thu, 16 Apr 2026 17:53:27 -0700 Subject: [PATCH 041/148] ice: fix double free in ice_sf_eth_activate() error path When auxiliary_device_add() fails, ice_sf_eth_activate() jumps to aux_dev_uninit and calls auxiliary_device_uninit(&sf_dev->adev). The device release callback ice_sf_dev_release() frees sf_dev, but the current error path falls through to sf_dev_free and calls kfree(sf_dev) again, causing a double free. Keep kfree(sf_dev) for the auxiliary_device_init() failure path, but avoid falling through to sf_dev_free after auxiliary_device_uninit(). Fixes: 13acc5c4cdbe ("ice: subfunction activation and base devlink ops") Cc: stable@vger.kernel.org Reviewed-by: Aleksandr Loktionov Signed-off-by: Guangshuo Li Reviewed-by: Simon Horman Signed-off-by: Jacob Keller Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-3-686c33c9828d@intel.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/intel/ice/ice_sf_eth.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/ethernet/intel/ice/ice_sf_eth.c b/drivers/net/ethernet/intel/ice/ice_sf_eth.c index 2cf04bc6edce..a730aa368c92 100644 --- a/drivers/net/ethernet/intel/ice/ice_sf_eth.c +++ b/drivers/net/ethernet/intel/ice/ice_sf_eth.c @@ -305,6 +305,8 @@ ice_sf_eth_activate(struct ice_dynamic_port *dyn_port, aux_dev_uninit: auxiliary_device_uninit(&sf_dev->adev); + return err; + sf_dev_free: kfree(sf_dev); xa_erase: From 1a303baa715e6b78d6a406aaf335f87ff35acfcd Mon Sep 17 00:00:00 2001 From: Michal Schmidt Date: Thu, 16 Apr 2026 17:53:28 -0700 Subject: [PATCH 042/148] ice: fix double-free of tx_buf skb If ice_tso() or ice_tx_csum() fail, the error path in ice_xmit_frame_ring() frees the skb, but the 'first' tx_buf still points to it and is marked as valid (ICE_TX_BUF_SKB). 'next_to_use' remains unchanged, so the potential problem will likely fix itself when the next packet is transmitted and the tx_buf gets overwritten. But if there is no next packet and the interface is brought down instead, ice_clean_tx_ring() -> ice_unmap_and_free_tx_buf() will find the tx_buf and free the skb for the second time. The fix is to reset the tx_buf type to ICE_TX_BUF_EMPTY in the error path, so that ice_unmap_and_free_tx_buf(). Move the initialization of 'first' up, to ensure it's already valid in case we hit the linearization error path. The bug was spotted by AI while I had it looking for something else. It also proposed an initial version of the patch. I reproduced the bug and tested the fix by adding code to inject failures, on a build with KASAN. I looked for similar bugs in related Intel drivers and did not find any. Fixes: d76a60ba7afb ("ice: Add support for VLANs and offloads") Assisted-by: Claude:claude-4.6-opus-high Cursor Signed-off-by: Michal Schmidt Signed-off-by: Jacob Keller Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-4-686c33c9828d@intel.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/intel/ice/ice_txrx.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c index a2cd4cf37734..7be9c062949b 100644 --- a/drivers/net/ethernet/intel/ice/ice_txrx.c +++ b/drivers/net/ethernet/intel/ice/ice_txrx.c @@ -2158,6 +2158,9 @@ ice_xmit_frame_ring(struct sk_buff *skb, struct ice_tx_ring *tx_ring) ice_trace(xmit_frame_ring, tx_ring, skb); + /* record the location of the first descriptor for this packet */ + first = &tx_ring->tx_buf[tx_ring->next_to_use]; + count = ice_xmit_desc_count(skb); if (ice_chk_linearize(skb, count)) { if (__skb_linearize(skb)) @@ -2183,8 +2186,6 @@ ice_xmit_frame_ring(struct sk_buff *skb, struct ice_tx_ring *tx_ring) offload.tx_ring = tx_ring; - /* record the location of the first descriptor for this packet */ - first = &tx_ring->tx_buf[tx_ring->next_to_use]; first->skb = skb; first->type = ICE_TX_BUF_SKB; first->bytecount = max_t(unsigned int, skb->len, ETH_ZLEN); @@ -2249,6 +2250,7 @@ ice_xmit_frame_ring(struct sk_buff *skb, struct ice_tx_ring *tx_ring) out_drop: ice_trace(xmit_frame_ring_drop, tx_ring, skb); dev_kfree_skb_any(skb); + first->type = ICE_TX_BUF_EMPTY; return NETDEV_TX_OK; } From 55e74f9ea7fea3d3da1cb6d5cacdaf8cf0fe3516 Mon Sep 17 00:00:00 2001 From: Paul Greenwalt Date: Thu, 16 Apr 2026 17:53:29 -0700 Subject: [PATCH 043/148] ice: fix PHY config on media change with link-down-on-close Commit 1a3571b5938c ("ice: restore PHY settings on media insertion") introduced separate flows for setting PHY configuration on media present: ice_configure_phy() when link-down-on-close is disabled, and ice_force_phys_link_state() when enabled. The latter incorrectly uses the previous configuration even after module change, causing link issues such as wrong speed or no link. Unify PHY configuration into a single ice_phy_cfg() function with a link_en parameter, ensuring PHY capabilities are always fetched fresh from hardware. Fixes: 1a3571b5938c ("ice: restore PHY settings on media insertion") Reviewed-by: Przemek Kitszel Signed-off-by: Paul Greenwalt Reviewed-by: Aleksandr Loktionov Tested-by: Sunitha Mekala Signed-off-by: Jacob Keller Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-5-686c33c9828d@intel.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/intel/ice/ice_main.c | 121 +++++----------------- 1 file changed, 27 insertions(+), 94 deletions(-) diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c index 3c36e3641b9e..ce3a0afe302d 100644 --- a/drivers/net/ethernet/intel/ice/ice_main.c +++ b/drivers/net/ethernet/intel/ice/ice_main.c @@ -1922,82 +1922,6 @@ static void ice_handle_mdd_event(struct ice_pf *pf) ice_print_vfs_mdd_events(pf); } -/** - * ice_force_phys_link_state - Force the physical link state - * @vsi: VSI to force the physical link state to up/down - * @link_up: true/false indicates to set the physical link to up/down - * - * Force the physical link state by getting the current PHY capabilities from - * hardware and setting the PHY config based on the determined capabilities. If - * link changes a link event will be triggered because both the Enable Automatic - * Link Update and LESM Enable bits are set when setting the PHY capabilities. - * - * Returns 0 on success, negative on failure - */ -static int ice_force_phys_link_state(struct ice_vsi *vsi, bool link_up) -{ - struct ice_aqc_get_phy_caps_data *pcaps; - struct ice_aqc_set_phy_cfg_data *cfg; - struct ice_port_info *pi; - struct device *dev; - int retcode; - - if (!vsi || !vsi->port_info || !vsi->back) - return -EINVAL; - if (vsi->type != ICE_VSI_PF) - return 0; - - dev = ice_pf_to_dev(vsi->back); - - pi = vsi->port_info; - - pcaps = kzalloc_obj(*pcaps); - if (!pcaps) - return -ENOMEM; - - retcode = ice_aq_get_phy_caps(pi, false, ICE_AQC_REPORT_ACTIVE_CFG, pcaps, - NULL); - if (retcode) { - dev_err(dev, "Failed to get phy capabilities, VSI %d error %d\n", - vsi->vsi_num, retcode); - retcode = -EIO; - goto out; - } - - /* No change in link */ - if (link_up == !!(pcaps->caps & ICE_AQC_PHY_EN_LINK) && - link_up == !!(pi->phy.link_info.link_info & ICE_AQ_LINK_UP)) - goto out; - - /* Use the current user PHY configuration. The current user PHY - * configuration is initialized during probe from PHY capabilities - * software mode, and updated on set PHY configuration. - */ - cfg = kmemdup(&pi->phy.curr_user_phy_cfg, sizeof(*cfg), GFP_KERNEL); - if (!cfg) { - retcode = -ENOMEM; - goto out; - } - - cfg->caps |= ICE_AQ_PHY_ENA_AUTO_LINK_UPDT; - if (link_up) - cfg->caps |= ICE_AQ_PHY_ENA_LINK; - else - cfg->caps &= ~ICE_AQ_PHY_ENA_LINK; - - retcode = ice_aq_set_phy_cfg(&vsi->back->hw, pi, cfg, NULL); - if (retcode) { - dev_err(dev, "Failed to set phy config, VSI %d error %d\n", - vsi->vsi_num, retcode); - retcode = -EIO; - } - - kfree(cfg); -out: - kfree(pcaps); - return retcode; -} - /** * ice_init_nvm_phy_type - Initialize the NVM PHY type * @pi: port info structure @@ -2066,7 +1990,7 @@ static void ice_init_link_dflt_override(struct ice_port_info *pi) * first time media is available. The ICE_LINK_DEFAULT_OVERRIDE_PENDING state * is used to indicate that the user PHY cfg default override is initialized * and the PHY has not been configured with the default override settings. The - * state is set here, and cleared in ice_configure_phy the first time the PHY is + * state is set here, and cleared in ice_phy_cfg the first time the PHY is * configured. * * This function should be called only if the FW doesn't support default @@ -2172,14 +2096,18 @@ static int ice_init_phy_user_cfg(struct ice_port_info *pi) } /** - * ice_configure_phy - configure PHY + * ice_phy_cfg - configure PHY * @vsi: VSI of PHY + * @link_en: true/false indicates to set link to enable/disable * * Set the PHY configuration. If the current PHY configuration is the same as - * the curr_user_phy_cfg, then do nothing to avoid link flap. Otherwise - * configure the based get PHY capabilities for topology with media. + * the curr_user_phy_cfg and link_en hasn't changed, then do nothing to avoid + * link flap. Otherwise configure the PHY based get PHY capabilities for + * topology with media and link_en. + * + * Return: 0 on success, negative on failure */ -static int ice_configure_phy(struct ice_vsi *vsi) +static int ice_phy_cfg(struct ice_vsi *vsi, bool link_en) { struct device *dev = ice_pf_to_dev(vsi->back); struct ice_port_info *pi = vsi->port_info; @@ -2199,9 +2127,6 @@ static int ice_configure_phy(struct ice_vsi *vsi) phy->link_info.topo_media_conflict == ICE_AQ_LINK_TOPO_UNSUPP_MEDIA) return -EPERM; - if (test_bit(ICE_FLAG_LINK_DOWN_ON_CLOSE_ENA, pf->flags)) - return ice_force_phys_link_state(vsi, true); - pcaps = kzalloc_obj(*pcaps); if (!pcaps) return -ENOMEM; @@ -2215,10 +2140,8 @@ static int ice_configure_phy(struct ice_vsi *vsi) goto done; } - /* If PHY enable link is configured and configuration has not changed, - * there's nothing to do - */ - if (pcaps->caps & ICE_AQC_PHY_EN_LINK && + /* Configuration has not changed. There's nothing to do. */ + if (link_en == !!(pcaps->caps & ICE_AQC_PHY_EN_LINK) && ice_phy_caps_equals_cfg(pcaps, &phy->curr_user_phy_cfg)) goto done; @@ -2282,8 +2205,12 @@ static int ice_configure_phy(struct ice_vsi *vsi) */ ice_cfg_phy_fc(pi, cfg, phy->curr_user_fc_req); - /* Enable link and link update */ - cfg->caps |= ICE_AQ_PHY_ENA_AUTO_LINK_UPDT | ICE_AQ_PHY_ENA_LINK; + /* Enable/Disable link and link update */ + cfg->caps |= ICE_AQ_PHY_ENA_AUTO_LINK_UPDT; + if (link_en) + cfg->caps |= ICE_AQ_PHY_ENA_LINK; + else + cfg->caps &= ~ICE_AQ_PHY_ENA_LINK; err = ice_aq_set_phy_cfg(&pf->hw, pi, cfg, NULL); if (err) @@ -2336,7 +2263,7 @@ static void ice_check_media_subtask(struct ice_pf *pf) test_bit(ICE_FLAG_LINK_DOWN_ON_CLOSE_ENA, vsi->back->flags)) return; - err = ice_configure_phy(vsi); + err = ice_phy_cfg(vsi, true); if (!err) clear_bit(ICE_FLAG_NO_MEDIA, pf->flags); @@ -4892,9 +4819,15 @@ static int ice_init_link(struct ice_pf *pf) if (!test_bit(ICE_FLAG_LINK_DOWN_ON_CLOSE_ENA, pf->flags)) { struct ice_vsi *vsi = ice_get_main_vsi(pf); + struct ice_link_default_override_tlv *ldo; + bool link_en; + + ldo = &pf->link_dflt_override; + link_en = !(ldo->options & + ICE_LINK_OVERRIDE_AUTO_LINK_DIS); if (vsi) - ice_configure_phy(vsi); + ice_phy_cfg(vsi, link_en); } } else { set_bit(ICE_FLAG_NO_MEDIA, pf->flags); @@ -9707,7 +9640,7 @@ int ice_open_internal(struct net_device *netdev) } } - err = ice_configure_phy(vsi); + err = ice_phy_cfg(vsi, true); if (err) { netdev_err(netdev, "Failed to set physical link up, error %d\n", err); @@ -9748,7 +9681,7 @@ int ice_stop(struct net_device *netdev) } if (test_bit(ICE_FLAG_LINK_DOWN_ON_CLOSE_ENA, vsi->back->flags)) { - int link_err = ice_force_phys_link_state(vsi, false); + int link_err = ice_phy_cfg(vsi, false); if (link_err) { if (link_err == -ENOMEDIUM) From 4a3a940059e98539de293a6e36e464094c2e875b Mon Sep 17 00:00:00 2001 From: Paul Greenwalt Date: Thu, 16 Apr 2026 17:53:30 -0700 Subject: [PATCH 044/148] ice: fix ICE_AQ_LINK_SPEED_M for 200G When setting PHY configuration during driver initialization, 200G link speed is not being advertised even when the PHY is capable. This is because the get PHY capabilities link speed response is being masked by ICE_AQ_LINK_SPEED_M, which does not include the 200G link speed bit. ICE_AQ_LINK_SPEED_200GB is defined as BIT(11), but the mask 0x7FF only covers bits 0-10. Fix ICE_AQ_LINK_SPEED_M to use GENMASK(11, 0) so that it covers all defined link speed bits including 200G. Fixes: 24407a01e57c ("ice: Add 200G speed/phy type use") Signed-off-by: Paul Greenwalt Signed-off-by: Aleksandr Loktionov Reviewed-by: Simon Horman Tested-by: Sunitha Mekala Signed-off-by: Jacob Keller Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-6-686c33c9828d@intel.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/intel/ice/ice_adminq_cmd.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h index 859e9c66f3e7..3cbb1b0582e3 100644 --- a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h +++ b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h @@ -1252,7 +1252,7 @@ struct ice_aqc_get_link_status_data { #define ICE_AQ_LINK_PWR_QSFP_CLASS_3 2 #define ICE_AQ_LINK_PWR_QSFP_CLASS_4 3 __le16 link_speed; -#define ICE_AQ_LINK_SPEED_M 0x7FF +#define ICE_AQ_LINK_SPEED_M GENMASK(11, 0) #define ICE_AQ_LINK_SPEED_10MB BIT(0) #define ICE_AQ_LINK_SPEED_100MB BIT(1) #define ICE_AQ_LINK_SPEED_1000MB BIT(2) From 7c72ec18c2a4111204c2e915f8e4f6d849ce9398 Mon Sep 17 00:00:00 2001 From: Keita Morisaki Date: Thu, 16 Apr 2026 17:53:31 -0700 Subject: [PATCH 045/148] ice: fix race condition in TX timestamp ring cleanup Fix a race condition between ice_free_tx_tstamp_ring() and ice_tx_map() that can cause a NULL pointer dereference. ice_free_tx_tstamp_ring currently clears the ICE_TX_FLAGS_TXTIME flag after NULLing the tstamp_ring. This could allow a concurrent ice_tx_map call on another CPU to dereference the tstamp_ring, which could lead to a NULL pointer dereference. CPU A:ice_free_tx_tstamp_ring() | CPU B:ice_tx_map() --------------------------------|--------------------------------- tx_ring->tstamp_ring = NULL | | ice_is_txtime_cfg() -> true | tstamp_ring = tx_ring->tstamp_ring | tstamp_ring->count // NULL deref! flags &= ~ICE_TX_FLAGS_TXTIME | Fix by: 1. Reordering ice_free_tx_tstamp_ring() to clear the flag before NULLing the pointer, with smp_wmb() to ensure proper ordering. 2. Adding smp_rmb() in ice_tx_map() after the flag check to order the flag read before the pointer read, using READ_ONCE() for the pointer, and adding a NULL check as a safety net. 3. Converting tx_ring->flags from u8 to DECLARE_BITMAP() and using atomic bitops (set_bit(), clear_bit(), test_bit()) for all flag operations throughout the driver: - ICE_TX_RING_FLAGS_XDP - ICE_TX_RING_FLAGS_VLAN_L2TAG1 - ICE_TX_RING_FLAGS_VLAN_L2TAG2 - ICE_TX_RING_FLAGS_TXTIME Fixes: ccde82e909467 ("ice: add E830 Earliest TxTime First Offload support") Signed-off-by: Keita Morisaki Reviewed-by: Aleksandr Loktionov Tested-by: Rinitha S Signed-off-by: Jacob Keller Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-7-686c33c9828d@intel.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/intel/ice/ice.h | 4 ++-- drivers/net/ethernet/intel/ice/ice_dcb_lib.c | 2 +- drivers/net/ethernet/intel/ice/ice_lib.c | 4 ++-- drivers/net/ethernet/intel/ice/ice_txrx.c | 23 ++++++++++++++------ drivers/net/ethernet/intel/ice/ice_txrx.h | 16 +++++++++----- 5 files changed, 31 insertions(+), 18 deletions(-) diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h index eb3a48330cc1..725b130dd3a2 100644 --- a/drivers/net/ethernet/intel/ice/ice.h +++ b/drivers/net/ethernet/intel/ice/ice.h @@ -753,7 +753,7 @@ static inline bool ice_is_xdp_ena_vsi(struct ice_vsi *vsi) static inline void ice_set_ring_xdp(struct ice_tx_ring *ring) { - ring->flags |= ICE_TX_FLAGS_RING_XDP; + set_bit(ICE_TX_RING_FLAGS_XDP, ring->flags); } /** @@ -778,7 +778,7 @@ static inline bool ice_is_txtime_ena(const struct ice_tx_ring *ring) */ static inline bool ice_is_txtime_cfg(const struct ice_tx_ring *ring) { - return !!(ring->flags & ICE_TX_FLAGS_TXTIME); + return test_bit(ICE_TX_RING_FLAGS_TXTIME, ring->flags); } /** diff --git a/drivers/net/ethernet/intel/ice/ice_dcb_lib.c b/drivers/net/ethernet/intel/ice/ice_dcb_lib.c index bd77f1c001ee..16aa25535152 100644 --- a/drivers/net/ethernet/intel/ice/ice_dcb_lib.c +++ b/drivers/net/ethernet/intel/ice/ice_dcb_lib.c @@ -943,7 +943,7 @@ ice_tx_prepare_vlan_flags_dcb(struct ice_tx_ring *tx_ring, /* if this is not already set it means a VLAN 0 + priority needs * to be offloaded */ - if (tx_ring->flags & ICE_TX_FLAGS_RING_VLAN_L2TAG2) + if (test_bit(ICE_TX_RING_FLAGS_VLAN_L2TAG2, tx_ring->flags)) first->tx_flags |= ICE_TX_FLAGS_HW_OUTER_SINGLE_VLAN; else first->tx_flags |= ICE_TX_FLAGS_HW_VLAN; diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c index 689c6025ea82..837b71b7b2b7 100644 --- a/drivers/net/ethernet/intel/ice/ice_lib.c +++ b/drivers/net/ethernet/intel/ice/ice_lib.c @@ -1412,9 +1412,9 @@ static int ice_vsi_alloc_rings(struct ice_vsi *vsi) ring->count = vsi->num_tx_desc; ring->txq_teid = ICE_INVAL_TEID; if (dvm_ena) - ring->flags |= ICE_TX_FLAGS_RING_VLAN_L2TAG2; + set_bit(ICE_TX_RING_FLAGS_VLAN_L2TAG2, ring->flags); else - ring->flags |= ICE_TX_FLAGS_RING_VLAN_L2TAG1; + set_bit(ICE_TX_RING_FLAGS_VLAN_L2TAG1, ring->flags); WRITE_ONCE(vsi->tx_rings[i], ring); } diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c index 7be9c062949b..4ca1a0602307 100644 --- a/drivers/net/ethernet/intel/ice/ice_txrx.c +++ b/drivers/net/ethernet/intel/ice/ice_txrx.c @@ -190,9 +190,10 @@ void ice_free_tstamp_ring(struct ice_tx_ring *tx_ring) void ice_free_tx_tstamp_ring(struct ice_tx_ring *tx_ring) { ice_free_tstamp_ring(tx_ring); + clear_bit(ICE_TX_RING_FLAGS_TXTIME, tx_ring->flags); + smp_wmb(); /* order flag clear before pointer NULL */ kfree_rcu(tx_ring->tstamp_ring, rcu); - tx_ring->tstamp_ring = NULL; - tx_ring->flags &= ~ICE_TX_FLAGS_TXTIME; + WRITE_ONCE(tx_ring->tstamp_ring, NULL); } /** @@ -405,7 +406,7 @@ static int ice_alloc_tstamp_ring(struct ice_tx_ring *tx_ring) tx_ring->tstamp_ring = tstamp_ring; tstamp_ring->desc = NULL; tstamp_ring->count = ice_calc_ts_ring_count(tx_ring); - tx_ring->flags |= ICE_TX_FLAGS_TXTIME; + set_bit(ICE_TX_RING_FLAGS_TXTIME, tx_ring->flags); return 0; } @@ -1521,13 +1522,20 @@ ice_tx_map(struct ice_tx_ring *tx_ring, struct ice_tx_buf *first, return; if (ice_is_txtime_cfg(tx_ring)) { - struct ice_tstamp_ring *tstamp_ring = tx_ring->tstamp_ring; - u32 tstamp_count = tstamp_ring->count; - u32 j = tstamp_ring->next_to_use; + struct ice_tstamp_ring *tstamp_ring; + u32 tstamp_count, j; struct ice_ts_desc *ts_desc; struct timespec64 ts; u32 tstamp; + smp_rmb(); /* order flag read before pointer read */ + tstamp_ring = READ_ONCE(tx_ring->tstamp_ring); + if (unlikely(!tstamp_ring)) + goto ring_kick; + + tstamp_count = tstamp_ring->count; + j = tstamp_ring->next_to_use; + ts = ktime_to_timespec64(first->skb->tstamp); tstamp = ts.tv_nsec >> ICE_TXTIME_CTX_RESOLUTION_128NS; @@ -1555,6 +1563,7 @@ ice_tx_map(struct ice_tx_ring *tx_ring, struct ice_tx_buf *first, tstamp_ring->next_to_use = j; writel_relaxed(j, tstamp_ring->tail); } else { +ring_kick: writel_relaxed(i, tx_ring->tail); } return; @@ -1814,7 +1823,7 @@ ice_tx_prepare_vlan_flags(struct ice_tx_ring *tx_ring, struct ice_tx_buf *first) */ if (skb_vlan_tag_present(skb)) { first->vid = skb_vlan_tag_get(skb); - if (tx_ring->flags & ICE_TX_FLAGS_RING_VLAN_L2TAG2) + if (test_bit(ICE_TX_RING_FLAGS_VLAN_L2TAG2, tx_ring->flags)) first->tx_flags |= ICE_TX_FLAGS_HW_OUTER_SINGLE_VLAN; else first->tx_flags |= ICE_TX_FLAGS_HW_VLAN; diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h index b6547e1b7c42..5e517f219379 100644 --- a/drivers/net/ethernet/intel/ice/ice_txrx.h +++ b/drivers/net/ethernet/intel/ice/ice_txrx.h @@ -212,6 +212,14 @@ enum ice_rx_dtype { ICE_RX_DTYPE_SPLIT_ALWAYS = 2, }; +enum ice_tx_ring_flags { + ICE_TX_RING_FLAGS_XDP, + ICE_TX_RING_FLAGS_VLAN_L2TAG1, + ICE_TX_RING_FLAGS_VLAN_L2TAG2, + ICE_TX_RING_FLAGS_TXTIME, + ICE_TX_RING_FLAGS_NBITS, +}; + struct ice_pkt_ctx { u64 cached_phctime; __be16 vlan_proto; @@ -352,11 +360,7 @@ struct ice_tx_ring { u16 count; /* Number of descriptors */ u16 q_index; /* Queue number of ring */ - u8 flags; -#define ICE_TX_FLAGS_RING_XDP BIT(0) -#define ICE_TX_FLAGS_RING_VLAN_L2TAG1 BIT(1) -#define ICE_TX_FLAGS_RING_VLAN_L2TAG2 BIT(2) -#define ICE_TX_FLAGS_TXTIME BIT(3) + DECLARE_BITMAP(flags, ICE_TX_RING_FLAGS_NBITS); struct xsk_buff_pool *xsk_pool; @@ -398,7 +402,7 @@ static inline bool ice_ring_ch_enabled(struct ice_tx_ring *ring) static inline bool ice_ring_is_xdp(struct ice_tx_ring *ring) { - return !!(ring->flags & ICE_TX_FLAGS_RING_XDP); + return test_bit(ICE_TX_RING_FLAGS_XDP, ring->flags); } enum ice_container_type { From fa28351f970fa5138c7c5dedfe5dea480a0ee065 Mon Sep 17 00:00:00 2001 From: Kohei Enju Date: Thu, 16 Apr 2026 17:53:32 -0700 Subject: [PATCH 046/148] ice: fix potential NULL pointer deref in error path of ice_set_ringparam() ice_set_ringparam nullifies tstamp_ring of temporary tx_rings, without clearing ICE_TX_RING_FLAGS_TXTIME bit. When ICE_TX_RING_FLAGS_TXTIME is set and the subsequent ice_setup_tx_ring() call fails, a NULL pointer dereference could happen in the unwinding sequence: ice_clean_tx_ring() -> ice_is_txtime_cfg() == true (ICE_TX_RING_FLAGS_TXTIME is set) -> ice_free_tx_tstamp_ring() -> ice_free_tstamp_ring() -> tstamp_ring->desc (NULL deref) Clear ICE_TX_RING_FLAGS_TXTIME bit to avoid the potential issue. Note that this potential issue is found by manual code review. Compile test only since unfortunately I don't have E830 devices. Fixes: ccde82e90946 ("ice: add E830 Earliest TxTime First Offload support") Signed-off-by: Kohei Enju Reviewed-by: Paul Greenwalt Tested-by: Rinitha S Signed-off-by: Jacob Keller Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-8-686c33c9828d@intel.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/intel/ice/ice_ethtool.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c b/drivers/net/ethernet/intel/ice/ice_ethtool.c index e6a20af6f63d..f28416a707d7 100644 --- a/drivers/net/ethernet/intel/ice/ice_ethtool.c +++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c @@ -3290,6 +3290,7 @@ ice_set_ringparam(struct net_device *netdev, struct ethtool_ringparam *ring, tx_rings[i].desc = NULL; tx_rings[i].tx_buf = NULL; tx_rings[i].tstamp_ring = NULL; + clear_bit(ICE_TX_RING_FLAGS_TXTIME, tx_rings[i].flags); tx_rings[i].tx_tstamps = &pf->ptp.port.tx; err = ice_setup_tx_ring(&tx_rings[i]); if (err) { From a24162f18825684ad04e3a5d0531f8a50d679347 Mon Sep 17 00:00:00 2001 From: Kohei Enju Date: Thu, 16 Apr 2026 17:53:33 -0700 Subject: [PATCH 047/148] i40e: don't advertise IFF_SUPP_NOFCS i40e advertises IFF_SUPP_NOFCS, allowing users to use the SO_NOFCS socket option. However, this option is silently ignored, as the driver does not check skb->no_fcs, and always enables FCS insertion offload. Fix this by removing the advertisement of IFF_SUPP_NOFCS. This behavior can be reproduced with a simple AF_PACKET socket: import socket s = socket.socket(socket.AF_PACKET, socket.SOCK_RAW) s.setsockopt(socket.SOL_SOCKET, 43, 1) # SO_NOFCS s.bind(("eth0", 0)) s.send(b'\xff' * 64) Previously, send() succeeds but the driver ignores SO_NOFCS. With this change, send() fails with -EPROTONOSUPPORT, as expected. Fixes: 41c445ff0f48 ("i40e: main driver core") Signed-off-by: Kohei Enju Reviewed-by: Aleksandr Loktionov Tested-by: Sunitha Mekala Signed-off-by: Jacob Keller Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-9-686c33c9828d@intel.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/intel/i40e/i40e_main.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 926d001b2150..028bd500603a 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -13783,7 +13783,6 @@ static int i40e_config_netdev(struct i40e_vsi *vsi) netdev->neigh_priv_len = sizeof(u32) * 4; netdev->priv_flags |= IFF_UNICAST_FLT; - netdev->priv_flags |= IFF_SUPP_NOFCS; /* Setup netdev TC information */ i40e_vsi_config_netdev_tc(vsi, vsi->tc_config.enabled_tc); From 496d9f91062fa07956702e0f234c5203f03a974d Mon Sep 17 00:00:00 2001 From: Petr Oros Date: Thu, 16 Apr 2026 17:53:34 -0700 Subject: [PATCH 048/148] iavf: fix wrong VLAN mask for legacy Rx descriptors L2TAG2 The IAVF_RXD_LEGACY_L2TAG2_M mask was incorrectly defined as GENMASK_ULL(63, 32), extracting 32 bits from qw2 instead of the 16-bit VLAN tag. In the legacy Rx descriptor layout, the 2nd L2TAG2 (VLAN tag) occupies bits 63:48 of qw2, not 63:32. The oversized mask causes FIELD_GET to return a 32-bit value where the actual VLAN tag sits in bits 31:16. When this value is passed to iavf_receive_skb() as a u16 parameter, it gets truncated to the lower 16 bits (which contain the 1st L2TAG2, typically zero). As a result, __vlan_hwaccel_put_tag() is never called and software VLAN interfaces on VFs receive no traffic. This affects VFs behind ice PF (VIRTCHNL VLAN v2) when the PF advertises VLAN stripping into L2TAG2_2 and legacy descriptors are used. The flex descriptor path already uses the correct mask (IAVF_RXD_FLEX_L2TAG2_2_M = GENMASK_ULL(63, 48)). Reproducer: 1. Create 2 VFs on ice PF (echo 2 > sriov_numvfs) 2. Disable spoofchk on both VFs 3. Move each VF into a separate network namespace 4. On each VF: create VLAN interface (e.g. vlan 198), assign IP, bring up 5. Set rx-vlan-offload OFF on both VFs 6. Ping between VLAN interfaces -> expect PASS (VLAN tag stays in packet data, kernel matches in-band) 7. Set rx-vlan-offload ON on both VFs 8. Ping between VLAN interfaces -> expect FAIL if bug present (HW strips VLAN tag into descriptor L2TAG2 field, wrong mask extracts bits 47:32 instead of 63:48, truncated to u16 -> zero, __vlan_hwaccel_put_tag() never called, packet delivered to parent interface, not VLAN interface) The reproducer requires legacy Rx descriptors. On modern ice + iavf with full PTP support, flex descriptors are always negotiated and the buggy legacy path is never reached. Flex descriptors require all of: - CONFIG_PTP_1588_CLOCK enabled - VIRTCHNL_VF_OFFLOAD_RX_FLEX_DESC granted by PF - PTP capabilities negotiated (VIRTCHNL_VF_CAP_PTP) - VIRTCHNL_1588_PTP_CAP_RX_TSTAMP supported - VIRTCHNL_RXDID_2_FLEX_SQ_NIC present in DDP profile If any condition is not met, iavf_select_rx_desc_format() falls back to legacy descriptors (RXDID=1) and the wrong L2TAG2 mask is hit. Fixes: 2dc8e7c36d80 ("iavf: refactor iavf_clean_rx_irq to support legacy and flex descriptors") Signed-off-by: Petr Oros Reviewed-by: Aleksandr Loktionov Reviewed-by: Paul Menzel Reviewed-by: Jacob Keller Tested-by: Rafal Romanowski Signed-off-by: Jacob Keller Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-10-686c33c9828d@intel.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/intel/iavf/iavf_type.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/iavf/iavf_type.h b/drivers/net/ethernet/intel/iavf/iavf_type.h index 1d8cf29cb65a..5bb1de1cfd33 100644 --- a/drivers/net/ethernet/intel/iavf/iavf_type.h +++ b/drivers/net/ethernet/intel/iavf/iavf_type.h @@ -277,7 +277,7 @@ struct iavf_rx_desc { /* L2 Tag 2 Presence */ #define IAVF_RXD_LEGACY_L2TAG2P_M BIT(0) /* Stripped S-TAG VLAN from the receive packet */ -#define IAVF_RXD_LEGACY_L2TAG2_M GENMASK_ULL(63, 32) +#define IAVF_RXD_LEGACY_L2TAG2_M GENMASK_ULL(63, 48) /* Stripped S-TAG VLAN from the receive packet */ #define IAVF_RXD_FLEX_L2TAG2_2_M GENMASK_ULL(63, 48) /* The packet is a UDP tunneled packet */ From aa3f7fe409350857c25d050482a2eef2cfd69b58 Mon Sep 17 00:00:00 2001 From: Matt Vollrath Date: Thu, 16 Apr 2026 17:53:36 -0700 Subject: [PATCH 049/148] e1000e: Unroll PTP in probe error handling If probe fails after registering the PTP clock and its delayed work, these resources must be released. This was not an issue until a 2016 fix moved the e1000e_ptp_init() call before the jump to err_register. Fixes: aa524b66c5ef ("e1000e: don't modify SYSTIM registers during SIOCSHWTSTAMP ioctl") Signed-off-by: Matt Vollrath Tested-by: Avigail Dahan Signed-off-by: Jacob Keller Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-12-686c33c9828d@intel.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/intel/e1000e/netdev.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c index 9befdacd6730..7ce0cc8ab8f4 100644 --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -7706,6 +7706,7 @@ static int e1000_probe(struct pci_dev *pdev, const struct pci_device_id *ent) err_register: if (!(adapter->flags & FLAG_HAS_AMT)) e1000e_release_hw_control(adapter); + e1000e_ptp_remove(adapter); err_eeprom: if (hw->phy.ops.check_reset_block && !hw->phy.ops.check_reset_block(hw)) e1000_phy_hw_reset(&adapter->hw); From f996edd7615e686ada141b7f3395025729ff8ccb Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 16 Apr 2026 10:35:05 +0000 Subject: [PATCH 050/148] ipv6: fix possible UAF in icmpv6_rcv() Caching saddr and daddr before pskb_pull() is problematic since skb->head can change. Remove these temporary variables: - We only access &ipv6_hdr(skb)->saddr and &ipv6_hdr(skb)->daddr when net_dbg_ratelimited() is called in the slow path. - Avoid potential future misuse after pskb_pull() call. Fixes: 4b3418fba0fe ("ipv6: icmp: include addresses in debug messages") Signed-off-by: Eric Dumazet Reviewed-by: Fernando Fernandez Mancera Reviewed-by: Joe Damato Reviewed-by: Ido Schimmel Link: https://patch.msgid.link/20260416103505.2380753-1-edumazet@google.com Signed-off-by: Jakub Kicinski --- net/ipv6/icmp.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c index 799d9e9ac45d..efb23807a026 100644 --- a/net/ipv6/icmp.c +++ b/net/ipv6/icmp.c @@ -1104,7 +1104,6 @@ static int icmpv6_rcv(struct sk_buff *skb) struct net *net = dev_net_rcu(skb->dev); struct net_device *dev = icmp6_dev(skb); struct inet6_dev *idev = __in6_dev_get(dev); - const struct in6_addr *saddr, *daddr; struct icmp6hdr *hdr; u8 type; @@ -1135,12 +1134,10 @@ static int icmpv6_rcv(struct sk_buff *skb) __ICMP6_INC_STATS(dev_net_rcu(dev), idev, ICMP6_MIB_INMSGS); - saddr = &ipv6_hdr(skb)->saddr; - daddr = &ipv6_hdr(skb)->daddr; - if (skb_checksum_validate(skb, IPPROTO_ICMPV6, ip6_compute_pseudo)) { net_dbg_ratelimited("ICMPv6 checksum failed [%pI6c > %pI6c]\n", - saddr, daddr); + &ipv6_hdr(skb)->saddr, + &ipv6_hdr(skb)->daddr); goto csum_error; } @@ -1220,7 +1217,8 @@ static int icmpv6_rcv(struct sk_buff *skb) break; net_dbg_ratelimited("icmpv6: msg of unknown type [%pI6c > %pI6c]\n", - saddr, daddr); + &ipv6_hdr(skb)->saddr, + &ipv6_hdr(skb)->daddr); /* * error of unknown type. From 8cff9dbe89d8bd44d9a5e631c9394dd3901ffd79 Mon Sep 17 00:00:00 2001 From: KhaiWenTan Date: Thu, 16 Apr 2026 18:26:09 +0800 Subject: [PATCH 051/148] net: stmmac: Update default_an_inband before passing value to phylink_config get_interfaces() will update both the plat->phy_interfaces and mdio_bus_data->default_an_inband based on reading a SERDES register. As get_interfaces() will be called after default_an_inband had already been read, dwmac-intel regressed as a result with incorrect default_an_inband value in phylink_config. Therefore, we moved the priv->plat->get_interfaces() to be executed first before assigning priv->plat->default_an_inband to config->default_an_inband to ensure default_an_inband is in correct value. Fixes: d3836052fe09 ("net: stmmac: intel: convert speed_mode_2500() to get_interfaces()") Signed-off-by: KhaiWenTan Reviewed-by: Russell King (Oracle) Link: https://patch.msgid.link/20260416102609.7953-1-khai.wen.tan@intel.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c index 01a983001ab4..ca68248dbc78 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c @@ -1410,8 +1410,6 @@ static int stmmac_phylink_setup(struct stmmac_priv *priv) priv->tx_lpi_clk_stop = priv->plat->flags & STMMAC_FLAG_EN_TX_LPI_CLOCKGATING; - config->default_an_inband = priv->plat->default_an_inband; - /* Get the PHY interface modes (at the PHY end of the link) that * are supported by the platform. */ @@ -1419,6 +1417,8 @@ static int stmmac_phylink_setup(struct stmmac_priv *priv) priv->plat->get_interfaces(priv, priv->plat->bsp_priv, config->supported_interfaces); + config->default_an_inband = priv->plat->default_an_inband; + /* Set the platform/firmware specified interface mode if the * supported interfaces have not already been provided using * phy_interface as a last resort. From 965dc93481d1b80d341bdd16c27b16fe197175ee Mon Sep 17 00:00:00 2001 From: Kuniyuki Iwashima Date: Wed, 15 Apr 2026 18:48:29 +0000 Subject: [PATCH 052/148] af_unix: Drop all SCM attributes for SOCKMAP. SOCKMAP can hide inflight fd from AF_UNIX GC. When a socket in SOCKMAP receives skb with inflight fd, sk_psock_verdict_data_ready() looks up the mapped socket and enqueue skb to its psock->ingress_skb. Since neither the old nor the new GC can inspect the psock queue, the hidden skb leaks the inflight sockets. Note that this cannot be detected via kmemleak because inflight sockets are linked to a global list. In addition, SOCKMAP redirect breaks the Tarjan-based GC's assumption that unix_edge.successor is always alive, which is no longer true once skb is redirected, resulting in use-after-free below. [0] Moreover, SOCKMAP does not call scm_stat_del() properly, so unix_show_fdinfo() could report an incorrect fd count. sk_msg_recvmsg() does not support any SCM attributes in the first place. Let's drop all SCM attributes before passing skb to the SOCKMAP layer. [0]: BUG: KASAN: slab-use-after-free in unix_del_edges (net/unix/garbage.c:118 net/unix/garbage.c:181 net/unix/garbage.c:251) Read of size 8 at addr ffff888125362670 by task kworker/56:1/496 CPU: 56 UID: 0 PID: 496 Comm: kworker/56:1 Not tainted 7.0.0-rc7-00263-gb9d8b856689d #3 PREEMPT(lazy) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014 Workqueue: events sk_psock_backlog Call Trace: dump_stack_lvl (lib/dump_stack.c:122) print_report (mm/kasan/report.c:379) kasan_report (mm/kasan/report.c:597) unix_del_edges (net/unix/garbage.c:118 net/unix/garbage.c:181 net/unix/garbage.c:251) unix_destroy_fpl (net/unix/garbage.c:317) unix_destruct_scm (./include/net/scm.h:80 ./include/net/scm.h:86 net/unix/af_unix.c:1976) sk_psock_backlog (./include/linux/skbuff.h:?) process_scheduled_works (kernel/workqueue.c:?) worker_thread (kernel/workqueue.c:?) kthread (kernel/kthread.c:438) ret_from_fork (arch/x86/kernel/process.c:164) ret_from_fork_asm (arch/x86/entry/entry_64.S:258) Allocated by task 955: kasan_save_track (mm/kasan/common.c:58 mm/kasan/common.c:78) __kasan_slab_alloc (mm/kasan/common.c:369) kmem_cache_alloc_noprof (mm/slub.c:4539) sk_prot_alloc (net/core/sock.c:2240) sk_alloc (net/core/sock.c:2301) unix_create1 (net/unix/af_unix.c:1099) unix_create (net/unix/af_unix.c:1169) __sock_create (net/socket.c:1606) __sys_socketpair (net/socket.c:1811) __x64_sys_socketpair (net/socket.c:1863 net/socket.c:1860 net/socket.c:1860) do_syscall_64 (arch/x86/entry/syscall_64.c:?) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) Freed by task 496: kasan_save_track (mm/kasan/common.c:58 mm/kasan/common.c:78) kasan_save_free_info (mm/kasan/generic.c:587) __kasan_slab_free (mm/kasan/common.c:287) kmem_cache_free (mm/slub.c:6165) __sk_destruct (net/core/sock.c:2282 net/core/sock.c:2384) sk_psock_destroy (./include/net/sock.h:?) process_scheduled_works (kernel/workqueue.c:?) worker_thread (kernel/workqueue.c:?) kthread (kernel/kthread.c:438) ret_from_fork (arch/x86/kernel/process.c:164) ret_from_fork_asm (arch/x86/entry/entry_64.S:258) Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()") Fixes: 77462de14a43 ("af_unix: Add read_sock for stream socket types") Reported-by: Xingyu Jin Signed-off-by: Kuniyuki Iwashima Link: https://patch.msgid.link/20260415184830.3988432-1-kuniyu@google.com Signed-off-by: Jakub Kicinski --- net/unix/af_unix.c | 35 +++++++++++++++++++++++++++-------- 1 file changed, 27 insertions(+), 8 deletions(-) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 4c4a8d23ddd2..fa34c7aec88d 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -1968,16 +1968,19 @@ static void unix_peek_fds(struct scm_cookie *scm, struct sk_buff *skb) static void unix_destruct_scm(struct sk_buff *skb) { - struct scm_cookie scm; + struct scm_cookie scm = {}; + + swap(scm.pid, UNIXCB(skb).pid); - memset(&scm, 0, sizeof(scm)); - scm.pid = UNIXCB(skb).pid; if (UNIXCB(skb).fp) unix_detach_fds(&scm, skb); - /* Alas, it calls VFS */ - /* So fscking what? fput() had been SMP-safe since the last Summer */ scm_destroy(&scm); +} + +static void unix_wfree(struct sk_buff *skb) +{ + unix_destruct_scm(skb); sock_wfree(skb); } @@ -1993,7 +1996,7 @@ static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb, bool sen if (scm->fp && send_fds) err = unix_attach_fds(scm, skb); - skb->destructor = unix_destruct_scm; + skb->destructor = unix_wfree; return err; } @@ -2070,6 +2073,13 @@ static void scm_stat_del(struct sock *sk, struct sk_buff *skb) } } +static void unix_orphan_scm(struct sock *sk, struct sk_buff *skb) +{ + scm_stat_del(sk, skb); + unix_destruct_scm(skb); + skb->destructor = sock_wfree; +} + /* * Send AF_UNIX data. */ @@ -2683,10 +2693,16 @@ static int unix_read_skb(struct sock *sk, skb_read_actor_t recv_actor) int err; mutex_lock(&u->iolock); + skb = skb_recv_datagram(sk, MSG_DONTWAIT, &err); - mutex_unlock(&u->iolock); - if (!skb) + if (!skb) { + mutex_unlock(&u->iolock); return err; + } + + unix_orphan_scm(sk, skb); + + mutex_unlock(&u->iolock); return recv_actor(sk, skb); } @@ -2886,6 +2902,9 @@ static int unix_stream_read_skb(struct sock *sk, skb_read_actor_t recv_actor) #endif spin_unlock(&queue->lock); + + unix_orphan_scm(sk, skb); + mutex_unlock(&u->iolock); return recv_actor(sk, skb); From 5c9fcac3c872224316714d0d8914d9af16c76a6d Mon Sep 17 00:00:00 2001 From: Marek Vasut Date: Thu, 16 Apr 2026 01:09:44 +0200 Subject: [PATCH 053/148] net: ks8851: Reinstate disabling of BHs around IRQ handler If the driver executes ks8851_irq() AND a TX packet has been sent, then the driver enables TX queue via netif_wake_queue() which schedules TX softirq to queue packets for this device. If CONFIG_PREEMPT_RT=y is set AND a packet has also been received by the MAC, then ks8851_rx_pkts() calls netdev_alloc_skb_ip_align() to allocate SKBs for the received packets. If netdev_alloc_skb_ip_align() is called with BH enabled, then local_bh_enable() at the end of netdev_alloc_skb_ip_align() will trigger the pending softirq processing, which may ultimately call the .xmit callback ks8851_start_xmit_par(). The ks8851_start_xmit_par() will try to lock struct ks8851_net_par .lock spinlock, which is already locked by ks8851_irq() from which ks8851_start_xmit_par() was called. This leads to a deadlock, which is reported by the kernel, including a trace listed below. If CONFIG_PREEMPT_RT is not set, then since commit 0913ec336a6c0 ("net: ks8851: Fix deadlock with the SPI chip variant") the deadlock can also be triggered without received packet in the RX FIFO. The pending softirqs will be processed on return from spin_unlock_bh(&ks->statelock) in ks8851_irq(), which triggers the deadlock as well. Fix the problem by disabling BH around critical sections, including the IRQ handler, thus preventing the net_tx_action() softirq from triggering during these critical sections. The net_tx_action() softirq is triggered once BH are re-enabled and at the end of the IRQ handler, once all the other IRQ handler actions have been completed. __schedule from schedule_rtlock+0x1c/0x34 schedule_rtlock from rtlock_slowlock_locked+0x548/0x904 rtlock_slowlock_locked from rt_spin_lock+0x60/0x9c rt_spin_lock from ks8851_start_xmit_par+0x74/0x1a8 ks8851_start_xmit_par from netdev_start_xmit+0x20/0x44 netdev_start_xmit from dev_hard_start_xmit+0xd0/0x188 dev_hard_start_xmit from sch_direct_xmit+0xb8/0x25c sch_direct_xmit from __qdisc_run+0x1f8/0x4ec __qdisc_run from qdisc_run+0x1c/0x28 qdisc_run from net_tx_action+0x1f0/0x268 net_tx_action from handle_softirqs+0x1a4/0x270 handle_softirqs from __local_bh_enable_ip+0xcc/0xe0 __local_bh_enable_ip from __alloc_skb+0xd8/0x128 __alloc_skb from __netdev_alloc_skb+0x3c/0x19c __netdev_alloc_skb from ks8851_irq+0x388/0x4d4 ks8851_irq from irq_thread_fn+0x24/0x64 irq_thread_fn from irq_thread+0x178/0x28c irq_thread from kthread+0x12c/0x138 kthread from ret_from_fork+0x14/0x28 Reviewed-by: Sebastian Andrzej Siewior Fixes: e0863634bf9f ("net: ks8851: Queue RX packets in IRQ handler instead of disabling BHs") Cc: stable@vger.kernel.org Signed-off-by: Marek Vasut Link: https://patch.msgid.link/20260415231020.455298-1-marex@nabladev.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/micrel/ks8851.h | 6 +- drivers/net/ethernet/micrel/ks8851_common.c | 64 +++++++++------------ drivers/net/ethernet/micrel/ks8851_par.c | 15 ++--- drivers/net/ethernet/micrel/ks8851_spi.c | 11 ++-- 4 files changed, 38 insertions(+), 58 deletions(-) diff --git a/drivers/net/ethernet/micrel/ks8851.h b/drivers/net/ethernet/micrel/ks8851.h index 31f75b4a67fd..b795a3a60571 100644 --- a/drivers/net/ethernet/micrel/ks8851.h +++ b/drivers/net/ethernet/micrel/ks8851.h @@ -408,10 +408,8 @@ struct ks8851_net { struct gpio_desc *gpio; struct mii_bus *mii_bus; - void (*lock)(struct ks8851_net *ks, - unsigned long *flags); - void (*unlock)(struct ks8851_net *ks, - unsigned long *flags); + void (*lock)(struct ks8851_net *ks); + void (*unlock)(struct ks8851_net *ks); unsigned int (*rdreg16)(struct ks8851_net *ks, unsigned int reg); void (*wrreg16)(struct ks8851_net *ks, diff --git a/drivers/net/ethernet/micrel/ks8851_common.c b/drivers/net/ethernet/micrel/ks8851_common.c index 8048770958d6..6c375647b24d 100644 --- a/drivers/net/ethernet/micrel/ks8851_common.c +++ b/drivers/net/ethernet/micrel/ks8851_common.c @@ -28,25 +28,23 @@ /** * ks8851_lock - register access lock * @ks: The chip state - * @flags: Spinlock flags * * Claim chip register access lock */ -static void ks8851_lock(struct ks8851_net *ks, unsigned long *flags) +static void ks8851_lock(struct ks8851_net *ks) { - ks->lock(ks, flags); + ks->lock(ks); } /** * ks8851_unlock - register access unlock * @ks: The chip state - * @flags: Spinlock flags * * Release chip register access lock */ -static void ks8851_unlock(struct ks8851_net *ks, unsigned long *flags) +static void ks8851_unlock(struct ks8851_net *ks) { - ks->unlock(ks, flags); + ks->unlock(ks); } /** @@ -129,11 +127,10 @@ static void ks8851_set_powermode(struct ks8851_net *ks, unsigned pwrmode) static int ks8851_write_mac_addr(struct net_device *dev) { struct ks8851_net *ks = netdev_priv(dev); - unsigned long flags; u16 val; int i; - ks8851_lock(ks, &flags); + ks8851_lock(ks); /* * Wake up chip in case it was powered off when stopped; otherwise, @@ -149,7 +146,7 @@ static int ks8851_write_mac_addr(struct net_device *dev) if (!netif_running(dev)) ks8851_set_powermode(ks, PMECR_PM_SOFTDOWN); - ks8851_unlock(ks, &flags); + ks8851_unlock(ks); return 0; } @@ -163,12 +160,11 @@ static int ks8851_write_mac_addr(struct net_device *dev) static void ks8851_read_mac_addr(struct net_device *dev) { struct ks8851_net *ks = netdev_priv(dev); - unsigned long flags; u8 addr[ETH_ALEN]; u16 reg; int i; - ks8851_lock(ks, &flags); + ks8851_lock(ks); for (i = 0; i < ETH_ALEN; i += 2) { reg = ks8851_rdreg16(ks, KS_MAR(i)); @@ -177,7 +173,7 @@ static void ks8851_read_mac_addr(struct net_device *dev) } eth_hw_addr_set(dev, addr); - ks8851_unlock(ks, &flags); + ks8851_unlock(ks); } /** @@ -312,11 +308,10 @@ static irqreturn_t ks8851_irq(int irq, void *_ks) { struct ks8851_net *ks = _ks; struct sk_buff_head rxq; - unsigned long flags; unsigned int status; struct sk_buff *skb; - ks8851_lock(ks, &flags); + ks8851_lock(ks); status = ks8851_rdreg16(ks, KS_ISR); ks8851_wrreg16(ks, KS_ISR, status); @@ -373,7 +368,7 @@ static irqreturn_t ks8851_irq(int irq, void *_ks) ks8851_wrreg16(ks, KS_RXCR1, rxc->rxcr1); } - ks8851_unlock(ks, &flags); + ks8851_unlock(ks); if (status & IRQ_LCI) mii_check_link(&ks->mii); @@ -405,7 +400,6 @@ static void ks8851_flush_tx_work(struct ks8851_net *ks) static int ks8851_net_open(struct net_device *dev) { struct ks8851_net *ks = netdev_priv(dev); - unsigned long flags; int ret; ret = request_threaded_irq(dev->irq, NULL, ks8851_irq, @@ -418,7 +412,7 @@ static int ks8851_net_open(struct net_device *dev) /* lock the card, even if we may not actually be doing anything * else at the moment */ - ks8851_lock(ks, &flags); + ks8851_lock(ks); netif_dbg(ks, ifup, ks->netdev, "opening\n"); @@ -471,7 +465,7 @@ static int ks8851_net_open(struct net_device *dev) netif_dbg(ks, ifup, ks->netdev, "network device up\n"); - ks8851_unlock(ks, &flags); + ks8851_unlock(ks); mii_check_link(&ks->mii); return 0; } @@ -487,23 +481,22 @@ static int ks8851_net_open(struct net_device *dev) static int ks8851_net_stop(struct net_device *dev) { struct ks8851_net *ks = netdev_priv(dev); - unsigned long flags; netif_info(ks, ifdown, dev, "shutting down\n"); netif_stop_queue(dev); - ks8851_lock(ks, &flags); + ks8851_lock(ks); /* turn off the IRQs and ack any outstanding */ ks8851_wrreg16(ks, KS_IER, 0x0000); ks8851_wrreg16(ks, KS_ISR, 0xffff); - ks8851_unlock(ks, &flags); + ks8851_unlock(ks); /* stop any outstanding work */ ks8851_flush_tx_work(ks); flush_work(&ks->rxctrl_work); - ks8851_lock(ks, &flags); + ks8851_lock(ks); /* shutdown RX process */ ks8851_wrreg16(ks, KS_RXCR1, 0x0000); @@ -512,7 +505,7 @@ static int ks8851_net_stop(struct net_device *dev) /* set powermode to soft power down to save power */ ks8851_set_powermode(ks, PMECR_PM_SOFTDOWN); - ks8851_unlock(ks, &flags); + ks8851_unlock(ks); /* ensure any queued tx buffers are dumped */ while (!skb_queue_empty(&ks->txq)) { @@ -566,14 +559,13 @@ static netdev_tx_t ks8851_start_xmit(struct sk_buff *skb, static void ks8851_rxctrl_work(struct work_struct *work) { struct ks8851_net *ks = container_of(work, struct ks8851_net, rxctrl_work); - unsigned long flags; - ks8851_lock(ks, &flags); + ks8851_lock(ks); /* need to shutdown RXQ before modifying filter parameters */ ks8851_wrreg16(ks, KS_RXCR1, 0x00); - ks8851_unlock(ks, &flags); + ks8851_unlock(ks); } static void ks8851_set_rx_mode(struct net_device *dev) @@ -780,7 +772,6 @@ static int ks8851_set_eeprom(struct net_device *dev, { struct ks8851_net *ks = netdev_priv(dev); int offset = ee->offset; - unsigned long flags; int len = ee->len; u16 tmp; @@ -794,7 +785,7 @@ static int ks8851_set_eeprom(struct net_device *dev, if (!(ks->rc_ccr & CCR_EEPROM)) return -ENOENT; - ks8851_lock(ks, &flags); + ks8851_lock(ks); ks8851_eeprom_claim(ks); @@ -817,7 +808,7 @@ static int ks8851_set_eeprom(struct net_device *dev, eeprom_93cx6_wren(&ks->eeprom, false); ks8851_eeprom_release(ks); - ks8851_unlock(ks, &flags); + ks8851_unlock(ks); return 0; } @@ -827,7 +818,6 @@ static int ks8851_get_eeprom(struct net_device *dev, { struct ks8851_net *ks = netdev_priv(dev); int offset = ee->offset; - unsigned long flags; int len = ee->len; /* must be 2 byte aligned */ @@ -837,7 +827,7 @@ static int ks8851_get_eeprom(struct net_device *dev, if (!(ks->rc_ccr & CCR_EEPROM)) return -ENOENT; - ks8851_lock(ks, &flags); + ks8851_lock(ks); ks8851_eeprom_claim(ks); @@ -845,7 +835,7 @@ static int ks8851_get_eeprom(struct net_device *dev, eeprom_93cx6_multiread(&ks->eeprom, offset/2, (__le16 *)data, len/2); ks8851_eeprom_release(ks); - ks8851_unlock(ks, &flags); + ks8851_unlock(ks); return 0; } @@ -904,7 +894,6 @@ static int ks8851_phy_reg(int reg) static int ks8851_phy_read_common(struct net_device *dev, int phy_addr, int reg) { struct ks8851_net *ks = netdev_priv(dev); - unsigned long flags; int result; int ksreg; @@ -912,9 +901,9 @@ static int ks8851_phy_read_common(struct net_device *dev, int phy_addr, int reg) if (ksreg < 0) return ksreg; - ks8851_lock(ks, &flags); + ks8851_lock(ks); result = ks8851_rdreg16(ks, ksreg); - ks8851_unlock(ks, &flags); + ks8851_unlock(ks); return result; } @@ -949,14 +938,13 @@ static void ks8851_phy_write(struct net_device *dev, int phy, int reg, int value) { struct ks8851_net *ks = netdev_priv(dev); - unsigned long flags; int ksreg; ksreg = ks8851_phy_reg(reg); if (ksreg >= 0) { - ks8851_lock(ks, &flags); + ks8851_lock(ks); ks8851_wrreg16(ks, ksreg, value); - ks8851_unlock(ks, &flags); + ks8851_unlock(ks); } } diff --git a/drivers/net/ethernet/micrel/ks8851_par.c b/drivers/net/ethernet/micrel/ks8851_par.c index 78695be2570b..9f1c33f6ddec 100644 --- a/drivers/net/ethernet/micrel/ks8851_par.c +++ b/drivers/net/ethernet/micrel/ks8851_par.c @@ -55,29 +55,27 @@ struct ks8851_net_par { /** * ks8851_lock_par - register access lock * @ks: The chip state - * @flags: Spinlock flags * * Claim chip register access lock */ -static void ks8851_lock_par(struct ks8851_net *ks, unsigned long *flags) +static void ks8851_lock_par(struct ks8851_net *ks) { struct ks8851_net_par *ksp = to_ks8851_par(ks); - spin_lock_irqsave(&ksp->lock, *flags); + spin_lock_bh(&ksp->lock); } /** * ks8851_unlock_par - register access unlock * @ks: The chip state - * @flags: Spinlock flags * * Release chip register access lock */ -static void ks8851_unlock_par(struct ks8851_net *ks, unsigned long *flags) +static void ks8851_unlock_par(struct ks8851_net *ks) { struct ks8851_net_par *ksp = to_ks8851_par(ks); - spin_unlock_irqrestore(&ksp->lock, *flags); + spin_unlock_bh(&ksp->lock); } /** @@ -233,7 +231,6 @@ static netdev_tx_t ks8851_start_xmit_par(struct sk_buff *skb, { struct ks8851_net *ks = netdev_priv(dev); netdev_tx_t ret = NETDEV_TX_OK; - unsigned long flags; unsigned int txqcr; u16 txmir; int err; @@ -241,7 +238,7 @@ static netdev_tx_t ks8851_start_xmit_par(struct sk_buff *skb, netif_dbg(ks, tx_queued, ks->netdev, "%s: skb %p, %d@%p\n", __func__, skb, skb->len, skb->data); - ks8851_lock_par(ks, &flags); + ks8851_lock_par(ks); txmir = ks8851_rdreg16_par(ks, KS_TXMIR) & 0x1fff; @@ -262,7 +259,7 @@ static netdev_tx_t ks8851_start_xmit_par(struct sk_buff *skb, ret = NETDEV_TX_BUSY; } - ks8851_unlock_par(ks, &flags); + ks8851_unlock_par(ks); return ret; } diff --git a/drivers/net/ethernet/micrel/ks8851_spi.c b/drivers/net/ethernet/micrel/ks8851_spi.c index a161ae45743a..b9e68520278d 100644 --- a/drivers/net/ethernet/micrel/ks8851_spi.c +++ b/drivers/net/ethernet/micrel/ks8851_spi.c @@ -71,11 +71,10 @@ struct ks8851_net_spi { /** * ks8851_lock_spi - register access lock * @ks: The chip state - * @flags: Spinlock flags * * Claim chip register access lock */ -static void ks8851_lock_spi(struct ks8851_net *ks, unsigned long *flags) +static void ks8851_lock_spi(struct ks8851_net *ks) { struct ks8851_net_spi *kss = to_ks8851_spi(ks); @@ -85,11 +84,10 @@ static void ks8851_lock_spi(struct ks8851_net *ks, unsigned long *flags) /** * ks8851_unlock_spi - register access unlock * @ks: The chip state - * @flags: Spinlock flags * * Release chip register access lock */ -static void ks8851_unlock_spi(struct ks8851_net *ks, unsigned long *flags) +static void ks8851_unlock_spi(struct ks8851_net *ks) { struct ks8851_net_spi *kss = to_ks8851_spi(ks); @@ -309,7 +307,6 @@ static void ks8851_tx_work(struct work_struct *work) struct ks8851_net_spi *kss; unsigned short tx_space; struct ks8851_net *ks; - unsigned long flags; struct sk_buff *txb; bool last; @@ -317,7 +314,7 @@ static void ks8851_tx_work(struct work_struct *work) ks = &kss->ks8851; last = skb_queue_empty(&ks->txq); - ks8851_lock_spi(ks, &flags); + ks8851_lock_spi(ks); while (!last) { txb = skb_dequeue(&ks->txq); @@ -343,7 +340,7 @@ static void ks8851_tx_work(struct work_struct *work) ks->tx_space = tx_space; spin_unlock_bh(&ks->statelock); - ks8851_unlock_spi(ks, &flags); + ks8851_unlock_spi(ks); } /** From 22230e68b2cf1ab6b027be8cf1198164a949c4fa Mon Sep 17 00:00:00 2001 From: Marek Vasut Date: Thu, 16 Apr 2026 01:09:45 +0200 Subject: [PATCH 054/148] net: ks8851: Avoid excess softirq scheduling The code injects a packet into netif_rx() repeatedly, which will add it to its internal NAPI and schedule a softirq, and process it. It is more efficient to queue multiple packets and process them all at the local_bh_enable() time. Reviewed-by: Sebastian Andrzej Siewior Fixes: e0863634bf9f ("net: ks8851: Queue RX packets in IRQ handler instead of disabling BHs") Cc: stable@vger.kernel.org Signed-off-by: Marek Vasut Link: https://patch.msgid.link/20260415231020.455298-2-marex@nabladev.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/micrel/ks8851_common.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/micrel/ks8851_common.c b/drivers/net/ethernet/micrel/ks8851_common.c index 6c375647b24d..4afbb40bc0e4 100644 --- a/drivers/net/ethernet/micrel/ks8851_common.c +++ b/drivers/net/ethernet/micrel/ks8851_common.c @@ -373,9 +373,12 @@ static irqreturn_t ks8851_irq(int irq, void *_ks) if (status & IRQ_LCI) mii_check_link(&ks->mii); - if (status & IRQ_RXI) + if (status & IRQ_RXI) { + local_bh_disable(); while ((skb = __skb_dequeue(&rxq))) netif_rx(skb); + local_bh_enable(); + } return IRQ_HANDLED; } From 0cf004ffb61cd32d140531c3a84afe975f9fc7ea Mon Sep 17 00:00:00 2001 From: Michael Bommarito Date: Wed, 15 Apr 2026 23:19:03 -0400 Subject: [PATCH 055/148] sctp: fix OOB write to userspace in sctp_getsockopt_peer_auth_chunks sctp_getsockopt_peer_auth_chunks() checks that the caller's optval buffer is large enough for the peer AUTH chunk list with if (len < num_chunks) return -EINVAL; but then writes num_chunks bytes to p->gauth_chunks, which lives at offset offsetof(struct sctp_authchunks, gauth_chunks) == 8 inside optval. The check is missing the sizeof(struct sctp_authchunks) = 8-byte header. When the caller supplies len == num_chunks (for any num_chunks > 0) the test passes but copy_to_user() writes sizeof(struct sctp_authchunks) = 8 bytes past the declared buffer. The sibling function sctp_getsockopt_local_auth_chunks() at the next line already has the correct check: if (len < sizeof(struct sctp_authchunks) + num_chunks) return -EINVAL; Align the peer variant with its sibling. Reproducer confirms on v7.0-13-generic: an unprivileged userspace caller that opens a loopback SCTP association with AUTH enabled, queries num_chunks with a short optval, then issues the real getsockopt with len == num_chunks and sentinel bytes painted past the buffer observes those sentinel bytes overwritten with the peer's AUTH chunk type. The bytes written are under the peer's control but land in the caller's own userspace; this is not a kernel memory corruption, but it is a kernel-side contract violation that can silently corrupt adjacent userspace data. Fixes: 65b07e5d0d09 ("[SCTP]: API updates to suport SCTP-AUTH extensions.") Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Michael Bommarito Acked-by: Xin Long Link: https://patch.msgid.link/20260416031903.1447072-1-michael.bommarito@gmail.com Signed-off-by: Jakub Kicinski --- net/sctp/socket.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/sctp/socket.c b/net/sctp/socket.c index d2665bbd41a2..f52fe90d3e00 100644 --- a/net/sctp/socket.c +++ b/net/sctp/socket.c @@ -7033,7 +7033,7 @@ static int sctp_getsockopt_peer_auth_chunks(struct sock *sk, int len, /* See if the user provided enough room for all the data */ num_chunks = ntohs(ch->param_hdr.length) - sizeof(struct sctp_paramhdr); - if (len < num_chunks) + if (len < sizeof(struct sctp_authchunks) + num_chunks) return -EINVAL; if (copy_to_user(to, ch->chunks, num_chunks)) From d6c19b31a3c1d519fabdcf0aa239e6b6109b9473 Mon Sep 17 00:00:00 2001 From: Qingfang Deng Date: Wed, 15 Apr 2026 10:24:50 +0800 Subject: [PATCH 056/148] flow_dissector: do not dissect PPPoE PFC frames RFC 2516 Section 7 states that Protocol Field Compression (PFC) is NOT RECOMMENDED for PPPoE. In practice, pppd does not support negotiating PFC for PPPoE sessions, and the flow dissector driver has assumed an uncompressed frame until the blamed commit. During the review process of that commit [1], support for PFC is suggested. However, having a compressed (1-byte) protocol field means the subsequent PPP payload is shifted by one byte, causing 4-byte misalignment for the network header and an unaligned access exception on some architectures. The exception can be reproduced by sending a PPPoE PFC frame to an ethernet interface of a MIPS board, with RPS enabled, even if no PPPoE session is active on that interface: $ 0 : 00000000 80c40000 00000000 85144817 $ 4 : 00000008 00000100 80a75758 81dc9bb8 $ 8 : 00000010 8087ae2c 0000003d 00000000 $12 : 000000e0 00000039 00000000 00000000 $16 : 85043240 80a75758 81dc9bb8 00006488 $20 : 0000002f 00000007 85144810 80a70000 $24 : 81d1bda0 00000000 $28 : 81dc8000 81dc9aa8 00000000 805ead08 Hi : 00009d51 Lo : 2163358a epc : 805e91f0 __skb_flow_dissect+0x1b0/0x1b50 ra : 805ead08 __skb_get_hash_net+0x74/0x12c Status: 11000403 KERNEL EXL IE Cause : 40800010 (ExcCode 04) BadVA : 85144817 PrId : 0001992f (MIPS 1004Kc) Call Trace: [<805e91f0>] __skb_flow_dissect+0x1b0/0x1b50 [<805ead08>] __skb_get_hash_net+0x74/0x12c [<805ef330>] get_rps_cpu+0x1b8/0x3fc [<805fca70>] netif_receive_skb_list_internal+0x324/0x364 [<805fd120>] napi_complete_done+0x68/0x2a4 [<8058de5c>] mtk_napi_rx+0x228/0xfec [<805fd398>] __napi_poll+0x3c/0x1c4 [<805fd754>] napi_threaded_poll_loop+0x234/0x29c [<805fd848>] napi_threaded_poll+0x8c/0xb0 [<80053544>] kthread+0x104/0x12c [<80002bd8>] ret_from_kernel_thread+0x14/0x1c Code: 02d51821 1060045b 00000000 <8c640000> 3084000f 2c820005 144001a2 00042080 8e220000 To reduce the attack surface and maintain performance, do not process PPPoE PFC frames. [1] https://lore.kernel.org/r/20220630231016.GA392@debian.home Fixes: 46126db9c861 ("flow_dissector: Add PPPoE dissectors") Signed-off-by: Qingfang Deng Link: https://patch.msgid.link/20260415022456.141758-1-qingfang.deng@linux.dev Signed-off-by: Jakub Kicinski --- net/core/flow_dissector.c | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c index 1b61bb25ba0e..2a98f5fa74eb 100644 --- a/net/core/flow_dissector.c +++ b/net/core/flow_dissector.c @@ -1374,16 +1374,13 @@ bool __skb_flow_dissect(const struct net *net, break; } - /* least significant bit of the most significant octet - * indicates if protocol field was compressed + /* PFC (compressed 1-byte protocol) frames are not processed. + * A compressed protocol field has the least significant bit of + * the most significant octet set, which will fail the following + * ppp_proto_is_valid(), returning FLOW_DISSECT_RET_OUT_BAD. */ ppp_proto = ntohs(hdr->proto); - if (ppp_proto & 0x0100) { - ppp_proto = ppp_proto >> 8; - nhoff += PPPOE_SES_HLEN - 1; - } else { - nhoff += PPPOE_SES_HLEN; - } + nhoff += PPPOE_SES_HLEN; if (ppp_proto == PPP_IP) { proto = htons(ETH_P_IP); From cc1ff87bce1ccd38410ab10960f576dcd17db679 Mon Sep 17 00:00:00 2001 From: Qingfang Deng Date: Wed, 15 Apr 2026 10:24:51 +0800 Subject: [PATCH 057/148] pppoe: drop PFC frames RFC 2516 Section 7 states that Protocol Field Compression (PFC) is NOT RECOMMENDED for PPPoE. In practice, pppd does not support negotiating PFC for PPPoE sessions, and the current PPPoE driver assumes an uncompressed (2-byte) protocol field. However, the generic PPP layer function ppp_input() is not aware of the negotiation result, and still accepts PFC frames. If a peer with a broken implementation or an attacker sends a frame with a compressed (1-byte) protocol field, the subsequent PPP payload is shifted by one byte. This causes the network header to be 4-byte misaligned, which may trigger unaligned access exceptions on some architectures. To reduce the attack surface, drop PPPoE PFC frames. Introduce ppp_skb_is_compressed_proto() helper function to be used in both ppp_generic.c and pppoe.c to avoid open-coding. Fixes: 7fb1b8ca8fa1 ("ppp: Move PFC decompression to PPP generic layer") Signed-off-by: Qingfang Deng Reviewed-by: Simon Horman Link: https://patch.msgid.link/20260415022456.141758-2-qingfang.deng@linux.dev Signed-off-by: Jakub Kicinski --- drivers/net/ppp/ppp_generic.c | 2 +- drivers/net/ppp/pppoe.c | 8 +++++++- include/linux/ppp_defs.h | 16 ++++++++++++++++ 3 files changed, 24 insertions(+), 2 deletions(-) diff --git a/drivers/net/ppp/ppp_generic.c b/drivers/net/ppp/ppp_generic.c index b0d3bc49c685..57c68efa5ff8 100644 --- a/drivers/net/ppp/ppp_generic.c +++ b/drivers/net/ppp/ppp_generic.c @@ -2245,7 +2245,7 @@ ppp_do_recv(struct ppp *ppp, struct sk_buff *skb, struct channel *pch) */ static void __ppp_decompress_proto(struct sk_buff *skb) { - if (skb->data[0] & 0x01) + if (ppp_skb_is_compressed_proto(skb)) *(u8 *)skb_push(skb, 1) = 0x00; } diff --git a/drivers/net/ppp/pppoe.c b/drivers/net/ppp/pppoe.c index d546a7af0d54..bdd61c504a1c 100644 --- a/drivers/net/ppp/pppoe.c +++ b/drivers/net/ppp/pppoe.c @@ -393,7 +393,7 @@ static int pppoe_rcv(struct sk_buff *skb, struct net_device *dev, if (skb_mac_header_len(skb) < ETH_HLEN) goto drop; - if (!pskb_may_pull(skb, sizeof(struct pppoe_hdr))) + if (!pskb_may_pull(skb, PPPOE_SES_HLEN)) goto drop; ph = pppoe_hdr(skb); @@ -403,6 +403,12 @@ static int pppoe_rcv(struct sk_buff *skb, struct net_device *dev, if (skb->len < len) goto drop; + /* skb->data points to the PPP protocol header after skb_pull_rcsum. + * Drop PFC frames. + */ + if (ppp_skb_is_compressed_proto(skb)) + goto drop; + if (pskb_trim_rcsum(skb, len)) goto drop; diff --git a/include/linux/ppp_defs.h b/include/linux/ppp_defs.h index b7e57fdbd413..b1d1f46d7d3b 100644 --- a/include/linux/ppp_defs.h +++ b/include/linux/ppp_defs.h @@ -8,6 +8,7 @@ #define _PPP_DEFS_H_ #include +#include #include #define PPP_FCS(fcs, c) crc_ccitt_byte(fcs, c) @@ -25,4 +26,19 @@ static inline bool ppp_proto_is_valid(u16 proto) return !!((proto & 0x0101) == 0x0001); } +/** + * ppp_skb_is_compressed_proto - checks if PPP protocol in a skb is compressed + * @skb: skb to check + * + * Check if the PPP protocol field is compressed (the least significant + * bit of the most significant octet is 1). skb->data must point to the PPP + * protocol header. + * + * Return: Whether the PPP protocol field is compressed. + */ +static inline bool ppp_skb_is_compressed_proto(const struct sk_buff *skb) +{ + return unlikely(skb->data[0] & 0x01); +} + #endif /* _PPP_DEFS_H_ */ From d03fc81a57956248383efec99967d0ae627390a8 Mon Sep 17 00:00:00 2001 From: Prathamesh Deshpande Date: Wed, 15 Apr 2026 01:49:37 +0100 Subject: [PATCH 058/148] net/mlx5: Fix HCA caps leak on notifier init failure mlx5_mdev_init() allocates HCA caps via mlx5_hca_caps_alloc() before calling mlx5_notifiers_init(). If notifier initialization fails, the error path jumps to err_hca_caps and skips mlx5_hca_caps_free(), leaking allocated caps. Add a dedicated unwind label for notifier-init failure that frees HCA caps before continuing the existing cleanup sequence. Fixes: b6b03097f982 ("net/mlx5: Initialize events outside devlink lock") Signed-off-by: Prathamesh Deshpande Reviewed-by: Cosmin Ratiu Reviewed-by: Tariq Toukan Link: https://patch.msgid.link/20260415005022.34764-1-prathameshdeshpande7@gmail.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/mellanox/mlx5/core/main.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c index b4501cdc2351..09c57841e48d 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c @@ -1914,7 +1914,7 @@ int mlx5_mdev_init(struct mlx5_core_dev *dev, int profile_idx) err = mlx5_notifiers_init(dev); if (err) - goto err_hca_caps; + goto err_notifiers_init; /* The conjunction of sw_vhca_id with sw_owner_id will be a global * unique id per function which uses mlx5_core. @@ -1930,6 +1930,8 @@ int mlx5_mdev_init(struct mlx5_core_dev *dev, int profile_idx) return 0; +err_notifiers_init: + mlx5_hca_caps_free(dev); err_hca_caps: mlx5_adev_cleanup(dev); err_adev_init: From 2091c6aa0df6aba47deb5c8ab232b1cb60af3519 Mon Sep 17 00:00:00 2001 From: Weiming Shi Date: Wed, 15 Apr 2026 19:46:54 -0700 Subject: [PATCH 059/148] openvswitch: cap upcall PID array size and pre-size vport replies The vport netlink reply helpers allocate a fixed-size skb with nlmsg_new(NLMSG_DEFAULT_SIZE, ...) but serialize the full upcall PID array via ovs_vport_get_upcall_portids(). Since ovs_vport_set_upcall_portids() accepts any non-zero multiple of sizeof(u32) with no upper bound, a CAP_NET_ADMIN user can install a PID array large enough to overflow the reply buffer, causing nla_put() to fail with -EMSGSIZE and hitting BUG_ON(err < 0). On systems with unprivileged user namespaces enabled (e.g., Ubuntu default), this is reachable via unshare -Urn since OVS vport mutation operations use GENL_UNS_ADMIN_PERM. kernel BUG at net/openvswitch/datapath.c:2414! Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI CPU: 1 UID: 0 PID: 65 Comm: poc Not tainted 7.0.0-rc7-00195-geb216e422044 #1 RIP: 0010:ovs_vport_cmd_set+0x34c/0x400 Call Trace: genl_family_rcv_msg_doit (net/netlink/genetlink.c:1116) genl_rcv_msg (net/netlink/genetlink.c:1194) netlink_rcv_skb (net/netlink/af_netlink.c:2550) genl_rcv (net/netlink/genetlink.c:1219) netlink_unicast (net/netlink/af_netlink.c:1344) netlink_sendmsg (net/netlink/af_netlink.c:1894) __sys_sendto (net/socket.c:2206) __x64_sys_sendto (net/socket.c:2209) do_syscall_64 (arch/x86/entry/syscall_64.c:63) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) Kernel panic - not syncing: Fatal exception Reject attempts to set more PIDs than nr_cpu_ids in ovs_vport_set_upcall_portids(), and pre-compute the worst-case reply size in ovs_vport_cmd_msg_size() based on that bound, similar to the existing ovs_dp_cmd_msg_size(). nr_cpu_ids matches the cap already used by the per-CPU dispatch configuration on the datapath side (ovs_dp_cmd_fill_info() serialises at most nr_cpu_ids PIDs), so the two sides stay consistent. Fixes: 5cd667b0a456 ("openvswitch: Allow each vport to have an array of 'port_id's.") Reported-by: Xiang Mei Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Weiming Shi Reviewed-by: Ilya Maximets Link: https://patch.msgid.link/20260416024653.153456-2-bestswngs@gmail.com Signed-off-by: Jakub Kicinski --- net/openvswitch/datapath.c | 35 +++++++++++++++++++++++++++++++++-- net/openvswitch/vport.c | 3 +++ 2 files changed, 36 insertions(+), 2 deletions(-) diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c index e209099218b4..bbbde50fc649 100644 --- a/net/openvswitch/datapath.c +++ b/net/openvswitch/datapath.c @@ -2184,9 +2184,40 @@ static int ovs_vport_cmd_fill_info(struct vport *vport, struct sk_buff *skb, return err; } +static size_t ovs_vport_cmd_msg_size(void) +{ + size_t msgsize = NLMSG_ALIGN(sizeof(struct ovs_header)); + + msgsize += nla_total_size(sizeof(u32)); /* OVS_VPORT_ATTR_PORT_NO */ + msgsize += nla_total_size(sizeof(u32)); /* OVS_VPORT_ATTR_TYPE */ + msgsize += nla_total_size(IFNAMSIZ); /* OVS_VPORT_ATTR_NAME */ + msgsize += nla_total_size(sizeof(u32)); /* OVS_VPORT_ATTR_IFINDEX */ + msgsize += nla_total_size(sizeof(s32)); /* OVS_VPORT_ATTR_NETNSID */ + + /* OVS_VPORT_ATTR_STATS */ + msgsize += nla_total_size_64bit(sizeof(struct ovs_vport_stats)); + + /* OVS_VPORT_ATTR_UPCALL_STATS(OVS_VPORT_UPCALL_ATTR_SUCCESS + + * OVS_VPORT_UPCALL_ATTR_FAIL) + */ + msgsize += nla_total_size(nla_total_size_64bit(sizeof(u64)) + + nla_total_size_64bit(sizeof(u64))); + + /* OVS_VPORT_ATTR_UPCALL_PID */ + msgsize += nla_total_size(nr_cpu_ids * sizeof(u32)); + + /* OVS_VPORT_ATTR_OPTIONS(OVS_TUNNEL_ATTR_DST_PORT + + * OVS_TUNNEL_ATTR_EXTENSION(OVS_VXLAN_EXT_GBP)) + */ + msgsize += nla_total_size(nla_total_size(sizeof(u16)) + + nla_total_size(nla_total_size(0))); + + return msgsize; +} + static struct sk_buff *ovs_vport_cmd_alloc_info(void) { - return nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); + return genlmsg_new(ovs_vport_cmd_msg_size(), GFP_KERNEL); } /* Called with ovs_mutex, only via ovs_dp_notify_wq(). */ @@ -2196,7 +2227,7 @@ struct sk_buff *ovs_vport_cmd_build_info(struct vport *vport, struct net *net, struct sk_buff *skb; int retval; - skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); + skb = ovs_vport_cmd_alloc_info(); if (!skb) return ERR_PTR(-ENOMEM); diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c index 23f629e94a36..56b2e2d1a749 100644 --- a/net/openvswitch/vport.c +++ b/net/openvswitch/vport.c @@ -406,6 +406,9 @@ int ovs_vport_set_upcall_portids(struct vport *vport, const struct nlattr *ids) if (!nla_len(ids) || nla_len(ids) % sizeof(u32)) return -EINVAL; + if (nla_len(ids) / sizeof(u32) > nr_cpu_ids) + return -EINVAL; + old = ovsl_dereference(vport->upcall_portids); vport_portids = kmalloc(sizeof(*vport_portids) + nla_len(ids), From b94769eb2f30e61e86cd8551c084c34134290d89 Mon Sep 17 00:00:00 2001 From: Lorenzo Bianconi Date: Thu, 16 Apr 2026 12:30:12 +0200 Subject: [PATCH 060/148] net: airoha: Fix possible TX queue stall in airoha_qdma_tx_napi_poll() Since multiple net_device TX queues can share the same hw QDMA TX queue, there is no guarantee we have inflight packets queued in hw belonging to a net_device TX queue stopped in the xmit path because hw QDMA TX queue can be full. In this corner case the net_device TX queue will never be re-activated. In order to avoid any potential net_device TX queue stall, we need to wake all the net_device TX queues feeding the same hw QDMA TX queue in airoha_qdma_tx_napi_poll routine. Fixes: 23020f0493270 ("net: airoha: Introduce ethernet support for EN7581 SoC") Signed-off-by: Lorenzo Bianconi Reviewed-by: Simon Horman Link: https://patch.msgid.link/20260416-airoha-txq-potential-stall-v2-1-42c732074540@kernel.org Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/airoha/airoha_eth.c | 37 ++++++++++++++++++++---- drivers/net/ethernet/airoha/airoha_eth.h | 1 + 2 files changed, 33 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c index e1ab15f1ee7d..19f67c7dd8e1 100644 --- a/drivers/net/ethernet/airoha/airoha_eth.c +++ b/drivers/net/ethernet/airoha/airoha_eth.c @@ -843,6 +843,21 @@ static int airoha_qdma_init_rx(struct airoha_qdma *qdma) return 0; } +static void airoha_qdma_wake_netdev_txqs(struct airoha_queue *q) +{ + struct airoha_qdma *qdma = q->qdma; + struct airoha_eth *eth = qdma->eth; + int i; + + for (i = 0; i < ARRAY_SIZE(eth->ports); i++) { + struct airoha_gdm_port *port = eth->ports[i]; + + if (port && port->qdma == qdma) + netif_tx_wake_all_queues(port->dev); + } + q->txq_stopped = false; +} + static int airoha_qdma_tx_napi_poll(struct napi_struct *napi, int budget) { struct airoha_tx_irq_queue *irq_q; @@ -919,12 +934,21 @@ static int airoha_qdma_tx_napi_poll(struct napi_struct *napi, int budget) txq = netdev_get_tx_queue(skb->dev, queue); netdev_tx_completed_queue(txq, 1, skb->len); - if (netif_tx_queue_stopped(txq) && - q->ndesc - q->queued >= q->free_thr) - netif_tx_wake_queue(txq); - dev_kfree_skb_any(skb); } + + if (q->txq_stopped && q->ndesc - q->queued >= q->free_thr) { + /* Since multiple net_device TX queues can share the + * same hw QDMA TX queue, there is no guarantee we have + * inflight packets queued in hw belonging to a + * net_device TX queue stopped in the xmit path. + * In order to avoid any potential net_device TX queue + * stall, we need to wake all the net_device TX queues + * feeding the same hw QDMA TX queue. + */ + airoha_qdma_wake_netdev_txqs(q); + } + unlock: spin_unlock_bh(&q->lock); } @@ -1984,6 +2008,7 @@ static netdev_tx_t airoha_dev_xmit(struct sk_buff *skb, if (q->queued + nr_frags >= q->ndesc) { /* not enough space in the queue */ netif_tx_stop_queue(txq); + q->txq_stopped = true; spin_unlock_bh(&q->lock); return NETDEV_TX_BUSY; } @@ -2039,8 +2064,10 @@ static netdev_tx_t airoha_dev_xmit(struct sk_buff *skb, TX_RING_CPU_IDX_MASK, FIELD_PREP(TX_RING_CPU_IDX_MASK, index)); - if (q->ndesc - q->queued < q->free_thr) + if (q->ndesc - q->queued < q->free_thr) { netif_tx_stop_queue(txq); + q->txq_stopped = true; + } spin_unlock_bh(&q->lock); diff --git a/drivers/net/ethernet/airoha/airoha_eth.h b/drivers/net/ethernet/airoha/airoha_eth.h index 95e557638617..87b328cfefb0 100644 --- a/drivers/net/ethernet/airoha/airoha_eth.h +++ b/drivers/net/ethernet/airoha/airoha_eth.h @@ -193,6 +193,7 @@ struct airoha_queue { int ndesc; int free_thr; int buf_size; + bool txq_stopped; struct napi_struct napi; struct page_pool *page_pool; From f6315295899415f1ddcf39f7c9cb46d25e2c6c6a Mon Sep 17 00:00:00 2001 From: Dexuan Cui Date: Thu, 16 Apr 2026 12:14:33 -0700 Subject: [PATCH 061/148] hv_sock: Report EOF instead of -EIO for FIN Commit f0c5827d07cb unluckily causes a regression for the FIN packet, and the final read syscall gets an error rather than 0. Ideally, we would want to fix hvs_channel_readable_payload() so that it could return 0 in the FIN scenario, but it's not good for the hv_sock driver to use the VMBus ringbuffer's cached priv_read_index, which is internal data in the VMBus driver. Fix the regression in hv_sock by returning 0 rather than -EIO. Fixes: f0c5827d07cb ("hv_sock: Return the readable bytes in hvs_stream_has_data()") Cc: stable@vger.kernel.org Reported-by: Ben Hillis Reported-by: Mitchell Levy Signed-off-by: Dexuan Cui Acked-by: Stefano Garzarella Link: https://patch.msgid.link/20260416191433.840637-1-decui@microsoft.com Signed-off-by: Jakub Kicinski --- net/vmw_vsock/hyperv_transport.c | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c index 2b7c0b5896ed..76e78c83fdbc 100644 --- a/net/vmw_vsock/hyperv_transport.c +++ b/net/vmw_vsock/hyperv_transport.c @@ -694,7 +694,6 @@ static ssize_t hvs_stream_enqueue(struct vsock_sock *vsk, struct msghdr *msg, static s64 hvs_stream_has_data(struct vsock_sock *vsk) { struct hvsock *hvs = vsk->trans; - bool need_refill; s64 ret; if (hvs->recv_data_len > 0) @@ -702,9 +701,22 @@ static s64 hvs_stream_has_data(struct vsock_sock *vsk) switch (hvs_channel_readable_payload(hvs->chan)) { case 1: - need_refill = !hvs->recv_desc; - if (!need_refill) - return -EIO; + if (hvs->recv_desc) { + /* Here hvs->recv_data_len is 0, so hvs->recv_desc must + * be NULL unless it points to the 0-byte-payload FIN + * packet: see hvs_update_recv_data(). + * + * Here all the payload has been dequeued, but + * hvs_channel_readable_payload() still returns 1, + * because the VMBus ringbuffer's read_index is not + * updated for the FIN packet: hvs_stream_dequeue() -> + * hv_pkt_iter_next() updates the cached priv_read_index + * but has no opportunity to update the read_index in + * hv_pkt_iter_close() as hvs_stream_has_data() returns + * 0 for the FIN packet, so it won't get dequeued. + */ + return 0; + } hvs->recv_desc = hv_pkt_iter_first(hvs->chan); if (!hvs->recv_desc) From 5638504a2aa9e1b9d72af9060df1a160cce2d379 Mon Sep 17 00:00:00 2001 From: David Carlier Date: Fri, 17 Apr 2026 06:54:08 +0100 Subject: [PATCH 062/148] gtp: disable BH before calling udp_tunnel_xmit_skb() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit gtp_genl_send_echo_req() runs as a generic netlink doit handler in process context with BH not disabled. It calls udp_tunnel_xmit_skb(), which eventually invokes iptunnel_xmit() — that uses __this_cpu_inc/dec on softnet_data.xmit.recursion to track the tunnel xmit recursion level. Without local_bh_disable(), the task may migrate between dev_xmit_recursion_inc() and dev_xmit_recursion_dec(), breaking the per-CPU counter pairing. The result is stale or negative recursion levels that can later produce false-positive SKB_DROP_REASON_RECURSION_LIMIT drops on either CPU. The other udp_tunnel_xmit_skb() call sites in gtp.c are unaffected: the data path runs under ndo_start_xmit and the echo response handlers run from the UDP encap rx softirq, both with BH already disabled. Fix it by disabling BH around the udp_tunnel_xmit_skb() call, mirroring commit 2cd7e6971fc2 ("sctp: disable BH before calling udp_tunnel_xmit_skb()"). Fixes: 6f1a9140ecda ("net: add xmit recursion limit to tunnel xmit functions") Cc: stable@vger.kernel.org Signed-off-by: David Carlier Link: https://patch.msgid.link/20260417055408.4667-1-devnexen@gmail.com Signed-off-by: Jakub Kicinski --- drivers/net/gtp.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/gtp.c b/drivers/net/gtp.c index 70b9e58b9b78..5150f2e4f66b 100644 --- a/drivers/net/gtp.c +++ b/drivers/net/gtp.c @@ -2400,6 +2400,7 @@ static int gtp_genl_send_echo_req(struct sk_buff *skb, struct genl_info *info) return -ENODEV; } + local_bh_disable(); udp_tunnel_xmit_skb(rt, sk, skb_to_send, fl4.saddr, fl4.daddr, inet_dscp_to_dsfield(fl4.flowi4_dscp), @@ -2409,6 +2410,7 @@ static int gtp_genl_send_echo_req(struct sk_buff *skb, struct genl_info *info) !net_eq(sock_net(sk), dev_net(gtp->dev)), false, 0); + local_bh_enable(); return 0; } From a663bac71a2f0b3ac6c373168ca57b2a6e6381aa Mon Sep 17 00:00:00 2001 From: Yuan Zhaoming Date: Fri, 17 Apr 2026 22:13:40 +0800 Subject: [PATCH 063/148] net: mctp: fix don't require received header reserved bits to be zero From the MCTP Base specification (DSP0236 v1.2.1), the first byte of the MCTP header contains a 4 bit reserved field, and 4 bit version. On our current receive path, we require those 4 reserved bits to be zero, but the 9500-8i card is non-conformant, and may set these reserved bits. DSP0236 states that the reserved bits must be written as zero, and ignored when read. While the device might not conform to the former, we should accept these message to conform to the latter. Relax our check on the MCTP version byte to allow non-zero bits in the reserved field. Fixes: 889b7da23abf ("mctp: Add initial routing framework") Signed-off-by: Yuan Zhaoming Cc: stable@vger.kernel.org Acked-by: Jeremy Kerr Link: https://patch.msgid.link/20260417141340.5306-1-yuanzhaoming901030@126.com Signed-off-by: Jakub Kicinski --- include/net/mctp.h | 3 +++ net/mctp/route.c | 8 ++++++-- 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/include/net/mctp.h b/include/net/mctp.h index e1e0a69afdce..d8bf9074110d 100644 --- a/include/net/mctp.h +++ b/include/net/mctp.h @@ -26,6 +26,9 @@ struct mctp_hdr { #define MCTP_VER_MIN 1 #define MCTP_VER_MAX 1 +/* Definitions for ver field */ +#define MCTP_HDR_VER_MASK GENMASK(3, 0) + /* Definitions for flags_seq_tag field */ #define MCTP_HDR_FLAG_SOM BIT(7) #define MCTP_HDR_FLAG_EOM BIT(6) diff --git a/net/mctp/route.c b/net/mctp/route.c index 26fb8c6bbad2..1f3dccbb7aed 100644 --- a/net/mctp/route.c +++ b/net/mctp/route.c @@ -441,6 +441,7 @@ static int mctp_dst_input(struct mctp_dst *dst, struct sk_buff *skb) unsigned long f; u8 tag, flags; int rc; + u8 ver; msk = NULL; rc = -EINVAL; @@ -467,7 +468,8 @@ static int mctp_dst_input(struct mctp_dst *dst, struct sk_buff *skb) netid = mctp_cb(skb)->net; skb_pull(skb, sizeof(struct mctp_hdr)); - if (mh->ver != 1) + ver = mh->ver & MCTP_HDR_VER_MASK; + if (ver < MCTP_VER_MIN || ver > MCTP_VER_MAX) goto out; flags = mh->flags_seq_tag & (MCTP_HDR_FLAG_SOM | MCTP_HDR_FLAG_EOM); @@ -1317,6 +1319,7 @@ static int mctp_pkttype_receive(struct sk_buff *skb, struct net_device *dev, struct mctp_dst dst; struct mctp_hdr *mh; int rc; + u8 ver; rcu_read_lock(); mdev = __mctp_dev_get(dev); @@ -1334,7 +1337,8 @@ static int mctp_pkttype_receive(struct sk_buff *skb, struct net_device *dev, /* We have enough for a header; decode and route */ mh = mctp_hdr(skb); - if (mh->ver < MCTP_VER_MIN || mh->ver > MCTP_VER_MAX) + ver = mh->ver & MCTP_HDR_VER_MASK; + if (ver < MCTP_VER_MIN || ver > MCTP_VER_MAX) goto err_drop; /* source must be valid unicast or null; drop reserved ranges and From b336fdbb7103fb1484e1dcb6741151d4b5a41e35 Mon Sep 17 00:00:00 2001 From: Pablo Neira Ayuso Date: Tue, 14 Apr 2026 13:06:38 +0200 Subject: [PATCH 064/148] netfilter: nft_osf: restrict it to ipv4 This expression only supports for ipv4, restrict it. Fixes: b96af92d6eaf ("netfilter: nf_tables: implement Passive OS fingerprint module in nft_osf") Acked-by: Florian Westphal Reviewed-by: Fernando Fernandez Mancera Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nft_osf.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/net/netfilter/nft_osf.c b/net/netfilter/nft_osf.c index 18003433476c..c02d5cb52143 100644 --- a/net/netfilter/nft_osf.c +++ b/net/netfilter/nft_osf.c @@ -28,6 +28,11 @@ static void nft_osf_eval(const struct nft_expr *expr, struct nft_regs *regs, struct nf_osf_data data; struct tcphdr _tcph; + if (nft_pf(pkt) != NFPROTO_IPV4) { + regs->verdict.code = NFT_BREAK; + return; + } + if (pkt->tprot != IPPROTO_TCP) { regs->verdict.code = NFT_BREAK; return; @@ -114,7 +119,6 @@ static int nft_osf_validate(const struct nft_ctx *ctx, switch (ctx->family) { case NFPROTO_IPV4: - case NFPROTO_IPV6: case NFPROTO_INET: hooks = (1 << NF_INET_LOCAL_IN) | (1 << NF_INET_PRE_ROUTING) | From 2195574dc6d9017d32ac346987e12659f931d932 Mon Sep 17 00:00:00 2001 From: Xiang Mei Date: Tue, 14 Apr 2026 15:14:01 -0700 Subject: [PATCH 065/148] netfilter: nfnetlink_osf: fix divide-by-zero in OSF_WSS_MODULO nf_osf_match_one() computes ctx->window % f->wss.val in the OSF_WSS_MODULO branch with no guard for f->wss.val == 0. A CAP_NET_ADMIN user can add such a fingerprint via nfnetlink; a subsequent matching TCP SYN divides by zero and panics the kernel. Reject the bogus fingerprint in nfnl_osf_add_callback() above the per-option for-loop. f->wss is per-fingerprint, not per-option, so the check must run regardless of f->opt_num (including 0). Also reject wss.wc >= OSF_WSS_MAX; nf_osf_match_one() already treats that as "should not happen". Crash: Oops: divide error: 0000 [#1] SMP KASAN NOPTI RIP: 0010:nf_osf_match_one (net/netfilter/nfnetlink_osf.c:98) Call Trace: nf_osf_match (net/netfilter/nfnetlink_osf.c:220) xt_osf_match_packet (net/netfilter/xt_osf.c:32) ipt_do_table (net/ipv4/netfilter/ip_tables.c:348) nf_hook_slow (net/netfilter/core.c:622) ip_local_deliver (net/ipv4/ip_input.c:265) ip_rcv (include/linux/skbuff.h:1162) __netif_receive_skb_one_core (net/core/dev.c:6181) process_backlog (net/core/dev.c:6642) __napi_poll (net/core/dev.c:7710) net_rx_action (net/core/dev.c:7945) handle_softirqs (kernel/softirq.c:622) Fixes: 11eeef41d5f6 ("netfilter: passive OS fingerprint xtables match") Reported-by: Weiming Shi Suggested-by: Florian Westphal Suggested-by: Pablo Neira Ayuso Signed-off-by: Xiang Mei Reviewed-by: Fernando Fernandez Mancera Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nfnetlink_osf.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/net/netfilter/nfnetlink_osf.c b/net/netfilter/nfnetlink_osf.c index d64ce21c7b55..9de91fdd107c 100644 --- a/net/netfilter/nfnetlink_osf.c +++ b/net/netfilter/nfnetlink_osf.c @@ -320,6 +320,10 @@ static int nfnl_osf_add_callback(struct sk_buff *skb, if (f->opt_num > ARRAY_SIZE(f->opt)) return -EINVAL; + if (f->wss.wc >= OSF_WSS_MAX || + (f->wss.wc == OSF_WSS_MODULO && f->wss.val == 0)) + return -EINVAL; + for (i = 0; i < f->opt_num; i++) { if (!f->opt[i].length || f->opt[i].length > MAX_IPOPTLEN) return -EINVAL; From 6e7066bdb481a87fe88c4fa563e348c03b2d373d Mon Sep 17 00:00:00 2001 From: Florian Westphal Date: Tue, 14 Apr 2026 19:13:46 +0200 Subject: [PATCH 066/148] netfilter: conntrack: remove sprintf usage Replace it with scnprintf, the buffer sizes are expected to be large enough to hold the result, no need for snprintf+overflow check. Increase buffer size in mangle_content_len() while at it. BUG: KASAN: stack-out-of-bounds in vsnprintf+0xea5/0x1270 Write of size 1 at addr [..] vsnprintf+0xea5/0x1270 sprintf+0xb1/0xe0 mangle_content_len+0x1ac/0x280 nf_nat_sdp_session+0x1cc/0x240 process_sdp+0x8f8/0xb80 process_invite_request+0x108/0x2b0 process_sip_msg+0x5da/0xf50 sip_help_tcp+0x45e/0x780 nf_confirm+0x34d/0x990 [..] Fixes: 9fafcd7b2032 ("[NETFILTER]: nf_conntrack/nf_nat: add SIP helper port") Reported-by: Yiming Qian Signed-off-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_nat_amanda.c | 2 +- net/netfilter/nf_nat_sip.c | 33 ++++++++++++++++++--------------- 2 files changed, 19 insertions(+), 16 deletions(-) diff --git a/net/netfilter/nf_nat_amanda.c b/net/netfilter/nf_nat_amanda.c index 98deef6cde69..8f1054920a85 100644 --- a/net/netfilter/nf_nat_amanda.c +++ b/net/netfilter/nf_nat_amanda.c @@ -50,7 +50,7 @@ static unsigned int help(struct sk_buff *skb, return NF_DROP; } - sprintf(buffer, "%u", port); + snprintf(buffer, sizeof(buffer), "%u", port); if (!nf_nat_mangle_udp_packet(skb, exp->master, ctinfo, protoff, matchoff, matchlen, buffer, strlen(buffer))) { diff --git a/net/netfilter/nf_nat_sip.c b/net/netfilter/nf_nat_sip.c index cf4aeb299bde..c845b6d1a2bd 100644 --- a/net/netfilter/nf_nat_sip.c +++ b/net/netfilter/nf_nat_sip.c @@ -68,25 +68,27 @@ static unsigned int mangle_packet(struct sk_buff *skb, unsigned int protoff, } static int sip_sprintf_addr(const struct nf_conn *ct, char *buffer, + size_t size, const union nf_inet_addr *addr, bool delim) { if (nf_ct_l3num(ct) == NFPROTO_IPV4) - return sprintf(buffer, "%pI4", &addr->ip); + return scnprintf(buffer, size, "%pI4", &addr->ip); else { if (delim) - return sprintf(buffer, "[%pI6c]", &addr->ip6); + return scnprintf(buffer, size, "[%pI6c]", &addr->ip6); else - return sprintf(buffer, "%pI6c", &addr->ip6); + return scnprintf(buffer, size, "%pI6c", &addr->ip6); } } static int sip_sprintf_addr_port(const struct nf_conn *ct, char *buffer, + size_t size, const union nf_inet_addr *addr, u16 port) { if (nf_ct_l3num(ct) == NFPROTO_IPV4) - return sprintf(buffer, "%pI4:%u", &addr->ip, port); + return scnprintf(buffer, size, "%pI4:%u", &addr->ip, port); else - return sprintf(buffer, "[%pI6c]:%u", &addr->ip6, port); + return scnprintf(buffer, size, "[%pI6c]:%u", &addr->ip6, port); } static int map_addr(struct sk_buff *skb, unsigned int protoff, @@ -119,7 +121,7 @@ static int map_addr(struct sk_buff *skb, unsigned int protoff, if (nf_inet_addr_cmp(&newaddr, addr) && newport == port) return 1; - buflen = sip_sprintf_addr_port(ct, buffer, &newaddr, ntohs(newport)); + buflen = sip_sprintf_addr_port(ct, buffer, sizeof(buffer), &newaddr, ntohs(newport)); return mangle_packet(skb, protoff, dataoff, dptr, datalen, matchoff, matchlen, buffer, buflen); } @@ -212,7 +214,7 @@ static unsigned int nf_nat_sip(struct sk_buff *skb, unsigned int protoff, &addr, true) > 0 && nf_inet_addr_cmp(&addr, &ct->tuplehash[dir].tuple.src.u3) && !nf_inet_addr_cmp(&addr, &ct->tuplehash[!dir].tuple.dst.u3)) { - buflen = sip_sprintf_addr(ct, buffer, + buflen = sip_sprintf_addr(ct, buffer, sizeof(buffer), &ct->tuplehash[!dir].tuple.dst.u3, true); if (!mangle_packet(skb, protoff, dataoff, dptr, datalen, @@ -229,7 +231,7 @@ static unsigned int nf_nat_sip(struct sk_buff *skb, unsigned int protoff, &addr, false) > 0 && nf_inet_addr_cmp(&addr, &ct->tuplehash[dir].tuple.dst.u3) && !nf_inet_addr_cmp(&addr, &ct->tuplehash[!dir].tuple.src.u3)) { - buflen = sip_sprintf_addr(ct, buffer, + buflen = sip_sprintf_addr(ct, buffer, sizeof(buffer), &ct->tuplehash[!dir].tuple.src.u3, false); if (!mangle_packet(skb, protoff, dataoff, dptr, datalen, @@ -247,7 +249,7 @@ static unsigned int nf_nat_sip(struct sk_buff *skb, unsigned int protoff, htons(n) == ct->tuplehash[dir].tuple.dst.u.udp.port && htons(n) != ct->tuplehash[!dir].tuple.src.u.udp.port) { __be16 p = ct->tuplehash[!dir].tuple.src.u.udp.port; - buflen = sprintf(buffer, "%u", ntohs(p)); + buflen = scnprintf(buffer, sizeof(buffer), "%u", ntohs(p)); if (!mangle_packet(skb, protoff, dataoff, dptr, datalen, poff, plen, buffer, buflen)) { nf_ct_helper_log(skb, ct, "cannot mangle rport"); @@ -418,7 +420,8 @@ static unsigned int nf_nat_sip_expect(struct sk_buff *skb, unsigned int protoff, if (!nf_inet_addr_cmp(&exp->tuple.dst.u3, &exp->saved_addr) || exp->tuple.dst.u.udp.port != exp->saved_proto.udp.port) { - buflen = sip_sprintf_addr_port(ct, buffer, &newaddr, port); + buflen = sip_sprintf_addr_port(ct, buffer, sizeof(buffer), + &newaddr, port); if (!mangle_packet(skb, protoff, dataoff, dptr, datalen, matchoff, matchlen, buffer, buflen)) { nf_ct_helper_log(skb, ct, "cannot mangle packet"); @@ -438,8 +441,8 @@ static int mangle_content_len(struct sk_buff *skb, unsigned int protoff, { enum ip_conntrack_info ctinfo; struct nf_conn *ct = nf_ct_get(skb, &ctinfo); + char buffer[sizeof("4294967295")]; unsigned int matchoff, matchlen; - char buffer[sizeof("65536")]; int buflen, c_len; /* Get actual SDP length */ @@ -454,7 +457,7 @@ static int mangle_content_len(struct sk_buff *skb, unsigned int protoff, &matchoff, &matchlen) <= 0) return 0; - buflen = sprintf(buffer, "%u", c_len); + buflen = scnprintf(buffer, sizeof(buffer), "%u", c_len); return mangle_packet(skb, protoff, dataoff, dptr, datalen, matchoff, matchlen, buffer, buflen); } @@ -491,7 +494,7 @@ static unsigned int nf_nat_sdp_addr(struct sk_buff *skb, unsigned int protoff, char buffer[INET6_ADDRSTRLEN]; unsigned int buflen; - buflen = sip_sprintf_addr(ct, buffer, addr, false); + buflen = sip_sprintf_addr(ct, buffer, sizeof(buffer), addr, false); if (mangle_sdp_packet(skb, protoff, dataoff, dptr, datalen, sdpoff, type, term, buffer, buflen)) return 0; @@ -509,7 +512,7 @@ static unsigned int nf_nat_sdp_port(struct sk_buff *skb, unsigned int protoff, char buffer[sizeof("nnnnn")]; unsigned int buflen; - buflen = sprintf(buffer, "%u", port); + buflen = scnprintf(buffer, sizeof(buffer), "%u", port); if (!mangle_packet(skb, protoff, dataoff, dptr, datalen, matchoff, matchlen, buffer, buflen)) return 0; @@ -529,7 +532,7 @@ static unsigned int nf_nat_sdp_session(struct sk_buff *skb, unsigned int protoff unsigned int buflen; /* Mangle session description owner and contact addresses */ - buflen = sip_sprintf_addr(ct, buffer, addr, false); + buflen = sip_sprintf_addr(ct, buffer, sizeof(buffer), addr, false); if (mangle_sdp_packet(skb, protoff, dataoff, dptr, datalen, sdpoff, SDP_HDR_OWNER, SDP_HDR_MEDIA, buffer, buflen)) return 0; From b6fe26f86a1649f84e057f3f15605b08eda15497 Mon Sep 17 00:00:00 2001 From: Pablo Neira Ayuso Date: Wed, 15 Apr 2026 12:21:00 +0200 Subject: [PATCH 067/148] netfilter: xtables: restrict several matches to inet family This is a partial revert of: commit ab4f21e6fb1c ("netfilter: xtables: use NFPROTO_UNSPEC in more extensions") to allow ipv4 and ipv6 only. - xt_mac - xt_owner - xt_physdev These extensions are not used by ebtables in userspace. Moreover, xt_realm is only for ipv4, since dst->tclassid is ipv4 specific. Fixes: ab4f21e6fb1c ("netfilter: xtables: use NFPROTO_UNSPEC in more extensions") Reported-by: "Kito Xu (veritas501)" Signed-off-by: Pablo Neira Ayuso --- net/netfilter/xt_mac.c | 34 +++++++++++++++++++++++----------- net/netfilter/xt_owner.c | 37 +++++++++++++++++++++++++------------ net/netfilter/xt_physdev.c | 29 +++++++++++++++++++---------- net/netfilter/xt_realm.c | 2 +- 4 files changed, 68 insertions(+), 34 deletions(-) diff --git a/net/netfilter/xt_mac.c b/net/netfilter/xt_mac.c index 4798cd2ca26e..7fc5156825e4 100644 --- a/net/netfilter/xt_mac.c +++ b/net/netfilter/xt_mac.c @@ -36,25 +36,37 @@ static bool mac_mt(const struct sk_buff *skb, struct xt_action_param *par) return ret; } -static struct xt_match mac_mt_reg __read_mostly = { - .name = "mac", - .revision = 0, - .family = NFPROTO_UNSPEC, - .match = mac_mt, - .matchsize = sizeof(struct xt_mac_info), - .hooks = (1 << NF_INET_PRE_ROUTING) | (1 << NF_INET_LOCAL_IN) | - (1 << NF_INET_FORWARD), - .me = THIS_MODULE, +static struct xt_match mac_mt_reg[] __read_mostly = { + { + .name = "mac", + .family = NFPROTO_IPV4, + .match = mac_mt, + .matchsize = sizeof(struct xt_mac_info), + .hooks = (1 << NF_INET_PRE_ROUTING) | + (1 << NF_INET_LOCAL_IN) | + (1 << NF_INET_FORWARD), + .me = THIS_MODULE, + }, + { + .name = "mac", + .family = NFPROTO_IPV6, + .match = mac_mt, + .matchsize = sizeof(struct xt_mac_info), + .hooks = (1 << NF_INET_PRE_ROUTING) | + (1 << NF_INET_LOCAL_IN) | + (1 << NF_INET_FORWARD), + .me = THIS_MODULE, + }, }; static int __init mac_mt_init(void) { - return xt_register_match(&mac_mt_reg); + return xt_register_matches(mac_mt_reg, ARRAY_SIZE(mac_mt_reg)); } static void __exit mac_mt_exit(void) { - xt_unregister_match(&mac_mt_reg); + xt_unregister_matches(mac_mt_reg, ARRAY_SIZE(mac_mt_reg)); } module_init(mac_mt_init); diff --git a/net/netfilter/xt_owner.c b/net/netfilter/xt_owner.c index 5bfb4843df66..8f2e57b2a586 100644 --- a/net/netfilter/xt_owner.c +++ b/net/netfilter/xt_owner.c @@ -127,26 +127,39 @@ owner_mt(const struct sk_buff *skb, struct xt_action_param *par) return true; } -static struct xt_match owner_mt_reg __read_mostly = { - .name = "owner", - .revision = 1, - .family = NFPROTO_UNSPEC, - .checkentry = owner_check, - .match = owner_mt, - .matchsize = sizeof(struct xt_owner_match_info), - .hooks = (1 << NF_INET_LOCAL_OUT) | - (1 << NF_INET_POST_ROUTING), - .me = THIS_MODULE, +static struct xt_match owner_mt_reg[] __read_mostly = { + { + .name = "owner", + .revision = 1, + .family = NFPROTO_IPV4, + .checkentry = owner_check, + .match = owner_mt, + .matchsize = sizeof(struct xt_owner_match_info), + .hooks = (1 << NF_INET_LOCAL_OUT) | + (1 << NF_INET_POST_ROUTING), + .me = THIS_MODULE, + }, + { + .name = "owner", + .revision = 1, + .family = NFPROTO_IPV6, + .checkentry = owner_check, + .match = owner_mt, + .matchsize = sizeof(struct xt_owner_match_info), + .hooks = (1 << NF_INET_LOCAL_OUT) | + (1 << NF_INET_POST_ROUTING), + .me = THIS_MODULE, + } }; static int __init owner_mt_init(void) { - return xt_register_match(&owner_mt_reg); + return xt_register_matches(owner_mt_reg, ARRAY_SIZE(owner_mt_reg)); } static void __exit owner_mt_exit(void) { - xt_unregister_match(&owner_mt_reg); + xt_unregister_matches(owner_mt_reg, ARRAY_SIZE(owner_mt_reg)); } module_init(owner_mt_init); diff --git a/net/netfilter/xt_physdev.c b/net/netfilter/xt_physdev.c index 53997771013f..d2b0b52434fa 100644 --- a/net/netfilter/xt_physdev.c +++ b/net/netfilter/xt_physdev.c @@ -137,24 +137,33 @@ static int physdev_mt_check(const struct xt_mtchk_param *par) return 0; } -static struct xt_match physdev_mt_reg __read_mostly = { - .name = "physdev", - .revision = 0, - .family = NFPROTO_UNSPEC, - .checkentry = physdev_mt_check, - .match = physdev_mt, - .matchsize = sizeof(struct xt_physdev_info), - .me = THIS_MODULE, +static struct xt_match physdev_mt_reg[] __read_mostly = { + { + .name = "physdev", + .family = NFPROTO_IPV4, + .checkentry = physdev_mt_check, + .match = physdev_mt, + .matchsize = sizeof(struct xt_physdev_info), + .me = THIS_MODULE, + }, + { + .name = "physdev", + .family = NFPROTO_IPV6, + .checkentry = physdev_mt_check, + .match = physdev_mt, + .matchsize = sizeof(struct xt_physdev_info), + .me = THIS_MODULE, + }, }; static int __init physdev_mt_init(void) { - return xt_register_match(&physdev_mt_reg); + return xt_register_matches(physdev_mt_reg, ARRAY_SIZE(physdev_mt_reg)); } static void __exit physdev_mt_exit(void) { - xt_unregister_match(&physdev_mt_reg); + xt_unregister_matches(physdev_mt_reg, ARRAY_SIZE(physdev_mt_reg)); } module_init(physdev_mt_init); diff --git a/net/netfilter/xt_realm.c b/net/netfilter/xt_realm.c index 6df485f4403d..61b2f1e58d15 100644 --- a/net/netfilter/xt_realm.c +++ b/net/netfilter/xt_realm.c @@ -33,7 +33,7 @@ static struct xt_match realm_mt_reg __read_mostly = { .matchsize = sizeof(struct xt_realm_info), .hooks = (1 << NF_INET_POST_ROUTING) | (1 << NF_INET_FORWARD) | (1 << NF_INET_LOCAL_OUT) | (1 << NF_INET_LOCAL_IN), - .family = NFPROTO_UNSPEC, + .family = NFPROTO_IPV4, .me = THIS_MODULE }; From 6eda0d771f94267f73f57c94630aa47e90957915 Mon Sep 17 00:00:00 2001 From: Pablo Neira Ayuso Date: Wed, 15 Apr 2026 17:29:45 +0200 Subject: [PATCH 068/148] netfilter: nat: use kfree_rcu to release ops Florian Westphal says: "Historically this is not an issue, even for normal base hooks: the data path doesn't use the original nf_hook_ops that are used to register the callbacks. However, in v5.14 I added the ability to dump the active netfilter hooks from userspace. This code will peek back into the nf_hook_ops that are available at the tail of the pointer-array blob used by the datapath. The nat hooks are special, because they are called indirectly from the central nat dispatcher hook. They are currently invisible to the nfnl hook dump subsystem though. But once that changes the nat ops structures have to be deferred too." Update nf_nat_register_fn() to deal with partial exposition of the hooks from error path which can be also an issue for nfnetlink_hook. Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem") Signed-off-by: Pablo Neira Ayuso --- net/ipv4/netfilter/iptable_nat.c | 4 ++-- net/ipv6/netfilter/ip6table_nat.c | 4 ++-- net/netfilter/nf_nat_core.c | 10 ++++++---- 3 files changed, 10 insertions(+), 8 deletions(-) diff --git a/net/ipv4/netfilter/iptable_nat.c b/net/ipv4/netfilter/iptable_nat.c index a5db7c67d61b..625a1ca13b1b 100644 --- a/net/ipv4/netfilter/iptable_nat.c +++ b/net/ipv4/netfilter/iptable_nat.c @@ -79,7 +79,7 @@ static int ipt_nat_register_lookups(struct net *net) while (i) nf_nat_ipv4_unregister_fn(net, &ops[--i]); - kfree(ops); + kfree_rcu(ops, rcu); return ret; } } @@ -100,7 +100,7 @@ static void ipt_nat_unregister_lookups(struct net *net) for (i = 0; i < ARRAY_SIZE(nf_nat_ipv4_ops); i++) nf_nat_ipv4_unregister_fn(net, &ops[i]); - kfree(ops); + kfree_rcu(ops, rcu); } static int iptable_nat_table_init(struct net *net) diff --git a/net/ipv6/netfilter/ip6table_nat.c b/net/ipv6/netfilter/ip6table_nat.c index e119d4f090cc..5be723232df8 100644 --- a/net/ipv6/netfilter/ip6table_nat.c +++ b/net/ipv6/netfilter/ip6table_nat.c @@ -81,7 +81,7 @@ static int ip6t_nat_register_lookups(struct net *net) while (i) nf_nat_ipv6_unregister_fn(net, &ops[--i]); - kfree(ops); + kfree_rcu(ops, rcu); return ret; } } @@ -102,7 +102,7 @@ static void ip6t_nat_unregister_lookups(struct net *net) for (i = 0; i < ARRAY_SIZE(nf_nat_ipv6_ops); i++) nf_nat_ipv6_unregister_fn(net, &ops[i]); - kfree(ops); + kfree_rcu(ops, rcu); } static int ip6table_nat_table_init(struct net *net) diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c index 83b2b5e9759a..74ec224ce0d6 100644 --- a/net/netfilter/nf_nat_core.c +++ b/net/netfilter/nf_nat_core.c @@ -1222,9 +1222,11 @@ int nf_nat_register_fn(struct net *net, u8 pf, const struct nf_hook_ops *ops, ret = nf_register_net_hooks(net, nat_ops, ops_count); if (ret < 0) { mutex_unlock(&nf_nat_proto_mutex); - for (i = 0; i < ops_count; i++) - kfree(nat_ops[i].priv); - kfree(nat_ops); + for (i = 0; i < ops_count; i++) { + priv = nat_ops[i].priv; + kfree_rcu(priv, rcu_head); + } + kfree_rcu(nat_ops, rcu); return ret; } @@ -1288,7 +1290,7 @@ void nf_nat_unregister_fn(struct net *net, u8 pf, const struct nf_hook_ops *ops, } nat_proto_net->nat_hook_ops = NULL; - kfree(nat_ops); + kfree_rcu(nat_ops, rcu); } unlock: mutex_unlock(&nf_nat_proto_mutex); From 67bf42cae41d847fd6e5749eb68278ca5d748b25 Mon Sep 17 00:00:00 2001 From: Yingnan Zhang <342144303@qq.com> Date: Wed, 15 Apr 2026 22:40:29 +0800 Subject: [PATCH 069/148] ipvs: fix MTU check for GSO packets in tunnel mode Currently, IPVS skips MTU checks for GSO packets by excluding them with the !skb_is_gso(skb) condition. This creates problems when IPVS tunnel mode encapsulates GSO packets with IPIP headers. The issue manifests in two ways: 1. MTU violation after encapsulation: When a GSO packet passes through IPVS tunnel mode, the original MTU check is bypassed. After adding the IPIP tunnel header, the packet size may exceed the outgoing interface MTU, leading to unexpected fragmentation at the IP layer. 2. Fragmentation with problematic IP IDs: When net.ipv4.vs.pmtu_disc=1 and a GSO packet with multiple segments is fragmented after encapsulation, each segment gets a sequentially incremented IP ID (0, 1, 2, ...). This happens because: a) The GSO packet bypasses MTU check and gets encapsulated b) At __ip_finish_output, the oversized GSO packet is split into separate SKBs (one per segment), with IP IDs incrementing c) Each SKB is then fragmented again based on the actual MTU This sequential IP ID allocation differs from the expected behavior and can cause issues with fragment reassembly and packet tracking. Fix this by properly validating GSO packets using skb_gso_validate_network_len(). This function correctly validates whether the GSO segments will fit within the MTU after segmentation. If validation fails, send an ICMP Fragmentation Needed message to enable proper PMTU discovery. Fixes: 4cdd34084d53 ("netfilter: nf_conntrack_ipv6: improve fragmentation handling") Signed-off-by: Yingnan Zhang <342144303@qq.com> Acked-by: Julian Anastasov Signed-off-by: Pablo Neira Ayuso --- net/netfilter/ipvs/ip_vs_xmit.c | 19 +++++++++++++++---- 1 file changed, 15 insertions(+), 4 deletions(-) diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c index 0fb5162992e5..ce542ed4b013 100644 --- a/net/netfilter/ipvs/ip_vs_xmit.c +++ b/net/netfilter/ipvs/ip_vs_xmit.c @@ -102,6 +102,18 @@ __ip_vs_dst_check(struct ip_vs_dest *dest) return dest_dst; } +/* Based on ip_exceeds_mtu(). */ +static bool ip_vs_exceeds_mtu(const struct sk_buff *skb, unsigned int mtu) +{ + if (skb->len <= mtu) + return false; + + if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu)) + return false; + + return true; +} + static inline bool __mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu) { @@ -111,10 +123,9 @@ __mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu) */ if (IP6CB(skb)->frag_max_size > mtu) return true; /* largest fragment violate MTU */ - } - else if (skb->len > mtu && !skb_is_gso(skb)) { + } else if (ip_vs_exceeds_mtu(skb, mtu)) return true; /* Packet size violate MTU size */ - } + return false; } @@ -232,7 +243,7 @@ static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af, return true; if (unlikely(ip_hdr(skb)->frag_off & htons(IP_DF) && - skb->len > mtu && !skb_is_gso(skb) && + ip_vs_exceeds_mtu(skb, mtu) && !ip_vs_iph_icmp(ipvsh))) { icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu)); From f5ca450087c3baf3651055e7a6de92600f827af3 Mon Sep 17 00:00:00 2001 From: Fernando Fernandez Mancera Date: Fri, 17 Apr 2026 18:20:56 +0200 Subject: [PATCH 070/148] netfilter: nfnetlink_osf: fix out-of-bounds read on option matching In nf_osf_match(), the nf_osf_hdr_ctx structure is initialized once and passed by reference to nf_osf_match_one() for each fingerprint checked. During TCP option parsing, nf_osf_match_one() advances the shared ctx->optp pointer. If a fingerprint perfectly matches, the function returns early without restoring ctx->optp to its initial state. If the user has configured NF_OSF_LOGLEVEL_ALL, the loop continues to the next fingerprint. However, because ctx->optp was not restored, the next call to nf_osf_match_one() starts parsing from the end of the options buffer. This causes subsequent matches to read garbage data and fail immediately, making it impossible to log more than one match or logging incorrect matches. Instead of using a shared ctx->optp pointer, pass the context as a constant pointer and use a local pointer (optp) for TCP option traversal. This makes nf_osf_match_one() strictly stateless from the caller's perspective, ensuring every fingerprint check starts at the correct option offset. Fixes: 1a6a0951fc00 ("netfilter: nfnetlink_osf: add missing fmatch check") Suggested-by: Florian Westphal Signed-off-by: Fernando Fernandez Mancera Reviewed-by: Pablo Neira Ayuso Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nfnetlink_osf.c | 19 ++++++++----------- 1 file changed, 8 insertions(+), 11 deletions(-) diff --git a/net/netfilter/nfnetlink_osf.c b/net/netfilter/nfnetlink_osf.c index 9de91fdd107c..9b209241029b 100644 --- a/net/netfilter/nfnetlink_osf.c +++ b/net/netfilter/nfnetlink_osf.c @@ -64,9 +64,9 @@ struct nf_osf_hdr_ctx { static bool nf_osf_match_one(const struct sk_buff *skb, const struct nf_osf_user_finger *f, int ttl_check, - struct nf_osf_hdr_ctx *ctx) + const struct nf_osf_hdr_ctx *ctx) { - const __u8 *optpinit = ctx->optp; + const __u8 *optp = ctx->optp; unsigned int check_WSS = 0; int fmatch = FMATCH_WRONG; int foptsize, optnum; @@ -95,17 +95,17 @@ static bool nf_osf_match_one(const struct sk_buff *skb, check_WSS = f->wss.wc; for (optnum = 0; optnum < f->opt_num; ++optnum) { - if (f->opt[optnum].kind == *ctx->optp) { + if (f->opt[optnum].kind == *optp) { __u32 len = f->opt[optnum].length; - const __u8 *optend = ctx->optp + len; + const __u8 *optend = optp + len; fmatch = FMATCH_OK; - switch (*ctx->optp) { + switch (*optp) { case OSFOPT_MSS: - mss = ctx->optp[3]; + mss = optp[3]; mss <<= 8; - mss |= ctx->optp[2]; + mss |= optp[2]; mss = ntohs((__force __be16)mss); break; @@ -113,7 +113,7 @@ static bool nf_osf_match_one(const struct sk_buff *skb, break; } - ctx->optp = optend; + optp = optend; } else fmatch = FMATCH_OPT_WRONG; @@ -156,9 +156,6 @@ static bool nf_osf_match_one(const struct sk_buff *skb, } } - if (fmatch != FMATCH_OK) - ctx->optp = optpinit; - return fmatch == FMATCH_OK; } From 711987ba281fd806322a7cd244e98e2a81903114 Mon Sep 17 00:00:00 2001 From: Fernando Fernandez Mancera Date: Fri, 17 Apr 2026 18:20:57 +0200 Subject: [PATCH 071/148] netfilter: nfnetlink_osf: fix potential NULL dereference in ttl check The nf_osf_ttl() function accessed skb->dev to perform a local interface address lookup without verifying that the device pointer was valid. Additionally, the implementation utilized an in_dev_for_each_ifa_rcu loop to match the packet source address against local interface addresses. It assumed that packets from the same subnet should not see a decrement on the initial TTL. A packet might appear it is from the same subnet but it actually isn't especially in modern environments with containers and virtual switching. Remove the device dereference and interface loop. Replace the logic with a switch statement that evaluates the TTL according to the ttl_check. Fixes: 11eeef41d5f6 ("netfilter: passive OS fingerprint xtables match") Reported-by: Kito Xu (veritas501) Closes: https://lore.kernel.org/netfilter-devel/20260414074556.2512750-1-hxzene@gmail.com/ Signed-off-by: Fernando Fernandez Mancera Reviewed-by: Pablo Neira Ayuso Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nfnetlink_osf.c | 22 +++++++--------------- 1 file changed, 7 insertions(+), 15 deletions(-) diff --git a/net/netfilter/nfnetlink_osf.c b/net/netfilter/nfnetlink_osf.c index 9b209241029b..acb753ec5697 100644 --- a/net/netfilter/nfnetlink_osf.c +++ b/net/netfilter/nfnetlink_osf.c @@ -31,26 +31,18 @@ EXPORT_SYMBOL_GPL(nf_osf_fingers); static inline int nf_osf_ttl(const struct sk_buff *skb, int ttl_check, unsigned char f_ttl) { - struct in_device *in_dev = __in_dev_get_rcu(skb->dev); const struct iphdr *ip = ip_hdr(skb); - const struct in_ifaddr *ifa; - int ret = 0; - if (ttl_check == NF_OSF_TTL_TRUE) + switch (ttl_check) { + case NF_OSF_TTL_TRUE: return ip->ttl == f_ttl; - if (ttl_check == NF_OSF_TTL_NOCHECK) + break; + case NF_OSF_TTL_NOCHECK: return 1; - else if (ip->ttl <= f_ttl) - return 1; - - in_dev_for_each_ifa_rcu(ifa, in_dev) { - if (inet_ifa_match(ip->saddr, ifa)) { - ret = (ip->ttl == f_ttl); - break; - } + case NF_OSF_TTL_LESS: + default: + return ip->ttl <= f_ttl; } - - return ret; } struct nf_osf_hdr_ctx { From e76607442d5b73e1ba6768f501ef815bb58c2c0e Mon Sep 17 00:00:00 2001 From: Weiming Shi Date: Thu, 16 Apr 2026 04:41:31 +0800 Subject: [PATCH 072/148] slip: reject VJ receive packets on instances with no rstate array slhc_init() accepts rslots == 0 as a valid configuration, with the documented meaning of 'no receive compression'. In that case the allocation loop in slhc_init() is skipped, so comp->rstate stays NULL and comp->rslot_limit stays 0 (from the kzalloc of struct slcompress). The receive helpers do not defend against that configuration. slhc_uncompress() dereferences comp->rstate[x] when the VJ header carries an explicit connection ID, and slhc_remember() later assigns cs = &comp->rstate[...] after only comparing the packet's slot number to comp->rslot_limit. Because rslot_limit is 0, slot 0 passes the range check, and the code dereferences a NULL rstate. The configuration is reachable in-tree through PPP. PPPIOCSMAXCID stores its argument in a signed int, and (val >> 16) uses arithmetic shift. Passing 0xffff0000 therefore sign-extends to -1, so val2 + 1 is 0 and ppp_generic.c ends up calling slhc_init(0, 1). Because /dev/ppp open is gated by ns_capable(CAP_NET_ADMIN), the whole path is reachable from an unprivileged user namespace. Once the malformed VJ state is installed, any inbound VJ-compressed or VJ-uncompressed frame that selects slot 0 crashes the kernel in softirq context: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] RIP: 0010:slhc_uncompress (drivers/net/slip/slhc.c:519) Call Trace: ppp_receive_nonmp_frame (drivers/net/ppp/ppp_generic.c:2466) ppp_input (drivers/net/ppp/ppp_generic.c:2359) ppp_async_process (drivers/net/ppp/ppp_async.c:492) tasklet_action_common (kernel/softirq.c:926) handle_softirqs (kernel/softirq.c:623) run_ksoftirqd (kernel/softirq.c:1055) smpboot_thread_fn (kernel/smpboot.c:160) kthread (kernel/kthread.c:436) ret_from_fork (arch/x86/kernel/process.c:164) Reject the receive side on such instances instead of touching rstate. slhc_uncompress() falls through to its existing 'bad' label, which bumps sls_i_error and enters the toss state. slhc_remember() mirrors that with an explicit sls_i_error increment followed by slhc_toss(); the sls_i_runt counter is not used here because a missing rstate is an internal configuration state, not a runt packet. The transmit path is unaffected: the only in-tree caller that picks rslots from userspace (ppp_generic.c) still supplies tslots >= 1, and slip.c always calls slhc_init(16, 16), so comp->tstate remains valid and slhc_compress() continues to work. Fixes: 4ab42d78e37a ("ppp, slip: Validate VJ compression slot parameters completely") Reported-by: Xiang Mei Signed-off-by: Weiming Shi Reviewed-by: Simon Horman Link: https://patch.msgid.link/20260415204130.258866-2-bestswngs@gmail.com Signed-off-by: Paolo Abeni --- drivers/net/slip/slhc.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/drivers/net/slip/slhc.c b/drivers/net/slip/slhc.c index e3c785da3eef..e18a4213d10c 100644 --- a/drivers/net/slip/slhc.c +++ b/drivers/net/slip/slhc.c @@ -506,6 +506,8 @@ slhc_uncompress(struct slcompress *comp, unsigned char *icp, int isize) comp->sls_i_error++; return 0; } + if (!comp->rstate) + goto bad; changes = *cp++; if(changes & NEW_C){ /* Make sure the state index is in range, then grab the state. @@ -649,6 +651,10 @@ slhc_remember(struct slcompress *comp, unsigned char *icp, int isize) struct cstate *cs; unsigned int ihl; + if (!comp->rstate) { + comp->sls_i_error++; + return slhc_toss(comp); + } /* The packet is shorter than a legal IP header. * Also make sure isize is positive. */ From 4c1367a2d7aad643a6f87c6931b13cc1a25e8ca7 Mon Sep 17 00:00:00 2001 From: Weiming Shi Date: Thu, 16 Apr 2026 18:01:51 +0800 Subject: [PATCH 073/148] slip: bound decode() reads against the compressed packet length slhc_uncompress() parses a VJ-compressed TCP header by advancing a pointer through the packet via decode() and pull16(). Neither helper bounds-checks against isize, and decode() masks its return with & 0xffff so it can never return the -1 that callers test for -- those error paths are dead code. A short compressed frame whose change byte requests optional fields lets decode() read past the end of the packet. The over-read bytes are folded into the cached cstate and reflected into subsequent reconstructed packets. Make decode() and pull16() take the packet end pointer and return -1 when exhausted. Add a bounds check before the TCP-checksum read. The existing == -1 tests now do what they were always meant to. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Simon Horman Closes: https://lore.kernel.org/netdev/20260414134126.758795-2-horms@kernel.org/ Signed-off-by: Weiming Shi Reviewed-by: Simon Horman Link: https://patch.msgid.link/20260416100147.531855-5-bestswngs@gmail.com Signed-off-by: Paolo Abeni --- drivers/net/slip/slhc.c | 43 ++++++++++++++++++++++++----------------- 1 file changed, 25 insertions(+), 18 deletions(-) diff --git a/drivers/net/slip/slhc.c b/drivers/net/slip/slhc.c index e18a4213d10c..1a9b27d5e256 100644 --- a/drivers/net/slip/slhc.c +++ b/drivers/net/slip/slhc.c @@ -80,9 +80,9 @@ #include static unsigned char *encode(unsigned char *cp, unsigned short n); -static long decode(unsigned char **cpp); +static long decode(unsigned char **cpp, const unsigned char *end); static unsigned char * put16(unsigned char *cp, unsigned short x); -static unsigned short pull16(unsigned char **cpp); +static long pull16(unsigned char **cpp, const unsigned char *end); /* Allocate compression data structure * slots must be in range 0 to 255 (zero meaning no compression) @@ -190,30 +190,34 @@ encode(unsigned char *cp, unsigned short n) return cp; } -/* Pull a 16-bit integer in host order from buffer in network byte order */ -static unsigned short -pull16(unsigned char **cpp) +/* Pull a 16-bit integer in host order from buffer in network byte order. + * Returns -1 if the buffer is exhausted, otherwise the 16-bit value. + */ +static long +pull16(unsigned char **cpp, const unsigned char *end) { - short rval; + long rval; + if (*cpp + 2 > end) + return -1; rval = *(*cpp)++; rval <<= 8; rval |= *(*cpp)++; return rval; } -/* Decode a number */ +/* Decode a number. Returns -1 if the buffer is exhausted. */ static long -decode(unsigned char **cpp) +decode(unsigned char **cpp, const unsigned char *end) { int x; + if (*cpp >= end) + return -1; x = *(*cpp)++; - if(x == 0){ - return pull16(cpp) & 0xffff; /* pull16 returns -1 on error */ - } else { - return x & 0xff; /* -1 if PULLCHAR returned error */ - } + if (x == 0) + return pull16(cpp, end); + return x & 0xff; } /* @@ -499,6 +503,7 @@ slhc_uncompress(struct slcompress *comp, unsigned char *icp, int isize) struct cstate *cs; int len, hdrlen; unsigned char *cp = icp; + const unsigned char *end = icp + isize; /* We've got a compressed packet; read the change byte */ comp->sls_i_compressed++; @@ -536,6 +541,8 @@ slhc_uncompress(struct slcompress *comp, unsigned char *icp, int isize) thp = &cs->cs_tcp; ip = &cs->cs_ip; + if (cp + 2 > end) + goto bad; thp->check = *(__sum16 *)cp; cp += 2; @@ -566,26 +573,26 @@ slhc_uncompress(struct slcompress *comp, unsigned char *icp, int isize) default: if(changes & NEW_U){ thp->urg = 1; - if((x = decode(&cp)) == -1) { + if((x = decode(&cp, end)) == -1) { goto bad; } thp->urg_ptr = htons(x); } else thp->urg = 0; if(changes & NEW_W){ - if((x = decode(&cp)) == -1) { + if((x = decode(&cp, end)) == -1) { goto bad; } thp->window = htons( ntohs(thp->window) + x); } if(changes & NEW_A){ - if((x = decode(&cp)) == -1) { + if((x = decode(&cp, end)) == -1) { goto bad; } thp->ack_seq = htonl( ntohl(thp->ack_seq) + x); } if(changes & NEW_S){ - if((x = decode(&cp)) == -1) { + if((x = decode(&cp, end)) == -1) { goto bad; } thp->seq = htonl( ntohl(thp->seq) + x); @@ -593,7 +600,7 @@ slhc_uncompress(struct slcompress *comp, unsigned char *icp, int isize) break; } if(changes & NEW_I){ - if((x = decode(&cp)) == -1) { + if((x = decode(&cp, end)) == -1) { goto bad; } ip->id = htons (ntohs (ip->id) + x); From db9e726525e45dbd713c07897a4d20bc18333ccc Mon Sep 17 00:00:00 2001 From: Stanislav Fomichev Date: Thu, 16 Apr 2026 11:56:58 -0700 Subject: [PATCH 074/148] net: add address list snapshot and reconciliation infrastructure Introduce __hw_addr_list_snapshot() and __hw_addr_list_reconcile() for use by the upcoming ndo_set_rx_mode_async callback. The async rx_mode path needs to snapshot the device's unicast and multicast address lists under the addr_lock, hand those snapshots to the driver (which may sleep), and then propagate any sync_cnt changes back to the real lists. Two identical snapshots are taken: a work copy for the driver to pass to __hw_addr_sync_dev() and a reference copy to compute deltas against. __hw_addr_list_reconcile() walks the reference snapshot comparing each entry against the work snapshot to determine what the driver synced or unsynced. It then applies those deltas to the real list, handling concurrent modifications: - If the real entry was concurrently removed but the driver synced it to hardware (delta > 0), re-insert a stale entry so the next work run properly unsyncs it from hardware. - If the entry still exists, apply the delta normally. An entry whose refcount drops to zero is removed. # dev_addr_test_snapshot_benchmark: 1024 addrs x 1000 snapshots: 89872802 ns total, 89872 ns/iter # dev_addr_test_snapshot_benchmark.speed: slow Reviewed-by: Aleksandr Loktionov Signed-off-by: Stanislav Fomichev Link: https://patch.msgid.link/20260416185712.2155425-2-sdf@fomichev.me Signed-off-by: Paolo Abeni --- include/linux/netdevice.h | 7 + net/core/dev.h | 1 + net/core/dev_addr_lists.c | 109 +++++++++- net/core/dev_addr_lists_test.c | 363 ++++++++++++++++++++++++++++++++- 4 files changed, 477 insertions(+), 3 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 7969fcdd5ac4..a84c55488b8c 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -5004,6 +5004,13 @@ void __hw_addr_unsync_dev(struct netdev_hw_addr_list *list, int (*unsync)(struct net_device *, const unsigned char *)); void __hw_addr_init(struct netdev_hw_addr_list *list); +void __hw_addr_flush(struct netdev_hw_addr_list *list); +int __hw_addr_list_snapshot(struct netdev_hw_addr_list *snap, + const struct netdev_hw_addr_list *list, + int addr_len); +void __hw_addr_list_reconcile(struct netdev_hw_addr_list *real_list, + struct netdev_hw_addr_list *work, + struct netdev_hw_addr_list *ref, int addr_len); /* Functions used for device addresses handling */ void dev_addr_mod(struct net_device *dev, unsigned int offset, diff --git a/net/core/dev.h b/net/core/dev.h index 628bdaebf0ca..585b6d7e88df 100644 --- a/net/core/dev.h +++ b/net/core/dev.h @@ -78,6 +78,7 @@ void linkwatch_run_queue(void); void dev_addr_flush(struct net_device *dev); int dev_addr_init(struct net_device *dev); void dev_addr_check(struct net_device *dev); +void __hw_addr_flush(struct netdev_hw_addr_list *list); #if IS_ENABLED(CONFIG_NET_SHAPER) void net_shaper_flush_netdev(struct net_device *dev); diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c index 76c91f224886..bb4851bc55ce 100644 --- a/net/core/dev_addr_lists.c +++ b/net/core/dev_addr_lists.c @@ -11,6 +11,7 @@ #include #include #include +#include #include "dev.h" @@ -481,7 +482,7 @@ void __hw_addr_unsync_dev(struct netdev_hw_addr_list *list, } EXPORT_SYMBOL(__hw_addr_unsync_dev); -static void __hw_addr_flush(struct netdev_hw_addr_list *list) +void __hw_addr_flush(struct netdev_hw_addr_list *list) { struct netdev_hw_addr *ha, *tmp; @@ -492,6 +493,7 @@ static void __hw_addr_flush(struct netdev_hw_addr_list *list) } list->count = 0; } +EXPORT_SYMBOL_IF_KUNIT(__hw_addr_flush); void __hw_addr_init(struct netdev_hw_addr_list *list) { @@ -501,6 +503,111 @@ void __hw_addr_init(struct netdev_hw_addr_list *list) } EXPORT_SYMBOL(__hw_addr_init); +/** + * __hw_addr_list_snapshot - create a snapshot copy of an address list + * @snap: destination snapshot list (needs to be __hw_addr_init-initialized) + * @list: source address list to snapshot + * @addr_len: length of addresses + * + * Creates a copy of @list with individually allocated entries suitable + * for use with __hw_addr_sync_dev() and other list manipulation helpers. + * Each entry is allocated with GFP_ATOMIC; must be called under a spinlock. + * + * Return: 0 on success, -errno on failure. + */ +int __hw_addr_list_snapshot(struct netdev_hw_addr_list *snap, + const struct netdev_hw_addr_list *list, + int addr_len) +{ + struct netdev_hw_addr *ha, *entry; + + list_for_each_entry(ha, &list->list, list) { + entry = __hw_addr_create(ha->addr, addr_len, ha->type, + false, false); + if (!entry) { + __hw_addr_flush(snap); + return -ENOMEM; + } + entry->sync_cnt = ha->sync_cnt; + entry->refcount = ha->refcount; + + list_add_tail(&entry->list, &snap->list); + __hw_addr_insert(snap, entry, addr_len); + snap->count++; + } + + return 0; +} +EXPORT_SYMBOL_IF_KUNIT(__hw_addr_list_snapshot); + +/** + * __hw_addr_list_reconcile - sync snapshot changes back and free snapshots + * @real_list: the real address list to update + * @work: the working snapshot (modified by driver via __hw_addr_sync_dev) + * @ref: the reference snapshot (untouched copy of original state) + * @addr_len: length of addresses + * + * Walks the reference snapshot and compares each entry against the work + * snapshot to compute sync_cnt deltas. Applies those deltas to @real_list. + * Frees both snapshots when done. + * Caller must hold netif_addr_lock_bh. + */ +void __hw_addr_list_reconcile(struct netdev_hw_addr_list *real_list, + struct netdev_hw_addr_list *work, + struct netdev_hw_addr_list *ref, int addr_len) +{ + struct netdev_hw_addr *ref_ha, *tmp, *work_ha, *real_ha; + int delta; + + list_for_each_entry_safe(ref_ha, tmp, &ref->list, list) { + work_ha = __hw_addr_lookup(work, ref_ha->addr, addr_len, + ref_ha->type); + if (work_ha) + delta = work_ha->sync_cnt - ref_ha->sync_cnt; + else + delta = -1; + + if (delta == 0) + continue; + + real_ha = __hw_addr_lookup(real_list, ref_ha->addr, addr_len, + ref_ha->type); + if (!real_ha) { + /* The real entry was concurrently removed. If the + * driver synced this addr to hardware (delta > 0), + * re-insert it as a stale entry so the next work + * run unsyncs it from hardware. + */ + if (delta > 0) { + rb_erase(&ref_ha->node, &ref->tree); + list_del(&ref_ha->list); + ref->count--; + ref_ha->sync_cnt = delta; + ref_ha->refcount = delta; + list_add_tail_rcu(&ref_ha->list, + &real_list->list); + __hw_addr_insert(real_list, ref_ha, + addr_len); + real_list->count++; + } + continue; + } + + real_ha->sync_cnt += delta; + real_ha->refcount += delta; + if (!real_ha->refcount) { + rb_erase(&real_ha->node, &real_list->tree); + list_del_rcu(&real_ha->list); + kfree_rcu(real_ha, rcu_head); + real_list->count--; + } + } + + __hw_addr_flush(work); + __hw_addr_flush(ref); +} +EXPORT_SYMBOL_IF_KUNIT(__hw_addr_list_reconcile); + /* * Device addresses handling functions */ diff --git a/net/core/dev_addr_lists_test.c b/net/core/dev_addr_lists_test.c index 8e1dba825e94..fba926d5ec0d 100644 --- a/net/core/dev_addr_lists_test.c +++ b/net/core/dev_addr_lists_test.c @@ -2,22 +2,31 @@ #include #include +#include #include #include static const struct net_device_ops dummy_netdev_ops = { }; +#define ADDR_A 1 +#define ADDR_B 2 +#define ADDR_C 3 + struct dev_addr_test_priv { u32 addr_seen; + u32 addr_synced; + u32 addr_unsynced; }; static int dev_addr_test_sync(struct net_device *netdev, const unsigned char *a) { struct dev_addr_test_priv *datp = netdev_priv(netdev); - if (a[0] < 31 && !memchr_inv(a, a[0], ETH_ALEN)) + if (a[0] < 31 && !memchr_inv(a, a[0], ETH_ALEN)) { datp->addr_seen |= 1 << a[0]; + datp->addr_synced |= 1 << a[0]; + } return 0; } @@ -26,11 +35,22 @@ static int dev_addr_test_unsync(struct net_device *netdev, { struct dev_addr_test_priv *datp = netdev_priv(netdev); - if (a[0] < 31 && !memchr_inv(a, a[0], ETH_ALEN)) + if (a[0] < 31 && !memchr_inv(a, a[0], ETH_ALEN)) { datp->addr_seen &= ~(1 << a[0]); + datp->addr_unsynced |= 1 << a[0]; + } return 0; } +static void dev_addr_test_reset(struct net_device *netdev) +{ + struct dev_addr_test_priv *datp = netdev_priv(netdev); + + datp->addr_seen = 0; + datp->addr_synced = 0; + datp->addr_unsynced = 0; +} + static int dev_addr_test_init(struct kunit *test) { struct dev_addr_test_priv *datp; @@ -225,6 +245,339 @@ static void dev_addr_test_add_excl(struct kunit *test) rtnl_unlock(); } +/* Snapshot test: basic sync with no concurrent modifications. + * Add one address, snapshot, driver syncs it, reconcile propagates + * sync_cnt delta back to real list. + */ +static void dev_addr_test_snapshot_sync(struct kunit *test) +{ + struct net_device *netdev = test->priv; + struct netdev_hw_addr_list snap, ref; + struct dev_addr_test_priv *datp; + struct netdev_hw_addr *ha; + u8 addr[ETH_ALEN]; + + datp = netdev_priv(netdev); + + rtnl_lock(); + + memset(addr, ADDR_A, sizeof(addr)); + KUNIT_EXPECT_EQ(test, 0, dev_uc_add(netdev, addr)); + + /* Snapshot: ADDR_A has sync_cnt=0, refcount=1 (new) */ + netif_addr_lock_bh(netdev); + __hw_addr_init(&snap); + __hw_addr_init(&ref); + KUNIT_EXPECT_EQ(test, 0, + __hw_addr_list_snapshot(&snap, &netdev->uc, ETH_ALEN)); + KUNIT_EXPECT_EQ(test, 0, + __hw_addr_list_snapshot(&ref, &netdev->uc, ETH_ALEN)); + netif_addr_unlock_bh(netdev); + + /* Driver syncs ADDR_A to hardware */ + dev_addr_test_reset(netdev); + __hw_addr_sync_dev(&snap, netdev, dev_addr_test_sync, + dev_addr_test_unsync); + KUNIT_EXPECT_EQ(test, 1 << ADDR_A, datp->addr_synced); + KUNIT_EXPECT_EQ(test, 0, datp->addr_unsynced); + + /* Reconcile: delta=+1 applied to real entry */ + netif_addr_lock_bh(netdev); + __hw_addr_list_reconcile(&netdev->uc, &snap, &ref, ETH_ALEN); + netif_addr_unlock_bh(netdev); + + /* Real entry should now reflect the sync: sync_cnt=1, refcount=2 */ + KUNIT_EXPECT_EQ(test, 1, netdev->uc.count); + ha = list_first_entry(&netdev->uc.list, struct netdev_hw_addr, list); + KUNIT_EXPECT_MEMEQ(test, ha->addr, addr, ETH_ALEN); + KUNIT_EXPECT_EQ(test, 1, ha->sync_cnt); + KUNIT_EXPECT_EQ(test, 2, ha->refcount); + + /* Second work run: already synced, nothing to do */ + dev_addr_test_reset(netdev); + __hw_addr_sync_dev(&netdev->uc, netdev, dev_addr_test_sync, + dev_addr_test_unsync); + KUNIT_EXPECT_EQ(test, 0, datp->addr_synced); + KUNIT_EXPECT_EQ(test, 0, datp->addr_unsynced); + KUNIT_EXPECT_EQ(test, 1, netdev->uc.count); + + rtnl_unlock(); +} + +/* Snapshot test: ADDR_A synced to hardware, then concurrently removed + * from the real list before reconcile runs. Reconcile re-inserts ADDR_A as + * a stale entry so the next work run unsyncs it from hardware. + */ +static void dev_addr_test_snapshot_remove_during_sync(struct kunit *test) +{ + struct net_device *netdev = test->priv; + struct netdev_hw_addr_list snap, ref; + struct dev_addr_test_priv *datp; + struct netdev_hw_addr *ha; + u8 addr[ETH_ALEN]; + + datp = netdev_priv(netdev); + + rtnl_lock(); + + memset(addr, ADDR_A, sizeof(addr)); + KUNIT_EXPECT_EQ(test, 0, dev_uc_add(netdev, addr)); + + /* Snapshot: ADDR_A is new (sync_cnt=0, refcount=1) */ + netif_addr_lock_bh(netdev); + __hw_addr_init(&snap); + __hw_addr_init(&ref); + KUNIT_EXPECT_EQ(test, 0, + __hw_addr_list_snapshot(&snap, &netdev->uc, ETH_ALEN)); + KUNIT_EXPECT_EQ(test, 0, + __hw_addr_list_snapshot(&ref, &netdev->uc, ETH_ALEN)); + netif_addr_unlock_bh(netdev); + + /* Driver syncs ADDR_A to hardware */ + dev_addr_test_reset(netdev); + __hw_addr_sync_dev(&snap, netdev, dev_addr_test_sync, + dev_addr_test_unsync); + KUNIT_EXPECT_EQ(test, 1 << ADDR_A, datp->addr_synced); + KUNIT_EXPECT_EQ(test, 0, datp->addr_unsynced); + + /* Concurrent removal: user deletes ADDR_A while driver was working */ + memset(addr, ADDR_A, sizeof(addr)); + KUNIT_EXPECT_EQ(test, 0, dev_uc_del(netdev, addr)); + KUNIT_EXPECT_EQ(test, 0, netdev->uc.count); + + /* Reconcile: ADDR_A gone from real list but driver synced it, + * so it gets re-inserted as stale (sync_cnt=1, refcount=1). + */ + netif_addr_lock_bh(netdev); + __hw_addr_list_reconcile(&netdev->uc, &snap, &ref, ETH_ALEN); + netif_addr_unlock_bh(netdev); + + KUNIT_EXPECT_EQ(test, 1, netdev->uc.count); + ha = list_first_entry(&netdev->uc.list, struct netdev_hw_addr, list); + KUNIT_EXPECT_MEMEQ(test, ha->addr, addr, ETH_ALEN); + KUNIT_EXPECT_EQ(test, 1, ha->sync_cnt); + KUNIT_EXPECT_EQ(test, 1, ha->refcount); + + /* Second work run: stale entry gets unsynced from HW and removed */ + dev_addr_test_reset(netdev); + __hw_addr_sync_dev(&netdev->uc, netdev, dev_addr_test_sync, + dev_addr_test_unsync); + KUNIT_EXPECT_EQ(test, 0, datp->addr_synced); + KUNIT_EXPECT_EQ(test, 1 << ADDR_A, datp->addr_unsynced); + KUNIT_EXPECT_EQ(test, 0, netdev->uc.count); + + rtnl_unlock(); +} + +/* Snapshot test: ADDR_A was stale (unsynced from hardware by driver), + * but concurrently re-added by the user. The re-add bumps refcount of + * the existing stale entry. Reconcile applies delta=-1, leaving ADDR_A + * as a fresh entry (sync_cnt=0, refcount=1) for the next work run. + */ +static void dev_addr_test_snapshot_readd_during_unsync(struct kunit *test) +{ + struct net_device *netdev = test->priv; + struct netdev_hw_addr_list snap, ref; + struct dev_addr_test_priv *datp; + struct netdev_hw_addr *ha; + u8 addr[ETH_ALEN]; + + datp = netdev_priv(netdev); + + rtnl_lock(); + + memset(addr, ADDR_A, sizeof(addr)); + KUNIT_EXPECT_EQ(test, 0, dev_uc_add(netdev, addr)); + + /* Sync ADDR_A to hardware: sync_cnt=1, refcount=2 */ + dev_addr_test_reset(netdev); + __hw_addr_sync_dev(&netdev->uc, netdev, dev_addr_test_sync, + dev_addr_test_unsync); + KUNIT_EXPECT_EQ(test, 1 << ADDR_A, datp->addr_synced); + KUNIT_EXPECT_EQ(test, 0, datp->addr_unsynced); + + /* User removes ADDR_A: refcount=1, sync_cnt=1 -> stale */ + KUNIT_EXPECT_EQ(test, 0, dev_uc_del(netdev, addr)); + + /* Snapshot: ADDR_A is stale (sync_cnt=1, refcount=1) */ + netif_addr_lock_bh(netdev); + __hw_addr_init(&snap); + __hw_addr_init(&ref); + KUNIT_EXPECT_EQ(test, 0, + __hw_addr_list_snapshot(&snap, &netdev->uc, ETH_ALEN)); + KUNIT_EXPECT_EQ(test, 0, + __hw_addr_list_snapshot(&ref, &netdev->uc, ETH_ALEN)); + netif_addr_unlock_bh(netdev); + + /* Driver unsyncs stale ADDR_A from hardware */ + dev_addr_test_reset(netdev); + __hw_addr_sync_dev(&snap, netdev, dev_addr_test_sync, + dev_addr_test_unsync); + KUNIT_EXPECT_EQ(test, 0, datp->addr_synced); + KUNIT_EXPECT_EQ(test, 1 << ADDR_A, datp->addr_unsynced); + + /* Concurrent: user re-adds ADDR_A. dev_uc_add finds the existing + * stale entry and bumps refcount from 1 -> 2. sync_cnt stays 1. + */ + KUNIT_EXPECT_EQ(test, 0, dev_uc_add(netdev, addr)); + KUNIT_EXPECT_EQ(test, 1, netdev->uc.count); + + /* Reconcile: ref sync_cnt=1 matches real sync_cnt=1, delta=-1 + * applied. Result: sync_cnt=0, refcount=1 (fresh). + */ + netif_addr_lock_bh(netdev); + __hw_addr_list_reconcile(&netdev->uc, &snap, &ref, ETH_ALEN); + netif_addr_unlock_bh(netdev); + + /* Entry survives as fresh: needs re-sync to HW */ + KUNIT_EXPECT_EQ(test, 1, netdev->uc.count); + ha = list_first_entry(&netdev->uc.list, struct netdev_hw_addr, list); + KUNIT_EXPECT_MEMEQ(test, ha->addr, addr, ETH_ALEN); + KUNIT_EXPECT_EQ(test, 0, ha->sync_cnt); + KUNIT_EXPECT_EQ(test, 1, ha->refcount); + + /* Second work run: fresh entry gets synced to HW */ + dev_addr_test_reset(netdev); + __hw_addr_sync_dev(&netdev->uc, netdev, dev_addr_test_sync, + dev_addr_test_unsync); + KUNIT_EXPECT_EQ(test, 1 << ADDR_A, datp->addr_synced); + KUNIT_EXPECT_EQ(test, 0, datp->addr_unsynced); + + rtnl_unlock(); +} + +/* Snapshot test: ADDR_A is new (synced by driver), and independent ADDR_B + * is concurrently removed from the real list. A's sync delta propagates + * normally; B's absence doesn't interfere. + */ +static void dev_addr_test_snapshot_add_and_remove(struct kunit *test) +{ + struct net_device *netdev = test->priv; + struct netdev_hw_addr_list snap, ref; + struct dev_addr_test_priv *datp; + struct netdev_hw_addr *ha; + u8 addr[ETH_ALEN]; + + datp = netdev_priv(netdev); + + rtnl_lock(); + + /* Add ADDR_A and ADDR_B (will be synced then removed) */ + memset(addr, ADDR_A, sizeof(addr)); + KUNIT_EXPECT_EQ(test, 0, dev_uc_add(netdev, addr)); + memset(addr, ADDR_B, sizeof(addr)); + KUNIT_EXPECT_EQ(test, 0, dev_uc_add(netdev, addr)); + + /* Sync both to hardware: sync_cnt=1, refcount=2 */ + __hw_addr_sync_dev(&netdev->uc, netdev, dev_addr_test_sync, + dev_addr_test_unsync); + + /* Add ADDR_C (new, will be synced by snapshot) */ + memset(addr, ADDR_C, sizeof(addr)); + KUNIT_EXPECT_EQ(test, 0, dev_uc_add(netdev, addr)); + + /* Snapshot: A,B synced (sync_cnt=1,refcount=2); C new (0,1) */ + netif_addr_lock_bh(netdev); + __hw_addr_init(&snap); + __hw_addr_init(&ref); + KUNIT_EXPECT_EQ(test, 0, + __hw_addr_list_snapshot(&snap, &netdev->uc, ETH_ALEN)); + KUNIT_EXPECT_EQ(test, 0, + __hw_addr_list_snapshot(&ref, &netdev->uc, ETH_ALEN)); + netif_addr_unlock_bh(netdev); + + /* Driver syncs snapshot: ADDR_C is new -> synced; A,B already synced */ + dev_addr_test_reset(netdev); + __hw_addr_sync_dev(&snap, netdev, dev_addr_test_sync, + dev_addr_test_unsync); + KUNIT_EXPECT_EQ(test, 1 << ADDR_C, datp->addr_synced); + KUNIT_EXPECT_EQ(test, 0, datp->addr_unsynced); + + /* Concurrent: user removes addr B while driver was working */ + memset(addr, ADDR_B, sizeof(addr)); + KUNIT_EXPECT_EQ(test, 0, dev_uc_del(netdev, addr)); + + /* Reconcile: ADDR_C's delta=+1 applied to real list. + * ADDR_B's delta=0 (unchanged in snapshot), + * so nothing to apply to ADDR_B. + */ + netif_addr_lock_bh(netdev); + __hw_addr_list_reconcile(&netdev->uc, &snap, &ref, ETH_ALEN); + netif_addr_unlock_bh(netdev); + + /* ADDR_A: unchanged (sync_cnt=1, refcount=2) + * ADDR_B: refcount went from 2->1 via dev_uc_del (still present, stale) + * ADDR_C: sync propagated (sync_cnt=1, refcount=2) + */ + KUNIT_EXPECT_EQ(test, 3, netdev->uc.count); + netdev_hw_addr_list_for_each(ha, &netdev->uc) { + u8 id = ha->addr[0]; + + if (!memchr_inv(ha->addr, id, ETH_ALEN)) { + if (id == ADDR_A) { + KUNIT_EXPECT_EQ(test, 1, ha->sync_cnt); + KUNIT_EXPECT_EQ(test, 2, ha->refcount); + } else if (id == ADDR_B) { + /* B: still present but now stale */ + KUNIT_EXPECT_EQ(test, 1, ha->sync_cnt); + KUNIT_EXPECT_EQ(test, 1, ha->refcount); + } else if (id == ADDR_C) { + KUNIT_EXPECT_EQ(test, 1, ha->sync_cnt); + KUNIT_EXPECT_EQ(test, 2, ha->refcount); + } + } + } + + /* Second work run: ADDR_B is stale, gets unsynced and removed */ + dev_addr_test_reset(netdev); + __hw_addr_sync_dev(&netdev->uc, netdev, dev_addr_test_sync, + dev_addr_test_unsync); + KUNIT_EXPECT_EQ(test, 0, datp->addr_synced); + KUNIT_EXPECT_EQ(test, 1 << ADDR_B, datp->addr_unsynced); + KUNIT_EXPECT_EQ(test, 2, netdev->uc.count); + + rtnl_unlock(); +} + +static void dev_addr_test_snapshot_benchmark(struct kunit *test) +{ + struct net_device *netdev = test->priv; + struct netdev_hw_addr_list snap; + u8 addr[ETH_ALEN]; + s64 duration = 0; + ktime_t start; + int i, iter; + + rtnl_lock(); + + for (i = 0; i < 1024; i++) { + memset(addr, 0, sizeof(addr)); + addr[0] = (i >> 8) & 0xff; + addr[1] = i & 0xff; + KUNIT_EXPECT_EQ(test, 0, dev_uc_add(netdev, addr)); + } + + for (iter = 0; iter < 1000; iter++) { + netif_addr_lock_bh(netdev); + __hw_addr_init(&snap); + + start = ktime_get(); + KUNIT_EXPECT_EQ(test, 0, + __hw_addr_list_snapshot(&snap, &netdev->uc, + ETH_ALEN)); + duration += ktime_to_ns(ktime_sub(ktime_get(), start)); + + netif_addr_unlock_bh(netdev); + __hw_addr_flush(&snap); + } + + kunit_info(test, + "1024 addrs x 1000 snapshots: %lld ns total, %lld ns/iter", + duration, div_s64(duration, 1000)); + + rtnl_unlock(); +} + static struct kunit_case dev_addr_test_cases[] = { KUNIT_CASE(dev_addr_test_basic), KUNIT_CASE(dev_addr_test_sync_one), @@ -232,6 +585,11 @@ static struct kunit_case dev_addr_test_cases[] = { KUNIT_CASE(dev_addr_test_del_main), KUNIT_CASE(dev_addr_test_add_set), KUNIT_CASE(dev_addr_test_add_excl), + KUNIT_CASE(dev_addr_test_snapshot_sync), + KUNIT_CASE(dev_addr_test_snapshot_remove_during_sync), + KUNIT_CASE(dev_addr_test_snapshot_readd_during_unsync), + KUNIT_CASE(dev_addr_test_snapshot_add_and_remove), + KUNIT_CASE_SLOW(dev_addr_test_snapshot_benchmark), {} }; @@ -243,5 +601,6 @@ static struct kunit_suite dev_addr_test_suite = { }; kunit_test_suite(dev_addr_test_suite); +MODULE_IMPORT_NS("EXPORTED_FOR_KUNIT_TESTING"); MODULE_DESCRIPTION("KUnit tests for struct netdev_hw_addr_list"); MODULE_LICENSE("GPL"); From 3554b4345d855089ab7af5e3557f5dc3262d14c9 Mon Sep 17 00:00:00 2001 From: Stanislav Fomichev Date: Thu, 16 Apr 2026 11:56:59 -0700 Subject: [PATCH 075/148] net: introduce ndo_set_rx_mode_async and netdev_rx_mode_work Add ndo_set_rx_mode_async callback that drivers can implement instead of the legacy ndo_set_rx_mode. The legacy callback runs under the netif_addr_lock spinlock with BHs disabled, preventing drivers from sleeping. The async variant runs from a work queue with rtnl_lock and netdev_lock_ops held, in fully sleepable context. When __dev_set_rx_mode() sees ndo_set_rx_mode_async, it schedules netdev_rx_mode_work instead of calling the driver inline. The work function takes two snapshots of each address list (uc/mc) under the addr_lock, then drops the lock and calls the driver with the work copies. After the driver returns, it reconciles the snapshots back to the real lists under the lock. Add netif_rx_mode_sync() to opportunistically execute the pending workqueue update inline, so that rx mode changes are committed before returning to userspace: - dev_change_flags (SIOCSIFFLAGS / RTM_NEWLINK) - dev_set_promiscuity - dev_set_allmulti - dev_ifsioc SIOCADDMULTI / SIOCDELMULTI - do_setlink (RTM_SETLINK) Note that some deep hierarchies still do skip the lower updates via: - dev_uc_sync - dev_mc_sync If we do end up hitting user-visible issues, we can add more calls to netif_rx_mode_sync in specific places. But hopefully we should not, the actual user-visible lists are still synced, it's that just HW state that might be lagging. Signed-off-by: Stanislav Fomichev Link: https://patch.msgid.link/20260416185712.2155425-3-sdf@fomichev.me Signed-off-by: Paolo Abeni --- Documentation/networking/netdevices.rst | 9 + include/linux/netdevice.h | 18 ++ net/core/dev.c | 43 +---- net/core/dev.h | 3 + net/core/dev_addr_lists.c | 209 ++++++++++++++++++++++++ net/core/dev_api.c | 3 + net/core/dev_ioctl.c | 6 +- net/core/rtnetlink.c | 1 + 8 files changed, 249 insertions(+), 43 deletions(-) diff --git a/Documentation/networking/netdevices.rst b/Documentation/networking/netdevices.rst index 83e28b96884f..e89b12d4f3a7 100644 --- a/Documentation/networking/netdevices.rst +++ b/Documentation/networking/netdevices.rst @@ -289,6 +289,15 @@ ndo_tx_timeout: ndo_set_rx_mode: Synchronization: netif_addr_lock spinlock. Context: BHs disabled + Notes: Deprecated in favor of ndo_set_rx_mode_async which runs + in process context. + +ndo_set_rx_mode_async: + Synchronization: rtnl_lock() semaphore. In addition, netdev instance + lock if the driver implements queue management or shaper API. + Context: process (from a work queue) + Notes: Async version of ndo_set_rx_mode which runs in process + context. Receives snapshots of the unicast and multicast address lists. ndo_setup_tc: ``TC_SETUP_BLOCK`` and ``TC_SETUP_FT`` are running under NFT locks diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index a84c55488b8c..6ed97f4c3bc6 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1119,6 +1119,16 @@ struct netdev_net_notifier { * This function is called device changes address list filtering. * If driver handles unicast address filtering, it should set * IFF_UNICAST_FLT in its priv_flags. + * Cannot sleep, called with netif_addr_lock_bh held. + * Deprecated in favor of ndo_set_rx_mode_async. + * + * void (*ndo_set_rx_mode_async)(struct net_device *dev, + * struct netdev_hw_addr_list *uc, + * struct netdev_hw_addr_list *mc); + * Async version of ndo_set_rx_mode which runs in process context + * with rtnl_lock and netdev_lock_ops(dev) held. The uc/mc parameters + * are snapshots of the address lists - iterate with + * netdev_hw_addr_list_for_each(ha, uc). * * int (*ndo_set_mac_address)(struct net_device *dev, void *addr); * This function is called when the Media Access Control address @@ -1439,6 +1449,10 @@ struct net_device_ops { void (*ndo_change_rx_flags)(struct net_device *dev, int flags); void (*ndo_set_rx_mode)(struct net_device *dev); + void (*ndo_set_rx_mode_async)( + struct net_device *dev, + struct netdev_hw_addr_list *uc, + struct netdev_hw_addr_list *mc); int (*ndo_set_mac_address)(struct net_device *dev, void *addr); int (*ndo_validate_addr)(struct net_device *dev); @@ -1903,6 +1917,8 @@ enum netdev_reg_state { * has been enabled due to the need to listen to * additional unicast addresses in a device that * does not implement ndo_set_rx_mode() + * @rx_mode_node: List entry for rx_mode work processing + * @rx_mode_tracker: Refcount tracker for rx_mode work * @uc: unicast mac addresses * @mc: multicast mac addresses * @dev_addrs: list of device hw addresses @@ -2294,6 +2310,8 @@ struct net_device { unsigned int promiscuity; unsigned int allmulti; bool uc_promisc; + struct list_head rx_mode_node; + netdevice_tracker rx_mode_tracker; #ifdef CONFIG_LOCKDEP unsigned char nested_level; #endif diff --git a/net/core/dev.c b/net/core/dev.c index e59f6025067c..b37061238a25 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -9593,7 +9593,7 @@ static void dev_change_rx_flags(struct net_device *dev, int flags) ops->ndo_change_rx_flags(dev, flags); } -static int __dev_set_promiscuity(struct net_device *dev, int inc, bool notify) +int __dev_set_promiscuity(struct net_device *dev, int inc, bool notify) { unsigned int old_flags = dev->flags; unsigned int promiscuity, flags; @@ -9697,46 +9697,6 @@ int netif_set_allmulti(struct net_device *dev, int inc, bool notify) return 0; } -/* - * Upload unicast and multicast address lists to device and - * configure RX filtering. When the device doesn't support unicast - * filtering it is put in promiscuous mode while unicast addresses - * are present. - */ -void __dev_set_rx_mode(struct net_device *dev) -{ - const struct net_device_ops *ops = dev->netdev_ops; - - /* dev_open will call this function so the list will stay sane. */ - if (!(dev->flags&IFF_UP)) - return; - - if (!netif_device_present(dev)) - return; - - if (!(dev->priv_flags & IFF_UNICAST_FLT)) { - /* Unicast addresses changes may only happen under the rtnl, - * therefore calling __dev_set_promiscuity here is safe. - */ - if (!netdev_uc_empty(dev) && !dev->uc_promisc) { - __dev_set_promiscuity(dev, 1, false); - dev->uc_promisc = true; - } else if (netdev_uc_empty(dev) && dev->uc_promisc) { - __dev_set_promiscuity(dev, -1, false); - dev->uc_promisc = false; - } - } - - if (ops->ndo_set_rx_mode) - ops->ndo_set_rx_mode(dev); -} - -void dev_set_rx_mode(struct net_device *dev) -{ - netif_addr_lock_bh(dev); - __dev_set_rx_mode(dev); - netif_addr_unlock_bh(dev); -} /** * netif_get_flags() - get flags reported to userspace @@ -12127,6 +12087,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name, #endif mutex_init(&dev->lock); + INIT_LIST_HEAD(&dev->rx_mode_node); dev->priv_flags = IFF_XMIT_DST_RELEASE | IFF_XMIT_DST_RELEASE_PERM; setup(dev); diff --git a/net/core/dev.h b/net/core/dev.h index 585b6d7e88df..0cf24b8f5008 100644 --- a/net/core/dev.h +++ b/net/core/dev.h @@ -165,6 +165,9 @@ int netif_change_carrier(struct net_device *dev, bool new_carrier); int dev_change_carrier(struct net_device *dev, bool new_carrier); void __dev_set_rx_mode(struct net_device *dev); +int __dev_set_promiscuity(struct net_device *dev, int inc, bool notify); +bool netif_rx_mode_clean(struct net_device *dev); +void netif_rx_mode_sync(struct net_device *dev); void __dev_notify_flags(struct net_device *dev, unsigned int old_flags, unsigned int gchanges, u32 portid, diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c index bb4851bc55ce..056bca6fce12 100644 --- a/net/core/dev_addr_lists.c +++ b/net/core/dev_addr_lists.c @@ -11,10 +11,18 @@ #include #include #include +#include +#include #include #include "dev.h" +static void netdev_rx_mode_work(struct work_struct *work); + +static LIST_HEAD(rx_mode_list); +static DEFINE_SPINLOCK(rx_mode_lock); +static DECLARE_WORK(rx_mode_work, netdev_rx_mode_work); + /* * General list handling functions */ @@ -1156,3 +1164,204 @@ void dev_mc_init(struct net_device *dev) __hw_addr_init(&dev->mc); } EXPORT_SYMBOL(dev_mc_init); + +static int netif_addr_lists_snapshot(struct net_device *dev, + struct netdev_hw_addr_list *uc_snap, + struct netdev_hw_addr_list *mc_snap, + struct netdev_hw_addr_list *uc_ref, + struct netdev_hw_addr_list *mc_ref) +{ + int err; + + err = __hw_addr_list_snapshot(uc_snap, &dev->uc, dev->addr_len); + if (!err) + err = __hw_addr_list_snapshot(uc_ref, &dev->uc, dev->addr_len); + if (!err) + err = __hw_addr_list_snapshot(mc_snap, &dev->mc, + dev->addr_len); + if (!err) + err = __hw_addr_list_snapshot(mc_ref, &dev->mc, dev->addr_len); + + if (err) { + __hw_addr_flush(uc_snap); + __hw_addr_flush(uc_ref); + __hw_addr_flush(mc_snap); + } + + return err; +} + +static void netif_addr_lists_reconcile(struct net_device *dev, + struct netdev_hw_addr_list *uc_snap, + struct netdev_hw_addr_list *mc_snap, + struct netdev_hw_addr_list *uc_ref, + struct netdev_hw_addr_list *mc_ref) +{ + __hw_addr_list_reconcile(&dev->uc, uc_snap, uc_ref, dev->addr_len); + __hw_addr_list_reconcile(&dev->mc, mc_snap, mc_ref, dev->addr_len); +} + +static void netif_rx_mode_run(struct net_device *dev) +{ + struct netdev_hw_addr_list uc_snap, mc_snap, uc_ref, mc_ref; + const struct net_device_ops *ops = dev->netdev_ops; + int err; + + might_sleep(); + netdev_ops_assert_locked(dev); + + __hw_addr_init(&uc_snap); + __hw_addr_init(&mc_snap); + __hw_addr_init(&uc_ref); + __hw_addr_init(&mc_ref); + + if (!(dev->flags & IFF_UP) || !netif_device_present(dev)) + return; + + netif_addr_lock_bh(dev); + err = netif_addr_lists_snapshot(dev, &uc_snap, &mc_snap, + &uc_ref, &mc_ref); + if (err) { + netdev_WARN(dev, "failed to sync uc/mc addresses\n"); + netif_addr_unlock_bh(dev); + return; + } + netif_addr_unlock_bh(dev); + + ops->ndo_set_rx_mode_async(dev, &uc_snap, &mc_snap); + + netif_addr_lock_bh(dev); + netif_addr_lists_reconcile(dev, &uc_snap, &mc_snap, + &uc_ref, &mc_ref); + netif_addr_unlock_bh(dev); +} + +static void netdev_rx_mode_work(struct work_struct *work) +{ + struct net_device *dev; + + rtnl_lock(); + + while (true) { + spin_lock_bh(&rx_mode_lock); + if (list_empty(&rx_mode_list)) { + spin_unlock_bh(&rx_mode_lock); + break; + } + dev = list_first_entry(&rx_mode_list, struct net_device, + rx_mode_node); + list_del_init(&dev->rx_mode_node); + /* We must free netdev tracker under + * the spinlock protection. + */ + netdev_tracker_free(dev, &dev->rx_mode_tracker); + spin_unlock_bh(&rx_mode_lock); + + netdev_lock_ops(dev); + netif_rx_mode_run(dev); + netdev_unlock_ops(dev); + /* Use __dev_put() because netdev_tracker_free() was already + * called above. Must be after netdev_unlock_ops() to prevent + * netdev_run_todo() from freeing the device while still in use. + */ + __dev_put(dev); + } + + rtnl_unlock(); +} + +static void netif_rx_mode_queue(struct net_device *dev) +{ + spin_lock_bh(&rx_mode_lock); + if (list_empty(&dev->rx_mode_node)) { + list_add_tail(&dev->rx_mode_node, &rx_mode_list); + netdev_hold(dev, &dev->rx_mode_tracker, GFP_ATOMIC); + } + spin_unlock_bh(&rx_mode_lock); + schedule_work(&rx_mode_work); +} + +/** + * __dev_set_rx_mode() - upload unicast and multicast address lists to device + * and configure RX filtering. + * @dev: device + * + * When the device doesn't support unicast filtering it is put in promiscuous + * mode while unicast addresses are present. + */ +void __dev_set_rx_mode(struct net_device *dev) +{ + const struct net_device_ops *ops = dev->netdev_ops; + + /* dev_open will call this function so the list will stay sane. */ + if (!(dev->flags & IFF_UP)) + return; + + if (!netif_device_present(dev)) + return; + + if (ops->ndo_set_rx_mode_async) { + netif_rx_mode_queue(dev); + return; + } + + if (!(dev->priv_flags & IFF_UNICAST_FLT)) { + if (!netdev_uc_empty(dev) && !dev->uc_promisc) { + __dev_set_promiscuity(dev, 1, false); + dev->uc_promisc = true; + } else if (netdev_uc_empty(dev) && dev->uc_promisc) { + __dev_set_promiscuity(dev, -1, false); + dev->uc_promisc = false; + } + } + + if (ops->ndo_set_rx_mode) + ops->ndo_set_rx_mode(dev); +} + +void dev_set_rx_mode(struct net_device *dev) +{ + netif_addr_lock_bh(dev); + __dev_set_rx_mode(dev); + netif_addr_unlock_bh(dev); +} + +bool netif_rx_mode_clean(struct net_device *dev) +{ + bool clean = false; + + spin_lock_bh(&rx_mode_lock); + if (!list_empty(&dev->rx_mode_node)) { + list_del_init(&dev->rx_mode_node); + clean = true; + /* We must release netdev tracker under + * the spinlock protection. + */ + netdev_tracker_free(dev, &dev->rx_mode_tracker); + } + spin_unlock_bh(&rx_mode_lock); + + return clean; +} + +/** + * netif_rx_mode_sync() - sync rx mode inline + * @dev: network device + * + * Drivers implementing ndo_set_rx_mode_async() have their rx mode callback + * executed from a workqueue. This allows the callback to sleep, but means + * the hardware update is deferred and may not be visible to userspace + * by the time the initiating syscall returns. netif_rx_mode_sync() steals + * workqueue update and executes it inline. This preserves the atomicity of + * operations to the userspace. + */ +void netif_rx_mode_sync(struct net_device *dev) +{ + if (netif_rx_mode_clean(dev)) { + netif_rx_mode_run(dev); + /* Use __dev_put() because netdev_tracker_free() was already + * called inside netif_rx_mode_clean(). + */ + __dev_put(dev); + } +} diff --git a/net/core/dev_api.c b/net/core/dev_api.c index f28852078aa6..437947dd08ed 100644 --- a/net/core/dev_api.c +++ b/net/core/dev_api.c @@ -66,6 +66,7 @@ int dev_change_flags(struct net_device *dev, unsigned int flags, netdev_lock_ops(dev); ret = netif_change_flags(dev, flags, extack); + netif_rx_mode_sync(dev); netdev_unlock_ops(dev); return ret; @@ -285,6 +286,7 @@ int dev_set_promiscuity(struct net_device *dev, int inc) netdev_lock_ops(dev); ret = netif_set_promiscuity(dev, inc); + netif_rx_mode_sync(dev); netdev_unlock_ops(dev); return ret; @@ -311,6 +313,7 @@ int dev_set_allmulti(struct net_device *dev, int inc) netdev_lock_ops(dev); ret = netif_set_allmulti(dev, inc, true); + netif_rx_mode_sync(dev); netdev_unlock_ops(dev); return ret; diff --git a/net/core/dev_ioctl.c b/net/core/dev_ioctl.c index 7a8966544c9d..f3979b276090 100644 --- a/net/core/dev_ioctl.c +++ b/net/core/dev_ioctl.c @@ -586,24 +586,26 @@ static int dev_ifsioc(struct net *net, struct ifreq *ifr, void __user *data, return err; case SIOCADDMULTI: - if (!ops->ndo_set_rx_mode || + if ((!ops->ndo_set_rx_mode && !ops->ndo_set_rx_mode_async) || ifr->ifr_hwaddr.sa_family != AF_UNSPEC) return -EINVAL; if (!netif_device_present(dev)) return -ENODEV; netdev_lock_ops(dev); err = dev_mc_add_global(dev, ifr->ifr_hwaddr.sa_data); + netif_rx_mode_sync(dev); netdev_unlock_ops(dev); return err; case SIOCDELMULTI: - if (!ops->ndo_set_rx_mode || + if ((!ops->ndo_set_rx_mode && !ops->ndo_set_rx_mode_async) || ifr->ifr_hwaddr.sa_family != AF_UNSPEC) return -EINVAL; if (!netif_device_present(dev)) return -ENODEV; netdev_lock_ops(dev); err = dev_mc_del_global(dev, ifr->ifr_hwaddr.sa_data); + netif_rx_mode_sync(dev); netdev_unlock_ops(dev); return err; diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index 69daba3ddaf0..b613bb6e07df 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -3431,6 +3431,7 @@ static int do_setlink(const struct sk_buff *skb, struct net_device *dev, dev->name); } + netif_rx_mode_sync(dev); netdev_unlock_ops(dev); return err; From a4c833278144917982510ca43a3438155756122a Mon Sep 17 00:00:00 2001 From: Stanislav Fomichev Date: Thu, 16 Apr 2026 11:57:00 -0700 Subject: [PATCH 076/148] net: cache snapshot entries for ndo_set_rx_mode_async Add a per-device netdev_hw_addr_list cache (rx_mode_addr_cache) that allows __hw_addr_list_snapshot() and __hw_addr_list_reconcile() to reuse previously allocated entries instead of hitting GFP_ATOMIC on every snapshot cycle. snapshot pops entries from the cache when available, falling back to __hw_addr_create(). reconcile splices both snapshot lists back into the cache via __hw_addr_splice(). The cache is flushed in free_netdev(). Signed-off-by: Stanislav Fomichev Link: https://patch.msgid.link/20260416185712.2155425-4-sdf@fomichev.me Signed-off-by: Paolo Abeni --- include/linux/netdevice.h | 7 ++-- net/core/dev.c | 3 ++ net/core/dev_addr_lists.c | 66 ++++++++++++++++++++++++---------- net/core/dev_addr_lists_test.c | 60 +++++++++++++++++++++---------- 4 files changed, 97 insertions(+), 39 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 6ed97f4c3bc6..97b435da5771 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1919,6 +1919,7 @@ enum netdev_reg_state { * does not implement ndo_set_rx_mode() * @rx_mode_node: List entry for rx_mode work processing * @rx_mode_tracker: Refcount tracker for rx_mode work + * @rx_mode_addr_cache: Recycled snapshot entries for rx_mode work * @uc: unicast mac addresses * @mc: multicast mac addresses * @dev_addrs: list of device hw addresses @@ -2312,6 +2313,7 @@ struct net_device { bool uc_promisc; struct list_head rx_mode_node; netdevice_tracker rx_mode_tracker; + struct netdev_hw_addr_list rx_mode_addr_cache; #ifdef CONFIG_LOCKDEP unsigned char nested_level; #endif @@ -5025,10 +5027,11 @@ void __hw_addr_init(struct netdev_hw_addr_list *list); void __hw_addr_flush(struct netdev_hw_addr_list *list); int __hw_addr_list_snapshot(struct netdev_hw_addr_list *snap, const struct netdev_hw_addr_list *list, - int addr_len); + int addr_len, struct netdev_hw_addr_list *cache); void __hw_addr_list_reconcile(struct netdev_hw_addr_list *real_list, struct netdev_hw_addr_list *work, - struct netdev_hw_addr_list *ref, int addr_len); + struct netdev_hw_addr_list *ref, int addr_len, + struct netdev_hw_addr_list *cache); /* Functions used for device addresses handling */ void dev_addr_mod(struct net_device *dev, unsigned int offset, diff --git a/net/core/dev.c b/net/core/dev.c index b37061238a25..8597ec56fd64 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -12088,6 +12088,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name, mutex_init(&dev->lock); INIT_LIST_HEAD(&dev->rx_mode_node); + __hw_addr_init(&dev->rx_mode_addr_cache); dev->priv_flags = IFF_XMIT_DST_RELEASE | IFF_XMIT_DST_RELEASE_PERM; setup(dev); @@ -12192,6 +12193,8 @@ void free_netdev(struct net_device *dev) kfree(rcu_dereference_protected(dev->ingress_queue, 1)); + __hw_addr_flush(&dev->rx_mode_addr_cache); + /* Flush device addresses */ dev_addr_flush(dev); diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c index 056bca6fce12..7bab2ed0f625 100644 --- a/net/core/dev_addr_lists.c +++ b/net/core/dev_addr_lists.c @@ -511,30 +511,50 @@ void __hw_addr_init(struct netdev_hw_addr_list *list) } EXPORT_SYMBOL(__hw_addr_init); +static void __hw_addr_splice(struct netdev_hw_addr_list *dst, + struct netdev_hw_addr_list *src) +{ + src->tree = RB_ROOT; + list_splice_init(&src->list, &dst->list); + dst->count += src->count; + src->count = 0; +} + /** * __hw_addr_list_snapshot - create a snapshot copy of an address list * @snap: destination snapshot list (needs to be __hw_addr_init-initialized) * @list: source address list to snapshot * @addr_len: length of addresses + * @cache: entry cache to reuse entries from; falls back to GFP_ATOMIC * - * Creates a copy of @list with individually allocated entries suitable - * for use with __hw_addr_sync_dev() and other list manipulation helpers. - * Each entry is allocated with GFP_ATOMIC; must be called under a spinlock. + * Creates a copy of @list reusing entries from @cache when available. + * Must be called under a spinlock. * * Return: 0 on success, -errno on failure. */ int __hw_addr_list_snapshot(struct netdev_hw_addr_list *snap, const struct netdev_hw_addr_list *list, - int addr_len) + int addr_len, struct netdev_hw_addr_list *cache) { struct netdev_hw_addr *ha, *entry; list_for_each_entry(ha, &list->list, list) { - entry = __hw_addr_create(ha->addr, addr_len, ha->type, - false, false); - if (!entry) { - __hw_addr_flush(snap); - return -ENOMEM; + if (cache->count) { + entry = list_first_entry(&cache->list, + struct netdev_hw_addr, list); + list_del(&entry->list); + cache->count--; + memcpy(entry->addr, ha->addr, addr_len); + entry->type = ha->type; + entry->global_use = false; + entry->synced = 0; + } else { + entry = __hw_addr_create(ha->addr, addr_len, ha->type, + false, false); + if (!entry) { + __hw_addr_flush(snap); + return -ENOMEM; + } } entry->sync_cnt = ha->sync_cnt; entry->refcount = ha->refcount; @@ -554,15 +574,17 @@ EXPORT_SYMBOL_IF_KUNIT(__hw_addr_list_snapshot); * @work: the working snapshot (modified by driver via __hw_addr_sync_dev) * @ref: the reference snapshot (untouched copy of original state) * @addr_len: length of addresses + * @cache: entry cache to return snapshot entries to for reuse * * Walks the reference snapshot and compares each entry against the work * snapshot to compute sync_cnt deltas. Applies those deltas to @real_list. - * Frees both snapshots when done. + * Returns snapshot entries to @cache for reuse; frees both snapshots. * Caller must hold netif_addr_lock_bh. */ void __hw_addr_list_reconcile(struct netdev_hw_addr_list *real_list, struct netdev_hw_addr_list *work, - struct netdev_hw_addr_list *ref, int addr_len) + struct netdev_hw_addr_list *ref, int addr_len, + struct netdev_hw_addr_list *cache) { struct netdev_hw_addr *ref_ha, *tmp, *work_ha, *real_ha; int delta; @@ -611,8 +633,8 @@ void __hw_addr_list_reconcile(struct netdev_hw_addr_list *real_list, } } - __hw_addr_flush(work); - __hw_addr_flush(ref); + __hw_addr_splice(cache, work); + __hw_addr_splice(cache, ref); } EXPORT_SYMBOL_IF_KUNIT(__hw_addr_list_reconcile); @@ -1173,14 +1195,18 @@ static int netif_addr_lists_snapshot(struct net_device *dev, { int err; - err = __hw_addr_list_snapshot(uc_snap, &dev->uc, dev->addr_len); + err = __hw_addr_list_snapshot(uc_snap, &dev->uc, dev->addr_len, + &dev->rx_mode_addr_cache); if (!err) - err = __hw_addr_list_snapshot(uc_ref, &dev->uc, dev->addr_len); + err = __hw_addr_list_snapshot(uc_ref, &dev->uc, dev->addr_len, + &dev->rx_mode_addr_cache); if (!err) err = __hw_addr_list_snapshot(mc_snap, &dev->mc, - dev->addr_len); + dev->addr_len, + &dev->rx_mode_addr_cache); if (!err) - err = __hw_addr_list_snapshot(mc_ref, &dev->mc, dev->addr_len); + err = __hw_addr_list_snapshot(mc_ref, &dev->mc, dev->addr_len, + &dev->rx_mode_addr_cache); if (err) { __hw_addr_flush(uc_snap); @@ -1197,8 +1223,10 @@ static void netif_addr_lists_reconcile(struct net_device *dev, struct netdev_hw_addr_list *uc_ref, struct netdev_hw_addr_list *mc_ref) { - __hw_addr_list_reconcile(&dev->uc, uc_snap, uc_ref, dev->addr_len); - __hw_addr_list_reconcile(&dev->mc, mc_snap, mc_ref, dev->addr_len); + __hw_addr_list_reconcile(&dev->uc, uc_snap, uc_ref, dev->addr_len, + &dev->rx_mode_addr_cache); + __hw_addr_list_reconcile(&dev->mc, mc_snap, mc_ref, dev->addr_len, + &dev->rx_mode_addr_cache); } static void netif_rx_mode_run(struct net_device *dev) diff --git a/net/core/dev_addr_lists_test.c b/net/core/dev_addr_lists_test.c index fba926d5ec0d..260e71a2399f 100644 --- a/net/core/dev_addr_lists_test.c +++ b/net/core/dev_addr_lists_test.c @@ -251,8 +251,8 @@ static void dev_addr_test_add_excl(struct kunit *test) */ static void dev_addr_test_snapshot_sync(struct kunit *test) { + struct netdev_hw_addr_list snap, ref, cache; struct net_device *netdev = test->priv; - struct netdev_hw_addr_list snap, ref; struct dev_addr_test_priv *datp; struct netdev_hw_addr *ha; u8 addr[ETH_ALEN]; @@ -268,10 +268,13 @@ static void dev_addr_test_snapshot_sync(struct kunit *test) netif_addr_lock_bh(netdev); __hw_addr_init(&snap); __hw_addr_init(&ref); + __hw_addr_init(&cache); KUNIT_EXPECT_EQ(test, 0, - __hw_addr_list_snapshot(&snap, &netdev->uc, ETH_ALEN)); + __hw_addr_list_snapshot(&snap, &netdev->uc, ETH_ALEN, + &cache)); KUNIT_EXPECT_EQ(test, 0, - __hw_addr_list_snapshot(&ref, &netdev->uc, ETH_ALEN)); + __hw_addr_list_snapshot(&ref, &netdev->uc, ETH_ALEN, + &cache)); netif_addr_unlock_bh(netdev); /* Driver syncs ADDR_A to hardware */ @@ -283,7 +286,8 @@ static void dev_addr_test_snapshot_sync(struct kunit *test) /* Reconcile: delta=+1 applied to real entry */ netif_addr_lock_bh(netdev); - __hw_addr_list_reconcile(&netdev->uc, &snap, &ref, ETH_ALEN); + __hw_addr_list_reconcile(&netdev->uc, &snap, &ref, ETH_ALEN, + &cache); netif_addr_unlock_bh(netdev); /* Real entry should now reflect the sync: sync_cnt=1, refcount=2 */ @@ -301,6 +305,7 @@ static void dev_addr_test_snapshot_sync(struct kunit *test) KUNIT_EXPECT_EQ(test, 0, datp->addr_unsynced); KUNIT_EXPECT_EQ(test, 1, netdev->uc.count); + __hw_addr_flush(&cache); rtnl_unlock(); } @@ -310,8 +315,8 @@ static void dev_addr_test_snapshot_sync(struct kunit *test) */ static void dev_addr_test_snapshot_remove_during_sync(struct kunit *test) { + struct netdev_hw_addr_list snap, ref, cache; struct net_device *netdev = test->priv; - struct netdev_hw_addr_list snap, ref; struct dev_addr_test_priv *datp; struct netdev_hw_addr *ha; u8 addr[ETH_ALEN]; @@ -327,10 +332,13 @@ static void dev_addr_test_snapshot_remove_during_sync(struct kunit *test) netif_addr_lock_bh(netdev); __hw_addr_init(&snap); __hw_addr_init(&ref); + __hw_addr_init(&cache); KUNIT_EXPECT_EQ(test, 0, - __hw_addr_list_snapshot(&snap, &netdev->uc, ETH_ALEN)); + __hw_addr_list_snapshot(&snap, &netdev->uc, ETH_ALEN, + &cache)); KUNIT_EXPECT_EQ(test, 0, - __hw_addr_list_snapshot(&ref, &netdev->uc, ETH_ALEN)); + __hw_addr_list_snapshot(&ref, &netdev->uc, ETH_ALEN, + &cache)); netif_addr_unlock_bh(netdev); /* Driver syncs ADDR_A to hardware */ @@ -349,7 +357,8 @@ static void dev_addr_test_snapshot_remove_during_sync(struct kunit *test) * so it gets re-inserted as stale (sync_cnt=1, refcount=1). */ netif_addr_lock_bh(netdev); - __hw_addr_list_reconcile(&netdev->uc, &snap, &ref, ETH_ALEN); + __hw_addr_list_reconcile(&netdev->uc, &snap, &ref, ETH_ALEN, + &cache); netif_addr_unlock_bh(netdev); KUNIT_EXPECT_EQ(test, 1, netdev->uc.count); @@ -366,6 +375,7 @@ static void dev_addr_test_snapshot_remove_during_sync(struct kunit *test) KUNIT_EXPECT_EQ(test, 1 << ADDR_A, datp->addr_unsynced); KUNIT_EXPECT_EQ(test, 0, netdev->uc.count); + __hw_addr_flush(&cache); rtnl_unlock(); } @@ -376,8 +386,8 @@ static void dev_addr_test_snapshot_remove_during_sync(struct kunit *test) */ static void dev_addr_test_snapshot_readd_during_unsync(struct kunit *test) { + struct netdev_hw_addr_list snap, ref, cache; struct net_device *netdev = test->priv; - struct netdev_hw_addr_list snap, ref; struct dev_addr_test_priv *datp; struct netdev_hw_addr *ha; u8 addr[ETH_ALEN]; @@ -403,10 +413,13 @@ static void dev_addr_test_snapshot_readd_during_unsync(struct kunit *test) netif_addr_lock_bh(netdev); __hw_addr_init(&snap); __hw_addr_init(&ref); + __hw_addr_init(&cache); KUNIT_EXPECT_EQ(test, 0, - __hw_addr_list_snapshot(&snap, &netdev->uc, ETH_ALEN)); + __hw_addr_list_snapshot(&snap, &netdev->uc, ETH_ALEN, + &cache)); KUNIT_EXPECT_EQ(test, 0, - __hw_addr_list_snapshot(&ref, &netdev->uc, ETH_ALEN)); + __hw_addr_list_snapshot(&ref, &netdev->uc, ETH_ALEN, + &cache)); netif_addr_unlock_bh(netdev); /* Driver unsyncs stale ADDR_A from hardware */ @@ -426,7 +439,8 @@ static void dev_addr_test_snapshot_readd_during_unsync(struct kunit *test) * applied. Result: sync_cnt=0, refcount=1 (fresh). */ netif_addr_lock_bh(netdev); - __hw_addr_list_reconcile(&netdev->uc, &snap, &ref, ETH_ALEN); + __hw_addr_list_reconcile(&netdev->uc, &snap, &ref, ETH_ALEN, + &cache); netif_addr_unlock_bh(netdev); /* Entry survives as fresh: needs re-sync to HW */ @@ -443,6 +457,7 @@ static void dev_addr_test_snapshot_readd_during_unsync(struct kunit *test) KUNIT_EXPECT_EQ(test, 1 << ADDR_A, datp->addr_synced); KUNIT_EXPECT_EQ(test, 0, datp->addr_unsynced); + __hw_addr_flush(&cache); rtnl_unlock(); } @@ -452,8 +467,8 @@ static void dev_addr_test_snapshot_readd_during_unsync(struct kunit *test) */ static void dev_addr_test_snapshot_add_and_remove(struct kunit *test) { + struct netdev_hw_addr_list snap, ref, cache; struct net_device *netdev = test->priv; - struct netdev_hw_addr_list snap, ref; struct dev_addr_test_priv *datp; struct netdev_hw_addr *ha; u8 addr[ETH_ALEN]; @@ -480,10 +495,13 @@ static void dev_addr_test_snapshot_add_and_remove(struct kunit *test) netif_addr_lock_bh(netdev); __hw_addr_init(&snap); __hw_addr_init(&ref); + __hw_addr_init(&cache); KUNIT_EXPECT_EQ(test, 0, - __hw_addr_list_snapshot(&snap, &netdev->uc, ETH_ALEN)); + __hw_addr_list_snapshot(&snap, &netdev->uc, ETH_ALEN, + &cache)); KUNIT_EXPECT_EQ(test, 0, - __hw_addr_list_snapshot(&ref, &netdev->uc, ETH_ALEN)); + __hw_addr_list_snapshot(&ref, &netdev->uc, ETH_ALEN, + &cache)); netif_addr_unlock_bh(netdev); /* Driver syncs snapshot: ADDR_C is new -> synced; A,B already synced */ @@ -502,7 +520,8 @@ static void dev_addr_test_snapshot_add_and_remove(struct kunit *test) * so nothing to apply to ADDR_B. */ netif_addr_lock_bh(netdev); - __hw_addr_list_reconcile(&netdev->uc, &snap, &ref, ETH_ALEN); + __hw_addr_list_reconcile(&netdev->uc, &snap, &ref, ETH_ALEN, + &cache); netif_addr_unlock_bh(netdev); /* ADDR_A: unchanged (sync_cnt=1, refcount=2) @@ -536,13 +555,14 @@ static void dev_addr_test_snapshot_add_and_remove(struct kunit *test) KUNIT_EXPECT_EQ(test, 1 << ADDR_B, datp->addr_unsynced); KUNIT_EXPECT_EQ(test, 2, netdev->uc.count); + __hw_addr_flush(&cache); rtnl_unlock(); } static void dev_addr_test_snapshot_benchmark(struct kunit *test) { struct net_device *netdev = test->priv; - struct netdev_hw_addr_list snap; + struct netdev_hw_addr_list snap, cache; u8 addr[ETH_ALEN]; s64 duration = 0; ktime_t start; @@ -557,6 +577,8 @@ static void dev_addr_test_snapshot_benchmark(struct kunit *test) KUNIT_EXPECT_EQ(test, 0, dev_uc_add(netdev, addr)); } + __hw_addr_init(&cache); + for (iter = 0; iter < 1000; iter++) { netif_addr_lock_bh(netdev); __hw_addr_init(&snap); @@ -564,13 +586,15 @@ static void dev_addr_test_snapshot_benchmark(struct kunit *test) start = ktime_get(); KUNIT_EXPECT_EQ(test, 0, __hw_addr_list_snapshot(&snap, &netdev->uc, - ETH_ALEN)); + ETH_ALEN, &cache)); duration += ktime_to_ns(ktime_sub(ktime_get(), start)); netif_addr_unlock_bh(netdev); __hw_addr_flush(&snap); } + __hw_addr_flush(&cache); + kunit_info(test, "1024 addrs x 1000 snapshots: %lld ns total, %lld ns/iter", duration, div_s64(duration, 1000)); From 7ef83bf1712b5c441a21d5df844202433f6b0b05 Mon Sep 17 00:00:00 2001 From: Stanislav Fomichev Date: Thu, 16 Apr 2026 11:57:01 -0700 Subject: [PATCH 077/148] net: move promiscuity handling into netdev_rx_mode_work Move unicast promiscuity tracking into netdev_rx_mode_work so it runs under netdev_ops_lock instead of under the addr_lock spinlock. This is required because __dev_set_promiscuity calls dev_change_rx_flags and __dev_notify_flags, both of which may need to sleep. Change ASSERT_RTNL() to netdev_ops_assert_locked() in __dev_set_promiscuity, netif_set_allmulti and __dev_change_flags since these are now called from the work queue under the ops lock. Link: https://lore.kernel.org/netdev/20260214033859.43857-1-jiayuan.chen@linux.dev/ Fixes: 78cd408356fe ("net: add missing instance lock to dev_set_promiscuity") Reported-by: syzbot+2b3391f44313b3983e91@syzkaller.appspotmail.com Reviewed-by: Aleksandr Loktionov Signed-off-by: Stanislav Fomichev Link: https://patch.msgid.link/20260416185712.2155425-5-sdf@fomichev.me Signed-off-by: Paolo Abeni --- Documentation/networking/netdevices.rst | 4 ++ net/core/dev.c | 16 ++--- net/core/dev_addr_lists.c | 82 ++++++++++++++++++------- 3 files changed, 68 insertions(+), 34 deletions(-) diff --git a/Documentation/networking/netdevices.rst b/Documentation/networking/netdevices.rst index e89b12d4f3a7..93e06e8d51a9 100644 --- a/Documentation/networking/netdevices.rst +++ b/Documentation/networking/netdevices.rst @@ -299,6 +299,10 @@ ndo_set_rx_mode_async: Notes: Async version of ndo_set_rx_mode which runs in process context. Receives snapshots of the unicast and multicast address lists. +ndo_change_rx_flags: + Synchronization: rtnl_lock() semaphore. In addition, netdev instance + lock if the driver implements queue management or shaper API. + ndo_setup_tc: ``TC_SETUP_BLOCK`` and ``TC_SETUP_FT`` are running under NFT locks (i.e. no ``rtnl_lock`` and no device instance lock). The rest of diff --git a/net/core/dev.c b/net/core/dev.c index 8597ec56fd64..8a69aed56fca 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -9600,7 +9600,7 @@ int __dev_set_promiscuity(struct net_device *dev, int inc, bool notify) kuid_t uid; kgid_t gid; - ASSERT_RTNL(); + netdev_ops_assert_locked(dev); promiscuity = dev->promiscuity + inc; if (promiscuity == 0) { @@ -9636,16 +9636,8 @@ int __dev_set_promiscuity(struct net_device *dev, int inc, bool notify) dev_change_rx_flags(dev, IFF_PROMISC); } - if (notify) { - /* The ops lock is only required to ensure consistent locking - * for `NETDEV_CHANGE` notifiers. This function is sometimes - * called without the lock, even for devices that are ops - * locked, such as in `dev_uc_sync_multiple` when using - * bonding or teaming. - */ - netdev_ops_assert_locked(dev); + if (notify) __dev_notify_flags(dev, old_flags, IFF_PROMISC, 0, NULL); - } return 0; } @@ -9667,7 +9659,7 @@ int netif_set_allmulti(struct net_device *dev, int inc, bool notify) unsigned int old_flags = dev->flags, old_gflags = dev->gflags; unsigned int allmulti, flags; - ASSERT_RTNL(); + netdev_ops_assert_locked(dev); allmulti = dev->allmulti + inc; if (allmulti == 0) { @@ -9735,7 +9727,7 @@ int __dev_change_flags(struct net_device *dev, unsigned int flags, unsigned int old_flags = dev->flags; int ret; - ASSERT_RTNL(); + netdev_ops_assert_locked(dev); /* * Set the flags on our device. diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c index 7bab2ed0f625..4c9e8a69493f 100644 --- a/net/core/dev_addr_lists.c +++ b/net/core/dev_addr_lists.c @@ -1229,10 +1229,34 @@ static void netif_addr_lists_reconcile(struct net_device *dev, &dev->rx_mode_addr_cache); } +/** + * netif_uc_promisc_update() - evaluate whether uc_promisc should be toggled. + * @dev: device + * + * Must be called under netif_addr_lock_bh. + * Return: +1 to enter promisc, -1 to leave, 0 for no change. + */ +static int netif_uc_promisc_update(struct net_device *dev) +{ + if (dev->priv_flags & IFF_UNICAST_FLT) + return 0; + + if (!netdev_uc_empty(dev) && !dev->uc_promisc) { + dev->uc_promisc = true; + return 1; + } + if (netdev_uc_empty(dev) && dev->uc_promisc) { + dev->uc_promisc = false; + return -1; + } + return 0; +} + static void netif_rx_mode_run(struct net_device *dev) { struct netdev_hw_addr_list uc_snap, mc_snap, uc_ref, mc_ref; const struct net_device_ops *ops = dev->netdev_ops; + int promisc_inc; int err; might_sleep(); @@ -1246,22 +1270,39 @@ static void netif_rx_mode_run(struct net_device *dev) if (!(dev->flags & IFF_UP) || !netif_device_present(dev)) return; - netif_addr_lock_bh(dev); - err = netif_addr_lists_snapshot(dev, &uc_snap, &mc_snap, - &uc_ref, &mc_ref); - if (err) { - netdev_WARN(dev, "failed to sync uc/mc addresses\n"); + if (ops->ndo_set_rx_mode_async) { + netif_addr_lock_bh(dev); + err = netif_addr_lists_snapshot(dev, &uc_snap, &mc_snap, + &uc_ref, &mc_ref); + if (err) { + netdev_WARN(dev, "failed to sync uc/mc addresses\n"); + netif_addr_unlock_bh(dev); + return; + } + + promisc_inc = netif_uc_promisc_update(dev); + netif_addr_unlock_bh(dev); + } else { + netif_addr_lock_bh(dev); + promisc_inc = netif_uc_promisc_update(dev); netif_addr_unlock_bh(dev); - return; } - netif_addr_unlock_bh(dev); - ops->ndo_set_rx_mode_async(dev, &uc_snap, &mc_snap); + if (promisc_inc) + __dev_set_promiscuity(dev, promisc_inc, false); - netif_addr_lock_bh(dev); - netif_addr_lists_reconcile(dev, &uc_snap, &mc_snap, - &uc_ref, &mc_ref); - netif_addr_unlock_bh(dev); + if (ops->ndo_set_rx_mode_async) { + ops->ndo_set_rx_mode_async(dev, &uc_snap, &mc_snap); + + netif_addr_lock_bh(dev); + netif_addr_lists_reconcile(dev, &uc_snap, &mc_snap, + &uc_ref, &mc_ref); + netif_addr_unlock_bh(dev); + } else if (ops->ndo_set_rx_mode) { + netif_addr_lock_bh(dev); + ops->ndo_set_rx_mode(dev); + netif_addr_unlock_bh(dev); + } } static void netdev_rx_mode_work(struct work_struct *work) @@ -1320,6 +1361,7 @@ static void netif_rx_mode_queue(struct net_device *dev) void __dev_set_rx_mode(struct net_device *dev) { const struct net_device_ops *ops = dev->netdev_ops; + int promisc_inc; /* dev_open will call this function so the list will stay sane. */ if (!(dev->flags & IFF_UP)) @@ -1328,20 +1370,16 @@ void __dev_set_rx_mode(struct net_device *dev) if (!netif_device_present(dev)) return; - if (ops->ndo_set_rx_mode_async) { + if (ops->ndo_set_rx_mode_async || ops->ndo_change_rx_flags) { netif_rx_mode_queue(dev); return; } - if (!(dev->priv_flags & IFF_UNICAST_FLT)) { - if (!netdev_uc_empty(dev) && !dev->uc_promisc) { - __dev_set_promiscuity(dev, 1, false); - dev->uc_promisc = true; - } else if (netdev_uc_empty(dev) && dev->uc_promisc) { - __dev_set_promiscuity(dev, -1, false); - dev->uc_promisc = false; - } - } + /* Legacy path for non-ops-locked HW devices. */ + + promisc_inc = netif_uc_promisc_update(dev); + if (promisc_inc) + __dev_set_promiscuity(dev, promisc_inc, false); if (ops->ndo_set_rx_mode) ops->ndo_set_rx_mode(dev); From 60dd9781e9b890fa4de3e8ab01f63f0fd4e332e7 Mon Sep 17 00:00:00 2001 From: Stanislav Fomichev Date: Thu, 16 Apr 2026 11:57:02 -0700 Subject: [PATCH 078/148] fbnic: convert to ndo_set_rx_mode_async Convert fbnic from ndo_set_rx_mode to ndo_set_rx_mode_async. The driver's __fbnic_set_rx_mode() now takes explicit uc/mc list parameters and uses __hw_addr_sync_dev() on the snapshots instead of __dev_uc_sync/__dev_mc_sync on the netdev directly. Update callers in fbnic_up, fbnic_fw_config_after_crash, fbnic_bmc_rpc_check and fbnic_set_mac to pass the real address lists calling __fbnic_set_rx_mode outside the async work path. Cc: Alexander Duyck Cc: kernel-team@meta.com Reviewed-by: Aleksandr Loktionov Signed-off-by: Stanislav Fomichev Link: https://patch.msgid.link/20260416185712.2155425-6-sdf@fomichev.me Signed-off-by: Paolo Abeni --- .../net/ethernet/meta/fbnic/fbnic_netdev.c | 20 ++++++++++++------- .../net/ethernet/meta/fbnic/fbnic_netdev.h | 4 +++- drivers/net/ethernet/meta/fbnic/fbnic_pci.c | 4 ++-- drivers/net/ethernet/meta/fbnic/fbnic_rpc.c | 2 +- 4 files changed, 19 insertions(+), 11 deletions(-) diff --git a/drivers/net/ethernet/meta/fbnic/fbnic_netdev.c b/drivers/net/ethernet/meta/fbnic/fbnic_netdev.c index b4b396ca9bce..c406a3b56b37 100644 --- a/drivers/net/ethernet/meta/fbnic/fbnic_netdev.c +++ b/drivers/net/ethernet/meta/fbnic/fbnic_netdev.c @@ -183,7 +183,9 @@ static int fbnic_mc_unsync(struct net_device *netdev, const unsigned char *addr) return ret; } -void __fbnic_set_rx_mode(struct fbnic_dev *fbd) +void __fbnic_set_rx_mode(struct fbnic_dev *fbd, + struct netdev_hw_addr_list *uc, + struct netdev_hw_addr_list *mc) { bool uc_promisc = false, mc_promisc = false; struct net_device *netdev = fbd->netdev; @@ -213,10 +215,10 @@ void __fbnic_set_rx_mode(struct fbnic_dev *fbd) } /* Synchronize unicast and multicast address lists */ - err = __dev_uc_sync(netdev, fbnic_uc_sync, fbnic_uc_unsync); + err = __hw_addr_sync_dev(uc, netdev, fbnic_uc_sync, fbnic_uc_unsync); if (err == -ENOSPC) uc_promisc = true; - err = __dev_mc_sync(netdev, fbnic_mc_sync, fbnic_mc_unsync); + err = __hw_addr_sync_dev(mc, netdev, fbnic_mc_sync, fbnic_mc_unsync); if (err == -ENOSPC) mc_promisc = true; @@ -238,18 +240,21 @@ void __fbnic_set_rx_mode(struct fbnic_dev *fbd) fbnic_write_tce_tcam(fbd); } -static void fbnic_set_rx_mode(struct net_device *netdev) +static void fbnic_set_rx_mode(struct net_device *netdev, + struct netdev_hw_addr_list *uc, + struct netdev_hw_addr_list *mc) { struct fbnic_net *fbn = netdev_priv(netdev); struct fbnic_dev *fbd = fbn->fbd; /* No need to update the hardware if we are not running */ if (netif_running(netdev)) - __fbnic_set_rx_mode(fbd); + __fbnic_set_rx_mode(fbd, uc, mc); } static int fbnic_set_mac(struct net_device *netdev, void *p) { + struct fbnic_net *fbn = netdev_priv(netdev); struct sockaddr *addr = p; if (!is_valid_ether_addr(addr->sa_data)) @@ -257,7 +262,8 @@ static int fbnic_set_mac(struct net_device *netdev, void *p) eth_hw_addr_set(netdev, addr->sa_data); - fbnic_set_rx_mode(netdev); + if (netif_running(netdev)) + __fbnic_set_rx_mode(fbn->fbd, &netdev->uc, &netdev->mc); return 0; } @@ -551,7 +557,7 @@ static const struct net_device_ops fbnic_netdev_ops = { .ndo_features_check = fbnic_features_check, .ndo_set_mac_address = fbnic_set_mac, .ndo_change_mtu = fbnic_change_mtu, - .ndo_set_rx_mode = fbnic_set_rx_mode, + .ndo_set_rx_mode_async = fbnic_set_rx_mode, .ndo_get_stats64 = fbnic_get_stats64, .ndo_bpf = fbnic_bpf, .ndo_hwtstamp_get = fbnic_hwtstamp_get, diff --git a/drivers/net/ethernet/meta/fbnic/fbnic_netdev.h b/drivers/net/ethernet/meta/fbnic/fbnic_netdev.h index 9129a658f8fa..eded20b0e9e4 100644 --- a/drivers/net/ethernet/meta/fbnic/fbnic_netdev.h +++ b/drivers/net/ethernet/meta/fbnic/fbnic_netdev.h @@ -97,7 +97,9 @@ void fbnic_time_init(struct fbnic_net *fbn); int fbnic_time_start(struct fbnic_net *fbn); void fbnic_time_stop(struct fbnic_net *fbn); -void __fbnic_set_rx_mode(struct fbnic_dev *fbd); +void __fbnic_set_rx_mode(struct fbnic_dev *fbd, + struct netdev_hw_addr_list *uc, + struct netdev_hw_addr_list *mc); void fbnic_clear_rx_mode(struct fbnic_dev *fbd); void fbnic_phylink_get_pauseparam(struct net_device *netdev, diff --git a/drivers/net/ethernet/meta/fbnic/fbnic_pci.c b/drivers/net/ethernet/meta/fbnic/fbnic_pci.c index b7c0b7349d00..7e85b480203c 100644 --- a/drivers/net/ethernet/meta/fbnic/fbnic_pci.c +++ b/drivers/net/ethernet/meta/fbnic/fbnic_pci.c @@ -135,7 +135,7 @@ void fbnic_up(struct fbnic_net *fbn) fbnic_rss_reinit_hw(fbn->fbd, fbn); - __fbnic_set_rx_mode(fbn->fbd); + __fbnic_set_rx_mode(fbn->fbd, &fbn->netdev->uc, &fbn->netdev->mc); /* Enable Tx/Rx processing */ fbnic_napi_enable(fbn); @@ -180,7 +180,7 @@ static int fbnic_fw_config_after_crash(struct fbnic_dev *fbd) } fbnic_rpc_reset_valid_entries(fbd); - __fbnic_set_rx_mode(fbd); + __fbnic_set_rx_mode(fbd, &fbd->netdev->uc, &fbd->netdev->mc); return 0; } diff --git a/drivers/net/ethernet/meta/fbnic/fbnic_rpc.c b/drivers/net/ethernet/meta/fbnic/fbnic_rpc.c index 42a186db43ea..fe95b6f69646 100644 --- a/drivers/net/ethernet/meta/fbnic/fbnic_rpc.c +++ b/drivers/net/ethernet/meta/fbnic/fbnic_rpc.c @@ -244,7 +244,7 @@ void fbnic_bmc_rpc_check(struct fbnic_dev *fbd) if (fbd->fw_cap.need_bmc_tcam_reinit) { fbnic_bmc_rpc_init(fbd); - __fbnic_set_rx_mode(fbd); + __fbnic_set_rx_mode(fbd, &fbd->netdev->uc, &fbd->netdev->mc); fbd->fw_cap.need_bmc_tcam_reinit = false; } From 5cf06fbdaf024f1b256dc4fc18f49f09ca26d16d Mon Sep 17 00:00:00 2001 From: Stanislav Fomichev Date: Thu, 16 Apr 2026 11:57:03 -0700 Subject: [PATCH 079/148] mlx5: convert to ndo_set_rx_mode_async Convert mlx5 from ndo_set_rx_mode to ndo_set_rx_mode_async. The driver's mlx5e_set_rx_mode now receives uc/mc snapshots and calls mlx5e_fs_set_rx_mode_work directly instead of queueing work. mlx5e_sync_netdev_addr and mlx5e_handle_netdev_addr now take explicit uc/mc list parameters and iterate with netdev_hw_addr_list_for_each instead of netdev_for_each_{uc,mc}_addr. Fallback to netdev's uc/mc in a few places and grab addr lock. Cc: Saeed Mahameed Cc: Tariq Toukan Cc: Cosmin Ratiu Reviewed-by: Aleksandr Loktionov Signed-off-by: Stanislav Fomichev Link: https://patch.msgid.link/20260416185712.2155425-7-sdf@fomichev.me Signed-off-by: Paolo Abeni --- .../net/ethernet/mellanox/mlx5/core/en/fs.h | 5 ++- .../net/ethernet/mellanox/mlx5/core/en_fs.c | 32 ++++++++++++------- .../net/ethernet/mellanox/mlx5/core/en_main.c | 13 +++++--- 3 files changed, 34 insertions(+), 16 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/fs.h b/drivers/net/ethernet/mellanox/mlx5/core/en/fs.h index c3408b3f7010..091b80a67189 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/fs.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/fs.h @@ -201,7 +201,10 @@ int mlx5e_add_vlan_trap(struct mlx5e_flow_steering *fs, int trap_id, int tir_nu void mlx5e_remove_vlan_trap(struct mlx5e_flow_steering *fs); int mlx5e_add_mac_trap(struct mlx5e_flow_steering *fs, int trap_id, int tir_num); void mlx5e_remove_mac_trap(struct mlx5e_flow_steering *fs); -void mlx5e_fs_set_rx_mode_work(struct mlx5e_flow_steering *fs, struct net_device *netdev); +void mlx5e_fs_set_rx_mode_work(struct mlx5e_flow_steering *fs, + struct net_device *netdev, + struct netdev_hw_addr_list *uc, + struct netdev_hw_addr_list *mc); int mlx5e_fs_vlan_rx_add_vid(struct mlx5e_flow_steering *fs, struct net_device *netdev, __be16 proto, u16 vid); diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c index fdfe9d1cfe21..12492c4a5d41 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c @@ -609,20 +609,26 @@ static void mlx5e_execute_l2_action(struct mlx5e_flow_steering *fs, } static void mlx5e_sync_netdev_addr(struct mlx5e_flow_steering *fs, - struct net_device *netdev) + struct net_device *netdev, + struct netdev_hw_addr_list *uc, + struct netdev_hw_addr_list *mc) { struct netdev_hw_addr *ha; - netif_addr_lock_bh(netdev); + if (!uc || !mc) { + netif_addr_lock_bh(netdev); + mlx5e_sync_netdev_addr(fs, netdev, &netdev->uc, &netdev->mc); + netif_addr_unlock_bh(netdev); + return; + } mlx5e_add_l2_to_hash(fs->l2.netdev_uc, netdev->dev_addr); - netdev_for_each_uc_addr(ha, netdev) + + netdev_hw_addr_list_for_each(ha, uc) mlx5e_add_l2_to_hash(fs->l2.netdev_uc, ha->addr); - netdev_for_each_mc_addr(ha, netdev) + netdev_hw_addr_list_for_each(ha, mc) mlx5e_add_l2_to_hash(fs->l2.netdev_mc, ha->addr); - - netif_addr_unlock_bh(netdev); } static void mlx5e_fill_addr_array(struct mlx5e_flow_steering *fs, int list_type, @@ -724,7 +730,9 @@ static void mlx5e_apply_netdev_addr(struct mlx5e_flow_steering *fs) } static void mlx5e_handle_netdev_addr(struct mlx5e_flow_steering *fs, - struct net_device *netdev) + struct net_device *netdev, + struct netdev_hw_addr_list *uc, + struct netdev_hw_addr_list *mc) { struct mlx5e_l2_hash_node *hn; struct hlist_node *tmp; @@ -736,7 +744,7 @@ static void mlx5e_handle_netdev_addr(struct mlx5e_flow_steering *fs, hn->action = MLX5E_ACTION_DEL; if (fs->state_destroy) - mlx5e_sync_netdev_addr(fs, netdev); + mlx5e_sync_netdev_addr(fs, netdev, uc, mc); mlx5e_apply_netdev_addr(fs); } @@ -820,13 +828,15 @@ static void mlx5e_destroy_promisc_table(struct mlx5e_flow_steering *fs) } void mlx5e_fs_set_rx_mode_work(struct mlx5e_flow_steering *fs, - struct net_device *netdev) + struct net_device *netdev, + struct netdev_hw_addr_list *uc, + struct netdev_hw_addr_list *mc) { struct mlx5e_priv *priv = netdev_priv(netdev); struct mlx5e_l2_table *ea = &fs->l2; if (mlx5e_is_uplink_rep(priv)) { - mlx5e_handle_netdev_addr(fs, netdev); + mlx5e_handle_netdev_addr(fs, netdev, uc, mc); goto update_vport_context; } @@ -856,7 +866,7 @@ void mlx5e_fs_set_rx_mode_work(struct mlx5e_flow_steering *fs, if (enable_broadcast) mlx5e_add_l2_flow_rule(fs, &ea->broadcast, MLX5E_FULLMATCH); - mlx5e_handle_netdev_addr(fs, netdev); + mlx5e_handle_netdev_addr(fs, netdev, uc, mc); if (disable_broadcast) mlx5e_del_l2_flow_rule(fs, &ea->broadcast); diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index 6c4eeb88588c..5a46870c4b74 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -4145,11 +4145,13 @@ static void mlx5e_nic_set_rx_mode(struct mlx5e_priv *priv) queue_work(priv->wq, &priv->set_rx_mode_work); } -static void mlx5e_set_rx_mode(struct net_device *dev) +static void mlx5e_set_rx_mode(struct net_device *dev, + struct netdev_hw_addr_list *uc, + struct netdev_hw_addr_list *mc) { struct mlx5e_priv *priv = netdev_priv(dev); - mlx5e_nic_set_rx_mode(priv); + mlx5e_fs_set_rx_mode_work(priv->fs, dev, uc, mc); } static int mlx5e_set_mac(struct net_device *netdev, void *addr) @@ -5324,7 +5326,7 @@ const struct net_device_ops mlx5e_netdev_ops = { .ndo_setup_tc = mlx5e_setup_tc, .ndo_select_queue = mlx5e_select_queue, .ndo_get_stats64 = mlx5e_get_stats, - .ndo_set_rx_mode = mlx5e_set_rx_mode, + .ndo_set_rx_mode_async = mlx5e_set_rx_mode, .ndo_set_mac_address = mlx5e_set_mac, .ndo_vlan_rx_add_vid = mlx5e_vlan_rx_add_vid, .ndo_vlan_rx_kill_vid = mlx5e_vlan_rx_kill_vid, @@ -6309,8 +6311,11 @@ void mlx5e_set_rx_mode_work(struct work_struct *work) { struct mlx5e_priv *priv = container_of(work, struct mlx5e_priv, set_rx_mode_work); + struct net_device *dev = priv->netdev; - return mlx5e_fs_set_rx_mode_work(priv->fs, priv->netdev); + netdev_lock_ops(dev); + mlx5e_fs_set_rx_mode_work(priv->fs, dev, NULL, NULL); + netdev_unlock_ops(dev); } /* mlx5e generic netdev management API (move to en_common.c) */ From f6c53cfa1217b651065cf6ff2173b2ce257f3a3f Mon Sep 17 00:00:00 2001 From: Stanislav Fomichev Date: Thu, 16 Apr 2026 11:57:04 -0700 Subject: [PATCH 080/148] bnxt: convert to ndo_set_rx_mode_async Convert bnxt from ndo_set_rx_mode to ndo_set_rx_mode_async. bnxt_set_rx_mode, bnxt_mc_list_updated and bnxt_uc_list_updated now take explicit uc/mc list parameters and iterate with netdev_hw_addr_list_for_each instead of netdev_for_each_{uc,mc}_addr. The bnxt_cfg_rx_mode internal caller passes the real lists under netif_addr_lock_bh. BNXT_RX_MASK_SP_EVENT is still used here, next patch converts to the direct call. Cc: Michael Chan Cc: Pavan Chebbi Reviewed-by: Michael Chan Reviewed-by: Aleksandr Loktionov Signed-off-by: Stanislav Fomichev Link: https://patch.msgid.link/20260416185712.2155425-8-sdf@fomichev.me Signed-off-by: Paolo Abeni --- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 31 +++++++++++++---------- 1 file changed, 17 insertions(+), 14 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c index 2715632115a5..61d4a9911413 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -11132,7 +11132,8 @@ static int bnxt_setup_nitroa0_vnic(struct bnxt *bp) } static int bnxt_cfg_rx_mode(struct bnxt *); -static bool bnxt_mc_list_updated(struct bnxt *, u32 *); +static bool bnxt_mc_list_updated(struct bnxt *, u32 *, + const struct netdev_hw_addr_list *); static int bnxt_init_chip(struct bnxt *bp, bool irq_re_init) { @@ -11222,7 +11223,7 @@ static int bnxt_init_chip(struct bnxt *bp, bool irq_re_init) } else if (bp->dev->flags & IFF_MULTICAST) { u32 mask = 0; - bnxt_mc_list_updated(bp, &mask); + bnxt_mc_list_updated(bp, &mask, &bp->dev->mc); vnic->rx_mask |= mask; } @@ -13620,17 +13621,17 @@ void bnxt_get_ring_drv_stats(struct bnxt *bp, bnxt_get_one_ring_drv_stats(bp, stats, &bp->bnapi[i]->cp_ring); } -static bool bnxt_mc_list_updated(struct bnxt *bp, u32 *rx_mask) +static bool bnxt_mc_list_updated(struct bnxt *bp, u32 *rx_mask, + const struct netdev_hw_addr_list *mc) { struct bnxt_vnic_info *vnic = &bp->vnic_info[BNXT_VNIC_DEFAULT]; - struct net_device *dev = bp->dev; struct netdev_hw_addr *ha; u8 *haddr; int mc_count = 0; bool update = false; int off = 0; - netdev_for_each_mc_addr(ha, dev) { + netdev_hw_addr_list_for_each(ha, mc) { if (mc_count >= BNXT_MAX_MC_ADDRS) { *rx_mask |= CFA_L2_SET_RX_MASK_REQ_MASK_ALL_MCAST; vnic->mc_list_count = 0; @@ -13654,17 +13655,17 @@ static bool bnxt_mc_list_updated(struct bnxt *bp, u32 *rx_mask) return update; } -static bool bnxt_uc_list_updated(struct bnxt *bp) +static bool bnxt_uc_list_updated(struct bnxt *bp, + const struct netdev_hw_addr_list *uc) { - struct net_device *dev = bp->dev; struct bnxt_vnic_info *vnic = &bp->vnic_info[BNXT_VNIC_DEFAULT]; struct netdev_hw_addr *ha; int off = 0; - if (netdev_uc_count(dev) != (vnic->uc_filter_count - 1)) + if (netdev_hw_addr_list_count(uc) != (vnic->uc_filter_count - 1)) return true; - netdev_for_each_uc_addr(ha, dev) { + netdev_hw_addr_list_for_each(ha, uc) { if (!ether_addr_equal(ha->addr, vnic->uc_list + off)) return true; @@ -13673,7 +13674,9 @@ static bool bnxt_uc_list_updated(struct bnxt *bp) return false; } -static void bnxt_set_rx_mode(struct net_device *dev) +static void bnxt_set_rx_mode(struct net_device *dev, + struct netdev_hw_addr_list *uc, + struct netdev_hw_addr_list *mc) { struct bnxt *bp = netdev_priv(dev); struct bnxt_vnic_info *vnic; @@ -13694,7 +13697,7 @@ static void bnxt_set_rx_mode(struct net_device *dev) if (dev->flags & IFF_PROMISC) mask |= CFA_L2_SET_RX_MASK_REQ_MASK_PROMISCUOUS; - uc_update = bnxt_uc_list_updated(bp); + uc_update = bnxt_uc_list_updated(bp, uc); if (dev->flags & IFF_BROADCAST) mask |= CFA_L2_SET_RX_MASK_REQ_MASK_BCAST; @@ -13702,7 +13705,7 @@ static void bnxt_set_rx_mode(struct net_device *dev) mask |= CFA_L2_SET_RX_MASK_REQ_MASK_ALL_MCAST; vnic->mc_list_count = 0; } else if (dev->flags & IFF_MULTICAST) { - mc_update = bnxt_mc_list_updated(bp, &mask); + mc_update = bnxt_mc_list_updated(bp, &mask, mc); } if (mask != vnic->rx_mask || uc_update || mc_update) { @@ -13721,7 +13724,7 @@ static int bnxt_cfg_rx_mode(struct bnxt *bp) bool uc_update; netif_addr_lock_bh(dev); - uc_update = bnxt_uc_list_updated(bp); + uc_update = bnxt_uc_list_updated(bp, &dev->uc); netif_addr_unlock_bh(dev); if (!uc_update) @@ -15986,7 +15989,7 @@ static const struct net_device_ops bnxt_netdev_ops = { .ndo_start_xmit = bnxt_start_xmit, .ndo_stop = bnxt_close, .ndo_get_stats64 = bnxt_get_stats64, - .ndo_set_rx_mode = bnxt_set_rx_mode, + .ndo_set_rx_mode_async = bnxt_set_rx_mode, .ndo_eth_ioctl = bnxt_ioctl, .ndo_validate_addr = eth_validate_addr, .ndo_set_mac_address = bnxt_change_mac_addr, From a453b5d9b3eda72dc93f0934194c2ebad4615e1f Mon Sep 17 00:00:00 2001 From: Stanislav Fomichev Date: Thu, 16 Apr 2026 11:57:05 -0700 Subject: [PATCH 081/148] bnxt: use snapshot in bnxt_cfg_rx_mode With the introduction of ndo_set_rx_mode_async (as discussed in [1]) we can call bnxt_cfg_rx_mode directly. Convert bnxt_cfg_rx_mode to use uc/mc snapshots and move its call in bnxt_sp_task to the section that resets BNXT_STATE_IN_SP_TASK. Switch to direct call in bnxt_set_rx_mode. Link: https://lore.kernel.org/netdev/CACKFLi=5vj8hPqEUKDd8RTw3au5G+zRgQEqjF+6NZnyoNm90KA@mail.gmail.com/ [1] Cc: Michael Chan Cc: Pavan Chebbi Reviewed-by: Michael Chan Signed-off-by: Stanislav Fomichev Link: https://patch.msgid.link/20260416185712.2155425-9-sdf@fomichev.me Signed-off-by: Paolo Abeni --- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 29 ++++++++++++----------- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c index 61d4a9911413..79e286621a28 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -11131,7 +11131,7 @@ static int bnxt_setup_nitroa0_vnic(struct bnxt *bp) return rc; } -static int bnxt_cfg_rx_mode(struct bnxt *); +static int bnxt_cfg_rx_mode(struct bnxt *, struct netdev_hw_addr_list *, bool); static bool bnxt_mc_list_updated(struct bnxt *, u32 *, const struct netdev_hw_addr_list *); @@ -11227,7 +11227,7 @@ static int bnxt_init_chip(struct bnxt *bp, bool irq_re_init) vnic->rx_mask |= mask; } - rc = bnxt_cfg_rx_mode(bp); + rc = bnxt_cfg_rx_mode(bp, &bp->dev->uc, true); if (rc) goto err_out; @@ -13711,21 +13711,17 @@ static void bnxt_set_rx_mode(struct net_device *dev, if (mask != vnic->rx_mask || uc_update || mc_update) { vnic->rx_mask = mask; - bnxt_queue_sp_work(bp, BNXT_RX_MASK_SP_EVENT); + bnxt_cfg_rx_mode(bp, uc, uc_update); } } -static int bnxt_cfg_rx_mode(struct bnxt *bp) +static int bnxt_cfg_rx_mode(struct bnxt *bp, struct netdev_hw_addr_list *uc, + bool uc_update) { struct net_device *dev = bp->dev; struct bnxt_vnic_info *vnic = &bp->vnic_info[BNXT_VNIC_DEFAULT]; struct netdev_hw_addr *ha; int i, off = 0, rc; - bool uc_update; - - netif_addr_lock_bh(dev); - uc_update = bnxt_uc_list_updated(bp, &dev->uc); - netif_addr_unlock_bh(dev); if (!uc_update) goto skip_uc; @@ -13740,10 +13736,10 @@ static int bnxt_cfg_rx_mode(struct bnxt *bp) vnic->uc_filter_count = 1; netif_addr_lock_bh(dev); - if (netdev_uc_count(dev) > (BNXT_MAX_UC_ADDRS - 1)) { + if (netdev_hw_addr_list_count(uc) > (BNXT_MAX_UC_ADDRS - 1)) { vnic->rx_mask |= CFA_L2_SET_RX_MASK_REQ_MASK_PROMISCUOUS; } else { - netdev_for_each_uc_addr(ha, dev) { + netdev_hw_addr_list_for_each(ha, uc) { memcpy(vnic->uc_list + off, ha->addr, ETH_ALEN); off += ETH_ALEN; vnic->uc_filter_count++; @@ -14709,6 +14705,7 @@ static void bnxt_ulp_restart(struct bnxt *bp) static void bnxt_sp_task(struct work_struct *work) { struct bnxt *bp = container_of(work, struct bnxt, sp_task); + struct net_device *dev = bp->dev; set_bit(BNXT_STATE_IN_SP_TASK, &bp->state); smp_mb__after_atomic(); @@ -14722,9 +14719,6 @@ static void bnxt_sp_task(struct work_struct *work) bnxt_reenable_sriov(bp); } - if (test_and_clear_bit(BNXT_RX_MASK_SP_EVENT, &bp->sp_event)) - bnxt_cfg_rx_mode(bp); - if (test_and_clear_bit(BNXT_RX_NTP_FLTR_SP_EVENT, &bp->sp_event)) bnxt_cfg_ntp_filters(bp); if (test_and_clear_bit(BNXT_HWRM_EXEC_FWD_REQ_SP_EVENT, &bp->sp_event)) @@ -14789,6 +14783,13 @@ static void bnxt_sp_task(struct work_struct *work) /* These functions below will clear BNXT_STATE_IN_SP_TASK. They * must be the last functions to be called before exiting. */ + if (test_and_clear_bit(BNXT_RX_MASK_SP_EVENT, &bp->sp_event)) { + bnxt_lock_sp(bp); + if (test_bit(BNXT_STATE_OPEN, &bp->state)) + bnxt_cfg_rx_mode(bp, &dev->uc, true); + bnxt_unlock_sp(bp); + } + if (test_and_clear_bit(BNXT_RESET_TASK_SP_EVENT, &bp->sp_event)) bnxt_reset(bp, false); From d071c15b43e9a58c1ee774ac94f2a635423371d4 Mon Sep 17 00:00:00 2001 From: Stanislav Fomichev Date: Thu, 16 Apr 2026 11:57:06 -0700 Subject: [PATCH 082/148] iavf: convert to ndo_set_rx_mode_async Convert iavf from ndo_set_rx_mode to ndo_set_rx_mode_async. iavf_set_rx_mode now takes explicit uc/mc list parameters and uses __hw_addr_sync_dev on the snapshots instead of __dev_uc_sync and __dev_mc_sync. The iavf_configure internal caller passes the real lists directly. Cc: Tony Nguyen Cc: Przemek Kitszel Signed-off-by: Stanislav Fomichev Link: https://patch.msgid.link/20260416185712.2155425-10-sdf@fomichev.me Signed-off-by: Paolo Abeni --- drivers/net/ethernet/intel/iavf/iavf_main.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c index dad001abc908..3c1465cf0515 100644 --- a/drivers/net/ethernet/intel/iavf/iavf_main.c +++ b/drivers/net/ethernet/intel/iavf/iavf_main.c @@ -1150,14 +1150,18 @@ bool iavf_promiscuous_mode_changed(struct iavf_adapter *adapter) /** * iavf_set_rx_mode - NDO callback to set the netdev filters * @netdev: network interface device structure + * @uc: snapshot of uc address list + * @mc: snapshot of mc address list **/ -static void iavf_set_rx_mode(struct net_device *netdev) +static void iavf_set_rx_mode(struct net_device *netdev, + struct netdev_hw_addr_list *uc, + struct netdev_hw_addr_list *mc) { struct iavf_adapter *adapter = netdev_priv(netdev); spin_lock_bh(&adapter->mac_vlan_list_lock); - __dev_uc_sync(netdev, iavf_addr_sync, iavf_addr_unsync); - __dev_mc_sync(netdev, iavf_addr_sync, iavf_addr_unsync); + __hw_addr_sync_dev(uc, netdev, iavf_addr_sync, iavf_addr_unsync); + __hw_addr_sync_dev(mc, netdev, iavf_addr_sync, iavf_addr_unsync); spin_unlock_bh(&adapter->mac_vlan_list_lock); spin_lock_bh(&adapter->current_netdev_promisc_flags_lock); @@ -1210,7 +1214,9 @@ static void iavf_configure(struct iavf_adapter *adapter) struct net_device *netdev = adapter->netdev; int i; - iavf_set_rx_mode(netdev); + netif_addr_lock_bh(netdev); + iavf_set_rx_mode(netdev, &netdev->uc, &netdev->mc); + netif_addr_unlock_bh(netdev); iavf_configure_tx(adapter); iavf_configure_rx(adapter); @@ -5153,7 +5159,7 @@ static const struct net_device_ops iavf_netdev_ops = { .ndo_open = iavf_open, .ndo_stop = iavf_close, .ndo_start_xmit = iavf_xmit_frame, - .ndo_set_rx_mode = iavf_set_rx_mode, + .ndo_set_rx_mode_async = iavf_set_rx_mode, .ndo_validate_addr = eth_validate_addr, .ndo_set_mac_address = iavf_set_mac, .ndo_change_mtu = iavf_change_mtu, From 8a5df09e70c2260ad05125a6199229faf20f8671 Mon Sep 17 00:00:00 2001 From: Stanislav Fomichev Date: Thu, 16 Apr 2026 11:57:07 -0700 Subject: [PATCH 083/148] netdevsim: convert to ndo_set_rx_mode_async Convert netdevsim from ndo_set_rx_mode to ndo_set_rx_mode_async. The callback is a no-op stub so just update the signature and ops struct wiring. Reviewed-by: Breno Leitao Signed-off-by: Stanislav Fomichev Link: https://patch.msgid.link/20260416185712.2155425-11-sdf@fomichev.me Signed-off-by: Paolo Abeni --- drivers/net/netdevsim/netdev.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c index e1541ca76715..a05af192caf3 100644 --- a/drivers/net/netdevsim/netdev.c +++ b/drivers/net/netdevsim/netdev.c @@ -185,7 +185,9 @@ static netdev_tx_t nsim_start_xmit(struct sk_buff *skb, struct net_device *dev) return NETDEV_TX_OK; } -static void nsim_set_rx_mode(struct net_device *dev) +static void nsim_set_rx_mode(struct net_device *dev, + struct netdev_hw_addr_list *uc, + struct netdev_hw_addr_list *mc) { } @@ -623,7 +625,7 @@ static const struct net_shaper_ops nsim_shaper_ops = { static const struct net_device_ops nsim_netdev_ops = { .ndo_start_xmit = nsim_start_xmit, - .ndo_set_rx_mode = nsim_set_rx_mode, + .ndo_set_rx_mode_async = nsim_set_rx_mode, .ndo_set_mac_address = eth_mac_addr, .ndo_validate_addr = eth_validate_addr, .ndo_change_mtu = nsim_change_mtu, @@ -648,7 +650,7 @@ static const struct net_device_ops nsim_netdev_ops = { static const struct net_device_ops nsim_vf_netdev_ops = { .ndo_start_xmit = nsim_start_xmit, - .ndo_set_rx_mode = nsim_set_rx_mode, + .ndo_set_rx_mode_async = nsim_set_rx_mode, .ndo_set_mac_address = eth_mac_addr, .ndo_validate_addr = eth_validate_addr, .ndo_change_mtu = nsim_change_mtu, From 4d157e89bde4bec79c0aaf91347691310c4b2a11 Mon Sep 17 00:00:00 2001 From: Stanislav Fomichev Date: Thu, 16 Apr 2026 11:57:08 -0700 Subject: [PATCH 084/148] dummy: convert to ndo_set_rx_mode_async Convert dummy driver from ndo_set_rx_mode to ndo_set_rx_mode_async. The dummy driver's set_multicast_list is a no-op, so the conversion is straightforward: update the signature and the ops assignment. Reviewed-by: Aleksandr Loktionov Signed-off-by: Stanislav Fomichev Link: https://patch.msgid.link/20260416185712.2155425-12-sdf@fomichev.me Signed-off-by: Paolo Abeni --- drivers/net/dummy.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/net/dummy.c b/drivers/net/dummy.c index d6bdad4baadd..f8a4eb365c3d 100644 --- a/drivers/net/dummy.c +++ b/drivers/net/dummy.c @@ -47,7 +47,9 @@ static int numdummies = 1; /* fake multicast ability */ -static void set_multicast_list(struct net_device *dev) +static void set_multicast_list(struct net_device *dev, + struct netdev_hw_addr_list *uc, + struct netdev_hw_addr_list *mc) { } @@ -87,7 +89,7 @@ static const struct net_device_ops dummy_netdev_ops = { .ndo_init = dummy_dev_init, .ndo_start_xmit = dummy_xmit, .ndo_validate_addr = eth_validate_addr, - .ndo_set_rx_mode = set_multicast_list, + .ndo_set_rx_mode_async = set_multicast_list, .ndo_set_mac_address = eth_mac_addr, .ndo_get_stats64 = dummy_get_stats64, .ndo_change_carrier = dummy_change_carrier, From 754b7e1169a7190760dd66cea026f1b4bd7acc54 Mon Sep 17 00:00:00 2001 From: Stanislav Fomichev Date: Thu, 16 Apr 2026 11:57:09 -0700 Subject: [PATCH 085/148] netkit: convert to ndo_set_rx_mode_async Convert netkit driver from ndo_set_rx_mode to ndo_set_rx_mode_async. The netkit driver's set_multicast_list is a no-op, presumably for the same reason as the one in dummy? (fake multicast ability) Signed-off-by: Stanislav Fomichev Link: https://patch.msgid.link/20260416185712.2155425-13-sdf@fomichev.me Signed-off-by: Paolo Abeni --- drivers/net/netkit.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c index 7b56a7ad7a49..5e2eecc3165d 100644 --- a/drivers/net/netkit.c +++ b/drivers/net/netkit.c @@ -186,7 +186,9 @@ static int netkit_get_iflink(const struct net_device *dev) return iflink; } -static void netkit_set_multicast(struct net_device *dev) +static void netkit_set_multicast(struct net_device *dev, + struct netdev_hw_addr_list *uc, + struct netdev_hw_addr_list *mc) { /* Nothing to do, we receive whatever gets pushed to us! */ } @@ -330,7 +332,7 @@ static const struct net_device_ops netkit_netdev_ops = { .ndo_open = netkit_open, .ndo_stop = netkit_close, .ndo_start_xmit = netkit_xmit, - .ndo_set_rx_mode = netkit_set_multicast, + .ndo_set_rx_mode_async = netkit_set_multicast, .ndo_set_rx_headroom = netkit_set_headroom, .ndo_set_mac_address = netkit_set_macaddr, .ndo_get_iflink = netkit_get_iflink, From 3cbd22938877e0c0c7c193a91d5a6d5149c39490 Mon Sep 17 00:00:00 2001 From: Stanislav Fomichev Date: Thu, 16 Apr 2026 11:57:10 -0700 Subject: [PATCH 086/148] net: warn ops-locked drivers still using ndo_set_rx_mode Now that all in-tree ops-locked drivers have been converted to ndo_set_rx_mode_async, add a warning in register_netdevice to catch any remaining or newly added drivers that use ndo_set_rx_mode with ops locking. This ensures future driver authors are guided toward the async path. Also route ops-locked devices through netdev_rx_mode_work even if they lack rx_mode NDOs, to ensure netdev_ops_assert_locked() does not fire on the legacy path where only RTNL is held. Reviewed-by: Aleksandr Loktionov Signed-off-by: Stanislav Fomichev Link: https://patch.msgid.link/20260416185712.2155425-14-sdf@fomichev.me Signed-off-by: Paolo Abeni --- net/core/dev.c | 5 +++++ net/core/dev_addr_lists.c | 3 ++- 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/net/core/dev.c b/net/core/dev.c index 8a69aed56fca..d426c1beeb76 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -11360,6 +11360,11 @@ int register_netdevice(struct net_device *dev) goto err_uninit; } + if (netdev_need_ops_lock(dev) && + dev->netdev_ops->ndo_set_rx_mode && + !dev->netdev_ops->ndo_set_rx_mode_async) + netdev_WARN(dev, "ops-locked drivers should use ndo_set_rx_mode_async\n"); + ret = netdev_do_alloc_pcpu_stats(dev); if (ret) goto err_uninit; diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c index 4c9e8a69493f..d73fcb0c6785 100644 --- a/net/core/dev_addr_lists.c +++ b/net/core/dev_addr_lists.c @@ -1370,7 +1370,8 @@ void __dev_set_rx_mode(struct net_device *dev) if (!netif_device_present(dev)) return; - if (ops->ndo_set_rx_mode_async || ops->ndo_change_rx_flags) { + if (ops->ndo_set_rx_mode_async || ops->ndo_change_rx_flags || + netdev_need_ops_lock(dev)) { netif_rx_mode_queue(dev); return; } From ee514cdb07b330b2a9012e5373d26959fbbc4f81 Mon Sep 17 00:00:00 2001 From: Stanislav Fomichev Date: Thu, 16 Apr 2026 11:57:11 -0700 Subject: [PATCH 087/148] selftests: net: add team_bridge_macvlan rx_mode test Add a test that exercises the ndo_change_rx_flags path through a macvlan -> bridge -> team -> dummy stack. This triggers dev_uc_add under addr_list_lock which flips promiscuity on the lower device. With the new work queue approach, this must not deadlock. Link: https://lore.kernel.org/netdev/20260214033859.43857-1-jiayuan.chen@linux.dev/ Reviewed-by: Breno Leitao Signed-off-by: Stanislav Fomichev Link: https://patch.msgid.link/20260416185712.2155425-15-sdf@fomichev.me Signed-off-by: Paolo Abeni --- tools/testing/selftests/net/config | 1 + tools/testing/selftests/net/rtnetlink.sh | 44 ++++++++++++++++++++++++ 2 files changed, 45 insertions(+) diff --git a/tools/testing/selftests/net/config b/tools/testing/selftests/net/config index 2a390cae41bf..94d722770420 100644 --- a/tools/testing/selftests/net/config +++ b/tools/testing/selftests/net/config @@ -101,6 +101,7 @@ CONFIG_NET_SCH_HTB=m CONFIG_NET_SCH_INGRESS=m CONFIG_NET_SCH_NETEM=y CONFIG_NET_SCH_PRIO=m +CONFIG_NET_TEAM=y CONFIG_NET_VRF=y CONFIG_NF_CONNTRACK=m CONFIG_NF_CONNTRACK_OVS=y diff --git a/tools/testing/selftests/net/rtnetlink.sh b/tools/testing/selftests/net/rtnetlink.sh index 5a5ff88321d5..c499953d4885 100755 --- a/tools/testing/selftests/net/rtnetlink.sh +++ b/tools/testing/selftests/net/rtnetlink.sh @@ -23,6 +23,7 @@ ALL_TESTS=" kci_test_encap kci_test_macsec kci_test_macsec_vlan + kci_test_team_bridge_macvlan kci_test_ipsec kci_test_ipsec_offload kci_test_fdb_get @@ -636,6 +637,49 @@ kci_test_macsec_vlan() end_test "PASS: macsec_vlan" } +# Test ndo_change_rx_flags call from dev_uc_add under addr_list_lock spinlock. +# When we are flipping the promisc, make sure it runs on the work queue. +# +# https://lore.kernel.org/netdev/20260214033859.43857-1-jiayuan.chen@linux.dev/ +# With (more conventional) macvlan instead of macsec. +# macvlan -> bridge -> team -> dummy +kci_test_team_bridge_macvlan() +{ + local vlan="test_macv1" + local bridge="test_br1" + local team="test_team1" + local dummy="test_dummy1" + local ret=0 + + run_cmd ip link add $team type team + if [ $ret -ne 0 ]; then + end_test "SKIP: team_bridge_macvlan: can't add team interface" + return $ksft_skip + fi + + run_cmd ip link add $dummy type dummy + run_cmd ip link set $dummy master $team + run_cmd ip link set $team up + run_cmd ip link add $bridge type bridge vlan_filtering 1 + run_cmd ip link set $bridge up + run_cmd ip link set $team master $bridge + run_cmd ip link add link $bridge name $vlan \ + address 00:aa:bb:cc:dd:ee type macvlan mode bridge + run_cmd ip link set $vlan up + + run_cmd ip link del $vlan + run_cmd ip link del $bridge + run_cmd ip link del $team + run_cmd ip link del $dummy + + if [ $ret -ne 0 ]; then + end_test "FAIL: team_bridge_macvlan" + return 1 + fi + + end_test "PASS: team_bridge_macvlan" +} + #------------------------------------------------------------------- # Example commands # ip x s add proto esp src 14.0.0.52 dst 14.0.0.70 \ From c4dde411bc366f568dbe33366253bbfea049e8ea Mon Sep 17 00:00:00 2001 From: Stanislav Fomichev Date: Thu, 16 Apr 2026 11:57:12 -0700 Subject: [PATCH 088/148] selftests: net: use ip commands instead of teamd in team rx_mode test Replace teamd daemon usage with ip link commands for team device setup. teamd -d daemonizes and returns to the shell before port addition completes, creating a race: the test may create the macvlan (and check for its address on a slave) before teamd has finished adding ports. This makes the test inherently dependent on scheduling timing. Using ip commands makes port addition synchronous, removing the race and making the test deterministic. Cc: Jiri Pirko Cc: Jay Vosburgh Signed-off-by: Stanislav Fomichev Link: https://patch.msgid.link/20260416185712.2155425-16-sdf@fomichev.me Signed-off-by: Paolo Abeni --- .../selftests/drivers/net/bonding/lag_lib.sh | 17 +++-------------- .../drivers/net/team/dev_addr_lists.sh | 2 -- 2 files changed, 3 insertions(+), 16 deletions(-) diff --git a/tools/testing/selftests/drivers/net/bonding/lag_lib.sh b/tools/testing/selftests/drivers/net/bonding/lag_lib.sh index bf9bcd1b5ec0..f2e43b6c4c81 100644 --- a/tools/testing/selftests/drivers/net/bonding/lag_lib.sh +++ b/tools/testing/selftests/drivers/net/bonding/lag_lib.sh @@ -23,20 +23,9 @@ test_LAG_cleanup() ip link set dev dummy2 master "$name" elif [ "$driver" = "team" ]; then name="team0" - teamd -d -c ' - { - "device": "'"$name"'", - "runner": { - "name": "'"$mode"'" - }, - "ports": { - "dummy1": - {}, - "dummy2": - {} - } - } - ' + ip link add "$name" type team + ip link set dev dummy1 master "$name" + ip link set dev dummy2 master "$name" ip link set dev "$name" up else check_err 1 diff --git a/tools/testing/selftests/drivers/net/team/dev_addr_lists.sh b/tools/testing/selftests/drivers/net/team/dev_addr_lists.sh index b1ec7755b783..26469f3be022 100755 --- a/tools/testing/selftests/drivers/net/team/dev_addr_lists.sh +++ b/tools/testing/selftests/drivers/net/team/dev_addr_lists.sh @@ -42,8 +42,6 @@ team_cleanup() } -require_command teamd - trap cleanup EXIT tests_run From d647f2545219754603b2064de948425cdfd93fba Mon Sep 17 00:00:00 2001 From: Lorenzo Bianconi Date: Fri, 17 Apr 2026 17:24:41 +0200 Subject: [PATCH 089/148] net: airoha: Fix PPE cpu port configuration for GDM2 loopback path When QoS loopback is enabled for GDM3 or GDM4, incoming packets are forwarded to GDM2. However, the PPE cpu port for GDM2 is not configured in this path, causing traffic originating from GDM3/GDM4, which may be set up as WAN ports backed by QDMA1, to be incorrectly directed to QDMA0 instead. Configure the PPE cpu port for GDM2 when QoS loopback is active on GDM3 or GDM4 to ensure traffic is routed to the correct QDMA instance. Fixes: 9cd451d414f6 ("net: airoha: Add loopback support for GDM2") Signed-off-by: Lorenzo Bianconi Link: https://patch.msgid.link/20260417-airoha-ppe-cpu-port-for-gdm2-loopback-v1-1-c7a9de0f6f57@kernel.org Signed-off-by: Paolo Abeni --- drivers/net/ethernet/airoha/airoha_eth.c | 8 ++++++-- drivers/net/ethernet/airoha/airoha_eth.h | 3 ++- drivers/net/ethernet/airoha/airoha_ppe.c | 6 +++--- 3 files changed, 11 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c index 19f67c7dd8e1..376d91df5441 100644 --- a/drivers/net/ethernet/airoha/airoha_eth.c +++ b/drivers/net/ethernet/airoha/airoha_eth.c @@ -1751,7 +1751,7 @@ static int airoha_set_gdm2_loopback(struct airoha_gdm_port *port) { struct airoha_eth *eth = port->qdma->eth; u32 val, pse_port, chan; - int src_port; + int i, src_port; /* Forward the traffic to the proper GDM port */ pse_port = port->id == AIROHA_GDM3_IDX ? FE_PSE_PORT_GDM3 @@ -1793,6 +1793,9 @@ static int airoha_set_gdm2_loopback(struct airoha_gdm_port *port) SP_CPORT_MASK(val), __field_prep(SP_CPORT_MASK(val), FE_PSE_PORT_CDM2)); + for (i = 0; i < eth->soc->num_ppe; i++) + airoha_ppe_set_cpu_port(port, i, AIROHA_GDM2_IDX); + if (port->id == AIROHA_GDM4_IDX && airoha_is_7581(eth)) { u32 mask = FC_ID_OF_SRC_PORT_MASK(port->nbq); @@ -1831,7 +1834,8 @@ static int airoha_dev_init(struct net_device *dev) } for (i = 0; i < eth->soc->num_ppe; i++) - airoha_ppe_set_cpu_port(port, i); + airoha_ppe_set_cpu_port(port, i, + airoha_get_fe_port(port)); return 0; } diff --git a/drivers/net/ethernet/airoha/airoha_eth.h b/drivers/net/ethernet/airoha/airoha_eth.h index 87b328cfefb0..e389d2fe3b86 100644 --- a/drivers/net/ethernet/airoha/airoha_eth.h +++ b/drivers/net/ethernet/airoha/airoha_eth.h @@ -654,7 +654,8 @@ int airoha_get_fe_port(struct airoha_gdm_port *port); bool airoha_is_valid_gdm_port(struct airoha_eth *eth, struct airoha_gdm_port *port); -void airoha_ppe_set_cpu_port(struct airoha_gdm_port *port, u8 ppe_id); +void airoha_ppe_set_cpu_port(struct airoha_gdm_port *port, u8 ppe_id, + u8 fport); bool airoha_ppe_is_enabled(struct airoha_eth *eth, int index); void airoha_ppe_check_skb(struct airoha_ppe_dev *dev, struct sk_buff *skb, u16 hash, bool rx_wlan); diff --git a/drivers/net/ethernet/airoha/airoha_ppe.c b/drivers/net/ethernet/airoha/airoha_ppe.c index 859818676b69..5c9dff6bccd1 100644 --- a/drivers/net/ethernet/airoha/airoha_ppe.c +++ b/drivers/net/ethernet/airoha/airoha_ppe.c @@ -85,10 +85,9 @@ static u32 airoha_ppe_get_timestamp(struct airoha_ppe *ppe) return FIELD_GET(AIROHA_FOE_IB1_BIND_TIMESTAMP, timestamp); } -void airoha_ppe_set_cpu_port(struct airoha_gdm_port *port, u8 ppe_id) +void airoha_ppe_set_cpu_port(struct airoha_gdm_port *port, u8 ppe_id, u8 fport) { struct airoha_qdma *qdma = port->qdma; - u8 fport = airoha_get_fe_port(port); struct airoha_eth *eth = qdma->eth; u8 qdma_id = qdma - ð->qdma[0]; u32 fe_cpu_port; @@ -182,7 +181,8 @@ static void airoha_ppe_hw_init(struct airoha_ppe *ppe) if (!port) continue; - airoha_ppe_set_cpu_port(port, i); + airoha_ppe_set_cpu_port(port, i, + airoha_get_fe_port(port)); } } } From 478ed6b7d2577439c610f91fa8759a4c878a4264 Mon Sep 17 00:00:00 2001 From: Chia-Yu Chang Date: Fri, 17 Apr 2026 17:25:51 +0200 Subject: [PATCH 090/148] net/sched: sch_dualpi2: drain both C-queue and L-queue in dualpi2_change() Fix dualpi2_change() to correctly enforce updated limit and memlimit values after a configuration change of the dualpi2 qdisc. Before this patch, dualpi2_change() always attempted to dequeue packets via the root qdisc (C-queue) when reducing backlog or memory usage, and unconditionally assumed that a valid skb will be returned. When traffic classification results in packets being queued in the L-queue while the C-queue is empty, this leads to a NULL skb dereference during limit or memlimit enforcement. This is fixed by first dequeuing from the C-queue path if it is non-empty. Once the C-queue is empty, packets are dequeued directly from the L-queue. Return values from qdisc_dequeue_internal() are checked for both queues. When dequeuing from the L-queue, the parent qdisc qlen and backlog counters are updated explicitly to keep overall qdisc statistics consistent. Fixes: 320d031ad6e4 ("sched: Struct definition and parsing of dualpi2 qdisc") Reported-by: "Kito Xu (veritas501)" Closes: https://lore.kernel.org/netdev/20260413075740.2234828-1-hxzene@gmail.com/ Signed-off-by: Chia-Yu Chang Link: https://patch.msgid.link/20260417152551.71648-1-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni --- net/sched/sch_dualpi2.c | 32 ++++++++++++++++++++++++++++---- 1 file changed, 28 insertions(+), 4 deletions(-) diff --git a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c index fe6f5e889625..241e6a46bd00 100644 --- a/net/sched/sch_dualpi2.c +++ b/net/sched/sch_dualpi2.c @@ -868,11 +868,35 @@ static int dualpi2_change(struct Qdisc *sch, struct nlattr *opt, old_backlog = sch->qstats.backlog; while (qdisc_qlen(sch) > sch->limit || q->memory_used > q->memory_limit) { - struct sk_buff *skb = qdisc_dequeue_internal(sch, true); + struct sk_buff *skb = NULL; - q->memory_used -= skb->truesize; - qdisc_qstats_backlog_dec(sch, skb); - rtnl_qdisc_drop(skb, sch); + if (qdisc_qlen(sch) > qdisc_qlen(q->l_queue)) { + skb = qdisc_dequeue_internal(sch, true); + if (unlikely(!skb)) { + WARN_ON_ONCE(1); + break; + } + q->memory_used -= skb->truesize; + rtnl_qdisc_drop(skb, sch); + } else if (qdisc_qlen(q->l_queue)) { + skb = qdisc_dequeue_internal(q->l_queue, true); + if (unlikely(!skb)) { + WARN_ON_ONCE(1); + break; + } + /* L-queue packets are counted in both sch and + * l_queue on enqueue; qdisc_dequeue_internal() + * handled l_queue, so we further account for sch. + */ + --sch->q.qlen; + qdisc_qstats_backlog_dec(sch, skb); + q->memory_used -= skb->truesize; + rtnl_qdisc_drop(skb, q->l_queue); + qdisc_qstats_drop(sch); + } else { + WARN_ON_ONCE(1); + break; + } } qdisc_tree_reduce_backlog(sch, old_qlen - qdisc_qlen(sch), old_backlog - sch->qstats.backlog); From 3bfcf396081ace536733b454ff128d53116581e5 Mon Sep 17 00:00:00 2001 From: Kohei Enju Date: Mon, 20 Apr 2026 10:54:23 +0000 Subject: [PATCH 091/148] net: validate skb->napi_id in RX tracepoints Since commit 2bd82484bb4c ("xps: fix xps for stacked devices"), skb->napi_id shares storage with sender_cpu. RX tracepoints using net_dev_rx_verbose_template read skb->napi_id directly and can therefore report sender_cpu values as if they were NAPI IDs. For example, on the loopback path this can report 1 as napi_id, where 1 comes from raw_smp_processor_id() + 1 in the XPS path: # bpftrace -e 'tracepoint:net:netif_rx_entry{ print(args->napi_id); }' # taskset -c 0 ping -c 1 ::1 Report only valid NAPI IDs in these tracepoints and use 0 otherwise. Fixes: 2bd82484bb4c ("xps: fix xps for stacked devices") Signed-off-by: Kohei Enju Reviewed-by: Simon Horman Reviewed-by: Jiayuan Chen Link: https://patch.msgid.link/20260420105427.162816-1-kohei@enjuk.jp Signed-off-by: Jakub Kicinski --- include/trace/events/net.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/include/trace/events/net.h b/include/trace/events/net.h index fdd9ad474ce3..dbc2c5598e35 100644 --- a/include/trace/events/net.h +++ b/include/trace/events/net.h @@ -10,6 +10,7 @@ #include #include #include +#include TRACE_EVENT(net_dev_start_xmit, @@ -208,7 +209,8 @@ DECLARE_EVENT_CLASS(net_dev_rx_verbose_template, TP_fast_assign( __assign_str(name); #ifdef CONFIG_NET_RX_BUSY_POLL - __entry->napi_id = skb->napi_id; + __entry->napi_id = napi_id_valid(skb->napi_id) ? + skb->napi_id : 0; #else __entry->napi_id = 0; #endif From 2c054e17d9d41f1020376806c7f750834ced4dc5 Mon Sep 17 00:00:00 2001 From: Bingquan Chen Date: Sat, 18 Apr 2026 19:20:06 +0800 Subject: [PATCH 092/148] net/packet: fix TOCTOU race on mmap'd vnet_hdr in tpacket_snd() In tpacket_snd(), when PACKET_VNET_HDR is enabled, vnet_hdr points directly into the mmap'd TX ring buffer shared with userspace. The kernel validates the header via __packet_snd_vnet_parse() but then re-reads all fields later in virtio_net_hdr_to_skb(). A concurrent userspace thread can modify the vnet_hdr fields between validation and use, bypassing all safety checks. The non-TPACKET path (packet_snd()) already correctly copies vnet_hdr to a stack-local variable. All other vnet_hdr consumers in the kernel (tun.c, tap.c, virtio_net.c) also use stack copies. The TPACKET TX path is the only caller of virtio_net_hdr_to_skb() that reads directly from user-controlled shared memory. Fix this by copying vnet_hdr from the mmap'd ring buffer to a stack-local variable before validation and use, consistent with the approach used in packet_snd() and all other callers. Fixes: 1d036d25e560 ("packet: tpacket_snd gso and checksum offload") Signed-off-by: Bingquan Chen Reviewed-by: Willem de Bruijn Link: https://patch.msgid.link/20260418112006.78823-1-patzilla007@gmail.com Signed-off-by: Jakub Kicinski --- net/packet/af_packet.c | 21 +++++++++++++-------- 1 file changed, 13 insertions(+), 8 deletions(-) diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 4b043241fd56..8e6f3a734ba0 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -2718,7 +2718,8 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) { struct sk_buff *skb = NULL; struct net_device *dev; - struct virtio_net_hdr *vnet_hdr = NULL; + struct virtio_net_hdr vnet_hdr; + bool has_vnet_hdr = false; struct sockcm_cookie sockc; __be16 proto; int err, reserve = 0; @@ -2819,16 +2820,20 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) hlen = LL_RESERVED_SPACE(dev); tlen = dev->needed_tailroom; if (vnet_hdr_sz) { - vnet_hdr = data; data += vnet_hdr_sz; tp_len -= vnet_hdr_sz; - if (tp_len < 0 || - __packet_snd_vnet_parse(vnet_hdr, tp_len)) { + if (tp_len < 0) { + tp_len = -EINVAL; + goto tpacket_error; + } + memcpy(&vnet_hdr, data - vnet_hdr_sz, sizeof(vnet_hdr)); + if (__packet_snd_vnet_parse(&vnet_hdr, tp_len)) { tp_len = -EINVAL; goto tpacket_error; } copylen = __virtio16_to_cpu(vio_le(), - vnet_hdr->hdr_len); + vnet_hdr.hdr_len); + has_vnet_hdr = true; } copylen = max_t(int, copylen, dev->hard_header_len); skb = sock_alloc_send_skb(&po->sk, @@ -2865,12 +2870,12 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) } } - if (vnet_hdr_sz) { - if (virtio_net_hdr_to_skb(skb, vnet_hdr, vio_le())) { + if (has_vnet_hdr) { + if (virtio_net_hdr_to_skb(skb, &vnet_hdr, vio_le())) { tp_len = -EINVAL; goto tpacket_error; } - virtio_net_hdr_set_proto(skb, vnet_hdr); + virtio_net_hdr_set_proto(skb, &vnet_hdr); } skb->destructor = tpacket_destruct_skb; From 645d044d7e5c99079ff2b134bd35d3301be2ec79 Mon Sep 17 00:00:00 2001 From: Ariful Islam Shoikot Date: Mon, 20 Apr 2026 17:45:53 +0600 Subject: [PATCH 093/148] docs: maintainer-netdev: fix typo in "targeting" Fix spelling mistake "targgeting" -> "targeting" in maintainer-netdev.rst No functional change. Signed-off-by: Ariful Islam Shoikot Reviewed-by: Breno Leitao Link: https://patch.msgid.link/20260420114554.1026-1-islamarifulshoikat@gmail.com Signed-off-by: Jakub Kicinski --- Documentation/process/maintainer-netdev.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/process/maintainer-netdev.rst b/Documentation/process/maintainer-netdev.rst index bda93b459a05..ec7b9aa2877f 100644 --- a/Documentation/process/maintainer-netdev.rst +++ b/Documentation/process/maintainer-netdev.rst @@ -528,7 +528,7 @@ The exact rules a driver must follow to acquire the ``Supported`` status: status will be withdrawn. 5. Test failures due to bugs either in the driver or the test itself, - or lack of support for the feature the test is targgeting are + or lack of support for the feature the test is targeting are *not* a basis for losing the ``Supported`` status. netdev CI will maintain an official page of supported devices, listing their From 70d7c905a07ae8415b955569620bf2bf77423553 Mon Sep 17 00:00:00 2001 From: Vikas Gupta Date: Sat, 18 Apr 2026 08:04:37 +0530 Subject: [PATCH 094/148] bnge: fix initial HWRM sequence Firmware may not advertize correct resources if backing store is not enabled before resource information is queried. Fix the initial sequence of HWRMs so that driver gets capabilities and resource information correctly. Fixes: 3fa9e977a0cd ("bng_en: Initialize default configuration") Signed-off-by: Vikas Gupta Reviewed-by: Rahul Gupta Link: https://patch.msgid.link/20260418023438.1597876-2-vikas.gupta@broadcom.com Signed-off-by: Jakub Kicinski --- .../net/ethernet/broadcom/bnge/bnge_core.c | 30 ++++++++++++++----- 1 file changed, 22 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnge/bnge_core.c b/drivers/net/ethernet/broadcom/bnge/bnge_core.c index 1c14c5fe8d61..68b74eb2c3a2 100644 --- a/drivers/net/ethernet/broadcom/bnge/bnge_core.c +++ b/drivers/net/ethernet/broadcom/bnge/bnge_core.c @@ -74,6 +74,13 @@ static int bnge_func_qcaps(struct bnge_dev *bd) return rc; } + return 0; +} + +static int bnge_func_qrcaps_qcfg(struct bnge_dev *bd) +{ + int rc; + rc = bnge_hwrm_func_resc_qcaps(bd); if (rc) { dev_err(bd->dev, "query resc caps failure rc: %d\n", rc); @@ -133,23 +140,28 @@ static int bnge_fw_register_dev(struct bnge_dev *bd) bnge_hwrm_fw_set_time(bd); - rc = bnge_hwrm_func_drv_rgtr(bd); + /* Get the resources and configuration from firmware */ + rc = bnge_func_qcaps(bd); if (rc) { - dev_err(bd->dev, "Failed to rgtr with firmware rc: %d\n", rc); + dev_err(bd->dev, "Failed querying caps rc: %d\n", rc); return rc; } rc = bnge_alloc_ctx_mem(bd); if (rc) { dev_err(bd->dev, "Failed to allocate ctx mem rc: %d\n", rc); - goto err_func_unrgtr; + goto err_free_ctx_mem; } - /* Get the resources and configuration from firmware */ - rc = bnge_func_qcaps(bd); + rc = bnge_hwrm_func_drv_rgtr(bd); if (rc) { - dev_err(bd->dev, "Failed initial configuration rc: %d\n", rc); - rc = -ENODEV; + dev_err(bd->dev, "Failed to rgtr with firmware rc: %d\n", rc); + goto err_free_ctx_mem; + } + + rc = bnge_func_qrcaps_qcfg(bd); + if (rc) { + dev_err(bd->dev, "Failed querying resources rc: %d\n", rc); goto err_func_unrgtr; } @@ -158,7 +170,9 @@ static int bnge_fw_register_dev(struct bnge_dev *bd) return 0; err_func_unrgtr: - bnge_fw_unregister_dev(bd); + bnge_hwrm_func_drv_unrgtr(bd); +err_free_ctx_mem: + bnge_free_ctx_mem(bd); return rc; } From c6b34add67a5402f53359580956b5c318965a893 Mon Sep 17 00:00:00 2001 From: Vikas Gupta Date: Sat, 18 Apr 2026 08:04:38 +0530 Subject: [PATCH 095/148] bnge: remove unsupported backing store type The backing store type, BNGE_CTX_MRAV, is not applicable in Thor Ultra devices. Remove it from the backing store configuration, as the firmware will not populate entities in this backing store type, due to which the driver load fails. Fixes: 29c5b358f385 ("bng_en: Add backing store support") Signed-off-by: Vikas Gupta Reviewed-by: Dharmender Garg Link: https://patch.msgid.link/20260418023438.1597876-3-vikas.gupta@broadcom.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/broadcom/bnge/bnge_rmem.c | 16 ---------------- 1 file changed, 16 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnge/bnge_rmem.c b/drivers/net/ethernet/broadcom/bnge/bnge_rmem.c index 94f15e08a88c..b066ee887a09 100644 --- a/drivers/net/ethernet/broadcom/bnge/bnge_rmem.c +++ b/drivers/net/ethernet/broadcom/bnge/bnge_rmem.c @@ -324,7 +324,6 @@ int bnge_alloc_ctx_mem(struct bnge_dev *bd) u32 l2_qps, qp1_qps, max_qps; u32 ena, entries_sp, entries; u32 srqs, max_srqs, min; - u32 num_mr, num_ah; u32 extra_srqs = 0; u32 extra_qps = 0; u32 fast_qpmd_qps; @@ -390,21 +389,6 @@ int bnge_alloc_ctx_mem(struct bnge_dev *bd) if (!bnge_is_roce_en(bd)) goto skip_rdma; - ctxm = &ctx->ctx_arr[BNGE_CTX_MRAV]; - /* 128K extra is needed to accommodate static AH context - * allocation by f/w. - */ - num_mr = min_t(u32, ctxm->max_entries / 2, 1024 * 256); - num_ah = min_t(u32, num_mr, 1024 * 128); - ctxm->split_entry_cnt = BNGE_CTX_MRAV_AV_SPLIT_ENTRY + 1; - if (!ctxm->mrav_av_entries || ctxm->mrav_av_entries > num_ah) - ctxm->mrav_av_entries = num_ah; - - rc = bnge_setup_ctxm_pg_tbls(bd, ctxm, num_mr + num_ah, 2); - if (rc) - return rc; - ena |= FUNC_BACKING_STORE_CFG_REQ_ENABLES_MRAV; - ctxm = &ctx->ctx_arr[BNGE_CTX_TIM]; rc = bnge_setup_ctxm_pg_tbls(bd, ctxm, l2_qps + qp1_qps + extra_qps, 1); if (rc) From 7c9b012d6367a335f1e91da28401a7c612305a46 Mon Sep 17 00:00:00 2001 From: Xin Long Date: Fri, 17 Apr 2026 17:09:40 -0400 Subject: [PATCH 096/148] sctp: fix sockets_allocated imbalance after sk_clone() sk_clone() increments sockets_allocated and sets the socket refcount to 2. SCTP performs additional accounting in sctp_clone_sock(), so the clone-time increment must be undone to avoid double counting. Note we cannot simply remove the SCTP-side increment, because the SCTP destroy path in sctp_destroy_sock() only decrements sockets_allocated when sp->ep is set, which may not be true for all failure paths in sctp_clone_sock(). Fixes: 16942cf4d3e3 ("sctp: Use sk_clone() in sctp_accept().") Signed-off-by: Xin Long Reviewed-by: Kuniyuki Iwashima Link: https://patch.msgid.link/af8d66f928dec3e9fcbee8d4a85b7d5a6b86f515.1776460180.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski --- net/sctp/socket.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/sctp/socket.c b/net/sctp/socket.c index f52fe90d3e00..58d0d9747f0b 100644 --- a/net/sctp/socket.c +++ b/net/sctp/socket.c @@ -4855,8 +4855,9 @@ static struct sock *sctp_clone_sock(struct sock *sk, if (!newsk) return ERR_PTR(err); - /* sk_clone() sets refcnt to 2 */ + /* sk_clone() sets refcnt to 2 and increments sockets_allocated */ sock_put(newsk); + sk_sockets_allocated_dec(newsk); newinet = inet_sk(newsk); newsp = sctp_sk(newsk); From ade67d5f588832c7ba131aadd4215a94ce0a15c8 Mon Sep 17 00:00:00 2001 From: Andrea Mayer Date: Sat, 18 Apr 2026 18:28:38 +0200 Subject: [PATCH 097/148] seg6: fix seg6 lwtunnel output redirect for L2 reduced encap mode When SEG6_IPTUN_MODE_L2ENCAP_RED (L2ENCAP_RED) was introduced, the condition in seg6_build_state() that excludes L2 encap modes from setting LWTUNNEL_STATE_OUTPUT_REDIRECT was not updated to account for the new mode. As a consequence, L2ENCAP_RED routes incorrectly trigger seg6_output() on the output path, where the packet is silently dropped because skb_mac_header_was_set() fails on L3 packets. Extend the check to also exclude L2ENCAP_RED, consistent with L2ENCAP. Fixes: 13f0296be8ec ("seg6: add support for SRv6 H.L2Encaps.Red behavior") Cc: stable@vger.kernel.org Signed-off-by: Andrea Mayer Reviewed-by: Justin Iurman Link: https://patch.msgid.link/20260418162838.31979-1-andrea.mayer@uniroma2.it Signed-off-by: Jakub Kicinski --- net/ipv6/seg6_iptunnel.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/ipv6/seg6_iptunnel.c b/net/ipv6/seg6_iptunnel.c index 97b50d9b1365..9b64343ebad6 100644 --- a/net/ipv6/seg6_iptunnel.c +++ b/net/ipv6/seg6_iptunnel.c @@ -746,7 +746,8 @@ static int seg6_build_state(struct net *net, struct nlattr *nla, newts->type = LWTUNNEL_ENCAP_SEG6; newts->flags |= LWTUNNEL_STATE_INPUT_REDIRECT; - if (tuninfo->mode != SEG6_IPTUN_MODE_L2ENCAP) + if (tuninfo->mode != SEG6_IPTUN_MODE_L2ENCAP && + tuninfo->mode != SEG6_IPTUN_MODE_L2ENCAP_RED) newts->flags |= LWTUNNEL_STATE_OUTPUT_REDIRECT; newts->headroom = seg6_lwt_headroom(tuninfo); From c88eb7e8d8397a8c1db59c425332c5a30b2a1682 Mon Sep 17 00:00:00 2001 From: Michael Bommarito Date: Sat, 18 Apr 2026 10:10:47 -0400 Subject: [PATCH 098/148] net/rds: zero per-item info buffer before handing it to visitors rds_for_each_conn_info() and rds_walk_conn_path_info() both hand a caller-allocated on-stack u64 buffer to a per-connection visitor and then copy the full item_len bytes back to user space via rds_info_copy() regardless of how much of the buffer the visitor actually wrote. rds_ib_conn_info_visitor() and rds6_ib_conn_info_visitor() only write a subset of their output struct when the underlying rds_connection is not in state RDS_CONN_UP (src/dst addr, tos, sl and the two GIDs via explicit memsets). Several u32 fields (max_send_wr, max_recv_wr, max_send_sge, rdma_mr_max, rdma_mr_size, cache_allocs) and the 2-byte alignment hole between sl and cache_allocs remain as whatever stack contents preceded the visitor call and are then memcpy_to_user()'d out to user space. struct rds_info_rdma_connection and struct rds6_info_rdma_connection are the only rds_info_* structs in include/uapi/linux/rds.h that are not marked __attribute__((packed)), so they have a real alignment hole. The other info visitors (rds_conn_info_visitor, rds6_conn_info_visitor, rds_tcp_tc_info, ...) write all fields of their packed output struct today and are not known to be vulnerable, but a future visitor that adds a conditional write-path would have the same bug. Reproduction on a kernel built without CONFIG_INIT_STACK_ALL_ZERO=y: a local unprivileged user opens AF_RDS, sets SO_RDS_TRANSPORT=IB, binds to a local address on an RDMA-capable netdev (rxe soft-RoCE on any netdev is sufficient), sendto()'s any peer on the same subnet (fails cleanly but installs an rds_connection in the global hash in RDS_CONN_CONNECTING), then calls getsockopt(SOL_RDS, RDS_INFO_IB_CONNECTIONS). The returned 68-byte item contains 26 bytes of stack garbage including kernel text/data pointers: 0..7 0a 63 00 01 0a 63 00 02 src=10.99.0.1 dst=10.99.0.2 8..39 00 ... gids (memset-zeroed) 40..47 e0 92 a3 81 ff ff ff ff kernel pointer (max_send_wr) 48..55 7f 37 b5 81 ff ff ff ff kernel pointer (rdma_mr_max) 56..59 01 00 08 00 rdma_mr_size (garbage) 60..61 00 00 tos, sl 62..63 00 00 alignment padding 64..67 18 00 00 00 cache_allocs (garbage) Fix by zeroing the per-item buffer in both rds_for_each_conn_info() and rds_walk_conn_path_info() before invoking the visitor. This covers the IPv4/IPv6 IB visitors and hardens all current and future visitors against the same class of bug. No functional change for visitors that fully populate their output. Changes in v2: - retarget at the net tree (subject prefix "[PATCH net v2]", net/rds: prefix in the title) - pick up Reviewed-by tags from Sharath Srinivasan and Allison Henderson Fixes: ec16227e1414 ("RDS/IB: Infiniband transport") Signed-off-by: Michael Bommarito Reviewed-by: Sharath Srinivasan Reviewed-by: Allison Henderson Assisted-by: Claude:claude-opus-4-7 Link: https://patch.msgid.link/20260418141047.3398203-1-michael.bommarito@gmail.com Signed-off-by: Jakub Kicinski --- net/rds/connection.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/net/rds/connection.c b/net/rds/connection.c index 412441aaa298..c10b7ed06c49 100644 --- a/net/rds/connection.c +++ b/net/rds/connection.c @@ -701,6 +701,13 @@ void rds_for_each_conn_info(struct socket *sock, unsigned int len, i++, head++) { hlist_for_each_entry_rcu(conn, head, c_hash_node) { + /* Zero the per-item buffer before handing it to the + * visitor so any field the visitor does not write - + * including implicit alignment padding - cannot leak + * stack contents to user space via rds_info_copy(). + */ + memset(buffer, 0, item_len); + /* XXX no c_lock usage.. */ if (!visitor(conn, buffer)) continue; @@ -750,6 +757,13 @@ static void rds_walk_conn_path_info(struct socket *sock, unsigned int len, */ cp = conn->c_path; + /* Zero the per-item buffer for the same reason as + * rds_for_each_conn_info(): any byte the visitor + * does not write (including alignment padding) must + * not leak stack contents via rds_info_copy(). + */ + memset(buffer, 0, item_len); + /* XXX no cp_lock usage.. */ if (!visitor(cp, buffer)) continue; From c0a575a801a2040eb1e0db54b488f8c548c8458a Mon Sep 17 00:00:00 2001 From: Grzegorz Nitka Date: Mon, 20 Apr 2026 17:51:25 -0700 Subject: [PATCH 099/148] ice: fix timestamp interrupt configuration for E825C The E825C ice_phy_cfg_intr_eth56g() function is responsible for programming the PHY interrupt for a given port. This function writes to the PHY_REG_TS_INT_CONFIG register of the port. The register is responsible for configuring whether the port interrupt logic is enabled, as well as programming the threshold of waiting timestamps that will trigger an interrupt from this port. This threshold value must not be programmed to zero while the interrupt is enabled. Doing so puts the port in a misconfigured state where the PHY timestamp interrupt for the quad of connected ports will become stuck. This occurs, because a threshold of zero results in the timestamp interrupt status for the port becoming stuck high. The four ports in the connected quad have their timestamp status indicators muxed together. A new interrupt cannot be generated until the timestamp status indicators return low for all four ports. Normally, the timestamp status for a port will clear once there are fewer timestamps in that ports timestamp memory bank than the threshold. A threshold of zero makes this impossible, so the timestamp status for the port does not clear. The ice driver never intentionally programs the threshold to zero, indeed the driver always programs it to a value of 1, intending to get an interrupt immediately as soon as even a single packet is waiting for a timestamp. However, there is a subtle flaw in the programming logic in the ice_phy_cfg_intr_eth56g() function. Due to the way that the hardware handles enabling the PHY interrupt. If the threshold value is modified at the same time as the interrupt is enabled, the HW PHY state machine might enable the interrupt before the new threshold value is actually updated. This leaves a potential race condition caused by the hardware logic where a PHY timestamp interrupt might be triggered before the non-zero threshold is written, resulting in the PHY timestamp logic becoming stuck. Once the PHY timestamp status is stuck high, it will remain stuck even after attempting to reprogram the PHY block by changing its threshold or disabling the interrupt. Even a typical PF or CORE reset will not reset the particular block of the PHY that becomes stuck. Even a warm power cycle is not guaranteed to cause the PHY block to reset, and a cold power cycle is required. Prevent this by always writing the PHY_REG_TS_INT_CONFIG in two stages. First write the threshold value with the interrupt disabled, and only write the enable bit after the threshold has been programmed. When disabling the interrupt, leave the threshold unchanged. Additionally, re-read the register after writing it to guarantee that the write to the PHY has been flushed upon exit of the function. While we're modifying this function implementation, explicitly reject programming a threshold of 0 when enabling the interrupt. No caller does this today, but the consequences of doing so are significant. An explicit rejection in the code makes this clear. Fixes: 7cab44f1c35f ("ice: Introduce ETH56G PHY model for E825C products") Signed-off-by: Grzegorz Nitka Reviewed-by: Aleksandr Loktionov Reviewed-by: Petr Oros Tested-by: Sunitha Mekala Signed-off-by: Jacob Keller Reviewed-by: Simon Horman Link: https://patch.msgid.link/20260420-jk-iwl-net-2026-04-20-ptp-e825c-phy-interrupt-fixes-v1-1-bc2240f42251@intel.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/intel/ice/ice_ptp_hw.c | 36 ++++++++++++++++++--- 1 file changed, 32 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/intel/ice/ice_ptp_hw.c b/drivers/net/ethernet/intel/ice/ice_ptp_hw.c index 5a5c511ccbb6..7f2f7440e705 100644 --- a/drivers/net/ethernet/intel/ice/ice_ptp_hw.c +++ b/drivers/net/ethernet/intel/ice/ice_ptp_hw.c @@ -1847,6 +1847,8 @@ static int ice_phy_cfg_mac_eth56g(struct ice_hw *hw, u8 port) * @ena: enable or disable interrupt * @threshold: interrupt threshold * + * The threshold cannot be 0 while the interrupt is enabled. + * * Configure TX timestamp interrupt for the specified port * * Return: @@ -1858,19 +1860,45 @@ int ice_phy_cfg_intr_eth56g(struct ice_hw *hw, u8 port, bool ena, u8 threshold) int err; u32 val; + if (ena && !threshold) + return -EINVAL; + err = ice_read_ptp_reg_eth56g(hw, port, PHY_REG_TS_INT_CONFIG, &val); if (err) return err; + val &= ~PHY_TS_INT_CONFIG_ENA_M; if (ena) { - val |= PHY_TS_INT_CONFIG_ENA_M; val &= ~PHY_TS_INT_CONFIG_THRESHOLD_M; val |= FIELD_PREP(PHY_TS_INT_CONFIG_THRESHOLD_M, threshold); - } else { - val &= ~PHY_TS_INT_CONFIG_ENA_M; + err = ice_write_ptp_reg_eth56g(hw, port, PHY_REG_TS_INT_CONFIG, + val); + if (err) { + ice_debug(hw, ICE_DBG_PTP, + "Failed to update 'threshold' PHY_REG_TS_INT_CONFIG port=%u ena=%u threshold=%u\n", + port, !!ena, threshold); + return err; + } + val |= PHY_TS_INT_CONFIG_ENA_M; } - return ice_write_ptp_reg_eth56g(hw, port, PHY_REG_TS_INT_CONFIG, val); + err = ice_write_ptp_reg_eth56g(hw, port, PHY_REG_TS_INT_CONFIG, val); + if (err) { + ice_debug(hw, ICE_DBG_PTP, + "Failed to update 'ena' PHY_REG_TS_INT_CONFIG port=%u ena=%u threshold=%u\n", + port, !!ena, threshold); + return err; + } + + err = ice_read_ptp_reg_eth56g(hw, port, PHY_REG_TS_INT_CONFIG, &val); + if (err) { + ice_debug(hw, ICE_DBG_PTP, + "Failed to read PHY_REG_TS_INT_CONFIG port=%u ena=%u threshold=%u\n", + port, !!ena, threshold); + return err; + } + + return 0; } /** From 3ec46e157c7fa420c77dfc23f7030e61f2f3fd55 Mon Sep 17 00:00:00 2001 From: Grzegorz Nitka Date: Mon, 20 Apr 2026 17:51:26 -0700 Subject: [PATCH 100/148] ice: perform PHY soft reset for E825C ports at initialization In some cases the PHY timestamp block of the E825C can become stuck. This is known to occur if the software writes 0 to the Tx timestamp threshold, and with older versions of the ice driver the threshold configuration is buggy and can race in such that hardware briefly operates with a zero threshold enabled. There are no other known ways to trigger this behavior, but once it occurs, the hardware is not recovered by normal reset, a driver reload, or even a warm power cycle of the system. A cold power cycle is sufficient to recover hardware, but this is extremely invasive and can result in significant downtime on customer deployments. The PHY for each port has a timestamping block which has its own reset functionality accessible by programming the PHY_REG_GLOBAL register. Writing to the PHY_REG_GLOBAL_SOFT_RESET_BIT triggers the hardware to perform a complete reset of the timestamping block of the PHY. This includes clearing the timestamp status for the port, clearing all outstanding timestamps in the memory bank, and resetting the PHY timer. The new ice_ptp_phy_soft_reset_eth56g() function toggles the PHY_REG_GLOBAL soft reset bit with the required delays, ensuring the PHY is properly reinitialized without requiring a full device reset. The sequence clears the reset bit, asserts it, then clears it again, with short waits between transitions to allow hardware stabilization. Call this function in the new ice_ptp_init_phc_e825c(), implementing the E825C device specific variant of the ice_ptp_init_phc(). Note that if ice_ptp_init_phc() fails, PTP functionality may be disabled, but the driver will still load to allow basic functionality to continue. This causes the clock owning PF driver to perform a PHY soft reset for every port during initialization. This ensures the driver begins life in a known functional state regardless of how it was previously programmed. This ensures that we properly reconfigure the hardware after a device reset or when loading the driver, even if it was previously misconfigured with an out-of-date or modified driver. Fixes: 7cab44f1c35f ("ice: Introduce ETH56G PHY model for E825C products") Signed-off-by: Timothy Miskell Signed-off-by: Grzegorz Nitka Reviewed-by: Aleksandr Loktionov Reviewed-by: Petr Oros Tested-by: Sunitha Mekala Signed-off-by: Jacob Keller Reviewed-by: Simon Horman Link: https://patch.msgid.link/20260420-jk-iwl-net-2026-04-20-ptp-e825c-phy-interrupt-fixes-v1-2-bc2240f42251@intel.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/intel/ice/ice_ptp_hw.c | 90 ++++++++++++++++++++- drivers/net/ethernet/intel/ice/ice_ptp_hw.h | 4 + 2 files changed, 93 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/ice/ice_ptp_hw.c b/drivers/net/ethernet/intel/ice/ice_ptp_hw.c index 7f2f7440e705..d4c2bb084255 100644 --- a/drivers/net/ethernet/intel/ice/ice_ptp_hw.c +++ b/drivers/net/ethernet/intel/ice/ice_ptp_hw.c @@ -377,6 +377,31 @@ static void ice_ptp_cfg_sync_delay(const struct ice_hw *hw, u32 delay) * The following functions operate on devices with the ETH 56G PHY. */ +/** + * ice_ptp_init_phc_e825c - Perform E825C specific PHC initialization + * @hw: pointer to HW struct + * + * Perform E825C-specific PTP hardware clock initialization steps. + * + * Return: 0 on success, or a negative error value on failure. + */ +static int ice_ptp_init_phc_e825c(struct ice_hw *hw) +{ + int err; + + /* Soft reset all ports, to ensure everything is at a clean state */ + for (int port = 0; port < hw->ptp.num_lports; port++) { + err = ice_ptp_phy_soft_reset_eth56g(hw, port); + if (err) { + ice_debug(hw, ICE_DBG_PTP, "Failed to soft reset port %d, err %d\n", + port, err); + return err; + } + } + + return 0; +} + /** * ice_ptp_get_dest_dev_e825 - get destination PHY for given port number * @hw: pointer to the HW struct @@ -2179,6 +2204,69 @@ int ice_ptp_read_tx_hwtstamp_status_eth56g(struct ice_hw *hw, u32 *ts_status) return 0; } +/** + * ice_ptp_phy_soft_reset_eth56g - Perform a PHY soft reset on ETH56G + * @hw: pointer to the HW structure + * @port: PHY port number + * + * Trigger a soft reset of the ETH56G PHY by toggling the soft reset + * bit in the PHY global register. The reset sequence consists of: + * 1. Clearing the soft reset bit + * 2. Asserting the soft reset bit + * 3. Clearing the soft reset bit again + * + * Short delays are inserted between each step to allow the hardware + * to settle. This provides a controlled way to reinitialize the PHY + * without requiring a full device reset. + * + * Return: 0 on success, or a negative error code on failure when + * reading or writing the PHY register. + */ +int ice_ptp_phy_soft_reset_eth56g(struct ice_hw *hw, u8 port) +{ + u32 global_val; + int err; + + err = ice_read_ptp_reg_eth56g(hw, port, PHY_REG_GLOBAL, &global_val); + if (err) { + ice_debug(hw, ICE_DBG_PTP, "Failed to read PHY_REG_GLOBAL for port %d, err %d\n", + port, err); + return err; + } + + global_val &= ~PHY_REG_GLOBAL_SOFT_RESET_M; + ice_debug(hw, ICE_DBG_PTP, "Clearing soft reset bit for port %d, val: 0x%x\n", + port, global_val); + err = ice_write_ptp_reg_eth56g(hw, port, PHY_REG_GLOBAL, global_val); + if (err) { + ice_debug(hw, ICE_DBG_PTP, "Failed to write PHY_REG_GLOBAL for port %d, err %d\n", + port, err); + return err; + } + + usleep_range(5000, 6000); + + global_val |= PHY_REG_GLOBAL_SOFT_RESET_M; + ice_debug(hw, ICE_DBG_PTP, "Set soft reset bit for port %d, val: 0x%x\n", + port, global_val); + err = ice_write_ptp_reg_eth56g(hw, port, PHY_REG_GLOBAL, global_val); + if (err) { + ice_debug(hw, ICE_DBG_PTP, "Failed to write PHY_REG_GLOBAL for port %d, err %d\n", + port, err); + return err; + } + usleep_range(5000, 6000); + + global_val &= ~PHY_REG_GLOBAL_SOFT_RESET_M; + ice_debug(hw, ICE_DBG_PTP, "Clear soft reset bit for port %d, val: 0x%x\n", + port, global_val); + err = ice_write_ptp_reg_eth56g(hw, port, PHY_REG_GLOBAL, global_val); + if (err) + ice_debug(hw, ICE_DBG_PTP, "Failed to write PHY_REG_GLOBAL for port %d, err %d\n", + port, err); + return err; +} + /** * ice_get_phy_tx_tstamp_ready_eth56g - Read the Tx memory status register * @hw: pointer to the HW struct @@ -5592,7 +5680,7 @@ int ice_ptp_init_phc(struct ice_hw *hw) case ICE_MAC_GENERIC: return ice_ptp_init_phc_e82x(hw); case ICE_MAC_GENERIC_3K_E825: - return 0; + return ice_ptp_init_phc_e825c(hw); default: return -EOPNOTSUPP; } diff --git a/drivers/net/ethernet/intel/ice/ice_ptp_hw.h b/drivers/net/ethernet/intel/ice/ice_ptp_hw.h index 9bfd3e79c580..ac4611291052 100644 --- a/drivers/net/ethernet/intel/ice/ice_ptp_hw.h +++ b/drivers/net/ethernet/intel/ice/ice_ptp_hw.h @@ -374,6 +374,7 @@ int ice_stop_phy_timer_eth56g(struct ice_hw *hw, u8 port, bool soft_reset); int ice_start_phy_timer_eth56g(struct ice_hw *hw, u8 port); int ice_phy_cfg_intr_eth56g(struct ice_hw *hw, u8 port, bool ena, u8 threshold); int ice_phy_cfg_ptp_1step_eth56g(struct ice_hw *hw, u8 port); +int ice_ptp_phy_soft_reset_eth56g(struct ice_hw *hw, u8 port); #define ICE_ETH56G_NOMINAL_INCVAL 0x140000000ULL #define ICE_ETH56G_NOMINAL_PCS_REF_TUS 0x100000000ULL @@ -676,6 +677,9 @@ static inline u64 ice_get_base_incval(struct ice_hw *hw) #define ICE_P0_GNSS_PRSNT_N BIT(4) /* ETH56G PHY register addresses */ +#define PHY_REG_GLOBAL 0x0 +#define PHY_REG_GLOBAL_SOFT_RESET_M BIT(11) + /* Timestamp PHY incval registers */ #define PHY_REG_TIMETUS_L 0x8 #define PHY_REG_TIMETUS_U 0xC From 359dc1d41358c88955eeff1b75aee55da7a415d3 Mon Sep 17 00:00:00 2001 From: Jacob Keller Date: Mon, 20 Apr 2026 17:51:27 -0700 Subject: [PATCH 101/148] ice: fix ready bitmap check for non-E822 devices The E800 hardware (apart from E810) has a ready bitmap for the PHY indicating which timestamp slots currently have an outstanding timestamp waiting to be read by software. This bitmap is checked in multiple places using the ice_get_phy_tx_tstamp_ready(): * ice_ptp_process_tx_tstamp() calls it to determine which timestamps to attempt reading from the PHY * ice_ptp_tx_tstamps_pending() calls it in a loop at the end of the miscellaneous IRQ to check if new timestamps came in while the interrupt handler was executing. * ice_ptp_maybe_trigger_tx_interrupt() calls it in the auxiliary work task to trigger a software interrupt in the event that the hardware logic gets stuck. For E82X devices, multiple PHYs share the same block, and the parameter passed to the ready bitmap is a block number associated with the given port. For E825-C devices, the PHYs have their own independent blocks and do not share, so the parameter passed needs to be the port number. For E810 devices, the ice_get_phy_tx_tstamp_ready() always returns all 1s regardless of what port, since this hardware does not have a ready bitmap. Finally, for E830 devices, each PF has its own ready bitmap accessible via register, and the block parameter is unused. The first call correctly uses the Tx timestamp tracker block parameter to check the appropriate timestamp block. This works because the tracker is setup correctly for each timestamp device type. The second two callers behave incorrectly for all device types other than the older E822 devices. They both iterate in a loop using ICE_GET_QUAD_NUM() which is a macro only used by E822 devices. This logic is incorrect for devices other than the E822 devices. For E810 the calls would always return true, causing E810 devices to always attempt to trigger a software interrupt even when they have no reason to. For E830, this results in duplicate work as the ready bitmap is checked once per number of quads. Finally, for E825-C, this results in the pending checks failing to detect timestamps on ports other than the first two. Fix this by introducing a new hardware API function to ice_ptp_hw.c, ice_check_phy_tx_tstamp_ready(). This function will check if any timestamps are available and returns a positive value if any timestamps are pending. For E810, the function always returns false, so that the re-trigger checks never happen. For E830, check the ready bitmap just once. For E82x hardware, check each quad. Finally, for E825-C, check every port. The interface function returns an integer to enable reporting of error code if the driver is unable read the ready bitmap. This enables callers to handle this case properly. The previous implementation assumed that timestamps are available if they failed to read the bitmap. This is problematic as it could lead to continuous software IRQ triggering if the PHY timestamp registers somehow become inaccessible. This change is especially important for E825-C devices, as the missing checks could leave a window open where a new timestamp could arrive while the existing timestamps aren't completed. As a result, the hardware threshold logic would not trigger a new interrupt. Without the check, the timestamp is left unhandled, and new timestamps will not cause an interrupt again until the timestamp is handled. Since both the interrupt check and the backup check in the auxiliary task do not function properly, the device may have Tx timestamps permanently stuck failing on a given port. The faulty checks originate from commit d938a8cca88a ("ice: Auxbus devices & driver for E822 TS") and commit 712e876371f8 ("ice: periodically kick Tx timestamp interrupt"), however at the time of the original coding, both functions only operated on E822 hardware. This is no longer the case, and hasn't been since the introduction of the ETH56G PHY model in commit 7cab44f1c35f ("ice: Introduce ETH56G PHY model for E825C products") Fixes: 7cab44f1c35f ("ice: Introduce ETH56G PHY model for E825C products") Reviewed-by: Aleksandr Loktionov Reviewed-by: Petr Oros Tested-by: Sunitha Mekala Signed-off-by: Jacob Keller Reviewed-by: Simon Horman Link: https://patch.msgid.link/20260420-jk-iwl-net-2026-04-20-ptp-e825c-phy-interrupt-fixes-v1-3-bc2240f42251@intel.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/intel/ice/ice_ptp.c | 44 +++----- drivers/net/ethernet/intel/ice/ice_ptp_hw.c | 117 ++++++++++++++++++++ drivers/net/ethernet/intel/ice/ice_ptp_hw.h | 1 + 3 files changed, 136 insertions(+), 26 deletions(-) diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c index 6cb0cf7a9891..36df742c326c 100644 --- a/drivers/net/ethernet/intel/ice/ice_ptp.c +++ b/drivers/net/ethernet/intel/ice/ice_ptp.c @@ -2710,7 +2710,7 @@ static bool ice_any_port_has_timestamps(struct ice_pf *pf) bool ice_ptp_tx_tstamps_pending(struct ice_pf *pf) { struct ice_hw *hw = &pf->hw; - unsigned int i; + int ret; /* Check software indicator */ switch (pf->ptp.tx_interrupt_mode) { @@ -2731,16 +2731,19 @@ bool ice_ptp_tx_tstamps_pending(struct ice_pf *pf) } /* Check hardware indicator */ - for (i = 0; i < ICE_GET_QUAD_NUM(hw->ptp.num_lports); i++) { - u64 tstamp_ready = 0; - int err; - - err = ice_get_phy_tx_tstamp_ready(&pf->hw, i, &tstamp_ready); - if (err || tstamp_ready) - return true; + ret = ice_check_phy_tx_tstamp_ready(hw); + if (ret < 0) { + dev_dbg(ice_pf_to_dev(pf), "Unable to read PHY Tx timestamp ready bitmap, err %d\n", + ret); + /* Stop triggering IRQs if we're unable to read PHY */ + return false; } - return false; + /* ice_check_phy_tx_tstamp_ready() returns 1 if there are timestamps + * available, 0 if there are no waiting timestamps, and a negative + * value if there was an error (which we checked for above). + */ + return ret > 0; } /** @@ -2824,8 +2827,7 @@ static void ice_ptp_maybe_trigger_tx_interrupt(struct ice_pf *pf) { struct device *dev = ice_pf_to_dev(pf); struct ice_hw *hw = &pf->hw; - bool trigger_oicr = false; - unsigned int i; + int ret; if (!pf->ptp.port.tx.has_ready_bitmap) return; @@ -2833,21 +2835,11 @@ static void ice_ptp_maybe_trigger_tx_interrupt(struct ice_pf *pf) if (!ice_pf_src_tmr_owned(pf)) return; - for (i = 0; i < ICE_GET_QUAD_NUM(hw->ptp.num_lports); i++) { - u64 tstamp_ready; - int err; - - err = ice_get_phy_tx_tstamp_ready(&pf->hw, i, &tstamp_ready); - if (!err && tstamp_ready) { - trigger_oicr = true; - break; - } - } - - if (trigger_oicr) { - /* Trigger a software interrupt, to ensure this data - * gets processed. - */ + ret = ice_check_phy_tx_tstamp_ready(hw); + if (ret < 0) { + dev_dbg(dev, "PTP periodic task unable to read PHY timestamp ready bitmap, err %d\n", + ret); + } else if (ret) { dev_dbg(dev, "PTP periodic task detected waiting timestamps. Triggering Tx timestamp interrupt now.\n"); wr32(hw, PFINT_OICR, PFINT_OICR_TSYN_TX_M); diff --git a/drivers/net/ethernet/intel/ice/ice_ptp_hw.c b/drivers/net/ethernet/intel/ice/ice_ptp_hw.c index d4c2bb084255..4795af06b983 100644 --- a/drivers/net/ethernet/intel/ice/ice_ptp_hw.c +++ b/drivers/net/ethernet/intel/ice/ice_ptp_hw.c @@ -2168,6 +2168,35 @@ int ice_start_phy_timer_eth56g(struct ice_hw *hw, u8 port) return 0; } +/** + * ice_check_phy_tx_tstamp_ready_eth56g - Check Tx memory status for all ports + * @hw: pointer to the HW struct + * + * Check the PHY_REG_TX_MEMORY_STATUS for all ports. A set bit indicates + * a waiting timestamp. + * + * Return: 1 if any port has at least one timestamp ready bit set, + * 0 otherwise, and a negative error code if unable to read the bitmap. + */ +static int ice_check_phy_tx_tstamp_ready_eth56g(struct ice_hw *hw) +{ + int port; + + for (port = 0; port < hw->ptp.num_lports; port++) { + u64 tstamp_ready; + int err; + + err = ice_get_phy_tx_tstamp_ready(hw, port, &tstamp_ready); + if (err) + return err; + + if (tstamp_ready) + return 1; + } + + return 0; +} + /** * ice_ptp_read_tx_hwtstamp_status_eth56g - Get TX timestamp status * @hw: pointer to the HW struct @@ -4318,6 +4347,35 @@ ice_get_phy_tx_tstamp_ready_e82x(struct ice_hw *hw, u8 quad, u64 *tstamp_ready) return 0; } +/** + * ice_check_phy_tx_tstamp_ready_e82x - Check Tx memory status for all quads + * @hw: pointer to the HW struct + * + * Check the Q_REG_TX_MEMORY_STATUS for all quads. A set bit indicates + * a waiting timestamp. + * + * Return: 1 if any quad has at least one timestamp ready bit set, + * 0 otherwise, and a negative error value if unable to read the bitmap. + */ +static int ice_check_phy_tx_tstamp_ready_e82x(struct ice_hw *hw) +{ + int quad; + + for (quad = 0; quad < ICE_GET_QUAD_NUM(hw->ptp.num_lports); quad++) { + u64 tstamp_ready; + int err; + + err = ice_get_phy_tx_tstamp_ready(hw, quad, &tstamp_ready); + if (err) + return err; + + if (tstamp_ready) + return 1; + } + + return 0; +} + /** * ice_phy_cfg_intr_e82x - Configure TX timestamp interrupt * @hw: pointer to the HW struct @@ -4871,6 +4929,23 @@ ice_get_phy_tx_tstamp_ready_e810(struct ice_hw *hw, u8 port, u64 *tstamp_ready) return 0; } +/** + * ice_check_phy_tx_tstamp_ready_e810 - Check Tx memory status register + * @hw: pointer to the HW struct + * + * The E810 devices do not have a Tx memory status register. Note this is + * intentionally different behavior from ice_get_phy_tx_tstamp_ready_e810 + * which always says that all bits are ready. This function is called in cases + * where code will trigger interrupts if timestamps are waiting, and should + * not be called for E810 hardware. + * + * Return: 0. + */ +static int ice_check_phy_tx_tstamp_ready_e810(struct ice_hw *hw) +{ + return 0; +} + /* E810 SMA functions * * The following functions operate specifically on E810 hardware and are used @@ -5125,6 +5200,21 @@ static void ice_get_phy_tx_tstamp_ready_e830(const struct ice_hw *hw, u8 port, *tstamp_ready |= rd32(hw, E830_PRTMAC_TS_TX_MEM_VALID_L); } +/** + * ice_check_phy_tx_tstamp_ready_e830 - Check Tx memory status register + * @hw: pointer to the HW struct + * + * Return: 1 if the device has waiting timestamps, 0 otherwise. + */ +static int ice_check_phy_tx_tstamp_ready_e830(struct ice_hw *hw) +{ + u64 tstamp_ready; + + ice_get_phy_tx_tstamp_ready_e830(hw, 0, &tstamp_ready); + + return !!tstamp_ready; +} + /** * ice_ptp_init_phy_e830 - initialize PHY parameters * @ptp: pointer to the PTP HW struct @@ -5717,6 +5807,33 @@ int ice_get_phy_tx_tstamp_ready(struct ice_hw *hw, u8 block, u64 *tstamp_ready) } } +/** + * ice_check_phy_tx_tstamp_ready - Check PHY Tx timestamp memory status + * @hw: pointer to the HW struct + * + * Check the PHY for Tx timestamp memory status on all ports. If you need to + * see individual timestamp status for each index, use + * ice_get_phy_tx_tstamp_ready() instead. + * + * Return: 1 if any port has timestamps available, 0 if there are no timestamps + * available, and a negative error code on failure. + */ +int ice_check_phy_tx_tstamp_ready(struct ice_hw *hw) +{ + switch (hw->mac_type) { + case ICE_MAC_E810: + return ice_check_phy_tx_tstamp_ready_e810(hw); + case ICE_MAC_E830: + return ice_check_phy_tx_tstamp_ready_e830(hw); + case ICE_MAC_GENERIC: + return ice_check_phy_tx_tstamp_ready_e82x(hw); + case ICE_MAC_GENERIC_3K_E825: + return ice_check_phy_tx_tstamp_ready_eth56g(hw); + default: + return -EOPNOTSUPP; + } +} + /** * ice_cgu_get_pin_desc_e823 - get pin description array * @hw: pointer to the hw struct diff --git a/drivers/net/ethernet/intel/ice/ice_ptp_hw.h b/drivers/net/ethernet/intel/ice/ice_ptp_hw.h index ac4611291052..1c9e77dbc770 100644 --- a/drivers/net/ethernet/intel/ice/ice_ptp_hw.h +++ b/drivers/net/ethernet/intel/ice/ice_ptp_hw.h @@ -300,6 +300,7 @@ void ice_ptp_reset_ts_memory(struct ice_hw *hw); int ice_ptp_init_phc(struct ice_hw *hw); void ice_ptp_init_hw(struct ice_hw *hw); int ice_get_phy_tx_tstamp_ready(struct ice_hw *hw, u8 block, u64 *tstamp_ready); +int ice_check_phy_tx_tstamp_ready(struct ice_hw *hw); int ice_ptp_one_port_cmd(struct ice_hw *hw, u8 configured_port, enum ice_ptp_tmr_cmd configured_cmd); From 1f75dbc53f68f0fb2acd99f92315e426a3d0b446 Mon Sep 17 00:00:00 2001 From: Jacob Keller Date: Mon, 20 Apr 2026 17:51:28 -0700 Subject: [PATCH 102/148] ice: fix ice_ptp_read_tx_hwtstamp_status_eth56g The ice_ptp_read_tx_hwtstamp_status_eth56g function calls ice_read_phy_eth56g with a PHY index. However the function actually expects a port index. This causes the function to read the wrong PHY_PTP_INT_STATUS registers, and effectively makes the status wrong for the second set of ports from 4 to 7. The ice_read_phy_eth56g function uses the provided port index to determine which PHY device to read. We could refactor the entire chain to take a PHY index, but this would impact many code sites. Instead, multiply the PHY index by the number of ports, so that we read from the first port of each PHY. Fixes: 7cab44f1c35f ("ice: Introduce ETH56G PHY model for E825C products") Reviewed-by: Aleksandr Loktionov Reviewed-by: Petr Oros Tested-by: Sunitha Mekala Signed-off-by: Jacob Keller Reviewed-by: Simon Horman Link: https://patch.msgid.link/20260420-jk-iwl-net-2026-04-20-ptp-e825c-phy-interrupt-fixes-v1-4-bc2240f42251@intel.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/intel/ice/ice_ptp_hw.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/ice/ice_ptp_hw.c b/drivers/net/ethernet/intel/ice/ice_ptp_hw.c index 4795af06b983..24fb7a3e14d6 100644 --- a/drivers/net/ethernet/intel/ice/ice_ptp_hw.c +++ b/drivers/net/ethernet/intel/ice/ice_ptp_hw.c @@ -2219,13 +2219,19 @@ int ice_ptp_read_tx_hwtstamp_status_eth56g(struct ice_hw *hw, u32 *ts_status) *ts_status = 0; for (phy = 0; phy < params->num_phys; phy++) { + u8 port; int err; - err = ice_read_phy_eth56g(hw, phy, PHY_PTP_INT_STATUS, &status); + /* ice_read_phy_eth56g expects a port index, so use the first + * port of the PHY + */ + port = phy * hw->ptp.ports_per_phy; + + err = ice_read_phy_eth56g(hw, port, PHY_PTP_INT_STATUS, &status); if (err) return err; - *ts_status |= (status & mask) << (phy * hw->ptp.ports_per_phy); + *ts_status |= (status & mask) << port; } ice_debug(hw, ICE_DBG_PTP, "PHY interrupt err: %x\n", *ts_status); From a6edf2cd4156b71e07258876b7626692e158f7e8 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Tue, 21 Apr 2026 14:33:49 +0000 Subject: [PATCH 103/148] net_sched: sch_hhf: annotate data-races in hhf_dump_stats() hhf_dump_stats() only runs with RTNL held, reading fields that can be changed in qdisc fast path. Add READ_ONCE()/WRITE_ONCE() annotations. Fixes: edb09eb17ed8 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump") Signed-off-by: Eric Dumazet Reviewed-by: Jamal Hadi Salim Link: https://patch.msgid.link/20260421143349.4052215-1-edumazet@google.com Signed-off-by: Jakub Kicinski --- net/sched/sch_hhf.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/net/sched/sch_hhf.c b/net/sched/sch_hhf.c index 95e5d9bfd9c8..96021f52d835 100644 --- a/net/sched/sch_hhf.c +++ b/net/sched/sch_hhf.c @@ -198,7 +198,8 @@ static struct hh_flow_state *seek_list(const u32 hash, return NULL; list_del(&flow->flowchain); kfree(flow); - q->hh_flows_current_cnt--; + WRITE_ONCE(q->hh_flows_current_cnt, + q->hh_flows_current_cnt - 1); } else if (flow->hash_id == hash) { return flow; } @@ -226,7 +227,7 @@ static struct hh_flow_state *alloc_new_hh(struct list_head *head, } if (q->hh_flows_current_cnt >= q->hh_flows_limit) { - q->hh_flows_overlimit++; + WRITE_ONCE(q->hh_flows_overlimit, q->hh_flows_overlimit + 1); return NULL; } /* Create new entry. */ @@ -234,7 +235,7 @@ static struct hh_flow_state *alloc_new_hh(struct list_head *head, if (!flow) return NULL; - q->hh_flows_current_cnt++; + WRITE_ONCE(q->hh_flows_current_cnt, q->hh_flows_current_cnt + 1); INIT_LIST_HEAD(&flow->flowchain); list_add_tail(&flow->flowchain, head); @@ -309,7 +310,7 @@ static enum wdrr_bucket_idx hhf_classify(struct sk_buff *skb, struct Qdisc *sch) return WDRR_BUCKET_FOR_NON_HH; flow->hash_id = hash; flow->hit_timestamp = now; - q->hh_flows_total_cnt++; + WRITE_ONCE(q->hh_flows_total_cnt, q->hh_flows_total_cnt + 1); /* By returning without updating counters in q->hhf_arrays, * we implicitly implement "shielding" (see Optimization O1). @@ -403,7 +404,7 @@ static int hhf_enqueue(struct sk_buff *skb, struct Qdisc *sch, return NET_XMIT_SUCCESS; prev_backlog = sch->qstats.backlog; - q->drop_overlimit++; + WRITE_ONCE(q->drop_overlimit, q->drop_overlimit + 1); /* Return Congestion Notification only if we dropped a packet from this * bucket. */ @@ -686,10 +687,10 @@ static int hhf_dump_stats(struct Qdisc *sch, struct gnet_dump *d) { struct hhf_sched_data *q = qdisc_priv(sch); struct tc_hhf_xstats st = { - .drop_overlimit = q->drop_overlimit, - .hh_overlimit = q->hh_flows_overlimit, - .hh_tot_count = q->hh_flows_total_cnt, - .hh_cur_count = q->hh_flows_current_cnt, + .drop_overlimit = READ_ONCE(q->drop_overlimit), + .hh_overlimit = READ_ONCE(q->hh_flows_overlimit), + .hh_tot_count = READ_ONCE(q->hh_flows_total_cnt), + .hh_cur_count = READ_ONCE(q->hh_flows_current_cnt), }; return gnet_stats_copy_app(d, &st, sizeof(st)); From 5154561d9b119f781249f8e845fecf059b38b483 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Tue, 21 Apr 2026 14:29:44 +0000 Subject: [PATCH 104/148] net/sched: sch_pie: annotate data-races in pie_dump_stats() pie_dump_stats() only runs with RTNL held, reading fields that can be changed in qdisc fast path. Add READ_ONCE()/WRITE_ONCE() annotations. Alternative would be to acquire the qdisc spinlock, but our long-term goal is to make qdisc dump operations lockless as much as we can. tc_pie_xstats fields don't need to be latched atomically, otherwise this bug would have been caught earlier. Fixes: edb09eb17ed8 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump") Signed-off-by: Eric Dumazet Reviewed-by: Jamal Hadi Salim Link: https://patch.msgid.link/20260421142944.4009941-1-edumazet@google.com Signed-off-by: Jakub Kicinski --- include/net/pie.h | 2 +- net/sched/sch_pie.c | 38 +++++++++++++++++++------------------- 2 files changed, 20 insertions(+), 20 deletions(-) diff --git a/include/net/pie.h b/include/net/pie.h index 01cbc66825a4..1f3db0c35514 100644 --- a/include/net/pie.h +++ b/include/net/pie.h @@ -104,7 +104,7 @@ static inline void pie_vars_init(struct pie_vars *vars) vars->dq_tstamp = DTIME_INVALID; vars->accu_prob = 0; vars->dq_count = DQCOUNT_INVALID; - vars->avg_dq_rate = 0; + WRITE_ONCE(vars->avg_dq_rate, 0); } static inline struct pie_skb_cb *get_pie_cb(const struct sk_buff *skb) diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c index 16f3f629cb8e..fb53fbf0e328 100644 --- a/net/sched/sch_pie.c +++ b/net/sched/sch_pie.c @@ -90,7 +90,7 @@ static int pie_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch, bool enqueue = false; if (unlikely(qdisc_qlen(sch) >= sch->limit)) { - q->stats.overlimit++; + WRITE_ONCE(q->stats.overlimit, q->stats.overlimit + 1); goto out; } @@ -104,7 +104,7 @@ static int pie_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch, /* If packet is ecn capable, mark it if drop probability * is lower than 10%, else drop it. */ - q->stats.ecn_mark++; + WRITE_ONCE(q->stats.ecn_mark, q->stats.ecn_mark + 1); enqueue = true; } @@ -114,15 +114,15 @@ static int pie_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch, if (!q->params.dq_rate_estimator) pie_set_enqueue_time(skb); - q->stats.packets_in++; + WRITE_ONCE(q->stats.packets_in, q->stats.packets_in + 1); if (qdisc_qlen(sch) > q->stats.maxq) - q->stats.maxq = qdisc_qlen(sch); + WRITE_ONCE(q->stats.maxq, qdisc_qlen(sch)); return qdisc_enqueue_tail(skb, sch); } out: - q->stats.dropped++; + WRITE_ONCE(q->stats.dropped, q->stats.dropped + 1); q->vars.accu_prob = 0; return qdisc_drop_reason(skb, sch, to_free, reason); } @@ -267,11 +267,11 @@ void pie_process_dequeue(struct sk_buff *skb, struct pie_params *params, count = count / dtime; if (vars->avg_dq_rate == 0) - vars->avg_dq_rate = count; + WRITE_ONCE(vars->avg_dq_rate, count); else - vars->avg_dq_rate = + WRITE_ONCE(vars->avg_dq_rate, (vars->avg_dq_rate - - (vars->avg_dq_rate >> 3)) + (count >> 3); + (vars->avg_dq_rate >> 3)) + (count >> 3)); /* If the queue has receded below the threshold, we hold * on to the last drain rate calculated, else we reset @@ -381,7 +381,7 @@ void pie_calculate_probability(struct pie_params *params, struct pie_vars *vars, if (delta > 0) { /* prevent overflow */ if (vars->prob < oldprob) { - vars->prob = MAX_PROB; + WRITE_ONCE(vars->prob, MAX_PROB); /* Prevent normalization error. If probability is at * maximum value already, we normalize it here, and * skip the check to do a non-linear drop in the next @@ -392,7 +392,7 @@ void pie_calculate_probability(struct pie_params *params, struct pie_vars *vars, } else { /* prevent underflow */ if (vars->prob > oldprob) - vars->prob = 0; + WRITE_ONCE(vars->prob, 0); } /* Non-linear drop in probability: Reduce drop probability quickly if @@ -403,7 +403,7 @@ void pie_calculate_probability(struct pie_params *params, struct pie_vars *vars, /* Reduce drop probability to 98.4% */ vars->prob -= vars->prob / 64; - vars->qdelay = qdelay; + WRITE_ONCE(vars->qdelay, qdelay); vars->backlog_old = backlog; /* We restart the measurement cycle if the following conditions are met @@ -502,21 +502,21 @@ static int pie_dump_stats(struct Qdisc *sch, struct gnet_dump *d) struct pie_sched_data *q = qdisc_priv(sch); struct tc_pie_xstats st = { .prob = q->vars.prob << BITS_PER_BYTE, - .delay = ((u32)PSCHED_TICKS2NS(q->vars.qdelay)) / + .delay = ((u32)PSCHED_TICKS2NS(READ_ONCE(q->vars.qdelay))) / NSEC_PER_USEC, - .packets_in = q->stats.packets_in, - .overlimit = q->stats.overlimit, - .maxq = q->stats.maxq, - .dropped = q->stats.dropped, - .ecn_mark = q->stats.ecn_mark, + .packets_in = READ_ONCE(q->stats.packets_in), + .overlimit = READ_ONCE(q->stats.overlimit), + .maxq = READ_ONCE(q->stats.maxq), + .dropped = READ_ONCE(q->stats.dropped), + .ecn_mark = READ_ONCE(q->stats.ecn_mark), }; /* avg_dq_rate is only valid if dq_rate_estimator is enabled */ st.dq_rate_estimating = q->params.dq_rate_estimator; /* unscale and return dq_rate in bytes per sec */ - if (q->params.dq_rate_estimator) - st.avg_dq_rate = q->vars.avg_dq_rate * + if (st.dq_rate_estimating) + st.avg_dq_rate = READ_ONCE(q->vars.avg_dq_rate) * (PSCHED_TICKS_PER_SEC) >> PIE_SCALE; return gnet_stats_copy_app(d, &st, sizeof(st)); From bbfaa73ea6871db03dc05d7f05f00557a8981f25 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Tue, 21 Apr 2026 14:25:09 +0000 Subject: [PATCH 105/148] net/sched: sch_fq_codel: remove data-races from fq_codel_dump_stats() fq_codel_dump_stats() acquires the qdisc spinlock a bit too late. Move this acquisition before we fill st.qdisc_stats with live data. Fixes: edb09eb17ed8 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump") Signed-off-by: Eric Dumazet Reviewed-by: Jamal Hadi Salim Link: https://patch.msgid.link/20260421142509.3967231-1-edumazet@google.com Signed-off-by: Jakub Kicinski --- net/sched/sch_fq_codel.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c index 2a3d758f67ab..0664b2f2d6f2 100644 --- a/net/sched/sch_fq_codel.c +++ b/net/sched/sch_fq_codel.c @@ -585,6 +585,8 @@ static int fq_codel_dump_stats(struct Qdisc *sch, struct gnet_dump *d) }; struct list_head *pos; + sch_tree_lock(sch); + st.qdisc_stats.maxpacket = q->cstats.maxpacket; st.qdisc_stats.drop_overlimit = q->drop_overlimit; st.qdisc_stats.ecn_mark = q->cstats.ecn_mark; @@ -593,7 +595,6 @@ static int fq_codel_dump_stats(struct Qdisc *sch, struct gnet_dump *d) st.qdisc_stats.memory_usage = q->memory_usage; st.qdisc_stats.drop_overmemory = q->drop_overmemory; - sch_tree_lock(sch); list_for_each(pos, &q->new_flows) st.qdisc_stats.new_flows_len++; From a8f5192809caf636d05ba47c144f282cfd0e3839 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Tue, 21 Apr 2026 14:23:09 +0000 Subject: [PATCH 106/148] net/sched: sch_red: annotate data-races in red_dump_stats() red_dump_stats() only runs with RTNL held, reading fields that can be changed in qdisc fast path. Add READ_ONCE()/WRITE_ONCE() annotations. Alternative would be to acquire the qdisc spinlock, but our long-term goal is to make qdisc dump operations lockless as much as we can. tc_red_xstats fields don't need to be latched atomically, otherwise this bug would have been caught earlier. Fixes: edb09eb17ed8 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump") Signed-off-by: Eric Dumazet Reviewed-by: Jamal Hadi Salim Link: https://patch.msgid.link/20260421142309.3964322-1-edumazet@google.com Signed-off-by: Jakub Kicinski --- net/sched/sch_red.c | 31 +++++++++++++++++++++---------- 1 file changed, 21 insertions(+), 10 deletions(-) diff --git a/net/sched/sch_red.c b/net/sched/sch_red.c index c8d3d09f15e3..432b8a3000a5 100644 --- a/net/sched/sch_red.c +++ b/net/sched/sch_red.c @@ -90,17 +90,20 @@ static int red_enqueue(struct sk_buff *skb, struct Qdisc *sch, case RED_PROB_MARK: qdisc_qstats_overlimit(sch); if (!red_use_ecn(q)) { - q->stats.prob_drop++; + WRITE_ONCE(q->stats.prob_drop, + q->stats.prob_drop + 1); goto congestion_drop; } if (INET_ECN_set_ce(skb)) { - q->stats.prob_mark++; + WRITE_ONCE(q->stats.prob_mark, + q->stats.prob_mark + 1); skb = tcf_qevent_handle(&q->qe_mark, sch, skb, to_free, &ret); if (!skb) return NET_XMIT_CN | ret; } else if (!red_use_nodrop(q)) { - q->stats.prob_drop++; + WRITE_ONCE(q->stats.prob_drop, + q->stats.prob_drop + 1); goto congestion_drop; } @@ -111,17 +114,20 @@ static int red_enqueue(struct sk_buff *skb, struct Qdisc *sch, reason = QDISC_DROP_OVERLIMIT; qdisc_qstats_overlimit(sch); if (red_use_harddrop(q) || !red_use_ecn(q)) { - q->stats.forced_drop++; + WRITE_ONCE(q->stats.forced_drop, + q->stats.forced_drop + 1); goto congestion_drop; } if (INET_ECN_set_ce(skb)) { - q->stats.forced_mark++; + WRITE_ONCE(q->stats.forced_mark, + q->stats.forced_mark + 1); skb = tcf_qevent_handle(&q->qe_mark, sch, skb, to_free, &ret); if (!skb) return NET_XMIT_CN | ret; } else if (!red_use_nodrop(q)) { - q->stats.forced_drop++; + WRITE_ONCE(q->stats.forced_drop, + q->stats.forced_drop + 1); goto congestion_drop; } @@ -135,7 +141,8 @@ static int red_enqueue(struct sk_buff *skb, struct Qdisc *sch, sch->qstats.backlog += len; sch->q.qlen++; } else if (net_xmit_drop_count(ret)) { - q->stats.pdrop++; + WRITE_ONCE(q->stats.pdrop, + q->stats.pdrop + 1); qdisc_qstats_drop(sch); } return ret; @@ -463,9 +470,13 @@ static int red_dump_stats(struct Qdisc *sch, struct gnet_dump *d) dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_QDISC_RED, &hw_stats_request); } - st.early = q->stats.prob_drop + q->stats.forced_drop; - st.pdrop = q->stats.pdrop; - st.marked = q->stats.prob_mark + q->stats.forced_mark; + st.early = READ_ONCE(q->stats.prob_drop) + + READ_ONCE(q->stats.forced_drop); + + st.pdrop = READ_ONCE(q->stats.pdrop); + + st.marked = READ_ONCE(q->stats.prob_mark) + + READ_ONCE(q->stats.forced_mark); return gnet_stats_copy_app(d, &st, sizeof(st)); } From 1ada03fdef82d3d7d2edb9dcd3acc91917675e48 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Tue, 21 Apr 2026 14:16:55 +0000 Subject: [PATCH 107/148] net/sched: sch_sfb: annotate data-races in sfb_dump_stats() sfb_dump_stats() only runs with RTNL held, reading fields that can be changed in qdisc fast path. Add READ_ONCE()/WRITE_ONCE() annotations. Alternative would be to acquire the qdisc spinlock, but our long-term goal is to make qdisc dump operations lockless as much as we can. tc_sfb_xstats fields don't need to be latched atomically, otherwise this bug would have been caught earlier. Fixes: edb09eb17ed8 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump") Signed-off-by: Eric Dumazet Link: https://patch.msgid.link/20260421141655.3953721-1-edumazet@google.com Signed-off-by: Jakub Kicinski --- net/sched/sch_sfb.c | 54 +++++++++++++++++++++++++++------------------ 1 file changed, 32 insertions(+), 22 deletions(-) diff --git a/net/sched/sch_sfb.c b/net/sched/sch_sfb.c index 013738662128..bd5ef561030f 100644 --- a/net/sched/sch_sfb.c +++ b/net/sched/sch_sfb.c @@ -130,7 +130,7 @@ static void increment_one_qlen(u32 sfbhash, u32 slot, struct sfb_sched_data *q) sfbhash >>= SFB_BUCKET_SHIFT; if (b[hash].qlen < 0xFFFF) - b[hash].qlen++; + WRITE_ONCE(b[hash].qlen, b[hash].qlen + 1); b += SFB_NUMBUCKETS; /* next level */ } } @@ -159,7 +159,7 @@ static void decrement_one_qlen(u32 sfbhash, u32 slot, sfbhash >>= SFB_BUCKET_SHIFT; if (b[hash].qlen > 0) - b[hash].qlen--; + WRITE_ONCE(b[hash].qlen, b[hash].qlen - 1); b += SFB_NUMBUCKETS; /* next level */ } } @@ -179,12 +179,12 @@ static void decrement_qlen(const struct sk_buff *skb, struct sfb_sched_data *q) static void decrement_prob(struct sfb_bucket *b, struct sfb_sched_data *q) { - b->p_mark = prob_minus(b->p_mark, q->decrement); + WRITE_ONCE(b->p_mark, prob_minus(b->p_mark, q->decrement)); } static void increment_prob(struct sfb_bucket *b, struct sfb_sched_data *q) { - b->p_mark = prob_plus(b->p_mark, q->increment); + WRITE_ONCE(b->p_mark, prob_plus(b->p_mark, q->increment)); } static void sfb_zero_all_buckets(struct sfb_sched_data *q) @@ -202,11 +202,14 @@ static u32 sfb_compute_qlen(u32 *prob_r, u32 *avgpm_r, const struct sfb_sched_da const struct sfb_bucket *b = &q->bins[q->slot].bins[0][0]; for (i = 0; i < SFB_LEVELS * SFB_NUMBUCKETS; i++) { - if (qlen < b->qlen) - qlen = b->qlen; - totalpm += b->p_mark; - if (prob < b->p_mark) - prob = b->p_mark; + u32 b_qlen = READ_ONCE(b->qlen); + u32 b_mark = READ_ONCE(b->p_mark); + + if (qlen < b_qlen) + qlen = b_qlen; + totalpm += b_mark; + if (prob < b_mark) + prob = b_mark; b++; } *prob_r = prob; @@ -295,7 +298,8 @@ static int sfb_enqueue(struct sk_buff *skb, struct Qdisc *sch, if (unlikely(sch->q.qlen >= q->limit)) { qdisc_qstats_overlimit(sch); - q->stats.queuedrop++; + WRITE_ONCE(q->stats.queuedrop, + q->stats.queuedrop + 1); goto drop; } @@ -348,7 +352,8 @@ static int sfb_enqueue(struct sk_buff *skb, struct Qdisc *sch, if (unlikely(minqlen >= q->max)) { qdisc_qstats_overlimit(sch); - q->stats.bucketdrop++; + WRITE_ONCE(q->stats.bucketdrop, + q->stats.bucketdrop + 1); goto drop; } @@ -374,7 +379,8 @@ static int sfb_enqueue(struct sk_buff *skb, struct Qdisc *sch, } if (sfb_rate_limit(skb, q)) { qdisc_qstats_overlimit(sch); - q->stats.penaltydrop++; + WRITE_ONCE(q->stats.penaltydrop, + q->stats.penaltydrop + 1); goto drop; } goto enqueue; @@ -390,14 +396,17 @@ static int sfb_enqueue(struct sk_buff *skb, struct Qdisc *sch, * In either case, we want to start dropping packets. */ if (r < (p_min - SFB_MAX_PROB / 2) * 2) { - q->stats.earlydrop++; + WRITE_ONCE(q->stats.earlydrop, + q->stats.earlydrop + 1); goto drop; } } if (INET_ECN_set_ce(skb)) { - q->stats.marked++; + WRITE_ONCE(q->stats.marked, + q->stats.marked + 1); } else { - q->stats.earlydrop++; + WRITE_ONCE(q->stats.earlydrop, + q->stats.earlydrop + 1); goto drop; } } @@ -410,7 +419,8 @@ static int sfb_enqueue(struct sk_buff *skb, struct Qdisc *sch, sch->q.qlen++; increment_qlen(&cb, q); } else if (net_xmit_drop_count(ret)) { - q->stats.childdrop++; + WRITE_ONCE(q->stats.childdrop, + q->stats.childdrop + 1); qdisc_qstats_drop(sch); } return ret; @@ -599,12 +609,12 @@ static int sfb_dump_stats(struct Qdisc *sch, struct gnet_dump *d) { struct sfb_sched_data *q = qdisc_priv(sch); struct tc_sfb_xstats st = { - .earlydrop = q->stats.earlydrop, - .penaltydrop = q->stats.penaltydrop, - .bucketdrop = q->stats.bucketdrop, - .queuedrop = q->stats.queuedrop, - .childdrop = q->stats.childdrop, - .marked = q->stats.marked, + .earlydrop = READ_ONCE(q->stats.earlydrop), + .penaltydrop = READ_ONCE(q->stats.penaltydrop), + .bucketdrop = READ_ONCE(q->stats.bucketdrop), + .queuedrop = READ_ONCE(q->stats.queuedrop), + .childdrop = READ_ONCE(q->stats.childdrop), + .marked = READ_ONCE(q->stats.marked), }; st.maxqlen = sfb_compute_qlen(&st.maxprob, &st.avgprob, q); From f329924bb49458c65297f1361f545816a5b90998 Mon Sep 17 00:00:00 2001 From: Lorenzo Bianconi Date: Fri, 17 Apr 2026 08:36:31 +0200 Subject: [PATCH 108/148] net: airoha: Move ndesc initialization at end of airoha_qdma_init_tx() If queue entry list allocation fails in airoha_qdma_init_tx_queue routine, airoha_qdma_cleanup_tx_queue() will trigger a NULL pointer dereference accessing the queue entry array. The issue is due to the early ndesc initialization in airoha_qdma_init_tx_queue(). Fix the issue moving ndesc initialization at end of airoha_qdma_init_tx routine. Fixes: 3f47e67dff1f7 ("net: airoha: Add the capability to consume out-of-order DMA tx descriptors") Signed-off-by: Lorenzo Bianconi Link: https://patch.msgid.link/20260417-airoha_qdma_cleanup_tx_queue-fix-net-v4-1-e04bcc2c9642@kernel.org Reviewed-by: Simon Horman Signed-off-by: Paolo Abeni --- drivers/net/ethernet/airoha/airoha_eth.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c index 376d91df5441..11c8d3589c45 100644 --- a/drivers/net/ethernet/airoha/airoha_eth.c +++ b/drivers/net/ethernet/airoha/airoha_eth.c @@ -978,27 +978,27 @@ static int airoha_qdma_init_tx_queue(struct airoha_queue *q, dma_addr_t dma_addr; spin_lock_init(&q->lock); - q->ndesc = size; q->qdma = qdma; q->free_thr = 1 + MAX_SKB_FRAGS; INIT_LIST_HEAD(&q->tx_list); - q->entry = devm_kzalloc(eth->dev, q->ndesc * sizeof(*q->entry), + q->entry = devm_kzalloc(eth->dev, size * sizeof(*q->entry), GFP_KERNEL); if (!q->entry) return -ENOMEM; - q->desc = dmam_alloc_coherent(eth->dev, q->ndesc * sizeof(*q->desc), + q->desc = dmam_alloc_coherent(eth->dev, size * sizeof(*q->desc), &dma_addr, GFP_KERNEL); if (!q->desc) return -ENOMEM; - for (i = 0; i < q->ndesc; i++) { + for (i = 0; i < size; i++) { u32 val = FIELD_PREP(QDMA_DESC_DONE_MASK, 1); list_add_tail(&q->entry[i].list, &q->tx_list); WRITE_ONCE(q->desc[i].ctrl, cpu_to_le32(val)); } + q->ndesc = size; /* xmit ring drop default setting */ airoha_qdma_set(qdma, REG_TX_RING_BLOCKING(qid), From 3309965fe44c00fd65af7cef5016e9e782c021a7 Mon Sep 17 00:00:00 2001 From: Lorenzo Bianconi Date: Fri, 17 Apr 2026 08:36:32 +0200 Subject: [PATCH 109/148] net: airoha: Add missing bits in airoha_qdma_cleanup_tx_queue() Similar to airoha_qdma_cleanup_rx_queue(), reset DMA TX descriptors in airoha_qdma_cleanup_tx_queue routine. Moreover, reset TX_DMA_IDX to TX_CPU_IDX to notify the NIC the QDMA TX ring is empty. Fixes: 23020f0493270 ("net: airoha: Introduce ethernet support for EN7581 SoC") Signed-off-by: Lorenzo Bianconi Link: https://patch.msgid.link/20260417-airoha_qdma_cleanup_tx_queue-fix-net-v4-2-e04bcc2c9642@kernel.org Reviewed-by: Simon Horman Signed-off-by: Paolo Abeni --- drivers/net/ethernet/airoha/airoha_eth.c | 32 ++++++++++++++++++++++-- 1 file changed, 30 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c index 11c8d3589c45..e17e40a2090d 100644 --- a/drivers/net/ethernet/airoha/airoha_eth.c +++ b/drivers/net/ethernet/airoha/airoha_eth.c @@ -1063,12 +1063,15 @@ static int airoha_qdma_init_tx(struct airoha_qdma *qdma) static void airoha_qdma_cleanup_tx_queue(struct airoha_queue *q) { - struct airoha_eth *eth = q->qdma->eth; - int i; + struct airoha_qdma *qdma = q->qdma; + struct airoha_eth *eth = qdma->eth; + int i, qid = q - &qdma->q_tx[0]; + u16 index = 0; spin_lock_bh(&q->lock); for (i = 0; i < q->ndesc; i++) { struct airoha_queue_entry *e = &q->entry[i]; + struct airoha_qdma_desc *desc = &q->desc[i]; if (!e->dma_addr) continue; @@ -1079,8 +1082,33 @@ static void airoha_qdma_cleanup_tx_queue(struct airoha_queue *q) e->dma_addr = 0; e->skb = NULL; list_add_tail(&e->list, &q->tx_list); + + /* Reset DMA descriptor */ + WRITE_ONCE(desc->ctrl, 0); + WRITE_ONCE(desc->addr, 0); + WRITE_ONCE(desc->data, 0); + WRITE_ONCE(desc->msg0, 0); + WRITE_ONCE(desc->msg1, 0); + WRITE_ONCE(desc->msg2, 0); + q->queued--; } + + if (!list_empty(&q->tx_list)) { + struct airoha_queue_entry *e; + + e = list_first_entry(&q->tx_list, struct airoha_queue_entry, + list); + index = e - q->entry; + } + /* Set TX_DMA_IDX to TX_CPU_IDX to notify the hw the QDMA TX ring is + * empty. + */ + airoha_qdma_rmw(qdma, REG_TX_CPU_IDX(qid), TX_RING_CPU_IDX_MASK, + FIELD_PREP(TX_RING_CPU_IDX_MASK, index)); + airoha_qdma_rmw(qdma, REG_TX_DMA_IDX(qid), TX_RING_DMA_IDX_MASK, + FIELD_PREP(TX_RING_DMA_IDX_MASK, index)); + spin_unlock_bh(&q->lock); } From 0c078021d3861966614d5e594ee03587f0c9e74d Mon Sep 17 00:00:00 2001 From: Mieczyslaw Nalewaj Date: Sun, 19 Apr 2026 21:37:07 +0200 Subject: [PATCH 110/148] net: dsa: realtek: rtl8365mb: fix mode mask calculation The RTL8365MB_DIGITAL_INTERFACE_SELECT_MODE_MASK macro was shifting the 4-bit mask (0xF) by only (_extint % 2) bits instead of (_extint % 2) * 4. This caused the mask to overlap with the adjacent nibble when configuring odd-numbered external interfaces, selecting the wrong bits entirely. Align the shift calculation with the existing ...MODE_OFFSET macro. Fixes: 4af2950c50c8 ("net: dsa: realtek-smi: add rtl8365mb subdriver for RTL8365MB-VC") Signed-off-by: Abdulkader Alrezej Signed-off-by: Mieczyslaw Nalewaj Reviewed-by: Luiz Angelo Daros de Luca Link: https://patch.msgid.link/400a6387-a444-4576-af6d-26be5410bce3@yahoo.com Signed-off-by: Paolo Abeni --- drivers/net/dsa/realtek/rtl8365mb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/dsa/realtek/rtl8365mb.c b/drivers/net/dsa/realtek/rtl8365mb.c index 31fa94dac627..c35cef01ec26 100644 --- a/drivers/net/dsa/realtek/rtl8365mb.c +++ b/drivers/net/dsa/realtek/rtl8365mb.c @@ -216,7 +216,7 @@ (_extint) == 2 ? RTL8365MB_DIGITAL_INTERFACE_SELECT_REG1 : \ 0x0) #define RTL8365MB_DIGITAL_INTERFACE_SELECT_MODE_MASK(_extint) \ - (0xF << (((_extint) % 2))) + (0xF << (((_extint) % 2) * 4)) #define RTL8365MB_DIGITAL_INTERFACE_SELECT_MODE_OFFSET(_extint) \ (((_extint) % 2) * 4) From fc69decc811b155a0ed8eef17ee940f28c4f6dbc Mon Sep 17 00:00:00 2001 From: Longxuan Yu Date: Mon, 20 Apr 2026 11:18:45 +0800 Subject: [PATCH 111/148] 8021q: use RCU for egress QoS mappings The TX fast path and reporting paths walk egress QoS mappings without RTNL. Convert the mapping lists to RCU-protected pointers, use RCU reader annotations in readers, and defer freeing mapping nodes with an embedded rcu_head. This prepares the egress QoS mapping code for safe removal of mapping nodes in a follow-up change while preserving the current behavior. Co-developed-by: Yuan Tan Signed-off-by: Yuan Tan Signed-off-by: Longxuan Yu Signed-off-by: Ren Wei Link: https://patch.msgid.link/9136768189f8c6d3f824f476c62d2fa1111688e8.1776647968.git.yuantan098@gmail.com Signed-off-by: Paolo Abeni --- include/linux/if_vlan.h | 25 ++++++++++++++++--------- net/8021q/vlan_dev.c | 31 ++++++++++++++++--------------- net/8021q/vlan_netlink.c | 10 ++++++---- net/8021q/vlanproc.c | 12 ++++++++---- 4 files changed, 46 insertions(+), 32 deletions(-) diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h index e6272f9c5e42..20cc16ea4e5a 100644 --- a/include/linux/if_vlan.h +++ b/include/linux/if_vlan.h @@ -147,11 +147,13 @@ extern __be16 vlan_dev_vlan_proto(const struct net_device *dev); * @priority: skb priority * @vlan_qos: vlan priority: (skb->priority << 13) & 0xE000 * @next: pointer to next struct + * @rcu: used for deferred freeing of mapping nodes */ struct vlan_priority_tci_mapping { u32 priority; u16 vlan_qos; - struct vlan_priority_tci_mapping *next; + struct vlan_priority_tci_mapping __rcu *next; + struct rcu_head rcu; }; struct proc_dir_entry; @@ -177,7 +179,7 @@ struct vlan_dev_priv { unsigned int nr_ingress_mappings; u32 ingress_priority_map[8]; unsigned int nr_egress_mappings; - struct vlan_priority_tci_mapping *egress_priority_map[16]; + struct vlan_priority_tci_mapping __rcu *egress_priority_map[16]; __be16 vlan_proto; u16 vlan_id; @@ -209,19 +211,24 @@ static inline u16 vlan_dev_get_egress_qos_mask(struct net_device *dev, u32 skprio) { struct vlan_priority_tci_mapping *mp; + u16 vlan_qos = 0; - smp_rmb(); /* coupled with smp_wmb() in vlan_dev_set_egress_priority() */ + rcu_read_lock(); - mp = vlan_dev_priv(dev)->egress_priority_map[(skprio & 0xF)]; + mp = rcu_dereference(vlan_dev_priv(dev)->egress_priority_map[skprio & 0xF]); while (mp) { if (mp->priority == skprio) { - return mp->vlan_qos; /* This should already be shifted - * to mask correctly with the - * VLAN's TCI */ + vlan_qos = READ_ONCE(mp->vlan_qos); + break; } - mp = mp->next; + mp = rcu_dereference(mp->next); } - return 0; + rcu_read_unlock(); + + /* This should already be shifted to mask correctly with + * the VLAN's TCI. + */ + return vlan_qos; } extern bool vlan_do_receive(struct sk_buff **skb); diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c index c40f7d5c4fca..a5340932b657 100644 --- a/net/8021q/vlan_dev.c +++ b/net/8021q/vlan_dev.c @@ -172,39 +172,34 @@ int vlan_dev_set_egress_priority(const struct net_device *dev, u32 skb_prio, u16 vlan_prio) { struct vlan_dev_priv *vlan = vlan_dev_priv(dev); - struct vlan_priority_tci_mapping *mp = NULL; + struct vlan_priority_tci_mapping *mp; struct vlan_priority_tci_mapping *np; + u32 bucket = skb_prio & 0xF; u32 vlan_qos = (vlan_prio << VLAN_PRIO_SHIFT) & VLAN_PRIO_MASK; /* See if a priority mapping exists.. */ - mp = vlan->egress_priority_map[skb_prio & 0xF]; + mp = rtnl_dereference(vlan->egress_priority_map[bucket]); while (mp) { if (mp->priority == skb_prio) { if (mp->vlan_qos && !vlan_qos) vlan->nr_egress_mappings--; else if (!mp->vlan_qos && vlan_qos) vlan->nr_egress_mappings++; - mp->vlan_qos = vlan_qos; + WRITE_ONCE(mp->vlan_qos, vlan_qos); return 0; } - mp = mp->next; + mp = rtnl_dereference(mp->next); } /* Create a new mapping then. */ - mp = vlan->egress_priority_map[skb_prio & 0xF]; np = kmalloc_obj(struct vlan_priority_tci_mapping); if (!np) return -ENOBUFS; - np->next = mp; np->priority = skb_prio; np->vlan_qos = vlan_qos; - /* Before inserting this element in hash table, make sure all its fields - * are committed to memory. - * coupled with smp_rmb() in vlan_dev_get_egress_qos_mask() - */ - smp_wmb(); - vlan->egress_priority_map[skb_prio & 0xF] = np; + RCU_INIT_POINTER(np->next, rtnl_dereference(vlan->egress_priority_map[bucket])); + rcu_assign_pointer(vlan->egress_priority_map[bucket], np); if (vlan_qos) vlan->nr_egress_mappings++; return 0; @@ -604,11 +599,17 @@ void vlan_dev_free_egress_priority(const struct net_device *dev) int i; for (i = 0; i < ARRAY_SIZE(vlan->egress_priority_map); i++) { - while ((pm = vlan->egress_priority_map[i]) != NULL) { - vlan->egress_priority_map[i] = pm->next; - kfree(pm); + pm = rtnl_dereference(vlan->egress_priority_map[i]); + RCU_INIT_POINTER(vlan->egress_priority_map[i], NULL); + while (pm) { + struct vlan_priority_tci_mapping *next; + + next = rtnl_dereference(pm->next); + kfree_rcu(pm, rcu); + pm = next; } } + vlan->nr_egress_mappings = 0; } static void vlan_dev_uninit(struct net_device *dev) diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c index a000b1ef0520..a5b16833e2ce 100644 --- a/net/8021q/vlan_netlink.c +++ b/net/8021q/vlan_netlink.c @@ -260,13 +260,15 @@ static int vlan_fill_info(struct sk_buff *skb, const struct net_device *dev) goto nla_put_failure; for (i = 0; i < ARRAY_SIZE(vlan->egress_priority_map); i++) { - for (pm = vlan->egress_priority_map[i]; pm; - pm = pm->next) { - if (!pm->vlan_qos) + for (pm = rcu_dereference_rtnl(vlan->egress_priority_map[i]); pm; + pm = rcu_dereference_rtnl(pm->next)) { + u16 vlan_qos = READ_ONCE(pm->vlan_qos); + + if (!vlan_qos) continue; m.from = pm->priority; - m.to = (pm->vlan_qos >> 13) & 0x7; + m.to = (vlan_qos >> 13) & 0x7; if (nla_put(skb, IFLA_VLAN_QOS_MAPPING, sizeof(m), &m)) goto nla_put_failure; diff --git a/net/8021q/vlanproc.c b/net/8021q/vlanproc.c index fa67374bda49..0e424e0895b7 100644 --- a/net/8021q/vlanproc.c +++ b/net/8021q/vlanproc.c @@ -262,15 +262,19 @@ static int vlandev_seq_show(struct seq_file *seq, void *offset) vlan->ingress_priority_map[7]); seq_printf(seq, " EGRESS priority mappings: "); + rcu_read_lock(); for (i = 0; i < 16; i++) { - const struct vlan_priority_tci_mapping *mp - = vlan->egress_priority_map[i]; + const struct vlan_priority_tci_mapping *mp = + rcu_dereference(vlan->egress_priority_map[i]); while (mp) { + u16 vlan_qos = READ_ONCE(mp->vlan_qos); + seq_printf(seq, "%u:%d ", - mp->priority, ((mp->vlan_qos >> 13) & 0x7)); - mp = mp->next; + mp->priority, ((vlan_qos >> 13) & 0x7)); + mp = rcu_dereference(mp->next); } } + rcu_read_unlock(); seq_puts(seq, "\n"); return 0; From 7dddc74af369478ba7f9bc136d0fc1dc4570cb66 Mon Sep 17 00:00:00 2001 From: Longxuan Yu Date: Mon, 20 Apr 2026 11:18:46 +0800 Subject: [PATCH 112/148] 8021q: delete cleared egress QoS mappings vlan_dev_set_egress_priority() currently keeps cleared egress priority mappings in the hash as tombstones. Repeated set/clear cycles with distinct skb priorities therefore accumulate mapping nodes until device teardown and leak memory. Delete mappings when vlan_prio is cleared instead of keeping tombstones. Now that the egress mapping lists are RCU protected, the node can be unlinked safely and freed after a grace period. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: stable@kernel.org Reported-by: Yifan Wu Reported-by: Juefei Pu Reported-by: Xin Liu Co-developed-by: Yuan Tan Signed-off-by: Yuan Tan Signed-off-by: Longxuan Yu Signed-off-by: Ren Wei Link: https://patch.msgid.link/ecfa6f6ce2467a42647ff4c5221238ae85b79a59.1776647968.git.yuantan098@gmail.com Signed-off-by: Paolo Abeni --- net/8021q/vlan_dev.c | 20 ++++++++++++++------ net/8021q/vlan_netlink.c | 4 ---- 2 files changed, 14 insertions(+), 10 deletions(-) diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c index a5340932b657..7aa3af8b10ea 100644 --- a/net/8021q/vlan_dev.c +++ b/net/8021q/vlan_dev.c @@ -172,26 +172,34 @@ int vlan_dev_set_egress_priority(const struct net_device *dev, u32 skb_prio, u16 vlan_prio) { struct vlan_dev_priv *vlan = vlan_dev_priv(dev); + struct vlan_priority_tci_mapping __rcu **mpp; struct vlan_priority_tci_mapping *mp; struct vlan_priority_tci_mapping *np; u32 bucket = skb_prio & 0xF; u32 vlan_qos = (vlan_prio << VLAN_PRIO_SHIFT) & VLAN_PRIO_MASK; /* See if a priority mapping exists.. */ - mp = rtnl_dereference(vlan->egress_priority_map[bucket]); + mpp = &vlan->egress_priority_map[bucket]; + mp = rtnl_dereference(*mpp); while (mp) { if (mp->priority == skb_prio) { - if (mp->vlan_qos && !vlan_qos) + if (!vlan_qos) { + rcu_assign_pointer(*mpp, rtnl_dereference(mp->next)); vlan->nr_egress_mappings--; - else if (!mp->vlan_qos && vlan_qos) - vlan->nr_egress_mappings++; - WRITE_ONCE(mp->vlan_qos, vlan_qos); + kfree_rcu(mp, rcu); + } else { + WRITE_ONCE(mp->vlan_qos, vlan_qos); + } return 0; } - mp = rtnl_dereference(mp->next); + mpp = &mp->next; + mp = rtnl_dereference(*mpp); } /* Create a new mapping then. */ + if (!vlan_qos) + return 0; + np = kmalloc_obj(struct vlan_priority_tci_mapping); if (!np) return -ENOBUFS; diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c index a5b16833e2ce..368d53ca7d87 100644 --- a/net/8021q/vlan_netlink.c +++ b/net/8021q/vlan_netlink.c @@ -263,10 +263,6 @@ static int vlan_fill_info(struct sk_buff *skb, const struct net_device *dev) for (pm = rcu_dereference_rtnl(vlan->egress_priority_map[i]); pm; pm = rcu_dereference_rtnl(pm->next)) { u16 vlan_qos = READ_ONCE(pm->vlan_qos); - - if (!vlan_qos) - continue; - m.from = pm->priority; m.to = (vlan_qos >> 13) & 0x7; if (nla_put(skb, IFLA_VLAN_QOS_MAPPING, From 379050947a1828826ad7ea50c95245a56929b35a Mon Sep 17 00:00:00 2001 From: Lorenzo Bianconi Date: Mon, 20 Apr 2026 10:07:47 +0200 Subject: [PATCH 113/148] net: airoha: Move ndesc initialization at end of airoha_qdma_init_rx_queue() If queue entry or DMA descriptor list allocation fails in airoha_qdma_init_rx_queue routine, airoha_qdma_cleanup() will trigger a NULL pointer dereference running netif_napi_del() for RX queue NAPIs since netif_napi_add() has never been executed to this particular RX NAPI. The issue is due to the early ndesc initialization in airoha_qdma_init_rx_queue() since airoha_qdma_cleanup() relies on ndesc value to check if the queue is properly initialized. Fix the issue moving ndesc initialization at end of airoha_qdma_init_tx routine. Move page_pool allocation after descriptor list allocation in order to avoid memory leaks if desc allocation fails. Fixes: 23020f049327 ("net: airoha: Introduce ethernet support for EN7581 SoC") Signed-off-by: Lorenzo Bianconi Link: https://patch.msgid.link/20260420-airoha_qdma_init_rx_queue-fix-v2-1-d99347e5c18d@kernel.org Signed-off-by: Paolo Abeni --- drivers/net/ethernet/airoha/airoha_eth.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c index e17e40a2090d..ff948d555a7d 100644 --- a/drivers/net/ethernet/airoha/airoha_eth.c +++ b/drivers/net/ethernet/airoha/airoha_eth.c @@ -745,14 +745,18 @@ static int airoha_qdma_init_rx_queue(struct airoha_queue *q, dma_addr_t dma_addr; q->buf_size = PAGE_SIZE / 2; - q->ndesc = ndesc; q->qdma = qdma; - q->entry = devm_kzalloc(eth->dev, q->ndesc * sizeof(*q->entry), + q->entry = devm_kzalloc(eth->dev, ndesc * sizeof(*q->entry), GFP_KERNEL); if (!q->entry) return -ENOMEM; + q->desc = dmam_alloc_coherent(eth->dev, ndesc * sizeof(*q->desc), + &dma_addr, GFP_KERNEL); + if (!q->desc) + return -ENOMEM; + q->page_pool = page_pool_create(&pp_params); if (IS_ERR(q->page_pool)) { int err = PTR_ERR(q->page_pool); @@ -761,11 +765,7 @@ static int airoha_qdma_init_rx_queue(struct airoha_queue *q, return err; } - q->desc = dmam_alloc_coherent(eth->dev, q->ndesc * sizeof(*q->desc), - &dma_addr, GFP_KERNEL); - if (!q->desc) - return -ENOMEM; - + q->ndesc = ndesc; netif_napi_add(eth->napi_dev, &q->napi, airoha_qdma_rx_napi_poll); airoha_qdma_wr(qdma, REG_RX_RING_BASE(qid), dma_addr); From 4b91cb65789b794bfc8d50554b8994f8e0f16309 Mon Sep 17 00:00:00 2001 From: Lorenzo Bianconi Date: Mon, 20 Apr 2026 10:07:48 +0200 Subject: [PATCH 114/148] net: airoha: Add size check for TX NAPIs in airoha_qdma_cleanup() If airoha_qdma_init routine fails before airoha_qdma_tx_irq_init() runs successfully for all TX NAPIs, airoha_qdma_cleanup() will unconditionally runs netif_napi_del() on TX NAPIs, triggering a NULL pointer dereference. Fix the issue relying on q_tx_irq size value to check if the TX NAPIs is properly initialized in airoha_qdma_cleanup(). Moreover, run netif_napi_add_tx() just if irq_q queue is properly allocated. Fixes: 23020f049327 ("net: airoha: Introduce ethernet support for EN7581 SoC") Signed-off-by: Lorenzo Bianconi Link: https://patch.msgid.link/20260420-airoha_qdma_init_rx_queue-fix-v2-2-d99347e5c18d@kernel.org Signed-off-by: Paolo Abeni --- drivers/net/ethernet/airoha/airoha_eth.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c index ff948d555a7d..2bb0a3ff9810 100644 --- a/drivers/net/ethernet/airoha/airoha_eth.c +++ b/drivers/net/ethernet/airoha/airoha_eth.c @@ -1020,8 +1020,6 @@ static int airoha_qdma_tx_irq_init(struct airoha_tx_irq_queue *irq_q, struct airoha_eth *eth = qdma->eth; dma_addr_t dma_addr; - netif_napi_add_tx(eth->napi_dev, &irq_q->napi, - airoha_qdma_tx_napi_poll); irq_q->q = dmam_alloc_coherent(eth->dev, size * sizeof(u32), &dma_addr, GFP_KERNEL); if (!irq_q->q) @@ -1031,6 +1029,9 @@ static int airoha_qdma_tx_irq_init(struct airoha_tx_irq_queue *irq_q, irq_q->size = size; irq_q->qdma = qdma; + netif_napi_add_tx(eth->napi_dev, &irq_q->napi, + airoha_qdma_tx_napi_poll); + airoha_qdma_wr(qdma, REG_TX_IRQ_BASE(id), dma_addr); airoha_qdma_rmw(qdma, REG_TX_IRQ_CFG(id), TX_IRQ_DEPTH_MASK, FIELD_PREP(TX_IRQ_DEPTH_MASK, size)); @@ -1450,8 +1451,12 @@ static void airoha_qdma_cleanup(struct airoha_qdma *qdma) } } - for (i = 0; i < ARRAY_SIZE(qdma->q_tx_irq); i++) + for (i = 0; i < ARRAY_SIZE(qdma->q_tx_irq); i++) { + if (!qdma->q_tx_irq[i].size) + continue; + netif_napi_del(&qdma->q_tx_irq[i].napi); + } for (i = 0; i < ARRAY_SIZE(qdma->q_tx); i++) { if (!qdma->q_tx[i].ndesc) From 7079c8c13f2d33992bc846240517d88f4ab07781 Mon Sep 17 00:00:00 2001 From: Breno Leitao Date: Mon, 20 Apr 2026 03:18:36 -0700 Subject: [PATCH 115/148] netconsole: avoid out-of-bounds access on empty string in trim_newline() trim_newline() unconditionally dereferences s[len - 1] after computing len = strnlen(s, maxlen). When the string is empty, len is 0 and the expression underflows to s[(size_t)-1], reading (and potentially writing) one byte before the buffer. The two callers feed trim_newline() with the result of strscpy() from configfs store callbacks (dev_name_store, userdatum_value_store). configfs guarantees count >= 1 reaches the callback, but the byte itself can be NUL: a userspace write(fd, "\0", 1) leaves the destination empty after strscpy() and triggers the underflow. The OOB write only fires if the adjacent byte happens to be '\n', so this is not a security issue, but the access is undefined behaviour either way. This pattern is commonly flagged by LLM-based code reviewers. While it is not a security fix, the underlying access is undefined behaviour and the change is small and self-contained, so it is a reasonable candidate for the stable trees. Guard the dereference on a non-zero length. Fixes: ae001dc67907 ("net: netconsole: move newline trimming to function") Cc: stable@vger.kernel.org Signed-off-by: Breno Leitao Reviewed-by: Gustavo Luiz Duarte Reviewed-by: Simon Horman Link: https://patch.msgid.link/20260420-netcons_trim_newline-v1-1-dc35889aeedf@debian.org Signed-off-by: Paolo Abeni --- drivers/net/netconsole.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c index 3c9acd6e49e8..205384dab89a 100644 --- a/drivers/net/netconsole.c +++ b/drivers/net/netconsole.c @@ -497,6 +497,8 @@ static void trim_newline(char *s, size_t maxlen) size_t len; len = strnlen(s, maxlen); + if (!len) + return; if (s[len - 1] == '\n') s[len - 1] = '\0'; } From cb4a90744bcd1adf12f0d0c7c4f0dd2647444ec5 Mon Sep 17 00:00:00 2001 From: Erni Sri Satya Vennela Date: Mon, 20 Apr 2026 05:47:35 -0700 Subject: [PATCH 116/148] net: mana: Init link_change_work before potential error paths in probe Move INIT_WORK(link_change_work) to right after the mana_context allocation, before any error path that could reach mana_remove(). Previously, if mana_create_eq() or mana_query_device_cfg() failed, mana_probe() would jump to the error path which calls mana_remove(). mana_remove() unconditionally calls disable_work_sync(link_change_work), but the work struct had not been initialized yet. This can trigger CONFIG_DEBUG_OBJECTS_WORK enabled. Fixes: 54133f9b4b53 ("net: mana: Support HW link state events") Signed-off-by: Erni Sri Satya Vennela Link: https://patch.msgid.link/20260420124741.1056179-2-ernis@linux.microsoft.com Reviewed-by: Simon Horman Signed-off-by: Paolo Abeni --- drivers/net/ethernet/microsoft/mana/mana_en.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c index 6302432b9bf6..e3e4b6de6668 100644 --- a/drivers/net/ethernet/microsoft/mana/mana_en.c +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c @@ -3631,6 +3631,8 @@ int mana_probe(struct gdma_dev *gd, bool resuming) ac->gdma_dev = gd; gd->driver_data = ac; + + INIT_WORK(&ac->link_change_work, mana_link_state_handle); } err = mana_create_eq(ac); @@ -3648,8 +3650,6 @@ int mana_probe(struct gdma_dev *gd, bool resuming) if (!resuming) { ac->num_ports = num_ports; - - INIT_WORK(&ac->link_change_work, mana_link_state_handle); } else { if (ac->num_ports != num_ports) { dev_err(dev, "The number of vPorts changed: %d->%d\n", From 6e8bc03349fe4f09567fa76235abf52bdaf83082 Mon Sep 17 00:00:00 2001 From: Erni Sri Satya Vennela Date: Mon, 20 Apr 2026 05:47:36 -0700 Subject: [PATCH 117/148] net: mana: Init gf_stats_work before potential error paths in probe Move INIT_DELAYED_WORK(gf_stats_work) to before mana_create_eq(), while keeping schedule_delayed_work() at its original location. Previously, if any function between mana_create_eq() and the INIT_DELAYED_WORK call failed, mana_probe() would call mana_remove() which unconditionally calls cancel_delayed_work_sync(gf_stats_work) in __flush_work() or debug object warnings with CONFIG_DEBUG_OBJECTS_WORK enabled. Fixes: be4f1d67ec56 ("net: mana: Add standard counter rx_missed_errors") Signed-off-by: Erni Sri Satya Vennela Link: https://patch.msgid.link/20260420124741.1056179-3-ernis@linux.microsoft.com Reviewed-by: Simon Horman Signed-off-by: Paolo Abeni --- drivers/net/ethernet/microsoft/mana/mana_en.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c index e3e4b6de6668..468ed60a8a00 100644 --- a/drivers/net/ethernet/microsoft/mana/mana_en.c +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c @@ -3635,6 +3635,8 @@ int mana_probe(struct gdma_dev *gd, bool resuming) INIT_WORK(&ac->link_change_work, mana_link_state_handle); } + INIT_DELAYED_WORK(&ac->gf_stats_work, mana_gf_stats_work_handler); + err = mana_create_eq(ac); if (err) { dev_err(dev, "Failed to create EQs: %d\n", err); @@ -3709,7 +3711,6 @@ int mana_probe(struct gdma_dev *gd, bool resuming) err = add_adev(gd, "eth"); - INIT_DELAYED_WORK(&ac->gf_stats_work, mana_gf_stats_work_handler); schedule_delayed_work(&ac->gf_stats_work, MANA_GF_STATS_PERIOD); out: From 50271d7ec95144d26808025b508f463780517d3c Mon Sep 17 00:00:00 2001 From: Erni Sri Satya Vennela Date: Mon, 20 Apr 2026 05:47:37 -0700 Subject: [PATCH 118/148] net: mana: Guard mana_remove against double invocation If PM resume fails (e.g., mana_attach() returns an error), mana_probe() calls mana_remove(), which tears down the device and sets gd->gdma_context = NULL and gd->driver_data = NULL. However, a failed resume callback does not automatically unbind the driver. When the device is eventually unbound, mana_remove() is invoked a second time. Without a NULL check, it dereferences gc->dev with gc == NULL, causing a kernel panic. Add an early return if gdma_context or driver_data is NULL so the second invocation is harmless. Move the dev = gc->dev assignment after the guard so it cannot dereference NULL. Fixes: 635096a86edb ("net: mana: Support hibernation and kexec") Signed-off-by: Erni Sri Satya Vennela Link: https://patch.msgid.link/20260420124741.1056179-4-ernis@linux.microsoft.com Reviewed-by: Simon Horman Signed-off-by: Paolo Abeni --- drivers/net/ethernet/microsoft/mana/mana_en.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c index 468ed60a8a00..ce1b7ec46a27 100644 --- a/drivers/net/ethernet/microsoft/mana/mana_en.c +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c @@ -3731,11 +3731,16 @@ void mana_remove(struct gdma_dev *gd, bool suspending) struct gdma_context *gc = gd->gdma_context; struct mana_context *ac = gd->driver_data; struct mana_port_context *apc; - struct device *dev = gc->dev; + struct device *dev; struct net_device *ndev; int err; int i; + if (!gc || !ac) + return; + + dev = gc->dev; + disable_work_sync(&ac->link_change_work); cancel_delayed_work_sync(&ac->gf_stats_work); From a7fdaf069bd031fcc234581fa6a580be11bf2175 Mon Sep 17 00:00:00 2001 From: Erni Sri Satya Vennela Date: Mon, 20 Apr 2026 05:47:38 -0700 Subject: [PATCH 119/148] net: mana: Don't overwrite port probe error with add_adev result In mana_probe(), if mana_probe_port() fails for any port, the error is stored in 'err' and the loop breaks. However, the subsequent unconditional 'err = add_adev(gd, "eth")' overwrites this error. If add_adev() succeeds, mana_probe() returns success despite ports being left in a partially initialized state (ac->ports[i] == NULL). Only call add_adev() when there is no prior error, so the probe correctly fails and triggers mana_remove() cleanup. Fixes: a69839d4327d ("net: mana: Add support for auxiliary device") Signed-off-by: Erni Sri Satya Vennela Link: https://patch.msgid.link/20260420124741.1056179-5-ernis@linux.microsoft.com Reviewed-by: Simon Horman Signed-off-by: Paolo Abeni --- drivers/net/ethernet/microsoft/mana/mana_en.c | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c index ce1b7ec46a27..39b18577fb51 100644 --- a/drivers/net/ethernet/microsoft/mana/mana_en.c +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c @@ -3680,10 +3680,9 @@ int mana_probe(struct gdma_dev *gd, bool resuming) if (!resuming) { for (i = 0; i < ac->num_ports; i++) { err = mana_probe_port(ac, i, &ac->ports[i]); - /* we log the port for which the probe failed and stop - * probes for subsequent ports. - * Note that we keep running ports, for which the probes - * were successful, unless add_adev fails too + /* Log the port for which the probe failed, stop probing + * subsequent ports, and skip add_adev. + * mana_remove() will clean up already-probed ports. */ if (err) { dev_err(dev, "Probe Failed for port %d\n", i); @@ -3697,10 +3696,9 @@ int mana_probe(struct gdma_dev *gd, bool resuming) enable_work(&apc->queue_reset_work); err = mana_attach(ac->ports[i]); rtnl_unlock(); - /* we log the port for which the attach failed and stop - * attach for subsequent ports - * Note that we keep running ports, for which the attach - * were successful, unless add_adev fails too + /* Log the port for which the attach failed, stop + * attaching subsequent ports, and skip add_adev. + * mana_remove() will clean up already-attached ports. */ if (err) { dev_err(dev, "Attach Failed for port %d\n", i); @@ -3709,7 +3707,8 @@ int mana_probe(struct gdma_dev *gd, bool resuming) } } - err = add_adev(gd, "eth"); + if (!err) + err = add_adev(gd, "eth"); schedule_delayed_work(&ac->gf_stats_work, MANA_GF_STATS_PERIOD); From 65267c9c4f28199985505977bc2c628c82fc50ef Mon Sep 17 00:00:00 2001 From: Erni Sri Satya Vennela Date: Mon, 20 Apr 2026 05:47:39 -0700 Subject: [PATCH 120/148] net: mana: Fix EQ leak in mana_remove on NULL port In mana_remove(), when a NULL port is encountered in the port iteration loop, 'goto out' skips the mana_destroy_eq(ac) call, leaking the event queues allocated earlier by mana_create_eq(). This can happen when mana_probe_port() fails for port 0, leaving ac->ports[0] as NULL. On driver unload or error cleanup, mana_remove() hits the NULL entry and jumps past mana_destroy_eq(). Change 'goto out' to 'break' so the for-loop exits normally and mana_destroy_eq() is always reached. Remove the now-unreferenced out: label. Fixes: 1e2d0824a9c3 ("net: mana: Add support for EQ sharing") Signed-off-by: Erni Sri Satya Vennela Link: https://patch.msgid.link/20260420124741.1056179-6-ernis@linux.microsoft.com Reviewed-by: Simon Horman Signed-off-by: Paolo Abeni --- drivers/net/ethernet/microsoft/mana/mana_en.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c index 39b18577fb51..98e2fcc797ca 100644 --- a/drivers/net/ethernet/microsoft/mana/mana_en.c +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c @@ -3752,7 +3752,7 @@ void mana_remove(struct gdma_dev *gd, bool suspending) if (!ndev) { if (i == 0) dev_err(dev, "No net device to remove\n"); - goto out; + break; } apc = netdev_priv(ndev); @@ -3783,7 +3783,7 @@ void mana_remove(struct gdma_dev *gd, bool suspending) } mana_destroy_eq(ac); -out: + if (ac->per_port_queue_reset_wq) { destroy_workqueue(ac->per_port_queue_reset_wq); ac->per_port_queue_reset_wq = NULL; From 1cb36e252211506f51095fe7ced8286cc77b4c80 Mon Sep 17 00:00:00 2001 From: Stefano Garzarella Date: Mon, 20 Apr 2026 15:20:51 +0200 Subject: [PATCH 121/148] vsock/virtio: fix MSG_ZEROCOPY pinned-pages accounting virtio_transport_init_zcopy_skb() uses iter->count as the size argument for msg_zerocopy_realloc(), which in turn passes it to mm_account_pinned_pages() for RLIMIT_MEMLOCK accounting. However, this function is called after virtio_transport_fill_skb() has already consumed the iterator via __zerocopy_sg_from_iter(), so on the last skb, iter->count will be 0, skipping the RLIMIT_MEMLOCK enforcement. Pass pkt_len (the total bytes being sent) as an explicit parameter to virtio_transport_init_zcopy_skb() instead of reading the already-consumed iter->count. This matches TCP and UDP, which both call msg_zerocopy_realloc() with the original message size. Fixes: 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY flag support") Reported-by: Yiming Qian Signed-off-by: Stefano Garzarella Reviewed-by: Bobby Eshleman Link: https://patch.msgid.link/20260420132051.217589-1-sgarzare@redhat.com Signed-off-by: Paolo Abeni --- net/vmw_vsock/virtio_transport_common.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c index 0742091beae7..416d533f493d 100644 --- a/net/vmw_vsock/virtio_transport_common.c +++ b/net/vmw_vsock/virtio_transport_common.c @@ -73,6 +73,7 @@ static bool virtio_transport_can_zcopy(const struct virtio_transport *t_ops, static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk, struct sk_buff *skb, struct msghdr *msg, + size_t pkt_len, bool zerocopy) { struct ubuf_info *uarg; @@ -81,12 +82,10 @@ static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk, uarg = msg->msg_ubuf; net_zcopy_get(uarg); } else { - struct iov_iter *iter = &msg->msg_iter; struct ubuf_info_msgzc *uarg_zc; uarg = msg_zerocopy_realloc(sk_vsock(vsk), - iter->count, - NULL, false); + pkt_len, NULL, false); if (!uarg) return -1; @@ -398,11 +397,17 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk, * each iteration. If this is last skb for this buffer * and MSG_ZEROCOPY mode is in use - we must allocate * completion for the current syscall. + * + * Pass pkt_len because msg iter is already consumed + * by virtio_transport_fill_skb(), so iter->count + * can not be used for RLIMIT_MEMLOCK pinned-pages + * accounting done by msg_zerocopy_realloc(). */ if (info->msg && info->msg->msg_flags & MSG_ZEROCOPY && skb_len == rest_len && info->op == VIRTIO_VSOCK_OP_RW) { if (virtio_transport_init_zcopy_skb(vsk, skb, info->msg, + pkt_len, can_zcopy)) { kfree_skb(skb); ret = -ENOMEM; From fcf04b14334641f4b0b8647824480935e9416d52 Mon Sep 17 00:00:00 2001 From: Gang Yan Date: Mon, 20 Apr 2026 18:19:23 +0200 Subject: [PATCH 122/148] mptcp: sync the msk->sndbuf at accept() time On passive MPTCP connections, the msk sndbuf is not updated correctly. The root cause is an order issue in the accept path: - tcp_check_req() -> subflow_syn_recv_sock() -> mptcp_sk_clone_init() calls __mptcp_propagate_sndbuf() to copy the ssk sndbuf into msk - Later, tcp_child_process() -> tcp_init_transfer() -> tcp_sndbuf_expand() grows the ssk sndbuf. So __mptcp_propagate_sndbuf() runs before the ssk sndbuf has been expanded and the msk ends up with a much smaller sndbuf than the subflow: MPTCP: msk->sndbuf:20480, msk->first->sndbuf:2626560 Fix this by moving the __mptcp_propagate_sndbuf() call from mptcp_sk_clone_init() -- the ssk sndbuf is not yet finalized there -- to __mptcp_propagate_sndbuf() at accept() time, when the ssk sndbuf has been fully expanded by tcp_sndbuf_expand(). Fixes: 8005184fd1ca ("mptcp: refactor sndbuf auto-tuning") Cc: stable@vger.kernel.org Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/602 Signed-off-by: Gang Yan Acked-by: Paolo Abeni Reviewed-by: Matthieu Baerts (NGI0) Signed-off-by: Matthieu Baerts (NGI0) Link: https://patch.msgid.link/20260420-net-mptcp-sync-sndbuf-accept-v1-1-e3523e3aeb44@kernel.org Signed-off-by: Paolo Abeni --- net/mptcp/protocol.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index fbffd3a43fe8..718e910ff23f 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -3594,7 +3594,6 @@ struct sock *mptcp_sk_clone_init(const struct sock *sk, * uses the correct data */ mptcp_copy_inaddrs(nsk, ssk); - __mptcp_propagate_sndbuf(nsk, ssk); mptcp_rcv_space_init(msk, ssk); msk->rcvq_space.time = mptcp_stamp(); @@ -4252,6 +4251,7 @@ static int mptcp_stream_accept(struct socket *sock, struct socket *newsock, mptcp_graft_subflows(newsk); mptcp_rps_record_subflows(msk); + __mptcp_propagate_sndbuf(newsk, mptcp_subflow_tcp_sock(subflow)); /* Do late cleanup for the first subflow as necessary. Also * deal with bad peers not doing a complete shutdown. From d0576eb8508e68b950ee9d2820116a6fc205fd07 Mon Sep 17 00:00:00 2001 From: Gang Yan Date: Mon, 20 Apr 2026 18:19:24 +0200 Subject: [PATCH 123/148] selftests: mptcp: add a check for sndbuf of S/C Add a new chk_sndbuf() helper to diag.sh that extracts the sndbuf (the 'tb' field from 'ss -m' skmem output) for both server and client MPTCP sockets, and verifies they are equal. Without the previous patch, it will fail: ''' 07 ....chk sndbuf server/client [FAIL] sndbuf S=20480 != C=2630656 ''' Signed-off-by: Gang Yan Reviewed-by: Matthieu Baerts (NGI0) Signed-off-by: Matthieu Baerts (NGI0) Link: https://patch.msgid.link/20260420-net-mptcp-sync-sndbuf-accept-v1-2-e3523e3aeb44@kernel.org Signed-off-by: Paolo Abeni --- tools/testing/selftests/net/mptcp/diag.sh | 28 +++++++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/tools/testing/selftests/net/mptcp/diag.sh b/tools/testing/selftests/net/mptcp/diag.sh index d847ff1737c3..27cbda68144e 100755 --- a/tools/testing/selftests/net/mptcp/diag.sh +++ b/tools/testing/selftests/net/mptcp/diag.sh @@ -322,6 +322,33 @@ wait_connected() done } +chk_sndbuf() +{ + local server_sndbuf client_sndbuf msg + local port=${1} + + msg="....chk sndbuf server/client" + server_sndbuf=$(ss -N "${ns}" -inmHM "sport" "${port}" | \ + sed -n 's/.*tb\([0-9]\+\).*/\1/p') + client_sndbuf=$(ss -N "${ns}" -inmHM "dport" "${port}" | \ + sed -n 's/.*tb\([0-9]\+\).*/\1/p') + + mptcp_lib_print_title "${msg}" + if [ -z "${server_sndbuf}" ] || [ -z "${client_sndbuf}" ]; then + mptcp_lib_pr_fail "sndbuf S=${server_sndbuf} C=${client_sndbuf}" + mptcp_lib_result_fail "${msg}" + ret=${KSFT_FAIL} + elif [ "${server_sndbuf}" != "${client_sndbuf}" ]; then + mptcp_lib_pr_fail "sndbuf S=${server_sndbuf} != C=${client_sndbuf}" + mptcp_lib_result_fail "${msg}" + ret=${KSFT_FAIL} + else + mptcp_lib_pr_ok + mptcp_lib_result_pass "${msg}" + fi +} + + trap cleanup EXIT mptcp_lib_ns_init ns @@ -341,6 +368,7 @@ echo "b" | \ 127.0.0.1 >/dev/null & wait_connected $ns 10000 chk_msk_nr 2 "after MPC handshake" +chk_sndbuf 10000 chk_last_time_info 10000 chk_msk_remote_key_nr 2 "....chk remote_key" chk_msk_fallback_nr 0 "....chk no fallback" From 3bc06da858ef17cfe94b49efc0d9713727012835 Mon Sep 17 00:00:00 2001 From: Brett Creeley Date: Thu, 16 Apr 2026 14:21:21 -0700 Subject: [PATCH 124/148] virtio_net: sync rss_trailer.max_tx_vq on queue_pairs change via VQ_PAIRS_SET When netif_is_rxfh_configured() is true (i.e., the user has explicitly configured the RSS indirection table), virtnet_set_queues() skips the RSS update path and falls through to the VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET command to change the number of queue pairs. However, it does not update vi->rss_trailer.max_tx_vq to reflect the new queue_pairs value. This causes a mismatch between vi->curr_queue_pairs and vi->rss_trailer.max_tx_vq. Any subsequent RSS reconfiguration (e.g., via ethtool -X) calls virtnet_commit_rss_command(), which sends the stale max_tx_vq to the device, silently reverting the queue count. Reproduction: 1. User configured RSS ethtool -X eth0 equal 8 2. VQ_PAIRS_SET path; max_tx_vq stays 16 ethtool -L eth0 combined 12 3. RSS commit uses max_tx_vq=16 instead of 12 ethtool -X eth0 equal 4 Fix this by updating vi->rss_trailer.max_tx_vq after a successful VQ_PAIRS_SET command when RSS is enabled, keeping it in sync with curr_queue_pairs. Fixes: 50bfcaedd78e ("virtio_net: Update rss when set queue") Signed-off-by: Brett Creeley Acked-by: Michael S. Tsirkin Link: https://patch.msgid.link/20260416212121.29073-1-brett.creeley@amd.com Signed-off-by: Jakub Kicinski --- drivers/net/virtio_net.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index bfb566fecb92..f4adcfee7a80 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -3759,6 +3759,12 @@ static int virtnet_set_queues(struct virtnet_info *vi, u16 queue_pairs) queue_pairs); return -EINVAL; } + + /* Keep max_tx_vq in sync so that a later RSS command does not + * revert queue_pairs to a stale value. + */ + if (vi->has_rss) + vi->rss_trailer.max_tx_vq = cpu_to_le16(queue_pairs); succ: vi->curr_queue_pairs = queue_pairs; if (dev->flags & IFF_UP) { From 3d1f20727a635811f6b77801a7b57b8995268abd Mon Sep 17 00:00:00 2001 From: Dexuan Cui Date: Wed, 22 Apr 2026 23:48:11 -0700 Subject: [PATCH 125/148] hv_sock: Return -EIO for malformed/short packets Commit f63152958994 fixes a regression, however it fails to report an error for malformed/short packets -- normally we should never see such packets, but let's report an error for them just in case. Fixes: f63152958994 ("hv_sock: Report EOF instead of -EIO for FIN") Cc: stable@vger.kernel.org Signed-off-by: Dexuan Cui Acked-by: Stefano Garzarella Link: https://patch.msgid.link/20260423064811.1371749-1-decui@microsoft.com Signed-off-by: Jakub Kicinski --- net/vmw_vsock/hyperv_transport.c | 27 ++++++++++++++++++--------- 1 file changed, 18 insertions(+), 9 deletions(-) diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c index 76e78c83fdbc..f862988c1e86 100644 --- a/net/vmw_vsock/hyperv_transport.c +++ b/net/vmw_vsock/hyperv_transport.c @@ -704,17 +704,26 @@ static s64 hvs_stream_has_data(struct vsock_sock *vsk) if (hvs->recv_desc) { /* Here hvs->recv_data_len is 0, so hvs->recv_desc must * be NULL unless it points to the 0-byte-payload FIN - * packet: see hvs_update_recv_data(). + * packet or a malformed/short packet: see + * hvs_update_recv_data(). * - * Here all the payload has been dequeued, but - * hvs_channel_readable_payload() still returns 1, - * because the VMBus ringbuffer's read_index is not - * updated for the FIN packet: hvs_stream_dequeue() -> - * hv_pkt_iter_next() updates the cached priv_read_index - * but has no opportunity to update the read_index in - * hv_pkt_iter_close() as hvs_stream_has_data() returns - * 0 for the FIN packet, so it won't get dequeued. + * If hvs->recv_desc points to the FIN packet, here all + * the payload has been dequeued and the peer_shutdown + * flag is set, but hvs_channel_readable_payload() still + * returns 1, because the VMBus ringbuffer's read_index + * is not updated for the FIN packet: + * hvs_stream_dequeue() -> hv_pkt_iter_next() updates + * the cached priv_read_index but has no opportunity to + * update the read_index in hv_pkt_iter_close() as + * hvs_stream_has_data() returns 0 for the FIN packet, + * so it won't get dequeued. + * + * In case hvs->recv_desc points to a malformed/short + * packet, return -EIO. */ + if (!(vsk->peer_shutdown & SEND_SHUTDOWN)) + return -EIO; + return 0; } From 5a8db80f721deee8e916c2cfdee78decda02ce4f Mon Sep 17 00:00:00 2001 From: Ruijie Li Date: Wed, 22 Apr 2026 23:40:18 +0800 Subject: [PATCH 126/148] net/smc: avoid early lgr access in smc_clc_wait_msg A CLC decline can be received while the handshake is still in an early stage, before the connection has been associated with a link group. The decline handling in smc_clc_wait_msg() updates link-group level sync state for first-contact declines, but that state only exists after link group setup has completed. Guard the link-group update accordingly and keep the per-socket peer diagnosis handling unchanged. This preserves the existing sync_err handling for established link-group contexts and avoids touching link-group state before it is available. Fixes: 0cfdd8f92cac ("smc: connection and link group creation") Cc: stable@kernel.org Reported-by: Yuan Tan Reported-by: Yifan Wu Reported-by: Juefei Pu Reported-by: Xin Liu Signed-off-by: Ruijie Li Signed-off-by: Ren Wei Reviewed-by: Dust Li Link: https://patch.msgid.link/08c68a5c817acf198cce63d22517e232e8d60718.1776850759.git.ruijieli51@gmail.com Signed-off-by: Jakub Kicinski --- net/smc/smc_clc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/smc/smc_clc.c b/net/smc/smc_clc.c index c38fc7bf0a7e..014d527d5462 100644 --- a/net/smc/smc_clc.c +++ b/net/smc/smc_clc.c @@ -788,8 +788,8 @@ int smc_clc_wait_msg(struct smc_sock *smc, void *buf, int buflen, dclc = (struct smc_clc_msg_decline *)clcm; reason_code = SMC_CLC_DECL_PEERDECL; smc->peer_diagnosis = ntohl(dclc->peer_diagnosis); - if (((struct smc_clc_msg_decline *)buf)->hdr.typev2 & - SMC_FIRST_CONTACT_MASK) { + if ((dclc->hdr.typev2 & SMC_FIRST_CONTACT_MASK) && + smc->conn.lgr) { smc->conn.lgr->sync_err = 1; smc_lgr_terminate_sched(smc->conn.lgr); } From 4078c5611d7585548b249377ebd60c272e410490 Mon Sep 17 00:00:00 2001 From: Alexey Kodanev Date: Wed, 22 Apr 2026 16:05:36 +0000 Subject: [PATCH 127/148] nfp: fix swapped arguments in nfp_encode_basic_qdr() calls There is a mismatch between the passed arguments and the actual nfp_encode_basic_qdr() function parameter names: static int nfp_encode_basic_qdr(u64 addr, int dest_island, int cpp_tgt, int mode, bool addr40, int isld1, int isld0) { ... But "dest_island" and "cpp_tgt" are swapped at every call-site. For example: return nfp_encode_basic_qdr(*addr, cpp_tgt, dest_island, mode, addr40, isld1, isld0); As a result, nfp_encode_basic_qdr() receives "dest_island" as CPP target type, which is always NFP_CPP_TARGET_QDR(2) for these calls, and "cpp_tgt" as the destination island ID, which can accidentally match or be outside the valid NFP_CPP_TARGET_* types (e.g. '-1' for any destination). Since code already worked for years, also add extra pr_warn() to error paths in nfp_encode_basic_qdr() to help identify any potential address verification failures. Detected using the static analysis tool - Svace. Fixes: 4cb584e0ee7d ("nfp: add CPP access core") Signed-off-by: Alexey Kodanev Link: https://patch.msgid.link/20260422160536.61855-1-aleksei.kodanev@bell-sw.com Signed-off-by: Jakub Kicinski --- .../ethernet/netronome/nfp/nfpcore/nfp_target.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_target.c b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_target.c index 79470f198a62..9cf19446657c 100644 --- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_target.c +++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_target.c @@ -435,12 +435,17 @@ static int nfp_encode_basic_qdr(u64 addr, int dest_island, int cpp_tgt, /* Full Island ID and channel bits overlap? */ ret = nfp_decode_basic(addr, &v, cpp_tgt, mode, addr40, isld1, isld0); - if (ret) + if (ret) { + pr_warn("%s: decode dest_island failed: %d\n", __func__, ret); return ret; + } /* The current address won't go where expected? */ - if (dest_island != -1 && dest_island != v) + if (dest_island != -1 && dest_island != v) { + pr_warn("%s: dest_island mismatch: current (%d) != decoded (%d)\n", + __func__, dest_island, v); return -EINVAL; + } /* If dest_island was -1, we don't care where it goes. */ return 0; @@ -493,7 +498,7 @@ static int nfp_encode_basic(u64 *addr, int dest_island, int cpp_tgt, * the address but we can verify if the existing * contents will point to a valid island. */ - return nfp_encode_basic_qdr(*addr, cpp_tgt, dest_island, + return nfp_encode_basic_qdr(*addr, dest_island, cpp_tgt, mode, addr40, isld1, isld0); iid_lsb = addr40 ? 34 : 26; @@ -504,7 +509,7 @@ static int nfp_encode_basic(u64 *addr, int dest_island, int cpp_tgt, return 0; case 1: if (cpp_tgt == NFP_CPP_TARGET_QDR && !addr40) - return nfp_encode_basic_qdr(*addr, cpp_tgt, dest_island, + return nfp_encode_basic_qdr(*addr, dest_island, cpp_tgt, mode, addr40, isld1, isld0); idx_lsb = addr40 ? 39 : 31; @@ -530,7 +535,7 @@ static int nfp_encode_basic(u64 *addr, int dest_island, int cpp_tgt, * be set before hand and with them select an island. * So we need to confirm that it's at least plausible. */ - return nfp_encode_basic_qdr(*addr, cpp_tgt, dest_island, + return nfp_encode_basic_qdr(*addr, dest_island, cpp_tgt, mode, addr40, isld1, isld0); /* Make sure we compare against isldN values @@ -551,7 +556,7 @@ static int nfp_encode_basic(u64 *addr, int dest_island, int cpp_tgt, * iid<1> = addr<30> = channel<0> * channel<1> = addr<31> = Index */ - return nfp_encode_basic_qdr(*addr, cpp_tgt, dest_island, + return nfp_encode_basic_qdr(*addr, dest_island, cpp_tgt, mode, addr40, isld1, isld0); isld[0] &= ~3; From 42726ec644cbdde0035c3e0417fee8ed9547e120 Mon Sep 17 00:00:00 2001 From: Jiayuan Chen Date: Wed, 22 Apr 2026 20:35:38 +0800 Subject: [PATCH 128/148] tcp: send a challenge ACK on SEG.ACK > SND.NXT RFC 5961 Section 5.2 validates an incoming segment's ACK value against the range [SND.UNA - MAX.SND.WND, SND.NXT] and states: "All incoming segments whose ACK value doesn't satisfy the above condition MUST be discarded and an ACK sent back." Commit 354e4aa391ed ("tcp: RFC 5961 5.2 Blind Data Injection Attack Mitigation") opted Linux into this mitigation and implements the challenge ACK on the lower side (SEG.ACK < SND.UNA - MAX.SND.WND), but the symmetric upper side (SEG.ACK > SND.NXT) still takes the pre-RFC-5961 path and silently returns SKB_DROP_REASON_TCP_ACK_UNSENT_DATA, even though RFC 793 Section 3.9 (now RFC 9293 Section 3.10.7.4) has always required: "If the ACK acknowledges something not yet sent (SEG.ACK > SND.NXT) then send an ACK, drop the segment, and return." Complete the mitigation by sending a challenge ACK on that branch, reusing the existing tcp_send_challenge_ack() path which already enforces the per-socket RFC 5961 Section 7 rate limit via __tcp_oow_rate_limited(). FLAG_NO_CHALLENGE_ACK is honoured for symmetry with the lower-edge case. Update the existing tcp_ts_recent_invalid_ack.pkt selftest, which drives this exact path, to consume the new challenge ACK. Fixes: 354e4aa391ed ("tcp: RFC 5961 5.2 Blind Data Injection Attack Mitigation") Signed-off-by: Jiayuan Chen Reviewed-by: Eric Dumazet Link: https://patch.msgid.link/20260422123605.320000-2-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski --- net/ipv4/tcp_input.c | 10 +++++++--- .../net/packetdrill/tcp_ts_recent_invalid_ack.pkt | 4 +++- 2 files changed, 10 insertions(+), 4 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index e04ae105893c..d5c9e65d9760 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4286,11 +4286,15 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag) goto old_ack; } - /* If the ack includes data we haven't sent yet, discard - * this segment (RFC793 Section 3.9). + /* If the ack includes data we haven't sent yet, drop the + * segment. RFC 793 Section 3.9 and RFC 5961 Section 5.2 + * require us to send an ACK back in that case. */ - if (after(ack, tp->snd_nxt)) + if (after(ack, tp->snd_nxt)) { + if (!(flag & FLAG_NO_CHALLENGE_ACK)) + tcp_send_challenge_ack(sk, false); return -SKB_DROP_REASON_TCP_ACK_UNSENT_DATA; + } if (after(ack, prior_snd_una)) { flag |= FLAG_SND_UNA_ADVANCED; diff --git a/tools/testing/selftests/net/packetdrill/tcp_ts_recent_invalid_ack.pkt b/tools/testing/selftests/net/packetdrill/tcp_ts_recent_invalid_ack.pkt index 174ce9a1bfc0..ee6baf7c36cf 100644 --- a/tools/testing/selftests/net/packetdrill/tcp_ts_recent_invalid_ack.pkt +++ b/tools/testing/selftests/net/packetdrill/tcp_ts_recent_invalid_ack.pkt @@ -19,7 +19,9 @@ // bad packet with high tsval (its ACK sequence is above our sndnxt) +0 < F. 1:1(0) ack 9999 win 20000 - +// Challenge ACK for SEG.ACK > SND.NXT (RFC 5961 5.2 / RFC 793 3.9). +// ecr=200 (not 200000) proves ts_recent was not updated from the bad packet. + +0 > . 1:1(0) ack 1 +0 < . 1:1001(1000) ack 1 win 20000 +0 > . 1:1(0) ack 1001 From cf94b3c0f052c2674328b330309604af2dedd3a0 Mon Sep 17 00:00:00 2001 From: Jiayuan Chen Date: Wed, 22 Apr 2026 20:35:39 +0800 Subject: [PATCH 129/148] selftests/net: packetdrill: cover RFC 5961 5.2 challenge ACK on both edges RFC 5961 Section 5.2 / RFC 793 Section 3.9 require a challenge ACK whenever an incoming SEG.ACK falls outside [SND.UNA - MAX.SND.WND, SND.NXT]. There is currently no packetdrill coverage for either edge. Add tcp_rfc5961_ack-out-of-window.pkt, which in a single passive-open connection exercises: - Upper edge (SEG.ACK > SND.NXT): peer ACKs data that was never sent before the server has transmitted anything. - Lower edge (SEG.ACK < SND.UNA - MAX.SND.WND): after the server has sent 2000 bytes (the peer-advertised rwnd forces two 1000-byte segments, both acknowledged), peer sends an ACK that is older than the acceptable window. Both cases must elicit a challenge ACK . The per-socket RFC 5961 Section 7 rate limit is disabled for the duration of the test so that both challenge ACKs can fire back-to-back. Signed-off-by: Jiayuan Chen Reviewed-by: Eric Dumazet Link: https://patch.msgid.link/20260422123605.320000-3-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski --- .../tcp_rfc5961_ack-out-of-window.pkt | 48 +++++++++++++++++++ 1 file changed, 48 insertions(+) create mode 100644 tools/testing/selftests/net/packetdrill/tcp_rfc5961_ack-out-of-window.pkt diff --git a/tools/testing/selftests/net/packetdrill/tcp_rfc5961_ack-out-of-window.pkt b/tools/testing/selftests/net/packetdrill/tcp_rfc5961_ack-out-of-window.pkt new file mode 100644 index 000000000000..2776b8728085 --- /dev/null +++ b/tools/testing/selftests/net/packetdrill/tcp_rfc5961_ack-out-of-window.pkt @@ -0,0 +1,48 @@ +// SPDX-License-Identifier: GPL-2.0 +// +// RFC 5961 Section 5.2 / RFC 793 Section 3.9: an incoming segment's +// ACK value must lie in [SND.UNA - MAX.SND.WND, SND.NXT]; otherwise +// the receiver MUST discard the segment and send a challenge ACK +// back. Exercise both edges of that window in a single connection. + +`./defaults.sh +sysctl -q net.ipv4.tcp_invalid_ratelimit=0 +` + + 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 + +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 + +0 bind(3, ..., ...) = 0 + +0 listen(3, 1) = 0 + +// Three-way handshake. Peer advertises rwnd = 1000 (no wscale), so +// MAX.SND.WND is tracked as 1000. + +0 < S 0:0(0) win 1000 + +0 > S. 0:0(0) ack 1 <...> ++.1 < . 1:1(0) ack 1 win 1000 + +0 accept(3, ..., ...) = 4 + +// ---- Upper edge: SEG.ACK > SND.NXT -------------------------------- +// Server has sent nothing yet, so SND.UNA = SND.NXT = 1. +// Peer sends a pure ACK with SEG.ACK = 2, beyond SND.NXT. + +0 < . 1:1(0) ack 2 win 1000 +// Expect a challenge ACK: . + +0 > . 1:1(0) ack 1 + +// Advance SND.UNA past MAX.SND.WND so that the lower edge becomes +// reachable. Issue two 1-MSS writes so each skb is exactly one MSS +// and PSH is set by tcp_push() at the end of each sendmsg, keeping +// the setup independent of the TSO / tcp_fragment split path. + +0 write(4, ..., 1000) = 1000 + +0 > P. 1:1001(1000) ack 1 ++.01 < . 1:1(0) ack 1001 win 1000 + +0 write(4, ..., 1000) = 1000 + +0 > P. 1001:2001(1000) ack 1 ++.01 < . 1:1(0) ack 2001 win 1000 +// Now SND.UNA = SND.NXT = 2001, MAX.SND.WND = 1000, bytes_acked = 2000. + +// ---- Lower edge: SEG.ACK < SND.UNA - MAX.SND.WND ------------------ +// SND.UNA - MAX.SND.WND = 2001 - 1000 = 1001, so SEG.ACK = 1000 falls +// below the acceptable range. + +0 < . 1:1(0) ack 1000 win 1000 +// Expect a challenge ACK: . + +0 > . 2001:2001(0) ack 1 From 67bf002a2d7387a6312138210d0bd06e3cf4879b Mon Sep 17 00:00:00 2001 From: Ruide Cao Date: Tue, 21 Apr 2026 12:16:31 +0800 Subject: [PATCH 130/148] ipv4: icmp: validate reply type before using icmp_pointers Extended echo replies use ICMP_EXT_ECHOREPLY as the outbound reply type. That value is outside the range covered by icmp_pointers[], which only describes the traditional ICMP types up to NR_ICMP_TYPES. Avoid consulting icmp_pointers[] for reply types outside that range, and use array_index_nospec() for the remaining in-range lookup. Normal ICMP replies keep their existing behavior unchanged. Fixes: d329ea5bd884 ("icmp: add response to RFC 8335 PROBE messages") Cc: stable@kernel.org Reported-by: Yuan Tan Reported-by: Yifan Wu Reported-by: Juefei Pu Reported-by: Xin Liu Signed-off-by: Ruide Cao Signed-off-by: Ren Wei Reviewed-by: Simon Horman Link: https://patch.msgid.link/0dace90c01a5978e829ca741ef684dbd7304ce62.1776628519.git.caoruide123@gmail.com Signed-off-by: Jakub Kicinski --- net/ipv4/icmp.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c index 2f4fac22d1ab..7eeff658b467 100644 --- a/net/ipv4/icmp.c +++ b/net/ipv4/icmp.c @@ -64,6 +64,7 @@ #include #include #include +#include #include #include #include @@ -371,7 +372,9 @@ static int icmp_glue_bits(void *from, char *to, int offset, int len, int odd, to, len); skb->csum = csum_block_add(skb->csum, csum, odd); - if (icmp_pointers[icmp_param->data.icmph.type].error) + if (icmp_param->data.icmph.type <= NR_ICMP_TYPES && + icmp_pointers[array_index_nospec(icmp_param->data.icmph.type, + NR_ICMP_TYPES + 1)].error) nf_ct_attach(skb, icmp_param->skb); return 0; } From 864ba40c80edae2b98f47d46f2c39399126aa3d6 Mon Sep 17 00:00:00 2001 From: Ernestas Kulik Date: Tue, 21 Apr 2026 09:02:26 +0300 Subject: [PATCH 131/148] llc: Return -EINPROGRESS from llc_ui_connect() Given a zero sk_sndtimeo, llc_ui_connect() skips waiting for state change and returns 0, confusing userspace applications that will assume the socket is connected, making e.g. getpeername() calls error out. More specifically, the issue was discovered in libcoap, where newly-added AF_LLC socket support was behaving differently from AF_INET connections due to EINPROGRESS handling being skipped. Set rc to -EINPROGRESS if connect() would not block, akin to AF_INET sockets. Signed-off-by: Ernestas Kulik Reviewed-by: Simon Horman Link: https://patch.msgid.link/20260421060304.285419-1-ernestas.k@iconn-networks.com Signed-off-by: Jakub Kicinski --- net/llc/af_llc.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/net/llc/af_llc.c b/net/llc/af_llc.c index 59d593bb5d18..1b210db3119e 100644 --- a/net/llc/af_llc.c +++ b/net/llc/af_llc.c @@ -520,8 +520,10 @@ static int llc_ui_connect(struct socket *sock, struct sockaddr_unsized *uaddr, if (sk->sk_state == TCP_SYN_SENT) { const long timeo = sock_sndtimeo(sk, flags & O_NONBLOCK); - if (!timeo || !llc_ui_wait_for_conn(sk, timeo)) + if (!timeo || !llc_ui_wait_for_conn(sk, timeo)) { + rc = -EINPROGRESS; goto out; + } rc = sock_intr_errno(timeo); if (signal_pending(current)) From d293ca716e7d5dffdaecaf6b9b2f857a33dc3d3a Mon Sep 17 00:00:00 2001 From: Lee Jones Date: Tue, 21 Apr 2026 13:45:26 +0100 Subject: [PATCH 132/148] tipc: fix double-free in tipc_buf_append() tipc_msg_validate() can potentially reallocate the skb it is validating, freeing the old one. In tipc_buf_append(), it was being called with a pointer to a local variable which was a copy of the caller's skb pointer. If the skb was reallocated and validation subsequently failed, the error handling path would free the original skb pointer, which had already been freed, leading to double-free. Fix this by checking if head now points to a newly allocated reassembled skb. If it does, reassign *headbuf for later freeing operations. Fixes: d618d09a68e4 ("tipc: enforce valid ratio between skb truesize and contents") Suggested-by: Tung Nguyen Signed-off-by: Lee Jones Reviewed-by: Tung Nguyen Signed-off-by: Jakub Kicinski --- net/tipc/msg.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/net/tipc/msg.c b/net/tipc/msg.c index 76284fc538eb..b0bba0feef56 100644 --- a/net/tipc/msg.c +++ b/net/tipc/msg.c @@ -177,8 +177,20 @@ int tipc_buf_append(struct sk_buff **headbuf, struct sk_buff **buf) if (fragid == LAST_FRAGMENT) { TIPC_SKB_CB(head)->validated = 0; - if (unlikely(!tipc_msg_validate(&head))) + + /* If the reassembled skb has been freed in + * tipc_msg_validate() because of an invalid truesize, + * then head will point to a newly allocated reassembled + * skb, while *headbuf points to freed reassembled skb. + * In such cases, correct *headbuf for freeing the newly + * allocated reassembled skb later. + */ + if (unlikely(!tipc_msg_validate(&head))) { + if (head != *headbuf) + *headbuf = head; goto err; + } + *buf = head; TIPC_SKB_CB(head)->tail = NULL; *headbuf = NULL; From 076b8cad77aa96557719fb5effe8703bfb64df00 Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Tue, 21 Apr 2026 22:24:06 +0200 Subject: [PATCH 133/148] ipv6: Cap TLV scan in ip6_tnl_parse_tlv_enc_lim Commit 47d3d7ac656a ("ipv6: Implement limits on Hop-by-Hop and Destination options") added net.ipv6.max_{hbh,dst}_opts_{cnt,len} and applied them in ip6_parse_tlv(), the generic TLV walker invoked from ipv6_destopt_rcv() and ipv6_parse_hopopts(). ip6_tnl_parse_tlv_enc_lim() does not go through ip6_parse_tlv(); it has its own hand-rolled TLV scanner inside its NEXTHDR_DEST branch which looks for IPV6_TLV_TNL_ENCAP_LIMIT. That inner loop is bounded only by optlen, which can be up to 2048 bytes. Stuffing the Destination Options header with 2046 Pad1 (type=0) entries advances the scanner a single byte at a time, yielding ~2000 TLV iterations per extension header. Reusing max_dst_opts_cnt to bound the TLV iterations, matching the semantics from 47d3d7ac656a, would require duplicating ip6_parse_tlv() to also validate Pad1/PadN payload. It would also mandate enforcing max_dst_opts_len, since otherwise an attacker shifts the axis to few options with a giant PadN and recovers the original DoS. Allowing up to 8 options before the tunnel encapsulation limit TLV is liberal enough; in practice encap limit is the first TLV. Thus, go with a hard-coded limit IP6_TUNNEL_MAX_DEST_TLVS (8). Signed-off-by: Daniel Borkmann Reviewed-by: Ido Schimmel Reviewed-by: Justin Iurman Signed-off-by: Jakub Kicinski --- net/ipv6/ip6_tunnel.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c index 46bc06506470..c468c83af0f2 100644 --- a/net/ipv6/ip6_tunnel.c +++ b/net/ipv6/ip6_tunnel.c @@ -62,6 +62,8 @@ MODULE_LICENSE("GPL"); MODULE_ALIAS_RTNL_LINK("ip6tnl"); MODULE_ALIAS_NETDEV("ip6tnl0"); +#define IP6_TUNNEL_MAX_DEST_TLVS 8 + #define IP6_TUNNEL_HASH_SIZE_SHIFT 5 #define IP6_TUNNEL_HASH_SIZE (1 << IP6_TUNNEL_HASH_SIZE_SHIFT) @@ -425,11 +427,15 @@ __u16 ip6_tnl_parse_tlv_enc_lim(struct sk_buff *skb, __u8 *raw) break; } if (nexthdr == NEXTHDR_DEST) { + int tlv_cnt = 0; u16 i = 2; while (1) { struct ipv6_tlv_tnl_enc_lim *tel; + if (unlikely(tlv_cnt++ >= IP6_TUNNEL_MAX_DEST_TLVS)) + break; + /* No more room for encapsulation limit */ if (i + sizeof(*tel) > optlen) break; From e08a9fac5cf8c3fecf4755e7e3ac059f78b8f83d Mon Sep 17 00:00:00 2001 From: Kohei Enju Date: Wed, 22 Apr 2026 02:30:24 +0000 Subject: [PATCH 134/148] vhost_net: fix sleeping with preempt-disabled in vhost_net_busy_poll() syzbot reported "sleeping function called from invalid context" in vhost_net_busy_poll(). Commit 030881372460 ("vhost_net: basic polling support") introduced a busy-poll loop and preempt_{disable,enable}() around it, where each iteration calls a sleepable function inside the loop. The purpose of disabling preemption was to keep local_clock()-based timeout accounting on a single CPU, rather than as a requirement of busy-poll itself: https://lore.kernel.org/1448435489-5949-4-git-send-email-jasowang@redhat.com From this perspective, migrate_disable() is sufficient here, so replace preempt_disable() with migrate_disable(), avoiding sleepable accesses from a preempt-disabled context. Fixes: 030881372460 ("vhost_net: basic polling support") Tested-by: syzbot+6985cb8e543ea90ba8ee@syzkaller.appspotmail.com Reported-by: syzbot+6985cb8e543ea90ba8ee@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/69e6a414.050a0220.24bfd3.002d.GAE@google.com/T/ Signed-off-by: Kohei Enju Acked-by: Michael S. Tsirkin Signed-off-by: Jakub Kicinski --- drivers/vhost/net.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 80965181920c..c6536cad9c4f 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -560,7 +560,7 @@ static void vhost_net_busy_poll(struct vhost_net *net, busyloop_timeout = poll_rx ? rvq->busyloop_timeout: tvq->busyloop_timeout; - preempt_disable(); + migrate_disable(); endtime = busy_clock() + busyloop_timeout; while (vhost_can_busy_poll(endtime)) { @@ -577,7 +577,7 @@ static void vhost_net_busy_poll(struct vhost_net *net, cpu_relax(); } - preempt_enable(); + migrate_enable(); if (poll_rx || sock_has_rx_data(sock)) vhost_net_busy_poll_try_queue(net, vq); From 3864c6ba1e041bc75342353a70fa2a2c6f909923 Mon Sep 17 00:00:00 2001 From: Zhenzhong Wu Date: Wed, 22 Apr 2026 10:45:53 +0800 Subject: [PATCH 135/148] tcp: call sk_data_ready() after listener migration When inet_csk_listen_stop() migrates an established child socket from a closing listener to another socket in the same SO_REUSEPORT group, the target listener gets a new accept-queue entry via inet_csk_reqsk_queue_add(), but that path never notifies the target listener's waiters. A nonblocking accept() still works because it checks the queue directly, but poll()/epoll_wait() waiters and blocking accept() callers can also remain asleep indefinitely. Call READ_ONCE(nsk->sk_data_ready)(nsk) after a successful migration in inet_csk_listen_stop(). However, after inet_csk_reqsk_queue_add() succeeds, the ref acquired in reuseport_migrate_sock() is effectively transferred to nreq->rsk_listener. Another CPU can then dequeue nreq via accept() or listener shutdown, hit reqsk_put(), and drop that listener ref. Since listeners are SOCK_RCU_FREE, wrap the post-queue_add() dereferences of nsk in rcu_read_lock()/rcu_read_unlock(), which also covers the existing sock_net(nsk) access in that path. The reqsk_timer_handler() path does not need the same changes for two reasons: half-open requests become readable only after the final ACK, where tcp_child_process() already wakes the listener; and once nreq is visible via inet_ehash_insert(), the success path no longer touches nsk directly. Fixes: 54b92e841937 ("tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.") Cc: stable@vger.kernel.org Suggested-by: Eric Dumazet Reviewed-by: Kuniyuki Iwashima Signed-off-by: Zhenzhong Wu Reviewed-by: Eric Dumazet Link: https://patch.msgid.link/20260422024554.130346-2-jt26wzz@gmail.com Signed-off-by: Jakub Kicinski --- net/ipv4/inet_connection_sock.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index 4ac3ae1bc1af..928654c34156 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -1479,16 +1479,19 @@ void inet_csk_listen_stop(struct sock *sk) if (nreq) { refcount_set(&nreq->rsk_refcnt, 1); + rcu_read_lock(); if (inet_csk_reqsk_queue_add(nsk, nreq, child)) { __NET_INC_STATS(sock_net(nsk), LINUX_MIB_TCPMIGRATEREQSUCCESS); reqsk_migrate_reset(req); + READ_ONCE(nsk->sk_data_ready)(nsk); } else { __NET_INC_STATS(sock_net(nsk), LINUX_MIB_TCPMIGRATEREQFAILURE); reqsk_migrate_reset(nreq); __reqsk_free(nreq); } + rcu_read_unlock(); /* inet_csk_reqsk_queue_add() has already * called inet_child_forget() on failure case. From c01cfc4886752c86f6d48c58f7b103a01b0caeca Mon Sep 17 00:00:00 2001 From: Zhenzhong Wu Date: Wed, 22 Apr 2026 10:45:54 +0800 Subject: [PATCH 136/148] selftests/bpf: check epoll readiness during reuseport migration Inside migrate_dance(), add epoll checks around shutdown() to verify that the target listener is not ready before shutdown() and becomes ready immediately after shutdown() triggers migration. Cover TCP_ESTABLISHED and TCP_SYN_RECV. Exclude TCP_NEW_SYN_RECV as it depends on later handshake completion. Suggested-by: Kuniyuki Iwashima Reviewed-by: Kuniyuki Iwashima Signed-off-by: Zhenzhong Wu Link: https://patch.msgid.link/20260422024554.130346-3-jt26wzz@gmail.com Signed-off-by: Jakub Kicinski --- .../bpf/prog_tests/migrate_reuseport.c | 49 ++++++++++++++++--- 1 file changed, 42 insertions(+), 7 deletions(-) diff --git a/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c b/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c index 653b0a20fab9..c62907732c19 100644 --- a/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c +++ b/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c @@ -7,24 +7,29 @@ * 3. call listen() for 1 server socket. (migration target) * 4. update a map to migrate all child sockets * to the last server socket (migrate_map[cookie] = 4) - * 5. call shutdown() for first 4 server sockets + * 5. for TCP_ESTABLISHED and TCP_SYN_RECV cases, verify via epoll + * that the last server socket is not ready before migration. + * 6. call shutdown() for first 4 server sockets * and migrate the requests in the accept queue * to the last server socket. - * 6. call listen() for the second server socket. - * 7. call shutdown() for the last server + * 7. for TCP_ESTABLISHED and TCP_SYN_RECV cases, verify via epoll + * that the last server socket is ready after migration. + * 8. call listen() for the second server socket. + * 9. call shutdown() for the last server * and migrate the requests in the accept queue * to the second server socket. - * 8. call listen() for the last server. - * 9. call shutdown() for the second server + * 10. call listen() for the last server. + * 11. call shutdown() for the second server * and migrate the requests in the accept queue * to the last server socket. - * 10. call accept() for the last server socket. + * 12. call accept() for the last server socket. * * Author: Kuniyuki Iwashima */ #include #include +#include #include "test_progs.h" #include "test_migrate_reuseport.skel.h" @@ -350,21 +355,51 @@ static int update_maps(struct migrate_reuseport_test_case *test_case, static int migrate_dance(struct migrate_reuseport_test_case *test_case) { + struct epoll_event ev = { + .events = EPOLLIN, + }; + int epoll = -1, nfds; int i, err; + if (test_case->state != BPF_TCP_NEW_SYN_RECV) { + epoll = epoll_create1(0); + if (!ASSERT_NEQ(epoll, -1, "epoll_create1")) + return -1; + + ev.data.fd = test_case->servers[MIGRATED_TO]; + if (!ASSERT_OK(epoll_ctl(epoll, EPOLL_CTL_ADD, + test_case->servers[MIGRATED_TO], &ev), + "epoll_ctl")) + goto close_epoll; + + nfds = epoll_wait(epoll, &ev, 1, 0); + if (!ASSERT_EQ(nfds, 0, "epoll_wait 1")) + goto close_epoll; + } + /* Migrate TCP_ESTABLISHED and TCP_SYN_RECV requests * to the last listener based on eBPF. */ for (i = 0; i < MIGRATED_TO; i++) { err = shutdown(test_case->servers[i], SHUT_RDWR); if (!ASSERT_OK(err, "shutdown")) - return -1; + goto close_epoll; } /* No dance for TCP_NEW_SYN_RECV to migrate based on eBPF */ if (test_case->state == BPF_TCP_NEW_SYN_RECV) return 0; + nfds = epoll_wait(epoll, &ev, 1, 0); + if (!ASSERT_EQ(nfds, 1, "epoll_wait 2")) { +close_epoll: + if (epoll >= 0) + close(epoll); + return -1; + } + + close(epoll); + /* Note that we use the second listener instead of the * first one here. * From c263f644add3d6ad81f9d62a99284fde408f0caa Mon Sep 17 00:00:00 2001 From: Jiawen Wu Date: Wed, 22 Apr 2026 15:18:37 +0800 Subject: [PATCH 137/148] net: txgbe: fix firmware version check For the device SP, the firmware version is a 32-bit value where the lower 20 bits represent the base version number. And the customized firmware version populates the upper 12 bits with a specific identification number. For other devices AML 25G and 40G, the upper 12 bits of the firmware version is always non-zero, and they have other naming conventions. Only SP devices need to check this to tell if XPCS will work properly. So the judgement of MAC type is added here. And the original logic compared the entire 32-bit value against 0x20010, which caused the outdated base firmwares bypass the version check without a warning. Apply a mask 0xfffff to isolate the lower 20 bits for an accurate base version comparison. Fixes: ab928c24e6cd ("net: txgbe: add FW version warning") Cc: stable@vger.kernel.org Signed-off-by: Jiawen Wu Reviewed-by: Jacob Keller Link: https://patch.msgid.link/C787AA5C07598B13+20260422071837.372731-1-jiawenwu@trustnetic.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/wangxun/txgbe/txgbe_main.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c b/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c index ec32a5f422f2..8b7c3753bb6a 100644 --- a/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c +++ b/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c @@ -864,7 +864,8 @@ static int txgbe_probe(struct pci_dev *pdev, "0x%08x", etrack_id); } - if (etrack_id < 0x20010) + if (wx->mac.type == wx_mac_sp && + ((etrack_id & 0xfffff) < 0x20010)) dev_warn(&pdev->dev, "Please upgrade the firmware to 0x20010 or above.\n"); err = txgbe_test_hostif(wx); From 7256eb3e0909fd77902d6d6ff086fd430cff3a58 Mon Sep 17 00:00:00 2001 From: Daniel Palmer Date: Wed, 22 Apr 2026 22:27:10 +0900 Subject: [PATCH 138/148] m68k: mvme147: Make me the maintainer I'm actively using mainline + patches on this board as a bootloader for another VME board and as a terminal server using a multiport serial board in the same VME backplane. I even have mainline u-boot on real EPROMs. Make me the maintainer of its ethernet, scsi and arch code so I get an email before one or more of them get deleted. Signed-off-by: Daniel Palmer Link: https://patch.msgid.link/20260422132710.2855826-1-daniel@thingy.jp Signed-off-by: Jakub Kicinski --- MAINTAINERS | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index e7dc9e6fad2e..2f9509b7bd53 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -15235,6 +15235,13 @@ S: Maintained W: http://www.tazenda.demon.co.uk/phil/linux-hp F: arch/m68k/hp300/ +M68K ON MVME147 +M: Daniel Palmer +S: Maintained +F: arch/m68k/mvme147/ +F: drivers/net/ethernet/amd/mvme147.c +F: drivers/scsi/mvme147.* + M88DS3103 MEDIA DRIVER L: linux-media@vger.kernel.org S: Orphan From 8141a2dc70080eda1aedc0389ed2db2b292af5bd Mon Sep 17 00:00:00 2001 From: Ao Zhou Date: Wed, 22 Apr 2026 22:52:07 +0800 Subject: [PATCH 139/148] net: rds: fix MR cleanup on copy error __rds_rdma_map() hands sg/pages ownership to the transport after get_mr() succeeds. If copying the generated cookie back to user space fails after that point, the error path must not free those resources again before dropping the MR reference. Remove the duplicate unpin/free from the put_user() failure branch so that MR teardown is handled only through the existing final cleanup path. Fixes: 0d4597c8c5ab ("net/rds: Track user mapped pages through special API") Cc: stable@kernel.org Reported-by: Yuan Tan Reported-by: Yifan Wu Reported-by: Juefei Pu Reported-by: Xin Liu Signed-off-by: Ao Zhou Signed-off-by: Ren Wei Reviewed-by: Allison Henderson Link: https://patch.msgid.link/79c8ef73ec8e5844d71038983940cc2943099baf.1776764247.git.draw51280@163.com Signed-off-by: Jakub Kicinski --- net/rds/rdma.c | 4 ---- 1 file changed, 4 deletions(-) diff --git a/net/rds/rdma.c b/net/rds/rdma.c index aa6465dc742c..61fb6e45281b 100644 --- a/net/rds/rdma.c +++ b/net/rds/rdma.c @@ -326,10 +326,6 @@ static int __rds_rdma_map(struct rds_sock *rs, struct rds_get_mr_args *args, if (args->cookie_addr && put_user(cookie, (u64 __user *)(unsigned long)args->cookie_addr)) { - if (!need_odp) { - unpin_user_pages(pages, nr_pages); - kfree(sg); - } ret = -EFAULT; goto out; } From 34f61a07e0cdefaecd3ec03bb5fb22215643678f Mon Sep 17 00:00:00 2001 From: David Howells Date: Wed, 22 Apr 2026 17:14:30 +0100 Subject: [PATCH 140/148] rxrpc: Fix memory leaks in rxkad_verify_response() Fix rxkad_verify_response() to free the ticket and the server key under all circumstances by initialising the ticket pointer to NULL and then making all paths through the function after the first allocation has been done go through a single common epilogue that just releases everything - where all the releases skip on a NULL pointer. Fixes: 57af281e5389 ("rxrpc: Tidy up abort generation infrastructure") Fixes: ec832bd06d6f ("rxrpc: Don't retain the server key in the connection") Closes: https://sashiko.dev/#/patchset/20260408121252.2249051-1-dhowells%40redhat.com Signed-off-by: David Howells cc: Marc Dionne cc: Jeffrey Altman cc: Simon Horman cc: linux-afs@lists.infradead.org cc: stable@kernel.org Link: https://patch.msgid.link/20260422161438.2593376-2-dhowells@redhat.com Signed-off-by: Jakub Kicinski --- net/rxrpc/rxkad.c | 103 +++++++++++++++++++--------------------------- 1 file changed, 42 insertions(+), 61 deletions(-) diff --git a/net/rxrpc/rxkad.c b/net/rxrpc/rxkad.c index eb7f2769d2b1..5a720222854f 100644 --- a/net/rxrpc/rxkad.c +++ b/net/rxrpc/rxkad.c @@ -1136,7 +1136,7 @@ static int rxkad_verify_response(struct rxrpc_connection *conn, struct rxrpc_crypt session_key; struct key *server_key; time64_t expiry; - void *ticket; + void *ticket = NULL; u32 version, kvno, ticket_len, level; __be32 csum; int ret, i; @@ -1162,13 +1162,13 @@ static int rxkad_verify_response(struct rxrpc_connection *conn, ret = -ENOMEM; response = kzalloc_obj(struct rxkad_response, GFP_NOFS); if (!response) - goto temporary_error; + goto error; if (skb_copy_bits(skb, sizeof(struct rxrpc_wire_header), response, sizeof(*response)) < 0) { - rxrpc_abort_conn(conn, skb, RXKADPACKETSHORT, -EPROTO, - rxkad_abort_resp_short); - goto protocol_error; + ret = rxrpc_abort_conn(conn, skb, RXKADPACKETSHORT, -EPROTO, + rxkad_abort_resp_short); + goto error; } version = ntohl(response->version); @@ -1178,62 +1178,62 @@ static int rxkad_verify_response(struct rxrpc_connection *conn, trace_rxrpc_rx_response(conn, sp->hdr.serial, version, kvno, ticket_len); if (version != RXKAD_VERSION) { - rxrpc_abort_conn(conn, skb, RXKADINCONSISTENCY, -EPROTO, - rxkad_abort_resp_version); - goto protocol_error; + ret = rxrpc_abort_conn(conn, skb, RXKADINCONSISTENCY, -EPROTO, + rxkad_abort_resp_version); + goto error; } if (ticket_len < 4 || ticket_len > MAXKRB5TICKETLEN) { - rxrpc_abort_conn(conn, skb, RXKADTICKETLEN, -EPROTO, - rxkad_abort_resp_tkt_len); - goto protocol_error; + ret = rxrpc_abort_conn(conn, skb, RXKADTICKETLEN, -EPROTO, + rxkad_abort_resp_tkt_len); + goto error; } if (kvno >= RXKAD_TKT_TYPE_KERBEROS_V5) { - rxrpc_abort_conn(conn, skb, RXKADUNKNOWNKEY, -EPROTO, - rxkad_abort_resp_unknown_tkt); - goto protocol_error; + ret = rxrpc_abort_conn(conn, skb, RXKADUNKNOWNKEY, -EPROTO, + rxkad_abort_resp_unknown_tkt); + goto error; } /* extract the kerberos ticket and decrypt and decode it */ ret = -ENOMEM; ticket = kmalloc(ticket_len, GFP_NOFS); if (!ticket) - goto temporary_error_free_resp; + goto error; if (skb_copy_bits(skb, sizeof(struct rxrpc_wire_header) + sizeof(*response), ticket, ticket_len) < 0) { - rxrpc_abort_conn(conn, skb, RXKADPACKETSHORT, -EPROTO, - rxkad_abort_resp_short_tkt); - goto protocol_error; + ret = rxrpc_abort_conn(conn, skb, RXKADPACKETSHORT, -EPROTO, + rxkad_abort_resp_short_tkt); + goto error; } ret = rxkad_decrypt_ticket(conn, server_key, skb, ticket, ticket_len, &session_key, &expiry); if (ret < 0) - goto temporary_error_free_ticket; + goto error; /* use the session key from inside the ticket to decrypt the * response */ ret = rxkad_decrypt_response(conn, response, &session_key); if (ret < 0) - goto temporary_error_free_ticket; + goto error; if (ntohl(response->encrypted.epoch) != conn->proto.epoch || ntohl(response->encrypted.cid) != conn->proto.cid || ntohl(response->encrypted.securityIndex) != conn->security_ix) { - rxrpc_abort_conn(conn, skb, RXKADSEALEDINCON, -EPROTO, - rxkad_abort_resp_bad_param); - goto protocol_error_free; + ret = rxrpc_abort_conn(conn, skb, RXKADSEALEDINCON, -EPROTO, + rxkad_abort_resp_bad_param); + goto error; } csum = response->encrypted.checksum; response->encrypted.checksum = 0; rxkad_calc_response_checksum(response); if (response->encrypted.checksum != csum) { - rxrpc_abort_conn(conn, skb, RXKADSEALEDINCON, -EPROTO, - rxkad_abort_resp_bad_checksum); - goto protocol_error_free; + ret = rxrpc_abort_conn(conn, skb, RXKADSEALEDINCON, -EPROTO, + rxkad_abort_resp_bad_checksum); + goto error; } for (i = 0; i < RXRPC_MAXCALLS; i++) { @@ -1241,38 +1241,38 @@ static int rxkad_verify_response(struct rxrpc_connection *conn, u32 counter = READ_ONCE(conn->channels[i].call_counter); if (call_id > INT_MAX) { - rxrpc_abort_conn(conn, skb, RXKADSEALEDINCON, -EPROTO, - rxkad_abort_resp_bad_callid); - goto protocol_error_free; + ret = rxrpc_abort_conn(conn, skb, RXKADSEALEDINCON, -EPROTO, + rxkad_abort_resp_bad_callid); + goto error; } if (call_id < counter) { - rxrpc_abort_conn(conn, skb, RXKADSEALEDINCON, -EPROTO, - rxkad_abort_resp_call_ctr); - goto protocol_error_free; + ret = rxrpc_abort_conn(conn, skb, RXKADSEALEDINCON, -EPROTO, + rxkad_abort_resp_call_ctr); + goto error; } if (call_id > counter) { if (conn->channels[i].call) { - rxrpc_abort_conn(conn, skb, RXKADSEALEDINCON, -EPROTO, + ret = rxrpc_abort_conn(conn, skb, RXKADSEALEDINCON, -EPROTO, rxkad_abort_resp_call_state); - goto protocol_error_free; + goto error; } conn->channels[i].call_counter = call_id; } } if (ntohl(response->encrypted.inc_nonce) != conn->rxkad.nonce + 1) { - rxrpc_abort_conn(conn, skb, RXKADOUTOFSEQUENCE, -EPROTO, - rxkad_abort_resp_ooseq); - goto protocol_error_free; + ret = rxrpc_abort_conn(conn, skb, RXKADOUTOFSEQUENCE, -EPROTO, + rxkad_abort_resp_ooseq); + goto error; } level = ntohl(response->encrypted.level); if (level > RXRPC_SECURITY_ENCRYPT) { - rxrpc_abort_conn(conn, skb, RXKADLEVELFAIL, -EPROTO, - rxkad_abort_resp_level); - goto protocol_error_free; + ret = rxrpc_abort_conn(conn, skb, RXKADLEVELFAIL, -EPROTO, + rxkad_abort_resp_level); + goto error; } conn->security_level = level; @@ -1280,31 +1280,12 @@ static int rxkad_verify_response(struct rxrpc_connection *conn, * this the connection security can be handled in exactly the same way * as for a client connection */ ret = rxrpc_get_server_data_key(conn, &session_key, expiry, kvno); - if (ret < 0) - goto temporary_error_free_ticket; +error: kfree(ticket); - kfree(response); - _leave(" = 0"); - return 0; - -protocol_error_free: - kfree(ticket); -protocol_error: kfree(response); key_put(server_key); - return -EPROTO; - -temporary_error_free_ticket: - kfree(ticket); -temporary_error_free_resp: - kfree(response); -temporary_error: - /* Ignore the response packet if we got a temporary error such as - * ENOMEM. We just want to send the challenge again. Note that we - * also come out this way if the ticket decryption fails. - */ - key_put(server_key); + _leave(" = %d", ret); return ret; } From def304aae2edf321d2671fd6ca766a93c21f877e Mon Sep 17 00:00:00 2001 From: David Howells Date: Wed, 22 Apr 2026 17:14:31 +0100 Subject: [PATCH 141/148] rxrpc: Fix rxkad crypto unalignment handling Fix handling of a packet with a misaligned crypto length. Also handle non-ENOMEM errors from decryption by aborting. Further, remove the WARN_ON_ONCE() so that it can't be remotely triggered (a trace line can still be emitted). Fixes: f93af41b9f5f ("rxrpc: Fix missing error checks for rxkad encryption/decryption failure") Closes: https://sashiko.dev/#/patchset/20260408121252.2249051-1-dhowells%40redhat.com Signed-off-by: David Howells cc: Marc Dionne cc: Jeffrey Altman cc: Simon Horman cc: linux-afs@lists.infradead.org cc: stable@kernel.org Link: https://patch.msgid.link/20260422161438.2593376-3-dhowells@redhat.com Signed-off-by: Jakub Kicinski --- include/trace/events/rxrpc.h | 1 + net/rxrpc/rxkad.c | 9 +++++++-- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/include/trace/events/rxrpc.h b/include/trace/events/rxrpc.h index 578b8038b211..5820d7e41ea0 100644 --- a/include/trace/events/rxrpc.h +++ b/include/trace/events/rxrpc.h @@ -37,6 +37,7 @@ EM(rxkad_abort_1_short_encdata, "rxkad1-short-encdata") \ EM(rxkad_abort_1_short_header, "rxkad1-short-hdr") \ EM(rxkad_abort_2_short_check, "rxkad2-short-check") \ + EM(rxkad_abort_2_crypto_unaligned, "rxkad2-crypto-unaligned") \ EM(rxkad_abort_2_short_data, "rxkad2-short-data") \ EM(rxkad_abort_2_short_header, "rxkad2-short-hdr") \ EM(rxkad_abort_2_short_len, "rxkad2-short-len") \ diff --git a/net/rxrpc/rxkad.c b/net/rxrpc/rxkad.c index 5a720222854f..cba7935977f0 100644 --- a/net/rxrpc/rxkad.c +++ b/net/rxrpc/rxkad.c @@ -510,6 +510,9 @@ static int rxkad_verify_packet_2(struct rxrpc_call *call, struct sk_buff *skb, return rxrpc_abort_eproto(call, skb, RXKADSEALEDINCON, rxkad_abort_2_short_header); + /* Don't let the crypto algo see a misaligned length. */ + sp->len = round_down(sp->len, 8); + /* Decrypt the skbuff in-place. TODO: We really want to decrypt * directly into the target buffer. */ @@ -543,8 +546,10 @@ static int rxkad_verify_packet_2(struct rxrpc_call *call, struct sk_buff *skb, if (sg != _sg) kfree(sg); if (ret < 0) { - WARN_ON_ONCE(ret != -ENOMEM); - return ret; + if (ret == -ENOMEM) + return ret; + return rxrpc_abort_eproto(call, skb, RXKADSEALEDINCON, + rxkad_abort_2_crypto_unaligned); } /* Extract the decrypted packet length */ From 1f2740150f904bfa60e4bad74d65add3ccb5e7f8 Mon Sep 17 00:00:00 2001 From: David Howells Date: Wed, 22 Apr 2026 17:14:32 +0100 Subject: [PATCH 142/148] rxrpc: Fix potential UAF after skb_unshare() failure If skb_unshare() fails to unshare a packet due to allocation failure in rxrpc_input_packet(), the skb pointer in the parent (rxrpc_io_thread()) will be NULL'd out. This will likely cause the call to trace_rxrpc_rx_done() to oops. Fix this by moving the unsharing down to where rxrpc_input_call_event() calls rxrpc_input_call_packet(). There are a number of places prior to that where we ignore DATA packets for a variety of reasons (such as the call already being complete) for which an unshare is then avoided. And with that, rxrpc_input_packet() doesn't need to take a pointer to the pointer to the packet, so change that to just a pointer. Fixes: 2d1faf7a0ca3 ("rxrpc: Simplify skbuff accounting in receive path") Closes: https://sashiko.dev/#/patchset/20260408121252.2249051-1-dhowells%40redhat.com Signed-off-by: David Howells cc: Marc Dionne cc: Jeffrey Altman cc: Simon Horman cc: linux-afs@lists.infradead.org cc: stable@kernel.org Link: https://patch.msgid.link/20260422161438.2593376-4-dhowells@redhat.com Signed-off-by: Jakub Kicinski --- include/trace/events/rxrpc.h | 4 ++-- net/rxrpc/ar-internal.h | 1 - net/rxrpc/call_event.c | 19 ++++++++++++++++++- net/rxrpc/io_thread.c | 24 ++---------------------- net/rxrpc/skbuff.c | 9 --------- 5 files changed, 22 insertions(+), 35 deletions(-) diff --git a/include/trace/events/rxrpc.h b/include/trace/events/rxrpc.h index 5820d7e41ea0..13b9d017f8e1 100644 --- a/include/trace/events/rxrpc.h +++ b/include/trace/events/rxrpc.h @@ -162,8 +162,6 @@ E_(rxrpc_call_poke_timer_now, "Timer-now") #define rxrpc_skb_traces \ - EM(rxrpc_skb_eaten_by_unshare, "ETN unshare ") \ - EM(rxrpc_skb_eaten_by_unshare_nomem, "ETN unshar-nm") \ EM(rxrpc_skb_get_call_rx, "GET call-rx ") \ EM(rxrpc_skb_get_conn_secured, "GET conn-secd") \ EM(rxrpc_skb_get_conn_work, "GET conn-work") \ @@ -190,6 +188,7 @@ EM(rxrpc_skb_put_purge, "PUT purge ") \ EM(rxrpc_skb_put_purge_oob, "PUT purge-oob") \ EM(rxrpc_skb_put_response, "PUT response ") \ + EM(rxrpc_skb_put_response_copy, "PUT resp-cpy ") \ EM(rxrpc_skb_put_rotate, "PUT rotate ") \ EM(rxrpc_skb_put_unknown, "PUT unknown ") \ EM(rxrpc_skb_see_conn_work, "SEE conn-work") \ @@ -198,6 +197,7 @@ EM(rxrpc_skb_see_recvmsg_oob, "SEE recvm-oob") \ EM(rxrpc_skb_see_reject, "SEE reject ") \ EM(rxrpc_skb_see_rotate, "SEE rotate ") \ + EM(rxrpc_skb_see_unshare_nomem, "SEE unshar-nm") \ E_(rxrpc_skb_see_version, "SEE version ") #define rxrpc_local_traces \ diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h index 96ecb83c9071..27c2aa2dd023 100644 --- a/net/rxrpc/ar-internal.h +++ b/net/rxrpc/ar-internal.h @@ -1486,7 +1486,6 @@ int rxrpc_server_keyring(struct rxrpc_sock *, sockptr_t, int); void rxrpc_kernel_data_consumed(struct rxrpc_call *, struct sk_buff *); void rxrpc_new_skb(struct sk_buff *, enum rxrpc_skb_trace); void rxrpc_see_skb(struct sk_buff *, enum rxrpc_skb_trace); -void rxrpc_eaten_skb(struct sk_buff *, enum rxrpc_skb_trace); void rxrpc_get_skb(struct sk_buff *, enum rxrpc_skb_trace); void rxrpc_free_skb(struct sk_buff *, enum rxrpc_skb_trace); void rxrpc_purge_queue(struct sk_buff_head *); diff --git a/net/rxrpc/call_event.c b/net/rxrpc/call_event.c index fec59d9338b9..cc8f9dfa44e8 100644 --- a/net/rxrpc/call_event.c +++ b/net/rxrpc/call_event.c @@ -332,7 +332,24 @@ bool rxrpc_input_call_event(struct rxrpc_call *call) saw_ack |= sp->hdr.type == RXRPC_PACKET_TYPE_ACK; - rxrpc_input_call_packet(call, skb); + if (sp->hdr.securityIndex != 0 && + skb_cloned(skb)) { + /* Unshare the packet so that it can be + * modified by in-place decryption. + */ + struct sk_buff *nskb = skb_copy(skb, GFP_ATOMIC); + + if (nskb) { + rxrpc_new_skb(nskb, rxrpc_skb_new_unshared); + rxrpc_input_call_packet(call, nskb); + rxrpc_free_skb(nskb, rxrpc_skb_put_call_rx); + } else { + /* OOM - Drop the packet. */ + rxrpc_see_skb(skb, rxrpc_skb_see_unshare_nomem); + } + } else { + rxrpc_input_call_packet(call, skb); + } rxrpc_free_skb(skb, rxrpc_skb_put_call_rx); did_receive = true; } diff --git a/net/rxrpc/io_thread.c b/net/rxrpc/io_thread.c index 697956931925..dc5184a2fa9d 100644 --- a/net/rxrpc/io_thread.c +++ b/net/rxrpc/io_thread.c @@ -192,13 +192,12 @@ static bool rxrpc_extract_abort(struct sk_buff *skb) /* * Process packets received on the local endpoint */ -static bool rxrpc_input_packet(struct rxrpc_local *local, struct sk_buff **_skb) +static bool rxrpc_input_packet(struct rxrpc_local *local, struct sk_buff *skb) { struct rxrpc_connection *conn; struct sockaddr_rxrpc peer_srx; struct rxrpc_skb_priv *sp; struct rxrpc_peer *peer = NULL; - struct sk_buff *skb = *_skb; bool ret = false; skb_pull(skb, sizeof(struct udphdr)); @@ -244,25 +243,6 @@ static bool rxrpc_input_packet(struct rxrpc_local *local, struct sk_buff **_skb) return rxrpc_bad_message(skb, rxrpc_badmsg_zero_call); if (sp->hdr.seq == 0) return rxrpc_bad_message(skb, rxrpc_badmsg_zero_seq); - - /* Unshare the packet so that it can be modified for in-place - * decryption. - */ - if (sp->hdr.securityIndex != 0) { - skb = skb_unshare(skb, GFP_ATOMIC); - if (!skb) { - rxrpc_eaten_skb(*_skb, rxrpc_skb_eaten_by_unshare_nomem); - *_skb = NULL; - return just_discard; - } - - if (skb != *_skb) { - rxrpc_eaten_skb(*_skb, rxrpc_skb_eaten_by_unshare); - *_skb = skb; - rxrpc_new_skb(skb, rxrpc_skb_new_unshared); - sp = rxrpc_skb(skb); - } - } break; case RXRPC_PACKET_TYPE_CHALLENGE: @@ -494,7 +474,7 @@ int rxrpc_io_thread(void *data) switch (skb->mark) { case RXRPC_SKB_MARK_PACKET: skb->priority = 0; - if (!rxrpc_input_packet(local, &skb)) + if (!rxrpc_input_packet(local, skb)) rxrpc_reject_packet(local, skb); trace_rxrpc_rx_done(skb->mark, skb->priority); rxrpc_free_skb(skb, rxrpc_skb_put_input); diff --git a/net/rxrpc/skbuff.c b/net/rxrpc/skbuff.c index 3bcd6ee80396..e2169d1a14b5 100644 --- a/net/rxrpc/skbuff.c +++ b/net/rxrpc/skbuff.c @@ -46,15 +46,6 @@ void rxrpc_get_skb(struct sk_buff *skb, enum rxrpc_skb_trace why) skb_get(skb); } -/* - * Note the dropping of a ref on a socket buffer by the core. - */ -void rxrpc_eaten_skb(struct sk_buff *skb, enum rxrpc_skb_trace why) -{ - int n = atomic_inc_return(&rxrpc_n_rx_skbs); - trace_rxrpc_skb(skb, 0, n, why); -} - /* * Note the destruction of a socket buffer. */ From 24481a7f573305706054c59e275371f8d0fe919f Mon Sep 17 00:00:00 2001 From: David Howells Date: Wed, 22 Apr 2026 17:14:33 +0100 Subject: [PATCH 143/148] rxrpc: Fix conn-level packet handling to unshare RESPONSE packets The security operations that verify the RESPONSE packets decrypt bits of it in place - however, the sk_buff may be shared with a packet sniffer, which would lead to the sniffer seeing an apparently corrupt packet (actually decrypted). Fix this by handing a copy of the packet off to the specific security handler if the packet was cloned. Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both") Closes: https://sashiko.dev/#/patchset/20260408121252.2249051-1-dhowells%40redhat.com Signed-off-by: David Howells cc: Marc Dionne cc: Jeffrey Altman cc: Simon Horman cc: linux-afs@lists.infradead.org cc: stable@kernel.org Link: https://patch.msgid.link/20260422161438.2593376-5-dhowells@redhat.com Signed-off-by: Jakub Kicinski --- net/rxrpc/conn_event.c | 29 ++++++++++++++++++++++++++++- 1 file changed, 28 insertions(+), 1 deletion(-) diff --git a/net/rxrpc/conn_event.c b/net/rxrpc/conn_event.c index 9a41ec708aeb..aee977291d90 100644 --- a/net/rxrpc/conn_event.c +++ b/net/rxrpc/conn_event.c @@ -240,6 +240,33 @@ static void rxrpc_call_is_secure(struct rxrpc_call *call) rxrpc_notify_socket(call); } +static int rxrpc_verify_response(struct rxrpc_connection *conn, + struct sk_buff *skb) +{ + int ret; + + if (skb_cloned(skb)) { + /* Copy the packet if shared so that we can do in-place + * decryption. + */ + struct sk_buff *nskb = skb_copy(skb, GFP_NOFS); + + if (nskb) { + rxrpc_new_skb(nskb, rxrpc_skb_new_unshared); + ret = conn->security->verify_response(conn, nskb); + rxrpc_free_skb(nskb, rxrpc_skb_put_response_copy); + } else { + /* OOM - Drop the packet. */ + rxrpc_see_skb(skb, rxrpc_skb_see_unshare_nomem); + ret = -ENOMEM; + } + } else { + ret = conn->security->verify_response(conn, skb); + } + + return ret; +} + /* * connection-level Rx packet processor */ @@ -270,7 +297,7 @@ static int rxrpc_process_event(struct rxrpc_connection *conn, } spin_unlock_irq(&conn->state_lock); - ret = conn->security->verify_response(conn, skb); + ret = rxrpc_verify_response(conn, skb); if (ret < 0) return ret; From 6929350080f4da292d111a3b33e53138fee51cec Mon Sep 17 00:00:00 2001 From: David Howells Date: Wed, 22 Apr 2026 17:14:34 +0100 Subject: [PATCH 144/148] rxgk: Fix potential integer overflow in length check Fix potential integer overflow in rxgk_extract_token() when checking the length of the ticket. Rather than rounding up the value to be tested (which might overflow), round down the size of the available data. Fixes: 2429a1976481 ("rxrpc: Fix untrusted unsigned subtract") Closes: https://sashiko.dev/#/patchset/20260408121252.2249051-1-dhowells%40redhat.com Signed-off-by: David Howells cc: Marc Dionne cc: Jeffrey Altman cc: Simon Horman cc: linux-afs@lists.infradead.org cc: stable@kernel.org Link: https://patch.msgid.link/20260422161438.2593376-6-dhowells@redhat.com Signed-off-by: Jakub Kicinski --- net/rxrpc/rxgk_app.c | 2 +- net/rxrpc/rxgk_common.h | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/net/rxrpc/rxgk_app.c b/net/rxrpc/rxgk_app.c index 30275cb5ba3e..5587639d60c5 100644 --- a/net/rxrpc/rxgk_app.c +++ b/net/rxrpc/rxgk_app.c @@ -214,7 +214,7 @@ int rxgk_extract_token(struct rxrpc_connection *conn, struct sk_buff *skb, ticket_len = ntohl(container.token_len); ticket_offset = token_offset + sizeof(container); - if (xdr_round_up(ticket_len) > token_len - sizeof(container)) + if (ticket_len > xdr_round_down(token_len - sizeof(container))) goto short_packet; _debug("KVNO %u", kvno); diff --git a/net/rxrpc/rxgk_common.h b/net/rxrpc/rxgk_common.h index 80164d89e19c..1e257d7ab8ec 100644 --- a/net/rxrpc/rxgk_common.h +++ b/net/rxrpc/rxgk_common.h @@ -34,6 +34,7 @@ struct rxgk_context { }; #define xdr_round_up(x) (round_up((x), sizeof(__be32))) +#define xdr_round_down(x) (round_down((x), sizeof(__be32))) #define xdr_object_len(x) (4 + xdr_round_up(x)) /* From ac33733b10b484d666f97688561670afd5861383 Mon Sep 17 00:00:00 2001 From: Anderson Nascimento Date: Wed, 22 Apr 2026 17:14:35 +0100 Subject: [PATCH 145/148] rxrpc: Fix missing validation of ticket length in non-XDR key preparsing In rxrpc_preparse(), there are two paths for parsing key payloads: the XDR path (for large payloads) and the non-XDR path (for payloads <= 28 bytes). While the XDR path (rxrpc_preparse_xdr_rxkad()) correctly validates the ticket length against AFSTOKEN_RK_TIX_MAX, the non-XDR path fails to do so. This allows an unprivileged user to provide a very large ticket length. When this key is later read via rxrpc_read(), the total token size (toksize) calculation results in a value that exceeds AFSTOKEN_LENGTH_MAX, triggering a WARN_ON(). [ 2001.302904] WARNING: CPU: 2 PID: 2108 at net/rxrpc/key.c:778 rxrpc_read+0x109/0x5c0 [rxrpc] Fix this by adding a check in the non-XDR parsing path of rxrpc_preparse() to ensure the ticket length does not exceed AFSTOKEN_RK_TIX_MAX, bringing it into parity with the XDR parsing logic. Fixes: 8a7a3eb4ddbe ("KEYS: RxRPC: Use key preparsing") Fixes: 84924aac08a4 ("rxrpc: Fix checker warning") Reported-by: Anderson Nascimento Signed-off-by: Anderson Nascimento Signed-off-by: David Howells cc: Marc Dionne cc: Jeffrey Altman cc: Simon Horman cc: linux-afs@lists.infradead.org cc: stable@kernel.org Link: https://patch.msgid.link/20260422161438.2593376-7-dhowells@redhat.com Signed-off-by: Jakub Kicinski --- net/rxrpc/key.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/net/rxrpc/key.c b/net/rxrpc/key.c index 6301d79ee35a..3ec3d89fdf14 100644 --- a/net/rxrpc/key.c +++ b/net/rxrpc/key.c @@ -502,6 +502,10 @@ static int rxrpc_preparse(struct key_preparsed_payload *prep) if (v1->security_index != RXRPC_SECURITY_RXKAD) goto error; + ret = -EKEYREJECTED; + if (v1->ticket_length > AFSTOKEN_RK_TIX_MAX) + goto error; + plen = sizeof(*token->kad) + v1->ticket_length; prep->quotalen += plen + sizeof(*token); From 55b2984c96c37f909bbfe8851f13152693951382 Mon Sep 17 00:00:00 2001 From: David Howells Date: Thu, 23 Apr 2026 21:09:06 +0100 Subject: [PATCH 146/148] rxrpc: Fix rxrpc_input_call_event() to only unshare DATA packets Fix rxrpc_input_call_event() to only unshare DATA packets and not ACK, ABORT, etc.. And with that, rxrpc_input_packet() doesn't need to take a pointer to the pointer to the packet, so change that to just a pointer. Fixes: 1f2740150f90 ("rxrpc: Fix potential UAF after skb_unshare() failure") Closes: https://sashiko.dev/#/patchset/20260422161438.2593376-4-dhowells@redhat.com Signed-off-by: David Howells cc: Marc Dionne cc: Jeffrey Altman cc: Simon Horman cc: linux-afs@lists.infradead.org cc: stable@kernel.org Link: https://patch.msgid.link/20260423200909.3049438-2-dhowells@redhat.com Signed-off-by: Jakub Kicinski --- net/rxrpc/call_event.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/rxrpc/call_event.c b/net/rxrpc/call_event.c index cc8f9dfa44e8..fdd683261226 100644 --- a/net/rxrpc/call_event.c +++ b/net/rxrpc/call_event.c @@ -332,7 +332,8 @@ bool rxrpc_input_call_event(struct rxrpc_call *call) saw_ack |= sp->hdr.type == RXRPC_PACKET_TYPE_ACK; - if (sp->hdr.securityIndex != 0 && + if (sp->hdr.type == RXRPC_PACKET_TYPE_DATA && + sp->hdr.securityIndex != 0 && skb_cloned(skb)) { /* Unshare the packet so that it can be * modified by in-place decryption. From 0422e7a4883f25101903f3e8105c0808aa5f4ce9 Mon Sep 17 00:00:00 2001 From: David Howells Date: Thu, 23 Apr 2026 21:09:07 +0100 Subject: [PATCH 147/148] rxrpc: Fix re-decryption of RESPONSE packets If a RESPONSE packet gets a temporary failure during processing, it may end up in a partially decrypted state - and then get requeued for a retry. Fix this by just discarding the packet; we will send another CHALLENGE packet and thereby elicit a further response. Similarly, discard an incoming CHALLENGE packet if we get an error whilst generating a RESPONSE; the server will send another CHALLENGE. Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both") Closes: https://sashiko.dev/#/patchset/20260422161438.2593376-4-dhowells@redhat.com Signed-off-by: David Howells cc: Marc Dionne cc: Jeffrey Altman cc: Simon Horman cc: linux-afs@lists.infradead.org cc: stable@kernel.org Link: https://patch.msgid.link/20260423200909.3049438-3-dhowells@redhat.com Signed-off-by: Jakub Kicinski --- include/trace/events/rxrpc.h | 1 - net/rxrpc/conn_event.c | 14 ++------------ 2 files changed, 2 insertions(+), 13 deletions(-) diff --git a/include/trace/events/rxrpc.h b/include/trace/events/rxrpc.h index 13b9d017f8e1..573f2df3a2c9 100644 --- a/include/trace/events/rxrpc.h +++ b/include/trace/events/rxrpc.h @@ -285,7 +285,6 @@ EM(rxrpc_conn_put_unidle, "PUT unidle ") \ EM(rxrpc_conn_put_work, "PUT work ") \ EM(rxrpc_conn_queue_challenge, "QUE chall ") \ - EM(rxrpc_conn_queue_retry_work, "QUE retry-wk") \ EM(rxrpc_conn_queue_rx_work, "QUE rx-work ") \ EM(rxrpc_conn_see_new_service_conn, "SEE new-svc ") \ EM(rxrpc_conn_see_reap_service, "SEE reap-svc") \ diff --git a/net/rxrpc/conn_event.c b/net/rxrpc/conn_event.c index aee977291d90..a2130d25aaa9 100644 --- a/net/rxrpc/conn_event.c +++ b/net/rxrpc/conn_event.c @@ -389,7 +389,6 @@ void rxrpc_process_delayed_final_acks(struct rxrpc_connection *conn, bool force) static void rxrpc_do_process_connection(struct rxrpc_connection *conn) { struct sk_buff *skb; - int ret; if (test_and_clear_bit(RXRPC_CONN_EV_CHALLENGE, &conn->events)) rxrpc_secure_connection(conn); @@ -398,17 +397,8 @@ static void rxrpc_do_process_connection(struct rxrpc_connection *conn) * connection that each one has when we've finished with it */ while ((skb = skb_dequeue(&conn->rx_queue))) { rxrpc_see_skb(skb, rxrpc_skb_see_conn_work); - ret = rxrpc_process_event(conn, skb); - switch (ret) { - case -ENOMEM: - case -EAGAIN: - skb_queue_head(&conn->rx_queue, skb); - rxrpc_queue_conn(conn, rxrpc_conn_queue_retry_work); - break; - default: - rxrpc_free_skb(skb, rxrpc_skb_put_conn_work); - break; - } + rxrpc_process_event(conn, skb); + rxrpc_free_skb(skb, rxrpc_skb_put_conn_work); } } From 3476c8bb960f48e49355d6f93fb7673211e0163f Mon Sep 17 00:00:00 2001 From: David Howells Date: Thu, 23 Apr 2026 21:09:08 +0100 Subject: [PATCH 148/148] rxrpc: Fix error handling in rxgk_extract_token() Fix a missing bit of error handling in rxgk_extract_token(): in the event that rxgk_decrypt_skb() returns -ENOMEM, it should just return that rather than continuing on (for anything else, it generates an abort). Fixes: 64863f4ca494 ("rxrpc: Fix unhandled errors in rxgk_verify_packet_integrity()") Closes: https://sashiko.dev/#/patchset/20260422161438.2593376-4-dhowells@redhat.com Signed-off-by: David Howells cc: Marc Dionne cc: Jeffrey Altman cc: Simon Horman cc: linux-afs@lists.infradead.org cc: stable@kernel.org Link: https://patch.msgid.link/20260423200909.3049438-4-dhowells@redhat.com Signed-off-by: Jakub Kicinski --- net/rxrpc/rxgk_app.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/rxrpc/rxgk_app.c b/net/rxrpc/rxgk_app.c index 5587639d60c5..0ef2a29eb695 100644 --- a/net/rxrpc/rxgk_app.c +++ b/net/rxrpc/rxgk_app.c @@ -245,6 +245,7 @@ int rxgk_extract_token(struct rxrpc_connection *conn, struct sk_buff *skb, if (ret != -ENOMEM) return rxrpc_abort_conn(conn, skb, ec, ret, rxgk_abort_resp_tok_dec); + return ret; } ret = conn->security->default_decode_ticket(conn, skb, ticket_offset,