linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-06-02 15:43:01 -04:00

Author	SHA1	Message	Date
Chuck Lever	ea5fe6a73c	net/handshake: Drain pending requests at net namespace exit The arguments to list_splice_init() in handshake_net_exit() are reversed. The call moves the local empty "requests" list onto hn->hn_requests, leaving the local list empty, so the subsequent drain loop runs zero iterations. Pending handshake requests that had not yet been accepted are not torn down when the net namespace is destroyed; each one keeps a reference on a socket file and on the handshake_req allocation. Pass the source and destination in the documented order (list_splice_init(list, head) moves list onto head) so the pending list is transferred to the local scratch list and drained through handshake_complete(). Fixing the splice direction exposes a list-corruption race. After the splice each req->hr_list still has non-empty link pointers, threading the stack-local scratch list rather than hn_requests. A concurrent handshake_req_cancel() -- for example, from sunrpc's TLS timeout on a kernel socket whose netns reference was not taken -- finds the request through the rhashtable, calls remove_pending(), and sees !list_empty(&req->hr_list). __remove_pending_locked() then list_del_init()s an entry off the scratch list while the drain iterates, corrupting it. The same call arriving after the drain loop has run list_del() on an entry hits LIST_POISON instead. Have remove_pending() check HANDSHAKE_F_NET_DRAINING under hn_lock and report not-found when drain is in progress. The drain has already taken ownership; handshake_complete()'s existing test_and_set on HANDSHAKE_F_REQ_COMPLETED still arbitrates between drain and cancel for who calls the consumer's hp_done. Use list_del_init() rather than list_del() in the drain so req->hr_list does not carry LIST_POISON after drain releases the entry. The DRAINING guard in remove_pending() makes cancel return false, but cancel still falls through to test_and_set_bit on HANDSHAKE_F_REQ_COMPLETED and drops the request's hr_file reference. Without another pin, if that is the last reference, sk_destruct frees the request while it is still linked on the drain loop's local list. Pin each request's hr_file under hn_lock before releasing the list, and drop that drain pin after the loop finishes with the request. Fixes: `3b3009ea8a` ("net/handshake: Create a NETLINK service for handling handshake requests") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Hannes Reinecke <hare@kernel.org> Link: https://patch.msgid.link/20260525-handshake-file-pin-v3-8-66c616906ead@oracle.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-28 13:35:32 +02:00
Chuck Lever	204a5efde5	net/handshake: Verify file-reference balance in submit paths The new file-reference contract on struct handshake_req is silently breakable: a missing get_file() at submit or a missing fput() on an error path leaves the file leaked but does not crash the test, so the existing absence-of-crash checks pass either way. Snapshot file_count(filp) before each handshake_req_submit() in the submit-success, EAGAIN, EBUSY, and cancel tests, and assert the expected balance after submit and again after cancel. The already-completed cancel test also asserts the post-complete balance, which pins down that handshake_complete() drops the reference and that the subsequent cancel does not double-fput. The destroy test gets the same treatment before __fput_sync(), which double-checks that cancel's fput() ran and the only remaining reference is the one sock_alloc_file() established. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Hannes Reinecke <hare@kernel.org> Link: https://patch.msgid.link/20260525-handshake-file-pin-v3-7-66c616906ead@oracle.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-28 13:35:31 +02:00
Chuck Lever	5da98f55b1	net/handshake: Close the submit-side sock_hold race handshake_req_submit() publishes the request via handshake_req_hash_add() and __add_pending_locked(), drops hn_lock, and calls handshake_genl_notify() (which can sleep) before taking sock_hold() on req->hr_sk. A fast tlshd ACCEPT followed by DONE can drive handshake_complete()'s sock_put() into the window between the spin_unlock and the late sock_hold(); on a system where the consumer's fd held the only sk reference, the late sock_hold() then operates on an sk whose refcount has reached zero. The preceding two patches install an explicit file reference on struct handshake_req. That file pins sock->file, which pins the embedded struct socket, which defers inet_release()'s sock_put(). As long as hr_file is held, sk cannot reach refcount zero from the consumer side, and the submit-side sock_hold() with its matching sock_put() calls in handshake_complete() and handshake_req_cancel() is now redundant. Drop all three. The file reference already keeps each request's socket alive, and the lifetime story is contained in a single get_file()/fput() pair. Fixes: `3b3009ea8a` ("net/handshake: Create a NETLINK service for handling handshake requests") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Hannes Reinecke <hare@kernel.org> Link: https://patch.msgid.link/20260525-handshake-file-pin-v3-6-66c616906ead@oracle.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-28 13:35:31 +02:00
Chuck Lever	f4251190e5	net/handshake: hand off the pinned file reference to accept_doit handshake_req_next() removes the request from the per-net pending list and drops hn_lock before handshake_nl_accept_doit() reads req->hr_sk->sk_socket and dereferences sock->file (once in FD_PREPARE() and again in get_file()). In that window a consumer running tls_handshake_cancel() followed by sockfd_put() (svc_sock_free) or __fput_sync() (xs_reset_transport) releases sock->file. sock_release() then runs sock_orphan(), zeroing sk_socket, and frees the struct socket. The accept-side code either reads NULL through sk_socket or chases freed memory. The submit-side sock_hold() does not prevent this. sk_refcnt protects struct sock, but struct socket and sock->file are independently refcounted via the file descriptor the consumer owns. Pinning sk leaves sock and sock->file unprotected. Retarget the accept-side dereferences at req->hr_file, which was pinned at submit time, instead of req->hr_sk->sk_socket->file. Pinning on its own is not sufficient: a consumer that cancels between handshake_req_next() returning and accept_doit reaching FD_PREPARE() takes the !remove_pending() branch in handshake_req_cancel() and drops hr_file before the accept side takes its own reference. Hand off an additional file reference inside handshake_req_next(), under hn_lock, so the accept side operates on a reference that no concurrent handshake_req_cancel() can revoke. FD_PREPARE() consumes that handed-off reference, either by transferring it to the new fd in fd_publish() or by dropping it in the cleanup destructor on error; the explicit get_file() that previously balanced FD_PREPARE() is therefore redundant and goes away. Update handshake_req_cancel_test2 and _test3 to simulate the FD_PREPARE() consumption with an fput() so the kunit file-count assertions stay balanced. Reported-by: Chris Mason <clm@meta.com> Fixes: `3b3009ea8a` ("net/handshake: Create a NETLINK service for handling handshake requests") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Hannes Reinecke <hare@kernel.org> Link: https://patch.msgid.link/20260525-handshake-file-pin-v3-5-66c616906ead@oracle.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-28 13:35:31 +02:00
Chuck Lever	09dba37eee	net/handshake: Take a long-lived file reference at submit handshake_nl_accept_doit() needs the file pointer backing req->hr_sk->sk_socket to survive the window between handshake_req_next() and the subsequent FD_PREPARE() and get_file(). The submit-side sock_hold() does not provide that. sk_refcnt keeps struct sock alive, but struct socket is owned by sock->file: when the consumer fputs the last file reference, sock_release() tears the socket down regardless of any sock_hold. Add an hr_file pointer to struct handshake_req and acquire an explicit reference on sock->file during handshake_req_submit(). handshake_complete() and handshake_req_cancel() release the reference on the completion-bit-winning path. The submit error path must also release the file reference, but after rhashtable insertion a concurrent handshake_req_cancel() can discover the request and race the error path. Gate the error-path cleanup -- sk_destruct restoration, fput, and request destruction -- with test_and_set_bit(HANDSHAKE_F_REQ_COMPLETED), the same serialization handshake_complete() and handshake_req_cancel() already use. When cancel has already claimed ownership, the submit error path returns without touching the request; socket teardown handles final destruction. The accept-side dereferences are not yet retargeted; that change comes in the next patch. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://patch.msgid.link/20260525-handshake-file-pin-v3-4-66c616906ead@oracle.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-28 13:35:31 +02:00
Chuck Lever	6b22d433aa	net/handshake: Pass negative errno through handshake_complete() handshake_complete() declares status as unsigned int and tls_handshake_done() negates that value (-status) before handing it to the TLS consumer. Consumers match on negative errno constants -- xs_tls_handshake_done() has switch (status) { case 0: case -EACCES: case -ETIMEDOUT: lower_transport->xprt_err = status; break; default: lower_transport->xprt_err = -EACCES; } so the API as designed expects callers to pass positive errno values that the tlshd shim then negates. Three internal callers in handshake_nl_accept_doit(), the net-exit drain, and a kunit test follow kernel convention and pass negative errnos -- -EIO, -ETIMEDOUT, -ETIMEDOUT. The implicit conversion to unsigned int turns -ETIMEDOUT into 0xFFFFFF92; the subsequent -status in tls_handshake_done() wraps back to 110, the consumer's switch falls through, and the xprt reports -EACCES on what should be -ETIMEDOUT or -EIO. Fix the API rather than the call sites. The natural kernel convention is negative errno in, negative errno out. Change handshake_complete() and hp_done to take int status, drop the negation in tls_handshake_done(), and negate once in handshake_nl_done_doit() where status arrives from the wire as an unsigned netlink attribute. The three internal callers were already correct under that convention and need no change. At the same wire boundary, declare MAX_ERRNO as the netlink policy upper bound for HANDSHAKE_A_DONE_STATUS. Attribute validation rejects out-of-range values before handshake_nl_done_doit() runs, and negating a bounded u32 there stays within int range -- closing the UBSAN-visible signed- integer overflow that an unconstrained u32 would invoke. Fixes: `3b3009ea8a` ("net/handshake: Create a NETLINK service for handling handshake requests") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Hannes Reinecke <hare@kernel.org> Link: https://patch.msgid.link/20260525-handshake-file-pin-v3-3-66c616906ead@oracle.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-28 13:35:31 +02:00
Chuck Lever	9015985b5e	nvme-tcp: store negative errno in queue->tls_err nvme_tcp_tls_done() assigns queue->tls_err in three branches. The ENOKEY lookup failure and the EOPNOTSUPP initializer both store negative errnos. The third branch, reached when the handshake layer reports a non-zero status, stores -status. The handshake layer delivers status to the consumer callback as a negative errno; the other in-tree consumers -- xs_tls_handshake_done() and the nvmet target callback -- treat their status argument that way. The extra negation in nvme_tcp_tls_done() flips the sign, leaving tls_err as a positive value (for instance, +EIO), which nvme_tcp_start_tls() then returns to its caller. Drop the extra negation so queue->tls_err uniformly carries a negative errno on failure. Fixes: `be8e82caa6` ("nvme-tcp: enable TLS handshake upcall") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Alistair Francis <alistair.francis@wdc.com> Link: https://patch.msgid.link/20260525-handshake-file-pin-v3-2-66c616906ead@oracle.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-28 13:35:31 +02:00
Chuck Lever	cc993e0927	net/handshake: Use spin_lock_bh for hn_lock nvmet_tcp_state_change(), a socket callback that runs in BH context, can reach handshake_req_cancel() via nvmet_tcp_schedule_release_queue() and tls_handshake_cancel(). handshake_req_cancel() acquires hn->hn_lock with plain spin_lock(). If a process-context thread on the same CPU holds hn->hn_lock when a softirq invokes the cancel path, the lock attempt deadlocks. This is the only caller that invokes tls_handshake_cancel() from BH context; every other consumer calls it from process context. Deferring the cancel to process context in the NVMe target is not straightforward: nvmet_tcp_schedule_release_queue() must call tls_handshake_cancel() atomically with its state transition to DISCONNECTING. If the cancel were deferred, the handshake completion callback could fire in the window before the cancel runs, observe the unexpected state, and return without dropping its kref on the queue. Reworking that interlock is considerably more invasive than hardening the handshake lock. Convert all hn->hn_lock acquisitions from spin_lock/spin_unlock to spin_lock_bh/spin_unlock_bh so the lock is never taken with softirqs enabled. Fixes: `675b453e02` ("nvmet-tcp: enable TLS handshake upcall") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Hannes Reinecke <hare@kernel.org> Link: https://patch.msgid.link/20260525-handshake-file-pin-v3-1-66c616906ead@oracle.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-28 13:35:31 +02:00
Minh Nguyen	98d0912e9f	net: skbuff: fix missing zerocopy reference in pskb_carve helpers pskb_carve_inside_header() and pskb_carve_inside_nonlinear() both copy the old skb_shared_info header into a new buffer via memcpy(), which includes the destructor_arg pointer (uarg) for MSG_ZEROCOPY skbs. Neither function calls net_zcopy_get() for the new shinfo, creating an unaccounted holder: every skb_shared_info with destructor_arg set will call skb_zcopy_clear() once when freed, but the corresponding net_zcopy_get() was never called for the new copy. Repeated calls drive uarg->refcnt to zero prematurely, freeing ubuf_info_msgzc while TX skbs still hold live destructor_arg pointers. KASAN reports use-after-free on a freed ubuf_info_msgzc: BUG: KASAN: slab-use-after-free in skb_release_data+0x77b/0x810 Read of size 8 at addr ffff88801574d3e8 by task poc/220 Call Trace: skb_release_data+0x77b/0x810 kfree_skb_list_reason+0x13e/0x610 skb_release_data+0x4cd/0x810 sk_skb_reason_drop+0xf3/0x340 skb_queue_purge_reason+0x282/0x440 rds_tcp_inc_free+0x1e/0x30 rds_recvmsg+0x354/0x1780 __sys_recvmsg+0xdf/0x180 Allocated by task 219: msg_zerocopy_realloc+0x157/0x7b0 tcp_sendmsg_locked+0x2892/0x3ba0 Freed by task 219: ip_recv_error+0x74a/0xb10 tcp_recvmsg+0x475/0x530 The skb consuming the late access still referenced the same uarg via shinfo->destructor_arg copied by pskb_carve_inside_nonlinear() without a refcount bump. This has been verified to be reliably exploitable: a working proof-of-concept achieves full root privilege escalation from an unprivileged local user on a default kernel configuration. The fix follows the pattern of pskb_expand_head() which has the same memcpy/cloned structure. For pskb_carve_inside_header(), net_zcopy_get() is placed after skb_orphan_frags() succeeds, so the orphan error path needs no cleanup. For pskb_carve_inside_nonlinear(), net_zcopy_get() is placed after all failure points and just before skb_release_data(), so no error path needs cleanup at all -- matching pskb_expand_head() more closely and avoiding the need for a balancing net_zcopy_put(). Fixes: `6fa01ccd88` ("skbuff: Add pskb_extract() helper function") Cc: stable@vger.kernel.org Assisted-by: Claude:claude-sonnet-4-6 Signed-off-by: Minh Nguyen <minhnguyen.080505@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260526041240.329462-1-minhnguyen.080505@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-28 13:26:56 +02:00
Paolo Abeni	85576d7a48	Merge branch 'hibmcge-fix-rx-packet-corruption-issue' Jijie Shao says: ==================== hibmcge: fix RX packet corruption issue This series fixes an RX packet corruption issue observed when SMMU is disabled on the hibmcge driver. The fixes include disabling PCI Relaxed Ordering and correcting the order of DMA barrier operations in the RX data sync path. ==================== Link: https://patch.msgid.link/20260525144525.94884-1-shaojijie@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-28 13:02:59 +02:00
Jijie Shao	b545b6ea18	net: hibmcge: move dma_rmb() after dma_sync_single_for_cpu() in RX path The dma_rmb() barrier was placed before dma_sync_single_for_cpu(), which is incorrect. DMA sync must complete first to make the buffer accessible to the CPU, then the rmb barrier ensures subsequent descriptor reads observe the latest data written by the hardware. Reorder the operations so dma_sync_single_for_cpu() is called before dma_rmb() to guarantee the driver reads consistent data from the DMA buffer. Fixes: `f72e255940` ("net: hibmcge: Implement rx_poll function to receive packets") Signed-off-by: Jijie Shao <shaojijie@huawei.com> Link: https://patch.msgid.link/20260525144525.94884-3-shaojijie@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-28 12:59:36 +02:00
Jijie Shao	463a1271aa	net: hibmcge: disable Relaxed Ordering to fix RX packet corruption When SMMU is disabled, the hibmcge driver may receive corrupted packets. The hardware writes packet data and descriptors to the same page, but with Relaxed Ordering enabled, PCI write transactions may not be strictly ordered. This can cause the driver to observe a valid descriptor before the corresponding packet data is fully written. Fix this by clearing PCI_EXP_DEVCTL_RELAX_EN in the PCI bridge control register to ensure strict write ordering between packet data and descriptors. Fixes: `f72e255940` ("net: hibmcge: Implement rx_poll function to receive packets") Signed-off-by: Jijie Shao <shaojijie@huawei.com> Link: https://patch.msgid.link/20260525144525.94884-2-shaojijie@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-28 12:59:36 +02:00
Paolo Abeni	031f1592e5	Merge branch 'net-sched-fix-packet-loops-in-mirred-and-netem' Jamal Hadi Salim says: ==================== net/sched: Fix packet loops in mirred and netem This patchset adds a 2-bit per-skb tc_depth counter that travels with the packet. The existing per-CPU mirred nest tracking loses state when a packet is deferred through the backlog or moves between CPUs via XPS/RPS. A per-skb field covers both cases. Patch 1 adds the tc_depth field in a padding hole in sk_buff. Patches 2-3 revert the check_netem_in_tree() fix and its tests, which broke legitimate multi-netem configurations. Patch 4 uses tc_depth to stop netem duplicate recursion. Patch 5 uses tc_depth to catch mirred ingress redirect loops. Patch 6 fixes the infinite loop in the mirred egress blockcast case. Patch 7 fixes drop stats in early return error scenarios in tcf_mirred_act for redirect (caught by Sashiko [1]). Patches 8-9 add mirred and netem test cases. [1] https://sashiko.dev/#/patchset/20260413082027.2244884-1-hxzene%40gmail.com ==================== Link: https://patch.msgid.link/20260525122556.973584-1-jhs@mojatatu.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-28 12:26:39 +02:00
Victor Nogueira	0f6e00aa5f	selftests/tc-testing: Add netem test case exercising loops Add a netem nested duplicate test case to validate that it won't cause an infinite loop Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Victor Nogueira <victor@mojatatu.com> Link: https://patch.msgid.link/20260525122556.973584-10-jhs@mojatatu.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-28 12:26:37 +02:00
Victor Nogueira	d38dc56a02	selftests/tc-testing: Add mirred test cases exercising loops Add mirred loop test cases to validate that those will be caught and other test cases that were previously misinterpreted as loops by mirred. This commit adds 12 test cases: - Redirect multiport: dummy egress -> dev1 ingress -> dummy egress (Loop) - Redirect singleport: dev1 ingress -> dev1 egress -> dev1 ingress (Loop) - Redirect multiport: dev1 ingress -> dummy ingress -> dev1 egress (No Loop) - Redirect multiport: dev1 ingress -> dummy ingress -> dev1 ingress (Loop) - Redirect multiport: dev1 ingress -> dummy egress -> dev1 ingress (Loop) - Redirect multiport: dummy egress -> dev1 ingress -> dummy egress, different prios (Loop) - Redirect multiport: dev1 ingress -> dummy ingress -> dummy egress -> dev1 egress (No Loop) - Redirect multiport: dev1 ingress -> dummy egress -> dev1 egress (No Loop) - Redirect multiport: dev1 ingress -> dummy egress -> dummy ingress (No Loop) - Redirect singleport: dev1 ingress -> dev1 ingress (Loop) - Redirect singleport: dummy egress -> dummy ingress (No Loop) - Redirect multiport: dev1 ingress -> dummy ingress -> dummy egress (No Loop) Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Victor Nogueira <victor@mojatatu.com> Link: https://patch.msgid.link/20260525122556.973584-9-jhs@mojatatu.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-28 12:26:36 +02:00
Victor Nogueira	e80ad525fc	net/sched: act_mirred: Fix return code in early mirred redirect error paths Since retval is set as TC_ACT_STOLEN in the mirred redirect case, returning retval in cases where redirect failed will make the callers not register the skb as being dropped. Fix this by returning TC_ACT_SHOT instead in such scenarios. Fixes: `16085e48cb` ("net/sched: act_mirred: Create function tcf_mirred_to_dev and improve readability") Reported-by: Sashiko <sashiko-bot@kernel.org> Closes: https://sashiko.dev/#/patchset/20260413082027.2244884-1-hxzene%40gmail.com Signed-off-by: Victor Nogueira <victor@mojatatu.com> Link: https://patch.msgid.link/20260525122556.973584-8-jhs@mojatatu.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-28 12:26:36 +02:00
Kito Xu (veritas501)	a005fa5d75	net/sched: act_mirred: Fix blockcast recursion bypass leading to stack overflow tcf_mirred_act() checks sched_mirred_nest against MIRRED_NEST_LIMIT (4) to prevent deep recursion. However, when the action uses blockcast (tcfm_blockid != 0), the function returns at the tcf_blockcast() call BEFORE reaching the counter increment. As a result, the recursion counter never advances and the limit check is entirely bypassed. When two devices share a TC egress block with a mirred blockcast rule, a packet egressing on device A is mirrored to device B via blockcast; device B's egress TC re-enters tcf_mirred_act() via blockcast and mirrors back to A, creating an unbounded recursion loop: tcf_mirred_act -> tcf_blockcast -> tcf_mirred_to_dev -> dev_queue_xmit -> sch_handle_egress -> tcf_classify -> tcf_mirred_act -> (repeat) This recursion continues until the kernel stack overflows. The bug is reachable from an unprivileged user via unshare(CLONE_NEWUSER \| CLONE_NEWNET): user namespaces grant CAP_NET_ADMIN in the new network namespace, which is sufficient to create dummy devices, attach clsact qdiscs with shared blocks, and install mirred blockcast filters. BUG: TASK stack guard page was hit at ffffc90000b7fff8 Oops: stack guard page: 0000 [#1] SMP KASAN NOPTI CPU: 2 UID: 1000 PID: 169 Comm: poc Not tainted 7.0.0-rc7-next-20260410 RIP: 0010:xas_find+0x17/0x480 Call Trace: xa_find+0x17b/0x1d0 tcf_mirred_act+0x640/0x1060 tcf_action_exec+0x400/0x530 basic_classify+0x128/0x1d0 tcf_classify+0xd83/0x1150 tc_run+0x328/0x620 __dev_queue_xmit+0x797/0x3100 tcf_mirred_to_dev+0x7b1/0xf70 tcf_mirred_act+0x68a/0x1060 [repeating ~30+ times until stack overflow] Kernel panic - not syncing: Fatal exception in interrupt Fix this by incrementing sched_mirred_nest before calling tcf_blockcast() and decrementing it on return, mirroring the non-blockcast path. This ensures subsequent recursive entries see the updated counter and are correctly limited by MIRRED_NEST_LIMIT. Fixes: `fe946a751d` ("net/sched: act_mirred: add loop detection") Signed-off-by: Kito Xu (veritas501) <hxzene@gmail.com> Link: https://patch.msgid.link/20260525122556.973584-7-jhs@mojatatu.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-28 12:26:36 +02:00
Jamal Hadi Salim	db875221ab	net/sched: Fix ethx:ingress -> ethy:egress -> ethx:ingress mirred loop When mirred redirects to ingress (from either ingress or egress) the loop state from sched_mirred_dev array dev is lost because of 1) the packet deferral into the backlog and 2) the fact the sched_mirred_dev array is cleared. In such cases, if there was a loop we won't discover it. Here's a simple test to reproduce: ip a add dev port0 10.10.10.11/24 tc qdisc add dev port0 clsact tc filter add dev port0 egress protocol ip \ prio 10 matchall action mirred ingress redirect dev port1 tc qdisc add dev port1 clsact tc filter add dev port1 ingress protocol ip \ prio 10 matchall action mirred egress redirect dev port0 ping -c 1 -W0.01 10.10.10.10 Fixes: `fe946a751d` ("net/sched: act_mirred: add loop detection") Tested-by: Victor Nogueira <victor@mojatatu.com> Reviewed-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260525122556.973584-6-jhs@mojatatu.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-28 12:26:36 +02:00
Jamal Hadi Salim	9552b11e3e	net/sched: fix packet loop on netem when duplicate is on When netem duplicates a packet it re-enqueues the copy at the root qdisc. If another netem sits in the tree the copy can be duplicated again, recursing until the stack or memory is exhausted. The original duplication guard temporarily zeroed q->duplicate around the re-enqueue, but that does not cover all cases because it is per-qdisc state shared across all concurrent enqueue paths and is not safe without additional locking. Use the skb tc_depth field introduced in an earlier patch: - increment it on the duplicate before re-enqueue - skip duplication for any skb whose tc_depth is already non-zero. This marks the packet itself rather than mutating qdisc state, therefore it is safe regardless of tree topology or concurrency. Fixes: `0afb51e728` ("[PKT_SCHED]: netem: reinsert for duplication") Reported-by: William Liu <will@willsroot.io> Reported-by: Savino Dicanosa <savy@syst3mfailure.io> Closes: https://lore.kernel.org/netdev/8DuRWwfqjoRDLDmBMlIfbrsZg9Gx50DHJc1ilxsEBNe2D6NMoigR_eIRIG0LOjMc3r10nUUZtArXx4oZBIdUfZQrwjcQhdinnMis_0G7VEk=@willsroot.io/ Co-developed-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Victor Nogueira <victor@mojatatu.com> Reviewed-by: William Liu <will@willsroot.io> Reviewed-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260525122556.973584-5-jhs@mojatatu.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-28 12:26:36 +02:00
Jamal Hadi Salim	b213a4c607	Revert "selftests/tc-testing: Add tests for restrictions on netem duplication" This reverts commit `ecdec65ec7`. The tests added were related to check_netem_in_tree() which was just reverted in the previous patch. Reviewed-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260525122556.973584-4-jhs@mojatatu.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-28 12:26:36 +02:00
Jamal Hadi Salim	eda0b7f203	net/sched: Revert "net/sched: Restrict conditions for adding duplicating netems to qdisc tree" This reverts commit `ec8e0e3d7a`. The original patch rejects any tree containing two netems when either has duplication set, even when they sit on unrelated classes of the same classful parent. That broke configurations that have worked since netem was introduced. The re-entrancy problem the original commit was trying to solve is handled by later patch using tc_depth flag. Doing this revert will (re)expose the original bug with multiple netem duplication. When this patch is backported make sure and get the full series. Fixes: `ec8e0e3d7a` ("net/sched: Restrict conditions for adding duplicating netems to qdisc tree") Reported-by: Ji-Soo Chung <jschung2@proton.me> Reported-by: Gerlinde <lrGerlinde@mailfence.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220774 Reported-by: zyc zyc <zyc199902@zohomail.cn> Closes: https://lore.kernel.org/all/19adda5a1e2.12410b78222774.9191120410578703463@zohomail.cn/ Reported-by: Manas Ghandat <ghandatmanas@gmail.com> Closes: https://lore.kernel.org/netdev/f69b2c8f-8325-4c2e-a011-6dbc089f30e4@gmail.com/ Reviewed-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260525122556.973584-3-jhs@mojatatu.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-28 12:26:36 +02:00
Jamal Hadi Salim	98b34f3e8c	net: Introduce skb tc depth field to track packet loops Add a 2-bit per-skb tc depth field to track packet loops across the stack. The previous per-CPU loop counters like MIRRED_NEST_LIMIT assume a single call stack and lose state in two cases: 1) When a packet is queued and reprocessed later (e.g., egress->ingress via backlog), the per-cpu state is gone by the time it is dequeued. 2) With XPS/RPS a packet may arrive on one CPU and be processed on another. A per-skb field solves both by travelling with the packet itself. The field fits in existing padding, using 2 bits that were previously a hole: pahole before(-) and after (+) diff looks like: __u8 slow_gro:1; /* 132: 3 1 / __u8 csum_not_inet:1; / 132: 4 1 / __u8 unreadable:1; / 132: 5 1 / + __u8 tc_depth:2; / 132: 6 1 / - / XXX 2 bits hole, try to pack / / XXX 1 byte hole, try to pack / __u16 tc_index; / 134 2 */ There used to be a ttl field which was removed as part of tc_verd in commit `aec745e2c5` ("net-tc: remove unused tc_verd fields"). It was already unused by that time, due to remove earlier in commit `c19ae86a51` ("tc: remove unused redirect ttl"). The first user of this field is netem, which increments tc_depth on duplicated packets before re-enqueueing them at the root qdisc. On re-entry, netem skips duplication for any skb with tc_depth already set, bounding recursion to a single level regardless of tree topology. The other user is mirred which increments it on each pass and limits to depth to MIRRED_DEFER_LIMIT (3). The new field was called ttl in earlier versions of this patch but renamed to tc_depth to avoid confusion with IP ttl. Note (looking at you Sashiko! Dont ignore me and continue bringing this up): 1. Since both mirred and netem utilize the same 2-bit tc_depth field it is possible when netem and mirred are used together that netem qdisc to skip the duplication step. This is a known trade-off, as a 2-bit field cannot independently track both features' recursion depths and it is not considered sane to have a setup that addresses both features on at the same time. 2. skb_scrub_packet does not clear tc_depth. This means a packet's loop history is preserved even across namespaces. While this might be restrictive for some topologies, it is also design intent to provide robustness against loops across namespaces. Reviewed-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260525122556.973584-2-jhs@mojatatu.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-28 12:26:36 +02:00
Rahul Chandelkar	9d5e7a46a9	ipv6: rpl: fix hdrlen overflow in ipv6_rpl_srh_decompress() ipv6_rpl_srh_decompress() computes: outhdr->hdrlen = (((n + 1) * sizeof(struct in6_addr)) >> 3); hdrlen is __u8. For n >= 127 the result exceeds 255 and silently truncates. With n=127 (cmpri=15, cmpre=15, pad=0, hdrlen=16): (128 * 16) >> 3 = 256, truncated to 0 as __u8 The caller in ipv6_rpl_srh_rcv() then places the compressed header at buf + ((ohdr->hdrlen + 1) << 3). With hdrlen=0 this is buf + 8, but the decompressed region occupies buf[0..2055] (8-byte header plus 128 full addresses). The compressed header overlaps the decompressed data, and ipv6_rpl_srh_compress() writes into this overlap, corrupting the routing header of the forwarded packet. The existing guard at exthdrs.c:546 checks (n + 1) > 255, which prevents n+1 from overflowing unsigned char (the segments_left field), but does not prevent the computed hdrlen from overflowing __u8. n=127 passes because 128 <= 255, yet hdrlen=256 does not fit. Tighten the bound to (n + 1) > 127. This caps n at 126, giving hdrlen = (127 * 16) >> 3 = 254, which fits in __u8. The compressed header then lands at buf + ((254 + 1) << 3) = buf + 2040, exactly past the decompressed region (buf[0..2039]). No overlap. 127 segments is well beyond any realistic RPL deployment. Fixes: `8610c7c6e3` ("net: ipv6: add support for rpl sr exthdr") Signed-off-by: Rahul Chandelkar <rc@rexion.ai> Link: https://patch.msgid.link/20260525154031.2290876-1-rc@rexion.ai Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-27 19:05:53 -07:00
Jakub Kicinski	d07e5b20f1	Merge branch 'ethtool-more-bug-fixes' Jakub Kicinski says: ==================== ethtool: more bug fixes Last week I sent two patch sets - one fixing bugs in RSS handling, and one fixing CMIS / module handling. This set contains the remaining fixes. There's a concentration of fixes around PHY and timestamp config handling but not enough to break those out as separate sets. ==================== Link: https://patch.msgid.link/20260526153533.2779187-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-27 17:42:19 -07:00
Jakub Kicinski	67cfdd9210	ethtool: eeprom: add more safeties to EEPROM Netlink fallback The Netlink fallback path for reading module EEPROM (fallback_set_params()) validates that offset < eeprom_len, but does not check that offset + length stays within eeprom_len. The ioctl equivalent (ethtool_get_any_eeprom() in ioctl.c) has always enforced both bounds: if (eeprom.offset + eeprom.len > total_len) return -EINVAL; This could lead to surprises in both drivers and device FW. Add the missing offset + length validation to fallback_set_params(), mirroring the ioctl. Similarly - ethtool core in general, and ethtool_get_any_eeprom() in particular tries to zero-init all buffers passed to the drivers to avoid any extra work of zeroing things out. eeprom_fallback() uses a plain kmalloc(), change it to zalloc. Fixes: `96d971e307` ("ethtool: Add fallback to get_module_eeprom from netlink command") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260526153533.2779187-11-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-27 17:42:09 -07:00
Jakub Kicinski	2376586f85	ethtool: eeprom: add missing ethnl_ops_begin() / _complete() during fallback All ethtool driver op calls should be sandwiched between ethnl_ops_begin() / ethnl_ops_complete(). In Netlink eeprom code, if the paged access failed we fall back to old API, but we first call _complete() and the fallback never does its own ethnl_ops_begin(). Move the fallback into the _begin() / _complete() section. Fixes: `96d971e307` ("ethtool: Add fallback to get_module_eeprom from netlink command") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260526153533.2779187-10-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-27 17:42:09 -07:00
Jakub Kicinski	a8d8bef6b4	ethtool: strset: fix header attribute index in ethnl_req_get_phydev() strset_prepare_data() passes ETHTOOL_A_HEADER_FLAGS (3) as the header attribute to ethnl_req_get_phydev(). This is incorrect, in the main attr space 3 is ETHTOOL_A_STRSET_COUNTS_ONLY, not the request header attr. The correct constant is ETHTOOL_A_STRSET_HEADER (1). ethnl_req_get_phydev() only uses this value for the extack, so this is not a "functionally visible"(?) bug. Fixes: `e96c93aa4b` ("net: ethtool: strset: Allow querying phy stats by index") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260526153533.2779187-9-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-27 17:42:09 -07:00
Jakub Kicinski	c3fc9976f6	ethtool: tsinfo: don't pass ERR_PTR to genlmsg_cancel on prepare failure The goto err label leads to: genlmsg_cancel(skb, ehdr); return ret; If ethnl_tsinfo_prepare_dump() failed, it has not started a genlmsg. There's nothing to cancel, and passing an error pointer to genlmsg_cancel() would cause a crash. Fixes: `b9e3f7dc9e` ("net: ethtool: tsinfo: Enhance tsinfo to support several hwtstamp by net topology") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20260526153533.2779187-8-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-27 17:42:08 -07:00
Jakub Kicinski	1de405699c	ethtool: tsinfo: fix uninitialized stats on the by-PHC path tsinfo_prepare_data() has two code paths: a "by-PHC" path for user-specified hardware timestamping providers, and the old path. Commit `89e281ebff` ("ethtool: init tsinfo stats if requested") added ethtool_stats_init() to mark stat slots as ETHTOOL_STAT_NOT_SET before the driver callback populates them, but placed the call inside the old-path block. When commit `b9e3f7dc9e` ("net: ethtool: tsinfo: Enhance tsinfo to support several hwtstamp by net topology") added the by-PHC early return, it landed above the stats initialization. On that path the stats array retains the zero-fill from ethnl_init_reply_data()'s zalloc. This leads to the reply including a stats nest with four zero-valued attributes that should have been absent. Reject GET requests for stats with HWTSTAMP_PROVIDER or dump. Fixes: `b9e3f7dc9e` ("net: ethtool: tsinfo: Enhance tsinfo to support several hwtstamp by net topology") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260526153533.2779187-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-27 17:42:08 -07:00
Jakub Kicinski	6386bd772d	ethtool: tsconfig: fix missing ethnl_ops_complete() tsconfig_prepare_data() calls ethnl_ops_begin(), we need to call ethnl_ops_complete() before returning the error. Fixes: `6e9e2eed4f` ("net: ethtool: Add support for tsconfig command to get/set hwtstamp config") Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20260526153533.2779187-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-27 17:42:08 -07:00
Jakub Kicinski	ab5bf428fb	ethtool: pse-pd: fix missing ethnl_ops_complete() pse_prepare_data() is missing ethnl_ops_complete() if ethnl_req_get_phydev() returned an error. Move getting phydev up so that we don't have to worry about this (similar order to linkstate_prepare_data()). Note that phydev may still be NULL (this is checked in pse_get_pse_attributes()), the goal isn't really to avoid the _begin() / _complete() calls, only to simplify the error handling. While at it propagate the original error. Why this code overrides the error with -ENODEV but !phydev generates -EOPNOTSUPP is unclear to me... Fixes: `31748765be` ("net: ethtool: pse-pd: Target the command to the requested PHY") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260526153533.2779187-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-27 17:42:08 -07:00
Jakub Kicinski	596c51ed9e	ethtool: linkstate: fix unbalanced ethnl_ops_complete() on PHY lookup error linkstate_prepare_data() calls ethnl_req_get_phydev() before ethnl_ops_begin(), but routes its error path through "goto out" which calls ethnl_ops_complete(). Fixes: `fe55b1d401` ("ethtool: linkstate: migrate linkstate functions to support multi-PHY setups") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260526153533.2779187-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-27 17:42:08 -07:00
Jakub Kicinski	a888bbd439	ethtool: tsconfig: fix reply error handling A couple of trivial bugs in error handling in tsconfig_send_reply(). If we failed to allocate rskb we need to set the error. If we did allocate it but failed to send it - we need to remember to free it. Fixes: `6e9e2eed4f` ("net: ethtool: Add support for tsconfig command to get/set hwtstamp config") Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20260526153533.2779187-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-27 17:42:08 -07:00
Jakub Kicinski	7281b096b0	ethtool: coalesce: cap profile updates at NET_DIM_PARAMS_NUM_PROFILES ethnl_update_profile() walks the ETHTOOL_A_PROFILE_IRQ_MODERATION nest list with an index 'i' and writes new_profile[i++] without bounding i. The destination is kmemdup()'d at NET_DIM_PARAMS_NUM_PROFILES entries (5), but the Netlink nest count is entirely user-controlled. Netlink policies do not have support for constraining the number of nested entries (or number of multi-attr entries). Fixes: `f750dfe825` ("ethtool: provide customized dim profile management") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260526153533.2779187-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-27 17:42:07 -07:00
Jakub Kicinski	ce1e33020a	Merge branch 'bridge-fix-sleep-in-atomic-context' Ido Schimmel says: ==================== bridge: Fix sleep in atomic context Under certain circumstances the bridge driver can call dev_set_promiscuity() while holding the bridge spin lock. This is a problem as dev_set_promiscuity() might sleep. Patches #1-#2 fix the problem in the netlink and sysfs configuration paths by only taking the lock where it is actually needed, thereby avoiding calling dev_set_promiscuity() from an atomic context. Patch #3 adds test cases for both configuration paths in rtnetlink.sh which already includes test cases for similar issues. Note that dev_set_promiscuity() can sleep either when it takes the net device mutex or when calling netif_rx_mode_sync(). I encountered the problem with the latter, but blamed the former since it came earlier. ==================== Link: https://patch.msgid.link/20260526064818.272516-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-27 17:23:07 -07:00
Ido Schimmel	147f3b1f23	selftests: rtnetlink: Add bridge promiscuity tests Add two test cases that always pass, but trigger sleeping in atomic context BUGs without "bridge: Fix sleep in atomic context in netlink path" and "bridge: Fix sleep in atomic context in sysfs path". Reviewed-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260526064818.272516-4-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-27 17:23:05 -07:00
Ido Schimmel	6d34594cc6	bridge: Fix sleep in atomic context in sysfs path Since the start of the git history, brport_store() always acquired the bridge lock. Back then this decision made sense: The bridge lock protects the STP state of the bridge and its ports and at that time the function was only used by two STP related attributes (cost and priority). Nowadays, brport_store() processes a lot more attributes and most of them do not need the bridge lock: * Bridge flags: Only require RTNL. Read locklessly by the data path. Annotations can be added in net-next. * FDB port flushing: Only requires the FDB lock. * Multicast attributes: Only require the multicast lock. * Group forward mask: Only requires RTNL. Read locklessly by the data path. Annotations can be added in net-next. * Backup port: Only requires RTNL. Read locklessly by the data path. This is a problem as the bridge calls dev_set_promiscuity() when certain bridge port flags change and this function can sleep since the commit cited below, resulting in a splat such as [1]. Fix this by reducing the scope of the bridge lock and only take it when processing the two STP related attributes that require it. Remove the now stale comment from br_switchdev_set_port_flag(). The SWITCHDEV_F_DEFER flag can be removed in net-next. [1] BUG: sleeping function called from invalid context at net/core/dev_addr_lists.c:1262 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 372, name: bash preempt_count: 201, expected: 0 RCU nest depth: 0, expected: 0 5 locks held by bash/372: #0: ffff88810c51c3f0 (sb_writers#7){.+.+}-{0:0}, at: ksys_write (fs/read_write.c:740) #1: ffff888115ce9480 (&of->mutex){+.+.}-{4:4}, at: kernfs_fop_write_iter (fs/kernfs/file.c:343) #2: ffff88810b9fd330 (kn->active#37){.+.+}-{0:0}, at: kernfs_fop_write_iter (fs/kernfs/file.c:80 fs/kernfs/file.c:344) #3: ffffffffa59473a0 (rtnl_mutex){+.+.}-{4:4}, at: brport_store (net/bridge/br_sysfs_if.c:326) #4: ffff8881099d2d58 (&br->lock){+...}-{3:3}, at: brport_store (./include/linux/spinlock.h:348 net/bridge/br_sysfs_if.c:345) Preemption disabled at: 0x0 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:94 lib/dump_stack.c:120) __might_resched.cold (kernel/sched/core.c:9163) netif_rx_mode_run (net/core/dev_addr_lists.c:1262) netif_rx_mode_sync (net/core/dev_addr_lists.c:1428) dev_set_promiscuity (net/core/dev_api.c:289) br_manage_promisc (net/bridge/br_if.c:135 net/bridge/br_if.c:172) br_port_flags_change (net/bridge/br_if.c:242 net/bridge/br_if.c:747) store_learning (net/bridge/br_sysfs_if.c:79 net/bridge/br_sysfs_if.c:235) brport_store (net/bridge/br_sysfs_if.c:346) kernfs_fop_write_iter (fs/kernfs/file.c:352) new_sync_write (fs/read_write.c:595) vfs_write (fs/read_write.c:688) ksys_write (fs/read_write.c:740) do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121) Fixes: `78cd408356` ("net: add missing instance lock to dev_set_promiscuity") Reviewed-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260526064818.272516-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-27 17:23:05 -07:00
Ido Schimmel	5eec4427b8	bridge: Fix sleep in atomic context in netlink path Since the introduction of the netlink configuration path for bridge ports in commit `25c71c75ac` ("bridge: bridge port parameters over netlink"), br_setport() was always called with the bridge lock held around it. Back then this decision made sense: The bridge lock protects the STP state of the bridge and its ports and at that time the function only processed three STP related netlink attributes (cost, priority and state). Nowadays, br_setport() processes a lot more attributes and most of them do not need the bridge lock: * Bridge flags: Only require RTNL. Read locklessly by the data path. Annotations can be added in net-next. * FDB port flushing: Only requires the FDB lock. * Multicast attributes: Only require the multicast lock. * Group forward mask: Only requires RTNL. Read locklessly by the data path. Annotations can be added in net-next. * Backup port and NHID: Only require RTNL. Read locklessly by the data path. This is a problem as the bridge calls dev_set_promiscuity() when certain bridge port flags change and this function can sleep since the commit cited below, resulting in a splat such as [1]. Fix this by reducing the scope of the bridge lock and only take it when processing the three STP related attributes that require it. This is consistent with the multicast attributes where each attribute acquires the multicast lock instead of having one critical section for all relevant attributes. [1] BUG: sleeping function called from invalid context at net/core/dev_addr_lists.c:1262 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 356, name: bridge preempt_count: 201, expected: 0 RCU nest depth: 0, expected: 0 2 locks held by bridge/356: #0: ffffffff919473a0 (rtnl_mutex){+.+.}-{4:4}, at: rtnetlink_rcv_msg (net/core/rtnetlink.c:80 net/core/rtnetlink.c:7002) #1: ffff888115072d58 (&br->lock){+...}-{3:3}, at: br_setlink (./include/linux/spinlock.h:348 net/bridge/br_netlink.c:1117) Preemption disabled at: 0x0 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:94 lib/dump_stack.c:120) __might_resched.cold (kernel/sched/core.c:9163) netif_rx_mode_run (net/core/dev_addr_lists.c:1262) netif_rx_mode_sync (net/core/dev_addr_lists.c:1428) dev_set_promiscuity (net/core/dev_api.c:289) br_manage_promisc (net/bridge/br_if.c:135 net/bridge/br_if.c:172) br_port_flags_change (net/bridge/br_if.c:242 net/bridge/br_if.c:747) br_setport (net/bridge/br_netlink.c:1000) br_setlink (net/bridge/br_netlink.c:1118) rtnl_bridge_setlink (net/core/rtnetlink.c:5572) rtnetlink_rcv_msg (net/core/rtnetlink.c:7005) netlink_rcv_skb (net/netlink/af_netlink.c:2550) netlink_unicast (net/netlink/af_netlink.c:1318 net/netlink/af_netlink.c:1344) netlink_sendmsg (net/netlink/af_netlink.c:1894) __sock_sendmsg (net/socket.c:787 (discriminator 4) net/socket.c:802 (discriminator 4)) ____sys_sendmsg (net/socket.c:2698) ___sys_sendmsg (net/socket.c:2752) __sys_sendmsg (net/socket.c:2784) do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121) Fixes: `78cd408356` ("net: add missing instance lock to dev_set_promiscuity") Reviewed-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260526064818.272516-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-27 17:23:05 -07:00
Oliver Hartkopp	8ba68464e4	bonding: refuse to enslave CAN devices syzbot reported a kernel paging request crash in can_rx_unregister() inside net/can/af_can.c. The crash occurs because a virtual CAN device (vxcan) is being enslaved to a bonding master. During the enslavement process, the bonding driver mutates and modifies the network device states to fit an Ethernet-like aggregation model. However, CAN devices operate on a completely different Layer 2 architecture, relying on the CAN mid-layer private data structure (can_ml_priv) instead of standard Ethernet structures. Since bonding does not initialize or maintain these CAN structures, subsequent operations on the half-enslaved interface (such as closing associated sockets via isotp_release) lead to a null-pointer dereference when accessing the CAN receiver lists. Bonding CAN interfaces is architecturally invalid as CAN lacks MAC addresses, ARP capabilities, and standard Ethernet link-layer mechanisms. While generic loopback devices are blocked globally in net/core/dev.c, virtual CAN devices bypass this check because they do not carry the IFF_LOOPBACK flag, despite acting as local software-loopbacks. Fix this by explicitly blocking network devices of type ARPHRD_CAN from being enslaved at the very beginning of bond_enslave(). This prevents illegal state mutations, eliminates the resulting KASAN crashes, and avoids potential memory leaks from incomplete socket cleanups. As the CAN support has been added a long time after bonding the Fixes-tag points to the introduction of ARPHRD_CAN that would have needed a specific handling in bonding_main.c. Fixes: `cd05acfe65` ("[CAN]: Allocate protocol numbers for PF_CAN") Reported-by: syzbot+8ed98cbd0161632bce95@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=8ed98cbd0161632bce95 Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Acked-by: Jay Vosburgh <jv@jvosburgh.net> Link: https://patch.msgid.link/20260526-bonding-candev-v1-1-ba1df400918a@hartkopp.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-27 16:58:28 -07:00
Qi Tang	dd433671fe	ipv6: validate extension header length before copying to cmsg ip6_datagram_recv_specific_ctl() builds IPV6_{HOPOPTS,DSTOPTS,RTHDR} cmsgs (and their IPV6_2292* legacy counterparts) by trusting the on-wire hdrlen byte (ptr[1]) when computing the put_cmsg() length. The length was validated only at parse time (ipv6_parse_hopopts(), etc.). An nftables payload-write expression can rewrite hdrlen after parsing and before the skb reaches recvmsg; the write itself is in-bounds but put_cmsg() then reads up to ((hdrlen+1) << 3) = 2040 bytes from an 8-byte header. nftables is reachable from an unprivileged user namespace, so this is an unprivileged slab-out-of-bounds read: BUG: KASAN: slab-out-of-bounds in put_cmsg+0x3ac/0x540 put_cmsg+0x3ac/0x540 udpv6_recvmsg+0xca0/0x1250 sock_recvmsg+0xdf/0x190 ____sys_recvmsg+0x1b1/0x620 Add ipv6_get_exthdr_len() which validates that at least two bytes are accessible before reading the hdrlen field, then checks the computed length against skb_tail_pointer(skb), returning 0 on failure. Extension headers are kept in the linear skb area by pskb_may_pull() during input, so skb_tail_pointer() is the correct bound. Use ipv6_get_exthdr_len() at all non-AH call sites: the five standalone cmsg blocks (HbH, 2292HbH, 2292DSTOPTS x2, 2292RTHDR) and the three standard cases in the extension-header walk loop (DSTOPTS, ROUTING, default). AH retains an inline bounds check because its length formula differs ((ptr[1]+2)<<2). The walk loop also gets a pre-read bounds check at the top to validate ptr before any case accesses ptr[0] or ptr[1]. When the walk loop detects a corrupted header, return from the function instead of continuing to process later socket options. Cc: stable@vger.kernel.org Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Qi Tang <tpluszz77@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260523143245.2281415-1-tpluszz77@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-26 18:53:10 -07:00
Jakub Kicinski	e66c456f7c	Merge tag 'nfc-7.1-rc6' of https://codeberg.org/linux-nfc/linux David Heidelberg says: ==================== nfc pull request for net: Code improvements - llcp: Fix use-after-free in llcp_sock_release() - llcp: Fix use-after-free race in nfc_llcp_recv_cc() - hci: fix out-of-bounds read in HCP header parsing Regression fixes: - nxp-nci: i2c: use rising-edge IRQ on ACPI systems Signed-off-by: David Heidelberg <david@ixit.cz> * tag 'nfc-7.1-rc6' of https://codeberg.org/linux-nfc/linux: nfc: nxp-nci: i2c: use rising-edge IRQ on ACPI systems nfc: hci: fix out-of-bounds read in HCP header parsing nfc: llcp: Fix use-after-free race in nfc_llcp_recv_cc() nfc: llcp: Fix use-after-free in llcp_sock_release() ==================== Link: https://patch.msgid.link/217c0646-8a30-4037-b613-580c2b189729@ixit.cz Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-26 18:32:34 -07:00
Eric Dumazet	509323077e	tunnels: do not assume transport header in iptunnel_pmtud_check_icmp() In some cases, iptunnel_pmtud_check_icmp() can be called while skb transport header is not set. This triggers an out-of-bound access, because (typeof(skb->transport_header))~0U is 65535. Access the icmp header based on IPv4 network header, after making sure icmp->type is present in skb linear part. Note that iptunnel_pmtud_check_icmpv6()) is fine. Fixes: `4cb47a8644` ("tunnels: PMTU discovery support for directly bridged IP packets") Reported-by: Damiano Melotti <melotti@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260522115512.1519110-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-26 18:11:47 -07:00
Eric Dumazet	7d9ef0cb27	vxlan: do not reuse cached ip_hdr() value after skb_tunnel_check_pmtu() skb_tunnel_check_pmtu() can change skb->head. Reusing old_iph afer skb_tunnel_check_pmtu() can cause an UAF. Use instead ip_hdr(skb) as done in drivers/net/bareudp.c and drivers/net/geneve.c. Found by Sashiko. Fixes: `4cb47a8644` ("tunnels: PMTU discovery support for directly bridged IP packets") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Link: https://patch.msgid.link/20260525203642.2389723-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-26 18:10:55 -07:00
Eric Dumazet	b4bc943530	tunnels: load network headers after skb_cow() in iptunnel_pmtud_build_icmp[v6]() Sashiko found that iptunnel_pmtud_build_icmp() and iptunnel_pmtud_build_icmpv6() were caching ip_hdr() and ipv6_hdr() before an skb_cow() call which can reallocate skb->head. Fix this possible UAF by initializing the local variables after the skb_cow() call. Remove skb_reset_network_header() calls which were not needed. Fixes: `4cb47a8644` ("tunnels: PMTU discovery support for directly bridged IP packets") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Link: https://patch.msgid.link/20260525201335.2361845-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-26 18:10:25 -07:00
Lucien.Jheng	d895767c33	net: phy: air_en8811h: add AN8811HB MCU assert/deassert support AN8811HB needs a MCU soft-reset cycle before firmware loading begins. Assert the MCU (hold it in reset) and immediately deassert (release) via a dedicated PBUS register pair (0x5cf9f8 / 0x5cf9fc), accessed through a registered mdio_device at PHY-addr+8. Add __air_pbus_reg_write() as a low-level helper taking a struct mdio_device *, create and register the PBUS mdio_device in an8811hb_probe() and store it in priv->pbusdev, then implement an8811hb_mcu_assert() / _deassert() on top of it. Add an8811hb_remove() to unregister the PBUS device on teardown. Wire both calls into an8811hb_load_firmware() and en8811h_restart_mcu() so every firmware load or MCU restart on AN8811HB correctly sequences the reset control registers. Fixes: `5afda1d734` ("net: phy: air_en8811h: add Airoha AN8811HB support") Signed-off-by: Lucien Jheng <lucienzx159@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20260524063915.47961-1-lucienzx159@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-26 17:59:47 -07:00
Michael Bommarito	05f95729ca	l2tp: use refcount_inc_not_zero in l2tp_session_get_by_ifname A reader in l2tp_session_get_by_ifname() can return a pointer to a session whose refcount has reached zero. The getter takes its reference with plain refcount_inc(), but every other session getter in the same file (l2tp_v2_session_get, l2tp_v3_session_get, and the corresponding _get_next variants) uses refcount_inc_not_zero() because the IDR/RCU lookup can race with refcount_dec_and_test() -> l2tp_session_free() -> kfree_rcu(). The ifname getter is the only outlier; the inconsistency was raised on-list after `979c017803` ("l2tp: use list_del_rcu in l2tp_session_unhash"). A reader inside rcu_read_lock_bh() that matches session->ifname can be preempted between the strcmp() and the refcount_inc(). If the last reference drops on another CPU in that window, the reader's refcount_inc() runs on a counter that has reached zero. refcount_t catches the addition-on-zero, prints "refcount_t: addition on 0; use-after-free", saturates the counter, and returns the saturated pointer to the caller. Session memory is held live by the in-flight RCU read section, but the kfree_rcu() callback queued from l2tp_session_free() will free it once the grace period closes; a caller that dereferences the returned session past that point hits a slab-use-after-free. On PREEMPT_RT local_bh_disable() is a per-CPU sleeping lock and the preemption window is real; on stock PREEMPT kernels local_bh_disable() is a preempt_count increment that closes the cross-CPU race in practice (see below). Use refcount_inc_not_zero() and continue the list walk on failure, matching the other session getters in the file. The ifname getter is the only session getter in net/l2tp/ that still uses the bare refcount_inc() pattern; this change restores file-internal consistency. The success path is unchanged. Fixes: `abe7a1a7d0` ("l2tp: improve tunnel/session refcount helpers") Cc: stable@vger.kernel.org Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Reviewed-by: James Chapman <jchapman@katalix.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260523023423.2568972-1-michael.bommarito@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-26 17:35:12 -07:00
Jakub Kicinski	c66d7c3c1f	Merge branch 'ethtool-module-fix-a-handful-of-small-bugs' Jakub Kicinski says: ==================== ethtool: module: fix a handful of small bugs I've been poking at the locking in ethtool and it appears that the FW flashing is not currently taking the ops lock. Existing drivers which implement module FW flashing seem to have their own locking, so this series doesn't actually add the ops lock (I'll add it in net-next). But a number of other errors have been surfaced in the process. ==================== Link: https://patch.msgid.link/20260522231312.1710836-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-26 08:19:36 -07:00
Jakub Kicinski	d5551f4c18	ethtool: cmis: validate fw->size against start_cmd_payload_size cmis_fw_update_start_download() copies start_cmd_payload_size bytes from the firmware blob into the CDB LPL vendor_data[] payload without validating that the FW has enough data. Since the start_cmd_payload_size can only be ~120B an image too short is most likely corrupted, so reject it. Fixes: `c4f78134d4` ("ethtool: cmis_fw_update: add a layer for supporting firmware update using CDB") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Link: https://patch.msgid.link/20260522231312.1710836-10-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-26 08:19:33 -07:00
Jakub Kicinski	12c2496a71	ethtool: cmis: validate start_cmd_payload_size from module The CMIS firmware update code reads start_cmd_payload_size from the module's FW Management Features CDB reply and uses it directly as the byte count for memcpy. The destination buffer is 112 bytes (ETHTOOL_CMIS_CDB_LPL_MAX_PL_LENGTH - 8). So a malicious module (or corrupted response) can cause a OOB write later on in cmis_fw_update_start_download(). Let's error out. If modules that expect longer LPL writes actually exist we should revisit. struct cmis_cdb_start_fw_download_pl's definition has to move, no change there. Fixes: `c4f78134d4` ("ethtool: cmis_fw_update: add a layer for supporting firmware update using CDB") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Link: https://patch.msgid.link/20260522231312.1710836-9-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-26 08:19:33 -07:00
Jakub Kicinski	3e8c3d464c	ethtool: cmis: fix u16-to-u8 truncation of msleep_pre_rpl ethtool_cmis_cdb_compose_args() accepts msleep_pre_rpl as u16 but stores it into the u8 field ethtool_cmis_cdb_cmd_args::msleep_pre_rpl, silently truncating values >= 256. Seven of the nine call sites pass 1000 ms (it's the third argument from the end). Fixes: `a39c84d796` ("ethtool: cmis_cdb: Add a layer for supporting CDB commands") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Link: https://patch.msgid.link/20260522231312.1710836-8-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-26 08:19:33 -07:00

1 2 3 4 5 ...

1445959 Commits