Commit Graph

1402784 Commits

Author SHA1 Message Date
Paolo Abeni
cdc3074c00 Merge branch 'sctp-fix-two-issues-in-sctp_clone_sock'
Kuniyuki Iwashima says:

====================
sctp: Fix two issues in sctp_clone_sock().

syzbot reported two issues in sctp_clone_sock().

This series fixes the issues.

v1: https://lore.kernel.org/netdev/20251208133728.157648-1-kuniyu@google.com/
====================

Link: https://patch.msgid.link/20251210081206.1141086-1-kuniyu@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-12-18 16:18:02 +01:00
Kuniyuki Iwashima
d7ff61e6f3 sctp: Clear inet_opt in sctp_v6_copy_ip_options().
syzbot reported the splat below. [0]

Since the cited commit, the child socket inherits all fields
of its parent socket unless explicitly cleared.

syzbot set IP_OPTIONS to AF_INET6 socket and created a child
socket inheriting inet_sk(sk)->inet_opt.

sctp_v6_copy_ip_options() only clones np->opt, and leaving
inet_opt results in double-free.

Let's clear inet_opt in sctp_v6_copy_ip_options().

[0]:
BUG: KASAN: double-free in inet_sock_destruct+0x538/0x740 net/ipv4/af_inet.c:159
Free of addr ffff8880304b6d40 by task ksoftirqd/0/15

CPU: 0 UID: 0 PID: 15 Comm: ksoftirqd/0 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/02/2025
Call Trace:
 <TASK>
 dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
 print_address_description mm/kasan/report.c:378 [inline]
 print_report+0xca/0x240 mm/kasan/report.c:482
 kasan_report_invalid_free+0xea/0x110 mm/kasan/report.c:557
 check_slab_allocation+0xe1/0x130 include/linux/page-flags.h:-1
 kasan_slab_pre_free include/linux/kasan.h:198 [inline]
 slab_free_hook mm/slub.c:2484 [inline]
 slab_free mm/slub.c:6630 [inline]
 kfree+0x148/0x6d0 mm/slub.c:6837
 inet_sock_destruct+0x538/0x740 net/ipv4/af_inet.c:159
 __sk_destruct+0x89/0x660 net/core/sock.c:2350
 sock_put include/net/sock.h:1991 [inline]
 sctp_endpoint_destroy_rcu+0xa1/0xf0 net/sctp/endpointola.c:197
 rcu_do_batch kernel/rcu/tree.c:2605 [inline]
 rcu_core+0xcab/0x1770 kernel/rcu/tree.c:2861
 handle_softirqs+0x286/0x870 kernel/softirq.c:622
 run_ksoftirqd+0x9b/0x100 kernel/softirq.c:1063
 smpboot_thread_fn+0x542/0xa60 kernel/smpboot.c:160
 kthread+0x711/0x8a0 kernel/kthread.c:463
 ret_from_fork+0x4bc/0x870 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>

Allocated by task 6003:
 kasan_save_stack mm/kasan/common.c:56 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:77
 poison_kmalloc_redzone mm/kasan/common.c:400 [inline]
 __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:417
 kasan_kmalloc include/linux/kasan.h:262 [inline]
 __do_kmalloc_node mm/slub.c:5642 [inline]
 __kmalloc_noprof+0x411/0x7f0 mm/slub.c:5654
 kmalloc_noprof include/linux/slab.h:961 [inline]
 kzalloc_noprof include/linux/slab.h:1094 [inline]
 ip_options_get+0x51/0x4c0 net/ipv4/ip_options.c:517
 do_ip_setsockopt+0x1d9b/0x2d00 net/ipv4/ip_sockglue.c:1087
 ip_setsockopt+0x66/0x110 net/ipv4/ip_sockglue.c:1417
 do_sock_setsockopt+0x17c/0x1b0 net/socket.c:2360
 __sys_setsockopt net/socket.c:2385 [inline]
 __do_sys_setsockopt net/socket.c:2391 [inline]
 __se_sys_setsockopt net/socket.c:2388 [inline]
 __x64_sys_setsockopt+0x13f/0x1b0 net/socket.c:2388
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Freed by task 15:
 kasan_save_stack mm/kasan/common.c:56 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:77
 __kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:587
 kasan_save_free_info mm/kasan/kasan.h:406 [inline]
 poison_slab_object mm/kasan/common.c:252 [inline]
 __kasan_slab_free+0x5c/0x80 mm/kasan/common.c:284
 kasan_slab_free include/linux/kasan.h:234 [inline]
 slab_free_hook mm/slub.c:2539 [inline]
 slab_free mm/slub.c:6630 [inline]
 kfree+0x19a/0x6d0 mm/slub.c:6837
 inet_sock_destruct+0x538/0x740 net/ipv4/af_inet.c:159
 __sk_destruct+0x89/0x660 net/core/sock.c:2350
 sock_put include/net/sock.h:1991 [inline]
 sctp_endpoint_destroy_rcu+0xa1/0xf0 net/sctp/endpointola.c:197
 rcu_do_batch kernel/rcu/tree.c:2605 [inline]
 rcu_core+0xcab/0x1770 kernel/rcu/tree.c:2861
 handle_softirqs+0x286/0x870 kernel/softirq.c:622
 run_ksoftirqd+0x9b/0x100 kernel/softirq.c:1063
 smpboot_thread_fn+0x542/0xa60 kernel/smpboot.c:160
 kthread+0x711/0x8a0 kernel/kthread.c:463
 ret_from_fork+0x4bc/0x870 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

Fixes: 16942cf4d3 ("sctp: Use sk_clone() in sctp_accept().")
Reported-by: syzbot+ec33a1a006ed5abe7309@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/6936d112.a70a0220.38f243.00a8.GAE@google.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251210081206.1141086-3-kuniyu@google.com
Acked-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-12-18 16:18:00 +01:00
Kuniyuki Iwashima
b98f06f9a5 sctp: Fetch inet6_sk() after setting ->pinet6 in sctp_clone_sock().
syzbot reported the lockdep splat below. [0]

sctp_clone_sock() sets the child socket's ipv6_mc_list to NULL,
but somehow sock_release() in an error path finally acquires
lock_sock() in ipv6_sock_mc_close().

The root cause is that sctp_clone_sock() fetches inet6_sk(newsk)
before setting newinet->pinet6, meaning that the parent's
ipv6_mc_list was actually cleared.

Also, sctp_v6_copy_ip_options() uses inet6_sk() but is called
before newinet->pinet6 is set.

Let's use inet6_sk() only after setting newinet->pinet6.

[0]:
WARNING: possible recursive locking detected
syzkaller #0 Not tainted

syz.0.17/5996 is trying to acquire lock:
ffff888031af4c60 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1700 [inline]
ffff888031af4c60 (sk_lock-AF_INET6){+.+.}-{0:0}, at: ipv6_sock_mc_close+0xd3/0x140 net/ipv6/mcast.c:348

but task is already holding lock:
ffff888031af4320 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1700 [inline]
ffff888031af4320 (sk_lock-AF_INET6){+.+.}-{0:0}, at: sctp_getsockopt+0x135/0xb60 net/sctp/socket.c:8131

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(sk_lock-AF_INET6);
  lock(sk_lock-AF_INET6);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

1 lock held by syz.0.17/5996:
 #0: ffff888031af4320 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1700 [inline]
 #0: ffff888031af4320 (sk_lock-AF_INET6){+.+.}-{0:0}, at: sctp_getsockopt+0x135/0xb60 net/sctp/socket.c:8131

stack backtrace:
CPU: 0 UID: 0 PID: 5996 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025
Call Trace:
 <TASK>
 dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
 print_deadlock_bug+0x279/0x290 kernel/locking/lockdep.c:3041
 check_deadlock kernel/locking/lockdep.c:3093 [inline]
 validate_chain kernel/locking/lockdep.c:3895 [inline]
 __lock_acquire+0x2540/0x2cf0 kernel/locking/lockdep.c:5237
 lock_acquire+0x117/0x340 kernel/locking/lockdep.c:5868
 lock_sock_nested+0x48/0x100 net/core/sock.c:3780
 lock_sock include/net/sock.h:1700 [inline]
 ipv6_sock_mc_close+0xd3/0x140 net/ipv6/mcast.c:348
 inet6_release+0x47/0x70 net/ipv6/af_inet6.c:482
 __sock_release net/socket.c:653 [inline]
 sock_release+0x85/0x150 net/socket.c:681
 sctp_getsockopt_peeloff_common+0x56b/0x770 net/sctp/socket.c:5732
 sctp_getsockopt_peeloff_flags+0x13b/0x230 net/sctp/socket.c:5801
 sctp_getsockopt+0x3ab/0xb60 net/sctp/socket.c:8151
 do_sock_getsockopt+0x2b4/0x3d0 net/socket.c:2399
 __sys_getsockopt net/socket.c:2428 [inline]
 __do_sys_getsockopt net/socket.c:2435 [inline]
 __se_sys_getsockopt net/socket.c:2432 [inline]
 __x64_sys_getsockopt+0x1a5/0x250 net/socket.c:2432
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xfa/0xf80 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f8f8c38f749
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffcfdade018 EFLAGS: 00000246 ORIG_RAX: 0000000000000037
RAX: ffffffffffffffda RBX: 00007f8f8c5e5fa0 RCX: 00007f8f8c38f749
RDX: 000000000000007a RSI: 0000000000000084 RDI: 0000000000000003
RBP: 00007f8f8c413f91 R08: 0000200000000040 R09: 0000000000000000
R10: 0000200000000340 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f8f8c5e5fa0 R14: 00007f8f8c5e5fa0 R15: 0000000000000005
 </TASK>

Fixes: 16942cf4d3 ("sctp: Use sk_clone() in sctp_accept().")
Reported-by: syzbot+c59e6bb54e7620495725@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/6936d112.a70a0220.38f243.00a7.GAE@google.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251210081206.1141086-2-kuniyu@google.com
Acked-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-12-18 16:17:59 +01:00
Scott Mayhew
15564bd67e net/handshake: duplicate handshake cancellations leak socket
When a handshake request is cancelled it is removed from the
handshake_net->hn_requests list, but it is still present in the
handshake_rhashtbl until it is destroyed.

If a second cancellation request arrives for the same handshake request,
then remove_pending() will return false... and assuming
HANDSHAKE_F_REQ_COMPLETED isn't set in req->hr_flags, we'll continue
processing through the out_true label, where we put another reference on
the sock and a refcount underflow occurs.

This can happen for example if a handshake times out - particularly if
the SUNRPC client sends the AUTH_TLS probe to the server but doesn't
follow it up with the ClientHello due to a problem with tlshd.  When the
timeout is hit on the server, the server will send a FIN, which triggers
a cancellation request via xs_reset_transport().  When the timeout is
hit on the client, another cancellation request happens via
xs_tls_handshake_sync().

Add a test_and_set_bit(HANDSHAKE_F_REQ_COMPLETED) in the pending cancel
path so duplicate cancels can be detected.

Fixes: 3b3009ea8a ("net/handshake: Create a NETLINK service for handling handshake requests")
Suggested-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://patch.msgid.link/20251209193015.3032058-1-smayhew@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-12-18 16:01:36 +01:00
Paolo Abeni
3e82accd3e Merge tag 'nf-25-12-16' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Florian Westphal says:

====================
netfilter: updates for net

The following patchset contains Netfilter fixes for *net*:

1)  Jozsef Kadlecsik is retiring.  Fortunately Jozsef will still keep an
    eye on ipset patches.

2)  remove a bogus direction check from nat core, this caused spurious
    flakes in the 'reverse clash' selftest, from myself.

3) nf_tables doesn't need to do chain validation on register store,
   from Pablo Neira Ayuso.

4) nf_tables shouldn't revisit chains during ruleset (graph) validation
   if possible.  Both 3 and 4 were slated for -next initially but there
   are now two independent reports of people hitting soft lockup errors
   during ruleset validation, so it makes no sense anymore to route
   this via -next given this is -stable material. From myself.

5) call cond_resched() in a more frequently visited place during nf_tables
   chain validation, this wasn't possible earlier due to rcu read lock,
   but nowadays its not held anymore during set walks.

6) Don't fail conntrack packetdrill test with HZ=100 kernels.

netfilter pull request nf-25-12-16

* tag 'nf-25-12-16' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  selftests: netfilter: packetdrill: avoid failure on HZ=100 kernel
  netfilter: nf_tables: avoid softlockup warnings in nft_chain_validate
  netfilter: nf_tables: avoid chain re-validation if possible
  netfilter: nf_tables: remove redundant chain validation on register store
  netfilter: nf_nat: remove bogus direction check
  MAINTAINERS: Remove Jozsef Kadlecsik from MAINTAINERS file
====================

Link: https://patch.msgid.link/20251216190904.14507-1-fw@strlen.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-12-18 13:55:01 +01:00
Paolo Abeni
78a47532ab Merge branch 'mlx5-misc-fixes-2025-12-09'
Tariq Toukan says:

====================
mlx5 misc fixes 2025-12-09

This patchset provides misc bug fixes from the team to the mlx5 core and
Eth drivers.
====================

Link: https://patch.msgid.link/1765284977-1363052-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-12-18 13:39:32 +01:00
Cosmin Ratiu
4198a14c8c net/mlx5e: Don't include PSP in the hard MTU calculations
Commit [1] added the 40 bytes required by the PSP header+trailer and the
UDP header to MLX5E_ETH_HARD_MTU, which limits the device-wide max
software MTU that could be set. This is not okay, because most packets
are not PSP packets and it doesn't make sense to always reserve space
for headers which won't get added in most cases.

As it turns out, for TCP connections, PSP overhead is already taken into
account in the TCP MSS calculations via inet_csk(sk)->icsk_ext_hdr_len.
This was added in commit [2]. This means that the extra space reserved
in the hard MTU for mlx5 ends up unused and wasted.

Remove the unnecessary 40 byte reservation from hard MTU.

[1] commit e5a1861a29 ("net/mlx5e: Implement PSP Tx data path")
[2] commit e97269257f ("net: psp: update the TCP MSS to reflect PSP
packet overhead")

Fixes: e5a1861a29 ("net/mlx5e: Implement PSP Tx data path")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1765284977-1363052-10-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-12-18 13:39:30 +01:00
Tariq Toukan
c8591decd9 net/mlx5e: Do not update BQL of old txqs during channel reconfiguration
During channel reconfiguration (e.g., ethtool private flags changes),
the driver can trigger a kernel BUG_ON in dql_completed() with the error
"kernel BUG at lib/dynamic_queue_limits.c:99".

The issue occurs in the following sequence:

During mlx5e_safe_switch_params(), old channels are deactivated via
mlx5e_deactivate_txqsq(). New channels are created and activated, taking
ownership of the netdev_queues and their BQL state.

When old channels are closed via mlx5e_close_txqsq(), there may be
pending TX descriptors (sq->cc != sq->pc) that were in-flight during the
deactivation.

mlx5e_free_txqsq_descs() frees these pending descriptors and attempts to
complete them via netdev_tx_completed_queue().

However, the BQL state (dql->num_queued and dql->num_completed) have
been reset in mlx5e_activate_txqsq and belong to the new queue owner,
leading to dql->num_queued - dql->num_completed < nbytes.

This triggers BUG_ON(count > num_queued - num_completed) in
dql_completed().

Fixes: 3b88a535a8 ("net/mlx5e: Defer channels closure to reduce interface down time")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: William Tu <witu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Link: https://patch.msgid.link/1765284977-1363052-9-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-12-18 13:39:29 +01:00
Jianbo Liu
9ab89bde13 net/mlx5e: Trigger neighbor resolution for unresolved destinations
When initializing the MAC addresses for an outbound IPsec packet offload
rule in mlx5e_ipsec_init_macs, the call to dst_neigh_lookup is used to
find the next-hop neighbor (typically the gateway in tunnel mode).
This call might create a new neighbor entry if one doesn't already
exist. This newly created entry starts in the INCOMPLETE state, as the
kernel hasn't yet sent an ARP or NDISC probe to resolve the MAC
address. In this case, neigh_ha_snapshot will correctly return an
all-zero MAC address.

IPsec packet offload requires the actual next-hop MAC address to
program the rule correctly. If the neighbor state is INCOMPLETE when
the rule is created, the hardware rule is programmed with an all-zero
destination MAC address. Packets sent using this rule will be
subsequently dropped by the receiving network infrastructure or host.

This patch adds a check specifically for the outbound offload path. If
neigh_ha_snapshot returns an all-zero MAC address, it proactively
calls neigh_event_send(n, NULL). This ensures the kernel immediately
sends the initial ARP or NDISC probe if one isn't already pending,
accelerating the resolution process. This helps prevent the hardware
rule from being programmed with an invalid MAC address and avoids
packet drops due to unresolved neighbors.

Fixes: 71670f766b ("net/mlx5e: Support routed networks during IPsec MACs initialization")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1765284977-1363052-8-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-12-18 13:39:29 +01:00
Jianbo Liu
e35d7da8dd net/mlx5e: Use ip6_dst_lookup instead of ipv6_dst_lookup_flow for MAC init
Replace ipv6_stub->ipv6_dst_lookup_flow() with ip6_dst_lookup() in
mlx5e_ipsec_init_macs() since IPsec transformations are not needed
during Security Association setup - only basic routing information is
required for nexthop MAC address resolution.

This resolves an issue where XfrmOutNoStates error counter would be
incremented when xfrm policy is configured before xfrm state, as the
IPsec-aware routing function would attempt policy checks during SA
initialization.

Fixes: 71670f766b ("net/mlx5e: Support routed networks during IPsec MACs initialization")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1765284977-1363052-7-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-12-18 13:39:29 +01:00
Shay Drory
367e501f8b net/mlx5: Serialize firmware reset with devlink
The firmware reset mechanism can be triggered by asynchronous events,
which may race with other devlink operations like devlink reload or
devlink dev eswitch set, potentially leading to inconsistent states.

This patch addresses the race by using the devl_lock to serialize the
firmware reset against other devlink operations. When a reset is
requested, the driver attempts to acquire the lock. If successful, it
sets a flag to block devlink reload or eswitch changes, ACKs the reset
to firmware and then releases the lock. If the lock is already held by
another operation, the driver NACKs the firmware reset request,
indicating that the reset cannot proceed.

Firmware reset does not keep the devl_lock and instead uses an internal
firmware reset bit. This is because firmware resets can be triggered by
asynchronous events, and processed in different threads. It is illegal
and unsafe to acquire a lock in one thread and attempt to release it in
another, as lock ownership is intrinsically thread-specific.

This change ensures that firmware resets and other devlink operations
are mutually exclusive during the critical reset request phase,
preventing race conditions.

Fixes: 38b9f903f2 ("net/mlx5: Handle sync reset request event")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mateusz Berezecki <mberezecki@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1765284977-1363052-6-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-12-18 13:39:29 +01:00
Shay Drory
c0289f67f7 net/mlx5: fw_tracer, Handle escaped percent properly
The firmware tracer's format string validation and parameter counting
did not properly handle escaped percent signs (%%). This caused
fw_tracer to count more parameters when trace format strings contained
literal percent characters.

To fix it, allow %% to pass string validation and skip %% sequences when
counting parameters since they represent literal percent signs rather
than format specifiers.

Fixes: 70dd6fdb89 ("net/mlx5: FW tracer, parse traces and kernel tracing support")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reported-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Closes: https://lore.kernel.org/netdev/hanz6rzrb2bqbplryjrakvkbmv4y5jlmtthnvi3thg5slqvelp@t3s3erottr6s/
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1765284977-1363052-5-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-12-18 13:39:29 +01:00
Shay Drory
b35966042d net/mlx5: fw_tracer, Validate format string parameters
Add validation for format string parameters in the firmware tracer to
prevent potential security vulnerabilities and crashes from malformed
format strings received from firmware.

The firmware tracer receives format strings from the device firmware and
uses them to format trace messages. Without proper validation, bad
firmware could provide format strings with invalid format specifiers
(e.g., %s, %p, %n) that could lead to crashes, or other undefined
behavior.

Add mlx5_tracer_validate_params() to validate that all format specifiers
in trace strings are limited to safe integer/hex formats (%x, %d, %i,
%u, %llx, %lx, etc.). Reject strings containing other format types that
could be used to access arbitrary memory or cause crashes.
Invalid format strings are added to the trace output for visibility with
"BAD_FORMAT: " prefix.

Fixes: 70dd6fdb89 ("net/mlx5: FW tracer, parse traces and kernel tracing support")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reported-by: Breno Leitao <leitao@debian.org>
Closes: https://lore.kernel.org/netdev/hanz6rzrb2bqbplryjrakvkbmv4y5jlmtthnvi3thg5slqvelp@t3s3erottr6s/
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1765284977-1363052-4-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-12-18 13:39:29 +01:00
Moshe Shemesh
5846a365fc net/mlx5: Drain firmware reset in shutdown callback
Invoke drain_fw_reset() in the shutdown callback to ensure all
firmware reset handling is completed before shutdown proceeds.

Fixes: 16d42d3133 ("net/mlx5: Drain fw_reset when removing device")
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1765284977-1363052-3-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-12-18 13:39:29 +01:00
Moshe Shemesh
89a898d63f net/mlx5: fw reset, clear reset requested on drain_fw_reset
drain_fw_reset() waits for ongoing firmware reset events and blocks new
event handling, but does not clear the reset requested flag, and may
keep sync reset polling.

To fix it, call mlx5_sync_reset_clear_reset_requested() to clear the
flag, stop sync reset polling, and resume health polling, ensuring
health issues are still detected after the firmware reset drain.

Fixes: 16d42d3133 ("net/mlx5: Drain fw_reset when removing device")
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1765284977-1363052-2-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-12-18 13:39:29 +01:00
Paolo Abeni
71e6b15d21 Merge branch 'net-dsa-lantiq-a-bunch-of-fixes'
Daniel Golle says:

====================
net: dsa: lantiq: a bunch of fixes

This series is the continuation and result of comments received for a fix
for the SGMII restart-an bit not actually being self-clearing, which was
reported by by Rasmus Villemoes.

A closer investigation and testing the .remove and the .shutdown paths
of the mxl-gsw1xx.c and lantiq_gswip.c drivers has revealed a couple of
existing problems, which are also addressed in this series.
====================

Link: https://patch.msgid.link/cover.1765241054.git.daniel@makrotopia.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-12-18 12:53:24 +01:00
Daniel Golle
7b103aaf0d net: dsa: mxl-gsw1xx: manually clear RANEG bit
Despite being documented as self-clearing, the RANEG bit sometimes
remains set, preventing auto-negotiation from happening.

Manually clear the RANEG bit after 10ms as advised by MaxLinear.
In order to not hold RTNL during the 10ms of waiting schedule
delayed work to take care of clearing the bit asynchronously, which
is similar to the self-clearing behavior.

Fixes: 22335939ec ("net: dsa: add driver for MaxLinear GSW1xx switch family")
Reported-by: Rasmus Villemoes <ravi@prevas.dk>
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/76745fceb5a3f53088110fb7a96acf88434088ca.1765241054.git.daniel@makrotopia.org
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-12-18 12:53:21 +01:00
Daniel Golle
651b253b80 net: dsa: mxl-gsw1xx: fix .shutdown driver operation
The .shutdown operation should call dsa_switch_shutdown() just like
it is done also by the sibling lantiq_gswip driver. Not doing that
results in shutdown or reboot hanging and waiting for the CPU port
becoming free, which introduces a longer delay and a WARNING before
shutdown or reboot in case the driver is built-into the kernel.
Fix this by calling dsa_switch_shutdown() in the driver's shutdown
operation, harmonizing it with what is done in the lantiq_gswip
driver. As a side-effect this now allows to remove the previously
exported gswip_disable_switch() function which no longer got any
users.

Fixes: 22335939ec ("net: dsa: add driver for MaxLinear GSW1xx switch family")
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/77ed91a5206e5dbf5d3e83d7e364ebfda90d31fd.1765241054.git.daniel@makrotopia.org
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-12-18 12:53:21 +01:00
Daniel Golle
8e4c0f08f6 net: dsa: mxl-gsw1xx: fix order in .remove operation
The driver's .remove operation was calling gswip_disable_switch() which
clears the GSWIP_MDIO_GLOB_ENABLE bit before calling
dsa_unregister_switch() and thereby violating a Golden Rule of driver
development to always unpublish userspace interfaces before disabling
hardware, as pointed out by Russell King.

Fix this by relying in GSWIP_MDIO_GLOB_ENABLE being cleared by the
.teardown operation introduced by the previous commit
("net: dsa: lantiq_gswip: fix teardown order").

Fixes: 22335939ec ("net: dsa: add driver for MaxLinear GSW1xx switch family")
Suggested-by: "Russell King (Oracle)" <linux@armlinux.org.uk>
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/63f882eeb910cf24503c35a443b541cc54a930f2.1765241054.git.daniel@makrotopia.org
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-12-18 12:53:21 +01:00
Daniel Golle
377d66fa86 net: dsa: lantiq_gswip: fix order in .remove operation
Russell King pointed out that disabling the switch by clearing
GSWIP_MDIO_GLOB_ENABLE before calling dsa_unregister_switch() is
problematic, as it violates a Golden Rule of driver development to
always first unpublish userspace interfaces and then disable the
hardware.

Fix this, and also simplify the probe() function, by introducing a
dsa_switch_ops teardown() operation which takes care of clearing the
GSWIP_MDIO_GLOB_ENABLE bit.

Fixes: 14fceff477 ("net: dsa: Add Lantiq / Intel DSA driver for vrx200")
Suggested-by: "Russell King (Oracle)" <linux@armlinux.org.uk>
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/4ebd72a29edc1e4059b9666a26a0bb5d906a829a.1765241054.git.daniel@makrotopia.org
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-12-18 12:53:21 +01:00
Gal Pressman
7b07be1ff1 ethtool: Avoid overflowing userspace buffer on stats query
The ethtool -S command operates across three ioctl calls:
ETHTOOL_GSSET_INFO for the size, ETHTOOL_GSTRINGS for the names, and
ETHTOOL_GSTATS for the values.

If the number of stats changes between these calls (e.g., due to device
reconfiguration), userspace's buffer allocation will be incorrect,
potentially leading to buffer overflow.

Drivers are generally expected to maintain stable stat counts, but some
drivers (e.g., mlx5, bnx2x, bna, ksz884x) use dynamic counters, making
this scenario possible.

Some drivers try to handle this internally:
- bnad_get_ethtool_stats() returns early in case stats.n_stats is not
  equal to the driver's stats count.
- micrel/ksz884x also makes sure not to write anything beyond
  stats.n_stats and overflow the buffer.

However, both use stats.n_stats which is already assigned with the value
returned from get_sset_count(), hence won't solve the issue described
here.

Change ethtool_get_strings(), ethtool_get_stats(),
ethtool_get_phy_stats() to not return anything in case of a mismatch
between userspace's size and get_sset_size(), to prevent buffer
overflow.
The returned n_stats value will be equal to zero, to reflect that
nothing has been returned.

This could result in one of two cases when using upstream ethtool,
depending on when the size change is detected:
1. When detected in ethtool_get_strings():
    # ethtool -S eth2
    no stats available

2. When detected in get stats, all stats will be reported as zero.

Both cases are presumably transient, and a subsequent ethtool call
should succeed.

Other than the overflow avoidance, these two cases are very evident (no
output/cleared stats), which is arguably better than presenting
incorrect/shifted stats.
I also considered returning an error instead of a "silent" response, but
that seems more destructive towards userspace apps.

Notes:
- This patch does not claim to fix the inherent race, it only makes sure
  that we do not overflow the userspace buffer, and makes for a more
  predictable behavior.

- RTNL lock is held during each ioctl, the race window exists between
  the separate ioctl calls when the lock is released.

- Userspace ethtool always fills stats.n_stats, but it is likely that
  these stats ioctls are implemented in other userspace applications
  which might not fill it. The added code checks that it's not zero,
  to prevent any regressions.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20251208121901.3203692-1-gal@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-12-18 12:24:25 +01:00
Florian Westphal
fec7b07955 selftests: netfilter: packetdrill: avoid failure on HZ=100 kernel
packetdrill --ip_version=ipv4 --mtu=1500 --tolerance_usecs=1000000 --non_fatal packet conntrack_syn_challenge_ack.pkt
conntrack v1.4.8 (conntrack-tools): 1 flow entries have been shown.
conntrack_syn_challenge_ack.pkt:32: error executing `conntrack -f $NFCT_IP_VERSION \
-L -p tcp --dport 8080 | grep UNREPLIED | grep -q SYN_SENT` command: non-zero status 1

Affected kernel had CONFIG_HZ=100; reset packet was still sitting in
backlog.

Reported-by: Yi Chen <yiche@redhat.com>
Fixes: a8a388c2aa ("selftests: netfilter: add packetdrill based conntrack tests")
Signed-off-by: Florian Westphal <fw@strlen.de>
2025-12-15 15:04:04 +01:00
Florian Westphal
7e7a817f2d netfilter: nf_tables: avoid softlockup warnings in nft_chain_validate
This reverts commit
314c828416 ("netfilter: nf_tables: can't schedule in nft_chain_validate"):
Since commit a60a5abe19 ("netfilter: nf_tables: allow iter callbacks to sleep")
the iterator callback is invoked without rcu read lock held, so this
cond_resched() is now valid.

Signed-off-by: Florian Westphal <fw@strlen.de>
2025-12-15 15:04:04 +01:00
Florian Westphal
8e1a1bc4f5 netfilter: nf_tables: avoid chain re-validation if possible
Hamza Mahfooz reports cpu soft lock-ups in
nft_chain_validate():

 watchdog: BUG: soft lockup - CPU#1 stuck for 27s! [iptables-nft-re:37547]
[..]
 RIP: 0010:nft_chain_validate+0xcb/0x110 [nf_tables]
[..]
  nft_immediate_validate+0x36/0x50 [nf_tables]
  nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_immediate_validate+0x36/0x50 [nf_tables]
  nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_immediate_validate+0x36/0x50 [nf_tables]
  nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_immediate_validate+0x36/0x50 [nf_tables]
  nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_immediate_validate+0x36/0x50 [nf_tables]
  nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_immediate_validate+0x36/0x50 [nf_tables]
  nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_table_validate+0x6b/0xb0 [nf_tables]
  nf_tables_validate+0x8b/0xa0 [nf_tables]
  nf_tables_commit+0x1df/0x1eb0 [nf_tables]
[..]

Currently nf_tables will traverse the entire table (chain graph), starting
from the entry points (base chains), exploring all possible paths
(chain jumps).  But there are cases where we could avoid revalidation.

Consider:
1  input -> j2 -> j3
2  input -> j2 -> j3
3  input -> j1 -> j2 -> j3

Then the second rule does not need to revalidate j2, and, by extension j3,
because this was already checked during validation of the first rule.
We need to validate it only for rule 3.

This is needed because chain loop detection also ensures we do not exceed
the jump stack: Just because we know that j2 is cycle free, its last jump
might now exceed the allowed stack size.  We also need to update all
reachable chains with the new largest observed call depth.

Care has to be taken to revalidate even if the chain depth won't be an
issue: chain validation also ensures that expressions are not called from
invalid base chains.  For example, the masquerade expression can only be
called from NAT postrouting base chains.

Therefore we also need to keep record of the base chain context (type,
hooknum) and revalidate if the chain becomes reachable from a different
hook location.

Reported-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
Closes: https://lore.kernel.org/netfilter-devel/20251118221735.GA5477@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/
Tested-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
2025-12-15 15:02:44 +01:00
Pablo Neira Ayuso
a67fd55f6a netfilter: nf_tables: remove redundant chain validation on register store
This validation predates the introduction of the state machine that
determines when to enter slow path validation for error reporting.

Currently, table validation is perform when:

- new rule contains expressions that need validation.
- new set element with jump/goto verdict.

Validation on register store skips most checks with no basechains, still
this walks the graph searching for loops and ensuring expressions are
called from the right hook. Remove this.

Fixes: a654de8fdc ("netfilter: nf_tables: fix chain dependency validation")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2025-12-11 13:08:43 +01:00
Florian Westphal
5ec8ca26fe netfilter: nf_nat: remove bogus direction check
Jakub reports spurious failures of the 'conntrack_reverse_clash.sh'
selftest.  A bogus test makes nat core resort to port rewrite even
though there is no need for this.

When the test is made, nf_nat_used_tuple() would already have caused us
to return if no other CPU had added a colliding entry.
Moreover, nf_nat_used_tuple() would have ignored the colliding entry if
their origin tuples had been the same.

All that is left to check is if the colliding entry in the hash table
is subject to NAT, and, if its not, if our entry matches in the reverse
direction, e.g. hash table has

addr1:1234 -> addr2:80, and we want to commit
addr2:80   -> addr1:1234.

Because we already checked that neither the new nor the committed entry is
subject to NAT we only have to check origin vs. reply tuple:
for non-nat entries, the reply tuple is always the inverted original.

Just in case there are more problems extend the error reporting
in the selftest while at it and dump conntrack table/stats on error.

Reported-by: Jakub Kicinski <kuba@kernel.org>
Closes: https://lore.kernel.org/netdev/20251206175135.4a56591b@kernel.org/
Fixes: d8f84a9bc7 ("netfilter: nf_nat: don't try nat source port reallocation for reverse dir clash")
Signed-off-by: Florian Westphal <fw@strlen.de>
2025-12-11 13:08:37 +01:00
Jozsef Kadlecsik
99c6931fe1 MAINTAINERS: Remove Jozsef Kadlecsik from MAINTAINERS file
I'm retiring from maintaining netfilter. I'll still keep an
eye on ipset and respond to anything related to it.

Thank you!

Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2025-12-11 12:40:46 +01:00
Dan Carpenter
885bebac99 nfc: pn533: Fix error code in pn533_acr122_poweron_rdr()
Set the error code if "transferred != sizeof(cmd)" instead of
returning success.

Fixes: dbafc28955 ("NFC: pn533: don't send USB data off of the stack")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://patch.msgid.link/aTfIJ9tZPmeUF4W1@stanley.mountain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-12-11 01:40:00 -08:00
Victor Nogueira
5914428e0e selftests/tc-testing: Create tests to exercise ets classes active list misplacements
Add a test case for a bug fixed by Jamal [1] and for scenario where an
ets drr class is inserted into the active list twice.

- Try to delete ets drr class' qdisc while still keeping it in the
  active list
- Try to add ets class to the active list twice

[1] https://lore.kernel.org/netdev/20251128151919.576920-1-jhs@mojatatu.com/

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20251208190125.1868423-2-victor@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-12-11 01:36:40 -08:00
Victor Nogueira
b1e125ae42 net/sched: ets: Remove drr class from the active list if it changes to strict
Whenever a user issues an ets qdisc change command, transforming a
drr class into a strict one, the ets code isn't checking whether that
class was in the active list and removing it. This means that, if a
user changes a strict class (which was in the active list) back to a drr
one, that class will be added twice to the active list [1].

Doing so with the following commands:

tc qdisc add dev lo root handle 1: ets bands 2 strict 1
tc qdisc add dev lo parent 1:2 handle 20: \
    tbf rate 8bit burst 100b latency 1s
tc filter add dev lo parent 1: basic classid 1:2
ping -c1 -W0.01 -s 56 127.0.0.1
tc qdisc change dev lo root handle 1: ets bands 2 strict 2
tc qdisc change dev lo root handle 1: ets bands 2 strict 1
ping -c1 -W0.01 -s 56 127.0.0.1

Will trigger the following splat with list debug turned on:

[   59.279014][  T365] ------------[ cut here ]------------
[   59.279452][  T365] list_add double add: new=ffff88801d60e350, prev=ffff88801d60e350, next=ffff88801d60e2c0.
[   59.280153][  T365] WARNING: CPU: 3 PID: 365 at lib/list_debug.c:35 __list_add_valid_or_report+0x17f/0x220
[   59.280860][  T365] Modules linked in:
[   59.281165][  T365] CPU: 3 UID: 0 PID: 365 Comm: tc Not tainted 6.18.0-rc7-00105-g7e9f13163c13-dirty #239 PREEMPT(voluntary)
[   59.281977][  T365] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   59.282391][  T365] RIP: 0010:__list_add_valid_or_report+0x17f/0x220
[   59.282842][  T365] Code: 89 c6 e8 d4 b7 0d ff 90 0f 0b 90 90 31 c0 e9 31 ff ff ff 90 48 c7 c7 e0 a0 22 9f 48 89 f2 48 89 c1 4c 89 c6 e8 b2 b7 0d ff 90 <0f> 0b 90 90 31 c0 e9 0f ff ff ff 48 89 f7 48 89 44 24 10 4c 89 44
...
[   59.288812][  T365] Call Trace:
[   59.289056][  T365]  <TASK>
[   59.289224][  T365]  ? srso_alias_return_thunk+0x5/0xfbef5
[   59.289546][  T365]  ets_qdisc_change+0xd2b/0x1e80
[   59.289891][  T365]  ? __lock_acquire+0x7e7/0x1be0
[   59.290223][  T365]  ? __pfx_ets_qdisc_change+0x10/0x10
[   59.290546][  T365]  ? srso_alias_return_thunk+0x5/0xfbef5
[   59.290898][  T365]  ? __mutex_trylock_common+0xda/0x240
[   59.291228][  T365]  ? __pfx___mutex_trylock_common+0x10/0x10
[   59.291655][  T365]  ? srso_alias_return_thunk+0x5/0xfbef5
[   59.291993][  T365]  ? srso_alias_return_thunk+0x5/0xfbef5
[   59.292313][  T365]  ? trace_contention_end+0xc8/0x110
[   59.292656][  T365]  ? srso_alias_return_thunk+0x5/0xfbef5
[   59.293022][  T365]  ? srso_alias_return_thunk+0x5/0xfbef5
[   59.293351][  T365]  tc_modify_qdisc+0x63a/0x1cf0

Fix this by always checking and removing an ets class from the active list
when changing it to strict.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/tree/net/sched/sch_ets.c?id=ce052b9402e461a9aded599f5b47e76bc727f7de#n663

Fixes: cd9b50adc6 ("net/sched: ets: fix crash when flipping from 'strict' to 'quantum'")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20251208190125.1868423-1-victor@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-12-11 01:36:40 -08:00
Junrui Luo
8a11ff0948 caif: fix integer underflow in cffrml_receive()
The cffrml_receive() function extracts a length field from the packet
header and, when FCS is disabled, subtracts 2 from this length without
validating that len >= 2.

If an attacker sends a malicious packet with a length field of 0 or 1
to an interface with FCS disabled, the subtraction causes an integer
underflow.

This can lead to memory exhaustion and kernel instability, potential
information disclosure if padding contains uninitialized kernel memory.

Fix this by validating that len >= 2 before performing the subtraction.

Reported-by: Yuhao Jiang <danisjiang@gmail.com>
Reported-by: Junrui Luo <moonafterrain@outlook.com>
Fixes: b482cd2053 ("net-caif: add CAIF core protocol stack")
Signed-off-by: Junrui Luo <moonafterrain@outlook.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/SYBPR01MB7881511122BAFEA8212A1608AFA6A@SYBPR01MB7881.ausprd01.prod.outlook.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-12-11 01:35:41 -08:00
Marcus Hughes
71cfa7c893 net: sfp: extend Potron XGSPON quirk to cover additional EEPROM variant
Some Potron SFP+ XGSPON ONU sticks are shipped with different EEPROM
vendor ID and vendor name strings, but are otherwise functionally
identical to the existing "Potron SFP+ XGSPON ONU Stick" handled by
sfp_quirk_potron().

These modules, including units distributed under the "Better Internet"
branding, use the same UART pin assignment and require the same
TX_FAULT/LOS behaviour and boot delay. Re-use the existing Potron
quirk for this EEPROM variant.

Signed-off-by: Marcus Hughes <marcus.hughes@betterinternet.ltd>
Link: https://patch.msgid.link/20251207210355.333451-1-marcus.hughes@betterinternet.ltd
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-12-11 01:25:14 -08:00
Jakub Kicinski
95f3013e88 Merge tag 'linux-can-fixes-for-6.19-20251210' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:

====================
pull-request: can 2025-12-10

Arnd Bergmann's patch fixes a build dependency with the CAN protocols
and drivers introduced in the current development cycle.

The last patch is by me and fixes the error handling cleanup in the
gs_usb driver.

* tag 'linux-can-fixes-for-6.19-20251210' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
  can: gs_usb: gs_can_open(): fix error handling
  can: fix build dependency
====================

Link: https://patch.msgid.link/20251210083448.2116869-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-12-11 01:02:30 -08:00
Jakub Kicinski
898ae76a70 Merge tag 'nf-25-12-10' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Florian Westphal says:

====================
netfilter: updates for net

1) Fix refcount leaks in nf_conncount, from Fernando Fernandez Mancera.
   This addresses a recent regression that came in the last -next
   pull request.

2) Fix a null dereference in route error handling in IPVS, from Slavin
   Liu.  This is an ancient issue dating back to 5.1 days.

3) Always set ifindex in route tuple in the flowtable output path, from
   Lorenzo Bianconi.  This bug came in with the recent output path refactoring.

4) Prefer 'exit $ksft_xfail' over 'exit $ksft_skip' when we fail to
   trigger a nat race condition to exercise the clash resolution path in
   selftest infra, $ksft_skip should be reserved for missing tooling,
   From myself.

* tag 'nf-25-12-10' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  selftests: netfilter: prefer xfail in case race wasn't triggered
  netfilter: always set route tuple out ifindex
  ipvs: fix ipv4 null-ptr-deref in route error path
  netfilter: nf_conncount: fix leaked ct in error paths
====================

Link: https://patch.msgid.link/20251210110754.22620-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-12-11 00:58:38 -08:00
Jakub Kicinski
237c1e152b Merge branch 'selftests-forwarding-vxlan_bridge_1q_mc_ul-fix-flakiness'
Petr Machata says:

====================
selftests: forwarding: vxlan_bridge_1q_mc_ul: Fix flakiness

The net/forwarding/vxlan_bridge_1q_mc_ul selftest runs an overlay traffic,
forwarded over a multicast-routed VXLAN underlay. In order to determine
whether packets reach their intended destination, it uses a TC match. For
convenience, it uses a flower match, which however does not allow matching
on the encapsulated packet. So various service traffic ends up being
indistinguishable from the test packets, and ends up confusing the test. To
alleviate the problem, the test uses sleep to allow the necessary service
traffic to run and clear the channel, before running the test traffic. This
worked for a while, but lately we have nevertheless seen flakiness of the
test in the CI.

In this patchset, first generalize tc_rule_stats_get() to support u32 in
patch #1, then in patch #2 convert the test to use u32 to allow parsing
deeper into the packet, and in #3 drop the now-unnecessary sleep.
====================

Link: https://patch.msgid.link/cover.1765289566.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-12-11 00:53:19 -08:00
Petr Machata
514520b34b selftests: forwarding: vxlan_bridge_1q_mc_ul: Drop useless sleeping
After fixing traffic matching in the previous patch, the test does not need
to use the sleep anymore. So drop vx_wait() altogether, migrate all callers
of vx{10,20}_create_wait() to the corresponding _create(), and drop the now
unused _create_wait() helpers.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/eabfe4fa12ae788cf3b8c5c876a989de81dfc3d3.1765289566.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-12-11 00:53:16 -08:00
Petr Machata
0c8b9a68b3 selftests: forwarding: vxlan_bridge_1q_mc_ul: Fix flakiness
This test runs an overlay traffic, forwarded over a multicast-routed VXLAN
underlay. In order to determine whether packets reach their intended
destination, it uses a TC match. For convenience, it uses a flower match,
which however does not allow matching on the encapsulated packet. So
various service traffic ends up being indistinguishable from the test
packets, and ends up confusing the test. To alleviate the problem, the test
uses sleep to allow the necessary service traffic to run and clear the
channel, before running the test traffic. This worked for a while, but
lately we have nevertheless seen flakiness of the test in the CI.

Fix the issue by using u32 to match the encapsulated packet as well. The
confusing packets seem to always be IPv6 multicast listener reports.
Realistically they could be ARP or other ICMP6 traffic as well. Therefore
look for ethertype IPv4 in the IPv4 traffic test, and for IPv6 / UDP
combination in the IPv6 traffic test.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/6438cb1613a2a667d3ff64089eb5994778f247af.1765289566.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-12-11 00:53:15 -08:00
Petr Machata
0842e34849 selftests: net: lib: tc_rule_stats_get(): Don't hard-code array index
Flower is commonly used to match on packets in many bash-based selftests.
A dump of a flower filter including statistics looks something like this:

 [
   {
     "protocol": "all",
     "pref": 49152,
     "kind": "flower",
     "chain": 0
   },
   {
     ...
     "options": {
       ...
       "actions": [
         {
	   ...
           "stats": {
             "bytes": 0,
             "packets": 0,
             "drops": 0,
             "overlimits": 0,
             "requeues": 0,
             "backlog": 0,
             "qlen": 0
           }
         }
       ]
     }
   }
 ]

The JQ query in the helper function tc_rule_stats_get() assumes this form
and looks for the second element of the array.

However, a dump of a u32 filter looks like this:

 [
   {
     "protocol": "all",
     "pref": 49151,
     "kind": "u32",
     "chain": 0
   },
   {
     "protocol": "all",
     "pref": 49151,
     "kind": "u32",
     "chain": 0,
     "options": {
       "fh": "800:",
       "ht_divisor": 1
     }
   },
   {
     ...
     "options": {
       ...
       "actions": [
         {
	   ...
           "stats": {
             "bytes": 0,
             "packets": 0,
             "drops": 0,
             "overlimits": 0,
             "requeues": 0,
             "backlog": 0,
             "qlen": 0
           }
         }
       ]
     }
   },
 ]

There's an extra element which the JQ query ends up choosing.

Instead of hard-coding a particular index, look for the entry on which a
selector .options.actions yields anything.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/12982a44471c834511a0ee6c1e8f57e3a5307105.1765289566.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-12-11 00:53:15 -08:00
Florian Westphal
b8a81b0ce5 selftests: netfilter: prefer xfail in case race wasn't triggered
Jakub says: "We try to reserve SKIP for tests skipped because tool is
missing in env, something isn't built into the kernel etc."

use xfail, we can't force the race condition to appear at will
so its expected that the test 'fails' occasionally.

Fixes: 78a5883635 ("selftests: netfilter: add conntrack clash resolution test case")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Closes: https://lore.kernel.org/netdev/20251206175647.5c32f419@kernel.org/
Signed-off-by: Florian Westphal <fw@strlen.de>
2025-12-10 11:55:59 +01:00
Lorenzo Bianconi
2bdc536c9d netfilter: always set route tuple out ifindex
Always set nf_flow_route tuple out ifindex even if the indev is not one
of the flowtable configured devices since otherwise the outdev lookup in
nf_flow_offload_ip_hook() or nf_flow_offload_ipv6_hook() for
FLOW_OFFLOAD_XMIT_NEIGH flowtable entries will fail.
The above issue occurs in the following configuration since IP6IP6
tunnel does not support flowtable acceleration yet:

$ip addr show
5: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:11:22:33:22:55 brd ff:ff:ff:ff:ff:ff link-netns ns1
    inet6 2001:db8:1::2/64 scope global nodad
       valid_lft forever preferred_lft forever
    inet6 fe80::211:22ff:fe33:2255/64 scope link tentative proto kernel_ll
       valid_lft forever preferred_lft forever
6: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:22:22:33:22:55 brd ff:ff:ff:ff:ff:ff link-netns ns3
    inet6 2001:db8:2::1/64 scope global nodad
       valid_lft forever preferred_lft forever
    inet6 fe80::222:22ff:fe33:2255/64 scope link tentative proto kernel_ll
       valid_lft forever preferred_lft forever
7: tun0@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1452 qdisc noqueue state UNKNOWN group default qlen 1000
    link/tunnel6 2001:db8:2::1 peer 2001:db8:2::2 permaddr a85:e732:2c37::
    inet6 2002:db8:1::1/64 scope global nodad
       valid_lft forever preferred_lft forever
    inet6 fe80::885:e7ff:fe32:2c37/64 scope link proto kernel_ll
       valid_lft forever preferred_lft forever

$ip -6 route show
2001:db8:1::/64 dev eth0 proto kernel metric 256 pref medium
2001:db8:2::/64 dev eth1 proto kernel metric 256 pref medium
2002:db8:1::/64 dev tun0 proto kernel metric 256 pref medium
default via 2002:db8:1::2 dev tun0 metric 1024 pref medium

$nft list ruleset
table inet filter {
        flowtable ft {
                hook ingress priority filter
                devices = { eth0, eth1 }
        }

        chain forward {
                type filter hook forward priority filter; policy accept;
                meta l4proto { tcp, udp } flow add @ft
        }
}

Fixes: b5964aac51 ("netfilter: flowtable: consolidate xmit path")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2025-12-10 11:55:58 +01:00
Slavin Liu
ad891bb3d0 ipvs: fix ipv4 null-ptr-deref in route error path
The IPv4 code path in __ip_vs_get_out_rt() calls dst_link_failure()
without ensuring skb->dev is set, leading to a NULL pointer dereference
in fib_compute_spec_dst() when ipv4_link_failure() attempts to send
ICMP destination unreachable messages.

The issue emerged after commit ed0de45a10 ("ipv4: recompile ip options
in ipv4_link_failure") started calling __ip_options_compile() from
ipv4_link_failure(). This code path eventually calls fib_compute_spec_dst()
which dereferences skb->dev. An attempt was made to fix the NULL skb->dev
dereference in commit 0113d9c9d1 ("ipv4: fix null-deref in
ipv4_link_failure"), but it only addressed the immediate dev_net(skb->dev)
dereference by using a fallback device. The fix was incomplete because
fib_compute_spec_dst() later in the call chain still accesses skb->dev
directly, which remains NULL when IPVS calls dst_link_failure().

The crash occurs when:
1. IPVS processes a packet in NAT mode with a misconfigured destination
2. Route lookup fails in __ip_vs_get_out_rt() before establishing a route
3. The error path calls dst_link_failure(skb) with skb->dev == NULL
4. ipv4_link_failure() → ipv4_send_dest_unreach() →
   __ip_options_compile() → fib_compute_spec_dst()
5. fib_compute_spec_dst() dereferences NULL skb->dev

Apply the same fix used for IPv6 in commit 326bf17ea5 ("ipvs: fix
ipv6 route unreach panic"): set skb->dev from skb_dst(skb)->dev before
calling dst_link_failure().

KASAN: null-ptr-deref in range [0x0000000000000328-0x000000000000032f]
CPU: 1 PID: 12732 Comm: syz.1.3469 Not tainted 6.6.114 #2
RIP: 0010:__in_dev_get_rcu include/linux/inetdevice.h:233
RIP: 0010:fib_compute_spec_dst+0x17a/0x9f0 net/ipv4/fib_frontend.c:285
Call Trace:
  <TASK>
  spec_dst_fill net/ipv4/ip_options.c:232
  spec_dst_fill net/ipv4/ip_options.c:229
  __ip_options_compile+0x13a1/0x17d0 net/ipv4/ip_options.c:330
  ipv4_send_dest_unreach net/ipv4/route.c:1252
  ipv4_link_failure+0x702/0xb80 net/ipv4/route.c:1265
  dst_link_failure include/net/dst.h:437
  __ip_vs_get_out_rt+0x15fd/0x19e0 net/netfilter/ipvs/ip_vs_xmit.c:412
  ip_vs_nat_xmit+0x1d8/0xc80 net/netfilter/ipvs/ip_vs_xmit.c:764

Fixes: ed0de45a10 ("ipv4: recompile ip options in ipv4_link_failure")
Signed-off-by: Slavin Liu <slavin452@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
2025-12-10 11:55:58 +01:00
Fernando Fernandez Mancera
2e2a720766 netfilter: nf_conncount: fix leaked ct in error paths
There are some situations where ct might be leaked as error paths are
skipping the refcounted check and return immediately. In order to solve
it make sure that the check is always called.

Fixes: be102eb6a0 ("netfilter: nf_conncount: rework API to use sk_buff directly")
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
2025-12-10 11:55:58 +01:00
Jakub Kicinski
6bcb7727d9 Merge branch 'inet-frags-flush-pending-skbs-in-fqdir_pre_exit'
Jakub Kicinski says:

====================
inet: frags: flush pending skbs in fqdir_pre_exit()

Fix the issue reported by NIPA starting on Sep 18th [1], where
pernet_ops_rwsem is constantly held by a reader, preventing writers
from grabbing it (specifically driver modules from loading).

The fact that reports started around that time seems coincidental.
The issue seems to be skbs queued for defrag preventing conntrack
from exiting.

First patch fixes another theoretical issue, it's mostly a leftover
from an attempt to get rid of the inet_frag_queue refcnt, which
I gave up on (still think it's doable but a bit of a time sink).
Second patch is a minor refactor.

The real fix is in the third patch. It's the simplest fix I can
think of which is to flush the frag queues. Perhaps someone has
a better suggestion?

Last patch adds an explicit warning for conntrack getting stuck,
as this seems like something that can easily happen if bugs sneak in.
The warning will hopefully save us the first 20% of the investigation
effort.

Link: https://lore.kernel.org/20251001082036.0fc51440@kernel.org # [1]
====================

Link: https://patch.msgid.link/20251207010942.1672972-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-12-10 01:15:33 -08:00
Jakub Kicinski
92df4c56cf netfilter: conntrack: warn when cleanup is stuck
nf_conntrack_cleanup_net_list() calls schedule() so it does not
show up as a hung task. Add an explicit check to make debugging
leaked skbs/conntack references more obvious.

Acked-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251207010942.1672972-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-12-10 01:15:27 -08:00
Jakub Kicinski
006a5035b4 inet: frags: flush pending skbs in fqdir_pre_exit()
We have been seeing occasional deadlocks on pernet_ops_rwsem since
September in NIPA. The stuck task was usually modprobe (often loading
a driver like ipvlan), trying to take the lock as a Writer.
lockdep does not track readers for rwsems so the read wasn't obvious
from the reports.

On closer inspection the Reader holding the lock was conntrack looping
forever in nf_conntrack_cleanup_net_list(). Based on past experience
with occasional NIPA crashes I looked thru the tests which run before
the crash and noticed that the crash follows ip_defrag.sh. An immediate
red flag. Scouring thru (de)fragmentation queues reveals skbs sitting
around, holding conntrack references.

The problem is that since conntrack depends on nf_defrag_ipv6,
nf_defrag_ipv6 will load first. Since nf_defrag_ipv6 loads first its
netns exit hooks run _after_ conntrack's netns exit hook.

Flush all fragment queue SKBs during fqdir_pre_exit() to release
conntrack references before conntrack cleanup runs. Also flush
the queues in timer expiry handlers when they discover fqdir->dead
is set, in case packet sneaks in while we're running the pre_exit
flush.

The commit under Fixes is not exactly the culprit, but I think
previously the timer firing would eventually unblock the spinning
conntrack.

Fixes: d5dd88794a ("inet: fix various use-after-free in defrags units")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251207010942.1672972-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-12-10 01:15:27 -08:00
Jakub Kicinski
1231eec699 inet: frags: add inet_frag_queue_flush()
Instead of exporting inet_frag_rbtree_purge() which requires that
caller takes care of memory accounting, add a new helper. We will
need to call it from a few places in the next patch.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251207010942.1672972-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-12-10 01:15:27 -08:00
Jakub Kicinski
8ef522c8a5 inet: frags: avoid theoretical race in ip_frag_reinit()
In ip_frag_reinit() we want to move the frag timeout timer into
the future. If the timer fires in the meantime we inadvertently
scheduled it again, and since the timer assumes a ref on frag_queue
we need to acquire one to balance things out.

This is technically racy, we should have acquired the reference
_before_ we touch the timer, it may fire again before we take the ref.
Avoid this entire dance by using mod_timer_pending() which only modifies
the timer if its pending (and which exists since Linux v2.6.30)

Note that this was the only place we ever took a ref on frag_queue
since Eric's conversion to RCU. So we could potentially replace
the whole refcnt field with an atomic flag and a bit more RCU.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251207010942.1672972-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-12-10 01:15:27 -08:00
Jakub Kicinski
2f6e056e95 Merge branch 'selftests-fix-build-warnings-and-errors' (part)
Guenter Roeck says:

====================
selftests: Fix build warnings and errors

This series fixes build warnings and errors observed when trying to build
selftests.
====================

Link: https://patch.msgid.link/20251205171010.515236-1-linux@roeck-us.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-12-10 01:11:12 -08:00
Guenter Roeck
91dc09a609 selftests: net: tfo: Fix build warning
Fix

tfo.c: In function ‘run_server’:
tfo.c:84:9: warning: ignoring return value of ‘read’ declared with attribute ‘warn_unused_result’

by evaluating the return value from read() and displaying an error message
if it reports an error.

Fixes: c65b5bb232 ("selftests: net: add passive TFO test binary")
Cc: David Wei <dw@davidwei.uk>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Link: https://patch.msgid.link/20251205171010.515236-14-linux@roeck-us.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-12-10 01:11:12 -08:00
Guenter Roeck
59546e8744 selftests: net: Fix build warnings
Fix

ksft.h: In function ‘ksft_ready’:
ksft.h:27:9: warning: ignoring return value of ‘write’ declared with attribute ‘warn_unused_result’

ksft.h: In function ‘ksft_wait’:
ksft.h:51:9: warning: ignoring return value of ‘read’ declared with attribute ‘warn_unused_result’

by checking the return value of the affected functions and displaying
an error message if an error is seen.

Fixes: 2b6d490b82 ("selftests: drv-net: Factor out ksft C helpers")
Cc: Joe Damato <jdamato@fastly.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Link: https://patch.msgid.link/20251205171010.515236-11-linux@roeck-us.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-12-10 01:11:12 -08:00