Commit Graph

82400 Commits

Author SHA1 Message Date
Kuniyuki Iwashima
06d6322280 neighbour: Use RCU list helpers for neigh_parms.list writers.
We will convert RTM_GETNEIGHTBL to RCU soon, where we traverse
tbl->parms_list under RCU in neightbl_dump_info().

Let's use RCU list helper for neigh_parms in neigh_parms_alloc()
and neigh_parms_release().

neigh_table_init() uses the plain list_add() for the default
neigh_parm that is embedded in the table and not yet published.

Note that neigh_parms_release() already uses call_rcu() to free
neigh_parms.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251022054004.2514876-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-24 17:57:20 -07:00
Johan Hovold
91d35ec9b3 Bluetooth: rfcomm: fix modem control handling
The RFCOMM driver confuses the local and remote modem control signals,
which specifically means that the reported DTR and RTS state will
instead reflect the remote end (i.e. DSR and CTS).

This issue dates back to the original driver (and a follow-on update)
merged in 2002, which resulted in a non-standard implementation of
TIOCMSET that allowed controlling also the TS07.10 IC and DV signals by
mapping them to the RI and DCD input flags, while TIOCMGET failed to
return the actual state of DTR and RTS.

Note that the bogus control of input signals in tiocmset() is just
dead code as those flags will have been masked out by the tty layer
since 2003.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-10-24 10:32:19 -04:00
Luiz Augusto von Dentz
751463ceef Bluetooth: hci_core: Fix tracking of periodic advertisement
Periodic advertising enabled flag cannot be tracked by the enabled
flag since advertising and periodic advertising each can be
enabled/disabled separately from one another causing the states to be
inconsistent when for example an advertising set is disabled its
enabled flag is set to false which is then used for periodic which has
not being disabled.

Fixes: eca0ae4aea ("Bluetooth: Add initial implementation of BIS connections")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-10-24 10:31:59 -04:00
Luiz Augusto von Dentz
857eb0fabc Bluetooth: hci_conn: Fix connection cleanup with BIG with 2 or more BIS
This fixes bis_cleanup not considering connections in BT_OPEN state
before attempting to remove the BIG causing the following error:

btproxy[20110]: < HCI Command: LE Terminate Broadcast Isochronous Group (0x08|0x006a) plen 2
        BIG Handle: 0x01
        Reason: Connection Terminated By Local Host (0x16)
> HCI Event: Command Status (0x0f) plen 4
      LE Terminate Broadcast Isochronous Group (0x08|0x006a) ncmd 1
        Status: Unknown Advertising Identifier (0x42)

Fixes: fa224d0c09 ("Bluetooth: ISO: Reassociate a socket with an active BIS")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
2025-10-24 10:31:41 -04:00
Luiz Augusto von Dentz
c403da5e98 Bluetooth: ISO: Fix another instance of dst_type handling
Socket dst_type cannot be directly assigned to hci_conn->type since
there domain is different which may lead to the wrong address type being
used.

Fixes: 6a5ad251b7 ("Bluetooth: ISO: Fix possible circular locking dependency")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-10-24 10:30:50 -04:00
Frédéric Danis
76e20da0bd Revert "Bluetooth: L2CAP: convert timeouts to secs_to_jiffies()"
This reverts commit c9d84da18d. It
replaces in L2CAP calls to msecs_to_jiffies() to secs_to_jiffies()
and updates the constants accordingly. But the constants are also
used in LCAP Configure Request and L2CAP Configure Response which
expect values in milliseconds.
This may prevent correct usage of L2CAP channel.

To fix it, keep those constants in milliseconds and so revert this
change.

Fixes: c9d84da18d ("Bluetooth: L2CAP: convert timeouts to secs_to_jiffies()")
Signed-off-by: Frédéric Danis <frederic.danis@collabora.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-10-24 10:30:30 -04:00
Pauli Virtanen
e8785404de Bluetooth: MGMT: fix crash in set_mesh_sync and set_mesh_complete
There is a BUG: KASAN: stack-out-of-bounds in set_mesh_sync due to
memcpy from badly declared on-stack flexible array.

Another crash is in set_mesh_complete() due to double list_del via
mgmt_pending_valid + mgmt_pending_remove.

Use DEFINE_FLEX to declare the flexible array right, and don't memcpy
outside bounds.

As mgmt_pending_valid removes the cmd from list, use mgmt_pending_free,
and also report status on error.

Fixes: 302a1f674c ("Bluetooth: MGMT: Fix possible UAFs")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-10-24 10:21:37 -04:00
Luiz Augusto von Dentz
0d92808024 Bluetooth: HCI: Fix tracking of advertisement set/instance 0x00
This fixes the state tracking of advertisement set/instance 0x00 which
is considered a legacy instance and is not tracked individually by
adv_instances list, previously it was assumed that hci_dev itself would
track it via HCI_LE_ADV but that is a global state not specifc to
instance 0x00, so to fix it a new flag is introduced that only tracks the
state of instance 0x00.

Fixes: 1488af7b8b ("Bluetooth: hci_sync: Fix hci_resume_advertising_sync")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-10-24 10:21:07 -04:00
Luiz Augusto von Dentz
f0c200a4a5 Bluetooth: ISO: Fix BIS connection dst_type handling
Socket dst_type cannot be directly assigned to hci_conn->type since
there domain is different which may lead to the wrong address type being
used.

Fixes: 6a5ad251b7 ("Bluetooth: ISO: Fix possible circular locking dependency")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-10-24 10:20:34 -04:00
Cen Zhang
09b0cd1297 Bluetooth: hci_sync: fix race in hci_cmd_sync_dequeue_once
hci_cmd_sync_dequeue_once() does lookup and then cancel
the entry under two separate lock sections. Meanwhile,
hci_cmd_sync_work() can also delete the same entry,
leading to double list_del() and "UAF".

Fix this by holding cmd_sync_work_lock across both
lookup and cancel, so that the entry cannot be removed
concurrently.

Fixes: 505ea2b295 ("Bluetooth: hci_sync: Add helper functions to manipulate cmd_sync queue")
Reported-by: Cen Zhang <zzzccc427@163.com>
Signed-off-by: Cen Zhang <zzzccc427@163.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-10-24 10:20:15 -04:00
Eric Biggers
05774d7e42 tcp: Remove unnecessary null check in tcp_inbound_md5_hash()
The 'if (!key && hash_location)' check in tcp_inbound_md5_hash() implies
that hash_location might be null.  However, later code in the function
dereferences hash_location anyway, without checking for null first.
Fortunately, there is no real bug, since tcp_inbound_md5_hash() is
called only with non-null values of hash_location.

Therefore, remove the unnecessary and misleading null check of
hash_location.  This silences a Smatch static checker warning
(https://lore.kernel.org/netdev/aPi4b6aWBbBR52P1@stanley.mountain/)

Also fix the related comment at the beginning of the function.

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Link: https://patch.msgid.link/20251022221209.19716-1-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-23 18:49:55 -07:00
Sunday Adelodun
ec538867a3 net: unix: remove outdated BSD behavior comment in unix_release_sock()
Remove the long-standing comment in unix_release_sock() that described a
behavioral difference between Linux and BSD regarding when ECONNRESET is
sent to connected UNIX sockets upon closure.

As confirmed by testing on macOS (similar to BSD behavior), ECONNRESET
is only observed for SOCK_DGRAM sockets, not for SOCK_STREAM. Meanwhile,
Linux already returns ECONNRESET in cases where a socket is closed with
unread data or is not yet accept()ed. This means the previous comment no
longer accurately describes current behavior and is misleading.

Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Sunday Adelodun <adelodunolaoluwa@yahoo.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251021195906.20389-1-adelodunolaoluwa@yahoo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-23 17:26:16 -07:00
Jakub Kicinski
a83155cc4e Merge tag 'wireless-2025-10-23' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:

====================
First set of fixes:
 - brcmfmac: long-standing crash when used w/o P2P
 - iwlwifi: fix for a use-after-free bug
 - mac80211: key tailroom accounting bug could leave
             allocation overhead and cause a warning
 - ath11k: add a missing platform,
           fix key flag operations
 - bcma: skip devices disabled in OF/DT
 - various (potential) memory leaks

* tag 'wireless-2025-10-23' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
  wifi: nl80211: call kfree without a NULL check
  wifi: mac80211: fix key tailroom accounting leak
  wifi: brcmfmac: fix crash while sending Action Frames in standalone AP Mode
  MAINTAINERS: wcn36xx: Add linux-wireless list
  bcma: don't register devices disabled in OF
  wifi: mac80211: reset FILS discovery and unsol probe resp intervals
  wifi: iwlwifi: fix potential use after free in iwl_mld_remove_link()
  wifi: ath11k: avoid bit operation on key flags
  wifi: ath12k: free skb during idr cleanup callback
  wifi: ath11k: Add missing platform IDs for quirk table
  wifi: ath10k: Fix memory leak on unsupported WMI command
====================

Link: https://patch.msgid.link/20251023180604.626946-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-23 17:15:48 -07:00
Jakub Kicinski
2b7553db91 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.18-rc3).

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-23 10:53:08 -07:00
Linus Torvalds
ab431bc397 Merge tag 'net-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
 "Including fixes from can. Slim pickings, I'm guessing people haven't
  really started testing.

  Current release - new code bugs:

   - eth: mlx5e:
       - psp: avoid 'accel' NULL pointer dereference
       - skip PPHCR register query for FEC histogram if not supported

  Previous releases - regressions:

   - bonding: update the slave array for broadcast mode

   - rtnetlink: re-allow deleting FDB entries in user namespace

   - eth: dpaa2: fix the pointer passed to PTR_ALIGN on Tx path

  Previous releases - always broken:

   - can: drop skb on xmit if device is in listen-only mode

   - gro: clear skb_shinfo(skb)->hwtstamps in napi_reuse_skb()

   - eth: mlx5e
       - RX, fix generating skb from non-linear xdp_buff if program
         trims frags
       - make devcom init failures non-fatal, fix races with IPSec

  Misc:

   - some documentation formatting 'fixes'"

* tag 'net-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (47 commits)
  net/mlx5: Fix IPsec cleanup over MPV device
  net/mlx5: Refactor devcom to return NULL on failure
  net/mlx5e: Skip PPHCR register query if not supported by the device
  net/mlx5: Add PPHCR to PCAM supported registers mask
  virtio-net: zero unused hash fields
  net: phy: micrel: always set shared->phydev for LAN8814
  vsock: fix lock inversion in vsock_assign_transport()
  ovpn: use datagram_poll_queue for socket readiness in TCP
  espintcp: use datagram_poll_queue for socket readiness
  net: datagram: introduce datagram_poll_queue for custom receive queues
  net: bonding: fix possible peer notify event loss or dup issue
  net: hsr: prevent creation of HSR device with slaves from another netns
  sctp: avoid NULL dereference when chunk data buffer is missing
  ptp: ocp: Fix typo using index 1 instead of i in SMA initialization loop
  net: ravb: Ensure memory write completes before ringing TX doorbell
  net: ravb: Enforce descriptor type ordering
  net: hibmcge: select FIXED_PHY
  net: dlink: use dev_kfree_skb_any instead of dev_kfree_skb
  Documentation: networking: ax25: update the mailing list info.
  net: gro_cells: fix lock imbalance in gro_cells_receive()
  ...
2025-10-23 07:03:18 -10:00
Stefano Garzarella
f7c877e753 vsock: fix lock inversion in vsock_assign_transport()
Syzbot reported a potential lock inversion deadlock between
vsock_register_mutex and sk_lock-AF_VSOCK when vsock_linger() is called.

The issue was introduced by commit 687aa0c558 ("vsock: Fix
transport_* TOCTOU") which added vsock_register_mutex locking in
vsock_assign_transport() around the transport->release() call, that can
call vsock_linger(). vsock_assign_transport() can be called with sk_lock
held. vsock_linger() calls sk_wait_event() that temporarily releases and
re-acquires sk_lock. During this window, if another thread hold
vsock_register_mutex while trying to acquire sk_lock, a circular
dependency is created.

Fix this by releasing vsock_register_mutex before calling
transport->release() and vsock_deassign_transport(). This is safe
because we don't need to hold vsock_register_mutex while releasing the
old transport, and we ensure the new transport won't disappear by
obtaining a module reference first via try_module_get().

Reported-by: syzbot+10e35716f8e4929681fa@syzkaller.appspotmail.com
Tested-by: syzbot+10e35716f8e4929681fa@syzkaller.appspotmail.com
Fixes: 687aa0c558 ("vsock: Fix transport_* TOCTOU")
Cc: mhal@rbox.co
Cc: stable@vger.kernel.org
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20251021121718.137668-1-sgarzare@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-10-23 16:07:58 +02:00
Ralf Lici
0fc3e32c2c espintcp: use datagram_poll_queue for socket readiness
espintcp uses a custom queue (ike_queue) to deliver packets to
userspace. The polling logic relies on datagram_poll, which checks
sk_receive_queue, which can lead to false readiness signals when that
queue contains non-userspace packets.

Switch espintcp_poll to use datagram_poll_queue with ike_queue, ensuring
poll only signals readiness when userspace data is actually available.

Fixes: e27cca96cd ("xfrm: add espintcp (RFC 8229)")
Signed-off-by: Ralf Lici <ralf@mandelbit.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20251021100942.195010-3-ralf@mandelbit.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-10-23 15:46:04 +02:00
Ralf Lici
f6ceec6434 net: datagram: introduce datagram_poll_queue for custom receive queues
Some protocols using TCP encapsulation (e.g., espintcp, openvpn) deliver
userspace-bound packets through a custom skb queue rather than the
standard sk_receive_queue.

Introduce datagram_poll_queue that accepts an explicit receive queue,
and convert datagram_poll into a wrapper around datagram_poll_queue.
This allows protocols with custom skb queues to reuse the core polling
logic without relying on sk_receive_queue.

Cc: Sabrina Dubroca <sd@queasysnail.net>
Cc: Antonio Quartulli <antonio@openvpn.net>
Signed-off-by: Ralf Lici <ralf@mandelbit.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/20251021100942.195010-2-ralf@mandelbit.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-10-23 15:46:04 +02:00
Fernando Fernandez Mancera
c0178eec88 net: hsr: prevent creation of HSR device with slaves from another netns
HSR/PRP driver does not handle correctly having slaves/interlink devices
in a different net namespace. Currently, it is possible to create a HSR
link in a different net namespace than the slaves/interlink with the
following command:

 ip link add hsr0 netns hsr-ns type hsr slave1 eth1 slave2 eth2

As there is no use-case on supporting this scenario, enforce that HSR
device link matches netns defined by IFLA_LINK_NETNSID.

The iproute2 command mentioned above will throw the following error:

 Error: hsr: HSR slaves/interlink must be on the same net namespace than HSR link.

Fixes: f421436a59 ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Link: https://patch.msgid.link/20251020135533.9373-1-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-22 19:22:22 -07:00
Alexey Simakov
441f0647f7 sctp: avoid NULL dereference when chunk data buffer is missing
chunk->skb pointer is dereferenced in the if-block where it's supposed
to be NULL only.

chunk->skb can only be NULL if chunk->head_skb is not. Check for frag_list
instead and do it just before replacing chunk->skb. We're sure that
otherwise chunk->skb is non-NULL because of outer if() condition.

Fixes: 90017accff ("sctp: Add GSO support")
Signed-off-by: Alexey Simakov <bigalex934@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://patch.msgid.link/20251021130034.6333-1-bigalex934@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-22 19:19:31 -07:00
David Yang
ca4709843b net: dsa: tag_yt921x: add support for Motorcomm YT921x tags
Add support for Motorcomm YT921x tags, which includes a proper
configurable ethertype field (default to 0x9988).

Signed-off-by: David Yang <mmyangfl@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20251017060859.326450-3-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-21 18:25:30 -07:00
Hangbin Liu
0152747a52 net: bridge: use common function to compute the features
Previously, bridge ignored all features propagation and DST retention,
only handling explicitly the GSO limits.

By switching to the new helper netdev_compute_master_upper_features(), the bridge
now expose additional features, depending on the lowers capabilities.

Since br_set_gso_limits() is already covered by the helper, it can be
removed safely.

Bridge has it's own way to update needed_headroom. So we don't need to
update it in the helper.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20251017034155.61990-5-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-21 18:08:23 -07:00
Hangbin Liu
28098defc7 net: add a common function to compute features for upper devices
Some high level software drivers need to compute features from lower
devices. But each has their own implementations and may lost some
feature compute. Let's use one common function to compute features
for kinds of these devices.

The new helper uses the current bond implementation as the reference
one, as the latter already handles all the relevant aspects: netdev
features, TSO limits and dst retention.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20251017034155.61990-2-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-21 18:08:23 -07:00
Eric Dumazet
c5394b8b7a net: gro_cells: fix lock imbalance in gro_cells_receive()
syzbot found that the local_unlock_nested_bh() call was
missing in some cases.

WARNING: possible recursive locking detected
syzkaller #0 Not tainted
--------------------------------------------
syz.2.329/7421 is trying to acquire lock:
 ffffe8ffffd48888 ((&cell->bh_lock)){+...}-{3:3}, at: spin_lock include/linux/spinlock_rt.h:44 [inline]
 ffffe8ffffd48888 ((&cell->bh_lock)){+...}-{3:3}, at: gro_cells_receive+0x404/0x790 net/core/gro_cells.c:30

but task is already holding lock:
 ffffe8ffffd48888 ((&cell->bh_lock)){+...}-{3:3}, at: spin_lock include/linux/spinlock_rt.h:44 [inline]
 ffffe8ffffd48888 ((&cell->bh_lock)){+...}-{3:3}, at: gro_cells_receive+0x404/0x790 net/core/gro_cells.c:30

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock((&cell->bh_lock));
  lock((&cell->bh_lock));

 *** DEADLOCK ***

Given the introduction of @have_bh_lock variable, it seems the author
intent was to have the local_unlock_nested_bh() after the @unlock label.

Fixes: 25718fdcbd ("net: gro_cells: Use nested-BH locking for gro_cell")
Reported-by: syzbot+f9651b9a8212e1c8906f@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/68f65eb9.a70a0220.205af.0034.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20251020161114.1891141-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-21 17:41:09 -07:00
Alok Tiwari
0364ca3309 devlink: region: correct port region lookup to use port_ops
The function devlink_port_region_get_by_name() incorrectly uses
region->ops->name to compare the region name. as it is not any critical
impact as ops and port_ops define as union for devlink_region but as per
code logic it should refer port_ops here.

No functional impact as ops and port_ops are part of same union,
and name is the first member of both.

Update it to use region->port_ops->name to properly reference
the name of the devlink port region.

Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20251020170916.1741808-1-alok.a.tiwari@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-21 17:40:26 -07:00
Matthieu Baerts (NGI0)
e84cb860ac mptcp: pm: in-kernel: C-flag: handle late ADD_ADDR
The special C-flag case expects the ADD_ADDR to be received when
switching to 'fully-established'. But for various reasons, the ADD_ADDR
could be sent after the "4th ACK", and the special case doesn't work.

On NIPA, the new test validating this special case for the C-flag failed
a few times, e.g.

  102 default limits, server deny join id 0
        syn rx                 [FAIL] got 0 JOIN[s] syn rx expected 2

  Server ns stats
  (...)
  MPTcpExtAddAddrTx  1
  MPTcpExtEchoAdd    1

  Client ns stats
  (...)
  MPTcpExtAddAddr    1
  MPTcpExtEchoAddTx  1

        synack rx              [FAIL] got 0 JOIN[s] synack rx expected 2
        ack rx                 [FAIL] got 0 JOIN[s] ack rx expected 2
        join Rx                [FAIL] see above
        syn tx                 [FAIL] got 0 JOIN[s] syn tx expected 2
        join Tx                [FAIL] see above

I had a suspicion about what the issue could be: the ADD_ADDR might have
been received after the switch to the 'fully-established' state. The
issue was not easy to reproduce. The packet capture shown that the
ADD_ADDR can indeed be sent with a delay, and the client would not try
to establish subflows to it as expected.

A simple fix is not to mark the endpoints as 'used' in the C-flag case,
when looking at creating subflows to the remote initial IP address and
port. In this case, there is no need to try.

Note: newly added fullmesh endpoints will still continue to be used as
expected, thanks to the conditions behind mptcp_pm_add_addr_c_flag_case.

Fixes: 4b1ff850e0 ("mptcp: pm: in-kernel: usable client side with C-flag")
Cc: stable@vger.kernel.org
Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251020-net-mptcp-c-flag-late-add-addr-v1-1-8207030cb0e8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-21 17:36:45 -07:00
Eric Dumazet
3ff9bcecce net: avoid extra access to sk->sk_wmem_alloc in sock_wfree()
UDP TX packets destructor is sock_wfree().

It suffers from a cache line bouncing in sock_def_write_space_wfree().

Instead of reading sk->sk_wmem_alloc after we just did an atomic RMW
on it, use __refcount_sub_and_test() to get the old value for free,
and pass the new value to sock_def_write_space_wfree().

Add __sock_writeable() helper.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251017133712.2842665-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-10-21 15:56:21 +02:00
Aswin Karuvally
38516e3fa4 s390/iucv: Convert sprintf/snprintf to scnprintf
Convert sprintf/snprintf calls to scnprintf to better align with the
kernel development community practices [1].

Link: https://lwn.net/Articles/69419 [1]
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Aswin Karuvally <aswin@linux.ibm.com>
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251017094954.1402684-1-wintera@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-20 18:35:21 -07:00
Wang Liang
f584239a9e net/smc: fix general protection fault in __smc_diag_dump
The syzbot report a crash:

  Oops: general protection fault, probably for non-canonical address 0xfbd5a5d5a0000003: 0000 [#1] SMP KASAN NOPTI
  KASAN: maybe wild-memory-access in range [0xdead4ead00000018-0xdead4ead0000001f]
  CPU: 1 UID: 0 PID: 6949 Comm: syz.0.335 Not tainted syzkaller #0 PREEMPT(full)
  Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 08/18/2025
  RIP: 0010:smc_diag_msg_common_fill net/smc/smc_diag.c:44 [inline]
  RIP: 0010:__smc_diag_dump.constprop.0+0x3ca/0x2550 net/smc/smc_diag.c:89
  Call Trace:
   <TASK>
   smc_diag_dump_proto+0x26d/0x420 net/smc/smc_diag.c:217
   smc_diag_dump+0x27/0x90 net/smc/smc_diag.c:234
   netlink_dump+0x539/0xd30 net/netlink/af_netlink.c:2327
   __netlink_dump_start+0x6d6/0x990 net/netlink/af_netlink.c:2442
   netlink_dump_start include/linux/netlink.h:341 [inline]
   smc_diag_handler_dump+0x1f9/0x240 net/smc/smc_diag.c:251
   __sock_diag_cmd net/core/sock_diag.c:249 [inline]
   sock_diag_rcv_msg+0x438/0x790 net/core/sock_diag.c:285
   netlink_rcv_skb+0x158/0x420 net/netlink/af_netlink.c:2552
   netlink_unicast_kernel net/netlink/af_netlink.c:1320 [inline]
   netlink_unicast+0x5a7/0x870 net/netlink/af_netlink.c:1346
   netlink_sendmsg+0x8d1/0xdd0 net/netlink/af_netlink.c:1896
   sock_sendmsg_nosec net/socket.c:714 [inline]
   __sock_sendmsg net/socket.c:729 [inline]
   ____sys_sendmsg+0xa95/0xc70 net/socket.c:2614
   ___sys_sendmsg+0x134/0x1d0 net/socket.c:2668
   __sys_sendmsg+0x16d/0x220 net/socket.c:2700
   do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
   do_syscall_64+0xcd/0x4e0 arch/x86/entry/syscall_64.c:94
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
   </TASK>

The process like this:

               (CPU1)              |             (CPU2)
  ---------------------------------|-------------------------------
  inet_create()                    |
    // init clcsock to NULL        |
    sk = sk_alloc()                |
                                   |
    // unexpectedly change clcsock |
    inet_init_csk_locks()          |
                                   |
    // add sk to hash table        |
    smc_inet_init_sock()           |
      smc_sk_init()                |
        smc_hash_sk()              |
                                   | // traverse the hash table
                                   | smc_diag_dump_proto
                                   |   __smc_diag_dump()
                                   |     // visit wrong clcsock
                                   |     smc_diag_msg_common_fill()
    // alloc clcsock               |
    smc_create_clcsk               |
      sock_create_kern             |

With CONFIG_DEBUG_LOCK_ALLOC=y, the smc->clcsock is unexpectedly changed
in inet_init_csk_locks(). The INET_PROTOSW_ICSK flag is no need by smc,
just remove it.

After removing the INET_PROTOSW_ICSK flag, this patch alse revert
commit 6fd27ea183 ("net/smc: fix lacks of icsk_syn_mss with IPPROTO_SMC")
to avoid casting smc_sock to inet_connection_sock.

Reported-by: syzbot+f775be4458668f7d220e@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=f775be4458668f7d220e
Tested-by: syzbot+f775be4458668f7d220e@syzkaller.appspotmail.com
Fixes: d25a92ccae ("net/smc: Introduce IPPROTO_SMC")
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: D. Wythe <alibuda@linux.alibaba.com>
Link: https://patch.msgid.link/20251017024827.3137512-1-wangliang74@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-20 17:46:06 -07:00
Eric Dumazet
a5cd3a60aa net: shrink napi_skb_cache_{put,get}() and napi_skb_cache_get_bulk()
Following loop in napi_skb_cache_put() is unrolled by the compiler
even if CONFIG_KASAN is not enabled:

for (i = NAPI_SKB_CACHE_HALF; i < NAPI_SKB_CACHE_SIZE; i++)
	kasan_mempool_unpoison_object(nc->skb_cache[i],
				kmem_cache_size(net_hotdata.skbuff_cache));

We have 32 times this sequence, for a total of 384 bytes.

	48 8b 3d 00 00 00 00 	net_hotdata.skbuff_cache,%rdi
	e8 00 00 00 00       	call   kmem_cache_size

This is because kmem_cache_size() is not an inline and not const,
and kasan_unpoison_object_data() is an inline function.

Cache kmem_cache_size() result in a variable, so that
the compiler can remove dead code (and variable) when/if
CONFIG_KASAN is unset.

After this patch, napi_skb_cache_put() is inlined in its callers,
and we avoid one kmem_cache_size() call in napi_skb_cache_get()
and napi_skb_cache_get_bulk().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20251016182911.1132792-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-20 17:37:57 -07:00
Eric Dumazet
f8a55d5e71 net: add a fast path in __netif_schedule()
Cpus serving NIC interrupts and specifically TX completions are often
trapped in also restarting a busy qdisc (because qdisc was stopped by BQL
or the driver's own flow control).

When they call netdev_tx_completed_queue() or netif_tx_wake_queue(),
they call __netif_schedule() so that the queue can be run
later from net_tx_action() (involving NET_TX_SOFTIRQ)

Quite often, by the time the cpu reaches net_tx_action(), another cpu
grabbed the qdisc spinlock from __dev_xmit_skb(), and we spend too much
time spinning on this lock.

We can detect in __netif_schedule() if a cpu is already at a specific
point in __dev_xmit_skb() where we have the guarantee the queue will
be run.

This patch gives a 13 % increase of throughput on an IDPF NIC (200Gbit),
32 TX qeues, sending UDP packets of 120 bytes.

This also helps __qdisc_run() to not force a NET_TX_SOFTIRQ
if another thread is waiting in __dev_xmit_skb()

Before:

sar -n DEV 5 5|grep eth1|grep Average
Average:         eth1   1496.44 52191462.56    210.00 13369396.90      0.00      0.00      0.00     54.76

After:

sar -n DEV 5 5|grep eth1|grep Average
Average:         eth1   1457.88 59363099.96    205.08 15206384.35      0.00      0.00      0.00     62.29

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251017145334.3016097-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-20 16:45:25 -07:00
Emmanuel Grumbach
249e1443e3 wifi: nl80211: call kfree without a NULL check
Coverity is unhappy because we may leak old_radio_rts_threshold. Since
this pointer is only valid in the context of the function and kfree is
NULL pointer safe, don't check and just call kfree.
Note that somehow, we were checking old_rts_threshold to free
old_radio_rts_threshold which is a bit odd.

Fixes: 264637941c ("wifi: cfg80211: Add Support to Set RTS Threshold for each Radio")
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Link: https://patch.msgid.link/20251020075745.44168-1-emmanuel.grumbach@intel.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-10-20 13:57:26 +02:00
Johannes Berg
ed6a47346e wifi: mac80211: fix key tailroom accounting leak
For keys added by ieee80211_gtk_rekey_add(), we assume that
they're already present in the hardware and set the flag
KEY_FLAG_UPLOADED_TO_HARDWARE. However, setting this flag
needs to be paired with decrementing the tailroom needed,
which was missed.

Fixes: f52a0b408e ("wifi: mac80211: mark keys as uploaded when added by the driver")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20251019115358.c88eafb4083e.I69e9d4d78a756a133668c55b5570cf15a4b0e6a4@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-10-20 13:57:10 +02:00
Aloka Dixit
6078447614 wifi: mac80211: reset FILS discovery and unsol probe resp intervals
When ieee80211_stop_ap() deletes the FILS discovery and unsolicited
broadcast probe response templates, the associated interval values
are not reset. This can lead to drivers subsequently operating with
the non-zero values, leading to unexpected behavior.

Trigger repeated retrieval attempts of the FILS discovery template in
ath12k, resulting in excessive log messages such as:

mac vdev 0 failed to retrieve FILS discovery template
mac vdev 4 failed to retrieve FILS discovery template

Fix this by resetting the intervals in ieee80211_stop_ap() to ensure
proper cleanup of FILS discovery and unsolicited broadcast probe
response templates.

Fixes: 295b02c4be ("mac80211: Add FILS discovery support")
Fixes: 632189a018 ("mac80211: Unsolicited broadcast probe response support")
Signed-off-by: Aloka Dixit <aloka.dixit@oss.qualcomm.com>
Signed-off-by: Aaradhana Sahu <aaradhana.sahu@oss.qualcomm.com>
Link: https://patch.msgid.link/20250924130014.2575533-1-aaradhana.sahu@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-10-20 13:53:06 +02:00
Linus Torvalds
d303caf5ca Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:

 - Replace bpf_map_kmalloc_node() with kmalloc_nolock() to fix kmemleak
   imbalance in tracking of bpf_async_cb structures (Alexei Starovoitov)

 - Make selftests/bpf arg_parsing.c more robust to errors (Andrii
   Nakryiko)

 - Fix redefinition of 'off' as different kind of symbol when I40E
   driver is builtin (Brahmajit Das)

 - Do not disable preemption in bpf_test_run (Sahil Chandna)

 - Fix memory leak in __lookup_instance error path (Shardul Bankar)

 - Ensure test data is flushed to disk before reading it (Xing Guo)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Fix redefinition of 'off' as different kind of symbol
  bpf: Do not disable preemption in bpf_test_run().
  bpf: Fix memory leak in __lookup_instance error path
  selftests: arg_parsing: Ensure data is flushed to disk before reading.
  bpf: Replace bpf_map_kmalloc_node() with kmalloc_nolock() to allocate bpf_async_cb structures.
  selftests/bpf: make arg_parsing.c more robust to crashes
  bpf: test_run: Fix ctx leak in bpf_prog_test_run_xdp error path
2025-10-18 08:00:43 -10:00
Jakub Kicinski
e90576829c Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Martin KaFai Lau says:

====================
pull-request: bpf-next 2025-10-16

We've added 6 non-merge commits during the last 1 day(s) which contain
a total of 18 files changed, 577 insertions(+), 38 deletions(-).

The main changes are:

1) Bypass the global per-protocol memory accounting either by setting
   a netns sysctl or using bpf_setsockopt in a bpf program,
   from Kuniyuki Iwashima.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  selftests/bpf: Add test for sk->sk_bypass_prot_mem.
  bpf: Introduce SK_BPF_BYPASS_PROT_MEM.
  bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_CREATE.
  net: Introduce net.core.bypass_prot_mem sysctl.
  net: Allow opt-out from global protocol memory accounting.
  tcp: Save lock_sock() for memcg in inet_csk_accept().
====================

Link: https://patch.msgid.link/20251016204539.773707-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-17 17:20:42 -07:00
Eric Biggers
37a183d3b7 tcp: Convert tcp-md5 to use MD5 library instead of crypto_ahash
Make tcp-md5 use the MD5 library API (added in 6.18) instead of the
crypto_ahash API.  This is much simpler and also more efficient:

- The library API just operates on struct md5_ctx.  Just allocate this
  struct on the stack instead of using a pool of pre-allocated
  crypto_ahash and ahash_request objects.

- The library API accepts standard pointers and doesn't require
  scatterlists.  So, for hashing the headers just use an on-stack buffer
  instead of a pool of pre-allocated kmalloc'ed scratch buffers.

- The library API never fails.  Therefore, checking for MD5 hashing
  errors is no longer necessary.  Update tcp_v4_md5_hash_skb(),
  tcp_v6_md5_hash_skb(), tcp_v4_md5_hash_hdr(), tcp_v6_md5_hash_hdr(),
  tcp_md5_hash_key(), tcp_sock_af_ops::calc_md5_hash, and
  tcp_request_sock_ops::calc_md5_hash to return void instead of int.

- The library API provides direct access to the MD5 code, eliminating
  unnecessary overhead such as indirect function calls and scatterlist
  management.  Microbenchmarks of tcp_v4_md5_hash_skb() on x86_64 show a
  speedup from 7518 to 7041 cycles (6% fewer) with skb->len == 1440, or
  from 1020 to 678 cycles (33% fewer) with skb->len == 140.

Since tcp_sigpool_hash_skb_data() can no longer be used, add a function
tcp_md5_hash_skb_data() which is specialized to MD5.  Of course, to the
extent that this duplicates any code, it's well worth it.

To preserve the existing behavior of TCP-MD5 support being disabled when
the kernel is booted with "fips=1", make tcp_md5_do_add() check
fips_enabled itself.  Previously it relied on the error from
crypto_alloc_ahash("md5") being bubbled up.  I don't know for sure that
this is actually needed, but this preserves the existing behavior.

Tested with bidirectional TCP-MD5, both IPv4 and IPv6, between a kernel
that includes this commit and a kernel that doesn't include this commit.

(Side note: please don't use TCP-MD5!  It's cryptographically weak.  But
as long as Linux supports it, it might as well be implemented properly.)

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20251014215836.115616-1-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-17 17:14:54 -07:00
Florian Westphal
2af8ff1e47 net: Kconfig: discourage drop_monitor enablement
Quoting Eric Dumazet:
"I do not understand the fascination with net/core/drop_monitor.c [..]
misses all the features, flexibility, scalability  that 'perf',
eBPF tracing, bpftrace, .... have today."

Reword DROP_MONITOR kconfig help text to clearly state that its not
related to perf-based drop monitoring and that its safe to disable
this unless support for the older netlink-based tools is needed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251016115147.18503-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-17 16:29:26 -07:00
Xuanqiang Luo
b8ec80b130 inet: Avoid ehash lookup race in inet_twsk_hashdance_schedule()
Since ehash lookups are lockless, if another CPU is converting sk to tw
concurrently, fetching the newly inserted tw with tw->tw_refcnt == 0 cause
lookup failure.

The call trace map is drawn as follows:
   CPU 0                                CPU 1
   -----                                -----
				     inet_twsk_hashdance_schedule()
				     spin_lock()
				     inet_twsk_add_node_rcu(tw, ...)
__inet_lookup_established()
(find tw, failure due to tw_refcnt = 0)
				     __sk_nulls_del_node_init_rcu(sk)
				     refcount_set(&tw->tw_refcnt, 3)
				     spin_unlock()

By replacing sk with tw atomically via hlist_nulls_replace_init_rcu() after
setting tw_refcnt, we ensure that tw is either fully initialized or not
visible to other CPUs, eliminating the race.

It's worth noting that we held lock_sock() before the replacement, so
there's no need to check if sk is hashed. Thanks to Kuniyuki Iwashima!

Fixes: 3ab5aee7fe ("net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls")
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251015020236.431822-4-xuanqiang.luo@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-17 16:08:43 -07:00
Xuanqiang Luo
1532ed0d07 inet: Avoid ehash lookup race in inet_ehash_insert()
Since ehash lookups are lockless, if one CPU performs a lookup while
another concurrently deletes and inserts (removing reqsk and inserting sk),
the lookup may fail to find the socket, an RST may be sent.

The call trace map is drawn as follows:
   CPU 0                           CPU 1
   -----                           -----
				inet_ehash_insert()
                                spin_lock()
                                sk_nulls_del_node_init_rcu(osk)
__inet_lookup_established()
	(lookup failed)
                                __sk_nulls_add_node_rcu(sk, list)
                                spin_unlock()

As both deletion and insertion operate on the same ehash chain, this patch
introduces a new sk_nulls_replace_node_init_rcu() helper functions to
implement atomic replacement.

Fixes: 5e0724d027 ("tcp/dccp: fix hashdance race for passive sessions")
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251015020236.431822-3-xuanqiang.luo@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-17 16:08:43 -07:00
Kuniyuki Iwashima
1c17f4373d ipv6: Move ipv6_fl_list from ipv6_pinfo to inet_sock.
In {tcp6,udp6,raw6}_sock, struct ipv6_pinfo is always placed at
the beginning of a new cache line because

  1. __alignof__(struct tcp_sock) is 64 due to ____cacheline_aligned
     of __cacheline_group_begin(tcp_sock_write_tx)

  2. __alignof__(struct udp_sock) is 64 due to ____cacheline_aligned
     of struct numa_drop_counters

  3. in raw6_sock, struct numa_drop_counters is placed before
     struct ipv6_pinfo

.  struct ipv6_pinfo is 136 bytes, but the last cache line is
only used by ipv6_fl_list:

  $ pahole -C ipv6_pinfo vmlinux
  struct ipv6_pinfo {
  ...
  	/* --- cacheline 2 boundary (128 bytes) --- */
  	struct ipv6_fl_socklist *  ipv6_fl_list;         /*   128     8 */

  	/* size: 136, cachelines: 3, members: 23 */

Let's move ipv6_fl_list from struct ipv6_pinfo to struct inet_sock
to save a full cache line for {tcp6,udp6,raw6}_sock.

Now, struct ipv6_pinfo is 128 bytes, and {tcp6,udp6,raw6}_sock have
64 bytes less, while {tcp,udp,raw}_sock retain the same size.

Before:

  # grep -E "^(RAW|UDP[^L\-]|TCP)" /proc/slabinfo | awk '{print $1, "\t", $4}'
  RAWv6 	 1408
  UDPv6 	 1472
  TCPv6 	 2560
  RAW 		 1152
  UDP	 	 1280
  TCP 		 2368

After:

  # grep -E "^(RAW|UDP[^L\-]|TCP)" /proc/slabinfo | awk '{print $1, "\t", $4}'
  RAWv6 	 1344
  UDPv6 	 1408
  TCPv6 	 2496
  RAW 		 1152
  UDP	 	 1280
  TCP 		 2368

Also, ipv6_fl_list and inet_flags (SNDFLOW bit) are placed in the
same cache line.

  $ pahole -C inet_sock vmlinux
  ...
  	/* --- cacheline 11 boundary (704 bytes) was 56 bytes ago --- */
  	struct ipv6_pinfo *        pinet6;               /*   760     8 */
  	/* --- cacheline 12 boundary (768 bytes) --- */
  	struct ipv6_fl_socklist *  ipv6_fl_list;         /*   768     8 */
  	unsigned long              inet_flags;           /*   776     8 */

Doc churn is due to the insufficient Type column (only 1 space short).

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251014224210.2964778-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-17 16:06:52 -07:00
Sahil Chandna
7c33e97a6e bpf: Do not disable preemption in bpf_test_run().
The timer mode is initialized to NO_PREEMPT mode by default,
this disables preemption and force execution in atomic context
causing issue on PREEMPT_RT configurations when invoking
spin_lock_bh(), leading to the following warning:

BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 6107, name: syz.0.17
preempt_count: 1, expected: 0
RCU nest depth: 1, expected: 1
Preemption disabled at:
[<ffffffff891fce58>] bpf_test_timer_enter+0xf8/0x140 net/bpf/test_run.c:42

Fix this, by removing NO_PREEMPT/NO_MIGRATE mode check.
Also, the test timer context no longer needs explicit calls to
migrate_disable()/migrate_enable() with rcu_read_lock()/rcu_read_unlock().
Use helpers rcu_read_lock_dont_migrate() and rcu_read_unlock_migrate()
instead.

Reported-by: syzbot+1f1fbecb9413cdbfbef8@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=1f1fbecb9413cdbfbef8
Suggested-by: Yonghong Song <yonghong.song@linux.dev>
Suggested-by: Menglong Dong <menglong.dong@linux.dev>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Tested-by: syzbot+1f1fbecb9413cdbfbef8@syzkaller.appspotmail.com
Co-developed-by: Brahmajit Das <listout@listout.xyz>
Signed-off-by: Brahmajit Das <listout@listout.xyz>
Signed-off-by: Sahil Chandna <chandna.sahil@gmail.com>
Link: https://lore.kernel.org/r/20251014185635.10300-1-chandna.sahil@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-17 11:29:35 -07:00
Sven Eckelmann
ed5730f3f7 batman-adv: use skb_crc32c() instead of skb_seq_read()
Make batadv_bla_check_duplist() just use the new function skb_crc32c(),
instead of calling skb_seq_read() with crc32c(). This is faster and
simpler.

Suggested-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2025-10-17 16:30:43 +02:00
Simon Wunderlich
e5ae07b2ef batman-adv: Start new development cycle
This version will contain all the (major or even only minor) changes for
Linux 6.19.

The version number isn't a semantic version number with major and minor
information. It is just encoding the year of the expected publishing as
Linux -rc1 and the number of published versions this year (starting at 0).

Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2025-10-17 16:30:43 +02:00
Eric Dumazet
100dfa74ca net: dev_queue_xmit() llist adoption
Remove busylock spinlock and use a lockless list (llist)
to reduce spinlock contention to the minimum.

Idea is that only one cpu might spin on the qdisc spinlock,
while others simply add their skb in the llist.

After this patch, we get a 300 % improvement on heavy TX workloads.
- Sending twice the number of packets per second.
- While consuming 50 % less cycles.

Note that this also allows in the future to submit batches
to various qdisc->enqueue() methods.

Tested:

- Dual Intel(R) Xeon(R) 6985P-C  (480 hyper threads).
- 100Gbit NIC, 30 TX queues with FQ packet scheduler.
- echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
- 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"

Before:

16 Mpps (41 Mpps if each thread is pinned to a different cpu)

vmstat 2 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
243  0      0 2368988672  51036 1100852    0    0   146     1  242   60  0  9 91  0  0
244  0      0 2368988672  51036 1100852    0    0   536    10 487745 14718  0 52 48  0  0
244  0      0 2368988672  51036 1100852    0    0   512     0 503067 46033  0 52 48  0  0
244  0      0 2368988672  51036 1100852    0    0   512     0 494807 12107  0 52 48  0  0
244  0      0 2368988672  51036 1100852    0    0   702    26 492845 10110  0 52 48  0  0

Lock contention (1 second sample taken on 8 cores)
perf lock record -C0-7 sleep 1; perf lock contention
 contended   total wait     max wait     avg wait         type   caller

    442111      6.79 s     162.47 ms     15.35 us     spinlock   dev_hard_start_xmit+0xcd
      5961      9.57 ms      8.12 us      1.60 us     spinlock   __dev_queue_xmit+0x3a0
       244    560.63 us      7.63 us      2.30 us     spinlock   do_softirq+0x5b
        13     25.09 us      3.21 us      1.93 us     spinlock   net_tx_action+0xf8

If netperf threads are pinned, spinlock stress is very high.
perf lock record -C0-7 sleep 1; perf lock contention
 contended   total wait     max wait     avg wait         type   caller

    964508      7.10 s     147.25 ms      7.36 us     spinlock   dev_hard_start_xmit+0xcd
       201    268.05 us      4.65 us      1.33 us     spinlock   __dev_queue_xmit+0x3a0
        12     26.05 us      3.84 us      2.17 us     spinlock   do_softirq+0x5b

@__dev_queue_xmit_ns:
[256, 512)            21 |                                                    |
[512, 1K)            631 |                                                    |
[1K, 2K)           27328 |@                                                   |
[2K, 4K)          265392 |@@@@@@@@@@@@@@@@                                    |
[4K, 8K)          417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
[8K, 16K)         826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16K, 32K)        733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
[32K, 64K)         19055 |@                                                   |
[64K, 128K)        17240 |@                                                   |
[128K, 256K)       25633 |@                                                   |
[256K, 512K)           4 |                                                    |

After:

29 Mpps (57 Mpps if each thread is pinned to a different cpu)

vmstat 2 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
78  0      0 2369573632  32896 1350988    0    0    22     0  331  254  0  8 92  0  0
75  0      0 2369573632  32896 1350988    0    0    22    50 425713 280199  0 23 76  0  0
104  0      0 2369573632  32896 1350988    0    0   290     0 430238 298247  0 23 76  0  0
86  0      0 2369573632  32896 1350988    0    0   132     0 428019 291865  0 24 76  0  0
90  0      0 2369573632  32896 1350988    0    0   502     0 422498 278672  0 23 76  0  0

perf lock record -C0-7 sleep 1; perf lock contention
 contended   total wait     max wait     avg wait         type   caller

      2524    116.15 ms    486.61 us     46.02 us     spinlock   __dev_queue_xmit+0x55b
      5821    107.18 ms    371.67 us     18.41 us     spinlock   dev_hard_start_xmit+0xcd
      2377      9.73 ms     35.86 us      4.09 us     spinlock   ___slab_alloc+0x4e0
       923      5.74 ms     20.91 us      6.22 us     spinlock   ___slab_alloc+0x5c9
       121      3.42 ms    193.05 us     28.24 us     spinlock   net_tx_action+0xf8
         6    564.33 us    167.60 us     94.05 us     spinlock   do_softirq+0x5b

If netperf threads are pinned (~54 Mpps)
perf lock record -C0-7 sleep 1; perf lock contention
     32907    316.98 ms    195.98 us      9.63 us     spinlock   dev_hard_start_xmit+0xcd
      4507     61.83 ms    212.73 us     13.72 us     spinlock   __dev_queue_xmit+0x554
      2781     23.53 ms     40.03 us      8.46 us     spinlock   ___slab_alloc+0x5c9
      3554     18.94 ms     34.69 us      5.33 us     spinlock   ___slab_alloc+0x4e0
       233      9.09 ms    215.70 us     38.99 us     spinlock   do_softirq+0x5b
       153    930.66 us     48.67 us      6.08 us     spinlock   net_tx_action+0xfd
        84    331.10 us     14.22 us      3.94 us     spinlock   ___slab_alloc+0x5c9
       140    323.71 us      9.94 us      2.31 us     spinlock   ___slab_alloc+0x4e0

@__dev_queue_xmit_ns:
[128, 256)       1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
[256, 512)       2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K)         483936 |@@@@@@@@@@                                          |
[1K, 2K)          265345 |@@@@@@                                              |
[2K, 4K)          145463 |@@@                                                 |
[4K, 8K)           54571 |@                                                   |
[8K, 16K)          10270 |                                                    |
[16K, 32K)          9385 |                                                    |
[32K, 64K)          7749 |                                                    |
[64K, 128K)        26799 |                                                    |
[128K, 256K)        2665 |                                                    |
[256K, 512K)         665 |                                                    |

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20251014171907.3554413-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-16 16:25:10 -07:00
Eric Dumazet
178ca30889 Revert "net/sched: Fix mirred deadlock on device recursion"
This reverts commits 0f022d32c3
and 44180feacc.

Prior patch in this series implemented loop detection
in act_mirred, we can remove q->owner to save some cycles
in the fast path.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20251014171907.3554413-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-16 16:25:10 -07:00
Eric Dumazet
fe946a751d net/sched: act_mirred: add loop detection
Commit 0f022d32c3 ("net/sched: Fix mirred deadlock on device recursion")
added code in the fast path, even when act_mirred is not used.

Prepare its revert by implementing loop detection in act_mirred.

Adds an array of device pointers in struct netdev_xmit.

tcf_mirred_is_act_redirect() can detect if the array
already contains the target device.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20251014171907.3554413-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-16 16:25:10 -07:00
Eric Dumazet
5b2b7dec05 net: add add indirect call wrapper in skb_release_head_state()
While stress testing UDP senders on a host with expensive indirect
calls, I found cpus processing TX completions where showing
a very high cost (20%) in sock_wfree() due to
CONFIG_MITIGATION_RETPOLINE=y.

Take care of TCP and UDP TX destructors and use INDIRECT_CALL_3() macro.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20251014171907.3554413-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-16 16:25:09 -07:00
Johannes Wiesböck
bf29555f5b rtnetlink: Allow deleting FDB entries in user namespace
Creating FDB entries is possible from a non-initial user namespace when
having CAP_NET_ADMIN, yet, when deleting FDB entries, processes receive
an EPERM because the capability is always checked against the initial
user namespace. This restricts the FDB management from unprivileged
containers.

Drop the netlink_capable check in rtnl_fdb_del as it was originally
dropped in c5c351088a and reintroduced in 1690be63a2 without
intention.

This patch was tested using a container on GyroidOS, where it was
possible to delete FDB entries from an unprivileged user namespace and
private network namespace.

Fixes: 1690be63a2 ("bridge: Add vlan support to static neighbors")
Reviewed-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de>
Tested-by: Harshal Gohel <hg@simonwunderlich.de>
Signed-off-by: Johannes Wiesböck <johannes.wiesboeck@aisec.fraunhofer.de>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20251015201548.319871-1-johannes.wiesboeck@aisec.fraunhofer.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-16 16:09:56 -07:00
Jan Vaclav
f18c231fb1 net/hsr: add interlink to fill_info output
Currently, it is possible to configure the interlink
port, but no way to read it back from userspace.

Add it to the output of hsr_fill_info(), so it can be
read from userspace, for example:

$ ip -d link show hsr0
12: hsr0: <BROADCAST,MULTICAST> mtu ...
...
hsr slave1 veth0 slave2 veth1 interlink veth2 ...

Signed-off-by: Jan Vaclav <jvaclav@redhat.com>
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Link: https://patch.msgid.link/20251015101001.25670-2-jvaclav@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-16 15:52:10 -07:00