Commit Graph

18795 Commits

Author SHA1 Message Date
Roopni Devanathan
7cc986c04a wifi: cfg80211: Add debugfs support for multi-radio wiphy
In multi-radio wiphy architecture, where a single wiphy can have
multiple radios tied to it, radio specific configuration parameters
and global wiphy parameters are maintained for the entire physical
device and common to all radios. But, each radio in a wiphy can have
different values for each radio configuration parameter, like RTS
threshold. With the current debugfs directory structure, the values
of global wiphy configuration parameters can be viewed, but, values
of individual radio configuration parameters cannot be viewed, as
radio specific configuration parameters are not maintained, separately.

To address this, in addition to maintaining global wiphy configuration
parameters common to all radios, create separate debugfs directories
for each radio in a wiphy to maintain parameters corresponding to that
radio in this directory.

In implementation, maintain a dentry structure in wiphy_radio_cfg, a
structure  containing radio configurations of a wiphy. This struct is
maintained to denote per-radio configurations of a wiphy. Create
separate directories representing each radio within phy#X directory in
debugfs during wiphy registration.

Sample directory structure with this change:
ls /sys/kernel/debug/ieee80211/phy0/radio
radio0/ radio1/ radio2/

Signed-off-by: Roopni Devanathan <quic_rdevanat@quicinc.com>
Link: https://patch.msgid.link/20251024044649.483557-2-quic_rdevanat@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-10-27 09:18:41 +01:00
Ryder Lee
a392cde88d wifi: cfg80211/mac80211: validate radio frequency range for monitor mode
In multi-radio devices, it is possible to have an MLD AP and a monitor
interface active at the same time. In such cases, monitor mode may not
be able to specify a fixed channel and could end up capturing frames
from all radios, including those outside the intended frequency bands.

This patch adds frequency validation for monitor mode. Received frames
are now only processed if their frequency fall within the allowed ranges
of the radios specified by the interface's radio_mask.

This prevents monitor mode from capturing frames outside the supported radio.

Signed-off-by: Ryder Lee <ryder.lee@mediatek.com>
Link: https://patch.msgid.link/700b8284e845d96654eb98431f8eeb5a81503862.1758647858.git.ryder.lee@mediatek.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-10-27 09:16:23 +01:00
Jakub Kicinski
55db64ddd6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.18-rc2).

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-16 11:06:28 -07:00
Eric Dumazet
e5b670e543 net: remove obsolete WARN_ON(refcount_read(&sk->sk_refcnt) == 1)
sk->sk_refcnt has been converted to refcount_t in 2017.

__sock_put(sk) being refcount_dec(&sk->sk_refcnt), it will complain
loudly if the current refcnt is 1 (or less) in a non racy way.

We can remove four WARN_ON() in favor of the generic refcount_dec()
check.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Xuanqiang Luo<luoxuanqiang@kylinos.cn>
Link: https://patch.msgid.link/20251014140605.2982703-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-15 17:18:38 -07:00
Eric Dumazet
4a7708443d net: allow busy connected flows to switch tx queues
This is a followup of commit 726e9e8b94 ("tcp: refine
skb->ooo_okay setting") and of prior commit in this series
("net: control skb->ooo_okay from skb_set_owner_w()")

skb->ooo_okay might never be set for bulk flows that always
have at least one skb in a qdisc queue of NIC queue,
especially if TX completion is delayed because of a stressed cpu.

The so-called "strange attractors" has caused many performance
issues (see for instance 9b462d02d6 ("tcp: TCP Small Queues
and strange attractors")), we need to do better.

We have tried very hard to avoid reorders because TCP was
not dealing with them nicely a decade ago.

Use the new net.core.txq_reselection_ms sysctl to let
flows follow XPS and select a more efficient queue.

After this patch, we no longer have to make sure threads
are pinned to cpus, they now can be migrated without
adding too much spinlock/qdisc/TX completion pressure anymore.

TX completion part was problematic, because it added false sharing
on various socket fields, but also added false sharing and spinlock
contention in mm layers. Calling skb_orphan() from ndo_start_xmit()
is not an option unfortunately.

Note for later:

1) move sk->sk_tx_queue_mapping closer
to sk_tx_queue_mapping_jiffies for better cache locality.

2) Study if 9b462d02d6 ("tcp: TCP Small Queues
and strange attractors") could be revised.

Tested:

Used a host with 32 TX queues, shared by groups of 8 cores.
XPS setup :

echo ff >/sys/class/net/eth1/queue/tx-0/xps_cpus
echo ff00 >/sys/class/net/eth1/queue/tx-1/xps_cpus
echo ff0000 >/sys/class/net/eth1/queue/tx-2/xps_cpus
echo ff000000 >/sys/class/net/eth1/queue/tx-3/xps_cpus
echo ff,00000000 >/sys/class/net/eth1/queue/tx-4/xps_cpus
echo ff00,00000000 >/sys/class/net/eth1/queue/tx-5/xps_cpus
echo ff0000,00000000 >/sys/class/net/eth1/queue/tx-6/xps_cpus
echo ff000000,00000000 >/sys/class/net/eth1/queue/tx-7/xps_cpus
...

Launched a tcp_stream with 15 threads and 1000 flows, initially affined to core 0-15

taskset -c 0-15 tcp_stream -T15 -F1000 -l1000 -c -H target_host

Checked that only queues 0 and 1 are used as instructed by XPS :
tc -s qdisc show dev eth1|grep backlog|grep -v "backlog 0b 0p"
 backlog 123489410b 1890p
 backlog 69809026b 1064p
 backlog 52401054b 805p

Then force each thread to run on cpu 1,9,17,25,33,41,49,57,65,73,81,89,97,105,113,121

C=1;PID=`pidof tcp_stream`;for P in `ls /proc/$PID/task`; do taskset -pc $C $P; C=$(($C + 8));done

Set txq_reselection_ms to 1000
echo 1000 > /proc/sys/net/core/txq_reselection_ms

Check that the flows have migrated nicely:

tc -s qdisc show dev eth1|grep backlog|grep -v "backlog 0b 0p"
 backlog 130508314b 1916p
 backlog 8584380b 126p
 backlog 8584380b 126p
 backlog 8379990b 123p
 backlog 8584380b 126p
 backlog 8487484b 125p
 backlog 8584380b 126p
 backlog 8448120b 124p
 backlog 8584380b 126p
 backlog 8720640b 128p
 backlog 8856900b 130p
 backlog 8584380b 126p
 backlog 8652510b 127p
 backlog 8448120b 124p
 backlog 8516250b 125p
 backlog 7834950b 115p

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251013152234.842065-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-15 09:04:22 -07:00
Eric Dumazet
2ddef3462b net: add /proc/sys/net/core/txq_reselection_ms control
Add a new sysctl to control how often a queue reselection
can happen even if a flow has a persistent queue of skbs
in a Qdisc or NIC queue.

A value of zero means the feature is disabled.

Default is 1000 (1 second).

This sysctl is used in the following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251013152234.842065-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-15 09:04:21 -07:00
Eric Dumazet
6ddb811a57 net: add SK_WMEM_ALLOC_BIAS constant
sk->sk_wmem_alloc is initialized to 1, and sk_wmem_alloc_get()
takes care of this initial value.

Add SK_WMEM_ALLOC_BIAS define to not spread this magic value.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251013152234.842065-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-15 09:04:21 -07:00
Eric Dumazet
1c51450f1a tcp: better handle TCP_TX_DELAY on established flows
Some applications uses TCP_TX_DELAY socket option after TCP flow
is established.

Some metrics need to be updated, otherwise TCP might take time to
adapt to the new (emulated) RTT.

This patch adjusts tp->srtt_us, tp->rtt_min, icsk_rto
and sk->sk_pacing_rate.

This is best effort, and for instance icsk_rto is reset
without taking backoff into account.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251013145926.833198-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-15 08:56:30 -07:00
Byungchul Park
53615ad26e netmem: replace __netmem_clear_lsb() with netmem_to_nmdesc()
Now that we have struct netmem_desc, it'd better access the pp fields
via struct netmem_desc rather than struct net_iov.

Introduce netmem_to_nmdesc() for safely converting netmem_ref to
netmem_desc regardless of the type underneath e.i. netmem_desc, net_iov.

While at it, remove __netmem_clear_lsb() and make netmem_to_nmdesc()
used instead.

Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Byungchul Park <byungchul@sk.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20251013044133.69472-1-byungchul@sk.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-10-14 13:37:26 +02:00
Dmitry Safonov
21f4d45eba net/ip6_tunnel: Prevent perpetual tunnel growth
Similarly to ipv4 tunnel, ipv6 version updates dev->needed_headroom, too.
While ipv4 tunnel headroom adjustment growth was limited in
commit 5ae1e9922b ("net: ip_tunnel: prevent perpetual headroom growth"),
ipv6 tunnel yet increases the headroom without any ceiling.

Reflect ipv4 tunnel headroom adjustment limit on ipv6 version.

Credits to Francesco Ruggeri, who was originally debugging this issue
and wrote local Arista-specific patch and a reproducer.

Fixes: 8eb30be035 ("ipv6: Create ip6_tnl_xmit")
Cc: Florian Westphal <fw@strlen.de>
Cc: Francesco Ruggeri <fruggeri05@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Link: https://patch.msgid.link/20251009-ip6_tunnel-headroom-v2-1-8e4dbd8f7e35@arista.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-13 17:43:46 -07:00
Jakub Kicinski
7a0f94361f net: psp: don't assume reply skbs will have a socket
Rx path may be passing around unreferenced sockets, which means
that skb_set_owner_edemux() may not set skb->sk and PSP will crash:

  KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
  RIP: 0010:psp_reply_set_decrypted (./include/net/psp/functions.h:132 net/psp/psp_sock.c:287)
    tcp_v6_send_response.constprop.0 (net/ipv6/tcp_ipv6.c:979)
    tcp_v6_send_reset (net/ipv6/tcp_ipv6.c:1140 (discriminator 1))
    tcp_v6_do_rcv (net/ipv6/tcp_ipv6.c:1683)
    tcp_v6_rcv (net/ipv6/tcp_ipv6.c:1912)

Fixes: 659a2899a5 ("tcp: add datapath logic for PSP with inline key exchange")
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251001022426.2592750-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-03 10:23:50 -07:00
Linus Torvalds
07fdad3a93 Merge tag 'net-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
 "Core & protocols:

   - Improve drop account scalability on NUMA hosts for RAW and UDP
     sockets and the backlog, almost doubling the Pps capacity under DoS

   - Optimize the UDP RX performance under stress, reducing contention,
     revisiting the binary layout of the involved data structs and
     implementing NUMA-aware locking. This improves UDP RX performance
     by an additional 50%, even more under extreme conditions

   - Add support for PSP encryption of TCP connections; this mechanism
     has some similarities with IPsec and TLS, but offers superior HW
     offloads capabilities

   - Ongoing work to support Accurate ECN for TCP. AccECN allows more
     than one congestion notification signal per RTT and is a building
     block for Low Latency, Low Loss, and Scalable Throughput (L4S)

   - Reorganize the TCP socket binary layout for data locality, reducing
     the number of touched cachelines in the fastpath

   - Refactor skb deferral free to better scale on large multi-NUMA
     hosts, this improves TCP and UDP RX performances significantly on
     such HW

   - Increase the default socket memory buffer limits from 256K to 4M to
     better fit modern link speeds

   - Improve handling of setups with a large number of nexthop, making
     dump operating scaling linearly and avoiding unneeded
     synchronize_rcu() on delete

   - Improve bridge handling of VLAN FDB, storing a single entry per
     bridge instead of one entry per port; this makes the dump order of
     magnitude faster on large switches

   - Restore IP ID correctly for encapsulated packets at GSO
     segmentation time, allowing GRO to merge packets in more scenarios

   - Improve netfilter matching performance on large sets

   - Improve MPTCP receive path performance by leveraging recently
     introduced core infrastructure (skb deferral free) and adopting
     recent TCP autotuning changes

   - Allow bridges to redirect to a backup port when the bridge port is
     administratively down

   - Introduce MPTCP 'laminar' endpoint that con be used only once per
     connection and simplify common MPTCP setups

   - Add RCU safety to dst->dev, closing a lot of possible races

   - A significant crypto library API for SCTP, MPTCP and IPv6 SR,
     reducing code duplication

   - Supports pulling data from an skb frag into the linear area of an
     XDP buffer

  Things we sprinkled into general kernel code:

   - Generate netlink documentation from YAML using an integrated YAML
     parser

  Driver API:

   - Support using IPv6 Flow Label in Rx hash computation and RSS queue
     selection

   - Introduce API for fetching the DMA device for a given queue,
     allowing TCP zerocopy RX on more H/W setups

   - Make XDP helpers compatible with unreadable memory, allowing more
     easily building DevMem-enabled drivers with a unified XDP/skbs
     datapath

   - Add a new dedicated ethtool callback enabling drivers to provide
     the number of RX rings directly, improving efficiency and clarity
     in RX ring queries and RSS configuration

   - Introduce a burst period for the health reporter, allowing better
     handling of multiple errors due to the same root cause

   - Support for DPLL phase offset exponential moving average,
     controlling the average smoothing factor

  Device drivers:

   - Add a new Huawei driver for 3rd gen NIC (hinic3)

   - Add a new SpacemiT driver for K1 ethernet MAC

   - Add a generic abstraction for shared memory communication
     devices (dibps)

   - Ethernet high-speed NICs:
      - nVidia/Mellanox:
         - Use multiple per-queue doorbell, to avoid MMIO contention
           issues
         - support adjacent functions, allowing them to delegate their
           SR-IOV VFs to sibling PFs
         - support RSS for IPSec offload
         - support exposing raw cycle counters in PTP and mlx5
         - support for disabling host PFs.
      - Intel (100G, ice, idpf):
         - ice: support for SRIOV VFs over an Active-Active link
           aggregate
         - ice: support for firmware logging via debugfs
         - ice: support for Earliest TxTime First (ETF) hardware offload
         - idpf: support basic XDP functionalities and XSk
      - Broadcom (bnxt):
         - support Hyper-V VF ID
         - dynamic SRIOV resource allocations for RoCE
      - Meta (fbnic):
         - support queue API, zero-copy Rx and Tx
         - support basic XDP functionalities
         - devlink health support for FW crashes and OTP mem corruptions
         - expand hardware stats coverage to FEC, PHY, and Pause
      - Wangxun:
         - support ethtool coalesce options
         - support for multiple RSS contexts

   - Ethernet virtual:
      - Macsec:
         - replace custom netlink attribute checks with policy-level
           checks
      - Bonding:
         - support aggregator selection based on port priority
      - Microsoft vNIC:
         - use page pool fragments for RX buffers instead of full pages
           to improve memory efficiency

   - Ethernet NICs consumer, and embedded:
      - Qualcomm: support Ethernet function for IPQ9574 SoC
      - Airoha: implement wlan offloading via NPU
      - Freescale
         - enetc: add NETC timer PTP driver and add PTP support
         - fec: enable the Jumbo frame support for i.MX8QM
      - Renesas (R-Car S4):
         - support HW offloading for layer 2 switching
         - support for RZ/{T2H, N2H} SoCs
      - Cadence (macb): support TAPRIO traffic scheduling
      - TI:
         - support for Gigabit ICSS ethernet SoC (icssm-prueth)
      - Synopsys (stmmac): a lot of cleanups

   - Ethernet PHYs:
      - Support 10g-qxgmi phy-mode for AQR412C, Felix DSA and Lynx PCS
        driver
      - Support bcm63268 GPHY power control
      - Support for Micrel lan8842 PHY and PTP
      - Support for Aquantia AQR412 and AQR115

   - CAN:
      - a large CAN-XL preparation work
      - reorganize raw_sock and uniqframe struct to minimize memory
        usage
      - rcar_canfd: update the CAN-FD handling

   - WiFi:
      - extended Neighbor Awareness Networking (NAN) support
      - S1G channel representation cleanup
      - improve S1G support

   - WiFi drivers:
      - Intel (iwlwifi):
         - major refactor and cleanup
      - Broadcom (brcm80211):
         - support for AP isolation
      - RealTek (rtw88/89) rtw88/89:
         - preparation work for RTL8922DE support
      - MediaTek (mt76):
         - HW restart improvements
         - MLO support
      - Qualcomm/Atheros (ath10k):
         - GTK rekey fixes

   - Bluetooth drivers:
      - btusb: support for several new IDs for MT7925
      - btintel: support for BlazarIW core
      - btintel_pcie: support for _suspend() / _resume()
      - btintel_pcie: support for Scorpious, Panther Lake-H484 IDs"

* tag 'net-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1536 commits)
  net: stmmac: Add support for Allwinner A523 GMAC200
  dt-bindings: net: sun8i-emac: Add A523 GMAC200 compatible
  Revert "Documentation: net: add flow control guide and document ethtool API"
  octeontx2-pf: fix bitmap leak
  octeontx2-vf: fix bitmap leak
  net/mlx5e: Use extack in set rxfh callback
  net/mlx5e: Introduce mlx5e_rss_params for RSS configuration
  net/mlx5e: Introduce mlx5e_rss_init_params
  net/mlx5e: Remove unused mdev param from RSS indir init
  net/mlx5: Improve QoS error messages with actual depth values
  net/mlx5e: Prevent entering switchdev mode with inconsistent netns
  net/mlx5: HWS, Generalize complex matchers
  net/mlx5: Improve write-combining test reliability for ARM64 Grace CPUs
  selftests/net: add tcp_port_share to .gitignore
  Revert "net/mlx5e: Update and set Xon/Xoff upon MTU set"
  net: add NUMA awareness to skb_attempt_defer_free()
  net: use llist for sd->defer_list
  net: make softnet_data.defer_count an atomic
  selftests: drv-net: psp: add tests for destroying devices
  selftests: drv-net: psp: add test for auto-adjusting TCP MSS
  ...
2025-10-02 15:17:01 -07:00
Paolo Abeni
f1455695d2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.17-rc8).

Conflicts:

tools/testing/selftests/drivers/net/bonding/Makefile
  87951b5664 selftests: bonding: add test for passive LACP mode
  c2377f1763 selftests: bonding: add test for LACP actor port priority

Adjacent changes:

drivers/net/ethernet/cadence/macb.h
  fca3dc859b net: macb: remove illusion about TBQPH/RBQPH being per-queue
  89934dbf16 net: macb: Add TAPRIO traffic scheduling support

drivers/net/ethernet/cadence/macb_main.c
  fca3dc859b net: macb: remove illusion about TBQPH/RBQPH being per-queue
  89934dbf16 net: macb: Add TAPRIO traffic scheduling support

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-10-01 10:14:49 +02:00
Linus Torvalds
ae28ed4578 Merge tag 'bpf-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf updates from Alexei Starovoitov:

 - Support pulling non-linear xdp data with bpf_xdp_pull_data() kfunc
   (Amery Hung)

   Applied as a stable branch in bpf-next and net-next trees.

 - Support reading skb metadata via bpf_dynptr (Jakub Sitnicki)

   Also a stable branch in bpf-next and net-next trees.

 - Enforce expected_attach_type for tailcall compatibility (Daniel
   Borkmann)

 - Replace path-sensitive with path-insensitive live stack analysis in
   the verifier (Eduard Zingerman)

   This is a significant change in the verification logic. More details,
   motivation, long term plans are in the cover letter/merge commit.

 - Support signed BPF programs (KP Singh)

   This is another major feature that took years to materialize.

   Algorithm details are in the cover letter/marge commit

 - Add support for may_goto instruction to s390 JIT (Ilya Leoshkevich)

 - Add support for may_goto instruction to arm64 JIT (Puranjay Mohan)

 - Fix USDT SIB argument handling in libbpf (Jiawei Zhao)

 - Allow uprobe-bpf program to change context registers (Jiri Olsa)

 - Support signed loads from BPF arena (Kumar Kartikeya Dwivedi and
   Puranjay Mohan)

 - Allow access to union arguments in tracing programs (Leon Hwang)

 - Optimize rcu_read_lock() + migrate_disable() combination where it's
   used in BPF subsystem (Menglong Dong)

 - Introduce bpf_task_work_schedule*() kfuncs to schedule deferred
   execution of BPF callback in the context of a specific task using the
   kernel’s task_work infrastructure (Mykyta Yatsenko)

 - Enforce RCU protection for KF_RCU_PROTECTED kfuncs (Kumar Kartikeya
   Dwivedi)

 - Add stress test for rqspinlock in NMI (Kumar Kartikeya Dwivedi)

 - Improve the precision of tnum multiplier verifier operation
   (Nandakumar Edamana)

 - Use tnums to improve is_branch_taken() logic (Paul Chaignon)

 - Add support for atomic operations in arena in riscv JIT (Pu Lehui)

 - Report arena faults to BPF error stream (Puranjay Mohan)

 - Search for tracefs at /sys/kernel/tracing first in bpftool (Quentin
   Monnet)

 - Add bpf_strcasecmp() kfunc (Rong Tao)

 - Support lookup_and_delete_elem command in BPF_MAP_STACK_TRACE (Tao
   Chen)

* tag 'bpf-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (197 commits)
  libbpf: Replace AF_ALG with open coded SHA-256
  selftests/bpf: Add stress test for rqspinlock in NMI
  selftests/bpf: Add test case for different expected_attach_type
  bpf: Enforce expected_attach_type for tailcall compatibility
  bpftool: Remove duplicate string.h header
  bpf: Remove duplicate crypto/sha2.h header
  libbpf: Fix error when st-prefix_ops and ops from differ btf
  selftests/bpf: Test changing packet data from kfunc
  selftests/bpf: Add stacktrace map lookup_and_delete_elem test case
  selftests/bpf: Refactor stacktrace_map case with skeleton
  bpf: Add lookup_and_delete_elem for BPF_MAP_STACK_TRACE
  selftests/bpf: Fix flaky bpf_cookie selftest
  selftests/bpf: Test changing packet data from global functions with a kfunc
  bpf: Emit struct bpf_xdp_sock type in vmlinux BTF
  selftests/bpf: Task_work selftest cleanup fixes
  MAINTAINERS: Delete inactive maintainers from AF_XDP
  bpf: Mark kfuncs as __noclone
  selftests/bpf: Add kprobe multi write ctx attach test
  selftests/bpf: Add kprobe write ctx attach test
  selftests/bpf: Add uprobe context ip register change test
  ...
2025-09-30 17:58:11 -07:00
Eric Dumazet
5628f3fe3b net: add NUMA awareness to skb_attempt_defer_free()
Instead of sharing sd->defer_list & sd->defer_count with
many cpus, add one pair for each NUMA node.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250928084934.3266948-4-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-30 15:45:53 +02:00
Hangbin Liu
5b66169f6b bonding: fix xfrm offload feature setup on active-backup mode
The active-backup bonding mode supports XFRM ESP offload. However, when
a bond is added using command like `ip link add bond0 type bond mode 1
miimon 100`, the `ethtool -k` command shows that the XFRM ESP offload is
disabled. This occurs because, in bond_newlink(), we change bond link
first and register bond device later. So the XFRM feature update in
bond_option_mode_set() is not called as the bond device is not yet
registered, leading to the offload feature not being set successfully.

To resolve this issue, we can modify the code order in bond_newlink() to
ensure that the bond device is registered first before changing the bond
link parameters. This change will allow the XFRM ESP offload feature to be
correctly enabled.

Fixes: 007ab53455 ("bonding: fix feature flag setting at init time")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20250925023304.472186-1-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-30 09:55:11 +02:00
Eric Dumazet
7d452516b6 Revert "net: group sk_backlog and sk_receive_queue"
This reverts commit 4effb335b5.

This was a benefit for UDP flood case, which was later greatly improved
with commits 6471658dc6 ("udp: use skb_attempt_defer_free()")
and b650bf0977 ("udp: remove busylock and add per NUMA queues").

Apparently blamed commit added a regression for RAW sockets, possibly
because they do not use the dual RX queue strategy that UDP has.

sock_queue_rcv_skb_reason() and RAW recvmsg() compete for sk_receive_buf
and sk_rmem_alloc changes, and them being in the same
cache line reduce performance.

Fixes: 4effb335b5 ("net: group sk_backlog and sk_receive_queue")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202509281326.f605b4eb-lkp@intel.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: David Ahern <dsahern@kernel.org>
Cc: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250929182112.824154-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:30:32 -07:00
Paolo Abeni
a755677974 tcp: make tcp_rcvbuf_grow() accessible to mptcp code
To leverage the auto-tuning improvements brought by commit 2da35e4b4d
("Merge branch 'tcp-receive-side-improvements'"), the MPTCP stack need
to access the mentioned helper.

Acked-by: Geliang Tang <geliang@kernel.org>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250927-net-next-mptcp-rcv-path-imp-v1-2-5da266aa9c1a@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:23:35 -07:00
Linus Torvalds
18b19abc37 Merge tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull namespace updates from Christian Brauner:
 "This contains a larger set of changes around the generic namespace
  infrastructure of the kernel.

  Each specific namespace type (net, cgroup, mnt, ...) embedds a struct
  ns_common which carries the reference count of the namespace and so
  on.

  We open-coded and cargo-culted so many quirks for each namespace type
  that it just wasn't scalable anymore. So given there's a bunch of new
  changes coming in that area I've started cleaning all of this up.

  The core change is to make it possible to correctly initialize every
  namespace uniformly and derive the correct initialization settings
  from the type of the namespace such as namespace operations, namespace
  type and so on. This leaves the new ns_common_init() function with a
  single parameter which is the specific namespace type which derives
  the correct parameters statically. This also means the compiler will
  yell as soon as someone does something remotely fishy.

  The ns_common_init() addition also allows us to remove ns_alloc_inum()
  and drops any special-casing of the initial network namespace in the
  network namespace initialization code that Linus complained about.

  Another part is reworking the reference counting. The reference
  counting was open-coded and copy-pasted for each namespace type even
  though they all followed the same rules. This also removes all open
  accesses to the reference count and makes it private and only uses a
  very small set of dedicated helpers to manipulate them just like we do
  for e.g., files.

  In addition this generalizes the mount namespace iteration
  infrastructure introduced a few cycles ago. As reminder, the vfs makes
  it possible to iterate sequentially and bidirectionally through all
  mount namespaces on the system or all mount namespaces that the caller
  holds privilege over. This allow userspace to iterate over all mounts
  in all mount namespaces using the listmount() and statmount() system
  call.

  Each mount namespace has a unique identifier for the lifetime of the
  systems that is exposed to userspace. The network namespace also has a
  unique identifier working exactly the same way. This extends the
  concept to all other namespace types.

  The new nstree type makes it possible to lookup namespaces purely by
  their identifier and to walk the namespace list sequentially and
  bidirectionally for all namespace types, allowing userspace to iterate
  through all namespaces. Looking up namespaces in the namespace tree
  works completely locklessly.

  This also means we can move the mount namespace onto the generic
  infrastructure and remove a bunch of code and members from struct
  mnt_namespace itself.

  There's a bunch of stuff coming on top of this in the future but for
  now this uses the generic namespace tree to extend a concept
  introduced first for pidfs a few cycles ago. For a while now we have
  supported pidfs file handles for pidfds. This has proven to be very
  useful.

  This extends the concept to cover namespaces as well. It is possible
  to encode and decode namespace file handles using the common
  name_to_handle_at() and open_by_handle_at() apis.

  As with pidfs file handles, namespace file handles are exhaustive,
  meaning it is not required to actually hold a reference to nsfs in
  able to decode aka open_by_handle_at() a namespace file handle.
  Instead the FD_NSFS_ROOT constant can be passed which will let the
  kernel grab a reference to the root of nsfs internally and thus decode
  the file handle.

  Namespaces file descriptors can already be derived from pidfds which
  means they aren't subject to overmount protection bugs. IOW, it's
  irrelevant if the caller would not have access to an appropriate
  /proc/<pid>/ns/ directory as they could always just derive the
  namespace based on a pidfd already.

  It has the same advantage as pidfds. It's possible to reliably and for
  the lifetime of the system refer to a namespace without pinning any
  resources and to compare them trivially.

  Permission checking is kept simple. If the caller is located in the
  namespace the file handle refers to they are able to open it otherwise
  they must hold privilege over the owning namespace of the relevant
  namespace.

  The namespace file handle layout is exposed as uapi and has a stable
  and extensible format. For now it simply contains the namespace
  identifier, the namespace type, and the inode number. The stable
  format means that userspace may construct its own namespace file
  handles without going through name_to_handle_at() as they are already
  allowed for pidfs and cgroup file handles"

* tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (65 commits)
  ns: drop assert
  ns: move ns type into struct ns_common
  nstree: make struct ns_tree private
  ns: add ns_debug()
  ns: simplify ns_common_init() further
  cgroup: add missing ns_common include
  ns: use inode initializer for initial namespaces
  selftests/namespaces: verify initial namespace inode numbers
  ns: rename to __ns_ref
  nsfs: port to ns_ref_*() helpers
  net: port to ns_ref_*() helpers
  uts: port to ns_ref_*() helpers
  ipv4: use check_net()
  net: use check_net()
  net-sysfs: use check_net()
  user: port to ns_ref_*() helpers
  time: port to ns_ref_*() helpers
  pid: port to ns_ref_*() helpers
  ipc: port to ns_ref_*() helpers
  cgroup: port to ns_ref_*() helpers
  ...
2025-09-29 11:20:29 -07:00
Linus Torvalds
722df25ddf Merge tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull copy_process updates from Christian Brauner:
 "This contains the changes to enable support for clone3() on nios2
  which apparently is still a thing.

  The more exciting part of this is that it cleans up the inconsistency
  in how the 64-bit flag argument is passed from copy_process() into the
  various other copy_*() helpers"

[ Fixed up rv ltl_monitor 32-bit support as per Sasha Levin in the merge ]

* tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  nios2: implement architecture-specific portion of sys_clone3
  arch: copy_thread: pass clone_flags as u64
  copy_process: pass clone_flags as u64 across calltree
  copy_sighand: Handle architectures where sizeof(unsigned long) < sizeof(u64)
2025-09-29 10:36:50 -07:00
Gustavo A. R. Silva
be812ace03 Bluetooth: Avoid a couple dozen -Wflex-array-member-not-at-end warnings
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.

Use the __struct_group() helper to fix 31 instances of the following
type of warnings:

30 net/bluetooth/mgmt_config.c:16:33: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
1 net/bluetooth/mgmt_config.c:22:33: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-09-27 11:37:43 -04:00
Luiz Augusto von Dentz
9eb1433188 Bluetooth: Add function and line information to bt_dbg
When enabling debug via CONFIG_BT_FEATURE_DEBUG include function and
line information by default otherwise it is hard to make any sense of
which function the logs comes from.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-09-27 11:37:01 -04:00
Luiz Augusto von Dentz
c9beb36c14 Bluetooth: hci_core: Detect if an ISO link has stalled
This attempts to detect if an ISO link has been waiting for an ISO
buffer for longer than the maximum allowed transport latency then
proceed to use hci_link_tx_to which prints an error and disconnects.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-09-27 11:37:01 -04:00
Luiz Augusto von Dentz
339a87883a Bluetooth: ISO: Use sk_sndtimeo as conn_timeout
This aligns the usage of socket sk_sndtimeo as conn_timeout when
initiating a connection and then use it when scheduling the
resulting HCI command, similar to what has been done in bf98feea5b
("Bluetooth: hci_conn: Always use sk_timeo as conn_timeout").

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-09-27 11:37:01 -04:00
Thorsten Blum
d4e99db3d9 Bluetooth: Annotate struct hci_drv_rp_read_info with __counted_by_le()
Add the __counted_by_le() compiler attribute to the flexible array
member 'supported_commands' to improve access bounds-checking via
CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-09-27 11:37:00 -04:00
Jakub Kicinski
94aced6ed9 Merge tag 'wireless-next-2025-09-25' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Johannes Berg says:

====================
Quite a bit more things, including pull requests from drivers:

 - mt76: MLO support, HW restart improvements
 - rtw88/89: small features, prep for RTL8922DE support
 - ath10k: GTK rekey fixes
 - cfg80211/mac80211:
   - additions for more NAN support
   - S1G channel representation cleanup

* tag 'wireless-next-2025-09-25' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (167 commits)
  wifi: libertas: add WQ_UNBOUND to alloc_workqueue users
  Revert "wifi: libertas: WQ_PERCPU added to alloc_workqueue users"
  wifi: libertas: WQ_PERCPU added to alloc_workqueue users
  wifi: cfg80211: fix width unit in cfg80211_radio_chandef_valid()
  wifi: ath11k: HAL SRNG: don't deinitialize and re-initialize again
  wifi: ath12k: enforce CPU endian format for all QMI data
  wifi: ath12k: Use 1KB Cache Flush Command for QoS TID Descriptors
  wifi: ath12k: Fix flush cache failure during RX queue update
  wifi: ath12k: Add Retry Mechanism for REO RX Queue Update Failures
  wifi: ath12k: Refactor REO command to use ath12k_dp_rx_tid_rxq
  wifi: ath12k: Refactor RX TID buffer cleanup into helper function
  wifi: ath12k: Refactor RX TID deletion handling into helper function
  wifi: ath12k: Increase DP_REO_CMD_RING_SIZE to 256
  wifi: cfg80211: remove IEEE80211_CHAN_{1,2,4,8,16}MHZ flags
  wifi: rtw89: avoid circular locking dependency in ser_state_run()
  wifi: rtw89: fix leak in rtw89_core_send_nullfunc()
  wifi: rtw89: avoid possible TX wait initialization race
  wifi: rtw89: fix use-after-free in rtw89_core_tx_kick_off_and_wait()
  wifi: ath12k: Fix peer lookup in ath12k_dp_mon_rx_deliver_msdu()
  wifi: mac80211: fix Rx packet handling when pubsta information is not available
  ...
====================

Link: https://patch.msgid.link/20250925232341.4544-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-26 14:27:28 -07:00
Jakub Kicinski
203e3beb73 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.17-rc8).

Conflicts:

drivers/net/can/spi/hi311x.c
  6b69680847 ("can: hi311x: fix null pointer dereference when resuming from sleep before interface was enabled")
  27ce71e1ce ("net: WQ_PERCPU added to alloc_workqueue users")
https://lore.kernel.org/72ce7599-1b5b-464a-a5de-228ff9724701@kernel.org

net/smc/smc_loopback.c
drivers/dibs/dibs_loopback.c
  a35c04de25 ("net/smc: fix warning in smc_rx_splice() when calling get_page()")
  cc21191b58 ("dibs: Move data path to dibs layer")
https://lore.kernel.org/74368a5c-48ac-4f8e-a198-40ec1ed3cf5f@kernel.org

Adjacent changes:

drivers/net/dsa/lantiq/lantiq_gswip.c
  c0054b25e2 ("net: dsa: lantiq_gswip: move gswip_add_single_port_br() call to port_setup()")
  7a1eaef0a7 ("net: dsa: lantiq_gswip: support model-specific mac_select_pcs()")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-25 11:00:59 -07:00
Richard Gobert
f095a358fa net: gro: remove unnecessary df checks
Currently, packets with fixed IDs will be merged only if their
don't-fragment bit is set. This restriction is unnecessary since
packets without the don't-fragment bit will be forwarded as-is even
if they were merged together. The merged packets will be segmented
into their original forms before being forwarded, either by GSO or
by TSO. The IDs will also remain identical unless NETIF_F_TSO_MANGLEID
is set, in which case the IDs can become incrementing, which is also fine.

Clean up the code by removing the unnecessary don't-fragment checks.

Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250923085908.4687-5-richardbgobert@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-25 12:42:49 +02:00
Richard Gobert
21f7484220 net: gro: only merge packets with incrementing or fixed outer ids
Only merge encapsulated packets if their outer IDs are either
incrementing or fixed, just like for inner IDs and IDs of non-encapsulated
packets.

Add another ip_fixedid bit for a total of two bits: one for outer IDs (and
for unencapsulated packets) and one for inner IDs.

This commit preserves the current behavior of GSO where only the IDs of the
inner-most headers are restored correctly.

Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250923085908.4687-3-richardbgobert@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-25 12:42:49 +02:00
Richard Gobert
25c550464a net: gro: remove is_ipv6 from napi_gro_cb
Remove is_ipv6 from napi_gro_cb and use sk->sk_family instead.
This frees up space for another ip_fixedid bit that will be added
in the next commit.

udp_sock_create always creates either a AF_INET or a AF_INET6 socket,
so using sk->sk_family is reliable. In IPv6-FOU, cfg->ipv6_v6only is
always enabled.

Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250923085908.4687-2-richardbgobert@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-25 12:42:49 +02:00
Jakub Kicinski
5e3fee34f6 Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Martin KaFai Lau says:

====================
pull-request: bpf-next 2025-09-23

We've added 9 non-merge commits during the last 33 day(s) which contain
a total of 10 files changed, 480 insertions(+), 53 deletions(-).

The main changes are:

1) A new bpf_xdp_pull_data kfunc that supports pulling data from
   a frag into the linear area of a xdp_buff, from Amery Hung.

   This includes changes in the xdp_native.bpf.c selftest, which
   Nimrod's future work depends on.

   It is a merge from a stable branch 'xdp_pull_data' which has
   also been merged to bpf-next.

   There is a conflict with recent changes in 'include/net/xdp.h'
   in the net-next tree that will need to be resolved.

2) A compiler warning fix when CONFIG_NET=n in the recent dynptr
   skb_meta support, from Jakub Sitnicki.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  selftests: drv-net: Pull data before parsing headers
  selftests/bpf: Test bpf_xdp_pull_data
  bpf: Support specifying linear xdp packet data size for BPF_PROG_TEST_RUN
  bpf: Make variables in bpf_prog_test_run_xdp less confusing
  bpf: Clear packet pointers after changing packet data in kfuncs
  bpf: Support pulling non-linear xdp data
  bpf: Allow bpf_xdp_shrink_data to shrink a frag from head and tail
  bpf: Clear pfmemalloc flag when freeing all fragments
  bpf: Return an error pointer for skb metadata when CONFIG_NET=n
====================

Link: https://patch.msgid.link/20250924050303.2466356-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-24 10:22:37 -07:00
Eric Dumazet
b650bf0977 udp: remove busylock and add per NUMA queues
busylock was protecting UDP sockets against packet floods,
but unfortunately was not protecting the host itself.

Under stress, many cpus could spin while acquiring the busylock,
and NIC had to drop packets. Or packets would be dropped
in cpu backlog if RPS/RFS were in place.

This patch replaces the busylock by intermediate
lockless queues. (One queue per NUMA node).

This means that fewer number of cpus have to acquire
the UDP receive queue lock.

Most of the cpus can either:
- immediately drop the packet.
- or queue it in their NUMA aware lockless queue.

Then one of the cpu is chosen to process this lockless queue
in a batch.

The batch only contains packets that were cooked on the same
NUMA node, thus with very limited latency impact.

Tested:

DDOS targeting a victim UDP socket, on a platform with 6 NUMA nodes
(Intel(R) Xeon(R) 6985P-C)

Before:

nstat -n ; sleep 1 ; nstat | grep Udp
Udp6InDatagrams                 1004179            0.0
Udp6InErrors                    3117               0.0
Udp6RcvbufErrors                3117               0.0

After:
nstat -n ; sleep 1 ; nstat | grep Udp
Udp6InDatagrams                 1116633            0.0
Udp6InErrors                    14197275           0.0
Udp6RcvbufErrors                14197275           0.0

We can see this host can now proces 14.2 M more packets per second
while under attack, and the victim socket can receive 11 % more
packets.

I used a small bpftrace program measuring time (in us) spent in
__udp_enqueue_schedule_skb().

Before:

@udp_enqueue_us[398]:
[0]                24901 |@@@                                                 |
[1]                63512 |@@@@@@@@@                                           |
[2, 4)            344827 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)            244673 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                |
[8, 16)            54022 |@@@@@@@@                                            |
[16, 32)          222134 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                   |
[32, 64)          232042 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
[64, 128)           4219 |                                                    |
[128, 256)           188 |                                                    |

After:

@udp_enqueue_us[398]:
[0]              5608855 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1]              1111277 |@@@@@@@@@@                                          |
[2, 4)            501439 |@@@@                                                |
[4, 8)            102921 |                                                    |
[8, 16)            29895 |                                                    |
[16, 32)           43500 |                                                    |
[32, 64)           31552 |                                                    |
[64, 128)            979 |                                                    |
[128, 256)            13 |                                                    |

Note that the remaining bottleneck for this platform is in
udp_drops_inc() because we limited struct numa_drop_counters
to only two nodes so far.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250922104240.2182559-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-23 16:38:39 -07:00
Martin KaFai Lau
34f033a6c9 Merge branch 'bpf-next/xdp_pull_data' into 'bpf-next/master'
Merge the xdp_pull_data stable branch into the master branch. No conflict.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2025-09-23 16:23:58 -07:00
Martin KaFai Lau
55d5a5154d Merge branch 'bpf-next/xdp_pull_data' into 'bpf-next/net'
Merge the xdp_pull_data stable branch into the net branch. No conflict.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2025-09-23 15:46:52 -07:00
Amery Hung
dea1526fba bpf: Allow bpf_xdp_shrink_data to shrink a frag from head and tail
Move skb_frag_t adjustment into bpf_xdp_shrink_data() and extend its
functionality to be able to shrink an xdp fragment from both head and
tail. In a later patch, bpf_xdp_pull_data() will reuse it to shrink an
xdp fragment from head.

Additionally, in bpf_xdp_frags_shrink_tail(), breaking the loop when
bpf_xdp_shrink_data() returns false (i.e., not releasing the current
fragment) is not necessary as the loop condition, offset > 0, has the
same effect. Remove the else branch to simplify the code.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20250922233356.3356453-3-ameryhung@gmail.com
2025-09-23 13:35:12 -07:00
Amery Hung
8f12d1137c bpf: Clear pfmemalloc flag when freeing all fragments
It is possible for bpf_xdp_adjust_tail() to free all fragments. The
kfunc currently clears the XDP_FLAGS_HAS_FRAGS bit, but not
XDP_FLAGS_FRAGS_PF_MEMALLOC. So far, this has not caused a issue when
building sk_buff from xdp_buff since all readers of xdp_buff->flags
use the flag only when there are fragments. Clear the
XDP_FLAGS_FRAGS_PF_MEMALLOC bit as well to make the flags correct.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20250922233356.3356453-2-ameryhung@gmail.com
2025-09-23 13:35:11 -07:00
Julian Ruess
a612dbe8d0 dibs: Move event handling to dibs layer
Add defines for all event types and subtypes an ism device is known to
produce as it can be helpful for debugging purposes.

Introduces a generic 'struct dibs_event' and adopt ism device driver
and smc-d client accordingly. Tolerate and ignore other type and subtype
values to enable future device extensions.

SMC-D and ISM are now independent.
struct ism_dev can be moved to drivers/s390/net/ism.h.

Note that in smc, the term 'ism' is still used. Future patches could
replace that with 'dibs' or 'smc-d' as appropriate.

Signed-off-by: Julian Ruess <julianr@linux.ibm.com>
Co-developed-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-15-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-23 11:13:22 +02:00
Alexandra Winter
cc21191b58 dibs: Move data path to dibs layer
Use struct dibs_dmb instead of struct smc_dmb and move the corresponding
client tables to dibs_dev. Leave driver specific implementation details
like sba in the device drivers.

Register and unregister dmbs via dibs_dev_ops. A dmb is dedicated to a
single client, but a dibs device can have dmbs for more than one client.

Trigger dibs clients via dibs_client_ops->handle_irq(), when data is
received into a dmb. For dibs_loopback replace scheduling an smcd receive
tasklet with calling dibs_client_ops->handle_irq().

For loopback devices attach_dmb(), detach_dmb() and move_data() need to
access the dmb tables, so move those to dibs_dev_ops in this patch as well.

Remove remaining definitions of smc_loopback as they are no longer
required, now that everything is in dibs_loopback.

Note that struct ism_client and struct ism_dev are still required in smc
until a follow-on patch moves event handling to dibs. (Loopback does not
use events).

Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-14-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-23 11:13:22 +02:00
Alexandra Winter
719c3b67bb dibs: Move query_remote_gid() to dibs_dev_ops
Provide the dibs_dev_ops->query_remote_gid() in ism and dibs_loopback
dibs_devices. And call it in smc dibs_client.

Reviewed-by: Julian Ruess <julianr@linux.ibm.com>
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-13-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-23 11:13:22 +02:00
Alexandra Winter
92a0f7bb08 dibs: Move vlan support to dibs_dev_ops
It can be debated how much benefit definition of vlan ids for dibs devices
brings, as the dmbs are accessible only by a single peer anyhow. But ism
provides vlan support and smcd exploits it, so move it to dibs layer as an
optional feature.

smcd_loopback simply ignores all vlan settings, do the same in
dibs_loopback.

SMC-D and ISM have a method to use the invalid VLAN ID 1FFF
(ISM_RESERVED_VLANID), to indicate that both communication peers support
routable SMC-Dv2. Tolerate it in dibs, but move it to SMC only.

Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-12-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-23 11:13:22 +02:00
Alexandra Winter
05e68d8ded dibs: Local gid for dibs devices
Define a uuid_t GID attribute to identify a dibs device.

SMC uses 64 Bit and 128 Bit Global Identifiers (GIDs) per device, that
need to be sent via the SMC protocol. Because the smc code uses integers,
network endianness and host endianness need to be considered. Avoid this
in the dibs layer by using uuid_t byte arrays. Future patches could change
SMC to use uuid_t. For now conversion helper functions are introduced.

ISM devices provide 64 Bit GIDs. Map them to dibs uuid_t GIDs like this:
 _________________________________________
| 64 Bit ISM-vPCI GID | 00000000_00000000 |
 -----------------------------------------
If interpreted as UUID [1], this would be interpreted as the UIID variant,
that is reserved for NCS backward compatibility. So it will not collide
with UUIDs that were generated according to the standard.

smc_loopback already uses version 4 UUIDs as 128 Bit GIDs, move that to
dibs loopback. A temporary change to smc_lo_query_rgid() is required,
that will be moved to dibs_loopback with a follow-on patch.

Provide gid of a dibs device as sysfs read-only attribute.

Link: https://datatracker.ietf.org/doc/html/rfc4122 [1]
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Julian Ruess <julianr@linux.ibm.com>
Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-11-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-23 11:13:22 +02:00
Julian Ruess
845c334a01 dibs: Move struct device to dibs_dev
Move struct device from ism_dev and smc_lo_dev to dibs_dev, and define a
corresponding release function. Free ism_dev in ism_remove() and smc_lo_dev
in smc_lo_dev_remove().

Replace smcd->ops->get_dev(smcd) by using dibs->dev directly.

An alternative design would be to embed dibs_dev as a field in ism_dev and
do the same for other dibs device driver specific structs. However that
would have the disadvantage that each dibs device driver needs to allocate
dibs_dev and each dibs device driver needs a different device release
function. The advantage would be that ism_dev and other device driver
specific structs would be covered by device reference counts.

Signed-off-by: Julian Ruess <julianr@linux.ibm.com>
Co-developed-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-9-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-23 11:13:22 +02:00
Alexandra Winter
69baaac936 dibs: Define dibs_client_ops and dibs_dev_ops
Move the device add() and remove() functions from ism_client to
dibs_client_ops and call add_dev()/del_dev() for ism devices and
dibs_loopback devices. dibs_client_ops->add_dev() = smcd_register_dev() for
the smc_dibs_client. This is the first step to handle ism and loopback
devices alike (as dibs devices) in the smc dibs client.

Define dibs_dev->ops and move smcd_ops->get_chid to
dibs_dev_ops->get_fabric_id() for ism and loopback devices. See below for
why this needs to be in the same patch as dibs_client_ops->add_dev().

The following changes contain intermediate steps, that will be obsoleted by
follow-on patches, once more functionality has been moved to dibs:

Use different smcd_ops and max_dmbs for ism and loopback. Follow-on patches
will change SMC-D to directly use dibs_ops instead of smcd_ops.

In smcd_register_dev() it is now necessary to identify a dibs_loopback
device before smcd_dev and smcd_ops->get_chid() are available. So provide
dibs_dev_ops->get_fabric_id() in this patch and evaluate it in
smc_ism_is_loopback().

Call smc_loopback_init() in smcd_register_dev() and call
smc_loopback_exit() in smcd_unregister_dev() to handle the functionality
that is still in smc_loopback. Follow-on patches will move all smc_loopback
code to dibs_loopback.

In smcd_[un]register_dev() use only ism device name, this will be replaced
by dibs device name by a follow-on patch.

End of changes with intermediate parts.

Allocate an smcd event workqueue for all dibs devices, although
dibs_loopback does not generate events.

Use kernel memory instead of devres memory for smcd_dev and smcd->conn.
Since commit a72178cfe8 ("net/smc: Fix dependency of SMC on ISM") an ism
device and its driver can have a longer lifetime than the smc module, so
smc should not rely on devres to free its resources [1]. It is now the
responsibility of the smc client to free smcd and smcd->conn for all dibs
devices, ism devices as well as loopback. Call client->ops->del_dev() for
all existing dibs devices in dibs_unregister_client(), so all device
related structures can be freed in the client.

When dibs_unregister_client() is called in the context of smc_exit() or
smc_core_reboot_event(), these functions have already called
smc_lgrs_shutdown() which calls smc_smcd_terminate_all(smcd) and sets
going_away. This is done a second time in smcd_unregister_dev(). This is
analogous to how smcr is handled in these functions, by calling first
smc_lgrs_shutdown() and then smc_ib_unregister_client() >
smc_ib_remove_dev(), so leave it that way. It may be worth investigating,
whether smc_lgrs_shutdown() is still required or useful.

Remove CONFIG_SMC_LO. CONFIG_DIBS_LO now controls whether a dibs loopback
device exists or not.

Link: https://www.kernel.org/doc/Documentation/driver-model/devres.txt [1]
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-8-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-23 11:13:22 +02:00
Alexandra Winter
884eee8e43 net/smc: Remove error handling of unregister_dmb()
smcd_buf_free() calls smc_ism_unregister_dmb(lgr->smcd, buf_desc) and
then unconditionally frees buf_desc.

Remove the cleaning up of fields of buf_desc in
smc_ism_unregister_dmb(), because it is not helpful.

This removes the only usage of ISM_ERROR from the smc module. So move it
to drivers/s390/net/ism.h.

Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Link: https://patch.msgid.link/20250918110500.1731261-2-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-23 11:13:21 +02:00
Jakub Sitnicki
d57f4b8749 tcp: Update bind bucket state on port release
Today, once an inet_bind_bucket enters a state where fastreuse >= 0 or
fastreuseport >= 0 after a socket is explicitly bound to a port, it remains
in that state until all sockets are removed and the bucket is destroyed.

In this state, the bucket is skipped during ephemeral port selection in
connect(). For applications using a reduced ephemeral port
range (IP_LOCAL_PORT_RANGE socket option), this can cause faster port
exhaustion since blocked buckets are excluded from reuse.

The reason the bucket state isn't updated on port release is unclear.
Possibly a performance trade-off to avoid scanning bucket owners, or just
an oversight.

Fix it by recalculating the bucket state when a socket releases a port. To
limit overhead, each inet_bind2_bucket stores its own (fastreuse,
fastreuseport) state. On port release, only the relevant port-addr bucket
is scanned, and the overall state is derived from these.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250917-update-bind-bucket-state-on-unhash-v5-1-57168b661b47@cloudflare.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-23 10:12:15 +02:00
Eric Dumazet
649091ef59 tcp: reclaim 8 bytes in struct request_sock_queue
synflood_warned had to be u32 for xchg(), but ensuring
atomicity is not really needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-9-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-22 17:55:25 -07:00
Eric Dumazet
9303c3ced1 net: move sk->sk_err_soft and sk->sk_sndbuf
sk->sk_sndbuf is read-mostly in tx path, so move it from
sock_write_tx group to more appropriate sock_read_tx.

sk->sk_err_soft was not identified previously, but
is used from tcp_ack().

Move it to sock_write_tx group for better cache locality.

Also change tcp_ack() to clear sk->sk_err_soft only if needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-22 17:55:24 -07:00
Eric Dumazet
17b14d235f net: move sk_uid and sk_protocol to sock_read_tx
sk_uid and sk_protocol are read from inet6_csk_route_socket()
for each TCP transmit.

Also read from udpv6_sendmsg(), udp_sendmsg() and others.

Move them to sock_read_tx for better cache locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-22 17:55:24 -07:00
Kuniyuki Iwashima
0ac44301e3 tcp: Remove inet6_hash().
inet_hash() and inet6_hash() are exactly the same.

Also, we do not need to export inet6_hash().

Let's consolidate the two into __inet_hash() and rename it to inet_hash().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250919083706.1863217-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-22 11:38:43 -07:00
Kuniyuki Iwashima
6445bb832d tcp: Remove osk from __inet_hash() arg.
__inet_hash() is called from inet_hash() and inet6_hash with osk NULL.

Let's remove the 2nd arg from __inet_hash().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250919083706.1863217-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-22 11:38:42 -07:00