Commit Graph

1426259 Commits

Author SHA1 Message Date
Dimitri Daskalakis
19c3a2a81d selftests: drv-net: rss: Generate unique ports for RSS context tests
The RSS ctx tests rely on NFC rules with unique ports to steer packets
to the correct ctx. This updates the test to use the new rand_ports()
helper to guarantee the ports are unique.

Manual testing shows that generating 32 ports with the existing method
would result in at least one duplicate 4% of the time.

Signed-off-by: Dimitri Daskalakis <dimitri.daskalakis1@gmail.com>
Link: https://patch.msgid.link/20260224224659.1507082-3-dimitri.daskalakis1@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-25 19:42:02 -08:00
Dimitri Daskalakis
b0249c0d41 selftests: net: py: Add rand_ports helper method
Certain tests need a unique set of ports. Successive calls to the
existing rand_port method may return a duplicate port, resulting in test
flakiness. The new helper keeps sockets open while building a list of
ephemeral ports, thus the kernel enforces their uniqueness.

Signed-off-by: Dimitri Daskalakis <dimitri.daskalakis1@gmail.com>
Link: https://patch.msgid.link/20260224224659.1507082-2-dimitri.daskalakis1@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-25 19:42:02 -08:00
Jakub Kicinski
2cd63825c7 Merge branch 'netfilter-updates-for-net-next'
Florian Westphal says:

====================
netfilter: updates for net-next

including IPVS updates from and via Julian Anastasov.

First updates for IPVS. From Julians cover-letter:

* Convert the global __ip_vs_mutex to per-net service_mutex and
  switch the service tables to be per-net, cowork by Jiejian Wu and
  Dust Li

* Convert some code that walks the service lists to use RCU instead of
  the service_mutex

* We used two tables for services (non-fwmark and fwmark), merge them
  into single svc_table

* The list for unavailable destinations (dest_trash) holds dsts and
  thus dev references causing extra work for the ip_vs_dst_event() dev
  notifier handler. Change this by dropping the reference when dest
  is removed and saved into dest_trash. The dest_trash will need more
  changes to make it light for lookups. TODO.

* On new connection we can do multiple lookups for services by trying
  different fallback options. Add more counters for service types, so
  that we can avoid unneeded lookups for services.

* The no_cport and dropentry counters can be per-net and also we can
  avoid extra conn lookups

Then, a few cleanups for nf_tables:

* keep BH enabled during nft_set_rbtree inserts, this is possible because
  the root lock is now only taken from control plane.
* toss a few EXPORT_SYMBOLs from nf_tables; these were historic
  leftovers from back in the day when e.g. set backends were still
  residing in their own modules.
* remove the register tracking infra from nftables.  It was disabled
  years ago in 5.18 and there are no plans to salvage this work; the
  idea was good (remove redundant register stores), but there is just
  one too many pitfalls, and better rule structuring (verdict maps)
  largely avoids the scenarios where this would have helped.
====================

Link: https://patch.msgid.link/20260224205048.4718-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-25 19:36:29 -08:00
Florian Westphal
6b94d081f8 netfilter: nf_tables: remove register tracking infrastructure
This facility was disabled in commit
9e539c5b6d ("netfilter: nf_tables: disable expression reduction infra"),
because not all nft_exprs guarantee they will update the destination
register: some may set NFT_BREAK instead to cancel evaluation of the
rule.

This has been dead code ever since.
There are no plans to salvage this at this time, so remove this.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-10-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-25 19:36:26 -08:00
Florian Westphal
b6461103e0 netfilter: nf_tables: drop obsolete EXPORT_SYMBOLs
These are no longer required, calling objects are nowadays
baked into nf_tables.ko itself.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-9-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-25 19:36:26 -08:00
Florian Westphal
3aea466a43 netfilter: nft_set_rbtree: don't disable bh when acquiring tree lock
As of commit 7e43e0a114
("netfilter: nft_set_rbtree: translate rbtree to array for binary search")
the lock is only taken from control plane, no need to disable BH anymore.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-8-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-25 19:36:26 -08:00
Julian Anastasov
09b71fb459 ipvs: no_cport and dropentry counters can be per-net
Change the no_cport counters to be per-net and address family.
This should reduce the extra conn lookups done during present
NO_CPORT connections.

By changing from global to per-net dropentry counters, one net
will not affect the drop rate of another net.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-7-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-25 19:36:26 -08:00
Julian Anastasov
c59bd9e62e ipvs: use more counters to avoid service lookups
When new connection is created we can lookup for services multiple
times to support fallback options. We already have some counters
to skip specific lookups because it costs CPU cycles for hash
calculation, etc.

Add more counters for fwmark/non-fwmark services (fwm_services and
nonfwm_services) and make all counters per address family.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-6-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-25 19:36:26 -08:00
Julian Anastasov
40fb72209f ipvs: do not keep dest_dst after dest is removed
Before now dest->dest_dst is not released when server is moved into
dest_trash list after removal. As result, we can keep dst/dev
references for long time without actively using them.

It is better to avoid walking the dest_trash list when
ip_vs_dst_event() receives dev events. So, make sure we do not
hold dev references in dest_trash list. As packets can be flying
while server is being removed, check the IP_VS_DEST_F_AVAILABLE
flag in slow path to ensure we do not save new dev references to
removed servers.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-5-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-25 19:36:26 -08:00
Julian Anastasov
b24ae1a387 ipvs: use single svc table
fwmark based services and non-fwmark based services can be hashed
in same service table. This reduces the burden of working with two
tables.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-4-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-25 19:36:25 -08:00
Julian Anastasov
3de0ec2873 ipvs: some service readers can use RCU
Some places walk the services under mutex but they can just use RCU:

* ip_vs_dst_event() uses ip_vs_forget_dev() which uses its own lock
  to modify dest
* ip_vs_genl_dump_services(): ip_vs_genl_fill_service() just fills skb
* ip_vs_genl_parse_service(): move RCU lock to callers
  ip_vs_genl_set_cmd(), ip_vs_genl_dump_dests() and ip_vs_genl_get_cmd()
* ip_vs_genl_dump_dests(): just fill skb

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-3-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-25 19:36:25 -08:00
Jiejian Wu
74455a5b43 ipvs: make ip_vs_svc_table and ip_vs_svc_fwm_table per netns
Current ipvs uses one global mutex "__ip_vs_mutex" to keep the global
"ip_vs_svc_table" and "ip_vs_svc_fwm_table" safe. But when there are
tens of thousands of services from different netns in the table, it
takes a long time to look up the table, for example, using "ipvsadm
-ln" from different netns simultaneously.

We make "ip_vs_svc_table" and "ip_vs_svc_fwm_table" per netns, and we
add "service_mutex" per netns to keep these two tables safe instead of
the global "__ip_vs_mutex" in current version. To this end, looking up
services from different netns simultaneously will not get stuck,
shortening the time consumption in large-scale deployment. It can be
reproduced using the simple scripts below.

init.sh: #!/bin/bash
for((i=1;i<=4;i++));do
        ip netns add ns$i
        ip netns exec ns$i ip link set dev lo up
        ip netns exec ns$i sh add-services.sh
done

add-services.sh: #!/bin/bash
for((i=0;i<30000;i++)); do
        ipvsadm -A  -t 10.10.10.10:$((80+$i)) -s rr
done

runtest.sh: #!/bin/bash
for((i=1;i<4;i++));do
        ip netns exec ns$i ipvsadm -ln > /dev/null &
done
ip netns exec ns4 ipvsadm -ln > /dev/null

Run "sh init.sh" to initiate the network environment. Then run "time
./runtest.sh" to evaluate the time consumption. Our testbed is a 4-core
Intel Xeon ECS. The result of the original version is around 8 seconds,
while the result of the modified version is only 0.8 seconds.

Signed-off-by: Jiejian Wu <jiejian@linux.alibaba.com>
Co-developed-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-2-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-25 19:36:25 -08:00
Eric Woudstra
7717fbb140 net: pppoe: avoid zero-length arrays in struct pppoe_hdr
Jakub Kicinski reported following issue in upcoming patches:

W=1 C=1 GCC build gives us:

net/bridge/netfilter/nf_conntrack_bridge.c: note: in included file (through
../include/linux/if_pppox.h, ../include/uapi/linux/netfilter_bridge.h,
../include/linux/netfilter_bridge.h): include/uapi/linux/if_pppox.h:
153:29: warning: array of flexible structures

sparse doesn't like that hdr has a zero-length array which overlaps
proto. The kernel code doesn't currently need those arrays.

PPPoE connection is functional after applying this patch.

Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Eric Woudstra <ericwouds@gmail.com>
Link: https://patch.msgid.link/20260224155030.106918-1-ericwouds@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-25 19:25:08 -08:00
Abhilekh Deka
8debe7a223 net/ibmveth: fix comment typos in ibmveth.c
Correct spelling mistakes in comments:
- Fix misspelling of gro_receive
- Fix misspelling of Partition

Signed-off-by: Abhilekh Deka <abhindeka@gmail.com>
Reviewed-by: Dave Marquardt <davemarq@linux.ibm.com>
Link: https://patch.msgid.link/20260224153601.17534-1-abhindeka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-25 19:23:04 -08:00
Nicolai Buchwitz
45ce4b753a net: cadence: macb: add ethtool nway_reset support
Wire phy_ethtool_nway_reset() as the .nway_reset ethtool operation,
allowing userspace to restart PHY autonegotiation via 'ethtool -r'.

Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>
Link: https://patch.msgid.link/20260224145723.49450-1-nb@tipi-net.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-25 19:22:32 -08:00
Jakub Kicinski
23a611b9b3 Merge branch 'net-stmmac-fix-interrupt-coalescing'
Russell King says:

====================
net: stmmac: fix interrupt coalescing

While cleaning up the descriptor handling, I noticed that the accounting
of transmit "packets" for interrupt coalescing was buggy in that it
takes the difference of the two indexes into the circular list of
transmit discriptors and merely subtracts one from the other without
regard for the indexes wrapping.

This can result in a negative number or very large positive number
which would have the effect of either reducing tx_q->tx_count_frames
or making that very large.

Either way, the result is numerically incorrect, and could trigger
interrupts or not trigger interrupts when required.

This series converts stmmac to use the circ_buf helpers, and then fixes
this problem.
====================

Link: https://patch.msgid.link/aZ1o2dmfpeiubCik@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-25 19:12:36 -08:00
Russell King (Oracle)
dd53a0e859 net: stmmac: fix transmit interrupt coalescing
The accounting for transmit frames does not count the descriptors
correctly. It uses:

	tx_packets = (tx_q->cur_tx + 1) - first_tx;

however, these are indexes into a circular buffer, so cur_tx can be
less than first_tx, and when that happens, tx_packets becomes a very
large unsigned integer. When this is added to tx_q->tx_count_frames,
it has the effect of reducing the count of frames, possibly causing
it to also wrap to a very large unsigned integer.

Fix this by using CIRC_CNT() to calculate the number of descriptors
used.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/E1vuoIl-0000000Aouz-0ttb@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-25 19:12:34 -08:00
Russell King (Oracle)
819101c3c1 net: stmmac: use circ_buf helpers for descriptors
The stmmac descriptor queues are circular buffers, operated as far as
the hardware is concerned as either a ring, or a chain that loops back
on itself. From the software perspective, it forms a circular buffer.

We have a few places which calculate the number of in-use and free
entries in these circular buffers, for which we have macros for.
Use CIRC_CNT() and CIRC_SPACE() as appropriate to calculate these
values.

Validating, for stmmac_tx_avail(), which uses CIRC_SPACE():

  dirty_tx = 1, cur_tx = 0 -> 0
  dirty_tx = 0, cur_tx = 0 -> dma_tx_size - 1
  dirty_tx = 0, cur_tx = 1 -> dma_tx_size - 2

dirty_tx passed as end, reduced by one. cur_tx passed as start.
Output on sane computers is identical.

For stmmac_rx_dirty(), which uses CIRC_CNT():

  dirty_rx = 1, cur_rx = 0 -> dma_rx_size - 1
  dirty_rx = 0, cur_rx = 0 -> 0
  dirty_rx = 0, cur_rx = 1 -> 1

dirty_rx passed as start, cur_rx passed as end. Output is identical.

Same validation performed on the is_last_segment calculation, which
also gets converted to CIRC_CNT().

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/E1vuoIg-0000000Aout-0LyS@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-25 19:12:34 -08:00
kexinsun
51432958b5 rds: update outdated comment
The function rds_send_reset() was subsumed by rds_send_path_reset()
by commit d769ef81d5 ("RDS: Update rds_conn_shutdown to work with
rds_conn_path").  Update the comment accordingly.

Signed-off-by: kexinsun <kexinsun@smail.nju.edu.cn>
Link: https://patch.msgid.link/20260224020720.1174-1-kexinsun@smail.nju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-25 19:03:55 -08:00
Rosen Penev
dc2a1facbd net: fs_enet: allow nvmem to override MAC address
NVMEM typically loads after the ethernet driver and
of_get_ethdev_address returns -EPROBE_DEFER. return in such a case to
allow NVMEM to work.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Link: https://patch.msgid.link/20260224014607.353378-1-rosenp@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-25 19:01:24 -08:00
Jakub Kicinski
6698d6ce6a selftests: hw-net: tso: set a TCP window clamp to avoid spurious drops
The TSO test wants to make sure that there isn't a lot of retransmits,
because that could indicate that device has a buggy TSO implementation.
On debug kernels, however, we're likely to see significant packet loss
because we simply overwhelm the receiver.

In a QEMU loop with virtio devices we see ~10% false positive rate
with occasional run hitting the threshold of 25% packet loss.

Since we're only sending 4MB of data, set a TCP_WINDOW_CLAMP to 200k.
This seems to make virtio happy while having little impact since we're
primarily interested in testing the sender, and the test doesn't
currently enable BIG TCP.

Running socat over virtio loop for 2 sec on a debug kernel shows:

  TcpOutSegs                      27327              0.0
  TcpRetransSegs                  83                 0.0

  TcpOutSegs                      30012              0.0
  TcpRetransSegs                  80                 0.0

  TcpOutSegs                      28767              0.0
  TcpRetransSegs                  77                 0.0

But with the clamp the 3 attempts show no retransmit:

  TcpOutSegs                      31537              0.0
  TcpRetransSegs                  0                  0.0

  TcpOutSegs                      30323              0.0
  TcpRetransSegs                  0                  0.0

  TcpOutSegs                      28700              0.0
  TcpRetransSegs                  0                  0.0

Since we expect no receiver-related drops now we can significantly
increase test's sensitivity to drops.

All the testing we do in NIPA uses cubic.

Link: https://patch.msgid.link/20260223204030.4142884-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-25 18:59:57 -08:00
Russell King (Oracle)
8215d7cbfb net: stmmac: fix EEE supportable interfaces
According to the dwmac v3.74a databook, only MII, GMII and RGMII dwmac
interface modes are supported for EEE. Restrict EEE to these modes, or
the modules supported by a PCS other than the GMAC's integrated PCS.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vuUsD-0000000Afci-0XxO@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-25 18:51:19 -08:00
Rosen Penev
d2adf01780 net: freescale: ucc_geth: call of_node_put once
Move it up to avoid placing it in both the error and success paths.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260224014141.352642-1-rosenp@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-24 18:26:30 -08:00
Jakub Kicinski
7235555e9a Merge branch 'selftests-net-py-improve-bkg-error-reporting'
Jakub Kicinski says:

====================
selftests: net: py: improve bkg() error reporting

bkg() is a helper for running commands in the background.
When init or body of a with() block fails check if the bkg()
process already exited and report its status (including stdout/
/stderr). This significantly improves debugability.
====================

Link: https://patch.msgid.link/20260223202633.4126087-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-24 18:25:32 -08:00
Jakub Kicinski
6e4dff2002 selftests: net: py: add cmd info for ksft_wait failure
Gal recently complained:

  When [ksft_wait failure] happens, the test fails with a cryptic
  message:
    # Exception| Exception: Did not receive ready message

Let's try to include the stdout/stderr of the command we tried
to start. E.g. for cmd("false", ksft_wait=True):

    # Exception| lib.py.utils.CmdInitFailure: Did not receive ready message
    # Exception| CMD: false
    # Exception|   EXIT: 1

We need to factor out _process_terminate() otherwise the exit
path may try to write to already disconnected self.ksft_term_fd.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260223202633.4126087-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-24 18:25:29 -08:00
Jakub Kicinski
04abab18e1 selftests: net: py: use repr(cmd) for failure exceptions
Reuse repr(cmd) instead of manually formatting a similar string.

Before:
  # Exception| lib.py.utils.CmdExitFailure: Command failed: false
  # Exception| STDOUT: b''
  # Exception| STDERR: b''

After:
  # Exception| lib.py.utils.CmdExitFailure: Command failed
  # Exception| CMD: false
  # Exception|   EXIT: 1

Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260223202633.4126087-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-24 18:25:29 -08:00
Jakub Kicinski
d99aa5912c selftests: net: py: avoid masking exceptions in bkg() failures
bkg() failures are currently quite hard to debug and spot.
Often we have code along the lines of:

  with bkg("./cmd_rx_something -p PORT"):
       wait_port_listen(PORT)
       cmd("./cmd_tx_something", host=remote)

When wait_port_listen() fails we don't get to see the exit status
of bkg(). Even tho very often it's a failure in the bkg() command
that's actually to blame. Try not to interfere with the bkg()
command error checking.

With:

   with bkg("false", exit_wait=True):
        time.sleep(0.01)  # let the 'false' cmd run
        raise Exception("bla")

Before:

  .. stack trace ..
  # Exception| Exception: bla

After:

  .. stack trace ..
  # Exception| Exception: bla
  # Exception|
  # Exception| During handling of the above exception, another exception occurred:
  .. stack trace ..
  # Exception| lib.py.utils.CmdExitFailure: Command failed: false
  # Exception| STDOUT: b''
  # Exception| STDERR: b''

Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260223202633.4126087-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-24 18:25:29 -08:00
Jakub Kicinski
6c32b07650 eth: bnxt: rename ring_err_stats -> ring_drv_stats
We recently added GRO stats to bnxt, which are maintained
by the driver. Having "err" in the name of the struct for
ring stats no longer makes sense (as pointed out by Michael,
see Link).

Rename them to "drv" stats, as these are all maintained
by the driver (even if partially based on info from descriptors).
Michael suggested calling these misc, happy to go back to
that. IMHO "drv" is a bit more meaningful that "misc".

Pure rename using sed, no functional changes.

Link: https://lore.kernel.org/CACKFLimgibJ0qkM1AacZVh8MKKy-pE_AAc4KPKZ7GUqebmXW9A@mail.gmail.com
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20260223203702.4137801-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-24 18:24:46 -08:00
Kuniyuki Iwashima
fc1f97929a bonding: Optimise is_netpoll_tx_blocked().
bond_start_xmit() spends some cycles in is_netpoll_tx_blocked():

  if (unlikely(is_netpoll_tx_blocked(dev)))
      return NETDEV_TX_BUSY;

because of the "pushf;pop reg" sequence (aka irqs_disabled()).

Let's swap the conditions in is_netpoll_tx_blocked() and
convert netpoll_block_tx to a static key.

Before:

   1.23 │       mov    %gs:0x28,%rax
   1.24 │       mov    %rax,0x18(%rsp)
  29.45 │       pushfq
   0.50 │       pop    %rax
   0.47 │       test   $0x200,%eax
        │     ↓ je     1b4
   0.49 │ 32:   lea    0x980(%rsi),%rbx

After:

   0.72 │       mov    %gs:0x28,%rax
   0.81 │       mov    %rax,0x18(%rsp)
   0.82 │       nop
   2.77 │ 2a:   lea    0x980(%rsi),%rbx

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260223230749.2376145-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-24 18:13:38 -08:00
Eric Dumazet
64db5933c7 icmp: increase net.ipv4.icmp_msgs_{per_sec,burst}
These sysctls were added in 4cdf507d54 ("icmp: add a global rate
limitation") and their default values might be too small.

Some network tools send probes to closed UDP ports from many hosts
to estimate proportion of packet drops on a particular target.

This patch sets both sysctls to 10000.

Note the per-peer rate-limit (as described in RFC 4443 2.4 (f))
intent is still enforced.

This also increases security, see b38e7819ca
("icmp: randomize the global rate limiter") for reference.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260223161742.929830-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-24 17:50:12 -08:00
Eric Dumazet
539a6cf084 tcp: move inet6_csk_update_pmtu() to tcp_ipv6.c
This function is only called from tcp_v6_mtu_reduced() and can be
(auto)inlined by the compiler.

Note that inet6_csk_route_socket() is no longer (auto)inlined,
which is a good thing as it is slow path.

$ scripts/bloat-o-meter -t vmlinux.0 vmlinux.1

add/remove: 0/2 grow/shrink: 2/0 up/down: 93/-129 (-36)
Function                                     old     new   delta
tcp_v6_mtu_reduced                           139     228     +89
inet6_csk_route_socket                       486     490      +4
__pfx_inet6_csk_update_pmtu                   16       -     -16
inet6_csk_update_pmtu                        113       -    -113
Total: Before=25076512, After=25076476, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260223153047.886683-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-24 17:47:27 -08:00
Eric Dumazet
fca59a2dd0 tcp: reduce calls to tcp_schedule_loss_probe()
For RPC workloads, we alternate tcp_schedule_loss_probe() calls from
output path and from input path, with tp->packets_out value
oscillating between !zero and zero, leading to poor branch prediction.

Move tp->packets_out check from tcp_schedule_loss_probe() to
tcp_set_xmit_timer().

We avoid one call to tcp_schedule_loss_probe() from tcp_ack()
path for typical RPC workloads, while improving branch prediction.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260223113501.4070245-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-24 17:44:33 -08:00
Jakub Kicinski
a09eb622f3 Merge branch 'net-stmmac-qcom-ethqos-cleanups-and-re-organise-serdes-handling'
Russell King says:

====================
net: stmmac: qcom-ethqos: cleanups and re-organise SerDes handling

As the last series had issues with stability, I've changed the approach
in this series to concentrate on keeping much of the SerDes related
code within the qcom-ethqos driver rather than trying to move it out at
this stage. This means it should be possible to bisect these patches and
pinpoint exactly the code movement that causes any instability.

This series starts with various cleanups to qcom-ethqos (the first four
patches) before beginning to move code, passing phylink's phy interface
(which will change) to the fix_mac_speed() method, and then using that
to configure the serdes and inband setting before moving the SerDes
code.

This patch set has been tested.
====================

Link: https://patch.msgid.link/aZwfAFJQcp9f0niI@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-24 17:43:25 -08:00
Russell King (Oracle)
9192320a65 net: stmmac: qcom-ethqos: convert to set_clk_tx_rate() method
Set the RGMII link clock using the set_clk_tx_rate() method rather than
coding it into the .fix_mac_speed() method. This simplifies ethqos's
ethqos_fix_mac_speed().

Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vuSLF-0000000ASci-42kh@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-24 17:43:23 -08:00
Russell King (Oracle)
fb42f19e67 net: stmmac: qcom-ethqos: move SerDes speed configuration
Move the SerDes speed configuration to phylink's .mac_finish() stage
so that the SerDes is appropriately configured for the interface mode
prior to the link coming up.

Reviewed-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vuSLA-0000000AScc-3RFf@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-24 17:43:23 -08:00
Russell King (Oracle)
b8ab32315e net: stmmac: qcom-ethqos: use phy interface mode for inband
qcom-ethqos currently forces inband to be enabled for the Cisco SGMII
speeds (1G, 100M and 10M) but not for 2500BASE-X (2.5G).

Rather than using the speed to determine the forced inband state, use
phylink's PHY interface mode which will switch between SGMII for the
10M, 100M and 1G speeds, and 2500BASE-X for 2.5G.

Reviewed-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vuSL5-0000000AScX-2wuM@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-24 17:43:23 -08:00
Russell King (Oracle)
b560938163 net: stmmac: qcom-ethqos: pass phy interface mode to configs
Pass the current phylink phy interface mode to the RGMII and "SGMII"
configuration functions.

Reviewed-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vuSL0-0000000AScM-2TN0@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-24 17:43:22 -08:00
Russell King (Oracle)
cd0aa65153 net: stmmac: pass interface mode into fix_mac_speed() method
Pass the current interface mode reported by phylink into the
fix_mac_speed() method. This will be used by qcom-ethqos for its
"SGMII" configuration.

Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vuSKv-0000000AScG-1zv6@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-24 17:43:22 -08:00
Russell King (Oracle)
834c72ca30 net: stmmac: qcom-ethqos: move loopback disable to .mac_finish()
Loopback is enabled to allow the dwmac soft reset to succeed. This
is enabled when clocks are enabled in ethqos_clks_config(), which
happens at driver probe and runtime PM resume - e.g. when the
network device is administratively brought up.

Currently, the loopback is disabled when the link comes up (via
.mac_link_up() calling this driver's .fix_mac_speed().)

Move the qcom_ethqos_set_sgmii_loopback() call which disables
loopback from ethqos_fix_mac_speed() into ethqos' SerDes specific
.mac_finish() method so that loopback is disabled a little earlier
after reset has completed, and dwmac setup has completed.

Reviewed-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vuSKq-0000000AScA-1Wh3@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-24 17:43:22 -08:00
Russell King (Oracle)
3baa791f19 net: stmmac: qcom-ethqos: move qcom_ethqos_set_sgmii_loopback() up
ethqos_set_func_clk_en() configures both SGMII loopback and the RGMII
functional clock setting. qcom_ethqos_set_sgmii_loopback() is only
called from within ethqos_set_func_clk_en(), and checks for
PHY_INTERFACE_MODE_2500BASEX.

Move qcom_ethqos_set_sgmii_loopback() to the callers of
ethqos_set_func_clk_en() except for ethqos_configure_rgmii() where we
know that ethqos->phy_mode will not be PHY_INTERFACE_MODE_2500BASEX.

Reviewed-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vuSKl-0000000ASc1-18ka@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-24 17:43:22 -08:00
Russell King (Oracle)
649a00c392 net: stmmac: qcom-ethqos: change ethqos_configure*() to return void
The ethqos_configure*() family of functions always return zero, and the
return value is never checked. Change the int return type to void.

Reviewed-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vuSKg-0000000ASbv-0iWL@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-24 17:43:22 -08:00
Russell King (Oracle)
e6f43a41ba net: stmmac: qcom-ethqos: remove register field value obfuscations
Convert the register field values to something more human readable.

For example, using (BIT(29) | BIT(27)) to update a register field that
consists of bits 29:27 is an obfuscated way of writing decimal 5 for
this field. The comment above needs to explain that this value is 5.

Worse still is BIT(12) | GENMASK(9, 8), which is used to hide the
decimal value 19 for the bitfield 16:8.

Fix these, and a few others by using FIELD_PREP(). While it means we
have bare numeric constants, this is more preferable than having the
obfuscation.

Reviewed-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vuSKa-0000000ASbo-2zQg@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-24 17:43:22 -08:00
Russell King (Oracle)
ebfc2be12e net: stmmac: qcom-ethqos: rename "por" members to "rgmii_por"
Rename the "por" and "num_por" members to indicate that they are for
RGMII mode only as ethqos_configure_rgmii() is the only place that the
values are programmed into the registers.

Reviewed-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vuSKV-0000000ASbg-28JK@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-24 17:43:21 -08:00
Jakub Kicinski
583706230e Merge branch 'net-ethernet-enic-add-vic-ids-and-link-modes'
Satish Kharat says:

====================
eth: enic: add VIC ids and link modes

Add VIC subsystem ids and their supported/advertised media types so ethtool
reflects the hardware capabilities for the VIC variants.
====================

Link: https://patch.msgid.link/20260223-enic-cscwi36355-v2-0-63488194a974@cisco.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-24 17:20:13 -08:00
Satish Kharat
426f1f5b87 net:ethernet:enic: map ethtool link modes by VIC type
Report supported media types based on the VIC subsystem ID so ethtool
reflects the hardware capabilities.

Signed-off-by: Satish Kharat <satishkh@cisco.com>
Link: https://patch.msgid.link/20260223-enic-cscwi36355-v2-2-63488194a974@cisco.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-24 17:20:11 -08:00
Satish Kharat
472e079f8c net:ethernet:enic: add VIC subsystem ids
Add VIC subsystem id for 12xx, 13xx, 14xx and 15xxx series

Signed-off-by: Satish Kharat <satishkh@cisco.com>
Link: https://patch.msgid.link/20260223-enic-cscwi36355-v2-1-63488194a974@cisco.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-24 17:20:11 -08:00
Gabriel Goller
3197cce4d4 docs: net: document neigh gc_stale_time sysctl
Add missing documentation for a neighbor table garbage collector sysctl
parameter in ip-sysctl.rst:

neigh/default/gc_stale_time: controls how long an unused neighbor entry
is kept before becoming eligible for garbage collection (default: 60
seconds)

Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
Link: https://patch.msgid.link/20260223101257.47563-1-g.goller@proxmox.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-24 17:17:06 -08:00
Jakub Kicinski
54ef3e6bbe Merge branch 'tcp-rework-tcp_v-4-6-_send_check'
Eric Dumazet says:

====================
tcp: rework tcp_v{4,6}_send_check()

tcp_v{4,6}_send_check() are only called from __tcp_transmit_skb()

They currently are in different files (tcp_ipv4.c and tcp_ipv6.c)
thus out of line.

This series move them close to their caller so that compiler
can inline them.

For all patches in the series:

$ scripts/bloat-o-meter -t vmlinux.0 vmlinux.3
add/remove: 0/2 grow/shrink: 1/3 up/down: 102/-178 (-76)
Function                                     old     new   delta
__tcp_transmit_skb                          3321    3423    +102
tcp_v4_send_check                            136     132      -4
__tcp_v4_send_check                          130     121      -9
mptcp_subflow_init                           777     763     -14
__pfx_tcp_v6_send_check                       16       -     -16
tcp_v6_send_check                            135       -    -135
Total: Before=25143100, After=25143024, chg -0.00%
====================

Link: https://patch.msgid.link/20260223100729.3761597-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-24 17:16:19 -08:00
Eric Dumazet
fcd3d039fa tcp: make tcp_v{4,6}_send_check() static
tcp_v{4,6}_send_check() are only called from tcp_output.c
and should be made static so that the compiler does not need
to put an out of line copy of them.

Remove (struct inet_connection_sock_af_ops) send_check field
and use instead @net_header_len.

Move @net_header_len close to @queue_xmit for data locality
as both are used in TCP tx fast path.

$ scripts/bloat-o-meter -t vmlinux.2 vmlinux.3
add/remove: 0/2 grow/shrink: 0/3 up/down: 0/-172 (-172)
Function                                     old     new   delta
__tcp_transmit_skb                          3426    3423      -3
tcp_v4_send_check                            136     132      -4
mptcp_subflow_init                           777     763     -14
__pfx_tcp_v6_send_check                       16       -     -16
tcp_v6_send_check                            135       -    -135
Total: Before=25143196, After=25143024, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260223100729.3761597-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-24 17:16:09 -08:00
Eric Dumazet
255688652b tcp: move tcp_v6_send_check() to tcp_output.c
Move tcp_v6_send_check() so that __tcp_transmit_skb() can inline it.

$ scripts/bloat-o-meter -t vmlinux.1 vmlinux.2
add/remove: 0/0 grow/shrink: 1/0 up/down: 105/0 (105)
Function                                     old     new   delta
__tcp_transmit_skb                          3321    3426    +105
Total: Before=25143091, After=25143196, chg +0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260223100729.3761597-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-24 17:16:09 -08:00