The RSS ctx tests rely on NFC rules with unique ports to steer packets
to the correct ctx. This updates the test to use the new rand_ports()
helper to guarantee the ports are unique.
Manual testing shows that generating 32 ports with the existing method
would result in at least one duplicate 4% of the time.
Signed-off-by: Dimitri Daskalakis <dimitri.daskalakis1@gmail.com>
Link: https://patch.msgid.link/20260224224659.1507082-3-dimitri.daskalakis1@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Florian Westphal says:
====================
netfilter: updates for net-next
including IPVS updates from and via Julian Anastasov.
First updates for IPVS. From Julians cover-letter:
* Convert the global __ip_vs_mutex to per-net service_mutex and
switch the service tables to be per-net, cowork by Jiejian Wu and
Dust Li
* Convert some code that walks the service lists to use RCU instead of
the service_mutex
* We used two tables for services (non-fwmark and fwmark), merge them
into single svc_table
* The list for unavailable destinations (dest_trash) holds dsts and
thus dev references causing extra work for the ip_vs_dst_event() dev
notifier handler. Change this by dropping the reference when dest
is removed and saved into dest_trash. The dest_trash will need more
changes to make it light for lookups. TODO.
* On new connection we can do multiple lookups for services by trying
different fallback options. Add more counters for service types, so
that we can avoid unneeded lookups for services.
* The no_cport and dropentry counters can be per-net and also we can
avoid extra conn lookups
Then, a few cleanups for nf_tables:
* keep BH enabled during nft_set_rbtree inserts, this is possible because
the root lock is now only taken from control plane.
* toss a few EXPORT_SYMBOLs from nf_tables; these were historic
leftovers from back in the day when e.g. set backends were still
residing in their own modules.
* remove the register tracking infra from nftables. It was disabled
years ago in 5.18 and there are no plans to salvage this work; the
idea was good (remove redundant register stores), but there is just
one too many pitfalls, and better rule structuring (verdict maps)
largely avoids the scenarios where this would have helped.
====================
Link: https://patch.msgid.link/20260224205048.4718-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This facility was disabled in commit
9e539c5b6d ("netfilter: nf_tables: disable expression reduction infra"),
because not all nft_exprs guarantee they will update the destination
register: some may set NFT_BREAK instead to cancel evaluation of the
rule.
This has been dead code ever since.
There are no plans to salvage this at this time, so remove this.
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-10-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Change the no_cport counters to be per-net and address family.
This should reduce the extra conn lookups done during present
NO_CPORT connections.
By changing from global to per-net dropentry counters, one net
will not affect the drop rate of another net.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-7-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When new connection is created we can lookup for services multiple
times to support fallback options. We already have some counters
to skip specific lookups because it costs CPU cycles for hash
calculation, etc.
Add more counters for fwmark/non-fwmark services (fwm_services and
nonfwm_services) and make all counters per address family.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-6-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Before now dest->dest_dst is not released when server is moved into
dest_trash list after removal. As result, we can keep dst/dev
references for long time without actively using them.
It is better to avoid walking the dest_trash list when
ip_vs_dst_event() receives dev events. So, make sure we do not
hold dev references in dest_trash list. As packets can be flying
while server is being removed, check the IP_VS_DEST_F_AVAILABLE
flag in slow path to ensure we do not save new dev references to
removed servers.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-5-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some places walk the services under mutex but they can just use RCU:
* ip_vs_dst_event() uses ip_vs_forget_dev() which uses its own lock
to modify dest
* ip_vs_genl_dump_services(): ip_vs_genl_fill_service() just fills skb
* ip_vs_genl_parse_service(): move RCU lock to callers
ip_vs_genl_set_cmd(), ip_vs_genl_dump_dests() and ip_vs_genl_get_cmd()
* ip_vs_genl_dump_dests(): just fill skb
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-3-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Current ipvs uses one global mutex "__ip_vs_mutex" to keep the global
"ip_vs_svc_table" and "ip_vs_svc_fwm_table" safe. But when there are
tens of thousands of services from different netns in the table, it
takes a long time to look up the table, for example, using "ipvsadm
-ln" from different netns simultaneously.
We make "ip_vs_svc_table" and "ip_vs_svc_fwm_table" per netns, and we
add "service_mutex" per netns to keep these two tables safe instead of
the global "__ip_vs_mutex" in current version. To this end, looking up
services from different netns simultaneously will not get stuck,
shortening the time consumption in large-scale deployment. It can be
reproduced using the simple scripts below.
init.sh: #!/bin/bash
for((i=1;i<=4;i++));do
ip netns add ns$i
ip netns exec ns$i ip link set dev lo up
ip netns exec ns$i sh add-services.sh
done
add-services.sh: #!/bin/bash
for((i=0;i<30000;i++)); do
ipvsadm -A -t 10.10.10.10:$((80+$i)) -s rr
done
runtest.sh: #!/bin/bash
for((i=1;i<4;i++));do
ip netns exec ns$i ipvsadm -ln > /dev/null &
done
ip netns exec ns4 ipvsadm -ln > /dev/null
Run "sh init.sh" to initiate the network environment. Then run "time
./runtest.sh" to evaluate the time consumption. Our testbed is a 4-core
Intel Xeon ECS. The result of the original version is around 8 seconds,
while the result of the modified version is only 0.8 seconds.
Signed-off-by: Jiejian Wu <jiejian@linux.alibaba.com>
Co-developed-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-2-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski reported following issue in upcoming patches:
W=1 C=1 GCC build gives us:
net/bridge/netfilter/nf_conntrack_bridge.c: note: in included file (through
../include/linux/if_pppox.h, ../include/uapi/linux/netfilter_bridge.h,
../include/linux/netfilter_bridge.h): include/uapi/linux/if_pppox.h:
153:29: warning: array of flexible structures
sparse doesn't like that hdr has a zero-length array which overlaps
proto. The kernel code doesn't currently need those arrays.
PPPoE connection is functional after applying this patch.
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Eric Woudstra <ericwouds@gmail.com>
Link: https://patch.msgid.link/20260224155030.106918-1-ericwouds@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King says:
====================
net: stmmac: fix interrupt coalescing
While cleaning up the descriptor handling, I noticed that the accounting
of transmit "packets" for interrupt coalescing was buggy in that it
takes the difference of the two indexes into the circular list of
transmit discriptors and merely subtracts one from the other without
regard for the indexes wrapping.
This can result in a negative number or very large positive number
which would have the effect of either reducing tx_q->tx_count_frames
or making that very large.
Either way, the result is numerically incorrect, and could trigger
interrupts or not trigger interrupts when required.
This series converts stmmac to use the circ_buf helpers, and then fixes
this problem.
====================
Link: https://patch.msgid.link/aZ1o2dmfpeiubCik@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The accounting for transmit frames does not count the descriptors
correctly. It uses:
tx_packets = (tx_q->cur_tx + 1) - first_tx;
however, these are indexes into a circular buffer, so cur_tx can be
less than first_tx, and when that happens, tx_packets becomes a very
large unsigned integer. When this is added to tx_q->tx_count_frames,
it has the effect of reducing the count of frames, possibly causing
it to also wrap to a very large unsigned integer.
Fix this by using CIRC_CNT() to calculate the number of descriptors
used.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/E1vuoIl-0000000Aouz-0ttb@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The stmmac descriptor queues are circular buffers, operated as far as
the hardware is concerned as either a ring, or a chain that loops back
on itself. From the software perspective, it forms a circular buffer.
We have a few places which calculate the number of in-use and free
entries in these circular buffers, for which we have macros for.
Use CIRC_CNT() and CIRC_SPACE() as appropriate to calculate these
values.
Validating, for stmmac_tx_avail(), which uses CIRC_SPACE():
dirty_tx = 1, cur_tx = 0 -> 0
dirty_tx = 0, cur_tx = 0 -> dma_tx_size - 1
dirty_tx = 0, cur_tx = 1 -> dma_tx_size - 2
dirty_tx passed as end, reduced by one. cur_tx passed as start.
Output on sane computers is identical.
For stmmac_rx_dirty(), which uses CIRC_CNT():
dirty_rx = 1, cur_rx = 0 -> dma_rx_size - 1
dirty_rx = 0, cur_rx = 0 -> 0
dirty_rx = 0, cur_rx = 1 -> 1
dirty_rx passed as start, cur_rx passed as end. Output is identical.
Same validation performed on the is_last_segment calculation, which
also gets converted to CIRC_CNT().
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/E1vuoIg-0000000Aout-0LyS@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The TSO test wants to make sure that there isn't a lot of retransmits,
because that could indicate that device has a buggy TSO implementation.
On debug kernels, however, we're likely to see significant packet loss
because we simply overwhelm the receiver.
In a QEMU loop with virtio devices we see ~10% false positive rate
with occasional run hitting the threshold of 25% packet loss.
Since we're only sending 4MB of data, set a TCP_WINDOW_CLAMP to 200k.
This seems to make virtio happy while having little impact since we're
primarily interested in testing the sender, and the test doesn't
currently enable BIG TCP.
Running socat over virtio loop for 2 sec on a debug kernel shows:
TcpOutSegs 27327 0.0
TcpRetransSegs 83 0.0
TcpOutSegs 30012 0.0
TcpRetransSegs 80 0.0
TcpOutSegs 28767 0.0
TcpRetransSegs 77 0.0
But with the clamp the 3 attempts show no retransmit:
TcpOutSegs 31537 0.0
TcpRetransSegs 0 0.0
TcpOutSegs 30323 0.0
TcpRetransSegs 0 0.0
TcpOutSegs 28700 0.0
TcpRetransSegs 0 0.0
Since we expect no receiver-related drops now we can significantly
increase test's sensitivity to drops.
All the testing we do in NIPA uses cubic.
Link: https://patch.msgid.link/20260223204030.4142884-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski says:
====================
selftests: net: py: improve bkg() error reporting
bkg() is a helper for running commands in the background.
When init or body of a with() block fails check if the bkg()
process already exited and report its status (including stdout/
/stderr). This significantly improves debugability.
====================
Link: https://patch.msgid.link/20260223202633.4126087-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gal recently complained:
When [ksft_wait failure] happens, the test fails with a cryptic
message:
# Exception| Exception: Did not receive ready message
Let's try to include the stdout/stderr of the command we tried
to start. E.g. for cmd("false", ksft_wait=True):
# Exception| lib.py.utils.CmdInitFailure: Did not receive ready message
# Exception| CMD: false
# Exception| EXIT: 1
We need to factor out _process_terminate() otherwise the exit
path may try to write to already disconnected self.ksft_term_fd.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260223202633.4126087-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
bkg() failures are currently quite hard to debug and spot.
Often we have code along the lines of:
with bkg("./cmd_rx_something -p PORT"):
wait_port_listen(PORT)
cmd("./cmd_tx_something", host=remote)
When wait_port_listen() fails we don't get to see the exit status
of bkg(). Even tho very often it's a failure in the bkg() command
that's actually to blame. Try not to interfere with the bkg()
command error checking.
With:
with bkg("false", exit_wait=True):
time.sleep(0.01) # let the 'false' cmd run
raise Exception("bla")
Before:
.. stack trace ..
# Exception| Exception: bla
After:
.. stack trace ..
# Exception| Exception: bla
# Exception|
# Exception| During handling of the above exception, another exception occurred:
.. stack trace ..
# Exception| lib.py.utils.CmdExitFailure: Command failed: false
# Exception| STDOUT: b''
# Exception| STDERR: b''
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260223202633.4126087-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
These sysctls were added in 4cdf507d54 ("icmp: add a global rate
limitation") and their default values might be too small.
Some network tools send probes to closed UDP ports from many hosts
to estimate proportion of packet drops on a particular target.
This patch sets both sysctls to 10000.
Note the per-peer rate-limit (as described in RFC 4443 2.4 (f))
intent is still enforced.
This also increases security, see b38e7819ca
("icmp: randomize the global rate limiter") for reference.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260223161742.929830-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This function is only called from tcp_v6_mtu_reduced() and can be
(auto)inlined by the compiler.
Note that inet6_csk_route_socket() is no longer (auto)inlined,
which is a good thing as it is slow path.
$ scripts/bloat-o-meter -t vmlinux.0 vmlinux.1
add/remove: 0/2 grow/shrink: 2/0 up/down: 93/-129 (-36)
Function old new delta
tcp_v6_mtu_reduced 139 228 +89
inet6_csk_route_socket 486 490 +4
__pfx_inet6_csk_update_pmtu 16 - -16
inet6_csk_update_pmtu 113 - -113
Total: Before=25076512, After=25076476, chg -0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260223153047.886683-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For RPC workloads, we alternate tcp_schedule_loss_probe() calls from
output path and from input path, with tp->packets_out value
oscillating between !zero and zero, leading to poor branch prediction.
Move tp->packets_out check from tcp_schedule_loss_probe() to
tcp_set_xmit_timer().
We avoid one call to tcp_schedule_loss_probe() from tcp_ack()
path for typical RPC workloads, while improving branch prediction.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260223113501.4070245-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King says:
====================
net: stmmac: qcom-ethqos: cleanups and re-organise SerDes handling
As the last series had issues with stability, I've changed the approach
in this series to concentrate on keeping much of the SerDes related
code within the qcom-ethqos driver rather than trying to move it out at
this stage. This means it should be possible to bisect these patches and
pinpoint exactly the code movement that causes any instability.
This series starts with various cleanups to qcom-ethqos (the first four
patches) before beginning to move code, passing phylink's phy interface
(which will change) to the fix_mac_speed() method, and then using that
to configure the serdes and inband setting before moving the SerDes
code.
This patch set has been tested.
====================
Link: https://patch.msgid.link/aZwfAFJQcp9f0niI@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Loopback is enabled to allow the dwmac soft reset to succeed. This
is enabled when clocks are enabled in ethqos_clks_config(), which
happens at driver probe and runtime PM resume - e.g. when the
network device is administratively brought up.
Currently, the loopback is disabled when the link comes up (via
.mac_link_up() calling this driver's .fix_mac_speed().)
Move the qcom_ethqos_set_sgmii_loopback() call which disables
loopback from ethqos_fix_mac_speed() into ethqos' SerDes specific
.mac_finish() method so that loopback is disabled a little earlier
after reset has completed, and dwmac setup has completed.
Reviewed-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vuSKq-0000000AScA-1Wh3@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ethqos_set_func_clk_en() configures both SGMII loopback and the RGMII
functional clock setting. qcom_ethqos_set_sgmii_loopback() is only
called from within ethqos_set_func_clk_en(), and checks for
PHY_INTERFACE_MODE_2500BASEX.
Move qcom_ethqos_set_sgmii_loopback() to the callers of
ethqos_set_func_clk_en() except for ethqos_configure_rgmii() where we
know that ethqos->phy_mode will not be PHY_INTERFACE_MODE_2500BASEX.
Reviewed-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vuSKl-0000000ASc1-18ka@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Convert the register field values to something more human readable.
For example, using (BIT(29) | BIT(27)) to update a register field that
consists of bits 29:27 is an obfuscated way of writing decimal 5 for
this field. The comment above needs to explain that this value is 5.
Worse still is BIT(12) | GENMASK(9, 8), which is used to hide the
decimal value 19 for the bitfield 16:8.
Fix these, and a few others by using FIELD_PREP(). While it means we
have bare numeric constants, this is more preferable than having the
obfuscation.
Reviewed-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vuSKa-0000000ASbo-2zQg@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add missing documentation for a neighbor table garbage collector sysctl
parameter in ip-sysctl.rst:
neigh/default/gc_stale_time: controls how long an unused neighbor entry
is kept before becoming eligible for garbage collection (default: 60
seconds)
Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
Link: https://patch.msgid.link/20260223101257.47563-1-g.goller@proxmox.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet says:
====================
tcp: rework tcp_v{4,6}_send_check()
tcp_v{4,6}_send_check() are only called from __tcp_transmit_skb()
They currently are in different files (tcp_ipv4.c and tcp_ipv6.c)
thus out of line.
This series move them close to their caller so that compiler
can inline them.
For all patches in the series:
$ scripts/bloat-o-meter -t vmlinux.0 vmlinux.3
add/remove: 0/2 grow/shrink: 1/3 up/down: 102/-178 (-76)
Function old new delta
__tcp_transmit_skb 3321 3423 +102
tcp_v4_send_check 136 132 -4
__tcp_v4_send_check 130 121 -9
mptcp_subflow_init 777 763 -14
__pfx_tcp_v6_send_check 16 - -16
tcp_v6_send_check 135 - -135
Total: Before=25143100, After=25143024, chg -0.00%
====================
Link: https://patch.msgid.link/20260223100729.3761597-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcp_v{4,6}_send_check() are only called from tcp_output.c
and should be made static so that the compiler does not need
to put an out of line copy of them.
Remove (struct inet_connection_sock_af_ops) send_check field
and use instead @net_header_len.
Move @net_header_len close to @queue_xmit for data locality
as both are used in TCP tx fast path.
$ scripts/bloat-o-meter -t vmlinux.2 vmlinux.3
add/remove: 0/2 grow/shrink: 0/3 up/down: 0/-172 (-172)
Function old new delta
__tcp_transmit_skb 3426 3423 -3
tcp_v4_send_check 136 132 -4
mptcp_subflow_init 777 763 -14
__pfx_tcp_v6_send_check 16 - -16
tcp_v6_send_check 135 - -135
Total: Before=25143196, After=25143024, chg -0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260223100729.3761597-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>