If nvmem loads after the ethernet driver, mac address assignments will
not take effect. of_get_ethdev_address returns EPROBE_DEFER in such a
case so we need to handle that to avoid eth_hw_addr_random.
Add extra goto section to just free stats as they are allocated right
above.
Signed-off-by: Rosen Penev <rosenp@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260307031709.640141-1-rosenp@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The test generates 16 flows, and verifies that traffic is distributed
across two queues via the NICs RSS indirection table. The likelihood of the
flows skewing to a single queue is high, so we retry sending traffic up to
3 times.
Alternatively, we could increase the number of generated flows. But
debug kernels may struggle to ramp this many flows.
During manual testing, the test passed for 10,000 consecutive runs.
Signed-off-by: Dimitri Daskalakis <dimitri.daskalakis1@gmail.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Link: https://patch.msgid.link/20260309204215.2110486-1-dimitri.daskalakis1@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
With the current port selection algorithm, ports after a reserved port
range or long time used port are used more often than others [1]. This
causes an uneven port usage distribution. This combines with cloud
environments blocking connections between the application server and the
database server if there was a previous connection with the same source
port, leading to connectivity problems between applications on cloud
environments.
The real issue here is that these firewalls cannot cope with
standards-compliant port reuse. This is a workaround for such situations
and an improvement on the distribution of ports selected.
The proposed solution is to implement a variant of RFC 6056 Algorithm 5.
The step size is selected randomly on every connect() call ensuring it
is a coprime with respect to the size of the range of ports we want to
scan. This way, we can ensure that all ports within the range are
scanned before returning an error. To enable this algorithm, the user
must configure the new sysctl option "net.ipv4.ip_local_port_step_width".
In addition, on graphs generated we can observe that the distribution of
source ports is more even with the proposed approach. [2]
[1] https://0xffsoftware.com/port_graph_current_alg.html
[2] https://0xffsoftware.com/port_graph_random_step_alg.html
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Link: https://patch.msgid.link/20260309023946.5473-2-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Allison Henderson says:
====================
selftests: rds: ksft cleanups
This set addresses a few rds selftests clean ups and bugs encountered
when running in the ksft framework. The first patch is a clean up
patch that addresses pylint warnings, but otherwise no functional
changes. The next patch moves the test time out to a ksft settings
file so that the time out is set appropriately. And lastly we fix a
tcpdump segfault caused by deprecated a os.fork() call.
====================
Link: https://patch.msgid.link/20260308055835.1338257-1-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net/rds/test.py sees a segfault in tcpdump when executed through the
ksft runner.
[ 21.903713] tcpdump[1469]: segfault at 0 ip 000072100e99126d
sp 00007ffccf740fd0 error 4
[ 21.903721] in libc.so.6[16a26d,7798b149a000+188000]
[ 21.905074] in libc.so.6[16a26d,72100e84f000+188000] likely on
CPU 5 (core 5, socket 0)
[ 21.905084] Code: 00 0f 85 a0 00 00 00 48 83 c4 38 89 d8 5b 41 5c
41 5d 41 5e 41 5f 5d c3 0f 1f 44 00 00 48 8b 05 91 8b 09 00 8b 4d ac
64 89 08 <41> 0f b6 07 83 e8 2b a8 fd 0f 84 54 ff ff ff 49 8b 36 4c 89
ff e8
[ 21.906760] likely on CPU 9 (core 9, socket 0)
[ 21.913469] Code: 00 0f 85 a0 00 00 00 48 83 c4 38 89 d8 5b 41 5c 41
5d 41 5e 41 5f 5d c3 0f 1f 44 00 00 48 8b 05 91 8b 09 00 8b 4d ac 64 89
08 <41> 0f b6 07 83 e8 2b a8 fd 0f 84 54 ff ff ff 49 8b 36 4c 89 ff e8
The os.fork() call creates extra complexity because it forks the entire
process including the python interpreter. ip() then calls cmd() which
creates a subprocess.Popen. We can avoid the extra layering by simply
calling subprocess.Popen directly. Track the process handles directly
and terminate them at cleanup rather than relying on killall. Further
tcpdump's -Z flag attempts to change savefile ownership, which is not
supported by the 9p protocol. Fix this by writing pcap captures to
"/tmp" during the test and move them to the log directory after tcpdump
exits.
Signed-off-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260308055835.1338257-4-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
NIPA tries to make sure that HW tests don't modify system state.
It dumps some well known configs before and after the test and
compares the outputs.
Make sure that YNL json output is stable. Converting sets to lists
with a naive list(o) results in a random order.
Link: https://patch.msgid.link/20260307175916.1652518-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kyoji Ogasawara says:
====================
smc-sysctl formatting and missing entries
update SMC sysctl documentation in two small steps.
- patch 1 fixes indentation in the smcr_buf_type section
- patch 2 documents missing sysctl parameters limit_smc_hs and hs_ctrl,
including values/defaults and hs_ctrl usage notes
====================
Link: https://patch.msgid.link/20260309124541.22723-1-sawara04.o@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Mike Marciniszyn says:
====================
eth fbnic: Add fbnic self tests
From: "Mike Marciniszyn (Meta)" <mike.marciniszyn@gmail.com>
This series adds self tests to test the registers, the
msix interrupts, the tlv, and the firmware mailbox.
This series assumes that the
[PATCH net-next 0/2] Add debugfs hooks [1]
is present.
When the self tests are run the with ethtool -t:
ethtool -t eth0
The test result is PASS
The test extra info:
Register test (offline) 0
MSI-X Interrupt test (offline) 0
FW mailbox test (on/offline) 0
====================
Link: https://patch.msgid.link/20260307105847.1438-1-mike.marciniszyn@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The register test will be used to verify hardware is behaving as expected.
The test itself will have us writing to registers that should have no
side effects due to us resetting after the test has been completed.
While the test is being run the interface should be offline.
This patch counts on the first patch of this series to export netif_open()
and also ensures that the half close calls netif_close() to
avoid deadlock.
Signed-off-by: Mike Marciniszyn (Meta) <mike.marciniszyn@gmail.com>
Link: https://patch.msgid.link/20260307105847.1438-3-mike.marciniszyn@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
As a part of MANA hardening for CVM, add validation for the doorbell
ID (db_id) received from hardware in the GDMA_REGISTER_DEVICE response
to prevent out-of-bounds memory access when calculating the doorbell
page address.
In mana_gd_ring_doorbell(), the doorbell page address is calculated as:
addr = db_page_base + db_page_size * db_index
= (bar0_va + db_page_off) + db_page_size * db_index
A hardware could return values that cause this address to fall outside
the BAR0 MMIO region. In Confidential VM environments, hardware responses
cannot be fully trusted.
Add the following validations:
- Store the BAR0 size (bar0_size) in gdma_context during probe.
- Validate the doorbell page offset (db_page_off) read from device
registers does not exceed bar0_size during initialization, converting
mana_gd_init_registers() to return an error code.
- Validate db_id from GDMA_REGISTER_DEVICE response against the
maximum number of doorbell pages that fit within BAR0.
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Link: https://patch.msgid.link/20260306211212.543376-1-ernis@linux.microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Russell King says:
====================
net: stmmac: further ptp cleanups
The first uses a local variable when setting n_ext_ts which is a minor
simplification of the code. The second removes the now unnecessary
"available" flag for the PPS outputs.
====================
Link: https://patch.msgid.link/aawDiK7DjcSXSs1X@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
priv->pps[].available is set in stmmac_ptp_register() for all PPS
outputs reported by hardware up to STMMAC_PPS_MAX.
Since we now set priv->ptp_clock_ops.n_per_out to the number of PPS
outputs that both the hardware and driver can support to prevent
array overflow in stmmac_enable(), this makes priv->pps[].available
redundant. Remove this struct member.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vypHc-0000000CSbl-1X6v@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcp_chrono_start() is small enough, and used in TCP sendmsg()
fast path (from tcp_skb_entail()).
Note clang is already inlining it from functions in tcp_output.c.
Inlining it improves performance and reduces bloat :
$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/2 grow/shrink: 1/0 up/down: 1/-84 (-83)
Function old new delta
tcp_skb_entail 280 281 +1
__pfx_tcp_chrono_start 16 - -16
tcp_chrono_start 68 - -68
Total: Before=25192434, After=25192351, chg -0.00%
Note that tcp_chrono_stop() is too big.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20260308123549.2924460-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
chrono_type is currently in tcp_sock_read_txrx group, which
is supposed to hold read-mostly fields.
But chrono_type is mostly written in tx path, it should
be moved to tcp_sock_write_tx group, close to other
chrono fields (chrono_stat[], chrono_start).
Note this adds holes, but data locality is far more important.
Use a full u8 for the time being, compiler can generate
more efficient code.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20260308122302.2895067-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some modern cpus disable X86_FEATURE_RETPOLINE feature,
even if a direct call can still be beneficial.
Even when IBRS is present, an indirect call is more expensive
than a direct one:
Direct Calls:
Compilers can perform powerful optimizations like inlining,
where the function body is directly inserted at the call site,
eliminating call overhead entirely.
Indirect Calls:
Inlining is much harder, if not impossible, because the compiler
doesn't know the target function at compile time.
Techniques like Indirect Call Promotion can help by using
profile-guided optimization to turn frequently taken indirect calls
into conditional direct calls, but they still add complexity
and potential overhead compared to a truly direct call.
In this patch, I split tc_skip_wrapper in two different
static keys, one for tc_act() (tc_skip_wrapper_act)
and one for tc_classify() (tc_skip_wrapper_cls).
Then I enable the tc_skip_wrapper_cls only if the count
of builtin classifiers is above one.
I enable tc_skip_wrapper_act only it the count of builtin
actions is above one.
In our production kernels, we only have CONFIG_NET_CLS_BPF=y
and CONFIG_NET_ACT_BPF=y. Other are modules or are not compiled.
Tested on AMD Turin cpus, cls_bpf_classify() cost went
from 1% down to 0.18 %, and FDO will be able to inline
it in tcf_classify() for further gains.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260307133601.3863071-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 18fd64d254 ("netns-ipv4: reorganize netns_ipv4 fast path
variables") missed that __tcp_select_window() is reading
net->ipv4.sysctl_tcp_shrink_window.
Move this field to netns_ipv4_read_txrx group, as __tcp_select_window()
is used both in tx and rx paths.
Saves a potential cache line miss.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260307092214.2433548-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
fib6_nexthop() retrieves the link-local address for two interfaces used
in the test. However, both lldummy and llv1 are obtained from dummy0.
llv1 is expected to be retrieved from veth1, which is the interface used
later in the test. The subsequent check and error message also expect
the address to be retrieved from veth1.
Fix this by retrieving llv1 from veth1.
Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Link: https://patch.msgid.link/20260306180830.2329477-1-alok.a.tiwari@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Bartosz Golaszewski says:
====================
Immutable branch between GPIO and net
Convert remaining users of of_gpio.h to using GPIO descriptors and
remove the header.
* tag 'ib-gpio-remove-of-gpio-h-for-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpio: remove of_get_named_gpio() and <linux/of_gpio.h>
nfc: nfcmrvl: convert to gpio descriptors
nfc: s3fwrn5: convert to gpio descriptors
====================
Link: https://patch.msgid.link/20260309093153.10446-1-bartosz.golaszewski@oss.qualcomm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, ppp_input_error() indicates an error by allocating a 0-length
skb and calling ppp_do_recv(). It takes an error code argument, which is
stored in skb->cb, but not used by ppp_receive_frame().
Simplify the error handling by removing the unused parameter and the
unnecessary skb allocation. Instead, call ppp_receive_error() directly
from ppp_input_error() under the recv lock, and the length check in
ppp_receive_frame() can be removed.
Signed-off-by: Qingfang Deng <dqfext@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The driver uses packet-type (RX/TX) PTP-message type and PTP-sequence
number to identify a matching timestamp packet for a skb. If the same
PTP packet arrives on both ports (as in a PRP environment) then it is
not obvious which event belongs to which skb.
The event contains also the port number on which it was received.
Instead of masking it out, use it for matching.
Tested-by: Chintan Vankar <c-vankar@ti.com>
Reviewed-by: Martin Kaistra <martin.kaistra@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20260306144439.cVwaaopR@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Following typical script is extremely disruptive,
because each graft operation calls dev_deactivate()
which resets all the queues of the device.
QPARAM="limit 100000 flow_limit 1000 buckets 4096"
TXQS=64
for ETH in eth1
do
tc qd del dev $ETH root 2>/dev/null
tc qd add dev $ETH root handle 1: mq
for i in `seq 1 $TXQS`
do
slot=$( printf %x $(( i )) )
tc qd add dev $ETH parent 1:$slot fq $QPARAM
done
done
One can add "ip link set dev $ETH down/up" to reduce the disruption time:
QPARAM="limit 100000 flow_limit 1000 buckets 4096"
TXQS=64
for ETH in eth1
do
ip link set dev $ETH down
tc qd del dev $ETH root 2>/dev/null
tc qd add dev $ETH root handle 1: mq
for i in `seq 1 $TXQS`
do
slot=$( printf %x $(( i )) )
tc qd add dev $ETH parent 1:$slot fq $QPARAM
done
ip link set dev $ETH up
done
Or we can add a @reset_needed flag to dev_deactivate() and
dev_deactivate_many().
This flag is set to true at device dismantle or linkwatch_do_dev(),
and to false for graft operations.
In the future, we might only stop one queue instead of the whole
device, ie call dev_deactivate_queue() instead of dev_deactivate().
I think the problem (quadratic behavior) was added in commit
2fb541c862 ("net: sch_generic: aviod concurrent reset and enqueue op
for lockless qdisc") but this does not look serious enough to deserve
risky backports.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260307163430.470644-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add SPDX-License-Identifier lines to several source
files under the network sub-directory. Work on files
in the core, dns_resolver, ipv4, ipv6 and
netfilter sub-dirs. Remove boilerplate
and license reference text to avoid ambiguity.
Rusty Russell has expressed that his contributions
were intended to be GPL-2.0-or-later.
Signed-off-by: Tim Bird <tim.bird@sony.com>
Link: https://patch.msgid.link/20260305004724.87469-1-tim.bird@sony.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski says:
====================
tools: ynl: convert samples into selftests
The "samples" were always poor man's tests, used to manually
confirm that C YNL works as expected. Since a proper tests/
directory now exists move the samples and use the kselftest
harness to turn them into selftests outputting KTAP.
====================
Link: https://patch.msgid.link/20260307033630.1396085-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Convert rt-route.c to use kselftest_harness.h with FIXTURE/TEST_F.
This is the last test to convert so clean up the Makefile.
Validate that the connected routes for 192.168.1.0/24 and
2001:db8::/64 appear in the dump.
Output:
TAP version 13
1..1
# Starting 1 tests from 1 test cases.
# RUN rt_route.dump ...
# oif: nsim0 dst: 192.168.1.0/24
# oif: lo dst: ::1/128
# oif: nsim0 dst: 2001:db8::1/128
# oif: nsim0 dst: 2001:db8::/64
# oif: nsim0 dst: fe80::/64
# oif: nsim0 dst: ff00::/8
# OK rt_route.dump
ok 1 rt_route.dump
# PASSED: 1 / 1 tests passed.
# Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Tested-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20260307033630.1396085-11-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Convert rt-addr.c to use kselftest_harness.h with FIXTURE/TEST_F.
Validate that the addresses configured by the wrapper (192.168.1.1
and 2001:db8::1) appear in the dump.
Output:
TAP version 13
1..1
# Starting 1 tests from 1 test cases.
# RUN rt_addr.dump ...
# lo: 127.0.0.1
# nsim0: 192.168.1.1
# lo: ::1
# nsim0: 2001:db8::1
# nsim0: fe80::7c66:c9ff:fe5f:bf01
# OK rt_addr.dump
ok 1 rt_addr.dump
# PASSED: 1 / 1 tests passed.
# Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Tested-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20260307033630.1396085-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Convert ethtool.c to use kselftest_harness.h with FIXTURE/TEST_F.
Move ethtool from BINS to TEST_GEN_FILES and add ethtool.sh wrapper
which sets up a netdevsim device before running the test binary.
Output:
TAP version 13
1..2
# Starting 2 tests from 1 test cases.
# RUN ethtool.channels ...
# nsim0: combined 1
# OK ethtool.channels
ok 1 ethtool.channels
# RUN ethtool.rings ...
# nsim0: rx 512 tx 512
# OK ethtool.rings
ok 2 ethtool.rings
# PASSED: 2 / 2 tests passed.
# Totals: pass:2 fail:0 xfail:0 xpass:0 skip:0 error:0
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Tested-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20260307033630.1396085-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Convert devlink.c to use kselftest_harness.h with FIXTURE/TEST_F.
Move devlink from BINS to TEST_GEN_FILES in the Makefile since
it's invoked via the devlink.sh wrapper which sets up netdevsim.
Output:
TAP version 13
1..2
# Starting 2 tests from 1 test cases.
# RUN devlink.dump ...
# netdevsim/netdevsim1337
# OK devlink.dump
ok 1 devlink.dump
# RUN devlink.info ...
# netdevsim/netdevsim1337:
# driver: netdevsim
# running fw:
# fw.mgmt: 10.20.30
# OK devlink.info
ok 2 devlink.info
# PASSED: 2 / 2 tests passed.
# Totals: pass:2 fail:0 xfail:0 xpass:0 skip:0 error:0
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Tested-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20260307033630.1396085-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>