Commit Graph

1354201 Commits

Author SHA1 Message Date
Phil Sutter
e225376d78 netfilter: nf_tables: Introduce nft_hook_find_ops{,_rcu}()
Also a pretty dull wrapper around the hook->ops.dev comparison for now.
Will search the embedded nf_hook_ops list in future. The ugly cast to
eliminate the const qualifier will vanish then, too.

Since this future list will be RCU-protected, also introduce an _rcu()
variant here.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:12 +02:00
Phil Sutter
75e20bcdce netfilter: nf_tables: Introduce functions freeing nft_hook objects
Pointless wrappers around kfree() for now, prep work for an embedded
list of nf_hook_ops.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:12 +02:00
Florian Westphal
7e5c6aa67e netfilter: nf_tables: add packets conntrack state to debug trace info
Add the minimal relevant info needed for userspace ("nftables monitor
trace") to provide the conntrack view of the packet:

- state (new, related, established)
- direction (original, reply)
- status (e.g., if connection is subject to dnat)
- id (allows to query ctnetlink for remaining conntrack state info)

Example:
trace id a62 inet filter PRE_RAW packet: iif "enp0s3" ether [..]
  [..]
trace id a62 inet filter PRE_MANGLE conntrack: ct direction original ct state new ct id 32
trace id a62 inet filter PRE_MANGLE packet: [..]
 [..]
trace id a62 inet filter IN conntrack: ct direction original ct state new ct status dnat-done ct id 32
 [..]

In this case one can see that while NAT is active, the new connection
isn't subject to a translation.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:12 +02:00
Florian Westphal
90869f43d0 netfilter: conntrack: make nf_conntrack_id callable without a module dependency
While nf_conntrack_id() doesn't need any functionaliy from conntrack, it
does reside in nf_conntrack_core.c -- callers add a module
dependency on conntrack.

Followup patch will need to compute the conntrack id from nf_tables_trace.c
to include it in nf_trace messages emitted to userspace via netlink.

I don't want to introduce a module dependency between nf_tables and
conntrack for this.

Since trace is slowpath, the added indirection is ok.

One alternative is to move nf_conntrack_id to the netfilter/core.c,
but I don't see a compelling reason so far.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:12 +02:00
Sebastian Andrzej Siewior
f37ad91270 netfilter: nf_dup_netdev: Move the recursion counter struct netdev_xmit
nf_dup_skb_recursion is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Move nf_dup_skb_recursion to struct netdev_xmit, provide wrappers.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:12 +02:00
Sebastian Andrzej Siewior
ba36fada9a netfilter: nft_inner: Use nested-BH locking for nft_pcpu_tun_ctx
nft_pcpu_tun_ctx is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Make a struct with a nft_inner_tun_ctx member (original
nft_pcpu_tun_ctx) and a local_lock_t and use local_lock_nested_bh() for
locking. This change adds only lockdep coverage and does not alter the
functional behaviour for !PREEMPT_RT.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:12 +02:00
Sebastian Andrzej Siewior
a1f1acb9c5 netfilter: nf_dup{4, 6}: Move duplication check to task_struct
nf_skb_duplicated is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Due to the recursion involved, the simplest change is to make it a
per-task variable.

Move the per-CPU variable nf_skb_duplicated to task_struct and name it
in_nf_duplicate. Add it to the existing bitfield so it doesn't use
additional memory.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:12 +02:00
Fernando Fernandez Mancera
22a9613de4 netfilter: nft_tunnel: fix geneve_opt dump
When dumping a nft_tunnel with more than one geneve_opt configured the
netlink attribute hierarchy should be as follow:

 NFTA_TUNNEL_KEY_OPTS
 |
 |--NFTA_TUNNEL_KEY_OPTS_GENEVE
 |  |
 |  |--NFTA_TUNNEL_KEY_GENEVE_CLASS
 |  |--NFTA_TUNNEL_KEY_GENEVE_TYPE
 |  |--NFTA_TUNNEL_KEY_GENEVE_DATA
 |
 |--NFTA_TUNNEL_KEY_OPTS_GENEVE
 |  |
 |  |--NFTA_TUNNEL_KEY_GENEVE_CLASS
 |  |--NFTA_TUNNEL_KEY_GENEVE_TYPE
 |  |--NFTA_TUNNEL_KEY_GENEVE_DATA
 |
 |--NFTA_TUNNEL_KEY_OPTS_GENEVE
 ...

Otherwise, userspace tools won't be able to fetch the geneve options
configured correctly.

Fixes: 925d844696 ("netfilter: nft_tunnel: add support for geneve opts")
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:12 +02:00
Florian Westphal
996d62ece0 selftests: netfilter: nft_fib.sh: add type and oif tests with and without VRFs
Replace the existing VRF test with a more comprehensive one.

It tests following combinations:
 - fib type (returns address type, e.g. unicast)
 - fib oif (route output interface index
 - both with and without 'iif' keyword (changes result, e.g.
  'fib daddr type local' will be true when the destination address
  is configured on the local machine, but
  'fib daddr . iif type local' will only be true when the destination
  address is configured on the incoming interface.

Add all types of addresses to test with for both ipv4 and ipv6:
- local address on the incoming interface
- local address on another interface
- local address on another interface thats part of a vrf
- address on another host

The ruleset stores obtained results from 'fib' in nftables sets and
then queries the sets to check that it has the expected results.

Perform one pass while packets are coming in on interface NOT part of
a VRF and then again when it was added and make sure fib returns the
expected routes and address types for the various addresses in the
setup.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:12 +02:00
Florian Westphal
9a119669fb netfilter: nf_tables: nft_fib: consistent l3mdev handling
fib has two modes:
1. Obtain output device according to source or destination address
2. Obtain the type of the address, e.g. local, unicast, multicast.

'fib daddr type' should return 'local' if the address is configured
in this netns or unicast otherwise.

'fib daddr . iif type' should return 'local' if the address is configured
on the input interface or unicast otherwise, i.e. more restrictive.

However, if the interface is part of a VRF, then 'fib daddr type'
returns unicast even if the address is configured on the incoming
interface.

This is broken for both ipv4 and ipv6.

In the ipv4 case, inet_dev_addr_type must only be used if the
'iif' or 'oif' (strict mode) was requested.

Else inet_addr_type_dev_table() needs to be used and the correct
dev argument must be passed as well so the correct fib (vrf) table
is used.

In the ipv6 case, the bug is similar, without strict mode, dev is NULL
so .flowi6_l3mdev will be set to 0.

Add a new 'nft_fib_l3mdev_master_ifindex_rcu()' helper and use that
to init the .l3mdev structure member.

For ipv6, use it from nft_fib6_flowi_init() which gets called from
both the 'type' and the 'route' mode eval functions.

This provides consistent behaviour for all modes for both ipv4 and ipv6:
If strict matching is requested, the input respectively output device
of the netfilter hooks is used.

Otherwise, use skb->dev to obtain the l3mdev ifindex.

Without this, most type checks in updated nft_fib.sh selftest fail:

  FAIL: did not find veth0 . 10.9.9.1 . local in fibtype4
  FAIL: did not find veth0 . dead:1::1 . local in fibtype6
  FAIL: did not find veth0 . dead:9::1 . local in fibtype6
  FAIL: did not find tvrf . 10.0.1.1 . local in fibtype4
  FAIL: did not find tvrf . 10.9.9.1 . local in fibtype4
  FAIL: did not find tvrf . dead:1::1 . local in fibtype6
  FAIL: did not find tvrf . dead:9::1 . local in fibtype6
  FAIL: fib expression address types match (iif in vrf)

(fib errounously returns 'unicast' for all of them, even
 though all of these addresses are local to the vrf).

Fixes: f6d0cbcf09 ("netfilter: nf_tables: add fib expression")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:09 +02:00
Florian Westphal
8b53f46eb4 netfilter: nf_tables: nft_fib_ipv6: fix VRF ipv4/ipv6 result discrepancy
With a VRF, ipv4 and ipv6 FIB expression behave differently.

   fib daddr . iif oif

Will return the input interface name for ipv4, but the real device
for ipv6.  Example:

If VRF device name is tvrf and real (incoming) device is veth0.
First round is ok, both ipv4 and ipv6 will yield 'veth0'.

But in the second round (incoming device will be set to "tvrf"), ipv4
will yield "tvrf" whereas ipv6 returns "veth0" for the second round too.

This makes ipv6 behave like ipv4.

A followup patch will add a test case for this, without this change
it will fail with:
  get element inet t fibif6iif { tvrf . dead:1::99 . tvrf }
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  FAIL: did not find tvrf . dead:1::99 . tvrf in fibif6iif

Alternatively we could either not do anything at all or change
ipv4 to also return the lower/real device, however, nft (userspace)
doc says "iif: if fib lookup provides a route then check its output
interface is identical to the packets input interface." which is what
the nft fib ipv4 behaviour is.

Fixes: f6d0cbcf09 ("netfilter: nf_tables: add fib expression")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-22 17:16:02 +02:00
Florian Westphal
98287045c9 selftests: netfilter: move fib vrf test to nft_fib.sh
It was located in conntrack_vrf.sh because that already had the VRF bits.
Lets not add to this and move it to nft_fib.sh where this belongs.

No functional changes for the subtest intended.
The subtest is limited, it only covered 'fib oif'
(route output interface query) when the incoming interface is part
of a VRF.

Next we can extend it to cover 'fib type' for VRFs and also check fib
results when there is an unrelated VRF in same netns.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-22 17:16:02 +02:00
Florian Westphal
839340f7c7 selftests: netfilter: nft_fib.sh: add 'type' mode tests
fib can either lookup the interface id/name of the output interface that
would be used for the given address, or it can check for the type of the
address according to the fib, e.g. local, unicast, multicast and so on.

This can be used to e.g. make a locally configured address only reachable
through its interface.

Example: given eth0:10.1.1.1 and eth1:10.1.2.1 then 'fib daddr type' for
10.1.1.1 arriving on eth1 will be 'local', but 'fib daddr . iif type' is
expected to return 'unicast', whereas 'fib daddr' and 'fib daddr . iif'
are expected to indicate 'local' if such a packet arrives on eth0.

So far nft_fib.sh only covered oif/oifname, not type.

Repeat tests both with default and a policy (ip rule) based setup.

Also try to run all remaining tests even if a subtest has failed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-22 17:16:02 +02:00
Florian Westphal
c38eb2973c netfilter: xtables: support arpt_mark and ipv6 optstrip for iptables-nft only builds
Its now possible to build a kernel that has no support for the classic
xtables get/setsockopt interfaces and builtin tables.

In this case, we have CONFIG_IP6_NF_MANGLE=n and
CONFIG_IP_NF_ARPTABLES=n.

For optstript, the ipv6 code is so small that we can enable it if
netfilter ipv6 support exists. For mark, check if either classic
arptables or NFT_ARP_COMPAT is set.

Fixes: a9525c7f62 ("netfilter: xtables: allow xtables-nft only builds")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-22 17:16:02 +02:00
Florian Westphal
d31c1cafc4 selftests: netfilter: nft_concat_range.sh: add coverage for 4bit group representation
Pipapo supports a more compact '4 bit group' format that is chosen when
the memory needed for the default exceeds a threshold (2mb).

Add coverage for those code paths, the existing tests use small sets that
are handled by the default representation.

This comes with a test script run-time increase, but I think its ok:

 normal: 2m35s -> 3m9s
 debug:  3m24s -> 5m29s (with KSFT_MACHINE_SLOW=yes).

Cc: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-22 17:16:01 +02:00
Jakub Kicinski
f685204c57 Merge branch 'queue_api-reduce-risk-of-name-collision-over-txq'
Gur Stavi says:

====================
queue_api: reduce risk of name collision over txq

Rename local variable in macros from txq to _txq.
When macro parameter get_desc is expended it is likely to have a txq
token that refers to a different txq variable at the caller's site.
====================

Link: https://patch.msgid.link/cover.1747559621.git.gur.stavi@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-19 20:09:07 -07:00
Gur Stavi
84b21e61eb queue_api: reduce risk of name collision over txq
Rename local variable in macros from txq to _txq.
When macro parameter get_desc is expended it is likely to have a txq
token that refers to a different txq variable at the caller's site.

Signed-off-by: Gur Stavi <gur.stavi@huawei.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/95b60d218f004308486d92ed17c8cc6f28bac09d.1747559621.git.gur.stavi@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-19 20:09:02 -07:00
Jakub Kicinski
e41703aca2 Merge branch '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:

====================
idpf: add initial PTP support

Milena Olech says:

This patch series introduces support for Precision Time Protocol (PTP) to
Intel(R) Infrastructure Data Path Function (IDPF) driver. PTP feature is
supported when the PTP capability is negotiated with the Control
Plane (CP). IDPF creates a PTP clock and sets a set of supported
functions.

During the PTP initialization, IDPF requests a set of PTP capabilities
and receives a writeback from the CP with the set of supported options.
These options are:
- get time of the PTP clock
- set the time of the PTP clock
- adjust the PTP clock
- Tx timestamping

Each feature is considered to have direct access, where the operations
on PCIe BAR registers are allowed, or the mailbox access, where the
virtchnl messages are used to perform any PTP action. Mailbox access
means that PTP requests are sent to the CP through dedicated secondary
mailbox and the CP reads/writes/modifies desired resource - PTP Clock
or Tx timestamp registers.

Tx timestamp capabilities are negotiated only for vports that have
UPLINK_VPORT flag set by the CP. Capabilities provide information about
the number of available Tx timestamp latches, their indexes and size of
the Tx timestamp value. IDPF requests Tx timestamp by setting the
TSYN bit and the requested timestamp index in the context descriptor for
the PTP packets. When the completion tag for that packet is received,
IDPF schedules a worker to read the Tx timestamp value.

* '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  idpf: add support for Rx timestamping
  idpf: add Tx timestamp flows
  idpf: add Tx timestamp capabilities negotiation
  idpf: add PTP clock configuration
  idpf: add mailbox access to read PTP clock time
  idpf: negotiate PTP capabilities and get PTP clock
  idpf: move virtchnl structures to the header file
  virtchnl: add PTP virtchnl definitions
  idpf: add initial PTP support
  idpf: change the method for mailbox workqueue allocation
====================

Link: https://patch.msgid.link/20250516170645.1172700-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-19 20:07:41 -07:00
Sumanth Gavini
c6a957d067 selftests: drv-net: Fix "envirnoments" to "environments"
Fix misspelling reported by codespell

Signed-off-by: Sumanth Gavini <sumanth.gavini@yahoo.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250516225156.1122058-1-sumanth.gavini@yahoo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-19 16:19:01 -07:00
Johannes Berg
a462903fa2 net: netlink: reduce extack cookie size
Seems like the extack cookie hasn't found any users outside
of wireless, which always uses nl_set_extack_cookie_u64().
Thus, allocating 20 bytes for it is pointless, reduce that
to 8 bytes, and add a BUILD_BUG_ON() to ensure it's enough
(obviously it is, for a u64, but in case it changes again.)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250516115927.38209-2-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-19 16:06:42 -07:00
David S. Miller
a8ae8a0e84 Merge tag 'ovpn-net-next-20250515' of https://github.com/OpenVPN/ovpn-net-next
Antonio Quartulli says:

====================
ovpn: pull request for net-next: ovpn 2025-05-15

this is a new version of the previous pull request.
These time I have removed the fixes that we are still discussing,
so that we don't hold the entire series back.

There is a new fix though: it's about properly checking the return value
of skb_to_sgvec_nomark(). I spotted the issue while testing pings larger
than the iface's MTU on a TCP VPN connection.

I have added various Closes and Link tags where applicable, so
that we have references to GitHub tickets and other public discussions.

Since I have resent the PR, I have also added Andrew's Reviewed-by to
the first patch.

Please pull or let me know if something should be changed!
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Patchset highlights:
- update MAINTAINERS entry for ovpn
- extend selftest with more cases
- avoid crash in selftest in case of getaddrinfo() failure
- fix ndo_start_xmit return value on error
- set ignore_df flag for IPv6 packets
- drop useless reg_state check in keepalive worker
- retain skb's dst when entering xmit function
- fix check on skb_to_sgvec_nomark() return value
2025-05-19 12:10:43 +01:00
Jakub Kicinski
b8fa067c4a Merge branch 'vsock-test-improve-sigpipe-test-reliability'
Stefano Garzarella says:

====================
vsock/test: improve sigpipe test reliability

Running the tests continuously I noticed that sometimes the sigpipe
test would fail due to a race between the control message of the test
and the vsock transport messages.

While I was at it I also improved the test by checking the errno we
expect.

v1: https://lore.kernel.org/20250508142005.135857-1-sgarzare@redhat.com
====================

Link: https://patch.msgid.link/20250514141927.159456-1-sgarzare@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16 18:01:34 -07:00
Stefano Garzarella
3c6abbe85b vsock/test: check also expected errno on sigpipe test
In the sigpipe test, we expect send() to fail, but we do not check if
send() fails with the errno we expect (EPIPE).

Add this check and repeat the send() in case of EINTR as we do in other
tests.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20250514141927.159456-4-sgarzare@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16 18:01:32 -07:00
Stefano Garzarella
135a8a4d25 vsock/test: retry send() to avoid occasional failure in sigpipe test
When the other peer calls shutdown(SHUT_RD), there is a chance that
the send() call could occur before the message carrying the close
information arrives over the transport. In such cases, the send()
might still succeed. To avoid this race, let's retry the send() call
a few times, ensuring the test is more reliable.

Sleep a little before trying again to avoid flooding the other peer
and filling its receive buffer, causing false-negative.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20250514141927.159456-3-sgarzare@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16 18:01:32 -07:00
Stefano Garzarella
a7262ed4b1 vsock/test: add timeout_usleep() to allow sleeping in timeout sections
The timeout API uses signals, so we have documented not to use sleep(),
but we can use nanosleep(2) since POSIX.1 explicitly specifies that it
does not interact with signals.

Let's provide timeout_usleep() for that.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20250514141927.159456-2-sgarzare@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16 18:01:32 -07:00
Jakub Kicinski
9e1f7a3119 Merge branch 'tools-ynl-gen-support-sub-messages-and-rt-link'
Jakub Kicinski says:

====================
tools: ynl-gen: support sub-messages and rt-link

Sub-messages are how we express "polymorphism" in YNL. Donald added
the support to specs and Python a while back, support them in C, too.
Sub-message is a nest, but the interpretation of the attribute types
within that nest depends on a value of another attribute. For example
in rt-link the "kind" attribute contains the link type (veth, bonding,
etc.) and based on that the right enum has to be applied to interpret
link-specific attributes.

The last message is probably the most interesting to look at, as it
adds a fairly advanced sample.

This patch only contains enough support for rtnetlink, we will need
a little more complexity to support TC, where sub-messages may contain
fixed headers, and where the selector may be in a different nest than
the submessage.
====================

Link: https://patch.msgid.link/20250515231650.1325372-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16 16:32:08 -07:00
Jakub Kicinski
d5d1813b28 tools: ynl: add a sample for rt-link
Add a fairly complete example of rt-link usage. If run without any
arguments it simply lists the interfaces and some of their attrs.
If run with an arg it tries to create and delete a netkit device.

 1 # ./tools/net/ynl/samples/rt-link 1
 2 Trying to create a Netkit interface
 3 Testing error message for policy being bad:
 4     Kernel error: 'Provided default xmit policy not supported' (bad attribute: .linkinfo.data(netkit).policy)
 5   1:               lo: mtu 65536
 6   2:           wlp0s1: mtu  1500
 7   3:          enp0s13: mtu  1500
 8   4:           dummy0: mtu  1500  kind dummy     altname one two
 9   5:              nk0: mtu  1500  kind netkit    primary 0  policy forward
10   6:              nk1: mtu  1500  kind netkit    primary 1  policy blackhole
11 Trying to delete a Netkit interface (ifindex 6)

Sample creates the device first, it sets an invalid value for a netkit
attribute to trigger reverse parsing. Line 4 shows the error with the
attribute path correctly generated by YNL.

Then sample fixes the bad attribute and re-issues the request, with
NLM_F_ECHO set. This flag causes the notification to be looped back
to the initiating socket (our socket). Sample parses this notification
to save the ifindex of the created netkit.

Sample then proceeds to list the devices. Line 8 above shows a dummy
device with two alt names. Lines 9 and 10 show the netkit devices
the sample itself created.

The "primary" and "policy" attrs are from inside the netkit submsg.
The string values are auto-generated for the enums by YNL.

To clean up sample deletes the interface it created (line 11).

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16 16:32:06 -07:00
Jakub Kicinski
6bab77ced3 tools: ynl: enable codegen for all rt- families
Switch from including Classic netlink families one by one to excluding.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16 16:32:06 -07:00
Jakub Kicinski
0939a418b3 tools: ynl: submsg: reverse parse / error reporting
Reverse parsing lets YNL convert bad and missing attr pointers
from extack into a string like "missing attribute nest1.nest2.attr_name".
It's a feature that's unique to YNL C AFAIU (even the Python YNL
can't do nested reverse parsing). Add support for reverse-parsing
of sub-messages.

To simplify the logic and the code annotate the type policies
with extra metadata. Mark the selectors and the messages with
the information we need. We assume that key / selector always
precedes the sub-message while parsing (and also if there are
multiple sub-messages like in rt-link they are interleaved
selector 1 ... submsg 1 ... selector 2 .. submsg 2, not
selector 1 ... selector 2 ... submsg 1 ... submsg 2).

The rt-link sample in a subsequent changes shows reverse parsing
of sub-messages in action.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16 16:32:06 -07:00
Jakub Kicinski
b9e03e2636 tools: ynl-gen: submsg: support parsing and rendering sub-messages
Adjust parsing and rendering appropriately to make sub-messages work.
Rendering is pretty trivial, as the submsg -> netlink conversion looks
like rendering a nest in which only one attr was set. Only trick
is that we use the enum value of the sub-message rather than the nest
as the type, and effectively skip one layer of nesting. A real double
nested struct would look like this:

  [SELECTOR]
  [SUBMSG]
    [NEST]
      [MSG1-ATTR]

A submsg "is" the nest so by skipping I mean:

  [SELECTOR]
  [SUBMSG]
    [MSG1-ATTR]

There is no extra validation in YNL if caller has set the selector
matching the submsg type (e.g. link type = "macvlan" but the nest
attrs are set to carry "veth"). Let the kernel handle that.

Parsing side is a little more specialized as we need to render and
insert a new kind of function which switches between what to parse
based on the selector. But code isn't too complicated.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16 16:32:06 -07:00
Jakub Kicinski
6366d26778 tools: ynl-gen: submsg: render the structs
The easiest (or perhaps only sane) way to support submessages in C
is to treat them as if they were nests. Build fake attributes to
that effect in the codegen. Render the submsg as a big nest of all
possible values.

With this in place the main missing part is to hook in the switch
which selects how to parse based on the key.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16 16:32:05 -07:00
Jakub Kicinski
3186a8e55a tools: ynl-gen: submsg: plumb thru an empty type
Hook in handling of sub-messages, for now treat them as ignored attrs.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16 16:32:05 -07:00
Jakub Kicinski
99b76908a7 tools: ynl-gen: prepare for submsg structs
Prepare for constructing Struct() instances which represent
sub-messages rather than nested attributes.
Restructure the code / indentation to more easily insert
a case where nested reference comes from annotation other
than the 'nested-attributes' property. Make sure we don't
construct the Struct() object from scratch in multiple
places as the constructor will soon have more arguments.

This should cause no functional change.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16 16:32:05 -07:00
Jakub Kicinski
c9c048993d tools: ynl-gen: factor out the annotation of pure nested struct
We're about to add some code here for sub-messages.
Factor out the nest-related logic to make the code readable.
No functional change.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16 16:32:05 -07:00
Jakub Kicinski
dc3f63bc3e netlink: specs: rt-link: add C naming info for ovpn
C naming info for OVPN which was added since I adjusted
the existing attrs. Also add missing reference to a header needed
for a bridge struct.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250515231650.1325372-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16 16:32:05 -07:00
Oleksij Rempel
622b91e0f9 net: phy: microchip: document where the LAN88xx PHYs are used
The driver uses the name LAN88xx for PHYs with phy_id = 0x0007c132. But
with this placeholder name no documentation can be found on the net.

Document the fact that these PHYs are build into the LAN7800 and LAN7850
USB/Ethernet controllers.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250515082051.2644450-1-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16 16:27:54 -07:00
Heiner Kallweit
7b151e4efd net: phy: fixed_phy: remove fixed_phy_register_with_gpiod
Since its introduction 6 yrs ago this functions has never had a user.
So remove it.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/ccbeef28-65ae-4e28-b1db-816c44338dee@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16 16:08:07 -07:00
Eric Dumazet
9cd5ef0b8c net: rfs: add sock_rps_delete_flow() helper
RFS can exhibit lower performance for workloads using short-lived
flows and a small set of 4-tuple.

This is often the case for load-testers, using a pair of hosts,
if the server has a single listener port.

Typical use case :

Server : tcp_crr -T128 -F1000 -6 -U -l30 -R 14250
Client : tcp_crr -T128 -F1000 -6 -U -l30 -c -H server | grep local_throughput

This is because RFS global hash table contains stale information,
when the same RSS key is recycled for another socket and another cpu.

Make sure to undo the changes and go back to initial state when
a flow is disconnected.

Performance of the above test is increased by 22 %,
going from 372604 transactions per second to 457773.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Octavian Purdila <tavip@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20250515100354.3339920-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16 16:03:48 -07:00
ChunHao Lin
f24f7b2f3a r8169: add support for RTL8127A
This adds support for 10Gbs chip RTL8127A.

Signed-off-by: ChunHao Lin <hau@realtek.com>
Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/20250515095303.3138-1-hau@realtek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16 16:02:34 -07:00
Moon Yeounsu
12889ce926 net: dlink: add synchronization for stats update
This patch synchronizes code that accesses from both user-space
and IRQ contexts. The `get_stats()` function can be called from both
context.

`dev->stats.tx_errors` and `dev->stats.collisions` are also updated
in the `tx_errors()` function. Therefore, these fields must also be
protected by synchronized.

There is no code that accessses `dev->stats.tx_errors` between the
previous and updated lines, so the updating point can be moved.

Signed-off-by: Moon Yeounsu <yyyynoom@gmail.com>
Link: https://patch.msgid.link/20250515075333.48290-1-yyyynoom@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16 15:58:13 -07:00
Carolina Jubran
b66b76a82c net/mlx5e: Reuse per-RQ XDP buffer to avoid stack zeroing overhead
CONFIG_INIT_STACK_ALL_ZERO introduces a performance cost by
zero-initializing all stack variables on function entry. The mlx5 XDP
RX path previously allocated a struct mlx5e_xdp_buff on the stack per
received CQE, resulting in measurable performance degradation under
this config.

This patch reuses a mlx5e_xdp_buff stored in the mlx5e_rq struct,
avoiding per-CQE stack allocations and repeated zeroing.

With this change, XDP_DROP and XDP_TX performance matches that of
kernels built without CONFIG_INIT_STACK_ALL_ZERO.

Performance was measured on a ConnectX-6Dx using a single RX channel
(1 CPU at 100% usage) at ~50 Mpps. The baseline results were taken from
net-next-6.15.

Stack zeroing disabled:
- XDP_DROP:
    * baseline:                     31.47 Mpps
    * baseline + per-RQ allocation: 32.31 Mpps (+2.68%)

- XDP_TX:
    * baseline:                     12.41 Mpps
    * baseline + per-RQ allocation: 12.95 Mpps (+4.30%)

Stack zeroing enabled:
- XDP_DROP:
    * baseline:                     24.32 Mpps
    * baseline + per-RQ allocation: 32.27 Mpps (+32.7%)

- XDP_TX:
    * baseline:                     11.80 Mpps
    * baseline + per-RQ allocation: 12.24 Mpps (+3.72%)

Reported-by: Sebastiano Miano <mianosebastiano@gmail.com>
Reported-by: Samuel Dobron <sdobron@redhat.com>
Link: https://lore.kernel.org/all/CAMENy5pb8ea+piKLg5q5yRTMZacQqYWAoVLE1FE9WhQPq92E0g@mail.gmail.com/
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://patch.msgid.link/1747253032-663457-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16 15:46:07 -07:00
Frank Wunderlich
15d7b3dfaf net: phy: mediatek: do not require syscon compatible for pio property
Current implementation requires syscon compatible for pio property
which is used for driving the switch leds on mt7988.

Replace syscon_regmap_lookup_by_phandle with of_parse_phandle and
device_node_to_regmap to get the regmap already assigned by pinctrl
driver.

Signed-off-by: Frank Wunderlich <frank-w@public-files.de>
Link: https://patch.msgid.link/20250510174933.154589-1-linux@fw-web.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16 15:40:51 -07:00
Milena Olech
494565a745 idpf: add support for Rx timestamping
Add Rx timestamp function when the Rx timestamp value is read directly
from the Rx descriptor. In order to extend the Rx timestamp value to 64
bit in hot path, the PHC time is cached in the receive groups.
Add supported Rx timestamp modes.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Tested-by: YiFei Zhu <zhuyifei@google.com>
Tested-by: Mina Almasry <almasrymina@google.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-05-16 09:13:02 -07:00
Milena Olech
1a49cf814f idpf: add Tx timestamp flows
Add functions to request Tx timestamp for the PTP packets, read the Tx
timestamp when the completion tag for that packet is being received,
extend the Tx timestamp value and set the supported timestamping modes.

Tx timestamp is requested for the PTP packets by setting a TSYN bit and
index value in the Tx context descriptor. The driver assumption is that
the Tx timestamp value is ready to be read when the completion tag is
received. Then the driver schedules delayed work and the Tx timestamp
value read is requested through virtchnl message. At the end, the Tx
timestamp value is extended to 64-bit and provided back to the skb.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Co-developed-by: Josh Hay <joshua.a.hay@intel.com>
Signed-off-by: Josh Hay <joshua.a.hay@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-05-16 09:13:02 -07:00
Milena Olech
4901e83a94 idpf: add Tx timestamp capabilities negotiation
Tx timestamp capabilities are negotiated for the uplink Vport.
Driver receives information about the number of available Tx timestamp
latches, the size of Tx timestamp value and the set of indexes used
for Tx timestamping.

Add function to get the Tx timestamp capabilities and parse the uplink
vport flag.

Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Co-developed-by: Emil Tantilov <emil.s.tantilov@intel.com>
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Co-developed-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com>
Signed-off-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-05-16 09:13:02 -07:00
Milena Olech
d5dba8f720 idpf: add PTP clock configuration
PTP clock configuration operations - set time, adjust time and adjust
frequency are required to control the clock and maintain synchronization
process.

Extend get PTP capabilities function to request for the clock adjustments
and add functions to enable these actions using dedicated virtchnl
messages.

Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Tested-by: Mina Almasry <almasrymina@google.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-05-16 09:13:02 -07:00
Milena Olech
5a27503d38 idpf: add mailbox access to read PTP clock time
When the access to read PTP clock is specified as mailbox, the driver
needs to send virtchnl message to perform PTP actions. Message is sent
using idpf_mbq_opc_send_msg_to_peer_drv mailbox opcode, with the parameters
received during PTP capabilities negotiation.

Add functions to recognize PTP messages, move them to dedicated secondary
mailbox, read the PTP clock time and cross timestamp using mailbox
messages.

Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-05-16 09:13:02 -07:00
Milena Olech
5cb8805d23 idpf: negotiate PTP capabilities and get PTP clock
PTP capabilities are negotiated using virtchnl command. Add get
capabilities function, direct access to read the PTP clock.
Set initial PTP capabilities exposed to the stack.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Tested-by: Mina Almasry <almasrymina@google.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-05-16 09:13:02 -07:00
Milena Olech
c5d0607f42 idpf: move virtchnl structures to the header file
Move virtchnl structures to the header file to expose them for the PTP
virtchnl file.

Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Tested-by: Mina Almasry <almasrymina@google.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-05-16 09:13:02 -07:00
Milena Olech
bf27283ba5 virtchnl: add PTP virtchnl definitions
PTP capabilities are negotiated using virtchnl commands. There are two
available modes of the PTP support: direct and mailbox. When the direct
access to PTP resources is negotiated, virtchnl messages returns a set
of registers that allow read/write directly. When the mailbox access to
PTP resources is negotiated, virtchnl messages are used to access
PTP clock and to read the timestamp values.

Virtchnl API covers both modes and exposes a set of PTP capabilities.

Using virtchnl API, the driver recognizes also HW abilities - maximum
adjustment of the clock and the basic increment value.

Additionally, API allows to configure the secondary mailbox, dedicated
exclusively for PTP purposes.

Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Tested-by: Mina Almasry <almasrymina@google.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-05-16 09:13:02 -07:00