Ido Schimmel says:
====================
net: fib_rules: Add port mask support
In some deployments users would like to encode path information into
certain bits of the IPv6 flow label, the UDP source port and the DSCP
field and use this information to route packets accordingly.
Redirecting traffic to a routing table based on specific bits in the UDP
source port is not currently possible. Only exact match and range are
currently supported by FIB rules.
This patchset extends FIB rules to match on layer 4 ports with an
optional mask. The mask is not supported when matching on a range. A
future patchset will add support for matching on the DSCP field with an
optional mask.
Patches #1-#6 gradually extend FIB rules to match on layer 4 ports with
an optional mask.
Patches #7-#8 add test cases for FIB rule port matching.
iproute2 support can be found here [1].
[1] https://github.com/idosch/iproute2/tree/submit/fib_rule_mask_v1
====================
Link: https://patch.msgid.link/20250217134109.311176-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add tests for FIB rules that match on source and destination ports with
a mask. Test both good and bad flows.
# ./fib_rule_tests.sh
IPv6 FIB rule tests
[...]
TEST: rule6 check: sport and dport redirect to table [ OK ]
TEST: rule6 check: sport and dport no redirect to table [ OK ]
TEST: rule6 del by pref: sport and dport redirect to table [ OK ]
TEST: rule6 check: sport and dport range redirect to table [ OK ]
TEST: rule6 check: sport and dport range no redirect to table [ OK ]
TEST: rule6 del by pref: sport and dport range redirect to table [ OK ]
TEST: rule6 check: sport and dport masked redirect to table [ OK ]
TEST: rule6 check: sport and dport masked no redirect to table [ OK ]
TEST: rule6 del by pref: sport and dport masked redirect to table [ OK ]
[...]
Tests passed: 292
Tests failed: 0
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250217134109.311176-9-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add attributes that allow matching on source and destination ports with
a mask. Matching on the source port with a mask is needed in deployments
where users encode path information into certain bits of the UDP source
port.
Temporarily set the type of the attributes to 'NLA_REJECT' while support
is being added.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250217134109.311176-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Roger Quadros says:
====================
net: ethernet: ti: am65-cpsw: drop multiple functions and code cleanup
We have 2 tx completion functions to handle single-port vs multi-port
variants. Merge them into one function to make maintenance easier.
We also have 2 functions to handle TX completion for SKB vs XDP.
Get rid of them too.
Also do some minor cleanups.
====================
Signed-off-by: Roger Quadros <rogerq@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drop separate TX completion functions for SKB and XDP. To do that
use the SW_DATA mechanism to store ndev and skb/xdpf for TX packets.
Use BUILD_BUG_ON_MSG() to fail build if SW_DATA size exceeds whats
available. i.e. AM65_CPSW_NAV_SW_DATA_SIZE.
Signed-off-by: Roger Quadros <rogerq@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows us to re-use am65_cpsw_run_xdp() for zero copy
case. Add AM65_CPSW_XDP_TX case for successful XDP_TX so we don't
free the page while in flight.
Signed-off-by: Roger Quadros <rogerq@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In am65_cpsw_run_xdp() instead of goto followed by return, simply return.
Signed-off-by: Roger Quadros <rogerq@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
am65_cpsw_run_xdp() can figure out the cpu id itself.
No need to pass it around 2 functions so drop it.
Signed-off-by: Roger Quadros <rogerq@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The only difference between am65_cpsw_nuss_tx_compl_packets_2g() and
am65_cpsw_nuss_tx_compl_packets() is the usage of spin_lock() and
netdev_tx_completed_queue() + am65_cpsw_nuss_tx_wake at every packet
in the latter.
Insted of having 2 separate functions for TX completion, merge them
into one. This will reduce code duplication and make maintenance easier.
Signed-off-by: Roger Quadros <rogerq@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shradha Gupta says:
====================
net: Enable Big TCP for MANA devices
Allow the max gso/gro aggregated pkt size to go up to GSO_MAX_SIZE for
MANA NIC. On Azure, this not possible without allowing the same for
netvsc NIC (as the NICs are bonded together).
Therefore, we use netif_set_tso_max_size() to set max aggregated pkt
size
to VF's tso_max_size for netvsc too, when the data path is switched over
to the VF
The first patch allows MANA to configure aggregated pkt size of up-to
GSO_MAX_SIZE
The second patch enables the same on the netvsc NIC, if the data path
for the bonded NIC is switched to the VF
---
Changes in v3
* Add ipv6_hopopt_jumbo_remove() while sending Big TCP packets
---
Changes in v2
* Instead of using 'tcp segment' throughout the patch used the words
'aggregated pkt size'
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
On Azure, increasing VF's gso/gro packet size to up-to GSO_MAX_SIZE
is not possible without allowing the same for netvsc NIC
(as the NICs are bonded together). For bonded NICs, the min of the max
aggregated pkt size of the members is propagated in the stack.
Therefore, we use netif_set_tso_max_size() to set max aggregated pkt size
to VF's packet size for netvsc too, when the data path is switched over
to the VF
Tested on azure env with Accelerated Networking enabled and disabled.
Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Willem de Bruijn says:
====================
net: deduplicate cookie logic
Reuse standard sk, ip and ipv6 cookie init handlers where possible.
Avoid repeated open coding of the same logic.
Harmonize feature sets across protocols.
Make IPv4 and IPv6 logic more alike.
Simplify adding future new fields with a single init point.
====================
Link: https://patch.msgid.link/20250214222720.3205500-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This initializes tclass and dontfrag before cmsg parsing, removing the
need for explicit checks against -1 in each caller.
Leave hlimit set to -1, because its full initialization
(in ip6_sk_dst_hoplimit) requires more state (dst, flowi6, ..).
This also prepares for calling sockcm_init in a follow-on patch.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250214222720.3205500-7-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Initialize the ip cookie tos field when initializing the cookie, in
ipcm_init_sk.
The existing code inverts the standard pattern for initializing cookie
fields. Default is to initialize the field from the sk, then possibly
overwrite that when parsing cmsgs (the unlikely case).
This field inverts that, setting the field to an illegal value and
after cmsg parsing checking whether the value is still illegal and
thus should be overridden.
Be careful to always apply mask INET_DSCP_MASK, as before.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250214222720.3205500-5-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Avoid open coding initialization of sockcm fields.
Avoid reading the sk_priority field twice.
This ensures all callers, existing and future, will correctly try a
cmsg passed mark before sk_mark.
This patch extends support for cmsg mark to:
packet_spkt and packet_tpacket and net/can/raw.c.
This patch extends support for cmsg priority to:
packet_spkt and packet_tpacket.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250214222720.3205500-3-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As kernel test robot reported, the following warning occurs:
cocci warnings: (new ones prefixed by >>)
>> drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c:582:6-8: opportunity for str_enabled_disabled(on)
Replace ternary (condition ? "enabled" : "disabled") with
str_enabled_disabled() from string_choices.h to improve readability,
maintain uniform string usage, and reduce binary size through linker
deduplication.
Reviewed-by: Huacai Chen <chenhuacai@loongson.cn>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com>
Link: https://patch.msgid.link/20250217155833.3105775-1-eleanor15x@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The old_flags variable is declared twice in __dev_change_flags(),
causing a shadow variable warning. This patch fixes the issue by
removing the redundant declaration, reusing the existing old_flags
variable instead.
net/core/dev.c:9225:16: warning: declaration shadows a local variable [-Wshadow]
9225 | unsigned int old_flags = dev->flags;
| ^
net/core/dev.c:9185:15: note: previous declaration is here
9185 | unsigned int old_flags = dev->flags;
| ^
1 warning generated.
Remove the redundant inner declaration and reuse the existing old_flags
variable since its value is not needed outside the if block, and it is
safe to reuse the variable. This eliminates the warning while
maintaining the same functionality.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250217-old_flags-v2-1-4cda3b43a35f@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When we terminated the dump, the callback isn't running, so cb_running
should be set to false to be logically consistent.
cb_running signifies whether a dump is ongoing. It is set to true in
cb->start(), and is checked in netlink_dump() to be true initially.
After the dump, it is set to false in the same function.
This is just a cleanup, no path should access this field on a closed
socket.
Signed-off-by: Siddh Raman Pant <siddh.raman.pant@oracle.com>
Link: https://patch.msgid.link/aff028e3eb2b768b9895fa6349fa1981ae22f098.camel@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Report standard statistics using the dedicated callbacks instead of
get_ethtool_stats.
OCTTX is split over two registers. Accumulating these registers
separately in gem_stats just means we need to combine them again later.
Instead, combine these stats before saving them, like is done for
ethtool_stats.
Signed-off-by: Sean Anderson <sean.anderson@linux.dev>
Link: https://patch.msgid.link/20250214212703.2618652-3-sean.anderson@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski says:
====================
eth: mlx4: use the page pool for Rx buffers
Convert mlx4 to page pool. I've been sitting on these patches for
over a year, and Jonathan Lemon had a similar series years before.
We never deployed it or sent upstream because it didn't really show
much perf win under normal load (admittedly I think the real testing
was done before Ilias's work on recycling).
During the v6.9 kernel rollout Meta's CDN team noticed that machines
with CX3 Pro (mlx4) are prone to overloads (double digit % of CPU time
spent mapping buffers in the IOMMU). The problem does not occur with
modern NICs, so I dusted off this series and reportedly it still works.
And it makes the problem go away, no overloads, perf back in line with
older kernels. Something must have changed in IOMMU code, I guess.
This series is very simple, and can very likely be optimized further.
Thing is, I don't have access to any CX3 Pro NICs. They only exist
in CDN locations which haven't had a HW refresh for a while. So I can
say this series survives a week under traffic w/ XDP enabled, but
my ability to iterate and improve is a bit limited.
v2: https://lore.kernel.org/20250211192141.619024-1-kuba@kernel.org
v1: https://lore.kernel.org/20250205031213.358973-1-kuba@kernel.org
====================
Link: https://patch.msgid.link/20250213010635.1354034-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>