Report device_stall_critical_watermark_cnt as tx_pause_storm_events in
the ethtool_pause_stats struct. This counter tracks pause storm error
events which indicate the NIC has been sending pause frames for an
extended period due to a stall.
The ethtool_pause_stats struct reports these stalls as a single value,
whereas the device supports tracking them per priority. Aggregate the
counter across all priority classes to capture stalls on all priorities.
Note that the stats are fetched from the device for each priority via
mlx5_core_access_reg().
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20260302230149.1580195-6-mohsin.bashr@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
With pause storm protection in place, track the occurrence of pause
storm events. Since there is a one-to-one mapping between pause storm
interrupts and events, use the interrupt count to track this metric.
./ethtool -I -a eth0
Pause parameters for eth0:
Autonegotiate: off
RX: off
TX: on
Statistics:
tx_pause_frames: 759657
rx_pause_frames: 0
tx_pause_storm_events: 219
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20260302230149.1580195-5-mohsin.bashr@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add protection against TX pause storms. A pause storm occurs when a
device fails to send received packets up to the stack. When a pause
storm is detected (pause state persists beyond the configured timeout),
the device stops sending the pause frames and begins dropping packets
instead of back-pressuring.
The timeout is configurable via ethtool tunable (pfc-prevention-tout)
with a maximum value of 10485ms, and the default value of 500ms.
Once the device transitions to the storm-detected state, the service
task periodically attempts recovery, returning the device to normal
operation to handle any subsequent pause storm episodes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20260302230149.1580195-4-mohsin.bashr@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
With TX pause enabled, if a device is unable to pass packets up to the
stack (e.g., CPU is hanged), the device can cause pause storm. Given
that devices can have native support to protect the neighbor from such
flooding, such events need some tracking. This support is to track TX
pause storm events for better observability.
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20260302230149.1580195-2-mohsin.bashr@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Ankit Garg says:
====================
gve: optimize and enable HW GRO for DQO
The DQO device has always performed HW GRO, not LRO. This series updates
the feature bit and modifies the RX path to enhance support. It sets
gso_segs correctly so the software stack can continue coalescing, and
pulls network headers into the skb linear space to avoid multiple small
memory copies when header-split is disabled.
We also enable HW GRO by default on supported devices.
====================
Link: https://patch.msgid.link/20260303195549.2679070-1-joshwash@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Currently, in DQO mode with hw-gro enabled, entire received packet is
placed into skb fragments when header-split is disabled. This leaves
the skb linear part empty, forcing the networking stack to do multiple
small memory copies to access eth, IP and TCP headers.
This patch adds a single memcpy to put all headers into linear portion
before packet reaches the SW GRO stack; thus eliminating multiple
smaller memcpy calls.
Additionally, the criteria for calling napi_gro_frags() was updated.
Since skb->head is now populated, we instead check if the SKB is the
cached NAPI scratchpad to ensure we continue using the zero-allocation
path.
Signed-off-by: Ankit Garg <nktgrg@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Signed-off-by: Joshua Washington <joshwash@google.com>
Link: https://patch.msgid.link/20260303195549.2679070-4-joshwash@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Leaving gso_segs unpopulated on hardware GRO packet prevents further
coalescing by software stack because the kernel's GRO logic marks the
SKB for flush because the expected length of all segments doesn't match
actual payload length.
Setting gso_segs correctly results in significantly more segments being
coalesced as measured by the result of dev_gro_receive().
gso_segs are derived from payload length. When header-split is enabled,
payload is in the non-linear portion of skb. And when header-split is
disabled, we have to parse the headers to determine payload length.
Signed-off-by: Ankit Garg <nktgrg@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jordan Rhee <jordanrhee@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Signed-off-by: Joshua Washington <joshwash@google.com>
Link: https://patch.msgid.link/20260303195549.2679070-3-joshwash@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Currently, ppp->xmit_pending is used in ppp_send_frame() to pass a skb
to ppp_push(), and holds the skb when a PPP channel cannot immediately
transmit it. This state is redundant because the transmit queue
(ppp->file.xq) can already handle the backlog. Furthermore, during
normal operation, an skb is queued in file.xq only to be immediately
dequeued, causing unnecessary overhead.
Refactor the transmit path to avoid stashing the skb when possible:
- Remove ppp->xmit_pending.
- Rename ppp_send_frame() to ppp_prepare_tx_skb(), and don't call
ppp_push() in it. It returns 1 if the skb is consumed
(dropped/handled) or 0 if it can be passed to ppp_push().
- Update ppp_push() to accept the skb. It returns 1 if the skb is
consumed, or 0 if the channel is busy.
- Optimize __ppp_xmit_process():
- Fastpath: If the queue is empty, attempt to send the skb directly
via ppp_push(). If busy, queue it.
- Slowpath: If the queue is not empty, process the backlog in
file.xq. Split dequeuing loop into a separate function
ppp_xmit_flush() so ppp_channel_push() uses that directly instead of
passing a NULL skb to __ppp_xmit_process().
This simplifies the states and reduces locking in the fastpath.
Signed-off-by: Qingfang Deng <dqfext@gmail.com>
Link: https://patch.msgid.link/20260303093219.234403-1-dqfext@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Florian Westphal says:
====================
netfilter: updates for net-next
The following patchset contains Netfilter updates for *net-next*,
including changes to IPv6 stack and updates to IPVS from Julian Anastasov.
1) ipv6: export fib6_lookup for nft_fib_ipv6 module
2) factor out ipv6_anycast_destination logic so its usable without
dst_entry. These are dependencies for patch 3.
3) switch nft_fib_ipv6 module to no longer need temporary dst_entry
object allocations by using fib6_lookup() + RCU.
This gets us ~13% higher packet rate in my tests.
Patches 4 to 8, from Eric Dumazet, zap sk_callback_lock usage in
netfilter. Patch 9 removes another sk_callback_lock instance.
Remaining patches, from Julian Anastasov, improve IPVS, Quoting Julian:
* Add infrastructure for resizable hash tables based on hlist_bl.
* Change the 256-bucket service hash table to be resizable.
* Change the global connection table to be per-net and resizable.
* Make connection hashing more secure for setups with multiple services.
netfilter pull request nf-next-26-03-04
* tag 'nf-next-26-03-04' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
ipvs: use more keys for connection hashing
ipvs: switch to per-net connection table
ipvs: use resizable hash table for services
ipvs: add resizable hash tables
rculist_bl: add hlist_bl_for_each_entry_continue_rcu
netfilter: nfnetlink_queue: remove locking in nfqnl_get_sk_secctx
netfilter: nfnetlink_queue: no longer acquire sk_callback_lock
netfilter: nfnetlink_log: no longer acquire sk_callback_lock
netfilter: nft_meta: no longer acquire sk_callback_lock in nft_meta_get_eval_skugid()
netfilter: xt_owner: no longer acquire sk_callback_lock in mt_owner()
netfilter: nf_log_syslog: no longer acquire sk_callback_lock in nf_log_dump_sk_uid_gid()
netfilter: nft_fib_ipv6: switch to fib6_lookup
ipv6: make ipv6_anycast_destination logic usable without dst_entry
ipv6: export fib6_lookup for nft_fib_ipv6
====================
Link: https://patch.msgid.link/20260304114921.31042-1-fw@strlen.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Raju Rangoju says:
====================
amd-xgbe: add support for P100a platform
This patch series adds support for the AMD P100a platform featuring
the ethernet controller PCI device ID 0x1122.
The P100a platform uses different register access patterns and speed
encoding compared to previous generation hardware (Yellow Carp,etc.)
Key differences include:
1. Different XPCS window offset calculation due to changed memory mapping
2. 2.5G speed uses XGMII mode (ss=0x06) instead of GMII (ss=0x02)
3. Extended port speed bits (6-bit instead of 5-bit) for 5G support
The series is organized as follows:
Patch 1: Defines macros for MAC version numbers and speed select values
to replace hardcoded magic numbers
Patch 2: Adds the core P100a platform support with PCI ID,
register configuration, and version-specific behavior
Tested on AMD P100a platform verifying:
- 10G/2.5G/1G/100M link establishment
- PHY initialization and auto-negotiation
- No register access errors
====================
Link: https://patch.msgid.link/20260302044409.1388430-1-Raju.Rangoju@amd.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add hardware support for the AMD P100a platform featuring the ethernet
controller PCI device ID 0x1122.
Platform-specific changes include:
1. PCI device ID and register configuration:
- Add XGBE_P100a_PCI_DEVICE_ID (0x1122) for recognition
- Configure platform-specific XPCS window registers
- Disable CDR workaround and RRC for this platform
2. XPCS window offset calculation fix:
The P100a platform uses a different memory mapping scheme for XPCS
register access. The offset calculation differs between platforms:
- Older platforms (YC): offset = base + (addr & mask)
The address is masked first, then added to the window base.
- P100a: offset = (base + addr) & mask
The full address is added to base first, then masked.
This is critical because using the wrong calculation causes register
reads/writes to access incorrect addresses, leading to incorrect
behaviour.
3. 2.5G speed mode handling:
P100a uses XGMII mode (ss=0x06) for 2.5G instead of GMII mode
(ss=0x02) used by older platforms. The MAC version check determines
which mode to use.
4. Port speed bits extended:
Extend XP_PROP_0_PORT_SPEEDS from 5 bits to 6 bits to support the
additional 5G speed capability.
5. Rx adaptation disabled:
Rx adaptation is disabled for P100a (MAC version 0x33) as this
feature requires further development for this platform.
6. Rate change command for 2.5G:
Use XGBE_MB_SUBCMD_2_5G_KX subcommand for 2.5G mode on P100a
instead of XGBE_MB_SUBCMD_NONE used on older platforms.
Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com>
Link: https://patch.msgid.link/20260302044634.1388661-2-Raju.Rangoju@amd.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2026-03-02 (ice, i40e, ixgbe)
For ice:
Simon Horman adds const modifier to read only member of a struct.
For i40e:
Yury Norov removes an unneeded check of bitmap_weight().
Andy Shevchenko adds a missing include.
For ixgbe:
Aleksandr changes declaration of a bitmap to utilize DECLARE_BITMAP()
macro.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
ixgbe: refactor: use DECLARE_BITMAP for ring state field
i40e: Add missing wordpart.h header
i40e: drop useless bitmap_weight() call in i40e_set_rxfh_fields()
ice: Make name member of struct ice_cgu_pin_desc const
====================
Link: https://patch.msgid.link/20260304000800.3536872-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Instead of using struct timespec64 in scm_timestamping_internal,
use ktime_t, saving 24 bytes in kernel stack.
This makes tcp_update_recv_tstamps() small enough to be inlined.
The ktime_t -> timespec64 conversions happen after socket lock
has been released in tcp_recvmsg(), and only if the application
requested them.
$ scripts/bloat-o-meter -t vmlinux.0 vmlinux
add/remove: 0/2 grow/shrink: 5/4 up/down: 146/-277 (-131)
Function old new delta
tcp_zerocopy_receive 2383 2425 +42
mptcp_recvmsg 1565 1607 +42
tcp_recvmsg_locked 3797 3823 +26
put_cmsg_scm_timestamping64 131 149 +18
put_cmsg_scm_timestamping 131 149 +18
__pfx_tcp_update_recv_tstamps 16 - -16
do_tcp_getsockopt 4024 4006 -18
tcp_recv_timestamp 474 430 -44
tcp_zc_handle_leftover 417 371 -46
__sock_recv_timestamp 1087 1031 -56
tcp_update_recv_tstamps 97 - -97
Total: Before=25223788, After=25223657, chg -0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20260304012747.881644-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix some kernel-doc warnings in openvswitch.h:
Mark enum placeholders that are not used as "private" so that kernel-doc
comments are not needed for them.
Correct names for 2 enum values:
Warning: include/uapi/linux/openvswitch.h:300 Excess enum value
'@OVS_VPORT_UPCALL_SUCCESS' description in 'ovs_vport_upcall_attr'
Warning: include/uapi/linux/openvswitch.h:300 Excess enum value
'@OVS_VPORT_UPCALL_FAIL' description in 'ovs_vport_upcall_attr'
Convert one comment from "/**" kernel-doc to a plain C "/*" comment:
Warning: include/uapi/linux/openvswitch.h:638 This comment starts with
'/**', but isn't a kernel-doc comment.
* Omit attributes for notifications.
Add more kernel-doc:
- add kernel-doc for kernel-only enums;
- add missing kernel-doc for enum ovs_datapath_attr;
- add missing kernel-doc for enum ovs_flow_attr;
- add missing kernel-doc for enum ovs_sample_attr;
- add kernel-doc for enum ovs_check_pkt_len_attr;
- add kernel-doc for enum ovs_action_attr;
- add kernel-doc for enum ovs_action_push_eth;
- add kernel-doc for enum ovs_vport_attr;
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Link: https://patch.msgid.link/20260304012437.469151-1-rdunlap@infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcp_do_parse_auth_options() fast path user is tcp_inbound_hash().
Move tcp_do_parse_auth_options() right before tcp_inbound_hash()
so that it can be (auto)inlined by the compiler.
As a bonus, stack canary is removed from tcp_inbound_hash().
Also use EXPORT_IPV6_MOD(tcp_do_parse_auth_options).
$ scripts/bloat-o-meter -t vmlinux.0 vmlinux
add/remove: 0/0 grow/shrink: 1/0 up/down: 131/0 (131)
Function old new delta
tcp_inbound_hash 565 696 +131
Total: Before=25223788, After=25223919, chg +0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260303191243.557245-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet says:
====================
rfs: use high-order allocations for hash tables
This series adds rps_tag_ptr which encodes both a pointer
and a size of a power-of-two hash table in a single long word.
RFS hash tables (global and per rx-queue) are converted to rps_tag_ptr.
This removes a cache line miss, and allows high-order allocations.
The global hash table can benefit from huge pages.
====================
Link: https://patch.msgid.link/20260302181432.1836150-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Instead of storing the @log at the beginning of rps_dev_flow_table
use 5 low order bits of the rps_tag_ptr to store the log of the size.
This removes a potential cache line miss (for light traffic).
This allows us to switch to one high-order allocation instead of vmalloc()
when CONFIG_RFS_ACCEL is not set.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260302181432.1836150-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Instead of storing the @mask at the beginning of rps_sock_flow_table,
use 5 low order bits of the rps_tag_ptr to store the log of the size.
This removes a potential cache line miss to fetch @mask.
More importantly, we can switch to vmalloc_huge() without wasting memory.
Tested with:
numactl --interleave=all bash -c "echo 4194304 >/proc/sys/net/core/rps_sock_flow_entries"
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260302181432.1836150-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
udp_flow_src_port() and psp_write_headers() use ip_local_port_range.
ip_local_port_range is inclusive : all ports between min and max
can be used.
Before this patch, if ip_local_port_range was set to 40000-40001
40001 would not be used as a source port.
Use reciprocal_scale() to help code readability.
Not tagged for stable trees, as this change could break user
expectations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260302163933.1754393-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski says:
====================
tools: ynl: tests: adjust Makefile to mimic ksft
Make a few minor adjustments to tools/net/ynl/tests/Makefile
to align its behavior more with how real kselftests behave.
This series allows running the YNL tests in NIPA with little
extra integration effort.
If anyone already integrated these tests into their CI minor
adjustments to the integration may be needed (due to patch 2).
====================
Link: https://patch.msgid.link/20260303163504.2084981-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Until commit 790792ebc9 ("tools: ynl: don't install tests")
YNL selftests were installed with all the other YNL outputs.
That's no longer the case, as tests are not really production
artifacts. Let's not install them in /usr/bin at all, and
mirror kselftest format more closely:
For: make -C tools/net/ynl/tests/ install DESTDIR=tmp
tmp/usr/share/kselftest
├── ktap_helpers.sh
└── ynl
├── test_ynl_cli.sh
└── test_ynl_ethtool.sh
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20260303163504.2084981-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Johannes Berg says:
====================
Notable features this time:
- cfg80211/mac80211
- finished assoc frame encryption/EPPKE/802.1X-over-auth
(also hwsim)
- radar detection improvements
- 6 GHz incumbent signal detection APIs
- multi-link support for FILS, probe response
templates and client probling
- ath12k:
- monitor mode support on IPQ5332
- basic hwmon temperature reporting
* tag 'wireless-next-2026-03-04' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (38 commits)
wifi: UHR: define DPS/DBE/P-EDCA elements and fix size parsing
wifi: mac80211_hwsim: change hwsim_class to a const struct
wifi: mac80211: give the AP more time for EPPKE as well
wifi: ath12k: Remove the unused argument from the Rx data path
wifi: ath12k: Enable monitor mode support on IPQ5332
wifi: ath12k: Set up MLO after SSR
wifi: ath11k: Silence remoteproc probe deferral prints
wifi: cfg80211: support key installation on non-netdev wdevs
wifi: cfg80211: make cluster id an array
wifi: mac80211: update outdated comment
wifi: mac80211: Advertise IEEE 802.1X authentication support
wifi: mac80211: Add support for IEEE 802.1X authentication protocol in non-AP STA mode
wifi: cfg80211: add support for IEEE 802.1X Authentication Protocol
wifi: mac80211: Advertise EPPKE support based on driver capabilities
wifi: mac80211_hwsim: Advertise support for (Re)Association frame encryption
wifi: mac80211: Fix AAD/Nonce computation for management frames with MLO
wifi: rt2x00: use generic nvmem_cell_get
wifi: mac80211: fetch unsolicited probe response template by link ID
wifi: mac80211: fetch FILS discovery template by link ID
wifi: nl80211: don't allow DFS channels for NAN
...
====================
Link: https://patch.msgid.link/20260304113707.175181-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add UHR Operation and Capability definitions and parsing helpers:
- Define ieee80211_uhr_dps_info, ieee80211_uhr_dbe_info,
ieee80211_uhr_p_edca_info with masks.
- Update ieee80211_uhr_oper_size_ok() to account for optional
DPS/DBE/P-EDCA blocks.
- Move NPCA pointer position after DPS Operation Parameter if it is
present in ieee80211_uhr_oper_size_ok().
- Move NPCA pointer position after DPS info if it is present in
ieee80211_uhr_npca_info().
Signed-off-by: Karthikeyan Kathirvel <karthikeyan.kathirvel@oss.qualcomm.com>
Link: https://patch.msgid.link/20260304085343.1093993-2-karthikeyan.kathirvel@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Simon Kirby reported long time ago that IPVS connection hashing
based only on the client address/port (caddr, cport) as hash keys
is not suitable for setups that accept traffic on multiple virtual
IPs and ports. It can happen for multiple VIP:VPORT services, for
single or many fwmark service(s) that match multiple virtual IPs
and ports or even for passive FTP with peristence in DR/TUN mode
where we expect traffic on multiple ports for the virtual IP.
Fix it by adding virtual addresses and ports to the hash function.
This causes the traffic from NAT real servers to clients to use
second hashing for the in->out direction.
As result:
- the IN direction from client will use hash node hn0 where
the source/dest addresses and ports used by client will be used
as hash keys
- the OUT direction from NAT real servers will use hash node hn1
for the traffic from real server to client
- the persistence templates are hashed only with parameters based on
the IN direction, so they now will also use the virtual address,
port and fwmark from the service.
OLD:
- all methods: c_list node: proto, caddr:cport
- persistence templates: c_list node: proto, caddr_net:0
- persistence engine templates: c_list node: per-PE, PE-SIP uses jhash
NEW:
- all methods: hn0 node (dir 0): proto, caddr:cport -> vaddr:vport
- MASQ method: hn1 node (dir 1): proto, daddr:dport -> caddr:cport
- persistence templates: hn0 node (dir 0):
proto, caddr_net:0 -> vaddr:vport_or_0
proto, caddr_net:0 -> fwmark:0
- persistence engine templates: hn0 node (dir 0): as before
Also reorder the ip_vs_conn fields, so that hash nodes are on same
read-mostly cache line while write-mostly fields are on separate
cache line.
Reported-by: Simon Kirby <sim@hostway.ca>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Use per-net resizable hash table for connections. The global table
is slow to walk when using many namespaces.
The table can be resized in the range of [256 - ip_vs_conn_tab_size].
Table is attached only while services are present. Resizing is done
by delayed work based on load (the number of connections).
Add a hash_key field into the connection to store the table ID in
the highest bit and the entry's hash value in the lowest bits. The
lowest part of the hash value is used as bucket ID, the remaining
part is used to filter the entries in the bucket before matching
the keys and as result, helps the lookup operation to access only
one cache line. By knowing the table ID and bucket ID for entry,
we can unlink it without calculating the hash value and doing
lookup by keys. We need only to validate the saved hash_key under
lock.
For better security switch from jhash to siphash for the default
connection hashing but the persistence engines may use their own
function. Keeping the hash table loaded with entries below the
size (12%) allows to avoid collision for 96+% of the conns.
ip_vs_conn_fill_cport() now will rehash the connection with proper
locking because unhash+hash is not safe for RCU readers.
To invalidate the templates setting just dport to 0xffff is enough,
no need to rehash them. As result, ip_vs_conn_unhash() is now
unused and removed.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Make the hash table for services resizable in the bit range of 4-20.
Table is attached only while services are present. Resizing is done
by delayed work based on load (the number of hashed services).
Table grows when load increases 2+ times (above 12.5% with lfactor=-3)
and shrinks 8+ times when load decreases 16+ times (below 0.78%).
Switch to jhash hashing to reduce the collisions for multiple
services.
Add a hash_key field into the service to store the table ID in
the highest bit and the entry's hash value in the lowest bits. The
lowest part of the hash value is used as bucket ID, the remaining
part is used to filter the entries in the bucket before matching
the keys and as result, helps the lookup operation to access only
one cache line. By knowing the table ID and bucket ID for entry,
we can unlink it without calculating the hash value and doing
lookup by keys. We need only to validate the saved hash_key under
lock.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Add infrastructure for resizable hash tables based on hlist_bl
which we will use in followup patches.
The tables allow RCU lookups during resizing, bucket modifications
are protected with per-bucket bit lock and additional custom locking,
the tables are resized when load reaches thresholds determined based
on load factor parameter.
Compared to other implementations we rely on:
* fast entry removal by using node unlinking without pre-lookup
* entry rehashing when hash key changes
* entries can contain multiple hash nodes
* custom locking depending on different contexts
* adjustable load factor to customize the grow/shrink process
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Change the old hlist_bl_first_rcu to hlist_bl_first_rcu_dereference
to indicate that it is a RCU dereference.
Add hlist_bl_next_rcu and hlist_bl_first_rcu to use RCU pointers
and use them to fix sparse warnings.
Add hlist_bl_for_each_entry_continue_rcu.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
After commit 983512f3a8 ("net: Drop the lock in skb_may_tx_timestamp()")
from Sebastian Andrzej Siewior, apply the same logic in
nfqnl_put_sk_uidgid() to avoid touching sk->sk_callback_lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
After commit 983512f3a8 ("net: Drop the lock in skb_may_tx_timestamp()")
from Sebastian Andrzej Siewior, apply the same logic in
__build_packet_message() to avoid touching sk->sk_callback_lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>