Commit Graph

18679 Commits

Author SHA1 Message Date
Hangbin Liu
e5a6643435 bonding: support aggregator selection based on port priority
Add a new ad_select policy 'port_priority' that uses the per-port
actor priority values (set via ad_actor_port_prio) to determine
aggregator selection.

This allows administrators to influence which ports are preferred
for aggregation by assigning different priority values, providing
more flexible load balancing control in LACP configurations.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20250902064501.360822-3-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-09 10:56:02 +02:00
Hangbin Liu
6b6dc81ee7 bonding: add support for per-port LACP actor priority
Introduce a new netlink attribute 'actor_port_prio' to allow setting
the LACP actor port priority on a per-slave basis. This extends the
existing bonding infrastructure to support more granular control over
LACP negotiations.

The priority value is embedded in LACPDU packets and will be used by
subsequent patches to influence aggregator selection policies.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20250902064501.360822-2-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-09 10:56:02 +02:00
Eric Dumazet
20d3d26815 net: snmp: remove SNMP_MIB_SENTINEL
No more user of SNMP_MIB_SENTINEL, we can remove it.

Also remove snmp_get_cpu_field[64]_batch() helpers.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250905165813.1470708-10-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-08 18:06:21 -07:00
Eric Dumazet
ceac1fb229 ipv6: snmp: do not use SNMP_MIB_SENTINEL anymore
Use ARRAY_SIZE(), so that we know the limit at compile time.

Following patch needs this preliminary change.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250905165813.1470708-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-08 18:06:20 -07:00
Jakub Kicinski
5ef04a7b06 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.17-rc5).

No conflicts.

Adjacent changes:

include/net/sock.h
  c51613fa27 ("net: add sk->sk_drop_counters")
  5d6b58c932 ("net: lockless sock_i_ino()")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-04 13:33:00 -07:00
Jakub Kicinski
3ceb08838b net: add helper to pre-check if PP for an Rx queue will be unreadable
mlx5 pokes into the rxq state to check if the queue has a memory
provider, and therefore whether it may produce unreadable mem.
Add a helper for doing this in the page pool API. fbnic will want
a similar thing (tho, for a slightly different reason).

Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250901211214.1027927-11-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-04 10:19:17 +02:00
Juraj Šarinay
21f82062d0 net: nfc: nci: Increase NCI_DATA_TIMEOUT to 3000 ms
An exchange with a NFC target must complete within NCI_DATA_TIMEOUT.
A delay of 700 ms is not sufficient for cryptographic operations on smart
cards. CardOS 6.0 may need up to 1.3 seconds to perform 256-bit ECDH
or 3072-bit RSA. To prevent brute-force attacks, passports and similar
documents introduce even longer delays into access control protocols
(BAC/PACE).

The timeout should be higher, but not too much. The expiration allows
us to detect that a NFC target has disappeared.

Signed-off-by: Juraj Šarinay <juraj@sarinay.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/20250902113630.62393-1-juraj@sarinay.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-03 17:02:12 -07:00
Eric Dumazet
5d6b58c932 net: lockless sock_i_ino()
Followup of commit c51da3f7a1 ("net: remove sock_i_uid()")

A recent syzbot report was the trigger for this change.

Over the years, we had many problems caused by the
read_lock[_bh](&sk->sk_callback_lock) in sock_i_uid().

We could fix smc_diag_dump_proto() or make a more radical move:

Instead of waiting for new syzbot reports, cache the socket
inode number in sk->sk_ino, so that we no longer
need to acquire sk->sk_callback_lock in sock_i_ino().

This makes socket dumps faster (one less cache line miss,
and two atomic ops avoided).

Prior art:

commit 25a9c8a443 ("netlink: Add __sock_i_ino() for __netlink_diag_dump().")
commit 4f9bf2a2f5 ("tcp: Don't acquire inet_listen_hashbucket::lock with disabled BH.")
commit efc3dbc374 ("rds: Make rds_sock_lock BH rather than IRQ safe.")

Fixes: d2d6422f8b ("x86: Allow to enable PREEMPT_RT.")
Reported-by: syzbot+50603c05bbdf4dfdaffa@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/68b73804.050a0220.3db4df.01d8.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20250902183603.740428-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-03 16:08:24 -07:00
Jakub Kicinski
24ee9feeb3 Merge tag 'nf-next-25-09-02' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says:

====================
netfilter: updates for net-next

1) prefer vmalloc_array in ebtables, from  Qianfeng Rong.

2) Use csum_replace4 instead of open-coding it, from Christophe Leroy.

3+4) Get rid of GFP_ATOMIC in transaction object allocations, those
     cause silly failures with large sets under memory pressure, from
     myself.

5) Remove test for AVX cpu feature in nftables pipapo set type,
   testing for AVX2 feature is sufficient.

6) Unexport a few function in nf_reject infra: no external callers.

7) Extend payload offset to u16, this was restricted to values <=255
   so far, from Fernando Fernandez Mancera.

* tag 'nf-next-25-09-02' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: nft_payload: extend offset to 65535 bytes
  netfilter: nf_reject: remove unneeded exports
  netfilter: nft_set_pipapo: remove redundant test for avx feature bit
  netfilter: nf_tables: all transaction allocations can now sleep
  netfilter: nf_tables: allow iter callbacks to sleep
  netfilter: nft_payload: Use csum_replace4() instead of opencoding
  netfilter: ebtables: Use vmalloc_array() to improve code
====================

Link: https://patch.msgid.link/20250902133549.15945-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-03 16:06:45 -07:00
Asbjørn Sloth Tønnesen
017bda80fd genetlink: fix typo in comment
In this context "not that ..." should properly be "note that ...".

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250902154640.759815-4-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-03 15:16:49 -07:00
Christoph Paasch
929324913e net: Add rfs_needed() helper
Add a helper to check if RFS is needed or not. Allows to make the code a
bit cleaner and the next patch to have MPTCP use this helper to decide
whether or not to iterate over the subflows.

tun_flow_update() was calling sock_rps_record_flow_hash() regardless of
the state of rfs_needed. This was not really a bug as sock_flow_table
simply ends up being NULL and thus everything will be fine.
This commit here thus also implicitly makes tun_flow_update() respect
the state of rfs_needed.

Suggested-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: Christoph Paasch <cpaasch@openai.com>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250902-net-next-mptcp-misc-feat-6-18-v2-3-fa02bb3188b1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-03 15:08:20 -07:00
Eric Dumazet
5d14bbf9d1 net_sched: act: remove tcfa_qstats
tcfa_qstats is currently only used to hold drops and overlimits counters.

tcf_action_inc_drop_qstats() and tcf_action_inc_overlimit_qstats()
currently acquire a->tcfa_lock to increment these counters.

Switch to two atomic_t to get lock-free accounting.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20250901093141.2093176-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-02 15:52:24 -07:00
Jay Vosburgh
23a6037ce7 bonding: Remove support for use_carrier
Remove the implementation of use_carrier, the link monitoring
method that utilizes ethtool or ioctl to determine the link state of an
interface in a bond.  Bonding will always behaves as if use_carrier=1,
which relies on netif_carrier_ok() to determine the link state of
interfaces.

	To avoid acquiring RTNL many times per second, bonding inspects
link state under RCU, but not under RTNL.  However, ethtool
implementations in drivers may sleep, and therefore this strategy is
unsuitable for use with calls into driver ethtool functions.

	The use_carrier option was introduced in 2003, to provide
backwards compatibility for network device drivers that did not support
the then-new netif_carrier_ok/on/off system.  Device drivers are now
expected to support netif_carrier_*, and the use_carrier backwards
compatibility logic is no longer necessary.

	The option itself remains, but when queried always returns 1,
and may only be set to 1.

Link: https://lore.kernel.org/000000000000eb54bf061cfd666a@google.com
Link: https://lore.kernel.org/20240718122017.d2e33aaac43a.I10ab9c9ded97163aef4e4de10985cd8f7de60d28@changeid
Signed-off-by: Jay Vosburgh <jv@jvosburgh.net>
Reported-by: syzbot+b8c48ea38ca27d150063@syzkaller.appspotmail.com
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/2029487.1756512517@famine
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-02 14:01:54 -07:00
Fernando Fernandez Mancera
077dc4a275 netfilter: nft_payload: extend offset to 65535 bytes
In some situations 255 bytes offset is not enough to match or manipulate
the desired packet field. Increase the offset limit to 65535 or U16_MAX.

In addition, the nla policy maximum value is not set anymore as it is
limited to s16. Instead, the maximum value is checked during the payload
expression initialization function.

Tested with the nft command line tool.

table ip filter {
	chain output {
		@nh,2040,8 set 0xff
		@nh,524280,8 set 0xff
		@nh,524280,8 0xff
		@nh,2040,8 0xff
	}
}

Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
2025-09-02 15:28:18 +02:00
Florian Westphal
f4f9e05904 netfilter: nf_reject: remove unneeded exports
These functions have no external callers and can be static.

Signed-off-by: Florian Westphal <fw@strlen.de>
2025-09-02 15:28:17 +02:00
Florian Westphal
a60a5abe19 netfilter: nf_tables: allow iter callbacks to sleep
Quoting Sven Auhagen:
  we do see on occasions that we get the following error message, more so on
  x86 systems than on arm64:

  Error: Could not process rule: Cannot allocate memory delete table inet filter

  It is not a consistent error and does not happen all the time.
  We are on Kernel 6.6.80, seems to me like we have something along the lines
  of the nf_tables: allow clone callbacks to sleep problem using GFP_ATOMIC.

As hinted at by Sven, this is because of GFP_ATOMIC allocations during
set flush.

When set is flushed, all elements are deactivated. This triggers a set
walk and each element gets added to the transaction list.

The rbtree and rhashtable sets don't allow the iter callback to sleep:
rbtree walk acquires read side of an rwlock with bh disabled, rhashtable
walk happens with rcu read lock held.

Rbtree is simple enough to resolve:
When the walk context is ITER_READ, no change is needed (the iter
callback must not deactivate elements; we're not in a transaction).

When the iter type is ITER_UPDATE, the rwlock isn't needed because the
caller holds the transaction mutex, this prevents any and all changes to
the ruleset, including add/remove of set elements.

Rhashtable is slightly more complex.
When the iter type is ITER_READ, no change is needed, like rbtree.

For ITER_UPDATE, we hold transaction mutex which prevents elements from
getting free'd, even outside of rcu read lock section.

So build a temporary list of all elements while doing the rcu iteration
and then call the iterator in a second pass.

The disadvantage is the need to iterate twice, but this cost comes with
the benefit to allow the iter callback to use GFP_KERNEL allocations in
a followup patch.

The new list based logic makes it necessary to catch recursive calls to
the same set earlier.

Such walk -> iter -> walk recursion for the same set can happen during
ruleset validation in case userspace gave us a bogus (cyclic) ruleset
where verdict map m jumps to chain that sooner or later also calls
"vmap @m".

Before the new ->in_update_walk test, the ruleset is rejected because the
infinite recursion causes ctx->level to exceed the allowed maximum.

But with the new logic added here, elements would get skipped:

nft_rhash_walk_update would see elements that are on the walk_list of
an older stack frame.

As all recursive calls into same map results in -EMLINK, we can avoid this
problem by using the new in_update_walk flag and reject immediately.

Next patch converts the problematic GFP_ATOMIC allocations.

Reported-by: Sven Auhagen <Sven.Auhagen@belden.com>
Closes: https://lore.kernel.org/netfilter-devel/BY1PR18MB5874110CAFF1ED098D0BC4E7E07BA@BY1PR18MB5874.namprd18.prod.outlook.com/
Signed-off-by: Florian Westphal <fw@strlen.de>
2025-09-02 15:28:17 +02:00
Eric Dumazet
689adb36bd inet: ping: make ping_port_rover per netns
Provide isolation between netns for ping idents.

Randomize initial ping_port_rover value at netns creation.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250829153054.474201-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-01 13:15:14 -07:00
Eric Dumazet
10343e7e6c inet: ping: remove ping_hash()
There is no point in keeping ping_hash().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Yue Haibing <yuehaibing@huawei.com>
Link: https://patch.msgid.link/20250829153054.474201-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-01 13:15:14 -07:00
Kuniyuki Iwashima
7051b54fb5 tcp: Remove sk->sk_prot->orphan_count.
TCP tracks the number of orphaned (SOCK_DEAD but not yet destructed)
sockets in tcp_orphan_count.

In some code that was shared with DCCP, tcp_orphan_count is referenced
via sk->sk_prot->orphan_count.

Let's reference tcp_orphan_count directly.

inet_csk_prepare_for_destroy_sock() is moved to inet_connection_sock.c
due to header dependency.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250829215641.711664-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-01 12:52:09 -07:00
Eric Dumazet
99a2ace61b net: use dst_dev_rcu() in sk_setup_caps()
Use RCU to protect accesses to dst->dev from sk_setup_caps()
and sk_dst_gso_max_size().

Also use dst_dev_rcu() in ip6_dst_mtu_maybe_forward(),
and ip_dst_mtu_maybe_forward().

ip4_dst_hoplimit() can use dst_dev_net_rcu().

Fixes: 4a6ce2b6f2 ("net: introduce a new function dst_dev_put()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250828195823.3958522-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-29 19:36:32 -07:00
Eric Dumazet
caedcc5b6d net: dst: introduce dst->dev_rcu
Followup of commit 88fe14253e ("net: dst: add four helpers
to annotate data-races around dst->dev").

We want to gradually add explicit RCU protection to dst->dev,
including lockdep support.

Add an union to alias dst->dev_rcu and dst->dev.

Add dst_dev_net_rcu() helper.

Fixes: 4a6ce2b6f2 ("net: introduce a new function dst_dev_put()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250828195823.3958522-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-29 19:36:31 -07:00
Jakub Kicinski
d23ad54de7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.17-rc4).

No conflicts.

Adjacent changes:

drivers/net/ethernet/intel/idpf/idpf_txrx.c
  02614eee26 ("idpf: do not linearize big TSO packets")
  6c4e684802 ("idpf: remove obsolete stashing code")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-29 11:48:01 -07:00
Eric Dumazet
53df77e785 net_sched: act_skbmod: use RCU in tcf_skbmod_dump()
Also storing tcf_action into struct tcf_skbmod_params
makes sure there is no discrepancy in tcf_skbmod_act().

No longer block BH in tcf_skbmod_init() when acquiring tcf_lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250827125349.3505302-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-28 16:46:23 -07:00
Eric Dumazet
e97ae74297 net_sched: act_tunnel_key: use RCU in tunnel_key_dump()
Also storing tcf_action into struct tcf_tunnel_key_params
makes sure there is no discrepancy in tunnel_key_act().

No longer block BH in tunnel_key_init() when acquiring tcf_lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250827125349.3505302-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-28 16:46:23 -07:00
Eric Dumazet
48b5e5dbdb net_sched: act_vlan: use RCU in tcf_vlan_dump()
Also storing tcf_action into struct tcf_vlan_params
makes sure there is no discrepancy in tcf_vlan_act().

No longer block BH in tcf_vlan_init() when acquiring tcf_lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250827125349.3505302-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-28 16:46:23 -07:00
Dragos Tatulea
13d8e05adf queue_api: add support for fetching per queue DMA dev
For zerocopy (io_uring, devmem), there is an assumption that the
parent device can do DMA. However that is not always the case:
- Scalable Function netdevs [1] have the DMA device in the grandparent.
- For Multi-PF netdevs [2] queues can be associated to different DMA
devices.

This patch introduces the a queue based interface for allowing drivers
to expose a different DMA device for zerocopy.

[1] Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst
[2] Documentation/networking/multi-pf-netdev.rst

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250827144017.1529208-3-dtatulea@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-28 16:05:31 -07:00
Eric Dumazet
b81aa23234 inet: raw: add drop_counters to raw sockets
When a packet flood hits one or more RAW sockets, many cpus
have to update sk->sk_drops.

This slows down other cpus, because currently
sk_drops is in sock_write_rx group.

Add a socket_drop_counters structure to raw sockets.

Using dedicated cache lines to hold drop counters
makes sure that consumers no longer suffer from
false sharing if/when producers only change sk->sk_drops.

This adds 128 bytes per RAW socket.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250826125031.1578842-6-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-28 13:14:50 +02:00
Eric Dumazet
51132b99f0 udp: add drop_counters to udp socket
When a packet flood hits one or more UDP sockets, many cpus
have to update sk->sk_drops.

This slows down other cpus, because currently
sk_drops is in sock_write_rx group.

Add a socket_drop_counters structure to udp sockets.

Using dedicated cache lines to hold drop counters
makes sure that consumers no longer suffer from
false sharing if/when producers only change sk->sk_drops.

This adds 128 bytes per UDP socket.

Tested with the following stress test, sending about 11 Mpps
to a dual socket AMD EPYC 7B13 64-Core.

super_netperf 20 -t UDP_STREAM -H DUT -l10 -- -n -P,1000 -m 120
Note: due to socket lookup, only one UDP socket is receiving
packets on DUT.

Then measure receiver (DUT) behavior. We can see both
consumer and BH handlers can process more packets per second.

Before:

nstat -n ; sleep 1 ; nstat | grep Udp
Udp6InDatagrams                 615091             0.0
Udp6InErrors                    3904277            0.0
Udp6RcvbufErrors                3904277            0.0

After:

nstat -n ; sleep 1 ; nstat | grep Udp
Udp6InDatagrams                 816281             0.0
Udp6InErrors                    7497093            0.0
Udp6RcvbufErrors                7497093            0.0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250826125031.1578842-5-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-28 13:14:50 +02:00
Eric Dumazet
c51613fa27 net: add sk->sk_drop_counters
Some sockets suffer from heavy false sharing on sk->sk_drops,
and fields in the same cache line.

Add sk->sk_drop_counters to:

- move the drop counter(s) to dedicated cache lines.
- Add basic NUMA awareness to these drop counter(s).

Following patches will use this infrastructure for UDP and RAW sockets.

sk_clone_lock() is not yet ready, it would need to properly
set newsk->sk_drop_counters if we plan to use this for TCP sockets.

v2: used Paolo suggestion from https://lore.kernel.org/netdev/8f09830a-d83d-43c9-b36b-88ba0a23e9b2@redhat.com/

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250826125031.1578842-4-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-28 13:14:50 +02:00
Eric Dumazet
cb4d5a6eb6 net: add sk_drops_skbadd() helper
Existing sk_drops_add() helper is renamed to sk_drops_skbadd().

Add sk_drops_add() and convert sk_drops_inc() to use it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250826125031.1578842-3-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-28 13:14:50 +02:00
Eric Dumazet
f86f42ed2c net: add sk_drops_read(), sk_drops_inc() and sk_drops_reset() helpers
We want to split sk->sk_drops in the future to reduce
potential contention on this field.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250826125031.1578842-2-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-28 13:14:50 +02:00
Krishna Kumar
97bcc5b6f4 net: Prevent RPS table overwrite of active flows
This patch fixes an issue where two different flows on the same RXq
produce the same hash resulting in continuous flow overwrites.

Flow #1: A packet for Flow #1 comes in, kernel calls the steering
         function. The driver gives back a filter id. The kernel saves
	 this filter id in the selected slot. Later, the driver's
	 service task checks if any filters have expired and then
	 installs the rule for Flow #1.
Flow #2: A packet for Flow #2 comes in. It goes through the same steps.
         But this time, the chosen slot is being used by Flow #1. The
	 driver gives a new filter id and the kernel saves it in the
	 same slot. When the driver's service task runs, it runs through
	 all the flows, checks if Flow #1 should be expired, the kernel
	 returns True as the slot has a different filter id, and then
	 the driver installs the rule for Flow #2.
Flow #1: Another packet for Flow #1 comes in. The same thing repeats.
         The slot is overwritten with a new filter id for Flow #1.

This causes a repeated cycle of flow programming for missed packets,
wasting CPU cycles while not improving performance. This problem happens
at higher rates when the RPS table is small, but tests show it still
happens even with 12,000 connections and an RPS size of 16K per queue
(global table size = 144x16K = 64K).

This patch prevents overwriting an rps_dev_flow entry if it is active.
The intention is that it is better to do aRFS for the first flow instead
of hurting all flows on the same hash. Without this, two (or more) flows
on one RX queue with the same hash can keep overwriting each other. This
causes the driver to reprogram the flow repeatedly.

Changes:
  1. Add a new 'hash' field to struct rps_dev_flow.
  2. Add rps_flow_is_active(): a helper function to check if a flow is
     active or not, extracted from rps_may_expire_flow(). It is further
     simplified as per reviewer feedback.
  3. In set_rps_cpu():
     - Avoid overwriting by programming a new filter if:
        - The slot is not in use, or
        - The slot is in use but the flow is not active, or
        - The slot has an active flow with the same hash, but target CPU
          differs.
     - Save the hash in the rps_dev_flow entry.
  4. rps_may_expire_flow(): Use earlier extracted rps_flow_is_active().

Testing & results:
  - Driver: ice (E810 NIC), Kernel: net-next
  - #CPUs = #RXq = 144 (1:1)
  - Number of flows: 12K
  - Eight RPS settings from 256 to 32768. Though RPS=256 is not ideal,
    it is still sufficient to cover 12K flows (256*144 rx-queues = 64K
    global table slots)
  - Global Table Size = 144 * RPS (effectively equal to 256 * RPS)
  - Each RPS test duration = 8 mins (org code) + 8 mins (new code).
  - Metrics captured on client

Legend for following tables:
Steer-C: #times ndo_rx_flow_steer() was Called by set_rps_cpu()
Steer-L: #times ice_arfs_flow_steer() Looped over aRFS entries
Add:     #times driver actually programmed aRFS (ice_arfs_build_entry())
Del:     #times driver deleted the flow (ice_arfs_del_flow_rules())
Units:   K = 1,000 times, M = 1 million times

  |-------|---------|------|     Org Code    |---------|---------|
  | RPS   | Latency | CPU  | Add    |  Del   | Steer-C | Steer-L |
  |-------|---------|------|--------|--------|---------|---------|
  | 256   | 227.0   | 93.2 | 1.6M   | 1.6M   | 121.7M  | 267.6M  |
  | 512   | 225.9   | 94.1 | 11.5M  | 11.2M  | 65.7M   | 199.6M  |
  | 1024  | 223.5   | 95.6 | 16.5M  | 16.5M  | 27.1M   | 187.3M  |
  | 2048  | 222.2   | 96.3 | 10.5M  | 10.5M  | 12.5M   | 115.2M  |
  | 4096  | 223.9   | 94.1 | 5.5M   | 5.5M   | 7.2M    | 65.9M   |
  | 8192  | 224.7   | 92.5 | 2.7M   | 2.7M   | 3.0M    | 29.9M   |
  | 16384 | 223.5   | 92.5 | 1.3M   | 1.3M   | 1.4M    | 13.9M   |
  | 32768 | 219.6   | 93.2 | 838.1K | 838.1K | 965.1K  | 8.9M    |
  |-------|---------|------|   New Code      |---------|---------|
  | 256   | 201.5   | 99.1 | 13.4K  | 5.0K   | 13.7K   | 75.2K   |
  | 512   | 202.5   | 98.2 | 11.2K  | 5.9K   | 11.2K   | 55.5K   |
  | 1024  | 207.3   | 93.9 | 11.5K  | 9.7K   | 11.5K   | 59.6K   |
  | 2048  | 207.5   | 96.7 | 11.8K  | 11.1K  | 15.5K   | 79.3K   |
  | 4096  | 206.9   | 96.6 | 11.8K  | 11.7K  | 11.8K   | 63.2K   |
  | 8192  | 205.8   | 96.7 | 11.9K  | 11.8K  | 11.9K   | 63.9K   |
  | 16384 | 200.9   | 98.2 | 11.9K  | 11.9K  | 11.9K   | 64.2K   |
  | 32768 | 202.5   | 98.0 | 11.9K  | 11.9K  | 11.9K   | 64.2K   |
  |-------|---------|------|--------|--------|---------|---------|

Some observations:
  1. Overall Latency improved: (1790.19-1634.94)/1790.19*100 = 8.67%
  2. Overall CPU increased:    (777.32-751.49)/751.45*100    = 3.44%
  3. Flow Management (add/delete) remained almost constant at ~11K
     compared to values in millions.

Signed-off-by: Krishna Kumar <krikku@gmail.com>
Link: https://patch.msgid.link/20250825031005.3674864-2-krikku@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-27 18:24:13 -07:00
Takamitsu Iwai
d860d1faa6 net: rose: convert 'use' field to refcount_t
The 'use' field in struct rose_neigh is used as a reference counter but
lacks atomicity. This can lead to race conditions where a rose_neigh
structure is freed while still being referenced by other code paths.

For example, when rose_neigh->use becomes zero during an ioctl operation
via rose_rt_ioctl(), the structure may be removed while its timer is
still active, potentially causing use-after-free issues.

This patch changes the type of 'use' from unsigned short to refcount_t and
updates all code paths to use rose_neigh_hold() and rose_neigh_put() which
operate reference counts atomically.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Takamitsu Iwai <takamitz@amazon.co.jp>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250823085857.47674-3-takamitz@amazon.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-27 07:43:08 -07:00
Takamitsu Iwai
dcb3465902 net: rose: split remove and free operations in rose_remove_neigh()
The current rose_remove_neigh() performs two distinct operations:
1. Removes rose_neigh from rose_neigh_list
2. Frees the rose_neigh structure

Split these operations into separate functions to improve maintainability
and prepare for upcoming refcount_t conversion. The timer cleanup remains
in rose_remove_neigh() because free operations can be called from timer
itself.

This patch introduce rose_neigh_put() to handle the freeing of rose_neigh
structures and modify rose_remove_neigh() to handle removal only.

Signed-off-by: Takamitsu Iwai <takamitz@amazon.co.jp>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250823085857.47674-2-takamitz@amazon.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-27 07:43:08 -07:00
Eric Biggers
fe60065689 ipv6: sr: Prepare HMAC key ahead of time
Prepare the HMAC key when it is added to the kernel, instead of
preparing it implicitly for every packet.  This significantly improves
the performance of seg6_hmac_compute().  A microbenchmark on x86_64
shows seg6_hmac_compute() (with HMAC-SHA256) dropping from ~1978 cycles
to ~1419 cycles, a 28% improvement.

The size of 'struct seg6_hmac_info' increases by 128 bytes, but that
should be fine, since there should not be a massive number of keys.

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20250824013644.71928-3-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26 18:11:29 -07:00
Eric Biggers
095928e7d8 ipv6: sr: Use HMAC-SHA1 and HMAC-SHA256 library functions
Use the HMAC-SHA1 and HMAC-SHA256 library functions instead of
crypto_shash.  This is simpler and faster.  Pre-allocating per-CPU hash
transformation objects and descriptors is no longer needed, and a
microbenchmark on x86_64 shows seg6_hmac_compute() (with HMAC-SHA256)
dropping from ~2494 cycles to ~1978 cycles, a 20% improvement.

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20250824013644.71928-2-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26 18:11:29 -07:00
Guillaume Nault
1bec9d0c00 ipv4: Convert ->flowi4_tos to dscp_t.
Convert the ->flowic_tos field of struct flowi_common from __u8 to
dscp_t, rename it ->flowic_dscp and propagate these changes to struct
flowi and struct flowi4.

We've had several bugs in the past where ECN bits could interfere with
IPv4 routing, because these bits were not properly cleared when setting
->flowi4_tos. These bugs should be fixed now and the dscp_t type has
been introduced to ensure that variables carrying DSCP values don't
accidentally have any ECN bits set. Several variables and structure
fields have been converted to dscp_t already, but the main IPv4 routing
structure, struct flowi4, is still using a __u8. To avoid any future
regression, this patch converts it to dscp_t.

There are many users to convert at once. Fortunately, around half of
->flowi4_tos users already have a dscp_t value at hand, which they
currently convert to __u8 using inet_dscp_to_dsfield(). For all of
these users, we just need to drop that conversion.

But, although we try to do the __u8 <-> dscp_t conversions at the
boundaries of the network or of user space, some places still store
TOS/DSCP variables as __u8 in core networking code. Those can hardly be
converted either because the data structure is part of UAPI or because
the same variable or field is also used for handling ECN in other parts
of the code. In all of these cases where we don't have a dscp_t
variable at hand, we need to use inet_dsfield_to_dscp() when
interacting with ->flowi4_dscp.

Changes since v1:
  * Fix space alignment in __bpf_redirect_neigh_v4() (Ido).

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/29acecb45e911d17446b9a3dbdb1ab7b821ea371.1756128932.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26 17:34:31 -07:00
Shahar Shitrit
6a06d8c405 devlink: Introduce burst period for health reporter
Currently, the devlink health reporter starts the grace period
immediately after handling an error, blocking any further recoveries
until it finished.

However, when a single root cause triggers multiple errors in a short
time frame, it is desirable to treat them as a bulk of errors and to
allow their recoveries, avoiding premature blocking of subsequent
related errors, and reducing the risk of inconsistent or incomplete
error handling.

To address this, introduce a configurable burst period for devlink
health reporter. Start this period when the first error is handled,
and allow recovery attempts for reported errors during this window.
Once burst period expires, begin the grace period to block further
recoveries until it concludes.

Timeline summary:

----|--------|------------------------------/----------------------/--
error is  error is       burst period           grace period
reported  recovered  (recoveries allowed)    (recoveries blocked)

For calculating the burst period duration, use the same
last_recovery_ts as the grace period. Update it on recovery only
when the burst period is inactive (either disabled or at the
first error).

This patch implements the framework for the burst period and
effectively sets its value to 0 at reporter creation, so the current
behavior remains unchanged, which ensures backward compatibility.

A downstream patch will make the burst period configurable.

Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250824084354.533182-4-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26 17:24:16 -07:00
Shahar Shitrit
d2b0073745 devlink: Move graceful period parameter to reporter ops
Move the default graceful period from a parameter to
devlink_health_reporter_create() to a field in the
devlink_health_reporter_ops structure.

This change improves consistency, as the graceful period is inherently
tied to the reporter's behavior and recovery policy. It simplifies the
signature of devlink_health_reporter_create() and its internal helper
functions. It also centralizes the reporter configuration at the ops
structure, preparing the groundwork for a downstream patch that will
introduce a devlink health reporter burst period attribute whose
default value will similarly be provided by the driver via the ops
structure.

Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250824084354.533182-2-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26 17:24:16 -07:00
Kuniyuki Iwashima
cb16f4b6c7 tcp: Don't pass hashinfo to socket lookup helpers.
These socket lookup functions required struct inet_hashinfo because
they are shared by TCP and DCCP.

  * __inet_lookup_established()
  * __inet_lookup_listener()
  * __inet6_lookup_established()
  * inet6_lookup_listener()

DCCP has gone, and we don't need to pass hashinfo down to them.

Let's fetch net->ipv4.tcp_death_row.hashinfo directly in the above
4 functions.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250822190803.540788-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-25 17:53:35 -07:00
Kuniyuki Iwashima
2d842b6c67 tcp: Remove timewait_sock_ops.twsk_destructor().
Since DCCP has been removed, sk->sk_prot->twsk_prot->twsk_destructor
is always tcp_twsk_destructor().

Let's call tcp_twsk_destructor() directly in inet_twsk_free() and
remove ->twsk_destructor().

While at it, tcp_twsk_destructor() is un-exported.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250822190803.540788-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-25 17:53:35 -07:00
Pavel Shpakovskiy
6bbd0d3f0c Bluetooth: hci_sync: fix set_local_name race condition
Function set_name_sync() uses hdev->dev_name field to send
HCI_OP_WRITE_LOCAL_NAME command, but copying from data to hdev->dev_name
is called after mgmt cmd was queued, so it is possible that function
set_name_sync() will read old name value.

This change adds name as a parameter for function hci_update_name_sync()
to avoid race condition.

Fixes: 6f6ff38a1e ("Bluetooth: hci_sync: Convert MGMT_OP_SET_LOCAL_NAME")
Signed-off-by: Pavel Shpakovskiy <pashpakovskii@salutedevices.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-08-22 13:57:31 -04:00
Jakub Kicinski
a9af709fda Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.17-rc3).

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21 11:33:15 -07:00
Jakub Kicinski
07cf71bf25 net: page_pool: add page_pool_get()
There is a page_pool_put() function but no get equivalent.
Having multiple references to a page pool is quite useful.
It avoids branching in create / destroy paths in drivers
which support memory providers.

Use the new helper in bnxt.

Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250820025704.166248-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21 08:03:54 -07:00
Hangbin Liu
b64d035f77 bonding: update LACP activity flag after setting lacp_active
The port's actor_oper_port_state activity flag should be updated immediately
after changing the lacp_active option to reflect the current mode correctly.

Fixes: 3a755cd8b7 ("bonding: add new option lacp_active")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20250815062000.22220-2-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-21 09:35:20 +02:00
Eric Dumazet
a6d4f25888 net: set net.core.rmem_max and net.core.wmem_max to 4 MB
SO_RCVBUF and SO_SNDBUF have limited range today, unless
distros or system admins change rmem_max and wmem_max.

Even iproute2 uses 1 MB SO_RCVBUF which is capped by
the kernel.

Decouple [rw]mem_max and [rw]mem_default and increase
[rw]mem_max to 4 MB.

Before:

$ sysctl net.core.rmem_default net.core.rmem_max net.core.wmem_default net.core.wmem_max
net.core.rmem_default = 212992
net.core.rmem_max = 212992
net.core.wmem_default = 212992
net.core.wmem_max = 212992

After:

$ sysctl net.core.rmem_default net.core.rmem_max net.core.wmem_default net.core.wmem_max
net.core.rmem_default = 212992
net.core.rmem_max = 4194304
net.core.wmem_default = 212992
net.core.wmem_max = 4194304

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20250819174030.1986278-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-20 19:35:00 -07:00
Eric Biggers
2f3dd6ec90 sctp: Convert cookie authentication to use HMAC-SHA256
Convert SCTP cookies to use HMAC-SHA256, instead of the previous choice
of the legacy algorithms HMAC-MD5 and HMAC-SHA1.  Simplify and optimize
the code by using the HMAC-SHA256 library instead of crypto_shash, and
by preparing the HMAC key when it is generated instead of per-operation.

This doesn't break compatibility, since the cookie format is an
implementation detail, not part of the SCTP protocol itself.

Note that the cookie size doesn't change either.  The HMAC field was
already 32 bytes, even though previously at most 20 bytes were actually
compared.  32 bytes exactly fits an untruncated HMAC-SHA256 value.  So,
although we could safely truncate the MAC to something slightly shorter,
for now just keep the cookie size the same.

I also considered SipHash, but that would generate only 8-byte MACs.  An
8-byte MAC *might* suffice here.  However, there's quite a lot of
information in the SCTP cookies: more than in TCP SYN cookies.  So
absent an analysis that occasional forgeries of all that information is
okay in SCTP, I errored on the side of caution.

Remove HMAC-MD5 and HMAC-SHA1 as options, since the new HMAC-SHA256
option is just better.  It's faster as well as more secure.  For
example, benchmarking on x86_64, cookie authentication is now nearly 3x
as fast as the previous default choice and implementation of HMAC-MD5.

Also just make the kernel always support cookie authentication if SCTP
is supported at all, rather than making it optional in the build.  (It
was sort of optional before, but it didn't really work properly.  E.g.,
a kernel with CONFIG_SCTP_COOKIE_HMAC_MD5=n still supported HMAC-MD5
cookie authentication if CONFIG_CRYPTO_HMAC and CONFIG_CRYPTO_MD5
happened to be enabled in the kconfig for other reasons.)

Acked-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20250818205426.30222-5-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-19 19:36:26 -07:00
Eric Biggers
bf40785fa4 sctp: Use HMAC-SHA1 and HMAC-SHA256 library for chunk authentication
For SCTP chunk authentication, use the HMAC-SHA1 and HMAC-SHA256 library
functions instead of crypto_shash.  This is simpler and faster.  There's
no longer any need to pre-allocate 'crypto_shash' objects; the SCTP code
now simply calls into the HMAC code directly.

As part of this, make SCTP always support both HMAC-SHA1 and
HMAC-SHA256.  Previously, it only guaranteed support for HMAC-SHA1.
However, HMAC-SHA256 tended to be supported too anyway, as it was
supported if CONFIG_CRYPTO_SHA256 was enabled elsewhere in the kconfig.

Acked-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20250818205426.30222-4-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-19 19:36:25 -07:00
Kuniyuki Iwashima
bf64002c94 net: Define sk_memcg under CONFIG_MEMCG.
Except for sk_clone_lock(), all accesses to sk->sk_memcg
is done under CONFIG_MEMCG.

As a bonus, let's define sk->sk_memcg under CONFIG_MEMCG.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://patch.msgid.link/20250815201712.1745332-11-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-19 19:20:59 -07:00
Kuniyuki Iwashima
b2ffd10cdd net-memcg: Pass struct sock to mem_cgroup_sk_under_memory_pressure().
We will store a flag in the lowest bit of sk->sk_memcg.

Then, we cannot pass the raw pointer to mem_cgroup_under_socket_pressure().

Let's pass struct sock to it and rename the function to match other
functions starting with mem_cgroup_sk_.

Note that the helper is moved to sock.h to use mem_cgroup_from_sk().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://patch.msgid.link/20250815201712.1745332-10-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-19 19:20:59 -07:00