Commit Graph

82231 Commits

Author SHA1 Message Date
Matt Johnston
a58893aa17 net: mctp: Fix bad kfree_skb in bind lookup test
The kunit test's skb_pkt is consumed by mctp_dst_input() so shouldn't be
freed separately.

Fixes: e6d8e7dbc5 ("net: mctp: Add bind lookup test")
Reported-by: Alexandre Ghiti <alex@ghiti.fr>
Closes: https://lore.kernel.org/all/734b02a3-1941-49df-a0da-ec14310d41e4@ghiti.fr/
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Tested-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://patch.msgid.link/20250812-fix-mctp-bind-test-v1-1-5e2128664eb3@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-13 17:07:34 -07:00
Wang Liang
4d18083d6b vsock: use sizeof(struct sockaddr_storage) instead of magic value
Previous commit 230b183921 ("net: Use standard structures for generic
socket address structures.") use 'struct sockaddr_storage address;'
to replace 'char address[MAX_SOCK_ADDR];'.

The macro MAX_SOCK_ADDR is removed by commit 01893c82b4 ("net: Remove
MAX_SOCK_ADDR constant").

The comment in vsock_getname() is outdated, use sizeof(struct
sockaddr_storage) instead of magic value 128.

Signed-off-by: Wang Liang <wangliang74@huawei.com>
Link: https://patch.msgid.link/20250812015929.1419896-1-wangliang74@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-13 17:05:21 -07:00
Pablo Neira Ayuso
cf5fb87fcd netfilter: nf_tables: reject duplicate device on updates
A chain/flowtable update with duplicated devices in the same batch is
possible. Unfortunately, netdev event path only removes the first
device that is found, leaving unregistered the hook of the duplicated
device.

Check if a duplicated device exists in the transaction batch, bail out
with EEXIST in such case.

WARNING is hit when unregistering the hook:

 [49042.221275] WARNING: CPU: 4 PID: 8425 at net/netfilter/core.c:340 nf_hook_entry_head+0xaa/0x150
 [49042.221375] CPU: 4 UID: 0 PID: 8425 Comm: nft Tainted: G S                  6.16.0+ #170 PREEMPT(full)
 [...]
 [49042.221382] RIP: 0010:nf_hook_entry_head+0xaa/0x150

Fixes: 78d9f48f7f ("netfilter: nf_tables: add devices to existing flowtable")
Fixes: b9703ed44f ("netfilter: nf_tables: support for adding new devices to an existing netdev chain")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2025-08-13 08:34:55 +02:00
Frederic Weisbecker
c0a23bbc98 ipvs: Fix estimator kthreads preferred affinity
The estimator kthreads' affinity are defined by sysctl overwritten
preferences and applied through a plain call to the scheduler's affinity
API.

However since the introduction of managed kthreads preferred affinity,
such a practice shortcuts the kthreads core code which eventually
overwrites the target to the default unbound affinity.

Fix this with using the appropriate kthread's API.

Fixes: d1a8919758 ("kthread: Default affine kthread to its preferred NUMA node")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
2025-08-13 08:34:33 +02:00
Florian Westphal
30c1d25b98 netfilter: nft_set_pipapo: fix null deref for empty set
Blamed commit broke the check for a null scratch map:
  -  if (unlikely(!m || !*raw_cpu_ptr(m->scratch)))
  +  if (unlikely(!raw_cpu_ptr(m->scratch)))

This should have been "if (!*raw_ ...)".
Use the pattern of the avx2 version which is more readable.

This can only be reproduced if avx2 support isn't available.

Fixes: d8d871a35c ("netfilter: nft_set_pipapo: merge pipapo_get/lookup")
Signed-off-by: Florian Westphal <fw@strlen.de>
2025-08-13 08:34:32 +02:00
Jakub Kicinski
6db015fc4b tls: handle data disappearing from under the TLS ULP
TLS expects that it owns the receive queue of the TCP socket.
This cannot be guaranteed in case the reader of the TCP socket
entered before the TLS ULP was installed, or uses some non-standard
read API (eg. zerocopy ones). Replace the WARN_ON() and a buggy
early exit (which leaves anchor pointing to a freed skb) with real
error handling. Wipe the parsing state and tell the reader to retry.

We already reload the anchor every time we (re)acquire the socket lock,
so the only condition we need to avoid is an out of bounds read
(not having enough bytes in the socket for previously parsed record len).

If some data was read from under TLS but there's enough in the queue
we'll reload and decrypt what is most likely not a valid TLS record.
Leading to some undefined behavior from TLS perspective (corrupting
a stream? missing an alert? missing an attack?) but no kernel crash
should take place.

Reported-by: William Liu <will@willsroot.io>
Reported-by: Savino Dicanosa <savy@syst3mfailure.io>
Link: https://lore.kernel.org/tFjq_kf7sWIG3A7CrCg_egb8CVsT_gsmHAK0_wxDPJXfIzxFAMxqmLwp3MlU5EHiet0AwwJldaaFdgyHpeIUCS-3m3llsmRzp9xIOBR4lAI=@syst3mfailure.io
Fixes: 84c61fe1a7 ("tls: rx: do not use the standard strparser")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250807232907.600366-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-12 18:59:05 -07:00
Thorsten Blum
b3ba7d929c net/sched: Remove redundant memset(0) call in reset_policy()
The call to nla_strscpy() already zero-pads the tail of the destination
buffer which makes the additional memset(0) call redundant. Remove it.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20250811164039.43250-1-thorsten.blum@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-12 17:13:29 -07:00
Eric Dumazet
86e3d52bd3 phonet: add __rcu annotations
Removes following sparse errors.

make C=2 net/phonet/socket.o net/phonet/af_phonet.o
  CHECK   net/phonet/socket.c
net/phonet/socket.c:619:14: error: incompatible types in comparison expression (different address spaces):
net/phonet/socket.c:619:14:    struct sock [noderef] __rcu *
net/phonet/socket.c:619:14:    struct sock *
net/phonet/socket.c:642:17: error: incompatible types in comparison expression (different address spaces):
net/phonet/socket.c:642:17:    struct sock [noderef] __rcu *
net/phonet/socket.c:642:17:    struct sock *
net/phonet/socket.c:658:17: error: incompatible types in comparison expression (different address spaces):
net/phonet/socket.c:658:17:    struct sock [noderef] __rcu *
net/phonet/socket.c:658:17:    struct sock *
net/phonet/socket.c:677:25: error: incompatible types in comparison expression (different address spaces):
net/phonet/socket.c:677:25:    struct sock [noderef] __rcu *
net/phonet/socket.c:677:25:    struct sock *
net/phonet/socket.c:726:21: warning: context imbalance in 'pn_res_seq_start' - wrong count at exit
net/phonet/socket.c:741:13: warning: context imbalance in 'pn_res_seq_stop' - wrong count at exit
  CHECK   net/phonet/af_phonet.c
net/phonet/af_phonet.c:35:14: error: incompatible types in comparison expression (different address spaces):
net/phonet/af_phonet.c:35:14:    struct phonet_protocol const [noderef] __rcu *
net/phonet/af_phonet.c:35:14:    struct phonet_protocol const *
net/phonet/af_phonet.c:474:17: error: incompatible types in comparison expression (different address spaces):
net/phonet/af_phonet.c:474:17:    struct phonet_protocol const [noderef] __rcu *
net/phonet/af_phonet.c:474:17:    struct phonet_protocol const *
net/phonet/af_phonet.c:486:9: error: incompatible types in comparison expression (different address spaces):
net/phonet/af_phonet.c:486:9:    struct phonet_protocol const [noderef] __rcu *
net/phonet/af_phonet.c:486:9:    struct phonet_protocol const *

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Rémi Denis-Courmont <courmisch@gmail.com>
Link: https://patch.msgid.link/20250811145252.1007242-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-12 14:15:24 -07:00
Thorsten Blum
63fe077c21 caif: Replace memset(0) + strscpy() with strscpy_pad()
Replace memset(0) followed by strscpy() with strscpy_pad() to improve
cfctrl_linkup_request(). This avoids zeroing the memory before copying
the string and ensures the destination buffer is only written to once,
simplifying the code and improving efficiency.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://patch.msgid.link/20250811093442.5075-2-thorsten.blum@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-12 14:08:56 -07:00
Qianfeng Rong
7792232a4e RDS: remove redundant __GFP_NOWARN
GFP_NOWAIT already includes __GFP_NOWARN, so let's remove the redundant
__GFP_NOWARN.

Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20250810072944.438574-3-rongqianfeng@vivo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-12 14:05:44 -07:00
Qianfeng Rong
526c2530cb tcp: cdg: remove redundant __GFP_NOWARN
GFP_NOWAIT already includes __GFP_NOWARN, so let's remove the redundant
__GFP_NOWARN.

Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250810072944.438574-2-rongqianfeng@vivo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-12 14:05:43 -07:00
Jedrzej Jagielski
c5ec7f49b4 devlink: let driver opt out of automatic phys_port_name generation
Currently when adding devlink port, phys_port_name is automatically
generated within devlink port initialization flow. As a result adding
devlink port support to driver may result in forced changes of interface
names, which breaks already existing network configs.

This is an expected behavior but in some scenarios it would not be
preferable to provide such limitation for legacy driver not being able to
keep 'pre-devlink' interface name.

Add flag no_phys_port_name to devlink_port_attrs struct which indicates
if devlink should not alter name of interface.

Suggested-by: Jiri Pirko <jiri@resnulli.us>
Link: https://lore.kernel.org/all/nbwrfnjhvrcduqzjl4a2jafnvvud6qsbxlvxaxilnryglf4j7r@btuqrimnfuly/
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-08-12 13:23:39 -07:00
Martin KaFai Lau
9e293d47bf Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Cross merge bpf/master after 6.17-rc1.

No conflict.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2025-08-12 12:14:02 -07:00
Paolo Abeni
c04fdca8a9 Merge tag 'ipsec-2025-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2025-08-11

1) Fix flushing of all states in xfrm_state_fini.
   From Sabrina Dubroca.

2) Fix some IPsec software offload features. These
   got lost with some recent HW offload changes.
   From Sabrina Dubroca.

Please pull or let me know if there are problems.

* tag 'ipsec-2025-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
  udp: also consider secpath when evaluating ipsec use for checksumming
  xfrm: bring back device check in validate_xmit_xfrm
  xfrm: restore GSO for SW crypto
  xfrm: flush all states in xfrm_state_fini
====================

Link: https://patch.msgid.link/20250811092008.731573-1-steffen.klassert@secunet.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-12 15:01:09 +02:00
Jakub Kicinski
b3fc08ab9a net: prevent deadlocks when enabling NAPIs with mixed kthread config
The following order of calls currently deadlocks if:
 - device has threaded=1; and
 - NAPI has persistent config with threaded=0.

  netif_napi_add_weight_config()
    dev->threaded == 1
      napi_kthread_create()

  napi_enable()
    napi_restore_config()
      napi_set_threaded(0)
        napi_stop_kthread()
	  while (NAPIF_STATE_SCHED)
	    msleep(20)

We deadlock because disabled NAPI has STATE_SCHED set.
Creating a thread in netif_napi_add() just to destroy it in
napi_disable() is fairly ugly in the first place. Let's read
both the device config and the NAPI config in netif_napi_add().

Fixes: e6d7626881 ("net: Update threaded state in napi config in netif_set_threaded")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20250809001205.1147153-4-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-12 14:43:05 +02:00
Jakub Kicinski
ccba9f6baa net: update NAPI threaded config even for disabled NAPIs
We have to make sure that all future NAPIs will have the right threaded
state when the state is configured on the device level.
We chose not to have an "unset" state for threaded, and not to wipe
the NAPI config clean when channels are explicitly disabled.
This means the persistent config structs "exist" even when their NAPIs
are not instantiated.

Differently put - the NAPI persistent state lives in the net_device
(ncfg == struct napi_config):

    ,--- [napi 0] - [napi 1]
 [dev]      |          |
    `--- [ncfg 0] - [ncfg 1]

so say we a device with 2 queues but only 1 enabled:

    ,--- [napi 0]
 [dev]      |
    `--- [ncfg 0] - [ncfg 1]

now we set the device to threaded=1:

    ,---------- [napi 0 (thr:1)]
 [dev(thr:1)]      |
    `---------- [ncfg 0 (thr:1)] - [ncfg 1 (thr:?)]

Since [ncfg 1] was not attached to a NAPI during configuration we
skipped it. If we create a NAPI for it later it will have the old
setting (presumably disabled). One could argue if this is right
or not "in principle", but it's definitely not how things worked
before per-NAPI config..

Fixes: 2677010e77 ("Add support to set NAPI threaded for individual NAPI")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20250809001205.1147153-3-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-12 14:43:05 +02:00
Linus Torvalds
53e760d894 Merge tag 'nfsd-6.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:

 - A correctness fix for delegated timestamps

 - Address an NFSD shutdown hang when LOCALIO is in use

 - Prevent a remotely exploitable crasher when TLS is in use

* tag 'nfsd-6.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  sunrpc: fix handling of server side tls alerts
  nfsd: avoid ref leak in nfsd_open_local_fh()
  nfsd: don't set the ctime on delegated atime updates
2025-08-11 07:38:55 -07:00
Linus Torvalds
ccc1ead23c Merge tag 'nfs-for-6.17-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Stable fixes:
   - don't inherit NFS filesystem capabilities when crossing from one
     filesystem to another

  Bugfixes:
   - NFS wakeup of __nfs_lookup_revalidate() needs memory barriers
   - NFS improve bounds checking in nfs_fh_to_dentry()
   - NFS Fix allocation errors when writing to a NFS file backed
     loopback device
   - NFSv4: More listxattr fixes
   - SUNRPC: fix client handling of TLS alerts
   - pNFS block/scsi layout fix for an uninitialised pointer
     dereference
   - pNFS block/scsi layout fixes for the extent encoding, stripe
     mapping, and disk offset overflows
   - pNFS layoutcommit work around for RPC size limitations
   - pNFS/flexfiles avoid looping when handling fatal errors after
     layoutget
   - localio: fix various race conditions

  Features and cleanups:
   - Add NFSv4 support for retrieving the btime
   - NFS: Allow folio migration for the case of mode == MIGRATE_SYNC
   - NFS: Support using a kernel keyring to store TLS certificates
   - NFSv4: Speed up delegation lookup using a hash table
   - Assorted cleanups to remove unused variables and struct fields
   - Assorted new tracepoints to improve debugging"

* tag 'nfs-for-6.17-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (44 commits)
  NFS/localio: nfs_uuid_put() fix the wake up after unlinking the file
  NFS/localio: nfs_uuid_put() fix races with nfs_open/close_local_fh()
  NFS/localio: nfs_close_local_fh() fix check for file closed
  NFSv4: Remove duplicate lookups, capability probes and fsinfo calls
  NFS: Fix the setting of capabilities when automounting a new filesystem
  sunrpc: fix client side handling of tls alerts
  nfs/localio: use read_seqbegin() rather than read_seqbegin_or_lock()
  NFS: Fixup allocation flags for nfsiod's __GFP_NORETRY
  NFSv4.2: another fix for listxattr
  NFS: Fix filehandle bounds checking in nfs_fh_to_dentry()
  SUNRPC: Silence warnings about parameters not being described
  NFS: Clean up pnfs_put_layout_hdr()/pnfs_destroy_layout_final()
  NFS: Fix wakeup of __nfs_lookup_revalidate() in unblock_revalidate()
  NFS: use a hash table for delegation lookup
  NFS: track active delegations per-server
  NFS: move the delegation_watermark module parameter
  NFS: cleanup nfs_inode_reclaim_delegation
  NFS: cleanup error handling in nfs4_server_common_setup
  pNFS/flexfiles: don't attempt pnfs on fatal DS errors
  NFS: drop __exit from nfs_exit_keyring
  ...
2025-08-09 07:20:44 +03:00
Xin Long
fd60d8a086 sctp: linearize cloned gso packets in sctp_rcv
A cloned head skb still shares these frag skbs in fraglist with the
original head skb. It's not safe to access these frag skbs.

syzbot reported two use-of-uninitialized-memory bugs caused by this:

  BUG: KMSAN: uninit-value in sctp_inq_pop+0x15b7/0x1920 net/sctp/inqueue.c:211
   sctp_inq_pop+0x15b7/0x1920 net/sctp/inqueue.c:211
   sctp_assoc_bh_rcv+0x1a7/0xc50 net/sctp/associola.c:998
   sctp_inq_push+0x2ef/0x380 net/sctp/inqueue.c:88
   sctp_backlog_rcv+0x397/0xdb0 net/sctp/input.c:331
   sk_backlog_rcv+0x13b/0x420 include/net/sock.h:1122
   __release_sock+0x1da/0x330 net/core/sock.c:3106
   release_sock+0x6b/0x250 net/core/sock.c:3660
   sctp_wait_for_connect+0x487/0x820 net/sctp/socket.c:9360
   sctp_sendmsg_to_asoc+0x1ec1/0x1f00 net/sctp/socket.c:1885
   sctp_sendmsg+0x32b9/0x4a80 net/sctp/socket.c:2031
   inet_sendmsg+0x25a/0x280 net/ipv4/af_inet.c:851
   sock_sendmsg_nosec net/socket.c:718 [inline]

and

  BUG: KMSAN: uninit-value in sctp_assoc_bh_rcv+0x34e/0xbc0 net/sctp/associola.c:987
   sctp_assoc_bh_rcv+0x34e/0xbc0 net/sctp/associola.c:987
   sctp_inq_push+0x2a3/0x350 net/sctp/inqueue.c:88
   sctp_backlog_rcv+0x3c7/0xda0 net/sctp/input.c:331
   sk_backlog_rcv+0x142/0x420 include/net/sock.h:1148
   __release_sock+0x1d3/0x330 net/core/sock.c:3213
   release_sock+0x6b/0x270 net/core/sock.c:3767
   sctp_wait_for_connect+0x458/0x820 net/sctp/socket.c:9367
   sctp_sendmsg_to_asoc+0x223a/0x2260 net/sctp/socket.c:1886
   sctp_sendmsg+0x3910/0x49f0 net/sctp/socket.c:2032
   inet_sendmsg+0x269/0x2a0 net/ipv4/af_inet.c:851
   sock_sendmsg_nosec net/socket.c:712 [inline]

This patch fixes it by linearizing cloned gso packets in sctp_rcv().

Fixes: 90017accff ("sctp: Add GSO support")
Reported-by: syzbot+773e51afe420baaf0e2b@syzkaller.appspotmail.com
Reported-by: syzbot+70a42f45e76bede082be@syzkaller.appspotmail.com
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://patch.msgid.link/dd7dc337b99876d4132d0961f776913719f7d225.1754595611.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-08 13:08:06 -07:00
Budimir Markovic
aba0c94f61 vsock: Do not allow binding to VMADDR_PORT_ANY
It is possible for a vsock to autobind to VMADDR_PORT_ANY. This can
cause a use-after-free when a connection is made to the bound socket.
The socket returned by accept() also has port VMADDR_PORT_ANY but is not
on the list of unbound sockets. Binding it will result in an extra
refcount decrement similar to the one fixed in fcdd2242c0 (vsock: Keep
the binding until socket destruction).

Modify the check in __vsock_bind_connectible() to also prevent binding
to VMADDR_PORT_ANY.

Fixes: d021c34405 ("VSOCK: Introduce VM Sockets")
Reported-by: Budimir Markovic <markovicbudimir@gmail.com>
Signed-off-by: Budimir Markovic <markovicbudimir@gmail.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20250807041811.678-1-markovicbudimir@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-08 12:55:00 -07:00
Jakub Kicinski
64fdaa94bf net: page_pool: allow enabling recycling late, fix false positive warning
Page pool can have pages "directly" (locklessly) recycled to it,
if the NAPI that owns the page pool is scheduled to run on the same CPU.
To make this safe we check that the NAPI is disabled while we destroy
the page pool. In most cases NAPI and page pool lifetimes are tied
together so this happens naturally.

The queue API expects the following order of calls:
 -> mem_alloc
    alloc new pp
 -> stop
    napi_disable
 -> start
    napi_enable
 -> mem_free
    free old pp

Here we allocate the page pool in ->mem_alloc and free in ->mem_free.
But the NAPIs are only stopped between ->stop and ->start. We created
page_pool_disable_direct_recycling() to safely shut down the recycling
in ->stop. This way the page_pool_destroy() call in ->mem_free doesn't
have to worry about recycling any more.

Unfortunately, the page_pool_disable_direct_recycling() is not enough
to deal with failures which necessitate freeing the _new_ page pool.
If we hit a failure in ->mem_alloc or ->stop the new page pool has
to be freed while the NAPI is active (assuming driver attaches the
page pool to an existing NAPI instance and doesn't reallocate NAPIs).

Freeing the new page pool is technically safe because it hasn't been
used for any packets, yet, so there can be no recycling. But the check
in napi_assert_will_not_race() has no way of knowing that. We could
check if page pool is empty but that'd make the check much less likely
to trigger during development.

Add page_pool_enable_direct_recycling(), pairing with
page_pool_disable_direct_recycling(). It will allow us to create the new
page pools in "disabled" state and only enable recycling when we know
the reconfig operation will not fail.

Coincidentally it will also let us re-enable the recycling for the old
pool, if the reconfig failed:

 -> mem_alloc (new)
 -> stop (old)
    # disables direct recycling for old
 -> start (new)
    # fail!!
 -> start (old)
    # go back to old pp but direct recycling is lost :(
 -> mem_free (new)

The new helper is idempotent to make the life easier for drivers,
which can operate in HDS mode and support zero-copy Rx.
The driver can call the helper twice whether there are two pools
or it has multiple references to a single pool.

Fixes: 40eca00ae6 ("bnxt_en: unlink page pool when stopping Rx queue")
Tested-by: David Wei <dw@davidwei.uk>
Link: https://patch.msgid.link/20250805003654.2944974-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-08 12:54:42 -07:00
Jakub Kicinski
f6a2a31043 Merge tag 'nf-25-08-07' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Reinstantiate Florian Westphal as a Netfilter maintainer.

2) Depend on both NETFILTER_XTABLES and NETFILTER_XTABLES_LEGACY,
   from Arnd Bergmann.

3) Use id to annotate last conntrack/expectation visited to resume
   netlink dump, patches from Florian Westphal.

4) Fix bogus element in nft_pipapo avx2 lookup, introduced in
   the last nf-next batch of updates, also from Florian.

5) Return 0 instead of recycling ret variable in
   nf_conntrack_log_invalid_sysctl(), introduced in the last
   nf-next batch of updates, from Dan Carpenter.

6) Fix WARN_ON_ONCE triggered by syzbot with larger cgroup level
   in nft_socket.

* tag 'nf-25-08-07' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nft_socket: remove WARN_ON_ONCE with huge level value
  netfilter: conntrack: clean up returns in nf_conntrack_log_invalid_sysctl()
  netfilter: nft_set_pipapo: don't return bogus extension pointer
  netfilter: ctnetlink: remove refcounting in expectation dumpers
  netfilter: ctnetlink: fix refcount leak on table dump
  netfilter: add back NETFILTER_XTABLES dependencies
  MAINTAINERS: resurrect my netfilter maintainer entry
====================

Link: https://patch.msgid.link/20250807112948.1400523-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-08 11:45:14 -07:00
Linus Torvalds
3781648824 Merge tag 'net-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
  Previous releases - regressions:

   - netlink: avoid infinite retry looping in netlink_unicast()

  Previous releases - always broken:

   - packet: fix a race in packet_set_ring() and packet_notifier()

   - ipv6: reject malicious packets in ipv6_gso_segment()

   - sched: mqprio: fix stack out-of-bounds write in tc entry parsing

   - net: drop UFO packets (injected via virtio) in udp_rcv_segment()

   - eth: mlx5: correctly set gso_segs when LRO is used, avoid false
     positive checksum validation errors

   - netpoll: prevent hanging NAPI when netcons gets enabled

   - phy: mscc: fix parsing of unicast frames for PTP timestamping

   - a number of device tree / OF reference leak fixes"

* tag 'net-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (44 commits)
  pptp: fix pptp_xmit() error path
  net: ti: icssg-prueth: Fix skb handling for XDP_PASS
  net: Update threaded state in napi config in netif_set_threaded
  selftests: netdevsim: Xfail nexthop test on slow machines
  eth: fbnic: Lock the tx_dropped update
  eth: fbnic: Fix tx_dropped reporting
  eth: fbnic: remove the debugging trick of super high page bias
  net: ftgmac100: fix potential NULL pointer access in ftgmac100_phy_disconnect
  dt-bindings: net: Replace bouncing Alexandru Tachici emails
  dpll: zl3073x: ZL3073X_I2C and ZL3073X_SPI should depend on NET
  net/sched: mqprio: fix stack out-of-bounds write in tc entry parsing
  Revert "net: mdio_bus: Use devm for getting reset GPIO"
  selftests: net: packetdrill: xfail all problems on slow machines
  net/packet: fix a race in packet_set_ring() and packet_notifier()
  benet: fix BUG when creating VFs
  net: airoha: npu: Add missing MODULE_FIRMWARE macros
  net: devmem: fix DMA direction on unmapping
  ipa: fix compile-testing with qcom-mdt=m
  eth: fbnic: unlink NAPIs from queues on error to open
  net: Add locking to protect skb->dev access in ip_output
  ...
2025-08-08 07:03:25 +03:00
Li Jun
fa47913284 bpf: Standardize function declaration style
'noinlne' after 'int' cause
"ERROR: inline keyword should sit between storage class and type"
by checkpatch.pl

- Standardize function declaration style by moving 'noinline' modifier
- Fix asm volatile statement formatting

Signed-off-by: Li Jun <lijun01@kylinos.cn>
Link: https://lore.kernel.org/r/20250730105019.436235-1-lijun01@kylinos.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-07 19:18:03 -07:00
Pablo Neira Ayuso
1dee968d22 netfilter: nft_socket: remove WARN_ON_ONCE with huge level value
syzbot managed to reach this WARN_ON_ONCE by passing a huge level
value, remove it.

  WARNING: CPU: 0 PID: 5853 at net/netfilter/nft_socket.c:220 nft_socket_init+0x2f4/0x3d0 net/netfilter/nft_socket.c:220

Reported-by: syzbot+a225fea35d7baf8dbdc3@syzkaller.appspotmail.com
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-08-07 13:19:26 +02:00
Dan Carpenter
f54186df80 netfilter: conntrack: clean up returns in nf_conntrack_log_invalid_sysctl()
Smatch complains that these look like error paths with missing error
codes, especially the one where we return if nf_log_is_registered() is
true:

    net/netfilter/nf_conntrack_standalone.c:575 nf_conntrack_log_invalid_sysctl()
    warn: missing error code? 'ret'

In fact, all these return zero deliberately.  Change them to return a
literal instead which helps readability as well as silencing the warning.

Fixes: e89a680466 ("netfilter: load nf_log_syslog on enabling nf_conntrack_log_invalid")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Acked-by: Lance Yang <lance.yang@linux.dev>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-08-07 13:19:26 +02:00
Florian Westphal
c8a7c2c608 netfilter: nft_set_pipapo: don't return bogus extension pointer
Dan Carpenter says:
Commit 17a20e09f0 ("netfilter: nft_set: remove one argument from
lookup and update functions") [..] leads to the following Smatch
static checker warning:

 net/netfilter/nft_set_pipapo_avx2.c:1269 nft_pipapo_avx2_lookup()
 error: uninitialized symbol 'ext'.

Fix this by initing ext to NULL and set it only once we've found
a match.

Fixes: 17a20e09f0 ("netfilter: nft_set: remove one argument from lookup and update functions")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/netfilter-devel/aJBzc3V5wk-yPOnH@stanley.mountain/
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-08-07 13:19:26 +02:00
Florian Westphal
1492e3dcb2 netfilter: ctnetlink: remove refcounting in expectation dumpers
Same pattern as previous patch: do not keep the expectation object
alive via refcount, only store a cookie value and then use that
as the skip hint for dump resumption.

AFAICS this has the same issue as the one resolved in the conntrack
dumper, when we do
  if (!refcount_inc_not_zero(&exp->use))

to increment the refcount, there is a chance that exp == last, which
causes a double-increment of the refcount and subsequent memory leak.

Fixes: cf6994c2b9 ("[NETFILTER]: nf_conntrack_netlink: sync expectation dumping with conntrack table dumping")
Fixes: e844a92843 ("netfilter: ctnetlink: allow to dump expectation per master conntrack")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-08-07 13:19:26 +02:00
Florian Westphal
de788b2e62 netfilter: ctnetlink: fix refcount leak on table dump
There is a reference count leak in ctnetlink_dump_table():
      if (res < 0) {
                nf_conntrack_get(&ct->ct_general); // HERE
                cb->args[1] = (unsigned long)ct;
                ...

While its very unlikely, its possible that ct == last.
If this happens, then the refcount of ct was already incremented.
This 2nd increment is never undone.

This prevents the conntrack object from being released, which in turn
keeps prevents cnet->count from dropping back to 0.

This will then block the netns dismantle (or conntrack rmmod) as
nf_conntrack_cleanup_net_list() will wait forever.

This can be reproduced by running conntrack_resize.sh selftest in a loop.
It takes ~20 minutes for me on a preemptible kernel on average before
I see a runaway kworker spinning in nf_conntrack_cleanup_net_list.

One fix would to change this to:
        if (res < 0) {
		if (ct != last)
	                nf_conntrack_get(&ct->ct_general);

But this reference counting isn't needed in the first place.
We can just store a cookie value instead.

A followup patch will do the same for ctnetlink_exp_dump_table,
it looks to me as if this has the same problem and like
ctnetlink_dump_table, we only need a 'skip hint', not the actual
object so we can apply the same cookie strategy there as well.

Fixes: d205dc4079 ("[NETFILTER]: ctnetlink: fix deadlock in table dumping")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-08-07 13:19:25 +02:00
Arnd Bergmann
25a8b88f00 netfilter: add back NETFILTER_XTABLES dependencies
Some Kconfig symbols were changed to depend on the 'bool' symbol
NETFILTER_XTABLES_LEGACY, which means they can now be set to built-in
when the xtables code itself is in a loadable module:

x86_64-linux-ld: vmlinux.o: in function `arpt_unregister_table_pre_exit':
(.text+0x1831987): undefined reference to `xt_find_table'
x86_64-linux-ld: vmlinux.o: in function `get_info.constprop.0':
arp_tables.c:(.text+0x1831aab): undefined reference to `xt_request_find_table_lock'
x86_64-linux-ld: arp_tables.c:(.text+0x1831bea): undefined reference to `xt_table_unlock'
x86_64-linux-ld: vmlinux.o: in function `do_arpt_get_ctl':
arp_tables.c:(.text+0x183205d): undefined reference to `xt_find_table_lock'
x86_64-linux-ld: arp_tables.c:(.text+0x18320c1): undefined reference to `xt_table_unlock'
x86_64-linux-ld: arp_tables.c:(.text+0x183219a): undefined reference to `xt_recseq'

Change these to depend on both NETFILTER_XTABLES and
NETFILTER_XTABLES_LEGACY.

Fixes: 9fce66583f ("netfilter: Exclude LEGACY TABLES on PREEMPT_RT.")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Florian Westphal <fw@strlen.de>
Tested-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-08-07 13:19:25 +02:00
Sabrina Dubroca
1118aaa3b3 udp: also consider secpath when evaluating ipsec use for checksumming
Commit b40c5f4fde ("udp: disable inner UDP checksum offloads in
IPsec case") tried to fix checksumming in UFO when the packets are
going through IPsec, so that we can't rely on offloads because the UDP
header and payload will be encrypted.

But when doing a TCP test over VXLAN going through IPsec transport
mode with GSO enabled (esp4_offload module loaded), I'm seeing broken
UDP checksums on the encap after successful decryption.

The skbs get to udp4_ufo_fragment/__skb_udp_tunnel_segment via
__dev_queue_xmit -> validate_xmit_skb -> skb_gso_segment and at this
point we've already dropped the dst (unless the device sets
IFF_XMIT_DST_RELEASE, which is not common), so need_ipsec is false and
we proceed with checksum offload.

Make need_ipsec also check the secpath, which is not dropped on this
callpath.

Fixes: b40c5f4fde ("udp: disable inner UDP checksum offloads in IPsec case")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-08-07 08:07:15 +02:00
Sabrina Dubroca
65f079a6c4 xfrm: bring back device check in validate_xmit_xfrm
This is partial revert of commit d53dda291b.

This change causes traffic using GSO with SW crypto running through a
NIC capable of HW offload to no longer get segmented during
validate_xmit_xfrm, and is unrelated to the bonding use case mentioned
in the commit.

Fixes: d53dda291b ("xfrm: Remove unneeded device check from validate_xmit_xfrm")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-08-07 08:07:01 +02:00
Sabrina Dubroca
234d1eff5d xfrm: restore GSO for SW crypto
Commit 49431af6c4 incorrectly assumes that the GSO path is only used
by HW offload, but it's also useful for SW crypto.

This patch re-enables GSO for SW crypto. It's not an exact revert to
preserve the other changes made to xfrm_dev_offload_ok afterwards, but
it reverts all of its effects.

Fixes: 49431af6c4 ("xfrm: rely on XFRM offload")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-08-07 08:05:10 +02:00
Olga Kornievskaia
bee47cb026 sunrpc: fix handling of server side tls alerts
Scott Mayhew discovered a security exploit in NFS over TLS in
tls_alert_recv() due to its assumption it can read data from
the msg iterator's kvec..

kTLS implementation splits TLS non-data record payload between
the control message buffer (which includes the type such as TLS
aler or TLS cipher change) and the rest of the payload (say TLS
alert's level/description) which goes into the msg payload buffer.

This patch proposes to rework how control messages are setup and
used by sock_recvmsg().

If no control message structure is setup, kTLS layer will read and
process TLS data record types. As soon as it encounters a TLS control
message, it would return an error. At that point, NFS can setup a
kvec backed msg buffer and read in the control message such as a
TLS alert. Msg iterator can advance the kvec pointer as a part of
the copy process thus we need to revert the iterator before calling
into the tls_alert_recv.

Reported-by: Scott Mayhew <smayhew@redhat.com>
Fixes: 5e052dda12 ("SUNRPC: Recognize control messages in server-side TCP socket code")
Suggested-by: Trond Myklebust <trondmy@hammerspace.com>
Cc: stable@vger.kernel.org
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-08-06 09:57:50 -04:00
Sabrina Dubroca
42e42562c9 xfrm: flush all states in xfrm_state_fini
While reverting commit f75a2804da ("xfrm: destroy xfrm_state
synchronously on net exit path"), I incorrectly changed
xfrm_state_flush's "proto" argument back to IPSEC_PROTO_ANY. This
reverts some of the changes in commit dbb2483b2a ("xfrm: clean up
xfrm protocol checks"), and leads to some states not being removed
when we exit the netns.

Pass 0 instead of IPSEC_PROTO_ANY from both xfrm_state_fini
xfrm6_tunnel_net_exit, so that xfrm_state_flush deletes all states.

Fixes: 2a198bbec6 ("Revert "xfrm: destroy xfrm_state synchronously on net exit path"")
Reported-by: syzbot+6641a61fe0e2e89ae8c5@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=6641a61fe0e2e89ae8c5
Tested-by: syzbot+6641a61fe0e2e89ae8c5@syzkaller.appspotmail.com
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-08-06 09:23:38 +02:00
Samiullah Khawaja
e6d7626881 net: Update threaded state in napi config in netif_set_threaded
Commit 2677010e77 ("Add support to set NAPI threaded for individual
NAPI") added support to enable/disable threaded napi using netlink. This
also extended the napi config save/restore functionality to set the napi
threaded state. This breaks netdev reset for drivers that use napi
threaded at device level and also use napi config save/restore on
napi_disable/napi_enable. Basically on netdev with napi threaded enabled
at device level, a napi_enable call will get stuck trying to stop the
napi kthread. This is because the napi->config->threaded is set to
disabled when threaded is enabled at device level.

The issue can be reproduced on virtio-net device using qemu. To
reproduce the issue run following,

  echo 1 > /sys/class/net/threaded
  ethtool -L eth0 combined 1

Update the threaded state in napi config in netif_set_threaded and add a
new test that verifies this scenario.

Tested on qemu with virtio-net:
 NETIF=eth0 ./tools/testing/selftests/drivers/net/napi_threaded.py
 TAP version 13
 1..2
 ok 1 napi_threaded.change_num_queues
 ok 2 napi_threaded.enable_dev_threaded_disable_napi_threaded
 # Totals: pass:2 fail:0 xfail:0 xpass:0 skip:0 error:0

Fixes: 2677010e77 ("Add support to set NAPI threaded for individual NAPI")
Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
Link: https://patch.msgid.link/20250804164457.2494390-1-skhawaja@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-05 17:46:15 -07:00
Maher Azzouzi
ffd2dc4c6c net/sched: mqprio: fix stack out-of-bounds write in tc entry parsing
TCA_MQPRIO_TC_ENTRY_INDEX is validated using
NLA_POLICY_MAX(NLA_U32, TC_QOPT_MAX_QUEUE), which allows the value
TC_QOPT_MAX_QUEUE (16). This leads to a 4-byte out-of-bounds stack
write in the fp[] array, which only has room for 16 elements (0–15).

Fix this by changing the policy to allow only up to TC_QOPT_MAX_QUEUE - 1.

Fixes: f62af20bed ("net/sched: mqprio: allow per-TC user input of FP adminStatus")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Maher Azzouzi <maherazz04@gmail.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20250802001857.2702497-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-04 17:22:20 -07:00
Quang Le
01d3c8417b net/packet: fix a race in packet_set_ring() and packet_notifier()
When packet_set_ring() releases po->bind_lock, another thread can
run packet_notifier() and process an NETDEV_UP event.

This race and the fix are both similar to that of commit 15fe076ede
("net/packet: fix a race in packet_bind() and packet_notifier()").

There too the packet_notifier NETDEV_UP event managed to run while a
po->bind_lock critical section had to be temporarily released. And
the fix was similarly to temporarily set po->num to zero to keep
the socket unhooked until the lock is retaken.

The po->bind_lock in packet_set_ring and packet_notifier precede the
introduction of git history.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Quang Le <quanglex97@gmail.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250801175423.2970334-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-04 17:21:27 -07:00
Jakub Kicinski
fa516c0d8b net: devmem: fix DMA direction on unmapping
Looks like we always unmap the DMA_BUF with DMA_FROM_DEVICE direction.
While at it unexport __net_devmem_dmabuf_binding_free(), it's internal.

Found by code inspection.

Fixes: bd61848900 ("net: devmem: Implement TX path")
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250801011335.2267515-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-04 17:15:38 -07:00
Olga Kornievskaia
cc5d59081f sunrpc: fix client side handling of tls alerts
A security exploit was discovered in NFS over TLS in tls_alert_recv
due to its assumption that there is valid data in the msghdr's
iterator's kvec.

Instead, this patch proposes the rework how control messages are
setup and used by sock_recvmsg().

If no control message structure is setup, kTLS layer will read and
process TLS data record types. As soon as it encounters a TLS control
message, it would return an error. At that point, NFS can setup a kvec
backed control buffer and read in the control message such as a TLS
alert. Scott found that a msg iterator can advance the kvec pointer
as a part of the copy process thus we need to revert the iterator
before calling into the tls_alert_recv.

Fixes: dea034b963 ("SUNRPC: Capture CMSG metadata on client-side receive")
Suggested-by: Trond Myklebust <trondmy@hammerspace.com>
Suggested-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Link: https://lore.kernel.org/r/20250731180058.4669-3-okorniev@redhat.com
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-08-03 12:45:47 -07:00
Linus Torvalds
a6923c06a3 Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:

 - Fix kCFI failures in JITed BPF code on arm64 (Sami Tolvanen, Puranjay
   Mohan, Mark Rutland, Maxwell Bland)

 - Disallow tail calls between BPF programs that use different cgroup
   local storage maps to prevent out-of-bounds access (Daniel Borkmann)

 - Fix unaligned access in flow_dissector and netfilter BPF programs
   (Paul Chaignon)

 - Avoid possible use of uninitialized mod_len in libbpf (Achill
   Gilgenast)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Test for unaligned flow_dissector ctx access
  bpf: Improve ctx access verifier error message
  bpf: Check netfilter ctx accesses are aligned
  bpf: Check flow_dissector ctx accesses are aligned
  arm64/cfi,bpf: Support kCFI + BPF on arm64
  cfi: Move BPF CFI types and helpers to generic code
  cfi: add C CFI type macro
  libbpf: Avoid possible use of uninitialized mod_len
  bpf: Fix oob access in cgroup local storage
  bpf: Move cgroup iterator helpers to bpf.h
  bpf: Move bpf map owner out of common struct
  bpf: Add cookie object to bpf maps
2025-08-01 17:13:26 -07:00
Sharath Chandra Vurukala
1dbf1d590d net: Add locking to protect skb->dev access in ip_output
In ip_output() skb->dev is updated from the skb_dst(skb)->dev
this can become invalid when the interface is unregistered and freed,

Introduced new skb_dst_dev_rcu() function to be used instead of
skb_dst_dev() within rcu_locks in ip_output.This will ensure that
all the skb's associated with the dev being deregistered will
be transnmitted out first, before freeing the dev.

Given that ip_output() is called within an rcu_read_lock()
critical section or from a bottom-half context, it is safe to introduce
an RCU read-side critical section within it.

Multiple panic call stacks were observed when UL traffic was run
in concurrency with device deregistration from different functions,
pasting one sample for reference.

[496733.627565][T13385] Call trace:
[496733.627570][T13385] bpf_prog_ce7c9180c3b128ea_cgroupskb_egres+0x24c/0x7f0
[496733.627581][T13385] __cgroup_bpf_run_filter_skb+0x128/0x498
[496733.627595][T13385] ip_finish_output+0xa4/0xf4
[496733.627605][T13385] ip_output+0x100/0x1a0
[496733.627613][T13385] ip_send_skb+0x68/0x100
[496733.627618][T13385] udp_send_skb+0x1c4/0x384
[496733.627625][T13385] udp_sendmsg+0x7b0/0x898
[496733.627631][T13385] inet_sendmsg+0x5c/0x7c
[496733.627639][T13385] __sys_sendto+0x174/0x1e4
[496733.627647][T13385] __arm64_sys_sendto+0x28/0x3c
[496733.627653][T13385] invoke_syscall+0x58/0x11c
[496733.627662][T13385] el0_svc_common+0x88/0xf4
[496733.627669][T13385] do_el0_svc+0x2c/0xb0
[496733.627676][T13385] el0_svc+0x2c/0xa4
[496733.627683][T13385] el0t_64_sync_handler+0x68/0xb4
[496733.627689][T13385] el0t_64_sync+0x1a4/0x1a8

Changes in v3:
- Replaced WARN_ON() with  WARN_ON_ONCE(), as suggested by Willem de Bruijn.
- Dropped legacy lines mistakenly pulled in from an outdated branch.

Changes in v2:
- Addressed review comments from Eric Dumazet
- Used READ_ONCE() to prevent potential load/store tearing
- Added skb_dst_dev_rcu() and used along with rcu_read_lock() in ip_output

Signed-off-by: Sharath Chandra Vurukala <quic_sharathv@quicinc.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250730105118.GA26100@hu-sharathv-hyd.qualcomm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-01 15:17:52 -07:00
Takamitsu Iwai
ae8508b25d net/sched: taprio: enforce minimum value for picos_per_byte
Syzbot reported a WARNING in taprio_get_start_time().

When link speed is 470,589 or greater, q->picos_per_byte becomes too
small, causing length_to_duration(q, ETH_ZLEN) to return zero.

This zero value leads to validation failures in fill_sched_entry() and
parse_taprio_schedule(), allowing arbitrary values to be assigned to
entry->interval and cycle_time. As a result, sched->cycle can become zero.

Since SPEED_800000 is the largest defined speed in
include/uapi/linux/ethtool.h, this issue can occur in realistic scenarios.

To ensure length_to_duration() returns a non-zero value for minimum-sized
Ethernet frames (ETH_ZLEN = 60), picos_per_byte must be at least 17
(60 * 17 > PSEC_PER_NSEC which is 1000).

This patch enforces a minimum value of 17 for picos_per_byte when the
calculated value would be lower, and adds a warning message to inform
users that scheduling accuracy may be affected at very high link speeds.

Fixes: fb66df20a7 ("net/sched: taprio: extend minimum interval restriction to entire cycle too")
Reported-by: syzbot+398e1ee4ca2cac05fddb@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=398e1ee4ca2cac05fddb
Signed-off-by: Takamitsu Iwai <takamitz@amazon.co.jp>
Link: https://patch.msgid.link/20250728173149.45585-1-takamitz@amazon.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-01 15:15:28 -07:00
Eric Dumazet
d45cf1e7d7 ipv6: reject malicious packets in ipv6_gso_segment()
syzbot was able to craft a packet with very long IPv6 extension headers
leading to an overflow of skb->transport_header.

This 16bit field has a limited range.

Add skb_reset_transport_header_careful() helper and use it
from ipv6_gso_segment()

WARNING: CPU: 0 PID: 5871 at ./include/linux/skbuff.h:3032 skb_reset_transport_header include/linux/skbuff.h:3032 [inline]
WARNING: CPU: 0 PID: 5871 at ./include/linux/skbuff.h:3032 ipv6_gso_segment+0x15e2/0x21e0 net/ipv6/ip6_offload.c:151
Modules linked in:
CPU: 0 UID: 0 PID: 5871 Comm: syz-executor211 Not tainted 6.16.0-rc6-syzkaller-g7abc678e3084 #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/12/2025
 RIP: 0010:skb_reset_transport_header include/linux/skbuff.h:3032 [inline]
 RIP: 0010:ipv6_gso_segment+0x15e2/0x21e0 net/ipv6/ip6_offload.c:151
Call Trace:
 <TASK>
  skb_mac_gso_segment+0x31c/0x640 net/core/gso.c:53
  nsh_gso_segment+0x54a/0xe10 net/nsh/nsh.c:110
  skb_mac_gso_segment+0x31c/0x640 net/core/gso.c:53
  __skb_gso_segment+0x342/0x510 net/core/gso.c:124
  skb_gso_segment include/net/gso.h:83 [inline]
  validate_xmit_skb+0x857/0x11b0 net/core/dev.c:3950
  validate_xmit_skb_list+0x84/0x120 net/core/dev.c:4000
  sch_direct_xmit+0xd3/0x4b0 net/sched/sch_generic.c:329
  __dev_xmit_skb net/core/dev.c:4102 [inline]
  __dev_queue_xmit+0x17b6/0x3a70 net/core/dev.c:4679

Fixes: d1da932ed4 ("ipv6: Separate ipv6 offload support")
Reported-by: syzbot+af43e647fd835acc02df@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/688a1a05.050a0220.5d226.0008.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250730131738.3385939-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-01 14:40:53 -07:00
Linus Torvalds
821c9e515d Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio updates from Michael Tsirkin:

 - vhost can now support legacy threading if enabled in Kconfig

 - vsock memory allocation strategies for large buffers have been
   improved, reducing pressure on kmalloc

 - vhost now supports the in-order feature. guest bits missed the merge
   window.

 - fixes, cleanups all over the place

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (30 commits)
  vsock/virtio: Allocate nonlinear SKBs for handling large transmit buffers
  vsock/virtio: Rename virtio_vsock_skb_rx_put()
  vhost/vsock: Allocate nonlinear SKBs for handling large receive buffers
  vsock/virtio: Move SKB allocation lower-bound check to callers
  vsock/virtio: Rename virtio_vsock_alloc_skb()
  vsock/virtio: Resize receive buffers so that each SKB fits in a 4K page
  vsock/virtio: Move length check to callers of virtio_vsock_skb_rx_put()
  vsock/virtio: Validate length in packet header before skb_put()
  vhost/vsock: Avoid allocating arbitrarily-sized SKBs
  vhost_net: basic in_order support
  vhost: basic in order support
  vhost: fail early when __vhost_add_used() fails
  vhost: Reintroduce kthread API and add mode selection
  vdpa: Fix IDR memory leak in VDUSE module exit
  vdpa/mlx5: Fix release of uninitialized resources on error path
  vhost-scsi: Fix check for inline_sg_cnt exceeding preallocated limit
  virtio: virtio_dma_buf: fix missing parameter documentation
  vhost: Fix typos
  vhost: vringh: Remove unused functions
  vhost: vringh: Remove unused iotlb functions
  ...
2025-08-01 14:17:48 -07:00
Paul Chaignon
9e6448f7b1 bpf: Check netfilter ctx accesses are aligned
Similarly to the previous patch fixing the flow_dissector ctx accesses,
nf_is_valid_access also doesn't check that ctx accesses are aligned.
Contrary to flow_dissector programs, netfilter programs don't have
context conversion. The unaligned ctx accesses are therefore allowed by
the verifier.

Fixes: fd9c663b9a ("bpf: minimal support for programs hooked into netfilter framework")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/853ae9ed5edaa5196e8472ff0f1bb1cc24059214.1754039605.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-01 09:22:44 -07:00
Paul Chaignon
ead3d7b2b6 bpf: Check flow_dissector ctx accesses are aligned
flow_dissector_is_valid_access doesn't check that the context access is
aligned. As a consequence, an unaligned access within one of the exposed
field is considered valid and later rejected by
flow_dissector_convert_ctx_access when we try to convert it.

The later rejection is problematic because it's reported as a verifier
bug with a kernel warning and doesn't point to the right instruction in
verifier logs.

Fixes: d58e468b11 ("flow_dissector: implements flow dissector BPF hook")
Reported-by: syzbot+ccac90e482b2a81d74aa@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=ccac90e482b2a81d74aa
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/cc1b036be484c99be45eddf48bd78cc6f72839b1.1754039605.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-01 09:22:44 -07:00
Will Deacon
6693731487 vsock/virtio: Allocate nonlinear SKBs for handling large transmit buffers
When transmitting a vsock packet, virtio_transport_send_pkt_info() calls
virtio_transport_alloc_linear_skb() to allocate and fill SKBs with the
transmit data. Unfortunately, these are always linear allocations and
can therefore result in significant pressure on kmalloc() considering
that the maximum packet size (VIRTIO_VSOCK_MAX_PKT_BUF_SIZE +
VIRTIO_VSOCK_SKB_HEADROOM) is a little over 64KiB, resulting in a 128KiB
allocation for each packet.

Rework the vsock SKB allocation so that, for sizes with page order
greater than PAGE_ALLOC_COSTLY_ORDER, a nonlinear SKB is allocated
instead with the packet header in the SKB and the transmit data in the
fragments. Note that this affects both the vhost and virtio transports.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20250717090116.11987-10-will@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-08-01 09:11:09 -04:00
Will Deacon
8ca76151d2 vsock/virtio: Rename virtio_vsock_skb_rx_put()
In preparation for using virtio_vsock_skb_rx_put() when populating SKBs
on the vsock TX path, rename virtio_vsock_skb_rx_put() to
virtio_vsock_skb_put().

No functional change.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20250717090116.11987-9-will@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-08-01 09:11:09 -04:00
Will Deacon
2304c64a28 vsock/virtio: Rename virtio_vsock_alloc_skb()
In preparation for nonlinear allocations for large SKBs, rename
virtio_vsock_alloc_skb() to virtio_vsock_alloc_linear_skb() to indicate
that it returns linear SKBs unconditionally and switch all callers over
to this new interface for now.

No functional change.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20250717090116.11987-6-will@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-08-01 09:11:09 -04:00