We have an application that uses almost the same code for TCP and
AF_UNIX (SOCK_STREAM).
TCP can use TCP_INQ, but AF_UNIX doesn't have it and requires an
extra syscall, ioctl(SIOCINQ) or getsockopt(SO_MEMINFO) as an
alternative.
Let's introduce the generic version of TCP_INQ.
If SO_INQ is enabled, recvmsg() will put a cmsg of SCM_INQ that
contains the exact value of ioctl(SIOCINQ). The cmsg is also
included when msg->msg_get_inq is non-zero to make sockets
io_uring-friendly.
Note that SOCK_CUSTOM_SOCKOPT is flagged only for SOCK_STREAM to
override setsockopt() for SOL_SOCKET.
By having the flag in struct unix_sock, instead of struct sock, we
can later add SO_INQ support for TCP and reuse tcp_sk(sk)->recvmsg_inq.
Note also that supporting custom getsockopt() for SOL_SOCKET will need
preparation for other SOCK_CUSTOM_SOCKOPT users (UDP, vsock, MPTCP).
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250702223606.1054680-7-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Compared to TCP, ioctl(SIOCINQ) for AF_UNIX SOCK_STREAM socket is more
expensive, as unix_inq_len() requires iterating through the receive queue
and accumulating skb->len.
Let's cache the value for SOCK_STREAM to a new field during sendmsg()
and recvmsg().
The field is protected by the receive queue lock.
Note that ioctl(SIOCINQ) for SOCK_DGRAM returns the length of the first
skb in the queue.
SOCK_SEQPACKET still requires iterating through the queue because we do
not touch functions shared with unix_dgram_ops. But, if really needed,
we can support it by switching __skb_try_recv_datagram() to a custom
version.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250702223606.1054680-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
unix_stream_read_skb() calls skb_recv_datagram() with MSG_DONTWAIT,
which is mostly equivalent to sock_error(sk) + skb_dequeue().
In the following patch, we will add a new field to cache the number
of bytes in the receive queue. Then, we want to avoid introducing
atomic ops in the fast path, so we will reuse the receive queue lock.
As a preparation for the change, let's not use skb_recv_datagram()
in unix_stream_read_skb().
Note that sock_error() is now moved out of the u->iolock mutex as
the mutex does not synchronise the peer's close() at all.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250702223606.1054680-4-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
unix_stream_read_skb() checks SOCK_DEAD only when the dequeued skb is
OOB skb.
unix_stream_read_skb() is called for a SOCK_STREAM socket in SOCKMAP
when data is sent to it.
The function is invoked via sk_psock_verdict_data_ready(), which is
set to sk->sk_data_ready().
During sendmsg(), we check if the receiver has SOCK_DEAD, so there
is no point in checking it again later in ->read_skb().
Also, unix_read_skb() for SOCK_DGRAM does not have the test either.
Let's remove the SOCK_DEAD test in unix_stream_read_skb().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250702223606.1054680-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When __skb_try_recv_datagram() returns NULL in __unix_dgram_recvmsg(),
we hold unix_state_lock() unconditionally.
This is because SOCK_SEQPACKET sk needs to return EOF in case its peer
has been close()d concurrently.
This behaviour totally depends on the timing of the peer's close() and
reading sk->sk_shutdown, and taking the lock does not play a role.
Let's drop the lock from __unix_dgram_recvmsg() and use READ_ONCE().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250702223606.1054680-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Lee Trager says:
====================
eth: fbnic: Add firmware logging support
Firmware running on fbnic generates device logs. These logs contain useful
information about the device which may or may not be related to the host.
Logs are stored in a ring buffer and accessible through DebugFS.
====================
Link: https://patch.msgid.link/20250702192207.697368-1-lee@trager.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The firmware log buffer is enabled during probe and freed during remove.
Early versions of firmware do not support sending logs. Once the mailbox is
up driver will enable logging when supported firmware versions are detected.
Logging is disabled before the mailbox is freed.
Signed-off-by: Lee Trager <lee@trager.us>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250702192207.697368-6-lee@trager.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
By default firmware will not send logs to the host. This must be explicitly
enabled by the driver. The mailbox has the concept of a flag which is a u32
used as a boolean. Lack of flag defaults to a value of false. When enabling
logging historical logs may be optionally requested. These are log messages
generated by the NIC before the driver was loaded. The driver also sends a
log version to support changing the logging format in the future.
[SEND_LOGS_REQ] = {
[SEND_LOGS] /* flag to request log reporting */
[SEND_LOGS_HISTORY] /* flag to request historical logs */
[SEND_LOGS_VERSION] /* u32 indicating the log format version */
}
Logs may be sent to the user either one at a time, or when historical logs
are requested in bulk. Firmware may not send more than 14 messages in bulk
to prevent flooding the mailbox.
[LOG_MSG] = {
[LOG_INDEX] /* entry 0 - u64 index of log */
[LOG_MSEC] /* entry 0 - u32 timestamp of log */
[LOG_MSG] /* entry 0 - char log message up to 256 */
[LOG_LENGTH] /* u32 of remaining log items in arrays */
[LOG_INDEX_ARRAY] = {
[LOG_INDEX] /* entry 1 - u64 index of log */
[LOG_INDEX] /* entry 2 - u64 index of log */
...
}
[LOG_MSEC_ARRAY] = {
[LOG_MSEC] /* entry 1 - u32 timestamp of log */
[LOG_MSEC] /* entry 2 - u32 timestamp of log */
...
}
[LOG_MSG_ARRAY] = {
[LOG_MSG] /* entry 1 - char log message up to 256 */
[LOG_MSG] /* entry 2 - char log message up to 256 */
...
}
}
Signed-off-by: Lee Trager <lee@trager.us>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250702192207.697368-5-lee@trager.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The full minimum version is 0.10.6-0. The six is now correctly defined as
patch and shifted appropriately. 0.10.6-0 is a preproduction version of
firmware which was released over a year and a half ago. All production
devices meet this requirement.
Signed-off-by: Lee Trager <lee@trager.us>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250702192207.697368-2-lee@trager.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tariq Toukan says:
====================
mlx5-next updates 2025-07-08
The following pull-request contains common mlx5 updates
for your *net-next* tree.
v2: https://lore.kernel.org/1751574385-24672-1-git-send-email-tariqt@nvidia.com
* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Check device memory pointer before usage
net/mlx5: fs, fix RDMA TRANSPORT init cleanup flow
net/mlx5: Add IFC bits for PCIe Congestion Event object
net/mlx5: Small refactor for general object capabilities
net/mlx5: fs, add multiple prios to RDMA TRANSPORT steering domain
====================
Link: https://patch.msgid.link/1752002102-11316-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Convert mlx5 to dedicated RXFH ops. This is a fairly shallow
conversion, TBH, most of the driver code stays as is, but we
let the core allocate the context ID for the driver.
mlx5e_rx_res_rss_get_rxfh() and friends are made void, since
core only calls the driver for context 0. The second call
is right after context creation so it must exist (tm).
Tested with drivers/net/hw/rss_ctx.py on MCX6.
Reviewed-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20250707184115.2285277-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ICE appears to have some odd form of rss_context use plumbed
in for .get_rxfh. The .set_rxfh side does not support creating
contexts, however, so this must be dead code. For at least a year
now (since commit 7964e78846 ("net: ethtool: use the tracking
array for get_rxfh on custom RSS contexts")) we have not been
calling .get_rxfh with a non-zero rss_context. We just get
the info from the RSS XArray under dev->ethtool.
Remove what must be dead code in the driver, clear the support flags.
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250707184115.2285277-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
otx2 only supports additional indirection tables (no separate keys
etc.) so the conversion to dedicated callbacks and core-allocated
context is mostly removing the code which stores the extra tables
in the driver. Core already stores the indirection tables for
additional contexts, and doesn't call .get for them.
One subtle change here is that we'll now start with the table
covering all queues, not directing all traffic to queue 0.
This is what core expects if the user doesn't pass the initial
indir table explicitly (there's a WARN_ON() in the core trying
to make sure driver authors don't forget to populate ctx to
defaults).
Drivers implementing .create_rxfh_context don't have to set
cap_rss_ctx_supported, so remove it.
Tested-by: Geetha Sowjanya <gakula@marvell.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250707184115.2285277-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, the driver uses phylib to operate PHY by default.
On some boards, the PHY device is separated from the MAC device.
As a result, the hibmcge driver cannot operate the PHY device.
In this patch, the driver determines whether a PHY is available
based on register configuration. If no PHY is available,
the driver will use fixed_phy to register fake phydev.
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Link: https://patch.msgid.link/20250702125716.2875169-2-shaojijie@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since its introduction in commit 6fa01ccd88 ("skbuff: Add pskb_extract()
helper function"), pskb_carve_frag_list() never used the argument @skb.
Drop it and adapt the only caller.
No functional change intended.
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since its introduction in commit ce098da149 ("skbuff: Introduce
slab_build_skb()"), __slab_build_skb() never used the @skb argument. Remove
it and adapt both callers.
No functional change intended.
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use the newly added of_reserved_mem_region_to_resource{_byname}()
functions to handle "memory-region" properties.
The error handling is a bit different for mtk_wed_mcu_load_firmware().
A failed match of the "memory-region-names" would skip the entry, but
then other errors in the lookup and retrieval of the address would not
skip the entry. However, that distinction is not really important.
Either the region is available and usable or it is not. So now, errors
from of_reserved_mem_region_to_resource() are ignored so the region is
simply skipped.
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20250703183459.2074381-1-robh@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tc_action_net_exit() got an rtnl exclusion in commit
a159d3c4b8 ("net_sched: acquire RTNL in tc_action_net_exit()")
Since then, commit 16af606739 ("net: sched: implement reference
counted action release") made this RTNL exclusion obsolete for
most cases.
Only tcf_action_offload_del() might still require it.
Move the rtnl locking into tcf_idrinfo_destroy() when
an offload action is found.
Most netns do not have actions, yet deleting them is adding a lot
of pressure on RTNL, which is for many the most contended mutex
in the kernel.
We are moving to a per-netns 'rtnl', so tc_action_net_exit()
will not be able to grab 'rtnl' a single time for a batch of netns.
Before the patch:
perf probe -a rtnl_lock
perf record -e probe:rtnl_lock -a /bin/bash -c 'unshare -n "/bin/true"; sleep 1'
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.305 MB perf.data (25 samples) ]
After the patch:
perf record -e probe:rtnl_lock -a /bin/bash -c 'unshare -n "/bin/true"; sleep 1'
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.304 MB perf.data (9 samples) ]
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vlad Buslov <vladbu@nvidia.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://patch.msgid.link/20250702071230.1892674-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jeremy Kerr says:
====================
net: mctp: Add support for gateway routing
This series adds a gateway route type for the MCTP core, allowing
non-local EIDs as the match for a route.
Example setup using the mctp tools:
mctp route add 9 via mctpi2c0
mctp neigh add 9 dev mctpi2c0 lladdr 0x1d
mctp route add 10 gw 9
- will route packets to eid 10 through mctpi2c0, using a dest lladdr
of 0x1d (ie, that of the directly-attached eid 9).
The core change to support this is the introduction of a struct
mctp_dst, which represents the result of a route lookup. Since this
involves a bit of surgery through the routing code, we add a few tests
along the way.
We're introducing an ABI change in the new RTM_{NEW,GET,DEL}ROUTE
netlink formats, with the support for a RTA_GATEWAY attribute. Because
we need a network ID specified to fully-qualify a gateway EID, the
RTA_GATEWAY attribute carries the (net, eid) tuple in full:
struct mctp_fq_addr {
unsigned int net;
mctp_eid_t eid;
}
Of course, any questions, comments etc are most welcome.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
====================
Link: https://patch.msgid.link/20250702-dev-forwarding-v5-0-1468191da8a4@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This change allows for gateway routing, where a route table entry
may reference a routable endpoint (by network and EID), instead of
routing directly to a netdevice.
We add support for a RTM_GATEWAY attribute for netlink route updates,
with an attribute format of:
struct mctp_fq_addr {
unsigned int net;
mctp_eid_t eid;
}
- we need the net here to uniquely identify the target EID, as we no
longer have the device reference directly (which would provide the net
id in the case of direct routes).
This makes route lookups recursive, as a route lookup that returns a
gateway route must be resolved into a direct route (ie, to a device)
eventually. We provide a limit to the route lookups, to prevent infinite
loop routing.
The route lookup populates a new 'nexthop' field in the dst structure,
which now specifies the key for the neighbour table lookup on device
output, rather than using the packet destination address directly.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20250702-dev-forwarding-v5-13-1468191da8a4@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The netlink route parsing functions end up setting a bunch of output
variables from the rt attributes. This will get messy when the routes
become more complex.
So, split the rt parsing into two types: a lookup (returning route
target data suitable for a route lookup, like when deleting a route) and
a populate (setting fields of a struct mctp_route).
In doing this, we need to separate the route allocation from
mctp_route_add, so add some comments on the lifetime semantics for the
latter.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20250702-dev-forwarding-v5-12-1468191da8a4@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>