Sashiko reports:
> mctp_dst_from_route() increments the device reference count by calling
> mctp_dev_hold(). When a valid route is found and dst is NULL, the
> structure copy is bypassed and rc is set to 0.
Instead of optimistically creating a dst from the final route (then
releasing it if the saddr is invalid), perform the saddr check first.
This means we don't have an unuecessary hold/release on the dev, which
could leak if the dst pointer is NULL. No caller passes a NULL dst at
present though (so the leak is not possible), but this is an intended
use of mctp_dst_from_route().
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260403-dev-mctp-dst-defer-v1-1-9c2c55faf9e9@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, if we receive a physically-addressed packet for the local
stack, we perform a route_lookup_null to find a matching device-based
route. If a route is present, it will always have the ->output fn set to
mctp_dst_input, which provides our delivery mechanism.
However, if we don't yet have any local addresses assigned, we won't
have any local routes to lookup, so this will fail. One of the use-cases
for physical addressing is initial address assignment on endpoint nodes,
where we would have no addresses, and therefore no local routes.
Instead of iterating routes (looking for one matching the dev), just
create a suitable mctp_dst for the device directly.
Add a testcase for the no-route case too.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20260331-dev-mctp-null-eids-v1-3-b4d047372eaf@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Rather than querying the output device for its address in
mctp_local_output, set up the source address when we're populating the
dst structure. If no address is assigned, use MCTP_ADDR_NULL.
This will allow us more flexibility when routing for NULL-source-eid
cases. For now though, we still reject a NULL source address in the
output path.
We need to update the tests a little, so that addresses are assigned
before we do the dst lookups.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20260331-dev-mctp-null-eids-v1-1-b4d047372eaf@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Currently, we incorrectly send messages that are within 4 bytes (a
struct mctp_hdr) smaller than the MTU through mctp_do_fragment_route().
This has no effect on the actual fragmentation, as we will still send as
one packet, but unnecessarily copies the original skb into a new
single-fragment skb.
Instead of having the MTU comparisons in both mctp_local_output() and
mctp_do_fragment_route(), feed all local messages through the latter,
and add the single-packet optimisation there.
This means we can coalesce the routing path of mctp_local_output, so our
out_release path is now solely for errors, so rename the label
accordingly.
Include a check in the route tests for the single-packet case too.
Reported-by: yuanzhaoming <yuanzm2@lenovo.com>
Closes: 269936db5e (r175836230)
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260324-dev-mtu-copy-v1-1-7af6bd7027d3@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
mctp_flow_prepare_output() checks key->dev and may call
mctp_dev_set_key(), but it does not hold key->lock while doing so.
mctp_dev_set_key() and mctp_dev_release_key() are annotated with
__must_hold(&key->lock), so key->dev access is intended to be
serialized by key->lock. The mctp_sendmsg() transmit path reaches
mctp_flow_prepare_output() via mctp_local_output() -> mctp_dst_output()
without holding key->lock, so the check-and-set sequence is racy.
Example interleaving:
CPU0 CPU1
---- ----
mctp_flow_prepare_output(key, devA)
if (!key->dev) // sees NULL
mctp_flow_prepare_output(
key, devB)
if (!key->dev) // still NULL
mctp_dev_set_key(devB, key)
mctp_dev_hold(devB)
key->dev = devB
mctp_dev_set_key(devA, key)
mctp_dev_hold(devA)
key->dev = devA // overwrites devB
Now both devA and devB references were acquired, but only the final
key->dev value is tracked for release. One reference can be lost,
causing a resource leak as mctp_dev_release_key() would only decrease
the reference on one dev.
Fix by taking key->lock around the key->dev check and
mctp_dev_set_key() call.
Fixes: 67737c4572 ("mctp: Pass flow data & flow release events to drivers")
Signed-off-by: Chengfeng Ye <dg573847474@gmail.com>
Link: https://patch.msgid.link/20260306031402.857224-1-dg573847474@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
Syed Faraz Abrar (@farazsth98) from Zellic, and Pumpkin (@u1f383) from
DEVCORE Research Team working with Trend Micro Zero Day Initiative
report that a RTM_GETNEIGH will return uninitalised data in the pad
bytes of the ndmsg data.
Ensure we're initialising the netlink data to zero, in the link, addr
and neigh response messages.
Fixes: 831119f887 ("mctp: Add neighbour netlink interface")
Fixes: 06d2f4c583 ("mctp: Add netlink route management")
Fixes: 583be982d9 ("mctp: Add device handling and netlink interface")
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260209-dev-mctp-nlmsg-v1-1-f1e30c346a43@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To capture TX packets during a test, we are currently intercepting the
dst->output with an implementation that adds the transmitted packet to
a skb queue attached to the test-specific mock dst. The netdev itself is
not involved in the test TX path.
Instead, we can just use our test device to stash TXed packets for later
inspection by the test. This means we can include the actual
mctp_dst_output() implementation in the test (by setting dst.output in
the test case), and don't need to be creating fake dst objects, or their
corresponding skb queues.
We need to ensure that the netdev is up to allow delivery to
ndo_start_xmit, but the tests assume active devices at present anyway.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20251126-dev-mctp-test-tx-queue-v2-1-4e5bbd1d6c57@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
On transmit, we are currently relying on skb->dev being set by
mctp_local_output() when we first set up the skb destination fields.
However, forwarded skbs do not use the local_output path, so will retain
their incoming netdev as their ->dev on tx. This does not work when
we're forwarding between interfaces.
Set skb->dev unconditionally in the transmit path, to allow for proper
forwarding.
We keep the skb->dev initialisation in mctp_local_output(), as we use it
for fragmentation.
Fixes: 269936db5e ("net: mctp: separate routing database from routing operations")
Suggested-by: Vince Chang <vince_chang@aspeedtech.com>
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20251125-dev-forward-v1-1-54ecffcd0616@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Update all struct proto_ops connect() callback function prototypes from
"struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the
compiler about object sizes. Calls into struct proto handlers gain casts
that will be removed in the struct proto conversion patch.
No binary changes expected.
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20251104002617.2752303-3-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Update all struct proto_ops bind() callback function prototypes from
"struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the
compiler about object sizes. Calls into struct proto handlers gain casts
that will be removed in the struct proto conversion patch.
No binary changes expected.
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20251104002617.2752303-2-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In mctp_getsockopt(), unrecognized options currently return -EINVAL.
In contrast, mctp_setsockopt() returns -ENOPROTOOPT for unknown
options.
Update mctp_getsockopt() to also return -ENOPROTOOPT for unknown
options. This aligns the behavior of getsockopt() and setsockopt(),
and matches the standard kernel socket API convention for handling
unsupported options.
Fixes: 99ce45d5e7 ("mctp: Implement extended addressing")
Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Link: https://patch.msgid.link/20250902102059.1370008-1-alok.a.tiwari@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Prior to calling bind() a program may call connect() on a socket to
restrict to a remote peer address.
Using connect() is the normal mechanism to specify a remote network
peer, so we use that here. In MCTP connect() is only used for bound
sockets - send() is not available for MCTP since a tag must be provided
for each message.
The smctp_type must match between connect() and bind() calls.
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Link: https://patch.msgid.link/20250710-mctp-bind-v4-6-8ec2f6460c56@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When a specific EID is passed as a bind address, it only makes sense to
interpret with an actual network ID, so resolve that to the default
network at bind time.
For bind address of MCTP_ADDR_ANY, we want to be able to capture traffic
to any network and address, so keep the current behaviour of matching
traffic from any network interface (don't interpret MCTP_NET_ANY as
the default network ID).
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Link: https://patch.msgid.link/20250710-mctp-bind-v4-3-8ec2f6460c56@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Disallow bind() calls that have the same arguments as existing bound
sockets. Previously multiple sockets could bind() to the same
type/local address, with an arbitrary socket receiving matched messages.
This is only a partial fix, a future commit will define precedence order
for MCTP_ADDR_ANY versus specific EID bind(), which are allowed to exist
together.
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Link: https://patch.msgid.link/20250710-mctp-bind-v4-2-8ec2f6460c56@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This change allows for gateway routing, where a route table entry
may reference a routable endpoint (by network and EID), instead of
routing directly to a netdevice.
We add support for a RTM_GATEWAY attribute for netlink route updates,
with an attribute format of:
struct mctp_fq_addr {
unsigned int net;
mctp_eid_t eid;
}
- we need the net here to uniquely identify the target EID, as we no
longer have the device reference directly (which would provide the net
id in the case of direct routes).
This makes route lookups recursive, as a route lookup that returns a
gateway route must be resolved into a direct route (ie, to a device)
eventually. We provide a limit to the route lookups, to prevent infinite
loop routing.
The route lookup populates a new 'nexthop' field in the dst structure,
which now specifies the key for the neighbour table lookup on device
output, rather than using the packet destination address directly.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20250702-dev-forwarding-v5-13-1468191da8a4@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The netlink route parsing functions end up setting a bunch of output
variables from the rt attributes. This will get messy when the routes
become more complex.
So, split the rt parsing into two types: a lookup (returning route
target data suitable for a route lookup, like when deleting a route) and
a populate (setting fields of a struct mctp_route).
In doing this, we need to separate the route allocation from
mctp_route_add, so add some comments on the lifetime semantics for the
latter.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20250702-dev-forwarding-v5-12-1468191da8a4@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This change adds a struct mctp_dst, representing the result of a routing
lookup. This decouples the struct mctp_route from the actual
implementation of a routing operation.
This will allow for future routing changes which may require more
involved lookup logic, such as gateway routing - which may require
multiple traversals of the routing table.
Since we only use the struct mctp_route at lookup time, we no longer
hold routes over a routing operation, as we only need it to populate the
dst. However, we do hold the dev while the dst is active.
This requires some changes to the route test infrastructure, as we no
longer have a mock route to handle the route output operation, and
transient dsts are created by the routing code, so we can't override
them as easily.
Instead, we use kunit->priv to stash a packet queue, and a custom
dst_output function queues into that packet queue, which we can use for
later expectations.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20250702-dev-forwarding-v5-3-1468191da8a4@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
In our input_cloned_frag test, we currently allocate our test buffers
arbitrarily-sized at 100 bytes.
We only expect to receive a max of 15 bytes from the socket, so reduce
to a more appropriate size. There are some upcoming changes to the
routing code which hit a frame-size limit on s390, so reduce the usage
before that lands.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20250702-dev-forwarding-v5-2-1468191da8a4@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub suggests:
> I have a different request :) Matt, once this ends up in net-next
> (end of this week) could you refactor it to use nlmsg_payload() ?
> It doesn't exist in net but this is exactly why it was added.
This refactors the additions to both mctp_dump_addrinfo(), and
mctp_rtm_getneigh() - two cases where we're calling nlh_data() on an
an incoming netlink message, without a prior nlmsg_parse().
For the neigh.c case, we cannot hit the failure where the nlh does not
contain a full ndmsg at present, as the core handler
(net/core/neighbour.c, neigh_get()) has already validated the size
through neigh_valid_req_get(), and would have failed the get operation
before the MCTP hander is called.
However, relying on that is a bit fragile, so apply the nlmsg_payload
refector here too.
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20250521-mctp-nlmsg-payload-v2-1-e85df160c405@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
mctp_flow_prepare_output() is called in mctp_route_output(), which
places outbound packets onto a given interface. The packet may represent
a message fragment, in which case we provoke an unbalanced reference
count to the underlying device. This causes trouble if we ever attempt
to remove the interface:
[ 48.702195] usb 1-1: USB disconnect, device number 2
[ 58.883056] unregister_netdevice: waiting for mctpusb0 to become free. Usage count = 2
[ 69.022548] unregister_netdevice: waiting for mctpusb0 to become free. Usage count = 2
[ 79.172568] unregister_netdevice: waiting for mctpusb0 to become free. Usage count = 2
...
Predicate the invocation of mctp_dev_set_key() in
mctp_flow_prepare_output() on not already having associated the device
with the key. It's not yet realistic to uphold the property that the key
maintains only one device reference earlier in the transmission sequence
as the route (and therefore the device) may not be known at the time the
key is associated with the socket.
Fixes: 67737c4572 ("mctp: Pass flow data & flow release events to drivers")
Acked-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
Link: https://patch.msgid.link/20250508-mctp-dev-refcount-v1-1-d4f965c67bb5@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.
Conversion was done with coccinelle plus manual fixups where necessary.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Ensure that the frag_list used for reassembly isn't shared with other
packets. This avoids incorrect reassembly when packets are cloned, and
prevents a memory leak due to circular references between fragments and
their skb_shared_info.
The upcoming MCTP-over-USB driver uses skb_clone which can trigger the
problem - other MCTP drivers don't share SKBs.
A kunit test is added to reproduce the issue.
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Fixes: 4a992bbd36 ("mctp: Implement message fragmentation & reassembly")
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250306-matt-mctp-usb-v1-1-085502b3dd28@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Currently, we don't use the return value from sock_queue_rcv_skb, which
means we may leak skbs if a message is not successfully queued to a
socket.
Instead, ensure that we're freeing the skb where the sock hasn't
otherwise taken ownership of the skb by adding checks on the
sock_queue_rcv_skb() to invoke a kfree on failure.
In doing so, rather than using the 'rc' value to trigger the
kfree_skb(), use the skb pointer itself, which is more explicit.
Also, add a kunit test for the sock delivery failure cases.
Fixes: 4a992bbd36 ("mctp: Implement message fragmentation & reassembly")
Cc: stable@vger.kernel.org
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20241218-mctp-next-v2-1-1c1729645eaa@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>