In preparations of leftover.c split to individual files, avoid need to
have object structures exposed in devl_internal.h and allow to have them
maintained in object files.
The register/unregister notifications need to know the structures
to iterate lists. To avoid the need, introduce per-object
register/unregister notification helpers and use them.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230828061657.300667-2-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
My recent patch forgot to change error handling for IP_TRANSPARENT
socket option.
WARNING: bad unlock balance detected!
6.5.0-rc7-syzkaller-01717-g59da9885767a #0 Not tainted
-------------------------------------
syz-executor151/5028 is trying to release lock (sk_lock-AF_INET) at:
[<ffffffff88213983>] sockopt_release_sock+0x53/0x70 net/core/sock.c:1073
but there are no more locks to release!
other info that might help us debug this:
1 lock held by syz-executor151/5028:
stack backtrace:
CPU: 0 PID: 5028 Comm: syz-executor151 Not tainted 6.5.0-rc7-syzkaller-01717-g59da9885767a #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xd9/0x1b0 lib/dump_stack.c:106
__lock_release kernel/locking/lockdep.c:5438 [inline]
lock_release+0x4b5/0x680 kernel/locking/lockdep.c:5781
sock_release_ownership include/net/sock.h:1824 [inline]
release_sock+0x175/0x1b0 net/core/sock.c:3527
sockopt_release_sock+0x53/0x70 net/core/sock.c:1073
do_ip_setsockopt+0x12c1/0x3640 net/ipv4/ip_sockglue.c:1364
ip_setsockopt+0x59/0xe0 net/ipv4/ip_sockglue.c:1419
raw_setsockopt+0x218/0x290 net/ipv4/raw.c:833
__sys_setsockopt+0x2cd/0x5b0 net/socket.c:2305
__do_sys_setsockopt net/socket.c:2316 [inline]
__se_sys_setsockopt net/socket.c:2313 [inline]
Fixes: 4bd0623f04 ("inet: move inet->transparent to inet->inet_flags")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
While looking at TC_ACT_* handling, the TC_ACT_CONSUMED is only handled in
sch_handle_ingress but not sch_handle_egress. This was added via cd11b16407
("net/tc: introduce TC_ACT_REINSERT.") and e5cf1baf92 ("act_mirred: use
TC_ACT_REINSERT when possible") and later got renamed into TC_ACT_CONSUMED
via 720f22fed8 ("net: sched: refactor reinsert action").
The initial work was targeted for ovs back then and only needed on ingress,
and the mirred action module also restricts it to only that. However, given
it's an API contract it would still make sense to make this consistent to
sch_handle_ingress and handle it on egress side in the same way, that is,
setting return code to "success" and returning NULL back to the caller as
otherwise an action module sitting on egress returning TC_ACT_CONSUMED could
lead to an UAF when untreated.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a memory leak for the tc egress path with TC_ACT_{STOLEN,QUEUED,TRAP}:
[...]
unreferenced object 0xffff88818bcb4f00 (size 232):
comm "softirq", pid 0, jiffies 4299085078 (age 134.028s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 80 70 61 81 88 ff ff 00 41 31 14 81 88 ff ff ..pa.....A1.....
backtrace:
[<ffffffff9991b938>] kmem_cache_alloc_node+0x268/0x400
[<ffffffff9b3d9231>] __alloc_skb+0x211/0x2c0
[<ffffffff9b3f0c7e>] alloc_skb_with_frags+0xbe/0x6b0
[<ffffffff9b3bf9a9>] sock_alloc_send_pskb+0x6a9/0x870
[<ffffffff9b6b3f00>] __ip_append_data+0x14d0/0x3bf0
[<ffffffff9b6ba24e>] ip_append_data+0xee/0x190
[<ffffffff9b7e1496>] icmp_push_reply+0xa6/0x470
[<ffffffff9b7e4030>] icmp_reply+0x900/0xa00
[<ffffffff9b7e42e3>] icmp_echo.part.0+0x1a3/0x230
[<ffffffff9b7e444d>] icmp_echo+0xcd/0x190
[<ffffffff9b7e9566>] icmp_rcv+0x806/0xe10
[<ffffffff9b699bd1>] ip_protocol_deliver_rcu+0x351/0x3d0
[<ffffffff9b699f14>] ip_local_deliver_finish+0x2b4/0x450
[<ffffffff9b69a234>] ip_local_deliver+0x174/0x1f0
[<ffffffff9b69a4b2>] ip_sublist_rcv_finish+0x1f2/0x420
[<ffffffff9b69ab56>] ip_sublist_rcv+0x466/0x920
[...]
I was able to reproduce this via:
ip link add dev dummy0 type dummy
ip link set dev dummy0 up
tc qdisc add dev eth0 clsact
tc filter add dev eth0 egress protocol ip prio 1 u32 match ip protocol 1 0xff action mirred egress redirect dev dummy0
ping 1.1.1.1
<stolen>
After the fix, there are no kmemleak reports with the reproducer. This is
in line with what is also done on the ingress side, and from debugging the
skb_unref(skb) on dummy xmit and sch_handle_egress() side, it is visible
that these are two different skbs with both skb_unref(skb) as true. The two
seen skbs are due to mirred doing a skb_clone() internally as use_reinsert
is false in tcf_mirred_act() for egress. This was initially reported by Gal.
Fixes: e420bed025 ("bpf: Add fd-based tcx multi-prog infra with link support")
Reported-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/bdfc2640-8f65-5b56-4472-db8e2b161aab@nvidia.com
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hariprasad Kelam says:
====================
octeontx2-af: misc MAC block changes
This series of patches adds recent changes added in MAC (CGX/RPM) block.
Patch1: Adds new LMAC mode supported by CN10KB silicon
Patch2: In a scenario where system boots with no cgx devices, currently
AF driver treats this as error as a result no interfaces will work.
This patch relaxes this check, such that non cgx mapped netdev
devices will work.
Patch3: This patch adds required lmac validation in MAC block APIs.
Patch4: Prints error message incase, no netdev is mapped with given
cgx,lmac pair.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
During AF driver initialization, it creates a mapping between pf to
cgx,lmac pair. Whenever there is a physical link change, using this
mapping driver forwards the message to the associated netdev.
This patch prints error message incase of cgx,lmac pair is not
associated with any pf netdev.
Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the addition of new MAC blocks like CN10K RPM and CN10KB
RPM_USX, LMACs are noncontiguous. Though in most of the functions,
lmac validation checks exist but in few functions they are missing.
This patch adds the same.
Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't treat lack of CGX LMACs on the system as a error.
Instead ignore it so that LBK VFs are created and can be used.
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Upon physical link change, firmware reports to the kernel about the
change along with the details like speed, lmac_type_id, etc.
Kernel derives lmac_type based on lmac_type_id received from firmware.
This patch extends current lmac list with new USGMII mode supported
by CN10KB RPM block.
Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert the Xilinx GMII to RGMII Converter device tree binding
documentation to json schema.
This converter is usually used as gem <---> gmii2rgmii <---> external phy
and, it's phy-handle should point to the phandle of the external phy.
Signed-off-by: Pranavi Somisetty <pranavi.somisetty@amd.com>
Signed-off-by: Harini Katakam <harini.katakam@amd.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sabrina Dubroca says:
====================
tls: expand tls_cipher_size_desc to simplify getsockopt/setsockopt
Commit 2d2c5ea242 ("net/tls: Describe ciphers sizes by const
structs") introduced tls_cipher_size_desc to describe the size of the
fields of the per-cipher crypto_info structs, and commit ea7a9d88ba
("net/tls: Use cipher sizes structs") used it, but only in
tls_device.c and tls_device_fallback.c, and skipped converting similar
code in tls_main.c and tls_sw.c.
This series expands tls_cipher_size_desc (renamed to tls_cipher_desc
to better fit this expansion) to fully describe a cipher:
- offset of the fields within the per-cipher crypto_info
- size of the full struct (for copies to/from userspace)
- offload flag
- algorithm name used by SW crypto
With these additions, we can remove ~350L of
switch (crypto_info->cipher_type) { ... }
from tls_set_device_offload, tls_sw_fallback_init,
do_tls_getsockopt_conf, do_tls_setsockopt_conf, tls_set_sw_offload
(mainly do_tls_getsockopt_conf and tls_set_sw_offload).
This series also adds the ARIA ciphers to the tls selftests, and some
more getsockopt/setsockopt tests to cover more of the code changed by
this series.
====================
Link: https://lore.kernel.org/r/cover.1692977948.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tls_cipher_size_desc indexes ciphers by their type, but we're not
using indices 0..50 of the array. Each struct tls_cipher_size_desc is
20B, so that's a lot of unused memory. We can reindex the array
starting at the lowest used cipher_type.
Introduce the get_cipher_size_desc helper to find the right item and
avoid out-of-bounds accesses, and make tls_cipher_size_desc's size
explicit so that gcc reminds us to update TLS_CIPHER_MIN/MAX when we
add a new cipher.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/5e054e370e240247a5d37881a1cd93a67c15f4ca.1692977948.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Donald Hunter says:
====================
tools/net/ynl: Add support for netlink-raw families
This patchset adds support for netlink-raw families such as rtnetlink.
Patch 1 fixes a typo in existing schemas
Patch 2 contains the schema definition
Patches 3 & 4 update the schema documentation
Patches 5 - 9 extends ynl
Patches 10 - 12 add several netlink-raw specs
The netlink-raw schema is very similar to genetlink-legacy and I thought
about making the changes there and symlinking to it. On balance I
thought that might be problematic for accurate schema validation.
rtnetlink doesn't seem to fit into unified or directional message
enumeration models. It seems like an 'explicit' model would be useful,
to force the schema author to specify the message ids directly.
There is not yet support for notifications because ynl currently doesn't
support defining 'event' properties on a 'do' operation. The message ids
are shared so ops need to be both sync and async. I plan to look at this
in a future patch.
The link and route messages contain different nested attributes
dependent on the type of link or route. Decoding these will need some
kind of attr-space selection that uses the value of another attribute as
the selector key. These nested attributes have been left with type
'binary' for now.
====================
Link: https://lore.kernel.org/r/20230825122756.7603-1-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add schema for rt link with support for newlink, dellink, getlink,
setlink and getstats.
A dummy link can be created like this:
sudo ./tools/net/ynl/cli.py \
--spec Documentation/netlink/specs/rt_link.yaml \
--do newlink --create \
--json '{"ifname": "dummy0", "linkinfo": {"kind": "dummy"}}'
For example, offload stats can be fetched like this:
./tools/net/ynl/cli.py \
--spec Documentation/netlink/specs/rt_link.yaml \
--dump getstats --json '{ "filter-mask": 8 }'
Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20230825122756.7603-12-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This schema is largely a copy of the genetlink-legacy schema with the
following modifications:
- change the schema id to netlink-raw
- add a top-level protonum property, e.g. 0 (for NETLINK_ROUTE)
- change the protocol enumeration to netlink-raw, removing the
genetlink options.
- replace doc references to generic netlink with raw netlink
- add a value property to mcast-group definitions
Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20230825122756.7603-3-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Saeed Mahameed says:
====================
{devlink,mlx5}: Add port function attributes for ipsec
From Dima:
Introduce hypervisor-level control knobs to set the functionality of PCI
VF devices passed through to guests. The administrator of a hypervisor
host may choose to change the settings of a port function from the
defaults configured by the device firmware.
The software stack has two types of IPsec offload - crypto and packet.
Specifically, the ip xfrm command has sub-commands for "state" and
"policy" that have an "offload" parameter. With ip xfrm state, both
crypto and packet offload types are supported, while ip xfrm policy can
only be offloaded in packet mode.
The series introduces two new boolean attributes of a port function:
ipsec_crypto and ipsec_packet. The goal is to provide a similar level of
granularity for controlling VF IPsec offload capabilities, which would
be aligned with the software model. This will allow users to decide if
they want both types of offload enabled for a VF, just one of them, or
none at all (which is the default).
At a high level, the difference between the two knobs is that with
ipsec_crypto, only XFRM state can be offloaded. Specifically, only the
crypto operation (Encrypt/Decrypt) is offloaded. With ipsec_packet, both
XFRM state and policy can be offloaded. Furthermore, in addition to
crypto operation offload, IPsec encapsulation is also offloaded. For
XFRM state, choosing between crypto and packet offload types is
possible. From the HW perspective, different resources may be required
for each offload type.
Examples of when a user prefers to enable IPsec packet offload for a VF
when using switchdev mode:
$ devlink port show pci/0000:06:00.0/1
pci/0000:06:00.0/1: type eth netdev enp6s0pf0vf0 flavour pcivf pfnum 0 vfnum 0
function:
hw_addr 00:00:00:00:00:00 roce enable migratable disable ipsec_crypto disable ipsec_packet disable
$ devlink port function set pci/0000:06:00.0/1 ipsec_packet enable
$ devlink port show pci/0000:06:00.0/1
pci/0000:06:00.0/1: type eth netdev enp6s0pf0vf0 flavour pcivf pfnum 0 vfnum 0
function:
hw_addr 00:00:00:00:00:00 roce enable migratable disable ipsec_crypto disable ipsec_packet enable
This enables the corresponding IPsec capability of the function before
it's enumerated, so when the driver reads the capability from the device
firmware, it is enabled. The driver is then able to configure
corresponding features and ops of the VF net device to support IPsec
state and policy offloading.
v2: https://lore.kernel.org/netdev/20230421104901.897946-1-dchumak@nvidia.com/
====================
Link: https://lore.kernel.org/r/20230825062836.103744-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Implement devlink port function commands to enable / disable IPsec
packet offloads. This is used to control the IPsec capability of the
device.
When ipsec_offload is enabled for a VF, it prevents adding IPsec packet
offloads on the PF, because the two cannot be active simultaneously due
to HW constraints. Conversely, if there are any active IPsec packet
offloads on the PF, it's not allowed to enable ipsec_packet on a VF,
until PF IPsec offloads are cleared.
Signed-off-by: Dima Chumak <dchumak@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20230825062836.103744-9-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>