linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-06-02 00:32:31 -04:00

Author	SHA1	Message	Date
David Carlier	dd66b42854	octeon_ep_vf: add NULL check for napi_build_skb() napi_build_skb() can return NULL on allocation failure. In __octep_vf_oq_process_rx(), the result is used directly without a NULL check in both the single-buffer and multi-fragment paths, leading to a NULL pointer dereference. Add NULL checks after both napi_build_skb() calls, properly advancing descriptors and consuming remaining fragments on failure. Fixes: `1cd3b40797` ("octeon_ep_vf: add Tx/Rx processing and interrupt support") Cc: stable@vger.kernel.org Signed-off-by: David Carlier <devnexen@gmail.com> Link: https://patch.msgid.link/20260409184009.930359-3-devnexen@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-13 15:40:49 -07:00
David Carlier	4e5bc3ff06	octeon_ep_vf: introduce octep_vf_oq_next_idx() helper Introduce octep_vf_oq_next_idx() to consolidate the repeated ring index advance and wraparound pattern in __octep_vf_oq_process_rx(). No functional change intended. Signed-off-by: David Carlier <devnexen@gmail.com> Link: https://patch.msgid.link/20260409184009.930359-2-devnexen@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-13 15:40:49 -07:00
Jakub Kicinski	9ad24ba408	Merge branch 'net-qrtr-ns-a-bunch-of-fixs' Manivannan Sadhasivam says: ==================== net: qrtr: ns: A bunch of fixs This series fixes a bunch of possible memory exhaustion issues in the QRTR nameserver. ==================== Link: https://patch.msgid.link/20260409-qrtr-fix-v3-0-00a8a5ff2b51@oss.qualcomm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-13 15:34:09 -07:00
Manivannan Sadhasivam	7809fea20c	net: qrtr: ns: Fix use-after-free in driver remove() In the remove callback, if a packet arrives after destroy_workqueue() is called, but before sock_release(), the qrtr_ns_data_ready() callback will try to queue the work, causing use-after-free issue. Fix this issue by saving the default 'sk_data_ready' callback during qrtr_ns_init() and use it to replace the qrtr_ns_data_ready() callback at the start of remove(). This ensures that even if a packet arrives after destroy_workqueue(), the work struct will not be dereferenced. Note that it is also required to ensure that the RX threads are completed before destroying the workqueue, because the threads could be using the qrtr_ns_data_ready() callback. Cc: stable@vger.kernel.org Fixes: `0c2204a4ad` ("net: qrtr: Migrate nameservice to kernel from userspace") Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com> Link: https://patch.msgid.link/20260409-qrtr-fix-v3-5-00a8a5ff2b51@oss.qualcomm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-13 15:34:07 -07:00
Manivannan Sadhasivam	27d5e84e81	net: qrtr: ns: Limit the total number of nodes Currently, the nameserver doesn't limit the number of nodes it handles. This can be an attack vector if a malicious client starts registering random nodes, leading to memory exhaustion. Hence, limit the maximum number of nodes to 64. Note that, limit of 64 is chosen based on the current platform requirements. If requirement changes in the future, this limit can be increased. Cc: stable@vger.kernel.org Fixes: `0c2204a4ad` ("net: qrtr: Migrate nameservice to kernel from userspace") Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com> Link: https://patch.msgid.link/20260409-qrtr-fix-v3-4-00a8a5ff2b51@oss.qualcomm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-13 15:34:06 -07:00
Manivannan Sadhasivam	68efba3644	net: qrtr: ns: Free the node during ctrl_cmd_bye() A node sends the BYE packet when it is about to go down. So the nameserver should advertise the removal of the node to all remote and local observers and free the node finally. But currently, the nameserver doesn't free the node memory even after processing the BYE packet. This causes the node memory to leak. Hence, remove the node from Xarray list and free the node memory during both success and failure case of ctrl_cmd_bye(). Cc: stable@vger.kernel.org Fixes: `0c2204a4ad` ("net: qrtr: Migrate nameservice to kernel from userspace") Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com> Link: https://patch.msgid.link/20260409-qrtr-fix-v3-3-00a8a5ff2b51@oss.qualcomm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-13 15:34:06 -07:00
Manivannan Sadhasivam	5640227d9a	net: qrtr: ns: Limit the maximum number of lookups Current code does no bound checking on the number of lookups a client can perform. Though the code restricts the lookups to local clients, there is still a possibility of a malicious local client sending a flood of NEW_LOOKUP messages over the same socket. Fix this issue by limiting the maximum number of lookups to 64 globally. Since the nameserver allows only atmost one local observer, this global lookup count will ensure that the lookups stay within the limit. Note that, limit of 64 is chosen based on the current platform requirements. If requirement changes in the future, this limit can be increased. Cc: stable@vger.kernel.org Fixes: `0c2204a4ad` ("net: qrtr: Migrate nameservice to kernel from userspace") Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com> Link: https://patch.msgid.link/20260409-qrtr-fix-v3-2-00a8a5ff2b51@oss.qualcomm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-13 15:34:06 -07:00
Manivannan Sadhasivam	d5ee2ff983	net: qrtr: ns: Limit the maximum server registration per node Current code does no bound checking on the number of servers added per node. A malicious client can flood NEW_SERVER messages and exhaust memory. Fix this issue by limiting the maximum number of server registrations to 256 per node. If the NEW_SERVER message is received for an old port, then don't restrict it as it will get replaced. While at it, also rate limit the error messages in the failure path of qrtr_ns_worker(). Note that the limit of 256 is chosen based on the current platform requirements. If requirement changes in the future, this limit can be increased. Cc: stable@vger.kernel.org Fixes: `0c2204a4ad` ("net: qrtr: Migrate nameservice to kernel from userspace") Reported-by: Yiming Qian <yimingqian591@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com> Link: https://patch.msgid.link/20260409-qrtr-fix-v3-1-00a8a5ff2b51@oss.qualcomm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-13 15:34:06 -07:00
Maxime Chevallier	e7a62edd34	net: phy: qcom: at803x: Use the correct bit to disable extended next page As noted in the blamed commit, the AR8035 and other PHYs from this family advertise the Extended Next Page support by default, which may be understood by some partners as this PHY being multi-gig capable. The fix is to disable XNP advertising, which is done by setting bit 12 of the Auto-Negotiation Advertisement Register (MII_ADVERTISE). The blamed commit incorrectly uses MDIO_AN_CTRL1_XNP, which is bit 13 as per 802.3 : 45.2.7.1 AN control register (Register 7.0) BIT 12 in MII_ADVERTISE is wrapped by ADVERTISE_RESV, used by some drivers such as the aquantia one. 802.3 Clause 28 defines bit 12 as Extended Next Page ability, at least in recent versions of the standard. Let's add a define for it and use it in the at803x driver. Fixes: `3c51fa5d2a` ("net: phy: ar803x: disable extended next page bit") Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20260410171021.1277138-1-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-13 14:36:22 -07:00
Daniel Golle	2dddb34dd0	net: ethernet: mtk_eth_soc: initialize PPE per-tag-layer MTU registers The PPE enforces output frame size limits via per-tag-layer VLAN_MTU registers that the driver never initializes. The hardware defaults do not account for PPPoE overhead, causing the PPE to punt encapsulated frames back to the CPU instead of forwarding them. Initialize the registers at PPE start and on MTU changes using the maximum GMAC MTU. This is a conservative approximation -- the actual per-PPE requirement depends on egress path, but using the global maximum ensures the limits are never too small. Fixes: `ba37b7caf1` ("net: ethernet: mtk_eth_soc: add support for initializing the PPE") Signed-off-by: Daniel Golle <daniel@makrotopia.org> Link: https://patch.msgid.link/ec995ab8ce8be423267a1cc093147a74d2eb9d82.1775789829.git.daniel@makrotopia.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 15:22:58 -07:00
Jakub Kicinski	5e700c7730	Merge branch 'mlx5-misc-fixes-2026-04-09' Tariq Toukan says: ==================== mlx5 misc fixes 2026-04-09 This small patchset provides misc bug fixes from Gal to the mlx5 Eth driver. ==================== Link: https://patch.msgid.link/20260409202852.158059-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 14:37:10 -07:00
Gal Pressman	edccdd1eb9	net/mlx5e: IPsec, fix ASO poll timeout with read_poll_timeout_atomic() The do-while poll loop uses jiffies for its timeout: expires = jiffies + msecs_to_jiffies(10); jiffies is sampled at an arbitrary point within the current tick, so the first partial tick contributes anywhere from a full tick down to nearly zero real time. For small msecs_to_jiffies() results this is significant, the effective poll window can be much shorter than the requested 10ms, and in the worst case the loop exits after a single iteration (e.g., when HZ=100), well before the device has delivered the CQE. Replace the loop with read_poll_timeout_atomic(), which counts elapsed time via udelay() accounting rather than jiffies, guaranteeing the full poll window regardless of HZ. Additionally, read_poll_timeout_atomic() executes the poll operation one more time after the timeout has expired, giving the CQE a final chance to be detected. The old do-while loop could exit without a final poll if the timeout expired during the udelay() between iterations. Fixes: `76e463f650` ("net/mlx5e: Overcome slow response for first IPsec ASO WQE") Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260409202852.158059-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 14:37:08 -07:00
Gal Pressman	9994ad4df8	net/mlx5e: Fix features not applied during netdev registration mlx5e_fix_features() returns early when the netdevice is not present. This is correct during profile transitions where priv is cleared, but it also incorrectly blocks feature fixups during register_netdev(), when the device is also not yet present. It is not trivial to distinguish between both cases as we cannot use priv to carry state, and in both cases reg_state == NETREG_REGISTERED. Force a netdev features update after register_netdev() completes, where the device is present and fix_features() can actually work. This is not a pretty solution, as it results in an additional features update call (register_netdevice() already calls __netdev_update_features() internally), but it is the simplest, cleanest, and most robust way I found to fix this issue after multiple attempts. This fixes an issue on systems where CQE compression is enabled by default, RXHASH remains enabled after registration despite the two features being mutually exclusive. Fixes: `ab4b01bfda` ("net/mlx5e: Verify dev is present for fix features ndo") Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260409202852.158059-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 14:37:08 -07:00
Norbert Szetei	d114bfdc9b	vsock: fix buffer size clamping order In vsock_update_buffer_size(), the buffer size was being clamped to the maximum first, and then to the minimum. If a user sets a minimum buffer size larger than the maximum, the minimum check overrides the maximum check, inverting the constraint. This breaks the intended socket memory boundaries by allowing the vsk->buffer_size to grow beyond the configured vsk->buffer_max_size. Fix this by checking the minimum first, and then the maximum. This ensures the buffer size never exceeds the buffer_max_size. Fixes: `b9f2b0ffde` ("vsock: handle buffer_size sockopts in the core") Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Norbert Szetei <norbert@doyensec.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/180118C5-8BCF-4A63-A305-4EE53A34AB9C@doyensec.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 14:31:50 -07:00
Charles Perry	6f533abe7b	net: phy: fix a return path in get_phy_c45_ids() The return value of phy_c45_probe_present() is stored in "ret", not "phy_reg", fix this. "phy_reg" always has a positive value if we reach this return path (since it would have returned earlier otherwise), which means that the original goal of the patch of not considering -ENODEV fatal wasn't achieved. Fixes: `17b4475394` ("net: phy: c45 scanning: Don't consider -ENODEV fatal") Signed-off-by: Charles Perry <charles.perry@microchip.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/20260409133654.3203336-1-charles.perry@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 14:16:05 -07:00
Jakub Kicinski	e4df9f0264	Merge branch 'more-fixes-for-the-ipa-driver' Luca Weiss says: ==================== More fixes for the IPA driver Two more fixes for the Qualcomm IPA driver. ==================== Link: https://patch.msgid.link/20260409-ipa-fixes-v1-0-a817c30678ac@fairphone.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 13:49:34 -07:00
Luca Weiss	1335b903cf	net: ipa: Fix decoding EV_PER_EE for IPA v5.0+ Initially 'reg' and 'val' are assigned from HW_PARAM_2. But since IPA v5.0+ takes EV_PER_EE from HW_PARAM_4 (instead of NUM_EV_PER_EE from HW_PARAM_2), we not only need to re-assign 'reg' but also read the register value of that register into 'val' so that reg_decode() works on the correct value. Fixes: `f651334e1e` ("net: ipa: add HW_PARAM_4 GSI register") Link: https://sashiko.dev/#/patchset/20260403-milos-ipa-v1-0-01e9e4e03d3e%40fairphone.com?part=2 Signed-off-by: Luca Weiss <luca.weiss@fairphone.com> Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Link: https://patch.msgid.link/20260409-ipa-fixes-v1-2-a817c30678ac@fairphone.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 13:49:33 -07:00
Luca Weiss	de08f95856	net: ipa: Fix programming of QTIME_TIMESTAMP_CFG The 'val' variable gets overwritten multiple times, discarding previous values. Looking at the git log shows these should be combined with \|= instead. Fixes: `9265a4f0f0` ("net: ipa: define even more IPA register fields") Link: https://sashiko.dev/#/patchset/20260403-milos-ipa-v1-0-01e9e4e03d3e%40fairphone.com?part=4 Signed-off-by: Luca Weiss <luca.weiss@fairphone.com> Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Link: https://patch.msgid.link/20260409-ipa-fixes-v1-1-a817c30678ac@fairphone.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 13:49:32 -07:00
Taegu Ha	2bb6379416	ppp: require CAP_NET_ADMIN in target netns for unattached ioctls /dev/ppp open is currently authorized against file->f_cred->user_ns, while unattached administrative ioctls operate on current->nsproxy->net_ns. As a result, a local unprivileged user can create a new user namespace with CLONE_NEWUSER, gain CAP_NET_ADMIN only in that new user namespace, and still issue PPPIOCNEWUNIT, PPPIOCATTACH, or PPPIOCATTCHAN against an inherited network namespace. Require CAP_NET_ADMIN in the user namespace that owns the target network namespace before handling unattached PPP administrative ioctls. This preserves normal pppd operation in the network namespace it is actually privileged in, while rejecting the userns-only inherited-netns case. Fixes: `273ec51dd7` ("net: ppp_generic - introduce net-namespace functionality v2") Signed-off-by: Taegu Ha <hataegu0826@gmail.com> Link: https://patch.msgid.link/20260409071117.4354-1-hataegu0826@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 13:47:45 -07:00
Jakub Kicinski	e802cd8cb0	Merge branch 'net-rds-fix-use-after-free-in-rds-ib-for-non-init-namespaces' Allison Henderson says: ==================== net/rds: Fix use-after-free in RDS/IB for non-init namespaces This series fixes syzbot bug da8e060735ae02c8f3d1 https://syzkaller.appspot.com/bug?extid=da8e060735ae02c8f3d1 The report finds a use-after-free bug where ib connections access an invalid network namespace after it has been freed. The stack is: rds_rdma_cm_event_handler_cmn rds_conn_path_drop rds_destroy_pending check_net() <-- use-after-free This is initially introduced in: `d5a8ac28a7` ("RDS-TCP: Make RDS-TCP work correctly when it is set up in a netns other than init_net"). Here, we made RDS aware of the namespace by storing a net pointer in each connection. But it is not explicitly restricted to init_net in the case of ib. The RDS/TCP transport has its own pernet exit handler (rds_tcp_exit_net) that destroys connections when a namespace is torn down. But RDS/IB does not support more than the initial namespace and has no such handler. The initial namespace is statically allocated, and never torn down, so it always has at least one reference. Allowing non init namespaces that do not have a persistent reference means that when their refcounts drop to zero, they are released through cleanup_net(). Which would call any registered pernet clean up handlers if it had any, but since they don't in this case, the extra rds_connections remain with stale c_net pointers. Which are then accessed later causing the use-after-free bug. So, the simple fix is to disallow more than the initial namespace to be created in the case of ib connections. Fixes are ported from UEK patches found here: https://github.com/oracle/linux-uek/commit/8ed9a82376b7 Patch 1 is a prerequisite optimization to rds_ib_laddr_check() that avoids excessive rdma_bind_addr() calls during transport probing by first checking rds_ib_get_device(). This is needed because patch 2 adds a namespace check at the top of the same function. UEK: 8ed9a82376b7 ("rds: ib: Optimize rds_ib_laddr_check") https://github.com/oracle/linux-uek/commit/bd9489a08004 Patch 2 restricts RDS/IB to the initial network namespace. It adds checks in both rds_ib_laddr_check() and rds_set_transport() to reject IB use from non-init namespaces with -EPROTOTYPE. This prevents the use-after-free by ensuring IB connections cannot exist in namespaces that may be torn down. UEK: bd9489a08004 ("net/rds: Restrict use of RDS/IB to the initial network namespace") Questions, comments and feedback appreciated! ==================== Link: https://patch.msgid.link/20260408080420.540032-1-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 13:33:21 -07:00
Greg Jumper	ebf71dd4af	net/rds: Restrict use of RDS/IB to the initial network namespace Prevent using RDS/IB in network namespaces other than the initial one. The existing RDS/IB code will not work properly in non-initial network namespaces. Fixes: `d5a8ac28a7` ("RDS-TCP: Make RDS-TCP work correctly when it is set up in a netns other than init_net") Reported-by: syzbot+da8e060735ae02c8f3d1@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=da8e060735ae02c8f3d1 Signed-off-by: Greg Jumper <greg.jumper@oracle.com> Signed-off-by: Allison Henderson <achender@kernel.org> Link: https://patch.msgid.link/20260408080420.540032-3-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 13:33:19 -07:00
Håkon Bugge	236f718ac8	net/rds: Optimize rds_ib_laddr_check rds_ib_laddr_check() creates a CM_ID and attempts to bind the address in question to it. This in order to qualify the allegedly local address as a usable IB/RoCE address. In the field, ExaWatcher runs rds-ping to all ports in the fabric from all local ports. This using all active ToS'es. In a full rack system, we have 14 cell servers and eight db servers. Typically, 6 ToS'es are used. This implies 528 rds-ping invocations per ExaWatcher's "RDSinfo" interval. Adding to this, each rds-ping invocation creates eight sockets and binds the local address to them: socket(AF_RDS, SOCK_SEQPACKET, 0) = 3 bind(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.36.2")}, 16) = 0 socket(AF_RDS, SOCK_SEQPACKET, 0) = 4 bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.36.2")}, 16) = 0 socket(AF_RDS, SOCK_SEQPACKET, 0) = 5 bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.36.2")}, 16) = 0 socket(AF_RDS, SOCK_SEQPACKET, 0) = 6 bind(6, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.36.2")}, 16) = 0 socket(AF_RDS, SOCK_SEQPACKET, 0) = 7 bind(7, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.36.2")}, 16) = 0 socket(AF_RDS, SOCK_SEQPACKET, 0) = 8 bind(8, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.36.2")}, 16) = 0 socket(AF_RDS, SOCK_SEQPACKET, 0) = 9 bind(9, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.36.2")}, 16) = 0 socket(AF_RDS, SOCK_SEQPACKET, 0) = 10 bind(10, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.36.2")}, 16) = 0 So, at every interval ExaWatcher executes rds-ping's, 4224 CM_IDs are allocated, considering this full-rack system. After the a CM_ID has been allocated, rdma_bind_addr() is called, with the port number being zero. This implies that the CMA will attempt to search for an un-used ephemeral port. Simplified, the algorithm is to start at a random position in the available port space, and then if needed, iterate until an un-used port is found. The book-keeping of used ports uses the idr system, which again uses slab to allocate new struct idr_layer's. The size is 2092 bytes and slab tries to reduce the wasted space. Hence, it chooses an order:3 allocation, for which 15 idr_layer structs will fit and only 1388 bytes are wasted per the 32KiB order:3 chunk. Although this order:3 allocation seems like a good space/speed trade-off, it does not resonate well with how it used by the CMA. The combination of the randomized starting point in the port space (which has close to zero spatial locality) and the close proximity in time of the 4224 invocations of the rds-ping's, creates a memory hog for order:3 allocations. These costly allocations may need reclaims and/or compaction. At worst, they may fail and produce a stack trace such as (from uek4): [<ffffffff811a72d5>] __inc_zone_page_state+0x35/0x40 [<ffffffff811c2e97>] page_add_file_rmap+0x57/0x60 [<ffffffffa37ca1df>] remove_migration_pte+0x3f/0x3c0 [ksplice_6cn872bt_vmlinux_new] [<ffffffff811c3de8>] rmap_walk+0xd8/0x340 [<ffffffff811e8860>] remove_migration_ptes+0x40/0x50 [<ffffffff811ea83c>] migrate_pages+0x3ec/0x890 [<ffffffff811afa0d>] compact_zone+0x32d/0x9a0 [<ffffffff811b00ed>] compact_zone_order+0x6d/0x90 [<ffffffff811b03b2>] try_to_compact_pages+0x102/0x270 [<ffffffff81190e56>] __alloc_pages_direct_compact+0x46/0x100 [<ffffffff8119165b>] __alloc_pages_nodemask+0x74b/0xaa0 [<ffffffff811d8411>] alloc_pages_current+0x91/0x110 [<ffffffff811e3b0b>] new_slab+0x38b/0x480 [<ffffffffa41323c7>] __slab_alloc+0x3b7/0x4a0 [ksplice_s0dk66a8_vmlinux_new] [<ffffffff811e42ab>] kmem_cache_alloc+0x1fb/0x250 [<ffffffff8131fdd6>] idr_layer_alloc+0x36/0x90 [<ffffffff8132029c>] idr_get_empty_slot+0x28c/0x3d0 [<ffffffff813204ad>] idr_alloc+0x4d/0xf0 [<ffffffffa051727d>] cma_alloc_port+0x4d/0xa0 [rdma_cm] [<ffffffffa0517cbe>] rdma_bind_addr+0x2ae/0x5b0 [rdma_cm] [<ffffffffa09d8083>] rds_ib_laddr_check+0x83/0x2c0 [ksplice_6l2xst5i_rds_rdma_new] [<ffffffffa05f892b>] rds_trans_get_preferred+0x5b/0xa0 [rds] [<ffffffffa05f09f2>] rds_bind+0x212/0x280 [rds] [<ffffffff815b4016>] SYSC_bind+0xe6/0x120 [<ffffffff815b4d3e>] SyS_bind+0xe/0x10 [<ffffffff816b031a>] system_call_fastpath+0x18/0xd4 To avoid these excessive calls to rdma_bind_addr(), we optimize rds_ib_laddr_check() by simply checking if the address in question has been used before. The rds_rdma module keeps track of addresses associated with IB devices, and the function rds_ib_get_device() is used to determine if the address already has been qualified as a valid local address. If not found, we call the legacy rds_ib_laddr_check(), now renamed to rds_ib_laddr_check_cm(). Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Signed-off-by: Allison Henderson <achender@kernel.org> Link: https://patch.msgid.link/20260408080420.540032-2-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 13:33:19 -07:00
Jakub Kicinski	2654557112	Merge branch 'net-hamradio-fix-missing-input-validation-in-bpqether-and-scc' Mashiro Chen says: ==================== net: hamradio: fix missing input validation in bpqether and scc This series fixes two missing input validation bugs in the hamradio drivers. Both patches were reviewed by Joerg Reuter (hamradio maintainer). ==================== Link: https://patch.msgid.link/20260409024927.24397-1-mashiro.chen@mailbox.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 13:19:08 -07:00
Mashiro Chen	8263e484d6	net: hamradio: scc: validate bufsize in SIOCSCCSMEM ioctl The SIOCSCCSMEM ioctl copies a scc_mem_config from user space and assigns its bufsize field directly to scc->stat.bufsize without any range validation: scc->stat.bufsize = memcfg.bufsize; If a privileged user (CAP_SYS_RAWIO) sets bufsize to 0, the receive interrupt handler later calls dev_alloc_skb(0) and immediately writes a KISS type byte via skb_put_u8() into a zero-capacity socket buffer, corrupting the adjacent skb_shared_info region. Reject bufsize values smaller than 16; this is large enough to hold at least one KISS header byte plus useful data. Signed-off-by: Mashiro Chen <mashiro.chen@mailbox.org> Acked-by: Joerg Reuter <jreuter@yaina.de> Link: https://patch.msgid.link/20260409024927.24397-3-mashiro.chen@mailbox.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 13:19:03 -07:00
Mashiro Chen	6183bd8723	net: hamradio: bpqether: validate frame length in bpq_rcv() The BPQ length field is decoded as: len = skb->data[0] + skb->data[1] * 256 - 5; If the sender sets bytes [0..1] to values whose combined value is less than 5, len becomes negative. Passing a negative int to skb_trim() silently converts to a huge unsigned value, causing the function to be a no-op. The frame is then passed up to AX.25 with its original (untrimmed) payload, delivering garbage beyond the declared frame boundary. Additionally, a negative len corrupts the 64-bit rx_bytes counter through implicit sign-extension. Add a bounds check before pulling the length bytes: reject frames where len is negative or exceeds the remaining skb data. Acked-by: Joerg Reuter <jreuter@yaina.de> Signed-off-by: Mashiro Chen <mashiro.chen@mailbox.org> Link: https://patch.msgid.link/20260409024927.24397-2-mashiro.chen@mailbox.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 13:19:03 -07:00
Mashiro Chen	2835750dd6	net: rose: reject truncated CLEAR_REQUEST frames in state machines All five ROSE state machines (states 1-5) handle ROSE_CLEAR_REQUEST by reading the cause and diagnostic bytes directly from skb->data[3] and skb->data[4] without verifying that the frame is long enough: rose_disconnect(sk, ..., skb->data[3], skb->data[4]); The entry-point check in rose_route_frame() only enforces ROSE_MIN_LEN (3 bytes), so a remote peer on a ROSE network can send a syntactically valid but truncated CLEAR_REQUEST (3 or 4 bytes) while a connection is open in any state. Processing such a frame causes a one- or two-byte out-of-bounds read past the skb data, leaking uninitialized heap content as the cause/diagnostic values returned to user space via getsockopt(ROSE_GETCAUSE). Add a single length check at the rose_process_rx_frame() dispatch point, before any state machine is entered, to drop frames that carry the CLEAR_REQUEST type code but are too short to contain the required cause and diagnostic fields. Signed-off-by: Mashiro Chen <mashiro.chen@mailbox.org> Link: https://patch.msgid.link/20260408172551.281486-1-mashiro.chen@mailbox.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 13:09:45 -07:00
Jakub Kicinski	ba69b788ed	Merge branch 'bpf-fix-sock_ops_get_sk-same-register-oob-read-in-sock_ops-and-add-selftest' Jiayuan Chen says: ==================== bpf: Fix SOCK_OPS_GET_SK same-register OOB read in sock_ops and add selftest When a BPF sock_ops program accesses ctx fields with dst_reg == src_reg, the SOCK_OPS_GET_SK() and SOCK_OPS_GET_FIELD() macros fail to zero the destination register in the !fullsock / !locked_tcp_sock path, leading to OOB read (GET_SK) and kernel pointer leak (GET_FIELD). Patch 1: Fix both macros by adding BPF_MOV64_IMM(si->dst_reg, 0) in the !fullsock landing pad. Patch 2: Add selftests covering same-register and different-register cases for both GET_SK and GET_FIELD. [1] https://lore.kernel.org/bpf/6fe1243e-149b-4d3b-99c7-fcc9e2f75787@std.uestc.edu.cn/T/#u ==================== Link: https://patch.msgid.link/20260407022720.162151-1-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 12:28:08 -07:00
Jiayuan Chen	04013c3ca0	selftests/bpf: Add tests for sock_ops ctx access with same src/dst register Add selftests to verify SOCK_OPS_GET_SK() and SOCK_OPS_GET_FIELD() correctly return NULL/zero when dst_reg == src_reg and is_fullsock == 0. Three subtests are included: - get_sk: ctx->sk with same src/dst register (SOCK_OPS_GET_SK) - get_field: ctx->snd_cwnd with same src/dst register (SOCK_OPS_GET_FIELD) - get_sk_diff_reg: ctx->sk with different src/dst register (baseline) Each BPF program uses inline asm (__naked) to force specific register allocation, reads is_fullsock first, then loads the field using the same (or different) register. The test triggers TCP_NEW_SYN_RECV via a TCP handshake and checks that the result is NULL/zero when is_fullsock == 0. Reviewed-by: Sun Jian <sun.jian.kdev@gmail.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260407022720.162151-3-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 12:28:05 -07:00
Jiayuan Chen	10f86a2a5c	bpf: Fix same-register dst/src OOB read and pointer leak in sock_ops When a BPF sock_ops program accesses ctx fields with dst_reg == src_reg, the SOCK_OPS_GET_SK() and SOCK_OPS_GET_FIELD() macros fail to zero the destination register in the !fullsock / !locked_tcp_sock path. Both macros borrow a temporary register to check is_fullsock / is_locked_tcp_sock when dst_reg == src_reg, because dst_reg holds the ctx pointer. When the check is false (e.g., TCP_NEW_SYN_RECV state with a request_sock), dst_reg should be zeroed but is not, leaving the stale ctx pointer: - SOCK_OPS_GET_SK: dst_reg retains the ctx pointer, passes NULL checks as PTR_TO_SOCKET_OR_NULL, and can be used as a bogus socket pointer, leading to stack-out-of-bounds access in helpers like bpf_skc_to_tcp6_sock(). - SOCK_OPS_GET_FIELD: dst_reg retains the ctx pointer which the verifier believes is a SCALAR_VALUE, leaking a kernel pointer. Fix both macros by: - Changing JMP_A(1) to JMP_A(2) in the fullsock path to skip the added instruction. - Adding BPF_MOV64_IMM(si->dst_reg, 0) after the temp register restore in the !fullsock path, placed after the restore because dst_reg == src_reg means we need src_reg intact to read ctx->temp. Fixes: `fd09af0107` ("bpf: sock_ops ctx access may stomp registers in corner case") Fixes: `84f44df664` ("bpf: sock_ops sk access may stomp registers when dst_reg = src_reg") Reported-by: Quan Sun <2022090917019@std.uestc.edu.cn> Reported-by: Yinhao Hu <dddddd@hust.edu.cn> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Reported-by: Dongliang Mu <dzm91@hust.edu.cn> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Closes: https://lore.kernel.org/bpf/6fe1243e-149b-4d3b-99c7-fcc9e2f75787@std.uestc.edu.cn/T/#u Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260407022720.162151-2-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 12:28:05 -07:00
Greg Kroah-Hartman	46ce8be2ce	NFC: digital: Bounds check NFC-A cascade depth in SDD response handler The NFC-A anti-collision cascade in digital_in_recv_sdd_res() appends 3 or 4 bytes to target->nfcid1 on each round, but the number of cascade rounds is controlled entirely by the peer device. The peer sets the cascade tag in the SDD_RES (deciding 3 vs 4 bytes) and the cascade-incomplete bit in the SEL_RES (deciding whether another round follows). ISO 14443-3 limits NFC-A to three cascade levels and target->nfcid1 is sized accordingly (NFC_NFCID1_MAXSIZE = 10), but nothing in the driver actually enforces this. This means a malicious peer can keep the cascade running, writing past the heap-allocated nfc_target with each round. Fix this by rejecting the response when the accumulated UID would exceed the buffer. Commit `e329e71013` ("NFC: nci: Bounds check struct nfc_target arrays") fixed similar missing checks against the same field on the NCI path. Cc: Simon Horman <horms@kernel.org> Cc: Kees Cook <kees@kernel.org> Cc: Thierry Escande <thierry.escande@linux.intel.com> Cc: Samuel Ortiz <sameo@linux.intel.com> Fixes: `2c66daecc4` ("NFC Digital: Add NFC-A technology support") Cc: stable <stable@kernel.org> Assisted-by: gregkh_clanker_t1000 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://patch.msgid.link/2026040913-figure-seducing-bd3f@gregkh Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 11:40:45 -07:00
Fernando Fernandez Mancera	a6bd339dbb	net_sched: fix skb memory leak in deferred qdisc drops When the network stack cleans up the deferred list via qdisc_run_end(), it operates on the root qdisc. If the root qdisc do not implement the TCQ_F_DEQUEUE_DROPS flag the packets queue to free are never freed and gets stranded on the child's local to_free list. Fix this by making qdisc_dequeue_drop() aware of the root qdisc. It fetches the root qdisc and check for the TCQ_F_DEQUEUE_DROPS flag. If the flag is present, the packet is appended directly to the root's to_free list. Otherwise, drop it directly as it was done before the optimization was implemented. Fixes: `a6efc273ab` ("net_sched: use qdisc_dequeue_drop() in cake, codel, fq_codel") Reported-by: Damilola Bello <damilola@aterlo.com> Closes: https://lore.kernel.org/netdev/CAPgFtOLaedBMU0f_BxV2bXftTJSmJr018Q5uozOo5vVo6b9tjw@mail.gmail.com/ Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260408100044.4530-1-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 11:38:18 -07:00
Lorenzo Bianconi	656121b155	net: airoha: Add missing RX_CPU_IDX() configuration in airoha_qdma_cleanup_rx_queue() When the descriptor index written in REG_RX_CPU_IDX() is equal to the one stored in REG_RX_DMA_IDX(), the hw will stop since the QDMA RX ring is empty. Add missing REG_RX_CPU_IDX() configuration in airoha_qdma_cleanup_rx_queue routine during QDMA RX ring cleanup. Fixes: `514aac3599` ("net: airoha: Add missing cleanup bits in airoha_qdma_cleanup_rx_queue()") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/20260408-airoha-cpu-idx-airoha_qdma_cleanup_rx_queue-v1-1-8efa64844308@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 11:25:17 -07:00
Jakub Kicinski	ed45d380c5	Merge branch 'net-mana-fix-debugfs-directory-naming-and-file-lifecycle' Erni Sri Satya Vennela says: ==================== net: mana: Fix debugfs directory naming and file lifecycle This series fixes two pre-existing debugfs issues in the MANA driver. Patch 1 fixes the per-device debugfs directory naming to use the unique PCI BDF address via pci_name(), avoiding a potential NULL pointer dereference when pdev->slot is NULL (e.g. VFIO passthrough, nested KVM) and preventing name collisions across multiple PFs or VFs. Patch 2 moves the current_speed debugfs file creation from mana_probe_port() to mana_init_port() so it survives detach/attach cycles triggered by MTU changes or XDP program changes. ==================== Link: https://patch.msgid.link/20260408081224.302308-1-ernis@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 11:22:56 -07:00
Erni Sri Satya Vennela	3b7c7fc97a	net: mana: Move current_speed debugfs file to mana_init_port() Move the current_speed debugfs file creation from mana_probe_port() to mana_init_port(). The file was previously created only during initial probe, but mana_cleanup_port_context() removes the entire vPort debugfs directory during detach/attach cycles. Since mana_init_port() recreates the directory on re-attach, moving current_speed here ensures it survives these cycles. Fixes: `75cabb4693` ("net: mana: Add support for net_shaper_ops") Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260408081224.302308-3-ernis@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 11:22:54 -07:00
Erni Sri Satya Vennela	c116f07ab9	net: mana: Use pci_name() for debugfs directory naming Use pci_name(pdev) for the per-device debugfs directory instead of hardcoded "0" for PFs and pci_slot_name(pdev->slot) for VFs. The previous approach had two issues: 1. pci_slot_name() dereferences pdev->slot, which can be NULL for VFs in environments like generic VFIO passthrough or nested KVM, causing a NULL pointer dereference. 2. Multiple PFs would all use "0", and VFs across different PCI domains or buses could share the same slot name, leading to -EEXIST errors from debugfs_create_dir(). pci_name(pdev) returns the unique BDF address, is always valid, and is unique across the system. Fixes: `6607c17c6c` ("net: mana: Enable debugfs files for MANA device") Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260408081224.302308-2-ernis@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 11:22:54 -07:00
Junxi Qian	2b5dd46329	nfc: llcp: add missing return after LLCP_CLOSED checks In nfc_llcp_recv_hdlc() and nfc_llcp_recv_disc(), when the socket state is LLCP_CLOSED, the code correctly calls release_sock() and nfc_llcp_sock_put() but fails to return. Execution falls through to the remainder of the function, which calls release_sock() and nfc_llcp_sock_put() again. This results in a double release_sock() and a refcount underflow via double nfc_llcp_sock_put(), leading to a use-after-free. Add the missing return statements after the LLCP_CLOSED branches in both functions to prevent the fall-through. Fixes: `d646960f79` ("NFC: Initial LLCP support") Signed-off-by: Junxi Qian <qjx1298677004@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260408081006.3723-1-qjx1298677004@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 11:12:44 -07:00
Jamal Hadi Salim	f462dca0c8	net/sched: act_ct: Only release RCU read lock after ct_ft When looking up a flow table in act_ct in tcf_ct_flow_table_get(), rhashtable_lookup_fast() internally opens and closes an RCU read critical section before returning ct_ft. The tcf_ct_flow_table_cleanup_work() can complete before refcount_inc_not_zero() is invoked on the returned ct_ft resulting in a UAF on the already freed ct_ft object. This vulnerability can lead to privilege escalation. Analysis from zdi-disclosures@trendmicro.com: When initializing act_ct, tcf_ct_init() is called, which internally triggers tcf_ct_flow_table_get(). static int tcf_ct_flow_table_get(struct net net, struct tcf_ct_params params) { struct zones_ht_key key = { .net = net, .zone = params->zone }; struct tcf_ct_flow_table ct_ft; int err = -ENOMEM; mutex_lock(&zones_mutex); ct_ft = rhashtable_lookup_fast(&zones_ht, &key, zones_params); // [1] if (ct_ft && refcount_inc_not_zero(&ct_ft->ref)) // [2] goto out_unlock; ... } static __always_inline void rhashtable_lookup_fast( struct rhashtable ht, const void key, const struct rhashtable_params params) { void obj; rcu_read_lock(); obj = rhashtable_lookup(ht, key, params); rcu_read_unlock(); return obj; } At [1], rhashtable_lookup_fast() looks up and returns the corresponding ct_ft from zones_ht . The lookup is performed within an RCU read critical section through rcu_read_lock() / rcu_read_unlock(), which prevents the object from being freed. However, at the point of function return, rcu_read_unlock() has already been called, and there is nothing preventing ct_ft from being freed before reaching refcount_inc_not_zero(&ct_ft->ref) at [2]. This interval becomes the race window, during which ct_ft can be freed. Free Process: tcf_ct_flow_table_put() is executed through the path tcf_ct_cleanup() call_rcu() tcf_ct_params_free_rcu() tcf_ct_params_free() tcf_ct_flow_table_put(). static void tcf_ct_flow_table_put(struct tcf_ct_flow_table ct_ft) { if (refcount_dec_and_test(&ct_ft->ref)) { rhashtable_remove_fast(&zones_ht, &ct_ft->node, zones_params); INIT_RCU_WORK(&ct_ft->rwork, tcf_ct_flow_table_cleanup_work); // [3] queue_rcu_work(act_ct_wq, &ct_ft->rwork); } } At [3], tcf_ct_flow_table_cleanup_work() is scheduled as RCU work static void tcf_ct_flow_table_cleanup_work(struct work_struct work) { struct tcf_ct_flow_table ct_ft; struct flow_block *block; ct_ft = container_of(to_rcu_work(work), struct tcf_ct_flow_table, rwork); nf_flow_table_free(&ct_ft->nf_ft); block = &ct_ft->nf_ft.flow_block; down_write(&ct_ft->nf_ft.flow_block_lock); WARN_ON(!list_empty(&block->cb_list)); up_write(&ct_ft->nf_ft.flow_block_lock); kfree(ct_ft); // [4] module_put(THIS_MODULE); } tcf_ct_flow_table_cleanup_work() frees ct_ft at [4]. When this function executes between [1] and [2], UAF occurs. This race condition has a very short race window, making it generally difficult to trigger. Therefore, to trigger the vulnerability an msleep(100) was inserted after[1] Fixes: `138470a9b2` ("net/sched: act_ct: fix lockdep splat in tcf_ct_flow_table_get") Reported-by: zdi-disclosures@trendmicro.com Tested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260410111627.46611-1-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 09:26:15 -07:00
Jakub Kicinski	372169fd95	Merge tag 'linux-can-fixes-for-7.0-20260409' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can Marc Kleine-Budde says: ==================== pull-request: can 2026-04-09 Johan Hovold's patch fixes the a devres lifetime in the ucan driver. The last patch is by Samuel Page and fixes a use-after-free in raw_rcv() in the CAN_RAW protocol. * tag 'linux-can-fixes-for-7.0-20260409' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can: can: raw: fix ro->uniq use-after-free in raw_rcv() can: ucan: fix devres lifetime ==================== Link: https://patch.msgid.link/20260409165942.588421-1-mkl@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 09:24:16 -07:00
Davide Caratti	65782b2db7	net/sched: cls_fw: fix NULL dereference of "old" filters before change() Like pointed out by Sashiko [1], since commit `ed76f5edcc` ("net: sched: protect filter_chain list with filter_chain_lock mutex") TC filters are added to a shared block and published to datapath before their ->change() function is called. This is a problem for cls_fw: an invalid filter created with the "old" method can still classify some packets before it is destroyed by the validation logic added by Xiang. Therefore, insisting with repeated runs of the following script: # ip link add dev crash0 type dummy # ip link set dev crash0 up # mausezahn crash0 -c 100000 -P 10 \ > -A 4.3.2.1 -B 1.2.3.4 -t udp "dp=1234" -q & # sleep 1 # tc qdisc add dev crash0 egress_block 1 clsact # tc filter add block 1 protocol ip prio 1 matchall \ > action skbedit mark 65536 continue # tc filter add block 1 protocol ip prio 2 fw # ip link del dev crash0 can still make fw_classify() hit the WARN_ON() in [2]: WARNING: ./include/net/pkt_cls.h:88 at fw_classify+0x244/0x250 [cls_fw], CPU#18: mausezahn/1399 Modules linked in: cls_fw(E) act_skbedit(E) CPU: 18 UID: 0 PID: 1399 Comm: mausezahn Tainted: G E 7.0.0-rc6-virtme #17 PREEMPT(full) Tainted: [E]=UNSIGNED_MODULE Hardware name: Red Hat KVM, BIOS 1.16.3-2.el9 04/01/2014 RIP: 0010:fw_classify+0x244/0x250 [cls_fw] Code: 5c 49 c7 45 00 00 00 00 00 41 5d 41 5e 41 5f 5d c3 cc cc cc cc 5b b8 ff ff ff ff 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc 90 <0f> 0b 90 eb a0 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 RSP: 0018:ffffd1b7026bf8a8 EFLAGS: 00010202 RAX: ffff8c5ac9c60800 RBX: ffff8c5ac99322c0 RCX: 0000000000000004 RDX: 0000000000000001 RSI: ffff8c5b74d7a000 RDI: ffff8c5ac8284f40 RBP: ffffd1b7026bf8d0 R08: 0000000000000000 R09: ffffd1b7026bf9b0 R10: 00000000ffffffff R11: 0000000000000000 R12: 0000000000010000 R13: ffffd1b7026bf930 R14: ffff8c5ac8284f40 R15: 0000000000000000 FS: 00007fca40c37740(0000) GS:ffff8c5b74d7a000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fca40e822a0 CR3: 0000000005ca0001 CR4: 0000000000172ef0 Call Trace: <TASK> tcf_classify+0x17d/0x5c0 tc_run+0x9d/0x150 __dev_queue_xmit+0x2ab/0x14d0 ip_finish_output2+0x340/0x8f0 ip_output+0xa4/0x250 raw_sendmsg+0x147d/0x14b0 __sys_sendto+0x1cc/0x1f0 __x64_sys_sendto+0x24/0x30 do_syscall_64+0x126/0xf80 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fca40e822ba Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89 RSP: 002b:00007ffc248a42c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 000055ef233289d0 RCX: 00007fca40e822ba RDX: 000000000000001e RSI: 000055ef23328c30 RDI: 0000000000000003 RBP: 000055ef233289d0 R08: 00007ffc248a42d0 R09: 0000000000000010 R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000001e R13: 00000000000186a0 R14: 0000000000000000 R15: 00007fca41043000 </TASK> irq event stamp: 1045778 hardirqs last enabled at (1045784): [<ffffffff864ec042>] __up_console_sem+0x52/0x60 hardirqs last disabled at (1045789): [<ffffffff864ec027>] __up_console_sem+0x37/0x60 softirqs last enabled at (1045426): [<ffffffff874d48c7>] __alloc_skb+0x207/0x260 softirqs last disabled at (1045434): [<ffffffff874fe8f8>] __dev_queue_xmit+0x78/0x14d0 Then, because of the value in the packet's mark, dereference on 'q->handle' with NULL 'q' occurs: BUG: kernel NULL pointer dereference, address: 0000000000000038 [...] RIP: 0010:fw_classify+0x1fe/0x250 [cls_fw] [...] Skip "old-style" classification on shared blocks, so that the NULL dereference is fixed and WARN_ON() is not hit anymore in the short lifetime of invalid cls_fw "old-style" filters. [1] https://sashiko.dev/#/patchset/20260331050217.504278-1-xmei5%40asu.edu [2] https://elixir.bootlin.com/linux/v7.0-rc6/source/include/net/pkt_cls.h#L86 Fixes: `faeea8bbf6` ("net/sched: cls_fw: fix NULL pointer dereference on shared blocks") Fixes: `ed76f5edcc` ("net: sched: protect filter_chain list with filter_chain_lock mutex") Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Link: https://patch.msgid.link/e39cbd3103a337f1e515d186fe697b4459d24757.1775661704.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 08:49:13 -07:00
Lorenzo Bianconi	02f7296439	net: airoha: Fix FE_PSE_BUF_SET configuration if PPE2 is available airoha_fe_set routine is used to set specified bits to 1 in the selected register. In the FE_PSE_BUF_SET case this can due to a overestimation of the required buffers for I/O queues since we can miss to set some bits of PSE_ALLRSV_MASK subfield to 0. Fix the issue relying on airoha_fe_rmw routine instead. Fixes: `8e38e08f2c` ("net: airoha: fix PSE memory configuration in airoha_fe_pse_ports_init()") Tested-by: Xuegang Lu <xuegang.lu@airoha.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/20260408-airoha-reg_fe_pse_buf_set-v1-1-0c4fa8f4d1d9@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-10 16:05:05 -07:00
Mashiro Chen	bf9a38803b	net: hamradio: 6pack: fix uninit-value in sixpack_receive_buf sixpack_receive_buf() does not properly skip bytes with TTY error flags. The while loop iterates through the flags buffer but never advances the data pointer (cp), and passes the original count (including error bytes) to sixpack_decode(). This causes sixpack_decode() to process bytes that should have been skipped due to TTY errors. The TTY layer does not guarantee that cp[i] holds a meaningful value when fp[i] is set, so passing those positions to sixpack_decode() results in KMSAN reporting an uninit-value read. Fix this by processing bytes one at a time, advancing cp on each iteration, and only passing valid (non-error) bytes to sixpack_decode(). This matches the pattern used by slip_receive_buf() and mkiss_receive_buf() for the same purpose. Reported-by: syzbot+ecdb8c9878a81eb21e54@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=ecdb8c9878a81eb21e54 Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Mashiro Chen <mashiro.chen@mailbox.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260407173101.107352-1-mashiro.chen@mailbox.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-10 15:25:59 -07:00
Mohsin Bashir	12ff2a4aee	eth: fbnic: Use wake instead of start fbnic_up() calls netif_tx_start_all_queues(), which only clears __QUEUE_STATE_DRV_XOFF. If qdisc backlog has accumulated on any TX queue before the reconfiguration (e.g. ring resize via ethtool -G), start does not call __netif_schedule() to kick the qdisc, so the pending backlog is never drained and the queue stalls. Switch to netif_tx_wake_all_queues(), which clears DRV_XOFF and also calls __netif_schedule() on every queue, ensuring any backlog that built up before the down/up cycle is promptly dequeued. Fixes: `bc6107771b` ("eth: fbnic: Allocate a netdevice and napi vectors with queues") Signed-off-by: Mohsin Bashir <hmohsin@meta.com> Link: https://patch.msgid.link/20260408002415.2963915-1-mohsin.bashr@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 20:36:26 -07:00
Lorenzo Bianconi	4ae0604a06	net: airoha: Add dma_rmb() and READ_ONCE() in airoha_qdma_rx_process() Add missing dma_rmb() in airoha_qdma_rx_process routine to make sure the DMA read operations are completed when the NIC reports the processing on the current descriptor is done. Moreover, add missing READ_ONCE() in airoha_qdma_rx_process() for DMA descriptor control fields in order to avoid any compiler reordering. Fixes: `23020f0493` ("net: airoha: Introduce ethernet support for EN7581 SoC") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/20260407-airoha_qdma_rx_process-fix-reordering-v3-1-91c36e9da31f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 20:35:15 -07:00
Jiawen Wu	e159f05e12	net: txgbe: fix RTNL assertion warning when remove module For the copper NIC with external PHY, the driver called phylink_connect_phy() during probe and phylink_disconnect_phy() during remove. It caused an RTNL assertion warning in phylink_disconnect_phy() upon module remove. To fix this, add rtnl_lock() and rtnl_unlock() around the phylink_disconnect_phy() in remove function. ------------[ cut here ]------------ RTNL: assertion failed at drivers/net/phy/phylink.c (2351) WARNING: drivers/net/phy/phylink.c:2351 at phylink_disconnect_phy+0xd8/0xf0 [phylink], CPU#0: rmmod/4464 Modules linked in: ... CPU: 0 UID: 0 PID: 4464 Comm: rmmod Kdump: loaded Not tainted 7.0.0-rc4+ Hardware name: Micro-Star International Co., Ltd. MS-7E16/X670E GAMING PLUS WIFI (MS-7E16), BIOS 1.90 12/31/2024 RIP: 0010:phylink_disconnect_phy+0xe4/0xf0 [phylink] Code: 5b 41 5c 41 5d 41 5e 41 5f 5d 31 c0 31 d2 31 f6 31 ff e9 3a 38 8f e7 48 8d 3d 48 87 e2 ff ba 2f 09 00 00 48 c7 c6 c1 22 24 c0 <67> 48 0f b9 3a e9 34 ff ff ff 66 90 90 90 90 90 90 90 90 90 90 90 RSP: 0018:ffffce7288363ac0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff89654b2a1a00 RCX: 0000000000000000 RDX: 000000000000092f RSI: ffffffffc02422c1 RDI: ffffffffc0239020 RBP: ffffce7288363ae8 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8964c4022000 R13: ffff89654fce3028 R14: ffff89654ebb4000 R15: ffffffffc0226348 FS: 0000795e80d93780(0000) GS:ffff896c52857000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005b528b592000 CR3: 0000000170d0f000 CR4: 0000000000f50ef0 PKRU: 55555554 Call Trace: <TASK> txgbe_remove_phy+0xbb/0xd0 [txgbe] txgbe_remove+0x4c/0xb0 [txgbe] pci_device_remove+0x41/0xb0 device_remove+0x43/0x80 device_release_driver_internal+0x206/0x270 driver_detach+0x4a/0xa0 bus_remove_driver+0x83/0x120 driver_unregister+0x2f/0x60 pci_unregister_driver+0x40/0x90 txgbe_driver_exit+0x10/0x850 [txgbe] __do_sys_delete_module.isra.0+0x1c3/0x2f0 __x64_sys_delete_module+0x12/0x20 x64_sys_call+0x20c3/0x2390 do_syscall_64+0x11c/0x1500 ? srso_alias_return_thunk+0x5/0xfbef5 ? do_syscall_64+0x15a/0x1500 ? srso_alias_return_thunk+0x5/0xfbef5 ? do_fault+0x312/0x580 ? srso_alias_return_thunk+0x5/0xfbef5 ? __handle_mm_fault+0x9d5/0x1040 ? srso_alias_return_thunk+0x5/0xfbef5 ? count_memcg_events+0x101/0x1d0 ? srso_alias_return_thunk+0x5/0xfbef5 ? handle_mm_fault+0x1e8/0x2f0 ? srso_alias_return_thunk+0x5/0xfbef5 ? do_user_addr_fault+0x2f8/0x820 ? srso_alias_return_thunk+0x5/0xfbef5 ? irqentry_exit+0xb2/0x600 ? srso_alias_return_thunk+0x5/0xfbef5 ? exc_page_fault+0x92/0x1c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e Fixes: `02b2a6f91b` ("net: txgbe: support copper NIC with external PHY") Cc: stable@vger.kernel.org Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/8B47A5872884147D+20260407094041.4646-1-jiawenwu@trustnetic.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 20:33:43 -07:00
Jakub Kicinski	8832e5791d	Merge branch 'net-bcmgenet-fix-queue-lock-up' Justin Chen says: ==================== net: bcmgenet: fix queue lock up We have been seeing reports of logs like this. [ 41.761198] bcmgenet 1001300000.ethernet eth0: NETDEV WATCHDOG: CPU: 0: transmit queue 2 timed out 10039 ms [ 43.745198] bcmgenet 1001300000.ethernet eth0: NETDEV WATCHDOG: CPU: 0: transmit queue 2 timed out 12023 ms [ 45.729198] bcmgenet 1001300000.ethernet eth0: NETDEV WATCHDOG: CPU: 0: transmit queue 2 timed out 14007 ms We have two issues. The persistent queue timeouts and the eventual lock up of the entire transmit. We address the lock up issue first. The queue timeouts are due to a fundamental design issue not a bug perse. Timeouts still persist, but we should no longer lock up. ==================== Link: https://patch.msgid.link/20260406175756.134567-1-justin.chen@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 20:19:48 -07:00
Justin Chen	5393b2b5be	net: bcmgenet: fix racing timeout handler The bcmgenet_timeout handler tries to take down all tx queues when a single queue times out. This is over zealous and causes many race conditions with queues that are still chugging along. Instead lets only restart the timed out queue. Fixes: `13ea657806` ("net: bcmgenet: improve TX timeout") Signed-off-by: Justin Chen <justin.chen@broadcom.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de> Tested-by: Nicolai Buchwitz <nb@tipi-net.de> Link: https://patch.msgid.link/20260406175756.134567-4-justin.chen@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 20:19:45 -07:00
Justin Chen	3f3168300e	net: bcmgenet: fix leaking free_bds While reclaiming the tx queue we fast forward the write pointer to drop any data in flight. These dropped frames are not added back to the pool of free bds. We also need to tell the netdev that we are dropping said data. Fixes: `f1bacae8b6` ("net: bcmgenet: support reclaiming unsent Tx packets") Signed-off-by: Justin Chen <justin.chen@broadcom.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de> Tested-by: Nicolai Buchwitz <nb@tipi-net.de> Link: https://patch.msgid.link/20260406175756.134567-3-justin.chen@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 20:19:45 -07:00
Justin Chen	57f3f53d2c	net: bcmgenet: fix off-by-one in bcmgenet_put_txcb The write_ptr points to the next open tx_cb. We want to return the tx_cb that gets rewinded, so we must rewind the pointer first then return the tx_cb that it points to. That way the txcb can be correctly cleaned up. Fixes: `876dbadd53` ("net: bcmgenet: Fix unmapping of fragments in bcmgenet_xmit()") Signed-off-by: Justin Chen <justin.chen@broadcom.com> Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de> Link: https://patch.msgid.link/20260406175756.134567-2-justin.chen@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 20:19:45 -07:00
Jakub Kicinski	b02e3c4c80	Merge branch 'macsec-add-support-for-vlan-filtering-in-offload-mode' Cosmin Ratiu says: ==================== macsec: Add support for VLAN filtering in offload mode This short series adds support for VLANs in MACsec devices when offload mode is enabled. This allows VLAN netdevs on top of MACsec netdevs to function, which accidentally used to be the case in the past, but was broken. This series adds back proper support. As part of this, the existing nsim-only MACsec offload tests were translated to Python so they can run against real HW and new traffic-based tests were added for VLAN filter propagation, since there's currently no uAPI to check VLAN filters. ==================== Link: https://patch.msgid.link/20260408115240.1636047-1-cratiu@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 19:38:45 -07:00
Cosmin Ratiu	a363b1c8be	macsec: Support VLAN-filtering lower devices VLAN-filtering is done through two netdev features (NETIF_F_HW_VLAN_CTAG_FILTER and NETIF_F_HW_VLAN_STAG_FILTER) and two netdev ops (ndo_vlan_rx_add_vid and ndo_vlan_rx_kill_vid). Implement these and advertise the features if the lower device supports them. This allows proper VLAN filtering to work on top of MACsec devices, when the lower device is capable of VLAN filtering. As a concrete example, having this chain of interfaces now works: vlan_filtering_capable_dev(1) -> macsec_dev(2) -> macsec_vlan_dev(3) Before the mentioned commit this used to accidentally work because the MACsec device (and thus the lower device) was put in promiscuous mode and the VLAN filter was not used. But after commit [1] correctly made the macsec driver expose the IFF_UNICAST_FLT flag, promiscuous mode was no longer used and VLAN filters on dev 1 kicked in. Without support in dev 2 for propagating VLAN filters down, the register_vlan_dev -> vlan_vid_add -> __vlan_vid_add -> vlan_add_rx_filter_info call from dev 3 is silently eaten (because vlan_hw_filter_capable returns false and vlan_add_rx_filter_info silently succeeds). For MACsec, VLAN filters are only relevant for offload, otherwise the VLANs are encrypted and the lower devices don't care about them. So VLAN filters are only passed on to lower devices in offload mode. Flipping between offload modes now needs to offload/unoffload the filters with vlan_{get,drop}_rx_*_filter_info(). To avoid the back-and-forth filter updating during rollback, the setting of macsec->offload is moved after the add/del secy ops. This is safe since none of the code called from those requires macsec->offload. In case adding the filters fails, the added ones are rolled back and an error is returned to the operation toggling the offload state. Fixes: `0349659fd7` ("macsec: set IFF_UNICAST_FLT priv flag") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20260408115240.1636047-5-cratiu@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 19:38:42 -07:00

1 2 3 4 5 ...

1429987 Commits