linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-04-04 21:42:31 -04:00

Author	SHA1	Message	Date
Eric Dumazet	0170d7f47c	net_sched: add tcf_kfree_skb_list() helper Using kfree_skb_list_reason() to free list of skbs from qdisc operations seems wrong as each skb might have a different drop reason. Cleanup __dev_xmit_skb() to call tcf_kfree_skb_list() once in preparation of the following patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-13-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-11-25 16:10:32 +01:00
Eric Dumazet	4792c3a4c1	net: annotate a data-race in __dev_xmit_skb() q->limit is read locklessly, add a READ_ONCE(). Fixes: `100dfa74ca` ("net: dev_queue_xmit() llist adoption") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-12-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-11-25 16:10:32 +01:00
Eric Dumazet	b2e9821cff	net: prefech skb->priority in __dev_xmit_skb() Most qdiscs need to read skb->priority at enqueue time(). In commit `100dfa74ca` ("net: dev_queue_xmit() llist adoption") I added a prefetch(next), lets add another one for the second half of skb. Note that skb->priority and skb->hash share a common cache line, so this patch helps qdiscs needing both fields. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-11-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-11-25 16:10:32 +01:00
Eric Dumazet	2f9babc04d	net_sched: sch_fq: prefetch one skb ahead in dequeue() prefetch the skb that we are likely to dequeue at the next dequeue(). Also call fq_dequeue_skb() a bit sooner in fq_dequeue(). This reduces the window between read of q.qlen and changes of fields in the cache line that could be dirtied by another cpu trying to queue a packet. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-10-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-11-25 16:10:32 +01:00
Eric Dumazet	3c1100f042	net_sched: sch_fq: move qdisc_bstats_update() to fq_dequeue_skb() Group together changes to qdisc fields to reduce chances of false sharing if another cpu attempts to acquire the qdisc spinlock. qdisc_qstats_backlog_dec(sch, skb); sch->q.qlen--; qdisc_bstats_update(sch, skb); Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-9-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-11-25 16:10:32 +01:00
Eric Dumazet	ad50d5a3fc	net_sched: add Qdisc_read_mostly and Qdisc_write groups It is possible to reorg Qdisc to avoid always dirtying 2 cache lines in fast path by reducing this to a single dirtied cache line. In current layout, we change only four/six fields in the first cache line: - q.spinlock - q.qlen - bstats.bytes - bstats.packets - some Qdisc also change q.next/q.prev In the second cache line we change in the fast path: - running - state - qstats.backlog /* --- cacheline 2 boundary (128 bytes) --- / struct sk_buff_head gso_skb __attribute__((__aligned__(64))); / 0x80 0x18 / struct qdisc_skb_head q; / 0x98 0x18 / struct gnet_stats_basic_sync bstats __attribute__((__aligned__(16))); / 0xb0 0x10 / / --- cacheline 3 boundary (192 bytes) --- / struct gnet_stats_queue qstats; / 0xc0 0x14 / bool running; / 0xd4 0x1 / / XXX 3 bytes hole, try to pack / unsigned long state; / 0xd8 0x8 / struct Qdisc next_sched; /* 0xe0 0x8 / struct sk_buff_head skb_bad_txq; / 0xe8 0x18 / / --- cacheline 4 boundary (256 bytes) --- / Reorganize things to have a first cache line mostly read, then a mostly written one. This gives a ~3% increase of performance under tx stress. Note that there is an additional hole because @qstats now spans over a third cache line. / --- cacheline 2 boundary (128 bytes) --- / __u8 __cacheline_group_begin__Qdisc_read_mostly[0] __attribute__((__aligned__(64))); / 0x80 0 / struct sk_buff_head gso_skb; / 0x80 0x18 / struct Qdisc next_sched; /* 0x98 0x8 / struct sk_buff_head skb_bad_txq; / 0xa0 0x18 / __u8 __cacheline_group_end__Qdisc_read_mostly[0]; / 0xb8 0 / / XXX 8 bytes hole, try to pack / / --- cacheline 3 boundary (192 bytes) --- / __u8 __cacheline_group_begin__Qdisc_write[0] __attribute__((__aligned__(64))); / 0xc0 0 / struct qdisc_skb_head q; / 0xc0 0x18 / unsigned long state; / 0xd8 0x8 / struct gnet_stats_basic_sync bstats __attribute__((__aligned__(16))); / 0xe0 0x10 / bool running; / 0xf0 0x1 / / XXX 3 bytes hole, try to pack / struct gnet_stats_queue qstats; / 0xf4 0x14 / / --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- / __u8 __cacheline_group_end__Qdisc_write[0]; / 0x108 0 / / XXX 56 bytes hole, try to pack */ Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-8-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-11-25 16:10:32 +01:00
Eric Dumazet	c5d34f4583	net_sched: cake: use qdisc_pkt_segs() Use new qdisc_pkt_segs() to avoid a cache line miss in cake_enqueue() for non GSO packets. cake_overhead() does not have to recompute it. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-7-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-11-25 16:10:32 +01:00
Eric Dumazet	2773cb0b31	net_sched: use qdisc_skb_cb(skb)->pkt_segs in bstats_update() Avoid up to two cache line misses in qdisc dequeue() to fetch skb_shinfo(skb)->gso_segs/gso_size while qdisc spinlock is held. This gives a 5 % improvement in a TX intensive workload. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-6-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-11-25 16:10:32 +01:00
Eric Dumazet	f9e00e51e3	net: use qdisc_pkt_len_segs_init() in sch_handle_ingress() sch_handle_ingress() sets qdisc_skb_cb(skb)->pkt_len. We also need to initialize qdisc_skb_cb(skb)->pkt_segs. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-5-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-11-25 16:10:31 +01:00
Eric Dumazet	874c1928d3	net_sched: initialize qdisc_skb_cb(skb)->pkt_segs in qdisc_pkt_len_init() qdisc_pkt_len_init() is currently initalizing qdisc_skb_cb(skb)->pkt_len. Add qdisc_skb_cb(skb)->pkt_segs initialization and rename this function to qdisc_pkt_len_segs_init(). Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-4-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-11-25 16:10:31 +01:00
Eric Dumazet	be1b70ab21	net: init shinfo->gso_segs from qdisc_pkt_len_init() Qdisc use shinfo->gso_segs for their pkts stats in bstats_update(), but this field needs to be initialized for SKB_GSO_DODGY users. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-3-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-11-25 16:10:31 +01:00
Eric Dumazet	b2a38f6df9	net_sched: make room for (struct qdisc_skb_cb)->pkt_segs Add a new u16 field, next to pkt_len : pkt_segs This will cache shinfo->gso_segs to speed up qdisc deqeue(). Move slave_dev_queue_mapping at the end of qdisc_skb_cb, and move three bits from tc_skb_cb : - post_ct - post_ct_snat - post_ct_dnat Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-2-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-11-25 16:10:31 +01:00
Jacky Chou	e3daf0e7fe	dt-bindings: net: aspeed: add AST2700 MDIO compatible Add "aspeed,ast2700-mdio" compatible to the binding schema with a fallback to "aspeed,ast2600-mdio". Although the MDIO controller on AST2700 is functionally the same as the one on AST2600, it's good practice to add a SoC-specific compatible for new silicon. This allows future driver updates to handle any 2700-specific integration issues without requiring devicetree changes or complex runtime detection logic. For now, the driver continues to bind via the existing "aspeed,ast2600-mdio" compatible, so no driver changes are needed. Acked-by: Conor Dooley <conor.dooley@microchip.com> Signed-off-by: Jacky Chou <jacky_chou@aspeedtech.com> Link: https://patch.msgid.link/20251120-aspeed_mdio_ast2700-v2-1-0d722bfb2c54@aspeedtech.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-11-25 12:11:47 +01:00
Jakub Kicinski	cc1b62512a	Merge branch 'mptcp-memcg-accounting-for-passive-sockets-backlog-processing' Matthieu Baerts says: ==================== mptcp: memcg accounting for passive sockets & backlog processing This series is split in two: the 4 first patches are linked to memcg accounting for passive sockets, and the rest introduce the backlog processing. They are sent together, because the first one appeared to be needed to get the second one fully working. The second part includes RX path improvement built around backlog processing. The main goals are improving the RX performances _and_ increase the long term maintainability. - Patches 1-3: preparation work to ease the introduction of the next patch. - Patch 4: fix memcg accounting for passive sockets. Note that this is a (non-urgent) fix, but it depends on material that is currently only in net-next, e.g. commit `4a997d49d9` ("tcp: Save lock_sock() for memcg in inet_csk_accept()."). - Patches 5-6: preparation of the stack for backlog processing, removing assumptions that will not hold true any more after the backlog introduction. - Patches 7,8,10,11,12 are more cleanups that will make the backlog patch a little less huge. - Patch 9: somewhat an unrelated cleanup, included here not to forget about it. - Patches 13-14: The real work is done by them. Patch 13 introduces the helpers needed to manipulate the msk-level backlog, and the data struct itself, without any actual functional change. Patch 14 finally uses the backlog for RX skb processing. Note that MPTCP can't use the sk_backlog, as the MPTCP release callback can also release and re-acquire the msk-level spinlock and core backlog processing works under the assumption that such event is not possible. A relevant point is memory accounts for skbs in the backlog. It's somewhat "original" due to MPTCP constraints. Such skbs use space from the incoming subflow receive buffer, do not use explicitly any forward allocated memory, as we can't update the msk fwd mem while enqueuing, nor we want to acquire again the ssk socket lock while processing the skbs. Instead the msk borrows memory from the subflow and reserve it for the backlog, see patch 5 and 14 for the gory details. ==================== Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-0-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 20:23:43 -08:00
Paolo Abeni	6228efe0cc	mptcp: leverage the backlog for RX packet processing When the msk socket is owned or the msk receive buffer is full, move the incoming skbs in a msk level backlog list. This avoid traversing the joined subflows and acquiring the subflow level socket lock at reception time, improving the RX performances. When processing the backlog, use the fwd alloc memory borrowed from the incoming subflow. skbs exceeding the msk receive space are not dropped; instead they are kept into the backlog until the receive buffer is freed. Dropping packets already acked at the TCP level is explicitly discouraged by the RFC and would corrupt the data stream for fallback sockets. Special care is needed to avoid adding skbs to the backlog of a closed msk and to avoid leaving dangling references into the backlog at subflow closing time. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-14-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 19:49:43 -08:00
Paolo Abeni	ee458a3f31	mptcp: introduce mptcp-level backlog We are soon using it for incoming data processing. MPTCP can't leverage the sk_backlog, as the latter is processed before the release callback, and such callback for MPTCP releases and re-acquire the socket spinlock, breaking the sk_backlog processing assumption. Add a skb backlog list inside the mptcp sock struct, and implement basic helper to transfer packet to and purge such list. Packets in the backlog are memory accounted and still use the incoming subflow receive memory, to allow back-pressure. The backlog size is implicitly bounded to the sum of subflows rcvbuf. When a subflow is closed, references from the backlog to such sock are removed. No packet is currently added to the backlog, so no functional changes intended here. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-13-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 19:49:43 -08:00
Paolo Abeni	9db5b3cec4	mptcp: borrow forward memory from subflow In the MPTCP receive path, we release the subflow allocated fwd memory just to allocate it again shortly after for the msk. That could increases the failures chances, especially when we will add backlog processing, with other actions could consume the just released memory before the msk socket has a chance to do the rcv allocation. Replace the skb_orphan() call with an open-coded variant that explicitly borrows, the fwd memory from the subflow socket instead of releasing it. The borrowed memory does not have PAGE_SIZE granularity; rounding to the page size will make the fwd allocated memory higher than what is strictly required and could make the incoming subflow fwd mem consistently negative. Instead, keep track of the accumulated frag and borrow the full page at subflow close time. This allow removing the last drop in the TCP to MPTCP transition and the associated, now unused, MIB. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-12-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 19:49:42 -08:00
Paolo Abeni	0eeb372dee	mptcp: handle first subflow closing consistently Currently, as soon as the PM closes a subflow, the msk stops accepting data from it, even if the TCP socket could be still formally open in the incoming direction, with the notable exception of the first subflow. The root cause of such behavior is that code currently piggy back two separate semantic on the subflow->disposable bit: the subflow context must be released and that the subflow must stop accepting incoming data. The first subflow is never disposed, so it also never stop accepting incoming data. Use a separate bit to mark the latter status and set such bit in __mptcp_close_ssk() for all subflows. Beyond making per subflow behaviour more consistent this will also simplify the next patch. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-11-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 19:49:42 -08:00
Paolo Abeni	38a4a469c8	mptcp: drop the __mptcp_data_ready() helper It adds little clarity and there is a single user of such helper, just inline it in the caller. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-10-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 19:49:42 -08:00
Paolo Abeni	9d82959603	mptcp: make mptcp_destroy_common() static Such function is only used inside protocol.c, there is no need to expose it to the whole stack. Note that the function definition most be moved earlier to avoid forward declaration. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-9-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 19:49:42 -08:00
Paolo Abeni	48a395605e	mptcp: do not miss early first subflow close event notification The MPTCP protocol is not currently emitting the NL event when the first subflow is closed before msk accept() time. By replacing the in use close helper is such scenario, implicitly introduce the missing notification. Note that in such scenario we want to be sure that mptcp_close_ssk() will not trigger any PM work, move the msk state change update earlier, so that the previous patch will offer such guarantee. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-8-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 19:49:42 -08:00
Paolo Abeni	2ca1b8926f	mptcp: ensure the kernel PM does not take action too late The PM hooks can currently take place when the msk is already shutting down. Subflow creation will fail, thanks to the existing check at join time, but we can entirely avoid starting the to be failed operations. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-7-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 19:49:42 -08:00
Paolo Abeni	2834f8edd7	mptcp: cleanup fallback dummy mapping generation MPTCP currently access ack_seq outside the msk socket log scope to generate the dummy mapping for fallback socket. Soon we are going to introduce backlog usage and even for fallback socket the ack_seq value will be significantly off outside of the msk socket lock scope. Avoid relying on ack_seq for dummy mapping generation, using instead the subflow sequence number. Note that in case of disconnect() and (re)connect() we must ensure that any previous state is re-set. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-6-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 19:49:41 -08:00
Paolo Abeni	85f22b8e1e	mptcp: cleanup fallback data fin reception MPTCP currently generate a dummy data_fin for fallback socket when the fallback subflow has completed data reception using the current ack_seq. We are going to introduce backlog usage for the msk soon, even for fallback sockets: the ack_seq value will not match the most recent sequence number seen by the fallback subflow socket, as it will ignore data_seq sitting in the backlog. Instead use the last map sequence number to set the data_fin, as fallback (dummy) map sequences are always in sequence. Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-5-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 19:49:41 -08:00
Paolo Abeni	68c7c38671	mptcp: fix memcg accounting for passive sockets The passive sockets never got proper memcg accounting: the msk socket is associated with the memcg at accept time, but the passive subflows never got it right. At accept time, traverse the subflows list and associate each of them with the msk memcg, and try to do the same at join completion time, if the msk has been already accepted. Fixes: `cf7da0d66c` ("mptcp: Create SUBFLOW socket for incoming connections") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/298 Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/597 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-4-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 19:49:41 -08:00
Paolo Abeni	e777a7fb06	mptcp: grafting MPJ subflow earlier Later patches need to ensure that all MPJ subflows are grafted to the msk socket before accept() completion. Currently the grafting happens under the msk socket lock: potentially at msk release_cb time which make satisfying the above condition a bit tricky. Move the MPJ subflow grafting earlier, under the msk data lock, so that we can use such lock as a synchronization point. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-3-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 19:49:41 -08:00
Paolo Abeni	bd92dd8e03	mptcp: factor-out cgroup data inherit helper MPTCP will soon need the same functionality for passive sockets, factor them out in a common helper. No functional change intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-2-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 19:49:41 -08:00
Paolo Abeni	075b19c211	net: factor-out _sk_charge() helper Move out of __inet_accept() the code dealing charging newly accepted socket to memcg. MPTCP will soon use it to on a per subflow basis, in different contexts. No functional changes intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Geliang Tang <geliang@kernel.org> Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-1-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 19:49:40 -08:00
Dmitry Skorodumov	f296b73d17	ipvlan: fix sparse warning about __be32 -> u32 Fixed a sparse warning: ipvlan_core.c:56: warning: incorrect type in argument 1 (different base types) expected unsigned int [usertype] a got restricted __be32 const [usertype] s_addr Force cast the s_addr to u32 Signed-off-by: Dmitry Skorodumov <skorodumov.dmitry@huawei.com> Link: https://patch.msgid.link/20251121155112.4182007-1-skorodumov.dmitry@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 19:48:49 -08:00
Breno Leitao	20c20f05cf	net: mvpp2: extract GRXRINGS from .get_rxnfc Commit `84eaf4359c` ("net: ethtool: add get_rx_ring_count callback to optimize RX ring queries") added specific support for GRXRINGS callback, simplifying .get_rxnfc. Remove the handling of GRXRINGS in .get_rxnfc() by moving it to the new .get_rx_ring_count() for the mvpp2 driver. This simplifies the RX ring count retrieval and aligns mvpp2 with the new ethtool API for querying RX ring parameters, while keeping the other rxnfc handlers (GRXCLSRLCNT, GRXCLSRULE, GRXCLSRLALL) intact. Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20251121-marvell-v1-2-8338f3e55a4c@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 19:45:02 -08:00
Breno Leitao	737e14c5dc	net: mvneta: convert to use .get_rx_ring_count Convert the mvneta driver to use the new .get_rx_ring_count ethtool operation instead of implementing .get_rxnfc solely for handling ETHTOOL_GRXRINGS command. This simplifies the code by removing the switch statement and replacing it with a direct return of the queue count. The new callback provides the same functionality in a more direct way, following the ongoing ethtool API modernization. Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20251121-marvell-v1-1-8338f3e55a4c@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 19:45:02 -08:00
Breno Leitao	a8ff4842da	net: hyperv: convert to use .get_rx_ring_count Convert the hyperv netvsc driver to use the new .get_rx_ring_count ethtool operation instead of implementing .get_rxnfc solely for handling ETHTOOL_GRXRINGS command. This simplifies the code by replacing the switch statement with a direct return of the queue count. The new callback provides the same functionality in a more direct way, following the ongoing ethtool API modernization. Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20251121-hyperv_gxrings-v1-1-31293104953b@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 19:42:30 -08:00
Eric Dumazet	ec1e48e97f	net: optimize eth_type_trans() vs CONFIG_STACKPROTECTOR_STRONG=y Some platforms exhibit very high costs with CONFIG_STACKPROTECTOR_STRONG=y when a function needs to pass the address of a local variable to external functions. eth_type_trans() (and its callers) is showing this anomaly on AMD EPYC 7B12 platforms (and maybe others). We could : 1) inline eth_type_trans() This would help if its callers also has the same issue, and the canary cost would be paid by the callers already. This is a bit cumbersome because netdev_uses_dsa() is pulling whole <net/dsa.h> definitions. 2) Compile net/ethernet/eth.c with -fno-stack-protector This would weaken security. 3) Hack eth_type_trans() to temporarily use skb->dev as a place holder if skb_header_pointer() needs to pull 2 bytes not present in skb->head. This patch implements 3), and brings a 5% improvement on TX/RX intensive workload (tcp_rr 10,000 flows) on AMD EPYC 7B12. Removing CONFIG_STACKPROTECTOR_STRONG on this platform can improve performance by 25 %. This means eth_type_trans() issue is not an isolated artifact. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20251121061725.206675-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 19:27:31 -08:00
Jakub Kicinski	e254c212cd	selftests: af_unix: don't use SKIP for expected failures netdev CI reserves SKIP in selftests for cases which can't be executed due to setup issues, like missing or old commands. Tests which are expected to fail must use XFAIL. Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251123021601.158709-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 19:07:51 -08:00
Andre Carvalho	00f3b32518	selftests: netconsole: ensure required log level is set on netcons_basic This commit ensures that the required log level is set at the start of the test iteration. Part of the cleanup performed at the end of each test iteration resets the log level (do_cleanup in lib_netcons.sh) to the values defined at the time test script started. This may cause further test iterations to fail if the default values are not sufficient. Signed-off-by: Andre Carvalho <asantostc@gmail.com> Reviewed-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20251121-netcons-basic-loglevel-v1-1-577f8586159c@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 18:52:20 -08:00
Jakub Kicinski	d350d28350	Merge branch 'selftests-hw-net-toeplitz-read-config-from-the-nic-directly' Jakub Kicinski says: ==================== selftests: hw-net: toeplitz: read config from the NIC directly First patch here tries to auto-disable building the iouring sample. Our CI will still run the iouring test(s), of course, but it looks like the liburing updates aren't very quick in distroes and having to hack around it when developing unrelated tests is a bit annoying. Remaining 4 patches iron out running the Toeplitz hash test against real NICs. I tested mlx5, bnxt and fbnic, they all pass now. I switched to using YNL directly in the C code, can't see a reason to get the info in Python and pass it to C via argv. The old code likely did this because it predates YNL. ==================== Link: https://patch.msgid.link/20251121040259.3647749-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 18:51:44 -08:00
Jakub Kicinski	5aadc15584	selftests: hw-net: toeplitz: give the test up to 4 seconds Increase the receiver timeout. When running between machines in different geographic regions the test needs more than a second to SSH across and send the frames. The bkg() command that runs the receiver defaults to 5 sec timeout, so using 4 sec sounds like a reasonable value for the receiver itself. Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20251121040259.3647749-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 18:51:41 -08:00
Jakub Kicinski	c0105ffc50	selftests: hw-net: toeplitz: read indirection table from the device Replace the simple modulo math with the real indirection table read from the device. This makes the tests pass for mlx5 and bnxt NICs. Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20251121040259.3647749-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 18:51:41 -08:00
Jakub Kicinski	aa91dbf3ed	selftests: hw-net: toeplitz: read the RSS key directly from C Now that we have YNL support for RSS accessing the RSS info from C is very easy. Instead of passing the RSS key from Python do it directly in the C code. Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20251121040259.3647749-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 18:51:40 -08:00
Jakub Kicinski	27c512af19	selftests: hw-net: toeplitz: make sure NICs have pure Toeplitz configured Make sure that the NIC under test is configured for pure Toeplitz hashing, and no input key transform (no symmetric hashing). Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20251121040259.3647749-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 18:51:40 -08:00
Jakub Kicinski	f81171fecd	selftests: hw-net: auto-disable building the iouring C code Looks like the liburing is not updated by distros very aggressively. Presumably because a lot of packages depend on it. I just updated to Fedora 43 and it's still on liburing 2.9. The test is 9mo old, at this stage I think this warrants handling the build failure more gracefully. Detect if iouring is recent enough and if not print a warning and exclude the C prog from build. The Python test will just fail since the binary won't exist. But it removes the major annoyance of having to update liburing from sources when developing other tests. Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20251121040259.3647749-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 18:51:40 -08:00
Dan Carpenter	ef0b78b5b6	i40e: delete a stray tab This return statement is indented one tab too far. Delete a tab. Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Link: https://patch.msgid.link/aSBqjtA8oF25G1OG@stanley.mountain Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 18:51:06 -08:00
Jakub Kicinski	e05021a829	Merge branch 'net-stmmac-qcon-ethqos-rgmii-accessor-cleanups' Russell King says: ==================== net: stmmac: qcon-ethqos: "rgmii" accessor cleanups This series cleans up the "rgmii" accessors in qcom-ethqos. readl() and writel() return and take a u32 for the value. Rather than implicitly casting this to an int, keep it as a u32. Add set/clear functions to reduce the code and make it easier to read. Finally, convert the open-coded poll loops to use the iopoll helpers. Note that patch 1 has a checkpatch warning concerning "volatile" - I'm changing the type here, and the "volatile" is removed in patch 3. I do not feel it is appropriate to remove it in patch 1. ==================== Link: https://patch.msgid.link/aR76i0HjXitfl7xk@shell.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-21 18:13:46 -08:00
Russell King (Oracle)	9b60ba512c	net: stmmac: qcom-ethqos: use read_poll_timeout_atomic() Use read_poll_timeout_atomic() to poll the rgmii registers rather than open-coding the polling. Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1vM2n1-0000000FRTu-0js9@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-21 18:13:44 -08:00
Russell King (Oracle)	819212185a	net: stmmac: qcom-ethqos: add rgmii set/clear functions The driver has a lot of bit manipulation of the RGMII registers. Add a pair of helpers to set bits and clear bits, converting the various calls to rgmii_updatel() as appropriate. Most of the change was done via this sed script: /rgmii_updatel/ { N /,$/N /mask, / ! { s\|rgmii_updatel\(([^,],\s+([^,]),\s+)\2,\s+\|rgmii_setmask(\1\| s\|rgmii_updatel\(([^,],\s+([^,]),\s+)0,\s+\|rgmii_clrmask(\1\| s\|^\s+$\|\| } } and then formatting tweaked where necessary. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Link: https://patch.msgid.link/E1vM2mw-0000000FRTo-0End@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-21 18:13:44 -08:00
Russell King (Oracle)	f54bbd390f	net: stmmac: qcom-ethqos: use u32 for rgmii read/write/update readl() returns a u32, and writel() takes a "u32" for the value. These are used in rgmii_readl()() and rgmii_writel(), but the value and return are "int". As these are 32-bit register values which are not signed, use "u32". These changes do not cause generated code changes. Update rgmii_updatel() to use u32 for mask and val. Changing "mask" to "u32" also does not cause generated code changes. However, changing "val" causes the generated assembly to be re-ordered for aarch64. Update the temporary variables used with the rgmii functions to use u32. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Link: https://patch.msgid.link/E1vM2mq-0000000FRTi-3y5F@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-21 18:13:44 -08:00
Slark Xiao	501253b61d	net: wwan: t7xx: Make local function static This function was used in t7xx_hif_cldma.c only. Make it static as it should be. Signed-off-by: Slark Xiao <slark_xiao@163.com> Reviewed-by: Loic Poulain <loic.poulain@qualcomm.com> Link: https://patch.msgid.link/20251120115208.345578-1-slark_xiao@163.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-21 18:09:43 -08:00
Jakub Kicinski	e2c20036a8	Merge branch 'devlink-net-mlx5-implement-swp_l4_csum_mode-via-devlink-params' Daniel Zahka says: ==================== devlink: net/mlx5: implement swp_l4_csum_mode via devlink params This series introduces a new devlink feature for querying param default values, and resetting params to their default values. This feature is then used to implement a new mlx5 driver param. The series starts with two pure refactor patches: one that passes through the extack to devlink_param::get() implementations. And a second small refactor that prepares the netlink tlv handling code in the devlink_param::get() path to better handle default parameter values. The third patch introduces the uapi and driver api for default parameter values. The driver api is opt-in, and both the uapi and driver api preserve existing behavior when not used by drivers or userspace. The fourth patch introduces a new mlx5 driver param, swp_l4_csum_mode, for controlling tx csum behavior. The "l4_only" value of this param is a dependency for PSP initialization on CX7 NICs. Lastly, the series introduces a new driver param with cmode runtime to netdevsim, and then uses this param in a new testcase for netdevsim devlink params. Here are some examples of using the default param uapi with the devlink cli. Note the devlink cli binary I am using has changes which I am posting in accompanying series targeting iproute2-next: # netdevsim ./devlink dev param show netdevsim/netdevsim0 netdevsim/netdevsim0: name max_macs type generic values: cmode driverinit value 32 default 32 name test1 type driver-specific values: cmode driverinit value true default true # set to false ./devlink dev param set netdevsim/netdevsim0 name test1 value false cmode driverinit ./devlink dev param show netdevsim/netdevsim0 netdevsim/netdevsim0: name max_macs type generic values: cmode driverinit value 32 default 32 name test1 type driver-specific values: cmode driverinit value false default true # set back to default ./devlink dev param set netdevsim/netdevsim0 name test1 default cmode driverinit ./devlink dev param show netdevsim/netdevsim0 netdevsim/netdevsim0: name max_macs type generic values: cmode driverinit value 32 default 32 name test1 type driver-specific values: cmode driverinit value true default true # mlx5 params on cx7 ./devlink dev param show pci/0000:01:00.0 pci/0000:01:00.0: name max_macs type generic values: cmode driverinit value 128 default 128 ... name swp_l4_csum_mode type driver-specific values: cmode permanent value default default default # set to l4_only ./devlink dev param set pci/0000:01:00.0 name swp_l4_csum_mode value l4_only cmode permanent ./devlink dev param show pci/0000:01:00.0 name swp_l4_csum_mode pci/0000:01:00.0: name swp_l4_csum_mode type driver-specific values: cmode permanent value l4_only default default # reset to default ./devlink dev param set pci/0000:01:00.0 name swp_l4_csum_mode default cmode permanent ./devlink dev param show pci/0000:01:00.0 name swp_l4_csum_mode pci/0000:01:00.0: name swp_l4_csum_mode type driver-specific values: cmode permanent value default default default ==================== Link: https://patch.msgid.link/20251119025038.651131-1-daniel.zahka@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-20 19:01:25 -08:00
Daniel Zahka	8be656cfb9	selftest: netdevsim: test devlink default params Test querying default values and resetting to default values for netdevsim devlink params. This should cover the basic paths of interest: driverinit and non-driverinit cmodes, as well as bool and non-bool value type. Default param values of type bool are encoded with u8 netlink type as opposed to flag type, so that userspace can distinguish "not-present" from false. Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Link: https://patch.msgid.link/20251119025038.651131-7-daniel.zahka@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-20 19:01:23 -08:00
Daniel Zahka	72924056eb	netdevsim: register a new devlink param with default value interface Create a new devlink param, test2, that supports default param actions via the devlink_param::get_default() and devlink_param::reset_default() functions. Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Link: https://patch.msgid.link/20251119025038.651131-6-daniel.zahka@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-20 19:01:23 -08:00

1 2 3 4 5 ...

1399497 Commits