This patch refines and strengthens the statistics collection of TX queue
wake/stop events introduced by commit c39add9b24 ("virtio_net: Add TX
stopped and wake counters").
Previously, the driver only recorded partial wake/stop statistics
for TX queues. Some wake events triggered by 'skb_xmit_done()' or resume
operations were not counted, which made the per-queue metrics incomplete.
Signed-off-by: Liming Wu <liming.wu@jaguarmicro.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20251120015320.1418-1-liming.wu@jaguarmicro.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet says:
====================
tcp: provide better locality for retransmit timer
TCP stack uses three timers per flow, currently spread this way:
- sk->sk_timer : keepalive timer
- icsk->icsk_retransmit_timer : retransmit timer
- icsk->icsk_delack_timer : delayed ack timer
This series moves the retransmit timer to sk->sk_timer location,
to increase data locality in TX paths.
keepalive timers are not often used, this change should be neutral for them.
After the series we have following fields:
- sk->tcp_retransmit_timer : retransmit timer, in sock_write_tx group
- icsk->icsk_delack_timer : delayed ack timer
- icsk->icsk_keepalive_timer : keepalive timer
Moving icsk_delack_timer in a beter location would also be welcomed.
====================
Link: https://patch.msgid.link/20251124175013.1473655-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sk->sk_timer has been used for TCP keepalives.
Keepalive timers are not in fast path, we want to use sk->sk_timer
storage for retransmit timers, for better cache locality.
Create icsk->icsk_keepalive_timer and change keepalive
code to no longer use sk->sk_timer.
Added space is reclaimed in the following patch.
This includes changes to MPTCP, which was also using sk_timer.
Alias icsk->mptcp_tout_timer and icsk->icsk_keepalive_timer
for inet_sk_diag_fill() sake.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251124175013.1473655-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since the tagged commit, ice stopped respecting Rx buffer length
passed from VFs.
At that point, the buffer length was hardcoded in ice, so VFs still
worked up to some point (until, for example, a VF wanted an MTU
larger than its PF).
The next commit 93f53db9f9 ("ice: switch to Page Pool"), broke
Rx on VFs completely since ice started accounting per-queue buffer
lengths again, but now VF queues always had their length zeroed, as
ice was already ignoring what iavf was passing to it.
Restore the line that initializes the buffer length on VF queues
basing on the virtchnl messages.
Fixes: 3a4f419f75 ("ice: drop page splitting and recycling")
Reported-by: Jakub Slepecki <jakub.slepecki@intel.com>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Jakub Slepecki <jakub.slepecki@intel.com>
Link: https://patch.msgid.link/20251124170735.3077425-1-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.
Use the `DEFINE_RAW_FLEX()` helper for on-stack definitions of
a flexible structure where the size of the flexible-array member
is known at compile-time, and refactor the rest of the code,
accordingly.
So, with these changes, fix the following warning:
drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c:163:36: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://patch.msgid.link/aSQocKoJGkN0wzEj@kspp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Asbjørn Sloth Tønnesen says:
====================
tools: ynl-gen: regeneration comment + function prefix
It looks like these two patches are the last ones needed
for YNL, before the WireGuard patches can go in.
These patches was both requested by Jason, during review
of the WireGuard YNL conversion patchset[1].
====================
Link: https://patch.msgid.link/20251120174429.390574-1-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch adds a new CLI argument for overriding the default
function prefix, as used for naming the doit/dumpit functions
in the generated kernel code.
When not specified the default "$(FAMILY)-nl" is used.
This can also be specified persistently in generated files:
/* YNL-ARG --function-prefix wg */
In the above example it causes the following changes:
wireguard_nl_get_device_dumpit() -> wg_get_device_dumpit()
wireguard_nl_get_device_doit() -> wg_get_device_doit()
The variable name fn_prefix, was chosen as it relates to op_prefix
which is used to prefix the UAPI commands enum entries.
Link: https://lore.kernel.org/r/aRvWzC8qz3iXDAb3@zx2c4.com/
Suggested-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Link: https://patch.msgid.link/20251120174429.390574-2-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Andy Shevchenko says:
====================
ptp: ocp: A fix and refactoring
Here is the fix for incorrect use of %ptT with the associated
refactoring and additional cleanups.
Note, %ptS, which is introduced in another series that is already
applied to PRINTK tree, doesn't fit here, that's why this fix
is separated from that series.
====================
Link: https://patch.msgid.link/20251124084816.205035-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Wei Fang says:
====================
net: enetc: add port MDIO support for both i.MX94 and i.MX95
The NETC IP has one external master MDIO interface (eMDIO) for managing
external PHYs, all ENETC ports share this eMDIO. The EMDIO function and
the ENETC port MDIO are the virtual ports of this eMDIO, ENETC can use
these virtual ports to access their PHYs. The difference is that EMDIO
function is a 'global port', it can access all the PHYs on the eMDIO, so
it provides a means for different software modules to share a single set
of MDIO signals to access their PHYs.
The ENETC port MDIO can only access its own external PHY. Furthermore,
its PHY address must be set to its corresponding LaBCR register in IERB
module, which is is a 64 KB size page containing registers that are used
for pre-boot initialization for all NETC PCIe functions. And this IERB
is owned by the host OS and it will be locked after the initialization,
so it cannot be configured at running time any more. The port MDIO can
only work properly when the PHY address accessed by it matches the value
of its corresponding LaBCR[MDIO_PHYAD_PRTAD]. Otherwise, the MDIO access
by the port MDIO will not take effect.
Note that the same PHY is either controlled by port MDIO or by the EMDIO
function. The netc-blk-ctrl driver will only set the PHY address in the
LaBCR register corresponding to the ENETC when the ENETC node contains
an mdio child node, and the ENETC driver will only create the port MDIO
bus then. An example in DTS is as follows, the EMDIO function will not\
access this PHY.
enetc_port0 {
phy-handle = <ðphy0>;
phy-mode = "rgmii-id";
mdio {
#address-cells = <1>;
#size-cells = <0>;
ethphy0: ethernet-phy@1 {
reg = <1>;
};
};
};
If users want to use EMDIO funtion to manage the PHY, they only need to
place the PHY node in the emdio node. The same PHY must not be placed
simultaneously within the ENETC node. An example in DTS to use EMDIO
is as below.
netc_emdio {
ethphy0: ethernet-phy@1 {
reg = <1>;
};
ethphy2: ethernet-phy@8 {
reg = <8>;
};
};
In the host OS, when there are multiple ENETCs, they can all access their
PHYs using their own port MDIO, or they can all access their PHYs using
the EMDIO function, or they can partially use port MDIO and partially use
the EMDIO function.
Another typical use case of port MDIO is the Jailhouse usage. An ENETC is
assigned to a guest OS. The EMDIO function will be unavailable in the
guest OS because EMDIO is controlled by the host OS. Therefore, the ENETC
can use its port MDIO to manage its external PHY in this situation. In
this use case, the host OS's root dtb will disable the ENETC node, so the
host OS's ENETC driver will not probe the ENETC and its PHY.
In addition, this series also adds the internal MDIO bus support, each
ENETC has an internal MDIO interface for managing on-die PHY (PCS) if it
has PCS layer.
====================
Link: https://patch.msgid.link/20251119102557.1041881-1-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Each ENETC has a set of external MDIO registers to access its external
PHY based on its port EMDIO bus, these registers are used for MDIO bus
access, such as setting the PHY address, PHY register address and value,
read or write operations, C22 or C45 format, etc. The base address of
this set of registers has been modified in ENETC v4 and is different
from that in ENETC v1. So the base address needs to be updated so that
ENETC v4 can use port MDIO to manage its own external PHY.
Additionally, if ENETC has the PCS layer, it also has a set of internal
MDIO registers for managing its on-die PHY (PCS/Serdes). The base address
of this set of registers is also different from that of ENETC v1, so the
base address also needs to be updated so that ENETC v4 can support the
management of on-die PHY through the internal MDIO bus.
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Link: https://patch.msgid.link/20251119102557.1041881-4-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
NETC IP has only one external master MDIO interface (eMDIO) for managing
the external PHYs. ENETC can use the interfaces provided by the EMDIO
function or its port MDIO to access and manage its external PHY. Both
the EMDIO function and the port MDIO are all virtual ports of the eMDIO.
The difference is that the EMDIO function is a 'global port', it can
access all the PHYs on the eMDIO, but port MDIO can only access its own
PHY. To ensure that ENETC can only access its own PHY through port MDIO,
LaBCR[MDIO_PHYAD_PRTAD] needs to be set, which represents the address of
the external PHY connected to ENETC. If the accessed PHY address is not
consistent with LaBCR[MDIO_PHYAD_PRTAD], then the MDIO access initiated
by port MDIO will be invalid.
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Link: https://patch.msgid.link/20251119102557.1041881-3-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The ENETC supports managing its own external PHY through its port MDIO
functionality. To use this function, the PHY address needs be set in the
corresponding LaBCR register in the Integrated Endpoint Register Block
(IERB), which is used for pre-boot initialization of NETC PCIe functions.
The port MDIO can only work properly when the PHY address accessed by the
port MDIO matches the corresponding LaBCR[MDIO_PHYAD_PRTAD] value.
Because the ENETC driver only registers the MDIO bus (port MDIO bus) when
it detects an MDIO child node in its node, similarly, the netc-blk-ctrl
driver only resolves the PHY address and sets it in the corresponding
LaBCR when it detects an MDIO child node in the ENETC node.
Co-developed-by: Aziz Sellami <aziz.sellami@nxp.com>
Signed-off-by: Aziz Sellami <aziz.sellami@nxp.com>
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Link: https://patch.msgid.link/20251119102557.1041881-2-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean says:
====================
Improvements over DSA conduit ethtool ops
DSA interceps 'ethtool -S eth0', where eth0 is the host port of the
switch (called 'conduit'). It does this because otherwise there is no
way to report port counters for the CPU port, which is a MAC like any
other of that switch, except Linux exposes no net_device for it, thus no
ethtool hook.
Having understood all downsides of this debugging interface, when we
need it we needed, so the proposed changes here are to make it more
useful by dumping more counters in it: not just the switch CPU port,
but all other switch ports in the tree which lack a net_device. Not
reinventing any wheel, just putting more output in an existing command.
That is patch 3/3. The other 2 are cleanup.
====================
Link: https://patch.msgid.link/20251122112311.138784-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently there is no way to see packet counters on cascade ports, and
no clarity on how the API for that would look like.
Because it's something that is currently needed, just extend the hack
where ethtool -S on the conduit interface dumps CPU port counters, and
also use it to dump counters of cascade ports.
Note that the "pXX_" naming convention changes to "sXX_pYY", to
distinguish between ports having the same index but belonging to
different switches. This has a slight chance of causing regressions to
existing tooling:
- grepping for "p04_counter_name" still works, but might return more
than one string now
- grepping for " p04_counter_name" no longer works
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20251122112311.138784-4-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In theory this would have been seen by now, but it seems that all
drivers used as DSA conduit interfaces thus far have had ethtool_ops
set, and it's hard to even find modern Ethernet drivers (and not VF
ones) which don't use ethtool.
Here is the unfiltered list of drivers which register any sort of
net_device but don't set its ethtool_ops pointer. I don't think any of
them 'risks' being used as a DSA conduit, maybe except for moxart,
rnpbge and icssm, I'm not sure.
- drivers/net/can/dev/dev.c
- drivers/net/wwan/qcom_bam_dmux.c
- drivers/net/wwan/t7xx/t7xx_netdev.c
- drivers/net/arcnet/arcnet.c
- drivers/net/hamradio/
- drivers/net/slip/slip.c
- drivers/net/ethernet/ezchip/nps_enet.c
- drivers/net/ethernet/moxa/moxart_ether.c
- drivers/net/ethernet/wangxun/txgbevf/txgbevf_main.c
- drivers/net/ethernet/wangxun/ngbevf/ngbevf_main.c
- drivers/net/ethernet/huawei/hinic3/hinic3_main.c
- drivers/net/ethernet/i825xx/
- drivers/net/ethernet/ti/icssm/icssm_prueth.c
- drivers/net/ethernet/seeq/
- drivers/net/ethernet/litex/litex_liteeth.c
- drivers/net/ethernet/sunplus/spl2sw_driver.c
- drivers/net/ethernet/mucse/rnpgbe/rnpgbe_main.c
- drivers/net/ipa/
- drivers/net/wireless/microchip/wilc1000/
- drivers/net/wireless/mediatek/mt76/dma.c
- drivers/net/wireless/ath/ath12k/
- drivers/net/wireless/ath/ath11k/
- drivers/net/wireless/ath/ath6kl/
- drivers/net/wireless/ath/ath10k/
- drivers/net/wireless/intel/iwlwifi/pcie/gen1_2/trans.c
- drivers/net/wireless/virtual/mac80211_hwsim.c
- drivers/net/wireless/quantenna/qtnfmac/pcie/pcie.c
- drivers/net/wireless/realtek/rtw89/core.c
- drivers/net/wireless/realtek/rtw88/pci.c
- drivers/net/caif/
- drivers/net/plip/
- drivers/net/wan/
- drivers/net/mctp/
- drivers/net/ppp/
- drivers/net/thunderbolt/
Nonetheless, it's good for the framework not to make such assumptions,
and not panic when coming across such kind of host device in the future.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20251122112311.138784-2-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
drivers/net/ethernet/chelsio/cxgb4/sched.h declares a sched_class
struct which has a type name clash with struct sched_class
in kernel/sched/sched.h (a type used in a field in task_struct).
When cxgb4 is a builtin we end up with both sched_class types,
and as a result of this we wind up with DWARF (and derived from
that BTF) with a duplicate incorrect task_struct representation.
When cxgb4 is built-in this type clash can cause kernel builds to
fail as resolve_btfids will fail when confused which task_struct
to use. See [1] for more details.
As such, renaming sched_class to ch_sched_class (in line with
other structs like ch_sched_flowc) makes sense.
[1] https://lore.kernel.org/bpf/2412725b-916c-47bd-91c3-c2d57e3e6c7b@acm.org/
Reported-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Potnuri Bharat Teja <bharat@chelsio.com>
Link: https://patch.msgid.link/20251121181231.64337-1-alan.maguire@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some qdisc like cake, codel, fq_codel might drop packets
in their dequeue() method.
This is currently problematic because dequeue() runs with
the qdisc spinlock held. Freeing skbs can be extremely expensive.
Add qdisc_dequeue_drop() method and a new TCQ_F_DEQUEUE_DROPS
so that these qdiscs can opt-in to defer the skb frees
after the socket spinlock is released.
TCQ_F_DEQUEUE_DROPS is an attempt to not penalize other qdiscs
with an extra cache line miss.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251121083256.674562-14-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Most qdiscs need to read skb->priority at enqueue time().
In commit 100dfa74ca ("net: dev_queue_xmit() llist adoption")
I added a prefetch(next), lets add another one for the second
half of skb.
Note that skb->priority and skb->hash share a common cache line,
so this patch helps qdiscs needing both fields.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251121083256.674562-11-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
prefetch the skb that we are likely to dequeue at the next dequeue().
Also call fq_dequeue_skb() a bit sooner in fq_dequeue().
This reduces the window between read of q.qlen and
changes of fields in the cache line that could be dirtied
by another cpu trying to queue a packet.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251121083256.674562-10-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add "aspeed,ast2700-mdio" compatible to the binding schema with a fallback
to "aspeed,ast2600-mdio".
Although the MDIO controller on AST2700 is functionally the same as the
one on AST2600, it's good practice to add a SoC-specific compatible for
new silicon. This allows future driver updates to handle any 2700-specific
integration issues without requiring devicetree changes or complex
runtime detection logic.
For now, the driver continues to bind via the existing
"aspeed,ast2600-mdio" compatible, so no driver changes are needed.
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Jacky Chou <jacky_chou@aspeedtech.com>
Link: https://patch.msgid.link/20251120-aspeed_mdio_ast2700-v2-1-0d722bfb2c54@aspeedtech.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Matthieu Baerts says:
====================
mptcp: memcg accounting for passive sockets & backlog processing
This series is split in two: the 4 first patches are linked to memcg
accounting for passive sockets, and the rest introduce the backlog
processing. They are sent together, because the first one appeared to be
needed to get the second one fully working.
The second part includes RX path improvement built around backlog
processing. The main goals are improving the RX performances _and_
increase the long term maintainability.
- Patches 1-3: preparation work to ease the introduction of the next
patch.
- Patch 4: fix memcg accounting for passive sockets. Note that this is a
(non-urgent) fix, but it depends on material that is currently only in
net-next, e.g. commit 4a997d49d9 ("tcp: Save lock_sock() for memcg
in inet_csk_accept().").
- Patches 5-6: preparation of the stack for backlog processing, removing
assumptions that will not hold true any more after the backlog
introduction.
- Patches 7,8,10,11,12 are more cleanups that will make the backlog
patch a little less huge.
- Patch 9: somewhat an unrelated cleanup, included here not to forget
about it.
- Patches 13-14: The real work is done by them. Patch 13 introduces the
helpers needed to manipulate the msk-level backlog, and the data
struct itself, without any actual functional change. Patch 14 finally
uses the backlog for RX skb processing. Note that MPTCP can't use the
sk_backlog, as the MPTCP release callback can also release and
re-acquire the msk-level spinlock and core backlog processing works
under the assumption that such event is not possible.
A relevant point is memory accounts for skbs in the backlog. It's
somewhat "original" due to MPTCP constraints. Such skbs use space from
the incoming subflow receive buffer, do not use explicitly any forward
allocated memory, as we can't update the msk fwd mem while enqueuing,
nor we want to acquire again the ssk socket lock while processing the
skbs. Instead the msk borrows memory from the subflow and reserve it
for the backlog, see patch 5 and 14 for the gory details.
====================
Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-0-1f34b6c1e0b1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When the msk socket is owned or the msk receive buffer is full,
move the incoming skbs in a msk level backlog list. This avoid
traversing the joined subflows and acquiring the subflow level
socket lock at reception time, improving the RX performances.
When processing the backlog, use the fwd alloc memory borrowed from
the incoming subflow. skbs exceeding the msk receive space are
not dropped; instead they are kept into the backlog until the receive
buffer is freed. Dropping packets already acked at the TCP level is
explicitly discouraged by the RFC and would corrupt the data stream
for fallback sockets.
Special care is needed to avoid adding skbs to the backlog of a closed
msk and to avoid leaving dangling references into the backlog
at subflow closing time.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-14-1f34b6c1e0b1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>