sk->sk_sndbuf is read-mostly in tx path, so move it from
sock_write_tx group to more appropriate sock_read_tx.
sk->sk_err_soft was not identified previously, but
is used from tcp_ack().
Move it to sock_write_tx group for better cache locality.
Also change tcp_ack() to clear sk->sk_err_soft only if needed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Bhargava Marreddy says:
====================
Add more functionality to BNGE
This patch series adds the infrastructure to make the netdevice
functional. It allocates data structures for core resources,
followed by their initialisation and registration with the firmware.
The core resources include the RX, TX, AGG, CMPL, and NQ rings,
as well as the VNIC. RX/TX functionality will be introduced in the
next patch series to keep this one at a reviewable size.
====================
Link: https://patch.msgid.link/20250919174742.24969-1-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Marco Crivellari says:
====================
net: replace wq users and add WQ_PERCPU to alloc_workqueue() users
Below is a summary of a discussion about the Workqueue API and cpu isolation
considerations. Details and more information are available here:
"workqueue: Always use wq_select_unbound_cpu() for WORK_CPU_UNBOUND."
Link: https://lore.kernel.org/20250221112003.1dSuoGyc@linutronix.de
=== Current situation: problems ===
Let's consider a nohz_full system with isolated CPUs: wq_unbound_cpumask is
set to the housekeeping CPUs, for !WQ_UNBOUND the local CPU is selected.
This leads to different scenarios if a work item is scheduled on an isolated
CPU where "delay" value is 0 or greater then 0:
schedule_delayed_work(, 0);
This will be handled by __queue_work() that will queue the work item on the
current local (isolated) CPU, while:
schedule_delayed_work(, 1);
Will move the timer on an housekeeping CPU, and schedule the work there.
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
=== Plan and future plans ===
This patchset is the first stone on a refactoring needed in order to
address the points aforementioned; it will have a positive impact also
on the cpu isolation, in the long term, moving away percpu workqueue in
favor to an unbound model.
These are the main steps:
1) API refactoring (that this patch is introducing)
- Make more clear and uniform the system wq names, both per-cpu and
unbound. This to avoid any possible confusion on what should be
used.
- Introduction of WQ_PERCPU: this flag is the complement of WQ_UNBOUND,
introduced in this patchset and used on all the callers that are not
currently using WQ_UNBOUND.
WQ_UNBOUND will be removed in a future release cycle.
Most users don't need to be per-cpu, because they don't have
locality requirements, because of that, a next future step will be
make "unbound" the default behavior.
2) Check who really needs to be per-cpu
- Remove the WQ_PERCPU flag when is not strictly required.
3) Add a new API (prefer local cpu)
- There are users that don't require a local execution, like mentioned
above; despite that, local execution yeld to performance gain.
This new API will prefer the local execution, without requiring it.
=== Introduced Changes by this series ===
1) [P 1-2] Replace use of system_wq and system_unbound_wq
system_wq is a per-CPU workqueue, but his name is not clear.
system_unbound_wq is to be used when locality is not required.
Because of that, system_wq has been renamed in system_percpu_wq, and
system_unbound_wq has been renamed in system_dfl_wq.
2) [P 3] add WQ_PERCPU to remaining alloc_workqueue() users
Every alloc_workqueue() caller should use one among WQ_PERCPU or
WQ_UNBOUND.
WQ_UNBOUND will be removed in a next release cycle.
====================
Link: https://patch.msgid.link/20250918142427.309519-1-marco.crivellari@suse.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.
This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.
This change adds a new WQ_PERCPU flag at the network subsystem, to explicitly
request the use of the per-CPU behavior. Both flags coexist for one release
cycle to allow callers to transition their calls.
Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.
All existing users have been updated accordingly.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20250918142427.309519-4-marco.crivellari@suse.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
system_unbound_wq should be the default workqueue so as not to enforce
locality constraints for random work whenever it's not required.
Adding system_dfl_wq to encourage its use when unbound work should be used.
The old system_unbound_wq will be kept for a few release cycles.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20250918142427.309519-3-marco.crivellari@suse.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
system_unbound_wq should be the default workqueue so as not to enforce
locality constraints for random work whenever it's not required.
Adding system_dfl_wq to encourage its use when unbound work should be used.
The old system_unbound_wq will be kept for a few release cycles.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20250918142427.309519-2-marco.crivellari@suse.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2025-09-19 (ice, idpf, iavf, ixgbevf, fm10k)
Paul adds support for Earliest TxTime First (ETF) hardware offload
for E830 devices on ice. ETF is configured per-queue using tc-etf Qdisc;
a new Tx flow mechanism utilizes a dedicated timestamp ring alongside
the standard Tx ring. The timestamp ring contains descriptors that
specify when hardware should transmit packets; up to 2048 Tx queues can
be supported.
Additional info: https://lore.kernel.org/intel-wired-lan/20250818132257.21720-1-paul.greenwalt@intel.com/
Dave removes excess cleanup call to ice_lag_move_new_vf_nodes() in error
path.
Milena adds reporting of timestamping statistics to idpf.
Alex changes error variable type for code clarity for iavf and ixgbevf.
Brahmajit Das removes unused parameter from fm10k_unbind_hw_stats_q().
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
net: intel: fm10k: Fix parameter idx set but not used
ixgbevf: fix proper type for error code in ixgbevf_resume()
iavf: fix proper type for error code in iavf_resume()
idpf: add HW timestamping statistics
ice: Remove deprecated ice_lag_move_new_vf_nodes() call
ice: add E830 Earliest TxTime First Offload support
ice: move ice_qp_[ena|dis] for reuse
====================
Link: https://patch.msgid.link/20250919175412.653707-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add missing "Return:" sections to kernel-doc comments for four functions:
- axienet_calc_cr()
- axienet_device_reset()
- axienet_free_tx_chain()
- axienet_dim_coalesce_count_rx()
Also standardize the return documentation format by replacing inline
"Returns" text with proper "Return:" tags as per kernel documentation
guidelines.
Fixes below kernel-doc warnings:
- Warning: No description found for return value of 'axienet_calc_cr'
- Warning: No description found for return value of 'axienet_device_reset'
- Warning: No description found for return value of 'axienet_free_tx_chain'
- Warning: No description found for return value of
'axienet_dim_coalesce_count_rx'
Signed-off-by: Suraj Gupta <suraj.gupta2@amd.com>
Link: https://patch.msgid.link/20250919103754.434711-1-suraj.gupta2@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Bastien Curutchet says:
====================
net: dsa: microchip: Add strap description to set SPI as interface bus
At reset, the KSZ8463 uses a strap-based configuration to set SPI as
interface bus. If the required pull-ups/pull-downs are missing
(by mistake or by design to save power) the pins may float and the
configuration can go wrong preventing any communication with the switch.
This small series aims to allow to configure the KSZ8463 switch at
reset when the hardware straps are missing.
PATCH 0 and 1 add a new property to the bindings that describes the GPIOs
to be set during reset in order to configure the switch properly.
PATCH 2 implements the use of these properties in the driver.
====================
Link: https://patch.msgid.link/20250918-ksz-strap-pins-v3-0-16662e881728@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
At reset, the KSZ8463 uses a strap-based configuration to set SPI as
bus interface. SPI is the only bus supported by the driver. If the
required pull-ups/pull-downs are missing (by mistake or by design to
save power) the pins may float and the configuration can go wrong
preventing any communication with the switch.
Introduce a ksz8463_configure_straps_spi() function called during the
device reset. It relies on the 'straps-rxd-gpios' OF property and the
'reset' pinmux configuration to enforce SPI as bus interface.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Link: https://patch.msgid.link/20250918-ksz-strap-pins-v3-3-16662e881728@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
At reset, KSZ8463 uses a strap-based configuration to set SPI as
interface bus. If the required pull-ups/pull-downs are missing (by
mistake or by design to save power) the pins may float and the
configuration can go wrong preventing any communication with the switch.
Add a 'reset' pinmux state
Add a KSZ8463 specific strap description that can be used by the driver
to drive the strap pins during reset. Two GPIOs are used. Users must
describe either both of them or none of them.
Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20250918-ksz-strap-pins-v3-2-16662e881728@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King says:
====================
net: rework SFP capability parsing and quirks
The original SPF module parsing was implemented prior to gaining any
quirks, and was designed such that the upstream calls the parsing
functions to get the translated capabilities of the module.
SFP quirks were then added to cope with modules that didn't correctly
fill out their ID EEPROM. The quirk function was called from
sfp_parse_support() to allow quirks to modify the ethtool link mode
masks.
Using just ethtool link mode masks eventually lead to difficulties
determining the correct phy_interface_t mode, so a bitmap of these
modes were added - needing both the upstream API and quirks to be
updated.
We have had significantly more SFP module quirks added since, some
which are modifying the ID EEPROM as a way of influencing the data
we provide to the upstream - for example, sfp_fixup_10gbaset_30m()
changes id.base.connector so we report PORT_TP. This could be done
more cleanly if the quirks had access to the parsed SFP port.
In order to improve flexibility, and to simplify some of the upstream
code, we group all module capabilities into a single structure that
the upstream can access via sfp_module_get_caps(). This will allow
the module capabilities to be expanded if required without reworking
all the infrastructure and upstreams again.
In this series, we rework the SFP code to use the capability structure
and then rework all the upstream implementations, finally removing the
old kernel internal APIs.
====================
Link: https://patch.msgid.link/aMnaoPjIuzEAsESZ@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Matthieu Baerts says:
====================
mptcp: pm: netlink: announce server-side flag
Now that the 'flags' attribute is used, it seems interesting to add one
flag for 'server-side', a boolean value.
Here are a few patches related to the 'server-side' attribute:
- Patch 1: only announce this attribute on the server side.
- Patch 2: announce the 'server-side' flag when this is the case.
- Patch 3: deprecate the 'server-side' attribute.
- Patch 4: use the 'server-side' flag in the selftests.
- Patches 5, 6: small cleanups when working on code around.
====================
Link: https://patch.msgid.link/20250919-net-next-mptcp-server-side-flag-v1-0-a97a5d561a8b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This attribute is a boolean. No need to add it to set it to 'false'.
Indeed, the default value when this attribute is not set is naturally
'false'. A few bytes can then be saved by not adding this attribute if
the connection is not on the server side.
This prepares the future deprecation of its attribute, in favour of a
new flag.
Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250919-net-next-mptcp-server-side-flag-v1-1-a97a5d561a8b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
While most of the statistics functions (emac_get_stats64() and such) are
called with softirqs enabled, emac_stats_timer() is, as its name
suggests, also called from a timer, i.e. called in softirq context.
All of these take stats_lock. Therefore, make stats_lock softirq-safe by
changing spin_lock() into spin_lock_bh() for the functions that get
statistics.
Also, instead of directly calling emac_stats_timer() in emac_up() and
emac_resume(), set the timer to trigger instead, so that
emac_stats_timer() is only called from the timer. It will keep using
spin_lock().
This fixes a lockdep warning, and potential deadlock when stats_timer is
triggered in the middle of getting statistics.
Fixes: bfec6d7f20 ("net: spacemit: Add K1 Ethernet MAC")
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Closes: https://lore.kernel.org/all/a52c0cf5-0444-41aa-b061-a0a1d72b02fe@samsung.com/
Signed-off-by: Vivian Wang <wangruikang@iscas.ac.cn>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://patch.msgid.link/20250919-k1-ethernet-fix-lock-v1-1-c8b700aa4954@iscas.ac.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The commit 61f132ca8c ("ptp: add helpers to get the phc_index by
of_node or dev") has added two generic interfaces to get the phc_index
of the PTP clock. This eliminates the need for PTP device drivers to
provide custom APIs for consumers to retrieve the phc_index. This has
already been implemented for ENETC v4 and is also applicable to ENETC
v1. Therefore, the global variable enetc_phc_index is removed from the
driver. ENETC v1 now uses the same interface as v4 to get phc_index.
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20250919084509.1846513-3-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kuniyuki Iwashima says:
====================
tcp: Clean up inet_hash() and inet_unhash().
While reviewing the ehash fix series from Xuanqiang Luo [0],
I noticed that inet_twsk_hashdance_schedule() checks the
retval of __sk_nulls_del_node_init_rcu(), which looks confusing.
The test exists from the pre-git era:
$ git blame -L:tcp_tw_hashdance net/ipv4/tcp_minisocks.c e48c414ee61f4~
Patch 3 is to clarify that the retval check is unnecessary in
inet_twsk_hashdance_schedule(), but I'll delegate its removal
to the Xuanqiang's series.
Patch 1 & 2 are minor cleanups.
[0]: https://lore.kernel.org/netdev/20250916103054.719584-4-xuanqiang.luo@linux.dev/
====================
Link: https://patch.msgid.link/20250919083706.1863217-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
inet_unhash() checks sk_unhashed() twice at the entry and after locking
ehash/lhash bucket.
The former was somehow added redundantly by commit 4f9bf2a2f5 ("tcp:
Don't acquire inet_listen_hashbucket::lock with disabled BH.").
inet_unhash() is called for the full socket from 4 places, and it is
always under lock_sock() or the socket is not yet published to other
threads:
1. __sk_prot_rehash()
-> called from inet_sk_reselect_saddr(), which has
lockdep_sock_is_held()
2. sk_common_release()
-> called when inet_create() or inet6_create() fail, then the
socket is not yet published
3. tcp_set_state()
-> calls tcp_call_bpf_2arg(), and tcp_call_bpf() has
sock_owned_by_me()
4. inet_ctl_sock_create()
-> creates a kernel socket and unhashes it immediately, but TCP
socket is not hashed in sock_create_kern() (only SOCK_RAW is)
So we do not need to check sk_unhashed() twice before/after ehash/lhash
lock in inet_unhash().
Let's remove the 2nd one.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250919083706.1863217-4-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>