Commit Graph

2982 Commits

Author SHA1 Message Date
Matthieu Baerts (NGI0)
1f24a24097 doc: mptcp: fix Netlink specs link
The Netlink specs RST files are no longer generated inside the source
tree.

In other words, the path to mptcp_pm.rst has changed, and needs to be
updated to the new location.

Fixes: 1ce4da3dd9 ("docs: use parser_yaml extension to handle Netlink specs")
Reported-by: Kory Maincent <kory.maincent@bootlin.com>
Closes: https://lore.kernel.org/20250828185037.07873d04@kmaincent-XPS-13-7390
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250909-net-next-mptcp-pm-link-v1-1-0f1c4b8439c6@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-10 18:28:41 -07:00
Dragos Tatulea
cdc492746e net/mlx5e: Add stale counter for PCIe congestion events
This ethtool counter is meant to help with observing how many times the
congestion event was triggered but on query there was no state change.

This would help to indicate when a work item was scheduled to run too
late and in the meantime the congestion state changed back to previous
state.

While at it, do a driveby typo fix in documentation for
pci_bw_inbound_high.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/1757237976-531416-3-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-09 19:21:30 -07:00
Dragos Tatulea
f4053490a6 net/mlx5e: Make PCIe congestion event thresholds configurable
Add devlink driverinit parameters for configuring the thresholds for
PCIe congestion events. These parameters are registered only when the
firmware supports this feature.

Update the mlx5 devlink docs as well on these new params.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/1757237976-531416-2-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-09 19:21:30 -07:00
Vlad Dumitrescu
a4c49611cf net/mlx5: Implement devlink total_vfs parameter
Some devices support both symmetric (same value for all PFs) and
asymmetric, while others only support symmetric configuration. This
implementation prefers asymmetric, since it is closer to the devlink
model (per function settings), but falls back to symmetric when needed.

Example usage:
  devlink dev param set pci/0000:01:00.0 name total_vfs value <u16> cmode permanent
  devlink dev reload pci/0000:01:00.0 action fw_activate
  echo 1 >/sys/bus/pci/devices/0000:01:00.0/remove
  echo 1 >/sys/bus/pci/rescan
  cat /sys/bus/pci/devices/0000:01:00.0/sriov_totalvfs

Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Tested-by: Kamal Heib <kheib@redhat.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250907012953.301746-5-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-09 19:14:24 -07:00
Vlad Dumitrescu
95a0af146d net/mlx5: Implement devlink enable_sriov parameter
Example usage:
  devlink dev param set pci/0000:01:00.0 name enable_sriov value {true, false} cmode permanent
  devlink dev reload pci/0000:01:00.0 action fw_activate
  echo 1 >/sys/bus/pci/devices/0000:01:00.0/remove
  echo 1 >/sys/bus/pci/rescan
  grep ^ /sys/bus/pci/devices/0000:01:00.0/sriov_*

Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Tested-by: Kamal Heib <kheib@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250907012953.301746-4-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-09 19:14:24 -07:00
Saeed Mahameed
bf2da4799f net/mlx5: Implement cqe_compress_type via devlink params
Selects which algorithm should be used by the NIC in order to decide rate of
CQE compression dependeng on PCIe bus conditions.

Supported values:

1) balanced, merges fewer CQEs, resulting in a moderate compression ratio
   but maintaining a balance between bandwidth savings and performance
2) aggressive, merges more CQEs into a single entry, achieving a higher
   compression rate and maximizing performance, particularly under high
   traffic loads.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250907012953.301746-3-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-09 19:14:23 -07:00
Vlad Dumitrescu
ce0b015e26 devlink: Add 'total_vfs' generic device param
NICs are typically configured with total_vfs=0, forcing users to rely
on external tools to enable SR-IOV (a widely used and essential feature).

Add total_vfs parameter to devlink for SR-IOV max VF configurability.
Enables standard kernel tools to manage SR-IOV, addressing the need for
flexible VF configuration.

Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Tested-by: Kamal Heib <kheib@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250907012953.301746-2-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-09 19:14:23 -07:00
Geliang Tang
30549eebc4 mptcp: make ADD_ADDR retransmission timeout adaptive
Currently the ADD_ADDR option is retransmitted with a fixed timeout. This
patch makes the retransmission timeout adaptive by using the maximum RTO
among all the subflows, while still capping it at the configured maximum
value (add_addr_timeout_max). This improves responsiveness when
establishing new subflows.

Specifically:
1. Adds mptcp_adjust_add_addr_timeout() helper to compute the adaptive
timeout.
2. Uses maximum subflow RTO (icsk_rto) when available.
3. Applies exponential backoff based on retransmission count.
4. Maintains fallback to configured max timeout when no RTO data exists.

This slightly changes the behaviour of the MPTCP "add_addr_timeout"
sysctl knob to be used as a maximum instead of a fixed value. But this
is seen as an improvement: the ADD_ADDR might be sent quicker than
before to improve the overall MPTCP connection. Also, the default
value is set to 2 min, which was already way too long, and caused the
ADD_ADDR not to be retransmitted for connections shorter than 2 minutes.

Suggested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/576
Reviewed-by: Christoph Paasch <cpaasch@openai.com>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250907-net-next-mptcp-add_addr-retrans-adapt-v1-1-824cc805772b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-09 18:57:45 -07:00
Hangbin Liu
e5a6643435 bonding: support aggregator selection based on port priority
Add a new ad_select policy 'port_priority' that uses the per-port
actor priority values (set via ad_actor_port_prio) to determine
aggregator selection.

This allows administrators to influence which ports are preferred
for aggregation by assigning different priority values, providing
more flexible load balancing control in LACP configurations.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20250902064501.360822-3-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-09 10:56:02 +02:00
Hangbin Liu
6b6dc81ee7 bonding: add support for per-port LACP actor priority
Introduce a new netlink attribute 'actor_port_prio' to allow setting
the LACP actor port priority on a per-slave basis. This extends the
existing bonding infrastructure to support more granular control over
LACP negotiations.

The priority value is embedded in LACPDU packets and will be used by
subsequent patches to influence aggregator selection policies.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20250902064501.360822-2-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-09 10:56:02 +02:00
Jakub Kicinski
5ef04a7b06 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.17-rc5).

No conflicts.

Adjacent changes:

include/net/sock.h
  c51613fa27 ("net: add sk->sk_drop_counters")
  5d6b58c932 ("net: lockless sock_i_ino()")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-04 13:33:00 -07:00
Jay Vosburgh
23a6037ce7 bonding: Remove support for use_carrier
Remove the implementation of use_carrier, the link monitoring
method that utilizes ethtool or ioctl to determine the link state of an
interface in a bond.  Bonding will always behaves as if use_carrier=1,
which relies on netif_carrier_ok() to determine the link state of
interfaces.

	To avoid acquiring RTNL many times per second, bonding inspects
link state under RCU, but not under RTNL.  However, ethtool
implementations in drivers may sleep, and therefore this strategy is
unsuitable for use with calls into driver ethtool functions.

	The use_carrier option was introduced in 2003, to provide
backwards compatibility for network device drivers that did not support
the then-new netif_carrier_ok/on/off system.  Device drivers are now
expected to support netif_carrier_*, and the use_carrier backwards
compatibility logic is no longer necessary.

	The option itself remains, but when queried always returns 1,
and may only be set to 1.

Link: https://lore.kernel.org/000000000000eb54bf061cfd666a@google.com
Link: https://lore.kernel.org/20240718122017.d2e33aaac43a.I10ab9c9ded97163aef4e4de10985cd8f7de60d28@changeid
Signed-off-by: Jay Vosburgh <jv@jvosburgh.net>
Reported-by: syzbot+b8c48ea38ca27d150063@syzkaller.appspotmail.com
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/2029487.1756512517@famine
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-02 14:01:54 -07:00
Kohei Enju
b434a3772d docs: remove obsolete description about threaded NAPI
Commit 2677010e77 ("Add support to set NAPI threaded for individual
NAPI") introduced threaded NAPI configuration per individual NAPI
instance, however obsolete description that threaded NAPI is per device
has remained.

Remove the old description and clarify that only NAPI instances running
in threaded mode spawn kernel threads by changing "Each NAPI instance"
to "Each threaded NAPI instance".

Signed-off-by: Kohei Enju <enjuk@amazon.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Link: https://patch.msgid.link/20250829064857.51503-1-enjuk@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-01 13:09:08 -07:00
Shahar Shitrit
da0e219764 devlink: Make health reporter burst period configurable
Enable configuration of the burst period — a time window starting
from the first error recovery, during which the reporter allows
recovery attempts for each reported error.

This feature is helpful when a single underlying issue causes multiple
errors, as it delays the start of the grace period to allow sufficient
time for recovering all related errors. For example, if multiple TX
queues time out simultaneously, a sufficient burst period could allow
all affected TX queues to be recovered within that window. Without this
period, only the first TX queue that reports a timeout will undergo
recovery, while the remaining TX queues will be blocked once the grace
period begins.

Configuration example:
$ devlink health set pci/0000:00:09.0 reporter tx burst_period 500

Configuration example with ynl:
./tools/net/ynl/pyynl/cli.py \
 --spec Documentation/netlink/specs/devlink.yaml \
 --do health-reporter-set --json '{
  "bus-name": "auxiliary",
  "dev-name": "mlx5_core.eth.0",
  "port-index": 65535,
  "health-reporter-name": "tx",
  "health-reporter-burst-period": 500
}'

Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250824084354.533182-5-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26 17:24:16 -07:00
Jakub Kicinski
a9af709fda Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.17-rc3).

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21 11:33:15 -07:00
Lei Wei
6b9f301985 docs: networking: Add PPE driver documentation for Qualcomm IPQ9574 SoC
Add description and high-level diagram for PPE, driver overview and
module enable/debug information.

Signed-off-by: Lei Wei <quic_leiwei@quicinc.com>
Signed-off-by: Luo Jie <quic_luoj@quicinc.com>
Link: https://patch.msgid.link/20250818-qcom_ipq_ppe-v8-2-1d4ff641fce9@quicinc.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-21 12:38:41 +02:00
Eric Dumazet
a6d4f25888 net: set net.core.rmem_max and net.core.wmem_max to 4 MB
SO_RCVBUF and SO_SNDBUF have limited range today, unless
distros or system admins change rmem_max and wmem_max.

Even iproute2 uses 1 MB SO_RCVBUF which is capped by
the kernel.

Decouple [rw]mem_max and [rw]mem_default and increase
[rw]mem_max to 4 MB.

Before:

$ sysctl net.core.rmem_default net.core.rmem_max net.core.wmem_default net.core.wmem_max
net.core.rmem_default = 212992
net.core.rmem_max = 212992
net.core.wmem_default = 212992
net.core.wmem_max = 212992

After:

$ sysctl net.core.rmem_default net.core.rmem_max net.core.wmem_default net.core.wmem_max
net.core.rmem_default = 212992
net.core.rmem_max = 4194304
net.core.wmem_default = 212992
net.core.wmem_max = 4194304

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20250819174030.1986278-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-20 19:35:00 -07:00
Eric Biggers
d5a253702a sctp: Stop accepting md5 and sha1 for net.sctp.cookie_hmac_alg
The upgrade of the cookie authentication algorithm to HMAC-SHA256 kept
some backwards compatibility for the net.sctp.cookie_hmac_alg sysctl by
still accepting the values 'md5' and 'sha1'.  Those algorithms are no
longer actually used, but rather those values were just treated as
requests to enable cookie authentication.

As requested at
https://lore.kernel.org/netdev/CADvbK_fmCRARc8VznH8cQa-QKaCOQZ6yFbF=1-VDK=zRqv_cXw@mail.gmail.com/
and https://lore.kernel.org/netdev/20250818084345.708ac796@kernel.org/ ,
go further and start rejecting 'md5' and 'sha1' completely.

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20250818205426.30222-6-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-19 19:36:26 -07:00
Eric Biggers
2f3dd6ec90 sctp: Convert cookie authentication to use HMAC-SHA256
Convert SCTP cookies to use HMAC-SHA256, instead of the previous choice
of the legacy algorithms HMAC-MD5 and HMAC-SHA1.  Simplify and optimize
the code by using the HMAC-SHA256 library instead of crypto_shash, and
by preparing the HMAC key when it is generated instead of per-operation.

This doesn't break compatibility, since the cookie format is an
implementation detail, not part of the SCTP protocol itself.

Note that the cookie size doesn't change either.  The HMAC field was
already 32 bytes, even though previously at most 20 bytes were actually
compared.  32 bytes exactly fits an untruncated HMAC-SHA256 value.  So,
although we could safely truncate the MAC to something slightly shorter,
for now just keep the cookie size the same.

I also considered SipHash, but that would generate only 8-byte MACs.  An
8-byte MAC *might* suffice here.  However, there's quite a lot of
information in the SCTP cookies: more than in TCP SYN cookies.  So
absent an analysis that occasional forgeries of all that information is
okay in SCTP, I errored on the side of caution.

Remove HMAC-MD5 and HMAC-SHA1 as options, since the new HMAC-SHA256
option is just better.  It's faster as well as more secure.  For
example, benchmarking on x86_64, cookie authentication is now nearly 3x
as fast as the previous default choice and implementation of HMAC-MD5.

Also just make the kernel always support cookie authentication if SCTP
is supported at all, rather than making it optional in the build.  (It
was sort of optional before, but it didn't really work properly.  E.g.,
a kernel with CONFIG_SCTP_COOKIE_HMAC_MD5=n still supported HMAC-MD5
cookie authentication if CONFIG_CRYPTO_HMAC and CONFIG_CRYPTO_MD5
happened to be enabled in the kconfig for other reasons.)

Acked-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20250818205426.30222-5-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-19 19:36:26 -07:00
Mohsin Bashir
5213ff0863 eth: fbnic: Collect packet statistics for XDP
Add support for XDP statistics collection and reporting via rtnl_link
and netdev_queue API.

For XDP programs without frags support, fbnic requires MTU to be less
than the HDS threshold. If an over-sized frame is received, the frame
is dropped and recorded as rx_length_errors reported via ip stats to
highlight that this is an error.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20250813221319.3367670-9-mohsin.bashr@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-19 10:51:16 +02:00
Geliang Tang
f5ce071462 mptcp: disable add_addr retransmission when timeout is 0
When add_addr_timeout was set to 0, this caused the ADD_ADDR to be
retransmitted immediately, which looks like a buggy behaviour. Instead,
interpret 0 as "no retransmissions needed".

The documentation is updated to explicitly state that setting the timeout
to 0 disables retransmission.

Fixes: 93f323b9cc ("mptcp: add a new sysctl add_addr_timeout")
Cc: stable@vger.kernel.org
Suggested-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250815-net-mptcp-misc-fixes-6-17-rc2-v1-5-521fe9957892@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-18 17:39:58 -07:00
Jakub Kicinski
c4f72d3747 Merge tag 'docs/v6.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-docs
Mauro Carvalho Chehab says:

====================
add a generic yaml parser integrated with Netlink specs generation

- An YAML parser Sphinx plugin, integrated with Netlink YAML doc
  parser.

The patch content is identical to my v10 submission:
https://lore.kernel.org/cover.1753718185.git.mchehab+huawei@kernel.org

* tag 'docs/v6.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-docs:
  sphinx: parser_yaml.py: fix line numbers information
  docs: parser_yaml.py: fix backward compatibility with old docutils
  docs: parser_yaml.py: add support for line numbers from the parser
  tools: netlink_yml_parser.py: add line numbers to parsed data
  MAINTAINERS: add netlink_yml_parser.py to linux-doc
  docs: netlink: remove obsolete .gitignore from unused directory
  tools: ynl_gen_rst.py: drop support for generating index files
  docs: uapi: netlink: update netlink specs link
  docs: use parser_yaml extension to handle Netlink specs
  docs: sphinx: add a parser for yaml files for Netlink specs
  tools: ynl_gen_rst.py: cleanup coding style
  docs: netlink: index.rst: add a netlink index file
  tools: ynl_gen_rst.py: Split library from command line tool
  docs: netlink: netlink-raw.rst: use :ref: instead of :doc:
====================

Link: https://patch.msgid.link/20250812113329.356c93c2@foz.lan
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-14 17:26:38 -07:00
Jordan Rife
e93f7af148 docs: Fix name for net.ipv4.udp_child_hash_entries
udp_child_ehash_entries -> udp_child_hash_entries

Fixes: 9804985bf2 ("udp: Introduce optional per-netns hash table.")
Signed-off-by: Jordan Rife <jordan@jrife.io>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250808185800.1189042-1-jordan@jrife.io
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-12 13:03:13 +02:00
Mauro Carvalho Chehab
6abc773747 docs: netlink: remove obsolete .gitignore from unused directory
The previous code was generating source rst files
under Documentation/networking/netlink_spec/. With the
Sphinx YAML parser, this is now gone. So, stop ignoring
*.rst files inside netlink specs directory.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
2025-08-12 07:47:30 +02:00
Mauro Carvalho Chehab
1ce4da3dd9 docs: use parser_yaml extension to handle Netlink specs
Instead of manually calling ynl_gen_rst.py, use a Sphinx extension.
This way, no .rst files would be written to the Kernel source
directories.

We are using here a toctree with :glob: property. This way, there
is no need to touch the netlink/specs/index.rst file every time
a new Netlink spec is added/renamed/removed.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
2025-08-12 07:47:30 +02:00
Linus Torvalds
b1cce98493 Merge tag 'docs-6.17' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
 "It has been a relatively busy cycle for docs, especially the build
  system:

   - The Perl kernel-doc script was added to 2.3.52pre1 just after the
     turn of the millennium. Over the following 25 years, it accumulated
     a vast amount of cruft, all in a language few people want to deal
     with anymore. Mauro's Python replacement in 6.16 faithfully
     reproduced all of the cruft in the hope of avoiding regressions.

     Now that we have a more reasonable code base, though, we can work
     on cleaning it up; many of the changes this time around are toward
     that end.

   - A reorganization of the ext4 docs into the usual TOC format.

   - Various Chinese translations and updates.

   - A new script from Mauro to help with docs-build testing.

   - A new document for linked lists

   - A sweep through MAINTAINERS fixing broken GitHub git:// repository
     links.

  ...and lots of fixes and updates"

* tag 'docs-6.17' of git://git.lwn.net/linux: (147 commits)
  scripts: add origin commit identification based on specific patterns
  sphinx: kernel_abi: fix performance regression with O=<dir>
  Documentation: core-api: entry: Replace deprecated KVM entry/exit functions
  docs: fault-injection: drop reference to md-faulty
  docs: document linked lists
  scripts: kdoc: make it backward-compatible with Python 3.7
  docs: kernel-doc: emit warnings for ancient versions of Python
  Documentation/rtla: Describe exit status
  Documentation/rtla: Add include common_appendix.rst
  docs: kernel: Clarify printk_ratelimit_burst reset behavior
  Documentation: ioctl-number: Don't repeat macro names
  Documentation: ioctl-number: Shorten macros table
  Documentation: ioctl-number: Correct full path to papr-physical-attestation.h
  Documentation: ioctl-number: Extend "Include File" column width
  Documentation: ioctl-number: Fix linuxppc-dev mailto link
  overlayfs.rst: fix typos
  docs: kdoc: emit a warning for ancient versions of Python
  docs: kdoc: clean up check_sections()
  docs: kdoc: directly access the always-there KdocItem fields
  docs: kdoc: straighten up dump_declaration()
  ...
2025-07-31 08:36:51 -07:00
Jakub Kicinski
ecc383e5fe Merge tag 'linux-can-next-for-6.17-20250725' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:

====================
pull-request: can-next 2025-07-25

The first patch is by Khaled Elnaggar and converts the janz-ican3
driver's fwinfo_show() to sysfs_emit().

Vincent Mailhol contributes 3 patches that first fix a warning in the
ti_hecc driver and then add missing COMPILE_TEST more compile
coverage to the ti_hecc and tscan1 driver.

Randy Dunlap's patch let's the tscan1 driver depend on PC104.

A patch by Luis Felipe Hernandez fixes a kernel-doc error in the
ctucanfd driver.

Jimmy Assarsson contributes 10 patches for the kvaser_pciefd and 11
for the kvaser_usb driver. Both series simplify the identification of
physical the CAN interfaces and add devlink support to get information
about the running firmware.

* tag 'linux-can-next-for-6.17-20250725' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next: (27 commits)
  Documentation: devlink: add devlink documentation for the kvaser_usb driver
  can: kvaser_usb: Add devlink port support
  can: kvaser_usb: Expose device information via devlink info_get()
  can: kvaser_usb: Add devlink support
  can: kvaser_usb: Store additional device information
  can: kvaser_usb: Store the different firmware version components in a struct
  can: kvaser_usb: Move comment regarding max_tx_urbs
  can: kvaser_usb: Add intermediate variables
  can: kvaser_usb: Assign netdev.dev_port based on device channel index
  can: kvaser_usb: Add support for ethtool set_phys_id()
  can: kvaser_usb: Add support to control CAN LEDs on device
  Documentation: devlink: add devlink documentation for the kvaser_pciefd driver
  can: kvaser_pciefd: Add devlink port support
  can: kvaser_pciefd: Expose device firmware version via devlink info_get()
  can: kvaser_pciefd: Add devlink support
  can: kvaser_pciefd: Split driver into C-file and header-file.
  can: kvaser_pciefd: Store device channel index
  can: kvaser_pciefd: Store the different firmware version components in a struct
  can: kvaser_pciefd: Add intermediate variable for device struct in probe()
  can: kvaser_pciefd: Add support for ethtool set_phys_id()
  ...
====================

Link: https://patch.msgid.link/20250725161327.4165174-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-25 16:27:19 -07:00
Gabriel Goller
f24987ef69 ipv6: add force_forwarding sysctl to enable per-interface forwarding
It is currently impossible to enable ipv6 forwarding on a per-interface
basis like in ipv4. To enable forwarding on an ipv6 interface we need to
enable it on all interfaces and disable it on the other interfaces using
a netfilter rule. This is especially cumbersome if you have lots of
interfaces and only want to enable forwarding on a few. According to the
sysctl docs [0] the `net.ipv6.conf.all.forwarding` enables forwarding
for all interfaces, while the interface-specific
`net.ipv6.conf.<interface>.forwarding` configures the interface
Host/Router configuration.

Introduce a new sysctl flag `force_forwarding`, which can be set on every
interface. The ip6_forwarding function will then check if the global
forwarding flag OR the force_forwarding flag is active and forward the
packet.

To preserve backwards-compatibility reset the flag (on all interfaces)
to 0 if the net.ipv6.conf.all.forwarding flag is set to 0.

Add a short selftest that checks if a packet gets forwarded with and
without `force_forwarding`.

[0]: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt

Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
Link: https://patch.msgid.link/20250722081847.132632-1-g.goller@proxmox.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-25 13:06:19 -07:00
Jimmy Assarsson
6304c4c847 Documentation: devlink: add devlink documentation for the kvaser_usb driver
List the version information reported by the kvaser_usb driver
through devlink.

Suggested-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Reviewed-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Signed-off-by: Jimmy Assarsson <extja@kvaser.com>
Link: https://patch.msgid.link/20250725123452.41-12-extja@kvaser.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-07-25 18:01:22 +02:00
Jimmy Assarsson
fed552478e Documentation: devlink: add devlink documentation for the kvaser_pciefd driver
List the version information reported by the kvaser_pciefd driver
through devlink.

Suggested-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Reviewed-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Signed-off-by: Jimmy Assarsson <extja@kvaser.com>
Link: https://patch.msgid.link/20250725123230.8-11-extja@kvaser.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-07-25 17:55:46 +02:00
Samiullah Khawaja
8e7583a4f6 net: define an enum for the napi threaded state
Instead of using '0' and '1' for napi threaded state use an enum with
'disabled' and 'enabled' states.

Tested:
 ./tools/testing/selftests/net/nl_netdev.py
 TAP version 13
 1..7
 ok 1 nl_netdev.empty_check
 ok 2 nl_netdev.lo_check
 ok 3 nl_netdev.page_pool_check
 ok 4 nl_netdev.napi_list_check
 ok 5 nl_netdev.dev_set_threaded
 ok 6 nl_netdev.napi_set_threaded
 ok 7 nl_netdev.nsim_rxq_reset_down
 # Totals: pass:7 fail:0 xfail:0 xpass:0 skip:0 error:0

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
Link: https://patch.msgid.link/20250723013031.2911384-4-skhawaja@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-24 18:34:55 -07:00
Jakub Kicinski
fbe09277fa ethtool: rss: support removing contexts via Netlink
Implement removing additional RSS contexts via Netlink.
Technically it'd be possible to shoehorn the delete operation
into ethnl_request_ops-compatible handler. The code ends
up longer than open coded version, and I think we'll need
a custom way of sending notifications at some stage (if we
allow tying the context lifetime to the netlink socket, in
the future).

Link: https://patch.msgid.link/20250717234343.2328602-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-21 18:21:19 -07:00
Jakub Kicinski
a166ab7816 ethtool: rss: support creating contexts via Netlink
Support creating contexts via Netlink. Setting flow hashing
fields on the new context is not supported at this stage,
it can be added later.

An empty indirection table is not supported. This is a carry
over from the IOCTL interface where empty indirection table
meant delete. We can repurpose empty indirection table in
Netlink but for now to avoid confusion reject it using the
policy.

Support letting user choose the ID for the new context. This was
not possible in IOCTL since the context ID field for the create
action had to be set to the ETH_RXFH_CONTEXT_ALLOC magic value.

Link: https://patch.msgid.link/20250717234343.2328602-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-21 18:20:43 -07:00
Stanislav Fomichev
5d4d84618e net: s/dev_set_threaded/netif_set_threaded/
Commit cc34acd577 ("docs: net: document new locking reality")
introduced netif_ vs dev_ function semantics: the former expects locked
netdev, the latter takes care of the locking. We don't strictly
follow this semantics on either side, but there are more dev_xxx handlers
now that don't fit. Rename them to netif_xxx where appropriate.

Note that one dev_set_threaded call still remains in mt76 for debugfs file.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250717172333.1288349-7-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18 17:27:47 -07:00
Jakub Kicinski
ffe5aedc43 Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Martin KaFai Lau says:

====================
pull-request: bpf-next 2025-07-17

We've added 13 non-merge commits during the last 20 day(s) which contain
a total of 4 files changed, 712 insertions(+), 84 deletions(-).

The main changes are:

1) Avoid skipping or repeating a sk when using a TCP bpf_iter,
   from Jordan Rife.

2) Clarify the driver requirement on using the XDP metadata,
   from Song Yoong Siang

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  doc: xdp: Clarify driver implementation for XDP Rx metadata
  selftests/bpf: Add tests for bucket resume logic in established sockets
  selftests/bpf: Create iter_tcp_destroy test program
  selftests/bpf: Create established sockets in socket iterator tests
  selftests/bpf: Make ehash buckets configurable in socket iterator tests
  selftests/bpf: Allow for iteration over multiple states
  selftests/bpf: Allow for iteration over multiple ports
  selftests/bpf: Add tests for bucket resume logic in listening sockets
  bpf: tcp: Avoid socket skips and repeats during iteration
  bpf: tcp: Use bpf_tcp_iter_batch_item for bpf_tcp_iter_state batch items
  bpf: tcp: Get rid of st_bucket_done
  bpf: tcp: Make sure iter->batch always contains a full bucket snapshot
  bpf: tcp: Make mem flags configurable through bpf_iter_tcp_realloc_batch
====================

Link: https://patch.msgid.link/20250717191731.4142326-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 18:07:37 -07:00
Jakub Kicinski
2f70251112 ethtool: rss: support setting flow hashing fields
Add support for ETHTOOL_SRXFH (setting hashing fields) in RSS_SET.

The tricky part is dealing with symmetric hashing. In netlink user
can change the hashing fields and symmetric hash in one request,
in IOCTL the two used to be set via different uAPI requests.
Since fields and hash function config are still separate driver
callbacks - changes to the two are not atomic. Keep things simple
and validate the settings against both pre- and post- change ones.
Meaning that we will reject the config request if user tries
to correct the flow fields and set input_xfrm in one request,
or disables input_xfrm and makes flow fields non-symmetric.

We can adjust it later if there's a real need. Starting simple feels
right, and potentially partially applying the settings isn't nice,
either.

Reviewed-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20250716000331.1378807-11-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:13:59 -07:00
Jakub Kicinski
d3e2c7bab1 ethtool: rss: support setting input-xfrm via Netlink
Support configuring symmetric hashing via Netlink.
We have the flow field config prepared as part of SET handling,
so scan it for conflicts instead of querying the driver again.

Reviewed-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20250716000331.1378807-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:13:59 -07:00
Jakub Kicinski
51798c519a ethtool: rss: support setting hkey via Netlink
Support setting RSS hashing key via ethtool Netlink.
Use the Netlink policy to make sure user doesn't pass
an empty key, "resetting" the key is not a thing.

Reviewed-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20250716000331.1378807-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:13:58 -07:00
Jakub Kicinski
82ae67cbc4 ethtool: rss: support setting hfunc via Netlink
Support setting RSS hash function / algo via ethtool Netlink.
Like IOCTL we don't validate that the function is within the
range known to the kernel. The drivers do a pretty good job
validating the inputs, and the IDs are technically "dynamically
queried" rather than part of uAPI.

Only change should be that in Netlink we don't support user
explicitly passing ETH_RSS_HASH_NO_CHANGE (0), if no change
is requested the attribute should be absent.

The ETH_RSS_HASH_NO_CHANGE is retained in driver-facing
API for consistency (not that I see a strong reason for it).

Reviewed-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20250716000331.1378807-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:13:58 -07:00
Jakub Kicinski
c0ae03588b ethtool: rss: initial RSS_SET (indirection table handling)
Add initial support for RSS_SET, for now only operations on
the indirection table are supported.

Unlike the ioctl don't check if at least one parameter is
being changed. This is how other ethtool-nl ops behave,
so pick the ethtool-nl consistency vs copying ioctl behavior.

There are two special cases here:
 1) resetting the table to defaults;
 2) support for tables of different size.

For (1) I use an empty Netlink attribute (array of size 0).

(2) may require some background. AFAICT a lot of modern devices
allow allocating RSS tables of different sizes. mlx5 can upsize
its tables, bnxt has some "table size calculation", and Intel
folks asked about RSS table sizing in context of resource allocation
in the past. The ethtool IOCTL API has a concept of table size,
but right now the user is expected to provide a table exactly
the size the device requests. Some drivers may change the table
size at runtime (in response to queue count changes) but the
user is not in control of this. What's not great is that all
RSS contexts share the same table size. For example a device
with 128 queues enabled, 16 RSS contexts 8 queues in each will
likely have 256 entry tables for each of the 16 contexts,
while 32 would be more than enough given each context only has
8 queues. To address this the Netlink API should avoid enforcing
table size at the uAPI level, and should allow the user to express
the min table size they expect.

To fully solve (2) we will need more driver plumbing but
at the uAPI level this patch allows the user to specify
a table size smaller than what the device advertises. The device
table size must be a multiple of the user requested table size.
We then replicate the user-provided table to fill the full device
size table. This addresses the "allow the user to express the min
table size" objective, while not enforcing any fixed size.
From Netlink perspective .get_rxfh_indir_size() is now de facto
the "max" table size supported by the device.

We may choose to support table replication in ethtool, too,
when we actually plumb this thru the device APIs.

Initially I was considering moving full pattern generation
to the kernel (which queues to use, at which frequency and
what min sequence length). I don't think this complexity
would buy us much and most if not all devices have pow-2
table sizes, which simplifies the replication a lot.

Reviewed-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20250716000331.1378807-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17 16:13:58 -07:00
Dragos Tatulea
8890ee6dcf net/mlx5e: Add device PCIe congestion ethtool stats
Implement the PCIe Congestion Event notifier which triggers a work item
to query the PCIe Congestion Event object. The result of the congestion
state is reflected in the new ethtool stats:

* pci_bw_inbound_high: the device has crossed the high threshold for
  inbound PCIe traffic.
* pci_bw_inbound_low: the device has crossed the low threshold for
  inbound PCIe traffic
* pci_bw_outbound_high: the device has crossed the high threshold for
  outbound PCIe traffic.
* pci_bw_outbound_low: the device has crossed the low threshold for
  outbound PCIe traffic

The high and low thresholds are currently configured at 90% and 75%.
These are hysteresis thresholds which help to check if the
PCI bus on the device side is in a congested state.

If low + 1 = high then the device is in a congested state. If low == high
then the device is not in a congested state.

The counters are also documented.

A follow-up patch will make the thresholds configurable.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1752589821-145787-3-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-16 17:56:32 -07:00
Song Yoong Siang
ef57dc6f52 doc: xdp: Clarify driver implementation for XDP Rx metadata
Clarify that drivers must remove device-reserved metadata from the
data_meta area before passing frames to XDP programs.

Additionally, expand the explanation of how userspace and BPF programs
should coordinate the use of METADATA_SIZE, and add a detailed diagram
to illustrate pointer adjustments and metadata layout.

Also describe the requirements and constraints enforced by
bpf_xdp_adjust_meta().

Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://lore.kernel.org/r/20250716154846.3513575-1-yoong.siang.song@intel.com
2025-07-16 16:36:11 -07:00
Eric Dumazet
6c758062c6 tcp: add LINUX_MIB_BEYOND_WINDOW
Add a new SNMP MIB : LINUX_MIB_BEYOND_WINDOW

Incremented when an incoming packet is received beyond the
receiver window.

nstat -az | grep TcpExtBeyondWindow

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250711114006.480026-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-14 18:41:42 -07:00
Samiullah Khawaja
2677010e77 Add support to set NAPI threaded for individual NAPI
A net device has a threaded sysctl that can be used to enable threaded
NAPI polling on all of the NAPI contexts under that device. Allow
enabling threaded NAPI polling at individual NAPI level using netlink.

Extend the netlink operation `napi-set` and allow setting the threaded
attribute of a NAPI. This will enable the threaded polling on a NAPI
context.

Add a test in `nl_netdev.py` that verifies various cases of threaded
NAPI being set at NAPI and at device level.

Tested
 ./tools/testing/selftests/net/nl_netdev.py
 TAP version 13
 1..7
 ok 1 nl_netdev.empty_check
 ok 2 nl_netdev.lo_check
 ok 3 nl_netdev.page_pool_check
 ok 4 nl_netdev.napi_list_check
 ok 5 nl_netdev.dev_set_threaded
 ok 6 nl_netdev.napi_set_threaded
 ok 7 nl_netdev.nsim_rxq_reset_down
 # Totals: pass:7 fail:0 xfail:0 xpass:0 skip:0 error:0

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250710211203.3979655-1-skhawaja@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-14 18:02:37 -07:00
Jakub Kicinski
0f26870a98 Merge tag 'nf-next-25-07-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next (v2)

The following series contains an initial small batch of Netfilter
updates for net-next:

1) Remove DCCP conntrack support, keep DCCP matches around in order to
   avoid breakage when loading ruleset, add Kconfig to wrap the code
   so it can be disabled by distributors.

2) Remove buggy code aiming at shrinking netlink deletion event, then
   re-add it correctly in another patch. This is to prevent -stable to
   pick up on a fix that breaks old userspace. From Phil Sutter.

3) Missing WARN_ON_ONCE() to check for lockdep_commit_lock_is_held()
   to uncover bugs. From Fedor Pchelkin.

* tag 'nf-next-25-07-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: nf_tables: adjust lockdep assertions handling
  netfilter: nf_tables: Reintroduce shortened deletion notifications
  netfilter: nf_tables: Drop dead code from fill_*_info routines
  netfilter: conntrack: remove DCCP protocol support
====================

Link: https://patch.msgid.link/20250710010706.2861281-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-10 18:18:41 -07:00
Jakub Kicinski
178331743c ethtool: rss: report which fields are configured for hashing
Implement ETHTOOL_GRXFH over Netlink. The number of flow types is
reasonable (around 20) so report all of them at once for simplicity.

Do not maintain the flow ID mapping with ioctl at the uAPI level.
This gives us a chance to clean up the confusion that come from
RxNFC vs RxFH (flow direction vs hashing) in the ioctl.
Try to align with the names used in ethtool CLI, they seem to have
stood the test of time just fine. One annoyance is that we still
call L4 ports the weird names, but I guess they also apply to IPSec
(where they cover the SPI) so it is what it is.

 $ ynl --family ethtool --dump rss-get
 {
    "header": {
	"dev-index": 1,
	"dev-name": "enp1s0"
    },
    "hfunc": 1,
    "hkey": b"...",
    "indir": [0, 1, ...],
    "flow-hash": {
        "ether": {"l2da"},
	"ah-esp4": {"ip-src", "ip-dst"},
        "ah-esp6": {"ip-src", "ip-dst"},
        "ah4": {"ip-src", "ip-dst"},
        "ah6": {"ip-src", "ip-dst"},
        "esp4": {"ip-src", "ip-dst"},
        "esp6": {"ip-src", "ip-dst"},
        "ip4": {"ip-src", "ip-dst"},
        "ip6": {"ip-src", "ip-dst"},
        "sctp4": {"ip-src", "ip-dst"},
        "sctp6": {"ip-src", "ip-dst"},
        "udp4": {"ip-src", "ip-dst"},
        "udp6": {"ip-src", "ip-dst"}
        "tcp4": {"l4-b-0-1", "l4-b-2-3", "ip-src", "ip-dst"},
        "tcp6": {"l4-b-0-1", "l4-b-2-3", "ip-src", "ip-dst"},
    },
 }

Link: https://patch.msgid.link/20250708220640.2738464-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-10 17:57:49 -07:00
Jason Xing
45e359be1c net: xsk: introduce XDP_MAX_TX_SKB_BUDGET setsockopt
This patch provides a setsockopt method to let applications leverage to
adjust how many descs to be handled at most in one send syscall. It
mitigates the situation where the default value (32) that is too small
leads to higher frequency of triggering send syscall.

Considering the prosperity/complexity the applications have, there is no
absolutely ideal suggestion fitting all cases. So keep 32 as its default
value like before.

The patch does the following things:
- Add XDP_MAX_TX_SKB_BUDGET socket option.
- Set max_tx_budget to 32 by default in the initialization phase as a
  per-socket granular control.
- Set the range of max_tx_budget as [32, xs->tx->nentries].

The idea behind this comes out of real workloads in production. We use a
user-level stack with xsk support to accelerate sending packets and
minimize triggering syscalls. When the packets are aggregated, it's not
hard to hit the upper bound (namely, 32). The moment user-space stack
fetches the -EAGAIN error number passed from sendto(), it will loop to try
again until all the expected descs from tx ring are sent out to the driver.
Enlarging the XDP_MAX_TX_SKB_BUDGET value contributes to less frequency of
sendto() and higher throughput/PPS.

Here is what I did in production, along with some numbers as follows:
For one application I saw lately, I suggested using 128 as max_tx_budget
because I saw two limitations without changing any default configuration:
1) XDP_MAX_TX_SKB_BUDGET, 2) socket sndbuf which is 212992 decided by
net.core.wmem_default. As to XDP_MAX_TX_SKB_BUDGET, the scenario behind
this was I counted how many descs are transmitted to the driver at one
time of sendto() based on [1] patch and then I calculated the
possibility of hitting the upper bound. Finally I chose 128 as a
suitable value because 1) it covers most of the cases, 2) a higher
number would not bring evident results. After twisting the parameters,
a stable improvement of around 4% for both PPS and throughput and less
resources consumption were found to be observed by strace -c -p xxx:
1) %time was decreased by 7.8%
2) error counter was decreased from 18367 to 572

[1]: https://lore.kernel.org/all/20250619093641.70700-1-kerneljasonxing@gmail.com/

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20250704160138.48677-1-kerneljasonxing@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-10 14:48:29 +02:00
Kamil Horák - 2N
67c0170566 net: phy: MII-Lite PHY interface mode
Some Broadcom PHYs are capable to operate in simplified MII mode,
without TXER, RXER, CRS and COL signals as defined for the MII.
The MII-Lite mode can be used on most Ethernet controllers with full
MII interface by just leaving the input signals (RXER, CRS, COL)
inactive. The absence of COL signal makes half-duplex link modes
impossible but does not interfere with BroadR-Reach link modes on
Broadcom PHYs, because they are all full-duplex only.

Add MII-Lite interface mode, especially for Broadcom two-wire PHYs.

Signed-off-by: Kamil Horák - 2N <kamilh@axis.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20250708090140.61355-2-kamilh@axis.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09 19:32:30 -07:00
Jason Xing
819802e25a Documentation: xsk: correct the obsolete references and examples
The modified lines are mainly related to the following commits[1][2]
which remove those tests and examples. Since samples/bpf has been
deprecated, we can refer to more examples that are easily searched
in the various xdp-projects, like the following link:
https://github.com/xdp-project/bpf-examples/tree/main/AF_XDP-example

[1]
commit f366006342 ("libbpf: move xsk.{c,h} into selftests/bpf")
[2]
commit cfb5a2dbf1 ("bpf, samples: Remove AF_XDP samples")

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250708062907.11557-1-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09 19:29:32 -07:00
Ivan Vecera
2df8e64e01 dpll: Add basic Microchip ZL3073x support
Microchip Azurite ZL3073x represents chip family providing DPLL
and optionally PHC (PTP) functionality. The chips can be connected
be connected over I2C or SPI bus.

They have the following characteristics:
* up to 5 separate DPLL units (channels)
* 5 synthesizers
* 10 input pins (references)
* 10 outputs
* 20 output pins (output pin pair shares one output)
* Each reference and output can operate in either differential or
  single-ended mode (differential mode uses 2 pins)
* Each output is connected to one of the synthesizers
* Each synthesizer is driven by one of the DPLL unit

The device uses 7-bit addresses and 8-bits values. It exposes 8-, 16-,
32- and 48-bits registers in address range <0x000,0x77F>. Due to 7bit
addressing, the range is organized into pages of 128 bytes, with each
page containing a page selector register at address 0x7F.
For reading/writing multi-byte registers, the device supports bulk
transfers.

Add basic functionality to access device registers, probe functionality
both I2C and SPI cases and add devlink support to provide info and
to set clock ID parameter.

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250704182202.1641943-6-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09 19:08:52 -07:00