Pull nfsd updates from Chuck Lever:
"Neil Brown and Jeff Layton contributed a dynamic thread pool sizing
mechanism for NFSD. The sunrpc layer now tracks minimum and maximum
thread counts per pool, and NFSD adjusts running thread counts based
on workload: idle threads exit after a timeout when the pool exceeds
its minimum, and new threads spawn automatically when all threads are
busy. Administrators control this behavior via the nfsdctl netlink
interface.
Rick Macklem, FreeBSD NFS maintainer, generously contributed server-
side support for the POSIX ACL extension to NFSv4, as specified in
draft-ietf-nfsv4-posix-acls. This extension allows NFSv4 clients to
get and set POSIX access and default ACLs using native NFSv4
operations, eliminating the need for sideband protocols. The feature
is gated by a Kconfig option since the IETF draft has not yet been
ratified.
Chuck Lever delivered numerous improvements to the xdrgen tool. Error
reporting now covers parsing, AST transformation, and invalid
declarations. Generated enum decoders validate incoming values against
valid enumerator lists. New features include pass-through line support
for embedding C directives in XDR specifications, 16-bit integer
types, and program number definitions. Several code generation issues
were also addressed.
When an administrator revokes NFSv4 state for a filesystem via the
unlock_fs interface, ongoing async COPY operations referencing that
filesystem are now cancelled, with CB_OFFLOAD callbacks notifying
affected clients.
The remaining patches in this pull request are clean-ups and minor
optimizations. Sincere thanks to all contributors, reviewers, testers,
and bug reporters who participated in the v7.0 NFSD development cycle"
* tag 'nfsd-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (45 commits)
NFSD: Add POSIX ACL file attributes to SUPPATTR bitmasks
NFSD: Add POSIX draft ACL support to the NFSv4 SETATTR operation
NFSD: Add support for POSIX draft ACLs for file creation
NFSD: Add support for XDR decoding POSIX draft ACLs
NFSD: Refactor nfsd_setattr()'s ACL error reporting
NFSD: Do not allow NFSv4 (N)VERIFY to check POSIX ACL attributes
NFSD: Add nfsd4_encode_fattr4_posix_access_acl
NFSD: Add nfsd4_encode_fattr4_posix_default_acl
NFSD: Add nfsd4_encode_fattr4_acl_trueform_scope
NFSD: Add nfsd4_encode_fattr4_acl_trueform
Add RPC language definition of NFSv4 POSIX ACL extension
NFSD: Add a Kconfig setting to enable support for NFSv4 POSIX ACLs
xdrgen: Implement pass-through lines in specifications
nfsd: cancel async COPY operations when admin revokes filesystem state
nfsd: add controls to set the minimum number of threads per pool
nfsd: adjust number of running nfsd threads based on activity
sunrpc: allow svc_recv() to return -ETIMEDOUT and -EBUSY
sunrpc: split new thread creation into a separate function
sunrpc: introduce the concept of a minimum number of threads per pool
sunrpc: track the max number of requested threads in a pool
...
The following warnings were visible:
$ ./scripts/kernel-doc -Wall -none \
net/mptcp/ include/net/mptcp.h include/uapi/linux/mptcp*.h \
include/trace/events/mptcp.h
Warning: net/mptcp/token.c:108 No description found for return value of 'mptcp_token_new_request'
Warning: net/mptcp/token.c:151 No description found for return value of 'mptcp_token_new_connect'
Warning: net/mptcp/token.c:246 No description found for return value of 'mptcp_token_get_sock'
Warning: net/mptcp/token.c:298 No description found for return value of 'mptcp_token_iter_next'
Warning: net/mptcp/protocol.c:4431 No description found for return value of 'mptcp_splice_read'
Warning: include/uapi/linux/mptcp_pm.h:13 missing initial short description on line:
* enum mptcp_event_type
Address all of them: either by using the 'Return:' keyword, or by adding
a missing initial short description.
The MPTCP CI will soon report issues with kdoc to avoid introducing new
issues and being flagged by the Netdev CI.
Reviewed-by: Geliang Tang <geliang@kernel.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260205-net-mptcp-misc-fixes-6-19-rc8-v2-3-c2720ce75c34@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, the dpll subsystem exports the fractional frequency offset
(FFO) in parts per million (ppm). This granularity is insufficient for
high-precision synchronization scenarios which often require parts per
trillion (ppt) resolution.
Add a new netlink attribute DPLL_A_PIN_FRACTIONAL_FREQUENCY_OFFSET_PPT
to expose the FFO in ppt.
Update the dpll netlink core to expect the driver-provided FFO value
to be in ppt. To maintain backward compatibility with existing userspace
tools, populate the legacy DPLL_A_PIN_FRACTIONAL_FREQUENCY_OFFSET
attribute by dividing the new ppt value by 1,000,000.
Update the zl3073x and mlx5 drivers to provide the FFO value in ppt:
- zl3073x: adjust the fixed-point calculation to produce ppt (10^12)
instead of ppm (10^6).
- mlx5: scale the existing ppm value by 1,000,000.
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20260126162253.27890-1-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a new "min_threads" variable to the nfsd_net, along with the
corresponding netlink interface, to set that value from userland.
Pass that value to svc_set_pool_threads() and svc_set_num_threads().
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Pull networking fixes from Jakub Kicinski:
"Including fixes from CAN and wireless.
Pretty big, but hard to make up any cohesive story that would explain
it, a random collection of fixes. The two reverts of bad patches from
this release here feel like stuff that'd normally show up by rc5 or
rc6. Perhaps obvious thing to say, given the holiday timing.
That said, no active investigations / regressions. Let's see what the
next week brings.
Current release - fix to a fix:
- can: alloc_candev_mqs(): add missing default CAN capabilities
Current release - regressions:
- usbnet: fix crash due to missing BQL accounting after resume
- Revert "net: wwan: mhi_wwan_mbim: Avoid -Wflex-array-member-not ...
Previous releases - regressions:
- Revert "nfc/nci: Add the inconsistency check between the input ...
Previous releases - always broken:
- number of driver fixes for incorrect use of seqlocks on stats
- rxrpc: fix recvmsg() unconditional requeue, don't corrupt rcv queue
when MSG_PEEK was set
- ipvlan: make the addrs_lock be per port avoid races in the port
hash table
- sched: enforce that teql can only be used as root qdisc
- virtio: coalesce only linear skb
- wifi: ath12k: fix dead lock while flushing management frames
- eth: igc: reduce TSN TX packet buffer from 7KB to 5KB per queue"
* tag 'net-6.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (96 commits)
Octeontx2-af: Add proper checks for fwdata
dpll: Prevent duplicate registrations
net/sched: act_ife: avoid possible NULL deref
hinic3: Fix netif_queue_set_napi queue_index input parameter error
vsock/test: add stream TX credit bounds test
vsock/virtio: cap TX credit to local buffer size
vsock/test: fix seqpacket message bounds test
vsock/virtio: fix potential underflow in virtio_transport_get_credit()
net: fec: account for VLAN header in frame length calculations
net: openvswitch: fix data race in ovs_vport_get_upcall_stats
octeontx2-af: Fix error handling
net: pcs: pcs-mtk-lynxi: report in-band capability for 2500Base-X
rxrpc: Fix data-race warning and potential load/store tearing
net: dsa: fix off-by-one in maximum bridge ID determination
net: bcmasp: Fix network filter wake for asp-3.0
bonding: provide a net pointer to __skb_flow_dissect()
selftests: net: amt: wait longer for connection before sending packets
be2net: Fix NULL pointer dereference in be_cmd_get_mac_from_list
Revert "net: wwan: mhi_wwan_mbim: Avoid -Wflex-array-member-not-at-end warning"
netrom: fix double-free in nr_route_frame()
...
This reverts commit 77b9c4a438, reversing
changes made to 4515ec4ad5:
931420a2fc ("selftests/net: Add netkit container tests")
ab771c938d ("selftests/net: Make NetDrvContEnv support queue leasing")
6be87fbb27 ("selftests/net: Add env for container based tests")
61d99ce3df ("selftests/net: Add bpf skb forwarding program")
920da36341 ("netkit: Add xsk support for af_xdp applications")
eef51113f8 ("netkit: Add netkit notifier to check for unregistering devices")
b5ef109d22 ("netkit: Implement rtnl_link_ops->alloc and ndo_queue_create")
b5c3fa4a0b ("netkit: Add single device mode for netkit")
0073d2fd67 ("xsk: Proxy pool management for leased queues")
1ecea95dd3 ("xsk: Extend xsk_rcv_check validation")
804bf334d0 ("net: Proxy netdev_queue_get_dma_dev for leased queues")
0caa9a8dde ("net: Proxy net_mp_{open,close}_rxq for leased queues")
ff8889ff91 ("net, ethtool: Disallow leased real rxqs to be resized")
9e2103f361 ("net: Add lease info to queue-get response")
31127dedde ("net: Implement netdev_nl_queue_create_doit")
a5546e18f7 ("net: Add queue-create operation")
The series will conflict with io_uring work, and the code needs more
polish.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a ynl netdev family operation called queue-create that creates a
new queue on a netdevice:
name: queue-create
attribute-set: queue
flags: [admin-perm]
do:
request:
attributes:
- ifindex
- type
- lease
reply: &queue-create-op
attributes:
- id
This is a generic operation such that it can be extended for various
use cases in future. Right now it is mandatory to specify ifindex,
the queue type which is enforced to rx and a lease. The newly created
queue id is returned to the caller.
A queue from a virtual device can have a lease which refers to another
queue from a physical device. This is useful for memory providers
and AF_XDP operations which take an ifindex and queue id to allow
applications to bind against virtual devices in containers. The lease
couples both queues together and allows to proxy the operations from
a virtual device in a container to the physical device.
In future, the nested lease attribute can be lifted and made optional
for other use-cases such as dynamic queue creation for physical
netdevs. The lack of lease and the specification of the physical
device as an ifindex will imply that we need a real queue to be
allocated. Similarly, the queue type enforcement to rx can then be
lifted as well to support tx.
An early implementation had only driver-specific integration [0], but
in order for other virtual devices to reuse, it makes sense to have
this as a generic API in core net.
For leasing queues, the virtual netdev must have real_num_rx_queue
less than num_rx_queues at the time of calling queue-create. The
queue-type must be rx as only rx queues are supported for leasing
for now. We also enforce that the queue-create ifindex must point
to a virtual device, and that the nested lease attribute's ifindex
must point to a physical device. The nested lease attribute set
contains a netns-id attribute which is currently only intended for
dumping as part of the queue-get operation. Also, it is modeled as
an s32 type similarly as done elsewhere in the stack.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Link: https://bpfconf.ebpf.io/bpfconf2025/bpfconf2025_material/lsfmmbpf_2025_netkit_borkmann.pdf [0]
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260115082603.219152-2-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Currently, userspace can retrieve the DPLL working mode but cannot
configure it. This prevents changing the device operation, such as
switching from manual to automatic mode and vice versa.
Add a new callback .mode_set() to struct dpll_device_ops. Extend
the netlink policy and device-set command handling to process
the DPLL_A_MODE attribute. Update the netlink YAML specification
to include the mode attribute in the device-set operation.
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Link: https://patch.msgid.link/20260114122726.120303-3-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Merge fixes related to the energy model management for 6.19-rc6:
- Fix a memory leak in em_create_pd() error path (Malaya Kumar Rout)
- Fix stale description of the cost field in struct em_perf_state to
reflect the current code (Yaxiong Tian)
- Fix and revamp the energy model YNL specification added recently
along with the energy model netlink interface (Changwoo Min)
* pm-em:
PM: EM: Add dump to get-perf-domains in the EM YNL spec
PM: EM: Change cpus' type from string to u64 array in the EM YNL spec
PM: EM: Rename em.yaml to dev-energymodel.yaml
PM: EM: Fix yamllint warnings in the EM YNL spec
PM: EM: Fix memory leak in em_create_pd() error path
PM: EM: Fix incorrect description of the cost field in struct em_perf_state
This commit adds shared shaper state across the cake instances beneath a
cake_mq qdisc. It works by periodically tracking the number of active
instances, and scaling the configured rate by the number of active
queues.
The scan is lockless and simply reads the qlen and the last_active state
variable of each of the instances configured beneath the parent cake_mq
instance. Locking is not required since the values are only updated by
the owning instance, and eventual consistency is sufficient for the
purpose of estimating the number of active queues.
The interval for scanning the number of active queues is set to 200 us.
We found this to be a good tradeoff between overhead and response time.
For a detailed analysis of this aspect see the Netdevconf talk:
https://netdevconf.info/0x19/docs/netdev-0x19-paper16-talk-paper.pdf
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20260109-mq-cake-sub-qdisc-v8-5-8d613fece5d8@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add dump to get-perf-domains, so that a user can fetch either information
about a specific performance domain with do or information about all
performance domains with dump. Share the reply format of do and dump using
perf-domain-attrs, so remove perf-domains. The YNL spec, autogenerated
files, and the do implementation are updated, and the dump implementation
is added.
Suggested-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Link: https://patch.msgid.link/20260108053212.642478-5-changwoo@igalia.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The EM YNL specification used many acronyms, including ‘em’, ‘pd’,
‘ps’, etc. While the acronyms are short and convenient, they could be
confusing. So, let’s spell them out to be more specific. The following
changes were made in the spec. Note that the protocol name cannot exceed
GENL_NAMSIZ (16).
em -> dev-energymodel
pds -> perf-domains
pd -> perf-domain
pd-id -> perf-domain-id
pd-table -> perf-table
ps -> perf-state
get-pds -> get-perf-domains
get-pd-table -> get-perf-table
pd-created -> perf-domain-created
pd-updated -> perf-domain-updated
pd-deleted -> perf-domain-deleted
In addition. doc strings were added to the spec. based on the comments in
energy_model.h. Two flag attributes (perf-state-flags and
perf-domain-flags) were added for easily interpreting the bit flags.
Finally, the autogenerated files and em_netlink.c were updated accordingly
to reflect the name changes.
Suggested-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Link: https://patch.msgid.link/20260108053212.642478-3-changwoo@igalia.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull networking updates from Jakub Kicinski:
"Core & protocols:
- Replace busylock at the Tx queuing layer with a lockless list.
Resulting in a 300% (4x) improvement on heavy TX workloads, sending
twice the number of packets per second, for half the cpu cycles.
- Allow constantly busy flows to migrate to a more suitable CPU/NIC
queue.
Normally we perform queue re-selection when flow comes out of idle,
but under extreme circumstances the flows may be constantly busy.
Add sysctl to allow periodic rehashing even if it'd risk packet
reordering.
- Optimize the NAPI skb cache, make it larger, use it in more paths.
- Attempt returning Tx skbs to the originating CPU (like we already
did for Rx skbs).
- Various data structure layout and prefetch optimizations from Eric.
- Remove ktime_get() from the recvmsg() fast path, ktime_get() is
sadly quite expensive on recent AMD machines.
- Extend threaded NAPI polling to allow the kthread busy poll for
packets.
- Make MPTCP use Rx backlog processing. This lowers the lock
pressure, improving the Rx performance.
- Support memcg accounting of MPTCP socket memory.
- Allow admin to opt sockets out of global protocol memory accounting
(using a sysctl or BPF-based policy). The global limits are a poor
fit for modern container workloads, where limits are imposed using
cgroups.
- Improve heuristics for when to kick off AF_UNIX garbage collection.
- Allow users to control TCP SACK compression, and default to 33% of
RTT.
- Add tcp_rcvbuf_low_rtt sysctl to let datacenter users avoid
unnecessarily aggressive rcvbuf growth and overshot when the
connection RTT is low.
- Preserve skb metadata space across skb_push / skb_pull operations.
- Support for IPIP encapsulation in the nftables flowtable offload.
- Support appending IP interface information to ICMP messages (RFC
5837).
- Support setting max record size in TLS (RFC 8449).
- Remove taking rtnl_lock from RTM_GETNEIGHTBL and RTM_SETNEIGHTBL.
- Use a dedicated lock (and RCU) in MPLS, instead of rtnl_lock.
- Let users configure the number of write buffers in SMC.
- Add new struct sockaddr_unsized for sockaddr of unknown length,
from Kees.
- Some conversions away from the crypto_ahash API, from Eric Biggers.
- Some preparations for slimming down struct page.
- YAML Netlink protocol spec for WireGuard.
- Add a tool on top of YAML Netlink specs/lib for reporting commonly
computed derived statistics and summarized system state.
Driver API:
- Add CAN XL support to the CAN Netlink interface.
- Add uAPI for reporting PHY Mean Square Error (MSE) diagnostics, as
defined by the OPEN Alliance's "Advanced diagnostic features for
100BASE-T1 automotive Ethernet PHYs" specification.
- Add DPLL phase-adjust-gran pin attribute (and implement it in
zl3073x).
- Refactor xfrm_input lock to reduce contention when NIC offloads
IPsec and performs RSS.
- Add info to devlink params whether the current setting is the
default or a user override. Allow resetting back to default.
- Add standard device stats for PSP crypto offload.
- Leverage DSA frame broadcast to implement simple HSR frame
duplication for a lot of switches without dedicated HSR offload.
- Add uAPI defines for 1.6Tbps link modes.
Device drivers:
- Add Motorcomm YT921x gigabit Ethernet switch support.
- Add MUCSE driver for N500/N210 1GbE NIC series.
- Convert drivers to support dedicated ops for timestamping control,
and away from the direct IOCTL handling. While at it support GET
operations for PHY timestamping.
- Add (and convert most drivers to) a dedicated ethtool callback for
reading the Rx ring count.
- Significant refactoring efforts in the STMMAC driver, which
supports Synopsys turn-key MAC IP integrated into a ton of SoCs.
- Ethernet high-speed NICs:
- Broadcom (bnxt):
- support PPS in/out on all pins
- Intel (100G, ice, idpf):
- ice: implement standard ethtool and timestamping stats
- i40e: support setting the max number of MAC addresses per VF
- iavf: support RSS of GTP tunnels for 5G and LTE deployments
- nVidia/Mellanox (mlx5):
- reduce downtime on interface reconfiguration
- disable being an XDP redirect target by default (same as
other drivers) to avoid wasting resources if feature is
unused
- Meta (fbnic):
- add support for Linux-managed PCS on 25G, 50G, and 100G links
- Wangxun:
- support Rx descriptor merge, and Tx head writeback
- support Rx coalescing offload
- support 25G SPF and 40G QSFP modules
- Ethernet virtual:
- Google (gve):
- allow ethtool to configure rx_buf_len
- implement XDP HW RX Timestamping support for DQ descriptor
format
- Microsoft vNIC (mana):
- support HW link state events
- handle hardware recovery events when probing the device
- Ethernet NICs consumer, and embedded:
- usbnet: add support for Byte Queue Limits (BQL)
- AMD (amd-xgbe):
- add device selftests
- NXP (enetc):
- add i.MX94 support
- Broadcom integrated MACs (bcmgenet, bcmasp):
- bcmasp: add support for PHY-based Wake-on-LAN
- Broadcom switches (b53):
- support port isolation
- support BCM5389/97/98 and BCM63XX ARL formats
- Lantiq/MaxLinear switches:
- support bridge FDB entries on the CPU port
- use regmap for register access
- allow user to enable/disable learning
- support Energy Efficient Ethernet
- support configuring RMII clock delays
- add tagging driver for MaxLinear GSW1xx switches
- Synopsys (stmmac):
- support using the HW clock in free running mode
- add Eswin EIC7700 support
- add Rockchip RK3506 support
- add Altera Agilex5 support
- Cadence (macb):
- cleanup and consolidate descriptor and DMA address handling
- add EyeQ5 support
- TI:
- icssg-prueth: support AF_XDP
- Airoha access points:
- add missing Ethernet stats and link state callback
- add AN7583 support
- support out-of-order Tx completion processing
- Power over Ethernet:
- pd692x0: preserve PSE configuration across reboots
- add support for TPS23881B devices
- Ethernet PHYs:
- Open Alliance OATC14 10BASE-T1S PHY cable diagnostic support
- Support 50G SerDes and 100G interfaces in Linux-managed PHYs
- micrel:
- support for non PTP SKUs of lan8814
- enable in-band auto-negotiation on lan8814
- realtek:
- cable testing support on RTL8224
- interrupt support on RTL8221B
- motorcomm: support for PHY LEDs on YT853
- microchip: support for LAN867X Rev.D0 PHYs w/ SQI and cable diag
- mscc: support for PHY LED control
- CAN drivers:
- m_can: add support for optional reset and system wake up
- remove can_change_mtu() obsoleted by core handling
- mcp251xfd: support GPIO controller functionality
- Bluetooth:
- add initial support for PASTa
- WiFi:
- split ieee80211.h file, it's way too big
- improvements in VHT radiotap reporting, S1G, Channel Switch
Announcement handling, rate tracking in mesh networks
- improve multi-radio monitor mode support, and add a cfg80211
debugfs interface for it
- HT action frame handling on 6 GHz
- initial chanctx work towards NAN
- MU-MIMO sniffer improvements
- WiFi drivers:
- RealTek (rtw89):
- support USB devices RTL8852AU and RTL8852CU
- initial work for RTL8922DE
- improved injection support
- Intel:
- iwlwifi: new sniffer API support
- MediaTek (mt76):
- WED support for >32-bit DMA
- airoha NPU support
- regdomain improvements
- continued WiFi7/MLO work
- Qualcomm/Atheros:
- ath10k: factory test support
- ath11k: TX power insertion support
- ath12k: BSS color change support
- ath12k: statistics improvements
- brcmfmac: Acer A1 840 tablet quirk
- rtl8xxxu: 40 MHz connection fixes/support"
* tag 'net-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1381 commits)
net: page_pool: sanitise allocation order
net: page pool: xa init with destroy on pp init
net/mlx5e: Support XDP target xmit with dummy program
net/mlx5e: Update XDP features in switch channels
selftests/tc-testing: Test CAKE scheduler when enqueue drops packets
net/sched: sch_cake: Fix incorrect qlen reduction in cake_drop
wireguard: netlink: generate netlink code
wireguard: uapi: generate header with ynl-gen
wireguard: uapi: move flag enums
wireguard: uapi: move enum wg_cmd
wireguard: netlink: add YNL specification
selftests: drv-net: Fix tolerance calculation in devlink_rate_tc_bw.py
selftests: drv-net: Fix and clarify TC bandwidth split in devlink_rate_tc_bw.py
selftests: drv-net: Set shell=True for sysfs writes in devlink_rate_tc_bw.py
selftests: drv-net: Use Iperf3Runner in devlink_rate_tc_bw.py
selftests: drv-net: introduce Iperf3Runner for measurement use cases
selftests: drv-net: Add devlink_rate_tc_bw.py to TEST_PROGS
net: ps3_gelic_net: Use napi_alloc_skb() and napi_gro_receive()
Documentation: net: dsa: mention simple HSR offload helpers
Documentation: net: dsa: mention availability of RedBox
...
This patch adds a near[1] complete YNL specification for WireGuard,
documenting the protocol in a machine-readable format, rather than
comments in wireguard.h, and eases usage from C and non-C programming
languages alike.
The generated C library will be featured in a later patch, so in
this patch I will use the in-kernel python client for examples.
This makes the documentation in the UAPI header redundant, it is
therefore removed. The in-line documentation in the spec is based
on the existing comment in wireguard.h, and once released it will
be available in the kernel documentation at:
https://docs.kernel.org/netlink/specs/wireguard.html
(until then run: make htmldocs)
Generate wireguard.rst from this spec:
$ make -C tools/net/ynl/generated/ wireguard.rst
Query wireguard interface through pyynl:
$ sudo ./tools/net/ynl/pyynl/cli.py --family wireguard \
--dump get-device \
--json '{"ifindex":3}'
[{'fwmark': 0,
'ifindex': 3,
'ifname': 'wg-test',
'listen-port': 54318,
'peers': [{0: {'allowedips': [{0: {'cidr-mask': 0,
'family': 2,
'ipaddr': '0.0.0.0'}},
{0: {'cidr-mask': 0,
'family': 10,
'ipaddr': '::'}}],
'endpoint': b'[...]',
'last-handshake-time': {'nsec': 42, 'sec': 42},
'persistent-keepalive-interval': 42,
'preshared-key': '[...]',
'protocol-version': 1,
'public-key': '[...]',
'rx-bytes': 42,
'tx-bytes': 42}}],
'private-key': '[...]',
'public-key': '[...]'}]
Add another allowed IP prefix:
$ sudo ./tools/net/ynl/pyynl/cli.py --family wireguard \
--do set-device --json '{"ifindex":3,"peers":[
{"public-key":"6a df b1 83 a4 ..","allowedips":[
{"cidr-mask":0,"family":10,"ipaddr":"::"}]}]}'
[1] As can be seen above, the "endpoint" is only dumped as binary data,
as it can't be fully described in YNL. It's either a struct
sockaddr_in or struct sockaddr_in6 depending on the attribute length.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Fix two schema check errors that have lurked since the attribute name
validation was made more strict:
not ok 2 conntrack.yaml schema validation
'labels mask' does not match '^[0-9a-z-]+$'
not ok 13 nftables.yaml schema validation
'set id' does not match '^[0-9a-z-]+$'
Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20251127123502.89142-5-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Merge energy model management updates and operating performance points
(OPP) library changes for 6.19-rc1:
- Add support for sending netlink notifications to user space on energy
model updates (Changwoo Mini, Peng Fan)
- Minor improvements to the Rust OPP interface (Tamir Duberstein)
- Fixes to scope-based pointers in the OPP library (Viresh Kumar)
* pm-em:
PM: EM: Add to em_pd_list only when no failure
PM: EM: Notify an event when the performance domain changes
PM: EM: Implement em_notify_pd_created/updated()
PM: EM: Implement em_notify_pd_deleted()
PM: EM: Implement em_nl_get_pd_table_doit()
PM: EM: Implement em_nl_get_pds_doit()
PM: EM: Add an iterator and accessor for the performance domain
PM: EM: Add a skeleton code for netlink notification
PM: EM: Add em.yaml and autogen files
PM: EM: Expose the ID of a performance domain via debugfs
PM: EM: Assign a unique ID when creating a performance domain
* pm-opp:
rust: opp: simplify callers of `to_c_str_array`
OPP: Initialize scope-based pointers inline
rust: opp: fix broken rustdoc link
The fix commit converted several IPv4 address attributes from binary
to u32, but forgot to specify byte-order: big-endian. Without this,
YNL tools display IPv4 addresses incorrectly due to host-endian
interpretation.
Add the missing byte-order: big-endian to all affected u32 IPv4
address fields to ensure correct parsing and display.
Fixes: 1064d521d1 ("netlink: specs: support ipv4-or-v6 for dual-stack fields")
Reported-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Link: https://patch.msgid.link/20251125112048.37631-1-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Support querying and resetting to default param values.
Introduce two new devlink netlink attrs:
DEVLINK_ATTR_PARAM_VALUE_DEFAULT and
DEVLINK_ATTR_PARAM_RESET_DEFAULT. The former is used to contain an
optional parameter value inside of the param_value nested
attribute. The latter is used in param-set requests from userspace to
indicate that the driver should reset the param to its default value.
To implement this, two new functions are added to the devlink driver
api: devlink_param::get_default() and
devlink_param::reset_default(). These callbacks allow drivers to
implement default param actions for runtime and permanent cmodes. For
driverinit params, the core latches the last value set by a driver via
devl_param_driverinit_value_set(), and uses that as the default value
for a param.
Because default parameter values are optional, it would be impossible
to discern whether or not a param of type bool has default value of
false or not provided if the default value is encoded using a netlink
flag type. For this reason, when a DEVLINK_PARAM_TYPE_BOOL has an
associated default value, the default value is encoded using a u8
type.
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20251119025038.651131-4-daniel.zahka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since commit 1b255e1bea ("tools: ynl: add ipv4-or-v6 display hint"), we
can display either IPv4 or IPv6 addresses for a single field based on the
address family. However, most dual-stack fields still use the ipv4 display
hint. This update changes them to use the new ipv4-or-v6 display hint and
converts IPv4-only fields to use the u32 type.
Field changes:
- v4-or-v6
- IFA_ADDRESS, IFA_LOCAL
- IFLA_GRE_LOCAL, IFLA_GRE_REMOTE
- IFLA_VTI_LOCAL, IFLA_VTI_REMOTE
- IFLA_IPTUN_LOCAL, IFLA_IPTUN_REMOTE
- NDA_DST
- RTA_DST, RTA_SRC, RTA_GATEWAY, RTA_PREFSRC
- FRA_SRC, FRA_DST
- ipv4
- IFA_BROADCAST
- IFLA_GENEVE_REMOTE
- IFLA_IPTUN_6RD_RELAY_PREFIX
Reviewed-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20251117024457.3034-3-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Adds DEVLINK_ESWITCH_MODE_SWITCHDEV_INACTIVE attribute to UAPI and
documentation.
Before having traffic flow through an eswitch, a user may want to have the
ability to block traffic towards the FDB until FDB is fully programmed and
the user is ready to send traffic to it. For example: when two eswitches
are present for vports in a multi-PF setup, one eswitch may take over the
traffic from the other when the user chooses.
Before this take over, a user may want to first program the inactive
eswitch and then once ready redirect traffic to this new eswitch.
switchdev modes transition semantics:
legacy->switchdev_inactive: Create switchdev mode normally, traffic not
allowed to flow yet.
switchdev_inactive->switchdev: Enable traffic to flow.
switchdev->switchdev_inactive: Block traffic on the FDB, FDB and
representros state and content is preserved.
When eswitch is configured to this mode, traffic is ignored/dropped on
this eswitch FDB, while current configuration is kept, e.g FDB rules and
netdev representros are kept available, FDB programming is allowed.
Example:
# start inactive switchdev
devlink dev eswitch set pci/0000:08:00.1 mode switchdev_inactive
# setup TC rules, representors etc ..
# activate
devlink dev eswitch set pci/0000:08:00.1 mode switchdev
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20251108070404.1551708-2-saeed@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Introduce the userspace entry point for PHY MSE diagnostics via
ethtool netlink. This exposes the core API added previously and
returns both capability information and one or more snapshots.
Userspace sends ETHTOOL_MSG_MSE_GET. The reply carries:
- ETHTOOL_A_MSE_CAPABILITIES: scale limits and timing information
- ETHTOOL_A_MSE_CHANNEL_* nests: one or more snapshots (per-channel
if available, otherwise WORST, otherwise LINK)
Link down returns -ENETDOWN.
Changes:
- YAML: add attribute sets (mse, mse-capabilities, mse-snapshot)
and the mse-get operation
- UAPI (generated): add ETHTOOL_A_MSE_* enums and message IDs,
ETHTOOL_MSG_MSE_GET/REPLY
- ethtool core: add net/ethtool/mse.c implementing the request,
register genl op, and hook into ethnl dispatch
- docs: document MSE_GET in ethtool-netlink.rst
The include/uapi/linux/ethtool_netlink_generated.h is generated
from Documentation/netlink/specs/ethtool.yaml.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20251027122801.982364-3-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a new state NAPI_STATE_THREADED_BUSY_POLL to the NAPI state enum to
enable and disable threaded busy polling.
When threaded busy polling is enabled for a NAPI, enable
NAPI_STATE_THREADED also.
When the threaded NAPI is scheduled, set NAPI_STATE_IN_BUSY_POLL to
signal napi_complete_done not to rearm interrupts.
Whenever NAPI_STATE_THREADED_BUSY_POLL is unset, the
NAPI_STATE_IN_BUSY_POLL will be unset, napi_complete_done unsets the
NAPI_STATE_SCHED_THREADED bit also, which in turn will make the kthread
go to sleep.
Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Martin Karsten <mkarsten@uwaterloo.ca>
Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
Link: https://patch.msgid.link/20251028203007.575686-2-skhawaja@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The dpll.yaml spec incorrectly omitted module-name and clock-id from the
pin-get operation reply specification, even though the kernel DPLL
implementation has always included these attributes in pin-get responses
since the initial implementation.
This spec inconsistency caused issues with the C YNL code generator.
The generated dpll_pin_get_rsp structure was missing these fields.
Fix the spec by adding module-name and clock-id to the pin-attrs reply
specification to match the actual kernel behavior.
Fixes: 3badff3a25 ("dpll: spec: Add Netlink spec in YAML")
Signed-off-by: Petr Oros <poros@redhat.com>
Reviewed-by: Ivan Vecera <ivecera@redhat.com>
Link: https://patch.msgid.link/20251024185512.363376-1-poros@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a generic netlink spec in YAML format and autogenerate boilerplate
code using ynl-regen.sh to introduce a generic netlink for the energy
model. It allows a userspace program to read the performance domain and
its energy model. It notifies the userspace program when a performance
domain is created or deleted or its energy model is updated through a
multicast interface.
Specifically, it supports two commands:
- EM_CMD_GET_PDS: Get the list of information for all performance
domains.
- EM_CMD_GET_PD_TABLE: Get the energy model table of a performance
domain.
Also, it supports three notification events:
- EM_CMD_PD_CREATED: When a performance domain is created.
- EM_CMD_PD_DELETED: When a performance domain is deleted.
- EM_CMD_PD_UPDATED: When the energy model table of a performance domain
is updated.
Finally, update MAINTAINERS to include new files.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://patch.msgid.link/20251020220914.320832-4-changwoo@igalia.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull Char/Misc/IIO/Binder updates from Greg KH:
"Here is the big set of char/misc/iio and other driver subsystem
changes for 6.18-rc1.
Loads of different stuff in here, it was a busy development cycle in
lots of different subsystems, with over 27k new lines added to the
tree.
Included in here are:
- IIO updates including new drivers, reworking of existing apis, and
other goodness in the sensor subsystems
- MEI driver updates and additions
- NVMEM driver updates
- slimbus removal for an unused driver and some other minor updates
- coresight driver updates and additions
- MHI driver updates
- comedi driver updates and fixes
- extcon driver updates
- interconnect driver additions
- eeprom driver updates and fixes
- minor UIO driver updates
- tiny W1 driver updates
But the majority of new code is in the rust bindings and additions,
which includes:
- misc driver rust binding updates for read/write support, we can now
write "normal" misc drivers in rust fully, and the sample driver
shows how this can be done.
- Initial framework for USB driver rust bindings, which are disabled
for now in the build, due to limited support, but coming in through
this tree due to dependencies on other rust binding changes that
were in here. I'll be enabling these back on in the build in the
usb.git tree after -rc1 is out so that developers can continue to
work on these in linux-next over the next development cycle.
- Android Binder driver implemented in Rust.
This is the big one, and was driving a huge majority of the rust
binding work over the past years. Right now there are two binder
drivers in the kernel, selected only at build time as to which one
to use as binder wants to be included in the system at boot time.
The binder C maintainers all agreed on this, as eventually, they
want the C code to be removed from the tree, but it will take a few
releases to get there while both are maintained to ensure that the
rust implementation is fully stable and compliant with the existing
userspace apis.
All of these have been in linux-next for a while"
* tag 'char-misc-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (320 commits)
rust: usb: keep usb::Device private for now
rust: usb: don't retain device context for the interface parent
USB: disable rust bindings from the build for now
samples: rust: add a USB driver sample
rust: usb: add basic USB abstractions
coresight: Add label sysfs node support
dt-bindings: arm: Add label in the coresight components
coresight: tnoc: add new AMBA ID to support Trace Noc V2
coresight: Fix incorrect handling for return value of devm_kzalloc
coresight: tpda: fix the logic to setup the element size
coresight: trbe: Return NULL pointer for allocation failures
coresight: Refactor runtime PM
coresight: Make clock sequence consistent
coresight: Refactor driver data allocation
coresight: Consolidate clock enabling
coresight: Avoid enable programming clock duplicately
coresight: Appropriately disable trace bus clocks
coresight: Appropriately disable programming clocks
coresight: etm4x: Support atclk
coresight: catu: Support atclk
...
Introduce a new document, flow_control.rst, to provide a comprehensive
guide on Ethernet Flow Control in Linux. The guide explains how flow
control works, how autonegotiation resolves pause capabilities, and how
to configure it using ethtool and Netlink.
In parallel, document the pause and pause-stat attributes in the
ethtool.yaml netlink spec. This enables the ynl tool to generate
kernel-doc comments for the corresponding enums in the UAPI header,
making the C interface self-documenting.
Finally, replace the legacy flow control section in phy.rst with a
reference to the new document and add pointers in the relevant C source
files.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250924120241.724850-1-o.rempel@pengutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This attribute is a boolean. No need to add it to set it to 'false'.
Indeed, the default value when this attribute is not set is naturally
'false'. A few bytes can then be saved by not adding this attribute if
the connection is not on the server side.
This prepares the future deprecation of its attribute, in favour of a
new flag.
Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250919-net-next-mptcp-server-side-flag-v1-1-a97a5d561a8b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Various network interface types make use of needed_{head,tail}room values
to efficiently reserve buffer space for additional encapsulation headers,
such as VXLAN, Geneve, IPSec, etc. However, it is not currently possible
to query these values in a generic way.
Introduce ability to query the needed_{head,tail}room values of a network
device via rtnetlink, such that applications that may wish to use these
values can do so.
For example, Cilium agent iterates over present devices based on user config
(direct routing, vxlan, geneve, wireguard etc.) and in future will configure
netkit in order to expose the needed_{head,tail}room into K8s pods. See
b9ed315d3c ("netkit: Allow for configuring needed_{head,tail}room").
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alasdair McWilliam <alasdair@mcwilliam.dev>
Reviewed-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/20250917095543.14039-1-alasdair@mcwilliam.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge networking fixes after downstream PR (net-6.17-rc7).
No conflicts.
Adjacent changes:
drivers/net/ethernet/mellanox/mlx5/core/en/fs.h
9536fbe10c ("net/mlx5e: Add PSP steering in local NIC RX")
7601a0a462 ("net/mlx5e: Add a miss level for ipsec crypto offload")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a netlink family for PSP and allow drivers to register support.
The "PSP device" is its own object. This allows us to perform more
flexible reference counting / lifetime control than if PSP information
was part of net_device. In the future we should also be able
to "delegate" PSP access to software devices, such as *vlan, veth
or netkit more easily.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250917000954.859376-3-daniel.zahka@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
I'm trying to generate Rust bindings for netlink using the yaml spec.
It looks like there's a typo in conntrack spec: attribute set conntrack-attrs
defines attributes "counters-{orig,reply}" (plural), while get operation
references "counter-{orig,reply}" (singular). The latter should be fixed, as it
denotes multiple counters (packet and byte). The corresonding C define is
CTA_COUNTERS_ORIG.
Also, dump request references "nfgen-family" attribute, which neither exists in
conntrack-attrs attrset nor ctattr_type enum. There's member of nfgenmsg struct
with the same name, which is where family value is actually taken from.
> static int ctnetlink_dump_exp_ct(struct net *net, struct sock *ctnl,
> struct sk_buff *skb,
> const struct nlmsghdr *nlh,
> const struct nlattr * const cda[],
> struct netlink_ext_ack *extack)
> {
> int err;
> struct nfgenmsg *nfmsg = nlmsg_data(nlh);
> u_int8_t u3 = nfmsg->nfgen_family;
^^^^^^^^^^^^
Signed-off-by: Remy D. Farley <one-d-wide@protonmail.com>
Fixes: 23fc9311a5 ("netlink: specs: add conntrack dump and stats dump support")
Link: https://patch.msgid.link/20250913140515.1132886-1-one-d-wide@protonmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>