linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-11 01:53:20 -04:00

Author	SHA1	Message	Date
Simon Horman	b91c0e4d63	ixgbe: spelling corrections Correct spelling as flagged by codespell. Signed-off-by: Simon Horman <horms@kernel.org> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-07-03 09:39:04 -07:00
Radoslaw Tyl	1a3ebc59f7	ixgbe: turn off MDD while modifying SRRCTL Modifying SRRCTL register can generate MDD event. Turn MDD off during SRRCTL register write to prevent generating MDD. Fix RCT in ixgbe_set_rx_drop_en(). Reviewed-by: Marcin Szycik <marcin.szycik@linux.intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Radoslaw Tyl <radoslawx.tyl@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-07-03 09:39:04 -07:00
Slawomir Mrozowicz	b11aa9614d	ixgbe: add Tx hang detection unhandled MDD Add Tx Hang detection due to an unhandled MDD Event. Previously, a malicious VF could disable the entire port causing TX to hang on the E610 card. Those events that caused PF to freeze were not detected as an MDD event and usually required a Tx Hang watchdog timer to catch the suspension, and perform a physical function reset. Implement flows in the affected PF driver in such a way to check the cause of the hang, detect it as an MDD event and log an entry of the malicious VF that caused the Hang. The PF blocks the malicious VF, if it continues to be the source of several MDD events. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Marcin Szycik <marcin.szycik@linux.intel.com> Signed-off-by: Slawomir Mrozowicz <slawomirx.mrozowicz@intel.com> Co-developed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-07-03 09:39:03 -07:00
Don Skidmore	da3ab95f9b	ixgbe: check for MDD events When an event is detected it is logged and, for the time being, the queue is immediately re-enabled. This is due to the lack of an API to the hypervisor so it could deal with it as it chooses. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Reviewed-by: Marcin Szycik <marcin.szycik@linux.intel.com> Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-07-03 09:39:03 -07:00
Paul Greenwalt	d5e3152037	ixgbe: add MDD support Add malicious driver detection to ixgbe driver. The supported devices are E610 and X550. Handling MDD events is enabled while VFs are created and turned off when they are disabled. There is no runtime command to enable or disable MDD independently. MDD event is logged when malicious VF driver is detected. For example VF can try to send incorrect Tx descriptor (TSO on, but length field not correct). It can be reproduced by manipulating the driver, or using driver with incorrect descriptor values. Example log: "Malicious event on VF 0 tx:128 rx:128" Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Reviewed-by: Marcin Szycik <marcin.szycik@linux.intel.com> Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-07-03 09:39:03 -07:00
Vladimir Oltean	8cc2497877	i40e: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set() New timestamping API was introduced in commit `66f7223039` ("net: add NDOs for configuring hardware timestamping") from kernel v6.6. It is time to convert the Intel i40e driver to the new API, so that timestamping configuration can be removed from the ndo_eth_ioctl() path completely. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Milena Olech <milena.olech@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-07-03 09:39:00 -07:00
Vladimir Oltean	8f3f4995e8	ixgbe: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set() New timestamping API was introduced in commit `66f7223039` ("net: add NDOs for configuring hardware timestamping") from kernel v6.6. It is time to convert the Intel ixgbe driver to the new API, so that timestamping configuration can be removed from the ndo_eth_ioctl() path completely. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Milena Olech <milena.olech@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-07-03 09:38:50 -07:00
Vladimir Oltean	b88428d3fc	igb: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set() New timestamping API was introduced in commit `66f7223039` ("net: add NDOs for configuring hardware timestamping") from kernel v6.6. It is time to convert the Intel igb driver to the new API, so that timestamping configuration can be removed from the ndo_eth_ioctl() path completely. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Milena Olech <milena.olech@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-07-03 09:38:46 -07:00
Vladimir Oltean	033d0bcf4a	igc: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set() New timestamping API was introduced in commit `66f7223039` ("net: add NDOs for configuring hardware timestamping") from kernel v6.6. It is time to convert the Intel igc driver to the new API, so that timestamping configuration can be removed from the ndo_eth_ioctl() path completely. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Vitaly Lifshits <vitaly.lifshits@intel.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Milena Olech <milena.olech@intel.com> Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-07-03 09:38:33 -07:00
Vladimir Oltean	23ddacab4e	ice: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set() New timestamping API was introduced in commit `66f7223039` ("net: add NDOs for configuring hardware timestamping") from kernel v6.6. It is time to convert the Intel ice driver to the new API, so that timestamping configuration can be removed from the ndo_eth_ioctl() path completely. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Milena Olech <milena.olech@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-07-03 09:37:49 -07:00
Yue Haibing	5f712c3877	ipv6: Cleanup fib6_drop_pcpu_from() Since commit `0e23387491` ("ipv6: fix races in ip6_dst_destroy()"), 'table' is unused in __fib6_drop_pcpu_from(), no need pass it from fib6_drop_pcpu_from(). Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250701041235.1333687-1-yuehaibing@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-07-03 16:00:50 +02:00
Paolo Abeni	129676952e	Merge branch 'another-ip-sysctl-docs-cleanup' Bagas Sanjaya says: ==================== Another ip-sysctl docs cleanup Inspired by Abdelrahman's cleanup [1]. This time, mostly formatting conversion to bullet lists. [1]: https://lore.kernel.org/linux-doc/20250624150923.40590-1-abdelrahmanfekry375@gmail.com/ ==================== Link: https://patch.msgid.link/20250701031300.19088-1-bagasdotme@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-07-03 15:51:47 +02:00
Bagas Sanjaya	2f1fa26eef	net: ip-sysctl: Add link to SCTP IPv4 scoping draft addr_scope_policy description contains pointer to SCTP IPv4 scoping draft but not its IETF Datatracker link. Add it. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250701031300.19088-6-bagasdotme@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-07-03 15:51:45 +02:00
Bagas Sanjaya	82b0566000	net: ip-sysctl: Format SCTP-related memory parameters description as bullet list The description for vector elements of SCTP-related memory usage parameters (sctp{r,w,}mem) is formatted as normal paragraphs rather than bullet list. Convert the description to the latter. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250701031300.19088-5-bagasdotme@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-07-03 15:51:45 +02:00
Bagas Sanjaya	98bc1d41f2	net: ip-sysctl: Format pf_{enable,expose} boolean lists as bullet lists These lists' items were separated by newlines but without bullet list marker. Turn the lists into proper bullet list. While at it, also reword values description for pf_expose to not repeat mentioning SCTP_PEER_ADDR_CHANGE and SCTP_GET_PEER_ADDR_INFO sockopt. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250701031300.19088-4-bagasdotme@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-07-03 15:51:45 +02:00
Bagas Sanjaya	2040058db3	net: ip-sysctl: Format possible value range of ioam6_id{,_wide} as bullet list Format possible value range bounds of ioam6_id and ioam6_id_wide as bullet list instead of running paragraph. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250701031300.19088-3-bagasdotme@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-07-03 15:51:44 +02:00
Bagas Sanjaya	501aeb1ef4	net: ip-sysctl: Format Private VLAN proxy arp aliases as bullet list Alias names list for private VLAN proxy arp technology is formatted as indented paragraph instead. Make it bullet list as it is better fit for this purpose. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250701031300.19088-2-bagasdotme@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-07-03 15:51:44 +02:00
Paolo Abeni	792eacd324	Merge branch 'ptp-provide-support-for-auxiliary-clocks-for-ptp_sys_offset_extended' Thomas Gleixner says: ==================== ptp: Provide support for auxiliary clocks for PTP_SYS_OFFSET_EXTENDED This is a follow up to the V1 series, which can be found here: https://lore.kernel.org/all/20250626124327.667087805@linutronix.de to address the merge logistics problem, which I created myself. Changes vs. V1: - Make patch 1, which provides the timestamping function temporarily define CLOCK_AUX* if undefined so that it can be merged independently, - Add a missing check for CONFIG_POSIX_AUX_CLOCK in the PTP IOCTL - Picked up tags Merge logistics if agreed on: 1) Patch #1 is applied to the tip tree on top of plain v6.16-rc1 and tagged 2) That tag is merged into tip:timers/ptp and the temporary CLOCK_AUX define is removed in a subsequent commit 3) Network folks merge the tag and apply patches #2 + #3 So the only fallout from this are the extra merges in both trees and the cleanup commit in the tip tree. But that way there are no dependencies and no duplicate commits with different SHAs. Thoughts? Due to the above constraints there is no branch offered to pull from right now. Sorry for the inconveniance. Should have thought about that earlier. ==================== Link: https://patch.msgid.link/20250701130923.579834908@linutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-07-03 15:36:13 +02:00
Thomas Gleixner	17c395bba1	ptp: Enable auxiliary clocks for PTP_SYS_OFFSET_EXTENDED Allow ioctl(PTP_SYS_OFFSET_EXTENDED*) to select CLOCK_AUX clock ids for generating the pre and post hardware readout timestamps. Aside of adding these clocks to the clock ID validation, this also requires to check the timestamp to be valid, i.e. the seconds value being greater than or equal zero. This is necessary because AUX clocks can be asynchronously enabled or disabled, so there is no way to validate the availability upfront. The same could have been achieved by handing the return value of ktime_get_aux_ts64() all the way down to the IOCTL call site, but that'd require to modify all existing ptp::gettimex64() callbacks and their inner call chains. The timestamp check achieves the same with less churn and less complicated code all over the place. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://patch.msgid.link/20250701132628.491315452@linutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-07-03 15:36:06 +02:00
Thomas Gleixner	4c09a4cebd	ptp: Use ktime_get_clock_ts64() for timestamping The inlined ptp_read_system_[pre\|post]ts() switch cases expand to a copious amount of text in drivers, e.g. ~500 bytes in e1000e. Adding auxiliary clock support to the inlines would increase it further. Replace the inline switch case with a call to ktime_get_clock_ts64(), which reduces the code size in drivers and allows to access auxiliary clocks once they are enabled in the IOCTL parameter filter. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Acked-by: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/20250701132628.426168092@linutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-07-03 15:36:04 +02:00
Paolo Abeni	4f38a6db7b	Merge tag 'ktime-get-clock-ts64-for-ptp' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Base implementation for PTP with a temporary CLOCK_AUX* workaround to allow integration of depending changes into the networking tree. Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-07-03 15:35:07 +02:00
Seth Forshee (DigitalOcean)	135faae632	bonding: don't force LACPDU tx to ~333 ms boundaries The timer which ensures that no more than 3 LACPDUs are transmitted in a second rearms itself every 333ms regardless of whether an LACPDU is transmitted when the timer expires. This causes LACPDU tx to be delayed until the next expiration of the timer, which effectively aligns LACPDUs to ~333ms boundaries. This results in a variable amount of jitter in the timing of periodic LACPDUs. Change this to only rearm the timer when an LACPDU is actually sent, allowing tx at any point after the timer has expired. Signed-off-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Reviewed-by: Carlos Bilbao <carlos.bilbao@kernel.org> Link: https://patch.msgid.link/20250625-fix-lacpdu-jitter-v1-1-4d0ee627e1ba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-07-03 15:24:40 +02:00
Thomas Gleixner	5b605dbee0	timekeeping: Provide ktime_get_clock_ts64() PTP implements an inline switch case for taking timestamps from various POSIX clock IDs, which already consumes quite some text space. Expanding it for auxiliary clocks really becomes too big for inlining. Provide a out of line version. The function invalidates the timestamp in case the clock is invalid. The invalidation allows to implement a validation check without the need to propagate a return value through deep existing call chains. Due to merge logistics this temporarily defines CLOCK_AUX[_LAST] if undefined, so that the plain branch, which does not contain any of the core timekeeper changes, can be pulled into the networking tree as prerequisite for the PTP side changes. These temporary defines are removed after that branch is merged into the tip::timers/ptp branch. That way the result in -next or upstream in the next merge window has zero dependencies. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250701132628.357686408@linutronix.de	2025-07-03 14:18:06 +02:00
Chenguang Zhao	8b98f34ce1	net: ipv6: Fix spelling mistake change 'Maximium' to 'Maximum' Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Paul Moore <paul@paul-moore.com> Link: https://patch.msgid.link/20250702055820.112190-1-zhaochenguang@kylinos.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 15:42:29 -07:00
Jakub Kicinski	19b323e932	Merge branch 'support-rate-management-on-traffic-classes-in-devlink-and-mlx5' Mark Bloch says: ==================== Support rate management on traffic classes in devlink and mlx5 This patch series extends the devlink-rate API to support traffic class (TC) bandwidth management, enabling more granular control over traffic shaping and rate limiting across multiple TCs. The API now allows users to specify bandwidth proportions for different traffic classes in a single command. This is particularly useful for managing Enhanced Transmission Selection (ETS) for groups of Virtual Functions (VFs), allowing precise bandwidth allocation across traffic classes. Additionally the series refines the QoS handling in net/mlx5 to support TC arbitration and bandwidth management on vports and rate nodes. Discussions on traffic class shaping in net-shapers began in V5 [1], where we discussed with maintainers whether net-shapers should support traffic classes and how this could be implemented. Later, after further conversations with Paolo Abeni and Simon Horman, Cosmin provided an update [2], confirming that net-shapers' tree-based hierarchy aligns well with traffic classes when treated as distinct subsets of netdev queues. Since mlx5 enforces a 1:1 mapping between TX queues and traffic classes, this approach seems feasible, though some open questions remain regarding queue reconfiguration and certain mlx5 scheduling behaviors. Building on that discussion, Cosmin has now shared a concrete implementation plan on the netdev mailing list [3]. The plan, developed in collaboration with Paolo and Simon, outlines how net-shapers can be extended to support the same use cases currently covered by devlink-rate, with the eventual goal of aligning both and simplifying the shaping infrastructure in the kernel. This work was presented at Netdev 0x19 in Zagreb [4]. There we presented how TC scheduling is enforced in mlx5 hardware, which led to discussions on the mailing list. A summary of how things work: Classification means labeling a packet with a traffic class based on the packet's DSCP or VLAN PCP field, then treating packets with different traffic classes differently during transmit processing. In a virtualized setup, VFs are untrusted and do not control classification or shaping. Classification is done by the hardware using a prio-to-TC mapping set by the hypervisor. VFs only select which send queue to use and are expected to respect the classification logic by sending each traffic class on its dedicated queue. As stated in the net-shapers plan [3], each transmit queue should carry only a single traffic class. Mixing classes in a single queue can lead to HOL blocking. In the mlx5 implementation, if the queue used does not match the classified traffic class, the hardware moves the queue to the correct TC scheduler. This movement is not a reclassification; it’s a necessary enforcement step to ensure traffic class isolation is maintained. Extend devlink-rate API to support rate management on TCs: - devlink: Extend the devlink rate API to support traffic class bandwidth management Introduce a no-op implementation: - net/mlx5: Add no-op implementation for setting tc-bw on rate objects Add support for enabling and disabling TC QoS on vports and nodes: - net/mlx5: Add support for setting tc-bw on nodes - net/mlx5: Add traffic class scheduling support for vport QoS Support for setting tc-bw on rate objects: - net/mlx5: Manage TC arbiter nodes and implement full support for tc-bw [1] https://lore.kernel.org/netdev/20241204220931.254964-1-tariqt@nvidia.com/ [2] https://lore.kernel.org/netdev/67df1a562614b553dcab043f347a0d7c5393ff83.camel@nvidia.com/ [3] https://lore.kernel.org/netdev/d9831d0c940a7b77419abe7c7330e822bbfd1cfb.camel@nvidia.com/T/ [4] https://netdevconf.info/0x19/sessions/talk/optimizing-bandwidth-allocation-with-ets-and-traffic-classes.html ==================== Link: https://patch.msgid.link/20250629142138.361537-1-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 15:39:08 -07:00
Carolina Jubran	23ca32e4ea	selftests: drv-net: Add test for devlink-rate traffic class bandwidth distribution This test suite validates the functionality of the devlink-rate API for traffic class (TC) bandwidth allocation. It ensures that bandwidth can be distributed between different traffic classes as configured, and verifies that explicit TC-to-queue mapping is required for the allocation to be effective. The first test (test_no_tc_mapping_bandwidth) is marked as expected failure on mlx5, since the hardware automatically enforces traffic class separation by dynamically moving queues to the correct TC scheduler, even without explicit TC-to-queue mapping configuration. Test output on mlx5: 1..2 # Created VF interface: eth5 # Created VLAN eth5.101 on eth5 with tc 3 and IP 198.51.100.2 # Created VLAN eth5.102 on eth5 with tc 4 and IP 198.51.100.10 # Set representor eth4 up and added to bridge # Bandwidth check results without TC mapping: # TC 3: 0.19 Gbits/sec # TC 4: 0.76 Gbits/sec # Total bandwidth: 0.95 Gbits/sec # TC 3 percentage: 20.0% # TC 4 percentage: 80.0% ok 1 devlink_rate_tc_bw.test_no_tc_mapping_bandwidth # XFAIL Bandwidth matched 80/20 split without TC mapping # Created VF interface: eth5 # Created VLAN eth5.101 on eth5 with tc 3 and IP 198.51.100.2 # Created VLAN eth5.102 on eth5 with tc 4 and IP 198.51.100.10 # Set representor eth4 up and added to bridge # Bandwidth check results with TC mapping: # TC 3: 0.21 Gbits/sec # TC 4: 0.78 Gbits/sec # Total bandwidth: 0.98 Gbits/sec # TC 3 percentage: 21.1% # TC 4 percentage: 78.9% # Bandwidth is distributed as 80/20 with TC mapping ok 2 devlink_rate_tc_bw.test_tc_mapping_bandwidth # Totals: pass:1 fail:0 xfail:1 xpass:0 skip:0 error:0 Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Nimrod Oren <noren@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-9-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 15:39:06 -07:00
Carolina Jubran	cf7e73770d	net/mlx5: Manage TC arbiter nodes and implement full support for tc-bw Introduce support for managing Traffic Class (TC) arbiter nodes and associated vports TC nodes within the E-Switch QoS hierarchy. This patch adds support for the new scheduling node type, `SCHED_NODE_TYPE_VPORTS_TC_TSAR`, and implements full support for setting tc-bw on both vports and nodes. Key changes include: - Introduced the new scheduling node type, `SCHED_NODE_TYPE_VPORTS_TC_TSAR`, for managing vports within the TC arbiter node. - New helper functions for creating and destroying vports TC nodes under the TC arbiter. - Updated the minimum rate normalization function to skip nodes of type `SCHED_NODE_TYPE_VPORTS_TC_TSAR`. Vports TC TSARs have bandwidth shares configured on them but not minimum rates, so their `min_rate` cannot be normalized. - Implementation of `esw_qos_tc_arbiter_scheduling_setup()` and `esw_qos_tc_arbiter_scheduling_teardown()` for initializing and cleaning up TC arbiter scheduling elements. These functions now fully support tc-bw configuration on TC arbiter nodes. - Introduced a new helper `esw_qos_calculate_tc_bw_divider()` to compute the total TC bandwidth share, which is used as a divider for normalizing each TC's share. - Added `esw_qos_tc_arbiter_get_bw_shares()` and `esw_qos_set_tc_arbiter_bw_shares()` to handle the settings of bandwidth shares for vports traffic class TSARs. - `esw_qos_set_tc_arbiter_bw_shares()` normalizes each TC share based on the total and the firmware's maximum allowed TSAR bandwidth share. - Refactored `mlx5_esw_devlink_rate_node_tc_bw_set()` and `mlx5_esw_devlink_rate_leaf_tc_bw_set()` to fully support configuring tc-bw on devlink rate nodes and vports, respectively. - Refactored `mlx5_esw_qos_node_update_parent()` to ensure that tc-bw configuration remains compatible with setting a parent on a rate node, preserving level hierarchy functionality. - Refactored `esw_qos_calc_bw_share()` to generalize its input so it can be used for both minimum rate and bandwidth share calculations. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-8-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 15:39:05 -07:00
Carolina Jubran	97733d1e00	net/mlx5: Add traffic class scheduling support for vport QoS Introduce support for traffic class (TC) scheduling on vports by allowing the vport to own multiple TC scheduling nodes. This patch enables more granular control of QoS by defining three distinct QoS states for vports, each providing unique scheduling behavior: 1. Regular QoS: The `sched_node` represents the vport directly, handling QoS as a single scheduling entity. 2. TC QoS on the vport: The `sched_node` acts as a TC arbiter, enabling TC scheduling directly on the vport. 3. TC QoS on the parent node: The `sched_node` functions as a rate limiter, with TC arbitration enabled at the parent level, associating multiple scheduling nodes with each vport. Key changes include: - Added support for new scheduling elements, vport traffic class and rate limiter. - New helper functions for creating, destroying, and restoring vport TC scheduling nodes, handling transitions between regular QoS and TC arbitration states. - Updated `esw_qos_vport_enable()` and `esw_qos_vport_disable()` to support both regular QoS and TC arbitration states, ensuring consistent transitions between scheduling modes. - Introduced a `sched_nodes` array under `vport->qos` to store multiple TC scheduling nodes per vport, enabling finer control over per-TC QoS. - Enhanced `esw_qos_vport_update_parent()` to handle transitions between the three QoS states based on the current and new parent node types. This patch lays the groundwork for future support for configuring tc-bw on vports. Although the infrastructure is in place, full support for tc-bw is not yet implemented; attempts to set tc-bw on vports will return `-EOPNOTSUPP`. No functional changes are introduced at this stage. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-7-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 15:39:05 -07:00
Carolina Jubran	96619c485f	net/mlx5: Add support for setting tc-bw on nodes Introduce support for enabling and disabling Traffic Class (TC) arbitration for existing devlink rate nodes. This patch adds support for a new scheduling node type, `SCHED_NODE_TYPE_TC_ARBITER_TSAR`. Key changes include: - New helper functions for transitioning existing rate nodes to TC arbiter nodes and vice versa. These functions handle the allocation of TC arbiter nodes, copying of child nodes, and restoring vport QoS settings when TC arbitration is disabled. - Implementation of `mlx5_esw_devlink_rate_node_tc_bw_set()` to manage tc-bw configuration on nodes. - Introduced stubs for `esw_qos_tc_arbiter_scheduling_setup()` and `esw_qos_tc_arbiter_scheduling_teardown()`, which will be extended in future patches to provide full support for tc-bw on devlink rate objects. - Validation functions for tc-bw settings, allowing graceful handling of unsupported traffic class bandwidth configurations. - Updated `__esw_qos_alloc_node()` to insert the new node into the parent’s children list only if the parent is not NULL. For the root TSAR, the new node is inserted directly after the allocation call. - Don't allow `tc-bw` configuration for nodes containing non-leaf children. This patch lays the groundwork for future support for configuring tc-bw on devlink rate nodes. Although the infrastructure is in place, full support for tc-bw is not yet implemented; attempts to set tc-bw on nodes will return `-EOPNOTSUPP`. No functional changes are introduced at this stage. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-6-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 15:39:05 -07:00
Carolina Jubran	7109282124	net/mlx5: Add no-op implementation for setting tc-bw on rate objects Introduce `mlx5_esw_devlink_rate_node_tc_bw_set()` and `mlx5_esw_devlink_rate_leaf_tc_bw_set()` with no-op logic. Future patches will add support for setting traffic class bandwidth on rate objects. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-5-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 15:39:05 -07:00
Carolina Jubran	236156d80d	selftest: netdevsim: Add devlink rate tc-bw test Test verifies that netdevsim correctly implements devlink ops callbacks that set tc-bw on leaf or node rate object. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-4-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 15:39:05 -07:00
Carolina Jubran	566e8f108f	devlink: Extend devlink rate API with traffic classes bandwidth management Introduce support for specifying relative bandwidth shares between traffic classes (TC) in the devlink-rate API. This new option allows users to allocate bandwidth across multiple traffic classes in a single command. This feature provides a more granular control over traffic management, especially for scenarios requiring Enhanced Transmission Selection. Users can now define a relative bandwidth share for each traffic class. For example, assigning share values of 20 to TC0 (TCP/UDP) and 80 to TC5 (RoCE) will result in TC0 receiving 20% and TC5 receiving 80% of the total bandwidth. The actual percentage each class receives depends on the ratio of its share value to the sum of all shares. Example: DEV=pci/0000:08:00.0 $ devlink port function rate add $DEV/vfs_group tx_share 10Gbit \ tx_max 50Gbit tc-bw 0:20 1:0 2:0 3:0 4:0 5:80 6:0 7:0 $ devlink port function rate set $DEV/vfs_group \ tc-bw 0:20 1:0 2:0 3:0 4:0 5:20 6:60 7:0 Example usage with ynl: ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \ --do rate-set --json '{ "bus-name": "pci", "dev-name": "0000:08:00.0", "port-index": 1, "rate-tc-bws": [ {"rate-tc-index": 0, "rate-tc-bw": 50}, {"rate-tc-index": 1, "rate-tc-bw": 50}, {"rate-tc-index": 2, "rate-tc-bw": 0}, {"rate-tc-index": 3, "rate-tc-bw": 0}, {"rate-tc-index": 4, "rate-tc-bw": 0}, {"rate-tc-index": 5, "rate-tc-bw": 0}, {"rate-tc-index": 6, "rate-tc-bw": 0}, {"rate-tc-index": 7, "rate-tc-bw": 0} ] }' ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \ --do rate-get --json '{ "bus-name": "pci", "dev-name": "0000:08:00.0", "port-index": 1 }' output for rate-get: {'bus-name': 'pci', 'dev-name': '0000:08:00.0', 'port-index': 1, 'rate-tc-bws': [{'rate-tc-bw': 50, 'rate-tc-index': 0}, {'rate-tc-bw': 50, 'rate-tc-index': 1}, {'rate-tc-bw': 0, 'rate-tc-index': 2}, {'rate-tc-bw': 0, 'rate-tc-index': 3}, {'rate-tc-bw': 0, 'rate-tc-index': 4}, {'rate-tc-bw': 0, 'rate-tc-index': 5}, {'rate-tc-bw': 0, 'rate-tc-index': 6}, {'rate-tc-bw': 0, 'rate-tc-index': 7}], 'rate-tx-max': 0, 'rate-tx-priority': 0, 'rate-tx-share': 0, 'rate-tx-weight': 0, 'rate-type': 'leaf'} Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-3-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 15:39:05 -07:00
Carolina Jubran	42401c4238	netlink: introduce type-checking attribute iteration for nlmsg Add the nlmsg_for_each_attr_type() macro to simplify iteration over attributes of a specific type in a Netlink message. Convert existing users in vxlan and nfsd to use the new macro. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-2-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 15:39:04 -07:00
Jason Wang	97b2409f28	vhost-net: reduce one userspace copy when building XDP buff We used to do twice copy_from_iter() to copy virtio-net and packet separately. This introduce overheads for userspace access hardening as well as SMAP (for x86 it's stac/clac). So this patch tries to use one copy_from_iter() to copy them once and move the virtio-net header afterwards to reduce overheads. Testpmd + vhost_net shows 10% improvement from 5.45Mpps to 6.0Mpps. Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://patch.msgid.link/20250701010352.74515-2-jasowang@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 15:29:46 -07:00
Jason Wang	4d313f2bd2	tun: remove unnecessary tun_xdp_hdr structure With f95f0f95cfb7("net, xdp: Introduce xdp_init_buff utility routine"), buffer length could be stored as frame size so there's no need to have a dedicated tun_xdp_hdr structure. We can simply store virtio net header instead. Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://patch.msgid.link/20250701010352.74515-1-jasowang@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 15:29:46 -07:00
Jakub Kicinski	285c895fba	Merge branch 'preserve-msg_zerocopy-with-forwarding' Willem de Bruijn says: ==================== preserve MSG_ZEROCOPY with forwarding Avoid false positive copying of zerocopy skb frags when entering the ingress path if the skb is not queued locally but forwarded. Patch 1 for more details and feature. Patch 2 converts the existing selftest to a pass/fail test and adds coverage for this new feature. ==================== Link: https://patch.msgid.link/20250630194312.1571410-1-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 15:07:18 -07:00
Willem de Bruijn	81d572a551	selftest: net: extend msg_zerocopy test with forwarding Zerocopy skbs are converted to regular copy skbs when data is queued to a local socket. This happens in the existing test with a sender and receiver communicating over a veth device. Zerocopy skbs are sent without copying if egressing a device. Verify that this behavior is maintained even in the common container setup where data is forwarded over a veth to the physical device. Update msg_zerocopy.sh to 1. Have a dummy network device to simulate a physical device. 2. Have forwarding enabled between veth and dummy. 3. Add a tx-only test that sends out dummy via the forwarding path. 4. Verify the exitcode of the sender, which signals zerocopy success. As dummy drops all packets, this cannot be a TCP connection. Test the new case with unconnected UDP only. Update msg_zerocopy.c to - Accept an argument whether send with zerocopy is expected. - Return an exitcode whether behavior matched that expectation. Signed-off-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250630194312.1571410-3-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 15:07:16 -07:00
Willem de Bruijn	d2527ad3a9	net: preserve MSG_ZEROCOPY with forwarding MSG_ZEROCOPY data must be copied before data is queued to local sockets, to avoid indefinite timeout until memory release. This test is performed by skb_orphan_frags_rx, which is called when looping an egress skb to packet sockets, error queue or ingress path. To preserve zerocopy for skbs that are looped to ingress but are then forwarded to an egress device rather than delivered locally, defer this last check until an skb enters the local IP receive path. This is analogous to existing behavior of skb_clear_delivery_time. Signed-off-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250630194312.1571410-2-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 15:07:16 -07:00
Jakub Kicinski	04b1d18c5b	Merge branch 'vsock-test-check-for-null-ptr-deref-when-transport-changes' Luigi Leonardi says: ==================== vsock/test: check for null-ptr-deref when transport changes This series introduces a new test that checks for a null pointer dereference that may happen when there is a transport change[1]. This bug was fixed in [2]. Note that this test cannot fail, it hangs if it triggers a kernel oops. The intended use-case is to run it and then check if there is any oops in the dmesg. This test is based on Hyunwoo Kim's[3] and Michal's python reproducers[4]. [1]https://lore.kernel.org/netdev/Z2LvdTTQR7dBmPb5@v4bel-B760M-AORUS-ELITE-AX/ [2]https://lore.kernel.org/netdev/20250110083511.30419-1-sgarzare@redhat.com/ [3]https://lore.kernel.org/netdev/Z2LvdTTQR7dBmPb5@v4bel-B760M-AORUS-ELITE-AX/#t [4]https://lore.kernel.org/netdev/2b3062e3-bdaa-4c94-a3c0-2930595b9670@rbox.co/ v4: https://lore.kernel.org/20250624-test_vsock-v4-1-087c9c8e25a2@redhat.com v3: https://lore.kernel.org/20250611-test_vsock-v3-1-8414a2d4df62@redhat.com v2: https://lore.kernel.org/20250314-test_vsock-v2-1-3c0a1d878a6d@redhat.com v1: https://lore.kernel.org/20250306-test_vsock-v1-0-0320b5accf92@redhat.com ==================== Link: https://patch.msgid.link/20250630-test_vsock-v5-0-2492e141e80b@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 15:05:25 -07:00
Luigi Leonardi	3a764d9338	vsock/test: Add test for null ptr deref when transport changes Add a new test to ensure that when the transport changes a null pointer dereference does not occur. The bug was reported upstream [1] and fixed with commit `2cb7c756f6` ("vsock/virtio: discard packets if the transport changes"). KASAN: null-ptr-deref in range [0x0000000000000060-0x0000000000000067] CPU: 2 UID: 0 PID: 463 Comm: kworker/2:3 Not tainted Workqueue: vsock-loopback vsock_loopback_work RIP: 0010:vsock_stream_has_data+0x44/0x70 Call Trace: virtio_transport_do_close+0x68/0x1a0 virtio_transport_recv_pkt+0x1045/0x2ae4 vsock_loopback_work+0x27d/0x3f0 process_one_work+0x846/0x1420 worker_thread+0x5b3/0xf80 kthread+0x35a/0x700 ret_from_fork+0x2d/0x70 ret_from_fork_asm+0x1a/0x30 Note that this test may not fail in a kernel without the fix, but it may hang on the client side if it triggers a kernel oops. This works by creating a socket, trying to connect to a server, and then executing a second connect operation on the same socket but to a different CID (0). This triggers a transport change. If the connect operation is interrupted by a signal, this could cause a null-ptr-deref. Since this bug is non-deterministic, we need to try several times. It is reasonable to assume that the bug will show up within the timeout period. If there is a G2H transport loaded in the system, the bug is not triggered and this test will always pass. This is because `vsock_assign_transport`, when using CID 0, like in this case, sets vsk->transport to `transport_g2h` that is not NULL if a G2H transport is available. [1]https://lore.kernel.org/netdev/Z2LvdTTQR7dBmPb5@v4bel-B760M-AORUS-ELITE-AX/ Suggested-by: Hyunwoo Kim <v4bel@theori.io> Suggested-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20250630-test_vsock-v5-2-2492e141e80b@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 15:05:23 -07:00
Luigi Leonardi	e84b20b25d	vsock/test: Add macros to identify transports Add three new macros: TRANSPORTS_G2H, TRANSPORTS_H2G and TRANSPORTS_LOCAL. They can be used to identify the type of the transport(s) loaded when using the `get_transports()` function. Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20250630-test_vsock-v5-1-2492e141e80b@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 15:05:22 -07:00
Matthew Gerlach	6d359cf464	dt-bindings: net: Convert socfpga-dwmac bindings to yaml Convert the bindings for socfpga-dwmac to yaml. Since the original text contained descriptions for two separate nodes, two separate yaml files were created. Signed-off-by: Mun Yew Tham <mun.yew.tham@altera.com> Signed-off-by: Matthew Gerlach <matthew.gerlach@altera.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Rob Herring (Arm) <robh@kernel.org> Link: https://patch.msgid.link/20250630213748.71919-1-matthew.gerlach@altera.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 15:04:58 -07:00
Raju Rangoju	9e2a7ad4ae	amd-xgbe: add support for giant packet size AMD XGBE hardware supports giant Ethernet frames up to 16K bytes. Add support for configuring and enabling giant packet handling in the driver. - Define new register fields and macros for giant packet support. - Update the jumbo frame configuration logic to enable giant packet mode when MTU exceeds the jumbo threshold. Acked-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com> Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250701121929.319690-1-Raju.Rangoju@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 14:50:07 -07:00
Eric Dumazet	7d2dabaa17	net: ifb: support BIG TCP packets Set the driver limit to GSO_MAX_SIZE (512 KB). This allows the admin/user to set a GSO limit up to this value, to avoid segmenting too large GRO packets in the netem -> ifb path. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250701084540.459261-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 14:46:19 -07:00
Jakub Kicinski	7945fe4858	Merge branch 'net-add-data-race-annotations-around-dst-fields' Eric Dumazet says: ==================== net: add data-race annotations around dst fields Add annotations around various dst fields, which can change under us. Add four helpers to prepare better dst->dev protection, and start using them. More to come later. v1: https://lore.kernel.org/20250627112526.3615031-1-edumazet@google.com ==================== Link: https://patch.msgid.link/20250630121934.3399505-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 14:32:32 -07:00
Eric Dumazet	46a94e44b9	ipv6: ip6_mc_input() and ip6_mr_input() cleanups Both functions are always called under RCU. We remove the extra rcu_read_lock()/rcu_read_unlock(). We use skb_dst_dev_net_rcu() instead of skb_dst_dev_net(). We use dev_net_rcu() instead of dev_net(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-11-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 14:32:30 -07:00
Eric Dumazet	93d1cff35a	ipv6: adopt skb_dst_dev() and skb_dst_dev_net[_rcu]() helpers Use the new helpers as a step to deal with potential dst->dev races. v2: fix typo in ipv6_rthdr_rcv() (kernel test robot <lkp@intel.com>) Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-10-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 14:32:30 -07:00
Eric Dumazet	1caf272972	ipv6: adopt dst_dev() helper Use the new helper as a step to deal with potential dst->dev races. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-9-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 14:32:30 -07:00
Eric Dumazet	a74fc62eec	ipv4: adopt dst_dev, skb_dst_dev and skb_dst_dev_net[_rcu] Use the new helpers as a first step to deal with potential dst->dev races. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-8-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 14:32:30 -07:00
Eric Dumazet	88fe14253e	net: dst: add four helpers to annotate data-races around dst->dev dst->dev is read locklessly in many contexts, and written in dst_dev_put(). Fixing all the races is going to need many changes. We probably will have to add full RCU protection. Add three helpers to ease this painful process. static inline struct net_device dst_dev(const struct dst_entry dst) { return READ_ONCE(dst->dev); } static inline struct net_device skb_dst_dev(const struct sk_buff skb) { return dst_dev(skb_dst(skb)); } static inline struct net skb_dst_dev_net(const struct sk_buff skb) { return dev_net(skb_dst_dev(skb)); } static inline struct net skb_dst_dev_net_rcu(const struct sk_buff skb) { return dev_net_rcu(skb_dst_dev(skb)); } Fixes: `4a6ce2b6f2` ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 14:32:30 -07:00

1 2 3 4 5 ...

1368912 Commits