linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-02 08:39:08 -04:00

Author	SHA1	Message	Date
DENG Qingfang	1ca8a193ca	net: dsa: mt7530: manually set up VLAN ID 0 The driver was relying on dsa_slave_vlan_rx_add_vid to add VLAN ID 0. After the blamed commit, VLAN ID 0 won't be set up anymore, breaking software bridging fallback on VLAN-unaware bridges. Manually set up VLAN ID 0 to fix this. Fixes: `06cfb2df7e` ("net: dsa: don't advertise 'rx-vlan-filter' when not needed") Signed-off-by: DENG Qingfang <dqfext@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 11:09:31 +01:00
David S. Miller	e93826d35c	Merge branch 'mana-EQ-sharing' Haiyang Zhang says: ==================== net: mana: Add support for EQ sharing The existing code uses (1 + #vPorts * #Queues) MSIXs, which may exceed the device limit. Support EQ sharing, so that multiple vPorts can share the same set of MSIXs. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 11:06:54 +01:00
Haiyang Zhang	c1a3e9f98d	net: mana: Add WARN_ON_ONCE in case of CQE read overflow This is not an expected case normally. Add WARN_ON_ONCE in case of CQE read overflow, instead of failing silently. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 11:06:54 +01:00
Haiyang Zhang	1e2d0824a9	net: mana: Add support for EQ sharing The existing code uses (1 + #vPorts * #Queues) MSIXs, which may exceed the device limit. Support EQ sharing, so that multiple vPorts (NICs) can share the same set of MSIXs. And, report the EQ-sharing capability bit to the host, which means the host can potentially offer more vPorts and queues to the VM. Also update the resource limit checking and error handling for better robustness. Now, we support up to 256 virtual ports per VF (it was 16/VF), and support up to 64 queues per vPort (it was 16). Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 11:06:54 +01:00
Haiyang Zhang	e1b5683ff6	net: mana: Move NAPI from EQ to CQ The existing code has NAPI threads polling on EQ directly. To prepare for EQ sharing among vPorts, move NAPI from EQ to CQ so that one EQ can serve multiple CQs from different vPorts. The "arm bit" is only set when CQ processing is completed to reduce the number of EQ entries, which in turn reduce the number of interrupts on EQ. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 11:06:54 +01:00
Shaokun Zhang	807d1032e0	netxen_nic: Remove the repeated declaration Function 'netxen_rom_fast_read' is declared twice, so remove the repeated declaration. Cc: Manish Chopra <manishc@marvell.com> Cc: Rahul Verma <rahulv@marvell.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 11:04:53 +01:00
Nathan Chancellor	bc4f128d86	cxgb4: Properly revert VPD changes Clang warns: drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:2785:2: error: variable 'kw_offset' is uninitialized when used here [-Werror,-Wuninitialized] FIND_VPD_KW(i, "RV"); ^~~~~~~~~~~~~~~~~~~~ drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:2776:39: note: expanded from macro 'FIND_VPD_KW' var = pci_vpd_find_info_keyword(vpd, kw_offset, vpdr_len, name); \ ^~~~~~~~~ drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:2748:34: note: initialize the variable 'kw_offset' to silence this warning unsigned int vpdr_len, kw_offset, id_len; ^ = 0 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:2785:2: error: variable 'vpdr_len' is uninitialized when used here [-Werror,-Wuninitialized] FIND_VPD_KW(i, "RV"); ^~~~~~~~~~~~~~~~~~~~ drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:2776:50: note: expanded from macro 'FIND_VPD_KW' var = pci_vpd_find_info_keyword(vpd, kw_offset, vpdr_len, name); \ ^~~~~~~~ drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:2748:23: note: initialize the variable 'vpdr_len' to silence this warning unsigned int vpdr_len, kw_offset, id_len; ^ = 0 2 errors generated. The series "PCI/VPD: Convert more users to the new VPD API functions" was applied to net-next when it should have been applied to the PCI tree because of build errors. However, commit `82e34c8a9b` ("Revert "Revert "cxgb4: Search VPD with pci_vpd_find_ro_info_keyword()""") reapplied a change, resulting in the warning above. Properly revert commit `8d63ee602d` ("cxgb4: Search VPD with pci_vpd_find_ro_info_keyword()") to fix the warning and restore proper functionality. This also reverts commit `3a93bedea0` ("cxgb4: Remove unused vpd_param member ec") to avoid future merge conflicts, as that change has been applied to the PCI tree. Link: https://lore.kernel.org/r/20210823120929.7c6f7a4f@canb.auug.org.au/ Link: https://lore.kernel.org/r/1ca29408-7bc7-4da5-59c7-87893c9e0442@gmail.com/ Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 11:03:13 +01:00
David S. Miller	cb0f8b034c	Merge branch 'mptcp-next' Mat Martineau says: ==================== mptcp: Optimize output options and add MP_FAIL This patch set contains two groups of changes that we've been testing in the MPTCP tree. The first optimizes the code path and data structure for populating MPTCP option headers when transmitting. Patch 1 reorganizes code to reduce the number of conditionals that need to be evaluated in common cases. Patch 2 rearranges struct mptcp_out_options to save 80 bytes (on x86_64). The next five patches add partial support for the MP_FAIL option as defined in RFC 8684. MP_FAIL is an option header used to cleanly handle MPTCP checksum failures. When the MPTCP checksum detects an error in the MPTCP DSS header or the data mapped by that header, the receiver uses a TCP RST with MP_FAIL to close the subflow that experienced the error and provide associated MPTCP sequence number information to the peer. RFC 8684 also describes how a single-subflow connection can discard corrupt data and remain connected under certain conditions using MP_FAIL, but that feature is not implemented here. Patches 3-5 implement MP_FAIL transmit and receive, and integrates with checksum validation. Patches 6 & 7 add MP_FAIL selftests and the MIBs required for those tests. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 11:02:35 +01:00
Geliang Tang	6bb3ab4913	selftests: mptcp: add MP_FAIL mibs check This patch added a function chk_fail_nr to check the mibs for MP_FAIL. Signed-off-by: Geliang Tang <geliangtang@xiaomi.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 11:02:35 +01:00
Geliang Tang	eb7f33654d	mptcp: add the mibs for MP_FAIL This patch added the mibs for MP_FAIL: MPTCP_MIB_MPFAILTX and MPTCP_MIB_MPFAILRX. Signed-off-by: Geliang Tang <geliangtang@xiaomi.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 11:02:35 +01:00
Geliang Tang	478d770008	mptcp: send out MP_FAIL when data checksum fails When a bad checksum is detected, set the send_mp_fail flag to send out the MP_FAIL option. Add a new function mptcp_has_another_subflow() to check whether there's only a single subflow. When multiple subflows are in use, close the affected subflow with a RST that includes an MP_FAIL option and discard the data with the bad checksum. Set the sk_state of the subsocket to TCP_CLOSE, then the flag MPTCP_WORK_CLOSE_SUBFLOW will be set in subflow_sched_work_if_closed, and the subflow will be closed. When a single subfow is in use, temporarily handled by sending MP_FAIL with a RST too. Signed-off-by: Geliang Tang <geliangtang@xiaomi.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 11:02:35 +01:00
Geliang Tang	5580d41b75	mptcp: MP_FAIL suboption receiving This patch added handling for receiving MP_FAIL suboption. Add a new members mp_fail and fail_seq in struct mptcp_options_received. When MP_FAIL suboption is received, set mp_fail to 1 and save the sequence number to fail_seq. Then invoke mptcp_pm_mp_fail_received to deal with the MP_FAIL suboption. Signed-off-by: Geliang Tang <geliangtang@xiaomi.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 11:02:34 +01:00
Geliang Tang	c25aeb4e09	mptcp: MP_FAIL suboption sending This patch added the MP_FAIL suboption sending support. Add a new flag named send_mp_fail in struct mptcp_subflow_context. If this flag is set, send out MP_FAIL suboption. Add a new member fail_seq in struct mptcp_out_options to save the data sequence number to put into the MP_FAIL suboption. An MP_FAIL option could be included in a RST or on the subflow-level ACK. Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@xiaomi.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 11:02:34 +01:00
Paolo Abeni	d7b2690837	mptcp: shrink mptcp_out_options struct After the previous patch we can alias with a union several fields in mptcp_out_options. Such struct is stack allocated and memset() for each plain TCP out packet. Every saved byted counts. Before: pahole -EC mptcp_out_options # ... /* size: 136, cachelines: 3, members: 17 / After: pahole -EC mptcp_out_options # ... / size: 56, cachelines: 1, members: 9 */ Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 11:02:34 +01:00
Paolo Abeni	1bff1e43a3	mptcp: optimize out option generation Currently we have several protocol constraints on MPTCP options generation (e.g. MPC and MPJ subopt are mutually exclusive) and some additional ones required by our implementation (e.g. almost all ADD_ADDR variant are mutually exclusive with everything else). We can leverage the above to optimize the out option generation: we check DSS/MPC/MPJ presence in a mutually exclusive way, avoiding many unneeded conditionals in the common cases. Additionally extend the existing constraints on ADD_ADDR opt on all subvariants, so that it becomes fully mutually exclusive with the above and we can skip another conditional statement for the common case. This change is also needed by the next patch. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 11:02:34 +01:00
David S. Miller	d484dc2b21	Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== 1GbE Intel Wired LAN Driver Updates 2021-08-24 Vinicius Costa Gomes says: This adds support for PCIe PTM (Precision Time Measurement) to the igc driver. PCIe PTM allows the NIC and Host clocks to be compared more precisely, improving the clock synchronization accuracy. Patch 1/4 reverts a commit that made pci_enable_ptm() private to the PCI subsystem, reverting makes it possible for it to be called from the drivers. Patch 2/4 adds the pcie_ptm_enabled() helper. Patch 3/4 calls pci_enable_ptm() from the igc driver. Patch 4/4 implements the PCIe PTM support. Exposing it via the .getcrosststamp() API implies that the time measurements are made synchronously with the ioctl(). The hardware was implemented so the most convenient way to retrieve that information would be asynchronously. So, to follow the expectations of the ioctl() we have to use less convenient ways, triggering an PCIe PTM dialog every time a ioctl() is received. Some questions are raised (also pointed out in the commit message): 1. Using convert_art_ns_to_tsc() is too x86 specific, there should be a common way to create a 'system_counterval_t' from a timestamp. 2. convert_art_ns_to_tsc() says that it should only be used when X86_FEATURE_TSC_KNOWN_FREQ is true, but during tests it works even when it returns false. Should that check be done? ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 10:57:19 +01:00
David S. Miller	38cbd6e77f	Merge branch 'lan7800-improvements' John Efstathiades says: ==================== LAN7800 driver improvements This patch set introduces a number of improvements and fixes for problems found during testing of a modification to add a NAPI-style approach to packet handling to improve performance. NOTE: the NAPI changes are not part of this patch set and the issues fixed by this patch set are not coupled to the NAPI changes. Patch 1 fixes white space and style issues Patch 2 removes an unused timer Patch 3 introduces macros to set the internal packet FIFO flow control levels, which makes it easier to update the levels in future. Patch 4 removes an unused queue Patch 5 (updated for v2) introduces function return value checks and error propagation to various parts of the driver where a return code was captured but then ignored. This patch is completely different to patch 5 in version 1 of this patch set. The changes in the v1 patch 5 are being set aside for the time being. Patch 6 updates the LAN7800 MAC reset code to ensure there is no PHY register access in progress when the MAC is reset. This change prevents a kernel exception that can otherwise occur. Patch 7 fixes problems with system suspend and resume handling while the device is transmitting and receiving data. Patch 8 fixes problems with auto-suspend and resume handling and depends on changes introduced by patch 7. Patch 9 fixes problems with device disconnect handling that can result in kernel exceptions and/or hang. Patch 10 limits the rate at which driver warning messages are emitted. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 10:55:43 +01:00
John Efstathiades	df0d6f7a34	lan78xx: Limit number of driver warning messages Device removal can result in a large burst of driver warning messages (20 - 30) sent to the kernel log. Most of these are register read/write failures. This change limits the rate at which these messages are emitted. Signed-off-by: John Efstathiades <john.efstathiades@pebblebay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 10:55:43 +01:00
John Efstathiades	77dfff5bb7	lan78xx: Fix race condition in disconnect handling If there is a device disconnect at roughly the same time as a deferred PHY link reset there is a race condition that can result in a kernel lock up due to a null pointer dereference in the driver's deferred work handling routine lan78xx_delayedwork(). The following changes fix this problem. Add new status flag EVENT_DEV_DISCONNECT to indicate when the device has been removed and use it to prevent operations, such as register access, that will fail once the device is removed. Stop processing of deferred work items when the driver's USB disconnect handler is invoked. Disconnect the PHY only after the network device has been unregistered and all delayed work has been cancelled. Signed-off-by: John Efstathiades <john.efstathiades@pebblebay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 10:55:43 +01:00
John Efstathiades	5f4cc6e251	lan78xx: Fix race conditions in suspend/resume handling If the interface is given an IP address while the device is suspended (as a result of an auto-suspend event) there is a race between lan78xx_resume() and lan78xx_open() that can result in an exception or failure to handle incoming packets. The following changes fix this problem. Introduce a mutex to serialise operations in the network interface open and stop entry points with respect to the USB driver suspend and resume entry points. Move Tx and Rx data path start/stop to lan78xx_start() and lan78xx_stop() respectively and flush the packet FIFOs before starting the Tx and Rx data paths. This prevents the MAC and FIFOs getting out of step and delivery of malformed packets to the network stack. Stop processing of received packets before disconnecting the PHY from the MAC to prevent a kernel exception caused by handling packets after the PHY device has been removed. Refactor device auto-suspend code to make it consistent with the the system suspend code and make the suspend handler easier to read. Add new code to stop wake-on-lan packets or PHY events resuming the host or device from suspend if the device has not been opened (typically after an IP address is assigned). This patch is dependent on changes to lan78xx_suspend() and lan78xx_resume() introduced in the previous patch of this patch set. Signed-off-by: John Efstathiades <john.efstathiades@pebblebay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 10:55:43 +01:00
John Efstathiades	e1210fe63b	lan78xx: Fix partial packet errors on suspend/resume The MAC can get out of step with the internal packet FIFOs if the system goes to sleep when the link is active, especially at high data rates. This can result in partial frames in the packet FIFOs that in result in malformed frames being delivered to the host. This occurs because the driver does not enable/disable the internal packet FIFOs in step with the corresponding MAC data path. The following changes fix this problem. Update code that enables/disables the MAC receiver and transmitter to the more general Rx and Tx data path, where the data path in each direction consists of both the MAC function (Tx or Rx) and the corresponding packet FIFO. In the receive path the packet FIFO must be enabled before the MAC receiver but disabled after the MAC receiver. In the transmit path the opposite is true: the packet FIFO must be enabled after the MAC transmitter but disabled before the MAC transmitter. The packet FIFOs can be flushed safely once the corresponding data path is stopped. Signed-off-by: John Efstathiades <john.efstathiades@pebblebay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 10:55:43 +01:00
John Efstathiades	b1f6696daa	lan78xx: Fix exception on link speed change An exception is sometimes seen when the link speed is changed from auto-negotiation to a fixed speed, or vice versa. The exception occurs when the MAC is reset (due to the link speed change) at the same time as the PHY state machine is accessing a PHY register. The following changes fix this problem. Rework the MAC reset to ensure there is no outstanding MDIO register transaction before the reset and then wait until the reset is complete before allowing any further MAC register access. Signed-off-by: John Efstathiades <john.efstathiades@pebblebay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 10:55:43 +01:00
John Efstathiades	3415f6baad	lan78xx: Add missing return code checks There are many places in the driver where the return code from a function call is captured but without a subsequent test of the return code and appropriate action taken. This patch adds the missing return code tests and action. In most cases the action is an early exit from the calling function. The function lan78xx_set_suspend() was also updated to make it consistent with lan78xx_suspend(). Signed-off-by: John Efstathiades <john.efstathiades@pebblebay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 10:55:43 +01:00
John Efstathiades	40b8452fa8	lan78xx: Remove unused pause frame queue Remove the pause frame queue from the driver. It is initialised but not actually used. Signed-off-by: John Efstathiades <john.efstathiades@pebblebay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 10:55:43 +01:00
John Efstathiades	dc35f8548e	lan78xx: Set flow control threshold to prevent packet loss Set threshold at which flow control is triggered to 3/4 full of the internal Rx packet FIFO to prevent packet drops at high data rates. The new setting reduces the number of dropped UDP frames and TCP retransmit requests especially on less capable CPUs. Signed-off-by: John Efstathiades <john.efstathiades@pebblebay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 10:55:42 +01:00
John Efstathiades	3bef6b9e98	lan78xx: Remove unused timer Remove kernel timer that is not used by the driver. Signed-off-by: John Efstathiades <john.efstathiades@pebblebay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 10:55:42 +01:00
John Efstathiades	9ceec7d33a	lan78xx: Fix white space and style issues Fix white space and code style issues identified by checkpatch. Signed-off-by: John Efstathiades <john.efstathiades@pebblebay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 10:55:42 +01:00
David S. Miller	fbd029df29	Merge branch 'xen-harden-netfront' Juergen Gross says: ==================== xen: harden netfront against malicious backends Xen backends of para-virtualized devices can live in dom0 kernel, dom0 user land, or in a driver domain. This means that a backend might reside in a less trusted environment than the Xen core components, so a backend should not be able to do harm to a Xen guest (it can still mess up I/O data, but it shouldn't be able to e.g. crash a guest by other means or cause a privilege escalation in the guest). Unfortunately netfront in the Linux kernel is fully trusting its backend. This series is fixing netfront in this regard. It was discussed to handle this as a security problem, but the topic was discussed in public before, so it isn't a real secret. It should be mentioned that a similar series has been posted some years ago by Marek Marczykowski-Górecki, but this series has not been applied due to a Xen header not having been available in the Xen git repo at that time. Additionally my series is fixing some more DoS cases. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 10:43:22 +01:00
Juergen Gross	a884daa61a	xen/netfront: don't trust the backend response data blindly Today netfront will trust the backend to send only sane response data. In order to avoid privilege escalations or crashes in case of malicious backends verify the data to be within expected limits. Especially make sure that the response always references an outstanding request. Note that only the tx queue needs special id handling, as for the rx queue the id is equal to the index in the ring page. Introduce a new indicator for the device whether it is broken and let the device stop working when it is set. Set this indicator in case the backend sets any weird data. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 10:43:21 +01:00
Juergen Gross	21631d2d74	xen/netfront: disentangle tx_skb_freelist The tx_skb_freelist elements are in a single linked list with the request id used as link reference. The per element link field is in a union with the skb pointer of an in use request. Move the link reference out of the union in order to enable a later reuse of it for requests which need a populated skb pointer. Rename add_id_to_freelist() and get_id_from_freelist() to add_id_to_list() and get_id_from_list() in order to prepare using those for other lists as well. Define ~0 as value to indicate the end of a list and place that value into the link for a request not being on the list. When freeing a skb zero the skb pointer in the request. Use a NULL value of the skb pointer instead of skb_entry_is_link() for deciding whether a request has a skb linked to it. Remove skb_entry_set_link() and open code it instead as it is really trivial now. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 10:43:21 +01:00
Juergen Gross	162081ec33	xen/netfront: don't read data from request on the ring page In order to avoid a malicious backend being able to influence the local processing of a request build the request locally first and then copy it to the ring page. Any reading from the request influencing the processing in the frontend needs to be done on the local instance. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 10:43:21 +01:00
Juergen Gross	8446066bf8	xen/netfront: read response from backend only once In order to avoid problems in case the backend is modifying a response on the ring page while the frontend has already seen it, just read the response into a local buffer in one go and then operate on that buffer only. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 10:43:21 +01:00
Alok Prasad	755f905340	qed: Enable automatic recovery on error condition. This patch enables automatic recovery by default in case of various error condition like fw assert , hardware error etc. This also ensure driver can handle multiple iteration of assertion conditions. Signed-off-by: Ariel Elior <aelior@marvell.com> Signed-off-by: Shai Malin <smalin@marvell.com> Signed-off-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: Alok Prasad <palok@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 10:38:16 +01:00
Gilad Naaman	406f42fa0d	net-next: When a bond have a massive amount of VLANs with IPv6 addresses, performance of changing link state, attaching a VRF, changing an IPv6 address, etc. go down dramtically. The source of most of the slow down is the `dev_addr_lists.c` module, which mainatins a linked list of HW addresses. When using IPv6, this list grows for each IPv6 address added on a VLAN, since each IPv6 address has a multicast HW address associated with it. When performing any modification to the involved links, this list is traversed many times, often for nothing, all while holding the RTNL lock. Instead, this patch adds an auxilliary rbtree which cuts down traversal time significantly. Performance can be seen with the following script: #!/bin/bash ip netns del test \|\| true 2>/dev/null ip netns add test echo 1 \| ip netns exec test tee /proc/sys/net/ipv6/conf/all/keep_addr_on_down > /dev/null set -e ip -n test link add foo type veth peer name bar ip -n test link add b1 type bond ip -n test link add florp type vrf table 10 ip -n test link set bar master b1 ip -n test link set foo up ip -n test link set bar up ip -n test link set b1 up ip -n test link set florp up VLAN_COUNT=1500 BASE_DEV=b1 echo Creating vlans ip netns exec test time -p bash -c "for i in \$(seq 1 $VLAN_COUNT); do ip -n test link add link $BASE_DEV name foo.\$i type vlan id \$i; done" echo Bringing them up ip netns exec test time -p bash -c "for i in \$(seq 1 $VLAN_COUNT); do ip -n test link set foo.\$i up; done" echo Assiging IPv6 Addresses ip netns exec test time -p bash -c "for i in \$(seq 1 $VLAN_COUNT); do ip -n test address add dev foo.\$i 2000::\$i/64; done" echo Attaching to VRF ip netns exec test time -p bash -c "for i in \$(seq 1 $VLAN_COUNT); do ip -n test link set foo.\$i master florp; done" On an Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz machine, the performance before the patch is (truncated): Creating vlans real 108.35 Bringing them up real 4.96 Assiging IPv6 Addresses real 19.22 Attaching to VRF real 458.84 After the patch: Creating vlans real 5.59 Bringing them up real 5.07 Assiging IPv6 Addresses real 5.64 Attaching to VRF real 25.37 Cc: David S. Miller <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Lu Wei <luwei32@huawei.com> Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com> Cc: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Gilad Naaman <gnaaman@drivenets.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-25 10:29:07 +01:00
Kangmin Park	a37c5c2669	net: bridge: change return type of br_handle_ingress_vlan_tunnel br_handle_ingress_vlan_tunnel() is only referenced in br_handle_frame(). If br_handle_ingress_vlan_tunnel() is called and return non-zero value, goto drop in br_handle_frame(). But, br_handle_ingress_vlan_tunnel() always return 0. So, the routines that check the return value and goto drop has no meaning. Therefore, change return type of br_handle_ingress_vlan_tunnel() to void and remove if statement of br_handle_frame(). Signed-off-by: Kangmin Park <l4stpr0gr4m@gmail.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://lore.kernel.org/r/20210823102118.17966-1-l4stpr0gr4m@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-08-24 16:51:09 -07:00
Po-Hsu Lin	7844ec21a9	selftests/net: Use kselftest skip code for skipped tests There are several test cases in the net directory are still using exit 0 or exit 1 when they need to be skipped. Use kselftest framework skip code instead so it can help us to distinguish the return status. Criterion to filter out what should be fixed in net directory: grep -r "exit [01]" -B1 \| grep -i skip This change might cause some false-positives if people are running these test scripts directly and only checking their return codes, which will change from 0 to 4. However I think the impact should be small as most of our scripts here are already using this skip code. And there will be no such issue if running them with the kselftest framework. Signed-off-by: Po-Hsu Lin <po-hsu.lin@canonical.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/20210823085854.40216-1-po-hsu.lin@canonical.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-08-24 16:49:09 -07:00
Vinicius Costa Gomes	a90ec84837	igc: Add support for PTP getcrosststamp() i225 supports PCIe Precision Time Measurement (PTM), allowing us to support the PTP_SYS_OFFSET_PRECISE ioctl() in the driver via the getcrosststamp() function. The easiest way to expose the PTM registers would be to configure the PTM dialogs to run periodically, but the PTP_SYS_OFFSET_PRECISE ioctl() semantics are more aligned to using a kind of "one-shot" way of retrieving the PTM timestamps. But this causes a bit more code to be written: the trigger registers for the PTM dialogs are not cleared automatically. i225 can be configured to send "fake" packets with the PTM information, adding support for handling these types of packets is left for the future. PTM improves the accuracy of time synchronization, for example, using phc2sys, while a simple application is sending packets as fast as possible. First, without .getcrosststamp(): phc2sys[191.382]: enp4s0 sys offset -959 s2 freq -454 delay 4492 phc2sys[191.482]: enp4s0 sys offset 798 s2 freq +1015 delay 4069 phc2sys[191.583]: enp4s0 sys offset 962 s2 freq +1418 delay 3849 phc2sys[191.683]: enp4s0 sys offset 924 s2 freq +1669 delay 3753 phc2sys[191.783]: enp4s0 sys offset 664 s2 freq +1686 delay 3349 phc2sys[191.883]: enp4s0 sys offset 218 s2 freq +1439 delay 2585 phc2sys[191.983]: enp4s0 sys offset 761 s2 freq +2048 delay 3750 phc2sys[192.083]: enp4s0 sys offset 756 s2 freq +2271 delay 4061 phc2sys[192.183]: enp4s0 sys offset 809 s2 freq +2551 delay 4384 phc2sys[192.283]: enp4s0 sys offset -108 s2 freq +1877 delay 2480 phc2sys[192.383]: enp4s0 sys offset -1145 s2 freq +807 delay 4438 phc2sys[192.484]: enp4s0 sys offset 571 s2 freq +2180 delay 3849 phc2sys[192.584]: enp4s0 sys offset 241 s2 freq +2021 delay 3389 phc2sys[192.684]: enp4s0 sys offset 405 s2 freq +2257 delay 3829 phc2sys[192.784]: enp4s0 sys offset 17 s2 freq +1991 delay 3273 phc2sys[192.884]: enp4s0 sys offset 152 s2 freq +2131 delay 3948 phc2sys[192.984]: enp4s0 sys offset -187 s2 freq +1837 delay 3162 phc2sys[193.084]: enp4s0 sys offset -1595 s2 freq +373 delay 4557 phc2sys[193.184]: enp4s0 sys offset 107 s2 freq +1597 delay 3740 phc2sys[193.284]: enp4s0 sys offset 199 s2 freq +1721 delay 4010 phc2sys[193.385]: enp4s0 sys offset -169 s2 freq +1413 delay 3701 phc2sys[193.485]: enp4s0 sys offset -47 s2 freq +1484 delay 3581 phc2sys[193.585]: enp4s0 sys offset -65 s2 freq +1452 delay 3778 phc2sys[193.685]: enp4s0 sys offset 95 s2 freq +1592 delay 3888 phc2sys[193.785]: enp4s0 sys offset 206 s2 freq +1732 delay 4445 phc2sys[193.885]: enp4s0 sys offset -652 s2 freq +936 delay 2521 phc2sys[193.985]: enp4s0 sys offset -203 s2 freq +1189 delay 3391 phc2sys[194.085]: enp4s0 sys offset -376 s2 freq +955 delay 2951 phc2sys[194.185]: enp4s0 sys offset -134 s2 freq +1084 delay 3330 phc2sys[194.285]: enp4s0 sys offset -22 s2 freq +1156 delay 3479 phc2sys[194.386]: enp4s0 sys offset 32 s2 freq +1204 delay 3602 phc2sys[194.486]: enp4s0 sys offset 122 s2 freq +1303 delay 3731 Statistics for this run (total of 2179 lines), in nanoseconds: average: -1.12 stdev: 634.80 max: 1551 min: -2215 With .getcrosststamp() via PCIe PTM: phc2sys[367.859]: enp4s0 sys offset 6 s2 freq +1727 delay 0 phc2sys[367.959]: enp4s0 sys offset -2 s2 freq +1721 delay 0 phc2sys[368.059]: enp4s0 sys offset 5 s2 freq +1727 delay 0 phc2sys[368.160]: enp4s0 sys offset -1 s2 freq +1723 delay 0 phc2sys[368.260]: enp4s0 sys offset -4 s2 freq +1719 delay 0 phc2sys[368.360]: enp4s0 sys offset -5 s2 freq +1717 delay 0 phc2sys[368.460]: enp4s0 sys offset 1 s2 freq +1722 delay 0 phc2sys[368.560]: enp4s0 sys offset -3 s2 freq +1718 delay 0 phc2sys[368.660]: enp4s0 sys offset 5 s2 freq +1725 delay 0 phc2sys[368.760]: enp4s0 sys offset -1 s2 freq +1721 delay 0 phc2sys[368.860]: enp4s0 sys offset 0 s2 freq +1721 delay 0 phc2sys[368.960]: enp4s0 sys offset 0 s2 freq +1721 delay 0 phc2sys[369.061]: enp4s0 sys offset 4 s2 freq +1725 delay 0 phc2sys[369.161]: enp4s0 sys offset 1 s2 freq +1724 delay 0 phc2sys[369.261]: enp4s0 sys offset 4 s2 freq +1727 delay 0 phc2sys[369.361]: enp4s0 sys offset 8 s2 freq +1732 delay 0 phc2sys[369.461]: enp4s0 sys offset 7 s2 freq +1733 delay 0 phc2sys[369.561]: enp4s0 sys offset 4 s2 freq +1733 delay 0 phc2sys[369.661]: enp4s0 sys offset 1 s2 freq +1731 delay 0 phc2sys[369.761]: enp4s0 sys offset 1 s2 freq +1731 delay 0 phc2sys[369.861]: enp4s0 sys offset -5 s2 freq +1725 delay 0 phc2sys[369.961]: enp4s0 sys offset -4 s2 freq +1725 delay 0 phc2sys[370.062]: enp4s0 sys offset 2 s2 freq +1730 delay 0 phc2sys[370.162]: enp4s0 sys offset -7 s2 freq +1721 delay 0 phc2sys[370.262]: enp4s0 sys offset -3 s2 freq +1723 delay 0 phc2sys[370.362]: enp4s0 sys offset 1 s2 freq +1726 delay 0 phc2sys[370.462]: enp4s0 sys offset -3 s2 freq +1723 delay 0 phc2sys[370.562]: enp4s0 sys offset -1 s2 freq +1724 delay 0 phc2sys[370.662]: enp4s0 sys offset -4 s2 freq +1720 delay 0 phc2sys[370.762]: enp4s0 sys offset -7 s2 freq +1716 delay 0 phc2sys[370.862]: enp4s0 sys offset -2 s2 freq +1719 delay 0 Statistics for this run (total of 2179 lines), in nanoseconds: average: 0.14 stdev: 5.03 max: 48 min: -27 For reference, the statistics for runs without PCIe congestion show that the improvements from enabling PTM are less dramatic. For two runs of 16466 entries: without PTM: avg -0.04 stdev 10.57 max 39 min -42 with PTM: avg 0.01 stdev 4.20 max 19 min -16 One possible explanation is that when PTM is not enabled, and there's a lot of traffic in the PCIe fabric, some register reads will take more time than the others because of congestion on the PCIe fabric. When PTM is enabled, even if the PTM dialogs take more time to complete under heavy traffic, the time measurements do not depend on the time to read the registers. This was implemented following the i225 EAS version 0.993. Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2021-08-24 12:01:34 -07:00
Vinicius Costa Gomes	1b5d73fb86	igc: Enable PCIe PTM Enables PCIe PTM (Precision Time Measurement) support in the igc driver. Notifies the PCI devices that PCIe PTM should be enabled. PCIe PTM is similar protocol to PTP (Precision Time Protocol) running in the PCIe fabric, it allows devices to report time measurements from their internal clocks and the correlation with the PCIe root clock. The i225 NIC exposes some registers that expose those time measurements, those registers will be used, in later patches, to implement the PTP_SYS_OFFSET_PRECISE ioctl(). Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2021-08-24 11:49:21 -07:00
Vinicius Costa Gomes	014408cd62	PCI: Add pcie_ptm_enabled() Add a predicate that returns if PCIe PTM (Precision Time Measurement) is enabled. It will only return true if it's enabled in all the ports in the path from the device to the root. Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2021-08-24 11:36:13 -07:00
Vinicius Costa Gomes	1d71eb53e4	Revert "PCI: Make pci_enable_ptm() private" Make pci_enable_ptm() accessible from the drivers. Exposing this to the driver enables the driver to use the 'ptm_enabled' field of 'pci_dev' to check if PTM is enabled or not. This reverts commit `ac6c26da29` ("PCI: Make pci_enable_ptm() private"). Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2021-08-24 10:49:53 -07:00
Jakub Kicinski	3a62c33349	Merge branch 'ethtool-extend-coalesce-uapi' Yufeng Mo says: ==================== ethtool: extend coalesce uAPI In order to support some configuration in coalesce uAPI, this series extend coalesce uAPI and add support for CQE mode. Below is some test result with HNS3 driver: 1. old ethtool(ioctl) + new kernel: estuary:/$ ethtool -c eth0 Coalesce parameters for eth0: Adaptive RX: on TX: on stats-block-usecs: 0 sample-interval: 0 pkt-rate-low: 0 pkt-rate-high: 0 rx-usecs: 20 rx-frames: 0 rx-usecs-irq: 0 rx-frames-irq: 0 tx-usecs: 20 tx-frames: 0 tx-usecs-irq: 0 tx-frames-irq: 0 rx-usecs-low: 0 rx-frame-low: 0 tx-usecs-low: 0 tx-frame-low: 0 rx-usecs-high: 0 rx-frame-high: 0 tx-usecs-high: 0 tx-frame-high: 0 2. ethtool(netlink with cqe mode) + kernel without cqe mode: estuary:/$ ethtool -c eth0 Coalesce parameters for eth0: Adaptive RX: on TX: on stats-block-usecs: n/a sample-interval: n/a pkt-rate-low: n/a pkt-rate-high: n/a rx-usecs: 20 rx-frames: 0 rx-usecs-irq: n/a rx-frames-irq: n/a tx-usecs: 20 tx-frames: 0 tx-usecs-irq: n/a tx-frames-irq: n/a rx-usecs-low: n/a rx-frame-low: n/a tx-usecs-low: n/a tx-frame-low: n/a rx-usecs-high: 0 rx-frame-high: n/a tx-usecs-high: 0 tx-frame-high: n/a CQE mode RX: n/a TX: n/a 3. ethool(netlink with cqe mode) + kernel with cqe mode: estuary:/$ ethtool -c eth0 Coalesce parameters for eth0: Adaptive RX: on TX: on stats-block-usecs: n/a sample-interval: n/a pkt-rate-low: n/a pkt-rate-high: n/a rx-usecs: 20 rx-frames: 0 rx-usecs-irq: n/a rx-frames-irq: n/a tx-usecs: 20 tx-frames: 0 tx-usecs-irq: n/a tx-frames-irq: n/a rx-usecs-low: n/a rx-frame-low: n/a tx-usecs-low: n/a tx-frame-low: n/a rx-usecs-high: 0 rx-frame-high: n/a tx-usecs-high: 0 tx-frame-high: n/a CQE mode RX: off TX: off 4. ethool(netlink without cqe mode) + kernel with cqe mode: estuary:/$ ethtool -c eth0 Coalesce parameters for eth0: Adaptive RX: on TX: on stats-block-usecs: n/a sample-interval: n/a pkt-rate-low: n/a pkt-rate-high: n/a rx-usecs: 20 rx-frames: 0 rx-usecs-irq: n/a rx-frames-irq: n/a tx-usecs: 20 tx-frames: 0 tx-usecs-irq: n/a tx-frames-irq: n/a rx-usecs-low: n/a rx-frame-low: n/a tx-usecs-low: n/a tx-frame-low: n/a rx-usecs-high: 0 rx-frame-high: n/a tx-usecs-high: 0 tx-frame-high: n/a Change log: V2 -> V3: fix some warning on W=1 builds in #2 V1 -> V2: 1. fix compile error using allmodconfig in #2 2. move some property-related modifications from #2 to #1 for better review suggested by Jakub Kicinski. Change log from RFC: V3 -> V4: add document explaining the difference between CQE and EQE in #1 suggested by Jakub Kicinski. V2 -> V3: 1. split #1 into adding new parameter and adding new attributes. 2. use NLA_POLICY_MAX(NLA_U8, 1) instead of NLA_U8. 3. modify the description of CQE in Document. V1 -> V2: refactor #1&#2 in V1 suggestted by Jakub Kicinski. ==================== Link: https://lore.kernel.org/r/1629444920-25437-1-git-send-email-moyufeng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-08-24 07:38:30 -07:00
Yufeng Mo	cce1689eb5	net: hns3: add ethtool support for CQE/EQE mode configuration Add support in ethtool for switching EQE/CQE mode. Signed-off-by: Yufeng Mo <moyufeng@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-08-24 07:38:29 -07:00
Yufeng Mo	9f0c6f4b74	net: hns3: add support for EQE/CQE mode configuration For device whose version is above V3(include V3), the GL can select EQE or CQE mode, so adds support for it. In CQE mode, the coalesced timer will restart when the first new completion occurs, while in EQE mode, the timer will not restart. Signed-off-by: Yufeng Mo <moyufeng@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-08-24 07:38:29 -07:00
Yufeng Mo	f3ccfda193	ethtool: extend coalesce setting uAPI with CQE mode In order to support more coalesce parameters through netlink, add two new parameter kernel_coal and extack for .set_coalesce and .get_coalesce, then some extra info can return to user with the netlink API. Signed-off-by: Yufeng Mo <moyufeng@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-08-24 07:38:29 -07:00
Yufeng Mo	029ee6b143	ethtool: add two coalesce attributes for CQE mode Currently, there are many drivers who support CQE mode configuration, some configure it as a fixed when initialized, some provide an interface to change it by ethtool private flags. In order to make it more generic, add two new 'ETHTOOL_A_COALESCE_USE_CQE_TX' and 'ETHTOOL_A_COALESCE_USE_CQE_RX' coalesce attributes, then these parameters can be accessed by ethtool netlink coalesce uAPI. Also add an new structure kernel_ethtool_coalesce, then the new parameter can be added into this struct. Signed-off-by: Yufeng Mo <moyufeng@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-08-24 07:38:28 -07:00
Jakub Kicinski	95d1d2490c	netdevice: move xdp_rxq within netdev_rx_queue Both struct netdev_rx_queue and struct xdp_rxq_info are cacheline aligned. This causes extra padding before and after the xdp_rxq member. Move the member upfront, so that it's naturally aligned. Before: /* size: 256, cachelines: 4, members: 6 / / sum members: 160, holes: 1, sum holes: 40 / / padding: 56 / / paddings: 1, sum paddings: 36 / / forced alignments: 1, forced holes: 1, sum forced holes: 40 / After: / size: 192, cachelines: 3, members: 6 / / padding: 32 / / paddings: 1, sum paddings: 36 / / forced alignments: 1 */ Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/r/20210823180135.1153608-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-08-24 07:38:21 -07:00
Heiner Kallweit	18a9eae240	r8169: enable ASPM L0s state ASPM is disabled completely because we've seen different types of problems in the past. However it seems these problems occurred with L1 or L1 sub-states only. On all the chip versions I've seen the acceptable L0s exit latency is 512ns. This should be short enough not to cause problems. If the actual L0s exit latency of the PCIe link is bigger than 512ns then the PCI core will disable L0s anyway. So let's give it a try and disable L1 and L1 sub-states only. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-24 10:50:26 +01:00
Yunsheng Lin	7fb9b66dc9	page_pool: use relaxed atomic for release side accounting There is no need to synchronize the account updating, so use the relaxed atomic to avoid some memory barrier in the data path. Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-24 10:46:31 +01:00
David S. Miller	669f047ec1	Merge branch 'dsa-sw-bridging' Vladimir Oltean says: ==================== Plug holes in DSA's software bridging support Changes in v2: - Make sure that leaving an unoffloaded bridge works well too - Remove a set but unused variable - Tweak a commit message This series addresses some oddities reported by Alvin while he was working on the new rtl8365mb driver (a driver which does not implement bridge offloading for now, and relies on software bridging). First is that DSA behaves, in the lack of a .port_bridge_join method, as if the operation succeeds, and does not kick off its internal procedures for software bridging (the same procedures that were written for indirect software bridging, meaning bridging with an unoffloaded software LAG). Second is that even after being patched to treat ports with software bridging as standalone, we still don't get rid of bridge VLANs, even though we have code to ignore them, that code manages to get bypassed. This is in fact a recurring issue which was brought up by Tobias Waldekranz a while ago, but the solution never made it to the git tree. After debugging with Florian the last time: https://patchwork.kernel.org/project/netdevbpf/patch/20210320225928.2481575-3-olteanv@gmail.com/ I became very concerned about sending these patches to stable kernels. They are relatively large reworks, and they are only tested properly on net-next. A few commands on my test vehicle which has ds->vlan_filtering_is_global set to true: \| Nothing is committed to hardware when we add VLAN 100 on a standalone \| port $ ip link add link sw0p2 name sw0p2.100 type vlan id 100 \| When a neighbor port joins a VLAN-aware bridge, VLAN filtering gets \| enabled globally on the switch. This replays the VLAN 100 from \| sw0p2.100 and also installs VLAN 1 from the bridge on sw0p0. $ ip link add br0 type bridge vlan_filtering 1 && ip link set sw0p0 master br0 [ 97.948087] sja1105 spi2.0: Reset switch and programmed static config. Reason: VLAN filtering [ 97.957989] sja1105 spi2.0: sja1105_bridge_vlan_add: port 2 vlan 100 [ 97.964442] sja1105 spi2.0: sja1105_bridge_vlan_add: port 4 vlan 100 [ 97.971202] device sw0p0 entered promiscuous mode [ 97.976129] sja1105 spi2.0: sja1105_bridge_vlan_add: port 0 vlan 1 [ 97.982640] sja1105 spi2.0: sja1105_bridge_vlan_add: port 4 vlan 1 \| We can see that sw0p2, the standalone port, is now filtering because \| of the bridge $ ethtool -k sw0p2 \| grep vlan rx-vlan-filter: on [fixed] \| When we make the bridge VLAN-unaware, the 8021q upper sw0p2.100 is \| uncomitted from hardware. The VLANs managed by the bridge still remain \| committed to hardware, because they are managed by the bridge. $ ip link set br0 type bridge vlan_filtering 0 [ 134.218869] sja1105 spi2.0: Reset switch and programmed static config. Reason: VLAN filtering [ 134.228913] sja1105 spi2.0: sja1105_bridge_vlan_del: port 2 vlan 100 \| And now the standalone port is not filtering anymore. ethtool -k sw0p2 \| grep vlan rx-vlan-filter: off [fixed] The same test with .port_bridge_join and .port_bridge_leave commented out from this driver: \| Not a flinch $ ip link add link sw0p2 name sw0p2.100 type vlan id 100 $ ip link add br0 type bridge vlan_filtering 1 && ip link set sw0p0 master br0 Warning: dsa_core: Offloading not supported. $ ethtool -k sw0p2 \| grep vlan rx-vlan-filter: off [fixed] $ ip link set br0 type bridge vlan_filtering 0 $ ethtool -k sw0p2 \| grep vlan rx-vlan-filter: off [fixed] ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-24 09:30:58 +01:00
Vladimir Oltean	58adf9dcb1	net: dsa: let drivers state that they need VLAN filtering while standalone As explained in commit `e358bef7c3` ("net: dsa: Give drivers the chance to veto certain upper devices"), the hellcreek driver uses some tricks to comply with the network stack expectations: it enforces port separation in standalone mode using VLANs. For untagged traffic, bridging between ports is prevented by using different PVIDs, and for VLAN-tagged traffic, it never accepts 8021q uppers with the same VID on two ports, so packets with one VLAN cannot leak from one port to another. That is almost fine, and has worked because hellcreek relied on an implicit behavior of the DSA core that was changed by the previous patch: the standalone ports declare the 'rx-vlan-filter' feature as 'on [fixed]'. Since most of the DSA drivers are actually VLAN-unaware in standalone mode, that feature was actually incorrectly reflecting the hardware/driver state, so there was a desire to fix it. This leaves the hellcreek driver in a situation where it has to explicitly request this behavior from the DSA framework. We configure the ports as follows: - Standalone: 'rx-vlan-filter' is on. An 8021q upper on top of a standalone hellcreek port will go through dsa_slave_vlan_rx_add_vid and will add a VLAN to the hardware tables, giving the driver the opportunity to refuse it through .port_prechangeupper. - Bridged with vlan_filtering=0: 'rx-vlan-filter' is off. An 8021q upper on top of a bridged hellcreek port will not go through dsa_slave_vlan_rx_add_vid, because there will not be any attempt to offload this VLAN. The driver already disables VLAN awareness, so that upper should receive the traffic it needs. - Bridged with vlan_filtering=1: 'rx-vlan-filter' is on. An 8021q upper on top of a bridged hellcreek port will call dsa_slave_vlan_rx_add_vid, and can again be vetoed through .port_prechangeupper. It is not actually completely fine, because if I follow through correctly, we can have the following situation: ip link add br0 type bridge vlan_filtering 0 ip link set lan0 master br0 # lan0 now becomes VLAN-unaware ip link set lan0 nomaster # lan0 fails to become VLAN-aware again, therefore breaking isolation This patch fixes that corner case by extending the DSA core logic, based on this requested attribute, to change the VLAN awareness state of the switch (port) when it leaves the bridge. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-24 09:30:58 +01:00

1 2 3 4 5 ...

1032621 Commits