With modern NIC drivers shifting to full page allocations per
received frame, we face the following issue:
TCP has one per-netns sysctl used to tweak how to translate
a memory use into an expected payload (RWIN), in RX path.
tcp_win_from_space() implementation is limited to few cases.
For hosts dealing with various MSS, we either under estimate
or over estimate the RWIN we send to the remote peers.
For instance with the default sysctl_tcp_adv_win_scale value,
we expect to store 50% of payload per allocated chunk of memory.
For the typical use of MTU=1500 traffic, and order-0 pages allocations
by NIC drivers, we are sending too big RWIN, leading to potential
tcp collapse operations, which are extremely expensive and source
of latency spikes.
This patch makes sysctl_tcp_adv_win_scale obsolete, and instead
uses a per socket scaling factor, so that we can precisely
adjust the RWIN based on effective skb->len/skb->truesize ratio.
This patch alone can double TCP receive performance when receivers
are too slow to drain their receive queue, or by allowing
a bigger RWIN when MSS is close to PAGE_SIZE.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Link: https://lore.kernel.org/r/20230717152917.751987-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
It's inefficient to ring the doorbell page every time a WQE is posted to
the received queue. Excessive MMIO writes result in CPU spending more
time waiting on LOCK instructions (atomic operations), resulting in
poor scaling performance.
Move the code for ringing doorbell page to where after we have posted all
WQEs to the receive queue during a callback from napi_poll().
With this change, tests showed an improvement from 120G/s to 160G/s on a
200G physical link, with 16 or 32 hardware queues.
Tests showed no regression in network latency benchmarks on single
connection.
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1689622539-5334-2-git-send-email-longli@linuxonhyperv.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As per the comment above debugfs_create_dir(), it is not expected to
return an error, so an extra error check is not needed.
Drop the return check of debugfs_create_dir() in
mvpp2_dbgfs_c2_entry_init(), mvpp2_dbgfs_flow_tbl_entry_init()
and mvpp2_dbgfs_cls_init().
Signed-off-by: Minjie Du <duminjie@vivo.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/20230717025538.2848-1-duminjie@vivo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The old way to do LAN reset is sending reset command to firmware. Once
firmware performs reset, it reconfigures what it needs.
In the new firmware versions, veto bit is introduced for NCSI/LLDP to
block PHY domain in LAN reset. At this point, writing register of LAN
reset directly makes the same effect as the old way. And it does not
reset MNG domain, so that veto bit does not change.
Since veto bit was never used, the old firmware is compatible with the
driver before and after this change. The new firmware needs to use with
the driver after this change if it wants to implement the new feature,
otherwise it is the same as the old firmware.
Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230717021333.94181-1-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add TransmissionOverrun as per defined by IEEE 802.1Q Bridges.
TransmissionOverrun counter shall be incremented if the implementation
detects that a frame from a given queue is still being transmitted by
the MAC when that gate-close event for that queue occurs.
This counter is utilised by the Certification conformance test to
inform the user application whether any packets are currently being
transmitted on a particular queue during a gate-close event.
Intel Discrete I225/I226 have a mechanism to not transmit a packets if
the gate open time is insufficient for the packet transmission by setting
the Strict_End bit. Thus, it is expected for this counter to be always
zero at this moment.
Inspired from enetc_taprio_stats() and enetc_taprio_queue_stats(), now
driver also report the tx_overruns counter per traffic class.
User can get this counter by using below command:
1) tc -s qdisc show dev <interface> root
2) tc -s class show dev <interface>
Test Result (Before):
class mq :1 root
Sent 1289 bytes 20 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
class mq :2 root
Sent 124 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
class mq :3 root
Sent 46028 bytes 86 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
class mq :4 root
Sent 2596 bytes 14 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
Test Result (After):
class taprio 100:1 root
Sent 8491 bytes 38 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
Transmit overruns: 0
class taprio 100:2 root
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
Transmit overruns: 0
class taprio 100:3 root
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
Transmit overruns: 0
class taprio 100:4 root
Sent 994 bytes 11 pkt (dropped 0, overlimits 0 requeues 1)
backlog 0b 0p requeues 1
Transmit overruns: 0
Signed-off-by: Muhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Naama Meir <naamax.meir@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://lore.kernel.org/r/20230714201428.1718097-1-anthony.l.nguyen@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The DT of_device.h and of_platform.h date back to the separate
of_platform_bus_type before it as merged into the regular platform bus.
As part of that merge prepping Arm DT support 13 years ago, they
"temporarily" include each other. They also include platform_device.h
and of.h. As a result, there's a pretty much random mix of those include
files used throughout the tree. In order to detangle these headers and
replace the implicit includes with struct declarations, users need to
explicitly include the correct includes.
Signed-off-by: Rob Herring <robh@kernel.org>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Link: https://lore.kernel.org/r/20230714174922.4063153-1-robh@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Create a new netconsole runtime option that prepends the kernel version in
the netconsole message. This is useful to map kernel messages to kernel
version in a simple way, i.e., without checking somewhere which kernel
version the host that sent the message is using.
If this option is selected, then the "<release>," is prepended before the
netconsole message. This is an example of a netconsole output, with
release feature enabled:
6.4.0-01762-ga1ba2ffe946e;12,426,112883998,-;this is a test
Cc: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20230714111330.3069605-1-leitao@debian.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Russell King says:
====================
Remove some unused phylink legacy
I believe we are now in a position where some of the legacy phylink code
can be removed!
I believe that all DSA drivers do not make use of any pre-March 2020
phylink behaviour - all drivers now seem to set legacy_pre_march2020 to
false, and the conditions that DSA sets it to true are no longer
satisifed by any driver.
Moreover, no one uses the .mac_an_restart() method, so this can also be
removed.
====================
Link: https://lore.kernel.org/r/ZLERQ2OBrv44Ppyc@shell.armlinux.org.uk
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The mac_an_restart() method is now completely unused, and has been
superseded by phylink_pcs support. Remove this method.
Since phylink_pcs_mac_an_restart() now only deals with the PCS, rename
the function to remove the _mac infix.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Since DSA no longer marks anything as phylink-legacy, there is now no
need for DSA drivers to set this member to false. Remove all instances
of this.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
All drivers are now updated for the March 2020 changes, and no longer
make use of the mac_pcs_get_state() or mac_an_restart() operations,
which are now NULL across all DSA drivers. All DSA drivers don't look
at speed, duplex, pause or advertisement in their phylink_mac_config()
method either.
Remove support for these operations from DSA, and stop marking DSA as
a legacy driver by default.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Luo Jie says:
====================
net: phy: at803x: support qca8081 1G version chip
This patch series add supporting qca8081 1G version chip, the 1G version
chip can be identified by the register mmd7.0x901d bit0.
In addition, qca8081 does not support 1000BaseX mode and the sgmii fifo
reset is added on the link changed, which assert the fifo on the link
down, deassert the fifo on the link up.
Changes in v1:
* switch to use genphy_c45_pma_read_abilities.
* remove the patch [remove 1000BaseX mode of qca8081].
* move the sgmii fifo reset to link_change_notify.
Changes in v2:
* split the qca8081 1G chip support patch.
* improve the slave seed config, disable it if master preferred.
Changes in v3:
* fix the comments.
* add the help function qca808x_has_fast_retrain_or_slave_seed.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The qca8081 sgmii fifo needs to be reset on link down and
released on the link up in case of any abnormal issue
such as the packet blocked on the PHY.
Signed-off-by: Luo Jie <quic_luoj@quicinc.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
The fast retrain and slave seed configs are only applicable when the 2.5G
ability is supported.
Signed-off-by: Luo Jie <quic_luoj@quicinc.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The qca8081 1G chip version does not support 2.5 capability, which
is distinguished from qca8081 2.5G chip according to the bit0 of
register mmd7.0x901d, the 1G version chip also has the same PHY ID
as the normal qca8081 2.5G chip.
Signed-off-by: Luo Jie <quic_luoj@quicinc.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
qca8081 is the single port PHY, the slave prefer mode is used
by default.
if the phy master perfer mode is configured, the slave seed
configuration should not be enabled, since the slave seed
enablement is for making PHY linked as slave mode easily.
disable slave seed if the master mode is preferred.
Signed-off-by: Luo Jie <quic_luoj@quicinc.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
merge the seed enablement and seed value configuration into
one function, since the random seed value is needed to be
configured when the seed is enabled.
Signed-off-by: Luo Jie <quic_luoj@quicinc.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
qca8081 PHY supports to use genphy_c45_pma_read_abilities for
getting the PHY features supported except for the autoneg ability
but autoneg ability exists in MDIO_STAT1 instead of MMD7.1, add it
manually after calling genphy_c45_pma_read_abilities.
Signed-off-by: Luo Jie <quic_luoj@quicinc.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vignesh Viswanathan says:
====================
net: qrtr: Few fixes in QRTR
Add fixes in QRTR ns to change server and nodes radix tree to xarray to
avoid a use-after-free while iterating through the server or nodes
radix tree.
Also fix the destination port value for IPCR control buffer on older
targets.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The destination port value in the IPCR control buffer on older
targets is 0xFFFF. Handle the same by updating the dst_port to
QRTR_PORT_CTRL.
Signed-off-by: Vignesh Viswanathan <quic_viswanat@quicinc.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a use after free scenario while iterating through the nodes
radix tree despite the ns being a single threaded process. This can
happen when the radix tree APIs are not synchronized with the
rcu_read_lock() APIs.
Convert the radix tree for nodes to xarray to take advantage of the
built in rcu lock usage provided by xarray.
Signed-off-by: Chris Lew <quic_clew@quicinc.com>
Signed-off-by: Vignesh Viswanathan <quic_viswanat@quicinc.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a use after free scenario while iterating through the servers
radix tree despite the ns being a single threaded process. This can
happen when the radix tree APIs are not synchronized with the
rcu_read_lock() APIs.
Convert the radix tree for servers to xarray to take advantage of the
built in rcu lock usage provided by xarray.
Signed-off-by: Chris Lew <quic_clew@quicinc.com>
Signed-off-by: Vignesh Viswanathan <quic_viswanat@quicinc.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Justin Chen says:
====================
Brcm ASP 2.0 Ethernet Controller
Add support for the Broadcom ASP 2.0 Ethernet controller which is first
introduced with 72165.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for wake on network filters. The max match is 256 bytes.
Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for Wake-On-Lan magic packet and magic packet with password.
Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for the Broadcom ASP 2.0 Ethernet controller which is first
introduced with 72165. This controller features two distinct Ethernet
ports that can be independently operated.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename local `struct fec_enet_private *adapter` to `fep` in `fec_ptp_gettime()` to match the rest of the driver
Signed-off-by: Csókás Bence <csokas.bence@prolan.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Spotted this trivial spell mistake while casually reading
the google GVE driver code.
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata says:
====================
mlxsw: Manage RIF across PVID changes
The mlxsw driver currently makes the assumption that the user applies
configuration in a bottom-up manner. Thus netdevices need to be added to
the bridge before IP addresses are configured on that bridge or SVI added
on top of it. Enslaving a netdevice to another netdevice that already has
uppers is in fact forbidden by mlxsw for this reason. Despite this safety,
it is rather easy to get into situations where the offloaded configuration
is just plain wrong.
As an example, take a front panel port, configure an IP address: it gets a
RIF. Now enslave the port to the bridge, and the RIF is gone. Remove the
port from the bridge again, but the RIF never comes back. There is a number
of similar situations, where changing the configuration there and back
utterly breaks the offload.
The situation is going to be made better by implementing a range of replays
and post-hoc offloads.
In this patch set, address the ordering issues related to creation of
bridge RIFs. Currently, mlxsw has several shortcomings with regards to RIF
handling due to PVID changes:
- In order to cause RIF for a bridge device to be created, the user is
expected first to set PVID, then to add an IP address. The reverse
ordering is disallowed, which is not very user-friendly.
- When such bridge gets a VLAN upper whose VID was the same as the existing
PVID, and this VLAN netdevice gets an IP address, a RIF is created for
this netdevice. The new RIF is then assigned to the 802.1Q FID for the
given VID. This results in a working configuration. However, then, when
the VLAN netdevice is removed again, the RIF for the bridge itself is
never reassociated to the PVID.
- PVID cannot be changed once the bridge has uppers. Presumably this is
because the driver does not manage RIFs properly in face of PVID changes.
However, as the previous point shows, it is still possible to get into
invalid configurations.
This patch set addresses these issues and relaxes some of the ordering
requirements that mlxsw had. The patch set proceeds as follows:
- In patch #1, pass extack to mlxsw_sp_br_ban_rif_pvid_change()
- To relax ordering between setting PVID and adding an IP address to a
bridge, mlxsw must be able to request that a RIF is created with a given
VLAN ID, instead of trying to deduce it from the current netdevice
settings, which do not reflect the user-requested values yet. This is
done in patches #2 and #3.
- Similarly, mlxsw_sp_inetaddr_bridge_event() will need to make decisions
based on the user-requested value of PVID, not the current value. Thus in
patches #4 and #5, add a new argument which carries the requested PVID
value.
- Finally in patch #6 relax the ban on PVID changes when a bridge has
uppers. Instead, add the logic necessary for creation of a RIF as a
result of PVID change.
- Relevant selftests are presented afterwards. In patch #7 a preparatory
helper is added to lib.sh. Patches #8, #9, #10 and #11 include selftests
themselves.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This tests whether addition and deletion of a VLAN upper that coincides
with the current PVID setting throws off forwarding.
This selftests is specifically geared towards offloading drivers. In
particular, mlxsw used to fail this selftest, and an earlier patch in this
patchset fixes the issue. However, there's nothing HW-specific in the test
itself (it absolutely is supposed to pass on SW datapath), and therefore it
is put into the generic forwarding directory.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>