Commit Graph

1156170 Commits

Author SHA1 Message Date
Maxim Mikityanskiy
79efecb41f net/mlx5e: Trigger NAPI after activating an SQ
If an SQ is deactivated and reactivated again, some packets could be
sent after MLX5E_SQ_STATE_ENABLED is cleared, but before
netif_tx_stop_queue, meaning that NAPI might miss some completions. In
order to handle them, make sure to trigger NAPI after SQ activation in
all cases where it can be relevant. Regular SQs, XDP SQs and XSK SQs are
good. Missing cases added: after recovery, after activating HTB SQs and
after activating PTP SQs.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-04 02:07:04 -08:00
Raed Salem
a7385187a3 net/mlx5e: IPsec, support upper protocol selector field offload
Add support to policy/state upper protocol selector field offload,
this will enable to select traffic for IPsec operation based on l4
protocol (TCP/UDP) with specific source/destination port.

Signed-off-by: Raed Salem <raeds@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-04 02:07:04 -08:00
Dragos Tatulea
ce231772da net/mlx5e: IPoIB, Add support for XDR speed
Add XDR IB PTYS coding and XDR speed 200Gbps.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-04 02:07:04 -08:00
Jack Morgenstein
7eef93003e net/mlx5: Enhance debug print in page allocation failure
Provide more details to aid debugging.

Fixes: bf0bf77f65 ("mlx5: Support communicating arbitrary host page size to firmware")
Signed-off-by: Eran Ben Elisha <eranbe@nvidia.com>
Signed-off-by: Majd Dibbiny <majd@nvidia.com>
Signed-off-by: Jack Morgenstein <jackm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-04 02:07:03 -08:00
Rahul Rameshbabu
b63636b6c1 net/mlx5: Add firmware support for MTUTC scaled_ppm frequency adjustments
When device is capable of handling scaled ppm values for adjusting
frequency, conversion to ppb will not be done by the driver. Instead, the
scaled ppm value will be passed directly to the device for the frequency
adjustment operation.

Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-04 02:07:03 -08:00
Rahul Rameshbabu
04937a0f68 net/mlx5: Document support for RoCE HCA disablement capability
Some mlx5 devices are capable of disabling RoCE. In this situation,
disablement does not need to be handled at the driver level.

Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-04 02:07:03 -08:00
Rahul Rameshbabu
8ce3b586fa net/mlx5: Add counter information to mlx5 driver documentation
Update rst file to contain general information about statistics counters
for the mlx5 driver. Add specifics about individual counters in list
tables.

Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-04 02:07:03 -08:00
Rahul Rameshbabu
e12ebbf0cc net/mlx5: Document previously implemented mlx5 tracepoints
Tracepoints were previously implemented but not documented till this patch
series.

Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-04 02:07:03 -08:00
Rahul Rameshbabu
a12ba19269 net/mlx5: Update Kconfig parameter documentation
Provide information for Kconfig flags defined but not documented till this
patch series for the mlx5 driver.

Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-04 02:07:03 -08:00
Rahul Rameshbabu
f2d51e5793 net/mlx5: Separate mlx5 driver documentation into multiple pages
The mlx5 device driver documentation page has grown in size and should be
split into multiple subpages. This change also contains a table of contents
for these new subpages.

Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-04 02:07:02 -08:00
Roi Dayan
199abf33f4 net/mlx5: Lag, Move mpesw related definitions to mpesw.h
mpesw definitions should be in mpesw.h and not lag.h.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-04 02:07:02 -08:00
Mark Bloch
6a80313d24 net/mlx5: Lag, Use flag to check for shared FDB mode
It's redundant and incorrect to check lag is also sriov mode.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-04 02:07:02 -08:00
Roi Dayan
b399b066e2 net/mlx5: Lag, Remove redundant bool allocation on the stack
There is no need to allocate the bool variable and can just return the value.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-04 02:07:02 -08:00
Roi Dayan
9a49a64ea7 net/mlx5: Lag, Use mlx5_lag_dev() instead of derefering pointers
Use the existing wrapper mlx5_lag_dev() to access the lag object from
dev for better maintainability and consistent code.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-04 02:07:02 -08:00
Roi Dayan
2afcfae77a net/mlx5: Lag, Update multiport eswitch check to log an error
Update the function to log an error to the user if failing to offload
the rule and while there add correct prefix for the function name.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-04 02:07:02 -08:00
David S. Miller
042b7858d5 Merge branch 'net-smc-parallelism'
D. Wythe says:

====================
net/smc: optimize the parallelism of SMC-R connections

This patch set attempts to optimize the parallelism of SMC-R connections,
mainly to reduce unnecessary blocking on locks, and to fix exceptions that
occur after thoses optimization.

According to Off-CPU graph, SMC worker's off-CPU as that:

smc_close_passive_work                  (1.09%)
        smcr_buf_unuse                  (1.08%)
                smc_llc_flow_initiate   (1.02%)

smc_listen_work                         (48.17%)
        __mutex_lock.isra.11            (47.96%)

An ideal SMC-R connection process should only block on the IO events
of the network, but it's quite clear that the SMC-R connection now is
queued on the lock most of the time.

The goal of this patchset is to achieve our ideal situation where
network IO events are blocked for the majority of the connection lifetime.

There are three big locks here:

1. smc_client_lgr_pending & smc_server_lgr_pending

2. llc_conf_mutex

3. rmbs_lock & sndbufs_lock

And an implementation issue:

1. confirm/delete rkey msg can't be sent concurrently while
protocol allows indeed.

Unfortunately,The above problems together affect the parallelism of
SMC-R connection. If any of them are not solved. our goal cannot
be achieved.

After this patch set, we can get a quite ideal off-CPU graph as
following:

smc_close_passive_work                                  (41.58%)
        smcr_buf_unuse                                  (41.57%)
                smc_llc_do_delete_rkey                  (41.57%)

smc_listen_work                                         (39.10%)
        smc_clc_wait_msg                                (13.18%)
                tcp_recvmsg_locked                      (13.18)
        smc_listen_find_device                          (25.87%)
                smcr_lgr_reg_rmbs                       (25.87%)
                        smc_llc_do_confirm_rkey         (25.87%)

We can see that most of the waiting times are waiting for network IO
events. This also has a certain performance improvement on our
short-lived conenction wrk/nginx benchmark test:

+--------------+------+------+-------+--------+------+--------+
|conns/qps     |c4    | c8   |  c16  |  c32   | c64  |  c200  |
+--------------+------+------+-------+--------+------+--------+
|SMC-R before  |9.7k  | 10k  |  10k  |  9.9k  | 9.1k |  8.9k  |
+--------------+------+------+-------+--------+------+--------+
|SMC-R now     |13k   | 19k  |  18k  |  16k   | 15k  |  12k   |
+--------------+------+------+-------+--------+------+--------+
|TCP           |15k   | 35k  |  51k  |  80k   | 100k |  162k  |
+--------------+------+------+-------+--------+------+--------+

The reason why the benefit is not obvious after the number of connections
has increased dues to workqueue. If we try to change workqueue to UNBOUND,
we can obtain at least 4-5 times performance improvement, reach up to half
of TCP. However, this is not an elegant solution, the optimization of it
will be much more complicated. But in any case, we will submit relevant
optimization patches as soon as possible.

Please note that the premise here is that the lock related problem
must be solved first, otherwise, no matter how we optimize the workqueue,
there won't be much improvement.

Because there are a lot of related changes to the code, if you have
any questions or suggestions, please let me know.

Thanks
D. Wythe

v1 -> v2:

1. Fix panic in SMC-D scenario
2. Fix lnkc related hashfn calculation exception, caused by operator
priority
3. Only wake up one connection if the lnk is not active
4. Delete obsolete unlock logic in smc_listen_work()
5. PATCH format, do Reverse Christmas tree
6. PATCH format, change all xxx_lnk_xxx function to xxx_link_xxx
7. PATCH format, add correct fix tag for the patches for fixes.
8. PATCH format, fix some spelling error
9. PATCH format, rename slow to do_slow

v2 -> v3:

1. add SMC-D support, remove the concept of link cluster since SMC-D has
no link at all. Replace it by lgr decision maker, who provides suggestions
to SMC-D and SMC-R on whether to create new link group.

2. Fix the corruption problem described by PATCH 'fix application
data exception' on SMC-D.

v3 -> v4:

1. Fix panic caused by uninitialization map.

v4 -> v5:

1. Make SMC-D buf creation be serial to avoid Potential error
2. Add a flag to synchronize the success of the first contact
with the ready of the link group, including SMC-D and SMC-R.
3. Fixed possible reference count leak in smc_llc_flow_start().
4. reorder the patch, make bugfix PATCH be ahead.

v5 -> v6:

1. Separate the bugfix patches to make it independent.
2. Merge patch 'fix SMC_CLC_DECL_ERR_REGRMB without smc_server_lgr_pending'
with patch 'remove locks smc_client_lgr_pending and smc_server_lgr_pending'
3. Format code styles, including alignment and reverse christmas tree
style.
4. Fix a possible memory leak in smc_llc_rmt_delete_rkey()
and smc_llc_rmt_conf_rkey().

v6 -> v7:

1. Discard patch attempting to remove global locks
2. Discard patch attempting make confirm/delete rkey process concurrently
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-04 09:48:19 +00:00
D. Wythe
aff7bfed90 net/smc: replace mutex rmbs_lock and sndbufs_lock with rw_semaphore
It's clear that rmbs_lock and sndbufs_lock are aims to protect the
rmbs list or the sndbufs list.

During connection establieshment, smc_buf_get_slot() will always
be invoked, and it only performs read semantics in rmbs list and
sndbufs list.

Based on the above considerations, we replace mutex with rw_semaphore.
Only smc_buf_get_slot() use down_read() to allow smc_buf_get_slot()
run concurrently, other part use down_write() to keep exclusive
semantics.

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-04 09:48:19 +00:00
D. Wythe
4da687448d net/smc: reduce unnecessary blocking in smcr_lgr_reg_rmbs()
Unlike smc_buf_create() and smcr_buf_unuse(), smcr_lgr_reg_rmbs() is
exclusive when assigned rmb_desc was not registered, although it can be
executed in parallel when assigned rmb_desc was registered already
and only performs read semtamics on it. Hence, we can not simply replace
it with read semaphore.

The idea here is that if the assigned rmb_desc was registered already,
use read semaphore to protect the critical section, once the assigned
rmb_desc was not registered, keep using keep write semaphore still
to keep its exclusivity.

Thanks to the reusable features of rmb_desc, which allows us to execute
in parallel in most cases.

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-04 09:48:19 +00:00
D. Wythe
f6421014e8 net/smc: use read semaphores to reduce unnecessary blocking in smc_buf_create() & smcr_buf_unuse()
Following is part of Off-CPU graph during frequent SMC-R short-lived
processing:

process_one_work				(51.19%)
smc_close_passive_work			(28.36%)
	smcr_buf_unuse				(28.34%)
	rwsem_down_write_slowpath		(28.22%)

smc_listen_work				(22.83%)
	smc_clc_wait_msg			(1.84%)
	smc_buf_create				(20.45%)
		smcr_buf_map_usable_links
		rwsem_down_write_slowpath	(20.43%)
	smcr_lgr_reg_rmbs			(0.53%)
		rwsem_down_write_slowpath	(0.43%)
		smc_llc_do_confirm_rkey		(0.08%)

We can clearly see that during the connection establishment time,
waiting time of connections is not on IO, but on llc_conf_mutex.

What is more important, the core critical area (smcr_buf_unuse() &
smc_buf_create()) only perfroms read semantics on links, we can
easily replace it with read semaphore.

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-04 09:48:19 +00:00
D. Wythe
b5dd4d6981 net/smc: llc_conf_mutex refactor, replace it with rw_semaphore
llc_conf_mutex was used to protect links and link related configurations
in the same link group, for example, add or delete links. However,
in most cases, the protected critical area has only read semantics and
with no write semantics at all, such as obtaining a usable link or an
available rmb_desc.

This patch do simply code refactoring, replace mutex with rw_semaphore,
replace mutex_lock with down_write and replace mutex_unlock with
up_write.

Theoretically, this replacement is equivalent, but after this patch,
we can distinguish lock granularity according to different semantics
of critical areas.

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-04 09:48:19 +00:00
Jakub Kicinski
88c940cccb Merge branch 'updates-to-enetc-txq-management'
Vladimir Oltean says:

====================
Updates to ENETC TXQ management

The set ensures that the number of TXQs given by enetc to the network
stack (mqprio or TX hashing) + the number of TXQs given to XDP never
exceeds the number of available TXQs.

These are the first 4 patches of series "[v5,net-next,00/17] ENETC
mqprio/taprio cleanup" from here:
https://patchwork.kernel.org/project/netdevbpf/cover/20230202003621.2679603-1-vladimir.oltean@nxp.com/

There is no change in this version compared to there. I split them off
because this contains a fix for net-next and it would be good if it
could go in quickly. I also did it to reduce the patch count of that
other series, if I need to respin it again.
====================

Link: https://lore.kernel.org/r/20230203001116.3814809-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-03 20:05:59 -08:00
Vladimir Oltean
800db2d125 net: enetc: ensure we always have a minimum number of TXQs for stack
Currently it can happen that an mqprio qdisc is installed with num_tc 8,
and this will reserve 8 (out of 8) TXQs for the network stack. Then we
can attach an XDP program, and this will crop 2 TXQs, leaving just 6 for
mqprio. That's not what the user requested, and we should fail it.

On the other hand, if mqprio isn't requested, we still give the 8 TXQs
to the network stack (with hashing among a single traffic class), but
then, cropping 2 TXQs for XDP is fine, because the user didn't
explicitly ask for any number of TXQs, so no expectations are violated.

Simply put, the logic that mqprio should impose a minimum number of TXQs
for the network never existed. Let's say (more or less arbitrarily) that
without mqprio, the driver expects a minimum number of TXQs equal to the
number of CPUs (on NXP LS1028A, that is either 1, or 2). And with mqprio,
mqprio gives the minimum required number of TXQs.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-03 20:05:57 -08:00
Vladimir Oltean
4ea1dd743e net: enetc: recalculate num_real_tx_queues when XDP program attaches
Since the blamed net-next commit, enetc_setup_xdp_prog() no longer goes
through enetc_open(), and therefore, the function which was supposed to
detect whether a BPF program exists (in order to crop some TX queues
from network stack usage), enetc_num_stack_tx_queues(), no longer gets
called.

We can move the netif_set_real_num_rx_queues() call to enetc_alloc_msix()
(probe time), since it is a runtime invariant. We can do the same thing
with netif_set_real_num_tx_queues(), and let enetc_reconfigure_xdp_cb()
explicitly recalculate and change the number of stack TX queues.

Fixes: c33bfaf91c ("net: enetc: set up XDP program under enetc_reconfigure()")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-03 20:05:57 -08:00
Vladimir Oltean
46a0ecf93b net: enetc: allow the enetc_reconfigure() callback to fail
enetc_reconfigure() was modified in commit c33bfaf91c ("net: enetc:
set up XDP program under enetc_reconfigure()") to take an optional
callback that runs while the netdev is down, but this callback currently
cannot fail.

Code up the error handling so that the interface is restarted with the
old resources if the callback fails.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-03 20:05:57 -08:00
Vladimir Oltean
1c81a9b3aa net: enetc: simplify enetc_num_stack_tx_queues()
We keep a pointer to the xdp_prog in the private netdev structure as
well; what's replicated per RX ring is done so just for more convenient
access from the NAPI poll procedure.

Simplify enetc_num_stack_tx_queues() by looking at priv->xdp_prog rather
than iterating through the information replicated per RX ring.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-03 20:05:57 -08:00
Jakub Kicinski
8788260e8f Merge branch 'raw-add-drop-reasons-and-use-another-hash-function'
Eric Dumazet says:

====================
raw: add drop reasons and use another hash function

Two first patches add drop reasons to raw input processing.

Last patch spreads RAW sockets in the shared hash tables
to avoid long hash buckets in some cases.
====================

Link: https://lore.kernel.org/r/20230202094100.3083177-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-03 19:56:26 -08:00
Eric Dumazet
6579f5bacc raw: use net_hash_mix() in hash function
Some applications seem to rely on RAW sockets.

If they use private netns, we can avoid piling all RAW
sockets bound to a given protocol into a single bucket.

Also place (struct raw_hashinfo).lock into its own
cache line to limit false sharing.

Alternative would be to have per-netns hashtables,
but this seems too expensive for most netns
where RAW sockets are not used.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-03 19:56:23 -08:00
Eric Dumazet
42186e6c00 ipv4: raw: add drop reasons
Use existing helpers and drop reason codes for RAW input path.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-03 19:56:23 -08:00
Eric Dumazet
8d8ebd77f5 ipv6: raw: add drop reasons
Use existing helpers and drop reason codes for RAW input path.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-03 19:56:23 -08:00
Jakub Kicinski
dfefcb0c04 Merge branch 'devlink-move-devlink-dev-code-to-a-separate-file'
Moshe Shemesh says:

====================
devlink: Move devlink dev code to a separate file

This patchset is moving code from the file leftover.c to new file dev.c.
About 1.3K lines are moved by this patchset covering most of the devlink
dev object callbacks and functionality: reload, eswitch, info, flash and
selftest.
====================

Link: https://lore.kernel.org/r/1675349226-284034-1-git-send-email-moshe@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-03 19:25:28 -08:00
Moshe Shemesh
7c976c7cfc devlink: Move devlink dev selftest code to dev
Move devlink dev selftest callbacks and related code from leftover.c to
file dev.c. No functional change in this patch.

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-03 19:25:26 -08:00
Moshe Shemesh
ec4a0ce92e devlink: Move devlink_info_req struct to be local
As all users of the struct devlink_info_req are already in dev.c, move
this struct from devl_internal.c to be local in dev.c.

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-03 19:25:26 -08:00
Moshe Shemesh
a13aab66cb devlink: Move devlink dev flash code to dev
Move devlink dev flash callbacks, helpers and other related code from
leftover.c to dev.c. No functional change in this patch.

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-03 19:25:26 -08:00
Moshe Shemesh
d60191c46e devlink: Move devlink dev info code to dev
Move devlink dev info callbacks, related drivers helpers functions and
other related code from leftover.c to dev.c. No functional change in
this patch.

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-03 19:25:26 -08:00
Moshe Shemesh
af2f8c1f82 devlink: Move devlink dev eswitch code to dev
Move devlink dev eswitch callbacks and related code from leftover.c to
file dev.c. No functional change in this patch.

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-03 19:25:25 -08:00
Moshe Shemesh
c6ed7d6ef9 devlink: Move devlink dev reload code to dev
Move devlink dev reload callback and related code from leftover.c to
file dev.c. No functional change in this patch.

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-03 19:25:25 -08:00
Moshe Shemesh
dbeeca81bd devlink: Split out dev get and dump code
Move devlink dev get and dump callbacks and related dev code to new file
dev.c. This file shall include all callbacks that are specific on
devlink dev object.

No functional change in this patch.

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-03 19:25:25 -08:00
Vladimir Oltean
d795527d50 net: dsa: use NL_SET_ERR_MSG_WEAK_MOD() more consistently
Now that commit 028fb19c6b ("netlink: provide an ability to set
default extack message") provides a weak function that doesn't override
an existing extack message provided by the driver, it makes sense to use
it also for LAG and HSR offloading, not just for bridge offloading.

Also consistently put the message string on a separate line, to reduce
line length from 92 to 84 characters.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20230202140354.3158129-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-03 19:23:32 -08:00
David S. Miller
8065c0e13f Merge branch 'yt8531-support'
Frank Sae says:

====================
net: add dts for yt8521 and yt8531s, add driver for yt8531

Add dts for yt8521 and yt8531s, add driver for yt8531.
 These patches have been verified on our AM335x platform (motherboard)
 which has one integrated yt8521 and one RGMII interface.
 It can connect to daughter boards like yt8531s or yt8531 board.

 v5:
 - change the compatible of yaml
 - change the maintainers of yaml from "frank sae" to "Frank Sae"

 v4:
 - change default tx delay from 150ps to 1950ps
 - add compatible for yaml

 v3:
 - change default rx delay from 1900ps to 1950ps
 - moved ytphy_rgmii_clk_delay_config_with_lock from yt8521's patch to yt8531's patch
 - removed unnecessary checks of phydev->attached_dev->dev_addr

 v2:
 - split BIT macro as one patch
 - split "dts for yt8521/yt8531s ... " patch as two patches
 - use standard rx-internal-delay-ps and tx-internal-delay-ps, removed motorcomm,sds-tx-amplitude
 - removed ytphy_parse_dt, ytphy_probe_helper and ytphy_config_init_helper
 - not store dts arg to yt8521_priv
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-03 09:34:51 +00:00
Frank Sae
4ac94f728a net: phy: Add driver for Motorcomm yt8531 gigabit ethernet phy
Add a driver for the motorcomm yt8531 gigabit ethernet phy. We have
 verified the driver on AM335x platform with yt8531 board. On the
 board, yt8531 gigabit ethernet phy works in utp mode, RGMII
 interface, supports 1000M/100M/10M speeds, and wol(magic package).

Signed-off-by: Frank Sae <Frank.Sae@motor-comm.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-03 09:34:51 +00:00
Frank Sae
36152f87dd net: phy: Add dts support for Motorcomm yt8531s gigabit ethernet phy
Add dts support for Motorcomm yt8531s gigabit ethernet phy.
 Change yt8521_probe to support clk config of yt8531s. Becase
 yt8521_probe does the things which yt8531s is needed, so
 removed yt8531s function.
 This patch has been verified on AM335x platform with yt8531s board.

Signed-off-by: Frank Sae <Frank.Sae@motor-comm.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-03 09:34:51 +00:00
Frank Sae
a6e68f0f87 net: phy: Add dts support for Motorcomm yt8521 gigabit ethernet phy
Add dts support for Motorcomm yt8521 gigabit ethernet phy.
 Add ytphy_rgmii_clk_delay_config function to support dst config for
 the delay of rgmii clk. This funciont is common for yt8521, yt8531s
 and yt8531.
 This patch has been verified on AM335x platform.

Signed-off-by: Frank Sae <Frank.Sae@motor-comm.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-03 09:34:51 +00:00
Frank Sae
4869a146cd net: phy: Add BIT macro for Motorcomm yt8521/yt8531 gigabit ethernet phy
Add BIT macro for Motorcomm yt8521/yt8531 gigabit ethernet phy.
 This is a preparatory patch. Add BIT macro for 0xA012 reg, and
 supplement for 0xA001 and 0xA003 reg. These will be used to support dts.

Signed-off-by: Frank Sae <Frank.Sae@motor-comm.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-03 09:34:50 +00:00
Frank Sae
cf08dfe8ae dt-bindings: net: Add Motorcomm yt8xxx ethernet phy
Add a YAML binding document for the Motorcomm yt8xxx Ethernet phy.

Signed-off-by: Frank Sae <Frank.Sae@motor-comm.com>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-03 09:34:50 +00:00
David S. Miller
18390581d0 Merge branch 'act_ct-UDP-NEW'
Vlad Buslov says:

====================
net: Allow offloading of UDP NEW connections via act_ct

Currently only bidirectional established connections can be offloaded
via act_ct. Such approach allows to hardcode a lot of assumptions into
act_ct, flow_table and flow_offload intermediate layer codes. In order
to enabled offloading of unidirectional UDP NEW connections start with
incrementally changing the following assumptions:

- Drivers assume that only established connections are offloaded and
  don't support updating existing connections. Extract ctinfo from meta
  action cookie and refuse offloading of new connections in the drivers.

- Fix flow_table offload fixup algorithm to calculate flow timeout
  according to current connection state instead of hardcoded
  "established" value.

- Add new flow_table flow flag that designates bidirectional connections
  instead of assuming it and hardcoding hardware offload of every flow
  in both directions.

- Add new flow_table flow flag that designates connections that are
  offloaded to hardware as "established" instead of assuming it. This
  allows some optimizations in act_ct and prevents spamming the
  flow_table workqueue with redundant tasks.

With all the necessary infrastructure in place modify act_ct to offload
UDP NEW as unidirectional connection. Pass reply direction traffic to CT
and promote connection to bidirectional when UDP connection state
changes to "assured". Rely on refresh mechanism to propagate connection
state change to supporting drivers.

Note that early drop algorithm that is designed to free up some space in
connection tracking table when it becomes full (by randomly deleting up
to 5% of non-established connections) currently ignores connections
marked as "offloaded". Now, with UDP NEW connections becoming
"offloaded" it could allow malicious user to perform DoS attack by
filling the table with non-droppable UDP NEW connections by sending just
one packet in single direction. To prevent such scenario change early
drop algorithm to also consider "offloaded" connections for deletion.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-03 09:31:25 +00:00
Vlad Buslov
df25455e5a netfilter: nf_conntrack: allow early drop of offloaded UDP conns
Both synchronous early drop algorithm and asynchronous gc worker completely
ignore connections with IPS_OFFLOAD_BIT status bit set. With new
functionality that enabled UDP NEW connection offload in action CT
malicious user can flood the conntrack table with offloaded UDP connections
by just sending a single packet per 5tuple because such connections can no
longer be deleted by early drop algorithm.

To mitigate the issue allow both early drop and gc to consider offloaded
UDP connections for deletion.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-03 09:31:24 +00:00
Vlad Buslov
6a9bad0069 net/sched: act_ct: offload UDP NEW connections
Modify the offload algorithm of UDP connections to the following:

- Offload NEW connection as unidirectional.

- When connection state changes to ESTABLISHED also update the hardware
flow. However, in order to prevent act_ct from spamming offload add wq for
every packet coming in reply direction in this state verify whether
connection has already been updated to ESTABLISHED in the drivers. If that
it the case, then skip flow_table and let conntrack handle such packets
which will also allow conntrack to potentially promote the connection to
ASSURED.

- When connection state changes to ASSURED set the flow_table flow
NF_FLOW_HW_BIDIRECTIONAL flag which will cause refresh mechanism to offload
the reply direction.

All other protocols have their offload algorithm preserved and are always
offloaded as bidirectional.

Note that this change tries to minimize the load on flow_table add
workqueue. First, it tracks the last ctinfo that was offloaded by using new
flow 'NF_FLOW_HW_ESTABLISHED' flag and doesn't schedule the refresh for
reply direction packets when the offloads have already been updated with
current ctinfo. Second, when 'add' task executes on workqueue it always
update the offload with current flow state (by checking 'bidirectional'
flow flag and obtaining actual ctinfo/cookie through meta action instead of
caching any of these from the moment of scheduling the 'add' work)
preventing the need from scheduling more updates if state changed
concurrently while the 'add' work was pending on workqueue.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-03 09:31:24 +00:00
Vlad Buslov
d5774cb6c5 net/sched: act_ct: set ctinfo in meta action depending on ct state
Currently tcf_ct_flow_table_fill_actions() function assumes that only
established connections can be offloaded and always sets ctinfo to either
IP_CT_ESTABLISHED or IP_CT_ESTABLISHED_REPLY strictly based on direction
without checking actual connection state. To enable UDP NEW connection
offload set the ctinfo, metadata cookie and NF_FLOW_HW_ESTABLISHED
flow_offload flags bit based on ct->status value.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-03 09:31:24 +00:00
Vlad Buslov
1a441a9b8b netfilter: flowtable: cache info of last offload
Modify flow table offload to cache the last ct info status that was passed
to the driver offload callbacks by extending enum nf_flow_flags with new
"NF_FLOW_HW_ESTABLISHED" flag. Set the flag if ctinfo was 'established'
during last act_ct meta actions fill call. This infrastructure change is
necessary to optimize promoting of UDP connections from 'new' to
'established' in following patches in this series.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-03 09:31:24 +00:00
Vlad Buslov
8f84780b84 netfilter: flowtable: allow unidirectional rules
Modify flow table offload to support unidirectional connections by
extending enum nf_flow_flags with new "NF_FLOW_HW_BIDIRECTIONAL" flag. Only
offload reply direction when the flag is set. This infrastructure change is
necessary to support offloading UDP NEW connections in original direction
in following patches in series.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-03 09:31:24 +00:00