There are almost the same code in ovs_skb_network_trim() and
tcf_ct_skb_network_trim(), this patch extracts them into a function
nf_ct_skb_network_trim() and moves the function to nf_conntrack_ovs.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Similar to nf_nat_ovs created by Commit ebddb14049 ("net: move the
nat function to nf_nat_ovs for ovs and tc"), this patch is to create
nf_conntrack_ovs to get these functions shared by OVS and TC only.
There are nf_ct_helper() and nf_ct_add_helper() from nf_conntrak_helper
in this patch, and will be more in the following patches.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
skbuff_head_cache is misnamed (perhaps for historical reasons?)
because it does not hold heads. Head is the buffer which skb->data
points to, and also where shinfo lives. struct sk_buff is a metadata
structure, not the head.
Eric recently added skb_small_head_cache (which allocates actual
head buffers), let that serve as an excuse to finally clean this up :)
Leave the user-space visible name intact, it could possibly be uAPI.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder says:
====================
net: ipa: prepare for GSI register updtaes
An upcoming series (or two) will convert the definitions of GSI
registers used by IPA so they use the "IPA reg" mechanism to specify
register offsets and their fields. This will simplify implementing
the fairly large number of changes required in GSI registers to
support more than 32 GSI channels (introduced in IPA v5.0).
A few minor problems and inconsistencies were found, and they're
fixed here. The last three patches in this series change the
"ipa_reg" code to separate the IPA-specific part (the base virtual
address, basically) from the generic register part, and the now-
generic code is renamed to use just "reg_" or "REG_" as a prefix
rather than "ipa_reg" or "IPA_REG_".
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename functions related to register fields so they don't appear to
be IPA-specific, and move their definitions into "reg.h":
ipa_reg_fmask() -> reg_fmask()
ipa_reg_bit() -> reg_bit()
ipa_reg_field_max() -> reg_field_max()
ipa_reg_encode() -> reg_encode()
ipa_reg_decode() -> reg_decode()
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename ipa_reg_offset() to be reg_offset() and move its definition
to "reg.h". Rename ipa_reg_n_offset() to be reg_n_offset() also.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPA register definitions have evolved with each new version. The
changes required to support more than 32 endpoints in IPA v5.0 made
it best to define a unified mechanism for defining registers and
their fields.
GSI register definitions, meanwhile, have remained fairly stable.
And even as the total number of IPA endpoints goes beyond 32, the
number of GSI channels on a given EE that underly endpoints still
remains 32 or less.
Despite that, GSI v3.0 (which is used with IPA v5.0) extends the
number of channels (and events) it supports to be about 256, and as
a result, many GSI register definitions must change significantly.
To address this, we'll use the same "ipa_reg" mechanism to define
the GSI registers.
As a first step in generalizing the "ipa_reg" to also support GSI
registers, isolate the definitions of the "ipa_reg" and "ipa_regs"
structure types (and some supporting macros) into a new header file,
and remove the "ipa_" and "IPA_" from symbol names.
Separate the IPA register ID validity checking from the generic
check that a register ID is in range. Aside from that, this is
intended to have no functional effect on the code.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move some static inline function definitions out of "gsi_reg.h" and
into "gsi.c", which is the only place they're used. Rename them so
their names identify the register they're associated with.
Move the gsi_channel_type enumerated type definition below the
offset and field definitions for the CH_C_CNTXT_0 register where
it's used.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are seven GSI interrupt types that can be signaled by a single
GSI IRQ. These are represented in a bitmask, and the gsi_irq_type_id
enumerated type defines what each bit position represents.
Similarly, the global and general GSI interrupt types each has a set
of conditions it signals, and both types have an enumerated type
that defines which bit that represents each condition.
When used, these enumerated values are passed as an argument to BIT()
in *all* cases. So clean up the code a little bit by defining the
enumerated type values as one-bit masks rather than bit positions.
Rename gsi_general_id to be gsi_general_irq_id for consistency.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When checking the validity of an IPA register ID, compare it against
all possible ipa_reg_id values.
Rename the function ipa_reg_id_valid() to be specific about what's
being checked.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Soon IPA v5.0+ will be supported, and when that happens we will be
able to enable support for the SDX65 (IPA v5.0), SM8450 (IPA v5.1),
and SM8550 (IPA v5.5).
Fix the comment about the GSI version used for IPA v3.1.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The reg_addr field in the IPA structure is set but never used.
Get rid of it.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Starting at IPA v4.11, the GSI_GENERIC_COMMAND GSI register got a
new PARAMS field. The code that encodes a value into that field
sets it unconditionally, which is wrong.
We currently only provide 0 as the field's value, so this error has
no real effect. Still, it's a bug, so let's fix it.
Fix an (unrelated) incorrect comment as well. Fields in the
ERROR_LOG GSI register actually *are* defined for IPA versions
prior to v3.5.1.
Fixes: fe68c43ce3 ("net: ipa: support enhanced channel flow control")
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the following TC flower filter keys to lan966x for IS2:
- ipv4_addr (sip and dip)
- ipv6_addr (sip and dip)
- control (IPv4 fragments)
- portnum (tcp and udp port numbers)
- basic (L3 and L4 protocol)
- vlan (outer vlan tag info)
- tcp (tcp flags)
- ip (tos field)
As the parsing of these keys is similar between lan966x and sparx5, move
the code in a separate file to be shared by these 2 chips. And put the
specific parsing outside of the common functions.
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Howells says:
====================
rxrpc development
Here are some miscellaneous changes for rxrpc:
(1) Use consume_skb() rather than kfree_skb_reason().
(2) Fix unnecessary waking when poking and already-poked call.
(3) Add ack.rwind to the rxrpc_tx_ack tracepoint as this indicates how
many incoming DATA packets we're telling the peer that we are
currently willing to accept on this call.
(4) Reduce duplicate ACK transmission. We send ACKs to let the peer know
that we're increasing the receive window (ack.rwind) as we consume
packets locally. Normal ACK transmission is triggered in three places
and that leads to duplicates being sent.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IPsec offloading support for NFP3800. Include data
plane and control plane.
Data plane: add IPsec packet process flow in NFP3800
datapath (NFDk).
Control plane: add an algorithm support distinction flow
in xfrm hook function xdo_dev_state_add(), as NFP3800 has
a different set of IPsec algorithm support.
This matches existing support for the NFP6000/NFP4000 and
their NFD3 datapath.
In addition, fixup the md_bytes calculation for NFD3 datapath
to make sure the two datapahts are keept in sync.
Signed-off-by: Huanhuan Wang <huanhuan.wang@corigine.com>
Reviewed-by: Niklas Söderlund <niklas.soderlund@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230208091000.4139974-1-simon.horman@corigine.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni says:
====================
net: introduce rps_default_mask
Real-time setups try hard to ensure proper isolation between time
critical applications and e.g. network processing performed by the
network stack in softirq and RPS is used to move the softirq
activity away from the isolated core.
If the network configuration is dynamic, with netns and devices
routinely created at run-time, enforcing the correct RPS setting
on each newly created device allowing to transient bad configuration
became complex.
Additionally, when multi-queue devices are involved, configuring rps
in user-space on each queue easily becomes very expensive, e.g.
some setups use veths with per cpu queues.
These series try to address the above, introducing a new
sysctl knob: rps_default_mask. The new sysctl entry allows
configuring a netns-wide RPS mask, to be enforced since receive
queue creation time without any fourther per device configuration
required.
Additionally, a simple self-test is introduced to check the
rps_default_mask behavior.
====================
Link: https://lore.kernel.org/r/cover.1675789134.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ensure that RPS default mask changes take place on
all newly created netns/devices and don't affect
existing ones.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If RPS is enabled, this allows configuring a default rps
mask, which is effective since receive queue creation time.
A default RPS mask allows the system admin to ensure proper
isolation, avoiding races at network namespace or device
creation time.
The default RPS mask is initially empty, and can be
modified via a newly added sysctl entry.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Will be used by the following patch to avoid code
duplication. No functional changes intended.
The only difference is that now flow_limit_cpu_sysctl() will
always compute the flow limit mask on each read operation,
even when read() will not return any byte to user-space.
Note that the new helper is placed under a new #ifdef at
the file start to better fit the usage in the later patch
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull networking fixes from Paolo Abeni:
"Including fixes from can and ipsec subtrees.
Current release - regressions:
- sched: fix off by one in htb_activate_prios()
- eth: mana: fix accessing freed irq affinity_hint
- eth: ice: fix out-of-bounds KASAN warning in virtchnl
Current release - new code bugs:
- eth: mtk_eth_soc: enable special tag when any MAC uses DSA
Previous releases - always broken:
- core: fix sk->sk_txrehash default
- neigh: make sure used and confirmed times are valid
- mptcp: be careful on subflow status propagation on errors
- xfrm: prevent potential spectre v1 gadget in xfrm_xlate32_attr()
- phylink: move phy_device_free() to correctly release phy device
- eth: mlx5:
- fix crash unsetting rx-vlan-filter in switchdev mode
- fix hang on firmware reset
- serialize module cleanup with reload and remove"
* tag 'net-6.2-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (57 commits)
selftests: forwarding: lib: quote the sysctl values
net: mscc: ocelot: fix all IPv6 getting trapped to CPU when PTP timestamping is used
rds: rds_rm_zerocopy_callback() use list_first_entry()
net: txgbe: Update support email address
selftests: Fix failing VXLAN VNI filtering test
selftests: mptcp: stop tests earlier
selftests: mptcp: allow more slack for slow test-case
mptcp: be careful on subflow status propagation on errors
mptcp: fix locking for in-kernel listener creation
mptcp: fix locking for setsockopt corner-case
mptcp: do not wait for bare sockets' timeout
net: ethernet: mtk_eth_soc: fix DSA TX tag hwaccel for switch port 0
nfp: ethtool: fix the bug of setting unsupported port speed
txhash: fix sk->sk_txrehash default
net: ethernet: mtk_eth_soc: fix wrong parameters order in __xdp_rxq_info_reg()
net: ethernet: mtk_eth_soc: enable special tag when any MAC uses DSA
net: sched: sch: Fix off by one in htb_activate_prios()
igc: Add ndo_tx_timeout support
net: mana: Fix accessing freed irq affinity_hint
hv_netvsc: Allocate memory in netvsc_dma_map() with GFP_ATOMIC
...
Pull HID fixes from Benjamin Tissoires:
- fix potential infinite loop with a badly crafted HID device (Xin
Zhao)
- fix regression from 6.1 in USB logitech devices potentially making
their mouse wheel not working (Bastien Nocera)
- clean up in AMD sensors, which fixes a long time resume bug (Mario
Limonciello)
- few device small fixes and quirks
* tag 'for-linus-2023020901' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
HID: Ignore battery for ELAN touchscreen 29DF on HP
HID: amd_sfh: if no sensors are enabled, clean up
HID: logitech: Disable hi-res scrolling on USB
HID: core: Fix deadloop in hid_apply_multiplier.
HID: Ignore battery for Elan touchscreen on Asus TP420IA
HID: elecom: add support for TrackBall 056E:011C
Pull cifx fix from Steve French:
"Small fix for use after free"
* tag '6.2-rc8-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Fix use-after-free in rdata->read_into_pages()
While running this selftest which usually passes:
~/selftests/drivers/net/dsa# ./local_termination.sh eno0 swp0
TEST: swp0: Unicast IPv4 to primary MAC address [ OK ]
TEST: swp0: Unicast IPv4 to macvlan MAC address [ OK ]
TEST: swp0: Unicast IPv4 to unknown MAC address [ OK ]
TEST: swp0: Unicast IPv4 to unknown MAC address, promisc [ OK ]
TEST: swp0: Unicast IPv4 to unknown MAC address, allmulti [ OK ]
TEST: swp0: Multicast IPv4 to joined group [ OK ]
TEST: swp0: Multicast IPv4 to unknown group [ OK ]
TEST: swp0: Multicast IPv4 to unknown group, promisc [ OK ]
TEST: swp0: Multicast IPv4 to unknown group, allmulti [ OK ]
TEST: swp0: Multicast IPv6 to joined group [ OK ]
TEST: swp0: Multicast IPv6 to unknown group [ OK ]
TEST: swp0: Multicast IPv6 to unknown group, promisc [ OK ]
TEST: swp0: Multicast IPv6 to unknown group, allmulti [ OK ]
if I start PTP timestamping then run it again (debug prints added by me),
the unknown IPv6 MC traffic is seen by the CPU port even when it should
have been dropped:
~/selftests/drivers/net/dsa# ptp4l -i swp0 -2 -P -m
ptp4l[225.410]: selected /dev/ptp1 as PTP clock
[ 225.445746] mscc_felix 0000:00:00.5: ocelot_l2_ptp_trap_add: port 0 adding L2 PTP trap
[ 225.453815] mscc_felix 0000:00:00.5: ocelot_ipv4_ptp_trap_add: port 0 adding IPv4 PTP event trap
[ 225.462703] mscc_felix 0000:00:00.5: ocelot_ipv4_ptp_trap_add: port 0 adding IPv4 PTP general trap
[ 225.471768] mscc_felix 0000:00:00.5: ocelot_ipv6_ptp_trap_add: port 0 adding IPv6 PTP event trap
[ 225.480651] mscc_felix 0000:00:00.5: ocelot_ipv6_ptp_trap_add: port 0 adding IPv6 PTP general trap
ptp4l[225.488]: port 1: INITIALIZING to LISTENING on INIT_COMPLETE
ptp4l[225.488]: port 0: INITIALIZING to LISTENING on INIT_COMPLETE
^C
~/selftests/drivers/net/dsa# ./local_termination.sh eno0 swp0
TEST: swp0: Unicast IPv4 to primary MAC address [ OK ]
TEST: swp0: Unicast IPv4 to macvlan MAC address [ OK ]
TEST: swp0: Unicast IPv4 to unknown MAC address [ OK ]
TEST: swp0: Unicast IPv4 to unknown MAC address, promisc [ OK ]
TEST: swp0: Unicast IPv4 to unknown MAC address, allmulti [ OK ]
TEST: swp0: Multicast IPv4 to joined group [ OK ]
TEST: swp0: Multicast IPv4 to unknown group [ OK ]
TEST: swp0: Multicast IPv4 to unknown group, promisc [ OK ]
TEST: swp0: Multicast IPv4 to unknown group, allmulti [ OK ]
TEST: swp0: Multicast IPv6 to joined group [ OK ]
TEST: swp0: Multicast IPv6 to unknown group [FAIL]
reception succeeded, but should have failed
TEST: swp0: Multicast IPv6 to unknown group, promisc [ OK ]
TEST: swp0: Multicast IPv6 to unknown group, allmulti [ OK ]
The PGID_MCIPV6 is configured correctly to not flood to the CPU,
I checked that.
Furthermore, when I disable back PTP RX timestamping (ptp4l doesn't do
that when it exists), packets are RX filtered again as they should be:
~/selftests/drivers/net/dsa# hwstamp_ctl -i swp0 -r 0
[ 218.202854] mscc_felix 0000:00:00.5: ocelot_l2_ptp_trap_del: port 0 removing L2 PTP trap
[ 218.212656] mscc_felix 0000:00:00.5: ocelot_ipv4_ptp_trap_del: port 0 removing IPv4 PTP event trap
[ 218.222975] mscc_felix 0000:00:00.5: ocelot_ipv4_ptp_trap_del: port 0 removing IPv4 PTP general trap
[ 218.233133] mscc_felix 0000:00:00.5: ocelot_ipv6_ptp_trap_del: port 0 removing IPv6 PTP event trap
[ 218.242251] mscc_felix 0000:00:00.5: ocelot_ipv6_ptp_trap_del: port 0 removing IPv6 PTP general trap
current settings:
tx_type 1
rx_filter 12
new settings:
tx_type 1
rx_filter 0
~/selftests/drivers/net/dsa# ./local_termination.sh eno0 swp0
TEST: swp0: Unicast IPv4 to primary MAC address [ OK ]
TEST: swp0: Unicast IPv4 to macvlan MAC address [ OK ]
TEST: swp0: Unicast IPv4 to unknown MAC address [ OK ]
TEST: swp0: Unicast IPv4 to unknown MAC address, promisc [ OK ]
TEST: swp0: Unicast IPv4 to unknown MAC address, allmulti [ OK ]
TEST: swp0: Multicast IPv4 to joined group [ OK ]
TEST: swp0: Multicast IPv4 to unknown group [ OK ]
TEST: swp0: Multicast IPv4 to unknown group, promisc [ OK ]
TEST: swp0: Multicast IPv4 to unknown group, allmulti [ OK ]
TEST: swp0: Multicast IPv6 to joined group [ OK ]
TEST: swp0: Multicast IPv6 to unknown group [ OK ]
TEST: swp0: Multicast IPv6 to unknown group, promisc [ OK ]
TEST: swp0: Multicast IPv6 to unknown group, allmulti [ OK ]
So it's clear that something in the PTP RX trapping logic went wrong.
Looking a bit at the code, I can see that there are 4 typos, which
populate "ipv4" VCAP IS2 key filter fields for IPv6 keys.
VCAP IS2 keys of type OCELOT_VCAP_KEY_IPV4 and OCELOT_VCAP_KEY_IPV6 are
handled by is2_entry_set(). OCELOT_VCAP_KEY_IPV4 looks at
&filter->key.ipv4, and OCELOT_VCAP_KEY_IPV6 at &filter->key.ipv6.
Simply put, when we populate the wrong key field, &filter->key.ipv6
fields "proto.mask" and "proto.value" remain all zeroes (or "don't care").
So is2_entry_set() will enter the "else" of this "if" condition:
if (msk == 0xff && (val == IPPROTO_TCP || val == IPPROTO_UDP))
and proceed to ignore the "proto" field. The resulting rule will match
on all IPv6 traffic, trapping it to the CPU.
This is the reason why the local_termination.sh selftest sees it,
because control traps are stronger than the PGID_MCIPV6 used for
flooding (from the forwarding data path).
But the problem is in fact much deeper. We trap all IPv6 traffic to the
CPU, but if we're bridged, we set skb->offload_fwd_mark = 1, so software
forwarding will not take place and IPv6 traffic will never reach its
destination.
The fix is simple - correct the typos.
I was intentionally inaccurate in the commit message about the breakage
occurring when any PTP timestamping is enabled. In fact it only happens
when L4 timestamping is requested (HWTSTAMP_FILTER_PTP_V2_EVENT or
HWTSTAMP_FILTER_PTP_V2_L4_EVENT). But ptp4l requests a larger RX
timestamping filter than it needs for "-2": HWTSTAMP_FILTER_PTP_V2_EVENT.
I wanted people skimming through git logs to not think that the bug
doesn't affect them because they only use ptp4l in L2 mode.
Fixes: 96ca08c058 ("net: mscc: ocelot: set up traps for PTP packets")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230207183117.1745754-1-vladimir.oltean@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Steffen Klassert says:
====================
ipsec 2023-02-08
1) Fix policy checks for nested IPsec tunnels when using
xfrm interfaces. From Benedict Wong.
2) Fix netlink message expression on 32=>64-bit
messages translators. From Anastasia Belova.
3) Prevent potential spectre v1 gadget in xfrm_xlate32_attr.
From Eric Dumazet.
4) Always consistently use time64_t in xfrm_timer_handler.
From Eric Dumazet.
5) Fix KCSAN reported bug: Multiple cpus can update use_time
at the same time. From Eric Dumazet.
6) Fix SCP copy from IPv4 to IPv6 on interfamily tunnel.
From Christian Hopps.
* tag 'ipsec-2023-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
xfrm: fix bug with DSCP copy to v6 from v4 tunnel
xfrm: annotate data-race around use_time
xfrm: consistently use time64_t in xfrm_timer_handler()
xfrm/compat: prevent potential spectre v1 gadget in xfrm_xlate32_attr()
xfrm: compat: change expression for switch in xfrm_xlate64
Fix XFRM-I support for nested ESP tunnels
====================
Link: https://lore.kernel.org/r/20230208114322.266510-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Marc Kleine-Budde says:
====================
can-next 2023-02-08
The 1st patch is by Oliver Hartkopp and cleans up the CAN_RAW's
raw_setsockopt() for CAN_RAW_FD_FRAMES.
The 2nd patch is by me and fixes the compilation if
CONFIG_CAN_CALC_BITTIMING is disabled. (Problem introduced in last
pull request to next-next.)
* tag 'linux-can-next-for-6.3-20230208' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next:
can: bittiming: can_calc_bittiming(): add missing parameter to no-op function
can: raw: use temp variable instead of rolling back config
====================
Link: https://lore.kernel.org/r/20230208210014.3169347-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Saeed Mahameed says:
====================
mlx5-next-netdev-deadlock
This series from Jiri solves a deadlock when removing a network namespace
with mlx5 devlink instance being in it.
The deadlock is between:
1) mlx5_ib->unregister_netdevice_notifier()
AND
2) mlx5_core->devlink_reload->cleanup_net()
To slove this introduced mlx5 netdev added/removed events to track uplink
netdev to be used for register_netdevice_notifier_dev_net() purposes.
* tag 'mlx5-next-netdev-deadlock' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
RDMA/mlx5: Track netdev to avoid deadlock during netdev notifier unregister
net/mlx5e: Propagate an internal event in case uplink netdev changes
net/mlx5e: Fix trap event handling
net/mlx5: Introduce CQE error syndrome
====================
Link: https://lore.kernel.org/r/20230208005626.72930-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When removing a network namespace with mlx5 devlink instance being in
it, following callchain is performed:
cleanup_net (takes down_read(&pernet_ops_rwsem)
devlink_pernet_pre_exit()
devlink_reload()
mlx5_devlink_reload_down()
mlx5_unload_one_devl_locked()
mlx5_detach_device()
del_adev()
mlx5r_remove()
__mlx5_ib_remove()
mlx5_ib_roce_cleanup()
mlx5_remove_netdev_notifier()
unregister_netdevice_notifier (takes down_write(&pernet_ops_rwsem)
This deadlocks.
Resolve this by converting to register_netdevice_notifier_dev_net()
which does not take pernet_ops_rwsem and moves the notifier block around
according to netdev it takes as arg.
Use previously introduced netdev added/removed events to track uplink
netdev to be used for register_netdevice_notifier_dev_net() purposes.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Whenever uplink netdev is set/cleared, propagate newly introduced event
to inform notifier blocks netdev was added/removed.
Move the set() helper to core.c from header, introduce clear() and
netdev_added_event_replay() helpers. The last one is going to be called
from rdma driver, so export it.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Current code does not return correct return value from event handler.
Fix it by returning NOTIFY_* and propagate err over newly introduce ctx
structure.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Saeed Mahameed says:
====================
mlx5 fixes 2023-02-07
This series provides bug fixes to mlx5 driver.
* tag 'mlx5-fixes-2023-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5: Serialize module cleanup with reload and remove
net/mlx5: fw_tracer, Zero consumer index when reloading the tracer
net/mlx5: fw_tracer, Clear load bit when freeing string DBs buffers
net/mlx5: Expose SF firmware pages counter
net/mlx5: Store page counters in a single array
net/mlx5e: IPoIB, Show unknown speed instead of error
net/mlx5e: Fix crash unsetting rx-vlan-filter in switchdev mode
net/mlx5: Bridge, fix ageing of peer FDB entries
net/mlx5: DR, Fix potential race in dr_rule_create_rule_nic
net/mlx5e: Update rx ring hw mtu upon each rx-fcs flag change
====================
Link: https://lore.kernel.org/r/20230208030302.95378-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In commit 286c0e09e8 ("can: bittiming: can_changelink() pass extack
down callstack") a new parameter was added to can_calc_bittiming(),
however the static inline no-op (which is used if
CONFIG_CAN_CALC_BITTIMING is disabled) wasn't converted.
Add the new parameter to the static inline no-op of
can_calc_bittiming().
Fixes: 286c0e09e8 ("can: bittiming: can_changelink() pass extack down callstack")
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/20230207201734.2905618-1-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Vladimir Oltean says:
====================
taprio automatic queueMaxSDU and new TXQ selection procedure
This patch set addresses 2 design limitations in the taprio software scheduler:
1. Software scheduling fundamentally prioritizes traffic incorrectly,
in a way which was inspired from Intel igb/igc drivers and does not
follow the inputs user space gives (traffic classes and TC to TXQ
mapping). Patch 05/15 handles this, 01/15 - 04/15 are preparations
for this work.
2. Software scheduling assumes that the gate for a traffic class closes
as soon as the next interval begins. But this isn't true.
If consecutive schedule entries have that traffic class gate open,
there is no "gate close" event and taprio should keep dequeuing from
that TC without interruptions. Patches 06/15 - 15/15 handle this.
Patch 10/15 is a generic Qdisc change required for this to work.
Future development directions which depend on this patch set are:
- Propagating the automatic queueMaxSDU calculation down to offloading
device drivers, instead of letting them calculate this, as
vsc9959_tas_guard_bands_update() does today.
- A software data path for tc-taprio with preemptible traffic and
Hold/Release events.
v1 at:
https://patchwork.kernel.org/project/netdevbpf/cover/20230128010719.2182346-1-vladimir.oltean@nxp.com/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Improve commit 497cc00224 ("taprio: Handle short intervals and large
packets") to only perform segmentation when skb->len exceeds what
taprio_dequeue() expects.
In practice, this will make the biggest difference when a traffic class
gate is always open in the schedule. This is because the max_frm_len
will be U32_MAX, and such large skb->len values as Kurt reported will be
sent just fine unsegmented.
What I don't seem to know how to handle is how to make sure that the
segmented skbs themselves are smaller than the maximum frame size given
by the current queueMaxSDU[tc]. Nonetheless, we still need to drop
those, otherwise the Qdisc will hang.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>