Deletion of the last rule referencing a given idletimer may happen at
the same time as a read of its file in sysfs:
| ======================================================
| WARNING: possible circular locking dependency detected
| 6.12.0-rc7-01692-g5e9a28f41134-dirty #594 Not tainted
| ------------------------------------------------------
| iptables/3303 is trying to acquire lock:
| ffff8881057e04b8 (kn->active#48){++++}-{0:0}, at: __kernfs_remove+0x20
|
| but task is already holding lock:
| ffffffffa0249068 (list_mutex){+.+.}-{3:3}, at: idletimer_tg_destroy_v]
|
| which lock already depends on the new lock.
A simple reproducer is:
| #!/bin/bash
|
| while true; do
| iptables -A INPUT -i foo -j IDLETIMER --timeout 10 --label "testme"
| iptables -D INPUT -i foo -j IDLETIMER --timeout 10 --label "testme"
| done &
| while true; do
| cat /sys/class/xt_idletimer/timers/testme >/dev/null
| done
Avoid this by freeing list_mutex right after deleting the element from
the list, then continuing with the teardown.
Fixes: 0902b469bd ("netfilter: xtables: idletimer target implementation")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
On some systems, neighbor discoveries from ns1 for fec0:42::1 (i.e., the
martian trap address) would happen at the wrong time and cause
false-negative test result.
Problem analysis also discovered that IPv6 martian ping test was broken
in that sent neighbor discoveries, not echo requests were inadvertently
trapped
Avoid the race condition by introducing the neighbors to each other
upfront. Also pin down the firewall rules to matching on echo requests
only.
Fixes: efb056e5f1 ("netfilter: ip6t_rpfilter: Fix regression with VRF interfaces")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Fix esoteric undefined behaviour due to uninitialized stack access
in ip_vs_protocol_init(), from Jinghao Jia.
2) Fix iptables xt_LED slab-out-of-bounds due to incorrect sanitization
of the led string identifier, reported by syzbot. Patch from
Dmitry Antipov.
3) Remove WARN_ON_ONCE reachable from userspace to check for the maximum
cgroup level, nft_socket cgroup matching is restricted to 255 levels,
but cgroups allow for INT_MAX levels by default. Reported by syzbot.
4) Fix nft_inner incorrect use of percpu area to store tunnel parser
context with softirqs, resulting in inconsistent inner header
offsets that could lead to bogus rule mismatches, reported by syzbot.
5) Grab module reference on ipset core while requesting set type modules,
otherwise kernel crash is possible by removing ipset core module,
patch from Phil Sutter.
6) Fix possible double-free in nft_hash garbage collector due to unstable
walk interator that can provide twice the same element. Use a sequence
number to skip expired/dead elements that have been already scheduled
for removal. Based on patch from Laurent Fasnach
netfilter pull request 24-12-05
* tag 'nf-24-12-05' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nft_set_hash: skip duplicated elements pending gc run
netfilter: ipset: Hold module reference while requesting a module
netfilter: nft_inner: incorrect percpu area handling under softirq
netfilter: nft_socket: remove WARN_ON_ONCE on maximum cgroup level
netfilter: x_tables: fix LED ID check in led_tg_check()
ipvs: fix UB due to uninitialized stack access in ip_vs_protocol_init()
====================
Link: https://patch.msgid.link/20241205002854.162490-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Konstantin Shkolnyy says:
====================
vsock/test: fix wrong setsockopt() parameters
Parameters were created using wrong C types, which caused them to be of
wrong size on some architectures, causing problems.
The problem with SO_RCVLOWAT was found on s390 (big endian), while x86-64
didn't show it. After the fix, all tests pass on s390.
Then Stefano Garzarella pointed out that SO_VM_SOCKETS_* calls might have
a similar problem, which turned out to be true, hence, the second patch.
Changes for v8:
- Fix whitespace warnings from "checkpatch.pl --strict"
- Add maintainers to Cc:
Changes for v7:
- Rebase on top of https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git
- Add the "net" tags to the subjects
Changes for v6:
- rework the patch #3 to avoid creating a new file for new functions,
and exclude vsock_perf from calling the new functions.
- add "Reviewed-by:" to the patch #2.
Changes for v5:
- in the patch #2 replace the introduced uint64_t with unsigned long long
to match documentation
- add a patch #3 that verifies every setsockopt() call.
Changes for v4:
- add "Reviewed-by:" to the first patch, and add a second patch fixing
SO_VM_SOCKETS_* calls, which depends on the first one (hence, it's now
a patch series.)
Changes for v3:
- fix the same problem in vsock_perf and update commit message
Changes for v2:
- add "Fixes:" lines to the commit message
====================
Link: https://patch.msgid.link/20241203150656.287028-1-kshk@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Replace setsockopt() calls with calls to functions that follow
setsockopt() with getsockopt() and check that the returned value and its
size are the same as have been set. (Except in vsock_perf.)
Signed-off-by: Konstantin Shkolnyy <kshk@linux.ibm.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Change parameters of SO_VM_SOCKETS_* to unsigned long long as documented
in the vm_sockets.h, because the corresponding kernel code requires them
to be at least 64-bit, no matter what architecture. Otherwise they are
too small on 32-bit machines.
Fixes: 5c338112e4 ("test/vsock: rework message bounds test")
Fixes: 685a21c314 ("test/vsock: add big message test")
Fixes: 542e893fba ("vsock/test: two tests to check credit update logic")
Fixes: 8abbffd27c ("test/vsock: vsock_perf utility")
Signed-off-by: Konstantin Shkolnyy <kshk@linux.ibm.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This happens on 64-bit big-endian machines.
SO_RCVLOWAT requires an int parameter. However, instead of int, the test
uses unsigned long in one place and size_t in another. Both are 8 bytes
long on 64-bit machines. The kernel, having received the 8 bytes, doesn't
test for the exact size of the parameter, it only cares that it's >=
sizeof(int), and casts the 4 lower-addressed bytes to an int, which, on
a big-endian machine, contains 0. 0 doesn't trigger an error, SO_RCVLOWAT
returns with success and the socket stays with the default SO_RCVLOWAT = 1,
which results in vsock_test failures, while vsock_perf doesn't even notice
that it's failed to change it.
Fixes: b1346338fb ("vsock_test: POLLIN + SO_RCVLOWAT test")
Fixes: 542e893fba ("vsock/test: two tests to check credit update logic")
Fixes: 8abbffd27c ("test/vsock: vsock_perf utility")
Signed-off-by: Konstantin Shkolnyy <kshk@linux.ibm.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2024-12-03 (ice, idpf, ixgbe, ixgbevf, igb)
This series contains updates to ice, idpf, ixgbe, ixgbevf, and igb
drivers.
For ice:
Arkadiusz corrects search for determining whether PHY clock recovery is
supported on the device.
Przemyslaw corrects mask used for PHY timestamps on ETH56G devices.
Wojciech adds missing virtchnl ops which caused NULL pointer
dereference.
Marcin fixes VLAN filter settings for uplink VSI in switchdev mode.
For idpf:
Josh restores setting of completion tag for empty buffers.
For ixgbevf:
Jake removes incorrect initialization/support of IPSEC for mailbox
version 1.5.
For ixgbe:
Jake rewords and downgrades misleading message when negotiation
of VF mailbox version is not supported.
Tore Amundsen corrects value for BASE-BX10 capability.
For igb:
Yuan Can adds proper teardown on failed pci_register_driver() call.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
igb: Fix potential invalid memory access in igb_init_module()
ixgbe: Correct BASE-BX10 compliance code
ixgbe: downgrade logging of unsupported VF API version to debug
ixgbevf: stop attempting IPSEC offload on Mailbox API 1.5
idpf: set completion tag for "empty" bufs associated with a packet
ice: Fix VLAN pruning in switchdev mode
ice: Fix NULL pointer dereference in switchdev
ice: fix PHY timestamp extraction for ETH56G
ice: fix PHY Clock Recovery availability check
====================
Link: https://patch.msgid.link/20241203215521.1646668-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Previously a workaround was added to avoid syndrome 0xcdb051. It is
triggered when offload a rule with tunnel encapsulation, and
forwarding to another table, but not matching on the internal port in
firmware steering mode. The original workaround skips internal tunnel
port logic, which is not correct as not all cases are considered. As
an example, if vlan is configured on the uplink port, traffic can't
pass because vlan header is not added with this workaround. Besides,
there is no such issue for software steering. So, this patch removes
that, and returns error directly if trying to offload such rule for
firmware steering.
Fixes: 06b4eac9c4 ("net/mlx5e: Don't offload internal port if filter device is out device")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Tested-by: Frode Nordahl <frode.nordahl@canonical.com>
Reviewed-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Ariel Levkovich <lariel@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241203204920.232744-7-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In a multi-PF netdev, each traffic channel creates its own resources
against a specific PF.
In the cited commit, where this support was added, the channel_param
logic was mistakenly kept unchanged, so it always used the primary PF
which is found at priv->mdev.
In this patch we fix this by moving the logic to be per-channel, and
passing the correct mdev instance.
This bug happened to be usually harmless, as the resulting cparam
structures would be the same for all channels, due to identical FW logic
and decisions.
However, in some use cases, like fwreset, this gets broken.
This could lead to different symptoms. Example:
Error cqe on cqn 0x428, ci 0x0, qn 0x10a9, opcode 0xe, syndrome 0x4,
vendor syndrome 0x32
Fixes: e4f9686bde ("net/mlx5e: Let channels be SD-aware")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Lama Kayal <lkayal@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20241203204920.232744-6-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix the mentioned commit change for MPV mode, since in MPV mode the IB
device is shared between different core devices, so under this change
when moving both devices simultaneously to switchdev mode the IB device
removal and re-addition can race with itself causing unexpected behavior.
In such case do rescan_drivers() only once in order to add the ethernet
representor auxiliary device, and skip adding and removing IB devices.
Fixes: ab85ebf437 ("net/mlx5: E-switch, refactor eswitch mode change")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241203204920.232744-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Wei says:
====================
bnxt_en: support header page pool in queue API
Commit 7ed816be35 ("eth: bnxt: use page pool for head frags") added a
separate page pool for header frags. Now, frags are allocated from this
header page pool e.g. rxr->tpa_info.data.
The queue API did not properly handle rxr->tpa_info and so using the
queue API to i.e. reset any queues will result in pages being returned
to the incorrect page pool, causing inflight != 0 warnings.
Fix this bug by properly allocating/freeing tpa_info and copying/freeing
head_pool in the queue API implementation.
The 1st patch is a prep patch that refactors helpers out to be used by
the implementation patch later.
The 2nd patch is a drive-by refactor. Happy to take it out and re-send
to net-next if there are any objections.
The 3rd patch is the implementation patch that will properly alloc/free
rxr->tpa_info.
====================
Link: https://patch.msgid.link/20241204041022.56512-1-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 7ed816be35 ("eth: bnxt: use page pool for head frags") added a
page pool for header frags, which may be distinct from the existing pool
for the aggregation ring. Prior to this change, frags used in the TPA
ring rx_tpa were allocated from system memory e.g. napi_alloc_frag()
meaning their lifetimes were not associated with a page pool. They can
be returned at any time and so the queue API did not alloc or free
rx_tpa.
But now frags come from a separate head_pool which may be different to
page_pool. Without allocating and freeing rx_tpa, frags allocated from
the old head_pool may be returned to a different new head_pool which
causes a mismatch between the pp hold/release count.
Fix this problem by properly freeing and allocating rx_tpa in the queue
API implementation.
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20241204041022.56512-4-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Refactor bnxt_rx_ring_info->tpa_info operations into helpers that work
on a single tpa_info in prep for queue API using them.
There are 2 pairs of operations:
* bnxt_alloc_one_tpa_info()
* bnxt_free_one_tpa_info()
These alloc/free the tpa_info array itself.
* bnxt_alloc_one_tpa_info_data()
* bnxt_free_one_tpa_info_data()
These alloc/free the frags stored in tpa_info array.
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20241204041022.56512-2-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The driver is currently using an ACL key block that is not supported by
Spectrum-4. This works because the driver is only using a single field
from this key block which is located in the same offset in the
equivalent Spectrum-4 key block.
The issue was discovered when the firmware started rejecting the use of
the unsupported key block. The change has been reverted to avoid
breaking users that only update their firmware.
Nonetheless, fix the issue by using the correct key block.
Fixes: 07ff135958 ("mlxsw: Introduce flex key elements for Spectrum-4")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/35e72c97bdd3bc414fb8e4d747e5fb5d26c29658.1733237440.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A bitset without mask in a _SET request means we want exactly the bits in
the bitset to be set. This works correctly for compact format but when
verbose format is parsed, ethnl_update_bitset32_verbose() only sets the
bits present in the request bitset but does not clear the rest. The commit
6699170376 ("ethtool: fix application of verbose no_mask bitset") fixes
this issue by clearing the whole target bitmap before we start iterating.
The solution proposed brought an issue with the behavior of the mod
variable. As the bitset is always cleared the old value will always
differ to the new value.
Fix it by adding a new function to compare bitmaps and a temporary variable
which save the state of the old bitmap.
Fixes: 6699170376 ("ethtool: fix application of verbose no_mask bitset")
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20241202153358.1142095-1-kory.maincent@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rhashtable does not provide stable walk, duplicated elements are
possible in case of resizing. I considered that checking for errors when
calling rhashtable_walk_next() was sufficient to detect the resizing.
However, rhashtable_walk_next() returns -EAGAIN only at the end of the
iteration, which is too late, because a gc work containing duplicated
elements could have been already scheduled for removal to the worker.
Add a u32 gc worker sequence number per set, bump it on every workqueue
run. Annotate gc worker sequence number on the expired element. Use it
to skip those already seen in this gc workqueue run.
Note that this new field is never reset in case gc transaction fails, so
next gc worker run on the expired element overrides it. Wraparound of gc
worker sequence number should not be an issue with stale gc worker
sequence number in the element, that would just postpone the element
removal in one gc run.
Note that it is not possible to use flags to annotate that element is
pending gc run to detect duplicates, given that gc transaction can be
invalidated in case of update from the control plane, therefore, not
allowing to clear such flag.
On x86_64, pahole reports no changes in the size of nft_rhash_elem.
Fixes: f6c383b8c3 ("netfilter: nf_tables: adapt set backend to use GC transaction API")
Reported-by: Laurent Fasnacht <laurent.fasnacht@proton.ch>
Tested-by: Laurent Fasnacht <laurent.fasnacht@proton.ch>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
User space may unload ip_set.ko while it is itself requesting a set type
backend module, leading to a kernel crash. The race condition may be
provoked by inserting an mdelay() right after the nfnl_unlock() call.
Fixes: a7b4f989a6 ("netfilter: ipset: IP set core support")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Changes to sch->q.qlen around qdisc_tree_reduce_backlog() need to happen
_before_ a call to said function because otherwise it may fail to notify
parent qdiscs when the child is about to become empty.
Signed-off-by: Lion Ackermann <nnamrec@gmail.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
When matching erspan_opt in cls_flower, only the (version, dir, hwid)
fields are relevant. However, in fl_set_erspan_opt() it initializes
all bits of erspan_opt and its mask to 1. This inadvertently requires
packets to match not only the (version, dir, hwid) fields but also the
other fields that are unexpectedly set to 1.
This patch resolves the issue by ensuring that only the (version, dir,
hwid) fields are configured in fl_set_erspan_opt(), leaving the other
fields to 0 in erspan_opt.
Fixes: 79b1011cb3 ("net: sched: allow flower to match erspan options")
Reported-by: Shuang Li <shuali@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make napi_hash_lock IRQ safe. It is used during the control path, and is
taken and released in napi_hash_add and napi_hash_del, which will
typically be called by calls to napi_enable and napi_disable.
This change avoids a deadlock in pcnet32 (and other any other drivers
which follow the same pattern):
CPU 0:
pcnet32_open
spin_lock_irqsave(&lp->lock, ...)
napi_enable
napi_hash_add <- before this executes, CPU 1 proceeds
spin_lock(napi_hash_lock)
[...]
spin_unlock_irqrestore(&lp->lock, flags);
CPU 1:
pcnet32_close
napi_disable
napi_hash_del
spin_lock(napi_hash_lock)
< INTERRUPT >
pcnet32_interrupt
spin_lock(lp->lock) <- DEADLOCK
Changing the napi_hash_lock to be IRQ safe prevents the IRQ from firing
on CPU 1 until napi_hash_lock is released, preventing the deadlock.
Cc: stable@vger.kernel.org
Fixes: 86e25f40aa ("net: napi: Add napi_config")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Closes: https://lore.kernel.org/netdev/85dd4590-ea6b-427d-876a-1d8559c7ad82@roeck-us.net/
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Joe Damato <jdamato@fastly.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241202182103.363038-1-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Softirq can interrupt ongoing packet from process context that is
walking over the percpu area that contains inner header offsets.
Disable bh and perform three checks before restoring the percpu inner
header offsets to validate that the percpu area is valid for this
skbuff:
1) If the NFT_PKTINFO_INNER_FULL flag is set on, then this skbuff
has already been parsed before for inner header fetching to
register.
2) Validate that the percpu area refers to this skbuff using the
skbuff pointer as a cookie. If there is a cookie mismatch, then
this skbuff needs to be parsed again.
3) Finally, validate if the percpu area refers to this tunnel type.
Only after these three checks the percpu area is restored to a on-stack
copy and bh is enabled again.
After inner header fetching, the on-stack copy is stored back to the
percpu area.
Fixes: 3a07327d10 ("netfilter: nft_inner: support for inner tunnel header matching")
Reported-by: syzbot+84d0441b9860f0d63285@syzkaller.appspotmail.com
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The pci_register_driver() can fail and when this happened, the dca_notifier
needs to be unregistered, otherwise the dca_notifier can be called when
igb fails to install, resulting to invalid memory access.
Fixes: bbd98fe48a ("igb: Fix DCA errors and do not use context index for 82576")
Signed-off-by: Yuan Can <yuancan@huawei.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
SFF-8472 (section 5.4 Transceiver Compliance Codes) defines bit 6 as
BASE-BX10. Bit 6 means a value of 0x40 (decimal 64).
The current value in the source code is 0x64, which appears to be a
mix-up of hex and decimal values. A value of 0x64 (binary 01100100)
incorrectly sets bit 2 (1000BASE-CX) and bit 5 (100BASE-FX) as well.
Fixes: 1b43e0d20f ("ixgbe: Add 1000BASE-BX support")
Signed-off-by: Tore Amundsen <tore@amundsen.org>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Acked-by: Ernesto Castellotti <ernesto@castellotti.net>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The ixgbe PF driver logs an info message when a VF attempts to negotiate an
API version which it does not support:
VF 0 requested invalid api version 6
The ixgbevf driver attempts to load with mailbox API v1.5, which is
required for best compatibility with other hosts such as the ESX VMWare PF.
The Linux PF only supports API v1.4, and does not currently have support
for the v1.5 API.
The logged message can confuse users, as the v1.5 API is valid, but just
happens to not currently be supported by the Linux PF.
Downgrade the info message to a debug message, and fix the language to
use 'unsupported' instead of 'invalid' to improve message clarity.
Long term, we should investigate whether the improvements in the v1.5 API
make sense for the Linux PF, and if so implement them properly. This may
require yet another API version to resolve issues with negotiating IPSEC
offload support.
Fixes: 339f289641 ("ixgbevf: Add support for new mailbox communication between PF and VF")
Reported-by: Yifei Liu <yifei.l.liu@oracle.com>
Link: https://lore.kernel.org/intel-wired-lan/20240301235837.3741422-1-yifei.l.liu@oracle.com/
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Commit 339f289641 ("ixgbevf: Add support for new mailbox communication
between PF and VF") added support for v1.5 of the PF to VF mailbox
communication API. This commit mistakenly enabled IPSEC offload for API
v1.5.
No implementation of the v1.5 API has support for IPSEC offload. This
offload is only supported by the Linux PF as mailbox API v1.4. In fact, the
v1.5 API is not implemented in any Linux PF.
Attempting to enable IPSEC offload on a PF which supports v1.5 API will not
work. Only the Linux upstream ixgbe and ixgbevf support IPSEC offload, and
only as part of the v1.4 API.
Fix the ixgbevf Linux driver to stop attempting IPSEC offload when
the mailbox API does not support it.
The existing API design choice makes it difficult to support future API
versions, as other non-Linux hosts do not implement IPSEC offload. If we
add support for v1.5 to the Linux PF, then we lose support for IPSEC
offload.
A full solution likely requires a new mailbox API with a proper negotiation
to check that IPSEC is actually supported by the host.
Fixes: 339f289641 ("ixgbevf: Add support for new mailbox communication between PF and VF")
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Commit d9028db618 ("idpf: convert to libeth Tx buffer completion")
inadvertently removed code that was necessary for the tx buffer cleaning
routine to iterate over all buffers associated with a packet.
When a frag is too large for a single data descriptor, it will be split
across multiple data descriptors. This means the frag will span multiple
buffers in the buffer ring in order to keep the descriptor and buffer
ring indexes aligned. The buffer entries in the ring are technically
empty and no cleaning actions need to be performed. These empty buffers
can precede other frags associated with the same packet. I.e. a single
packet on the buffer ring can look like:
buf[0]=skb0.frag0
buf[1]=skb0.frag1
buf[2]=empty
buf[3]=skb0.frag2
The cleaning routine iterates through these buffers based on a matching
completion tag. If the completion tag is not set for buf2, the loop will
end prematurely. Frag2 will be left uncleaned and next_to_clean will be
left pointing to the end of packet, which will break the cleaning logic
for subsequent cleans. This consequently leads to tx timeouts.
Assign the empty bufs the same completion tag for the packet to ensure
the cleaning routine iterates over all of the buffers associated with
the packet.
Fixes: d9028db618 ("idpf: convert to libeth Tx buffer completion")
Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
Acked-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Madhu chittim <madhu.chittim@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
In switchdev mode the uplink VSI should receive all unmatched packets,
including VLANs. Therefore, VLAN pruning should be disabled if uplink is
in switchdev mode. It is already being done in ice_eswitch_setup_env(),
however the addition of ice_up() in commit 44ba608db5 ("ice: do
switchdev slow-path Rx using PF VSI") caused VLAN pruning to be
re-enabled after disabling it.
Add a check to ice_set_vlan_filtering_features() to ensure VLAN
filtering will not be enabled if uplink is in switchdev mode. Note that
ice_is_eswitch_mode_switchdev() is being used instead of
ice_is_switchdev_running(), as the latter would only return true after
the whole switchdev setup completes.
Fixes: 44ba608db5 ("ice: do switchdev slow-path Rx using PF VSI")
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com>
Tested-by: Priya Singh <priyax.singh@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
To check if PHY Clock Recovery mechanic is available for a device, there
is a need to verify if given PHY is available within the netlist, but the
netlist node type used for the search is wrong, also the search context
shall be specified.
Modify the search function to allow specifying the context in the
search.
Use the PHY node type instead of CLOCK CONTROLLER type, also use proper
search context which for PHY search is PORT, as defined in E810
Datasheet [1] ('3.3.8.2.4 Node Part Number and Node Options (0x0003)' and
'Table 3-105. Program Topology Device NVM Admin Command').
[1] https://cdrdv2.intel.com/v1/dl/getContent/613875?explicitVersion=true
Fixes: 91e43ca009 ("ice: fix linking when CONFIG_PTP_1588_CLOCK=n")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Currently rtnl_link_get_net_ifla() gets called twice when we create
peer devices, once in rtnl_add_peer_net() and once in each ->newlink()
implementation.
This looks safer, however, it leads to a classic Time-of-Check to
Time-of-Use (TOCTOU) bug since IFLA_NET_NS_PID is very dynamic. And
because of the lack of checking error pointer of the second call, it
also leads to a kernel crash as reported by syzbot.
Fix this by getting rid of the second call, which already becomes
redudant after Kuniyuki's work. We have to propagate the result of the
first rtnl_link_get_net_ifla() down to each ->newlink().
Reported-by: syzbot+21ba4d5adff0b6a7cfc6@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=21ba4d5adff0b6a7cfc6
Fixes: 0eb87b02a7 ("veth: Set VETH_INFO_PEER to veth_link_ops.peer_type.")
Fixes: 6b84e558e9 ("vxcan: Set VXCAN_INFO_PEER to vxcan_link_ops.peer_type.")
Fixes: fefd5d0821 ("netkit: Set IFLA_NETKIT_PEER_INFO to netkit_link_ops.peer_type.")
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241129212519.825567-1-xiyou.wangcong@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Commit 43645ce03e ("qed: Populate nvm image attribute shadow.")
added support for populating flash image attributes, notably
"num_images". However, some cards were not able to return this
information. In such cases, the driver would return EINVAL, causing the
driver to exit.
Add check to return EOPNOTSUPP instead of EINVAL when the card is not
able to return these information. The caller function already handles
EOPNOTSUPP without error.
Fixes: 43645ce03e ("qed: Populate nvm image attribute shadow.")
Co-developed-by: Florian Forestier <florian@forestier.re>
Signed-off-by: Florian Forestier <florian@forestier.re>
Signed-off-by: Louis Leseur <louis.leseur@gmail.com>
Link: https://patch.msgid.link/20241128083633.26431-1-louis.leseur@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We encountered a LGR/link use-after-free issue, which manifested as
the LGR/link refcnt reaching 0 early and entering the clear process,
making resource access unsafe.
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 14 PID: 107447 at lib/refcount.c:25 refcount_warn_saturate+0x9c/0x140
Workqueue: events smc_lgr_terminate_work [smc]
Call trace:
refcount_warn_saturate+0x9c/0x140
__smc_lgr_terminate.part.45+0x2a8/0x370 [smc]
smc_lgr_terminate_work+0x28/0x30 [smc]
process_one_work+0x1b8/0x420
worker_thread+0x158/0x510
kthread+0x114/0x118
or
refcount_t: underflow; use-after-free.
WARNING: CPU: 6 PID: 93140 at lib/refcount.c:28 refcount_warn_saturate+0xf0/0x140
Workqueue: smc_hs_wq smc_listen_work [smc]
Call trace:
refcount_warn_saturate+0xf0/0x140
smcr_link_put+0x1cc/0x1d8 [smc]
smc_conn_free+0x110/0x1b0 [smc]
smc_conn_abort+0x50/0x60 [smc]
smc_listen_find_device+0x75c/0x790 [smc]
smc_listen_work+0x368/0x8a0 [smc]
process_one_work+0x1b8/0x420
worker_thread+0x158/0x510
kthread+0x114/0x118
It is caused by repeated release of LGR/link refcnt. One suspect is that
smc_conn_free() is called repeatedly because some smc_conn_free() from
server listening path are not protected by sock lock.
e.g.
Calls under socklock | smc_listen_work
-------------------------------------------------------
lock_sock(sk) | smc_conn_abort
smc_conn_free | \- smc_conn_free
\- smcr_link_put | \- smcr_link_put (duplicated)
release_sock(sk)
So here add sock lock protection in smc_listen_work() path, making it
exclusive with other connection operations.
Fixes: 3b2dec2603 ("net/smc: restructure client and server code in af_smc")
Co-developed-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Co-developed-by: Kai <KaiShen@linux.alibaba.com>
Signed-off-by: Kai <KaiShen@linux.alibaba.com>
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We encountered a warning that close_work was canceled before
initialization.
WARNING: CPU: 7 PID: 111103 at kernel/workqueue.c:3047 __flush_work+0x19e/0x1b0
Workqueue: events smc_lgr_terminate_work [smc]
RIP: 0010:__flush_work+0x19e/0x1b0
Call Trace:
? __wake_up_common+0x7a/0x190
? work_busy+0x80/0x80
__cancel_work_timer+0xe3/0x160
smc_close_cancel_work+0x1a/0x70 [smc]
smc_close_active_abort+0x207/0x360 [smc]
__smc_lgr_terminate.part.38+0xc8/0x180 [smc]
process_one_work+0x19e/0x340
worker_thread+0x30/0x370
? process_one_work+0x340/0x340
kthread+0x117/0x130
? __kthread_cancel_work+0x50/0x50
ret_from_fork+0x22/0x30
This is because when smc_close_cancel_work is triggered, e.g. the RDMA
driver is rmmod and the LGR is terminated, the conn->close_work is
flushed before initialization, resulting in WARN_ON(!work->func).
__smc_lgr_terminate | smc_connect_{rdma|ism}
-------------------------------------------------------------
| smc_conn_create
| \- smc_lgr_register_conn
for conn in lgr->conns_all |
\- smc_conn_kill |
\- smc_close_active_abort |
\- smc_close_cancel_work |
\- cancel_work_sync |
\- __flush_work |
(close_work) |
| smc_close_init
| \- INIT_WORK(&close_work)
So fix this by initializing close_work before establishing the
connection.
Fixes: 46c28dbd4c ("net/smc: no socket state changes in tasklet context")
Fixes: 413498440e ("net/smc: add SMC-D support in af_smc")
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>