Florian Westphal says:
====================
netfilter updates for next
First 5 patches, from Phil Sutter, clean up nftables dumpers to
use the context buffer in the netlink_callback structure rather
than a kmalloc'd buffer.
Patch 6, from myself, zaps dead code and replaces the helper function
with a small inlined helper.
Patch 7, also from myself, removes another pr_debug and replaces it
with the existing nf_log-based debug helpers.
Last patch, from George Guo, gets nft_table comments back in
sync with the structure members.
* tag 'nf-next-23-10-10' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: cleanup struct nft_table
netfilter: conntrack: prefer tcp_error_log to pr_debug
netfilter: conntrack: simplify nf_conntrack_alter_reply
netfilter: nf_tables: Don't allocate nft_rule_dump_ctx
netfilter: nf_tables: Carry s_idx in nft_rule_dump_ctx
netfilter: nf_tables: Carry reset flag in nft_rule_dump_ctx
netfilter: nf_tables: Drop pointless memset when dumping rules
netfilter: nf_tables: Always allocate nft_rule_dump_ctx
====================
Link: https://lore.kernel.org/r/20231010145343.12551-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rework network interface logic. Before this change, the code flow was:
1. Disable interrupt
2. Try to schedule a NAPI
3. Check if it was possible (NAPI is not already scheduled)
4. emit BUG() if we receive interrupt while a NAPI is scheduled
If some application busy poll or set gro_flush_timeout low enough, it's
possible to reach the BUG() condition. Given that the condition may
happen and it wouldn't be a bug, rework the logic to permit such case
and prevent stall with interrupt never enabled again.
Disable the interrupt only if the NAPI can be scheduled (aka it's not
already scheduled) and drop the printk and BUG() call. With these
change, in the event of a NAPI already scheduled, the interrupt is
simply ignored with nothing done.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Link: https://lore.kernel.org/r/20231009133754.9834-4-ansuelsmth@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jiri added more careful handling of output of the code generator
to avoid wiping out existing files in
commit f65f305ae0 ("tools: ynl-gen: use temporary file for rendering")
Make use of the -o option in the Makefiles, it is already used
by ynl-regen.sh.
Link: https://lore.kernel.org/r/20231010202714.4045168-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is no good reason to specify the version for new protocols.
Forbid it in genetlink schema.
If the future proves me wrong, this restriction could be easily lifted.
Move the version definition in between legacy properties
in genetlink-legacy.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20231010074810.191177-1-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently hns3 driver supports vf fault detect feature. Several ras caused
by VF resources don't need to do PF function reset for recovery. The driver
only needs to reset the specified VF.
So this patch adds process in ras module. New process will get detailed
information about ras and do the most correct measures based on these
accurate information.
Signed-off-by: Jie Wang <wangjie125@huawei.com>
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Link: https://lore.kernel.org/r/20231007031215.1067758-3-shaojijie@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Willem de Bruijn says:
====================
add skb_segment kunit coverage
As discussed at netconf last week. Some kernel code is exercised in
many different ways. skb_segment is a prime example. This ~350 line
function has 49 different patches in git blame with 28 different
authors.
When making a change, e.g., to fix a bug in one specific use case,
it is hard to establish through analysis alone that the change does
not break the many other paths through the code. It is impractical to
exercise all code paths through regression testing from userspace.
Add the minimal infrastructure needed to add KUnit tests to networking,
and add code coverage for this function.
Patch 1 adds the infra and the first simple test case: a linear skb
Patch 2 adds variants with frags[]
Patch 3 adds variants with frag_list skbs
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Expand the test with these variants that use skb frag_list:
- GSO_TEST_FRAG_LIST: frag_skb length is gso_size
- GSO_TEST_FRAG_LIST_PURE: same, data exclusively in frag skbs
- GSO_TEST_FRAG_LIST_NON_UNIFORM: frag_skb length may vary
- GSO_TEST_GSO_BY_FRAGS: frag_skb length defines gso_size,
i.e., segs may have varying sizes.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Expand the test with variants
- GSO_TEST_NO_GSO: payload size less than or equal to gso_size
- GSO_TEST_FRAGS: payload in both linear and page frags
- GSO_TEST_FRAGS_PURE: payload exclusively in page frags
- GSO_TEST_GSO_PARTIAL: produce one gso segment of multiple of gso_size,
plus optionally one non-gso trailer segment
Define a test struct that encodes the input gso skb and output segs.
Input in terms of linear and fragment lengths. Output as length of
each segment.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add unit testing for skb segment. This function is exercised by many
different code paths, such as GSO_PARTIAL or GSO_BY_FRAGS, linear
(with or without head_frag), frags or frag_list skbs, etc.
It is infeasible to manually run tests that cover all code paths when
making changes. The long and complex function also makes it hard to
establish through analysis alone that a patch has no unintended
side-effects.
Add code coverage through kunit regression testing. Introduce kunit
infrastructure for tests under net/core, and add this first test.
This first skb_segment test exercises a simple case: a linear skb.
Follow-on patches will parametrize the test and add more variants.
Tested: Built and ran the test with
make ARCH=um mrproper
./tools/testing/kunit/kunit.py run \
--kconfig_add CONFIG_NET=y \
--kconfig_add CONFIG_DEBUG_KERNEL=y \
--kconfig_add CONFIG_DEBUG_INFO=y \
--kconfig_add=CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y \
net_core_gso
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Although there is a kfree_skb_reason() helper function that can be used to
find the reason why this skb is dropped, but most callers didn't increase
one of rx_dropped, tx_dropped, rx_nohandler and rx_otherhost_dropped.
For the users, people are more concerned about why the dropped in ip
is increasing.
Introduce netdev_core_stats_inc() for trace the caller of
dev_core_stats_*_inc().
Also, add __code to netdev_core_stats_alloc(), as it's called with small
probability. And add noinline make sure netdev_core_stats_inc was never
inlined.
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Russell King says:
====================
net: dsa: remove validate method
These three patches remove DSA's phylink .validate method which becomes
unnecessary once the last two drivers provide phylink capabilities,
which this patch set adds. Both of these are best guesses.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
As all drivers now provide phylink capabilities (including MAC), the
if() condition in dsa_port_phylink_validate() will always be true. We
will always use the generic validator, which phylink will call itself
if the .validate method isn't populated. Thus, there is now no need to
implement the .validate method, so this implementation can be removed.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add phylink capabilities for dsa_loop, which I believe being a software
construct means that it supports essentially all interface types and
all speeds.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add phylink capabilities for vsc73xx. Although this switch driver does
populates the .adjust_link method, dsa_slave_phy_setup() will still be
used to create phylink instances for the LAN ports, although phylink
won't be used for shared links.
There are two different classes of switch - 5+1 and 8 port. The 5+1
port switches uses port indicies 0-4 for the user interfaces and 6 for
the CPU port. The 8 port is confusing - some comments in the driver
imply that port index 7 is used, but the driver actually still uses 6,
so that is what we go with. Also, there appear to be no DTs in the
kernel tree that are using the 8 port variety.
It also looks like port 5 is always skipped.
The switch supports 10M, 100M and 1G speeds. It is not clear whether
all these speeds are supported on the CPU interface. It also looks like
symmetric pause is supported, whether asymmetric pause is as well is
unclear. However, it looks like the pause configuration is entirely
static, and doesn't depend on negotiation results.
So, let's do the best effort we can based on the information found in
the driver when creating vsc73xx_phylink_get_caps().
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The switch statement in netvsc_send_completion() is incorrectly validating
the length of incoming network packets by falling through to the next case.
Avoid the fallthrough. Instead break after a case match and then process
the complete() call.
The current code has not caused any known failures. But nonetheless, the
code should be corrected as a different ordering of the switch cases might
cause a length check to fail when it should not.
Signed-off-by: Sonia Sharma <sonia.sharma@linux.microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heng Qi says:
====================
virtio-net: Fix and update interrupt moderation
The setting of virtio coalescing parameters involves all-queues and
per queue, so we must be careful to synchronize the two.
Regarding napi_tx switching, this patch set is not only
compatible with the previous way of using tx-frames to switch napi_tx,
but also improves the user experience when setting interrupt parameters.
This patch set has been tested and was part of the previous netdim patch
set[1] and is now being split to be rolled out in steps.
[1] https://lore.kernel.org/all/20230811065512.22190-1-hengqi@linux.alibaba.com/
---
v2->v3:
1. Fix a tiny comment.
v1->v2:
1. Fix some minor comments and add ack tags.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Update a comment because virtio-net now supports both
VIRTIO_NET_F_NOTF_COAL and VIRTIO_NET_F_VQ_NOTF_COAL.
Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to the definition of virtqueue coalescing spec[1]:
Upon disabling and re-enabling a transmit virtqueue, the device MUST set
the coalescing parameters of the virtqueue to those configured through the
VIRTIO_NET_CTRL_NOTF_COAL_TX_SET command, or, if the driver did not set
any TX coalescing parameters, to 0.
Upon disabling and re-enabling a receive virtqueue, the device MUST set
the coalescing parameters of the virtqueue to those configured through the
VIRTIO_NET_CTRL_NOTF_COAL_RX_SET command, or, if the driver did not set
any RX coalescing parameters, to 0.
We need to add this setting for vq resize (ethtool -G) where vq_reset happens.
[1] https://lists.oasis-open.org/archives/virtio-dev/202303/msg00415.html
Fixes: 394bd87764 ("virtio_net: support per queue interrupt coalesce command")
Cc: Gavin Li <gavinl@nvidia.com>
Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using .set_coalesce interface to set all queue coalescing
parameters, we need to update both per-queue and global save values.
Fixes: 394bd87764 ("virtio_net: support per queue interrupt coalesce command")
Cc: Gavin Li <gavinl@nvidia.com>
Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since virtio-net allows switching napi_tx for per txq, we have to
get the specific txq's result now.
Fixes: 394bd87764 ("virtio_net: support per queue interrupt coalesce command")
Cc: Gavin Li <gavinl@nvidia.com>
Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Background:
1. Commit 0c465be183 ("virtio_net: ethtool tx napi configuration") uses
tx-frames to toggle napi_tx (0 off and 1 on) if notification coalescing
is not supported.
2. Commit 31c03aef9b ("virtio_net: enable napi_tx by default") enables
napi_tx for all txqs by default.
Status:
When virtio-net supports notification coalescing, after initialization,
tx-frames is 0 and napi_tx is true.
Problem:
When the user only wants to set rx coalescing params using
ethtool -C eth0 rx-usecs 10, or
ethtool -Q eth0 queue_mask 0x1 -C rx-usecs 10,
these cmds will carry tx-frames as 0, causing the napi_tx switching condition
is satisfied. Then the user gets:
netlink error: Device or resource busy.
The same happens when trying to set rx-frames, adaptive_rx, adaptive_tx...
How to fix:
When notification coalescing feature is negotiated, initially make the
value of tx-frames to be consistent with napi_tx.
For compatibility with the past, it is still supported to use tx-frames
to toggle napi_tx.
Reported-by: Xiaoming Zhao <zxm377917@alibaba-inc.com>
Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add comments for nlpid, family, udlen and udata in struct nft_table, and
afinfo is no longer a member of struct nft_table, so remove the comment
for it.
Signed-off-by: George Guo <guodongtai@kylinos.cn>
Signed-off-by: Florian Westphal <fw@strlen.de>
pr_debug doesn't provide any information other than that a packet
did not match existing state but also was found to not create a new
connection.
Replaces this with tcp_error_log, which will also dump packets'
content so one can see if this is a stray FIN or RST.
Signed-off-by: Florian Westphal <fw@strlen.de>
nf_conntrack_alter_reply doesn't do helper reassignment anymore.
Remove the comments that make this claim.
Furthermore, remove dead code from the function and place ot
in nf_conntrack.h.
Signed-off-by: Florian Westphal <fw@strlen.de>
Since struct netlink_callback::args is not used by rule dumpers anymore,
use it to hold nft_rule_dump_ctx. Add a build-time check to make sure it
won't ever exceed the available space.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Florian Westphal <fw@strlen.de>
In order to move the context into struct netlink_callback's scratch
area, the latter must be unused first.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Florian Westphal <fw@strlen.de>
This relieves the dump callback from having to check nlmsg_type upon
each call and instead performs the check once in .start callback.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Florian Westphal <fw@strlen.de>
None of the dump callbacks uses netlink_callback::args beyond the first
element, no need to zero the data.
Fixes: 96518518cc ("netfilter: add nftables")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Florian Westphal <fw@strlen.de>
It will move into struct netlink_callback's scratch area later, just put
nf_tables_dump_rules_start in shape to reduce churn later.
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Florian Westphal <fw@strlen.de>
Petr Machata says:
====================
mlxsw: Fix -Wformat-truncation warnings
Ido Schimmel writes:
Commit 6d4ab2e97d ("extrawarn: enable format and stringop overflow
warnings in W=1") enabled format warnings as part of W=1 builds,
resulting in two new warnings in mlxsw. Fix both and target at net-next
as the warnings are not indicative of actual bugs.
====================
Link: https://lore.kernel.org/r/cover.1696600763.git.petrm@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Ethtool stats strings cannot be longer than 32 characters
('ETH_GSTRING_LEN'), including the terminating null byte. The format
string '%.29s_%.1d' can exceed this limitation if the per-TC counter
name exceeds 28 characters. Together with the underscore, the two digits
of the TC (bounded at 16) and the terminating null byte, more than 32
characters will be used.
Fix this by bounding the counter name at 28 characters which suppresses
the following build warning [1]. This does not affect ethtool output
since the longest counter name does not exceed this limitation.
[1]
drivers/net/ethernet/mellanox/mlxsw/spectrum_ethtool.c: In function ‘mlxsw_sp_port_get_strings’:
drivers/net/ethernet/mellanox/mlxsw/spectrum_ethtool.c:622:58: error: ‘snprintf’ output may be truncated before the last format character [-Werror=format-truncation=]
622 | snprintf(*p, ETH_GSTRING_LEN, "%.29s_%.1d",
| ^
In function ‘mlxsw_sp_port_get_tc_strings’,
inlined from ‘mlxsw_sp_port_get_strings’ at drivers/net/ethernet/mellanox/mlxsw/spectrum_ethtool.c:677:4:
drivers/net/ethernet/mellanox/mlxsw/spectrum_ethtool.c:622:17: note: ‘snprintf’ output between 3 and 33 bytes into a destination of size 32
622 | snprintf(*p, ETH_GSTRING_LEN, "%.29s_%.1d",
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
623 | mlxsw_sp_port_hw_tc_stats[i].str, tc);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/mellanox/mlxsw/spectrum_ethtool.c: In function ‘mlxsw_sp_port_get_strings’:
drivers/net/ethernet/mellanox/mlxsw/spectrum_ethtool.c:622:58: error: ‘snprintf’ output may be truncated before the last format character [-Werror=format-truncation=]
622 | snprintf(*p, ETH_GSTRING_LEN, "%.29s_%.1d",
| ^
In function ‘mlxsw_sp_port_get_tc_strings’,
inlined from ‘mlxsw_sp_port_get_strings’ at drivers/net/ethernet/mellanox/mlxsw/spectrum_ethtool.c:677:4:
drivers/net/ethernet/mellanox/mlxsw/spectrum_ethtool.c:622:17: note: ‘snprintf’ output between 3 and 33 bytes into a destination of size 32
622 | snprintf(*p, ETH_GSTRING_LEN, "%.29s_%.1d",
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
623 | mlxsw_sp_port_hw_tc_stats[i].str, tc);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Simon Horman <horms@kernel.org> # build-tested
Signed-off-by: Paolo Abeni <pabeni@redhat.com>