This work adds a new, minimal BPF-programmable device called "netkit"
(former PoC code-name "meta") we recently presented at LSF/MM/BPF. The
core idea is that BPF programs are executed within the drivers xmit routine
and therefore e.g. in case of containers/Pods moving BPF processing closer
to the source.
One of the goals was that in case of Pod egress traffic, this allows to
move BPF programs from hostns tcx ingress into the device itself, providing
earlier drop or forward mechanisms, for example, if the BPF program
determines that the skb must be sent out of the node, then a redirect to
the physical device can take place directly without going through per-CPU
backlog queue. This helps to shift processing for such traffic from softirq
to process context, leading to better scheduling decisions/performance (see
measurements in the slides).
In this initial version, the netkit device ships as a pair, but we plan to
extend this further so it can also operate in single device mode. The pair
comes with a primary and a peer device. Only the primary device, typically
residing in hostns, can manage BPF programs for itself and its peer. The
peer device is designated for containers/Pods and cannot attach/detach
BPF programs. Upon the device creation, the user can set the default policy
to 'pass' or 'drop' for the case when no BPF program is attached.
Additionally, the device can be operated in L3 (default) or L2 mode. The
management of BPF programs is done via bpf_mprog, so that multi-attach is
supported right from the beginning with similar API and dependency controls
as tcx. For details on the latter see commit 053c8e1f23 ("bpf: Add generic
attach/detach/query API for multi-progs"). tc BPF compatibility is provided,
so that existing programs can be easily migrated.
Going forward, we plan to use netkit devices in Cilium as the main device
type for connecting Pods. They will be operated in L3 mode in order to
simplify a Pod's neighbor management and the peer will operate in default
drop mode, so that no traffic is leaving between the time when a Pod is
brought up by the CNI plugin and programs attached by the agent.
Additionally, the programs we attach via tcx on the physical devices are
using bpf_redirect_peer() for inbound traffic into netkit device, hence the
latter is also supporting the ndo_get_peer_dev callback. Similarly, we use
bpf_redirect_neigh() for the way out, pushing from netkit peer to phys device
directly. Also, BIG TCP is supported on netkit device. For the follow-up
work in single device mode, we plan to convert Cilium's cilium_host/_net
devices into a single one.
An extensive test suite for checking device operations and the BPF program
and link management API comes as BPF selftests in this series.
Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://github.com/borkmann/iproute2/tree/pr/netkit
Link: http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf (24ff.)
Link: https://lore.kernel.org/r/20231024214904.29825-2-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Compiler warns about a possible format-overflow in tsnep_request_irq():
drivers/net/ethernet/engleder/tsnep_main.c:884:55: warning: 'sprintf' may write a terminating nul past the end of the destination [-Wformat-overflow=]
sprintf(queue->name, "%s-rx-%d", name,
^
drivers/net/ethernet/engleder/tsnep_main.c:881:55: warning: 'sprintf' may write a terminating nul past the end of the destination [-Wformat-overflow=]
sprintf(queue->name, "%s-tx-%d", name,
^
drivers/net/ethernet/engleder/tsnep_main.c:878:49: warning: '-txrx-' directive writing 6 bytes into a region of size between 5 and 25 [-Wformat-overflow=]
sprintf(queue->name, "%s-txrx-%d", name,
^~~~~~
Actually overflow cannot happen. Name is limited to IFNAMSIZ, because
netdev_name() is called during ndo_open(). queue_index is single char,
because less than 10 queues are supported.
Fix warning with snprintf(). Additionally increase buffer to 32 bytes,
because those 7 additional bytes were unused anyway.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202310182028.vmDthIUa-lkp@intel.com/
Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20231023183856.58373-1-gerhard@engleder-embedded.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Call skb_gso_validate_network_len() to check if packet is over PMTU.
Fixes: 459aa660eb ("gtp: add initial driver for datapath of GPRS Tunneling Protocol (GTP-U)")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Currently there is a device tree entry called "local-mac-address"
which can be filled by the bootloader or manually set.This is
useful when the user does not want to use the MAC address
programmed into the SoC.
Currently, the davinci_emac reads the MAC from the DT, copies
it from pdata->mac_addr to priv->mac_addr, then blindly overwrites
it by reading from registers in the SoC, and falls back to a
random MAC if it's still not valid. This completely ignores any
MAC address in the device tree.
In order to use the local-mac-address, check to see if the contents
of priv->mac_addr are valid before falling back to reading from the
SoC when the MAC address is not valid.
Signed-off-by: Adam Ford <aford173@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20231022151911.4279-1-aford173@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Currently all RX frames are timestamped which results in a performance
penalty when timestamping is not needed. The default is now being
changed to not timestamp any Rx frames (HWTSTAMP_FILTER_NONE), but
support has been added to allow changing the desired RX timestamping
mode (HWTSTAMP_FILTER_ALL - which was the previous setting and
HWTSTAMP_FILTER_PTP_V2_EVENT are now supported) using
SIOCSHWTSTAMP. All settings were tested using the hwstamp_ctl application.
It is also noted that ptp4l, when started, preconfigures the device to
timestamp using HWTSTAMP_FILTER_PTP_V2_EVENT, so this driver continues
to work properly "out of the box".
Test setup: x64 PC with LAN7430 ---> x64 PC as partner
iperf3 with - Timestamp all incoming packets:
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-5.05 sec 517 MBytes 859 Mbits/sec 0 sender
[ 5] 0.00-5.00 sec 515 MBytes 864 Mbits/sec receiver
iperf Done.
iperf3 with - Timestamp only PTP packets:
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-5.04 sec 563 MBytes 937 Mbits/sec 0 sender
[ 5] 0.00-5.00 sec 561 MBytes 941 Mbits/sec receiver
Signed-off-by: Vishvambar Panth S <vishvambarpanth.s@microchip.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20231020185801.25649-1-vishvambarpanth.s@microchip.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.
wl->chip.phy_fw_ver_str is obviously intended to be NUL-terminated by
the deliberate comment telling us as much. Furthermore, its only use is
drivers/net/wireless/ti/wlcore/debugfs.c shows us it should be
NUL-terminated since its used in scnprintf:
492 | DRIVER_STATE_PRINT_STR(chip.phy_fw_ver_str);
which is defined as:
| #define DRIVER_STATE_PRINT_STR(x) DRIVER_STATE_PRINT(x, "%s")
...
| #define DRIVER_STATE_PRINT(x, fmt) \
| (res += scnprintf(buf + res, DRIVER_STATE_BUF_LEN - res,\
| #x " = " fmt "\n", wl->x))
We can also see that NUL-padding is not required.
Considering the above, a suitable replacement is `strscpy` [2] due to
the fact that it guarantees NUL-termination on the destination buffer
without unnecessarily NUL-padding.
The very fact that a plain-english comment had to be made alongside a
manual NUL-byte assignment for such a simple purpose shows why strncpy
is faulty. It has non-obvious behavior that has to be clarified every
time it is used (and if it isn't then the reader suffers).
Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://github.com/KSPP/linux/issues/90
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20231018-strncpy-drivers-net-wireless-ti-wl18xx-main-c-v2-1-ab828a491ce5@google.com
strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.
Based on other assignments of similar fw_version fields we can see that
NUL-termination is required but not NUL-padding:
ethernet/intel/ixgbe/ixgbe_ethtool.c
1111: strscpy(drvinfo->fw_version, adapter->eeprom_id,
1112: sizeof(drvinfo->fw_version));
ethernet/intel/igc/igc_ethtool.c
147: scnprintf(adapter->fw_version,
148: sizeof(adapter->fw_version),
153: strscpy(drvinfo->fw_version, adapter->fw_version,
154: sizeof(drvinfo->fw_version));
wireless/broadcom/brcm80211/brcmfmac/core.c
569: strscpy(info->fw_version, drvr->fwver, sizeof(info->fw_version));
wireless/broadcom/brcm80211/brcmsmac/main.c
7867: snprintf(wlc->wiphy->fw_version,
7868: sizeof(wlc->wiphy->fw_version), "%u.%u", rev, patch);
wireless/broadcom/b43legacy/main.c
1765: snprintf(wiphy->fw_version, sizeof(wiphy->fw_version), "%u.%u",
wireless/broadcom/b43/main.c
2730: snprintf(wiphy->fw_version, sizeof(wiphy->fw_version), "%u.%u",
wireless/intel/iwlwifi/dvm/main.c
1465: snprintf(priv->hw->wiphy->fw_version,
1466: sizeof(priv->hw->wiphy->fw_version),
wireless/intel/ipw2x00/ipw2100.c
5905: snprintf(info->fw_version, sizeof(info->fw_version), "%s:%d:%s",
A suitable replacement is `strscpy` due to the fact that it guarantees
NUL-termination on the destination buffer without unnecessarily
NUL-padding.
Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://github.com/KSPP/linux/issues/90
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20231018-strncpy-drivers-net-wireless-ti-wl1251-main-c-v2-1-67b63dfcb1b8@google.com
strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.
`extra` is intended to be NUL-terminated which is evident by the manual
assignment of a NUL-byte as well as its immediate usage with strlen().
Moreover, many of these getters and setters are NUL-padding buffers with
memset():
2439 | memset(&tx_power, 0, sizeof(tx_power));
9998 | memset(sys_config, 0, sizeof(struct ipw_sys_config));
10084 | memset(tfd, 0, sizeof(*tfd));
10261 | memset(&dummystats, 0, sizeof(dummystats));
... let's maintain this behavior and NUL-pad our destination buffer.
Considering the above, a suitable replacement is `strscpy_pad` due to
the fact that it guarantees both NUL-termination and NUL-padding on the
destination buffer.
To be clear, there is no bug in the current implementation as
MAX_WX_STRING is much larger than the size of the string literals being
copied from. Also, strncpy() does NUL-pad the destination buffer and
using strscpy_pad() simply matches that behavior. All in all, there
should be no functional change but we are one step closer to eliminating
usage of strncpy().
Do note that we cannot use the more idiomatic strscpy invocation of
(dest, src, sizeof(dest)) as the destination buffer cannot have its size
determined at compile time. So, let's stick with (dest, src, LEN).
Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://github.com/KSPP/linux/issues/90
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20231017-strncpy-drivers-net-wireless-intel-ipw2x00-ipw2200-c-v2-1-465e10dc817c@google.com
The watchdog function is broken on rt2800 series SoCs. This patch
fixes the incorrect watchdog logic to make it work again.
1. Update current wdt queue index if it's not equal to the previous
index. Watchdog compares the current and previous queue index to
judge if the queue hung.
2. Make sure hung_{rx,tx} 'true' status won't be override by the
normal queue. Any queue hangs should trigger a reset action.
3. Clear the watchdog counter of all queues before resetting the
hardware. This change may help to avoid the reset loop.
4. Change hang check function return type to bool as we only need
to return two status, yes or no.
Signed-off-by: Shiji Yang <yangshiji66@outlook.com>
Acked-by: Stanislaw Gruszka <stf_xl@wp.pl>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/TYAP286MB0315BC1D83D31154924F0D39BCD1A@TYAP286MB0315.JPNP286.PROD.OUTLOOK.COM
On v6.6-rc4 with GCC 13.2 I see:
drivers/net/wireless/ath/ath9k/hif_usb.c:1223:42: warning: '.0.fw' directive output may be truncated writing 5 bytes into a region of size between 4 and 11 [-Wformat-truncation=]
drivers/net/wireless/ath/ath9k/hif_usb.c:1222:17: note: 'snprintf' output between 27 and 34 bytes into a destination of size 32
Fix it by increasing the size of the fw_name field to 64 bytes.
Compile tested only.
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20231012135854.3473332-3-kvalo@kernel.org
On v6.6-rc4 with GCC 13.2 I see:
drivers/net/wireless/intel/ipw2x00/ipw2100.c:5905:63: warning: '%s' directive output may be truncated writing up to 63 bytes into a region of size 32 [-Wformat-truncation=]
drivers/net/wireless/intel/ipw2x00/ipw2100.c:5905:9: note: 'snprintf' output between 4 and 140 bytes into a destination of size 32
drivers/net/wireless/intel/ipw2x00/ipw2200.c:10392:63: warning: '%s' directive output may be truncated writing up to 63 bytes into a region of size 32 [-Wformat-truncation=]
drivers/net/wireless/intel/ipw2x00/ipw2200.c:10392:9: note: 'snprintf' output between 4 and 98 bytes into a destination of size 32
Fix this by copying only the firmware version and not providing any extra
information via ethtool. This is an ancient driver anyway and most likely
removed soon so it doesn't really matter.
Compile tested only.
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20231012135854.3473332-2-kvalo@kernel.org
On v6.6-rc4 with GCC 13.2 I see:
drivers/net/wireless/broadcom/brcm80211/brcmfmac/firmware.c:262:52: warning: '%d' directive output may be truncated writing between 1 and 5 bytes into a region of size 4 [-Wformat-truncation=]
drivers/net/wireless/broadcom/brcm80211/brcmfmac/firmware.c:262:46: note: directive argument in the range [0, 65535]
drivers/net/wireless/broadcom/brcm80211/brcmfmac/firmware.c:262:46: note: directive argument in the range [0, 65535]
drivers/net/wireless/broadcom/brcm80211/brcmfmac/firmware.c:262:9: note: 'snprintf' output between 9 and 17 bytes into a destination of size 9
drivers/net/wireless/broadcom/brcm80211/brcmfmac/firmware.c:265:55: warning: '%d' directive output may be truncated writing between 1 and 5 bytes into a region of size 4 [-Wformat-truncation=]
drivers/net/wireless/broadcom/brcm80211/brcmfmac/firmware.c:265:48: note: directive argument in the range [0, 65535]
drivers/net/wireless/broadcom/brcm80211/brcmfmac/firmware.c:265:48: note: directive argument in the range [0, 65535]
drivers/net/wireless/broadcom/brcm80211/brcmfmac/firmware.c:265:9: note: 'snprintf' output between 10 and 18 bytes into a destination of size 10
drivers/net/wireless/broadcom/brcm80211/brcmfmac/firmware.c:342:50: warning: '/' directive output may be truncated writing 1 byte into a region of size between 0 and 4 [-Wformat-truncation=]
drivers/net/wireless/broadcom/brcm80211/brcmfmac/firmware.c:342:42: note: directive argument in the range [0, 65535]
drivers/net/wireless/broadcom/brcm80211/brcmfmac/firmware.c:342:9: note: 'snprintf' output between 10 and 18 bytes into a destination of size 10
Fix these by increasing the buffer sizes to 20 bytes to make sure there's enough space.
Compile tested only.
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20231012135854.3473332-1-kvalo@kernel.org
Commit a243ecc323 ("net: mdio: xgene: Use device_get_match_data()")
dropped the unconditional use of xgene_mdio_of_match resulting in this
warning:
drivers/net/mdio/mdio-xgene.c:303:34: warning: unused variable 'xgene_mdio_of_match' [-Wunused-const-variable]
The fix is to drop of_match_ptr() which is not necessary because DT is
always used for this driver (well, it could in theory support ACPI only,
but CONFIG_OF is always enabled for arm64).
Fixes: a243ecc323 ("net: mdio: xgene: Use device_get_match_data()")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202310170832.xnVXw1bb-lkp@intel.com/
Signed-off-by: Rob Herring <robh@kernel.org>
Link: https://lore.kernel.org/r/20231019182345.833136-1-robh@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When the vif is in MLD mode, we'll get a vif links change from
non-zero to zero on disassociation, which removes all links in
the firmware and adds the 'deflink' the driver/mac80211 has.
This causes the firmware to clear some internal state.
However, in non-MLD mode, this doesn't happen, and causes some
state to be left around in firmware, which can particularly
cause trouble with the ref-BSSID in multi-BSSID, leading to an
assert later if immediately making a new multi-BSSID connection
with a different ref-BSSID.
Fix this by removing/re-adding the link in the non-MLD case
when the channel is removed from the vif. This way, all of the
state will get cleared out, even if we need the deflink, which
is more for software architecture purposes than otherwise.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Link: https://lore.kernel.org/r/20231022173519.90c82837ba4d.I341fa30c480f7673b14b48a0e29a2241472c2e13@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If a TX queue has no space for new TX frames, the driver will keep
these frames in the overflow queue, and during reclaim flow it
will retry to send the frames from that queue.
But if the reclaim flow was invoked from TX queue flush, we will also
TX these frames, which is wrong as we don't want to TX anything
after flush.
This might also cause assert 0x125F when removing the queue,
saying that the driver removes a non-empty queue
Fix this by TXing the overflow queue's frames only if we are
not in flush queue flow.
Fixes: a445098058 ("iwlwifi: move reclaim flows to the queue file")
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Link: https://lore.kernel.org/r/20231022173519.caf06c8709d9.Ibf664ccb3f952e836f8fa461ea58fc08e5c46e88@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In order to get regulatory domain, driver sends MCC_UPDATE_CMD to the
FW. One of the parameters in the response is the status which can tell
if the regdomain has changed or not.
When iwl_mvm_init_mcc() is called during iwl_op_mode_mvm_start(), then
sband is still NULL and channel parameters (i.e. chan->flags) cannot be
initialized. When, further in the flow, iwl_mvm_update_mcc() is called
during iwl_mvm_up(), it first checks if the regdomain has changed and
then skips the update if it remains the same. But, since channel
parameters weren't initialized yet, the update should be forced in this
codepath. Fix that by adding a corresponding parameter to
iwl_mvm_init_fw_regd().
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Link: https://lore.kernel.org/r/20231017115047.78b2c5b891b0.Iac49d52e0bfc0317372015607c63ea9276bbb188@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The firmware / hardware of devices supporting RSS is able to report
duplicates and packets that time out inside the reoder buffer. We can
now remove all the complex logic that was implemented to keep all the Rx
queues more the less synchronized: we used to send a message to all the
queues through the firmware to teach the different queues about what is
the current SSN every 2048 packets.
Now that we rely on the firmware / hardware to detect duplicates, we can
completely remove the code that did that in the driver and it has been
reported that this code was spuriously dropping legit packets.
Suggested-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Link: https://lore.kernel.org/r/20231017115047.54cf4d3d5956.Ic06a08c9fb1e1ec315a4b49d632b78b8474dab79@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Multi rx queue allows to spread the load of the Rx streams on different
CPUs. 9000 series required complex synchronization mechanisms from the
driver side since the hardware / firmware is not able to provide
information about duplicate packets and timeouts inside the reordering
buffer.
Users have complained that for newer devices, all those synchronization
mechanisms have caused spurious packet drops. Those packet drops
disappeared if we simplify the code, but unfortunately, we can't have
RSS enabled on 9000 series without this complex code.
Remove support for RSS on 9000 so that we can make the code much simpler
for newer devices and fix the bugs for them.
The down side of this patch is a that all the Rx path will be routed to
a single CPU, but this has never been an issue, the modern CPUs are just
fast enough to cope with all the traffic.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Link: https://lore.kernel.org/r/20231017115047.2917eb8b7af9.Iddd7dcf335387ba46fcbbb6067ef4ff9cd3755a7@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>