For drivers to support partial offload of a filter's action list,
add support for action miss to specify an action instance to
continue from in sw.
CT action in particular can't be fully offloaded, as new connections
need to be handled in software. This imposes other limitations on
the actions that can be offloaded together with the CT action, such
as packet modifications.
Assign each action on a filter's action list a unique miss_cookie
which drivers can then use to fill action_miss part of the tc skb
extension. On getting back this miss_cookie, find the action
instance with relevant cookie and continue classifying from there.
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stefan Schmidt says:
====================
pull-request: ieee802154-next 2023-02-20
Miquel Raynal build upon his earlier work and introduced two new
features into the ieee802154 stack. Beaconing to announce existing
PAN's and passive scanning to discover the beacons and associated
PAN's. The matching changes to the userspace configuration tool
have been posted as well and will be released together with the
kernel release.
Arnd Bergmann and Dmitry Torokhov worked on converting the
at86rf230 and cc2520 drivers away from the unused platform_data
usage and towards the new gpiod API. (I had to add a revert as
Dmitry found a regression on an already pushed tree on my side).
Changes since v1 (pull request 2023-02-02)
- Netlink API extack and NLA_POLICY* usage as suggested by Jakub
- Removed always true condition found by kernel test robot
- Simplify device removal with running background job for scanning
- Fix problems with beacon sending in some cases by using the MLME
tx path
* tag 'ieee802154-for-net-next-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan-next:
ieee802154: Drop device trackers
mac802154: Fix an always true condition
mac802154: Send beacons using the MLME Tx path
ieee802154: Change error code on monitor scan netlink request
ieee802154: Convert scan error messages to extack
ieee802154: Use netlink policies when relevant on scan parameters
ieee802154: at86rf230: switch to using gpiod API
ieee802154: at86rf230: drop support for platform data
Revert "at86rf230: convert to gpio descriptors"
cc2520: move to gpio descriptors
mac802154: Avoid superfluous endianness handling
at86rf230: convert to gpio descriptors
mac802154: Handle basic beaconing
ieee802154: Add support for user beaconing requests
mac802154: Handle passive scanning
mac802154: Add MLME Tx locked helpers
mac802154: Prepare forcing specific symbol duration
ieee802154: Introduce a helper to validate a channel
ieee802154: Define a beacon frame header
ieee802154: Add support for user scanning requests
====================
Link: https://lore.kernel.org/r/20230220213749.386451-1-stefan@datenfreihafen.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The call "skb_copy_from_linear_data(skb, inl + 1, spc)" triggers a FORTIFY
memcpy() warning on ppc64 platform:
In function ‘fortify_memcpy_chk’,
inlined from ‘skb_copy_from_linear_data’ at ./include/linux/skbuff.h:4029:2,
inlined from ‘build_inline_wqe’ at drivers/net/ethernet/mellanox/mlx4/en_tx.c:722:4,
inlined from ‘mlx4_en_xmit’ at drivers/net/ethernet/mellanox/mlx4/en_tx.c:1066:3:
./include/linux/fortify-string.h:513:25: error: call to ‘__write_overflow_field’ declared with
attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()?
[-Werror=attribute-warning]
513 | __write_overflow_field(p_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Same behaviour on x86 you can get if you use "__always_inline" instead of
"inline" for skb_copy_from_linear_data() in skbuff.h
The call here copies data into inlined tx destricptor, which has 104
bytes (MAX_INLINE) space for data payload. In this case "spc" is known
in compile-time but the destination is used with hidden knowledge
(real structure of destination is different from that the compiler
can see). That cause the fortify warning because compiler can check
bounds, but the real bounds are different. "spc" can't be bigger than
64 bytes (MLX4_INLINE_ALIGN), so the data can always fit into inlined
tx descriptor. The fact that "inl" points into inlined tx descriptor is
determined earlier in mlx4_en_xmit().
Avoid confusing the compiler with "inl + 1" constructions to get to past
the inl header by introducing a flexible array "data" to the struct so
that the compiler can see that we are not dealing with an array of inl
structs, but rather, arbitrary data following the structure. There are
no changes to the structure layout reported by pahole, and the resulting
machine code is actually smaller.
Reported-by: Josef Oskera <joskera@redhat.com>
Link: https://lore.kernel.org/lkml/20230217094541.2362873-1-joskera@redhat.com
Fixes: f68f2ff915 ("fortify: Detect struct member overflows in memcpy() at compile-time")
Cc: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20230218183842.never.954-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2023-02-17
We've added 64 non-merge commits during the last 7 day(s) which contain
a total of 158 files changed, 4190 insertions(+), 988 deletions(-).
The main changes are:
1) Add a rbtree data structure following the "next-gen data structure"
precedent set by recently-added linked-list, that is, by using
kfunc + kptr instead of adding a new BPF map type, from Dave Marchevsky.
2) Add a new benchmark for hashmap lookups to BPF selftests,
from Anton Protopopov.
3) Fix bpf_fib_lookup to only return valid neighbors and add an option
to skip the neigh table lookup, from Martin KaFai Lau.
4) Add cgroup.memory=nobpf kernel parameter option to disable BPF memory
accouting for container environments, from Yafang Shao.
5) Batch of ice multi-buffer and driver performance fixes,
from Alexander Lobakin.
6) Fix a bug in determining whether global subprog's argument is
PTR_TO_CTX, which is based on type names which breaks kprobe progs,
from Andrii Nakryiko.
7) Prep work for future -mcpu=v4 LLVM option which includes usage of
BPF_ST insn. Thus improve BPF_ST-related value tracking in verifier,
from Eduard Zingerman.
8) More prep work for later building selftests with Memory Sanitizer
in order to detect usages of undefined memory, from Ilya Leoshkevich.
9) Fix xsk sockets to check IFF_UP earlier to avoid a NULL pointer
dereference via sendmsg(), from Maciej Fijalkowski.
10) Implement BPF trampoline for RV64 JIT compiler, from Pu Lehui.
11) Fix BPF memory allocator in combination with BPF hashtab where it could
corrupt special fields e.g. used in bpf_spin_lock, from Hou Tao.
12) Fix LoongArch BPF JIT to always use 4 instructions for function
address so that instruction sequences don't change between passes,
from Hengqi Chen.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (64 commits)
selftests/bpf: Add bpf_fib_lookup test
bpf: Add BPF_FIB_LOOKUP_SKIP_NEIGH for bpf_fib_lookup
riscv, bpf: Add bpf trampoline support for RV64
riscv, bpf: Add bpf_arch_text_poke support for RV64
riscv, bpf: Factor out emit_call for kernel and bpf context
riscv: Extend patch_text for multiple instructions
Revert "bpf, test_run: fix &xdp_frame misplacement for LIVE_FRAMES"
selftests/bpf: Add global subprog context passing tests
selftests/bpf: Convert test_global_funcs test to test_loader framework
bpf: Fix global subprog context argument resolution logic
LoongArch, bpf: Use 4 instructions for function address in JIT
bpf: bpf_fib_lookup should not return neigh in NUD_FAILED state
bpf: Disable bh in bpf_test_run for xdp and tc prog
xsk: check IFF_UP earlier in Tx path
Fix typos in selftest/bpf files
selftests/bpf: Use bpf_{btf,link,map,prog}_get_info_by_fd()
samples/bpf: Use bpf_{btf,link,map,prog}_get_info_by_fd()
bpftool: Use bpf_{btf,link,map,prog}_get_info_by_fd()
libbpf: Use bpf_{btf,link,map,prog}_get_info_by_fd()
libbpf: Introduce bpf_{btf,link,map,prog}_get_info_by_fd()
...
====================
Link: https://lore.kernel.org/r/20230217221737.31122-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
That really was meant to be a per netns attribute from the beginning.
The idea is that once proper isolation is in place in the main
namespace, additional demux in the child namespaces will be redundant.
Let's make child netns default rps mask empty by default.
To avoid bloating the netns with a possibly large cpumask, allocate
it on-demand during the first write operation.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Add safeguard to check for NULL tupe in objects updates via
NFT_MSG_NEWOBJ, this should not ever happen. From Alok Tiwari.
2) Incorrect pointer check in the new destroy rule command,
from Yang Yingliang.
3) Incorrect status bitcheck in nf_conntrack_udp_packet(),
from Florian Westphal.
4) Simplify seq_print_acct(), from Ilia Gavrilov.
5) Use 2-arg optimal variant of kfree_rcu() in IPVS,
from Julian Anastasov.
6) TCP connection enters CLOSE state in conntrack for locally
originated TCP reset packet from the reject target,
from Florian Westphal.
The fixes#2 and #3 in this series address issues from the previous pull
nf-next request in this net-next cycle.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
iptables/nftables support responding to tcp packets with tcp resets.
The generated tcp reset packet passes through both output and postrouting
netfilter hooks, but conntrack will never see them because the generated
skb has its ->nfct pointer copied over from the packet that triggered the
reset rule.
If the reset rule is used for established connections, this
may result in the conntrack entry to be around for a very long
time (default timeout is 5 days).
One way to avoid this would be to not copy the nf_conn pointer
so that the rest packet passes through conntrack too.
Problem is that output rules might not have the same conntrack
zone setup as the prerouting ones, so its possible that the
reset skb won't find the correct entry. Generating a template
entry for the skb seems error prone as well.
Add an explicit "closing" function that switches a confirmed
conntrack entry to closed state and wire this up for tcp.
If the entry isn't confirmed, no action is needed because
the conntrack entry will never be committed to the table.
Reported-by: Russel King <linux@armlinux.org.uk>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pull drm fixes from Dave Airlie:
"Just a final collection of misc fixes, the biggest disables the
recently added dynamic debugging support, it has a regression that
needs some bigger fixes.
Otherwise a bunch of fixes across the board, vc4, amdgpu and vmwgfx
mostly, with some smaller i915 and ast fixes.
drm:
- dynamic debug disable for now
fbdev:
- deferred i/o device close fix
amdgpu:
- Fix GC11.x suspend warning
- Fix display warning
vc4:
- YUV planes fix
- hdmi display fix
- crtc reduced blanking fix
ast:
- fix start address computation
vmwgfx:
- fix bo/handle races
i915:
- gen11 WA fix"
* tag 'drm-fixes-2023-02-17' of git://anongit.freedesktop.org/drm/drm:
drm/amd/display: Fail atomic_check early on normalize_zpos error
drm/amd/amdgpu: fix warning during suspend
drm/vmwgfx: Do not drop the reference to the handle too soon
drm/vmwgfx: Stop accessing buffer objects which failed init
drm/i915/gen11: Wa_1408615072/Wa_1407596294 should be on GT list
drm: Disable dynamic debug as broken
drm/ast: Fix start address computation
fbdev: Fix invalid page access after closing deferred I/O devices
drm/vc4: crtc: Increase setup cost in core clock calculation to handle extreme reduced blanking
drm/vc4: hdmi: Always enable GCP with AVMUTE cleared
drm/vc4: Fix YUV plane handling when planes are in different buffers
Pull networking fixes from Jakub Kicinski:
"Fixes from the main networking tree only, probably because all
sub-trees have backed off and haven't submitted their changes.
None of the fixes here are particularly scary and no outstanding
regressions. In an ideal world the "current release" sections would be
empty at this stage but that never happens.
Current release - regressions:
- fix unwanted sign extension in netdev_stats_to_stats64()
Current release - new code bugs:
- initialize net->notrefcnt_tracker earlier
- devlink: fix netdev notifier chain corruption
- nfp: make sure mbox accesses in IPsec code are atomic
- ice: fix check for weight and priority of a scheduling node
Previous releases - regressions:
- ice: xsk: fix cleaning of XDP_TX frame, prevent inf loop
- igb: fix I2C bit banging config with external thermal sensor
Previous releases - always broken:
- sched: tcindex: update imperfect hash filters respecting rcu
- mpls: fix stale pointer if allocation fails during device rename
- dccp/tcp: avoid negative sk_forward_alloc by ipv6_pinfo.pktoptions
- remove WARN_ON_ONCE(sk->sk_forward_alloc) from
sk_stream_kill_queues()
- af_key: fix heap information leak
- ipv6: fix socket connection with DSCP (correct interpretation of
the tclass field vs fib rule matching)
- tipc: fix kernel warning when sending SYN message
- vmxnet3: read RSS information from the correct descriptor (eop)"
* tag 'net-6.2-final' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (35 commits)
devlink: Fix netdev notifier chain corruption
igb: conditionalize I2C bit banging on external thermal sensor support
net: mpls: fix stale pointer if allocation fails during device rename
net/sched: tcindex: search key must be 16 bits
tipc: fix kernel warning when sending SYN message
igb: Fix PPS input and output using 3rd and 4th SDP
net: use a bounce buffer for copying skb->mark
ixgbe: add double of VLAN header when computing the max MTU
i40e: add double of VLAN header when computing the max MTU
ixgbe: allow to increase MTU to 3K with XDP enabled
net: stmmac: Restrict warning on disabling DMA store and fwd mode
net/sched: act_ctinfo: use percpu stats
net: stmmac: fix order of dwmac5 FlexPPS parametrization sequence
ice: fix lost multicast packets in promisc mode
ice: Fix check for weight and priority of a scheduling node
bnxt_en: Fix mqprio and XDP ring checking logic
net: Fix unwanted sign extension in netdev_stats_to_stats64()
net/usb: kalmia: Don't pass act_len in usb_bulk_msg error path
net: openvswitch: fix possible memory leak in ovs_meter_cmd_set()
af_key: Fix heap information leak
...
Johannes Berg says:
====================
Major stack changes:
* EHT channel puncturing support (client & AP)
* some support for AP MLD without mac80211
* fixes for A-MSDU on mesh connections
Major driver changes:
iwlwifi
* EHT rate reporting
* Bump FW API to 74 for AX devices
* STEP equalizer support: transfer some STEP (connection to radio
on platforms with integrated wifi) related parameters from the
BIOS to the firmware
mt76
* switch to using page pool allocator
* mt7996 EHT (Wi-Fi 7) support
* Wireless Ethernet Dispatch (WED) reset support
libertas
* WPS enrollee support
brcmfmac
* Rename Cypress 89459 to BCM4355
* BCM4355 and BCM4377 support
mwifiex
* SD8978 chipset support
rtl8xxxu
* LED support
ath12k
* new driver for Qualcomm Wi-Fi 7 devices
ath11k
* IPQ5018 support
* Fine Timing Measurement (FTM) responder role support
* channel 177 support
ath10k
* store WLAN firmware version in SMEM image table
* tag 'wireless-next-2023-03-16' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (207 commits)
wifi: brcmfmac: p2p: Introduce generic flexible array frame member
wifi: mac80211: add documentation for amsdu_mesh_control
wifi: cfg80211: remove gfp parameter from cfg80211_obss_color_collision_notify description
wifi: mac80211: always initialize link_sta with sta
wifi: mac80211: pass 'sta' to ieee80211_rx_data_set_sta()
wifi: cfg80211: Set SSID if it is not already set
wifi: rtw89: move H2C of del_pkt_offload before polling FW status ready
wifi: rtw89: use readable return 0 in rtw89_mac_cfg_ppdu_status()
wifi: rtw88: usb: drop now unnecessary URB size check
wifi: rtw88: usb: send Zero length packets if necessary
wifi: rtw88: usb: Set qsel correctly
wifi: mac80211: fix off-by-one link setting
wifi: mac80211: Fix for Rx fragmented action frames
wifi: mac80211: avoid u32_encode_bits() warning
wifi: mac80211: Don't translate MLD addresses for multicast
wifi: cfg80211: call reg_notifier for self managed wiphy from driver hint
wifi: cfg80211: get rid of gfp in cfg80211_bss_color_notify
wifi: nl80211: Allow authentication frames and set keys on NAN interface
wifi: mac80211: fix non-MLO station association
wifi: mac80211: Allow NSS change only up to capability
...
====================
Link: https://lore.kernel.org/r/20230216105406.208416-1-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cited commit changed devlink to register its netdev notifier block on
the global netdev notifier chain instead of on the per network namespace
one.
However, when changing the network namespace of the devlink instance,
devlink still tries to unregister its notifier block from the chain of
the old namespace and register it on the chain of the new namespace.
This results in corruption of the notifier chains, as the same notifier
block is registered on two different chains: The global one and the per
network namespace one. In turn, this causes other problems such as the
inability to dismantle namespaces due to netdev reference count issues.
Fix by preventing devlink from moving its notifier block between
namespaces.
Reproducer:
# echo "10 1" > /sys/bus/netdevsim/new_device
# ip netns add test123
# devlink dev reload netdevsim/netdevsim10 netns test123
# ip netns del test123
[ 71.935619] unregister_netdevice: waiting for lo to become free. Usage count = 2
[ 71.938348] leaked reference.
Fixes: 565b4824c3 ("devlink: change port event netdev notifier from per-net to global")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230215073139.1360108-1-idosch@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Currently the freed element in bpf memory allocator may be immediately
reused, for htab map the reuse will reinitialize special fields in map
value (e.g., bpf_spin_lock), but lookup procedure may still access
these special fields, and it may lead to hard-lockup as shown below:
NMI backtrace for cpu 16
CPU: 16 PID: 2574 Comm: htab.bin Tainted: G L 6.1.0+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
RIP: 0010:queued_spin_lock_slowpath+0x283/0x2c0
......
Call Trace:
<TASK>
copy_map_value_locked+0xb7/0x170
bpf_map_copy_value+0x113/0x3c0
__sys_bpf+0x1c67/0x2780
__x64_sys_bpf+0x1c/0x20
do_syscall_64+0x30/0x60
entry_SYSCALL_64_after_hwframe+0x46/0xb0
......
</TASK>
For htab map, just like the preallocated case, these is no need to
initialize these special fields in map value again once these fields
have been initialized. For preallocated htab map, these fields are
initialized through __GFP_ZERO in bpf_map_area_alloc(), so do the
similar thing for non-preallocated htab in bpf memory allocator. And
there is no need to use __GFP_ZERO for per-cpu bpf memory allocator,
because __alloc_percpu_gfp() does it implicitly.
Fixes: 0fd7c5d433 ("bpf: Optimize call_rcu in non-preallocated hash map.")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230215082132.3856544-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add IPSec flow steering priorities in RDMA namespaces. This allows
adding tables/rules to forward RoCEv2 traffic to the IPSec crypto
tables in NIC_TX domain, and accept RoCEv2 traffic from NIC_RX domain.
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Implement new destination type to support flow transition between
different table types.
e.g. from NIC_RX to RDMA_RX or from RDMA_TX to NIC_TX.
The new destination is described in the tracepoint as follows:
"mlx5_fs_add_rule: rule=00000000d53cd0ed fte=0000000048a8a6ed index=0 sw_action=<> [dst] flow_table_type=7 id:262152"
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
This new destination type supports flow transition between different
table types, e.g. from NIC_RX to RDMA_RX or from RDMA_TX to NIC_TX.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
No need to have dl_port which is tightly coupled with mlx5e code
in mlx5 core code. Move it to struct mlx5e_dev and loose
mlx5e_devlink_get_dl_port() helper.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In MultiPort E-Switch mode a single RDMA is created. This device has multiple
RDMA ports that represent the uplink ports that are connected to the E-Switch.
Account for this when creating the RDMA device so it has an additional port for
the non native uplink.
As a side effect of this patch, use shared fdb in multiport eswitch mode.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In a follow-up commit multiport eswitch mode will use a shared fdb.
In shared fdb there is a single eswitch fdb and traffic could come from any
port. to distinguish between the ports set a different metadata per uplink port.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Extend the action stats callback implementation to update stats for actions
that are associated with hw counters.
Note that the callback may be called from tc action utility or from tc
flower. Both apis expect the driver to return the stats difference from
the last update. As such, query the raw counter value and maintain
the diff from the last api call in the tc layer, instead of the fs_core
layer.
Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
pskb_may_pull() can fail for two different reasons.
Provide pskb_may_pull_reason() helper to distinguish
between these reasons.
It returns:
SKB_NOT_DROPPED_YET : Success
SKB_DROP_REASON_PKT_TOO_SMALL : packet too small
SKB_DROP_REASON_NOMEM : skb->head could not be resized
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch adds special BPF_RB_{ROOT,NODE} btf_field_types similar to
BPF_LIST_{HEAD,NODE}, adds the necessary plumbing to detect the new
types, and adds bpf_rb_root_free function for freeing bpf_rb_root in
map_values.
structs bpf_rb_root and bpf_rb_node are opaque types meant to
obscure structs rb_root_cached rb_node, respectively.
btf_struct_access will prevent BPF programs from touching these special
fields automatically now that they're recognized.
btf_check_and_fixup_fields now groups list_head and rb_root together as
"graph root" fields and {list,rb}_node as "graph node", and does same
ownership cycle checking as before. Note that this function does _not_
prevent ownership type mixups (e.g. rb_root owning list_node) - that's
handled by btf_parse_graph_root.
After this patch, a bpf program can have a struct bpf_rb_root in a
map_value, but not add anything to nor do anything useful with it.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230214004017.2534011-2-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Pull misc fixes from Andrew Morton:
"Twelve hotfixes, mostly against mm/.
Five of these fixes are cc:stable"
* tag 'mm-hotfixes-stable-2023-02-13-13-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
of: reserved_mem: Have kmemleak ignore dynamically allocated reserved mem
scripts/gdb: fix 'lx-current' for x86
lib: parser: optimize match_NUMBER apis to use local array
mm: shrinkers: fix deadlock in shrinker debugfs
mm: hwpoison: support recovery from ksm_might_need_to_copy()
kasan: fix Oops due to missing calls to kasan_arch_is_ready()
revert "squashfs: harden sanity check in squashfs_read_xattr_id_table"
fsdax: dax_unshare_iter() should return a valid length
mm/gup: add folio to list when folio_isolate_lru() succeed
aio: fix mremap after fork null-deref
mailmap: add entry for Alexander Mikhalitsyn
mm: extend max struct page size for kmsan
This patch introduces non-owning reference semantics to the verifier,
specifically linked_list API kfunc handling. release_on_unlock logic for
refs is refactored - with small functional changes - to implement these
semantics, and bpf_list_push_{front,back} are migrated to use them.
When a list node is pushed to a list, the program still has a pointer to
the node:
n = bpf_obj_new(typeof(*n));
bpf_spin_lock(&l);
bpf_list_push_back(&l, n);
/* n still points to the just-added node */
bpf_spin_unlock(&l);
What the verifier considers n to be after the push, and thus what can be
done with n, are changed by this patch.
Common properties both before/after this patch:
* After push, n is only a valid reference to the node until end of
critical section
* After push, n cannot be pushed to any list
* After push, the program can read the node's fields using n
Before:
* After push, n retains the ref_obj_id which it received on
bpf_obj_new, but the associated bpf_reference_state's
release_on_unlock field is set to true
* release_on_unlock field and associated logic is used to implement
"n is only a valid ref until end of critical section"
* After push, n cannot be written to, the node must be removed from
the list before writing to its fields
* After push, n is marked PTR_UNTRUSTED
After:
* After push, n's ref is released and ref_obj_id set to 0. NON_OWN_REF
type flag is added to reg's type, indicating that it's a non-owning
reference.
* NON_OWN_REF flag and logic is used to implement "n is only a
valid ref until end of critical section"
* n can be written to (except for special fields e.g. bpf_list_node,
timer, ...)
Summary of specific implementation changes to achieve the above:
* release_on_unlock field, ref_set_release_on_unlock helper, and logic
to "release on unlock" based on that field are removed
* The anonymous active_lock struct used by bpf_verifier_state is
pulled out into a named struct bpf_active_lock.
* NON_OWN_REF type flag is introduced along with verifier logic
changes to handle non-owning refs
* Helpers are added to use NON_OWN_REF flag to implement non-owning
ref semantics as described above
* invalidate_non_owning_refs - helper to clobber all non-owning refs
matching a particular bpf_active_lock identity. Replaces
release_on_unlock logic in process_spin_lock.
* ref_set_non_owning - set NON_OWN_REF type flag after doing some
sanity checking
* ref_convert_owning_non_owning - convert owning reference w/
specified ref_obj_id to non-owning references. Set NON_OWN_REF
flag for each reg with that ref_obj_id and 0-out its ref_obj_id
* Update linked_list selftests to account for minor semantic
differences introduced by this patch
* Writes to a release_on_unlock node ref are not allowed, while
writes to non-owning reference pointees are. As a result the
linked_list "write after push" failure tests are no longer scenarios
that should fail.
* The test##missing_lock##op and test##incorrect_lock##op
macro-generated failure tests need to have a valid node argument in
order to have the same error output as before. Otherwise
verification will fail early and the expected error output won't be seen.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230212092715.1422619-2-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The Marvell SD8978 (aka NXP IW416) uses identical registers as SD8987,
so reuse the existing mwifiex_reg_sd8987 definition.
Note that mwifiex_reg_sd8977 and mwifiex_reg_sd8997 are likewise
identical, save for the fw_dump_ctrl register: They define it as 0xf0
whereas mwifiex_reg_sd8987 defines it as 0xf9. I've verified that
0xf9 is the correct value on SD8978. NXP's out-of-tree driver uses
0xf9 for all of them, so there's a chance that 0xf0 is not correct
in the mwifiex_reg_sd8977 and mwifiex_reg_sd8997 definitions. I cannot
test that for lack of hardware, hence am leaving it as is.
NXP has only released a firmware which runs Bluetooth over UART.
Perhaps Bluetooth over SDIO is unsupported by this chipset.
Consequently, only an "sdiouart" firmware image is referenced, not an
alternative "sdsd" image.
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/536b4f17a72ca460ad1b07045757043fb0778988.1674827105.git.lukas@wunner.de
Add replacement for phy_ethtool_get/set_eee() functions.
Current phy_ethtool_get/set_eee() implementation is great and it is
possible to make it even better:
- this functionality is for devices implementing parts of IEEE 802.3
specification beyond Clause 22. The better place for this code is
phy-c45.c
- currently it is able to do read/write operations on PHYs with
different abilities to not existing registers. It is better to
use stored supported_eee abilities to avoid false read/write
operations.
- the eee_active detection will provide wrong results on not supported
link modes. It is better to validate speed/duplex properties against
supported EEE link modes.
- it is able to support only limited amount of link modes. We have more
EEE link modes...
By refactoring this code I address most of this point except of the last
one. Adding additional EEE link modes will need more work.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add generic function for EEE abilities defined by IEEE 802.3
specification. For now following registers are supported:
- IEEE 802.3-2018 45.2.3.10 EEE control and capability 1 (Register 3.20)
- IEEE 802.3cg-2019 45.2.1.186b 10BASE-T1L PMA status register
(Register 1.2295)
Since I was not able to find any flag signaling support of these
registers, we should detect link mode abilities first and then based on
these abilities doing EEE link modes detection.
Results of EEE ability detection will be stored into new variable
phydev->supported_eee.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to what was done for TX_PUSH, add an RX_PUSH concept
to the ethtool interfaces.
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull tracing fix from Steven Rostedt:
"Fix showing of TASK_COMM_LEN instead of its value
The TASK_COMM_LEN was converted from a macro into an enum so that BTF
would have access to it. But this unfortunately caused TASK_COMM_LEN
to display in the format fields of trace events, as they are created
by the TRACE_EVENT() macro and such, macros convert to their values,
where as enums do not.
To handle this, instead of using the field itself to be display, save
the value of the array size as another field in the trace_event_fields
structure, and use that instead.
Not only does this fix the issue, but also converts the other trace
events that have this same problem (but were not breaking tooling).
With this change, the original work around b3bc8547d3 ("tracing:
Have TRACE_DEFINE_ENUM affect trace event types as well") could be
reverted (but that should be done in the merge window)"
* tag 'trace-v6.2-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix TASK_COMM_LEN in trace event format file
Introduce new helper bpf_map_kvcalloc() for the memory allocation in
bpf_local_storage(). Then the allocation will charge the memory from the
map instead of from current, though currently they are the same thing as
it is only used in map creation path now. By charging map's memory into
the memcg from the map, it will be more clear.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20230210154734.4416-3-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
====================
pull-request: bpf-next 2023-02-11
We've added 96 non-merge commits during the last 14 day(s) which contain
a total of 152 files changed, 4884 insertions(+), 962 deletions(-).
There is a minor conflict in drivers/net/ethernet/intel/ice/ice_main.c
between commit 5b246e533d ("ice: split probe into smaller functions")
from the net-next tree and commit 66c0e13ad2 ("drivers: net: turn on
XDP features") from the bpf-next tree. Remove the hunk given ice_cfg_netdev()
is otherwise there a 2nd time, and add XDP features to the existing
ice_cfg_netdev() one:
[...]
ice_set_netdev_features(netdev);
netdev->xdp_features = NETDEV_XDP_ACT_BASIC | NETDEV_XDP_ACT_REDIRECT |
NETDEV_XDP_ACT_XSK_ZEROCOPY;
ice_set_ops(netdev);
[...]
Stephen's merge conflict mail:
https://lore.kernel.org/bpf/20230207101951.21a114fa@canb.auug.org.au/
The main changes are:
1) Add support for BPF trampoline on s390x which finally allows to remove many
test cases from the BPF CI's DENYLIST.s390x, from Ilya Leoshkevich.
2) Add multi-buffer XDP support to ice driver, from Maciej Fijalkowski.
3) Add capability to export the XDP features supported by the NIC.
Along with that, add a XDP compliance test tool,
from Lorenzo Bianconi & Marek Majtyka.
4) Add __bpf_kfunc tag for marking kernel functions as kfuncs,
from David Vernet.
5) Add a deep dive documentation about the verifier's register
liveness tracking algorithm, from Eduard Zingerman.
6) Fix and follow-up cleanups for resolve_btfids to be compiled
as a host program to avoid cross compile issues,
from Jiri Olsa & Ian Rogers.
7) Batch of fixes to the BPF selftest for xdp_hw_metadata which resulted
when testing on different NICs, from Jesper Dangaard Brouer.
8) Fix libbpf to better detect kernel version code on Debian, from Hao Xiang.
9) Extend libbpf to add an option for when the perf buffer should
wake up, from Jon Doron.
10) Follow-up fix on xdp_metadata selftest to just consume on TX
completion, from Stanislav Fomichev.
11) Extend the kfuncs.rst document with description on kfunc
lifecycle & stability expectations, from David Vernet.
12) Fix bpftool prog profile to skip attaching to offline CPUs,
from Tonghao Zhang.
====================
Link: https://lore.kernel.org/r/20230211002037.8489-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When a fbdev with deferred I/O is once opened and closed, the dirty
pages still remain queued in the pageref list, and eventually later
those may be processed in the delayed work. This may lead to a
corruption of pages, hitting an Oops.
This patch makes sure to cancel the delayed work and clean up the
pageref list at closing the device for addressing the bug. A part of
the cleanup code is factored out as a new helper function that is
called from the common fb_release().
Reviewed-by: Patrik Jakobsson <patrik.r.jakobsson@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Tested-by: Miko Larsson <mikoxyzzz@gmail.com>
Fixes: 56c134f7f1 ("fbdev: Track deferred-I/O pages in pageref struct")
Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20230129082856.22113-1-tiwai@suse.de
skbuff_head_cache is misnamed (perhaps for historical reasons?)
because it does not hold heads. Head is the buffer which skb->data
points to, and also where shinfo lives. struct sk_buff is a metadata
structure, not the head.
Eric recently added skb_small_head_cache (which allocates actual
head buffers), let that serve as an excuse to finally clean this up :)
Leave the user-space visible name intact, it could possibly be uAPI.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If RPS is enabled, this allows configuring a default rps
mask, which is effective since receive queue creation time.
A default RPS mask allows the system admin to ensure proper
isolation, avoiding races at network namespace or device
creation time.
The default RPS mask is initially empty, and can be
modified via a newly added sysctl entry.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The debugfs_remove_recursive() is invoked by unregister_shrinker(), which
is holding the write lock of shrinker_rwsem. It will waits for the
handler of debugfs file complete. The handler also needs to hold the read
lock of shrinker_rwsem to do something. So it may cause the following
deadlock:
CPU0 CPU1
debugfs_file_get()
shrinker_debugfs_count_show()/shrinker_debugfs_scan_write()
unregister_shrinker()
--> down_write(&shrinker_rwsem);
debugfs_remove_recursive()
// wait for (A)
--> wait_for_completion();
// wait for (B)
--> down_read_killable(&shrinker_rwsem)
debugfs_file_put() -- (A)
up_write() -- (B)
The down_read_killable() can be killed, so that the above deadlock can be
recovered. But it still requires an extra kill action, otherwise it will
block all subsequent shrinker-related operations, so it's better to fix
it.
[akpm@linux-foundation.org: fix CONFIG_SHRINKER_DEBUG=n stub]
Link: https://lkml.kernel.org/r/20230202105612.64641-1-zhengqi.arch@bytedance.com
Fixes: 5035ebc644 ("mm: shrinkers: introduce debugfs interface for memory shrinkers")
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pull networking fixes from Paolo Abeni:
"Including fixes from can and ipsec subtrees.
Current release - regressions:
- sched: fix off by one in htb_activate_prios()
- eth: mana: fix accessing freed irq affinity_hint
- eth: ice: fix out-of-bounds KASAN warning in virtchnl
Current release - new code bugs:
- eth: mtk_eth_soc: enable special tag when any MAC uses DSA
Previous releases - always broken:
- core: fix sk->sk_txrehash default
- neigh: make sure used and confirmed times are valid
- mptcp: be careful on subflow status propagation on errors
- xfrm: prevent potential spectre v1 gadget in xfrm_xlate32_attr()
- phylink: move phy_device_free() to correctly release phy device
- eth: mlx5:
- fix crash unsetting rx-vlan-filter in switchdev mode
- fix hang on firmware reset
- serialize module cleanup with reload and remove"
* tag 'net-6.2-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (57 commits)
selftests: forwarding: lib: quote the sysctl values
net: mscc: ocelot: fix all IPv6 getting trapped to CPU when PTP timestamping is used
rds: rds_rm_zerocopy_callback() use list_first_entry()
net: txgbe: Update support email address
selftests: Fix failing VXLAN VNI filtering test
selftests: mptcp: stop tests earlier
selftests: mptcp: allow more slack for slow test-case
mptcp: be careful on subflow status propagation on errors
mptcp: fix locking for in-kernel listener creation
mptcp: fix locking for setsockopt corner-case
mptcp: do not wait for bare sockets' timeout
net: ethernet: mtk_eth_soc: fix DSA TX tag hwaccel for switch port 0
nfp: ethtool: fix the bug of setting unsupported port speed
txhash: fix sk->sk_txrehash default
net: ethernet: mtk_eth_soc: fix wrong parameters order in __xdp_rxq_info_reg()
net: ethernet: mtk_eth_soc: enable special tag when any MAC uses DSA
net: sched: sch: Fix off by one in htb_activate_prios()
igc: Add ndo_tx_timeout support
net: mana: Fix accessing freed irq affinity_hint
hv_netvsc: Allocate memory in netvsc_dma_map() with GFP_ATOMIC
...
Marc Kleine-Budde says:
====================
can-next 2023-02-08
The 1st patch is by Oliver Hartkopp and cleans up the CAN_RAW's
raw_setsockopt() for CAN_RAW_FD_FRAMES.
The 2nd patch is by me and fixes the compilation if
CONFIG_CAN_CALC_BITTIMING is disabled. (Problem introduced in last
pull request to next-next.)
* tag 'linux-can-next-for-6.3-20230208' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next:
can: bittiming: can_calc_bittiming(): add missing parameter to no-op function
can: raw: use temp variable instead of rolling back config
====================
Link: https://lore.kernel.org/r/20230208210014.3169347-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Saeed Mahameed says:
====================
mlx5-next-netdev-deadlock
This series from Jiri solves a deadlock when removing a network namespace
with mlx5 devlink instance being in it.
The deadlock is between:
1) mlx5_ib->unregister_netdevice_notifier()
AND
2) mlx5_core->devlink_reload->cleanup_net()
To slove this introduced mlx5 netdev added/removed events to track uplink
netdev to be used for register_netdevice_notifier_dev_net() purposes.
* tag 'mlx5-next-netdev-deadlock' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
RDMA/mlx5: Track netdev to avoid deadlock during netdev notifier unregister
net/mlx5e: Propagate an internal event in case uplink netdev changes
net/mlx5e: Fix trap event handling
net/mlx5: Introduce CQE error syndrome
====================
Link: https://lore.kernel.org/r/20230208005626.72930-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Whenever uplink netdev is set/cleared, propagate newly introduced event
to inform notifier blocks netdev was added/removed.
Move the set() helper to core.c from header, introduce clear() and
netdev_added_event_replay() helpers. The last one is going to be called
from rdma driver, so export it.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In commit 286c0e09e8 ("can: bittiming: can_changelink() pass extack
down callstack") a new parameter was added to can_calc_bittiming(),
however the static inline no-op (which is used if
CONFIG_CAN_CALC_BITTIMING is disabled) wasn't converted.
Add the new parameter to the static inline no-op of
can_calc_bittiming().
Fixes: 286c0e09e8 ("can: bittiming: can_changelink() pass extack down callstack")
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/20230207201734.2905618-1-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>