The TCP option parser in cake qdisc (cake_get_tcpopt and
cake_tcph_may_drop) could read one byte out of bounds. When the length
is 1, the execution flow gets into the loop, reads one byte of the
opcode, and if the opcode is neither TCPOPT_EOL nor TCPOPT_NOP, it reads
one more byte, which exceeds the length of 1.
This fix is inspired by commit 9609dad263 ("ipv4: tcp_input: fix stack
out of bounds when parsing TCP options.").
v2 changes:
Added doff validation in cake_get_tcphdr to avoid parsing garbage as TCP
header. Although it wasn't strictly an out-of-bounds access (memory was
allocated), garbage values could be read where CAKE expected the TCP
header if doff was smaller than 5.
Cc: Young Xiao <92siuyang@gmail.com>
Fixes: 8b7138814f ("sch_cake: Add optional ACK filter")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
The TCP option parser in mptcp (mptcp_get_options) could read one byte
out of bounds. When the length is 1, the execution flow gets into the
loop, reads one byte of the opcode, and if the opcode is neither
TCPOPT_EOL nor TCPOPT_NOP, it reads one more byte, which exceeds the
length of 1.
This fix is inspired by commit 9609dad263 ("ipv4: tcp_input: fix stack
out of bounds when parsing TCP options.").
Cc: Young Xiao <92siuyang@gmail.com>
Fixes: cec37a6e41 ("mptcp: Handle MP_CAPABLE options for outgoing connections")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The TCP option parser in synproxy (synproxy_parse_options) could read
one byte out of bounds. When the length is 1, the execution flow gets
into the loop, reads one byte of the opcode, and if the opcode is
neither TCPOPT_EOL nor TCPOPT_NOP, it reads one more byte, which exceeds
the length of 1.
This fix is inspired by commit 9609dad263 ("ipv4: tcp_input: fix stack
out of bounds when parsing TCP options.").
v2 changes:
Added an early return when length < 0 to avoid calling
skb_header_pointer with negative length.
Cc: Young Xiao <92siuyang@gmail.com>
Fixes: 48b1de4c11 ("netfilter: add SYNPROXY core/target")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a known race in packet_sendmsg(), addressed
in commit 32d3182cd2 ("net/packet: fix race in tpacket_snd()")
Now we have data_race(), we can use it to avoid a future KCSAN warning,
as syzbot loves stressing af_packet sockets :)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP sendmsg() path can be lockless, it is possible for another
thread to re-connect an change sk->sk_txhash under us.
There is no serious impact, but we can use READ_ONCE()/WRITE_ONCE()
pair to document the race.
BUG: KCSAN: data-race in __ip4_datagram_connect / skb_set_owner_w
write to 0xffff88813397920c of 4 bytes by task 30997 on cpu 1:
sk_set_txhash include/net/sock.h:1937 [inline]
__ip4_datagram_connect+0x69e/0x710 net/ipv4/datagram.c:75
__ip6_datagram_connect+0x551/0x840 net/ipv6/datagram.c:189
ip6_datagram_connect+0x2a/0x40 net/ipv6/datagram.c:272
inet_dgram_connect+0xfd/0x180 net/ipv4/af_inet.c:580
__sys_connect_file net/socket.c:1837 [inline]
__sys_connect+0x245/0x280 net/socket.c:1854
__do_sys_connect net/socket.c:1864 [inline]
__se_sys_connect net/socket.c:1861 [inline]
__x64_sys_connect+0x3d/0x50 net/socket.c:1861
do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
entry_SYSCALL_64_after_hwframe+0x44/0xae
read to 0xffff88813397920c of 4 bytes by task 31039 on cpu 0:
skb_set_hash_from_sk include/net/sock.h:2211 [inline]
skb_set_owner_w+0x118/0x220 net/core/sock.c:2101
sock_alloc_send_pskb+0x452/0x4e0 net/core/sock.c:2359
sock_alloc_send_skb+0x2d/0x40 net/core/sock.c:2373
__ip6_append_data+0x1743/0x21a0 net/ipv6/ip6_output.c:1621
ip6_make_skb+0x258/0x420 net/ipv6/ip6_output.c:1983
udpv6_sendmsg+0x160a/0x16b0 net/ipv6/udp.c:1527
inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:642
sock_sendmsg_nosec net/socket.c:654 [inline]
sock_sendmsg net/socket.c:674 [inline]
____sys_sendmsg+0x360/0x4d0 net/socket.c:2350
___sys_sendmsg net/socket.c:2404 [inline]
__sys_sendmmsg+0x315/0x4b0 net/socket.c:2490
__do_sys_sendmmsg net/socket.c:2519 [inline]
__se_sys_sendmmsg net/socket.c:2516 [inline]
__x64_sys_sendmmsg+0x53/0x60 net/socket.c:2516
do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
entry_SYSCALL_64_after_hwframe+0x44/0xae
value changed: 0xbca3c43d -> 0xfdb309e0
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 31039 Comm: syz-executor.2 Not tainted 5.13.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov says:
====================
net: bridge: vlan tunnel egress path fixes
These two fixes take care of tunnel_dst problems in the vlan tunnel egress
path. Patch 01 fixes a null ptr deref due to the lockless use of tunnel_dst
pointer without checking it first, and patch 02 fixes a use-after-free
issue due to wrong dst refcounting (dst_clone() -> dst_hold_safe()).
Both fix the same commit and should be queued for stable backports:
Fixes: 11538d039a ("bridge: vlan dst_metadata hooks in ingress and egress paths")
v2: no changes, added stable list to CC
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a tunnel_dst null pointer dereference due to lockless
access in the tunnel egress path. When deleting a vlan tunnel the
tunnel_dst pointer is set to NULL without waiting a grace period (i.e.
while it's still usable) and packets egressing are dereferencing it
without checking. Use READ/WRITE_ONCE to annotate the lockless use of
tunnel_id, use RCU for accessing tunnel_dst and make sure it is read
only once and checked in the egress path. The dst is already properly RCU
protected so we don't need to do anything fancy than to make sure
tunnel_id and tunnel_dst are read only once and checked in the egress path.
Cc: stable@vger.kernel.org
Fixes: 11538d039a ("bridge: vlan dst_metadata hooks in ingress and egress paths")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Function 'ping_queue_rcv_skb' not always return success, which will
also return fail. If not check the wrong return value of it, lead to function
`ping_rcv` return success.
Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
msg_zerocopy signals if a send operation required copying with a flag
in serr->ee.ee_code.
This field can be incorrect as of the below commit, as a result of
both structs uarg and serr pointing into the same skb->cb[].
uarg->zerocopy must be read before skb->cb[] is reinitialized to hold
serr. Similar to other fields len, hi and lo, use a local variable to
temporarily hold the value.
This was not a problem before, when the value was passed as a function
argument.
Fixes: 75518851a2 ("skbuff: Push status and refcounts into sock_zerocopy_callback")
Reported-by: Talal Ahmad <talalahmad@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The device is able to offload either the outer header csum or inner
header csum. The driver utilizes the inner csum offload. So, prohibit
setting of tx-gre-csum-segmentation and let it be: off[fixed].
Fixes: 2729984149 ("net/mlx5e: Support TSO and TX checksum offloads for GRE tunnels")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The device is able to offload either the outer header csum or inner
header csum. The driver utilizes the inner csum offload. Hence, block
setting of tx-udp_tnl-csum-segmentation and set it to off[fixed].
Fixes: b49663c8fb ("net/mlx5e: Add support for UDP tunnel segmentation with outer checksum offload")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In the scenario described below, an EQ can remain in FIRED state which
can result in missing an interrupt generation.
The scenario:
device mlx5_core driver
------ ----------------
EQ1.eqe generated
EQ1.MSI-X sent
EQ1.state = FIRED
EQ2.eqe generated
mlx5_irq()
polls - eq1_eqes()
arm eq1
polls - eq2_eqes()
arm eq2
EQ2.MSI-X sent
EQ2.state = FIRED
mlx5_irq()
polls - eq2_eqes() -- no eqes found
driver skips EQ arming;
->EQ2 remains fired, misses generating interrupt.
Hence, always arm the EQ by reverting the cited commit in fixes tag.
Fixes: d894892dda ("net/mlx5: Arm only EQs with EQEs")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Steering packets to PTP-SQ should be done only if the SKB has
SKBTX_HW_TSTAMP set in the tx_flags. While here, take the function into
a header and inline it.
Set the whole condition to select the PTP-SQ to unlikely.
Fixes: 24c22dd091 ("net/mlx5e: Add states to PTP channel")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Since the driver opens the PTP-RQ under channel 0, it appears to the
stack as if the SKB was received on rxq0. So from thew stack POV there
are still the same number of RX queues.
Fixes: 960fbfe222 ("net/mlx5e: Allow coexistence of CQE compression and HW TS PTP")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
SW steering uses RC QP to write/read to/from ICM, hence it's not
supported when RoCE is not supported as well.
Fixes: 70605ea545 ("net/mlx5: DR, Expose APIs for direct rule managing")
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Check if RoCE is supported by the device before enable it in
the vport context and create all the RDMA steering objects.
Fixes: 80f09dfc23 ("net/mlx5: Eswitch, enable RoCE loopback traffic")
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently, IPsec feature is disabled because mlx5e_build_nic_netdev
is required to be called after mlx5e_ipsec_init. This requirement is
invalid as mlx5e_build_nic_netdev and mlx5e_ipsec_init initialize
independent resources.
Remove ipsec pointer check in mlx5e_build_nic_netdev so that the
two functions can be called at any order.
Fixes: 547eede070 ("net/mlx5e: IPSec, Innova IPSec offload infrastructure")
Signed-off-by: Huy Nguyen <huyn@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Function mlx5e_rep_neigh_update() wasn't updated to accommodate rtnl lock
removal from TC filter update path and properly handle concurrent encap
entry insertion/deletion which can lead to following use-after-free:
[23827.464923] ==================================================================
[23827.469446] BUG: KASAN: use-after-free in mlx5e_encap_take+0x72/0x140 [mlx5_core]
[23827.470971] Read of size 4 at addr ffff8881d132228c by task kworker/u20:6/21635
[23827.472251]
[23827.472615] CPU: 9 PID: 21635 Comm: kworker/u20:6 Not tainted 5.13.0-rc3+ #5
[23827.473788] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[23827.475639] Workqueue: mlx5e mlx5e_rep_neigh_update [mlx5_core]
[23827.476731] Call Trace:
[23827.477260] dump_stack+0xbb/0x107
[23827.477906] print_address_description.constprop.0+0x18/0x140
[23827.478896] ? mlx5e_encap_take+0x72/0x140 [mlx5_core]
[23827.479879] ? mlx5e_encap_take+0x72/0x140 [mlx5_core]
[23827.480905] kasan_report.cold+0x7c/0xd8
[23827.481701] ? mlx5e_encap_take+0x72/0x140 [mlx5_core]
[23827.482744] kasan_check_range+0x145/0x1a0
[23827.493112] mlx5e_encap_take+0x72/0x140 [mlx5_core]
[23827.494054] ? mlx5e_tc_tun_encap_info_equal_generic+0x140/0x140 [mlx5_core]
[23827.495296] mlx5e_rep_neigh_update+0x41e/0x5e0 [mlx5_core]
[23827.496338] ? mlx5e_rep_neigh_entry_release+0xb80/0xb80 [mlx5_core]
[23827.497486] ? read_word_at_a_time+0xe/0x20
[23827.498250] ? strscpy+0xa0/0x2a0
[23827.498889] process_one_work+0x8ac/0x14e0
[23827.499638] ? lockdep_hardirqs_on_prepare+0x400/0x400
[23827.500537] ? pwq_dec_nr_in_flight+0x2c0/0x2c0
[23827.501359] ? rwlock_bug.part.0+0x90/0x90
[23827.502116] worker_thread+0x53b/0x1220
[23827.502831] ? process_one_work+0x14e0/0x14e0
[23827.503627] kthread+0x328/0x3f0
[23827.504254] ? _raw_spin_unlock_irq+0x24/0x40
[23827.505065] ? __kthread_bind_mask+0x90/0x90
[23827.505912] ret_from_fork+0x1f/0x30
[23827.506621]
[23827.506987] Allocated by task 28248:
[23827.507694] kasan_save_stack+0x1b/0x40
[23827.508476] __kasan_kmalloc+0x7c/0x90
[23827.509197] mlx5e_attach_encap+0xde1/0x1d40 [mlx5_core]
[23827.510194] mlx5e_tc_add_fdb_flow+0x397/0xc40 [mlx5_core]
[23827.511218] __mlx5e_add_fdb_flow+0x519/0xb30 [mlx5_core]
[23827.512234] mlx5e_configure_flower+0x191c/0x4870 [mlx5_core]
[23827.513298] tc_setup_cb_add+0x1d5/0x420
[23827.514023] fl_hw_replace_filter+0x382/0x6a0 [cls_flower]
[23827.514975] fl_change+0x2ceb/0x4a51 [cls_flower]
[23827.515821] tc_new_tfilter+0x89a/0x2070
[23827.516548] rtnetlink_rcv_msg+0x644/0x8c0
[23827.517300] netlink_rcv_skb+0x11d/0x340
[23827.518021] netlink_unicast+0x42b/0x700
[23827.518742] netlink_sendmsg+0x743/0xc20
[23827.519467] sock_sendmsg+0xb2/0xe0
[23827.520131] ____sys_sendmsg+0x590/0x770
[23827.520851] ___sys_sendmsg+0xd8/0x160
[23827.521552] __sys_sendmsg+0xb7/0x140
[23827.522238] do_syscall_64+0x3a/0x70
[23827.522907] entry_SYSCALL_64_after_hwframe+0x44/0xae
[23827.523797]
[23827.524163] Freed by task 25948:
[23827.524780] kasan_save_stack+0x1b/0x40
[23827.525488] kasan_set_track+0x1c/0x30
[23827.526187] kasan_set_free_info+0x20/0x30
[23827.526968] __kasan_slab_free+0xed/0x130
[23827.527709] slab_free_freelist_hook+0xcf/0x1d0
[23827.528528] kmem_cache_free_bulk+0x33a/0x6e0
[23827.529317] kfree_rcu_work+0x55f/0xb70
[23827.530024] process_one_work+0x8ac/0x14e0
[23827.530770] worker_thread+0x53b/0x1220
[23827.531480] kthread+0x328/0x3f0
[23827.532114] ret_from_fork+0x1f/0x30
[23827.532785]
[23827.533147] Last potentially related work creation:
[23827.534007] kasan_save_stack+0x1b/0x40
[23827.534710] kasan_record_aux_stack+0xab/0xc0
[23827.535492] kvfree_call_rcu+0x31/0x7b0
[23827.536206] mlx5e_tc_del_fdb_flow+0x577/0xef0 [mlx5_core]
[23827.537305] mlx5e_flow_put+0x49/0x80 [mlx5_core]
[23827.538290] mlx5e_delete_flower+0x6d1/0xe60 [mlx5_core]
[23827.539300] tc_setup_cb_destroy+0x18e/0x2f0
[23827.540144] fl_hw_destroy_filter+0x1d2/0x310 [cls_flower]
[23827.541148] __fl_delete+0x4dc/0x660 [cls_flower]
[23827.541985] fl_delete+0x97/0x160 [cls_flower]
[23827.542782] tc_del_tfilter+0x7ab/0x13d0
[23827.543503] rtnetlink_rcv_msg+0x644/0x8c0
[23827.544257] netlink_rcv_skb+0x11d/0x340
[23827.544981] netlink_unicast+0x42b/0x700
[23827.545700] netlink_sendmsg+0x743/0xc20
[23827.546424] sock_sendmsg+0xb2/0xe0
[23827.547084] ____sys_sendmsg+0x590/0x770
[23827.547850] ___sys_sendmsg+0xd8/0x160
[23827.548606] __sys_sendmsg+0xb7/0x140
[23827.549303] do_syscall_64+0x3a/0x70
[23827.549969] entry_SYSCALL_64_after_hwframe+0x44/0xae
[23827.550853]
[23827.551217] The buggy address belongs to the object at ffff8881d1322200
[23827.551217] which belongs to the cache kmalloc-256 of size 256
[23827.553341] The buggy address is located 140 bytes inside of
[23827.553341] 256-byte region [ffff8881d1322200, ffff8881d1322300)
[23827.555747] The buggy address belongs to the page:
[23827.556847] page:00000000898762aa refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1d1320
[23827.558651] head:00000000898762aa order:2 compound_mapcount:0 compound_pincount:0
[23827.559961] flags: 0x2ffff800010200(slab|head|node=0|zone=2|lastcpupid=0x1ffff)
[23827.561243] raw: 002ffff800010200 dead000000000100 dead000000000122 ffff888100042b40
[23827.562653] raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000
[23827.564112] page dumped because: kasan: bad access detected
[23827.565439]
[23827.565932] Memory state around the buggy address:
[23827.566917] ffff8881d1322180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[23827.568485] ffff8881d1322200: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[23827.569818] >ffff8881d1322280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[23827.571143] ^
[23827.571879] ffff8881d1322300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[23827.573283] ffff8881d1322380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[23827.574654] ==================================================================
Most of the necessary logic is already correctly implemented by
mlx5e_get_next_valid_encap() helper that is used in neigh stats update
handler. Make the handler generic by renaming it to
mlx5e_get_next_matching_encap() and use callback to test whether flow is
matching instead of hardcoded check for 'valid' flag value. Implement
mlx5e_get_next_valid_encap() by calling mlx5e_get_next_matching_encap()
with callback that tests encap MLX5_ENCAP_ENTRY_VALID flag. Implement new
mlx5e_get_next_init_encap() helper by calling
mlx5e_get_next_matching_encap() with callback that tests encap completion
result to be non-error and use it in mlx5e_rep_neigh_update() to safely
iterate over nhe->encap_list.
Remove encap completion logic from mlx5e_rep_update_flows() since the encap
entries passed to this function are already guaranteed to be properly
initialized by similar code in mlx5e_get_next_init_encap().
Fixes: 2a1f1768fa ("net/mlx5e: Refactor neigh update for concurrent execution")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
When the code execute 'if (!priv->fs.arfs->wq)', the value of err is 0.
So, we use -ENOMEM to indicate that the function
create_singlethread_workqueue() return NULL.
Clean up smatch warning:
drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c:373
mlx5e_arfs_create_tables() warn: missing error code 'err'.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Fixes: f6755b80d6 ("net/mlx5e: Dynamic alloc arfs table for netdev when needed")
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2021-06-09
This series contains updates to ice driver only.
Maciej informs the user when XDP is not supported due to the driver
being in the 'safe mode' state. He also adds a parameter to Tx queue
configuration to resolve an issue in configuring XDP queues as it cannot
rely on using the number Tx or Rx queues.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This this the counterpart of 8aa7b526dc ("openvswitch: handle DNAT
tuple collision") for act_ct. From that commit changelog:
"""
With multiple DNAT rules it's possible that after destination
translation the resulting tuples collide.
...
Netfilter handles this case by allocating a null binding for SNAT at
egress by default. Perform the same operation in openvswitch for DNAT
if no explicit SNAT is requested by the user and allocate a null binding
for SNAT for packets in the "original" direction.
"""
Fixes: 95219afbb9 ("act_ct: support asymmetric conntrack")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cited commit started returning errors when notification info is not
filled by the bridge driver, resulting in the following regression:
# ip link add name br1 type bridge vlan_filtering 1
# bridge vlan add dev br1 vid 555 self pvid untagged
RTNETLINK answers: Invalid argument
As long as the bridge driver does not fill notification info for the
bridge device itself, an empty notification should not be considered as
an error. This is explained in commit 59ccaaaa49 ("bridge: dont send
notification when skb->len == 0 in rtnl_bridge_notify").
Fix by removing the error and add a comment to avoid future bugs.
Fixes: a8db57c1d2 ("rtnetlink: Fix missing error code in rtnl_bridge_notify()")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes berg says:
====================
A fair number of fixes:
* fix more fallout from RTNL locking changes
* fixes for some of the bugs found by syzbot
* drop multicast fragments in mac80211 to align
with the spec and what drivers are doing now
* fix NULL-ptr deref in radiotap injection
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Kaustubh reported and diagnosed a panic in udp_lib_lookup().
The root cause is udp_abort() racing with close(). Both
racing functions acquire the socket lock, but udp{v6}_destroy_sock()
release it before performing destructive actions.
We can't easily extend the socket lock scope to avoid the race,
instead use the SOCK_DEAD flag to prevent udp_abort from doing
any action when the critical race happens.
Diagnosed-and-tested-by: Kaustubh Pandey <kapandey@codeaurora.org>
Fixes: 5d77dca828 ("net: diag: support SOCK_DESTROY for UDP sockets")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both functions are known to be racy when reading inet_num
as we do not want to grab locks for the common case the socket
has been bound already. The race is resolved in inet_autobind()
by reading again inet_num under the socket lock.
syzbot reported:
BUG: KCSAN: data-race in inet_send_prepare / udp_lib_get_port
write to 0xffff88812cba150e of 2 bytes by task 24135 on cpu 0:
udp_lib_get_port+0x4b2/0xe20 net/ipv4/udp.c:308
udp_v6_get_port+0x5e/0x70 net/ipv6/udp.c:89
inet_autobind net/ipv4/af_inet.c:183 [inline]
inet_send_prepare+0xd0/0x210 net/ipv4/af_inet.c:807
inet6_sendmsg+0x29/0x80 net/ipv6/af_inet6.c:639
sock_sendmsg_nosec net/socket.c:654 [inline]
sock_sendmsg net/socket.c:674 [inline]
____sys_sendmsg+0x360/0x4d0 net/socket.c:2350
___sys_sendmsg net/socket.c:2404 [inline]
__sys_sendmmsg+0x315/0x4b0 net/socket.c:2490
__do_sys_sendmmsg net/socket.c:2519 [inline]
__se_sys_sendmmsg net/socket.c:2516 [inline]
__x64_sys_sendmmsg+0x53/0x60 net/socket.c:2516
do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
entry_SYSCALL_64_after_hwframe+0x44/0xae
read to 0xffff88812cba150e of 2 bytes by task 24132 on cpu 1:
inet_send_prepare+0x21/0x210 net/ipv4/af_inet.c:806
inet6_sendmsg+0x29/0x80 net/ipv6/af_inet6.c:639
sock_sendmsg_nosec net/socket.c:654 [inline]
sock_sendmsg net/socket.c:674 [inline]
____sys_sendmsg+0x360/0x4d0 net/socket.c:2350
___sys_sendmsg net/socket.c:2404 [inline]
__sys_sendmmsg+0x315/0x4b0 net/socket.c:2490
__do_sys_sendmmsg net/socket.c:2519 [inline]
__se_sys_sendmmsg net/socket.c:2516 [inline]
__x64_sys_sendmmsg+0x53/0x60 net/socket.c:2516
do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
entry_SYSCALL_64_after_hwframe+0x44/0xae
value changed: 0x0000 -> 0x9db4
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 24132 Comm: syz-executor.2 Not tainted 5.13.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several ethtool functions leave heap uncleared (potentially) by
drivers. This will leave the unused portion of heap unchanged and
might copy the full contents back to userspace.
Signed-off-by: Austin Kim <austindh.kim@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit ae15e0ba1b ("ice: Change number of XDP Tx queues to match
number of Rx queues") tried to address the incorrect setting of XDP
queue count that was based on the Tx queue count, whereas in theory we
should provide the XDP queue per Rx queue. However, the routines that
setup and destroy the set of Tx resources are still based on the
vsi->num_txq.
Ice supports the asynchronous Tx/Rx queue count, so for a setup where
vsi->num_txq > vsi->num_rxq, ice_vsi_stop_tx_rings and ice_vsi_cfg_txqs
will be accessing the vsi->xdp_rings out of the bounds.
Parameterize two mentioned functions so they get the size of Tx resources
array as the input.
Fixes: ae15e0ba1b ("ice: Change number of XDP Tx queues to match number of Rx queues")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Kiran Bhandare <kiranx.bhandare@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
ice driver requires a programmable pipeline firmware package in order to
have a support for advanced features. Otherwise, driver falls back to so
called 'safe mode'. For that mode, ndo_bpf callback is not exposed and
when user tries to load XDP program, the following happens:
$ sudo ./xdp1 enp179s0f1
libbpf: Kernel error message: Underlying driver does not support XDP in native mode
link set xdp fd failed
which is sort of confusing, as there is a native XDP support, but not in
the current mode. Improve the user experience by providing the specific
ndo_bpf callback dedicated for safe mode which will make use of extack
to explicitly let the user know that the DDP package is missing and
that's the reason that the XDP can't be loaded onto interface currently.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Fixes: efc2214b60 ("ice: Add support for XDP")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Kiran Bhandare <kiranx.bhandare@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
When I moved around the code here, I neglected that we could still
call register_netdev() or similar without the wiphy mutex held,
which then calls cfg80211_register_wdev() - that's also done from
cfg80211_register_netdevice(), but the phy80211 symlink creation
was only there. Now, the symlink isn't needed for a *pure* wdev,
but a netdev not registered via cfg80211_register_wdev() should
still have the symlink, so move the creation to the right place.
Cc: stable@vger.kernel.org
Fixes: 2fe8ef1062 ("cfg80211: change netdev registration/unregistration semantics")
Link: https://lore.kernel.org/r/20210608113226.a5dc4c1e488c.Ia42fe663cefe47b0883af78c98f284c5555bbe5d@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This patch fixes TX hangs with threaded NAPI enabled. The scheduled
NAPI seems to be executed in parallel with the interrupt on second
thread. Sometimes it happens that ltq_dma_disable_irq() is executed
after xrx200_tx_housekeeping(). The symptom is that TX interrupts
are disabled in the DMA controller. As a result, the TX hangs after
a few seconds of the iperf test. Scheduling NAPI after disabling
interrupts fixes this issue.
Tested on Lantiq xRX200 (BT Home Hub 5A).
Fixes: 9423361da5 ("net: lantiq: Disable IRQs only if NAPI gets scheduled ")
Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl>
Acked-by: Hauke Mehrtens <hauke@hauke-m.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes several bugs found when (DMA/LLQ) mapping a packet for
transmission. The mapping procedure makes the transmitted packet
accessible by the device.
When using LLQ, this requires copying the packet's header to push header
(which would be passed to LLQ) and creating DMA mapping for the payload
(if the packet doesn't fit the maximum push length).
When not using LLQ, we map the whole packet with DMA.
The following bugs are fixed in the code:
1. Add support for non-LLQ machines:
The ena_xdp_tx_map_frame() function assumed that LLQ is
supported, and never mapped the whole packet using DMA. On some
instances, which don't support LLQ, this causes loss of traffic.
2. Wrong DMA buffer length passed to device:
When using LLQ, the first 'tx_max_header_size' bytes of the
packet would be copied to push header. The rest of the packet
would be copied to a DMA'd buffer.
3. Freeing the XDP buffer twice in case of a mapping error:
In case a buffer DMA mapping fails, the function uses
xdp_return_frame_rx_napi() to free the RX buffer and returns from
the function with an error. XDP frames that fail to xmit get
freed by the kernel and so there is no need for this call.
Fixes: 548c4940b9 ("net: ena: Implement XDP_TX action")
Signed-off-by: Shay Agroskin <shayagr@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Because flow control is set up statically in ocelot_init_port(), and not
in phylink_mac_link_up(), what happens is that after the blamed commit,
the flow control remains disabled after the port flushing procedure.
Fixes: eb4733d7cf ("net: dsa: felix: implement port flushing on .phylink_mac_link_down")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Syzbot reported memory leak in rds. The problem
was in unputted refcount in case of error.
int rds_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
int msg_flags)
{
...
if (!rds_next_incoming(rs, &inc)) {
...
}
After this "if" inc refcount incremented and
if (rds_cmsg_recv(inc, msg, rs)) {
ret = -EFAULT;
goto out;
}
...
out:
return ret;
}
in case of rds_cmsg_recv() fail the refcount won't be
decremented. And it's easy to see from ftrace log, that
rds_inc_addref() don't have rds_inc_put() pair in
rds_recvmsg() after rds_cmsg_recv()
1) | rds_recvmsg() {
1) 3.721 us | rds_inc_addref();
1) 3.853 us | rds_message_inc_copy_to_user();
1) + 10.395 us | rds_cmsg_recv();
1) + 34.260 us | }
Fixes: bdbe6fbc6a ("RDS: recv.c")
Reported-and-tested-by: syzbot+5134cdf021c4ed5aaa5f@syzkaller.appspotmail.com
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simon Wunderlich says:
====================
Here is a batman-adv bugfix:
- Avoid WARN_ON timing related checks, by Sven Eckelmann
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
My initial goal was to fix the default MTU, which is set to 65536, ie above
the maximum defined in the driver: 65535 (ETH_MAX_MTU).
In fact, it's seems more consistent, wrt min_mtu, to set the max_mtu to
IP6_MAX_MTU (65535 + sizeof(struct ipv6hdr)) and use it by default.
Let's also, for consistency, set the mtu in vrf_setup(). This function
calls ether_setup(), which set the mtu to 1500. Thus, the whole mtu config
is done in the same function.
Before the patch:
$ ip link add blue type vrf table 1234
$ ip link list blue
9: blue: <NOARP,MASTER> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether fa:f5:27:70:24:2a brd ff:ff:ff:ff:ff:ff
$ ip link set dev blue mtu 65535
$ ip link set dev blue mtu 65536
Error: mtu greater than device maximum.
Fixes: 5055376a3b ("net: vrf: Fix ping failed when vrf mtu is set to 0")
CC: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The preposition "for" should be changed to preposition "of".
Signed-off-by: gushengxian <gushengxian@yulong.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When 'nla_parse_nested_deprecated' failed, it's no need to
BUG() here, return -EINVAL is ok.
Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Syzbot reports that when you have AP_VLAN interfaces that are up
and close the AP interface they belong to, we get a deadlock. No
surprise - since we dev_close() them with the wiphy mutex held,
which goes back into the netdev notifier in cfg80211 and tries to
acquire the wiphy mutex there.
To fix this, we need to do two things:
1) prevent changing iftype while AP_VLANs are up, we can't
easily fix this case since cfg80211 already calls us with
the wiphy mutex held, but change_interface() is relatively
rare in drivers anyway, so changing iftype isn't used much
(and userspace has to fall back to down/change/up anyway)
2) pull the dev_close() loop over VLANs out of the wiphy mutex
section in the normal stop case
Cc: stable@vger.kernel.org
Reported-by: syzbot+452ea4fbbef700ff0a56@syzkaller.appspotmail.com
Fixes: a05829a722 ("cfg80211: avoid holding the RTNL when calling the driver")
Link: https://lore.kernel.org/r/20210517160322.9b8f356c0222.I392cb0e2fa5a1a94cf2e637555d702c7e512c1ff@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
IFF_POINTOPOINT interfaces use NUD_NOARP entries for IPv6. It's possible to
fill up the neighbour table with enough entries that it will overflow for
valid connections after that.
This behaviour is more prevalent after commit 58956317c8 ("neighbor:
Improve garbage collection") is applied, as it prevents removal from
entries that are not NUD_FAILED, unless they are more than 5s old.
Fixes: 58956317c8 (neighbor: Improve garbage collection)
Reported-by: Kasper Dupont <kasperd@gjkwv.06.feb.2021.kasperd.net>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit c47cc30499 ("net: kcm: fix memory leak in kcm_sendmsg")
I misunderstood the root case of the memory leak and came up with
completely broken fix.
So, simply revert this commit to avoid GPF reported by
syzbot.
Im so sorry for this situation.
Fixes: c47cc30499 ("net: kcm: fix memory leak in kcm_sendmsg")
Reported-by: syzbot+65badd5e74ec62cb67dc@syzkaller.appspotmail.com
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>