goto free_skb if an unexpected result is returned by pskb_tirm()
in erspan_xmit().
Signed-off-by: Yuanjun Gong <ruc_gongyuanjun@163.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
goto err_free_skb if an unexpected result is returned by pskb_tirm()
in erspan_fb_xmit().
Signed-off-by: Yuanjun Gong <ruc_gongyuanjun@163.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
goto tx_err if an unexpected result is returned by pskb_tirm()
in ip6erspan_tunnel_xmit().
Fixes: 5a963eb61b ("ip6_gre: Add ERSPAN native tunnel support")
Signed-off-by: Yuanjun Gong <ruc_gongyuanjun@163.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
key might contain private part of the key, so better use
kfree_sensitive to free it.
Fixes: 38320c70d2 ("[IPSEC]: Use crypto_aead and authenc in ESP")
Signed-off-by: Wang Ming <machel@vivo.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Marc Kleine-Budde says:
====================
pull-request: can 2023-07-17
The 1st patch is by Ziyang Xuan and fixes a possible memory leak in
the receiver handling in the CAN RAW protocol.
YueHaibing contributes a use after free in bcm_proc_show() of the
Broad Cast Manager (BCM) CAN protocol.
The next 2 patches are by me and fix a possible null pointer
dereference in the RX path of the gs_usb driver with activated
hardware timestamps and the candlelight firmware.
The last patch is by Fedor Ross, Marek Vasut and me and targets the
mcp251xfd driver. The polling timeout of __mcp251xfd_chip_set_mode()
is increased to fix bus joining on busy CAN buses and very low bit
rate.
* tag 'linux-can-fixes-for-6.5-20230717' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
can: mcp251xfd: __mcp251xfd_chip_set_mode(): increase poll timeout
can: gs_usb: fix time stamp counter initialization
can: gs_usb: gs_can_open(): improve error handling
can: bcm: Fix UAF in bcm_proc_show()
can: raw: fix receiver memory leak
====================
Link: https://lore.kernel.org/r/20230717180938.230816-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Got kmemleak errors with the following ltp can_filter testcase:
for ((i=1; i<=100; i++))
do
./can_filter &
sleep 0.1
done
==============================================================
[<00000000db4a4943>] can_rx_register+0x147/0x360 [can]
[<00000000a289549d>] raw_setsockopt+0x5ef/0x853 [can_raw]
[<000000006d3d9ebd>] __sys_setsockopt+0x173/0x2c0
[<00000000407dbfec>] __x64_sys_setsockopt+0x61/0x70
[<00000000fd468496>] do_syscall_64+0x33/0x40
[<00000000b7e47d51>] entry_SYSCALL_64_after_hwframe+0x61/0xc6
It's a bug in the concurrent scenario of unregister_netdevice_many()
and raw_release() as following:
cpu0 cpu1
unregister_netdevice_many(can_dev)
unlist_netdevice(can_dev) // dev_get_by_index() return NULL after this
net_set_todo(can_dev)
raw_release(can_socket)
dev = dev_get_by_index(, ro->ifindex); // dev == NULL
if (dev) { // receivers in dev_rcv_lists not free because dev is NULL
raw_disable_allfilters(, dev, );
dev_put(dev);
}
...
ro->bound = 0;
...
call_netdevice_notifiers(NETDEV_UNREGISTER, )
raw_notify(, NETDEV_UNREGISTER, )
if (ro->bound) // invalid because ro->bound has been set 0
raw_disable_allfilters(, dev, ); // receivers in dev_rcv_lists will never be freed
Add a net_device pointer member in struct raw_sock to record bound
can_dev, and use rtnl_lock to serialize raw_socket members between
raw_bind(), raw_release(), raw_setsockopt() and raw_notify(). Use
ro->dev to decide whether to free receivers in dev_rcv_lists.
Fixes: 8d0caedb75 ("can: bcm/raw/isotp: use per module netdevice notifier")
Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Link: https://lore.kernel.org/all/20230711011737.1969582-1-william.xuanziyang@huawei.com
Cc: stable@vger.kernel.org
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
If TCA_FLOWER_CLASSID is specified in the netlink message, the code will
call tcf_bind_filter. However, if any error occurs after that, the code
should undo this by calling tcf_unbind_filter.
Fixes: 77b9900ef5 ("tc: introduce Flower classifier")
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If cls_bpf_offload errors out, we must also undo tcf_bind_filter that
was done before the error.
Fix that by calling tcf_unbind_filter in errout_parms.
Fixes: eadb41489f ("net: cls_bpf: add support for marking filters as hardware-only")
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the case of an update, when TCA_U32_LINK is set, u32_set_parms will
decrement the refcount of the ht_down (struct tc_u_hnode) pointer
present in the older u32 filter which we are replacing. However, if
u32_replace_hw_knode errors out, the update command fails and that
ht_down pointer continues decremented. To fix that, when
u32_replace_hw_knode fails, check if ht_down's refcount was decremented
and undo the decrement.
Fixes: d34e3e1813 ("net: cls_u32: Add support for skip-sw flag to tc u32 classifier.")
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When u32_replace_hw_knode fails, we need to undo the tcf_bind_filter
operation done at u32_set_parms.
Fixes: d34e3e1813 ("net: cls_u32: Add support for skip-sw flag to tc u32 classifier.")
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case an error occurred after mall_set_parms executed successfully, we
must undo the tcf_bind_filter call it issues.
Fix that by calling tcf_unbind_filter in err_replace_hw_filter label.
Fixes: ec2507d2a3 ("net/sched: cls_matchall: Fix error path")
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull ceph fix from Ilya Dryomov:
"A fix to prevent a potential buffer overrun in the messenger, marked
for stable"
* tag 'ceph-for-6.5-rc2' of https://github.com/ceph/ceph-client:
libceph: harden msgr2.1 frame segment length checks
When we create an L2 loop on a bridge in netns, we will see packets storm
even if STP is enabled.
# unshare -n
# ip link add br0 type bridge
# ip link add veth0 type veth peer name veth1
# ip link set veth0 master br0 up
# ip link set veth1 master br0 up
# ip link set br0 type bridge stp_state 1
# ip link set br0 up
# sleep 30
# ip -s link show br0
2: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether b6:61:98:1c:1c:b5 brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped missed mcast
956553768 12861249 0 0 0 12861249 <-. Keep
TX: bytes packets errors dropped carrier collsns | increasing
1027834 11951 0 0 0 0 <-' rapidly
This is because llc_rcv() drops all packets in non-root netns and BPDU
is dropped.
Let's add extack warning when enabling STP in netns.
# unshare -n
# ip link add br0 type bridge
# ip link set br0 type bridge stp_state 1
Warning: bridge: STP does not work in non-root netns.
Note this commit will be reverted later when we namespacify the whole LLC
infra.
Fixes: e730c15519 ("[NET]: Make packet reception network namespace safe")
Suggested-by: Harry Coin <hcoin@quietfountain.com>
Link: https://lore.kernel.org/netdev/0f531295-e289-022d-5add-5ceffa0df9bc@quietfountain.com/
Suggested-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ceph_frame_desc::fd_lens is an int array. decode_preamble() thus
effectively casts u32 -> int but the checks for segment lengths are
written as if on unsigned values. While reading in HELLO or one of the
AUTH frames (before authentication is completed), arithmetic in
head_onwire_len() can get duped by negative ctrl_len and produce
head_len which is less than CEPH_PREAMBLE_LEN but still positive.
This would lead to a buffer overrun in prepare_read_control() as the
preamble gets copied to the newly allocated buffer of size head_len.
Cc: stable@vger.kernel.org
Fixes: cd1a677cad ("libceph, ceph: implement msgr2.1 protocol (crc and secure modes)")
Reported-by: Thelford Williams <thelford@google.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Lion says:
-------
In the QFQ scheduler a similar issue to CVE-2023-31436
persists.
Consider the following code in net/sched/sch_qfq.c:
static int qfq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
struct sk_buff **to_free)
{
unsigned int len = qdisc_pkt_len(skb), gso_segs;
// ...
if (unlikely(cl->agg->lmax < len)) {
pr_debug("qfq: increasing maxpkt from %u to %u for class %u",
cl->agg->lmax, len, cl->common.classid);
err = qfq_change_agg(sch, cl, cl->agg->class_weight, len);
if (err) {
cl->qstats.drops++;
return qdisc_drop(skb, sch, to_free);
}
// ...
}
Similarly to CVE-2023-31436, "lmax" is increased without any bounds
checks according to the packet length "len". Usually this would not
impose a problem because packet sizes are naturally limited.
This is however not the actual packet length, rather the
"qdisc_pkt_len(skb)" which might apply size transformations according to
"struct qdisc_size_table" as created by "qdisc_get_stab()" in
net/sched/sch_api.c if the TCA_STAB option was set when modifying the qdisc.
A user may choose virtually any size using such a table.
As a result the same issue as in CVE-2023-31436 can occur, allowing heap
out-of-bounds read / writes in the kmalloc-8192 cache.
-------
We can create the issue with the following commands:
tc qdisc add dev $DEV root handle 1: stab mtu 2048 tsize 512 mpu 0 \
overhead 999999999 linklayer ethernet qfq
tc class add dev $DEV parent 1: classid 1:1 htb rate 6mbit burst 15k
tc filter add dev $DEV parent 1: matchall classid 1:1
ping -I $DEV 1.1.1.2
This is caused by incorrectly assuming that qdisc_pkt_len() returns a
length within the QFQ_MIN_LMAX < len < QFQ_MAX_LMAX.
Fixes: 462dbc9101 ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost")
Reported-by: Lion <nnamrec@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
25369891fc deletes a check for the case where no 'lmax' is
specified which 3037933448 previously fixed as 'lmax'
could be set to the device's MTU without any bound checking
for QFQ_LMAX_MIN and QFQ_LMAX_MAX. Therefore, reintroduce the check.
Fixes: 25369891fc ("net/sched: sch_qfq: refactor parsing of netlink parameters")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Alexei Starovoitov says:
====================
pull-request: bpf 2023-07-12
We've added 5 non-merge commits during the last 7 day(s) which contain
a total of 7 files changed, 93 insertions(+), 28 deletions(-).
The main changes are:
1) Fix max stack depth check for async callbacks, from Kumar.
2) Fix inconsistent JIT image generation, from Björn.
3) Use trusted arguments in XDP hints kfuncs, from Larysa.
4) Fix memory leak in cpu_map_update_elem, from Pu.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
xdp: use trusted arguments in XDP hints kfuncs
bpf: cpumap: Fix memory leak in cpu_map_update_elem
riscv, bpf: Fix inconsistent JIT image generation
selftests/bpf: Add selftest for check_stack_max_depth bug
bpf: Fix max stack depth check for async callbacks
====================
Link: https://lore.kernel.org/r/20230712223045.40182-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The kernel does not currently validate that both the minimum and maximum
ports of a port range are specified. This can lead user space to think
that a filter matching on a port range was successfully added, when in
fact it was not. For example, with a patched (buggy) iproute2 that only
sends the minimum port, the following commands do not return an error:
# tc filter add dev swp1 ingress pref 1 proto ip flower ip_proto udp src_port 100-200 action pass
# tc filter add dev swp1 ingress pref 1 proto ip flower ip_proto udp dst_port 100-200 action pass
# tc filter show dev swp1 ingress
filter protocol ip pref 1 flower chain 0
filter protocol ip pref 1 flower chain 0 handle 0x1
eth_type ipv4
ip_proto udp
not_in_hw
action order 1: gact action pass
random type none pass val 0
index 1 ref 1 bind 1
filter protocol ip pref 1 flower chain 0 handle 0x2
eth_type ipv4
ip_proto udp
not_in_hw
action order 1: gact action pass
random type none pass val 0
index 2 ref 1 bind 1
Fix by returning an error unless both ports are specified:
# tc filter add dev swp1 ingress pref 1 proto ip flower ip_proto udp src_port 100-200 action pass
Error: Both min and max source ports must be specified.
We have an error talking to the kernel
# tc filter add dev swp1 ingress pref 1 proto ip flower ip_proto udp dst_port 100-200 action pass
Error: Both min and max destination ports must be specified.
We have an error talking to the kernel
Fixes: 5c72299fba ("net: sched: cls_flower: Classify packets using port ranges")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, verifier does not reject XDP programs that pass NULL pointer to
hints functions. At the same time, this case is not handled in any driver
implementation (including veth). For example, changing
bpf_xdp_metadata_rx_timestamp(ctx, ×tamp);
to
bpf_xdp_metadata_rx_timestamp(ctx, NULL);
in xdp_metadata test successfully crashes the system.
Add KF_TRUSTED_ARGS flag to hints kfunc definitions, so driver code
does not have to worry about getting invalid pointers.
Fixes: 3d76a4d3d4 ("bpf: XDP metadata RX kfuncs")
Reported-by: Stanislav Fomichev <sdf@google.com>
Closes: https://lore.kernel.org/bpf/ZKWo0BbpLfkZHbyE@google.com/
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230711105930.29170-1-larysa.zaremba@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When ip_vti device is set to the qdisc of the sfb type, the cb field
of the sent skb may be modified during enqueuing. Then,
slab-use-after-free may occur when ip_vti device sends IPv6 packets.
As commit f855691975 ("xfrm6: Fix the nexthdr offset in
_decode_session6.") showed, xfrm_decode_session was originally intended
only for the receive path. IP6CB(skb)->nhoff is not set during
transmission. Therefore, set the cb field in the skb to 0 before
sending packets.
Fixes: f855691975 ("xfrm6: Fix the nexthdr offset in _decode_session6.")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
When ipv6_vti device is set to the qdisc of the sfb type, the cb field
of the sent skb may be modified during enqueuing. Then,
slab-use-after-free may occur when ipv6_vti device sends IPv6 packets.
The stack information is as follows:
BUG: KASAN: slab-use-after-free in decode_session6+0x103f/0x1890
Read of size 1 at addr ffff88802e08edc2 by task swapper/0/0
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.4.0-next-20230707-00001-g84e2cad7f979 #410
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014
Call Trace:
<IRQ>
dump_stack_lvl+0xd9/0x150
print_address_description.constprop.0+0x2c/0x3c0
kasan_report+0x11d/0x130
decode_session6+0x103f/0x1890
__xfrm_decode_session+0x54/0xb0
vti6_tnl_xmit+0x3e6/0x1ee0
dev_hard_start_xmit+0x187/0x700
sch_direct_xmit+0x1a3/0xc30
__qdisc_run+0x510/0x17a0
__dev_queue_xmit+0x2215/0x3b10
neigh_connected_output+0x3c2/0x550
ip6_finish_output2+0x55a/0x1550
ip6_finish_output+0x6b9/0x1270
ip6_output+0x1f1/0x540
ndisc_send_skb+0xa63/0x1890
ndisc_send_rs+0x132/0x6f0
addrconf_rs_timer+0x3f1/0x870
call_timer_fn+0x1a0/0x580
expire_timers+0x29b/0x4b0
run_timer_softirq+0x326/0x910
__do_softirq+0x1d4/0x905
irq_exit_rcu+0xb7/0x120
sysvec_apic_timer_interrupt+0x97/0xc0
</IRQ>
Allocated by task 9176:
kasan_save_stack+0x22/0x40
kasan_set_track+0x25/0x30
__kasan_slab_alloc+0x7f/0x90
kmem_cache_alloc_node+0x1cd/0x410
kmalloc_reserve+0x165/0x270
__alloc_skb+0x129/0x330
netlink_sendmsg+0x9b1/0xe30
sock_sendmsg+0xde/0x190
____sys_sendmsg+0x739/0x920
___sys_sendmsg+0x110/0x1b0
__sys_sendmsg+0xf7/0x1c0
do_syscall_64+0x39/0xb0
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Freed by task 9176:
kasan_save_stack+0x22/0x40
kasan_set_track+0x25/0x30
kasan_save_free_info+0x2b/0x40
____kasan_slab_free+0x160/0x1c0
slab_free_freelist_hook+0x11b/0x220
kmem_cache_free+0xf0/0x490
skb_free_head+0x17f/0x1b0
skb_release_data+0x59c/0x850
consume_skb+0xd2/0x170
netlink_unicast+0x54f/0x7f0
netlink_sendmsg+0x926/0xe30
sock_sendmsg+0xde/0x190
____sys_sendmsg+0x739/0x920
___sys_sendmsg+0x110/0x1b0
__sys_sendmsg+0xf7/0x1c0
do_syscall_64+0x39/0xb0
entry_SYSCALL_64_after_hwframe+0x63/0xcd
The buggy address belongs to the object at ffff88802e08ed00
which belongs to the cache skbuff_small_head of size 640
The buggy address is located 194 bytes inside of
freed 640-byte region [ffff88802e08ed00, ffff88802e08ef80)
As commit f855691975 ("xfrm6: Fix the nexthdr offset in
_decode_session6.") showed, xfrm_decode_session was originally intended
only for the receive path. IP6CB(skb)->nhoff is not set during
transmission. Therefore, set the cb field in the skb to 0 before
sending packets.
Fixes: f855691975 ("xfrm6: Fix the nexthdr offset in _decode_session6.")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
When the xfrm device is set to the qdisc of the sfb type, the cb field
of the sent skb may be modified during enqueuing. Then,
slab-use-after-free may occur when the xfrm device sends IPv6 packets.
The stack information is as follows:
BUG: KASAN: slab-use-after-free in decode_session6+0x103f/0x1890
Read of size 1 at addr ffff8881111458ef by task swapper/3/0
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 6.4.0-next-20230707 #409
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014
Call Trace:
<IRQ>
dump_stack_lvl+0xd9/0x150
print_address_description.constprop.0+0x2c/0x3c0
kasan_report+0x11d/0x130
decode_session6+0x103f/0x1890
__xfrm_decode_session+0x54/0xb0
xfrmi_xmit+0x173/0x1ca0
dev_hard_start_xmit+0x187/0x700
sch_direct_xmit+0x1a3/0xc30
__qdisc_run+0x510/0x17a0
__dev_queue_xmit+0x2215/0x3b10
neigh_connected_output+0x3c2/0x550
ip6_finish_output2+0x55a/0x1550
ip6_finish_output+0x6b9/0x1270
ip6_output+0x1f1/0x540
ndisc_send_skb+0xa63/0x1890
ndisc_send_rs+0x132/0x6f0
addrconf_rs_timer+0x3f1/0x870
call_timer_fn+0x1a0/0x580
expire_timers+0x29b/0x4b0
run_timer_softirq+0x326/0x910
__do_softirq+0x1d4/0x905
irq_exit_rcu+0xb7/0x120
sysvec_apic_timer_interrupt+0x97/0xc0
</IRQ>
<TASK>
asm_sysvec_apic_timer_interrupt+0x1a/0x20
RIP: 0010:intel_idle_hlt+0x23/0x30
Code: 1f 84 00 00 00 00 00 f3 0f 1e fa 41 54 41 89 d4 0f 1f 44 00 00 66 90 0f 1f 44 00 00 0f 00 2d c4 9f ab 00 0f 1f 44 00 00 fb f4 <fa> 44 89 e0 41 5c c3 66 0f 1f 44 00 00 f3 0f 1e fa 41 54 41 89 d4
RSP: 0018:ffffc90000197d78 EFLAGS: 00000246
RAX: 00000000000a83c3 RBX: ffffe8ffffd09c50 RCX: ffffffff8a22d8e5
RDX: 0000000000000001 RSI: ffffffff8d3f8080 RDI: ffffe8ffffd09c50
RBP: ffffffff8d3f8080 R08: 0000000000000001 R09: ffffed1026ba6d9d
R10: ffff888135d36ceb R11: 0000000000000001 R12: 0000000000000001
R13: ffffffff8d3f8100 R14: 0000000000000001 R15: 0000000000000000
cpuidle_enter_state+0xd3/0x6f0
cpuidle_enter+0x4e/0xa0
do_idle+0x2fe/0x3c0
cpu_startup_entry+0x18/0x20
start_secondary+0x200/0x290
secondary_startup_64_no_verify+0x167/0x16b
</TASK>
Allocated by task 939:
kasan_save_stack+0x22/0x40
kasan_set_track+0x25/0x30
__kasan_slab_alloc+0x7f/0x90
kmem_cache_alloc_node+0x1cd/0x410
kmalloc_reserve+0x165/0x270
__alloc_skb+0x129/0x330
inet6_ifa_notify+0x118/0x230
__ipv6_ifa_notify+0x177/0xbe0
addrconf_dad_completed+0x133/0xe00
addrconf_dad_work+0x764/0x1390
process_one_work+0xa32/0x16f0
worker_thread+0x67d/0x10c0
kthread+0x344/0x440
ret_from_fork+0x1f/0x30
The buggy address belongs to the object at ffff888111145800
which belongs to the cache skbuff_small_head of size 640
The buggy address is located 239 bytes inside of
freed 640-byte region [ffff888111145800, ffff888111145a80)
As commit f855691975 ("xfrm6: Fix the nexthdr offset in
_decode_session6.") showed, xfrm_decode_session was originally intended
only for the receive path. IP6CB(skb)->nhoff is not set during
transmission. Therefore, set the cb field in the skb to 0 before
sending packets.
Fixes: f855691975 ("xfrm6: Fix the nexthdr offset in _decode_session6.")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
After the elimination of inner modes, a couple of warnings that
were previously unreachable can now be triggered by malformed
inbound packets.
Fix this by:
1. Moving the setting of skb->protocol into the decap functions.
2. Returning -EINVAL when unexpected protocol is seen.
Reported-by: Maciej Żenczykowski<maze@google.com>
Fixes: 5f24f41e8e ("xfrm: Remove inner/outer modes from input path")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Now in addrconf_mod_rs_timer(), reference idev depends on whether
rs_timer is not pending. Then modify rs_timer timeout.
There is a time gap in [1], during which if the pending rs_timer
becomes not pending. It will miss to hold idev, but the rs_timer
is activated. Thus rs_timer callback function addrconf_rs_timer()
will be executed and put idev later without holding idev. A refcount
underflow issue for idev can be caused by this.
if (!timer_pending(&idev->rs_timer))
in6_dev_hold(idev);
<--------------[1]
mod_timer(&idev->rs_timer, jiffies + when);
To fix the issue, hold idev if mod_timer() return 0.
Fixes: b7b1bfce0b ("ipv6: split duplicate address detection and router solicitation timer")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The tracepoint has existed for 12 years, but it only covered udp
over the legacy IPv4 protocol. Having it enabled for udp6 removes
the unnecessary difference in error visibility.
Signed-off-by: Ivan Babrou <ivan@cloudflare.com>
Fixes: 296f7ea75b ("udp: add tracepoints for queueing skb to rcvbuf")
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the event of a failure in tcf_change_indev(), fw_set_parms() will
immediately return an error after incrementing or decrementing
reference counter in tcf_bind_filter(). If attacker can control
reference counter to zero and make reference freed, leading to
use after free.
In order to prevent this, move the point of possible failure above the
point where the TC_FW_CLASSID is handled.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: M A Ramdhan <ramdhan@starlabs.sg>
Signed-off-by: M A Ramdhan <ramdhan@starlabs.sg>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Message-ID: <20230705161530.52003-1-ramdhan@starlabs.sg>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Fix missing overflow use refcount checks in nf_tables.
2) Do not set IPS_ASSURED for IPS_NAT_CLASH entries in GRE tracker,
from Florian Westphal.
3) Bail out if nf_ct_helper_hash is NULL before registering helper,
from Florent Revest.
4) Use siphash() instead siphash_4u64() to fix performance regression,
also from Florian.
5) Do not allow to add rules to removed chains via ID,
from Thadeu Lima de Souza Cascardo.
6) Fix oob read access in byteorder expression, also from Thadeu.
netfilter pull request 23-07-06
====================
Link: https://lore.kernel.org/r/20230705230406.52201-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Pull networking fixes from Jakub Kicinski:
"Including fixes from bluetooth, bpf and wireguard.
Current release - regressions:
- nvme-tcp: fix comma-related oops after sendpage changes
Current release - new code bugs:
- ptp: make max_phase_adjustment sysfs device attribute invisible
when not supported
Previous releases - regressions:
- sctp: fix potential deadlock on &net->sctp.addr_wq_lock
- mptcp:
- ensure subflow is unhashed before cleaning the backlog
- do not rely on implicit state check in mptcp_listen()
Previous releases - always broken:
- net: fix net_dev_start_xmit trace event vs skb_transport_offset()
- Bluetooth:
- fix use-bdaddr-property quirk
- L2CAP: fix multiple UaFs
- ISO: use hci_sync for setting CIG parameters
- hci_event: fix Set CIG Parameters error status handling
- hci_event: fix parsing of CIS Established Event
- MGMT: fix marking SCAN_RSP as not connectable
- wireguard: queuing: use saner cpu selection wrapping
- sched: act_ipt: various bug fixes for iptables <> TC interactions
- sched: act_pedit: add size check for TCA_PEDIT_PARMS_EX
- dsa: fixes for receiving PTP packets with 8021q and sja1105 tagging
- eth: sfc: fix null-deref in devlink port without MAE access
- eth: ibmvnic: do not reset dql stats on NON_FATAL err
Misc:
- xsk: honor SO_BINDTODEVICE on bind"
* tag 'net-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (70 commits)
nfp: clean mc addresses in application firmware when closing port
selftests: mptcp: pm_nl_ctl: fix 32-bit support
selftests: mptcp: depend on SYN_COOKIES
selftests: mptcp: userspace_pm: report errors with 'remove' tests
selftests: mptcp: userspace_pm: use correct server port
selftests: mptcp: sockopt: return error if wrong mark
selftests: mptcp: sockopt: use 'iptables-legacy' if available
selftests: mptcp: connect: fail if nft supposed to work
mptcp: do not rely on implicit state check in mptcp_listen()
mptcp: ensure subflow is unhashed before cleaning the backlog
s390/qeth: Fix vipa deletion
octeontx-af: fix hardware timestamp configuration
net: dsa: sja1105: always enable the send_meta options
net: dsa: tag_sja1105: fix MAC DA patching from meta frames
net: Replace strlcpy with strscpy
pptp: Fix fib lookup calls.
mlxsw: spectrum_router: Fix an IS_ERR() vs NULL check
net/sched: act_pedit: Add size check for TCA_PEDIT_PARMS_EX
xsk: Honor SO_BINDTODEVICE on bind
ptp: Make max_phase_adjustment sysfs device attribute invisible when not supported
...
Daniel Borkmann says:
====================
pull-request: bpf 2023-07-05
We've added 2 non-merge commits during the last 1 day(s) which contain
a total of 3 files changed, 16 insertions(+), 4 deletions(-).
The main changes are:
1) Fix BTF to warn but not returning an error for a NULL BTF to still be
able to load modules under CONFIG_DEBUG_INFO_BTF, from SeongJae Park.
2) Fix xsk sockets to honor SO_BINDTODEVICE in bind(), from Ilya Maximets.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
xsk: Honor SO_BINDTODEVICE on bind
bpf, btf: Warn but return no error for NULL btf from __register_btf_kfunc_id_set()
====================
Link: https://lore.kernel.org/r/20230705171716.6494-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Originally this used jhash2() over tuple and folded the zone id,
the pernet hash value, destination port and l4 protocol number into the
32bit seed value.
When the switch to siphash was done, I used an on-stack temporary
buffer to build a suitable key to be hashed via siphash().
But this showed up as performance regression, so I got rid of
the temporary copy and collected to-be-hashed data in 4 u64 variables.
This makes it easy to build tuples that produce the same hash, which isn't
desirable even though chain lengths are limited.
Switch back to plain siphash, but just like with jhash2(), take advantage
of the fact that most of to-be-hashed data is already in a suitable order.
Use an empty struct as annotation in 'struct nf_conntrack_tuple' to mark
last member that can be used as hash input.
The only remaining data that isn't present in the tuple structure are the
zone identifier and the pernet hash: fold those into the key.
Fixes: d2c806abcf ("netfilter: conntrack: use siphash_4u64")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If nf_conntrack_init_start() fails (for example due to a
register_nf_conntrack_bpf() failure), the nf_conntrack_helper_fini()
clean-up path frees the nf_ct_helper_hash map.
When built with NF_CONNTRACK=y, further netfilter modules (e.g:
netfilter_conntrack_ftp) can still be loaded and call
nf_conntrack_helpers_register(), independently of whether nf_conntrack
initialized correctly. This accesses the nf_ct_helper_hash dangling
pointer and causes a uaf, possibly leading to random memory corruption.
This patch guards nf_conntrack_helper_register() from accessing a freed
or uninitialized nf_ct_helper_hash pointer and fixes possible
uses-after-free when loading a conntrack module.
Cc: stable@vger.kernel.org
Fixes: 12f7a50533 ("netfilter: add user-space connection tracking helper infrastructure")
Signed-off-by: Florent Revest <revest@chromium.org>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Now that conntrack core is allowd to insert clashing entries, make sure
GRE won't set assured flag on NAT_CLASH entries, just like UDP.
Doing so prevents early_drop logic for these entries.
Fixes: d671fd82ea ("netfilter: conntrack: allow insertion clash of gre protocol")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Overflow use refcount checks are not complete.
Add helper function to deal with object reference counter tracking.
Report -EMFILE in case UINT_MAX is reached.
nft_use_dec() splats in case that reference counter underflows,
which should not ever happen.
Add nft_use_inc_restore() and nft_use_dec_restore() which are used
to restore reference counter from error and abort paths.
Use u32 in nft_flowtable and nft_object since helper functions cannot
work on bitfields.
Remove the few early incomplete checks now that the helper functions
are in place and used to check for refcount overflow.
Fixes: 96518518cc ("netfilter: add nftables")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since the blamed commit, closing the first subflow resets the first
subflow socket state to SS_UNCONNECTED.
The current mptcp listen implementation relies only on such
state to prevent touching not-fully-disconnected sockets.
Incoming mptcp fastclose (or paired endpoint removal) unconditionally
closes the first subflow.
All the above allows an incoming fastclose followed by a listen() call
to successfully race with a blocking recvmsg(), potentially causing the
latter to hit a divide by zero bug in cleanup_rbuf/__tcp_select_window().
Address the issue explicitly checking the msk socket state in
mptcp_listen(). An alternative solution would be moving the first
subflow socket state update into mptcp_disconnect(), but in the long
term the first subflow socket should be removed: better avoid relaying
on it for internal consistency check.
Fixes: b29fcfb54c ("mptcp: full disconnect implementation")
Cc: stable@vger.kernel.org
Reported-by: Christoph Paasch <cpaasch@apple.com>
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/414
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
While tacking care of the mptcp-level listener I unintentionally
moved the subflow level unhash after the subflow listener backlog
cleanup.
That could cause some nasty race and makes the code harder to read.
Address the issue restoring the proper order of operations.
Fixes: 57fc0f1cea ("mptcp: ensure listener is unhashed before updating the sk status")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
incl_srcpt has the limitation, mentioned in commit b4638af888 ("net:
dsa: sja1105: always enable the INCL_SRCPT option"), that frames with a
MAC DA of 01:80:c2:xx:yy:zz will be received as 01:80:c2:00:00:zz unless
PTP RX timestamping is enabled.
The incl_srcpt option was initially unconditionally enabled, then that
changed with commit 42824463d3 ("net: dsa: sja1105: Limit use of
incl_srcpt to bridge+vlan mode"), then again with b4638af888 ("net:
dsa: sja1105: always enable the INCL_SRCPT option"). Bottom line is that
it now needs to be always enabled, otherwise the driver does not have a
reliable source of information regarding source_port and switch_id for
link-local traffic (tag_8021q VLANs may be imprecise since now they
identify an entire bridging domain when ports are not standalone).
If we accept that PTP RX timestamping (and therefore, meta frame
generation) is always enabled in hardware, then that limitation could be
avoided and packets with any MAC DA can be properly received, because
meta frames do contain the original bytes from the MAC DA of their
associated link-local packet.
This change enables meta frame generation unconditionally, which also
has the nice side effects of simplifying the switch control path
(a switch reset is no longer required on hwtstamping settings change)
and the tagger data path (it no longer needs to be informed whether to
expect meta frames or not - it always does).
Fixes: 227d07a07e ("net: dsa: sja1105: Add support for traffic through standalone ports")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SJA1105 manual says that at offset 4 into the meta frame payload we
have "MAC destination byte 2" and at offset 5 we have "MAC destination
byte 1". These are counted from the LSB, so byte 1 is h_dest[ETH_HLEN-2]
aka h_dest[4] and byte 2 is h_dest[ETH_HLEN-3] aka h_dest[3].
The sja1105_meta_unpack() function decodes these the other way around,
so a frame with MAC DA 01:80:c2:11:22:33 is received by the network
stack as having 01:80:c2:22:11:33.
Fixes: e53e18a6fe ("net: dsa: sja1105: Receive and decode meta frames")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The attribute TCA_PEDIT_PARMS_EX is not be included in pedit_policy and
one malicious user could fake a TCA_PEDIT_PARMS_EX whose length is
smaller than the intended sizeof(struct tc_pedit). Hence, the
dereference in tcf_pedit_init() could access dirty heap data.
static int tcf_pedit_init(...)
{
// ...
pattr = tb[TCA_PEDIT_PARMS]; // TCA_PEDIT_PARMS is included
if (!pattr)
pattr = tb[TCA_PEDIT_PARMS_EX]; // but this is not
// ...
parm = nla_data(pattr);
index = parm->index; // parm is able to be smaller than 4 bytes
// and this dereference gets dirty skb_buff
// data created in netlink_sendmsg
}
This commit adds TCA_PEDIT_PARMS_EX length in pedit_policy which avoid
the above case, just like the TCA_PEDIT_PARMS.
Fixes: 71d0ed7079 ("net/act_pedit: Support using offset relative to the conventional network headers")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Link: https://lore.kernel.org/r/20230703110842.590282-1-linma@zju.edu.cn
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Initial creation of an AF_XDP socket requires CAP_NET_RAW capability. A
privileged process might create the socket and pass it to a non-privileged
process for later use. However, that process will be able to bind the socket
to any network interface. Even though it will not be able to receive any
traffic without modification of the BPF map, the situation is not ideal.
Sockets already have a mechanism that can be used to restrict what interface
they can be attached to. That is SO_BINDTODEVICE.
To change the SO_BINDTODEVICE binding the process will need CAP_NET_RAW.
Make xsk_bind() honor the SO_BINDTODEVICE in order to allow safer workflow
when non-privileged process is using AF_XDP.
The intended workflow is following:
1. First process creates a bare socket with socket(AF_XDP, ...).
2. First process loads the XSK program to the interface.
3. First process adds the socket fd to a BPF map.
4. First process ties socket fd to a particular interface using
SO_BINDTODEVICE.
5. First process sends socket fd to a second process.
6. Second process allocates UMEM.
7. Second process binds socket to the interface with bind(...).
8. Second process sends/receives the traffic.
All the steps above are possible today if the first process is privileged
and the second one has sufficient RLIMIT_MEMLOCK and no capabilities.
However, the second process will be able to bind the socket to any interface
it wants on step 7 and send traffic from it. With the proposed change, the
second process will be able to bind the socket only to a specific interface
chosen by the first process at step 4.
Fixes: 965a990984 ("xsk: add support for bind for Rx")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/bpf/20230703175329.3259672-1-i.maximets@ovn.org