Set the error code if set_branch_dest_ft() fails.
Fixes: ccbe33003b ("net/mlx5e: TC, Don't offload post action rule if not supported")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Preserve the error code if esw_add_restore_rule() fails. Don't return
success.
Fixes: 6702782845 ("net/mlx5e: TC, Set CT miss to the specific ct action instance")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
While handling new traces, to verify it is not the first block being
written, last_timestamp is checked. But instead of checking it is non
zero it is verified to be zero. Fix to verify last_timestamp is not
zero.
Fixes: c71ad41ccb ("net/mlx5: FW tracer, events handling")
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Feras Daoud <ferasda@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
XDP transmits fragmented packets that are larger than MTU size instead of
dropping those packets. The drop check that checks whether a packet is larger
than MTU is comparing MTU size against the linear part length only.
Adjust the drop check to compare MTU size against both linear and non-linear
part lengths to avoid transmitting fragmented packets larger than MTU size.
Fixes: 39a1665d16 ("net/mlx5e: Implement sending multi buffer XDP frames")
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The cited commit increases num_block_tc when unblock tc offload.
Actually should decrease it.
Fixes: c8e350e62f ("net/mlx5e: Make TC and IPsec offloads mutually exclusive on a netdev")
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Coverity Scan reports the following issue. But it's impossible that
mlx5_get_dev_index returns 7 for PF, even if the index is calculated
from PCI FUNC ID. So add the checking to make coverity slience.
CID 610894 (#2 of 2): Out-of-bounds write (OVERRUN)
Overrunning array esw->fdb_table.offloads.peer_miss_rules of 4 8-byte
elements at element index 7 (byte offset 63) using index
mlx5_get_dev_index(peer_dev) (which evaluates to 7).
Fixes: 9bee385a6e ("net/mlx5: E-switch, refactor FDB miss rule add/remove")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
When kcalloc() for ft->g succeeds but kvzalloc() for in fails,
fs_udp_create_groups() will free ft->g. However, its caller
fs_udp_create_table() will free ft->g again through calling
mlx5e_destroy_flow_table(), which will lead to a double-free.
Fix this by setting ft->g to NULL in fs_udp_create_groups().
Fixes: 1c80bd6843 ("net/mlx5e: Introduce Flow Steering UDP API")
Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Fix a cmd->ent use after free due to a race on command entry.
Such race occurs when one of the commands releases its last refcount and
frees its index and entry while another process running command flush
flow takes refcount to this command entry. The process which handles
commands flush may see this command as needed to be flushed if the other
process allocated a ent->idx but didn't set ent to cmd->ent_arr in
cmd_work_handler(). Fix it by moving the assignment of cmd->ent_arr into
the spin lock.
[70013.081955] BUG: KASAN: use-after-free in mlx5_cmd_trigger_completions+0x1e2/0x4c0 [mlx5_core]
[70013.081967] Write of size 4 at addr ffff88880b1510b4 by task kworker/26:1/1433361
[70013.081968]
[70013.082028] Workqueue: events aer_isr
[70013.082053] Call Trace:
[70013.082067] dump_stack+0x8b/0xbb
[70013.082086] print_address_description+0x6a/0x270
[70013.082102] kasan_report+0x179/0x2c0
[70013.082173] mlx5_cmd_trigger_completions+0x1e2/0x4c0 [mlx5_core]
[70013.082267] mlx5_cmd_flush+0x80/0x180 [mlx5_core]
[70013.082304] mlx5_enter_error_state+0x106/0x1d0 [mlx5_core]
[70013.082338] mlx5_try_fast_unload+0x2ea/0x4d0 [mlx5_core]
[70013.082377] remove_one+0x200/0x2b0 [mlx5_core]
[70013.082409] pci_device_remove+0xf3/0x280
[70013.082439] device_release_driver_internal+0x1c3/0x470
[70013.082453] pci_stop_bus_device+0x109/0x160
[70013.082468] pci_stop_and_remove_bus_device+0xe/0x20
[70013.082485] pcie_do_fatal_recovery+0x167/0x550
[70013.082493] aer_isr+0x7d2/0x960
[70013.082543] process_one_work+0x65f/0x12d0
[70013.082556] worker_thread+0x87/0xb50
[70013.082571] kthread+0x2e9/0x3a0
[70013.082592] ret_from_fork+0x1f/0x40
The logical relationship of this error is as follows:
aer_recover_work | ent->work
-------------------------------------------+------------------------------
aer_recover_work_func |
|- pcie_do_recovery |
|- report_error_detected |
|- mlx5_pci_err_detected |cmd_work_handler
|- mlx5_enter_error_state | |- cmd_alloc_index
|- enter_error_state | |- lock cmd->alloc_lock
|- mlx5_cmd_flush | |- clear_bit
|- mlx5_cmd_trigger_completions| |- unlock cmd->alloc_lock
|- lock cmd->alloc_lock |
|- vector = ~dev->cmd.vars.bitmask
|- for_each_set_bit |
|- cmd_ent_get(cmd->ent_arr[i]) (UAF)
|- unlock cmd->alloc_lock | |- cmd->ent_arr[ent->idx]=ent
The cmd->ent_arr[ent->idx] assignment and the bit clearing are not
protected by the cmd->alloc_lock in cmd_work_handler().
Fixes: 50b2412b7e ("net/mlx5: Avoid possible free of command entry while timeout comp handler")
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Shifeng Li <lishifeng@sangfor.com.cn>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Cited commit introduced potential double free since encap_header can be
destroyed twice in some cases - once by error cleanup sequence in
mlx5e_tc_tun_{create|update}_header_ipv{4|6}(), once by generic
mlx5e_encap_put() that user calls as a result of getting an error from
tunnel create|update. At the same time the point where e->encap_header is
assigned can't be delayed because the function can still return non-error
code 0 as a result of checking for NUD_VALID flag, which will cause
neighbor update to dereference NULL encap_header.
Fix the issue by:
- Nulling local encap_header variables in
mlx5e_tc_tun_{create|update}_header_ipv{4|6}() to make kfree(encap_header)
call in error cleanup sequence noop after that point.
- Assigning reformat_params.data from e->encap_header instead of local
variable encap_header that was set to NULL pointer by previous step. Also
assign reformat_params.size from e->encap_size for uniformity and in order
to make the code less error-prone in the future.
Fixes: d589e785ba ("net/mlx5e: Allow concurrent creation of encap entries")
Reported-by: Dust Li <dust.li@linux.alibaba.com>
Reported-by: Cruz Zhao <cruzzhao@linux.alibaba.com>
Reported-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This reverts commit 6f9b1a0731.
This patch is causing a null ptr issue, the proper fix is in the next
patch.
Fixes: 6f9b1a0731 ("net/mlx5e: fix double free of encap_header")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This reverts commit 3a4aa3cb83.
This patch is causing a null ptr issue, the proper fix is in the next
patch.
Fixes: 3a4aa3cb83 ("net/mlx5e: fix double free of encap_header in update funcs")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Yanteng Si says:
====================
stmmac: Some bug fixes
* Put Krzysztof's patch into my thread, pick Conor's Reviewed-by
tag and Jiaxun's Acked-by tag.(prev version is RFC patch)
* I fixed an Oops related to mdio, mainly to ensure that
mdio is initialized before use, because it will be used
in a series of patches I am working on.
see <https://lore.kernel.org/loongarch/cover.1699533745.git.siyanteng@loongson.cn/T/#t>
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Device binds to proper PCI ID (LOONGSON, 0x7a03), already listed in DTS,
so checking for some other compatible does not make sense. It cannot be
bound to unsupported platform.
Drop useless, incorrect (space in between) and undocumented compatible.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Yanteng Si <siyanteng@loongson.cn>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Acked-by: Jiaxun Yang <jiaxun.yang@flygoat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Device binds to proper PCI ID (LOONGSON, 0x7a03), already listed in DTS,
so checking for some other compatible does not make sense. It cannot be
bound to unsupported platform.
Drop useless, incorrect (space in between) and undocumented compatible.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Yanteng Si <siyanteng@loongson.cn>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Acked-by: Jiaxun Yang <jiaxun.yang@flygoat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Generic code will use mdio. If it is not initialized before use,
the kernel will Oops.
Fixes: 30bba69d7d ("stmmac: pci: Add dwmac support for Loongson")
Signed-off-by: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Feiyang Chen <chenfeiyang@loongson.cn>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based on the tcp man page, if TCP_NODELAY is set, it disables Nagle's algorithm
and packets are sent as soon as possible. However in the `tcp_push` function
where autocorking is evaluated the `nonagle` value set by TCP_NODELAY is not
considered which can trigger unexpected corking of packets and induce delays.
For example, if two packets are generated as part of a server's reply, if the
first one is not transmitted on the wire quickly enough, the second packet can
trigger the autocorking in `tcp_push` and be delayed instead of sent as soon as
possible. It will either wait for additional packets to be coalesced or an ACK
from the client before transmitting the corked packet. This can interact badly
if the receiver has tcp delayed acks enabled, introducing 40ms extra delay in
completion times. It is not always possible to control who has delayed acks
set, but it is possible to adjust when and how autocorking is triggered.
Patch prevents autocorking if the TCP_NODELAY flag is set on the socket.
Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu 22.04 and
Apache Tomcat 9.0.83 running the basic servlet below:
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
public class HelloWorldServlet extends HttpServlet {
@Override
protected void doGet(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
response.setContentType("text/html;charset=utf-8");
OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8");
String s = "a".repeat(3096);
osw.write(s,0,s.length());
osw.flush();
}
}
Load was applied using wrk2 (https://github.com/kinvolk/wrk2) from an AWS
c6i.8xlarge instance. With the current auto-corking behavior and TCP_NODELAY
set an additional 40ms latency from P99.99+ values are observed. With the
patch applied we see no occurrences of 40ms latencies. The patch has also been
tested with iperf and uperf benchmarks and no regression was observed.
# No patch with tcp_autocorking=1 and TCP_NODELAY set on all sockets
./wrk -t32 -c128 -d40s --latency -R10000 http://172.31.49.177:8080/hello/hello'
...
50.000% 0.91ms
75.000% 1.12ms
90.000% 1.46ms
99.000% 1.73ms
99.900% 1.96ms
99.990% 43.62ms <<< 40+ ms extra latency
99.999% 48.32ms
100.000% 49.34ms
# With patch
./wrk -t32 -c128 -d40s --latency -R10000 http://172.31.49.177:8080/hello/hello'
...
50.000% 0.89ms
75.000% 1.13ms
90.000% 1.44ms
99.000% 1.67ms
99.900% 1.78ms
99.990% 2.27ms <<< no 40+ ms extra latency
99.999% 3.71ms
100.000% 4.57ms
Fixes: f54b311142 ("tcp: auto corking")
Signed-off-by: Salvatore Dipietro <dipiets@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When sending TX packets, the meta descriptor can be all zeroes
as no meta information is required (as in XDP).
This patch removes the validity check, as when
`disable_meta_caching` is enabled, such TX packets will be
dropped otherwise.
Fixes: 0e3a3f6dac ("net: ena: support new LLQ acceleration mode")
Signed-off-by: Shay Agroskin <shayagr@amazon.com>
Signed-off-by: David Arinzon <darinzon@amazon.com>
Link: https://lore.kernel.org/r/20231211062801.27891-5-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch fixes two issues:
Issue 1
-------
Description
```````````
Current code does not call dma_sync_single_for_cpu() to sync data from
the device side memory to the CPU side memory before the XDP code path
uses the CPU side data.
This causes the XDP code path to read the unset garbage data in the CPU
side memory, resulting in incorrect handling of the packet by XDP.
Solution
````````
1. Add a call to dma_sync_single_for_cpu() before the XDP code starts to
use the data in the CPU side memory.
2. The XDP code verdict can be XDP_PASS, in which case there is a
fallback to the non-XDP code, which also calls
dma_sync_single_for_cpu().
To avoid calling dma_sync_single_for_cpu() twice:
2.1. Put the dma_sync_single_for_cpu() in the code in such a place where
it happens before XDP and non-XDP code.
2.2. Remove the calls to dma_sync_single_for_cpu() in the non-XDP code
for the first buffer only (rx_copybreak and non-rx_copybreak
cases), since the new call that was added covers these cases.
The call to dma_sync_single_for_cpu() for the second buffer and on
stays because only the first buffer is handled by the newly added
dma_sync_single_for_cpu(). And there is no need for special
handling of the second buffer and on for the XDP path since
currently the driver supports only single buffer packets.
Issue 2
-------
Description
```````````
In case the XDP code forwarded the packet (ENA_XDP_FORWARDED),
ena_unmap_rx_buff_attrs() is called with attrs set to 0.
This means that before unmapping the buffer, the internal function
dma_unmap_page_attrs() will also call dma_sync_single_for_cpu() on
the whole buffer (not only on the data part of it).
This sync is both wasteful (since a sync was already explicitly
called before) and also causes a bug, which will be explained
using the below diagram.
The following diagram shows the flow of events causing the bug.
The order of events is (1)-(4) as shown in the diagram.
CPU side memory area
(3)convert_to_xdp_frame() initializes the
headroom with xdpf metadata
||
\/
___________________________________
| |
0 | V 4K
---------------------------------------------------------------------
| xdpf->data | other xdpf | < data > | tailroom ||...|
| | fields | | GARBAGE || |
---------------------------------------------------------------------
/\ /\
|| ||
(4)ena_unmap_rx_buff_attrs() calls (2)dma_sync_single_for_cpu()
dma_sync_single_for_cpu() on the copies data from device
whole buffer page, overwriting side to CPU side memory
the xdpf->data with GARBAGE. ||
0 4K
---------------------------------------------------------------------
| headroom | < data > | tailroom ||...|
| GARBAGE | | GARBAGE || |
---------------------------------------------------------------------
Device side memory area /\
||
(1) device writes RX packet data
After the call to ena_unmap_rx_buff_attrs() in (4), the xdpf->data
becomes corrupted, and so when it is later accessed in
ena_clean_xdp_irq()->xdp_return_frame(), it causes a page fault,
crashing the kernel.
Solution
````````
Explicitly tell ena_unmap_rx_buff_attrs() not to call
dma_sync_single_for_cpu() by passing it the ENA_DMA_ATTR_SKIP_CPU_SYNC
flag.
Fixes: f7d625adeb ("net: ena: Add dynamic recycling mechanism for rx buffers")
Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: David Arinzon <darinzon@amazon.com>
Link: https://lore.kernel.org/r/20231211062801.27891-4-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Current xdp code drops packets larger than ENA_XDP_MAX_MTU.
This is an incorrect condition since the problem is not the
size of the packet, rather the number of buffers it contains.
This commit:
1. Identifies and drops XDP multi-buffer packets at the
beginning of the function.
2. Increases the xdp drop statistic when this drop occurs.
3. Adds a one-time print that such drops are happening to
give better indication to the user.
Fixes: 838c93dc54 ("net: ena: implement XDP drop support")
Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: David Arinzon <darinzon@amazon.com>
Link: https://lore.kernel.org/r/20231211062801.27891-3-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The ena_setup_and_create_all_xdp_queues() function freed all the
resources upon failure, after creating only xdp_num_queues queues,
instead of freeing just the created ones.
In this patch, the only resources that are freed, are the ones
allocated right before the failure occurs.
Fixes: 548c4940b9 ("net: ena: Implement XDP_TX action")
Signed-off-by: Shahar Itzko <itzko@amazon.com>
Signed-off-by: David Arinzon <darinzon@amazon.com>
Link: https://lore.kernel.org/r/20231211062801.27891-2-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
syzkaller report:
kernel BUG at net/core/skbuff.c:3452!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.7.0-rc4-00009-gbee0e7762ad2-dirty #135
RIP: 0010:skb_copy_and_csum_bits (net/core/skbuff.c:3452)
Call Trace:
icmp_glue_bits (net/ipv4/icmp.c:357)
__ip_append_data.isra.0 (net/ipv4/ip_output.c:1165)
ip_append_data (net/ipv4/ip_output.c:1362 net/ipv4/ip_output.c:1341)
icmp_push_reply (net/ipv4/icmp.c:370)
__icmp_send (./include/net/route.h:252 net/ipv4/icmp.c:772)
ip_fragment.constprop.0 (./include/linux/skbuff.h:1234 net/ipv4/ip_output.c:592 net/ipv4/ip_output.c:577)
__ip_finish_output (net/ipv4/ip_output.c:311 net/ipv4/ip_output.c:295)
ip_output (net/ipv4/ip_output.c:427)
__ip_queue_xmit (net/ipv4/ip_output.c:535)
__tcp_transmit_skb (net/ipv4/tcp_output.c:1462)
__tcp_retransmit_skb (net/ipv4/tcp_output.c:3387)
tcp_retransmit_skb (net/ipv4/tcp_output.c:3404)
tcp_retransmit_timer (net/ipv4/tcp_timer.c:604)
tcp_write_timer (./include/linux/spinlock.h:391 net/ipv4/tcp_timer.c:716)
The panic issue was trigered by tcp simultaneous initiation.
The initiation process is as follows:
TCP A TCP B
1. CLOSED CLOSED
2. SYN-SENT --> <SEQ=100><CTL=SYN> ...
3. SYN-RECEIVED <-- <SEQ=300><CTL=SYN> <-- SYN-SENT
4. ... <SEQ=100><CTL=SYN> --> SYN-RECEIVED
5. SYN-RECEIVED --> <SEQ=100><ACK=301><CTL=SYN,ACK> ...
// TCP B: not send challenge ack for ack limit or packet loss
// TCP A: close
tcp_close
tcp_send_fin
if (!tskb && tcp_under_memory_pressure(sk))
tskb = skb_rb_last(&sk->tcp_rtx_queue); //pick SYN_ACK packet
TCP_SKB_CB(tskb)->tcp_flags |= TCPHDR_FIN; // set FIN flag
6. FIN_WAIT_1 --> <SEQ=100><ACK=301><END_SEQ=102><CTL=SYN,FIN,ACK> ...
// TCP B: send challenge ack to SYN_FIN_ACK
7. ... <SEQ=301><ACK=101><CTL=ACK> <-- SYN-RECEIVED //challenge ack
// TCP A: <SND.UNA=101>
8. FIN_WAIT_1 --> <SEQ=101><ACK=301><END_SEQ=102><CTL=SYN,FIN,ACK> ... // retransmit panic
__tcp_retransmit_skb //skb->len=0
tcp_trim_head
len = tp->snd_una - TCP_SKB_CB(skb)->seq // len=101-100
__pskb_trim_head
skb->data_len -= len // skb->len=-1, wrap around
... ...
ip_fragment
icmp_glue_bits //BUG_ON
If we use tcp_trim_head() to remove acked SYN from packet that contains data
or other flags, skb->len will be incorrectly decremented. We can remove SYN
flag that has been acked from rtx_queue earlier than tcp_trim_head(), which
can fix the problem mentioned above.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Co-developed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Dong Chenchen <dongchenchen2@huawei.com>
Link: https://lore.kernel.org/r/20231210020200.1539875-1-dongchenchen2@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
qed_ilt_shadow_alloc() will call qed_ilt_shadow_free() to
free p_hwfn->p_cxt_mngr->ilt_shadow on error. However,
qed_cxt_tables_alloc() accesses the freed pointer on failure
of qed_ilt_shadow_alloc() through calling qed_cxt_mngr_free(),
which may lead to use-after-free. Fix this issue by setting
p_mngr->ilt_shadow to NULL in qed_ilt_shadow_free().
Fixes: fe56b9e6a8 ("qed: Add module with basic common support")
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Link: https://lore.kernel.org/r/20231210045255.21383-1-dinghao.liu@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Because rose_ioctl() accesses sk->sk_receive_queue
without holding a sk->sk_receive_queue.lock, it can
cause a race with rose_accept().
A use-after-free for skb occurs with the following flow.
```
rose_ioctl() -> skb_peek()
rose_accept() -> skb_dequeue() -> kfree_skb()
```
Add sk->sk_receive_queue.lock to rose_ioctl() to fix this issue.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Hyunwoo Kim <v4bel@theori.io>
Link: https://lore.kernel.org/r/20231209100538.GA407321@v4bel-B760M-AORUS-ELITE-AX
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Because do_vcc_ioctl() accesses sk->sk_receive_queue
without holding a sk->sk_receive_queue.lock, it can
cause a race with vcc_recvmsg().
A use-after-free for skb occurs with the following flow.
```
do_vcc_ioctl() -> skb_peek()
vcc_recvmsg() -> skb_recv_datagram() -> skb_free_datagram()
```
Add sk->sk_receive_queue.lock to do_vcc_ioctl() to fix this issue.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Hyunwoo Kim <v4bel@theori.io>
Link: https://lore.kernel.org/r/20231209094210.GA403126@v4bel-B760M-AORUS-ELITE-AX
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The current implementation's default Pause Forward setting is causing
unnecessary network traffic. This patch disables Pause Forward to
address this issue.
Fixes: 1121f6b02e ("octeontx2-af: Priority flow control configuration support")
Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hariprasad Kelam says:
====================
octeontx2: Fix issues with promisc/allmulti mode
When interface is configured in promisc/all multi mode, low network
performance observed. This series patches address the same.
Patch1: Change the promisc/all multi mcam entry action to unicast if
there are no trusted vfs associated with PF.
Patch2: Configures RSS flow algorithm in promisc/all multi mcam entries
to address flow distribution issues.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The RSS flow algorithm is not set up correctly for promiscuous or all
multi MCAM entries. This has an impact on flow distribution.
This patch fixes the issue by updating flow algorithm index in above
mentioned MCAM entries.
Fixes: 967db3529e ("octeontx2-af: add support for multicast/promisc packet replication feature")
Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current implementation is such that, promisc mcam entry action
is set as multicast even when there are no trusted VFs. multicast
action causes the hardware to copy packet data, which reduces
the performance.
This patch fixes this issue by setting the promisc mcam entry action to
unicast instead of multicast when there are no trusted VFs. The same
change is made for the 'allmulti' mcam entry action.
Fixes: ffd2f89ad0 ("octeontx2-pf: Enable promisc/allmulti match MCAM entries.")
Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The firmware ready value is 1, and get firmware ready status
function should explicitly test for that value. The firmware
ready value read will be 2 after driver load, and on unbind
till firmware rewrites the firmware ready back to 0, the value
seen by driver will be 2, which should be regarded as not ready.
Fixes: 10c073e404 ("octeon_ep: defer probe if firmware not ready")
Signed-off-by: Shinas Rasheed <srasheed@marvell.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The rvu_dl will be freed in rvu_nix_health_reporters_destroy(rvu_dl)
after the create_workqueue fails, and after that free, the rvu_dl will
be translate back through the following call chain:
rvu_nix_health_reporters_destroy
|-> rvu_nix_health_reporters_create
|-> rvu_health_reporters_create
|-> rvu_register_dl (label err_dl_health)
Finally. in the err_dl_health label, rvu_dl being freed again in
rvu_health_reporters_destroy(rvu) by rvu_nix_health_reporters_destroy.
In the second calls of rvu_nix_health_reporters_destroy, however,
it uses rvu_dl->rvu_nix_health_reporter, which is already freed at
the end of rvu_nix_health_reporters_destroy in the first call.
So this patch prevents the first destroy by instantly returning -ENONMEN
when create_workqueue fails. In addition, since the failure of
create_workqueue is the only entrence of label err, it has been
integrated into the error-handling path of create_workqueue.
Fixes: 5ed66306ea ("octeontx2-af: Add devlink health reporters for NIX")
Signed-off-by: Zhipeng Lu <alexious@zju.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
The old implementation extracted VLAN TCI info from the payload
before the VLAN tag has been pushed in the payload.
Another problem was that the VLAN TCI was extracted even if the
packet did not have VLAN protocol header.
This resulted in invalid VLAN TCI and as a consequence a random
queue was computed.
This patch fixes the above issues and use the VLAN TCI from the
skb if it is present or VLAN TCI from payload if present. If no
VLAN header is present queue 0 is selected.
Fixes: 52c4a1a85f ("net: fec: add ndo_select_queue to fix TX bandwidth fluctuations")
Signed-off-by: Radu Bulie <radu-andrei.bulie@nxp.com>
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
File reference cycles have caused lots of problems for io_uring
in the past, and it still doesn't work exactly right and races with
unix_stream_read_generic(). The safest fix would be to completely
disallow sending io_uring files via sockets via SCM_RIGHT, so there
are no possible cycles invloving registered files and thus rendering
SCM accounting on the io_uring side unnecessary.
Cc: stable@vger.kernel.org
Fixes: 0091bfc817 ("io_uring/af_unix: defer registered files gc to io_uring release")
Reported-and-suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
As &card->tx_queue_lock is acquired under softirq context along the
following call chain from solos_bh(), other acquisition of the same
lock inside process context should disable at least bh to avoid double
lock.
<deadlock #2>
pclose()
--> spin_lock(&card->tx_queue_lock)
<interrupt>
--> solos_bh()
--> fpga_tx()
--> spin_lock(&card->tx_queue_lock)
This flaw was found by an experimental static analysis tool I am
developing for irq-related deadlock.
To prevent the potential deadlock, the patch uses spin_lock_bh()
on &card->tx_queue_lock under process context code consistently to
prevent the possible deadlock scenario.
Fixes: 213e85d389 ("solos-pci: clean up pclose() function")
Signed-off-by: Chengfeng Ye <dg573847474@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As &card->cli_queue_lock is acquired under softirq context along the
following call chain from solos_bh(), other acquisition of the same
lock inside process context should disable at least bh to avoid double
lock.
<deadlock #1>
console_show()
--> spin_lock(&card->cli_queue_lock)
<interrupt>
--> solos_bh()
--> spin_lock(&card->cli_queue_lock)
This flaw was found by an experimental static analysis tool I am
developing for irq-related deadlock.
To prevent the potential deadlock, the patch uses spin_lock_bh()
on the card->cli_queue_lock under process context code consistently
to prevent the possible deadlock scenario.
Fixes: 9c54004ea7 ("atm: Driver for Solos PCI ADSL2+ card.")
Signed-off-by: Chengfeng Ye <dg573847474@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the chip is configured to timestamp all receive packets, the
timestamp in the RX completion is only valid if the metadata
present flag is not set for packets received on the wire. In
addition, internal loopback packets will never have a valid timestamp
and the timestamp field will always be zero. We must exclude
any 0 value in the timestamp field because there is no way to
determine if it is a loopback packet or not.
Add a new function bnxt_rx_ts_valid() to check for all timestamp
valid conditions.
Fixes: 66ed81dced ("bnxt_en: Enable packet timestamping for all RX packets")
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20231208001658.14230-5-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The wait_event_interruptible_timeout() function returns 0
if the timeout elapsed, -ERESTARTSYS if it was interrupted
by a signal, and the remaining jiffies otherwise if the
condition evaluated to true before the timeout elapsed.
Driver should have checked for zero return value instead of
a positive value.
MChan: Print a warning for -ERESTARTSYS. The close operation
will proceed anyway when wait_event_interruptible_timeout()
returns for any reason. Since we do the close no matter what,
we should not return this error code to the caller. Change
bnxt_close_nic() to a void function and remove all error
handling from some of the callers.
Fixes: c0c050c58d ("bnxt_en: New Broadcom ethernet driver.")
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20231208001658.14230-4-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After commit 939463016b ("tcp: change data receiver flowlabel after one dup")
we noticed an increase of TCPACKSkippedPAWS events.
Neal Cardwell tracked the issue to tcp_disordered_ack() assumption
about remote peer TS clock.
RFC 1323 & 7323 are suggesting the following:
"timestamp clock frequency in the range 1 ms to 1 sec per tick
between 1ms and 1sec."
This has to be adjusted for 1 MHz clock frequency.
This hints at reorders of SACK packets on send side,
this might deserve a future patch.
(skb->ooo_okay is always set for pure ACK packets)
Fixes: 614e8316aa ("tcp: add support for usec resolution in TCP TS values")
Co-developed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Morley <morleyd@google.com>
Link: https://lore.kernel.org/r/20231207181342.525181-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stefan Wahren says:
====================
qca_spi: collection of major fixes
This series contains a collection of major fixes for the qca_spi driver,
which has been recently discovered.
====================
Link: https://lore.kernel.org/r/20231206141222.52029-1-wahrenst@gmx.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In case of a reset triggered by the QCA7000 itself, the behavior of the
qca_spi driver was not quite correct:
- in case of a pending RX frame decoding the drop counter must be
incremented and decoding state machine reseted
- also the reset counter must always be incremented regardless of sync
state
Fixes: 291ab06ecf ("net: qualcomm: new Ethernet over SPI driver for QCA7000")
Signed-off-by: Stefan Wahren <wahrenst@gmx.net>
Link: https://lore.kernel.org/r/20231206141222.52029-4-wahrenst@gmx.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After calling ethtool -g it was not possible to adjust the TX ring
size again:
# ethtool -g eth1
Ring parameters for eth1:
Pre-set maximums:
RX: 4
RX Mini: n/a
RX Jumbo: n/a
TX: 10
Current hardware settings:
RX: 4
RX Mini: n/a
RX Jumbo: n/a
TX: 10
# ethtool -G eth1 tx 8
netlink error: Invalid argument
The reason for this is that the readonly setting rx_pending get
initialized and after that the range check in qcaspi_set_ringparam()
fails regardless of the provided parameter. So fix this by accepting
the exposed RX defaults. Instead of adding another magic number
better use a new define here.
Fixes: 291ab06ecf ("net: qualcomm: new Ethernet over SPI driver for QCA7000")
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Stefan Wahren <wahrenst@gmx.net>
Link: https://lore.kernel.org/r/20231206141222.52029-3-wahrenst@gmx.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>