Replace kfree_skb() with kfree_skb_reason() in __udp_queue_rcv_skb().
Following new drop reasons are introduced:
SKB_DROP_REASON_SOCKET_RCVBUFF
SKB_DROP_REASON_PROTO_MEM
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace kfree_skb() with kfree_skb_reason() in ip_protocol_deliver_rcu().
Following new drop reasons are introduced:
SKB_DROP_REASON_XFRM_POLICY
SKB_DROP_REASON_IP_NOPROTO
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace kfree_skb() with kfree_skb_reason() in ip_rcv_finish_core(),
following drop reasons are introduced:
SKB_DROP_REASON_IP_RPFILTER
SKB_DROP_REASON_UNICAST_IN_L2_MULTICAST
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace kfree_skb() with kfree_skb_reason() in ip_rcv_core(). Three new
drop reasons are introduced:
SKB_DROP_REASON_OTHERHOST
SKB_DROP_REASON_IP_CSUM
SKB_DROP_REASON_IP_INHDR
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace kfree_skb() with kfree_skb_reason() in nf_hook_slow() when
skb is dropped by reason of NF_DROP. Following new drop reasons
are introduced:
SKB_DROP_REASON_NETFILTER_DROP
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add document for following existing drop reasons:
SKB_DROP_REASON_NOT_SPECIFIED
SKB_DROP_REASON_NO_SOCKET
SKB_DROP_REASON_PKT_TOO_SMALL
SKB_DROP_REASON_TCP_CSUM
SKB_DROP_REASON_SOCKET_FILTER
SKB_DROP_REASON_UDP_CSUM
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
While testing a patch that will follow later
("net: add netns refcount tracker to struct nsproxy")
I found that devtmpfs_init() was called before init_net
was initialized.
This is a bug, because devtmpfs_setup() calls
ksys_unshare(CLONE_NEWNS);
This has the effect of increasing init_net refcount,
which will be later overwritten to 1, as part of setup_net(&init_net)
We had too many prior patches [1] trying to work around the root cause.
Really, make sure init_net is in BSS section, and that net_ns_init()
is called earlier at boot time.
Note that another patch ("vfs: add netns refcount tracker
to struct fs_context") also will need net_ns_init() being called
before vfs_caches_init()
As a bonus, this patch saves around 4KB in .data section.
[1]
f8c46cb390 ("netns: do not call pernet ops for not yet set up init_net namespace")
b5082df801 ("net: Initialise init_net.count to 1")
734b65417b ("net: Statically initialize init_net.dev_base_head")
v2: fixed a build error reported by kernel build bots (CONFIG_NET=n)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have plans for increasing MAX_SKB_FRAGS, but sk_msg_sg::copy
is currently an unsigned long, limiting MAX_SKB_FRAGS to 30 on 32bit arches.
Convert it to a bitmap, as Jakub suggested.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We are still chasing some syzbot reports where we think a rogue dev_put()
is called with no corresponding prior dev_hold().
Unfortunately it eats a reference on dev->dev_refcnt taken by innocent
dev_hold_track(), meaning that the refcount saturation splat comes
too late to be useful.
Make sure that 'not tracked' dev_put() and dev_hold() better use
CONFIG_NET_DEV_REFCNT_TRACKER=y debug infrastructure:
Prior patch in the series allowed ref_tracker_alloc() and ref_tracker_free()
to be called with a NULL @trackerp parameter, and to use a separate refcount
only to detect too many put() even in the following case:
dev_hold_track(dev, tracker_1, GFP_ATOMIC);
dev_hold(dev);
dev_put(dev);
dev_put(dev); // Should complain loudly here.
dev_put_track(dev, tracker_1); // instead of here
Add clarification about netdev_tracker_alloc() role.
v2: I replaced the dev_put() in linkwatch_do_dev()
with __dev_put() because callers called netdev_tracker_free().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We are still chasing a netdev refcount imbalance, and we suspect
we have one rogue dev_put() that is consuming a reference taken
from a dev_hold_track()
To detect this case, allow ref_tracker_alloc() and ref_tracker_free()
to be called with a NULL @trackerp parameter, and use a dedicated
refcount_t just for them.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Whenever ref_tracker_dir_init() is called, mark the struct ref_tracker_dir
as dead.
Test the dead status from ref_tracker_alloc() and ref_tracker_free()
This should detect buggy dev_put()/dev_hold() happening too late
in netdevice dismantle process.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes minor data-races in ip6_mc_input() and
batadv_mcast_mla_rtr_flags_softif_get_ipv6()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
phylink_set_10g_modes() is no longer used with the conversion of
drivers to phylink_generic_validate(), so we can remove it.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
While inspecting some perf report, I noticed that the compiler
emits suboptimal code for the napi CB initialization, fetching
and storing multiple times the memory for flags bitfield.
This is with gcc 10.3.1, but I observed the same with older compiler
versions.
We can help the compiler to do a nicer work clearing several
fields at once using an u32 alias. The generated code is quite
smaller, with the same number of conditional.
Before:
objdump -t net/core/gro.o | grep " F .text"
0000000000000bb0 l F .text 0000000000000357 dev_gro_receive
After:
0000000000000bb0 l F .text 000000000000033c dev_gro_receive
v1 -> v2:
- use struct_group (Alexander and Alex)
RFC -> v1:
- use __struct_group to delimit the zeroed area (Alexander)
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently tc skb extension is used to send miss info from
tc to ovs datapath module, and driver to tc. For the tc to ovs
miss it is currently always allocated even if it will not
be used by ovs datapath (as it depends on a requested feature).
Export the static key which is used by openvswitch module to
guard this code path as well, so it will be skipped if ovs
datapath doesn't need it. Enable this code path once
ovs datapath needs it.
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the insertion frequency uapi for IOAM lwtunnels.
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
TLS recvmsg() passes user pages as destination for decrypt.
The decrypt operation is repeated record by record, each
record being 16kB, max. TLS allocates an sg_table and uses
iov_iter_get_pages() to populate it with enough pages to
fit the decrypted record.
Even though we decrypt a single message at a time we size
the sg_table based on the entire length of the iovec.
This leads to unnecessarily large allocations, risking
triggering OOM conditions.
Use iov_iter_truncate() / iov_iter_reexpand() to construct
a "capped" version of iov_iter_npages(). Alternatively we
could parametrize iov_iter_npages() to take the size as
arg instead of using i->count, or do something else..
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tony Nguyen says:
====================
40GbE Intel Wired LAN Driver Updates 2022-02-03
This series contains updates to the i40e client header file and driver.
Mateusz disables HW TC offload by default.
Joe Damato removes a no longer used statistic.
Jakub Kicinski removes an unused enum from the client header file.
Jedrzej changes some admin queue commands to occur under atomic context
and adds new functions for admin queue MAC VLAN filters to avoid a
potential race that could occur due storing results in a structure that
could be overwritten by the next admin queue call.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf, netfilter, and ieee802154.
Current release - regressions:
- Partially revert "net/smc: Add netlink net namespace support", fix
uABI breakage
- netfilter:
- nft_ct: fix use after free when attaching zone template
- nft_byteorder: track register operations
Previous releases - regressions:
- ipheth: fix EOVERFLOW in ipheth_rcvbulk_callback
- phy: qca8081: fix speeds lower than 2.5Gb/s
- sched: fix use-after-free in tc_new_tfilter()
Previous releases - always broken:
- tcp: fix mem under-charging with zerocopy sendmsg()
- tcp: add missing tcp_skb_can_collapse() test in
tcp_shift_skb_data()
- neigh: do not trigger immediate probes on NUD_FAILED from
neigh_managed_work, avoid a deadlock
- bpf: use VM_MAP instead of VM_ALLOC for ringbuf, avoid KASAN
false-positives
- netfilter: nft_reject_bridge: fix for missing reply from prerouting
- smc: forward wakeup to smc socket waitqueue after fallback
- ieee802154:
- return meaningful error codes from the netlink helpers
- mcr20a: fix lifs/sifs periods
- at86rf230, ca8210: stop leaking skbs on error paths
- macsec: add missing un-offload call for NETDEV_UNREGISTER of parent
- ax25: add refcount in ax25_dev to avoid UAF bugs
- eth: mlx5e:
- fix SFP module EEPROM query
- fix broken SKB allocation in HW-GRO
- IPsec offload: fix tunnel mode crypto for non-TCP/UDP flows
- eth: amd-xgbe:
- fix skb data length underflow
- ensure reset of the tx_timer_active flag, avoid Tx timeouts
- eth: stmmac: fix runtime pm use in stmmac_dvr_remove()
- eth: e1000e: handshake with CSME starts from Alder Lake platforms"
* tag 'net-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (69 commits)
ax25: fix reference count leaks of ax25_dev
net: stmmac: ensure PTP time register reads are consistent
net: ipa: request IPA register values be retained
dt-bindings: net: qcom,ipa: add optional qcom,qmp property
tools/resolve_btfids: Do not print any commands when building silently
bpf: Use VM_MAP instead of VM_ALLOC for ringbuf
net, neigh: Do not trigger immediate probes on NUD_FAILED from neigh_managed_work
tcp: add missing tcp_skb_can_collapse() test in tcp_shift_skb_data()
net: sparx5: do not refer to skb after passing it on
Partially revert "net/smc: Add netlink net namespace support"
net/mlx5e: Avoid field-overflowing memcpy()
net/mlx5e: Use struct_group() for memcpy() region
net/mlx5e: Avoid implicit modify hdr for decap drop rule
net/mlx5e: IPsec: Fix tunnel mode crypto offload for non TCP/UDP traffic
net/mlx5e: IPsec: Fix crypto offload for non TCP/UDP encapsulated traffic
net/mlx5e: Don't treat small ceil values as unlimited in HTB offload
net/mlx5: E-Switch, Fix uninitialized variable modact
net/mlx5e: Fix handling of wrong devices during bond netevent
net/mlx5e: Fix broken SKB allocation in HW-GRO
net/mlx5e: Fix wrong calculation of header index in HW_GRO
...
The previous commit d01ffb9eee ("ax25: add refcount in ax25_dev
to avoid UAF bugs") introduces refcount into ax25_dev, but there
are reference leak paths in ax25_ctl_ioctl(), ax25_fwd_ioctl(),
ax25_rt_add(), ax25_rt_del() and ax25_rt_opt().
This patch uses ax25_dev_put() and adjusts the position of
ax25_addr_ax25dev() to fix reference cout leaks of ax25_dev.
Fixes: d01ffb9eee ("ax25: add refcount in ax25_dev to avoid UAF bugs")
Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/20220203150811.42256-1-duoming@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This reverts commit 774a1221e8.
We need to finish all async code before the module init sequence is
done. In the reverted commit the PF_USED_ASYNC flag was added to mark a
thread that called async_schedule(). Then the PF_USED_ASYNC flag was
used to determine whether or not async_synchronize_full() needs to be
invoked. This works when modprobe thread is calling async_schedule(),
but it does not work if module dispatches init code to a worker thread
which then calls async_schedule().
For example, PCI driver probing is invoked from a worker thread based on
a node where device is attached:
if (cpu < nr_cpu_ids)
error = work_on_cpu(cpu, local_pci_probe, &ddi);
else
error = local_pci_probe(&ddi);
We end up in a situation where a worker thread gets the PF_USED_ASYNC
flag set instead of the modprobe thread. As a result,
async_synchronize_full() is not invoked and modprobe completes without
waiting for the async code to finish.
The issue was discovered while loading the pm80xx driver:
(scsi_mod.scan=async)
modprobe pm80xx worker
...
do_init_module()
...
pci_call_probe()
work_on_cpu(local_pci_probe)
local_pci_probe()
pm8001_pci_probe()
scsi_scan_host()
async_schedule()
worker->flags |= PF_USED_ASYNC;
...
< return from worker >
...
if (current->flags & PF_USED_ASYNC) <--- false
async_synchronize_full();
Commit 21c3c5d280 ("block: don't request module during elevator init")
fixed the deadlock issue which the reverted commit 774a1221e8
("module, async: async_synchronize_full() on module init iff async is
used") tried to fix.
Since commit 0fdff3ec6d ("async, kmod: warn on synchronous
request_module() from async workers") synchronous module loading from
async is not allowed.
Given that the original deadlock issue is fixed and it is no longer
allowed to call synchronous request_module() from async we can remove
PF_USED_ASYNC flag to make module init consistently invoke
async_synchronize_full() unless async module probe is requested.
Signed-off-by: Igor Pylypiv <ipylypiv@google.com>
Reviewed-by: Changyuan Lyu <changyuanl@google.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Given that standalone ports are now configured to bypass the ATU and
forward all frames towards the upstream port, extend the ATU bypass to
multichip systems.
Load VID 0 (standalone) into the VTU with the policy bit set. Since
VID 4095 (bridged) is already loaded, we now know that all VIDs in use
are always available in all VTUs. Therefore, we can safely enable
802.1Q on DSA ports.
Setting the DSA ports' VTU policy to TRAP means that all incoming
frames on VID 0 will be classified as MGMT - as a result, the ATU is
bypassed on all subsequent switches.
With this isolation in place, we are able to support configurations
that are simultaneously very quirky and very useful. Quirky because it
involves looping cables between local switchports like in this
example:
CPU
| .------.
.---0---. | .----0----.
| sw0 | | | sw1 |
'-1-2-3-' | '-1-2-3-4-'
$ @ '---' $ @ % %
We have three physically looped pairs ($, @, and %).
This is very useful because it allows us to run the kernel's
kselftests for the bridge on mv88e6xxx hardware.
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Clear MapDA on standalone ports to bypass any ATU lookup that might
point the packet in the wrong direction. This means that all packets
are flooded using the PVT config. So make sure that standalone ports
are only allowed to communicate with the local upstream port.
Here is a scenario in which this is needed:
CPU
| .----.
.---0---. | .--0--.
| sw0 | | | sw1 |
'-1-2-3-' | '-1-2-'
'---'
- sw0p1 and sw1p1 are bridged
- sw0p2 and sw1p2 are in standalone mode
- Learning must be enabled on sw0p3 in order for hardware forwarding
to work properly between bridged ports
1. A packet with SA :aa comes in on sw1p2
1a. Egresses sw1p0
1b. Ingresses sw0p3, ATU adds an entry for :aa towards port 3
1c. Egresses sw0p0
2. A packet with DA :aa comes in on sw0p2
2a. If an ATU lookup is done at this point, the packet will be
incorrectly forwarded towards sw0p3. With this change in place,
the ATU is bypassed and the packet is forwarded in accordance
with the PVT, which only contains the CPU port.
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change is meant to permit a driver to perform "fragmenting" of the
page from within the driver instead of the current model which requires
pre-partitioning the page. The main motivation behind this is to support
use cases where the page will be split up by the driver after DMA instead
of before.
With this change it becomes possible to start using page pool to replace
some of the existing use cases where multiple references were being used
for a single page, but the number needed was unknown as the size could be
dynamic.
For example, with this code it would be possible to do something like
the following to handle allocation:
page = page_pool_alloc_pages();
if (!page)
return NULL;
page_pool_fragment_page(page, DRIVER_PAGECNT_BIAS_MAX);
rx_buf->page = page;
rx_buf->pagecnt_bias = DRIVER_PAGECNT_BIAS_MAX;
Then we would process a received buffer by handling it with:
rx_buf->pagecnt_bias--;
Once the page has been fully consumed we could then flush the remaining
instances with:
if (page_pool_defrag_page(page, rx_buf->pagecnt_bias))
continue;
page_pool_put_defragged_page(pool, page -1, !!budget);
The general idea is that we want to have the ability to allocate a page
with excess fragment count and then trim off the unneeded fragments.
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzkaller was able to trigger a deadlock for NTF_MANAGED entries [0]:
kworker/0:16/14617 is trying to acquire lock:
ffffffff8d4dd370 (&tbl->lock){++-.}-{2:2}, at: ___neigh_create+0x9e1/0x2990 net/core/neighbour.c:652
[...]
but task is already holding lock:
ffffffff8d4dd370 (&tbl->lock){++-.}-{2:2}, at: neigh_managed_work+0x35/0x250 net/core/neighbour.c:1572
The neighbor entry turned to NUD_FAILED state, where __neigh_event_send()
triggered an immediate probe as per commit cd28ca0a3d ("neigh: reduce
arp latency") via neigh_probe() given table lock was held.
One option to fix this situation is to defer the neigh_probe() back to
the neigh_timer_handler() similarly as pre cd28ca0a3d. For the case
of NTF_MANAGED, this deferral is acceptable given this only happens on
actual failure state and regular / expected state is NUD_VALID with the
entry already present.
The fix adds a parameter to __neigh_event_send() in order to communicate
whether immediate probe is allowed or disallowed. Existing call-sites
of neigh_event_send() default as-is to immediate probe. However, the
neigh_managed_work() disables it via use of neigh_event_send_probe().
[0] <TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_deadlock_bug kernel/locking/lockdep.c:2956 [inline]
check_deadlock kernel/locking/lockdep.c:2999 [inline]
validate_chain kernel/locking/lockdep.c:3788 [inline]
__lock_acquire.cold+0x149/0x3ab kernel/locking/lockdep.c:5027
lock_acquire kernel/locking/lockdep.c:5639 [inline]
lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5604
__raw_write_lock_bh include/linux/rwlock_api_smp.h:202 [inline]
_raw_write_lock_bh+0x2f/0x40 kernel/locking/spinlock.c:334
___neigh_create+0x9e1/0x2990 net/core/neighbour.c:652
ip6_finish_output2+0x1070/0x14f0 net/ipv6/ip6_output.c:123
__ip6_finish_output net/ipv6/ip6_output.c:191 [inline]
__ip6_finish_output+0x61e/0xe90 net/ipv6/ip6_output.c:170
ip6_finish_output+0x32/0x200 net/ipv6/ip6_output.c:201
NF_HOOK_COND include/linux/netfilter.h:296 [inline]
ip6_output+0x1e4/0x530 net/ipv6/ip6_output.c:224
dst_output include/net/dst.h:451 [inline]
NF_HOOK include/linux/netfilter.h:307 [inline]
ndisc_send_skb+0xa99/0x17f0 net/ipv6/ndisc.c:508
ndisc_send_ns+0x3a9/0x840 net/ipv6/ndisc.c:650
ndisc_solicit+0x2cd/0x4f0 net/ipv6/ndisc.c:742
neigh_probe+0xc2/0x110 net/core/neighbour.c:1040
__neigh_event_send+0x37d/0x1570 net/core/neighbour.c:1201
neigh_event_send include/net/neighbour.h:470 [inline]
neigh_managed_work+0x162/0x250 net/core/neighbour.c:1574
process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307
worker_thread+0x657/0x1110 kernel/workqueue.c:2454
kthread+0x2e9/0x3a0 kernel/kthread.c:377
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
</TASK>
Fixes: 7482e3841d ("net, neigh: Add NTF_MANAGED flag for managed neighbor entries")
Reported-by: syzbot+5239d0e1778a500d477a@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Roopa Prabhu <roopa@nvidia.com>
Tested-by: syzbot+5239d0e1778a500d477a@syzkaller.appspotmail.com
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20220201193942.5055-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The change of sizeof(struct smc_diag_linkinfo) by commit 79d39fc503
("net/smc: Add netlink net namespace support") introduced an ABI
regression: since struct smc_diag_lgrinfo contains an object of
type "struct smc_diag_linkinfo", offset of all subsequent members
of struct smc_diag_lgrinfo was changed by that change.
As result, applications compiled with the old version
of struct smc_diag_linkinfo will receive garbage in
struct smc_diag_lgrinfo.role if the kernel implements
this new version of struct smc_diag_linkinfo.
Fix this regression by reverting the part of commit 79d39fc503 that
changes struct smc_diag_linkinfo. After all, there is SMC_GEN_NETLINK
interface which is good enough, so there is probably no need to touch
the smc_diag ABI in the first place.
Fixes: 79d39fc503 ("net/smc: Add netlink net namespace support")
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Reviewed-by: Karsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/r/20220202030904.GA9742@altlinux.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When setting RTO through BPF program, some SYN ACK packets were unaffected
and continued to use TCP_TIMEOUT_INIT constant. This patch adds timeout
option to struct request_sock. Option is initialized with TCP_TIMEOUT_INIT
and is reassigned through BPF using tcp_timeout_init call. SYN ACK
retransmits now use newly added timeout option.
Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru>
Acked-by: Martin KaFai Lau <kafai@fb.com>
v2:
- Add timeout option to struct request_sock. Do not call
tcp_timeout_init on every syn ack retransmit.
v3:
- Use unsigned long for min. Bound tcp_timeout_init to TCP_RTO_MAX.
v4:
- Refactor duplicate code by adding reqsk_timeout function.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add connect/disconnect helper to assign private struct to the DSA switch.
Add support for Ethernet mgmt and MIB if the DSA driver provide an handler
to correctly parse and elaborate the data.
Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add all the required define to prepare support for mgmt read/write in
Ethernet packet. Any packet of this type has to be dropped as the only
use of these special packet is receive ack for an mgmt write request or
receive data for an mgmt read request.
A struct is used that emulates the Ethernet header but is used for a
different purpose.
Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move tag_qca define to include dir linux/dsa as the qca8k require access
to the tagger define to support in-band mdio read/write using ethernet
packet.
Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Certain drivers may need to send management traffic to the switch for
things like register access, FDB dump, etc, to accelerate what their
slow bus (SPI, I2C, MDIO) can already do.
Ethernet is faster (especially in bulk transactions) but is also more
unreliable, since the user may decide to bring the DSA master down (or
not bring it up), therefore severing the link between the host and the
attached switch.
Drivers needing Ethernet-based register access already should have
fallback logic to the slow bus if the Ethernet method fails, but that
fallback may be based on a timeout, and the I/O to the switch may slow
down to a halt if the master is down, because every Ethernet packet will
have to time out. The driver also doesn't have the option to turn off
Ethernet-based I/O momentarily, because it wouldn't know when to turn it
back on.
Which is where this change comes in. By tracking NETDEV_CHANGE,
NETDEV_UP and NETDEV_GOING_DOWN events on the DSA master, we should know
the exact interval of time during which this interface is reliably
available for traffic. Provide this information to switches so they can
use it as they wish.
An helper is added dsa_port_master_is_operational() to check if a master
port is operational.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memcpy(), memmove(), and memset(), avoid
intentionally writing across neighboring fields.
Use struct_group() in struct vlan_ethhdr around members h_dest and
h_source, so they can be referenced together. This will allow memcpy()
and sizeof() to more easily reason about sizes, improve readability,
and avoid future warnings about writing beyond the end of h_dest.
"pahole" shows no size nor member offset changes to struct vlan_ethhdr.
"objdump -d" shows no object code changes.
Fixes: 34802a42b3 ("net/mlx5e: Do not modify the TX SKB")
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Pull unicode cleanup from Gabriel Krisman Bertazi:
"A fix from Christoph Hellwig merging the CONFIG_UNICODE_UTF8_DATA into
the previous CONFIG_UNICODE. It is -rc material since we don't want to
expose the former symbol on 5.17.
This has been living on linux-next for the past week"
* tag 'unicode-for-next-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode:
unicode: clean up the Kconfig symbol confusion
Add the SO_TXREHASH socket option to control hash rethink behavior per socket.
When default mode is set, sockets disable rehash at initialization and use
sysctl option when entering listen state. setsockopt() overrides default
behavior.
Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a per ns sysctl that controls the txhash rethink behavior:
net.core.txrehash. When enabled, the same behavior is retained,
when disabled, rethink is not performed. Sysctl is enabled by default.
Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After following the call tree of phy_set_max_speed(), it became clear
that this function never returns anything but 0, so we can change its
result type to *void* and drop the result checks from the three drivers
that actually bothered to do it...
Found by Linux Verification Center (linuxtesting.org) with the SVACE static
analysis tool.
Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
When CONFIG_CGROUPS is disabled psi code generates the following
warnings:
kernel/sched/psi.c:1112:21: warning: no previous prototype for 'psi_trigger_create' [-Wmissing-prototypes]
1112 | struct psi_trigger *psi_trigger_create(struct psi_group *group,
| ^~~~~~~~~~~~~~~~~~
kernel/sched/psi.c:1182:6: warning: no previous prototype for 'psi_trigger_destroy' [-Wmissing-prototypes]
1182 | void psi_trigger_destroy(struct psi_trigger *t)
| ^~~~~~~~~~~~~~~~~~~
kernel/sched/psi.c:1249:10: warning: no previous prototype for 'psi_trigger_poll' [-Wmissing-prototypes]
1249 | __poll_t psi_trigger_poll(void **trigger_ptr,
| ^~~~~~~~~~~~~~~~
Change the declarations of these functions in the header to provide the
prototypes even when they are unused.
Link: https://lkml.kernel.org/r/20220119223940.787748-2-surenb@google.com
Fixes: 0e94682b73 ("psi: introduce psi monitor")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since v2.5.44 and addition of ip_options_fragment()
ip_options_build() does not render headers for fragments
directly. @is_frag is always 0.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull tty/serial driver fixes from Greg KH:
"Here are some small bug fixes and reverts for reported problems with
the tty core and drivers. They include:
- revert the fifo use for the 8250 console mode. It caused too many
regressions and problems, and had a bug in it as well. This is
being reworked and should show up in a later -rc1 release, but it's
not ready for 5.17
- rpmsg tty race fix
- restore the cyclades.h uapi header file. Turns out a compiler test
suite used it for some unknown reason. Bring it back just for the
parts that are used by the builder test so they continue to build.
No functionality is restored as no one actually has this hardware
anymore, nor is it really tested.
- stm32 driver fixes
- n_gsm flow control fixes
- pl011 driver fix
- rs485 initialization fix
All of these have been in linux-next this week with no reported
problems"
* tag 'tty-5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
kbuild: remove include/linux/cyclades.h from header file check
serial: core: Initialize rs485 RTS polarity already on probe
serial: pl011: Fix incorrect rs485 RTS polarity on set_mctrl
serial: stm32: fix software flow control transfer
serial: stm32: prevent TDR register overwrite when sending x_char
tty: n_gsm: fix SW flow control encoding/handling
serial: 8250: of: Fix mapped region size when using reg-offset property
tty: rpmsg: Fix race condition releasing tty port
tty: Partially revert the removal of the Cyclades public API
tty: Add support for Brainboxes UC cards.
Revert "tty: serial: Use fifo in 8250 console driver"
Pull USB driver fixes from Greg KH:
"Here are some small USB driver fixes for 5.17-rc2 that resolve a
number of reported problems. These include:
- typec driver fixes
- xhci platform driver fixes for suspending
- ulpi core fix
- role.h build fix
- new device ids
- syzbot-reported bugfixes
- gadget driver fixes
- dwc3 driver fixes
- other small fixes
All of these have been in linux-next this week with no reported
issues"
* tag 'usb-5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
usb: cdnsp: Fix segmentation fault in cdns_lost_power function
usb: dwc2: gadget: don't try to disable ep0 in dwc2_hsotg_suspend
usb: gadget: at91_udc: fix incorrect print type
usb: dwc3: xilinx: Fix error handling when getting USB3 PHY
usb: dwc3: xilinx: Skip resets and USB3 register settings for USB2.0 mode
usb: xhci-plat: fix crash when suspend if remote wake enable
usb: common: ulpi: Fix crash in ulpi_match()
usb: gadget: f_sourcesink: Fix isoc transfer for USB_SPEED_SUPER_PLUS
ucsi_ccg: Check DEV_INT bit only when starting CCG4
USB: core: Fix hang in usb_kill_urb by adding memory barriers
usb-storage: Add unusual-devs entry for VL817 USB-SATA bridge
usb: typec: tcpm: Do not disconnect when receiving VSAFE0V
usb: typec: tcpm: Do not disconnect while receiving VBUS off
usb: typec: Don't try to register component master without components
usb: typec: Only attempt to link USB ports if there is fwnode
usb: typec: tcpci: don't touch CC line if it's Vconn source
usb: roles: fix include/linux/usb/role.h compile issue
Pull block fixes from Jens Axboe:
- NVMe pull request
- add the IGNORE_DEV_SUBNQN quirk for Intel P4500/P4600 SSDs (Wu
Zheng)
- remove the unneeded ret variable in nvmf_dev_show (Changcheng
Deng)
- Fix for a hang regression introduced with a patch in the merge
window, where low queue depth devices would not always get woken
correctly (Laibin)
- Small series fixing an IO accounting issue with bio backed dm devices
(Mike, Yu)
* tag 'block-5.17-2022-01-28' of git://git.kernel.dk/linux-block:
dm: properly fix redundant bio-based IO accounting
dm: revert partial fix for redundant bio-based IO accounting
block: add bio_start_io_acct_time() to control start_time
blk-mq: Fix wrong wakeup batch configuration which will cause hang
nvme-fabrics: remove the unneeded ret variable in nvmf_dev_show
nvme-pci: add the IGNORE_DEV_SUBNQN quirk for Intel P4500/P4600 SSDs
blk-mq: fix missing blk_account_io_done() in error path
block: fix memory leak in disk_register_independent_access_ranges
Pull security sybsystem fix from James Morris:
"Fix NULL pointer crash in LSM via Ceph, from Vivek Goyal"
* tag 'fixes-v5.17-lsm-ceph-null' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
security, lsm: dentry_init_security() Handle multi LSM registration