Accesses to the PTP_SUBNANOSEC_RATE register are done through a
hardcoded address which doesn't match with the KSZ8463's register
layout.
Add a new entry for the PTP_SUBNANOSEC_RATE register in the regs[]
tables.
Use the regs[] table to retrieve the PTP_SUBNANOSEC_RATE register
address when accessing it.
Remove the macro defining the address to prevent further use.
Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Link: https://patch.msgid.link/20260105-ksz-rework-v1-7-a68df7f57375@bootlin.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Accesses to the PTP_RTC_SUB_NANOSEC register are done through a
hardcoded address which doesn't match with the KSZ8463's register
layout.
Add a new entry for the PTP_RTC_SUB_NANOSEC register in the regs[]
tables.
Use the regs[] table to retrieve the PTP_RTC_SUB_NANOSEC register
address when accessing it.
Remove the macro defining the address to prevent further use.
Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Link: https://patch.msgid.link/20260105-ksz-rework-v1-6-a68df7f57375@bootlin.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Accesses to the PTP_RTC_SEC register are done through a hardcoded
address which doesn't match with the KSZ8463's register layout.
Add a new entry for the PTP_RTC_SEC register in the regs[] tables.
Use the regs[] table to retrieve the PTP_RTC_SEC register address
when accessing it.
Remove the macro defining the address to prevent further use.
Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Link: https://patch.msgid.link/20260105-ksz-rework-v1-5-a68df7f57375@bootlin.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Accesses to the PTP_RTC_NANOSEC register are done through a hardcoded
address which doesn't match with the KSZ8463's register layout.
Add a new entry for the PTP_RTC_NANOSEC register in the regs[] tables.
Use the regs[] table to retrieve the PTP_RTC_NANOSEC register address
when accessing it.
Remove the macro defining the address to prevent further use.
Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Link: https://patch.msgid.link/20260105-ksz-rework-v1-4-a68df7f57375@bootlin.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Accesses to the PTP_CLK_CTRL register are done through a hardcoded
address which doesn't match with the KSZ8463's register layout.
Add a new entry for the PTP_CLK_CTRL register in the regs[] tables.
Use the regs[] table to retrieve the PTP_CLK_CTRL register address
when accessing it.
Remove the macro defining the address to prevent further use.
Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Link: https://patch.msgid.link/20260105-ksz-rework-v1-3-a68df7f57375@bootlin.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The IRQ logic of the KSZ8463 differs from that of other KSZ switches.
It doesn't have a 'mask' register but an 'enable' one instead. The
common IRQ framework can still be used though as soon as we reverse
the logic (using '1' to enable interrupts instead of '0') for KSZ8463
cases.
Move the initialization of the kirq->masked outside of
ksz_irq_common_setup() to keep this function truly common when
IRQ support for the KSZ8463 is added.
Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Link: https://patch.msgid.link/20260105-ksz-rework-v1-1-a68df7f57375@bootlin.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Introduce minimal tests. These can serve as simple illustrative
examples, and as templates when writing new tests.
When adding new cases, it can be easier to extend an existing base
test rather than start from scratch. The existing tests all focus on
real, often non-trivial, features. It is not obvious which to take as
starting point, and arguably none really qualify.
Add two tests
- the client test performs the active open and initial close
- the server test implements the passive open and final close
Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260105172529.3514786-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The optional and required hints in the tcp_congestion_ops are information
for the user of this interface to signalize its importance when
implementing these functions.
However, cong_avoid comment incorrectly tells that it is required,
in reality congestion control must provide one of either cong_avoid or
cong_control.
In addition, min_tso_segs has not had any comment optional/required
hints. So mark it as optional since it is used only in BBR.
Co-developed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Daniel Sedlak <daniel.sedlak@cdn77.com>
Link: https://patch.msgid.link/20260105115533.1151442-1-daniel.sedlak@cdn77.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Convert the Google Virtual Ethernet (GVE) driver to use the new
.get_rx_ring_count ethtool operation instead of handling
ETHTOOL_GRXRINGS in .get_rxnfc. This simplifies the code by moving the
ring count query to a dedicated callback.
The new callback provides the same functionality in a more direct way,
following the ongoing ethtool API modernization.
Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260105-gxring_google-v2-1-e7cfe924d429@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some arches (like x86) do not inline spin_unlock_irqrestore().
backlog_unlock_irq_restore() is in RPS/RFS critical path,
we prefer using spin_unlock() + local_irq_restore() for
optimal performance.
Also change backlog_unlock_irq_restore() second argument
to avoid a pointless dereference.
No difference in net/core/dev.o code size.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260105163054.13698-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Replace rio_probe1() printk(KERN_INFO) messages with netdev_{info,dbg}().
Keep one netdev_info() line for device identification; move the rest to
netdev_dbg() to avoid spamming the kernel log.
Log rx_timeout on a separate line since netdev_*() prefixes each
message and the multi-line formatting looks broken otherwise.
No functional change intended.
Tested-on: D-Link DGE-550T Rev-A3
Signed-off-by: Yeounsu Moon <yyyynoom@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260105130552.8721-2-yyyynoom@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use DEFINE_RAW_FLEX() to avoid thousands of -Wflex-array-member-not-at-end
warnings.
Remove struct ip_options_data, and adjust the rest of the code so that
flexible-array member struct ip_options_rcu::opt.__data[] ends last
in struct icmp_bxm.
Compensate for this by using the DEFINE_RAW_FLEX() helper to define each
on-stack struct instance that contained struct ip_options_data as a member,
and to define struct ip_options_rcu with a fixed on-stack size for its
nested flexible-array member opt.__data[].
Also, add a couple of code comments to prevent people from adding members
to a struct after another member that contains a flexible array.
With these changes, fix 2600 warnings of the following type:
include/net/inet_sock.h:65:33: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://patch.msgid.link/aVteBadWA6AbTp7X@kspp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The test currently SKIPs if the symmetric RSS xfrm is not enabled
by default. This leads to spurious SKIPs in the Intel CI reporting
results to NIPA.
Testing on CX7:
# ./drivers/net/hw/rss_input_xfrm.py
TAP version 13
1..2
ok 1 rss_input_xfrm.test_rss_input_xfrm_ipv4 # SKIP Test requires IPv4 connectivity
# Sym input xfrm already enabled: {'sym-or-xor'}
ok 2 rss_input_xfrm.test_rss_input_xfrm_ipv6
# Totals: pass:1 fail:0 xfail:0 xpass:0 skip:1 error:0
# ethtool -X eth0 xfrm none
# ./drivers/net/hw/rss_input_xfrm.py
TAP version 13
1..2
ok 1 rss_input_xfrm.test_rss_input_xfrm_ipv4 # SKIP Test requires IPv4 connectivity
# Sym input xfrm configured: {'sym-or-xor'}
ok 2 rss_input_xfrm.test_rss_input_xfrm_ipv6
# Totals: pass:1 fail:0 xfail:0 xpass:0 skip:1 error:0
Link: https://patch.msgid.link/20260104184600.795280-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
IPv6 addresses with the same scope are returned in reverse insertion
order, unlike IPv4. For example, when adding a -> b -> c, the list is
reported as c -> b -> a, while IPv4 preserves the original order.
This behavior causes:
a. When using `ip -6 a save` and `ip -6 a restore`, addresses are restored
in the opposite order from which they were saved. See example below
showing addresses added as 1::1, 1::2, 1::3 but displayed and saved
in reverse order.
# ip -6 a a 1::1 dev x
# ip -6 a a 1::2 dev x
# ip -6 a a 1::3 dev x
# ip -6 a s dev x
2: x: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
inet6 1::3/128 scope global tentative
valid_lft forever preferred_lft forever
inet6 1::2/128 scope global tentative
valid_lft forever preferred_lft forever
inet6 1::1/128 scope global tentative
valid_lft forever preferred_lft forever
# ip -6 a save > dump
# ip -6 a d 1::1 dev x
# ip -6 a d 1::2 dev x
# ip -6 a d 1::3 dev x
# ip a d ::1 dev lo
# ip a restore < dump
# ip -6 a s dev x
2: x: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
inet6 1::1/128 scope global tentative
valid_lft forever preferred_lft forever
inet6 1::2/128 scope global tentative
valid_lft forever preferred_lft forever
inet6 1::3/128 scope global tentative
valid_lft forever preferred_lft forever
# ip a showdump < dump
if1:
inet6 ::1/128 scope host proto kernel_lo
valid_lft forever preferred_lft forever
if2:
inet6 1::3/128 scope global tentative
valid_lft forever preferred_lft forever
if2:
inet6 1::2/128 scope global tentative
valid_lft forever preferred_lft forever
if2:
inet6 1::1/128 scope global tentative
valid_lft forever preferred_lft forever
b. Addresses in pasta to appear in reversed order compared to host
addresses.
The ipv6 addresses were added in reverse order by commit e55ffac601
("[IPV6]: order addresses by scope"), then it was changed by commit
502a2ffd73 ("ipv6: convert idev_list to list macros"), and restored by
commit b54c9b98bb ("ipv6: Preserve pervious behavior in
ipv6_link_dev_addr()."). However, this reverse ordering within the same
scope causes inconsistency with IPv4 and the issues described above.
This patch aligns IPv6 address ordering with IPv4 for consistency
by changing the comparison from >= to > when inserting addresses
into the address list. Also updates the ioam6 selftest to reflect
the new address ordering behavior. Combine these two changes into
one patch for bisectability.
Link: https://bugs.passt.top/show_bug.cgi?id=175
Suggested-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Yumei Huang <yuhuang@redhat.com>
Acked-by: Justin Iurman <justin.iurman@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260104032357.38555-1-yuhuang@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tamir Duberstein says:
====================
rust: net: replace `kernel::c_str!` with C-Strings
C-String literals were added in Rust 1.77. Replace instances of
`kernel::c_str!` with C-String literals where possible.
====================
Link: https://patch.msgid.link/20260103-cstr-net-v2-0-8688f504b85d@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
prestera_ldr_wait_buf() returns the result of readl_poll_timeout(),
which is 0 on success or -ETIMEDOUT on failure. Its current return
type is u32.
Assigning this u32 value to an int variable works today because the
bit pattern of (u32)-ETIMEDOUT (0xffffff92) is correctly interpreted
as -ETIMEDOUT when stored in an int. However, keeping the function
return type as u32 is misleading and fragile.
Change the return type to int to reflect the actual semantic of
returning negative error codes and to avoid potential issues with
unsigned casts. There is no functional change.
Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260103174313.1172197-1-alok.a.tiwari@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
At the current moment, the logs for martian packets are as follows:
```
martian source {DST} from {SRC}, on dev {DEV}
martian destination {DST} from {SRC}, dev {DEV}
```
These messages feel rather hard to understand in production, especially
the "martian source" one, mostly because it is grammatically ambitious
to parse which part is now the source address and which part is the
destination address. For example, "{DST}" may there be interpreted as
the actual source address due to following the word "source", thereby
implying the actual source address to be the destination one.
Personally, I discovered this bug while toying around with TUN
interfaces and using them as a tunnel (receiving packets via a TUN
interface and sending them over a TCP stream; receiving packets from a
TCP stream and writing them to a TUN).[^1]
When these IP addresses contained local IPs (i.e. 10.0.0.0/8 in source
and destination), everything worked fine. However, sending them to a
real routable IP address on the internet led to them being treated as a
martian packet, obviously. Using a few sysctl(8) and iptables(8)
settings[^2] fixed it, but while debugging I found the log message
starting with "martian source" rather confusing, as I was unsure on
whether the packet that gets dropped was the packet originating from me
or the response from the endpoint, as "martian source <ROUTABLE IP>"
could also be falsely interpreted as the response packet being martian,
due to the word "source" followed by the routable IP address, implying
the source address of that packet is set to this IP, as explained above.
In the end, I had to look into the source code of the kernel on where
this error message gets generated, which is usually an indicator of
there being room for improvement with regard to this error message.
In terms of improvement, this commit changes the error messages for
martian source and martian destination packets as follows:
```
martian source (src={SRC}, dst={DST}, dev={DEV})
martian destination (src={SRC}, dst={DST}, dev={DEV})
```
These new wordings leave pretty much no room for ambiguity as all
parameters are prefixed with a respective key explaining their semantic
meaning.
See also the following thread on LKML.[^3]
[^1]: <https://backreference.org/2010/03/26/tuntap-interface-tutorial>
[^2]: sysctl net.ipv4.ip_forward=1 && \
iptables -A INPUT -i tun0 -j ACCEPT && \
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
[^3]: <https://lore.kernel.org/all/aSd4Xj8rHrh-krjy@4944566b5c925f79/>
Signed-off-by: Clara Engler <cve@cve.cx>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260101125114.2608-1-cve@cve.cx
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull networking fixes from Paolo Abeni:
"Including fixes from Bluetooth and WiFi. Notably this includes the fix
for the iwlwifi issue you reported.
Current release - regressions:
- core: avoid prefetching NULL pointers
- wifi:
- iwlwifi: implement settime64 as stub for MVM/MLD PTP
- mac80211: fix list iteration in ieee80211_add_virtual_monitor()
- handshake: fix null-ptr-deref in handshake_complete()
- eth: mana: fix use-after-free in reset service rescan path
Previous releases - regressions:
- openvswitch: avoid needlessly taking the RTNL on vport destroy
- dsa: properly keep track of conduit reference
- ipv4:
- fix error route reference count leak with nexthop objects
- fib: restore ECMP balance from loopback
- mptcp: ensure context reset on disconnect()
- bluetooth: fix potential UaF in btusb
- nfc: fix deadlock between nfc_unregister_device and
rfkill_fop_write
- eth:
- gve: defer interrupt enabling until NAPI registration
- i40e: fix scheduling in set_rx_mode
- macb: relocate mog_init_rings() callback from macb_mac_link_up()
to macb_open()
- rtl8150: fix memory leak on usb_submit_urb() failure
- wifi: 8192cu: fix tid out of range in rtl92cu_tx_fill_desc()
Previous releases - always broken:
- ip6_gre: make ip6gre_header() robust
- ipv6: fix a BUG in rt6_get_pcpu_route() under PREEMPT_RT
- af_unix: don't post cmsg for SO_INQ unless explicitly asked for
- phy: mediatek: fix nvmem cell reference leak in
mt798x_phy_calibration
- wifi: mac80211: discard beacon frames to non-broadcast address
- eth:
- iavf: fix off-by-one issues in iavf_config_rss_reg()
- stmmac: fix the crash issue for zero copy XDP_TX action
- team: fix check for port enabled when priority changes"
* tag 'net-6.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (64 commits)
ipv6: fix a BUG in rt6_get_pcpu_route() under PREEMPT_RT
net: rose: fix invalid array index in rose_kill_by_device()
net: enetc: do not print error log if addr is 0
net: macb: Relocate mog_init_rings() callback from macb_mac_link_up() to macb_open()
selftests: fib_test: Add test case for ipv4 multi nexthops
net: fib: restore ECMP balance from loopback
selftests: fib_nexthops: Add test cases for error routes deletion
ipv4: Fix reference count leak when using error routes with nexthop objects
net: usb: sr9700: fix incorrect command used to write single register
ipv6: BUG() in pskb_expand_head() as part of calipso_skbuff_setattr()
usbnet: avoid a possible crash in dql_completed()
gve: defer interrupt enabling until NAPI registration
net: stmmac: fix the crash issue for zero copy XDP_TX action
octeontx2-pf: fix "UBSAN: shift-out-of-bounds error"
af_unix: don't post cmsg for SO_INQ unless explicitly asked for
net: mana: Fix use-after-free in reset service rescan path
net: avoid prefetching NULL pointers
net: bridge: Describe @tunnel_hash member in net_bridge_vlan_group struct
net: nfc: fix deadlock between nfc_unregister_device and rfkill_fop_write
net: usb: asix: validate PHY address before use
...
On PREEMPT_RT kernels, after rt6_get_pcpu_route() returns NULL, the
current task can be preempted. Another task running on the same CPU
may then execute rt6_make_pcpu_route() and successfully install a
pcpu_rt entry. When the first task resumes execution, its cmpxchg()
in rt6_make_pcpu_route() will fail because rt6i_pcpu is no longer
NULL, triggering the BUG_ON(prev). It's easy to reproduce it by adding
mdelay() after rt6_get_pcpu_route().
Using preempt_disable/enable is not appropriate here because
ip6_rt_pcpu_alloc() may sleep.
Fix this by handling the cmpxchg() failure gracefully on PREEMPT_RT:
free our allocation and return the existing pcpu_rt installed by
another task. The BUG_ON is replaced by WARN_ON_ONCE for non-PREEMPT_RT
kernels where such races should not occur.
Link: https://syzkaller.appspot.com/bug?extid=9b35e9bc0951140d13e6
Fixes: d2d6422f8b ("x86: Allow to enable PREEMPT_RT.")
Reported-by: syzbot+9b35e9bc0951140d13e6@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6918cd88.050a0220.1c914e.0045.GAE@google.com/T/
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://patch.msgid.link/20251223051413.124687-1-jiayuan.chen@linux.dev
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
rose_kill_by_device() collects sockets into a local array[] and then
iterates over them to disconnect sockets bound to a device being brought
down.
The loop mistakenly indexes array[cnt] instead of array[i]. For cnt <
ARRAY_SIZE(array), this reads an uninitialized entry; for cnt ==
ARRAY_SIZE(array), it is an out-of-bounds read. Either case can lead to
an invalid socket pointer dereference and also leaks references taken
via sock_hold().
Fix the index to use i.
Fixes: 64b8bc7d5f ("net/rose: fix races in rose_kill_by_device()")
Co-developed-by: Fatma Alwasmi <falwasmi@purdue.edu>
Signed-off-by: Fatma Alwasmi <falwasmi@purdue.edu>
Signed-off-by: Pwnverse <stanksal@purdue.edu>
Link: https://patch.msgid.link/20251222212227.4116041-1-ritviktanksalkar@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
In the non-RT kernel, local_bh_disable() merely disables preemption,
whereas it maps to an actual spin lock in the RT kernel. Consequently,
when attempting to refill RX buffers via netdev_alloc_skb() in
macb_mac_link_up(), a deadlock scenario arises as follows:
WARNING: possible circular locking dependency detected
6.18.0-08691-g2061f18ad76e #39 Not tainted
------------------------------------------------------
kworker/0:0/8 is trying to acquire lock:
ffff00080369bbe0 (&bp->lock){+.+.}-{3:3}, at: macb_start_xmit+0x808/0xb7c
but task is already holding lock:
ffff000803698e58 (&queue->tx_ptr_lock){+...}-{3:3}, at: macb_start_xmit
+0x148/0xb7c
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (&queue->tx_ptr_lock){+...}-{3:3}:
rt_spin_lock+0x50/0x1f0
macb_start_xmit+0x148/0xb7c
dev_hard_start_xmit+0x94/0x284
sch_direct_xmit+0x8c/0x37c
__dev_queue_xmit+0x708/0x1120
neigh_resolve_output+0x148/0x28c
ip6_finish_output2+0x2c0/0xb2c
__ip6_finish_output+0x114/0x308
ip6_output+0xc4/0x4a4
mld_sendpack+0x220/0x68c
mld_ifc_work+0x2a8/0x4f4
process_one_work+0x20c/0x5f8
worker_thread+0x1b0/0x35c
kthread+0x144/0x200
ret_from_fork+0x10/0x20
-> #2 (_xmit_ETHER#2){+...}-{3:3}:
rt_spin_lock+0x50/0x1f0
sch_direct_xmit+0x11c/0x37c
__dev_queue_xmit+0x708/0x1120
neigh_resolve_output+0x148/0x28c
ip6_finish_output2+0x2c0/0xb2c
__ip6_finish_output+0x114/0x308
ip6_output+0xc4/0x4a4
mld_sendpack+0x220/0x68c
mld_ifc_work+0x2a8/0x4f4
process_one_work+0x20c/0x5f8
worker_thread+0x1b0/0x35c
kthread+0x144/0x200
ret_from_fork+0x10/0x20
-> #1 ((softirq_ctrl.lock)){+.+.}-{3:3}:
lock_release+0x250/0x348
__local_bh_enable_ip+0x7c/0x240
__netdev_alloc_skb+0x1b4/0x1d8
gem_rx_refill+0xdc/0x240
gem_init_rings+0xb4/0x108
macb_mac_link_up+0x9c/0x2b4
phylink_resolve+0x170/0x614
process_one_work+0x20c/0x5f8
worker_thread+0x1b0/0x35c
kthread+0x144/0x200
ret_from_fork+0x10/0x20
-> #0 (&bp->lock){+.+.}-{3:3}:
__lock_acquire+0x15a8/0x2084
lock_acquire+0x1cc/0x350
rt_spin_lock+0x50/0x1f0
macb_start_xmit+0x808/0xb7c
dev_hard_start_xmit+0x94/0x284
sch_direct_xmit+0x8c/0x37c
__dev_queue_xmit+0x708/0x1120
neigh_resolve_output+0x148/0x28c
ip6_finish_output2+0x2c0/0xb2c
__ip6_finish_output+0x114/0x308
ip6_output+0xc4/0x4a4
mld_sendpack+0x220/0x68c
mld_ifc_work+0x2a8/0x4f4
process_one_work+0x20c/0x5f8
worker_thread+0x1b0/0x35c
kthread+0x144/0x200
ret_from_fork+0x10/0x20
other info that might help us debug this:
Chain exists of:
&bp->lock --> _xmit_ETHER#2 --> &queue->tx_ptr_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&queue->tx_ptr_lock);
lock(_xmit_ETHER#2);
lock(&queue->tx_ptr_lock);
lock(&bp->lock);
*** DEADLOCK ***
Call trace:
show_stack+0x18/0x24 (C)
dump_stack_lvl+0xa0/0xf0
dump_stack+0x18/0x24
print_circular_bug+0x28c/0x370
check_noncircular+0x198/0x1ac
__lock_acquire+0x15a8/0x2084
lock_acquire+0x1cc/0x350
rt_spin_lock+0x50/0x1f0
macb_start_xmit+0x808/0xb7c
dev_hard_start_xmit+0x94/0x284
sch_direct_xmit+0x8c/0x37c
__dev_queue_xmit+0x708/0x1120
neigh_resolve_output+0x148/0x28c
ip6_finish_output2+0x2c0/0xb2c
__ip6_finish_output+0x114/0x308
ip6_output+0xc4/0x4a4
mld_sendpack+0x220/0x68c
mld_ifc_work+0x2a8/0x4f4
process_one_work+0x20c/0x5f8
worker_thread+0x1b0/0x35c
kthread+0x144/0x200
ret_from_fork+0x10/0x20
Notably, invoking the mog_init_rings() callback upon link establishment
is unnecessary. Instead, we can exclusively call mog_init_rings() within
the ndo_open() callback. This adjustment resolves the deadlock issue.
Furthermore, since MACB_CAPS_MACB_IS_EMAC cases do not use mog_init_rings()
when opening the network interface via at91ether_open(), moving
mog_init_rings() to macb_open() also eliminates the MACB_CAPS_MACB_IS_EMAC
check.
Fixes: 633e98a711 ("net: macb: use resolved link config in mac_link_up()")
Cc: stable@vger.kernel.org
Suggested-by: Kevin Hao <kexin.hao@windriver.com>
Signed-off-by: Xiaolei Wang <xiaolei.wang@windriver.com>
Link: https://patch.msgid.link/20251222015624.1994551-1-xiaolei.wang@windriver.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Preference of nexthop with source address broke ECMP for packets with
source addresses which are not in the broadcast domain, but rather added
to loopback/dummy interfaces. Original behaviour was to balance over
nexthops while now it uses the latest nexthop from the group. To fix the
issue introduce next hop scoring system where next hops with source
address equal to requested will always have higher priority.
For the case with 198.51.100.1/32 assigned to dummy0 and routed using
192.0.2.0/24 and 203.0.113.0/24 networks:
2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether d6:54:8a:ff:78:f5 brd ff:ff:ff:ff:ff:ff
inet 198.51.100.1/32 scope global dummy0
valid_lft forever preferred_lft forever
7: veth1@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 06:ed:98:87:6d:8a brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 192.0.2.2/24 scope global veth1
valid_lft forever preferred_lft forever
inet6 fe80::4ed:98ff:fe87:6d8a/64 scope link proto kernel_ll
valid_lft forever preferred_lft forever
9: veth3@if8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether ae:75:23:38:a0:d2 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 203.0.113.2/24 scope global veth3
valid_lft forever preferred_lft forever
inet6 fe80::ac75:23ff:fe38:a0d2/64 scope link proto kernel_ll
valid_lft forever preferred_lft forever
~ ip ro list:
default
nexthop via 192.0.2.1 dev veth1 weight 1
nexthop via 203.0.113.1 dev veth3 weight 1
192.0.2.0/24 dev veth1 proto kernel scope link src 192.0.2.2
203.0.113.0/24 dev veth3 proto kernel scope link src 203.0.113.2
before:
for i in {1..255} ; do ip ro get 10.0.0.$i; done | grep veth | awk ' {print $(NF-2)}' | sort | uniq -c:
255 veth3
after:
for i in {1..255} ; do ip ro get 10.0.0.$i; done | grep veth | awk ' {print $(NF-2)}' | sort | uniq -c:
122 veth1
133 veth3
Fixes: 32607a332c ("ipv4: prefer multipath nexthop that matches source address")
Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20251221192639.3911901-1-vadim.fedorenko@linux.dev
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When a nexthop object is deleted, it is marked as dead and then
fib_table_flush() is called to flush all the routes that are using the
dead nexthop.
The current logic in fib_table_flush() is to only flush error routes
(e.g., blackhole) when it is called as part of network namespace
dismantle (i.e., with flush_all=true). Therefore, error routes are not
flushed when their nexthop object is deleted:
# ip link add name dummy1 up type dummy
# ip nexthop add id 1 dev dummy1
# ip route add 198.51.100.1/32 nhid 1
# ip route add blackhole 198.51.100.2/32 nhid 1
# ip nexthop del id 1
# ip route show
blackhole 198.51.100.2 nhid 1 dev dummy1
As such, they keep holding a reference on the nexthop object which in
turn holds a reference on the nexthop device, resulting in a reference
count leak:
# ip link del dev dummy1
[ 70.516258] unregister_netdevice: waiting for dummy1 to become free. Usage count = 2
Fix by flushing error routes when their nexthop is marked as dead.
IPv6 does not suffer from this problem.
Fixes: 493ced1ac4 ("ipv4: Allow routes to use nexthop objects")
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Closes: https://lore.kernel.org/netdev/d943f806-4da6-4970-ac28-b9373b0e63ac@I-love.SAKURA.ne.jp/
Reported-by: syzbot+881d65229ca4f9ae8c84@syzkaller.appspotmail.com
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20251221144829.197694-1-idosch@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Pull Kbuild fixes from Nicolas Schier:
- Revert commit "scripts/clang-tools: Handle included .c files in
gen_compile_commands" which is reported to cause false entries for
some files.
- Fix compilation of dtb specified on command-line without make rule
- mcb: Add missing modpost build support
* tag 'kbuild-fixes-6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux:
mcb: Add missing modpost build support
kbuild: fix compilation of dtb specified on command-line without make rule
Revert "scripts/clang-tools: Handle included .c files in gen_compile_commands"
Pull misc fixes from Andrew Morton:
"27 hotfixes. 12 are cc:stable, 18 are MM.
There's a patch series from Jiayuan Chen which fixes some
issues with KASAN and vmalloc. Apart from that it's the usual
shower of singletons - please see the respective changelogs
for details"
* tag 'mm-hotfixes-stable-2025-12-28-21-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (27 commits)
mm/ksm: fix pte_unmap_unlock of wrong address in break_ksm_pmd_entry
mm/page_owner: fix memory leak in page_owner_stack_fops->release()
mm/memremap: fix spurious large folio warning for FS-DAX
MAINTAINERS: notify the "Device Memory" community of memory hotplug changes
sparse: update MAINTAINERS info
mm/page_alloc: report 1 as zone_batchsize for !CONFIG_MMU
mm: consider non-anon swap cache folios in folio_expected_ref_count()
rust: maple_tree: rcu_read_lock() in destructor to silence lockdep
mm: memcg: fix unit conversion for K() macro in OOM log
mm: fixup pfnmap memory failure handling to use pgoff
tools/mm/page_owner_sort: fix timestamp comparison for stable sorting
selftests/mm: fix thread state check in uffd-unit-tests
kernel/kexec: fix IMA when allocation happens in CMA area
kernel/kexec: change the prototype of kimage_map_segment()
MAINTAINERS: add ABI headers to KHO and LIVE UPDATE
.mailmap: remove one of the entries for WangYuli
mm/damon/vaddr: fix missing pte_unmap_unlock in damos_va_migrate_pmd_entry()
MAINTAINERS: update one straggling entry for Bartosz Golaszewski
mm/page_alloc: change all pageblocks migrate type on coalescing
mm: leafops.h: correct kernel-doc function param. names
...
There exists a kernel oops caused by a BUG_ON(nhead < 0) at
net/core/skbuff.c:2232 in pskb_expand_head().
This bug is triggered as part of the calipso_skbuff_setattr()
routine when skb_cow() is passed headroom > INT_MAX
(i.e. (int)(skb_headroom(skb) + len_delta) < 0).
The root cause of the bug is due to an implicit integer cast in
__skb_cow(). The check (headroom > skb_headroom(skb)) is meant to ensure
that delta = headroom - skb_headroom(skb) is never negative, otherwise
we will trigger a BUG_ON in pskb_expand_head(). However, if
headroom > INT_MAX and delta <= -NET_SKB_PAD, the check passes, delta
becomes negative, and pskb_expand_head() is passed a negative value for
nhead.
Fix the trigger condition in calipso_skbuff_setattr(). Avoid passing
"negative" headroom sizes to skb_cow() within calipso_skbuff_setattr()
by only using skb_cow() to grow headroom.
PoC:
Using `netlabelctl` tool:
netlabelctl map del default
netlabelctl calipso add pass doi:7
netlabelctl map add default address:0::1/128 protocol:calipso,7
Then run the following PoC:
int fd = socket(AF_INET6, SOCK_DGRAM, IPPROTO_UDP);
// setup msghdr
int cmsg_size = 2;
int cmsg_len = 0x60;
struct msghdr msg;
struct sockaddr_in6 dest_addr;
struct cmsghdr * cmsg = (struct cmsghdr *) calloc(1,
sizeof(struct cmsghdr) + cmsg_len);
msg.msg_name = &dest_addr;
msg.msg_namelen = sizeof(dest_addr);
msg.msg_iov = NULL;
msg.msg_iovlen = 0;
msg.msg_control = cmsg;
msg.msg_controllen = cmsg_len;
msg.msg_flags = 0;
// setup sockaddr
dest_addr.sin6_family = AF_INET6;
dest_addr.sin6_port = htons(31337);
dest_addr.sin6_flowinfo = htonl(31337);
dest_addr.sin6_addr = in6addr_loopback;
dest_addr.sin6_scope_id = 31337;
// setup cmsghdr
cmsg->cmsg_len = cmsg_len;
cmsg->cmsg_level = IPPROTO_IPV6;
cmsg->cmsg_type = IPV6_HOPOPTS;
char * hop_hdr = (char *)cmsg + sizeof(struct cmsghdr);
hop_hdr[1] = 0x9; //set hop size - (0x9 + 1) * 8 = 80
sendmsg(fd, &msg, 0);
Fixes: 2917f57b6b ("calipso: Allow the lsm to label the skbuff directly.")
Suggested-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Will Rosenberg <whrosenb@asu.edu>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://patch.msgid.link/20251219173637.797418-1-whrosenb@asu.edu
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
syzbot reported a crash [1] in dql_completed() after recent usbnet
BQL adoption.
The reason for the crash is that netdev_reset_queue() is called too soon.
It should be called after cancel_work_sync(&dev->bh_work) to make
sure no more TX completion can happen.
[1]
kernel BUG at lib/dynamic_queue_limits.c:99 !
Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 1 UID: 0 PID: 5197 Comm: udevd Tainted: G L syzkaller #0 PREEMPT(full)
Tainted: [L]=SOFTLOCKUP
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025
RIP: 0010:dql_completed+0xbe1/0xbf0 lib/dynamic_queue_limits.c:99
Call Trace:
<IRQ>
netdev_tx_completed_queue include/linux/netdevice.h:3864 [inline]
netdev_completed_queue include/linux/netdevice.h:3894 [inline]
usbnet_bh+0x793/0x1020 drivers/net/usb/usbnet.c:1601
process_one_work kernel/workqueue.c:3257 [inline]
process_scheduled_works+0xad1/0x1770 kernel/workqueue.c:3340
bh_worker+0x2b1/0x600 kernel/workqueue.c:3611
tasklet_action+0xc/0x70 kernel/softirq.c:952
handle_softirqs+0x27d/0x850 kernel/softirq.c:622
__do_softirq kernel/softirq.c:656 [inline]
invoke_softirq kernel/softirq.c:496 [inline]
__irq_exit_rcu+0xca/0x1f0 kernel/softirq.c:723
irq_exit_rcu+0x9/0x30 kernel/softirq.c:739
Fixes: 7ff14c5204 ("usbnet: Add support for Byte Queue Limits (BQL)")
Reported-by: syzbot+5b55e49f8bbd84631a9c@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/6945644f.a70a0220.207337.0113.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Simon Schippers <simon.schippers@tu-dortmund.de>
Link: https://patch.msgid.link/20251219144459.692715-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Currently, interrupts are automatically enabled immediately upon
request. This allows interrupt to fire before the associated NAPI
context is fully initialized and cause failures like below:
[ 0.946369] Call Trace:
[ 0.946369] <IRQ>
[ 0.946369] __napi_poll+0x2a/0x1e0
[ 0.946369] net_rx_action+0x2f9/0x3f0
[ 0.946369] handle_softirqs+0xd6/0x2c0
[ 0.946369] ? handle_edge_irq+0xc1/0x1b0
[ 0.946369] __irq_exit_rcu+0xc3/0xe0
[ 0.946369] common_interrupt+0x81/0xa0
[ 0.946369] </IRQ>
[ 0.946369] <TASK>
[ 0.946369] asm_common_interrupt+0x22/0x40
[ 0.946369] RIP: 0010:pv_native_safe_halt+0xb/0x10
Use the `IRQF_NO_AUTOEN` flag when requesting interrupts to prevent auto
enablement and explicitly enable the interrupt in NAPI initialization
path (and disable it during NAPI teardown).
This ensures that interrupt lifecycle is strictly coupled with
readiness of NAPI context.
Cc: stable@vger.kernel.org
Fixes: 1dfc2e4611 ("gve: Refactor napi add and remove functions")
Signed-off-by: Ankit Garg <nktgrg@google.com>
Reviewed-by: Jordan Rhee <jordanrhee@google.com>
Reviewed-by: Joshua Washington <joshwash@google.com>
Signed-off-by: Harshitha Ramamurthy <hramamurthy@google.com>
Link: https://patch.msgid.link/20251219102945.2193617-1-hramamurthy@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
There is a crash issue when running zero copy XDP_TX action, the crash
log is shown below.
[ 216.122464] Unable to handle kernel paging request at virtual address fffeffff80000000
[ 216.187524] Internal error: Oops: 0000000096000144 [#1] SMP
[ 216.301694] Call trace:
[ 216.304130] dcache_clean_poc+0x20/0x38 (P)
[ 216.308308] __dma_sync_single_for_device+0x1bc/0x1e0
[ 216.313351] stmmac_xdp_xmit_xdpf+0x354/0x400
[ 216.317701] __stmmac_xdp_run_prog+0x164/0x368
[ 216.322139] stmmac_napi_poll_rxtx+0xba8/0xf00
[ 216.326576] __napi_poll+0x40/0x218
[ 216.408054] Kernel panic - not syncing: Oops: Fatal exception in interrupt
For XDP_TX action, the xdp_buff is converted to xdp_frame by
xdp_convert_buff_to_frame(). The memory type of the resulting xdp_frame
depends on the memory type of the xdp_buff. For page pool based xdp_buff
it produces xdp_frame with memory type MEM_TYPE_PAGE_POOL. For zero copy
XSK pool based xdp_buff it produces xdp_frame with memory type
MEM_TYPE_PAGE_ORDER0. However, stmmac_xdp_xmit_back() does not check the
memory type and always uses the page pool type, this leads to invalid
mappings and causes the crash. Therefore, check the xdp_buff memory type
in stmmac_xdp_xmit_back() to fix this issue.
Fixes: bba2556efa ("net: stmmac: Enable RX via AF_XDP zero-copy")
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Hariprasad Kelam <hkelam@marvell.com>
Link: https://patch.msgid.link/20251204071332.1907111-1-wei.fang@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Johannes Berg says:
====================
Various fixes all over, most are recent regressions but
also some long-standing issues:
- cfg80211:
- fix an issue with overly long SSIDs
- mac80211:
- long-standing beacon protection issue on some devices
- for for a multi-BSSID AP-side issue
- fix a syzbot warning on OCB (not really used in practice)
- remove WARN on connections using disabled channels,
as that can happen due to changes in the disable flag
- fix monitor mode list iteration
- iwlwifi:
- fix firmware loading on certain (really old) devices
- add settime64 to PTP clock to avoid a warning and clock
registration failure, but it's not actually supported
- rtw88:
- remove WQ_UNBOUND since it broke USB adapters
(because it can't be used with WQ_BH)
- fix SDIO issues with certain devices
- rtl8192cu: fix TID array out-of-bounds (since 6.9)
- wlcore (TI): add missing skb push headroom increase
* tag 'wireless-2025-12-17' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: iwlwifi: Implement settime64 as stub for MVM/MLD PTP
wifi: iwlwifi: Fix firmware version handling
wifi: mac80211: ocb: skip rx_no_sta when interface is not joined
wifi: mac80211: do not use old MBSSID elements
wifi: mac80211: don't WARN for connections on invalid channels
wifi: wlcore: ensure skb headroom before skb_push
wifi: cfg80211: sme: store capped length in __cfg80211_connect_result()
wifi: mac80211: fix list iteration in ieee80211_add_virtual_monitor()
wifi: mac80211: Discard Beacon frames to non-broadcast address
Revert "wifi: rtw88: add WQ_UNBOUND to alloc_workqueue users"
wifi: rtlwifi: 8192cu: fix tid out of range in rtl92cu_tx_fill_desc()
wifi: rtw88: limit indirect IO under powered off for RTL8822CS
====================
Link: https://patch.msgid.link/20251217201441.59876-3-johannes@sipsolutions.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Pull sched_ext fixes from Tejun Heo:
- Fix uninitialized @ret on alloc_percpu() failure leading to
ERR_PTR(0)
- Fix PREEMPT_RT warning when bypass load balancer sends IPI to offline
CPU by using resched_cpu() instead of resched_curr()
- Fix comment referring to renamed function
- Update scx_show_state.py for scx_root and scx_aborting changes
* tag 'sched_ext-for-6.19-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
tools/sched_ext: update scx_show_state.py for scx_aborting change
tools/sched_ext: fix scx_show_state.py for scx_root change
sched_ext: Use the resched_cpu() to replace resched_curr() in the bypass_lb_node()
sched_ext: Fix some comments in ext.c
sched_ext: fix uninitialized ret on alloc_percpu() failure
Pull cgroup fix from Tejun Heo:
- Fix a spurious cpuset warning when disabling remote partition after
CPU hotplug leaves subpartitions_cpus empty. Guard the warning and
invalidate affected partitions.
* tag 'cgroup-for-6.19-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cpuset: fix warning when disabling remote partition