Johannes Berg says:
====================
Just a handful of fixes, notably
- iwlwifi: various build warning fixes (e.g. PM_SLEEP)
- iwlwifi: fix operation when FW reset handshake times out
- mac80211: drop pending frames on interface down
* tag 'wireless-2025-04-11' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
Revert "wifi: mac80211: Update skb's control block key in ieee80211_tx_dequeue()"
wifi: iwlwifi: mld: Restart firmware on iwl_mld_no_wowlan_resume() error
wifi: iwlwifi: pcie: set state to no-FW before reset handshake
wifi: wl1251: fix memory leak in wl1251_tx_work
wifi: brcmfmac: fix memory leak in brcmf_get_module_param
wifi: iwlwifi: mld: silence uninitialized variable warning
wifi: mac80211: Purge vif txq in ieee80211_do_stop()
wifi: mac80211: Update skb's control block key in ieee80211_tx_dequeue()
wifi: at76c50x: fix use after free access in at76_disconnect
wifi: add wireless list to MAINTAINERS
iwlwifi: mld: fix building with CONFIG_PM_SLEEP disabled
wifi: iwlwifi: mld: fix PM_SLEEP -Wundef warning
wifi: iwlwifi: mld: reduce scope for uninitialized variable
====================
Link: https://patch.msgid.link/20250411142354.24419-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The reset handshake attempts to kill the firmware, and it'll go
into a pretty much dead state once we do that. However, if it
times out, then we'll attempt to dump the firmware to be able
to see why it didn't respond. During this dump, we cannot treat
it as if it was still running, since we just tried to kill it,
otherwise dumping will attempt to send a DBGC stop command. As
this command will time out, we'll go into a reset loop.
For now, fix this by setting the trans->state to say firmware
isn't running before doing the reset handshake. In the longer
term, we should clean up the way this state is handled.
It's not entirely clear but it seems likely that this issue was
introduced by my rework of the error handling, prior to that it
would've been synchronous at that point and (I think) not have
attempted to reset since it was already doing down.
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219967
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219968
Fixes: 7391b2a4f7 ("wifi: iwlwifi: rework firmware error handling")
Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250411104054.63aa4f56894d.Ife70cfe997db03f0d07fdef2b164695739a05a63@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Commit 5eb902b8e7 ("net/ipv6: Remove expired routes with a separated list
of routes.") introduced a separated list for managing route expiration via
the GC timer.
However, it missed adding exception routes (created by ip6_rt_update_pmtu()
and rt6_do_redirect()) to this GC list. As a result, these exceptions were
never considered for expiration and removal, leading to stale entries
persisting in the routing table.
This patch fixes the issue by calling fib6_add_gc_list() in
rt6_insert_exception(), ensuring that exception routes are properly tracked
and garbage collected when expired.
Fixes: 5eb902b8e7 ("net/ipv6: Remove expired routes with a separated list of routes.")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/837e7506ffb63f47faa2b05d9b85481aad28e1a4.1744134377.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For bnxt when the agg ring is used then tcp-data-split is automatically
reported to be enabled, but __net_mp_open_rxq() requires tcp-data-split
to be explicitly enabled by the user.
Enable tcp-data-split explicitly in io_uring zc rx selftest.
Signed-off-by: David Wei <dw@davidwei.uk>
Link: https://patch.msgid.link/20250409163153.2747918-1-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This prints an error message if the FW Dump trigger command fails. This
scenario is mainly observed in legacy chipsets 8987 and 8997 and also
IW416, where this feature is unavailable due to memory constraints.
Fixes: 998e447f44 ("Bluetooth: btnxpuart: Add support for HCI coredump feature")
Signed-off-by: Neeraj Sanjay Kale <neeraj.sanjaykale@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This reverts the change baudrate logic in nxp_shutdown.
Earlier, when the driver was removed, it restored the controller
baudrate to fw_init_baudrate, so that on re-loading the driver, things
work fine.
However, if the driver was removed while hci0 interface is down, the
change baudrate vendor command could not be sent by the driver. When the
driver was re-loaded, host and controller baudrate would be mismatched
and hci initialization would fail. The only way to recover would be to
reboot the system.
This issue was fixed by moving the restore baudrate logic from
nxp_serdev_remove() to nxp_shutdown().
This fix however caused another issue with the command "hciconfig hci0
reset", which makes hci0 DOWN and UP immediately.
Running "bluetoothctl power off" and "bluetoothctl power on" in a tight
loop works fine.
To maintain support for "hciconfig reset" command, the above mentioned fix
is reverted.
Fixes: 6fca6781d1 ("Bluetooth: btnxpuart: Move vendor specific initialization to .post_init")
Signed-off-by: Neeraj Sanjay Kale <neeraj.sanjaykale@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Documentation/networking/timestamping.rst implies TX timestamping OPT_ID
tskey increments for each sendmsg. In practice: TCP socket increments
it for all sendmsg, timestamping on or off, but UDP only when
timestamping is on. The user-visible counter resets when OPT_ID is
turned on, so difference can be seen only if timestamping is enabled for
some packets only (eg. via SO_TIMESTAMPING CMSG).
Fix BT sockets to work in the same way: stream sockets increment tskey
for all sendmsg (seqpacket already increment for timestamped only).
Fixes: 134f4b39df ("Bluetooth: add support for skb TX SND/COMPLETION timestamping")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
The btrtl_initialize() function checks that rtl_load_file() either
had an error or it loaded a zero length file. However, if it loaded
a zero length file then the error code is not set correctly. It
results in an error pointer vs NULL bug, followed by a NULL pointer
dereference. This was detected by Smatch:
drivers/bluetooth/btrtl.c:592 btrtl_initialize() warn: passing zero to 'ERR_PTR'
Fixes: 26503ad25d ("Bluetooth: btrtl: split the device initialization into smaller parts")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This fixes sending MGMT_EV_DEVICE_FOUND for invalid address
(00:00:00:00:00:00) which is a regression introduced by
a2ec905d1e ("Bluetooth: fix kernel oops in store_pending_adv_report")
since in the attempt to skip storing data for extended advertisement it
actually made the code to skip the entire if statement supposed to send
MGMT_EV_DEVICE_FOUND without attempting to use the last_addr_adv which
is garanteed to be invalid for extended advertisement since we never
store anything on it.
Link: https://github.com/bluez/bluez/issues/1157
Link: https://github.com/bluez/bluez/issues/1149#issuecomment-2767215658
Fixes: a2ec905d1e ("Bluetooth: fix kernel oops in store_pending_adv_report")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Pull networking fixes from Paolo Abeni:
"Including fixes from netfilter.
Current release - regressions:
- core: hold instance lock during NETDEV_CHANGE
- rtnetlink: fix bad unlock balance in do_setlink()
- ipv6:
- fix null-ptr-deref in addrconf_add_ifaddr()
- align behavior across nexthops during path selection
Previous releases - regressions:
- sctp: prevent transport UaF in sendmsg
- mptcp: only inc MPJoinAckHMacFailure for HMAC failures
Previous releases - always broken:
- sched:
- make ->qlen_notify() idempotent
- ensure sufficient space when sending filter netlink notifications
- sch_sfq: really don't allow 1 packet limit
- netfilter: fix incorrect avx2 match of 5th field octet
- tls: explicitly disallow disconnect
- eth: octeontx2-pf: fix VF root node parent queue priority"
* tag 'net-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (38 commits)
ethtool: cmis_cdb: Fix incorrect read / write length extension
selftests: netfilter: add test case for recent mismatch bug
nft_set_pipapo: fix incorrect avx2 match of 5th field octet
net: ppp: Add bound checking for skb data on ppp_sync_txmung
net: Fix null-ptr-deref by sock_lock_init_class_and_name() and rmmod.
ipv6: Align behavior across nexthops during path selection
net: phy: allow MDIO bus PM ops to start/stop state machine for phylink-controlled PHY
net: phy: move phy_link_change() prior to mdio_bus_phy_may_suspend()
selftests/tc-testing: sfq: check that a derived limit of 1 is rejected
net_sched: sch_sfq: move the limit validation
net_sched: sch_sfq: use a temporary work area for validating configuration
net: libwx: handle page_pool_dev_alloc_pages error
selftests: mptcp: validate MPJoin HMacFailure counters
mptcp: only inc MPJoinAckHMacFailure for HMAC failures
rtnetlink: Fix bad unlock balance in do_setlink().
net: ethtool: Don't call .cleanup_data when prepare_data fails
tc: Ensure we have enough buffer space when sending filter netlink notifications
net: libwx: Fix the wrong Rx descriptor field
octeontx2-pf: qos: fix VF root node parent queue index
selftests: tls: check that disconnect does nothing
...
Pull xen fixes from Juergen Gross:
- A simple fix adding the module description of the Xenbus frontend
module
- A fix correcting the xen-acpi-processor Kconfig dependency for PVH
Dom0 support
- A fix for the Xen balloon driver when running as Xen Dom0 in PVH mode
- A fix for PVH Dom0 in order to avoid problems with CPU idle and
frequency drivers conflicting with Xen
* tag 'for-linus-6.15a-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/xen: disable CPU idle and frequency drivers for PVH dom0
x86/xen: fix balloon target initialization for PVH dom0
xen: Change xen-acpi-processor dom0 dependency
xenbus: add module description
Pull block fixes from Jens Axboe:
- Add a missing ublk selftest script, from test additions added last
week
- Two fixes for ublk error recovery and reissue
- Cleanup of ublk argument passing
* tag 'block-6.15-20250410' of git://git.kernel.dk/linux:
ublk: pass ublksrv_ctrl_cmd * instead of io_uring_cmd *
ublk: don't fail request for recovery & reissue in case of ubq->canceling
ublk: fix handling recovery & reissue in ublk_abort_queue()
selftests: ublk: fix test_stripe_04
Pull io_uring fixes from Jens Axboe:
- Reject zero sized legacy provided buffers upfront. No ill side
effects from this one, only really done to shut up a silly syzbot
test case.
- Fix for a regression in tag posting for registered files or buffers,
where the tag would be posted even when the registration failed.
- two minor zcrx cleanups for code added this merge window.
* tag 'io_uring-6.15-20250410' of git://git.kernel.dk/linux:
io_uring/kbuf: reject zero sized provided buffers
io_uring/zcrx: separate niov number from pages
io_uring/zcrx: put refill data into separate cache line
io_uring: don't post tag CQEs on file/buffer registration failure
Pull gpio fixes from Bartosz Golaszewski:
- fix resource handling in gpio-tegra186
- fix wakeup source leaks in gpio-mpc8xxx and gpio-zynq
- fix minor issues with some GPIO OF quirks
- deprecate GPIOD_FLAGS_BIT_NONEXCLUSIVE and devm_gpiod_unhinge()
symbols and add a TODO task to track replacing them with a better
solution
* tag 'gpio-fixes-for-v6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpiolib: of: Move Atmel HSMCI quirk up out of the regulator comment
gpiolib: of: Fix the choice for Ingenic NAND quirk
gpio: zynq: Fix wakeup source leaks on device unbind
gpio: mpc8xxx: Fix wakeup source leaks on device unbind
gpio: TODO: track the removal of regulator-related workarounds
MAINTAINERS: add more keywords for the GPIO subsystem entry
gpio: deprecate devm_gpiod_unhinge()
gpio: deprecate the GPIOD_FLAGS_BIT_NONEXCLUSIVE flag
gpio: tegra186: fix resource handling in ACPI probe path
Pull mtd fixes from Miquel Raynal:
"Two important fixes: the build of the SPI NAND layer with old GCC
versions as well as the fix of the Qpic Makefile which was wrong in
the first place.
There are also two smaller fixes about a missing error and status
check"
* tag 'mtd/fixes-for-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux:
mtd: spinand: Fix build with gcc < 7.5
mtd: rawnand: Add status chack in r852_ready()
mtd: inftlcore: Add error check for inftl_read_oob()
mtd: nand: Drop explicit test for built-in CONFIG_SPI_QPIC_SNAND
The 'read_write_len_ext' field in 'struct ethtool_cmis_cdb_cmd_args'
stores the maximum number of bytes that can be read from or written to
the Local Payload (LPL) page in a single multi-byte access.
Cited commit started overwriting this field with the maximum number of
bytes that can be read from or written to the Extended Payload (LPL)
pages in a single multi-byte access. Transceiver modules that support
auto paging can advertise a number larger than 255 which is problematic
as 'read_write_len_ext' is a 'u8', resulting in the number getting
truncated and firmware flashing failing [1].
Fix by ignoring the maximum EPL access size as the kernel does not
currently support auto paging (even if the transceiver module does) and
will not try to read / write more than 128 bytes at once.
[1]
Transceiver module firmware flashing started for device enp177s0np0
Transceiver module firmware flashing in progress for device enp177s0np0
Progress: 0%
Transceiver module firmware flashing encountered an error for device enp177s0np0
Status message: Write FW block EPL command failed, LPL length is longer
than CDB read write length extension allows.
Fixes: 9a3b0d078b ("net: ethtool: Add support for writing firmware blocks using EPL payload")
Reported-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com>
Closes: https://lore.kernel.org/netdev/20250402183123.321036-3-michael.chan@broadcom.com/
Tested-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20250409112440.365672-1-idosch@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following batch contains a Netfilter fix and improved test coverage:
1) Fix AVX2 matching in nft_pipapo, from Florian Westphal.
2) Extend existing test to improve coverage for the aforementioned bug,
also from Florian.
netfilter pull request 25-04-10
* tag 'nf-25-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
selftests: netfilter: add test case for recent mismatch bug
nft_set_pipapo: fix incorrect avx2 match of 5th field octet
====================
Link: https://patch.msgid.link/20250410103647.1030244-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Without 'nft_set_pipapo: fix incorrect avx2 match of 5th field octet"
this fails:
TEST: reported issues
Add two elements, flush, re-add 1s [ OK ]
net,mac with reload 0s [ OK ]
net,port,proto 3s [ OK ]
avx2 false match 0s [FAIL]
False match for fe80:dead:01fe:0a02:0b03:6007:8009:a001
Other tests do not detect the kernel bug as they only alter parts in
the /64 netmask.
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Given a set element like:
icmpv6 . dead:beef:00ff::1
The value of 'ff' is irrelevant, any address will be matched
as long as the other octets are the same.
This is because of too-early register clobbering:
ymm7 is reloaded with new packet data (pkt[9]) but it still holds data
of an earlier load that wasn't processed yet.
The existing tests in nft_concat_range.sh selftests do exercise this code
path, but do not trigger incorrect matching due to the network prefix
limitation.
Fixes: 7400b06396 ("nft_set_pipapo: Introduce AVX2-based lookup implementation")
Reported-by: sontu mazumdar <sontu21@gmail.com>
Closes: https://lore.kernel.org/netfilter/CANgxkqwnMH7fXra+VUfODT-8+qFLgskq3set1cAzqqJaV4iEZg@mail.gmail.com/T/#t
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When I ran the repro [0] and waited a few seconds, I observed two
LOCKDEP splats: a warning immediately followed by a null-ptr-deref. [1]
Reproduction Steps:
1) Mount CIFS
2) Add an iptables rule to drop incoming FIN packets for CIFS
3) Unmount CIFS
4) Unload the CIFS module
5) Remove the iptables rule
At step 3), the CIFS module calls sock_release() for the underlying
TCP socket, and it returns quickly. However, the socket remains in
FIN_WAIT_1 because incoming FIN packets are dropped.
At this point, the module's refcnt is 0 while the socket is still
alive, so the following rmmod command succeeds.
# ss -tan
State Recv-Q Send-Q Local Address:Port Peer Address:Port
FIN-WAIT-1 0 477 10.0.2.15:51062 10.0.0.137:445
# lsmod | grep cifs
cifs 1159168 0
This highlights a discrepancy between the lifetime of the CIFS module
and the underlying TCP socket. Even after CIFS calls sock_release()
and it returns, the TCP socket does not die immediately in order to
close the connection gracefully.
While this is generally fine, it causes an issue with LOCKDEP because
CIFS assigns a different lock class to the TCP socket's sk->sk_lock
using sock_lock_init_class_and_name().
Once an incoming packet is processed for the socket or a timer fires,
sk->sk_lock is acquired.
Then, LOCKDEP checks the lock context in check_wait_context(), where
hlock_class() is called to retrieve the lock class. However, since
the module has already been unloaded, hlock_class() logs a warning
and returns NULL, triggering the null-ptr-deref.
If LOCKDEP is enabled, we must ensure that a module calling
sock_lock_init_class_and_name() (CIFS, NFS, etc) cannot be unloaded
while such a socket is still alive to prevent this issue.
Let's hold the module reference in sock_lock_init_class_and_name()
and release it when the socket is freed in sk_prot_free().
Note that sock_lock_init() clears sk->sk_owner for svc_create_socket()
that calls sock_lock_init_class_and_name() for a listening socket,
which clones a socket by sk_clone_lock() without GFP_ZERO.
[0]:
CIFS_SERVER="10.0.0.137"
CIFS_PATH="//${CIFS_SERVER}/Users/Administrator/Desktop/CIFS_TEST"
DEV="enp0s3"
CRED="/root/WindowsCredential.txt"
MNT=$(mktemp -d /tmp/XXXXXX)
mount -t cifs ${CIFS_PATH} ${MNT} -o vers=3.0,credentials=${CRED},cache=none,echo_interval=1
iptables -A INPUT -s ${CIFS_SERVER} -j DROP
for i in $(seq 10);
do
umount ${MNT}
rmmod cifs
sleep 1
done
rm -r ${MNT}
iptables -D INPUT -s ${CIFS_SERVER} -j DROP
[1]:
DEBUG_LOCKS_WARN_ON(1)
WARNING: CPU: 10 PID: 0 at kernel/locking/lockdep.c:234 hlock_class (kernel/locking/lockdep.c:234 kernel/locking/lockdep.c:223)
Modules linked in: cifs_arc4 nls_ucs2_utils cifs_md4 [last unloaded: cifs]
CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Not tainted 6.14.0 #36
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:hlock_class (kernel/locking/lockdep.c:234 kernel/locking/lockdep.c:223)
...
Call Trace:
<IRQ>
__lock_acquire (kernel/locking/lockdep.c:4853 kernel/locking/lockdep.c:5178)
lock_acquire (kernel/locking/lockdep.c:469 kernel/locking/lockdep.c:5853 kernel/locking/lockdep.c:5816)
_raw_spin_lock_nested (kernel/locking/spinlock.c:379)
tcp_v4_rcv (./include/linux/skbuff.h:1678 ./include/net/tcp.h:2547 net/ipv4/tcp_ipv4.c:2350)
...
BUG: kernel NULL pointer dereference, address: 00000000000000c4
PF: supervisor read access in kernel mode
PF: error_code(0x0000) - not-present page
PGD 0
Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Tainted: G W 6.14.0 #36
Tainted: [W]=WARN
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:__lock_acquire (kernel/locking/lockdep.c:4852 kernel/locking/lockdep.c:5178)
Code: 15 41 09 c7 41 8b 44 24 20 25 ff 1f 00 00 41 09 c7 8b 84 24 a0 00 00 00 45 89 7c 24 20 41 89 44 24 24 e8 e1 bc ff ff 4c 89 e7 <44> 0f b6 b8 c4 00 00 00 e8 d1 bc ff ff 0f b6 80 c5 00 00 00 88 44
RSP: 0018:ffa0000000468a10 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ff1100010091cc38 RCX: 0000000000000027
RDX: ff1100081f09ca48 RSI: 0000000000000001 RDI: ff1100010091cc88
RBP: ff1100010091c200 R08: ff1100083fe6e228 R09: 00000000ffffbfff
R10: ff1100081eca0000 R11: ff1100083fe10dc0 R12: ff1100010091cc88
R13: 0000000000000001 R14: 0000000000000000 R15: 00000000000424b1
FS: 0000000000000000(0000) GS:ff1100081f080000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000c4 CR3: 0000000002c4a003 CR4: 0000000000771ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<IRQ>
lock_acquire (kernel/locking/lockdep.c:469 kernel/locking/lockdep.c:5853 kernel/locking/lockdep.c:5816)
_raw_spin_lock_nested (kernel/locking/spinlock.c:379)
tcp_v4_rcv (./include/linux/skbuff.h:1678 ./include/net/tcp.h:2547 net/ipv4/tcp_ipv4.c:2350)
ip_protocol_deliver_rcu (net/ipv4/ip_input.c:205 (discriminator 1))
ip_local_deliver_finish (./include/linux/rcupdate.h:878 net/ipv4/ip_input.c:234)
ip_sublist_rcv_finish (net/ipv4/ip_input.c:576)
ip_list_rcv_finish (net/ipv4/ip_input.c:628)
ip_list_rcv (net/ipv4/ip_input.c:670)
__netif_receive_skb_list_core (net/core/dev.c:5939 net/core/dev.c:5986)
netif_receive_skb_list_internal (net/core/dev.c:6040 net/core/dev.c:6129)
napi_complete_done (./include/linux/list.h:37 ./include/net/gro.h:519 ./include/net/gro.h:514 net/core/dev.c:6496)
e1000_clean (drivers/net/ethernet/intel/e1000/e1000_main.c:3815)
__napi_poll.constprop.0 (net/core/dev.c:7191)
net_rx_action (net/core/dev.c:7262 net/core/dev.c:7382)
handle_softirqs (kernel/softirq.c:561)
__irq_exit_rcu (kernel/softirq.c:596 kernel/softirq.c:435 kernel/softirq.c:662)
irq_exit_rcu (kernel/softirq.c:680)
common_interrupt (arch/x86/kernel/irq.c:280 (discriminator 14))
</IRQ>
<TASK>
asm_common_interrupt (./arch/x86/include/asm/idtentry.h:693)
RIP: 0010:default_idle (./arch/x86/include/asm/irqflags.h:37 ./arch/x86/include/asm/irqflags.h:92 arch/x86/kernel/process.c:744)
Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d c3 2b 15 00 fb f4 <fa> c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90
RSP: 0018:ffa00000000ffee8 EFLAGS: 00000202
RAX: 000000000000640b RBX: ff1100010091c200 RCX: 0000000000061aa4
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff812f30c5
RBP: 000000000000000a R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000002 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
? do_idle (kernel/sched/idle.c:186 kernel/sched/idle.c:325)
default_idle_call (./include/linux/cpuidle.h:143 kernel/sched/idle.c:118)
do_idle (kernel/sched/idle.c:186 kernel/sched/idle.c:325)
cpu_startup_entry (kernel/sched/idle.c:422 (discriminator 1))
start_secondary (arch/x86/kernel/smpboot.c:315)
common_startup_64 (arch/x86/kernel/head_64.S:421)
</TASK>
Modules linked in: cifs_arc4 nls_ucs2_utils cifs_md4 [last unloaded: cifs]
CR2: 00000000000000c4
Fixes: ed07536ed6 ("[PATCH] lockdep: annotate nfs/nfsd in-kernel sockets")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20250407163313.22682-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A nexthop is only chosen when the calculated multipath hash falls in the
nexthop's hash region (i.e., the hash is smaller than the nexthop's hash
threshold) and when the nexthop is assigned a non-negative score by
rt6_score_route().
Commit 4d0ab3a688 ("ipv6: Start path selection from the first
nexthop") introduced an unintentional difference between the first
nexthop and the rest when the score is negative.
When the first nexthop matches, but has a negative score, the code will
currently evaluate subsequent nexthops until one is found with a
non-negative score. On the other hand, when a different nexthop matches,
but has a negative score, the code will fallback to the nexthop with
which the selection started ('match').
Align the behavior across all nexthops and fallback to 'match' when the
first nexthop matches, but has a negative score.
Fixes: 3d709f69a3 ("ipv6: Use hash-threshold instead of modulo-N")
Fixes: 4d0ab3a688 ("ipv6: Start path selection from the first nexthop")
Reported-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Closes: https://lore.kernel.org/netdev/67efef607bc41_1ddca82948c@willemb.c.googlers.com.notmuch/
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250408084316.243559-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
DSA has 2 kinds of drivers:
1. Those who call dsa_switch_suspend() and dsa_switch_resume() from
their device PM ops: qca8k-8xxx, bcm_sf2, microchip ksz
2. Those who don't: all others. The above methods should be optional.
For type 1, dsa_switch_suspend() calls dsa_user_suspend() -> phylink_stop(),
and dsa_switch_resume() calls dsa_user_resume() -> phylink_start().
These seem good candidates for setting mac_managed_pm = true because
that is essentially its definition [1], but that does not seem to be the
biggest problem for now, and is not what this change focuses on.
Talking strictly about the 2nd category of DSA drivers here (which
do not have MAC managed PM, meaning that for their attached PHYs,
mdio_bus_phy_suspend() and mdio_bus_phy_resume() should run in full),
I have noticed that the following warning from mdio_bus_phy_resume() is
triggered:
WARN_ON(phydev->state != PHY_HALTED && phydev->state != PHY_READY &&
phydev->state != PHY_UP);
because the PHY state machine is running.
It's running as a result of a previous dsa_user_open() -> ... ->
phylink_start() -> phy_start() having been initiated by the user.
The previous mdio_bus_phy_suspend() was supposed to have called
phy_stop_machine(), but it didn't. So this is why the PHY is in state
PHY_NOLINK by the time mdio_bus_phy_resume() runs.
mdio_bus_phy_suspend() did not call phy_stop_machine() because for
phylink, the phydev->adjust_link function pointer is NULL. This seems a
technicality introduced by commit fddd91016d ("phylib: fix PAL state
machine restart on resume"). That commit was written before phylink
existed, and was intended to avoid crashing with consumer drivers which
don't use the PHY state machine - phylink always does, when using a PHY.
But phylink itself has historically not been developed with
suspend/resume in mind, and apparently not tested too much in that
scenario, allowing this bug to exist unnoticed for so long. Plus, prior
to the WARN_ON(), it would have likely been invisible.
This issue is not in fact restricted to type 2 DSA drivers (according to
the above ad-hoc classification), but can be extrapolated to any MAC
driver with phylink and MDIO-bus-managed PHY PM ops. DSA is just where
the issue was reported. Assuming mac_managed_pm is set correctly, a
quick search indicates the following other drivers might be affected:
$ grep -Zlr PHYLINK_NETDEV drivers/ | xargs -0 grep -L mac_managed_pm
drivers/net/ethernet/atheros/ag71xx.c
drivers/net/ethernet/microchip/sparx5/sparx5_main.c
drivers/net/ethernet/microchip/lan966x/lan966x_main.c
drivers/net/ethernet/freescale/dpaa2/dpaa2-mac.c
drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c
drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
drivers/net/ethernet/freescale/ucc_geth.c
drivers/net/ethernet/freescale/enetc/enetc_pf_common.c
drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
drivers/net/ethernet/marvell/mvneta.c
drivers/net/ethernet/marvell/prestera/prestera_main.c
drivers/net/ethernet/mediatek/mtk_eth_soc.c
drivers/net/ethernet/altera/altera_tse_main.c
drivers/net/ethernet/wangxun/txgbe/txgbe_phy.c
drivers/net/ethernet/meta/fbnic/fbnic_phylink.c
drivers/net/ethernet/tehuti/tn40_phy.c
drivers/net/ethernet/mscc/ocelot_net.c
Make the existing conditions dependent on the PHY device having a
phydev->phy_link_change() implementation equal to the default
phy_link_change() provided by phylib. Otherwise, we implicitly know that
the phydev has the phylink-provided phylink_phy_change() callback, and
when phylink is used, the PHY state machine always needs to be stopped/
started on the suspend/resume path. The code is structured as such that
if phydev->phy_link_change() is absent, it is a matter of time until the
kernel will crash - no need to further complicate the test.
Thus, for the situation where the PM is not managed by the MAC, we will
make the MDIO bus PM ops treat identically the phylink-controlled PHYs
with the phylib-controlled PHYs where an adjust_link() callback is
supplied. In both cases, the MDIO bus PM ops should stop and restart the
PHY state machine.
[1] https://lore.kernel.org/netdev/Z-1tiW9zjcoFkhwc@shell.armlinux.org.uk/
Fixes: 744d23c71a ("net: phy: Warn about incorrect mdio_bus_phy_resume() state")
Reported-by: Wei Fang <wei.fang@nxp.com>
Tested-by: Wei Fang <wei.fang@nxp.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20250407094042.2155633-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In an upcoming change, mdio_bus_phy_may_suspend() will need to
distinguish a phylib-based PHY client from a phylink PHY client.
For that, it will need to compare the phydev->phy_link_change() function
pointer with the eponymous phy_link_change() provided by phylib.
To avoid forward function declarations, the default PHY link state
change method should be moved upwards. There is no functional change
associated with this patch, it is only to reduce the noise from a real
bug fix.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20250407093900.2155112-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull kselftest fixes from Shuah Khan:
- Fixes tpm2, futex, and mincore tests
- Create a dedicated .gitignore for tpm2 tests
* tag 'linux_kselftest-fixes-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests/mincore: Allow read-ahead pages to reach the end of the file
selftests/futex: futex_waitv wouldblock test should fail
selftests: tpm2: test_smoke: use POSIX-conformant expression operator
selftests: tpm2: create a dedicated .gitignore
Commit 8284066946 ("ublk: grab request reference when the request is handled
by userspace") doesn't grab request reference in case of recovery reissue.
Then the request can be requeued & re-dispatch & failed when canceling
uring command.
If it is one zc request, the request can be freed before io_uring
returns the zc buffer back, then cause kernel panic:
[ 126.773061] BUG: kernel NULL pointer dereference, address: 00000000000000c8
[ 126.773657] #PF: supervisor read access in kernel mode
[ 126.774052] #PF: error_code(0x0000) - not-present page
[ 126.774455] PGD 0 P4D 0
[ 126.774698] Oops: Oops: 0000 [#1] SMP NOPTI
[ 126.775034] CPU: 13 UID: 0 PID: 1612 Comm: kworker/u64:55 Not tainted 6.14.0_blk+ #182 PREEMPT(full)
[ 126.775676] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 04/01/2014
[ 126.776275] Workqueue: iou_exit io_ring_exit_work
[ 126.776651] RIP: 0010:ublk_io_release+0x14/0x130 [ublk_drv]
Fixes it by always grabbing request reference for aborting the request.
Reported-by: Caleb Sander Mateos <csander@purestorage.com>
Closes: https://lore.kernel.org/linux-block/CADUfDZodKfOGUeWrnAxcZiLT+puaZX8jDHoj_sfHZCOZwhzz6A@mail.gmail.com/
Fixes: 8284066946 ("ublk: grab request reference when the request is handled by userspace")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250409011444.2142010-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Octavian Purdila says:
====================
net_sched: sch_sfq: reject a derived limit of 1
Because sfq parameters can influence each other there can be
situations where although the user sets a limit of 2 it can be lowered
to 1:
$ tc qdisc add dev dummy0 handle 1: root sfq limit 2 flows 1 depth 1
$ tc qdisc show dev dummy0
qdisc sfq 1: dev dummy0 root refcnt 2 limit 1p quantum 1514b depth 1 divisor 1024
$ tc qdisc add dev dummy0 handle 1: root sfq limit 2 flows 10 depth 1 divisor 1
$ tc qdisc show dev dummy0
qdisc sfq 2: root refcnt 2 limit 1p quantum 1514b depth 1 divisor 1
As a limit of 1 is invalid, this patch series moves the limit
validation to after all configuration changes have been done. To do
so, the configuration is done in a temporary work area then applied to
the internal state.
The patch series also adds new test cases.
v3:
- remove a couple of unnecessary comments
- rearrange local variables to use reverse Christmas tree style
declaration order
v2: https://lore.kernel.org/all/20250402162750.1671155-1-tavip@google.com/
- remove tmp struct and directly use local variables
v1: https://lore.kernel.org/all/20250328201634.3876474-1-tavip@google.com/
===================
Signed-off-by: David S. Miller <davem@davemloft.net>
Because the limit is updated indirectly when other parameters are
updated, there are cases where even though the user requests a limit
of 2 it can actually be set to 1.
Add the following test cases to check that the kernel rejects them:
- limit 2 depth 1 flows 1
- limit 2 depth 1 divisor 1
Signed-off-by: Octavian Purdila <tavip@google.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is not sufficient to directly validate the limit on the data that
the user passes as it can be updated based on how the other parameters
are changed.
Move the check at the end of the configuration update process to also
catch scenarios where the limit is indirectly updated, for example
with the following configurations:
tc qdisc add dev dummy0 handle 1: root sfq limit 2 flows 1 depth 1
tc qdisc add dev dummy0 handle 1: root sfq limit 2 flows 1 divisor 1
This fixes the following syzkaller reported crash:
------------[ cut here ]------------
UBSAN: array-index-out-of-bounds in net/sched/sch_sfq.c:203:6
index 65535 is out of range for type 'struct sfq_head[128]'
CPU: 1 UID: 0 PID: 3037 Comm: syz.2.16 Not tainted 6.14.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 12/27/2024
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x201/0x300 lib/dump_stack.c:120
ubsan_epilogue lib/ubsan.c:231 [inline]
__ubsan_handle_out_of_bounds+0xf5/0x120 lib/ubsan.c:429
sfq_link net/sched/sch_sfq.c:203 [inline]
sfq_dec+0x53c/0x610 net/sched/sch_sfq.c:231
sfq_dequeue+0x34e/0x8c0 net/sched/sch_sfq.c:493
sfq_reset+0x17/0x60 net/sched/sch_sfq.c:518
qdisc_reset+0x12e/0x600 net/sched/sch_generic.c:1035
tbf_reset+0x41/0x110 net/sched/sch_tbf.c:339
qdisc_reset+0x12e/0x600 net/sched/sch_generic.c:1035
dev_reset_queue+0x100/0x1b0 net/sched/sch_generic.c:1311
netdev_for_each_tx_queue include/linux/netdevice.h:2590 [inline]
dev_deactivate_many+0x7e5/0xe70 net/sched/sch_generic.c:1375
Reported-by: syzbot <syzkaller@googlegroups.com>
Fixes: 10685681ba ("net_sched: sch_sfq: don't allow 1 packet limit")
Signed-off-by: Octavian Purdila <tavip@google.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Many configuration parameters have influence on others (e.g. divisor
-> flows -> limit, depth -> limit) and so it is difficult to correctly
do all of the validation before applying the configuration. And if a
validation error is detected late it is difficult to roll back a
partially applied configuration.
To avoid these issues use a temporary work area to update and validate
the configuration and only then apply the configuration to the
internal state.
Signed-off-by: Octavian Purdila <tavip@google.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull kunit fixes from Shuah Khan:
- Fix the tool to report test count in case of a late test plan when
tests are specified before the test plan
- Fix spelling error
* tag 'linux_kselftest-kunit-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: Spelling s/slowm/slow/
kunit: tool: fix count of tests if late test plan
page_pool_dev_alloc_pages could return NULL. There was a WARN_ON(!page)
but it would still proceed to use the NULL pointer and then crash.
This is similar to commit 001ba09020
("net: fec: handle page_pool_dev_alloc_pages error").
This is found by our static analysis tool KNighter.
Signed-off-by: Chenyuan Yang <chenyuan0y@gmail.com>
Fixes: 3c47e8ae11 ("net: libwx: Support to receive packets in NAPI")
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250407184952.2111299-1-chenyuan0y@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Matthieu Baerts says:
====================
mptcp: only inc MPJoinAckHMacFailure for HMAC failures
Recently, during a debugging session using local MPTCP connections, I
noticed MPJoinAckHMacFailure was strangely not zero on the server side.
The first patch fixes this issue -- present since v5.9 -- and the second
one validates it in the selftests.
====================
Link: https://patch.msgid.link/20250407-net-mptcp-hmac-failure-mib-v1-0-3c9ecd0a3a50@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Recently, during a debugging session using local MPTCP connections, I
noticed MPJoinAckHMacFailure was not zero on the server side. The
counter was in fact incremented when the PM rejected new subflows,
because the 'subflow' limit was reached.
The fix is easy, simply dissociating the two cases: only the HMAC
validation check should increase MPTCP_MIB_JOINACKMAC counter.
Fixes: 4cf8b7e48a ("subflow: introduce and use mptcp_can_accept_new_subflow()")
Cc: stable@vger.kernel.org
Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250407-net-mptcp-hmac-failure-mib-v1-1-3c9ecd0a3a50@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When running the mincore_selftest on a system with an XFS file system, it
failed the "check_file_mmap" test case due to the read-ahead pages reaching
the end of the file. The failure log is as below:
RUN global.check_file_mmap ...
mincore_selftest.c:264:check_file_mmap:Expected i (1024) < vec_size (1024)
mincore_selftest.c:265:check_file_mmap:Read-ahead pages reached the end of the file
check_file_mmap: Test failed
FAIL global.check_file_mmap
This is because the read-ahead window size of the XFS file system on this
machine is 4 MB, which is larger than the size from the #PF address to the
end of the file. As a result, all the pages for this file are populated.
blockdev --getra /dev/nvme0n1p5
8192
blockdev --getbsz /dev/nvme0n1p5
512
This issue can be fixed by extending the current FILE_SIZE 4MB to a larger
number, but it will still fail if the read-ahead window size of the file
system is larger enough. Additionally, in the real world, read-ahead pages
reaching the end of the file can happen and is an expected behavior.
Therefore, allowing read-ahead pages to reach the end of the file is a
better choice for the "check_file_mmap" test case.
Link: https://lore.kernel.org/r/20250311080940.21413-1-qiuxu.zhuo@intel.com
Reported-by: Yi Lai <yi1.lai@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>