This reverts commit 484a54c2e5. The CoDel
parameter change essentially disables CoDel on slow stations, with some
questionable assumptions, as Dave pointed out in [0]. Quoting from
there:
But here are my pithy comments as to why this part of mac80211 is so
wrong...
static void sta_update_codel_params(struct sta_info *sta, u32 thr)
{
- if (thr && thr < STA_SLOW_THRESHOLD * sta->local->num_sta) {
1) sta->local->num_sta is the number of associated, rather than
active, stations. "Active" stations in the last 50ms or so, might have
been a better thing to use, but as most people have far more than that
associated, we end up with really lousy codel parameters, all the
time. Mistake numero uno!
2) The STA_SLOW_THRESHOLD was completely arbitrary in 2016.
- sta->cparams.target = MS2TIME(50);
This, by itself, was probably not too bad. 30ms might have been
better, at the time, when we were battling powersave etc, but 20ms was
enough, really, to cover most scenarios, even where we had low rate
2Ghz multicast to cope with. Even then, codel has a hard time finding
any sane drop rate at all, with a target this high.
- sta->cparams.interval = MS2TIME(300);
But this was horrible, a total mistake, that is leading to codel being
completely ineffective in almost any scenario on clients or APS.
100ms, even 80ms, here, would be vastly better than this insanity. I'm
seeing 5+seconds of delay accumulated in a bunch of otherwise happily
fq-ing APs....
100ms of observed jitter during a flow is enough. Certainly (in 2016)
there were interactions with powersave that I did not understand, and
still don't, but if you are transmitting in the first place, powersave
shouldn't be a problemmmm.....
- sta->cparams.ecn = false;
At the time we were pretty nervous about ecn, I'm kind of sanguine
about it now, and reliably indicating ecn seems better than turning it
off for any reason.
[...]
In production, on p2p wireless, I've had 8ms and 80ms for target and
interval for years now, and it works great.
I think Dave's arguments above are basically sound on the face of it,
and various experimentation with tighter CoDel parameters in the OpenWrt
community have show promising results[1]. So I don't think there's any
reason to keep this parameter fiddling; hence this revert.
[0] https://lore.kernel.org/linux-wireless/CAA93jw6NJ2cmLmMauz0xAgC2MGbBq6n0ZiZzAdkK0u4b+O2yXg@mail.gmail.com/
[1] https://forum.openwrt.org/t/reducing-multiplexing-latencies-still-further-in-wifi/133605/130
Suggested-By: Dave Taht <dave.taht@gmail.com>
In-memory-of: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://patch.msgid.link/20250403183930.197716-1-toke@toke.dk
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.
Use the `DEFINE_RAW_FLEX()` helper for an on-stack definition of
a flexible structure where the size of the flexible-array member
is known at compile-time, and refactor the rest of the code,
accordingly.
So, with these changes, fix the following warning:
drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c:6430:41: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Acked-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/Z-SV8gb6MuZJmmhe@kspp
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.
Use the `DEFINE_RAW_FLEX()` helper for on-stack definitions of
a flexible structure where the size of the flexible-array member
is known at compile-time, and refactor the rest of the code,
accordingly.
So, with these changes, fix the following warnings:
net/mac80211/spectmgmt.c:151:47: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
net/mac80211/spectmgmt.c:155:48: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/Z-SQdHZljwAgIlp9@kspp
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The ethernet-controller schema specifies "mac-address" and
"local-mac-address" but other network devices such as wireless network
adapters use mac addresses as well.
The Devicetree Specification, Release v0.3 specifies in section 4.3.1
a generic "Network Class Binding" with "address-bits", "mac-address",
"local-mac-address" and "max-frame-size". This schema specifies the
"address-bits" property and moves the remaining properties over from
the ethernet-controller.yaml schema.
The "max-frame-size" property is used to describe the maximal payload
size despite its name. Keep the description from ethernet-controller
specifying this property as MTU. The contradictory description in the
Devicetree Specification is ignored.
Signed-off-by: Janne Grunau <j@jannau.net>
Signed-off-by: David Heidelberg <david@ixit.cz>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20250324-dt-bindings-network-class-v5-1-f5c3fe00e8f0@ixit.cz
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
irq_domain_add_linear() is going away as being obsolete now. Switch to
the preferred irq_domain_create_linear(). That differs in the first
parameter: It takes more generic struct fwnode_handle instead of struct
device_node. Therefore, of_fwnode_handle() is added around the
parameter.
Note some of the users can likely use dev->fwnode directly instead of
indirect of_fwnode_handle(dev->of_node). But dev->fwnode is not
guaranteed to be set for all, so this has to be investigated on case to
case basis (by people who can actually test with the HW).
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Cc: Michael Buesch <m@bues.ch>
Cc: linux-wireless@vger.kernel.org
Link: https://patch.msgid.link/20250319092951.37667-38-jirislaby@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Pull networking fixes from Jakub Kicinski:
"Including fixes from Bluetooth, CAN and Netfilter.
Current release - regressions:
- two fixes for the netdev per-instance locking
- batman-adv: fix double-hold of meshif when getting enabled
Current release - new code bugs:
- Bluetooth: increment TX timestamping tskey always for stream
sockets
- wifi: static analysis and build fixes for the new Intel sub-driver
Previous releases - regressions:
- net: fib_rules: fix iif / oif matching on L3 master (VRF) device
- ipv6: add exception routes to GC list in rt6_insert_exception()
- netfilter: conntrack: fix erroneous removal of offload bit
- Bluetooth:
- fix sending MGMT_EV_DEVICE_FOUND for invalid address
- l2cap: process valid commands in too long frame
- btnxpuart: Revert baudrate change in nxp_shutdown
Previous releases - always broken:
- ethtool: fix memory corruption during SFP FW flashing
- eth:
- hibmcge: fixes for link and MTU handling, pause frames etc
- igc: fixes for PTM (PCIe timestamping)
- dsa: b53: enable BPDU reception for management port
Misc:
- fixes for Netlink protocol schemas"
* tag 'net-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
net: ethernet: mtk_eth_soc: revise QDMA packet scheduler settings
net: ethernet: mtk_eth_soc: correct the max weight of the queue limit for 100Mbps
net: ethernet: mtk_eth_soc: reapply mdc divider on reset
net: ti: icss-iep: Fix possible NULL pointer dereference for perout request
net: ti: icssg-prueth: Fix possible NULL pointer dereference inside emac_xmit_xdp_frame()
net: ti: icssg-prueth: Fix kernel warning while bringing down network interface
netfilter: conntrack: fix erronous removal of offload bit
net: don't try to ops lock uninitialized devs
ptp: ocp: fix start time alignment in ptp_ocp_signal_set
net: dsa: avoid refcount warnings when ds->ops->tag_8021q_vlan_del() fails
net: dsa: free routing table on probe failure
net: dsa: clean up FDB, MDB, VLAN entries on unbind
net: dsa: mv88e6xxx: fix -ENOENT when deleting VLANs and MST is unsupported
net: dsa: mv88e6xxx: avoid unregistering devlink regions which were never registered
net: txgbe: fix memory leak in txgbe_probe() error path
net: bridge: switchdev: do not notify new brentries as changed
net: b53: enable BPDU reception for management port
netlink: specs: rt-neigh: prefix struct nfmsg members with ndm
netlink: specs: rt-link: adjust mctp attribute naming
netlink: specs: rtnetlink: attribute naming corrections
...
Pull xen fix from Juergen Gross:
"Just a single fix for the Xen multicall driver avoiding a percpu
variable referencing initdata by its initializer"
* tag 'for-linus-6.15a-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: fix multicall debug feature
Pull fwctl fixes from Jason Gunthorpe:
"Three small changes from further build testing:
- Don't rely on the userspace uuid.h for the uapi header
- Fix sparse warnings in pds
- Typo in log message"
* tag 'for-linus-fwctl' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
fwctl: Fix repeated device word in log message
pds_fwctl: Fix type and endian complaints
fwctl/cxl: Fix uuid_t usage in uapi
Pull sound fixes from Takashi Iwai:
"A collection of small fixes. All are device-specific like quirks, new
IDs, and other safe (or rather boring) changes"
* tag 'sound-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
firmware: cs_dsp: test_bin_error: Fix uninitialized data used as fw version
ASoC: codecs: Add of_match_table for aw888081 driver
ASoC: fsl: fsl_qmc_audio: Reset audio data pointers on TRIGGER_START event
mailmap: Add entry for Srinivas Kandagatla
MAINTAINERS: use kernel.org alias
ASoC: cs42l43: Reset clamp override on jack removal
ALSA: hda/realtek - Fixed ASUS platform headset Mic issue
ALSA: hda/cirrus_scodec_test: Don't select dependencies
ALSA: azt2320: Replace deprecated strcpy() with strscpy()
ASoC: hdmi-codec: use RTD ID instead of DAI ID for ELD entry
ASoC: Intel: avs: Constrain path based on BE capabilities
ALSA: hda/tas2781: Remove unnecessary NULL check before release_firmware()
ASoC: Intel: avs: Fix null-ptr-deref in avs_component_probe()
ASoC: fsl_asrc_dma: get codec or cpu dai from backend
ASoC: qcom: Fix sc7280 lpass potential buffer overflow
ASoC: dwc: always enable/disable i2s irqs
ASoC: Intel: sof_sdw: Add quirk for Asus Zenbook S16
ASoC: codecs:lpass-wsa-macro: Fix logic of enabling vi channels
ASoC: codecs:lpass-wsa-macro: Fix vi feedback rate
Pull SCSI fixes from James Bottomley:
"Small drivers fixes, except for ufs which has two large updates, one
for exposing the device level feature, which is a new addition to the
device spec and the other reworking the exynos driver to fix coherence
issues on some android phones"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: megaraid_sas: Driver version update to 07.734.00.00-rc1
scsi: megaraid_sas: Block zero-length ATA VPD inquiry
scsi: scsi_transport_srp: Replace min/max nesting with clamp()
scsi: ufs: core: Add device level exception support
scsi: ufs: core: Rename ufshcd_wb_presrv_usrspc_keep_vcc_on()
scsi: smartpqi: Use is_kdump_kernel() to check for kdump
scsi: pm80xx: Set phy_attached to zero when device is gone
scsi: ufs: exynos: gs101: Put UFS device in reset on .suspend()
scsi: ufs: exynos: Move phy calls to .exit() callback
scsi: ufs: exynos: Enable PRDT pre-fetching with UFSHCD_CAP_CRYPTO
scsi: ufs: exynos: Ensure consistent phy reference counts
scsi: ufs: exynos: Disable iocc if dma-coherent property isn't set
scsi: ufs: exynos: Move UFS shareability value to drvdata
scsi: ufs: exynos: Ensure pre_link() executes before exynos_ufs_phy_init()
scsi: iscsi: Fix missing scsi_host_put() in error path
scsi: ufs: core: Fix a race condition related to device commands
scsi: hisi_sas: Fix I/O errors caused by hardware port ID changes
scsi: hisi_sas: Enable force phy when SATA disk directly connected
Pull ata fix from Damien Le Moal:
- Fix how sense data from the sense data for successfull NCQ commands
log page is used to fully initialize the result_tf of a completed
command, so that the sense data returned to the scsi layer is fully
initialized with all the device provided information (from Niklas)
* tag 'ata-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
ata: libata-sata: Save all fields from sense data descriptor
Pull XFS fixes from Carlos Maiolino:
"This mostly includes fixes and documentation for the zoned allocator
feature merged during previous merge window, but it also adds a sysfs
tunable for the zone garbage collector.
There is also a fix for a regression to the RT device that we'd like
to fix ASAP now that we're getting more users on the RT zoned
allocator"
* tag 'xfs-fixes-6.15-rc3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: document zoned rt specifics in admin-guide
xfs: fix fsmap for internal zoned devices
xfs: Fix spelling mistake "drity" -> "dirty"
xfs: compute buffer address correctly in xmbuf_map_backing_mem
xfs: add tunable threshold parameter for triggering zone GC
xfs: mark xfs_buf_free as might_sleep()
xfs: remove the leftover xfs_{set,clear}_li_failed infrastructure
Pull btrfs fixes from David Sterba:
- handle encoded read ioctl returning EAGAIN so it does not mistakenly
free the work structure
- escape subvolume path in mount option list so it cannot be wrongly
parsed when the path contains ","
- remove folio size assertions when writing super block to device with
enabled large folios
* tag 'for-6.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: remove folio order ASSERT()s in super block writeback path
btrfs: correctly escape subvol in btrfs_show_options()
btrfs: ioctl: don't free iov when btrfs_encoded_read() returns -EAGAIN
Pull slab fix from Vlastimil Babka:
- Stable fix adding zero initialization of slab->obj_ext to prevent
crashes with allocation profiling (Suren Baghdasaryan)
* tag 'slab-for-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
slab: ensure slab->obj_exts is clear in a newly allocated slab page
Pablo Neira Ayuso says:
====================
Netfilter fix for net
The following batch contains one Netfilter fix for net:
1) conntrack offload bit is erroneously unset in a race scenario,
from Florian Westphal.
netfilter pull request 25-04-17
* tag 'nf-25-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: conntrack: fix erronous removal of offload bit
====================
Link: https://patch.msgid.link/20250417102847.16640-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- l2cap: Process valid commands in too long frame
- vhci: Avoid needless snprintf() calls
* tag 'for-net-2025-04-16' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: vhci: Avoid needless snprintf() calls
Bluetooth: l2cap: Process valid commands in too long frame
====================
Link: https://patch.msgid.link/20250416210126.2034212-1-luiz.dentz@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Meghana Malladi says:
====================
Bug fixes from XDP and perout series
This patch series consists of bug fixes from the XDP series:
1. Fixes a kernel warning that occurs when bringing down the
network interface.
2. Resolves a potential NULL pointer dereference in the
emac_xmit_xdp_frame() function.
3. Resolves a potential NULL pointer dereference in the
icss_iep_perout_enable() function
v3: https://lore.kernel.org/all/20250328102403.2626974-1-m-malladi@ti.com/
====================
Link: https://patch.msgid.link/20250415090543.717991-1-m-malladi@ti.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
During network interface initialization, the NIC driver needs to register
its Rx queue with the XDP, to ensure the incoming XDP buffer carries a
pointer reference to this info and is stored inside xdp_rxq_info.
While this struct isn't tied to XDP prog, if there are any changes in
Rx queue, the NIC driver needs to stop the Rx queue by unregistering
with XDP before purging and reallocating memory. Drop page_pool destroy
during Rx channel reset as this is already handled by XDP during
xdp_rxq_info_unreg (Rx queue unregister), failing to do will cause the
following warning:
warning logs: https://gist.github.com/MeghanaMalladiTI/eb627e5dc8de24e42d7d46572c13e576
Fixes: 46eeb90f03 ("net: ti: icssg-prueth: Use page_pool API for RX buffer allocation")
Signed-off-by: Meghana Malladi <m-malladi@ti.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Roger Quadros <rogerq@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250415090543.717991-2-m-malladi@ti.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The blamed commit exposes a possible issue with flow_offload_teardown():
We might remove the offload bit of a conntrack entry that has been
offloaded again.
1. conntrack entry c1 is offloaded via flow f1 (f1->ct == c1).
2. f1 times out and is pushed back to slowpath, c1 offload bit is
removed. Due to bug, f1 is not unlinked from rhashtable right away.
3. a new packet arrives for the flow and re-offload is triggered, i.e.
f2->ct == c1. This is because lookup in flowtable skip entries with
teardown bit set.
4. Next flowtable gc cycle finds f1 again
5. flow_offload_teardown() is called again for f1 and c1 offload bit is
removed again, even though we have f2 referencing the same entry.
This is harmless, but clearly not correct.
Fix the bug that exposes this: set 'teardown = true' to have the gc
callback unlink the flowtable entry from the table right away instead of
the unintentional defer to the next round.
Also prevent flow_offload_teardown() from fixing up the ct state more than
once: We could also be called from the data path or a notifier, not only
from the flowtable gc callback.
NF_FLOW_TEARDOWN can never be unset, so we can use it as synchronization
point: if we observe did not see a 0 -> 1 transition, then another CPU
is already doing the ct state fixups for us.
Fixes: 03428ca5ce ("netfilter: conntrack: rework offload nf_conn timeout extension logic")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Document the lifetime, nolifetime and max_open_zones mount options
added for zoned rt file systems.
Also add documentation describing the max_open_zones sysfs attribute
exposed in /sys/fs/xfs/<dev>/zoned/
Fixes: 4e4d520755 ("xfs: add the zoned space allocator")
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Pull misc hotfixes from Andrew Morton:
"31 hotfixes.
9 are cc:stable and the remainder address post-6.15 issues or aren't
considered necessary for -stable kernels.
22 patches are for MM, 9 are otherwise"
* tag 'mm-hotfixes-stable-2025-04-16-19-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (31 commits)
MAINTAINERS: update HUGETLB reviewers
mm: fix apply_to_existing_page_range()
selftests/mm: fix compiler -Wmaybe-uninitialized warning
alloc_tag: handle incomplete bulk allocations in vm_module_tags_populate
mailmap: add entry for Jean-Michel Hautbois
mm: (un)track_pfn_copy() fix + doc improvements
mm: fix filemap_get_folios_contig returning batches of identical folios
mm/hugetlb: add a line break at the end of the format string
selftests: mincore: fix tmpfs mincore test failure
mm/hugetlb: fix set_max_huge_pages() when there are surplus pages
mm/cma: report base address of single range correctly
mm: page_alloc: speed up fallbacks in rmqueue_bulk()
kunit: slub: add module description
mm/kasan: add module decription
ucs2_string: add module description
zlib: add module description
fpga: tests: add module descriptions
samples/livepatch: add module descriptions
ASN.1: add module description
mm/vma: add give_up_on_oom option on modify/merge, use in uffd release
...
We need to be careful when operating on dev while in rtnl_create_link().
Some devices (vxlan) initialize netdev_ops in ->newlink, so later on.
Avoid using netdev_lock_ops(), the device isn't registered so we
cannot legally call its ops or generate any notifications for it.
netdev_ops_assert_locked_or_invisible() is safe to use, it checks
registration status first.
Reported-by: syzbot+de1c7d68a10e3f123bdd@syzkaller.appspotmail.com
Fixes: 04efcee6ef ("net: hold instance lock during NETDEV_CHANGE")
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250415151552.768373-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In ptp_ocp_signal_set, the start time for periodic signals is not
aligned to the next period boundary. The current code rounds up the
start time and divides by the period but fails to multiply back by
the period, causing misaligned signal starts. Fix this by multiplying
the rounded-up value by the period to ensure the start time is the
closest next period.
Fixes: 4bd46bb037 ("ptp: ocp: Use DIV64_U64_ROUND_UP for rounding.")
Signed-off-by: Sagi Maimon <maimon.sagi@gmail.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20250415053131.129413-1-maimon.sagi@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean says:
====================
Collection of DSA bug fixes
Prompted by Russell King's 3 DSA bug reports from Friday (linked in
their respective patches: 1, 2 and 3), I am providing fixes to those, as
well as flushing the queue with 2 other bug fixes I had.
1: fix NULL pointer dereference during mv88e6xxx driver unbind, on old
switch models which lack PVT and/or STU. Seen on the ZII dev board
rev B.
2: fix failure to delete bridge port VLANs on old mv88e6xxx chips which
lack STU. Seen on the same board.
3: fix WARN_ON() and resource leak in DSA core on driver unbind. Seen on
the same board but is a much more widespread issue.
4: fix use-after-free during probing of DSA trees with >= 3 switches,
if -EPROBE_DEFER exists. In principle issue also exists for the ZII
board, I reproduced on Turris MOX.
5: fix incorrect use of refcount API in DSA core for those switches
which use tag_8021q (felix, sja1105, vsc73xx). Returning an error
when attempting to delete a tag_8021q VLAN prints a WARN_ON(), which
is harmless but might be problematic with CONFIG_PANIC_ON_OOPS.
====================
Link: https://patch.msgid.link/20250414212708.2948164-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If complete = true in dsa_tree_setup(), it means that we are the last
switch of the tree which is successfully probing, and we should be
setting up all switches from our probe path.
After "complete" becomes true, dsa_tree_setup_cpu_ports() or any
subsequent function may fail. If that happens, the entire tree setup is
in limbo: the first N-1 switches have successfully finished probing
(doing nothing but having allocated persistent memory in the tree's
dst->ports, and maybe dst->rtable), and switch N failed to probe, ending
the tree setup process before anything is tangible from the user's PoV.
If switch N fails to probe, its memory (ports) will be freed and removed
from dst->ports. However, the dst->rtable elements pointing to its ports,
as created by dsa_link_touch(), will remain there, and will lead to
use-after-free if dereferenced.
If dsa_tree_setup_switches() returns -EPROBE_DEFER, which is entirely
possible because that is where ds->ops->setup() is, we get a kasan
report like this:
==================================================================
BUG: KASAN: slab-use-after-free in mv88e6xxx_setup_upstream_port+0x240/0x568
Read of size 8 at addr ffff000004f56020 by task kworker/u8:3/42
Call trace:
__asan_report_load8_noabort+0x20/0x30
mv88e6xxx_setup_upstream_port+0x240/0x568
mv88e6xxx_setup+0xebc/0x1eb0
dsa_register_switch+0x1af4/0x2ae0
mv88e6xxx_register_switch+0x1b8/0x2a8
mv88e6xxx_probe+0xc4c/0xf60
mdio_probe+0x78/0xb8
really_probe+0x2b8/0x5a8
__driver_probe_device+0x164/0x298
driver_probe_device+0x78/0x258
__device_attach_driver+0x274/0x350
Allocated by task 42:
__kasan_kmalloc+0x84/0xa0
__kmalloc_cache_noprof+0x298/0x490
dsa_switch_touch_ports+0x174/0x3d8
dsa_register_switch+0x800/0x2ae0
mv88e6xxx_register_switch+0x1b8/0x2a8
mv88e6xxx_probe+0xc4c/0xf60
mdio_probe+0x78/0xb8
really_probe+0x2b8/0x5a8
__driver_probe_device+0x164/0x298
driver_probe_device+0x78/0x258
__device_attach_driver+0x274/0x350
Freed by task 42:
__kasan_slab_free+0x48/0x68
kfree+0x138/0x418
dsa_register_switch+0x2694/0x2ae0
mv88e6xxx_register_switch+0x1b8/0x2a8
mv88e6xxx_probe+0xc4c/0xf60
mdio_probe+0x78/0xb8
really_probe+0x2b8/0x5a8
__driver_probe_device+0x164/0x298
driver_probe_device+0x78/0x258
__device_attach_driver+0x274/0x350
The simplest way to fix the bug is to delete the routing table in its
entirety. dsa_tree_setup_routing_table() has no problem in regenerating
it even if we deleted links between ports other than those of switch N,
because dsa_link_touch() first checks whether the port pair already
exists in dst->rtable, allocating if not.
The deletion of the routing table in its entirety already exists in
dsa_tree_teardown(), so refactor that into a function that can also be
called from the tree setup error path.
In my analysis of the commit to blame, it is the one which added
dsa_link elements to dst->rtable. Prior to that, each switch had its own
ds->rtable which is freed when the switch fails to probe. But the tree
is potentially persistent memory.
Fixes: c5f51765a1 ("net: dsa: list DSA links in the fabric")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20250414213001.2957964-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As explained in many places such as commit b117e1e8a8 ("net: dsa:
delete dsa_legacy_fdb_add and dsa_legacy_fdb_del"), DSA is written given
the assumption that higher layers have balanced additions/deletions.
As such, it only makes sense to be extremely vocal when those
assumptions are violated and the driver unbinds with entries still
present.
But Ido Schimmel points out a very simple situation where that is wrong:
https://lore.kernel.org/netdev/ZDazSM5UsPPjQuKr@shredder/
(also briefly discussed by me in the aforementioned commit).
Basically, while the bridge bypass operations are not something that DSA
explicitly documents, and for the majority of DSA drivers this API
simply causes them to go to promiscuous mode, that isn't the case for
all drivers. Some have the necessary requirements for bridge bypass
operations to do something useful - see dsa_switch_supports_uc_filtering().
Although in tools/testing/selftests/net/forwarding/local_termination.sh,
we made an effort to popularize better mechanisms to manage address
filters on DSA interfaces from user space - namely macvlan for unicast,
and setsockopt(IP_ADD_MEMBERSHIP) - through mtools - for multicast, the
fact is that 'bridge fdb add ... self static local' also exists as
kernel UAPI, and might be useful to someone, even if only for a quick
hack.
It seems counter-productive to block that path by implementing shim
.ndo_fdb_add and .ndo_fdb_del operations which just return -EOPNOTSUPP
in order to prevent the ndo_dflt_fdb_add() and ndo_dflt_fdb_del() from
running, although we could do that.
Accepting that cleanup is necessary seems to be the only option.
Especially since we appear to be coming back at this from a different
angle as well. Russell King is noticing that the WARN_ON() triggers even
for VLANs:
https://lore.kernel.org/netdev/Z_li8Bj8bD4-BYKQ@shell.armlinux.org.uk/
What happens in the bug report above is that dsa_port_do_vlan_del() fails,
then the VLAN entry lingers on, and then we warn on unbind and leak it.
This is not a straight revert of the blamed commit, but we now add an
informational print to the kernel log (to still have a way to see
that bugs exist), and some extra comments gathered from past years'
experience, to justify the logic.
Fixes: 0832cd9f1f ("net: dsa: warn if port lists aren't empty in dsa_port_teardown")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20250414212930.2956310-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King reports that on the ZII dev rev B, deleting a bridge VLAN
from a user port fails with -ENOENT:
https://lore.kernel.org/netdev/Z_lQXNP0s5-IiJzd@shell.armlinux.org.uk/
This comes from mv88e6xxx_port_vlan_leave() -> mv88e6xxx_mst_put(),
which tries to find an MST entry in &chip->msts associated with the SID,
but fails and returns -ENOENT as such.
But we know that this chip does not support MST at all, so that is not
surprising. The question is why does the guard in mv88e6xxx_mst_put()
not exit early:
if (!sid)
return 0;
And the answer seems to be simple: the sid comes from vlan.sid which
supposedly was previously populated by mv88e6xxx_vtu_get().
But some chip->info->ops->vtu_getnext() implementations do not populate
vlan.sid, for example see mv88e6185_g1_vtu_getnext(). In that case,
later in mv88e6xxx_port_vlan_leave() we are using a garbage sid which is
just residual stack memory.
Testing for sid == 0 covers all cases of a non-bridge VLAN or a bridge
VLAN mapped to the default MSTI. For some chips, SID 0 is valid and
installed by mv88e6xxx_stu_setup(). A chip which does not support the
STU would implicitly only support mapping all VLANs to the default MSTI,
so although SID 0 is not valid, it would be sufficient, if we were to
zero-initialize the vlan structure, to fix the bug, due to the
coincidence that a test for vlan.sid == 0 already exists and leads to
the same (correct) behavior.
Another option which would be sufficient would be to add a test for
mv88e6xxx_has_stu() inside mv88e6xxx_mst_put(), symmetric to the one
which already exists in mv88e6xxx_mst_get(). But that placement means
the caller will have to dereference vlan.sid, which means it will access
uninitialized memory, which is not nice even if it ignores it later.
So we end up making both modifications, in order to not rely just on the
sid == 0 coincidence, but also to avoid having uninitialized structure
fields which might get temporarily accessed.
Fixes: acaf4d2e36 ("net: dsa: mv88e6xxx: MST Offloading")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20250414212913.2955253-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King reports that a system with mv88e6xxx dereferences a NULL
pointer when unbinding this driver:
https://lore.kernel.org/netdev/Z_lRkMlTJ1KQ0kVX@shell.armlinux.org.uk/
The crash seems to be in devlink_region_destroy(), which is not NULL
tolerant but is given a NULL devlink global region pointer.
At least on some chips, some devlink regions are conditionally registered
since the blamed commit, see mv88e6xxx_setup_devlink_regions_global():
if (cond && !cond(chip))
continue;
These are MV88E6XXX_REGION_STU and MV88E6XXX_REGION_PVT. If the chip
does not have an STU or PVT, it should crash like this.
To fix the issue, avoid unregistering those regions which are NULL, i.e.
were skipped at mv88e6xxx_setup_devlink_regions_global() time.
Fixes: 836021a2d0 ("net: dsa: mv88e6xxx: Export cross-chip PVT as devlink region")
Tested-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20250414212850.2953957-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>