Previously, the driver would replenish the rx pool if the polling
function consumed less than the budget. The logic being that the driver
did not exhaust its budget so that must mean that the driver is not busy
and has cycles to spare for replenishing the pool.
So pool replenishment happens on every poll which did not consume
the budget. This can very costly during request-response tests.
In fact, an extra ~100pps can be seen in TCP_RR_150 tests when we remove
this conditional. Trace results (ftrace, graph-time=1) for the poll
function are below:
Previous results:
ibmvnic_poll = 64951846.0 us / 4167628.0 hits = AVG 15.58
replenish_rx_pool = 17602846.0 us / 4710437.0 hits = AVG 3.74
Now:
ibmvnic_poll = 57673941.0 us / 4791737.0 hits = AVG 12.04
replenish_rx_pool = 3938171.6 us / 4314.0 hits = AVG 912.88
While the replenish function takes longer, it is hit less frequently
meaning the ibmvnic_poll function, on average, is faster.
Furthermore, this change does not have a negative effect on
performance bandwidth/latency measurements.
Signed-off-by: Nick Child <nnac123@linux.ibm.com>
Link: https://patch.msgid.link/20240807211809.1259563-2-nnac123@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Building fs_enet on powerpc e500 leads to following warning:
CC drivers/net/ethernet/freescale/fs_enet/mac-scc.o
In file included from ./include/linux/build_bug.h:5,
from ./include/linux/container_of.h:5,
from ./include/linux/list.h:5,
from ./include/linux/module.h:12,
from drivers/net/ethernet/freescale/fs_enet/mac-scc.c:15:
drivers/net/ethernet/freescale/fs_enet/mac-scc.c: In function 'allocate_bd':
./include/linux/err.h:28:49: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
28 | #define IS_ERR_VALUE(x) unlikely((unsigned long)(void *)(x) >= (unsigned long)-MAX_ERRNO)
| ^
./include/linux/compiler.h:77:45: note: in definition of macro 'unlikely'
77 | # define unlikely(x) __builtin_expect(!!(x), 0)
| ^
drivers/net/ethernet/freescale/fs_enet/mac-scc.c:138:13: note: in expansion of macro 'IS_ERR_VALUE'
138 | if (IS_ERR_VALUE(fep->ring_mem_addr))
| ^~~~~~~~~~~~
This is due to fep->ring_mem_addr not being a pointer but a DMA
address which is 64 bits on that platform while pointers are
32 bits as this is a 32 bits platform with wider physical bus.
However, using fep->ring_mem_addr is just wrong because
cpm_muram_alloc() returns an offset within the muram and not
a physical address directly. So use fpi->dpram_offset instead.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/ec67ea3a3bef7e58b8dc959f7c17d405af0d27e4.1723101144.git.christophe.leroy@csgroup.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The usbnet_link_change function is not called, if the link has not changed.
...
[16913.807393][ 3] cdc_ether 1-2:2.0 enx00e0995fd1ac: kevent 12 may have been dropped
[16913.822266][ 2] cdc_ether 1-2:2.0 enx00e0995fd1ac: kevent 12 may have been dropped
[16913.826296][ 2] cdc_ether 1-2:2.0 enx00e0995fd1ac: kevent 11 may have been dropped
...
kevent 11 is scheduled too frequently and may affect other event schedules.
Signed-off-by: zhangxiangqian <zhangxiangqian@kylinos.cn>
Acked-by: Oliver Neukum <oneukum@suse.com>
Link: https://patch.msgid.link/1723109985-11996-1-git-send-email-zhangxiangqian@kylinos.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently ethtool_set_channel calls separate functions to check whether
the new channel number violates rss configuration or flow steering
configuration.
Very soon we need to check whether the new channel number violates
memory provider configuration as well.
To do all 3 checks cleanly, add a wrapper around
ethtool_get_max_rxnfc_channel() and ethtool_get_max_rxfh_channel(),
which does both checks. We can later extend this wrapper to add the
memory provider check in one place.
Note that in the current code, we put a descriptive genl error message
when we run into issues. To preserve the error message, we pass the
genl_info* to the common helper. The ioctl calls can pass NULL instead.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20240808205345.2141858-1-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Allison Henderson says:
====================
selftests: rds selftest
This series is a new selftest that Vegard, Chuck and myself have been
working on to provide some test coverage for rds. I've modified the
scripts to include the feedback from the last version, but let me know
if there's anything missed. Questions and comments appreciated.
Thanks everyone!
Allison
Changes in v2:
- Removed qemu vm creation and related code
- Updated README.txt with examples of running the test with virtme
- Removed init.sh. run.sh now directly calls test.py
- Some clean up done with the return code handling since there is no
vm between the scripts anymore
- Imported ip python function in
tools/testing/selftests/net/lib/py/utils.py into test.py
- Adapted test.py to use the imported ip function, and removed the
local ip wrapper
- Some line wrap clean up
- Link to v1:
https://lore.kernel.org/netdev/20240626012834.5678-3-allison.henderson@oracle.com/T
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
To better our unit tests we need code coverage to be part of the kernel.
This patch borrows heavily from how CONFIG_GCOV_PROFILE_FTRACE is
implemented
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Slaby advises me that the preferred mechanism for declaring
string constants is static char arrays, so use that here.
This mostly reverts
commit 1692b9775e ("net: stmmac: xgmac: use #define for string constants")
That commit was a fix for
commit 46eba193d0 ("net: stmmac: xgmac: fix handling of DPP safety error for DMA channels").
The fix being replacing const char * with #defines in order to address
compilation failures observed on GCC 6 through 10.
Compile tested only.
No functional change intended.
Suggested-by: Jiri Slaby <jirislaby@kernel.org>
Link: https://lore.kernel.org/netdev/485dbc5a-a04b-40c2-9481-955eaa5ce2e2@kernel.org/
Signed-off-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
'struct mii_phy_def' are not modified in this driver.
Constifying these structures moves some data to a read-only section, so
increase overall security.
While at it fix the checkpatch warning related to this patch (some missing
newlines and spaces around *)
On a x86_64, with allmodconfig:
Before:
======
27709 928 0 28637 6fdd drivers/net/sungem_phy.o
After:
=====
text data bss dec hex filename
28157 476 0 28633 6fd9 drivers/net/sungem_phy.o
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/54c3b30930f80f4895e6fa2f4234714fdea4ef4e.1723033266.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Provide declaration of dmae_reg_go_c in header.
This symbol is defined in bnx2x_main.c.
And used in that file and bnx2x_stats.c.
However, Sparse complains that there is no declaration
of the symbol in dmae_reg_go_c nor is the symbol static.
.../bnx2x_main.c:291:11: warning: symbol 'dmae_reg_go_c' was not declared. Should it be static?
Address this by moving the declaration from bnx2x_stats.c to bnx2x_reg.h.
No functional change intended.
Compile tested only.
Signed-off-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240806-bnx2x-dec-v1-1-ae844ec785e4@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull networking fixes from Jakub Kicinski:
"Including fixes from bluetooth.
Current release - regressions:
- eth: bnxt_en: fix memory out-of-bounds in bnxt_fill_hw_rss_tbl() on
older chips
Current release - new code bugs:
- ethtool: fix off-by-one error / kdoc contradicting the code for max
RSS context IDs
- Bluetooth: hci_qca:
- QCA6390: fix support on non-DT platforms
- QCA6390: don't call pwrseq_power_off() twice
- fix a NULL-pointer derefence at shutdown
- eth: ice: fix incorrect assigns of FEC counters
Previous releases - regressions:
- mptcp: fix handling endpoints with both 'signal' and 'subflow'
flags set
- virtio-net: fix changing ring count when vq IRQ coalescing not
supported
- eth: gve: fix use of netif_carrier_ok() during reconfig / reset
Previous releases - always broken:
- eth: idpf: fix bugs in queue re-allocation on reconfig / reset
- ethtool: fix context creation with no parameters
Misc:
- linkwatch: use system_unbound_wq to ease RTNL contention"
* tag 'net-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (41 commits)
net: dsa: microchip: disable EEE for KSZ8567/KSZ9567/KSZ9896/KSZ9897.
ethtool: Fix context creation with no parameters
net: ethtool: fix off-by-one error in max RSS context IDs
net: pse-pd: tps23881: include missing bitfield.h header
net: fec: Stop PPS on driver remove
net: bcmgenet: Properly overlay PHY and MAC Wake-on-LAN capabilities
l2tp: fix lockdep splat
net: stmmac: dwmac4: fix PCS duplex mode decode
idpf: fix UAFs when destroying the queues
idpf: fix memleak in vport interrupt configuration
idpf: fix memory leaks and crashes while performing a soft reset
bnxt_en : Fix memory out-of-bounds in bnxt_fill_hw_rss_tbl()
net: dsa: bcm_sf2: Fix a possible memory leak in bcm_sf2_mdio_register()
net/smc: add the max value of fallback reason count
Bluetooth: hci_sync: avoid dup filtering when passive scanning with adv monitor
Bluetooth: l2cap: always unlock channel in l2cap_conless_channel()
Bluetooth: hci_qca: fix a NULL-pointer derefence at shutdown
Bluetooth: hci_qca: fix QCA6390 support on non-DT platforms
Bluetooth: hci_qca: don't call pwrseq_power_off() twice for QCA6390
ice: Fix incorrect assigns of FEC counts
...
Pull tracing fixes from Steven Rostedt:
- Have reading of event format files test if the metadata still exists.
When a event is freed, a flag (EVENT_FILE_FL_FREED) in the metadata
is set to state that it is to prevent any new references to it from
happening while waiting for existing references to close. When the
last reference closes, the metadata is freed. But the "format" was
missing a check to this flag (along with some other files) that
allowed new references to happen, and a use-after-free bug to occur.
- Have the trace event meta data use the refcount infrastructure
instead of relying on its own atomic counters.
- Have tracefs inodes use alloc_inode_sb() for allocation instead of
using kmem_cache_alloc() directly.
- Have eventfs_create_dir() return an ERR_PTR instead of NULL as the
callers expect a real object or an ERR_PTR.
- Have release_ei() use call_srcu() and not call_rcu() as all the
protection is on SRCU and not RCU.
- Fix ftrace_graph_ret_addr() to use the task passed in and not
current.
- Fix overflow bug in get_free_elt() where the counter can overflow the
integer and cause an infinite loop.
- Remove unused function ring_buffer_nr_pages()
- Have tracefs freeing use the inode RCU infrastructure instead of
creating its own.
When the kernel had randomize structure fields enabled, the rcu field
of the tracefs_inode was overlapping the rcu field of the inode
structure, and corrupting it. Instead, use the destroy_inode()
callback to do the initial cleanup of the code, and then have
free_inode() free it.
* tag 'trace-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracefs: Use generic inode RCU for synchronizing freeing
ring-buffer: Remove unused function ring_buffer_nr_pages()
tracing: Fix overflow in get_free_elt()
function_graph: Fix the ret_stack used by ftrace_graph_ret_addr()
eventfs: Use SRCU for freeing eventfs_inodes
eventfs: Don't return NULL in eventfs_create_dir()
tracefs: Fix inode allocation
tracing: Use refcount for trace_event_file reference counter
tracing: Have format file honor EVENT_FILE_FL_FREED
Pull bcachefs fixes from Kent Overstreet:
"Assorted little stuff:
- lockdep fixup for lockdep_set_notrack_class()
- we can now remove a device when using erasure coding without
deadlocking, though we still hit other issues
- the 'allocator stuck' timeout is now configurable, and messages are
ratelimited. The default timeout has been increased from 10 seconds
to 30"
* tag 'bcachefs-2024-08-08' of git://evilpiepirate.org/bcachefs:
bcachefs: Use bch2_wait_on_allocator() in btree node alloc path
bcachefs: Make allocator stuck timeout configurable, ratelimit messages
bcachefs: Add missing path_traverse() to btree_iter_next_node()
bcachefs: ec should not allocate from ro devs
bcachefs: Improved allocator debugging for ec
bcachefs: Add missing bch2_trans_begin() call
bcachefs: Add a comment for bucket helper types
bcachefs: Don't rely on implicit unsigned -> signed integer conversion
lockdep: Fix lockdep_set_notrack_class() for CONFIG_LOCK_STAT
bcachefs: Fix double free of ca->buckets_nouse
Russell King reported that the arm cbc(aes) crypto module hangs when
loaded, and Herbert Xu bisected it to commit 9b9879fc03 ("modules:
catch concurrent module loads, treat them as idempotent"), and noted:
"So what's happening here is that the first modprobe tries to load a
fallback CBC implementation, in doing so it triggers a load of the
exact same module due to module aliases.
IOW we're loading aes-arm-bs which provides cbc(aes). However, this
needs a fallback of cbc(aes) to operate, which is made out of the
generic cbc module + any implementation of aes, or ecb(aes). The
latter happens to also be provided by aes-arm-cb so that's why it
tries to load the same module again"
So loading the aes-arm-bs module ends up wanting to recursively load
itself, and the recursive load then ends up waiting for the original
module load to complete.
This is a regression, in that it used to be that we just tried to load
the module multiple times, and then as we went on to install it the
second time we would instead just error out because the module name
already existed.
That is actually also exactly what the original "catch concurrent loads"
patch did in commit 9828ed3f69 ("module: error out early on concurrent
load of the same module file"), but it turns out that it ends up being
racy, in that erroring out before the module has been fully initialized
will cause failures in dependent module loading.
See commit ac2263b588 (which was the revert of that "error out early")
commit for details about why erroring out before the module has been
initialized is actually fundamentally racy.
Now, for the actual recursive module load (as opposed to just
concurrently loading the same module twice), the race is not an issue.
At the same time it's hard for the kernel to see that this is recursion,
because the module load is always done from a usermode helper, so the
recursion is not some simple callchain within the kernel.
End result: this is not the real fix, but this at least adds a warning
for the situation (admittedly much too late for all the debugging pain
that Russell and Herbert went through) and if we can come to a
resolution on how to detect the recursion properly, this re-organizes
the code to make that easier.
Link: https://lore.kernel.org/all/ZrFHLqvFqhzykuYw@shell.armlinux.org.uk/
Reported-by: Russell King <linux@armlinux.org.uk>
Debugged-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull LoongArch fixes from Huacai Chen:
"Enable general EFI poweroff method to make poweroff usable on
hardwares which lack ACPI S5, use accessors to page table entries
instead of direct dereference to avoid potential problems, and two
trivial kvm cleanups"
* tag 'loongarch-fixes-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
LoongArch: KVM: Remove undefined a6 argument comment for kvm_hypercall()
LoongArch: KVM: Remove unnecessary definition of KVM_PRIVATE_MEM_SLOTS
LoongArch: Use accessors to page table entries instead of direct dereference
LoongArch: Enable general EFI poweroff method
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2024-08-07 (ice)
This series contains updates to ice driver only.
Grzegorz adds IRQ synchronization call before performing reset and
prevents writing to hardware when it is resetting.
Mateusz swaps incorrect assignment of FEC statistics.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
ice: Fix incorrect assigns of FEC counts
ice: Skip PTP HW writes during PTP reset procedure
ice: Fix reset handler
====================
Link: https://patch.msgid.link/20240807224521.3819189-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Both ethtool_ops.rxfh_max_context_id and the default value used when
it's not specified are supposed to be exclusive maxima (the former
is documented as such; the latter, U32_MAX, cannot be used as an ID
since it equals ETH_RXFH_CONTEXT_ALLOC), but xa_alloc() expects an
inclusive maximum.
Subtract one from 'limit' to produce an inclusive maximum, and pass
that to xa_alloc().
Increase bnxt's max by one to prevent a (very minor) regression, as
BNXT_MAX_ETH_RSS_CTX is an inclusive max. This is safe since bnxt
is not actually hard-limited; BNXT_MAX_ETH_RSS_CTX is just a
leftover from old driver code that managed context IDs itself.
Rename rxfh_max_context_id to rxfh_max_num_contexts to make its
semantics (hopefully) more obvious.
Fixes: 847a8ab186 ("net: ethtool: let the core choose RSS context IDs")
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://patch.msgid.link/5a2d11a599aa5b0cc6141072c01accfb7758650c.1723045898.git.ecree.xilinx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Using FIELD_GET() fails in configurations that don't already include
the header file indirectly:
drivers/net/pse-pd/tps23881.c: In function 'tps23881_i2c_probe':
drivers/net/pse-pd/tps23881.c:755:13: error: implicit declaration of function 'FIELD_GET' [-Wimplicit-function-declaration]
755 | if (FIELD_GET(TPS23881_REG_DEVID_MASK, ret) != TPS23881_DEVICE_ID) {
| ^~~~~~~~~
Fixes: 89108cb5c2 ("net: pse-pd: tps23881: Fix the device ID check")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20240807075455.2055224-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull misc fixes from Andrew Morton:
"Nine hotfixes. Five are cc:stable, the others either pertain to
post-6.10 material or aren't considered necessary for earlier kernels.
Five are MM and four are non-MM. No identifiable theme here - please
see the individual changelogs"
* tag 'mm-hotfixes-stable-2024-08-07-18-32' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
padata: Fix possible divide-by-0 panic in padata_mt_helper()
mailmap: update entry for David Heidelberg
memcg: protect concurrent access to mem_cgroup_idr
mm: shmem: fix incorrect aligned index when checking conflicts
mm: shmem: avoid allocating huge pages larger than MAX_PAGECACHE_ORDER for shmem
mm: list_lru: fix UAF for memory cgroup
kcov: properly check for softirq context
MAINTAINERS: Update LTP members and web
selftests: mm: add s390 to ARCH check
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- hci_sync: avoid dup filtering when passive scanning with adv monitor
- hci_qca: don't call pwrseq_power_off() twice for QCA6390
- hci_qca: fix QCA6390 support on non-DT platforms
- hci_qca: fix a NULL-pointer derefence at shutdown
- l2cap: always unlock channel in l2cap_conless_channel()
* tag 'for-net-2024-08-07' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: hci_sync: avoid dup filtering when passive scanning with adv monitor
Bluetooth: l2cap: always unlock channel in l2cap_conless_channel()
Bluetooth: hci_qca: fix a NULL-pointer derefence at shutdown
Bluetooth: hci_qca: fix QCA6390 support on non-DT platforms
Bluetooth: hci_qca: don't call pwrseq_power_off() twice for QCA6390
====================
Link: https://patch.msgid.link/20240807210103.142483-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tony Nguyen says:
====================
idpf: fix 3 bugs revealed by the Chapter I
Alexander Lobakin says:
The libeth conversion revealed 2 serious issues which lead to sporadic
crashes or WARNs under certain configurations. Additional one was found
while debugging these two with kmemleak.
This one is targeted stable, the rest can be backported manually later
if needed. They can be reproduced only after the conversion is applied
anyway.
====================
Link: https://patch.msgid.link/20240806220923.3359860-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The second tagged commit started sometimes (very rarely, but possible)
throwing WARNs from
net/core/page_pool.c:page_pool_disable_direct_recycling().
Turned out idpf frees interrupt vectors with embedded NAPIs *before*
freeing the queues making page_pools' NAPI pointers lead to freed
memory before these pools are destroyed by libeth.
It's not clear whether there are other accesses to the freed vectors
when destroying the queues, but anyway, we usually free queue/interrupt
vectors only when the queues are destroyed and the NAPIs are guaranteed
to not be referenced anywhere.
Invert the allocation and freeing logic making queue/interrupt vectors
be allocated first and freed last. Vectors don't require queues to be
present, so this is safe. Additionally, this change allows to remove
that useless queue->q_vector pointer cleanup, as vectors are still
valid when freeing the queues (+ both are freed within one function,
so it's not clear why nullify the pointers at all).
Fixes: 1c325aac10 ("idpf: configure resources for TX queues")
Fixes: 90912f9f4f ("idpf: convert header split mode to libeth + napi_build_skb()")
Reported-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20240806220923.3359860-4-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The initialization of vport interrupt consists of two functions:
1) idpf_vport_intr_init() where a generic configuration is done
2) idpf_vport_intr_req_irq() where the irq for each q_vector is
requested.
The first function used to create a base name for each interrupt using
"kasprintf()" call. Unfortunately, although that call allocated memory
for a text buffer, that memory was never released.
Fix this by removing creating the interrupt base name in 1).
Instead, always create a full interrupt name in the function 2), because
there is no need to create a base name separately, considering that the
function 2) is never called out of idpf_vport_intr_init() context.
Fixes: d4d5587182 ("idpf: initialize interrupts and enable vport")
Cc: stable@vger.kernel.org # 6.7
Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
Reviewed-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20240806220923.3359860-3-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The second tagged commit introduced a UAF, as it removed restoring
q_vector->vport pointers after reinitializating the structures.
This is due to that all queue allocation functions are performed here
with the new temporary vport structure and those functions rewrite
the backpointers to the vport. Then, this new struct is freed and
the pointers start leading to nowhere.
But generally speaking, the current logic is very fragile. It claims
to be more reliable when the system is low on memory, but in fact, it
consumes two times more memory as at the moment of running this
function, there are two vports allocated with their queues and vectors.
Moreover, it claims to prevent the driver from running into "bad state",
but in fact, any error during the rebuild leaves the old vport in the
partially allocated state.
Finally, if the interface is down when the function is called, it always
allocates a new queue set, but when the user decides to enable the
interface later on, vport_open() allocates them once again, IOW there's
a clear memory leak here.
Just don't allocate a new queue set when performing a reset, that solves
crashes and memory leaks. Readd the old queue number and reopen the
interface on rollback - that solves limbo states when the device is left
disabled and/or without HW queues enabled.
Fixes: 02cbfba1ad ("idpf: add ethtool callbacks")
Fixes: e4891e4687 ("idpf: split &idpf_queue into 4 strictly-typed queue structures")
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20240806220923.3359860-2-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Increase size of queue_name buffer from 30 to 31 to accommodate
the largest string written to it. This avoids truncation in
the possibly unlikely case where the string is name is the
maximum size.
Flagged by gcc-14:
.../mvpp2_main.c: In function 'mvpp2_probe':
.../mvpp2_main.c:7636:32: warning: 'snprintf' output may be truncated before the last format character [-Wformat-truncation=]
7636 | "stats-wq-%s%s", netdev_name(priv->port_list[0]->dev),
| ^
.../mvpp2_main.c:7635:9: note: 'snprintf' output between 10 and 31 bytes into a destination of size 30
7635 | snprintf(priv->queue_name, sizeof(priv->queue_name),
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
7636 | "stats-wq-%s%s", netdev_name(priv->port_list[0]->dev),
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
7637 | priv->port_count > 1 ? "+" : "");
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Introduced by commit 118d6298f6 ("net: mvpp2: add ethtool GOP statistics").
I am not flagging this as a bug as I am not aware that it is one.
Compile tested only.
Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Marcin Wojtas <marcin.s.wojtas@gmail.com>
Link: https://patch.msgid.link/20240806-mvpp2-namelen-v1-1-6dc773653f2f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add netkit support to rt_link.yaml. Only forward(PASS) and
blackhole(DROP) policies are allowed to be set by user-space so I've
added only them to the yaml to avoid confusion.
Example:
$ ./tools/net/ynl/cli.py \
--spec Documentation/netlink/specs/rt_link.yaml \
--do getlink --json '{"ifname": "netkit0"}' --output-json | jq
...
"linkinfo": {
"kind": "netkit",
"data": {
"primary": 1,
"policy": "blackhole",
"mode": "l2",
"peer-policy": "forward"
}
},
...
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20240806104531.3296718-1-razor@blackwall.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Recently I noticed that both gcc-14 and clang-18 report that passing
a non-string literal as the format argument of alloc_ordered_workqueue
is potentially insecure.
F.e. clang-18 says:
.../bond_main.c:6384:37: warning: format string is not a string literal (potentially insecure) [-Wformat-security]
6384 | bond->wq = alloc_ordered_workqueue(bond_dev->name, WQ_MEM_RECLAIM);
| ^~~~~~~~~~~~~~
.../workqueue.h:524:18: note: expanded from macro 'alloc_ordered_workqueue'
524 | alloc_workqueue(fmt, WQ_UNBOUND | __WQ_ORDERED | (flags), 1, ##args)
| ^~~
.../bond_main.c:6384:37: note: treat the string as an argument to avoid this
6384 | bond->wq = alloc_ordered_workqueue(bond_dev->name, WQ_MEM_RECLAIM);
| ^
| "%s",
..../workqueue.h:524:18: note: expanded from macro 'alloc_ordered_workqueue'
524 | alloc_workqueue(fmt, WQ_UNBOUND | __WQ_ORDERED | (flags), 1, ##args)
| ^
Perhaps it is always the case where the contents of bond_dev->name is
safe to pass as the format argument. That is, in my understanding, it
never contains any format escape sequences.
But, it seems better to be safe than sorry. And, as a bonus, compiler
output becomes less verbose by addressing this issue as suggested by
clang-18.
Signed-off-by: Simon Horman <horms@kernel.org>
Acked-by: Jay Vosburgh <jv@jvosburgh.net>
Link: https://patch.msgid.link/20240806-bonding-fmt-v1-1-e75027e45775@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A recent commit has modified the code in __bnxt_reserve_rings() to
set the default RSS indirection table to default only when the number
of RX rings is changing. While this works for newer firmware that
requires RX ring reservations, it causes the regression on older
firmware not requiring RX ring resrvations (BNXT_NEW_RM() returns
false).
With older firmware, RX ring reservations are not required and so
hw_resc->resv_rx_rings is not always set to the proper value. The
comparison:
if (old_rx_rings != bp->hw_resc.resv_rx_rings)
in __bnxt_reserve_rings() may be false even when the RX rings are
changing. This will cause __bnxt_reserve_rings() to skip setting
the default RSS indirection table to default to match the current
number of RX rings. This may later cause bnxt_fill_hw_rss_tbl() to
use an out-of-range index.
We already have bnxt_check_rss_tbl_no_rmgr() to handle exactly this
scenario. We just need to move it up in bnxt_need_reserve_rings()
to be called unconditionally when using older firmware. Without the
fix, if the TX rings are changing, we'll skip the
bnxt_check_rss_tbl_no_rmgr() call and __bnxt_reserve_rings() may also
skip the bnxt_set_dflt_rss_indir_tbl() call for the reason explained
in the last paragraph. Without setting the default RSS indirection
table to default, it causes the regression:
BUG: KASAN: slab-out-of-bounds in __bnxt_hwrm_vnic_set_rss+0xb79/0xe40
Read of size 2 at addr ffff8881c5809618 by task ethtool/31525
Call Trace:
__bnxt_hwrm_vnic_set_rss+0xb79/0xe40
bnxt_hwrm_vnic_rss_cfg_p5+0xf7/0x460
__bnxt_setup_vnic_p5+0x12e/0x270
__bnxt_open_nic+0x2262/0x2f30
bnxt_open_nic+0x5d/0xf0
ethnl_set_channels+0x5d4/0xb30
ethnl_default_set_doit+0x2f1/0x620
Reported-by: Breno Leitao <leitao@debian.org>
Closes: https://lore.kernel.org/netdev/ZrC6jpghA3PWVWSB@gmail.com/
Fixes: 98ba1d931f ("bnxt_en: Fix RSS logic in __bnxt_reserve_rings()")
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Tested-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20240806053742.140304-1-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Replace SET_RUNTIME_PM_OPS()/SET SYSTEM_SLEEP_PM_OPS() with their modern
RUNTIME_PM_OPS() and SYSTEM_SLEEP_PM_OPS() alternatives.
The combined usage of pm_ptr() and RUNTIME_PM_OPS/SYSTEM_SLEEP_PM_OPS()
allows the compiler to evaluate if the runtime suspend/resume() functions
are used at build time or are simply dead code.
This allows removing the __maybe_unused notation from the runtime
suspend/resume() functions.
Signed-off-by: Fabio Estevam <festevam@denx.de>
Reviewed-by: Peng Fan <peng.fan@nxp.com>
Link: https://patch.msgid.link/20240806021628.2524089-1-festevam@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
bcm_sf2_mdio_register() calls of_phy_find_device() and then
phy_device_remove() in a loop to remove existing PHY devices.
of_phy_find_device() eventually calls bus_find_device(), which calls
get_device() on the returned struct device * to increment the refcount.
The current implementation does not decrement the refcount, which causes
memory leak.
This commit adds the missing phy_device_free() call to decrement the
refcount via put_device() to balance the refcount.
Fixes: 771089c2a4 ("net: dsa: bcm_sf2: Ensure that MDIO diversion is used")
Signed-off-by: Joe Hattori <joe@pf.is.s.u-tokyo.ac.jp>
Tested-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20240806011327.3817861-1-joe@pf.is.s.u-tokyo.ac.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use the `DEFINE_RAW_FLEX()` helper for an on-stack definition of
a flexible structure where the size of the flexible-array member
is known at compile-time, and refactor the rest of the code,
accordingly.
So, with these changes, fix the following warning:
drivers/net/ethernet/fungible/funcore/fun_dev.c:550:43: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://patch.msgid.link/ZrDwEugW7DR/FlP5@cute
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We are hit with a not easily reproducible divide-by-0 panic in padata.c at
bootup time.
[ 10.017908] Oops: divide error: 0000 1 PREEMPT SMP NOPTI
[ 10.017908] CPU: 26 PID: 2627 Comm: kworker/u1666:1 Not tainted 6.10.0-15.el10.x86_64 #1
[ 10.017908] Hardware name: Lenovo ThinkSystem SR950 [7X12CTO1WW]/[7X12CTO1WW], BIOS [PSE140J-2.30] 07/20/2021
[ 10.017908] Workqueue: events_unbound padata_mt_helper
[ 10.017908] RIP: 0010:padata_mt_helper+0x39/0xb0
:
[ 10.017963] Call Trace:
[ 10.017968] <TASK>
[ 10.018004] ? padata_mt_helper+0x39/0xb0
[ 10.018084] process_one_work+0x174/0x330
[ 10.018093] worker_thread+0x266/0x3a0
[ 10.018111] kthread+0xcf/0x100
[ 10.018124] ret_from_fork+0x31/0x50
[ 10.018138] ret_from_fork_asm+0x1a/0x30
[ 10.018147] </TASK>
Looking at the padata_mt_helper() function, the only way a divide-by-0
panic can happen is when ps->chunk_size is 0. The way that chunk_size is
initialized in padata_do_multithreaded(), chunk_size can be 0 when the
min_chunk in the passed-in padata_mt_job structure is 0.
Fix this divide-by-0 panic by making sure that chunk_size will be at least
1 no matter what the input parameters are.
Link: https://lkml.kernel.org/r/20240806174647.1050398-1-longman@redhat.com
Fixes: 004ed42638 ("padata: add basic support for multithreaded jobs")
Signed-off-by: Waiman Long <longman@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Waiman Long <longman@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit 73f576c04b ("mm: memcontrol: fix cgroup creation failure after
many small jobs") decoupled the memcg IDs from the CSS ID space to fix the
cgroup creation failures. It introduced IDR to maintain the memcg ID
space. The IDR depends on external synchronization mechanisms for
modifications. For the mem_cgroup_idr, the idr_alloc() and idr_replace()
happen within css callback and thus are protected through cgroup_mutex
from concurrent modifications. However idr_remove() for mem_cgroup_idr
was not protected against concurrency and can be run concurrently for
different memcgs when they hit their refcnt to zero. Fix that.
We have been seeing list_lru based kernel crashes at a low frequency in
our fleet for a long time. These crashes were in different part of
list_lru code including list_lru_add(), list_lru_del() and reparenting
code. Upon further inspection, it looked like for a given object (dentry
and inode), the super_block's list_lru didn't have list_lru_one for the
memcg of that object. The initial suspicions were either the object is
not allocated through kmem_cache_alloc_lru() or somehow
memcg_list_lru_alloc() failed to allocate list_lru_one() for a memcg but
returned success. No evidence were found for these cases.
Looking more deeply, we started seeing situations where valid memcg's id
is not present in mem_cgroup_idr and in some cases multiple valid memcgs
have same id and mem_cgroup_idr is pointing to one of them. So, the most
reasonable explanation is that these situations can happen due to race
between multiple idr_remove() calls or race between
idr_alloc()/idr_replace() and idr_remove(). These races are causing
multiple memcgs to acquire the same ID and then offlining of one of them
would cleanup list_lrus on the system for all of them. Later access from
other memcgs to the list_lru cause crashes due to missing list_lru_one.
Link: https://lkml.kernel.org/r/20240802235822.1830976-1-shakeel.butt@linux.dev
Fixes: 73f576c04b ("mm: memcontrol: fix cgroup creation failure after many small jobs")
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Muchun Song <muchun.song@linux.dev>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>