Most function arguments that are passed in as unsigned int or unsigned
long are better displayed as hexadecimal than normal integer. For example,
the functions:
static void __create_object(unsigned long ptr, size_t size,
int min_count, gfp_t gfp, unsigned int objflags);
static bool stack_access_ok(struct unwind_state *state, unsigned long _addr,
size_t len);
void __local_bh_disable_ip(unsigned long ip, unsigned int cnt);
Show up in the trace as:
__create_object(ptr=-131387050520576, size=4096, min_count=1, gfp=3264, objflags=0) <-kmem_cache_alloc_noprof
stack_access_ok(state=0xffffc9000233fc98, _addr=-60473102566256, len=8) <-unwind_next_frame
__local_bh_disable_ip(ip=-2127311112, cnt=256) <-handle_softirqs
Instead, by displaying unsigned as hexadecimal, they look more like this:
__create_object(ptr=0xffff8881028d2080, size=0x280, min_count=1, gfp=0x82820, objflags=0x0) <-kmem_cache_alloc_node_noprof
stack_access_ok(state=0xffffc90000003938, _addr=0xffffc90000003930, len=0x8) <-unwind_next_frame
__local_bh_disable_ip(ip=0xffffffff8133cef8, cnt=0x100) <-handle_softirqs
Which is much easier to understand as most unsigned longs are usually just
pointers. Even the "unsigned int cnt" in __local_bh_disable_ip() looks
better as hexadecimal as a lot of flags are passed as unsigned.
Changes since v2: https://lore.kernel.org/20250801111453.01502861@gandalf.local.home
- Use btf_int_encoding() instead of open coding it (Martin KaFai Lau)
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Douglas Raillard <douglas.raillard@arm.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Link: https://lore.kernel.org/20250801165601.7770d65c@gandalf.local.home
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pull CXL updates from Dave Jiang:
"The most significant changes in this pull request is the series that
introduces ACQUIRE() and ACQUIRE_ERR() macros to replace conditional
locking and ease the pain points of scoped_cond_guard().
The series also includes follow on changes that refactor the CXL
sub-system to utilize the new macros.
Detail summary:
- Add documentation template for CXL conventions to document CXL
platform quirks
- Replace mutex_lock_io() with mutex_lock() for mailbox
- Add location limit for fake CFMWS range for cxl_test, ARM platform
enabling
- CXL documentation typo and clarity fixes
- Use correct format specifier for function cxl_set_ecs_threshold()
- Make cxl_bus_type constant
- Introduce new helper cxl_resource_contains_addr() to check address
availability
- Fix wrong DPA checking for PPR operation
- Remove core/acpi.c and CXL core dependency on ACPI
- Introduce ACQUIRE() and ACQUIRE_ERR() for conditional locks
- Add CXL updates utilizing ACQUIRE() macro to remove gotos and
improve readability
- Add return for the dummy version of cxl_decoder_detach() without
CONFIG_CXL_REGION
- CXL events updates for spec r3.2
- Fix return of __cxl_decoder_detach() error path
- CXL debugfs documentation fix"
* tag 'cxl-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (28 commits)
Documentation/ABI/testing/debugfs-cxl: Add 'cxl' to clear_poison path
cxl/region: Fix an ERR_PTR() vs NULL bug
cxl/events: Trace Memory Sparing Event Record
cxl/events: Add extra validity checks for CVME count in DRAM Event Record
cxl/events: Add extra validity checks for corrected memory error count in General Media Event Record
cxl/events: Update Common Event Record to CXL spec rev 3.2
cxl: Fix -Werror=return-type in cxl_decoder_detach()
cleanup: Fix documentation build error for ACQUIRE updates
cxl: Convert to ACQUIRE() for conditional rwsem locking
cxl/region: Consolidate cxl_decoder_kill_region() and cxl_region_detach()
cxl/region: Move ready-to-probe state check to a helper
cxl/region: Split commit_store() into __commit() and queue_reset() helpers
cxl/decoder: Drop pointless locking
cxl/decoder: Move decoder register programming to a helper
cxl/mbox: Convert poison list mutex to ACQUIRE()
cleanup: Introduce ACQUIRE() and ACQUIRE_ERR() for conditional locks
cxl: Remove core/acpi.c and cxl core dependency on ACPI
cxl/core: Using cxl_resource_contains_addr() to check address availability
cxl/edac: Fix wrong dpa checking for PPR operation
cxl/core: Introduce a new helper cxl_resource_contains_addr()
...
In ip_output() skb->dev is updated from the skb_dst(skb)->dev
this can become invalid when the interface is unregistered and freed,
Introduced new skb_dst_dev_rcu() function to be used instead of
skb_dst_dev() within rcu_locks in ip_output.This will ensure that
all the skb's associated with the dev being deregistered will
be transnmitted out first, before freeing the dev.
Given that ip_output() is called within an rcu_read_lock()
critical section or from a bottom-half context, it is safe to introduce
an RCU read-side critical section within it.
Multiple panic call stacks were observed when UL traffic was run
in concurrency with device deregistration from different functions,
pasting one sample for reference.
[496733.627565][T13385] Call trace:
[496733.627570][T13385] bpf_prog_ce7c9180c3b128ea_cgroupskb_egres+0x24c/0x7f0
[496733.627581][T13385] __cgroup_bpf_run_filter_skb+0x128/0x498
[496733.627595][T13385] ip_finish_output+0xa4/0xf4
[496733.627605][T13385] ip_output+0x100/0x1a0
[496733.627613][T13385] ip_send_skb+0x68/0x100
[496733.627618][T13385] udp_send_skb+0x1c4/0x384
[496733.627625][T13385] udp_sendmsg+0x7b0/0x898
[496733.627631][T13385] inet_sendmsg+0x5c/0x7c
[496733.627639][T13385] __sys_sendto+0x174/0x1e4
[496733.627647][T13385] __arm64_sys_sendto+0x28/0x3c
[496733.627653][T13385] invoke_syscall+0x58/0x11c
[496733.627662][T13385] el0_svc_common+0x88/0xf4
[496733.627669][T13385] do_el0_svc+0x2c/0xb0
[496733.627676][T13385] el0_svc+0x2c/0xa4
[496733.627683][T13385] el0t_64_sync_handler+0x68/0xb4
[496733.627689][T13385] el0t_64_sync+0x1a4/0x1a8
Changes in v3:
- Replaced WARN_ON() with WARN_ON_ONCE(), as suggested by Willem de Bruijn.
- Dropped legacy lines mistakenly pulled in from an outdated branch.
Changes in v2:
- Addressed review comments from Eric Dumazet
- Used READ_ONCE() to prevent potential load/store tearing
- Added skb_dst_dev_rcu() and used along with rcu_read_lock() in ip_output
Signed-off-by: Sharath Chandra Vurukala <quic_sharathv@quicinc.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250730105118.GA26100@hu-sharathv-hyd.qualcomm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Syzbot reported a WARNING in taprio_get_start_time().
When link speed is 470,589 or greater, q->picos_per_byte becomes too
small, causing length_to_duration(q, ETH_ZLEN) to return zero.
This zero value leads to validation failures in fill_sched_entry() and
parse_taprio_schedule(), allowing arbitrary values to be assigned to
entry->interval and cycle_time. As a result, sched->cycle can become zero.
Since SPEED_800000 is the largest defined speed in
include/uapi/linux/ethtool.h, this issue can occur in realistic scenarios.
To ensure length_to_duration() returns a non-zero value for minimum-sized
Ethernet frames (ETH_ZLEN = 60), picos_per_byte must be at least 17
(60 * 17 > PSEC_PER_NSEC which is 1000).
This patch enforces a minimum value of 17 for picos_per_byte when the
calculated value would be lower, and adds a warning message to inform
users that scheduling accuracy may be affected at very high link speeds.
Fixes: fb66df20a7 ("net/sched: taprio: extend minimum interval restriction to entire cycle too")
Reported-by: syzbot+398e1ee4ca2cac05fddb@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=398e1ee4ca2cac05fddb
Signed-off-by: Takamitsu Iwai <takamitz@amazon.co.jp>
Link: https://patch.msgid.link/20250728173149.45585-1-takamitz@amazon.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When sending a packet with virtio_net_hdr to tun device, if the gso_type
in virtio_net_hdr is SKB_GSO_UDP and the gso_size is less than udphdr
size, below crash may happen.
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:4572!
Oops: invalid opcode: 0000 [#1] SMP NOPTI
CPU: 0 UID: 0 PID: 62 Comm: mytest Not tainted 6.16.0-rc7 #203 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:skb_pull_rcsum+0x8e/0xa0
Code: 00 00 5b c3 cc cc cc cc 8b 93 88 00 00 00 f7 da e8 37 44 38 00 f7 d8 89 83 88 00 00 00 48 8b 83 c8 00 00 00 5b c3 cc cc cc cc <0f> 0b 0f 0b 66 66 2e 0f 1f 84 00 000
RSP: 0018:ffffc900001fba38 EFLAGS: 00000297
RAX: 0000000000000004 RBX: ffff8880040c1000 RCX: ffffc900001fb948
RDX: ffff888003e6d700 RSI: 0000000000000008 RDI: ffff88800411a062
RBP: ffff8880040c1000 R08: 0000000000000000 R09: 0000000000000001
R10: ffff888003606c00 R11: 0000000000000001 R12: 0000000000000000
R13: ffff888004060900 R14: ffff888004050000 R15: ffff888004060900
FS: 000000002406d3c0(0000) GS:ffff888084a19000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000040 CR3: 0000000004007000 CR4: 00000000000006f0
Call Trace:
<TASK>
udp_queue_rcv_one_skb+0x176/0x4b0 net/ipv4/udp.c:2445
udp_queue_rcv_skb+0x155/0x1f0 net/ipv4/udp.c:2475
udp_unicast_rcv_skb+0x71/0x90 net/ipv4/udp.c:2626
__udp4_lib_rcv+0x433/0xb00 net/ipv4/udp.c:2690
ip_protocol_deliver_rcu+0xa6/0x160 net/ipv4/ip_input.c:205
ip_local_deliver_finish+0x72/0x90 net/ipv4/ip_input.c:233
ip_sublist_rcv_finish+0x5f/0x70 net/ipv4/ip_input.c:579
ip_sublist_rcv+0x122/0x1b0 net/ipv4/ip_input.c:636
ip_list_rcv+0xf7/0x130 net/ipv4/ip_input.c:670
__netif_receive_skb_list_core+0x21d/0x240 net/core/dev.c:6067
netif_receive_skb_list_internal+0x186/0x2b0 net/core/dev.c:6210
napi_complete_done+0x78/0x180 net/core/dev.c:6580
tun_get_user+0xa63/0x1120 drivers/net/tun.c:1909
tun_chr_write_iter+0x65/0xb0 drivers/net/tun.c:1984
vfs_write+0x300/0x420 fs/read_write.c:593
ksys_write+0x60/0xd0 fs/read_write.c:686
do_syscall_64+0x50/0x1c0 arch/x86/entry/syscall_64.c:63
</TASK>
To trigger gso segment in udp_queue_rcv_skb(), we should also set option
UDP_ENCAP_ESPINUDP to enable udp_sk(sk)->encap_rcv. When the encap_rcv
hook return 1 in udp_queue_rcv_one_skb(), udp_csum_pull_header() will try
to pull udphdr, but the skb size has been segmented to gso size, which
leads to this crash.
Previous commit cf329aa42b ("udp: cope with UDP GRO packet misdirection")
introduces segmentation in UDP receive path only for GRO, which was never
intended to be used for UFO, so drop UFO packets in udp_rcv_segment().
Link: https://lore.kernel.org/netdev/20250724083005.3918375-1-wangliang74@huawei.com/
Link: https://lore.kernel.org/netdev/20250729123907.3318425-1-wangliang74@huawei.com/
Fixes: cf329aa42b ("udp: cope with UDP GRO packet misdirection")
Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250730101458.3470788-1-wangliang74@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull remoteproc updates from Bjorn Andersson:
- Make the Xilinx remoteproc driver support running on only a single
core, disable still unsupported remoteproc features, and stop the
remoteproc on shutdown to facilitate kexec.
- Conclude the renaming of the Qualcomm ADSP driver to "PAS" that was
started many years ago.
* tag 'rproc-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux:
remoteproc: xlnx: Fix kernel-doc warnings
remoteproc: xlnx: Disable unsupported features
remoteproc: xlnx: Add shutdown callback
remoteproc: xlnx: Allow single core use in split mode
dt-bindings: remoteproc: qcom,sa8775p-pas: Correct the interrupt number
remoteproc: Don't use %pK through printk
dt-bindings: remoteproc: qcom,sm8150-pas: Document QCS615 remoteproc
remoteproc: qcom: pas: Conclude the rename from adsp
When the parent clock is a gated clock which has multiple parents, the
clock provider (clk-scmi typically) might return a rate of 0 since there
is not one of those particular parent clocks that should be chosen for
returning a rate. Prior to ee975351cf ("net: mdio: mdio-bcm-unimac:
Manage clock around I/O accesses"), we would not always be passing a
clock reference depending upon how mdio-bcm-unimac was instantiated. In
that case, we would take the fallback path where the rate is hard coded
to 250MHz.
Make sure that we still fallback to using a fixed rate for the divider
calculation, otherwise we simply ignore the desired MDIO bus clock
frequency which can prevent us from interfacing with Ethernet PHYs
properly.
Fixes: ee975351cf ("net: mdio: mdio-bcm-unimac: Manage clock around I/O accesses")
Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250730202533.3463529-1-florian.fainelli@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
syzbot was able to craft a packet with very long IPv6 extension headers
leading to an overflow of skb->transport_header.
This 16bit field has a limited range.
Add skb_reset_transport_header_careful() helper and use it
from ipv6_gso_segment()
WARNING: CPU: 0 PID: 5871 at ./include/linux/skbuff.h:3032 skb_reset_transport_header include/linux/skbuff.h:3032 [inline]
WARNING: CPU: 0 PID: 5871 at ./include/linux/skbuff.h:3032 ipv6_gso_segment+0x15e2/0x21e0 net/ipv6/ip6_offload.c:151
Modules linked in:
CPU: 0 UID: 0 PID: 5871 Comm: syz-executor211 Not tainted 6.16.0-rc6-syzkaller-g7abc678e3084 #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/12/2025
RIP: 0010:skb_reset_transport_header include/linux/skbuff.h:3032 [inline]
RIP: 0010:ipv6_gso_segment+0x15e2/0x21e0 net/ipv6/ip6_offload.c:151
Call Trace:
<TASK>
skb_mac_gso_segment+0x31c/0x640 net/core/gso.c:53
nsh_gso_segment+0x54a/0xe10 net/nsh/nsh.c:110
skb_mac_gso_segment+0x31c/0x640 net/core/gso.c:53
__skb_gso_segment+0x342/0x510 net/core/gso.c:124
skb_gso_segment include/net/gso.h:83 [inline]
validate_xmit_skb+0x857/0x11b0 net/core/dev.c:3950
validate_xmit_skb_list+0x84/0x120 net/core/dev.c:4000
sch_direct_xmit+0xd3/0x4b0 net/sched/sch_generic.c:329
__dev_xmit_skb net/core/dev.c:4102 [inline]
__dev_queue_xmit+0x17b6/0x3a70 net/core/dev.c:4679
Fixes: d1da932ed4 ("ipv6: Separate ipv6 offload support")
Reported-by: syzbot+af43e647fd835acc02df@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/688a1a05.050a0220.5d226.0008.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250730131738.3385939-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, the user is always asked about the Microchip Azurite
DPLL/PTP/SyncE core driver, even when I2C and SPI are disabled, and thus
the driver cannot be used at all.
Fix this by making the Kconfig symbol for the core driver invisible
(unless compile-testing), and selecting it by the bus glue sub-drivers.
Drop the modular defaults, as drivers should not default to enabled.
Fixes: 2df8e64e01 ("dpll: Add basic Microchip ZL3073x support")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://patch.msgid.link/97804163aeb262f0e0706d00c29d9bb751844454.1753874405.git.geert+renesas@glider.be
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When gso_segs is left at 0, a number of assumptions will end up being
incorrect throughout the stack.
For example, in the GRO-path, we set NAPI_GRO_CB()->count to gso_segs.
So, if a non-LRO'ed packet followed by an LRO'ed packet is being
processed in GRO, the first one will have NAPI_GRO_CB()->count set to 1 and
the next one to 0 (in dev_gro_receive()).
Since commit 531d0d32de
("net/mlx5: Correctly set gso_size when LRO is used")
these packets will get merged (as their gso_size now matches).
So, we end up in gro_complete() with NAPI_GRO_CB()->count == 1 and thus
don't call inet_gro_complete(). Meaning, checksum-validation in
tcp_checksum_complete() will fail with a "hw csum failure".
Even before the above mentioned commit, incorrect gso_segs means that other
things like TCP's accounting of incoming packets (tp->segs_in,
data_segs_in, rcv_ooopack) will be incorrect. Which means that if one
does bytes_received/data_segs_in, the result will be bigger than the
MTU.
Fix this by initializing gso_segs correctly when LRO is used.
Fixes: e586b3b0ba ("net/mlx5: Ethernet Datapath files")
Reported-by: Gal Pressman <gal@nvidia.com>
Closes: https://lore.kernel.org/netdev/6583783f-f0fb-4fb1-a415-feec8155bc69@nvidia.com/
Signed-off-by: Christoph Paasch <cpaasch@openai.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250729-mlx5_gso_segs-v1-1-b48c480c1c12@openai.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull virtio updates from Michael Tsirkin:
- vhost can now support legacy threading if enabled in Kconfig
- vsock memory allocation strategies for large buffers have been
improved, reducing pressure on kmalloc
- vhost now supports the in-order feature. guest bits missed the merge
window.
- fixes, cleanups all over the place
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (30 commits)
vsock/virtio: Allocate nonlinear SKBs for handling large transmit buffers
vsock/virtio: Rename virtio_vsock_skb_rx_put()
vhost/vsock: Allocate nonlinear SKBs for handling large receive buffers
vsock/virtio: Move SKB allocation lower-bound check to callers
vsock/virtio: Rename virtio_vsock_alloc_skb()
vsock/virtio: Resize receive buffers so that each SKB fits in a 4K page
vsock/virtio: Move length check to callers of virtio_vsock_skb_rx_put()
vsock/virtio: Validate length in packet header before skb_put()
vhost/vsock: Avoid allocating arbitrarily-sized SKBs
vhost_net: basic in_order support
vhost: basic in order support
vhost: fail early when __vhost_add_used() fails
vhost: Reintroduce kthread API and add mode selection
vdpa: Fix IDR memory leak in VDUSE module exit
vdpa/mlx5: Fix release of uninitialized resources on error path
vhost-scsi: Fix check for inline_sg_cnt exceeding preallocated limit
virtio: virtio_dma_buf: fix missing parameter documentation
vhost: Fix typos
vhost: vringh: Remove unused functions
vhost: vringh: Remove unused iotlb functions
...
Pull PCI updates from Bjorn Helgaas:
"Enumeration:
- Allow built-in drivers, not just modular drivers, to use async
initial probing (Lukas Wunner)
- Support Immediate Readiness even on devices with no PM Capability
(Sean Christopherson)
- Consolidate definition of PCIE_RESET_CONFIG_WAIT_MS (100ms), the
required delay between a reset and sending config requests to a
device (Niklas Cassel)
- Add pci_is_display() to check for "Display" base class and use it
in ALSA hda, vfio, vga_switcheroo, vt-d (Mario Limonciello)
- Allow 'isolated PCI functions' (multi-function devices without a
function 0) for LoongArch, similar to s390 and jailhouse (Huacai
Chen)
Power control:
- Add ability to enable optional slot clock for cases where the PCIe
host controller and the slot are supplied by different clocks
(Marek Vasut)
PCIe native device hotplug:
- Fix runtime PM ref imbalance on Hot-Plug Capable ports caused by
misinterpreting a config read failure after a device has been
removed (Lukas Wunner)
- Avoid creating a useless PCIe port service device for pciehp if the
slot is handled by the ACPI hotplug driver (Lukas Wunner)
- Ignore ACPI hotplug slots when calculating depth of pciehp hotplug
ports (Lukas Wunner)
Virtualization:
- Save VF resizable BAR state and restore it after reset (Michał
Winiarski)
- Allow IOV resources (VF BARs) to be resized (Michał Winiarski)
- Add pci_iov_vf_bar_set_size() so drivers can control VF BAR size
(Michał Winiarski)
Endpoint framework:
- Add RC-to-EP doorbell support using platform MSI controller,
including a test case (Frank Li)
- Allow BAR assignment via configfs so platforms have flexibility in
determining BAR usage (Jerome Brunet)
Native PCIe controller drivers:
- Convert amazon,al-alpine-v[23]-pcie, apm,xgene-pcie,
axis,artpec6-pcie, marvell,armada-3700-pcie, st,spear1340-pcie to
DT schema format (Rob Herring)
- Use dev_fwnode() instead of of_fwnode_handle() to remove OF
dependency in altera (fixes an unused variable), designware-host,
mediatek, mediatek-gen3, mobiveil, plda, xilinx, xilinx-dma,
xilinx-nwl (Jiri Slaby, Arnd Bergmann)
- Convert aardvark, altera, brcmstb, designware-host, iproc,
mediatek, mediatek-gen3, mobiveil, plda, rcar-host, vmd, xilinx,
xilinx-dma, xilinx-nwl from using pci_msi_create_irq_domain() to
using msi_create_parent_irq_domain() instead; this makes the
interrupt controller per-PCI device, allows dynamic allocation of
vectors after initialization, and allows support of IMS (Nam Cao)
APM X-Gene PCIe controller driver:
- Rewrite MSI handling to MSI CPU affinity, drop useless CPU hotplug
bits, use device-managed memory allocations, and clean things up
(Marc Zyngier)
- Probe xgene-msi as a standard platform driver rather than a
subsys_initcall (Marc Zyngier)
Broadcom STB PCIe controller driver:
- Add optional DT 'num-lanes' property and if present, use it to
override the Maximum Link Width advertised in Link Capabilities
(Jim Quinlan)
Cadence PCIe controller driver:
- Use PCIe Message routing types from the PCI core rather than
defining private ones (Hans Zhang)
Freescale i.MX6 PCIe controller driver:
- Add IMX8MQ_EP third 64-bit BAR in epc_features (Richard Zhu)
- Add IMX8MM_EP and IMX8MP_EP fixed 256-byte BAR 4 in epc_features
(Richard Zhu)
- Configure LUT for MSI/IOMMU in Endpoint mode so Root Complex can
trigger doorbel on Endpoint (Frank Li)
- Remove apps_reset (LTSSM_EN) from
imx_pcie_{assert,deassert}_core_reset(), which fixes a hotplug
regression on i.MX8MM (Richard Zhu)
- Delay Endpoint link start until configfs 'start' written (Richard
Zhu)
Intel VMD host bridge driver:
- Add Intel Panther Lake (PTL)-H/P/U Vendor ID (George D Sworo)
Qualcomm PCIe controller driver:
- Add DT binding and driver support for SA8255p, which supports ECAM
for Configuration Space access (Mayank Rana)
- Update DT binding and driver to describe PHYs and per-Root Port
resets in a Root Port stanza and deprecate describing them in the
host bridge; this makes it possible to support multiple Root Ports
in the future (Krishna Chaitanya Chundru)
- Add Qualcomm QCS615 to SM8150 DT binding (Ziyue Zhang)
- Add Qualcomm QCS8300 to SA8775p DT binding (Ziyue Zhang)
- Drop TBU and ref clocks from Qualcomm SM8150 and SC8180x DT
bindings (Konrad Dybcio)
- Document 'link_down' reset in Qualcomm SA8775P DT binding (Ziyue
Zhang)
- Add required PCIE_RESET_CONFIG_WAIT_MS delay after Link up IRQ
(Niklas Cassel)
Rockchip PCIe controller driver:
- Drop unused PCIe Message routing and code definitions (Hans Zhang)
- Remove several unused header includes (Hans Zhang)
- Use standard PCIe config register definitions instead of
rockchip-specific redefinitions (Geraldo Nascimento)
- Set Target Link Speed to 5.0 GT/s before retraining so we have a
chance to train at a higher speed (Geraldo Nascimento)
Rockchip DesignWare PCIe controller driver:
- Prevent race between link training and register update via DBI by
inhibiting link training after hot reset and link down (Wilfred
Mallawa)
- Add required PCIE_RESET_CONFIG_WAIT_MS delay after Link up IRQ
(Niklas Cassel)
Sophgo PCIe controller driver:
- Add DT binding and driver for Sophgo SG2044 PCIe controller driver
in Root Complex mode (Inochi Amaoto)
Synopsys DesignWare PCIe controller driver:
- Add required PCIE_RESET_CONFIG_WAIT_MS after waiting for Link up on
Ports that support > 5.0 GT/s. Slower Ports still rely on the
not-quite-correct PCIE_LINK_WAIT_SLEEP_MS 90ms default delay while
waiting for the Link (Niklas Cassel)"
* tag 'pci-v6.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (116 commits)
dt-bindings: PCI: qcom,pcie-sa8775p: Document 'link_down' reset
dt-bindings: PCI: Remove 83xx-512x-pci.txt
dt-bindings: PCI: Convert amazon,al-alpine-v[23]-pcie to DT schema
dt-bindings: PCI: Convert marvell,armada-3700-pcie to DT schema
dt-bindings: PCI: Convert apm,xgene-pcie to DT schema
dt-bindings: PCI: Convert axis,artpec6-pcie to DT schema
dt-bindings: PCI: Convert st,spear1340-pcie to DT schema
PCI: Move is_pciehp check out of pciehp_is_native()
PCI: pciehp: Use is_pciehp instead of is_hotplug_bridge
PCI/portdrv: Use is_pciehp instead of is_hotplug_bridge
PCI/ACPI: Fix runtime PM ref imbalance on Hot-Plug Capable ports
selftests: pci_endpoint: Add doorbell test case
misc: pci_endpoint_test: Add doorbell test case
PCI: endpoint: pci-epf-test: Add doorbell test support
PCI: endpoint: Add pci_epf_align_inbound_addr() helper for inbound address alignment
PCI: endpoint: pci-ep-msi: Add checks for MSI parent and mutability
PCI: endpoint: Add RC-to-EP doorbell support using platform MSI controller
PCI: dwc: Add Sophgo SG2044 PCIe controller driver in Root Complex mode
PCI: vmd: Switch to msi_create_parent_irq_domain()
PCI: vmd: Convert to lock guards
...
The purpose of the "Periodic garbage collection" test case is to make
sure that "extern_valid" neighbors are not flushed during periodic
garbage collection, unlike regular neighbor entries.
The test case is currently doing the following:
1. Changing the base reachable time to 10 seconds so that periodic
garbage collection will run every 5 seconds.
2. Changing the garbage collection stale time to 5 seconds so that
neighbors that have not been used in the last 5 seconds will be
considered for removal.
3. Waiting for the base reachable time change to take effect.
4. Adding an "extern_valid" neighbor, a non-"extern_valid" neighbor and
a bunch of other neighbors so that the threshold ("thresh1") will be
crossed and stale neighbors will be flushed during garbage
collection.
5. Waiting for 10 seconds to give garbage collection a chance to run.
6. Checking that the "extern_valid" neighbor was not flushed and that
the non-"extern_valid" neighbor was flushed.
The test sometimes fails in the netdev CI because the non-"extern_valid"
neighbor was not flushed. I am unable to reproduce this locally, but my
theory that since we do not know exactly when the periodic garbage
collection runs, it is possible for it to run at a time when the
non-"extern_valid" neighbor is still not considered stale.
Fix by moving the addition of the two neighbors before step 3 and by
reducing the garbage collection stale time to 1 second, to ensure that
both neighbors are considered stale when garbage collection runs.
Fixes: 171f2ee31a ("selftests: net: Add a selftest for externally validated neighbor entries")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Closes: https://lore.kernel.org/netdev/20250728093504.4ebbd73c@kernel.org/
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250731110914.506890-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some calls to the tracing ring buffer can happen when the ring buffer is
already being written to by the same context (for example, a
trace_printk() in between a ring_buffer_lock_reserve() and a
ring_buffer_unlock_commit()).
In order to not trigger the recursion detection, these functions use
ring_buffer_nest_start() and ring_buffer_nest_end(). Create a guard() for
these functions so that their use cases can be simplified and not need to
use goto for the release.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250801203857.710501021@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pull dmaengine updates from Vinod Koul:
"Core:
- Managed API for dma channel request
New support:
- Sophgo CV18XX/SG200X dmamux driver
- Qualcomm Milos GPI, sc8280xp GPI support
Updates:
- Conversion of brcm,iproc-sba and marvell,orion-xor binding
- Unused code cleanup across drivers"
* tag 'dmaengine-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine: (23 commits)
dt-bindings: dma: fsl-mxs-dma: allow interrupt-names for fsl,imx23-dma-apbx
dmaengine: xdmac: make it selectable for ARCH_MICROCHIP
dt-bindings: dma: Convert marvell,orion-xor to DT schema
dt-bindings: dma: Convert brcm,iproc-sba to DT schema
dmaengine: nbpfaxi: Add missing check after DMA map
dmaengine: mv_xor: Fix missing check after DMA map and missing unmap
dt-bindings: dma: qcom,gpi: document the Milos GPI DMA Engine
dmaengine: idxd: Remove __packed from structures
dmaengine: ti: Do not enable by default during compile testing
dmaengine: sh: Do not enable SH_DMAE_BASE by default during compile testing
dmaengine: idxd: Fix warning for deadcode.deadstore
dmaengine: mmp: Fix again Wvoid-pointer-to-enum-cast warning
dmaengine: fsl-qdma: Add missing fsl_qdma_format kerneldoc
dmaengine: qcom: gpi: Drop unused gpi_write_reg_field()
dmaengine: fsl-dpaa2-qdma: Drop unused mc_enc()
dmaengine: dw-edma: Drop unused dchan2dev() and chan2dev()
dmaengine: stm32: Don't use %pK through printk
dmaengine: stm32-dma: configure next sg only if there are more than 2 sgs
dmaengine: sun4i: Simplify error handling in probe()
dt-bindings: dma: qcom,gpi: Document the sc8280xp GPI DMA engine
...
Pull more sound updates from Takashi Iwai:
"For catching up the remaining stuff for 6.17: only small updates and
the rest are mostly small fixes.
- Fixes in HD-audio codec driver Kconfig, so that configurations can
be more easily/safely carried between different versions
- Fixes in ASoC SDCA, FSL xcvr, AW88399
- ASoC IMX WM8524 support
- HD-audio and USB-audio quirks and fixes
- A minor selftest fix"
* tag 'sound-6.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (21 commits)
ALSA: usb: scarlett2: Fix missing NULL check
mips: Update HD-audio configs again
LoongArch: Update HD-audio codec configs
arm: Update HD-audio configs again
selftests: ALSA: fix memory leak in utimer test
ALSA: usb-audio: Add DSD support for Comtrue USB Audio device
ALSA: hda/hdmi: Enable drivers as default
ALSA: hda/cirrus: Enable drivers as default
ALSA: hda/realtek: Enable drivers as default
ALSA: hda/realtek - Fix mute LED for HP Victus 16-d1xxx (MB 8A26)
ALSA: hda/realtek - Fix mute LED for HP Victus 16-s0xxx
ALSA: hda: Fix the wrong register was used for DVC of TAS2770
ALSA: scarlett2: Add retry on -EPROTO from scarlett2_usb_tx()
ALSA: hda/realtek - Fix mute LED for HP Victus 16-r1xxx
ASoC: codecs: Add acpi_match_table for aw88399 driver
ASoC: dt-bindings: atmel,at91-ssc: add microchip,sam9x7-ssc
ASoC: imx-card: Add WM8524 support
ASoC: fsl_xcvr: get channel status data with firmware exists
ASoC: fsl_xcvr: get channel status data when PHY is not exists
ASoC: SDCA: Add support for -cn- value properties
...
Pull soundwire updates from Vinod Koul:
"A couple of small core changes and driver updates:
- Core: handling of nesting irqs to outside the lock, stream
parameters handing on port prep failures.
- AMD driver support for ACP 7.2 platforms and improved handing of
slave alerts and resume sequences
- Qualcomm updating driver debug spew
- Intel BPT message length limitations, rt721 codec as wake capable
etc"
* tag 'soundwire-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/soundwire:
soundwire: amd: Add support for acp7.2 platform
soundwire: stream: restore params when prepare ports fail
soundwire: debugfs: move debug statement outside of error handling
soundwire: amd: add check for status update registers
soundwire: intel_auxdevice: add rt721 codec to wake_capable_list
soundwire: Correct some property names
soundwire: update Intel BPT message length limitation
soundwire: intel_ace2.x: Use str_read_write() helper
soundwire: amd: cancel pending slave status handling workqueue during remove sequence
soundwire: amd: serialize amd manager resume sequence during pm_prepare
soundwire: qcom: demote probe registration printk
ASoC: cs42l43: Remove unnecessary work functions
soundwire: Move handle_nested_irq outside of sdw_dev_lock
MAINTAINERS: Remove Sanyog Kale as reviewer on SoundWire
Pull tracing updates from Steven Rostedt:
- Deprecate auto-mounting tracefs to /sys/kernel/debug/tracing
When tracefs was first introduced back in 2014, the directory
/sys/kernel/tracing was added and is the designated location to mount
tracefs. To keep backward compatibility, tracefs was auto-mounted in
/sys/kernel/debug/tracing as well.
All distros now mount tracefs on /sys/kernel/tracing. Having it seen
in two different locations has lead to various issues and
inconsistencies.
The VFS folks have to also maintain debugfs_create_automount() for
this single user.
It's been over 10 years. Tooling and scripts should start replacing
the debugfs location with the tracefs one. The reason tracefs was
created in the first place was to allow access to the tracing
facilities without the need to configure debugfs into the kernel.
Using tracefs should now be more robust.
A new config is created: CONFIG_TRACEFS_AUTOMOUNT_DEPRECATED which is
default y, so that the kernel is still built with the automount. This
config allows those that want to remove the automount from debugfs to
do so.
When tracefs is accessed from /sys/kernel/debug/tracing, the
following printk is triggerd:
pr_warn("NOTICE: Automounting of tracing to debugfs is deprecated and will be removed in 2030\n");
This gives users another 5 years to fix their scripts.
- Use queue_rcu_work() instead of call_rcu() for freeing event filters
The number of filters to be free can be many depending on the number
of events within an event system. Freeing them from softirq context
can potentially cause undesired latency. Use the RCU workqueue to
free them instead.
- Remove pointless memory barriers in latency code
Memory barriers were added to some of the latency code a long time
ago with the idea of "making them visible", but that's not what
memory barriers are for. They are to synchronize access between
different variables. There was no synchronization here making them
pointless.
- Remove "__attribute__()" from the type field of event format
When LLVM is used to compile the kernel with CONFIG_DEBUG_INFO_BTF=y
and PAHOLE_HAS_BTF_TAG=y, some of the format fields get expanded with
the following:
field:const char * filename; offset:24; size:8; signed:0;
Turns into:
field:const char __attribute__((btf_type_tag("user"))) * filename; offset:24; size:8; signed:0;
This confuses parsers. Add code to strip these tags from the strings.
- Add eprobe config option CONFIG_EPROBE_EVENTS
Eprobes were added back in 5.15 but were only enabled when another
probe was enabled (kprobe, fprobe, uprobe, etc). The eprobes had no
config option of their own. Add one as they should be a separate
entity.
It's default y to keep with the old kernels but still has
dependencies on TRACING and HAVE_REGS_AND_STACK_ACCESS_API.
- Add eprobe documentation
When eprobes were added back in 5.15 no documentation was added to
describe them. This needs to be rectified.
- Replace open coded cpumask_next_wrap() in move_to_next_cpu()
- Have preemptirq_delay_run() use off-stack CPU mask
- Remove obsolete comment about pelt_cfs event
DECLARE_TRACE() appends "_tp" to trace events now, but the comment
above pelt_cfs still mentioned appending it manually.
- Remove EVENT_FILE_FL_SOFT_MODE flag
The SOFT_MODE flag was required when the soft enabling and disabling
of trace events was first introduced. But there was a bug with this
approach as it only worked for a single instance. When multiple users
required soft disabling and disabling the code was changed to have a
ref count. The SOFT_MODE flag is now set iff the ref count is non
zero. This is redundant and just reading the ref count is good
enough.
- Fix typo in comment
* tag 'trace-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
Documentation: tracing: Add documentation about eprobes
tracing: Have eprobes have their own config option
tracing: Remove "__attribute__()" from the type field of event format
tracing: Deprecate auto-mounting tracefs in debugfs
tracing: Fix comment in trace_module_remove_events()
tracing: Remove EVENT_FILE_FL_SOFT_MODE flag
tracing: Remove pointless memory barriers
tracing/sched: Remove obsolete comment on suffixes
kernel: trace: preemptirq_delay_test: use offstack cpu mask
tracing: Use queue_rcu_work() to free filters
tracing: Replace opencoded cpumask_next_wrap() in move_to_next_cpu()
Pull tracing tools updates from Steven Rostedt:
- Introduce enum timerlat_tracing_mode
Now that BPF based sampling has been added to timerlat, add an enum
to represent which mode timerlat is running in
- Add action on timelat threshold feature
A new option, --on-threshold, is added, taking an argument that
further specifies the action. Actions added in this patch are:
- trace[,file=<filename>]: Saves tracefs buffer, optionally taking a
filename
- signal,num=<sig>,pid=<pid>: Sends signal to process. "parent" might
be specified instead of number to send signal to parent process
- shell,command=<command>: Execute shell command
- Allow resuming tracing in timerlat bpf
rtla-timerlat BPF program uses a global variable stored in a .bss
section to store whether tracing has been stopped. Map it to allow it
to resume tracing after it has been stopped
- Add continue action to timerlat
Introduce option to resume tracing after a latency threshold
overflow. The option is implemented as an action named "continue"
- Add action on end feature to timerlat
Implement actions on end next to actions on threshold. A new option,
--on-end is added, parallel to --on-threshold. Instead of being
executed whenever a latency threshold is reached, it is executed at
the end of the measurement
- Have rtla tests check output with grep
Add argument to the check command in the test suite that takes a
regular expression that the output of rtla command is checked
against. This allows testing for specific information in rtla output
in addition to checking the return value
- Add tests for timerlat actions
- Update the documentation for the new features
* tag 'trace-tools-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
rtla/tests: Test timerlat -P option using actions
rtla/tests: Add grep checks for base test cases
Documentation/rtla: Add actions feature
rtla/tests: Limit duration to maximum of 10s
rtla/tests: Add tests for actions
rtla/tests: Check rtla output with grep
rtla/timerlat: Add action on end feature
rtla/timerlat: Add continue action
rtla/timerlat_bpf: Allow resuming tracing
rtla/timerlat: Add action on threshold feature
rtla/timerlat: Introduce enum timerlat_tracing_mode
Pull initial deferred unwind infrastructure from Steven Rostedt:
"This is the core infrastructure for the deferred unwinder that is
required for sframes[1]. Several other patch series are based on this
work although those patch series are not dependent on each other. In
order to simplify the development, having this core series upstream
will allow the other series to be worked on in parallel. The other
series are:
- The two patches to implement x86 support [2] [3]
- The s390 work [4]
- The perf work [5]
- The ftrace work [6]
- The sframe work [7]
And more is on the way.
The core infrastructure adds the following in kernel APIs:
- int unwind_user_faultable(struct unwind_stacktrace *trace);
Performs a user space stack trace that may fault user pages in.
- int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func);
Allows a tracer to register with the unwind deferred
infrastructure.
- int unwind_deferred_request(struct unwind_work *work, u64 *cookie);
Used when a tracer request a deferred trace. Can be called from
interrupt or NMI context.
- void unwind_deferred_cancel(struct unwind_work *work);
Called by a tracer to unregister from the deferred unwind
infrastructure.
- void unwind_deferred_task_exit(struct task_struct *task);
Called by task exit code to flush any pending unwind requests.
- void unwind_task_init(struct task_struct *task);
Called by do_fork() to initialize the task struct for the
deferred unwinder.
- void unwind_task_free(struct task_struct *task);
Called by do_exit() to free up any resources used by the
deferred unwinder.
None of the above is actually compiled unless an architecture enables it,
which none currently do"
Link: https://sourceware.org/binutils/wiki/sframe [1]
Link: https://lore.kernel.org/linux-trace-kernel/20250717004958.260781923@kernel.org/ [2]
Link: https://lore.kernel.org/linux-trace-kernel/20250717004958.432327787@kernel.org/ [3]
Link: https://lore.kernel.org/linux-trace-kernel/20250710163522.3195293-1-jremus@linux.ibm.com/ [4]
Link: https://lore.kernel.org/linux-trace-kernel/20250718164119.089692174@kernel.org/ [5]
Link: https://lore.kernel.org/linux-trace-kernel/20250424192612.505622711@goodmis.org/ [6]
Link: https://lore.kernel.org/linux-trace-kernel/20250717012848.927473176@kernel.org/ [7]
* tag 'trace-deferred-unwind-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
unwind: Finish up unwind when a task exits
unwind deferred: Use SRCU unwind_deferred_task_work()
unwind: Add USED bit to only have one conditional on way back to user space
unwind deferred: Add unwind_completed mask to stop spurious callbacks
unwind deferred: Use bitmask to determine which callbacks to call
unwind_user/deferred: Make unwind deferral requests NMI-safe
unwind_user/deferred: Add deferred unwinding interface
unwind_user/deferred: Add unwind cache
unwind_user/deferred: Add unwind_user_faultable()
unwind_user: Add user space unwinding API with frame pointer support
When transmitting a vsock packet, virtio_transport_send_pkt_info() calls
virtio_transport_alloc_linear_skb() to allocate and fill SKBs with the
transmit data. Unfortunately, these are always linear allocations and
can therefore result in significant pressure on kmalloc() considering
that the maximum packet size (VIRTIO_VSOCK_MAX_PKT_BUF_SIZE +
VIRTIO_VSOCK_SKB_HEADROOM) is a little over 64KiB, resulting in a 128KiB
allocation for each packet.
Rework the vsock SKB allocation so that, for sizes with page order
greater than PAGE_ALLOC_COSTLY_ORDER, a nonlinear SKB is allocated
instead with the packet header in the SKB and the transmit data in the
fragments. Note that this affects both the vhost and virtio transports.
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20250717090116.11987-10-will@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
In preparation for using virtio_vsock_skb_rx_put() when populating SKBs
on the vsock TX path, rename virtio_vsock_skb_rx_put() to
virtio_vsock_skb_put().
No functional change.
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20250717090116.11987-9-will@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
When receiving a packet from a guest, vhost_vsock_handle_tx_kick()
calls vhost_vsock_alloc_linear_skb() to allocate and fill an SKB with
the receive data. Unfortunately, these are always linear allocations and
can therefore result in significant pressure on kmalloc() considering
that the maximum packet size (VIRTIO_VSOCK_MAX_PKT_BUF_SIZE +
VIRTIO_VSOCK_SKB_HEADROOM) is a little over 64KiB, resulting in a 128KiB
allocation for each packet.
Rework the vsock SKB allocation so that, for sizes with page order
greater than PAGE_ALLOC_COSTLY_ORDER, a nonlinear SKB is allocated
instead with the packet header in the SKB and the receive data in the
fragments. Finally, add a debug warning if virtio_vsock_skb_rx_put() is
ever called on an SKB with a non-zero length, as this would be
destructive for the nonlinear case.
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20250717090116.11987-8-will@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
virtio_vsock_alloc_linear_skb() checks that the requested size is at
least big enough for the packet header (VIRTIO_VSOCK_SKB_HEADROOM).
Of the three callers of virtio_vsock_alloc_linear_skb(), only
vhost_vsock_alloc_skb() can potentially pass a packet smaller than the
header size and, as it already has a check against the maximum packet
size, extend its bounds checking to consider the minimum packet size
and remove the check from virtio_vsock_alloc_linear_skb().
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20250717090116.11987-7-will@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
In preparation for nonlinear allocations for large SKBs, rename
virtio_vsock_alloc_skb() to virtio_vsock_alloc_linear_skb() to indicate
that it returns linear SKBs unconditionally and switch all callers over
to this new interface for now.
No functional change.
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20250717090116.11987-6-will@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
When allocating receive buffers for the vsock virtio RX virtqueue, an
SKB is allocated with a 4140 data payload (the 44-byte packet header +
VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE). Even when factoring in the SKB
overhead, the resulting 8KiB allocation thanks to the rounding in
kmalloc_reserve() is wasteful (~3700 unusable bytes) and results in a
higher-order page allocation on systems with 4KiB pages just for the
sake of a few hundred bytes of packet data.
Limit the vsock virtio RX buffers to 4KiB per SKB, resulting in much
better memory utilisation and removing the need to allocate higher-order
pages entirely.
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20250717090116.11987-5-will@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
virtio_vsock_skb_rx_put() only calls skb_put() if the length in the
packet header is not zero even though skb_put() handles this case
gracefully.
Remove the functionally redundant check from virtio_vsock_skb_rx_put()
and, on the assumption that this is a worthwhile optimisation for
handling credit messages, augment the existing length checks in
virtio_transport_rx_work() to elide the call for zero-length payloads.
Since the callers all have the length, extend virtio_vsock_skb_rx_put()
to take it as an additional parameter rather than fish it back out of
the packet header.
Note that the vhost code already has similar logic in
vhost_vsock_alloc_skb().
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20250717090116.11987-4-will@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
When receiving a vsock packet in the guest, only the virtqueue buffer
size is validated prior to virtio_vsock_skb_rx_put(). Unfortunately,
virtio_vsock_skb_rx_put() uses the length from the packet header as the
length argument to skb_put(), potentially resulting in SKB overflow if
the host has gone wonky.
Validate the length as advertised by the packet header before calling
virtio_vsock_skb_rx_put().
Cc: <stable@vger.kernel.org>
Fixes: 71dc9ec9ac ("virtio/vsock: replace virtio_vsock_pkt with sk_buff")
Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20250717090116.11987-3-will@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
vhost_vsock_alloc_skb() returns NULL for packets advertising a length
larger than VIRTIO_VSOCK_MAX_PKT_BUF_SIZE in the packet header. However,
this is only checked once the SKB has been allocated and, if the length
in the packet header is zero, the SKB may not be freed immediately.
Hoist the size check before the SKB allocation so that an iovec larger
than VIRTIO_VSOCK_MAX_PKT_BUF_SIZE + the header size is rejected
outright. The subsequent check on the length field in the header can
then simply check that the allocated SKB is indeed large enough to hold
the packet.
Cc: <stable@vger.kernel.org>
Fixes: 71dc9ec9ac ("virtio/vsock: replace virtio_vsock_pkt with sk_buff")
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20250717090116.11987-2-will@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This patch introduces basic in-order support for vhost-net. By
recording the number of batched buffers in an array when calling
`vhost_add_used_and_signal_n()`, we can reduce the number of userspace
accesses. Note that the vhost-net batching logic is kept as we still
count the number of buffers there.
Testing Results:
With testpmd:
- TX: txonly mode + vhost_net with XDP_DROP on TAP shows a 17.5%
improvement, from 4.75 Mpps to 5.35 Mpps.
- RX: No obvious improvements were observed.
With virtio-ring in-order experimental code in the guest:
- TX: pktgen in the guest + XDP_DROP on TAP shows a 19% improvement,
from 5.2 Mpps to 6.2 Mpps.
- RX: pktgen on TAP with vhost_net + XDP_DROP in the guest achieves a
6.1% improvement, from 3.47 Mpps to 3.61 Mpps.
Acked-by: Jonah Palmer <jonah.palmer@oracle.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20250714084755.11921-4-jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
This patch adds basic in order support for vhost. Two optimizations
are implemented in this patch:
1) Since driver uses descriptor in order, vhost can deduce the next
avail ring head by counting the number of descriptors that has been
used in next_avail_head. This eliminate the need to access the
available ring in vhost.
2) vhost_add_used_and_singal_n() is extended to accept the number of
batched buffers per used elem. While this increases the times of
userspace memory access but it helps to reduce the chance of
used ring access of both the driver and vhost.
Vhost-net will be the first user for this.
Acked-by: Jonah Palmer <jonah.palmer@oracle.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20250714084755.11921-3-jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
Since commit 6e890c5d50 ("vhost: use vhost_tasks for worker threads"),
the vhost uses vhost_task and operates as a child of the
owner thread. This is required for correct CPU usage accounting,
especially when using containers.
However, this change has caused confusion for some legacy
userspace applications, and we didn't notice until it's too late.
Unfortunately, it's too late to revert - we now have userspace
depending both on old and new behaviour :(
To address the issue, reintroduce kthread mode for vhost workers and
provide a configuration to select between kthread and task worker.
- Add 'fork_owner' parameter to vhost_dev to let users select kthread
or task mode. Default mode is task mode(VHOST_FORK_OWNER_TASK).
- Reintroduce kthread mode support:
* Bring back the original vhost_worker() implementation,
and renamed to vhost_run_work_kthread_list().
* Add cgroup support for the kthread
* Introduce struct vhost_worker_ops:
- Encapsulates create / stop / wake‑up callbacks.
- vhost_worker_create() selects the proper ops according to
inherit_owner.
- Userspace configuration interface:
* New IOCTLs:
- VHOST_SET_FORK_FROM_OWNER lets userspace select task mode
(VHOST_FORK_OWNER_TASK) or kthread mode (VHOST_FORK_OWNER_KTHREAD)
- VHOST_GET_FORK_FROM_OWNER reads the current worker mode
* Expose module parameter 'fork_from_owner_default' to allow system
administrators to configure the default mode for vhost workers
* Kconfig option CONFIG_VHOST_ENABLE_FORK_OWNER_CONTROL controls whether
these IOCTLs and the parameter are available
- The VHOST_NEW_WORKER functionality requires fork_owner to be set
to true, with validation added to ensure proper configuration
This partially reverts or improves upon:
commit 6e890c5d50 ("vhost: use vhost_tasks for worker threads")
commit 1cdaafa1b8 ("vhost: replace single worker pointer with xarray")
Fixes: 6e890c5d50 ("vhost: use vhost_tasks for worker threads"),
Signed-off-by: Cindy Lu <lulu@redhat.com>
Message-Id: <20250714071333.59794-2-lulu@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
Add missing idr_destroy() call in vduse_exit() to properly free the
vduse_idr radix tree nodes. Without this, module load/unload cycles leak
576-byte radix tree node allocations, detectable by kmemleak as:
unreferenced object (size 576):
backtrace:
[<ffffffff81234567>] radix_tree_node_alloc+0xa0/0xf0
[<ffffffff81234568>] idr_get_free+0x128/0x280
The vduse_idr is initialized via DEFINE_IDR() at line 136 and used throughout
the VDUSE (vDPA Device in Userspace) driver for device ID management. The fix
follows the documented pattern in lib/idr.c and matches the cleanup approach
used by other drivers.
This leak was discovered through comprehensive module testing with cumulative
kmemleak detection across 10 load/unload iterations per module.
Fixes: c8a6153b6c ("vduse: Introduce VDUSE - vDPA Device in Userspace")
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Message-Id: <20250704125335.1084649-1-anders.roxell@linaro.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The commit in the fixes tag made sure that mlx5_vdpa_free()
is the single entrypoint for removing the vdpa device resources
added in mlx5_vdpa_dev_add(), even in the cleanup path of
mlx5_vdpa_dev_add().
This means that all functions from mlx5_vdpa_free() should be able to
handle uninitialized resources. This was not the case though:
mlx5_vdpa_destroy_mr_resources() and mlx5_cmd_cleanup_async_ctx()
were not able to do so. This caused the splat below when adding
a vdpa device without a MAC address.
This patch fixes these remaining issues:
- Makes mlx5_vdpa_destroy_mr_resources() return early if called on
uninitialized resources.
- Moves mlx5_cmd_init_async_ctx() early on during device addition
because it can't fail. This means that mlx5_cmd_cleanup_async_ctx()
also can't fail. To mirror this, move the call site of
mlx5_cmd_cleanup_async_ctx() in mlx5_vdpa_free().
An additional comment was added in mlx5_vdpa_free() to document
the expectations of functions called from this context.
Splat:
mlx5_core 0000:b5:03.2: mlx5_vdpa_dev_add:3950:(pid 2306) warning: No mac address provisioned?
------------[ cut here ]------------
WARNING: CPU: 13 PID: 2306 at kernel/workqueue.c:4207 __flush_work+0x9a/0xb0
[...]
Call Trace:
<TASK>
? __try_to_del_timer_sync+0x61/0x90
? __timer_delete_sync+0x2b/0x40
mlx5_vdpa_destroy_mr_resources+0x1c/0x40 [mlx5_vdpa]
mlx5_vdpa_free+0x45/0x160 [mlx5_vdpa]
vdpa_release_dev+0x1e/0x50 [vdpa]
device_release+0x31/0x90
kobject_cleanup+0x37/0x130
mlx5_vdpa_dev_add+0x327/0x890 [mlx5_vdpa]
vdpa_nl_cmd_dev_add_set_doit+0x2c1/0x4d0 [vdpa]
genl_family_rcv_msg_doit+0xd8/0x130
genl_family_rcv_msg+0x14b/0x220
? __pfx_vdpa_nl_cmd_dev_add_set_doit+0x10/0x10 [vdpa]
genl_rcv_msg+0x47/0xa0
? __pfx_genl_rcv_msg+0x10/0x10
netlink_rcv_skb+0x53/0x100
genl_rcv+0x24/0x40
netlink_unicast+0x27b/0x3b0
netlink_sendmsg+0x1f7/0x430
__sys_sendto+0x1fa/0x210
? ___pte_offset_map+0x17/0x160
? next_uptodate_folio+0x85/0x2b0
? percpu_counter_add_batch+0x51/0x90
? filemap_map_pages+0x515/0x660
__x64_sys_sendto+0x20/0x30
do_syscall_64+0x7b/0x2c0
? do_read_fault+0x108/0x220
? do_pte_missing+0x14a/0x3e0
? __handle_mm_fault+0x321/0x730
? count_memcg_events+0x13f/0x180
? handle_mm_fault+0x1fb/0x2d0
? do_user_addr_fault+0x20c/0x700
? syscall_exit_work+0x104/0x140
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f0c25b0feca
[...]
---[ end trace 0000000000000000 ]---
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Fixes: 83e445e64f ("vdpa/mlx5: Fix error path during device add")
Reported-by: Wenli Quan <wquan@redhat.com>
Closes: https://lore.kernel.org/virtualization/CADZSLS0r78HhZAStBaN1evCSoPqRJU95Lt8AqZNJ6+wwYQ6vPQ@mail.gmail.com/
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Message-Id: <20250708120424.2363354-2-dtatulea@nvidia.com>
Tested-by: Wenli Quan <wquan@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The condition comparing ret to VHOST_SCSI_PREALLOC_SGLS was incorrect,
as ret holds the result of kstrtouint() (typically 0 on success),
not the parsed value. Update the check to use cnt, which contains the
actual user-provided value.
prevents silently accepting values exceeding the maximum inline_sg_cnt.
Fixes: bca939d5bc ("vhost-scsi: Dynamically allocate scatterlists")
Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Reviewed-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20250628183405.3979538-1-alok.a.tiwari@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Add missing parameter documentation for virtio_dma_buf_attach()
function to fix kernel-doc warnings:
Warning: drivers/virtio/virtio_dma_buf.c:41 function parameter 'dma_buf' not described in 'virtio_dma_buf_attach'
Warning: drivers/virtio/virtio_dma_buf.c:41 function parameter 'attach' not described in 'virtio_dma_buf_attach'
The function documentation was missing descriptions for both the
'dma_buf' and 'attach' parameters. Add proper parameter documentation
following kernel-doc format.
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Message-Id: <241C7118259DA110+20250623065210.270237-1-wangyuli@uniontech.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>