Johannes Berg says:
====================
Aside from various small improvements/cleanups, not much:
- cfg80211/mac80211: S1G and UHR improvements
- hwsim: incumbent signal report test support
* tag 'wireless-next-2026-03-19' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (31 commits)
qtnfmac: use alloc_netdev macro for single queue devices
wifi: libertas: don't kill URBs in interrupt context
wifi: libertas: use USB anchors for tracking in-flight URBs
wifi: nl80211: use int for band coming from netlink
wifi: rsi_91x_usb: do not pause rfkill polling when stopping mac80211
wifi: mac80211: fix STA link removal during link removal
wifi: nl80211: reject S1G/60G with HT chantype
wifi: ieee80211: fix definition of EHT-MCS 15 in MRU
wifi: cfg80211: check non-S1G width with S1G chandef
wifi: cfg80211: restrict cfg80211_chandef_create() to only HT-based bands
wifi: mac80211: don't use cfg80211_chandef_create() for default chandef
wifi: mac80211: Remove deleted sta links in ieee80211_ml_reconf_work()
wifi: b43: use register definitions in nphy_op_software_rfkill
wifi: cfg80211: split control freq check from chandef check
wifi: mac80211: always use full chanctx compatible check
wifi: mac80211: refactor chandef tracing macros
wifi: mac80211: validate HE 6 GHz operation when EHT is used
wifi: nl80211: split out UHR operation information
wifi: mwifiex: drop redundant device reference
wifi: rt2x00: drop redundant device reference
...
====================
Link: https://patch.msgid.link/20260319082439.79875-3-johannes@sipsolutions.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Antonio Quartulli says:
====================
Included features:
* use bitops.h API when possible
* send netlink notification in case of client float event
* implement support for asymmetric peer IDs
* consolidate memory allocations during crypto operations
* add netlink notification check in selftests
* add FW mark check in selftest
* tag 'ovpn-net-next-20260317' of https://github.com/OpenVPN/ovpn-net-next:
ovpn: consolidate crypto allocations in one chunk
selftests: ovpn: add test for the FW mark feature
selftests: ovpn: check asymmetric peer-id
ovpn: add support for asymmetric peer IDs
selftests: ovpn: add notification parsing and matching
ovpn: notify userspace on client float event
ovpn: pktid: use bitops.h API
ovpn: use correct array size to parse nested attributes in ovpn_nl_key_swap_doit
selftests: ovpn: allow compiling ovpn-cli.c with mbedtls3
====================
Link: https://patch.msgid.link/20260317104023.192548-1-antonio@openvpn.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add two parameters for drivers supporting Rx CQE coalescing /
descriptor writeback.
ETHTOOL_A_COALESCE_RX_CQE_FRAMES:
Maximum number of frames that can be coalesced into a CQE or
writeback.
ETHTOOL_A_COALESCE_RX_CQE_NSECS:
Max time in nanoseconds after the first packet arrival in a
coalesced CQE or writeback to be sent.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/20260317191826.1346111-2-haiyangz@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In order to support the multipeer architecture, upon connection setup
each side of a tunnel advertises a unique ID that the other side must
include in packets sent to them. Therefore when transmitting a packet, a
peer inserts the recipient's advertised ID for that specific tunnel into
the peer ID field. When receiving a packet, a peer expects to find its
own unique receive ID for that specific tunnel in the peer ID field.
Add support for the TX peer ID and embed it into transmitting packets.
If no TX peer ID is specified, fallback to using the same peer ID both
for RX and TX in order to be compatible with the non-multipeer compliant
peers.
Cc: horms@kernel.org
Cc: donald.hunter@gmail.com
Signed-off-by: Ralf Lici <ralf@mandelbit.com>
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Devlink instances without a backing device use bus_name
"devlink_index" and dev_name set to the decimal index string.
When user space sends this handle, detect the pattern and perform
a direct xarray lookup by index instead of iterating all instances.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20260312100407.551173-6-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
UDP-Lite supports variable-length checksum and has two socket
options, UDPLITE_SEND_CSCOV and UDPLITE_RECV_CSCOV, to control
the checksum coverage.
Let's remove the support.
setsockopt(UDPLITE_SEND_CSCOV / UDPLITE_RECV_CSCOV) was only
available for UDP-Lite and returned -ENOPROTOOPT for UDP.
Now, the options are handled in ip_setsockopt() and
ipv6_setsockopt(), which still return the same error.
getsockopt(UDPLITE_SEND_CSCOV / UDPLITE_RECV_CSCOV) was available
for UDP and always returned 0, meaning full checksum, but now
-ENOPROTOOPT is returned.
Given that getsockopt() is meaningless for UDP and even the options
are not defined under include/uapi/, this should not be a problem.
$ man 7 udplite
...
BUGS
Where glibc support is missing, the following definitions
are needed:
#define IPPROTO_UDPLITE 136
#define UDPLITE_SEND_CSCOV 10
#define UDPLITE_RECV_CSCOV 11
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260311052020.1213705-10-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
inet_sk_diag_fill() populates r->idiag_timer with the following
precedence order:
1 - Retransmit timer.
4 - Probe0 timer.
2 - Keepalive timer.
This patch adds a new value, last in the list, if other timers
are not active.
5 - Delayed ACK timer.
A corresponding iproute2 patch will follow to replace "unknown"
with "delack":
ESTAB 10 0 [2002:a05:6830:1f86::]:12875 [2002:a05:6830:1f85::]:50438
timer:(unknown,003ms,0) ino:152178 sk:3004 cgroup:unreachable:189 <->
skmem:(r1344,rb12780520,t0,tb262144,f2752,w0,o250,bl0,d0) ts usec_ts
...
Also add the following enum in uapi/linux/inet_diag.h
as suggested by David Ahern.
enum {
IDIAG_TIMER_OFF,
IDIAG_TIMER_ON,
IDIAG_TIMER_KEEPALIVE,
IDIAG_TIMER_TIMEWAIT,
IDIAG_TIMER_PROBE0,
IDIAG_TIMER_DELACK,
};
Neal Cardwell suggested to test for ICSK_ACK_TIMER:
inet_csk_clear_xmit_timer() does not call sk_stop_timer()
because INET_CSK_CLEAR_TIMERS is unset.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260305114829.2163276-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull drm fixes from Dave Airlie:
"Weekly fixes pull.
There is one mm fix in here for a HMM livelock triggered by the xe
driver tests. Otherwise it's a pretty wide range of fixes across the
board, ttm UAF regression fix, amdgpu fixes, nouveau doesn't crash my
laptop anymore fix, and a fair bit of misc.
Seems about right for rc3.
mm:
- mm: Fix a hmm_range_fault() livelock / starvation problem
pagemap:
- Revert "drm/pagemap: Disable device-to-device migration"
ttm:
- fix function return breaking reclaim
- fix build failure on PREEMPT_RT
- fix bo->resource UAF
dma-buf:
- include ioctl.h in uapi header
sched:
- fix kernel doc warning
amdgpu:
- LUT fixes
- VCN5 fix
- Dispclk fix
- SMU 13.x fix
- Fix race in VM acquire
- PSP 15.x fix
- UserQ fix
amdxdna:
- fix invalid payload for failed command
- fix NULL ptr dereference
- fix major fw version check
- avoid inconsistent fw state on error
i915/display:
- Fix for Lenovo T14 G7 display not refreshing
xe:
- Do not preempt fence signaling CS instructions
- Some leak and finalization fixes
- Workaround fix
nouveau:
- avoid runtime suspend oops when using dp aux
panthor:
- fix gem_sync argument ordering
solomon:
- fix incorrect display output
renesas:
- fix DSI divider programming
ethosu:
- fix job submit error clean-up refcount
- fix NPU_OP_ELEMENTWISE validation
- handle possible underflows in IFM size calcs"
* tag 'drm-fixes-2026-03-07' of https://gitlab.freedesktop.org/drm/kernel: (38 commits)
accel: ethosu: Handle possible underflow in IFM size calculations
accel: ethosu: Fix NPU_OP_ELEMENTWISE validation with scalar
accel: ethosu: Fix job submit error clean-up refcount underflows
accel/amdxdna: Split mailbox channel create function
drm/panthor: Correct the order of arguments passed to gem_sync
Revert "drm/syncobj: Fix handle <-> fd ioctls with dirty stack"
drm/ttm: Fix bo resource use-after-free
nouveau/dpcd: return EBUSY for aux xfer if the device is asleep
accel/amdxdna: Fix major version check on NPU1 platform
drm/amdgpu/userq: refcount userqueues to avoid any race conditions
drm/amdgpu/userq: Consolidate wait ioctl exit path
drm/amdgpu/psp: Use Indirect access address for GFX to PSP mailbox
drm/amdgpu: Fix use-after-free race in VM acquire
drm/amd/pm: remove invalid gpu_metrics.energy_accumulator on smu v13.0.x
drm/xe: Fix memory leak in xe_vm_madvise_ioctl
drm/xe/reg_sr: Fix leak on xa_store failure
drm/xe/xe2_hpg: Correct implementation of Wa_16025250150
drm/xe/gsc: Fix GSC proxy cleanup on early initialization failure
Revert "drm/pagemap: Disable device-to-device migration"
drm/i915/psr: Fix for Panel Replay X granularity DPCD register handling
...
Pull io_uring fixes from Jens Axboe:
- Fix a typo in the mock_file help text
- Fix a comment regarding IORING_SETUP_TASKRUN_FLAG in the
io_uring.h UAPI header
- Use READ_ONCE() for reading refill queue entries
- Reject SEND_VECTORIZED for fixed buffer sends, as it isn't
implemented. Currently this flag is silently ignored
This is in preparation for making these work, but first we
need a fixup so that older kernels will correctly reject them
- Ensure "0" means default for the rx page size
* tag 'io_uring-7.0-20260305' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring/zcrx: use READ_ONCE with user shared RQEs
io_uring/mock: Fix typo in help text
io_uring/net: reject SEND_VECTORIZED when unsupported
io_uring: correct comment for IORING_SETUP_TASKRUN_FLAG
io_uring/zcrx: don't set rx_page_size when not requested
A return type fix for ttm, a display fix for solomon, several misc fixes
for amdxdna, a DSI clock rate fix for rz-du, a uapi fix for syncobj, a
possible build failure fix for dma-buf, a doc warning fix for sched, a
build failure fix for ttm tests, and a crash fix when suspended for
nouveau.
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maxime Ripard <mripard@redhat.com>
Link: https://patch.msgid.link/20260305-ludicrous-quirky-raven-7cdafd@houat
Cross-merge networking fixes after downstream PR (net-7.0-rc3).
No conflicts.
Adjacent changes:
net/netfilter/nft_set_rbtree.c
fb7fb40163 ("netfilter: nf_tables: clone set on flush only")
3aea466a43 ("netfilter: nft_set_rbtree: don't disable bh when acquiring tree lock")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
With TX pause enabled, if a device is unable to pass packets up to the
stack (e.g., CPU is hanged), the device can cause pause storm. Given
that devices can have native support to protect the neighbor from such
flooding, such events need some tracking. This support is to track TX
pause storm events for better observability.
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20260302230149.1580195-2-mohsin.bashr@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Fix some kernel-doc warnings in openvswitch.h:
Mark enum placeholders that are not used as "private" so that kernel-doc
comments are not needed for them.
Correct names for 2 enum values:
Warning: include/uapi/linux/openvswitch.h:300 Excess enum value
'@OVS_VPORT_UPCALL_SUCCESS' description in 'ovs_vport_upcall_attr'
Warning: include/uapi/linux/openvswitch.h:300 Excess enum value
'@OVS_VPORT_UPCALL_FAIL' description in 'ovs_vport_upcall_attr'
Convert one comment from "/**" kernel-doc to a plain C "/*" comment:
Warning: include/uapi/linux/openvswitch.h:638 This comment starts with
'/**', but isn't a kernel-doc comment.
* Omit attributes for notifications.
Add more kernel-doc:
- add kernel-doc for kernel-only enums;
- add missing kernel-doc for enum ovs_datapath_attr;
- add missing kernel-doc for enum ovs_flow_attr;
- add missing kernel-doc for enum ovs_sample_attr;
- add kernel-doc for enum ovs_check_pkt_len_attr;
- add kernel-doc for enum ovs_action_attr;
- add kernel-doc for enum ovs_action_push_eth;
- add kernel-doc for enum ovs_vport_attr;
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Link: https://patch.msgid.link/20260304012437.469151-1-rdunlap@infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Johannes Berg says:
====================
Notable features this time:
- cfg80211/mac80211
- finished assoc frame encryption/EPPKE/802.1X-over-auth
(also hwsim)
- radar detection improvements
- 6 GHz incumbent signal detection APIs
- multi-link support for FILS, probe response
templates and client probling
- ath12k:
- monitor mode support on IPQ5332
- basic hwmon temperature reporting
* tag 'wireless-next-2026-03-04' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (38 commits)
wifi: UHR: define DPS/DBE/P-EDCA elements and fix size parsing
wifi: mac80211_hwsim: change hwsim_class to a const struct
wifi: mac80211: give the AP more time for EPPKE as well
wifi: ath12k: Remove the unused argument from the Rx data path
wifi: ath12k: Enable monitor mode support on IPQ5332
wifi: ath12k: Set up MLO after SSR
wifi: ath11k: Silence remoteproc probe deferral prints
wifi: cfg80211: support key installation on non-netdev wdevs
wifi: cfg80211: make cluster id an array
wifi: mac80211: update outdated comment
wifi: mac80211: Advertise IEEE 802.1X authentication support
wifi: mac80211: Add support for IEEE 802.1X authentication protocol in non-AP STA mode
wifi: cfg80211: add support for IEEE 802.1X Authentication Protocol
wifi: mac80211: Advertise EPPKE support based on driver capabilities
wifi: mac80211_hwsim: Advertise support for (Re)Association frame encryption
wifi: mac80211: Fix AAD/Nonce computation for management frames with MLO
wifi: rt2x00: use generic nvmem_cell_get
wifi: mac80211: fetch unsolicited probe response template by link ID
wifi: mac80211: fetch FILS discovery template by link ID
wifi: nl80211: don't allow DFS channels for NAN
...
====================
Link: https://patch.msgid.link/20260304113707.175181-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add an extended feature flag NL80211_EXT_FEATURE_IEEE8021X_AUTH to
allow a driver to indicate support for the IEEE 802.1X authentication
protocol in non-AP STA mode, as defined in
"IEEE P802.11bi/D4.0, 12.16.5".
In case of SME in userspace, the Authentication frame body is prepared
in userspace while the driver finalizes the Authentication frame once
it receives the required fields and elements. The driver indicates
support for IEEE 802.1X authentication using the extended feature flag
so that userspace can initiate IEEE 802.1X authentication.
When the feature flag is set, process IEEE 802.1X Authentication frames
from userspace in non-AP STA mode. If the flag is not set, reject
IEEE 802.1X Authentication frames.
Define a new authentication type NL80211_AUTHTYPE_IEEE8021X for
IEEE 802.1X authentication.
Signed-off-by: Kavita Kavita <kavita.kavita@oss.qualcomm.com>
Link: https://patch.msgid.link/20260226185553.1516290-4-kavita.kavita@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Pull scheduler fixes from Ingo Molnar:
- Fix zero_vruntime tracking when there's a single task running
- Fix slice protection logic
- Fix the ->vprot logic for reniced tasks
- Fix lag clamping in mixed slice workloads
- Fix objtool uaccess warning (and bug) in the
!CONFIG_RSEQ_SLICE_EXTENSION case caused by unexpected un-inlining,
which triggers with older compilers
- Fix a comment in the rseq registration rseq_size bound check code
- Fix a legacy RSEQ ABI quirk that handled 32-byte area sizes
differently, which special size we now reached naturally and want to
avoid. The visible ugliness of the new reserved field will be avoided
the next time the RSEQ area is extended.
* tag 'sched-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rseq: slice ext: Ensure rseq feature size differs from original rseq size
rseq: Clarify rseq registration rseq_size bound check comment
sched/core: Fix wakeup_preempt's next_class tracking
rseq: Mark rseq_arm_slice_extension_timer() __always_inline
sched/fair: Fix lag clamp
sched/eevdf: Update se->vprot in reweight_entity()
sched/fair: Only set slice protection at pick time
sched/fair: Fix zero_vruntime tracking
Sync with a recent liburing fix, which corrects the comment explaining
when the IORING_SETUP_TASKRUN_FLAG setup flag is valid to use. May be
use with COOP_TASKRUN or DEFER_TASKRUN, not useful without either of
this task_work mechanisms being used.
Link: https://github.com/axboe/liburing/pull/1543
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Repair some of the comments:
- use the correct enum names
- don't use "/**" for a non-kernel-doc comment
to fix these warnings:
Warning: include/uapi/linux/nfc.h:127 Excess enum value
'@NFC_EVENT_DEVICE_DEACTIVATED' description in 'nfc_commands'
Warning: include/uapi/linux/nfc.h:204 Excess enum value
'@NFC_ATTR_APDU' description in 'nfc_attrs'
Warning: include/uapi/linux/nfc.h:302 expecting prototype for Pseudo().
Prototype was for NFC_RAW_HEADER_SIZE() instead
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://patch.msgid.link/20260226221004.1037909-1-rdunlap@infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull pci fixes from Bjorn Helgaas:
- Update MAINTAINERS email address (Shawn Guo)
- Refresh cached Endpoint driver MSI Message Address to fix a v7.0
regression when kernel changes the address after firmware has
configured it (Niklas Cassel)
- Flush Endpoint MSI-X writes so they complete before the outbound ATU
entry is unmapped (Niklas Cassel)
- Correct the PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 value, which broke VMM use
of PCI capabilities (Bjorn Helgaas)
* tag 'pci-v7.0-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
PCI: Correct PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 value
PCI: dwc: ep: Flush MSI-X write before unmapping its ATU entry
PCI: dwc: ep: Refresh MSI Message Address cache on change
MAINTAINERS: Update Shawn Guo's address for HiSilicon PCIe controller driver
fb82437fdd ("PCI: Change capability register offsets to hex") incorrectly
converted the PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 value from decimal 52 to hex
0x32:
-#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 52 /* v2 endpoints with link end here */
+#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 0x32 /* end of v2 EPs w/ link */
This broke PCI capabilities in a VMM because subsequent ones weren't
DWORD-aligned.
Change PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 to the correct value of 0x34.
fb82437fdd was from Baruch Siach <baruch@tkos.co.il>, but this was not
Baruch's fault; it's a mistake I made when applying the patch.
Fixes: fb82437fdd ("PCI: Change capability register offsets to hex")
Reported-by: David Woodhouse <dwmw2@infradead.org>
Closes: https://lore.kernel.org/all/3ae392a0158e9d9ab09a1d42150429dd8ca42791.camel@infradead.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Cross-merge networking fixes after downstream PR (net-7.0-rc2).
Conflicts:
tools/testing/selftests/drivers/net/hw/rss_ctx.py
19c3a2a81d ("selftests: drv-net: rss: Generate unique ports for RSS context tests")
ce5a0f4612 ("selftests: drv-net: rss_ctx: test RSS contexts persist after ifdown/up")
include/net/inet_connection_sock.h
858d2a4f67 ("tcp: fix potential race in tcp_v6_syn_recv_sock()")
fcd3d039fa ("tcp: make tcp_v{4,6}_send_check() static")
https://lore.kernel.org/aZ8PSFLzBrEU3I89@sirena.org.uk
drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c
drivers/net/ethernet/mellanox/mlx5/core/en/xsk/pool.c
69050f8d6d ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types")
bf4afc53b7 ("Convert 'alloc_obj' family to use the new default GFP_KERNEL argument")
8a96b9144f ("net/mlx5e: Alloc xsk channel param out of mlx5e_open_xsk()")
Adjacent changes:
net/netfilter/ipvs/ip_vs_ctl.c
c59bd9e62e ("ipvs: use more counters to avoid service lookups")
bf4afc53b7 ("Convert 'alloc_obj' family to use the new default GFP_KERNEL argument")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The kernel-mode PPPoE relay feature and its two associated ioctls
(PPPOEIOCSFWD and PPPOEIOCDFWD) are not used by any existing userspace
PPPoE implementations. The most commonly-used package, RP-PPPoE [1],
handles the relaying entirely in userspace.
This legacy code has remained in the driver since its introduction in
kernel 2.3.99-pre7 for over two decades, but has served no practical
purpose.
Remove the unused relay code.
[1] https://dianne.skoll.ca/projects/rp-pppoe/
Signed-off-by: Qingfang Deng <dqfext@gmail.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20260224015053.42472-1-dqfext@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski reported following issue in upcoming patches:
W=1 C=1 GCC build gives us:
net/bridge/netfilter/nf_conntrack_bridge.c: note: in included file (through
../include/linux/if_pppox.h, ../include/uapi/linux/netfilter_bridge.h,
../include/linux/netfilter_bridge.h): include/uapi/linux/if_pppox.h:
153:29: warning: array of flexible structures
sparse doesn't like that hdr has a zero-length array which overlaps
proto. The kernel code doesn't currently need those arrays.
PPPoE connection is functional after applying this patch.
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Eric Woudstra <ericwouds@gmail.com>
Link: https://patch.msgid.link/20260224155030.106918-1-ericwouds@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Before rseq became extensible, its original size was 32 bytes even
though the active rseq area was only 20 bytes. This had the following
impact in terms of userspace ecosystem evolution:
* The GNU libc between 2.35 and 2.39 expose a __rseq_size symbol set
to 32, even though the size of the active rseq area is really 20.
* The GNU libc 2.40 changes this __rseq_size to 20, thus making it
express the active rseq area.
* Starting from glibc 2.41, __rseq_size corresponds to the
AT_RSEQ_FEATURE_SIZE from getauxval(3).
This means that users of __rseq_size can always expect it to
correspond to the active rseq area, except for the value 32, for
which the active rseq area is 20 bytes.
Exposing a 32 bytes feature size would make life needlessly painful
for userspace. Therefore, add a reserved field at the end of the
rseq area to bump the feature size to 33 bytes. This reserved field
is expected to be replaced with whatever field will come next,
expecting that this field will be larger than 1 byte.
The effect of this change is to increase the size from 32 to 64 bytes
before we actually have fields using that memory.
Clarify the allocation size and alignment requirements in the struct
rseq uapi comment.
Change the value returned by getauxval(AT_RSEQ_ALIGN) to return the
value of the active rseq area size rounded up to next power of 2, which
guarantees that the rseq structure will always be aligned on the nearest
power of two large enough to contain it, even as it grows. Change the
alignment check in the rseq registration accordingly.
This will minimize the amount of ABI corner-cases we need to document
and require userspace to play games with. The rule stays simple when
__rseq_size != 32:
#define rseq_field_available(field) (__rseq_size >= offsetofend(struct rseq_abi, field))
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260220200642.1317826-3-mathieu.desnoyers@efficios.com
Pull Hyper-V updates from Wei Liu:
- Debugfs support for MSHV statistics (Nuno Das Neves)
- Support for the integrated scheduler (Stanislav Kinsburskii)
- Various fixes for MSHV memory management and hypervisor status
handling (Stanislav Kinsburskii)
- Expose more capabilities and flags for MSHV partition management
(Anatol Belski, Muminul Islam, Magnus Kulke)
- Miscellaneous fixes to improve code quality and stability (Carlos
López, Ethan Nelson-Moore, Li RongQing, Michael Kelley, Mukesh
Rathor, Purna Pavan Chandra Aekkaladevi, Stanislav Kinsburskii, Uros
Bizjak)
- PREEMPT_RT fixes for vmbus interrupts (Jan Kiszka)
* tag 'hyperv-next-signed-20260218' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (34 commits)
mshv: Handle insufficient root memory hypervisor statuses
mshv: Handle insufficient contiguous memory hypervisor status
mshv: Introduce hv_deposit_memory helper functions
mshv: Introduce hv_result_needs_memory() helper function
mshv: Add SMT_ENABLED_GUEST partition creation flag
mshv: Add nested virtualization creation flag
Drivers: hv: vmbus: Simplify allocation of vmbus_evt
mshv: expose the scrub partition hypercall
mshv: Add support for integrated scheduler
mshv: Use try_cmpxchg() instead of cmpxchg()
x86/hyperv: Fix error pointer dereference
x86/hyperv: Reserve 3 interrupt vectors used exclusively by MSHV
Drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT
x86/hyperv: Remove ASM_CALL_CONSTRAINT with VMMCALL insn
x86/hyperv: Use savesegment() instead of inline asm() to save segment registers
mshv: fix SRCU protection in irqfd resampler ack handler
mshv: make field names descriptive in a header struct
x86/hyperv: Update comment in hyperv_cleanup()
mshv: clear eventfd counter on irqfd shutdown
x86/hyperv: Use memremap()/memunmap() instead of ioremap_cache()/iounmap()
...
Pull networking fixes from Jakub Kicinski:
"Including fixes from Netfilter.
Current release - new code bugs:
- net: fix backlog_unlock_irq_restore() vs CONFIG_PREEMPT_RT
- eth: mlx5e: XSK, Fix unintended ICOSQ change
- phy_port: correctly recompute the port's linkmodes
- vsock: prevent child netns mode switch from local to global
- couple of kconfig fixes for new symbols
Previous releases - regressions:
- nfc: nci: fix false-positive parameter validation for packet data
- net: do not delay zero-copy skbs in skb_attempt_defer_free()
Previous releases - always broken:
- mctp: ensure our nlmsg responses to user space are zero-initialised
- ipv6: ioam: fix heap buffer overflow in __ioam6_fill_trace_data()
- fixes for ICMP rate limiting
Misc:
- intel: fix PCI device ID conflict between i40e and ipw2200"
* tag 'net-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (85 commits)
net: nfc: nci: Fix parameter validation for packet data
net/mlx5e: Use unsigned for mlx5e_get_max_num_channels
net/mlx5e: Fix deadlocks between devlink and netdev instance locks
net/mlx5e: MACsec, add ASO poll loop in macsec_aso_set_arm_event
net/mlx5: Fix misidentification of write combining CQE during poll loop
net/mlx5e: Fix misidentification of ASO CQE during poll loop
net/mlx5: Fix multiport device check over light SFs
bonding: alb: fix UAF in rlb_arp_recv during bond up/down
bnge: fix reserving resources from FW
eth: fbnic: Advertise supported XDP features.
rds: tcp: fix uninit-value in __inet_bind
net/rds: Fix NULL pointer dereference in rds_tcp_accept_one
octeontx2-af: Fix default entries mcam entry action
net/mlx5e: XSK, Fix unintended ICOSQ change
ipv6: icmp: icmpv6_xrlim_allow() optimization if net.ipv6.icmp.ratelimit is zero
ipv4: icmp: icmpv4_xrlim_allow() optimization if net.ipv4.icmp_ratelimit is zero
ipv6: icmp: remove obsolete code in icmpv6_xrlim_allow()
inet: move icmp_global_{credit,stamp} to a separate cache line
icmp: prevent possible overflow in icmp_global_allow()
selftests/net: packetdrill: add ipv4-mapped-ipv6 tests
...
Add support for HV_PARTITION_CREATION_FLAG_SMT_ENABLED_GUEST
to allow userspace VMMs to enable SMT for guest partitions.
Expose this via new MSHV_PT_BIT_SMT_ENABLED_GUEST flag in the UAPI.
Without this flag, the hypervisor schedules guest VPs incorrectly,
causing SMT unusable.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Introduce HV_PARTITION_CREATION_FLAG_NESTED_VIRTUALIZATION_CAPABLE to
indicate support for nested virtualization during partition creation.
This enables clearer configuration and capability checks for nested
virtualization scenarios.
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Signed-off-by: Muminul Islam <muislam@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Pull dmaengine updates from Vinod Koul:
"Core:
- Add Frank Li as susbstem reviewer to help with reviews
New Support:
- Mediatek support for Dimensity 6300 and 9200 controller
- Qualcomm Kaanapali and Glymur GPI DMA engine
- Synopsis DW AXI Agilex5
- Renesas RZ/V2N SoC
- Atmel microchip lan9691-dma
- Tegra ADMA tegra264
Updates:
- sg_nents_for_dma() helper use in subsystem
- pm_runtime_mark_last_busy() redundant call update for subsystem
- Residue support for xilinx AXIDMA driver
- Intel Max SGL Size Support and capabilities for DSA3.0
- AXI dma larger than 32bits address support"
* tag 'dmaengine-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine: (64 commits)
dmaengine: add Frank Li as reviewer
dt-bindings: dma: qcom,gpi: Update max interrupts lines to 16
dmaengine: fsl-edma: don't explicitly disable clocks in .remove()
dmaengine: xilinx: xdma: use sg_nents_for_dma() helper
dmaengine: sh: use sg_nents_for_dma() helper
dmaengine: sa11x0: use sg_nents_for_dma() helper
dmaengine: qcom: bam_dma: use sg_nents_for_dma() helper
dmaengine: qcom: adm: use sg_nents_for_dma() helper
dmaengine: pxa-dma: use sg_nents_for_dma() helper
dmaengine: lgm: use sg_nents_for_dma() helper
dmaengine: k3dma: use sg_nents_for_dma() helper
dmaengine: dw-axi-dmac: use sg_nents_for_dma() helper
dmaengine: bcm2835-dma: use sg_nents_for_dma() helper
dmaengine: axi-dmac: use sg_nents_for_dma() helper
dmaengine: altera-msgdma: use sg_nents_for_dma() helper
scatterlist: introduce sg_nents_for_dma() helper
dmaengine: idxd: Add Max SGL Size Support for DSA3.0
dmaengine: idxd: Expose DSA3.0 capabilities through sysfs
dmaengine: sh: rz-dmac: Make channel irq local
dmaengine: pl08x: Fix comment stating the difference between PL080 and PL081
...
Pull char/misc/IIO driver updates from Greg KH:
"Here is the big set of char/misc/iio and other smaller driver
subsystem changes for 7.0-rc1. Lots of little things in here,
including:
- Loads of iio driver changes and updates and additions
- gpib driver updates
- interconnect driver updates
- i3c driver updates
- hwtracing (coresight and intel) driver updates
- deletion of the obsolete mwave driver
- binder driver updates (rust and c versions)
- mhi driver updates (causing a merge conflict, see below)
- mei driver updates
- fsi driver updates
- eeprom driver updates
- lots of other small char and misc driver updates and cleanups
All of these have been in linux-next for a while, with no reported
issues"
* tag 'char-misc-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (297 commits)
mux: mmio: fix regmap leak on probe failure
rust_binder: return p from rust_binder_transaction_target_node()
drivers: android: binder: Update ARef imports from sync::aref
rust_binder: fix needless borrow in context.rs
iio: magn: mmc5633: Fix Kconfig for combination of I3C as module and driver builtin
iio: sca3000: Fix a resource leak in sca3000_probe()
iio: proximity: rfd77402: Add interrupt handling support
iio: proximity: rfd77402: Document device private data structure
iio: proximity: rfd77402: Use devm-managed mutex initialization
iio: proximity: rfd77402: Use kernel helper for result polling
iio: proximity: rfd77402: Align polling timeout with datasheet
iio: cros_ec: Allow enabling/disabling calibration mode
iio: frequency: ad9523: correct kernel-doc bad line warning
iio: buffer: buffer_impl.h: fix kernel-doc warnings
iio: gyro: itg3200: Fix unchecked return value in read_raw
MAINTAINERS: add entry for ADE9000 driver
iio: accel: sca3000: remove unused last_timestamp field
iio: accel: adxl372: remove unused int2_bitmask field
iio: adc: ad7766: Use iio_trigger_generic_data_rdy_poll()
iio: magnetometer: Remove IRQF_ONESHOT
...
Pull more io_uring updates from Jens Axboe:
"This is a mix of cleanups and fixes. No major fixes in here, just a
bunch of little fixes. Some of them marked for stable as it fixes
behavioral issues
- Fix an issue with SOCKET_URING_OP_SETSOCKOPT for netlink sockets,
due to a too restrictive check on it having an ioctl handler
- Remove a redundant SQPOLL check in ring creation
- Kill dead accounting for zero-copy send, which doesn't use ->buf
or ->len post the initial setup
- Fix missing clamp of the allocation hint, which could cause
allocations to fall outside of the range the application asked
for. Still within the allowed limits.
- Fix for IORING_OP_PIPE's handling of direct descriptors
- Tweak to the API for the newly added BPF filters, making them
more future proof in terms of how applications deal with them
- A few fixes for zcrx, fixing a few error handling conditions
- Fix for zcrx request flag checking
- Add support for querying the zcrx page size
- Improve the NO_SQARRAY static branch inc/dec, avoiding busy
conditions causing too much traffic
- Various little cleanups"
* tag 'io_uring-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring/bpf_filter: pass in expected filter payload size
io_uring/bpf_filter: move filter size and populate helper into struct
io_uring/cancel: de-unionize file and user_data in struct io_cancel_data
io_uring/rsrc: improve regbuf iov validation
io_uring: remove unneeded io_send_zc accounting
io_uring/cmd_net: fix too strict requirement on ioctl
io_uring: delay sqarray static branch disablement
io_uring/query: add query.h copyright notice
io_uring/query: return support for custom rx page size
io_uring/zcrx: check unsupported flags on import
io_uring/zcrx: fix post open error handling
io_uring/zcrx: fix sgtable leak on mapping failures
io_uring: use the right type for creds iteration
io_uring/openclose: fix io_pipe_fixed() slot tracking for specific slots
io_uring/filetable: clamp alloc_hint to the configured alloc range
io_uring/rsrc: replace reg buffer bit field with flags
io_uring/zcrx: improve types for size calculation
io_uring/tctx: avoid modifying loop variable in io_ring_add_registered_file
io_uring: simplify IORING_SETUP_DEFER_TASKRUN && !SQPOLL check
Musl defines its own struct ethhdr and thus defines __UAPI_DEF_ETHHDR to
zero. To avoid struct redefinition errors, user space is therefore
supposed to include netinet/if_ether.h before (or instead of)
linux/if_ether.h. To relieve them from this burden, include the libc
header here if not building for kernel space.
Reported-by: Alyssa Ross <hi@alyssa.is>
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Florian Westphal <fw@strlen.de>
It's quite possible that opcodes that have payloads attached to them,
like IORING_OP_OPENAT/OPENAT2 or IORING_OP_SOCKET, that these paylods
can change over time. For example, on the openat/openat2 side, the
struct open_how argument is extensible, and could be extended in the
future to allow further arguments to be passed in.
Allow registration of a cBPF filter to give the size of the filter as
seen by userspace. If that filter is for an opcode that takes extra
payload data, allow it if the application payload expectation is the
same size than the kernels. If that is the case, the kernel supports
filtering on the payload that the application expects. If the size
differs, the behavior depends on the IO_URING_BPF_FILTER_SZ_STRICT flag:
1) If IO_URING_BPF_FILTER_SZ_STRICT is set and the size expectation
differs, fail the attempt to load the filter.
2) If IO_URING_BPF_FILTER_SZ_STRICT isn't set, allow the filter if
the userspace pdu size is smaller than what the kernel offers.
3) Regardless if IO_URING_BPF_FILTER_SZ_STRICT, fail loading the filter
if the userspace pdu size is bigger than what the kernel supports.
An attempt to load a filter due to sizing will error with -EMSGSIZE.
For that error, the registration struct will have filter->pdu_size
populated with the pdu size that the kernel uses.
Reported-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull more misc vfs updates from Christian Brauner:
"Features:
- Optimize close_range() from O(range size) to O(active FDs) by using
find_next_bit() on the open_fds bitmap instead of linearly scanning
the entire requested range. This is a significant improvement for
large-range close operations on sparse file descriptor tables.
- Add FS_XFLAG_VERITY file attribute for fs-verity files, retrievable
via FS_IOC_FSGETXATTR and file_getattr(). The flag is read-only.
Add tracepoints for fs-verity enable and verify operations,
replacing the previously removed debug printk's.
- Prevent nfsd from exporting special kernel filesystems like pidfs
and nsfs. These filesystems have custom ->open() and ->permission()
export methods that are designed for open_by_handle_at(2) only and
are incompatible with nfsd. Update the exportfs documentation
accordingly.
Fixes:
- Fix KMSAN uninit-value in ovl_fill_real() where strcmp() was used
on a non-null-terminated decrypted directory entry name from
fscrypt. This triggered on encrypted lower layers when the
decrypted name buffer contained uninitialized tail data.
The fix also adds VFS-level name_is_dot(), name_is_dotdot(), and
name_is_dot_dotdot() helpers, replacing various open-coded "." and
".." checks across the tree.
- Fix read-only fsflags not being reset together with xflags in
vfs_fileattr_set(). Currently harmless since no read-only xflags
overlap with flags, but this would cause inconsistencies for any
future shared read-only flag
- Return -EREMOTE instead of -ESRCH from PIDFD_GET_INFO when the
target process is in a different pid namespace. This lets userspace
distinguish "process exited" from "process in another namespace",
matching glibc's pidfd_getpid() behavior
Cleanups:
- Use C-string literals in the Rust seq_file bindings, replacing the
kernel::c_str!() macro (available since Rust 1.77)
- Fix typo in d_walk_ret enum comment, add porting notes for the
readlink_copy() calling convention change"
* tag 'vfs-7.0-rc1.misc.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: add porting notes about readlink_copy()
pidfs: return -EREMOTE when PIDFD_GET_INFO is called on another ns
nfsd: do not allow exporting of special kernel filesystems
exportfs: clarify the documentation of open()/permission() expotrfs ops
fsverity: add tracepoints
fs: add FS_XFLAG_VERITY for fs-verity files
rust: seq_file: replace `kernel::c_str!` with C-Strings
fs: dcache: fix typo in enum d_walk_ret comment
ovl: use name_is_dot* helpers in readdir code
fs: add helpers name_is_dot{,dot,_dotdot}
ovl: Fix uninit-value in ovl_fill_real
fs: reset read-only fsflags together with xflags
fs/file: optimize close_range() complexity from O(N) to O(Sparse)
Add an ability to query if the zcrx rx page size setting is available.
Note, even when the API is supported by io_uring, the registration can
still get rejected for various reasons, e.g. when the NIC or the driver
doesn't support it, when the particular specified size is unsupported,
when the memory area doesn't satisfy all requirements, etc.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull virtio updates from Michael Tsirkin:
- in-order support in virtio core
- multiple address space support in vduse
- fixes, cleanups all over the place, notably dma alignment fixes for
non-cache-coherent systems
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (59 commits)
vduse: avoid adding implicit padding
vhost: fix caching attributes of MMIO regions by setting them explicitly
vdpa/mlx5: update MAC address handling in mlx5_vdpa_set_attr()
vdpa/mlx5: reuse common function for MAC address updates
vdpa/mlx5: update mlx_features with driver state check
crypto: virtio: Replace package id with numa node id
crypto: virtio: Remove duplicated virtqueue_kick in virtio_crypto_skcipher_crypt_req
crypto: virtio: Add spinlock protection with virtqueue notification
Documentation: Add documentation for VDUSE Address Space IDs
vduse: bump version number
vduse: add vq group asid support
vduse: merge tree search logic of IOTLB_GET_FD and IOTLB_GET_INFO ioctls
vduse: take out allocations from vduse_dev_alloc_coherent
vduse: remove unused vaddr parameter of vduse_domain_free_coherent
vduse: refactor vdpa_dev_add for goto err handling
vhost: forbid change vq groups ASID if DRIVER_OK is set
vdpa: document set_group_asid thread safety
vduse: return internal vq group struct as map token
vduse: add vq group support
vduse: add v1 API definition
...
Pull KVM updates from Paolo Bonzini:
"Loongarch:
- Add more CPUCFG mask bits
- Improve feature detection
- Add lazy load support for FPU and binary translation (LBT) register
state
- Fix return value for memory reads from and writes to in-kernel
devices
- Add support for detecting preemption from within a guest
- Add KVM steal time test case to tools/selftests
ARM:
- Add support for FEAT_IDST, allowing ID registers that are not
implemented to be reported as a normal trap rather than as an UNDEF
exception
- Add sanitisation of the VTCR_EL2 register, fixing a number of
UXN/PXN/XN bugs in the process
- Full handling of RESx bits, instead of only RES0, and resulting in
SCTLR_EL2 being added to the list of sanitised registers
- More pKVM fixes for features that are not supposed to be exposed to
guests
- Make sure that MTE being disabled on the pKVM host doesn't give it
the ability to attack the hypervisor
- Allow pKVM's host stage-2 mappings to use the Force Write Back
version of the memory attributes by using the "pass-through'
encoding
- Fix trapping of ICC_DIR_EL1 on GICv5 hosts emulating GICv3 for the
guest
- Preliminary work for guest GICv5 support
- A bunch of debugfs fixes, removing pointless custom iterators
stored in guest data structures
- A small set of FPSIMD cleanups
- Selftest fixes addressing the incorrect alignment of page
allocation
- Other assorted low-impact fixes and spelling fixes
RISC-V:
- Fixes for issues discoverd by KVM API fuzzing in
kvm_riscv_aia_imsic_has_attr(), kvm_riscv_aia_imsic_rw_attr(), and
kvm_riscv_vcpu_aia_imsic_update()
- Allow Zalasr, Zilsd and Zclsd extensions for Guest/VM
- Transparent huge page support for hypervisor page tables
- Adjust the number of available guest irq files based on MMIO
register sizes found in the device tree or the ACPI tables
- Add RISC-V specific paging modes to KVM selftests
- Detect paging mode at runtime for selftests
s390:
- Performance improvement for vSIE (aka nested virtualization)
- Completely new memory management. s390 was a special snowflake that
enlisted help from the architecture's page table management to
build hypervisor page tables, in particular enabling sharing the
last level of page tables. This however was a lot of code (~3K
lines) in order to support KVM, and also blocked several features.
The biggest advantages is that the page size of userspace is
completely independent of the page size used by the guest:
userspace can mix normal pages, THPs and hugetlbfs as it sees fit,
and in fact transparent hugepages were not possible before. It's
also now possible to have nested guests and guests with huge pages
running on the same host
- Maintainership change for s390 vfio-pci
- Small quality of life improvement for protected guests
x86:
- Add support for giving the guest full ownership of PMU hardware
(contexted switched around the fastpath run loop) and allowing
direct access to data MSRs and PMCs (restricted by the vPMU model).
KVM still intercepts access to control registers, e.g. to enforce
event filtering and to prevent the guest from profiling sensitive
host state. This is more accurate, since it has no risk of
contention and thus dropped events, and also has significantly less
overhead.
For more information, see the commit message for merge commit
bf2c3138ae ("Merge tag 'kvm-x86-pmu-6.20' ...")
- Disallow changing the virtual CPU model if L2 is active, for all
the same reasons KVM disallows change the model after the first
KVM_RUN
- Fix a bug where KVM would incorrectly reject host accesses to PV
MSRs when running with KVM_CAP_ENFORCE_PV_FEATURE_CPUID enabled,
even if those were advertised as supported to userspace,
- Fix a bug with protected guest state (SEV-ES/SNP and TDX) VMs,
where KVM would attempt to read CR3 configuring an async #PF entry
- Fail the build if EXPORT_SYMBOL_GPL or EXPORT_SYMBOL is used in KVM
(for x86 only) to enforce usage of EXPORT_SYMBOL_FOR_KVM_INTERNAL.
Only a few exports that are intended for external usage, and those
are allowed explicitly
- When checking nested events after a vCPU is unblocked, ignore
-EBUSY instead of WARNing. Userspace can sometimes put the vCPU
into what should be an impossible state, and spurious exit to
userspace on -EBUSY does not really do anything to solve the issue
- Also throw in the towel and drop the WARN on INIT/SIPI being
blocked when vCPU is in Wait-For-SIPI, which also resulted in
playing whack-a-mole with syzkaller stuffing architecturally
impossible states into KVM
- Add support for new Intel instructions that don't require anything
beyond enumerating feature flags to userspace
- Grab SRCU when reading PDPTRs in KVM_GET_SREGS2
- Add WARNs to guard against modifying KVM's CPU caps outside of the
intended setup flow, as nested VMX in particular is sensitive to
unexpected changes in KVM's golden configuration
- Add a quirk to allow userspace to opt-in to actually suppress EOI
broadcasts when the suppression feature is enabled by the guest
(currently limited to split IRQCHIP, i.e. userspace I/O APIC).
Sadly, simply fixing KVM to honor Suppress EOI Broadcasts isn't an
option as some userspaces have come to rely on KVM's buggy behavior
(KVM advertises Supress EOI Broadcast irrespective of whether or
not userspace I/O APIC supports Directed EOIs)
- Clean up KVM's handling of marking mapped vCPU pages dirty
- Drop a pile of *ancient* sanity checks hidden behind in KVM's
unused ASSERT() macro, most of which could be trivially triggered
by the guest and/or user, and all of which were useless
- Fold "struct dest_map" into its sole user, "struct rtc_status", to
make it more obvious what the weird parameter is used for, and to
allow fropping these RTC shenanigans if CONFIG_KVM_IOAPIC=n
- Bury all of ioapic.h, i8254.h and related ioctls (including
KVM_CREATE_IRQCHIP) behind CONFIG_KVM_IOAPIC=y
- Add a regression test for recent APICv update fixes
- Handle "hardware APIC ISR", a.k.a. SVI, updates in
kvm_apic_update_apicv() to consolidate the updates, and to
co-locate SVI updates with the updates for KVM's own cache of ISR
information
- Drop a dead function declaration
- Minor cleanups
x86 (Intel):
- Rework KVM's handling of VMCS updates while L2 is active to
temporarily switch to vmcs01 instead of deferring the update until
the next nested VM-Exit.
The deferred updates approach directly contributed to several bugs,
was proving to be a maintenance burden due to the difficulty in
auditing the correctness of deferred updates, and was polluting
"struct nested_vmx" with a growing pile of booleans
- Fix an SGX bug where KVM would incorrectly try to handle EPCM page
faults, and instead always reflect them into the guest. Since KVM
doesn't shadow EPCM entries, EPCM violations cannot be due to KVM
interference and can't be resolved by KVM
- Fix a bug where KVM would register its posted interrupt wakeup
handler even if loading kvm-intel.ko ultimately failed
- Disallow access to vmcb12 fields that aren't fully supported,
mostly to avoid weirdness and complexity for FRED and other
features, where KVM wants enable VMCS shadowing for fields that
conditionally exist
- Print out the "bad" offsets and values if kvm-intel.ko refuses to
load (or refuses to online a CPU) due to a VMCS config mismatch
x86 (AMD):
- Drop a user-triggerable WARN on nested_svm_load_cr3() failure
- Add support for virtualizing ERAPS. Note, correct virtualization of
ERAPS relies on an upcoming, publicly announced change in the APM
to reduce the set of conditions where hardware (i.e. KVM) *must*
flush the RAP
- Ignore nSVM intercepts for instructions that are not supported
according to L1's virtual CPU model
- Add support for expedited writes to the fast MMIO bus, a la VMX's
fastpath for EPT Misconfig
- Don't set GIF when clearing EFER.SVME, as GIF exists independently
of SVM, and allow userspace to restore nested state with GIF=0
- Treat exit_code as an unsigned 64-bit value through all of KVM
- Add support for fetching SNP certificates from userspace
- Fix a bug where KVM would use vmcb02 instead of vmcb01 when
emulating VMLOAD or VMSAVE on behalf of L2
- Misc fixes and cleanups
x86 selftests:
- Add a regression test for TPR<=>CR8 synchronization and IRQ masking
- Overhaul selftest's MMU infrastructure to genericize stage-2 MMU
support, and extend x86's infrastructure to support EPT and NPT
(for L2 guests)
- Extend several nested VMX tests to also cover nested SVM
- Add a selftest for nested VMLOAD/VMSAVE
- Rework the nested dirty log test, originally added as a regression
test for PML where KVM logged L2 GPAs instead of L1 GPAs, to
improve test coverage and to hopefully make the test easier to
understand and maintain
guest_memfd:
- Remove kvm_gmem_populate()'s preparation tracking and half-baked
hugepage handling. SEV/SNP was the only user of the tracking and it
can do it via the RMP
- Retroactively document and enforce (for SNP) that
KVM_SEV_SNP_LAUNCH_UPDATE and KVM_TDX_INIT_MEM_REGION require the
source page to be 4KiB aligned, to avoid non-trivial complexity for
something that no known VMM seems to be doing and to avoid an API
special case for in-place conversion, which simply can't support
unaligned sources
- When populating guest_memfd memory, GUP the source page in common
code and pass the refcounted page to the vendor callback, instead
of letting vendor code do the heavy lifting. Doing so avoids a
looming deadlock bug with in-place due an AB-BA conflict betwee
mmap_lock and guest_memfd's filemap invalidate lock
Generic:
- Fix a bug where KVM would ignore the vCPU's selected address space
when creating a vCPU-specific mapping of guest memory. Actually
this bug could not be hit even on x86, the only architecture with
multiple address spaces, but it's a bug nevertheless"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (267 commits)
KVM: s390: Increase permitted SE header size to 1 MiB
MAINTAINERS: Replace backup for s390 vfio-pci
KVM: s390: vsie: Fix race in acquire_gmap_shadow()
KVM: s390: vsie: Fix race in walk_guest_tables()
KVM: s390: Use guest address to mark guest page dirty
irqchip/riscv-imsic: Adjust the number of available guest irq files
RISC-V: KVM: Transparent huge page support
RISC-V: KVM: selftests: Add Zalasr extensions to get-reg-list test
RISC-V: KVM: Allow Zalasr extensions for Guest/VM
KVM: riscv: selftests: Add riscv vm satp modes
KVM: riscv: selftests: add Zilsd and Zclsd extension to get-reg-list test
riscv: KVM: allow Zilsd and Zclsd extensions for Guest/VM
RISC-V: KVM: Skip IMSIC update if vCPU IMSIC state is not initialized
RISC-V: KVM: Fix null pointer dereference in kvm_riscv_aia_imsic_rw_attr()
RISC-V: KVM: Fix null pointer dereference in kvm_riscv_aia_imsic_has_attr()
RISC-V: KVM: Remove unnecessary 'ret' assignment
KVM: s390: Add explicit padding to struct kvm_s390_keyop
KVM: LoongArch: selftests: Add steal time test case
LoongArch: KVM: Add paravirt vcpu_is_preempted() support in guest side
LoongArch: KVM: Add paravirt preempt feature in hypervisor side
...