There are 2 types of logical links with a NAN peer:
- management (NMI), which is used for Tx/Rx of NAN management frames.
- data (NDI), which is used for Tx/Rx of data frames, or non-NAN
management frames.
The NMI station has two roles:
- representation of the NAN peer - for example, the peer's schedule
and the HT, VHT, HE capabilities - belong to the NMI station, and not to
the NDI ones.
- Tx/Rx of NAN management frames to/from the peer.
The NDI station is used for Tx/Rx data frames of a specific NDP that was
established with the NAN peer.
Note that a peer can choose to reuse its NMI address as the NDI address.
In that case, it is expected that two stations will be added even though
they will have the same address.
- An NDI station can only be added after the corresponding NMI station
was configured with capabilities.
- All the NDI stations will be removed before the NDI interface is brought
down.
- All NMI stations will be removed before NAN is stopped.
- Before NMI sta removal, all corresponding NDI stations will be removed
Add support for adding, removing, and changing NMI and NDI stations.
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260219114327.d280936ee832.I6d859eee759bb5824a9ffd2984410faf879ba00e@changeid
Link: https://patch.msgid.link/20260318123926.206536-6-miriam.rachel.korenblit@intel.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In NAN, unlike in other modes, there is only one set of (HT, VHT, HE)
capabilities that is used for all channels (and bands) used in the NAN
data path.
This set of capabilities will have to be a special one, for example - have
the minimum of (HT-for-5 GHz, HT-for-2.4 GHz), careful handling of the
bits that have a different meaning for each band, etc.
While we could use the exiting sband/iftype capabilities, and require
identical capabilities for all bands (makes no sense since this means
that we will have VHT capabilities in the 2.4 GHz slot),
or require that only one of the sbands will be set,
or have logic to extract the minimum and handle the conflicting bits -
it seems simpler to add a dedicated set of capabilities which is special
for NAN, and is band agnostic, to be populated by the driver.
That way we also let the driver decide how it wants to handle the
conflicting bits.
Add this special set of these capabilities to wiphy:nan_capabilities, to be
populated by the driver.
Send it to user space.
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260219114327.4b6f3e4a81b4.I45422adc0df3ad4101d857a92e83f0de5cf241e1@changeid
Link: https://patch.msgid.link/20260318123926.206536-5-miriam.rachel.korenblit@intel.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add an nl80211 API to allow user space to configure the local NAN
schedule.
The local schedule consists of a list of channel definitions and a schedule
map, in which each element covers a time slot and indicates on what
channel the device should be in that time slot.
Channels can be added to schedule even without being scheduled, for
reservation purposes.
A schedule can be configured either immedietally or be deferred, in case
there are already connected peers.
When the deferred flag is set, the command is a request from the device
to perform an announced schedule update: send the updated NAN
Availability - as set in this command - to the peers, and do the
actual switch to the new schedule on the right time (i.e. at the end of
the slot after the slot in which the update was sent to the peers).
In addition, a notification will be sent to indicate a deferred update
completion.
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260219114327.ecca178a2de0.Ic977ab08b4ed5cf9b849e55d3a59b01ad3fbd08e@changeid
Link: https://patch.msgid.link/20260318123926.206536-2-miriam.rachel.korenblit@intel.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add support for querying per-process buffer object (BO) memory
usage through the amdxdna GET_ARRAY UAPI.
Introduce a new query type, DRM_AMDXDNA_BO_USAGE, along with
struct amdxdna_drm_bo_usage to report BO memory usage statistics,
including heap, total, and internal usage.
Track BO memory usage on a per-client basis by maintaining counters
in GEM open/close and heap allocation/free paths. This ensures the
reported statistics reflect the current memory footprint of each
process.
Wire the new query into the GET_ARRAY implementation to expose
the usage information to userspace.
Link: 0546f2aaad
Signed-off-by: Max Zhen <max.zhen@amd.com>
Reviewed-by: Lizhi Hou <lizhi.hou@amd.com>
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260324163159.2425461-1-lizhi.hou@amd.com
The commit f35dbac694 ("ring-buffer: Fix to update per-subbuf entries of
persistent ring buffer") was a fix and merged upstream. It is needed for
some other work in the ring buffer. The current branch has the remote
buffer code that is shared with the Arm64 subsystem and can't be rebased.
Merge in the upstream commit to allow continuing of the ring buffer work.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
This structure definition is used outside the kernel proper.
For example in kmod and the kernel build environment.
To allow reuse, move it to a new UAPI header.
While it is not a true UAPI, it is a common practice to have
non-UAPI interface definitions in the kernel's UAPI headers.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Reviewed-by: Nicolas Schier <nsc@kernel.org>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Add a description of struct reader to resolve a kernel-doc warning:
Warning: include/uapi/linux/trace_mmap.h:43 struct member 'reader' not described in 'trace_buffer_meta'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
>From hardware implementation perspective, a guest tegra241-cmdqv hardware
is different than the host hardware:
- Host HW is backed by a VINTF (HYP_OWN=1)
- Guest HW is backed by a VINTF (HYP_OWN=0)
The kernel driver has an implementation requirement of the HYP_OWN bit in
the VM. So, VMM must follow that to allow the same copy of Linux to work.
Add this requirement to the uAPI, which is currently missing.
Fixes: 4dc0d12474 ("iommu/tegra241-cmdqv: Add user-space use support")
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Will Deacon <will@kernel.org>
The struct pidfd_info currently exposes in a field called coredump_signal the
signal number (si_signo) that triggered the dump (for example, 11 for SIGSEGV).
However, it is also valuable to understand the reason why that signal was sent.
This additional context is provided by the signal code (si_code), such as 2 for
SEGV_ACCERR.
Add a new field to struct pidfd_info called coredump_code with the value of
si_code for the benefit of sysadmins who pipe core dumps to user-space programs
for later analysis. The following snippet illustrates a simplified C program
that consumes coredump_signal and coredump_code, and then logs core dump
signals and codes to a file:
int pidfd = (int)atoi(argv[1]);
struct pidfd_info info = {
.mask = PIDFD_INFO_EXIT | PIDFD_INFO_COREDUMP,
};
if (ioctl(pidfd, PIDFD_GET_INFO, &info) == 0)
if (info.mask & PIDFD_INFO_COREDUMP)
fprintf(f, "PID=%d, si_signo: %d si_code: %d\n",
info.pid, info.coredump_signal, info.coredump_code);
Assuming the program is installed under /usr/local/bin/core-logger, core dump
processing can be enabled by setting /proc/sys/kernel/core_pattern to
'|/usr/local/bin/dumpstuff %F'.
systemd-coredump(8) already uses pidfds to process core dumps, and it could be
extended to include the values of coredump_code too.
Signed-off-by: Emanuele Rocca <emanuele.rocca@arm.com>
Link: https://patch.msgid.link/acE52HIFivNZN3nE@NH27D9T0LF
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
When set, starting xe3p_lpg, the L2 flush optimization
feature will control whether L2 is in Persistent or
Transient mode through monitoring of media activity.
To enable L2 flush optimization include new feature flag
GUC_CTL_ENABLE_L2FLUSH_OPT for Novalake platforms when
media type is detected.
Tighten UAPI validation to restrict userptr, svm and
dmabuf mappings to be either 2WAY or XA+1WAY
V5(Thomas): logic correction
V4(MattA): Modify uapi doc and commit
V3(MattA): check valid op and pat_index value
V2(MattA): validate dma-buf bos and madvise pat-index
Acked-by: José Roberto de Souza <jose.souza@intel.com>
Acked-by: Michal Mrozek <michal.mrozek@intel.com>
Acked-by: Carl Zhang <carl.zhang@intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patch.msgid.link/20260305121902.1892593-9-tejas.upadhyay@intel.com
Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
drm-misc-next for v7.1:
UAPI Changes:
math:
- provide __KERNEL_DIV_ROUND_CLOSEST() in UAPI
mode:
- provide DRM_ARGB_GET*() macros for reading color components
Cross-subsystem Changes:
math:
- implement DIV_ROUND_CLOSEST() with __KERNEL_DIV_ROUND_CLOSEST()
Core Changes:
atomic:
- fix handling of colorop state in atomic updates
- provide CRTC background color
ttm:
- improve tests and doumentation
Driver Changes:
amdxdna:
- allow forcing DMA through IOMMU IOVA
- improve debugging
bridge:
- Support Lontium LT8713SX DP MST bridge plus DT bindings
imx:
- support planes behind the primary plane
- fix bus-format selection
ivpu:
- perform engine reset on TDR error
panel:
- novatek-nt36672a: Use mipi_dsi_*_multi() functions
- panel-edp: Support BOE NV153WUM-N42, CMN N153JCA-ELK, CSW MNF307QS3-2
renesas:
- rz-du: clean up
rockchip:
- support CRTC background color
sun4i:
- fix leak in init code
- clean up
tildc
- clean up
v3d:
- improve handling of struct v3d_stats
- improve error handling
- clean up
vkms:
- support CRTC background color
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patch.msgid.link/20260320082604.GA17867@linux.fritz.box
Repair all kernel-doc warnings in um_timetravel.h:
- add one enum description
- mark "reserve" as private
- use a leading '@' on current_time
Warning: include/uapi/linux/um_timetravel.h:59 Enum value
'UM_TIMETRAVEL_SHARED_MAX_FDS' not described in enum
'um_timetravel_shared_mem_fds'
Warning: include/uapi/linux/um_timetravel.h:245 union member 'reserve'
not described in 'um_timetravel_schedshm_client'
Warning: include/uapi/linux/um_timetravel.h:288 struct member
'current_time' not described in 'um_timetravel_schedshm'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://patch.msgid.link/20260226221112.1042008-1-rdunlap@infradead.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Refactor amdxdna GEM buffer object (BO) handling to simplify address
management and unify BO type semantics.
Introduce helper APIs to retrieve commonly used BO addresses:
- User virtual address (UVA)
- Kernel virtual address (KVA)
- Device address (IOVA/PA)
These helpers centralize address lookup logic and avoid duplicating
BO-specific handling across submission and execution paths. This also
improves readability and reduces the risk of inconsistent address
handling in future changes.
As part of the refactor:
- Rename SHMEM BO type to SHARE to better reflect its usage.
- Merge CMD BO handling into SHARE, removing special-case logic for
command buffers.
- Consolidate BO type handling paths to reduce code duplication and
simplify maintenance.
No functional change is intended. The refactor prepares the driver for
future enhancements by providing a cleaner abstraction for BO address
management.
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Max Zhen <max.zhen@amd.com>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260320210615.1973016-1-lizhi.hou@amd.com
As currently defined, initial_bytes is monotonically decreasing and
precedes dirty_bytes when reading from the saving file descriptor.
The transition from initial_bytes to dirty_bytes is unidirectional and
irreversible.
The initial_bytes are considered as critical data that is highly
recommended to be transferred to the target as part of PRE_COPY, without
this data, the PRE_COPY phase would be ineffective.
We come to solve the case when a new chunk of critical data is
introduced during the PRE_COPY phase and the driver would like to report
an entirely new value for the initial_bytes.
For that, we extend the VFIO_MIG_GET_PRECOPY_INFO ioctl with an output
flag named VFIO_PRECOPY_INFO_REINIT to allow drivers reporting a new
initial_bytes value during the PRE_COPY phase.
Currently, existing VFIO_MIG_GET_PRECOPY_INFO implementations don't
assign info.flags before copy_to_user(), this effectively echoes
userspace-provided flags back as output, preventing the field from being
used to report new reliable data from the drivers.
Reliable use of the new VFIO_PRECOPY_INFO_REINIT flag requires userspace
to explicitly opt in by enabling the
VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2 device feature.
When the caller opts in, the driver may report an entirely new
value for initial_bytes. It may be larger, it may be smaller, it may
include the previous unread initial_bytes, it may discard the previous
unread initial_bytes, up to the driver logic and state.
The presence of the VFIO_PRECOPY_INFO_REINIT output flag set by the
driver indicates that new initial data is present on the stream.
Once the caller sees this flag, the initial_bytes value should be
re-evaluated relative to the readiness state for transition to
STOP_COPY.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20260317161753.18964-2-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
The file contains a spelling error in a source comment (succes).
Typos in comments reduce readability and make text searches less reliable
for developers and maintainers.
Replace 'succes' with 'success' in the affected comment. This is a
comment-only cleanup and does not change behavior.
Signed-off-by: Joseph Salisbury <joseph.salisbury@oracle.com>
Link: https://lore.kernel.org/r/20260316185617.166414-1-joseph.salisbury@oracle.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
Johannes Berg says:
====================
Aside from various small improvements/cleanups, not much:
- cfg80211/mac80211: S1G and UHR improvements
- hwsim: incumbent signal report test support
* tag 'wireless-next-2026-03-19' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (31 commits)
qtnfmac: use alloc_netdev macro for single queue devices
wifi: libertas: don't kill URBs in interrupt context
wifi: libertas: use USB anchors for tracking in-flight URBs
wifi: nl80211: use int for band coming from netlink
wifi: rsi_91x_usb: do not pause rfkill polling when stopping mac80211
wifi: mac80211: fix STA link removal during link removal
wifi: nl80211: reject S1G/60G with HT chantype
wifi: ieee80211: fix definition of EHT-MCS 15 in MRU
wifi: cfg80211: check non-S1G width with S1G chandef
wifi: cfg80211: restrict cfg80211_chandef_create() to only HT-based bands
wifi: mac80211: don't use cfg80211_chandef_create() for default chandef
wifi: mac80211: Remove deleted sta links in ieee80211_ml_reconf_work()
wifi: b43: use register definitions in nphy_op_software_rfkill
wifi: cfg80211: split control freq check from chandef check
wifi: mac80211: always use full chanctx compatible check
wifi: mac80211: refactor chandef tracing macros
wifi: mac80211: validate HE 6 GHz operation when EHT is used
wifi: nl80211: split out UHR operation information
wifi: mwifiex: drop redundant device reference
wifi: rt2x00: drop redundant device reference
...
====================
Link: https://patch.msgid.link/20260319082439.79875-3-johannes@sipsolutions.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Antonio Quartulli says:
====================
Included features:
* use bitops.h API when possible
* send netlink notification in case of client float event
* implement support for asymmetric peer IDs
* consolidate memory allocations during crypto operations
* add netlink notification check in selftests
* add FW mark check in selftest
* tag 'ovpn-net-next-20260317' of https://github.com/OpenVPN/ovpn-net-next:
ovpn: consolidate crypto allocations in one chunk
selftests: ovpn: add test for the FW mark feature
selftests: ovpn: check asymmetric peer-id
ovpn: add support for asymmetric peer IDs
selftests: ovpn: add notification parsing and matching
ovpn: notify userspace on client float event
ovpn: pktid: use bitops.h API
ovpn: use correct array size to parse nested attributes in ovpn_nl_key_swap_doit
selftests: ovpn: allow compiling ovpn-cli.c with mbedtls3
====================
Link: https://patch.msgid.link/20260317104023.192548-1-antonio@openvpn.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add two parameters for drivers supporting Rx CQE coalescing /
descriptor writeback.
ETHTOOL_A_COALESCE_RX_CQE_FRAMES:
Maximum number of frames that can be coalesced into a CQE or
writeback.
ETHTOOL_A_COALESCE_RX_CQE_NSECS:
Max time in nanoseconds after the first packet arrival in a
coalesced CQE or writeback to be sent.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/20260317191826.1346111-2-haiyangz@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some display controllers can be hardware programmed to show non-black
colors for pixels that are either not covered by any plane or are
exposed through transparent regions of higher planes. This feature can
help reduce memory bandwidth usage, e.g. in compositors managing a UI
with a solid background color while using smaller planes to render the
remaining content.
To support this capability, introduce the BACKGROUND_COLOR standard DRM
mode property, which can be attached to a CRTC through the
drm_crtc_attach_background_color_property() helper function.
Additionally, define a 64-bit ARGB format value to be built with the
help of a couple of dedicated DRM_ARGB64_PREP*() helpers. Individual
color components can be extracted with desired precision using the
corresponding DRM_ARGB64_GET*() macros.
Co-developed-by: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Reviewed-by: Nícolas F. R. A. Prado <nfraprado@collabora.com>
Tested-by: Diederik de Haas <diederik@cknow-tech.com>
Signed-off-by: Cristian Ciocaltea <cristian.ciocaltea@collabora.com>
Link: https://patch.msgid.link/20260303-rk3588-bgcolor-v8-2-fee377037ad1@collabora.com
Signed-off-by: Daniel Stone <daniels@collabora.com>
If the IOMMU driver reports that ATS is not supported for a device, set
the IOMMU_HW_CAP_PCI_ATS_NOT_SUPPORTED flag in the returned hardware
capabilities.
This uses a negative flag for UAPI compatibility. Existing userspace
assumes ATS is supported if no flag is present. This also ensures that
new userspace works correctly on both old and new kernels, where a
zero value implies ATS support.
When this flag is set, ATS cannot be used for the device. When it is
clear, ATS may be enabled when an appropriate HWPT is attached.
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
In order to support the multipeer architecture, upon connection setup
each side of a tunnel advertises a unique ID that the other side must
include in packets sent to them. Therefore when transmitting a packet, a
peer inserts the recipient's advertised ID for that specific tunnel into
the peer ID field. When receiving a packet, a peer expects to find its
own unique receive ID for that specific tunnel in the peer ID field.
Add support for the TX peer ID and embed it into transmitting packets.
If no TX peer ID is specified, fallback to using the same peer ID both
for RX and TX in order to be compatible with the non-multipeer compliant
peers.
Cc: horms@kernel.org
Cc: donald.hunter@gmail.com
Signed-off-by: Ralf Lici <ralf@mandelbit.com>
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Pull kvm fixes from Paolo Bonzini:
"Quite a large pull request, partly due to skipping last week and
therefore having material from ~all submaintainers in this one. About
a fourth of it is a new selftest, and a couple more changes are large
in number of files touched (fixing a -Wflex-array-member-not-at-end
compiler warning) or lines changed (reformatting of a table in the API
documentation, thanks rST).
But who am I kidding---it's a lot of commits and there are a lot of
bugs being fixed here, some of them on the nastier side like the
RISC-V ones.
ARM:
- Correctly handle deactivation of interrupts that were activated
from LRs. Since EOIcount only denotes deactivation of interrupts
that are not present in an LR, start EOIcount deactivation walk
*after* the last irq that made it into an LR
- Avoid calling into the stubs to probe for ICH_VTR_EL2.TDS when pKVM
is already enabled -- not only thhis isn't possible (pKVM will
reject the call), but it is also useless: this can only happen for
a CPU that has already booted once, and the capability will not
change
- Fix a couple of low-severity bugs in our S2 fault handling path,
affecting the recently introduced LS64 handling and the even more
esoteric handling of hwpoison in a nested context
- Address yet another syzkaller finding in the vgic initialisation,
where we would end-up destroying an uninitialised vgic with nasty
consequences
- Address an annoying case of pKVM failing to boot when some of the
memblock regions that the host is faulting in are not page-aligned
- Inject some sanity in the NV stage-2 walker by checking the limits
against the advertised PA size, and correctly report the resulting
faults
PPC:
- Fix a PPC e500 build error due to a long-standing wart that was
exposed by the recent conversion to kmalloc_obj(); rip out all the
ugliness that led to the wart
RISC-V:
- Prevent speculative out-of-bounds access using array_index_nospec()
in APLIC interrupt handling, ONE_REG regiser access, AIA CSR
access, float register access, and PMU counter access
- Fix potential use-after-free issues in kvm_riscv_gstage_get_leaf(),
kvm_riscv_aia_aplic_has_attr(), and kvm_riscv_aia_imsic_has_attr()
- Fix potential null pointer dereference in
kvm_riscv_vcpu_aia_rmw_topei()
- Fix off-by-one array access in SBI PMU
- Skip THP support check during dirty logging
- Fix error code returned for Smstateen and Ssaia ONE_REG interface
- Check host Ssaia extension when creating AIA irqchip
x86:
- Fix cases where CPUID mitigation features were incorrectly marked
as available whenever the kernel used scattered feature words for
them
- Validate _all_ GVAs, rather than just the first GVA, when
processing a range of GVAs for Hyper-V's TLB flush hypercalls
- Fix a brown paper bug in add_atomic_switch_msr()
- Use hlist_for_each_entry_srcu() when traversing mask_notifier_list,
to fix a lockdep warning; KVM doesn't hold RCU, just irq_srcu
- Ensure AVIC VMCB fields are initialized if the VM has an in-kernel
local APIC (and AVIC is enabled at the module level)
- Update CR8 write interception when AVIC is (de)activated, to fix a
bug where the guest can run in perpetuity with the CR8 intercept
enabled
- Add a quirk to skip the consistency check on FREEZE_IN_SMM, i.e. to
allow L1 hypervisors to set FREEZE_IN_SMM. This reverts (by
default) an unintentional tightening of userspace ABI in 6.17, and
provides some amount of backwards compatibility with hypervisors
who want to freeze PMCs on VM-Entry
- Validate the VMCS/VMCB on return to a nested guest from SMM,
because either userspace or the guest could stash invalid values in
memory and trigger the processor's consistency checks
Generic:
- Remove a subtle pseudo-overlay of kvm_stats_desc, which, aside from
being unnecessary and confusing, triggered compiler warnings due to
-Wflex-array-member-not-at-end
- Document that vcpu->mutex is take outside of kvm->slots_lock and
kvm->slots_arch_lock, which is intentional and desirable despite
being rather unintuitive
Selftests:
- Increase the maximum number of NUMA nodes in the guest_memfd
selftest to 64 (from 8)"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (43 commits)
KVM: selftests: Verify SEV+ guests can read and write EFER, CR0, CR4, and CR8
Documentation: kvm: fix formatting of the quirks table
KVM: x86: clarify leave_smm() return value
selftests: kvm: add a test that VMX validates controls on RSM
selftests: kvm: extract common functionality out of smm_test.c
KVM: SVM: check validity of VMCB controls when returning from SMM
KVM: VMX: check validity of VMCS controls when returning from SMM
KVM: SVM: Set/clear CR8 write interception when AVIC is (de)activated
KVM: SVM: Initialize AVIC VMCB fields if AVIC is enabled with in-kernel APIC
KVM: x86: Introduce KVM_X86_QUIRK_VMCS12_ALLOW_FREEZE_IN_SMM
KVM: x86: Fix SRCU list traversal in kvm_fire_mask_notifiers()
KVM: VMX: Fix a wrong MSR update in add_atomic_switch_msr()
KVM: x86: hyper-v: Validate all GVAs during PV TLB flush
KVM: x86: synthesize CPUID bits only if CPU capability is set
KVM: PPC: e500: Rip out "struct tlbe_ref"
KVM: PPC: e500: Fix build error due to using kmalloc_obj() with wrong type
KVM: selftests: Increase 'maxnode' for guest_memfd tests
KVM: arm64: pkvm: Don't reprobe for ICH_VTR_EL2.TDS on CPU hotplug
KVM: arm64: vgic: Pick EOIcount deactivations from AP-list tail
KVM: arm64: Remove the redundant ISB in __kvm_at_s1e2()
...
Devlink instances without a backing device use bus_name
"devlink_index" and dev_name set to the decimal index string.
When user space sends this handle, detect the pattern and perform
a direct xarray lookup by index instead of iterating all instances.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20260312100407.551173-6-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
UDP-Lite supports variable-length checksum and has two socket
options, UDPLITE_SEND_CSCOV and UDPLITE_RECV_CSCOV, to control
the checksum coverage.
Let's remove the support.
setsockopt(UDPLITE_SEND_CSCOV / UDPLITE_RECV_CSCOV) was only
available for UDP-Lite and returned -ENOPROTOOPT for UDP.
Now, the options are handled in ip_setsockopt() and
ipv6_setsockopt(), which still return the same error.
getsockopt(UDPLITE_SEND_CSCOV / UDPLITE_RECV_CSCOV) was available
for UDP and always returned 0, meaning full checksum, but now
-ENOPROTOOPT is returned.
Given that getsockopt() is meaningless for UDP and even the options
are not defined under include/uapi/, this should not be a problem.
$ man 7 udplite
...
BUGS
Where glibc support is missing, the following definitions
are needed:
#define IPPROTO_UDPLITE 136
#define UDPLITE_SEND_CSCOV 10
#define UDPLITE_RECV_CSCOV 11
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260311052020.1213705-10-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 94dfc73e7c ("treewide: uapi: Replace zero-length arrays with
flexible-array members") broke the userspace API for C++.
These structures ending in VLAs are typically a *header*, which can be
followed by an arbitrary number of entries. Userspace typically creates
a larger structure with some non-zero number of entries, for example in
QEMU's kvm_arch_get_supported_msr_feature():
struct {
struct kvm_msrs info;
struct kvm_msr_entry entries[1];
} msr_data = {};
While that works in C, it fails in C++ with an error like:
flexible array member 'kvm_msrs::entries' not at end of 'struct msr_data'
Fix this by using __DECLARE_FLEX_ARRAY() for the VLA, which uses [0]
for C++ compilation.
Fixes: 94dfc73e7c ("treewide: uapi: Replace zero-length arrays with flexible-array members")
Cc: stable@vger.kernel.org
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://patch.msgid.link/3abaf6aefd6e5efeff3b860ac38421d9dec908db.camel@infradead.org
[sean: tag for stable@]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add a new keysym type KT_CSI that generates CSI tilde sequences with
automatic modifier encoding. The keysym value encodes the CSI parameter
number, producing sequences like ESC [ <value> ~ or ESC [ <value> ; <mod> ~
when Shift, Alt, or Ctrl modifiers are held.
This allows navigation keys (Home, End, Insert, Delete, PgUp, PgDn) and
function keys to generate modifier-aware escape sequences without
consuming string table entries for each modifier combination.
Define key symbols for navigation keys (K_CSI_HOME, K_CSI_END, etc.)
and function keys (K_CSI_F1 through K_CSI_F20) using standard xterm
CSI parameter values.
The modifier encoding follows the xterm convention:
mod = 1 + (shift ? 1 : 0) + (alt ? 2 : 0) + (ctrl ? 4 : 0)
Allowed CSI parameter values range from 0 to 99.
Note: The Linux console historically uses a non-standard double-bracket
format for F1-F5 (ESC [ [ A through ESC [ [ E) rather than the xterm
tilde format (ESC [ 11 ~ through ESC [ 15 ~). The K_CSI_F1 through
K_CSI_F5 definitions use the xterm format. Converting F1-F5 to KT_CSI
would require updating the "linux" terminfo entry to match. Navigation
keys and F6-F20 already use the tilde format and are fully compatible.
Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
Link: https://patch.msgid.link/20260203045457.1049793-3-nico@fluxnic.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add support for creating a mount namespace that contains only a copy of
the root mount from the caller's mount namespace, with none of the
child mounts. This is useful for containers and sandboxes that want to
start with a minimal mount table and populate it from scratch rather
than inheriting and then tearing down the full mount tree.
Two new flags are introduced:
- CLONE_EMPTY_MNTNS for clone3(), using the 64-bit flag space.
- UNSHARE_EMPTY_MNTNS for unshare(), reusing the
CLONE_PARENT_SETTID bit which has no meaning for unshare.
Both flags imply CLONE_NEWNS. For the unshare path,
UNSHARE_EMPTY_MNTNS is converted to CLONE_EMPTY_MNTNS in
unshare_nsproxy_namespaces() before it reaches copy_mnt_ns(), so the
mount namespace code only needs to handle a single flag.
In copy_mnt_ns(), when CLONE_EMPTY_MNTNS is set, clone_mnt() is used
instead of copy_tree() to clone only the root mount. The caller's root
and working directory are both reset to the root dentry of the new
mount.
The cleanup variables are changed from vfsmount pointers with
__free(mntput) to struct path with __free(path_put) because the empty
mount namespace path needs to release both mount and dentry references
when replacing the caller's root and pwd. In the normal (non-empty)
path only the mount component is set, and dput(NULL) is a no-op so
path_put remains correct there as well.
Link: https://patch.msgid.link/20260306-work-empty-mntns-consolidated-v1-1-6eb30529bbb0@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Add FSMOUNT_NAMESPACE flag to fsmount() that creates a new mount
namespace with the newly created filesystem attached to a copy of the
real rootfs. This returns a namespace file descriptor instead of an
O_PATH mount fd, similar to how OPEN_TREE_NAMESPACE works for open_tree().
This allows creating a new filesystem and immediately placing it in a
new mount namespace in a single operation, which is useful for container
runtimes and other namespace-based isolation mechanisms.
The rootfs mount is created before copying the real rootfs for the new
namespace meaning that the mount namespace id for the mount of the root
of the namespace is bigger than the child mounted on top of it. We've
never explicitly given the guarantee for such ordering and I doubt
anyone relies on it. Accepting that lets us avoid copying the mount
again and also avoids having to massage may_copy_tree() to grant an
exception for fsmount->mnt->mnt_ns being NULL.
Link: https://patch.msgid.link/20260122-work-fsmount-namespace-v1-3-5ef0a886e646@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Add a new VM_BIND flag, DRM_XE_VM_BIND_FLAG_DECOMPRESS, that lets userspace
express intent for the driver to perform on-device in-place decompression
for the GPU mapping created by a MAP bind operation.
This flag is used by subsequent driver changes to trigger scheduling of
GPU work that resolves compressed VRAM pages into an uncompressed PAT
VM mapping.
Behavior and semantics:
- Valid only for DRM_XE_VM_BIND_OP_MAP. IOCTLs using this flag on other ops
are rejected (-EINVAL).
- The bind's pat_index must select the device "no-compression" PAT entry;
otherwise the ioctl is rejected (-EINVAL).
- Only meaningful for VRAM-backed BOs on devices that support Flat CCS and
the required hardware generation (driver will return -EOPNOTSUPP if not).
- On success the driver schedules a migrate/resolve and installs the
returned dma_fence into the BO's kernel reservation
(DMA_RESV_USAGE_KERNEL).
Compute PR: https://github.com/intel/compute-runtime/pull/898
v3: Rebase on latest drm-tip and add compute pr info
v2: Add kernel doc (Matt)
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Mrozek, Michal <michal.mrozek@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Nitin Gote <nitin.r.gote@intel.com>
Acked-by: Michal Mrozek <michal.mrozek@intel.com>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patch.msgid.link/20260304123758.3050386-6-nitin.r.gote@intel.com
Add a new clone3() flag CLONE_PIDFD_AUTOKILL that ties a child's
lifetime to the pidfd returned from clone3(). When the last reference to
the struct file created by clone3() is closed the kernel sends SIGKILL
to the child. A pidfd obtained via pidfd_open() for the same process
does not keep the child alive and does not trigger autokill - only the
specific struct file from clone3() has this property.
This is useful for container runtimes, service managers, and sandboxed
subprocess execution - any scenario where the child must die if the
parent crashes or abandons the pidfd.
CLONE_PIDFD_AUTOKILL requires both CLONE_PIDFD (the whole point is tying
lifetime to the pidfd file) and CLONE_AUTOREAP (a killed child with no
one to reap it would become a zombie). CLONE_THREAD is rejected because
autokill targets a process not a thread.
The clone3 pidfd is identified by the PIDFD_AUTOKILL file flag set on
the struct file at clone3() time. The pidfs .release handler checks this
flag and sends SIGKILL via do_send_sig_info(SIGKILL, SEND_SIG_PRIV, ...)
only when it is set. Files from pidfd_open() or open_by_handle_at() are
distinct struct files that do not carry this flag. dup()/fork() share the
same struct file so they extend the child's lifetime until the last
reference drops.
CLONE_PIDFD_AUTOKILL uses a privilege model based on CLONE_NNP: without
CLONE_NNP the child could escalate privileges via setuid/setgid exec
after being spawned, so the caller must have CAP_SYS_ADMIN in its user
namespace. With CLONE_NNP the child can never gain new privileges so
unprivileged usage is allowed. This is a deliberate departure from the
pdeath_signal model which is reset during secureexec and commit_creds()
rendering it useless for container runtimes that need to deprivilege
themselves.
Link: https://patch.msgid.link/20260226-work-pidfs-autoreap-v5-3-d148b984a989@kernel.org
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Add a new clone3() flag CLONE_NNP that sets no_new_privs on the child
process at clone time. This is analogous to prctl(PR_SET_NO_NEW_PRIVS)
but applied at process creation rather than requiring a separate step
after the child starts running.
CLONE_NNP is rejected with CLONE_THREAD. It's conceptually a lot simpler
if the whole thread-group is forced into NNP and not have single threads
running around with NNP.
Link: https://patch.msgid.link/20260226-work-pidfs-autoreap-v5-2-d148b984a989@kernel.org
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>