Pull namespace updates from Christian Brauner:
"This contains substantial namespace infrastructure changes including a new
system call, active reference counting, and extensive header cleanups.
The branch depends on the shared kbuild branch for -fms-extensions support.
Features:
- listns() system call
Add a new listns() system call that allows userspace to iterate
through namespaces in the system. This provides a programmatic
interface to discover and inspect namespaces, addressing
longstanding limitations:
Currently, there is no direct way for userspace to enumerate
namespaces. Applications must resort to scanning /proc/*/ns/ across
all processes, which is:
- Inefficient - requires iterating over all processes
- Incomplete - misses namespaces not attached to any running
process but kept alive by file descriptors, bind mounts, or
parent references
- Permission-heavy - requires access to /proc for many processes
- No ordering or ownership information
- No filtering per namespace type
The listns() system call solves these problems:
ssize_t listns(const struct ns_id_req *req, u64 *ns_ids,
size_t nr_ns_ids, unsigned int flags);
struct ns_id_req {
__u32 size;
__u32 spare;
__u64 ns_id;
struct /* listns */ {
__u32 ns_type;
__u32 spare2;
__u64 user_ns_id;
};
};
Features include:
- Pagination support for large namespace sets
- Filtering by namespace type (MNT_NS, NET_NS, USER_NS, etc.)
- Filtering by owning user namespace
- Permission checks respecting namespace isolation
- Active Reference Counting
Introduce an active reference count that tracks namespace
visibility to userspace. A namespace is visible in the following
cases:
- The namespace is in use by a task
- The namespace is persisted through a VFS object (namespace file
descriptor or bind-mount)
- The namespace is a hierarchical type and is the parent of child
namespaces
The active reference count does not regulate lifetime (that's still
done by the normal reference count) - it only regulates visibility
to namespace file handles and listns().
This prevents resurrection of namespaces that are pinned only for
internal kernel reasons (e.g., user namespaces held by
file->f_cred, lazy TLB references on idle CPUs, etc.) which should
not be accessible via (1)-(3).
- Unified Namespace Tree
Introduce a unified tree structure for all namespaces with:
- Fixed IDs assigned to initial namespaces
- Lookup based solely on inode number
- Maintained list of owned namespaces per user namespace
- Simplified rbtree comparison helpers
Cleanups
- Header Reorganization:
- Move namespace types into separate header (ns_common_types.h)
- Decouple nstree from ns_common header
- Move nstree types into separate header
- Switch to new ns_tree_{node,root} structures with helper functions
- Use guards for ns_tree_lock
- Initial Namespace Reference Count Optimization
- Make all reference counts on initial namespaces a nop to avoid
pointless cacheline ping-pong for namespaces that can never go
away
- Drop custom reference count initialization for initial namespaces
- Add NS_COMMON_INIT() macro and use it for all namespaces
- pid: rely on common reference count behavior
- Miscellaneous Cleanups
- Rename exit_task_namespaces() to exit_nsproxy_namespaces()
- Rename is_initial_namespace() and make argument const
- Use boolean to indicate anonymous mount namespace
- Simplify owner list iteration in nstree
- nsfs: raise SB_I_NODEV, SB_I_NOEXEC, and DCACHE_DONTCACHE explicitly
- nsfs: use inode_just_drop()
- pidfs: raise DCACHE_DONTCACHE explicitly
- pidfs: simplify PIDFD_GET__NAMESPACE ioctls
- libfs: allow to specify s_d_flags
- cgroup: add cgroup namespace to tree after owner is set
- nsproxy: fix free_nsproxy() and simplify create_new_namespaces()
Fixes:
- setns(pidfd, ...) race condition
Fix a subtle race when using pidfds with setns(). When the target
task exits after prepare_nsset() but before commit_nsset(), the
namespace's active reference count might have been dropped. If
setns() then installs the namespaces, it would bump the active
reference count from zero without taking the required reference on
the owner namespace, leading to underflow when later decremented.
The fix resurrects the ownership chain if necessary - if the caller
succeeded in grabbing passive references, the setns() should
succeed even if the target task exits or gets reaped.
- Return EFAULT on put_user() error instead of success
- Make sure references are dropped outside of RCU lock (some
namespaces like mount namespace sleep when putting the last
reference)
- Don't skip active reference count initialization for network
namespace
- Add asserts for active refcount underflow
- Add asserts for initial namespace reference counts (both passive
and active)
- ipc: enable is_ns_init_id() assertions
- Fix kernel-doc comments for internal nstree functions
- Selftests
- 15 active reference count tests
- 9 listns() functionality tests
- 7 listns() permission tests
- 12 inactive namespace resurrection tests
- 3 threaded active reference count tests
- commit_creds() active reference tests
- Pagination and stress tests
- EFAULT handling test
- nsid tests fixes"
* tag 'namespace-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (103 commits)
pidfs: simplify PIDFD_GET_<type>_NAMESPACE ioctls
nstree: fix kernel-doc comments for internal functions
nsproxy: fix free_nsproxy() and simplify create_new_namespaces()
selftests/namespaces: fix nsid tests
ns: drop custom reference count initialization for initial namespaces
pid: rely on common reference count behavior
ns: add asserts for initial namespace active reference counts
ns: add asserts for initial namespace reference counts
ns: make all reference counts on initial namespace a nop
ipc: enable is_ns_init_id() assertions
fs: use boolean to indicate anonymous mount namespace
ns: rename is_initial_namespace()
ns: make is_initial_namespace() argument const
nstree: use guards for ns_tree_lock
nstree: simplify owner list iteration
nstree: switch to new structures
nstree: add helper to operate on struct ns_tree_{node,root}
nstree: move nstree types into separate header
nstree: decouple from ns_common header
ns: move namespace types into separate header
...
Pull misc vfs updates from Christian Brauner:
"Features:
- Cheaper MAY_EXEC handling for path lookup. This elides MAY_WRITE
permission checks during path lookup and adds the
IOP_FASTPERM_MAY_EXEC flag so filesystems like btrfs can avoid
expensive permission work.
- Hide dentry_cache behind runtime const machinery.
- Add German Maglione as virtiofs co-maintainer.
Cleanups:
- Tidy up and inline step_into() and walk_component() for improved
code generation.
- Re-enable IOCB_NOWAIT writes to files. This refactors file
timestamp update logic, fixing a layering bypass in btrfs when
updating timestamps on device files and improving FMODE_NOCMTIME
handling in VFS now that nfsd started using it.
- Path lookup optimizations extracting slowpaths into dedicated
routines and adding branch prediction hints for mntput_no_expire(),
fd_install(), lookup_slow(), and various other hot paths.
- Enable clang's -fms-extensions flag, requiring a JFS rename to
avoid conflicts.
- Remove spurious exports in fs/file_attr.c.
- Stop duplicating union pipe_index declaration. This depends on the
shared kbuild branch that brings in -fms-extensions support which
is merged into this branch.
- Use MD5 library instead of crypto_shash in ecryptfs.
- Use largest_zero_folio() in iomap_dio_zero().
- Replace simple_strtol/strtoul with kstrtoint/kstrtouint in init and
initrd code.
- Various typo fixes.
Fixes:
- Fix emergency sync for btrfs. Btrfs requires an explicit sync_fs()
call with wait == 1 to commit super blocks. The emergency sync path
never passed this, leaving btrfs data uncommitted during emergency
sync.
- Use local kmap in watch_queue's post_one_notification().
- Add hint prints in sb_set_blocksize() for LBS dependency on THP"
* tag 'vfs-6.19-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits)
MAINTAINERS: add German Maglione as virtiofs co-maintainer
fs: inline step_into() and walk_component()
fs: tidy up step_into() & friends before inlining
orangefs: use inode_update_timestamps directly
btrfs: fix the comment on btrfs_update_time
btrfs: use vfs_utimes to update file timestamps
fs: export vfs_utimes
fs: lift the FMODE_NOCMTIME check into file_update_time_flags
fs: refactor file timestamp update logic
include/linux/fs.h: trivial fix: regualr -> regular
fs/splice.c: trivial fix: pipes -> pipe's
fs: mark lookup_slow() as noinline
fs: add predicts based on nd->depth
fs: move mntput_no_expire() slowpath into a dedicated routine
fs: remove spurious exports in fs/file_attr.c
watch_queue: Use local kmap in post_one_notification()
fs: touch up predicts in path lookup
fs: move fd_install() slowpath into a dedicated routine and provide commentary
fs: hide dentry_cache behind runtime const machinery
fs: touch predicts in do_dentry_open()
...
Pull timer fix from Borislav Petkov:
- Have timekeeping aux clocks sysfs interface setup function return an
error code on failure instead of success
* tag 'timers_urgent_for_v6.18_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Fix error code in tk_aux_sysfs_init()
Pull dma-mapping fixes from Marek Szyprowski:
"Two last minute fixes for the recently modified DMA API infrastructure:
- proper handling of DMA_ATTR_MMIO in dma_iova_unlink() function (me)
- regression fix for the code refactoring related to P2PDMA (Pranjal
Shrivastava)"
* tag 'dma-mapping-6.18-2025-11-27' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma-direct: Fix missing sg_dma_len assignment in P2PDMA bus mappings
iommu/dma: add missing support for DMA_ATTR_MMIO for dma_iova_unlink()
Pull ring-buffer fix from Steven Rostedt:
- Do not allow mmapped ring buffer to be split
When the ring buffer VMA is split by a partial munmap or a MAP_FIXED,
the kernel calls vm_ops->close() on each portion. This causes the
ring_buffer_unmap() to be called multiple times. This causes
subsequent calls to return -ENODEV and triggers a warning.
There's no reason to allow user space to split up memory mapping of
the ring buffer. Have it return -EINVAL when that happens.
* tag 'trace-ringbuffer-v6.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix WARN_ON in tracing_buffers_mmap_close for split VMAs
Prior to commit a25e7962db ("PCI/P2PDMA: Refactor the p2pdma mapping
helpers"), P2P segments were mapped using the pci_p2pdma_map_segment()
helper. This helper was responsible for populating sg->dma_address,
marking the bus address, and also setting sg_dma_len(sg).
The refactor[1] removed this helper and moved the mapping logic directly
into the callers. While iommu_dma_map_sg() was correctly updated to set
the length in the new flow, it was missed in dma_direct_map_sg().
Thus, in dma_direct_map_sg(), the PCI_P2PDMA_MAP_BUS_ADDR case sets the
dma_address and marks the segment, but immediately executes 'continue',
which causes the loop to skip the standard assignment logic at the end:
sg_dma_len(sg) = sg->length;
As a result, when CONFIG_NEED_SG_DMA_LENGTH is enabled, the dma_length
field remains uninitialized (zero) for P2P bus address mappings. This
breaks upper-layer drivers (for e.g. RDMA/IB) that rely on sg_dma_len()
to determine the transfer size.
Fix this by explicitly setting the DMA length in the
PCI_P2PDMA_MAP_BUS_ADDR case before continuing to the next scatterlist
entry.
Fixes: a25e7962db ("PCI/P2PDMA: Refactor the p2pdma mapping helpers")
Reported-by: Jacob Moroni <jmoroni@google.com>
Signed-off-by: Pranjal Shrivastava <praan@google.com>
[1]
https://lore.kernel.org/all/ac14a0e94355bf898de65d023ccf8a2ad22a3ece.1746424934.git.leon@kernel.org/
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Shivaji Kant <shivajikant@google.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20251126114112.3694469-1-praan@google.com
If kobject_create_and_add() fails on the first iteration, then the error
code is set to -ENOMEM which is correct. But if it fails in subsequent
iterations then "ret" is zero, which means success, but it should be
-ENOMEM.
Set the error code to -ENOMEM correctly.
Fixes: 7b5ab04f03 ("timekeeping: Fix resource leak in tk_aux_sysfs_init() error paths")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Malaya Kumar Rout <mrout@redhat.com>
Link: https://patch.msgid.link/aSW1R8q5zoY_DgQE@stanley.mountain
Pull timer fixes from Ingo Molnar:
- Fix a race in timer->function clearing in timer_shutdown_sync()
- Fix a timekeeper sysfs-setup resource leak in error paths
- Fix the NOHZ report_idle_softirq() syslog rate-limiting
logic to have no side effects on the return value
* tag 'timers-urgent-2025-11-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers: Fix NULL function pointer race in timer_shutdown_sync()
timekeeping: Fix resource leak in tk_aux_sysfs_init() error paths
tick/sched: Fix bogus condition in report_idle_softirq()
There is a race condition between timer_shutdown_sync() and timer
expiration that can lead to hitting a WARN_ON in expire_timers().
The issue occurs when timer_shutdown_sync() clears the timer function
to NULL while the timer is still running on another CPU. The race
scenario looks like this:
CPU0 CPU1
<SOFTIRQ>
lock_timer_base()
expire_timers()
base->running_timer = timer;
unlock_timer_base()
[call_timer_fn enter]
mod_timer()
...
timer_shutdown_sync()
lock_timer_base()
// For now, will not detach the timer but only clear its function to NULL
if (base->running_timer != timer)
ret = detach_if_pending(timer, base, true);
if (shutdown)
timer->function = NULL;
unlock_timer_base()
[call_timer_fn exit]
lock_timer_base()
base->running_timer = NULL;
unlock_timer_base()
...
// Now timer is pending while its function set to NULL.
// next timer trigger
<SOFTIRQ>
expire_timers()
WARN_ON_ONCE(!fn) // hit
...
lock_timer_base()
// Now timer will detach
if (base->running_timer != timer)
ret = detach_if_pending(timer, base, true);
if (shutdown)
timer->function = NULL;
unlock_timer_base()
The problem is that timer_shutdown_sync() clears the timer function
regardless of whether the timer is currently running. This can leave a
pending timer with a NULL function pointer, which triggers the
WARN_ON_ONCE(!fn) check in expire_timers().
Fix this by only clearing the timer function when actually detaching the
timer. If the timer is running, leave the function pointer intact, which is
safe because the timer will be properly detached when it finishes running.
Fixes: 0cc04e8045 ("timers: Add shutdown mechanism to the internal functions")
Signed-off-by: Yipeng Zou <zouyipeng@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20251122093942.301559-1-zouyipeng@huawei.com
Pull sched_ext fix from Tejun Heo:
"One low risk and obvious fix: scx_enable() was dereferencing an error
pointer on helper kthread creation failure. Fixed"
* tag 'sched_ext-for-6.18-rc6-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Fix scx_enable() crash on helper kthread creation failure
A crash was observed when the sched_ext selftests runner was
terminated with Ctrl+\ while test 15 was running:
NIP [c00000000028fa58] scx_enable.constprop.0+0x358/0x12b0
LR [c00000000028fa2c] scx_enable.constprop.0+0x32c/0x12b0
Call Trace:
scx_enable.constprop.0+0x32c/0x12b0 (unreliable)
bpf_struct_ops_link_create+0x18c/0x22c
__sys_bpf+0x23f8/0x3044
sys_bpf+0x2c/0x6c
system_call_exception+0x124/0x320
system_call_vectored_common+0x15c/0x2ec
kthread_run_worker() returns an ERR_PTR() on failure rather than NULL,
but the current code in scx_alloc_and_add_sched() only checks for a NULL
helper. Incase of failure on SIGQUIT, the error is not handled in
scx_alloc_and_add_sched() and scx_enable() ends up dereferencing an
error pointer.
Error handling is fixed in scx_alloc_and_add_sched() to propagate
PTR_ERR() into ret, so that scx_enable() jumps to the existing error
path, avoiding random dereference on failure.
Fixes: bff3b5aec1 ("sched_ext: Move disable machinery into scx_sched")
Cc: stable@vger.kernel.org # v6.16+
Reported-and-tested-by: Samir Mulani <samir@linux.ibm.com>
Signed-off-by: Saket Kumar Bhaskar <skb99@linux.ibm.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Vishal Chourasia <vishalc@linux.ibm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
tk_aux_sysfs_init() returns immediately on error during the auxiliary clock
initialization loop without cleaning up previously allocated kobjects and
sysfs groups.
If kobject_create_and_add() or sysfs_create_group() fails during loop
iteration, the parent kobjects (tko and auxo) and any previously created
child kobjects are leaked.
Fix this by adding proper error handling with goto labels to ensure all
allocated resources are cleaned up on failure. kobject_put() on the
parent kobjects will handle cleanup of their children.
Fixes: 7b95663a3d ("timekeeping: Provide interface to control auxiliary clocks")
Signed-off-by: Malaya Kumar Rout <mrout@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251120150213.246777-1-mrout@redhat.com
Currently cpu-clock event always returns 0 count, e.g.,
perf stat -e cpu-clock -- sleep 1
Performance counter stats for 'sleep 1':
0 cpu-clock # 0.000 CPUs utilized
1.002308394 seconds time elapsed
The root cause is the commit 'bc4394e5e79c ("perf: Fix the throttle
error of some clock events")' adds PERF_EF_UPDATE flag check before
calling cpu_clock_event_update() to update the count, however the
PERF_EF_UPDATE flag is never set when the cpu-clock event is stopped in
counting mode (pmu->dev() -> cpu_clock_event_del() ->
cpu_clock_event_stop()). This leads to the cpu-clock event count is
never updated.
To fix this issue, force to set PERF_EF_UPDATE flag for cpu-clock event
just like what task-clock does.
Fixes: bc4394e5e7 ("perf: Fix the throttle error of some clock events")
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://patch.msgid.link/20251112080526.3971392-1-dapeng1.mi@linux.intel.com
In commit 0345691b24 ("tick/rcu: Stop allowing RCU_SOFTIRQ in idle") the
new function report_idle_softirq() was created by breaking code out of the
existing can_stop_idle_tick() for kernels v5.18 and newer.
In doing so, the code essentially went from this form:
if (A) {
static int ratelimit;
if (ratelimit < 10 && !C && A&D) {
pr_warn("NOHZ tick-stop error: ...");
ratelimit++;
}
return false;
}
to a new function:
static bool report_idle_softirq(void)
{
static int ratelimit;
if (likely(!A))
return false;
if (ratelimit < 10)
return false;
...
pr_warn("NOHZ tick-stop error: local softirq work is pending, handler #%02x!!!\n",
pending);
ratelimit++;
return true;
}
commit a7e282c777 ("tick/rcu: Fix bogus ratelimit condition") realized
ratelimit was essentially set to zero instead of ten, and hence *no*
softirq pending messages would ever be issued, but "fixed" it as:
- if (ratelimit < 10)
+ if (ratelimit >= 10)
return false;
However, this fix introduced another issue:
When ratelimit is greater than or equal 10, even if A is true, it will
directly return false. While ratelimit in the original code was only used
to control printing and will not affect the return value.
Restore the original logic and restrict ratelimit to control the printk and
not the return value.
Fixes: 0345691b24 ("tick/rcu: Stop allowing RCU_SOFTIRQ in idle")
Fixes: a7e282c777 ("tick/rcu: Fix bogus ratelimit condition")
Signed-off-by: Wen Yang <wen.yang@linux.dev>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251119174525.29470-1-wen.yang@linux.dev
Pull vfs fixes from Christian Brauner:
- Fix unitialized variable in statmount_string()
- Fix hostfs mounting when passing host root during boot
- Fix dynamic lookup to fail on cell lookup failure
- Fix missing file type when reading bfs inodes from disk
- Enforce checking of sb_min_blocksize() calls and update all callers
accordingly
- Restore write access before closing files opened by open_exec() in
binfmt_misc
- Always freeze efivarfs during suspend/hibernate cycles
- Fix statmount()'s and listmount()'s grab_requested_mnt_ns() helper to
actually allow mount namespace file descriptor in addition to mount
namespace ids
- Fix tmpfs remount when noswap is specified
- Switch Landlock to iput_not_last() to remove false-positives from
might_sleep() annotations in iput()
- Remove dead node_to_mnt_ns() code
- Ensure that per-queue kobjects are successfully created
* tag 'vfs-6.18-rc7.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
landlock: fix splats from iput() after it started calling might_sleep()
fs: add iput_not_last()
shmem: fix tmpfs reconfiguration (remount) when noswap is set
fs/namespace: correctly handle errors returned by grab_requested_mnt_ns
power: always freeze efivarfs
binfmt_misc: restore write access before closing files opened by open_exec()
block: add __must_check attribute to sb_min_blocksize()
virtio-fs: fix incorrect check for fsvq->kobj
xfs: check the return value of sb_min_blocksize() in xfs_fs_fill_super
isofs: check the return value of sb_min_blocksize() in isofs_fill_super
exfat: check return value of sb_min_blocksize in exfat_read_boot_sector
vfat: fix missing sb_min_blocksize() return value checks
mnt: Remove dead code which might prevent from building
bfs: Reconstruct file type when loading from disk
afs: Fix dynamic lookup to fail on cell lookup failure
hostfs: Fix only passing host root in boot stage with new mount
fs: Fix uninitialized 'offp' in statmount_string()
Pull sched_ext fixes from Tejun Heo:
"Five fixes addressing PREEMPT_RT compatibility and locking issues.
Three commits fix potential deadlocks and sleeps in atomic contexts on
RT kernels by converting locks to raw spinlocks and ensuring IRQ work
runs in hard-irq context. The remaining two fix unsafe locking in the
debug dump path and a variable dereference typo"
* tag 'sched_ext-for-6.18-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Use IRQ_WORK_INIT_HARD() to initialize rq->scx.kick_cpus_irq_work
sched_ext: Fix possible deadlock in the deferred_irq_workfn()
sched/ext: convert scx_tasks_lock to raw spinlock
sched_ext: Fix unsafe locking in the scx_dump_state()
sched_ext: Fix use of uninitialized variable in scx_bpf_cpuperf_set()
For PREEMPT_RT kernels, the kick_cpus_irq_workfn() be invoked in
the per-cpu irq_work/* task context and there is no rcu-read critical
section to protect. this commit therefore use IRQ_WORK_INIT_HARD() to
initialize the per-cpu rq->scx.kick_cpus_irq_work in the
init_sched_ext_class().
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull misc fixes from Andrew Morton:
"7 hotfixes. 5 are cc:stable, 4 are against mm/
All are singletons - please see the respective changelogs for details"
* tag 'mm-hotfixes-stable-2025-11-16-10-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm, swap: fix potential UAF issue for VMA readahead
selftests/user_events: fix type cast for write_index packed member in perf_test
lib/test_kho: check if KHO is enabled
mm/huge_memory: fix folio split check for anon folios in swapcache
MAINTAINERS: update David Hildenbrand's email address
crash: fix crashkernel resource shrink
mm: fix MAX_FOLIO_ORDER on powerpc configs with hugetlb
When crashkernel is configured with a high reservation, shrinking its
value below the low crashkernel reservation causes two issues:
1. Invalid crashkernel resource objects
2. Kernel crash if crashkernel shrinking is done twice
For example, with crashkernel=200M,high, the kernel reserves 200MB of high
memory and some default low memory (say 256MB). The reservation appears
as:
cat /proc/iomem | grep -i crash
af000000-beffffff : Crash kernel
433000000-43f7fffff : Crash kernel
If crashkernel is then shrunk to 50MB (echo 52428800 >
/sys/kernel/kexec_crash_size), /proc/iomem still shows 256MB reserved:
af000000-beffffff : Crash kernel
Instead, it should show 50MB:
af000000-b21fffff : Crash kernel
Further shrinking crashkernel to 40MB causes a kernel crash with the
following trace (x86):
BUG: kernel NULL pointer dereference, address: 0000000000000038
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
<snip...>
Call Trace: <TASK>
? __die_body.cold+0x19/0x27
? page_fault_oops+0x15a/0x2f0
? search_module_extables+0x19/0x60
? search_bpf_extables+0x5f/0x80
? exc_page_fault+0x7e/0x180
? asm_exc_page_fault+0x26/0x30
? __release_resource+0xd/0xb0
release_resource+0x26/0x40
__crash_shrink_memory+0xe5/0x110
crash_shrink_memory+0x12a/0x190
kexec_crash_size_store+0x41/0x80
kernfs_fop_write_iter+0x141/0x1f0
vfs_write+0x294/0x460
ksys_write+0x6d/0xf0
<snip...>
This happens because __crash_shrink_memory()/kernel/crash_core.c
incorrectly updates the crashk_res resource object even when
crashk_low_res should be updated.
Fix this by ensuring the correct crashkernel resource object is updated
when shrinking crashkernel memory.
Link: https://lkml.kernel.org/r/20251101193741.289252-1-sourabhjain@linux.ibm.com
Fixes: 16c6006af4 ("kexec: enable kexec_crash_size to support two crash kernel regions")
Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Zhen Lei <thunder.leizhen@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pull timer fix from Ingo Molnar:
"Fix a memory leak in the posix timer creation logic"
* tag 'timers-urgent-2025-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
posix-timers: Plug potential memory leak in do_timer_create()
Pull bpf fixes from Alexei Starovoitov:
- Fix interaction between livepatch and BPF fexit programs (Song Liu)
With Steven and Masami acks.
- Fix stack ORC unwind from BPF kprobe_multi (Jiri Olsa)
With Steven and Masami acks.
- Fix out of bounds access in widen_imprecise_scalars() in the verifier
(Eduard Zingerman)
- Fix conflicts between MPTCP and BPF sockmap (Jiayuan Chen)
- Fix net_sched storage collision with BPF data_meta/data_end (Eric
Dumazet)
- Add _impl suffix to BPF kfuncs with implicit args to avoid breaking
them in bpf-next when KF_IMPLICIT_ARGS is added (Mykyta Yatsenko)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Test widen_imprecise_scalars() with different stack depth
bpf: account for current allocated stack depth in widen_imprecise_scalars()
bpf: Add bpf_prog_run_data_pointers()
selftests/bpf: Add mptcp test with sockmap
mptcp: Fix proto fallback detection with BPF
mptcp: Disallow MPTCP subflows from sockmap
selftests/bpf: Add stacktrace ips test for raw_tp
selftests/bpf: Add stacktrace ips test for kprobe_multi/kretprobe_multi
x86/fgraph,bpf: Fix stack ORC unwind from kprobe_multi return probe
Revert "perf/x86: Always store regs->ip in perf_callchain_kernel()"
bpf: add _impl suffix for bpf_stream_vprintk() kfunc
bpf:add _impl suffix for bpf_task_work_schedule* kfuncs
selftests/bpf: Add tests for livepatch + bpf trampoline
ftrace: bpf: Fix IPMODIFY + DIRECT in modify_ftrace_direct()
ftrace: Fix BPF fexit with livepatch
The usage pattern for widen_imprecise_scalars() looks as follows:
prev_st = find_prev_entry(env, ...);
queued_st = push_stack(...);
widen_imprecise_scalars(env, prev_st, queued_st);
Where prev_st is an ancestor of the queued_st in the explored states
tree. This ancestor is not guaranteed to have same allocated stack
depth as queued_st. E.g. in the following case:
def main():
for i in 1..2:
foo(i) // same callsite, differnt param
def foo(i):
if i == 1:
use 128 bytes of stack
iterator based loop
Here, for a second 'foo' call prev_st->allocated_stack is 128,
while queued_st->allocated_stack is much smaller.
widen_imprecise_scalars() needs to take this into account and avoid
accessing bpf_verifier_state->frame[*]->stack out of bounds.
Fixes: 2793a8b015 ("bpf: exact states comparison for iterator convergence checks")
Reported-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251114025730.772723-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Documentation build reported:
Warning: kernel/nstree.c:325 function parameter 'ns_tree' not described in '__ns_tree_adjoined_rcu'
Warning: kernel/nstree.c:325 expecting prototype for ns_tree_adjoined_rcu(). Prototype was for __ns_tree_adjoined_rcu() instead
Warning: kernel/nstree.c:353 expecting prototype for ns_tree_gen_id(). Prototype was for __ns_tree_gen_id() instead
The kernel-doc comments for `__ns_tree_adjoined_rcu()` and
`__ns_tree_gen_id()` had mismatched function names and a missing
parameter description. This patch updates the function names in the
kernel-doc headers and adds the missing `@ns_tree` parameter description
for `__ns_tree_adjoined_rcu()`.
Fixes: 885fc8ac0a ("nstree: make iterator generic")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202511061542.0LO7xKs8-lkp@intel.com
Signed-off-by: Kriish Sharma <kriish.sharma2006@gmail.com>
Link: https://patch.msgid.link/20251111112533.2254432-1-kriish.sharma2006@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Pull power management fixes from Rafael Wysocki:
"These fix issues related to the handling of compressed hibernation
images and a recent intel_pstate driver regression:
- Fix issues related to using inadequate data types and incorrect use
of atomic variables in the compressed hibernation images handling
code that were introduced during the 6.9 development cycle (Mario
Limonciello)
- Move a X86_FEATURE_IDA check from turbo_is_disabled() to the places
where a new value for MSR_IA32_PERF_CTL is computed in intel_pstate
to address a regression preventing users from enabling turbo
frequencies post-boot (Srinivas Pandruvada)"
* tag 'pm-6.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: intel_pstate: Check IDA only before MSR_IA32_PERF_CTL writes
PM: hibernate: Fix style issues in save_compressed_image()
PM: hibernate: Use atomic64_t for compressed_size variable
PM: hibernate: Emit an error when image writing fails
Merge fixes for issues related to the handling of compressed hibernation
images that were introduced during the 6.9 development cycle.
* pm-sleep:
PM: hibernate: Fix style issues in save_compressed_image()
PM: hibernate: Use atomic64_t for compressed_size variable
PM: hibernate: Emit an error when image writing fails
For PREEMPT_RT=y kernels, the deferred_irq_workfn() is executed in
the per-cpu irq_work/* task context and not disable-irq, if the rq
returned by container_of() is current CPU's rq, the following scenarios
may occur:
lock(&rq->__lock);
<Interrupt>
lock(&rq->__lock);
This commit use IRQ_WORK_INIT_HARD() to replace init_irq_work() to
initialize rq->scx.deferred_irq_work, make the deferred_irq_workfn()
is always invoked in hard-irq context.
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
Update scx_task_locks so that it's safe to lock/unlock in a
non-sleepable context in PREEMPT_RT kernels. scx_task_locks is
(non-raw) spinlock used to protect the list of tasks under SCX.
This list is updated during from finish_task_switch(), which
cannot sleep. Regular spinlocks can be locked in such a context
in non-RT kernels, but are sleepable under when CONFIG_PREEMPT_RT=y.
Convert scx_task_locks into a raw spinlock, which is not sleepable
even on RT kernels.
Sample backtrace:
<TASK>
dump_stack_lvl+0x83/0xa0
__might_resched+0x14a/0x200
rt_spin_lock+0x61/0x1c0
? sched_ext_dead+0x2d/0xf0
? lock_release+0xc6/0x280
sched_ext_dead+0x2d/0xf0
? srso_alias_return_thunk+0x5/0xfbef5
finish_task_switch.isra.0+0x254/0x360
__schedule+0x584/0x11d0
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? tick_nohz_idle_exit+0x7e/0x120
schedule_idle+0x23/0x40
cpu_startup_entry+0x29/0x30
start_secondary+0xf8/0x100
common_startup_64+0x13e/0x148
</TASK>
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Bring in the shared branch with the kbuild tree to enable
'-fms-extensions' for 6.19. Further namespace cleanup work
requires this extension.
Signed-off-by: Christian Brauner <brauner@kernel.org>
The setns() system call supports:
(1) namespace file descriptors (nsfd)
(2) process file descriptors (pidfd)
When using nsfds the namespaces will remain active because they are
pinned by the vfs. However, when pidfds are used things are more
complicated.
When the target task exits and passes through exit_nsproxy_namespaces()
or is reaped and thus also passes through exit_cred_namespaces() after
the setns()'ing task has called prepare_nsset() but before the active
reference count of the set of namespaces it wants to setns() to might
have been dropped already:
P1 P2
pid_p1 = clone(CLONE_NEWUSER | CLONE_NEWNET | CLONE_NEWNS)
pidfd = pidfd_open(pid_p1)
setns(pidfd, CLONE_NEWUSER | CLONE_NEWNET | CLONE_NEWNS)
prepare_nsset()
exit(0)
// ns->__ns_active_ref == 1
// parent_ns->__ns_active_ref == 1
-> exit_nsproxy_namespaces()
-> exit_cred_namespaces()
// ns_active_ref_put() will also put
// the reference on the owner of the
// namespace. If the only reason the
// owning namespace was alive was
// because it was a parent of @ns
// it's active reference count now goes
// to zero... --------------------------------
// |
// ns->__ns_active_ref == 0 |
// parent_ns->__ns_active_ref == 0 |
| commit_nsset()
-----------------> // If setns()
// now manages to install the namespaces
// it will call ns_active_ref_get()
// on them thus bumping the active reference
// count from zero again but without also
// taking the required reference on the owner.
// Thus we get:
//
// ns->__ns_active_ref == 1
// parent_ns->__ns_active_ref == 0
When later someone does ns_active_ref_put() on @ns it will underflow
parent_ns->__ns_active_ref leading to a splat from our asserts
thinking there are still active references when in fact the counter
just underflowed.
So resurrect the ownership chain if necessary as well. If the caller
succeeded to grab passive references to the set of namespaces the
setns() should simply succeed even if the target task exists or gets
reaped in the meantime and thus has dropped all active references to its
namespaces.
The race is rare and can only be triggered when using pidfs to setns()
to namespaces. Also note that active reference on initial namespaces are
nops.
Since we now always handle parent references directly we can drop
ns_ref_active_get_owner() when adding a namespace to a namespace tree.
This is now all handled uniformly in the places where the new namespaces
actually become active.
Link: https://patch.msgid.link/20251109-namespace-6-19-fixes-v1-5-ae8a4ad5a3b3@kernel.org
Fixes: 3c9820d5c64a ("ns: add active reference count")
Reported-by: syzbot+1957b26299cf3ff7890c@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
The mount namespace may in fact sleep when putting the last passive
reference so we need to drop the namespace reference outside of the rcu
read lock. Do this by delaying the put until the next iteration where
we've already moved on to the next namespace and legitimized it. Once we
drop the rcu read lock to call put_user() we will also drop the
reference to the previous namespace in the tree.
Link: https://patch.msgid.link/20251109-namespace-6-19-fixes-v1-3-ae8a4ad5a3b3@kernel.org
Fixes: 76b6f5dfb3 ("nstree: add listns()")
Signed-off-by: Christian Brauner <brauner@kernel.org>
kho_vmalloc_unpreserve_chunk() calls __kho_unpreserve() with end_pfn as
pfn + 1. This happens to work for 0-order pages, but leaks higher order
pages.
For example, say order 2 pages back the allocation. During preservation,
they get preserved in the order 2 bitmaps, but
kho_vmalloc_unpreserve_chunk() would try to unpreserve them from the order
0 bitmaps, which should not have these bits set anyway, leaving the order
2 bitmaps untouched. This results in the pages being carried over to the
next kernel. Nothing will free those pages in the next boot, leaking
them.
Fix this by taking the order into account when calculating the end PFN for
__kho_unpreserve().
Link: https://lkml.kernel.org/r/20251103180235.71409-2-pratyush@kernel.org
Fixes: a667300bd5 ("kho: add support for preserving vmalloc allocations")
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>