linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-07-22 02:17:36 -04:00

Author	SHA1	Message	Date
Thomas Gleixner	042df0c1d4	futex: Add robust futex unlock IP range There will be a VDSO function to unlock robust futexes in user space. The unlock sequence is racy vs. clearing the list_pending_op pointer in the tasks robust list head. To plug this race the kernel needs to know the instruction window. As the VDSO is per MM the addresses are stored in mm_struct::futex. Architectures which implement support for this have to update these addresses when the VDSO is (re)mapped and indicate the pending op pointer size which is matching the IP. Arguably this could be resolved by chasing mm->context->vdso->image, but that's architecture specific and requires to touch quite some cache lines. Having it in mm::futex reduces the cache line impact and avoids having yet another set of architecture specific functionality. To support multi size robust list applications (gaming) this provides two ranges when COMPAT is enabled. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: André Almeida <andrealmeid@igalia.com> Link: https://patch.msgid.link/20260602090535.718926819@kernel.org	2026-06-03 11:38:51 +02:00
Thomas Gleixner	3ca9595d9f	futex: Add support for unlocking robust futexes Unlocking robust non-PI futexes happens in user space with the following sequence: 1) robust_list_set_op_pending(mutex); 2) robust_list_remove(mutex); lval = 0; 3) lval = atomic_xchg(lock, lval); 4) if (lval & WAITERS) 5) sys_futex(WAKE,....); 6) robust_list_clear_op_pending(); That opens a window between #3 and #6 where the mutex could be acquired by some other task which observes that it is the last user and: A) unmaps the mutex memory B) maps a different file, which ends up covering the same address When the original task exits before reaching #6 then the kernel robust list handling observes the pending op entry and tries to fix up user space. In case that the newly mapped data contains the TID of the exiting thread at the address of the mutex/futex the kernel will set the owner died bit in that memory and therefore corrupting unrelated data. PI futexes have a similar problem both for the non-contented user space unlock and the in kernel unlock: 1) robust_list_set_op_pending(mutex); 2) robust_list_remove(mutex); lval = gettid(); 3) if (!atomic_try_cmpxchg(lock, lval, 0)) 4) sys_futex(UNLOCK_PI,....); 5) robust_list_clear_op_pending(); Address the first part of the problem where the futexes have waiters and need to enter the kernel anyway. Add a new FUTEX_ROBUST_UNLOCK flag, which is valid for the sys_futex() FUTEX_UNLOCK_PI, FUTEX_WAKE, FUTEX_WAKE_BITSET operations. This deliberately omits FUTEX_WAKE_OP from this treatment as it's unclear whether this is needed and there is no usage of it in glibc either to investigate. For the futex2 syscall family this needs to be implemented with a new syscall. The sys_futex() case [ab]uses the @uaddr2 argument to hand the pointer to robust_list_head::list_pending_op into the kernel. This argument is only evaluated when the FUTEX_ROBUST_UNLOCK bit is set and is therefore backward compatible. This is an explicit argument to avoid the lookup of the robust list pointer and retrieving the pending op pointer from there. User space has the pointer already available so it can just put it into the @uaddr2 argument. Aside of that this allows the usage of multiple robust lists in the future without any changes to the internal functions as they just operate on the provided pointer. This requires a second flag FUTEX_ROBUST_LIST32 which indicates that the robust list pointer points to an u32 and not to an u64. This is required for two reasons: 1) sys_futex() has no compat variant 2) The gaming emulators use both both 64-bit and compat 32-bit robust lists in the same 64-bit application As a consequence 32-bit applications have to set this flag unconditionally so they can run on a 64-bit kernel in compat mode unmodified. 32-bit kernels return an error code when the flag is not set. 64-bit kernels will happily clear the full 64 bits if user space fails to set it. In case of FUTEX_UNLOCK_PI this clears the robust list pending op when the unlock succeeded. In case of errors, the user space value is still locked by the caller and therefore the above cannot happen. In case of FUTEX_WAKE* this does the unlock of the futex in the kernel and clears the robust list pending op when the unlock was successful. If not, the user space value is still locked and user space has to deal with the returned error. That means that the unlocking of non-PI robust futexes has to use the same try_cmpxchg() unlock scheme as PI futexes. If the clearing of the pending list op fails (fault) then the kernel clears the registered robust list pointer if it matches to prevent that exit() will try to handle invalid data. That's a valid paranoid decision because the robust list head sits usually in the TLS and if the TLS is not longer accessible then the chance for fixing up the resulting mess is very close to zero. The problem of non-contended unlocks still exists and will be addressed separately. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: André Almeida <andrealmeid@igalia.com> Link: https://patch.msgid.link/20260602090535.670514505@kernel.org	2026-06-03 11:38:51 +02:00
Thomas Gleixner	2cb5251d3d	futex: Provide UABI defines for robust list entry modifiers The marker for PI futexes in the robust list is a hardcoded 0x1 which lacks any sensible form of documentation. Provide proper defines for the bit and the mask and fix up the usage sites. Thereby convert the boolean pi argument into a modifier argument, which allows new modifier bits to be trivially added and conveyed. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: André Almeida <andrealmeid@igalia.com> Link: https://patch.msgid.link/20260602090535.458758556@kernel.org	2026-06-03 11:38:50 +02:00
Thomas Gleixner	1f7f4816b9	futex: Move futex related mm_struct data into a struct Having all these members in mm_struct along with the required #ifdeffery is annoying, does not allow efficient initializing of the data with memset() and makes extending it tedious. Move it into a data structure and fix up all usage sites. The extra struct for the private hash is intentional to make integration of other conditional mechanisms easier in terms of initialization and separation. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260602090535.407756793@kernel.org	2026-06-03 11:38:49 +02:00
Thomas Gleixner	d7b3f52c86	futex: Make futex_mm_init() void Nothing fails there. Mop up the leftovers of the early version of this, which did an allocation. While at it clean up the stubs and the #ifdef comments to make the header file readable. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260602090535.356789395@kernel.org	2026-06-03 11:38:49 +02:00
Thomas Gleixner	c1ffc9c6e4	futex: Move futex task related data into a struct Having all these members in task_struct along with the required #ifdeffery is annoying, does not allow efficient initializing of the data with memset() and makes extending it tedious. Move it into a data structure and fix up all usage sites. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: André Almeida <andrealmeid@igalia.com> Link: https://patch.msgid.link/20260602090535.308220888@kernel.org	2026-06-03 11:38:49 +02:00
Dmitry Ilvokhin	08d4a7837f	genirq: Move NULL check into irqdesc_lock guard unlock expression irqdesc_lock uses __DEFINE_UNLOCK_GUARD() directly with a custom constructor that can set .lock to NULL. In preparation for removing the NULL check from __DEFINE_UNLOCK_GUARD(), move the NULL check into the irqdesc_lock unlock expression, making the NULL handling explicit at the call site. No functional change. Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/ab457810653e4356e29b2d74ba616478bd9328ad.1780064327.git.d@ilvokhin.com	2026-06-03 11:38:47 +02:00
Li RongQing	560000d619	dma-mapping: direct: fix missing mapping for THRU_HOST_BRIDGE segments In dma_direct_map_sg(), the case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE incorrectly used 'break' instead of falling through to MAP_NONE. As a result, segments traversing the host bridge skipped the required dma_direct_map_phys() call entirely, leaving sg->dma_address uninitialized and leading to DMA failures. Fix this by using 'fallthrough;'. Fixes: `a25e7962db` ("PCI/P2PDMA: Refactor the p2pdma mapping helpers") Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260603013723.2439-1-lirongqing@baidu.com	2026-06-03 08:52:40 +02:00
Rosen Penev	7318301e06	dma: map_benchmark: turn dma_sg_map_param buf into a flexible array The buf pointer was kmalloc_array()'d immediately after the parent struct allocation, with the count (granule, validated to 1..1024 by the ioctl) trivially available beforehand. Move buf to the struct tail as a flexible array member and fold the two allocations into a single kzalloc_flex(), dropping the kfree(params->buf) in both the prepare error path and unprepare. Add __counted_by for extra runtime analysis. Assisted-by: Claude:Opus-4.7 Signed-off-by: Rosen Penev <rosenp@gmail.com> Reviewed-by: Qinxin Xia <xiaqinxin@huawei.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260603031758.290538-1-rosenp@gmail.com	2026-06-03 08:20:02 +02:00
Brendan Jackman	9c860d1d5d	mm: introduce for_each_free_list() Patch series "mm: misc cleanups from __GFP_UNMAPPED series". In v2 of the __GFP_UNMAPPED series [0], we realised that some of the patches could potentially be merged as independent cleanups. These are all independent of one another, if you think some are useful cleanups and others are pointless churn, it should be fine to just pick whatever subset you prefer. No functional change intended. This patch (of 4): There are a couple of places that iterate over the freelists with awareness of the data structures' layout. It seems ideally, code outside of mm should not be aware of the page allocator's freelists at all. But, this patch just doesn't hide them completely, it's just a meek incremental step in that direction: provide a macro to iterate over it without needing to be aware of the actual struct fields. Link: https://lore.kernel.org/20260513-page_alloc-unmapped-prep-v1-0-dacdf5402be8@google.com Link: https://lore.kernel.org/20260513-page_alloc-unmapped-prep-v1-1-dacdf5402be8@google.com Link: https://lore.kernel.org/all/20260320-page_alloc-unmapped-v2-0-28bf1bd54f41@google.com/ [0] Signed-off-by: Brendan Jackman <jackmanb@google.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Len Brown <lenb@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-06-02 15:22:19 -07:00
Tejun Heo	02e545c429	sched_ext: Don't warn on NULL cgrp_moving_from in scx_cgroup_move_task() A WARN fires when systemd's user manager writes "+cpu +memory +pids" to its own subtree_control while a sched_ext scheduler is loaded: WARNING: at kernel/sched/ext.c:3227 scx_cgroup_move_task+0xa8/0xb0 scx_cgroup_move_task+0xa8/0xb0 sched_move_task+0x134/0x290 cpu_cgroup_attach+0x39/0x70 cgroup_migrate_execute+0x37d/0x450 cgroup_update_dfl_csses+0x1e3/0x270 cgroup_subtree_control_write+0x3e7/0x440 scx_cgroup_can_attach() arms cgrp_moving_from only when a task's cpu cgroup changes. It can still be NULL when scx_cgroup_move_task() runs, through this sequence: Step Result --------------------------------- ---------------------------------- 1. cpu enabled on cgroup G cpu css = A 2. cpu toggled off then on for G A killed, B created (same cgroup) 3. an exiting task keeps A alive migration skips it, A now stale 4. +memory migrates G stale A vs current B pulls cpu in 5. cpu attach runs for all tasks hits a live, cpu-unchanged task 6. scx_cgroup_move_task() on it cgrp_moving_from NULL -> WARN The mismatch is that scx_cgroup_can_attach() keys on cgroup identity while migration drives the move on css identity, so a NULL cgrp_moving_from here is a legitimate css-only migration, not a missing prep. The call is already gated on cgrp_moving_from, so just drop the warning. ops.cgroup_prep_move() and ops.cgroup_move() stay paired. Fixes: `8195136669` ("sched_ext: Add cgroup support") Cc: stable@vger.kernel.org # v6.12+ Reported-by: Matt Fleming <mfleming@cloudflare.com> Closes: https://lore.kernel.org/all/20260601124156.2205704-1-mfleming@cloudflare.com/ Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-06-02 11:27:50 -10:00
Ji'an Zhou	74e144274a	futex/requeue: Prevent NULL pointer dereference in remove_waiter() on self-deadlock When FUTEX_CMP_REQUEUE_PI requeues a non-top waiter that already owns the target PI futex, task_blocks_on_rt_mutex() returns -EDEADLK before setting waiter->task. The subsequent remove_waiter() in rt_mutex_start_proxy_lock() dereferences the NULL waiter->task, causing a kernel crash. Add a self-deadlock check for non-top waiters before calling rt_mutex_start_proxy_lock(), analogous to the top-waiter check in futex_lock_pi_atomic(). Fixes: `3bfdc63936` ("rtmutex: Use waiter::task instead of current in remove_waiter()") Signed-off-by: Ji'an Zhou <eilaimemedsnaimel@gmail.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org	2026-06-02 22:27:04 +02:00
Thomas Weißschuh	96942092d5	vdso/treewide: Drop GENERIC_TIME_VSYSCALL This Kconfig symbol is not used anymore, remove it. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260519-vdso-generic_time_vsyscal-v1-3-5c2a5905d5f5@linutronix.de	2026-06-02 21:41:23 +02:00
Rosen Penev	45b49d7e3a	timers/migration: Turn tmigr_hierarchy level_list into a flexible array The level_list array is allocated separately right after the parent struct. The size of the array is already known. Move level_list to the struct tail as a flexible array member and fold the two allocations into a single kzalloc_flex(). Signed-off-by: Rosen Penev <rosenp@gmail.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Assisted-by: Claude:Opus-4.7 Link: https://patch.msgid.link/20260522231618.41622-1-rosenp@gmail.com	2026-06-02 21:34:03 +02:00
Frederic Weisbecker	d4f198c136	timers/migration: Deactivate per-capacity hierarchies under nohz_full NOHZ_FULL CPUs global timers are guaranteed to be handled by the timekeeper CPU, which never stops its tick and therefore remains active in the hierarchy. But since the introduction of per-capacity hierarchies, this guarantee is broken because the timekeeper may not belong to the same hierarchy as all the NOHZ_FULL CPUs. Fix it with simply turning off capacity awareness when NOHZ_FULL is running and force a single hierarchy. NOHZ_FULL is not exactly optimized powerwise anyway. Fixes: `098cbaad8e` ("timers/migration: Split per-capacity hierarchies") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260519220926.63437-3-frederic@kernel.org	2026-06-02 21:34:03 +02:00
Frederic Weisbecker	e4a70f5fbd	timers/migration: Fix hotplug migrator selection target on asymetric capacity machines When a top-level migrator is deactivated, either at CPU down hotplug time or when a CPU is domain isolated, a new migrator is elected among the available CPUs and woken up to take over the migration duty. However that election must happen at the scope of a given hierarchy and not globally, which the introduction of per-capacity hierarchies failed to handle. As a result a given hierarchy may end up without migrator to handle global timers. Fix it by making sure that the new migrator belongs to the same hierarchy as the outgoing CPU. Fixes: `098cbaad8e` ("timers/migration: Split per-capacity hierarchies") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260519220926.63437-2-frederic@kernel.org	2026-06-02 21:34:03 +02:00
Frederic Weisbecker	6199f9999a	sched/cputime: Handle dyntick-idle steal time correctly The dyntick-idle steal time is currently accounted when the tick restarts but the stolen idle time is not subtracted from the idle time that was already accounted. This is to avoid observing the idle time going backward as the dyntick-idle cputime accessors can't reliably know in advance the stolen idle time. In order to maintain a forward progressing idle cputime while subtracting idle steal time from it, keep track of the previously accounted idle stolen time and substract it from _later_ idle cputime accounting. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/20260508131647.43868-16-frederic@kernel.org	2026-06-02 21:27:26 +02:00
Frederic Weisbecker	7198e3927a	sched/cputime: Handle idle irqtime gracefully The dyntick-idle cputime accounting always assumes that interrupt time accounting is enabled and consequently stops elapsing the idle time during dyntick-idle interrupts. This doesn't mix up well with disabled interrupt time accounting because then idle interrupts become a cputime blind-spot. Also this feature is disabled on most configurations and the overhead of pausing dyntick-idle accounting while in idle interrupts could then be avoided. Fix the situation with conditionally pausing dyntick-idle accounting during idle interrupts only iff either native vtime (which does interrupt time accounting) or generic interrupt time accounting are enabled. Also make sure that the accumulated interrupt time is not accidentally substracted from later accounting. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/20260508131647.43868-15-frederic@kernel.org	2026-06-02 21:27:26 +02:00
Frederic Weisbecker	3b45b4f188	sched/cputime: Provide get_cpu_[idle\|iowait]_time_us() off-case The last reason why get_cpu_idle/iowait_time_us() may return -1 now is if the config doesn't support nohz. The ad-hoc replacement solution by cpufreq is to compute jiffies minus the whole busy cputime. Although the intention should provide a coherent low resolution estimation of the idle and iowait time, the implementation is buggy because jiffies don't start at 0. Just provide instead a real get_cpu_[idle\|iowait]_time_us() offcase. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/20260508131647.43868-14-frederic@kernel.org	2026-06-02 21:27:26 +02:00
Frederic Weisbecker	127b2eb44f	tick/sched: Consolidate idle time fetching APIs Fetching the idle cputime is available through a variety of accessors all over the place depending on the different accounting flavours and needs: - idle vtime generic accounting can be accessed by kcpustat_field(), kcpustat_cpu_fetch(), get_idle/iowait_time() and get_cpu_idle/iowait_time_us() - dynticks-idle accounting can only be accessed by get_idle/iowait_time() or get_cpu_idle/iowait_time_us() - CONFIG_NO_HZ_COMMON=n idle accounting can be accessed by kcpustat_field() kcpustat_cpu_fetch(), or get_idle/iowait_time() but not by get_cpu_idle/iowait_time_us() Moreover get_idle/iowait_time() relies on get_cpu_idle/iowait_time_us() with a non-sensical conversion to microseconds and back to nanoseconds on the way. Start consolidating the APIs with removing get_idle/iowait_time() and make kcpustat_field() and kcpustat_cpu_fetch() work for all cases. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/20260508131647.43868-13-frederic@kernel.org	2026-06-02 21:27:26 +02:00
Frederic Weisbecker	6a1f6a9dd0	tick/sched: Account tickless idle cputime only when tick is stopped There is no real point in switching to dyntick-idle cputime accounting mode if the tick is not actually stopped. This just adds overhead, notably fetching the GTOD, on each idle exit and each idle IRQ entry for no reason during short idle trips. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/20260508131647.43868-12-frederic@kernel.org	2026-06-02 21:27:26 +02:00
Frederic Weisbecker	29807c524d	tick/sched: Remove unused fields Remove fields after the dyntick-idle cputime migration to scheduler code. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/20260508131647.43868-11-frederic@kernel.org	2026-06-02 21:27:25 +02:00
Frederic Weisbecker	a5fe724e20	tick/sched: Move dyntick-idle cputime accounting to cputime code Although the dynticks-idle cputime accounting is necessarily tied to the tick subsystem, the actual related accounting code has no business residing there and should be part of the scheduler cputime code. Move away the relevant pieces and state machine to where they belong. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/20260508131647.43868-10-frederic@kernel.org	2026-06-02 21:27:25 +02:00
Frederic Weisbecker	bd0c77cd46	tick/sched: Remove nohz disabled special case in cputime fetch Even when nohz is not runtime enabled, the dynticks idle cputime accounting can run and the common idle cputime accessors are still relevant. Remove the nohz disabled special case accordingly. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/20260508131647.43868-9-frederic@kernel.org	2026-06-02 21:27:25 +02:00
Frederic Weisbecker	cf6444c3e1	tick/sched: Unify idle cputime accounting The non-vtime dynticks-idle cputime accounting is a big mess that accumulates within two concurrent statistics, each having their own shortcomings: * The accounting for online CPUs which is based on the delta between tick_nohz_start_idle() and tick_nohz_stop_idle(). Pros: - Works when the tick is off - Has nsecs granularity Cons: - Account idle steal time but doesn't substract it from idle cputime. - Assumes CONFIG_IRQ_TIME_ACCOUNTING by not accounting IRQs but the IRQ time is simply ignored when CONFIG_IRQ_TIME_ACCOUNTING=n - The windows between 1) idle task scheduling and the first call to tick_nohz_start_idle() and 2) idle task between the last tick_nohz_stop_idle() and the rest of the idle time are blindspots wrt. cputime accounting (though mostly insignificant amount) - Relies on private fields outside of kernel stats, with specific accessors. * The accounting for offline CPUs which is based on ticks and the jiffies delta during which the tick was stopped. Pros: - Handles steal time correctly - Handle CONFIG_IRQ_TIME_ACCOUNTING=y and CONFIG_IRQ_TIME_ACCOUNTING=n correctly. - Handles the whole idle task - Accounts directly to kernel stats, without midlayer accumulator. Cons: - Doesn't elapse when the tick is off, which doesn't make it suitable for online CPUs. - Has TICK_NSEC granularity (jiffies) - Needs to track the dyntick-idle ticks that were accounted and substract them from the total jiffies time spent while the tick was stopped. This is an ugly workaround. Having two different accounting for a single context is not the only problem: since those accountings are of different natures, it is possible to observe the global idle time going backward after a CPU goes offline. Clean up the situation with introducing a hybrid approach that stays coherent and works for both online and offline CPUs: * Tick based or native vtime accounting operate before the idle loop is entered and resume once the idle loop prepares to exit. * When the idle loop starts, switch to dynticks-idle accounting as is done currently, except that the statistics accumulate directly to the relevant kernel stat fields. * Private dyntick cputime accounting fields are removed. * Works on both online and offline case. Further improvement will include: * Only switch to dynticks-idle cputime accounting when the tick actually goes in dynticks mode. * Handle CONFIG_IRQ_TIME_ACCOUNTING=n correctly such that the dynticks-idle accounting still elapses while on IRQs. * Correctly substract idle steal cputime from idle time Reported-by: Xin Zhao <jackzxcui1989@163.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/20260508131647.43868-8-frederic@kernel.org	2026-06-02 21:27:25 +02:00
Frederic Weisbecker	650a59805a	sched/cputime: Correctly support generic vtime idle time Currently whether generic vtime is running or not, the idle cputime is fetched from the nohz accounting. However generic vtime already does its own idle cputime accounting. Only the kernel stat accessors are not plugged to support it. Read the idle generic vtime cputime when it's running, this will allow to later more clearly split nohz and vtime cputime accounting. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/20260508131647.43868-5-frederic@kernel.org	2026-06-02 21:27:25 +02:00
Frederic Weisbecker	080b5c6d95	sched/cputime: Remove superfluous and error prone kcpustat_field() parameter The first parameter to kcpustat_field() is a pointer to the cpu kcpustat to be fetched from. This parameter is error prone because a copy to a kcpustat could be passed by accident instead of the original one. Also the kcpustat structure can already be retrieved with the help of the mandatory CPU argument. Remove the needless parameter. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/20260508131647.43868-4-frederic@kernel.org	2026-06-02 21:27:25 +02:00
Frederic Weisbecker	0236aaf07b	sched/idle: Handle offlining first in idle loop Offline handling happens from within the inner idle loop, after the beginning of dyntick cputime accounting, nohz idle load balancing and TIF_NEED_RESCHED polling. This is not necessary and even buggy because: * There is no dyntick handling to do. And calling tick_nohz_idle_enter() messes up with the struct tick_sched reset that was performed on tick_sched_timer_dying(). * There is no nohz idle balancing to do. * Polling on TIF_RESCHED is irrelevant at this stage, there are no more tasks allowed to run. * No need to check if need_resched() before offline handling since stop_machine is done and all per-cpu kthread should be done with their job. Therefore move the offline handling at the beginning of the idle loop. This will also ease the idle cputime unification later by not elapsing idle time while offline through the call to: tick_nohz_idle_enter() -> tick_nohz_start_idle() Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Reviewed-by: Rafael J. Wysocki (Intel) <rafael@kernel.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/20260508131647.43868-3-frederic@kernel.org	2026-06-02 21:27:25 +02:00
Frederic Weisbecker	86db4084b4	tick/sched: Fix TOCTOU in nohz idle time fetch When the nohz idle time is fetched, the current clock timestamp is taken outside the seqcount, which can result in a race as reported by Sashiko: get_cpu_sleep_time_us() tick_nohz_start_idle() ----------------------- --------------------- now = ktime_get() write_seqcount_begin(idle_sleeptime_seq); idle_entrytime = ktime_get() tick_sched_flag_set(ts, TS_FLAG_IDLE_ACTIVE); write_seqcount_end(&ts->idle_sleeptime_seq); read_seqcount_begin(idle_sleeptime_seq) delta = now - idle_entrytime); //!! But now < idle_entrytime idle = *sleeptime + delta; read_seqcount_retry(&ts->idle_sleeptime_seq, seq) Here the read side fetches the timestamp before the write side and its update. As a result the time delta computed on the read side is negative (ktime_t is signed) and breaks the cputime monotonicity guarantee. This could possibly be fixed with reading the current clock timestamp inside the seqcount but the reader overhead might then increase. Also simply checking that the current timestamp is above the idle entry time is enough to prevent any issue of the like. Fixes: `620a30fa0b` ("timers/nohz: Protect idle/iowait sleep time under seqcount") Reported-by: Sashiko Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260508131647.43868-2-frederic@kernel.org	2026-06-02 21:27:24 +02:00
Naveen Kumar Chaudhary	ce4abda5e1	time: Fix off-by-one in settimeofday() usec validation The validation check uses '>' instead of '>=' when comparing tv_usec against USEC_PER_SEC, allowing the value 1000000 through. After conversion to nanoseconds (*= 1000), this produces tv_nsec == NSEC_PER_SEC, violating the timespec invariant that tv_nsec must be less than NSEC_PER_SEC. Use '>=' to reject tv_usec values that are not in the valid range of 0 to 999999. Fixes: `5e0fb1b57b` ("y2038: time: avoid timespec usage in settimeofday()") Signed-off-by: Naveen Kumar Chaudhary <naveen.osdev@gmail.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Acked-by: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/4rikk44zew3s6577dugmx4jyblz7o5c57niuap6ct3td5yfm6w@gh7pcumg7qor	2026-06-02 21:07:55 +02:00
Naveen Kumar Chaudhary	c1ca14ca22	clockevents: Fix duplicate type specifier in stub function parameter The stub for arch_inlined_clockevent_set_next_coupled() has 'u64 u64 cycles' in its parameter list. Since u64 is a typedef, the compiler parses the second 'u64' as the parameter name, making 'cycles' an unused token. Remove the duplicate so the parameter is correctly named. Fixes: `89f951a1e8` ("clockevents: Provide support for clocksource coupled comparators") Signed-off-by: Naveen Kumar Chaudhary <naveen.osdev@gmail.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/7tostpvxzdn6tobmyow63a5rweatls5kux3scqp2vzhe7mv6uq@ecr746b4hyhf	2026-06-02 21:07:55 +02:00
Maoyi Xie	766e828b01	time/namespace: Export init_time_ns and do_timens_ktime_to_host() timens_ktime_to_host() in compares the current time namespace against init_time_ns for the fast path. It calls do_timens_ktime_to_host() for the offset case. Both symbols are needed at link time by any caller of the inline. All current callers are builtin, but ntsync can be built as module, which prevents it from using it. Export both with EXPORT_SYMBOL_GPL. Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260528063311.3300393-2-maoyixie.tju@gmail.com	2026-06-02 21:05:36 +02:00
Tejun Heo	a99ce697ea	cgroup: Migrate tasks to the root css when a controller is rebound cgroup_apply_control_disable() defers kill_css_finish() while a css is still populated, relying on css_update_populated() to fire the deferred kill once the populated count reaches zero. This deadlocks when a controller is rebound out of a hierarchy. Mounting an implicit_on_dfl controller such as perf_event as a v1 hierarchy steals it off the default hierarchy, and rebind_subsystems() kills its per-cgroup csses while they are still populated. The migration run in the same step keeps the old css for a controller no longer in the hierarchy's mask, so no task is migrated off the dying csses. Their populated count never reaches zero, the deferred kill_css_finish() never fires, and the next cgroup_lock_and_drain_offline() hangs forever under cgroup_mutex. That migration is already a no-op pass over the rebound subtree. Add cgroup_rebind_ss_mask so find_existing_css_set() resolves the leaving controllers to the root css. Their tasks are migrated there, the per-cgroup csses depopulate, and cgroup_apply_control_disable() kills them synchronously. The deferral stays correct for the rmdir and controller-disable paths it was meant for. Fixes: `1dffd95575` ("cgroup: Defer kill_css_finish() in cgroup_apply_control_disable()") Reported-by: Mark Brown <broonie@kernel.org> Closes: https://lore.kernel.org/all/41cd159c-54e5-45e0-81df-eaf36a6c028e@sirena.org.uk/ Reported-by: Bert Karwatzki <spasswolf@web.de> Closes: https://lore.kernel.org/all/4e986b4ed7e16547805d54b6e67d09120bc4d2f2.camel@web.de/ Tested-by: Mark Brown <broonie@kernel.org> Tested-by: Bert Karwatzki <spasswolf@web.de> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-06-02 08:25:29 -10:00
Uladzislau Rezki (Sony)	e853c1b285	Merge branches 'rcutorture.2026.05.24' and 'misc.2026.05.24' into rcu-merge.2026.05.24 rcutorture.2026.05.24: Torture-test updates misc.2026.05.24: Miscellaneous RCU updates	2026-06-02 19:45:08 +02:00
Arnd Bergmann	002668809b	rcu/nocb: reduce stack usage in nocb_gp_wait() When CONFIG_UBSAN_ALIGNMENT is enabled, the stack usage of nocb_gp_wait() grows above typical warning limits: In file included from kernel/rcu/tree.c:4930: kernel/rcu/tree_nocb.h: In function 'rcu_nocb_gp_kthread': kernel/rcu/tree_nocb.h:866:1: error: the frame size of 1968 bytes is larger than 1280 bytes [-Werror=frame-larger-than=] Apparently, the problem is passing rcu_data from a 'void *' pointer, which gcc assumes may be misaligned. When the function is not inlined into rcu_nocb_gp_kthread(), that is no longer visible to gcc. Add a 'noinline_for_stack' annotation that leads to skipping a lot of the alignment sanitizer checks and keeps the stack usage 60% lower here. Reviewed-by: Kunwu Chan <chentao@kylinos.cn> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2026-06-02 19:43:53 +02:00
Daniel Borkmann	3c56ee343f	bpf: Reject exclusive maps for bpf_map_elem iterators Exclusive maps (aka excl_prog_hash) are meant to be reachable only from the single program whose hash matches. This is enforced by check_map_prog_compatibility() when the map is referenced from a program such as signed BPF loaders. A bpf_map_elem iterator, however, binds its target map at attach time in bpf_iter_attach_map() instead of referencing it from the program, so the exclusivity check is never reached. On top of that, the iterator exposes the map value as a writable buffer. Fixes: `baefdbdf68` ("bpf: Implement exclusive map creation") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260602133052.423725-2-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-06-02 09:46:52 -07:00
Steven Rostedt	69efd863a7	tracing/eprobes: Allow use of BTF names to dereference pointers Add syntax to the parsing of eprobes to be able to typecast a trace event field that is a pointer to a structure. Currently, a dereference must be a number, where the user has to figure out manually the offset of a member of a structure that they want to dereference. But for event probes that records a field that happens to be a pointer to a structure, it cannot dereference these values with BTF naming, but must use numerical offsets. For example, to find out what device a sk_buff is pointing to in the net_dev_xmit trace event, one must first use gdb to find the offsets of the members of the structures: (gdb) p &((struct sk_buff )0)->dev $1 = (struct net_device ) 0x10 (gdb) p &((struct net_device )0)->name $2 = (char (*)[16]) 0x118 And then use the raw numbers to dereference: # echo 'e:xmit net.net_dev_xmit +0x118(+0x10($skbaddr)):string' >> dynamic_events If BTF is in the kernel, then instead, the skbaddr can be typecast to sk_buff and use the normal dereference logic. # echo 'e:xmit net.net_dev_xmit (sk_buff)skbaddr->dev->name:string' >> dynamic_events # echo 1 > events/eprobes/xmit/enable # cat trace [..] sshd-session-1022 [000] b..2. 860.249343: xmit: (net.net_dev_xmit) arg1="enp7s0" sshd-session-1022 [000] b..2. 860.250061: xmit: (net.net_dev_xmit) arg1="enp7s0" sshd-session-1022 [000] b..2. 860.250142: xmit: (net.net_dev_xmit) arg1="enp7s0" sshd-session-1022 [000] b..2. 860.263553: xmit: (net.net_dev_xmit) arg1="enp7s0" sshd-session-1022 [000] b..2. 860.283820: xmit: (net.net_dev_xmit) arg1="enp7s0" sshd-session-1022 [000] b..2. 860.302716: xmit: (net.net_dev_xmit) arg1="enp7s0" sshd-session-1022 [000] b..2. 860.322905: xmit: (net.net_dev_xmit) arg1="enp7s0" sshd-session-1022 [000] b..2. 860.342828: xmit: (net.net_dev_xmit) arg1="enp7s0" sshd-session-1022 [000] b..2. 860.362268: xmit: (net.net_dev_xmit) arg1="enp7s0" sshd-session-1022 [000] b..2. 860.382335: xmit: (net.net_dev_xmit) arg1="enp7s0" sshd-session-1022 [000] b..2. 860.400856: xmit: (net.net_dev_xmit) arg1="enp7s0" sshd-session-1022 [000] b..2. 860.419893: xmit: (net.net_dev_xmit) arg1="enp7s0" The syntax is simply: (STRUCT)(FIELD)->MEMBER[->MEMBER..] Also add comments around the #else and #endif of #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS to know what they are for. Link: https://lore.kernel.org/all/20260601130746.2139d926@gandalf.local.home/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2026-06-02 23:36:22 +09:00
Naveen Kumar Chaudhary	6a579050f8	printk: fix typos in comments Fix spelling/grammatical errors in printk.c and nbcon.c: - "precation" -> "precautionary" - "othrewise" -> "otherwise" - "An usable" -> "A usable" - "made a progress" -> "made progress" - "preemtible" -> "preemptible" - "mechasism" -> "mechanism" - "ownerhip" -> "ownership" Signed-off-by: Naveen Kumar Chaudhary <naveen.osdev@gmail.com> Link: https://patch.msgid.link/pakfewagyzb7da3yuxnaxdaoma5w4j2c7i3xebmcld3xy4mqs5@zxsx2idpxrdq Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com>	2026-06-02 15:36:06 +02:00
Peter Zijlstra	f666241e6b	sched/fair: Unify cfs_rq throttling via account_cfs_rq_runtime() assign_cfs_rq_runtime() during update_curr() sets the resched indicator and relies on check_cfs_rq_runtime() during pick_next_task() / put_prev_entity() to throttle the hierarchy once current task is preempted / blocks. Per-task throttle, on the other hand, uses throttle_cfs_rq() to simply propagate the throttle signals, and then relies on task work to individually throttle the runnable tasks on their way out to the userspace. Remove check_cfs_rq_runtime() and unify throttling into account_cfs_rq_runtime() which only sets the cfs_rq->throttled, cfs_rq->throttle_count indicators via throttle_cfs_rq() and optionally adds the task work to the current task (donor) it is on the throttled hierarchy. throttle_cfs_rq() requests for sched_cfs_bandwidth_slice() worth of bandwidth for the current hierarchy that enable it to continue running uninterrupted when selected. For the rest, it requests a bare minimum of "1" to ensure some bandwidth is available and pass the "runtime_remaining > 0" checks once selected. For SCHED_PROXY_EXEC, a mutex holder cannot exit to userspace without dropping it first and the mutex_unlock() ensures proxy is stopped before the mutex handoff which preserves the current semantics for running a throttled task until it exits to the userspace even if it acts as a donor. [ prateek: rebased on tip, comments, commit message. ] Reviewed-By: Benjamin Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Aaron Lu <ziqianlu@bytedance.com> Link: https://patch.msgid.link/20260602071005.11942-1-kprateek.nayak@amd.com	2026-06-02 12:26:13 +02:00
K Prateek Nayak	102a28344a	sched/fair: Move the throttled tasks to a local list in tg_unthrottle_up() An update_curr() during the enqueue of throttled task will start throttling the hierarchy from subsequent commit. This can lead to tg_throttle_down() seeing non-empty throttled_limbo_list for the cfs_rq attaching the task from throttled_limbo_list one by one. For example: R \| A / \ B C \| rq->curr B is throttled with tasks on hte limbo list. When the tasks are unthrottled via tg_unthrottle_up() and entity of group B is placed onto A, update_curr() is called to catch up the vruntime and it may throttle group A causing the subsequent tg_throttle_down() to see the pending task's on B's limbo list. tg_unthrottle_up() /* --cfs_rq->throttle_count == 0 / list_for_each_entry_safe(p, cfs_rq->throttled_limbo_list) enqueue_task_fair() enqueue_entity(se / B->se /) update_curr(cfs_rq / A->gcfs_rq /) account_cfs_rq_runtime(cfs_rq) throttle_cfs_rq(cfs_rq / A->gcfs_rq / ) tg_throttle_down() / Reaches B->cfs_rq with throttle_count == 0 */ !!! !list_empty(&cfs_rq->throttled_limbo_list)) !!! Move the tasks from throttled_limbo_list onto a local list before starting the unthrottle to prevent the splat described above. If the hierarchy is throttled again in middle of an unthrottle, put the pending tasks back onto the limbo list to prevent running them unnecessarily. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Benjamin Segall <bsegall@google.com> Tested-by: Aaron Lu <ziqianlu@bytedance.com> Link: https://patch.msgid.link/20260602052531.11450-2-kprateek.nayak@amd.com	2026-06-02 12:26:12 +02:00
K Prateek Nayak	28ad542768	sched/fair: Call update_curr() before unthrottling the hierarchy Subsequent commits will allow update_curr() to throttle the hierarchy when the runtime accounting exceeds allocated quota. Call update_curr() before the unthrottle event, and in tg_unthrottle_up() to catch up on any remaining runtime and stabilize the "runtime_remaining" and "throttle_count" for that cfs_rq. Doing an update_curr() early ensures the cfs_rq is not throttled right back up again when the unthrottle is in progress. Since all callers of unthrottle_cfs_rq(), except two, already update the rq_clock and call rq_clock_start_loop_update(), move the update_rq_clock() from unthrottle_cfs_rq() to the callers that don't update the rq_clock. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Benjamin Segall <bsegall@google.com> Tested-by: Aaron Lu <ziqianlu@bytedance.com> Link: https://patch.msgid.link/20260602052531.11450-1-kprateek.nayak@amd.com	2026-06-02 12:26:12 +02:00
K Prateek Nayak	253edcf543	sched/fair: Use throttled_csd_list for local unthrottle When distribute_cfs_runtime() encounters a local cfs_rq, it adds it to a local list and unthrottles it at the end, when it is done unthrottling other cfs_rq(s) on cfs_b->throttled_cfs_rq until the bandwidth runs out. Instead of using a local list, reuse the local CPU's rq->throttled_csd_list and the __cfsb_csd_unthrottle() path for unthrottle. If this is the first cfs_rq to be queued on the "throttled_csd_list", it prevents the need for a remote CPUs to interrupt this local CPU if they themselves are performing async unthrottle. If this is not the first cfs_rq on the list, there is an async unthrottle operation pending on this local CPU and the unthrottle can be batched together. No functional changes intended. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Benjamin Segall <bsegall@google.com> Tested-by: Aaron Lu <ziqianlu@bytedance.com> Link: https://patch.msgid.link/20260602050005.11160-3-kprateek.nayak@amd.com	2026-06-02 12:26:12 +02:00
K Prateek Nayak	1abbecd1d2	sched/fair: Convert cfs bandwidth throttling to use guards Routine conversion of rcu_read_lock(), spin_lock*, and rq_lock usage within the cfs bandwidth controller to use class guards. Only notable changes are: - Checking for "cfs_rq->runtime_remaining <= 0" instead of the inverse to spot a throttle and break early. This also saves the need for extra indentation in the unthrottle case. - Reordering of list_del_rcu() against throttled_clock indicator update in unthrottle_cfs_rq(). Both are done with "cfs_b->lock" held after the "cfs_rq->throttled" is cleared which make the reordering safe against concurrent list modifications. No functional changes intended. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Tested-by: Aaron Lu <ziqianlu@bytedance.com> Link: https://patch.msgid.link/20260602050005.11160-2-kprateek.nayak@amd.com	2026-06-02 12:26:11 +02:00
Zecheng Li	b8fea7af0e	sched/fair: Allocate cfs_tg_state with percpu allocator To remove the cfs_rq pointer array in task_group, allocate the combined cfs_rq and sched_entity using the per-cpu allocator. This patch implements the following: - Changes task_group->cfs_rq from 'struct cfs_rq *' to 'struct cfs_rq __percpu '. - Updates memory allocation in alloc_fair_sched_group() and free_fair_sched_group() to use alloc_percpu() and free_percpu() respectively. - Uses the inline accessor tg_cfs_rq(tg, cpu) with per_cpu_ptr() to retrieve the pointer to cfs_rq for the given task group and CPU. - Replaces direct accesses tg->cfs_rq[cpu] with calls to the new tg_cfs_rq(tg, cpu) helper. - Handles the root_task_group: since struct rq is already a per-cpu variable (runqueues), its embedded cfs_rq (rq->cfs) is also per-cpu. Therefore, we assign root_task_group.cfs_rq = &runqueues.cfs. - Cleanup the code in initializing the root task group. This change places each CPU's cfs_rq and sched_entity in its local per-cpu memory area to remove the per-task_group pointer arrays. Signed-off-by: Zecheng Li <zecheng@google.com> Signed-off-by: Zecheng Li <zli94@ncsu.edu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Reviewed-by: Josh Don <joshdon@google.com> Link: https://patch.msgid.link/20260522141623.600235-4-zli94@ncsu.edu	2026-06-02 12:26:11 +02:00
Zecheng Li	89e1f67186	sched/fair: Remove task_group->se pointer array Now that struct sched_entity is co-located with struct cfs_rq for non-root task groups, the task_group->se pointer array is redundant. The associated sched_entity can be loaded directly from the cfs_rq. This patch performs the access conversion with the helpers: - is_root_task_group(tg): checks if a task group is the root task group. It compares the task group's address with the global root_task_group variable. - tg_se(tg, cpu): retrieves the cfs_rq and returns the address of the co-located se. This function checks if tg is the root task group to ensure behaving the same of previous tg->se[cpu]. Replaces all accesses that use the tg->se[cpu] pointer array with calls to the new tg_se(tg, cpu) accessor. - cfs_rq_se(cfs_rq): simplifies access paths like cfs_rq->tg->se[...] to use the co-located sched_entity. This function also checks if tg is the root task group to ensure same behavior. Since tg_se is not in very hot code paths, and the branch is a register comparison with an immediate value (`&root_task_group`), the performance impact is expected to be negligible. Signed-off-by: Zecheng Li <zecheng@google.com> Signed-off-by: Zecheng Li <zli94@ncsu.edu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Reviewed-by: Josh Don <joshdon@google.com> Link: https://patch.msgid.link/20260522141623.600235-3-zli94@ncsu.edu	2026-06-02 12:26:11 +02:00
Zecheng Li	dfcfc97b6d	sched/fair: Co-locate cfs_rq and sched_entity in cfs_tg_state Improve data locality and reduce pointer chasing by allocating struct cfs_rq and struct sched_entity together for non-root task groups. This is achieved by introducing a new combined struct cfs_tg_state that holds both objects in a single allocation. This patch: - Introduces struct cfs_tg_state that embeds cfs_rq, sched_entity, and sched_statistics together in a single structure. - Updates __schedstats_from_se() in stats.h to use cfs_tg_state for accessing sched_statistics from a group sched_entity. - Modifies alloc_fair_sched_group() and free_fair_sched_group() to allocate and free the new struct as a single unit. - Modifies the per-CPU pointers in task_group->se and task_group->cfs_rq to point to the members in the new combined structure. Signed-off-by: Zecheng Li <zecheng@google.com> Signed-off-by: Zecheng Li <zli94@ncsu.edu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Reviewed-by: Josh Don <joshdon@google.com> Link: https://patch.msgid.link/20260522141623.600235-2-zli94@ncsu.edu	2026-06-02 12:26:10 +02:00
Guanyou.Chen	63c1a12bc0	sched: restore timer_slack_ns when resetting RT policy on fork Commit `ed4fb6d7ef` ("hrtimer: Use and report correct timerslack values for realtime tasks") sets timer_slack_ns to 0 for RT tasks in __setscheduler_params(). However, when an RT task with SCHED_RESET_ON_FORK creates child threads, the children inherit timer_slack_ns=0 from the parent. sched_fork() resets the child's policy to SCHED_NORMAL but does not restore timer_slack_ns, leaving the child permanently running with zero slack. Fix this by restoring timer_slack_ns from default_timer_slack_ns in sched_fork() when resetting from RT/DL to NORMAL policy, matching the existing behavior in __setscheduler_params(). Note: this fix alone requires a correct default_timer_slack_ns to be effective. See the following patch for that fix. Fixes: `ed4fb6d7ef` ("hrtimer: Use and report correct timerslack values for realtime tasks") Reported-by: Qiaoting.Lin <linqiaoting@xiaomi.com> Signed-off-by: Guanyou.Chen <chenguanyou@xiaomi.com> Signed-off-by: Chunhui.Li <chunhui.li@mediatek.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260522131000.1664983-2-chenguanyou@xiaomi.com	2026-06-02 12:26:10 +02:00
Peter Zijlstra	56e50ff567	sched: Simplify ttwu_runnable() Note that both proxy and delayed tasks have ->is_blocked set. Use this one condition to guard both paths. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260526113322.714832584%40infradead.org	2026-06-02 12:26:10 +02:00
Peter Zijlstra	c0404dd88d	sched/proxy: Remove superfluous clear_task_blocked_in() Per the discussion here: https://lore.kernel.org/all/20260403112810.GG3738786@noisy.programming.kicks-ass.net/ The reason for this condition is that the signal condition in try_to_block_task() would set_task_blocked_in_waking(). However, it no longer does that, in fact, that path does clear_task_blocked_on(). Further, per the discussions here: https://lore.kernel.org/r/dc61cf77-e541-441d-a708-c40e19aa0db2%40amd.com https://lore.kernel.org/r//9dd1d24d-45d3-4ee2-8e67-8305b34bfb6d%40amd.com there are a few other edge cases that needed this. But they're all variants of PROXY_WAKING leaking out. And since PROXY_WAKING is now gone, this is no longer needed either. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/20260526113322.120970670%40infradead.org	2026-06-02 12:26:09 +02:00
K Prateek Nayak	ec9d4f1c42	sched/proxy: Remove PROXY_WAKING Now that the proxy path uses ->is_blocked, use the '->is_blocked && !->blocked_on' state instead of PROXY_WAKING. Notably, this is where a blocked_on relation is broken but the donor task might still need a return migration. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260526113322.596522894%40infradead.org	2026-06-02 12:26:09 +02:00

... 4 5 6 7 8 ...

52483 Commits