linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-21 09:05:20 -04:00

Author	SHA1	Message	Date
Anna-Maria Behnsen	22c62b9a84	timekeeping: Introduce auxiliary timekeepers Provide timekeepers for auxiliary clocks and initialize them during boot. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083026.350061049@linutronix.de	2025-06-19 14:28:23 +02:00
Thomas Gleixner	6168024604	timekeeping: Add clock_valid flag to timekeeper In preparation for supporting independent auxiliary timekeepers, add a clock valid field and set it to true for the system timekeeper. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083026.287145536@linutronix.de	2025-06-19 14:28:23 +02:00
Thomas Gleixner	8c782acd3f	timekeeping: Prepare timekeeping_update_from_shadow() Don't invoke the VDSO and paravirt updates when utilized for auxiliary clocks. This is a temporary workaround until the VDSO and paravirt interfaces have been worked out. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250519083026.223876435@linutronix.de	2025-06-19 14:28:23 +02:00
Anna-Maria Behnsen	926ad47516	timekeeping: Make __timekeeping_advance() reusable In __timekeeping_advance() the pointer to struct tk_data is hardcoded by the use of &tk_core. As long as there is only a single timekeeper (tk_core), this is not a problem. But when __timekeeping_advance() will be reused for per auxiliary timekeepers, __timekeeping_advance() needs to be generalized. Add a pointer to struct tk_data as function argument of __timekeeping_advance() and adapt all call sites. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083026.160967312@linutronix.de	2025-06-19 14:28:23 +02:00
Thomas Gleixner	c7ebfbc440	ntp: Rename __do_adjtimex() to ntp_adjtimex() Clean up the name space. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083026.095637820@linutronix.de	2025-06-19 14:28:23 +02:00
Thomas Gleixner	5ffa25f573	ntp: Add timekeeper ID arguments to public functions In preparation for supporting auxiliary POSIX clocks, add a timekeeper ID to the relevant functions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083026.032425931@linutronix.de	2025-06-19 14:28:23 +02:00
Thomas Gleixner	8515714b0f	ntp: Add support for auxiliary timekeepers If auxiliary clocks are enabled, provide an array of NTP data so that the auxiliary timekeepers can be steered independently of the core timekeeper. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083025.969000914@linutronix.de	2025-06-19 14:28:22 +02:00
Anna-Maria Behnsen	9094c72c3d	time: Introduce auxiliary POSIX clocks To support auxiliary timekeeping and the related user space interfaces, it's required to define a clock ID range for them. Reserve 8 auxiliary clock IDs after the regular timekeeping clock ID space. This is the maximum number of auxiliary clocks the kernel can support. The actual number of supported clocks depends obviously on the presence of related devices and might be constraint by the available VDSO space. Add the corresponding timekeeper IDs as well. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083025.905800695@linutronix.de	2025-06-19 14:28:22 +02:00
Anna-Maria Behnsen	f12b45862c	timekeeping: Introduce timekeeper ID As long as there is only a single timekeeper, there is no need to clarify which timekeeper is used. But with the upcoming reusage of the timekeeper infrastructure for auxiliary clock timekeepers, an ID is required to differentiate. Introduce an enum for timekeeper IDs, introduce a field in struct tk_data to store this timekeeper id and add also initialization. The id struct field is added at the end of the second cachline, as there is a 4 byte hole anyway. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083025.842476378@linutronix.de	2025-06-19 14:28:22 +02:00
Thomas Gleixner	7e55b6ba1f	timekeeping: Avoid double notification in do_adjtimex() Consolidate do_adjtimex() so that it does not notify about clock changes twice. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250519083025.779267274@linutronix.de	2025-06-19 14:28:22 +02:00
Thomas Gleixner	506a54a031	timekeeping: Cleanup kernel doc of __ktime_get_real_seconds() Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083025.715836017@linutronix.de	2025-06-19 14:28:22 +02:00
Thomas Gleixner	990518eb3a	timekeeping: Remove hardcoded access to tk_core This was overlooked in the initial conversion. Use the provided pointer to access the shadow timekeeper. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083025.652611452@linutronix.de	2025-06-19 14:28:22 +02:00
Willem de Bruijn	d4adf1c9ee	bpf: Adjust free target to avoid global starvation of LRU map BPF_MAP_TYPE_LRU_HASH can recycle most recent elements well before the map is full, due to percpu reservations and force shrink before neighbor stealing. Once a CPU is unable to borrow from the global map, it will once steal one elem from a neighbor and after that each time flush this one element to the global list and immediately recycle it. Batch value LOCAL_FREE_TARGET (128) will exhaust a 10K element map with 79 CPUs. CPU 79 will observe this behavior even while its neighbors hold 78 * 127 + 1 * 15 == 9921 free elements (99%). CPUs need not be active concurrently. The issue can appear with affinity migration, e.g., irqbalance. Each CPU can reserve and then hold onto its 128 elements indefinitely. Avoid global list exhaustion by limiting aggregate percpu caches to half of map size, by adjusting LOCAL_FREE_TARGET based on cpu count. This change has no effect on sufficiently large tables. Similar to LOCAL_NR_SCANS and lru->nr_scans, introduce a map variable lru->free_target. The extra field fits in a hole in struct bpf_lru. The cacheline is already warm where read in the hot path. The field is only accessed with the lru lock held. Tested-by: Anton Protopopov <a.s.protopopov@gmail.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://lore.kernel.org/r/20250618215803.3587312-1-willemdebruijn.kernel@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-18 18:50:14 -07:00
Linus Torvalds	74b4cc9b87	Merge tag 'cgroup-for-6.16-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fix from Tejun Heo: - In cgroup1 freezer, a task migrating into a frozen cgroup might not get frozen immediately due to the wrong operation order. Fix it. * tag 'cgroup-for-6.16-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup,freezer: fix incomplete freezing when attaching tasks	2025-06-18 14:25:50 -07:00
Linus Torvalds	0564e6a8c2	Merge tag 'wq-for-6.16-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fix from Tejun Heo: - Fix missed early init of wq_isolated_cpumask * tag 'wq-for-6.16-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Initialize wq_isolated_cpumask in workqueue_init_early()	2025-06-18 14:22:31 -07:00
Linus Torvalds	4f24bfcc39	Merge tag 'sched_ext-for-6.16-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix a couple bugs in cgroup cpu.weight support - Add the new sched-ext@lists.linux.dev to MAINTAINERS * tag 'sched_ext-for-6.16-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext, sched/core: Don't call scx_group_set_weight() prematurely from sched_create_group() sched_ext: Make scx_group_set_weight() always update tg->scx.weight sched_ext: Update mailing list entry in MAINTAINERS	2025-06-18 14:17:15 -07:00
Chen Ridong	37fb58a727	cgroup,freezer: fix incomplete freezing when attaching tasks An issue was found: # cd /sys/fs/cgroup/freezer/ # mkdir test # echo FROZEN > test/freezer.state # cat test/freezer.state FROZEN # sleep 1000 & [1] 863 # echo 863 > test/cgroup.procs # cat test/freezer.state FREEZING When tasks are migrated to a frozen cgroup, the freezer fails to immediately freeze the tasks, causing the cgroup to remain in the "FREEZING". The freeze_task() function is called before clearing the CGROUP_FROZEN flag. This causes the freezing() check to incorrectly return false, preventing __freeze_task() from being invoked for the migrated task. To fix this issue, clear the CGROUP_FROZEN state before calling freeze_task(). Fixes: `f5d39b0208` ("freezer,sched: Rewrite core freezer logic") Cc: stable@vger.kernel.org # v6.1+ Reported-by: Zhong Jiawei <zhongjiawei1@huawei.com> Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-06-18 09:43:30 -10:00
Thomas Weißschuh	5ea2bcdfbf	printk: ringbuffer: Add KUnit test The KUnit test validates the correct operation of the ringbuffer. A separate dedicated ringbuffer is used so that the global printk ringbuffer is not touched. Co-developed-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://patch.msgid.link/20250612-printk-ringbuffer-test-v3-1-550c088ee368@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2025-06-18 16:42:42 +02:00
Tejun Heo	5bc34be478	sched/core: Reorganize cgroup bandwidth control interface file writes - Move input parameter validation from tg_set_cfs_bandwidth() to the new outer function tg_set_bandwidth(). The outer function handles parameters in usecs, validates them and calls tg_set_cfs_bandwidth() which converts them into nsecs. This matches tg_bandwidth() on the read side. - max/min_cfs_* consts are now used by tg_set_bandwidth(). Relocate, convert into usecs and drop "cfs" from the names. - Reimplement cpu_cfs_{period\|quote\|burst}_write_*() using tg_bandwidth() and tg_set_bandwidth() and replace "cfs" in the names with "bw". - Update cpu_max_write() to use tg_set_bandiwdth(). cpu_period_quota_parse() is updated to drop nsec conversion accordingly. This aligns the behavior with cfs_period_quota_print(). - Drop now unused tg_set_cfs_{period\|quota\|burst}(). - While at it, for consistency, rename default_cfs_period() to default_bw_period_us() and make it return usecs. This is to prepare for adding bandwidth control support to sched_ext. tg_set_bandwidth() will be used as the muxing point. No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250614012346.2358261-5-tj@kernel.org	2025-06-18 13:59:57 +02:00
Tejun Heo	43e33f53e2	sched/core: Reorganize cgroup bandwidth control interface file reads - Update tg_get_cfs_() to return u64 values. These are now used as the low level accessors to the fair's bandwidth configuration parameters. Translation to usecs takes place in these functions. - Add tg_bandwidth() which reads all three bandwidth parameters using tg_get_cfs_(). - Reimplement cgroup interface read functions using tg_bandwidth(). Drop cfs from the function names. This is to prepare for adding bandwidth control support to sched_ext. tg_bandwidth() will be used as the muxing point similar to tg_weight(). No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250614012346.2358261-4-tj@kernel.org	2025-06-18 13:59:57 +02:00
Tejun Heo	de4c80c696	sched/core: Relocate tg_get_cfs_() and cpu_cfs__read_*() Collect the getters, relocate the trivial interface file wrappers, and put all of them in period, quota, burst order to prepare for future changes. Pure reordering. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250614012346.2358261-3-tj@kernel.org	2025-06-18 13:59:57 +02:00
Tejun Heo	d403a3689a	sched/fair: Move max_cfs_quota_period decl and default_cfs_period() def from fair.c to sched.h max_cfs_quota_period is defined in core.c but has a declaration in fair.c. Move the declaration to kernel/sched/sched.h. Also, move default_cfs_period() from fair.c to sched.h. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250614012346.2358261-2-tj@kernel.org	2025-06-18 13:59:56 +02:00
Steven Rostedt	327e286643	fgraph: Do not enable function_graph tracer when setting funcgraph-args When setting the funcgraph-args option when function graph tracer is net enabled, it incorrectly enables it. Worse, it unregisters itself when it was never registered. Then when it gets enabled again, it will register itself a second time causing a WARNing. ~# echo 1 > /sys/kernel/tracing/options/funcgraph-args ~# head -20 /sys/kernel/tracing/trace # tracer: nop # # entries-in-buffer/entries-written: 813/26317372 #P:8 # # _-----=> irqs-off/BH-disabled # / _----=> need-resched # \| / _---=> hardirq/softirq # \|\| / _--=> preempt-depth # \|\|\| / _-=> migrate-disable # \|\|\|\| / delay # TASK-PID CPU# \|\|\|\|\| TIMESTAMP FUNCTION # \| \| \| \|\|\|\|\| \| \| <idle>-0 [007] d..4. 358.966010: 7) 1.692 us \| fetch_next_timer_interrupt(basej=4294981640, basem=357956000000, base_local=0xffff88823c3ae040, base_global=0xffff88823c3af300, tevt=0xffff888100e47cb8); <idle>-0 [007] d..4. 358.966012: 7) \| tmigr_cpu_deactivate(nextexp=357988000000) { <idle>-0 [007] d..4. 358.966013: 7) \| _raw_spin_lock(lock=0xffff88823c3b2320) { <idle>-0 [007] d..4. 358.966014: 7) 0.981 us \| preempt_count_add(val=1); <idle>-0 [007] d..5. 358.966017: 7) 1.058 us \| do_raw_spin_lock(lock=0xffff88823c3b2320); <idle>-0 [007] d..4. 358.966019: 7) 5.824 us \| } <idle>-0 [007] d..5. 358.966021: 7) \| tmigr_inactive_up(group=0xffff888100cb9000, child=0x0, data=0xffff888100e47bc0) { <idle>-0 [007] d..5. 358.966022: 7) \| tmigr_update_events(group=0xffff888100cb9000, child=0x0, data=0xffff888100e47bc0) { Notice the "tracer: nop" at the top there. The current tracer is the "nop" tracer, but the content is obviously the function graph tracer. Enabling function graph tracing will cause it to register again and trigger a warning in the accounting: ~# echo function_graph > /sys/kernel/tracing/current_tracer -bash: echo: write error: Device or resource busy With the dmesg of: ------------[ cut here ]------------ WARNING: CPU: 7 PID: 1095 at kernel/trace/ftrace.c:3509 ftrace_startup_subops+0xc1e/0x1000 Modules linked in: kvm_intel kvm irqbypass CPU: 7 UID: 0 PID: 1095 Comm: bash Not tainted 6.16.0-rc2-test-00006-gea03de4105d3 #24 PREEMPT Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:ftrace_startup_subops+0xc1e/0x1000 Code: 48 b8 22 01 00 00 00 00 ad de 49 89 84 24 88 01 00 00 8b 44 24 08 89 04 24 e9 c3 f7 ff ff c7 04 24 ed ff ff ff e9 b7 f7 ff ff <0f> 0b c7 04 24 f0 ff ff ff e9 a9 f7 ff ff c7 04 24 f4 ff ff ff e9 RSP: 0018:ffff888133cff948 EFLAGS: 00010202 RAX: 0000000000000001 RBX: 1ffff1102679ff31 RCX: 0000000000000000 RDX: 1ffffffff0b27a60 RSI: ffffffff8593d2f0 RDI: ffffffff85941140 RBP: 00000000000c2041 R08: ffffffffffffffff R09: ffffed1020240221 R10: ffff88810120110f R11: ffffed1020240214 R12: ffffffff8593d2f0 R13: ffffffff8593d300 R14: ffffffff85941140 R15: ffffffff85631100 FS: 00007f7ec6f28740(0000) GS:ffff8882b5251000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7ec6f181c0 CR3: 000000012f1d0005 CR4: 0000000000172ef0 Call Trace: <TASK> ? __pfx_ftrace_startup_subops+0x10/0x10 ? find_held_lock+0x2b/0x80 ? ftrace_stub_direct_tramp+0x10/0x10 ? ftrace_stub_direct_tramp+0x10/0x10 ? trace_preempt_on+0xd0/0x110 ? __pfx_trace_graph_entry_args+0x10/0x10 register_ftrace_graph+0x4d2/0x1020 ? tracing_reset_online_cpus+0x14b/0x1e0 ? __pfx_register_ftrace_graph+0x10/0x10 ? ring_buffer_record_enable+0x16/0x20 ? tracing_reset_online_cpus+0x153/0x1e0 ? __pfx_tracing_reset_online_cpus+0x10/0x10 ? __pfx_trace_graph_return+0x10/0x10 graph_trace_init+0xfd/0x160 tracing_set_tracer+0x500/0xa80 ? __pfx_tracing_set_tracer+0x10/0x10 ? lock_release+0x181/0x2d0 ? _copy_from_user+0x26/0xa0 tracing_set_trace_write+0x132/0x1e0 ? __pfx_tracing_set_trace_write+0x10/0x10 ? ftrace_graph_func+0xcc/0x140 ? ftrace_stub_direct_tramp+0x10/0x10 ? ftrace_stub_direct_tramp+0x10/0x10 ? ftrace_stub_direct_tramp+0x10/0x10 vfs_write+0x1d0/0xe90 ? __pfx_vfs_write+0x10/0x10 Have the setting of the funcgraph-args check if function_graph tracer is the current tracer of the instance, and if not, do nothing, as there's nothing to do (the option is checked when function_graph tracing starts). Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/20250618073801.057ea636@gandalf.local.home Fixes: `c7a60a733c` ("ftrace: Have funcgraph-args take affect during tracing") Closes: https://lore.kernel.org/all/4ab1a7bdd0174ab09c7b0d68cdbff9a4@huawei.com/ Reported-by: Changbin Du <changbin.du@huawei.com> Tested-by: Changbin Du <changbin.du@huawei.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-06-18 07:43:22 -04:00
James Bottomley	bd07bd12f2	bpf: Fix key serial argument of bpf_lookup_user_key() The underlying lookup_user_key() function uses a signed 32 bit integer for key serial numbers because legitimate serial numbers are positive (and > 3) and keyrings are negative. Using a u32 for the keyring in the bpf function doesn't currently cause any conversion problems but will start to trip the signed to unsigned conversion warnings when the kernel enables them, so convert the argument to signed (and update the tests accordingly) before it acquires more users. Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> Reviewed-by: Roberto Sassu <roberto.sassu@huawei.com> Link: https://lore.kernel.org/r/84cdb0775254d297d75e21f577089f64abdfbd28.camel@HansenPartnership.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-17 18:15:27 -07:00
Al Viro	f5527f0171	bpf: Get rid of redundant 3rd argument of prepare_seq_file() Remove 3rd argument in prepare_seq_file() to clean up the code a bit. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250615004719.GE3011112@ZenIV Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-17 17:19:41 -07:00
Shakeel Butt	6af89c6ca7	cgroup: remove per-cpu per-subsystem locks The rstat update side used to insert the cgroup whose stats are updated in the update tree and the read side flush the update tree to get the latest uptodate stats. The per-cpu per-subsystem locks were used to synchronize the update and flush side. However now the update side does not access update tree but uses per-cpu lockless lists. So there is no need for locks to synchronize update and flush side. Let's remove them. Suggested-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Tested-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-06-17 10:01:18 -10:00
Shakeel Butt	36df6e3dbd	cgroup: make css_rstat_updated nmi safe To make css_rstat_updated() able to safely run in nmi context, let's move the rstat update tree creation at the flush side and use per-cpu lockless lists in struct cgroup_subsys to track the css whose stats are updated on that cpu. The struct cgroup_subsys_state now has per-cpu lnode which needs to be inserted into the corresponding per-cpu lhead of struct cgroup_subsys. Since we want the insertion to be nmi safe, there can be multiple inserters on the same cpu for the same lnode. Here multiple inserters are from stacked contexts like softirq, hardirq and nmi. The current llist does not provide function to protect against the scenario where multiple inserters can use the same lnode. So, using llist_node() out of the box is not safe for this scenario. However we can protect against multiple inserters using the same lnode by using the fact llist node points to itself when not on the llist and atomically reset it and select the winner as the single inserter. Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Tested-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-06-17 10:01:00 -10:00
Shakeel Butt	1257b8786a	cgroup: support to enable nmi-safe css_rstat_updated Add necessary infrastructure to enable the nmi-safe execution of css_rstat_updated(). Currently css_rstat_updated() takes a per-cpu per-css raw spinlock to add the given css in the per-cpu per-css update tree. However the kernel can not spin in nmi context, so we need to remove the spinning on the raw spinlock in css_rstat_updated(). To support lockless css_rstat_updated(), let's add necessary data structures in the css and ss structures. Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Tested-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-06-17 09:58:45 -10:00
Chuyi Zhou	261dce3d64	workqueue: Initialize wq_isolated_cpumask in workqueue_init_early() Now when isolcpus is enabled via the cmdline, wq_isolated_cpumask does not include these isolated CPUs, even wq_unbound_cpumask has already excluded them. It is only when we successfully configure an isolate cpuset partition that wq_isolated_cpumask gets overwritten by workqueue_unbound_exclude_cpumask(), including both the cmdline-specified isolated CPUs and the isolated CPUs within the cpuset partitions. Fix this issue by initializing wq_isolated_cpumask properly in workqueue_init_early(). Fixes: `fe28f631fa` ("workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask") Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-06-17 08:58:29 -10:00
Tejun Heo	f11113d013	Merge branch 'WQ_PERCPU' into for-6.17	2025-06-17 08:52:50 -10:00
Marco Crivellari	128ea9f6cc	workqueue: Add system_percpu_wq and system_dfl_wq Currently, if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_wq is a per-CPU worqueue, yet nothing in its name tells about that CPU affinity constraint, which is very often not required by users. Make it clear by adding a system_percpu_wq. system_unbound_wq should be the default workqueue so as not to enforce locality constraints for random work whenever it's not required. Adding system_dfl_wq to encourage its use when unbound work should be used. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-06-17 08:51:34 -10:00
Tejun Heo	33796b9187	sched_ext, sched/core: Don't call scx_group_set_weight() prematurely from sched_create_group() During task_group creation, sched_create_group() calls scx_group_set_weight() with CGROUP_WEIGHT_DFL to initialize the sched_ext portion. This is premature and ends up calling ops.cgroup_set_weight() with an incorrect @cgrp before ops.cgroup_init() is called. sched_create_group() should just initialize SCX related fields in the new task_group. Fix it by factoring out scx_tg_init() from sched_init() and making sched_create_group() call that function instead of scx_group_set_weight(). v2: Retain CONFIG_EXT_GROUP_SCHED ifdef in sched_init() as removing it leads to build failures on !CONFIG_GROUP_SCHED configs. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `8195136669` ("sched_ext: Add cgroup support") Cc: stable@vger.kernel.org # v6.12+	2025-06-17 08:19:55 -10:00
Tejun Heo	c50784e99f	sched_ext: Make scx_group_set_weight() always update tg->scx.weight Otherwise, tg->scx.weight can go out of sync while scx_cgroup is not enabled and ops.cgroup_init() may be called with a stale weight value. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `8195136669` ("sched_ext: Add cgroup support") Cc: stable@vger.kernel.org # v6.12+	2025-06-17 08:19:43 -10:00
Song Liu	a766cfbbeb	bpf: Mark dentry->d_inode as trusted_or_null LSM hooks such as security_path_mknod() and security_inode_rename() have access to newly allocated negative dentry, which has NULL d_inode. Therefore, it is necessary to do the NULL pointer check for d_inode. Also add selftests that checks the verifier enforces the NULL pointer check. Signed-off-by: Song Liu <song@kernel.org> Reviewed-by: Matt Bobrowski <mattbobrowski@google.com> Link: https://lore.kernel.org/r/20250613052857.1992233-1-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-17 08:40:59 -07:00
John Ogness	571c1ea91a	printk: nbcon: Allow reacquire during panic If a console printer is interrupted during panic, it will never be able to reacquire ownership in order to perform and cleanup. That in itself is not a problem, since the non-panic CPU will simply quiesce in an endless loop within nbcon_reacquire_nobuf(). However, in this state, platforms that do not support a true NMI to interrupt the quiesced CPU will not be able to shutdown that CPU from within panic(). This then causes problems for such as being unable to load and run a kdump kernel. Fix this by allowing non-panic CPUs to reacquire ownership using a direct acquire. Then the non-panic CPUs can successfullyl exit the nbcon_reacquire_nobuf() loop and the console driver can perform any necessary cleanup. But more importantly, the CPU is no longer quiesced and is free to process any interrupts necessary for panic() to shutdown the CPU. All other forms of acquire are still not allowed for non-panic CPUs since it is safer to have them avoid gaining console ownership that is not strictly necessary. Reported-by: Michael Kelley <mhklinux@outlook.com> Closes: https://lore.kernel.org/r/SN6PR02MB4157A4C5E8CB219A75263A17D46DA@SN6PR02MB4157.namprd02.prod.outlook.com Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Link: https://patch.msgid.link/20250606185549.900611-1-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2025-06-17 12:10:29 +02:00
Thomas Weißschuh	fb506e31b3	sysfs: treewide: switch back to attribute_group::bin_attrs The normal bin_attrs field can now handle const pointers. This makes the _new variant unnecessary. Switch all users back. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20250530-sysfs-const-bin_attr-final-v3-4-724bfcf05b99@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-06-17 10:44:15 +02:00
Thomas Weißschuh	2fbe82037a	sysfs: treewide: switch back to bin_attribute::read()/write() The bin_attribute argument of bin_attribute::read() is now const. This makes the _new() callbacks unnecessary. Switch all users back. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20250530-sysfs-const-bin_attr-final-v3-3-724bfcf05b99@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-06-17 10:44:13 +02:00
Richard Guy Briggs	ae1ae11fb2	audit,module: restore audit logging in load failure case The move of the module sanity check to earlier skipped the audit logging call in the case of failure and to a place where the previously used context is unavailable. Add an audit logging call for the module loading failure case and get the module name when possible. Link: https://issues.redhat.com/browse/RHEL-52839 Fixes: `02da2cbab4` ("module: move check_modinfo() early to early_mod_check()") Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2025-06-16 17:00:06 -04:00
Kent Overstreet	fda6add924	workqueue: Basic memory allocation profiling support Hook alloc_workqueue and alloc_workqueue_attrs() so that they're accounted to the callsite. Since we're doing allocations on behalf of another subsystem, this helps when using memory allocation profiling to check for leaks. Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-06-16 08:01:45 -10:00
Cheng-Yang Chou	f479fee382	sched_ext: Return NULL in llc_span Use NULL instead of 0 to signal no LLC domain, matching numa_span() and the function comment. No functional change. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-06-16 07:32:37 -10:00
Christian Brauner	70e3ee3128	coredump: rename do_coredump() to vfs_coredump() Align the naming with the rest of our helpers exposed outside of core vfs. Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-9-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-06-16 17:01:22 +02:00
Yury Norov [NVIDIA]	bfa788dc2d	clocksource: Use cpumask_next_wrap() in clocksource_watchdog() cpumask_next_wrap() is more verbose and efficient comparing to cpumask_next() followed by cpumask_first(). Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250614155031.340988-3-yury.norov@gmail.com	2025-06-14 20:09:44 +02:00
Yury Norov [NVIDIA]	4fa7d61d5a	clocksource: Use cpumask_any_but() in clocksource_verify_choose_cpus() cpumask_any_but() is more verbose than cpumask_first() followed by cpumask_next(). Use it in clocksource_verify_choose_cpus(). Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250614155031.340988-2-yury.norov@gmail.com	2025-06-14 20:09:44 +02:00
Cheng-Yang Chou	545b343015	sched_ext: Always use SMP versions in kernel/sched/ext_idle.h Simplify the scheduler by making formerly SMP-only primitives and data structures unconditional. tj: Updated subject for clarity. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-06-13 14:47:59 -10:00
Cheng-Yang Chou	8834ace4a8	sched_ext: Always use SMP versions in kernel/sched/ext_idle.c Simplify the scheduler by making formerly SMP-only primitives and data structures unconditional. tj: Updated subject for clarity. Fixed stray #else block which wasn't removed causing build failure. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-06-13 14:47:52 -10:00
Cheng-Yang Chou	6a1cda143c	sched_ext: Always use SMP versions in kernel/sched/ext.h Simplify the scheduler by making formerly SMP-only primitives and data structures unconditional. tj: Updated subject for clarity. Replace #if defined() with #ifdef. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-06-13 14:30:58 -10:00
Cheng-Yang Chou	165af41516	sched_ext: Always use SMP versions in kernel/sched/ext.c Simplify the scheduler by making formerly SMP-only primitives and data structures unconditional. tj: Updated subject for clarity. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-06-13 14:29:39 -10:00
Tejun Heo	9ec5e0be0e	Merge branch 'sched/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-6.17	2025-06-13 14:21:47 -10:00
Luis Gerhorst	f66b4aaff2	bpf: Remove redundant free_verifier_state()/pop_stack() This patch removes duplicated code. Eduard points out [1]: Same cleanup cycles are done in push_stack() and push_async_cb(), both functions are only reachable from do_check_common() via do_check() -> do_check_insn(). Hence, I think that cur state should not be freed in push_() functions and pop_stack() loop there is not needed. This would also fix the 'symptom' for [2], but the issue also has a simpler fix which was sent separately. This fix also makes sure the push_() callers always return an error for which error_recoverable_with_nospec(err) is false. This is required because otherwise we try to recover and access the stale `state`. Moving free_verifier_state() and pop_stack(..., pop_log=false) to happen after the bpf_vlog_reset() call in do_check_common() is fine because the pop_stack() call that is moved does not call bpf_vlog_reset() with the pop_log=false parameter. [1] https://lore.kernel.org/all/b6931bd0dd72327c55287862f821ca6c4c3eb69a.camel@gmail.com/ [2] https://lore.kernel.org/all/68497853.050a0220.33aa0e.036a.GAE@google.com/ Reported-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/all/b6931bd0dd72327c55287862f821ca6c4c3eb69a.camel@gmail.com/ Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de> Link: https://lore.kernel.org/r/20250613090157.568349-2-luis.gerhorst@fau.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-13 14:59:30 -07:00
Eduard Zingerman	3157f7e299	bpf: handle jset (if a & b ...) as a jump in CFG computation BPF_JSET is a conditional jump and currently verifier.c:can_jump() does not know about that. This can lead to incorrect live registers and SCC computation. E.g. in the following example: 1: r0 = 1; 2: r2 = 2; 3: if r1 & 0x7 goto +1; 4: exit; 5: r0 = r2; 6: exit; W/o this fix insn_successors(3) will return only (4), a jump to (5) would be missed and r2 won't be marked as alive at (3). Fixes: `14c8552db6` ("bpf: simple DFA-based live registers analysis") Reported-by: syzbot+a36aac327960ff474804@syzkaller.appspotmail.com Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250613175331.3238739-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-13 11:51:19 -07:00

... 21 22 23 24 25 ...

49605 Commits