linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-04-05 10:45:23 -04:00

Author	SHA1	Message	Date
pengdonglin	f83ac7544f	function_graph: Enable funcgraph-args and funcgraph-retaddr to work simultaneously Currently, the funcgraph-args and funcgraph-retaddr features are mutually exclusive. This patch resolves this limitation by allowing funcgraph-retaddr to have an args array. To verify the change, use perf to trace vfs_write with both options enabled: Before: # perf ftrace -G vfs_write --graph-opts args,retaddr ...... down_read() { /* <-n_tty_write+0xa3/0x540 / __cond_resched(); / <-down_read+0x12/0x160 / preempt_count_add(); / <-down_read+0x3b/0x160 / preempt_count_sub(); / <-down_read+0x8b/0x160 / } After: # perf ftrace -G vfs_write --graph-opts args,retaddr ...... down_read(sem=0xffff8880100bea78) { / <-n_tty_write+0xa3/0x540 / __cond_resched(); / <-down_read+0x12/0x160 / preempt_count_add(val=1); / <-down_read+0x3b/0x160 / preempt_count_sub(val=1); / <-down_read+0x8b/0x160 */ } Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Xiaoqin Zhang <zhangxiaoqin@xiaomi.com> Link: https://patch.msgid.link/20251125093425.2563849-1-dolinux.peng@gmail.com Signed-off-by: pengdonglin <pengdonglin@xiaomi.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-26 15:13:30 -05:00
Masami Hiramatsu (Google)	20e7168326	tracing: Add boot-time backup of persistent ring buffer Currently, the persistent ring buffer instance needs to be read before using it. This means we have to wait for boot up user space and dump the persistent ring buffer. However, in that case we can not start tracing on it from the kernel cmdline. To solve this limitation, this adds an option which allows to create a trace instance as a backup of the persistent ring buffer at boot. If user specifies trace_instance=<BACKUP>=<PERSIST_RB> then the <BACKUP> instance is made as a copy of the <PERSIST_RB> instance. For example, the below kernel cmdline records all syscalls, scheduler and interrupt events on the persistent ring buffer `boot_map` but before starting the tracing, it makes a `backup` instance from the `boot_map`. Thus, the `backup` instance has the previous boot events. 'reserve_mem=12M:4M:trace trace_instance=boot_map@trace,syscalls:,sched:,irq:* trace_instance=backup=boot_map' As you can see, this just make a copy of entire reserved area and make a backup instance on it. So you can release (or shrink) the backup instance after use it to save the memory usage. /sys/kernel/tracing/instances # free total used free shared buff/cache available Mem: 1999284 55704 1930520 10132 13060 `1914628` Swap: 0 0 0 /sys/kernel/tracing/instances # rmdir backup/ /sys/kernel/tracing/instances # free total used free shared buff/cache available Mem: 1999284 40640 `1945584` 10132 13060 1929692 Swap: 0 0 0 Note: since there is no reason to make a copy of empty buffer, this backup only accepts a persistent ring buffer as the original instance. Also, since this backup is based on vmalloc(), it does not support user-space mmap(). Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/176377150002.219692.9425536150438129267.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-26 15:13:30 -05:00
Steven Rostedt	f93a7d0cac	ftrace: Allow tracing of some of the tracing code There is times when tracing the tracing infrastructure can be useful for debugging the tracing code. Currently all files in the tracing directory are set to "notrace" the functions. Add a new config option FUNCTION_SELF_TRACING that will allow some of the files in the tracing infrastructure to be traced. It requires a config to enable because it will add noise to the function tracer if events and other tracing features are enabled. Tracing functions and events together is quite common, so not tracing the event code should be the default. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Tom Zanussi <zanussi@kernel.org> Link: https://patch.msgid.link/20251120181514.736f2d5f@gandalf.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-26 15:13:30 -05:00
Steven Rostedt	400ddf1dbe	tracing: Use strim() in trigger_process_regex() instead of skip_spaces() The function trigger_process_regex() is called by a few functions, where only one calls strim() on the buffer passed to it. That leaves the other functions not trimming the end of the buffer passed in and making it a little inconsistent. Remove the strim() from event_trigger_regex_write() and have trigger_process_regex() use strim() instead of skip_spaces(). The buff variable is not passed in as const, so it can be modified. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tom Zanussi <zanussi@kernel.org> Link: https://patch.msgid.link/20251125214032.323747707@kernel.org Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-26 15:13:30 -05:00
Steven Rostedt	61d445af0a	tracing: Add bulk garbage collection of freeing event_trigger_data The event trigger data requires a full tracepoint_synchronize_unregister() call before freeing. That call can take 100s of milliseconds to complete. In order to allow for bulk freeing of the trigger data, it can not call the tracepoint_synchronize_unregister() for every individual trigger data being free. Create a kthread that gets created the first time a trigger data is freed, and have it use the lockless llist to get the list of data to free, run the tracepoint_synchronize_unregister() then free everything in the list. By freeing hundreds of event_trigger_data elements together, it only requires two runs of the synchronization function, and not hundreds of runs. This speeds up the operation by orders of magnitude (milliseconds instead of several seconds). Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tom Zanussi <zanussi@kernel.org> Link: https://patch.msgid.link/20251125214032.151674992@kernel.org Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-26 15:13:30 -05:00
Steven Rostedt	78c7051394	tracing: Remove unneeded event_mutex lock in event_trigger_regex_release() In event_trigger_regex_release(), the only code is: mutex_lock(&event_mutex); if (file->f_mode & FMODE_READ) seq_release(inode, file); mutex_unlock(&event_mutex); return 0; There's nothing special about the file->f_mode or the seq_release() that requires any locking. Remove the unnecessary locks. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tom Zanussi <zanussi@kernel.org> Link: https://patch.msgid.link/20251125214031.975879283@kernel.org Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-26 15:13:29 -05:00
Steven Rostedt	b052d70f7c	tracing: Merge struct event_trigger_ops into struct event_command Now that there's pretty much a one to one mapping between the struct event_trigger_ops and struct event_command, there's no reason to have two different structures. Merge the function pointers of event_trigger_ops into event_command. There's one exception in trace_events_hist.c for the event_hist_trigger_named_ops. This has special logic for the init and free function pointers for "named histograms". In this case, allocate the cmd_ops of the event_trigger_data and set it to the proper init and free functions, which are used to initialize and free the event_trigger_data respectively. Have the free function and the init function (on failure) free the cmd_ops of the data element. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251125200932.446322765@kernel.org Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-26 15:13:29 -05:00
Steven Rostedt	bdafb4d4cb	tracing: Remove get_trigger_ops() and add count_func() from trigger ops The struct event_command has a callback function called get_trigger_ops(). This callback returns the "trigger_ops" to use for the trigger. These ops define the trigger function, how to init the trigger, how to print the trigger and how to free it. The only reason there's a callback function to get these ops is because some triggers have two types of operations. One is an "always on" operation, and the other is a "count down" operation. If a user passes in a parameter to say how many times the trigger should execute. For example: echo stacktrace:5 > events/kmem/kmem_cache_alloc/trigger It will trigger the stacktrace for the first 5 times the kmem_cache_alloc event is hit. Instead of having two different trigger_ops since the only difference between them is the tigger itself (the print, init and free functions are all the same), just use a single ops that the event_command points to and add a function field to the trigger_ops to have a count_func. When a trigger is added to an event, if there's a count attached to it and the trigger ops has the count_func field, the data allocated to represent this trigger will have a new flag set called COUNT. Then when the trigger executes, it will check if the COUNT data flag is set, and if so, it will call the ops count_func(). If that returns false, it returns without executing the trigger. This removes the need for duplicate event_trigger_ops structures. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251125200932.274566147@kernel.org Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-26 15:13:29 -05:00
Masami Hiramatsu (Google)	23c0e9cc76	tracing: Show the tracer options in boot-time created instance Since tracer_init_tracefs_work_func() only updates the tracer options for the global_trace, the instances created by the kernel cmdline do not have those options. Fix to update tracer options for those boot-time created instances to show those options. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/176354112555.2356172.3989277078358802353.stgit@mhiramat.tok.corp.google.com Fixes: `428add559b` ("tracing: Have tracer option be instance specific") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-26 15:13:29 -05:00
Menglong Dong	7a6735cc9b	ftrace: Avoid redundant initialization in register_ftrace_direct The FTRACE_OPS_FL_INITIALIZED flag is cleared in register_ftrace_direct, which can make it initialized by ftrace_ops_init() even if it is already initialized. It seems that there is no big deal here, but let's still fix it. Link: https://patch.msgid.link/20251110121808.1559240-1-dongml2@chinatelecom.cn Fixes: `f64dd4627e` ("ftrace: Add multi direct register/unregister interface") Acked-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-26 15:13:28 -05:00
Steven Rostedt	49c1364c7c	tracing: Remove unused variable in tracing_trace_options_show() The flags and opts used in tracing_trace_options_show() now come directly from the trace array "current_trace_flags" and not the current_trace. The variable "trace" was still being assigned to tr->current_trace but never used. This caused a warning in clang. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251117120637.43ef995d@gandalf.local.home Reported-by: Andy Shevchenko <andriy.shevchenko@intel.com> Tested-by: Andy Shevchenko <andriy.shevchenko@intel.com> Closes: https://lore.kernel.org/all/aRtHWXzYa8ijUIDa@black.igk.intel.com/ Fixes: `428add559b` ("tracing: Have tracer option be instance specific") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-26 15:13:28 -05:00
Steven Rostedt	ac87b220a6	fgraph: Make fgraph_no_sleep_time signed The variable fgraph_no_sleep_time changed from being a boolean to being a counter. A check is made to make sure that it never goes below zero. But the variable being unsigned makes the check always fail even if it does go below zero. Make the variable a signed int so that checking it going below zero still works. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251125104751.4c9c7f28@gandalf.local.home Fixes: `5abb6ccb58` ("tracing: Have function graph tracer option sleep-time be per instance") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/all/aR1yRQxDmlfLZzoo@stanley.mountain/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-26 15:13:28 -05:00
Steven Rostedt	bc089c4725	tracing: Convert function graph set_flags() to use a switch() statement Currently the set_flags() of the function graph tracer has a bunch of: if (bit == FLAG1) { [..] } if (bit == FLAG2) { [..] } To clean it up a bit, convert it over to a switch statement. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251114192319.117123664@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-14 14:30:55 -05:00
Steven Rostedt	5abb6ccb58	tracing: Have function graph tracer option sleep-time be per instance Currently the option to have function graph tracer to ignore time spent when a task is sleeping is global when the interface is per-instance. Changing the value in one instance will affect the results of another instance that is also running the function graph tracer. This can lead to confusing results. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251114192318.950255167@kernel.org Fixes: `c132be2c4f` ("function_graph: Have the instances use their own ftrace_ops for filtering") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-14 14:30:55 -05:00
Steven Rostedt	4132886e1b	tracing: Move graph-time out of function graph options The option "graph-time" affects the function profiler when it is using the function graph infrastructure. It has nothing to do with the function graph tracer itself. The option only affects the global function profiler and does nothing to the function graph tracer. Move it out of the function graph tracer options and make it a global option that is only available at the top level instance. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251114192318.781711154@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-14 14:30:55 -05:00
Steven Rostedt	6479325eca	tracing: Have function graph tracer option funcgraph-irqs be per instance Currently the option to trace interrupts in the function graph tracer is global when the interface is per-instance. Changing the value in one instance will affect the results of another instance that is also running the function graph tracer. This can lead to confusing results. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251114192318.613867934@kernel.org Fixes: `c132be2c4f` ("function_graph: Have the instances use their own ftrace_ops for filtering") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-14 14:30:54 -05:00
Yongliang Gao	97e047f44d	trace/pid_list: optimize pid_list->lock contention When the system has many cores and task switching is frequent, setting set_ftrace_pid can cause frequent pid_list->lock contention and high system sys usage. For example, in a 288-core VM environment, we observed 267 CPUs experiencing contention on pid_list->lock, with stack traces showing: #4 [ffffa6226fb4bc70] native_queued_spin_lock_slowpath at ffffffff99cd4b7e #5 [ffffa6226fb4bc90] _raw_spin_lock_irqsave at ffffffff99cd3e36 #6 [ffffa6226fb4bca0] trace_pid_list_is_set at ffffffff99267554 #7 [ffffa6226fb4bcc0] trace_ignore_this_task at ffffffff9925c288 #8 [ffffa6226fb4bcd8] ftrace_filter_pid_sched_switch_probe at ffffffff99246efe #9 [ffffa6226fb4bcf0] __schedule at ffffffff99ccd161 Replaces the existing spinlock with a seqlock to allow concurrent readers, while maintaining write exclusivity. Link: https://patch.msgid.link/20251113000252.1058144-1-leonylgao@gmail.com Reviewed-by: Huang Cun <cunhuang@tencent.com> Signed-off-by: Yongliang Gao <leonylgao@tencent.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-13 15:15:54 -05:00
Steven Rostedt	e29aa918a9	tracing: Have function graph tracer define options per instance Currently the function graph tracer's options are saved via a global mask when it should be per instance. Use the new infrastructure to define a "default_flags" field in the tracer structure that is used for the top level instance as well as new ones. Currently the global mask causes confusion: # cd /sys/kernel/tracing # mkdir instances/foo # echo function_graph > instances/foo/current_tracer # echo 1 > options/funcgraph-args # echo function_graph > current_tracer # cat trace [..] 2) \| _raw_spin_lock_irq(lock=0xffff96b97dea16c0) { 2) 0.422 us \| do_raw_spin_lock(lock=0xffff96b97dea16c0); 7) \| rcu_sched_clock_irq(user=0) { 2) 1.478 us \| } 7) 0.758 us \| rcu_is_cpu_rrupt_from_idle(); 2) 0.647 us \| enqueue_hrtimer(timer=0xffff96b97dea2058, base=0xffff96b97dea1740, mode=0); # cat instances/foo/options/funcgraph-args 1 # cat instances/foo/trace [..] 4) \| __x64_sys_read() { 4) \| ksys_read() { 4) 0.755 us \| fdget_pos(); 4) \| vfs_read() { 4) \| rw_verify_area() { 4) \| security_file_permission() { 4) \| apparmor_file_permission() { 4) \| common_file_perm() { 4) \| aa_file_perm() { 4) \| rcu_read_lock_held() { [..] The above shows that updating the "funcgraph-args" option at the top level instance also updates the "funcgraph-args" option in the instance but because the update is only done by the instance that gets changed (as it should), it's confusing to see that the option is already set in the other instance. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251111232429.641030027@kernel.org Fixes: `c132be2c4f` ("function_graph: Have the instances use their own ftrace_ops for filtering") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-13 15:08:17 -05:00
Steven Rostedt	76680d0d28	tracing: Have function tracer define options per instance Currently the function tracer's options are saved via a global mask when it should be per instance. Use the new infrastructure to define a "default_flags" field in the tracer structure that is used for the top level instance as well as new ones. Currently the global mask causes confusion: # cd /sys/kernel/tracing # mkdir instances/foo # echo function > instances/foo/current_tracer # echo 1 > options/func-args # echo function > current_tracer # cat trace [..] <idle>-0 [005] d..3. 1050.656187: rcu_needs_cpu() <-tick_nohz_next_event <idle>-0 [005] d..3. 1050.656188: get_next_timer_interrupt(basej=0x10002dbad, basem=0xf45fd7d300) <-tick_nohz_next_event <idle>-0 [005] d..3. 1050.656189: _raw_spin_lock(lock=0xffff8944bdf5de80) <-__get_next_timer_interrupt <idle>-0 [005] d..4. 1050.656190: do_raw_spin_lock(lock=0xffff8944bdf5de80) <-__get_next_timer_interrupt <idle>-0 [005] d..4. 1050.656191: _raw_spin_lock_nested(lock=0xffff8944bdf5f140, subclass=1) <-__get_next_timer_interrupt # cat instances/foo/options/func-args 1 # cat instances/foo/trace [..] kworker/4:1-88 [004] ...1. 298.127735: next_zone <-refresh_cpu_vm_stats kworker/4:1-88 [004] ...1. 298.127736: first_online_pgdat <-refresh_cpu_vm_stats kworker/4:1-88 [004] ...1. 298.127738: next_online_pgdat <-refresh_cpu_vm_stats kworker/4:1-88 [004] ...1. 298.127739: fold_diff <-refresh_cpu_vm_stats kworker/4:1-88 [004] ...1. 298.127741: round_jiffies_relative <-vmstat_update [..] The above shows that updating the "func-args" option at the top level instance also updates the "func-args" option in the instance but because the update is only done by the instance that gets changed (as it should), it's confusing to see that the option is already set in the other instance. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251111232429.470883736@kernel.org Fixes: `f20a580627` ("ftrace: Allow instances to use function tracing") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-12 09:59:54 -05:00
Steven Rostedt	428add559b	tracing: Have tracer option be instance specific Tracers can add specify options to modify them. This logic was added before instances were created and the tracer flags were global variables. After instances were created where a tracer may exist in more than one instance, the flags were not updated from being global into instance specific. This causes confusion with these options. For example, the function tracer has an option to enable function arguments: # cd /sys/kernel/tracing # mkdir instances/foo # echo function > instances/foo/current_tracer # echo 1 > options/func-args # echo function > current_tracer # cat trace [..] <idle>-0 [005] d..3. 1050.656187: rcu_needs_cpu() <-tick_nohz_next_event <idle>-0 [005] d..3. 1050.656188: get_next_timer_interrupt(basej=0x10002dbad, basem=0xf45fd7d300) <-tick_nohz_next_event <idle>-0 [005] d..3. 1050.656189: _raw_spin_lock(lock=0xffff8944bdf5de80) <-__get_next_timer_interrupt <idle>-0 [005] d..4. 1050.656190: do_raw_spin_lock(lock=0xffff8944bdf5de80) <-__get_next_timer_interrupt <idle>-0 [005] d..4. 1050.656191: _raw_spin_lock_nested(lock=0xffff8944bdf5f140, subclass=1) <-__get_next_timer_interrupt # cat instances/foo/options/func-args 1 # cat instances/foo/trace [..] kworker/4:1-88 [004] ...1. 298.127735: next_zone <-refresh_cpu_vm_stats kworker/4:1-88 [004] ...1. 298.127736: first_online_pgdat <-refresh_cpu_vm_stats kworker/4:1-88 [004] ...1. 298.127738: next_online_pgdat <-refresh_cpu_vm_stats kworker/4:1-88 [004] ...1. 298.127739: fold_diff <-refresh_cpu_vm_stats kworker/4:1-88 [004] ...1. 298.127741: round_jiffies_relative <-vmstat_update [..] The above shows that setting "func-args" in the top level instance also set it in the instance "foo", but since the interface of the trace flags are per instance, the update didn't take affect in the "foo" instance. Update the infrastructure to allow tracers to add a "default_flags" field in the tracer structure that can be set instead of "flags" which will make the flags per instance. If a tracer needs to keep the flags global (like blktrace), keeping the "flags" field set will keep the old behavior. This does not update function or the function graph tracers. That will be handled later. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251111232429.305317942@kernel.org Fixes: `f20a580627` ("ftrace: Allow instances to use function tracing") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-12 09:59:54 -05:00
Masami Hiramatsu (Google)	7157062bb4	tracing: Report wrong dynamic event command Report wrong dynamic event type in the command via error_log. ----- # echo "z hoge" > /sys/kernel/tracing/dynamic_events sh: write error: Invalid argument # cat /sys/kernel/tracing/error_log [ 22.977022] dynevent: error: No matching dynamic event type Command: z hoge ^ ----- Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/176278970056.343441.10528135217342926645.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-10 19:26:14 -05:00
Steven Rostedt	3a0d5bc76f	tracing: Use switch statement instead of ifs in set_tracer_flag() The "mask" passed in to set_trace_flag() has a single bit set. The function then checks if the mask is equal to one of the option masks and performs the appropriate function associated to that option. Instead of having a bunch of "if ()" statement, use a "switch ()" statement instead to make it cleaner and a bit more optimal. No function changes. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251106003501.890298562@kernel.org Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-10 14:33:51 -05:00
Steven Rostedt	9c5053083e	tracing: Exit out immediately after update_marker_trace() The call to update_marker_trace() in set_tracer_flag() performs the update to the tr->trace_flags. There's no reason to perform it again after it is called. Return immediately instead. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251106003501.726406870@kernel.org Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-10 14:33:43 -05:00
Steven Rostedt	5aa0d18df0	tracing: Have add_tracer_options() error pass up to callers The function add_tracer_options() can fail, but currently it is ignored. Pass the status of add_tracer_options() up to adding a new tracer as well as when an instance is created. Have the instance creation fail if the add_tracer_options() fail. Only print a warning for the top level instance, like it does with other failures. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251105161935.375299297@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-10 14:28:31 -05:00
Steven Rostedt	c7bed15ccf	tracing: Remove dummy options and flags When a tracer does not define their own flags, dummy options and flags are used so that the values are always valid. There's not that many locations that reference these values so having dummy versions just complicates the code. Remove the dummy values and just check for NULL when appropriate. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251105161935.206093132@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-10 14:27:04 -05:00
Steven Rostedt	a10e6e6818	tracing: Hide __NR_utimensat and _NR_mq_timedsend when not defined Some architectures (riscv-32) do not define __NR_utimensat and _NR_mq_timedsend, and fails to build when they are used. Hide them in "ifdef"s. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251104205310.00a1db9a@batman.local.home Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202511031239.ZigDcWzY-lkp@intel.com/ Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-10 14:23:53 -05:00
Steven Rostedt	2f294c35c0	Merge branch 'topic/func-profiler-offset' of git://git.kernel.org/pub/scm/linux/kernel/git/mhiramat/linux into trace/trace/core Updates to the function profiler adds new options to tracefs. The options are currently defined by an enum as flags. The added options brings the number of options over 32, which means they can no longer be held in a 32 bit enum. The TRACE_ITER_* flags are converted to a macro TRACE_ITER() to allow the creation of options to still be done by macros. This change is intrusive, as it affects all TRACE_ITER options throughout the trace code. Merge the branch that added these options and converted the TRACE_ITER_* enum into a TRACE_ITER(*) macro, to allow the topic branches to still be developed without conflict. Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-04 10:12:32 -05:00
Masami Hiramatsu (Google)	1149fcf759	tracing: Add an option to show symbols in _text+offset for function profiler Function profiler shows the hit count of each function using its symbol name. However, there are some same-name local symbols, which we can not distinguish. To solve this issue, this introduces an option to show the symbols in "_text+OFFSET" format. This can avoid exposing the random shift of KASLR. The functions in modules are shown as "MODNAME+OFFSET" where the offset is from ".text". E.g. for the kernel text symbols, specify vmlinux and the output to addr2line, you can find the actual function and source info; $ addr2line -fie vmlinux _text+3078208 __balance_callbacks kernel/sched/core.c:5064 for modules, specify the module file and .text+OFFSET; $ addr2line -fie samples/trace_events/trace-events-sample.ko .text+8224 do_simple_thread_func samples/trace_events/trace-events-sample.c:23 Link: https://lore.kernel.org/all/176187878064.994619.8878296550240416558.stgit@devnote2/ Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2025-11-04 21:44:18 +09:00
Masami Hiramatsu (Google)	bbec8e28ca	tracing: Allow tracer to add more than 32 options Since enum trace_iterator_flags is 32bit, the max number of the option flags is limited to 32 and it is fully used now. To add a new option, we need to expand it. So replace the TRACE_ITER_##flag with TRACE_ITER(flag) macro which is 64bit bitmask. Link: https://lore.kernel.org/all/176187877103.994619.166076000668757232.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2025-11-04 21:44:00 +09:00
Steven Rostedt	25bd47a592	tracing: Have persistent ring buffer print syscalls normally The persistent ring buffer from a previous boot has to be careful printing events as the print formats of random events can have pointers to strings and such that are not available. Ftrace static events (like the function tracer event) are stable and are printed normally. System call event formats are also stable. Allow them to be printed normally as well: Instead of: <...>-1 [005] ...1. 57.240405: sys_enter_waitid: __syscall_nr=0xf7 (247) which=0x1 (1) upid=0x499 (1177) infop=0x7ffd5294d690 (140725988939408) options=0x5 (5) ru=0x0 (0) <...>-1 [005] ...1. 57.240433: sys_exit_waitid: __syscall_nr=0xf7 (247) ret=0x0 (0) <...>-1 [005] ...1. 57.240437: sys_enter_rt_sigprocmask: __syscall_nr=0xe (14) how=0x2 (2) nset=0x7ffd5294d7c0 (140725988939712) oset=0x0 (0) sigsetsize=0x8 (8) <...>-1 [005] ...1. 57.240438: sys_exit_rt_sigprocmask: __syscall_nr=0xe (14) ret=0x0 (0) <...>-1 [005] ...1. 57.240442: sys_enter_close: __syscall_nr=0x3 (3) fd=0x4 (4) <...>-1 [005] ...1. 57.240463: sys_exit_close: __syscall_nr=0x3 (3) ret=0x0 (0) <...>-1 [005] ...1. 57.240485: sys_enter_openat: __syscall_nr=0x101 (257) dfd=0xffffffffffdfff9c (-2097252) filename=(0xffff8b81639ca01c) flags=0x80000 (524288) mode=0x0 (0) __filename_val=/run/systemd/reboot-param <...>-1 [005] ...1. 57.240555: sys_exit_openat: __syscall_nr=0x101 (257) ret=0xffffffffffdffffe (-2097154) <...>-1 [005] ...1. 57.240571: sys_enter_openat: __syscall_nr=0x101 (257) dfd=0xffffffffffdfff9c (-2097252) filename=(0xffff8b81639ca01c) flags=0x80000 (524288) mode=0x0 (0) __filename_val=/run/systemd/reboot-param <...>-1 [005] ...1. 57.240620: sys_exit_openat: __syscall_nr=0x101 (257) ret=0xffffffffffdffffe (-2097154) <...>-1 [005] ...1. 57.240629: sys_enter_writev: __syscall_nr=0x14 (20) fd=0x3 (3) vec=0x7ffd5294ce50 (140725988937296) vlen=0x7 (7) <...>-1 [005] ...1. 57.242281: sys_exit_writev: __syscall_nr=0x14 (20) ret=0x24 (36) <...>-1 [005] ...1. 57.242286: sys_enter_reboot: __syscall_nr=0xa9 (169) magic1=0xfee1dead (4276215469) magic2=0x28121969 (672274793) cmd=0x1234567 (19088743) arg=0x0 (0) Have: <...>-1 [000] ...1. 91.446011: sys_waitid(which: 1, upid: 0x4d2, infop: 0x7ffdccdadfd0, options: 5, ru: 0) <...>-1 [000] ...1. 91.446042: sys_waitid -> 0x0 <...>-1 [000] ...1. 91.446045: sys_rt_sigprocmask(how: 2, nset: 0x7ffdccdae100, oset: 0, sigsetsize: 8) <...>-1 [000] ...1. 91.446047: sys_rt_sigprocmask -> 0x0 <...>-1 [000] ...1. 91.446051: sys_close(fd: 4) <...>-1 [000] ...1. 91.446073: sys_close -> 0x0 <...>-1 [000] ...1. 91.446095: sys_openat(dfd: 18446744073709551516, filename: 139732544945794 "/run/systemd/reboot-param", flags: O_RDONLY\|O_CLOEXEC) <...>-1 [000] ...1. 91.446165: sys_openat -> 0xfffffffffffffffe <...>-1 [000] ...1. 91.446182: sys_openat(dfd: 18446744073709551516, filename: 139732544945794 "/run/systemd/reboot-param", flags: O_RDONLY\|O_CLOEXEC) <...>-1 [000] ...1. 91.446233: sys_openat -> 0xfffffffffffffffe <...>-1 [000] ...1. 91.446242: sys_writev(fd: 3, vec: 0x7ffdccdad790, vlen: 7) <...>-1 [000] ...1. 91.447877: sys_writev -> 0x24 <...>-1 [000] ...1. 91.447883: sys_reboot(magic1: 0xfee1dead, magic2: 0x28121969, cmd: 0x1234567, arg: 0) Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Takaya Saeki <takayas@google.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ian Rogers <irogers@google.com> Cc: Douglas Raillard <douglas.raillard@arm.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Link: https://lore.kernel.org/20251028231149.097404581@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-10-28 20:11:00 -04:00
Steven Rostedt	b6e5d971fc	tracing: Check for printable characters when printing field dyn strings When the "fields" option is enabled, it prints each trace event field based on its type. But a dynamic array and a dynamic string can both have a "char *" type. Printing it as a string can cause escape characters to be printed and mess up the output of the trace. For dynamic strings, test if there are any non-printable characters, and if so, print both the string with the non printable characters as '.', and the print the hex value of the array. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Takaya Saeki <takayas@google.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ian Rogers <irogers@google.com> Cc: Douglas Raillard <douglas.raillard@arm.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Link: https://lore.kernel.org/20251028231148.929243047@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-10-28 20:10:59 -04:00
Steven Rostedt	64b627c8da	tracing: Add parsing of flags to the sys_enter_openat trace event Add some logic to give the openat system call trace event a bit more human readable information: syscalls:sys_enter_openat: dfd: 0xffffff9c, filename: 0x7f0053dc121c "/etc/ld.so.cache", flags: O_RDONLY\|O_CLOEXEC, mode: 0000 The above is output from "perf script" and now shows the flags used by the openat system call. Since the output from tracing is in the kernel, it can also remove the mode field when not used (when flags does not contain O_CREATE\|O_TMPFILE) touch-1185 [002] ...1. 1291.690154: sys_openat(dfd: 4294967196, filename: 139785545139344 "/usr/lib/locale/locale-archive", flags: O_RDONLY\|O_CLOEXEC) touch-1185 [002] ...1. 1291.690504: sys_openat(dfd: 18446744073709551516, filename: 140733603151330 "/tmp/x", flags: O_WRONLY\|O_CREAT\|O_NOCTTY\|O_NONBLOCK, mode: 0666) As system calls have a fixed ABI, their trace events can be extended. This currently only updates the openat system call, but others may be extended in the future. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Takaya Saeki <takayas@google.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ian Rogers <irogers@google.com> Cc: Douglas Raillard <douglas.raillard@arm.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Link: https://lore.kernel.org/20251028231148.763161484@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-10-28 20:10:59 -04:00
Steven Rostedt	e77ad6da90	tracing: Show printable characters in syscall arrays When displaying the contents of the user space data passed to the kernel, instead of just showing the array values, also print any printable content. Instead of just: bash-1113 [003] ..... 3433.290654: sys_write(fd: 2, buf: 0x555a8deeddb0 (72:6f:6f:74:40:64:65:62:69:61:6e:2d:78:38:36:2d:36:34:3a:7e:23:20), count: 0x16) Display: bash-1113 [003] ..... 3433.290654: sys_write(fd: 2, buf: 0x555a8deeddb0 (72:6f:6f:74:40:64:65:62:69:61:6e:2d:78:38:36:2d:36:34:3a:7e:23:20) "root@debian-x86-64:~# ", count: 0x16) This only affects tracing and does not affect perf, as this only updates the output from the kernel. The output from perf is via user space. This may change by an update to libtraceevent that will then update perf to have this as well. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Takaya Saeki <takayas@google.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ian Rogers <irogers@google.com> Cc: Douglas Raillard <douglas.raillard@arm.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Link: https://lore.kernel.org/20251028231148.429422865@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-10-28 20:10:59 -04:00
Steven Rostedt	299ea67e6a	tracing: Add a config and syscall_user_buf_size file to limit amount written When a system call that can copy user space addresses into the ring buffer, it can copy up to 511 bytes of data. This can waste precious ring buffer space if the user isn't interested in the output. Add a new file "syscall_user_buf_size" that gets initialized to a new config CONFIG_SYSCALL_BUF_SIZE_DEFAULT that defaults to 63. The config also is used to limit how much perf can read from user space. Also lower the max down to 165, as this isn't to record everything that a system call may be passing through to the kernel. 165 is more than enough. The reason for 165 is because adding one for the nul terminating byte, as well as possibly needing to append the "..." string turns it into 170 bytes. As this needs to save up to 3 arguments and 3 * 170 is 510 which fits nicely in 512 bytes (a power of 2). Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Takaya Saeki <takayas@google.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ian Rogers <irogers@google.com> Cc: Douglas Raillard <douglas.raillard@arm.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Link: https://lore.kernel.org/20251028231148.260068913@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-10-28 20:10:59 -04:00
Steven Rostedt	baa031b7bd	tracing: Allow syscall trace events to read more than one user parameter Allow more than one field of a syscall trace event to read user space. Build on top of the user_mask by allowing more than one bit to be set that corresponds to the @args array of the syscall metadata. For each argument in the @args array that is to be read, it will have a dynamic array/string field associated to it. Note that multiple fields to be read from user space is not supported if the user_arg_size field is set in the syscall metada. That field can only be used if only one field is being read from user space as that field is a number representing the size field of the syscall event that holds the size of the data to read from user space. It becomes ambiguous if the system call reads more than one field. Currently this is not an issue. If a syscall event happens to enable two events to read user space and sets the user_arg_size field, it will trigger a warning at boot and the user_arg_size field will be cleared. The per CPU buffer that is used to read the user space addresses is now broken up into 3 sections, each of 168 bytes. The reason for 168 is that it is the biggest portion of 512 bytes divided by 3 that is 8 byte aligned. The max amount copied into the ring buffer from user space is now only 128 bytes, which is plenty. When reading user space, it still reads 167 (168-1) bytes and uses the remaining to know if it should append the extra "..." to the end or not. This will allow the event to look like this: sys_renameat2(olddfd: 0xffffff9c, oldname: 0x7ffe02facdff "/tmp/x", newdfd: 0xffffff9c, newname: 0x7ffe02face06 "/tmp/y", flags: 1) Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Takaya Saeki <takayas@google.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ian Rogers <irogers@google.com> Cc: Douglas Raillard <douglas.raillard@arm.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Link: https://lore.kernel.org/20251028231148.095789277@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-10-28 20:10:58 -04:00
Steven Rostedt	011ea0501d	tracing: Display some syscall arrays as strings Some of the system calls that read a fixed length of memory from the user space address are not arrays but strings. Take a bit away from the nb_args field in the syscall meta data to use as a flag to denote that the system call's user_arg_size is being used as a string. The nb_args should never be more than 6, so 7 bits is plenty to hold that number. When the user_arg_is_str flag that, when set, will display the data array from the user space address as a string and not an array. This will allow the output to look like this: sys_sethostname(name: 0x5584310eb2a0 "debian", len: 6) Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Takaya Saeki <takayas@google.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ian Rogers <irogers@google.com> Cc: Douglas Raillard <douglas.raillard@arm.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Link: https://lore.kernel.org/20251028231147.930550359@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-10-28 20:10:58 -04:00
Steven Rostedt	b4f7624cfc	tracing: Have system call events record user array data For system call events that have a length field, add a "user_arg_size" parameter to the system call meta data that denotes the index of the args array that holds the size of arg that the user_mask field has a bit set for. The "user_mask" has a bit set that denotes the arg that points to an array in the user space address space and if a system call event has the user_mask field set and the user_arg_size set, it will then record the content of that address into the trace event, up to the size defined by SYSCALL_FAULT_BUF_SZ - 1. This allows the output to look like: sys_write(fd: 0xa, buf: 0x5646978d13c0 (01:00:05:00:00:00:00:00:01:87:55:89:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00), count: 0x20) Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Takaya Saeki <takayas@google.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ian Rogers <irogers@google.com> Cc: Douglas Raillard <douglas.raillard@arm.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Link: https://lore.kernel.org/20251028231147.763528474@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-10-28 20:10:58 -04:00
Steven Rostedt	2e82e256df	perf: tracing: Have perf system calls read user space Allow some of the system call events to read user space buffers. Instead of just showing the pointer into user space, allow perf events to also record the content of those pointers. For example: # perf record -e syscalls:sys_enter_openat ls /usr/bin [..] # perf script ls 1024 [005] 52.902721: syscalls:sys_enter_openat: dfd: 0xffffff9c, filename: 0x7fc1dbae321c "/etc/ld.so.cache", flags: 0x00080000, mode: 0x00000000 ls 1024 [005] 52.902899: syscalls:sys_enter_openat: dfd: 0xffffff9c, filename: 0x7fc1dbaae140 "/lib/x86_64-linux-gnu/libselinux.so.1", flags: 0x00080000, mode: 0x00000000 ls 1024 [005] 52.903471: syscalls:sys_enter_openat: dfd: 0xffffff9c, filename: 0x7fc1dbaae690 "/lib/x86_64-linux-gnu/libcap.so.2", flags: 0x00080000, mode: 0x00000000 ls 1024 [005] 52.903946: syscalls:sys_enter_openat: dfd: 0xffffff9c, filename: 0x7fc1dbaaebe0 "/lib/x86_64-linux-gnu/libc.so.6", flags: 0x00080000, mode: 0x00000000 ls 1024 [005] 52.904629: syscalls:sys_enter_openat: dfd: 0xffffff9c, filename: 0x7fc1dbaaf110 "/lib/x86_64-linux-gnu/libpcre2-8.so.0", flags: 0x00080000, mode: 0x00000000 ls 1024 [005] 52.906985: syscalls:sys_enter_openat: dfd: 0xffffffffffffff9c, filename: 0x7fc1dba92904 "/proc/filesystems", flags: 0x00080000, mode: 0x00000000 ls 1024 [005] 52.907323: syscalls:sys_enter_openat: dfd: 0xffffff9c, filename: 0x7fc1dba19490 "/usr/lib/locale/locale-archive", flags: 0x00080000, mode: 0x00000000 ls 1024 [005] 52.907746: syscalls:sys_enter_openat: dfd: 0xffffff9c, filename: 0x556fb888dcd0 "/usr/bin", flags: 0x00090800, mode: 0x00000000 Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Takaya Saeki <takayas@google.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ian Rogers <irogers@google.com> Cc: Douglas Raillard <douglas.raillard@arm.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Link: https://lore.kernel.org/20251028231147.593925979@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-10-28 20:10:58 -04:00
Steven Rostedt	bd1b80fba7	perf: tracing: Simplify perf_sysenter_enable/disable() with guards Use guard(mutex)(&syscall_trace_lock) for perf_sysenter_enable() and perf_sysenter_disable() as well as for the perf_sysexit_enable() and perf_sysexit_disable(). This will make it easier to update these functions with other code that has early exit handling. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Takaya Saeki <takayas@google.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ian Rogers <irogers@google.com> Cc: Douglas Raillard <douglas.raillard@arm.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Link: https://lore.kernel.org/20251028231147.429583335@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-10-28 20:10:58 -04:00
Steven Rostedt	a544d9a66b	tracing: Have syscall trace events read user space string As of commit `654ced4a13` ("tracing: Introduce tracepoint_is_faultable()") system call trace events allow faulting in user space memory. Have some of the system call trace events take advantage of this. Use the trace_user_fault_read() logic to read the user space buffer from user space and instead of just saving the pointer to the buffer in the system call event, also save the string that is passed in. The syscall event has its nb_args shorten from an int to a short (where even u8 is plenty big enough) and the freed two bytes are used for "user_mask". The new "user_mask" field is used to store the index of the "args" field array that has the address to read from user space. This value is set to 0 if the system call event does not need to read user space for a field. This mask can be used to know if the event may fault or not. Only one bit set in user_mask is supported at this time. This allows the output to look like this: sys_access(filename: 0x7f8c55368470 "/etc/ld.so.preload", mode: 4) sys_execve(filename: 0x564ebcf5a6b8 "/usr/bin/emacs", argv: 0x7fff357c0300, envp: 0x564ebc4a4820) Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Takaya Saeki <takayas@google.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ian Rogers <irogers@google.com> Cc: Douglas Raillard <douglas.raillard@arm.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Link: https://lore.kernel.org/20251028231147.261867956@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-10-28 20:10:58 -04:00
Steven Rostedt	a9f1687264	tracing: Make trace_user_fault_read() exposed to rest of tracing The write to the trace_marker file is a critical section where it cannot take locks nor allocate memory. To read from user space, it allocates a per CPU buffer when the trace_marker file is opened, and then when the write system call is performed, it uses the following method to read from user space: preempt_disable(); buffer = per_cpu_ptr(cpu_buffers, cpu); do { cnt = nr_context_switches_cpu(); migrate_disable(); preempt_enable(); ret = copy_from_user(buffer, ptr, len); preempt_disable(); migrate_enable(); } while (!ret && cnt != nr_context_switches_cpu()); if (!ret) ring_buffer_write(buffer); preempt_enable(); It records the number of context switches for the current CPU, enables preemption, copies from user space, disable preemption and then checks if the number of context switches changed. If it did not, then the buffer is valid, otherwise the buffer may have been corrupted and the read from user space must be tried again. The system call trace events are now faultable and have the same restrictions as the trace_marker write. For system calls to read the user space buffer (for example to read the file of the openat system call), it needs the same logic. Instead of copying the code over to the system call trace events, make the code generic to allow the system call trace events to use the same code. The following API is added internally to the tracing sub system (these are only exposed within the tracing subsystem and not to be used outside of it): trace_user_fault_init() - initializes a trace_user_buf_info descriptor that will allocate the per CPU buffers to copy from user space into. trace_user_fault_destroy() - used to free the allocations made by trace_user_fault_init(). trace_user_fault_get() - update the ref count of the info descriptor to allow more than one user to use the same descriptor. trace_user_fault_put() - decrement the ref count. trace_user_fault_read() - performs the above action to read user space into the per CPU buffer. The preempt_disable() is expected before calling this function and preemption must remain disabled while the buffer returned is in use. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Takaya Saeki <takayas@google.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ian Rogers <irogers@google.com> Cc: Douglas Raillard <douglas.raillard@arm.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Link: https://lore.kernel.org/20251028231147.096570057@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-10-28 20:10:57 -04:00
Nam Cao	3d62f95bd8	rv: Make rtapp/pagefault monitor depends on CONFIG_MMU There is no page fault without MMU. Compiling the rtapp/pagefault monitor without CONFIG_MMU fails as page fault tracepoints' definitions are not available. Make rtapp/pagefault monitor depends on CONFIG_MMU. Fixes: `9162620eb6` ("rv: Add rtapp_pagefault monitor") Signed-off-by: Nam Cao <namcao@linutronix.de> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202509260455.6Z9Vkty4-lkp@intel.com/ Cc: stable@vger.kernel.org Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Link: https://lore.kernel.org/r/20251002082317.973839-1-namcao@linutronix.de Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>	2025-10-20 12:47:40 +02:00
Nam Cao	103541e6a5	rv: Fully convert enabled_monitors to use list_head as iterator The callbacks in enabled_monitors_seq_ops are inconsistent. Some treat the iterator as struct rv_monitor , while others treat the iterator as struct list_head . This causes a wrong type cast and crashes the system as reported by Nathan. Convert everything to use struct list_head * as iterator. This also makes enabled_monitors consistent with available_monitors. Fixes: `de090d1cca` ("rv: Fix wrong type cast in enabled_monitors_next()") Reported-by: Nathan Chancellor <nathan@kernel.org> Closes: https://lore.kernel.org/linux-trace-kernel/20250923002004.GA2836051@ax162/ Signed-off-by: Nam Cao <namcao@linutronix.de> Cc: stable@vger.kernel.org Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Link: https://lore.kernel.org/r/20251002082235.973099-1-namcao@linutronix.de Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>	2025-10-20 12:47:40 +02:00
Linus Torvalds	67029a49db	Merge tag 'trace-v6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: "The previous fix to trace_marker required updating trace_marker_raw as well. The difference between trace_marker_raw from trace_marker is that the raw version is for applications to write binary structures directly into the ring buffer instead of writing ASCII strings. This is for applications that will read the raw data from the ring buffer and get the data structures directly. It's a bit quicker than using the ASCII version. Unfortunately, it appears that our test suite has several tests that test writes to the trace_marker file, but lacks any tests to the trace_marker_raw file (this needs to be remedied). Two issues came about the update to the trace_marker_raw file that syzbot found: - Fix tracing_mark_raw_write() to use per CPU buffer The fix to use the per CPU buffer to copy from user space was needed for both the trace_maker and trace_maker_raw file. The fix for reading from user space into per CPU buffers properly fixed the trace_marker write function, but the trace_marker_raw file wasn't fixed properly. The user space data was correctly written into the per CPU buffer, but the code that wrote into the ring buffer still used the user space pointer and not the per CPU buffer that had the user space data already written. - Stop the fortify string warning from writing into trace_marker_raw After converting the copy_from_user_nofault() into a memcpy(), another issue appeared. As writes to the trace_marker_raw expects binary data, the first entry is a 4 byte identifier. The entry structure is defined as: struct { struct trace_entry ent; int id; char buf[]; }; The size of this structure is reserved on the ring buffer with: size = sizeof(entry) + cnt; Then it is copied from the buffer into the ring buffer with: memcpy(&entry->id, buf, cnt); This use to be a copy_from_user_nofault(), but now converting it to a memcpy() triggers the fortify-string code, and causes a warning. The allocated space is actually more than what is copied, as the cnt used also includes the entry->id portion. Allocating sizeof(entry) plus cnt is actually allocating 4 bytes more than what is needed. Change the size function to: size = struct_size(entry, buf, cnt - sizeof(entry->id)); And update the memcpy() to unsafe_memcpy()" * tag 'trace-v6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Stop fortify-string from warning in tracing_mark_raw_write() tracing: Fix tracing_mark_raw_write() to use buf and not ubuf	2025-10-11 16:06:04 -07:00
Steven Rostedt	54b91e54b1	tracing: Stop fortify-string from warning in tracing_mark_raw_write() The way tracing_mark_raw_write() records its data is that it has the following structure: struct { struct trace_entry; int id; char buf[]; }; But memcpy(&entry->id, buf, size) triggers the following warning when the size is greater than the id: ------------[ cut here ]------------ memcpy: detected field-spanning write (size 6) of single field "&entry->id" at kernel/trace/trace.c:7458 (size 4) WARNING: CPU: 7 PID: 995 at kernel/trace/trace.c:7458 write_raw_marker_to_buffer.isra.0+0x1f9/0x2e0 Modules linked in: CPU: 7 UID: 0 PID: 995 Comm: bash Not tainted 6.17.0-test-00007-g60b82183e78a-dirty #211 PREEMPT(voluntary) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-debian-1.17.0-1 04/01/2014 RIP: 0010:write_raw_marker_to_buffer.isra.0+0x1f9/0x2e0 Code: 04 00 75 a7 b9 04 00 00 00 48 89 de 48 89 04 24 48 c7 c2 e0 b1 d1 b2 48 c7 c7 40 b2 d1 b2 c6 05 2d 88 6a 04 01 e8 f7 e8 bd ff <0f> 0b 48 8b 04 24 e9 76 ff ff ff 49 8d 7c 24 04 49 8d 5c 24 08 48 RSP: 0018:ffff888104c3fc78 EFLAGS: 00010292 RAX: 0000000000000000 RBX: 0000000000000006 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 1ffffffff6b363b4 RDI: 0000000000000001 RBP: ffff888100058a00 R08: ffffffffb041d459 R09: ffffed1020987f40 R10: 0000000000000007 R11: 0000000000000001 R12: ffff888100bb9010 R13: 0000000000000000 R14: 00000000000003e3 R15: ffff888134800000 FS: 00007fa61d286740(0000) GS:ffff888286cad000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000560d28d509f1 CR3: 00000001047a4006 CR4: 0000000000172ef0 Call Trace: <TASK> tracing_mark_raw_write+0x1fe/0x290 ? __pfx_tracing_mark_raw_write+0x10/0x10 ? security_file_permission+0x50/0xf0 ? rw_verify_area+0x6f/0x4b0 vfs_write+0x1d8/0xdd0 ? __pfx_vfs_write+0x10/0x10 ? __pfx_css_rstat_updated+0x10/0x10 ? count_memcg_events+0xd9/0x410 ? fdget_pos+0x53/0x5e0 ksys_write+0x182/0x200 ? __pfx_ksys_write+0x10/0x10 ? do_user_addr_fault+0x4af/0xa30 do_syscall_64+0x63/0x350 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7fa61d318687 Code: 48 89 fa 4c 89 df e8 58 b3 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 1a 5b c3 0f 1f 84 00 00 00 00 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00 83 e2 39 83 fa 08 75 de e8 23 ff ff ff RSP: 002b:00007ffd87fe0120 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 00007fa61d286740 RCX: 00007fa61d318687 RDX: 0000000000000006 RSI: 0000560d28d509f0 RDI: 0000000000000001 RBP: 0000560d28d509f0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000006 R13: 00007fa61d4715c0 R14: 00007fa61d46ee80 R15: 0000000000000000 </TASK> ---[ end trace 0000000000000000 ]--- This is because fortify string sees that the size of entry->id is only 4 bytes, but it is writing more than that. But this is OK as the dynamic_array is allocated to handle that copy. The size allocated on the ring buffer was actually a bit too big: size = sizeof(entry) + cnt; But cnt includes the 'id' and the buffer data, so adding cnt to the size of entry actually allocates too much on the ring buffer. Change the allocation to: size = struct_size(entry, buf, cnt - sizeof(entry->id)); and the memcpy() to unsafe_memcpy() with an added justification. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/20251011112032.77be18e4@gandalf.local.home Fixes: `64cf7d058a` ("tracing: Have trace_marker use per-cpu data to read user space") Reported-by: syzbot+9a2ede1643175f350105@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68e973f5.050a0220.1186a4.0010.GAE@google.com/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-10-11 11:27:27 -04:00
Steven Rostedt	bda745ee8f	tracing: Fix tracing_mark_raw_write() to use buf and not ubuf The fix to use a per CPU buffer to read user space tested only the writes to trace_marker. But it appears that the selftests are missing tests to the trace_maker_raw file. The trace_maker_raw file is used by applications that writes data structures and not strings into the file, and the tools read the raw ring buffer to process the structures it writes. The fix that reads the per CPU buffers passes the new per CPU buffer to the trace_marker file writes, but the update to the trace_marker_raw write read the data from user space into the per CPU buffer, but then still used then passed the user space address to the function that records the data. Pass in the per CPU buffer and not the user space address. TODO: Add a test to better test trace_marker_raw. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/20251011035243.386098147@kernel.org Fixes: `64cf7d058a` ("tracing: Have trace_marker use per-cpu data to read user space") Reported-by: syzbot+9a2ede1643175f350105@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68e973f5.050a0220.1186a4.0010.GAE@google.com/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-10-10 23:58:44 -04:00
Linus Torvalds	5472d60c12	Merge tag 'trace-v6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing clean up and fixes from Steven Rostedt: - Have osnoise tracer use memdup_user_nul() The function osnoise_cpus_write() open codes a kmalloc() and then a copy_from_user() and then adds a nul byte at the end which is the same as simply using memdup_user_nul(). - Fix wakeup and irq tracers when failing to acquire calltime When the wakeup and irq tracers use the function graph tracer for tracing function times, it saves a timestamp into the fgraph shadow stack. It is possible that this could fail to be stored. If that happens, it exits the routine early. These functions also disable nesting of the operations by incremeting the data "disable" counter. But if the calltime exits out early, it never increments the counter back to what it needs to be. Since there's only a couple of lines of code that does work after acquiring the calltime, instead of exiting out early, reverse the if statement to be true if calltime is acquired, and place the code that is to be done within that if block. The clean up will always be done after that. - Fix ring_buffer_map() return value on failure of __rb_map_vma() If __rb_map_vma() fails in ring_buffer_map(), it does not return an error. This means the caller will be working against a bad vma mapping. Have ring_buffer_map() return an error when __rb_map_vma() fails. - Fix regression of writing to the trace_marker file A bug fix was made to change __copy_from_user_inatomic() to copy_from_user_nofault() in the trace_marker write function. The trace_marker file is used by applications to write into it (usually with a file descriptor opened at the start of the program) to record into the tracing system. It's usually used in critical sections so the write to trace_marker is highly optimized. The reason for copying in an atomic section is that the write reserves space on the ring buffer and then writes directly into it. After it writes, it commits the event. The time between reserve and commit must have preemption disabled. The trace marker write does not have any locking nor can it allocate due to the nature of it being a critical path. Unfortunately, converting __copy_from_user_inatomic() to copy_from_user_nofault() caused a regression in Android. Now all the writes from its applications trigger the fault that is rejected by the _nofault() version that wasn't rejected by the _inatomic() version. Instead of getting data, it now just gets a trace buffer filled with: tracing_mark_write: <faulted> To fix this, on opening of the trace_marker file, allocate per CPU buffers that can be used by the write call. Then when entering the write call, do the following: preempt_disable(); cpu = smp_processor_id(); buffer = per_cpu_ptr(cpu_buffers, cpu); do { cnt = nr_context_switches_cpu(cpu); migrate_disable(); preempt_enable(); ret = copy_from_user(buffer, ptr, size); preempt_disable(); migrate_enable(); } while (!ret && cnt != nr_context_switches_cpu(cpu)); if (!ret) ring_buffer_write(buffer); preempt_enable(); This works similarly to seqcount. As it must enabled preemption to do a copy_from_user() into a per CPU buffer, if it gets preempted, the buffer could be corrupted by another task. To handle this, read the number of context switches of the current CPU, disable migration, enable preemption, copy the data from user space, then immediately disable preemption again. If the number of context switches is the same, the buffer is still valid. Otherwise it must be assumed that the buffer may have been corrupted and it needs to try again. Now the trace_marker write can get the user data even if it has to fault it in, and still not grab any locks of its own. * tag 'trace-v6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Have trace_marker use per-cpu data to read user space ring buffer: Propagate __rb_map_vma return value to caller tracing: Fix irqoff tracers on failure of acquiring calltime tracing: Fix wakeup tracers on failure of acquiring calltime tracing/osnoise: Replace kmalloc + copy_from_user with memdup_user_nul	2025-10-09 12:18:22 -07:00
Steven Rostedt	64cf7d058a	tracing: Have trace_marker use per-cpu data to read user space It was reported that using __copy_from_user_inatomic() can actually schedule. Which is bad when preemption is disabled. Even though there's logic to check in_atomic() is set, but this is a nop when the kernel is configured with PREEMPT_NONE. This is due to page faulting and the code could schedule with preemption disabled. Link: https://lore.kernel.org/all/20250819105152.2766363-1-luogengkun@huaweicloud.com/ The solution was to change the __copy_from_user_inatomic() to copy_from_user_nofault(). But then it was reported that this caused a regression in Android. There's several applications writing into trace_marker() in Android, but now instead of showing the expected data, it is showing: tracing_mark_write: <faulted> After reverting the conversion to copy_from_user_nofault(), Android was able to get the data again. Writes to the trace_marker is a way to efficiently and quickly enter data into the Linux tracing buffer. It takes no locks and was designed to be as non-intrusive as possible. This means it cannot allocate memory, and must use pre-allocated data. A method that is actively being worked on to have faultable system call tracepoints read user space data is to allocate per CPU buffers, and use them in the callback. The method uses a technique similar to seqcount. That is something like this: preempt_disable(); cpu = smp_processor_id(); buffer = this_cpu_ptr(&pre_allocated_cpu_buffers, cpu); do { cnt = nr_context_switches_cpu(cpu); migrate_disable(); preempt_enable(); ret = copy_from_user(buffer, ptr, size); preempt_disable(); migrate_enable(); } while (!ret && cnt != nr_context_switches_cpu(cpu)); if (!ret) ring_buffer_write(buffer); preempt_enable(); It's a little more involved than that, but the above is the basic logic. The idea is to acquire the current CPU buffer, disable migration, and then enable preemption. At this moment, it can safely use copy_from_user(). After reading the data from user space, it disables preemption again. It then checks to see if there was any new scheduling on this CPU. If there was, it must assume that the buffer was corrupted by another task. If there wasn't, then the buffer is still valid as only tasks in preemptable context can write to this buffer and only those that are running on the CPU. By using this method, where trace_marker open allocates the per CPU buffers, trace_marker writes can access user space and even fault it in, without having to allocate or take any locks of its own. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Luo Gengkun <luogengkun@huaweicloud.com> Cc: Wattson CI <wattson-external@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/20251008124510.6dba541a@gandalf.local.home Fixes: `3d62ab32df` ("tracing: Fix tracing_marker may trigger page fault during preempt_disable") Reported-by: Runping Lai <runpinglai@google.com> Tested-by: Runping Lai <runpinglai@google.com> Closes: https://lore.kernel.org/linux-trace-kernel/20251007003417.3470979-2-runpinglai@google.com/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-10-08 21:50:01 -04:00
Ankit Khushwaha	de4cbd7047	ring buffer: Propagate __rb_map_vma return value to caller The return value from `__rb_map_vma()`, which rejects writable or executable mappings (VM_WRITE, VM_EXEC, or !VM_MAYSHARE), was being ignored. As a result the caller of `__rb_map_vma` always returned 0 even when the mapping had actually failed, allowing it to proceed with an invalid VMA. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20251008172516.20697-1-ankitkhushwaha.linux@gmail.com Fixes: `117c39200d` ("ring-buffer: Introducing ring-buffer mapping functions") Reported-by: syzbot+ddc001b92c083dbf2b97@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?id=194151be8eaebd826005329b2e123aecae714bdb Signed-off-by: Ankit Khushwaha <ankitkhushwaha.linux@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-10-08 21:48:58 -04:00
Steven Rostedt	c834a97962	tracing: Fix irqoff tracers on failure of acquiring calltime The functions irqsoff_graph_entry() and irqsoff_graph_return() both call func_prolog_dec() that will test if the data->disable is already set and if not, increment it and return. If it was set, it returns false and the caller exits. The caller of this function must decrement the disable counter, but misses doing so if the calltime fails to be acquired. Instead of exiting out when calltime is NULL, change the logic to do the work if it is not NULL and still do the clean up at the end of the function if it is NULL. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20251008114943.6f60f30f@gandalf.local.home Fixes: `a485ea9e3e` ("tracing: Fix irqsoff and wakeup latency tracers when using function graph") Reported-by: Sasha Levin <sashal@kernel.org> Closes: https://lore.kernel.org/linux-trace-kernel/20251006175848.1906912-2-sashal@kernel.org/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-10-08 12:10:44 -04:00

1 2 3 4 5 ...

6866 Commits