Add --off-cpu option to enable the off-cpu profiling with BPF. It'd
use a bpf_output event and rename it to "offcpu-time". Samples will
be synthesized at the end of the record session using data from a BPF
map which contains the aggregated off-cpu time at context switches.
So it needs root privilege to get the off-cpu profiling.
Each sample will have a separate user stacktrace so it will skip
kernel threads. The sample ip will be set from the stacktrace and
other sample data will be updated accordingly. Currently it only
handles some basic sample types.
The sample timestamp is set to a dummy value just not to bother with
other events during the sorting. So it has a very big initial value
and increase it on processing each samples.
Good thing is that it can be used together with regular profiling like
cpu cycles. If you don't want to that, you can use a dummy event to
enable off-cpu profiling only.
Example output:
$ sudo perf record --off-cpu perf bench sched messaging -l 1000
$ sudo perf report --stdio --call-graph=no
# Total Lost Samples: 0
#
# Samples: 41K of event 'cycles'
# Event count (approx.): 42137343851
...
# Samples: 1K of event 'offcpu-time'
# Event count (approx.): 587990831640
#
# Children Self Command Shared Object Symbol
# ........ ........ ............... .................. .........................
#
81.66% 0.00% sched-messaging libc-2.33.so [.] __libc_start_main
81.66% 0.00% sched-messaging perf [.] cmd_bench
81.66% 0.00% sched-messaging perf [.] main
81.66% 0.00% sched-messaging perf [.] run_builtin
81.43% 0.00% sched-messaging perf [.] bench_sched_messaging
40.86% 40.86% sched-messaging libpthread-2.33.so [.] __read
37.66% 37.66% sched-messaging libpthread-2.33.so [.] __write
2.91% 2.91% sched-messaging libc-2.33.so [.] __poll
...
As you can see it spent most of off-cpu time in read and write in
bench_sched_messaging(). The --call-graph=no was added just to make
the output concise here.
It uses perf hooks facility to control BPF program during the record
session rather than adding new BPF/off-cpu specific calls.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Ian Rogers <irogers@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Blake Jones <blakejones@google.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Milian Wolff <milian.wolff@kdab.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: bpf@vger.kernel.org
Link: https://lore.kernel.org/r/20220518224725.742882-3-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Extend --threads option in perf record command line interface.
The option can have a value in the form of masks that specify
CPUs to be monitored with data streaming threads and its layout
in system topology. The masks can be filtered using CPU mask
provided via -C option.
The specification value can be user defined list of masks. Masks
separated by colon define CPUs to be monitored by one thread and
affinity mask of that thread is separated by slash. For example:
<cpus mask 1>/<affinity mask 1>:<cpu mask 2>/<affinity mask 2>
specifies parallel threads layout that consists of two threads
with corresponding assigned CPUs to be monitored.
The specification value can be a string e.g. "cpu", "core" or
"package" meaning creation of data streaming thread for every
CPU or core or package to monitor distinct CPUs or CPUs grouped
by core or package.
The option provided with no or empty value defaults to per-cpu
parallel threads layout creating data streaming thread for every
CPU being monitored.
Document --threads option syntax and parallel data streaming modes
in Documentation/perf-record.txt.
Suggested-by: Jiri Olsa <jolsa@kernel.org>
Suggested-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Riccardo Mancini <rickyman7@gmail.com>
Signed-off-by: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Riccardo Mancini <rickyman7@gmail.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Namhyung Kim <namhyung@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Antonov <alexander.antonov@linux.intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Budankov <abudankov@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/079e2619be70c465317cf7c9fdaf5fa069728c32.1642440724.git.alexey.v.bayduraev@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
AMD processors have events with event select codes and unit masks larger
than a byte. The core PMU, for example, uses 12-bit event select codes
split between bits 0-7 and 32-35 of the PERF_CTL MSRs as can be seen
from /sys/bus/event_sources/devices/cpu/format/*.
The Processor Programming Reference (PPR) lists the event codes as
unified 12-bit hexadecimal values instead and the split between the bits
is not apparent to someone who is not aware of the layout of the
PERF_CTL MSRs.
8-bit event select codes continue to work as the layout matches that of
the PERF_CTL MSRs i.e. bits 0-7 for event select and 8-15 for unit mask.
This adds more details in the perf man pages about using
/sys/bus/event_sources/devices/*/format/* for determining the correct
raw event encoding scheme.
E.g. the "op_cache_hit_miss.op_cache_hit" event with code 0x28f and
umask 0x03 can be programmed using its symbolic name as:
$ sudo perf --debug perf-event-open stat -e op_cache_hit_miss.op_cache_hit sleep 1
------------------------------------------------------------
perf_event_attr:
type 4
size 128
config 0x20000038f
sample_type IDENTIFIER
read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
disabled 1
inherit 1
enable_on_exec 1
exclude_guest 1
------------------------------------------------------------
[...]
One might use a simple eventsel+umask combination based on what the
current man pages say and incorrectly program the event as:
$ sudo perf --debug perf-event-open stat -e r0328f sleep 1
------------------------------------------------------------
perf_event_attr:
type 4
size 128
config 0x328f
sample_type IDENTIFIER
read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
disabled 1
inherit 1
enable_on_exec 1
exclude_guest 1
------------------------------------------------------------
[...]
When it should have been based on the format from sysfs:
$ cat /sys/bus/event_source/devices/cpu/format/event
config:0-7,32-35
$ sudo perf --debug perf-event-open stat -e r20000038f sleep 1
------------------------------------------------------------
perf_event_attr:
type 4
size 128
config 0x20000038f
sample_type IDENTIFIER
read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
disabled 1
inherit 1
enable_on_exec 1
exclude_guest 1
------------------------------------------------------------
[...]
Reviewed-by: Kajol Jain <kjain@linux.ibm.com>
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Ananth Narayan <ananth.narayan@amd.com>
Cc: Kim Phillips <kim.phillips@amd.com>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Robert Richter <rrichter@amd.com>
Cc: Santosh Shukla <santosh.shukla@amd.com>
Link: https://lore.kernel.org/r/20211123084613.243792-1-sandipan.das@amd.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Add a new 'evlist' control command to display all the evlist events.
When it is received, perf will scan and print current evlist into perf
record terminal.
The interface string for control file is:
evlist [-v|-g|-F]
The syntax follows perf evlist command:
-F Show just the sample frequency used for each event.
-v Show all fields.
-g Show event group information.
Example session:
terminal 1:
# mkfifo control ack
# perf record --control=fifo:control,ack -e '{cycles,instructions}'
terminal 2:
# echo evlist > control
terminal 1:
cycles
instructions
dummy:HG
terminal 2:
# echo 'evlist -v' > control
terminal 1:
cycles: size: 120, { sample_period, sample_freq }: 4000, sample_type: \
IP|TID|TIME|ID|CPU|PERIOD, read_format: ID, disabled: 1, inherit: 1, freq: 1, \
sample_id_all: 1, exclude_guest: 1
instructions: size: 120, config: 0x1, { sample_period, sample_freq }: 4000, \
sample_type: IP|TID|TIME|ID|CPU|PERIOD, read_format: ID, inherit: 1, freq: 1, \
sample_id_all: 1, exclude_guest: 1
dummy:HG: type: 1, size: 120, config: 0x9, { sample_period, sample_freq }: 4000, \
sample_type: IP|TID|TIME|ID|CPU|PERIOD, read_format: ID, inherit: 1, mmap: 1, \
comm: 1, freq: 1, task: 1, sample_id_all: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, \
bpf_event: 1
terminal 2:
# echo 'evlist -g' > control
terminal 1:
{cycles,instructions}
dummy:HG
terminal 2:
# echo 'evlist -F' > control
terminal 1:
cycles: sample_freq=4000
instructions: sample_freq=4000
dummy:HG: sample_freq=4000
This new evlist command is handy to get real event names when
wildcards are used.
Adding evsel_fprintf.c object to python/perf.so build, because
it's now evlist.c dependency.
Adding PYTHON_PERF define for python/perf.so compilation, so we
can use it to compile in only evsel__fprintf from evsel_fprintf.c
object.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Budankov <abudankov@huawei.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Petlan <mpetlan@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20201226232038.390883-3-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Adding new control events to enable/disable specific event.
The interface string for control file are:
'enable <EVENT NAME>'
'disable <EVENT NAME>'
when received the command, perf will scan the current evlist
for <EVENT NAME> and if found it's enabled/disabled.
Example session:
terminal 1:
# mkfifo control ack perf.pipe
# perf record --control=fifo:control,ack -D -1 --no-buffering -e 'sched:*' -o - > perf.pipe
terminal 2:
# cat perf.pipe | perf --no-pager script -i -
terminal 1:
Events disabled
NOTE Above message will show only after read side of the pipe ('>')
is started on 'terminal 2'. The 'terminal 1's bash does not execute
perf before that, hence the delyaed perf record message.
terminal 3:
# echo 'enable sched:sched_process_fork' > control
terminal 1:
event sched:sched_process_fork enabled
terminal 2:
bash 33349 [034] 149587.674295: sched:sched_process_fork: comm=bash pid=33349 child_comm=bash child_pid=34056
bash 33349 [034] 149588.239521: sched:sched_process_fork: comm=bash pid=33349 child_comm=bash child_pid=34057
terminal 3:
# echo 'enable sched:sched_wakeup_new' > control
terminal 1:
event sched:sched_wakeup_new enabled
terminal 2:
bash 33349 [034] 149632.228023: sched:sched_process_fork: comm=bash pid=33349 child_comm=bash child_pid=34059
bash 33349 [034] 149632.228050: sched:sched_wakeup_new: bash:34059 [120] success=1 CPU:036
bash 33349 [034] 149633.950005: sched:sched_process_fork: comm=bash pid=33349 child_comm=bash child_pid=34060
bash 33349 [034] 149633.950030: sched:sched_wakeup_new: bash:34060 [120] success=1 CPU:036
Committer testing:
If I use 'sched:*' and then enable all events, I can't get 'perf record'
to react to further commands, so I tested it with:
[root@five ~]# perf record --control=fifo:control,ack -D -1 --no-buffering -e 'sched:sched_process_*' -o - > perf.pipe
Events disabled
Events enabled
Events disabled
And then it works as expected, so we need to fix this pre-existing
problem.
Another issue, we need to check if a event is already enabled or
disabled and change the message to be clearer, i.e.:
[root@five ~]# perf record --control=fifo:control,ack -D -1 --no-buffering -e 'sched:sched_process_*' -o - > perf.pipe
Events disabled
If we receive a 'disable' command, then it should say:
[root@five ~]# perf record --control=fifo:control,ack -D -1 --no-buffering -e 'sched:sched_process_*' -o - > perf.pipe
Events disabled
Events already disabled
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Budankov <abudankov@huawei.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Petlan <mpetlan@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20201226232038.390883-2-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Add --buildid-mmap option to enable build id in PERF_RECORD_MMAP2 events.
It will only work if there's kernel support for that and it disables
build id cache (implies --no-buildid).
It's also possible to enable it permanently via config option in
~/.perfconfig file:
[record]
build-id=mmap
Also added build_id bit in the verbose output for perf_event_attr:
# perf record --buildid-mmap -vv
...
perf_event_attr:
type 1
size 120
...
build_id 1
Adding also missing text_poke bit.
Committer testing:
$ perf record -h build
Usage: perf record [<options>] [<command>]
or: perf record [<options>] -- <command> [<options>]
-B, --no-buildid do not collect buildids in perf.data
-N, --no-buildid-cache
do not update the buildid cache
--buildid-all Record build-id of all DSOs regardless of hits
--buildid-mmap Record build-id in map events
$
$ perf record --buildid-mmap sleep 1
Failed: no support to record build id in mmap events, update your kernel.
$
After adding the needed kernel bits in a test kernel:
$ perf record -vv --buildid-mmap sleep 1 |& grep -m1 build
Enabling build id in mmap2 events.
$ perf evlist -v
cycles:u: size: 120, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|PERIOD, read_format: ID, disabled: 1, inherit: 1, exclude_kernel: 1, mmap: 1, comm: 1, freq: 1, enable_on_exec: 1, task: 1, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1, build_id: 1
$
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Ian Rogers <irogers@google.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Budankov <abudankov@huawei.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Petlan <mpetlan@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20201214105457.543111-16-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Add 'snapshot' control command to create an AUX area tracing snapshot
the same as if sending SIGUSR2. The advantage of the FIFO is that access
is governed by access to the FIFO.
Example:
$ mkfifo perf.control
$ mkfifo perf.ack
$ cat perf.ack &
[1] 15235
$ sudo ~/bin/perf record --control fifo:perf.control,perf.ack -S -e intel_pt//u -- sleep 60 &
[2] 15243
$ ps -e | grep perf
15244 pts/1 00:00:00 perf
$ kill -USR2 15244
bash: kill: (15244) - Operation not permitted
$ echo snapshot > perf.control
ack
$
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lore.kernel.org/lkml/20200901093758.32293-6-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Context switch events are added automatically by Intel PT and Coresight.
Make it possible to suppress them. That is useful for tracing the
scheduler without the disturbance that the switch event processing
creates.
Example:
Prerequisites:
$ which perf
~/bin/perf
$ sudo setcap "cap_sys_rawio,cap_sys_admin,cap_sys_ptrace,cap_syslog,cap_ipc_lock=ep" ~/bin/perf
$ sudo chmod +r /proc/kcore
Before:
$ perf record --no-switch-events --kcore -a -e intel_pt//k -- sleep 0.001
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.938 MB perf.data ]
$ perf script -D | grep PERF_RECORD_SWITCH | wc -l
572
After:
$ perf record --no-switch-events --kcore -a -e intel_pt//k -- sleep 0.001
Warning:
Intel Processor Trace decoding will not be possible except for kernel tracing!
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.838 MB perf.data ]
$ perf script -D | grep PERF_RECORD_SWITCH | wc -l
0
$ sudo chmod go-r /proc/kcore
$ sudo setcap -r ~/bin/perf
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Link: http://lore.kernel.org/lkml/20200528120859.21604-1-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To control degree of parallelism of the synthesize_mmap() code which
is scanning /proc/PID/task/PID/maps and can be time consuming.
Mimic perf top way of handling the option.
If not specified will default to 1 thread, i.e. default behavior before
this option.
On a desktop computer the processing of /proc/PID/task/PID/maps isn't
slow enough to warrant parallel processing and the thread creation has
some cost - hence the default of 1. On a loaded server with
>100 cores it is possible to see synthesis times in the order of
seconds and in this case having the option is desirable.
As the processing is a synchronization point, it is legitimate to worry if
Amdahl's law will apply to this patch. Profiling with this patch in
place:
https://lore.kernel.org/lkml/20200415054050.31645-4-irogers@google.com/
shows:
...
- 32.59% __perf_event__synthesize_threads
- 32.54% __event__synthesize_thread
+ 22.13% perf_event__synthesize_mmap_events
+ 6.68% perf_event__get_comm_ids.constprop.0
+ 1.49% process_synthesized_event
+ 1.29% __GI___readdir64
+ 0.60% __opendir
...
That is the processing is 1.49% of execution time and there is plenty to
make parallel. This is shown in the benchmark in this patch:
https://lore.kernel.org/lkml/20200415054050.31645-2-irogers@google.com/
Computing performance of multi threaded perf event synthesis by
synthesizing events on CPU 0:
Number of synthesis threads: 1
Average synthesis took: 127729.000 usec (+- 3372.880 usec)
Average num. events: 21548.600 (+- 0.306)
Average time per event 5.927 usec
Number of synthesis threads: 2
Average synthesis took: 88863.500 usec (+- 385.168 usec)
Average num. events: 21552.800 (+- 0.327)
Average time per event 4.123 usec
Number of synthesis threads: 3
Average synthesis took: 83257.400 usec (+- 348.617 usec)
Average num. events: 21553.200 (+- 0.327)
Average time per event 3.863 usec
Number of synthesis threads: 4
Average synthesis took: 75093.000 usec (+- 422.978 usec)
Average num. events: 21554.200 (+- 0.200)
Average time per event 3.484 usec
Number of synthesis threads: 5
Average synthesis took: 64896.600 usec (+- 353.348 usec)
Average num. events: 21558.000 (+- 0.000)
Average time per event 3.010 usec
Number of synthesis threads: 6
Average synthesis took: 59210.200 usec (+- 342.890 usec)
Average num. events: 21560.000 (+- 0.000)
Average time per event 2.746 usec
Number of synthesis threads: 7
Average synthesis took: 54093.900 usec (+- 306.247 usec)
Average num. events: 21562.000 (+- 0.000)
Average time per event 2.509 usec
Number of synthesis threads: 8
Average synthesis took: 48938.700 usec (+- 341.732 usec)
Average num. events: 21564.000 (+- 0.000)
Average time per event 2.269 usec
Where average time per synthesized event goes from 5.927 usec with 1
thread to 2.269 usec with 8. This isn't a linear speed up as not all of
synthesize code has been made parallel. If the synthesis time was about
10 seconds then using 8 threads may bring this down to less than 4.
Signed-off-by: Stephane Eranian <eranian@google.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tony Jones <tonyj@suse.de>
Cc: yuzhoujian <yuzhoujian@didichuxing.com>
Link: http://lore.kernel.org/lkml/20200422155038.9380-1-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To allow individual events to be selected for AUX area sampling, add
aux-sample-size config term. attr.aux_sample_size is updated by
auxtrace_parse_sample_options() so that the existing validation will see
the value. Any event that has a non-zero aux_sample_size will cause AUX
area sampling to be configured, irrespective of the --aux-sample option.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lore.kernel.org/lkml/20191115124225.5247-8-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The patch adds a new option to limit the output file size, then based on
it, we can create a wrapper of the perf command that uses the option to
avoid exhausting the disk space by the unconscious user.
In order to make the perf.data parsable, we just limit the sample data
size, since the perf.data consists of many headers and sample data and
other data, the actual size of the recorded file will bigger than the
setting value.
Testing it:
# ./perf record -a -g --max-size=10M
Couldn't synthesize bpf events.
[ perf record: perf size limit reached (10249 KB), stopping session ]
[ perf record: Woken up 32 times to write data ]
[ perf record: Captured and wrote 10.133 MB perf.data (71964 samples) ]
# ls -lh perf.data
-rw------- 1 root root 11M Oct 22 14:32 perf.data
# ./perf record -a -g --max-size=10K
[ perf record: perf size limit reached (10 KB), stopping session ]
Couldn't synthesize bpf events.
[ perf record: Woken up 0 times to write data ]
[ perf record: Captured and wrote 1.546 MB perf.data (69 samples) ]
# ls -l perf.data
-rw------- 1 root root 1626952 Oct 22 14:36 perf.data
Committer notes:
Fixed the build in multiple distros by using PRIu64 to print u64 struct
members, fixing this:
builtin-record.c: In function 'record__write':
builtin-record.c:150:5: error: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'u64' [-Werror=format=]
rec->bytes_written >> 10);
^
CC /tmp/build/pe
Signed-off-by: Jiwei Sun <jiwei.sun@windriver.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Michael Petlan <mpetlan@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Danter <richard.danter@windriver.com>
Link: http://lore.kernel.org/lkml/20191022080901.3841-1-jiwei.sun@windriver.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
It is sometimes useful to generate a snapshot when perf record exits;
I've been using a wrapper script around the workload that would do a
killall -USR2 perf when the workload exits.
This patch makes it easier and also works when perf record is attached
to a pre-existing task. A new snapshot option 'e' can be specified in
-S to enable this behavior:
root@elsewhere:~# perf record -e intel_pt// -Se sleep 1
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.085 MB perf.data ]
Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20190806144101.62892-1-alexander.shishkin@linux.intel.com
[ Fixed up !HAVE_AUXTRACE_SUPPORT build in builtin-record.c, adding 2 missing __maybe_unused ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
One can just record callchains in the kernel or user space with this new
options.
We can use it together with "--all-kernel" options.
This two options is used just like print_stack(sys) or print_ustack(usr)
for systemtap.
Shown below is the usage of this new option combined with "--all-kernel"
options:
1. Configure all used events to run in kernel space and just collect
kernel callchains.
$ perf record -a -g --all-kernel --kernel-callchains
2. Configure all used events to run in kernel space and just collect
user callchains.
$ perf record -a -g --all-kernel --user-callchains
Committer notes:
Improved documentation to state that asking for kernel callchains really
is asking for excluding user callchains, and vice versa.
Further mentioned that using both won't get both, but nothing, as both
will be excluded.
Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Milian Wolff <milian.wolff@kdab.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/r/1559222962-22891-1-git-send-email-ufo19890607@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Implemented -z,--compression_level[=<n>] option that enables compression
of mmaped kernel data buffers content in runtime during perf record mode
collection. Default option value is 1 (fastest compression).
Compression overhead has been measured for serial and AIO streaming when
profiling matrix multiplication workload:
-------------------------------------------------------------
| SERIAL | AIO-1 |
----------------------------------------------------------------|
|-z | OVH(x) | ratio(x) size(MiB) | OVH(x) | ratio(x) size(MiB) |
|---------------------------------------------------------------|
| 0 | 1,00 | 1,000 179,424 | 1,00 | 1,000 187,527 |
| 1 | 1,04 | 8,427 181,148 | 1,01 | 8,474 188,562 |
| 2 | 1,07 | 8,055 186,953 | 1,03 | 7,912 191,773 |
| 3 | 1,04 | 8,283 181,908 | 1,03 | 8,220 191,078 |
| 5 | 1,09 | 8,101 187,705 | 1,05 | 7,780 190,065 |
| 8 | 1,05 | 9,217 179,191 | 1,12 | 6,111 193,024 |
-----------------------------------------------------------------
OVH = (Execution time with -z N) / (Execution time with -z 0)
ratio - compression ratio
size - number of bytes that was compressed
size ~= trace size x ratio
Committer notes:
Testing it I noticed that it failed to disable build id processing when
compression is enabled, and as we'd have to uncompress everything to
look for the PERF_RECORD_{MMAP,SAMPLE,etc} to figure out which build ids
to read from DSOs, we better disable build id processing when
compression is enabled, logging with pr_debug() when doing so:
Original patch:
# perf record -z2
^C[ perf record: Woken up 1 times to write data ]
0x1746e0 [0x76]: failed to process type: 81 [Invalid argument]
[ perf record: Captured and wrote 1.568 MB perf.data, compressed (original 0.452 MB, ratio is 3.995) ]
#
After auto-disabling build id processing when compression is enabled:
$ perf record -z2 sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.001 MB perf.data, compressed (original 0.001 MB, ratio is 2.292) ]
$ perf record -v -z2 sleep 1
Compression enabled, disabling build id collection at the end of the session.
<SNIP extra -v pr_debug() messages>
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.001 MB perf.data, compressed (original 0.001 MB, ratio is 2.305) ]
$
Also, with parts of the patch originally after this one moved to just
before this one we get:
$ perf record -z2 sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.001 MB perf.data, compressed (original 0.001 MB, ratio is 2.371) ]
$ perf report -D | grep COMPRESS
0 0x1b8 [0x155]: PERF_RECORD_COMPRESSED: unhandled!
0 0x30d [0x80]: PERF_RECORD_COMPRESSED: unhandled!
COMPRESSED events: 2
COMPRESSED events: 0
$
I.e. when faced with PERF_RECORD_COMPRESSED that we still have no code
to process, we just show it as not being handled, skip them and
continue, while before we had:
$ perf report -D | grep COMPRESS
0x1b8 [0x169]: failed to process type: 81 [Invalid argument]
Error:
failed to process sample
0 0x1b8 [0x169]: PERF_RECORD_COMPRESSED
$
Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/9ff06518-ae63-a908-e44d-5d9e56dd66d9@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Implement a --mmap-flush option that specifies minimal number of bytes
that is extracted from mmaped kernel buffer to store into a trace. The
default option value is 1 byte what means every time trace writing
thread finds some new data in the mmaped buffer the data is extracted,
possibly compressed and written to a trace.
$ tools/perf/perf record --mmap-flush 1024 -e cycles -- matrix.gcc
$ tools/perf/perf record --aio --mmap-flush 1K -e cycles -- matrix.gcc
The option is independent from -z setting, doesn't vary with compression
level and can serve two purposes.
The first purpose is to increase the compression ratio of a trace data.
Larger data chunks are compressed more effectively so the implemented
option allows specifying data chunk size to compress. Also at some cases
executing more write syscalls with smaller data size can take longer
than executing less write syscalls with bigger data size due to syscall
overhead so extracting bigger data chunks specified by the option value
could additionally decrease runtime overhead.
The second purpose is to avoid self monitoring live-lock issue in system
wide (-a) profiling mode. Profiling in system wide mode with compression
(-a -z) can additionally induce data into the kernel buffers along with
the data from monitored processes. If performance data rate and volume
from the monitored processes is high then trace streaming and
compression activity in the tool is also high. High tool process
activity can lead to subtle live-lock effect when compression of single
new byte from some of mmaped kernel buffer leads to generation of the
next single byte at some mmaped buffer. So perf tool process ends up in
endless self monitoring.
Implemented synch parameter is the mean to force data move independently
from the specified flush threshold value. Despite the provided flush
value the tool needs capability to unconditionally drain memory buffers,
at least in the end of the collection.
Committer testing:
Running with the default value, i.e. as soon as there is something to
read go on consuming, we first write the synthesized events, small
chunks of about 128 bytes:
# perf trace -m 2048 --call-graph dwarf -e write -- perf record
<SNIP>
101.142 ( 0.004 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x210db60, count: 120) = 120
__libc_write (/usr/lib64/libpthread-2.28.so)
ion (/home/acme/bin/perf)
record__write (inlined)
process_synthesized_event (/home/acme/bin/perf)
perf_tool__process_synth_event (inlined)
perf_event__synthesize_mmap_events (/home/acme/bin/perf)
Then we move to reading the mmap buffers consuming the events put there
by the kernel perf infrastructure:
107.561 ( 0.005 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befc02000, count: 336) = 336
__libc_write (/usr/lib64/libpthread-2.28.so)
ion (/home/acme/bin/perf)
record__write (inlined)
record__pushfn (/home/acme/bin/perf)
perf_mmap__push (/home/acme/bin/perf)
record__mmap_read_evlist (inlined)
record__mmap_read_all (inlined)
__cmd_record (inlined)
cmd_record (/home/acme/bin/perf)
12919.953 ( 0.136 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befc83150, count: 184984) = 184984
<SNIP same backtrace as in the 107.561 timestamp>
12920.094 ( 0.155 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befc02150, count: 261816) = 261816
<SNIP same backtrace as in the 107.561 timestamp>
12920.253 ( 0.093 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befb81120, count: 170832) = 170832
<SNIP same backtrace as in the 107.561 timestamp>
If we limit it to write only when more than 16MB are available for
reading, it throttles that to a quarter of the --mmap-pages set for
'perf record', which by default get to 528384 bytes, found out using
'record -v':
mmap flush: 132096
mmap size 528384B
With that in place all the writes coming from
record__mmap_read_evlist(), i.e. from the mmap buffers setup by the
kernel perf infrastructure were at least 132096 bytes long.
Trying with a bigger mmap size:
perf trace -e write perf record -v -m 2048 --mmap-flush 16M
74982.928 ( 2.471 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff94a6cc000, count: 3580888) = 3580888
74985.406 ( 2.353 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff949ecb000, count: 3453256) = 3453256
74987.764 ( 2.629 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff9496ca000, count: 3859232) = 3859232
74990.399 ( 2.341 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff948ec9000, count: 3769032) = 3769032
74992.744 ( 2.064 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff9486c8000, count: 3310520) = 3310520
74994.814 ( 2.619 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff947ec7000, count: 4194688) = 4194688
74997.439 ( 2.787 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff9476c6000, count: 4029760) = 4029760
Was again limited to a quarter of the mmap size:
mmap flush: 2098176
mmap size 8392704B
A warning about that would be good to have but can be added later,
something like:
"max flush is a quarter of the mmap size, if wanting to bump the mmap
flush further, bump the mmap size as well using -m/--mmap-pages"
Also rename the 'sync' parameters to 'synch' to keep tools/perf building
with older glibcs:
cc1: warnings being treated as errors
builtin-record.c: In function 'record__mmap_read_evlist':
builtin-record.c:775: warning: declaration of 'sync' shadows a global declaration
/usr/include/unistd.h:933: warning: shadowed declaration is here
builtin-record.c: In function 'record__mmap_read_all':
builtin-record.c:856: warning: declaration of 'sync' shadows a global declaration
/usr/include/unistd.h:933: warning: shadowed declaration is here
Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/f6600d72-ecfa-2eb7-7e51-f6954547d500@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
When doing long term recording and waiting for some event to snapshot
on, we often only care about the last minute or so.
The --switch-output command line option supports rotating the perf.data
file when the size exceeds a threshold. But the disk would still be
filled with unnecessary old files.
Add a new option to only keep a number of rotated files, so that the
disk space usage can be limited.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
LPU-Reference: 20190314225002.30108-3-andi@firstfloor.org
Link: https://lkml.kernel.org/n/tip-y5u2lik0ragt4vlktz6qc9ks@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Multi AIO trace writing allows caching more kernel data into userspace
memory postponing trace writing for the sake of overall profiling data
thruput increase. It could be seen as kernel data buffer extension into
userspace memory.
With an --aio option value different from 0 (default value is 1) the
tool has capability to cache more and more data into user space along
with delegating spill to AIO.
That allows avoiding to suspend at record__aio_sync() between calls of
record__mmap_read_evlist() and increases profiling data thruput at the
cost of userspace memory.
Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Reviewed-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/050bb053-e7f3-aa83-fde7-f27ff90be7f6@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Enable complex event names containing [.:=,] symbols to be encoded into Perf
trace using name= modifier e.g. like this:
perf record -e cpu/name=\'OFFCORE_RESPONSE:request=DEMAND_RFO:response=L3_HIT.SNOOP_HITM\',\
period=0x3567e0,event=0x3c,cmask=0x1/Duk ./futex
Below is how it looks like in the report output. Please note explicit escaped
quoting at cmdline string in the header so that thestring can be directly reused
for another collection in shell:
perf report --header
# ========
...
# cmdline : /root/abudanko/kernel/tip/tools/perf/perf record -v -e cpu/name=\'OFFCORE_RESPONSE:request=DEMAND_RFO:response=L3_HIT.SNOOP_HITM\',period=0x3567e0,event=0x3c,cmask=0x1/Duk ./futex
# event : name = OFFCORE_RESPONSE:request=DEMAND_RFO:response=L3_HIT.SNOOP_HITM, , type = 4, size = 112, config = 0x100003c, { sample_period, sample_freq } = 3500000, sample_type = IP|TID|TIME, disabled = 1, inh
...
# ========
#
#
# Total Lost Samples: 0
#
# Samples: 24K of event 'OFFCORE_RESPONSE:request=DEMAND_RFO:response=L3_HIT.SNOOP_HITM'
# Event count (approx.): 86492000000
#
# Overhead Command Shared Object Symbol
# ........ ....... ................ ..............................................
#
14.75% futex [kernel.vmlinux] [k] __entry_trampoline_start
...
perf stat -e cpu/name=\'CPU_CLK_UNHALTED.THREAD:cmask=0x1\',period=0x3567e0,event=0x3c,cmask=0x1/Duk ./futex
10000000 process context switches in 16678890291ns (1667.9ns/ctxsw)
Performance counter stats for './futex':
88,095,770,571 CPU_CLK_UNHALTED.THREAD:cmask=0x1
16.679542407 seconds time elapsed
Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/c194b060-761d-0d50-3b21-bb4ed680002d@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
# perf record -F 200000 sleep 1
warning: Maximum frequency rate (15,000 Hz) exceeded, throttling from 200,000 Hz to 15,000 Hz.
The limit can be raised via /proc/sys/kernel/perf_event_max_sample_rate.
The kernel will lower it when perf's interrupts take too long.
Use --strict-freq to disable this throttling, refusing to record.
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.019 MB perf.data (15 samples) ]
# perf evlist -v
cycles:ppp: size: 112, { sample_period, sample_freq }: 15000, sample_type: IP|TID|TIME|PERIOD, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, enable_on_exec: 1, task: 1, precise_ip: 3, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1
For those wanting that it fails if the desired frequency can't be used:
# perf record --strict-freq -F 200000 sleep 1
error: Maximum frequency rate (15,000 Hz) exceeded.
Please use -F freq option with a lower value or consider
tweaking /proc/sys/kernel/perf_event_max_sample_rate.
#
Suggested-by: Ingo Molnar <mingo@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-oyebruc44nlja499nqkr1nzn@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>