There is no session in perf trace unless in replay mode, so in host
mode no session can be associated with the evlist. If the evsel__env
call fails resort to the host_env that's part of the trace. Remove
errno_to_name as it becomes a called once 1-line function once the
argument is turned into a perf_env, just call perf_env__arch_strerrno
directly.
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250724163302.596743-19-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
auxtrace_mmap__read and auxtrace_mmap__read_snapshot end up calling
`evsel__env(NULL)` which returns the global perf_env variable for the
host. Their only call is in perf record. Rather than use the global
variable pass through the perf_env for `perf record`.
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250724163302.596743-18-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
When creating a perf_session the host perf_env may or may not want to
be used. For example, `perf top` uses a host perf_env while `perf
inject` does not. Add a host_env argument to perf_session__new so that
sessions requiring a host perf_env can pass it in. Currently if none
is specified the global perf_env variable is used, but this will
change in later patches.
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250724163302.596743-14-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The perf_env global variable holds the host perf_env data but its use
is hit and miss. Switch to using local perf_env variables and ensure
scoped perf_env__init and perf_env__exit. This loses command line
setting of the perf_env, but this doesn't matter for tests. So the
perf_env is fully initialized, clear it with memset in perf_env__init.
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250724163302.596743-13-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The session holds a perf_env pointer env. In UI code container_of is
used to turn the env to a session, but this assumes the session
header's env is in use. Rather than a dubious container_of, hold the
session in the evlist and derive the env from the session with
evsel__env, perf_session__env, etc.
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250724163302.596743-11-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The DSO being generated was being accessed through a thread's maps,
this is unnecessary as the dso can just be directly found. This avoids
problems with passing a NULL evsel which may be inspected to determine
properties of a callchain when using the buildid DSO marking code.
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250724163302.596743-8-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The dso_id previously contained the major, minor, inode and inode
generation information from a mmap2 event - the inode generation would
be zero when reading from /proc/pid/maps. The build_id was in the
dso. With build ID mmap2 events these fields wouldn't be initialized
which would largely mean the special empty case where any dso would
match for equality. This isn't desirable as if a dso is replaced we
want the comparison to yield a difference.
To support detecting the difference between DSOs based on build_id,
move the build_id out of the DSO and into the dso_id. The dso_id is
also stored in the DSO so nothing is lost. Capture in the dso_id what
parts have been initialized and rename dso_id__inject to
dso_id__improve_id so that it is clear the dso_id is being improved
upon with additional information. With the build_id in the dso_id, use
memcmp to compare for equality.
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250724163302.596743-7-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Thresholds can increase the number of counters a metric needs. The NMI
watchdog can take away a counter (hopefully the buddy watchdog will
become the default and this will no longer be true). Add a new
constraint for the case that a metric and its thresholds would fit in
counters but only if the NMI watchdog isn't enabled. Either the
threshold or the NMI watchdog should be disabled to make the metric
fit. Wire this up into the metric__group_events logic.
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250719030517.1990983-16-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Topdown metric events require grouping with a slots event. In perf
metrics this is currently achieved by metrics adding an unnecessary
"0 * tma_info_thread_slots". New TMA metrics trigger optimizations of
the metric expression that removes the event and breaks the metric due
to the missing but required event. Add a pass immediately before
sorting and fixing parsed events, that insert a slots event if one is
missing. Update test expectations to match this.
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250719030517.1990983-15-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The string comparisons were overly broad and could fire for the
incorrect PMU and events. Switch to using the config in the attribute
then add a perf test to confirm the attribute config values match
those of parsed events of that name and don't match others. This
exposed matches for slots events that shouldn't have matched as the
slots fixed counter event, such as topdown.slots_p.
Fixes: fbc798316b ("perf x86/topdown: Refine helper arch_is_topdown_metrics()")
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250719030517.1990983-14-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Counting events system-wide with a specified CPU prior to this change
worked:
```
$ perf stat -e 'msr/tsc/,msr/tsc,cpu=cpu_core/,msr/tsc,cpu=cpu_atom/' -a sleep 1
Performance counter stats for 'system wide':
59,393,419,099 msr/tsc/
33,927,965,927 msr/tsc,cpu=cpu_core/
25,465,608,044 msr/tsc,cpu=cpu_atom/
```
However, when counting with process the counts became system wide:
```
$ perf stat -e 'msr/tsc/,msr/tsc,cpu=cpu_core/,msr/tsc,cpu=cpu_atom/' perf test -F 10
10.1: Basic parsing test : Ok
10.2: Parsing without PMU name : Ok
10.3: Parsing with PMU name : Ok
Performance counter stats for 'perf test -F 10':
59,233,549 msr/tsc/
59,227,556 msr/tsc,cpu=cpu_core/
59,224,053 msr/tsc,cpu=cpu_atom/
```
Make the handling of CPU maps with event parsing clearer. When an
event is parsed creating an evsel the cpus should be either the PMU's
cpumask or user specified CPUs.
Update perf_evlist__propagate_maps so that it doesn't clobber the user
specified CPUs. Try to make the behavior clearer, firstly fix up
missing cpumasks. Next, perform sanity checks and adjustments from the
global evlist CPU requests and for the PMU including simplifying to
the "any CPU"(-1) value. Finally remove the event if the cpumask is
empty.
So that events are opened with a CPU and a thread change stat's
create_perf_stat_counter to give both.
With the change things are fixed:
```
$ perf stat --no-scale -e 'msr/tsc/,msr/tsc,cpu=cpu_core/,msr/tsc,cpu=cpu_atom/' perf test -F 10
10.1: Basic parsing test : Ok
10.2: Parsing without PMU name : Ok
10.3: Parsing with PMU name : Ok
Performance counter stats for 'perf test -F 10':
63,704,975 msr/tsc/
47,060,704 msr/tsc,cpu=cpu_core/ (4.62%)
16,640,591 msr/tsc,cpu=cpu_atom/ (2.18%)
```
However, note the "--no-scale" option is used. This is necessary as
the running time for the event on the counter isn't the same as the
enabled time because the thread doesn't necessarily run on the CPUs
specified for the counter. All counter values are scaled with:
scaled_value = value * time_enabled / time_running
and so without --no-scale the scaled_value becomes very large. This
problem already exists on hybrid systems for the same reason. Here are
2 runs of the same code with an instructions event that counts the
same on both types of core, there is no real multiplexing happening on
the event:
```
$ perf stat -e instructions perf test -F 10
...
Performance counter stats for 'perf test -F 10':
87,896,447 cpu_atom/instructions/ (14.37%)
98,171,964 cpu_core/instructions/ (85.63%)
...
$ perf stat --no-scale -e instructions perf test -F 10
...
Performance counter stats for 'perf test -F 10':
13,069,890 cpu_atom/instructions/ (19.32%)
83,460,274 cpu_core/instructions/ (80.68%)
...
```
The scaling has inflated per-PMU instruction counts and the overall
count by 2x.
To fix this the kernel needs changing when a task+CPU event (or just
task event on hybrid) is scheduled out. A fix could be that the state
isn't inactive but off for such events, so that time_enabled counts
don't accumulate on them.
Reviewed-by: Thomas Falcon <thomas.falcon@intel.com>
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250719030517.1990983-13-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Add evsel__open_per_cpu_and_thread that combines the operation of
evsel__open_per_cpu and evsel__open_per_thread so that an event
without the "any" cpumask can be opened with its cpumask and with
threads it specifies. Change the implementation of evsel__open_per_cpu
and evsel__open_per_thread to use evsel__open_per_cpu_and_thread to
make the implementation of those functions clearer.
Reviewed-by: Thomas Falcon <thomas.falcon@intel.com>
Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20250719030517.1990983-12-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
For hybrid metrics it is useful to know the number of p-core or e-core
CPUs. If a cpumask is specified for the num_cpus or num_cpus_online
tool events, compute the value relative to the given mask rather than
for the full system.
```
$ sudo /tmp/perf/perf stat -e 'tool/num_cpus/,tool/num_cpus,cpu=cpu_core/,
tool/num_cpus,cpu=cpu_atom/,tool/num_cpus_online/,tool/num_cpus_online,
cpu=cpu_core/,tool/num_cpus_online,cpu=cpu_atom/' true
Performance counter stats for 'true':
28 tool/num_cpus/
16 tool/num_cpus,cpu=cpu_core/
12 tool/num_cpus,cpu=cpu_atom/
28 tool/num_cpus_online/
16 tool/num_cpus_online,cpu=cpu_core/
12 tool/num_cpus_online,cpu=cpu_atom/
0.000767205 seconds time elapsed
0.000938000 seconds user
0.000000000 seconds sys
```
Reviewed-by: Thomas Falcon <thomas.falcon@intel.com>
Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20250719030517.1990983-6-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
On hybrid systems, events like msr/tsc/ will aggregate counts across
all CPUs. Often metrics only want a value like msr/tsc/ for the cores
on which the metric is being computed. Listing each CPU with terms
cpu=0,cpu=1.. is laborious and would need to be encoded for all
variations of a CPU model.
Allow the cpumask from a PMU to be an argument to the cpu term. For
example in the following the cpumask of the cstate_pkg PMU selects the
CPUs to count msr/tsc/ counter upon:
```
$ cat /sys/bus/event_source/devices/cstate_pkg/cpumask
0
$ perf stat -A -e 'msr/tsc,cpu=cstate_pkg/' -a sleep 0.1
Performance counter stats for 'system wide':
CPU0 252,621,253 msr/tsc,cpu=cstate_pkg/
0.101184092 seconds time elapsed
```
As the cpu term is now also allowed to be a string, allow it to encode
a range of CPUs (a list can't be supported as ',' is already a special
token).
The "event qualifiers" section of the `perf list` man page is updated
to detail the additional behavior. The man page formatting is tidied
up in this section, as it was incorrectly appearing within the
"parameterized events" section.
Reviewed-by: Thomas Falcon <thomas.falcon@intel.com>
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250719030517.1990983-5-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Factor requested CPU warning out of evlist and into evsel. At the end
of adding an event, perform the warning check. To avoid repeatedly
testing if the cpu_list is empty, add a local variable.
```
$ perf stat -e cpu_atom/cycles,cpu=1/ -a true
WARNING: A requested CPU in '1' is not supported by PMU 'cpu_atom' (CPUs 16-27) for event 'cpu_atom/cycles/'
Performance counter stats for 'system wide':
<not supported> cpu_atom/cycles/
0.000781511 seconds time elapsed
```
Reviewed-by: Thomas Falcon <thomas.falcon@intel.com>
Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20250719030517.1990983-2-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Force loading all PMUs adds significant cost because DRM and other
PMUs are loaded, it should also not be required if the pmus__
functions are used.
Tested by run perf test, in particular the pfm related tests. Also
`perf list` is identical before and after.
Before:
$ time ./perf test pfm
54: Test libpfm4 support :
54.1: test of individual --pfm-events : Ok
54.2: test groups of --pfm-events : Ok
103: perf all libpfm4 events test : Ok
real 0m8.933s
user 0m1.824s
sys 0m7.122s
After:
$ time ./perf test pfm
54: Test libpfm4 support :
54.1: test of individual --pfm-events : Ok
54.2: test groups of --pfm-events : Ok
103: perf all libpfm4 events test : Ok
real 0m5.259s
user 0m1.793s
sys 0m3.570s
Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20250722013449.146233-1-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
When the clang toolchain has stack protection enabled, the bpf
skeletons build fails with:
error: A call to built-in function '__stack_chk_fail' is not supported.
Since stack-protector makes no sense for the BPF bits, just unconditionally
disable it.
See also similar case at 878625e1c7
Signed-off-by: Federico Pellegrin <fede@evolware.org>
Link: https://lore.kernel.org/r/20250718041224.12389-1-fede@evolware.org
[ rearrange long lines ]
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Previously when running perf trace timehist --state, when recording
processes in the zombie state the process name would not be decoded
properly and appears with just the PID:
1140057.412177 [0006] Mutter Input Th[3139/3104] 0.956 0.019 0.041 S
1140057.412222 [0012] :1248612[1248612] 0.000 0.000 0.332 Z
1140057.412275 [0004] <idle> 0.052 0.052 0.953 I
1140057.412284 [0008] <idle> 0.070 0.070 0.932 I
1140057.412333 [0004] KMS thread[3126/3104] 0.953 0.112 0.058 S
Now some extra processing has been added to decode the process name:
1140057.412177 [0006] Mutter Input Th[3139/3104] 0.956 0.019 0.041 S
1140057.412222 [0012] sleep[1248612] 0.000 0.000 0.332 Z
1140057.412275 [0004] <idle> 0.052 0.052 0.953 I
1140057.412284 [0008] <idle> 0.070 0.070 0.932 I
1140057.412333 [0004] KMS thread[3126/3104] 0.953 0.112 0.058 S
Signed-off-by: Anubhav Shelat <ashelat@redhat.com>
Link: https://lore.kernel.org/r/20250716203914.45772-2-ashelat@redhat.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
In addition to the function latency, it can measure events latencies.
Some kernel tracepoints are paired and it's menningful to measure how
long it takes between the two events. The latency is tracked for the
same thread.
Currently it only uses BPF to do the work but it can be lifted later.
Instead of having separate a BPF program for each tracepoint, it only
uses generic 'event_begin' and 'event_end' programs to attach to any
(raw) tracepoints.
$ sudo perf ftrace latency -a -b --hide-empty \
-e i915_request_wait_begin,i915_request_wait_end -- sleep 1
# DURATION | COUNT | GRAPH |
256 - 512 us | 4 | ###### |
2 - 4 ms | 2 | ### |
4 - 8 ms | 12 | ################### |
8 - 16 ms | 10 | ################ |
# statistics (in usec)
total time: 194915
avg time: 6961
max time: 12855
min time: 373
count: 28
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250714052143.342851-1-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The struct pyrf_evlist embeds the evlist requiring the copying from
things like parsed events. The copying logic handles the leader being
the event itself, but if the leader group event is a different in the
list it will cause an evsel to point to the evsel in the list that was
copied from which is bad. Fix this by adding another pass over the
evlist rewriting leaders, simplified by the introductin of two evlist
helpers.
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250710235126.1086011-13-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tool PMUs assume that stat's process_counter_values is being used to
read the counters. Specifically they hold onto old values in
evsel->prev_raw_counts and give the cumulative count based off of this
value. Update pyrf_evsel__read to allocate counts and prev_raw_counts,
use evsel__read_counter rather than perf_evsel__read so tool PMUs are
read from not just perf_event_open events, make the returned
pyrf_counts_values contain the delta value rather than the cumulative
value.
Fixes: 739621f657 ("perf python: Add evsel read method")
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250710235126.1086011-12-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The evsel__pmu_name helper will internally use evsel__find_pmu that
handles legacy events, extended types, etc. in determining a PMU and
will provide a better value than just trying to access the PMU's name
directly as the PMU may not have been computed.
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250710235126.1086011-10-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
If the short and long descriptions are the same then save space and
don't store both of them. When storing the desc in the perf_pmu_alias,
don't duplicate the desc into the long_desc.
By avoiding storing the duplicate the size of the events string in the
binary on x86 is reduced by 29,840 bytes.
Fix tests that expect a duplicated description.
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250710235126.1086011-9-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Metrics will fill in the context to have mappings from an event to a
count. When counts are added they replace existing mappings which
generally shouldn't exist with aggregation. Switch to accumulating to
better support cases where perf stat's aggregation isn't used and we
may see a counter more than once.
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250710235126.1086011-8-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>