On graniterapids the cache home agent (CHA) and memory controller
(IMC) PMUs all have their cpumask set to per-socket information. In
order for per NUMA node aggregation to work correctly the PMUs cpumask
needs to be set to CPUs for the relevant sub-NUMA grouping.
For example, on a 2 socket graniterapids machine with sub NUMA
clustering of 3, for uncore_cha and uncore_imc PMUs the cpumask is
"0,120" leading to aggregation only on NUMA nodes 0 and 3:
```
$ perf stat --per-node -e 'UNC_CHA_CLOCKTICKS,UNC_M_CLOCKTICKS' -a sleep 1
Performance counter stats for 'system wide':
N0 1 277,835,681,344 UNC_CHA_CLOCKTICKS
N0 1 19,242,894,228 UNC_M_CLOCKTICKS
N3 1 277,803,448,124 UNC_CHA_CLOCKTICKS
N3 1 19,240,741,498 UNC_M_CLOCKTICKS
1.002113847 seconds time elapsed
```
By updating the PMUs cpumasks to "0,120", "40,160" and "80,200" then
the correctly 6 NUMA node aggregations are achieved:
```
$ perf stat --per-node -e 'UNC_CHA_CLOCKTICKS,UNC_M_CLOCKTICKS' -a sleep 1
Performance counter stats for 'system wide':
N0 1 92,748,667,796 UNC_CHA_CLOCKTICKS
N0 0 6,424,021,142 UNC_M_CLOCKTICKS
N1 0 92,753,504,424 UNC_CHA_CLOCKTICKS
N1 1 6,424,308,338 UNC_M_CLOCKTICKS
N2 0 92,751,170,084 UNC_CHA_CLOCKTICKS
N2 0 6,424,227,402 UNC_M_CLOCKTICKS
N3 1 92,745,944,144 UNC_CHA_CLOCKTICKS
N3 0 6,423,752,086 UNC_M_CLOCKTICKS
N4 0 92,725,793,788 UNC_CHA_CLOCKTICKS
N4 1 6,422,393,266 UNC_M_CLOCKTICKS
N5 0 92,717,504,388 UNC_CHA_CLOCKTICKS
N5 0 6,421,842,618 UNC_M_CLOCKTICKS
1.003406645 seconds time elapsed
```
In general, having the perf tool adjust cpumasks isn't desirable as
ideally the PMU driver would be advertising the correct cpumask.
Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Kan Liang <kan.liang@linux.intel.com>
Tested-by: Weilin Wang <weilin.wang@intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20250515181417.491401-1-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Bunch of IBS kernel fixes went in v6.15-rc1 [1].
The amd-ibs-period test will fail without those kernel patches.
Skip the test on system running kernel older than v6.15 to distinguish
genuine new failures vs known failure due to old kernel.
Since all the related IBS fixes went in -rc1 itself, the ">= 6.15" check
will work for any custom compiled v6.15-* kernel as well.
Reported-by: Arnaldo Carvalho de Melo <acme@kernel.org>
Suggested-by: Arnaldo Carvalho de Melo <acme@kernel.org>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Closes: https://lore.kernel.org/r/aCfuGXUnNIbnYo_r@x1
Link: https://lore.kernel.org/r/20250115054438.1021-1-ravi.bangoria@amd.com [1]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
On systems with many CPUs, recording extra context switch events can be
excessive and unnecessary. Add perf config intel-pt.all-switch-events=false
to control the behaviour.
Example:
# perf config intel-pt.all-switch-events=false
# perf record -eintel_pt//u uname
Linux
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.082 MB perf.data ]
# perf script -D | grep PERF_RECORD_SWITCH | awk '{print $5}' | uniq -c
5 PERF_RECORD_SWITCH
# perf config intel-pt.all-switch-events=true
# perf record -eintel_pt//u uname
Linux
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.102 MB perf.data ]
# perf script -D | grep PERF_RECORD_SWITCH | awk '{print $5}' | uniq -c
180 PERF_RECORD_SWITCH_CPU_WIDE
Committer testing:
While doing a make -j28 allmodconfig:
root@five:~# grep "model name" -m1 /proc/cpuinfo
model name : Intel(R) Core(TM) i7-14700K
root@five:~#
root@five:~# perf config intel-pt.all-switch-events=false
root@five:~# perf record -e intel_pt//u uname
Linux
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.019 MB perf.data ]
root@five:~# perf report --stats | grep SWITCH_CPU_WIDE
root@five:~#
root@five:~# perf config intel-pt.all-switch-events=true
root@five:~# perf record -e intel_pt//u uname
Linux
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.047 MB perf.data ]
root@five:~# perf report --stats | grep SWITCH_CPU_WIDE
SWITCH_CPU_WIDE events: 542 (96.4%)
root@five:~#
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20250512093932.79854-3-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To pick up the changes in:
c4a16820d9 fs: add open_tree_attr()
2df1ad0d25 x86/arch_prctl: Simplify sys_arch_prctl()
e632bca07c arm64: generate 64-bit syscall.tbl
This is basically to support the new open_tree_attr syscall. But it
also needs to update asm-generic unistd.h header to get the new syscall
number. And arm64 unistd.h header was converted to use the generic
64-bit header.
Addressing this perf tools build warning:
Warning: Kernel ABI header differences:
diff -u tools/scripts/syscall.tbl scripts/syscall.tbl
diff -u tools/perf/arch/x86/entry/syscalls/syscall_32.tbl arch/x86/entry/syscalls/syscall_32.tbl
diff -u tools/perf/arch/x86/entry/syscalls/syscall_64.tbl arch/x86/entry/syscalls/syscall_64.tbl
diff -u tools/perf/arch/powerpc/entry/syscalls/syscall.tbl arch/powerpc/kernel/syscalls/syscall.tbl
diff -u tools/perf/arch/s390/entry/syscalls/syscall.tbl arch/s390/kernel/syscalls/syscall.tbl
diff -u tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl arch/mips/kernel/syscalls/syscall_n64.tbl
diff -u tools/perf/arch/arm/entry/syscalls/syscall.tbl arch/arm/tools/syscall.tbl
diff -u tools/perf/arch/sh/entry/syscalls/syscall.tbl arch/sh/kernel/syscalls/syscall.tbl
diff -u tools/perf/arch/sparc/entry/syscalls/syscall.tbl arch/sparc/kernel/syscalls/syscall.tbl
diff -u tools/perf/arch/xtensa/entry/syscalls/syscall.tbl arch/xtensa/kernel/syscalls/syscall.tbl
diff -u tools/arch/arm64/include/uapi/asm/unistd.h arch/arm64/include/uapi/asm/unistd.h
diff -u tools/include/uapi/asm-generic/unistd.h include/uapi/asm-generic/unistd.h
Please see tools/include/uapi/README for further details.
Acked-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Cc: linux-arch@vger.kernel.org
Link: https://lore.kernel.org/r/20250410001125.391820-6-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The definition of "static const char *const syscalltbl[] = {" is done
in a generated syscalls_32.h or syscalls_64.h that is architecture
dependent. In order to include the appropriate file a syscall_table.h
is found via the perf include path and it includes the syscalls_32.h
or syscalls_64.h as appropriate.
To support having multiple syscall tables, one for 32-bit and one for
64-bit, or for different architectures, an include path cannot be
used. Remove syscall_table.h because of this and inline what it does
into syscalltbl.c.
For architectures without a syscall_table.h this will cause a failure
to include either syscalls_32.h or syscalls_64.h rather than a failure
to include syscall_table.h. For architectures that only included one
or other, the behavior matches BITS_PER_LONG as previously done on
architectures supporting both syscalls_32.h and syscalls_64.h.
Signed-off-by: Ian Rogers <irogers@google.com>
Reviewed-by: Howard Chu <howardchu95@gmail.com>
Reviewed-by: Charlie Jenkins <charlie@rivosinc.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Arnaldo Carvalho de Melo <acme@kernel.org>
Link: https://lore.kernel.org/r/20250319050741.269828-4-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
When running topdown leader smapling test on Intel hybrid platforms,
such as LNL/ARL, we see the below error.
Topdown leader sampling test
Topdown leader sampling [Failed topdown events not reordered correctly]
It indciates the below command fails.
perf record -o "${perfdata}" -e "{instructions,slots,topdown-retiring}:S" true
The root cause is that perf tool creats a perf event for each PMU type
if it can create.
As for this command, there would be 5 perf events created,
cpu_atom/instructions/,cpu_atom/topdown_retiring/,
cpu_core/slots/,cpu_core/instructions/,cpu_core/topdown-retiring/
For these 5 events, the 2 cpu_atom events are in a group and the other 3
cpu_core events are in another group.
When arch_topdown_sample_read() traverses all these 5 events, events
cpu_atom/instructions/ and cpu_core/slots/ don't have a same group
leade, and then return false directly and lead to cpu_core/slots/ event
is used to sample and this is not allowed by PMU driver.
It's a overkill to return false directly if "evsel->core.leader !=
leader->core.leader" since there could be multiple groups in the event
list.
Just "continue" instead of "return false" to fix this issue.
Fixes: 1e53e9d178 ("perf x86/topdown: Correct leader selection with sample_read enabled")
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Thomas Falcon <thomas.falcon@intel.com>
Tested-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250307023906.1135613-2-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
In sysfs, the perf events are all located in
/sys/bus/event_source/devices/ but some places ended up hard-coding the
location to be at the root of /sys/devices/ which could be very risky as
you do not exactly know what type of device you are accessing in sysfs
at that location.
So fix this all up by properly pointing everything at the bus device
list instead of the root of the sysfs devices/ tree.
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lore.kernel.org/r/2025021955-implant-excavator-179d@gregkh
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The struct dump_regs contains 512 bytes of cache_regs, meaning the two
values in perf_sample contribute 1088 bytes of its total 1384 bytes
size. Initializing this much memory has a cost reported by Tavian
Barnes <tavianator@tavianator.com> as about 2.5% when running `perf
script --itrace=i0`:
https://lore.kernel.org/lkml/d841b97b3ad2ca8bcab07e4293375fb7c32dfce7.1736618095.git.tavianator@tavianator.com/
Adrian Hunter <adrian.hunter@intel.com> replied that the zero
initialization was necessary and couldn't simply be removed.
This patch aims to strike a middle ground of still zeroing the
perf_sample, but removing 79% of its size by make user_regs and
intr_regs optional pointers to zalloc-ed memory. To support the
allocation accessors are created for user_regs and intr_regs. To
support correct cleanup perf_sample__init and perf_sample__exit
functions are created and added throughout the code base.
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250113194345.1537821-1-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
To pick up the changes in this cset:
6140be90ec ("fs/xattr: add *at family syscalls")
This addresses these perf build warnings:
Warning: Kernel ABI header differences:
diff -u tools/include/uapi/asm-generic/unistd.h include/uapi/asm-generic/unistd.h
diff -u tools/perf/arch/x86/entry/syscalls/syscall_32.tbl arch/x86/entry/syscalls/syscall_32.tbl
diff -u tools/perf/arch/x86/entry/syscalls/syscall_64.tbl arch/x86/entry/syscalls/syscall_64.tbl
diff -u tools/perf/arch/powerpc/entry/syscalls/syscall.tbl arch/powerpc/kernel/syscalls/syscall.tbl
diff -u tools/perf/arch/s390/entry/syscalls/syscall.tbl arch/s390/kernel/syscalls/syscall.tbl
diff -u tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl arch/mips/kernel/syscalls/syscall_n64.tbl
The arm64 changes are not included as it requires more changes in the
tools. It'll be worked for the later cycle.
Please see tools/include/uapi/README for further details.
Reviewed-by: James Clark <james.clark@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Brauner <brauner@kernel.org>
CC: x86@kernel.org
CC: linux-mips@vger.kernel.org
CC: linuxppc-dev@lists.ozlabs.org
CC: linux-s390@vger.kernel.org
Link: https://lore.kernel.org/r/20241203035349.1901262-7-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Leverage the existed function perf_pmu__name_from_config() to check if
an event is topdown metrics event. perf_pmu__name_from_config() goes
through the defined formats and figures out the config of pre-defined
topdown events.
This avoids to figure out the config of topdown pre-defined events with
hard-coded format strings "event=" and "umask=" and provides more
flexibility.
Suggested-by: Ian Rogers <irogers@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20241011110207.1032235-2-dapeng1.mi@linux.intel.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The commit "3b5edc0421e2 (perf x86/topdown: Don't move topdown metric
events in group)" modifies topdown metrics comparator to move topdown
metrics events which are not in same group with previous event. But it
just modifies the 2nd comparator and causes the comparators become
asymmetric.
Thus modify the 1st topdown metrics comparator and make the two
comparators be symmetric, and refine the comments as well.
Suggested-by: Ian Rogers <irogers@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20241011110207.1032235-1-dapeng1.mi@linux.intel.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Add the expr literals like "#smt_on" as tool events, this allows stat
events to give the values. On my laptop with hyperthreading enabled:
```
$ perf stat -e "has_pmem,num_cores,num_cpus,num_cpus_online,num_dies,num_packages,smt_on,system_tsc_freq" true
Performance counter stats for 'true':
0 has_pmem
8 num_cores
16 num_cpus
16 num_cpus_online
1 num_dies
1 num_packages
1 smt_on
2,496,000,000 system_tsc_freq
0.001113637 seconds time elapsed
0.001218000 seconds user
0.000000000 seconds sys
```
And with hyperthreading disabled:
```
$ perf stat -e "has_pmem,num_cores,num_cpus,num_cpus_online,num_dies,num_packages,smt_on,system_tsc_freq" true
Performance counter stats for 'true':
0 has_pmem
8 num_cores
16 num_cpus
8 num_cpus_online
1 num_dies
1 num_packages
0 smt_on
2,496,000,000 system_tsc_freq
0.000802115 seconds time elapsed
0.000000000 seconds user
0.000806000 seconds sys
```
As zero matters for these values, in stat-display
should_skip_zero_counter only skip the zero value if it is not the
first aggregation index.
The tool event implementations are used in expr but not evaluated as
events for simplicity. Also core_wide isn't made a tool event as it
requires command line parameters.
Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20241002032016.333748-8-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Hard coded terms like "config=10" are skipped by perf_pmu__config
assuming they were already applied to a perf_event_attr by parse
event's config_attr function. When doing a reverse number to name
lookup in perf_pmu__name_from_config, as the hardcoded terms aren't
applied the config value is incorrect leading to misses or false
matches. Fix this by adding a parameter to have perf_pmu__config apply
hardcoded terms too (not just in parse event's config_term_common).
Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20241002032016.333748-3-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
when running below perf command, we say error is reported.
perf record -e "{slots,instructions,topdown-retiring}:S" -vv -C0 sleep 1
------------------------------------------------------------
perf_event_attr:
type 4 (cpu)
size 168
config 0x400 (slots)
sample_type IP|TID|TIME|READ|CPU|PERIOD|IDENTIFIER
read_format ID|GROUP|LOST
disabled 1
sample_id_all 1
exclude_guest 1
------------------------------------------------------------
sys_perf_event_open: pid -1 cpu 0 group_fd -1 flags 0x8 = 5
------------------------------------------------------------
perf_event_attr:
type 4 (cpu)
size 168
config 0x8000 (topdown-retiring)
{ sample_period, sample_freq } 4000
sample_type IP|TID|TIME|READ|CPU|PERIOD|IDENTIFIER
read_format ID|GROUP|LOST
freq 1
sample_id_all 1
exclude_guest 1
------------------------------------------------------------
sys_perf_event_open: pid -1 cpu 0 group_fd 5 flags 0x8
sys_perf_event_open failed, error -22
Error:
The sys_perf_event_open() syscall returned with 22 (Invalid argument) for
event (topdown-retiring).
The reason of error is that the events are regrouped and
topdown-retiring event is moved to closely after the slots event and
topdown-retiring event needs to do the sampling, but Intel PMU driver
doesn't support to sample topdown metrics events.
For topdown metrics events, it just requires to be in a group which has
slots event as leader. It doesn't require topdown metrics event must be
closely after slots event. Thus it's a overkill to move topdown metrics
event closely after slots event in events regrouping and furtherly cause
the above issue.
Thus don't move topdown metrics events forward if they are already in a
group.
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Cc: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20240913084712.13861-4-dapeng1.mi@linux.intel.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Addresses an issue where, in the absence of a topdown metrics event
within a sampling group, the slots event was incorrectly bypassed as
the sampling leader when sample_read was enabled.
perf record -e '{slots,branches}:S' -c 10000 -vv sleep 1
In this case, the slots event should be sampled as leader but the
branches event is sampled in fact like the verbose output shows.
perf_event_attr:
type 4 (cpu)
size 168
config 0x400 (slots)
sample_type IP|TID|TIME|READ|CPU|IDENTIFIER
read_format ID|GROUP|LOST
disabled 1
sample_id_all 1
exclude_guest 1
------------------------------------------------------------
sys_perf_event_open: pid -1 cpu 0 group_fd -1 flags 0x8 = 5
------------------------------------------------------------
perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
size 168
config 0x4 (PERF_COUNT_HW_BRANCH_INSTRUCTIONS)
{ sample_period, sample_freq } 10000
sample_type IP|TID|TIME|READ|CPU|IDENTIFIER
read_format ID|GROUP|LOST
sample_id_all 1
exclude_guest 1
The sample period of slots event instead of branches event is reset to
0.
This fix ensures the slots event remains the leader under these
conditions.
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Cc: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20240913084712.13861-3-dapeng1.mi@linux.intel.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
It's not complete to check whether an event is a topdown slots or
topdown metrics event by only comparing the event name since user
may assign the event by RAW format, e.g.
perf stat -e '{instructions,cpu/r400/,cpu/r8300/}' sleep 1
Performance counter stats for 'sleep 1':
<not counted> instructions
<not counted> cpu/r400/
<not supported> cpu/r8300/
1.002917796 seconds time elapsed
0.002955000 seconds user
0.000000000 seconds sys
The RAW format slots and topdown-be-bound events are not recognized and
not regroup the events, and eventually cause error.
Thus add two helpers arch_is_topdown_slots()/arch_is_topdown_metrics()
to detect whether an event is topdown slots/metrics event by comparing
the event config directly, and use these two helpers to replace the
original event name comparisons.
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Cc: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20240913084712.13861-2-dapeng1.mi@linux.intel.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
In some cases, compilers don't set the location expression in DWARF
precisely. For instance, it may assign a variable to a register after
copying it from a different register. Then it should use the register
for the new type but still uses the old register. This makes hard to
track the type information properly.
This is an example I found in __tcp_transmit_skb(). The first argument
(sk) of this function is a pointer to sock and there's a variable (tp)
for tcp_sock.
static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
int clone_it, gfp_t gfp_mask, u32 rcv_nxt)
{
...
struct tcp_sock *tp;
BUG_ON(!skb || !tcp_skb_pcount(skb));
tp = tcp_sk(sk);
prior_wstamp = tp->tcp_wstamp_ns;
tp->tcp_wstamp_ns = max(tp->tcp_wstamp_ns, tp->tcp_clock_cache);
...
So it basically calls tcp_sk(sk) to get the tcp_sock pointer from sk.
But it turned out to be the same value because tcp_sock embeds sock as
the first member. The sk is located in reg5 (RDI) and tp is in reg3
(RBX). The offset of tcp_wstamp_ns is 0x748 and tcp_clock_cache is
0x750. So you need to use RBX (reg3) to access the fields in the
tcp_sock. But the code used RDI (reg5) as it has the same value.
$ pahole --hex -C tcp_sock vmlinux | grep -e 748 -e 750
u64 tcp_wstamp_ns; /* 0x748 0x8 */
u64 tcp_clock_cache; /* 0x750 0x8 */
And this is the disassembly of the part of the function.
<__tcp_transmit_skb>:
...
44: mov %rdi, %rbx
47: mov 0x748(%rdi), %rsi
4e: mov 0x750(%rdi), %rax
55: cmp %rax, %rsi
Because compiler put the debug info to RBX, it only knows RDI is a
pointer to sock and accessing those two fields resulted in error
due to offset being beyond the type size.
-----------------------------------------------------------
find data type for 0x748(reg5) at __tcp_transmit_skb+0x63
CU for net/ipv4/tcp_output.c (die:0x817f543)
frame base: cfa=0 fbreg=6
scope: [1/1] (die:81aac3e)
bb: [0 - 30]
var [0] -0x98(stack) type='struct tcp_out_options' size=0x28 (die:0x81af3df)
var [5] reg8 type='unsigned int' size=0x4 (die:0x8180ed6)
var [5] reg2 type='unsigned int' size=0x4 (die:0x8180ed6)
var [5] reg1 type='int' size=0x4 (die:0x818059e)
var [5] reg4 type='struct sk_buff*' size=0x8 (die:0x8181360)
var [5] reg5 type='struct sock*' size=0x8 (die:0x8181a0c) <<<--- the first argument ('sk' at %RDI)
mov [19] reg8 -> -0xa8(stack) type='unsigned int' size=0x4 (die:0x8180ed6)
mov [20] stack canary -> reg0
mov [29] reg0 -> -0x30(stack) stack canary
bb: [36 - 3e]
mov [36] reg4 -> reg15 type='struct sk_buff*' size=0x8 (die:0x8181360)
bb: [44 - 63]
mov [44] reg5 -> reg3 type='struct sock*' size=0x8 (die:0x8181a0c) <<<--- calling tcp_sk()
var [47] reg3 type='struct tcp_sock*' size=0x8 (die:0x819eead) <<<--- new variable ('tp' at %RBX)
var [4e] reg4 type='unsigned long long' size=0x8 (die:0x8180edd)
mov [58] reg4 -> -0xc0(stack) type='unsigned long long' size=0x8 (die:0x8180edd)
chk [63] reg5 offset=0x748 ok=1 kind=1 (struct sock*) : offset bigger than size <<<--- access with old variable
final result: offset bigger than size
While it's a fault in the compiler, we could work around this issue by
using the type of new variable when it's copied directly. So I've added
copied_from field in the register state to track those direct register
to register copies. After that new register gets a new type and the old
register still has the same type, it'll update (copy it back) the type
of the old register.
For example, if we can update type of reg5 at __tcp_transmit_skb+0x47,
we can find the target type of the instruction at 0x63 like below:
-----------------------------------------------------------
find data type for 0x748(reg5) at __tcp_transmit_skb+0x63
...
bb: [44 - 63]
mov [44] reg5 -> reg3 type='struct sock*' size=0x8 (die:0x8181a0c)
var [47] reg3 type='struct tcp_sock*' size=0x8 (die:0x819eead)
var [47] copyback reg5 type='struct tcp_sock*' size=0x8 (die:0x819eead) <<<--- here
mov [47] 0x748(reg5) -> reg4 type='unsigned long long' size=0x8 (die:0x8180edd)
mov [4e] 0x750(reg5) -> reg0 type='unsigned long long' size=0x8 (die:0x8180edd)
mov [58] reg4 -> -0xc0(stack) type='unsigned long long' size=0x8 (die:0x8180edd)
chk [63] reg5 offset=0x748 ok=1 kind=1 (struct tcp_sock*) : Good! <<<--- new type
found by insn track: 0x748(reg5) type-offset=0x748
final result: type='struct tcp_sock' size=0xa98 (die:0x819eeb2)
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240821232628.353177-5-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
In check_matching_type(), it checks the type state of the register in a
wrong order. When it's the percpu pointer, it should check the type for
the pointer, but it checks the CFA bit first and thought it has no type
in the stack slot. This resulted in no type info.
-----------------------------------------------------------
find data type for 0x28(reg1) at hrtimer_reprogram+0x88
CU for kernel/time/hrtimer.c (die:0x18f219f)
frame base: cfa=1 fbreg=7
...
add [72] percpu 0x24500 -> reg1 pointer type='struct hrtimer_cpu_base' size=0x240 (die:0x18f6d46)
bb: [7a - 7e]
bb: [80 - 86] (here)
bb: [88 - 88] vvv
chk [88] reg1 offset=0x28 ok=1 kind=4 cfa : no type information
no type information
Here, instruction at 0x72 found reg1 has a (percpu) pointer and got the
correct type. But when it checks the final result, it wrongly thought
it was stack variable because it checks the cfa bit first.
After changing the order of state check:
-----------------------------------------------------------
find data type for 0x28(reg1) at hrtimer_reprogram+0x88
CU for kernel/time/hrtimer.c (die:0x18f219f)
frame base: cfa=1 fbreg=7
... (here)
vvvvvvvvvv
chk [88] reg1 offset=0x28 ok=1 kind=4 percpu ptr : Good!
found by insn track: 0x28(reg1) type-offset=0x28
final type: type='struct hrtimer_cpu_base' size=0x240 (die:0x18f6d46)
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240821065408.285548-3-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To pick up the latest perf-tools merge for 6.11, i.e. to have the
current perf tools branch that is getting into 6.11 with the
perf-tools-next that is geared towards 6.12.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
When retire_latency value is used in a metric formula, evsel would fork
a 'perf record' process with "-e" and "-W" options. 'perf record' will
collect required retire_latency values in parallel while 'perf stat' is
collecting counting values.
At the point of time that 'perf stat' stops counting, evsel would stop
'perf record' by sending sigterm signal to 'perf record' process.
Sampled data will be processed to get retire latency value. Another
thread is required to synchronize between 'perf stat' and 'perf record'
when we pass data through pipe.
Retire_latency evsel is not opened for 'perf stat' so that there is no
counter wasted on it. This commit includes code suggested by Namhyung to
adjust reading size for groups that include retire_latency evsels.
In current :R parsing implementation, the parser would recognize events
with retire_latency modifier and insert them into the evlist like a
normal event. Ideally, we need to avoid counting these events.
In this commit, at the time when a retire_latency evsel is read, set the
retire latency value processed from the sampled data to count value.
This sampled retire latency value will be used for metric calculation
and final event count print out. No special metric calculation and event
print out code required for retire_latency events.
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Weilin Wang <weilin.wang@intel.com>
Acked-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Caleb Biggers <caleb.biggers@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Perry Taylor <perry.taylor@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Samantha Alt <samantha.alt@intel.com>
Link: https://lore.kernel.org/r/20240720062102.444578-4-weilin.wang@intel.com
[ Squashed the 3rd and 4th commit in the series to keep it building patch by patch ]
[ Constified the 'struct perf_tool' pointer in process_sample_event() ]
[ Use perf_tool__init(&tool, false) to address a segfault I reported and Ian/Weilin diagnosed ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>