Occassionally events (e.g., context-switch, sched tracepoints) are losing
the conversion of sample data associated with a thread. For example:
$ perf record -e sched:sched_switch -c 1 -a -- sleep 5
$ perf script
<selected events shown>
ls 30482 [000] 1379727.583037: sched:sched_switch: prev_comm=ls prev_pid=30482 ...
ls 30482 [000] 1379727.586339: sched:sched_switch: prev_comm=ls prev_pid=30482 ...
:30482 30482 [000] 1379727.589462: sched:sched_switch: prev_comm=ls prev_pid=30482 ...
The last line lost the conversion from tid to comm. If you look at the events
(perf script -D) you see why - a SAMPLE event is generated after the EXIT:
0 1379727589449774 0x1540b0 [0x38]: PERF_RECORD_EXIT(30482:30482):(30482:30482)
0 1379727589462497 0x1540e8 [0x80]: PERF_RECORD_SAMPLE(IP, 1): 30482/30482: 0xffffffff816416f1 period: 1 addr: 0
... thread: :30482:30482
When perf processes the EXIT event the thread is moved to the dead_threads
list. When the SAMPLE event is processed no thread exists for the pid so a new
one is created by machine__findnew_thread.
This patch address the problem by delaying the move to the dead_threads list
until the tid is re-used (per Adrian's suggestion).
With this patch we get the previous example shows:
ls 30482 [000] 1379727.583037: sched:sched_switch: prev_comm=ls prev_pid=30482 ...
ls 30482 [000] 1379727.586339: sched:sched_switch: prev_comm=ls prev_pid=30482 ...
ls 30482 [000] 1379727.589462: sched:sched_switch: prev_comm=ls prev_pid=30482 ...
and
0 1379727589449774 0x1540b0 [0x38]: PERF_RECORD_EXIT(30482:30482):(30482:30482)
0 1379727589462497 0x1540e8 [0x80]: PERF_RECORD_SAMPLE(IP, 1): 30482/30482: 0xffffffff816416f1 period: 1 addr: 0
... thread: ls:30482
v4: per Arnaldo's request add dead flag to thread struct and set when task exits
v3: re-do from a time based check to a delayed move to dead_threads list
v2: Rebased to latest perf/core branch. Changed time comparison to use
a macro which explicitly shows the time basis
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1376491767-84171-1-git-send-email-dsahern@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
perf stat -a needs 10 open file descriptors per logical CPU
perf stat -a -dddd needs 20 open fds for each.
This implies that stat -a doesn't work on any system with the default
ulimit -n 1024 which has more than ~100 CPUs and stat -a -dddd doesn't
work on anything with more than 46 CPUs.
Longer term there needs to be probably some way to lower the file
descriptor requirements. This would need some changes in the kernel/user
interface.
But short term this patch just tries to increase the file descriptor
limit in perf itself, when it runs into a EMFILE.
It first sets it to the hard limit, and then tries to increase the hard
limit.
On Fedora systems the default seems to be soft limit 1024 and hard limit
4*1024. So even non root can support 409 or 186 CPUs respectively. root
can go far higher.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1375670486-15480-1-git-send-email-andi@firstfloor.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Freq events may not always be affine to a particular CPU. As such,
account_event_cpu() may crash if we account per cpu a freq event
that has event->cpu == -1.
To solve this, lets account freq events globally. In practice
this doesn't change much the picture because perf tools create
per-task perf events with one event per CPU by default. Profiling a
single CPU is usually a corner case so there is no much point in
optimizing things that way.
Reported-by: Jiri Olsa <jolsa@redhat.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Tested-by: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1375460996-16329-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When we fail to allocate the callchain buffers, we roll back the refcount
we did and return from get_callchain_buffers().
However we take the refcount and allocate under the callchain lock
but the rollback is done outside the lock.
As a result, while we roll back, some concurrent callchain user may
call get_callchain_buffers(), see the non-zero refcount and give up
because the buffers are NULL without itself retrying the allocation.
The consequences aren't that bad but that behaviour looks weird enough and
it's better to give their chances to the following callchain users where
we failed.
Reported-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1375460996-16329-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch adds support for the SNB-EP PCU uncore PMU extra_sel_bit
(bit 21) which is missing from the documentation in Table-2.75 of
Intel Xeon Processor E5-2600 Product Family Uncore Performance
Monitoring Guide. It is referred to later in Table-2.81. Without
this selection bit explicitly enabled by the kernel, some events
such as COREx_TRANSITION_CYCLES do not count correctly.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1376375382-21350-4-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The QPI uncore boxes have two pairs of MATCH/MASK registers that
user to filter packet traffic serviced by QPI link layer. These
registers are in auxiliary PCI devices.
This patch changes the meaning of (struct pci_device_id)->driver_data.
The first 8 bits are device index of the same uncore type, the second
8 bytes are uncore type index. Auxiliary PCI device's type is defined
as UNCORE_EXTRA_PCI_DEV(0xff)
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1375856245-10717-1-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:
* Allow specifying syscalls in 'perf trace', a la strace.
* Simplify symbol filtering by doing it at machine class level,
from Adrian Hunter.
* Add option to 'perf kvm' to print only events that exceed a specified time
duration, from David Ahern.
* 'perf sched' improvements, including removing some tracepoints that provide
the same information as the PERF_RECORD_{FORK,EXIT} events.
* Improve stack trace printing, from David Ahern.
* Update documentation with live command, from David Ahern
* Fix 'perf test' compile failure on do_sort_something, from David Ahern.
* Improve robustness of topology parsing code, from Stephane Eranian.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit b55ae0a9 added code-reading.c which fails to compile on Fedora 16
with compiler version:
$ gcc --version
gcc (GCC) 4.6.3 20120306 (Red Hat 4.6.3-2)
Failure message is:
tests/code-reading.c: In function ‘do_sort_something’:
tests/code-reading.c:305:13: error: stack protector not protecting local variables: variable length buffer [-Werror=stack-protector]
cc1: all warnings being treated as errors
make: *** [/tmp/junk/tests/code-reading.o] Error 1
make: *** Waiting for unfinished jobs....
v2: as Adrian noticed changed sizeof to ARRAY_SIZE
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1376454732-83728-1-git-send-email-dsahern@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
perf_trace_buf_prepare() + perf_trace_buf_submit(task => NULL)
make no sense if hlist_empty(head). Change perf_trace_##call()
to check ->perf_events beforehand and do nothing if it is empty.
This removes the overhead for tasks without events associated
with them. For example, "perf record -e sched:sched_switch -p1"
attaches the counter(s) to the single task, but every task in
system will do perf_trace_buf_prepare/submit() just to realize
that it was not attached to this event.
However, we can only do this if __task == NULL, so we also add
the __builtin_constant_p(__task) check.
With this patch "perf bench sched pipe" shows approximately 4%
improvement when "perf record -p1" runs in parallel, many thanks
to Steven for the testing.
Link: http://lkml.kernel.org/r/20130806160847.GA2746@redhat.com
Tested-by: David Ahern <dsahern@gmail.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The next patch tries to avoid the costly perf_trace_buf_* calls
when possible but there is a problem. We can only do this if
__task == NULL, perf_tp_event(task != NULL) has the additional
code for this case.
Unfortunately, TP_perf_assign/__perf_xxx which changes the default
values of __count/__task variables for perf_trace_buf_submit() is
called "too late", after we already did perf_trace_buf_prepare(),
and the optimization above can't work.
So this patch simply embeds __perf_xxx() into TP_ARGS(), this way
DECLARE_EVENT_CLASS() can use the result of assignments hidden in
"args" right after ftrace_get_offsets_##call() which is mostly
trivial. This allows us to have the fast-path "__task != NULL"
check at the start, see the next patch.
Link: http://lkml.kernel.org/r/20130806160844.GA2739@redhat.com
Tested-by: David Ahern <dsahern@gmail.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
To simplify the review of the next patches:
1. We are going to reimplent __perf_task/counter and embedd them
into TP_ARGS(). expand TRACE_EVENT(sched_stat_runtime) into
DECLARE_EVENT_CLASS() + DEFINE_EVENT(), this way they can use
different TP_ARGS's.
2. Change perf_trace_##call() macro to do perf_fetch_caller_regs()
right before perf_trace_buf_prepare().
This way it evaluates TP_ARGS() asap, the next patch explores
this fact.
Note: after 87f44bbc perf_trace_buf_prepare() doesn't need
"struct pt_regs *regs", perhaps it makes sense to remove this
argument. And perhaps we can teach perf_trace_buf_submit()
to accept regs == NULL and do fetch_caller_regs(CALLER_ADDR1)
in this case.
3. Cosmetic, but the typecast from "void*" buys nothing. It just
adds the noise, remove it.
Link: http://lkml.kernel.org/r/20130806160841.GA2736@redhat.com
Acked-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>