struct text_poke_array is an equivalent structure to these global variables:
static struct smp_text_poke_loc tp_vec[TP_VEC_MAX];
static int tp_vec_nr;
Note that we intentionally mirror much of the naming of
'struct text_poke_int3_vec', which will further highlight
the unecessary layering going on in this code, and will
ease its removal.
No change in functionality.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250411054105.2341982-28-mingo@kernel.org
There's this weird hack used by smp_text_poke_batch_finish() to indicate
a 'forced flush':
smp_text_poke_batch_flush(NULL);
Just open-code the vector-flush in a straightforward fashion:
smp_text_poke_batch_process(tp_vec, tp_vec_nr);
tp_vec_nr = 0;
And get rid of !addr hack from text_poke_addr_ordered().
Leave a WARN_ON_ONCE(), just in case some external code learned
to rely on this behavior.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250411054105.2341982-24-mingo@kernel.org
It's possible to escape the text_mutex-held assert in
smp_text_poke_batch_process() if the caller uses a properly
batched and sorted series of patch requests, so add
an explicit lockdep_assert_held() to make sure it's
held by all callers.
All text_poke_int3_*() APIs will call either smp_text_poke_batch_process()
or smp_text_poke_batch_flush() internally.
The text_mutex must be held, because tp_vec and tp_vec_nr et al
are all globals, and the INT3 patching machinery itself relies on
external serialization.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250411054105.2341982-22-mingo@kernel.org
This name is actively confusing as well, because the simple text_poke*()
APIs use MM-switching based code patching, while text_poke_loc_init()
is part of the INT3 based text_poke_int3_*() machinery that is an
additional layer of functionality on top of regular text_poke*() functionality.
Rename it to text_poke_int3_loc_init() to make it clear which layer
it belongs to.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250411054105.2341982-18-mingo@kernel.org
This name is actively confusing as well, because the simple text_poke*()
APIs use MM-switching based code patching, while text_poke_queue()
is part of the INT3 based text_poke_int3_*() machinery that is an
additional layer of functionality on top of regular text_poke*() functionality.
Rename it to smp_text_poke_batch_add() to make it clear which layer
it belongs to.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250411054105.2341982-17-mingo@kernel.org
This name is actively confusing as well, because the simple text_poke*()
APIs use MM-switching based code patching, while text_poke_finish()
is part of the INT3 based text_poke_int3_*() machinery that is an
additional layer of functionality on top of regular text_poke*() functionality.
Rename it to smp_text_poke_batch_finish() to make it clear which layer
it belongs to.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250411054105.2341982-16-mingo@kernel.org
This name is actually actively confusing, because the simple text_poke*()
APIs use MM-switching based code patching, while text_poke_flush()
is part of the INT3 based text_poke_int3_*() machinery that is an
additional layer of functionality on top of regular text_poke*() functionality.
Rename it to smp_text_poke_batch_flush() to make it clear which layer
it belongs to.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250411054105.2341982-15-mingo@kernel.org
So the temp_mm_state_t abstraction used by use_temporary_mm() and
unuse_temporary_mm() is super confusing:
- The whole machinery is about temporarily switching to the
text_poke_mm utility MM that got allocated during bootup
for text-patching purposes alone:
temp_mm_state_t prev;
/*
* Loading the temporary mm behaves as a compiler barrier, which
* guarantees that the PTE will be set at the time memcpy() is done.
*/
prev = use_temporary_mm(text_poke_mm);
- Yet the value that gets saved in the temp_mm_state_t variable
is not the temporary MM ... but the previous MM...
- Ie. we temporarily put the non-temporary MM into a variable
that has the temp_mm_state_t type. This makes no sense whatsoever.
- The confusion continues in unuse_temporary_mm():
static inline void unuse_temporary_mm(temp_mm_state_t prev_state)
Here we unuse an MM that is ... not the temporary MM, but the
previous MM. :-/
Fix up all this confusion by removing the unnecessary layer of
abstraction and using a bog-standard 'struct mm_struct *prev_mm'
variable to save the MM to.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250411054105.2341982-14-mingo@kernel.org
eBPF programs can be run 50,000,000 times per second on busy servers.
Whenever /proc/sys/kernel/bpf_stats_enabled is turned off,
hundreds of calls sites are patched from text_poke_bp_batch()
and we see a huge loss of performance due to false sharing
on bp_desc.refs lasting up to three seconds.
51.30% server_bin [kernel.kallsyms] [k] poke_int3_handler
|
|--46.45%--poke_int3_handler
| exc_int3
| asm_exc_int3
| |
| |--24.26%--cls_bpf_classify
| | tcf_classify
| | __dev_queue_xmit
| | ip6_finish_output2
| | ip6_output
| | ip6_xmit
| | inet6_csk_xmit
| | __tcp_transmit_skb
Fix this by replacing bp_desc.refs with a per-cpu bp_refs.
Before the patch, on a host with 240 cores (480 threads):
$ sysctl -wq kernel.bpf_stats_enabled=0
text_poke_bp_batch(nr_entries=164) : Took 2655300 usec
$ bpftool prog | grep run_time_ns
...
105: sched_cls name hn_egress tag 699fc5eea64144e3 gpl run_time_ns
3009063719 run_cnt 82757845 : average cost is 36 nsec per call
After this patch:
$ sysctl -wq kernel.bpf_stats_enabled=0
text_poke_bp_batch(nr_entries=164) : Took 702 usec
$ bpftool prog | grep run_time_ns
...
105: sched_cls name hn_egress tag 699fc5eea64144e3 gpl run_time_ns
1928223019 run_cnt 67682728 : average cost is 28 nsec per call
Ie. text-patching performance improved 3700x: from 2.65 seconds
to 0.0007 seconds.
Since the atomic_cond_read_acquire(refs, !VAL) spin-loop was not triggered
even once in my tests, add an unlikely() annotation, because this appears
to be the common case.
[ mingo: Improved the changelog some more. ]
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20250411054105.2341982-2-mingo@kernel.org
Pull soundwire fix from Vinod Koul:
- add missing config symbol CONFIG_SND_HDA_EXT_CORE required for asoc
driver CONFIG_SND_SOF_SOF_HDA_SDW_BPT
* tag 'soundwire-6.15-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/soundwire:
ASoC: SOF: Intel: Let SND_SOF_SOF_HDA_SDW_BPT select SND_HDA_EXT_CORE
Support up to 8192 processors
Add cpuidle governor debug telemetry, disabled by default
Update default output to exclude cpuidle invocation counts
Bug fixes
Signed-off-by: Len Brown <len.brown@intel.com>
Create "pct_idle" counter group, the sofware notion of residency
so it can now be singled out, independent of other counter groups.
Create "cpuidle" group, the cpuidle invocation counts.
Disable "cpuidle", by default.
Create "swidle" = "cpuidle" + "pct_idle".
Undocument "sysfs", the old name for "swidle", but keep it working
for backwards compatibilty.
Create "hwidle", all the HW idle counters
Modify "idle", enabled by default
"idle" = "hwidle" + "pct_idle" (and now excludes "cpuidle")
Signed-off-by: Len Brown <len.brown@intel.com>
Pull perf event fix from Ingo Molnar:
"Fix a perf events time accounting bug"
* tag 'perf-urgent-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Fix child_total_time_enabled accounting bug at task exit
Pull scheduler fixes from Ingo Molnar:
- Fix a nonsensical Kconfig combination
- Remove an unnecessary rseq-notification
* tag 'sched-urgent-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rseq: Eliminate useless task_work on execve
sched/isolation: Make CONFIG_CPU_ISOLATION depend on CONFIG_SMP
... and don't error out so hard on missing module descriptions.
Before commit 6c6c1fc09d ("modpost: require a MODULE_DESCRIPTION()")
we used to warn about missing module descriptions, but only when
building with extra warnigns (ie 'W=1').
After that commit the warning became an unconditional hard error.
And it turns out not all modules have been converted despite the claims
to the contrary. As reported by Damian Tometzki, the slub KUnit test
didn't have a module description, and apparently nobody ever really
noticed.
The reason nobody noticed seems to be that the slub KUnit tests get
disabled by SLUB_TINY, which also ends up disabling a lot of other code,
both in tests and in slub itself. And so anybody doing full build tests
didn't actually see this failre.
So let's disable SLUB_TINY for build-only tests, since it clearly ends
up limiting build coverage. Also turn the missing module descriptions
error back into a warning, but let's keep it around for non-'W=1'
builds.
Reported-by: Damian Tometzki <damian@riscv-rocks.de>
Link: https://lore.kernel.org/all/01070196099fd059-e8463438-7b1b-4ec8-816d-173874be9966-000000@eu-central-1.amazonses.com/
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Jeff Johnson <jeff.johnson@oss.qualcomm.com>
Fixes: 6c6c1fc09d ("modpost: require a MODULE_DESCRIPTION()")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Probe cpuidle "sysfs" residency and counts separately,
since soon we will make one disabled on, and the
other disabled off.
Clarify that some BIC (build-in-counters) are actually "groups".
since we're about to re-name some of those groups.
no functional change.
Signed-off-by: Len Brown <len.brown@intel.com>
Do fflush() to discard the buffered data, before each read of the
graphics sysfs knobs.
Fixes: ba99a4fc8c ("tools/power turbostat: Remove unnecessary fflush() call")
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Document that on Intel Granite Rapids Systems,
Uncore domains 0-2 are CPU domains, and
uncore domains 3-4 are IO domains.
Signed-off-by: Len Brown <len.brown@intel.com>
The CoreThr column displays total thermal throttling events
since boot time.
Change it to report events during the measurement interval.
This is more useful for showing a user the current conditions.
Total events since boot time are still available to the user via
/sys/devices/system/cpu/cpu*/thermal_throttle/*
Document CoreThr on turbostat.8
Fixes: eae97e053f ("turbostat: Support thermal throttle count print")
Reported-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Cc: Chen Yu <yu.c.chen@intel.com>
On systems with >= 1024 cpus (in my case 1152), turbostat fails with the error output:
"turbostat: /sys/fs/cgroup/cpuset.cpus.effective: cpu str malformat 0-1151"
A similar error appears with the use of turbostat --cpu when the inputted cpu
range contains a cpu number >= 1024:
# turbostat -c 1100-1151
"--cpu 1100-1151" malformed
...
Both errors are caused by parse_cpu_str() reaching its limit of CPU_SUBSET_MAXCPUS.
It's a good idea to limit the maximum cpu number being parsed, but 1024 is too low.
For a small increase in compute and allocated memory, increasing CPU_SUBSET_MAXCPUS
brings support for parsing cpu numbers >= 1024.
Increase CPU_SUBSET_MAXCPUS to 8192, a common setting for CONFIG_NR_CPUS on x86_64.
Signed-off-by: Justin Ernst <justin.ernst@hpe.com>
Signed-off-by: Len Brown <len.brown@intel.com>