The memory barrier rmb() in generic idle loop do_idle() function is not
needed, it doesn't order any load instruction, just remove it as needless
rmb() can cause performance impact.
The rmb() was introduced by the tglx/history.git commit f2f1b44c75c4
("[PATCH] Remove RCU abuse in cpu_idle()") to order the loads between
cpu_idle_map and pm_idle. It pairs with wmb() in function cpu_idle_wait().
And then with the removal of cpu_idle_state in function cpu_idle() and
wmb() in function cpu_idle_wait() in commit 783e391b7b ("x86: Simplify
cpu_idle_wait"), rmb() no longer has a reason to exist.
After that, commit d166991234 ("idle: Implement generic idle function")
implemented a generic idle function cpu_idle_loop() which resembles the
functionality found in arch/. And it retained the rmb() in generic idle
loop in file kernel/cpu/idle.c.
And at last, commit cf37b6b484 ("sched/idle: Move cpu/idle.c to
sched/idle.c") moved cpu/idle.c to sched/idle.c. And commit c1de45ca83
("sched/idle: Add support for tasks that inject idle") renamed function
cpu_idle_loop() to do_idle().
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Signed-off-by: Zhongqiu Han <quic_zhonhan@quicinc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241009093745.9504-1-quic_zhonhan@quicinc.com
Sean noted that ever since commit 152e11f6df ("sched/fair: Implement
delayed dequeue") KVM's preemption notifiers have started
mis-classifying preemption vs blocking.
Notably p->on_rq is no longer sufficient to determine if a task is
runnable or blocked -- the aforementioned commit introduces tasks that
remain on the runqueue even through they will not run again, and
should be considered blocked for many cases.
Add the task_is_runnable() helper to classify things and audit all
external users of the p->on_rq state. Also add a few comments.
Fixes: 152e11f6df ("sched/fair: Implement delayed dequeue")
Reported-by: Sean Christopherson <seanjc@google.com>
Tested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20241010091843.GK33184@noisy.programming.kicks-ass.net
Since sched_delayed tasks remain queued even after blocking, the load
balancer can migrate them between runqueues while PSI considers them
to be asleep. As a result, it misreads the migration requeue followed
by a wakeup as a double queue:
psi: inconsistent task state! task=... cpu=... psi_flags=4 clear=. set=4
First, call psi_enqueue() after p->sched_class->enqueue_task(). A
wakeup will clear p->se.sched_delayed while a migration will not, so
psi can use that flag to tell them apart.
Then teach psi to migrate any "sleep" state when delayed-dequeue tasks
are being migrated.
Delayed-dequeue tasks can be revived by ttwu_runnable(), which will
call down with a new ENQUEUE_DELAYED. Instead of further complicating
the wakeup conditional in enqueue_task(), identify migration contexts
instead and default to wakeup handling for all other cases.
It's not just the warning in dmesg, the task state corruption causes a
permanent CPU pressure indication, which messes with workload/machine
health monitoring.
Debugged-by-and-original-fix-by: K Prateek Nayak <kprateek.nayak@amd.com>
Fixes: 152e11f6df ("sched/fair: Implement delayed dequeue")
Closes: https://lore.kernel.org/lkml/20240830123458.3557-1-spasswolf@web.de/
Closes: https://lore.kernel.org/all/cd67fbcd-d659-4822-bb90-7e8fbb40a856@molgen.mpg.de/
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20241010193712.GC181795@cmpxchg.org
Pull RCU fix from Neeraj Upadhyay:
"Fix rcuog kthread wakeup invocation from softirq context on a CPU
which has been marked offline.
This can happen when new callbacks are enqueued from a softirq on an
offline CPU before it calls rcutree_report_cpu_dead(). When this
happens on NOCB configuration, the rcuog wake-up is deferred through
an IPI to an online CPU. This is done to avoid call into the scheduler
which can risk arming the RT-bandwidth after hrtimers have been
migrated out and disabled.
However, doing IPI call from softirq is not allowed: Fix this by
forcing deferred rcuog wakeup through the NOCB timer when the CPU is
offline"
* tag 'rcu.fixes.6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux:
rcu/nocb: Fix rcuog wake-up from offline softirq
Change the LSM data stored in the audit transactions from a secid
to an LSM prop. This is done in struct audit_context and struct
audit_aux_data_pids. Several cases of scaffolding can be removed.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: subj line tweak]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Replace the osid field in the audit_names structure with a
lsm_prop structure. This accommodates the use of an lsm_prop in
security_audit_rule_match() and security_inode_getsecid().
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: subj line tweak]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Change the security_inode_getsecid() interface to fill in a
lsm_prop structure instead of a u32 secid. This allows for its
callers to gather data from all registered LSMs. Data is provided
for IMA and audit. Change the name to security_inode_getlsmprop().
Cc: linux-integrity@vger.kernel.org
Cc: selinux@vger.kernel.org
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: subj line tweak]
Signed-off-by: Paul Moore <paul@paul-moore.com>
The audit process LSM information is changed from a secid audit_sig_sid
to an lsm_prop in audit_sig_lsm. Update the users of this data
appropriately. Calls to security_secid_to_secctx() are changed to use
security_lsmprop_to_secctx() instead. security_current_getsecid_subj()
is scaffolded. It will be updated in a subsequent patch.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: subject line tweak]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Replace the secid value stored in struct audit_context with a struct
lsm_prop. Change the code that uses this value to accommodate the
change. security_audit_rule_match() expects a lsm_prop, so existing
scaffolding can be removed. A call to security_secid_to_secctx()
is changed to security_lsmprop_to_secctx(). The call to
security_ipc_getsecid() is scaffolded.
A new function lsmprop_is_set() is introduced to identify whether
an lsm_prop contains a non-zero value.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: subject line tweak, fix lsmprop_is_set() typo]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Change the secid parameter of security_audit_rule_match
to a lsm_prop structure pointer. Pass the entry from the
lsm_prop structure for the approprite slot to the LSM hook.
Change the users of security_audit_rule_match to use the
lsm_prop instead of a u32. The scaffolding function lsmprop_init()
fills the structure with the value of the old secid, ensuring that
it is available to the appropriate module hook. The sources of
the secid, security_task_getsecid() and security_inode_getsecid(),
will be converted to use the lsm_prop structure later in the series.
At that point the use of lsmprop_init() is dropped.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: subject line tweak]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Currently, the CONFIG_CSD_LOCK_WAIT_DEBUG code uses sched_clock() to check
for excessive CSD-lock wait times. This works, but does not guarantee
monotonic timestamps on x86 due to the sched_clock() function's use of
the rdtsc instruction, which does not guarantee ordering. This means
that, given successive calls to sched_clock(), the second might return
an earlier time than the second, that is, time might seem to go backwards.
This can (and does!) result in false-positive CSD-lock wait complaints
claiming almost 2^64 nanoseconds of delay.
Therefore, switch from sched_clock() to ktime_get_mono_fast_ns(), which
does guarantee monotonic timestamps via the rdtsc_ordered() function,
which as the name implies, does guarantee ordered timestamps, at least
in the absence of calls from NMI handlers, which are not involved in
this code path.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Cc: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Cc: Leonardo Bras <leobras@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Commit 2e0199df25 ("sched/fair: Prepare exit/cleanup paths for delayed_dequeue")
and its follow up fixes try to deal with a rather unfortunate
situation where is task is enqueued in a new class, even though it
shouldn't have been. Mostly because the existing ->switched_to/from()
hooks are in the wrong place for this case.
This all led to Paul being able to trigger failures at something like
once per 10k CPU hours of RCU torture.
For now, do the ugly thing and move the code to the right place by
ignoring the switch hooks.
Note: Clean up the whole sched_class::switch*_{to,from}() thing.
Fixes: 2e0199df25 ("sched/fair: Prepare exit/cleanup paths for delayed_dequeue")
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241003185037.GA5594@noisy.programming.kicks-ass.net
With KASAN and PREEMPT_RT enabled, calling task_work_add() in
task_tick_mm_cid() may cause the following splat.
[ 63.696416] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
[ 63.696416] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 610, name: modprobe
[ 63.696416] preempt_count: 10001, expected: 0
[ 63.696416] RCU nest depth: 1, expected: 1
This problem is caused by the following call trace.
sched_tick() [ acquire rq->__lock ]
-> task_tick_mm_cid()
-> task_work_add()
-> __kasan_record_aux_stack()
-> kasan_save_stack()
-> stack_depot_save_flags()
-> alloc_pages_mpol_noprof()
-> __alloc_pages_noprof()
-> get_page_from_freelist()
-> rmqueue()
-> rmqueue_pcplist()
-> __rmqueue_pcplist()
-> rmqueue_bulk()
-> rt_spin_lock()
The rq lock is a raw_spinlock_t. We can't sleep while holding
it. IOW, we can't call alloc_pages() in stack_depot_save_flags().
The task_tick_mm_cid() function with its task_work_add() call was
introduced by commit 223baf9d17 ("sched: Fix performance regression
introduced by mm_cid") in v6.4 kernel.
Fortunately, there is a kasan_record_aux_stack_noalloc() variant that
calls stack_depot_save_flags() while not allowing it to allocate
new pages. To allow task_tick_mm_cid() to use task_work without
page allocation, a new TWAF_NO_ALLOC flag is added to enable calling
kasan_record_aux_stack_noalloc() instead of kasan_record_aux_stack()
if set. The task_tick_mm_cid() function is modified to add this new flag.
The possible downside is the missing stack trace in a KASAN report due
to new page allocation required when task_work_add_noallloc() is called
which should be rare.
Fixes: 223baf9d17 ("sched: Fix performance regression introduced by mm_cid")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241010014432.194742-1-longman@redhat.com
Previously when retrieving `bpf_link_info.uprobe_multi` with `path` and
`path_size` fields unset, the `path_size` field is not populated
(remains 0). This behavior was inconsistent with how other input/output
string buffer fields work, as the field should be populated in cases
when:
- both buffer and length are set (currently works as expected)
- both buffer and length are unset (not working as expected)
This patch now fills the `path_size` field when `path` and `path_size`
are unset.
Fixes: e56fdbfb06 ("bpf: Add link_info support for uprobe multi link")
Signed-off-by: Tyrone Wu <wudevelops@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241011000803.681190-1-wudevelops@gmail.com
Iterating with scx_task_iter involves scx_tasks_lock and optionally the rq
lock of the task being iterated. Both locks can be released during iteration
and the iteration can be continued after re-grabbing scx_tasks_lock.
Currently, all lock handling is pushed to the caller which is a bit
cumbersome and makes it difficult to add lock-aware behaviors. Make the
scx_task_iter helpers handle scx_tasks_lock.
- scx_task_iter_init/scx_taks_iter_exit() now grabs and releases
scx_task_lock, respectively. Renamed to
scx_task_iter_start/scx_task_iter_stop() to more clearly indicate that
there are non-trivial side-effects.
- Add __ prefix to scx_task_iter_rq_unlock() to indicate that the function
is internal.
- Add scx_task_iter_unlock/relock(). The former drops both rq lock (if held)
and scx_tasks_lock and the latter re-locks only scx_tasks_lock.
This doesn't cause behavior changes and will be used to implement stall
avoidance.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Bypass mode was depending on ops.select_cpu() which can't be trusted as with
the rest of the BPF scheduler. Always enable and use scx_select_cpu_dfl() in
bypass mode.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Move the sanity check from the inner function scx_select_cpu_dfl() to the
exported kfunc scx_bpf_select_cpu_dfl(). This doesn't cause behavior
differences and will allow using scx_select_cpu_dfl() in bypass mode
regardless of scx_builtin_idle_enabled.
Signed-off-by: Tejun Heo <tj@kernel.org>
The disable path caps p->scx.slice to SCX_SLICE_DFL. As the field is already
being ignored at this stage during disable, the only effect this has is that
when the next BPF scheduler is loaded, it won't see unreasonable left-over
slices. Ultimately, this shouldn't matter but it's better to start in a
known state. Drop p->scx.slice capping from the disable path and instead
reset it to SCX_SLICE_DFL in the enable path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
This reverts commit 6f34d8d382.
Slice length is ignored while bypassing and tasks are switched on every tick
and thus the patch does not make any difference. The perceived difference
was from test noise.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Pull tracing fix from Steven Rostedt:
"Ring-buffer fix: do not have boot-mapped buffers use CPU hotplug
callbacks
When a ring buffer is mapped to memory assigned at boot, it also
splits it up evenly between the possible CPUs. But the allocation code
still attached a CPU notifier callback to this ring buffer. When a CPU
is added, the callback will happen and another per-cpu buffer is
created for the ring buffer.
But for boot mapped buffers, there is no room to add another one (as
they were all created already). The result of calling the CPU hotplug
notifier on a boot mapped ring buffer is unpredictable and could lead
to a system crash.
If the ring buffer is boot mapped simply do not attach the CPU
notifier to it"
* tag 'trace-ringbuffer-v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ring-buffer: Do not have boot mapped buffers hook to CPU hotplug
The verifier contains a cache for looking up module BTF objects when
calling kfuncs defined in modules. This cache uses a 'struct
bpf_kfunc_btf_tab', which contains a sorted list of BTF objects that
were already seen in the current verifier run, and the BTF objects are
looked up by the offset stored in the relocated call instruction using
bsearch().
The first time a given offset is seen, the module BTF is loaded from the
file descriptor passed in by libbpf, and stored into the cache. However,
there's a bug in the code storing the new entry: it stores a pointer to
the new cache entry, then calls sort() to keep the cache sorted for the
next lookup using bsearch(), and then returns the entry that was just
stored through the stored pointer. However, because sort() modifies the
list of entries in place *by value*, the stored pointer may no longer
point to the right entry, in which case the wrong BTF object will be
returned.
The end result of this is an intermittent bug where, if a BPF program
calls two functions with the same signature in two different modules,
the function from the wrong module may sometimes end up being called.
Whether this happens depends on the order of the calls in the BPF
program (as that affects whether sort() reorders the array of BTF
objects), making it especially hard to track down. Simon, credited as
reporter below, spent significant effort analysing and creating a
reproducer for this issue. The reproducer is added as a selftest in a
subsequent patch.
The fix is straight forward: simply don't use the stored pointer after
calling sort(). Since we already have an on-stack pointer to the BTF
object itself at the point where the function return, just use that, and
populate it from the cache entry in the branch where the lookup
succeeds.
Fixes: 2357672c54 ("bpf: Introduce BPF support for kernel module function calls")
Reported-by: Simon Sundberg <simon.sundberg@kau.se>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20241010-fix-kfunc-btf-caching-for-modules-v2-1-745af6c1af98@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
After a CPU has set itself offline and before it eventually calls
rcutree_report_cpu_dead(), there are still opportunities for callbacks
to be enqueued, for example from a softirq. When that happens on NOCB,
the rcuog wake-up is deferred through an IPI to an online CPU in order
not to call into the scheduler and risk arming the RT-bandwidth after
hrtimers have been migrated out and disabled.
But performing a synchronized IPI from a softirq is buggy as reported in
the following scenario:
WARNING: CPU: 1 PID: 26 at kernel/smp.c:633 smp_call_function_single
Modules linked in: rcutorture torture
CPU: 1 UID: 0 PID: 26 Comm: migration/1 Not tainted 6.11.0-rc1-00012-g9139f93209d1 #1
Stopper: multi_cpu_stop+0x0/0x320 <- __stop_cpus+0xd0/0x120
RIP: 0010:smp_call_function_single
<IRQ>
swake_up_one_online
__call_rcu_nocb_wake
__call_rcu_common
? rcu_torture_one_read
call_timer_fn
__run_timers
run_timer_softirq
handle_softirqs
irq_exit_rcu
? tick_handle_periodic
sysvec_apic_timer_interrupt
</IRQ>
Fix this with forcing deferred rcuog wake up through the NOCB timer when
the CPU is offline. The actual wake up will happen from
rcutree_report_cpu_dead().
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202409231644.4c55582d-lkp@intel.com
Fixes: 9139f93209 ("rcu/nocb: Fix RT throttling hrtimer armed from offline CPU")
Reviewed-by: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
pick_next_task_scx() was turned into pick_task_scx() since
commit 753e2836d1 ("sched_ext: Unify regular and core-sched pick
task paths"). Update the outdated message.
Signed-off-by: Honglei Wang <jameshongleiwang@126.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The mgtime_floor value is a global variable for tracking the latest
fine-grained timestamp handed out. Because it's a global, track the
number of times that a new floor value is assigned.
Add a new percpu counter to the timekeeping code to track the number of
floor swap events that have occurred. A later patch will add a debugfs
file to display this counter alongside other stats involving multigrain
timestamps.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits
Link: https://lore.kernel.org/all/20241002-mgtime-v10-2-d1c4717f5284@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Multigrain timestamps allow the kernel to use fine-grained timestamps when
an inode's attributes is being actively observed via ->getattr(). With
this support, it's possible for a file to get a fine-grained timestamp, and
another modified after it to get a coarse-grained stamp that is earlier
than the fine-grained time. If this happens then the files can appear to
have been modified in reverse order, which breaks VFS ordering guarantees
[1].
To prevent this, maintain a floor value for multigrain timestamps.
Whenever a fine-grained timestamp is handed out, record it, and when later
coarse-grained stamps are handed out, ensure they are not earlier than that
value. If the coarse-grained timestamp is earlier than the fine-grained
floor, return the floor value instead.
Add a static singleton atomic64_t into timekeeper.c that is used to keep
track of the latest fine-grained time ever handed out. This is tracked as a
monotonic ktime_t value to ensure that it isn't affected by clock
jumps. Because it is updated at different times than the rest of the
timekeeper object, the floor value is managed independently of the
timekeeper via a cmpxchg() operation, and sits on its own cacheline.
Add two new public interfaces:
- ktime_get_coarse_real_ts64_mg() fills a timespec64 with the later of the
coarse-grained clock and the floor time
- ktime_get_real_ts64_mg() gets the fine-grained clock value, and tries
to swap it into the floor. A timespec64 is filled with the result.
The floor value is global and updated via a single try_cmpxchg(). If
that fails then the operation raced with a concurrent update. Any
concurrent update must be later than the existing floor value, so any
racing tasks can accept any resulting floor value without retrying.
[1]: POSIX requires that files be stamped with realtime clock values, and
makes no provision for dealing with backward clock jumps. If a backward
realtime clock jump occurs, then files can appear to have been modified
in reverse order.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241002-mgtime-v10-1-d1c4717f5284@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Previously when retrieving `bpf_link_info.perf_event` for
kprobe/uprobe/tracepoint, the `name_len` field was not populated by the
kernel, leaving it to reflect the value initially set by the user. This
behavior was inconsistent with how other input/output string buffer
fields function (e.g. `raw_tracepoint.tp_name_len`).
This patch fills `name_len` with the actual size of the string name.
Fixes: 1b715e1b0e ("bpf: Support ->fill_link_info for perf_event")
Signed-off-by: Tyrone Wu <wudevelops@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20241008164312.46269-1-wudevelops@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The kzmalloc call in bpf_check can fail when memory is very fragmented,
which in turn can lead to an OOM kill.
Use kvzmalloc to fall back to vmalloc when memory is too fragmented to
allocate an order 3 sized bpf verifier environment.
Admittedly this is not a very common case, and only happens on systems
where memory has already been squeezed close to the limit, but this does
not seem like much of a hot path, and it's a simple enough fix.
Signed-off-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://lore.kernel.org/r/20241008170735.16766766@imladris.surriel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When a tracepoint event is created with attr.freq = 1,
'hwc->period_left' is not initialized correctly. As a result,
in the perf_swevent_overflow() function, when the first time the event occurs,
it calculates the event overflow and the perf_swevent_set_period() returns 3,
this leads to the event are recorded for three duplicate times.
Step to reproduce:
1. Enable the tracepoint event & starting tracing
$ echo 1 > /sys/kernel/tracing/events/module/module_free
$ echo 1 > /sys/kernel/tracing/tracing_on
2. Record with perf
$ perf record -a --strict-freq -F 1 -e "module:module_free"
3. Trigger module_free event.
$ modprobe -i sunrpc
$ modprobe -r sunrpc
Result:
- Trace pipe result:
$ cat trace_pipe
modprobe-174509 [003] ..... 6504.868896: module_free: sunrpc
- perf sample:
modprobe 174509 [003] 6504.868980: module:module_free: sunrpc
modprobe 174509 [003] 6504.868980: module:module_free: sunrpc
modprobe 174509 [003] 6504.868980: module:module_free: sunrpc
By setting period_left via perf_swevent_set_period() as other sw_event did,
This problem could be solved.
After patch:
- Trace pipe result:
$ cat trace_pipe
modprobe 1153096 [068] 613468.867774: module:module_free: xfs
- perf sample
modprobe 1153096 [068] 613468.867794: module:module_free: xfs
Link: https://lore.kernel.org/20240913021347.595330-1-yeoreum.yun@arm.com
Fixes: bd2b5b1284 ("perf_counter: More aggressive frequency adjustment")
Signed-off-by: Levi Yun <yeoreum.yun@arm.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When trying to repeat the btf fields for array of nested struct, it
doesn't check the remaining info_cnt. The following splat will be
reported when the value of ret * nelems is greater than BTF_FIELDS_MAX:
------------[ cut here ]------------
UBSAN: array-index-out-of-bounds in ../kernel/bpf/btf.c:3951:49
index 11 is out of range for type 'btf_field_info [11]'
CPU: 6 UID: 0 PID: 411 Comm: test_progs ...... 6.11.0-rc4+ #1
Tainted: [O]=OOT_MODULE
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ...
Call Trace:
<TASK>
dump_stack_lvl+0x57/0x70
dump_stack+0x10/0x20
ubsan_epilogue+0x9/0x40
__ubsan_handle_out_of_bounds+0x6f/0x80
? kallsyms_lookup_name+0x48/0xb0
btf_parse_fields+0x992/0xce0
map_create+0x591/0x770
__sys_bpf+0x229/0x2410
__x64_sys_bpf+0x1f/0x30
x64_sys_call+0x199/0x9f0
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7fea56f2cc5d
......
</TASK>
---[ end trace ]---
Fix it by checking the remaining info_cnt in btf_repeat_fields() before
repeating the btf fields.
Fixes: 64e8ee8148 ("bpf: look into the types of the fields of a struct type recursively.")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20241008071114.3718177-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Pull misc fixes from Andrew Morton:
"12 hotfixes, 5 of which are c:stable. All singletons, about half of
which are MM"
* tag 'mm-hotfixes-stable-2024-10-09-15-46' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: zswap: delete comments for "value" member of 'struct zswap_entry'.
CREDITS: sort alphabetically by name
secretmem: disable memfd_secret() if arch cannot set direct map
.mailmap: update Fangrui's email
mm/huge_memory: check pmd_special() only after pmd_present()
resource, kunit: fix user-after-free in resource_test_region_intersects()
fs/proc/kcore.c: allow translation of physical memory addresses
selftests/mm: fix incorrect buffer->mirror size in hmm2 double_map test
device-dax: correct pgoff align in dax_set_mapping()
kthread: unpark only parked kthread
Revert "mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"
bcachefs: do not use PF_MEMALLOC_NORECLAIM
Calling into kthread unparking unconditionally is mostly harmless when
the kthread is already unparked. The wake up is then simply ignored
because the target is not in TASK_PARKED state.
However if the kthread is per CPU, the wake up is preceded by a call
to kthread_bind() which expects the task to be inactive and in
TASK_PARKED state, which obviously isn't the case if it is unparked.
As a result, calling kthread_stop() on an unparked per-cpu kthread
triggers such a warning:
WARNING: CPU: 0 PID: 11 at kernel/kthread.c:525 __kthread_bind_mask kernel/kthread.c:525
<TASK>
kthread_stop+0x17a/0x630 kernel/kthread.c:707
destroy_workqueue+0x136/0xc40 kernel/workqueue.c:5810
wg_destruct+0x1e2/0x2e0 drivers/net/wireguard/device.c:257
netdev_run_todo+0xe1a/0x1000 net/core/dev.c:10693
default_device_exit_batch+0xa14/0xa90 net/core/dev.c:11769
ops_exit_list net/core/net_namespace.c:178 [inline]
cleanup_net+0x89d/0xcc0 net/core/net_namespace.c:640
process_one_work kernel/workqueue.c:3231 [inline]
process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3312
worker_thread+0x86d/0xd70 kernel/workqueue.c:3393
kthread+0x2f0/0x390 kernel/kthread.c:389
ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
</TASK>
Fix this with skipping unecessary unparking while stopping a kthread.
Link: https://lkml.kernel.org/r/20240913214634.12557-1-frederic@kernel.org
Fixes: 5c25b5ff89 ("workqueue: Tag bound workers with KTHREAD_IS_PER_CPU")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reported-by: syzbot+943d34fa3cf2191e3068@syzkaller.appspotmail.com
Tested-by: syzbot+943d34fa3cf2191e3068@syzkaller.appspotmail.com
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The boot mapped ring buffer has its buffer mapped at a fixed location
found at boot up. It is not dynamic. It cannot grow or be expanded when
new CPUs come online.
Do not hook fixed memory mapped ring buffers to the CPU hotplug callback,
otherwise it can cause a crash when it tries to add the buffer to the
memory that is already fully occupied.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241008143242.25e20801@gandalf.local.home
Fixes: be68d63a13 ("ring-buffer: Add ring_buffer_alloc_range()")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When using mutex_acquire_nest() with a nest_lock, lockdep refcounts the
number of acquired lockdep_maps of mutexes of the same class, and also
keeps a pointer to the first acquired lockdep_map of a class. That pointer
is then used for various comparison-, printing- and checking purposes,
but there is no mechanism to actively ensure that lockdep_map stays in
memory. Instead, a warning is printed if the lockdep_map is freed and
there are still held locks of the same lock class, even if the lockdep_map
itself has been released.
In the context of WW/WD transactions that means that if a user unlocks
and frees a ww_mutex from within an ongoing ww transaction, and that
mutex happens to be the first ww_mutex grabbed in the transaction,
such a warning is printed and there might be a risk of a UAF.
Note that this is only problem when lockdep is enabled and affects only
dereferences of struct lockdep_map.
Adjust to this by adding a fake lockdep_map to the acquired context and
make sure it is the first acquired lockdep map of the associated
ww_mutex class. Then hold it for the duration of the WW/WD transaction.
This has the side effect that trying to lock a ww mutex *without* a
ww_acquire_context but where a such context has been acquire, we'd see
a lockdep splat. The test-ww_mutex.c selftest attempts to do that, so
modify that particular test to not acquire a ww_acquire_context if it
is not going to be used.
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241009092031.6356-1-thomas.hellstrom@linux.intel.com