Commit Graph

1331538 Commits

Author SHA1 Message Date
Tejun Heo
f2c880fc81 tools/sched_ext: Sync with scx repo
Synchronize with https://github.com/sched-ext/scx at d384453984a0 ("kernel:
Sync at ad3b301aa0 ("sched_ext: Provides a sysfs 'events' to expose core
event counters")").

Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-14 08:46:20 -10:00
Changwoo Min
ad3b301aa0 sched_ext: Provides a sysfs 'events' to expose core event counters
Add a sysfs entry at /sys/kernel/sched_ext/root/events to expose core
event counters through the files system interface. Each line of the file
shows the event name and its counter value.

In addition, the format of scx_dump_event() is adjusted as the event name
gets longer.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-14 07:15:41 -10:00
Tejun Heo
3539c6411a sched_ext: Implement SCX_OPS_ALLOW_QUEUED_WAKEUP
A task wakeup can be either processed on the waker's CPU or bounced to the
wakee's previous CPU using an IPI (ttwu_queue). Bouncing to the wakee's CPU
avoids the waker's CPU locking and accessing the wakee's rq which can be
expensive across cache and node boundaries.

When ttwu_queue path is taken, select_task_rq() and thus ops.select_cpu()
may be skipped in some cases (racing against the wakee switching out). As
this confused some BPF schedulers, there wasn't a good way for a BPF
scheduler to tell whether idle CPU selection has been skipped, ops.enqueue()
couldn't insert tasks into foreign local DSQs, and the performance
difference on machines with simple toplogies were minimal, sched_ext
disabled ttwu_queue.

However, this optimization makes noticeable difference on more complex
topologies and a BPF scheduler now has an easy way tell whether
ops.select_cpu() was skipped since 9b671793c7 ("sched_ext, scx_qmap: Add
and use SCX_ENQ_CPU_SELECTED") and can insert tasks into foreign local DSQs
since 5b26f7b920 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct
dispatches").

Implement SCX_OPS_ALLOW_QUEUED_WAKEUP which allows BPF schedulers to choose
to enable ttwu_queue optimization.

v2: Update the patch description and comment re. ops.select_cpu() being
    skipped in some cases as opposed to always as per Neel.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Neel Natu <neelnatu@google.com>
Reported-by: Barret Rhoden <brho@google.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-02-13 07:30:46 -10:00
Tejun Heo
78e4690de4 Merge branch 'for-6.14-fixes' into for-6.15
Pull to receive f3f08c3acf ("sched_ext: Fix incorrect assumption about
migration disabled tasks in task_can_run_on_remote_rq()") which conflicts
with 26176116d9 ("sched_ext: Count SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE in
the right spot") in for-6.15.
2025-02-10 10:45:43 -10:00
Tejun Heo
f3f08c3acf sched_ext: Fix incorrect assumption about migration disabled tasks in task_can_run_on_remote_rq()
While fixing migration disabled task handling, 3296682157 ("sched_ext: Fix
migration disabled handling in targeted dispatches") assumed that a
migration disabled task's ->cpus_ptr would only have the pinned CPU. While
this is eventually true for migration disabled tasks that are switched out,
->cpus_ptr update is performed by migrate_disable_switch() which is called
right before context_switch() in __scheduler(). However, the task is
enqueued earlier during pick_next_task() via put_prev_task_scx(), so there
is a race window where another CPU can see the task on a DSQ.

If the CPU tries to dispatch the migration disabled task while in that
window, task_allowed_on_cpu() will succeed and task_can_run_on_remote_rq()
will subsequently trigger SCHED_WARN(is_migration_disabled()).

  WARNING: CPU: 8 PID: 1837 at kernel/sched/ext.c:2466 task_can_run_on_remote_rq+0x12e/0x140
  Sched_ext: layered (enabled+all), task: runnable_at=-10ms
  RIP: 0010:task_can_run_on_remote_rq+0x12e/0x140
  ...
   <TASK>
   consume_dispatch_q+0xab/0x220
   scx_bpf_dsq_move_to_local+0x58/0xd0
   bpf_prog_84dd17b0654b6cf0_layered_dispatch+0x290/0x1cfa
   bpf__sched_ext_ops_dispatch+0x4b/0xab
   balance_one+0x1fe/0x3b0
   balance_scx+0x61/0x1d0
   prev_balance+0x46/0xc0
   __pick_next_task+0x73/0x1c0
   __schedule+0x206/0x1730
   schedule+0x3a/0x160
   __do_sys_sched_yield+0xe/0x20
   do_syscall_64+0xbb/0x1e0
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

Fix it by converting the SCHED_WARN() back to a regular failure path. Also,
perform the migration disabled test before task_allowed_on_cpu() test so
that BPF schedulers which fail to handle migration disabled tasks can be
noticed easily.

While at it, adjust scx_ops_error() message for !task_allowed_on_cpu() case
for brevity and consistency.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 3296682157 ("sched_ext: Fix migration disabled handling in targeted dispatches")
Acked-by: Andrea Righi <arighi@nvidia.com>
Reported-by: Jake Hillion <jakehillion@meta.com>
2025-02-10 10:40:47 -10:00
Changwoo Min
2e7df12bdd tools/sched_ext: Update enum_defs.autogen.h
Add where the script is located to the comment lines of the header file.
This helps anyone re-generate the header file if required.

Note that this is a sync from the PR [1] in the scx repo.

  [1] https://github.com/sched-ext/scx/pull/1322

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-10 07:22:42 -10:00
Li RongQing
1a4e0d8682 sched_ext: Take NUMA node into account when allocating per-CPU cpumasks
per-CPU cpumasks are dominantly accessed from their own local CPUs,
so allocate them node-local to improve performance.

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-10 07:20:13 -10:00
Changwoo Min
372033ad9e tools/sched_ext: Compatible testing of SCX_ENQ_CPU_SELECTED
This provides compatible testing of SCX_ENQ_CPU_SELECTED.
More specifically, it handles two cases:

  1. a BPF scheduler is compiled against vmlinux.h where
  SCX_ENQ_CPU_SELECTED is defined, but it runs on a kernel that does not
  have SCX_ENQ_CPU_SELECTED. In this case, the test result of
  'enq_flags & SCX_ENQ_CPU_SELECTED' will always be false. That test result
  is semantically incorrect because the kernel before SCX_ENQ_CPU_SELECTED
  has never skipped select_task_rq_scx(), so the result should be true.

  2. a BPF scheduler is compiling against vmlinux.h where
  SCX_ENQ_CPU_SELECTED is not defined. In this case, directly using
  SCX_ENQ_CPU_SELECTED causes compilation errors.

To hide such complexity, introduce __COMPAT_is_enq_cpu_selected(),
which checks if SCX_ENQ_CPU_SELECTED exists in runtime using BPF CO-RE.
This consists of three parts:

  1. Add enum_defs.autogen.h, which has macros (HAVE_{enum name}) denoting
  whether SCX enums are defined in the vmlinux.h or not.

  2. Implement __COMPAT_is_enq_cpu_selected(), which provide the test of
  SCX_ENQ_CPU_SELECTED in a compatible way.

  3. Use  __COMPAT_is_enq_cpu_selected() in scx_qmap.

Note that this is a sync of the relevant PR [1] in the scx repo.

  [1] https://github.com/sched-ext/scx/pull/1314

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-08 20:42:50 -10:00
Tejun Heo
eace54dff0 sched_ext: Add SCX_EV_ENQ_SKIP_MIGRATION_DISABLED
Count the number of times a migration disabled task is automatically
dispatched to its local DSQ.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-02-08 20:39:11 -10:00
Tejun Heo
26176116d9 sched_ext: Count SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE in the right spot
SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE wasn't quite right in two aspects:

- It counted both migration disabled and offline events.

- It didn't count events from scx_bpf_dsq_move() path.

Fix it by moving the counting into task_can_run_on_remote_rq() which is
shared by both paths and can distinguish the different rejection conditions.
The argument @trigger_error is renamed to @enforce as it now does more than
just triggering error.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-02-08 20:39:11 -10:00
Tejun Heo
46a0e16158 tool/sched_ext: Event counter dumping updates
- There's no need to dump event counters from both scx_qmap and scx_central.
  Drop counter dumping from scx_central.

- bpf_printk() implies a trailing new line and the explicit new line leads
  to double new lines. Drop the explicit new lines.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-02-08 20:39:11 -10:00
Tejun Heo
29ef4a2fcf Merge branch 'for-6.14-fixes' into for-6.15
Pull to receive:

- 2fa0fbeb69 ("sched_ext: Implement auto local dispatching of migration disabled tasks")
- 3296682157 ("sched_ext: Fix migration disabled handling in targeted dispatches")

as planned for-6.15 changes depend on them (e.g. adding event counter for
implicit migration disabled task handling).
2025-02-08 20:34:43 -10:00
Tejun Heo
3296682157 sched_ext: Fix migration disabled handling in targeted dispatches
A dispatch operation that can target a specific local DSQ -
scx_bpf_dsq_move_to_local() or scx_bpf_dsq_move() - checks whether the task
can be migrated to the target CPU using task_can_run_on_remote_rq(). If the
task can't be migrated to the targeted CPU, it is bounced through a global
DSQ.

task_can_run_on_remote_rq() assumes that the task is on a CPU that's
different from the targeted CPU but the callers doesn't uphold the
assumption and may call the function when the task is already on the target
CPU. When such task has migration disabled, task_can_run_on_remote_rq() ends
up returning %false incorrectly unnecessarily bouncing the task to a global
DSQ.

Fix it by updating the callers to only call task_can_run_on_remote_rq() when
the task is on a different CPU than the target CPU. As this is a bit subtle,
for clarity and documentation:

- Make task_can_run_on_remote_rq() trigger SCHED_WARN_ON() if the task is on
  the same CPU as the target CPU.

- is_migration_disabled() test in task_can_run_on_remote_rq() cannot trigger
  if the task is on a different CPU than the target CPU as the preceding
  task_allowed_on_cpu() test should fail beforehand. Convert the test into
  SCHED_WARN_ON().

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 4c30f5ce4f ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()")
Fixes: 0366017e09 ("sched_ext: Use task_can_run_on_remote_rq() test in dispatch_to_local_dsq()")
Cc: stable@vger.kernel.org # v6.12+
2025-02-08 20:33:25 -10:00
Tejun Heo
2fa0fbeb69 sched_ext: Implement auto local dispatching of migration disabled tasks
Migration disabled tasks are special and pinned to their previous CPUs. They
tripped up some unsuspecting BPF schedulers as their ->nr_cpus_allowed may
not agree with the bits set in ->cpus_ptr. Make it easier for BPF schedulers
by automatically dispatching them to the pinned local DSQs by default. If a
BPF scheduler wants to handle migration disabled tasks explicitly, it can
set SCX_OPS_ENQ_MIGRATION_DISABLED.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-02-08 20:32:54 -10:00
Changwoo Min
38d65cd692 sched_ext: Print an event, SCX_EV_ENQ_SLICE_DFL, in scx_qmap/central
Modify the scx_qmap and scx_celtral schedulers
to print the SCX_EV_ENQ_SLICE_DFL event every second.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-07 11:24:59 -10:00
Changwoo Min
6d3f8fb4b2 sched_ext: Add an event, SCX_EV_ENQ_SLICE_DFL
Add a core event, SCX_EV_ENQ_SLICE_DFL, which represents how many
tasks have been enqueued (or pick_task-ed or select_cpu-ed) with
a default time slice (SCX_SLICE_DFL).

Scheduling a task with SCX_SLICE_DFL unintentionally would be a source
of latency spikes because SCX_SLICE_DFL is relatively long (20 msec).
Thus, soaring the SCX_EV_ENQ_SLICE_DFL value would be a sign of BPF
scheduler bugs, causing latency spikes, especially when ops.select_cpu()
is provided.

__scx_add_event() is used since the caller holds an rq lock or p->pi_lock,
so the preemption has already been disabled.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-07 11:24:59 -10:00
Changwoo Min
2494e555fb sched_ext: Print core event count in scx_qmap scheduler
Modify the scx_qmap scheduler to print the core event counter
every second.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04 10:36:47 -10:00
Changwoo Min
6df93804b7 sched_ext: Print core event count in scx_central scheduler
Modify the scx_central scheduler to print the core event counter
every second.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04 10:36:47 -10:00
Changwoo Min
9865f31d85 sched_ext: Add scx_bpf_events() and scx_read_event() for BPF schedulers
scx_bpf_events() is added to the header files so the BPF scheduler
can use it. Also, scx_read_event() is added to read an event type in a
compatible way.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04 10:36:47 -10:00
Changwoo Min
d46457c31c sched_ext: Add an event, SCX_EV_BYPASS_DURATION
Add a core event, SCX_EV_BYPASS_DURATION, which represents the
total duration of bypass modes in nanoseconds.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04 10:36:47 -10:00
Changwoo Min
5c605cd33c sched_ext: Add an event, SCX_EV_BYPASS_DISPATCH
Add a core event, SCX_EV_BYPASS_DISPATCH, which represents how many
tasks have been dispatched in the bypass mode.

__scx_add_event() is used since the caller holds an rq lock or
p->pi_lock, so the preemption has already been disabled.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04 10:36:47 -10:00
Changwoo Min
4f7a38c7c9 sched_ext: Add an event, SCX_EV_BYPASS_ACTIVATE
Add a core event, SCX_EV_BYPASS_ACTIVATE, which represents how many
times the bypass mode has been triggered.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04 10:36:47 -10:00
Changwoo Min
824d4f2dce sched_ext: Add an event, SCX_EV_ENQ_SKIP_EXITING
Add a core event, SCX_EV_ENQ_SKIP_EXITING, which represents how many
times a task is enqueued to a local DSQ when exiting if
SCX_OPS_ENQ_EXITING is not set.

__scx_add_event() is used since the caller holds an rq lock,
so the preemption has already been disabled.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04 10:36:47 -10:00
Changwoo Min
029b6ce733 sched_ext: Fix incorrect time delta calculation in time_delta()
When (s64)(after - before) > 0, the code returns the result of
(s64)(after - before) > 0 while the intended result should be
(s64)(after - before). That happens because the middle operand of
the ternary operator was omitted incorrectly, returning the result of
(s64)(after - before) > 0. Thus, add the middle operand
-- (s64)(after - before) -- to return the correct time calculation.

Fixes: d07be814fc ("sched_ext: Add time helpers for BPF schedulers")
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02 07:40:35 -10:00
Changwoo Min
7125660bc1 sched_ext: Add an event, SCX_EV_DISPATCH_KEEP_LAST
Add a core event, SCX_EV_DISPATCH_KEEP_LAST, which represents how many
times a task is continued to run without ops.enqueue() when
SCX_OPS_ENQ_LAST is not set.

__scx_add_event() is used since the caller holds an rq lock,
so the preemption has already been disabled.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02 07:23:18 -10:00
Changwoo Min
9be0a1b0c8 sched_ext: Add an event, SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE
Add a core event, SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE, which represents how
many times a BPF scheduler tries to dispatch to an offlined local DSQ.

__scx_add_event() is used since the caller holds an rq lock,
so the preemption has already been disabled.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02 07:23:18 -10:00
Changwoo Min
f7f6142107 sched_ext: Add an event, SCX_EV_SELECT_CPU_FALLBACK
Add a core event, SCX_EV_SELECT_CPU_FALLBACK, which represents how many times
ops.select_cpu() returns a CPU that the task can't use.

__scx_add_event() is used since the caller holds an rq lock,
so the preemption has already been disabled.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02 07:23:18 -10:00
Changwoo Min
17103b8504 sched_ext: Implement event counter infrastructure
Collect the statistics of specific types of behavior in the sched_ext core,
which are not easily visible but still interesting to an scx scheduler.

An event type is defined in 'struct scx_event_stats.' When an event occurs,
its counter is accumulated using 'scx_add_event()' and '__scx_add_event()'
to per-CPU 'struct scx_event_stats' for efficiency. 'scx_bpf_events()'
aggregates all the per-CPU counters and exposes a system-wide counters.

For convenience and readability of the code, 'scx_agg_event()' and
'scx_dump_event()' are provided.

The collected events can be observed after a BPF scheduler is unloaded
beforea new BPF scheduler is loaded so the per-CPU 'struct scx_event_stats'
are reset.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02 07:23:18 -10:00
Andrea Righi
337d1b354a sched_ext: Move built-in idle CPU selection policy to a separate file
As ext.c is becoming quite large, move the idle CPU selection policy to
separate files (ext_idle.c / ext_idle.h) for better code readability.

Moreover, group together all the idle CPU selection kfunc's to the same
btf_kfunc_id_set block.

No functional changes, this is purely code reorganization.

Suggested-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-27 12:43:43 -10:00
Andrea Righi
1626e5ef0b sched_ext: Fix lock imbalance in dispatch_to_local_dsq()
While performing the rq locking dance in dispatch_to_local_dsq(), we may
trigger the following lock imbalance condition, in particular when
multiple tasks are rapidly changing CPU affinity (i.e., running a
`stress-ng --race-sched 0`):

[   13.413579] =====================================
[   13.413660] WARNING: bad unlock balance detected!
[   13.413729] 6.13.0-virtme #15 Not tainted
[   13.413792] -------------------------------------
[   13.413859] kworker/1:1/80 is trying to release lock (&rq->__lock) at:
[   13.413954] [<ffffffff873c6c48>] dispatch_to_local_dsq+0x108/0x1a0
[   13.414111] but there are no more locks to release!
[   13.414176]
[   13.414176] other info that might help us debug this:
[   13.414258] 1 lock held by kworker/1:1/80:
[   13.414318]  #0: ffff8b66feb41698 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x20/0x90
[   13.414612]
[   13.414612] stack backtrace:
[   13.415255] CPU: 1 UID: 0 PID: 80 Comm: kworker/1:1 Not tainted 6.13.0-virtme #15
[   13.415505] Workqueue:  0x0 (events)
[   13.415567] Sched_ext: dsp_local_on (enabled+all), task: runnable_at=-2ms
[   13.415570] Call Trace:
[   13.415700]  <TASK>
[   13.415744]  dump_stack_lvl+0x78/0xe0
[   13.415806]  ? dispatch_to_local_dsq+0x108/0x1a0
[   13.415884]  print_unlock_imbalance_bug+0x11b/0x130
[   13.415965]  ? dispatch_to_local_dsq+0x108/0x1a0
[   13.416226]  lock_release+0x231/0x2c0
[   13.416326]  _raw_spin_unlock+0x1b/0x40
[   13.416422]  dispatch_to_local_dsq+0x108/0x1a0
[   13.416554]  flush_dispatch_buf+0x199/0x1d0
[   13.416652]  balance_one+0x194/0x370
[   13.416751]  balance_scx+0x61/0x1e0
[   13.416848]  prev_balance+0x43/0xb0
[   13.416947]  __pick_next_task+0x6b/0x1b0
[   13.417052]  __schedule+0x20d/0x1740

This happens because dispatch_to_local_dsq() is racing with
dispatch_dequeue() and, when the latter wins, we incorrectly assume that
the task has been moved to dst_rq.

Fix by properly tracking the currently locked rq.

Fixes: 4d3ca89bdd ("sched_ext: Refactor consume_remote_task()")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-27 12:41:12 -10:00
Andrea Righi
3c7d51b0d2 sched_ext: selftests/dsp_local_on: Fix selftest on UP systems
In UP systems p->migration_disabled is not available. Fix this by using
the portable helper is_migration_disabled(p).

Fixes: e9fe182772 ("sched_ext: selftests/dsp_local_on: Fix sporadic failures")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-27 09:00:09 -10:00
Andrea Righi
5f52bbf2f6 tools/sched_ext: Add helper to check task migration state
Introduce a new helper for BPF schedulers to determine whether a task
can migrate or not (supporting both SMP and UP systems).

Fixes: e9fe182772 ("sched_ext: selftests/dsp_local_on: Fix sporadic failures")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-27 08:59:55 -10:00
Tejun Heo
d6f3e7d564 sched_ext: Fix incorrect autogroup migration detection
scx_move_task() is called from sched_move_task() and tells the BPF scheduler
that cgroup migration is being committed. sched_move_task() is used by both
cgroup and autogroup migrations and scx_move_task() tried to filter out
autogroup migrations by testing the destination cgroup and PF_EXITING but
this is not enough. In fact, without explicitly tagging the thread which is
doing the cgroup migration, there is no good way to tell apart
scx_move_task() invocations for racing migration to the root cgroup and an
autogroup migration.

This led to scx_move_task() incorrectly ignoring a migration from non-root
cgroup to an autogroup of the root cgroup triggering the following warning:

  WARNING: CPU: 7 PID: 1 at kernel/sched/ext.c:3725 scx_cgroup_can_attach+0x196/0x340
  ...
  Call Trace:
  <TASK>
    cgroup_migrate_execute+0x5b1/0x700
    cgroup_attach_task+0x296/0x400
    __cgroup_procs_write+0x128/0x140
    cgroup_procs_write+0x17/0x30
    kernfs_fop_write_iter+0x141/0x1f0
    vfs_write+0x31d/0x4a0
    __x64_sys_write+0x72/0xf0
    do_syscall_64+0x82/0x160
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fix it by adding an argument to sched_move_task() that indicates whether the
moving is for a cgroup or autogroup migration. After the change,
scx_move_task() is called only for cgroup migrations and renamed to
scx_cgroup_move_task().

Link: https://github.com/sched-ext/scx/issues/370
Fixes: 8195136669 ("sched_ext: Add cgroup support")
Cc: stable@vger.kernel.org # v6.12+
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-27 08:31:50 -10:00
Tejun Heo
e9fe182772 sched_ext: selftests/dsp_local_on: Fix sporadic failures
dsp_local_on has several incorrect assumptions, one of which is that
p->nr_cpus_allowed always tracks p->cpus_ptr. This is not true when a task
is scheduled out while migration is disabled - p->cpus_ptr is temporarily
overridden to the previous CPU while p->nr_cpus_allowed remains unchanged.

This led to sporadic test faliures when dsp_local_on_dispatch() tries to put
a migration disabled task to a different CPU. Fix it by keeping the previous
CPU when migration is disabled.

There are SCX schedulers that make use of p->nr_cpus_allowed. They should
also implement explicit handling for p->migration_disabled.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Ihor Solodrai <ihor.solodrai@pm.me>
Cc: Andrea Righi <arighi@nvidia.com>
Cc: Changwoo Min <changwoo@igalia.com>
2025-01-24 10:48:25 -10:00
Andrea Righi
74ca334338 selftests/sched_ext: Fix enum resolution
All scx enums are now automatically generated from vmlinux.h and they
must be initialized using the SCX_ENUM_INIT() macro.

Fix the scx selftests to use this macro to properly initialize these
values.

Fixes: 8da7bf2cee ("tools/sched_ext: Receive updates from SCX repo")
Reported-by: Ihor Solodrai <ihor.solodrai@pm.me>
Closes: https://lore.kernel.org/all/Z2tNK2oFDX1OPp8C@slm.duckdns.org/
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-24 08:09:52 -10:00
Andrea Righi
2279563e3a sched_ext: Include task weight in the error state dump
Report the task weight when dumping the task state during an error exit.
Moreover, adjust the output format to display dsq_vtime, slice, and
weight on the same line.

This can help identify whether certain tasks were excessively
prioritized or de-prioritized due to large niceness gaps.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-24 08:07:42 -10:00
Atul Kumar Pant
be8ee18152 sched_ext: Fixes typos in comments
Fixes some spelling errors in the comments.

Signed-off-by: Atul Kumar Pant <atulpant.linux@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-24 08:06:05 -10:00
Linus Torvalds
ab18b8fff1 Merge tag 'auxdisplay-v6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/andy/linux-auxdisplay
Pull auxdisplay updates from Andy Shevchenko:

 - A couple of cleanups to img-ascii-lcd driver

* tag 'auxdisplay-v6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/andy/linux-auxdisplay:
  auxdisplay: img-ascii-lcd: Constify struct img_ascii_lcd_config
  auxdisplay: img-ascii-lcd: Remove an unused field in struct img_ascii_lcd_ctx
2025-01-24 08:03:52 -08:00
Linus Torvalds
2c8d2a510c Merge tag 'sound-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound updates from Takashi Iwai:
 "This was a relatively calm cycle, and most of changes are rather small
  device-specific fixes. Here are highlights:

  Core:
   - Further enhancements of ALSA rawmidi and sequencer APIs for MIDI
     2.0
   - compress-offload API extensions for ASRC support

  ASoC:
   - Allow clocking on each DAI in an audio graph card to be configured
     separately
   - Improved power management for Renesas RZ-SSI
   - KUnit testing for the Cirrus DSP framework
   - Memory to meory operation support for Freescale/NXP platforms
   - Support for pause operations in SOF
   - Support for Allwinner suinv F1C100s, Awinc AW88083, Realtek
     ALC5682I-VE

  HD- and USB-audio:
   - Add support for Focusrite Scarlett 4th Gen 16i16, 18i16, and 18i20
     interfaces via new FCP driver
   - TAS2781 SPI HD-audio sub-codec support
   - Various device-specific quirks as usual"

* tag 'sound-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (235 commits)
  ALSA: hda: tas2781-spi: Fix bogus error handling in tas2781_hda_spi_probe()
  ALSA: hda: tas2781-spi: Fix error code in tas2781_read_acpi()
  ALSA: hda: tas2781-spi: Delete some dead code
  ALSA: usb: fcp: Fix return code from poll ops
  ALSA: usb: fcp: Fix incorrect resp->opcode retrieval
  ALSA: usb: fcp: Fix meter_levels type to __le32
  ALSA: hda/realtek: Enable Mute LED on HP Laptop 14s-fq1xxx
  ALSA: hda: tas2781-spi: Fix -Wsometimes-uninitialized in tasdevice_spi_switch_book()
  ALSA: ctxfi: Simplify dao_clear_{left,right}_input() functions
  ALSA: hda: tas2781-spi: select CRC32 instead of CRC32_SARWATE
  ALSA: usb: fcp: Fix hwdep read ops types
  ALSA: scarlett2: Add device_setup option to use FCP driver
  ALSA: FCP: Add Focusrite Control Protocol driver
  ALSA: hda/tas2781: Add tas2781 hda SPI driver
  ALSA: hda/realtek - Fixed headphone distorted sound on Acer Aspire A115-31 laptop
  ASoC: xilinx: xlnx_spdif: Simpify using devm_clk_get_enabled()
  ALSA: hda: Support for Ideapad hotkey mute LEDs
  ASoC: Intel: sof_sdw: Fix DMI match for Lenovo 83JX, 83MC and 83NM
  ASoC: Intel: sof_sdw: Fix DMI match for Lenovo 83LC
  ASoC: dapm: add support for preparing streams
  ...
2025-01-24 07:54:34 -08:00
Linus Torvalds
454cb97726 Merge tag 'v6.14-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
 "API:
   - Remove physical address skcipher walking
   - Fix boot-up self-test race

  Algorithms:
   - Optimisations for x86/aes-gcm
   - Optimisations for x86/aes-xts
   - Remove VMAC
   - Remove keywrap

  Drivers:
   - Remove n2

  Others:
   - Fixes for padata UAF
   - Fix potential rhashtable deadlock by moving schedule_work outside
     lock"

* tag 'v6.14-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (75 commits)
  rhashtable: Fix rhashtable_try_insert test
  dt-bindings: crypto: qcom,inline-crypto-engine: Document the SM8750 ICE
  dt-bindings: crypto: qcom,prng: Document SM8750 RNG
  dt-bindings: crypto: qcom-qce: Document the SM8750 crypto engine
  crypto: asymmetric_keys - Remove unused key_being_used_for[]
  padata: avoid UAF for reorder_work
  padata: fix UAF in padata_reorder
  padata: add pd get/put refcnt helper
  crypto: skcipher - call cond_resched() directly
  crypto: skcipher - optimize initializing skcipher_walk fields
  crypto: skcipher - clean up initialization of skcipher_walk::flags
  crypto: skcipher - fold skcipher_walk_skcipher() into skcipher_walk_virt()
  crypto: skcipher - remove redundant check for SKCIPHER_WALK_SLOW
  crypto: skcipher - remove redundant clamping to page size
  crypto: skcipher - remove unnecessary page alignment of bounce buffer
  crypto: skcipher - document skcipher_walk_done() and rename some vars
  crypto: omap - switch from scatter_walk to plain offset
  crypto: powerpc/p10-aes-gcm - simplify handling of linear associated data
  crypto: bcm - Drop unused setting of local 'ptr' variable
  crypto: hisilicon/qm - support new function communication
  ...
2025-01-24 07:48:10 -08:00
Linus Torvalds
ae2d4fc540 Merge tag 'tpmdd-next-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd
Pull TPM update from Jarkko Sakkinen.

* tag 'tpmdd-next-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd:
  tpm: Change to kvalloc() in eventlog/acpi.c
2025-01-24 07:46:33 -08:00
Linus Torvalds
68732c0bf9 Merge tag 'pmdomain-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm
Pull pmdomain updates from Ulf Hansson:
 "pmdomain core:
   - Add support for naming idlestates through DT

  pmdomain providers:
   - arm: Explicitly request the current state at init for the SCMI PM
     domain
   - mediatek: Add Airoha CPU PM Domain support for CPU frequency
     scaling
   - ti: Add per-device latency constraint management to the ti_sci PM
     domain

  cpuidle-psci:
   - Enable system-wakeup through GENPD_FLAG_ACTIVE_WAKEUP"

* tag 'pmdomain-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm:
  pmdomain: airoha: Fix compilation error with Clang-20 and Thumb2 mode
  pmdomain: arm: scmi_pm_domain: Send an explicit request to set the current state
  pmdomain: airoha: Add Airoha CPU PM Domain support
  pmdomain: ti_sci: handle wake IRQs for IO daisy chain wakeups
  pmdomain: ti_sci: add wakeup constraint management
  pmdomain: ti_sci: add per-device latency constraint management
  pmdomain: imx-gpcv2: Suppress bind attrs
  pmdomain: imx8m[p]-blk-ctrl: Suppress bind attrs
  pmdomain: core: Support naming idle states
  dt-bindings: power: domain-idle-state: Allow idle-state-name
  cpuidle: psci: Activate GENPD_FLAG_ACTIVE_WAKEUP with OSI
2025-01-24 07:43:35 -08:00
Linus Torvalds
b746043cb3 Merge tag 'pinctrl-v6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control updates from Linus Walleij:
 "No core changes this time

  New drivers:

   - New subdriver for the Qualcomm MSM8917 SoC TLMM

   - New subdriver for the Mediatek MT7988 SoC

   - New subdriver for the Rockchip RK3562 SoC

   - New subdriver for the Renesas RZ/G3E SoC

  Improvements:

   - Fix some missing pins in the Qualcomm IPQ5424 TLMM

   - Fix some missing LVDS pins in the Sunxi A100/A133

   - Support Sunxi V853 (simple compatible string)

   - Cleanups in the Samsung driver

   - Fix some AMD suspend behaviour

   - Cleanups"

* tag 'pinctrl-v6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (29 commits)
  dt-bindings: pinctrl: sunxi: add compatible for V853
  pinctrl: Use str_enable_disable-like helpers
  dt-bindings: pinctrl: Correct indentation and style in DTS example
  pinctrl: amd: Take suspend type into consideration which pins are non-wake
  pinctrl: stm32: Add check for clk_enable()
  pinctrl: renesas: rzg2l: Fix PFC_MASK for RZ/V2H and RZ/G3E
  pinctrl: sunxi: add missed lvds pins for a100/a133
  pinctrl: mediatek: Drop mtk_pinconf_bias_set_pd()
  pinctrl: renesas: rzg2l: Add support for RZ/G3E SoC
  pinctrl: renesas: rzg2l: Update r9a09g057_variable_pin_cfg table
  dt-bindings: pinctrl: renesas: Document RZ/G3E SoC
  dt-bindings: pinctrl: renesas: Add alpha-numerical port support for RZ/V2H
  pinctrl: rockchip: add rk3562 support
  dt-bindings: pinctrl: Add rk3562 pinctrl support
  pinctrl: Fix the clean up on pinconf_apply_setting failure
  dt-bindings: pinctrl: add binding for MT7988 SoC
  pinctrl: mediatek: add MT7988 pinctrl driver
  pinctrl: mediatek: add support for MTK_PULL_PD_TYPE
  pinctrl: ocelot: Constify some structures
  pinctrl: renesas: rzg2l: Add audio clock pins on RZ/G3S
  ...
2025-01-24 07:38:50 -08:00
Linus Torvalds
f1c243fc78 Merge tag 'iommu-updates-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu updates from Joerg Roedel:
 "Core changes:
   - PASID support for the blocked_domain

  ARM-SMMU Updates:
   - SMMUv2:
      - Implement per-client prefetcher configuration on Qualcomm SoCs
      - Support for the Adreno SMMU on Qualcomm's SDM670 SOC
   - SMMUv3:
      - Pretty-printing of event records
      - Drop the ->domain_alloc_paging implementation in favour of
        domain_alloc_paging_flags(flags==0)
   - IO-PGTable:
      - Generalisation of the page-table walker to enable external
        walkers (e.g. for debugging unexpected page-faults from the GPU)
      - Minor fix for handling concatenated PGDs at stage-2 with 16KiB
        pages
   - Misc:
      - Clean-up device probing and replace the crufty probe-deferral
        hack with a more robust implementation of
        arm_smmu_get_by_fwnode()
      - Device-tree binding updates for a bunch of Qualcomm platforms

  Intel VT-d Updates:
   - Remove domain_alloc_paging()
   - Remove capability audit code
   - Draining PRQ in sva unbind path when FPD bit set
   - Link cache tags of same iommu unit together

  AMD-Vi Updates:
   - Use CMPXCHG128 to update DTE
   - Cleanups of the domain_alloc_paging() path

  RiscV IOMMU:
   - Platform MSI support
   - Shutdown support

  Rockchip IOMMU:
   - Add DT bindings for Rockchip RK3576

  More smaller fixes and cleanups"

* tag 'iommu-updates-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (66 commits)
  iommu: Use str_enable_disable-like helpers
  iommu/amd: Fully decode all combinations of alloc_paging_flags
  iommu/amd: Move the nid to pdom_setup_pgtable()
  iommu/amd: Change amd_iommu_pgtable to use enum protection_domain_mode
  iommu/amd: Remove type argument from do_iommu_domain_alloc() and related
  iommu/amd: Remove dev == NULL checks
  iommu/amd: Remove domain_alloc()
  iommu/amd: Remove unused amd_iommu_domain_update()
  iommu/riscv: Fixup compile warning
  iommu/arm-smmu-v3: Add missing #include of linux/string_choices.h
  iommu/arm-smmu-v3: Use str_read_write helper w/ logs
  iommu/io-pgtable-arm: Add way to debug pgtable walk
  iommu/io-pgtable-arm: Re-use the pgtable walk for iova_to_phys
  iommu/io-pgtable-arm: Make pgtable walker more generic
  iommu/arm-smmu: Add ACTLR data and support for qcom_smmu_500
  iommu/arm-smmu: Introduce ACTLR custom prefetcher settings
  iommu/arm-smmu: Add support for PRR bit setup
  iommu/arm-smmu: Refactor qcom_smmu structure to include single pointer
  iommu/arm-smmu: Re-enable context caching in smmu reset operation
  iommu/vt-d: Link cache tags of same iommu unit together
  ...
2025-01-24 07:33:46 -08:00
Linus Torvalds
c9c0543b52 Merge tag 'platform-drivers-x86-v6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver updates from Ilpo Järvinen:
 "acer-wmi:
   - Add support for PH14-51, PH16-72, and Nitro AN515-58
   - Add proper hwmon support
   - Improve error handling when reading "gaming system info"
   - Replace direct EC reads for the current platform profile with WMI
     calls to handle EC address variations
   - Replace custom platform_profile cycling with the generic one

  ACPI:
   - platform_profile: Major refactoring and improvements
   - Support registering multiple platform_profile handlers concurrently
     to avoid the need to quirk which handler takes precedence
   - Support reporting "custom" profile for cases where the current
     profile is ambiguous or when settings tweaks are done outside the
     pre-defined profile
   - Abstract and layer platform_profile API better using the class_dev
     and drvdata
   - Various minor improvements
   - Add Documentation and kerneldoc

  amd/hsmp:
   - Add support for HSMP protocol v7

  amd/pmc:
   - Support AMD 1Ah family 70h
   - Support STB with Ryzen desktop SoCs

  amd/pmf:
   - Support Custom BIOS inputs for PMF TA
   - Support passing SRA sensor data from AMD SFH (HID) to PMF TA

  dell-smo8800:
   - Move SMO88xx quirk away from the generic i2c-i801 driver
   - Add accelerometer support for Dell Latitude E6330/E6430 and XPS
     9550
   - Support probing accelerometer for models yet to be listed in the
     DMI mapping table because ACPI lacks i2c-address for the
     accelerometer (behind a module parameter because probing might be
     dangerous)

  HID:
   - amd_sfh: Add support for exporting SRA sensor data

  hp-wmi:
   - Add fan and thermal support for Victus 16-s1000

  input:
   - Add key for phone linking
   - i8042: Add context for the i8042 filter to enable cleaning up the
     filter related global variables from pdx86 drivers

  lenovo-wmi-camera:
   - Use SW_CAMERA_LENS_COVER instead of KEY_CAMERA_ACCESS

  mellanox mlxbf-pmc:
   - Add support for monitoring cycle count
   - Add Documentation

  thinkpad_acpi:
   - Add support for phone link key

  tools/power/x86/intel-speed-select:
   - Fix Turbo Ratio Limit restore

  x86-android-tables:
   - Add support for Vexia EDU ATLA 10 Bluetooth and EC battery driver

  And miscellaneous cleanups / refactoring / improvements"

* tag 'platform-drivers-x86-v6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: (133 commits)
  platform/x86: acer-wmi: Fix initialization of last_non_turbo_profile
  platform/x86: acer-wmi: Ignore AC events
  platform/mellanox: mlxreg-io: use sysfs_emit() instead of sprintf()
  platform/mellanox: mlxreg-hotplug: use sysfs_emit() instead of sprintf()
  platform/mellanox: mlxbf-bootctl: use sysfs_emit() instead of sprintf()
  platform/x86: hp-wmi: Add fan and thermal profile support for Victus 16-s1000
  ACPI: platform_profile: Add a prefix to log messages
  ACPI: platform_profile: Add documentation
  ACPI: platform_profile: Clean platform_profile_handler
  ACPI: platform_profile: Move platform_profile_handler
  ACPI: platform_profile: Remove platform_profile_handler from exported symbols
  platform/x86: thinkpad_acpi: Use devm_platform_profile_register()
  platform/x86: inspur_platform_profile: Use devm_platform_profile_register()
  platform/x86: hp-wmi: Use devm_platform_profile_register()
  platform/x86: ideapad-laptop: Use devm_platform_profile_register()
  platform/x86: dell-pc: Use devm_platform_profile_register()
  platform/x86: asus-wmi: Use devm_platform_profile_register()
  platform/x86: amd: pmf: sps: Use devm_platform_profile_register()
  platform/x86: acer-wmi: Use devm_platform_profile_register()
  platform/surface: surface_platform_profile: Use devm_platform_profile_register()
  ...
2025-01-24 07:18:39 -08:00
Linus Torvalds
113691ce9f Merge tag 'x86_tdx_for_6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 TDX updates from Dave Hansen:
 "Intel Trust Domain updates.

  The existing TDX code needs a _bit_ of metadata from the TDX module.
  But KVM is going to need a bunch more very shortly. Rework the
  interface with the TDX module to be more consistent and handle the new
  higher volume.

  The TDX module has added a few new features. The first is a promise
  not to clobber RBP under any circumstances. Basically the kernel now
  will refuse to use any modules that don't have this promise. Second,
  enable the new "REDUCE_VE" feature. This ensures that the TDX module
  will not send some silly virtualization exceptions that the guest had
  no good way to handle anyway.

   - Centralize global metadata infrastructure

   - Use new TDX module features for exception suppression and RBP
     clobbering"

* tag 'x86_tdx_for_6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/virt/tdx: Require the module to assert it has the NO_RBP_MOD mitigation
  x86/virt/tdx: Switch to use auto-generated global metadata reading code
  x86/virt/tdx: Use dedicated struct members for PAMT entry sizes
  x86/virt/tdx: Use auto-generated code to read global metadata
  x86/virt/tdx: Start to track all global metadata in one structure
  x86/virt/tdx: Rename 'struct tdx_tdmr_sysinfo' to reflect the spec better
  x86/tdx: Dump attributes and TD_CTLS on boot
  x86/tdx: Disable unnecessary virtualization exceptions
2025-01-24 05:58:31 -08:00
Linus Torvalds
5b7f7234ff Merge tag 'x86-boot-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 boot updates from Ingo Molnar:

 - A large and involved preparatory series to pave the way to add
   exception handling for relocate_kernel - which will be a debugging
   facility that has aided in the field to debug an exceptionally hard
   to debug early boot bug. Plus assorted cleanups and fixes that were
   discovered along the way, by David Woodhouse:

      - Clean up and document register use in relocate_kernel_64.S
      - Use named labels in swap_pages in relocate_kernel_64.S
      - Only swap pages for ::preserve_context mode
      - Allocate PGD for x86_64 transition page tables separately
      - Copy control page into place in machine_kexec_prepare()
      - Invoke copy of relocate_kernel() instead of the original
      - Move relocate_kernel to kernel .data section
      - Add data section to relocate_kernel
      - Drop page_list argument from relocate_kernel()
      - Eliminate writes through kernel mapping of relocate_kernel page
      - Clean up register usage in relocate_kernel()
      - Mark relocate_kernel page as ROX instead of RWX
      - Disable global pages before writing to control page
      - Ensure preserve_context flag is set on return to kernel
      - Use correct swap page in swap_pages function
      - Fix stack and handling of re-entry point for ::preserve_context
      - Mark machine_kexec() with __nocfi
      - Cope with relocate_kernel() not being at the start of the page
      - Use typedef for relocate_kernel_fn function prototype
      - Fix location of relocate_kernel with -ffunction-sections (fix by Nathan Chancellor)

 - A series to remove the last remaining absolute symbol references from
   .head.text, and enforce this at build time, by Ard Biesheuvel:

      - Avoid WARN()s and panic()s in early boot code
      - Don't hang but terminate on failure to remap SVSM CA
      - Determine VA/PA offset before entering C code
      - Avoid intentional absolute symbol references in .head.text
      - Disable UBSAN in early boot code
      - Move ENTRY_TEXT to the start of the image
      - Move .head.text into its own output section
      - Reject absolute references in .head.text

 - The above build-time enforcement uncovered a handful of bugs of
   essentially non-working code, and a wrokaround for a toolchain bug,
   fixed by Ard Biesheuvel as well:

      - Fix spurious undefined reference when CONFIG_X86_5LEVEL=n, on GCC-12
      - Disable UBSAN on SEV code that may execute very early
      - Disable ftrace branch profiling in SEV startup code

 - And miscellaneous cleanups:

      - kexec_core: Add and update comments regarding the KEXEC_JUMP flow (Rafael J. Wysocki)
      - x86/sysfs: Constify 'struct bin_attribute' (Thomas Weißschuh)"

* tag 'x86-boot-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  x86/sev: Disable ftrace branch profiling in SEV startup code
  x86/kexec: Use typedef for relocate_kernel_fn function prototype
  x86/kexec: Cope with relocate_kernel() not being at the start of the page
  kexec_core: Add and update comments regarding the KEXEC_JUMP flow
  x86/kexec: Mark machine_kexec() with __nocfi
  x86/kexec: Fix location of relocate_kernel with -ffunction-sections
  x86/kexec: Fix stack and handling of re-entry point for ::preserve_context
  x86/kexec: Use correct swap page in swap_pages function
  x86/kexec: Ensure preserve_context flag is set on return to kernel
  x86/kexec: Disable global pages before writing to control page
  x86/sev: Don't hang but terminate on failure to remap SVSM CA
  x86/sev: Disable UBSAN on SEV code that may execute very early
  x86/boot/64: Fix spurious undefined reference when CONFIG_X86_5LEVEL=n, on GCC-12
  x86/sysfs: Constify 'struct bin_attribute'
  x86/kexec: Mark relocate_kernel page as ROX instead of RWX
  x86/kexec: Clean up register usage in relocate_kernel()
  x86/kexec: Eliminate writes through kernel mapping of relocate_kernel page
  x86/kexec: Drop page_list argument from relocate_kernel()
  x86/kexec: Add data section to relocate_kernel
  x86/kexec: Move relocate_kernel to kernel .data section
  ...
2025-01-24 05:54:26 -08:00
Linus Torvalds
7685b334d1 Merge tag 'perf-tools-for-v6.14-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools
Pull perf-tools updates from Namhyung Kim:
 "There are a lot of changes in the perf tools in this cycle.

  build:

   - Use generic syscall table to generate syscall numbers on supported
     archs

   - This also enables to get rid of libaudit which was used for syscall
     numbers

   - Remove python2 support as it's deprecated for years

   - Fix issues on static build with libzstd

  perf record:

   - Intel-PT supports "aux-action" config term to pause or resume
     tracing in the aux-buffer. Users can start the intel_pt event as
     "started-paused" and configure other events to control the Intel-PT
     tracing:

         # perf record --kcore -e intel_pt/aux-action=start-paused/   \
             -e syscalls:sys_enter_newuname/aux-action=resume/        \
             -e syscalls:sys_exit_newuname/aux-action=pause/ -- uname

     This requires kernel support (which was added in v6.13)

  perf lock:

   - 'perf lock contention' command has an ability to symbolize locks in
     dynamically allocated objects using slab cache name when it runs
     with BPF. Those dynamic locks would have "&" prefix in the name to
     distinguish them from ordinary (static) locks

        # perf lock con -abl -E 5 sleep 1
           contended   total wait     max wait     avg wait            address   symbol

                   2      1.95 us      1.77 us       975 ns   ffff9d5e852d3498   &task_struct (mutex)
                   1      1.18 us      1.18 us      1.18 us   ffff9d5e852d3538   &task_struct (mutex)
                   4      1.12 us       354 ns       279 ns   ffff9d5e841ca800   &kmalloc-cg-512 (mutex)
                   2       859 ns       617 ns       429 ns   ffffffffa41c3620   delayed_uprobe_lock (mutex)
                   3       691 ns       388 ns       230 ns   ffffffffa41c0940   pack_mutex (mutex)

     This also requires kernel/BPF support (which was added in v6.13)

  perf ftrace:

   - 'perf ftrace latency' command gets a couple of options to support
     linear buckets instead of exponential. Also it's possible to
     specify max and min latency for the linear buckets:

        # perf ftrace latency -abn -T switch_mm_irqs_off --bucket-range=100   \
            --min-latency=200 --max-latency=800 -- sleep 1
        #   DURATION     |      COUNT | GRAPH                                  |
             0 -  200 ns |        186 | ###                                    |
           200 -  300 ns |        256 | #####                                  |
           300 -  400 ns |        364 | #######                                |
           400 -  500 ns |        223 | ####                                   |
           500 -  600 ns |        111 | ##                                     |
           600 -  700 ns |         41 |                                        |
           700 -  800 ns |        141 | ##                                     |
           800 -  ... ns |        169 | ###                                    |

        # statistics  (in nsec)
          total time:              2162212
            avg time:                  967
            max time:                16817
            min time:                  132
               count:                 2236

   - As you can see in the above example, it nows shows the statistics
     at the end so that users can see the avg/max/min latencies easily

   - 'perf ftrace profile' command has --graph-opts option like 'perf
     ftrace trace' so that it can control the tracing behaviors in the
     same way. For example, it can limit the function call depth or
     threshold

  perf script:

   - Improve physical memory resolution in 'mem-phys-addr' script by
     parsing /proc/iomem file

        # perf script mem-phys-addr -- find /
        ...
        Event: mem_inst_retired.all_loads:P
        Memory type                                    count  percentage
        ----------------------------------------  ----------  ----------
        100000000-85f7fffff : System RAM                8929        69.7
          547600000-54785d23f : Kernel data             1240         9.7
          546a00000-5474bdfff : Kernel rodata            490         3.8
          5480ce000-5485fffff : Kernel bss               121         0.9
        0-fff : Reserved                                3860        30.1
        100000-89c01fff : System RAM                      18         0.1
        8a22c000-8df6efff : System RAM                     5         0.0

  Others:

   - 'perf test' gets --runs-per-test option to run the test cases
     repeatedly. This would be helpful to see if it's flaky

   - Add 'parse_events' method to Python perf extension module, so that
     users can use the same event parsing logic in the python code. One
     more step towards implementing perf tools in Python. :)

   - Support opening tracepoint events without libtraceevent. This will
     be helpful if it won't use the tracing data like in 'perf stat'

   - Update ARM Neoverse N2/V2 JSON events and metrics"

* tag 'perf-tools-for-v6.14-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (176 commits)
  perf test: Update event_groups test to use instructions
  perf bench: Fix undefined behavior in cmpworker()
  perf annotate: Prefer passing evsel to evsel->core.idx
  perf lock: Rename fields in lock_type_table
  perf lock: Add percpu-rwsem for type filter
  perf lock: Fix parse_lock_type which only retrieve one lock flag
  perf lock: Fix return code for functions in __cmd_contention
  perf hist: Fix width calculation in hpp__fmt()
  perf hist: Fix bogus profiles when filters are enabled
  perf hist: Deduplicate cmp/sort/collapse code
  perf test: Improve verbose documentation
  perf test: Add a runs-per-test flag
  perf test: Fix parallel/sequential option documentation
  perf test: Send list output to stdout rather than stderr
  perf test: Rename functions and variables for better clarity
  perf tools: Expose quiet/verbose variables in Makefile.perf
  perf config: Add a function to set one variable in .perfconfig
  perf test perftool_testsuite: Return correct value for skipping
  perf test perftool_testsuite: Add missing description
  perf test record+probe_libc_inet_pton: Make test resilient
  ...
2025-01-24 05:45:40 -08:00
Linus Torvalds
bc8198dc7e Merge tag 'sched_ext-for-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext updates from Tejun Heo:

 - scx_bpf_now() added so that BPF scheduler can access the cached
   timestamp in struct rq to avoid reading TSC multiple times within a
   locked scheduling operation.

 - Minor updates to the built-in idle CPU selection logic.

 - tool/sched_ext updates and other misc changes.

* tag 'sched_ext-for-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: fix kernel-doc warnings
  sched_ext: Use time helpers in BPF schedulers
  sched_ext: Replace bpf_ktime_get_ns() to scx_bpf_now()
  sched_ext: Add time helpers for BPF schedulers
  sched_ext: Add scx_bpf_now() for BPF scheduler
  sched_ext: Implement scx_bpf_now()
  sched_ext: Relocate scx_enabled() related code
  sched_ext: Add option -l in selftest runner to list all available tests
  sched_ext: Include remaining task time slice in error state dump
  sched_ext: update scx_bpf_dsq_insert() doc for SCX_DSQ_LOCAL_ON
  sched_ext: idle: small CPU iteration refactoring
  sched_ext: idle: introduce check_builtin_idle_enabled() helper
  sched_ext: idle: clarify comments
  sched_ext: idle: use assign_cpu() to update the idle cpumask
  sched_ext: Use str_enabled_disabled() helper in update_selcpu_topology()
  sched_ext: Use sizeof_field for key_len in dsq_hash_params
  tools/sched_ext: Receive updates from SCX repo
  sched_ext: Use the NUMA scheduling domain for NUMA optimizations
2025-01-23 18:49:43 -08:00
Linus Torvalds
606489dbfa Merge tag 'trace-ringbuffer-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull trace fing buffer fix from Steven Rostedt:
 "Fix atomic64 operations on some architectures for the tracing ring
  buffer:

   - Have emulating atomic64 use arch_spin_locks instead of
     raw_spin_locks

     The tracing ring buffer events have a small timestamp that holds
     the delta between itself and the event before it. But this can be
     tricky to update when interrupts come in. It originally just set
     the deltas to zero for events that interrupted the adding of
     another event which made all the events in the interrupt have the
     same timestamp as the event it interrupted. This was not suitable
     for many tools, so it was eventually fixed. But that fix required
     adding an atomic64 cmpxchg on the timestamp in cases where an event
     was added while another event was in the process of being added.

     Originally, for 32 bit architectures, the manipulation of the 64
     bit timestamp was done by a structure that held multiple 32bit
     words to hold parts of the timestamp and a counter. But as updates
     to the ring buffer were done, maintaining this became too complex
     and was replaced by the atomic64 generic operations which are now
     used by both 64bit and 32bit architectures. Shortly after that, it
     was reported that riscv32 and other 32 bit architectures that just
     used the generic atomic64 were locking up. This was because the
     generic atomic64 operations defined in lib/atomic64.c uses a
     raw_spin_lock() to emulate an atomic64 operation. The problem here
     was that raw_spin_lock() can also be traced by the function tracer
     (which is commonly used for debugging raw spin locks). Since the
     function tracer uses the tracing ring buffer, which now is being
     traced internally, this was triggering a recursion and setting off
     a warning that the spin locks were recusing.

     There's no reason for the code that emulates atomic64 operations to
     be using raw_spin_locks which have a lot of debugging
     infrastructure attached to them (depending on the config options).
     Instead it should be using the arch_spin_lock() which does not have
     any infrastructure attached to them and is used by low level
     infrastructure like RCU locks, lockdep and of course tracing. Using
     arch_spin_lock()s fixes this issue.

   - Do not trace in NMI if the architecture uses emulated atomic64
     operations

     Another issue with using the emulated atomic64 operations that uses
     spin locks to emulate the atomic64 operations is that they cannot
     be used in NMI context. As an NMI can trigger while holding the
     atomic64 spin locks it can try to take the same lock and cause a
     deadlock.

     Have the ring buffer fail recording events if in NMI context and
     the architecture uses the emulated atomic64 operations"

* tag 'trace-ringbuffer-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  atomic64: Use arch_spin_locks instead of raw_spin_locks
  ring-buffer: Do not allow events in NMI with generic atomic64 cmpxchg()
2025-01-23 18:02:55 -08:00