Commit Graph

45245 Commits

Author SHA1 Message Date
Tejun Heo
344576fa6a sched_ext: Improve logging around enable/disable
sched_ext currently doesn't generate messages when the BPF scheduler is
enabled and disabled unless there are errors. It is useful to have paper
trail. Improve logging around enable/disable:

- Generate info messages on enable and non-error disable.

- Update error exit message formatting so that it's consistent with
  non-error message. Also, prefix ei->msg with the BPF scheduler's name to
  make it clear where the message is coming from.

- Shorten scx_exit_reason() strings for SCX_EXIT_UNREG* for brevity and
  consistency.

v2: Use pr_*() instead of KERN_* consistently. (David)

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Acked-by: David Vernet <void@manifault.com>
2024-08-08 13:42:37 -10:00
Tejun Heo
991ef53a48 sched_ext: Make scx_rq_online() also test cpu_active() in addition to SCX_RQ_ONLINE
scx_rq_online() currently only tests SCX_RQ_ONLINE. This isn't fully correct
- e.g. consume_dispatch_q() uses task_run_on_remote_rq() which tests
scx_rq_online() to see whether the current rq can run the task, and, if so,
calls consume_remote_task() to migrate the task to @rq. While the test
itself was done while locking @rq, @rq can be temporarily unlocked by
consume_remote_task() and nothing prevents SCX_RQ_ONLINE from going offline
before the migration takes place.

To address the issue, add cpu_active() test to scx_rq_online(). There is a
synchronize_rcu() between cpu_active() being cleared and the rq going
offline, so if an on-going scheduling operation sees cpu_active(), the
associated rq is guaranteed to not go offline until the scheduling operation
is complete.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 60c27fb59f ("sched_ext: Implement sched_ext_ops.cpu_online/offline()")
Acked-by: David Vernet <void@manifault.com>
2024-08-08 13:38:19 -10:00
Tejun Heo
72763ea3d4 sched_ext: Fix unsafe list iteration in process_ddsp_deferred_locals()
process_ddsp_deferred_locals() executes deferred direct dispatches to the
local DSQs of remote CPUs. It iterates the tasks on
rq->scx.ddsp_deferred_locals list, removing and calling
dispatch_to_local_dsq() on each. However, the list is protected by the rq
lock that can be dropped by dispatch_to_local_dsq() temporarily, so the list
can be modified during the iteration, which can lead to oopses and other
failures.

Fix it by popping from the head of the list instead of iterating the list.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 5b26f7b920 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches")
Acked-by: David Vernet <void@manifault.com>
2024-08-08 13:38:09 -10:00
Tejun Heo
2c390dda9e sched_ext: Make task_can_run_on_remote_rq() use common task_allowed_on_cpu()
task_can_run_on_remote_rq() is similar to is_cpu_allowed() but there are
subtle differences. It currently open codes all the tests. This is
cumbersome to understand and error-prone in case the intersecting tests need
to be updated.

Factor out the common part - testing whether the task is allowed on the CPU
at all regardless of the CPU state - into task_allowed_on_cpu() and make
both is_cpu_allowed() and SCX's task_can_run_on_remote_rq() use it. As the
code is now linked between the two and each contains only the extra tests
that differ between them, it's less error-prone when the conditions need to
be updated. Also, improve the comment to explain why they are different.

v2: Replace accidental "extern inline" with "static inline" (Peter).

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: David Vernet <void@manifault.com>
2024-08-06 09:40:11 -10:00
Tejun Heo
9390a923e1 sched_ext: Improve comment on idle_sched_class exception in scx_task_iter_next_locked()
scx_task_iter_next_locked() skips tasks whose sched_class is
idle_sched_class. While it has a short comment explaining why it's testing
the sched_class directly isntead of using is_idle_task(), the comment
doesn't sufficiently explain what's going on and why. Improve the comment.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: David Vernet <void@manifault.com>
2024-08-06 09:40:11 -10:00
Tejun Heo
a735d43c7f sched_ext: Simplify UP support by enabling sched_class->balance() in UP
On SMP, SCX performs dispatch from sched_class->balance(). As balance() was
not available in UP, it instead called the internal balance function from
put_prev_task_scx() and pick_next_task_scx() to emulate the effect, which is
rather nasty.

Enabling sched_class->balance() on UP shouldn't cause any meaningful
overhead. Enable balance() on UP and drop the ugly workaround.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: David Vernet <void@manifault.com>
2024-08-06 09:40:11 -10:00
Tejun Heo
7799140b6a sched_ext: Use update_curr_common() in update_curr_scx()
update_curr_scx() is open coding runtime updates. Use update_curr_common()
instead and avoid unnecessary deviations.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: David Vernet <void@manifault.com>
2024-08-06 09:40:11 -10:00
Tejun Heo
cd01449268 sched_ext: Add scx_enabled() test to @start_class promotion in put_prev_task_balance()
SCX needs its balance() invoked even when waking up from a lower priority
sched class (idle) and put_prev_task_balance() thus has the logic to promote
@start_class if it's lower than ext_sched_class. This is only needed when
SCX is enabled. Add scx_enabled() test to avoid unnecessary overhead when
SCX is disabled.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: David Vernet <void@manifault.com>
2024-08-06 09:40:10 -10:00
Tejun Heo
11cc374f46 sched_ext: Simplify scx_can_stop_tick() invocation in sched_can_stop_tick()
The way sched_can_stop_tick() used scx_can_stop_tick() was rather confusing
and the behavior wasn't ideal when SCX is enabled in partial mode. Simplify
it so that:

- scx_can_stop_tick() can say no if scx_enabled().

- CFS tests rq->cfs.nr_running > 1 instead of rq->nr_running.

This is easier to follow and leads to the correct answer whether SCX is
disabled, enabled in partial mode or all tasks are switched to SCX.

Peter, note that this is a bit different from your suggestion where
sched_can_stop_tick() unconditionally returns scx_can_stop_tick() iff
scx_switched_all(). The problem is that in partial mode, tick can be stopped
when there is only one SCX task even if the BPF scheduler didn't ask and
isn't ready for it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: David Vernet <void@manifault.com>
2024-08-06 09:40:10 -10:00
Tejun Heo
0df340ceae Merge branch 'sched/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-6.12
Pull tip/sched/core to resolve the following four conflicts. While 2-4 are
simple context conflicts, 1 is a bit subtle and easy to resolve incorrectly.

1. 2c8d046d5d ("sched: Add normal_policy()")
   vs.
   faa42d2941 ("sched/fair: Make SCHED_IDLE entity be preempted in strict hierarchy")

The former converts direct test on p->policy to use the helper
normal_policy(). The latter moves the p->policy test to a different
location. Resolve by converting the test on p->plicy in the new location to
use normal_policy().

2. a7a9fc5492 ("sched_ext: Add boilerplate for extensible scheduler class")
   vs.
   a110a81c52 ("sched/deadline: Deferrable dl server")

Both add calls to put_prev_task_idle() and set_next_task_idle(). Simple
context conflict. Resolve by taking changes from both.

3. a7a9fc5492 ("sched_ext: Add boilerplate for extensible scheduler class")
   vs.
   c245910049 ("sched/core: Add clearing of ->dl_server in put_prev_task_balance()")

The former changes for_each_class() itertion to use for_each_active_class().
The latter moves away the adjacent dl_server handling code. Simple context
conflict. Resolve by taking changes from both.

4. 60c27fb59f ("sched_ext: Implement sched_ext_ops.cpu_online/offline()")
   vs.
   31b164e2e4 ("sched/smt: Introduce sched_smt_present_inc/dec() helper")
   2f02735412 ("sched/core: Introduce sched_set_rq_on/offline() helper")

The former adds scx_rq_deactivate() call. The latter two change code around
it. Simple context conflict. Resolve by taking changes from both.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-04 07:36:54 -10:00
Tejun Heo
e99129e5db sched_ext: Allow p->scx.disallow only while loading
From 1232da7eced620537a78f19c8cf3d4a3508e2419 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 31 Jul 2024 09:14:52 -1000

p->scx.disallow provides a way for the BPF scheduler to reject certain tasks
from attaching. It's currently allowed for both the load and fork paths;
however, the latter doesn't actually work as p->sched_class is already set
by the time scx_ops_init_task() is called during fork.

This is a convenience feature which is mostly useful from the load path
anyway. Allow it only from the load path.

v2: Trigger scx_ops_error() iff @p->policy == SCHED_EXT to make it a bit
    easier for the BPF scheduler (David).

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: "Zhangqiao (2012 lab)" <zhangqiao22@huawei.com>
Link: http://lkml.kernel.org/r/20240711110720.1285-1-zhangqiao22@huawei.com
Fixes: 7bb6f0810e ("sched_ext: Allow BPF schedulers to disallow specific tasks from joining SCHED_EXT")
Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-02 08:59:32 -10:00
Tejun Heo
a2f4b16e73 sched_ext: Build fix on !CONFIG_STACKTRACE[_SUPPORT]
scx_dump_task() uses stack_trace_save_tsk() which is only available when
CONFIG_STACKTRACE. Make CONFIG_SCHED_CLASS_EXT select CONFIG_STACKTRACE if
the support is available and skip capturing stack trace if
!CONFIG_STACKTRACE.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202407161844.reewQQrR-lkp@intel.com/
Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-01 07:08:01 -10:00
David Vernet
298dec19bd scx: Allow calling sleepable kfuncs from BPF_PROG_TYPE_SYSCALL
We currently only allow calling sleepable scx kfuncs (i.e.
scx_bpf_create_dsq()) from BPF_PROG_TYPE_STRUCT_OPS progs. The idea here
was that we'd never have to call scx_bpf_create_dsq() outside of a
sched_ext struct_ops callback, but that might not actually be true. For
example, a scheduler could do something like the following:

1. Open and load (not yet attach) a scheduler skel

2. Synchronously call into a BPF_PROG_TYPE_SYSCALL prog from user space.
   For example, to initialize an LLC domain, or some other global,
   read-only state.

3. Attach the skel, which actually enables the scheduler

The advantage of doing this is that it can preclude having to do pretty
ugly boilerplate like initializing a read-only, statically sized array of
u64[]'s which the kernel consumes literally once at init time to then
create struct bpf_cpumask objects which are actually queried at runtime.

Doing the above is already possible given that we can invoke core BPF
kfuncs, such as bpf_cpumask_create(), from BPF_PROG_TYPE_SYSCALL progs. We
already allow many scx kfuncs to be called from BPF_PROG_TYPE_SYSCALL progs
(e.g. scx_bpf_kick_cpu()). Let's allow the sleepable kfuncs as well.

Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-31 07:45:28 -10:00
Tejun Heo
c8faf11cd1 Merge tag 'v6.11-rc1' into for-6.12
Linux 6.11-rc1
2024-07-30 09:30:11 -10:00
Peter Zijlstra
cea5a3472a sched/fair: Cleanup fair_server
The throttle interaction made my brain hurt, make it consistently
about 0 transitions of h_nr_running.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29 12:22:37 +02:00
Peter Zijlstra
5f6bd380c7 sched/rt: Remove default bandwidth control
Now that fair_server exists, we no longer need RT bandwidth control
unless RT_GROUP_SCHED.

Enable fair_server with parameters equivalent to RT throttling.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/14d562db55df5c3c780d91940743acb166895ef7.1716811044.git.bristot@kernel.org
2024-07-29 12:22:37 +02:00
Joel Fernandes (Google)
c8a85394cf sched/core: Fix picking of tasks for core scheduling with DL server
* Use simple CFS pick_task for DL pick_task

  DL server's pick_task calls CFS's pick_next_task_fair(), this is wrong
  because core scheduling's pick_task only calls CFS's pick_task() for
  evaluation / checking of the CFS task (comparing across CPUs), not for
  actually affirmatively picking the next task. This causes RB tree
  corruption issues in CFS that were found by syzbot.

* Make pick_task_fair clear DL server

  A DL task pick might set ->dl_server, but it is possible the task will
  never run (say the other HT has a stop task). If the CFS task is picked
  in the future directly (say without DL server), ->dl_server will be
  set. So clear it in pick_task_fair().

This fixes the KASAN issue reported by syzbot in set_next_entity().

(DL refactoring suggestions by Vineeth Pillai).

Reported-by: Suleiman Souhlal <suleiman@google.com>
Signed-off-by: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vineeth Pillai <vineeth@bitbyteword.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/b10489ab1f03d23e08e6097acea47442e7d6466f.1716811044.git.bristot@kernel.org
2024-07-29 12:22:37 +02:00
Joel Fernandes (Google)
4b26cfdd39 sched/core: Fix priority checking for DL server picks
In core scheduling, a DL server pick (which is CFS task) should be
given higher priority than tasks in other classes.

Not doing so causes CFS starvation. A kselftest is added later to
demonstrate this.  A CFS task that is competing with RT tasks can
be completely starved without this and the DL server's boosting
completely ignored.

Fix these problems.

Reported-by: Suleiman Souhlal <suleiman@google.com>
Signed-off-by: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vineeth Pillai <vineeth@bitbyteword.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/48b78521d86f3b33c24994d843c1aad6b987dda9.1716811044.git.bristot@kernel.org
2024-07-29 12:22:36 +02:00
Daniel Bristot de Oliveira
d741f297bc sched/fair: Fair server interface
Add an interface for fair server setup on debugfs.

Each CPU has two files under /debug/sched/fair_server/cpu{ID}:

 - runtime: set runtime in ns
 - period:  set period in ns

This then leaves /proc/sys/kernel/sched_rt_{period,runtime}_us to set
bounds on admission control.

The interface also add the server to the dl bandwidth accounting.

Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/a9ef9fc69bcedb44bddc9bc34f2b313296052819.1716811044.git.bristot@kernel.org
2024-07-29 12:22:36 +02:00
Daniel Bristot de Oliveira
a110a81c52 sched/deadline: Deferrable dl server
Among the motivations for the DL servers is the real-time throttling
mechanism. This mechanism works by throttling the rt_rq after
running for a long period without leaving space for fair tasks.

The base dl server avoids this problem by boosting fair tasks instead
of throttling the rt_rq. The point is that it boosts without waiting
for potential starvation, causing some non-intuitive cases.

For example, an IRQ dispatches two tasks on an idle system, a fair
and an RT. The DL server will be activated, running the fair task
before the RT one. This problem can be avoided by deferring the
dl server activation.

By setting the defer option, the dl_server will dispatch an
SCHED_DEADLINE reservation with replenished runtime, but throttled.

The dl_timer will be set for the defer time at (period - runtime) ns
from start time. Thus boosting the fair rq at defer time.

If the fair scheduler has the opportunity to run while waiting
for defer time, the dl server runtime will be consumed. If
the runtime is completely consumed before the defer time, the
server will be replenished while still in a throttled state. Then,
the dl_timer will be reset to the new defer time

If the fair server reaches the defer time without consuming
its runtime, the server will start running, following CBS rules
(thus without breaking SCHED_DEADLINE). Then the server will
continue the running state (without deferring) until it fair
tasks are able to execute as regular fair scheduler (end of
the starvation).

Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/dd175943c72533cd9f0b87767c6499204879cc38.1716811044.git.bristot@kernel.org
2024-07-29 12:22:36 +02:00
Peter Zijlstra
557a6bfc66 sched/fair: Add trivial fair server
Use deadline servers to service fair tasks.

This patch adds a fair_server deadline entity which acts as a container
for fair entities and can be used to fix starvation when higher priority
(wrt fair) tasks are monopolizing CPU(s).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/b6b0bcefaf25391bcf5b6ecdb9f1218de402d42e.1716811044.git.bristot@kernel.org
2024-07-29 12:22:36 +02:00
Youssef Esmat
a741b82423 sched/core: Clear prev->dl_server in CFS pick fast path
In case the previous pick was a DL server pick, ->dl_server might be
set. Clear it in the fast path as well.

Fixes: 63ba8422f8 ("sched/deadline: Introduce deadline servers")
Signed-off-by: Youssef Esmat <youssefesmat@google.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/7f7381ccba09efcb4a1c1ff808ed58385eccc222.1716811044.git.bristot@kernel.org
2024-07-29 12:22:35 +02:00
Joel Fernandes (Google)
c245910049 sched/core: Add clearing of ->dl_server in put_prev_task_balance()
Paths using put_prev_task_balance() need to do a pick shortly
after. Make sure they also clear the ->dl_server on prev as a
part of that.

Fixes: 63ba8422f8 ("sched/deadline: Introduce deadline servers")
Signed-off-by: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/d184d554434bedbad0581cb34656582d78655150.1716811044.git.bristot@kernel.org
2024-07-29 12:22:35 +02:00
Tianchen Ding
faa42d2941 sched/fair: Make SCHED_IDLE entity be preempted in strict hierarchy
Consider the following cgroup:

                       root
                        |
             ------------------------
             |                      |
       normal_cgroup            idle_cgroup
             |                      |
   SCHED_IDLE task_A           SCHED_NORMAL task_B

According to the cgroup hierarchy, A should preempt B. But current
check_preempt_wakeup_fair() treats cgroup se and task separately, so B
will preempt A unexpectedly.
Unify the wakeup logic by {c,p}se_is_idle only. This makes SCHED_IDLE of
a task a relative policy that is effective only within its own cgroup,
similar to the behavior of NICE.

Also fix se_is_idle() definition when !CONFIG_FAIR_GROUP_SCHED.

Fixes: 304000390f ("sched: Cgroup SCHED_IDLE support")
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josh Don <joshdon@google.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20240626023505.1332596-1-dtcccc@linux.alibaba.com
2024-07-29 12:22:35 +02:00
Phil Auld
a58501fb83 sched: remove HZ_BW feature hedge
As a hedge against unexpected user issues commit 88c56cfeae
("sched/fair: Block nohz tick_stop when cfs bandwidth in use")
included a scheduler feature to disable the new functionality.
It's been a few releases (v6.6) and no screams, so remove it.

Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20240515133705.3632915-1-pauld@redhat.com
2024-07-29 12:22:34 +02:00
Chuyi Zhou
2c2d962469 sched/fair: Remove cfs_rq::nr_spread_over and cfs_rq::exec_clock
nr_spread_over tracks the number of instances where the difference
between a scheduling entity's virtual runtime and the minimum virtual
runtime in the runqueue exceeds three times the scheduler latency,
indicating significant disparity in task scheduling.
Commit that removed its usage: 5e963f2bd: sched/fair: Commit to EEVDF

cfs_rq->exec_clock was used to account for time spent executing tasks.
Commit that removed its usage: 5d69eca542 sched: Unify runtime
accounting across classes

cfs_rq::nr_spread_over and cfs_rq::exec_clock are not used anymore in
eevdf. Remove them from struct cfs_rq.

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Acked-by: Vishal Chourasia <vishalc@linux.ibm.com>
Link: https://lore.kernel.org/r/20240717143342.593262-1-zhouchuyi@bytedance.com
2024-07-29 12:22:34 +02:00
Peilin He
0ec8d5aed4 sched/core: Add WARN_ON_ONCE() to check overflow for migrate_disable()
Background
==========
When repeated migrate_disable() calls are made with missing the
corresponding migrate_enable() calls, there is a risk of
'migration_disabled' going upper overflow because
'migration_disabled' is a type of unsigned short whose max value is
65535.

In PREEMPT_RT kernel, if 'migration_disabled' goes upper overflow, it may
make the migrate_disable() ineffective within local_lock_irqsave(). This
is because, during the scheduling procedure, the value of
'migration_disabled' will be checked, which can trigger CPU migration.
Consequently, the count of 'rcu_read_lock_nesting' may leak due to
local_lock_irqsave() and local_unlock_irqrestore() occurring on different
CPUs.

Usecase
========
For example, When I developed a driver, I encountered a warning like
"WARNING: CPU: 4 PID: 260 at kernel/rcu/tree_plugin.h:315
rcu_note_context_switch+0xa8/0x4e8" warning. It took me half a month
to locate this issue. Ultimately, I discovered that the lack of upper
overflow detection mechanism in migrate_disable() was the root cause,
leading to a significant amount of time spent on problem localization.

If the upper overflow detection mechanism was added to migrate_disable(),
the root cause could be very quickly and easily identified.

Effect
======
Using WARN_ON_ONCE() to check if 'migration_disabled' is upper overflow
can help developers identify the issue quickly.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Peilin He<he.peilin@zte.com.cn>
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yunkai Zhang <zhang.yunkai@zte.com.cn>
Reviewed-by: Qiang Tu <tu.qiang35@zte.com.cn>
Reviewed-by: Kun Jiang <jiang.kun2@zte.com.cn>
Reviewed-by: Fan Yu <fan.yu9@zte.com.cn>
Link: https://lkml.kernel.org/r/20240716104244764N2jD8gnBpnsLjCDnQGQ8c@zte.com.cn
2024-07-29 12:22:34 +02:00
Zhang Qiao
c40dd90ac0 sched: Initialize the vruntime of a new task when it is first enqueued
When creating a new task, we initialize vruntime of the newly task at
sched_cgroup_fork(). However, the timing of executing this action is too
early and may not be accurate.

Because it uses current CPU to init the vruntime, but the new task
actually runs on the cpu which be assigned at wake_up_new_task().

To optimize this case, we pass ENQUEUE_INITIAL flag to activate_task()
in wake_up_new_task(), in this way, when place_entity is called in
enqueue_entity(), the vruntime of the new task will be initialized.

In addition, place_entity() in task_fork_fair() was introduced for two
reasons:
1. Previously, the __enqueue_entity() was in task_new_fair(),
in order to provide vruntime for enqueueing the newly task, the
vruntime assignment equation "se->vruntime = cfs_rq->min_vruntime" was
introduced by commit e9acbff648 ("sched: introduce se->vruntime").
This is the initial state of place_entity().

2. commit 4d78e7b656 ("sched: new task placement for vruntime") added
child_runs_first task placement feature which based on vruntime, this
also requires the new task's vruntime value.

After removing the child_runs_first and enqueue_entity() from
task_fork_fair(), this place_entity() no longer makes sense, so remove
it also.

Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20240627133359.1370598-1-zhangqiao22@huawei.com
2024-07-29 12:22:34 +02:00
Yang Yingliang
fe7a11c78d sched/core: Fix unbalance set_rq_online/offline() in sched_cpu_deactivate()
If cpuset_cpu_inactive() fails, set_rq_online() need be called to rollback.

Fixes: 120455c514 ("sched: Fix hotplug vs CPU bandwidth control")
Cc: stable@kernel.org
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240703031610.587047-5-yangyingliang@huaweicloud.com
2024-07-29 12:22:33 +02:00
Yang Yingliang
2f02735412 sched/core: Introduce sched_set_rq_on/offline() helper
Introduce sched_set_rq_on/offline() helper, so it can be called
in normal or error path simply. No functional changed.

Cc: stable@kernel.org
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240703031610.587047-4-yangyingliang@huaweicloud.com
2024-07-29 12:22:32 +02:00
Yang Yingliang
e22f910a26 sched/smt: Fix unbalance sched_smt_present dec/inc
I got the following warn report while doing stress test:

jump label: negative count!
WARNING: CPU: 3 PID: 38 at kernel/jump_label.c:263 static_key_slow_try_dec+0x9d/0xb0
Call Trace:
 <TASK>
 __static_key_slow_dec_cpuslocked+0x16/0x70
 sched_cpu_deactivate+0x26e/0x2a0
 cpuhp_invoke_callback+0x3ad/0x10d0
 cpuhp_thread_fun+0x3f5/0x680
 smpboot_thread_fn+0x56d/0x8d0
 kthread+0x309/0x400
 ret_from_fork+0x41/0x70
 ret_from_fork_asm+0x1b/0x30
 </TASK>

Because when cpuset_cpu_inactive() fails in sched_cpu_deactivate(),
the cpu offline failed, but sched_smt_present is decremented before
calling sched_cpu_deactivate(), it leads to unbalanced dec/inc, so
fix it by incrementing sched_smt_present in the error path.

Fixes: c5511d03ec ("sched/smt: Make sched_smt_present track topology")
Cc: stable@kernel.org
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lore.kernel.org/r/20240703031610.587047-3-yangyingliang@huaweicloud.com
2024-07-29 12:22:32 +02:00
Yang Yingliang
31b164e2e4 sched/smt: Introduce sched_smt_present_inc/dec() helper
Introduce sched_smt_present_inc/dec() helper, so it can be called
in normal or error path simply. No functional changed.

Cc: stable@kernel.org
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240703031610.587047-2-yangyingliang@huaweicloud.com
2024-07-29 12:22:32 +02:00
Zheng Zucheng
77baa5bafc sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime
In extreme test scenarios:
the 14th field utime in /proc/xx/stat is greater than sum_exec_runtime,
utime = 18446744073709518790 ns, rtime = 135989749728000 ns

In cputime_adjust() process, stime is greater than rtime due to
mul_u64_u64_div_u64() precision problem.
before call mul_u64_u64_div_u64(),
stime = 175136586720000, rtime = 135989749728000, utime = 1416780000.
after call mul_u64_u64_div_u64(),
stime = 135989949653530

unsigned reversion occurs because rtime is less than stime.
utime = rtime - stime = 135989749728000 - 135989949653530
		      = -199925530
		      = (u64)18446744073709518790

Trigger condition:
  1). User task run in kernel mode most of time
  2). ARM64 architecture
  3). TICK_CPU_ACCOUNTING=y
      CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set

Fix mul_u64_u64_div_u64() conversion precision by reset stime to rtime

Fixes: 3dc167ba57 ("sched/cputime: Improve cputime_adjust()")
Signed-off-by: Zheng Zucheng <zhengzucheng@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20240726023235.217771-1-zhengzucheng@huawei.com
2024-07-29 12:22:32 +02:00
Linus Torvalds
5256184b61 Merge tag 'timers-urgent-2024-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer migration updates from Thomas Gleixner:
 "Fixes and minor updates for the timer migration code:

   - Stop testing the group->parent pointer as it is not guaranteed to
     be stable over a chain of operations by design.

     This includes a warning which would be nice to have but it produces
     false positives due to the racy nature of the check.

   - Plug a race between CPUs going in and out of idle and a CPU hotplug
     operation. The latter can create and connect a new hierarchy level
     which is missed in the concurrent updates of CPUs which go into
     idle. As a result the events of such a CPU might not be processed
     and timers go stale.

     Cure it by splitting the hotplug operation into a prepare and
     online callback. The prepare callback is guaranteed to run on an
     online and therefore active CPU. This CPU updates the hierarchy and
     being online ensures that there is always at least one migrator
     active which handles the modified hierarchy correctly when going
     idle. The online callback which runs on the incoming CPU then just
     marks the CPU active and brings it into operation.

   - Improve tracing and polish the code further so it is more obvious
     what's going on"

* tag 'timers-urgent-2024-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers/migration: Fix grammar in comment
  timers/migration: Spare write when nothing changed
  timers/migration: Rename childmask by groupmask to make naming more obvious
  timers/migration: Read childmask and parent pointer in a single place
  timers/migration: Use a single struct for hierarchy walk data
  timers/migration: Improve tracing
  timers/migration: Move hierarchy setup into cpuhotplug prepare callback
  timers/migration: Do not rely always on group->parent
2024-07-27 10:19:55 -07:00
Linus Torvalds
1722389b0d Merge tag 'net-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf and netfilter.

  A lot of networking people were at a conference last week, busy
  catching COVID, so relatively short PR.

  Current release - regressions:

   - tcp: process the 3rd ACK with sk_socket for TFO and MPTCP

  Current release - new code bugs:

   - l2tp: protect session IDR and tunnel session list with one lock,
     make sure the state is coherent to avoid a warning

   - eth: bnxt_en: update xdp_rxq_info in queue restart logic

   - eth: airoha: fix location of the MBI_RX_AGE_SEL_MASK field

  Previous releases - regressions:

   - xsk: require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len,
     the field reuses previously un-validated pad

  Previous releases - always broken:

   - tap/tun: drop short frames to prevent crashes later in the stack

   - eth: ice: add a per-VF limit on number of FDIR filters

   - af_unix: disable MSG_OOB handling for sockets in sockmap/sockhash"

* tag 'net-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (34 commits)
  tun: add missing verification for short frame
  tap: add missing verification for short frame
  mISDN: Fix a use after free in hfcmulti_tx()
  gve: Fix an edge case for TSO skb validity check
  bnxt_en: update xdp_rxq_info in queue restart logic
  tcp: process the 3rd ACK with sk_socket for TFO/MPTCP
  selftests/bpf: Add XDP_UMEM_TX_METADATA_LEN to XSK TX metadata test
  xsk: Require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len
  bpf: Fix a segment issue when downgrading gso_size
  net: mediatek: Fix potential NULL pointer dereference in dummy net_device handling
  MAINTAINERS: make Breno the netconsole maintainer
  MAINTAINERS: Update bonding entry
  net: nexthop: Initialize all fields in dumped nexthops
  net: stmmac: Correct byte order of perfect_match
  selftests: forwarding: skip if kernel not support setting bridge fdb learning limit
  tipc: Return non-zero value from tipc_udp_addr2str() on error
  netfilter: nft_set_pipapo_avx2: disable softinterrupts
  ice: Fix recipe read procedure
  ice: Add a per-VF limit on number of FDIR filters
  net: bonding: correctly annotate RCU in bond_should_notify_peers()
  ...
2024-07-25 13:32:25 -07:00
Linus Torvalds
8bf100092d Merge tag 'printk-for-6.11-trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk updates from Petr Mladek:

 - trivial printk changes

The bigger "real" printk work is still being discussed.

* tag 'printk-for-6.11-trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
  vsprintf: add missing MODULE_DESCRIPTION() macro
  printk: Rename console_replay_all() and update context
2024-07-25 13:18:41 -07:00
Linus Torvalds
b485625078 Merge tag 'constfy-sysctl-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl
Pull sysctl constification from Joel Granados:
 "Treewide constification of the ctl_table argument of proc_handlers
  using a coccinelle script and some manual code formatting fixups.

  This is a prerequisite to moving the static ctl_table structs into
  read-only data section which will ensure that proc_handler function
  pointers cannot be modified"

* tag 'constfy-sysctl-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
  sysctl: treewide: constify the ctl_table argument of proc_handlers
2024-07-25 12:58:36 -07:00
Linus Torvalds
9b21993654 Merge tag 'kgdb-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux
Pull kgdb updates from Daniel Thompson:
 "Three small changes this cycle:

   - Clean up an architecture abstraction that is no longer needed
     because all the architectures have converged.

   - Actually use the prompt argument to kdb_position_cursor() instead
     of ignoring it (functionally this fix is a nop but that was due to
     luck rather than good judgement)

   - Fix a -Wformat-security warning"

* tag 'kgdb-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
  kdb: Get rid of redundant kdb_curr_task()
  kdb: Use the passed prompt in kdb_position_cursor()
  kdb: address -Wformat-security warnings
2024-07-25 12:48:42 -07:00
Linus Torvalds
9cf601e865 Merge tag 'dma-mapping-6.11-2024-07-24' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fix from Christoph Hellwig:

 - fix the order of actions in dmam_free_coherent (Lance Richardson)

* tag 'dma-mapping-6.11-2024-07-24' of git://git.infradead.org/users/hch/dma-mapping:
  dma: fix call order in dmam_free_coherent
2024-07-25 10:10:34 -07:00
Jakub Kicinski
f7578df913 Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2024-07-25

We've added 14 non-merge commits during the last 8 day(s) which contain
a total of 19 files changed, 177 insertions(+), 70 deletions(-).

The main changes are:

1) Fix af_unix to disable MSG_OOB handling for sockets in BPF sockmap and
   BPF sockhash. Also add test coverage for this case, from Michal Luczaj.

2) Fix a segmentation issue when downgrading gso_size in the BPF helper
   bpf_skb_adjust_room(), from Fred Li.

3) Fix a compiler warning in resolve_btfids due to a missing type cast,
   from Liwei Song.

4) Fix stack allocation for arm64 to align the stack pointer at a 16 byte
   boundary in the fexit_sleep BPF selftest, from Puranjay Mohan.

5) Fix a xsk regression to require a flag when actuating tx_metadata_len,
   from Stanislav Fomichev.

6) Fix function prototype BTF dumping in libbpf for prototypes that have
   no input arguments, from Andrii Nakryiko.

7) Fix stacktrace symbol resolution in perf script for BPF programs
   containing subprograms, from Hou Tao.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Add XDP_UMEM_TX_METADATA_LEN to XSK TX metadata test
  xsk: Require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len
  bpf: Fix a segment issue when downgrading gso_size
  tools/resolve_btfids: Fix comparison of distinct pointer types warning in resolve_btfids
  bpf, events: Use prog to emit ksymbol event for main program
  selftests/bpf: Test sockmap redirect for AF_UNIX MSG_OOB
  selftests/bpf: Parametrize AF_UNIX redir functions to accept send() flags
  selftests/bpf: Support SOCK_STREAM in unix_inet_redir_to_connected()
  af_unix: Disable MSG_OOB handling for sockets in sockmap/sockhash
  bpftool: Fix typo in usage help
  libbpf: Fix no-args func prototype BTF dumping syntax
  MAINTAINERS: Update powerpc BPF JIT maintainers
  MAINTAINERS: Update email address of Naveen
  selftests/bpf: fexit_sleep: Fix stack allocation for arm64
====================

Link: https://patch.msgid.link/20240725114312.32197-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-25 07:40:25 -07:00
Joel Granados
78eb4ea25c sysctl: treewide: constify the ctl_table argument of proc_handlers
const qualify the struct ctl_table argument in the proc_handler function
signatures. This is a prerequisite to moving the static ctl_table
structs into .rodata data which will ensure that proc_handler function
pointers cannot be modified.

This patch has been generated by the following coccinelle script:

```
  virtual patch

  @r1@
  identifier ctl, write, buffer, lenp, ppos;
  identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)";
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int write, void *buffer, size_t *lenp, loff_t *ppos);

  @r2@
  identifier func, ctl, write, buffer, lenp, ppos;
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int write, void *buffer, size_t *lenp, loff_t *ppos)
  { ... }

  @r3@
  identifier func;
  @@

  int func(
  - struct ctl_table *
  + const struct ctl_table *
    ,int , void *, size_t *, loff_t *);

  @r4@
  identifier func, ctl;
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int , void *, size_t *, loff_t *);

  @r5@
  identifier func, write, buffer, lenp, ppos;
  @@

  int func(
  - struct ctl_table *
  + const struct ctl_table *
    ,int write, void *buffer, size_t *lenp, loff_t *ppos);

```

* Code formatting was adjusted in xfs_sysctl.c to comply with code
  conventions. The xfs_stats_clear_proc_handler,
  xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where
  adjusted.

* The ctl_table argument in proc_watchdog_common was const qualified.
  This is called from a proc_handler itself and is calling back into
  another proc_handler, making it necessary to change it as part of the
  proc_handler migration.

Co-developed-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Co-developed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Joel Granados <j.granados@samsung.com>
2024-07-24 20:59:29 +02:00
Linus Torvalds
ca83c61cb3 Merge tag 'kbuild-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:

 - Remove tristate choice support from Kconfig

 - Stop using the PROVIDE() directive in the linker script

 - Reduce the number of links for the combination of CONFIG_KALLSYMS and
   CONFIG_DEBUG_INFO_BTF

 - Enable the warning for symbol reference to .exit.* sections by
   default

 - Fix warnings in RPM package builds

 - Improve scripts/make_fit.py to generate a FIT image with separate
   base DTB and overlays

 - Improve choice value calculation in Kconfig

 - Fix conditional prompt behavior in choice in Kconfig

 - Remove support for the uncommon EMAIL environment variable in Debian
   package builds

 - Remove support for the uncommon "name <email>" form for the DEBEMAIL
   environment variable

 - Raise the minimum supported GNU Make version to 4.0

 - Remove stale code for the absolute kallsyms

 - Move header files commonly used for host programs to scripts/include/

 - Introduce the pacman-pkg target to generate a pacman package used in
   Arch Linux

 - Clean up Kconfig

* tag 'kbuild-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (65 commits)
  kbuild: doc: gcc to CC change
  kallsyms: change sym_entry::percpu_absolute to bool type
  kallsyms: unify seq and start_pos fields of struct sym_entry
  kallsyms: add more original symbol type/name in comment lines
  kallsyms: use \t instead of a tab in printf()
  kallsyms: avoid repeated calculation of array size for markers
  kbuild: add script and target to generate pacman package
  modpost: use generic macros for hash table implementation
  kbuild: move some helper headers from scripts/kconfig/ to scripts/include/
  Makefile: add comment to discourage tools/* addition for kernel builds
  kbuild: clean up scripts/remove-stale-files
  kconfig: recursive checks drop file/lineno
  kbuild: rpm-pkg: introduce a simple changelog section for kernel.spec
  kallsyms: get rid of code for absolute kallsyms
  kbuild: Create INSTALL_PATH directory if it does not exist
  kbuild: Abort make on install failures
  kconfig: remove 'e1' and 'e2' macros from expression deduplication
  kconfig: remove SYMBOL_CHOICEVAL flag
  kconfig: add const qualifiers to several function arguments
  kconfig: call expr_eliminate_yn() at least once in expr_eliminate_dups()
  ...
2024-07-23 14:32:21 -07:00
Linus Torvalds
d2d721e2eb Merge tag 'livepatching-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching
Pull livepatching update from Petr Mladek:

 - show patch->replace flag in sysfs

 - add or improve few selftests

* tag 'livepatching-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching:
  livepatch: Replace snprintf() with sysfs_emit()
  selftests/livepatch: Add selftests for "replace" sysfs attribute
  livepatch: Add "replace" sysfs attribute
  selftests: livepatch: Test atomic replace against multiple modules
  selftests/livepatch: define max test-syscall processes
2024-07-23 11:11:51 -07:00
Linus Torvalds
66ebbdfdeb Merge tag 'irq-msi-2024-07-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull MSI interrupt updates from Thomas Gleixner:
 "Switch ARM/ARM64 over to the modern per device MSI domains.

  This simplifies the handling of platform MSI and wire to MSI
  controllers and removes about 500 lines of legacy code.

  Aside of that it paves the way for ARM/ARM64 to utilize the dynamic
  allocation of PCI/MSI interrupts and to support the upcoming non
  standard IMS (Interrupt Message Store) mechanism on PCIe devices"

* tag 'irq-msi-2024-07-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
  irqchip/gic-v3-its: Correctly fish out the DID for platform MSI
  irqchip/gic-v3-its: Correctly honor the RID remapping
  genirq/msi: Move msi_device_data to core
  genirq/msi: Remove platform MSI leftovers
  irqchip/irq-mvebu-icu: Remove platform MSI leftovers
  irqchip/irq-mvebu-sei: Switch to MSI parent
  irqchip/mvebu-odmi: Switch to parent MSI
  irqchip/mvebu-gicp: Switch to MSI parent
  irqchip/irq-mvebu-icu: Prepare for real per device MSI
  irqchip/imx-mu-msi: Switch to MSI parent
  irqchip/gic-v2m: Switch to device MSI
  irqchip/gic_v3_mbi: Switch over to parent domain
  genirq/msi: Remove platform_msi_create_device_domain()
  irqchip/mbigen: Remove platform_msi_create_device_domain() fallback
  irqchip/gic-v3-its: Switch platform MSI to MSI parent
  irqchip/irq-msi-lib: Prepare for DOMAIN_BUS_WIRED_TO_MSI
  irqchip/mbigen: Prepare for real per device MSI
  irqchip/irq-msi-lib: Prepare for DEVICE MSI to replace platform MSI
  irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]
  irqchip/irq-msi-lib: Prepare for PCI MSI/MSIX
  ...
2024-07-22 14:02:19 -07:00
Linus Torvalds
ac7473a179 Merge tag 'irq-core-2024-07-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull interrupt subsystem updates from Thomas Gleixner:
 "Core:

   - Provide a new mechanism to create interrupt domains. The existing
     interfaces have already too many parameters and it's a pain to
     expand any of this for new required functionality.

     The new function takes a pointer to a data structure as argument.
     The data structure combines all existing parameters and allows for
     easy extension.

     The first extension for this is to handle the instantiation of
     generic interrupt chips at the core level and to allow drivers to
     provide extra init/exit callbacks.

     This is necessary to do the full interrupt chip initialization
     before the new domain is published, so that concurrent usage sites
     won't see a half initialized interrupt domain. Similar problems
     exist on teardown.

     This has turned out to be a real problem due to the deferred and
     parallel probing which was added in recent years.

     Handling this at the core level allows to remove quite some accrued
     boilerplate code in existing drivers and avoids horrible
     workarounds at the driver level.

   - The usual small improvements all over the place

  Drivers:

   - Add support for LAN966x OIC and RZ/Five SoC

   - Split the STM ExtI driver into a microcontroller and a SMP version
     to allow building the latter as a module for multi-platform
     kernels

   - Enable MSI support for Armada 370XP on platforms which do not
     support IPIs

   - The usual small fixes and enhancements all over the place"

* tag 'irq-core-2024-07-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (59 commits)
  irqdomain: Fix the kernel-doc and plug it into Documentation
  genirq: Set IRQF_COND_ONESHOT in request_irq()
  irqchip/imx-irqsteer: Handle runtime power management correctly
  irqchip/gic-v3: Pass #redistributor-regions to gic_of_setup_kvm_info()
  irqchip/bcm2835: Enable SKIP_SET_WAKE and MASK_ON_SUSPEND
  irqchip/gic-v4: Make sure a VPE is locked when VMAPP is issued
  irqchip/gic-v4: Substitute vmovp_lock for a per-VM lock
  irqchip/gic-v4: Always configure affinity on VPE activation
  Revert "irqchip/dw-apb-ictl: Support building as module"
  Revert "Loongarch: Support loongarch avec"
  arm64: Kconfig: Allow build irq-stm32mp-exti driver as module
  ARM: stm32: Allow build irq-stm32mp-exti driver as module
  irqchip/stm32mp-exti: Allow building as module
  irqchip/stm32mp-exti: Rename internal symbols
  irqchip/stm32-exti: Split MCU and MPU code
  arm64: Kconfig: Select STM32MP_EXTI on STM32 platforms
  ARM: stm32: Use different EXTI driver on ARMv7m and ARMv7a
  irqchip/stm32-exti: Add CONFIG_STM32MP_EXTI
  irqchip/dw-apb-ictl: Support building as module
  irqchip/riscv-aplic: Simplify the initialization code
  ...
2024-07-22 13:52:05 -07:00
Anna-Maria Behnsen
f004bf9de0 timers/migration: Fix grammar in comment
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240716-tmigr-fixes-v4-8-757baa7803fe@linutronix.de
2024-07-22 18:03:34 +02:00
Anna-Maria Behnsen
2367e28e23 timers/migration: Spare write when nothing changed
The wakeup value is written unconditionally in tmigr_cpu_new_timer(). When
there was no new next timer expiry that needs to be propagated, then the
value that was read before is written. This is not required.

Move the write to the place where wakeup value is changed changed.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240716-tmigr-fixes-v4-7-757baa7803fe@linutronix.de
2024-07-22 18:03:34 +02:00
Anna-Maria Behnsen
835a9a67f5 timers/migration: Rename childmask by groupmask to make naming more obvious
childmask in the group reflects the mask that is required to 'reference'
this group in the parent. When reading childmask, this might be confusing,
as this suggests, that this is the mask of the child of the group.

Clarify this by renaming childmask in the tmigr_group and tmc_group by
groupmask.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240716-tmigr-fixes-v4-6-757baa7803fe@linutronix.de
2024-07-22 18:03:34 +02:00
Anna-Maria Behnsen
d47be58984 timers/migration: Read childmask and parent pointer in a single place
Reading the childmask and parent pointer is required when propagating
changes through the hierarchy. At the moment this reads are spread all over
the place which makes it harder to follow.

Move those reads to a single place to keep code clean.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240716-tmigr-fixes-v4-5-757baa7803fe@linutronix.de
2024-07-22 18:03:34 +02:00
Anna-Maria Behnsen
3ba111032b timers/migration: Use a single struct for hierarchy walk data
Two different structs are defined for propagating data from one to another
level when walking the hierarchy. Several struct members exist in both
structs which makes generalization harder.

Merge those two structs into a single one and use it directly in
walk_groups() and the corresponding function pointers instead of
introducing pointer casting all over the place.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240716-tmigr-fixes-v4-4-757baa7803fe@linutronix.de
2024-07-22 18:03:34 +02:00