Commit Graph

1352327 Commits

Author SHA1 Message Date
Lifeng Zheng
922607a2b4 cpufreq: CPPC: Add support for autonomous selection
Add sysfs interfaces for CPPC autonomous selection in the cppc_cpufreq
driver.

Signed-off-by: Lifeng Zheng <zhenglifeng1@huawei.com>
Reviewed-by: Sumit Gupta <sumitg@nvidia.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://patch.msgid.link/20250507031941.2812701-1-zhenglifeng1@huawei.com
[ rjw: Subject edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-05-21 22:47:44 +02:00
Bowen Yu
1da98dc52b cpufreq: Update sscanf() to kstrtouint()
In store_scaling_setspeed(), sscanf is still used to read to sysfs.
Newer kstrtox provide more features including overflow protection,
better errorhandling and allows for other systems of numeration. It
is therefore better to update sscanf() to kstrtouint().

Signed-off-by: Bowen Yu <yubowen8@huawei.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://patch.msgid.link/20250519070938.931396-1-yubowen8@huawei.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-05-21 22:27:36 +02:00
Bowen Yu
9c5075fc9d cpufreq: Replace magic number
Setting the length of str_governor with a magic number could cause
overflow when max length increases, it is better to use the defined
macro in this case.

Signed-off-by: Bowen Yu <yubowen8@huawei.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://patch.msgid.link/20250519070908.930879-1-yubowen8@huawei.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-05-21 22:19:10 +02:00
Rafael J. Wysocki
fd3d883aa5 Merge tag 'amd-pstate-v6.16-2025-05-15' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux
Merge an amd-pstate driver fix for 6.16 (5/15/25) from Mario
Liminciello:

"Fix an error caught with -Werror in amd-pstate-ut."

* tag 'amd-pstate-v6.16-2025-05-15' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux:
  cpufreq/amd-pstate: Avoid shadowing ret in amd_pstate_ut_check_driver()
2025-05-16 18:39:08 +02:00
Seyediman Seyedarab
a0982afa09 cpufreq: drop redundant cpus_read_lock() from store_local_boost()
Lockdep reports a possible circular locking dependency [1] when
cpu_hotplug_lock is acquired inside store_local_boost(), after
policy->rwsem has already been taken by store().

However, the boost update is strictly per-policy and does not
access shared state or iterate over all policies.

Since policy->rwsem is already held, this is enough to serialize
against concurrent topology changes for the current policy.

Remove the cpus_read_lock() to resolve the lockdep warning and
avoid unnecessary locking.

 [1]
 ======================================================
 WARNING: possible circular locking dependency detected
 6.15.0-rc6-debug-gb01fc4eca73c #1 Not tainted
 ------------------------------------------------------
 power-profiles-/588 is trying to acquire lock:
 ffffffffb3a7d910 (cpu_hotplug_lock){++++}-{0:0}, at: store_local_boost+0x56/0xd0

 but task is already holding lock:
 ffff8b6e5a12c380 (&policy->rwsem){++++}-{4:4}, at: store+0x37/0x90

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #2 (&policy->rwsem){++++}-{4:4}:
        down_write+0x29/0xb0
        cpufreq_online+0x7e8/0xa40
        cpufreq_add_dev+0x82/0xa0
        subsys_interface_register+0x148/0x160
        cpufreq_register_driver+0x15d/0x260
        amd_pstate_register_driver+0x36/0x90
        amd_pstate_init+0x1e7/0x270
        do_one_initcall+0x68/0x2b0
        kernel_init_freeable+0x231/0x270
        kernel_init+0x15/0x130
        ret_from_fork+0x2c/0x50
        ret_from_fork_asm+0x11/0x20

 -> #1 (subsys mutex#3){+.+.}-{4:4}:
        __mutex_lock+0xc2/0x930
        subsys_interface_register+0x7f/0x160
        cpufreq_register_driver+0x15d/0x260
        amd_pstate_register_driver+0x36/0x90
        amd_pstate_init+0x1e7/0x270
        do_one_initcall+0x68/0x2b0
        kernel_init_freeable+0x231/0x270
        kernel_init+0x15/0x130
        ret_from_fork+0x2c/0x50
        ret_from_fork_asm+0x11/0x20

 -> #0 (cpu_hotplug_lock){++++}-{0:0}:
        __lock_acquire+0x10ed/0x1850
        lock_acquire.part.0+0x69/0x1b0
        cpus_read_lock+0x2a/0xc0
        store_local_boost+0x56/0xd0
        store+0x50/0x90
        kernfs_fop_write_iter+0x132/0x200
        vfs_write+0x2b3/0x590
        ksys_write+0x74/0xf0
        do_syscall_64+0xbb/0x1d0
        entry_SYSCALL_64_after_hwframe+0x56/0x5e

Signed-off-by: Seyediman Seyedarab <ImanDevel@gmail.com>
Link: https://patch.msgid.link/20250513015726.1497-1-ImanDevel@gmail.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-05-14 14:41:15 +02:00
Nathan Chancellor
f6b1eebbdb cpufreq/amd-pstate: Avoid shadowing ret in amd_pstate_ut_check_driver()
Clang warns (or errors with CONFIG_WERROR=y):

  drivers/cpufreq/amd-pstate-ut.c:262:6: error: variable 'ret' is uninitialized when used here [-Werror,-Wuninitialized]
    262 |         if (ret)
        |             ^~~

ret is declared at the top of the function and in the for loop so the
initialization of ret is local to the loop. Remove the declaration in
the for loop so that ret is always used initialized.

Fixes: d26d16438b ("amd-pstate-ut: Reset amd-pstate driver mode after running selftests")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202505072226.g2QclhaR-lkp@intel.com/
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://lore.kernel.org/r/20250512-amd-pstate-ut-uninit-ret-v1-1-fcb4104f502e@kernel.org
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
2025-05-13 11:12:44 -05:00
Rafael J. Wysocki
f20af84c29 cpufreq: intel_pstate: Document hybrid processor support
Describe the support for hybrid processors in intel_pstate, including
the CAS and EAS support, in the admin-guide documentation.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/1935040.CQOukoFCf9@rjwysocki.net
2025-05-13 14:35:25 +02:00
Rafael J. Wysocki
05cf8b8c51 cpufreq: intel_pstate: EAS: Increase cost for CPUs using L3 cache
On some hybrid platforms some efficient CPUs (E-cores) are not connected
to the L3 cache, but there are no other differences between them and the
other E-cores that use L3.  In that case, it is generally more efficient
to run "light" workloads on the E-cores that do not use L3 and allow all
of the cores using L3, including P-cores, to go into idle states.

For this reason, slightly increase the cost for all CPUs sharing the L3
cache to make EAS prefer CPUs that do not use it to the other CPUs of
the same type (if any).

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/2032776.usQuhbGJ8B@rjwysocki.net
2025-05-13 14:35:25 +02:00
Rafael J. Wysocki
7b010f9b90 cpufreq: intel_pstate: EAS support for hybrid platforms
Modify intel_pstate to register EM perf domains for CPUs on hybrid
platforms without SMT which causes EAS to be enabled on them when
schedutil is used as the cpufreq governor (which requires intel_pstate
to operate in the passive mode).

This change is targeting platforms (for example, Lunar Lake) where the
"little" CPUs (E-cores) are always more energy-efficient than the "big"
or "performance" CPUs (P-cores) when run at the same HWP performance
level, so it is sufficient to tell EAS that E-cores are always preferred
(so long as there is enough spare capacity on one of them to run the
given task).  However, migrating tasks between CPUs of the same type
too often is not desirable because it may hurt both performance and
energy efficiency due to leaving warm caches behind.

For this reason, register a separate perf domain for each CPU and choose
the cost values for them so that the cost mostly depends on the CPU type,
but there is also a small component of it depending on the performance
level (utilization) which helps to balance the load between CPUs of the
same type.

The cost component related to the CPU type is computed with the help of
the observation that the IPC metric value for a given CPU is inversely
proportional to its performance-to-frequency scaling factor and the cost
of running code on it can be assumed to be roughly proportional to that
IPC ratio (in principle, the higher the IPC ratio, the more resources
are utilized when running at a given frequency, so the cost should be
higher).

For all CPUs that are online at the system initialization time, EM perf
domains are registered when the driver starts up, after asymmetric
capacity support has been enabled.  For the CPUs that become online
later, EM perf domains are registered after setting the asymmetric
capacity for them.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/6057101.MhkbZ0Pkbq@rjwysocki.net
2025-05-13 14:35:25 +02:00
Rafael J. Wysocki
0b224fcc89 Merge Energy Model management code changes for 6.16 2025-05-13 14:34:54 +02:00
Rafael J. Wysocki
4a6b1cf0d4 PM: EM: Introduce em_adjust_cpu_capacity()
Add a function for updating the Energy Model for a CPU after its
capacity has changed, which subsequently will be used by the
intel_pstate driver.

An EM_PERF_DOMAIN_ARTIFICIAL check is added to em_recalc_and_update()
to prevent it from calling em_compute_costs() for an "artificial" perf
domain with a NULL cb parameter which would cause it to crash.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/3637203.iIbC2pHGDl@rjwysocki.net
2025-05-13 14:34:18 +02:00
Rafael J. Wysocki
3e3ba654d3 PM: EM: Move CPU capacity check to em_adjust_new_capacity()
Move the check of the CPU capacity currently stored in the energy model
against the arch_scale_cpu_capacity() value to em_adjust_new_capacity()
so it will be done regardless of where the latter is called from.

This will be useful when a new em_adjust_new_capacity() caller is added
subsequently.

While at it, move the pd local variable declaration in
em_check_capacity_update() into the loop in which it is used.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/7810787.EvYhyI6sBW@rjwysocki.net
2025-05-13 14:34:18 +02:00
Atul Kumar Pant
c9b83cbe46 PM: EM: Documentation: Fix typos in example driver code
Fix the API name to free the allocated table in the example driver code
that modifies the EM. Also fix the passing of correct table when
updating the cost.

Signed-off-by: Atul Kumar Pant <atulpant.linux@gmail.com>
Link: https://patch.msgid.link/20250511071141.13237-1-atulpant.linux@gmail.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-05-13 14:30:01 +02:00
Rafael J. Wysocki
f7e9228532 cpufreq: Drop policy locking from cpufreq_policy_is_good_for_eas()
Policy locking was added to cpufreq_policy_is_good_for_eas() by commit
4854649b1f ("cpufreq/sched: Move cpufreq-specific EAS checks to
cpufreq") to address a theoretical race condition, but it turned out to
introduce a circular locking dependency between the policy rwsem and
sched_domains_mutex via cpuset_mutex. This leads to a board lockup on
OdroidN2 that is based on the ARM64 Amlogic Meson SoC.

Drop the policy locking from cpufreq_policy_is_good_for_eas() to address
this issue.

Fixes: 4854649b1f ("cpufreq/sched: Move cpufreq-specific EAS checks to cpufreq")
Closes: https://lore.kernel.org/linux-pm/1bf3df62-0641-459f-99fc-fd511e564b84@samsung.com/
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/2806514.mvXUDI8C0e@rjwysocki.net
2025-05-13 14:27:11 +02:00
Moon Hee Lee
7330e002c0 PM: EM: Documentation: fix typo in energy-model.rst
Fixes a grammar issue ("than" -> "then") and changes "re-use" to
"reuse" for consistency with modern spelling.

Signed-off-by: Moon Hee Lee <moonhee.lee.ca@gmail.com>
Link: https://patch.msgid.link/20250506220057.5589-1-moonhee.lee.ca@gmail.com
[ rjw: Subject edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-05-09 18:34:51 +02:00
Rafael J. Wysocki
41abd479d1 Merge tag 'amd-pstate-v6.16-2025-05-08' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux
Merge amd-pstate content for 6.16 (5/8/25) from Mario Limonciello:

 - Add support for a new feature on some BIOS that allows setting
   "lowest CPU minimum frequency".

 - Fix the amd-pstate-ut unit tests to restore system settings when
   done.

* tag 'amd-pstate-v6.16-2025-05-08' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux:
  amd-pstate-ut: Reset amd-pstate driver mode after running selftests
  cpufreq/amd-pstate: Add support for the "Requested CPU Min frequency" BIOS option
  cpufreq/amd-pstate: Add offline, online and suspend callbacks for amd_pstate_driver
  cpufreq/amd-pstate: Move max_perf limiting in amd_pstate_update
2025-05-09 15:17:44 +02:00
Ricardo Neri
f1a50492f5 cpufreq: intel_pstate: Populate the cpu_capacity sysfs entries
Intel hybrid processors have CPUs of different capacity. Populate the
interface /sys/devices/system/cpu/cpuN/cpu_capacity.

This interface uses the per-CPU variable `cpu_scale`. On x86 this
variable has no other use besides feeding the sysfs entries. Initialize
it when setting CPU capacity for the scheduler and scale-invariant code.
Feed it with arch_scale_cpu_capacity() as it gives capacity normalized
to the interval [0, SCHED_CAPACITY_SCALE].

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-05-07 21:56:55 +02:00
Ricardo Neri
6bceea7a1e arch_topology: Relocate cpu_scale to topology.[h|c]
arch_topology.c provides functionality to parse and scale CPU capacity.
It also provides a corresponding sysfs interface. Some architectures
parse and scale CPU capacity differently as per their own needs. On
Intel processors, for instance, it is responsibility of the Intel
P-state driver.

Relocate the implementation of that interface to a common location in
topology.c. Architectures can use the interface and populate it using
their own mechanisms.

An alternative approach would be to compile arch_topology.c even if
not needed only to get this interface. This approach would create
duplicated and conflicting functionality and data structures.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20250419025504.9760-2-ricardo.neri-calderon@linux.intel.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-05-07 21:56:55 +02:00
Rafael J. Wysocki
4854649b1f cpufreq/sched: Move cpufreq-specific EAS checks to cpufreq
Doing cpufreq-specific EAS checks that require accessing policy
internals directly from sched_is_eas_possible() is a bit unfortunate,
so introduce cpufreq_ready_for_eas() in cpufreq, move those checks
into that new function and make sched_is_eas_possible() call it.

While at it, address a possible race between the EAS governor check
and governor change by doing the former under the policy rwsem.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/2317800.iZASKD2KPV@rjwysocki.net
2025-05-07 21:17:56 +02:00
Rafael J. Wysocki
f42c8556a0 cpufreq/sched: schedutil: Add helper for governor checks
Add a helper for checking if schedutil is the current governor for
a given cpufreq policy and use it in sched_is_eas_possible() to avoid
accessing cpufreq policy internals directly from there.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/3365956.44csPzL39Z@rjwysocki.net
2025-05-07 21:17:56 +02:00
Swapnil Sapkal
d26d16438b amd-pstate-ut: Reset amd-pstate driver mode after running selftests
In amd-pstate-ut, one of the basic test is to switch between all
possible mode combinations. After running this test the mode of the
amd-pstate driver is active mode. Store and reset the mode to its original
state.

Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://lore.kernel.org/r/20250430064206.7402-1-swapnil.sapkal@amd.com
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
2025-05-05 12:07:42 -05:00
Rafael J. Wysocki
400da808fd Merge back cpufreq material for 6.16 2025-05-04 12:35:42 +02:00
Srinivas Pandruvada
ac4e04d9e3 cpufreq: intel_pstate: Unchecked MSR aceess in legacy mode
When turbo mode is unavailable on a Skylake-X system, executing the
command:

 # echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

results in an unchecked MSR access error:

 WRMSR to 0x199 (attempted to write 0x0000000100001300).

This issue was reproduced on an OEM (Original Equipment Manufacturer)
system and is not a common problem across all Skylake-X systems.

This error occurs because the MSR 0x199 Turbo Engage Bit (bit 32) is set
when turbo mode is disabled. The issue arises when intel_pstate fails to
detect that turbo mode is disabled. Here intel_pstate relies on
MSR_IA32_MISC_ENABLE bit 38 to determine the status of turbo mode.
However, on this system, bit 38 is not set even when turbo mode is
disabled.

According to the Intel Software Developer's Manual (SDM), the BIOS sets
this bit during platform initialization to enable or disable
opportunistic processor performance operations. Logically, this bit
should be set in such cases. However, the SDM also specifies that "OS
and applications must use CPUID leaf 06H to detect processors with
opportunistic processor performance operations enabled."

Therefore, in addition to checking MSR_IA32_MISC_ENABLE bit 38, verify
that CPUID.06H:EAX[1] is 0 to accurately determine if turbo mode is
disabled.

Fixes: 4521e1a0ce ("cpufreq: intel_pstate: Reflect current no_turbo state correctly")
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-04-30 16:27:34 +02:00
Dhananjay Ugwekar
608a76b652 cpufreq/amd-pstate: Add support for the "Requested CPU Min frequency" BIOS option
Initialize lower frequency limit to the "Requested CPU Min frequency"
BIOS option (if it is set) value as part of the driver->init()
callback. The BIOS specified value is passed by the PMFW as min_perf in
CPPC_REQ MSR. To ensure that we don't mistake a stale min_perf value in
CPPC_REQ value as the "Requested CPU Min frequency" during a kexec wakeup,
reset the CPPC_REQ.min_perf value back to the BIOS specified one in the
offline, exit and suspend callbacks.

amd_pstate_target() and amd_pstate_epp_update_limit() which are invoked
as part of the resume() and online() callbacks will take care of restoring
the CPPC_REQ back to the correct values.

Signed-off-by: Dhananjay Ugwekar <dhananjay.ugwekar@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://lore.kernel.org/r/20250428071623.4309-1-dhananjay.ugwekar@amd.com
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
2025-04-28 16:09:35 -05:00
Dhananjay Ugwekar
98b52c6b0b cpufreq/amd-pstate: Add offline, online and suspend callbacks for amd_pstate_driver
Rename and use the existing amd_pstate_epp callbacks for amd_pstate driver
as well. Remove the debug print in online callback while at it.

These callbacks will be needed to support the "Requested CPU Min Frequency"
BIOS option.

Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Dhananjay Ugwekar <dhananjay.ugwekar@amd.com>
Link: https://lore.kernel.org/r/20250428062520.4997-2-dhananjay.ugwekar@amd.com
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
2025-04-28 16:09:35 -05:00
Yaxiong Tian
179c0c7044 PM: EM: Fix potential division-by-zero error in em_compute_costs()
When the device is of a non-CPU type, table[i].performance won't be
initialized in the previous em_init_performance(), resulting in division
by zero when calculating costs in em_compute_costs().

Since the 'cost' algorithm is only used for EAS energy efficiency
calculations and is currently not utilized by other device drivers, we
should add the _is_cpu_device(dev) check to prevent this division-by-zero
issue.

Fixes: 1b600da510 ("PM: EM: Optimize em_cpu_energy() and remove division")
Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://patch.msgid.link/tencent_7F99ED4767C1AF7889D0D8AD50F34859CE06@qq.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-04-28 15:28:59 +02:00
Rafael J. Wysocki
b79028039f cpufreq: Fix setting policy limits when frequency tables are used
Commit 7491cdf46b ("cpufreq: Avoid using inconsistent policy->min and
policy->max") overlooked the fact that policy->min and policy->max were
accessed directly in cpufreq_frequency_table_target() and in the
functions called by it.  Consequently, the changes made by that commit
led to problems with setting policy limits.

Address this by passing the target frequency limits to __resolve_freq()
and cpufreq_frequency_table_target() and propagating them to the
functions called by the latter.

Fixes: 7491cdf46b ("cpufreq: Avoid using inconsistent policy->min and policy->max")
Cc: 5.16+ <stable@vger.kernel.org> # 5.16+
Closes: https://lore.kernel.org/linux-pm/aAplED3IA_J0eZN0@linaro.org/
Reported-by: Stephan Gerhold <stephan.gerhold@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Stephan Gerhold <stephan.gerhold@linaro.org>
Reviewed-by: Lifeng Zheng <zhenglifeng1@huawei.com>
Link: https://patch.msgid.link/5896780.DvuYhMxLoT@rjwysocki.net
2025-04-28 14:36:41 +02:00
Linus Torvalds
b4432656b3 Linux 6.15-rc4 v6.15-rc4 2025-04-27 15:19:23 -07:00
Linus Torvalds
5bc1018675 Merge tag 'pci-v6.15-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull PCI fixes from Bjorn Helgaas:

 - When releasing a start-aligned resource, e.g., a bridge window, save
   start/end/flags for the next assignment attempt; fixes a v6.15-rc1
   regression (Ilpo Järvinen)

 - Move set_pcie_speed.sh from TEST_PROGS to TEST_FILE; fixes a bwctrl
   selftest v6.15-rc1 regression (Ilpo Järvinen)

 - Add Manivannan Sadhasivam as maintainer of native host bridge and
   endpoint drivers (Manivannan Sadhasivam)

 - In endpoint test driver, defer IRQ allocation from .probe() until
   ioctl() to fix a regression on platforms where the Vendor/Device ID
   match doesn't include driver_data (Niklas Cassel)

* tag 'pci-v6.15-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
  misc: pci_endpoint_test: Defer IRQ allocation until ioctl(PCITEST_SET_IRQTYPE)
  MAINTAINERS: Move Manivannan Sadhasivam as PCI Native host bridge and endpoint maintainer
  selftests/pcie_bwctrl: Fix test progs list
  PCI: Restore assigned resources fully after release
2025-04-26 13:02:36 -07:00
Linus Torvalds
d22aad29de Merge tag 'nfsd-6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fix from Chuck Lever:

 - Revert a v6.15 patch due to a report of SELinux test failures

* tag 'nfsd-6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  Revert "sunrpc: clean cache_detail immediately when flush is written frequently"
2025-04-26 10:43:03 -07:00
Linus Torvalds
06b31bdbf8 Merge tag 'x86-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:

 - Fix 32-bit kernel boot crash if passed physical memory with more than
   32 address bits

 - Fix Xen PV crash

 - Work around build bug in certain limited build environments

 - Fix CTEST instruction decoding in insn_decoder_test

* tag 'x86-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/insn: Fix CTEST instruction decoding
  x86/boot: Work around broken busybox 'truncate' tool
  x86/mm: Fix _pgd_alloc() for Xen PV mode
  x86/e820: Discard high memory that can't be addressed by 32-bit systems
2025-04-26 09:45:54 -07:00
Linus Torvalds
3d23ef05c3 Merge tag 'sched-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
 "Fix sporadic crashes in dequeue_entities() due to ... bad math.

  [ Arguably if pick_eevdf()/pick_next_entity() was less trusting of
    complex math being correct it could have de-escalated a crash into
    a warning, but that's for a different patch ]"

* tag 'sched-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/eevdf: Fix se->slice being set to U64_MAX and resulting crash
2025-04-26 09:23:20 -07:00
Linus Torvalds
86baa5499c Merge tag 'perf-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc perf events fixes from Ingo Molnar:

 - Use POLLERR for events in error state, instead of the ambiguous
   POLLHUP error value

 - Fix non-sampling (counting) events on certain x86 platforms

* tag 'perf-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: Fix non-sampling (counting) events on certain x86 platforms
  perf/core: Change to POLLERR for pinned events with error
2025-04-26 09:13:09 -07:00
Linus Torvalds
a226e6540b Merge tag 'irq-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Ingo Molnar:
 "Fix crashes in the gic-v2m irqchip driver, caused by an incorrect
  __init annotation"

* tag 'irq-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/gic-v2m: Prevent use after free of gicv2m_get_fwnode()
2025-04-26 09:08:45 -07:00
Linus Torvalds
e742bd1990 Merge tag 'loongarch-fixes-6.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
Pull LoongArch fixes from Huacai Chen:
 "Add a missing Kconfig option, fix some bugs in exception handlers,
  memory management and KVM"

* tag 'loongarch-fixes-6.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
  LoongArch: KVM: Fix PMU pass-through issue if VM exits to host finally
  LoongArch: KVM: Fully clear some CSRs when VM reboot
  LoongArch: KVM: Fix multiple typos of KVM code
  LoongArch: Return NULL from huge_pte_offset() for invalid PMD
  LoongArch: Remove a bogus reference to ZONE_DMA
  LoongArch: Handle fp, lsx, lasx and lbt assembly symbols
  LoongArch: Make do_xyz() exception handlers more robust
  LoongArch: Make regs_irqs_disabled() more clear
  LoongArch: Select ARCH_USE_MEMTEST
2025-04-26 09:02:41 -07:00
Linus Torvalds
ec0c2d5359 Merge tag 'for-linus' of https://github.com/openrisc/linux
Pull OpenRISC updates from Stafford Horne:

 - Support for cacheinfo API to expose OpenRISC cache info via sysfs,
   this also translated to some cleanups to OpenRISC cache flush and
   invalidate API's

 - Documentation updates for new mailing list and toolchain binaries

* tag 'for-linus' of https://github.com/openrisc/linux:
  Documentation: openrisc: Update toolchain binaries URL
  Documentation: openrisc: Update mailing list
  openrisc: Add cacheinfo support
  openrisc: Introduce new utility functions to flush and invalidate caches
  openrisc: Refactor struct cpuinfo_or1k to reduce duplication
2025-04-26 09:01:13 -07:00
Chuck Lever
831e3f545b Revert "sunrpc: clean cache_detail immediately when flush is written frequently"
Ondrej reports that certain SELinux tests are failing after commit
fc2a169c56 ("sunrpc: clean cache_detail immediately when flush is
written frequently"), merged during the v6.15 merge window.

Reported-by: Ondrej Mosnacek <omosnace@redhat.com>
Fixes: fc2a169c56 ("sunrpc: clean cache_detail immediately when flush is written frequently")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-04-26 12:00:43 -04:00
Linus Torvalds
a16ebe51a6 Merge tag 'move-lib-kunit-v6.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull kunit fix from Kees Cook:
 "A single fix for the kunit lib/tests/ relocation:

   - Ensure prime numbers tests are included in KUnit test runs (Mark Brown)"

* tag 'move-lib-kunit-v6.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  lib: Ensure prime numbers tests are included in KUnit test runs
2025-04-26 08:55:24 -07:00
Linus Torvalds
fa573aefdf Merge tag 'drm-fixes-2025-04-26' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
 "Weekly drm fixes, mostly amdgpu, with some exynos cleanups and a
  couple of minor fixes, seems a bit quiet, but probably some lag from
  Easter holidays.

  amdgpu:
   - P2P DMA fixes
   - Display reset fixes
   - DCN 3.5 fixes
   - ACPI EDID fix
   - LTTPR fix
   - mode_valid() fix

  exynos:
   - fix spelling error
   - remove redundant error handling in exynos_drm_vidi.c module
   - marks struct decon_data as const in the exynos7_drm_decon driver
     since it is only read
   - Remove unnecessary checking in exynos_drm_drv.c module

  meson:
   - Fix VCLK calculation

  panel:
   - jd9365a: Fix reset polarity"

* tag 'drm-fixes-2025-04-26' of https://gitlab.freedesktop.org/drm/kernel:
  drm/exynos: Fix spelling mistake "enqueu" -> "enqueue"
  drm/exynos: exynos7_drm_decon: Consstify struct decon_data
  drm/exynos: fixed a spelling error
  drm/exynos/vidi: Remove redundant error handling in vidi_get_modes()
  drm/exynos: Remove unnecessary checking
  drm/amd/display: do not copy invalid CRTC timing info
  drm/amd/display: Default IPS to RCG_IN_ACTIVE_IPS2_IN_OFF
  drm/amd/display: Use 16ms AUX read interval for LTTPR with old sinks
  drm/amd/display: Fix ACPI edid parsing on some Lenovo systems
  drm/amdgpu: Allow P2P access through XGMI
  drm/amd/display: Enable urgent latency adjustment on DCN35
  drm/amd/display: Force full update in gpu reset
  drm/amd/display: Fix gpu reset in multidisplay config
  drm/amdgpu: Don't pin VRAM without DMABUF_MOVE_NOTIFY
  drm/amdgpu: Use allowed_domains for pinning dmabufs
  drm: panel: jd9365da: fix reset signal polarity in unprepare
  drm/meson: use unsigned long long / Hz for frequency types
  Revert "drm/meson: vclk: fix calculation of 59.94 fractional rates"
2025-04-26 08:32:29 -07:00
Omar Sandoval
bbce3de72b sched/eevdf: Fix se->slice being set to U64_MAX and resulting crash
There is a code path in dequeue_entities() that can set the slice of a
sched_entity to U64_MAX, which sometimes results in a crash.

The offending case is when dequeue_entities() is called to dequeue a
delayed group entity, and then the entity's parent's dequeue is delayed.
In that case:

1. In the if (entity_is_task(se)) else block at the beginning of
   dequeue_entities(), slice is set to
   cfs_rq_min_slice(group_cfs_rq(se)). If the entity was delayed, then
   it has no queued tasks, so cfs_rq_min_slice() returns U64_MAX.
2. The first for_each_sched_entity() loop dequeues the entity.
3. If the entity was its parent's only child, then the next iteration
   tries to dequeue the parent.
4. If the parent's dequeue needs to be delayed, then it breaks from the
   first for_each_sched_entity() loop _without updating slice_.
5. The second for_each_sched_entity() loop sets the parent's ->slice to
   the saved slice, which is still U64_MAX.

This throws off subsequent calculations with potentially catastrophic
results. A manifestation we saw in production was:

6. In update_entity_lag(), se->slice is used to calculate limit, which
   ends up as a huge negative number.
7. limit is used in se->vlag = clamp(vlag, -limit, limit). Because limit
   is negative, vlag > limit, so se->vlag is set to the same huge
   negative number.
8. In place_entity(), se->vlag is scaled, which overflows and results in
   another huge (positive or negative) number.
9. The adjusted lag is subtracted from se->vruntime, which increases or
   decreases se->vruntime by a huge number.
10. pick_eevdf() calls entity_eligible()/vruntime_eligible(), which
    incorrectly returns false because the vruntime is so far from the
    other vruntimes on the queue, causing the
    (vruntime - cfs_rq->min_vruntime) * load calulation to overflow.
11. Nothing appears to be eligible, so pick_eevdf() returns NULL.
12. pick_next_entity() tries to dereference the return value of
    pick_eevdf() and crashes.

Dumping the cfs_rq states from the core dumps with drgn showed tell-tale
huge vruntime ranges and bogus vlag values, and I also traced se->slice
being set to U64_MAX on live systems (which was usually "benign" since
the rest of the runqueue needed to be in a particular state to crash).

Fix it in dequeue_entities() by always setting slice from the first
non-empty cfs_rq.

Fixes: aef6987d89 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/f0c2d1072be229e1bdddc73c0703919a8b00c652.1745570998.git.osandov@fb.com
2025-04-26 10:44:36 +02:00
Suzuki K Poulose
3318dc299b irqchip/gic-v2m: Prevent use after free of gicv2m_get_fwnode()
With ACPI in place, gicv2m_get_fwnode() is registered with the pci
subsystem as pci_msi_get_fwnode_cb(), which may get invoked at runtime
during a PCI host bridge probe. But, the call back is wrongly marked as
__init, causing it to be freed, while being registered with the PCI
subsystem and could trigger:

 Unable to handle kernel paging request at virtual address ffff8000816c0400
  gicv2m_get_fwnode+0x0/0x58 (P)
  pci_set_bus_msi_domain+0x74/0x88
  pci_register_host_bridge+0x194/0x548

This is easily reproducible on a Juno board with ACPI boot.

Retain the function for later use.

Fixes: 0644b3daca ("irqchip/gic-v2m: acpi: Introducing GICv2m ACPI support")
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
2025-04-26 10:17:24 +02:00
Bibo Mao
5add0dbbeb LoongArch: KVM: Fix PMU pass-through issue if VM exits to host finally
In function kvm_pre_enter_guest(), it prepares to enter guest and check
whether there are pending signals or events. And it will not enter guest
if there are, PMU pass-through preparation for guest should be cancelled
and host should own PMU hardware.

Cc: stable@vger.kernel.org
Fixes: f4e40ea9f7 ("LoongArch: KVM: Add PMU support for guest")
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-04-26 09:58:13 +08:00
Bibo Mao
9ea86232a5 LoongArch: KVM: Fully clear some CSRs when VM reboot
Some registers such as LOONGARCH_CSR_ESTAT and LOONGARCH_CSR_GINTC are
partly cleared with function _kvm_setcsr(). This comes from the hardware
specification, some bits are read only in VM mode, and however they can
be written in host mode. So they are partly cleared in VM mode, and can
be fully cleared in host mode.

These read only bits show pending interrupt or exception status. When VM
reset, the read-only bits should be cleared, otherwise vCPU will receive
unknown interrupts in boot stage.

Here registers LOONGARCH_CSR_ESTAT/LOONGARCH_CSR_GINTC are fully cleared
in ioctl KVM_REG_LOONGARCH_VCPU_RESET vCPU reset path.

Cc: stable@vger.kernel.org
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-04-26 09:58:13 +08:00
Yulong Han
8b2d01fec8 LoongArch: KVM: Fix multiple typos of KVM code
Fix multiple typos inside arch/loongarch/kvm.

Cc: stable@vger.kernel.org
Reviewed-by: Yuli Wang <wangyuli@uniontech.com>
Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Yulong Han <wheatfox17@icloud.com>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-04-26 09:58:13 +08:00
Ming Wang
bd51834d1c LoongArch: Return NULL from huge_pte_offset() for invalid PMD
LoongArch's huge_pte_offset() currently returns a pointer to a PMD slot
even if the underlying entry points to invalid_pte_table (indicating no
mapping). Callers like smaps_hugetlb_range() fetch this invalid entry
value (the address of invalid_pte_table) via this pointer.

The generic is_swap_pte() check then incorrectly identifies this address
as a swap entry on LoongArch, because it satisfies the "!pte_present()
&& !pte_none()" conditions. This misinterpretation, combined with a
coincidental match by is_migration_entry() on the address bits, leads to
kernel crashes in pfn_swap_entry_to_page().

Fix this at the architecture level by modifying huge_pte_offset() to
check the PMD entry's content using pmd_none() before returning. If the
entry is invalid (i.e., it points to invalid_pte_table), return NULL
instead of the pointer to the slot.

Cc: stable@vger.kernel.org
Acked-by: Peter Xu <peterx@redhat.com>
Co-developed-by: Hongchen Zhang <zhanghongchen@loongson.cn>
Signed-off-by: Hongchen Zhang <zhanghongchen@loongson.cn>
Signed-off-by: Ming Wang <wangming01@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-04-26 09:58:12 +08:00
Petr Tesarik
c37325cbd9 LoongArch: Remove a bogus reference to ZONE_DMA
Remove dead code. LoongArch does not have a DMA memory zone (24bit DMA).
The architecture does not even define MAX_DMA_PFN.

Cc: stable@vger.kernel.org
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Petr Tesarik <ptesarik@suse.com>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-04-26 09:58:12 +08:00
Tiezhu Yang
2ef174b133 LoongArch: Handle fp, lsx, lasx and lbt assembly symbols
Like the other relevant symbols, export some fp, lsx, lasx and lbt
assembly symbols and put the function declarations in header files
rather than source files.

While at it, use "asmlinkage" for the other existing C prototypes
of assembly functions and also do not use the "extern" keyword with
function declarations according to the document coding-style.rst.

Cc: stable@vger.kernel.org # 6.6+
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-04-26 09:58:12 +08:00
Tiezhu Yang
cc73cc6bcd LoongArch: Make do_xyz() exception handlers more robust
Currently, interrupts need to be disabled before single-step mode is
set, it requires that CSR_PRMD_PIE be cleared in save_local_irqflag()
which is called by setup_singlestep(), this is reasonable.

But in the first kprobe breakpoint exception, if the irq is enabled at
the beginning of do_bp(), it will not be disabled at the end of do_bp()
due to the CSR_PRMD_PIE has been cleared in save_local_irqflag(). So for
this case, it may corrupt exception context when restoring the exception
after do_bp() in handle_bp(), this is not reasonable.

In order to restore exception safely in handle_bp(), it needs to ensure
the irq is disabled at the end of do_bp(), so just add a local variable
to record the original interrupt status in the parent context, then use
it as the check condition to enable and disable irq in do_bp().

While at it, do the similar thing for other do_xyz() exception handlers
to make them more robust.

Fixes: 6d4cc40fb5 ("LoongArch: Add kprobes support")
Suggested-by: Jinyang He <hejinyang@loongson.cn>
Suggested-by: Huacai Chen <chenhuacai@loongson.cn>
Co-developed-by: Tianyang Zhang <zhangtianyang@loongson.cn>
Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-04-26 09:58:12 +08:00
Tiezhu Yang
bb0511d59d LoongArch: Make regs_irqs_disabled() more clear
In the current code, the definition of regs_irqs_disabled() is actually
"!(regs->csr_prmd & CSR_CRMD_IE)" because arch_irqs_disabled_flags() is
defined as "!(flags & CSR_CRMD_IE)", it looks a little strange.

Define regs_irqs_disabled() as !(regs->csr_prmd & CSR_PRMD_PIE) directly
to make it more clear, no functional change.

While at it, the return value of regs_irqs_disabled() is true or false,
so change its type to reflect that and also make it always inline.

Fixes: 803b0fc5c3 ("LoongArch: Add process management")
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-04-26 09:58:12 +08:00
Yuli Wang
fb8e9f59d6 LoongArch: Select ARCH_USE_MEMTEST
As of commit dce4456619 ("mm/memtest: add ARCH_USE_MEMTEST"),
architectures must select ARCH_USE_MEMTESET to enable CONFIG_MEMTEST.

Commit 628c3bb40e ("LoongArch: Add boot and setup routines") added
support for early_memtest but did not select ARCH_USE_MEMTESET.

Fixes: 628c3bb40e ("LoongArch: Add boot and setup routines")
Tested-by: Erpeng Xu <xuerpeng@uniontech.com>
Tested-by: Yuli Wang <wangyuli@uniontech.com>
Signed-off-by: Yuli Wang <wangyuli@uniontech.com>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-04-26 09:58:12 +08:00