__sched_setscheduler() goes through an enqueue/dequeue cycle like so:
flags := DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
prev_class->dequeue_task(rq, p, flags);
new_class->enqueue_task(rq, p, flags);
when prev_class := fair_sched_class, this is followed by:
dequeue_task(rq, p, DEQUEUE_NOCLOCK | DEQUEUE_SLEEP);
the idea being that since the task has switched classes, we need to drop
the sched_delayed logic and have that task be deactivated per its previous
dequeue_task(..., DEQUEUE_SLEEP).
Unfortunately, this leaves the task on_rq. This is missing the tail end of
dequeue_entities() that issues __block_task(), which __sched_setscheduler()
won't have done due to not using DEQUEUE_DELAYED - not that it should, as
it is pretty much a fair_sched_class specific thing.
Make switched_from_fair() properly deactivate sched_delayed tasks upon
class changes via __block_task(), as if a
dequeue_task(..., DEQUEUE_DELAYED)
had been issued.
Fixes: 2e0199df25 ("sched/fair: Prepare exit/cleanup paths for delayed_dequeue")
Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
Reported-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20240829135353.1524260-1-vschneid@redhat.com
In dl_server_start(), when schedstats is enabled, the following
happens:
dl_server_start()
dl_se->dl_server = 1;
enqueue_dl_entity()
update_stats_enqueue_dl()
__schedstats_from_dl_se()
dl_task_of()
BUG_ON(dl_server(dl_se));
Since only tasks have schedstats and internal entries do not, avoid
trying to update stats in this case.
Fixes: 63ba8422f8 ("sched/deadline: Introduce deadline servers")
Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20240829031111.12142-1-shijie@os.amperecomputing.com
When analyzing a kernel waring message, Peter pointed out that there is a
race condition when the kworker is being frozen and falls into
try_to_freeze() with TASK_INTERRUPTIBLE, which could trigger a
might_sleep() warning in try_to_freeze(). Although the root cause is not
related to freeze()[1], it is still worthy to fix this issue ahead.
One possible race scenario:
CPU 0 CPU 1
----- -----
// kthread_worker_fn
set_current_state(TASK_INTERRUPTIBLE);
suspend_freeze_processes()
freeze_processes
static_branch_inc(&freezer_active);
freeze_kernel_threads
pm_nosig_freezing = true;
if (work) { //false
__set_current_state(TASK_RUNNING);
} else if (!freezing(current)) //false, been frozen
freezing():
if (static_branch_unlikely(&freezer_active))
if (pm_nosig_freezing)
return true;
schedule()
}
// state is still TASK_INTERRUPTIBLE
try_to_freeze()
might_sleep() <--- warning
Fix this by explicitly set the TASK_RUNNING before entering
try_to_freeze().
Link: https://lore.kernel.org/lkml/Zs2ZoAcUsZMX2B%2FI@chenyu5-mobl2/ [1]
Link: https://lkml.kernel.org/r/20240827112308.181081-1-yu.c.chen@intel.com
Fixes: b56c0d8937 ("kthread: implement kthread_worker")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: David Gow <davidgow@google.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Mickaël Salaün <mic@digikod.net>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When watchdog_hardlockup_probe() is being called by
lockup_detector_delay_init(), an error return of -ENODEV will happen
for the arm64 arch when arch_perf_nmi_is_available() returns false. This
means that NMI is not usable by the hard lockup detector and so has to
be disabled. This can be considered a deficiency in that particular
arm64 chip, but there is nothing we can do about it. That also means
the following error will always be reported when the kernel boot up.
watchdog: Delayed init of the lockup detector failed: -19
The word "failed" itself has a connotation that there is something
wrong with the kernel which is not really the case here. Handle this
special ENODEV case separately and explain the reason behind disabling
hard lockup detector without causing anxiety for those users who read
the above message and wonder about it.
Link: https://lkml.kernel.org/r/20240802151621.617244-1-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Joel Granados <j.granados@samsung.com>
Cc: Li Zhe <lizhe.67@bytedance.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
On x86_32 Qemu machine with 1GB memory, the cmdline "crashkernel=4G" is ok
as below:
crashkernel reserved: 0x0000000020000000 - 0x0000000120000000 (4096 MB)
It's similar on other architectures, such as ARM32 and RISCV32.
The cause is that the crash_size is parsed and printed with "unsigned long
long" data type which is 8 bytes but allocated used with "phys_addr_t"
which is 4 bytes in memblock_phys_alloc_range().
Fix it by checking if crash_size is greater than system RAM size and
return error if so.
After this patch, there is no above confusing reserve success info.
Link: https://lkml.kernel.org/r/20240729115252.1659112-1-ruanjinjie@huawei.com
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Suggested-by: Mike Rapoport <rppt@kernel.org>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Dave Young <dyoung@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm, memcg: cg2 memory{.swap,}.peak write handlers", v7.
This patch (of 2):
Other mechanisms for querying the peak memory usage of either a process or
v1 memory cgroup allow for resetting the high watermark. Restore parity
with those mechanisms, but with a less racy API.
For example:
- Any write to memory.max_usage_in_bytes in a cgroup v1 mount resets
the high watermark.
- writing "5" to the clear_refs pseudo-file in a processes's proc
directory resets the peak RSS.
This change is an evolution of a previous patch, which mostly copied the
cgroup v1 behavior, however, there were concerns about races/ownership
issues with a global reset, so instead this change makes the reset
filedescriptor-local.
Writing any non-empty string to the memory.peak and memory.swap.peak
pseudo-files reset the high watermark to the current usage for subsequent
reads through that same FD.
Notably, following Johannes's suggestion, this implementation moves the
O(FDs that have written) behavior onto the FD write(2) path. Instead, on
the page-allocation path, we simply add one additional watermark to
conditionally bump per-hierarchy level in the page-counter.
Additionally, this takes Longman's suggestion of nesting the
page-charging-path checks for the two watermarks to reduce the number of
common-case comparisons.
This behavior is particularly useful for work scheduling systems that need
to track memory usage of worker processes/cgroups per-work-item. Since
memory can't be squeezed like CPU can (the OOM-killer has opinions), these
systems need to track the peak memory usage to compute system/container
fullness when binpacking workitems.
Most notably, Vimeo's use-case involves a system that's doing global
binpacking across many Kubernetes pods/containers, and while we can use
PSI for some local decisions about overload, we strive to avoid packing
workloads too tightly in the first place. To facilitate this, we track
the peak memory usage. However, since we run with long-lived workers (to
amortize startup costs) we need a way to track the high watermark while a
work-item is executing. Polling runs the risk of missing short spikes
that last for timescales below the polling interval, and peak memory
tracking at the cgroup level is otherwise perfect for this use-case.
As this data is used to ensure that binpacked work ends up with sufficient
headroom, this use-case mostly avoids the inaccuracies surrounding
reclaimable memory.
Link: https://lkml.kernel.org/r/20240730231304.761942-1-davidf@vimeo.com
Link: https://lkml.kernel.org/r/20240729143743.34236-1-davidf@vimeo.com
Link: https://lkml.kernel.org/r/20240729143743.34236-2-davidf@vimeo.com
Signed-off-by: David Finkel <davidf@vimeo.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Waiman Long <longman@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
As part of the dynamic kernel stack project, we need to know the amount of
data that can be saved by reducing the default kernel stack size [1].
Provide a kernel stack usage histogram to aid in optimizing kernel stack
sizes and minimizing memory waste in large-scale environments. The
histogram divides stack usage into power-of-two buckets and reports the
results in /proc/vmstat. This information is especially valuable in
environments with millions of machines, where even small optimizations can
have a significant impact.
The histogram data is presented in /proc/vmstat with entries like
"kstack_1k", "kstack_2k", and so on, indicating the number of threads that
exited with stack usage falling within each respective bucket.
Example outputs:
Intel:
$ grep kstack /proc/vmstat
kstack_1k 3
kstack_2k 188
kstack_4k 11391
kstack_8k 243
kstack_16k 0
ARM with 64K page_size:
$ grep kstack /proc/vmstat
kstack_1k 1
kstack_2k 340
kstack_4k 25212
kstack_8k 1659
kstack_16k 0
kstack_32k 0
kstack_64k 0
Note: once the dynamic kernel stack is implemented it will depend on the
implementation the usability of this feature: On hardware that supports
faults on kernel stacks, we will have other metrics that show the total
number of pages allocated for stacks. On hardware where faults are not
supported, we will most likely have some optimization where only some
threads are extended, and for those, these metrics will still be very
useful.
[1] https://lwn.net/Articles/974367
Link: https://lkml.kernel.org/r/20240730150158.832783-3-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20240724203322.2765486-3-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Li Zhijian <lizhijian@fujitsu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pull x86 fixes from Thomas Gleixner:
- x2apic_disable() clears x2apic_state and x2apic_mode unconditionally,
even when the state is X2APIC_ON_LOCKED, which prevents the kernel to
disable it thereby creating inconsistent state.
Reorder the logic so it actually works correctly
- The XSTATE logic for handling LBR is incorrect as it assumes that
XSAVES supports LBR when the CPU supports LBR. In fact both
conditions need to be true. Otherwise the enablement of LBR in the
IA32_XSS MSR fails and subsequently the machine crashes on the next
XRSTORS operation because IA32_XSS is not initialized.
Cache the XSTATE support bit during init and make the related
functions use this cached information and the LBR CPU feature bit to
cure this.
- Cure a long standing bug in KASLR
KASLR uses the full address space between PAGE_OFFSET and vaddr_end
to randomize the starting points of the direct map, vmalloc and
vmemmap regions. It thereby limits the size of the direct map by
using the installed memory size plus an extra configurable margin for
hot-plug memory. This limitation is done to gain more randomization
space because otherwise only the holes between the direct map,
vmalloc, vmemmap and vaddr_end would be usable for randomizing.
The limited direct map size is not exposed to the rest of the kernel,
so the memory hot-plug and resource management related code paths
still operate under the assumption that the available address space
can be determined with MAX_PHYSMEM_BITS.
request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1
downwards. That means the first allocation happens past the end of
the direct map and if unlucky this address is in the vmalloc space,
which causes high_memory to become greater than VMALLOC_START and
consequently causes iounmap() to fail for valid ioremap addresses.
Cure this by exposing the end of the direct map via PHYSMEM_END and
use that for the memory hot-plug and resource management related
places instead of relying on MAX_PHYSMEM_BITS. In the KASLR case
PHYSMEM_END maps to a variable which is initialized by the KASLR
initialization and otherwise it is based on MAX_PHYSMEM_BITS as
before.
- Prevent a data leak in mmio_read(). The TDVMCALL exposes the value of
an initialized variabled on the stack to the VMM. The variable is
only required as output value, so it does not have to exposed to the
VMM in the first place.
- Prevent an array overrun in the resource control code on systems with
Sub-NUMA Clustering enabled because the code failed to adjust the
index by the number of SNC nodes per L3 cache.
* tag 'x86-urgent-2024-09-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/resctrl: Fix arch_mbm_* array overrun on SNC
x86/tdx: Fix data leak in mmio_read()
x86/kaslr: Expose and use the end of the physical memory address space
x86/fpu: Avoid writing LBR bit to IA32_XSS unless supported
x86/apic: Make x2apic_disable() work correctly
Pull locking fix from Thomas Gleixner:
"A single fix for rt_mutex.
The deadlock detection code drops into an infinite scheduling loop
while still holding rt_mutex::wait_lock, which rightfully triggers a
'scheduling in atomic' warning.
Unlock it before that"
* tag 'locking-urgent-2024-08-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rtmutex: Drop rt_mutex::wait_lock before scheduling
This reverts the following commits:
- 236dec0510 ("kconfig: tinyconfig: provide whole choice blocks to
avoid warnings")
- b0f269728c ("x86/config: Fix warning for 'make ARCH=x86_64
tinyconfig'")
Since commit f79dc03fe6 ("kconfig: refactor choice value calculation"),
it is no longer necessary to disable the remaining options in choice
blocks.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Since 3cf78c5d01 ("sched_ext: Unpin and repin rq lock from
balance_scx()"), sched_ext's balance path terminates rq_pin in the outermost
function. This is simpler and in line with what other balance functions are
doing but it loses control over rq->clock_update_flags which makes
assert_clock_udpated() trigger if other CPUs pins the rq lock.
The only place this matters is touch_core_sched() which uses the timestamp
to order tasks from sibling rq's. Switch to sched_clock_cpu(). Later, it may
be better to use per-core dispatch sequence number.
v2: Use sched_clock_cpu() instead of ktime_get_ns() per David.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 3cf78c5d01 ("sched_ext: Unpin and repin rq lock from balance_scx()")
Acked-by: David Vernet <void@manifault.com>
Cc: Peter Zijlstra <peterz@infradead.org>
When deciding whether a task can be migrated to a CPU,
dispatch_to_local_dsq() was open-coding p->cpus_allowed and scx_rq_online()
tests instead of using task_can_run_on_remote_rq(). This had two problems.
- It was missing is_migration_disabled() check and thus could try to migrate
a task which shouldn't leading to assertion and scheduling failures.
- It was testing p->cpus_ptr directly instead of using task_allowed_on_cpu()
and thus failed to consider ISA compatibility.
Update dispatch_to_local_dsq() to use task_can_run_on_remote_rq():
- Move scx_ops_error() triggering into task_can_run_on_remote_rq().
- When migration isn't allowed, fall back to the global DSQ instead of the
source DSQ by returning DTL_INVALID. This is both simpler and an overall
better behavior.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: David Vernet <void@manifault.com>
This patch introduces CONFIG_CPUSETS_V1 and guard cpuset-v1 code under
CONFIG_CPUSETS_V1. The default value of CONFIG_CPUSETS_V1 is N, so that
user who adopted v2 don't have 'pay' for cpuset v1.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Some functions name declared in cpuset-internel.h are generic. To avoid
confilicting with other variables for the same name, rename these
functions with cpuset_/cpuset1_ prefix to make them unique to cpuset.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Move legacy cpuset controller interfaces files and corresponding code
into cpuset-v1.c. 'update_flag', 'cpuset_write_resmask' and
'cpuset_common_seq_show' are also used for v1, so declare them in
cpuset-internal.h.
'cpuset_write_s64', 'cpuset_read_s64' and 'fmeter_getrate' are only used
cpuset-v1.c now, make it static.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The validate_change_legacy functions is used for v1, move it to
cpuset-v1.c. And two micro 'cpuset_for_each_child' and
'cpuset_for_each_descendant_pre' are common for v1 and v2, move them to
cpuset-internal.h.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
There are some differents about hotplug update between cpuset v1 and
cpuset v2. Move the legacy code to cpuset-v1.c.
'update_tasks_cpumask' and 'update_tasks_nodemask' are both used in cpuset
v1 and cpuset v2, declare them in cpuset-internal.h.
The change from original code is that use callback_lock helpers to get
callback_lock lock/unlock.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
To modify cpuset, both cpuset_mutex and callback_lock are needed. Add
helpers for cpuset-v1 to get callback_lock.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
'memory_spread' is only set in cpuset v1. move corresponding code into
cpuset-v1.c.
Currently, 'cpuset_update_task_spread_flags' and 'update_tasks_flags' are
exposed to cpuset.c.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Setting domain level is not supported at cpuset v2, so move corresponding
code into cpuset-v1.c.
The 'cpuset_write_s64' and 'cpuset_read_s64' are only used for setting
domain level, move them to cpuset-v1.c. Currently, expose to cpuset.c.
After cpuset legacy interface files are move to cpuset-v1.c, they can
be static. The 'rebuild_sched_domains_locked' is exposed to cpuset-v1.c.
The change from original code is that using 'cpuset_lock' and
'cpuset_unlock' functions to lock or unlock cpuset_mutex.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Collection of memory_pressure can be enabled by writing 1 to the cpuset
file 'memory_pressure_enabled', which is only for cpuset-v1. Therefore,
move the corresponding code to cpuset-v1.c.
Currently, the 'fmeter_init' and 'fmeter_getrate' functions are called
at cpuset.c, so expose them to cpuset.c.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Move some declarations that will be used for cpuset v1 and v2,
including 'cpuset struct', 'cpuset_flagbits_t', cpuset_filetype_t,etc.
No logical change.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch introduces the cgroup/cpuset-v1.c source file which will be
used for all legacy (cgroup v1) cpuset cgroup code. It also introduces
cgroup/cpuset-internal.h to keep declarations shared between
cgroup/cpuset.c and cpuset/cpuset-v1.c.
As of now, let's compile it if CONFIG_CPUSET is set. Later on it can be
switched to use a separate config option, so that the legacy code won't be
compiled if not required.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
With the "isolcpus" boot command line parameter, we are able to
create isolated CPUs at boot time. These isolated CPUs aren't fully
accounted for in the cpuset code. For instance, the root cgroup's
"cpuset.cpus.isolated" control file does not include the boot time
isolated CPUs. Fix that by looking for pre-isolated CPUs at init time.
The prstate_housekeeping_conflict() function does check the
HK_TYPE_DOMAIN housekeeping cpumask to make sure that CPUs outside of it
can only be used in isolated partition. Given the fact that we are going
to make housekeeping cpumasks dynamic, the current check may not be right
anymore. Save the boot time HK_TYPE_DOMAIN cpumask and check against
it instead of the upcoming dynamic HK_TYPE_DOMAIN housekeeping cpumask.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
According to the documentation, when building a kernel with the C=2
parameter, all source files should be checked. But this does not happen
for the kernel/bpf/ directory.
$ touch kernel/bpf/core.o
$ make C=2 CHECK=true kernel/bpf/core.o
Outputs:
CHECK scripts/mod/empty.c
CALL scripts/checksyscalls.sh
DESCEND objtool
INSTALL libsubcmd_headers
CC kernel/bpf/core.o
As can be seen the compilation is done, but CHECK is not executed. This
happens because kernel/bpf/Makefile has defined its own rule for
compilation and forgotten the macro that does the check.
There is no need to duplicate the build code, and this rule can be
removed to use generic rules.
Acked-by: Masahiro Yamada <masahiroy@kernel.org>
Tested-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Alexey Gladkov <legion@kernel.org>
Link: https://lore.kernel.org/r/20240830074350.211308-1-legion@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Use perf_allow_kernel() for 'pa_enable' (physical addresses),
'pct_enable' (physical timestamps) and context IDs. This means that
perf_event_paranoid is now taken into account and LSM hooks can be used,
which is more consistent with other perf_event_open calls. For example
PERF_SAMPLE_PHYS_ADDR uses perf_allow_kernel() rather than just
perfmon_capable().
This also indirectly fixes the following error message which is
misleading because perf_event_paranoid is not taken into account by
perfmon_capable():
$ perf record -e arm_spe/pa_enable/
Error:
Access to performance monitoring and observability operations is
limited. Consider adjusting /proc/sys/kernel/perf_event_paranoid
setting ...
Suggested-by: Al Grant <al.grant@arm.com>
Signed-off-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20240827145113.1224604-1-james.clark@linaro.org
Link: https://lore.kernel.org/all/20240807120039.GD37996@noisy.programming.kicks-ass.net/
Signed-off-by: Will Deacon <will@kernel.org>
In the case where we are forcing the ps.chunk_size to be at least 1,
we are ignoring the caller's alignment.
Move the forcing of ps.chunk_size to be at least 1 before rounding it
up to caller's alignment, so that caller's alignment is honored.
While at it, use max() to force the ps.chunk_size to be at least 1 to
improve readability.
Fixes: 6d45e1c948 ("padata: Fix possible divide-by-0 panic in padata_mt_helper()")
Signed-off-by: Kamlesh Gurudasani <kamlesh@ti.com>
Acked-by: Waiman Long <longman@redhat.com>
Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Currently we cannot pass the pointer returned by iter next method as
argument to KF_TRUSTED_ARGS or KF_RCU kfuncs, because the pointer
returned by iter next method is not "valid".
This patch sets the pointer returned by iter next method to be valid.
This is based on the fact that if the iterator is implemented correctly,
then the pointer returned from the iter next method should be valid.
This does not make NULL pointer valid. If the iter next method has
KF_RET_NULL flag, then the verifier will ask the ebpf program to
check NULL pointer.
KF_RCU_PROTECTED iterator is a special case, the pointer returned by
iter next method should only be valid within RCU critical section,
so it should be with MEM_RCU, not PTR_TRUSTED.
Another special case is bpf_iter_num_next, which returns a pointer with
base type PTR_TO_MEM. PTR_TO_MEM should not be combined with type flag
PTR_TRUSTED (PTR_TO_MEM already means the pointer is valid).
The pointer returned by iter next method of other types of iterators
is with PTR_TRUSTED.
In addition, this patch adds get_iter_from_state to help us get the
current iterator from the current state.
Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
Link: https://lore.kernel.org/r/AM6PR03MB584869F8B448EA1C87B7CDA399962@AM6PR03MB5848.eurprd03.prod.outlook.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch adds a .gen_epilogue to the bpf_verifier_ops. It is similar
to the existing .gen_prologue. Instead of allowing a subsystem
to run code at the beginning of a bpf prog, it allows the subsystem
to run code just before the bpf prog exit.
One of the use case is to allow the upcoming bpf qdisc to ensure that
the skb->dev is the same as the qdisc->dev_queue->dev. The bpf qdisc
struct_ops implementation could either fix it up or drop the skb.
Another use case could be in bpf_tcp_ca.c to enforce snd_cwnd
has sane value (e.g. non zero).
The epilogue can do the useful thing (like checking skb->dev) if it
can access the bpf prog's ctx. Unlike prologue, r1 may not hold the
ctx pointer. This patch saves the r1 in the stack if the .gen_epilogue
has returned some instructions in the "epilogue_buf".
The existing .gen_prologue is done in convert_ctx_accesses().
The new .gen_epilogue is done in the convert_ctx_accesses() also.
When it sees the (BPF_JMP | BPF_EXIT) instruction, it will be patched
with the earlier generated "epilogue_buf". The epilogue patching is
only done for the main prog.
Only one epilogue will be patched to the main program. When the
bpf prog has multiple BPF_EXIT instructions, a BPF_JA is used
to goto the earlier patched epilogue. Majority of the archs
support (BPF_JMP32 | BPF_JA): x86, arm, s390, risv64, loongarch,
powerpc and arc. This patch keeps it simple and always
use (BPF_JMP32 | BPF_JA). A new macro BPF_JMP32_A is added to
generate the (BPF_JMP32 | BPF_JA) insn.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240829210833.388152-4-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The next patch will add a ctx ptr saving instruction
"(r1 = *(u64 *)(r10 -8)" at the beginning for the main prog
when there is an epilogue patch (by the .gen_epilogue() verifier
ops added in the next patch).
There is one corner case if the bpf prog has a BPF_JMP that jumps
to the 1st instruction. It needs an adjustment such that
those BPF_JMP instructions won't jump to the newly added
ctx saving instruction.
The commit 5337ac4c9b ("bpf: Fix the corner case with may_goto and jump to the 1st insn.")
has the details on this case.
Note that the jump back to 1st instruction is not limited to the
ctx ptr saving instruction. The same also applies to the prologue.
A later test, pro_epilogue_goto_start.c, has a test for the prologue
only case.
Thus, this patch does one adjustment after gen_prologue and
the future ctx ptr saving. It is done by
adjust_jmp_off(env->prog, 0, delta) where delta has the total
number of instructions in the prologue and
the future ctx ptr saving instruction.
The adjust_jmp_off(env->prog, 0, delta) assumes that the
prologue does not have a goto 1st instruction itself.
To accommodate the prologue might have a goto 1st insn itself,
this patch changes the adjust_jmp_off() to skip considering
the instructions between [tgt_idx, tgt_idx + delta).
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240829210833.388152-3-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch moves the 'struct bpf_insn insn_buf[16]' stack usage
to the bpf_verifier_env. A '#define INSN_BUF_SIZE 16' is also added
to replace the ARRAY_SIZE(insn_buf) usages.
Both convert_ctx_accesses() and do_misc_fixup() are changed
to use the env->insn_buf.
It is a refactoring work for adding the epilogue_buf[16] in a later patch.
With this patch, the stack size usage decreased.
Before:
./kernel/bpf/verifier.c:22133:5: warning: stack frame size (2584)
After:
./kernel/bpf/verifier.c:22184:5: warning: stack frame size (2264)
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240829210833.388152-2-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>