Correct several typos found in comments across various files in the
kernel/time directory.
No functional changes are introduced by these corrections.
Signed-off-by: Haofeng Li <lihaofeng@kylinos.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The check for scancode 0xe0 is always false because earlier on
the scan code is masked with 0x7f so there are never going to
be values greater than 0x7f. Remove the redundant check.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Daniel Thompson (RISCstar) <danielt@kernel.org>
strcpy() is deprecated; use the new helper function kdb_strdup_dequote()
instead. In addition to string duplication similar to kdb_strdup(), it
also trims surrounding quotes from the input string if present.
kdb_strdup_dequote() also checks for a trailing quote in the input
string which was previously not checked.
Link: https://github.com/KSPP/linux/issues/88
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Daniel Thompson (RISCstar) <danielt@kernel.org>
strcpy() is deprecated; use memcpy() instead.
We can safely use memcpy() because we already know the length of the
source string 'cp' and that it is guaranteed to be NUL-terminated within
the first KDB_GREP_STRLEN bytes.
Link: https://github.com/KSPP/linux/issues/88
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Daniel Thompson (RISCstar) <danielt@kernel.org>
strcpy() is deprecated and its behavior is undefined when the source and
destination buffers overlap. Use memmove() instead to avoid any
undefined behavior.
Adjust comments for clarity.
Link: https://github.com/KSPP/linux/issues/88
Fixes: 5d5314d679 ("kdb: core for kgdb back end (1 of 2)")
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Daniel Thompson (RISCstar) <danielt@kernel.org>
sys_get_robust_list() and compat_get_robust_list() use ptrace_may_access()
to check if the calling task is allowed to access another task's
robust_list pointer. This check is racy against a concurrent exec() in the
target process.
During exec(), a task may transition from a non-privileged binary to a
privileged one (e.g., setuid binary) and its credentials/memory mappings
may change. If get_robust_list() performs ptrace_may_access() before
this transition, it may erroneously allow access to sensitive information
after the target becomes privileged.
A racy access allows an attacker to exploit a window during which
ptrace_may_access() passes before a target process transitions to a
privileged state via exec().
For example, consider a non-privileged task T that is about to execute a
setuid-root binary. An attacker task A calls get_robust_list(T) while T
is still unprivileged. Since ptrace_may_access() checks permissions
based on current credentials, it succeeds. However, if T begins exec
immediately afterwards, it becomes privileged and may change its memory
mappings. Because get_robust_list() proceeds to access T->robust_list
without synchronizing with exec() it may read user-space pointers from a
now-privileged process.
This violates the intended post-exec access restrictions and could
expose sensitive memory addresses or be used as a primitive in a larger
exploit chain. Consequently, the race can lead to unauthorized
disclosure of information across privilege boundaries and poses a
potential security risk.
Take a read lock on signal->exec_update_lock prior to invoking
ptrace_may_access() and accessing the robust_list/compat_robust_list.
This ensures that the target task's exec state remains stable during the
check, allowing for consistent and synchronized validation of
credentials.
Suggested-by: Jann Horn <jann@thejh.net>
Signed-off-by: Pranav Tyagi <pranav.tyagi03@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/linux-fsdevel/1477863998-3298-5-git-send-email-jann@thejh.net/
Link: https://github.com/KSPP/linux/issues/119
syzbot managed to trigger the following race:
T1 T2
futex_wait_requeue_pi()
futex_do_wait()
schedule()
futex_requeue()
futex_proxy_trylock_atomic()
futex_requeue_pi_prepare()
requeue_pi_wake_futex()
futex_requeue_pi_complete()
/* preempt */
* timeout/ signal wakes T1 *
futex_requeue_pi_wakeup_sync() // Q_REQUEUE_PI_LOCKED
futex_hash_put()
// back to userland, on stack futex_q is garbage
/* back */
wake_up_state(q->task, TASK_NORMAL);
In this scenario futex_wait_requeue_pi() is able to leave without using
futex_q::lock_ptr for synchronization.
This can be prevented by reading futex_q::task before updating the
futex_q::requeue_state. A reference on the task_struct is not needed
because requeue_pi_wake_futex() is invoked with a spinlock_t held which
implies a RCU read section.
Even if T1 terminates immediately after, the task_struct will remain valid
during T2's wake_up_state(). A READ_ONCE on futex_q::task before
futex_requeue_pi_complete() is enough because it ensures that the variable
is read before the state is updated.
Read futex_q::task before updating the requeue state, use it for the
following wakeup.
Fixes: 07d91ef510 ("futex: Prevent requeue_pi() lock nesting issue on RT")
Reported-by: syzbot+034246a838a10d181e78@syzkaller.appspotmail.com
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Closes: https://lore.kernel.org/all/68b75989.050a0220.3db4df.01dd.GAE@google.com/
A previous patch fixed a bug where new_prs should be assigned before
checking housekeeping conflicts. This patch addresses another potential
issue: the nocpu error check currently uses the xcpus which is not updated.
Although no issue has been observed so far, the check should be performed
using the new effective exclusive cpus.
The comment has been removed because the function returns an error if
nocpu checking fails, which is unrelated to the parent.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The 'isolcpus' parameter specified at boot time can be assigned to an
isolated partition. While it is valid put the 'isolcpus' in an isolated
partition, attempting to change a member cpuset to an isolated partition
will fail if the cpuset contains any 'isolcpus'.
For example, the system boots with 'isolcpus=9', and the following
configuration works correctly:
# cd /sys/fs/cgroup/
# mkdir test
# echo 1 > test/cpuset.cpus
# echo isolated > test/cpuset.cpus.partition
# cat test/cpuset.cpus.partition
isolated
# echo 9 > test/cpuset.cpus
# cat test/cpuset.cpus.partition
isolated
# cat test/cpuset.cpus
9
However, the following steps to convert a member cpuset to an isolated
partition will fail:
# cd /sys/fs/cgroup/
# mkdir test
# echo 9 > test/cpuset.cpus
# echo isolated > test/cpuset.cpus.partition
# cat test/cpuset.cpus.partition
isolated invalid (partition config conflicts with housekeeping setup)
The issue occurs because the new partition state (new_prs) is used for
validation against housekeeping constraints before it has been properly
updated. To resolve this, move the assignment of new_prs before the
housekeeping validation check when enabling a root partition.
Fixes: 4a74e41888 ("cgroup/cpuset: Check partition conflict with housekeeping setup")
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Converting bpf_insn_successors() to use lookup table makes it ~1.5
times faster.
Also remove unnecessary conditionals:
- `idx + 1 < prog->len` is unnecessary because after check_cfg() all
jump targets are guaranteed to be within a program;
- `i == 0 || succ[0] != dst` is unnecessary because any client of
bpf_insn_successors() can handle duplicate edges:
- compute_live_registers()
- compute_scc()
Moving bpf_insn_successors() to liveness.c allows its inlining in
liveness.c:__update_stack_liveness().
Such inlining speeds up __update_stack_liveness() by ~40%.
bpf_insn_successors() is used in both verifier.c and liveness.c.
perf shows such move does not negatively impact users in verifier.c,
as these are executed only once before main varification pass.
Unlike __update_stack_liveness() which can be triggered multiple
times.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-10-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Remove register chain based liveness tracking:
- struct bpf_reg_state->{parent,live} fields are no longer needed;
- REG_LIVE_WRITTEN marks are superseded by bpf_mark_stack_write()
calls;
- mark_reg_read() calls are superseded by bpf_mark_stack_read();
- log.c:print_liveness() is superseded by logging in liveness.c;
- propagate_liveness() is superseded by bpf_update_live_stack();
- no need to establish register chains in is_state_visited() anymore;
- fix a bunch of tests expecting "_w" suffixes in verifier log
messages.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-9-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Allocate analysis instance:
- Add bpf_stack_liveness_{init,free}() calls to bpf_check().
Notify the instance about any stack reads and writes:
- Add bpf_mark_stack_write() call at every location where
REG_LIVE_WRITTEN is recorded for a stack slot.
- Add bpf_mark_stack_read() call at every location mark_reg_read() is
called.
- Both bpf_mark_stack_{read,write}() rely on
env->liveness->cur_instance callchain being in sync with
env->cur_state. It is possible to update env->liveness->cur_instance
every time a mark read/write is called, but that costs a hash table
lookup and is noticeable in the performance profile. Hence, manually
reset env->liveness->cur_instance whenever the verifier changes
env->cur_state call stack:
- call bpf_reset_live_stack_callchain() when the verifier enters a
subprogram;
- call bpf_update_live_stack() when the verifier exits a subprogram
(it implies the reset).
Make sure bpf_update_live_stack() is called for a callchain before
issuing liveness queries. And make sure that bpf_update_live_stack()
is called for any callee callchain first:
- Add bpf_update_live_stack() call at every location that processes
BPF_EXIT:
- exit from a subprogram;
- before pop_stack() call.
This makes sure that bpf_update_live_stack() is called for callee
callchains before caller callchains.
Make sure must_write marks are set to zero for instructions that
do not always access the stack:
- Wrap do_check_insn() with bpf_reset_stack_write_marks() /
bpf_commit_stack_write_marks() calls.
Any calls to bpf_mark_stack_write() are accumulated between this
pair of calls. If no bpf_mark_stack_write() calls were made
it means that the instruction does not access stack (at-least
on the current verification path) and it is important to record
this fact.
Finally, use bpf_live_stack_query_init() / bpf_stack_slot_alive()
to query stack liveness info.
The manual tracking of the correct order for callee/caller
bpf_update_live_stack() calls is a bit convoluted and may warrant some
automation in future revisions.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-7-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This commit adds a flow-sensitive, context-sensitive, path-insensitive
data flow analysis for live stack slots:
- flow-sensitive: uses program control flow graph to compute data flow
values;
- context-sensitive: collects data flow values for each possible call
chain in a program;
- path-insensitive: does not distinguish between separate control flow
graph paths reaching the same instruction.
Compared to the current path-sensitive analysis, this approach trades
some precision for not having to enumerate every path in the program.
This gives a theoretical capability to run the analysis before main
verification pass. See cover letter for motivation.
The basic idea is as follows:
- Data flow values indicate stack slots that might be read and stack
slots that are definitely written.
- Data flow values are collected for each
(call chain, instruction number) combination in the program.
- Within a subprogram, data flow values are propagated using control
flow graph.
- Data flow values are transferred from entry instructions of callee
subprograms to call sites in caller subprograms.
In other words, a tree of all possible call chains is constructed.
Each node of this tree represents a subprogram. Read and write marks
are collected for each instruction of each node. Live stack slots are
first computed for lower level nodes. Then, information about outer
stack slots that might be read or are definitely written by a
subprogram is propagated one level up, to the corresponding call
instructions of the upper nodes. Procedure repeats until root node is
processed.
In the absence of value range analysis, stack read/write marks are
collected during main verification pass, and data flow computation is
triggered each time verifier.c:states_equal() needs to query the
information.
Implementation details are documented in kernel/bpf/liveness.c.
Quantitative data about verification performance changes and memory
consumption is in the cover letter.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-6-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The next patch would require doing postorder traversal of individual
subprograms. Facilitate this by moving env->cfg.insn_postorder
computation from check_cfg() to a separate pass, as check_cfg()
descends into called subprograms (and it needs to, because of
merge_callee_effects() logic).
env->cfg.insn_postorder is used only by compute_live_registers(),
this function does not track cross subprogram dependencies,
thus the change does not affect it's operation.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-5-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
And drop ns_free_inum(). Anything common that can be wasted centrally
should be wasted in the new common helper.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
There's a lot of information that namespace implementers don't need to
know about at all. Encapsulate this all in the initialization helper.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Assign the reserved MNT_NS_ANON_INO sentinel to anonymous mount
namespaces and cleanup the initial mount ns allocation. This is just a
preparatory patch and the ns->inum check in ns_common_init() will be
dropped in the next patch.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
It's really awkward spilling the ns common infrastructure into multiple
headers. Move it to a separate file.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Every namespace type has a container_of(ns, <ns_type>, ns) static inline
function that is currently not exposed in the header. So we have a bunch
of places that open-code it via container_of(). Move it to the headers
so we can use it directly.
Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Support the generic ns lookup infrastructure to support file handles for
namespaces.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Support the generic ns lookup infrastructure to support file handles for
namespaces.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Move the namespace iteration infrastructure originally introduced for
mount namespaces into a generic library usable by all namespace types.
Signed-off-by: Christian Brauner <brauner@kernel.org>
Don't cargo-cult the same thing over and over.
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The capability check should not be audited since it is only being used
to determine the inode permissions. A failed check does not indicate a
violation of security policy but, when an LSM is enabled, a denial audit
message was being generated.
The denial audit message can either lead to the capability being
unnecessarily allowed in a security policy, or being silenced potentially
masking a legitimate capability check at a later point in time.
Similar to commit d6169b0206 ("net: Use ns_capable_noaudit() when
determining net sysctl permissions")
Fixes: 7863dcc72d ("pid: allow pid_max to be set per pid namespace")
CC: Christian Brauner <brauner@kernel.org>
CC: linux-security-module@vger.kernel.org
CC: selinux@vger.kernel.org
Signed-off-by: Christian Göttsche <cgzones@googlemail.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Currently only array maps are supported, but the implementation can be
extended for other maps and objects. The hash is memoized only for
exclusive and frozen maps as their content is stable until the exclusive
program modifies the map.
This is required for BPF signing, enabling a trusted loader program to
verify a map's integrity. The loader retrieves
the map's runtime hash from the kernel and compares it against an
expected hash computed at build time.
Signed-off-by: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/r/20250914215141.15144-7-kpsingh@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Exclusive maps allow maps to only be accessed by program with a
program with a matching hash which is specified in the excl_prog_hash
attr.
For the signing use-case, this allows the trusted loader program
to load the map and verify the integrity
Signed-off-by: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/r/20250914215141.15144-3-kpsingh@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>