I had to spend some (a lot;) time to understand why pidfd_info_test
(and more) fails with my patch under qemu on my machine ;) Until I
applied the patch below.
I think it is a bad idea to do the things like
#ifndef __NR_clone3
#define __NR_clone3 -1
#endif
because this can hide a problem. My working laptop runs Fedora-23 which
doesn't have __NR_clone3/etc in /usr/include/. So "make" happily succeeds,
but everything fails and it is not clear why.
Link: https://lore.kernel.org/r/20250323174518.GB834@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Pull vfs mount updates from Christian Brauner:
- Add a mountinfo program to demonstrate statmount()/listmount()
Add a new "mountinfo" sample userland program that demonstrates how
to use statmount() and listmount() to get at the same info that
/proc/pid/mountinfo provides
- Remove pointless nospec.h include
- Prepend statmount.mnt_opts string with security_sb_mnt_opts()
Currently these mount options aren't accessible via statmount()
- Add new mount namespaces to mount namespace rbtree outside of the
namespace semaphore
- Lockless mount namespace lookup
Currently we take the read lock when looking for a mount namespace to
list mounts in. We can make this lockless. The simple search case can
just use a sequence counter to detect concurrent changes to the
rbtree
For walking the list of mount namespaces sequentially via nsfs we
keep a separate rcu list as rb_prev() and rb_next() aren't usable
safely with rcu. Currently there is no primitive for retrieving the
previous list member. To do this we need a new deletion primitive
that doesn't poison the prev pointer and a corresponding retrieval
helper
Since creating mount namespaces is a relatively rare event compared
with querying mounts in a foreign mount namespace this is worth it.
Once libmount and systemd pick up this mechanism to list mounts in
foreign mount namespaces this will be used very frequently
- Add extended selftests for lockless mount namespace iteration
- Add a sample program to list all mounts on the system, i.e., in
all mount namespaces
- Improve mount namespace iteration performance
Make finding the last or first mount to start iterating the mount
namespace from an O(1) operation and add selftests for iterating the
mount table starting from the first and last mount
- Use an xarray for the old mount id
While the ida does use the xarray internally we can use it explicitly
which allows us to increment the unique mount id under the xa lock.
This allows us to remove the atomic as we're now allocating both ids
in one go
- Use a shared header for vfs sample programs
- Fix build warnings for new sample program to list all mounts
* tag 'vfs-6.14-rc1.mount.v2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
samples/vfs: fix build warnings
samples/vfs: use shared header
samples/vfs/mountinfo: Use __u64 instead of uint64_t
fs: remove useless lockdep assertion
fs: use xarray for old mount id
selftests: add listmount() iteration tests
fs: cache first and last mount
samples: add test-list-all-mounts
selftests: remove unneeded include
selftests: add tests for mntns iteration
seltests: move nsfs into filesystems subfolder
fs: simplify rwlock to spinlock
fs: lockless mntns lookup for nsfs
rculist: add list_bidir_{del,prev}_rcu()
fs: lockless mntns rbtree lookup
fs: add mount namespace to rbtree late
fs: prepend statmount.mnt_opts string with security_sb_mnt_opts()
mount: remove inlude/nospec.h include
samples: add a mountinfo program to demonstrate statmount()/listmount()
An output message:
> # # waitpid WEXITSTATUS=0
will be printed for 30,000+ times in the `pidfd_test` selftest, which
does not seem ideal. This patch removes the print logic in the
`wait_for_pid` function, so each call to this function does not output
a line by default. Any existing call sites where the extra line might
be beneficial have been modified to include extra print statements
outside of the function calls.
Signed-off-by: Ziqi Zhao <astrajoan@yahoo.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
When running the pidfd_fdinfo_test on arm64, it fails for me. After some
digging, the reason is that the child exits due to SIGBUS, because it
overflows the 1024 byte stack we've reserved for it.
To fix the issue, increase the stack size to 8192 bytes (this number is
somewhat arbitrary, and was arrived at through experimentation -- I kept
doubling until the failure no longer occurred).
Also, let's make the issue easier to debug. wait_for_pid() returns an
ambiguous value: it may return -1 in all of these cases:
1. waitpid() itself returned -1
2. waitpid() returned success, but we found !WIFEXITED(status).
3. The child process exited, but it did so with a -1 exit code.
There's no way for the caller to tell the difference. So, at least log
which occurred, so the test runner can debug things.
While debugging this, I found that we had !WIFEXITED(), because the
child exited due to a signal. This seems like a reasonably common case,
so also print out whether or not we have WIFSIGNALED(), and the
associated WTERMSIG() (if any). This lets us see the SIGBUS I'm fixing
clearly when it occurs.
Finally, I'm suspicious of allocating the child's stack on our stack.
man clone(2) suggests that the correct way to do this is with mmap(),
and in particular by setting MAP_STACK. So, switch to doing it that way
instead.
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Verify that the PIDFD_NONBLOCK flag works with pidfd_open() and that
waitid() with a non-blocking pidfd returns EAGAIN:
TAP version 13
1..3
# Starting 3 tests from 1 test cases.
# RUN global.wait_simple ...
# OK global.wait_simple
ok 1 global.wait_simple
# RUN global.wait_states ...
# OK global.wait_states
ok 2 global.wait_states
# RUN global.wait_nonblock ...
# OK global.wait_nonblock
ok 3 global.wait_nonblock
# PASSED: 3 / 3 tests passed.
# Totals: pass:3 fail:0 xfail:0 xpass:0 skip:0 error:0
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: linux-kselftest@vger.kernel.org
Link: https://lore.kernel.org/r/20200902102130.147672-5-christian.brauner@ubuntu.com
We recently regressed (cf. [1] and its corresponding fix in [2]) returning
ENOMEM when trying to create a process in a pid namespace whose init
process/child subreaper has already died. This has caused confusion at
least once before that (cf. [3]). Let's add a simple regression test to
catch this in the future.
[1]: 49cb2fc42c ("fork: extend clone3() to support setting a PID")
[2]: b26ebfe12f ("pid: Fix error return value in some cases")
[3]: 35f71bc0a0 ("fork: report pid reservation failure properly")
Cc: Corey Minyard <cminyard@mvista.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Adrian Reber <areber@redhat.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>