Evidently, PATH_MAX isn't always defined, at least not via <limits.h>.
Simply remove the use and replace it by a constant 4k. As stat::st_size
is zero for /proc/self/exe we can't even size it automatically, and it
seems unlikely someone's going to try to run UML with such a path.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202410240553.gYNIXN8i-lkp@intel.com/
Fixes: 031acdcfb5 ("um: restore process name")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The PTRACE_GETREGSET API has now existed since Linux 2.6.33. The XSAVE
CPU feature should also be sufficiently common to be able to rely on it.
With this, define our internal FP state to be the hosts XSAVE data. Add
discovery for the hosts XSAVE size and place the FP registers at the end
of task_struct so that we can adjust the size at runtime.
Next we can implement the regset API on top and update the signal
handling as well as ptrace APIs to use them. Also switch coredump
creation to use the regset API and finally set HAVE_ARCH_TRACEHOOK.
This considerably improves the signal frames. Previously they might not
have contained all the registers (i386) and also did not have the
sizes and magic values set to the correct values to permit userspace to
decode the frame.
As a side effect, this will permit UML to run on hosts with newer CPU
extensions (such as AMX) that need even more register state.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Link: https://patch.msgid.link/20241023094120.4083426-1-benjamin@sipsolutions.net
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In time-travel mode userspace can do a lot of work without any time
passing. Unfortunately, this can result in OOM situations as the RCU
core code will never be run.
Work around this by keeping track of userspace processes that do not
yield for a lot of operations. When this happens, insert a jiffie into
the sched_clock clock to account time against the process and cause the
bookkeeping to run.
As sched_clock is used for tracing, it is useful to keep it in sync
between the different VMs. As such, try to remove added ticks again when
the actual clock ticks.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Link: https://patch.msgid.link/20241010142537.1134685-1-benjamin@sipsolutions.net
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When a PTE is updated in the page table, the _PAGE_NEWPAGE bit will
always be set. And the corresponding page will always be mapped or
unmapped depending on whether the PTE is present or not. The check
on the _PAGE_NEWPROT bit is not really reachable. Abandoning it will
allow us to simplify the code and remove the unreachable code.
Reviewed-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
Link: https://patch.msgid.link/20241011102354.1682626-2-tiwei.btw@antgroup.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The stub_exe could segfault when built with some compilers (e.g. gcc
13.2.0), as SSE instructions which relied on stack alignment could be
generated, but the stack was misaligned.
This seems to be due to the __start entry point being run with a 16-byte
aligned stack, but the x86_64 SYSV ABI wanting the stack to be so
aligned _before_ a function call (so it is misaligned when the function
is entered due to the return address being pushed). The function
prologue then realigns it. Because the entry point is never _called_,
and hence there is no return address, the prologue is therefore actually
misaligning it, and causing the generated movaps instructions to
SIGSEGV. This results in the following error:
start_userspace : expected SIGSTOP, got status = 139
Don't generate this prologue for __start by using
__attribute__((naked)), which resolves the issue.
Fixes: 32e8eaf263 ("um: use execveat to create userspace MMs")
Signed-off-by: David Gow <davidgow@google.com>
Link: https://lore.kernel.org/linux-um/CABVgOS=boUoG6=LHFFhxEd8H8jDP1zOaPKFEjH+iy2n2Q5S2aQ@mail.gmail.com/
Link: https://patch.msgid.link/20241017231007.1500497-2-davidgow@google.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When automatic variable initialization is enabled via
CONFIG_INIT_STACK_ALL_{PATTERN,ZERO}, clang will insert a call to
memset() to initialize an object created with __builtin_alloca(). This
ultimately breaks the build when linking stub_exe because it is a
standalone executable that does not include or link against memset().
ld: arch/um/kernel/skas/stub_exe.o: in function `_start':
arch/um/kernel/skas/stub_exe.c:83:(.ltext+0x15): undefined reference to `memset'
Disable automatic variable initialization for stub_exe.c by passing the
default value of 'uninitialized' to '-ftrivial-auto-var-init', which
avoids generating the call to memset(). This code is small and runs
quickly as it is just designed to set up an environment, so stack
variable initialization is unnecessary overhead for little gain.
Fixes: 32e8eaf263 ("um: use execveat to create userspace MMs")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Link: https://patch.msgid.link/20241016-uml-fix-stub_exe-clang-v1-2-3d6381dc5a78@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When building stub_exe with clang, there is an error because '-n' is not
a recognized flag by the clang driver (which is being used to invoke the
linker):
clang: error: unknown argument: '-n'
'-n' should be passed along to the linker, as it is the short flag for
'--nmagic', so prefix it with '-Wl,'.
Fixes: 32e8eaf263 ("um: use execveat to create userspace MMs")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Link: https://patch.msgid.link/20241016-uml-fix-stub_exe-clang-v1-1-3d6381dc5a78@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
With the change to use execve() we can now safely clear the memory up to
STUB_START as rseq will not be trying to use memory in that region. Also,
on 64 bit the previous changes should mean that there is no usable
memory range above the stub.
Make the change and remove the comment as it is not needed anymore.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Link: https://patch.msgid.link/20240919124511.282088-10-benjamin@sipsolutions.net
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When loading the UML binary, the host kernel will place the stack at the
highest possible address. It will then map the program name and
environment variables onto the start of the stack.
As such, an easy way to figure out the host_task_size is to use the
highest pointer to an environment variable as a reference.
Ensure that this works by disabling address layout randomization and
re-executing UML in case it was enabled.
This increases the available TASK_SIZE for 64 bit UML considerably.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Link: https://patch.msgid.link/20240919124511.282088-9-benjamin@sipsolutions.net
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Instead of using the current stack pointer, we can also use the current
instruction to calculate where the stub data is. With this the stub data
only needs to be aligned to a full page boundary.
Changing this has the advantage that we do not have a hole in the memory
space above the stub data (which would need to be explicitly cleared).
Another motivation to do this is that with the planned addition of a
SECCOMP based userspace the stack pointer may not be fully trustworthy.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Link: https://patch.msgid.link/20240919124511.282088-7-benjamin@sipsolutions.net
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Using clone will not undo features that have been enabled by libc. An
example of this already happening is rseq, which could cause the kernel
to read/write memory of the userspace process. In the future the
standard library might also use mseal by default to protect itself,
which would also thwart our attempts at unmapping everything.
Solve all this by taking a step back and doing an execve into a tiny
static binary that sets up the minimal environment required for the
stub without using any standard library. That way we have a clean
execution environment that is fully under the control of UML.
Note that this changes things a bit as the FDs are not anymore shared
with the kernel. Instead, we explicitly share the FDs for the physical
memory and all existing iomem regions. Doing this is fine, as iomem
regions cannot be added at runtime.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Link: https://patch.msgid.link/20240919124511.282088-3-benjamin@sipsolutions.net
[use pipe() instead of pipe2(), remove unneeded close() calls]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We do not need the extra save/restore of the FP registers when getting
the fault information. This was originally added in commit 2f56debd77
("uml: fix FP register corruption") but at that time the code was not
saving/restoring the FP registers when switching to userspace. This was
fixed in commit fbfe9c847e ("um: Save FPU registers between task
switches") and since then the auxiliary registers have not been useful.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Link: https://patch.msgid.link/20241004233821.2130874-1-benjamin@sipsolutions.net
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When switching from userspace to the kernel, all registers including the
FP registers are copied into the kernel and restored later on. As such,
the true source for the FP register state is actually already in the
kernel and they should never be grabbed from the userspace process.
Change the various places to simply copy the data from the internal FP
register storage area. Note that on i386 the format of PTRACE_GETFPREGS
and PTRACE_GETFPXREGS is different enough that conversion would be
needed. With this patch, -EINVAL is returned if the non-native format is
requested.
The upside is, that this patchset fixes setting registers via ptrace
(which simply did not work before) as well as fixing setting floating
point registers using the mcontext on signal return on i386.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Link: https://patch.msgid.link/20240913133845.964292-1-benjamin@sipsolutions.net
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Pull Kbuild fixes from Masahiro Yamada:
- Move non-boot built-in DTBs to the .rodata section
- Fix Kconfig bugs
- Fix maint scripts in the linux-image Debian package
- Import some list macros to scripts/include/
* tag 'kbuild-fixes-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: deb-pkg: Remove blank first line from maint scripts
kbuild: fix a typo dt_binding_schema -> dt_binding_schemas
scripts: import more list macros
kconfig: qconf: fix buffer overflow in debug links
kconfig: qconf: move conf_read() before drawing tree pain
kconfig: clear expr::val_is_valid when allocated
kconfig: fix infinite loop in sym_calc_choice()
kbuild: move non-boot built-in DTBs to .rodata section
Pull x86 platform driver fixes from Hans de Goede:
- Intel PMC fix for suspend/resume issues on some Sky and Kaby Lake
laptops
- Intel Diamond Rapids hw-id additions
- Documentation and MAINTAINERS fixes
- Some other small fixes
* tag 'platform-drivers-x86-v6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
platform/x86: x86-android-tablets: Fix use after free on platform_device_register() errors
platform/x86: wmi: Update WMI driver API documentation
platform/x86: dell-ddv: Fix typo in documentation
platform/x86: dell-sysman: add support for alienware products
platform/x86/intel: power-domains: Add Diamond Rapids support
platform/x86: ISST: Add Diamond Rapids to support list
platform/x86:intel/pmc: Disable ACPI PM Timer disabling on Sky and Kaby Lake
platform/x86: dell-laptop: Do not fail when encountering unsupported batteries
MAINTAINERS: Update Intel In Field Scan(IFS) entry
platform/x86: ISST: Fix the KASAN report slab-out-of-bounds bug
Pull kvm fixes from Paolo Bonzini:
"ARM64:
- Fix pKVM error path on init, making sure we do not change critical
system registers as we're about to fail
- Make sure that the host's vector length is at capped by a value
common to all CPUs
- Fix kvm_has_feat*() handling of "negative" features, as the current
code is pretty broken
- Promote Joey to the status of official reviewer, while James steps
down -- hopefully only temporarly
x86:
- Fix compilation with KVM_INTEL=KVM_AMD=n
- Fix disabling KVM_X86_QUIRK_SLOT_ZAP_ALL when shadow MMU is in use
Selftests:
- Fix compilation on non-x86 architectures"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
x86/reboot: emergency callbacks are now registered by common KVM code
KVM: x86: leave kvm.ko out of the build if no vendor module is requested
KVM: x86/mmu: fix KVM_X86_QUIRK_SLOT_ZAP_ALL for shadow MMU
KVM: arm64: Fix kvm_has_feat*() handling of negative features
KVM: selftests: Fix build on architectures other than x86_64
KVM: arm64: Another reviewer reshuffle
KVM: arm64: Constrain the host to the maximum shared SVE VL with pKVM
KVM: arm64: Fix __pkvm_init_vcpu cptr_el2 error path