By delaying the setting of prev/next VMA until after the write of NULL,
the probability of the prev/next VMA already being in the CPU cache is
significantly increased, especially for larger munmap operations. It
also means that prev/next will be loaded closer to when they are used.
This requires changing the loop type when gathering the VMAs that will
be freed.
Since prev will be set later in the function, it is better to reverse
the splitting direction of the start VMA (modify the new_below argument
to __split_vma).
Using the vma_iter_prev_range() to walk back to the correct location in
the tree will, on the most part, mean walking within the CPU cache.
Usually, this is two steps vs a node reset and a tree re-walk.
Link: https://lkml.kernel.org/r/20230724183157.3939892-16-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The majority of the calls to munmap a vm range is within a single vma.
The maple tree is able to store a single entry at 0, with a size of 1 as
a pointer and avoid any allocations. Change do_vmi_align_munmap() to
store the VMAs being munmap()'ed into a tree indexed by the count. This
will leverage the ability to store the first entry without a node
allocation.
Storing the entries into a tree by the count and not the vma start and
end means changing the functions which iterate over the entries. Update
unmap_vmas() and free_pgtables() to take a maple state and a tree end
address to support this functionality.
Passing through the same maple state to unmap_vmas() and free_pgtables()
means the state needs to be reset between calls. This happens in the
static unmap_region() and exit_mmap().
Link: https://lkml.kernel.org/r/20230724183157.3939892-4-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Reduce preallocations for maple tree", v3.
Initial work on preallocations showed no regression in performance during
testing, but recently some users (both on [1] and off [android] list) have
reported that preallocating the worst-case number of nodes has caused some
slow down. This patch set addresses the number of allocations in a few
ways.
During munmap() most munmap() operations will remove a single VMA, so
leverage the fact that the maple tree can place a single pointer at range
0 - 0 without allocating. This is done by changing the index of the VMAs
to be indexed by the count, starting at 0.
Re-introduce the entry argument to mas_preallocate() so that a more
intelligent guess of the node count can be made.
Implement the more intelligent guess of the node count, although there is
more work to be done.
During development of v2 of this patch set, I also noticed that the number
of nodes being allocated for a rebalance was beyond what could possibly be
needed. This is addressed in patch 0008.
This patch (of 15):
Add a way to test the speed of mas_for_each() to the testing code.
Link: https://lkml.kernel.org/r/20230724183157.3939892-1-Liam.Howlett@oracle.com
Link: https://lkml.kernel.org/r/20230724183157.3939892-2-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Despite its name, mm_drop_all_locks() does not drop _all_ locks; the mmap
lock is held write-locked by the caller, and the caller is responsible for
dropping the mmap lock at a later point (which will also release the VMA
locks).
Calling vma_end_write_all() here is dangerous because the caller might
have write-locked a VMA with the expectation that it will stay
write-locked until the mmap_lock is released, as usual.
This _almost_ becomes a problem in the following scenario:
An anonymous VMA A and an SGX VMA B are mapped adjacent to each other.
Userspace calls munmap() on a range starting at the start address of A and
ending in the middle of B.
Hypothetical call graph with additional notes in brackets:
do_vmi_align_munmap
[begin first for_each_vma_range loop]
vma_start_write [on VMA A]
vma_mark_detached [on VMA A]
__split_vma [on VMA B]
sgx_vma_open [== new->vm_ops->open]
sgx_encl_mm_add
__mmu_notifier_register [luckily THIS CAN'T ACTUALLY HAPPEN]
mm_take_all_locks
mm_drop_all_locks
vma_end_write_all [drops VMA lock taken on VMA A before]
vma_start_write [on VMA B]
vma_mark_detached [on VMA B]
[end first for_each_vma_range loop]
vma_iter_clear_gfp [removes VMAs from maple tree]
mmap_write_downgrade
unmap_region
mmap_read_unlock
In this hypothetical scenario, while do_vmi_align_munmap() thinks it still
holds a VMA write lock on VMA A, the VMA write lock has actually been
invalidated inside __split_vma().
The call from sgx_encl_mm_add() to __mmu_notifier_register() can't
actually happen here, as far as I understand, because we are duplicating
an existing SGX VMA, but sgx_encl_mm_add() only calls
__mmu_notifier_register() for the first SGX VMA created in a given
process. So this could only happen in fork(), not on munmap(). But in my
view it is just pure luck that this can't happen.
Also, we wouldn't actually have any bad consequences from this in
do_vmi_align_munmap(), because by the time the bug drops the lock on VMA
A, we've already marked VMA A as detached, which makes it completely
ineligible for any VMA-locked page faults. But again, that's just pure
luck.
So remove the vma_end_write_all(), so that VMA write locks are only ever
released on mmap_write_unlock() or mmap_write_downgrade().
Also add comments to document the locking rules established by this patch.
Link: https://lkml.kernel.org/r/20230720193436.454247-1-jannh@google.com
Fixes: eeff9a5d47 ("mm/mmap: prevent pagefault handler from racing with mmu_notifier registration")
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
It is very unclear to me how one is supposed to run all the mm selftests
consistently and get clear results.
Most of the test programs are launched by both run_vmtests.sh and
run_kselftest.sh:
hugepage-mmap
hugepage-shm
map_hugetlb
hugepage-mremap
hugepage-vmemmap
hugetlb-madvise
map_fixed_noreplace
gup_test
gup_longterm
uffd-unit-tests
uffd-stress
compaction_test
on-fault-limit
map_populate
mlock-random-test
mlock2-tests
mrelease_test
mremap_test
thuge-gen
virtual_address_range
va_high_addr_switch
mremap_dontunmap
hmm-tests
madv_populate
memfd_secret
ksm_tests
ksm_functional_tests
soft-dirty
cow
However, of this set, when launched by run_vmtests.sh, some of the
programs are invoked multiple times with different arguments. When
invoked by run_kselftest.sh, they are invoked without arguments (and as
a consequence, some fail immediately).
Some test programs are only launched by run_vmtests.sh:
test_vmalloc.sh
And some test programs and only launched by run_kselftest.sh:
khugepaged
migration
mkdirty
transhuge-stress
split_huge_page_test
mdwe_test
write_to_hugetlbfs
Furthermore, run_vmtests.sh is invoked by run_kselftest.sh, so in this
case all the test programs invoked by both scripts are run twice!
Needless to say, this is a bit of a mess. In the absence of fully
understanding the history here, it looks to me like the best solution is
to launch ALL test programs from run_vmtests.sh, and ONLY invoke
run_vmtests.sh from run_kselftest.sh. This way, we get full control over
the parameters, each program is only invoked the intended number of
times, and regardless of which script is used, the same tests get run in
the same way.
The only drawback is that if using run_kselftest.sh, it's top-level tap
result reporting reports only a single test and it fails if any of the
contained tests fail. I don't see this as a big deal though since we
still see all the nested reporting from multiple layers. The other issue
with this is that all of run_vmtests.sh must execute within a single
kselftest timeout period, so let's increase that to something more
suitable.
In the Makefile, TEST_GEN_PROGS will compile and install the tests and
will add them to the list of tests that run_kselftest.sh will run.
TEST_GEN_FILES will compile and install the tests but will not add them
to the test list. So let's move all the programs from TEST_GEN_PROGS to
TEST_GEN_FILES so that they are built but not executed by
run_kselftest.sh. Note that run_vmtests.sh is added to TEST_PROGS, which
means it ends up in the test list. (the lack of "_GEN" means it won't be
compiled, but simply copied).
Link: https://lkml.kernel.org/r/20230724082522.1202616-9-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Until now, transhuge-stress runs until its explicitly killed, so when
invoked by run_kselftest.sh, it would run until the test timeout, then it
would be killed and the test would be marked as failed.
Add a new, optional command line parameter that allows the user to specify
the duration in seconds that the program should run. The program exits
after this duration with a success (0) exit code. If the argument is
omitted the old behacvior remains.
On it's own, this doesn't quite solve our problem because run_kselftest.sh
does not allow passing parameters to the program under test. But we will
shortly move this to run_vmtests.sh, which does allow parameter passing.
Link: https://lkml.kernel.org/r/20230724082522.1202616-8-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The `migration` test currently has a number of robustness problems that
cause it to hang and leak resources.
Timeout: There are 3 tests, which each previously ran for 60 seconds.
However, the timeout in mm/settings for a single test binary was set to 45
seconds. So when run using run_kselftest.sh, the top level timeout would
trigger before the test binary was finished. Solve this by meeting in the
middle; each of the 3 tests now runs for 20 seconds (for a total of 60),
and the top level timeout is set to 90 seconds.
Leaking child processes: the `shared_anon` test fork()s some children but
then an ASSERT() fires before the test kills those children. The assert
causes immediate exit of the parent and leaking of the children.
Furthermore, if run using the run_kselftest.sh wrapper, the wrapper would
get stuck waiting for those children to exit, which never happens. Solve
this by setting the "parent death signal" to SIGHUP in the child, so that
the child is killed automatically if the parent dies.
With these changes, the test binary now runs to completion on arm64, with
2 tests passing and the `shared_anon` test failing.
Link: https://lkml.kernel.org/r/20230724082522.1202616-7-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
thuge-gen was previously only munmapping part of the mmapped buffer, which
caused us to run out of 1G huge pages for a later part of the test. Fix
this by munmapping the whole buffer. Based on the code, it looks like a
typo rather than an intention to keep some of the buffer mapped.
thuge-gen was also calling mmap with SHM_HUGETLB flag (bit 11 set), which
is actually MAP_DENYWRITE in mmap context. The man page says this flag is
ignored in modern kernels. I'm pretty sure from the context that the
author intended to pass the MAP_HUGETLB flag so I've fixed that up too.
Link: https://lkml.kernel.org/r/20230724082522.1202616-5-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
arm64 does not support the soft-dirty PTE bit. However, the `soft-dirty`
test suite is currently run unconditionally and therefore generates
spurious test failures on arm64. There are also some tests in
`madv_populate` which assume it is supported.
For `soft-dirty` lets disable the whole suite for arm64; it is no longer
built and run_vmtests.sh will skip it if its not present.
For `madv_populate`, we need a runtime mechanism so that the remaining
tests continue to be run. Unfortunately, the only way to determine if the
soft-dirty dirty bit is supported is to write to a page, then see if the
bit is set in /proc/self/pagemap. But the tests that we want to
conditionally execute are testing precicesly this. So if we introduced
this feature check, we could accedentally turn a real failure (on a system
that claims to support soft-dirty) into a skip. So instead, do the check
based on architecture; for arm64, we report that soft-dirty is not
supported.
Link: https://lkml.kernel.org/r/20230724082522.1202616-3-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "selftests/mm fixes for arm64", v3.
Given my on-going work on large anon folios and contpte mappings, I
decided it would be a good idea to start running mm selftests to help
guard against regressions. However, it soon became clear that I
couldn't get the suite to run cleanly on arm64 with a vanilla v6.5-rc1
kernel (perhaps I'm just doing it wrong??), so got stuck in a rabbit
hole trying to debug and fix all the issues. Some were down to
misconfigurations, but I also found a number of issues with the tests
and even a couple of issues with the kernel.
This patch (of 8):
The selftests runner pipes the test program's stdout to tap_prefix. The
presence of the pipe means that the test program sets its stdout to be
fully buffered (as aposed to line buffered when directly connected to the
terminal). The block buffering means that there is often content in the
buffer at fork() time, which causes the output to end up duplicated. This
was causing problems for mm:cow where test results were duplicated 20-30x.
Solve this by using `stdbuf`, when available to force the test program to
use line buffered mode. This means previously printf'ed results are
flushed out of the program before any fork().
Additionally, explicitly set line buffer mode in ksft_print_header(),
which means that all test programs that use the ksft framework will
benefit even if stdbuf is not present on the system.
[ryan.roberts@arm.com: add setvbuf() to set buffering mode]
Link: https://lkml.kernel.org/r/20230726070655.2713530-1-ryan.roberts@arm.com
Link: https://lkml.kernel.org/r/20230724082522.1202616-1-ryan.roberts@arm.com
Link: https://lkml.kernel.org/r/20230724082522.1202616-2-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The invalidate_range() is going to become an architecture specific mmu
notifier used to keep the TLB of secondary MMUs such as an IOMMU in sync
with the CPU page tables. Currently it is called from separate code paths
to the main CPU TLB invalidations. This can lead to a secondary TLB not
getting invalidated when required and makes it hard to reason about when
exactly the secondary TLB is invalidated.
To fix this move the notifier call to the architecture specific TLB
maintenance functions for architectures that have secondary MMUs requiring
explicit software invalidations.
This fixes a SMMU bug on ARM64. On ARM64 PTE permission upgrades require
a TLB invalidation. This invalidation is done by the architecture
specific ptep_set_access_flags() which calls flush_tlb_page() if required.
However this doesn't call the notifier resulting in infinite faults being
generated by devices using the SMMU if it has previously cached a
read-only PTE in it's TLB.
Moving the invalidations into the TLB invalidation functions ensures all
invalidations happen at the same time as the CPU invalidation. The
architecture specific flush_tlb_all() routines do not call the notifier as
none of the IOMMUs require this.
Link: https://lkml.kernel.org/r/0287ae32d91393a582897d6c4db6f7456b1001f2.1690292440.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Tested-by: SeongJae Park <sj@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Andrew Donnellan <ajd@linux.ibm.com>
Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Cc: Frederic Barrat <fbarrat@linux.ibm.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nicolin Chen <nicolinc@nvidia.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zhi Wang <zhi.wang.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "More strict maple tree lockdep", v2.
Linus asked for more strict maple tree lockdep checking [1] and for them
to resume the normal path through Andrews tree.
This series of patches adds checks to ensure the lock is held in write
mode during the write path of the maple tree instead of checking if it's
held at all.
It also reduces the validate_mm() calls by consolidating into commonly
used functions (patch 0001), and removes the necessity of holding the lock
on the detached tree during munmap() operations.
This patch (of 4):
validate_mm() calls are too spread out and duplicated in numerous
locations. Also, now that the stack write is done under the write lock,
it is not necessary to validate the mm prior to write operations.
Add a validate_mm() to the stack expansions, and to vma_complete() so
that numerous others may be dropped.
Note that vma_link() (and also insert_vm_struct() by call path) already
call validate_mm().
vma_merge() also had an unnecessary call to vma_iter_free() since the
logic change to abort earlier if no merging is necessary.
Drop extra validate_mm() calls at the start of functions and error paths
which won't write to the tree.
Relocate the validate_mm() call in the do_brk_flags() to avoid
re-running the same test when vma_complete() is used.
The call within the error path of mmap_region() is left intentionally
because of the complexity of the function and the potential of drivers
modifying the tree.
Link: https://lkml.kernel.org/r/20230714195551.894800-1-Liam.Howlett@oracle.com
Link: https://lkml.kernel.org/r/20230714195551.894800-2-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oliver Sang <oliver.sang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>