linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-07 15:07:13 -04:00

Author	SHA1	Message	Date
Matthew Wilcox (Oracle)	350f6bbca1	mm: allow per-VMA locks on file-backed VMAs Remove the TCP layering violation by allowing per-VMA locks on all VMAs. The fault path will immediately fail in handle_mm_fault(). There may be a small performance reduction from this patch as a little unnecessary work will be done on each page fault. See later patches for the improvement. Link: https://lkml.kernel.org/r/20230724185410.1124082-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Arjun Roy <arjunroy@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:51 -07:00
Matthew Wilcox (Oracle)	284e059204	mm: remove CONFIG_PER_VMA_LOCK ifdefs Patch series "Handle most file-backed faults under the VMA lock", v3. This patchset adds the ability to handle page faults on parts of files which are already in the page cache without taking the mmap lock. This patch (of 10): Provide lock_vma_under_rcu() when CONFIG_PER_VMA_LOCK is not defined to eliminate ifdefs in the users. Link: https://lkml.kernel.org/r/20230724185410.1124082-1-willy@infradead.org Link: https://lkml.kernel.org/r/20230724185410.1124082-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Arjun Roy <arjunroy@google.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:50 -07:00
Liam R. Howlett	6935e05255	mm/mmap: change vma iteration order in do_vmi_align_munmap() By delaying the setting of prev/next VMA until after the write of NULL, the probability of the prev/next VMA already being in the CPU cache is significantly increased, especially for larger munmap operations. It also means that prev/next will be loaded closer to when they are used. This requires changing the loop type when gathering the VMAs that will be freed. Since prev will be set later in the function, it is better to reverse the splitting direction of the start VMA (modify the new_below argument to __split_vma). Using the vma_iter_prev_range() to walk back to the correct location in the tree will, on the most part, mean walking within the CPU cache. Usually, this is two steps vs a node reset and a tree re-walk. Link: https://lkml.kernel.org/r/20230724183157.3939892-16-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:50 -07:00
Liam R. Howlett	fec2936434	maple_tree: reduce resets during store setup mas_prealloc() may walk partially down the tree before finding that a split or spanning store is needed. When the write occurs, relax the logic on resetting the walk so that partial walks will not restart, but walks that have gone too far (a store that affects beyond the current node) should be restarted. Link: https://lkml.kernel.org/r/20230724183157.3939892-15-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:50 -07:00
Liam R. Howlett	17983dc617	maple_tree: refine mas_preallocate() node calculations Calculate the number of nodes based on the pending write action instead of assuming the worst case. This addresses a performance regression introduced in platforms that have longer allocation timing. Link: https://lkml.kernel.org/r/20230724183157.3939892-14-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:50 -07:00
Liam R. Howlett	0b8bb544b1	maple_tree: update mas_preallocate() testing Since the mas_preallocate() calculation has been updated to be more precise, the testing must also be updated to check for what is expected. Link: https://lkml.kernel.org/r/20230724183157.3939892-13-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:49 -07:00
Liam R. Howlett	a7496ad529	maple_tree: move mas_wr_end_piv() below mas_wr_extend_null() Relocate it and call mas_wr_extend_null() from within mas_wr_end_piv(). Extending the NULL may affect the end pivot value so call mas_wr_endtend_null() from within mas_wr_end_piv() to keep it all together. Link: https://lkml.kernel.org/r/20230724183157.3939892-12-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:49 -07:00
Liam R. Howlett	b5df092264	mm: set up vma iterator for vma_iter_prealloc() calls Set the correct limits for vma_iter_prealloc() calls so that the maple tree can be smarter about how many nodes are needed. Link: https://lkml.kernel.org/r/20230724183157.3939892-11-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:49 -07:00
Liam R. Howlett	f72cf24a86	mm: use vma_iter_clear_gfp() in nommu Move the definition of vma_iter_clear_gfp() from mmap.c to internal.h so it can be used in the nommu code. This will reduce node preallocations in nommu. Link: https://lkml.kernel.org/r/20230724183157.3939892-10-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:49 -07:00
Liam R. Howlett	c108df767f	maple_tree: adjust node allocation on mas_rebalance() mas_rebalance() is called to rebalance an insufficient node into a single node or two sufficient nodes. The preallocation estimate is always too many in this case as the height of the tree will never grow and there is no possibility to have a three way split in this case, so revise the node allocation count. Link: https://lkml.kernel.org/r/20230724183157.3939892-9-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:48 -07:00
Liam R. Howlett	da0892547b	maple_tree: re-introduce entry to mas_preallocate() arguments The current preallocation strategy is to preallocate the absolute worst-case allocation for a tree modification. The entry (or NULL) is needed to know how many nodes are needed to write to the tree. Start by adding the argument to the mas_preallocate() definition. Link: https://lkml.kernel.org/r/20230724183157.3939892-8-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:48 -07:00
Liam R. Howlett	53bee98d00	mm: remove re-walk from mmap_region() Using vma_iter_set() will reset the tree and cause a re-walk. Use vmi_iter_config() to set the write to a sub-set of the range. Change the file case to also use vmi_iter_config() so that the end is correctly set. Link: https://lkml.kernel.org/r/20230724183157.3939892-7-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:48 -07:00
Liam R. Howlett	c1297987cc	maple_tree: introduce __mas_set_range() mas_set_range() resets the node to MAS_START, which will cause a re-walk of the tree to the range. This is unnecessary when the maple state is already at the correct location of the write. Add a function that only sets the range to avoid unnecessary re-walking of the tree. Link: https://lkml.kernel.org/r/20230724183157.3939892-6-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:48 -07:00
Liam R. Howlett	445a2ea0ef	mm: remove prev check from do_vmi_align_munmap() If the prev does not exist, the vma iterator will be set to MAS_NONE, which will be treated as a MAS_START when the mas_next or mas_find is used. In this case, the next caller will be the vma iterator, which uses mas_find() under the hood and will now do what the user expects. Link: https://lkml.kernel.org/r/20230724183157.3939892-5-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:47 -07:00
Liam R. Howlett	fd892593d4	mm: change do_vmi_align_munmap() tracking of VMAs to remove The majority of the calls to munmap a vm range is within a single vma. The maple tree is able to store a single entry at 0, with a size of 1 as a pointer and avoid any allocations. Change do_vmi_align_munmap() to store the VMAs being munmap()'ed into a tree indexed by the count. This will leverage the ability to store the first entry without a node allocation. Storing the entries into a tree by the count and not the vma start and end means changing the functions which iterate over the entries. Update unmap_vmas() and free_pgtables() to take a maple state and a tree end address to support this functionality. Passing through the same maple state to unmap_vmas() and free_pgtables() means the state needs to be reset between calls. This happens in the static unmap_region() and exit_mmap(). Link: https://lkml.kernel.org/r/20230724183157.3939892-4-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:47 -07:00
Liam R. Howlett	8c314f3b55	maple_tree: add benchmarking for mas_prev() Add some benchmarking functions in testing for mas_prev(). This is useful to ensure there are no regressions added during modifications. Link: https://lkml.kernel.org/r/20230724183157.3939892-3-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:47 -07:00
Liam R. Howlett	361c678be7	maple_tree: add benchmarking for mas_for_each Patch series "Reduce preallocations for maple tree", v3. Initial work on preallocations showed no regression in performance during testing, but recently some users (both on [1] and off [android] list) have reported that preallocating the worst-case number of nodes has caused some slow down. This patch set addresses the number of allocations in a few ways. During munmap() most munmap() operations will remove a single VMA, so leverage the fact that the maple tree can place a single pointer at range 0 - 0 without allocating. This is done by changing the index of the VMAs to be indexed by the count, starting at 0. Re-introduce the entry argument to mas_preallocate() so that a more intelligent guess of the node count can be made. Implement the more intelligent guess of the node count, although there is more work to be done. During development of v2 of this patch set, I also noticed that the number of nodes being allocated for a rebalance was beyond what could possibly be needed. This is addressed in patch 0008. This patch (of 15): Add a way to test the speed of mas_for_each() to the testing code. Link: https://lkml.kernel.org/r/20230724183157.3939892-1-Liam.Howlett@oracle.com Link: https://lkml.kernel.org/r/20230724183157.3939892-2-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:47 -07:00
Jann Horn	90717566f8	mm: don't drop VMA locks in mm_drop_all_locks() Despite its name, mm_drop_all_locks() does not drop _all_ locks; the mmap lock is held write-locked by the caller, and the caller is responsible for dropping the mmap lock at a later point (which will also release the VMA locks). Calling vma_end_write_all() here is dangerous because the caller might have write-locked a VMA with the expectation that it will stay write-locked until the mmap_lock is released, as usual. This _almost_ becomes a problem in the following scenario: An anonymous VMA A and an SGX VMA B are mapped adjacent to each other. Userspace calls munmap() on a range starting at the start address of A and ending in the middle of B. Hypothetical call graph with additional notes in brackets: do_vmi_align_munmap [begin first for_each_vma_range loop] vma_start_write [on VMA A] vma_mark_detached [on VMA A] __split_vma [on VMA B] sgx_vma_open [== new->vm_ops->open] sgx_encl_mm_add __mmu_notifier_register [luckily THIS CAN'T ACTUALLY HAPPEN] mm_take_all_locks mm_drop_all_locks vma_end_write_all [drops VMA lock taken on VMA A before] vma_start_write [on VMA B] vma_mark_detached [on VMA B] [end first for_each_vma_range loop] vma_iter_clear_gfp [removes VMAs from maple tree] mmap_write_downgrade unmap_region mmap_read_unlock In this hypothetical scenario, while do_vmi_align_munmap() thinks it still holds a VMA write lock on VMA A, the VMA write lock has actually been invalidated inside __split_vma(). The call from sgx_encl_mm_add() to __mmu_notifier_register() can't actually happen here, as far as I understand, because we are duplicating an existing SGX VMA, but sgx_encl_mm_add() only calls __mmu_notifier_register() for the first SGX VMA created in a given process. So this could only happen in fork(), not on munmap(). But in my view it is just pure luck that this can't happen. Also, we wouldn't actually have any bad consequences from this in do_vmi_align_munmap(), because by the time the bug drops the lock on VMA A, we've already marked VMA A as detached, which makes it completely ineligible for any VMA-locked page faults. But again, that's just pure luck. So remove the vma_end_write_all(), so that VMA write locks are only ever released on mmap_write_unlock() or mmap_write_downgrade(). Also add comments to document the locking rules established by this patch. Link: https://lkml.kernel.org/r/20230720193436.454247-1-jannh@google.com Fixes: `eeff9a5d47` ("mm/mmap: prevent pagefault handler from racing with mmu_notifier registration") Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:46 -07:00
ZhangPeng	98630cfdc4	mm/page_io: convert bio_associate_blkg_from_page() to take in a folio Convert bio_associate_blkg_from_page() to take in a folio. We can remove two implicit calls to compound_head() by taking in a folio. Link: https://lkml.kernel.org/r/20230721034451.16412-11-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:46 -07:00
ZhangPeng	9b72b134ee	mm/page_io: convert count_swpout_vm_event() to take in a folio Convert count_swpout_vm_event() to take in a folio. We can remove five implicit calls to compound_head() by taking in a folio. Link: https://lkml.kernel.org/r/20230721034451.16412-10-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:46 -07:00
ZhangPeng	2675251d50	mm/page_io: use a folio in swap_writepage_bdev_async() Saves one implicit call to compound_head(). Link: https://lkml.kernel.org/r/20230721034451.16412-9-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:46 -07:00
ZhangPeng	f54fcaabd3	mm/page_io: use a folio in swap_writepage_bdev_sync() Saves one implicit call to compound_head(). Link: https://lkml.kernel.org/r/20230721034451.16412-8-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:46 -07:00
ZhangPeng	6a8c068774	mm/page_io: use a folio in sio_read_complete() Saves one implicit call to compound_head(). Link: https://lkml.kernel.org/r/20230721034451.16412-7-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:45 -07:00
ZhangPeng	bc74b53f29	mm/page_io: use a folio in __end_swap_bio_read() Saves one implicit call to compound_head(). Link: https://lkml.kernel.org/r/20230721034451.16412-6-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:45 -07:00
ZhangPeng	a3ed1e9b63	mm/page_io: use a folio in __end_swap_bio_write() Saves two implicit call to compound_head(). Link: https://lkml.kernel.org/r/20230721034451.16412-5-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:45 -07:00
ZhangPeng	6d2790d95d	mm/page_io: introduce bio_first_folio_all() Introduce bio_first_folio_all() to return a folio, which makes it easier to use. Link: https://lkml.kernel.org/r/20230721034451.16412-4-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:45 -07:00
ZhangPeng	9962ed64bd	mm/page_io: remove unneeded SetPageError() Nobody checks the PageError()/folio_test_error() for the page/folio in __end_swap_bio_read/write() and sio_write_complete(). Therefore, we don't need to set the error flag. Just drop it. Link: https://lkml.kernel.org/r/20230721034451.16412-3-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:44 -07:00
ZhangPeng	479c330491	mm/page_io: remove unneeded ClearPageUptodate() Patch series "Convert several functions in page_io.c to use a folio", v4. Convert several functions in page_io.c to use a folio, which can remove several implicit calls to compound_head(). This patch (of 10): The VM_BUG_ON_FOLIO in swap_readpage() ensures that the page is already !uptodate in __end_swap_bio_read() and sio_read_complete(). Just remove unneeded ClearPageUptodate(). Link: https://lkml.kernel.org/r/20230721034451.16412-1-zhangpeng362@huawei.com Link: https://lkml.kernel.org/r/20230721034451.16412-2-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:44 -07:00
Kemeng Shi	3c099a2b0b	mm/compaction: avoid unneeded pageblock_end_pfn when no_set_skip_hint is set Move pageblock_end_pfn after no_set_skip_hint check to avoid unneeded pageblock_end_pfn if no_set_skip_hint is set. Link: https://lkml.kernel.org/r/20230721150957.2058634-3-shikemeng@huawei.com Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:44 -07:00
Kemeng Shi	e6bd14eca2	mm/compaction: correct comment of candidate pfn in fast_isolate_freepages Patch series "Two minor cleanups for compaction", v2. This series contains two random cleanups for compaction. This patch (of 2): If no preferred one was not found, we will use candidate page with maximum pfn > min_pfn which is saved in high_pfn. Correct "minimum" to "maximum candidate" in comment. Link: https://lkml.kernel.org/r/20230721150957.2058634-1-shikemeng@huawei.com Link: https://lkml.kernel.org/r/20230721150957.2058634-2-shikemeng@huawei.com Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:44 -07:00
Miaohe Lin	eafcb7a972	mm/mprotect: fix obsolete function name in change_pte_range() Since commit `79a1971c5f` ("mm: move the copy_one_pte() pte_present check into the caller"), the explanation of preserving soft-dirtiness is moved into copy_nonpresent_pte(). Update corresponding comment. Link: https://lkml.kernel.org/r/20230723033114.3224409-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:43 -07:00
Ryan Roberts	05f1edac80	selftests/mm: run all tests from run_vmtests.sh It is very unclear to me how one is supposed to run all the mm selftests consistently and get clear results. Most of the test programs are launched by both run_vmtests.sh and run_kselftest.sh: hugepage-mmap hugepage-shm map_hugetlb hugepage-mremap hugepage-vmemmap hugetlb-madvise map_fixed_noreplace gup_test gup_longterm uffd-unit-tests uffd-stress compaction_test on-fault-limit map_populate mlock-random-test mlock2-tests mrelease_test mremap_test thuge-gen virtual_address_range va_high_addr_switch mremap_dontunmap hmm-tests madv_populate memfd_secret ksm_tests ksm_functional_tests soft-dirty cow However, of this set, when launched by run_vmtests.sh, some of the programs are invoked multiple times with different arguments. When invoked by run_kselftest.sh, they are invoked without arguments (and as a consequence, some fail immediately). Some test programs are only launched by run_vmtests.sh: test_vmalloc.sh And some test programs and only launched by run_kselftest.sh: khugepaged migration mkdirty transhuge-stress split_huge_page_test mdwe_test write_to_hugetlbfs Furthermore, run_vmtests.sh is invoked by run_kselftest.sh, so in this case all the test programs invoked by both scripts are run twice! Needless to say, this is a bit of a mess. In the absence of fully understanding the history here, it looks to me like the best solution is to launch ALL test programs from run_vmtests.sh, and ONLY invoke run_vmtests.sh from run_kselftest.sh. This way, we get full control over the parameters, each program is only invoked the intended number of times, and regardless of which script is used, the same tests get run in the same way. The only drawback is that if using run_kselftest.sh, it's top-level tap result reporting reports only a single test and it fails if any of the contained tests fail. I don't see this as a big deal though since we still see all the nested reporting from multiple layers. The other issue with this is that all of run_vmtests.sh must execute within a single kselftest timeout period, so let's increase that to something more suitable. In the Makefile, TEST_GEN_PROGS will compile and install the tests and will add them to the list of tests that run_kselftest.sh will run. TEST_GEN_FILES will compile and install the tests but will not add them to the test list. So let's move all the programs from TEST_GEN_PROGS to TEST_GEN_FILES so that they are built but not executed by run_kselftest.sh. Note that run_vmtests.sh is added to TEST_PROGS, which means it ends up in the test list. (the lack of "_GEN" means it won't be compiled, but simply copied). Link: https://lkml.kernel.org/r/20230724082522.1202616-9-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Florent Revest <revest@chromium.org> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mark Brown <broonie@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:43 -07:00
Ryan Roberts	e170621027	selftests/mm: optionally pass duration to transhuge-stress Until now, transhuge-stress runs until its explicitly killed, so when invoked by run_kselftest.sh, it would run until the test timeout, then it would be killed and the test would be marked as failed. Add a new, optional command line parameter that allows the user to specify the duration in seconds that the program should run. The program exits after this duration with a success (0) exit code. If the argument is omitted the old behacvior remains. On it's own, this doesn't quite solve our problem because run_kselftest.sh does not allow passing parameters to the program under test. But we will shortly move this to run_vmtests.sh, which does allow parameter passing. Link: https://lkml.kernel.org/r/20230724082522.1202616-8-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Florent Revest <revest@chromium.org> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mark Brown <broonie@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:43 -07:00
Ryan Roberts	0003033297	selftests/mm: make migration test robust to failure The `migration` test currently has a number of robustness problems that cause it to hang and leak resources. Timeout: There are 3 tests, which each previously ran for 60 seconds. However, the timeout in mm/settings for a single test binary was set to 45 seconds. So when run using run_kselftest.sh, the top level timeout would trigger before the test binary was finished. Solve this by meeting in the middle; each of the 3 tests now runs for 20 seconds (for a total of 60), and the top level timeout is set to 90 seconds. Leaking child processes: the `shared_anon` test fork()s some children but then an ASSERT() fires before the test kills those children. The assert causes immediate exit of the parent and leaking of the children. Furthermore, if run using the run_kselftest.sh wrapper, the wrapper would get stuck waiting for those children to exit, which never happens. Solve this by setting the "parent death signal" to SIGHUP in the child, so that the child is killed automatically if the parent dies. With these changes, the test binary now runs to completion on arm64, with 2 tests passing and the `shared_anon` test failing. Link: https://lkml.kernel.org/r/20230724082522.1202616-7-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Florent Revest <revest@chromium.org> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mark Brown <broonie@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:43 -07:00
Ryan Roberts	49f09526b1	selftests/mm: va_high_addr_switch should skip unsupported arm64 configs va_high_addr_switch has a mechanism to determine if the tests should be run or skipped (supported_arch()). This currently returns unconditionally true for arm64. However, va_high_addr_switch also requires a large virtual address space for the tests to run, otherwise they spuriously fail. Since arm64 can only support VA > 48 bits when the page size is 64K, let's decide whether we should skip the test suite based on the page size. This reduces noise when running on 4K and 16K kernels. Link: https://lkml.kernel.org/r/20230724082522.1202616-6-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Florent Revest <revest@chromium.org> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mark Brown <broonie@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:43 -07:00
Ryan Roberts	6e16f51335	selftests/mm: fix thuge-gen test bugs thuge-gen was previously only munmapping part of the mmapped buffer, which caused us to run out of 1G huge pages for a later part of the test. Fix this by munmapping the whole buffer. Based on the code, it looks like a typo rather than an intention to keep some of the buffer mapped. thuge-gen was also calling mmap with SHM_HUGETLB flag (bit 11 set), which is actually MAP_DENYWRITE in mmap context. The man page says this flag is ignored in modern kernels. I'm pretty sure from the context that the author intended to pass the MAP_HUGETLB flag so I've fixed that up too. Link: https://lkml.kernel.org/r/20230724082522.1202616-5-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Florent Revest <revest@chromium.org> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mark Brown <broonie@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:42 -07:00
Ryan Roberts	e515bce98d	selftests/mm: enable mrelease_test for arm64 mrelease_test defaults to defining __NR_pidfd_open and __NR_process_mrelease syscall numbers to -1, if they are not defined anywhere else, and the suite would then be marked as skipped as a result. arm64 (at least the stock debian toolchain that I'm using) requires including <sys/syscall.h> to pull in the defines for these syscalls. So let's add this header. With this in place, the test is passing on arm64. Link: https://lkml.kernel.org/r/20230724082522.1202616-4-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Florent Revest <revest@chromium.org> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mark Brown <broonie@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:42 -07:00
Ryan Roberts	f6dd4e223d	selftests/mm: skip soft-dirty tests on arm64 arm64 does not support the soft-dirty PTE bit. However, the `soft-dirty` test suite is currently run unconditionally and therefore generates spurious test failures on arm64. There are also some tests in `madv_populate` which assume it is supported. For `soft-dirty` lets disable the whole suite for arm64; it is no longer built and run_vmtests.sh will skip it if its not present. For `madv_populate`, we need a runtime mechanism so that the remaining tests continue to be run. Unfortunately, the only way to determine if the soft-dirty dirty bit is supported is to write to a page, then see if the bit is set in /proc/self/pagemap. But the tests that we want to conditionally execute are testing precicesly this. So if we introduced this feature check, we could accedentally turn a real failure (on a system that claims to support soft-dirty) into a skip. So instead, do the check based on architecture; for arm64, we report that soft-dirty is not supported. Link: https://lkml.kernel.org/r/20230724082522.1202616-3-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Florent Revest <revest@chromium.org> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mark Brown <broonie@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:42 -07:00
Ryan Roberts	58e2847ad2	selftests: line buffer test program's stdout Patch series "selftests/mm fixes for arm64", v3. Given my on-going work on large anon folios and contpte mappings, I decided it would be a good idea to start running mm selftests to help guard against regressions. However, it soon became clear that I couldn't get the suite to run cleanly on arm64 with a vanilla v6.5-rc1 kernel (perhaps I'm just doing it wrong??), so got stuck in a rabbit hole trying to debug and fix all the issues. Some were down to misconfigurations, but I also found a number of issues with the tests and even a couple of issues with the kernel. This patch (of 8): The selftests runner pipes the test program's stdout to tap_prefix. The presence of the pipe means that the test program sets its stdout to be fully buffered (as aposed to line buffered when directly connected to the terminal). The block buffering means that there is often content in the buffer at fork() time, which causes the output to end up duplicated. This was causing problems for mm:cow where test results were duplicated 20-30x. Solve this by using `stdbuf`, when available to force the test program to use line buffered mode. This means previously printf'ed results are flushed out of the program before any fork(). Additionally, explicitly set line buffer mode in ksft_print_header(), which means that all test programs that use the ksft framework will benefit even if stdbuf is not present on the system. [ryan.roberts@arm.com: add setvbuf() to set buffering mode] Link: https://lkml.kernel.org/r/20230726070655.2713530-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20230724082522.1202616-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20230724082522.1202616-2-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Mark Brown <broonie@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Florent Revest <revest@chromium.org> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:42 -07:00
Miaohe Lin	ea09800bf1	mm: fix obsolete function name above debug_pagealloc_enabled_static() Since commit `04013513cc` ("mm, page_alloc: do not rely on the order of page_poison and init_on_alloc/free parameters"), init_debug_pagealloc() is converted to init_mem_debugging_and_hardening(). Later it's renamed to mem_debugging_and_hardening_init() via commit `f2fc4b44ec` ("mm: move init_mem_debugging_and_hardening() to mm/mm_init.c"). Link: https://lkml.kernel.org/r/20230720112806.3851893-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:41 -07:00
Alistair Popple	1af5a81099	mmu_notifiers: rename invalidate_range notifier There are two main use cases for mmu notifiers. One is by KVM which uses mmu_notifier_invalidate_range_start()/end() to manage a software TLB. The other is to manage hardware TLBs which need to use the invalidate_range() callback because HW can establish new TLB entries at any time. Hence using start/end() can lead to memory corruption as these callbacks happen too soon/late during page unmap. mmu notifier users should therefore either use the start()/end() callbacks or the invalidate_range() callbacks. To make this usage clearer rename the invalidate_range() callback to arch_invalidate_secondary_tlbs() and update documention. Link: https://lkml.kernel.org/r/6f77248cd25545c8020a54b4e567e8b72be4dca1.1690292440.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Andrew Donnellan <ajd@linux.ibm.com> Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com> Cc: Frederic Barrat <fbarrat@linux.ibm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kevin Tian <kevin.tian@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Nicolin Chen <nicolinc@nvidia.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: SeongJae Park <sj@kernel.org> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Zhi Wang <zhi.wang.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:41 -07:00
Alistair Popple	ec8832d007	mmu_notifiers: don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end() Secondary TLBs are now invalidated from the architecture specific TLB invalidation functions. Therefore there is no need to explicitly notify or invalidate as part of the range end functions. This means we can remove mmu_notifier_invalidate_range_end_only() and some of the ptep_*_notify() functions. Link: https://lkml.kernel.org/r/90d749d03cbab256ca0edeb5287069599566d783.1690292440.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Andrew Donnellan <ajd@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com> Cc: Frederic Barrat <fbarrat@linux.ibm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kevin Tian <kevin.tian@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Nicolin Chen <nicolinc@nvidia.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: SeongJae Park <sj@kernel.org> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Zhi Wang <zhi.wang.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:41 -07:00
Alistair Popple	6bbd42e2df	mmu_notifiers: call invalidate_range() when invalidating TLBs The invalidate_range() is going to become an architecture specific mmu notifier used to keep the TLB of secondary MMUs such as an IOMMU in sync with the CPU page tables. Currently it is called from separate code paths to the main CPU TLB invalidations. This can lead to a secondary TLB not getting invalidated when required and makes it hard to reason about when exactly the secondary TLB is invalidated. To fix this move the notifier call to the architecture specific TLB maintenance functions for architectures that have secondary MMUs requiring explicit software invalidations. This fixes a SMMU bug on ARM64. On ARM64 PTE permission upgrades require a TLB invalidation. This invalidation is done by the architecture specific ptep_set_access_flags() which calls flush_tlb_page() if required. However this doesn't call the notifier resulting in infinite faults being generated by devices using the SMMU if it has previously cached a read-only PTE in it's TLB. Moving the invalidations into the TLB invalidation functions ensures all invalidations happen at the same time as the CPU invalidation. The architecture specific flush_tlb_all() routines do not call the notifier as none of the IOMMUs require this. Link: https://lkml.kernel.org/r/0287ae32d91393a582897d6c4db6f7456b1001f2.1690292440.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> Tested-by: SeongJae Park <sj@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Andrew Donnellan <ajd@linux.ibm.com> Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com> Cc: Frederic Barrat <fbarrat@linux.ibm.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kevin Tian <kevin.tian@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Nicolin Chen <nicolinc@nvidia.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Zhi Wang <zhi.wang.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:41 -07:00
Alistair Popple	57b037dbba	mmu_notifiers: fixup comment in mmu_interval_read_begin() The comment in mmu_interval_read_begin() refers to a function that doesn't exist and uses the wrong call-back name. The op for mmu interval notifiers is mmu_interval_notifier_ops->invalidate() so fix the comment up to reflect that. Link: https://lkml.kernel.org/r/e7a09081b3ac82a03c189409f1262fc2df91071e.1690292440.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Andrew Donnellan <ajd@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com> Cc: Frederic Barrat <fbarrat@linux.ibm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kevin Tian <kevin.tian@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Nicolin Chen <nicolinc@nvidia.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: SeongJae Park <sj@kernel.org> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Zhi Wang <zhi.wang.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:40 -07:00
Alistair Popple	38b14e2e3d	arm64/smmu: use TLBI ASID when invalidating entire range Patch series "Invalidate secondary IOMMU TLB on permission upgrade", v4. The main change is to move secondary TLB invalidation mmu notifier callbacks into the architecture specific TLB flushing functions. This makes secondary TLB invalidation mostly match CPU invalidation while still allowing efficient range based invalidations based on the existing TLB batching code. This patch (of 5): The ARM SMMU has a specific command for invalidating the TLB for an entire ASID. Currently this is used for the IO_PGTABLE API but not for ATS when called from the MMU notifier. The current implementation of notifiers does not attempt to invalidate such a large address range, instead walking each VMA and invalidating each range individually during mmap removal. However in future SMMU TLB invalidations are going to be sent as part of the normal flush_tlb_*() kernel calls. To better deal with that add handling to use TLBI ASID when invalidating the entire address space. Link: https://lkml.kernel.org/r/cover.1eca029b8603ef4eebe5b41eae51facfc5920c41.1690292440.git-series.apopple@nvidia.com Link: https://lkml.kernel.org/r/ba5f0ec5fbc2ab188797524d3687e075e2412a2b.1690292440.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Andrew Donnellan <ajd@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com> Cc: Frederic Barrat <fbarrat@linux.ibm.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kevin Tian <kevin.tian@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Nicolin Chen <nicolinc@nvidia.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Zhi Wang <zhi.wang.linux@gmail.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:40 -07:00
Liam R. Howlett	19a462f06e	maple_tree: Be more strict about locking Use lockdep to check the write path in the maple tree holds the lock in write mode. Introduce mt_write_lock_is_held() to check if the lock is held for writing. Update the necessary checks for rcu_dereference_protected() to use the new write lock check. Link: https://lkml.kernel.org/r/20230714195551.894800-5-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oliver Sang <oliver.sang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:40 -07:00
Liam R. Howlett	02fdb25fb4	mm/mmap: change detached vma locking scheme Don't set the lock to the mm lock so that the detached VMA tree does not complain about being unlocked when the mmap_lock is dropped prior to freeing the tree. Introduce mt_on_stack() for setting the external lock to NULL only when LOCKDEP is used. Move the destroying of the detached tree outside the mmap lock all together. Link: https://lkml.kernel.org/r/20230719183142.ktgcmuj2pnlr3h3s@revolver Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oliver Sang <oliver.sang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:40 -07:00
Liam R. Howlett	134d153c93	maple_tree: relax lockdep checks for on-stack trees To support early release of the maple tree locks, do not lockdep check the lock if it is set to NULL. This is intended for the special case on-stack use of tracking entries and not for general use. Link: https://lkml.kernel.org/r/20230714195551.894800-3-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oliver Sang <oliver.sang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:39 -07:00
Liam R. Howlett	2574d5e4df	mm/mmap: clean up validate_mm() calls Patch series "More strict maple tree lockdep", v2. Linus asked for more strict maple tree lockdep checking [1] and for them to resume the normal path through Andrews tree. This series of patches adds checks to ensure the lock is held in write mode during the write path of the maple tree instead of checking if it's held at all. It also reduces the validate_mm() calls by consolidating into commonly used functions (patch 0001), and removes the necessity of holding the lock on the detached tree during munmap() operations. This patch (of 4): validate_mm() calls are too spread out and duplicated in numerous locations. Also, now that the stack write is done under the write lock, it is not necessary to validate the mm prior to write operations. Add a validate_mm() to the stack expansions, and to vma_complete() so that numerous others may be dropped. Note that vma_link() (and also insert_vm_struct() by call path) already call validate_mm(). vma_merge() also had an unnecessary call to vma_iter_free() since the logic change to abort earlier if no merging is necessary. Drop extra validate_mm() calls at the start of functions and error paths which won't write to the tree. Relocate the validate_mm() call in the do_brk_flags() to avoid re-running the same test when vma_complete() is used. The call within the error path of mmap_region() is left intentionally because of the complexity of the function and the potential of drivers modifying the tree. Link: https://lkml.kernel.org/r/20230714195551.894800-1-Liam.Howlett@oracle.com Link: https://lkml.kernel.org/r/20230714195551.894800-2-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oliver Sang <oliver.sang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:39 -07:00
Sidhartha Kumar	affd26b1fb	mm/hugetlb: get rid of page_hstate() Convert the last page_hstate() user to use folio_hstate() so page_hstate() can be safely removed. Link: https://lkml.kernel.org/r/20230719184145.301911-1-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:39 -07:00

1 2 3 4 5 ...

1201068 Commits