Commit e73ad5ff2f ("mm, x86/mm: Make the batched unmap TLB flush API
more generic") moved the trace_tlb_flush out of mm/rmap.c and back into
x86 specific architecture, but it kept the include to the events/tlb.h
file, even though it didn't use that event.
Then another commit came in and added more events to the mm/rmap.c file
and moved the #define CREATE_TRACE_POINTS define from the x86 specific
architecture to the generic mm/rmap.h file to create both the tlb_flush
tracepoint and the new tracepoints.
But since the tlb_flush tracepoint is only x86 specific, it now creates
that tracepoint for all other architectures and this wastes approximately
5K of text and meta data that will not be used.
Remove the events/tlb.h from mm/rmap.c and add the define
CREATE_TRACE_POINTS back in the x86 code.
Link: https://lkml.kernel.org/r/20250612100313.3b9a8b80@batman.local.home
Fixes: e73ad5ff2f ("mm, x86/mm: Make the batched unmap TLB flush API more generic")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Now tmpfs enables i_version by default and tmpfs does not modify it. But
SB_I_VERSION can also be modified via sb_flags, and reconfigure_super()
always overwrites the existing flags with the latest ones. This means
that if tmpfs is remounted without specifying iversion, the default
i_version will be unexpectedly disabled.
To ensure iversion remains enabled, SB_I_VERSION is now always set for
fc->sb_flags in shmem_init_fs_context(), instead of for sb->s_flags in
shmem_fill_super().
Link: https://lkml.kernel.org/r/20250819061803.1496443-1-libaokun@huaweicloud.com
Fixes: 36f05cab0a ("tmpfs: add support for an i_version counter")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When reading data exceeding the maximum IO size, the operation is split
into multiple IO requests, but the data isn't immediately copied to
userspace after each IO completion.
For example, when reading 2560k data from a device with 1280k maximum IO
size, the following sequence occurs:
1. read 1280k
2. copy 41 pages and issue read ahead for next 1280k
3. copy 31 pages to user buffer
4. wait the next 1280k
5. copy 8 pages to user buffer
6. copy 20 folios(64k) to user buffer
The 8 pages in step 5 are copied after the second 1280k completes(step 4)
due to waiting for a non-uptodate folio in filemap_update_page. We can
copy the 8 pages before the second 1280k completes(step 4) to reduce the
latency of this read operation.
After applying the patch, these 8 pages will be copied before the next IO
completes:
1. read 1280k
2. copy 41 pages and issue read ahead for next 1280k
3. copy 31 pages to user buffer
4. copy 8 pages to user buffer
5. wait the next 1280k
6. copy 20 folios(64k) to user buffer
This patch drops a setting of IOCB_NOWAIT for AIO, which is fine because
filemap_read will set it again for AIO.
The final solution provided by Matthew Wilcox:
Link: https://lore.kernel.org/linux-fsdevel/aIDy076Sxt544qja@casper.infradead.org/
Link: https://lkml.kernel.org/r/20250728083952.75518-3-chizhiling@163.com
Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Tiny optimization for large read operations".
This series contains two patches,
1. Skip calling is_partially_uptodate for entire folio to save time, I
have reviewed the mpage and iomap implementations and didn't spot any
issues, but this change likely needs more thorough review.
2. Skip calling filemap_uptodate if there are ready folios in the
batch, This might save a few milliseconds in practice, but I didn't
observe measurable improvements in my tests.
This patch (of 2):
When a folio is marked as non-uptodate, it means the folio contains some
non-uptodate data. Therefore, calling is_partially_uptodate() to recheck
the entire folio is redundant.
If all data in a folio is actually up-to-date but the folio lacks the
uptodate flag, it will still be treated as non-uptodate in many other
places. Thus, there should be no special case handling for filemap.
Link: https://lkml.kernel.org/r/20250728083952.75518-1-chizhiling@163.com
Link: https://lkml.kernel.org/r/20250728083952.75518-2-chizhiling@163.com
Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In the zone_reclaimable_pages() function, if the page counts for
NR_ZONE_INACTIVE_FILE, NR_ZONE_ACTIVE_FILE, NR_ZONE_INACTIVE_ANON, and
NR_ZONE_ACTIVE_ANON are all zero, the function returns the number of free
pages as the result.
In this case, when should_reclaim_retry() calculates reclaimable pages, it
will inadvertently double-count the free pages in its accounting.
static inline bool
should_reclaim_retry(gfp_t gfp_mask, unsigned order,
struct alloc_context *ac, int alloc_flags,
bool did_some_progress, int *no_progress_loops)
{
...
available = reclaimable = zone_reclaimable_pages(zone);
available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
This may result in an increase in the number of retries of
__alloc_pages_slowpath(), causing increased kswapd load.
Link: https://lkml.kernel.org/r/20250812070210.1624218-1-liuqiqi@kylinos.cn
Fixes: 6aaced5abd ("mm: vmscan: account for free pages to prevent infinite Loop in throttle_direct_reclaim()")
Signed-off-by: liuqiqi <liuqiqi@kylinos.cn>
Reviewed-by: Ye Liu <liuye@kylinos.cn>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Add and use memdesc_flags_t".
At some point struct page will be separated from struct slab and struct
folio. This is a step towards that by introducing a type for the 'flags'
word of all three structures. This gives us a certain amount of type
safety by establishing that some of these unsigned longs are different
from other unsigned longs in that they contain things like node ID,
section number and zone number in the upper bits. That lets us have
functions that can be easily called by anyone who has a slab, folio or
page (but not easily by anyone else) to get the node or zone.
There's going to be some unusual merge problems with this as some odd bits
of the kernel decide they want to print out the flags value or something
similar by writing page->flags and now they'll need to write page->flags.f
instead. That's most of the churn here. Maybe we should be removing
these things from the debug output?
This patch (of 11):
Wrap the unsigned long flags in a typedef. In upcoming patches, this will
provide a strong hint that you can't just pass a random unsigned long to
functions which take this as an argument.
[willy@infradead.org: s/flags/flags.f/ in several architectures]
Link: https://lkml.kernel.org/r/aKMgPRLD-WnkPxYm@casper.infradead.org
[nicola.vetrini@gmail.com: mips: fix compilation error]
Link: https://lore.kernel.org/lkml/CA+G9fYvkpmqGr6wjBNHY=dRp71PLCoi2341JxOudi60yqaeUdg@mail.gmail.com/
Link: https://lkml.kernel.org/r/20250825214245.1838158-1-nicola.vetrini@gmail.com
Link: https://lkml.kernel.org/r/20250805172307.1302730-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20250805172307.1302730-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The test will set the global system THP setting to never, madvise or
always depending on the fixture variant and the 2M setting to inherit
before it starts (and reset to original at teardown). The fixture setup
will also test if PR_SET_THP_DISABLE prctl call can be made with
PR_THP_DISABLE_EXCEPT_ADVISED and skip if it fails.
This tests if the process can:
- successfully get the policy to disable THPs expect for madvise.
- get hugepages only on MADV_HUGE and MADV_COLLAPSE if the global policy
is madvise/always and only with MADV_COLLAPSE if the global policy is
never.
- successfully reset the policy of the process.
- after reset, only get hugepages with:
- MADV_COLLAPSE when policy is set to never.
- MADV_HUGE and MADV_COLLAPSE when policy is set to madvise.
- always when policy is set to "always".
- never get a THP with MADV_NOHUGEPAGE.
- repeat the above tests in a forked process to make sure the policy is
carried across forks.
Test results:
./prctl_thp_disable
TAP version 13
1..12
ok 1 prctl_thp_disable_completely.never.nofork
ok 2 prctl_thp_disable_completely.never.fork
ok 3 prctl_thp_disable_completely.madvise.nofork
ok 4 prctl_thp_disable_completely.madvise.fork
ok 5 prctl_thp_disable_completely.always.nofork
ok 6 prctl_thp_disable_completely.always.fork
ok 7 prctl_thp_disable_except_madvise.never.nofork
ok 8 prctl_thp_disable_except_madvise.never.fork
ok 9 prctl_thp_disable_except_madvise.madvise.nofork
ok 10 prctl_thp_disable_except_madvise.madvise.fork
ok 11 prctl_thp_disable_except_madvise.always.nofork
ok 12 prctl_thp_disable_except_madvise.always.fork
[usamaarif642@gmail.com: return after executing test in child process]
Link: https://lkml.kernel.org/r/3dca2de4-9a6a-4efe-a86c-83f9509831fc@gmail.com
Link: https://lkml.kernel.org/r/20250815135549.130506-8-usamaarif642@gmail.com
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yafang <laoar.shao@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When determining which THP orders are eligible for a VMA mapping, we have
previously specified tva_flags, however it turns out it is really not
necessary to treat these as flags.
Rather, we distinguish between distinct modes.
The only case where we previously combined flags was with
TVA_ENFORCE_SYSFS, but we can avoid this by observing that this is the
default, except for MADV_COLLAPSE or an edge cases in
collapse_pte_mapped_thp() and hugepage_vma_revalidate(), and adding a mode
specifically for this case - TVA_FORCED_COLLAPSE.
We have:
* smaps handling for showing "THPeligible"
* Pagefault handling
* khugepaged handling
* Forced collapse handling: primarily MADV_COLLAPSE, but also for
an edge case in collapse_pte_mapped_thp()
Disregarding the edge cases, we only want to ignore sysfs settings only
when we are forcing a collapse through MADV_COLLAPSE, otherwise we want to
enforce it, hence this patch does the following flag to enum conversions:
* TVA_SMAPS | TVA_ENFORCE_SYSFS -> TVA_SMAPS
* TVA_IN_PF | TVA_ENFORCE_SYSFS -> TVA_PAGEFAULT
* TVA_ENFORCE_SYSFS -> TVA_KHUGEPAGED
* 0 -> TVA_FORCED_COLLAPSE
With this change, we immediately know if we are in the forced collapse
case, which will be valuable next.
Link: https://lkml.kernel.org/r/20250815135549.130506-3-usamaarif642@gmail.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Acked-by: Usama Arif <usamaarif642@gmail.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yafang <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>