The z3fold will always do atomic64_dec(&pool->pages_nr) when the
__release_z3fold_page() is called. Thus we can move decrement of
pool->pages_nr into __release_z3fold_page() to simplify the code.
Also we can reduce the size of z3fold.o ~1k.
Without this patch:
text data bss dec hex filename
15444 1376 8 16828 41bc mm/z3fold.o
With this patch:
text data bss dec hex filename
15044 1248 8 16300 3fac mm/z3fold.o
Link: https://lkml.kernel.org/r/20220308134311.59086-7-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The feature of minimizing overhead of struct page associated with each
HugeTLB page aims to free its vmemmap pages (used as struct page) to save
memory, where is ~14GB/16GB per 1TB HugeTLB pages (2MB/1GB type). In
short, when a HugeTLB page is allocated or freed, the vmemmap array
representing the range associated with the page will need to be remapped.
When a page is allocated, vmemmap pages are freed after remapping. When a
page is freed, previously discarded vmemmap pages must be allocated before
remapping. More implementations and details can be found here [1].
The infrastructure of freeing vmemmap pages associated with each HugeTLB
page is already there, we can easily enable HUGETLB_PAGE_FREE_VMEMMAP for
arm64, the only thing to be fixed is flush_dcache_page() .
flush_dcache_page() need to be adapted to operate on the head page's flags
since the tail vmemmap pages are mapped with read-only after the feature
is enabled (clear operation is not permitted).
There was some discussions about this in the thread [2], but there was no
conclusion in the end. And I copied the concern proposed by Anshuman to
here and explain why those concern is superfluous. It is safe to enable
it for x86_64 as well as arm64.
1st concern:
'''
But what happens when a hot remove section's vmemmap area (which is
being teared down) is nearby another vmemmap area which is either created
or being destroyed for HugeTLB alloc/free purpose. As you mentioned
HugeTLB pages inside the hot remove section might be safe. But what about
other HugeTLB areas whose vmemmap area shares page table entries with
vmemmap entries for a section being hot removed ? Massive HugeTLB alloc
/use/free test cycle using memory just adjacent to a memory hotplug area,
which is always added and removed periodically, should be able to expose
this problem.
'''
Answer: At the time memory is removed, all HugeTLB pages either have been
migrated away or dissolved. So there is no race between memory hot remove
and free_huge_page_vmemmap(). Therefore, HugeTLB pages inside the hot
remove section is safe. Let's talk your question "what about other
HugeTLB areas whose vmemmap area shares page table entries with vmemmap
entries for a section being hot removed ?", the question is not
established. The minimal granularity size of hotplug memory 128MB (on
arm64, 4k base page), any HugeTLB smaller than 128MB is within a section,
then, there is no share PTE page tables between HugeTLB in this section
and ones in other sections and a HugeTLB page could not cross two
sections. In this case, the section cannot be freed. Any HugeTLB bigger
than 128MB (section size) whose vmemmap pages is an integer multiple of
2MB (PMD-mapped). As long as:
1) HugeTLBs are naturally aligned, power-of-two sizes
2) The HugeTLB size >= the section size
3) The HugeTLB size >= the vmemmap leaf mapping size
Then a HugeTLB will not share any leaf page table entries with *anything
else*, but will share intermediate entries. In this case, at the time
memory is removed, all HugeTLB pages either have been migrated away or
dissolved. So there is also no race between memory hot remove and
free_huge_page_vmemmap().
2nd concern:
'''
differently, not sure if ptdump would require any synchronization.
Dumping an wrong value is probably okay but crashing because a page table
entry is being freed after ptdump acquired the pointer is bad. On arm64,
ptdump() is protected against hotremove via [get|put]_online_mems().
'''
Answer: The ptdump should be fine since vmemmap_remap_free() only
exchanges PTEs or splits the PMD entry (which means allocating a PTE page
table). Both operations do not free any page tables (PTE), so ptdump
cannot run into a UAF on any page tables. The worst case is just dumping
an wrong value.
[1] https://lore.kernel.org/all/20210510030027.56044-1-songmuchun@bytedance.com/
[2] https://lore.kernel.org/all/20210518091826.36937-1-songmuchun@bytedance.com/
[songmuchun@bytedance.com: restructure the code comment inside flush_dcache_page()]
Link: https://lkml.kernel.org/r/20220414072646.21910-1-songmuchun@bytedance.com
Link: https://lkml.kernel.org/r/20220331065640.5777-2-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Tested-by: Barry Song <baohua@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
vmap() takes struct page *pages as one of arguments, and user may provide
an invalid pointer which may lead to corrupted translation table.
An example of such behaviour is erroneous usage of virt_to_page():
vaddr1 = dma_alloc_coherent()
page = virt_to_page() // Wrong here
...
vaddr2 = vmap(page)
memset(vaddr2) // Faulting here
virt_to_page() returns a wrong pointer if vaddr1 is not a linear kernel
address. The problem is that vmap() populates pte with bad pfn
successfully, and it's much harder to debug at memory access time. This
case should be caught by DEBUG_VIRTUAL being that enabled, but it's not
enabled in popular distros.
Kernel already checks the pages against NULL. In the case mentioned
above, however, the address is not NULL, and it's big enough so that the
hardware generated Address Size Abort on arm64:
[ 665.484101] Unhandled fault at 0xffff8000252cd000
[ 665.488807] Mem abort info:
[ 665.491617] ESR = 0x96000043
[ 665.494675] EC = 0x25: DABT (current EL), IL = 32 bits
[ 665.499985] SET = 0, FnV = 0
[ 665.503039] EA = 0, S1PTW = 0
[ 665.506167] Data abort info:
[ 665.509047] ISV = 0, ISS = 0x00000043
[ 665.512882] CM = 0, WnR = 1
[ 665.515851] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000818cb000
[ 665.522550] [ffff8000252cd000] pgd=000000affcfff003, pud=000000affcffe003, pmd=0000008fad8c3003, pte=00688000a5217713
[ 665.533160] Internal error: level 3 address size fault: 96000043 [#1] SMP
[ 665.539936] Modules linked in: [...]
[ 665.616212] CPU: 178 PID: 13199 Comm: test Tainted: P OE 5.4.0-84-generic #94~18.04.1-Ubuntu
[ 665.626806] Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.0.6 07/10/2018
[ 665.636618] pstate: 80400009 (Nzcv daif +PAN -UAO)
[ 665.641407] pc : __memset+0x38/0x188
[ 665.645146] lr : test+0xcc/0x3f8
[ 665.650184] sp : ffff8000359bb840
[ 665.653486] x29: ffff8000359bb840 x28: 0000000000000000
[ 665.658785] x27: 0000000000000000 x26: 0000000000231000
[ 665.664083] x25: ffff00ae660f6110 x24: ffff00ae668cb800
[ 665.669382] x23: 0000000000000001 x22: ffff00af533e5000
[ 665.674680] x21: 0000000000001000 x20: 0000000000000000
[ 665.679978] x19: ffff00ae66950000 x18: ffffffffffffffff
[ 665.685276] x17: 00000000588636a5 x16: 0000000000000013
[ 665.690574] x15: ffffffffffffffff x14: 000000000007ffff
[ 665.695872] x13: 0000000080000000 x12: 0140000000000000
[ 665.701170] x11: 0000000000000041 x10: ffff8000652cd000
[ 665.706468] x9 : ffff8000252cf000 x8 : ffff8000252cd000
[ 665.711767] x7 : 0303030303030303 x6 : 0000000000001000
[ 665.717065] x5 : ffff8000252cd000 x4 : 0000000000000000
[ 665.722363] x3 : ffff8000252cdfff x2 : 0000000000000001
[ 665.727661] x1 : 0000000000000003 x0 : ffff8000252cd000
[ 665.732960] Call trace:
[ 665.735395] __memset+0x38/0x188
[...]
Interestingly, this abort happens even if copy_from_kernel_nofault() is
used, which is quite inconvenient for debugging purposes.
This patch adds a pfn_valid() check into vmap() path, so that invalid
mapping will not be created; WARN_ON() is used to let client code know
that something goes wrong, and it's not a regular EINVAL situation.
Link: https://lkml.kernel.org/r/20220422220410.1308706-1-yury.norov@gmail.com
Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Alexey Klimov <aklimov@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ding Tianhong <dingtianhong@huawei.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
After commit 0e4b01df86 ("mm, memcg: throttle allocators when failing
reclaim over memory.high") allocating memory over memory.high became very
time consuming. But it's exactly what the memory.high test from cgroup
kselftests is doing: it tries to allocate 100M with 30M memory.high value.
It takes forever to complete.
In order to keep it passing (or failing) in a reasonable amount of time
let's try to allocate only a little over 30M: 31M to be precise.
With this change test_memcontrol finishes in a reasonable amount of
time:
$ time ./test_memcontrol
ok 1 test_memcg_subtree_control
ok 2 test_memcg_current
ok 3 test_memcg_min
ok 4 test_memcg_low
ok 5 test_memcg_high
ok 6 test_memcg_max
ok 7 test_memcg_oom_events
ok 8 test_memcg_swap_max
ok 9 test_memcg_sock
ok 10 test_memcg_oom_group_leaf_events
ok 11 test_memcg_oom_group_parent_events
ok 12 test_memcg_oom_group_score_events
real 0m2.273s
user 0m0.064s
sys 0m0.739s
Link: https://lkml.kernel.org/r/20220415000133.3955987-3-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: David Vernet <void@manifault.com>
Cc: Chris Down <chris@chrisdown.name>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
An application is suspected of having memory leak when its memory
consumption is high and keeps increasing. There are several commonly used
memory allocators: slab, cma, vmalloc, etc. The memory leak
identification can be sped up if the page information allocated by an
allocator can be analyzed separately.
This patch provides supports for memory allocator labelling for slab,
vmalloc, and cma. The pages allocated by slab and cma can be confirmed
from the "PFN" line according to the kernel codes, and the label of the
vmalloc allocator can be obtained by analyzing the stack trace. Thanks
for Vlastimil Babka's constructive suggestions.
Based on Yinan Zhang's study, the call chain of vmalloc() is vmalloc() ->
... -> __vmalloc_node_range() -> __vmalloc_area_node().
__vmalloc_area_node() requests memory through the interface of buddy
allocation system. In the current version, __vmalloc_area_node() uses
four interfaces: alloc_pages_bulk_array_mempolicy(),
alloc_pages_bulk_array_node(), alloc_pages() and alloc_pages_node(). By
disassembling the code, we find that __vmalloc_area_node() is expanded in
__vmalloc_node_range(). So __vmalloc_area_node is not in the stack trace.
On the test machine, the stack trace of pages allocated by vmalloc has the
following four forms:
__alloc_pages_bulk+0x230/0x6a0
__vmalloc_node_range+0x19c/0x598
alloc_pages_bulk_array_mempolicy+0xbc/0x278
__vmalloc_node_range+0x1e8/0x598
__alloc_pages+0x160/0x2b0
__vmalloc_node_range+0x234/0x598
alloc_pages+0xac/0x150
__vmalloc_node_range+0x44c/0x598
Therefore, in two consecutive lines of stacktrace, if the first line
contains the word "alloc_pages" and the second line contains the word
"__vmalloc_node_range", it can be determined that the page is allocated by
vmalloc. And the function offset and size are not the same on different
machines, so there is no need to match them.
At the same time, this patch updates the --cull and --sort options to
support allocator-based merge statistics and sorting. The added functions
are fully compatible with the original work. When using, you can use
"allocator", or abbreviated as "ator". Relevant updates have also been
made in the documentation(Documentation/vm/page_owner.rst).
Example:
./page_owner_sort <input> <output> --cull=st,pid,name,allocator
./page_owner_sort <input> <output> --sort=ator,pid,name
This work is coauthored by Jiajian Ye, Yinan Zhang, Shenghong Han,
Chongxi Zhao, Yuhong Feng and Yongqiang Liu.
Link: https://lkml.kernel.org/r/20220410132932.9402-1-caoyixuan2019@email.szu.edu.cn
Signed-off-by: Yixuan Cao <caoyixuan2019@email.szu.edu.cn>
Cc: Chongxi Zhao <zhaochongxi2019@email.szu.edu.cn>
Cc: Haowen Bai <baihaowen@meizu.com>
Cc: Jiajian Ye <yejiajian2018@email.szu.edu.cn>
Cc: Sean Anderson <seanga2@gmail.com>
Cc: Shenghong Han <hanshenghong2019@email.szu.edu.cn>
Cc: Yinan Zhang <zhangyinan2019@email.szu.edu.cn>
Cc: Yongqiang Liu <liuyongqiang13@huawei.com>
Cc: Yuhong Feng <yuhongf@szu.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When viewing page owner information, we may want to sort blocks of
information by multiple keys, since one single key does not uniquely
identify a block. Therefore, following adjustments are made:
1. Add a new --sort option to support sorting blocks of information by
multiple keys.
./page_owner_sort <input> <output> --sort=<order>
./page_owner_sort <input> <output> --sort <order>
<order> is a single argument in the form of a comma-separated list,
which offers a way to specify sorting order.
Sorting syntax is [+|-]key[,[+|-]key[,...]]. The ascending or descending
order can be specified by adding the + (ascending, default) or - (descend
-ing) prefix to the key:
./page_owner_sort <input> <output> [option] --sort -key1,+key2,key3...
For example, to sort the blocks first by task command name in lexicographic
order and then by pid in ascending numerical order, use the following:
./page_owner_sort <input> <output> --sort=name,+pid
To sort the blocks first by pid in ascending order and then by timestamp
of the page when it is allocated in descending order, use the following:
./page_owner_sort <input> <output> --sort=pid,-alloc_ts
2. Add explanations of a newly added --sort option in the function usage()
and the document(Documentation/vm/page_owner.rst).
This work is coauthored by
Yixuan Cao
Shenghong Han
Yinan Zhang
Chongxi Zhao
Yuhong Feng
Yongqiang Liu
Link: https://lkml.kernel.org/r/20220401024856.767-3-yejiajian2018@email.szu.edu.cn
Signed-off-by: Jiajian Ye <yejiajian2018@email.szu.edu.cn>
Cc: Chongxi Zhao <zhaochongxi2019@email.szu.edu.cn>
Cc: Shenghong Han <hanshenghong2019@email.szu.edu.cn>
Cc: Yinan Zhang <zhangyinan2019@email.szu.edu.cn>
Cc: Yixuan Cao <caoyixuan2019@email.szu.edu.cn>
Cc: Yongqiang Liu <liuyongqiang13@huawei.com>
Cc: Yuhong Feng <yuhongf@szu.edu.cn>
Cc: Haowen Bai <baihaowen@meizu.com>
Cc: Sean Anderson <seanga2@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When viewing page owner information, we may want to select blocks whose
PID/TGID/TASK_COMM_NAME appears in a user-specified list for data analysis
and aggregation. But currently page_owner_sort only supports selecting
blocks associated with only one specified PID/TGID/TASK_COMM_NAME.
Therefore, following adjustments are made to fix the problem:
1. Enhance selecting function to support the selection of multiple
PIDs/TGIDs/TASK_COMM_NAMEs.
The enhanced usages are as follows:
--pid <pidlist> Select by pid. This selects the blocks whose PID
numbers appear in <pidlist>.
--tgid <tgidlist> Select by tgid. This selects the blocks whose
TGID numbers appear in <tgidlist>.
--name <cmdlist> Select by task command name. This selects the
blocks whose task command name appear in <cmdlist>.
Where <pidlist>, <tgidlist>, <cmdlist> are single arguments in the form of
a comma-separated list,which offers a way to specify individual selecting
rules.
For example, if you want to select blocks whose tgids are 1, 2 or 3, you
have to use 4 commands as follows:
./page_owner_sort <input> <output1> --tgid=1
./page_owner_sort <input> <output2> --tgid=2
./page_owner_sort <input> <output3> --tgid=3
cat <output1> <output2> <output3> > <output>
With this patch, you can use only 1 command to obtain the same result as
above:
./page_owner_sort <input> <output1> --tgid=1,2,3
2. Update explanations of --pid, --tgid and --name in the function
usage() and the document(Documents/vm/page_owner.rst).
This work is coauthored by
Yixuan Cao
Shenghong Han
Yinan Zhang
Chongxi Zhao
Yuhong Feng
Yongqiang Liu
Link: https://lkml.kernel.org/r/20220401024856.767-2-yejiajian2018@email.szu.edu.cn
Signed-off-by: Jiajian Ye <yejiajian2018@email.szu.edu.cn>
Cc: Chongxi Zhao <zhaochongxi2019@email.szu.edu.cn>
Cc: Shenghong Han <hanshenghong2019@email.szu.edu.cn>
Cc: Yinan Zhang <zhangyinan2019@email.szu.edu.cn>
Cc: Yixuan Cao <caoyixuan2019@email.szu.edu.cn>
Cc: Yongqiang Liu <liuyongqiang13@huawei.com>
Cc: Yuhong Feng <yuhongf@szu.edu.cn>
Cc: Haowen Bai <baihaowen@meizu.com>
Cc: Sean Anderson <seanga2@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pull drm fixes from Dave Airlie:
"Another relatively quiet week, amdgpu leads the way, some i915 display
fixes, and a single sunxi fix.
amdgpu:
- Runtime pm fix
- DCN memory leak fix in error path
- SI DPM deadlock fix
- S0ix fix
amdkfd:
- GWS fix
- GWS support for CRIU
i915:
- Fix#5284: Backlight control regression on XMG Core 15 e21
- Fix black display plane on Acer One AO532h
- Two smaller display fixes
sunxi:
- Single fix removing applying PHYS_OFFSET twice"
* tag 'drm-fixes-2022-04-29' of git://anongit.freedesktop.org/drm/drm:
drm/amdgpu: keep mmhub clock gating being enabled during s2idle suspend
drm/amd/pm: fix the deadlock issue observed on SI
drm/amd/display: Fix memory leak in dcn21_clock_source_create
drm/amdgpu: don't runtime suspend if there are displays attached (v3)
drm/amdkfd: CRIU add support for GWS queues
drm/amdkfd: Fix GWS queue count
drm/sun4i: Remove obsolete references to PHYS_OFFSET
drm/i915/fbc: Consult hw.crtc instead of uapi.crtc
drm/i915: Fix SEL_FETCH_PLANE_*(PIPE_B+) register addresses
drm/i915: Check EDID for HDR static metadata when choosing blc
drm/i915: Fix DISP_POS_Y and DISP_HEIGHT defines