Memory sections are combined into "memory block" chunks. These chunks
are the units upon which memory can be added and removed.
On x86, the new memory may be added after the end of the boot memory,
therefore, if block size does not align with end of boot memory, memory
hot-plugging/hot-removing can be broken.
Memory sections are combined into "memory block" chunks. These chunks
are the units upon which memory can be added and removed.
On x86 the new memory may be added after the end of the boot memory,
therefore, if block size does not align with end of boot memory, memory
hotplugging/hotremoving can be broken.
Currently, whenever machine is booted with more than 64G the block size
is unconditionally increased to 2G from the base 128M. This is done in
order to reduce number of memory device files in sysfs:
/sys/devices/system/memory/memoryXXX
We must use the largest allowed block size that aligns to the next
address to be able to hotplug the next block of memory.
So, when memory is larger or equal to 64G, we check the end address and
find the largest block size that is still power of two but smaller or
equal to 2G.
Before, the fix:
Run qemu with:
-m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
size 0x80000000
acpi PNP0C80:00: add_memory failed
acpi PNP0C80:00: acpi_memory_enable_device() error
acpi PNP0C80:00: Enumeration failure
With the fix memory is added successfully as the block size is set to
1G, and therefore aligns with start address 0x1040000000.
[pasha.tatashin@oracle.com: v4]
Link: http://lkml.kernel.org/r/20180215165920.8570-3-pasha.tatashin@oracle.com
Link: http://lkml.kernel.org/r/20180213193159.14606-3-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "optimize memory hotplug", v3.
This patchset:
- Improves hotplug performance by eliminating a number of struct page
traverses during memory hotplug.
- Fixes some issues with hotplugging, where boundaries were not
properly checked. And on x86 block size was not properly aligned with
end of memory
- Also, potentially improves boot performance by eliminating condition
from __init_single_page().
- Adds robustness by verifying that that struct pages are correctly
poisoned when flags are accessed.
The following experiments were performed on Xeon(R) CPU E7-8895 v3 @
2.60GHz with 1T RAM:
booting in qemu with 960G of memory, time to initialize struct pages:
no-kvm:
TRY1 TRY2
BEFORE: 39.433668 39.39705
AFTER: 36.903781 36.989329
with-kvm:
BEFORE: 10.977447 11.103164
AFTER: 10.929072 10.751885
Hotplug 896G memory:
no-kvm:
TRY1 TRY2
BEFORE: 848.740000 846.910000
AFTER: 783.070000 786.560000
with-kvm:
TRY1 TRY2
BEFORE: 34.410000 33.57
AFTER: 29.810000 29.580000
This patch (of 6):
Start qemu with the following arguments:
-m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G
Which: boots machine with 64G, and adds a device mem1 with 2G which can
be hotplugged later.
Also make sure that config has the following turned on:
CONFIG_MEMORY_HOTPLUG
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
CONFIG_ACPI_HOTPLUG_MEMORY
Using the qemu monitor hotplug the memory (make sure config has (qemu)
device_add pc-dimm,id=dimm1,memdev=mem1
The operation will fail with the following trace:
WARNING: CPU: 0 PID: 91 at drivers/base/memory.c:205
pages_correctly_reserved+0xe6/0x110
Modules linked in:
CPU: 0 PID: 91 Comm: systemd-udevd Not tainted 4.16.0-rc1_pt_master #29
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:pages_correctly_reserved+0xe6/0x110
Call Trace:
memory_subsys_online+0x44/0xa0
device_online+0x51/0x80
store_mem_state+0x5e/0xe0
kernfs_fop_write+0xfa/0x170
__vfs_write+0x2e/0x150
vfs_write+0xa8/0x1a0
SyS_write+0x4d/0xb0
do_syscall_64+0x5d/0x110
entry_SYSCALL_64_after_hwframe+0x21/0x86
---[ end trace 6203bc4f1a5d30e8 ]---
The problem is detected in: drivers/base/memory.c
static bool pages_correctly_reserved(unsigned long start_pfn)
205 if (WARN_ON_ONCE(!pfn_valid(pfn)))
This function loops through every section in the newly added memory
block and verifies that the first pfn is valid, meaning section exists,
has mapping (struct page array), and is online.
The block size on x86 is usually 128M, but when machine is booted with
more than 64G of memory, the block size is changed to 2G: $ cat
/sys/devices/system/memory/block_size_bytes 80000000
or
$ dmesg | grep "block size"
[ 0.086469] x86/mm: Memory block size: 2048MB
During memory hotplug, and hotremove we verify that the range is section
size aligned, but we actually must verify that it is block size aligned,
because that is the proper unit for hotplug operations. See:
Documentation/memory-hotplug.txt
So, when the start_pfn of newly added memory is not block size aligned,
we can get a memory block that has only part of it with properly
populated sections.
In our case the start_pfn starts from the last_pfn (end of physical
memory).
$ dmesg | grep last_pfn
[ 0.000000] e820: last_pfn = 0x1040000 max_arch_pfn = 0x400000000
0x1040000 == 65G, and so is not 2G aligned!
The fix is to enforce that memory that is hotplugged and hotremoved is
block size aligned.
With this fix, running the above sequence yield to the following result:
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
size 0x80000000
acpi PNP0C80:00: add_memory failed
acpi PNP0C80:00: acpi_memory_enable_device() error
acpi PNP0C80:00: Enumeration failure
Link: http://lkml.kernel.org/r/20180213193159.14606-2-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kasan quarantine is designed to delay freeing slab objects to catch
use-after-free. The quarantine can be large (several percent of machine
memory size). When kmem_caches are deleted related objects are flushed
from the quarantine but this requires scanning the entire quarantine
which can be very slow. We have seen the kernel busily working on this
while holding slab_mutex and badly affecting cache_reaper, slabinfo
readers and memcg kmem cache creations.
It can easily reproduced by following script:
yes . | head -1000000 | xargs stat > /dev/null
for i in `seq 1 10`; do
seq 500 | (cd /cg/memory && xargs mkdir)
seq 500 | xargs -I{} sh -c 'echo $BASHPID > \
/cg/memory/{}/tasks && exec stat .' > /dev/null
seq 500 | (cd /cg/memory && xargs rmdir)
done
The busy stack:
kasan_cache_shutdown
shutdown_cache
memcg_destroy_kmem_caches
mem_cgroup_css_free
css_free_rwork_fn
process_one_work
worker_thread
kthread
ret_from_fork
This patch is based on the observation that if the kmem_cache to be
destroyed is empty then there should not be any objects of this cache in
the quarantine.
Without the patch the script got stuck for couple of hours. With the
patch the script completed within a second.
Link: http://lkml.kernel.org/r/20180327230603.54721-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I have noticed on debug kernel with SLAB, the size of some non-root
slabs were larger than their corresponding root slabs.
e.g. for radix_tree_node:
$cat /proc/slabinfo | grep radix
name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> ...
radix_tree_node 15052 15075 4096 1 1 ...
$cat /cgroup/memory/temp/memory.kmem.slabinfo | grep radix
name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> ...
radix_tree_node 1581 158 4120 1 2 ...
However for SLUB in debug kernel, the sizes were same. On further
inspection it is found that SLUB always use kmem_cache.object_size to
measure the kmem_cache.size while SLAB use the given kmem_cache.size.
In the debug kernel the slab's size can be larger than its object_size.
Thus in the creation of non-root slab, the SLAB uses the root's size as
base to calculate the non-root slab's size and thus non-root slab's size
can be larger than the root slab's size. For SLUB, the non-root slab's
size is measured based on the root's object_size and thus the size will
remain same for root and non-root slab.
This patch makes slab's object_size the default base to measure the
slab's size.
Link: http://lkml.kernel.org/r/20180313165428.58699-1-shakeelb@google.com
Fixes: 794b1248be ("memcg, slab: separate memcg vs root cache creation paths")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kmalloc_index() return index into an array of kmalloc kmem caches,
therefore should be unsigned.
Space savings with SLUB on trimmed down .config:
add/remove: 0/1 grow/shrink: 6/56 up/down: 85/-557 (-472)
Function old new delta
calculate_sizes 924 983 +59
on_freelist 589 604 +15
init_cache_random_seq 122 127 +5
ext4_mb_init 1206 1210 +4
slab_pad_check.part 270 271 +1
cpu_partial_store 112 113 +1
usersize_show 28 27 -1
...
new_slab 1871 1837 -34
slab_order 204 - -204
This patch start a series of converting SLUB (mostly) to "unsigned int".
1) Most integers in the code are in fact unsigned entities: array
indexes, lengths, buffer sizes, allocation orders. It is therefore
better to use unsigned variables
2) Some integers in the code are either "size_t" or "unsigned long" for
no reason.
size_t usually comes from people trying to maintain type correctness
and figuring out that "sizeof" operator returns size_t or
memset/memcpy takes size_t so should everything passed to it.
However the number of 4GB+ objects in the kernel is very small. Most,
if not all, dynamically allocated objects with kmalloc() or
kmem_cache_create() aren't actually big. Maintaining wide types
doesn't do anything.
64-bit ops are bigger than 32-bit on our beloved x86_64,
so try to not use 64-bit where it isn't necessary
(read: everywhere where integers are integers not pointers)
3) in case of SLAB allocators, there are additional limitations
*) page->inuse, page->objects are only 16-/15-bit,
*) cache size was always 32-bit
*) slab orders are small, order 20 is needed to go 64-bit on x86_64
(PAGE_SIZE << order)
Basically everything is 32-bit except kmalloc(1ULL<<32) which gets
shortcut through page allocator.
Christoph said:
:
: That changes with large base page size on power and ARM64 f.e. but then
: we do not want to encourage larger allocations through slab anyways.
Link: http://lkml.kernel.org/r/20180305200730.15812-2-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When SLUB_DEBUG catches some issues, it prints all the required debug
info. However, in a few cases where allocation and free of the object
has happened in a very short time, 'age' might be misleading. See the
example below:
=============================================================================
BUG kmalloc-256 (Tainted: G W O ): Poison overwritten
-----------------------------------------------------------------------------
...
INFO: Allocated in binder_transaction+0x4b0/0x2448 age=731 cpu=3 pid=5314
...
INFO: Freed in binder_free_transaction+0x2c/0x58 age=735 cpu=6 pid=2079
...
Object fffffff14956a870: 6b 6b 6b 6b 6b 6b 6b 6b 67 6b 6b 6b 6b 6b 6b a5 kkkkkkkkgkkkk
In this case, object got freed later but 'age' shows otherwise. This
could be because, while printing this info, we print allocation traces
first and free traces thereafter. In between, if we get schedule out or
jiffies increment, (jiffies - t->when) could become meaningless.
Use the jitter free reference to calculate age.
New output will exactly be same. 'age' is still staying with single
jiffies ref in both prints.
Change-Id: I0846565807a4229748649bbecb1ffb743d71fcd8
Link: http://lkml.kernel.org/r/1520492010-19389-1-git-send-email-cpandya@codeaurora.org
Signed-off-by: Chintan Pandya <cpandya@codeaurora.org>
Acked-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the exported filesystem dir on 9p server doesn't maintain accurate
i_nlink count, e.g. always reports i_nlink as 1, then 9p should not
maintain nlink count either, otherwise drop_link would report warning
with i_nlink being zero.
For example:
- overlayfs sets nlink to 1 for merged dir
- ext4 (with dir_nlink feature enabled) sets nlink to 1 if a dir has
more than EXT4_LINK_MAX (65000) links.
In this case, everytime a stat(2) call (getattr) on such exported dirs
on 9p client side, the i_nlink gets reset to 1, then operations like
rmdir(2), unlink(2) and rename(2) would cause the dir nlink to go to
zero (then negative), which results in warnings in drop_nlink() and/or
inc_nlink() calls.
This can be reproduced easily as the following steps:
- export a merged overlayfs dir via qemu virtfs to guest
- mount the exported virtfs in guest
- create two sub-directories in the root dir of the mounted 9pfs
- stat the root dir of 9pfs, this resets nlink to 1
- remove all subdirs, the second unlink/rmdir would trigger warning
------------[ cut here ]------------
WARNING: CPU: 3 PID: 1284 at fs/inode.c:282 drop_nlink+0x3e/0x50
...
Call Trace:
dump_stack+0x63/0x81
__warn+0xcb/0xf0
warn_slowpath_null+0x1d/0x20
drop_nlink+0x3e/0x50
v9fs_remove+0xaa/0x130 [9p]
v9fs_vfs_rmdir+0x13/0x20 [9p]
vfs_rmdir+0xb7/0x130
do_rmdir+0x1b8/0x230
SyS_unlinkat+0x22/0x30
do_syscall_64+0x67/0x180
---[ end trace 43758d8ba91e603b ]---
Fix it by leaving i_nlink to be 1 and don't drop nlink if a directory
has nlink <= 2, which indicates that the underlying exported fs doesn't
maintain nlink count accurately. This follows what ext4 does in
ext4_dec_count().
Link: http://lkml.kernel.org/r/20180312053829.4367-1-eguan@linux.alibaba.com
Signed-off-by: Eryu Guan <eguan@linux.alibaba.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Tested-by: Roman Kapl <code@rkapl.cz>
Cc: Caspar Zhang <caspar@linux.alibaba.com>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Cc: <v9fs-developer@lists.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If it was interrupted by a signal, the 9p client may need to send some
more requests to the server for cleanup before returning to userspace.
To avoid such a last minute request to be interrupted right away, the
client memorizes if a signal is pending, clears TIF_SIGPENDING, handles
the request and calls recalc_sigpending() before returning.
Unfortunately, if the transmission of this cleanup request fails for any
reason, the transport returns an error and the client propagates it
right away, without calling recalc_sigpending().
This ends up with -ERESTARTSYS from the initially interrupted request
crawling up to syscall exit, with TIF_SIGPENDING cleared by the cleanup
request. The specific signal handling code, which is responsible for
converting -ERESTARTSYS to -EINTR is not called, and userspace receives
the confusing errno value:
open: Unknown error 512 (512)
This is really hard to hit in real life. I discovered the issue while
working on hot-unplug of a virtio-9p-pci device with an instrumented
QEMU allowing to control request completion.
Both p9_client_zc_rpc() and p9_client_rpc() functions have this buggy
error path actually. Their code flow is a bit obscure and the best
thing to do would probably be a full rewrite: to really ensure this
situation of clearing TIF_SIGPENDING and returning -ERESTARTSYS can
never happen.
But given the general lack of interest for the 9p code, I won't risk
breaking more things. So this patch simply fixes the buggy paths in
both functions with a trivial label+goto.
Thanks to Laurent Dufour for his help and suggestions on how to find the
root cause and how to fix it.
Link: http://lkml.kernel.org/r/152062809886.10599.7361006774123053312.stgit@bahia.lan
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Cc: David Miller <davem@davemloft.net>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>