alloc_bucket_locks allocation pattern is quite unusual. We are
preferring vmalloc when CONFIG_NUMA is enabled. The rationale is that
vmalloc will respect the memory policy of the current process and so the
backing memory will get distributed over multiple nodes if the requester
is configured properly. At least that is the intention, in reality
rhastable is shrunk and expanded from a kernel worker so no mempolicy
can be assumed.
Let's just simplify the code and use kvmalloc helper, which is a
transparent way to use kmalloc with vmalloc fallback, if the caller is
allowed to block and use the flag otherwise.
Link: http://lkml.kernel.org/r/20170306103032.2540-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Tom Herbert <tom@herbertland.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vhost code uses __GFP_REPEAT when allocating vhost_virtqueue resp.
vhost_vsock because it would really like to prefer kmalloc to the
vmalloc fallback - see 23cc5a991c ("vhost-net: extend device
allocation to vmalloc") for more context. Michael Tsirkin has also
noted:
"__GFP_REPEAT overhead is during allocation time. Using vmalloc means
all accesses are slowed down. Allocation is not on data path, accesses
are."
The similar applies to other vhost_kvzalloc users.
Let's teach kvmalloc_node to handle __GFP_REPEAT properly. There are
two things to be careful about. First we should prevent from the OOM
killer and so have to involve __GFP_NORETRY by default and secondly
override __GFP_REPEAT for !costly order requests as the __GFP_REPEAT is
ignored for !costly orders.
Supporting __GFP_REPEAT like semantic for !costly request is possible it
would require changes in the page allocator. This is out of scope of
this patch.
This patch shouldn't introduce any functional change.
Link: http://lkml.kernel.org/r/20170306103032.2540-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__vmalloc_node_flags used to be static inline but this has changed by
"mm: introduce kv[mz]alloc helpers" because kvmalloc_node needs to use
it as well and the code is outside of the vmalloc proper. I haven't
realized that changing this will lead to a subtle bug though. The
function is responsible to track the caller as well. This caller is
then printed by /proc/vmallocinfo. If __vmalloc_node_flags is not
inline then we would get only direct users of __vmalloc_node_flags as
callers (e.g. v[mz]alloc) which reduces usefulness of this debugging
feature considerably. It simply doesn't help to see that the given
range belongs to vmalloc as a caller:
0xffffc90002c79000-0xffffc90002c7d000 16384 vmalloc+0x16/0x18 pages=3 vmalloc N0=3
0xffffc90002c81000-0xffffc90002c85000 16384 vmalloc+0x16/0x18 pages=3 vmalloc N1=3
0xffffc90002c8d000-0xffffc90002c91000 16384 vmalloc+0x16/0x18 pages=3 vmalloc N1=3
0xffffc90002c95000-0xffffc90002c99000 16384 vmalloc+0x16/0x18 pages=3 vmalloc N1=3
We really want to catch the _caller_ of the vmalloc function. Fix this
issue by making __vmalloc_node_flags static inline again.
Link: http://lkml.kernel.org/r/20170502134657.12381-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "kvmalloc", v5.
There are many open coded kmalloc with vmalloc fallback instances in the
tree. Most of them are not careful enough or simply do not care about
the underlying semantic of the kmalloc/page allocator which means that
a) some vmalloc fallbacks are basically unreachable because the kmalloc
part will keep retrying until it succeeds b) the page allocator can
invoke a really disruptive steps like the OOM killer to move forward
which doesn't sound appropriate when we consider that the vmalloc
fallback is available.
As it can be seen implementing kvmalloc requires quite an intimate
knowledge if the page allocator and the memory reclaim internals which
strongly suggests that a helper should be implemented in the memory
subsystem proper.
Most callers, I could find, have been converted to use the helper
instead. This is patch 6. There are some more relying on __GFP_REPEAT
in the networking stack which I have converted as well and Eric Dumazet
was not opposed [2] to convert them as well.
[1] http://lkml.kernel.org/r/20170130094940.13546-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com
This patch (of 9):
Using kmalloc with the vmalloc fallback for larger allocations is a
common pattern in the kernel code. Yet we do not have any common helper
for that and so users have invented their own helpers. Some of them are
really creative when doing so. Let's just add kv[mz]alloc and make sure
it is implemented properly. This implementation makes sure to not make
a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
to not warn about allocation failures. This also rules out the OOM
killer as the vmalloc is a more approapriate fallback than a disruptive
user visible action.
This patch also changes some existing users and removes helpers which
are specific for them. In some cases this is not possible (e.g.
ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and
require GFP_NO{FS,IO} context which is not vmalloc compatible in general
(note that the page table allocation is GFP_KERNEL). Those need to be
fixed separately.
While we are at it, document that __vmalloc{_node} about unsupported gfp
mask because there seems to be a lot of confusion out there.
kvmalloc_node will warn about GFP_KERNEL incompatible (which are not
superset) flags to catch new abusers. Existing ones would have to die
slowly.
[sfr@canb.auug.org.au: f2fs fixup]
Link: http://lkml.kernel.org/r/20170320163735.332e64b7@canb.auug.org.au
Link: http://lkml.kernel.org/r/20170306103032.2540-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Andreas Dilger <adilger@dilger.ca> [ext4 part]
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Assign 'struct kern_ipc_perm' its own cacheline to avoid false sharing
with sysv ipc calls.
While the structure itself is rather read-mostly throughout the lifespan
of ipc, the spinlock causes most of the invalidations. One example is
commit 31a7c4746e ("ipc/sem.c: cacheline align the ipc spinlock for
semaphores"). Therefore, extend this to all ipc.
The effect of cacheline alignment on sems can be seen in sembench, which
deals mostly with semtimedop wait/wakes is seen to improve raw
throughput (worker loops) between 8 to 12% on a 24-core x86 with over 4
threads.
Link: http://lkml.kernel.org/r/1486673582-6979-4-git-send-email-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sys_newlstat is a system call implementation that is meant for user
space, and that copies kernel-internal data structure to the user
format, which is not needed for in-kernel users.
Further, as we rearrange the system call implementation so we can extend
it with 64-bit time_t, the prototype for sys_newlstat changes.
This changes the initramfs code to use vfs_lstat directly, to get it out
of the way of the time_t changes, and make it slightly more efficient in
the process. Along the same lines we also replace sys_stat and
sys_stat64 with vfs_stat.
Link: http://lkml.kernel.org/r/20170314214932.4052842-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Many "embedded" architectures provide CMDLINE_FORCE to allow the kernel
to override the command line provided by an inflexible bootloader.
However there is currrently no way for the kernel to override the
initramfs image provided by the bootloader meaning there are still ways
for bootloaders to make things difficult for us.
Fix this by introducing INITRAMFS_FORCE which can prevent the kernel
from loading the bootloader supplied image.
We use CMDLINE_FORCE (and its friend CMDLINE_EXTEND) to imply that the
system has an inflexible bootloader. This allow us to avoid presenting
this config option to users of systems where inflexible bootloaders
aren't usually a problem.
Link: http://lkml.kernel.org/r/20170217121940.30126-1-daniel.thompson@linaro.org
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
smatch says:
WARNING: please, no spaces at the start of a line
#30: FILE: lib/zlib_inflate/inftrees.c:112:
+ for (min = 1; min < MAXBITS; min++)$
total: 0 errors, 1 warnings, 8 lines checked
NOTE: For some of the reported defects, checkpatch may be able to
mechanically convert to the typical style using --fix or --fix-inplace.
./patches/zlib-inflate-fix-potential-buffer-overflow.patch has style problems, please review.
NOTE: If any of the errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Expose task pid_ns_for_children to userspace".
pid_ns_for_children set by a task is known only to the task itself, and
it's impossible to identify it from outside.
It's a big problem for checkpoint/restore software like CRIU, because it
can't correctly handle tasks, that do setns(CLONE_NEWPID) in proccess of
their work. If they have a custom pid_ns_for_children before dump, they
must have the same ns after restore. Otherwise, restored task bumped
into enviroment it does not expect.
This patchset solves the problem. It exposes pid_ns_for_children to ns
directory in standard way with the name "pid_for_children":
~# ls /proc/5531/ns -l | grep pid
lrwxrwxrwx 1 root root 0 Jan 14 16:38 pid -> pid:[4026531836]
lrwxrwxrwx 1 root root 0 Jan 14 16:38 pid_for_children -> pid:[4026532286]
This patch (of 2):
Make possible to have link content prefix yyy different from the link
name xxx:
$ readlink /proc/[pid]/ns/xxx
yyy:[4026531838]
This will be used in next patch.
Link: http://lkml.kernel.org/r/149201120318.6007.7362655181033883000.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fadump supports specifying memory to reserve for fadump's crash kernel
with fadump_reserve_mem kernel parameter. This parameter currently
supports passing a fixed memory size, like fadump_reserve_mem=<size>
only. This patch aims to add support for other syntaxes like
range-based memory size
<range1>:<size1>[,<range2>:<size2>,<range3>:<size3>,...] which allows
using the same parameter to boot the kernel with different system RAM
sizes.
As crashkernel parameter already supports the above mentioned syntaxes,
this patch deprecates fadump_reserve_mem parameter and reuses
crashkernel parameter instead, to specify memory for fadump's crash
kernel memory reservation as well. If any offset is provided in
crashkernel parameter, it will be ignored in case of fadump, as fadump
reserves memory at end of RAM.
Advantages using crashkernel parameter instead of fadump_reserve_mem
parameter are one less kernel parameter overall, code reuse and support
for multiple syntaxes to specify memory.
Suggested-by: Dave Young <dyoung@redhat.com>
Link: http://lkml.kernel.org/r/149035346749.6881.911095631212975718.stgit@hbathini.in.ibm.com
Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "kexec/fadump: remove dependency with CONFIG_KEXEC and
reuse crashkernel parameter for fadump", v4.
Traditionally, kdump is used to save vmcore in case of a crash. Some
architectures like powerpc can save vmcore using architecture specific
support instead of kexec/kdump mechanism. Such architecture specific
support also needs to reserve memory, to be used by dump capture kernel.
crashkernel parameter can be a reused, for memory reservation, by such
architecture specific infrastructure.
This patchset removes dependency with CONFIG_KEXEC for crashkernel
parameter and vmcoreinfo related code as it can be reused without kexec
support. Also, crashkernel parameter is reused instead of
fadump_reserve_mem to reserve memory for fadump.
The first patch moves crashkernel parameter parsing and vmcoreinfo
related code under CONFIG_CRASH_CORE instead of CONFIG_KEXEC_CORE. The
second patch reuses the definitions of append_elf_note() & final_note()
functions under CONFIG_CRASH_CORE in IA64 arch code. The third patch
removes dependency on CONFIG_KEXEC for firmware-assisted dump (fadump)
in powerpc. The next patch reuses crashkernel parameter for reserving
memory for fadump, instead of the fadump_reserve_mem parameter. This
has the advantage of using all syntaxes crashkernel parameter supports,
for fadump as well. The last patch updates fadump kernel documentation
about use of crashkernel parameter.
This patch (of 5):
Traditionally, kdump is used to save vmcore in case of a crash. Some
architectures like powerpc can save vmcore using architecture specific
support instead of kexec/kdump mechanism. Such architecture specific
support also needs to reserve memory, to be used by dump capture kernel.
crashkernel parameter can be a reused, for memory reservation, by such
architecture specific infrastructure.
But currently, code related to vmcoreinfo and parsing of crashkernel
parameter is built under CONFIG_KEXEC_CORE. This patch introduces
CONFIG_CRASH_CORE and moves the above mentioned code under this config,
allowing code reuse without dependency on CONFIG_KEXEC. There is no
functional change with this patch.
Link: http://lkml.kernel.org/r/149035338104.6881.4550894432615189948.stgit@hbathini.in.ibm.com
Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com>
Acked-by: Dave Young <dyoung@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bit searching functions accept "unsigned long" indices but
"nr_cpumask_bits" is "int" which is signed, so inevitable sign
extensions occur on x86_64. Those MOVSX are #1 MOVSX bloat by number of
uses across whole kernel.
Change "nr_cpumask_bits" to unsigned, this number can't be negative
after all. It allows to do implicit zero-extension on x86_64 without
MOVSX.
Change signed comparisons into unsigned comparisons where necessary.
Other uses looks fine because it is either argument passed to a function
or comparison is already unsigned.
Net win on allyesconfig type of kernel: ~2.8 KB (!)
add/remove: 0/0 grow/shrink: 8/725 up/down: 93/-2926 (-2833)
function old new delta
xen_exit_mmap 691 735 +44
qstat_read 426 440 +14
__cpufreq_cooling_register 1678 1687 +9
trace_rb_cpu_prepare 447 455 +8
vermagic 54 60 +6
nfp_driver_version 54 60 +6
rcu_torture_stats_print 1147 1151 +4
find_next_push_cpu 267 269 +2
xen_irq_resume 961 960 -1
...
init_vp_index 946 906 -40
od_set_powersave_bias 328 281 -47
power_cpu_exit 193 139 -54
arch_show_interrupts 3538 3484 -54
select_idle_sibling 1558 1471 -87
Total: Before=158358910, After=158356077, chg -0.00%
Same arguments apply to "nr_cpu_ids" but I haven't yet found enough
courage to delve into this issue (and proper fix may require new type
"cpu_t" which is whole separate story).
Link: http://lkml.kernel.org/r/20170309205322.GA1728@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current SUSPECT_CODE_INDENT test does not recognize several
defective code style defects where code following a logical test is
inappropriately indented.
Before this patch, for code like:
if (foo)
bar();
checkpatch would not emit a warning.
Improve the test to warn when code after a logical test has the same
indentation as the logical test.
Perform the same indentation test for "else" blocks too.
Link: http://lkml.kernel.org/r/df2374b68c4a68af2b7ef08afe486584811f610a.1493683942.git.joe@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The existing behavior relies on patch context to identify function
declarations. Add the ability to find function declarations when there
is an open brace in column 1.
This finds function declarations only in specific single line forms
where the function name is on a single line like:
int foo(args...)
{
and
int
foo(args...)
{
It does not recognize function declarations like:
int foo(int bar,
int baz)
{
Link: http://lkml.kernel.org/r/738d74bbbe1a06b80f11ed504818107c68903095.1488155636.git.joe@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When I was running my testcase which may block hundreds of threads on fs
locks, I got lockup due to output from debug_show_all_locks() added by
commit b2d4c2edb2 ("locking/hung_task: Show all locks").
For example, if 1000 threads were blocked in TASK_UNINTERRUPTIBLE state
and 500 out of 1000 threads hold some lock, debug_show_all_locks() from
for_each_process_thread() loop will report locks held by 500 threads for
1000 times. This is a too much noise.
In order to make sure rcu_lock_break() is called frequently, we should
avoid calling debug_show_all_locks() from for_each_process_thread() loop
because debug_show_all_locks() effectively calls for_each_process_thread()
loop. Let's defer calling debug_show_all_locks() till before panic() or
leaving for_each_process_thread() loop.
Link: http://lkml.kernel.org/r/1489296834-60436-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
jiffies_64 is defined in kernel/time/timer.c with
____cacheline_aligned_in_smp, however this macro is not part of the
declaration of jiffies and jiffies_64 in jiffies.h.
As a result clang generates the following warning:
kernel/time/timer.c:57:26: error: section does not match previous declaration [-Werror,-Wsection]
__visible u64 jiffies_64 __cacheline_aligned_in_smp = INITIAL_JIFFIES;
^
include/linux/cache.h:39:36: note: expanded from macro '__cacheline_aligned_in_smp'
^
include/linux/cache.h:34:4: note: expanded from macro '__cacheline_aligned'
__section__(".data..cacheline_aligned")))
^
include/linux/jiffies.h:77:12: note: previous attribute is here
extern u64 __jiffy_data jiffies_64;
^
include/linux/jiffies.h:70:38: note: expanded from macro '__jiffy_data'
Link: http://lkml.kernel.org/r/20170403190200.70273-1-mka@chromium.org
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
Cc: Grant Grundler <grundler@chromium.org>
Cc: Michael Davidson <md@google.com>
Cc: Greg Hackmann <ghackmann@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The main goal of direct compaction is to form a high-order page for
allocation, but it should also help against long-term fragmentation when
possible.
Most lower-than-pageblock-order compactions are for non-movable
allocations, which means that if we compact in a movable pageblock and
terminate as soon as we create the high-order page, it's unlikely that
the fallback heuristics will claim the whole block. Instead there might
be a single unmovable page in a pageblock full of movable pages, and the
next unmovable allocation might pick another pageblock and increase
long-term fragmentation.
To help against such scenarios, this patch changes the termination
criteria for compaction so that the current pageblock is finished even
though the high-order page already exists. Note that it might be
possible that the high-order page formed elsewhere in the zone due to
parallel activity, but this patch doesn't try to detect that.
This is only done with sync compaction, because async compaction is
limited to pageblock of the same migratetype, where it cannot result in
a migratetype fallback. (Async compaction also eagerly skips
order-aligned blocks where isolation fails, which is against the goal of
migrating away as much of the pageblock as possible.)
As a result of this patch, long-term memory fragmentation should be
reduced.
In testing based on 4.9 kernel with stress-highalloc from mmtests
configured for order-4 GFP_KERNEL allocations, this patch has reduced
the number of unmovable allocations falling back to movable pageblocks
by 20%. The number
Link: http://lkml.kernel.org/r/20170307131545.28577-9-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The migrate scanner in async compaction is currently limited to
MIGRATE_MOVABLE pageblocks. This is a heuristic intended to reduce
latency, based on the assumption that non-MOVABLE pageblocks are
unlikely to contain movable pages.
However, with the exception of THP's, most high-order allocations are
not movable. Should the async compaction succeed, this increases the
chance that the non-MOVABLE allocations will fallback to a MOVABLE
pageblock, making the long-term fragmentation worse.
This patch attempts to help the situation by changing async direct
compaction so that the migrate scanner only scans the pageblocks of the
requested migratetype. If it's a non-MOVABLE type and there are such
pageblocks that do contain movable pages, chances are that the
allocation can succeed within one of such pageblocks, removing the need
for a fallback. If that fails, the subsequent sync attempt will ignore
this restriction.
In testing based on 4.9 kernel with stress-highalloc from mmtests
configured for order-4 GFP_KERNEL allocations, this patch has reduced
the number of unmovable allocations falling back to movable pageblocks
by 30%. The number of movable allocations falling back is reduced by
12%.
Link: http://lkml.kernel.org/r/20170307131545.28577-8-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>