From 04fd61a4e01028210a91f0efc408c8bc61a3018c Mon Sep 17 00:00:00 2001 From: Yafang Shao Date: Wed, 13 May 2020 17:50:34 -0700 Subject: [PATCH 1/7] mm, memcg: fix inconsistent oom event behavior A recent commit 9852ae3fe529 ("mm, memcg: consider subtrees in memory.events") changed the behavior of memcg events, which will now consider subtrees in memory.events. But oom_kill event is a special one as it is used in both cgroup1 and cgroup2. In cgroup1, it is displayed in memory.oom_control. The file memory.oom_control is in both root memcg and non root memcg, that is different with memory.event as it only in non-root memcg. That commit is okay for cgroup2, but it is not okay for cgroup1 as it will cause inconsistent behavior between root memcg and non-root memcg. Here's an example on why this behavior is inconsistent in cgroup1. root memcg / memcg foo / memcg bar Suppose there's an oom_kill in memcg bar, then the oon_kill will be root memcg : memory.oom_control(oom_kill) 0 / memcg foo : memory.oom_control(oom_kill) 1 / memcg bar : memory.oom_control(oom_kill) 1 For the non-root memcg, its memory.oom_control(oom_kill) includes its descendants' oom_kill, but for root memcg, it doesn't include its descendants' oom_kill. That means, memory.oom_control(oom_kill) has different meanings in different memcgs. That is inconsistent. Then the user has to know whether the memcg is root or not. If we can't fully support it in cgroup1, for example by adding memory.events.local into cgroup1 as well, then let's don't touch its original behavior. Fixes: 9852ae3fe529 ("mm, memcg: consider subtrees in memory.events") Reported-by: Randy Dunlap Signed-off-by: Yafang Shao Signed-off-by: Andrew Morton Reviewed-by: Shakeel Butt Acked-by: Johannes Weiner Acked-by: Chris Down Acked-by: Michal Hocko Cc: Link: http://lkml.kernel.org/r/20200502141055.7378-1-laoar.shao@gmail.com Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index d275c72c4f8e..977edd3b7bd8 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -783,6 +783,8 @@ static inline void memcg_memory_event(struct mem_cgroup *memcg, atomic_long_inc(&memcg->memory_events[event]); cgroup_file_notify(&memcg->events_file); + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) + break; if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_LOCAL_EVENTS) break; } while ((memcg = parent_mem_cgroup(memcg)) && From 65759097d804d2a9ad2b687db436319704ba7019 Mon Sep 17 00:00:00 2001 From: Roman Penyaev Date: Wed, 13 May 2020 17:50:38 -0700 Subject: [PATCH 2/7] epoll: call final ep_events_available() check under the lock There is a possible race when ep_scan_ready_list() leaves ->rdllist and ->obflist empty for a short period of time although some events are pending. It is quite likely that ep_events_available() observes empty lists and goes to sleep. Since commit 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll") we are conservative in wakeups (there is only one place for wakeup and this is ep_poll_callback()), thus ep_events_available() must always observe correct state of two lists. The easiest and correct way is to do the final check under the lock. This does not impact the performance, since lock is taken anyway for adding a wait entry to the wait queue. The discussion of the problem can be found here: https://lore.kernel.org/linux-fsdevel/a2f22c3c-c25a-4bda-8339-a7bdaf17849e@akamai.com/ In this patch barrierless __set_current_state() is used. This is safe since waitqueue_active() is called under the same lock on wakeup side. Short-circuit for fatal signals (i.e. fatal_signal_pending() check) is moved to the line just before actual events harvesting routine. This is fully compliant to what is said in the comment of the patch where the actual fatal_signal_pending() check was added: c257a340ede0 ("fs, epoll: short circuit fetching events if thread has been killed"). Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll") Reported-by: Jason Baron Reported-by: Randy Dunlap Signed-off-by: Roman Penyaev Signed-off-by: Andrew Morton Reviewed-by: Jason Baron Cc: Khazhismel Kumykov Cc: Alexander Viro Cc: Link: http://lkml.kernel.org/r/20200505145609.1865152-1-rpenyaev@suse.de Signed-off-by: Linus Torvalds --- fs/eventpoll.c | 56 ++++++++++++++++++++++++++++---------------------- 1 file changed, 32 insertions(+), 24 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index aba03ee749f8..12eebcdea9c8 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1879,34 +1879,33 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, * event delivery. */ init_wait(&wait); + write_lock_irq(&ep->lock); - __add_wait_queue_exclusive(&ep->wq, &wait); + /* + * Barrierless variant, waitqueue_active() is called under + * the same lock on wakeup ep_poll_callback() side, so it + * is safe to avoid an explicit barrier. + */ + __set_current_state(TASK_INTERRUPTIBLE); + + /* + * Do the final check under the lock. ep_scan_ready_list() + * plays with two lists (->rdllist and ->ovflist) and there + * is always a race when both lists are empty for short + * period of time although events are pending, so lock is + * important. + */ + eavail = ep_events_available(ep); + if (!eavail) { + if (signal_pending(current)) + res = -EINTR; + else + __add_wait_queue_exclusive(&ep->wq, &wait); + } write_unlock_irq(&ep->lock); - /* - * We don't want to sleep if the ep_poll_callback() sends us - * a wakeup in between. That's why we set the task state - * to TASK_INTERRUPTIBLE before doing the checks. - */ - set_current_state(TASK_INTERRUPTIBLE); - /* - * Always short-circuit for fatal signals to allow - * threads to make a timely exit without the chance of - * finding more events available and fetching - * repeatedly. - */ - if (fatal_signal_pending(current)) { - res = -EINTR; + if (eavail || res) break; - } - - eavail = ep_events_available(ep); - if (eavail) - break; - if (signal_pending(current)) { - res = -EINTR; - break; - } if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS)) { timed_out = 1; @@ -1927,6 +1926,15 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, } send_events: + if (fatal_signal_pending(current)) { + /* + * Always short-circuit for fatal signals to allow + * threads to make a timely exit without the chance of + * finding more events available and fetching + * repeatedly. + */ + res = -EINTR; + } /* * Try to transfer events to user space. In case we get 0 events and * there's still timeout left over, we go trying again in search of From 475f4dfc021c5fde69f3b7d3287bde0a50477b05 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Wed, 13 May 2020 17:50:41 -0700 Subject: [PATCH 3/7] mm/gup: fix fixup_user_fault() on multiple retries This part was overlooked when reworking the gup code on multiple retries. When we get the 2nd+ retry, we'll be with TRIED flag set. Current code will bail out on the 2nd retry because the !TRIED check will fail so the retry logic will be skipped. What's worse is that, it will also return zero which errornously hints the caller that the page is faulted in while it's not. The !TRIED flag check seems to not be needed even before the mutliple retries change because if we get a VM_FAULT_RETRY, it must be the 1st retry, and we should not have TRIED set for that. Fix it by removing the !TRIED check, at the meantime check against fatal signals properly before the page fault so we can still properly respond to the user killing the process during retries. Fixes: 4426e945df58 ("mm/gup: allow VM_FAULT_RETRY for multiple times") Reported-by: Randy Dunlap Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Cc: Alex Williamson Cc: Brian Geffon Link: http://lkml.kernel.org/r/20200502003523.8204-1-peterx@redhat.com Signed-off-by: Linus Torvalds --- mm/gup.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index 50681f0286de..87a6a59fe667 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1218,6 +1218,10 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, if (!vma_permits_fault(vma, fault_flags)) return -EFAULT; + if ((fault_flags & FAULT_FLAG_KILLABLE) && + fatal_signal_pending(current)) + return -EINTR; + ret = handle_mm_fault(vma, address, fault_flags); major |= ret & VM_FAULT_MAJOR; if (ret & VM_FAULT_ERROR) { @@ -1230,11 +1234,9 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, if (ret & VM_FAULT_RETRY) { down_read(&mm->mmap_sem); - if (!(fault_flags & FAULT_FLAG_TRIED)) { - *unlocked = true; - fault_flags |= FAULT_FLAG_TRIED; - goto retry; - } + *unlocked = true; + fault_flags |= FAULT_FLAG_TRIED; + goto retry; } if (tsk) { From d1564926066115fb47ca06b2b9d23ed506ca9608 Mon Sep 17 00:00:00 2001 From: Brian Geffon Date: Wed, 13 May 2020 17:50:44 -0700 Subject: [PATCH 4/7] userfaultfd: fix remap event with MREMAP_DONTUNMAP A user is not required to set a new address when using MREMAP_DONTUNMAP as it can be used without MREMAP_FIXED. When doing so the remap event will use new_addr which may not have been set and we didn't propagate it back other then in the return value of remap_to. Because ret is always the new address it's probably more correct to use it rather than new_addr on the remap_event_complete call, and it resolves this bug. Fixes: e346b3813067d4b ("mm/mremap: add MREMAP_DONTUNMAP to mremap()") Reported-by: Randy Dunlap Signed-off-by: Brian Geffon Signed-off-by: Andrew Morton Cc: Lokesh Gidra Cc: Minchan Kim Cc: Kirill A. Shutemov Cc: Vlastimil Babka Cc: "Michael S . Tsirkin" Cc: Andrea Arcangeli Cc: Sonny Rao Cc: Joel Fernandes Link: http://lkml.kernel.org/r/20200506172158.218366-1-bgeffon@google.com Signed-off-by: Linus Torvalds --- mm/mremap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/mremap.c b/mm/mremap.c index c881abeba0bf..6aa6ea605068 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -794,7 +794,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, if (locked && new_len > old_len) mm_populate(new_addr + old_len, new_len - old_len); userfaultfd_unmap_complete(mm, &uf_unmap_early); - mremap_userfaultfd_complete(&uf, addr, new_addr, old_len); + mremap_userfaultfd_complete(&uf, addr, ret, old_len); userfaultfd_unmap_complete(mm, &uf_unmap); return ret; } From 5e698222c70257d13ae0816720dde57c56f81e15 Mon Sep 17 00:00:00 2001 From: Vasily Averin Date: Wed, 13 May 2020 17:50:48 -0700 Subject: [PATCH 5/7] ipc/util.c: sysvipc_find_ipc() incorrectly updates position index Commit 89163f93c6f9 ("ipc/util.c: sysvipc_find_ipc() should increase position index") is causing this bug (seen on 5.6.8): # ipcs -q ------ Message Queues -------- key msqid owner perms used-bytes messages # ipcmk -Q Message queue id: 0 # ipcs -q ------ Message Queues -------- key msqid owner perms used-bytes messages 0x82db8127 0 root 644 0 0 # ipcmk -Q Message queue id: 1 # ipcs -q ------ Message Queues -------- key msqid owner perms used-bytes messages 0x82db8127 0 root 644 0 0 0x76d1fb2a 1 root 644 0 0 # ipcrm -q 0 # ipcs -q ------ Message Queues -------- key msqid owner perms used-bytes messages 0x76d1fb2a 1 root 644 0 0 0x76d1fb2a 1 root 644 0 0 # ipcmk -Q Message queue id: 2 # ipcrm -q 2 # ipcs -q ------ Message Queues -------- key msqid owner perms used-bytes messages 0x76d1fb2a 1 root 644 0 0 0x76d1fb2a 1 root 644 0 0 # ipcmk -Q Message queue id: 3 # ipcrm -q 1 # ipcs -q ------ Message Queues -------- key msqid owner perms used-bytes messages 0x7c982867 3 root 644 0 0 0x7c982867 3 root 644 0 0 0x7c982867 3 root 644 0 0 0x7c982867 3 root 644 0 0 Whenever an IPC item with a low id is deleted, the items with higher ids are duplicated, as if filling a hole. new_pos should jump through hole of unused ids, pos can be updated inside "for" cycle. Fixes: 89163f93c6f9 ("ipc/util.c: sysvipc_find_ipc() should increase position index") Reported-by: Andreas Schwab Reported-by: Randy Dunlap Signed-off-by: Vasily Averin Signed-off-by: Andrew Morton Acked-by: Waiman Long Cc: NeilBrown Cc: Steven Rostedt Cc: Ingo Molnar Cc: Peter Oberparleiter Cc: Davidlohr Bueso Cc: Manfred Spraul Cc: Link: http://lkml.kernel.org/r/4921fe9b-9385-a2b4-1dc4-1099be6d2e39@virtuozzo.com Signed-off-by: Linus Torvalds --- ipc/util.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/ipc/util.c b/ipc/util.c index 7acccfded7cb..cfa0045e748d 100644 --- a/ipc/util.c +++ b/ipc/util.c @@ -764,21 +764,21 @@ static struct kern_ipc_perm *sysvipc_find_ipc(struct ipc_ids *ids, loff_t pos, total++; } - *new_pos = pos + 1; + ipc = NULL; if (total >= ids->in_use) - return NULL; + goto out; for (; pos < ipc_mni; pos++) { ipc = idr_find(&ids->ipcs_idr, pos); if (ipc != NULL) { rcu_read_lock(); ipc_lock_object(ipc); - return ipc; + break; } } - - /* Out of range - return NULL to terminate iteration */ - return NULL; +out: + *new_pos = pos + 1; + return ipc; } static void *sysvipc_proc_next(struct seq_file *s, void *it, loff_t *pos) From 8a16c09edc58982d56c49ab577fdcdf830fbc3a5 Mon Sep 17 00:00:00 2001 From: Andrey Konovalov Date: Wed, 13 May 2020 17:50:51 -0700 Subject: [PATCH 6/7] kasan: consistently disable debugging features KASAN is incompatible with some kernel debugging/tracing features. There's been multiple patches that disable those feature for some of KASAN files one by one. Instead of prolonging that, disable these features for all KASAN files at once. Reported-by: Randy Dunlap Signed-off-by: Andrey Konovalov Signed-off-by: Andrew Morton Cc: Andrey Ryabinin Cc: Alexander Potapenko Cc: Dmitry Vyukov Cc: Leon Romanovsky Link: http://lkml.kernel.org/r/29bd753d5ff5596425905b0b07f51153e2345cc1.1589297433.git.andreyknvl@google.com Signed-off-by: Linus Torvalds --- mm/kasan/Makefile | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/mm/kasan/Makefile b/mm/kasan/Makefile index 08b43de2383b..434d503a6525 100644 --- a/mm/kasan/Makefile +++ b/mm/kasan/Makefile @@ -1,23 +1,28 @@ # SPDX-License-Identifier: GPL-2.0 KASAN_SANITIZE := n -UBSAN_SANITIZE_common.o := n -UBSAN_SANITIZE_generic.o := n -UBSAN_SANITIZE_generic_report.o := n -UBSAN_SANITIZE_tags.o := n +UBSAN_SANITIZE := n KCOV_INSTRUMENT := n +# Disable ftrace to avoid recursion. CFLAGS_REMOVE_common.o = $(CC_FLAGS_FTRACE) CFLAGS_REMOVE_generic.o = $(CC_FLAGS_FTRACE) CFLAGS_REMOVE_generic_report.o = $(CC_FLAGS_FTRACE) +CFLAGS_REMOVE_init.o = $(CC_FLAGS_FTRACE) +CFLAGS_REMOVE_quarantine.o = $(CC_FLAGS_FTRACE) +CFLAGS_REMOVE_report.o = $(CC_FLAGS_FTRACE) CFLAGS_REMOVE_tags.o = $(CC_FLAGS_FTRACE) +CFLAGS_REMOVE_tags_report.o = $(CC_FLAGS_FTRACE) # Function splitter causes unnecessary splits in __asan_load1/__asan_store1 # see: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63533 - CFLAGS_common.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector) CFLAGS_generic.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector) CFLAGS_generic_report.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector) +CFLAGS_init.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector) +CFLAGS_quarantine.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector) +CFLAGS_report.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector) CFLAGS_tags.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector) +CFLAGS_tags_report.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector) obj-$(CONFIG_KASAN) := common.o init.o report.o obj-$(CONFIG_KASAN_GENERIC) += generic.o generic_report.o quarantine.o From 13cf04880235cd70d408d70426ada31408ded2ee Mon Sep 17 00:00:00 2001 From: Andrey Konovalov Date: Wed, 13 May 2020 17:50:54 -0700 Subject: [PATCH 7/7] kasan: add missing functions declarations to kasan.h KASAN is currently missing declarations for __asan_report* and __hwasan* functions. This can lead to compiler warnings. Reported-by: Leon Romanovsky Reported-by: Randy Dunlap Signed-off-by: Andrey Konovalov Signed-off-by: Andrew Morton Tested-by: Leon Romanovsky Cc: Andrey Ryabinin Cc: Alexander Potapenko Cc: Dmitry Vyukov Link: http://lkml.kernel.org/r/45b445a76a79208918f0cc44bfabebaea909b54d.1589297433.git.andreyknvl@google.com Signed-off-by: Linus Torvalds --- mm/kasan/kasan.h | 34 ++++++++++++++++++++++++++++++++-- 1 file changed, 32 insertions(+), 2 deletions(-) diff --git a/mm/kasan/kasan.h b/mm/kasan/kasan.h index e8f37199d885..cfade6413528 100644 --- a/mm/kasan/kasan.h +++ b/mm/kasan/kasan.h @@ -212,8 +212,6 @@ static inline const void *arch_kasan_set_tag(const void *addr, u8 tag) asmlinkage void kasan_unpoison_task_stack_below(const void *watermark); void __asan_register_globals(struct kasan_global *globals, size_t size); void __asan_unregister_globals(struct kasan_global *globals, size_t size); -void __asan_loadN(unsigned long addr, size_t size); -void __asan_storeN(unsigned long addr, size_t size); void __asan_handle_no_return(void); void __asan_alloca_poison(unsigned long addr, size_t size); void __asan_allocas_unpoison(const void *stack_top, const void *stack_bottom); @@ -228,6 +226,8 @@ void __asan_load8(unsigned long addr); void __asan_store8(unsigned long addr); void __asan_load16(unsigned long addr); void __asan_store16(unsigned long addr); +void __asan_loadN(unsigned long addr, size_t size); +void __asan_storeN(unsigned long addr, size_t size); void __asan_load1_noabort(unsigned long addr); void __asan_store1_noabort(unsigned long addr); @@ -239,6 +239,21 @@ void __asan_load8_noabort(unsigned long addr); void __asan_store8_noabort(unsigned long addr); void __asan_load16_noabort(unsigned long addr); void __asan_store16_noabort(unsigned long addr); +void __asan_loadN_noabort(unsigned long addr, size_t size); +void __asan_storeN_noabort(unsigned long addr, size_t size); + +void __asan_report_load1_noabort(unsigned long addr); +void __asan_report_store1_noabort(unsigned long addr); +void __asan_report_load2_noabort(unsigned long addr); +void __asan_report_store2_noabort(unsigned long addr); +void __asan_report_load4_noabort(unsigned long addr); +void __asan_report_store4_noabort(unsigned long addr); +void __asan_report_load8_noabort(unsigned long addr); +void __asan_report_store8_noabort(unsigned long addr); +void __asan_report_load16_noabort(unsigned long addr); +void __asan_report_store16_noabort(unsigned long addr); +void __asan_report_load_n_noabort(unsigned long addr, size_t size); +void __asan_report_store_n_noabort(unsigned long addr, size_t size); void __asan_set_shadow_00(const void *addr, size_t size); void __asan_set_shadow_f1(const void *addr, size_t size); @@ -247,4 +262,19 @@ void __asan_set_shadow_f3(const void *addr, size_t size); void __asan_set_shadow_f5(const void *addr, size_t size); void __asan_set_shadow_f8(const void *addr, size_t size); +void __hwasan_load1_noabort(unsigned long addr); +void __hwasan_store1_noabort(unsigned long addr); +void __hwasan_load2_noabort(unsigned long addr); +void __hwasan_store2_noabort(unsigned long addr); +void __hwasan_load4_noabort(unsigned long addr); +void __hwasan_store4_noabort(unsigned long addr); +void __hwasan_load8_noabort(unsigned long addr); +void __hwasan_store8_noabort(unsigned long addr); +void __hwasan_load16_noabort(unsigned long addr); +void __hwasan_store16_noabort(unsigned long addr); +void __hwasan_loadN_noabort(unsigned long addr, size_t size); +void __hwasan_storeN_noabort(unsigned long addr, size_t size); + +void __hwasan_tag_memory(unsigned long addr, u8 tag, unsigned long size); + #endif