mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2026-03-22 13:23:20 -04:00
cdc2fcfea79b9873bb63159f8ed973f4046018c8
118212 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
cdc2fcfea7 |
hugetlb_cgroup: add hugetlb_cgroup reservation counter
These counters will track hugetlb reservations rather than hugetlb memory faulted in. This patch only adds the counter, following patches add the charging and uncharging of the counter. This is patch 1 of an 9 patch series. Problem: Currently tasks attempting to reserve more hugetlb memory than is available get a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1]. However, if a task attempts to reserve more hugetlb memory than its hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call, but will SIGBUS the task when it attempts to fault in the excess memory. We have users hitting their hugetlb_cgroup limits and thus we've been looking at this failure mode. We'd like to improve this behavior such that users violating the hugetlb_cgroup limits get an error on mmap/shmget time, rather than getting SIGBUS'd when they try to fault the excess memory in. This gives the user an opportunity to fallback more gracefully to non-hugetlbfs memory for example. The underlying problem is that today's hugetlb_cgroup accounting happens at hugetlb memory *fault* time, rather than at *reservation* time. Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and the offending task gets SIGBUS'd. Proposed Solution: A new page counter named 'hugetlb.xMB.rsvd.[limit|usage|max_usage]_in_bytes'. This counter has slightly different semantics than 'hugetlb.xMB.[limit|usage|max_usage]_in_bytes': - While usage_in_bytes tracks all *faulted* hugetlb memory, rsvd.usage_in_bytes tracks all *reserved* hugetlb memory and hugetlb memory faulted in without a prior reservation. - If a task attempts to reserve more memory than limit_in_bytes allows, the kernel will allow it to do so. But if a task attempts to reserve more memory than rsvd.limit_in_bytes, the kernel will fail this reservation. This proposal is implemented in this patch series, with tests to verify functionality and show the usage. Alternatives considered: 1. A new cgroup, instead of only a new page_counter attached to the existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of code duplication with hugetlb_cgroup. Keeping hugetlb related page counters under hugetlb_cgroup seemed cleaner as well. 2. Instead of adding a new counter, we considered adding a sysctl that modifies the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do accounting at reservation time rather than fault time. Adding a new page_counter seems better as userspace could, if it wants, choose to enforce different cgroups differently: one via limit_in_bytes, and another via rsvd.limit_in_bytes. This could be very useful if you're transitioning how hugetlb memory is partitioned on your system one cgroup at a time, for example. Also, someone may find usage for both limit_in_bytes and rsvd.limit_in_bytes concurrently, and this approach gives them the option to do so. Testing: - Added tests passing. - Used libhugetlbfs for regression testing. [1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html Signed-off-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Sandipan Das <sandipan@linux.ibm.com> Link: http://lkml.kernel.org/r/20200211213128.73302-1-almasrymina@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
c0d0381ade |
hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2. While discussing the issue with huge_pte_offset [1], I remembered that there were more outstanding hugetlb races. These issues are: 1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become invalid via a call to huge_pmd_unshare by another thread. 2) hugetlbfs page faults can race with truncation causing invalid global reserve counts and state. A previous attempt was made to use i_mmap_rwsem in this manner as described at [2]. However, those patches were reverted starting with [3] due to locking issues. To effectively use i_mmap_rwsem to address the above issues it needs to be held (in read mode) during page fault processing. However, during fault processing we need to lock the page we will be adding. Lock ordering requires we take page lock before i_mmap_rwsem. Waiting until after taking the page lock is too late in the fault process for the synchronization we want to do. To address this lock ordering issue, the following patches change the lock ordering for hugetlb pages. This is not too invasive as hugetlbfs processing is done separate from core mm in many places. However, I don't really like this idea. Much ugliness is contained in the new routine hugetlb_page_mapping_lock_write() of patch 1. The only other way I can think of to address these issues is by catching all the races. After catching a race, cleanup, backout, retry ... etc, as needed. This can get really ugly, especially for huge page reservations. At one time, I started writing some of the reservation backout code for page faults and it got so ugly and complicated I went down the path of adding synchronization to avoid the races. Any other suggestions would be welcome. [1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/ [2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/ [3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com [4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/ [5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/ This patch (of 2): While looking at BUGs associated with invalid huge page map counts, it was discovered and observed that a huge pte pointer could become 'invalid' and point to another task's page table. Consider the following: A task takes a page fault on a shared hugetlbfs file and calls huge_pte_alloc to get a ptep. Suppose the returned ptep points to a shared pmd. Now, another task truncates the hugetlbfs file. As part of truncation, it unmaps everyone who has the file mapped. If the range being truncated is covered by a shared pmd, huge_pmd_unshare will be called. For all but the last user of the shared pmd, huge_pmd_unshare will clear the pud pointing to the pmd. If the task in the middle of the page fault is not the last user, the ptep returned by huge_pte_alloc now points to another task's page table or worse. This leads to bad things such as incorrect page map/reference counts or invalid memory references. To fix, expand the use of i_mmap_rwsem as follows: - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called. huge_pmd_share is only called via huge_pte_alloc, so callers of huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers of huge_pte_alloc continue to hold the semaphore until finished with the ptep. - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called. One problem with this scheme is that it requires taking i_mmap_rwsem before taking the page lock during page faults. This is not the order specified in the rest of mm code. Handling of hugetlbfs pages is mostly isolated today. Therefore, we use this alternative locking order for PageHuge() pages. mapping->i_mmap_rwsem hugetlb_fault_mutex (hugetlbfs specific page fault mutex) page->flags PG_locked (lock_page) To help with lock ordering issues, hugetlb_page_mapping_lock_write() is introduced to write lock the i_mmap_rwsem associated with a page. In most cases it is easy to get address_space via vma->vm_file->f_mapping. However, in the case of migration or memory errors for anon pages we do not have an associated vma. A new routine _get_hugetlb_page_mapping() will use anon_vma to get address_space in these cases. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
20ca87f22b |
mm/mempolicy: check hugepage migration is supported by arch in vma_migratable()
vma_migratable() is called to check if pages in vma can be migrated before go ahead to further actions. Currently it is used in below code path: - task_numa_work - mbind - move_pages For hugetlb mapping, whether vma is migratable or not is determined by: - CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION - arch_hugetlb_migration_supported Issue: current code only checks for CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION alone, and no code should use it directly. (note that current code in vma_migratable don't cause failure or bug because unmap_and_move_huge_page() will catch unsupported hugepage and handle it properly) This patch checks the two factors by hugepage_migration_supported for impoving code logic and robustness. It will enable early bail out of hugepage migration procedure, but because currently all architecture supporting hugepage migration is able to support all page size, we would not see performance gain with this patch applied. vma_migratable() is moved to mm/mempolicy.c, because of the circular reference of mempolicy.h and hugetlb.h cause defining it as inline not feasible. Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Link: http://lkml.kernel.org/r/1579786179-30633-1-git-send-email-lixinhai.lxh@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
8cceeff48f |
kasan: detect negative size in memory operation function
Patch series "fix the missing underflow in memory operation function", v4. The patchset helps to produce a KASAN report when size is negative in memory operation functions. It is helpful for programmer to solve an undefined behavior issue. Patch 1 based on Dmitry's review and suggestion, patch 2 is a test in order to verify the patch 1. [1]https://bugzilla.kernel.org/show_bug.cgi?id=199341 [2]https://lore.kernel.org/linux-arm-kernel/20190927034338.15813-1-walter-zh.wu@mediatek.com/ This patch (of 2): KASAN missed detecting size is a negative number in memset(), memcpy(), and memmove(), it will cause out-of-bounds bug. So needs to be detected by KASAN. If size is a negative number, then it has a reason to be defined as out-of-bounds bug type. Casting negative numbers to size_t would indeed turn up as a large size_t and its value will be larger than ULONG_MAX/2, so that this can qualify as out-of-bounds. KASAN report is shown below: BUG: KASAN: out-of-bounds in kmalloc_memmove_invalid_size+0x70/0xa0 Read of size 18446744073709551608 at addr ffffff8069660904 by task cat/72 CPU: 2 PID: 72 Comm: cat Not tainted 5.4.0-rc1-next-20191004ajb-00001-gdb8af2f372b2-dirty #1 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x288 show_stack+0x14/0x20 dump_stack+0x10c/0x164 print_address_description.isra.9+0x68/0x378 __kasan_report+0x164/0x1a0 kasan_report+0xc/0x18 check_memory_region+0x174/0x1d0 memmove+0x34/0x88 kmalloc_memmove_invalid_size+0x70/0xa0 [1] https://bugzilla.kernel.org/show_bug.cgi?id=199341 [cai@lca.pw: fix -Wdeclaration-after-statement warn] Link: http://lkml.kernel.org/r/1583509030-27939-1-git-send-email-cai@lca.pw [peterz@infradead.org: fix objtool warning] Link: http://lkml.kernel.org/r/20200305095436.GV2596@hirez.programming.kicks-ass.net Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Suggested-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Walter Wu <walter-zh.wu@mediatek.com> Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Link: http://lkml.kernel.org/r/20191112065302.7015-1-walter-zh.wu@mediatek.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
e03d1f7834 |
mm/sparse: rename pfn_present() to pfn_in_present_section()
After introducing mem sub section concept, pfn_present() loses its literal meaning, and will not be necessary a truth on partial populated mem section. Since all of the callers use it to judge an absent section, it is better to rename pfn_present() as pfn_in_present_section(). Signed-off-by: Pingfan Liu <kernelfans@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Cc: Dan Williams <dan.j.williams@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Leonardo Bras <leonardo@linux.ibm.com> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Cc: Nathan Lynch <nathanl@linux.ibm.com> Link: http://lkml.kernel.org/r/1581919110-29575-1-git-send-email-kernelfans@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
e346b38130 |
mm/mremap: add MREMAP_DONTUNMAP to mremap()
When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is set, the source mapping will not be removed. The remap operation will be performed as it would have been normally by moving over the page tables to the new mapping. The old vma will have any locked flags cleared, have no pagetables, and any userfaultfds that were watching that range will continue watching it. For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause the mremap() call to fail. Because MREMAP_DONTUNMAP always results in moving a VMA you MUST use the MREMAP_MAYMOVE flag, it's not possible to resize a VMA while also moving with MREMAP_DONTUNMAP so old_len must always be equal to the new_len otherwise it will return -EINVAL. We hope to use this in Chrome OS where with userfaultfd we could write an anonymous mapping to disk without having to STOP the process or worry about VMA permission changes. This feature also has a use case in Android, Lokesh Gidra has said that "As part of using userfaultfd for GC, We'll have to move the physical pages of the java heap to a separate location. For this purpose mremap will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java heap, its virtual mapping will be removed as well. Therefore, we'll require performing mmap immediately after. This is not only time consuming but also opens a time window where a native thread may call mmap and reserve the java heap's address range for its own usage. This flag solves the problem." [bgeffon@google.com: v6] Link: http://lkml.kernel.org/r/20200218173221.237674-1-bgeffon@google.com [bgeffon@google.com: v7] Link: http://lkml.kernel.org/r/20200221174248.244748-1-bgeffon@google.com Signed-off-by: Brian Geffon <bgeffon@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Lokesh Gidra <lokeshgidra@google.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Michael S . Tsirkin" <mst@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Will Deacon <will@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Jesse Barnes <jsbarnes@google.com> Cc: Nathan Chancellor <natechancellor@gmail.com> Cc: Florian Weimer <fweimer@redhat.com> Link: http://lkml.kernel.org/r/20200207201856.46070-1-bgeffon@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
df529cabb7 |
mm: mmap: add trace point of vm_unmapped_area
Even on 64 bit kernel, the mmap failure can happen for a 32 bit task. Virtual memory space shortage of a task on mmap is reported to userspace as -ENOMEM. It can be confused as physical memory shortage of overall system. The vm_unmapped_area can be called to by some drivers or other kernel core system like filesystem. In my platform, GPU driver calls to vm_unmapped_area and the driver returns -ENOMEM even in GPU side shortage. It can be hard to distinguish which code layer returns the -ENOMEM. Create mmap trace file and add trace point of vm_unmapped_area. i.e.) 277.156599: vm_unmapped_area: addr=77e0d03000 err=0 total_vm=0x17014b flags=0x1 len=0x400000 lo=0x8000 hi=0x7878c27000 mask=0x0 ofs=0x1 342.838740: vm_unmapped_area: addr=0 err=-12 total_vm=0xffb08 flags=0x0 len=0x100000 lo=0x40000000 hi=0xfffff000 mask=0x0 ofs=0x22 [akpm@linux-foundation.org: prefix address printk with 0x, per Matthew] Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@suse.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michel Lespinasse <walken@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200320055823.27089-3-jaewon31.kim@samsung.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
baceaf1c8b |
mmap: remove inline of vm_unmapped_area
Patch series "mm: mmap: add mmap trace point", v3. Create mmap trace file and add trace point of vm_unmapped_area(). This patch (of 2): In preparation for next patch remove inline of vm_unmapped_area and move code to mmap.c. There is no logical change. Also remove unmapped_area[_topdown] out of mm.h, there is no code calling to them. Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michel Lespinasse <walken@google.com> Cc: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/20200320055823.27089-2-jaewon31.kim@samsung.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
4064b98270 |
mm: allow VM_FAULT_RETRY for multiple times
The idea comes from a discussion between Linus and Andrea [1].
Before this patch we only allow a page fault to retry once. We achieved
this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
handle_mm_fault() the second time. This was majorly used to avoid
unexpected starvation of the system by looping over forever to handle the
page fault on a single page. However that should hardly happen, and after
all for each code path to return a VM_FAULT_RETRY we'll first wait for a
condition (during which time we should possibly yield the cpu) to happen
before VM_FAULT_RETRY is really returned.
This patch removes the restriction by keeping the FAULT_FLAG_ALLOW_RETRY
flag when we receive VM_FAULT_RETRY. It means that the page fault handler
now can retry the page fault for multiple times if necessary without the
need to generate another page fault event. Meanwhile we still keep the
FAULT_FLAG_TRIED flag so page fault handler can still identify whether a
page fault is the first attempt or not.
Then we'll have these combinations of fault flags (only considering
ALLOW_RETRY flag and TRIED flag):
- ALLOW_RETRY and !TRIED: this means the page fault allows to
retry, and this is the first try
- ALLOW_RETRY and TRIED: this means the page fault allows to
retry, and this is not the first try
- !ALLOW_RETRY and !TRIED: this means the page fault does not allow
to retry at all
- !ALLOW_RETRY and TRIED: this is forbidden and should never be used
In existing code we have multiple places that has taken special care of
the first condition above by checking against (fault_flags &
FAULT_FLAG_ALLOW_RETRY). This patch introduces a simple helper to detect
the first retry of a page fault by checking against both (fault_flags &
FAULT_FLAG_ALLOW_RETRY) and !(fault_flag & FAULT_FLAG_TRIED) because now
even the 2nd try will have the ALLOW_RETRY set, then use that helper in
all existing special paths. One example is in __lock_page_or_retry(), now
we'll drop the mmap_sem only in the first attempt of page fault and we'll
keep it in follow up retries, so old locking behavior will be retained.
This will be a nice enhancement for current code [2] at the same time a
supporting material for the future userfaultfd-writeprotect work, since in
that work there will always be an explicit userfault writeprotect retry
for protected pages, and if that cannot resolve the page fault (e.g., when
userfaultfd-writeprotect is used in conjunction with swapped pages) then
we'll possibly need a 3rd retry of the page fault. It might also benefit
other potential users who will have similar requirement like userfault
write-protection.
GUP code is not touched yet and will be covered in follow up patch.
Please read the thread below for more information.
[1] https://lore.kernel.org/lkml/20171102193644.GB22686@redhat.com/
[2] https://lore.kernel.org/lkml/20181230154648.GB9832@redhat.com/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Brian Geffon <bgeffon@google.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Link: http://lkml.kernel.org/r/20200220160246.9790-1-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
c270a7eedc |
mm: introduce FAULT_FLAG_INTERRUPTIBLE
handle_userfaultfd() is currently the only one place in the kernel page fault procedures that can respond to non-fatal userspace signals. It was trying to detect such an allowance by checking against USER & KILLABLE flags, which was "un-official". In this patch, we introduced a new flag (FAULT_FLAG_INTERRUPTIBLE) to show that the fault handler allows the fault procedure to respond even to non-fatal signals. Meanwhile, add this new flag to the default fault flags so that all the page fault handlers can benefit from the new flag. With that, replacing the userfault check to this one. Since the line is getting even longer, clean up the fault flags a bit too to ease TTY users. Although we've got a new flag and applied it, we shouldn't have any functional change with this patch so far. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Brian Geffon <bgeffon@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Matthew Wilcox <willy@infradead.org> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Link: http://lkml.kernel.org/r/20200220195348.16302-1-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
dde1607248 |
mm: introduce FAULT_FLAG_DEFAULT
Although there're tons of arch-specific page fault handlers, most of them are still sharing the same initial value of the page fault flags. Say, merely all of the page fault handlers would allow the fault to be retried, and they also allow the fault to respond to SIGKILL. Let's define a default value for the fault flags to replace those initial page fault flags that were copied over. With this, it'll be far easier to introduce new fault flag that can be used by all the architectures instead of touching all the archs. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Brian Geffon <bgeffon@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Matthew Wilcox <willy@infradead.org> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Link: http://lkml.kernel.org/r/20200220160238.9694-1-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
8b9a65fd28 |
mm: return faster for non-fatal signals in user mode faults
The idea comes from the upstream discussion between Linus and Andrea: https://lore.kernel.org/lkml/20171102193644.GB22686@redhat.com/ A summary to the issue: there was a special path in handle_userfault() in the past that we'll return a VM_FAULT_NOPAGE when we detected non-fatal signals when waiting for userfault handling. We did that by reacquiring the mmap_sem before returning. However that brings a risk in that the vmas might have changed when we retake the mmap_sem and even we could be holding an invalid vma structure. This patch is a preparation of removing that special path by allowing the page fault to return even faster if we were interrupted by a non-fatal signal during a user-mode page fault handling routine. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Suggested-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Brian Geffon <bgeffon@google.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Matthew Wilcox <willy@infradead.org> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Link: http://lkml.kernel.org/r/20200220160230.9598-1-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
4ef873226c |
mm: introduce fault_signal_pending()
For most architectures, we've got a quick path to detect fatal signal after a handle_mm_fault(). Introduce a helper for that quick path. It cleans the current codes a bit so we don't need to duplicate the same check across archs. More importantly, this will be an unified place that we handle the signal immediately right after an interrupted page fault, so it'll be much easier for us if we want to change the behavior of handling signals later on for all the archs. Note that currently only part of the archs are using this new helper, because some archs have their own way to handle signals. In the follow up patches, we'll try to apply this helper to all the rest of archs. Another note is that the "regs" parameter in the new helper is not used yet. It'll be used very soon. Now we kept it in this patch only to avoid touching all the archs again in the follow up patches. [peterx@redhat.com: fix sparse warnings] Link: http://lkml.kernel.org/r/20200311145921.GD479302@xz-x1 Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Brian Geffon <bgeffon@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Matthew Wilcox <willy@infradead.org> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Link: http://lkml.kernel.org/r/20200220155353.8676-4-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
767e5ee54e |
mm: add pagemap.h to the fine documentation
The documentation currently does not include the deathless prose written to describe functions in pagemap.h because it's not included in any rst file. Fix up the mismatches between parameter names and the documentation and add the file to mm-api. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Link: http://lkml.kernel.org/r/20200221220045.24989-1-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
222100eed2 |
mm/vma: make is_vma_temporary_stack() available for general use
Currently the declaration and definition for is_vma_temporary_stack() are scattered. Lets make is_vma_temporary_stack() helper available for general use and also drop the declaration from (include/linux/huge_mm.h) which is no longer required. While at this, rename this as vma_is_temporary_stack() in line with existing helpers. This should not cause any functional change. Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1582782965-3274-4-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
7969f2264f |
mm/vma: make vma_is_foreign() available for general use
Idea of a foreign VMA with respect to the present context is very generic. But currently there are two identical definitions for this in powerpc and x86 platforms. Lets consolidate those redundant definitions while making vma_is_foreign() available for general use later. This should not cause any functional change. Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Link: http://lkml.kernel.org/r/1582782965-3274-3-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
b44437723c |
mm/vma: move VM_NO_KHUGEPAGED into generic header
Patch series "mm/vma: some more minor changes", v2. The motivation here is to consolidate VMA flags and helpers in generic memory header and reduce code duplication when ever applicable. If there are other possible similar instances which might be missing here, please do let me me know. I will be happy to incorporate them. This patch (of 3): Move VM_NO_KHUGEPAGED into generic header (include/linux/mm.h). This just makes sure that no VMA flag is scattered in individual function files any longer. While at this, fix an old comment which is no longer valid. This should not cause any functional change. Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1582782965-3274-2-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
8a931f8013 |
mm: memcontrol: recursive memory.low protection
Right now, the effective protection of any given cgroup is capped by its
own explicit memory.low setting, regardless of what the parent says. The
reasons for this are mostly historical and ease of implementation: to make
delegation of memory.low safe, effective protection is the min() of all
memory.low up the tree.
Unfortunately, this limitation makes it impossible to protect an entire
subtree from another without forcing the user to make explicit protection
allocations all the way to the leaf cgroups - something that is highly
undesirable in real life scenarios.
Consider memory in a data center host. At the cgroup top level, we have a
distinction between system management software and the actual workload the
system is executing. Both branches are further subdivided into individual
services, job components etc.
We want to protect the workload as a whole from the system management
software, but that doesn't mean we want to protect and prioritize
individual workload wrt each other. Their memory demand can vary over
time, and we'd want the VM to simply cache the hottest data within the
workload subtree. Yet, the current memory.low limitations force us to
allocate a fixed amount of protection to each workload component in order
to get protection from system management software in general. This
results in very inefficient resource distribution.
Another concern with mandating downward allocation is that, as the
complexity of the cgroup tree grows, it gets harder for the lower levels
to be informed about decisions made at the host-level. Consider a
container inside a namespace that in turn creates its own nested tree of
cgroups to run multiple workloads. It'd be extremely difficult to
configure memory.low parameters in those leaf cgroups that on one hand
balance pressure among siblings as the container desires, while also
reflecting the host-level protection from e.g. rpm upgrades, that lie
beyond one or more delegation and namespacing points in the tree.
It's highly unusual from a cgroup interface POV that nested levels have to
be aware of and reflect decisions made at higher levels for them to be
effective.
To enable such use cases and scale configurability for complex trees, this
patch implements a resource inheritance model for memory that is similar
to how the CPU and the IO controller implement work-conserving resource
allocations: a share of a resource allocated to a subree always applies to
the entire subtree recursively, while allowing, but not mandating,
children to further specify distribution rules.
That means that if protection is explicitly allocated among siblings,
those configured shares are being followed during page reclaim just like
they are now. However, if the memory.low set at a higher level is not
fully claimed by the children in that subtree, the "floating" remainder is
applied to each cgroup in the tree in proportion to its size. Since
reclaim pressure is applied in proportion to size as well, each child in
that tree gets the same boost, and the effect is neutral among siblings -
with respect to each other, they behave as if no memory control was
enabled at all, and the VM simply balances the memory demands optimally
within the subtree. But collectively those cgroups enjoy a boost over the
cgroups in neighboring trees.
E.g. a leaf cgroup with a memory.low setting of 0 no longer means that
it's not getting a share of the hierarchically assigned resource, just
that it doesn't claim a fixed amount of it to protect from its siblings.
This allows us to recursively protect one subtree (workload) from another
(system management), while letting subgroups compete freely among each
other - without having to assign fixed shares to each leaf, and without
nested groups having to echo higher-level settings.
The floating protection composes naturally with fixed protection.
Consider the following example tree:
A A: low = 2G
/ \ A1: low = 1G
A1 A2 A2: low = 0G
As outside pressure is applied to this tree, A1 will enjoy a fixed
protection from A2 of 1G, but the remaining, unclaimed 1G from A is split
evenly among A1 and A2, coming out to 1.5G and 0.5G.
There is a slight risk of regressing theoretical setups where the
top-level cgroups don't know about the true budgeting and set bogusly high
"bypass" values that are meaningfully allocated down the tree. Such
setups would rely on unclaimed protection to be discarded, and
distributing it would change the intended behavior. Be safe and hide the
new behavior behind a mount option, 'memory_recursiveprot'.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Chris Down <chris@chrisdown.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
4b13f64de2 |
mm: kmem: rename (__)memcg_kmem_(un)charge_memcg() to __memcg_kmem_(un)charge()
Drop the _memcg suffix from (__)memcg_kmem_(un)charge functions. It's shorter and more obvious. These are the most basic functions which are just (un)charging the given cgroup with the given amount of pages. Also fix up the corresponding comments. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Link: http://lkml.kernel.org/r/20200109202659.752357-7-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
92d0510c35 |
mm: kmem: switch to nr_pages in (__)memcg_kmem_charge_memcg()
These functions are charging the given number of kernel pages to the given memory cgroup. The number doesn't have to be a power of two. Let's make them to take the unsigned int nr_pages as an argument instead of the page order. It makes them look consistent with the corresponding uncharge functions and functions like: mem_cgroup_charge_skmem(memcg, nr_pages). Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Link: http://lkml.kernel.org/r/20200109202659.752357-5-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
f4b00eab50 |
mm: kmem: rename memcg_kmem_(un)charge() into memcg_kmem_(un)charge_page()
Rename (__)memcg_kmem_(un)charge() into (__)memcg_kmem_(un)charge_page() to better reflect what they are actually doing: 1) call __memcg_kmem_(un)charge_memcg() to actually charge or uncharge the current memcg 2) set or clear the PageKmemcg flag Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Link: http://lkml.kernel.org/r/20200109202659.752357-4-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
50591183fa |
mm: kmem: cleanup memcg_kmem_uncharge_memcg() arguments
Drop the unused page argument and put the memcg pointer at the first place. This make the function consistent with its peers: __memcg_kmem_uncharge_memcg(), memcg_kmem_charge_memcg(), etc. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Link: http://lkml.kernel.org/r/20200109202659.752357-3-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
10eaec2f63 |
mm: kmem: cleanup (__)memcg_kmem_charge_memcg() arguments
Patch series "mm: memcg: kmem API cleanup", v2. This patchset aims to clean up the kernel memory charging API. It doesn't bring any functional changes, just removes unused arguments, renames some functions and fixes some comments. Currently it's not obvious which functions are most basic (memcg_kmem_(un)charge_memcg()) and which are based on them (memcg_kmem_(un)charge()). The patchset renames these functions and removes unused arguments: TL;DR: was: memcg_kmem_charge_memcg(page, gfp, order, memcg) memcg_kmem_uncharge_memcg(memcg, nr_pages) memcg_kmem_charge(page, gfp, order) memcg_kmem_uncharge(page, order) now: memcg_kmem_charge(memcg, gfp, nr_pages) memcg_kmem_uncharge(memcg, nr_pages) memcg_kmem_charge_page(page, gfp, order) memcg_kmem_uncharge_page(page, order) This patch (of 6): The first argument of memcg_kmem_charge_memcg() and __memcg_kmem_charge_memcg() is the page pointer and it's not used. Let's drop it. Memcg pointer is passed as the last argument. Move it to the first place for consistency with other memcg functions, e.g. __memcg_kmem_uncharge_memcg() or try_charge(). Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Link: http://lkml.kernel.org/r/20200109202659.752357-2-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
1eb6234e52 |
mm: swap: make page_evictable() inline
When backporting commit |
||
|
|
f28d43636d |
mm/gup/writeback: add callbacks for inaccessible pages
With the introduction of protected KVM guests on s390 there is now a concept of inaccessible pages. These pages need to be made accessible before the host can access them. While cpu accesses will trigger a fault that can be resolved, I/O accesses will just fail. We need to add a callback into architecture code for places that will do I/O, namely when writeback is started or when a page reference is taken. This is not only to enable paging, file backing etc, it is also necessary to protect the host against a malicious user space. For example a bad QEMU could simply start direct I/O on such protected memory. We do not want userspace to be able to trigger I/O errors and thus the logic is "whenever somebody accesses that page (gup) or does I/O, make sure that this page can be accessed". When the guest tries to access that page we will wait in the page fault handler for writeback to have finished and for the page_ref to be the expected value. On s390x the function is not supposed to fail, so it is ok to use a WARN_ON on failure. If we ever need some more finegrained handling we can tackle this when we know the details. Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Will Deacon <will@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200306132537.783769-3-imbrenda@linux.ibm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
1970dc6f52 |
mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting
Now that pages are "DMA-pinned" via pin_user_page*(), and unpinned via
unpin_user_pages*(), we need some visibility into whether all of this is
working correctly.
Add two new fields to /proc/vmstat:
nr_foll_pin_acquired
nr_foll_pin_released
These are documented in Documentation/core-api/pin_user_pages.rst. They
represent the number of pages (since boot time) that have been pinned
("nr_foll_pin_acquired") and unpinned ("nr_foll_pin_released"), via
pin_user_pages*() and unpin_user_pages*().
In the absence of long-running DMA or RDMA operations that hold pages
pinned, the above two fields will normally be equal to each other.
Also: update Documentation/core-api/pin_user_pages.rst, to remove an
earlier (now confirmed untrue) claim about a performance problem with
/proc/vmstat.
Also: update Documentation/core-api/pin_user_pages.rst to rename the new
/proc/vmstat entries, to the names listed here.
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-9-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
47e29d32af |
mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages
For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS scheme tends to overflow too easily, each tail page increments the head page->_refcount by GUP_PIN_COUNTING_BIAS (1024). That limits the number of huge pages that can be pinned. This patch removes that limitation, by using an exact form of pin counting for compound pages of order > 1. The "order > 1" is required because this approach uses the 3rd struct page in the compound page, and order 1 compound pages only have two pages, so that won't work there. A new struct page field, hpage_pinned_refcount, has been added, replacing a padding field in the union (so no new space is used). This enhancement also has a useful side effect: huge pages and compound pages (of order > 1) do not suffer from the "potential false positives" problem that is discussed in the page_dma_pinned() comment block. That is because these compound pages have extra space for tracking things, so they get exact pin counts instead of overloading page->_refcount. Documentation/core-api/pin_user_pages.rst is updated accordingly. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
3faa52c03f |
mm/gup: track FOLL_PIN pages
Add tracking of pages that were pinned via FOLL_PIN. This tracking is
implemented via overloading of page->_refcount: pins are added by adding
GUP_PIN_COUNTING_BIAS (1024) to the refcount. This provides a fuzzy
indication of pinning, and it can have false positives (and that's OK).
Please see the pre-existing Documentation/core-api/pin_user_pages.rst for
details.
As mentioned in pin_user_pages.rst, callers who effectively set FOLL_PIN
(typically via pin_user_pages*()) are required to ultimately free such
pages via unpin_user_page().
Please also note the limitation, discussed in pin_user_pages.rst under the
"TODO: for 1GB and larger huge pages" section. (That limitation will be
removed in a following patch.)
The effect of a FOLL_PIN flag is similar to that of FOLL_GET, and may be
thought of as "FOLL_GET for DIO and/or RDMA use".
Pages that have been pinned via FOLL_PIN are identifiable via a new
function call:
bool page_maybe_dma_pinned(struct page *page);
What to do in response to encountering such a page, is left to later
patchsets. There is discussion about this in [1], [2], [3], and [4].
This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
[1] Some slow progress on get_user_pages() (Apr 2, 2019):
https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018):
https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018):
https://lwn.net/Articles/753027/
[4] LWN kernel index: get_user_pages():
https://lwn.net/Kernel/Index/#Memory_management-get_user_pages
[jhubbard@nvidia.com: add kerneldoc]
Link: http://lkml.kernel.org/r/20200307021157.235726-1-jhubbard@nvidia.com
[imbrenda@linux.ibm.com: if pin fails, we need to unpin, a simple put_page will not be enough]
Link: http://lkml.kernel.org/r/20200306132537.783769-2-imbrenda@linux.ibm.com
[akpm@linux-foundation.org: fix put_compound_head defined but not used]
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-7-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
566d774a11 |
mm: introduce page_ref_sub_return()
An upcoming patch requires subtracting a large chunk of refcounts from a page, and checking what the resulting refcount is. This is a little different than the usual "check for zero refcount" that many of the page ref functions already do. However, it is similar to a few other routines that (like this one) are generally useful for things such as 1-based refcounting. Add page_ref_sub_return(), that subtracts a chunk of refcounts atomically, and returns an atomic snapshot of the result. Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200211001536.1027652-4-jhubbard@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
ec84821507 |
include/linux/pagemap.h: rename arguments to find_subpage
This isn't just a random struct page, it's known to be a head page, and calling it head makes the function better self-documenting. The pgoff_t is less confusing if it's named index instead of offset. Also add a couple of comments to explain why we're doing various things. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20200318140253.6141-3-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
667c790169 |
revert "topology: add support for node_to_mem_node() to determine the fallback node"
This reverts commit |
||
|
|
630f289b71 |
asm-generic: make more kernel-space headers mandatory
Change a header to mandatory-y if both of the following are met:
[1] At least one architecture (except um) specifies it as generic-y in
arch/*/include/asm/Kbuild
[2] Every architecture (except um) either has its own implementation
(arch/*/include/asm/*.h) or specifies it as generic-y in
arch/*/include/asm/Kbuild
This commit was generated by the following shell script.
----------------------------------->8-----------------------------------
arches=$(cd arch; ls -1 | sed -e '/Kconfig/d' -e '/um/d')
tmpfile=$(mktemp)
grep "^mandatory-y +=" include/asm-generic/Kbuild > $tmpfile
find arch -path 'arch/*/include/asm/Kbuild' |
xargs sed -n 's/^generic-y += \(.*\)/\1/p' | sort -u |
while read header
do
mandatory=yes
for arch in $arches
do
if ! grep -q "generic-y += $header" arch/$arch/include/asm/Kbuild &&
! [ -f arch/$arch/include/asm/$header ]; then
mandatory=no
break
fi
done
if [ "$mandatory" = yes ]; then
echo "mandatory-y += $header" >> $tmpfile
for arch in $arches
do
sed -i "/generic-y += $header/d" arch/$arch/include/asm/Kbuild
done
fi
done
sed -i '/^mandatory-y +=/d' include/asm-generic/Kbuild
LANG=C sort $tmpfile >> include/asm-generic/Kbuild
----------------------------------->8-----------------------------------
One obvious benefit is the diff stat:
25 files changed, 52 insertions(+), 557 deletions(-)
It is tedious to list generic-y for each arch that needs it.
So, mandatory-y works like a fallback default (by just wrapping
asm-generic one) when arch does not have a specific header
implementation.
See the following commits:
|
||
|
|
98c985d7da |
kthread: mark timer used by delayed kthread works as IRQ safe
The timer used by delayed kthread works are IRQ safe because the used kthread_delayed_work_timer_fn() is IRQ safe. It is properly marked when initialized by KTHREAD_DELAYED_WORK_INIT(). But TIMER_IRQSAFE flag is missing when initialized by kthread_init_delayed_work(). The missing flag might trigger invalid warning from del_timer_sync() when kthread_mod_delayed_work() is called with interrupts disabled. This patch is result of a discussion about using the API, see https://lkml.kernel.org/r/cfa886ad-e3b7-c0d2-3ff8-58d94170eab5@ti.com Reported-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Grygorii Strashko <grygorii.strashko@ti.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20200217120709.1974-1-pmladek@suse.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
29d9f30d4c |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:
"Highlights:
1) Fix the iwlwifi regression, from Johannes Berg.
2) Support BSS coloring and 802.11 encapsulation offloading in
hardware, from John Crispin.
3) Fix some potential Spectre issues in qtnfmac, from Sergey
Matyukevich.
4) Add TTL decrement action to openvswitch, from Matteo Croce.
5) Allow paralleization through flow_action setup by not taking the
RTNL mutex, from Vlad Buslov.
6) A lot of zero-length array to flexible-array conversions, from
Gustavo A. R. Silva.
7) Align XDP statistics names across several drivers for consistency,
from Lorenzo Bianconi.
8) Add various pieces of infrastructure for offloading conntrack, and
make use of it in mlx5 driver, from Paul Blakey.
9) Allow using listening sockets in BPF sockmap, from Jakub Sitnicki.
10) Lots of parallelization improvements during configuration changes
in mlxsw driver, from Ido Schimmel.
11) Add support to devlink for generic packet traps, which report
packets dropped during ACL processing. And use them in mlxsw
driver. From Jiri Pirko.
12) Support bcmgenet on ACPI, from Jeremy Linton.
13) Make BPF compatible with RT, from Thomas Gleixnet, Alexei
Starovoitov, and your's truly.
14) Support XDP meta-data in virtio_net, from Yuya Kusakabe.
15) Fix sysfs permissions when network devices change namespaces, from
Christian Brauner.
16) Add a flags element to ethtool_ops so that drivers can more simply
indicate which coalescing parameters they actually support, and
therefore the generic layer can validate the user's ethtool
request. Use this in all drivers, from Jakub Kicinski.
17) Offload FIFO qdisc in mlxsw, from Petr Machata.
18) Support UDP sockets in sockmap, from Lorenz Bauer.
19) Fix stretch ACK bugs in several TCP congestion control modules,
from Pengcheng Yang.
20) Support virtual functiosn in octeontx2 driver, from Tomasz
Duszynski.
21) Add region operations for devlink and use it in ice driver to dump
NVM contents, from Jacob Keller.
22) Add support for hw offload of MACSEC, from Antoine Tenart.
23) Add support for BPF programs that can be attached to LSM hooks,
from KP Singh.
24) Support for multiple paths, path managers, and counters in MPTCP.
From Peter Krystad, Paolo Abeni, Florian Westphal, Davide Caratti,
and others.
25) More progress on adding the netlink interface to ethtool, from
Michal Kubecek"
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2121 commits)
net: ipv6: rpl_iptunnel: Fix potential memory leak in rpl_do_srh_inline
cxgb4/chcr: nic-tls stats in ethtool
net: dsa: fix oops while probing Marvell DSA switches
net/bpfilter: remove superfluous testing message
net: macb: Fix handling of fixed-link node
net: dsa: ksz: Select KSZ protocol tag
netdevsim: dev: Fix memory leak in nsim_dev_take_snapshot_write
net: stmmac: add EHL 2.5Gbps PCI info and PCI ID
net: stmmac: add EHL PSE0 & PSE1 1Gbps PCI info and PCI ID
net: stmmac: create dwmac-intel.c to contain all Intel platform
net: dsa: bcm_sf2: Support specifying VLAN tag egress rule
net: dsa: bcm_sf2: Add support for matching VLAN TCI
net: dsa: bcm_sf2: Move writing of CFP_DATA(5) into slicing functions
net: dsa: bcm_sf2: Check earlier for FLOW_EXT and FLOW_MAC_EXT
net: dsa: bcm_sf2: Disable learning for ASP port
net: dsa: b53: Deny enslaving port 7 for 7278 into a bridge
net: dsa: b53: Prevent tagged VLAN on port 7 for 7278
net: dsa: b53: Restore VLAN entries upon (re)configuration
net: dsa: bcm_sf2: Fix overflow checks
hv_netvsc: Remove unnecessary round_up for recv_completion_cnt
...
|
||
|
|
1f944f976d |
Merge tag 'tty-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial updates from Greg KH: "Here is the big set of TTY / Serial patches for 5.7-rc1 Lots of console fixups and reworking in here, serial core tweaks (doesn't that ever get old, why are we still creating new serial devices?), serial driver updates, line-protocol driver updates, and some vt cleanups and fixes included in here as well. All have been in linux-next with no reported issues" * tag 'tty-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (161 commits) serial: 8250: Optimize irq enable after console write serial: 8250: Fix rs485 delay after console write vt: vt_ioctl: fix use-after-free in vt_in_use() vt: vt_ioctl: fix VT_DISALLOCATE freeing in-use virtual console tty: serial: make SERIAL_SPRD depend on COMMON_CLK tty: serial: fsl_lpuart: fix return value checking tty: serial: fsl_lpuart: move dma_request_chan() ARM: dts: tango4: Make /serial compatible with ns16550a ARM: dts: mmp*: Make the serial ports compatible with xscale-uart ARM: dts: mmp*: Fix serial port names ARM: dts: mmp2-brownstone: Don't redeclare phandle references ARM: dts: pxa*: Make the serial ports compatible with xscale-uart ARM: dts: pxa*: Fix serial port names ARM: dts: pxa*: Don't redeclare phandle references serial: omap: drop unused dt-bindings header serial: 8250: 8250_omap: Add DMA support for UARTs on K3 SoCs serial: 8250: 8250_omap: Work around errata causing spurious IRQs with DMA serial: 8250: 8250_omap: Extend driver data to pass FIFO trigger info serial: 8250: 8250_omap: Move locking out from __dma_rx_do_complete() serial: 8250: 8250_omap: Account for data in flight during DMA teardown ... |
||
|
|
dfabb077d6 |
Merge tag 'mmc-v5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Pull MMC updates from Ulf Hansson:
"MMC core:
- Add support for host software queue for (e)MMC/SD
- Throttle polling rate for CMD6
- Update CMD13 busy condition check for CMD6 commands
- Improve busy detect polling for erase/trim/discard/HPI
- Fixup support for HW busy detection for HPI commands
- Re-work and improve support for eMMC sanitize commands
MMC host:
- mmci:
* Add support for sdmmc variant revision 2.0
- mmci_sdmmc:
* Improve support for busyend detection
* Fixup support for signal voltage switch
* Add support for tuning with delay block
- mtk-sd:
* Fix another SDIO irq issue
- sdhci:
* Disable native card detect when GPIO based type exist
- sdhci:
* Add option to defer request completion
- sdhci_am654:
* Add support to set a tap value per speed mode
- sdhci-esdhc-imx:
* Add support for i.MX8MM based variant
* Fixup support for standard tuning on i.MX8 usdhc
* Optimize for strobe/clock dll settings
* Fixup support for system and runtime suspend/resume
- sdhci-iproc:
* Update regulator/bus-voltage management for bcm2711
- sdhci-msm:
* Prevent clock gating with PWRSAVE_DLL on broken variants
* Fix management of CQE during SDHCI reset
- sdhci-of-arasan:
* Add support for auto tuning on ZynqMP based platforms
- sdhci-omap:
* Add support for system suspend/resume
- sdhci-sprd:
* Add support for HW busy detection
* Enable support host software queue
- sdhci-tegra:
* Add support for HW busy detection
- tmio/renesas_sdhi:
* Enforce retune after runtime suspend
- renesas_sdhi:
* Use manual tap correction for HS400 on some variants
* Add support for manual correction of tap values for tunings"
* tag 'mmc-v5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc: (86 commits)
mmc: cavium-octeon: remove nonsense variable coercion
mmc: mediatek: fix SDIO irq issue
mmc: mmci_sdmmc: Fix clear busyd0end irq flag
dt-bindings: mmc: Fix node name in an example
mmc: core: Re-work the code for eMMC sanitize
mmc: sdhci: use FIELD_GET for preset value bit masks
mmc: sdhci-of-at91: Display clock changes for debug purpose only
mmc: sdhci: iproc: Add custom set_power() callback for bcm2711
mmc: sdhci: am654: Use sdhci_set_power_and_voltage()
mmc: sdhci: at91: Use sdhci_set_power_and_voltage()
mmc: sdhci: milbeaut: Use sdhci_set_power_and_voltage()
mmc: sdhci: arasan: Use sdhci_set_power_and_voltage()
mmc: sdhci: Introduce sdhci_set_power_and_bus_voltage()
mmc: vub300: Use scnprintf() for avoiding potential buffer overflow
dt-bindings: mmc: synopsys-dw-mshc: fix clock-freq-min-max in example
sdhci: tegra: Enable MMC_CAP_WAIT_WHILE_BUSY host capability
sdhci: tegra: Implement Tegra specific set_timeout callback
mmc: sdhci-omap: Add Support for Suspend/Resume
mmc: renesas_sdhi: simplify execute_tuning
mmc: renesas_sdhi: Use BITS_PER_LONG helper
...
|
||
|
|
5b67fbfc32 |
Merge tag 'kbuild-v5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
"Build system:
- add CONFIG_UNUSED_KSYMS_WHITELIST, which will be useful to define a
fixed set of export symbols for Generic Kernel Image (GKI)
- allow to run 'make dt_binding_check' without .config
- use full schema for checking DT examples in *.yaml files
- make modpost fail for missing MODULE_IMPORT_NS(), which makes more
sense because we know the produced modules are never loadable
- Remove unused 'AS' variable
Kconfig:
- sanitize DEFCONFIG_LIST, and remove ARCH_DEFCONFIG from Kconfig
files
- relax the 'imply' behavior so that symbols implied by 'y' can
become 'm'
- make 'imply' obey 'depends on' in order to make 'imply' really weak
Misc:
- add documentation on building the kernel with Clang/LLVM
- revive __HAVE_ARCH_STRLEN for 32bit sparc to use optimized strlen()
- fix warning from deb-pkg builds when CONFIG_DEBUG_INFO=n
- various script and Makefile cleanups"
* tag 'kbuild-v5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (34 commits)
Makefile: Update kselftest help information
kbuild: deb-pkg: fix warning when CONFIG_DEBUG_INFO is unset
kbuild: add outputmakefile to no-dot-config-targets
kbuild: remove AS variable
net: wan: wanxl: refactor the firmware rebuild rule
net: wan: wanxl: use $(M68KCC) instead of $(M68KAS) for rebuilding firmware
net: wan: wanxl: use allow to pass CROSS_COMPILE_M68k for rebuilding firmware
kbuild: add comment about grouped target
kbuild: add -Wall to KBUILD_HOSTCXXFLAGS
kconfig: remove unused variable in qconf.cc
sparc: revive __HAVE_ARCH_STRLEN for 32bit sparc
kbuild: refactor Makefile.dtbinst more
kbuild: compute the dtbs_install destination more simply
Makefile: disallow data races on gcc-10 as well
kconfig: make 'imply' obey the direct dependency
kconfig: allow symbols implied by y to become m
net: drop_monitor: use IS_REACHABLE() to guard net_dm_hw_report()
modpost: return error if module is missing ns imports and MODULE_ALLOW_MISSING_NAMESPACE_IMPORTS=n
modpost: rework and consolidate logging interface
kbuild: allow to run dt_binding_check without kernel configuration
...
|
||
|
|
a16298439b |
Merge branch 'next-general' of git://git.kernel.org:/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris: "Two minor updates for the core security subsystem: - kernel-doc warning fixes from Randy Dunlap - header cleanup from YueHaibing" * 'next-general' of git://git.kernel.org:/pub/scm/linux/kernel/git/jmorris/linux-security: security: remove duplicated include from security.h security: <linux/lsm_hooks.h>: fix all kernel-doc warnings |
||
|
|
b3aa112d57 |
Merge tag 'selinux-pr-20200330' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux
Pull SELinux updates from Paul Moore:
"We've got twenty SELinux patches for the v5.7 merge window, the
highlights are below:
- Deprecate setting /sys/fs/selinux/checkreqprot to 1.
This flag was originally created to deal with legacy userspace and
the READ_IMPLIES_EXEC personality flag. We changed the default from
1 to 0 back in Linux v4.4 and now we are taking the next step of
deprecating it, at some point in the future we will take the final
step of rejecting 1.
- Allow kernfs symlinks to inherit the SELinux label of the parent
directory. In order to preserve backwards compatibility this is
protected by the genfs_seclabel_symlinks SELinux policy capability.
- Optimize how we store filename transitions in the kernel, resulting
in some significant improvements to policy load times.
- Do a better job calculating our internal hash table sizes which
resulted in additional policy load improvements and likely general
SELinux performance improvements as well.
- Remove the unused initial SIDs (labels) and improve how we handle
initial SIDs.
- Enable per-file labeling for the bpf filesystem.
- Ensure that we properly label NFS v4.2 filesystems to avoid a
temporary unlabeled condition.
- Add some missing XFS quota command types to the SELinux quota
access controls.
- Fix a problem where we were not updating the seq_file position
index correctly in selinuxfs.
- We consolidate some duplicated code into helper functions.
- A number of list to array conversions.
- Update Stephen Smalley's email address in MAINTAINERS"
* tag 'selinux-pr-20200330' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
selinux: clean up indentation issue with assignment statement
NFS: Ensure security label is set for root inode
MAINTAINERS: Update my email address
selinux: avtab_init() and cond_policydb_init() return void
selinux: clean up error path in policydb_init()
selinux: remove unused initial SIDs and improve handling
selinux: reduce the use of hard-coded hash sizes
selinux: Add xfs quota command types
selinux: optimize storage of filename transitions
selinux: factor out loop body from filename_trans_read()
security: selinux: allow per-file labeling for bpffs
selinux: generalize evaluate_cond_node()
selinux: convert cond_expr to array
selinux: convert cond_av_list to array
selinux: convert cond_list to array
selinux: sel_avc_get_stat_idx should increase position index
selinux: allow kernfs symlinks to inherit parent directory context
selinux: simplify evaluate_cond_node()
Documentation,selinux: deprecate setting checkreqprot to 1
selinux: move status variables out of selinux_ss
|
||
|
|
15c981d16d |
Merge tag 'for-5.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"A number of core changes that make things work better in general, code
is simpler and cleaner.
Core changes:
- per-inode file extent tree, for in memory tracking of contiguous
extent ranges to make sure i_size adjustments are accurate
- tree root structures are protected by reference counts, replacing
SRCU that did not cover some cases
- leak detector for tree root structures
- per-transaction pinned extent tracking
- buffer heads are replaced by bios for super block access
- speedup of extent back reference resolution, on an example test
scenario the runtime of send went down from a hour to minutes
- factor out locking scheme used for subvolume writer and NOCOW
exclusion, abstracted as DREW lock, double reader-writer exclusion
(allow either readers or writers)
- cleanup and abstract extent allocation policies, preparation for
zoned device support
- make reflink/clone_range work on inline extents
- add more cancellation point for relocation, improves long response
from 'balance cancel'
- add page migration callback for data pages
- switch to guid for uuids, with additional cleanups of the interface
- make ranged full fsyncs more efficient
- removal of obsolete ioctl flag BTRFS_SUBVOL_CREATE_ASYNC
- remove b-tree readahead from delayed refs paths, avoiding seek and
read unnecessary blocks
Features:
- v2 of ioctl to delete subvolumes, allowing to delete by id and more
future extensions
Fixes:
- fix qgroup rescan worker that could block umount
- fix crash during unmount due to race with delayed inode workers
- fix dellaloc flushing logic that could create unnecessary chunks
under heavy load
- fix missing file extent item for hole after ranged fsync
- several fixes in relocation error handling
Other:
- more documentation of relocation, device replace, space
reservations
- many random cleanups"
* tag 'for-5.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (210 commits)
btrfs: fix missing semaphore unlock in btrfs_sync_file
btrfs: use nofs allocations for running delayed items
btrfs: sysfs: Use scnprintf() instead of snprintf()
btrfs: do not resolve backrefs for roots that are being deleted
btrfs: track reloc roots based on their commit root bytenr
btrfs: restart relocate_tree_blocks properly
btrfs: reloc: reorder reservation before root selection
btrfs: do not readahead in build_backref_tree
btrfs: do not use readahead for running delayed refs
btrfs: Remove async_transid from btrfs_mksubvol/create_subvol/create_snapshot
btrfs: Remove transid argument from btrfs_ioctl_snap_create_transid
btrfs: Remove BTRFS_SUBVOL_CREATE_ASYNC support
btrfs: kill the subvol_srcu
btrfs: make btrfs_cleanup_fs_roots use the radix tree lock
btrfs: don't take an extra root ref at allocation time
btrfs: hold a ref on the root on the dead roots list
btrfs: make inodes hold a ref on their roots
btrfs: move the root freeing stuff into btrfs_put_root
btrfs: move ino_cache_inode dropping out of btrfs_free_fs_root
btrfs: make the extent buffer leak check per fs info
...
|
||
|
|
1455c69900 |
Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt
Pull fscrypt updates from Eric Biggers: "Add an ioctl FS_IOC_GET_ENCRYPTION_NONCE which retrieves a file's encryption nonce. This makes it easier to write automated tests which verify that fscrypt is doing the encryption correctly" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: ubifs: wire up FS_IOC_GET_ENCRYPTION_NONCE f2fs: wire up FS_IOC_GET_ENCRYPTION_NONCE ext4: wire up FS_IOC_GET_ENCRYPTION_NONCE fscrypt: add FS_IOC_GET_ENCRYPTION_NONCE ioctl |
||
|
|
fdf5563a72 |
Merge branch 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
"This topic tree contains more commits than usual:
- most of it are uaccess cleanups/reorganization by Al
- there's a bunch of prototype declaration (--Wmissing-prototypes)
cleanups
- misc other cleanups all around the map"
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
x86/mm/set_memory: Fix -Wmissing-prototypes warnings
x86/efi: Add a prototype for efi_arch_mem_reserve()
x86/mm: Mark setup_emu2phys_nid() static
x86/jump_label: Move 'inline' keyword placement
x86/platform/uv: Add a missing prototype for uv_bau_message_interrupt()
kill uaccess_try()
x86: unsafe_put-style macro for sigmask
x86: x32_setup_rt_frame(): consolidate uaccess areas
x86: __setup_rt_frame(): consolidate uaccess areas
x86: __setup_frame(): consolidate uaccess areas
x86: setup_sigcontext(): list user_access_{begin,end}() into callers
x86: get rid of put_user_try in __setup_rt_frame() (both 32bit and 64bit)
x86: ia32_setup_rt_frame(): consolidate uaccess areas
x86: ia32_setup_frame(): consolidate uaccess areas
x86: ia32_setup_sigcontext(): lift user_access_{begin,end}() into the callers
x86/alternatives: Mark text_poke_loc_init() static
x86/cpu: Fix a -Wmissing-prototypes warning for init_ia32_feat_ctl()
x86/mm: Drop pud_mknotpresent()
x86: Replace setup_irq() by request_irq()
x86/configs: Slightly reduce defconfigs
...
|
||
|
|
97cddfc345 |
Merge branch 'x86-build-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 build updates from Ingo Molnar: "A handful of updates: two linker script cleanups and a stock defconfig+allmodconfig bootability fix" * 'x86-build-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/vdso: Discard .note.gnu.property sections in vDSO x86, vmlinux.lds: Add RUNTIME_DISCARD_EXIT to generic DISCARDS x86/Kconfig: Make CMDLINE_OVERRIDE depend on non-empty CMDLINE |
||
|
|
3cd86a58f7 |
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
"The bulk is in-kernel pointer authentication, activity monitors and
lots of asm symbol annotations. I also queued the sys_mremap() patch
commenting the asymmetry in the address untagging.
Summary:
- In-kernel Pointer Authentication support (previously only offered
to user space).
- ARM Activity Monitors (AMU) extension support allowing better CPU
utilisation numbers for the scheduler (frequency invariance).
- Memory hot-remove support for arm64.
- Lots of asm annotations (SYM_*) in preparation for the in-kernel
Branch Target Identification (BTI) support.
- arm64 perf updates: ARMv8.5-PMU 64-bit counters, refactoring the
PMU init callbacks, support for new DT compatibles.
- IPv6 header checksum optimisation.
- Fixes: SDEI (software delegated exception interface) double-lock on
hibernate with shared events.
- Minor clean-ups and refactoring: cpu_ops accessor,
cpu_do_switch_mm() converted to C, cpufeature finalisation helper.
- sys_mremap() comment explaining the asymmetric address untagging
behaviour"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (81 commits)
mm/mremap: Add comment explaining the untagging behaviour of mremap()
arm64: head: Convert install_el2_stub to SYM_INNER_LABEL
arm64: Introduce get_cpu_ops() helper function
arm64: Rename cpu_read_ops() to init_cpu_ops()
arm64: Declare ACPI parking protocol CPU operation if needed
arm64: move kimage_vaddr to .rodata
arm64: use mov_q instead of literal ldr
arm64: Kconfig: verify binutils support for ARM64_PTR_AUTH
lkdtm: arm64: test kernel pointer authentication
arm64: compile the kernel with ptrauth return address signing
kconfig: Add support for 'as-option'
arm64: suspend: restore the kernel ptrauth keys
arm64: __show_regs: strip PAC from lr in printk
arm64: unwind: strip PAC from kernel addresses
arm64: mask PAC bits of __builtin_return_address
arm64: initialize ptrauth keys for kernel booting task
arm64: initialize and switch ptrauth kernel keys
arm64: enable ptrauth earlier
arm64: cpufeature: handle conflicts based on capability
arm64: cpufeature: Move cpu capability helpers inside C file
...
|
||
|
|
cad18da0af |
Merge tag 'please-pull-ia64_for_5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux
Pull ia64 updates from Tony Luck: "Couple of cleanup patches" * tag 'please-pull-ia64_for_5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux: tty/serial: cleanup after ioc*_serial driver removal ia64: replace setup_irq() by request_irq() |
||
|
|
58233ccf94 |
Merge tag 'm68k-for-v5.7-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k
Pull m68k updates from Geert Uytterhoeven: - pagetable layout rewrite, to facilitate global READ_ONCE() rework - Zorro (Amiga) and DIO (HP 9000/300) bus cleanups - defconfig updates - minor cleanups and fixes * tag 'm68k-for-v5.7-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k: (23 commits) m68k: defconfig: Update defconfigs for v5.6-rc4 zorro: Replace zero-length array with flexible-array member m68k: Switch to asm-generic/hardirq.h fbdev: c2p: Use BUILD_BUG() instead of custom solution dio: Remove unused dio_dev_driver() dio: Fix dio_bus_match() kerneldoc dio: Make dio_match_device() static zorro: Move zorro_bus_type to bus-private header file zorro: Remove unused zorro_dev_driver() zorro: Use zorro_match_device() helper in zorro_bus_match() zorro: Fix zorro_bus_match() kerneldoc zorro: Make zorro_match_device() static m68k: Fix Kconfig indentation m68k: mm: Change ColdFire pgtable_t m68k: mm: Fully initialize the page-table allocator m68k: mm: Extend table allocator for multiple sizes m68k: mm: Use table allocator for pgtables m68k: mm: Improve kernel_page_table() m68k: mm: Restructure Motorola MMU page-table layout m68k: mm: Move the pointer table allocator to motorola.c ... |
||
|
|
ed52f2c608 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
|
d9679cd985 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for net-next: 1) Add support to specify a stateful expression in set definitions, this allows users to specify e.g. counters per set elements. 2) Flowtable software counter support. 3) Flowtable hardware offload counter support, from wenxu. 3) Parallelize flowtable hardware offload requests, from Paul Blakey. This includes a patch to add one work entry per offload command. 4) Several patches to rework nf_queue refcount handling, from Florian Westphal. 4) A few fixes for the flowtable tunnel offload: Fix crash if tunneling information is missing and set up indirect flow block as TC_SETUP_FT, patch from wenxu. 5) Stricter netlink attribute sanity check on filters, from Romain Bellan and Florent Fourcot. 5) Annotations to make sparse happy, from Jules Irenge. 6) Improve icmp errors in debugging information, from Haishuang Yan. 7) Fix warning in IPVS icmp error debugging, from Haishuang Yan. 8) Fix endianess issue in tcp extension header, from Sergey Marinkevich. ==================== Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
|
d5f744f9a2 |
Merge tag 'x86-entry-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 entry code updates from Thomas Gleixner:
- Convert the 32bit syscalls to be pt_regs based which removes the
requirement to push all 6 potential arguments onto the stack and
consolidates the interface with the 64bit variant
- The first small portion of the exception and syscall related entry
code consolidation which aims to address the recently discovered
issues vs. RCU, int3, NMI and some other exceptions which can
interrupt any context. The bulk of the changes is still work in
progress and aimed for 5.8.
- A few lockdep namespace cleanups which have been applied into this
branch to keep the prerequisites for the ongoing work confined.
* tag 'x86-entry-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
x86/entry: Fix build error x86 with !CONFIG_POSIX_TIMERS
lockdep: Rename trace_{hard,soft}{irq_context,irqs_enabled}()
lockdep: Rename trace_softirqs_{on,off}()
lockdep: Rename trace_hardirq_{enter,exit}()
x86/entry: Rename ___preempt_schedule
x86: Remove unneeded includes
x86/entry: Drop asmlinkage from syscalls
x86/entry/32: Enable pt_regs based syscalls
x86/entry/32: Use IA32-specific wrappers for syscalls taking 64-bit arguments
x86/entry/32: Rename 32-bit specific syscalls
x86/entry/32: Clean up syscall_32.tbl
x86/entry: Remove ABI prefixes from functions in syscall tables
x86/entry/64: Add __SYSCALL_COMMON()
x86/entry: Remove syscall qualifier support
x86/entry/64: Remove ptregs qualifier from syscall table
x86/entry: Move max syscall number calculation to syscallhdr.sh
x86/entry/64: Split X32 syscall table into its own file
x86/entry/64: Move sys_ni_syscall stub to common.c
x86/entry/64: Use syscall wrappers for x32_rt_sigreturn
x86/entry: Refactor SYS_NI macros
...
|
||
|
|
dbb381b619 |
Merge tag 'timers-core-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timekeeping and timer updates from Thomas Gleixner:
"Core:
- Consolidation of the vDSO build infrastructure to address the
difficulties of cross-builds for ARM64 compat vDSO libraries by
restricting the exposure of header content to the vDSO build.
This is achieved by splitting out header content into separate
headers. which contain only the minimaly required information which
is necessary to build the vDSO. These new headers are included from
the kernel headers and the vDSO specific files.
- Enhancements to the generic vDSO library allowing more fine grained
control over the compiled in code, further reducing architecture
specific storage and preparing for adopting the generic library by
PPC.
- Cleanup and consolidation of the exit related code in posix CPU
timers.
- Small cleanups and enhancements here and there
Drivers:
- The obligatory new drivers: Ingenic JZ47xx and X1000 TCU support
- Correct the clock rate of PIT64b global clock
- setup_irq() cleanup
- Preparation for PWM and suspend support for the TI DM timer
- Expand the fttmr010 driver to support ast2600 systems
- The usual small fixes, enhancements and cleanups all over the
place"
* tag 'timers-core-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (80 commits)
Revert "clocksource/drivers/timer-probe: Avoid creating dead devices"
vdso: Fix clocksource.h macro detection
um: Fix header inclusion
arm64: vdso32: Enable Clang Compilation
lib/vdso: Enable common headers
arm: vdso: Enable arm to use common headers
x86/vdso: Enable x86 to use common headers
mips: vdso: Enable mips to use common headers
arm64: vdso32: Include common headers in the vdso library
arm64: vdso: Include common headers in the vdso library
arm64: Introduce asm/vdso/processor.h
arm64: vdso32: Code clean up
linux/elfnote.h: Replace elf.h with UAPI equivalent
scripts: Fix the inclusion order in modpost
common: Introduce processor.h
linux/ktime.h: Extract common header for vDSO
linux/jiffies.h: Extract common header for vDSO
linux/time64.h: Extract common header for vDSO
linux/time32.h: Extract common header for vDSO
linux/time.h: Extract common header for vDSO
...
|