Commit Graph

1216076 Commits

Author SHA1 Message Date
Joel Fernandes (Google)
8ed873d8e5 selftests: mm: add a test for mutually aligned moves > PMD size
This patch adds a test case to check if a PMD-alignment optimization
successfully happens.

I add support to make sure there is some room before the source mapping,
otherwise the optimization to trigger PMD-aligned move will be disabled as
the kernel will detect that a mapping before the source exists and such
optimization becomes impossible.

Link: https://lkml.kernel.org/r/20230903151328.2981432-5-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04 10:32:20 -07:00
Joel Fernandes (Google)
99eb26d59c selftests: mm: fix failure case when new remap region was not found
When a valid remap region could not be found, the source mapping is not
cleaned up.  Fix the goto statement such that the clean up happens.

Link: https://lkml.kernel.org/r/20230903151328.2981432-4-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04 10:32:20 -07:00
Joel Fernandes (Google)
b1e5a3dee2 mm/mremap: allow moves within the same VMA for stack moves
For the stack move happening in shift_arg_pages(), the move is happening
within the same VMA which spans the old and new ranges.

In case the aligned address happens to fall within that VMA, allow such
moves and don't abort the mremap alignment optimization.

In the regular non-stack mremap case, we cannot allow any such moves as
will end up destroying some part of the mapping (either the source of the
move, or part of the existing mapping).  So just avoid it for stack moves.

Link: https://lkml.kernel.org/r/20230903151328.2981432-3-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04 10:32:20 -07:00
Joel Fernandes (Google)
af8ca1c149 mm/mremap: optimize the start addresses in move_page_tables()
Patch series "Optimize mremap during mutual alignment within PMD", v6.

This patchset optimizes the start addresses in move_page_tables() and
tests the changes.  It addresses a warning [1] that occurs due to a
downward, overlapping move on a mutually-aligned offset within a PMD
during exec.  By initiating the copy process at the PMD level when such
alignment is present, we can prevent this warning and speed up the copying
process at the same time.  Linus Torvalds suggested this idea.  Check the
individual patches for more details.  [1]
https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/


This patch (of 7):

Recently, we see reports [1] of a warning that triggers due to
move_page_tables() doing a downward and overlapping move on a
mutually-aligned offset within a PMD.  By mutual alignment, I mean the
source and destination addresses of the mremap are at the same offset
within a PMD.

This mutual alignment along with the fact that the move is downward is
sufficient to cause a warning related to having an allocated PMD that does
not have PTEs in it.

This warning will only trigger when there is mutual alignment in the move
operation.  A solution, as suggested by Linus Torvalds [2], is to initiate
the copy process at the PMD level whenever such alignment is present. 
Implementing this approach will not only prevent the warning from being
triggered, but it will also optimize the operation as this method should
enhance the speed of the copy process whenever there's a possibility to
start copying at the PMD level.

Some more points:
a. The optimization can be done only when both the source and
destination of the mremap do not have anything mapped below it up to a
PMD boundary. I add support to detect that.

b. #1 is not a problem for the call to move_page_tables() from exec.c as
nothing is expected to be mapped below the source. However, for
non-overlapping mutually aligned moves as triggered by mremap(2), I
added support for checking such cases.

c. I currently only optimize for PMD moves, in the future I/we can build
on this work and do PUD moves as well if there is a need for this. But I
want to take it one step at a time.

d. We need to be careful about mremap of ranges within the VMA itself.
For this purpose, I added checks to determine if the address after
alignment falls within its VMA itself.

[1] https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
[2] https://lore.kernel.org/all/CAHk-=whd7msp8reJPfeGNyt0LiySMT0egExx3TVZSX3Ok6X=9g@mail.gmail.com/

Link: https://lkml.kernel.org/r/20230903151328.2981432-1-joel@joelfernandes.org
Link: https://lkml.kernel.org/r/20230903151328.2981432-2-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04 10:32:20 -07:00
Yuan Can
2eaa6c2abb mm: hugetlb_vmemmap: fix hugetlb page number decrease failed on movable nodes
The decreasing of hugetlb pages number failed with the following message
given:

 sh: page allocation failure: order:0, mode:0x204cc0(GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_THISNODE)
 CPU: 1 PID: 112 Comm: sh Not tainted 6.5.0-rc7-... #45
 Hardware name: linux,dummy-virt (DT)
 Call trace:
  dump_backtrace.part.6+0x84/0xe4
  show_stack+0x18/0x24
  dump_stack_lvl+0x48/0x60
  dump_stack+0x18/0x24
  warn_alloc+0x100/0x1bc
  __alloc_pages_slowpath.constprop.107+0xa40/0xad8
  __alloc_pages+0x244/0x2d0
  hugetlb_vmemmap_restore+0x104/0x1e4
  __update_and_free_hugetlb_folio+0x44/0x1f4
  update_and_free_hugetlb_folio+0x20/0x68
  update_and_free_pages_bulk+0x4c/0xac
  set_max_huge_pages+0x198/0x334
  nr_hugepages_store_common+0x118/0x178
  nr_hugepages_store+0x18/0x24
  kobj_attr_store+0x18/0x2c
  sysfs_kf_write+0x40/0x54
  kernfs_fop_write_iter+0x164/0x1dc
  vfs_write+0x3a8/0x460
  ksys_write+0x6c/0x100
  __arm64_sys_write+0x1c/0x28
  invoke_syscall+0x44/0x100
  el0_svc_common.constprop.1+0x6c/0xe4
  do_el0_svc+0x38/0x94
  el0_svc+0x28/0x74
  el0t_64_sync_handler+0xa0/0xc4
  el0t_64_sync+0x174/0x178
 Mem-Info:
  ...

The reason is that the hugetlb pages being released are allocated from
movable nodes, and with hugetlb_optimize_vmemmap enabled, vmemmap pages
need to be allocated from the same node during the hugetlb pages
releasing. With GFP_KERNEL and __GFP_THISNODE set, allocating from movable
node is always failed. Fix this problem by removing __GFP_THISNODE.

Link: https://lkml.kernel.org/r/20230905124503.24899-1-yuancan@huawei.com
Fixes: ad2fa3717b ("mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page")
Signed-off-by: Yuan Can <yuancan@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04 10:32:20 -07:00
Uros Bizjak
77cd814835 mm/vmstat: use this_cpu_try_cmpxchg in mod_{zone,node}_state
Use this_cpu_try_cmpxchg instead of this_cpu_cmpxchg (*ptr, old, new) ==
old in mod_zone_state and mod_node_state.  x86 CMPXCHG instruction returns
success in ZF flag, so this change saves a compare after cmpxchg (and
related move instruction in front of cmpxchg).

Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
fails.  There is no need to re-read the value in the loop.

No functional change intended.

Link: https://lkml.kernel.org/r/20230904150917.8318-1-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04 10:32:20 -07:00
Matthew Wilcox (Oracle)
91e79d22be mm: convert DAX lock/unlock page to lock/unlock folio
The one caller of DAX lock/unlock page already calls compound_head(), so
use page_folio() instead, then use a folio throughout the DAX code to
remove uses of page->mapping and page->index.

[jane.chu@oracle.com: add comment to mf_generic_kill_procss(), simplify mf_generic_kill_procs:folio initialization]
  Link: https://lkml.kernel.org/r/20230908222336.186313-1-jane.chu@oracle.com
Link: https://lkml.kernel.org/r/20230822231314.349200-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jane Chu <jane.chu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04 10:32:20 -07:00
Mateusz Guzik
bc0c335760 mm: remove remnants of SPLIT_RSS_COUNTING
The feature got retired in f1a7941243 ("mm: convert mm's rss stats into
percpu_counter"), but the patch failed to fully clean it up.

Link: https://lkml.kernel.org/r/20230823170556.2281747-1-mjguzik@gmail.com
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04 10:32:20 -07:00
Vern Hao
97144ce008 mm/vmscan: use folio_migratetype() instead of get_pageblock_migratetype()
In skip_cma(), we can use folio_migratetype() to replace
get_pageblock_migratetype().

Link: https://lkml.kernel.org/r/20230825075735.52436-1-user@VERNHAO-MC1
Signed-off-by: Vern Hao <vernhao@tencent.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04 10:32:19 -07:00
Lorenzo Stoakes
80e4a765a7 mm: refactor si_mem_available()
si_mem_available() needlessly places LRU statistics into an array before
retrieving only two of them, simply access those directly.

In addition, refactor the code so that the blocks of code which calculate
the page cache and reclaimable components each resemble one another to
clearly indicate we cap both against wmark_low in the same fashion.

Link: https://lkml.kernel.org/r/20230827110848.43510-1-lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04 10:32:19 -07:00
Xueshi Hu
b72b3c9c34 mm/hugetlb: fix nodes huge page allocation when there are surplus pages
In set_nr_huge_pages(), local variable "count" is used to record
persistent_huge_pages(), but when it cames to nodes huge page allocation,
the semantics changes to nr_huge_pages.  When there exists surplus huge
pages and using the interface under
/sys/devices/system/node/node*/hugepages to change huge page pool size,
this difference can result in the allocation of an unexpected number of
huge pages.

Steps to reproduce the bug:

Starting with:

				  Node 0          Node 1    Total
	HugePages_Total             0.00            0.00     0.00
	HugePages_Free              0.00            0.00     0.00
	HugePages_Surp              0.00            0.00     0.00

create 100 huge pages in Node 0 and consume it, then set Node 0 's
nr_hugepages to 0.

yields:

				  Node 0          Node 1    Total
	HugePages_Total           200.00            0.00   200.00
	HugePages_Free              0.00            0.00     0.00
	HugePages_Surp            200.00            0.00   200.00

write 100 to Node 1's nr_hugepages

		echo 100 > /sys/devices/system/node/node1/\
	hugepages/hugepages-2048kB/nr_hugepages

gets:

				  Node 0          Node 1    Total
	HugePages_Total           200.00          400.00   600.00
	HugePages_Free              0.00          400.00   400.00
	HugePages_Surp            200.00            0.00   200.00

Kernel is expected to create only 100 huge pages and it gives 200.

Link: https://lkml.kernel.org/r/20230829033343.467779-1-xueshi.hu@smartx.com
Fixes: 9a30523066 ("hugetlb: add per node hstate attributes")
Signed-off-by: Xueshi Hu <xueshi.hu@smartx.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04 10:32:19 -07:00
Mike Kravetz
d8f5f7e445 hugetlb: set hugetlb page flag before optimizing vmemmap
Currently, vmemmap optimization of hugetlb pages is performed before the
hugetlb flag (previously hugetlb destructor) is set identifying it as a
hugetlb folio.  This means there is a window of time where an ordinary
folio does not have all associated vmemmap present.  The core mm only
expects vmemmap to be potentially optimized for hugetlb and device dax. 
This can cause problems in code such as memory error handling that may
want to write to tail struct pages.

There is only one call to perform hugetlb vmemmap optimization today.  To
fix this issue, simply set the hugetlb flag before that call.

There was a similar issue in the free hugetlb path that was previously
addressed.  The two routines that optimize or restore hugetlb vmemmap
should only be passed hugetlb folios/pages.  To catch any callers not
following this rule, add VM_WARN_ON calls to the routines.  In the hugetlb
free code paths, some calls could be made to restore vmemmap after
clearing the hugetlb flag.  This was 'safe' as in these cases vmemmap was
already present and the call was a NOOP.  However, for consistency these
calls where eliminated so that we can add the VM_WARN_ON checks.

Link: https://lkml.kernel.org/r/20230829213734.69673-1-mike.kravetz@oracle.com
Fixes: f41f2ed43c ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Usama Arif <usama.arif@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04 10:32:19 -07:00
Anthony Yznaga
dd34d9fe3b mm: fix unaccount of memory on vma_link() failure
Fix insert_vm_struct() so that only accounted memory is unaccounted if
vma_link() fails.

Link: https://lkml.kernel.org/r/20230830004324.16101-1-anthony.yznaga@oracle.com
Fixes: d4af56c5c7 ("mm: start tracking VMAs with maple tree")
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04 10:32:19 -07:00
Anthony Yznaga
954652b9f3 mm/mremap: fix unaccount of memory on vma_merge() failure
Fix mremap so that only accounted memory is unaccounted if the mapping is
expandable but vma_merge() fails.

Link: https://lkml.kernel.org/r/20230830004549.16131-1-anthony.yznaga@oracle.com
Fixes: fdbef61491 ("mm/mremap: don't account pages in vma_to_resize()")
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04 10:32:19 -07:00
Ding Xiang
b6afcb94ce selftests/mm: gup_longterm: fix a resource leak
The opened file should be closed in run_with_tmpfile(), otherwise resource
leak will occur

Link: https://lkml.kernel.org/r/20230831093144.7520-1-dingxiang@cmss.chinamobile.com
Signed-off-by: Ding Xiang <dingxiang@cmss.chinamobile.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04 10:32:19 -07:00
Kemeng Shi
e19a3f595a mm/compaction: factor out code to test if we should run compaction for target order
We always do zone_watermark_ok check and compaction_suitable check
together to test if compaction for target order should be ran.  Factor
these code out to remove repeat code.

Link: https://lkml.kernel.org/r/20230901155141.249860-7-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04 10:32:19 -07:00
Kemeng Shi
9cc17ede51 mm/compaction: improve comment of is_via_compact_memory
We do proactive compaction with order == -1 via
1. /proc/sys/vm/compact_memory
2. /sys/devices/system/node/nodex/compact
3. /proc/sys/vm/compaction_proactiveness
Add missed situation in which order == -1.

Link: https://lkml.kernel.org/r/20230901155141.249860-6-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04 10:32:19 -07:00
Kemeng Shi
8df4e28c64 mm/compaction: remove repeat compact_blockskip_flush check in reset_isolation_suitable
We have compact_blockskip_flush check in __reset_isolation_suitable, just
remove repeat check before __reset_isolation_suitable in
compact_blockskip_flush.

Link: https://lkml.kernel.org/r/20230901155141.249860-5-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04 10:32:19 -07:00
Kemeng Shi
3da0272a4c mm/compaction: correctly return failure with bogus compound_order in strict mode
In strict mode, we should return 0 if there is any hole in pageblock.  If
we successfully isolated pages at beginning at pageblock and then have a
bogus compound_order outside pageblock in next page.  We will abort search
loop with blockpfn > end_pfn.  Although we will limit blockpfn to end_pfn,
we will treat it as a successful isolation in strict mode as blockpfn is
not < end_pfn and return partial isolated pages.  Then
isolate_freepages_range may success unexpectly with hole in isolated
range.

Link: https://lkml.kernel.org/r/20230901155141.249860-4-shikemeng@huaweicloud.com
Fixes: 9fcd6d2e05 ("mm, compaction: skip compound pages by order in free scanner")
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04 10:32:19 -07:00
Kemeng Shi
4c17989116 mm/compaction: call list_is_{first}/{last} more intuitively in move_freelist_{head}/{tail}
We use move_freelist_head after list_for_each_entry_reverse to skip recent
pages.  And there is no need to do actual move if all freepages are
searched in list_for_each_entry_reverse, e.g.  freepage point to first
page in freelist.  It's more intuitively to call list_is_first with list
entry as the first argument and list head as the second argument to check
if list entry is the first list entry instead of call list_is_last with
list entry and list head passed in reverse.

Similarly, call list_is_last in move_freelist_tail is more intuitively.

Link: https://lkml.kernel.org/r/20230901155141.249860-3-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04 10:32:19 -07:00
Kemeng Shi
bbefa0fc04 mm/compaction: use correct list in move_freelist_{head}/{tail}
Patch series "Fixes and cleanups to compaction", v3.

This is a series to do fix and clean up to compaction.
Patch 1-2 fix and clean up freepage list operation.
Patch 3-4 fix and clean up isolation of freepages
Patch 7 factor code to check if compaction is needed for allocation order.

More details can be found in respective patches. 


This patch (of 6):

The freepage is chained with buddy_list in freelist head. Use buddy_list
instead of lru to correct the list operation.

Link: https://lkml.kernel.org/r/20230901155141.249860-1-shikemeng@huaweicloud.com
Link: https://lkml.kernel.org/r/20230901155141.249860-2-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04 10:32:19 -07:00
Linus Torvalds
8a749fd1a8 Linux 6.6-rc4 v6.6-rc4 2023-10-01 14:15:13 -07:00
Linus Torvalds
e81a2dabc3 Merge tag 'kbuild-fixes-v6.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:

 - Fix the module compression with xz so the in-kernel decompressor
   works

 - Document a kconfig idiom to express an optional dependency between
   modules

 - Make modpost, when W=1 is given, detect broken drivers that reference
   .exit.* sections

 - Remove unused code

* tag 'kbuild-fixes-v6.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kbuild: remove stale code for 'source' symlink in packaging scripts
  modpost: Don't let "driver"s reference .exit.*
  vmlinux.lds.h: remove unused CPU_KEEP and CPU_DISCARD macros
  modpost: add missing else to the "of" check
  Documentation: kbuild: explain handling optional dependencies
  kbuild: Use CRC32 and a 1MiB dictionary for XZ compressed modules
2023-10-01 13:48:46 -07:00
Linus Torvalds
d2c5231581 Merge tag 'mm-hotfixes-stable-2023-10-01-08-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
 "Fourteen hotfixes, eleven of which are cc:stable. The remainder
  pertain to issues which were introduced after 6.5"

* tag 'mm-hotfixes-stable-2023-10-01-08-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  Crash: add lock to serialize crash hotplug handling
  selftests/mm: fix awk usage in charge_reserved_hugetlb.sh and hugetlb_reparenting_test.sh that may cause error
  mm: mempolicy: keep VMA walk if both MPOL_MF_STRICT and MPOL_MF_MOVE are specified
  mm/damon/vaddr-test: fix memory leak in damon_do_test_apply_three_regions()
  mm, memcg: reconsider kmem.limit_in_bytes deprecation
  mm: zswap: fix potential memory corruption on duplicate store
  arm64: hugetlb: fix set_huge_pte_at() to work with all swap entries
  mm: hugetlb: add huge page size param to set_huge_pte_at()
  maple_tree: add MAS_UNDERFLOW and MAS_OVERFLOW states
  maple_tree: add mas_is_active() to detect in-tree walks
  nilfs2: fix potential use after free in nilfs_gccache_submit_read_data()
  mm: abstract moving to the next PFN
  mm: report success more often from filemap_map_folio_range()
  fs: binfmt_elf_efpic: fix personality for ELF-FDPIC
2023-10-01 13:33:25 -07:00
Linus Torvalds
8f63336941 Merge tag 'char-misc-6.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull misc driver fix from Greg KH:
 "Here is a single, much requested, fix for a set of misc drivers to
  resolve a much reported regression in the -rc series that has also
  propagated back to the stable releases. Sorry for the delay, lots of
  conference travel for a few weeks put me very far behind in patch
  wrangling.

  It has been reported by many to resolve the reported problem, and has
  been in linux-next with no reported issues"

* tag 'char-misc-6.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  misc: rtsx: Fix some platforms can not boot and move the l1ss judgment to probe
2023-10-01 12:50:04 -07:00
Linus Torvalds
3abd15e25f Merge tag 'tty-6.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty / serial driver fixes from Greg KH:
 "Here are two tty/serial driver fixes for 6.6-rc4 that resolve some
  reported regressions:

   - revert a n_gsm change that ended up causing problems

   - 8250_port fix for irq data

  both have been in linux-next for over a week with no reported
  problems"

* tag 'tty-6.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  Revert "tty: n_gsm: fix UAF in gsm_cleanup_mux"
  serial: 8250_port: Check IRQ data before use
2023-10-01 12:44:45 -07:00
Linus Torvalds
ec8c298121 Merge tag 'x86-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
 "Misc fixes: a kerneldoc build warning fix, add SRSO mitigation for
  AMD-derived Hygon processors, and fix a SGX kernel crash in the page
  fault handler that can trigger when ksgxd races to reclaim the SECS
  special page, by making the SECS page unswappable"

* tag 'x86-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sgx: Resolves SECS reclaim vs. page fault for EAUG race
  x86/srso: Add SRSO mitigation for Hygon processors
  x86/kgdb: Fix a kerneldoc warning when build with W=1
2023-10-01 09:50:58 -07:00
Linus Torvalds
373ceff28e Merge tag 'timers-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
 "Fix a spurious kernel warning during CPU hotplug events that may
  trigger when timer/hrtimer softirqs are pending, which are otherwise
  hotplug-safe and don't merit a warning"

* tag 'timers-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers: Tag (hr)timer softirq as hotplug safe
2023-10-01 09:41:58 -07:00
Linus Torvalds
c5ecffe6d3 Merge tag 'sched-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
 "Fix a RT tasks related lockup/live-lock during CPU offlining"

* tag 'sched-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/rt: Fix live lock between select_fallback_rq() and RT push
2023-10-01 09:38:05 -07:00
Linus Torvalds
3a38c57a87 Merge tag 'perf-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf event fixes from Ingo Molnar:
 "Misc fixes: work around an AMD microcode bug on certain models, and
  fix kexec kernel PMI handlers on AMD systems that get loaded on older
  kernels that have an unexpected register state"

* tag 'perf-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/amd: Do not WARN() on every IRQ
  perf/x86/amd/core: Fix overflow reset on hotplug
2023-10-01 09:34:53 -07:00
Masahiro Yamada
2d7d1bc119 kbuild: remove stale code for 'source' symlink in packaging scripts
Since commit d8131c2965 ("kbuild: remove $(MODLIB)/source symlink"),
modules_install does not create the 'source' symlink.

Remove the stale code from builddeb and kernel.spec.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2023-10-01 23:06:06 +09:00
Uwe Kleine-König
f177cd0c15 modpost: Don't let "driver"s reference .exit.*
Drivers must not reference functions marked with __exit as these likely
are not available when the code is built-in.

There are few creative offenders uncovered for example in ARCH=amd64
allmodconfig builds. So only trigger the section mismatch warning for
W=1 builds.

The dual rule that drivers must not reference .init.* is implemented
since commit 0db2524523 ("modpost: don't allow *driver to reference
.init.*") which however missed that .exit.* should be handled in the
same way.

Thanks to Masahiro Yamada and Arnd Bergmann who gave valuable hints to
find this improvement.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2023-10-01 14:55:30 +09:00
Masahiro Yamada
15e86643d5 vmlinux.lds.h: remove unused CPU_KEEP and CPU_DISCARD macros
Remove the left-over of commit e24f662881 ("modpost: remove all
traces of cpuinit/cpuexit sections").

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2023-10-01 14:55:23 +09:00
Mauricio Faria de Oliveira
cbc3d00cf8 modpost: add missing else to the "of" check
Without this 'else' statement, an "usb" name goes into two handlers:
the first/previous 'if' statement _AND_ the for-loop over 'devtable',
but the latter is useless as it has no 'usb' device_id entry anyway.

Tested with allmodconfig before/after patch; no changes to *.mod.c:

    git checkout v6.6-rc3
    make -j$(nproc) allmodconfig
    make -j$(nproc) olddefconfig

    make -j$(nproc)
    find . -name '*.mod.c' | cpio -pd /tmp/before

    # apply patch

    make -j$(nproc)
    find . -name '*.mod.c' | cpio -pd /tmp/after

    diff -r /tmp/before/ /tmp/after/
    # no difference

Fixes: acbef7b766 ("modpost: fix module autoloading for OF devices with generic compatible property")
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2023-10-01 14:24:34 +09:00
Linus Torvalds
e402b08634 Merge tag 'soc-fixes-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC fixes from Arnd Bergmann:
 "These are the latest bug fixes that have come up in the soc tree. Most
  of these are fairly minor. Most notably, the majority of changes this
  time are not for dts files as usual.

   - Updates to the addresses of the broadcom and aspeed entries in the
     MAINTAINERS file.

   - Defconfig updates to address a regression on samsung and a build
     warning from an unknown Kconfig symbol

   - Build fixes for the StrongARM and Uniphier platforms

   - Code fixes for SCMI and FF-A firmware drivers, both of which had a
     simple bug that resulted in invalid data, and a lesser fix for the
     optee firmware driver

   - Multiple fixes for the recently added loongson/loongarch "guts" soc
     driver

   - Devicetree fixes for RISC-V on the startfive platform, addressing
     issues with NOR flash, usb and uart.

   - Multiple fixes for NXP i.MX8/i.MX9 dts files, fixing problems with
     clock, gpio, hdmi settings and the Makefile

   - Bug fixes for i.MX firmware code and the OCOTP soc driver

   - Multiple fixes for the TI sysc bus driver

   - Minor dts updates for TI omap dts files, to address boot time
     warnings and errors"

* tag 'soc-fixes-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (35 commits)
  MAINTAINERS: Fix Florian Fainelli's email address
  arm64: defconfig: enable syscon-poweroff driver
  ARM: locomo: fix locomolcd_power declaration
  soc: loongson: loongson2_guts: Remove unneeded semicolon
  soc: loongson: loongson2_guts: Convert to devm_platform_ioremap_resource()
  soc: loongson: loongson_pm2: Populate children syscon nodes
  dt-bindings: soc: loongson,ls2k-pmc: Allow syscon-reboot/syscon-poweroff as child
  soc: loongson: loongson_pm2: Drop useless of_device_id compatible
  dt-bindings: soc: loongson,ls2k-pmc: Use fallbacks for ls2k-pmc compatible
  soc: loongson: loongson_pm2: Add dependency for INPUT
  arm64: defconfig: remove CONFIG_COMMON_CLK_NPCM8XX=y
  ARM: uniphier: fix cache kernel-doc warnings
  MAINTAINERS: aspeed: Update Andrew's email address
  MAINTAINERS: aspeed: Update git tree URL
  firmware: arm_ffa: Don't set the memory region attributes for MEM_LEND
  arm64: dts: imx: Add imx8mm-prt8mm.dtb to build
  arm64: dts: imx8mm-evk: Fix hdmi@3d node
  soc: imx8m: Enable OCOTP clock for imx8mm before reading registers
  arm64: dts: imx8mp-beacon-kit: Fix audio_pll2 clock
  arm64: dts: imx8mp: Fix SDMA2/3 clocks
  ...
2023-09-30 18:41:37 -07:00
Linus Torvalds
3b347e4032 Merge tag 'trace-v6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:

 - Make sure 32-bit applications using user events have aligned access
   when running on a 64-bit kernel.

 - Add cond_resched in the loop that handles converting enums in
   print_fmt string is trace events.

 - Fix premature wake ups of polling processes in the tracing ring
   buffer. When a task polls waiting for a percentage of the ring buffer
   to be filled, the writer still will wake it up at every event. Add
   the polling's percentage to the "shortest_full" list to tell the
   writer when to wake it up.

 - For eventfs dir lookups on dynamic events, an event system's only
   event could be removed, leaving its dentry with no children. This is
   totally legitimate. But in eventfs_release() it must not access the
   children array, as it is only allocated when the dentry has children.

* tag 'trace-v6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  eventfs: Test for dentries array allocated in eventfs_release()
  tracing/user_events: Align set_bit() address for all archs
  tracing: relax trace_event_eval_update() execution with cond_resched()
  ring-buffer: Update "shortest_full" in polling
2023-09-30 18:19:02 -07:00
Steven Rostedt (Google)
2598bd3ca8 eventfs: Test for dentries array allocated in eventfs_release()
The dcache_dir_open_wrapper() could be called when a dynamic event is
being deleted leaving a dentry with no children. In this case the
dlist->dentries array will never be allocated. This needs to be checked
for in eventfs_release(), otherwise it will trigger a NULL pointer
dereference.

Link: https://lore.kernel.org/linux-trace-kernel/20230930090106.1c3164e9@rorschach.local.home

Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Fixes: ef36b4f928 ("eventfs: Remember what dentries were created on dir open")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-30 16:26:04 -04:00
Beau Belgrave
2de9ee9405 tracing/user_events: Align set_bit() address for all archs
All architectures should use a long aligned address passed to set_bit().
User processes can pass either a 32-bit or 64-bit sized value to be
updated when tracing is enabled when on a 64-bit kernel. Both cases are
ensured to be naturally aligned, however, that is not enough. The
address must be long aligned without affecting checks on the value
within the user process which require different adjustments for the bit
for little and big endian CPUs.

Add a compat flag to user_event_enabler that indicates when a 32-bit
value is being used on a 64-bit kernel. Long align addresses and correct
the bit to be used by set_bit() to account for this alignment. Ensure
compat flags are copied during forks and used during deletion clears.

Link: https://lore.kernel.org/linux-trace-kernel/20230925230829.341-2-beaub@linux.microsoft.com
Link: https://lore.kernel.org/linux-trace-kernel/20230914131102.179100-1-cleger@rivosinc.com/

Cc: stable@vger.kernel.org
Fixes: 7235759084 ("tracing/user_events: Use remote writes for event enablement")
Reported-by: Clément Léger <cleger@rivosinc.com>
Suggested-by: Clément Léger <cleger@rivosinc.com>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-30 16:25:41 -04:00
Clément Léger
23cce5f254 tracing: relax trace_event_eval_update() execution with cond_resched()
When kernel is compiled without preemption, the eval_map_work_func()
(which calls trace_event_eval_update()) will not be preempted up to its
complete execution. This can actually cause a problem since if another
CPU call stop_machine(), the call will have to wait for the
eval_map_work_func() function to finish executing in the workqueue
before being able to be scheduled. This problem was observe on a SMP
system at boot time, when the CPU calling the initcalls executed
clocksource_done_booting() which in the end calls stop_machine(). We
observed a 1 second delay because one CPU was executing
eval_map_work_func() and was not preempted by the stop_machine() task.

Adding a call to cond_resched() in trace_event_eval_update() allows
other tasks to be executed and thus continue working asynchronously
like before without blocking any pending task at boot time.

Link: https://lore.kernel.org/linux-trace-kernel/20230929191637.416931-1-cleger@rivosinc.com

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Clément Léger <cleger@rivosinc.com>
Tested-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-30 16:24:55 -04:00
Steven Rostedt (Google)
1e0cb399c7 ring-buffer: Update "shortest_full" in polling
It was discovered that the ring buffer polling was incorrectly stating
that read would not block, but that's because polling did not take into
account that reads will block if the "buffer-percent" was set. Instead,
the ring buffer polling would say reads would not block if there was any
data in the ring buffer. This was incorrect behavior from a user space
point of view. This was fixed by commit 42fb0a1e84 by having the polling
code check if the ring buffer had more data than what the user specified
"buffer percent" had.

The problem now is that the polling code did not register itself to the
writer that it wanted to wait for a specific "full" value of the ring
buffer. The result was that the writer would wake the polling waiter
whenever there was a new event. The polling waiter would then wake up, see
that there's not enough data in the ring buffer to notify user space and
then go back to sleep. The next event would wake it up again.

Before the polling fix was added, the code would wake up around 100 times
for a hackbench 30 benchmark. After the "fix", due to the constant waking
of the writer, it would wake up over 11,0000 times! It would never leave
the kernel, so the user space behavior was still "correct", but this
definitely is not the desired effect.

To fix this, have the polling code add what it's waiting for to the
"shortest_full" variable, to tell the writer not to wake it up if the
buffer is not as full as it expects to be.

Note, after this fix, it appears that the waiter is now woken up around 2x
the times it was before (~200). This is a tremendous improvement from the
11,000 times, but I will need to spend some time to see why polling is
more aggressive in its wakeups than the read blocking code.

Link: https://lore.kernel.org/linux-trace-kernel/20230929180113.01c2cae3@rorschach.local.home

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Fixes: 42fb0a1e84 ("tracing/ring-buffer: Have polling block on watermark")
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Tested-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-30 16:17:34 -04:00
Linus Torvalds
3b517966c5 Merge tag 'dma-mapping-6.6-2023-09-30' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:

 - fix the narea calculation in swiotlb initialization (Ross Lagerwall)

 - fix the check whether a device has used swiotlb (Petr Tesarik)

* tag 'dma-mapping-6.6-2023-09-30' of git://git.infradead.org/users/hch/dma-mapping:
  swiotlb: fix the check whether a device has used software IO TLB
  swiotlb: use the calculated number of areas
2023-09-30 11:07:26 -07:00
Linus Torvalds
25d48d570e Merge tag 'iomap-6.6-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap fixes from Darrick Wong:

 - Handle a race between writing and shrinking block devices by
   returning EIO

 - Fix a typo in a comment

* tag 'iomap-6.6-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: Spelling s/preceeding/preceding/g
  iomap: add a workaround for racy i_size updates on block devices
2023-09-30 11:01:38 -07:00
Linus Torvalds
cefc06e4de Merge tag 'i2c-for-6.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
 "Usual business: a driver fix, a DT fix, a minor core fix"

* tag 'i2c-for-6.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: npcm7xx: Fix callback completion ordering
  i2c: mux: Avoid potential false error message in i2c_mux_add_adapter
  dt-bindings: i2c: mxs: Pass ref and 'unevaluatedProperties: false'
2023-09-30 10:07:33 -07:00
Linus Torvalds
830380e317 Merge tag 'acpi-6.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fix from Rafael Wysocki:
 "Fix a possible NULL pointer dereference in the error path of
  acpi_video_bus_add() resulting from recent changes (Dinghao Liu)"

* tag 'acpi-6.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: video: Fix NULL pointer dereference in acpi_video_bus_add()
2023-09-30 09:59:37 -07:00
Linus Torvalds
1c9d831221 Merge tag 'powerpc-6.6-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:

 - Fix arch_stack_walk_reliable(), used by live patching

 - Fix powerpc selftests to work with run_kselftest.sh

Thanks to Joe Lawrence and Petr Mladek.

* tag 'powerpc-6.6-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  selftests/powerpc: Fix emit_tests to work with run_kselftest.sh
  powerpc/stacktrace: Fix arch_stack_walk_reliable()
2023-09-30 09:53:09 -07:00
Linus Torvalds
ae21363998 Merge tag 'nfsd-6.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fix from Chuck Lever:

 - Fix NFSv4 READ corner case

* tag 'nfsd-6.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  NFSD: Fix zero NFSv4 READ results when RQ_SPLICE_OK is not set
2023-09-30 09:44:48 -07:00
Linus Torvalds
ba77f7a63f Merge tag '6.6-rc3-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fix from Steve French:
 "Fix for password freeing potential oops (also for stable)"

* tag '6.6-rc3-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6:
  fs/smb/client: Reset password pointer to NULL
2023-09-30 09:39:23 -07:00
Baoquan He
e2a8f20dd8 Crash: add lock to serialize crash hotplug handling
Eric reported that handling corresponding crash hotplug event can be
failed easily when many memory hotplug event are notified in a short
period.  They failed because failing to take __kexec_lock.

=======
[   78.714569] Fallback order for Node 0: 0
[   78.714575] Built 1 zonelists, mobility grouping on.  Total pages: 1817886
[   78.717133] Policy zone: Normal
[   78.724423] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
[   78.727207] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
[   80.056643] PEFILE: Unsigned PE binary
=======

The memory hotplug events are notified very quickly and very many, while
the handling of crash hotplug is much slower relatively.  So the atomic
variable __kexec_lock and kexec_trylock() can't guarantee the
serialization of crash hotplug handling.

Here, add a new mutex lock __crash_hotplug_lock to serialize crash hotplug
handling specifically.  This doesn't impact the usage of __kexec_lock.

Link: https://lkml.kernel.org/r/20230926120905.392903-1-bhe@redhat.com
Fixes: 2472627561 ("crash: add generic infrastructure for crash hotplug support")
Signed-off-by: Baoquan He <bhe@redhat.com>
Tested-by: Eric DeVolder <eric.devolder@oracle.com>
Reviewed-by: Eric DeVolder <eric.devolder@oracle.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-29 17:20:48 -07:00
Juntong Deng
bbe246f875 selftests/mm: fix awk usage in charge_reserved_hugetlb.sh and hugetlb_reparenting_test.sh that may cause error
According to the awk manual, the -e option does not need to be specified
in front of 'program' (unless you need to mix program-file).

The redundant -e option can cause error when users use awk tools other
than gawk (for example, mawk does not support the -e option).

Error Example:
awk: not an option: -e

Link: https://lkml.kernel.org/r/VI1P193MB075228810591AF2FDD7D42C599C3A@VI1P193MB0752.EURP193.PROD.OUTLOOK.COM
Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-29 17:20:48 -07:00
Yang Shi
24526268f4 mm: mempolicy: keep VMA walk if both MPOL_MF_STRICT and MPOL_MF_MOVE are specified
When calling mbind() with MPOL_MF_{MOVE|MOVEALL} | MPOL_MF_STRICT, kernel
should attempt to migrate all existing pages, and return -EIO if there is
misplaced or unmovable page.  Then commit 6f4576e368 ("mempolicy: apply
page table walker on queue_pages_range()") messed up the return value and
didn't break VMA scan early ianymore when MPOL_MF_STRICT alone.  The
return value problem was fixed by commit a7f40cfe3b ("mm: mempolicy:
make mbind() return -EIO when MPOL_MF_STRICT is specified"), but it broke
the VMA walk early if unmovable page is met, it may cause some pages are
not migrated as expected.

The code should conceptually do:

 if (MPOL_MF_MOVE|MOVEALL)
     scan all vmas
     try to migrate the existing pages
     return success
 else if (MPOL_MF_MOVE* | MPOL_MF_STRICT)
     scan all vmas
     try to migrate the existing pages
     return -EIO if unmovable or migration failed
 else /* MPOL_MF_STRICT alone */
     break early if meets unmovable and don't call mbind_range() at all
 else /* none of those flags */
     check the ranges in test_walk, EFAULT without mbind_range() if discontig.

Fixed the behavior.

Link: https://lkml.kernel.org/r/20230920223242.3425775-1-yang@os.amperecomputing.com
Fixes: a7f40cfe3b ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified")
Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org>	[4.9+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-29 17:20:48 -07:00