Commit Graph

152046 Commits

Author SHA1 Message Date
Andrey Konovalov
108be8def4 lib/stackdepot: allow users to evict stack traces
Add stack_depot_put, a function that decrements the reference counter on a
stack record and removes it from the stack depot once the counter reaches
0.

Internally, when removing a stack record, the function unlinks it from the
hash table bucket and returns to the freelist.

With this change, the users of stack depot can call stack_depot_put when
keeping a stack trace in the stack depot is not needed anymore.  This
allows avoiding polluting the stack depot with irrelevant stack traces and
thus have more space to store the relevant ones before the stack depot
reaches its capacity.

Link: https://lkml.kernel.org/r/1d1ad5692ee43d4fc2b3fd9d221331d30b36123f.1700502145.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 16:51:47 -08:00
Andrey Konovalov
410b764f89 lib/stackdepot: add refcount for records
Add a reference counter for how many times a stack records has been
  added to stack depot.

Add a new STACK_DEPOT_FLAG_GET flag to stack_depot_save_flags that
  instructs the stack depot to increment the refcount.

Do not yet decrement the refcount; this is implemented in one of the
  following patches.

Do not yet enable any users to use the flag to avoid overflowing the
  refcount.

This is preparatory patch for implementing the eviction of stack records
  from the stack depot.

Link: https://lkml.kernel.org/r/a3fc14a2359d019d2a008d4ff8b46a665371ffee.1700502145.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 16:51:46 -08:00
Andrey Konovalov
022012dcf4 lib/stackdepot, kasan: add flags to __stack_depot_save and rename
Change the bool can_alloc argument of __stack_depot_save to a u32
  argument that accepts a set of flags.

The following patch will add another flag to stack_depot_save_flags
  besides the existing STACK_DEPOT_FLAG_CAN_ALLOC.

Also rename the function to stack_depot_save_flags, as
  __stack_depot_save is a cryptic name,

Link: https://lkml.kernel.org/r/645fa15239621eebbd3a10331e5864b718839512.1700502145.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 16:51:46 -08:00
Matthew Wilcox (Oracle)
af7628d6ec fs: convert error_remove_page to error_remove_folio
There were already assertions that we were not passing a tail page to
error_remove_page(), so make the compiler enforce that by converting
everything to pass and use a folio.

Link: https://lkml.kernel.org/r/20231117161447.2461643-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 16:51:42 -08:00
Matthew Wilcox (Oracle)
16f5dfbc85 gfp: include __GFP_NOWARN in GFP_NOWAIT
GFP_NOWAIT callers are always prepared for their allocations to fail
because they fail so frequently.  Forcing the callers to remember to add
__GFP_NOWARN is just annoying and leads to an endless stream of patches
for the places where we forgot to add it.

We can now remove __GFP_NOWARN from all the callers which specify
GFP_NOWAIT, but I'd rather wait a cycle and send patches to each
maintainer instead of creating a big pile of merge conflicts.

Link: https://lkml.kernel.org/r/20231109211507.2262419-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 16:51:39 -08:00
Matthew Wilcox (Oracle)
b5612c3686 mm: return void from folio_start_writeback() and related functions
Nobody now checks the return value from any of these functions, so
add an assertion at the beginning of the function and return void.

Link: https://lkml.kernel.org/r/20231108204605.745109-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 16:51:37 -08:00
Matthew Wilcox (Oracle)
c36f9d3d2c mm: remove test_set_page_writeback()
Patch series "Make folio_start_writeback return void".

Most of the folio flag-setting functions return void. 
folio_start_writeback is gratuitously different; the only two filesystems
that do anything with the return value emit debug messages if it's already
set, and we can (and should) do that internally without bothering the
filesystem to do it.


This patch (of 4):

There are no more callers of this wrapper.

Link: https://lkml.kernel.org/r/20231108204605.745109-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20231108204605.745109-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 16:51:36 -08:00
Matthew Wilcox (Oracle)
6eaa266b54 mm: add folio_fill_tail() and use it in iomap
The iomap code was limited to PAGE_SIZE bytes; generalise it to cover
an arbitrary-sized folio, and move it to be a common helper.

[akpm@linux-foundation.org: fix folio_fill_tail(), per Andreas Gruenbacher]
Link: https://lkml.kernel.org/r/20231107212643.3490372-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 16:51:36 -08:00
Matthew Wilcox (Oracle)
a4fc4a0c45 mm: add folio_zero_tail() and use it in ext4
Patch series "Add folio_zero_tail() and folio_fill_tail()".

I'm trying to make it easier for filesystems with tailpacking / stuffing /
inline data to use folios.  The primary function here is
folio_fill_tail().  You give it a pointer to memory where the data
currently is, and it takes care of copying it into the folio at that
offset.  That works for gfs2 & iomap.  Then There's Ext4.  Rather than gin
up some kind of specialist "Here's a two pointers to two blocks of memory"
routine, just let it do its current thing, and let it call
folio_zero_tail(), which is also called by folio_fill_tail().

Other filesystems can be converted later; these ones seemed like good
examples as they're already partly or completely converted to folios.


This patch (of 3):

Instead of unmapping the folio after copying the data to it, then mapping
it again to zero the tail, provide folio_zero_tail() to zero the tail of
an already-mapped folio.

[akpm@linux-foundation.org: fix kerneldoc argument ordering]
Link: https://lkml.kernel.org/r/20231107212643.3490372-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20231107212643.3490372-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 16:51:36 -08:00
Andrei Vagin
e6a9a2cbc1 fs/proc/task_mmu: report SOFT_DIRTY bits through the PAGEMAP_SCAN ioctl
The PAGEMAP_SCAN ioctl returns information regarding page table entries. 
It is more efficient compared to reading pagemap files.  CRIU can start to
utilize this ioctl, but it needs info about soft-dirty bits to track
memory changes.

We are aware of a new method for tracking memory changes implemented in
the PAGEMAP_SCAN ioctl.  For CRIU, the primary advantage of this method is
its usability by unprivileged users.  However, it is not feasible to
transparently replace the soft-dirty tracker with the new one.  The main
problem here is userfault descriptors that have to be preserved between
pre-dump iterations.  It means criu continues supporting the soft-dirty
method to avoid breakage for current users.  The new method will be
implemented as a separate feature.

[avagin@google.com: update tools/include/uapi/linux/fs.h]
  Link: https://lkml.kernel.org/r/20231107164139.576046-1-avagin@google.com
Link: https://lkml.kernel.org/r/20231106220959.296568-1-avagin@google.com
Signed-off-by: Andrei Vagin <avagin@google.com>
Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 16:51:35 -08:00
Liam Ni
ff6c3d81f2 NUMA: optimize detection of memory with no node id assigned by firmware
Sanity check that makes sure the nodes cover all memory loops over
numa_meminfo to count the pages that have node id assigned by the
firmware, then loops again over memblock.memory to find the total amount
of memory and in the end checks that the difference between the total
memory and memory that covered by nodes is less than some threshold. 
Worse, the loop over numa_meminfo calls __absent_pages_in_range() that
also partially traverses memblock.memory.

It's much simpler and more efficient to have a single traversal of
memblock.memory that verifies that amount of memory not covered by nodes
is less than a threshold.

Introduce memblock_validate_numa_coverage() that does exactly that and use
it instead of numa_meminfo_cover_memory().

Link: https://lkml.kernel.org/r/20231026020329.327329-1-zhiguangni01@gmail.com
Signed-off-by: Liam Ni <zhiguangni01@gmail.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Bibo Mao <maobibo@loongson.cn>
Cc: Binbin Zhou <zhoubinbin@loongson.cn>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Feiyang Chen <chenfeiyang@loongson.cn>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: WANG Xuerui <kernel@xen0n.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 16:51:34 -08:00
Peng Zhang
d240629148 fork: use __mt_dup() to duplicate maple tree in dup_mmap()
In dup_mmap(), using __mt_dup() to duplicate the old maple tree and then
directly replacing the entries of VMAs in the new maple tree can result in
better performance.  __mt_dup() uses DFS pre-order to duplicate the maple
tree, so it is efficient.

The average time complexity of __mt_dup() is O(n), where n is the number
of VMAs.  The proof of the time complexity is provided in the commit log
that introduces __mt_dup().  After duplicating the maple tree, each
element is traversed and replaced (ignoring the cases of deletion, which
are rare).  Since it is only a replacement operation for each element,
this process is also O(n).

Analyzing the exact time complexity of the previous algorithm is
challenging because each insertion can involve appending to a node,
pushing data to adjacent nodes, or even splitting nodes.  The frequency of
each action is difficult to calculate.  The worst-case scenario for a
single insertion is when the tree undergoes splitting at every level.  If
we consider each insertion as the worst-case scenario, we can determine
that the upper bound of the time complexity is O(n*log(n)), although this
is a loose upper bound.  However, based on the test data, it appears that
the actual time complexity is likely to be O(n).

As the entire maple tree is duplicated using __mt_dup(), if dup_mmap()
fails, there will be a portion of VMAs that have not been duplicated in
the maple tree.  To handle this, we mark the failure point with
XA_ZERO_ENTRY.  In exit_mmap(), if this marker is encountered, stop
releasing VMAs that have not been duplicated after this point.

There is a "spawn" in byte-unixbench[1], which can be used to test the
performance of fork().  I modified it slightly to make it work with
different number of VMAs.

Below are the test results.  The first row shows the number of VMAs.  The
second and third rows show the number of fork() calls per ten seconds,
corresponding to next-20231006 and the this patchset, respectively.  The
test results were obtained with CPU binding to avoid scheduler load
balancing that could cause unstable results.  There are still some
fluctuations in the test results, but at least they are better than the
original performance.

21     121   221    421    821    1621   3221   6421   12821  25621  51221
112100 76261 54227  34035  20195  11112  6017   3161   1606   802    393
114558 83067 65008  45824  28751  16072  8922   4747   2436   1233   599
2.19%  8.92% 19.88% 34.64% 42.37% 44.64% 48.28% 50.17% 51.68% 53.74% 52.42%

[1] https://github.com/kdlucas/byte-unixbench/tree/master

Link: https://lkml.kernel.org/r/20231027033845.90608-11-zhangpeng.00@bytedance.com
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Christie <michael.christie@oracle.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 16:51:34 -08:00
Peng Zhang
fd32e4e9b7 maple_tree: introduce interfaces __mt_dup() and mtree_dup()
Introduce interfaces __mt_dup() and mtree_dup(), which are used to
duplicate a maple tree.  They duplicate a maple tree in Depth-First Search
(DFS) pre-order traversal.  It uses memcopy() to copy nodes in the source
tree and allocate new child nodes in non-leaf nodes.  The new node is
exactly the same as the source node except for all the addresses stored in
it.  It will be faster than traversing all elements in the source tree and
inserting them one by one into the new tree.  The time complexity of these
two functions is O(n).

The difference between __mt_dup() and mtree_dup() is that mtree_dup()
handles locks internally.

Analysis of the average time complexity of this algorithm:

For simplicity, let's assume that the maximum branching factor of all
non-leaf nodes is 16 (in allocation mode, it is 10), and the tree is a
full tree.

Under the given conditions, if there is a maple tree with n elements, the
number of its leaves is n/16.  From bottom to top, the number of nodes in
each level is 1/16 of the number of nodes in the level below.  So the
total number of nodes in the entire tree is given by the sum of n/16 +
n/16^2 + n/16^3 + ...  + 1.  This is a geometric series, and it has log(n)
terms with base 16.  According to the formula for the sum of a geometric
series, the sum of this series can be calculated as (n-1)/15.  Each node
has only one parent node pointer, which can be considered as an edge.  In
total, there are (n-1)/15-1 edges.

This algorithm consists of two operations:

1. Traversing all nodes in DFS order.
2. For each node, making a copy and performing necessary modifications
   to create a new node.

For the first part, DFS traversal will visit each edge twice.  Let
T(ascend) represent the cost of taking one step downwards, and T(descend)
represent the cost of taking one step upwards.  And both of them are
constants (although mas_ascend() may not be, as it contains a loop, but
here we ignore it and treat it as a constant).  So the time spent on the
first part can be represented as ((n-1)/15-1) * (T(ascend) + T(descend)).

For the second part, each node will be copied, and the cost of copying a
node is denoted as T(copy_node).  For each non-leaf node, it is necessary
to reallocate all child nodes, and the cost of this operation is denoted
as T(dup_alloc).  The behavior behind memory allocation is complex and not
specific to the maple tree operation.  Here, we assume that the time
required for a single allocation is constant.  Since the size of a node is
fixed, both of these symbols are also constants.  We can calculate that
the time spent on the second part is ((n-1)/15) * T(copy_node) + ((n-1)/15
- n/16) * T(dup_alloc).

Adding both parts together, the total time spent by the algorithm can be
represented as:

((n-1)/15) * (T(ascend) + T(descend) + T(copy_node) + T(dup_alloc)) -
n/16 * T(dup_alloc) - (T(ascend) + T(descend))

Let C1 = T(ascend) + T(descend) + T(copy_node) + T(dup_alloc)
Let C2 = T(dup_alloc)
Let C3 = T(ascend) + T(descend)

Finally, the expression can be simplified as:
((16 * C1 - 15 * C2) / (15 * 16)) * n - (C1 / 15 + C3).

This is a linear function, so the average time complexity is O(n).

Link: https://lkml.kernel.org/r/20231027033845.90608-4-zhangpeng.00@bytedance.com
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Christie <michael.christie@oracle.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 16:51:32 -08:00
Peng Zhang
b2472efe43 maple_tree: introduce {mtree,mas}_lock_nested()
In some cases, nested locks may be needed, so {mtree,mas}_lock_nested is
introduced.  For example, when duplicating maple tree, we need to hold the
locks of two trees, in which case nested locks are needed.

At the same time, add the definition of spin_lock_nested() in tools for
testing.

Link: https://lkml.kernel.org/r/20231027033845.90608-3-zhangpeng.00@bytedance.com
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Christie <michael.christie@oracle.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 16:51:31 -08:00
Li Zhijian
23e9f01389 mm/vmstat: move pgdemote_* to per-node stats
Demotion will migrate pages across nodes.  Previously, only the global
demotion statistics were accounted for.  Changed them to per-node
statistics, making it easier to observe where demotion occurs on each
node.

This will help to identify which nodes are under pressure.

This patch also make pgdemote_* behind CONFIG_NUMA_BALANCING, since
demotion is not available for !CONFIG_NUMA_BALANCING

With this patch, here is a sample where node0 node1 are DRAM,
node3 is PMEM:
Global stats:
$ grep demote /proc/vmstat
pgdemote_kswapd 254288
pgdemote_direct 113497
pgdemote_khugepaged 0

Per-node stats:
$ grep demote /sys/devices/system/node/node0/vmstat # demotion source
pgdemote_kswapd 68454
pgdemote_direct 83431
pgdemote_khugepaged 0
$ grep demote /sys/devices/system/node/node1/vmstat # demotion source
pgdemote_kswapd 185834
pgdemote_direct 30066
pgdemote_khugepaged 0
$ grep demote /sys/devices/system/node/node3/vmstat # demotion target
pgdemote_kswapd 0
pgdemote_direct 0
pgdemote_khugepaged 0

Link: https://lkml.kernel.org/r/20231103031450.1456523-1-lizhijian@fujitsu.com
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Acked-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 16:51:31 -08:00
Andrew Morton
0c92218f4e Merge branch 'master' into mm-hotfixes-stable 2023-12-06 17:03:50 -08:00
Su Hui
73424d00dc highmem: fix a memory copy problem in memcpy_from_folio
Clang static checker complains that value stored to 'from' is never read. 
And memcpy_from_folio() only copy the last chunk memory from folio to
destination.  Use 'to += chunk' to replace 'from += chunk' to fix this
typo problem.

Link: https://lkml.kernel.org/r/20231130034017.1210429-1-suhui@nfschina.com
Fixes: b23d03ef7a ("highmem: add memcpy_to_folio() and memcpy_from_folio()")
Signed-off-by: Su Hui <suhui@nfschina.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jiaqi Yan <jiaqiyan@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Tom Rix <trix@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-06 16:12:49 -08:00
Andy Shevchenko
8e92157d7f units: add missing header
BITS_PER_BYTE is defined in bits.h.

Link: https://lkml.kernel.org/r/20231128174404.393393-1-andriy.shevchenko@linux.intel.com
Fixes: e8eed5f736 ("units: Add BYTES_PER_*BIT")
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Damian Muszynski <damian.muszynski@intel.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-06 16:12:48 -08:00
Mike Kravetz
187da0f825 hugetlb: fix null-ptr-deref in hugetlb_vma_lock_write
The routine __vma_private_lock tests for the existence of a reserve map
associated with a private hugetlb mapping.  A pointer to the reserve map
is in vma->vm_private_data.  __vma_private_lock was checking the pointer
for NULL.  However, it is possible that the low bits of the pointer could
be used as flags.  In such instances, vm_private_data is not NULL and not
a valid pointer.  This results in the null-ptr-deref reported by syzbot:

general protection fault, probably for non-canonical address 0xdffffc000000001d:
 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x00000000000000e8-0x00000000000000ef]
CPU: 0 PID: 5048 Comm: syz-executor139 Not tainted 6.6.0-rc7-syzkaller-00142-g88
8cf78c29e2 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 1
0/09/2023
RIP: 0010:__lock_acquire+0x109/0x5de0 kernel/locking/lockdep.c:5004
...
Call Trace:
 <TASK>
 lock_acquire kernel/locking/lockdep.c:5753 [inline]
 lock_acquire+0x1ae/0x510 kernel/locking/lockdep.c:5718
 down_write+0x93/0x200 kernel/locking/rwsem.c:1573
 hugetlb_vma_lock_write mm/hugetlb.c:300 [inline]
 hugetlb_vma_lock_write+0xae/0x100 mm/hugetlb.c:291
 __hugetlb_zap_begin+0x1e9/0x2b0 mm/hugetlb.c:5447
 hugetlb_zap_begin include/linux/hugetlb.h:258 [inline]
 unmap_vmas+0x2f4/0x470 mm/memory.c:1733
 exit_mmap+0x1ad/0xa60 mm/mmap.c:3230
 __mmput+0x12a/0x4d0 kernel/fork.c:1349
 mmput+0x62/0x70 kernel/fork.c:1371
 exit_mm kernel/exit.c:567 [inline]
 do_exit+0x9ad/0x2a20 kernel/exit.c:861
 __do_sys_exit kernel/exit.c:991 [inline]
 __se_sys_exit kernel/exit.c:989 [inline]
 __x64_sys_exit+0x42/0x50 kernel/exit.c:989
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Mask off low bit flags before checking for NULL pointer.  In addition, the
reserve map only 'belongs' to the OWNER (parent in parent/child
relationships) so also check for the OWNER flag.

Link: https://lkml.kernel.org/r/20231114012033.259600-1-mike.kravetz@oracle.com
Reported-by: syzbot+6ada951e7c0f7bc8a71e@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-mm/00000000000078d1e00608d7878b@google.com/
Fixes: bf4916922c ("hugetlbfs: extend hugetlb_vma_lock to private VMAs")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Rik van Riel <riel@surriel.com>
Cc: Edward Adam Davis <eadavis@qq.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Tom Rix <trix@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-06 16:12:43 -08:00
Linus Torvalds
17b17be28d Merge tag 'vfio-v6.7-rc4' of https://github.com/awilliam/linux-vfio
Pull vfio fixes from Alex Williamson:

 - Fix the lifecycle of a mutex in the pds variant driver such that a
   reset prior to opening the device won't find it uninitialized.
   Implement the release path to symmetrically destroy the mutex. Also
   switch a different lock from spinlock to mutex as the code path has
   the potential to sleep and doesn't need the spinlock context
   otherwise (Brett Creeley)

 - Fix an issue detected via randconfig where KVM tries to symbol_get an
   undeclared function. The symbol is temporarily declared
   unconditionally here, which resolves the problem and avoids churn
   relative to a series pending for the next merge window which resolves
   some of this symbol ugliness, but also fixes Kconfig dependencies
   (Sean Christopherson)

* tag 'vfio-v6.7-rc4' of https://github.com/awilliam/linux-vfio:
  vfio: Drop vfio_file_iommu_group() stub to fudge around a KVM wart
  vfio/pds: Fix possible sleep while in atomic context
  vfio/pds: Fix mutex lock->magic != lock warning
2023-12-03 08:37:39 +09:00
Linus Torvalds
669fc83452 Merge tag 'probes-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:

 - objpool: Fix objpool overrun case on memory/cache access delay
   especially on the big.LITTLE SoC. The objpool uses a copy of object
   slot index internal loop, but the slot index can be changed on
   another processor in parallel. In that case, the difference of 'head'
   local copy and the 'slot->last' index will be bigger than local slot
   size. In that case, we need to re-read the slot::head to update it.

 - kretprobe: Fix to use appropriate rcu API for kretprobe holder. Since
   kretprobe_holder::rp is RCU managed, it should use
   rcu_assign_pointer() and rcu_dereference_check() correctly. Also
   adding __rcu tag for finding wrong usage by sparse.

 - rethook: Fix to use appropriate rcu API for rethook::handler. The
   same as kretprobe, rethook::handler is RCU managed and it should use
   rcu_assign_pointer() and rcu_dereference_check(). This also adds
   __rcu tag for finding wrong usage by sparse.

* tag 'probes-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  rethook: Use __rcu pointer for rethook::handler
  kprobes: consistent rcu api usage for kretprobe holder
  lib: objpool: fix head overrun on RK3588 SBC
2023-12-03 08:02:49 +09:00
Linus Torvalds
815fb87b75 Merge tag 'pm-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
 "These fix issues in two cpufreq drivers, in the AMD P-state driver and
  in the power-capping DTPM framework.

  Specifics:

   - Fix the AMD P-state driver's EPP sysfs interface in the cases when
     the performance governor is in use (Ayush Jain)

   - Make the ->fast_switch() callback in the AMD P-state driver return
     the target frequency as expected (Gautham R. Shenoy)

   - Allow user space to control the range of frequencies to use via
     scaling_min_freq and scaling_max_freq when AMD P-state driver is in
     use (Wyes Karny)

   - Prevent power domains needed for wakeup signaling from being turned
     off during system suspend on Qualcomm systems and prevent
     performance states votes from runtime-suspended devices from being
     lost across a system suspend-resume cycle in qcom-cpufreq-nvmem
     (Stephan Gerhold)

   - Fix disabling the 792 Mhz OPP in the imx6q cpufreq driver for the
     i.MX6ULL types that can run at that frequency (Christoph
     Niedermaier)

   - Eliminate unnecessary and harmful conversions to uW from the DTPM
     (dynamic thermal and power management) framework (Lukasz Luba)"

* tag 'pm-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq/amd-pstate: Only print supported EPP values for performance governor
  cpufreq/amd-pstate: Fix scaling_min_freq and scaling_max_freq update
  powercap: DTPM: Fix unneeded conversions to micro-Watts
  cpufreq/amd-pstate: Fix the return value of amd_pstate_fast_switch()
  pmdomain: qcom: rpmpd: Set GENPD_FLAG_ACTIVE_WAKEUP
  cpufreq: qcom-nvmem: Preserve PM domain votes in system suspend
  cpufreq: qcom-nvmem: Enable virtual power domain devices
  cpufreq: imx6q: Don't disable 792 Mhz OPP unnecessarily
2023-12-02 09:01:00 +09:00
Linus Torvalds
ce474ae7d0 Merge tag 'acpi-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
 "This fixes a recently introduced build issue on ARM32 and a NULL
  pointer dereference in the ACPI backlight driver due to a design issue
  exposed by a recent change in the ACPI bus type code.

  Specifics:

   - Fix a recently introduced build issue on ARM32 platforms caused by
     an inadvertent header file breakage (Dave Jiang)

   - Eliminate questionable usage of acpi_driver_data() in the ACPI
     backlight cooling device code that leads to NULL pointer
     dereferences after recent ACPI core changes (Hans de Goede)"

* tag 'acpi-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: video: Use acpi_video_device for cooling-dev driver data
  ACPI: Fix ARM32 platforms compile issue introduced by fw_table changes
2023-12-02 08:52:20 +09:00
Linus Torvalds
1a2b418566 Merge tag 'iommu-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu fixes from Joerg Roedel:

 - Fix race conditions in device probe path

 - Handle ERR_PTR() returns in __iommu_domain_alloc() path

 - Update MAINTAINERS entry for Qualcom IOMMUs

 - Printk argument fix in device tree specific code

 - Several Intel VT-d fixes from Lu Baolu:
     - Do not support enforcing cache coherency for non-empty domains
     - Avoid devTLB invalidation if iommu is off
     - Disable PCI ATS in legacy passthrough mode
     - Support non-PCI devices when clearing context
     - Fix incorrect cache invalidation for mm notification
     - Add MTL to quirk list to skip TE disabling
     - Set variable intel_dirty_ops to static

* tag 'iommu-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
  iommu: Fix printk arg in of_iommu_get_resv_regions()
  iommu/vt-d: Set variable intel_dirty_ops to static
  iommu/vt-d: Fix incorrect cache invalidation for mm notification
  iommu/vt-d: Add MTL to quirk list to skip TE disabling
  iommu/vt-d: Make context clearing consistent with context mapping
  iommu/vt-d: Disable PCI ATS in legacy passthrough mode
  iommu/vt-d: Omit devTLB invalidation requests when TES=0
  iommu/vt-d: Support enforce_cache_coherency only for empty domains
  iommu: Avoid more races around device probe
  MAINTAINERS: list all Qualcomm IOMMU drivers in the QUALCOMM IOMMU entry
  iommu: Flow ERR_PTR out from __iommu_domain_alloc()
2023-12-02 08:42:39 +09:00
Linus Torvalds
06a3c59f9c Merge tag 'sound-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
 "No surprise here, including only a collection of HD-audio
  device-specific small fixes"

* tag 'sound-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: hda: Disable power-save on KONTRON SinglePC
  ALSA: hda/realtek: Add supported ALC257 for ChromeOS
  ALSA: hda/realtek: Headset Mic VREF to 100%
  ALSA: hda: intel-nhlt: Ignore vbps when looking for DMIC 32 bps format
  ALSA: hda: cs35l56: Enable low-power hibernation mode on SPI
  ALSA: cs35l41: Fix for old systems which do not support command
  ALSA: hda: cs35l41: Remove unnecessary boolean state variable firmware_running
  ALSA: hda - Fix speaker and headset mic pin config for CHUWI CoreBook XPro
2023-12-02 08:33:29 +09:00
Linus Torvalds
b1e51588aa Merge tag 'drm-fixes-2023-12-01' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
 "Weekly fixes, mostly amdgpu fixes with a scattering of nouveau, i915,
  and a couple of reverts. Hopefully it will quieten down in coming
  weeks.

  drm:
   - Revert unexport of prime helpers for fd/handle conversion

  dma_resv:
   - Do not double add fences in dma_resv_add_fence.

  gpuvm:
   - Fix GPUVM license identifier.

  i915:
   - Mark internal GSC engine with reserved uabi class
   - Take VGA converters into account in eDP probe
   - Fix intel_pre_plane_updates() call to ensure workarounds get applied

  panel:
   - Revert panel fixes as they require exporting device_is_dependent.

  nouveau:
   - fix oversized allocations in new vm path
   - fix zero-length array
   - remove a stray lock

  nt36523:
   - Fix error check for nt36523.

  amdgpu:
   - DMUB fix
   - DCN 3.5 fixes
   - XGMI fix
   - DCN 3.2 fixes
   - Vangogh suspend fix
   - NBIO 7.9 fix
   - GFX11 golden register fix
   - Backlight fix
   - NBIO 7.11 fix
   - IB test overflow fix
   - DCN 3.1.4 fixes
   - fix a runtime pm ref count
   - Retimer fix
   - ABM fix
   - DCN 3.1.5 fix
   - Fix AGP addressing
   - Fix possible memory leak in SMU error path
   - Make sure PME is enabled in D3
   - Fix possible NULL pointer dereference in debugfs
   - EEPROM fix
   - GC 9.4.3 fix

  amdkfd:
   - IP version check fix
   - Fix memory leak in pqm_uninit()"

* tag 'drm-fixes-2023-12-01' of git://anongit.freedesktop.org/drm/drm: (53 commits)
  Revert "drm/prime: Unexport helpers for fd/handle conversion"
  drm/amdgpu: Use another offset for GC 9.4.3 remap
  drm/amd/display: Fix some HostVM parameters in DML
  drm/amdkfd: Free gang_ctx_bo and wptr_bo in pqm_uninit
  drm/amdgpu: Update EEPROM I2C address for smu v13_0_0
  drm/amd/display: Allow DTBCLK disable for DCN35
  drm/amdgpu: Fix cat debugfs amdgpu_regs_didt causes kernel null pointer
  drm/amd: Enable PCIe PME from D3
  drm/amd/pm: fix a memleak in aldebaran_tables_init
  drm/amdgpu: fix AGP addressing when GART is not at 0
  drm/amd/display: update dcn315 lpddr pstate latency
  drm/amd/display: fix ABM disablement
  drm/amd/display: Fix black screen on video playback with embedded panel
  drm/amd/display: Fix conversions between bytes and KB
  drm/amdkfd: Use common function for IP version check
  drm/amd/display: Remove config update
  drm/amd/display: Update DCN35 clock table policy
  drm/amd/display: force toggle rate wa for first link training for a retimer
  drm/amdgpu: correct the amdgpu runtime dereference usage count
  drm/amd/display: Update min Z8 residency time to 2100 for DCN314
  ...
2023-12-02 08:18:59 +09:00
Linus Torvalds
c9a925b7bc Merge tag 'io_uring-6.7-2023-11-30' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:

 - Fix an issue with discontig page checking for IORING_SETUP_NO_MMAP

 - Fix an issue with not allowing IORING_SETUP_NO_MMAP also disallowing
   mmap'ed buffer rings

 - Fix an issue with deferred release of memory mapped pages

 - Fix a lockdep issue with IORING_SETUP_NO_MMAP

 - Use fget/fput consistently, even from our sync system calls. No real
   issue here, but if we were ever to allow closing io_uring descriptors
   it would be required. Let's play it safe and just use the full ref
   counted versions upfront. Most uses of io_uring are threaded anyway,
   and hence already doing the full version underneath.

* tag 'io_uring-6.7-2023-11-30' of git://git.kernel.dk/linux:
  io_uring: use fget/fput consistently
  io_uring: free io_buffer_list entries via RCU
  io_uring/kbuf: prune deferred locked cache when tearing down
  io_uring/kbuf: recycle freed mapped buffer ring entries
  io_uring/kbuf: defer release of mapped buffer rings
  io_uring: enable io_mem_alloc/free to be used in other parts
  io_uring: don't guard IORING_OFF_PBUF_RING with SETUP_NO_MMAP
  io_uring: don't allow discontig pages for IORING_SETUP_NO_MMAP
2023-12-02 06:47:32 +09:00
Linus Torvalds
ee0c8a9b34 Merge tag 'block-6.7-2023-12-01' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:

 - NVMe pull request via Keith:
     - Invalid namespace identification error handling (Marizio Ewan,
       Keith)
     - Fabrics keep-alive tuning (Mark)

 - Fix for a bad error check regression in bcache (Markus)

 - Fix for a performance regression with O_DIRECT (Ming)

 - Fix for a flush related deadlock (Ming)

 - Make the read-only warn on per-partition (Yu)

* tag 'block-6.7-2023-12-01' of git://git.kernel.dk/linux:
  nvme-core: check for too small lba shift
  blk-mq: don't count completed flush data request as inflight in case of quiesce
  block: Document the role of the two attribute groups
  block: warn once for each partition in bio_check_ro()
  block: move .bd_inode into 1st cacheline of block_device
  nvme: check for valid nvme_identify_ns() before using it
  nvme-core: fix a memory leak in nvme_ns_info_from_identify()
  nvme: fine-tune sending of first keep-alive
  bcache: revert replacing IS_ERR_OR_NULL with IS_ERR
2023-12-02 06:39:30 +09:00
Linus Torvalds
ff4a9f4905 Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
 "Three small fixes, one in drivers.

  The core changes are to the internal representation of flags in
  scsi_devices which removes space wasting bools in favour of single bit
  flags and to add a flag to force a runtime resume which is used by ATA
  devices"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: sd: Fix system start for ATA devices
  scsi: Change SCSI device boolean fields to single bit flags
  scsi: ufs: core: Clear cmd if abort succeeds in MCQ mode
2023-12-02 06:27:20 +09:00
Linus Torvalds
e6861be452 Merge tag 'bcachefs-2023-11-29' of https://evilpiepirate.org/git/bcachefs
Pull more bcachefs bugfixes from Kent Overstreet:

 - bcache & bcachefs were broken with CFI enabled; patch for closures to
   fix type punning

 - mark erasure coding as extra-experimental; there are incompatible
   disk space accounting changes coming for erasure coding, and I'm
   still seeing checksum errors in some tests

 - several fixes for durability-related issues (durability is a device
   specific setting where we can tell bcachefs that data on a given
   device should be counted as replicated x times)

 - a fix for a rare livelock when a btree node merge then updates a
   parent node that is almost full

 - fix a race in the device removal path, where dropping a pointer in a
   btree node to a device would be clobbered by an in flight btree write
   updating the btree node key on completion

 - fix one SRCU lock hold time warning in the btree gc code - ther's
   still a bunch more of these to fix

 - fix a rare race where we'd start copygc before initializing the "are
   we rw" percpu refcount; copygc would think we were already ro and die
   immediately

* tag 'bcachefs-2023-11-29' of https://evilpiepirate.org/git/bcachefs: (23 commits)
  bcachefs: Extra kthread_should_stop() calls for copygc
  bcachefs: Convert gc_alloc_start() to for_each_btree_key2()
  bcachefs: Fix race between btree writes and metadata drop
  bcachefs: move journal seq assertion
  bcachefs: -EROFS doesn't count as move_extent_start_fail
  bcachefs: trace_move_extent_start_fail() now includes errcode
  bcachefs: Fix split_race livelock
  bcachefs: Fix bucket data type for stripe buckets
  bcachefs: Add missing validation for jset_entry_data_usage
  bcachefs: Fix zstd compress workspace size
  bcachefs: bpos is misaligned on big endian
  bcachefs: Fix ec + durability calculation
  bcachefs: Data update path won't accidentaly grow replicas
  bcachefs: deallocate_extra_replicas()
  bcachefs: Proper refcounting for journal_keys
  bcachefs: preserve device path as device name
  bcachefs: Fix an endianness conversion
  bcachefs: Start gc, copygc, rebalance threads after initing writes ref
  bcachefs: Don't stop copygc thread on device resize
  bcachefs: Make sure bch2_move_ratelimit() also waits for move_ops
  ...
2023-12-02 06:02:16 +09:00
Rafael J. Wysocki
7d4c44a53d Merge branch 'acpi-tables'
Merge a fix for a recently introduced build issue on ARM32 platforms
caused by an inadvertent header file breakage (Dave Jiang).

* acpi-tables:
  ACPI: Fix ARM32 platforms compile issue introduced by fw_table changes
2023-12-01 21:32:19 +01:00
Masami Hiramatsu (Google)
a1461f1fd6 rethook: Use __rcu pointer for rethook::handler
Since the rethook::handler is an RCU-maganged pointer so that it will
notice readers the rethook is stopped (unregistered) or not, it should
be an __rcu pointer and use appropriate functions to be accessed. This
will use appropriate memory barrier when accessing it. OTOH,
rethook::data is never changed, so we don't need to check it in
get_kretprobe().

NOTE: To avoid sparse warning, rethook::handler is defined by a raw
function pointer type with __rcu instead of rethook_handler_t.

Link: https://lore.kernel.org/all/170126066201.398836.837498688669005979.stgit@devnote2/

Fixes: 54ecbe6f1e ("rethook: Add a generic return hook")
Cc: stable@vger.kernel.org
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202311241808.rv9ceuAh-lkp@intel.com/
Tested-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-12-01 14:53:56 +09:00
JP Kobryn
d839a656d0 kprobes: consistent rcu api usage for kretprobe holder
It seems that the pointer-to-kretprobe "rp" within the kretprobe_holder is
RCU-managed, based on the (non-rethook) implementation of get_kretprobe().
The thought behind this patch is to make use of the RCU API where possible
when accessing this pointer so that the needed barriers are always in place
and to self-document the code.

The __rcu annotation to "rp" allows for sparse RCU checking. Plain writes
done to the "rp" pointer are changed to make use of the RCU macro for
assignment. For the single read, the implementation of get_kretprobe()
is simplified by making use of an RCU macro which accomplishes the same,
but note that the log warning text will be more generic.

I did find that there is a difference in assembly generated between the
usage of the RCU macros vs without. For example, on arm64, when using
rcu_assign_pointer(), the corresponding store instruction is a
store-release (STLR) which has an implicit barrier. When normal assignment
is done, a regular store (STR) is found. In the macro case, this seems to
be a result of rcu_assign_pointer() using smp_store_release() when the
value to write is not NULL.

Link: https://lore.kernel.org/all/20231122132058.3359-1-inwardvessel@gmail.com/

Fixes: d741bf41d7 ("kprobes: Remove kretprobe hash")
Cc: stable@vger.kernel.org
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-12-01 14:53:55 +09:00
Linus Torvalds
994d5c58e5 Merge tag 'hardening-v6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening fixes from Kees Cook:

 - struct_group: propagate attributes to top-level union (Dmitry
   Antipov)

 - gcc-plugins: randstruct: Update code comment in relayout_struct
   (Gustavo A. R. Silva)

 - MAINTAINERS: refresh LLVM support (Nick Desaulniers)

* tag 'hardening-v6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  gcc-plugins: randstruct: Update code comment in relayout_struct()
  uapi: propagate __struct_group() attributes to the container union
  MAINTAINERS: refresh LLVM support
2023-12-01 14:17:54 +09:00
Dave Airlie
908f606424 Merge tag 'amd-drm-fixes-6.7-2023-11-30' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes
amd-drm-fixes-6.7-2023-11-30:

amdgpu:
- DMUB fix
- DCN 3.5 fixes
- XGMI fix
- DCN 3.2 fixes
- Vangogh suspend fix
- NBIO 7.9 fix
- GFX11 golden register fix
- Backlight fix
- NBIO 7.11 fix
- IB test overflow fix
- DCN 3.1.4 fixes
- fix a runtime pm ref count
- Retimer fix
- ABM fix
- DCN 3.1.5 fix
- Fix AGP addressing
- Fix possible memory leak in SMU error path
- Make sure PME is enabled in D3
- Fix possible NULL pointer dereference in debugfs
- EEPROM fix
- GC 9.4.3 fix

amdkfd:
- IP version check fix
- Fix memory leak in pqm_uninit()

drm:
- Revert unexport of prime helpers for fd/handle conversion

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20231130213135.5083-1-alexander.deucher@amd.com
2023-12-01 13:57:11 +10:00
Linus Torvalds
6172a5180f Merge tag 'net-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
 "Including fixes from bpf and wifi.

  Current release - regressions:

   - neighbour: fix __randomize_layout crash in struct neighbour

   - r8169: fix deadlock on RTL8125 in jumbo mtu mode

  Previous releases - regressions:

   - wifi:
       - mac80211: fix warning at station removal time
       - cfg80211: fix CQM for non-range use

   - tools: ynl-gen: fix unexpected response handling

   - octeontx2-af: fix possible buffer overflow

   - dpaa2: recycle the RX buffer only after all processing done

   - rswitch: fix missing dev_kfree_skb_any() in error path

  Previous releases - always broken:

   - ipv4: fix uaf issue when receiving igmp query packet

   - wifi: mac80211: fix debugfs deadlock at device removal time

   - bpf:
       - sockmap: af_unix stream sockets need to hold ref for pair sock
       - netdevsim: don't accept device bound programs

   - selftests: fix a char signedness issue

   - dsa: mv88e6xxx: fix marvell 6350 probe crash

   - octeontx2-pf: restore TC ingress police rules when interface is up

   - wangxun: fix memory leak on msix entry

   - ravb: keep reverse order of operations in ravb_remove()"

* tag 'net-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (51 commits)
  net: ravb: Keep reverse order of operations in ravb_remove()
  net: ravb: Stop DMA in case of failures on ravb_open()
  net: ravb: Start TX queues after HW initialization succeeded
  net: ravb: Make write access to CXR35 first before accessing other EMAC registers
  net: ravb: Use pm_runtime_resume_and_get()
  net: ravb: Check return value of reset_control_deassert()
  net: libwx: fix memory leak on msix entry
  ice: Fix VF Reset paths when interface in a failed over aggregate
  bpf, sockmap: Add af_unix test with both sockets in map
  bpf, sockmap: af_unix stream sockets need to hold ref for pair sock
  tools: ynl-gen: always construct struct ynl_req_state
  ethtool: don't propagate EOPNOTSUPP from dumps
  ravb: Fix races between ravb_tx_timeout_work() and net related ops
  r8169: prevent potential deadlock in rtl8169_close
  r8169: fix deadlock on RTL8125 in jumbo mtu mode
  neighbour: Fix __randomize_layout crash in struct neighbour
  octeontx2-pf: Restore TC ingress police rules when interface is up
  octeontx2-pf: Fix adding mbox work queue entry when num_vfs > 64
  net: stmmac: xgmac: Disable FPE MMC interrupts
  octeontx2-af: Fix possible buffer overflow
  ...
2023-12-01 08:24:46 +09:00
Dave Airlie
a74229bcaf Merge tag 'drm-misc-fixes-2023-11-29' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes
Fixes for v6.7-rc4:
- Revert panel fixes as they require exporting device_is_dependent.
- Do not double add fences in dma_resv_add_fence.
- Fix GPUVM license identifier.
- Assorted nouveau fixes.
- Fix error check for nt36523.

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/561f807e-f9d3-43c1-80d3-8b41ba83c9ec@linux.intel.com
2023-12-01 08:05:31 +10:00
Felix Kuehling
0514f63cff Revert "drm/prime: Unexport helpers for fd/handle conversion"
This reverts commit 71a7974ac7.

These helper functions are needed for KFD to export and import DMABufs
the right way without duplicating the tracking of DMABufs associated with
GEM objects while ensuring that move notifier callbacks are working as
intended.

CC: Christian König <christian.koenig@amd.com>
CC: Thomas Zimmermann <tzimmermann@suse.de>
Acked-by: Thomas Zimmermann <tzimmermann@suse.de>
Acked-by: Daniel Vetter <daniel@ffwll.ch>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-30 14:46:04 -05:00
Sean Christopherson
4ea95c04fa vfio: Drop vfio_file_iommu_group() stub to fudge around a KVM wart
Drop the vfio_file_iommu_group() stub and instead unconditionally declare
the function to fudge around a KVM wart where KVM tries to do symbol_get()
on vfio_file_iommu_group() (and other VFIO symbols) even if CONFIG_VFIO=n.

Ensuring the symbol is always declared fixes a PPC build error when
modules are also disabled, in which case symbol_get() simply points at the
address of the symbol (with some attributes shenanigans).  Because KVM
does symbol_get() instead of directly depending on VFIO, the lack of a
fully defined symbol is not problematic (ugly, but "fine").

   arch/powerpc/kvm/../../../virt/kvm/vfio.c:89:7:
   error: attribute declaration must precede definition [-Werror,-Wignored-attributes]
           fn = symbol_get(vfio_file_iommu_group);
                ^
   include/linux/module.h:805:60: note: expanded from macro 'symbol_get'
   #define symbol_get(x) ({ extern typeof(x) x __attribute__((weak,visibility("hidden"))); &(x); })
                                                              ^
   include/linux/vfio.h:294:35: note: previous definition is here
   static inline struct iommu_group *vfio_file_iommu_group(struct file *file)
                                     ^
   arch/powerpc/kvm/../../../virt/kvm/vfio.c:89:7:
   error: attribute declaration must precede definition [-Werror,-Wignored-attributes]
           fn = symbol_get(vfio_file_iommu_group);
                ^
   include/linux/module.h:805:65: note: expanded from macro 'symbol_get'
   #define symbol_get(x) ({ extern typeof(x) x __attribute__((weak,visibility("hidden"))); &(x); })
                                                                   ^
   include/linux/vfio.h:294:35: note: previous definition is here
   static inline struct iommu_group *vfio_file_iommu_group(struct file *file)
                                     ^
   2 errors generated.

Although KVM is firmly in the wrong (there is zero reason for KVM to build
virt/kvm/vfio.c when VFIO is disabled), fudge around the error in VFIO as
the stub is unnecessary and doesn't serve its intended purpose (KVM is the
only external user of vfio_file_iommu_group()), and there is an in-flight
series to clean up the entire KVM<->VFIO interaction, i.e. fixing this in
KVM would result in more churn in the long run, and the stub needs to go
away regardless.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202308251949.5IiaV0sz-lkp@intel.com
Closes: https://lore.kernel.org/oe-kbuild-all/202309030741.82aLACDG-lkp@intel.com
Closes: https://lore.kernel.org/oe-kbuild-all/202309110914.QLH0LU6L-lkp@intel.com
Link: https://lore.kernel.org/all/0-v1-08396538817d+13c5-vfio_kvm_kconfig_jgg@nvidia.com
Link: https://lore.kernel.org/all/20230916003118.2540661-1-seanjc@google.com
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Fixes: c1cce6d079 ("vfio: Compile vfio_group infrastructure optionally")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20231130001000.543240-1-seanjc@google.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-11-30 11:27:17 -07:00
Jakub Kicinski
300fbb247e Merge tag 'wireless-2023-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:

====================
wireless fixes:
 - debugfs had a deadlock (removal vs. use of files),
   fixes going through wireless ACKed by Greg
 - support for HT STAs on 320 MHz channels, even if it's
   not clear that should ever happen (that's 6 GHz), best
   not to WARN()
 - fix for the previous CQM fix that broke most cases
 - various wiphy locking fixes
 - various small driver fixes

* tag 'wireless-2023-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
  wifi: mac80211: use wiphy locked debugfs for sdata/link
  wifi: mac80211: use wiphy locked debugfs helpers for agg_status
  wifi: cfg80211: add locked debugfs wrappers
  debugfs: add API to allow debugfs operations cancellation
  debugfs: annotate debugfs handlers vs. removal with lockdep
  debugfs: fix automount d_fsdata usage
  wifi: mac80211: handle 320 MHz in ieee80211_ht_cap_ie_to_sta_ht_cap
  wifi: avoid offset calculation on NULL pointer
  wifi: cfg80211: hold wiphy mutex for send_interface
  wifi: cfg80211: lock wiphy mutex for rfkill poll
  wifi: cfg80211: fix CQM for non-range use
  wifi: mac80211: do not pass AP_VLAN vif pointer to drivers during flush
  wifi: iwlwifi: mvm: fix an error code in iwl_mvm_mld_add_sta()
  wifi: mt76: mt7925: fix typo in mt7925_init_he_caps
  wifi: mt76: mt7921: fix 6GHz disabled by the missing default CLC config
====================

Link: https://lore.kernel.org/r/20231129150809.31083-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-29 19:43:34 -08:00
Jakub Kicinski
0d47fa5cc9 Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2023-11-30

We've added 5 non-merge commits during the last 7 day(s) which contain
a total of 10 files changed, 66 insertions(+), 15 deletions(-).

The main changes are:

1) Fix AF_UNIX splat from use after free in BPF sockmap,
   from John Fastabend.

2) Fix a syzkaller splat in netdevsim by properly handling offloaded
   programs (and not device-bound ones), from Stanislav Fomichev.

3) Fix bpf_mem_cache_alloc_flags() to initialize the allocation hint,
   from Hou Tao.

4) Fix netkit by rejecting IFLA_NETKIT_PEER_INFO in changelink,
   from Daniel Borkmann.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf, sockmap: Add af_unix test with both sockets in map
  bpf, sockmap: af_unix stream sockets need to hold ref for pair sock
  netkit: Reject IFLA_NETKIT_PEER_INFO in netkit_change_link
  bpf: Add missed allocation hint for bpf_mem_cache_alloc_flags()
  netdevsim: Don't accept device bound programs
====================

Link: https://lore.kernel.org/r/20231129234916.16128-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-29 19:40:04 -08:00
John Fastabend
8866730aed bpf, sockmap: af_unix stream sockets need to hold ref for pair sock
AF_UNIX stream sockets are a paired socket. So sending on one of the pairs
will lookup the paired socket as part of the send operation. It is possible
however to put just one of the pairs in a BPF map. This currently increments
the refcnt on the sock in the sockmap to ensure it is not free'd by the
stack before sockmap cleans up its state and stops any skbs being sent/recv'd
to that socket.

But we missed a case. If the peer socket is closed it will be free'd by the
stack. However, the paired socket can still be referenced from BPF sockmap
side because we hold a reference there. Then if we are sending traffic through
BPF sockmap to that socket it will try to dereference the free'd pair in its
send logic creating a use after free. And following splat:

   [59.900375] BUG: KASAN: slab-use-after-free in sk_wake_async+0x31/0x1b0
   [59.901211] Read of size 8 at addr ffff88811acbf060 by task kworker/1:2/954
   [...]
   [59.905468] Call Trace:
   [59.905787]  <TASK>
   [59.906066]  dump_stack_lvl+0x130/0x1d0
   [59.908877]  print_report+0x16f/0x740
   [59.910629]  kasan_report+0x118/0x160
   [59.912576]  sk_wake_async+0x31/0x1b0
   [59.913554]  sock_def_readable+0x156/0x2a0
   [59.914060]  unix_stream_sendmsg+0x3f9/0x12a0
   [59.916398]  sock_sendmsg+0x20e/0x250
   [59.916854]  skb_send_sock+0x236/0xac0
   [59.920527]  sk_psock_backlog+0x287/0xaa0

To fix let BPF sockmap hold a refcnt on both the socket in the sockmap and its
paired socket. It wasn't obvious how to contain the fix to bpf_unix logic. The
primarily problem with keeping this logic in bpf_unix was: In the sock close()
we could handle the deref by having a close handler. But, when we are destroying
the psock through a map delete operation we wouldn't have gotten any signal
thorugh the proto struct other than it being replaced. If we do the deref from
the proto replace its too early because we need to deref the sk_pair after the
backlog worker has been stopped.

Given all this it seems best to just cache it at the end of the psock and eat 8B
for the af_unix and vsock users. Notice dgram sockets are OK because they handle
locking already.

Fixes: 94531cfcbe ("af_unix: Add unix_stream_proto for sockmap")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20231129012557.95371-2-john.fastabend@gmail.com
2023-11-30 00:25:16 +01:00
Wyes Karny
febab20cae cpufreq/amd-pstate: Fix scaling_min_freq and scaling_max_freq update
When amd_pstate is running, writing to scaling_min_freq and
scaling_max_freq has no effect. These values are only passed to the
policy level, but not to the platform level. This means that the
platform does not know about the frequency limits set by the user.

To fix this, update the min_perf and max_perf values at the platform
level whenever the user changes the scaling_min_freq and scaling_max_freq
values.

Fixes: ffa5096a7c ("cpufreq: amd-pstate: implement Pstate EPP support for the AMD processors")
Acked-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Wyes Karny <wyes.karny@amd.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-11-29 17:40:16 +01:00
Yu Kuai
67d995e069 block: warn once for each partition in bio_check_ro()
Commit 1b0a151c10 ("blk-core: use pr_warn_ratelimited() in
bio_check_ro()") fix message storm by limit the rate, however, there
will still be lots of message in the long term. Fix it better by warn
once for each partition.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231128123027.971610-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-28 12:11:08 -07:00
Ming Lei
fad907cffd block: move .bd_inode into 1st cacheline of block_device
The .bd_inode field of block_device is used in IO fast path of
blkdev_write_iter() and blkdev_llseek(), so it is more efficient to keep
it into the 1st cacheline.

.bd_openers is only touched in open()/close(), and .bd_size_lock is only
for updating bdev capacity, which is in slow path too.

So swap .bd_inode layout with .bd_openers & .bd_size_lock to move
.bd_inode into the 1st cache line.

Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231128123027.971610-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-28 12:11:08 -07:00
Jens Axboe
c392cbecd8 io_uring/kbuf: defer release of mapped buffer rings
If a provided buffer ring is setup with IOU_PBUF_RING_MMAP, then the
kernel allocates the memory for it and the application is expected to
mmap(2) this memory. However, io_uring uses remap_pfn_range() for this
operation, so we cannot rely on normal munmap/release on freeing them
for us.

Stash an io_buf_free entry away for each of these, if any, and provide
a helper to free them post ->release().

Cc: stable@vger.kernel.org
Fixes: c56e022c0a ("io_uring: add support for user mapped provided buffer ring")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-28 07:56:16 -07:00
Gustavo A. R. Silva
45b3fae467 neighbour: Fix __randomize_layout crash in struct neighbour
Previously, one-element and zero-length arrays were treated as true
flexible arrays, even though they are actually "fake" flex arrays.
The __randomize_layout would leave them untouched at the end of the
struct, similarly to proper C99 flex-array members.

However, this approach changed with commit 1ee60356c2 ("gcc-plugins:
randstruct: Only warn about true flexible arrays"). Now, only C99
flexible-array members will remain untouched at the end of the struct,
while one-element and zero-length arrays will be subject to randomization.

Fix a `__randomize_layout` crash in `struct neighbour` by transforming
zero-length array `primary_key` into a proper C99 flexible-array member.

Fixes: 1ee60356c2 ("gcc-plugins: randstruct: Only warn about true flexible arrays")
Closes: https://lore.kernel.org/linux-hardening/20231124102458.GB1503258@e124191.cambridge.arm.com/
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/ZWJoRsJGnCPdJ3+2@work
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-11-28 12:18:29 +01:00
Thomas Hellström
b9c02e1052 drm/gpuvm: Fix deprecated license identifier
"GPL-2.0-only" in the license header was incorrectly changed to the
now deprecated "GPL-2.0". Fix.

Cc: Maxime Ripard <mripard@kernel.org>
Cc: Danilo Krummrich <dakr@redhat.com>
Reported-by: David Edelsohn <dje.gcc@gmail.com>
Closes: https://lore.kernel.org/dri-devel/5lfrhdpkwhpgzipgngojs3tyqfqbesifzu5nf4l5q3nhfdhcf2@25nmiq7tfrew/T/#m5c356d68815711eea30dd94cc6f7ea8cd4344fe3
Fixes: f7749a549b ("drm/gpuvm: Dual-licence the drm_gpuvm code GPL-2.0 OR MIT")
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Acked-by: Maxime Ripard <mripard@kernel.org>
Acked-by: Danilo Krummrich <dakr@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20231106114827.62492-1-thomas.hellstrom@linux.intel.com
2023-11-28 11:19:26 +01:00
Linus Torvalds
d095b18f3e Merge tag 'media/v6.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab.

* tag 'media/v6.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
  media: pci: mgb4: add COMMON_CLK dependency
  media: v4l2-subdev: Fix a 64bit bug
  media: mgb4: Added support for T200 card variant
  media: vsp1: Remove unbalanced .s_stream(0) calls
2023-11-27 16:26:10 -08:00
Dmitry Antipov
4e86f32a13 uapi: propagate __struct_group() attributes to the container union
Recently the kernel test robot has reported an ARM-specific BUILD_BUG_ON()
in an old and unmaintained wil6210 wireless driver. The problem comes from
the structure packing rules of old ARM ABI ('-mabi=apcs-gnu'). For example,
the following structure is packed to 18 bytes instead of 16:

struct poorly_packed {
        unsigned int a;
        unsigned int b;
        unsigned short c;
        union {
                struct {
                        unsigned short d;
                        unsigned int e;
                } __attribute__((packed));
                struct {
                        unsigned short d;
                        unsigned int e;
                } __attribute__((packed)) inner;
        };
} __attribute__((packed));

To fit it into 16 bytes, it's required to add packed attribute to the
container union as well:

struct poorly_packed {
        unsigned int a;
        unsigned int b;
        unsigned short c;
        union {
                struct {
                        unsigned short d;
                        unsigned int e;
                } __attribute__((packed));
                struct {
                        unsigned short d;
                        unsigned int e;
                } __attribute__((packed)) inner;
        } __attribute__((packed));
} __attribute__((packed));

Thanks to Andrew Pinski of GCC team for sorting the things out at
https://gcc.gnu.org/pipermail/gcc/2023-November/242888.html.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202311150821.cI4yciFE-lkp@intel.com
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Link: https://lore.kernel.org/r/20231120110607.98956-1-dmantipov@yandex.ru
Fixes: 50d7bd38c3 ("stddef: Introduce struct_group() helper macro")
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-11-27 16:24:56 -08:00