linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-23 02:05:46 -04:00

Author	SHA1	Message	Date
Christoph Hellwig	8d4bb46ba7	ntfs3: stop using write_cache_pages Patch series "remove write_cache_pages()". Kill off write_cache_pages() after converting the last two users to the iterator. This patch (of 3): Stop using the obsolete write_cache_pages and use writeback_iter directly. Link: https://lkml.kernel.org/r/20250818061017.1526853-1-hch@lst.de Link: https://lkml.kernel.org/r/20250818061017.1526853-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 16:55:13 -07:00
Matthew Wilcox (Oracle)	53fbef56e0	mm: introduce memdesc_flags_t Patch series "Add and use memdesc_flags_t". At some point struct page will be separated from struct slab and struct folio. This is a step towards that by introducing a type for the 'flags' word of all three structures. This gives us a certain amount of type safety by establishing that some of these unsigned longs are different from other unsigned longs in that they contain things like node ID, section number and zone number in the upper bits. That lets us have functions that can be easily called by anyone who has a slab, folio or page (but not easily by anyone else) to get the node or zone. There's going to be some unusual merge problems with this as some odd bits of the kernel decide they want to print out the flags value or something similar by writing page->flags and now they'll need to write page->flags.f instead. That's most of the churn here. Maybe we should be removing these things from the debug output? This patch (of 11): Wrap the unsigned long flags in a typedef. In upcoming patches, this will provide a strong hint that you can't just pass a random unsigned long to functions which take this as an argument. [willy@infradead.org: s/flags/flags.f/ in several architectures] Link: https://lkml.kernel.org/r/aKMgPRLD-WnkPxYm@casper.infradead.org [nicola.vetrini@gmail.com: mips: fix compilation error] Link: https://lore.kernel.org/lkml/CA+G9fYvkpmqGr6wjBNHY=dRp71PLCoi2341JxOudi60yqaeUdg@mail.gmail.com/ Link: https://lkml.kernel.org/r/20250825214245.1838158-1-nicola.vetrini@gmail.com Link: https://lkml.kernel.org/r/20250805172307.1302730-1-willy@infradead.org Link: https://lkml.kernel.org/r/20250805172307.1302730-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 16:55:07 -07:00
David Hildenbrand	1f1c061089	mm/huge_memory: convert "tva_flags" to "enum tva_type" When determining which THP orders are eligible for a VMA mapping, we have previously specified tva_flags, however it turns out it is really not necessary to treat these as flags. Rather, we distinguish between distinct modes. The only case where we previously combined flags was with TVA_ENFORCE_SYSFS, but we can avoid this by observing that this is the default, except for MADV_COLLAPSE or an edge cases in collapse_pte_mapped_thp() and hugepage_vma_revalidate(), and adding a mode specifically for this case - TVA_FORCED_COLLAPSE. We have: * smaps handling for showing "THPeligible" * Pagefault handling * khugepaged handling * Forced collapse handling: primarily MADV_COLLAPSE, but also for an edge case in collapse_pte_mapped_thp() Disregarding the edge cases, we only want to ignore sysfs settings only when we are forcing a collapse through MADV_COLLAPSE, otherwise we want to enforce it, hence this patch does the following flag to enum conversions: * TVA_SMAPS \| TVA_ENFORCE_SYSFS -> TVA_SMAPS * TVA_IN_PF \| TVA_ENFORCE_SYSFS -> TVA_PAGEFAULT * TVA_ENFORCE_SYSFS -> TVA_KHUGEPAGED * 0 -> TVA_FORCED_COLLAPSE With this change, we immediately know if we are in the forced collapse case, which will be valuable next. Link: https://lkml.kernel.org/r/20250815135549.130506-3-usamaarif642@gmail.com Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Usama Arif <usamaarif642@gmail.com> Acked-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yafang <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 16:55:05 -07:00
David Hildenbrand	9dc21bbd62	prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE Patch series "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised", v5. This will allow individual processes to opt-out of THP = "always" into THP = "madvise", without affecting other workloads on the system. This has been extensively discussed on the mailing list and has been summarized very well by David in the first patch which also includes the links to alternatives, please refer to the first patch commit message for the motivation for this series. Patch 1 adds the PR_THP_DISABLE_EXCEPT_ADVISED flag to implement this, along with the MMF changes. Patch 2 is a cleanup patch for tva_flags that will allow the forced collapse case to be transmitted to vma_thp_disabled (which is done in patch 3). Patch 4 adds documentation for PR_SET_THP_DISABLE/PR_GET_THP_DISABLE. Patches 6-7 implement the selftests for PR_SET_THP_DISABLE for completely disabling THPs (old behaviour) and only enabling it at advise (PR_THP_DISABLE_EXCEPT_ADVISED). This patch (of 7): People want to make use of more THPs, for example, moving from the "never" system policy to "madvise", or from "madvise" to "always". While this is great news for every THP desperately waiting to get allocated out there, apparently there are some workloads that require a bit of care during that transition: individual processes may need to opt-out from this behavior for various reasons, and this should be permitted without needing to make all other workloads on the system similarly opt-out. The following scenarios are imaginable: (1) Switch from "none" system policy to "madvise"/"always", but keep THPs disabled for selected workloads. (2) Stay at "none" system policy, but enable THPs for selected workloads, making only these workloads use the "madvise" or "always" policy. (3) Switch from "madvise" system policy to "always", but keep the "madvise" policy for selected workloads: allocate THPs only when advised. (4) Stay at "madvise" system policy, but enable THPs even when not advised for selected workloads -- "always" policy. Once can emulate (2) through (1), by setting the system policy to "madvise"/"always" while disabling THPs for all processes that don't want THPs. It requires configuring all workloads, but that is a user-space problem to sort out. (4) can be emulated through (3) in a similar way. Back when (1) was relevant in the past, as people started enabling THPs, we added PR_SET_THP_DISABLE, so relevant workloads that were not ready yet (i.e., used by Redis) were able to just disable THPs completely. Redis still implements the option to use this interface to disable THPs completely. With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a workload -- a process, including fork+exec'ed process hierarchy. That essentially made us support (1): simply disable THPs for all workloads that are not ready for THPs yet, while still enabling THPs system-wide. The quest for handling (3) and (4) started, but current approaches (completely new prctl, options to set other policies per process, alternatives to prctl -- mctrl, cgroup handling) don't look particularly promising. Likely, the future will use bpf or something similar to implement better policies, in particular to also make better decisions about THP sizes to use, but this will certainly take a while as that work just started. Long story short: a simple enable/disable is not really suitable for the future, so we're not willing to add completely new toggles. While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs completely for these processes, this is a step backwards, because these processes can no longer allocate THPs in regions where THPs were explicitly advised: regions flagged as VM_HUGEPAGE. Apparently, that imposes a problem for relevant workloads, because "not THPs" is certainly worse than "THPs only when advised". Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not explicitly advised by the app through MAD_HUGEPAGE"? maybe, but this would change the documented semantics quite a bit, and the versatility to use it for debugging purposes, so I am not 100% sure that is what we want -- although it would certainly be much easier. So instead, as an easy way forward for (3) and (4), add an option to make PR_SET_THP_DISABLE disable less THPs for a process. In essence, this patch: (A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3 of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 != 0). prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED). (B) Makes prctl(PR_GET_THP_DISABLE) return 3 if PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling. Previously, it would return 1 if THPs were disabled completely. Now it returns the set flags as well: 3 if PR_THP_DISABLE_EXCEPT_ADVISED was set. (C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express the semantics clearly. Fortunately, there are only two instances outside of prctl() code. (D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs with VM_HUGEPAGE" -- essentially "thp=madvise" behavior Fortunately, we only have to extend vma_thp_disabled(). (E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are disabled completely Only indicating that THPs are disabled when they are really disabled completely, not only partially. For now, we don't add another interface to obtained whether THPs are disabled partially (PR_THP_DISABLE_EXCEPT_ADVISED was set). If ever required, we could add a new entry. The documented semantics in the man page for PR_SET_THP_DISABLE "is inherited by a child created via fork(2) and is preserved across execve(2)" is maintained. This behavior, for example, allows for disabling THPs for a workload through the launching process (e.g., systemd where we fork() a helper process to then exec()). For now, MADV_COLLAPSE will fail in regions without VM_HUGEPAGE and VM_NOHUGEPAGE. As MADV_COLLAPSE is a clear advise that user space thinks a THP is a good idea, we'll enable that separately next (requiring a bit of cleanup first). There is currently not way to prevent that a process will not issue PR_SET_THP_DISABLE itself to re-enable THP. There are not really known users for re-enabling it, and it's against the purpose of the original interface. So if ever required, we could investigate just forbidding to re-enable them, or make this somehow configurable. Link: https://lkml.kernel.org/r/20250815135549.130506-1-usamaarif642@gmail.com Link: https://lkml.kernel.org/r/20250815135549.130506-2-usamaarif642@gmail.com Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: Usama Arif <usamaarif642@gmail.com> Tested-by: Usama Arif <usamaarif642@gmail.com> Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Usama Arif <usamaarif642@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yafang <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 16:55:05 -07:00
Lorenzo Stoakes	d14d3f535e	mm: convert remaining users to mm_flags_() accessors As part of the effort to move to mm->flags becoming a bitmap field, convert existing users to making use of the mm_flags_() accessors which will, when the conversion is complete, be the only means of accessing mm_struct flags. No functional change intended. Link: https://lkml.kernel.org/r/cc67a56f9a8746a8ec7d9791853dc892c1c33e0b.1755012943.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Ben Segall <bsegall@google.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dev Jain <dev.jain@arm.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Kees Cook <kees@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Mariano Pache <npache@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Namhyung kim <namhyung@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 16:54:58 -07:00
Lorenzo Stoakes	39f8049cd4	mm: update coredump logic to correctly use bitmap mm flags The coredump logic is slightly different from other users in that it both stores mm flags and additionally sets and gets using masks. Since the MMF_DUMPABLE_* flags must remain as they are for uABI reasons, and of course these are within the first 32-bits of the flags, it is reasonable to provide access to these in the same fashion so this logic can all still keep working as it has been. Therefore, introduce coredump-specific helpers __mm_flags_get_dumpable() and __mm_flags_set_mask_dumpable() for this purpose, and update all core dump users of mm flags to use these. [lorenzo.stoakes@oracle.com: abstract set_mask_bits() invocation to mm_types.h to satisfy ARC] Link: https://lkml.kernel.org/r/0e7ad263-1ff7-446d-81fe-97cff9c0e7ed@lucifer.local Link: https://lkml.kernel.org/r/2a5075f7e3c5b367d988178c79a3063d12ee53a9.1755012943.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Ben Segall <bsegall@google.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dev Jain <dev.jain@arm.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Kees Cook <kees@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Mariano Pache <npache@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Namhyung kim <namhyung@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 16:54:57 -07:00
David Hildenbrand	b0f86aaebe	fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio Let's convert to vmf_insert_folio_pmd(). There is a theoretical change in behavior: in the unlikely case there is already something mapped, we'll now still call trace_dax_pmd_load_hole() and return VM_FAULT_NOPAGE. Previously, we would have returned VM_FAULT_FALLBACK, and the caller would have zapped the PMD to try a PTE fault. However, that behavior was different to other PTE+PMD faults, when there would already be something mapped, and it's not even clear if it could be triggered. Assuming the huge zero folio is already mapped, all good, no need to fallback to PTEs. Assuming there is already a leaf page table ... the behavior would be just like when trying to insert a PMD mapping a folio through dax_fault_iter()->vmf_insert_folio_pmd(). Assuming there is already something else mapped as PMD? It sounds like a BUG, and the behavior would be just like when trying to insert a PMD mapping a folio through dax_fault_iter()->vmf_insert_folio_pmd(). So, it sounds reasonable to not handle huge zero folios differently to inserting PMDs mapping folios when there already is something mapped. Link: https://lkml.kernel.org/r/20250811112631.759341-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Juegren Gross <jgross@suse.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 16:54:51 -07:00
David Hildenbrand	fb49a4425c	treewide: remove MIGRATEPAGE_SUCCESS At this point MIGRATEPAGE_SUCCESS is misnamed for all folio users, and now that we remove MIGRATEPAGE_UNMAP, it's really the only "success" return value that the code uses and expects. Let's just get rid of MIGRATEPAGE_SUCCESS completely and just use "0" for success. Link: https://lkml.kernel.org/r/20250811143949.1117439-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> [mm] Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com> [jfs] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Byungchul Park <byungchul@sk.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Lance Yang <lance.yang@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 16:54:50 -07:00
Suren Baghdasaryan	d9d1c2d817	fs/proc/task_mmu: execute PROCMAP_QUERY ioctl under per-vma locks Utilize per-vma locks to stabilize vma after lookup without taking mmap_lock during PROCMAP_QUERY ioctl execution. If vma lock is contended, we fall back to mmap_lock but take it only momentarily to lock the vma and release the mmap_lock. In a very unlikely case of vm_refcnt overflow, this fall back path will fail and ioctl is done under mmap_lock protection. This change is designed to reduce mmap_lock contention and prevent PROCMAP_QUERY ioctl calls from blocking address space updates. Link: https://lkml.kernel.org/r/20250808152850.2580887-4-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: T.J. Mercier <tjmercier@google.com> Cc: Ye Bin <yebin10@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 16:54:48 -07:00
Suren Baghdasaryan	ee737a5a10	fs/proc/task_mmu: factor out proc_maps_private fields used by PROCMAP_QUERY Refactor struct proc_maps_private so that the fields used by PROCMAP_QUERY ioctl are moved into a separate structure. In the next patch this allows ioctl to reuse some of the functions used for reading /proc/pid/maps without using file->private_data. This prevents concurrent modification of file->private_data members by ioctl and /proc/pid/maps readers. The change is pure code refactoring and has no functional changes. Link: https://lkml.kernel.org/r/20250808152850.2580887-3-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: T.J. Mercier <tjmercier@google.com> Cc: Ye Bin <yebin10@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 16:54:48 -07:00
Vitaly Wool	2cd8231796	mm/slub: allow to set node and align in k[v]realloc Reimplement k[v]realloc_node() to be able to set node and alignment should a user need to do so. In order to do that while retaining the maximal backward compatibility, add k[v]realloc_node_align() functions and redefine the rest of API using these new ones. While doing that, we also keep the number of _noprof variants to a minimum, which implies some changes to the existing users of older _noprof functions, that basically being bcachefs. With that change we also provide the ability for the Rust part of the kernel to set node and alignment in its K[v]xxx [re]allocations. Link: https://lkml.kernel.org/r/20250806124147.1724658-1-vitaly.wool@konsulko.se Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.se> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Jann Horn <jannh@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 16:54:45 -07:00
Linus Torvalds	8026aed072	Merge tag 'mm-hotfixes-stable-2025-09-01-17-20' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "17 hotfixes. 13 are cc:stable and the remainder address post-6.16 issues or aren't considered necessary for -stable kernels. 11 of these fixes are for MM. This includes a three-patch series from Harry Yoo which fixes an intermittent boot failure which can occur on x86 systems. And a two-patch series from Alexander Gordeev which fixes a KASAN crash on S390 systems" * tag 'mm-hotfixes-stable-2025-09-01-17-20' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: fix possible deadlock in kmemleak x86/mm/64: define ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings() mm: introduce and use {pgd,p4d}_populate_kernel() mm: move page table sync declarations to linux/pgtable.h proc: fix missing pde_set_flags() for net proc files mm: fix accounting of memmap pages mm/damon/core: prevent unnecessary overflow in damos_set_effective_quota() kexec: add KEXEC_FILE_NO_CMA as a legal flag kasan: fix GCC mem-intrinsic prefix with sw tags mm/kasan: avoid lazy MMU mode hazards mm/kasan: fix vmalloc shadow memory (de-)population races kunit: kasan_test: disable fortify string checker on kasan_strings() test selftests/mm: fix FORCE_READ to read input value correctly mm/userfaultfd: fix kmap_local LIFO ordering for CONFIG_HIGHPTE ocfs2: prevent release journal inode after journal shutdown rust: mm: mark VmaNew as transparent of_numa: fix uninitialized memory nodes causing kernel panic	2025-09-02 13:18:00 -07:00
Linus Torvalds	e3c94a539e	Merge tag 'for-6.17-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix a few races related to inode link count - fix inode leak on failure to add link to inode - move transaction aborts closer to where they happen * tag 'for-6.17-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: avoid load/store tearing races when checking if an inode was logged btrfs: fix race between setting last_dir_index_offset and inode logging btrfs: fix race between logging inode and checking if it was logged before btrfs: simplify error handling logic for btrfs_link() btrfs: fix inode leak on failure to add link to inode btrfs: abort transaction on failure to add link to inode	2025-09-02 13:13:22 -07:00
Linus Torvalds	fb679c832b	Merge tag 'efi-fixes-for-v6.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI fixes from Ard Biesheuvel: - Assorted fixes for the OP-TEE based pseudo-EFI variable store - Fix for an OOB access when looking up the same non-existing efivarfs entry multiple times in parallel * tag 'efi-fixes-for-v6.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: efivarfs: Fix slab-out-of-bounds in efivarfs_d_compare efi: stmm: Drop unneeded null pointer check efi: stmm: Drop unused EFI error from setup_mm_hdr arguments efi: stmm: Do not return EFI_OUT_OF_RESOURCES on internal errors efi: stmm: Fix incorrect buffer allocation method	2025-08-29 09:15:46 -07:00
Linus Torvalds	2575e638e2	Merge tag 'v6.17-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6 Pull smb client fixes from Steve French: - Fix possible refcount leak in compound operations - Fix remap_file_range() return code mapping, found by generic/157 * tag 'v6.17-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: fs/smb: Fix inconsistent refcnt update smb3 client: fix return code mapping of remap_file_range	2025-08-29 08:51:34 -07:00
Linus Torvalds	469447200a	Merge tag 'xfs-fixes-6.17-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs fixes from Carlos Maiolino: "The highlight I'd like to point here is related to the XFS_RT Kconfig, which has been updated to be enabled by default now if CONFIG_BLK_DEV_ZONED is enabled. This also contains a few fixes for zoned devices support in XFS, specially related to swapon requests in inodes belonging to the zoned FS. A null-ptr dereference fix in the xattr data, due to a mishandling of medium errors generated by block devices is also included" * tag 'xfs-fixes-6.17-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: do not propagate ENODATA disk errors into xattr code xfs: reject swapon for inodes on a zoned file system earlier xfs: kick off inodegc when failing to reserve zoned blocks xfs: remove xfs_last_used_zone xfs: Default XFS_RT to Y if CONFIG_BLK_DEV_ZONED is enabled	2025-08-29 08:09:34 -07:00
Li Nan	a6358f8cf6	efivarfs: Fix slab-out-of-bounds in efivarfs_d_compare Observed on kernel 6.6 (present on master as well): BUG: KASAN: slab-out-of-bounds in memcmp+0x98/0xd0 Call trace: kasan_check_range+0xe8/0x190 __asan_loadN+0x1c/0x28 memcmp+0x98/0xd0 efivarfs_d_compare+0x68/0xd8 __d_lookup_rcu_op_compare+0x178/0x218 __d_lookup_rcu+0x1f8/0x228 d_alloc_parallel+0x150/0x648 lookup_open.isra.0+0x5f0/0x8d0 open_last_lookups+0x264/0x828 path_openat+0x130/0x3f8 do_filp_open+0x114/0x248 do_sys_openat2+0x340/0x3c0 __arm64_sys_openat+0x120/0x1a0 If dentry->d_name.len < EFI_VARIABLE_GUID_LEN , 'guid' can become negative, leadings to oob. The issue can be triggered by parallel lookups using invalid filename: T1 T2 lookup_open ->lookup simple_lookup d_add // invalid dentry is added to hash list lookup_open d_alloc_parallel __d_lookup_rcu __d_lookup_rcu_op_compare hlist_bl_for_each_entry_rcu // invalid dentry can be retrieved ->d_compare efivarfs_d_compare // oob Fix it by checking 'guid' before cmp. Fixes: `da27a24383` ("efivarfs: guid part of filenames are case-insensitive") Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Wu Guanghao <wuguanghao3@huawei.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>	2025-08-28 08:39:49 +02:00
wangzijie	2ce3d282bd	proc: fix missing pde_set_flags() for net proc files To avoid potential UAF issues during module removal races, we use pde_set_flags() to save proc_ops flags in PDE itself before proc_register(), and then use pde_has_proc_() helpers instead of directly dereferencing pde->proc_ops->. However, the pde_set_flags() call was missing when creating net related proc files. This omission caused incorrect behavior which FMODE_LSEEK was being cleared inappropriately in proc_reg_open() for net proc files. Lars reported it in this link[1]. Fix this by ensuring pde_set_flags() is called when register proc entry, and add NULL check for proc_ops in pde_set_flags(). [wangzijie1@honor.com: stash pde->proc_ops in a local const variable, per Christian] Link: https://lkml.kernel.org/r/20250821105806.1453833-1-wangzijie1@honor.com Link: https://lkml.kernel.org/r/20250818123102.959595-1-wangzijie1@honor.com Link: https://lore.kernel.org/all/20250815195616.64497967@chagall.paradoxon.rec/ [1] Fixes: `ff7ec8dc1b` ("proc: use the same treatment to check proc_lseek as ones for proc_read_iter et.al") Signed-off-by: wangzijie <wangzijie1@honor.com> Reported-by: Lars Wendler <polynomial-c@gmx.de> Tested-by: Stefano Brivio <sbrivio@redhat.com> Tested-by: Petr Vaněk <pv@excello.cz> Tested by: Lars Wendler <polynomial-c@gmx.de> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Kirill A. Shutemov <k.shutemov@gmail.com> Cc: wangzijie <wangzijie1@honor.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-27 22:45:44 -07:00
Edward Adam Davis	f46e8ef8bb	ocfs2: prevent release journal inode after journal shutdown Before calling ocfs2_delete_osb(), ocfs2_journal_shutdown() has already been executed in ocfs2_dismount_volume(), so osb->journal must be NULL. Therefore, the following calltrace will inevitably fail when it reaches jbd2_journal_release_jbd_inode(). ocfs2_dismount_volume()-> ocfs2_delete_osb()-> ocfs2_free_slot_info()-> __ocfs2_free_slot_info()-> evict()-> ocfs2_evict_inode()-> ocfs2_clear_inode()-> jbd2_journal_release_jbd_inode(osb->journal->j_journal, Adding osb->journal checks will prevent null-ptr-deref during the above execution path. Link: https://lkml.kernel.org/r/tencent_357489BEAEE4AED74CBD67D246DBD2C4C606@qq.com Fixes: `da5e7c8782` ("ocfs2: cleanup journal init and shutdown") Signed-off-by: Edward Adam Davis <eadavis@qq.com> Reported-by: syzbot+47d8cb2f2cc1517e515a@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=47d8cb2f2cc1517e515a Tested-by: syzbot+47d8cb2f2cc1517e515a@syzkaller.appspotmail.com Reviewed-by: Mark Tinguely <mark.tinguely@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-27 22:45:41 -07:00
Shuhao Fu	ab529e6ca1	fs/smb: Fix inconsistent refcnt update A possible inconsistent update of refcount was identified in `smb2_compound_op`. Such inconsistent update could lead to possible resource leaks. Why it is a possible bug: 1. In the comment section of the function, it clearly states that the reference to `cfile` should be dropped after calling this function. 2. Every control flow path would check and drop the reference to `cfile`, except the patched one. 3. Existing callers would not handle refcount update of `cfile` if -ENOMEM is returned. To fix the bug, an extra goto label "out" is added, to make sure that the cleanup logic would always be respected. As the problem is caused by the allocation failure of `vars`, the cleanup logic between label "finished" and "out" can be safely ignored. According to the definition of function `is_replayable_error`, the error code of "-ENOMEM" is not recoverable. Therefore, the replay logic also gets ignored. Signed-off-by: Shuhao Fu <sfual@cse.ust.hk> Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>	2025-08-27 14:59:06 -05:00
Eric Sandeen	ae668cd567	xfs: do not propagate ENODATA disk errors into xattr code ENODATA (aka ENOATTR) has a very specific meaning in the xfs xattr code; namely, that the requested attribute name could not be found. However, a medium error from disk may also return ENODATA. At best, this medium error may escape to userspace as "attribute not found" when in fact it's an IO (disk) error. At worst, we may oops in xfs_attr_leaf_get() when we do: error = xfs_attr_leaf_hasname(args, &bp); if (error == -ENOATTR) { xfs_trans_brelse(args->trans, bp); return error; } because an ENODATA/ENOATTR error from disk leaves us with a null bp, and the xfs_trans_brelse will then null-deref it. As discussed on the list, we really need to modify the lower level IO functions to trap all disk errors and ensure that we don't let unique errors like this leak up into higher xfs functions - many like this should be remapped to EIO. However, this patch directly addresses a reported bug in the xattr code, and should be safe to backport to stable kernels. A larger-scope patch to handle more unique errors at lower levels can follow later. (Note, prior to `07120f1abd` we did not oops, but we did return the wrong error code to userspace.) Signed-off-by: Eric Sandeen <sandeen@redhat.com> Fixes: `07120f1abd` ("xfs: Add xfs_has_attr and subroutines") Cc: stable@vger.kernel.org # v5.9+ Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-08-26 11:00:33 +02:00
Steve French	0e08fa789d	smb3 client: fix return code mapping of remap_file_range We were returning -EOPNOTSUPP for various remap_file_range cases but for some of these the copy_file_range_syscall() requires -EINVAL to be returned (e.g. where source and target file ranges overlap when source and target are the same file). This fixes xfstest generic/157 which was expecting EINVAL for that (and also e.g. for when the src offset is beyond end of file). Cc: stable@vger.kernel.org Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2025-08-25 09:40:38 -05:00
Linus Torvalds	52025b8fc9	Merge tag 'driver-core-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core Pull driver core fixes from Danilo Krummrich: - Fix swapped handling of lru_gen and lru_gen_full debugfs files in vmscan - Fix debugfs mount options (uid, gid, mode) being silently ignored - Fix leak of devres action in the unwind path of Devres::new() - Documentation: - Expand and fix documentation of (outdated) Device, DeviceContext and generic driver infrastructure - Fix C header link of faux device abstractions - Clarify expected interaction with the security team - Smooth text flow in the security bug reporting process documentation * tag 'driver-core-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: Documentation: smooth the text flow in the security bug reporting process Documentation: clarify the expected collaboration with security bugs reporters debugfs: fix mount options not being applied rust: devres: fix leaking call to devm_add_action() rust: faux: fix C header link driver: rust: expand documentation for driver infrastructure device: rust: expand documentation for Device device: rust: expand documentation for DeviceContext mm/vmscan: fix inverted polarity in lru_gen_seq_show()	2025-08-23 09:04:32 -04:00
Linus Torvalds	3cfcd57def	Merge tag '6.17-rc2-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6 Pull smb client fix from Steve French: "Fix for netfs smb3 oops" * tag '6.17-rc2-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6: cifs: Fix oops due to uninitialised variable	2025-08-22 09:02:32 -04:00
Linus Torvalds	e86ba12cf8	Merge tag 'nfs-for-6.17-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull NFS client fix from Trond Myklebust: - NFS: Fix a data corrupting race when updating an existing write * tag 'nfs-for-6.17-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS: Fix a race when updating an existing write	2025-08-22 08:58:58 -04:00
Linus Torvalds	6eba757ce9	Merge tag 'mm-hotfixes-stable-2025-08-21-18-17' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "20 hotfixes. 10 are cc:stable and the remainder address post-6.16 issues or aren't considered necessary for -stable kernels. 17 of these fixes are for MM. As usual, singletons all over the place, apart from a three-patch series of KHO followup work from Pasha which is actually also a bunch of singletons" * tag 'mm-hotfixes-stable-2025-08-21-18-17' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/mremap: fix WARN with uffd that has remap events disabled mm/damon/sysfs-schemes: put damos dests dir after removing its files mm/migrate: fix NULL movable_ops if CONFIG_ZSMALLOC=m mm/damon/core: fix damos_commit_filter not changing allow mm/memory-failure: fix infinite UCE for VM_PFNMAP pfn MAINTAINERS: mark MGLRU as maintained mm: rust: add page.rs to MEMORY MANAGEMENT - RUST iov_iter: iterate_folioq: fix handling of offset >= folio size selftests/damon: fix selftests by installing drgn related script .mailmap: add entry for Easwar Hariharan selftests/mm: add test for invalid multi VMA operations mm/mremap: catch invalid multi VMA moves earlier mm/mremap: allow multi-VMA move when filesystem uses thp_get_unmapped_area mm/damon/core: fix commit_ops_filters by using correct nth function tools/testing: add linux/args.h header and fix radix, VMA tests mm/debug_vm_pgtable: clear page table entries at destroy_args() squashfs: fix memory leak in squashfs_fill_super kho: warn if KHO is disabled due to an error kho: mm: don't allow deferred struct page with KHO kho: init new_physxa->phys_bits to fix lockdep	2025-08-22 08:54:34 -04:00
Filipe Manana	986bf6ed44	btrfs: avoid load/store tearing races when checking if an inode was logged At inode_logged() we do a couple lockless checks for ->logged_trans, and these are generally safe except the second one in case we get a load or store tearing due to a concurrent call updating ->logged_trans (either at btrfs_log_inode() or later at inode_logged()). In the first case it's safe to compare to the current transaction ID since once ->logged_trans is set the current transaction, we never set it to a lower value. In the second case, where we check if it's greater than zero, we are prone to load/store tearing races, since we can have a concurrent task updating to the current transaction ID with store tearing for example, instead of updating with a single 64 bits write, to update with two 32 bits writes or four 16 bits writes. In that case the reading side at inode_logged() could see a positive value that does not match the current transaction and then return a false negative. Fix this by doing the second check while holding the inode's spinlock, add some comments about it too. Also add the data_race() annotation to the first check to avoid any reports from KCSAN (or similar tools) and comment about it. Fixes: `0f8ce49821` ("btrfs: avoid inode logging during rename and link when possible") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-22 00:58:55 +02:00
Filipe Manana	59a0dd4ab9	btrfs: fix race between setting last_dir_index_offset and inode logging At inode_logged() if we find that the inode was not logged before we update its ->last_dir_index_offset to (u64)-1 with the goal that the next directory log operation will see the (u64)-1 and then figure out it must check what was the index of the last logged dir index key and update ->last_dir_index_offset to that key's offset (this is done in update_last_dir_index_offset()). This however has a possibility for a time window where a race can happen and lead to directory logging skipping dir index keys that should be logged. The race happens like this: 1) Task A calls inode_logged(), sees ->logged_trans as 0 and then checks that the inode item was logged before, but before it sets the inode's ->last_dir_index_offset to (u64)-1... 2) Task B is at btrfs_log_inode() which calls inode_logged() early, and that has set ->last_dir_index_offset to (u64)-1; 3) Task B then enters log_directory_changes() which calls update_last_dir_index_offset(). There it sees ->last_dir_index_offset is (u64)-1 and that the inode was logged before (ctx->logged_before is true), and so it searches for the last logged dir index key in the log tree and it finds that it has an offset (index) value of N, so it sets ->last_dir_index_offset to N, so that we can skip index keys that are less than or equal to N (later at process_dir_items_leaf()); 4) Task A now sets ->last_dir_index_offset to (u64)-1, undoing the update that task B just did; 5) Task B will now skip every index key when it enters process_dir_items_leaf(), since ->last_dir_index_offset is (u64)-1. Fix this by making inode_logged() not touch ->last_dir_index_offset and initializing it to 0 when an inode is loaded (at btrfs_alloc_inode()) and then having update_last_dir_index_offset() treat a value of 0 as meaning we must check the log tree and update with the index of the last logged index key. This is fine since the minimum possible value for ->last_dir_index_offset is 1 (BTRFS_DIR_START_INDEX - 1 = 2 - 1 = 1). This also simplifies the management of ->last_dir_index_offset and now all accesses to it are done under the inode's log_mutex. Fixes: `0f8ce49821` ("btrfs: avoid inode logging during rename and link when possible") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-22 00:58:47 +02:00
Filipe Manana	ef07b74e1b	btrfs: fix race between logging inode and checking if it was logged before There's a race between checking if an inode was logged before and logging an inode that can cause us to mark an inode as not logged just after it was logged by a concurrent task: 1) We have inode X which was not logged before neither in the current transaction not in past transaction since the inode was loaded into memory, so it's ->logged_trans value is 0; 2) We are at transaction N; 3) Task A calls inode_logged() against inode X, sees that ->logged_trans is 0 and there is a log tree and so it proceeds to search in the log tree for an inode item for inode X. It doesn't see any, but before it sets ->logged_trans to N - 1... 3) Task B calls btrfs_log_inode() against inode X, logs the inode and sets ->logged_trans to N; 4) Task A now sets ->logged_trans to N - 1; 5) At this point anyone calling inode_logged() gets 0 (inode not logged) since ->logged_trans is greater than 0 and less than N, but our inode was really logged. As a consequence operations like rename, unlink and link that happen afterwards in the current transaction end up not updating the log when they should. Fix this by ensuring inode_logged() only updates ->logged_trans in case the inode item is not found in the log tree if after tacking the inode's lock (spinlock struct btrfs_inode::lock) the ->logged_trans value is still zero, since the inode lock is what protects setting ->logged_trans at btrfs_log_inode(). Fixes: `0f8ce49821` ("btrfs: avoid inode logging during rename and link when possible") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-22 00:58:43 +02:00
Filipe Manana	5bb00879cb	btrfs: simplify error handling logic for btrfs_link() Instead of incrementing the inode's link count and refcount early before adding the link, updating the inode and deleting orphan item, do it after all those steps succeeded right before calling d_instantiate(). This makes the error handling logic simpler by avoiding the need for the 'drop_inode' variable to signal if we need to undo the link count increment and the inode refcount increase under the 'fail' label. This also reduces the level of indentation by one, making the code easier to read. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-22 00:58:28 +02:00
Filipe Manana	e87e953bb2	btrfs: fix inode leak on failure to add link to inode If we fail to update the inode or delete the orphan item we leak the inode since we update its refcount with the ihold() call to account for the d_instantiate() call which never happens in case we fail those steps. Fix this by setting 'drop_inode' to true in case we fail those steps. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-22 00:58:27 +02:00
Filipe Manana	2b3979624c	btrfs: abort transaction on failure to add link to inode If we fail to update the inode or delete the orphan item, we must abort the transaction to prevent persisting an inconsistent state. For example if we fail to update the inode item, we have the inconsistency of having a persisted inode item with a link count of N but we have N + 1 inode ref items and N + 1 directory entries pointing to our inode in case the transaction gets committed. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-08-22 00:58:25 +02:00
Linus Torvalds	eb4a0992dd	Merge tag '6.17-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbd Pull smb server fixes from Steve French: - fix refcount issue that can cause memory leak - rate limit repeated connections from IPv6, not just IPv4 addresses - fix potential null pointer access of smb direct work queue * tag '6.17-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbd: ksmbd: fix refcount leak causing resource not released ksmbd: extend the connection limiting mechanism to support IPv6 smb: server: split ksmbd_rdma_stop_listening() out of ksmbd_rdma_destroy()	2025-08-21 04:48:41 -07:00
Phillip Lougher	b64700d41b	squashfs: fix memory leak in squashfs_fill_super If sb_min_blocksize returns 0, squashfs_fill_super exits without freeing allocated memory (sb->s_fs_info). Fix this by moving the call to sb_min_blocksize to before memory is allocated. Link: https://lkml.kernel.org/r/20250811223740.110392-1-phillip@squashfs.org.uk Fixes: `734aa85390` ("Squashfs: check return result of sb_min_blocksize") Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk> Reported-by: Scott GUO <scottzhguo@tencent.com> Closes: https://lore.kernel.org/all/20250811061921.3807353-1-scott_gzh@163.com Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-19 16:35:53 -07:00
Trond Myklebust	76d2e3890f	NFS: Fix a race when updating an existing write After nfs_lock_and_join_requests() tests for whether the request is still attached to the mapping, nothing prevents a call to nfs_inode_remove_request() from succeeding until we actually lock the page group. The reason is that whoever called nfs_inode_remove_request() doesn't necessarily have a lock on the page group head. So in order to avoid races, let's take the page group lock earlier in nfs_lock_and_join_requests(), and hold it across the removal of the request in nfs_inode_remove_request(). Reported-by: Jeff Layton <jlayton@kernel.org> Tested-by: Joe Quanaim <jdq@meta.com> Tested-by: Andrew Steffen <aksteffen@meta.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Fixes: `bd37d6fce1` ("NFSv4: Convert nfs_lock_and_join_requests() to use nfs_page_find_head_request()") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2025-08-19 11:16:02 -07:00
Linus Torvalds	b19a97d57c	Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull mount fixes from Al Viro: "Fixes for several recent mount-related regressions" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: change_mnt_propagation(): calculate propagation source only if we'll need it use uniform permission checks for all mount propagation changes propagate_umount(): only surviving overmounts should be reparented fix the softlockups in attach_recursive_mnt()	2025-08-19 10:12:10 -07:00
Linus Torvalds	7cca555b94	Merge tag 'ovl-fixes-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs Pull overlayfs fixes from Amir Goldstein: "Fixes for two fallouts from Neil's directory locking changes" * tag 'ovl-fixes-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs: ovl: fix possible double unlink ovl: use I_MUTEX_PARENT when locking parent in ovl_create_temp()	2025-08-19 10:08:05 -07:00
Linus Torvalds	055f213075	Merge tag 'vfs-6.17-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix two memory leaks in pidfs - Prevent changing the idmapping of an already idmapped mount without OPEN_TREE_CLONE through open_tree_attr() - Don't fail listing extended attributes in kernfs when no extended attributes are set - Fix the return value in coredump_parse() - Fix the error handling for unbuffered writes in netfs - Fix broken data integrity guarantees for O_SYNC writes via iomap - Fix UAF in __mark_inode_dirty() - Keep inode->i_blkbits constant in fuse - Fix coredump selftests - Fix get_unused_fd_flags() usage in do_handle_open() - Rename EXPORT_SYMBOL_GPL_FOR_MODULES to EXPORT_SYMBOL_FOR_MODULES - Fix use-after-free in bh_read() - Fix incorrect lflags value in the move_mount() syscall * tag 'vfs-6.17-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: signal: Fix memory leak for PIDFD_SELF* sentinels kernfs: don't fail listing extended attributes coredump: Fix return value in coredump_parse() fs/buffer: fix use-after-free when call bh_read() helper pidfs: Fix memory leak in pidfd_info() netfs: Fix unbuffered write error handling fhandle: do_handle_open() should get FD with user flags module: Rename EXPORT_SYMBOL_GPL_FOR_MODULES to EXPORT_SYMBOL_FOR_MODULES fs: fix incorrect lflags value in the move_mount syscall selftests/coredump: Remove the read() that fails the test fuse: keep inode->i_blkbits constant iomap: Fix broken data integrity guarantees for O_SYNC writes selftests/mount_setattr: add smoke tests for open_tree_attr(2) bug open_tree_attr: do not allow id-mapping changes without OPEN_TREE_CLONE fs: writeback: fix use-after-free in __mark_inode_dirty()	2025-08-19 09:54:47 -07:00
David Howells	453a6d2a68	cifs: Fix oops due to uninitialised variable Fix smb3_init_transform_rq() to initialise buffer to NULL before calling netfs_alloc_folioq_buffer() as netfs assumes it can append to the buffer it is given. Setting it to NULL means it should start a fresh buffer, but the value is currently undefined. Fixes: `a2906d3316` ("cifs: Switch crypto buffer to use a folio_queue rather than an xarray") Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.org> cc: linux-cifs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>	2025-08-19 11:16:14 -05:00
Al Viro	fb924b7b86	change_mnt_propagation(): calculate propagation source only if we'll need it We only need it when mount in question was sending events downstream (then recepients need to switch to new master) or the mount is being turned into slave (then we need a new master for it). That wouldn't be a big deal, except that it causes quite a bit of work when umount_tree() is taking a large peer group out. Adding a trivial "don't bother calling propagation_source() unless we are going to use its results" logics improves the things quite a bit. We are still doing unnecessary work on bulk removals from propagation graph, but the full solution for that will have to wait for the next merge window. Fixes: `955336e204` "do_make_slave(): choose new master sanely" Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-08-19 12:05:59 -04:00
Al Viro	cffd044187	use uniform permission checks for all mount propagation changes do_change_type() and do_set_group() are operating on different aspects of the same thing - propagation graph. The latter asks for mounts involved to be mounted in namespace(s) the caller has CAP_SYS_ADMIN for. The former is a mess - originally it didn't even check that mount is mounted. That got fixed, but the resulting check turns out to be too strict for userland - in effect, we check that mount is in our namespace, having already checked that we have CAP_SYS_ADMIN there. What we really need (in both cases) is * only touch mounts that are mounted. That's a must-have constraint - data corruption happens if it get violated. * don't allow to mess with a namespace unless you already have enough permissions to do so (i.e. CAP_SYS_ADMIN in its userns). That's an equivalent of what do_set_group() does; let's extract that into a helper (may_change_propagation()) and use it in both do_set_group() and do_change_type(). Fixes: `12f147ddd6` "do_change_type(): refuse to operate on unmounted/not ours mounts" Acked-by: Andrei Vagin <avagin@gmail.com> Reviewed-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Tested-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-08-19 12:03:23 -04:00
Al Viro	da025cdb97	propagate_umount(): only surviving overmounts should be reparented ... as the comments in reparent() clearly say. As it is, we reparent all overmounts of the mounts being taken out, including those that are taken out themselves. It's not only a potentially massive slowdown (on a pathological setup we might end up with O(N^2) time for N mounts being kicked out), it can end up with incorrect ->overmount in the surviving mounts. Fixes: `f0d0ba1998` "Rewrite of propagate_umount()" Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-08-19 12:00:07 -04:00
Al Viro	0ddfb62f5d	fix the softlockups in attach_recursive_mnt() In case when we mounting something on top of a large stack of overmounts, all of them being peers of each other, we get quadratic time by the depth of overmount stack. Easily fixed by doing commit_tree() before reparenting the overmount; simplifies commit_tree() as well - it doesn't need to skip the already mounted stuff that had been reparented on top of the new mounts. Since we are holding mount_lock through both reparenting and call of commit_tree(), the order does not matter from the mount hash point of view. Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com> Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Fixes: `663206854f` "copy_tree(): don't link the mounts via mnt_list" Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-08-19 11:58:18 -04:00
Christoph Hellwig	8e5a2441e1	xfs: reject swapon for inodes on a zoned file system earlier No point in going down into the iomap mapping loop when we know it will be rejected. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-08-19 14:38:39 +02:00
Christoph Hellwig	7d523255f5	xfs: kick off inodegc when failing to reserve zoned blocks XFS processes truncating unlinked inodes asynchronously and thus the free space pool only sees them with a delay. The non-zoned write path thus calls into inodegc to accelerate this processing before failing an allocation due the lack of free blocks. Do the same for the zoned space reservation. Fixes: `0bb2193056` ("xfs: add support for zoned space reservations") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-08-19 14:37:07 +02:00
Christoph Hellwig	d004d70d6c	xfs: remove xfs_last_used_zone This was my first attempt at caching the last used zone. But it turns out for O_DIRECT or RWF_DONTCACHE that operate concurrently or in very short sequence, the bmap btree does not record a written extent yet, so it fails. Because it then still finds the last written zone it can lead to a weird ping-pong around a few zones with writers seeing different values. Remove it entirely as the later added xfs_cached_zone actually does a much better job enforcing the locality as the zone is associated with the inode in the MRU cache as soon as the zone is selected. Fixes: `4e4d520755` ("xfs: add the zoned space allocator") Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-08-19 14:37:07 +02:00
Damien Le Moal	9ce43caa4b	xfs: Default XFS_RT to Y if CONFIG_BLK_DEV_ZONED is enabled XFS support for zoned block devices requires the realtime subvolume support (XFS_RT) to be enabled. Change the default configuration value of XFS_RT from N to CONFIG_BLK_DEV_ZONED to align with this requirement. This change still allows the user to disable XFS_RT if this feature is not desired for the user use case. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-08-19 14:23:37 +02:00
Christian Brauner	c237aa9884	kernfs: don't fail listing extended attributes Userspace doesn't expect a failure to list extended attributes: $ ls -lA /sys/ ls: /sys/: No data available ls: /sys/kernel: No data available ls: /sys/power: No data available ls: /sys/class: No data available ls: /sys/devices: No data available ls: /sys/dev: No data available ls: /sys/hypervisor: No data available ls: /sys/fs: No data available ls: /sys/bus: No data available ls: /sys/firmware: No data available ls: /sys/block: No data available ls: /sys/module: No data available total 0 drwxr-xr-x 2 root root 0 Jan 1 1970 block drwxr-xr-x 52 root root 0 Jan 1 1970 bus drwxr-xr-x 88 root root 0 Jan 1 1970 class drwxr-xr-x 4 root root 0 Jan 1 1970 dev drwxr-xr-x 11 root root 0 Jan 1 1970 devices drwxr-xr-x 3 root root 0 Jan 1 1970 firmware drwxr-xr-x 10 root root 0 Jan 1 1970 fs drwxr-xr-x 2 root root 0 Jul 2 09:43 hypervisor drwxr-xr-x 14 root root 0 Jan 1 1970 kernel drwxr-xr-x 251 root root 0 Jan 1 1970 module drwxr-xr-x 3 root root 0 Jul 2 09:43 power Fix it by simply reporting success when no extended attributes are available instead of reporting ENODATA. Link: https://lore.kernel.org/78b13bcdae82ade95e88f315682966051f461dde.camel@linaro.org Fixes: `d1f4e90260` ("kernfs: remove iattr_mutex") # mainline only Reported-by: André Draszik <andre.draszik@linaro.org> Link: https://lore.kernel.org/20250819-ahndung-abgaben-524a535f8101@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-19 13:51:28 +02:00
Dan Carpenter	589c12edcd	coredump: Fix return value in coredump_parse() The coredump_parse() function is bool type. It should return true on success and false on failure. The cn_printf() returns zero on success or negative error codes. This mismatch means that when "return err;" here, it is treated as success instead of failure. Change it to return false instead. Fixes: `a5715af549` ("coredump: make coredump_parse() return bool") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/aKRGu14w5vPSZLgv@stanley.mountain Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-19 13:51:28 +02:00
Ye Bin	7375f22495	fs/buffer: fix use-after-free when call bh_read() helper There's issue as follows: BUG: KASAN: stack-out-of-bounds in end_buffer_read_sync+0xe3/0x110 Read of size 8 at addr ffffc9000168f7f8 by task swapper/3/0 CPU: 3 UID: 0 PID: 0 Comm: swapper/3 Not tainted 6.16.0-862.14.0.6.x86_64 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) Call Trace: <IRQ> dump_stack_lvl+0x55/0x70 print_address_description.constprop.0+0x2c/0x390 print_report+0xb4/0x270 kasan_report+0xb8/0xf0 end_buffer_read_sync+0xe3/0x110 end_bio_bh_io_sync+0x56/0x80 blk_update_request+0x30a/0x720 scsi_end_request+0x51/0x2b0 scsi_io_completion+0xe3/0x480 ? scsi_device_unbusy+0x11e/0x160 blk_complete_reqs+0x7b/0x90 handle_softirqs+0xef/0x370 irq_exit_rcu+0xa5/0xd0 sysvec_apic_timer_interrupt+0x6e/0x90 </IRQ> Above issue happens when do ntfs3 filesystem mount, issue may happens as follows: mount IRQ ntfs_fill_super read_cache_page do_read_cache_folio filemap_read_folio mpage_read_folio do_mpage_readpage ntfs_get_block_vbo bh_read submit_bh wait_on_buffer(bh); blk_complete_reqs scsi_io_completion scsi_end_request blk_update_request end_bio_bh_io_sync end_buffer_read_sync __end_buffer_read_notouch unlock_buffer wait_on_buffer(bh);--> return will return to caller put_bh --> trigger stack-out-of-bounds In the mpage_read_folio() function, the stack variable 'map_bh' is passed to ntfs_get_block_vbo(). Once unlock_buffer() unlocks and wait_on_buffer() returns to continue processing, the stack variable is likely to be reclaimed. Consequently, during the end_buffer_read_sync() process, calling put_bh() may result in stack overrun. If the bh is not allocated on the stack, it belongs to a folio. Freeing a buffer head which belongs to a folio is done by drop_buffers() which will fail to free buffers which are still locked. So it is safe to call put_bh() before __end_buffer_read_notouch(). Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/20250811141830.343774-1-yebin@huaweicloud.com Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-19 13:51:28 +02:00

1 2 3 4 5 ...

100805 Commits