Commit Graph

101997 Commits

Author SHA1 Message Date
Lorenzo Stoakes
5dba5cc2e0 mm: introduce VM_MAYBE_GUARD and make visible in /proc/$pid/smaps
Patch series "introduce VM_MAYBE_GUARD and make it sticky", v4.

Currently, guard regions are not visible to users except through
/proc/$pid/pagemap, with no explicit visibility at the VMA level.

This makes the feature less useful, as it isn't entirely apparent which
VMAs may have these entries present, especially when performing actions
which walk through memory regions such as those performed by CRIU.

This series addresses this issue by introducing the VM_MAYBE_GUARD flag
which fulfils this role, updating the smaps logic to display an entry for
these.

The semantics of this flag are that a guard region MAY be present if set
(we cannot be sure, as we can't efficiently track whether an
MADV_GUARD_REMOVE finally removes all the guard regions in a VMA) - but if
not set the VMA definitely does NOT have any guard regions present.

It's problematic to establish this flag without further action, because
that means that VMAs with guard regions in them become non-mergeable with
adjacent VMAs for no especially good reason.

To work around this, this series also introduces the concept of 'sticky'
VMA flags - that is flags which:

a. if set in one VMA and not in another still permit those VMAs to be
   merged (if otherwise compatible).

b. When they are merged, the resultant VMA must have the flag set.

The VMA logic is updated to propagate these flags correctly.

Additionally, VM_MAYBE_GUARD being an explicit VMA flag allows us to solve
an issue with file-backed guard regions - previously these established an
anon_vma object for file-backed mappings solely to have vma_needs_copy()
correctly propagate guard region mappings to child processes.

We introduce a new flag alias VM_COPY_ON_FORK (which currently only
specifies VM_MAYBE_GUARD) and update vma_needs_copy() to check explicitly
for this flag and to copy page tables if it is present, which resolves
this issue.

Additionally, we add the ability for allow-listed VMA flags to be
atomically writable with only mmap/VMA read locks held.

The only flag we allow so far is VM_MAYBE_GUARD, which we carefully ensure
does not cause any races by being allowed to do so.

This allows us to maintain guard region installation as a read-locked
operation and not endure the overhead of obtaining a write lock here.

Finally we introduce extensive VMA userland tests to assert that the
sticky VMA logic behaves correctly as well as guard region self tests to
assert that smaps visibility is correctly implemented.


This patch (of 9):

Currently, if a user needs to determine if guard regions are present in a
range, they have to scan all VMAs (or have knowledge of which ones might
have guard regions).

Since commit 8e2f2aeb8b ("fs/proc/task_mmu: add guard region bit to
pagemap") and the related commit a516403787 ("fs/proc: extend the
PAGEMAP_SCAN ioctl to report guard regions"), users can use either
/proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
operation at a virtual address level.

This is not ideal, and it gives no visibility at a /proc/$pid/smaps level
that guard regions exist in ranges.

This patch remedies the situation by establishing a new VMA flag,
VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is
uncertain because we cannot reasonably determine whether a
MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and
additionally VMAs may change across merge/split).

We utilise 0x800 for this flag which makes it available to 32-bit
architectures also, a flag that was previously used by VM_DENYWRITE, which
was removed in commit 8d0920bde5 ("mm: remove VM_DENYWRITE") and hasn't
bee reused yet.

We also update the smaps logic and documentation to identify these VMAs.

Another major use of this functionality is that we can use it to identify
that we ought to copy page tables on fork.

We do not actually implement usage of this flag in mm/madvise.c yet as we
need to allow some VMA flags to be applied atomically under mmap/VMA read
lock in order to avoid the need to acquire a write lock for this purpose.

Link: https://lkml.kernel.org/r/cover.1763460113.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/cf8ef821eba29b6c5b5e138fffe95d6dcabdedb9.1763460113.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20 13:43:58 -08:00
Harry Yoo
ad8b2e0961 treewide: include linux/pgalloc.h instead of asm/pgalloc.h
For now, including <asm/pgalloc.h> instead of <linux/pgalloc.h> is
technically fine unless the .c file calls p*d_populate_kernel() helper
functions.

But it is a better practice to always include <linux/pgalloc.h>.  Include
<linux/pgalloc.h> instead of <asm/pgalloc.h> outside arch/.

Link: https://lkml.kernel.org/r/20251024113047.119058-3-harry.yoo@oracle.com
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-16 17:28:25 -08:00
Lorenzo Stoakes
8247e2600e mm: update resctl to use mmap_prepare
Make use of the ability to specify a remap action within mmap_prepare to
update the resctl pseudo-lock to use mmap_prepare in favour of the
deprecated mmap hook.

Link: https://lkml.kernel.org/r/95b28b066f37ca25f56fa9460a9367f1a866f88b.1760959442.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Dave Martin <dave.martin@arm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Morse <james.morse@arm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicolas Pitre <nico@fluxnic.net>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Robin Murohy <robin.murphy@arm.com>
Cc: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-16 17:28:14 -08:00
Lorenzo Stoakes
ea52cb24cd mm/hugetlbfs: update hugetlbfs to use mmap_prepare
Since we can now perform actions after the VMA is established via
mmap_prepare, use desc->action_success_hook to set up the hugetlb lock
once the VMA is setup.

We also make changes throughout hugetlbfs to make this possible.

Note that we must hide newly established hugetlb VMAs from the rmap until
the operation is entirely complete as we establish a hugetlb lock during
VMA setup that can be raced by rmap users.

Link: https://lkml.kernel.org/r/b1afa16d3cfa585a03df9ae215ae9f905b3f0ed7.1760959442.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Chatre, Reinette <reinette.chatre@intel.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Dave Martin <dave.martin@arm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Morse <james.morse@arm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicolas Pitre <nico@fluxnic.net>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Robin Murohy <robin.murphy@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-16 17:28:13 -08:00
Lorenzo Stoakes
54c58a2f5f mm: add vma_desc_size(), vma_desc_pages() helpers
It's useful to be able to determine the size of a VMA descriptor range
used on f_op->mmap_prepare, expressed both in bytes and pages, so add
helpers for both and update code that could make use of it to do so.

Link: https://lkml.kernel.org/r/74ef338203c9ff08a9ace73a8f1f6116a79112a0.1760959442.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Chatre, Reinette <reinette.chatre@intel.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Dave Martin <dave.martin@arm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Morse <james.morse@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicolas Pitre <nico@fluxnic.net>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Robin Murohy <robin.murphy@arm.com>
Cc: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-16 17:28:11 -08:00
Ryan Roberts
9ac09bb9fe mm: consistently use current->mm in mm_get_unmapped_area()
mm_get_unmapped_area() is a wrapper around arch_get_unmapped_area() /
arch_get_unmapped_area_topdown(), both of which search current->mm for
some free space.  Neither take an mm_struct - they implicitly operate on
current->mm.

But the wrapper takes an mm_struct and uses it to decide whether to search
bottom up or top down.  All callers pass in current->mm for this, so
everything is working consistently.  But it feels like an accident waiting
to happen; eventually someone will call that function with a different mm,
expecting to find free space in it, but what gets returned is free space
in the current mm.

So let's simplify by removing the parameter and have the wrapper use
current->mm to decide which end to start at.  Now everything is consistent
and self-documenting.

Link: https://lkml.kernel.org/r/20251003155306.2147572-1-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-16 17:27:57 -08:00
Edward Adam Davis
9a6b60cb14 nilfs2: avoid having an active sc_timer before freeing sci
Because kthread_stop did not stop sc_task properly and returned -EINTR,
the sc_timer was not properly closed, ultimately causing the problem [1]
reported by syzbot when freeing sci due to the sc_timer not being closed.

Because the thread sc_task main function nilfs_segctor_thread() returns 0
when it succeeds, when the return value of kthread_stop() is not 0 in
nilfs_segctor_destroy(), we believe that it has not properly closed
sc_timer.

We use timer_shutdown_sync() to sync wait for sc_timer to shutdown, and
set the value of sc_task to NULL under the protection of lock
sc_state_lock, so as to avoid the issue caused by sc_timer not being
properly shutdowned.

[1]
ODEBUG: free active (active state 0) object: 00000000dacb411a object type: timer_list hint: nilfs_construction_timeout
Call trace:
 nilfs_segctor_destroy fs/nilfs2/segment.c:2811 [inline]
 nilfs_detach_log_writer+0x668/0x8cc fs/nilfs2/segment.c:2877
 nilfs_put_super+0x4c/0x12c fs/nilfs2/super.c:509

Link: https://lkml.kernel.org/r/20251029225226.16044-1-konishi.ryusuke@gmail.com
Fixes: 3f66cc261c ("nilfs2: use kthread_create and kthread_stop for the log writer thread")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+24d8b70f039151f65590@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=24d8b70f039151f65590
Tested-by: syzbot+24d8b70f039151f65590@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Cc: <stable@vger.kernel.org>	[6.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09 21:19:46 -08:00
Wei Yang
895b4c0c79 fs/proc: fix uaf in proc_readdir_de()
Pde is erased from subdir rbtree through rb_erase(), but not set the node
to EMPTY, which may result in uaf access.  We should use RB_CLEAR_NODE()
set the erased node to EMPTY, then pde_subdir_next() will return NULL to
avoid uaf access.

We found an uaf issue while using stress-ng testing, need to run testcase
getdent and tun in the same time.  The steps of the issue is as follows:

1) use getdent to traverse dir /proc/pid/net/dev_snmp6/, and current
   pde is tun3;

2) in the [time windows] unregister netdevice tun3 and tun2, and erase
   them from rbtree.  erase tun3 first, and then erase tun2.  the
   pde(tun2) will be released to slab;

3) continue to getdent process, then pde_subdir_next() will return
   pde(tun2) which is released, it will case uaf access.

CPU 0                                      |    CPU 1
-------------------------------------------------------------------------
traverse dir /proc/pid/net/dev_snmp6/      |   unregister_netdevice(tun->dev)   //tun3 tun2
sys_getdents64()                           |
  iterate_dir()                            |
    proc_readdir()                         |
      proc_readdir_de()                    |     snmp6_unregister_dev()
        pde_get(de);                       |       proc_remove()
        read_unlock(&proc_subdir_lock);    |         remove_proc_subtree()
                                           |           write_lock(&proc_subdir_lock);
        [time window]                      |           rb_erase(&root->subdir_node, &parent->subdir);
                                           |           write_unlock(&proc_subdir_lock);
        read_lock(&proc_subdir_lock);      |
        next = pde_subdir_next(de);        |
        pde_put(de);                       |
        de = next;    //UAF                |

rbtree of dev_snmp6
                        |
                    pde(tun3)
                     /    \
                  NULL  pde(tun2)

Link: https://lkml.kernel.org/r/20251025024233.158363-1-albin_yang@163.com
Signed-off-by: Wei Yang <albinwyang@tencent.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: wangzijie <wangzijie1@honor.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09 21:19:43 -08:00
Linus Torvalds
7bb4d65125 Merge tag 'v6.18rc4-SMB-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:

 - Fix change notify packet validation check

 - Refcount fix (e.g. rename error paths)

 - Fix potential UAF due to missing locks on directory lease refcount

* tag 'v6.18rc4-SMB-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb: client: validate change notify buffer before copy
  smb: client: fix refcount leak in smb2_set_path_attr
  smb: client: fix potential UAF in smb2_close_cached_fid()
2025-11-08 10:17:30 -08:00
Linus Torvalds
e284d5118a Merge tag 'xfs-fixes-6.18-rc5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Carlos Maiolino:
 "This contain fixes for the RT and zoned allocator, and a few fixes for
  atomic writes"

* tag 'xfs-fixes-6.18-rc5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: free xfs_busy_extents structure when no RT extents are queued
  xfs: fix zone selection in xfs_select_open_zone_mru
  xfs: fix a rtgroup leak when xfs_init_zone fails
  xfs: fix various problems in xfs_atomic_write_cow_iomap_begin
  xfs: fix delalloc write failures in software-provided atomic writes
2025-11-08 08:43:01 -08:00
Joshua Rogers
4012abe8a7 smb: client: validate change notify buffer before copy
SMB2_change_notify called smb2_validate_iov() but ignored the return
code, then kmemdup()ed using server provided OutputBufferOffset/Length.

Check the return of smb2_validate_iov() and bail out on error.

Discovered with help from the ZeroPath security tooling.

Signed-off-by: Joshua Rogers <linux@joshua.hu>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Cc: stable@vger.kernel.org
Fixes: e3e9463414 ("smb3: improve SMB3 change notification support")
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-11-07 10:15:43 -06:00
Linus Torvalds
cff0a1be08 Merge tag 'v6.18-rc4-smb-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:

 - More safely detect RDMA capable devices correctly

* tag 'v6.18-rc4-smb-server-fixes' of git://git.samba.org/ksmbd:
  ksmbd: detect RDMA capable netdevs include IPoIB
  ksmbd: detect RDMA capable lower devices when bridge and vlan netdev is used
2025-11-07 07:39:57 -08:00
Linus Torvalds
c668da99b9 Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux
Pull fscrypt fix from Eric Biggers:
 "Fix an UBSAN warning that started occurring when the block layer
  started supporting logical_block_size > PAGE_SIZE"

* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux:
  fscrypt: fix left shift underflow when inode->i_blkbits > PAGE_SHIFT
2025-11-06 12:45:15 -08:00
Christoph Hellwig
d8a823c6f0 xfs: free xfs_busy_extents structure when no RT extents are queued
kmemleak occasionally reports leaking xfs_busy_extents structure
from xfs_scrub calls after running xfs/528 (but attributed to following
tests), which seems to be caused by not freeing the xfs_busy_extents
structure when tr.queued is 0 and xfs_trim_rtgroup_extents breaks out
of the main loop.  Free the structure in this case.

Fixes: a3315d1130 ("xfs: use rtgroup busy extent list for FITRIM")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-11-06 08:59:19 +01:00
Christoph Hellwig
21ab5179aa xfs: fix zone selection in xfs_select_open_zone_mru
xfs_select_open_zone_mru needs to pass XFS_ZONE_ALLOC_OK to
xfs_try_use_zone because we only want to tightly pack into zones of the
same or a compatible temperature instead of any available zone.

This got broken in commit 0301dae732 ("xfs: refactor hint based zone
allocation"), which failed to update this particular caller when
switching to an enum.  xfs/638 sometimes, but not reliably fails due to
this change.

Fixes: 0301dae732 ("xfs: refactor hint based zone allocation")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-11-05 16:54:38 +01:00
Christoph Hellwig
f5714a3c1a xfs: fix a rtgroup leak when xfs_init_zone fails
Drop the rtgrop reference when xfs_init_zone fails for a conventional
device.

Fixes: 4e4d520755 ("xfs: add the zoned space allocator")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-11-05 16:53:49 +01:00
Darrick J. Wong
8d7bba1e83 xfs: fix various problems in xfs_atomic_write_cow_iomap_begin
I think there are several things wrong with this function:

A) xfs_bmapi_write can return a much larger unwritten mapping than what
   the caller asked for.  We convert part of that range to written, but
   return the entire written mapping to iomap even though that's
   inaccurate.

B) The arguments to xfs_reflink_convert_cow_locked are wrong -- an
   unwritten mapping could be *smaller* than the write range (or even
   the hole range).  In this case, we convert too much file range to
   written state because we then return a smaller mapping to iomap.

C) It doesn't handle delalloc mappings.  This I covered in the patch
   that I already sent to the list.

D) Reassigning count_fsb to handle the hole means that if the second
   cmap lookup attempt succeeds (due to racing with someone else) we
   trim the mapping more than is strictly necessary.  The changing
   meaning of count_fsb makes this harder to notice.

E) The tracepoint is kinda wrong because @length is mutated.  That makes
   it harder to chase the data flows through this function because you
   can't just grep on the pos/bytecount strings.

F) We don't actually check that the br_state = XFS_EXT_NORM assignment
   is accurate, i.e that the cow fork actually contains a written
   mapping for the range we're interested in

G) Somewhat inadequate documentation of why we need to xfs_trim_extent
   so aggressively in this function.

H) Not sure why xfs_iomap_end_fsb is used here, the vfs already clamped
   the write range to s_maxbytes.

Fix these issues, and then the atomic writes regressions in generic/760,
generic/617, generic/091, generic/263, and generic/521 all go away for
me.

Cc: stable@vger.kernel.org # v6.16
Fixes: bd1d2c21d5 ("xfs: add xfs_atomic_write_cow_iomap_begin()")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-11-05 16:52:49 +01:00
Darrick J. Wong
8d54eacd82 xfs: fix delalloc write failures in software-provided atomic writes
With the 20 Oct 2025 release of fstests, generic/521 fails for me on
regular (aka non-block-atomic-writes) storage:

QA output created by 521
dowrite: write: Input/output error
LOG DUMP (8553 total operations):
1(  1 mod 256): SKIPPED (no operation)
2(  2 mod 256): WRITE    0x7e000 thru 0x8dfff	(0x10000 bytes) HOLE
3(  3 mod 256): READ     0x69000 thru 0x79fff	(0x11000 bytes)
4(  4 mod 256): FALLOC   0x53c38 thru 0x5e853	(0xac1b bytes) INTERIOR
5(  5 mod 256): COPY 0x55000 thru 0x59fff	(0x5000 bytes) to 0x25000 thru 0x29fff
6(  6 mod 256): WRITE    0x74000 thru 0x88fff	(0x15000 bytes)
7(  7 mod 256): ZERO     0xedb1 thru 0x11693	(0x28e3 bytes)

with a warning in dmesg from iomap about XFS trying to give it a
delalloc mapping for a directio write.  Fix the software atomic write
iomap_begin code to convert the reservation into a written mapping.
This doesn't fix the data corruption problems reported by generic/760,
but it's a start.

Cc: stable@vger.kernel.org # v6.16
Fixes: bd1d2c21d5 ("xfs: add xfs_atomic_write_cow_iomap_begin()")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-11-05 16:52:49 +01:00
Yongpeng Yang
1e39da974c fscrypt: fix left shift underflow when inode->i_blkbits > PAGE_SHIFT
When simulating an nvme device on qemu with both logical_block_size and
physical_block_size set to 8 KiB, an error trace appears during
partition table reading at boot time. The issue is caused by
inode->i_blkbits being larger than PAGE_SHIFT, which leads to a left
shift of -1 and triggering a UBSAN warning.

[    2.697306] ------------[ cut here ]------------
[    2.697309] UBSAN: shift-out-of-bounds in fs/crypto/inline_crypt.c:336:37
[    2.697311] shift exponent -1 is negative
[    2.697315] CPU: 3 UID: 0 PID: 274 Comm: (udev-worker) Not tainted 6.18.0-rc2+ #34 PREEMPT(voluntary)
[    2.697317] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[    2.697320] Call Trace:
[    2.697324]  <TASK>
[    2.697325]  dump_stack_lvl+0x76/0xa0
[    2.697340]  dump_stack+0x10/0x20
[    2.697342]  __ubsan_handle_shift_out_of_bounds+0x1e3/0x390
[    2.697351]  bh_get_inode_and_lblk_num.cold+0x12/0x94
[    2.697359]  fscrypt_set_bio_crypt_ctx_bh+0x44/0x90
[    2.697365]  submit_bh_wbc+0xb6/0x190
[    2.697370]  block_read_full_folio+0x194/0x270
[    2.697371]  ? __pfx_blkdev_get_block+0x10/0x10
[    2.697375]  ? __pfx_blkdev_read_folio+0x10/0x10
[    2.697377]  blkdev_read_folio+0x18/0x30
[    2.697379]  filemap_read_folio+0x40/0xe0
[    2.697382]  filemap_get_pages+0x5ef/0x7a0
[    2.697385]  ? mmap_region+0x63/0xd0
[    2.697389]  filemap_read+0x11d/0x520
[    2.697392]  blkdev_read_iter+0x7c/0x180
[    2.697393]  vfs_read+0x261/0x390
[    2.697397]  ksys_read+0x71/0xf0
[    2.697398]  __x64_sys_read+0x19/0x30
[    2.697399]  x64_sys_call+0x1e88/0x26a0
[    2.697405]  do_syscall_64+0x80/0x670
[    2.697410]  ? __x64_sys_newfstat+0x15/0x20
[    2.697414]  ? x64_sys_call+0x204a/0x26a0
[    2.697415]  ? do_syscall_64+0xb8/0x670
[    2.697417]  ? irqentry_exit_to_user_mode+0x2e/0x2a0
[    2.697420]  ? irqentry_exit+0x43/0x50
[    2.697421]  ? exc_page_fault+0x90/0x1b0
[    2.697422]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[    2.697425] RIP: 0033:0x75054cba4a06
[    2.697426] Code: 5d e8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 75 19 83 e2 39 83 fa 08 75 11 e8 26 ff ff ff 66 0f 1f 44 00 00 48 8b 45 10 0f 05 <48> 8b 5d f8 c9 c3 0f 1f 40 00 f3 0f 1e fa 55 48 89 e5 48 83 ec 08
[    2.697427] RSP: 002b:00007fff973723a0 EFLAGS: 00000202 ORIG_RAX: 0000000000000000
[    2.697430] RAX: ffffffffffffffda RBX: 00005ea9a2c02760 RCX: 000075054cba4a06
[    2.697432] RDX: 0000000000002000 RSI: 000075054c190000 RDI: 000000000000001b
[    2.697433] RBP: 00007fff973723c0 R08: 0000000000000000 R09: 0000000000000000
[    2.697434] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
[    2.697434] R13: 00005ea9a2c027c0 R14: 00005ea9a2be5608 R15: 00005ea9a2be55f0
[    2.697436]  </TASK>
[    2.697436] ---[ end trace ]---

This situation can happen for block devices because when
CONFIG_TRANSPARENT_HUGEPAGE is enabled, the maximum logical_block_size
is 64 KiB. set_init_blocksize() then sets the block device
inode->i_blkbits to 13, which is within this limit.

File I/O does not trigger this problem because for filesystems that do
not support the FS_LBS feature, sb_set_blocksize() prevents
sb->s_blocksize_bits from being larger than PAGE_SHIFT. During inode
allocation, alloc_inode()->inode_init_always() assigns inode->i_blkbits
from sb->s_blocksize_bits. Currently, only xfs_fs_type has the FS_LBS
flag, and since xfs I/O paths do not reach submit_bh_wbc(), it does not
hit the left-shift underflow issue.

Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Fixes: 47dd675323 ("block/bdev: lift block size restrictions to 64k")
Cc: stable@vger.kernel.org
[EB: use folio_pos() and consolidate the two shifts by i_blkbits]
Link: https://lore.kernel.org/r/20251105003642.42796-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-11-04 16:37:38 -08:00
Shuhao Fu
b540de9e3b smb: client: fix refcount leak in smb2_set_path_attr
Fix refcount leak in `smb2_set_path_attr` when path conversion fails.

Function `cifs_get_writable_path` returns `cfile` with its reference
counter `cfile->count` increased on success. Function `smb2_compound_op`
would decrease the reference counter for `cfile`, as stated in its
comment. By calling `smb2_rename_path`, the reference counter of `cfile`
would leak if `cifs_convert_path_to_utf16` fails in `smb2_set_path_attr`.

Fixes: 8de9e86c67 ("cifs: create a helper to find a writeable handle by path name")
Acked-by: Henrique Carvalho <henrique.carvalho@suse.com>
Signed-off-by: Shuhao Fu <sfual@cse.ust.hk>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-11-04 16:03:56 -06:00
Henrique Carvalho
734e99623c smb: client: fix potential UAF in smb2_close_cached_fid()
find_or_create_cached_dir() could grab a new reference after kref_put()
had seen the refcount drop to zero but before cfid_list_lock is acquired
in smb2_close_cached_fid(), leading to use-after-free.

Switch to kref_put_lock() so cfid_release() is called with
cfid_list_lock held, closing that gap.

Fixes: ebe98f1447 ("cifs: enable caching of directories for which a lease is held")
Cc: stable@vger.kernel.org
Reported-by: Jay Shin <jaeshin@redhat.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-11-04 08:53:28 -06:00
Namjae Jeon
e6187655ac ksmbd: detect RDMA capable netdevs include IPoIB
Current ksmbd_rdma_capable_netdev fails to mark certain RDMA-capable
inerfaces such as IPoIB as RDMA capable after reverting GUID matching code
due to layer violation.
This patch check the ARPHRD_INFINIBAND type safely identifies an IPoIB
interface without introducing a layer violation, ensuring RDMA
functionality is correctly enabled for these interfaces.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-11-04 08:47:55 -06:00
Namjae Jeon
d24822e147 ksmbd: detect RDMA capable lower devices when bridge and vlan netdev is used
If user set bridge interface as actual RDMA-capable NICs are lower devices,
ksmbd can not detect as RDMA capable. This patch can detect the RDMA
capable lower devices from bridge master or VLAN. With this change, ksmbd
can accept both TCP and RDMA connections through the same bridge IP
address, allowing mixed transport operation without requiring separate
interfaces.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-11-04 08:47:55 -06:00
Linus Torvalds
c9cfc122f0 Merge tag 'for-6.18-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:

 - fix memory leak in qgroup relation ioctl when qgroup levels are
   invalid

 - don't write back dirty metadata on filesystem with errors

 - properly log renamed links

 - properly mark prealloc extent range beyond inode size as dirty (when
   no-noles is not enabled)

* tag 'for-6.18-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: mark dirty extent range for out of bound prealloc extents
  btrfs: set inode flag BTRFS_INODE_COPY_EVERYTHING when logging new name
  btrfs: fix memory leak of qgroup_list in btrfs_add_qgroup_relation
  btrfs: ensure no dirty metadata is written back for an fs with errors
2025-11-04 14:25:38 +09:00
Linus Torvalds
9db0d7c5a5 Merge tag 'xfs-fixes-6.18-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Carlos Maiolino:
 "Just a single bug fix (and documentation for the issue)"

* tag 'xfs-fixes-6.18-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: document another racy GC case in xfs_zoned_map_extent
  xfs: prevent gc from picking the same zone twice
2025-11-01 10:04:35 -07:00
Linus Torvalds
b909d77359 Merge tag '6.18-rc3-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:

 - fix potential UAF in statfs

 - DFS fix for expired referrals

 - fix minor modinfo typo

 - small improvement to reconnect for smbdirect

* tag '6.18-rc3-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb: client: call smbd_destroy() in the same splace as kernel_sock_shutdown()/sock_release()
  smb: client: handle lack of IPC in dfs_cache_refresh()
  smb: client: fix potential cfid UAF in smb2_query_info_compound
  cifs: fix typo in enable_gcm_256 module parameter
2025-10-31 09:34:21 -07:00
Christoph Hellwig
0db22d7ee4 xfs: document another racy GC case in xfs_zoned_map_extent
Besides blocks being invalidated, there is another case when the original
mapping could have changed between querying the rmap for GC and calling
xfs_zoned_map_extent.  Document it there as it took us quite some time
to figure out what is going on while developing the multiple-GC
protection fix.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-10-31 12:06:03 +01:00
Christoph Hellwig
83bac569c7 xfs: prevent gc from picking the same zone twice
When we are picking a zone for gc it might already be in the pipeline
which can lead to us moving the same data twice resulting in in write
amplification and a very unfortunate case where we keep on garbage
collecting the zone we just filled with migrated data stopping all
forward progress.

Fix this by introducing a count of on-going GC operations on a zone, and
skip any zone with ongoing GC when picking a new victim.

Fixes: 080d01c41 ("xfs: implement zoned garbage collection")
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-10-31 12:06:03 +01:00
austinchang
3b1a4a59a2 btrfs: mark dirty extent range for out of bound prealloc extents
In btrfs_fallocate(), when the allocated range overlaps with a prealloc
extent and the extent starts after i_size, the range doesn't get marked
dirty in file_extent_tree. This results in persisting an incorrect
disk_i_size for the inode when not using the no-holes feature.

This is reproducible since commit 41a2ee75aa ("btrfs: introduce
per-inode file extent tree"), then became hidden since commit 3d7db6e8bd
("btrfs: don't allocate file extent tree for non regular files") and then
visible again after commit 8679d2687c ("btrfs: initialize
inode::file_extent_tree after i_mode has been set"), which fixes the
previous commit.

The following reproducer triggers the problem:

$ cat test.sh

MNT=/mnt/test
DEV=/dev/vdb

mkdir -p $MNT

mkfs.btrfs -f -O ^no-holes $DEV
mount $DEV $MNT

touch $MNT/file1
fallocate -n -o 1M -l 2M $MNT/file1

umount $MNT
mount $DEV $MNT

len=$((1 * 1024 * 1024))

fallocate -o 1M -l $len $MNT/file1

du --bytes $MNT/file1

umount $MNT
mount $DEV $MNT

du --bytes $MNT/file1

umount $MNT

Running the reproducer gives the following result:

$ ./test.sh
(...)
2097152 /mnt/test/file1
1048576 /mnt/test/file1

The difference is exactly 1048576 as we assigned.

Fix by adding a call to btrfs_inode_set_file_extent_range() in
btrfs_fallocate_update_isize().

Fixes: 41a2ee75aa ("btrfs: introduce per-inode file extent tree")
Signed-off-by: austinchang <austinchang@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-30 19:18:18 +01:00
Filipe Manana
953902e4fb btrfs: set inode flag BTRFS_INODE_COPY_EVERYTHING when logging new name
If we are logging a new name make sure our inode has the runtime flag
BTRFS_INODE_COPY_EVERYTHING set so that at btrfs_log_inode() we will find
new inode refs/extrefs in the subvolume tree and copy them into the log
tree.

We are currently doing it when adding a new link but we are missing it
when renaming.

An example where this makes a new name not persisted:

  1) create symlink with name foo in directory A
  2) fsync directory A, which persists the symlink
  3) rename the symlink from foo to bar
  4) fsync directory A to persist the new symlink name

Step 4 isn't working correctly as it's not logging the new name and also
leaving the old inode ref in the log tree, so after a power failure the
symlink still has the old name of "foo". This is because when we first
fsync directoy A we log the symlink's inode (as it's a new entry) and at
btrfs_log_inode() we set the log mode to LOG_INODE_ALL and then because
we are using that mode and the inode has the runtime flag
BTRFS_INODE_NEEDS_FULL_SYNC set, we clear that flag as well as the flag
BTRFS_INODE_COPY_EVERYTHING. That means the next time we log the inode,
during the rename through the call to btrfs_log_new_name() (calling
btrfs_log_inode_parent() and then btrfs_log_inode()), we will not search
the subvolume tree for new refs/extrefs and jump directory to the
'log_extents' label.

Fix this by making sure we set BTRFS_INODE_COPY_EVERYTHING on an inode
when we are about to log a new name. A test case for fstests will follow
soon.

Reported-by: Vyacheslav Kovalevsky <slava.kovalevskiy.2014@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/ac949c74-90c2-4b9a-b7fd-1ffc5c3175c7@gmail.com/
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-30 19:17:33 +01:00
Shardul Bankar
f260c6aff0 btrfs: fix memory leak of qgroup_list in btrfs_add_qgroup_relation
When btrfs_add_qgroup_relation() is called with invalid qgroup levels
(src >= dst), the function returns -EINVAL directly without freeing the
preallocated qgroup_list structure passed by the caller. This causes a
memory leak because the caller unconditionally sets the pointer to NULL
after the call, preventing any cleanup.

The issue occurs because the level validation check happens before the
mutex is acquired and before any error handling path that would free
the prealloc pointer. On this early return, the cleanup code at the
'out' label (which includes kfree(prealloc)) is never reached.

In btrfs_ioctl_qgroup_assign(), the code pattern is:

    prealloc = kzalloc(sizeof(*prealloc), GFP_KERNEL);
    ret = btrfs_add_qgroup_relation(trans, sa->src, sa->dst, prealloc);
    prealloc = NULL;  // Always set to NULL regardless of return value
    ...
    kfree(prealloc);  // This becomes kfree(NULL), does nothing

When the level check fails, 'prealloc' is never freed by either the
callee or the caller, resulting in a 64-byte memory leak per failed
operation. This can be triggered repeatedly by an unprivileged user
with access to a writable btrfs mount, potentially exhausting kernel
memory.

Fix this by freeing prealloc before the early return, ensuring prealloc
is always freed on all error paths.

Fixes: 4addc1ffd6 ("btrfs: qgroup: preallocate memory before adding a relation")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Shardul Bankar <shardulsb08@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-30 19:16:06 +01:00
Qu Wenruo
2618849f31 btrfs: ensure no dirty metadata is written back for an fs with errors
[BUG]
During development of a minor feature (make sure all btrfs_bio::end_io()
is called in task context), I noticed a crash in generic/388, where
metadata writes triggered new works after btrfs_stop_all_workers().

It turns out that it can even happen without any code modification, just
using RAID5 for metadata and the same workload from generic/388 is going
to trigger the use-after-free.

[CAUSE]
If btrfs hits an error, the fs is marked as error, no new
transaction is allowed thus metadata is in a frozen state.

But there are some metadata modifications before that error, and they are
still in the btree inode page cache.

Since there will be no real transaction commit, all those dirty folios
are just kept as is in the page cache, and they can not be invalidated
by invalidate_inode_pages2() call inside close_ctree(), because they are
dirty.

And finally after btrfs_stop_all_workers(), we call iput() on btree
inode, which triggers writeback of those dirty metadata.

And if the fs is using RAID56 metadata, this will trigger RMW and queue
new works into rmw_workers, which is already stopped, causing warning
from queue_work() and use-after-free.

[FIX]
Add a special handling for write_one_eb(), that if the fs is already in
an error state, immediately mark the bbio as failure, instead of really
submitting them.

Then during close_ctree(), iput() will just discard all those dirty
tree blocks without really writing them back, thus no more new jobs for
already stopped-and-freed workqueues.

The extra discard in write_one_eb() also acts as an extra safenet.
E.g. the transaction abort is triggered by some extent/free space
tree corruptions, and since extent/free space tree is already corrupted
some tree blocks may be allocated where they shouldn't be (overwriting
existing tree blocks). In that case writing them back will further
corrupting the fs.

CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-30 19:16:01 +01:00
Stefan Metzmacher
895ad6f708 smb: client: call smbd_destroy() in the same splace as kernel_sock_shutdown()/sock_release()
With commit b0432201a1 ("smb: client: let destroy_mr_list() keep
smbdirect_mr_io memory if registered") the changes from commit
214bab4484 ("cifs: Call MID callback before destroying transport") and
commit 1d2a4f57ce ("cifs:smbd When reconnecting to server, call
smbd_destroy() after all MIDs have been called") are no longer needed.

And it's better to use the same logic flow, so that
the chance of smbdirect related problems is smaller.

Fixes: 214bab4484 ("cifs: Call MID callback before destroying transport")
Fixes: 1d2a4f57ce ("cifs:smbd When reconnecting to server, call smbd_destroy() after all MIDs have been called")
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-29 20:13:13 -05:00
Paulo Alcantara
fac56c4651 smb: client: handle lack of IPC in dfs_cache_refresh()
In very rare cases, DFS mounts could end up with SMB sessions without
any IPC connections.  These mounts are only possible when having
unexpired cached DFS referrals, hence not requiring any IPC
connections during the mount process.

Try to establish those missing IPC connections when refreshing DFS
referrals.  If the server is still rejecting it, then simply ignore
and leave expired cached DFS referral for any potential DFS failovers.

Reported-by: Jay Shin <jaeshin@redhat.com>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Cc: David Howells <dhowells@redhat.com>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-29 20:13:05 -05:00
Linus Torvalds
e53642b87a Merge tag 'v6.18-rc3-smb-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:

 - Improve check for malformed payload

 - Fix free transport smbdirect potential race

 - Fix potential race in credit allocation during smbdirect negotiation

* tag 'v6.18-rc3-smb-server-fixes' of git://git.samba.org/ksmbd:
  smb: server: let smb_direct_cm_handler() call ib_drain_qp() after smb_direct_disconnect_rdma_work()
  smb: server: call smb_direct_post_recv_credits() when the negotiation is done
  ksmbd: transport_ipc: validate payload size before reading handle
2025-10-28 18:00:40 -07:00
Linus Torvalds
8eefed8f65 Merge tag 'nfsd-6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
 "Regression fixes:

   - Revert the patch that removed the cap on MAX_OPS_PER_COMPOUND

   - Address a kernel build issue

  Stable fixes:

   - Fix crash when a client queries new attributes on forechannel

   - Fix rare NFSD crash when tracing is enabled"

* tag 'nfsd-6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  Revert "NFSD: Remove the cap on number of operations per NFSv4 COMPOUND"
  nfsd: Avoid strlen conflict in nfsd4_encode_components_esc()
  NFSD: Fix crash in nfsd4_read_release()
  NFSD: Define actions for the new time_deleg FATTR4 attributes
2025-10-28 12:13:20 -07:00
Henrique Carvalho
5c76f9961c smb: client: fix potential cfid UAF in smb2_query_info_compound
When smb2_query_info_compound() retries, a previously allocated cfid may
have been freed in the first attempt.
Because cfid wasn't reset on replay, later cleanup could act on a stale
pointer, leading to a potential use-after-free.

Reinitialize cfid to NULL under the replay label.

Example trace (trimmed):

refcount_t: underflow; use-after-free.
WARNING: CPU: 1 PID: 11224 at ../lib/refcount.c:28 refcount_warn_saturate+0x9c/0x110
[...]
RIP: 0010:refcount_warn_saturate+0x9c/0x110
[...]
Call Trace:
 <TASK>
 smb2_query_info_compound+0x29c/0x5c0 [cifs f90b72658819bd21c94769b6a652029a07a7172f]
 ? step_into+0x10d/0x690
 ? __legitimize_path+0x28/0x60
 smb2_queryfs+0x6a/0xf0 [cifs f90b72658819bd21c94769b6a652029a07a7172f]
 smb311_queryfs+0x12d/0x140 [cifs f90b72658819bd21c94769b6a652029a07a7172f]
 ? kmem_cache_alloc+0x18a/0x340
 ? getname_flags+0x46/0x1e0
 cifs_statfs+0x9f/0x2b0 [cifs f90b72658819bd21c94769b6a652029a07a7172f]
 statfs_by_dentry+0x67/0x90
 vfs_statfs+0x16/0xd0
 user_statfs+0x54/0xa0
 __do_sys_statfs+0x20/0x50
 do_syscall_64+0x58/0x80

Cc: stable@kernel.org
Fixes: 4f1fffa237 ("cifs: commands that are retried should have replay flag set")
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Acked-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-28 08:41:36 -05:00
Stefan Metzmacher
f574069c5c smb: server: let smb_direct_cm_handler() call ib_drain_qp() after smb_direct_disconnect_rdma_work()
All handlers triggered by ib_drain_qp() should already see the
broken connection.

smb_direct_cm_handler() is called under a mutex of the rdma_cm,
we should make sure ib_drain_qp() and all rdma layer logic completes
and unlocks the mutex.

It means free_transport() will also already see the connection
as SMBDIRECT_SOCKET_DISCONNECTED, so we need to call
crdma_[un]lock_handler(sc->rdma.cm_id) around
ib_drain_qp(), rdma_destroy_qp(), ib_free_cq() and ib_dealloc_pd().

Otherwise we free resources while the ib_drain_qp() within
smb_direct_cm_handler() is still running.

We have to unlock before rdma_destroy_id() as it locks again.

Fixes: 141fa9824c ("ksmbd: call ib_drain_qp when disconnected")
Fixes: 4c564f03e2 ("smb: server: make use of common smbdirect_socket")
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-26 20:47:32 -05:00
Stefan Metzmacher
975f05a764 smb: server: call smb_direct_post_recv_credits() when the negotiation is done
We now activate sc->recv_io.posted.refill_work and sc->idle.immediate_work
only after a successful negotiation, before sending the negotiation
response.

It means the queue_work(sc->workqueue, &sc->recv_io.posted.refill_work)
in put_recvmsg() of the negotiate request, is a no-op now.

It also means our explicit smb_direct_post_recv_credits() will
have queue_work(sc->workqueue, &sc->idle.immediate_work) as no-op.

This should make sure we don't have races and post any immediate
data_transfer message that tries to grant credits to the peer,
before we send the negotiation response, as that will grant
the initial credits to the peer.

Fixes: 0626e6641f ("cifsd: add server handler for central processing and tranport layers")
Fixes: 1cde0a74a7 ("smb: server: don't use delayed_work for post_recv_credits_work")
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-26 20:47:32 -05:00
Qianchang Zhao
6f40e50ceb ksmbd: transport_ipc: validate payload size before reading handle
handle_response() dereferences the payload as a 4-byte handle without
verifying that the declared payload size is at least 4 bytes. A malformed
or truncated message from ksmbd.mountd can lead to a 4-byte read past the
declared payload size. Validate the size before dereferencing.

This is a minimal fix to guard the initial handle read.

Fixes: 0626e6641f ("cifsd: add server handler for central processing and tranport layers")
Cc: stable@vger.kernel.org
Reported-by: Qianchang Zhao <pioooooooooip@gmail.com>
Signed-off-by: Qianchang Zhao <pioooooooooip@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-26 20:47:32 -05:00
Steve French
f765fdfcd8 cifs: fix typo in enable_gcm_256 module parameter
Fix typo in description of enable_gcm_256 module parameter

Suggested-by: Thomas Spear <speeddymon@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-26 18:59:36 -05:00
Linus Torvalds
dbfc6422a3 Merge tag 'x86_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:

 - Remove dead code leftovers after a recent mitigations cleanup which
   fail a Clang build

 - Make sure a Retbleed mitigation message is printed only when
   necessary

 - Correct the last Zen1 microcode revision for which Entrysign sha256
   check is needed

 - Fix a NULL ptr deref when mounting the resctrl fs on a system which
   supports assignable counters but where L3 total and local bandwidth
   monitoring has been disabled at boot

* tag 'x86_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/bugs: Remove dead code which might prevent from building
  x86/bugs: Qualify RETBLEED_INTEL_MSG
  x86/microcode: Fix Entrysign revision check for Zen1/Naples
  x86,fs/resctrl: Fix NULL pointer dereference with events force-disabled in mbm_event mode
2025-10-26 09:57:18 -07:00
Linus Torvalds
72761a7e31 Merge tag 'driver-core-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core
Pull driver core fixes from Danilo Krummrich:

 - In Device::parent(), do not make any assumptions on the device
   context of the parent device

 - Check visibility before changing ownership of a sysfs attribute
   group

 - In topology_parse_cpu_capacity(), replace an incorrect usage of
   PTR_ERR_OR_ZERO() with IS_ERR_OR_NULL()

 - In devcoredump, fix a circular locking dependency between
   struct devcd_entry::mutex and kernfs

 - Do not warn about a pending fw_devlink sync state

* tag 'driver-core-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core:
  arch_topology: Fix incorrect error check in topology_parse_cpu_capacity()
  rust: device: fix device context of Device::parent()
  sysfs: check visibility before changing group attribute ownership
  devcoredump: Fix circular locking dependency with devcd->mutex.
  driver core: fw_devlink: Don't warn about sync_state() pending
2025-10-25 11:03:46 -07:00
Linus Torvalds
27c0b5c4f6 Merge tag 'xfs-fixes-6.18-rc3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Carlos Maiolino:
 "The main highlight here is a fix for a bug brought in by the removal
  of attr2 mount option, where some installations might actually have
  'attr2' explicitly configured in fstab preventing system to boot by
  not being able to remount the rootfs as RW.

  Besides that there are a couple fix to the zonefs implementation,
  changing XFS_ONLINE_SCRUB_STATS to depend on DEBUG_FS (was select
  before), and some other minor changes"

* tag 'xfs-fixes-6.18-rc3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: fix locking in xchk_nlinks_collect_dir
  xfs: loudly complain about defunct mount options
  xfs: always warn about deprecated mount options
  xfs: don't set bt_nr_sectors to a negative number
  xfs: don't use __GFP_NOFAIL in xfs_init_fs_context
  xfs: cache open zone in inode->i_private
  xfs: avoid busy loops in GCD
  xfs: XFS_ONLINE_SCRUB_STATS should depend on DEBUG_FS
  xfs: do not tightly pack-write large files
  xfs: Improve CONFIG_XFS_RT Kconfig help
2025-10-25 09:31:13 -07:00
Linus Torvalds
566771afc7 Merge tag 'v6.18-rc2-smb-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
 "smbdirect (RDMA) fixes in order avoid potential submission queue
  overflows:

   - free transport teardown fix

   - credit related fixes (five server related, one client related)"

* tag 'v6.18-rc2-smb-server-fixes' of git://git.samba.org/ksmbd:
  smb: server: let free_transport() wait for SMBDIRECT_SOCKET_DISCONNECTED
  smb: client: make use of smbdirect_socket.send_io.lcredits.*
  smb: server: make use of smbdirect_socket.send_io.lcredits.*
  smb: server: simplify sibling_list handling in smb_direct_flush_send_list/send_done
  smb: server: smb_direct_disconnect_rdma_connection() already wakes all waiters on error
  smb: smbdirect: introduce smbdirect_socket.send_io.lcredits.*
  smb: server: allocate enough space for RW WRs and ib_drain_qp()
2025-10-24 18:50:15 -07:00
Linus Torvalds
f2b2465726 Merge tag '6.18-rc2-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:

 - add missing tracepoints

 - smbdirect (RDMA) fix

 - fix potential issue with credits underflow

 - rename fix

 - improvement to calc_signature and additional cleanup patch

* tag '6.18-rc2-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: #include cifsglob.h before trace.h to allow structs in tracepoints
  cifs: Call the calc_signature functions directly
  smb: client: get rid of d_drop() in cifs_do_rename()
  cifs: Fix TCP_Server_Info::credits to be signed
  cifs: Add a couple of missing smb3_rw_credits tracepoints
  smb: client: allocate enough space for MR WRs and ib_drain_qp()
2025-10-24 15:48:08 -07:00
Stefan Metzmacher
dd6940f5c7 smb: server: let free_transport() wait for SMBDIRECT_SOCKET_DISCONNECTED
We should wait for the rdma_cm to become SMBDIRECT_SOCKET_DISCONNECTED!

At least on the client side (with similar code)
wait_event_interruptible() often returns with -ERESTARTSYS instead of
waiting for SMBDIRECT_SOCKET_DISCONNECTED.
We should use wait_event() here too, which makes the code be identical
in client and server, which will help when moving to common functions.

Fixes: b31606097d ("smb: server: move smb_direct_disconnect_rdma_work() into free_transport()")
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-23 20:58:51 -05:00
Linus Torvalds
942048d46a Merge tag 'for-6.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:

 - in send, fix duplicated rmdir operations when using extrefs
   (hardlinks), receive can fail with ENOENT

 - fixup of error check when reading extent root in ref-verify and
   damaged roots are allowed by mount option (found by smatch)

 - fix freeing partially initialized fs info (found by syzkaller)

 - fix use-after-free when printing ref_tracking status of delayed
   inodes

* tag 'for-6.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: ref-verify: fix IS_ERR() vs NULL check in btrfs_build_ref_tree()
  btrfs: fix delayed_node ref_tracker use after free
  btrfs: send: fix duplicated rmdir operations when using extrefs
  btrfs: directly free partially initialized fs_info in btrfs_check_leaked_roots()
2025-10-23 06:44:43 -10:00
David Howells
64c9471aa9 cifs: #include cifsglob.h before trace.h to allow structs in tracepoints
Make cifs #include cifsglob.h in advance of #including trace.h so that the
structures defined in cifsglob.h can be accessed directly by the cifs
tracepoints rather than the callers having to manually pass in the bits and
pieces.

This should allow the tracepoints to be made more efficient to use as well
as easier to read in the code.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-23 02:47:20 -05:00
David Howells
4b1d7f6222 cifs: Call the calc_signature functions directly
As the SMB1 and SMB2/3 calc_signature functions are called from separate
sign and verify paths, just call them directly rather than using a function
pointer.  The SMB3 calc_signature then jumps to the SMB2 variant if
necessary.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Enzo Matsumiya <ematsumiya@suse.de>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Shyam Prasad N <sprasad@microsoft.com>
cc: Tom Talpey <tom@talpey.com>
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-23 02:47:20 -05:00