Patch series "Add and use memdesc_flags_t".
At some point struct page will be separated from struct slab and struct
folio. This is a step towards that by introducing a type for the 'flags'
word of all three structures. This gives us a certain amount of type
safety by establishing that some of these unsigned longs are different
from other unsigned longs in that they contain things like node ID,
section number and zone number in the upper bits. That lets us have
functions that can be easily called by anyone who has a slab, folio or
page (but not easily by anyone else) to get the node or zone.
There's going to be some unusual merge problems with this as some odd bits
of the kernel decide they want to print out the flags value or something
similar by writing page->flags and now they'll need to write page->flags.f
instead. That's most of the churn here. Maybe we should be removing
these things from the debug output?
This patch (of 11):
Wrap the unsigned long flags in a typedef. In upcoming patches, this will
provide a strong hint that you can't just pass a random unsigned long to
functions which take this as an argument.
[willy@infradead.org: s/flags/flags.f/ in several architectures]
Link: https://lkml.kernel.org/r/aKMgPRLD-WnkPxYm@casper.infradead.org
[nicola.vetrini@gmail.com: mips: fix compilation error]
Link: https://lore.kernel.org/lkml/CA+G9fYvkpmqGr6wjBNHY=dRp71PLCoi2341JxOudi60yqaeUdg@mail.gmail.com/
Link: https://lkml.kernel.org/r/20250825214245.1838158-1-nicola.vetrini@gmail.com
Link: https://lkml.kernel.org/r/20250805172307.1302730-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20250805172307.1302730-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When determining which THP orders are eligible for a VMA mapping, we have
previously specified tva_flags, however it turns out it is really not
necessary to treat these as flags.
Rather, we distinguish between distinct modes.
The only case where we previously combined flags was with
TVA_ENFORCE_SYSFS, but we can avoid this by observing that this is the
default, except for MADV_COLLAPSE or an edge cases in
collapse_pte_mapped_thp() and hugepage_vma_revalidate(), and adding a mode
specifically for this case - TVA_FORCED_COLLAPSE.
We have:
* smaps handling for showing "THPeligible"
* Pagefault handling
* khugepaged handling
* Forced collapse handling: primarily MADV_COLLAPSE, but also for
an edge case in collapse_pte_mapped_thp()
Disregarding the edge cases, we only want to ignore sysfs settings only
when we are forcing a collapse through MADV_COLLAPSE, otherwise we want to
enforce it, hence this patch does the following flag to enum conversions:
* TVA_SMAPS | TVA_ENFORCE_SYSFS -> TVA_SMAPS
* TVA_IN_PF | TVA_ENFORCE_SYSFS -> TVA_PAGEFAULT
* TVA_ENFORCE_SYSFS -> TVA_KHUGEPAGED
* 0 -> TVA_FORCED_COLLAPSE
With this change, we immediately know if we are in the forced collapse
case, which will be valuable next.
Link: https://lkml.kernel.org/r/20250815135549.130506-3-usamaarif642@gmail.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Acked-by: Usama Arif <usamaarif642@gmail.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yafang <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "prctl: extend PR_SET_THP_DISABLE to only provide THPs when
advised", v5.
This will allow individual processes to opt-out of THP = "always" into THP
= "madvise", without affecting other workloads on the system. This has
been extensively discussed on the mailing list and has been summarized
very well by David in the first patch which also includes the links to
alternatives, please refer to the first patch commit message for the
motivation for this series.
Patch 1 adds the PR_THP_DISABLE_EXCEPT_ADVISED flag to implement this,
along with the MMF changes.
Patch 2 is a cleanup patch for tva_flags that will allow the forced
collapse case to be transmitted to vma_thp_disabled (which is done in
patch 3).
Patch 4 adds documentation for PR_SET_THP_DISABLE/PR_GET_THP_DISABLE.
Patches 6-7 implement the selftests for PR_SET_THP_DISABLE for completely
disabling THPs (old behaviour) and only enabling it at advise
(PR_THP_DISABLE_EXCEPT_ADVISED).
This patch (of 7):
People want to make use of more THPs, for example, moving from the "never"
system policy to "madvise", or from "madvise" to "always".
While this is great news for every THP desperately waiting to get
allocated out there, apparently there are some workloads that require a
bit of care during that transition: individual processes may need to
opt-out from this behavior for various reasons, and this should be
permitted without needing to make all other workloads on the system
similarly opt-out.
The following scenarios are imaginable:
(1) Switch from "none" system policy to "madvise"/"always", but keep THPs
disabled for selected workloads.
(2) Stay at "none" system policy, but enable THPs for selected
workloads, making only these workloads use the "madvise" or "always"
policy.
(3) Switch from "madvise" system policy to "always", but keep the
"madvise" policy for selected workloads: allocate THPs only when
advised.
(4) Stay at "madvise" system policy, but enable THPs even when not advised
for selected workloads -- "always" policy.
Once can emulate (2) through (1), by setting the system policy to
"madvise"/"always" while disabling THPs for all processes that don't want
THPs. It requires configuring all workloads, but that is a user-space
problem to sort out.
(4) can be emulated through (3) in a similar way.
Back when (1) was relevant in the past, as people started enabling THPs,
we added PR_SET_THP_DISABLE, so relevant workloads that were not ready yet
(i.e., used by Redis) were able to just disable THPs completely. Redis
still implements the option to use this interface to disable THPs
completely.
With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a
workload -- a process, including fork+exec'ed process hierarchy. That
essentially made us support (1): simply disable THPs for all workloads
that are not ready for THPs yet, while still enabling THPs system-wide.
The quest for handling (3) and (4) started, but current approaches
(completely new prctl, options to set other policies per process,
alternatives to prctl -- mctrl, cgroup handling) don't look particularly
promising. Likely, the future will use bpf or something similar to
implement better policies, in particular to also make better decisions
about THP sizes to use, but this will certainly take a while as that work
just started.
Long story short: a simple enable/disable is not really suitable for the
future, so we're not willing to add completely new toggles.
While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs
completely for these processes, this is a step backwards, because these
processes can no longer allocate THPs in regions where THPs were
explicitly advised: regions flagged as VM_HUGEPAGE. Apparently, that
imposes a problem for relevant workloads, because "not THPs" is certainly
worse than "THPs only when advised".
Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not
explicitly advised by the app through MAD_HUGEPAGE"? *maybe*, but this
would change the documented semantics quite a bit, and the versatility to
use it for debugging purposes, so I am not 100% sure that is what we want
-- although it would certainly be much easier.
So instead, as an easy way forward for (3) and (4), add an option to
make PR_SET_THP_DISABLE disable *less* THPs for a process.
In essence, this patch:
(A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3
of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 != 0).
prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED).
(B) Makes prctl(PR_GET_THP_DISABLE) return 3 if
PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling.
Previously, it would return 1 if THPs were disabled completely. Now
it returns the set flags as well: 3 if PR_THP_DISABLE_EXCEPT_ADVISED
was set.
(C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express
the semantics clearly.
Fortunately, there are only two instances outside of prctl() code.
(D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs
with VM_HUGEPAGE" -- essentially "thp=madvise" behavior
Fortunately, we only have to extend vma_thp_disabled().
(E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are
disabled completely
Only indicating that THPs are disabled when they are really disabled
completely, not only partially.
For now, we don't add another interface to obtained whether THPs
are disabled partially (PR_THP_DISABLE_EXCEPT_ADVISED was set). If
ever required, we could add a new entry.
The documented semantics in the man page for PR_SET_THP_DISABLE "is
inherited by a child created via fork(2) and is preserved across
execve(2)" is maintained. This behavior, for example, allows for
disabling THPs for a workload through the launching process (e.g., systemd
where we fork() a helper process to then exec()).
For now, MADV_COLLAPSE will *fail* in regions without VM_HUGEPAGE and
VM_NOHUGEPAGE. As MADV_COLLAPSE is a clear advise that user space thinks
a THP is a good idea, we'll enable that separately next (requiring a bit
of cleanup first).
There is currently not way to prevent that a process will not issue
PR_SET_THP_DISABLE itself to re-enable THP. There are not really known
users for re-enabling it, and it's against the purpose of the original
interface. So if ever required, we could investigate just forbidding to
re-enable them, or make this somehow configurable.
Link: https://lkml.kernel.org/r/20250815135549.130506-1-usamaarif642@gmail.com
Link: https://lkml.kernel.org/r/20250815135549.130506-2-usamaarif642@gmail.com
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Usama Arif <usamaarif642@gmail.com>
Tested-by: Usama Arif <usamaarif642@gmail.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yafang <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pull misc fixes from Andrew Morton:
"17 hotfixes. 13 are cc:stable and the remainder address post-6.16
issues or aren't considered necessary for -stable kernels. 11 of these
fixes are for MM.
This includes a three-patch series from Harry Yoo which fixes an
intermittent boot failure which can occur on x86 systems. And a
two-patch series from Alexander Gordeev which fixes a KASAN crash on
S390 systems"
* tag 'mm-hotfixes-stable-2025-09-01-17-20' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: fix possible deadlock in kmemleak
x86/mm/64: define ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings()
mm: introduce and use {pgd,p4d}_populate_kernel()
mm: move page table sync declarations to linux/pgtable.h
proc: fix missing pde_set_flags() for net proc files
mm: fix accounting of memmap pages
mm/damon/core: prevent unnecessary overflow in damos_set_effective_quota()
kexec: add KEXEC_FILE_NO_CMA as a legal flag
kasan: fix GCC mem-intrinsic prefix with sw tags
mm/kasan: avoid lazy MMU mode hazards
mm/kasan: fix vmalloc shadow memory (de-)population races
kunit: kasan_test: disable fortify string checker on kasan_strings() test
selftests/mm: fix FORCE_READ to read input value correctly
mm/userfaultfd: fix kmap_local LIFO ordering for CONFIG_HIGHPTE
ocfs2: prevent release journal inode after journal shutdown
rust: mm: mark VmaNew as transparent
of_numa: fix uninitialized memory nodes causing kernel panic
Pull btrfs fixes from David Sterba:
- fix a few races related to inode link count
- fix inode leak on failure to add link to inode
- move transaction aborts closer to where they happen
* tag 'for-6.17-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: avoid load/store tearing races when checking if an inode was logged
btrfs: fix race between setting last_dir_index_offset and inode logging
btrfs: fix race between logging inode and checking if it was logged before
btrfs: simplify error handling logic for btrfs_link()
btrfs: fix inode leak on failure to add link to inode
btrfs: abort transaction on failure to add link to inode
Pull EFI fixes from Ard Biesheuvel:
- Assorted fixes for the OP-TEE based pseudo-EFI variable store
- Fix for an OOB access when looking up the same non-existing efivarfs
entry multiple times in parallel
* tag 'efi-fixes-for-v6.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efivarfs: Fix slab-out-of-bounds in efivarfs_d_compare
efi: stmm: Drop unneeded null pointer check
efi: stmm: Drop unused EFI error from setup_mm_hdr arguments
efi: stmm: Do not return EFI_OUT_OF_RESOURCES on internal errors
efi: stmm: Fix incorrect buffer allocation method
Pull smb client fixes from Steve French:
- Fix possible refcount leak in compound operations
- Fix remap_file_range() return code mapping, found by generic/157
* tag 'v6.17-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
fs/smb: Fix inconsistent refcnt update
smb3 client: fix return code mapping of remap_file_range
Pull xfs fixes from Carlos Maiolino:
"The highlight I'd like to point here is related to the XFS_RT
Kconfig, which has been updated to be enabled by default now if
CONFIG_BLK_DEV_ZONED is enabled.
This also contains a few fixes for zoned devices support in XFS,
specially related to swapon requests in inodes belonging to the zoned
FS.
A null-ptr dereference fix in the xattr data, due to a mishandling of
medium errors generated by block devices is also included"
* tag 'xfs-fixes-6.17-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: do not propagate ENODATA disk errors into xattr code
xfs: reject swapon for inodes on a zoned file system earlier
xfs: kick off inodegc when failing to reserve zoned blocks
xfs: remove xfs_last_used_zone
xfs: Default XFS_RT to Y if CONFIG_BLK_DEV_ZONED is enabled
Observed on kernel 6.6 (present on master as well):
BUG: KASAN: slab-out-of-bounds in memcmp+0x98/0xd0
Call trace:
kasan_check_range+0xe8/0x190
__asan_loadN+0x1c/0x28
memcmp+0x98/0xd0
efivarfs_d_compare+0x68/0xd8
__d_lookup_rcu_op_compare+0x178/0x218
__d_lookup_rcu+0x1f8/0x228
d_alloc_parallel+0x150/0x648
lookup_open.isra.0+0x5f0/0x8d0
open_last_lookups+0x264/0x828
path_openat+0x130/0x3f8
do_filp_open+0x114/0x248
do_sys_openat2+0x340/0x3c0
__arm64_sys_openat+0x120/0x1a0
If dentry->d_name.len < EFI_VARIABLE_GUID_LEN , 'guid' can become
negative, leadings to oob. The issue can be triggered by parallel
lookups using invalid filename:
T1 T2
lookup_open
->lookup
simple_lookup
d_add
// invalid dentry is added to hash list
lookup_open
d_alloc_parallel
__d_lookup_rcu
__d_lookup_rcu_op_compare
hlist_bl_for_each_entry_rcu
// invalid dentry can be retrieved
->d_compare
efivarfs_d_compare
// oob
Fix it by checking 'guid' before cmp.
Fixes: da27a24383 ("efivarfs: guid part of filenames are case-insensitive")
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Wu Guanghao <wuguanghao3@huawei.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
A possible inconsistent update of refcount was identified in `smb2_compound_op`.
Such inconsistent update could lead to possible resource leaks.
Why it is a possible bug:
1. In the comment section of the function, it clearly states that the
reference to `cfile` should be dropped after calling this function.
2. Every control flow path would check and drop the reference to
`cfile`, except the patched one.
3. Existing callers would not handle refcount update of `cfile` if
-ENOMEM is returned.
To fix the bug, an extra goto label "out" is added, to make sure that the
cleanup logic would always be respected. As the problem is caused by the
allocation failure of `vars`, the cleanup logic between label "finished"
and "out" can be safely ignored. According to the definition of function
`is_replayable_error`, the error code of "-ENOMEM" is not recoverable.
Therefore, the replay logic also gets ignored.
Signed-off-by: Shuhao Fu <sfual@cse.ust.hk>
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
ENODATA (aka ENOATTR) has a very specific meaning in the xfs xattr code;
namely, that the requested attribute name could not be found.
However, a medium error from disk may also return ENODATA. At best,
this medium error may escape to userspace as "attribute not found"
when in fact it's an IO (disk) error.
At worst, we may oops in xfs_attr_leaf_get() when we do:
error = xfs_attr_leaf_hasname(args, &bp);
if (error == -ENOATTR) {
xfs_trans_brelse(args->trans, bp);
return error;
}
because an ENODATA/ENOATTR error from disk leaves us with a null bp,
and the xfs_trans_brelse will then null-deref it.
As discussed on the list, we really need to modify the lower level
IO functions to trap all disk errors and ensure that we don't let
unique errors like this leak up into higher xfs functions - many
like this should be remapped to EIO.
However, this patch directly addresses a reported bug in the xattr
code, and should be safe to backport to stable kernels. A larger-scope
patch to handle more unique errors at lower levels can follow later.
(Note, prior to 07120f1abd we did not oops, but we did return the
wrong error code to userspace.)
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Fixes: 07120f1abd ("xfs: Add xfs_has_attr and subroutines")
Cc: stable@vger.kernel.org # v5.9+
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
We were returning -EOPNOTSUPP for various remap_file_range cases
but for some of these the copy_file_range_syscall() requires -EINVAL
to be returned (e.g. where source and target file ranges overlap when
source and target are the same file). This fixes xfstest generic/157
which was expecting EINVAL for that (and also e.g. for when the src
offset is beyond end of file).
Cc: stable@vger.kernel.org
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Pull driver core fixes from Danilo Krummrich:
- Fix swapped handling of lru_gen and lru_gen_full debugfs files in
vmscan
- Fix debugfs mount options (uid, gid, mode) being silently ignored
- Fix leak of devres action in the unwind path of Devres::new()
- Documentation:
- Expand and fix documentation of (outdated) Device, DeviceContext
and generic driver infrastructure
- Fix C header link of faux device abstractions
- Clarify expected interaction with the security team
- Smooth text flow in the security bug reporting process
documentation
* tag 'driver-core-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core:
Documentation: smooth the text flow in the security bug reporting process
Documentation: clarify the expected collaboration with security bugs reporters
debugfs: fix mount options not being applied
rust: devres: fix leaking call to devm_add_action()
rust: faux: fix C header link
driver: rust: expand documentation for driver infrastructure
device: rust: expand documentation for Device
device: rust: expand documentation for DeviceContext
mm/vmscan: fix inverted polarity in lru_gen_seq_show()
Pull smb client fix from Steve French:
"Fix for netfs smb3 oops"
* tag '6.17-rc2-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Fix oops due to uninitialised variable
Pull NFS client fix from Trond Myklebust:
- NFS: Fix a data corrupting race when updating an existing write
* tag 'nfs-for-6.17-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS: Fix a race when updating an existing write
Pull misc fixes from Andrew Morton:
"20 hotfixes. 10 are cc:stable and the remainder address post-6.16
issues or aren't considered necessary for -stable kernels. 17 of these
fixes are for MM.
As usual, singletons all over the place, apart from a three-patch
series of KHO followup work from Pasha which is actually also a bunch
of singletons"
* tag 'mm-hotfixes-stable-2025-08-21-18-17' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/mremap: fix WARN with uffd that has remap events disabled
mm/damon/sysfs-schemes: put damos dests dir after removing its files
mm/migrate: fix NULL movable_ops if CONFIG_ZSMALLOC=m
mm/damon/core: fix damos_commit_filter not changing allow
mm/memory-failure: fix infinite UCE for VM_PFNMAP pfn
MAINTAINERS: mark MGLRU as maintained
mm: rust: add page.rs to MEMORY MANAGEMENT - RUST
iov_iter: iterate_folioq: fix handling of offset >= folio size
selftests/damon: fix selftests by installing drgn related script
.mailmap: add entry for Easwar Hariharan
selftests/mm: add test for invalid multi VMA operations
mm/mremap: catch invalid multi VMA moves earlier
mm/mremap: allow multi-VMA move when filesystem uses thp_get_unmapped_area
mm/damon/core: fix commit_ops_filters by using correct nth function
tools/testing: add linux/args.h header and fix radix, VMA tests
mm/debug_vm_pgtable: clear page table entries at destroy_args()
squashfs: fix memory leak in squashfs_fill_super
kho: warn if KHO is disabled due to an error
kho: mm: don't allow deferred struct page with KHO
kho: init new_physxa->phys_bits to fix lockdep
At inode_logged() we do a couple lockless checks for ->logged_trans, and
these are generally safe except the second one in case we get a load or
store tearing due to a concurrent call updating ->logged_trans (either at
btrfs_log_inode() or later at inode_logged()).
In the first case it's safe to compare to the current transaction ID since
once ->logged_trans is set the current transaction, we never set it to a
lower value.
In the second case, where we check if it's greater than zero, we are prone
to load/store tearing races, since we can have a concurrent task updating
to the current transaction ID with store tearing for example, instead of
updating with a single 64 bits write, to update with two 32 bits writes or
four 16 bits writes. In that case the reading side at inode_logged() could
see a positive value that does not match the current transaction and then
return a false negative.
Fix this by doing the second check while holding the inode's spinlock, add
some comments about it too. Also add the data_race() annotation to the
first check to avoid any reports from KCSAN (or similar tools) and comment
about it.
Fixes: 0f8ce49821 ("btrfs: avoid inode logging during rename and link when possible")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At inode_logged() if we find that the inode was not logged before we
update its ->last_dir_index_offset to (u64)-1 with the goal that the
next directory log operation will see the (u64)-1 and then figure out
it must check what was the index of the last logged dir index key and
update ->last_dir_index_offset to that key's offset (this is done in
update_last_dir_index_offset()).
This however has a possibility for a time window where a race can happen
and lead to directory logging skipping dir index keys that should be
logged. The race happens like this:
1) Task A calls inode_logged(), sees ->logged_trans as 0 and then checks
that the inode item was logged before, but before it sets the inode's
->last_dir_index_offset to (u64)-1...
2) Task B is at btrfs_log_inode() which calls inode_logged() early, and
that has set ->last_dir_index_offset to (u64)-1;
3) Task B then enters log_directory_changes() which calls
update_last_dir_index_offset(). There it sees ->last_dir_index_offset
is (u64)-1 and that the inode was logged before (ctx->logged_before is
true), and so it searches for the last logged dir index key in the log
tree and it finds that it has an offset (index) value of N, so it sets
->last_dir_index_offset to N, so that we can skip index keys that are
less than or equal to N (later at process_dir_items_leaf());
4) Task A now sets ->last_dir_index_offset to (u64)-1, undoing the update
that task B just did;
5) Task B will now skip every index key when it enters
process_dir_items_leaf(), since ->last_dir_index_offset is (u64)-1.
Fix this by making inode_logged() not touch ->last_dir_index_offset and
initializing it to 0 when an inode is loaded (at btrfs_alloc_inode()) and
then having update_last_dir_index_offset() treat a value of 0 as meaning
we must check the log tree and update with the index of the last logged
index key. This is fine since the minimum possible value for
->last_dir_index_offset is 1 (BTRFS_DIR_START_INDEX - 1 = 2 - 1 = 1).
This also simplifies the management of ->last_dir_index_offset and now
all accesses to it are done under the inode's log_mutex.
Fixes: 0f8ce49821 ("btrfs: avoid inode logging during rename and link when possible")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a race between checking if an inode was logged before and logging
an inode that can cause us to mark an inode as not logged just after it
was logged by a concurrent task:
1) We have inode X which was not logged before neither in the current
transaction not in past transaction since the inode was loaded into
memory, so it's ->logged_trans value is 0;
2) We are at transaction N;
3) Task A calls inode_logged() against inode X, sees that ->logged_trans
is 0 and there is a log tree and so it proceeds to search in the log
tree for an inode item for inode X. It doesn't see any, but before
it sets ->logged_trans to N - 1...
3) Task B calls btrfs_log_inode() against inode X, logs the inode and
sets ->logged_trans to N;
4) Task A now sets ->logged_trans to N - 1;
5) At this point anyone calling inode_logged() gets 0 (inode not logged)
since ->logged_trans is greater than 0 and less than N, but our inode
was really logged. As a consequence operations like rename, unlink and
link that happen afterwards in the current transaction end up not
updating the log when they should.
Fix this by ensuring inode_logged() only updates ->logged_trans in case
the inode item is not found in the log tree if after tacking the inode's
lock (spinlock struct btrfs_inode::lock) the ->logged_trans value is still
zero, since the inode lock is what protects setting ->logged_trans at
btrfs_log_inode().
Fixes: 0f8ce49821 ("btrfs: avoid inode logging during rename and link when possible")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of incrementing the inode's link count and refcount early before
adding the link, updating the inode and deleting orphan item, do it after
all those steps succeeded right before calling d_instantiate(). This makes
the error handling logic simpler by avoiding the need for the 'drop_inode'
variable to signal if we need to undo the link count increment and the
inode refcount increase under the 'fail' label.
This also reduces the level of indentation by one, making the code easier
to read.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we fail to update the inode or delete the orphan item we leak the inode
since we update its refcount with the ihold() call to account for the
d_instantiate() call which never happens in case we fail those steps. Fix
this by setting 'drop_inode' to true in case we fail those steps.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we fail to update the inode or delete the orphan item, we must abort
the transaction to prevent persisting an inconsistent state. For example
if we fail to update the inode item, we have the inconsistency of having
a persisted inode item with a link count of N but we have N + 1 inode ref
items and N + 1 directory entries pointing to our inode in case the
transaction gets committed.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull smb server fixes from Steve French:
- fix refcount issue that can cause memory leak
- rate limit repeated connections from IPv6, not just IPv4 addresses
- fix potential null pointer access of smb direct work queue
* tag '6.17-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix refcount leak causing resource not released
ksmbd: extend the connection limiting mechanism to support IPv6
smb: server: split ksmbd_rdma_stop_listening() out of ksmbd_rdma_destroy()
After nfs_lock_and_join_requests() tests for whether the request is
still attached to the mapping, nothing prevents a call to
nfs_inode_remove_request() from succeeding until we actually lock the
page group.
The reason is that whoever called nfs_inode_remove_request() doesn't
necessarily have a lock on the page group head.
So in order to avoid races, let's take the page group lock earlier in
nfs_lock_and_join_requests(), and hold it across the removal of the
request in nfs_inode_remove_request().
Reported-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Joe Quanaim <jdq@meta.com>
Tested-by: Andrew Steffen <aksteffen@meta.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Fixes: bd37d6fce1 ("NFSv4: Convert nfs_lock_and_join_requests() to use nfs_page_find_head_request()")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Pull mount fixes from Al Viro:
"Fixes for several recent mount-related regressions"
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
change_mnt_propagation(): calculate propagation source only if we'll need it
use uniform permission checks for all mount propagation changes
propagate_umount(): only surviving overmounts should be reparented
fix the softlockups in attach_recursive_mnt()
Pull overlayfs fixes from Amir Goldstein:
"Fixes for two fallouts from Neil's directory locking changes"
* tag 'ovl-fixes-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs:
ovl: fix possible double unlink
ovl: use I_MUTEX_PARENT when locking parent in ovl_create_temp()
Pull vfs fixes from Christian Brauner:
- Fix two memory leaks in pidfs
- Prevent changing the idmapping of an already idmapped mount without
OPEN_TREE_CLONE through open_tree_attr()
- Don't fail listing extended attributes in kernfs when no extended
attributes are set
- Fix the return value in coredump_parse()
- Fix the error handling for unbuffered writes in netfs
- Fix broken data integrity guarantees for O_SYNC writes via iomap
- Fix UAF in __mark_inode_dirty()
- Keep inode->i_blkbits constant in fuse
- Fix coredump selftests
- Fix get_unused_fd_flags() usage in do_handle_open()
- Rename EXPORT_SYMBOL_GPL_FOR_MODULES to EXPORT_SYMBOL_FOR_MODULES
- Fix use-after-free in bh_read()
- Fix incorrect lflags value in the move_mount() syscall
* tag 'vfs-6.17-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
signal: Fix memory leak for PIDFD_SELF* sentinels
kernfs: don't fail listing extended attributes
coredump: Fix return value in coredump_parse()
fs/buffer: fix use-after-free when call bh_read() helper
pidfs: Fix memory leak in pidfd_info()
netfs: Fix unbuffered write error handling
fhandle: do_handle_open() should get FD with user flags
module: Rename EXPORT_SYMBOL_GPL_FOR_MODULES to EXPORT_SYMBOL_FOR_MODULES
fs: fix incorrect lflags value in the move_mount syscall
selftests/coredump: Remove the read() that fails the test
fuse: keep inode->i_blkbits constant
iomap: Fix broken data integrity guarantees for O_SYNC writes
selftests/mount_setattr: add smoke tests for open_tree_attr(2) bug
open_tree_attr: do not allow id-mapping changes without OPEN_TREE_CLONE
fs: writeback: fix use-after-free in __mark_inode_dirty()
Fix smb3_init_transform_rq() to initialise buffer to NULL before calling
netfs_alloc_folioq_buffer() as netfs assumes it can append to the buffer it
is given. Setting it to NULL means it should start a fresh buffer, but the
value is currently undefined.
Fixes: a2906d3316 ("cifs: Switch crypto buffer to use a folio_queue rather than an xarray")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.org>
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
We only need it when mount in question was sending events downstream (then
recepients need to switch to new master) or the mount is being turned into
slave (then we need a new master for it).
That wouldn't be a big deal, except that it causes quite a bit of work
when umount_tree() is taking a large peer group out. Adding a trivial
"don't bother calling propagation_source() unless we are going to use
its results" logics improves the things quite a bit.
We are still doing unnecessary work on bulk removals from propagation graph,
but the full solution for that will have to wait for the next merge window.
Fixes: 955336e204 "do_make_slave(): choose new master sanely"
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
do_change_type() and do_set_group() are operating on different
aspects of the same thing - propagation graph. The latter
asks for mounts involved to be mounted in namespace(s) the caller
has CAP_SYS_ADMIN for. The former is a mess - originally it
didn't even check that mount *is* mounted. That got fixed,
but the resulting check turns out to be too strict for userland -
in effect, we check that mount is in our namespace, having already
checked that we have CAP_SYS_ADMIN there.
What we really need (in both cases) is
* only touch mounts that are mounted. That's a must-have
constraint - data corruption happens if it get violated.
* don't allow to mess with a namespace unless you already
have enough permissions to do so (i.e. CAP_SYS_ADMIN in its userns).
That's an equivalent of what do_set_group() does; let's extract that
into a helper (may_change_propagation()) and use it in both
do_set_group() and do_change_type().
Fixes: 12f147ddd6 "do_change_type(): refuse to operate on unmounted/not ours mounts"
Acked-by: Andrei Vagin <avagin@gmail.com>
Reviewed-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Tested-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... as the comments in reparent() clearly say. As it is, we reparent
*all* overmounts of the mounts being taken out, including those that
are taken out themselves. It's not only a potentially massive slowdown
(on a pathological setup we might end up with O(N^2) time for N mounts
being kicked out), it can end up with incorrect ->overmount in the
surviving mounts.
Fixes: f0d0ba1998 "Rewrite of propagate_umount()"
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In case when we mounting something on top of a large stack of overmounts,
all of them being peers of each other, we get quadratic time by the
depth of overmount stack. Easily fixed by doing commit_tree() before
reparenting the overmount; simplifies commit_tree() as well - it doesn't
need to skip the already mounted stuff that had been reparented on top
of the new mounts.
Since we are holding mount_lock through both reparenting and call of
commit_tree(), the order does not matter from the mount hash point
of view.
Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Fixes: 663206854f "copy_tree(): don't link the mounts via mnt_list"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
No point in going down into the iomap mapping loop when we know it
will be rejected.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
XFS processes truncating unlinked inodes asynchronously and thus the free
space pool only sees them with a delay. The non-zoned write path thus
calls into inodegc to accelerate this processing before failing an
allocation due the lack of free blocks. Do the same for the zoned space
reservation.
Fixes: 0bb2193056 ("xfs: add support for zoned space reservations")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
This was my first attempt at caching the last used zone. But it turns out
for O_DIRECT or RWF_DONTCACHE that operate concurrently or in very short
sequence, the bmap btree does not record a written extent yet, so it fails.
Because it then still finds the last written zone it can lead to a weird
ping-pong around a few zones with writers seeing different values.
Remove it entirely as the later added xfs_cached_zone actually does a
much better job enforcing the locality as the zone is associated with the
inode in the MRU cache as soon as the zone is selected.
Fixes: 4e4d520755 ("xfs: add the zoned space allocator")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
XFS support for zoned block devices requires the realtime subvolume
support (XFS_RT) to be enabled. Change the default configuration value
of XFS_RT from N to CONFIG_BLK_DEV_ZONED to align with this requirement.
This change still allows the user to disable XFS_RT if this feature is
not desired for the user use case.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Userspace doesn't expect a failure to list extended attributes:
$ ls -lA /sys/
ls: /sys/: No data available
ls: /sys/kernel: No data available
ls: /sys/power: No data available
ls: /sys/class: No data available
ls: /sys/devices: No data available
ls: /sys/dev: No data available
ls: /sys/hypervisor: No data available
ls: /sys/fs: No data available
ls: /sys/bus: No data available
ls: /sys/firmware: No data available
ls: /sys/block: No data available
ls: /sys/module: No data available
total 0
drwxr-xr-x 2 root root 0 Jan 1 1970 block
drwxr-xr-x 52 root root 0 Jan 1 1970 bus
drwxr-xr-x 88 root root 0 Jan 1 1970 class
drwxr-xr-x 4 root root 0 Jan 1 1970 dev
drwxr-xr-x 11 root root 0 Jan 1 1970 devices
drwxr-xr-x 3 root root 0 Jan 1 1970 firmware
drwxr-xr-x 10 root root 0 Jan 1 1970 fs
drwxr-xr-x 2 root root 0 Jul 2 09:43 hypervisor
drwxr-xr-x 14 root root 0 Jan 1 1970 kernel
drwxr-xr-x 251 root root 0 Jan 1 1970 module
drwxr-xr-x 3 root root 0 Jul 2 09:43 power
Fix it by simply reporting success when no extended attributes are
available instead of reporting ENODATA.
Link: https://lore.kernel.org/78b13bcdae82ade95e88f315682966051f461dde.camel@linaro.org
Fixes: d1f4e90260 ("kernfs: remove iattr_mutex") # mainline only
Reported-by: André Draszik <andre.draszik@linaro.org>
Link: https://lore.kernel.org/20250819-ahndung-abgaben-524a535f8101@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
The coredump_parse() function is bool type. It should return true on
success and false on failure. The cn_printf() returns zero on success
or negative error codes. This mismatch means that when "return err;"
here, it is treated as success instead of failure. Change it to return
false instead.
Fixes: a5715af549 ("coredump: make coredump_parse() return bool")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/aKRGu14w5vPSZLgv@stanley.mountain
Signed-off-by: Christian Brauner <brauner@kernel.org>
There's issue as follows:
BUG: KASAN: stack-out-of-bounds in end_buffer_read_sync+0xe3/0x110
Read of size 8 at addr ffffc9000168f7f8 by task swapper/3/0
CPU: 3 UID: 0 PID: 0 Comm: swapper/3 Not tainted 6.16.0-862.14.0.6.x86_64
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
Call Trace:
<IRQ>
dump_stack_lvl+0x55/0x70
print_address_description.constprop.0+0x2c/0x390
print_report+0xb4/0x270
kasan_report+0xb8/0xf0
end_buffer_read_sync+0xe3/0x110
end_bio_bh_io_sync+0x56/0x80
blk_update_request+0x30a/0x720
scsi_end_request+0x51/0x2b0
scsi_io_completion+0xe3/0x480
? scsi_device_unbusy+0x11e/0x160
blk_complete_reqs+0x7b/0x90
handle_softirqs+0xef/0x370
irq_exit_rcu+0xa5/0xd0
sysvec_apic_timer_interrupt+0x6e/0x90
</IRQ>
Above issue happens when do ntfs3 filesystem mount, issue may happens
as follows:
mount IRQ
ntfs_fill_super
read_cache_page
do_read_cache_folio
filemap_read_folio
mpage_read_folio
do_mpage_readpage
ntfs_get_block_vbo
bh_read
submit_bh
wait_on_buffer(bh);
blk_complete_reqs
scsi_io_completion
scsi_end_request
blk_update_request
end_bio_bh_io_sync
end_buffer_read_sync
__end_buffer_read_notouch
unlock_buffer
wait_on_buffer(bh);--> return will return to caller
put_bh
--> trigger stack-out-of-bounds
In the mpage_read_folio() function, the stack variable 'map_bh' is
passed to ntfs_get_block_vbo(). Once unlock_buffer() unlocks and
wait_on_buffer() returns to continue processing, the stack variable
is likely to be reclaimed. Consequently, during the end_buffer_read_sync()
process, calling put_bh() may result in stack overrun.
If the bh is not allocated on the stack, it belongs to a folio. Freeing
a buffer head which belongs to a folio is done by drop_buffers() which
will fail to free buffers which are still locked. So it is safe to call
put_bh() before __end_buffer_read_notouch().
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/20250811141830.343774-1-yebin@huaweicloud.com
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>