Commit Graph

1368284 Commits

Author SHA1 Message Date
Al Viro
e0f9396e24 propagate_one(): separate the "what should be the master for this copy" part
When we create the first copy for a peer group, it becomes a slave of
one of the existing copies; take that logics into a separate helper -
find_master(parent, last_copy, original).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 19:03:30 -04:00
Al Viro
15e710b8bb propagate_one(): separate the "do we need secondary here?" logics
take the checks into separate helper - need_secondary(mount, mountpoint).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 19:03:29 -04:00
Al Viro
2b2a34793d propagate_mnt(): handle all peer groups in the same loop
the only difference is that for the original group we want to skip
the first element; not worth having the logics twice...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 19:03:29 -04:00
Al Viro
25776a09d8 propagate_one(): get rid of dest_master
propagate_mnt() takes the subtree we are about to attach and creates
its copies, setting the propagation between those.  Each copy is cloned
either from the original or from one of the already created copies.
The tricky part is choosing the right copy to serve as a master when we
are starting a new peer group.

The algorithm for doing that selection puts temporary marks on the masters
of mountpoints that already got a copy created for them; since the initial
peer group might have no master at all, we need to special-case that when
looking for the mark.  Currently we do that by memorizing the master of
original peer group.  It works, but we get yet another piece of data to
pass from propagate_mnt() to propagate_one().

Alternative is to mark the master of original peer group if not NULL,
turning the check into "master is NULL or marked".  Less data to pass
around and memory safety is more obvious that way...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 19:03:29 -04:00
Al Viro
406fea7999 mount: separate the flags accessed only under namespace_sem
Several flags are updated and checked only under namespace_sem; we are
already making use of that when we are checking them without mount_lock,
but we have to hold mount_lock for all updates, which makes things
clumsier than they have to be.

Take MNT_SHARED, MNT_UNBINDABLE, MNT_MARKED and MNT_UMOUNT_CANDIDATE
into a separate field (->mnt_t_flags), renaming them to T_SHARED,
etc. to avoid confusion.  All accesses must be under namespace_sem.

That changes locking requirements for mnt_change_propagation() and
set_mnt_shared() - only namespace_sem is needed now.  The same goes
for SET_MNT_MARKED et.al.

There might be more flags moved from ->mnt_flags to that field;
this is just the initial set.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 19:03:29 -04:00
Al Viro
493a4bebf5 don't have mounts pin their parents
Simplify the rules for mount refcounts.  Current rules include:
	* being a namespace root => +1
	* being someone's child => +1
	* being someone's child => +1 to parent's refcount, unless you've
				   already been through umount_tree().

The last part is not needed at all.  It makes for more places where need
to decrement refcounts and it creates an asymmetry between the situations
for something that has never been a part of a namespace and something that
left one, both for no good reason.

If mount's refcount has additions from its children, we know that
	* it's either someone's child itself (and will remain so
until umount_tree(), at which point contributions from children
will disappear), or
	* or is the root of namespace (and will remain such until
it either becomes someone's child in another namespace or goes through
umount_tree()), or
	* it is the root of some tree copy, and is currently pinned
by the caller of copy_tree() (and remains such until it either gets
into namespace, or goes to umount_tree()).
In all cases we already have contribution(s) to refcount that will last
as long as the contribution from children remains.  In other words, the
lifetime is not affected by refcount contributions from children.

It might be useful for "is it busy" checks, but those are actually
no harder to express without it.

NB: propagate_mnt_busy() part is an equivalent transformation, ugly as it
is; the current logics is actually wrong and may give false negatives,
but fixing that is for a separate patch (probably earlier in the queue).

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:42 -04:00
Al Viro
d72c773237 get rid of mountpoint->m_count
struct mountpoint has an odd kinda-sorta refcount in it.  It's always
either equal to or one above the number of mounts attached to that
mountpoint.

"One above" happens when a function takes a temporary reference to
mountpoint.  Things get simpler if we express that as inserting
a local object into ->m_list and removing it to drop the reference.

New calling conventions:

1) lock_mount(), do_lock_mount(), get_mountpoint() and lookup_mountpoint()
take an extra struct pinned_mountpoint * argument and returns 0/-E...
(or true/false in case of lookup_mountpoint()) instead of returning
struct mountpoint pointers.  In case of success, the struct mountpoint *
we used to get can be found as pinned_mountpoint.mp

2) unlock_mount() (always paired with lock_mount()/do_lock_mount()) takes
an address of struct pinned_mountpoint - the same that had been passed to
lock_mount()/do_lock_mount().

3) put_mountpoint() for a temporary reference (paired with get_mountpoint()
or lookup_mountpoint()) is replaced with unpin_mountpoint(), which takes
the address of pinned_mountpoint we passed to matching {get,lookup}_mountpoint().

4) all instances of pinned_mountpoint are local variables; they always live on
stack.  {} is used for initializer, after successful {get,lookup}_mountpoint()
we must make sure to call unpin_mountpoint() before leaving the scope and
after successful {do_,}lock_mount() we must make sure to call unlock_mount()
before leaving the scope.

5) all manipulations of ->m_count are gone, along with ->m_count itself.
struct mountpoint lives while its ->m_list is non-empty.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:42 -04:00
Al Viro
86f6398096 combine __put_mountpoint() with unhash_mnt()
A call of unhash_mnt() is immediately followed by passing its return
value to __put_mountpoint(); the shrink list given to __put_mountpoint()
will be ex_mountpoints when called from umount_mnt() and list when called
from mntput_no_expire().

Replace with __umount_mnt(mount, shrink_list), moving the call of
__put_mountpoint() into it (and returning nothing), adjust the
callers.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:42 -04:00
Al Viro
e30da2a20e pivot_root(): reorder tree surgeries, collapse unhash_mnt() and put_mountpoint()
attach new_mnt *before* detaching root_mnt; that way we don't need to keep hold
on the mountpoint and one more pair of unhash_mnt()/put_mountpoint() gets
folded together into umount_mnt().

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:42 -04:00
Al Viro
ec3265a245 take ->mnt_expire handling under mount_lock [read_seqlock_excl]
Doesn't take much massage, and we no longer need to make sure that
by the time of final mntput() the victim has been removed from the
list.  Makes life safer for ->d_automount() instances...

Rules:
	* all ->mnt_expire accesses are under mount_lock.
	* insertion into the list is done by mnt_set_expiry(), and
caller (->d_automount() instance) must hold a reference to mount
in question.  It shouldn't be done more than once for a mount.
	* if a mount on an expiry list is not yet mounted, it will
be ignored by anything that walks that list.
	* if the final mntput() finds its victim still on an expiry
list (in which case it must've never been mounted - umount_tree()
would've taken it out), it will remove the victim from the list.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:42 -04:00
Al Viro
a8c764e1a5 attach_recursive_mnt(): remove from expiry list on move
... rather than doing that in do_move_mount().  That's the main
obstacle to moving the protection of ->mnt_expire from namespace_sem
to mount_lock (spinlock-only), which would simplify several failure
exits.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:42 -04:00
Al Viro
ee1ee33ccc do_move_mount(): get rid of 'attached' flag
'attached' serves as a proxy for "source is a subtree of our namespace
and not the entirety of anon namespace"; finish massaging it away.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:42 -04:00
Al Viro
761de25854 do_move_mount(): take dropping the old mountpoint into attach_recursive_mnt()
... and fold it with unhash_mnt() there - there's no need to retain a reference
to old_mp beyond that point, since by then all mountpoints we were going to add
are either explicitly pinned by get_mountpoint() or have stuff already added
to them.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:42 -04:00
Al Viro
86b1da96c5 attach_recursive_mnt(): get rid of flags entirely
move vs. attach is trivially detected as mnt_has_parent(source_mnt)...

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:42 -04:00
Al Viro
18959bf585 attach_recursive_mnt(): pass destination mount in all cases
... and 'beneath' is no longer used there

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:42 -04:00
Al Viro
96f5d2e051 attach_recursive_mnt(): unify the mnt_change_mountpoint() logics
The logics used for tucking under existing mount differs for original
and copies; copies do a mount hash lookup to see if mountpoint to be is
already overmounted, while the original is told explicitly.

But the same logics that is used for copies works for the original,
at which point the only place where we get very close to eliminating
the need of passing 'beneath' flag to attach_recursive_mnt().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
7c6fb47b2b make commit_tree() usable in same-namespace move case
Once attach_recursive_mnt() has created all copies of original subtree,
it needs to put them in place(s).

Steps needed for those are slightly different:
	1) in 'move' case, original copy doesn't need any rbtree
manipulations (everything's already in the same namespace where it will
be), but it needs to be detached from the current location
	2) in 'attach' case, original may be in anon namespace; if it is,
all those mounts need to removed from their current namespace before
insertion into the target one
	3) additional copies have a couple of extra twists - in case
of cross-userns propagation we need to lock everything other the root of
subtree and in case when we end up inserting under an existing mount,
that mount needs to be found (for original copy we have it explicitly
passed by the caller).

Quite a bit of that can be unified; as the first step, make commit_tree()
helper (inserting mounts into namespace, hashing the root of subtree
and marking the namespace as updated) usable in all cases; (2) and (3)
are already using it and for (1) we only need to make the insertion of
mounts into namespace conditional.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
f0d0ba1998 Rewrite of propagate_umount()
The variant currently in the tree has problems; trying to prove
correctness has caught at least one class of bugs (reparenting
that ends up moving the visible location of reparented mount, due
to not excluding some of the counterparts on propagation that
should've been included).

I tried to prove that it's the only bug there; I'm still not sure
whether it is.  If anyone can reconstruct and write down an analysis
of the mainline implementation, I'll gladly review it; as it is,
I ended up doing a different implementation.  Candidate collection
phase is similar, but trimming the set down until it satisfies the
constraints turned out pretty different.

I hoped to do transformation as a massage series, but that turns out
to be too convoluted.  So it's a single patch replacing propagate_umount()
and friends in one go, with notes and analysis in D/f/propagate_umount.txt
(in addition to inline comments).

As far I can tell, it is provably correct and provably linear by the number
of mounts we need to look at in order to decide what should be unmounted.
It even builds and seems to survive testing...

Another nice thing that fell out of that is that ->mnt_umounting is no longer
needed.

Compared to the first version:
	* explicit MNT_UMOUNT_CANDIDATE flag for is_candidate()
	* trim_ancestors() only clears that flag, leaving the suckers on list
	* trim_one() and handle_locked() take the stuff with flag cleared off
the list.  That allows to iterate with list_for_each_entry_safe() when calling
trim_one() - it removes at most one element from the list now.
	* no globals - I didn't bother with any kind of context, not worth it.

	* Notes updated accordingly; I have not touch the terms yet.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
24368a744b sanitize handling of long-term internal mounts
Original rationale for those had been the reduced cost of mntput()
for the stuff that is mounted somewhere.  Mount refcount increments and
decrements are frequent; what's worse, they tend to concentrate on the
same instances and cacheline pingpong is quite noticable.

As the result, mount refcounts are per-cpu; that allows a very cheap
increment.  Plain decrement would be just as easy, but decrement-and-test
is anything but (we need to add the components up, with exclusion against
possible increment-from-zero, etc.).

Fortunately, there is a very common case where we can tell that decrement
won't be the final one - if the thing we are dropping is currently
mounted somewhere.  We have an RCU delay between the removal from mount
tree and dropping the reference that used to pin it there, so we can
just take rcu_read_lock() and check if the victim is mounted somewhere.
If it is, we can go ahead and decrement without and further checks -
the reference we are dropping is not the last one.  If it isn't, we
get all the fun with locking, carefully adding up components, etc.,
but the majority of refcount decrements end up taking the fast path.

There is a major exception, though - pipes and sockets.  Those live
on the internal filesystems that are not going to be mounted anywhere.
They are not going to be _un_mounted, of course, so having to take the
slow path every time a pipe or socket gets closed is really obnoxious.
Solution had been to mark them as long-lived ones - essentially faking
"they are mounted somewhere" indicator.

With minor modification that works even for ones that do eventually get
dropped - all it takes is making sure we have an RCU delay between
clearing the "mounted somewhere" indicator and dropping the reference.

There are some additional twists (if you want to drop a dozen of such
internal mounts, you'd be better off with clearing the indicator on
all of them, doing an RCU delay once, then dropping the references),
but in the basic form it had been
	* use kern_mount() if you want your internal mount to be
a long-term one.
	* use kern_unmount() to undo that.

Unfortunately, the things did rot a bit during the mount API reshuffling.
In several cases we have lost the "fake the indicator" part; kern_unmount()
on the unmount side remained (it doesn't warn if you use it on a mount
without the indicator), but all benefits regaring mntput() cost had been
lost.

To get rid of that bitrot, let's add a new helper that would work
with fs_context-based API: fc_mount_longterm().  It's a counterpart
of fc_mount() that does, on success, mark its result as long-term.
It must be paired with kern_unmount() or equivalents.

Converted:
	1) mqueue (it used to use kern_mount_data() and the umount side
is still as it used to be)
	2) hugetlbfs (used to use kern_mount_data(), internal mount is
never unmounted in this one)
	3) i915 gemfs (used to be kern_mount() + manual remount to set
options, still uses kern_unmount() on umount side)
	4) v3d gemfs (copied from i915)

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
c93ff74ff1 do_umount(): simplify the "is it still mounted" checks
Calls of do_umount() are always preceded by can_umount(), where we'd
done a racy check for mount belonging to our namespace; if it wasn't,
can_unmount() would've failed with -EINVAL and we wouldn't have
reached do_umount() at all.

That check needs to be redone once we have acquired namespace_sem
and in do_umount() we do that.  However, that's done in a very odd
way; we check that mount is still in rbtree of _some_ namespace or
its mnt_list is not empty.  It is equivalent to check_mnt(mnt) -
we know that earlier mnt was mounted in our namespace; if it has
stayed there, it's going to remain in rbtree of our namespace.
OTOH, if it ever had been removed from out namespace, it would be
removed from rbtree and it never would've re-added to a namespace
afterwards.  As for ->mnt_list, for something that had been mounted
in a namespace we'll never observe non-empty ->mnt_list while holding
namespace_sem - it does temporarily become non-empty during
umount_tree(), but that doesn't outlast the call of umount_tree(),
let alone dropping namespace_sem.

Things get much easier to follow if we replace that with (equivalent)
check_mnt(mnt) there.  What's more, currently we treat a failure of
that test as "quietly do nothing"; we might as well pretend that we'd
lost the race and fail on that the same way can_umount() would have.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
49acacdc7c clone_mnt(): simplify the propagation-related logics
The underlying rules are simple:
	* MNT_SHARED should be set iff ->mnt_group_id of new mount ends up
non-zero.
	* mounts should be on the same ->mnt_share cyclic list iff they have
the same non-zero ->mnt_group_id value.
	* CL_PRIVATE is mutually exclusive with MNT_SHARED, MNT_SLAVE,
MNT_SHARED_TO_SLAVE and MNT_EXPIRE; the whole point of that thing is to
get a clone of old mount that would *not* be on any namespace-related
lists.

The above allows to make the logics more straightforward; what's more,
it makes the proof that invariants are maintained much simpler.
The variant in mainline is safe (aside of a very narrow race with
unsafe modification of mnt_flags right after we had the mount exposed
in superblock's ->s_mounts; theoretically it can race with ro remount
of the original, but it's not easy to hit), but proof of its correctness
is really unpleasant.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
d08fa7f44a don't set MNT_LOCKED on parentless mounts
Originally MNT_LOCKED meant only one thing - "don't let this mount to
be peeled off its parent, we don't want to have its mountpoint exposed".
Accordingly, it had only been set on mounts that *do* have a parent.
Later it got overloaded with another use - setting it on the absolute
root had given free protection against umount(2) of absolute root
(was possible to trigger, oopsed).  Not a bad trick, but it ended
up costing more than it bought us.  Unfortunately, the cost included
both hard-to-reason-about logics and a subtle race between
mount -o remount,ro and mount --[r]bind - lockless &= ~MNT_LOCKED in
the end of __do_loopback() could race with sb_prepare_remount_readonly()
setting and clearing MNT_HOLD_WRITE (under mount_lock, as it should
be).  The race wouldn't be much of a problem (there are other ways to
deal with it), but the subtlety is.

Turns out that nobody except umount(2) had ever made use of having
MNT_LOCKED set on absolute root.  So let's give up on that trick,
clever as it had been, add an explicit check in do_umount() and
return to using MNT_LOCKED only for mounts that have a parent.

It means that
	* clone_mnt() no longer copies MNT_LOCKED
	* copy_tree() sets it on submounts if their counterparts had
been marked such, and does that right next to attach_mnt() in there,
in the same mount_lock scope.
	* __do_loopback() no longer needs to strip MNT_LOCKED off the
root of subtree it's about to return; no store, no race.
	* init_mount_tree() doesn't bother setting MNT_LOCKED on absolute
root.
	* lock_mnt_tree() does not set MNT_LOCKED on the subtree's root;
accordingly, its caller (loop in attach_recursive_mnt()) does not need to
bother stripping that MNT_LOCKED on root.  Note that lock_mnt_tree() setting
MNT_LOCKED on submounts happens in the same mount_lock scope as __attach_mnt()
(from commit_tree()) that makes them reachable.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
1a867d729f __attach_mnt(): lose the second argument
It's always ->mnt_parent of the first one.  What the function does is
making a mount (with already set parent and mountpoint) visible - in
mount hash and in the parent's list of children.

IOW, it takes the existing rootwards linkage and sets the matching
crownwards linkage.

Renamed to make_visible(), while we are at it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
9ed4b9eaea dissolve_on_fput(): use anon_ns_root()
that's the condition we are actually trying to check there...

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
05da054d43 new predicate: anon_ns_root(mount)
checks if mount is the root of an anonymouns namespace.
Switch open-coded equivalents to using it.

For mounts that belong to anon namespace !mnt_has_parent(mount)
is the same as mount == ns->root, and intent is more obvious in
the latter form.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
e031251cb2 constify is_local_mountpoint()
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
9cb79ed60e new predicate: mount_is_ancestor()
mount_is_ancestor(p1, p2) returns true iff there is a possibly
empty ancestry chain from p1 to p2.

Convert the open-coded checks.  Unlike those open-coded variants
it does not depend upon p1 not being root...

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
592238c03e pnode: lift peers() into pnode.h
it's going to be useful both in pnode.c and namespace.c

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
0e84653ea5 constify mnt_has_parent()
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
cf53a2d423 copy_tree(): don't set ->mnt_mountpoint on the root of copy
It never made any sense - neither when copy_tree() had been introduced
(2.4.11-pre5), nor at any point afterwards.  Mountpoint is meaningless
without parent mount and the root of copied tree has no parent until we get
around to attaching it somewhere.  At that time we'll have mountpoint set;
before that we have no idea which dentry will be used as mountpoint.
IOW, copy_tree() should just leave the default value.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
ffdc52fbbd prevent mount hash conflicts
Currently it's still possible to run into a pathological situation when
two hashed mounts share both parent and mountpoint.  That does not work
well, for obvious reasons.

We are not far from getting rid of that; the only remaining gap is
attach_recursive_mnt() not being careful enough when sliding a tree
under existing mount (for propagated copies or in 'beneath' case for
the original one).

To deal with that cleanly we need to be able to find overmounts
(i.e. mounts on top of parent's root); we could do hash lookups or scan
the list of children but either would be costly.  Since one of the results
we get from that will be prevention of multiple parallel overmounts, let's
just bite the bullet and store a (non-counting) reference to overmount
in struct mount.

With that done, closing the hole in attach_recursive_mnt() becomes easy
- we just need to follow the chain of overmounts before we change the
mountpoint of the mount we are sliding things under.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
431cc1d8e2 get rid of mnt_set_mountpoint_beneath()
mnt_set_mountpoint_beneath() consists of attaching new mount side-by-side
with the one we want to mount beneath (by mnt_set_mountpoint()), followed
by mnt_change_mountpoint() shifting the the top mount onto the new one
(by mnt_change_mountpoint()).

Both callers of mnt_set_mountpoint_beneath (both in attach_recursive_mnt())
have the same form - in 'beneath' case we call mnt_set_mountpoint_beneath(),
otherwise - mnt_set_mountpoint().

The thing is, expressing that as unconditional mnt_set_mountpoint(),
followed, in 'beneath' case, by mnt_change_mountpoint() is just as easy.
And these mnt_change_mountpoint() callers are similar to the ones we
do when it comes to attaching propagated copies, which will allow more
cleanups in the next commits.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro
8c6ce8e86d attach_mnt(): expand in attach_recursive_mnt(), then lose the flag argument
simpler that way - all but one caller pass false as 'beneath' argument,
and that one caller is actually happier with the call expanded - the
logics with choice of mountpoint is identical for 'moving' and 'attaching'
cases, and now that is no longer hidden.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Linus Torvalds
d0b3b7b22d Linux 6.16-rc4 v6.16-rc4 2025-06-29 13:09:04 -07:00
Linus Torvalds
afa9a6f4f5 Merge tag 'staging-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging driver fix from Greg KH:
 "Here is a single staging driver fix for 6.16-rc4. It resolves a build
  error in the rtl8723bs driver for some versions of clang on arm64 when
  checking the frame size with -Wframe-larger-than.

  It has been in linux-next for a while now with no reported issues"

* tag 'staging-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  staging: rtl8723bs: Avoid memset() in aes_cipher() and aes_decipher()
2025-06-29 09:25:55 -07:00
Linus Torvalds
798804b69f Merge tag 'tty-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial driver fixes from Greg KH:
 "Here are five small serial and tty and vt fixes for 6.16-rc4. Included
  in here are:

   - kerneldoc fixes for recent vt changes

   - imx serial driver fix

   - of_node sysfs fix for a regression

   - vt missing notification fix

   - 8250 dt bindings fix

  All of these have been in linux-next for a while with no reported issues"

* tag 'tty-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  dt-bindings: serial: 8250: Make clocks and clock-frequency exclusive
  serial: imx: Restore original RXTL for console to fix data loss
  serial: core: restore of_node information in sysfs
  vt: fix kernel-doc warnings in ucs_get_fallback()
  vt: add missing notification when switching back to text mode
2025-06-29 09:21:27 -07:00
Linus Torvalds
3b1890e4b2 Merge tag 'edac_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC fix from Borislav Petkov:

 - Consider secondary address mask registers in amd64_edac in order to
   get the correct total memory size of the system

* tag 'edac_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
  EDAC/amd64: Fix size calculation for Non-Power-of-Two DIMMs
2025-06-29 08:43:54 -07:00
Linus Torvalds
cc69ac7a65 Merge tag 'x86_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:

 - Make sure DR6 and DR7 are initialized to their architectural values
   and not accidentally cleared, leading to misconfigurations

* tag 'x86_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/traps: Initialize DR7 by writing its architectural reset value
  x86/traps: Initialize DR6 by writing its architectural reset value
2025-06-29 08:28:24 -07:00
Linus Torvalds
2fc18d0b89 Merge tag 'perf_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Borislav Petkov:

 - Make sure an AUX perf event is really disabled when it overruns

* tag 'perf_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/aux: Fix pending disable flow when the AUX ring buffer overruns
2025-06-29 08:16:02 -07:00
Linus Torvalds
753a0f61b9 Merge tag 'locking_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Borislav Petkov:

 - Make sure the new futex phash is not copied during fork in order to
   avoid a double-free

* tag 'locking_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Initialize futex_phash_new during fork().
2025-06-29 08:09:13 -07:00
Linus Torvalds
dfba48a70c Merge tag 'i2c-for-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:

 - imx: fix SMBus protocol compliance during block read

 - omap: fix error handling path in probe

 - robotfuzz, tiny-usb: prevent zero-length reads

 - x86, designware, amdisp: fix build error when modules are disabled
   (agreed to go in via i2c)

 - scx200_acb: fix build error because of missing HAS_IOPORT

* tag 'i2c-for-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: scx200_acb: depends on HAS_IOPORT
  i2c: omap: Fix an error handling path in omap_i2c_probe()
  platform/x86: Use i2c adapter name to fix build errors
  i2c: amd-isp: Initialize unique adapter name
  i2c: designware: Initialize adapter name only when not set
  i2c: tiny-usb: disable zero-length read messages
  i2c: robotfuzz-osif: disable zero-length read messages
  i2c: imx: fix emulated smbus block read
2025-06-28 15:23:17 -07:00
Linus Torvalds
ded779017a Merge tag 'trace-v6.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fix from Steven Rostedt:

 - Fix possible UAF on error path in filter_free_subsystem_filters()

   When freeing a subsystem filter, the filter for the subsystem is
   passed in to be freed and all the events within the subsystem will
   have their filter freed too. In order to free without waiting for RCU
   synchronization, list items are allocated to hold what is going to be
   freed to free it via a call_rcu(). If the allocation of these items
   fails, it will call the synchronization directly and free after that
   (causing a bit of delay for the user).

   The subsystem filter is first added to this list and then the filters
   for all the events under the subsystem. The bug is if one of the
   allocations of the list items for the event filters fail to allocate,
   it jumps to the "free_now" label which will free the subsystem
   filter, then all the items on the allocated list, and then the event
   filters that were not added to the list yet. But because the
   subsystem filter was added first, it gets freed twice.

   The solution is to add the subsystem filter after the events, and
   then if any of the allocations fail it will not try to free any of
   them twice

* tag 'trace-v6.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Fix filter logic error
2025-06-28 11:39:24 -07:00
Linus Torvalds
3a3de75a68 Merge tag 'loongarch-fixes-6.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
Pull LoongArch fixes from Huacai Chen:

 - replace __ASSEMBLY__ with __ASSEMBLER__ in headers like others

 - fix build warnings about export.h

 - reserve the EFI memory map region for kdump

 - handle __init vs inline mismatches

 - fix some KVM bugs

* tag 'loongarch-fixes-6.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
  LoongArch: KVM: Disable updating of "num_cpu" and "feature"
  LoongArch: KVM: Check validity of "num_cpu" from user space
  LoongArch: KVM: Check interrupt route from physical CPU
  LoongArch: KVM: Fix interrupt route update with EIOINTC
  LoongArch: KVM: Add address alignment check for IOCSR emulation
  LoongArch: KVM: Avoid overflow with array index
  LoongArch: Handle KCOV __init vs inline mismatches
  LoongArch: Reserve the EFI memory map region
  LoongArch: Fix build warnings about export.h
  LoongArch: Replace __ASSEMBLY__ with __ASSEMBLER__ in headers
2025-06-28 11:35:11 -07:00
Linus Torvalds
aaf724ed69 Merge tag 'v6.16-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:

 - Multichannel reconnect lock ordering deadlock fix

 - Fix for regression in handling native Windows symlinks

 - Three smbdirect fixes:
     - oops in RDMA response processing
     - smbdirect memcpy issue
     - fix smbdirect regression with large writes (smbdirect test cases
       now all passing)

 - Fix for "FAILED_TO_PARSE" warning in trace-cmd report output

* tag 'v6.16-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Fix reading into an ITER_FOLIOQ from the smbdirect code
  cifs: Fix the smbd_response slab to allow usercopy
  smb: client: fix potential deadlock when reconnecting channels
  smb: client: remove \t from TP_printk statements
  smb: client: let smbd_post_send_iter() respect the peers max_send_size and transmit all data
  smb: client: fix regression with native SMB symlinks
2025-06-27 20:38:05 -07:00
Linus Torvalds
0fd39af24e Merge tag 'mm-hotfixes-stable-2025-06-27-16-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
 "16 hotfixes.

  6 are cc:stable and the remainder address post-6.15 issues or aren't
  considered necessary for -stable kernels. 5 are for MM"

* tag 'mm-hotfixes-stable-2025-06-27-16-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  MAINTAINERS: add Lorenzo as THP co-maintainer
  mailmap: update Duje Mihanović's email address
  selftests/mm: fix validate_addr() helper
  crashdump: add CONFIG_KEYS dependency
  mailmap: correct name for a historical account of Zijun Hu
  mailmap: add entries for Zijun Hu
  fuse: fix runtime warning on truncate_folio_batch_exceptionals()
  scripts/gdb: fix dentry_name() lookup
  mm/damon/sysfs-schemes: free old damon_sysfs_scheme_filter->memcg_path on write
  mm/alloc_tag: fix the kmemleak false positive issue in the allocation of the percpu variable tag->counters
  lib/group_cpus: fix NULL pointer dereference from group_cpus_evenly()
  mm/hugetlb: remove unnecessary holding of hugetlb_lock
  MAINTAINERS: add missing files to mm page alloc section
  MAINTAINERS: add tree entry to mm init block
  mm: add OOM killer maintainer structure
  fs/proc/task_mmu: fix PAGE_IS_PFNZERO detection for the huge zero folio
2025-06-27 20:34:10 -07:00
Linus Torvalds
867b9987a3 Merge tag 'riscv-for-linus-5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V Fixes for 5.16-rc4

 - .rodata is no longer linkd into PT_DYNAMIC.

   It was not supposed to be there in the first place and resulted in
   invalid (but unused) entries. This manifests as at least warnings in
   llvm-readelf

 - A fix for runtime constants with all-0 upper 32-bits. This should
   only manifest on MMU=n kernels

 - A fix for context save/restore on systems using the T-Head vector
   extensions

 - A fix for a conflicting "+r"/"r" register constraint in the VDSO
   getrandom syscall wrapper, which is undefined behavior in clang

 - A fix for a missing register clobber in the RVV raid6 implementation.

   This manifests as a NULL pointer reference on some compilers, but
   could trigger in other ways

 - Misaligned accesses from userspace at faulting addresses are now
   handled correctly

 - A fix for an incorrect optimization that allowed access_ok() to mark
   invalid addresses as accessible, which can result in userspace
   triggering BUG()s

 - A few fixes for build warnings, and an update to Drew's email address

* tag 'riscv-for-linus-5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: export boot_cpu_hartid
  Revert "riscv: Define TASK_SIZE_MAX for __access_ok()"
  riscv: Fix sparse warning in vendor_extensions/sifive.c
  Revert "riscv: misaligned: fix sleeping function called during misaligned access handling"
  MAINTAINERS: Update Drew Fustini's email address
  RISC-V: uaccess: Wrap the get_user_8 uaccess macro
  raid6: riscv: Fix NULL pointer dereference caused by a missing clobber
  RISC-V: vDSO: Correct inline assembly constraints in the getrandom syscall wrapper
  riscv: vector: Fix context save/restore with xtheadvector
  riscv: fix runtime constant support for nommu kernels
  riscv: vdso: Exclude .rodata from the PT_DYNAMIC segment
2025-06-27 20:22:18 -07:00
Linus Torvalds
fa33adcaf8 Merge tag 'pci-v6.16-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull PCI fix from Bjorn Helgaas:

 - Fix a PTM debugfs build error with CONFIG_DEBUG_FS=n &&
   CONFIG_PCIE_PTM=y (Manivannan Sadhasivam)

* tag 'pci-v6.16-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
  PCI/PTM: Build debugfs code only if CONFIG_DEBUG_FS is enabled
2025-06-27 20:17:48 -07:00
Linus Torvalds
7abdafd234 Merge tag 'drm-fixes-2025-06-28' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
 "Regular weekly drm updates, nothing out of the ordinary, amdgpu, xe,
  i915 and a few misc bits. Seems about right for this time in the
  release cycle.

  core:
   - fix drm_writeback_connector_cleanup function signature
   - use correct HDMI audio bridge in drm_connector_hdmi_audio_init

  bridge:
   - SN65DSI86: fix HPD

  amdgpu:
   - Cleaner shader support for additional GFX9 GPUs
   - MES firmware compatibility fixes
   - Discovery error reporting fixes
   - SDMA6/7 userq fixes
   - Backlight fix
   - EDID sanity check

  i915:
   - Fix for SNPS PHY HDMI for 1080p@120Hz
   - Correct DP AUX DPCD probe address
   - Followup build fix for GCOV and AutoFDO enabled config

  xe:
   - Missing error check
   - Fix xe_hwmon_power_max_write
   - Move flushes
   - Explicitly exit CT safe mode on unwind
   - Process deferred GGTT node removals on device unwind"

* tag 'drm-fixes-2025-06-28' of https://gitlab.freedesktop.org/drm/kernel:
  drm/xe: Process deferred GGTT node removals on device unwind
  drm/xe/guc: Explicitly exit CT safe mode on unwind
  drm/xe: move DPT l2 flush to a more sensible place
  drm/xe: Move DSB l2 flush to a more sensible place
  drm/bridge: ti-sn65dsi86: Add HPD for DisplayPort connector type
  drm/i915: fix build error some more
  drm/xe/hwmon: Fix xe_hwmon_power_max_write
  drm/xe/display: Add check for alloc_ordered_workqueue()
  drm/amd/display: Add sanity checks for drm_edid_raw()
  drm/amd/display: Fix AMDGPU_MAX_BL_LEVEL value
  drm/amdgpu/sdma7: add ucode version checks for userq support
  drm/amdgpu/sdma6: add ucode version checks for userq support
  drm/amd: Adjust output for discovery error handling
  drm/amdgpu/mes: add compatibility checks for set_hw_resource_1
  drm/amdgpu/gfx9: Add Cleaner Shader Support for GFX9.x GPUs
  drm/bridge-connector: Fix bridge in drm_connector_hdmi_audio_init()
  drm/dp: Change AUX DPCD probe address from DPCD_REV to LANE0_1_STATUS
  drm/i915/snps_hdmi_pll: Fix 64-bit divisor truncation by using div64_u64
  drm: writeback: Fix drm_writeback_connector_cleanup signature
2025-06-27 19:38:36 -07:00
Linus Torvalds
26fd9f7b7f Merge tag 'cxl-fixes-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull Compute Express Link (CXL) fixes from Dave Jiang:
 "These fixes address a few issues in the CXL subsystem, including
  dealing with some bugs in the CXL EDAC and RAS drivers:

   - Fix return value of cxlctl_validate_set_features()

   - Fix min_scrub_cycle of a region miscaculation and add additional
     documentation

   - Fix potential memory leak issues for CXL EDAC

   - Fix CPER handler device confusion for CXL RAS

   - Fix using wrong repair type to check DRAM event record"

* tag 'cxl-fixes-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
  cxl/edac: Fix using wrong repair type to check dram event record
  cxl/ras: Fix CPER handler device confusion
  cxl/edac: Fix potential memory leak issues
  cxl/Documentation: Add more description about min/max scrub cycle
  cxl/edac: Fix the min_scrub_cycle of a region miscalculation
  cxl: fix return value in cxlctl_validate_set_features()
2025-06-27 17:58:32 -07:00
Linus Torvalds
5683cd63a3 Merge tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux
Pull crypto library fix from Eric Biggers:
 "Fix a regression where the purgatory code sometimes fails to build"

* tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
  lib/crypto: sha256: Mark sha256_choose_blocks as __always_inline
2025-06-27 17:32:30 -07:00