Repeating automatically selected options in Kconfig files is redundant, so
let's delete repeated "select NETFS_SUPPORT" that was added accidentally.
Fixes: 69c3c023af ("cifs: Implement netfslib hooks")
Signed-off-by: Dragan Simic <dsimic@manjaro.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Replace default hardcoded value for cifsAttrs with ATTR_ARCHIVE macro
Use SMB2_LEASE_KEY_SIZE macro for leasekey size in smb2_lease_break
Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Encoding file handles is usually performed by a filesystem >encode_fh()
method that may fail for various reasons.
The legacy users of exportfs_encode_fh(), namely, nfsd and
name_to_handle_at(2) syscall are ready to cope with the possibility
of failure to encode a file handle.
There are a few other users of exportfs_encode_{fh,fid}() that
currently have a WARN_ON() assertion when ->encode_fh() fails.
Relax those assertions because they are wrong.
The second linked bug report states commit 16aac5ad1f ("ovl: support
encoding non-decodable file handles") in v6.6 as the regressing commit,
but this is not accurate.
The aforementioned commit only increases the chances of the assertion
and allows triggering the assertion with the reproducer using overlayfs,
inotify and drop_caches.
Triggering this assertion was always possible with other filesystems and
other reasons of ->encode_fh() failures and more particularly, it was
also possible with the exact same reproducer using overlayfs that is
mounted with options index=on,nfs_export=on also on kernels < v6.6.
Therefore, I am not listing the aforementioned commit as a Fixes commit.
Backport hint: this patch will have a trivial conflict applying to
v6.6.y, and other trivial conflicts applying to stable kernels < v6.6.
Reported-by: syzbot+ec07f6f5ce62b858579f@syzkaller.appspotmail.com
Tested-by: syzbot+ec07f6f5ce62b858579f@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-unionfs/671fd40c.050a0220.4735a.024f.GAE@google.com/
Reported-by: Dmitry Safonov <dima@arista.com>
Closes: https://lore.kernel.org/linux-fsdevel/CAGrbwDTLt6drB9eaUagnQVgdPBmhLfqqxAf3F+Juqy_o6oP8uw@mail.gmail.com/
Cc: stable@vger.kernel.org
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20241219115301.465396-1-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
When block_invalidatepage was converted to block_invalidate_folio, the
fallback to block_invalidatepage in folio_invalidate() if the
address_space_operations method invalidatepage (currently
invalidate_folio) was not set, was removed.
Unfortunately, some pseudo-inodes in nilfs2 use empty_aops set by
inode_init_always_gfp() as is, or explicitly set it to
address_space_operations. Therefore, with this change,
block_invalidatepage() is no longer called from folio_invalidate(), and as
a result, the buffer_head structures attached to these pages/folios are no
longer freed via try_to_free_buffers().
Thus, these buffer heads are now leaked by truncate_inode_pages(), which
cleans up the page cache from inode evict(), etc.
Three types of caches use empty_aops: gc inode caches and the DAT shadow
inode used by GC, and b-tree node caches. Of these, b-tree node caches
explicitly call invalidate_mapping_pages() during cleanup, which involves
calling try_to_free_buffers(), so the leak was not visible during normal
operation but worsened when GC was performed.
Fix this issue by using address_space_operations with invalidate_folio set
to block_invalidate_folio instead of empty_aops, which will ensure the
same behavior as before.
Link: https://lkml.kernel.org/r/20241212164556.21338-1-konishi.ryusuke@gmail.com
Fixes: 7ba13abbd3 ("fs: Turn block_invalidatepage into block_invalidate_folio")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org> [5.18+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
syzbot reported a WARNING in nilfs_rmdir. [1]
Because the inode bitmap is corrupted, an inode with an inode number that
should exist as a ".nilfs" file was reassigned by nilfs_mkdir for "file0",
causing an inode duplication during execution. And this causes an
underflow of i_nlink in rmdir operations.
The inode is used twice by the same task to unmount and remove directories
".nilfs" and "file0", it trigger warning in nilfs_rmdir.
Avoid to this issue, check i_nlink in nilfs_iget(), if it is 0, it means
that this inode has been deleted, and iput is executed to reclaim it.
[1]
WARNING: CPU: 1 PID: 5824 at fs/inode.c:407 drop_nlink+0xc4/0x110 fs/inode.c:407
...
Call Trace:
<TASK>
nilfs_rmdir+0x1b0/0x250 fs/nilfs2/namei.c:342
vfs_rmdir+0x3a3/0x510 fs/namei.c:4394
do_rmdir+0x3b5/0x580 fs/namei.c:4453
__do_sys_rmdir fs/namei.c:4472 [inline]
__se_sys_rmdir fs/namei.c:4470 [inline]
__x64_sys_rmdir+0x47/0x50 fs/namei.c:4470
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Link: https://lkml.kernel.org/r/20241209065759.6781-1-konishi.ryusuke@gmail.com
Fixes: d25006523d ("nilfs2: pathname operations")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+9260555647a5132edd48@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=9260555647a5132edd48
Tested-by: syzbot+9260555647a5132edd48@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In current kernel, hugetlb_no_page() calls folio_zero_user() with the
fault address. Where the fault address may be not aligned with the huge
page size. Then, folio_zero_user() may call clear_gigantic_page() with
the address, while clear_gigantic_page() requires the address to be huge
page size aligned. So, this may cause memory corruption or information
leak, addtional, use more obvious naming 'addr_hint' instead of 'addr' for
clear_gigantic_page().
Link: https://lkml.kernel.org/r/20241028145656.932941-1-wangkefeng.wang@huawei.com
Fixes: 78fefd04c1 ("mm: memory: convert clear_huge_page() to folio_zero_user()")
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Revert ocfs2 commit dfe6c5692f and provide a new fix".
SUSE QA team detected a mistake in my commit dfe6c5692f ("ocfs2: fix the
la space leak when unmounting an ocfs2 volume"). I am very sorry for my
error. (If my eyes are correct) From the mailling list mails, this patch
shouldn't be applied to 4.19 5.4 5.10 5.15 6.1 6.6, and these branches
should perform a revert operation.
Reason for revert:
In commit dfe6c5692f, I mistakenly wrote: "This bug has existed since
the initial OCFS2 code.". The statement is wrong. The correct
introduction commit is 30dd3478c3. IOW, if the branch doesn't include
30dd3478c3, dfe6c5692f should also not be included.
This reverts commit dfe6c5692f ("ocfs2: fix the la space leak when
unmounting an ocfs2 volume").
In commit dfe6c5692f, the commit log "This bug has existed since the
initial OCFS2 code." is wrong. The correct introduction commit is
30dd3478c3 ("ocfs2: correctly use ocfs2_find_next_zero_bit()").
The influence of commit dfe6c5692f is that it provides a correct fix for
the latest kernel. however, it shouldn't be pushed to stable branches.
Let's use this commit to revert all branches that include dfe6c5692f and
use a new fix method to fix commit 30dd3478c3.
Link: https://lkml.kernel.org/r/20241205104835.18223-1-heming.zhao@suse.com
Link: https://lkml.kernel.org/r/20241205104835.18223-2-heming.zhao@suse.com
Fixes: dfe6c5692f ("ocfs2: fix the la space leak when unmounting an ocfs2 volume")
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add a new AT_EXECVE_CHECK flag to execveat(2) to check if a file would
be allowed for execution. The main use case is for script interpreters
and dynamic linkers to check execution permission according to the
kernel's security policy. Another use case is to add context to access
logs e.g., which script (instead of interpreter) accessed a file. As
any executable code, scripts could also use this check [1].
This is different from faccessat(2) + X_OK which only checks a subset of
access rights (i.e. inode permission and mount options for regular
files), but not the full context (e.g. all LSM access checks). The main
use case for access(2) is for SUID processes to (partially) check access
on behalf of their caller. The main use case for execveat(2) +
AT_EXECVE_CHECK is to check if a script execution would be allowed,
according to all the different restrictions in place. Because the use
of AT_EXECVE_CHECK follows the exact kernel semantic as for a real
execution, user space gets the same error codes.
An interesting point of using execveat(2) instead of openat2(2) is that
it decouples the check from the enforcement. Indeed, the security check
can be logged (e.g. with audit) without blocking an execution
environment not yet ready to enforce a strict security policy.
LSMs can control or log execution requests with
security_bprm_creds_for_exec(). However, to enforce a consistent and
complete access control (e.g. on binary's dependencies) LSMs should
restrict file executability, or measure executed files, with
security_file_open() by checking file->f_flags & __FMODE_EXEC.
Because AT_EXECVE_CHECK is dedicated to user space interpreters, it
doesn't make sense for the kernel to parse the checked files, look for
interpreters known to the kernel (e.g. ELF, shebang), and return ENOEXEC
if the format is unknown. Because of that, security_bprm_check() is
never called when AT_EXECVE_CHECK is used.
It should be noted that script interpreters cannot directly use
execveat(2) (without this new AT_EXECVE_CHECK flag) because this could
lead to unexpected behaviors e.g., `python script.sh` could lead to Bash
being executed to interpret the script. Unlike the kernel, script
interpreters may just interpret the shebang as a simple comment, which
should not change for backward compatibility reasons.
Because scripts or libraries files might not currently have the
executable permission set, or because we might want specific users to be
allowed to run arbitrary scripts, the following patch provides a dynamic
configuration mechanism with the SECBIT_EXEC_RESTRICT_FILE and
SECBIT_EXEC_DENY_INTERACTIVE securebits.
This is a redesign of the CLIP OS 4's O_MAYEXEC:
f5cb330d6b/1901_open_mayexec.patch
This patch has been used for more than a decade with customized script
interpreters. Some examples can be found here:
https://github.com/clipos-archive/clipos4_portage-overlay/search?q=O_MAYEXEC
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: Jeff Xu <jeffxu@chromium.org>
Tested-by: Jeff Xu <jeffxu@chromium.org>
Link: https://docs.python.org/3/library/io.html#io.open_code [1]
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Link: https://lore.kernel.org/r/20241212174223.389435-2-mic@digikod.net
Signed-off-by: Kees Cook <kees@kernel.org>
Pull btrfs fixes from David Sterba:
- tree-checker catches invalid number of inline extent references
- zoned mode fixes:
- enhance zone append IO command so it also detects emulated writes
- handle bio splitting at sectorsize boundary
- when deleting a snapshot, fix a condition for visiting nodes in reloc
trees
* tag 'for-6.13-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: tree-checker: reject inline extent items with 0 ref count
btrfs: split bios to the fs sector size boundary
btrfs: use bio_is_zone_append() in the completion handler
btrfs: fix improper generation check in snapshot delete
Convert the first page passed to ni_write_frame() to a folio and use
folio_pos() on that instead of open-coding the access to folio->index,
cast & shift.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Currently the pending_async_copies count is decremented just
before a struct nfsd4_copy is destroyed. After commit aa0ebd21df
("NFSD: Add nfsd4_copy time-to-live") nfsd4_copy structures sticks
around for 10 lease periods after the COPY itself has completed,
the pending_async_copies count stays high for a long time. This
causes NFSD to avoid the use of background copy even though the
actual background copy workload might no longer be running.
In this patch, decrement pending_async_copies once async copy thread
is done processing the copy work.
Fixes: aa0ebd21df ("NFSD: Add nfsd4_copy time-to-live")
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
[BUG]
There is a bug report in the mailing list where btrfs_run_delayed_refs()
failed to drop the ref count for logical 25870311358464 num_bytes
2113536.
The involved leaf dump looks like this:
item 166 key (25870311358464 168 2113536) itemoff 10091 itemsize 50
extent refs 1 gen 84178 flags 1
ref#0: shared data backref parent 32399126528000 count 0 <<<
ref#1: shared data backref parent 31808973717504 count 1
Notice the count number is 0.
[CAUSE]
There is no concrete evidence yet, but considering 0 -> 1 is also a
single bit flipped, it's possible that hardware memory bitflip is
involved, causing the on-disk extent tree to be corrupted.
[FIX]
To prevent us reading such corrupted extent item, or writing such
damaged extent item back to disk, enhance the handling of
BTRFS_EXTENT_DATA_REF_KEY and BTRFS_SHARED_DATA_REF_KEY keys for both
inlined and key items, to detect such 0 ref count and reject them.
CC: stable@vger.kernel.org # 5.4+
Link: https://lore.kernel.org/linux-btrfs/7c69dd49-c346-4806-86e7-e6f863a66f48@app.fastmail.com/
Reported-by: Frankie Fisher <frankie@terrorise.me.uk>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs like other file systems can't really deal with I/O not aligned to
it's internal block size (which strangely is called sector size in
btrfs, for historical reasons), but the block layer split helper doesn't
even know about that.
Round down the split boundary so that all I/Os are aligned.
Fixes: d5e4377d50 ("btrfs: split zone append bios in btrfs_submit_bio")
CC: stable@vger.kernel.org # 6.12
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Otherwise it won't catch bios turned into regular writes by the block
level zone write plugging. The additional test it adds is for emulated
zone append.
Fixes: 9b1ce7f0c6 ("block: Implement zone append emulation")
CC: stable@vger.kernel.org # 6.12
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have been using the following check
if (generation <= root->root_key.offset)
to make decisions about whether or not to visit a node during snapshot
delete. This is because for normal subvolumes this is set to 0, and for
snapshots it's set to the creation generation. The idea being that if
the generation of the node is less than or equal to our creation
generation then we don't need to visit that node, because it doesn't
belong to us, we can simply drop our reference and move on.
However reloc roots don't have their generation stored in
root->root_key.offset, instead that is the objectid of their
corresponding fs root. This means we can incorrectly not walk into
nodes that need to be dropped when deleting a reloc root.
There are a variety of consequences to making the wrong choice in two
distinct areas.
visit_node_for_delete()
1. False positive. We think we are newer than the block when we really
aren't. We don't visit the node and drop our reference to the node
and carry on. This would result in leaked space.
2. False negative. We do decide to walk down into a block that we
should have just dropped our reference to. However this means that
the child node will have refs > 1, so we will switch to
UPDATE_BACKREF, and then the subsequent walk_down_proc() will notice
that btrfs_header_owner(node) != root->root_key.objectid and it'll
break out of the loop, and then walk_up_proc() will drop our reference,
so this appears to be ok.
do_walk_down()
1. False positive. We are in UPDATE_BACKREF and incorrectly decide that
we are done and don't need to update the backref for our lower nodes.
This is another case that simply won't happen with relocation, as we
only have to do UPDATE_BACKREF if the node below us was shared and
didn't have FULL_BACKREF set, and since we don't own that node
because we're a reloc root we actually won't end up in this case.
2. False negative. Again this is tricky because as described above, we
simply wouldn't be here from relocation, because we don't own any of
the nodes because we never set btrfs_header_owner() to the reloc root
objectid, and we always use FULL_BACKREF, we never actually need to
set FULL_BACKREF on any children.
Having spent a lot of time stressing relocation/snapshot delete recently
I've not seen this pop in practice. But this is objectively incorrect,
so fix this to get the correct starting generation based on the root
we're dropping to keep me from thinking there's a problem here.
CC: stable@vger.kernel.org
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull erofs fixes from Gao Xiang:
"The first one fixes a syzbot UAF report caused by a commit introduced
in this cycle, but it also addresses a longstanding memory leak. The
second one resolves a PSI memstall mis-accounting issue.
The remaining patches switch file-backed mounts to use buffered I/Os
by default instead of direct I/Os, since the page cache of underlay
files is typically valid and maybe even dirty. This change also aligns
with the default policy of loopback devices. A mount option has been
added to try to use direct I/Os explicitly.
Summary:
- Fix (pcluster) memory leak and (sbi) UAF after umounting
- Fix a case of PSI memstall mis-accounting
- Use buffered I/Os by default for file-backed mounts"
* tag 'erofs-for-6.13-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: use buffered I/O for file-backed mounts by default
erofs: reference `struct erofs_device_info` for erofs_map_dev
erofs: use `struct erofs_device_info` for the primary device
erofs: add erofs_sb_free() helper
MAINTAINERS: erofs: update Yue Hu's email address
erofs: fix PSI memstall accounting
erofs: fix rare pcluster memory leak after unmounting
fs/nfs/super.c should include fs/nfs/nfs4idmap.h for
declaration of nfs_idmap_cache_timeout. This fixes the sparse warning:
fs/nfs/super.c:1397:14: warning: symbol 'nfs_idmap_cache_timeout' was not declared. Should it be static?
Signed-off-by: Zhang Kunbo <zhangkunbo@huawei.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When the server is recalling a layout, we should ignore the count of
outstanding layoutget calls, since the server is expected to return
either NFS4ERR_RECALLCONFLICT or NFS4ERR_RETURNCONFLICT for as long as
the recall is outstanding.
Currently, we may end up livelocking, causing the layout to eventually
be forcibly revoked.
Fixes: bf0291dd22 ("pNFS: Ensure LAYOUTGET and LAYOUTRETURN are properly serialised")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
This reverts commit f8c989a0c8.
Before this commit, svc_export_put or expkey_put will call path_put with
sync mode. After this commit, path_put will be called with async mode.
And this can lead the unexpected results show as follow.
mkfs.xfs -f /dev/sda
echo "/ *(rw,no_root_squash,fsid=0)" > /etc/exports
echo "/mnt *(rw,no_root_squash,fsid=1)" >> /etc/exports
exportfs -ra
service nfs-server start
mount -t nfs -o vers=4.0 127.0.0.1:/mnt /mnt1
mount /dev/sda /mnt/sda
touch /mnt1/sda/file
exportfs -r
umount /mnt/sda # failed unexcepted
The touch will finally call nfsd_cross_mnt, add refcount to mount, and
then add cache_head. Before this commit, exportfs -r will call
cache_flush to cleanup all cache_head, and path_put in
svc_export_put/expkey_put will be finished with sync mode. So, the
latter umount will always success. However, after this commit, path_put
will be called with async mode, the latter umount may failed, and if
we add some delay, umount will success too. Personally I think this bug
and should be fixed. We first revert before bugfix patch, and then fix
the original bug with a different way.
Fixes: f8c989a0c8 ("nfsd: release svc_expkey/svc_export with rcu_work")
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The new pid inode number allocation scheme is neat but I overlooked a
possible, even though unlikely, attack that can be used to trigger an
overflow on both 32bit and 64bit.
An unique 64 bit identifier was constructed for each struct pid by two
combining a 32 bit idr with a 32 bit generation number. A 32bit number
was allocated using the idr_alloc_cyclic() infrastructure. When the idr
wrapped around a 32 bit wraparound counter was incremented. The 32 bit
wraparound counter served as the upper 32 bits and the allocated idr
number as the lower 32 bits.
Since the idr can only allocate up to INT_MAX entries everytime a
wraparound happens INT_MAX - 1 entries are lost (Ignoring that numbering
always starts at 2 to avoid theoretical collisions with the root inode
number.).
If userspace fully populates the idr such that and puts itself into
control of two entries such that one entry is somewhere in the middle
and the other entry is the INT_MAX entry then it is possible to overflow
the wraparound counter. That is probably difficult to pull off but the
mere possibility is annoying.
The problem could be contained to 32 bit by switching to a data
structure such as the maple tree that allows allocating 64 bit numbers
on 64 bit machines. That would leave 32 bit in a lurch but that probably
doesn't matter that much. The other problem is that removing entries
form the maple tree is somewhat non-trivial because the removal code can
be called under the irq write lock of tasklist_lock and
irq{save,restore} code.
Instead, allocate unique identifiers for struct pid by simply
incrementing a 64 bit counter and insert each struct pid into the rbtree
so it can be looked up to decode file handles avoiding to leak actual
pids across pid namespaces in file handles.
On both 64 bit and 32 bit the same 64 bit identifier is used to lookup
struct pid in the rbtree. On 64 bit the unique identifier for struct pid
simply becomes the inode number. Comparing two pidfds continues to be as
simple as comparing inode numbers.
On 32 bit the 64 bit number assigned to struct pid is split into two 32
bit numbers. The lower 32 bits are used as the inode number and the
upper 32 bits are used as the inode generation number. Whenever a
wraparound happens on 32 bit the 64 bit number will be incremented by 2
so inode numbering starts at 2 again.
When a wraparound happens on 32 bit multiple pidfds with the same inode
number are likely to exist. This isn't a problem since before pidfs
pidfds used the anonymous inode meaning all pidfds had the same inode
number. On 32 bit sserspace can thus reconstruct the 64 bit identifier
by retrieving both the inode number and the inode generation number to
compare, or use file handles. This gives the same guarantees on both 32
bit and 64 bit.
Link: https://lore.kernel.org/r/20241214-gekoppelt-erdarbeiten-a1f9a982a5a6@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
On 64-bit platforms, userspace can read the pidfd's inode in order to
get a never-repeated PID identifier. On 32-bit platforms this identifier
is not exposed, as inodes are limited to 32 bits. Instead expose the
identifier via export_fh, which makes it available to userspace via
name_to_handle_at.
In addition we implement fh_to_dentry, which allows userspace to
recover a pidfd from a pidfs file handle.
Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
[brauner: patch heavily rewritten]
Link: https://lore.kernel.org/r/20241129-work-pidfs-file_handle-v1-6-87d803a42495@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Co-Developed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Zbigniew mentioned at Linux Plumber's that systemd is interested in
switching to execveat() for service execution, but can't, because the
contents of /proc/pid/comm are the file descriptor which was used,
instead of the path to the binary[1]. This makes the output of tools like
top and ps useless, especially in a world where most fds are opened
CLOEXEC so the number is truly meaningless.
When the filename passed in is empty (e.g. with AT_EMPTY_PATH), use the
dentry's filename for "comm" instead of using the useless numeral from
the synthetic fdpath construction. This way the actual exec machinery
is unchanged, but cosmetically the comm looks reasonable to admins
investigating things.
Instead of adding TASK_COMM_LEN more bytes to bprm, use one of the unused
flag bits to indicate that we need to set "comm" from the dentry.
Suggested-by: Zbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl>
Suggested-by: Tycho Andersen <tandersen@netflix.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://github.com/uapi-group/kernel-features#set-comm-field-before-exec [1]
Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
Tested-by: Zbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl>
Signed-off-by: Kees Cook <kees@kernel.org>
Using strscpy() meant that the final character in task->comm may be
non-NUL for a moment before the "string too long" truncation happens.
Instead of adding a new use of the ambiguous strncpy(), we'd want to
use memtostr_pad() which enforces being able to check at compile time
that sizes are sensible, but this requires being able to see string
buffer lengths. Instead of trying to inline __set_task_comm() (which
needs to call trace and perf functions), just open-code it. But to
make sure we're always safe, add compile-time checking like we already
do for get_task_comm().
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Kees Cook <kees@kernel.org>
If mounted with sparseread option, ceph_direct_read_write() ends up
making an unnecessarily allocation for O_DIRECT writes.
Fixes: 03bc06c7b0 ("ceph: add new mount option to enable sparse reads")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
The bvecs array which is allocated in iter_get_bvecs_alloc() is leaked
and pages remain pinned if ceph_alloc_sparse_ext_map() fails.
There is no need to delay the allocation of sparse_ext map until after
the bvecs array is set up, so fix this by moving sparse_ext allocation
a bit earlier. Also, make a similar adjustment in __ceph_sync_read()
for consistency (a leak of the same kind in __ceph_sync_read() has been
addressed differently).
Cc: stable@vger.kernel.org
Fixes: 03bc06c7b0 ("ceph: add new mount option to enable sparse reads")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
This patch refines the read logic in __ceph_sync_read() to ensure more
predictable and efficient behavior in various edge cases.
- Return early if the requested read length is zero or if the file size
(`i_size`) is zero.
- Initialize the index variable (`idx`) where needed and reorder some
code to ensure it is always set before use.
- Improve error handling by checking for negative return values earlier.
- Remove redundant encrypted file checks after failures. Only attempt
filesystem-level decryption if the read succeeded.
- Simplify leftover calculations to correctly handle cases where the
read extends beyond the end of the file or stops short. This can be
hit by continuously reading a file while, on another client, we keep
truncating and writing new data into it.
- This resolves multiple issues caused by integer and consequent buffer
overflow (`pages` array being accessed beyond `num_pages`):
- https://tracker.ceph.com/issues/67524
- https://tracker.ceph.com/issues/68980
- https://tracker.ceph.com/issues/68981
Cc: stable@vger.kernel.org
Fixes: 1065da21e5 ("ceph: stop copying to iter at EOF on sync reads")
Reported-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
Signed-off-by: Alex Markuze <amarkuze@redhat.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
It becomes a path component, so it shouldn't exceed NAME_MAX
characters. This was hardened in commit c152737be2 ("ceph: Use
strscpy() instead of strcpy() in __get_snap_name()"), but no actual
check was put in place.
Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
If the full path to be built by ceph_mdsc_build_path() happens to be
longer than PATH_MAX, then this function will enter an endless (retry)
loop, effectively blocking the whole task. Most of the machine
becomes unusable, making this a very simple and effective DoS
vulnerability.
I cannot imagine why this retry was ever implemented, but it seems
rather useless and harmful to me. Let's remove it and fail with
ENAMETOOLONG instead.
Cc: stable@vger.kernel.org
Reported-by: Dario Weißer <dario@cure53.de>
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In two `break` statements, the call to ceph_release_page_vector() was
missing, leaking the allocation from ceph_alloc_page_vector().
Instead of adding the missing ceph_release_page_vector() calls, the
Ceph maintainers preferred to transfer page ownership to the
`ceph_osd_request` by passing `own_pages=true` to
osd_req_op_extent_osd_data_pages(). This requires postponing the
ceph_osdc_put_request() call until after the block that accesses the
`pages`.
Cc: stable@vger.kernel.org
Fixes: 03bc06c7b0 ("ceph: add new mount option to enable sparse reads")
Fixes: f0fe1e54cf ("ceph: plumb in decryption during reads")
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In SSR mode, the segment selected for allocation might be the same as
the target segment of the GC triggered by ioctl, resulting in the GC
moving the CURSEG_I(sbi, type)->segno.
Thread A Thread B or Thread A
- f2fs_ioc_gc_range
- __f2fs_ioc_gc_range(.victim_segno=segno#N)
- f2fs_gc
- __get_victim
- f2fs_get_victim
: segno#N is valid, return segno#N as source segment of GC
- f2fs_allocate_data_block
- need_new_seg
- get_ssr_segment
- f2fs_get_victim
: get segno #N as destination segment
- change_curseg
Fixes: e066b83c9b ("f2fs: add ioctl to flush data from faster device to cold area")
Signed-off-by: Yongpeng Yang <yangyongpeng1@oppo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
While traversing dir entries in dentry page, it's better to refresh current
accessed page in lru list by using FGP_ACCESSED flag, otherwise, such page
may has less chance to survive during memory reclaim, result in causing
additional IO when revisiting dentry page.
Signed-off-by: zangyangyang1 <zangyangyang1@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
All folios that f2fs sees belong to f2fs and not to the swapcache
so it can dereference folio->mapping directly like all other
filesystems do.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Remove the last call to page_file_mapping() as both callers can now pass
in a folio.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Remove a call to compound_head(). We can call bio_add_folio_nofail()
here because we just allocated the bio, so we know it can't fail and
thus the error path can never be taken.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Cache the result of page_folio(fio->page) in a local variable so
we don't have to keep calling it. Saves a couple of calls to
compound_head() and removes an access to page->mapping.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use bio_for_each_folio_all() to iterate over each folio in the bio.
This lets us use folio_end_read() which saves an atomic operation and
memory barrier compared to marking the folio uptodate and unlocking
it as two separate operations. This also removes a few hidden calls
to compound_head().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This is the folio equivalent of F2FS_P_SB(). Removes a call to
page_file_mapping() as we know folios seen by f2fs are never part of
the swap cache.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Remove accesses to page->index and page->mapping as well as
unnecessary calls to page_file_mapping().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Remove accesses to page->index and an unnecessary reference to
page->mapping.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Convert the incoming page to a folio and use it throughout.
Removes an access to page->index.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Remove references to page->index and use folio_test_uptodate()
instead of PageUptodate().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For many use cases (e.g. container images are just fetched from remote),
performance will be impacted if underlay page cache is up-to-date but
direct i/o flushes dirty pages first.
Instead, let's use buffered I/O by default to keep in sync with loop
devices and add a (re)mount option to explicitly give a try to use
direct I/O if supported by the underlying files.
The container startup time is improved as below:
[workload] docker.io/library/workpress:latest
unpack 1st run non-1st runs
EROFS snapshotter buffered I/O file 4.586404265s 0.308s 0.198s
EROFS snapshotter direct I/O file 4.581742849s 2.238s 0.222s
EROFS snapshotter loop 4.596023152s 0.346s 0.201s
Overlayfs snapshotter 5.382851037s 0.206s 0.214s
Fixes: fb17675026 ("erofs: add file-backed mount support")
Cc: Derek McGowan <derek@mcg.dev>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241212134336.2059899-1-hsiangkao@linux.alibaba.com