f2fs_zero_post_eof_page() may cuase more overhead due to invalidate_lock
and page lookup, change as below to mitigate its overhead:
- check new_size before grabbing invalidate_lock
- lookup and invalidate pages only in range of [old_size, new_size]
Fixes: ba8dac350f ("f2fs: fix to zero post-eof page")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
syzbot reports a bug as below:
loop0: detected capacity change from 0 to 40427
F2FS-fs (loop0): Wrong SSA boundary, start(3584) end(4096) blocks(3072)
F2FS-fs (loop0): Can't find valid F2FS filesystem in 1th superblock
F2FS-fs (loop0): invalid crc value
F2FS-fs (loop0): f2fs_convert_inline_folio: corrupted inline inode ino=3, i_addr[0]:0x1601, run fsck to fix.
------------[ cut here ]------------
kernel BUG at fs/inode.c:753!
RIP: 0010:clear_inode+0x169/0x190 fs/inode.c:753
Call Trace:
<TASK>
evict+0x504/0x9c0 fs/inode.c:810
f2fs_fill_super+0x5612/0x6fa0 fs/f2fs/super.c:5047
get_tree_bdev_flags+0x40e/0x4d0 fs/super.c:1692
vfs_get_tree+0x8f/0x2b0 fs/super.c:1815
do_new_mount+0x2a2/0x9e0 fs/namespace.c:3808
do_mount fs/namespace.c:4136 [inline]
__do_sys_mount fs/namespace.c:4347 [inline]
__se_sys_mount+0x317/0x410 fs/namespace.c:4324
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
During f2fs_evict_inode(), clear_inode() detects that we missed to truncate
all page cache before destorying inode, that is because in below path, we
will create page #0 in cache, but missed to drop it in error path, let's fix
it.
- evict
- f2fs_evict_inode
- f2fs_truncate
- f2fs_convert_inline_inode
- f2fs_grab_cache_folio
: create page #0 in cache
- f2fs_convert_inline_folio
: sanity check failed, return -EFSCORRUPTED
- clear_inode detects that inode->i_data.nrpages is not zero
Fixes: 92dffd0179 ("f2fs: convert inline_data when i_size becomes large")
Reported-by: syzbot+90266696fe5daacebd35@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-f2fs-devel/68c09802.050a0220.3c6139.000e.GAE@google.com
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull f2fs updates from Jaegeuk Kim:
"Three main updates: folio conversion by Matthew, switch to a new mount
API by Hongbo and Eric, and several sysfs entries to tune GCs for ZUFS
with finer granularity by Daeho.
There are also patches to address bugs and issues in the existing
features such as GCs, file pinning, write-while-dio-read, contingous
block allocation, and memory access violations.
Enhancements:
- switch to new mount API and folio conversion
- add sysfs nodes to controle F2FS GCs for ZUFS
- improve performance on the nat entry cache
- drop inode from the donation list when the last file is closed
- avoid splitting bio when reading multiple pages
Bug fixes:
- fix to trigger foreground gc during f2fs_map_blocks() in lfs mode
- make sure zoned device GC to use FG_GC in shortage of free section
- fix to calculate dirty data during has_not_enough_free_secs()
- fix to update upper_p in __get_secs_required() correctly
- wait for inflight dio completion, excluding pinned files read using dio
- don't break allocation when crossing contiguous sections
- vm_unmap_ram() may be called from an invalid context
- fix to avoid out-of-boundary access in dnode page
- fix to avoid panic in f2fs_evict_inode
- fix to avoid UAF in f2fs_sync_inode_meta()
- fix to use f2fs_is_valid_blkaddr_raw() in do_write_page()
- fix UAF of f2fs_inode_info in f2fs_free_dic
- fix to avoid invalid wait context issue
- fix bio memleak when committing super block
- handle nat.blkaddr corruption in f2fs_get_node_info()
In addition, there are also clean-ups and minor bug fixes"
* tag 'f2fs-for-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (109 commits)
f2fs: drop inode from the donation list when the last file is closed
f2fs: add gc_boost_gc_greedy sysfs node
f2fs: add gc_boost_gc_multiple sysfs node
f2fs: fix to trigger foreground gc during f2fs_map_blocks() in lfs mode
f2fs: fix to calculate dirty data during has_not_enough_free_secs()
f2fs: fix to update upper_p in __get_secs_required() correctly
f2fs: directly add newly allocated pre-dirty nat entry to dirty set list
f2fs: avoid redundant clean nat entry move in lru list
f2fs: zone: wait for inflight dio completion, excluding pinned files read using dio
f2fs: ignore valid ratio when free section count is low
f2fs: don't break allocation when crossing contiguous sections
f2fs: remove unnecessary tracepoint enabled check
f2fs: merge the two conditions to avoid code duplication
f2fs: vm_unmap_ram() may be called from an invalid context
f2fs: fix to avoid out-of-boundary access in dnode page
f2fs: switch to the new mount api
f2fs: introduce fs_context_operation structure
f2fs: separate the options parsing and options checking
f2fs: Add f2fs_fs_context to record the mount options
f2fs: Allow sbi to be NULL in f2fs_printk
...
Let's drop the inode from the donation list when there is no other
open file.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull fileattr updates from Christian Brauner:
"This introduces the new file_getattr() and file_setattr() system calls
after lengthy discussions.
Both system calls serve as successors and extensible companions to
the FS_IOC_FSGETXATTR and FS_IOC_FSSETXATTR system calls which have
started to show their age in addition to being named in a way that
makes it easy to conflate them with extended attribute related
operations.
These syscalls allow userspace to set filesystem inode attributes on
special files. One of the usage examples is the XFS quota projects.
XFS has project quotas which could be attached to a directory. All new
inodes in these directories inherit project ID set on parent
directory.
The project is created from userspace by opening and calling
FS_IOC_FSSETXATTR on each inode. This is not possible for special
files such as FIFO, SOCK, BLK etc. Therefore, some inodes are left
with empty project ID. Those inodes then are not shown in the quota
accounting but still exist in the directory. This is not critical but
in the case when special files are created in the directory with
already existing project quota, these new inodes inherit extended
attributes. This creates a mix of special files with and without
attributes. Moreover, special files with attributes don't have a
possibility to become clear or change the attributes. This, in turn,
prevents userspace from re-creating quota project on these existing
files.
In addition, these new system calls allow the implementation of
additional attributes that we couldn't or didn't want to fit into the
legacy ioctls anymore"
* tag 'vfs-6.17-rc1.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: tighten a sanity check in file_attr_to_fileattr()
tree-wide: s/struct fileattr/struct file_kattr/g
fs: introduce file_getattr and file_setattr syscalls
fs: prepare for extending file_get/setattr()
fs: make vfs_fileattr_[get|set] return -EOPNOTSUPP
selinux: implement inode_file_[g|s]etattr hooks
lsm: introduce new hooks for setting/getting inode fsxattr
fs: split fileattr related helpers into separate file
Pull mmap_prepare updates from Christian Brauner:
"Last cycle we introduce f_op->mmap_prepare() in c84bf6dd2b ("mm:
introduce new .mmap_prepare() file callback").
This is preferred to the existing f_op->mmap() hook as it does require
a VMA to be established yet, thus allowing the mmap logic to invoke
this hook far, far earlier, prior to inserting a VMA into the virtual
address space, or performing any other heavy handed operations.
This allows for much simpler unwinding on error, and for there to be a
single attempt at merging a VMA rather than having to possibly
reattempt a merge based on potentially altered VMA state.
Far more importantly, it prevents inappropriate manipulation of
incompletely initialised VMA state, which is something that has been
the cause of bugs and complexity in the past.
The intent is to gradually deprecate f_op->mmap, and in that vein this
series coverts the majority of file systems to using f_op->mmap_prepare.
Prerequisite steps are taken - firstly ensuring all checks for mmap
capabilities use the file_has_valid_mmap_hooks() helper rather than
directly checking for f_op->mmap (which is now not a valid check) and
secondly updating daxdev_mapping_supported() to not require a VMA
parameter to allow ext4 and xfs to be converted.
Commit bb666b7c27 ("mm: add mmap_prepare() compatibility layer for
nested file systems") handles the nasty edge-case of nested file
systems like overlayfs, which introduces a compatibility shim to allow
f_op->mmap_prepare() to be invoked from an f_op->mmap() callback.
This allows for nested filesystems to continue to function correctly
with all file systems regardless of which callback is used. Once we
finally convert all file systems, this shim can be removed.
As a result, ecryptfs, fuse, and overlayfs remain unaltered so they
can nest all other file systems.
We additionally do not update resctl - as this requires an update to
remap_pfn_range() (or an alternative to it) which we defer to a later
series, equally we do not update cramfs which needs a mixed mapping
insertion with the same issue, nor do we update procfs, hugetlbfs,
syfs or kernfs all of which require VMAs for internal state and hooks.
We shall return to all of these later"
* tag 'vfs-6.17-rc1.mmap_prepare' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
doc: update porting, vfs documentation to describe mmap_prepare()
fs: replace mmap hook with .mmap_prepare for simple mappings
fs: convert most other generic_file_*mmap() users to .mmap_prepare()
fs: convert simple use of generic_file_*_mmap() to .mmap_prepare()
mm/filemap: introduce generic_file_*_mmap_prepare() helpers
fs/xfs: transition from deprecated .mmap hook to .mmap_prepare
fs/ext4: transition from deprecated .mmap hook to .mmap_prepare
fs/dax: make it possible to check dev dax support without a VMA
fs: consistently use can_mmap_file() helper
mm/nommu: use file_has_valid_mmap_hooks() helper
mm: rename call_mmap/mmap_prepare to vfs_mmap/mmap_prepare
There is no extra work before trace_f2fs_[dataread|datawrite]_end(),
so there is no need to check trace_<tracepoint>_enabled().
Signed-off-by: Sheng Yong <shengyong1@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Name these new functions folio_test_f2fs_*(), folio_set_f2fs_*() and
folio_clear_f2fs_*(). Convert all callers which currently have a folio
and cast back to a page.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
All callers now have a folio so pass it in. Also make it const to help
the compiler.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Let's return errors caught by the generic checks. This fixes generic/494 where
it expects to see EBUSY by setattr_prepare instead of EINVAL by f2fs for active
swapfile.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
To prevent scattered pin block generation, don't allow non-section aligned truncation
to smaller or equal size on pinned file. But for truncation to larger size, after
commit 3fdd89b452c2("f2fs: prevent writing without fallocate() for pinned files"),
we only support overwrite IO to pinned file, so we don't need to consider
attr->ia_size > i_size case.
Signed-off-by: wangzijie <wangzijie1@honor.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Introduce sbi in f2fs_setattr() and convert F2FS_I_SB to it. No logic
change, just cleanup and prepare to get CAP_BLKS_PER_SEC(sbi).
Signed-off-by: wangzijie <wangzijie1@honor.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces /sys/fs/f2fs/<dev>/reserved_pin_section for tuning
@needed parameter of has_not_enough_free_secs(), if we configure it w/
zero, it can avoid f2fs_gc() as much as possible while fallocating on
pinned file.
Signed-off-by: Chao Yu <chao@kernel.org>
Reviewed-by: wangzijie <wangzijie1@honor.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since commit c84bf6dd2b ("mm: introduce new .mmap_prepare() file
callback"), the f_op->mmap() hook has been deprecated in favour of
f_op->mmap_prepare().
This callback is invoked in the mmap() logic far earlier, so error handling
can be performed more safely without complicated and bug-prone state
unwinding required should an error arise.
This hook also avoids passing a pointer to a not-yet-correctly-established
VMA avoiding any issues with referencing this data structure.
It rather provides a pointer to the new struct vm_area_desc descriptor type
which contains all required state and allows easy setting of required
parameters without any consideration needing to be paid to locking or
reference counts.
Note that nested filesystems like overlayfs are compatible with an
.mmap_prepare() callback since commit bb666b7c27 ("mm: add mmap_prepare()
compatibility layer for nested file systems").
In this patch we apply this change to file systems with relatively simple
mmap() hook logic - exfat, ceph, f2fs, bcachefs, zonefs, btrfs, ocfs2,
orangefs, nilfs2, romfs, ramfs and aio.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/f528ac4f35b9378931bd800920fee53fc0c5c74d.1750099179.git.lorenzo.stoakes@oracle.com
Acked-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
fstest reports a f2fs bug:
generic/363 42s ... [failed, exit status 1]- output mismatch (see /share/git/fstests/results//generic/363.out.bad)
--- tests/generic/363.out 2025-01-12 21:57:40.271440542 +0800
+++ /share/git/fstests/results//generic/363.out.bad 2025-05-19 19:55:58.000000000 +0800
@@ -1,2 +1,78 @@
QA output created by 363
fsx -q -S 0 -e 1 -N 100000
+READ BAD DATA: offset = 0xd6fb, size = 0xf044, fname = /mnt/f2fs/junk
+OFFSET GOOD BAD RANGE
+0x1540d 0x0000 0x2a25 0x0
+operation# (mod 256) for the bad data may be 37
+0x1540e 0x0000 0x2527 0x1
...
(Run 'diff -u /share/git/fstests/tests/generic/363.out /share/git/fstests/results//generic/363.out.bad' to see the entire diff)
Ran: generic/363
Failures: generic/363
Failed 1 of 1 tests
The root cause is user can update post-eof page via mmap [1], however, f2fs
missed to zero post-eof page in below operations, so, once it expands i_size,
then it will include dummy data locates previous post-eof page, so during
below operations, we need to zero post-eof page.
Operations which can include dummy data after previous i_size after expanding
i_size:
- write
- mapwrite [1]
- truncate
- fallocate
* preallocate
* zero_range
* insert_range
* collapse_range
- clone_range (doesn’t support in f2fs)
- copy_range (doesn’t support in f2fs)
[1] https://man7.org/linux/man-pages/man2/mmap.2.html 'BUG section'
Cc: stable@kernel.org
Signed-off-by: Chao Yu <chao@kernel.org>
Reviewed-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since commits 7ff0104a80 ("f2fs: Remove f2fs_write_node_page()") and
3b47398d98 ("f2fs: Remove f2fs_write_meta_page()'), f2fs can't be
called from reclaim context any more. Remove all code keyed of the
wbc->for_reclaim flag, which is now only set for writing out swap or
shmem pages inside the swap code, but never passed to file systems.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In cases of removing memory donation, we need to handle some error cases
like ENOENT and EACCES (indicating the range already has been donated).
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
All callers except __get_inode_rdev() and __set_inode_rdev() now have a
folio, but the only callers of those two functions do have a folio, so
pass the folio to them and then into get_dnode_addr().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
All assignments to this struct member are conversions from a folio
so convert it to be a folio and convert all users. At the same time,
convert data_blkaddr() to take a folio as all callers now have a folio.
Remove eight calls to compound_head().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Support large folios & simplify the loops in redirty_blocks().
Use the folio APIs and remove four calls to compound_head().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fetch a folio from the pagecache instead of a page. Removes two
calls to compound_head()
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
All callers now have a folio, so pass it in. Removes a call to
compound_head().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There were some missing conversions from f2fs_wait_on_page_writeback()
to f2fs_folio_wait_writeback(). Saves a call to compound_head() at each
callsite.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch uses i_sem to protect access/update on f2fs_inode_info.flag
in finish_preallocate_blocks(), it avoids grabbing inode_lock() in
each open().
Signed-off-by: Chao Yu <chao@kernel.org>
Reviewed-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If checkpoint is disabled, GC can not reclaim any segments, we need
to detect such condition and bail out from fallocate() of a pinfile,
rather than letting allocator running out of free segment, which may
cause f2fs to be shutdown.
reproducer:
mkfs.f2fs -f /dev/vda 16777216
mount -o checkpoint=disable:10% /dev/vda /mnt/f2fs
for ((i=0;i<4096;i++)) do { dd if=/dev/zero of=/mnt/f2fs/$i bs=1M count=1; } done
sync
for ((i=0;i<4096;i+=2)) do { rm /mnt/f2fs/$i; } done
sync
touch /mnt/f2fs/pinfile
f2fs_io pinfile set /mnt/f2fs/pinfile
f2fs_io fallocate 0 0 4201644032 /mnt/f2fs/pinfile
cat /sys/kernel/debug/f2fs/status
output:
- Free: 0 (0)
Fixes: f5a53edcf0 ("f2fs: support aligned pinned file")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces a new wrapper f2fs_get_inode_page(), then, caller
can use it to load inode block to page cache, meanwhile it will do sanity
check on inode footer.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Retrieve a folio from the page cache and use it throughout.
Saves five hidden calls to compound_head().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch records POSIX_FADV_NOREUSE ranges for users to reclaim the caches
instantly off from LRU.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
syzbot reports a f2fs bug as below:
------------[ cut here ]------------
kernel BUG at fs/f2fs/segment.c:2746!
CPU: 0 UID: 0 PID: 5323 Comm: syz.0.0 Not tainted 6.13.0-rc2-syzkaller-00018-g7cb1b4663150 #0
RIP: 0010:get_new_segment fs/f2fs/segment.c:2746 [inline]
RIP: 0010:new_curseg+0x1f52/0x1f70 fs/f2fs/segment.c:2876
Call Trace:
<TASK>
__allocate_new_segment+0x1ce/0x940 fs/f2fs/segment.c:3210
f2fs_allocate_new_section fs/f2fs/segment.c:3224 [inline]
f2fs_allocate_pinning_section+0xfa/0x4e0 fs/f2fs/segment.c:3238
f2fs_expand_inode_data+0x696/0xca0 fs/f2fs/file.c:1830
f2fs_fallocate+0x537/0xa10 fs/f2fs/file.c:1940
vfs_fallocate+0x569/0x6e0 fs/open.c:327
do_vfs_ioctl+0x258c/0x2e40 fs/ioctl.c:885
__do_sys_ioctl fs/ioctl.c:904 [inline]
__se_sys_ioctl+0x80/0x170 fs/ioctl.c:892
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Concurrent pinfile allocation may run out of free section, result in
panic in get_new_segment(), let's expand pin_sem lock coverage to
include f2fs_gc(), so that we can make sure to reclaim enough free
space for following allocation.
In addition, do below changes to enhance error path handling:
- call f2fs_bug_on() only in non-pinfile allocation path in
get_new_segment().
- call reset_curseg_fields() to reset all fields of curseg in
new_curseg()
Fixes: f5a53edcf0 ("f2fs: support aligned pinned file")
Reported-by: syzbot+15669ec8c35ddf6c3d43@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-f2fs-devel/675cd64e.050a0220.37aaf.00bb.GAE@google.com
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds an ioctl to give a per-file priority hint to attach
REQ_PRIO.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Function f2fs_invalidate_blocks() can process consecutive
blocks at a time, so f2fs_truncate_data_blocks_range() is
optimized to use the new functionality of
f2fs_invalidate_blocks().
Add two variables @blkstart and @blklen, @blkstart records
the first address of the consecutive blocks, and @blkstart
records the number of consecutive blocks.
Signed-off-by: Yi Sun <yi.sun@unisoc.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
New function can process some consecutive blocks at a time.
Function f2fs_invalidate_blocks()->down_write() and up_write()
are very time-consuming, so if f2fs_invalidate_blocks() can
process consecutive blocks at one time, it will save a lot of time.
Signed-off-by: Yi Sun <yi.sun@unisoc.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull f2fs updates from Jaegeuk Kim:
"This series introduces a device aliasing feature where user can carve
out partitions but reclaim the space back by deleting aliased file in
root dir.
In addition to that, there're numerous minor bug fixes in zoned device
support, checkpoint=disable, extent cache management, fiemap, and
lazytime mount option. The full list of noticeable changes can be
found below.
Enhancements:
- introduce device aliasing file
- add stats in debugfs to show multiple devices
- add a sysfs node to limit max read extent count per-inode
- modify f2fs_is_checkpoint_ready logic to allow more data to be
written with the CP disable
- decrease spare area for pinned files for zoned devices
Fixes:
- Revert "f2fs: remove unreachable lazytime mount option parsing"
- adjust unusable cap before checkpoint=disable mode
- fix to drop all discards after creating snapshot on lvm device
- fix to shrink read extent node in batches
- fix changing cursegs if recovery fails on zoned device
- fix to adjust appropriate length for fiemap
- fix fiemap failure issue when page size is 16KB
- fix to avoid forcing direct write to use buffered IO on inline_data
inode
- fix to map blocks correctly for direct write
- fix to account dirty data in __get_secs_required()
- fix null-ptr-deref in f2fs_submit_page_bio()
- fix inconsistent update of i_blocks in release_compress_blocks and
reserve_compress_blocks"
* tag 'f2fs-for-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (40 commits)
f2fs: fix to drop all discards after creating snapshot on lvm device
f2fs: add a sysfs node to limit max read extent count per-inode
f2fs: fix to shrink read extent node in batches
f2fs: print message if fscorrupted was found in f2fs_new_node_page()
f2fs: clear SBI_POR_DOING before initing inmem curseg
f2fs: fix changing cursegs if recovery fails on zoned device
f2fs: adjust unusable cap before checkpoint=disable mode
f2fs: fix to requery extent which cross boundary of inquiry
f2fs: fix to adjust appropriate length for fiemap
f2fs: clean up w/ F2FS_{BLK_TO_BYTES,BTYES_TO_BLK}
f2fs: fix to do cast in F2FS_{BLK_TO_BYTES, BTYES_TO_BLK} to avoid overflow
f2fs: replace deprecated strcpy with strscpy
Revert "f2fs: remove unreachable lazytime mount option parsing"
f2fs: fix to avoid forcing direct write to use buffered IO on inline_data inode
f2fs: fix to map blocks correctly for direct write
f2fs: fix race in concurrent f2fs_stop_gc_thread
f2fs: fix fiemap failure issue when page size is 16KB
f2fs: remove redundant atomic file check in defragment
f2fs: fix to convert log type to segment data type correctly
f2fs: clean up the unused variable additional_reserved_segments
...
Pull 'struct fd' class updates from Al Viro:
"The bulk of struct fd memory safety stuff
Making sure that struct fd instances are destroyed in the same scope
where they'd been created, getting rid of reassignments and passing
them by reference, converting to CLASS(fd{,_pos,_raw}).
We are getting very close to having the memory safety of that stuff
trivial to verify"
* tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits)
deal with the last remaing boolean uses of fd_file()
css_set_fork(): switch to CLASS(fd_raw, ...)
memcg_write_event_control(): switch to CLASS(fd)
assorted variants of irqfd setup: convert to CLASS(fd)
do_pollfd(): convert to CLASS(fd)
convert do_select()
convert vfs_dedupe_file_range().
convert cifs_ioctl_copychunk()
convert media_request_get_by_fd()
convert spu_run(2)
switch spufs_calls_{get,put}() to CLASS() use
convert cachestat(2)
convert do_preadv()/do_pwritev()
fdget(), more trivial conversions
fdget(), trivial conversions
privcmd_ioeventfd_assign(): don't open-code eventfd_ctx_fdget()
o2hb_region_dev_store(): avoid goto around fdget()/fdput()
introduce "fd_pos" class, convert fdget_pos() users to it.
fdget_raw() users: switch to CLASS(fd_raw)
convert vmsplice() to CLASS(fd)
...
Jinsu Lee reported a performance regression issue, after commit
5c8764f867 ("f2fs: fix to force buffered IO on inline_data
inode"), we forced direct write to use buffered IO on inline_data
inode, it will cause performace regression due to memory copy
and data flush.
It's fine to not force direct write to use buffered IO, as it
can convert inline inode before committing direct write IO.
Fixes: 5c8764f867 ("f2fs: fix to force buffered IO on inline_data inode")
Reported-by: Jinsu Lee <jinsu1.lee@samsung.com>
Closes: https://lore.kernel.org/linux-f2fs-devel/af03dd2c-e361-4f80-b2fd-39440766cf6e@kernel.org
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In my test case, concurrent calls to f2fs shutdown report the following
stack trace:
Oops: general protection fault, probably for non-canonical address 0xc6cfff63bb5513fc: 0000 [#1] PREEMPT SMP PTI
CPU: 0 UID: 0 PID: 678 Comm: f2fs_rep_shutdo Not tainted 6.12.0-rc5-next-20241029-g6fb2fa9805c5-dirty #85
Call Trace:
<TASK>
? show_regs+0x8b/0xa0
? __die_body+0x26/0xa0
? die_addr+0x54/0x90
? exc_general_protection+0x24b/0x5c0
? asm_exc_general_protection+0x26/0x30
? kthread_stop+0x46/0x390
f2fs_stop_gc_thread+0x6c/0x110
f2fs_do_shutdown+0x309/0x3a0
f2fs_ioc_shutdown+0x150/0x1c0
__f2fs_ioctl+0xffd/0x2ac0
f2fs_ioctl+0x76/0xe0
vfs_ioctl+0x23/0x60
__x64_sys_ioctl+0xce/0xf0
x64_sys_call+0x2b1b/0x4540
do_syscall_64+0xa7/0x240
entry_SYSCALL_64_after_hwframe+0x76/0x7e
The root cause is a race condition in f2fs_stop_gc_thread() called from
different f2fs shutdown paths:
[CPU0] [CPU1]
---------------------- -----------------------
f2fs_stop_gc_thread f2fs_stop_gc_thread
gc_th = sbi->gc_thread
gc_th = sbi->gc_thread
kfree(gc_th)
sbi->gc_thread = NULL
< gc_th != NULL >
kthread_stop(gc_th->f2fs_gc_task) //UAF
The commit c7f114d864 ("f2fs: fix to avoid use-after-free in
f2fs_stop_gc_thread()") attempted to fix this issue by using a read
semaphore to prevent races between shutdown and remount threads, but
it fails to prevent all race conditions.
Fix it by converting to write lock of s_umount in f2fs_do_shutdown().
Fixes: 7950e9ac63 ("f2fs: stop gc/discard thread after fs shutdown")
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
all failure exits prior to fdget() leave the scope, all matching fdput()
are immediately followed by leaving the scope.
[xfs_ioc_commit_range() chunk moved here as well]
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
f2fs_is_atomic_file(inode) is checked in f2fs_defragment_range,
so remove the redundant checking in f2fs_ioc_defragment.
Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In __get_segment_type(), __get_segment_type_6() may return
CURSEG_COLD_DATA_PINNED or CURSEG_ALL_DATA_ATGC log type, but
following f2fs_get_segment_temp() can only handle persistent
log type, fix it.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
F2FS should understand how the device aliasing file works and support
deleting the file after use. A device aliasing file can be created by
mkfs.f2fs tool and it can map the whole device with an extent, not
using node blocks. The file space should be pinned and normally used for
read-only usages.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>