if filename is encrypted, filename could have no printable characters.
so remove it.
Signed-off-by: Hyunchul Lee <cheol.lee@lge.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
- has_not_enough_free_secs
node_secs: 0 dent_secs: 0 freed:0 free_segments:103 reserved:104
- f2fs_gc
- get_victim_by_default
alloc_mode 0, gc_mode 1, max_search 2672, offset 4654, ofs_unit 1
- do_garbage_collect
start_segno 3976, end_segno 3977 type 0
- is_alive
nid 22797, blkaddr 2131882, ofs_in_node 0, version 0x8/0x0
- gc_data_segment 766, segno 3976, block 512/426 not alive
So, this patch fixes subtle corrupted case where node version does not match
to summary version which results in infinite loop by gc.
Reported-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In order to give more spatial locality, this patch changes the block allocation
policy which assigns beginning of partition for small and hot data/node blocks.
In order to do this, we set noheap allocation by default and introduce another
mount option, heap, to reset it back.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
generic_permission() presently checks CAP_DAC_OVERRIDE prior to
CAP_DAC_READ_SEARCH. This can cause misleading audit messages when
using a LSM such as SELinux or AppArmor, since CAP_DAC_OVERRIDE
may not be required for the operation. Flip the order of the
tests so that CAP_DAC_OVERRIDE is only checked when required for
the operation.
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: John Johansen <john.johansen@canonical.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Acked-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
This isn't super serious because you need CAP_ADMIN to run this code.
I added this integer overflow check last year but apparently I am
rubbish at writing integer overflow checks... There are two issues.
First, access_ok() works on unsigned long type and not u64 so on 32 bit
systems the access_ok() could be checking a truncated size. The other
issue is that we should be using a stricter limit so we don't overflow
the kzalloc() setting ctx->clone_roots later in the function after the
access_ok():
alloc_size = sizeof(struct clone_root) * (arg->clone_sources_count + 1);
sctx->clone_roots = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN);
Fixes: f5ecec3ce2 ("btrfs: send: silence an integer overflow warning")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ added comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Using an int value is causing qg->reserved to become negative and
exclusive -EDQUOT to be reached prematurely.
This affects exclusive qgroups only.
TEST CASE:
DEVICE=/dev/vdb
MOUNTPOINT=/mnt
SUBVOL=$MOUNTPOINT/tmp
umount $SUBVOL
umount $MOUNTPOINT
mkfs.btrfs -f $DEVICE
mount /dev/vdb $MOUNTPOINT
btrfs quota enable $MOUNTPOINT
btrfs subvol create $SUBVOL
umount $MOUNTPOINT
mount /dev/vdb $MOUNTPOINT
mount -o subvol=tmp $DEVICE $SUBVOL
btrfs qgroup limit -e 3G $SUBVOL
btrfs quota rescan /mnt -w
for i in `seq 1 44000`; do
dd if=/dev/zero of=/mnt/tmp/test_$i bs=10k count=1
if [[ $? > 0 ]]; then
btrfs qgroup show -pcref $SUBVOL
exit 1
fi
done
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
[ add reproducer to changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 20a7db8ab3 ("btrfs: add dummy callback for readpage_io_failed
and drop checks") made a cleanup around readpage_io_failed_hook, and
it was supposed to keep the original sematics, but it also
unexpectedly disabled repair during read for dup, raid1 and raid10.
This fixes the problem by letting data's inode call the generic
readpage_io_failed callback by returning -EAGAIN from its
readpage_io_failed_hook in order to notify end_bio_extent_readpage to
do the rest. We don't call it directly because the generic one takes
an offset from end_bio_extent_readpage() to calculate the index in the
checksum array and inode's readpage_io_failed_hook doesn't offer that
offset.
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ keep the const function attribute ]
Signed-off-by: David Sterba <dsterba@suse.com>
There is an include loop between netdevice.h, dsa.h, devlink.h because
of NETDEV_ALIGN, making it impossible to use devlink structures in
dsa.h.
Break this loop by taking dsa.h out of netdevice.h, add a forward
declaration of dsa_switch_tree and netdev_set_default_ethtool_ops()
function, which is what netdevice.h requires.
No longer having dsa.h in netdevice.h means the includes in dsa.h no
longer get included. This breaks a few other files which depend on
these includes. Add these directly in the affected file.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes missing increased max cost caused by a patch that we increased
cose of data segments in greedy algorithm.
Cc: <stable@vger.kernel.org> # v4.10+
Fixes: b9cd20619 "f2fs: node segment is prior to data segment selected victim"
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The inline directory verifiers should be called on the inode fork data,
which means after iformat_local on the read side, and prior to
ifork_flush on the write side. This makes the fork verifier more
consistent with the way buffer verifiers work -- i.e. they will operate
on the memory buffer that the code will be reading and writing directly.
Furthermore, revise the verifier function to return -EFSCORRUPTED so
that we don't flood the logs with corruption messages and assert
notices. This has been a particular problem with xfs/348, which
triggers the XFS_WANT_CORRUPTED_RETURN assertions, which halts the
kernel when CONFIG_XFS_DEBUG=y. Disk corruption isn't supposed to do
that, at least not in a verifier.
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: get the inode d_ops the proper way
v3: describe the bug that this patch fixes; no code changes
Fix a filelayout GETDEVICEINFO call hang triggered from the LAYOUTGET
pnfs_layout_process where the GETDEVICEINFO call is waiting for a
session slot, and the LAYOUGET call is waiting for pnfs_layout_process
to complete before freeing the slot GETDEVICEINFO is waiting for..
This occurs in testing against the pynfs pNFS server where the
the on-wire reply highest_slotid and slot id are zero, and the
target high slot id is 8 (negotiated in CREATE_SESSION).
The internal fore channel slot table max_slotid, the maximum allowed
table slotid value, has been reduced via nfs41_set_max_slotid_locked
from 8 to 1. Thus there is one slot (slotid 0) available for use but
it has not been freed by LAYOUTGET proir to the GETDEVICEINFO request.
In order to ensure that layoutrecall callbacks are processed in the
correct order, nfs4_proc_layoutget processing needs to be finished
e.g. pnfs_layout_process) before giving up the slot that identifies
the layoutget (see referring_call_exists).
Move the filelayout_check_layout nfs4_find_get_device call outside of
the pnfs_layout_process call tree.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
In preparation for moving the filelayout getdeviceinfo call from
filelayout_alloc_lseg called by pnfs_process_layout
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Now that nfs_rename()'s d_move has moved within the RPC task's rpc_call_done
callback, rehashing new_dentry will actually rehash the old dentry's name
in nfs_rename(). d_move() is going to rehash the new dentry for us anyway,
so doing it again here is unnecessary.
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Fixes: 920b4530fb ("NFS: nfs_rename() handle -ERESTARTSYS dentry left behind")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Pull driver core fix from Greg KH:
"Here is a single kernfs fix for 4.11-rc4 that resolves a reported
issue.
It has been in linux-next with no reported issues"
* tag 'driver-core-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
kernfs: Check KERNFS_HAS_RELEASE before calling kernfs_release_file()
Pull ext4 fixes from Ted Ts'o:
"Fix a memory leak on an error path, and two races when modifying
inodes relating to the inline_data and metadata checksum features"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix two spelling nits
ext4: lock the xattr block before checksuming it
jbd2: don't leak memory if setting up journal fails
ext4: mark inode dirty after converting inline directory
Pull fscrypto fixes from Ted Ts'o:
"A code cleanup and bugfix for fs/crypto"
* tag 'fscrypt-for-linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt:
fscrypt: eliminate ->prepare_context() operation
fscrypt: remove broken support for detecting keyring key revocation
This patch allow write data to normal file when writting
new checkpoint.
We relax three limitations for write_begin path:
1. data allocation
2. node allocation
3. variables in checkpoint
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds busy poll support to epoll. The implementation is meant to
be opportunistic in that it will take the NAPI ID from the last socket
that is added to the ready list that contains a valid NAPI ID and it will
use that for busy polling until the ready list goes empty. Once the ready
list goes empty the NAPI ID is reset and busy polling is disabled until a
new socket is added to the ready list.
In addition when we insert a new socket into the epoll we record the NAPI
ID and assume we are going to receive events on it. If that doesn't occur
it will be evicted as the active NAPI ID and we will resume normal
behavior.
An application can use SO_INCOMING_CPU or SO_REUSEPORT_ATTACH_C/EBPF socket
options to spread the incoming connections to specific worker threads
based on the incoming queue. This enables epoll for each worker thread
to have only sockets that receive packets from a single queue. So when an
application calls epoll_wait() and there are no events available to report,
busy polling is done on the associated queue to pull the packets.
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch flips the logic we were using to determine if the busy polling
has timed out. The main motivation for this is that we will need to
support two different possible timeout values in the future and by
recording the start time rather than when we would want to end we can focus
on making the end_time specific to the task be it epoll or socket based
polling.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds to show the max number of volatile operations which are
conducting concurrently.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In below concurrent case, allocated nid can be loaded into free nid cache
and be allocated again.
Thread A Thread B
- f2fs_create
- f2fs_new_inode
- alloc_nid
- __insert_nid_to_list(ALLOC_NID_LIST)
- f2fs_balance_fs_bg
- build_free_nids
- __build_free_nids
- scan_nat_page
- add_free_nid
- __lookup_nat_cache
- f2fs_add_link
- init_inode_metadata
- new_inode_page
- new_node_page
- set_node_addr
- alloc_nid_done
- __remove_nid_from_list(ALLOC_NID_LIST)
- __insert_nid_to_list(FREE_NID_LIST)
This patch makes nat cache lookup and free nid list operation being atomical
to avoid this race condition.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use set_page_private marcro instead of operte page struct
directly
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When doing garbage collection, we try to record segment offset which
locates at next one of last victim, using it as the start offset in
next searching.
But in some corner cases, recorded offset may cross the end of main
segment area, it will cause incorrectly searching in dirty_segmap
bitmap. This patch adds modular operation to avoid this issue.
Reported-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull btrfs fixes from Chris Mason:
"Zygo tracked down a very old bug with inline compressed extents.
I didn't tag this one for stable because I want to do individual
tested backports. It's a little tricky and I'd rather do some extra
testing on it along the way"
* 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: add missing memset while reading compressed inline extents
Btrfs: fix regression in lock_delalloc_pages
btrfs: remove btrfs_err_str function from uapi/linux/btrfs.h
The latest gcc-7.0.1 snapshot warns about an unintialized variable use:
In file included from fs/reiserfs/lbalance.c:8:0:
fs/reiserfs/lbalance.c: In function 'leaf_item_bottle.isra.3':
fs/reiserfs/reiserfs.h:1279:13: error: '*((void *)&n_ih+8).v' may be used uninitialized in this function [-Werror=maybe-uninitialized]
v2->v = (v2->v & cpu_to_le64(15ULL << 60)) | cpu_to_le64(offset);
~~^~~
fs/reiserfs/reiserfs.h:1279:13: error: '*((void *)&n_ih+8).v' may be used uninitialized in this function [-Werror=maybe-uninitialized]
v2->v = (v2->v & cpu_to_le64(15ULL << 60)) | cpu_to_le64(offset);
This happens because the offset/type pair that is stored in
ih.key.u.k_offset_v2 is actually uninitialized when we call
set_le_ih_k_offset() and set_le_ih_k_type(). After we have called both,
all data is correct, but the first of the two reads uninitialized data
for the type field and writes it back before it gets overwritten.
This works around the warning by initializing the k_offset_v2 through
the slightly larger memcpy().
[JK: Remove now unused define and make it obvious we initialize the key]
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jan Kara <jack@suse.cz>
When block device is closed, we call inode_detach_wb() in __blkdev_put()
which sets inode->i_wb to NULL. That is contrary to expectations that
inode->i_wb stays valid once set during the whole inode's lifetime and
leads to oops in wb_get() in locked_inode_to_wb_and_lock_list() because
inode_to_wb() returned NULL.
The reason why we called inode_detach_wb() is not valid anymore though.
BDI is guaranteed to stay along until we call bdi_put() from
bdev_evict_inode() so we can postpone calling inode_detach_wb() to that
moment.
Also add a warning to catch if someone uses inode_detach_wb() in a
dangerous way.
Reported-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
When disk->fops->open() in __blkdev_get() returns -ERESTARTSYS, we
restart the process of opening the block device. However we forget to
switch bdev->bd_bdi back to noop_backing_dev_info and as a result bdev
inode will be pointing to a stale bdi. Fix the problem by setting
bdev->bd_bdi later when __blkdev_get() is already guaranteed to succeed.
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
The memory size of f2fs_stat_info also should be calculated.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
After filemap_write_and_wait_range fail, the FI_ATOMIC_FILE flags is removed,
so that f2fs should not increase the stat of atomic_write.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The crc_offset towards or beyond the end of block is wrong,
sanity check it.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As discuss with Jaegeuk and Chao,
"Once checkpoint is done, f2fs doesn't need to update there-in filename at all."
The disk-level filename is used only one case,
1. create a file A under a dir
2. sync A
3. godown
4. umount
5. mount (roll_forward)
Only the rename/cross_rename changes the filename, if it happens,
a. between step 1 and 2, the sync A will caused checkpoint, so that,
the roll_forward at step 5 never happens.
b. after step 2, the roll_forward happens, file A will roll forward
to the result as after step 1.
So that, any updating the disk filename is useless, just cleanup it.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
free_nid_bitmap and free_nid_count in update_free_nid_bitmap should be
updated atomically, use nid_list_lock cover them to avoid race in
concurrent scenario.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For f2fs_read_data_pages, the f2fs_mpage_readpages gets "page == NULL",
so that, the prefetchw(&page->flags) is operated on NULL.
Fixes: f1e8866016 ("f2fs: expose f2fs_mpage_readpages")
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Clear FI_DATA_EXIST flag atomically in truncate_inline_inode, and
the return value from truncate_inline_inode isn't used, remove it.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It's needless of mnt_want_write_file for arguments checking.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The inode_newsize_ok is better than only checking the maxbytes,
eg. the rlimit etc.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If move file range return error, the data copied to user-space is duplicate.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>