[BUG]
If run_delalloc_range() returns error and there is already some ordered
extents created, btrfs will be hanged with the following backtrace:
Call Trace:
__schedule+0x2d4/0xae0
schedule+0x3d/0x90
btrfs_start_ordered_extent+0x160/0x200 [btrfs]
? wake_atomic_t_function+0x60/0x60
btrfs_run_ordered_extent_work+0x25/0x40 [btrfs]
btrfs_scrubparity_helper+0x1c1/0x620 [btrfs]
btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
process_one_work+0x2af/0x720
? process_one_work+0x22b/0x720
worker_thread+0x4b/0x4f0
kthread+0x10f/0x150
? process_one_work+0x720/0x720
? kthread_create_on_node+0x40/0x40
ret_from_fork+0x2e/0x40
[CAUSE]
|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<>| |<---------- cleanup range --------->|
||
\_=> First page handled by end_extent_writepage() in __extent_writepage()
The problem is caused by error handler of run_delalloc_range(), which
doesn't handle any created ordered extents, leaving them waiting on
btrfs_finish_ordered_io() to finish.
However after run_delalloc_range() returns error, __extent_writepage()
won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered
except the first page, and btrfs_finish_ordered_io() won't be triggered
for created ordered extents either.
So OE 2~n will hang forever, and if OE 1 is larger than one page, it
will also hang.
[FIX]
Introduce btrfs_cleanup_ordered_extents() function to cleanup created
ordered extents and finish them manually.
The function is based on existing
btrfs_endio_direct_write_update_ordered() function, and modify it to
act just like btrfs_writepage_endio_hook() but handles specified range
other than one page.
After fix, delalloc error will be handled like:
|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<>|<-------- ----------->|<------ old error handler --------->|
|| ||
|| \_=> Cleaned up by cleanup_ordered_extents()
\_=> First page handled by end_extent_writepage() in __extent_writepage()
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[BUG]
When btrfs_reloc_clone_csum() reports error, it can underflow metadata
and leads to kernel assertion on outstanding extents in
run_delalloc_nocow() and cow_file_range().
BTRFS info (device vdb5): relocating block group 12582912 flags data
BTRFS info (device vdb5): found 1 extents
assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858
Currently, due to another bug blocking ordered extents, the bug is only
reproducible under certain block group layout and using error injection.
a) Create one data block group with one 4K extent in it.
To avoid the bug that hangs btrfs due to ordered extent which never
finishes
b) Make btrfs_reloc_clone_csum() always fail
c) Relocate that block group
[CAUSE]
run_delalloc_nocow() and cow_file_range() handles error from
btrfs_reloc_clone_csum() wrongly:
(The ascii chart shows a more generic case of this bug other than the
bug mentioned above)
|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<----------- cleanup range --------------->|
|<----------- ----------->|
\/
btrfs_finish_ordered_io() range
So error handler, which calls extent_clear_unlock_delalloc() with
EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
will both cover OE n, and free its metadata, causing metadata under flow.
[Fix]
The fix is to ensure after calling btrfs_add_ordered_extent(), we only
call error handler after increasing the iteration offset, so that
cleanup range won't cover any created ordered extent.
|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<----------- ----------->|<---------- cleanup range --------->|
\/
btrfs_finish_ordered_io() range
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
The optimization for opaque dir create was wrongly being applied
also to non-dir create.
Fixes: 97c684cc91 ("ovl: create directories inside merged parent opaque")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # v4.10
A null check followed by a return is being performed already, so block
is always non-null at the second check on block, hence we can remove
this redundant null-check (Detected by PVS-Studio). Also re-work
comment to clean up a check-patch warning.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
consider the sequence of commands:
mkdir -p /import/nfs /import/bind /import/etc
mount --bind / /import/bind
mount --make-private /import/bind
mount --bind /import/etc /import/bind/etc
exportfs -o rw,no_root_squash,crossmnt,async,no_subtree_check localhost:/
mount -o vers=4 localhost:/ /import/nfs
ls -l /import/nfs/etc
You would not expect this to report a stale file handle.
Yet it does.
The manipulations under /import/bind cause the dentry for
/etc to get the DCACHE_MOUNTED flag set, even though nothing
is mounted on /etc. This causes nfsd to call
nfsd_cross_mnt() even though there is no mountpoint. So an
upcall to mountd for "/etc" is performed.
The 'crossmnt' flag on the export of / causes mountd to
report that /etc is exported as it is a descendant of /. It
assumes the kernel wouldn't ask about something that wasn't
a mountpoint. The filehandle returned identifies the
filesystem and the inode number of /etc.
When this filehandle is presented to rpc.mountd, via
"nfsd.fh", the inode cannot be found associated with any
name in /etc/exports, or with any mountpoint listed by
getmntent(). So rpc.mountd says the filehandle doesn't
exist. Hence ESTALE.
This is fixed by teaching nfsd not to trust DCACHE_MOUNTED
too much. It is just a hint, not a guarantee.
Change nfsd_mountpoint() to return '1' for a certain mountpoint,
'2' for a possible mountpoint, and 0 otherwise.
Then change nfsd_crossmnt() to check if follow_down()
actually found a mountpount and, if not, to avoid performing
a lookup if the location is not known to certainly require
an export-point.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
kstrdup() already checks for NULL.
(Brought to our attention by Jason Yann noticing (from sparse output)
that it should have been declared static.)
Signed-off-by: NeilBrown <neilb@suse.com>
Reported-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
A client can append random data to the end of an NFSv2 or NFSv3 RPC call
without our complaining; we'll just stop parsing at the end of the
expected data and ignore the rest.
Encoded arguments and replies are stored together in an array of pages,
and if a call is too large it could leave inadequate space for the
reply. This is normally OK because NFS RPC's typically have either
short arguments and long replies (like READ) or long arguments and short
replies (like WRITE). But a client that sends an incorrectly long reply
can violate those assumptions. This was observed to cause crashes.
So, insist that the argument not be any longer than we expect.
Also, several operations increment rq_next_page in the decode routine
before checking the argument size, which can leave rq_next_page pointing
well past the end of the page array, causing trouble later in
svc_free_pages.
As followup we may also want to rewrite the encoding routines to check
more carefully that they aren't running off the end of the page array.
Reported-by: Tuomas Haanpää <thaan@synopsys.com>
Reported-by: Ari Kauppi <ari@synopsys.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This patch start to enable 4K granularity small discard by default
when realtime discard is on, so, in seriously fragmented space,
small size discard can be issued in time to avoid useless storage
space occupying of invalid filesystem's data, then performance of
flash storage can be recovered.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It's better to delay awaking discard thread while queuing discard commands
in checkpoint, it will help to give more chances for merging big and small
discard.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch seperate nat page read io from nat_tree_lock.
-lock_page
-get_node_info()
-current_nat_addr
...... -> write_checkpoint
-get_meta_page
Because we lock node page, we can make sure no other threads
modify this nid concurrently. So we just obtain current_nat_addr
under nat_tree_lock, node info is always same in both nat pack.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Commit 88c5c13a50 (f2fs: fix multiple f2fs_add_link() calls having
same name) does not cover the scenario where inline dentry is enabled.
In that case, F2FS_I(dir)->task will be NULL, and __f2fs_add_link will
lookup dentries one more time.
This patch fixes it by moving the assigment of current task to a upper
level to cover both normal and inline dentry.
Cc: <stable@vger.kernel.org>
Fixes: 88c5c13a50 (f2fs: fix multiple f2fs_add_link() calls having same name)
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The NFSv2/v3 code does not systematically check whether we decode past
the end of the buffer. This generally appears to be harmless, but there
are a few places where we do arithmetic on the pointers involved and
don't account for the possibility that a length could be negative. Add
checks to catch these.
Reported-by: Tuomas Haanpää <thaan@synopsys.com>
Reported-by: Ari Kauppi <ari@synopsys.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
A client can append random data to the end of an NFSv2 or NFSv3 RPC call
without our complaining; we'll just stop parsing at the end of the
expected data and ignore the rest.
Encoded arguments and replies are stored together in an array of pages,
and if a call is too large it could leave inadequate space for the
reply. This is normally OK because NFS RPC's typically have either
short arguments and long replies (like READ) or long arguments and short
replies (like WRITE). But a client that sends an incorrectly long reply
can violate those assumptions. This was observed to cause crashes.
Also, several operations increment rq_next_page in the decode routine
before checking the argument size, which can leave rq_next_page pointing
well past the end of the page array, causing trouble later in
svc_free_pages.
So, following a suggestion from Neil Brown, add a central check to
enforce our expectation that no NFSv2/v3 call has both a large call and
a large reply.
As followup we may also want to rewrite the encoding routines to check
more carefully that they aren't running off the end of the page array.
We may also consider rejecting calls that have any extra garbage
appended. That would be safer, and within our rights by spec, but given
the age of our server and the NFS protocol, and the fact that we've
never enforced this before, we may need to balance that against the
possibility of breaking some oddball client.
Reported-by: Tuomas Haanpää <thaan@synopsys.com>
Reported-by: Ari Kauppi <ari@synopsys.com>
Cc: stable@vger.kernel.org
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Now that all the producers and consumers of dax interfaces have been
converted to using dax_operations on a dax_device, remove the block
device direct_access enabling.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Now that a dax_device is plumbed through all dax-capable drivers we can
switch from block_device_operations to dax_operations for invoking
->direct_access.
This also lets us kill off some usages of struct blk_dax_ctl on the way
to its eventual removal.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
commit d1a5f2b4d8 ("block: use DAX for partition table reads") was
part of a stalled effort to allow dax mappings of block devices. Since
then the device-dax mechanism has filled the role of dax-mapping static
device ranges.
Now that we are moving ->direct_access() from a block_device operation
to a dax_inode operation we would need block devices to map and carry
their own dax_inode reference.
Unless / until we decide to revive dax mapping of raw block devices
through the dax_inode scheme, there is no need to carry
read_dax_sector(). Its removal in turn allows for the removal of
bdev_direct_access() and should have been included in commit
2237570168 ("block_dev: remove DAX leftovers").
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In preparation for converting fs/dax.c to use dax_direct_access()
instead of bdev_direct_access(), add the plumbing to retrieve the
dax_device associated with a given block_device.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
If the server has already returned a fatal write error that the user
has not yet received on this file, then don't write back the other pages.
Instead, act as if they have been sent, and have returned with the same
error.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The assumption should be that if the caller returns PNFS_ATTEMPTED, then hdr
has been consumed, and so we should not be testing hdr->task.tk_status.
If the caller returns PNFS_TRY_AGAIN, then we need to recoalesce and
free hdr.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
ceph_set_acl() calls __ceph_setattr() if the setacl operation needs
to modify inode's i_mode. __ceph_setattr() updates inode's i_mode,
then calls posix_acl_chmod().
The problem is that __ceph_setattr() calls posix_acl_chmod() before
sending the setattr request. The get_acl() call in posix_acl_chmod()
can trigger a getxattr request. The reply of the getxattr request
can restore inode's i_mode to its old value. The set_acl() call in
posix_acl_chmod() sees old value of inode's i_mode, so it calls
__ceph_setattr() again.
Cc: stable@vger.kernel.org # needs backporting for < 4.9
Link: http://tracker.ceph.com/issues/19688
Reported-by: Jerry Lee <leisurelysw24@gmail.com>
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Tested-by: Luis Henriques <lhenriques@suse.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Use ASSERTs on the log intent item refcounts so that we fail noisily if
anyone tries to double-free the item.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The quotacheck error handling of the delwri buffer list assumes the
resident buffers are locked and doesn't clear the _XBF_DELWRI_Q flag
on the buffers that are dequeued. This can lead to assert failures
on buffer release and possibly other locking problems.
Move this code to a delwri queue cancel helper function to
encapsulate the logic required to properly release buffers from a
delwri queue. Update the helper to clear the delwri queue flag and
call it from quotacheck.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_iflush_done uses an on-stack variable length array to pass the log
items to be deleted to xfs_trans_ail_delete_bulk. On-stack VLAs are a
nasty gcc extension that can lead to unbounded stack allocations, but
fortunately we can easily avoid them by simply open coding
xfs_trans_ail_delete_bulk in xfs_iflush_done, which is the only caller
of it except for the single-item xfs_trans_ail_delete.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
At the end of a getfsmap call, we will set FMR_OF_LAST in the last
struct fsmap that was handed in by userspace if we've truly run out of
space mapping record (as opposed to simply running out of space in the
user array). Unfortunately, fmh_entries is the wrong check for whether
or not we've filled out anything in the user array because the ioctl
provides that fmh_count==0 sets fmh_entries without filling out the user
array. Therefore we end up writing things into user memory areas that we
weren't given, and kaboom.
Since Christoph amended the getfsmap structure to track the number of
fsmap entries we've actually filled out, use that as part of deciding if
we have to set the OF_LAST flag.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
By passing the whole fsmap_head structure and an index we can get the
user point annotations right for the embedded variable sized array
in struct fsmap_head.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: change idx to unsigned int]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
At least if we want to be able to recognize the pattern. Add a missing
byte swap to the corruption injection case in xlog_sync.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
XFS only supports the unwritten extent bit in the data fork, and only if
the file system has a version 5 superblock or the unwritten extent
feature bit.
We currently have two routines that validate the invariant:
xfs_check_nostate_extents which return -EFSCORRUPTED when it's not met,
and xfs_validate_extent that triggers and assert in debug build.
Both of them iterate over all extents of an inode fork when called,
which isn't very efficient.
This patch instead adds a new helper that verifies the invariant one
extent at a time, and calls it from the places where we iterate over
all extents to converted them from or two the in-memory format. The
callers then return -EFSCORRUPTED when reading invalid extents from
disk, or trigger an assert when writing them to disk.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We only ever use the normal and unwritten states. And the actual
ondisk format (this enum isn't despite being in xfs_format.h) only
has space for the unwritten bit anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
On some architectures do_div does the pointer compare
trick to make sure that we've sent it an unsigned 64-bit
number. (Why unsigned? I don't know.)
Fix up the few places that squawk about this; in
xfs_bmap_wants_extents() we just used a bare int64_t so change
that to unsigned.
In xfs_adjust_extent_unmap_boundaries() all we wanted was the
mod, and we have an xfs-specific function to handle that w/o
side effects, which includes proper casting for do_div.
In xfs_daddr_to_ag[b]no, we were using the wrong type anyway;
XFS_BB_TO_FSBT returns a block in the filesystem, so use
xfs_rfsblock_t not xfs_daddr_t, and gain the unsignedness
from that type as a bonus.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The kbuild test robot caught this; in debug code we have another
caller of do_div with a 32-bit dividend (j) which is caught now
that we are using the kernel-supplied do_div.
None of the values used here are 64-bit; just use simple division.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The trailing newlines wil lead to extra newlines in the trace file
which looks like the following output, so remove them.
>kworker/4:1H-1508 [004] .... 47879.101608: xfs_discard_extent: dev 8:0
>
>kworker/u16:2-238 [004] .... 47879.101725: xfs_extent_busy_clear: dev 8:0
Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix the getfsmap tracepoints too]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Directory block readahead uses a complex iteration mechanism to map
between high-level directory blocks and underlying physical extents.
This mechanism attempts to traverse the higher-level dir blocks in a
manner that handles multi-fsb directory blocks and simultaneously
maintains a reference to the corresponding physical blocks.
This logic doesn't handle certain (discontiguous) physical extent
layouts correctly with multi-fsb directory blocks. For example,
consider the case of a 4k FSB filesystem with a 2 FSB (8k) directory
block size and a directory with the following extent layout:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
0: [0..7]: 88..95 0 (88..95) 8
1: [8..15]: 80..87 0 (80..87) 8
2: [16..39]: 168..191 0 (168..191) 24
3: [40..63]: 5242952..5242975 1 (72..95) 24
Directory block 0 spans physical extents 0 and 1, dirblk 1 lies
entirely within extent 2 and dirblk 2 spans extents 2 and 3. Because
extent 2 is larger than the directory block size, the readahead code
erroneously assumes the block is contiguous and issues a readahead
based on the physical mapping of the first fsb of the dirblk. This
results in read verifier failure and a spurious corruption or crc
failure, depending on the filesystem format.
Further, the subsequent readahead code responsible for walking
through the physical table doesn't correctly advance the physical
block reference for dirblk 2. Instead of advancing two physical
filesystem blocks, the first iteration of the loop advances 1 block
(correctly), but the subsequent iteration advances 2 more physical
blocks because the next physical extent (extent 3, above) happens to
cover more than dirblk 2. At this point, the higher-level directory
block walking is completely off the rails of the actual physical
layout of the directory for the respective mapping table.
Update the contiguous dirblock logic to consider the current offset
in the physical extent to avoid issuing directory readahead to
unrelated blocks. Also, update the mapping table advancing code to
consider the current offset within the current dirblock to avoid
advancing the mapping reference too far beyond the dirblock.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Carlos had a case where "find" seemed to start spinning
forever and never return.
This was on a filesystem with non-default multi-fsb (8k)
directory blocks, and a fragmented directory with extents
like this:
0:[0,133646,2,0]
1:[2,195888,1,0]
2:[3,195890,1,0]
3:[4,195892,1,0]
4:[5,195894,1,0]
5:[6,195896,1,0]
6:[7,195898,1,0]
7:[8,195900,1,0]
8:[9,195902,1,0]
9:[10,195908,1,0]
10:[11,195910,1,0]
11:[12,195912,1,0]
12:[13,195914,1,0]
...
i.e. the first extent is a contiguous 2-fsb dir block, but
after that it is fragmented into 1 block extents.
At the top of the readdir path, we allocate a mapping array
which (for this filesystem geometry) can hold 10 extents; see
the assignment to map_info->map_size. During readdir, we are
therefore able to map extents 0 through 9 above into the array
for readahead purposes. If we count by 2, we see that the last
mapped index (9) is the first block of a 2-fsb directory block.
At the end of xfs_dir2_leaf_readbuf() we have 2 loops to fill
more readahead; the outer loop assumes one full dir block is
processed each loop iteration, and an inner loop that ensures
that this is so by advancing to the next extent until a full
directory block is mapped.
The problem is that this inner loop may step past the last
extent in the mapping array as it tries to reach the end of
the directory block. This will read garbage for the extent
length, and as a result the loop control variable 'j' may
become corrupted and never fail the loop conditional.
The number of valid mappings we have in our array is stored
in map->map_valid, so stop this inner loop based on that limit.
There is an ASSERT at the top of the outer loop for this
same condition, but we never made it out of the inner loop,
so the ASSERT never fired.
Huge appreciation for Carlos for debugging and isolating
the problem.
Debugged-and-analyzed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Tested-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
On a ppc64 machine executing overlayfs/019 with xfs as the lower and
upper filesystem causes the following call trace,
WARNING: CPU: 2 PID: 8034 at /root/repos/linux/fs/iomap.c:765 .iomap_dio_actor+0xcc/0x420
Modules linked in:
CPU: 2 PID: 8034 Comm: fsstress Tainted: G L 4.11.0-rc5-next-20170405 #100
task: c000000631314880 task.stack: c0000003915d4000
NIP: c00000000035a72c LR: c00000000035a6f4 CTR: c00000000035a660
REGS: c0000003915d7570 TRAP: 0700 Tainted: G L (4.11.0-rc5-next-20170405)
MSR: 800000000282b032 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI>
CR: 24004284 XER: 00000000
CFAR: c0000000006f7190 SOFTE: 1
GPR00: c00000000035a6f4 c0000003915d77f0 c0000000015a3f00 000000007c22f600
GPR04: 000000000022d000 0000000000002600 c0000003b2d56360 c0000003915d7960
GPR08: c0000003915d7cd0 0000000000000002 0000000000002600 c000000000521cc0
GPR12: 0000000024004284 c00000000fd80a00 000000004b04ae64 ffffffffffffffff
GPR16: 000000001000ca70 0000000000000000 c0000003b2d56380 c00000000153d2b8
GPR20: 0000000000000010 c0000003bc87bac8 0000000000223000 000000000022f5ff
GPR24: c0000003b2d56360 000000000000000c 0000000000002600 000000000022d000
GPR28: 0000000000000000 c0000003915d7960 c0000003b2d56360 00000000000001ff
NIP [c00000000035a72c] .iomap_dio_actor+0xcc/0x420
LR [c00000000035a6f4] .iomap_dio_actor+0x94/0x420
Call Trace:
[c0000003915d77f0] [c00000000035a6f4] .iomap_dio_actor+0x94/0x420 (unreliable)
[c0000003915d78f0] [c00000000035b9f4] .iomap_apply+0xf4/0x1f0
[c0000003915d79d0] [c00000000035c320] .iomap_dio_rw+0x230/0x420
[c0000003915d7ae0] [c000000000512a14] .xfs_file_dio_aio_read+0x84/0x160
[c0000003915d7b80] [c000000000512d24] .xfs_file_read_iter+0x104/0x130
[c0000003915d7c10] [c0000000002d6234] .__vfs_read+0x114/0x1a0
[c0000003915d7cf0] [c0000000002d7a8c] .vfs_read+0xac/0x1a0
[c0000003915d7d90] [c0000000002d96b8] .SyS_read+0x58/0x100
[c0000003915d7e30] [c00000000000b8e0] system_call+0x38/0xfc
Instruction dump:
78630020 7f831b78 7ffc07b4 7c7ce039 40820360 a13d0018 2f890003 419e0288
2f890004 419e00a0 2f890001 419e02a8 <0fe00000> 3b80fffb 38210100 7f83e378
The above problem can also be recreated on a regular xfs filesystem
using the command,
$ fsstress -d /mnt -l 1000 -n 1000 -p 1000
The reason for the call trace is,
1. When 'reserving' blocks for delayed allocation , XFS reserves more
blocks (i.e. past file's current EOF) than required. This is done
because XFS assumes that userspace might write more data and hence
'reserving' more blocks might lead to the file's new data being
stored contiguously on disk.
2. The in-memory 'struct xfs_bmbt_irec' mapping the file's last extent would
then cover the prealloc-ed EOF blocks in addition to the regular blocks.
3. When flushing the dirty blocks to disk, we only flush data till the
file's EOF. But before writing out the dirty data, we allocate blocks
on the disk for holding the file's new data. This allocation includes
the blocks that are part of the 'prealloc EOF blocks'.
4. Later, when the last reference to the inode is being closed, XFS frees the
unused 'prealloc EOF blocks' in xfs_inactive().
In step 3 above, When allocating space on disk for the delayed allocation
range, the space allocator might sometimes allocate less blocks than
required. If such an allocation ends right at the current EOF of the
file, We will not be able to clear the "delayed allocation" flag for the
'prealloc EOF blocks', since we won't have dirty buffer heads associated
with that range of the file.
In such a situation if a Direct I/O read operation is performed on file
range [X, Y] (where X < EOF and Y > EOF), we flush dirty data in the
range [X, Y] and invalidate page cache for that range (Refer to
iomap_dio_rw()). Later for performing the Direct I/O read, XFS obtains
the extent items (which are still cached in memory) for the file
range. When doing so we are not supposed to get an extent item with
IOMAP_DELALLOC flag set, since the previous "flush" operation should
have converted any delayed allocation data in the range [X, Y]. Hence we
end up hitting a WARN_ON_ONCE(1) statement in iomap_dio_actor().
This commit fixes the bug by preventing the read operation from going
beyond iomap_dio->i_size.
Reported-by: Santhosh G <santhog4@linux.vnet.ibm.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that reflink operations don't set the firstblock value we don't
need the workarounds for non-NULL firstblock values without a prior
allocation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The main thing that xfs_bmap_remap_alloc does is fixing the AGFL, similar
to what we do in the space allocator. But the reflink code doesn't touch
the allocation btree unlike the normal space allocator, so we couldn't
care less about the state of the AGFL.
So remove xfs_bmap_remap_alloc and just handle the di_nblocks update in
the caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Add a new helper to be used for reflink extent list additions instead of
funneling them through xfs_bmapi_write and overloading the firstblock
member in struct xfs_bmalloca and struct xfs_alloc_args.
With some small changes to xfs_bmap_remap_alloc this also means we do
not need a xfs_bmalloca structure for this case at all.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
For the reflink case we'd much rather pass the required arguments than
faking up a struct xfs_bmalloca.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We never do COW operations for the attr fork, so don't pretend we handle
them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
bno should be a xfs_fsblock_t, which is 64-bit wides instead of a
xfs_aglock_t, which truncates the value to 32 bits.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If we have a layout segment cached in pgio->pg_lseg, we should check it
for validity before reusing it in a new RPC request. Otherwise, if we
recoalesce, we can end up looping forever.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When delivering an event to userspace for a file on an NFS share,
if the file is deleted on server side before user reads the event,
user will not get the event.
If the event queue contained several events, the stale event is
quietly dropped and read() returns to user with events read so far
in the buffer.
If the event queue contains a single stale event or if the stale
event is a permission event, read() returns to user with the kernel
internal error code 518 (EOPENSTALE), which is not a POSIX error code.
Check the internal return value -EOPENSTALE in fanotify_read(), just
the same as it is checked in path_openat() and drop the event in the
cases that it is not already dropped.
This is a reproducer from Marko Rauhamaa:
Just take the example program listed under "man fanotify" ("fantest")
and follow these steps:
==============================================================
NFS Server NFS Client(1) NFS Client(2)
==============================================================
# echo foo >/nfsshare/bar.txt
# cat /nfsshare/bar.txt
foo
# ./fantest /nfsshare
Press enter key to terminate.
Listening for events.
# rm -f /nfsshare/bar.txt
# cat /nfsshare/bar.txt
read: Unknown error 518
cat: /nfsshare/bar.txt: Operation not permitted
==============================================================
where NFS Client (1) and (2) are two terminal sessions on a single NFS
Client machine.
Reported-by: Marko Rauhamaa <marko.rauhamaa@f-secure.com>
Tested-by: Marko Rauhamaa <marko.rauhamaa@f-secure.com>
Cc: <linux-api@vger.kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Async request may be throttled in block layer, so page for async may keep WRITE_BACK
for a long time.
For encrytped inode, we need wait on page writeback no matter if the device supports
BDI_CAP_STABLE_WRITES. This may result in a higher waiting page writeback time for
async encrypted inode page.
This patch skips IPU for encrypted inode's updating write.
Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>