This patch checks the parameter range passed by ioctl to void that range
exceeds the max_file_blocks limit.
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch add a function to punch discard command if one segment
reuse before discard. Split this segment from multi-segments discard
range, and discard the left bigger range.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The nat entry is listed from the set list for freeing,
it's duplicate to do radix tree lookup again.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
[Jaegeuk Kim: remove unnecessary f2fs_bug_on]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The root device's issue flush trace is missing,
add it and tracing the result from submit.
Fixes d50aaeec90 ("f2fs: show actual device info in tracepoints")
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now f2fs only supports volatile writes for journal db regular file.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The previous one was not a proper location to inject an error, since there
is no point to get errors. Instead, we can emulate EIO during truncation,
and the below logic should handle it correctly.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes missing le16 conversion, reported by kbuild test robot.
Fixes: 5f35a2cd5 ("f2fs: Don't update the xattr data that same as the exist")
Reviewed-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If f2fs_new_inode() is failed, the bad inode will invalidate 0'th node page
during f2fs_evict_inode(), which doesn't need to do.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fix a error return value in truncate_partial_data_page
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Credit for this patch goes is shared with Dan Williams [1]. I've
taken things one step further to make the helper function more
useful and clean up calling code.
There's a common pattern in the kernel whereby a struct cdev is placed
in a structure along side a struct device which manages the life-cycle
of both. In the naive approach, the reference counting is broken and
the struct device can free everything before the chardev code
is entirely released.
Many developers have solved this problem by linking the internal kobjs
in this fashion:
cdev.kobj.parent = &parent_dev.kobj;
The cdev code explicitly gets and puts a reference to it's kobj parent.
So this seems like it was intended to be used this way. Dmitrty Torokhov
first put this in place in 2012 with this commit:
2f0157f char_dev: pin parent kobject
and the first instance of the fix was then done in the input subsystem
in the following commit:
4a215aa Input: fix use-after-free introduced with dynamic minor changes
Subsequently over the years, however, this issue seems to have tripped
up multiple developers independently. For example, see these commits:
0d5b7da iio: Prevent race between IIO chardev opening and IIO device
(by Lars-Peter Clausen in 2013)
ba0ef85 tpm: Fix initialization of the cdev
(by Jason Gunthorpe in 2015)
5b28dde [media] media: fix use-after-free in cdev_put() when app exits
after driver unbind
(by Shauh Khan in 2016)
This technique is similarly done in at least 15 places within the kernel
and probably should have been done so in another, at least, 5 places.
The kobj line also looks very suspect in that one would not expect
drivers to have to mess with kobject internals in this way.
Even highly experienced kernel developers can be surprised by this
code, as seen in [2].
To help alleviate this situation, and hopefully prevent future
wasted effort on this problem, this patch introduces a helper function
to register a char device along with its parent struct device.
This creates a more regular API for tying a char device to its parent
without the developer having to set members in the underlying kobject.
This patch introduce cdev_device_add and cdev_device_del which
replaces a common pattern including setting the kobj parent, calling
cdev_add and then calling device_add. It also introduces cdev_set_parent
for the few cases that set the kobject parent without using device_add.
[1] https://lkml.org/lkml/2017/2/13/700
[2] https://lkml.org/lkml/2017/2/10/370
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Hans Verkuil <hans.verkuil@cisco.com>
Reviewed-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch adds to account free nids for each NAT blocks, and while
scanning all free nid bitmap, do check count and skip lookuping in
full NAT block.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This is to avoid build warning reported by kbuild test robot.
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes that SSR can overwrite previous warm node block consisting of
a node chain since the last checkpoint.
Fixes: 5b6c6be2d8 ("f2fs: use SSR for warm node as well")
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull NFS client fixes from Anna Schumaker:
"We have a handful of stable fixes to fix kernel warnings and other
bugs that have been around for a while. We've also found a few other
reference counting bugs and memory leaks since the initial 4.11 pull.
Stable Bugfixes:
- Fix decrementing nrequests in NFS v4.2 COPY to fix kernel warnings
- Prevent a double free in async nfs4_exchange_id()
- Squelch a kbuild sparse complaint for xprtrdma
Other Bugfixes:
- Fix a typo (NFS_ATTR_FATTR_GROUP_NAME) that causes a memory leak
- Fix a reference leak that causes kernel warnings
- Make nfs4_cb_sv_ops static to fix a sparse warning
- Respect a server's max size in CREATE_SESSION
- Handle errors from nfs4_pnfs_ds_connect
- Flexfiles layout shouldn't mark devices as unavailable"
* tag 'nfs-for-4.11-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
pNFS/flexfiles: never nfs4_mark_deviceid_unavailable
pNFS: return status from nfs4_pnfs_ds_connect
NFSv4.1 respect server's max size in CREATE_SESSION
NFS prevent double free in async nfs4_exchange_id
nfs: make nfs4_cb_sv_ops static
xprtrdma: Squelch kbuild sparse complaint
NFS: fix the fault nrequests decreasing for nfs_inode COPY
NFSv4: fix a reference leak caused WARNING messages
nfs4: fix a typo of NFS_ATTR_FATTR_GROUP_NAME
This is a story about 4 distinct (and very old) btrfs bugs.
Commit c8b978188c ("Btrfs: Add zlib compression support") added
three data corruption bugs for inline extents (bugs #1-3).
Commit 93c82d5750 ("Btrfs: zero page past end of inline file items")
fixed bug #1: uncompressed inline extents followed by a hole and more
extents could get non-zero data in the hole as they were read. The fix
was to add a memset in btrfs_get_extent to zero out the hole.
Commit 166ae5a418 ("btrfs: fix inline compressed read err corruption")
fixed bug #2: compressed inline extents which contained non-zero bytes
might be replaced with zero bytes in some cases. This patch removed an
unhelpful memset from uncompress_inline, but the case where memset is
required was missed.
There is also a memset in the decompression code, but this only covers
decompressed data that is shorter than the ram_bytes from the extent
ref record. This memset doesn't cover the region between the end of the
decompressed data and the end of the page. It has also moved around a
few times over the years, so there's no single patch to refer to.
This patch fixes bug #3: compressed inline extents followed by a hole
and more extents could get non-zero data in the hole as they were read
(i.e. bug #3 is the same as bug #1, but s/uncompressed/compressed/).
The fix is the same: zero out the hole in the compressed case too,
by putting a memset back in uncompress_inline, but this time with
correct parameters.
The last and oldest bug, bug #0, is the cause of the offending inline
extent/hole/extent pattern. Bug #0 is a subtle and mostly-harmless quirk
of behavior somewhere in the btrfs write code. In a few special cases,
an inline extent and hole are allowed to persist where they normally
would be combined with later extents in the file.
A fast reproducer for bug #0 is presented below. A few offending extents
are also created in the wild during large rsync transfers with the -S
flag. A Linux kernel build (git checkout; make allyesconfig; make -j8)
will produce a handful of offending files as well. Once an offending
file is created, it can present different content to userspace each
time it is read.
Bug #0 is at least 4 and possibly 8 years old. I verified every vX.Y
kernel back to v3.5 has this behavior. There are fossil records of this
bug's effects in commits all the way back to v2.6.32. I have no reason
to believe bug #0 wasn't present at the beginning of btrfs compression
support in v2.6.29, but I can't easily test kernels that old to be sure.
It is not clear whether bug #0 is worth fixing. A fix would likely
require injecting extra reads into currently write-only paths, and most
of the exceptional cases caused by bug #0 are already handled now.
Whether we like them or not, bug #0's inline extents followed by holes
are part of the btrfs de-facto disk format now, and we need to be able
to read them without data corruption or an infoleak. So enough about
bug #0, let's get back to bug #3 (this patch).
An example of on-disk structure leading to data corruption found in
the wild:
item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160
inode generation 50 transid 50 size 47424 nbytes 49141
block group 0 mode 100644 links 1 uid 0 gid 0
rdev 0 flags 0x0(none)
item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20
inode ref index 3 namelen 10 name: DB_File.so
item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362
inline extent data size 1341 ram 4085 compress(zlib)
item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53
extent data disk byte 5367308288 nr 20480
extent data offset 0 nr 45056 ram 45056
extent compression(zlib)
Different data appears in userspace during each read of the 11 bytes
between 4085 and 4096. The extent in item 63 is not long enough to
fill the first page of the file, so a memset is required to fill the
space between item 63 (ending at 4085) and item 64 (beginning at 4096)
with zero.
Here is a reproducer from Liu Bo, which demonstrates another method
of creating the same inline extent and hole pattern:
Using 'page_poison=on' kernel command line (or enable
CONFIG_PAGE_POISONING) run the following:
# touch foo
# chattr +c foo
# xfs_io -f -c "pwrite -W 0 1000" foo
# xfs_io -f -c "falloc 4 8188" foo
# od -x foo
# echo 3 >/proc/sys/vm/drop_caches
# od -x foo
This produce the following on my box:
Correct output: file contains 1000 data bytes followed
by zeros:
0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
*
0001740 cdcd cdcd cdcd cdcd 0000 0000 0000 0000
0001760 0000 0000 0000 0000 0000 0000 0000 0000
*
0020000
Actual output: the data after the first 1000 bytes
will be different each run:
0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
*
0001740 cdcd cdcd cdcd cdcd 6c63 7400 635f 006d
0001760 5f74 6f43 7400 435f 0053 5f74 7363 7400
0002000 435f 0056 5f74 6164 7400 645f 0062 5f74
(...)
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Chris Mason <clm@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The bug is a regression after commit
(da2c7009f6 "btrfs: teach __process_pages_contig about PAGE_LOCK operation")
and commit
(76c0021db8 "Btrfs: use helper to simplify lock/unlock pages").
So if the dirty pages which are under writeback got truncated partially
before we lock the dirty pages, we couldn't find all pages mapping to the
delalloc range, and the bug didn't return an error so it kept going on and
found that the delalloc range got truncated and got to unlock the dirty
pages, and then the ASSERT could caught the error, and showed
-----------------------------------------------------------------------------
assertion failed: page_ops & PAGE_LOCK, file: fs/btrfs/extent_io.c, line: 1716
-----------------------------------------------------------------------------
This fixes the bug by returning the proper -EAGAIN.
Cc: David Sterba <dsterba@suse.com>
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The flexfiles layout should never mark a device unavailable.
Move nfs4_mark_deviceid_unavailable out of nfs4_pnfs_ds_connect and call
directly from files layout where it's still needed.
The flexfiles driver still handles marked devices in error paths, but will
now print a rate limited warning.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The nfs4_pnfs_ds_connect path can call rpc_create which can fail or it
can wait on another context to reach the same failure.
This checks that the rpc_create succeeded and returns the error to the
caller.
When an error is returned, both the files and flexfiles layouts will return
NULL from _prepare_ds(). The flexfiles layout will also return the layout
with the error NFS4ERR_NXIO.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Currently client doesn't respect max sizes server returns in CREATE_SESSION.
nfs4_session_set_rwsize() gets called and server->rsize, server->wsize are 0
so they never get set to the sizes returned by the server.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Since rpc_task is async, the release function should be called which
will free the impl_id, scope, and owner.
Trond pointed at 2 more problems:
-- use of client pointer after free in the nfs4_exchangeid_release() function
-- cl_count mismatch if rpc_run_task() isn't run
Fixes: 8d89bd70bc ("NFS setup async exchange_id")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Cc: stable@vger.kernel.org # 4.9
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Fixes the following sparse warning:
fs/nfs/callback.c:235:21: warning: symbol 'nfs4_cb_sv_ops' was not
declared. Should it be static?
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Pull AFS fixes from David Howells:
"Fixes to the AFS filesystem in the kernel.
They fix a variety of bugs. These include some issues fixed for
consistency with other AFS implementations:
- handle AFS mode bits better
- use the client mtime rather than the server mtime in the protocol
- handle the server returning more or less data than was requested in
a FetchData call
- distinguish mountpoints from symlinks based on the mode bits rather
than preemptively reading every symlink to find out what it
actually represents
One other notable change for the user is that files are now flushed on
close analogously with other network filesystems"
* tag 'afs-20170316' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (28 commits)
afs: Don't wait for page writeback with the page lock held
afs: ->writepage() shouldn't call clear_page_dirty_for_io()
afs: Fix abort on signal while waiting for call completion
afs: Fix an off-by-one error in afs_send_pages()
afs: Fix afs_kill_pages()
afs: Fix page leak in afs_write_begin()
afs: Don't set PG_error on local EINTR or ENOMEM when filling a page
afs: Populate and use client modification time
afs: Better abort and net error handling
afs: Invalid op ID should abort with RXGEN_OPCODE
afs: Fix the maths in afs_fs_store_data()
afs: Use a bvec rather than a kvec in afs_send_pages()
afs: Make struct afs_read::remain 64-bit
afs: Fix AFS read bug
afs: Prevent callback expiry timer overflow
afs: Migrate vlocation fields to 64-bit
afs: security: Replace rcu_assign_pointer() with RCU_INIT_POINTER()
afs: inode: Replace rcu_assign_pointer() with RCU_INIT_POINTER()
afs: Distinguish mountpoints from symlinks by file mode alone
afs: Flush outstanding writes when an fd is closed
...
Recently started seeing a kernel oops when a module tries removing a
memory mapped sysfs bin_attribute. On closer investigation the root
cause seems to be kernfs_release_file() trying to call
kernfs_op.release() callback that's NULL for such sysfs
bin_attributes. The oops occurs when kernfs_release_file() is called from
kernfs_drain_open_files() to cleanup any open handles with active
memory mappings.
The patch fixes this by checking for flag KERNFS_HAS_RELEASE before
calling kernfs_release_file() in function kernfs_drain_open_files().
On ppc64-le arch with cxl module the oops back-trace is of the
form below:
[ 861.381126] Unable to handle kernel paging request for instruction fetch
[ 861.381360] Faulting instruction address: 0x00000000
[ 861.381428] Oops: Kernel access of bad area, sig: 11 [#1]
....
[ 861.382481] NIP: 0000000000000000 LR: c000000000362c60 CTR:
0000000000000000
....
Call Trace:
[c000000f1680b750] [c000000000362c34] kernfs_drain_open_files+0x104/0x1d0 (unreliable)
[c000000f1680b790] [c00000000035fa00] __kernfs_remove+0x260/0x2c0
[c000000f1680b820] [c000000000360da0] kernfs_remove_by_name_ns+0x60/0xe0
[c000000f1680b8b0] [c0000000003638f4] sysfs_remove_bin_file+0x24/0x40
[c000000f1680b8d0] [c00000000062a164] device_remove_bin_file+0x24/0x40
[c000000f1680b8f0] [d000000009b7b22c] cxl_sysfs_afu_remove+0x144/0x170 [cxl]
[c000000f1680b940] [d000000009b7c7e4] cxl_remove+0x6c/0x1a0 [cxl]
[c000000f1680b990] [c00000000052f694] pci_device_remove+0x64/0x110
[c000000f1680b9d0] [c0000000006321d4] device_release_driver_internal+0x1f4/0x2b0
[c000000f1680ba20] [c000000000525cb0] pci_stop_bus_device+0xa0/0xd0
[c000000f1680ba60] [c000000000525e80] pci_stop_and_remove_bus_device+0x20/0x40
[c000000f1680ba90] [c00000000004a6c4] pci_hp_remove_devices+0x84/0xc0
[c000000f1680bad0] [c00000000004a688] pci_hp_remove_devices+0x48/0xc0
[c000000f1680bb10] [c0000000009dfda4] eeh_reset_device+0xb0/0x290
[c000000f1680bbb0] [c000000000032b4c] eeh_handle_normal_event+0x47c/0x530
[c000000f1680bc60] [c000000000032e64] eeh_handle_event+0x174/0x350
[c000000f1680bd10] [c000000000033228] eeh_event_handler+0x1e8/0x1f0
[c000000f1680bdc0] [c0000000000d384c] kthread+0x14c/0x190
[c000000f1680be30] [c00000000000b5a0] ret_from_kernel_thread+0x5c/0xbc
Fixes: f83f3c5156 ("kernfs: fix locking around kernfs_ops->release() callback")
Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pull xfs fix from Darrick Wong:
"Here's a single fix for -rc3 to improve input validation on inline
directory data to prevent buffer overruns due to corrupt metadata"
* tag 'xfs-4.11-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: verify inline directory data forks
Before this patch i_no_addr was not initialized until after the
return from allocating its block. That meant the i_no_addr was
temporarily uninitialized storage. Ordinarily that's not a concern,
but if inplace_reserve can't find space, it can call try_rgrp_unlink
which references i_no_addr as a block to avoid. That can result in
unpredictable behavior. More importantly, the trace point in
gfs2_alloc_blocks references ip->i_no_addr before it is set, which
is misleading when reading the kernel traces. This patch makes it
look like the new dinode block was assigned in the name of inode 0
rather than a random inode that's completely unrelated.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
The ->writepage() op shouldn't call clear_page_dirty_for_io() as that has
already been called by the caller.
Fix afs_writepage() by moving the call out of
afs_write_back_from_locked_page() to afs_writepages_region() where it is
needed.
Signed-off-by: David Howells <dhowells@redhat.com>
Fix the way in which a call that's in progress and being waited for is
aborted in the case that EINTR is detected. We should be sending
RX_USER_ABORT rather than RX_CALL_DEAD as the abort code.
Note that since the only two ways out of the loop are if the call completes
or if a signal happens, the kill-the-call clause after the loop has
finished can only happen in the case of EINTR. This means that we only
have one abort case to deal with, not two, and the "KWC" case can never
happen and so can be deleted.
Note further that simply aborting the call isn't necessarily the best thing
here since at this point: the request has been entirely sent and it's
likely the server will do the operation anyway - whether we abort it or
not. In future, we should punt the handling of the remainder of the call
off to a background thread.
Reported-by: Marc Dionne <marc.c.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
afs_send_pages() should only put the call into the AFS_CALL_AWAIT_REPLY
state if it has sent all the pages - but the check it makes is incorrect
and sometimes it will finish the loop early.
Signed-off-by: David Howells <dhowells@redhat.com>
Fix afs_kill_pages() in two ways:
(1) If a writeback has been partially flushed, then if we try and kill the
pages it contains, some of them may no longer be undergoing writeback
and end_page_writeback() will assert.
Fix this by checking to see whether the page in question is actually
undergoing writeback before ending that writeback.
(2) The loop that scans for pages to kill doesn't increase the first page
index, and so the loop may not terminate, but it will try to process
the same pages over and over again.
Fix this by increasing the first page index to one after the last page
we processed.
Signed-off-by: David Howells <dhowells@redhat.com>
afs_write_begin() leaks a ref and a lock on a page if afs_fill_page()
fails. Fix the leak by unlocking and releasing the page in the error path.
Signed-off-by: David Howells <dhowells@redhat.com>
The inode timestamps should be set from the client time
in the status received from the server, rather than the
server time which is meant for internal server use.
Set AFS_SET_MTIME and populate the mtime for operations
that take an input status, such as file/dir creation
and StoreData. If an input time is not provided the
server will set the vnode times based on the current server
time.
In a situation where the server has some skew with the
client, this could lead to the client seeing a timestamp
in the future for a file that it just created or wrote.
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>