Commit Graph

50492 Commits

Author SHA1 Message Date
Christoph Hellwig
3398a4005f xfs: remove XFS_HSIZE
XFS_HSIZE is an extremly confusing way to calculate the size of handle_t.
Given that handle_t always only had two sizes, and one of them isn't
even covered by XFS_HSIZE to start with just remove the macro and use
a constant sizeof expression.

Note that XFS_HSIZE isn't used in xfsprogs, xfsdump or xfstests either.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19 08:59:10 -07:00
Brian Foster
d4ca1d550d xfs: dump transaction usage details on log reservation overrun
If a transaction log reservation overrun occurs, the ticket data
associated with the reservation is dumped in xfs_log_commit_cil().
This occurs long after the transaction items and details have been
removed from the transaction and effectively lost. This limited set
of ticket data provides very little information to support debugging
transaction overruns based on the typical report.

To improve transaction log reservation overrun reporting, create a
helper to dump transaction details such as log items, log vector
data, etc., as well as the underlying ticket data for the
transaction. Move the overrun detection from xfs_log_commit_cil() to
xlog_cil_insert_items() so it occurs prior to migration of the
logged items to the CIL. Call the new helper such that it is able to
dump this transaction data before it is lost.

Also, warn on overrun to provide callstack context for the offending
transaction and include a few additional messages from
xlog_cil_insert_items() to display the reservation consumed locally
for overhead such as log vector headers, split region headers and
the context ticket. This provides a complete general breakdown of
the reservation consumption of a transaction when/if it happens to
overrun the reservation.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19 08:59:10 -07:00
Brian Foster
e2f2342639 xfs: refactor xlog_cil_insert_items() to facilitate transaction dump
Transaction reservation overrun detection currently occurs too late
to print useful information about the offending transaction.
Ideally, the transaction data is printed before the associated log
items are moved from the transaction to the CIL, which occurs in
xlog_cil_insert_items(), such that details of the items logged by
the transaction are available for analysis.

Refactor xlog_cil_insert_items() to facilitate moving tx overrun
detection to this function. Update the function to track each bit of
extra log reservation stolen from the transaction (i.e., such as for
the CIL context ticket) and perform the log item migration as the
last operation before the CIL lock is released. This creates a
context where the transaction reservation consumption has been fully
calculated when the log items are moved to the CIL. This patch makes
no functional changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19 08:59:10 -07:00
Brian Foster
7d2d565346 xfs: separate shutdown from ticket reservation print helper
xlog_print_tic_res() pre-dates delayed logging and the committed
items list (CIL) and thus retains some factoring warts, such as hard
coded function names in the output and the fact that it induces a
shutdown.

In preparation for more detailed logging of regular transaction
overrun situations, refactor xlog_print_tic_res() to be slightly
more generic. Reword some of the warning messages and pull the
shutdown into the callers.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19 08:59:10 -07:00
Brian Foster
1040960efa xfs: define fatal assert build time tunable
While configurable at runtime, the DEBUG mode assert failure
behavior is usually either desired or not for a particular
situation. For example, developers using kernel modules may prefer
for fatal asserts to remain disabled across module reloads while QE
engineers doing broad regression testing may prefer to have fatal
asserts enabled on boot to facilitate data collection for bug
reports.

To provide a compromise/convenience for developers, create a Kconfig
option that sets the default value of the DEBUG mode 'bug_on_assert'
sysfs tunable. The default behavior remains to trigger kernel BUGs
on assert failures to preserve existing behavior across kernel
configuration updates with DEBUG mode enabled.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19 08:59:10 -07:00
Brian Foster
ccdab3d6e8 xfs: define bug_on_assert debug mode sysfs tunable
In DEBUG mode, assert failures unconditionally trigger a kernel BUG.
This is useful in diagnostic situations to panic a system and
collect detailed state information at the time of a failure.

This can also cause problems in cases where DEBUG mode code is
desired but it is preferable not trigger kernel BUGs on assert
failure. For example, during development of new code or during
certain xfstests tests that intentionally cause corruption and test
the kernel for survival (but otherwise may expect to trigger assert
failures).

To provide additional flexibility, create the
<sysfs>/fs/xfs/debug/bug_on_assert tunable to configure assert
failure behavior at runtime. This tunable is only available in DEBUG
mode and is enabled by default to preserve existing default
behavior. When disabled, assert failures in DEBUG mode result in
kernel warnings.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19 08:59:10 -07:00
Darrick J. Wong
e1a4e37cc7 xfs: try to avoid blowing out the transaction reservation when bunmaping a shared extent
In a pathological scenario where we are trying to bunmapi a single
extent in which every other block is shared, it's possible that trying
to unmap the entire large extent in a single transaction can generate so
many EFIs that we overflow the transaction reservation.

Therefore, use a heuristic to guess at the number of blocks we can
safely unmap from a reflink file's data fork in an single transaction.
This should prevent problems such as the log head slamming into the tail
and ASSERTs that trigger because we've exceeded the transaction
reservation.

Note that since bunmapi can fail to unmap the entire range, we must also
teach the deferred unmap code to roll into a new transaction whenever we
get low on reservation.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch: random edits, all bugs are my fault]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-19 08:59:10 -07:00
Darrick J. Wong
d205a7d0ec xfs: refactor dir2 leaf readahead shadow buffer cleverness
Currently, the dir2 leaf block getdents function uses a complex state
tracking mechanism to create a shadow copy of the block mappings and
then uses the shadow copy to schedule readahead.  Since the read and
readahead functions are perfectly capable of reading the mappings
themselves, we can tear all that out in favor of a simpler function that
simply keeps pushing the readahead window further out.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-06-19 08:59:10 -07:00
Brian Foster
7912e7fef2 xfs: push buffer of flush locked dquot to avoid quotacheck deadlock
Reclaim during quotacheck can lead to deadlocks on the dquot flush
lock:

 - Quotacheck populates a local delwri queue with the physical dquot
   buffers.
 - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and
   dirties all of the dquots.
 - Reclaim kicks in and attempts to flush a dquot whose buffer is
   already queud on the quotacheck queue. The flush succeeds but
   queueing to the reclaim delwri queue fails as the backing buffer is
   already queued. The flush unlock is now deferred to I/O completion
   of the buffer from the quotacheck queue.
 - The dqadjust bulkstat continues and dirties the recently flushed
   dquot once again.
 - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires
   the flush lock to update the backing buffers with the in-core
   recalculated values. It deadlocks on the redirtied dquot as the
   flush lock was already acquired by reclaim, but the buffer resides
   on the local delwri queue which isn't submitted until the end of
   quotacheck.

This is reproduced by running quotacheck on a filesystem with a
couple million inodes in low memory (512MB-1GB) situations. This is
a regression as of commit 43ff2122e6 ("xfs: on-stack delayed write
buffer lists"), which removed a trylock and buffer I/O submission
from the quotacheck dquot flush sequence.

Quotacheck first resets and collects the physical dquot buffers in a
delwri queue. Then, it traverses the filesystem inodes via bulkstat,
updates the in-core dquots, flushes the corrected dquots to the
backing buffers and finally submits the delwri queue for I/O. Since
the backing buffers are queued across the entire quotacheck
operation, dquot reclaim cannot possibly complete a dquot flush
before quotacheck completes.

Therefore, quotacheck must submit the buffer for I/O in order to
cycle the flush lock and flush the dirty in-core dquot to the
buffer. Add a delwri queue buffer push mechanism to submit an
individual buffer for I/O without losing the delwri queue status and
use it from quotacheck to avoid the deadlock. This restores
quotacheck behavior to as before the regression was introduced.

Reported-by: Martin Svec <martin.svec@zoner.cz>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19 08:59:10 -07:00
Hugh Dickins
1be7107fbe mm: larger stack guard gap, between vmas
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.

This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.

Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.

One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications.  For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).

Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.

Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.

Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-19 21:50:20 +08:00
NeilBrown
011067b056 blk: replace bioset_create_nobvec() with a flags arg to bioset_create()
"flags" arguments are often seen as good API design as they allow
easy extensibility.
bioset_create_nobvec() is implemented internally as a variation in
flags passed to __bioset_create().

To support future extension, make the internal structure part of the
API.
i.e. add a 'flags' argument to bioset_create() and discard
bioset_create_nobvec().

Note that the bio_split allocations in drivers/md/raid* do not need
the bvec mempool - they should have used bioset_create_nobvec().

Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18 12:40:59 -06:00
Linus Torvalds
6e20350659 Merge tag 'ceph-for-4.12-rc6' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
 "A fix for an old ceph ->fh_to_* bug from Luis and two timestamp fixups
  from Zheng, prompted by the ongoing y2038 work"

* tag 'ceph-for-4.12-rc6' of git://github.com/ceph/ceph-client:
  ceph: unify inode i_ctime update
  ceph: use current_kernel_time() to get request time stamp
  ceph: check i_nlink while converting a file handle to dentry
2017-06-18 08:23:02 +09:00
Al Viro
77e9ce327d ufs: fix the logics for tail relocation
* original hysteresis loop got broken by typo back in 2002; now
it never switches out of OPTTIME state.  Fixed.
* critical levels for switching from OPTTIME to OPTSPACE and back
ought to be calculated once, at mount time.
* we should use mul_u64_u32_div() for those calculations, now that
->s_dsize is 64bit.
* to quote Kirk McKusick (in 1995 FreeBSD commit message):
    The threshold for switching from time-space and space-time is too small
    when minfree is 5%...so make it stay at space in this case.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-17 17:22:42 -04:00
Al Viro
c0ef65d292 ufs_iget(): fail with -ESTALE on deleted inode
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-17 12:25:58 -04:00
Al Viro
23ac7cba73 fix signedness of timestamps on ufs1
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-17 12:25:13 -04:00
Linus Torvalds
adc311034c Merge tag 'xfs-4.12-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fix from Darrick Wong:
 "One more bugfix for you for 4.12-rc6 to fix something that came up in
  an earlier rc:

   - Fix some bogus ASSERT failures on CONFIG_SMP=n and CONFIG_XFS_DEBUG=y"

* tag 'xfs-4.12-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: fix spurious spin_is_locked() assert failures on non-smp kernels
2017-06-17 17:34:41 +09:00
Linus Torvalds
c8636b90a0 Merge branch 'ufs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull ufs fixes from Al Viro:
 "Fix assorted ufs bugs: a couple of deadlocks, fs corruption in
  truncate(), oopsen on tail unpacking and truncate when racing with
  vmscan, mild fs corruption (free blocks stats summary buggered, *BSD
  fsck would complain and fix), several instances of broken logics
  around reserved blocks (starting with "check almost never triggers
  when it should" and then there are issues with sufficiently large
  UFS2)"

[ Note: ufs hasn't gotten any loving in a long time, because nobody
  really seems to use it. These ufs fixes are triggered by people
  actually caring now, not some sudden influx of new bugs.  - Linus ]

* 'ufs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ufs_truncate_blocks(): fix the case when size is in the last direct block
  ufs: more deadlock prevention on tail unpacking
  ufs: avoid grabbing ->truncate_mutex if possible
  ufs_get_locked_page(): make sure we have buffer_heads
  ufs: fix s_size/s_dsize users
  ufs: fix reserved blocks check
  ufs: make ufs_freespace() return signed
  ufs: fix logics in "ufs: make fsck -f happy"
2017-06-17 17:30:07 +09:00
Linus Torvalds
ccd3d905f7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "A couple of fixes; a leak in mntns_install() caught by Andrei (this
  cycle regression) + d_invalidate() softlockup fix - that had been
  reported by a bunch of people lately, but the problem is pretty old"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: don't forget to put old mntns in mntns_install
  Hang/soft lockup in d_invalidate with simultaneous calls
2017-06-17 17:26:53 +09:00
Andrea Arcangeli
64c2b20301 userfaultfd: shmem: handle coredumping in handle_userfault()
Anon and hugetlbfs handle FOLL_DUMP set by get_dump_page() internally to
__get_user_pages().

shmem as opposed has no special FOLL_DUMP handling there so
handle_mm_fault() is invoked without mmap_sem and ends up calling
handle_userfault() that isn't expecting to be invoked without mmap_sem
held.

This makes handle_userfault() fail immediately if invoked through
shmem_vm_ops->fault during coredumping and solves the problem.

The side effect is a BUG_ON with no lock held triggered by the
coredumping process which exits.  Only 4.11 is affected, pre-4.11 anon
memory holes are skipped in __get_user_pages by checking FOLL_DUMP
explicitly against empty pagetables (mm/gup.c:no_page_table()).

It's zero cost as we already had a check for current->flags to prevent
futex to trigger userfaults during exit (PF_EXITING).

Link: http://lkml.kernel.org/r/20170615214838.27429-1-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: <stable@vger.kernel.org>	[4.11+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-17 06:37:05 +09:00
Linus Torvalds
ab2789b72d Merge tag 'configfs-for-4.12' of git://git.infradead.org/users/hch/configfs
Pull configfs updates from Christoph Hellwig:
 "A fix from Nic for a race seen in production (including a stable tag).

  And while I'm sending you this I'm also sneaking in a trivial new
  helper from Bart so that we don't need inter-tree dependencies for the
  next merge window"

* tag 'configfs-for-4.12' of git://git.infradead.org/users/hch/configfs:
  configfs: Introduce config_item_get_unless_zero()
  configfs: Fix race between create_link and configfs_rmdir
2017-06-16 18:45:47 +09:00
Christoph Hellwig
20223f0f39 fs: pass on flags in compat_writev
Fixes: 793b80ef14 ("vfs: pass a flags argument to vfs_readv/vfs_writev")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-16 18:40:51 +09:00
Dan Williams
81f558701a x86, dax: replace clear_pmem() with open coded memset + dax_ops->flush
The clear_pmem() helper simply combines a memset() plus a cache flush.
Now that the flush routine is optionally provided by the dax device
driver we can avoid unnecessary cache management on dax devices fronting
volatile memory.

With clear_pmem() gone we can follow on with a patch to make pmem cache
management completely defined within the pmem driver.

Cc: <x86@kernel.org>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-15 14:35:24 -07:00
Dan Williams
6318770a7d filesystem-dax: convert to dax_flush()
Filesystem-DAX flushes caches whenever it writes to the address returned
through dax_direct_access() and when writing back dirty radix entries.
That flushing is only required in the pmem case, so the dax_flush()
helper skips cache management work when the underlying driver does not
specify a flush method.

We still do all the dirty tracking since the radix entry will already be
there for locking purposes. However, the work to clean the entry will be
a nop for some dax drivers.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-15 14:35:24 -07:00
Dan Williams
fec53774fd filesystem-dax: convert to dax_copy_from_iter()
Now that all possible providers of the dax_operations copy_from_iter
method are implemented, switch filesytem-dax to call the driver rather
than copy_to_iter_pmem.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-15 14:34:59 -07:00
David S. Miller
0ddead90b2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The conflicts were two cases of overlapping changes in
batman-adv and the qed driver.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 11:59:32 -04:00
Andrei Vagin
4068367c9c fs: don't forget to put old mntns in mntns_install
Fixes: 4f757f3cbf ("make sure that mntns_install() doesn't end up with referral for root")
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-15 06:53:05 -04:00
Al Viro
81be24d263 Hang/soft lockup in d_invalidate with simultaneous calls
It's not hard to trigger a bunch of d_invalidate() on the same
dentry in parallel.  They end up fighting each other - any
dentry picked for removal by one will be skipped by the rest
and we'll go for the next iteration through the entire
subtree, even if everything is being skipped.  Morevoer, we
immediately go back to scanning the subtree.  The only thing
we really need is to dissolve all mounts in the subtree and
as soon as we've nothing left to do, we can just unhash the
dentry and bugger off.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-15 06:52:09 -04:00
Linus Torvalds
54ed0f71f0 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fix from Herbert Xu:
 "This fixes a bug on sparc where we may dereference freed stack memory"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: Work around deallocated stack frame reference gcc bug on sparc.
2017-06-15 17:54:51 +09:00
Al Viro
a8fad98483 ufs_truncate_blocks(): fix the case when size is in the last direct block
The logics when deciding whether we need to do anything with direct blocks
is broken when new size is within the last direct block.  It's better to
find the path to the last byte _not_ to be removed and use that instead
of the path to the beginning of the first block to be freed...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-15 03:57:46 -04:00
Al Viro
289dec5b89 ufs: more deadlock prevention on tail unpacking
->s_lock is not needed for ufs_change_blocknr()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-15 00:42:56 -04:00
Al Viro
09bf4f5b6e ufs: avoid grabbing ->truncate_mutex if possible
tail unpacking is done in a wrong place; the deadlocks galore
is best dealt with by doing that in ->write_iter() (and switching
to iomap, while we are at it), but that's rather painful to
backport.  The trouble comes from grabbing pages that cover
the beginning of tail from inside of ufs_new_fragments(); ongoing
pageout of any of those is going to deadlock on ->truncate_mutex
with process that got around to extending the tail holding that
and waiting for page to get unlocked, while ->writepage() on
that page is waiting on ->truncate_mutex.

The thing is, we don't need ->truncate_mutex when the fragment
we are trying to map is within the tail - the damn thing is
allocated (tail can't contain holes).

Let's do a plain lookup and if the fragment is present, we can
just pretend that we'd won the race in almost all cases.  The
only exception is a fragment between the end of tail and the
end of block containing tail.

Protect ->i_lastfrag with ->meta_lock - read_seqlock_excl() is
sufficient.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-15 00:41:18 -04:00
Al Viro
267309f394 ufs_get_locked_page(): make sure we have buffer_heads
callers rely upon that, but find_lock_page() racing with attempt of
page eviction by memory pressure might have left us with
	* try_to_free_buffers() successfully done
	* __remove_mapping() failed, leaving the page in our mapping
	* find_lock_page() returning an uptodate page with no
buffer_heads attached.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-14 23:32:19 -04:00
Al Viro
c596961d1b ufs: fix s_size/s_dsize users
For UFS2 we need 64bit variants; we even store them in uspi, but
use 32bit ones instead.  One wrinkle is in handling of reserved
space - recalculating it every time had been stupid all along, but
now it would become really ugly.  Just calculate it once...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-14 16:43:03 -04:00
Al Viro
b451cec4bb ufs: fix reserved blocks check
a) honour ->s_minfree; don't just go with default (5)
b) don't bother with capability checks until we know we'll need them

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-14 15:46:05 -04:00
Al Viro
fffd70f588 ufs: make ufs_freespace() return signed
as it is, checking that its return value is <= 0 is useless and
that's how it's being used.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-14 15:36:31 -04:00
Al Viro
96ecff1422 ufs: fix logics in "ufs: make fsck -f happy"
Storing stats _only_ at new locations is wrong for UFS1; old
locations should always be kept updated.  The check for "has
been converted to use of new locations" is also wrong - it
should be "->fs_maxbsize is equal to ->fs_bsize".

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-14 15:17:32 -04:00
Yan, Zheng
4ca2fea6f8 ceph: unify inode i_ctime update
Current __ceph_setattr() can set inode's i_ctime to current_time(),
req->r_stamp or attr->ia_ctime. These time stamps may have minor
differences. It may cause potential problem.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-06-14 19:37:23 +02:00
Yan, Zheng
56199016e8 ceph: use current_kernel_time() to get request time stamp
ceph uses ktime_get_real_ts() to get request time stamp. In most
other cases, current_kernel_time() is used to get time stamp for
filesystem operations (called by current_time()).

There is granularity difference between ktime_get_real_ts() and
current_kernel_time(). The later one can be up to one jiffy behind
the former one. This can causes inode's ctime to go back.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-06-14 19:33:23 +02:00
Luis Henriques
03f219041f ceph: check i_nlink while converting a file handle to dentry
Converting a file handle to a dentry can be done call after the inode
unlink.  This means that __fh_to_dentry() requires an extra check to
verify the number of links is not 0.

The issue can be easily reproduced using xfstest generic/426, which does
something like:

    name_to_handle_at(&fh)
    echo 3 > /proc/sys/vm/drop_caches
    unlink()
    open_by_handle_at(&fh)

The call to open_by_handle_at() should fail, as the file doesn't exist
anymore.

Link: http://tracker.ceph.com/issues/19958
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-06-14 19:32:43 +02:00
Jeff Layton
f73127356f fs/fcntl: return -ESRCH in f_setown when pid/pgid can't be found
The current implementation of F_SETOWN doesn't properly vet the argument
passed in and only returns an error if INT_MIN is passed in. If the
argument doesn't specify a valid pid/pgid, then we just end up cleaning
out the file->f_owner structure.

What we really want is to only clean that out only in the case where
userland passed in an argument of 0. For anything else, we want to
return ESRCH if it doesn't refer to a valid pid.

The relevant POSIX spec page is here:

    http://pubs.opengroup.org/onlinepubs/9699919799/functions/fcntl.html

Cc: Jiri Slaby <jslaby@suse.cz>
Cc: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-06-14 09:11:54 -04:00
Jiri Slaby
fc3dc67471 fs/fcntl: f_setown, avoid undefined behaviour
fcntl(0, F_SETOWN, 0x80000000) triggers:
UBSAN: Undefined behaviour in fs/fcntl.c:118:7
negation of -2147483648 cannot be represented in type 'int':
CPU: 1 PID: 18261 Comm: syz-executor Not tainted 4.8.1-0-syzkaller #1
...
Call Trace:
...
 [<ffffffffad8f0868>] ? f_setown+0x1d8/0x200
 [<ffffffffad8f19a9>] ? SyS_fcntl+0x999/0xf30
 [<ffffffffaed1fb00>] ? entry_SYSCALL_64_fastpath+0x23/0xc1

Fix that by checking the arg parameter properly (against INT_MAX) before
"who = -who". And return immediatelly with -EINVAL in case it is wrong.
Note that according to POSIX we can return EINVAL:
    http://pubs.opengroup.org/onlinepubs/9699919799/functions/fcntl.html

    [EINVAL]
        The cmd argument is F_SETOWN and the value of the argument
        is not valid as a process or process group identifier.

[v2] returns an error, v1 used to fail silently
[v3] implement proper check for the bad value INT_MIN

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-06-14 08:46:45 -04:00
Jiri Slaby
393cc3f511 fs/fcntl: f_setown, allow returning error
Allow f_setown to return an error value. We will fail in the next patch
with EINVAL for bad input to f_setown, so tile the path for the later
patch.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-06-14 08:46:36 -04:00
Jan Kara
fd3cfad374 udf: Convert udf_disk_stamp_to_time() to use mktime64()
Convert udf_disk_stamp_to_time() to use mktime64() to simplify the code.
As a bonus we get working timestamp conversion for dates before epoch
and after 2038 (both of which are allowed by UDF standard).

Signed-off-by: Jan Kara <jack@suse.cz>
2017-06-14 11:21:02 +02:00
Jan Kara
3c399fa40f udf: Use time64_to_tm for timestamp conversion
UDF on-disk time stamp is stored in a form very similar to struct tm.
Use time64_to_tm() for conversion of seconds since epoch to year, month,
... format and then just copy this as necessary to UDF on-disk
structure to simplify the code.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-06-14 11:21:02 +02:00
Jan Kara
f2e9535589 udf: Fix deadlock between writeback and udf_setsize()
udf_setsize() called truncate_setsize() with i_data_sem held. Thus
truncate_pagecache() called from truncate_setsize() could lock a page
under i_data_sem which can deadlock as page lock ranks below
i_data_sem - e. g. writeback can hold page lock and try to acquire
i_data_sem to map a block.

Fix the problem by moving truncate_setsize() calls from under
i_data_sem. It is safe for us to change i_size without holding
i_data_sem as all the places that depend on i_size being stable already
hold inode_lock.

CC: stable@vger.kernel.org
Fixes: 7e49b6f248
Signed-off-by: Jan Kara <jack@suse.cz>
2017-06-14 11:21:01 +02:00
Jan Kara
146c4ad6ec udf: Use i_size_read() in udf_adinicb_writepage()
We don't hold inode_lock in udf_adinicb_writepage() so use i_size_read()
to get i_size. This cannot cause real problems is i_size is guaranteed
to be small but let's be careful.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-06-14 11:21:01 +02:00
Jan Kara
9795e0e8ac udf: Fix races with i_size changes during readpage
__udf_adinicb_readpage() uses i_size several times. When truncate
changes i_size while the function is running, it can observe several
different values and thus e.g. expose uninitialized parts of page to
userspace. Also use i_size_read() in the function since it does not hold
inode_lock. Since i_size is guaranteed to be small, this cannot really
cause any issues even on 32-bit archs but let's be careful.

CC: stable@vger.kernel.org
Fixes: 9c2fc0de1a
Signed-off-by: Jan Kara <jack@suse.cz>
2017-06-14 11:21:01 +02:00
Jan Kara
a247f7236d udf: Remove unused UDF_DEFAULT_BLOCKSIZE
The define is unused. Remove it.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-06-13 14:59:14 +02:00
Christoph Hellwig
fdd050b5b3 Merge branch 'uuid-types' of bombadil.infradead.org:public_git/uuid into nvme-base 2017-06-13 11:45:14 +02:00
Bob Peterson
df68f20f56 GFS2: Remove gl_list from glock structure
The gl_list is no longer used nor needed in the glock structure,
so this patch eliminates it.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-06-12 14:39:12 -05:00