Commit Graph

738561 Commits

Author SHA1 Message Date
Christoph Hellwig
5bcffe300c xfs: fix the check for COW extents in xfs_swap_extents
i_cnextents does not include delayed allocated extents, so switch
to the inode fork size check that we already use in other places
instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-15 10:31:38 -07:00
Christoph Hellwig
e6b9657056 xfs: refactor xfs_log_force
Streamline the conditionals so that it is more obvious which specific case
form the top of the function comments is being handled.  Use gotos only
for early returns.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-14 11:12:52 -07:00
Christoph Hellwig
656de4ffaf xfs: merge _xfs_log_force_lsn and xfs_log_force_lsn
Switch to a single interface for flushing the log to a specific LSN, which
gives consistent trace point coverage and a less confusing interface.

The was only a single user of the previous xfs_log_force_lsn function,
which now also passes a NULL log_flushed argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-14 11:12:52 -07:00
Christoph Hellwig
60e5bb7844 xfs: merge _xfs_log_force and xfs_log_force
Switch to a single interface for flushing the whole log, which gives
consistent trace point coverage, and removes the unused log_flushed
argument for the previous _xfs_log_force callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-14 11:12:52 -07:00
Christoph Hellwig
2b56c2857f xfs: remove the unused log_flushed variable in xfs_extent_busy_flush
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-14 11:12:52 -07:00
Christoph Hellwig
4f7aa2fd8c xfs: remove an outdated comment for xfs_inode_item_committing
The function now does something, and that something is central to our
inode logging scheme.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-14 11:12:51 -07:00
Christoph Hellwig
2022ab36fe xfs: remove misleading comment text on xfs_inode_item_unlock
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-14 11:12:51 -07:00
Brian Foster
0ab32086d0 xfs: account only rmapbt-used blocks against rmapbt perag res
The rmapbt perag metadata reservation reserves blocks for the
reverse mapping btree (rmapbt). Since the rmapbt uses blocks from
the agfl and perag accounting is updated as blocks are allocated
from the allocation btrees, the reservation actually accounts blocks
as they are allocated to (or freed from) the agfl rather than the
rmapbt itself.

While this works for blocks that are eventually used for the rmapbt,
not all agfl blocks are destined for the rmapbt. Blocks that are
allocated to the agfl (and thus "reserved" for the rmapbt) but then
used by another structure leads to a growing inconsistency over time
between the runtime tracking of rmapbt usage vs. actual rmapbt
usage. Since the runtime tracking thinks all agfl blocks are rmapbt
blocks, it essentially believes that less future reservation is
required to satisfy the rmapbt than what is actually necessary.

The inconsistency is rectified across mount cycles because the perag
reservation is initialized based on the actual rmapbt usage at mount
time. The problem, however, is that the excessive drain of the
reservation at runtime opens a window to allocate blocks for other
purposes that might be required for the rmapbt on a subsequent
mount. This problem can be demonstrated by a simple test that runs
an allocation workload to consume agfl blocks over time and then
observe the difference in the agfl reservation requirement across an
unmount/mount cycle:

  mount ...: xfs_ag_resv_init: ... resv 3193 ask 3194 len 3194
  ...
  ...      : xfs_ag_resv_alloc_extent: ... resv 2957 ask 3194 len 1
  umount...: xfs_ag_resv_free: ... resv 2956 ask 3194 len 0
  mount ...: xfs_ag_resv_init: ... resv 3052 ask 3194 len 3194

As the above tracepoints show, the reservation requirement reduces
from 3194 blocks to 2956 blocks as the workload runs.  Without any
other changes in the filesystem, the same reservation requirement
jumps from 2956 to 3052 blocks over a umount/mount cycle.

To address this divergence, update the RMAPBT reservation to account
blocks used for the rmapbt only rather than all blocks filled into
the agfl. This patch makes several high-level changes toward that
end:

1.) Reintroduce an AGFL reservation type to serve as an accounting
    no-op for blocks allocated to (or freed from) the AGFL.
2.) Invoke RMAPBT usage accounting from the actual rmapbt block
    allocation path rather than the AGFL allocation path.

The first change is required because agfl blocks are considered free
blocks throughout their lifetime. The perag reservation subsystem is
invoked unconditionally by the allocation subsystem, so we need a
way to tell the perag subsystem (via the allocation subsystem) to
not make any accounting changes for blocks filled into the AGFL.

The second change causes the in-core RMAPBT reservation usage
accounting to remain consistent with the on-disk state at all times
and eliminates the risk of leaving the rmapbt reservation
underfilled.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:57 -07:00
Brian Foster
2159286335 xfs: rename agfl perag res type to rmapbt
The AGFL perag reservation type accounts all allocations that feed
into (or are released from) the allocation group free list (agfl).
The purpose of the reservation is to support worst case conditions
for the reverse mapping btree (rmapbt). As such, the agfl
reservation usage accounting only considers rmapbt usage when the
in-core counters are initialized at mount time.

This implementation inconsistency leads to divergence of the in-core
and on-disk usage accounting over time. In preparation to resolve
this inconsistency and adjust the AGFL reservation into an rmapbt
specific reservation, rename the AGFL reservation type and
associated accounting fields to something more rmapbt-specific. Also
fix up a couple tracepoints that incorrectly use the AGFL
reservation type to pass the agfl state of the associated extent
where the raw reservation type is expected.

Note that this patch does not change perag reservation behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:57 -07:00
Brian Foster
b3fed43482 xfs: account format bouncing into rmapbt swapext tx reservation
The extent swap mechanism requires a unique implementation for
rmapbt enabled filesystems. Because the rmapbt tracks extent owner
information, extent swap must individually unmap and remap each
extent between the two inodes.

The rmapbt extent swap transaction block reservation currently
accounts for the worst case bmapbt block and rmapbt block
consumption based on the extent count of each inode. There is a
corner case that exists due to the extent swap implementation that
is not covered by this reservation, however.

If one of the associated inodes is just over the max extent count
used for extent format inodes (i.e., the inode is in btree format by
a single extent), the unmap/remap cycle of the extent swap can
bounce the inode between extent and btree format multiple times,
almost as many times as there are extents in the inode (if the
opposing inode happens to have one less, for example). Each back and
forth cycle involves a block free and allocation, which isn't a
problem except for that the initial transaction reservation must
account for the total number of block allocations performed by the
chain of deferred operations. If not, a block reservation overrun
occurs and the filesystem shuts down.

Update the rmapbt extent swap block reservation to check for this
situation and add some block reservation slop to ensure the entire
operation succeeds. We'd never likely require reservation for both
inodes as fsr wouldn't defrag the file in that case, but the
additional reservation is constrained by the data fork size so be
cautious and check for both.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:57 -07:00
Brian Foster
3e78b9a468 xfs: shutdown if block allocation overruns tx reservation
The ->t_blk_res_used field tracks how many blocks have been used in
the current transaction. This should never exceed the block
reservation (->t_blk_res) for a particular transaction. We currently
assert this condition in the transaction block accounting code, but
otherwise take no additional action should this situation occur.

The overrun generally has no effect if space ends up being available
and the associated transaction commits. If the transaction is
duplicated, however, the current block usage is used to determine
the remaining block reservation to be transferred to the new
transaction. If usage exceeds reservation, this calculation
underflows and creates a transaction with an invalid and excessive
reservation. When the second transaction commits, the release of
unused blocks corrupts the in-core free space counters. With lazy
superblock accounting enabled, this inconsistency eventually
trickles to the on-disk superblock and corrupts the filesystem.

Replace the transaction block usage accounting assert with an
explicit overrun check. If the transaction overruns the reservation,
shutdown the filesystem immediately to prevent corruption. Add a new
assert to xfs_trans_dup() to catch any callers that might induce
this invalid state in the future.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:57 -07:00
Matthew Wilcox
57e8095611 xfs: Rename xa_ elements to ail_
This is a simple rename, except that xa_ail becomes ail_head.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:56 -07:00
Dave Chinner
ae23395d88 inode: don't memset the inode address space twice
Noticed when looking at why cycling 600k inodes/s through the inode
cache was taking a total of 8% cpu in memset() during inode
initialisation.  There is no need to zero the inode.i_data structure
twice.

This increases single threaded bulkstat throughput from ~200,000
inodes/s to ~220,000 inodes/s, so we save a substantial amount of
CPU time per inode init by doing this.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:56 -07:00
Dave Chinner
a78ee256c3 xfs: convert XFS_AGFL_SIZE to a helper function
The AGFL size calculation is about to get more complex, so lets turn
the macro into a function first and remove the macro.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[darrick: forward port to newer kernel, simplify the helper]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-11 20:27:56 -07:00
Darrick J. Wong
6231848c3a xfs: check for cow blocks before trying to clear them
There's no point in allocating a transaction and locking the inode in
preparation to clear cow blocks if there actually are any cow fork
extents.  Therefore, move the xfs_reflink_cancel_cow_range hunk to
xfs_inactive and check the cow ifp first.  This makes inode reclamation
run faster.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-11 20:27:56 -07:00
Darrick J. Wong
3f883f5bb1 xfs: convert a few more directory asserts to corruption
Yet another round of playing whack-a-mole with directory code that
asserts on corrupt on-disk metadata when it really should be returning
-EFSCORRUPTED instead of ASSERTing.  Found by a xfs/391 crash while
lastbit fuzzing of ltail.bestcount.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-11 20:27:56 -07:00
Darrick J. Wong
8241f7f983 xfs: don't iunlock the quota ip when quota block
In xfs_qm_dqalloc, we join the locked quota inode to the transaction we
use to allocate blocks.  If the allocation or mapping fails, we're not
allowed to unlock the inode because the transaction code is in charge of
unlocking it for us.  Therefore, remove the iunlock call to avoid
blowing asserts about unbalanced locking + mount hang.

Found by corrupting the AGF and allocating space in the filesystem
(quotacheck) immediately after mount.  The upcoming agfl wrapping fixup
test will trigger this scenario.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-11 20:27:56 -07:00
Vratislav Bendel
19957a1816 xfs: Correctly invert xfs_buftarg LRU isolation logic
Due to an inverted logic mistake in xfs_buftarg_isolate()
the xfs_buffers with zero b_lru_ref will take another trip
around LRU, while isolating buffers with non-zero b_lru_ref.

Additionally those isolated buffers end up right back on the LRU
once they are released, because b_lru_ref remains elevated.

Fix that circuitous route by leaving them on the LRU
as originally intended.

Signed-off-by: Vratislav Bendel <vbendel@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:56 -07:00
Dave Chinner
4df0f7f145 xfs: fix transaction allocation deadlock in IO path
xfs_trans_alloc() does GFP_KERNEL allocation, and we can call it
while holding pages locked for writeback in the ->writepages path.
The memory allocation is allowed to wait on pages under writeback,
and so can wait on pages that are tagged as writeback by the
caller.

This affects both pre-IO submission and post-IO submission paths.
Hence xfs_setsize_trans_alloc(), xfs_reflink_end_cow(),
xfs_iomap_write_unwritten() and xfs_reflink_cancel_cow_range().
xfs_iomap_write_unwritten() already does the right thing, but the
others don't. Fix them.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Fixes: 281627df3e ("xfs: log file size updates at I/O completion time")
Fixes: 43caeb187d ("xfs: move mappings from cow fork to data fork after copy-write)"
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:55 -07:00
Christoph Hellwig
c3b1b13190 xfs: implement the lazytime mount option
Use the VFS dirty inode tracking for lazytime inodes only, and just
log them in ->dirty_inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:55 -07:00
Christoph Hellwig
0d07e5573f fs: don't clear I_DIRTY_TIME before calling mark_inode_dirty_sync
__mark_inode_dirty already takes care of that, and for the XFS lazytime
implementation we need to know that ->dirty_inode was called because
I_DIRTY_TIME was set.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:55 -07:00
Nikolay Borisov
bcab2ebfa1 xfs: Remove dead code from inode recover function
The memcpy is guarded by a check which is performed a right before we
call xfs_log_dinode_to_disk. At this point we are sure this check will
always be false otherwise we would have errored out. So let's remove
this dead weight.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:55 -07:00
Carlos Maiolino
e157ebdcb3 Cleanup old XFS_BTREE_* traces
Remove unused legacy btree traces from IRIX era.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:55 -07:00
Eric Sandeen
4603fa744c xfs: remove unused m_dmevmask from xfs_mount struct
The dmevmask structure member is a dmapi leftover; it's
set here and there but never actually used.  Remove it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:55 -07:00
Dave Chinner
cb0a8d2302 xfs: fall back to vmalloc when allocation log vector buffers
When using large directory blocks, we regularly see memory
allocations of >64k being made for the shadow log vector buffer.
When we are under memory pressure, kmalloc() may not be able to find
contiguous memory chunks large enough to satisfy these allocations
easily, and if memory is fragmented we can potentially stall here.

TO avoid this problem, switch the log vector buffer allocation to
use kmem_alloc_large(). This will allow failed allocations to fall
back to vmalloc and so remove the dependency on large contiguous
regions of memory being available. This should prevent slowdowns
and potential stalls when memory is low and/or fragmented.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:55 -07:00
Linus Torvalds
0c8efd610b Linux 4.16-rc5 v4.16-rc5 2018-03-11 17:25:09 -07:00
Linus Torvalds
ed58d66f60 Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/pti updates from Thomas Gleixner:
 "Yet another pile of melted spectrum related updates:

   - Drop native vsyscall support finally as it causes more trouble than
     benefit.

   - Make microcode loading more robust. There were a few issues
     especially related to late loading which are now surfacing because
     late loading of the IB* microcodes addressing spectre issues has
     become more widely used.

   - Simplify and robustify the syscall handling in the entry code

   - Prevent kprobes on the entry trampoline code which lead to kernel
     crashes when the probe hits before CR3 is updated

   - Don't check microcode versions when running on hypervisors as they
     are considered as lying anyway.

   - Fix the 32bit objtool build and a coment typo"

* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/kprobes: Fix kernel crash when probing .entry_trampoline code
  x86/pti: Fix a comment typo
  x86/microcode: Synchronize late microcode loading
  x86/microcode: Request microcode on the BSP
  x86/microcode/intel: Look into the patch cache first
  x86/microcode: Do not upload microcode if CPUs are offline
  x86/microcode/intel: Writeback and invalidate caches before updating microcode
  x86/microcode/intel: Check microcode revision before updating sibling threads
  x86/microcode: Get rid of struct apply_microcode_ctx
  x86/spectre_v2: Don't check microcode versions when running under hypervisors
  x86/vsyscall/64: Drop "native" vsyscalls
  x86/entry/64/compat: Save one instruction in entry_INT80_compat()
  x86/entry: Do not special-case clone(2) in compat entry
  x86/syscalls: Use COMPAT_SYSCALL_DEFINEx() macros for x86-only compat syscalls
  x86/syscalls: Use proper syscall definition for sys_ioperm()
  x86/entry: Remove stale syscall prototype
  x86/syscalls/32: Simplify $entry == $compat entries
  objtool: Fix 32-bit build
2018-03-11 14:59:23 -07:00
Linus Torvalds
1ad5daa653 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
 "Just a single fix which adds a missing Kconfig dependency to avoid
  unmet dependency warnings"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource/atmel-st: Add 'depends on HAS_IOMEM' to fix unmet dependency
2018-03-11 14:55:15 -07:00
Linus Torvalds
ebb3762e88 Merge branch 'ras-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS fixes from Thomas Gleixner:
 "Two small fixes for RAS/MCE:

   - Serialize sysfs changes to avoid concurrent modificaiton of
     underlying data

   - Add microcode revision to Machine Check records. This should have
     been there forever, but now with the broken microcode versions in
     the wild it has become important"

* 'ras-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/MCE: Serialize sysfs changes
  x86/MCE: Save microcode revision in machine check records
2018-03-11 14:52:41 -07:00
Linus Torvalds
8ad4424350 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Thomas Gleixner:
 "Another set of perf updates:

   - Fix a Skylake Uncore event format declaration

   - Prevent perf pipe mode from crahsing which was caused by a missing
     buffer allocation

   - Make the perf top popup message which tells the user that it uses
     fallback mode on older kernels a debug message.

   - Make perf context rescheduling work correcctly

   - Robustify the jump error drawing in perf browser mode so it does
     not try to create references to NULL initialized offset entries

   - Make trigger_on() robust so it does not enable the trigger before
     everything is set up correctly to handle it

   - Make perf auxtrace respect the --no-itrace option so it does not
     try to queue AUX data for decoding.

   - Prevent having different number of field separators in CVS output
     lines when a counter is not supported.

   - Make the perf kallsyms man page usage behave like it does for all
     other perf commands.

   - Synchronize the kernel headers"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Fix ctx_event_type in ctx_resched()
  perf tools: Fix trigger class trigger_on()
  perf auxtrace: Prevent decoding when --no-itrace
  perf stat: Fix CVS output format for non-supported counters
  tools headers: Sync x86's cpufeatures.h
  tools headers: Sync copy of kvm UAPI headers
  perf record: Fix crash in pipe mode
  perf annotate browser: Be more robust when drawing jump arrows
  perf top: Fix annoying fallback message on older kernels
  perf kallsyms: Fix the usage on the man page
  perf/x86/intel/uncore: Fix Skylake UPI event format
2018-03-11 14:49:49 -07:00
Linus Torvalds
02bf0ef028 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Thomas Gleixner:
 "rt_mutex_futex_unlock() grew a new irq-off call site, but the function
  assumes that its always called from irq enabled context.

  Use (un)lock_irqsafe() to handle the new call site correctly"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rtmutex: Make rt_mutex_futex_unlock() safe for irq-off callsites
2018-03-11 14:46:54 -07:00
Linus Torvalds
abeb75218a Merge tag 'dmaengine-fix-4.16-rc5' of git://git.infradead.org/users/vkoul/slave-dma
Pull dmaengine fixes from Vinod Koul:
 "Two small fixes are for this cycle:

   - fix max_chunk_size for rcar-dmac for R-Car Gen3

   - fix clock resource of mv_xor_v2"

* tag 'dmaengine-fix-4.16-rc5' of git://git.infradead.org/users/vkoul/slave-dma:
  dmaengine: mv_xor_v2: Fix clock resource by adding a register clock
  dmaengine: rcar-dmac: fix max_chunk_size for R-Car Gen3
2018-03-11 13:07:14 -07:00
Linus Torvalds
d43be80a4a Merge tag 'gpio-v4.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio
Pull GPIO fix from Linus Walleij:
 "This is a single GPIO fix for the v4.16 series affecting the Renesas
  driver, and fixes wakeup from external stuff"

* tag 'gpio-v4.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
  gpio: rcar: Use wakeup_path i.s.o. explicit clock handling
2018-03-11 13:05:15 -07:00
Gregory CLEMENT
3cd2c313f1 dmaengine: mv_xor_v2: Fix clock resource by adding a register clock
On the CP110 components which are present on the Armada 7K/8K SoC we need
to explicitly enable the clock for the registers. However it is not
needed for the AP8xx component, that's why this clock is optional.

With this patch both clock have now a name, but in order to be backward
compatible, the name of the first clock is not used. It allows to still
use this clock with a device tree using the old binding.

Signed-off-by: Gregory CLEMENT <gregory.clement@bootlin.com>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Vinod Koul <vinod.koul@intel.com>
2018-03-11 20:33:27 +05:30
Linus Torvalds
3266b5bd97 Merge tag 'kbuild-fixes-v4.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:

 - make fixdep parse kconfig.h to fix missing rebuild

 - replace hyphens with underscores in builtin DTB label names

 - fix typos

* tag 'kbuild-fixes-v4.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kbuild: Handle builtin dtb file names containing hyphens
  scripts/bloat-o-meter: fix typos in help
  fixdep: do not ignore kconfig.h
  fixdep: remove some false CONFIG_ matches
  fixdep: remove stale references to uml-config.h
2018-03-10 10:21:07 -08:00
Linus Torvalds
23b33acc5f Merge tag 'linux-watchdog-4.16-fixes-2' of git://www.linux-watchdog.org/linux-watchdog
Pull watchdog fixes from Wim Van Sebroeck:

 - f71808e_wdt: Fix magic close handling

 - sbsa: 32-bit read fix for WCV

 - hpwdt: Remove legacy NMI sourcing

* tag 'linux-watchdog-4.16-fixes-2' of git://www.linux-watchdog.org/linux-watchdog:
  watchdog: hpwdt: Remove legacy NMI sourcing.
  watchdog: sbsa: use 32-bit read for WCV
  watchdog: f71808e_wdt: Fix magic close handling
2018-03-10 10:17:59 -08:00
Linus Torvalds
91a262096e Merge tag 'for-linus-20180309' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:

 - a xen-blkfront fix from Bhavesh with a multiqueue fix when
   detaching/re-attaching

 - a few important NVMe fixes, including a revert for a sysfs fix that
   caused some user space confusion

 - two bcache fixes by way of Michael Lyle

 - a loop regression fix, fixing an issue with lost writes on DAX.

* tag 'for-linus-20180309' of git://git.kernel.dk/linux-block:
  loop: Fix lost writes caused by missing flag
  nvme_fc: rework sqsize handling
  nvme-fabrics: Ignore nr_io_queues option for discovery controllers
  xen-blkfront: move negotiate_mq to cover all cases of new VBDs
  Revert "nvme: create 'slaves' and 'holders' entries for hidden controllers"
  bcache: don't attach backing with duplicate UUID
  bcache: fix crashes in duplicate cache device register
  nvme: pci: pass max vectors as num_possible_cpus() to pci_alloc_irq_vectors
  nvme-pci: Fix EEH failure on ppc
2018-03-10 08:48:01 -08:00
Linus Torvalds
b3b25b1d9e Merge tag 'for-4.16/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:

 - Fix an uninitialized variable false warning in dm bufio

 - Fix DM's passthrough ioctl support to be race free against an
   underlying device being removed.

 - Fix corner-case of DM raid resync reporting if/when the raid becomes
   degraded during resync; otherwise automated raid repair will fail.

 - A few DM multipath fixes to make non-SCSI optimizations, that were
   introduced during the 4.16 merge, useful for all non-SCSI devices,
   rather than narrowly define this non-SCSI mode in terms of "nvme".

   This allows the removal of "queue_mode nvme" that really didn't need
   to be introduced. Instead DM core will internalize whether
   nvme-specific IO submission optimizations are doable and DM multipath
   will only do SCSI-specific device handler operations if SCSI is in
   use.

* tag 'for-4.16/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm table: allow upgrade from bio-based to specialized bio-based variant
  dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks
  dm table: fix "nvme" test
  dm raid: fix incorrect sync_ratio when degraded
  dm: use blkdev_get rather than bdgrab when issuing pass-through ioctl
  dm bufio: avoid false-positive Wmaybe-uninitialized warning
2018-03-10 08:45:44 -08:00
Linus Torvalds
2f64e70cd0 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Doug Ledford:

 - Various driver bug fixes in mlx5, mlx4, bnxt_re and qedr, ranging
   from bugs under load to bad error case handling

 - There in one largish patch fixing the locking in bnxt_re to avoid a
   machine hard lock situation

 - A few core bugs on error paths

 - A patch to reduce stack usage in the new CQ API

 - One mlx5 regression introduced in this merge window

 - There were new syzkaller scripts written for the RDMA subsystem and
   we are fixing issues found by the bot

 - One of the commits (aa0de36a40 “RDMA/mlx5: Fix integer overflow
   while resizing CQ”) is missing part of the commit log message and one
   of the SOB lines. The original patch was from Leon Romanovsky, and a
   cut-n-paste separator in the commit message confused patchworks which
   then put the end of message separator in the wrong place in the
   downloaded patch, and I didn’t notice in time. The patch made it into
   the official branch, and the only way to fix it in-place was to
   rebase. Given the pain that a rebase causes, and the fact that the
   patch has relevant tags for stable and syzkaller, a revert of the
   munged patch and a reapplication of the original patch with the log
   message intact was done.

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (25 commits)
  RDMA/mlx5: Fix integer overflow while resizing CQ
  Revert "RDMA/mlx5: Fix integer overflow while resizing CQ"
  RDMA/ucma: Check that user doesn't overflow QP state
  RDMA/mlx5: Fix integer overflow while resizing CQ
  RDMA/ucma: Limit possible option size
  IB/core: Fix possible crash to access NULL netdev
  RDMA/bnxt_re: Avoid Hard lockup during error CQE processing
  RDMA/core: Reduce poll batch for direct cq polling
  IB/mlx5: Fix an error code in __mlx5_ib_modify_qp()
  IB/mlx5: When not in dual port RoCE mode, use provided port as native
  IB/mlx4: Include GID type when deleting GIDs from HW table under RoCE
  IB/mlx4: Fix corruption of RoCEv2 IPv4 GIDs
  RDMA/qedr: Fix iWARP write and send with immediate
  RDMA/qedr: Fix kernel panic when running fio over NFSoRDMA
  RDMA/qedr: Fix iWARP connect with port mapper
  RDMA/qedr: Fix ipv6 destination address resolution
  IB/core : Add null pointer check in addr_resolve
  RDMA/bnxt_re: Fix the ib_reg failure cleanup
  RDMA/bnxt_re: Fix incorrect DB offset calculation
  RDMA/bnxt_re: Unconditionly fence non wire memory operations
  ...
2018-03-10 08:38:01 -08:00
Linus Torvalds
b3337a6c35 Merge tag 'platform-drivers-x86-v4.16-6' of git://git.infradead.org/linux-platform-drivers-x86
Pull x86 platform driver fixes from Darren Hart:
 "Correct a module loading race condition between the DELL_SMBIOS
  backend modules and the first user by converting them to bool features
  of the DELL_SMBIOS driver. Fixup the resulting Kconfig dependency
  issue with DCDBAS"

* tag 'platform-drivers-x86-v4.16-6' of git://git.infradead.org/linux-platform-drivers-x86:
  platform/x86: dell-smbios: Resolve dependency error on DCDBAS
  platform/x86: Allow for SMBIOS backend defaults
  platform/x86: dell-smbios: Link all dell-smbios-* modules together
  platform/x86: dell-smbios: Rename dell-smbios source to dell-smbios-base
  platform/x86: dell-smbios: Correct some style warnings
2018-03-10 08:35:29 -08:00
Linus Torvalds
cdb06e9d8f Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Radim Krčmář:
 "PPC:

   - Fix guest time accounting in the host

   - Fix large-page backing for radix guests on POWER9

   - Fix HPT guests on POWER9 backed by 2M or 1G pages

   - Compile fixes for some configs and gcc versions

  s390:

   - Fix random memory corruption when running as guest2 (e.g. KVM in
     LPAR) and starting guest3 (e.g. nested KVM) with many CPUs

   - Export forgotten io interrupt delivery statistics counter"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: s390: fix memory overwrites when not using SCA entries
  KVM: PPC: Book3S HV: Fix guest time accounting with VIRT_CPU_ACCOUNTING_GEN
  KVM: PPC: Book3S HV: Fix VRMA initialization with 2MB or 1GB memory backing
  KVM: PPC: Book3S HV: Fix handling of large pages in radix page fault handler
  KVM: s390: provide io interrupt kvm_stat
  KVM: PPC: Book3S: Fix compile error that occurs with some gcc versions
  KVM: PPC: Fix compile error that occurs when CONFIG_ALTIVEC=n
2018-03-09 16:59:19 -08:00
Linus Torvalds
39614481fb Merge tag 'for-linus-4.16a-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fix from Juergen Gross:
 "Just one fix for the correct error handling after a failed
  device_register()"

* tag 'for-linus-4.16a-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen: xenbus: use put_device() instead of kfree()
2018-03-09 16:54:18 -08:00
Linus Torvalds
4178802c77 Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:

 - The SMCCC firmware interface for the spectre variant 2 mitigation has
   been updated to allow the discovery of whether the CPU needs the
   workaround. This pull request relaxes the kernel check on the return
   value from firmware.

 - Fix the commit allowing changing from global to non-global page table
   entries which inadvertently disallowed other safe attribute changes.

 - Fix sleeping in atomic during the arm_perf_teardown_cpu() code.

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: Relax ARM_SMCCC_ARCH_WORKAROUND_1 discovery
  arm_pmu: Use disable_irq_nosync when disabling SPI in CPU teardown hook
  arm64: mm: fix thinko in non-global page table attribute check
2018-03-09 16:49:30 -08:00
Linus Torvalds
ed3c4dff8d Merge tag 'docs-4.16-fix' of git://git.lwn.net/linux
Pull Documentation build fix from Jonathan Corbet:
 "The Sphinx 1.7 release broke the build process for reasons that are
  mostly our fault.

  This is a single fix cherry-picked from docs-next that restores docs
  buildability for all supported Sphinx versions"

* tag 'docs-4.16-fix' of git://git.lwn.net/linux:
  Documentation/sphinx: Fix Directive import error
2018-03-09 16:45:57 -08:00
Linus Torvalds
cfc79ae844 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "8 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  lib/test_kmod.c: fix limit check on number of test devices created
  selftests/vm/run_vmtests: adjust hugetlb size according to nr_cpus
  mm/page_alloc: fix memmap_init_zone pageblock alignment
  mm/memblock.c: hardcode the end_pfn being -1
  mm/gup.c: teach get_user_pages_unlocked to handle FOLL_NOWAIT
  lib/bug.c: exclude non-BUG/WARN exceptions from report_bug()
  bug: use %pB in BUG and stack protector failure
  hugetlb: fix surplus pages accounting
2018-03-09 16:42:25 -08:00
Luis R. Rodriguez
ac68b1b3b9 lib/test_kmod.c: fix limit check on number of test devices created
As reported by Dan the parentheses is in the wrong place, and since
unlikely() call returns either 0 or 1 it's never less than zero.  The
second issue is that signed integer overflows like "INT_MAX + 1" are
undefined behavior.

Since num_test_devs represents the number of devices, we want to stop
prior to hitting the max, and not rely on the wrap arround at all.  So
just cap at num_test_devs + 1, prior to assigning a new device.

Link: http://lkml.kernel.org/r/20180224030046.24238-1-mcgrof@kernel.org
Fixes: d9c6a72d6f ("kmod: add test driver to stress test the module loader")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-09 16:40:02 -08:00
Li Zhijian
0627be7d3c selftests/vm/run_vmtests: adjust hugetlb size according to nr_cpus
Fix userfaultfd_hugetlb on hosts which have more than 64 cpus.

  ---------------------------
  running userfaultfd_hugetlb
  ---------------------------
  invalid MiB
  Usage: <MiB> <bounces>
  [FAIL]

Via userfaultfd.c we can know, hugetlb_size needs to meet hugetlb_size
>= nr_cpus * hugepage_size.  hugepage_size is often 2M, so when host
cpus > 64, it requires more than 128M.

[zhijianx.li@intel.com: update changelog/comments and variable name]
 Link: http://lkml.kernel.org/r/20180302024356.83359-1-zhijianx.li@intel.com
 Link: http://lkml.kernel.org/r/20180303125027.81638-1-zhijianx.li@intel.com
Link: http://lkml.kernel.org/r/20180302024356.83359-1-zhijianx.li@intel.com
Signed-off-by: Li Zhijian <zhijianx.li@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-09 16:40:01 -08:00
Daniel Vacek
864b75f9d6 mm/page_alloc: fix memmap_init_zone pageblock alignment
Commit b92df1de5d ("mm: page_alloc: skip over regions of invalid pfns
where possible") introduced a bug where move_freepages() triggers a
VM_BUG_ON() on uninitialized page structure due to pageblock alignment.
To fix this, simply align the skipped pfns in memmap_init_zone() the
same way as in move_freepages_block().

Seen in one of the RHEL reports:

  crash> log | grep -e BUG -e RIP -e Call.Trace -e move_freepages_block -e rmqueue -e freelist -A1
  kernel BUG at mm/page_alloc.c:1389!
  invalid opcode: 0000 [#1] SMP
  --
  RIP: 0010:[<ffffffff8118833e>]  [<ffffffff8118833e>] move_freepages+0x15e/0x160
  RSP: 0018:ffff88054d727688  EFLAGS: 00010087
  --
  Call Trace:
   [<ffffffff811883b3>] move_freepages_block+0x73/0x80
   [<ffffffff81189e63>] __rmqueue+0x263/0x460
   [<ffffffff8118c781>] get_page_from_freelist+0x7e1/0x9e0
   [<ffffffff8118caf6>] __alloc_pages_nodemask+0x176/0x420
  --
  RIP  [<ffffffff8118833e>] move_freepages+0x15e/0x160
   RSP <ffff88054d727688>

  crash> page_init_bug -v | grep RAM
  <struct resource 0xffff88067fffd2f8>          1000 -        9bfff	System RAM (620.00 KiB)
  <struct resource 0xffff88067fffd3a0>        100000 -     430bffff	System RAM (  1.05 GiB = 1071.75 MiB = 1097472.00 KiB)
  <struct resource 0xffff88067fffd410>      4b0c8000 -     4bf9cfff	System RAM ( 14.83 MiB = 15188.00 KiB)
  <struct resource 0xffff88067fffd480>      4bfac000 -     646b1fff	System RAM (391.02 MiB = 400408.00 KiB)
  <struct resource 0xffff88067fffd560>      7b788000 -     7b7fffff	System RAM (480.00 KiB)
  <struct resource 0xffff88067fffd640>     100000000 -    67fffffff	System RAM ( 22.00 GiB)

  crash> page_init_bug | head -6
  <struct resource 0xffff88067fffd560>      7b788000 -     7b7fffff	System RAM (480.00 KiB)
  <struct page 0xffffea0001ede200>   1fffff00000000  0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32          4096    1048575
  <struct page 0xffffea0001ede200> 505736 505344 <struct page 0xffffea0001ed8000> 505855 <struct page 0xffffea0001edffc0>
  <struct page 0xffffea0001ed8000>                0  0 <struct pglist_data 0xffff88047ffd9000> 0 <struct zone 0xffff88047ffd9000> DMA               1       4095
  <struct page 0xffffea0001edffc0>   1fffff00000400  0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32          4096    1048575
  BUG, zones differ!

Note that this range follows two not populated sections
68000000-77ffffff in this zone.  7b788000-7b7fffff is the first one
after a gap.  This makes memmap_init_zone() skip all the pfns up to the
beginning of this range.  But this range is not pageblock (2M) aligned.
In fact no range has to be.

  crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b787000 7b788000
        PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
  ffffea0001e00000  78000000                0        0  0 0
  ffffea0001ed7fc0  7b5ff000                0        0  0 0
  ffffea0001ed8000  7b600000                0        0  0 0	<<<<
  ffffea0001ede1c0  7b787000                0        0  0 0
  ffffea0001ede200  7b788000                0        0  1 1fffff00000000

Top part of page flags should contain nodeid and zonenr, which is not
the case for page ffffea0001ed8000 here (<<<<).

  crash> log | grep -o fffea0001ed[^\ ]* | sort -u
  fffea0001ed8000
  fffea0001eded20
  fffea0001edffc0

  crash> bt -r | grep -o fffea0001ed[^\ ]* | sort -u
  fffea0001ed8000
  fffea0001eded00
  fffea0001eded20
  fffea0001edffc0

Initialization of the whole beginning of the section is skipped up to
the start of the range due to the commit b92df1de5d.  Now any code
calling move_freepages_block() (like reusing the page from a freelist as
in this example) with a page from the beginning of the range will get
the page rounded down to start_page ffffea0001ed8000 and passed to
move_freepages() which crashes on assertion getting wrong zonenr.

  >         VM_BUG_ON(page_zone(start_page) != page_zone(end_page));

Note, page_zone() derives the zone from page flags here.

From similar machine before commit b92df1de5d:

  crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b7fe000 7b7ff000
        PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
  fffff73941e00000  78000000                0        0  1 1fffff00000000
  fffff73941ed7fc0  7b5ff000                0        0  1 1fffff00000000
  fffff73941ed8000  7b600000                0        0  1 1fffff00000000
  fffff73941edff80  7b7fe000                0        0  1 1fffff00000000
  fffff73941edffc0  7b7ff000 ffff8e67e04d3ae0     ad84  1 1fffff00020068 uptodate,lru,active,mappedtodisk

All the pages since the beginning of the section are initialized.
move_freepages()' not gonna blow up.

The same machine with this fix applied:

  crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b7fe000 7b7ff000
        PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
  ffffea0001e00000  78000000                0        0  0 0
  ffffea0001e00000  7b5ff000                0        0  0 0
  ffffea0001ed8000  7b600000                0        0  1 1fffff00000000
  ffffea0001edff80  7b7fe000                0        0  1 1fffff00000000
  ffffea0001edffc0  7b7ff000 ffff88017fb13720        8  2 1fffff00020068 uptodate,lru,active,mappedtodisk

At least the bare minimum of pages is initialized preventing the crash
as well.

Customers started to report this as soon as 7.4 (where b92df1de5d was
merged in RHEL) was released.  I remember reports from
September/October-ish times.  It's not easily reproduced and happens on
a handful of machines only.  I guess that's why.  But that does not make
it less serious, I think.

Though there actually is a report here:
  https://bugzilla.kernel.org/show_bug.cgi?id=196443

And there are reports for Fedora from July:
  https://bugzilla.redhat.com/show_bug.cgi?id=1473242
and CentOS:
  https://bugs.centos.org/view.php?id=13964
and we internally track several dozens reports for RHEL bug
  https://bugzilla.redhat.com/show_bug.cgi?id=1525121

Link: http://lkml.kernel.org/r/0485727b2e82da7efbce5f6ba42524b429d0391a.1520011945.git.neelx@redhat.com
Fixes: b92df1de5d ("mm: page_alloc: skip over regions of invalid pfns where possible")
Signed-off-by: Daniel Vacek <neelx@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paul Burton <paul.burton@imgtec.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-09 16:40:01 -08:00
Daniel Vacek
379b03b7fa mm/memblock.c: hardcode the end_pfn being -1
This is just a cleanup.  It aids handling the special end case in the
next commit.

[akpm@linux-foundation.org: make it work against current -linus, not against -mm]
[akpm@linux-foundation.org: make it work against current -linus, not against -mm some more]
Link: http://lkml.kernel.org/r/1ca478d4269125a99bcfb1ca04d7b88ac1aee924.1520011944.git.neelx@redhat.com
Signed-off-by: Daniel Vacek <neelx@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Paul Burton <paul.burton@imgtec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-09 16:40:01 -08:00
Andrea Arcangeli
96312e6128 mm/gup.c: teach get_user_pages_unlocked to handle FOLL_NOWAIT
KVM is hanging during postcopy live migration with userfaultfd because
get_user_pages_unlocked is not capable to handle FOLL_NOWAIT.

Earlier FOLL_NOWAIT was only ever passed to get_user_pages.

Specifically faultin_page (the callee of get_user_pages_unlocked caller)
doesn't know that if FAULT_FLAG_RETRY_NOWAIT was set in the page fault
flags, when VM_FAULT_RETRY is returned, the mmap_sem wasn't actually
released (even if nonblocking is not NULL).  So it sets *nonblocking to
zero and the caller won't release the mmap_sem thinking it was already
released, but it wasn't because of FOLL_NOWAIT.

Link: http://lkml.kernel.org/r/20180302174343.5421-2-aarcange@redhat.com
Fixes: ce53053ce3 ("kvm: switch get_user_page_nowait() to get_user_pages_unlocked()")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Tested-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-09 16:40:01 -08:00