Commit Graph

1106009 Commits

Author SHA1 Message Date
Darrick J. Wong
dd81dc0559 Merge tag 'xfs-cil-scale-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.20-mergeA
xfs: improve CIL scalability

This series aims to improve the scalability of XFS transaction
commits on large CPU count machines. My 32p machine hits contention
limits in xlog_cil_commit() at about 700,000 transaction commits a
section. It hits this at 16 thread workloads, and 32 thread
workloads go no faster and just burn CPU on the CIL spinlocks.

This patchset gets rid of spinlocks and global serialisation points
in the xlog_cil_commit() path. It does this by moving to a
combination of per-cpu counters, unordered per-cpu lists and
post-ordered per-cpu lists.

This results in transaction commit rates exceeding 1.4 million
commits/s under unlink certain workloads, and while the log lock
contention is largely gone there is still significant lock
contention in the VFS (dentry cache, inode cache and security layers)
at >600,000 transactions/s that still limit scalability.

The changes to the CIL accounting and behaviour, combined with the
structural changes to xlog_write() in prior patchsets make the
per-cpu restructuring possible and sane. This allows us to move to
precalculated reservation requirements that allow for reservation
stealing to be accounted across multiple CPUs accurately.

That is, instead of trying to account for continuation log opheaders
on a "growth" basis, we pre-calculate how many iclogs we'll need to
write out a maximally sized CIL checkpoint and steal that reserveD
that space one commit at a time until the CIL has a full
reservation. If we ever run a commit when we are already at the hard
limit (because post-throttling) we simply take an extra reservation
from each commit that is run when over the limit. Hence we don't
need to do space usage math in the fast path and so never need to
sum the per-cpu counters in this fast path.

Similarly, per-cpu lists have the problem of ordering - we can't
remove an item from a per-cpu list if we want to move it forward in
the CIL. We solve this problem by using an atomic counter to give
every commit a sequence number that is copied into the log items in
that transaction. Hence relogging items just overwrites the sequence
number in the log item, and does not move it in the per-cpu lists.
Once we reaggregate the per-cpu lists back into a single list in the
CIL push work, we can run it through list-sort() and reorder it back
into a globally ordered list. This costs a bit of CPU time, but now
that the CIL can run multiple works and pipelines properly, this is
not a limiting factor for performance. It does increase fsync
latency when the CIL is full, but workloads issuing large numbers of
fsync()s or sync transactions end up with very small CILs and so the
latency impact or sorting is not measurable for such workloads.

OVerall, this pushes the transaction commit bottleneck out to the
lockless reservation grant head updates. These atomic updates don't
start to be a limiting fact until > 1.5 million transactions/s are
being run, at which point the accounting functions start to show up
in profiles as the highest CPU users. Still, this series doubles
transaction throughput without increasing CPU usage before we get
to that cacheline contention breakdown point...
`
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

* tag 'xfs-cil-scale-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
  xfs: expanding delayed logging design with background material
  xfs: xlog_sync() manually adjusts grant head space
  xfs: avoid cil push lock if possible
  xfs: move CIL ordering to the logvec chain
  xfs: convert log vector chain to use list heads
  xfs: convert CIL to unordered per cpu lists
  xfs: Add order IDs to log items in CIL
  xfs: convert CIL busy extents to per-cpu
  xfs: track CIL ticket reservation in percpu structure
  xfs: implement percpu cil space used calculation
  xfs: introduce per-cpu CIL tracking structure
  xfs: rework per-iclog header CIL reservation
  xfs: lift init CIL reservation out of xc_cil_lock
  xfs: use the CIL space used counter for emptiness checks
2022-07-09 10:55:21 -07:00
Dave Chinner
51a117edff xfs: expanding delayed logging design with background material
I wrote up a description of how transactions, space reservations and
relogging work together in response to a question for background
material on the delayed logging design. Add this to the existing
document for ease of future reference.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 18:56:09 +10:00
Dave Chinner
d9f68777b2 xfs: xlog_sync() manually adjusts grant head space
When xlog_sync() rounds off the tail the iclog that is being
flushed, it manually subtracts that space from the grant heads. This
space is actually reserved by the transaction ticket that covers
the xlog_sync() call from xlog_write(), but we don't plumb the
ticket down far enough for it to account for the space consumed in
the current log ticket.

The grant heads are hot, so we really should be accounting this to
the ticket is we can, rather than adding thousands of extra grant
head updates every CIL commit.

Interestingly, this actually indicates a potential log space overrun
can occur when we force the log. By the time that xfs_log_force()
pushes out an active iclog and consumes the roundoff space, the
reservation for that roundoff space has been returned to the grant
heads and is no longer covered by a reservation. In theory the
roundoff added to log force on an already full log could push the
write head past the tail. In practice, the CIL commit that writes to
the log and needs the iclog pushed will have reserved space for
roundoff, so when it releases the ticket there will still be
physical space for the roundoff to be committed to the log, even
though it is no longer reserved. This roundoff won't be enough space
to allow a transaction to be woken if the log is full, so overruns
should not actually occur in practice.

That said, it indicates that we should not release the CIL context
log ticket until after we've released the commit iclog. It also
means that xlog_sync() still needs the direct grant head
manipulation if we don't provide it with a ticket. Log forces are
rare when we are in fast paths running 1.5 million transactions/s
that make the grant heads hot, so let's optimise the hot case and
pass CIL log tickets down to the xlog_sync() code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 18:56:09 +10:00
Dave Chinner
1ccb0745a9 xfs: avoid cil push lock if possible
Because now it hurts when the CIL fills up.

  - 37.20% __xfs_trans_commit
      - 35.84% xfs_log_commit_cil
         - 19.34% _raw_spin_lock
            - do_raw_spin_lock
                 19.01% __pv_queued_spin_lock_slowpath
         - 4.20% xfs_log_ticket_ungrant
              0.90% xfs_log_space_wake


Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 18:56:08 +10:00
Dave Chinner
4eb56069cb xfs: move CIL ordering to the logvec chain
Adding a list_sort() call to the CIL push work while the xc_ctx_lock
is held exclusively has resulted in fairly long lock hold times and
that stops all front end transaction commits from making progress.

We can move the sorting out of the xc_ctx_lock if we can transfer
the ordering information to the log vectors as they are detached
from the log items and then we can sort the log vectors.  With these
changes, we can move the list_sort() call to just before we call
xlog_write() when we aren't holding any locks at all.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 18:56:08 +10:00
Dave Chinner
169248536a xfs: convert log vector chain to use list heads
Because the next change is going to require sorting log vectors, and
that requires arbitrary rearrangement of the list which cannot be
done easily with a single linked list.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 18:55:59 +10:00
Dave Chinner
c0fb4765c5 xfs: convert CIL to unordered per cpu lists
So that we can remove the cil_lock which is a global serialisation
point. We've already got ordering sorted, so all we need to do is
treat the CIL list like the busy extent list and reconstruct it
before the push starts.

This is what we're trying to avoid:

 -   75.35%     1.83%  [kernel]            [k] xfs_log_commit_cil
    - 46.35% xfs_log_commit_cil
       - 41.54% _raw_spin_lock
          - 67.30% do_raw_spin_lock
               66.96% __pv_queued_spin_lock_slowpath

Which happens on a 32p system when running a 32-way 'rm -rf'
workload. After this patch:

-   20.90%     3.23%  [kernel]               [k] xfs_log_commit_cil
   - 17.67% xfs_log_commit_cil
      - 6.51% xfs_log_ticket_ungrant
           1.40% xfs_log_space_wake
        2.32% memcpy_erms
      - 2.18% xfs_buf_item_committing
         - 2.12% xfs_buf_item_release
            - 1.03% xfs_buf_unlock
                 0.96% up
              0.72% xfs_buf_rele
        1.33% xfs_inode_item_format
        1.19% down_read
        0.91% up_read
        0.76% xfs_buf_item_format
      - 0.68% kmem_alloc_large
         - 0.67% kmem_alloc
              0.64% __kmalloc
        0.50% xfs_buf_item_size

It kinda looks like the workload is running out of log space all
the time. But all the spinlock contention is gone and the
transaction commit rate has gone from 800k/s to 1.3M/s so the amount
of real work being done has gone up a *lot*.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 18:54:59 +10:00
Dave Chinner
016a23388c xfs: Add order IDs to log items in CIL
Before we split the ordered CIL up into per cpu lists, we need a
mechanism to track the order of the items in the CIL. We need to do
this because there are rules around the order in which related items
must physically appear in the log even inside a single checkpoint
transaction.

An example of this is intents - an intent must appear in the log
before it's intent done record so that log recovery can cancel the
intent correctly. If we have these two records misordered in the
CIL, then they will not be recovered correctly by journal replay.

We also will not be able to move items to the tail of
the CIL list when they are relogged, hence the log items will need
some mechanism to allow the correct log item order to be recreated
before we write log items to the hournal.

Hence we need to have a mechanism for recording global order of
transactions in the log items  so that we can recover that order
from un-ordered per-cpu lists.

Do this with a simple monotonic increasing commit counter in the CIL
context. Each log item in the transaction gets stamped with the
current commit order ID before it is added to the CIL. If the item
is already in the CIL, leave it where it is instead of moving it to
the tail of the list and instead sort the list before we start the
push work.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 18:53:59 +10:00
Dave Chinner
df7a4a2134 xfs: convert CIL busy extents to per-cpu
To get them out from under the CIL lock.

This is an unordered list, so we can simply punt it to per-cpu lists
during transaction commits and reaggregate it back into a single
list during the CIL push work.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 18:52:59 +10:00
Dave Chinner
1dd2a2c18e xfs: track CIL ticket reservation in percpu structure
To get it out from under the cil spinlock.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 18:51:59 +10:00
Dave Chinner
7c8ade2121 xfs: implement percpu cil space used calculation
Now that we have the CIL percpu structures in place, implement the
space used counter as a per-cpu counter.

We have to be really careful now about ensuring that the checks and
updates run without arbitrary delays, which means they need to run
with pre-emption disabled. We do this by careful placement of
the get_cpu_ptr/put_cpu_ptr calls to access the per-cpu structures
for that CPU.

We need to be able to reliably detect that the CIL has reached
the hard limit threshold so we can take extra reservations for the
iclog headers when the space used overruns the original reservation.
hence we factor out xlog_cil_over_hard_limit() from
xlog_cil_push_background().

The global CIL space used is an atomic variable that is backed by
per-cpu aggregation to minimise the number of atomic updates we do
to the global state in the fast path. While we are under the soft
limit, we aggregate only when the per-cpu aggregation is over the
proportion of the soft limit assigned to that CPU. This means that
all CPUs can use all but one byte of their aggregation threshold
and we will not go over the soft limit.

Hence once we detect that we've gone over both a per-cpu aggregation
threshold and the soft limit, we know that we have only
exceeded the soft limit by one per-cpu aggregation threshold. Even
if all CPUs hit this at the same time, we can't be over the hard
limit, so we can run an aggregation back into the atomic counter
at this point and still be under the hard limit.

At this point, we will be over the soft limit and hence we'll
aggregate into the global atomic used space directly rather than the
per-cpu counters, hence providing accurate detection of hard limit
excursion for accounting and reservation purposes.

Hence we get the best of both worlds - lockless, scalable per-cpu
fast path plus accurate, atomic detection of hard limit excursion.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07 18:50:59 +10:00
Linus Torvalds
88084a3df1 Linux 5.19-rc5 v5.19-rc5 2022-07-03 15:39:28 -07:00
Linus Torvalds
b8d5109f50 lockref: remove unused 'lockref_get_or_lock()' function
Looking at the conditional lock acquire functions in the kernel due to
the new sparse support (see commit 4a557a5d1a "sparse: introduce
conditional lock acquire function attribute"), it became obvious that
the lockref code has a couple of them, but they don't match the usual
naming convention for the other ones, and their return value logic is
also reversed.

In the other very similar places, the naming pattern is '*_and_lock()'
(eg 'atomic_put_and_lock()' and 'refcount_dec_and_lock()'), and the
function returns true when the lock is taken.

The lockref code is superficially very similar to the refcount code,
only with the special "atomic wrt the embedded lock" semantics.  But
instead of the '*_and_lock()' naming it uses '*_or_lock()'.

And instead of returning true in case it took the lock, it returns true
if it *didn't* take the lock.

Now, arguably the reflock code is quite logical: it really is a "either
decrement _or_ lock" kind of situation - and the return value is about
whether the operation succeeded without any special care needed.

So despite the similarities, the differences do make some sense, and
maybe it's not worth trying to unify the different conditional locking
primitives in this area.

But while looking at this all, it did become obvious that the
'lockref_get_or_lock()' function hasn't actually had any users for
almost a decade.

The only user it ever had was the shortlived 'd_rcu_to_refcount()'
function, and it got removed and replaced with 'lockref_get_not_dead()'
back in 2013 in commits 0d98439ea3 ("vfs: use lockred 'dead' flag to
mark unrecoverably dead dentries") and e5c832d555 ("vfs: fix dentry
RCU to refcounting possibly sleeping dput()")

In fact, that single use was removed less than a week after the whole
function was introduced in commit b3abd80250 ("lockref: add
'lockref_get_or_lock() helper") so this function has been around for a
decade, but only had a user for six days.

Let's just put this mis-designed and unused function out of its misery.

We can think about the naming and semantic oddities of the remaining
'lockref_put_or_lock()' later, but at least that function has users.

And while the naming is different and the return value doesn't match,
that function matches the whole '{atomic,refcount}_dec_and_test()'
pattern much better (ie the magic happens when the count goes down to
zero, not when it is incremented from zero).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-07-03 14:40:28 -07:00
Linus Torvalds
4a557a5d1a sparse: introduce conditional lock acquire function attribute
The kernel tends to try to avoid conditional locking semantics because
it makes it harder to think about and statically check locking rules,
but we do have a few fundamental locking primitives that take locks
conditionally - most obviously the 'trylock' functions.

That has always been a problem for 'sparse' checking for locking
imbalance, and we've had a special '__cond_lock()' macro that we've used
to let sparse know how the locking works:

    # define __cond_lock(x,c)        ((c) ? ({ __acquire(x); 1; }) : 0)

so that you can then use this to tell sparse that (for example) the
spinlock trylock macro ends up acquiring the lock when it succeeds, but
not when it fails:

    #define raw_spin_trylock(lock)  __cond_lock(lock, _raw_spin_trylock(lock))

and then sparse can follow along the locking rules when you have code like

        if (!spin_trylock(&dentry->d_lock))
                return LRU_SKIP;
	.. sparse sees that the lock is held here..
        spin_unlock(&dentry->d_lock);

and sparse ends up happy about the lock contexts.

However, this '__cond_lock()' use does result in very ugly header files,
and requires you to basically wrap the real function with that macro
that uses '__cond_lock'.  Which has made PeterZ NAK things that try to
fix sparse warnings over the years [1].

To solve this, there is now a very experimental patch to sparse that
basically does the exact same thing as '__cond_lock()' did, but using a
function attribute instead.  That seems to make PeterZ happy [2].

Note that this does not replace existing use of '__cond_lock()', but
only exposes the new proposed attribute and uses it for the previously
unannotated 'refcount_dec_and_lock()' family of functions.

For existing sparse installations, this will make no difference (a
negative output context was ignored), but if you have the experimental
sparse patch it will make sparse now understand code that uses those
functions, the same way '__cond_lock()' makes sparse understand the very
similar 'atomic_dec_and_lock()' uses that have the old '__cond_lock()'
annotations.

Note that in some cases this will silence existing context imbalance
warnings.  But in other cases it may end up exposing new sparse warnings
for code that sparse just didn't see the locking for at all before.

This is a trial, in other words.  I'd expect that if it ends up being
successful, and new sparse releases end up having this new attribute,
we'll migrate the old-style '__cond_lock()' users to use the new-style
'__cond_acquires' function attribute.

The actual experimental sparse patch was posted in [3].

Link: https://lore.kernel.org/all/20130930134434.GC12926@twins.programming.kicks-ass.net/ [1]
Link: https://lore.kernel.org/all/Yr60tWxN4P568x3W@worktop.programming.kicks-ass.net/ [2]
Link: https://lore.kernel.org/all/CAHk-=wjZfO9hGqJ2_hGQG3U_XzSh9_XaXze=HgPdvJbgrvASfA@mail.gmail.com/ [3]
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Aring <aahringo@redhat.com>
Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-07-03 11:32:22 -07:00
Linus Torvalds
20855e4cb3 Merge tag 'xfs-5.19-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
 "This fixes some stalling problems and corrects the last of the
  problems (I hope) observed during testing of the new atomic xattr
  update feature.

   - Fix statfs blocking on background inode gc workers

   - Fix some broken inode lock assertion code

   - Fix xattr leaf buffer leaks when cancelling a deferred xattr update
     operation

   - Clean up xattr recovery to make it easier to understand.

   - Fix xattr leaf block verifiers tripping over empty blocks.

   - Remove complicated and error prone xattr leaf block bholding mess.

   - Fix a bug where an rt extent crossing EOF was treated as "posteof"
     blocks and cleaned unnecessarily.

   - Fix a UAF when log shutdown races with unmount"

* tag 'xfs-5.19-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: prevent a UAF when log IO errors race with unmount
  xfs: dont treat rt extents beyond EOF as eofblocks to be cleared
  xfs: don't hold xattr leaf buffers across transaction rolls
  xfs: empty xattr leaf header blocks are not corruption
  xfs: clean up the end of xfs_attri_item_recover
  xfs: always free xattri_leaf_bp when cancelling a deferred op
  xfs: use invalidate_lock to check the state of mmap_lock
  xfs: factor out the common lock flags assert
  xfs: introduce xfs_inodegc_push()
  xfs: bound maximum wait time for inodegc work
2022-07-03 09:42:17 -07:00
Linus Torvalds
69cb6c6556 Merge tag 'nfsd-5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
 "Notable regression fixes:

   - Fix NFSD crash during NFSv4.2 READ_PLUS operation

   - Fix incorrect status code returned by COMMIT operation"

* tag 'nfsd-5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  SUNRPC: Fix READ_PLUS crasher
  NFSD: restore EINVAL error translation in nfsd_commit()
2022-07-02 11:20:56 -07:00
Linus Torvalds
34074da542 Merge tag 'for-5.19/parisc-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc architecture fixes from Helge Deller:
 "Two important fixes for bugs in code which was added in 5.18:

   - Fix userspace signal failures on 32-bit kernel due to a bug in vDSO

   - Fix 32-bit load-word unalignment exception handler which returned
     wrong values"

* tag 'for-5.19/parisc-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: Fix vDSO signal breakage on 32-bit kernel
  parisc/unaligned: Fix emulate_ldw() breakage
2022-07-02 10:23:36 -07:00
Helge Deller
aa78fa905b parisc: Fix vDSO signal breakage on 32-bit kernel
Addition of vDSO support for parisc in kernel v5.18 suddenly broke glibc
signal testcases on a 32-bit kernel.

The trampoline code (sigtramp.S) which is mapped into userspace includes
an offset to the context data on the stack, which is used by gdb and
glibc to get access to registers.

In a 32-bit kernel we used by mistake the offset into the compat context
(which is valid on a 64-bit kernel only) instead of the offset into the
"native" 32-bit context.

Reported-by: John David Anglin <dave.anglin@bell.net>
Tested-by: John David Anglin <dave.anglin@bell.net>
Fixes: 	df24e1783e ("parisc: Add vDSO support")
CC: stable@vger.kernel.org # 5.18
Signed-off-by: Helge Deller <deller@gmx.de>
2022-07-02 18:36:58 +02:00
Linus Torvalds
bb7c512687 Merge tag 'perf-tools-fixes-for-v5.19-2022-07-02' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
Pull perf tools fixes from Arnaldo Carvalho de Melo:

 - BPF program info linear (BPIL) data is accessed assuming 64-bit
   alignment resulting in undefined behavior as the data is just byte
   aligned. Fix it, Found using -fsanitize=undefined.

 - Fix 'perf offcpu' build on old kernels wrt task_struct's
   state/__state field.

 - Fix perf_event_attr.sample_type setting on the 'offcpu-time' event
   synthesized by the 'perf offcpu' tool.

 - Don't bail out when synthesizing PERF_RECORD_ events for pre-existing
   threads when one goes away while parsing its procfs entries.

 - Don't sort the task scan result from /proc, its not needed and
   introduces bugs when the main thread isn't the first one to be
   processed.

 - Fix uninitialized 'offset' variable on aarch64 in the unwind code.

 - Sync KVM headers with the kernel sources.

* tag 'perf-tools-fixes-for-v5.19-2022-07-02' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
  perf synthetic-events: Ignore dead threads during event synthesis
  perf synthetic-events: Don't sort the task scan result from /proc
  perf unwind: Fix unitialized 'offset' variable on aarch64
  tools headers UAPI: Sync linux/kvm.h with the kernel sources
  perf bpf: 8 byte align bpil data
  tools kvm headers arm64: Update KVM headers from the kernel sources
  perf offcpu: Accept allowed sample types only
  perf offcpu: Fix build failure on old kernels
2022-07-02 09:28:36 -07:00
Linus Torvalds
5411de0733 Merge tag 'powerpc-5.19-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:

 - Fix BPF uapi confusion about the correct type of bpf_user_pt_regs_t.

 - Fix virt_addr_valid() when memory is hotplugged above the boot-time
   high_memory value.

 - Fix a bug in 64-bit Book3E map_kernel_page() which would incorrectly
   allocate a PMD page at PUD level.

 - Fix a couple of minor issues found since we enabled KASAN for 64-bit
   Book3S.

Thanks to Aneesh Kumar K.V, Cédric Le Goater, Christophe Leroy, Kefeng
Wang, Liam Howlett, Nathan Lynch, and Naveen N. Rao.

* tag 'powerpc-5.19-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/memhotplug: Add add_pages override for PPC
  powerpc/bpf: Fix use of user_pt_regs in uapi
  powerpc/prom_init: Fix kernel config grep
  powerpc/book3e: Fix PUD allocation size in map_kernel_page()
  powerpc/xive/spapr: correct bitmap allocation size
2022-07-02 09:11:44 -07:00
Namhyung Kim
ff898552fb perf synthetic-events: Ignore dead threads during event synthesis
When it synthesize various task events, it scans the list of task
first and then accesses later.  There's a window threads can die
between the two and proc entries may not be available.

Instead of bailing out, we can ignore that thread and move on.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lore.kernel.org/lkml/20220701205458.985106-2-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2022-07-02 09:22:26 -03:00
Namhyung Kim
363afa3aef perf synthetic-events: Don't sort the task scan result from /proc
It should not sort the result as procfs already returns a proper
ordering of tasks.  Actually sorting the order caused problems that it
doesn't guararantee to process the main thread first.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lore.kernel.org/lkml/20220701205458.985106-1-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2022-07-02 09:21:41 -03:00
Ivan Babrou
5eb502b2e1 perf unwind: Fix unitialized 'offset' variable on aarch64
Commit dc2cf4ca86 ("perf unwind: Fix segbase for ld.lld linked
objects") uncovered the following issue on aarch64:

    util/unwind-libunwind-local.c: In function 'find_proc_info':
    util/unwind-libunwind-local.c:386:28: error: 'offset' may be used uninitialized in this function [-Werror=maybe-uninitialized]
    386 |                         if (ofs > 0) {
        |                            ^
    util/unwind-libunwind-local.c:199:22: note: 'offset' was declared here
    199 |         u64 address, offset;
        |                      ^~~~~~
    util/unwind-libunwind-local.c:371:20: error: 'offset' may be used uninitialized in this function [-Werror=maybe-uninitialized]
    371 |                 if (ofs <= 0) {
        |                    ^
    util/unwind-libunwind-local.c:199:22: note: 'offset' was declared here
    199 |         u64 address, offset;
        |                      ^~~~~~
    util/unwind-libunwind-local.c:363:20: error: 'offset' may be used uninitialized in this function [-Werror=maybe-uninitialized]
    363 |                 if (ofs <= 0) {
        |                    ^
    util/unwind-libunwind-local.c:199:22: note: 'offset' was declared here
    199 |         u64 address, offset;
        |                      ^~~~~~
    In file included from util/libunwind/arm64.c:37:

Fixes: dc2cf4ca86 ("perf unwind: Fix segbase for ld.lld linked objects")
Signed-off-by: Ivan Babrou <ivan@cloudflare.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Fangrui Song <maskray@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: James Clark <james.clark@arm.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: kernel-team@cloudflare.com
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lore.kernel.org/lkml/20220701182046.12589-1-ivan@cloudflare.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2022-07-02 09:16:52 -03:00
Linus Torvalds
0898660614 Merge tag 'libnvdimm-fixes-5.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fix from Vishal Verma:

 - Fix a bug in the libnvdimm 'BTT' (Block Translation Table) driver
   where accounting for poison blocks to be cleared was off by one,
   causing a failure to clear the the last badblock in an nvdimm region.

* tag 'libnvdimm-fixes-5.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  nvdimm: Fix badblocks clear off-by-one error
2022-07-01 16:58:19 -07:00
Linus Torvalds
1ce8c443e9 Merge tag 'thermal-5.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull thermal control fix from Rafael Wysocki:
 "Add a new CPU ID to the list of supported processors in the
  intel_tcc_cooling driver (Sumeet Pawnikar)"

* tag 'thermal-5.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  thermal: intel_tcc_cooling: Add TCC cooling support for RaptorLake
2022-07-01 13:00:47 -07:00
Linus Torvalds
9ee7827668 Merge tag 'pm-5.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
 "These fix some issues in cpufreq drivers and some issues in devfreq:

   - Fix error code path issues related PROBE_DEFER handling in devfreq
     (Christian Marangi)

   - Revert an editing accident in SPDX-License line in the devfreq
     passive governor (Lukas Bulwahn)

   - Fix refcount leak in of_get_devfreq_events() in the exynos-ppmu
     devfreq driver (Miaoqian Lin)

   - Use HZ_PER_KHZ macro in the passive devfreq governor (Yicong Yang)

   - Fix missing of_node_put for qoriq and pmac32 driver (Liang He)

   - Fix issues around throttle interrupt for qcom driver (Stephen Boyd)

   - Add MT8186 to cpufreq-dt-platdev blocklist (AngeloGioacchino Del
     Regno)

   - Make amd-pstate enable CPPC on resume from S3 (Jinzhou Su)"

* tag 'pm-5.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM / devfreq: passive: revert an editing accident in SPDX-License line
  PM / devfreq: Fix kernel warning with cpufreq passive register fail
  PM / devfreq: Rework freq_table to be local to devfreq struct
  PM / devfreq: exynos-ppmu: Fix refcount leak in of_get_devfreq_events
  PM / devfreq: passive: Use HZ_PER_KHZ macro in units.h
  PM / devfreq: Fix cpufreq passive unregister erroring on PROBE_DEFER
  PM / devfreq: Mute warning on governor PROBE_DEFER
  PM / devfreq: Fix kernel panic with cpu based scaling to passive gov
  cpufreq: Add MT8186 to cpufreq-dt-platdev blocklist
  cpufreq: pmac32-cpufreq: Fix refcount leak bug
  cpufreq: qcom-hw: Don't do lmh things without a throttle interrupt
  drivers: cpufreq: Add missing of_node_put() in qoriq-cpufreq.c
  cpufreq: amd-pstate: Add resume and suspend callbacks
2022-07-01 12:55:28 -07:00
Rafael J. Wysocki
bc621588ff Merge branch 'pm-cpufreq'
Merge cpufreq fixes for 5.19-rc5, including ARM cpufreq fixes and the
following one:

 - Make amd-pstate enable CPPC on resume from S3 (Jinzhou Su).

* pm-cpufreq:
  cpufreq: Add MT8186 to cpufreq-dt-platdev blocklist
  cpufreq: pmac32-cpufreq: Fix refcount leak bug
  cpufreq: qcom-hw: Don't do lmh things without a throttle interrupt
  drivers: cpufreq: Add missing of_node_put() in qoriq-cpufreq.c
  cpufreq: amd-pstate: Add resume and suspend callbacks
2022-07-01 21:43:08 +02:00
Linus Torvalds
b336ad598a Merge tag 'hwmon-for-v5.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
Pull hwmon fixes from Guenter Roeck:

 - Fix error handling in ibmaem driver initialization

 - Fix bad data reported by occ driver after setting power cap

 - Fix typos in pmbus/ucd9200 driver comments

* tag 'hwmon-for-v5.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
  hwmon: (ibmaem) don't call platform_device_del() if platform_device_add() fails
  hwmon: (pmbus/ucd9200) fix typos in comments
  hwmon: (occ) Prevent power cap command overwriting poll response
2022-07-01 12:05:27 -07:00
Yang Yingliang
d0e51022a0 hwmon: (ibmaem) don't call platform_device_del() if platform_device_add() fails
If platform_device_add() fails, it no need to call platform_device_del(), split
platform_device_unregister() into platform_device_del/put(), so platform_device_put()
can be called separately.

Fixes: 8808a793f0 ("ibmaem: new driver for power/energy/temp meters in IBM System X hardware")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Link: https://lore.kernel.org/r/20220701074153.4021556-1-yangyingliang@huawei.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
2022-07-01 11:53:29 -07:00
Linus Torvalds
d0f67adb79 Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fix from Catalin Marinas:
 "Restore TLB invalidation for the 'break-before-make' rule on
  contiguous ptes (missed in a recent clean-up)"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: hugetlb: Restore TLB invalidation for BBM on contiguous ptes
2022-07-01 11:23:21 -07:00
Linus Torvalds
cec84e7547 Merge tag 's390-5.19-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Alexander Gordeev:

 - Fix purgatory build process so bin2c tool does not get built
   unnecessarily and the Makefile is more consistent with other
   architectures.

 - Return earlier simple design of arch_get_random_seed_long|int() and
   arch_get_random_long|int() callbacks as result of changes in generic
   RNG code.

 - Fix minor comment typos and spelling mistakes.

* tag 's390-5.19-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/qdio: Fix spelling mistake
  s390/sclp: Fix typo in comments
  s390/archrandom: simplify back to earlier design and initialize earlier
  s390/purgatory: remove duplicated build rule of kexec-purgatory.o
  s390/purgatory: hard-code obj-y in Makefile
  s390: remove unneeded 'select BUILD_BIN2C'
2022-07-01 11:19:14 -07:00
Linus Torvalds
76ff294e16 Merge tag 'nfs-for-5.19-3' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client fixes from Anna Schumaker:

 - Allocate a fattr for _nfs4_discover_trunking()

 - Fix module reference count leak in nfs4_run_state_manager()

* tag 'nfs-for-5.19-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFSv4: Add an fattr allocation to _nfs4_discover_trunking()
  NFS: restore module put when manager exits.
2022-07-01 11:11:32 -07:00
Linus Torvalds
6f8693ea2b Merge tag 'ceph-for-5.19-rc5' of https://github.com/ceph/ceph-client
Pull ceph fix from Ilya Dryomov:
 "A ceph filesystem fix, marked for stable.

  There appears to be a deeper issue on the MDS side, but for now we are
  going with this one-liner to avoid busy looping and potential soft
  lockups"

* tag 'ceph-for-5.19-rc5' of https://github.com/ceph/ceph-client:
  ceph: wait on async create before checking caps for syncfs
2022-07-01 11:06:21 -07:00
Linus Torvalds
8300d38030 Merge tag 'for-5.19/dm-fixes-5' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
 "Three fixes for invalid memory accesses discovered by using KASAN
  while running the lvm2 testsuite's dm-raid tests. Includes changes to
  MD's raid5.c given the dependency dm-raid has on the MD code"

* tag 'for-5.19/dm-fixes-5' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm raid: fix KASAN warning in raid5_add_disks
  dm raid: fix KASAN warning in raid5_remove_disk
  dm raid: fix accesses beyond end of raid member array
2022-07-01 10:58:39 -07:00
Linus Torvalds
0a35d1622d Merge tag 'io_uring-5.19-2022-07-01' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
 "Two minor tweaks:

   - While we still can, adjust the send/recv based flags to be in
     ->ioprio rather than in ->addr2. This is consistent with eg accept,
     and also doesn't waste a full 64-bit field for flags (Pavel)

   - 5.18-stable fix for re-importing provided buffers. Not much real
     world relevance here as it'll only impact non-pollable files gone
     async, which is more of a practical test case rather than something
     that is used in the wild (Dylan)"

* tag 'io_uring-5.19-2022-07-01' of git://git.kernel.dk/linux-block:
  io_uring: fix provided buffer import
  io_uring: keep sendrecv flags in ioprio
2022-07-01 10:52:01 -07:00
Linus Torvalds
d516e221e2 Merge tag 'block-5.19-2022-07-01' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:

 - Fix for batch getting of tags in sbitmap (wuchi)

 - NVMe pull request via Christoph:
      - More quirks (Lamarque Vieira Souza, Pablo Greco)
      - Fix a fabrics disconnect regression (Ruozhu Li)
      - Fix a nvmet-tcp data_digest calculation regression (Sagi
        Grimberg)
      - Fix nvme-tcp send failure handling (Sagi Grimberg)
      - Fix a regression with nvmet-loop and passthrough controllers
        (Alan Adamson)

* tag 'block-5.19-2022-07-01' of git://git.kernel.dk/linux-block:
  nvme-pci: add NVME_QUIRK_BOGUS_NID for ADATA IM2P33F8ABR1
  nvmet: add a clear_ids attribute for passthru targets
  nvme: fix regression when disconnect a recovering ctrl
  nvme-pci: add NVME_QUIRK_BOGUS_NID for ADATA XPG SX6000LNP (AKA SPECTRIX S40G)
  nvme-tcp: always fail a request when sending it failed
  nvmet-tcp: fix regression in data_digest calculation
  lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()
2022-07-01 10:42:10 -07:00
Linus Torvalds
067c227379 Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fix from James Bottomley:
 "One simple driver fix for a dma overrun"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: hisi_sas: Limit max hw sectors for v3 HW
2022-07-01 10:38:17 -07:00
Linus Torvalds
690685ffcd Merge tag 'ata-5.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata
Pull ATA fix from Damien Le Moal:

 - Fix a compilation warning with some versions of gcc/sparse when
   compiling the pata_cs5535 driver, from John.

* tag 'ata-5.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata:
  ata: pata_cs5535: Fix W=1 warnings
2022-07-01 10:31:44 -07:00
Will Deacon
4109823037 arm64: hugetlb: Restore TLB invalidation for BBM on contiguous ptes
Commit fb396bb459 ("arm64/hugetlb: Drop TLB flush from get_clear_flush()")
removed TLB invalidation from get_clear_flush() [now get_clear_contig()]
on the basis that the core TLB invalidation code is aware of hugetlb
mappings backed by contiguous page-table entries and will cover the
correct virtual address range.

However, this change also resulted in the TLB invalidation being removed
from the "break" step in the break-before-make (BBM) sequence used
internally by huge_ptep_set_{access_flags,wrprotect}(), therefore
making the BBM sequence unsafe irrespective of later invalidation.

Although the architecture is desperately unclear about how exactly
contiguous ptes should be updated in a live page-table, restore TLB
invalidation to our BBM sequence under the assumption that BBM is the
right thing to be doing in the first place.

Fixes: fb396bb459 ("arm64/hugetlb: Drop TLB flush from get_clear_flush()")
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Marc Zyngier <maz@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20220629095349.25748-1-will@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2022-07-01 18:29:26 +01:00
Linus Torvalds
9650910d05 Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk fixes from Stephen Boyd:
 "Two small fixes

   - Initialize a spinlock in the stm32 reset code

   - Add dt bindings to the clk maintainer filepattern"

* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
  MAINTAINERS: add include/dt-bindings/clock to COMMON CLK FRAMEWORK
  clk: stm32: rcc_reset: Fix missing spin_lock_init()
2022-07-01 10:01:32 -07:00
Dave Chinner
af1c2146a5 xfs: introduce per-cpu CIL tracking structure
The CIL push lock is highly contended on larger machines, becoming a
hard bottleneck that about 700,000 transaction commits/s on >16p
machines. To address this, start moving the CIL tracking
infrastructure to utilise per-CPU structures.

We need to track the space used, the amount of log reservation space
reserved to write the CIL, the log items in the CIL and the busy
extents that need to be completed by the CIL commit.  This requires
a couple of per-cpu counters, an unordered per-cpu list and a
globally ordered per-cpu list.

Create a per-cpu structure to hold these and all the management
interfaces needed, as well as the hooks to handle hotplug CPUs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-02 02:13:52 +10:00
Dave Chinner
31151cc342 xfs: rework per-iclog header CIL reservation
For every iclog that a CIL push will use up, we need to ensure we
have space reserved for the iclog header in each iclog. It is
extremely difficult to do this accurately with a per-cpu counter
without expensive summing of the counter in every commit. However,
we know what the maximum CIL size is going to be because of the
hard space limit we have, and hence we know exactly how many iclogs
we are going to need to write out the CIL.

We are constrained by the requirement that small transactions only
have reservation space for a single iclog header built into them.
At commit time we don't know how much of the current transaction
reservation is made up of iclog header reservations as calculated by
xfs_log_calc_unit_res() when the ticket was reserved. As larger
reservations have multiple header spaces reserved, we can steal
more than one iclog header reservation at a time, but we only steal
the exact number needed for the given log vector size delta.

As a result, we don't know exactly when we are going to steal iclog
header reservations, nor do we know exactly how many we are going to
need for a given CIL.

To make things simple, start by calculating the worst case number of
iclog headers a full CIL push will require. Record this into an
atomic variable in the CIL. Then add a byte counter to the log
ticket that records exactly how much iclog header space has been
reserved in this ticket by xfs_log_calc_unit_res(). This tells us
exactly how much space we can steal from the ticket at transaction
commit time.

Now, at transaction commit time, we can check if the CIL has a full
iclog header reservation and, if not, steal the entire reservation
the current ticket holds for iclog headers. This minimises the
number of times we need to do atomic operations in the fast path,
but still guarantees we get all the reservations we need.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-02 02:12:52 +10:00
Dave Chinner
12380d237b xfs: lift init CIL reservation out of xc_cil_lock
The xc_cil_lock is the most highly contended lock in XFS now. To
start the process of getting rid of it, lift the initial reservation
of the CIL log space out from under the xc_cil_lock.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-02 02:11:52 +10:00
Dave Chinner
88591e7f06 xfs: use the CIL space used counter for emptiness checks
In the next patches we are going to make the CIL list itself
per-cpu, and so we cannot use list_empty() to check is the list is
empty. Replace the list_empty() checks with a flag in the CIL to
indicate we have committed at least one transaction to the CIL and
hence the CIL is not empty.

We need this flag to be an atomic so that we can clear it without
holding any locks in the commit fast path, but we also need to be
careful to avoid atomic operations in the fast path. Hence we use
the fact that test_bit() is not an atomic op to first check if the
flag is set and then run the atomic test_and_clear_bit() operation
to clear it and steal the initial unit reservation for the CIL
context checkpoint.

When we are switching to a new context in a push, we place the
setting of the XLOG_CIL_EMPTY flag under the xc_push_lock. THis
allows all the other places that need to check whether the CIL is
empty to use test_bit() and still be serialised correctly with the
CIL context swaps that set the bit.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-02 02:10:52 +10:00
Darrick J. Wong
7561cea5db xfs: prevent a UAF when log IO errors race with unmount
KASAN reported the following use after free bug when running
generic/475:

 XFS (dm-0): Mounting V5 Filesystem
 XFS (dm-0): Starting recovery (logdev: internal)
 XFS (dm-0): Ending recovery (logdev: internal)
 Buffer I/O error on dev dm-0, logical block 20639616, async page read
 Buffer I/O error on dev dm-0, logical block 20639617, async page read
 XFS (dm-0): log I/O error -5
 XFS (dm-0): Filesystem has been shut down due to log error (0x2).
 XFS (dm-0): Unmounting Filesystem
 XFS (dm-0): Please unmount the filesystem and rectify the problem(s).
 ==================================================================
 BUG: KASAN: use-after-free in do_raw_spin_lock+0x246/0x270
 Read of size 4 at addr ffff888109dd84c4 by task 3:1H/136

 CPU: 3 PID: 136 Comm: 3:1H Not tainted 5.19.0-rc4-xfsx #rc4 8e53ab5ad0fddeb31cee5e7063ff9c361915a9c4
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
 Workqueue: xfs-log/dm-0 xlog_ioend_work [xfs]
 Call Trace:
  <TASK>
  dump_stack_lvl+0x34/0x44
  print_report.cold+0x2b8/0x661
  ? do_raw_spin_lock+0x246/0x270
  kasan_report+0xab/0x120
  ? do_raw_spin_lock+0x246/0x270
  do_raw_spin_lock+0x246/0x270
  ? rwlock_bug.part.0+0x90/0x90
  xlog_force_shutdown+0xf6/0x370 [xfs 4ad76ae0d6add7e8183a553e624c31e9ed567318]
  xlog_ioend_work+0x100/0x190 [xfs 4ad76ae0d6add7e8183a553e624c31e9ed567318]
  process_one_work+0x672/0x1040
  worker_thread+0x59b/0xec0
  ? __kthread_parkme+0xc6/0x1f0
  ? process_one_work+0x1040/0x1040
  ? process_one_work+0x1040/0x1040
  kthread+0x29e/0x340
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x1f/0x30
  </TASK>

 Allocated by task 154099:
  kasan_save_stack+0x1e/0x40
  __kasan_kmalloc+0x81/0xa0
  kmem_alloc+0x8d/0x2e0 [xfs]
  xlog_cil_init+0x1f/0x540 [xfs]
  xlog_alloc_log+0xd1e/0x1260 [xfs]
  xfs_log_mount+0xba/0x640 [xfs]
  xfs_mountfs+0xf2b/0x1d00 [xfs]
  xfs_fs_fill_super+0x10af/0x1910 [xfs]
  get_tree_bdev+0x383/0x670
  vfs_get_tree+0x7d/0x240
  path_mount+0xdb7/0x1890
  __x64_sys_mount+0x1fa/0x270
  do_syscall_64+0x2b/0x80
  entry_SYSCALL_64_after_hwframe+0x46/0xb0

 Freed by task 154151:
  kasan_save_stack+0x1e/0x40
  kasan_set_track+0x21/0x30
  kasan_set_free_info+0x20/0x30
  ____kasan_slab_free+0x110/0x190
  slab_free_freelist_hook+0xab/0x180
  kfree+0xbc/0x310
  xlog_dealloc_log+0x1b/0x2b0 [xfs]
  xfs_unmountfs+0x119/0x200 [xfs]
  xfs_fs_put_super+0x6e/0x2e0 [xfs]
  generic_shutdown_super+0x12b/0x3a0
  kill_block_super+0x95/0xd0
  deactivate_locked_super+0x80/0x130
  cleanup_mnt+0x329/0x4d0
  task_work_run+0xc5/0x160
  exit_to_user_mode_prepare+0xd4/0xe0
  syscall_exit_to_user_mode+0x1d/0x40
  entry_SYSCALL_64_after_hwframe+0x46/0xb0

This appears to be a race between the unmount process, which frees the
CIL and waits for in-flight iclog IO; and the iclog IO completion.  When
generic/475 runs, it starts fsstress in the background, waits a few
seconds, and substitutes a dm-error device to simulate a disk falling
out of a machine.  If the fsstress encounters EIO on a pure data write,
it will exit but the filesystem will still be online.

The next thing the test does is unmount the filesystem, which tries to
clean the log, free the CIL, and wait for iclog IO completion.  If an
iclog was being written when the dm-error switch occurred, it can race
with log unmounting as follows:

Thread 1				Thread 2

					xfs_log_unmount
					xfs_log_clean
					xfs_log_quiesce
xlog_ioend_work
<observe error>
xlog_force_shutdown
test_and_set_bit(XLOG_IOERROR)
					xfs_log_force
					<log is shut down, nop>
					xfs_log_umount_write
					<log is shut down, nop>
					xlog_dealloc_log
					xlog_cil_destroy
					<wait for iclogs>
spin_lock(&log->l_cilp->xc_push_lock)
<KABOOM>

Therefore, free the CIL after waiting for the iclogs to complete.  I
/think/ this race has existed for quite a few years now, though I don't
remember the ~2014 era logging code well enough to know if it was a real
threat then or if the actual race was exposed only more recently.

Fixes: ac983517ec ("xfs: don't sleep in xlog_cil_force_lsn on shutdown")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-07-01 09:09:52 -07:00
Linus Torvalds
a175eca0f3 Merge tag 'drm-fixes-2022-07-01' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
 "Bit quieter this week, the main thing is it pulls in the fixes for the
  sysfb resource issue you were seeing. these had been queued for next
  so should have had some decent testing.

  Otherwise amdgpu, i915 and msm each have a few fixes, and vc4 has one.

  fbdev:
   - sysfb fixes/conflicting fb fixes

  amdgpu:
   - GPU recovery fix

   - Fix integer type usage in fourcc header for AMD modifiers

   - KFD TLB flush fix for gfx9 APUs

   - Display fix

  i915:
   - Fix ioctl argument error return

   - Fix d3cold disable to allow PCI upstream bridge D3 transition

   - Fix setting cache_dirty for dma-buf objects on discrete

  msm:
   - Fix to increment vsync_cnt before calling drm_crtc_handle_vblank so
     that userspace sees the value *after* it is incremented if waiting
     for vblank events

   - Fix to reset drm_dev to NULL in dp_display_unbind to avoid a crash
     in probe/bind error paths

   - Fix to resolve the smatch error of de-referencing before NULL check
     in dpu_encoder_phys_wb.c

   - Fix error return to userspace if fence-id allocation fails in
     submit ioctl

  vc4:
   - NULL ptr dereference fix"

* tag 'drm-fixes-2022-07-01' of git://anongit.freedesktop.org/drm/drm:
  Revert "drm/amdgpu/display: set vblank_disable_immediate for DC"
  drm/amdgpu: To flush tlb for MMHUB of RAVEN series
  drm/fourcc: fix integer type usage in uapi header
  drm/amdgpu: fix adev variable used in amdgpu_device_gpu_recover()
  fbdev: Disable sysfb device registration when removing conflicting FBs
  firmware: sysfb: Add sysfb_disable() helper function
  firmware: sysfb: Make sysfb_create_simplefb() return a pdev pointer
  drm/msm/gem: Fix error return on fence id alloc fail
  drm/i915: tweak the ordering in cpu_write_needs_clflush
  drm/i915/dgfx: Disable d3cold at gfx root port
  drm/i915/gem: add missing else
  drm/vc4: perfmon: Fix variable dereferenced before check
  drm/msm/dpu: Fix variable dereferenced before check
  drm/msm/dp: reset drm_dev to NULL at dp_display_unbind()
  drm/msm/dpu: Increment vsync_cnt before waking up userspace
2022-06-30 17:19:19 -07:00
Dave Airlie
b8f0009bc9 Merge tag 'drm-misc-fixes-2022-06-30' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes
A NULL pointer dereference fix for vc4, and 3 patches to improve the
sysfb device behaviour when removing conflicting framebuffers

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Maxime Ripard <maxime@cerno.tech>
Link: https://patchwork.freedesktop.org/patch/msgid/20220630072404.2fa4z3nk5h5q34ci@houat
2022-07-01 09:27:55 +10:00
Linus Torvalds
5e8379351d Merge tag 'net-5.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
 "Including fixes from netfilter.

  Current release - new code bugs:

   - clear msg_get_inq in __sys_recvfrom() and __copy_msghdr_from_user()

   - mptcp:
      - invoke MP_FAIL response only when needed
      - fix shutdown vs fallback race
      - consistent map handling on failure

   - octeon_ep: use bitwise AND

  Previous releases - regressions:

   - tipc: move bc link creation back to tipc_node_create, fix NPD

  Previous releases - always broken:

   - tcp: add a missing nf_reset_ct() in 3WHS handling to prevent socket
     buffered skbs from keeping refcount on the conntrack module

   - ipv6: take care of disable_policy when restoring routes

   - tun: make sure to always disable and unlink NAPI instances

   - phy: don't trigger state machine while in suspend

   - netfilter: nf_tables: avoid skb access on nf_stolen

   - asix: fix "can't send until first packet is send" issue

   - usb: asix: do not force pause frames support

   - nxp-nci: don't issue a zero length i2c_master_read()

  Misc:

   - ncsi: allow use of proper "mellanox" DT vendor prefix

   - act_api: add a message for user space if any actions were already
     flushed before the error was hit"

* tag 'net-5.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (55 commits)
  net: dsa: felix: fix race between reading PSFP stats and port stats
  selftest: tun: add test for NAPI dismantle
  net: tun: avoid disabling NAPI twice
  net: sparx5: mdb add/del handle non-sparx5 devices
  net: sfp: fix memory leak in sfp_probe()
  mlxsw: spectrum_router: Fix rollback in tunnel next hop init
  net: rose: fix UAF bugs caused by timer handler
  net: usb: ax88179_178a: Fix packet receiving
  net: bonding: fix use-after-free after 802.3ad slave unbind
  ipv6: fix lockdep splat in in6_dump_addrs()
  net: phy: ax88772a: fix lost pause advertisement configuration
  net: phy: Don't trigger state machine while in suspend
  usbnet: fix memory allocation in helpers
  selftests net: fix kselftest net fatal error
  NFC: nxp-nci: don't print header length mismatch on i2c error
  NFC: nxp-nci: Don't issue a zero length i2c_master_read()
  net: tipc: fix possible refcount leak in tipc_sk_create()
  nfc: nfcmrvl: Fix irq_of_parse_and_map() return value
  net: ipv6: unexport __init-annotated seg6_hmac_net_init()
  ipv6/sit: fix ipip6_tunnel_get_prl return value
  ...
2022-06-30 15:26:55 -07:00
Amir Goldstein
868f9f2f8e vfs: fix copy_file_range() regression in cross-fs copies
A regression has been reported by Nicolas Boichat, found while using the
copy_file_range syscall to copy a tracefs file.

Before commit 5dae222a5f ("vfs: allow copy_file_range to copy across
devices") the kernel would return -EXDEV to userspace when trying to
copy a file across different filesystems.  After this commit, the
syscall doesn't fail anymore and instead returns zero (zero bytes
copied), as this file's content is generated on-the-fly and thus reports
a size of zero.

Another regression has been reported by He Zhe - the assertion of
WARN_ON_ONCE(ret == -EOPNOTSUPP) can be triggered from userspace when
copying from a sysfs file whose read operation may return -EOPNOTSUPP.

Since we do not have test coverage for copy_file_range() between any two
types of filesystems, the best way to avoid these sort of issues in the
future is for the kernel to be more picky about filesystems that are
allowed to do copy_file_range().

This patch restores some cross-filesystem copy restrictions that existed
prior to commit 5dae222a5f ("vfs: allow copy_file_range to copy across
devices"), namely, cross-sb copy is not allowed for filesystems that do
not implement ->copy_file_range().

Filesystems that do implement ->copy_file_range() have full control of
the result - if this method returns an error, the error is returned to
the user.  Before this change this was only true for fs that did not
implement the ->remap_file_range() operation (i.e.  nfsv3).

Filesystems that do not implement ->copy_file_range() still fall-back to
the generic_copy_file_range() implementation when the copy is within the
same sb.  This helps the kernel can maintain a more consistent story
about which filesystems support copy_file_range().

nfsd and ksmbd servers are modified to fall-back to the
generic_copy_file_range() implementation in case vfs_copy_file_range()
fails with -EOPNOTSUPP or -EXDEV, which preserves behavior of
server-side-copy.

fall-back to generic_copy_file_range() is not implemented for the smb
operation FSCTL_DUPLICATE_EXTENTS_TO_FILE, which is arguably a correct
change of behavior.

Fixes: 5dae222a5f ("vfs: allow copy_file_range to copy across devices")
Link: https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drinkcat@chromium.org/
Link: https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx+BnvW=ZmpsG47CyHFJwnw7zSX6Q@mail.gmail.com/
Link: https://lore.kernel.org/linux-fsdevel/20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid/
Link: https://lore.kernel.org/linux-fsdevel/20210630161320.29006-1-lhenriques@suse.de/
Reported-by: Nicolas Boichat <drinkcat@chromium.org>
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Luis Henriques <lhenriques@suse.de>
Fixes: 64bf5ff58d ("vfs: no fallback for ->copy_file_range")
Link: https://lore.kernel.org/linux-fsdevel/20f17f64-88cb-4e80-07c1-85cb96c83619@windriver.com/
Reported-by: He Zhe <zhe.he@windriver.com>
Tested-by: Namjae Jeon <linkinjeon@kernel.org>
Tested-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-06-30 15:16:38 -07:00
Chuck Lever
a23dd544de SUNRPC: Fix READ_PLUS crasher
Looks like there are still cases when "space_left - frag1bytes" can
legitimately exceed PAGE_SIZE. Ensure that xdr->end always remains
within the current encode buffer.

Reported-by: Bruce Fields <bfields@fieldses.org>
Reported-by: Zorro Lang <zlang@redhat.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216151
Fixes: 6c254bf3b6 ("SUNRPC: Fix the calculation of xdr->end in xdr_get_next_encode_buffer()")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-06-30 17:41:08 -04:00