Commit Graph

1234426 Commits

Author SHA1 Message Date
Darrick J. Wong
6bb9ea8ecd xfs: log EFIs for all btree blocks being used to stage a btree
We need to log EFIs for every extent that we allocate for the purpose of
staging a new btree so that if we fail then the blocks will be freed
during log recovery.  Use the autoreaping mechanism provided by the
previous patch to attach paused freeing work to the scrub transaction.
We can then mark the EFIs stale if we decide to commit the new btree, or
we can unpause the EFIs if we decide to abort the repair.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-06 18:45:19 -08:00
Darrick J. Wong
be40841763 xfs: implement block reservation accounting for btrees we're staging
Create a new xrep_newbt structure to encapsulate a fake root for
creating a staged btree cursor as well as to track all the blocks that
we need to reserve in order to build that btree.

As for the particular choice of lowspace thresholds and btree block
slack factors -- at this point one could say that the thresholds in
online repair come from bulkload_estimate_ag_slack in xfs_repair[1].
But that's not the entire story, since the offline btree rebuilding
code in xfs_repair was merged as a retroport of the online btree code
in this patchset!

Before xfs_btree_staging.[ch] came along, xfs_repair determined the
slack factor (aka the number of slots to leave unfilled in each new
btree block) via open-coded logic in repair/phase5.c[2].  At that point
the slack factors were arbitrary quantities per btree.  The rmapbt
automatically left 10 slots free; everything else left zero.

That had a noticeable effect on performance straight after mounting
because adding records to /any/ btree would result in splits.  A few
years ago when this patch was first written, Dave and I decided that
repair should generate btree blocks that were 75% full unless space was
tight, in which case it should try to fill the blocks to nearly full.
We defined tight as ~10% free to avoid repair failures but settled on
3/32 (~9%) to avoid div64.

IOWs, we mostly pulled the thresholds out of thin air.  We've been
QAing with those geometry numbers ever since. ;)

Link: https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/tree/repair/bulkload.c?h=v6.5.0#n114
Link: https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/tree/repair/phase5.c?h=v4.19.0#n1349
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-12-06 18:45:18 -08:00
Darrick J. Wong
4c8ecd1cfd xfs: remove unused fields from struct xbtree_ifakeroot
Remove these unused fields since nobody uses them.  They should have
been removed years ago in a different cleanup series from Christoph
Hellwig.

Fixes: daf83964a3 ("xfs: move the per-fork nextents fields into struct xfs_ifork")
Fixes: f7e67b20ec ("xfs: move the fork format fields into struct xfs_ifork")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-12-06 18:45:18 -08:00
Darrick J. Wong
e3042be36c xfs: automatic freeing of freshly allocated unwritten space
As mentioned in the previous commit, online repair wants to allocate
space to write out a new metadata structure, and it also wants to hedge
against system crashes during repairs by logging (and later cancelling)
EFIs to free the space if we crash before committing the new data
structure.

Therefore, create a trio of functions to schedule automatic reaping of
freshly allocated unwritten space.  xfs_alloc_schedule_autoreap creates
a paused EFI representing the space we just allocated.  Once the
allocations are made and the autoreaps scheduled, we can start writing
to disk.

If the writes succeed, xfs_alloc_cancel_autoreap marks the EFI work
items as stale and unpauses the pending deferred work item.  Assuming
that's done in the same transaction that commits the new structure into
the filesystem, we guarantee that either the new object is fully
visible, or that all the space gets reclaimed.

If the writes succeed but only part of an extent was used, repair must
call the same _cancel_autoreap function to kill the first EFI and then
log a new EFI to free the unused space.  The first EFI is already
committed, so it cannot be changed.

For full extents that aren't used, xfs_alloc_commit_autoreap will
unpause the EFI, which results in the space being freed during the next
_defer_finish cycle.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-06 18:45:18 -08:00
Darrick J. Wong
4c88fef3af xfs: remove __xfs_free_extent_later
xfs_free_extent_later is a trivial helper, so remove it to reduce the
amount of thinking required to understand the deferred freeing
interface.  This will make it easier to introduce automatic reaping of
speculative allocations in the next patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-06 18:45:18 -08:00
Darrick J. Wong
4dffb2cbb4 xfs: allow pausing of pending deferred work items
Traditionally, all pending deferred work attached to a transaction is
finished when one of the xfs_defer_finish* functions is called.
However, online repair wants to be able to allocate space for a new data
structure, format a new metadata structure into the allocated space, and
commit that into the filesystem.

As a hedge against system crashes during repairs, we also want to log
some EFI items for the allocated space speculatively, and cancel them if
we elect to commit the new data structure.

Therefore, introduce the idea of pausing a pending deferred work item.
Log intent items are still created for paused items and relogged as
necessary.  However, paused items are pushed onto a side list before we
start calling ->finish_item, and the whole list is reattach to the
transaction afterwards.  New work items are never attached to paused
pending items.

Modify xfs_defer_cancel to clean up pending deferred work items holding
a log intent item but not a log intent done item, since that is now
possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-06 18:45:18 -08:00
Darrick J. Wong
6b12613940 xfs: don't append work items to logged xfs_defer_pending objects
When someone tries to add a deferred work item to xfs_defer_add, it will
try to attach the work item to the most recently added xfs_defer_pending
object attached to the transaction.  However, it doesn't check if the
pending object has a log intent item attached to it.  This is incorrect
behavior because we cannot add more work to an object that has already
been committed to the ondisk log.

Therefore, change the behavior not to append to pending items with a non
null dfp_intent.  In practice this has not been an issue because the
only way xfs_defer_add gets called after log intent items have been
committed is from the defer ops ->finish_item functions themselves, and
the @dop_pending isolation in xfs_defer_finish_noroll protects the
pending items that have already been logged.

However, the next patch will add the ability to pause a deferred extent
free object during online btree rebuilding, and any new extfree work
items need to have their own pending event.

While we're at it, hoist the predicate to its own static inline function
for readability.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-06 18:45:18 -08:00
Darrick J. Wong
3f113c2739 xfs: make xchk_iget safer in the presence of corrupt inode btrees
When scrub is trying to iget an inode, ensure that it won't end up
deadlocked on a cycle in the inode btree by using an empty transaction
to store all the buffers.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-06 18:45:17 -08:00
Darrick J. Wong
9c07bca793 xfs: elide ->create_done calls for unlogged deferred work
Extended attribute updates use the deferred work machinery to manage
state across a chain of smaller transactions.  All previous deferred
work users have employed log intent items and log done items to manage
restarting of interrupted operations, which means that ->create_intent
sets dfp_intent to a log intent item and ->create_done uses that item to
create a log intent done item.

However, xattrs have used the INCOMPLETE flag to deal with the lack of
recovery support for an interrupted transaction chain.  Log items are
optional if the xattr update caller didn't set XFS_DA_OP_LOGGED to
require a restartable sequence.

In other words, ->create_intent can return NULL to say that there's no
log intent item.  If that's the case, no log intent done item should be
created.  Clean up xfs_defer_create_done not to do this, so that the
->create_done functions don't have to check for non-null dfp_intent
themselves.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-06 18:45:17 -08:00
Darrick J. Wong
94da54d582 xfs: document what LARP means
Christoph requested a blurb somewhere explaining exactly what LARP
means.  I don't know of a good place other than the source code (debug
knobs aren't covered in Documentation/), so here it is.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-06 18:45:17 -08:00
Darrick J. Wong
e14293803f xfs: don't allow overly small or large realtime volumes
Don't allow realtime volumes that are less than one rt extent long.
This has been broken across 4 LTS kernels with nobody noticing, so let's
just disable it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-06 18:45:17 -08:00
Darrick J. Wong
cf8f0e6c14 xfs: fix 32-bit truncation in xfs_compute_rextslog
It's quite reasonable that some customer somewhere will want to
configure a realtime volume with more than 2^32 extents.  If they try to
do this, the highbit32() call will truncate the upper bits of the
xfs_rtbxlen_t and produce the wrong value for rextslog.  This in turn
causes the rsumlevels to be wrong, which results in a realtime summary
file that is the wrong length.  Fix that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-06 18:45:17 -08:00
Darrick J. Wong
a6a38f309a xfs: make rextslog computation consistent with mkfs
There's a weird discrepancy in xfsprogs dating back to the creation of
the Linux port -- if there are zero rt extents, mkfs will set
sb_rextents and sb_rextslog both to zero:

	sbp->sb_rextslog =
		(uint8_t)(rtextents ?
			libxfs_highbit32((unsigned int)rtextents) : 0);

However, that's not the check that xfs_repair uses for nonzero rtblocks:

	if (sb->sb_rextslog !=
			libxfs_highbit32((unsigned int)sb->sb_rextents))

The difference here is that xfs_highbit32 returns -1 if its argument is
zero.  Unfortunately, this means that in the weird corner case of a
realtime volume shorter than 1 rt extent, xfs_repair will immediately
flag a freshly formatted filesystem as corrupt.  Because mkfs has been
writing ondisk artifacts like this for decades, we have to accept that
as "correct".  TBH, zero rextslog for zero rtextents makes more sense to
me anyway.

Regrettably, the superblock verifier checks created in commit copied
xfs_repair even though mkfs has been writing out such filesystems for
ages.  Fix the superblock verifier to accept what mkfs spits out; the
userspace version of this patch will have to fix xfs_repair as well.

Note that the new helper leaves the zeroday bug where the upper 32 bits
of sb_rextents is ripped off and fed to highbit32.  This leads to a
seriously undersized rt summary file, which immediately breaks mkfs:

$ hugedisk.sh foo /dev/sdc $(( 0x100000080 * 4096))B
$ /sbin/mkfs.xfs -f /dev/sda -m rmapbt=0,reflink=0 -r rtdev=/dev/mapper/foo
meta-data=/dev/sda               isize=512    agcount=4, agsize=1298176 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0    bigtime=1 inobtcount=1 nrext64=1
data     =                       bsize=4096   blocks=5192704, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=16384, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =/dev/mapper/foo        extsz=4096   blocks=4294967424, rtextents=4294967424
Discarding blocks...Done.
mkfs.xfs: Error initializing the realtime space [117 - Structure needs cleaning]

The next patch will drop support for rt volumes with fewer than 1 or
more than 2^32-1 rt extents, since they've clearly been broken forever.

Fixes: f8e566c0f5 ("xfs: validate the realtime geometry in xfs_validate_sb_common")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-06 18:45:17 -08:00
Darrick J. Wong
a49c708f9a xfs: move ->iop_relog to struct xfs_defer_op_type
The only log items that need relogging are the ones created for deferred
work operations, and the only part of the code base that relogs log
items is the deferred work machinery.  Move the function pointers.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-06 18:45:17 -08:00
Darrick J. Wong
8a9aa763e1 xfs: collapse the ->create_done functions
Move the meat of the ->create_done function helpers into ->create_done
to reduce the amount of boilerplate.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-06 18:45:16 -08:00
Darrick J. Wong
b28852a5bd xfs: hoist xfs_trans_add_item calls to defer ops functions
Remove even more repeated boilerplate.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-06 18:45:16 -08:00
Darrick J. Wong
3e0958be21 xfs: clean out XFS_LI_DIRTY setting boilerplate from ->iop_relog
Hoist this dirty flag setting to the ->iop_relog callsite to reduce
boilerplate.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-06 18:45:16 -08:00
Darrick J. Wong
bd3a88f6b7 xfs: use xfs_defer_create_done for the relogging operation
Now that we have a helper to handle creating a log intent done item and
updating all the necessary state flags, use it to reduce boilerplate in
the ->iop_relog implementations.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-06 18:45:16 -08:00
Darrick J. Wong
f3fd7f6fce xfs: hoist ->create_intent boilerplate to its callsite
Hoist the dirty flag setting code out of each ->create_intent
implementation up to the callsite to reduce boilerplate further.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-06 18:45:16 -08:00
Darrick J. Wong
e6e5299fcb xfs: collapse the ->finish_item helpers
Each log item's ->finish_item function sets up a small amount of state
and calls another function to do the work.  Collapse that other function
into ->finish_item to reduce the call stack height.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-06 18:45:16 -08:00
Darrick J. Wong
3dd75c8db1 xfs: hoist intent done flag setting to ->finish_item callsite
Each log intent item's ->finish_item call chain inevitably includes some
code to set the dirty flag of the transaction.  If there's an associated
log intent done item, it also sets the item's dirty flag and the
transaction's INTENT_DONE flag.  This is repeated throughout the
codebase.

Reduce the LOC by moving all that to xfs_defer_finish_one.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-06 18:45:15 -08:00
Darrick J. Wong
172538beba xfs: don't set XFS_TRANS_HAS_INTENT_DONE when there's no ATTRD log item
XFS_TRANS_HAS_INTENT_DONE is a flag to the CIL that we've added a log
intent done item to the transaction.  This enables an optimization
wherein we avoid writing out log intent and log intent done items if
they would have ended up in the same checkpoint.  This reduces writes to
the ondisk log and speeds up recovery as a result.

However, callers can use the defer ops machinery to modify xattrs
without using the log items.  In this situation, there won't be an
intent done item, so we do not need to set the flag.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-06 18:45:15 -08:00
Darrick J. Wong
db7ccc0bac xfs: move ->iop_recover to xfs_defer_op_type
Finish off the series by moving the intent item recovery function
pointer to the xfs_defer_op_type struct, since this is really a deferred
work function now.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-06 18:45:15 -08:00
Darrick J. Wong
e5f1a5146e xfs: use xfs_defer_finish_one to finish recovered work items
Get rid of the open-coded calls to xfs_defer_finish_one.  This also
means that the recovery transaction takes care of cleaning up the dfp,
and we have solved (I hope) all the ownership issues in recovery.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-06 18:45:15 -08:00
Darrick J. Wong
a51489e140 xfs: dump the recovered xattri log item if corruption happens
If xfs_attri_item_recover receives a corruption error when it tries to
finish a recovered log intent item, it should dump the log item for
debugging, just like all the other log intent items.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-06 18:45:15 -08:00
Darrick J. Wong
e70fb328d5 xfs: recreate work items when recovering intent items
Recreate work items for each xfs_defer_pending object when we are
recovering intent items.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-06 18:45:15 -08:00
Darrick J. Wong
deb4cd8ba8 xfs: transfer recovered intent item ownership in ->iop_recover
Now that we pass the xfs_defer_pending object into the intent item
recovery functions, we know exactly when ownership of the sole refcount
passes from the recovery context to the intent done item.  At that
point, we need to null out dfp_intent so that the recovery mechanism
won't release it.  This should fix the UAF problem reported by Long Li.

Note that we still want to recreate the full deferred work state.  That
will be addressed in the next patches.

Fixes: 2e76f188fd ("xfs: cancel intents immediately if process_intents fails")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-06 18:45:14 -08:00
Darrick J. Wong
a050acdfa8 xfs: pass the xfs_defer_pending object to iop_recover
Now that log intent item recovery recreates the xfs_defer_pending state,
we should pass that into the ->iop_recover routines so that the intent
item can finish the recreation work.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-06 18:45:14 -08:00
Darrick J. Wong
03f7767c9f xfs: use xfs_defer_pending objects to recover intent items
One thing I never quite got around to doing is porting the log intent
item recovery code to reconstruct the deferred pending work state.  As a
result, each intent item open codes xfs_defer_finish_one in its recovery
method, because that's what the EFI code did before xfs_defer.c even
existed.

This is a gross thing to have left unfixed -- if an EFI cannot proceed
due to busy extents, we end up creating separate new EFIs for each
unfinished work item, which is a change in behavior from what runtime
would have done.

Worse yet, Long Li pointed out that there's a UAF in the recovery code.
The ->commit_pass2 function adds the intent item to the AIL and drops
the refcount.  The one remaining refcount is now owned by the recovery
mechanism (aka the log intent items in the AIL) with the intent of
giving the refcount to the intent done item in the ->iop_recover
function.

However, if something fails later in recovery, xlog_recover_finish will
walk the recovered intent items in the AIL and release them.  If the CIL
hasn't been pushed before that point (which is possible since we don't
force the log until later) then the intent done release will try to free
its associated intent, which has already been freed.

This patch starts to address this mess by having the ->commit_pass2
functions recreate the xfs_defer_pending state.  The next few patches
will fix the recovery functions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-06 18:45:14 -08:00
Darrick J. Wong
07bcbdf020 xfs: don't leak recovered attri intent items
If recovery finds an xattr log intent item calling for the removal of an
attribute and the file doesn't even have an attr fork, we know that the
removal is trivially complete.  However, we can't just exit the recovery
function without doing something about the recovered log intent item --
it's still on the AIL, and not logging an attrd item means it stays
there forever.

This has likely not been seen in practice because few people use LARP
and the runtime code won't log the attri for a no-attrfork removexattr
operation.  But let's fix this anyway.

Also we shouldn't really be testing the attr fork presence until we've
taken the ILOCK, though this doesn't matter much in recovery, which is
single threaded.

Fixes: fdaf1bb3ca ("xfs: ATTR_REPLACE algorithm with LARP enabled needs rework")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-06 18:45:14 -08:00
Linus Torvalds
33cc938e65 Linux 6.7-rc4 v6.7-rc4 2023-12-03 18:52:56 +09:00
Linus Torvalds
968f35f4ab Merge tag 'v6.7-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:

 - Two fallocate fixes

 - Fix warnings from new gcc

 - Two symlink fixes

* tag 'v6.7-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb: client, common: fix fortify warnings
  cifs: Fix FALLOC_FL_INSERT_RANGE by setting i_size after EOF moved
  cifs: Fix FALLOC_FL_ZERO_RANGE by setting i_size if EOF moved
  smb: client: report correct st_size for SMB and NFS symlinks
  smb: client: fix missing mode bits for SMB symlinks
2023-12-03 09:08:26 +09:00
Linus Torvalds
55abae438c Merge tag 'firewire-fixes-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394
Pull firewire fix from Takashi Sakamoto:
 "A single patch to fix long-standing issue of memory leak at failure of
  device registration for fw_unit. We rarely encounter the issue, but it
  should be applied to stable releases, since it fixes inappropriate API
  usage"

* tag 'firewire-fixes-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
  firewire: core: fix possible memory leak in create_units()
2023-12-03 09:03:07 +09:00
Linus Torvalds
1b8af6552c Merge tag 'powerpc-6.7-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:

 - Fix corruption of f0/vs0 during FP/Vector save, seen as userspace
   crashes when using io-uring workers (in particular with MariaDB)

 - Fix KVM_RUN potentially clobbering all host userspace FP/Vector
   registers

Thanks to Timothy Pearson, Jens Axboe, and Nicholas Piggin.

* tag 'powerpc-6.7-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  KVM: PPC: Book3S HV: Fix KVM_RUN clobbering FP/VEC user registers
  powerpc: Don't clobber f0/vs0 during fp|altivec register save
2023-12-03 08:43:35 +09:00
Linus Torvalds
17b17be28d Merge tag 'vfio-v6.7-rc4' of https://github.com/awilliam/linux-vfio
Pull vfio fixes from Alex Williamson:

 - Fix the lifecycle of a mutex in the pds variant driver such that a
   reset prior to opening the device won't find it uninitialized.
   Implement the release path to symmetrically destroy the mutex. Also
   switch a different lock from spinlock to mutex as the code path has
   the potential to sleep and doesn't need the spinlock context
   otherwise (Brett Creeley)

 - Fix an issue detected via randconfig where KVM tries to symbol_get an
   undeclared function. The symbol is temporarily declared
   unconditionally here, which resolves the problem and avoids churn
   relative to a series pending for the next merge window which resolves
   some of this symbol ugliness, but also fixes Kconfig dependencies
   (Sean Christopherson)

* tag 'vfio-v6.7-rc4' of https://github.com/awilliam/linux-vfio:
  vfio: Drop vfio_file_iommu_group() stub to fudge around a KVM wart
  vfio/pds: Fix possible sleep while in atomic context
  vfio/pds: Fix mutex lock->magic != lock warning
2023-12-03 08:37:39 +09:00
Linus Torvalds
deb4b9dd3b Merge tag 'for-linus-6.7a-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:

 - A fix for the Xen event driver setting the correct return value when
   experiencing an allocation failure

 - A fix for allocating space for a struct in the percpu area to not
   cross page boundaries (this one is for x86, a similar one for Arm was
   already in the pull request for rc3)

* tag 'for-linus-6.7a-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/events: fix error code in xen_bind_pirq_msi_to_irq()
  x86/xen: fix percpu vcpu_info allocation
2023-12-03 08:31:53 +09:00
Linus Torvalds
669fc83452 Merge tag 'probes-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:

 - objpool: Fix objpool overrun case on memory/cache access delay
   especially on the big.LITTLE SoC. The objpool uses a copy of object
   slot index internal loop, but the slot index can be changed on
   another processor in parallel. In that case, the difference of 'head'
   local copy and the 'slot->last' index will be bigger than local slot
   size. In that case, we need to re-read the slot::head to update it.

 - kretprobe: Fix to use appropriate rcu API for kretprobe holder. Since
   kretprobe_holder::rp is RCU managed, it should use
   rcu_assign_pointer() and rcu_dereference_check() correctly. Also
   adding __rcu tag for finding wrong usage by sparse.

 - rethook: Fix to use appropriate rcu API for rethook::handler. The
   same as kretprobe, rethook::handler is RCU managed and it should use
   rcu_assign_pointer() and rcu_dereference_check(). This also adds
   __rcu tag for finding wrong usage by sparse.

* tag 'probes-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  rethook: Use __rcu pointer for rethook::handler
  kprobes: consistent rcu api usage for kretprobe holder
  lib: objpool: fix head overrun on RK3588 SBC
2023-12-03 08:02:49 +09:00
Linus Torvalds
815fb87b75 Merge tag 'pm-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
 "These fix issues in two cpufreq drivers, in the AMD P-state driver and
  in the power-capping DTPM framework.

  Specifics:

   - Fix the AMD P-state driver's EPP sysfs interface in the cases when
     the performance governor is in use (Ayush Jain)

   - Make the ->fast_switch() callback in the AMD P-state driver return
     the target frequency as expected (Gautham R. Shenoy)

   - Allow user space to control the range of frequencies to use via
     scaling_min_freq and scaling_max_freq when AMD P-state driver is in
     use (Wyes Karny)

   - Prevent power domains needed for wakeup signaling from being turned
     off during system suspend on Qualcomm systems and prevent
     performance states votes from runtime-suspended devices from being
     lost across a system suspend-resume cycle in qcom-cpufreq-nvmem
     (Stephan Gerhold)

   - Fix disabling the 792 Mhz OPP in the imx6q cpufreq driver for the
     i.MX6ULL types that can run at that frequency (Christoph
     Niedermaier)

   - Eliminate unnecessary and harmful conversions to uW from the DTPM
     (dynamic thermal and power management) framework (Lukasz Luba)"

* tag 'pm-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq/amd-pstate: Only print supported EPP values for performance governor
  cpufreq/amd-pstate: Fix scaling_min_freq and scaling_max_freq update
  powercap: DTPM: Fix unneeded conversions to micro-Watts
  cpufreq/amd-pstate: Fix the return value of amd_pstate_fast_switch()
  pmdomain: qcom: rpmpd: Set GENPD_FLAG_ACTIVE_WAKEUP
  cpufreq: qcom-nvmem: Preserve PM domain votes in system suspend
  cpufreq: qcom-nvmem: Enable virtual power domain devices
  cpufreq: imx6q: Don't disable 792 Mhz OPP unnecessarily
2023-12-02 09:01:00 +09:00
Linus Torvalds
ce474ae7d0 Merge tag 'acpi-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
 "This fixes a recently introduced build issue on ARM32 and a NULL
  pointer dereference in the ACPI backlight driver due to a design issue
  exposed by a recent change in the ACPI bus type code.

  Specifics:

   - Fix a recently introduced build issue on ARM32 platforms caused by
     an inadvertent header file breakage (Dave Jiang)

   - Eliminate questionable usage of acpi_driver_data() in the ACPI
     backlight cooling device code that leads to NULL pointer
     dereferences after recent ACPI core changes (Hans de Goede)"

* tag 'acpi-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: video: Use acpi_video_device for cooling-dev driver data
  ACPI: Fix ARM32 platforms compile issue introduced by fw_table changes
2023-12-02 08:52:20 +09:00
Linus Torvalds
35f8458480 Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fix from Catalin Marinas:
 "Fix a regression where the arm64 KPTI ends up enabled even on systems
  that don't need it"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: Avoid enabling KPTI unnecessarily
2023-12-02 08:48:59 +09:00
Linus Torvalds
1a2b418566 Merge tag 'iommu-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu fixes from Joerg Roedel:

 - Fix race conditions in device probe path

 - Handle ERR_PTR() returns in __iommu_domain_alloc() path

 - Update MAINTAINERS entry for Qualcom IOMMUs

 - Printk argument fix in device tree specific code

 - Several Intel VT-d fixes from Lu Baolu:
     - Do not support enforcing cache coherency for non-empty domains
     - Avoid devTLB invalidation if iommu is off
     - Disable PCI ATS in legacy passthrough mode
     - Support non-PCI devices when clearing context
     - Fix incorrect cache invalidation for mm notification
     - Add MTL to quirk list to skip TE disabling
     - Set variable intel_dirty_ops to static

* tag 'iommu-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
  iommu: Fix printk arg in of_iommu_get_resv_regions()
  iommu/vt-d: Set variable intel_dirty_ops to static
  iommu/vt-d: Fix incorrect cache invalidation for mm notification
  iommu/vt-d: Add MTL to quirk list to skip TE disabling
  iommu/vt-d: Make context clearing consistent with context mapping
  iommu/vt-d: Disable PCI ATS in legacy passthrough mode
  iommu/vt-d: Omit devTLB invalidation requests when TES=0
  iommu/vt-d: Support enforce_cache_coherency only for empty domains
  iommu: Avoid more races around device probe
  MAINTAINERS: list all Qualcomm IOMMU drivers in the QUALCOMM IOMMU entry
  iommu: Flow ERR_PTR out from __iommu_domain_alloc()
2023-12-02 08:42:39 +09:00
Linus Torvalds
06a3c59f9c Merge tag 'sound-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
 "No surprise here, including only a collection of HD-audio
  device-specific small fixes"

* tag 'sound-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: hda: Disable power-save on KONTRON SinglePC
  ALSA: hda/realtek: Add supported ALC257 for ChromeOS
  ALSA: hda/realtek: Headset Mic VREF to 100%
  ALSA: hda: intel-nhlt: Ignore vbps when looking for DMIC 32 bps format
  ALSA: hda: cs35l56: Enable low-power hibernation mode on SPI
  ALSA: cs35l41: Fix for old systems which do not support command
  ALSA: hda: cs35l41: Remove unnecessary boolean state variable firmware_running
  ALSA: hda - Fix speaker and headset mic pin config for CHUWI CoreBook XPro
2023-12-02 08:33:29 +09:00
Linus Torvalds
b1e51588aa Merge tag 'drm-fixes-2023-12-01' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
 "Weekly fixes, mostly amdgpu fixes with a scattering of nouveau, i915,
  and a couple of reverts. Hopefully it will quieten down in coming
  weeks.

  drm:
   - Revert unexport of prime helpers for fd/handle conversion

  dma_resv:
   - Do not double add fences in dma_resv_add_fence.

  gpuvm:
   - Fix GPUVM license identifier.

  i915:
   - Mark internal GSC engine with reserved uabi class
   - Take VGA converters into account in eDP probe
   - Fix intel_pre_plane_updates() call to ensure workarounds get applied

  panel:
   - Revert panel fixes as they require exporting device_is_dependent.

  nouveau:
   - fix oversized allocations in new vm path
   - fix zero-length array
   - remove a stray lock

  nt36523:
   - Fix error check for nt36523.

  amdgpu:
   - DMUB fix
   - DCN 3.5 fixes
   - XGMI fix
   - DCN 3.2 fixes
   - Vangogh suspend fix
   - NBIO 7.9 fix
   - GFX11 golden register fix
   - Backlight fix
   - NBIO 7.11 fix
   - IB test overflow fix
   - DCN 3.1.4 fixes
   - fix a runtime pm ref count
   - Retimer fix
   - ABM fix
   - DCN 3.1.5 fix
   - Fix AGP addressing
   - Fix possible memory leak in SMU error path
   - Make sure PME is enabled in D3
   - Fix possible NULL pointer dereference in debugfs
   - EEPROM fix
   - GC 9.4.3 fix

  amdkfd:
   - IP version check fix
   - Fix memory leak in pqm_uninit()"

* tag 'drm-fixes-2023-12-01' of git://anongit.freedesktop.org/drm/drm: (53 commits)
  Revert "drm/prime: Unexport helpers for fd/handle conversion"
  drm/amdgpu: Use another offset for GC 9.4.3 remap
  drm/amd/display: Fix some HostVM parameters in DML
  drm/amdkfd: Free gang_ctx_bo and wptr_bo in pqm_uninit
  drm/amdgpu: Update EEPROM I2C address for smu v13_0_0
  drm/amd/display: Allow DTBCLK disable for DCN35
  drm/amdgpu: Fix cat debugfs amdgpu_regs_didt causes kernel null pointer
  drm/amd: Enable PCIe PME from D3
  drm/amd/pm: fix a memleak in aldebaran_tables_init
  drm/amdgpu: fix AGP addressing when GART is not at 0
  drm/amd/display: update dcn315 lpddr pstate latency
  drm/amd/display: fix ABM disablement
  drm/amd/display: Fix black screen on video playback with embedded panel
  drm/amd/display: Fix conversions between bytes and KB
  drm/amdkfd: Use common function for IP version check
  drm/amd/display: Remove config update
  drm/amd/display: Update DCN35 clock table policy
  drm/amd/display: force toggle rate wa for first link training for a retimer
  drm/amdgpu: correct the amdgpu runtime dereference usage count
  drm/amd/display: Update min Z8 residency time to 2100 for DCN314
  ...
2023-12-02 08:18:59 +09:00
Linus Torvalds
c9a925b7bc Merge tag 'io_uring-6.7-2023-11-30' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:

 - Fix an issue with discontig page checking for IORING_SETUP_NO_MMAP

 - Fix an issue with not allowing IORING_SETUP_NO_MMAP also disallowing
   mmap'ed buffer rings

 - Fix an issue with deferred release of memory mapped pages

 - Fix a lockdep issue with IORING_SETUP_NO_MMAP

 - Use fget/fput consistently, even from our sync system calls. No real
   issue here, but if we were ever to allow closing io_uring descriptors
   it would be required. Let's play it safe and just use the full ref
   counted versions upfront. Most uses of io_uring are threaded anyway,
   and hence already doing the full version underneath.

* tag 'io_uring-6.7-2023-11-30' of git://git.kernel.dk/linux:
  io_uring: use fget/fput consistently
  io_uring: free io_buffer_list entries via RCU
  io_uring/kbuf: prune deferred locked cache when tearing down
  io_uring/kbuf: recycle freed mapped buffer ring entries
  io_uring/kbuf: defer release of mapped buffer rings
  io_uring: enable io_mem_alloc/free to be used in other parts
  io_uring: don't guard IORING_OFF_PBUF_RING with SETUP_NO_MMAP
  io_uring: don't allow discontig pages for IORING_SETUP_NO_MMAP
2023-12-02 06:47:32 +09:00
Linus Torvalds
ee0c8a9b34 Merge tag 'block-6.7-2023-12-01' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:

 - NVMe pull request via Keith:
     - Invalid namespace identification error handling (Marizio Ewan,
       Keith)
     - Fabrics keep-alive tuning (Mark)

 - Fix for a bad error check regression in bcache (Markus)

 - Fix for a performance regression with O_DIRECT (Ming)

 - Fix for a flush related deadlock (Ming)

 - Make the read-only warn on per-partition (Yu)

* tag 'block-6.7-2023-12-01' of git://git.kernel.dk/linux:
  nvme-core: check for too small lba shift
  blk-mq: don't count completed flush data request as inflight in case of quiesce
  block: Document the role of the two attribute groups
  block: warn once for each partition in bio_check_ro()
  block: move .bd_inode into 1st cacheline of block_device
  nvme: check for valid nvme_identify_ns() before using it
  nvme-core: fix a memory leak in nvme_ns_info_from_identify()
  nvme: fine-tune sending of first keep-alive
  bcache: revert replacing IS_ERR_OR_NULL with IS_ERR
2023-12-02 06:39:30 +09:00
Linus Torvalds
abd792f330 Merge tag 'dm-6.7/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:

 - Fix DM verity target's FEC support to always initialize IO before it
   frees it. Also fix alignment of struct dm_verity_fec_io within the
   per-bio-data

 - Fix DM verity target to not FEC failed readahead IO

 - Update DM flakey target to use MAX_ORDER rather than MAX_ORDER - 1

* tag 'dm-6.7/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm-flakey: start allocating with MAX_ORDER
  dm-verity: align struct dm_verity_fec_io properly
  dm verity: don't perform FEC for failed readahead IO
  dm verity: initialize fec io before freeing it
2023-12-02 06:32:29 +09:00
Linus Torvalds
ff4a9f4905 Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
 "Three small fixes, one in drivers.

  The core changes are to the internal representation of flags in
  scsi_devices which removes space wasting bools in favour of single bit
  flags and to add a flag to force a runtime resume which is used by ATA
  devices"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: sd: Fix system start for ATA devices
  scsi: Change SCSI device boolean fields to single bit flags
  scsi: ufs: core: Clear cmd if abort succeeds in MCQ mode
2023-12-02 06:27:20 +09:00
Linus Torvalds
c1c09da07c Merge tag 'fs_for_v6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext2 fix from Jan Kara:
 "Fix an ext2 bug introduced by changes in ext2 & iomap stepping on each
  other toes (apparently ext2 driver does not get much testing in
  linux-next)"

* tag 'fs_for_v6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  ext2: Fix ki_pos update for DIO buffered-io fallback case
2023-12-02 06:19:27 +09:00
Linus Torvalds
e6861be452 Merge tag 'bcachefs-2023-11-29' of https://evilpiepirate.org/git/bcachefs
Pull more bcachefs bugfixes from Kent Overstreet:

 - bcache & bcachefs were broken with CFI enabled; patch for closures to
   fix type punning

 - mark erasure coding as extra-experimental; there are incompatible
   disk space accounting changes coming for erasure coding, and I'm
   still seeing checksum errors in some tests

 - several fixes for durability-related issues (durability is a device
   specific setting where we can tell bcachefs that data on a given
   device should be counted as replicated x times)

 - a fix for a rare livelock when a btree node merge then updates a
   parent node that is almost full

 - fix a race in the device removal path, where dropping a pointer in a
   btree node to a device would be clobbered by an in flight btree write
   updating the btree node key on completion

 - fix one SRCU lock hold time warning in the btree gc code - ther's
   still a bunch more of these to fix

 - fix a rare race where we'd start copygc before initializing the "are
   we rw" percpu refcount; copygc would think we were already ro and die
   immediately

* tag 'bcachefs-2023-11-29' of https://evilpiepirate.org/git/bcachefs: (23 commits)
  bcachefs: Extra kthread_should_stop() calls for copygc
  bcachefs: Convert gc_alloc_start() to for_each_btree_key2()
  bcachefs: Fix race between btree writes and metadata drop
  bcachefs: move journal seq assertion
  bcachefs: -EROFS doesn't count as move_extent_start_fail
  bcachefs: trace_move_extent_start_fail() now includes errcode
  bcachefs: Fix split_race livelock
  bcachefs: Fix bucket data type for stripe buckets
  bcachefs: Add missing validation for jset_entry_data_usage
  bcachefs: Fix zstd compress workspace size
  bcachefs: bpos is misaligned on big endian
  bcachefs: Fix ec + durability calculation
  bcachefs: Data update path won't accidentaly grow replicas
  bcachefs: deallocate_extra_replicas()
  bcachefs: Proper refcounting for journal_keys
  bcachefs: preserve device path as device name
  bcachefs: Fix an endianness conversion
  bcachefs: Start gc, copygc, rebalance threads after initing writes ref
  bcachefs: Don't stop copygc thread on device resize
  bcachefs: Make sure bch2_move_ratelimit() also waits for move_ops
  ...
2023-12-02 06:02:16 +09:00
Rafael J. Wysocki
7d4c44a53d Merge branch 'acpi-tables'
Merge a fix for a recently introduced build issue on ARM32 platforms
caused by an inadvertent header file breakage (Dave Jiang).

* acpi-tables:
  ACPI: Fix ARM32 platforms compile issue introduced by fw_table changes
2023-12-01 21:32:19 +01:00