Commit Graph

1352287 Commits

Author SHA1 Message Date
Linus Torvalds
eef0dc0bd4 Merge tag 'bcachefs-2025-04-24' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:

 - Case insensitive directories now work

 - Ciemap now correctly reports on unwritten pagecache data

 - bcachefs tools 1.25.1 was incorrectly picking unaligned bucket sizes;
   fix journal and write path bugs this uncovered

And assorted smaller fixes...

* tag 'bcachefs-2025-04-24' of git://evilpiepirate.org/bcachefs: (24 commits)
  bcachefs: Rework fiemap transaction restart handling
  bcachefs: add fiemap delalloc extent detection
  bcachefs: refactor fiemap processing into extent helper and struct
  bcachefs: track current fiemap offset in start variable
  bcachefs: drop duplicate fiemap sync flag
  bcachefs: Fix btree_iter_peek_prev() at end of inode
  bcachefs: Make btree_iter_peek_prev() assert more precise
  bcachefs: Unit test fixes
  bcachefs: Print mount opts earlier
  bcachefs: unlink: casefold d_invalidate
  bcachefs: Fix casefold lookups
  bcachefs: Casefold is now a regular opts.h option
  bcachefs: Implement fileattr_(get|set)
  bcachefs: Allocator now copes with unaligned buckets
  bcachefs: Start copygc, rebalance threads earlier
  bcachefs: Refactor bch2_run_recovery_passes()
  bcachefs: bch2_copygc_wakeup()
  bcachefs: Fix ref leak in write_super()
  bcachefs: Change __journal_entry_close() assert to ERO
  bcachefs: Ensure journal space is block size aligned
  ...
2025-04-25 09:06:14 -07:00
Alexei Starovoitov
6ae003adc0 Merge branch 'bpf-fix-softlock-condition-in-bpf-hashmap-interation'
Brandon Kammerdiener says:

====================
This patchset fixes an endless loop condition that can occur in
bpf_for_each_hash_elem, causing the core to softlock. My understanding is
that a combination of RCU list deletion and insertion introduces the new
element after the iteration cursor and that there is a chance that an RCU
reader may in fact use this new element in iteration. The patch uses a
_safe variant of the macro which gets the next element to iterate before
executing the loop body for the current element.

I have also added a subtest in the for_each selftest that can trigger this
condition without the fix.

Changes since v2:
- Renaming and additional checks in selftests/bpf/prog_tests/for_each.c

Changes since v1:
- Added missing Signed-off-by lines to both patches
====================

Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://patch.msgid.link/20250424153246.141677-1-brandon.kammerdiener@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-25 08:37:21 -07:00
Brandon Kammerdiener
3d9c463f95 selftests/bpf: add test for softlock when modifying hashmap while iterating
Add test that modifies the map while it's being iterated in such a way that
hangs the kernel thread unless the _safe fix is applied to
bpf_for_each_hash_elem.

Signed-off-by: Brandon Kammerdiener <brandon.kammerdiener@intel.com>
Link: https://lore.kernel.org/r/20250424153246.141677-3-brandon.kammerdiener@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
2025-04-25 08:36:59 -07:00
Brandon Kammerdiener
75673fda0c bpf: fix possible endless loop in BPF map iteration
The _safe variant used here gets the next element before running the callback,
avoiding the endless loop condition.

Signed-off-by: Brandon Kammerdiener <brandon.kammerdiener@intel.com>
Link: https://lore.kernel.org/r/20250424153246.141677-2-brandon.kammerdiener@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
2025-04-25 08:36:59 -07:00
Heikki Krogerus
3dfc044527 MAINTAINERS: Assign maintainer for the port controller drivers
Especially the port manager (tcpm.c) is so major driver that
it should have somebody watching over it who really
understands it, and the port controller interface in
general. Assigning Badhri as the designated reviewer and
restoring the status to Maintained from Orphan.

Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Cc: Badhri Jagan Sridharan <badhri@google.com>
Acked-by: Badhri Jagan Sridharan <badhri@google.com>
Link: https://lore.kernel.org/r/20250407133306.387576-1-heikki.krogerus@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-25 13:31:31 +02:00
Jan Kara
f520bed25d fs/xattr: Fix handling of AT_FDCWD in setxattrat(2) and getxattrat(2)
Currently, setxattrat(2) and getxattrat(2) are wrongly handling the
calls of the from setxattrat(AF_FDCWD, NULL, AT_EMPTY_PATH, ...) and
fail with -EBADF error instead of operating on CWD. Fix it.

Fixes: 6140be90ec ("fs/xattr: add *at family syscalls")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/20250424132246.16822-2-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-25 12:11:56 +02:00
Yangtao Li
1d28f25d6a MAINTAINERS: hfs/hfsplus: add myself as maintainer
I used to maintain Allwinner SoC cpufreq and thermal drivers and
have some work experience in the F2FS file system.

I volunteered to maintain the code together with Slava and Adrian.

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Link: https://lore.kernel.org/20250423123423.2062619-1-frank.li@vivo.com
Acked-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-25 12:11:56 +02:00
T.J. Mercier
e6f141b332 splice: remove duplicate noinline from pipe_clear_nowait
pipe_clear_nowait has two noinline macros, but we only need one.

I checked the whole tree, and this is the only occurrence:

$ grep -r "noinline .* noinline"
fs/splice.c:static noinline void noinline pipe_clear_nowait(struct file *file)
$

Fixes: 0f99fc513d ("splice: clear FMODE_NOWAIT on file if splice/vmsplice is used")
Signed-off-by: "T.J. Mercier" <tjmercier@google.com>
Link: https://lore.kernel.org/20250423180025.2627670-1-tjmercier@google.com
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-25 12:11:56 +02:00
Christoph Hellwig
e079d7c4db devtmpfs: don't use vfs_getattr_nosec to query i_mode
The recent move of the bdev_statx call to the low-level vfs_getattr_nosec
helper caused it being used by devtmpfs, which leads to deadlocks in
md teardown due to the block device lookup and put interfering with the
unusual lifetime rules in md.

But as handle_remove only works on inodes created and owned by devtmpfs
itself there is no need to use vfs_getattr_nosec vs simply reading the
mode from the inode directly.  Switch to that to avoid the bdev lookup
or any other unintentional side effect.

Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reported-by: Xiao Ni <xni@redhat.com>
Fixes: 777d0961ff ("fs: move the bdex_statx call to vfs_getattr_nosec")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250423045941.1667425-1-hch@lst.de
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Xiao Ni <xni@redhat.com>
Tested-by: Ayush Jain <Ayush.jain3@amd.com>
Tested-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-25 12:11:45 +02:00
Ming Lei
f40139fde5 ublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmd
ublk_cancel_cmd() calls io_uring_cmd_done() to complete uring_cmd, but
we may have scheduled task work via io_uring_cmd_complete_in_task() for
dispatching request, then kernel crash can be triggered.

Fix it by not trying to canceling the command if ublk block request is
started.

Fixes: 216c8f5ef0 ("ublk: replace monitor with cancelable uring_cmd")
Reported-by: Jared Holzman <jholzman@nvidia.com>
Tested-by: Jared Holzman <jholzman@nvidia.com>
Closes: https://lore.kernel.org/linux-block/d2179120-171b-47ba-b664-23242981ef19@nvidia.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250425013742.1079549-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24 19:52:20 -06:00
Ming Lei
d6aa0c178b ublk: call ublk_dispatch_req() for handling UBLK_U_IO_NEED_GET_DATA
We call io_uring_cmd_complete_in_task() to schedule task_work for handling
UBLK_U_IO_NEED_GET_DATA.

This way is really not necessary because the current context is exactly
the ublk queue context, so call ublk_dispatch_req() directly for handling
UBLK_U_IO_NEED_GET_DATA.

Fixes: 216c8f5ef0 ("ublk: replace monitor with cancelable uring_cmd")
Tested-by: Jared Holzman <jholzman@nvidia.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250425013742.1079549-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24 19:52:20 -06:00
Kent Overstreet
d1b0f9aa73 bcachefs: Rework fiemap transaction restart handling
Restart handling in the previous patch was incorrect, so: move btree
operations into a separate helper, and run it with a lockrestart_do().

Additionally, clarify whether pagecache or the btree takes precedence.

Right now, the btree takes precedence: this is incorrect, but it's
needed to pass fstests. Add a giant comment explaining why.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24 19:10:29 -04:00
Brian Foster
b9b0494017 bcachefs: add fiemap delalloc extent detection
bcachefs currently populates fiemap data from the extents btree.
This works correctly when the fiemap sync flag is provided, but if
not, it skips all delalloc extents that have not yet been flushed.
This is because delalloc extents from buffered writes are first
stored as reservation in the pagecache, and only become resident in
the extents btree after writeback completes.

Update the fiemap implementation to process holes between extents by
scanning pagecache for data, via seek data/hole. If a valid data
range is found over a hole in the extent btree, fake up an extent
key and flag the extent as delalloc for reporting to userspace.

Note that this does not necessarily change behavior for the case
where there is dirty pagecache over already written extents, where
when in COW mode, writeback will allocate new blocks for the
underlying ranges. The existing behavior is consistent with btrfs
and it is recommended to use the sync flag for the most up to date
extent state from fiemap.

Signed-off-by: Brian Foster <bfoster@redhat.com>
2025-04-24 19:10:29 -04:00
Brian Foster
2d55a63709 bcachefs: refactor fiemap processing into extent helper and struct
The bulk of the loop in bch2_fiemap() involves processing the
current extent key from the iter, including following indirections
and trimming the extent size and such. This patch makes a few
changes to reduce the size of the loop and facilitate future changes
to support delalloc extents.

Define a new bch_fiemap_extent structure to wrap the bkey buffer
that holds the extent key to report to userspace along with
associated fiemap flags. Update bch2_fill_extent() to take the
bch_fiemap_extent as a param instead of the individual fields.
Finally, lift the bulk of the extent processing into a
bch2_fiemap_extent() helper that takes the current key and formats
the bch_fiemap_extent appropriately for the fill function.

No functional changes intended by this patch.

Signed-off-by: Brian Foster <bfoster@redhat.com>
2025-04-24 19:10:29 -04:00
Brian Foster
d020a9fb11 bcachefs: track current fiemap offset in start variable
Signed-off-by: Brian Foster <bfoster@redhat.com>
2025-04-24 19:10:28 -04:00
Brian Foster
28d2d19ccc bcachefs: drop duplicate fiemap sync flag
FIEMAP_FLAG_SYNC handling was deliberately moved into core code in
commit 45dd052e67 ("fs: handle FIEMAP_FLAG_SYNC in fiemap_prep"),
released in kernel v5.8. Update bcachefs accordingly.

Signed-off-by: Brian Foster <bfoster@redhat.com>
2025-04-24 19:10:28 -04:00
Kent Overstreet
353739f1d1 bcachefs: Fix btree_iter_peek_prev() at end of inode
At the end of the inode, on an extents iterator, peek_slot() has to
advance to the next position to avoid returning a 0 size extent, which
is not allowed.

Changing iter->pos confuses peek_prev(), but we don't need to call
peek_slot() in this case.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24 19:09:52 -04:00
Kent Overstreet
c4f89a1d35 bcachefs: Make btree_iter_peek_prev() assert more precise
The issue this assert is guarding against is that in
BTREE_ITER_filter_snapshots mode we only want to be iterating within a
single inode number - if we iterate into another inode number with keys
for a different snapshot tree, we'll loop arbitrarily long before
finding a key we can return.

This comes up in the unit tests, where we're using inode 0 for our test
keys.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24 19:09:52 -04:00
Kent Overstreet
394ef278e1 bcachefs: Unit test fixes
The peek_end() tests expect an empty btree.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24 19:09:52 -04:00
Kent Overstreet
caab547686 bcachefs: Print mount opts earlier
If we aren't mounting with the correct degraded option, it's helpful to
know that before we fail to mount degraded.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24 19:09:52 -04:00
Kent Overstreet
7cb85324c4 bcachefs: unlink: casefold d_invalidate
casefolding results in additional aliases on lookup for the
non-casefolded names - these need invalidating on unlink.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24 19:09:52 -04:00
Kent Overstreet
9cdde3c7aa bcachefs: Fix casefold lookups
Add casefolding to bch2_lookup_trans:

During the delay between when casefolding was written and when it was
merged, the main filesystem lookup path grew self healing - which meant
it was no longer using bch2_dirent_lookup_trans(), where casefolding on
lookups happens.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24 19:09:52 -04:00
Kent Overstreet
b9e1f873d2 bcachefs: Casefold is now a regular opts.h option
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24 19:09:00 -04:00
Björn Töpel
7d1d19a11c riscv: uprobes: Add missing fence.i after building the XOL buffer
The XOL (execute out-of-line) buffer is used to single-step the
replaced instruction(s) for uprobes. The RISC-V port was missing a
proper fence.i (i$ flushing) after constructing the XOL buffer, which
can result in incorrect execution of stale/broken instructions.

This was found running the BPF selftests "test_progs:
uprobe_autoattach, attach_probe" on the Spacemit K1/X60, where the
uprobes tests randomly blew up.

Reviewed-by: Guo Ren <guoren@kernel.org>
Fixes: 74784081aa ("riscv: Add uprobes supported")
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
Link: https://lore.kernel.org/r/20250419111402.1660267-2-bjorn@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2025-04-24 13:20:02 -07:00
Björn Töpel
121f34341d riscv: Replace function-like macro by static inline function
The flush_icache_range() function is implemented as a "function-like
macro with unused parameters", which can result in "unused variables"
warnings.

Replace the macro with a static inline function, as advised by
Documentation/process/coding-style.rst.

Fixes: 08f051eda3 ("RISC-V: Flush I$ when making a dirty page executable")
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
Link: https://lore.kernel.org/r/20250419111402.1660267-1-bjorn@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2025-04-24 13:20:01 -07:00
Linus Torvalds
02ddfb981d Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
 "The single core change is an obvious bug fix (and falls within the LF
  guidelines for patches from sanctioned entities). The other driver
  changes are a bit larger but likewise pretty obvious"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: mpi3mr: Add level check to control event logging
  scsi: ufs: core: Add NULL check in ufshcd_mcq_compl_pending_transfer()
  scsi: core: Clear flags for scsi_cmnd that did not complete
  scsi: ufs: Introduce quirk to extend PA_HIBERN8TIME for UFS devices
  scsi: ufs: qcom: Add quirks for Samsung UFS devices
  scsi: target: iscsi: Fix timeout on deleted connection
  scsi: mpi3mr: Reset the pending interrupt flag
  scsi: mpi3mr: Fix pending I/O counter
  scsi: ufs: mcq: Add NULL check in ufshcd_mcq_abort()
2025-04-24 13:01:31 -07:00
Linus Torvalds
30e268185e Merge tag 'landlock-6.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux
Pull landlock fixes from Mickaël Salaün:
 "Fix some Landlock audit issues, add related tests, and updates
  documentation"

* tag 'landlock-6.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux:
  landlock: Update log documentation
  landlock: Fix documentation for landlock_restrict_self(2)
  landlock: Fix documentation for landlock_create_ruleset(2)
  selftests/landlock: Add PID tests for audit records
  selftests/landlock: Factor out audit fixture in audit_test
  landlock: Log the TGID of the domain creator
  landlock: Remove incorrect warning
2025-04-24 12:59:05 -07:00
Kirill A. Shutemov
85fd85bc02 x86/insn: Fix CTEST instruction decoding
insn_decoder_test found a problem with decoding APX CTEST instructions:

	Found an x86 instruction decoder bug, please report this.
	ffffffff810021df	62 54 94 05 85 ff    	ctestneq
	objdump says 6 bytes, but insn_get_length() says 5

It happens because x86-opcode-map.txt doesn't specify arguments for the
instruction and the decoder doesn't expect to see ModRM byte.

Fixes: 690ca3a306 ("x86/insn: Add support for APX EVEX instructions to the opcode map")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: stable@vger.kernel.org # v6.10+
Link: https://lore.kernel.org/r/20250423065815.2003231-1-kirill.shutemov@linux.intel.com
2025-04-24 20:19:17 +02:00
Luo Gengkun
1a97fea9db perf/x86: Fix non-sampling (counting) events on certain x86 platforms
Perf doesn't work at perf stat for hardware events on certain x86 platforms:

 $perf stat -- sleep 1
 Performance counter stats for 'sleep 1':
             16.44 msec task-clock                       #    0.016 CPUs utilized
                 2      context-switches                 #  121.691 /sec
                 0      cpu-migrations                   #    0.000 /sec
                54      page-faults                      #    3.286 K/sec
   <not supported>	cycles
   <not supported>	instructions
   <not supported>	branches
   <not supported>	branch-misses

The reason is that the check in x86_pmu_hw_config() for sampling events is
unexpectedly applied to counting events as well.

It should only impact x86 platforms with limit_period used for non-PEBS
events. For Intel platforms, it should only impact some older platforms,
e.g., HSW, BDW and NHM.

Fixes: 88ec7eedbb ("perf/x86: Fix low freqency setting issue")
Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20250423064724.3716211-1-luogengkun@huaweicloud.com
2025-04-24 20:15:04 +02:00
Paolo Bonzini
2d7124941a Merge tag 'kvmarm-fixes-6.15-2' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 6.15, round #2

 - Single fix for broken usage of 'multi-MIDR' infrastructure in PI
   code, adding an open-coded erratum check for everyone's favorite pile
   of sand: Cavium ThunderX
2025-04-24 13:28:53 -04:00
Jens Axboe
edd43f4d6f io_uring: fix 'sync' handling of io_fallback_tw()
A previous commit added a 'sync' parameter to io_fallback_tw(), which if
true, means the caller wants to wait on the fallback thread handling it.
But the logic is somewhat messed up, ensure that ctxs are swapped and
flushed appropriately.

Cc: stable@vger.kernel.org
Fixes: dfbe5561ae ("io_uring: flush offloaded and delayed task_work on exit")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24 10:32:43 -06:00
Ard Biesheuvel
032ce1ea94 x86/boot: Work around broken busybox 'truncate' tool
The GNU coreutils version of truncate, which is the original, accepts a
% prefix for the -s size argument which means the file in question
should be padded to a multiple of the given size. This is currently used
to pad the setup block of bzImage to a multiple of 4k before appending
the decompressor.

busybox reimplements truncate but does not support this idiom, and
therefore fails the build since commit

  9c54baab44 ("x86/boot: Drop CRC-32 checksum and the build tool that generates it")

Since very little build code within the kernel depends on the 'truncate'
utility, work around this incompatibility by avoiding truncate altogether,
and relying on dd to perform the padding.

Fixes: 9c54baab44 ("x86/boot: Drop CRC-32 checksum and the build tool that generates it")
Reported-by: <phasta@kernel.org>
Tested-by: Philipp Stanner <phasta@kernel.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250424101917.1552527-2-ardb+git@google.com
2025-04-24 18:23:27 +02:00
Linus Torvalds
e72e9e6933 Merge tag 'net-6.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
 "No fixes from any subtree.

  Current release - regressions:

   - net: fix the missing unlock for detached devices

  Previous releases - regressions:

   - sched: fix UAF vulnerability in HFSC qdisc

   - lwtunnel: disable BHs when required

   - mptcp: pm: defer freeing of MPTCP userspace path manager entries

   - tipc: fix NULL pointer dereference in tipc_mon_reinit_self()

   - eth: virtio-net: disable delayed refill when pausing rx

  Previous releases - always broken:

   - phylink: fix suspend/resume with WoL enabled and link down

   - eth:
       - mlx5: fix null-ptr-deref in mlx5_create_{inner_,}ttc_table()
       - xen-netfront: handle NULL returned by xdp_convert_buff_to_frame()
       - enetc: fix frame corruption on bpf_xdp_adjust_head/tail() and XDP_PASS
       - stmmac: fix dwmac1000 ptp timestamp status offset
       - pds_core: prevent possible adminq overflow/stuck condition

  Misc:

   - a bunch of MAINTAINERS updates"

* tag 'net-6.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (32 commits)
  net: stmmac: fix multiplication overflow when reading timestamp
  net: stmmac: fix dwmac1000 ptp timestamp status offset
  net: dp83822: Fix OF_MDIO config check
  pds_core: make wait_context part of q_info
  pds_core: Remove unnecessary check in pds_client_adminq_cmd()
  pds_core: handle unsupported PDS_CORE_CMD_FW_CONTROL result
  pds_core: Prevent possible adminq overflow/stuck condition
  net: dsa: mt7530: sync driver-specific behavior of MT7531 variants
  selftests/tc-testing: Add test for HFSC queue emptying during peek operation
  net_sched: hfsc: Fix a potential UAF in hfsc_dequeue() too
  net_sched: hfsc: Fix a UAF vulnerability in class handling
  selftests: mptcp: diag: use mptcp_lib_get_info_value
  mptcp: pm: Defer freeing of MPTCP userspace path manager entries
  net: ethernet: mtk_eth_soc: net: revise NETSYSv3 hardware configuration
  tipc: fix NULL pointer dereference in tipc_mon_reinit_self()
  virtio-net: disable delayed refill when pausing rx
  net: phy: leds: fix memory leak
  net: phylink: mac_link_(up|down)() clarifications
  net: phylink: fix suspend/resume with WoL enabled and link down
  net: lwtunnel: disable BHs when required
  ...
2025-04-24 09:14:50 -07:00
Linus Torvalds
288537d9c9 Merge tag 'v6.15-p5' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:

 - Revert acomp multibuffer tests which were buggy

 - Fix off-by-one regression in new scomp code

 - Lower quality setting on atmel-sha204a as it may not be random

* tag 'v6.15-p5' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: atmel-sha204a - Set hwrng quality to lowest possible
  crypto: scomp - Fix off-by-one bug when calculating last page
  Revert "crypto: testmgr - Add multibuffer acomp testing"
2025-04-24 09:10:01 -07:00
Adrian Hunter
38e93267ca KVM: x86: Do not use kvm_rip_read() unconditionally for KVM_PROFILING
Not all VMs allow access to RIP.  Check guest_state_protected before
calling kvm_rip_read().

This avoids, for example, hitting WARN_ON_ONCE in vt_cache_reg() for
TDX VMs.

Fixes: 81bf912b2c ("KVM: TDX: Implement TDX vcpu enter/exit path")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Message-ID: <20250415104821.247234-3-adrian.hunter@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24 09:52:32 -04:00
Adrian Hunter
ca4f113b0b KVM: x86: Do not use kvm_rip_read() unconditionally in KVM tracepoints
Not all VMs allow access to RIP.  Check guest_state_protected before
calling kvm_rip_read().

This avoids, for example, hitting WARN_ON_ONCE in vt_cache_reg() for
TDX VMs.

Fixes: 81bf912b2c ("KVM: TDX: Implement TDX vcpu enter/exit path")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Message-ID: <20250415104821.247234-2-adrian.hunter@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24 09:52:31 -04:00
Sean Christopherson
268cbfe65b KVM: SVM: WARN if an invalid posted interrupt IRTE entry is added
Now that the AMD IOMMU doesn't signal success incorrectly, WARN if KVM
attempts to track an AMD IRTE entry without metadata.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250404193923.1413163-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24 09:52:31 -04:00
Sean Christopherson
aae251a380 iommu/amd: WARN if KVM attempts to set vCPU affinity without posted intrrupts
WARN if KVM attempts to set vCPU affinity when posted interrupts aren't
enabled, as KVM shouldn't try to enable posting when they're unsupported,
and the IOMMU driver darn well should only advertise posting support when
AMD_IOMMU_GUEST_IR_VAPIC() is true.

Note, KVM consumes is_guest_mode only on success.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250404193923.1413163-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24 09:52:31 -04:00
Sean Christopherson
07172206a2 iommu/amd: Return an error if vCPU affinity is set for non-vCPU IRTE
Return -EINVAL instead of success if amd_ir_set_vcpu_affinity() is
invoked without use_vapic; lying to KVM about whether or not the IRTE was
configured to post IRQs is all kinds of bad.

Fixes: d98de49a53 ("iommu/amd: Enable vAPIC interrupt remapping mode by default")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250404193923.1413163-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24 09:52:31 -04:00
Sean Christopherson
f1fb088d9c KVM: x86: Take irqfds.lock when adding/deleting IRQ bypass producer
Take irqfds.lock when adding/deleting an IRQ bypass producer to ensure
irqfd->producer isn't modified while kvm_irq_routing_update() is running.
The only lock held when a producer is added/removed is irqbypass's mutex.

Fixes: 8727688006 ("KVM: x86: select IRQ_BYPASS_MANAGER")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250404193923.1413163-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24 09:52:31 -04:00
Sean Christopherson
bcda70c56f KVM: x86: Explicitly treat routing entry type changes as changes
Explicitly treat type differences as GSI routing changes, as comparing MSI
data between two entries could get a false negative, e.g. if userspace
changed the type but left the type-specific data as-is.

Fixes: 515a0c79e7 ("kvm: irqfd: avoid update unmodified entries of the routing")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250404193923.1413163-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24 09:52:31 -04:00
Sean Christopherson
9bcac97dc4 KVM: x86: Reset IRTE to host control if *new* route isn't postable
Restore an IRTE back to host control (remapped or posted MSI mode) if the
*new* GSI route prevents posting the IRQ directly to a vCPU, regardless of
the GSI routing type.  Updating the IRTE if and only if the new GSI is an
MSI results in KVM leaving an IRTE posting to a vCPU.

The dangling IRTE can result in interrupts being incorrectly delivered to
the guest, and in the worst case scenario can result in use-after-free,
e.g. if the VM is torn down, but the underlying host IRQ isn't freed.

Fixes: efc644048e ("KVM: x86: Update IRTE for posted-interrupts")
Fixes: 411b44ba80 ("svm: Implements update_pi_irte hook to setup posted interrupt")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250404193923.1413163-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24 09:52:31 -04:00
Sean Christopherson
7537deda36 KVM: SVM: Allocate IR data using atomic allocation
Allocate SVM's interrupt remapping metadata using GFP_ATOMIC as
svm_ir_list_add() is called with IRQs are disabled and irqfs.lock held
when kvm_irq_routing_update() reacts to GSI routing changes.

Fixes: 411b44ba80 ("svm: Implements update_pi_irte hook to setup posted interrupt")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250404193923.1413163-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24 09:52:31 -04:00
Sean Christopherson
6560aff981 KVM: SVM: Don't update IRTEs if APICv/AVIC is disabled
Skip IRTE updates if AVIC is disabled/unsupported, as forcing the IRTE
into remapped mode (kvm_vcpu_apicv_active() will never be true) is
unnecessary and wasteful.  The IOMMU driver is responsible for putting
IRTEs into remapped mode when an IRQ is allocated by a device, long before
that device is assigned to a VM.  I.e. the kernel as a whole has major
issues if the IRTE isn't already in remapped mode.

Opportunsitically kvm_arch_has_irq_bypass() to query for APICv/AVIC, so
so that all checks in KVM x86 incorporate the same information.

Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250401161804.842968-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24 09:52:31 -04:00
Paolo Bonzini
5f9e169814 KVM: arm64, x86: make kvm_arch_has_irq_bypass() inline
kvm_arch_has_irq_bypass() is a small function and even though it does
not appear in any *really* hot paths, it's also not entirely rare.
Make it inline---it also works out nicely in preparation for using it in
kvm-intel.ko and kvm-amd.ko, since the function is not currently exported.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24 09:46:58 -04:00
Christoph Hellwig
c4d2519c6a block: don't autoload drivers on blk-cgroup configuration
Loading a driver just to configure blk-cgroup doesn't make sense, as that
assumes and already existing device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250423053810.1683309-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24 07:35:23 -06:00
Christoph Hellwig
5f33b5226c block: don't autoload drivers on stat
blkdev_get_no_open can trigger the legacy autoload of block drivers.  A
simple stat of a block device has not historically done that, so disable
this behavior again.

Fixes: 9abcfbd235 ("block: Add atomic write support for statx")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250423053810.1683309-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24 07:35:23 -06:00
Christoph Hellwig
d13b7090b2 block: remove the backing_inode variable in bdev_statx
backing_inode is only used once, so remove it and update the comment
describing the bdev lookup to be a bit more clear.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250423053810.1683309-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24 07:35:09 -06:00
Christoph Hellwig
c63202140d block: move blkdev_{get,put} _no_open prototypes out of blkdev.h
These are only to be used by block internal code.  Remove the comment
as we grew more users due to reworking block device node opening.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250423053810.1683309-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24 07:33:38 -06:00
Christoph Hellwig
7b720c7202 block: never reduce ra_pages in blk_apply_bdi_limits
When the user increased the read-ahead size through sysfs this value
currently get lost if the device is reprobe, including on a resume
from suspend.

As there is no hardware limitation for the read-ahead size there is
no real need to reset it or track a separate hardware limitation
like for max_sectors.

This restores the pre-atomic queue limit behavior in the sd driver as
sd did not use blk_queue_io_opt and thus never updated the read ahead
size to the value based of the optimal I/O, but changes behavior for
all other drivers.  As the new behavior seems useful and sd is the
driver for which the readahead size tweaks are most useful that seems
like a worthwhile trade off.

Fixes: 804e498e04 ("sd: convert to the atomic queue limits API")
Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20250424082521.1967286-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24 07:32:17 -06:00