Commit Graph

1368207 Commits

Author SHA1 Message Date
Kent Overstreet
6447544c3d bcachefs: Improve error printing in btree_node_check_topology()
We had a bug report where the errors from btree_node_check_topology()
don't seem to be getting printed; log_fsck_err() does some fancy
ratelimiting-type stuff that we don't want here.

Instead, just use bch2_count_fsck_err(); this is simpler, and modelled
after how we're currently handling bucket ref update errors in
buckets.c.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-31 22:03:17 -04:00
Kent Overstreet
f402d9710b bcachefs: bch2_readdir() now calls str_hash_check_key()
More self healing code: readdir will now notice if there are dirents
hashed incorrectly, and it'll repair them if errors=fix_safe.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-31 22:03:17 -04:00
Kent Overstreet
a592268260 bcachefs: bch2_str_hash_check_key() may now be called without snapshots_seen
We don't track snapshot overwrites outside of fsck, so for this to be
called at runtime outside of fsck we need to create it on demand, when
we have repair to do.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-31 22:03:17 -04:00
Kent Overstreet
cb6f5d0dec bcachefs: __bch2_insert_snapshot_whiteouts() refactoring
Now uses bch2_get_snapshot_overwrites(), and much shorter.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-31 22:03:17 -04:00
Kent Overstreet
801cb2bd6c bcachefs: bch2_get_snapshot_overwrites()
New helper for getting a list of snapshot IDs that have overwritten a
given key.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-31 22:03:17 -04:00
Kent Overstreet
d21262d4e3 bcachefs: bch2_dev_journal_bucket_delete()
Recover from "journal and btree in same bucket".

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-31 22:03:17 -04:00
Kent Overstreet
0224d17d76 bcachefs: Runtime self healing for keys for deleted snapshots
If snapshot deletion incorrectly missing some keys and leaves keys for
deleted snapshots, that causes a bit of a problem for data move - we
can't move an extent for a nonexistent snapshot, because the extent
might have to be fragmented, and maintaining correct visibility in child
snapshots doesn't work if it doesn't have a snapshot.

Previously we'd just skip these keys, but it turns out that causes
copygc to spin.

So we need runtime self healing, i.e. calling check_key_has_snapshot()
from the data move path.

Snapshot deletion v2 included sentinal values for deleted snapshot
nodes, so this is quite safe.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-31 22:03:17 -04:00
Kent Overstreet
f02d153274 bcachefs: Don't unlock trans before data_update_init()
data_update_init() does need to do btree operations, delay doing the
unlock-before-io.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-31 22:03:17 -04:00
Kent Overstreet
642c1aabb0 bcachefs: Use bch2_err_matches() for BCH_ERR_fsck_(fix|ignore)
We'll be adding subtypes of these errors, and new error code tracing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-31 22:03:16 -04:00
Linus Torvalds
00c010e130 Merge tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:

 - "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of
   creating a pte which addresses the first page in a folio and reduces
   the amount of plumbing which architecture must implement to provide
   this.

 - "Misc folio patches for 6.16" from Matthew Wilcox is a shower of
   largely unrelated folio infrastructure changes which clean things up
   and better prepare us for future work.

 - "memory,x86,acpi: hotplug memory alignment advisement" from Gregory
   Price adds early-init code to prevent x86 from leaving physical
   memory unused when physical address regions are not aligned to memory
   block size.

 - "mm/compaction: allow more aggressive proactive compaction" from
   Michal Clapinski provides some tuning of the (sadly, hard-coded (more
   sadly, not auto-tuned)) thresholds for our invokation of proactive
   compaction. In a simple test case, the reduction of a guest VM's
   memory consumption was dramatic.

 - "Minor cleanups and improvements to swap freeing code" from Kemeng
   Shi provides some code cleaups and a small efficiency improvement to
   this part of our swap handling code.

 - "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin
   adds the ability for a ptracer to modify syscalls arguments. At this
   time we can alter only "system call information that are used by
   strace system call tampering, namely, syscall number, syscall
   arguments, and syscall return value.

   This series should have been incorporated into mm.git's "non-MM"
   branch, but I goofed.

 - "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from
   Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl
   against /proc/pid/pagemap. This permits CRIU to more efficiently get
   at the info about guard regions.

 - "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan
   implements that fix. No runtime effect is expected because
   validate_page_before_insert() happens to fix up this error.

 - "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David
   Hildenbrand basically brings uprobe text poking into the current
   decade. Remove a bunch of hand-rolled implementation in favor of
   using more current facilities.

 - "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman
   Khandual provides enhancements and generalizations to the pte dumping
   code. This might be needed when 128-bit Page Table Descriptors are
   enabled for ARM.

 - "Always call constructor for kernel page tables" from Kevin Brodsky
   ensures that the ctor/dtor is always called for kernel pgtables, as
   it already is for user pgtables.

   This permits the addition of more functionality such as "insert hooks
   to protect page tables". This change does result in various
   architectures performing unnecesary work, but this is fixed up where
   it is anticipated to occur.

 - "Rust support for mm_struct, vm_area_struct, and mmap" from Alice
   Ryhl adds plumbing to permit Rust access to core MM structures.

 - "fix incorrectly disallowed anonymous VMA merges" from Lorenzo
   Stoakes takes advantage of some VMA merging opportunities which we've
   been missing for 15 years.

 - "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from
   SeongJae Park optimizes process_madvise()'s TLB flushing.

   Instead of flushing each address range in the provided iovec, we
   batch the flushing across all the iovec entries. The syscall's cost
   was approximately halved with a microbenchmark which was designed to
   load this particular operation.

 - "Track node vacancy to reduce worst case allocation counts" from
   Sidhartha Kumar makes the maple tree smarter about its node
   preallocation.

   stress-ng mmap performance increased by single-digit percentages and
   the amount of unnecessarily preallocated memory was dramaticelly
   reduced.

 - "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes
   a few unnecessary things which Baoquan noted when reading the code.

 - ""Enhance sysfs handling for memory hotplug in weighted interleave"
   from Rakie Kim "enhances the weighted interleave policy in the memory
   management subsystem by improving sysfs handling, fixing memory
   leaks, and introducing dynamic sysfs updates for memory hotplug
   support". Fixes things on error paths which we are unlikely to hit.

 - "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory"
   from SeongJae Park introduces new DAMOS quota goal metrics which
   eliminate the manual tuning which is required when utilizing DAMON
   for memory tiering.

 - "mm/vmalloc.c: code cleanup and improvements" from Baoquan He
   provides cleanups and small efficiency improvements which Baoquan
   found via code inspection.

 - "vmscan: enforce mems_effective during demotion" from Gregory Price
   changes reclaim to respect cpuset.mems_effective during demotion when
   possible. because presently, reclaim explicitly ignores
   cpuset.mems_effective when demoting, which may cause the cpuset
   settings to violated.

   This is useful for isolating workloads on a multi-tenant system from
   certain classes of memory more consistently.

 - "Clean up split_huge_pmd_locked() and remove unnecessary folio
   pointers" from Gavin Guo provides minor cleanups and efficiency gains
   in in the huge page splitting and migrating code.

 - "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache
   for `struct mem_cgroup', yielding improved memory utilization.

 - "add max arg to swappiness in memory.reclaim and lru_gen" from
   Zhongkun He adds a new "max" argument to the "swappiness=" argument
   for memory.reclaim MGLRU's lru_gen.

   This directs proactive reclaim to reclaim from only anon folios
   rather than file-backed folios.

 - "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the
   first step on the path to permitting the kernel to maintain existing
   VMs while replacing the host kernel via file-based kexec. At this
   time only memblock's reserve_mem is preserved.

 - "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides
   and uses a smarter way of looping over a pfn range. By skipping
   ranges of invalid pfns.

 - "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
   cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning
   when a task is pinned a single NUMA mode.

   Dramatic performance benefits were seen in some real world cases.

 - "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank
   Garg addresses a warning which occurs during memory compaction when
   using JFS.

 - "move all VMA allocation, freeing and duplication logic to mm" from
   Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more
   appropriate mm/vma.c.

 - "mm, swap: clean up swap cache mapping helper" from Kairui Song
   provides code consolidation and cleanups related to the folio_index()
   function.

 - "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that.

 - "memcg: Fix test_memcg_min/low test failures" from Waiman Long
   addresses some bogus failures which are being reported by the
   test_memcontrol selftest.

 - "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo
   Stoakes commences the deprecation of file_operations.mmap() in favor
   of the new file_operations.mmap_prepare().

   The latter is more restrictive and prevents drivers from messing with
   things in ways which, amongst other problems, may defeat VMA merging.

 - "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples
   the per-cpu memcg charge cache from the objcg's one.

   This is a step along the way to making memcg and objcg charging
   NMI-safe, which is a BPF requirement.

 - "mm/damon: minor fixups and improvements for code, tests, and
   documents" from SeongJae Park is yet another batch of miscellaneous
   DAMON changes. Fix and improve minor problems in code, tests and
   documents.

 - "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg
   stats to be irq safe. Another step along the way to making memcg
   charging and stats updates NMI-safe, a BPF requirement.

 - "Let unmap_hugepage_range() and several related functions take folio
   instead of page" from Fan Ni provides folio conversions in the
   hugetlb code.

* tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits)
  mm: pcp: increase pcp->free_count threshold to trigger free_high
  mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range()
  mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page
  mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page
  mm/hugetlb: pass folio instead of page to unmap_ref_private()
  memcg: objcg stock trylock without irq disabling
  memcg: no stock lock for cpu hot-unplug
  memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs
  memcg: make count_memcg_events re-entrant safe against irqs
  memcg: make mod_memcg_state re-entrant safe against irqs
  memcg: move preempt disable to callers of memcg_rstat_updated
  memcg: memcg_rstat_updated re-entrant safe against irqs
  mm: khugepaged: decouple SHMEM and file folios' collapse
  selftests/eventfd: correct test name and improve messages
  alloc_tag: check mem_profiling_support in alloc_tag_init
  Docs/damon: update titles and brief introductions to explain DAMOS
  selftests/damon/_damon_sysfs: read tried regions directories in order
  mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject()
  mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat()
  mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs
  ...
2025-05-31 15:44:16 -07:00
Uday Shankar
08652bd86e Documentation: ublk: document UBLK_F_PER_IO_DAEMON
Explain the restrictions imposed on ublk servers in two cases:
1. When UBLK_F_PER_IO_DAEMON is set (current ublk_drv)
2. When UBLK_F_PER_IO_DAEMON is not set (legacy)

Remove most references to per-queue daemons, as the new
UBLK_F_PER_IO_DAEMON feature renders that concept obsolete.

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250529-ublk_task_per_io-v8-9-e9d3b119336a@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-31 14:38:57 -06:00
Uday Shankar
17574aa2a0 selftests: ublk: add stress test for per io daemons
Add a new test_stress_06 for the per io daemons feature. This is just a
copy of test_stress_01 with the per_io_tasks flag added, with varying
amounts of nthreads. This test is able to reproduce a panic which was
caught manually during development [1]; in the current version of this
patch set, it passes.

Note that this commit also makes all stress tests using the
run_io_and_remove helper more stressful by additionally exercising the
batch submit (queue_rqs) path.

[1] https://lore.kernel.org/linux-block/aDgwGoGCEpwd1mFY@fedora/

Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Link: https://lore.kernel.org/r/20250529-ublk_task_per_io-v8-8-e9d3b119336a@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-31 14:38:53 -06:00
Uday Shankar
236918d3e9 selftests: ublk: add functional test for per io daemons
Add a new test test_generic_12 which:

- sets up a ublk server with per_io_tasks and a different number of ublk
  server threads and ublk_queues. This is possible now that these
  objects are decoupled
- runs some I/O load from a single CPU
- verifies that all the ublk server threads handle some I/O

Before this changeset, this test fails, since I/O issued from one CPU is
always handled by the one ublk server thread. After this changeset, the
test passes.

In the future, the last check above may be strengthened to "verify that
all ublk server threads handle the same amount of I/O." However, this
requires some adjustments/bugfixes to tag allocation, so this work is
postponed to a followup.

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250529-ublk_task_per_io-v8-7-e9d3b119336a@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-31 14:38:48 -06:00
Uday Shankar
abe54c1603 selftests: ublk: kublk: decouple ublk_queues from ublk server threads
Add support in kublk for decoupled ublk_queues and ublk server threads.
kublk now has two modes of operation:

- (preexisting mode) threads and queues are paired 1:1, and each thread
  services all the I/Os of one queue
- (new mode) thread and queue counts are independently configurable.
  threads service I/Os in a way that balances load across threads even
  if load is not balanced over queues.

The default is the preexisting mode. The new mode is activated by
passing the --per_io_tasks flag.

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250529-ublk_task_per_io-v8-6-e9d3b119336a@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-31 14:38:43 -06:00
Uday Shankar
b9848ca7a7 selftests: ublk: kublk: move per-thread data out of ublk_queue
Towards the goal of decoupling ublk_queues from ublk server threads,
move resources/data that should be per-thread rather than per-queue out
of ublk_queue and into a new struct ublk_thread.

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250529-ublk_task_per_io-v8-5-e9d3b119336a@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-31 14:38:39 -06:00
Uday Shankar
8f75ba28b8 selftests: ublk: kublk: lift queue initialization out of thread
Currently, each ublk server I/O handler thread initializes its own
queue. However, as we move towards decoupled ublk_queues and ublk server
threads, this model does not make sense anymore, as there will no longer
be a concept of a thread having "its own" queue. So lift queue
initialization out of the per-thread ublk_io_handler_fn and into a loop
in ublk_start_daemon (which runs once for each device).

There is a part of ublk_queue_init (ring initialization) which does
actually need to happen on the thread that will use the ring; that is
separated into a separate ublk_thread_init which is still called by each
I/O handler thread.

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250529-ublk_task_per_io-v8-4-e9d3b119336a@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-31 14:38:35 -06:00
Uday Shankar
9773709752 selftests: ublk: kublk: tie sqe allocation to io instead of queue
We currently have a helper ublk_queue_alloc_sqes which the ublk targets
use to allocate SQEs for their own operations. However, as we move
towards decoupled ublk_queues and ublk server threads, this helper does
not make sense anymore. SQEs are allocated from rings, and we will have
one ring per thread to avoid locking. Change the SQE allocation helper
to ublk_io_alloc_sqes. Currently this still allocates SQEs from the io's
queue's ring, but when we fully decouple threads and queues, it will
allocate from the io's thread's ring instead.

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250529-ublk_task_per_io-v8-3-e9d3b119336a@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-31 14:38:30 -06:00
Uday Shankar
bf098d7269 selftests: ublk: kublk: plumb q_id in io_uring user_data
Currently, when we process CQEs, we know which ublk_queue we are working
on because we know which ring we are working on, and ublk_queues and
rings are in 1:1 correspondence. However, as we decouple ublk_queues
from ublk server threads, ublk_queues and rings will no longer be in 1:1
correspondence - each ublk server thread will have a ring, and each
thread may issue commands against more than one ublk_queue. So in order
to know which ublk_queue a CQE refers to, plumb that information in the
associated SQE's user_data.

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250529-ublk_task_per_io-v8-2-e9d3b119336a@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-31 14:38:26 -06:00
Uday Shankar
ab03a61c66 ublk: have a per-io daemon instead of a per-queue daemon
Currently, ublk_drv associates to each hardware queue (hctx) a unique
task (called the queue's ubq_daemon) which is allowed to issue
COMMIT_AND_FETCH commands against the hctx. If any other task attempts
to do so, the command fails immediately with EINVAL. When considered
together with the block layer architecture, the result is that for each
CPU C on the system, there is a unique ublk server thread which is
allowed to handle I/O submitted on CPU C. This can lead to suboptimal
performance under imbalanced load generation. For an extreme example,
suppose all the load is generated on CPUs mapping to a single ublk
server thread. Then that thread may be fully utilized and become the
bottleneck in the system, while other ublk server threads are totally
idle.

This issue can also be addressed directly in the ublk server without
kernel support by having threads dequeue I/Os and pass them around to
ensure even load. But this solution requires inter-thread communication
at least twice for each I/O (submission and completion), which is
generally a bad pattern for performance. The problem gets even worse
with zero copy, as more inter-thread communication would be required to
have the buffer register/unregister calls to come from the correct
thread.

Therefore, address this issue in ublk_drv by allowing each I/O to have
its own daemon task. Two I/Os in the same queue are now allowed to be
serviced by different daemon tasks - this was not possible before.
Imbalanced load can then be balanced across all ublk server threads by
having the ublk server threads issue FETCH_REQs in a round-robin manner.
As a small toy example, consider a system with a single ublk device
having 2 queues, each of depth 4. A ublk server having 4 threads could
issue its FETCH_REQs against this device as follows (where each entry is
the qid,tag pair that the FETCH_REQ targets):

ublk server thread:	T0	T1	T2	T3
			0,0	0,1	0,2	0,3
			1,3	1,0	1,1	1,2

This setup allows for load that is concentrated on one hctx/ublk_queue
to be spread out across all ublk server threads, alleviating the issue
described above.

Add the new UBLK_F_PER_IO_DAEMON feature to ublk_drv, which ublk servers
can use to essentially test for the presence of this change and tailor
their behavior accordingly.

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20250529-ublk_task_per_io-v8-1-e9d3b119336a@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-31 14:38:21 -06:00
Linus Torvalds
b42966552b Merge tag 'fbdev-for-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev
Pull fbdev updates from Helge Deller:
 "Many small but important fixes for special cases in the fbdev, fbcon
  and vgacon code which were found with Syzkaller, Svace and other tools
  by various people and teams (e.g. Linux Verification Center).

  Some smaller code cleanups in the nvidiafb, arkfb, atyfb and viafb
  drivers and two spelling fixes"

* tag 'fbdev-for-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev:
  fbdev: Fix fb_set_var to prevent null-ptr-deref in fb_videomode_to_var
  fbdev: Fix do_register_framebuffer to prevent null-ptr-deref in fb_videomode_to_var
  fbdev: sstfb.rst: Fix spelling mistake
  fbdev: core: fbcvt: avoid division by 0 in fb_cvt_hperiod()
  fbcon: Make sure modelist not set on unregistered console
  vgacon: Add check for vc_origin address range in vgacon_scroll()
  fbdev: arkfb: Cast ics5342_init() allocation type
  fbdev: nvidiafb: Correct const string length in nvidiafb_setup()
  fbdev: atyfb: Remove unused PCI vendor ID
  fbdev: carminefb: Fix spelling mistake of CARMINE_TOTAL_DIPLAY_MEM
  fbdev: via: use new GPIO line value setter callbacks
2025-05-31 11:34:27 -07:00
Mark Brown
4cb6c8af85 selftests/filesystems: Fix build of anon_inode_test
The newly added anon_inode_test test fails to build due to attempting to
include a nonexisting overlayfs/wrapper.h:

  anon_inode_test.c:10:10: fatal error: overlayfs/wrappers.h: No such file or directory
     10 | #include "overlayfs/wrappers.h"
        |          ^~~~~~~~~~~~~~~~~~~~~~

This is due to 0bd92b9fe5 ("selftests/filesystems: move wrapper.h out
of overlayfs subdir") which was added in the vfs-6.16.selftests branch
which was based on -rc5 and did not contain the newly added test so once
things were merged into mainline the build started failing - both parent
commits are fine.

Fixes: 3e406741b1 ("Merge tag 'vfs-6.16-rc1.selftests' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs")
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-05-31 08:43:53 -07:00
Linus Torvalds
dee264c16a Merge tag 'gcc-minimum-version-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull compiler version requirement update from Arnd Bergmann:
 "Require gcc-8 and binutils-2.30

  x86 already uses gcc-8 as the minimum version, this changes all other
  architectures to the same version. gcc-8 is used is Debian 10 and Red
  Hat Enterprise Linux 8, both of which are still supported, and
  binutils 2.30 is the oldest corresponding version on those.

  Ubuntu Pro 18.04 and SUSE Linux Enterprise Server 15 both use gcc-7 as
  the system compiler but additionally include toolchains that remain
  supported.

  With the new minimum toolchain versions, a number of workarounds for
  older versions can be dropped, in particular on x86_64 and arm64.
  Importantly, the updated compiler version allows removing two of the
  five remaining gcc plugins, as support for sancov and structeak
  features is already included in modern compiler versions.

  I tried collecting the known changes that are possible based on the
  new toolchain version, but expect that more cleanups will be possible.

  Since this touches multiple architectures, I merged the patches
  through the asm-generic tree."

* tag 'gcc-minimum-version-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
  Makefile.kcov: apply needed compiler option unconditionally in CFLAGS_KCOV
  Documentation: update binutils-2.30 version reference
  gcc-plugins: remove SANCOV gcc plugin
  Kbuild: remove structleak gcc plugin
  arm64: drop binutils version checks
  raid6: skip avx512 checks
  kbuild: require gcc-8 and binutils-2.30
2025-05-31 08:16:52 -07:00
Linus Torvalds
31848987f1 Merge tag 'soc-newsoc-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull sophgo SoC devicetree updates from Arnd Bergmann:
 "The Sophgo SG2044 SoC is their second generation server chip with 64
  cores, following the SG2042.

  In addition, there are minor updates for the cv180x SoCs"

* tag 'soc-newsoc-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
  riscv: dts: sophgo: switch precise compatible for existed clock device for CV18XX
  riscv: dts: sophgo: Add initial device tree of Sophgo SRD3-10
  dt-bindings: riscv: sophgo: Add SG2044 compatible string
  dt-bindings: interrupt-controller: Add Sophgo SG2044 PLIC
  dt-bindings: interrupt-controller: Add Sophgo SG2044 CLINT mswi
  riscv: dts: sopgho: use SOC_PERIPHERAL_IRQ to calculate interrupt number
  riscv: dts: sophgo: rename header file cv18xx.dtsi to cv180x.dtsi
  riscv: dts: sophgo: Move riscv cpu definition to a separate file
  riscv: dts: sophgo: Move all soc specific device into soc dtsi file
  riscv: sophgo: dts: Add spi controller for SG2042
  riscv: dts: sophgo: sg2042: add pinctrl support
2025-05-31 08:14:37 -07:00
Linus Torvalds
ec71f661a5 Merge tag 'soc-dt-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull SoC devicetree updates from Arnd Bergmann:
 "There are 11 newly supported SoCs, but these are all either new
  variants of existing designs, or straight reuses of the existing chip
  in a new package:

   - RK3562 is a new chip based on the old Cortex-A53 core, apparently a
     low-cost version of the Cortex-A55 based RK3568/RK3566.

   - NXP i.MX94 is a minor variation of i.MX93/i.MX95 with a different
     set of on-chip peripherals.

   - Renesas RZ/V2N (R9A09G056) is a new member of the larger RZ/V2
     family

   - Amlogic S6/S7/S7D

   - Samsung Exynos7870 is an older chip similar to Exynos7885

   - WonderMedia wm8950 is a minor variation on the wm8850 chip

   - Amlogic s805y is almost idential to s805x

   - Allwinner A523 is similar to A527 and T527

   - Qualcomm MSM8926 is a variant of MSM8226

   - Qualcomm Snapdragon X1P42100 is related to R1E80100

  There are also 65 boards, including reference designs for the chips
  above, this includes

   - 12 new boards based on TI K3 series chips, most of them from
     Toradex

   - 10 devices using Rockchips RK35xx and PX30 chips

   - 2 phones and 2 laptops based on Qualcomm Snapdragon designs

   - 10 NXP i.MX8/i.MX9 boards, mostly for embedded/industrial uses

   - 3 Samsung Galaxy phones based on Exynos7870

   - 5 Allwinner based boards using a variety of ARMv8 chips

   - 9 32-bit machines, each based on a different SoC family

  Aside from the new hardware, there is the usual set of cleanups and
  newly added hardware support on existing machines, for a total of 965
  devicetree changesets"

* tag 'soc-dt-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (956 commits)
  MAINTAINERS, mailmap: update Sven Peter's email address
  arm64: dts: renesas: rzg3e-smarc-som: Reduce I2C2 clock frequency
  arm64: dts: nuvoton: Add pinctrl
  ARM: dts: samsung: sp5v210-aries: Align wifi node name with bindings
  arm64: dts: blaize-blzp1600: Enable GPIO support
  dt-bindings: clock: socfpga: convert to yaml
  arm64: dts: rockchip: move rk3562 pinctrl node outside the soc node
  arm64: dts: rockchip: fix rk3562 pcie unit addresses
  arm64: dts: rockchip: move rk3528 pinctrl node outside the soc node
  arm64: dts: rockchip: remove a double-empty line from rk3576 core dtsi
  arm64: dts: rockchip: move rk3576 pinctrl node outside the soc node
  arm64: dts: rockchip: fix rk3576 pcie unit addresses
  arm64: dts: rockchip: Drop assigned-clock* from cpu nodes on rk3588
  arm64: dts: rockchip: Add missing SFC power-domains to rk3576
  Revert "arm64: dts: mediatek: mt8390-genio-common: Add firmware-name for scp0"
  arm64: dts: mediatek: mt8188: Address binding warnings for MDP3 nodes
  arm64: dts: mt6359: Rename RTC node to match binding expectations
  arm64: dts: mt8365-evk: Add goodix touchscreen support
  arm64: dts: mediatek: mt8188: Add missing #reset-cells property
  arm64: dts: airoha: en7581: Add PCIe nodes to EN7581 SoC evaluation board
  ...
2025-05-31 08:08:56 -07:00
Linus Torvalds
f79749a8e4 Merge tag 'soc-defconfig-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull SoC defconfig updates from Arnd Bergmann:
 "The usual defconfig updates enable configuration options for drivers
  that got added. A few SoC specific options are enabled in Kconfig
  files instead, in place of the defconfig files"

* tag 'soc-defconfig-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
  arm64: defconfig: enable ACPM protocol and Exynos mailbox
  arm64: defconfig: Enable configs for MediaTek Genio EVK boards
  arm64: defconfig: mediatek: enable PHY drivers
  arm64: defconfig: Enable Rockchip SAI and ES8328
  arm64: defconfig: Add Toradex Embedded Controller config
  arm64: defconfig: Enable TPIC2810 GPIO expander
  riscv: defconfig: spacemit: enable clock controller driver for SpacemiT K1
  arm64: defconfig: Enable TMP102 as module
  arm64: defconfig: Enable hwspinlock and eQEP for K3
  arm64: defconfig: Add CDNS_DSI and CDNS_PHY config
  riscv: defconfig: spacemit: enable gpio support for K1 SoC
  arm64: defconfig: Enable IPQ5424 RDP466 base configs
  riscv: Enable PM_GENERIC_DOMAINS for T-Head SoCs
2025-05-31 08:06:14 -07:00
Linus Torvalds
500af1defb Merge tag 'soc-arm-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC updates from Arnd Bergmann:
 "The main update in size is the removal of the TI DaVinci DA830 SoC
  support. DA830 is similar to DA850, which remain supported, but only
  the reference board was ever supported, and we removed that one 3
  years ago as it had never been converted to devicetree.

  There are some other cleanups for OMAP4 and a few boards using old
  GPIO interfaces"

* tag 'soc-arm-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
  ARM: s3c: stop including gpio.h
  ARM: dts: davinci: da850-evm: Increase fifo threshold
  ARM: OMAP2+: Fix l4ls clk domain handling in STANDBY
  ARM: broadcom: MAINTAINERS: Cover bcm2712 files
  bus: ti-sysc: PRUSS OCP configuration
  ARM: davinci: remove support for da830
  ARM: omap: pmic-cpcap: do not mess around without CPCAP or OMAP4
  ARM: omap2plus_defconfig: enable I2C devices of GTA04
  ARM: s3c/gpio: use new line value setter callbacks
  ARM: scoop/gpio: use new line value setter callbacks
  ARM: sa1100/gpio: use new line value setter callbacks
  ARM: orion/gpio: use new line value setter callbacks
2025-05-31 08:03:09 -07:00
Linus Torvalds
297d9111e9 Merge tag 'soc-drivers-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull SoC driver updates from Arnd Bergmann:
 "Updates are across the usual driver subsystems with SoC specific
  drivers:

   - added soc specicific drivers for sophgo cv1800 and sg2044, qualcomm
     sm8750, and amlogic c3 and s4 chips.

   - cache controller updates for sifive chips, plus binding changes for
     other cache descriptions.

   - memory controller drivers for mediatek mt6893, stm32 and cleanups
     for a few more drivers

   - reset controller drivers for T-Head TH1502, Sophgo sg2044 and
     Renesas RZ/V2H(P)

   - SCMI firmware updates to better deal with buggy firmware, plus
     better support for Qualcomm X1E and NXP i.MX specific interfaces

   - a new platform driver for the crypto firmware on Cznic Turris
     Omnia/MOX

   - cleanups for the TEE firmware subsystem and amdtee driver

   - minor updates and fixes for freescale/nxp, qualcomm, google,
     aspeed, wondermedia, ti, nxp, renesas, hisilicon, mediatek,
     broadcom and samsung SoCs"

* tag 'soc-drivers-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (133 commits)
  soc: aspeed: Add NULL check in aspeed_lpc_enable_snoop()
  soc: aspeed: lpc: Fix impossible judgment condition
  ARM: aspeed: Don't select SRAM
  docs: firmware: qcom_scm: Fix kernel-doc warning
  soc: fsl: qe: Consolidate chained IRQ handler install/remove
  firmware: qcom: scm: Allow QSEECOM for HP EliteBook Ultra G1q
  dt-bindings: mfd: qcom,tcsr: Add compatible for ipq5018
  dt-bindings: cache: add QiLai compatible to ax45mp
  memory: stm32_omm: Fix error handling in stm32_omm_disable_child()
  dt-bindings: cache: Convert marvell,tauros2-cache to DT schema
  dt-bindings: cache: Convert marvell,{feroceon,kirkwood}-cache to DT schema
  soc: samsung: exynos-pmu: enable CPU hotplug support for gs101
  MAINTAINERS: Add google,gs101-pmu-intr-gen.yaml binding file
  dt-bindings: soc: samsung: exynos-pmu: gs101: add google,pmu-intr-gen phandle
  dt-bindings: soc: google: Add gs101-pmu-intr-gen binding documentation
  bus: fsl-mc: Use strscpy() instead of strscpy_pad()
  soc: fsl: qbman: Remove const from portal->cgrs allocation type
  bus: fsl_mc: Fix driver_managed_dma check
  bus: fsl-mc: increase MC_CMD_COMPLETION_TIMEOUT_MS value
  bus: fsl-mc: drop useless cleanup
  ...
2025-05-31 07:53:30 -07:00
Helge Deller
213205889d parisc/unaligned: Fix hex output to show 8 hex chars
Change back printk format to 0x%08lx instead of %#08lx, since the latter
does not seem to reliably format the value to 8 hex chars.

Signed-off-by: Helge Deller <deller@gmx.de>
Cc: stable@vger.kernel.org # v5.18+
Fixes: e5e9e7f222 ("parisc/unaligned: Enhance user-space visible output")
2025-05-31 16:50:39 +02:00
Linus Torvalds
9d49da4388 Revert "iommu: make inclusion of arm/arm-smmu-v3 directory conditional"
This reverts commit e436576b02.

That commit is very broken, and seems to have missed the fact that
CONFIG_ARM_SMMU_V3 is not just a yes-or-no thing, but also can be
modular.

So it caused build errors on arm64 allmodconfig setups:

  ERROR: modpost: "arm_smmu_make_cdtable_ste" [drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.ko] undefined!
  ERROR: modpost: "arm_smmu_make_s2_domain_ste" [drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.ko] undefined!
  ERROR: modpost: "arm_smmu_make_s1_cd" [drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.ko] undefined!
  ...

(and six more symbols just the same).

Link: https://lore.kernel.org/all/CAHk-=wh4qRwm7AQ8sBmQj7qECzgAhj4r73RtCDfmHo5SdcN0Jw@mail.gmail.com/
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Rolf Eike Beer <eb@emlix.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-05-31 07:43:16 -07:00
Ian Rogers
a913ef6fd8 perf callchain: Always populate the addr_location map when adding IP
Dropping symbols also meant the callchain maps wasn't populated, but
the callchain map is needed to find the DSO.

Plumb the symbols option better, falling back to thread__find_map()
rather than thread__find_symbol() when symbols are disabled.

Fixes: 02b2705017 ("perf callchain: Allow symbols to be optional when resolving a callchain")
Signed-off-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Athira Rajeev <atrajeev@linux.ibm.com>
Cc: Ben Gainey <ben.gainey@arm.com>
Cc: Chaitanya S Prakash <chaitanyas.prakash@arm.com>
Cc: Charlie Jenkins <charlie@rivosinc.com>
Cc: Chun-Tse Shao <ctshao@google.com>
Cc: Colin Ian King <colin.i.king@gmail.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Dr. David Alan Gilbert <linux@treblig.org>
Cc: Graham Woodward <graham.woodward@arm.com>
Cc: Howard Chu <howardchu95@gmail.com>
Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Garry <john.g.garry@oracle.com>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Krzysztof Łopatowski <krzysztof.m.lopatowski@gmail.com>
Cc: Leo Yan <leo.yan@linux.dev>
Cc: Levi Yun <yeoreum.yun@arm.com>
Cc: Li Huafei <lihuafei1@huawei.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Martin Liška <martin.liska@hey.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Matt Fleming <matt@readmodwrite.com>
Cc: Mike Leach <mike.leach@linaro.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Song Liu <song@kernel.org>
Cc: Steinar H. Gunderson <sesse@google.com>
Cc: Stephen Brennan <stephen.s.brennan@oracle.com>
Cc: Steve Clevenger <scclevenger@os.amperecomputing.com>
Cc: Thomas Falcon <thomas.falcon@intel.com>
Cc: Veronika Molnarova <vmolnaro@redhat.com>
Cc: Weilin Wang <weilin.wang@intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yicong Yang <yangyicong@hisilicon.com>
Cc: Yujie Liu <yujie.liu@intel.com>
Cc: Zhongqiu Han <quic_zhonhan@quicinc.com>
Cc: Zixian Cai <fzczx123@gmail.com>
Link: https://lore.kernel.org/r/20250529044000.759937-2-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-31 08:58:30 -03:00
Namhyung Kim
0df14c1f1e perf lock contention: Reject more than 10ms delays for safety
Delaying kernel operations can be dangerous and the kernel may kill
(non-sleepable) BPF programs running for long in the future.

Limit the max delay to 10ms and update the document about it.

  $ sudo ./perf lock con -abl -J 100000us@cgroup_mutex true
  lock delay is too long: 100000us (> 10ms)

   Usage: perf lock contention [<options>]

      -J, --inject-delay <TIME@FUNC>
                            Inject delays to specific locks

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20250515181042.555189-1-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-31 08:45:24 -03:00
Murad Masimov
05f6e18387 fbdev: Fix fb_set_var to prevent null-ptr-deref in fb_videomode_to_var
If fb_add_videomode() in fb_set_var() fails to allocate memory for
fb_videomode, later it may lead to a null-ptr dereference in
fb_videomode_to_var(), as the fb_info is registered while not having the
mode in modelist that is expected to be there, i.e. the one that is
described in fb_info->var.

================================================================
general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
CPU: 1 PID: 30371 Comm: syz-executor.1 Not tainted 5.10.226-syzkaller #0
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
RIP: 0010:fb_videomode_to_var+0x24/0x610 drivers/video/fbdev/core/modedb.c:901
Call Trace:
 display_to_var+0x3a/0x7c0 drivers/video/fbdev/core/fbcon.c:929
 fbcon_resize+0x3e2/0x8f0 drivers/video/fbdev/core/fbcon.c:2071
 resize_screen drivers/tty/vt/vt.c:1176 [inline]
 vc_do_resize+0x53a/0x1170 drivers/tty/vt/vt.c:1263
 fbcon_modechanged+0x3ac/0x6e0 drivers/video/fbdev/core/fbcon.c:2720
 fbcon_update_vcs+0x43/0x60 drivers/video/fbdev/core/fbcon.c:2776
 do_fb_ioctl+0x6d2/0x740 drivers/video/fbdev/core/fbmem.c:1128
 fb_ioctl+0xe7/0x150 drivers/video/fbdev/core/fbmem.c:1203
 vfs_ioctl fs/ioctl.c:48 [inline]
 __do_sys_ioctl fs/ioctl.c:753 [inline]
 __se_sys_ioctl fs/ioctl.c:739 [inline]
 __x64_sys_ioctl+0x19a/0x210 fs/ioctl.c:739
 do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x67/0xd1
================================================================

The reason is that fb_info->var is being modified in fb_set_var(), and
then fb_videomode_to_var() is called. If it fails to add the mode to
fb_info->modelist, fb_set_var() returns error, but does not restore the
old value of fb_info->var. Restore fb_info->var on failure the same way
it is done earlier in the function.

Found by Linux Verification Center (linuxtesting.org) with Syzkaller.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Murad Masimov <m.masimov@mt-integration.ru>
Signed-off-by: Helge Deller <deller@gmx.de>
2025-05-31 10:24:02 +02:00
Murad Masimov
17186f1f90 fbdev: Fix do_register_framebuffer to prevent null-ptr-deref in fb_videomode_to_var
If fb_add_videomode() in do_register_framebuffer() fails to allocate
memory for fb_videomode, it will later lead to a null-ptr dereference in
fb_videomode_to_var(), as the fb_info is registered while not having the
mode in modelist that is expected to be there, i.e. the one that is
described in fb_info->var.

================================================================
general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
CPU: 1 PID: 30371 Comm: syz-executor.1 Not tainted 5.10.226-syzkaller #0
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
RIP: 0010:fb_videomode_to_var+0x24/0x610 drivers/video/fbdev/core/modedb.c:901
Call Trace:
 display_to_var+0x3a/0x7c0 drivers/video/fbdev/core/fbcon.c:929
 fbcon_resize+0x3e2/0x8f0 drivers/video/fbdev/core/fbcon.c:2071
 resize_screen drivers/tty/vt/vt.c:1176 [inline]
 vc_do_resize+0x53a/0x1170 drivers/tty/vt/vt.c:1263
 fbcon_modechanged+0x3ac/0x6e0 drivers/video/fbdev/core/fbcon.c:2720
 fbcon_update_vcs+0x43/0x60 drivers/video/fbdev/core/fbcon.c:2776
 do_fb_ioctl+0x6d2/0x740 drivers/video/fbdev/core/fbmem.c:1128
 fb_ioctl+0xe7/0x150 drivers/video/fbdev/core/fbmem.c:1203
 vfs_ioctl fs/ioctl.c:48 [inline]
 __do_sys_ioctl fs/ioctl.c:753 [inline]
 __se_sys_ioctl fs/ioctl.c:739 [inline]
 __x64_sys_ioctl+0x19a/0x210 fs/ioctl.c:739
 do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x67/0xd1
================================================================

Even though fbcon_init() checks beforehand if fb_match_mode() in
var_to_display() fails, it can not prevent the panic because fbcon_init()
does not return error code. Considering this and the comment in the code
about fb_match_mode() returning NULL - "This should not happen" - it is
better to prevent registering the fb_info if its mode was not set
successfully. Also move fb_add_videomode() closer to the beginning of
do_register_framebuffer() to avoid having to do the cleanup on fail.

Found by Linux Verification Center (linuxtesting.org) with Syzkaller.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Murad Masimov <m.masimov@mt-integration.ru>
Signed-off-by: Helge Deller <deller@gmx.de>
2025-05-31 10:24:02 +02:00
Rujra Bhatt
9c221db509 fbdev: sstfb.rst: Fix spelling mistake
Fix spelling: "tweeks" to "tweaks"

Signed-off-by: Rujra Bhatt <braker.noob.kernel@gmail.com>
Signed-off-by: Helge Deller <deller@gmx.de>
2025-05-31 10:24:02 +02:00
Sergey Shtylyov
3f6dae09fc fbdev: core: fbcvt: avoid division by 0 in fb_cvt_hperiod()
In fb_find_mode_cvt(), iff mode->refresh somehow happens to be 0x80000000,
cvt.f_refresh will become 0 when multiplying it by 2 due to overflow. It's
then passed to fb_cvt_hperiod(), where it's used as a divider -- division
by 0 will result in kernel oops. Add a sanity check for cvt.f_refresh to
avoid such overflow...

Found by Linux Verification Center (linuxtesting.org) with the Svace static
analysis tool.

Fixes: 96fe6a2109 ("[PATCH] fbdev: Add VESA Coordinated Video Timings (CVT) support")
Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Helge Deller <deller@gmx.de>
2025-05-31 10:24:02 +02:00
Kees Cook
cedc1b6339 fbcon: Make sure modelist not set on unregistered console
It looks like attempting to write to the "store_modes" sysfs node will
run afoul of unregistered consoles:

UBSAN: array-index-out-of-bounds in drivers/video/fbdev/core/fbcon.c:122:28
index -1 is out of range for type 'fb_info *[32]'
...
 fbcon_info_from_console+0x192/0x1a0 drivers/video/fbdev/core/fbcon.c:122
 fbcon_new_modelist+0xbf/0x2d0 drivers/video/fbdev/core/fbcon.c:3048
 fb_new_modelist+0x328/0x440 drivers/video/fbdev/core/fbmem.c:673
 store_modes+0x1c9/0x3e0 drivers/video/fbdev/core/fbsysfs.c:113
 dev_attr_store+0x55/0x80 drivers/base/core.c:2439

static struct fb_info *fbcon_registered_fb[FB_MAX];
...
static signed char con2fb_map[MAX_NR_CONSOLES];
...
static struct fb_info *fbcon_info_from_console(int console)
...
        return fbcon_registered_fb[con2fb_map[console]];

If con2fb_map contains a -1 things go wrong here. Instead, return NULL,
as callers of fbcon_info_from_console() are trying to compare against
existing "info" pointers, so error handling should kick in correctly.

Reported-by: syzbot+a7d4444e7b6e743572f7@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/679d0a8f.050a0220.163cdc.000c.GAE@google.com/
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Helge Deller <deller@gmx.de>
2025-05-31 10:24:02 +02:00
GONG Ruiqi
864f9963ec vgacon: Add check for vc_origin address range in vgacon_scroll()
Our in-house Syzkaller reported the following BUG (twice), which we
believed was the same issue with [1]:

==================================================================
BUG: KASAN: slab-out-of-bounds in vcs_scr_readw+0xc2/0xd0 drivers/tty/vt/vt.c:4740
Read of size 2 at addr ffff88800f5bef60 by task syz.7.2620/12393
...
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x72/0xa0 lib/dump_stack.c:106
 print_address_description.constprop.0+0x6b/0x3d0 mm/kasan/report.c:364
 print_report+0xba/0x280 mm/kasan/report.c:475
 kasan_report+0xa9/0xe0 mm/kasan/report.c:588
 vcs_scr_readw+0xc2/0xd0 drivers/tty/vt/vt.c:4740
 vcs_write_buf_noattr drivers/tty/vt/vc_screen.c:493 [inline]
 vcs_write+0x586/0x840 drivers/tty/vt/vc_screen.c:690
 vfs_write+0x219/0x960 fs/read_write.c:584
 ksys_write+0x12e/0x260 fs/read_write.c:639
 do_syscall_x64 arch/x86/entry/common.c:51 [inline]
 do_syscall_64+0x59/0x110 arch/x86/entry/common.c:81
 entry_SYSCALL_64_after_hwframe+0x78/0xe2
 ...
 </TASK>

Allocated by task 5614:
 kasan_save_stack+0x20/0x40 mm/kasan/common.c:45
 kasan_set_track+0x25/0x30 mm/kasan/common.c:52
 ____kasan_kmalloc mm/kasan/common.c:374 [inline]
 __kasan_kmalloc+0x8f/0xa0 mm/kasan/common.c:383
 kasan_kmalloc include/linux/kasan.h:201 [inline]
 __do_kmalloc_node mm/slab_common.c:1007 [inline]
 __kmalloc+0x62/0x140 mm/slab_common.c:1020
 kmalloc include/linux/slab.h:604 [inline]
 kzalloc include/linux/slab.h:721 [inline]
 vc_do_resize+0x235/0xf40 drivers/tty/vt/vt.c:1193
 vgacon_adjust_height+0x2d4/0x350 drivers/video/console/vgacon.c:1007
 vgacon_font_set+0x1f7/0x240 drivers/video/console/vgacon.c:1031
 con_font_set drivers/tty/vt/vt.c:4628 [inline]
 con_font_op+0x4da/0xa20 drivers/tty/vt/vt.c:4675
 vt_k_ioctl+0xa10/0xb30 drivers/tty/vt/vt_ioctl.c:474
 vt_ioctl+0x14c/0x1870 drivers/tty/vt/vt_ioctl.c:752
 tty_ioctl+0x655/0x1510 drivers/tty/tty_io.c:2779
 vfs_ioctl fs/ioctl.c:51 [inline]
 __do_sys_ioctl fs/ioctl.c:871 [inline]
 __se_sys_ioctl+0x12d/0x190 fs/ioctl.c:857
 do_syscall_x64 arch/x86/entry/common.c:51 [inline]
 do_syscall_64+0x59/0x110 arch/x86/entry/common.c:81
 entry_SYSCALL_64_after_hwframe+0x78/0xe2

Last potentially related work creation:
 kasan_save_stack+0x20/0x40 mm/kasan/common.c:45
 __kasan_record_aux_stack+0x94/0xa0 mm/kasan/generic.c:492
 __call_rcu_common.constprop.0+0xc3/0xa10 kernel/rcu/tree.c:2713
 netlink_release+0x620/0xc20 net/netlink/af_netlink.c:802
 __sock_release+0xb5/0x270 net/socket.c:663
 sock_close+0x1e/0x30 net/socket.c:1425
 __fput+0x408/0xab0 fs/file_table.c:384
 __fput_sync+0x4c/0x60 fs/file_table.c:465
 __do_sys_close fs/open.c:1580 [inline]
 __se_sys_close+0x68/0xd0 fs/open.c:1565
 do_syscall_x64 arch/x86/entry/common.c:51 [inline]
 do_syscall_64+0x59/0x110 arch/x86/entry/common.c:81
 entry_SYSCALL_64_after_hwframe+0x78/0xe2

Second to last potentially related work creation:
 kasan_save_stack+0x20/0x40 mm/kasan/common.c:45
 __kasan_record_aux_stack+0x94/0xa0 mm/kasan/generic.c:492
 __call_rcu_common.constprop.0+0xc3/0xa10 kernel/rcu/tree.c:2713
 netlink_release+0x620/0xc20 net/netlink/af_netlink.c:802
 __sock_release+0xb5/0x270 net/socket.c:663
 sock_close+0x1e/0x30 net/socket.c:1425
 __fput+0x408/0xab0 fs/file_table.c:384
 task_work_run+0x154/0x240 kernel/task_work.c:239
 exit_task_work include/linux/task_work.h:45 [inline]
 do_exit+0x8e5/0x1320 kernel/exit.c:874
 do_group_exit+0xcd/0x280 kernel/exit.c:1023
 get_signal+0x1675/0x1850 kernel/signal.c:2905
 arch_do_signal_or_restart+0x80/0x3b0 arch/x86/kernel/signal.c:310
 exit_to_user_mode_loop kernel/entry/common.c:111 [inline]
 exit_to_user_mode_prepare include/linux/entry-common.h:328 [inline]
 __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline]
 syscall_exit_to_user_mode+0x1b3/0x1e0 kernel/entry/common.c:218
 do_syscall_64+0x66/0x110 arch/x86/entry/common.c:87
 entry_SYSCALL_64_after_hwframe+0x78/0xe2

The buggy address belongs to the object at ffff88800f5be000
 which belongs to the cache kmalloc-2k of size 2048
The buggy address is located 2656 bytes to the right of
 allocated 1280-byte region [ffff88800f5be000, ffff88800f5be500)

...

Memory state around the buggy address:
 ffff88800f5bee00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff88800f5bee80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff88800f5bef00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
                                                       ^
 ffff88800f5bef80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff88800f5bf000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================

By analyzing the vmcore, we found that vc->vc_origin was somehow placed
one line prior to vc->vc_screenbuf when vc was in KD_TEXT mode, and
further writings to /dev/vcs caused out-of-bounds reads (and writes
right after) in vcs_write_buf_noattr().

Our further experiments show that in most cases, vc->vc_origin equals to
vga_vram_base when the console is in KD_TEXT mode, and it's around
vc->vc_screenbuf for the KD_GRAPHICS mode. But via triggerring a
TIOCL_SETVESABLANK ioctl beforehand, we can make vc->vc_origin be around
vc->vc_screenbuf while the console is in KD_TEXT mode, and then by
writing the special 'ESC M' control sequence to the tty certain times
(depends on the value of `vc->state.y - vc->vc_top`), we can eventually
move vc->vc_origin prior to vc->vc_screenbuf. Here's the PoC, tested on
QEMU:

```
int main() {
	const int RI_NUM = 10; // should be greater than `vc->state.y - vc->vc_top`
	int tty_fd, vcs_fd;
	const char *tty_path = "/dev/tty0";
	const char *vcs_path = "/dev/vcs";
	const char escape_seq[] = "\x1bM";  // ESC + M
	const char trigger_seq[] = "Let's trigger an OOB write.";
	struct vt_sizes vt_size = { 70, 2 };
	int blank = TIOCL_BLANKSCREEN;

	tty_fd = open(tty_path, O_RDWR);

	char vesa_mode[] = { TIOCL_SETVESABLANK, 1 };
	ioctl(tty_fd, TIOCLINUX, vesa_mode);

	ioctl(tty_fd, TIOCLINUX, &blank);
	ioctl(tty_fd, VT_RESIZE, &vt_size);

	for (int i = 0; i < RI_NUM; ++i)
		write(tty_fd, escape_seq, sizeof(escape_seq) - 1);

	vcs_fd = open(vcs_path, O_RDWR);
	write(vcs_fd, trigger_seq, sizeof(trigger_seq));

	close(vcs_fd);
	close(tty_fd);
	return 0;
}
```

To solve this problem, add an address range validation check in
vgacon_scroll(), ensuring vc->vc_origin never precedes vc_screenbuf.

Reported-by: syzbot+9c09fda97a1a65ea859b@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=9c09fda97a1a65ea859b [1]
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Co-developed-by: Yi Yang <yiyang13@huawei.com>
Signed-off-by: Yi Yang <yiyang13@huawei.com>
Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com>
Signed-off-by: Helge Deller <deller@gmx.de>
2025-05-31 10:24:02 +02:00
Kees Cook
ede481f6da fbdev: arkfb: Cast ics5342_init() allocation type
In preparation for making the kmalloc family of allocators type aware,
we need to make sure that the returned type from the allocation matches
the type of the variable being assigned. (Before, the allocator would
always return "void *", which can be implicitly cast to any pointer type.)

The assigned type is "struct dac_info *" but the returned type will be
"struct ics5342_info *", which has a larger allocation size. This is
by design, as struct ics5342_info contains struct dac_info as its first
member.
(patch slightly modified by Helge Deller)

Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Helge Deller <deller@gmx.de>
2025-05-31 10:24:02 +02:00
Zijun Hu
34fe05cd2d fbdev: nvidiafb: Correct const string length in nvidiafb_setup()
The actual length of const string "noaccel" is 7, but the strncmp()
branch in nvidiafb_setup() wrongly hard codes it as 6.

Fix by using actual length 7 as argument of the strncmp().

Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Signed-off-by: Helge Deller <deller@gmx.de>
2025-05-31 10:24:01 +02:00
Andy Shevchenko
c9b26429c8 fbdev: atyfb: Remove unused PCI vendor ID
The custom definition of PCI vendor ID in video/mach64.h is unused.
Remove it. Note, that the proper one is available in pci_ids.h.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Helge Deller <deller@gmx.de>
2025-05-31 10:24:01 +02:00
Colin Ian King
67ebb5890a fbdev: carminefb: Fix spelling mistake of CARMINE_TOTAL_DIPLAY_MEM
There is a spelling mistake in macro CARMINE_TOTAL_DIPLAY_MEM. Fix it.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Helge Deller <deller@gmx.de>
2025-05-31 10:24:01 +02:00
Bartosz Golaszewski
2be358790e fbdev: via: use new GPIO line value setter callbacks
struct gpio_chip now has callbacks for setting line values that return
an integer, allowing to indicate failures. Convert the driver to using
them.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Helge Deller <deller@gmx.de>
2025-05-31 10:24:01 +02:00
Dapeng Mi
86aa94cd50 perf/x86/intel: Fix incorrect MSR index calculations in intel_pmu_config_acr()
The MSR offset calculations in intel_pmu_config_acr() are buggy.

To calculate fixed counter MSR addresses in intel_pmu_config_acr(),
the HW counter index "idx" is subtracted by INTEL_PMC_IDX_FIXED.

This leads to the ACR mask value of fixed counters to be incorrectly
saved to the positions of GP counters in acr_cfg_b[], e.g.

For fixed counter 0, its ACR counter mask should be saved to
acr_cfg_b[32], but it's saved to acr_cfg_b[0] incorrectly.

Fix this issue.

[ mingo: Clarified & improved the changelog. ]

Fixes: ec980e4fac ("perf/x86/intel: Support auto counter reload")
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250529080236.2552247-2-dapeng1.mi@linux.intel.com
2025-05-31 10:05:16 +02:00
Steven Rostedt
99850a1c93 x86/fpu: Remove unused trace events
The following trace events are not used and defining them just wastes
memory:

  x86_fpu_before_restore
  x86_fpu_after_restore
  x86_fpu_init_state

Simply remove them.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Linux Trace Kernel <linux-trace-kernel@vger.kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/all/20250529130138.544ffec4@gandalf.local.home  # background
Link: https://lore.kernel.org/r/20250529131024.7c2ef96f@gandalf.local.home    # x86 submission
2025-05-31 09:40:40 +02:00
Linus Torvalds
8bf722c684 Merge tag 'trace-ringbuffer-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull ring-buffer updates from Steven Rostedt:

 - Allow the persistent ring buffer to be memory mapped

   In the last merge window there was issues with the implementation of
   mapping the persistent ring buffer because it was assumed that the
   persistent memory was just physical memory without being part of the
   kernel virtual address space. But this was incorrect and the
   persistent ring buffer can be mapped the same way as the allocated
   ring buffer is mapped.

   The metadata for the persistent ring buffer is different than the
   normal ring buffer and the organization of mapping it to user space
   is a little different. Make the updates needed to the meta data to
   allow the persistent ring buffer to be mapped to user space.

 - Fix cpus_read_lock() with buffer->mutex and cpu_buffer->mapping_lock

   Mapping the ring buffer to user space uses the
   cpu_buffer->mapping_lock. The buffer->mutex can be taken when the
   mapping_lock is held, giving the locking order of:
   cpu_buffer->mapping_lock -->> buffer->mutex. But there also exists
   the ordering:

       buffer->mutex -->> cpus_read_lock()
       mm->mmap_lock -->> cpu_buffer->mapping_lock
       cpus_read_lock() -->> mm->mmap_lock

   causing a circular chain of:

       cpu_buffer->mapping_lock -> buffer->mutex -->> cpus_read_lock() -->>
          mm->mmap_lock -->> cpu_buffer->mapping_lock

   By moving the cpus_read_lock() outside the buffer->mutex where:
   cpus_read_lock() -->> buffer->mutex, breaks the deadlock chain.

 - Do not trigger WARN_ON() for commit overrun

   When the ring buffer is user space mapped and there's a "commit
   overrun" (where an interrupt preempted an event, and then added so
   many events it filled the buffer having to drop events when it hit
   the preempted event) a WARN_ON() was triggered if this was read via a
   memory mapped buffer.

   This is due to "missed events" being non zero when the reader page
   ended up with the commit page. The idea was, if the writer is on the
   reader page, there's only one page that has been written to and there
   should be no missed events.

   But if a commit overrun is done where the writer is off the commit
   page and looped around to the commit page causing missed events, it
   is possible that the reader page is the commit page with missed
   events.

   Instead of triggering a WARN_ON() when the reader page is the commit
   page with missed events, trigger it when the reader page is the
   tail_page with missed events. That's because the writer is always on
   the tail_page if an event was interrupted (which holds the commit
   event) and continues off the commit page.

 - Reset the persistent buffer if it is fully consumed

   On boot up, if the user fully consumes the last boot buffer of the
   persistent buffer, if it reboots without enabling it, there will
   still be events in the buffer which can cause confusion. Instead,
   reset the buffer when it is fully consumed, so that the data is not
   read again.

 - Clean up some goto out jumps

   There's a few cases that the code jumps to the "out:" label that
   simply returns a value. There used to be more work done at those
   labels but now that they simply return a value use a return instead
   of jumping to a label.

 - Use guard() to simplify some of the code

   Add guard() around some locking instead of jumping to a label to do
   the unlocking.

 - Use free() to simplify some of the code

   Use free(kfree) on variables that will get freed on error and use
   return_ptr() to return the variable when its not freed. There's one
   instance where free(kfree) simplifies the code on a temp variable
   that was allocated just for the function use.

* tag 'trace-ringbuffer-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ring-buffer: Simplify functions with __free(kfree) to free allocations
  ring-buffer: Make ring_buffer_{un}map() simpler with guard(mutex)
  ring-buffer: Simplify ring_buffer_read_page() with guard()
  ring-buffer: Simplify reset_disabled_cpu_buffer() with use of guard()
  ring-buffer: Remove jump to out label in ring_buffer_swap_cpu()
  ring-buffer: Removed unnecessary if() goto out where out is the next line
  tracing: Reset last-boot buffers when reading out all cpu buffers
  ring-buffer: Allow reserve_mem persistent ring buffers to be mmapped
  ring-buffer: Do not trigger WARN_ON() due to a commit_overrun
  ring-buffer: Move cpus_read_lock() outside of buffer->mutex
2025-05-30 21:20:11 -07:00
Linus Torvalds
03ebff0c83 Merge tag 'microblaze-v6.16' of git://git.monstr.eu/linux-2.6-microblaze
Pull microblaze update from Michal Simek:

 - Small OF update

* tag 'microblaze-v6.16' of git://git.monstr.eu/linux-2.6-microblaze:
  microblaze: Use of_property_present() for non-boolean properties
2025-05-30 21:11:21 -07:00
Jakub Kicinski
558428921e Merge branch 'net-fix-inet_proto_csum_replace_by_diff-for-ipv6'
Paul Chaignon says:

====================
net: Fix inet_proto_csum_replace_by_diff for IPv6

This patchset fixes a bug that causes skb->csum to hold an incorrect
value when calling inet_proto_csum_replace_by_diff for an IPv6 packet
in CHECKSUM_COMPLETE state. This bug affects BPF helper
bpf_l4_csum_replace and IPv6 ILA in adj-transport mode.

In those cases, inet_proto_csum_replace_by_diff updates the L4 checksum
field after an IPv6 address change. These two changes cancel each other
in terms of checksum, so skb->csum shouldn't be updated.

v2: https://lore.kernel.org/aCz84JU60wd8etiT@mail.gmail.com
====================

Link: https://patch.msgid.link/cover.1748509484.git.paul.chaignon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-30 19:53:53 -07:00
Paul Chaignon
ead7f9b8de bpf: Fix L4 csum update on IPv6 in CHECKSUM_COMPLETE
In Cilium, we use bpf_csum_diff + bpf_l4_csum_replace to, among other
things, update the L4 checksum after reverse SNATing IPv6 packets. That
use case is however not currently supported and leads to invalid
skb->csum values in some cases. This patch adds support for IPv6 address
changes in bpf_l4_csum_update via a new flag.

When calling bpf_l4_csum_replace in Cilium, it ends up calling
inet_proto_csum_replace_by_diff:

    1:  void inet_proto_csum_replace_by_diff(__sum16 *sum, struct sk_buff *skb,
    2:                                       __wsum diff, bool pseudohdr)
    3:  {
    4:      if (skb->ip_summed != CHECKSUM_PARTIAL) {
    5:          csum_replace_by_diff(sum, diff);
    6:          if (skb->ip_summed == CHECKSUM_COMPLETE && pseudohdr)
    7:              skb->csum = ~csum_sub(diff, skb->csum);
    8:      } else if (pseudohdr) {
    9:          *sum = ~csum_fold(csum_add(diff, csum_unfold(*sum)));
    10:     }
    11: }

The bug happens when we're in the CHECKSUM_COMPLETE state. We've just
updated one of the IPv6 addresses. The helper now updates the L4 header
checksum on line 5. Next, it updates skb->csum on line 7. It shouldn't.

For an IPv6 packet, the updates of the IPv6 address and of the L4
checksum will cancel each other. The checksums are set such that
computing a checksum over the packet including its checksum will result
in a sum of 0. So the same is true here when we update the L4 checksum
on line 5. We'll update it as to cancel the previous IPv6 address
update. Hence skb->csum should remain untouched in this case.

The same bug doesn't affect IPv4 packets because, in that case, three
fields are updated: the IPv4 address, the IP checksum, and the L4
checksum. The change to the IPv4 address and one of the checksums still
cancel each other in skb->csum, but we're left with one checksum update
and should therefore update skb->csum accordingly. That's exactly what
inet_proto_csum_replace_by_diff does.

This special case for IPv6 L4 checksums is also described atop
inet_proto_csum_replace16, the function we should be using in this case.

This patch introduces a new bpf_l4_csum_replace flag, BPF_F_IPV6,
to indicate that we're updating the L4 checksum of an IPv6 packet. When
the flag is set, inet_proto_csum_replace_by_diff will skip the
skb->csum update.

Fixes: 7d672345ed ("bpf: add generic bpf_csum_diff helper")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/96a6bc3a443e6f0b21ff7b7834000e17fb549e05.1748509484.git.paul.chaignon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-30 19:53:51 -07:00
Paul Chaignon
6043b794c7 net: Fix checksum update for ILA adj-transport
During ILA address translations, the L4 checksums can be handled in
different ways. One of them, adj-transport, consist in parsing the
transport layer and updating any found checksum. This logic relies on
inet_proto_csum_replace_by_diff and produces an incorrect skb->csum when
in state CHECKSUM_COMPLETE.

This bug can be reproduced with a simple ILA to SIR mapping, assuming
packets are received with CHECKSUM_COMPLETE:

  $ ip a show dev eth0
  14: eth0@if15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
      link/ether 62:ae:35:9e:0f:8d brd ff:ff:ff:ff:ff:ff link-netnsid 0
      inet6 3333:0:0:1::c078/64 scope global
         valid_lft forever preferred_lft forever
      inet6 fd00:10:244:1::c078/128 scope global nodad
         valid_lft forever preferred_lft forever
      inet6 fe80::60ae:35ff:fe9e:f8d/64 scope link proto kernel_ll
         valid_lft forever preferred_lft forever
  $ ip ila add loc_match fd00:10:244:1 loc 3333:0:0:1 \
      csum-mode adj-transport ident-type luid dev eth0

Then I hit [fd00:10:244:1::c078]:8000 with a server listening only on
[3333:0:0:1::c078]:8000. With the bug, the SYN packet is dropped with
SKB_DROP_REASON_TCP_CSUM after inet_proto_csum_replace_by_diff changed
skb->csum. The translation and drop are visible on pwru [1] traces:

  IFACE   TUPLE                                                        FUNC
  eth0:9  [fd00:10:244:3::3d8]:51420->[fd00:10:244:1::c078]:8000(tcp)  ipv6_rcv
  eth0:9  [fd00:10:244:3::3d8]:51420->[fd00:10:244:1::c078]:8000(tcp)  ip6_rcv_core
  eth0:9  [fd00:10:244:3::3d8]:51420->[fd00:10:244:1::c078]:8000(tcp)  nf_hook_slow
  eth0:9  [fd00:10:244:3::3d8]:51420->[fd00:10:244:1::c078]:8000(tcp)  inet_proto_csum_replace_by_diff
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     tcp_v6_early_demux
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     ip6_route_input
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     ip6_input
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     ip6_input_finish
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     ip6_protocol_deliver_rcu
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     raw6_local_deliver
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     ipv6_raw_deliver
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     tcp_v6_rcv
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     __skb_checksum_complete
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     kfree_skb_reason(SKB_DROP_REASON_TCP_CSUM)
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     skb_release_head_state
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     skb_release_data
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     skb_free_head
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     kfree_skbmem

This is happening because inet_proto_csum_replace_by_diff is updating
skb->csum when it shouldn't. The L4 checksum is updated such that it
"cancels" the IPv6 address change in terms of checksum computation, so
the impact on skb->csum is null.

Note this would be different for an IPv4 packet since three fields
would be updated: the IPv4 address, the IP checksum, and the L4
checksum. Two would cancel each other and skb->csum would still need
to be updated to take the L4 checksum change into account.

This patch fixes it by passing an ipv6 flag to
inet_proto_csum_replace_by_diff, to skip the skb->csum update if we're
in the IPv6 case. Note the behavior of the only other user of
inet_proto_csum_replace_by_diff, the BPF subsystem, is left as is in
this patch and fixed in the subsequent patch.

With the fix, using the reproduction from above, I can confirm
skb->csum is not touched by inet_proto_csum_replace_by_diff and the TCP
SYN proceeds to the application after the ILA translation.

Link: https://github.com/cilium/pwru [1]
Fixes: 65d7ab8de5 ("net: Identifier Locator Addressing module")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/b5539869e3550d46068504feb02d37653d939c0b.1748509484.git.paul.chaignon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-30 19:53:51 -07:00
Jakub Kicinski
44abca1fe3 Merge branch 'net-stmmac-prevent-div-by-0'
Alexis Lothoré says:

====================
net: stmmac: prevent div by 0

fix a small splat I am observing on a STM32MP157 platform at boot
(see commit 1) due to a division by 0.

v3: https://lore.kernel.org/20250528-stmmac_tstamp_div-v3-0-b525ecdfd84c@bootlin.com
v2: https://lore.kernel.org/20250527-stmmac_tstamp_div-v2-1-663251b3b542@bootlin.com
v1: https://lore.kernel.org/20250523-stmmac_tstamp_div-v1-1-bca8a5a3a477@bootlin.com
====================

Link: https://patch.msgid.link/20250529-stmmac_tstamp_div-v4-0-d73340a794d5@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-30 19:33:30 -07:00