Commit Graph

49087 Commits

Author SHA1 Message Date
Paolo Bonzini
f02b1bcc73 Merge tag 'kvm-x86-irqs-6.17' of https://github.com/kvm-x86/linux into HEAD
KVM IRQ changes for 6.17

 - Rework irqbypass to track/match producers and consumers via an xarray
   instead of a linked list.  Using a linked list leads to O(n^2) insertion
   times, which is hugely problematic for use cases that create large numbers
   of VMs.  Such use cases typically don't actually use irqbypass, but
   eliminating the pointless registration is a future problem to solve as it
   likely requires new uAPI.

 - Track irqbypass's "token" as "struct eventfd_ctx *" instead of a "void *",
   to avoid making a simple concept unnecessarily difficult to understand.

 - Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O APIC, PIC,
   and PIT emulation at compile time.

 - Drop x86's irq_comm.c, and move a pile of IRQ related code into irq.c.

 - Fix a variety of flaws and bugs in the AVIC device posted IRQ code.

 - Inhibited AVIC if a vCPU's ID is too big (relative to what hardware
   supports) instead of rejecting vCPU creation.

 - Extend enable_ipiv module param support to SVM, by simply leaving IsRunning
   clear in the vCPU's physical ID table entry.

 - Disable IPI virtualization, via enable_ipiv, if the CPU is affected by
   erratum #1235, to allow (safely) enabling AVIC on such CPUs.

 - Dedup x86's device posted IRQ code, as the vast majority of functionality
   can be shared verbatime between SVM and VMX.

 - Harden the device posted IRQ code against bugs and runtime errors.

 - Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups O(1)
   instead of O(n).

 - Generate GA Log interrupts if and only if the target vCPU is blocking, i.e.
   only if KVM needs a notification in order to wake the vCPU.

 - Decouple device posted IRQs from VFIO device assignment, as binding a VM to
   a VFIO group is not a requirement for enabling device posted IRQs.

 - Clean up and document/comment the irqfd assignment code.

 - Disallow binding multiple irqfds to an eventfd with a priority waiter, i.e.
   ensure an eventfd is bound to at most one irqfd through the entire host,
   and add a selftest to verify eventfd:irqfd bindings are globally unique.
2025-07-29 08:35:46 -04:00
Masami Hiramatsu (Google)
a3e892ab0f tracing: fprobe: Fix infinite recursion using preempt_*_notrace()
Since preempt_count_add/del() are tracable functions, it is not allowed
to use preempt_disable/enable() in ftrace handlers. Without this fix,
probing on `preempt_count_add%return` will cause an infinite recursion
of fprobes.

To fix this problem, use preempt_disable/enable_notrace() in
fprobe_return().

Link: https://lore.kernel.org/all/175374642359.1471729.1054175011228386560.stgit@mhiramat.tok.corp.google.com/

Fixes: 4346ba1604 ("fprobe: Rewrite fprobe on function-graph tracer")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-07-29 16:19:05 +09:00
Linus Torvalds
53edfecef6 Merge tag 'pm-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
 "As is tradition, cpufreq is the part with the largest number of
  updates that include core fixes and cleanups as well as updates of
  several assorted drivers, but there are also quite a few updates
  related to system sleep, mostly focused on asynchronous suspend and
  resume of devices and on making the integration of system suspend
  and resume with runtime PM easier.

  Runtime PM is also updated to allow some code duplication in drivers
  to be eliminated going forward and to work more consistently overall
  in some cases.

  Apart from that, there are some driver core updates related to PM
  domains that should help to address ordering issues with devm_ cleanup
  routines relying on PM domains, some assorted devfreq updates
  including core fixes and cleanups, tooling updates, and documentation
  and MAINTAINERS updates.

  Specifics:

   - Fix two initialization ordering issues in the cpufreq core and a
     governor initialization error path in it, and clean it up (Lifeng
     Zheng)

   - Add Granite Rapids support in no-HWP mode to the intel_pstate
     cpufreq driver (Li RongQing)

   - Make intel_pstate always use HWP_DESIRED_PERF when operating in the
     passive mode (Rafael Wysocki)

   - Allow building the tegra124 cpufreq driver as a module (Aaron
     Kling)

   - Do minor cleanups for Rust cpufreq and cpumask APIs and fix
     MAINTAINERS entry for cpu.rs (Abhinav Ananthu, Ritvik Gupta, Lukas
     Bulwahn)

   - Clean up assorted cpufreq drivers (Arnd Bergmann, Dan Carpenter,
     Krzysztof Kozlowski, Sven Peter, Svyatoslav Ryhel, Lifeng Zheng)

   - Add the NEED_UPDATE_LIMITS flag to the CPPC cpufreq driver
     (Prashant Malani)

   - Fix minimum performance state label error in the amd-pstate driver
     documentation (Shouye Liu)

   - Add the CPUFREQ_GOV_STRICT_TARGET flag to the userspace cpufreq
     governor and explain HW coordination influence on it in the
     documentation (Shashank Balaji)

   - Fix opencoded for_each_cpu() in idle_state_valid() in the DT
     cpuidle driver (Yury Norov)

   - Remove info about non-existing QoS interfaces from the PM QoS
     documentation (Ulf Hansson)

   - Use c_* types via kernel prelude in Rust for OPP (Abhinav Ananthu)

   - Add HiSilicon uncore frequency scaling driver to devfreq (Jie Zhan)

   - Allow devfreq drivers to add custom sysfs ABIs (Jie Zhan)

   - Simplify the sun8i-a33-mbus devfreq driver by using more devm
     functions (Uwe Kleine-König)

   - Fix an index typo in trans_stat() in devfreq (Chanwoo Choi)

   - Check devfreq governor before using governor->name (Lifeng Zheng)

   - Remove a redundant devfreq_get_freq_range() call from
     devfreq_add_device() (Lifeng Zheng)

   - Limit max_freq with scaling_min_freq in devfreq (Lifeng Zheng)

   - Replace sscanf() with kstrtoul() in set_freq_store() (Lifeng Zheng)

   - Extend the asynchronous suspend and resume of devices to handle
     suppliers like parents and consumers like children (Rafael Wysocki)

   - Make pm_runtime_force_resume() work for drivers that set the
     DPM_FLAG_SMART_SUSPEND flag and allow PCI drivers and drivers that
     collaborate with the general ACPI PM domain to set it (Rafael
     Wysocki)

   - Add kernel parameter to disable asynchronous suspend/resume of
     devices (Tudor Ambarus)

   - Drop redundant might_sleep() calls from some functions in the
     device suspend/resume core code (Zhongqiu Han)

   - Fix the handling of monitors connected right before waking up the
     system from sleep (tuhaowen)

   - Clean up MAINTAINERS entries for suspend and hibernation (Rafael
     Wysocki)

   - Fix error code path in the KEXEC_JUMP flow and drop a redundant
     pm_restore_gfp_mask() call from it (Rafael Wysocki)

   - Rearrange suspend/resume error handling in the core device suspend
     and resume code (Rafael Wysocki)

   - Fix up white space that does not follow coding style in the
     hibernation core code (Darshan Rathod)

   - Document return values of suspend-related API functions in the
     runtime PM framework (Sakari Ailus)

   - Mark last busy stamp in multiple autosuspend-related functions in
     the runtime PM framework and update its documentation (Sakari
     Ailus)

   - Take active children into account in pm_runtime_get_if_in_use() for
     consistency (Rafael Wysocki)

   - Fix NULL pointer dereference in get_pd_power_uw() in the dtpm_cpu
     power capping driver (Sivan Zohar-Kotzer)

   - Add support for the Bartlett Lake platform to the Intel RAPL power
     capping driver (Qiao Wei)

   - Add PL4 support for Panther Lake to the intel_rapl_msr power
     capping driver (Zhang Rui)

   - Update contact information in the PM ABI docs and maintainer
     information in the power domains DT binding (Rafael Wysocki)

   - Update PM header inclusions to follow the IWYU (Include What You
     Use) principle (Andy Shevchenko)

   - Add flags to specify power on attach/detach for PM domains, make
     the driver core detach PM domains in device_unbind_cleanup(), and
     drop the dev_pm_domain_detach() call from the platform bus type
     (Claudiu Beznea)

   - Improve Python binding's Makefile for cpupower (John B. Wyatt IV)

   - Fix printing of CORE, CPU fields in cpupower-monitor (Gautham
     Shenoy)"

* tag 'pm-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (75 commits)
  cpufreq: CPPC: Mark driver with NEED_UPDATE_LIMITS flag
  PM: docs: Use my kernel.org address in ABI docs and DT bindings
  PM: hibernate: Fix up white space that does not follow coding style
  PM: sleep: Rearrange suspend/resume error handling in the core
  Documentation: amd-pstate:fix minimum performance state label error
  PM: runtime: Take active children into account in pm_runtime_get_if_in_use()
  kexec_core: Drop redundant pm_restore_gfp_mask() call
  kexec_core: Fix error code path in the KEXEC_JUMP flow
  PM: sleep: Clean up MAINTAINERS entries for suspend and hibernation
  drivers: cpufreq: add Tegra114 support
  rust: cpumask: Replace `MaybeUninit` and `mem::zeroed` with `Opaque` APIs
  cpufreq: Exit governor when failed to start old governor
  cpufreq: Move the check of cpufreq_driver->get into cpufreq_verify_current_freq()
  cpufreq: Init policy->rwsem before it may be possibly used
  cpufreq: Initialize cpufreq-based frequency-invariance later
  cpufreq: Remove duplicate check in __cpufreq_offline()
  cpufreq: Contain scaling_cur_freq.attr in cpufreq_attrs
  cpufreq: intel_pstate: Add Granite Rapids support in no-HWP mode
  cpufreq: intel_pstate: Always use HWP_DESIRED_PERF in passive mode
  PM / devfreq: Add HiSilicon uncore frequency scaling driver
  ...
2025-07-28 20:13:36 -07:00
KaFai Wan
863aab3d4d bpf: Add log for attaching tracing programs to functions in deny list
Show the rejected function name when attaching tracing programs to
functions in deny list.

With this change, we know why tracing programs can't attach to functions
like __rcu_read_lock() from log.

$ ./fentry
libbpf: prog '__rcu_read_lock': BPF program load failed: -EINVAL
libbpf: prog '__rcu_read_lock': -- BEGIN PROG LOAD LOG --
Attaching tracing programs to function '__rcu_read_lock' is rejected.

Suggested-by: Leon Hwang <leon.hwang@linux.dev>
Signed-off-by: KaFai Wan <kafai.wan@linux.dev>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20250724151454.499040-3-kafai.wan@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-28 19:39:29 -07:00
KaFai Wan
a5a6b29a70 bpf: Show precise rejected function when attaching fexit/fmod_ret to __noreturn functions
With this change, we know the precise rejected function name when
attaching fexit/fmod_ret to __noreturn functions from log.

$ ./fexit
libbpf: prog 'fexit': BPF program load failed: -EINVAL
libbpf: prog 'fexit': -- BEGIN PROG LOAD LOG --
Attaching fexit/fmod_ret to __noreturn function 'do_exit' is rejected.

Suggested-by: Leon Hwang <leon.hwang@linux.dev>
Signed-off-by: KaFai Wan <kafai.wan@linux.dev>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20250724151454.499040-2-kafai.wan@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-28 19:39:29 -07:00
Linus Torvalds
e833f7dfe3 Merge tag 'audit-pr-20250725' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit update from Paul Moore:
 "A single audit patch that restores logging of an audit event in the
  module load failure case"

* tag 'audit-pr-20250725' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit,module: restore audit logging in load failure case
2025-07-28 18:31:06 -07:00
Linus Torvalds
13150742b0 Merge tag 'libcrypto-updates-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux
Pull crypto library updates from Eric Biggers:
 "This is the main crypto library pull request for 6.17. The main focus
  this cycle is on reorganizing the SHA-1 and SHA-2 code, providing
  high-quality library APIs for SHA-1 and SHA-2 including HMAC support,
  and establishing conventions for lib/crypto/ going forward:

   - Migrate the SHA-1 and SHA-512 code (and also SHA-384 which shares
     most of the SHA-512 code) into lib/crypto/. This includes both the
     generic and architecture-optimized code. Greatly simplify how the
     architecture-optimized code is integrated. Add an easy-to-use
     library API for each SHA variant, including HMAC support. Finally,
     reimplement the crypto_shash support on top of the library API.

   - Apply the same reorganization to the SHA-256 code (and also SHA-224
     which shares most of the SHA-256 code). This is a somewhat smaller
     change, due to my earlier work on SHA-256. But this brings in all
     the same additional improvements that I made for SHA-1 and SHA-512.

  There are also some smaller changes:

   - Move the architecture-optimized ChaCha, Poly1305, and BLAKE2s code
     from arch/$(SRCARCH)/lib/crypto/ to lib/crypto/$(SRCARCH)/. For
     these algorithms it's just a move, not a full reorganization yet.

   - Fix the MIPS chacha-core.S to build with the clang assembler.

   - Fix the Poly1305 functions to work in all contexts.

   - Fix a performance regression in the x86_64 Poly1305 code.

   - Clean up the x86_64 SHA-NI optimized SHA-1 assembly code.

  Note that since the new organization of the SHA code is much simpler,
  the diffstat of this pull request is negative, despite the addition of
  new fully-documented library APIs for multiple SHA and HMAC-SHA
  variants.

  These APIs will allow further simplifications across the kernel as
  users start using them instead of the old-school crypto API. (I've
  already written a lot of such conversion patches, removing over 1000
  more lines of code. But most of those will target 6.18 or later)"

* tag 'libcrypto-updates-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (67 commits)
  lib/crypto: arm64/sha512-ce: Drop compatibility macros for older binutils
  lib/crypto: x86/sha1-ni: Convert to use rounds macros
  lib/crypto: x86/sha1-ni: Minor optimizations and cleanup
  crypto: sha1 - Remove sha1_base.h
  lib/crypto: x86/sha1: Migrate optimized code into library
  lib/crypto: sparc/sha1: Migrate optimized code into library
  lib/crypto: s390/sha1: Migrate optimized code into library
  lib/crypto: powerpc/sha1: Migrate optimized code into library
  lib/crypto: mips/sha1: Migrate optimized code into library
  lib/crypto: arm64/sha1: Migrate optimized code into library
  lib/crypto: arm/sha1: Migrate optimized code into library
  crypto: sha1 - Use same state format as legacy drivers
  crypto: sha1 - Wrap library and add HMAC support
  lib/crypto: sha1: Add HMAC support
  lib/crypto: sha1: Add SHA-1 library functions
  lib/crypto: sha1: Rename sha1_init() to sha1_init_raw()
  crypto: x86/sha1 - Rename conflicting symbol
  lib/crypto: sha2: Add hmac_sha*_init_usingrawkey()
  lib/crypto: arm/poly1305: Remove unneeded empty weak function
  lib/crypto: x86/poly1305: Fix performance regression on short messages
  ...
2025-07-28 17:58:52 -07:00
Linus Torvalds
8e736a2eea Merge tag 'hardening-v6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook:

 - Introduce and start using TRAILING_OVERLAP() helper for fixing
   embedded flex array instances (Gustavo A. R. Silva)

 - mux: Convert mux_control_ops to a flex array member in mux_chip
   (Thorsten Blum)

 - string: Group str_has_prefix() and strstarts() (Andy Shevchenko)

 - Remove KCOV instrumentation from __init and __head (Ritesh Harjani,
   Kees Cook)

 - Refactor and rename stackleak feature to support Clang

 - Add KUnit test for seq_buf API

 - Fix KUnit fortify test under LTO

* tag 'hardening-v6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (22 commits)
  sched/task_stack: Add missing const qualifier to end_of_stack()
  kstack_erase: Support Clang stack depth tracking
  kstack_erase: Add -mgeneral-regs-only to silence Clang warnings
  init.h: Disable sanitizer coverage for __init and __head
  kstack_erase: Disable kstack_erase for all of arm compressed boot code
  x86: Handle KCOV __init vs inline mismatches
  arm64: Handle KCOV __init vs inline mismatches
  s390: Handle KCOV __init vs inline mismatches
  arm: Handle KCOV __init vs inline mismatches
  mips: Handle KCOV __init vs inline mismatch
  powerpc/mm/book3s64: Move kfence and debug_pagealloc related calls to __init section
  configs/hardening: Enable CONFIG_INIT_ON_FREE_DEFAULT_ON
  configs/hardening: Enable CONFIG_KSTACK_ERASE
  stackleak: Split KSTACK_ERASE_CFLAGS from GCC_PLUGINS_CFLAGS
  stackleak: Rename stackleak_track_stack to __sanitizer_cov_stack_depth
  stackleak: Rename STACKLEAK to KSTACK_ERASE
  seq_buf: Introduce KUnit tests
  string: Group str_has_prefix() and strstarts()
  kunit/fortify: Add back "volatile" for sizeof() constants
  acpi: nfit: intel: avoid multiple -Wflex-array-member-not-at-end warnings
  ...
2025-07-28 17:16:12 -07:00
Linus Torvalds
d900c4ce63 Merge tag 'execve-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull execve updates from Kees Cook:

 - Introduce regular REGSET note macros arch-wide (Dave Martin)

 - Remove arbitrary 4K limitation of program header size (Yin Fengwei)

 - Reorder function qualifiers for copy_clone_args_from_user() (Dishank Jogi)

* tag 'execve-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (25 commits)
  fork: reorder function qualifiers for copy_clone_args_from_user
  binfmt_elf: remove the 4k limitation of program header size
  binfmt_elf: Warn on missing or suspicious regset note names
  xtensa: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  um: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  x86/ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  sparc: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  sh: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  s390/ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  riscv: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  powerpc/ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  parisc: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  openrisc: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  nios2: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  MIPS: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  m68k: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  LoongArch: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  hexagon: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  csky: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  arm64: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
  ...
2025-07-28 17:11:40 -07:00
Linus Torvalds
6e11664f14 Merge tag 'for-6.17/block-20250728' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:

 - MD pull request via Yu:
      - call del_gendisk synchronously (Xiao)
      - cleanup unused variable (John)
      - cleanup workqueue flags (Ryo)
      - fix faulty rdev can't be removed during resync (Qixing)

 - NVMe pull request via Christoph:
      - try PCIe function level reset on init failure (Keith Busch)
      - log TLS handshake failures at error level (Maurizio Lombardi)
      - pci-epf: do not complete commands twice if nvmet_req_init()
        fails (Rick Wertenbroek)
      - misc cleanups (Alok Tiwari)

 - Removal of the pktcdvd driver

   This has been more than a decade coming at this point, and some
   recently revealed breakages that had it causing issues even for cases
   where it isn't required made me re-pull the trigger on this one. It's
   known broken and nobody has stepped up to maintain the code

 - Series for ublk supporting batch commands, enabling the use of
   multishot where appropriate

 - Speed up ublk exit handling

 - Fix for the two-stage elevator fixing which could leak data

 - Convert NVMe to use the new IOVA based API

 - Increase default max transfer size to something more reasonable

 - Series fixing write operations on zoned DM devices

 - Add tracepoints for zoned block device operations

 - Prep series working towards improving blk-mq queue management in the
   presence of isolated CPUs

 - Don't allow updating of the block size of a loop device that is
   currently under exclusively ownership/open

 - Set chunk sectors from stacked device stripe size and use it for the
   atomic write size limit

 - Switch to folios in bcache read_super()

 - Fix for CD-ROM MRW exit flush handling

 - Various tweaks, fixes, and cleanups

* tag 'for-6.17/block-20250728' of git://git.kernel.dk/linux: (94 commits)
  block: restore two stage elevator switch while running nr_hw_queue update
  cdrom: Call cdrom_mrw_exit from cdrom_release function
  sunvdc: Balance device refcount in vdc_port_mpgroup_check
  nvme-pci: try function level reset on init failure
  dm: split write BIOs on zone boundaries when zone append is not emulated
  block: use chunk_sectors when evaluating stacked atomic write limits
  dm-stripe: limit chunk_sectors to the stripe size
  md/raid10: set chunk_sectors limit
  md/raid0: set chunk_sectors limit
  block: sanitize chunk_sectors for atomic write limits
  ilog2: add max_pow_of_two_factor()
  nvmet: pci-epf: Do not complete commands twice if nvmet_req_init() fails
  nvme-tcp: log TLS handshake failures at error level
  docs: nvme: fix grammar in nvme-pci-endpoint-target.rst
  nvme: fix typo in status code constant for self-test in progress
  nvmet: remove redundant assignment of error code in nvmet_ns_enable()
  nvme: fix incorrect variable in io cqes error message
  nvme: fix multiple spelling and grammar issues in host drivers
  block: fix blk_zone_append_update_request_bio() kernel-doc
  md/raid10: fix set but not used variable in sync_request_write()
  ...
2025-07-28 16:43:54 -07:00
Masami Hiramatsu (Google)
133c302a0c tracing: trace_fprobe: Fix typo of the semicolon
Fix a typo that uses ',' instead of ';' for line delimiter.

Link: https://lore.kernel.org/linux-trace-kernel/175366879192.487099.5714468217360139639.stgit@mhiramat.tok.corp.google.com/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-07-29 08:37:52 +09:00
Linus Torvalds
7e7bc8335b Merge tag 'vfs-6.17-rc1.bpf' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs bpf updates from Christian Brauner:
 "These changes allow bpf to read extended attributes from cgroupfs.

  This is useful in redirecting AF_UNIX socket connections based on
  cgroup membership of the socket. One use-case is the ability to
  implement log namespaces in systemd so services and containers are
  redirected to different journals"

* tag 'vfs-6.17-rc1.bpf' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  selftests/kernfs: test xattr retrieval
  selftests/bpf: Add tests for bpf_cgroup_read_xattr
  bpf: Mark cgroup_subsys_state->cgroup RCU safe
  bpf: Introduce bpf_cgroup_read_xattr to read xattr of cgroup's node
  kernfs: remove iattr_mutex
2025-07-28 14:42:31 -07:00
Linus Torvalds
672dcda246 Merge tag 'vfs-6.17-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull pidfs updates from Christian Brauner:

 - persistent info

   Persist exit and coredump information independent of whether anyone
   currently holds a pidfd for the struct pid.

   The current scheme allocated pidfs dentries on-demand repeatedly.
   This scheme is reaching it's limits as it makes it impossible to pin
   information that needs to be available after the task has exited or
   coredumped and that should not be lost simply because the pidfd got
   closed temporarily. The next opener should still see the stashed
   information.

   This is also a prerequisite for supporting extended attributes on
   pidfds to allow attaching meta information to them.

   If someone opens a pidfd for a struct pid a pidfs dentry is allocated
   and stashed in pid->stashed. Once the last pidfd for the struct pid
   is closed the pidfs dentry is released and removed from pid->stashed.

   So if 10 callers create a pidfs dentry for the same struct pid
   sequentially, i.e., each closing the pidfd before the other creates a
   new one then a new pidfs dentry is allocated every time.

   Because multiple tasks acquiring and releasing a pidfd for the same
   struct pid can race with each another a task may still find a valid
   pidfs entry from the previous task in pid->stashed and reuse it. Or
   it might find a dead dentry in there and fail to reuse it and so
   stashes a new pidfs dentry. Multiple tasks may race to stash a new
   pidfs dentry but only one will succeed, the other ones will put their
   dentry.

   The current scheme aims to ensure that a pidfs dentry for a struct
   pid can only be created if the task is still alive or if a pidfs
   dentry already existed before the task was reaped and so exit
   information has been was stashed in the pidfs inode.

   That's great except that it's buggy. If a pidfs dentry is stashed in
   pid->stashed after pidfs_exit() but before __unhash_process() is
   called we will return a pidfd for a reaped task without exit
   information being available.

   The pidfds_pid_valid() check does not guard against this race as it
   doens't sync at all with pidfs_exit(). The pid_has_task() check might
   be successful simply because we're before __unhash_process() but
   after pidfs_exit().

   Introduce a new scheme where the lifetime of information associated
   with a pidfs entry (coredump and exit information) isn't bound to the
   lifetime of the pidfs inode but the struct pid itself.

   The first time a pidfs dentry is allocated for a struct pid a struct
   pidfs_attr will be allocated which will be used to store exit and
   coredump information.

   If all pidfs for the pidfs dentry are closed the dentry and inode can
   be cleaned up but the struct pidfs_attr will stick until the struct
   pid itself is freed. This will ensure minimal memory usage while
   persisting relevant information.

   The new scheme has various advantages. First, it allows to close the
   race where we end up handing out a pidfd for a reaped task for which
   no exit information is available. Second, it minimizes memory usage.
   Third, it allows to remove complex lifetime tracking via dentries
   when registering a struct pid with pidfs. There's no need to get or
   put a reference. Instead, the lifetime of exit and coredump
   information associated with a struct pid is bound to the lifetime of
   struct pid itself.

 - extended attributes

   Now that we have a way to persist information for pidfs dentries we
   can start supporting extended attributes on pidfds. This will allow
   userspace to attach meta information to tasks.

   One natural extension would be to introduce a custom pidfs.* extended
   attribute space and allow for the inheritance of extended attributes
   across fork() and exec().

   The first simple scheme will allow privileged userspace to set
   trusted extended attributes on pidfs inodes.

 - Allow autonomous pidfs file handles

   Various filesystems such as pidfs and drm support opening file
   handles without having to require a file descriptor to identify the
   filesystem. The filesystem are global single instances and can be
   trivially identified solely on the information encoded in the file
   handle.

   This makes it possible to not have to keep or acquire a sentinal file
   descriptor just to pass it to open_by_handle_at() to identify the
   filesystem. That's especially useful when such sentinel file
   descriptor cannot or should not be acquired.

   For pidfs this means a file handle can function as full replacement
   for storing a pid in a file. Instead a file handle can be stored and
   reopened purely based on the file handle.

   Such autonomous file handles can be opened with or without specifying
   a a file descriptor. If no proper file descriptor is used the
   FD_PIDFS_ROOT sentinel must be passed. This allows us to define
   further special negative fd sentinels in the future.

   Userspace can trivially test for support by trying to open the file
   handle with an invalid file descriptor.

 - Allow pidfds for reaped tasks with SCM_PIDFD messages

   This is a logical continuation of the earlier work to create pidfds
   for reaped tasks through the SO_PEERPIDFD socket option merged in
   923ea4d448 ("Merge patch series "net, pidfs: enable handing out
   pidfds for reaped sk->sk_peer_pid"").

 - Two minor fixes:

    * Fold fs_struct->{lock,seq} into a seqlock

    * Don't bother with path_{get,put}() in unix_open_file()

* tag 'vfs-6.17-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (37 commits)
  don't bother with path_get()/path_put() in unix_open_file()
  fold fs_struct->{lock,seq} into a seqlock
  selftests: net: extend SCM_PIDFD test to cover stale pidfds
  af_unix: enable handing out pidfds for reaped tasks in SCM_PIDFD
  af_unix: stash pidfs dentry when needed
  af_unix/scm: fix whitespace errors
  af_unix: introduce and use scm_replace_pid() helper
  af_unix: introduce unix_skb_to_scm helper
  af_unix: rework unix_maybe_add_creds() to allow sleep
  selftests/pidfd: decode pidfd file handles withou having to specify an fd
  fhandle, pidfs: support open_by_handle_at() purely based on file handle
  uapi/fcntl: add FD_PIDFS_ROOT
  uapi/fcntl: add FD_INVALID
  fcntl/pidfd: redefine PIDFD_SELF_THREAD_GROUP
  uapi/fcntl: mark range as reserved
  fhandle: reflow get_path_anchor()
  pidfs: add pidfs_root_path() helper
  fhandle: rename to get_path_anchor()
  fhandle: hoist copy_from_user() above get_path_from_fd()
  fhandle: raise FILEID_IS_DIR in handle_type
  ...
2025-07-28 14:10:15 -07:00
Gabriele Monaco
614384533d rv: Add opid per-cpu monitor
Add a per-cpu monitor as part of the sched model:
* opid: operations with preemption and irq disabled
    Monitor to ensure wakeup and need_resched occur with irq and
    preemption disabled or in irq handlers.

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Juri Lelli <jlelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Link: https://lore.kernel.org/20250728135022.255578-10-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Acked-by: Nam Cao <namcao@linutronix.de>
Tested-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-28 16:47:35 -04:00
Gabriele Monaco
e8440a88e5 rv: Add nrp and sssw per-task monitors
Add 2 per-task monitors as part of the sched model:

* nrp: need-resched preempts
    Monitor to ensure preemption requires need resched.
* sssw: set state sleep and wakeup
    Monitor to ensure sched_set_state to sleepable leads to sleeping and
    sleeping tasks require wakeup.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Juri Lelli <jlelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/20250728135022.255578-9-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Acked-by: Nam Cao <namcao@linutronix.de>
Tested-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-28 16:47:34 -04:00
Gabriele Monaco
d0096c2f9c rv: Replace tss and sncid monitors with more complete sts
The tss monitor currently guarantees task switches can happen only while
scheduling, whereas the sncid monitor enforces scheduling occurs with
interrupt disabled.

Replace the monitors with a more comprehensive specification which
implies both but also ensures that:
* each scheduler call disable interrupts to switch
* each task switch happens with interrupts disabled

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Nam Cao <namcao@linutronix.de>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Juri Lelli <jlelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/20250728135022.255578-8-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-28 16:47:34 -04:00
Gabriele Monaco
adcc3bfa88 sched: Adapt sched tracepoints for RV task model
Add the following tracepoint:
* sched_set_need_resched(tsk, cpu, tif)
    Called when a task is set the need resched [lazy] flag

Remove the unused ip parameter from sched_entry and sched_exit and alter
sched_entry to have a value of preempt consistent with the one used in
sched_switch.

Also adapt all monitors using sched_{entry,exit} to avoid breaking build.

These tracepoints are useful to describe the Linux task model and are
adapted from the patches by Daniel Bristot de Oliveira
(https://bristot.me/linux-task-model/).

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Nam Cao <namcao@linutronix.de>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Juri Lelli <jlelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Link: https://lore.kernel.org/20250728135022.255578-7-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-28 16:47:34 -04:00
Gabriele Monaco
9d475d80c9 rv: Retry when da monitor detects race conditions
DA monitor can be accessed from multiple cores simultaneously, this is
likely, for instance when dealing with per-task monitors reacting on
events that do not always occur on the CPU where the task is running.
This can cause race conditions where two events change the next state
and we see inconsistent values. E.g.:

  [62] event_srs: 27: sleepable x sched_wakeup -> running (final)
  [63] event_srs: 27: sleepable x sched_set_state_sleepable -> sleepable
  [63] error_srs: 27: event sched_switch_suspend not expected in the state running

In this case the monitor fails because the event on CPU 62 wins against
the one on CPU 63, although the correct state should have been
sleepable, since the task get suspended.

Detect if the current state was modified by using try_cmpxchg while
storing the next value. If it was, try again reading the current state.
After a maximum number of failed retries, react by calling a special
tracepoint, print on the console and reset the monitor.

Remove the functions da_monitor_curr_state() and da_monitor_set_state()
as they only hide the underlying implementation in this case.

Monitors where this type of condition can occur must be able to account
for racing events in any possible order, as we cannot know the winner.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Juri Lelli <jlelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/20250728135022.255578-6-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-28 16:47:34 -04:00
Gabriele Monaco
79de661707 rv: Adjust monitor dependencies
RV monitors relying on the preemptirqs tracepoints are set as dependent
on PREEMPT_TRACER and IRQSOFF_TRACER. In fact, those configurations do
enable the tracepoints but are not the minimal configurations enabling
them, which are TRACE_PREEMPT_TOGGLE and TRACE_IRQFLAGS (not selectable
manually).

Set TRACE_PREEMPT_TOGGLE and TRACE_IRQFLAGS as dependencies for
monitors.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Juri Lelli <jlelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Link: https://lore.kernel.org/20250728135022.255578-5-gmonaco@redhat.com
Fixes: fbe6c09b7e ("rv: Add scpd, snep and sncid per-cpu monitors")
Acked-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-28 16:47:34 -04:00
Gabriele Monaco
7f904ff6e5 rv: Use strings in da monitors tracepoints
Using DA monitors tracepoints with KASAN enabled triggers the following
warning:

 BUG: KASAN: global-out-of-bounds in do_trace_event_raw_event_event_da_monitor+0xd6/0x1a0
 Read of size 32 at addr ffffffffaada8980 by task ...
 Call Trace:
  <TASK>
 [...]
  do_trace_event_raw_event_event_da_monitor+0xd6/0x1a0
  ? __pfx_do_trace_event_raw_event_event_da_monitor+0x10/0x10
  ? trace_event_sncid+0x83/0x200
  trace_event_sncid+0x163/0x200
 [...]
 The buggy address belongs to the variable:
  automaton_snep+0x4e0/0x5e0

This is caused by the tracepoints reading 32 bytes __array instead of
__string from the automata definition. Such strings are literals and
reading 32 bytes ends up in out of bound memory accesses (e.g. the next
automaton's data in this case).
The error is harmless as, while printing the string, we stop at the null
terminator, but it should still be fixed.

Use the __string facilities while defining the tracepoints to avoid
reading out of bound memory.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Juri Lelli <jlelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Link: https://lore.kernel.org/20250728135022.255578-4-gmonaco@redhat.com
Fixes: 792575348f ("rv/include: Add deterministic automata monitor definition via C macros")
Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-28 16:47:34 -04:00
Gabriele Monaco
7b70ac4cad rv: Remove trailing whitespace from tracepoint string
RV event tracepoints print a line with the format:
    "event_xyz: S0 x event -> S1 "
    "event_xyz: S1 x event -> S0 (final)"

While printing an event leading to a non-final state, the line
has a trailing white space (visible above before the closing ").

Adapt the format string not to print the trailing whitespace if we are
not printing "(final)".

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Juri Lelli <jlelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Link: https://lore.kernel.org/20250728135022.255578-3-gmonaco@redhat.com
Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-28 16:47:34 -04:00
Linus Torvalds
117eab5c6e Merge tag 'vfs-6.17-rc1.coredump' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull coredump updates from Christian Brauner:
 "This contains an extension to the coredump socket and a proper rework
  of the coredump code.

   - This extends the coredump socket to allow the coredump server to
     tell the kernel how to process individual coredumps. This allows
     for fine-grained coredump management. Userspace can decide to just
     let the kernel write out the coredump, or generate the coredump
     itself, or just reject it.

     * COREDUMP_KERNEL
       The kernel will write the coredump data to the socket.

     * COREDUMP_USERSPACE
       The kernel will not write coredump data but will indicate to the
       parent that a coredump has been generated. This is used when
       userspace generates its own coredumps.

     * COREDUMP_REJECT
       The kernel will skip generating a coredump for this task.

     * COREDUMP_WAIT
       The kernel will prevent the task from exiting until the coredump
       server has shutdown the socket connection.

     The flexible coredump socket can be enabled by using the "@@"
     prefix instead of the single "@" prefix for the regular coredump
     socket:

       @@/run/systemd/coredump.socket

   - Cleanup the coredump code properly while we have to touch it
     anyway.

     Split out each coredump mode in a separate helper so it's easy to
     grasp what is going on and make the code easier to follow. The core
     coredump function should now be very trivial to follow"

* tag 'vfs-6.17-rc1.coredump' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (31 commits)
  cleanup: add a scoped version of CLASS()
  coredump: add coredump_skip() helper
  coredump: avoid pointless variable
  coredump: order auto cleanup variables at the top
  coredump: add coredump_cleanup()
  coredump: auto cleanup prepare_creds()
  cred: add auto cleanup method
  coredump: directly return
  coredump: auto cleanup argv
  coredump: add coredump_write()
  coredump: use a single helper for the socket
  coredump: move pipe specific file check into coredump_pipe()
  coredump: split pipe coredumping into coredump_pipe()
  coredump: move core_pipe_count to global variable
  coredump: prepare to simplify exit paths
  coredump: split file coredumping into coredump_file()
  coredump: rename do_coredump() to vfs_coredump()
  selftests/coredump: make sure invalid paths are rejected
  coredump: validate socket path in coredump_parse()
  coredump: don't allow ".." in coredump socket path
  ...
2025-07-28 11:50:36 -07:00
Suchit Karunakaran
5b4c54ac49 bpf: Fix various typos in verifier.c comments
This patch fixes several minor typos in comments within the BPF verifier.
No changes in functionality.

Signed-off-by: Suchit Karunakaran <suchitkarunakaran@gmail.com>
Link: https://lore.kernel.org/r/20250727081754.15986-1-suchitkarunakaran@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-28 10:02:57 -07:00
Paul Chaignon
5dbb19b16a bpf: Add third round of bounds deduction
Commit d7f0087381 ("bpf: try harder to deduce register bounds from
different numeric domains") added a second call to __reg_deduce_bounds
in reg_bounds_sync because a single call wasn't enough to converge to a
fixed point in terms of register bounds.

With patch "bpf: Improve bounds when s64 crosses sign boundary" from
this series, Eduard noticed that calling __reg_deduce_bounds twice isn't
enough anymore to converge. The first selftest added in "selftests/bpf:
Test cross-sign 64bits range refinement" highlights the need for a third
call to __reg_deduce_bounds. After instruction 7, reg_bounds_sync
performs the following bounds deduction:

  reg_bounds_sync entry:          scalar(smin=-655,smax=0xeffffeee,smin32=-783,smax32=-146)
  __update_reg_bounds:            scalar(smin=-655,smax=0xeffffeee,smin32=-783,smax32=-146)
  __reg_deduce_bounds:
      __reg32_deduce_bounds:      scalar(smin=-655,smax=0xeffffeee,smin32=-783,smax32=-146,umin32=0xfffffcf1,umax32=0xffffff6e)
      __reg64_deduce_bounds:      scalar(smin=-655,smax=0xeffffeee,smin32=-783,smax32=-146,umin32=0xfffffcf1,umax32=0xffffff6e)
      __reg_deduce_mixed_bounds:  scalar(smin=-655,smax=0xeffffeee,umin=umin32=0xfffffcf1,umax=0xffffffffffffff6e,smin32=-783,smax32=-146,umax32=0xffffff6e)
  __reg_deduce_bounds:
      __reg32_deduce_bounds:      scalar(smin=-655,smax=0xeffffeee,umin=umin32=0xfffffcf1,umax=0xffffffffffffff6e,smin32=-783,smax32=-146,umax32=0xffffff6e)
      __reg64_deduce_bounds:      scalar(smin=-655,smax=smax32=-146,umin=0xfffffffffffffd71,umax=0xffffffffffffff6e,smin32=-783,umin32=0xfffffcf1,umax32=0xffffff6e)
      __reg_deduce_mixed_bounds:  scalar(smin=-655,smax=smax32=-146,umin=0xfffffffffffffd71,umax=0xffffffffffffff6e,smin32=-783,umin32=0xfffffcf1,umax32=0xffffff6e)
  __reg_bound_offset:             scalar(smin=-655,smax=smax32=-146,umin=0xfffffffffffffd71,umax=0xffffffffffffff6e,smin32=-783,umin32=0xfffffcf1,umax32=0xffffff6e,var_off=(0xfffffffffffffc00; 0x3ff))
  __update_reg_bounds:            scalar(smin=-655,smax=smax32=-146,umin=0xfffffffffffffd71,umax=0xffffffffffffff6e,smin32=-783,umin32=0xfffffcf1,umax32=0xffffff6e,var_off=(0xfffffffffffffc00; 0x3ff))

In particular, notice how:
1. In the first call to __reg_deduce_bounds, __reg32_deduce_bounds
   learns new u32 bounds.
2. __reg64_deduce_bounds is unable to improve bounds at this point.
3. __reg_deduce_mixed_bounds derives new u64 bounds from the u32 bounds.
4. In the second call to __reg_deduce_bounds, __reg64_deduce_bounds
   improves the smax and umin bounds thanks to patch "bpf: Improve
   bounds when s64 crosses sign boundary" from this series.
5. Subsequent functions are unable to improve the ranges further (only
   tnums). Yet, a better smin32 bound could be learned from the smin
   bound.

__reg32_deduce_bounds is able to improve smin32 from smin, but for that
we need a third call to __reg_deduce_bounds.

As discussed in [1], there may be a better way to organize the deduction
rules to learn the same information with less calls to the same
functions. Such an optimization requires further analysis and is
orthogonal to the present patchset.

Link: https://lore.kernel.org/bpf/aIKtSK9LjQXB8FLY@mail.gmail.com/ [1]
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Co-developed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/79619d3b42e5525e0e174ed534b75879a5ba15de.1753695655.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-28 10:02:13 -07:00
Paul Chaignon
00bf8d0c6c bpf: Improve bounds when s64 crosses sign boundary
__reg64_deduce_bounds currently improves the s64 range using the u64
range and vice versa, but only if it doesn't cross the sign boundary.

This patch improves __reg64_deduce_bounds to cover the case where the
s64 range crosses the sign boundary but overlaps with the u64 range on
only one end. In that case, we can improve both ranges. Consider the
following example, with the s64 range crossing the sign boundary:

    0                                                   U64_MAX
    |  [xxxxxxxxxxxxxx u64 range xxxxxxxxxxxxxx]              |
    |----------------------------|----------------------------|
    |xxxxx s64 range xxxxxxxxx]                       [xxxxxxx|
    0                     S64_MAX S64_MIN                    -1

The u64 range overlaps only with positive portion of the s64 range. We
can thus derive the following new s64 and u64 ranges.

    0                                                   U64_MAX
    |  [xxxxxx u64 range xxxxx]                               |
    |----------------------------|----------------------------|
    |  [xxxxxx s64 range xxxxx]                               |
    0                     S64_MAX S64_MIN                    -1

The same logic can probably apply to the s32/u32 ranges, but this patch
doesn't implement that change.

In addition to the selftests, the __reg64_deduce_bounds change was
also tested with Agni, the formal verification tool for the range
analysis [1].

Link: https://github.com/bpfverif/agni [1]
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/933bd9ce1f36ded5559f92fdc09e5dbc823fa245.1753695655.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-28 10:02:12 -07:00
Nam Cao
3cfb9c1a7d rv: Fix wrong type cast in reactors_show() and monitor_reactor_show()
Argument 'p' of reactors_show() and monitor_reactor_show() is not a pointer
to struct rv_reactor, it is actually a pointer to the list_head inside
struct rv_reactor. Therefore it's wrong to cast 'p' to struct rv_reactor *.

This wrong type cast has been there since the beginning. But it still
worked because the list_head was the first field in struct rv_reactor_def.
This is no longer true since commit 3d3c376118 ("rv: Merge struct
rv_reactor_def into struct rv_reactor") moved the list_head, and this wrong
type cast became a functional problem.

Properly use container_of() instead.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/b4febbd6844311209e4c8768b65d508b81bd8c9b.1753625621.git.namcao@linutronix.de
Fixes: 3d3c376118 ("rv: Merge struct rv_reactor_def into struct rv_reactor")
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-28 10:39:34 -04:00
Nam Cao
e82aea50fe rv: Fix wrong type cast in monitors_show()
Argument 'p' of monitors_show() is not a pointer to struct rv_monitor, it
is actually a pointer to the list_head inside struct rv_monitor. Therefore
it is wrong to cast 'p' to struct rv_monitor *.

This wrong type cast has been there since the beginning. But it still
worked because the list_head was the first field in struct rv_monitor_def.
This is no longer true since commit 24cbfe18d5 ("rv: Merge struct
rv_monitor_def into struct rv_monitor") moved the list_head, and this wrong
type cast became a functional problem.

Properly use container_of() instead.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/35e49e97696007919ceacf73796487a2e15a3d02.1753625621.git.namcao@linutronix.de
Fixes: 24cbfe18d5 ("rv: Merge struct rv_monitor_def into struct rv_monitor")
Signed-off-by: Nam Cao <namcao@linutronix.de>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-28 10:39:34 -04:00
Paul Chaignon
5345e64760 bpf: Simplify bounds refinement from s32
During the bounds refinement, we improve the precision of various ranges
by looking at other ranges. Among others, we improve the following in
this order (other things happen between 1 and 2):

  1. Improve u32 from s32 in __reg32_deduce_bounds.
  2. Improve s/u64 from u32 in __reg_deduce_mixed_bounds.
  3. Improve s/u64 from s32 in __reg_deduce_mixed_bounds.

In particular, if the s32 range forms a valid u32 range, we will use it
to improve the u32 range in __reg32_deduce_bounds. In
__reg_deduce_mixed_bounds, under the same condition, we will use the s32
range to improve the s/u64 ranges.

If at (1) we were able to learn from s32 to improve u32, we'll then be
able to use that in (2) to improve s/u64. Hence, as (3) happens under
the same precondition as (1), it won't improve s/u64 ranges further than
(1)+(2) did. Thus, we can get rid of (3).

In addition to the extensive suite of selftests for bounds refinement,
this patch was also tested with the Agni formal verification tool [1].

Additionally, Eduard mentioned:

  The argument appears to be as follows:

  Under precondition `(u32)reg->s32_min <= (u32)reg->s32_max`
  __reg32_deduce_bounds produces:

    reg->u32_min = max_t(u32, reg->s32_min, reg->u32_min);
    reg->u32_max = min_t(u32, reg->s32_max, reg->u32_max);

  And then first part of __reg_deduce_mixed_bounds assigns:

    a. reg->umin umax= (reg->umin & ~0xffffffffULL) | max_t(u32, reg->s32_min, reg->u32_min);
    b. reg->umax umin= (reg->umax & ~0xffffffffULL) | min_t(u32, reg->s32_max, reg->u32_max);

  And then second part of __reg_deduce_mixed_bounds assigns:

    c. reg->umin umax= (reg->umin & ~0xffffffffULL) | (u32)reg->s32_min;
    d. reg->umax umin= (reg->umax & ~0xffffffffULL) | (u32)reg->s32_max;

  But assignment (c) is a noop because:

     max_t(u32, reg->s32_min, reg->u32_min) >= (u32)reg->s32_min

  Hence RHS(a) >= RHS(c) and umin= does nothing.

  Also assignment (d) is a noop because:

    min_t(u32, reg->s32_max, reg->u32_max) <= (u32)reg->s32_max

  Hence RHS(b) <= RHS(d) and umin= does nothing.

  Plus the same reasoning for the part dealing with reg->s{min,max}_value:

    e. reg->smin_value smax= (reg->smin_value & ~0xffffffffULL) | max_t(u32, reg->s32_min_value, reg->u32_min_value);
    f. reg->smax_value smin= (reg->smax_value & ~0xffffffffULL) | min_t(u32, reg->s32_max_value, reg->u32_max_value);

      vs

    g. reg->smin_value smax= (reg->smin_value & ~0xffffffffULL) | (u32)reg->s32_min_value;
    h. reg->smax_value smin= (reg->smax_value & ~0xffffffffULL) | (u32)reg->s32_max_value;

      RHS(e) >= RHS(g) and RHS(f) <= RHS(h), hence smax=,smin= do nothing.

  This appears to be correct.

Also, Shung-Hsi:

  Beside going through the reasoning, I also played with CBMC a bit to
  double check that as far as a single run of __reg_deduce_bounds() is
  concerned (and that the register state matches certain handwavy
  expectations), the change indeed still preserve the original behavior.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://github.com/bpfverif/agni [1]
Link: https://lore.kernel.org/bpf/aIJwnFnFyUjNsCNa@mail.gmail.com
2025-07-27 19:23:29 +02:00
Linus Torvalds
b711733e89 Merge tag 'timers-urgent-2025-07-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
 "A single fix for the PTP systemcounter mechanism:

  The rework of this mechanism added a 'use_nsec' member to struct
  system_counterval. get_device_system_crosststamp() instantiates that
  struct on the stack and hands a pointer to the driver callback.

  Only the drivers which set use_nsec to true, initialize that field,
  but all others ignore it. As get_device_system_crosststamp() does not
  initialize the struct, the use_nsec field contains random stack
  content in those cases. That causes a miscalulation usually resulting
  in a failing range check in the best case.

  Initialize the structure before handing it to the drivers to cure
  that"

* tag 'timers-urgent-2025-07-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping: Zero initialize system_counterval when querying time from phc drivers
2025-07-27 09:31:32 -07:00
Kees Cook
6676fd3c99 kstack_erase: Add -mgeneral-regs-only to silence Clang warnings
Once CONFIG_KSTACK_ERASE is enabled with Clang on i386, the build warns:

  kernel/kstack_erase.c:168:2: warning: function with attribute 'no_caller_saved_registers' should only call a function with attribute 'no_caller_saved_registers' or be compiled with '-mgeneral-regs-only' [-Wexcessive-regsave]

Add -mgeneral-regs-only for the kstack_erase handler, to make Clang feel
better (it is effectively a no-op flag for the kernel). No binary
changes encountered.

Build & boot tested with Clang 21 on x86_64, and i386.
Build tested with GCC 14.2.0 on x86_64, i386, arm64, and arm.

Reported-by: Nathan Chancellor <nathan@kernel.org>
Closes: https://lore.kernel.org/all/20250726004313.GA3650901@ax162
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kees Cook <kees@kernel.org>
2025-07-26 14:28:35 -07:00
Puranjay Mohan
3ba58312e6 bpf: Move bpf_jit_get_prog_name() to core.c
bpf_jit_get_prog_name() will be used by all JITs when enabling support
for private stack. This function is currently implemented in the x86
JIT.

Move the function to core.c so that other JITs can easily use it in
their implementation of private stack.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20250724120257.7299-2-puranjay@kernel.org
2025-07-26 21:26:51 +02:00
Thomas Weißschuh
b7b3500bd4 umd: Remove usermode driver framework
The code is unused since 98e20e5e13 ("bpfilter: remove bpfilter"),
therefore remove it.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://lore.kernel.org/bpf/20250721-remove-usermode-driver-v1-2-0d0083334382@linutronix.de
2025-07-26 21:03:04 +02:00
Thomas Weißschuh
2b03164eee bpf/preload: Don't select USERMODE_DRIVER
The usermode driver framework is not used anymore by the BPF
preload code.

Fixes: cb80ddc671 ("bpf: Convert bpf_preload.ko to use light skeleton.")
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/bpf/20250721-remove-usermode-driver-v1-1-0d0083334382@linutronix.de
2025-07-26 21:02:48 +02:00
Nam Cao
b8a7fba39c rv: Remove struct rv_monitor::reacting
The field 'reacting' in struct rv_monitor is set but never used. Delete it.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/a6c16f845d2f1a09c4d0934ab83f3cb14478a71d.1753378331.git.namcao@linutronix.de
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-25 09:04:14 -04:00
Nam Cao
3d3800b4f7 rv: Remove rv_reactor's reference counter
rv_reactor has a reference counter to ensure it is not removed while
monitors are still using it.

However, this is futile, as __exit functions are not expected to fail and
will proceed normally despite rv_unregister_reactor() returning an error.

At the moment, reactors do not support being built as modules, therefore
they are never removed and the reference counters are not necessary.

If we support building RV reactors as modules in the future, kernel
module's centralized facilities such as try_module_get(), module_put() or
MODULE_SOFTDEP should be used instead of this custom implementation.

Remove this reference counter.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/bb946398436a5e17fb0f5b842ef3313c02291852.1753378331.git.namcao@linutronix.de
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-25 09:04:14 -04:00
Nam Cao
3d3c376118 rv: Merge struct rv_reactor_def into struct rv_reactor
Each struct rv_reactor has a unique struct rv_reactor_def associated with
it. struct rv_reactor is statically allocated, while struct rv_reactor_def
is dynamically allocated.

This makes the code more complicated than it should be:

  - Lookup is required to get the associated rv_reactor_def from rv_reactor

  - Dynamic memory allocation is required for rv_reactor_def. This is
    harder to get right compared to static memory. For instance, there is
    an existing mistake: rv_unregister_reactor() does not free the memory
    allocated by rv_register_reactor(). This is fortunately not a real
    memory leak problem as rv_unregister_reactor() is never called.

Simplify and merge rv_reactor_def into rv_reactor.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/71cb91c86cd40df5b8c492b788787f2a73c3eaa3.1753378331.git.namcao@linutronix.de
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-25 09:04:14 -04:00
Nam Cao
24cbfe18d5 rv: Merge struct rv_monitor_def into struct rv_monitor
Each struct rv_monitor has a unique struct rv_monitor_def associated with
it. struct rv_monitor is statically allocated, while struct rv_monitor_def
is dynamically allocated.

This makes the code more complicated than it should be:

  - Lookup is required to get the associated rv_monitor_def from rv_monitor

  - Dynamic memory allocation is required for rv_monitor_def. This is
    harder to get right compared to static memory. For instance, there is
    an existing mistake: rv_unregister_monitor() does not free the memory
    allocated by rv_register_monitor(). This is fortunately not a real
    memory leak problem, as rv_unregister_monitor() is never called.

Simplify and merge rv_monitor_def into rv_monitor.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/194449c00f87945c207aab4c96920c75796a4f53.1753378331.git.namcao@linutronix.de
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-25 09:04:14 -04:00
Nam Cao
b0c08dd534 rv: Remove unused field in struct rv_monitor_def
rv_monitor_def::task_monitor is not used. Delete it.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/502d94f2696435690a2b1fdbe80a9e56c96fcabf.1753378331.git.namcao@linutronix.de
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-25 09:04:13 -04:00
Linus Torvalds
2942242dde Merge tag 'mm-hotfixes-stable-2025-07-24-18-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
 "11 hotfixes. 9 are cc:stable and the remainder address post-6.15
  issues or aren't considered necessary for -stable kernels.

  7 are for MM"

* tag 'mm-hotfixes-stable-2025-07-24-18-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  sprintf.h requires stdarg.h
  resource: fix false warning in __request_region()
  mm/damon/core: commit damos_quota_goal->nid
  kasan: use vmalloc_dump_obj() for vmalloc error reports
  mm/ksm: fix -Wsometimes-uninitialized from clang-21 in advisor_mode_show()
  mm: update MAINTAINERS entry for HMM
  nilfs2: reject invalid file types when reading inodes
  selftests/mm: fix split_huge_page_test for folio_split() tests
  mailmap: add entry for Senozhatsky
  mm/zsmalloc: do not pass __GFP_MOVABLE if CONFIG_COMPACTION=n
  mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
2025-07-24 19:13:30 -07:00
Samiullah Khawaja
71c52411c5 net: Create separate gro_flush_normal function
Move multiple copies of same code snippet doing `gro_flush` and
`gro_normal_list` into separate helper function.

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250723013031.2911384-2-skhawaja@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-24 18:34:55 -07:00
Steven Rostedt
8c4e53a1a0 tracing: Call trace_ftrace_test_filter() for the event
The trace event filter bootup self test tests a bunch of filter logic
against the ftrace_test_filter event, but does not actually call the
event. Work is being done to cause a warning if an event is defined but
not used. To quiet the warning call the trace event under an if statement
where it is disabled so it doesn't get optimized out.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicolas Schier <nicolas.schier@linux.dev>
Cc: Nick Desaulniers <nick.desaulniers+lkml@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20250723194212.274458858@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-24 21:33:35 -04:00
Jakub Kicinski
a4f5759b6f Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Martin KaFai Lau says:

====================
pull-request: bpf-next 2025-07-24

We've added 3 non-merge commits during the last 3 day(s) which contain
a total of 4 files changed, 40 insertions(+), 15 deletions(-).

The main changes are:

1) Improved verifier error message for incorrect narrower load from
   pointer field in ctx, from Paul Chaignon.

2) Disabled migration in nf_hook_run_bpf to address a syzbot report,
   from Kuniyuki Iwashima.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  selftests/bpf: Test invalid narrower ctx load
  bpf: Reject narrower access to pointer ctx fields
  bpf: Disable migration in nf_hook_run_bpf().
====================

Link: https://patch.msgid.link/20250724173306.3578483-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-24 18:02:24 -07:00
Akinobu Mita
91a229bb7b resource: fix false warning in __request_region()
A warning is raised when __request_region() detects a conflict with a
resource whose resource.desc is IORES_DESC_DEVICE_PRIVATE_MEMORY.

But this warning is only valid for iomem_resources.
The hmem device resource uses resource.desc as the numa node id, which can
cause spurious warnings.

This warning appeared on a machine with multiple cxl memory expanders. 
One of the NUMA node id is 6, which is the same as the value of
IORES_DESC_DEVICE_PRIVATE_MEMORY.

In this environment it was just a spurious warning, but when I saw the
warning I suspected a real problem so it's better to fix it.

This change fixes this by restricting the warning to only iomem_resource.
This also adds a missing new line to the warning message.

Link: https://lkml.kernel.org/r/20250719112604.25500-1-akinobu.mita@gmail.com
Fixes: 7dab174e2e ("dax/hmem: Move hmem device registration to dax_hmem.ko")
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24 17:57:59 -07:00
Kees Cook
8245d47cfa x86: Handle KCOV __init vs inline mismatches
GCC appears to have kind of fragile inlining heuristics, in the
sense that it can change whether or not it inlines something based on
optimizations. It looks like the kcov instrumentation being added (or in
this case, removed) from a function changes the optimization results,
and some functions marked "inline" are _not_ inlined. In that case,
we end up with __init code calling a function not marked __init, and we
get the build warnings I'm trying to eliminate in the coming patch that
adds __no_sanitize_coverage to __init functions:

WARNING: modpost: vmlinux: section mismatch in reference: xbc_exit+0x8 (section: .text.unlikely) -> _xbc_exit (section: .init.text)
WARNING: modpost: vmlinux: section mismatch in reference: real_mode_size_needed+0x15 (section: .text.unlikely) -> real_mode_blob_end (section: .init.data)
WARNING: modpost: vmlinux: section mismatch in reference: __set_percpu_decrypted+0x16 (section: .text.unlikely) -> early_set_memory_decrypted (section: .init.text)
WARNING: modpost: vmlinux: section mismatch in reference: memblock_alloc_from+0x26 (section: .text.unlikely) -> memblock_alloc_try_nid (section: .init.text)
WARNING: modpost: vmlinux: section mismatch in reference: acpi_arch_set_root_pointer+0xc (section: .text.unlikely) -> x86_init (section: .init.data)
WARNING: modpost: vmlinux: section mismatch in reference: acpi_arch_get_root_pointer+0x8 (section: .text.unlikely) -> x86_init (section: .init.data)
WARNING: modpost: vmlinux: section mismatch in reference: efi_config_table_is_usable+0x16 (section: .text.unlikely) -> xen_efi_config_table_is_usable (section: .init.text)

This problem is somewhat fragile (though using either __always_inline
or __init will deterministically solve it), but we've tripped over
this before with GCC and the solution has usually been to just use
__always_inline and move on.

For x86 this means forcing several functions to be inline with
__always_inline.

Link: https://lore.kernel.org/r/20250724055029.3623499-2-kees@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
2025-07-24 16:55:11 -07:00
Jakub Kicinski
8b5a19b4ff Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.16-rc8).

Conflicts:

drivers/net/ethernet/microsoft/mana/gdma_main.c
  9669ddda18 ("net: mana: Fix warnings for missing export.h header inclusion")
  7553911210 ("net: mana: Allocate MSI-X vectors dynamically")
https://lore.kernel.org/20250711130752.23023d98@canb.auug.org.au

Adjacent changes:

drivers/net/ethernet/ti/icssg/icssg_prueth.h
  6e86fb73de ("net: ti: icssg-prueth: Fix buffer allocation for ICSSG")
  ffe8a49091 ("net: ti: icssg-prueth: Read firmware-names from device tree")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-24 11:10:46 -07:00
Gabriele Monaco
58d5f0d437 rv: Return init error when registering monitors
Monitors generated with dot2k have their registration function (the one
called during monitor initialisation) return always 0, even if the
registration failed on RV side.
This can hide potential errors.

Return the value returned by the RV register function.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Juri Lelli <jlelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Link: https://lore.kernel.org/20250723161240.194860-6-gmonaco@redhat.com
Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-24 10:43:46 -04:00
Gabriele Monaco
560473f2e2 verification/rvgen: Organise Kconfig entries for nested monitors
The current behaviour of rvgen when running with the -a option is to
append the necessary lines at the end of the configuration for Kconfig,
Makefile and tracepoints.
This is not always the desired behaviour in case of nested monitors:
while tracepoints are not affected by nesting and the Makefile's only
requirement is that the parent monitor is built before its children, in
the Kconfig it is better to have children defined right after their
parent, otherwise the result has wrong indentation:

[*]   foo_parent monitor
[*]     foo_child1 monitor
[*]     foo_child2 monitor
[*]   bar_parent monitor
[*]     bar_child1 monitor
[*]     bar_child2 monitor
[*]   foo_child3 monitor
[*]   foo_child4 monitor

Adapt rvgen to look for a different marker for nested monitors in the
Kconfig file and append the line right after the last sibling, instead
of the last monitor.
Also add the marker when creating a new parent monitor.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Juri Lelli <jlelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Link: https://lore.kernel.org/20250723161240.194860-5-gmonaco@redhat.com
Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-24 10:43:46 -04:00
Gabriele Monaco
9efcf59082 tools/dot2c: Fix generated files going over 100 column limit
The dot2c.py script generates all states in a single line. This breaks the
100 column limit when the state machines are non-trivial.

Change dot2c.py to generate the states in separate lines in case the
generated line is going to be too long.

Also adapt existing monitors with line length over the limit.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Juri Lelli <jlelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Link: https://lore.kernel.org/20250723161240.194860-4-gmonaco@redhat.com
Suggested-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-24 10:43:46 -04:00
Steven Rostedt
dabd3e7dcc tracing: Have eprobes handle arrays
eprobes are dynamic events that can read other events using their fields
to create new events. Currently it doesn't work with arrays. When the new
event field is attached to the old event field, it looks at the size of
the field to determine what type of field the new field should be. For 1
byte fields it's a char, for 2 bytes, it's a short and for 4 bytes it's an
integer. For all other sizes it just defaults to "long". This also reads
the contents of the field for such cases.

For arrays that are bigger than the size of long, return the value of the
address of the content itself. This will allow eprobes to read other
values in the array of the old event.

This is useful when raw_syscalls is enabled but the syscall events are
not. The syscall events are created from the raw_syscalls as they have an
array of "args" that holds the 6 long words passed to the syscall entry
point. To read the value of "filename" from sys_openat, the eprobe could
attach to the raw_syscall and read the second value.

It can then even be passed to a synthetic event and converted back to
another eprobe to get the value of "filename" after it has been read by
the kernel during the system call:

 [
   Create an eprobe called "sys" and attach it to sys_enter.
   Read the id of the system call and the second argument
 ]
 # echo 'e:sys raw_syscalls.sys_enter nr=$id:u32 arg2=+8($args):u64' >> /sys/kernel/tracing/dynamic_events

 [
   Create a synthetic event "path" that will hold the address of the
   sys_openat filename. This is on a 64bit machine, so make it 64 bits
 ]
 # echo 's:path u64 file;' >> /sys/kernel/tracing/dynamic_events

 [
   Add a histogram to the eprobe/sys which tiggers if the "nr" field is
   257 (sys_openat), and save the filename in the "file" variable.
 ]
 # echo 'hist:keys=common_pid:file=arg2 if nr == 257' > /sys/kernel/tracing/events/eprobes/sys/trigger

 [
   Attach a histogram to sys_exit event that triggers the "path" synthetic
   event and records the "filename" that was passed from the sys eprobe.
 ]
 # echo 'hist:keys=common_pid:f=$file:onmatch(eprobes.sys).trace(path,$f)' >> /sys/kernel/tracing/events/raw_syscalls/sys_exit/trigger

 [
   Create another eprobe that dereferences the "file" field as a user
   space string and displays it.
 ]
 # echo 'e:open synthetic.path file=+0($file):ustring' >> /sys/kernel/tracing/dynamic_events

 # echo 1 > /sys/kernel/tracing/events/eprobes/open/enable
 # cat trace_pipe
             cat-1142    [003] ...5.   799.521912: open: (synthetic.path) file="/etc/ld.so.cache"
             cat-1142    [003] ...5.   799.521934: open: (synthetic.path) file="/etc/ld.so.cache"
             cat-1142    [003] ...5.   799.522065: open: (synthetic.path) file="/etc/ld.so.cache"
             cat-1142    [003] ...5.   799.522080: open: (synthetic.path) file="/etc/ld.so.cache"
             cat-1142    [003] ...5.   799.522296: open: (synthetic.path) file="/lib/x86_64-linux-gnu/libc.so.6"
             cat-1142    [003] ...5.   799.522319: open: (synthetic.path) file="/lib/x86_64-linux-gnu/libc.so.6"
            less-1143    [005] ...5.   799.522327: open: (synthetic.path) file="/etc/ld.so.cache"
             cat-1142    [003] ...5.   799.522333: open: (synthetic.path) file="/lib/x86_64-linux-gnu/libc.so.6"
             cat-1142    [003] ...5.   799.522348: open: (synthetic.path) file="/lib/x86_64-linux-gnu/libc.so.6"
            less-1143    [005] ...5.   799.522349: open: (synthetic.path) file="/etc/ld.so.cache"
             cat-1142    [003] ...5.   799.522363: open: (synthetic.path) file="/lib/x86_64-linux-gnu/libc.so.6"
            less-1143    [005] ...5.   799.522477: open: (synthetic.path) file="/etc/ld.so.cache"
             cat-1142    [003] ...5.   799.522489: open: (synthetic.path) file="/lib/x86_64-linux-gnu/libc.so.6"
            less-1143    [005] ...5.   799.522492: open: (synthetic.path) file="/etc/ld.so.cache"
            less-1143    [005] ...5.   799.522720: open: (synthetic.path) file="/lib/x86_64-linux-gnu/libtinfo.so.6"
            less-1143    [005] ...5.   799.522744: open: (synthetic.path) file="/lib/x86_64-linux-gnu/libtinfo.so.6"
            less-1143    [005] ...5.   799.522759: open: (synthetic.path) file="/lib/x86_64-linux-gnu/libtinfo.so.6"
             cat-1142    [003] ...5.   799.522850: open: (synthetic.path) file="/lib/x86_64-linux-gnu/libc.so.6"

Link: https://lore.kernel.org/all/20250723124202.4f7475be@batman.local.home/

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-07-24 22:57:32 +09:00
Paul Chaignon
e09299225d bpf: Reject narrower access to pointer ctx fields
The following BPF program, simplified from a syzkaller repro, causes a
kernel warning:

    r0 = *(u8 *)(r1 + 169);
    exit;

With pointer field sk being at offset 168 in __sk_buff. This access is
detected as a narrower read in bpf_skb_is_valid_access because it
doesn't match offsetof(struct __sk_buff, sk). It is therefore allowed
and later proceeds to bpf_convert_ctx_access. Note that for the
"is_narrower_load" case in the convert_ctx_accesses(), the insn->off
is aligned, so the cnt may not be 0 because it matches the
offsetof(struct __sk_buff, sk) in the bpf_convert_ctx_access. However,
the target_size stays 0 and the verifier errors with a kernel warning:

    verifier bug: error during ctx access conversion(1)

This patch fixes that to return a proper "invalid bpf_context access
off=X size=Y" error on the load instruction.

The same issue affects multiple other fields in context structures that
allow narrow access. Some other non-affected fields (for sk_msg,
sk_lookup, and sockopt) were also changed to use bpf_ctx_range_ptr for
consistency.

Note this syzkaller crash was reported in the "Closes" link below, which
used to be about a different bug, fixed in
commit fce7bd8e38 ("bpf/verifier: Handle BPF_LOAD_ACQ instructions
in insn_def_regno()"). Because syzbot somehow confused the two bugs,
the new crash and repro didn't get reported to the mailing list.

Fixes: f96da09473 ("bpf: simplify narrower ctx access")
Fixes: 0df1a55afa ("bpf: Warn on internal verifier errors")
Reported-by: syzbot+0ef84a7bdf5301d4cbec@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=0ef84a7bdf5301d4cbec
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://patch.msgid.link/3b8dcee67ff4296903351a974ddd9c4dca768b64.1753194596.git.paul.chaignon@gmail.com
2025-07-23 19:33:49 -07:00