Commit Graph

50479 Commits

Author SHA1 Message Date
Linus Torvalds
08df88fa14 Merge tag 'v7.0-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu:
 "API:
   - Fix race condition in hwrng core by using RCU

  Algorithms:
   - Allow authenc(sha224,rfc3686) in fips mode
   - Add test vectors for authenc(hmac(sha384),cbc(aes))
   - Add test vectors for authenc(hmac(sha224),cbc(aes))
   - Add test vectors for authenc(hmac(md5),cbc(des3_ede))
   - Add lz4 support in hisi_zip
   - Only allow clear key use during self-test in s390/{phmac,paes}

  Drivers:
   - Set rng quality to 900 in airoha
   - Add gcm(aes) support for AMD/Xilinx Versal device
   - Allow tfms to share device in hisilicon/trng"

* tag 'v7.0-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (100 commits)
  crypto: img-hash - Use unregister_ahashes in img_{un}register_algs
  crypto: testmgr - Add test vectors for authenc(hmac(md5),cbc(des3_ede))
  crypto: cesa - Simplify return statement in mv_cesa_dequeue_req_locked
  crypto: testmgr - Add test vectors for authenc(hmac(sha224),cbc(aes))
  crypto: testmgr - Add test vectors for authenc(hmac(sha384),cbc(aes))
  hwrng: core - use RCU and work_struct to fix race condition
  crypto: starfive - Fix memory leak in starfive_aes_aead_do_one_req()
  crypto: xilinx - Fix inconsistant indentation
  crypto: rng - Use unregister_rngs in register_rngs
  crypto: atmel - Use unregister_{aeads,ahashes,skciphers}
  hwrng: optee - simplify OP-TEE context match
  crypto: ccp - Add sysfs attribute for boot integrity
  dt-bindings: crypto: atmel,at91sam9g46-sha: add microchip,lan9691-sha
  dt-bindings: crypto: atmel,at91sam9g46-aes: add microchip,lan9691-aes
  dt-bindings: crypto: qcom,inline-crypto-engine: document the Milos ICE
  crypto: caam - fix netdev memory leak in dpaa2_caam_probe
  crypto: hisilicon/qm - increase wait time for mailbox
  crypto: hisilicon/qm - obtain the mailbox configuration at one time
  crypto: hisilicon/qm - remove unnecessary code in qm_mb_write()
  crypto: hisilicon/qm - move the barrier before writing to the mailbox register
  ...
2026-02-10 08:36:42 -08:00
Linus Torvalds
d16738a4e7 Merge tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks
Pull kthread updates from Frederic Weisbecker:
 "The kthread code provides an infrastructure which manages the
  preferred affinity of unbound kthreads (node or custom cpumask)
  against housekeeping (CPU isolation) constraints and CPU hotplug
  events.

  One crucial missing piece is the handling of cpuset: when an isolated
  partition is created, deleted, or its CPUs updated, all the unbound
  kthreads in the top cpuset become indifferently affine to _all_ the
  non-isolated CPUs, possibly breaking their preferred affinity along
  the way.

  Solve this with performing the kthreads affinity update from cpuset to
  the kthreads consolidated relevant code instead so that preferred
  affinities are honoured and applied against the updated cpuset
  isolated partitions.

  The dispatch of the new isolated cpumasks to timers, workqueues and
  kthreads is performed by housekeeping, as per the nice Tejun's
  suggestion.

  As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set
  from boot defined domain isolation (through isolcpus=) and cpuset
  isolated partitions. Housekeeping cpumasks are now modifiable with a
  specific RCU based synchronization. A big step toward making
  nohz_full= also mutable through cpuset in the future"

* tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: (33 commits)
  doc: Add housekeeping documentation
  kthread: Document kthread_affine_preferred()
  kthread: Comment on the purpose and placement of kthread_affine_node() call
  kthread: Honour kthreads preferred affinity after cpuset changes
  sched/arm64: Move fallback task cpumask to HK_TYPE_DOMAIN
  sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN
  kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management
  kthread: Include kthreadd to the managed affinity list
  kthread: Include unbound kthreads in the managed affinity list
  kthread: Refine naming of affinity related fields
  PCI: Remove superfluous HK_TYPE_WQ check
  sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated()
  cpuset: Remove cpuset_cpu_is_isolated()
  timers/migration: Remove superfluous cpuset isolation test
  cpuset: Propagate cpuset isolation update to timers through housekeeping
  cpuset: Propagate cpuset isolation update to workqueue through housekeeping
  PCI: Flush PCI probe workqueue on cpuset isolated partition change
  sched/isolation: Flush vmstat workqueues on cpuset isolated partition change
  sched/isolation: Flush memcg workqueues on cpuset isolated partition change
  cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset
  ...
2026-02-09 19:57:30 -08:00
Linus Torvalds
9b1b3dcd28 Merge tag 'pm-6.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
 "By the number of commits, cpufreq is the leading party (again) and the
  most visible change there is the removal of the omap-cpufreq driver
  that has not been used for a long time (good riddance). There are also
  quite a few changes in the cppc_cpufreq driver, mostly related to
  fixing its frequency invariance engine in the case when the CPPC
  registers used by it are not in PCC. In addition to that, support for
  AM62L3 is added to the ti-cpufreq driver and the cpufreq-dt-platdev
  list is updated for some platforms. The remaining cpufreq changes are
  assorted fixes and cleanups.

  Next up is cpuidle and the changes there are dominated by intel_idle
  driver updates, mostly related to the new command line facility
  allowing users to adjust the list of C-states used by the driver.
  There are also a few updates of cpuidle governors, including two menu
  governor fixes and some refinements of the teo governor, and a
  MAINTAINERS update adding Christian Loehle as a cpuidle reviewer.
  [Thanks for stepping up Christian!]

  The most significant update related to system suspend and hibernation
  is the one to stop freezing the PM runtime workqueue during system PM
  transitions which allows some deadlocks to be avoided. There is also a
  fix for possible concurrent bit field updates in the core device
  suspend code and a few other minor fixes.

  Apart from the above, several drivers are updated to discard the
  return value of pm_runtime_put() which is going to be converted to a
  void function as soon as everybody stops using its return value, PL4
  support for Ice Lake is added to the Intel RAPL power capping driver,
  and there are assorted cleanups, documentation fixes, and some
  cpupower utility improvements.

  Specifics:

   - Remove the unused omap-cpufreq driver (Andreas Kemnade)

   - Optimize error handling code in cpufreq_boost_trigger_state() and
     make cpufreq_boost_trigger_state() return -EOPNOTSUPP if no policy
     supports boost (Lifeng Zheng)

   - Update cpufreq-dt-platdev list for tegra, qcom, TI (Aaron Kling,
     Dhruva Gole, and Konrad Dybcio)

   - Minor improvements to the cpufreq and cpumask rust implementation
     (Alexandre Courbot, Alice Ryhl, Tamir Duberstein, and Yilin Chen)

   - Add support for AM62L3 SoC to the ti-cpufreq driver (Dhruva Gole)

   - Update arch_freq_scale in the CPPC cpufreq driver's frequency
     invariance engine (FIE) in scheduler ticks if the related CPPC
     registers are not in PCC (Jie Zhan)

   - Assorted minor cleanups and improvements in ARM cpufreq drivers
     (Juan Martinez, Felix Gu, Luca Weiss, and Sergey Shtylyov)

   - Add generic helpers for sysfs show/store to cppc_cpufreq (Sumit
     Gupta)

   - Make the scaling_setspeed cpufreq sysfs attribute return the actual
     requested frequency to avoid confusion (Pengjie Zhang)

   - Simplify the idle CPU time granularity test in the ondemand cpufreq
     governor (Frederic Weisbecker)

   - Enable asym capacity in intel_pstate only when CPU SMT is not
     possible (Yaxiong Tian)

   - Update the description of rate_limit_us default value in cpufreq
     documentation (Yaxiong Tian)

   - Add a command line option to adjust the C-states table in the
     intel_idle driver, remove the 'preferred_cstates' module parameter
     from it, add C-states validation to it and clean it up (Artem
     Bityutskiy)

   - Make the menu cpuidle governor always check the time till the
     closest timer event when the scheduler tick has been stopped to
     prevent it from mistakenly selecting the deepest available idle
     state (Rafael Wysocki)

   - Update the teo cpuidle governor to avoid making suboptimal
     decisions in certain corner cases and generally improve idle state
     selection accuracy (Rafael Wysocki)

   - Remove an unlikely() annotation on the early-return condition in
     menu_select() that leads to branch misprediction 100% of the time
     on systems with only 1 idle state enabled, like ARM64 servers
     (Breno Leitao)

   - Add Christian Loehle to MAINTAINERS as a cpuidle reviewer
     (Christian Loehle)

   - Stop flagging the PM runtime workqueue as freezable to avoid system
     suspend and resume deadlocks in subsystems that assume asynchronous
     runtime PM to work during system-wide PM transitions (Rafael
     Wysocki)

   - Drop redundant NULL pointer checks before acomp_request_free() from
     the hibernation code handling image saving (Rafael Wysocki)

   - Update wakeup_sources_walk_start() to handle empty lists of wakeup
     sources as appropriate (Samuel Wu)

   - Make dev_pm_clear_wake_irq() check the power.wakeirq value under
     power.lock to avoid race conditions (Gui-Dong Han)

   - Avoid bit field races related to power.work_in_progress in the core
     device suspend code (Xuewen Yan)

   - Make several drivers discard pm_runtime_put() return value in
     preparation for converting that function to a void one (Rafael
     Wysocki)

   - Add PL4 support for Ice Lake to the Intel RAPL power capping driver
     (Daniel Tang)

   - Replace sprintf() with sysfs_emit() in power capping sysfs show
     functions (Sumeet Pawnikar)

   - Make dev_pm_opp_get_level() return value match the documentation
     after a previous update of the latter (Aleks Todorov)

   - Use scoped for each OF child loop in the OPP code (Krzysztof
     Kozlowski)

   - Fix a bug in an example code snippet and correct typos in the
     energy model management documentation (Patrick Little)

   - Fix miscellaneous problems in cpupower (Kaushlendra Kumar):
      * idle_monitor: Fix incorrect value logged after stop
      * Fix inverted APERF capability check
      * Use strcspn() to strip trailing newline
      * Reset errno before strtoull()
      * Show C0 in idle-info dump

   - Improve cpupower installation procedure by making the systemd step
     optional and allowing users to disable the installation of
     systemd's unit file (João Marcos Costa)"

* tag 'pm-6.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (65 commits)
  PM: sleep: core: Avoid bit field races related to work_in_progress
  PM: sleep: wakeirq: harden dev_pm_clear_wake_irq() against races
  cpufreq: Documentation: Update description of rate_limit_us default value
  cpufreq: intel_pstate: Enable asym capacity only when CPU SMT is not possible
  PM: wakeup: Handle empty list in wakeup_sources_walk_start()
  PM: EM: Documentation: Fix bug in example code snippet
  Documentation: Fix typos in energy model documentation
  cpuidle: governors: teo: Refine intercepts-based idle state lookup
  cpuidle: governors: teo: Adjust the classification of wakeup events
  cpufreq: ondemand: Simplify idle cputime granularity test
  cpufreq: userspace: make scaling_setspeed return the actual requested frequency
  PM: hibernate: Drop NULL pointer checks before acomp_request_free()
  cpufreq: CPPC: Add generic helpers for sysfs show/store
  cpufreq: scmi: Fix device_node reference leak in scmi_cpu_domain_id()
  cpufreq: ti-cpufreq: add support for AM62L3 SoC
  cpufreq: dt-platdev: Add ti,am62l3 to blocklist
  cpufreq/amd-pstate: Add comment explaining nominal_perf usage for performance policy
  cpufreq: scmi: correct SCMI explanation
  cpufreq: dt-platdev: Block the driver from probing on more QC platforms
  rust: cpumask: rename methods of Cpumask for clarity and consistency
  ...
2026-02-09 19:00:42 -08:00
Linus Torvalds
d84e173311 Merge tag 'acpi-6.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI updates from Rafael Wysocki:
 "This one is significantly larger than previous ACPI support pull
  requests because several significant updates have coincided in it.

  First, there is a routine ACPICA code update, to upstream version
  20251212, but this time it covers new ACPI 6.6 material that has not
  been covered yet. Among other things, it includes definitions of a few
  new ACPI tables and updates of some others, like the GICv5 MADT
  structures and ARM IORT IWB node definitions that are used for adding
  GICv5 ACPI probing on ARM (that technically is IRQ subsystem material,
  but it depends on the ACPICA changes, so it is included here). The
  latter alone adds a few hundred lines of new code.

  Second, there is an update of ACPI _OSC handling including a fix that
  prevents failures from occurring in some corner cases due to careless
  handling of _OSC error bits.

  On top of that, the "system resource" ACPI device objects with the
  PNP0C01 and PNP0C02 are now going to be handled by the ACPI core
  device enumeration code instead of handing them over to the legacy PNP
  system driver which causes device enumeration issues to occur. Some of
  those issues have been worked around in device drivers and elsewhere
  and those workarounds should not be necessary any more, so they are
  going away.

  Moreover, the time has come to convert all "core ACPI" device drivers
  that were still using struct acpi_driver objects for device binding
  into proper platform drivers that use struct platform_driver for this
  purpose. These updates are accompanied by some requisite core ACPI
  device enumeration code changes.

  Next, there are ACPI APEI updates, including changes to avoid excess
  overhead in the NMI handler and in SEA on the ARM side, changes to
  unify ACPI-based HW error tracing and logging, and changes to prevent
  APEI code from reaching out of its allocated memory.

  There are also some ACPI power management updates, mostly related to
  the ACPI cpuidle support in the processor driver, suspend-to-idle
  handling on systems with ACPI support and to ACPI PM of devices.

  In addition to the above, bugs are fixed and the code is cleaned up in
  assorted places all over.

  Specifics:

   - Update the ACPICA code in the kernel to upstream version 20251212
     which includes the following changes:
      * Add support for new ACPI table DTPR (Michal Camacho Romero)
      * Release objects with acpi_ut_delete_object_desc() (Zilin Guan)
      * Add UUIDs for Microsoft fan extensions and UUIDs associated with
        TPM 2.0 devices (Armin Wolf)
      * Fix NULL pointer dereference in acpi_ev_address_space_dispatch()
        (Alexey Simakov)
      * Add KEYP ACPI table definition (Dave Jiang)
      * Add support for the Microsoft display mux _OSI string (Armin
        Wolf)
      * Add definitions for the IOVT ACPI table (Xianglai Li)
      * Abort AML bytecode execution on AML_FATAL_OP (Armin Wolf)
      * Include all fields in subtable type1 for PPTT (Ben Horgan)
      * Add GICv5 MADT structures and Arm IORT IWB node definitions
        (Jose Marinho)
      * Update Parameter Block structure for RAS2 and add a new flag in
        Memory Affinity Structure for SRAT (Pawel Chmielewski)
      * Add _VDM (Voltage Domain) object (Pawel Chmielewski)

   - Add support for GICv5 ACPI probing on ARM which is based on the
     GICv5 MADT structures and ARM IORT IWB node definitions recently
     added to ACPICA (Lorenzo Pieralisi)

   - Rework ACPI PM notification setup for PCI root buses and modify the
     ACPI PM setup for devices to register wakeup source objects under
     physical (that is, PCI, platform, etc.) devices instead of doing
     that under their ACPI companions (Rafael Wysocki)

   - Adjust debug messages regarding postponed ACPI PM printed during
     system resume to be more accurate (Rafael Wysocki)

   - Remove dead code from lps0_device_attach() (Gergo Koteles)

   - Start to invoke Microsoft Function 9 (Turn On Display) of the Low-
     Power S0 Idle (LPS0) _DSM in the suspend-to-idle resume flow on
     systems with ACPI LPS0 support to address a functional issue on
     Lenovo Yoga Slim 7i Aura (15ILL9), where system fans and keyboard
     backlights fail to resume after suspend (Jakob Riemenschneider)

   - Add sysfs attribute cid for exposing _CID lists under ACPI device
     objects (Rafael Wysocki)

   - Replace sprintf() with sysfs_emit() in all of the core ACPI sysfs
     interface code (Sumeet Pawnikar)

   - Use acpi_get_local_u64_address() in the code implementing ACPI
     support for PCI to evaluate _ADR instead of evaluating that object
     directly (Andy Shevchenko)

   - Add JWIPC JVC9100 to irq1_level_low_skip_override[] to unbreak
     serial IRQs on that system (Ai Chao)

   - Fix handling of _OSC errors in acpi_run_osc() to avoid failures on
     systems where _OSC error bits are set even though the _OSC return
     buffer contains acknowledged feature bits (Rafael Wysocki)

   - Clean up and rearrange \_SB._OSC handling for general platform
     features and USB4 features to avoid code duplication and
     unnecessary memory management overhead (Rafael Wysocki)

   - Make the ACPI core device enumeration code handle PNP0C01 and
     PNP0C02 ("system resource") device objects directly instead of
     letting the legacy PNP system driver handle them to avoid device
     enumeration issues on systems where PNP0C02 is present in the _CID
     list under ACPI device objects with a _HID matching a proper device
     driver in Linux (Rafael Wysocki)

   - Drop workarounds for the known device enumeration issues related to
     _CID lists containing PNP0C02 (Rafael Wysocki)

   - Drop outdated comment regarding removed function in the ACPI-based
     device enumeration code (Julia Lawall)

   - Make PRP0001 device matching work as expected for ACPI device
     objects using it as a _HID for board development and similar
     purposes (Kartik Rajput)

   - Use async schedule function in acpi_scan_clear_dep_fn() to avoid
     races with user space initialization on some systems (Yicong Yang)

   - Add a piece of documentation explaining why binding drivers
     directly to ACPI device objects is not a good idea in general and
     why it is desirable to convert drivers doing so into proper
     platform drivers that use struct platform_driver for device binding
     (Rafael Wysocki)

   - Convert multiple "core ACPI" drivers, including the NFIT ACPI
     device driver, the generic ACPI button drivers, the generic ACPI
     thermal zone driver, the ACPI hardware event device (HED) driver,
     the ACPI EC driver, the ACPI SMBUS HC driver, the ACPI Smart
     Battery Subsystem (SBS) driver, and the ACPI backlight (video)
     driver to proper platform drivers that use struct platform_driver
     for device binding (Rafael Wysocki)

   - Use acpi_get_local_u64_address() in the ACPI backlight (video)
     driver to evaluate _ADR instead of evaluating that object directly
     (Andy Shevchenko)

   - Convert the generic ACPI battery driver to a proper platform driver
     using struct platform_driver for device binding (Rafael Wysocki)

   - Fix incorrect charging status when current is zero in the generic
     ACPI battery driver (Ata İlhan Köktürk)

   - Use LIST_HEAD() for initializing a stack-allocated list in the
     generic ACPI watchdog device driver (Can Peng)

   - Rework the ACPI idle driver initialization to register it directly
     from the common initialization code instead of doing that from a
     CPU hotplug "online" callback and clean it up (Huisong Li, Rafael
     Wysocki)

   - Fix a possible NULL pointer dereference in
     acpi_processor_errata_piix4() (Tuo Li)

   - Make read-only array non_mmio_desc[] static const (Colin Ian King)

   - Prevent the APEI GHES support code on ARM from accessing memory out
     of bounds or going past the ARM processor CPER record buffer (Mauro
     Carvalho Chehab)

   - Prevent cper_print_fw_err() from dumping the entire memory on
     systems with defective firmware (Mauro Carvalho Chehab)

   - Improve ghes_notify_nmi() status check to avoid unnecessary
     overhead in the NMI handler by carrying out all of the requisite
     preparations and the NMI registration time (Tony Luck)

   - Refactor the GHES driver by extracting common functionality into
     reusable helper functions to reduce code duplication and improve
     the ghes_notify_sea() status check in analogy with the previous
     ghes_notify_nmi() status check improvement (Shuai Xue)

   - Make ELOG and GHES log and trace consistently and support the CPER
     CXL protocol analogously (Fabio De Francesco)

   - Disable KASAN instrumentation in the APEI GHES driver when compile
     testing with clang < 18 (Nathan Chancellor)

   - Let ghes_edac be the preferred driver to load on __ZX__ and _BYO_
     systems by extending the platform detection list in the APEI GHES
     driver (Tony W Wang-oc)

   - Clean up cppc_perf_caps and cppc_perf_ctrls structs and rename EPP
     constants for clarity in the ACPI CPPC library (Sumit Gupta)"

* tag 'acpi-6.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (117 commits)
  ACPI: battery: fix incorrect charging status when current is zero
  ACPI: scan: Use async schedule function in acpi_scan_clear_dep_fn()
  ACPI: x86: s2idle: Invoke Microsoft _DSM Function 9 (Turn On Display)
  ACPI: APEI: GHES: Add ghes_edac support for __ZX__ and _BYO_ systems
  ACPI: APEI: GHES: Disable KASAN instrumentation when compile testing with clang < 18
  ACPI: sysfs: Replace sprintf() with sysfs_emit()
  ACPI: CPPC: Rename EPP constants for clarity
  ACPI: CPPC: Clean up cppc_perf_caps and cppc_perf_ctrls structs
  ACPI: processor: idle: Rework the handling of acpi_processor_ffh_lpi_probe()
  ACPI: processor: idle: Convert acpi_processor_setup_cpuidle_dev() to void
  ACPI: processor: idle: Convert acpi_processor_setup_cpuidle_states() to void
  irqchip/gic-v5: Add ACPI IWB probing
  irqchip/gic-v5: Add ACPI ITS probing
  irqchip/gic-v5: Add ACPI IRS probing
  irqchip/gic-v5: Split IRS probing into OF and generic portions
  PCI/MSI: Make the pci_msi_map_rid_ctlr_node() interface firmware agnostic
  irqdomain: Add parent field to struct irqchip_fwid
  ACPI: PCI: simplify code with acpi_get_local_u64_address()
  ACPI: video: simplify code with acpi_get_local_u64_address()
  ACPI: PM: Adjust messages regarding postponed ACPI PM
  ...
2026-02-09 18:42:47 -08:00
Linus Torvalds
0c00ed308d Merge tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block updates from Jens Axboe:

 - Support for batch request processing for ublk, improving the
   efficiency of the kernel/ublk server communication. This can yield
   nice 7-12% performance improvements

 - Support for integrity data for ublk

 - Various other ublk improvements and additions, including a ton of
   selftests additions and updated

 - Move the handling of blk-crypto software fallback from below the
   block layer to above it. This reduces the complexity of dealing with
   bio splitting

 - Series fixing a number of potential deadlocks in blk-mq related to
   the queue usage counter and writeback throttling and rq-qos debugfs
   handling

 - Add an async_depth queue attribute, to resolve a performance
   regression that's been around for a qhilw related to the scheduler
   depth handling

 - Only use task_work for IOPOLL completions on NVMe, if it is necessary
   to do so. An earlier fix for an issue resulted in all these
   completions being punted to task_work, to guarantee that completions
   were only run for a given io_uring ring when it was local to that
   ring. With the new changes, we can detect if it's necessary to use
   task_work or not, and avoid it if possible.

 - rnbd fixes:
      - Fix refcount underflow in device unmap path
      - Handle PREFLUSH and NOUNMAP flags properly in protocol
      - Fix server-side bi_size for special IOs
      - Zero response buffer before use
      - Fix trace format for flags
      - Add .release to rnbd_dev_ktype

 - MD pull requests via Yu Kuai
      - Fix raid5_run() to return error when log_init() fails
      - Fix IO hang with degraded array with llbitmap
      - Fix percpu_ref not resurrected on suspend timeout in llbitmap
      - Fix GPF in write_page caused by resize race
      - Fix NULL pointer dereference in process_metadata_update
      - Fix hang when stopping arrays with metadata through dm-raid
      - Fix any_working flag handling in raid10_sync_request
      - Refactor sync/recovery code path, improve error handling for
        badblocks, and remove unused recovery_disabled field
      - Consolidate mddev boolean fields into mddev_flags
      - Use mempool to allocate stripe_request_ctx and make sure
        max_sectors is not less than io_opt in raid5
      - Fix return value of mddev_trylock
      - Fix memory leak in raid1_run()
      - Add Li Nan as mdraid reviewer

 - Move phys_vec definitions to the kernel types, mostly in preparation
   for some VFIO and RDMA changes

 - Improve the speed for secure erase for some devices

 - Various little rust updates

 - Various other minor fixes, improvements, and cleanups

* tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits)
  blk-mq: ABI/sysfs-block: fix docs build warnings
  selftests: ublk: organize test directories by test ID
  block: decouple secure erase size limit from discard size limit
  block: remove redundant kill_bdev() call in set_blocksize()
  blk-mq: add documentation for new queue attribute async_dpeth
  block, bfq: convert to use request_queue->async_depth
  mq-deadline: covert to use request_queue->async_depth
  kyber: covert to use request_queue->async_depth
  blk-mq: add a new queue sysfs attribute async_depth
  blk-mq: factor out a helper blk_mq_limit_depth()
  blk-mq-sched: unify elevators checking for async requests
  block: convert nr_requests to unsigned int
  block: don't use strcpy to copy blockdev name
  blk-mq-debugfs: warn about possible deadlock
  blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs()
  blk-mq-debugfs: remove blk_mq_debugfs_unregister_rqos()
  blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static
  blk-rq-qos: fix possible debugfs_mutex deadlock
  blk-mq-debugfs: factor out a helper to register debugfs for all rq_qos
  blk-wbt: fix possible deadlock to nest pcpu_alloc_mutex under q_usage_counter
  ...
2026-02-09 17:57:21 -08:00
Linus Torvalds
591beb0e3a Merge tag 'io_uring-bpf-restrictions.4-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring bpf filters from Jens Axboe:
 "This adds support for both cBPF filters for io_uring, as well as task
  inherited restrictions and filters.

  seccomp and io_uring don't play along nicely, as most of the
  interesting data to filter on resides somewhat out-of-band, in the
  submission queue ring.

  As a result, things like containers and systemd that apply seccomp
  filters, can't filter io_uring operations.

  That leaves them with just one choice if filtering is critical -
  filter the actual io_uring_setup(2) system call to simply disallow
  io_uring. That's rather unfortunate, and has limited us because of it.

  io_uring already has some filtering support. It requires the ring to
  be setup in a disabled state, and then a filter set can be applied.
  This filter set is completely bi-modal - an opcode is either enabled
  or it's not. Once a filter set is registered, the ring can be enabled.
  This is very restrictive, and it's not useful at all to systemd or
  containers which really want both broader and more specific control.

  This first adds support for cBPF filters for opcodes, which enables
  tighter control over what exactly a specific opcode may do. As
  examples, specific support is added for IORING_OP_OPENAT/OPENAT2,
  allowing filtering on resolve flags. And another example is added for
  IORING_OP_SOCKET, allowing filtering on domain/type/protocol. These
  are both common use cases. cBPF was chosen rather than eBPF, because
  the latter is often restricted in containers as well.

  These filters are run post the init phase of the request, which allows
  filters to even dip into data that is being passed in struct in user
  memory, as the init side of requests make that data stable by bringing
  it into the kernel. This allows filtering without needing to copy this
  data twice, or have filters etc know about the exact layout of the
  user data. The filters get the already copied and sanitized data
  passed.

  On top of that support is added for per-task filters, meaning that any
  ring created with a task that has a per-task filter will get those
  filters applied when it's created. These filters are inherited across
  fork as well. Once a filter has been registered, any further added
  filters may only further restrict what operations are permitted.

  Filters cannot change the return value of an operation, they can only
  permit or deny it based on the contents"

* tag 'io_uring-bpf-restrictions.4-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  io_uring: allow registration of per-task restrictions
  io_uring: add task fork hook
  io_uring/bpf_filter: add ref counts to struct io_bpf_filter
  io_uring/bpf_filter: cache lookup table in ctx->bpf_filters
  io_uring/bpf_filter: allow filtering on contents of struct open_how
  io_uring/net: allow filtering on IORING_OP_SOCKET data
  io_uring: add support for BPF filtering for opcode restrictions
2026-02-09 17:31:17 -08:00
Linus Torvalds
26c9342bb7 Merge tag 'pull-filename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs 'struct filename' updates from Al Viro:
 "[Mostly] sanitize struct filename handling"

* tag 'pull-filename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (68 commits)
  sysfs(2): fs_index() argument is _not_ a pathname
  alpha: switch osf_mount() to strndup_user()
  ksmbd: use CLASS(filename_kernel)
  mqueue: switch to CLASS(filename)
  user_statfs(): switch to CLASS(filename)
  statx: switch to CLASS(filename_maybe_null)
  quotactl_block(): switch to CLASS(filename)
  chroot(2): switch to CLASS(filename)
  move_mount(2): switch to CLASS(filename_maybe_null)
  namei.c: switch user pathname imports to CLASS(filename{,_flags})
  namei.c: convert getname_kernel() callers to CLASS(filename_kernel)
  do_f{chmod,chown,access}at(): use CLASS(filename_uflags)
  do_readlinkat(): switch to CLASS(filename_flags)
  do_sys_truncate(): switch to CLASS(filename)
  do_utimes_path(): switch to CLASS(filename_uflags)
  chdir(2): unspaghettify a bit...
  do_fchownat(): unspaghettify a bit...
  fspick(2): use CLASS(filename_flags)
  name_to_handle_at(): use CLASS(filename_uflags)
  vfs_open_tree(): use CLASS(filename_uflags)
  ...
2026-02-09 16:58:28 -08:00
Linus Torvalds
9e355113f0 Merge tag 'vfs-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
 "This contains a mix of VFS cleanups, performance improvements, API
  fixes, documentation, and a deprecation notice.

  Scalability and performance:

   - Rework pid allocation to only take pidmap_lock once instead of
     twice during alloc_pid(), improving thread creation/teardown
     throughput by 10-16% depending on false-sharing luck. Pad the
     namespace refcount to reduce false-sharing

   - Track file lock presence via a flag in ->i_opflags instead of
     reading ->i_flctx, avoiding false-sharing with ->i_readcount on
     open/close hot paths. Measured 4-16% improvement on 24-core
     open-in-a-loop benchmarks

   - Use a consume fence in locks_inode_context() to match the
     store-release/load-consume idiom, eliminating a hardware fence on
     some architectures

   - Annotate cdev_lock with __cacheline_aligned_in_smp to prevent
     false-sharing

   - Remove a redundant DCACHE_MANAGED_DENTRY check in
     __follow_mount_rcu() that never fires since the caller already
     verifies it, eliminating a 100% mispredicted branch

   - Fix a 100% mispredicted likely() in devcgroup_inode_permission()
     that became wrong after a prior code reorder

  Bug fixes and correctness:

   - Make insert_inode_locked() wait for inode destruction instead of
     skipping, fixing a corner case where two matching inodes could
     exist in the hash

   - Move f_mode initialization before file_ref_init() in alloc_file()
     to respect the SLAB_TYPESAFE_BY_RCU ordering contract

   - Add a WARN_ON_ONCE guard in try_to_free_buffers() for folios with
     no buffers attached, preventing a null pointer dereference when
     AS_RELEASE_ALWAYS is set but no release_folio op exists

   - Fix select restart_block to store end_time as timespec64, avoiding
     truncation of tv_sec on 32-bit architectures

   - Make dump_inode() use get_kernel_nofault() to safely access inode
     and superblock fields, matching the dump_mapping() pattern

  API modernization:

   - Make posix_acl_to_xattr() allocate the buffer internally since
     every single caller was doing it anyway. Reduces boilerplate and
     unnecessary error checking across ~15 filesystems

   - Replace deprecated simple_strtoul() with kstrtoul() for the
     ihash_entries, dhash_entries, mhash_entries, and mphash_entries
     boot parameters, adding proper error handling

   - Convert chardev code to use guard(mutex) and __free(kfree) cleanup
     patterns

   - Replace min_t() with min() or umin() in VFS code to avoid silently
     truncating unsigned long to unsigned int

   - Gate LOOKUP_RCU assertions behind CONFIG_DEBUG_VFS since callers
     already check the flag

  Deprecation:

   - Begin deprecating legacy BSD process accounting (acct(2)). The
     interface has numerous footguns and better alternatives exist
     (eBPF)

  Documentation:

   - Fix and complete kernel-doc for struct export_operations, removing
     duplicated documentation between ReST and source

   - Fix kernel-doc warnings for __start_dirop() and ilookup5_nowait()

  Testing:

   - Add a kunit test for initramfs cpio handling of entries with
     filesize > PATH_MAX

  Misc:

   - Add missing <linux/init_task.h> include in fs_struct.c"

* tag 'vfs-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits)
  posix_acl: make posix_acl_to_xattr() alloc the buffer
  fs: make insert_inode_locked() wait for inode destruction
  initramfs_test: kunit test for cpio.filesize > PATH_MAX
  fs: improve dump_inode() to safely access inode fields
  fs: add <linux/init_task.h> for 'init_fs'
  docs: exportfs: Use source code struct documentation
  fs: move initializing f_mode before file_ref_init()
  exportfs: Complete kernel-doc for struct export_operations
  exportfs: Mark struct export_operations functions at kernel-doc
  exportfs: Fix kernel-doc output for get_name()
  acct(2): begin the deprecation of legacy BSD process accounting
  device_cgroup: remove branch hint after code refactor
  VFS: fix __start_dirop() kernel-doc warnings
  fs: Describe @isnew parameter in ilookup5_nowait()
  fs/namei: Remove redundant DCACHE_MANAGED_DENTRY check in __follow_mount_rcu
  fs: only assert on LOOKUP_RCU when built with CONFIG_DEBUG_VFS
  select: store end_time as timespec64 in restart block
  chardev: Switch to guard(mutex) and __free(kfree)
  namespace: Replace simple_strtoul with kstrtoul to parse boot params
  dcache: Replace simple_strtoul with kstrtoul in set_dhash_entries
  ...
2026-02-09 15:13:05 -08:00
Linus Torvalds
bcc8fd3e15 Merge tag 'lsm-pr-20260203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm
Pull lsm updates from Paul Moore:

 - Unify the security_inode_listsecurity() calls in NFSv4

   While looking at security_inode_listsecurity() with an eye towards
   improving the interface, we realized that the NFSv4 code was making
   multiple calls to the LSM hook that could be consolidated into one.

 - Mark the LSM static branch keys as static - this helps resolve some
   sparse warnings

 - Add __rust_helper annotations to the LSM and cred wrapper functions

 - Remove the unsused set_security_override_from_ctx() function

 - Minor fixes to some of the LSM kdoc comment blocks

* tag 'lsm-pr-20260203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
  lsm: make keys for static branch static
  cred: remove unused set_security_override_from_ctx()
  rust: security: add __rust_helper to helpers
  rust: cred: add __rust_helper to helpers
  nfs: unify security_inode_listsecurity() calls
  lsm: fix kernel-doc struct member names
2026-02-09 10:16:48 -08:00
Linus Torvalds
698749164a Merge tag 'audit-pr-20260203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:

 - Improve the NETFILTER_PKT audit records

   Add source and destination ports to the NETFILTER_PKT audit records
   while also consolidating a lot of the code into a new, singular
   audit_log_nf_skb() function. This new approach to structuring the
   NETFILTER_PKT record generation should eliminate some unnecessary
   overhead when audit is not built into the kernel.

 - Update the audit syscall classifier code

   Add the listxattrat(), getxattrat(), and fchmodat2() syscall to the
   audit code which classifies syscalls into categories of operations,
   e.g. "read" or "change attributes".

 - Move the syscall classifier declarations into audit_arch.h

   Shuffle around some header file declarations to resolve some sparse
   warnings.

* tag 'audit-pr-20260203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: move the compat_xxx_class[] extern declarations to audit_arch.h
  audit: add missing syscalls to read class
  audit: include source and destination ports to NETFILTER_PKT
  audit: add audit_log_nf_skb helper function
  audit: add fchmodat2() to change attributes class
2026-02-09 10:13:03 -08:00
Linus Torvalds
ef852baaf6 Merge tag 'rcu.release.v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux
Pull RCU updates from Boqun Feng:

 - RCU Tasks Trace:

   Re-implement RCU tasks trace in term of SRCU-fast, not only more than
   500 lines of code are saved because of the reimplementation, a new
   set of API, rcu_read_{,un}lock_tasks_trace(), becomes possible as
   well. Compared to the previous rcu_read_{,un}lock_trace(), the new
   API avoid the task_struct accesses thanks to the SRCU-fast semantics.

   As a result, the old rcu_read{,un}lock_trace() API is now deprecated.

 - RCU Torture Test:
    - Multiple improvements on kvm-series.sh (parallel run and
      progress showing metrics)
    - Add context checks to rcu_torture_timer()
    - Make config2csv.sh properly handle comments in .boot files
    - Include commit discription in testid.txt

 - Miscellaneous RCU changes:
    - Reduce synchronize_rcu() latency by reporting GP kthread's
      CPU QS early
    - Use suitable gfp_flags for the init_srcu_struct_nodes()
    - Fix rcu_read_unlock() deadloop due to softirq
    - Correctly compute probability to invoke ->exp_current()
      in rcutorture
    - Make expedited RCU CPU stall warnings detect stall-end races

 - RCU nocb:
    - Remove unnecessary WakeOvfIsDeferred wake path and callback
      overload handling
    - Extract nocb_defer_wakeup_cancel() helper

* tag 'rcu.release.v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (25 commits)
  rcu/nocb: Extract nocb_defer_wakeup_cancel() helper
  rcu/nocb: Remove dead callback overload handling
  rcu/nocb: Remove unnecessary WakeOvfIsDeferred wake path
  rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early
  srcu: Use suitable gfp_flags for the init_srcu_struct_nodes()
  rcu: Fix rcu_read_unlock() deadloop due to softirq
  rcutorture: Correctly compute probability to invoke ->exp_current()
  rcu: Make expedited RCU CPU stall warnings detect stall-end races
  rcutorture: Add --kill-previous option to terminate previous kvm.sh runs
  rcutorture: Prevent concurrent kvm.sh runs on same source tree
  torture: Include commit discription in testid.txt
  torture: Make config2csv.sh properly handle comments in .boot files
  torture: Make kvm-series.sh give run numbers and totals
  torture: Make kvm-series.sh give build numbers and totals
  torture: Parallelize kvm-series.sh guest-OS execution
  rcutorture: Add context checks to rcu_torture_timer()
  rcutorture: Test rcu_tasks_trace_expedite_current()
  srcu: Create an rcu_tasks_trace_expedite_current() function
  checkpatch: Deprecate rcu_read_{,un}lock_trace()
  rcu: Update Requirements.rst for RCU Tasks Trace
  ...
2026-02-09 09:46:26 -08:00
Linus Torvalds
dda5df9823 Merge tag 'sched-urgent-2026-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Miscellaneous MMCID fixes to address bugs and performance regressions
  in the recent rewrite of the SCHED_MM_CID management code:

   - Fix livelock triggered by BPF CI testing

   - Fix hard lockup on weakly ordered systems

   - Simplify the dropping of CIDs in the exit path by removing an
     unintended transition phase

   - Fix performance/scalability regression on a thread-pool benchmark
     by optimizing transitional CIDs when scheduling out"

* tag 'sched-urgent-2026-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/mmcid: Optimize transitional CIDs when scheduling out
  sched/mmcid: Drop per CPU CID immediately when switching to per task mode
  sched/mmcid: Protect transition on weakly ordered systems
  sched/mmcid: Prevent live lock on task to CPU mode transition
2026-02-07 09:10:42 -08:00
Linus Torvalds
7e0b172c80 Merge tag 'objtool-urgent-2026-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool fixes from Ingo Molnar::

 - Bump up the Clang minimum version requirements for livepatch
   builds, due to Clang assembler section handling bugs causing
   silent miscompilations

 - Strip livepatching symbol artifacts from non-livepatch modules

 - Fix livepatch build warnings when certain Clang LTO options
   are enabled

 - Fix livepatch build error when CONFIG_MEM_ALLOC_PROFILING_DEBUG=y

* tag 'objtool-urgent-2026-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  objtool/klp: Fix unexported static call key access for manually built livepatch modules
  objtool/klp: Fix symbol correlation for orphaned local symbols
  livepatch: Free klp_{object,func}_ext data after initialization
  livepatch: Fix having __klp_objects relics in non-livepatch modules
  livepatch/klp-build: Require Clang assembler >= 20
2026-02-07 08:21:21 -08:00
Linus Torvalds
bab849a908 Merge tag 'trace-v6.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fix from Steven Rostedt:

 - Fix event format field alignments for 32 bit architectures

   The fields in the event format files are used to parse the raw binary
   buffer data by applications. If they are incorrect, then the
   application produces garbage.

   On 32 bit architectures, the function graph 64bit calltime and
   rettime were off by 4bytes. That's because the actual fields are in a
   packed structure but the macros used by the ftrace events did not
   mark them as packed, and instead, gave them their natural alignment
   which made their offsets off by 4 bytes.

   There are macros to have a packed field within an embedded structure
   of an event, but there's no macro for normal fields within a packed
   structure of the event. The macro __field_packed() was used for the
   packed embedded structure field. Rename that to __field_desc_packed()
   (to match the non-packed embedded field macro __field_desc()), and
   make __field_packed() for fields that are in a packed event structure
   (which matches the unpacked __field() macro).

   Switch the calltime and rettime fields of the function graph event to
   use the new __field_packed() and this makes the offsets correct.

* tag 'trace-v6.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Fix ftrace event field alignments
2026-02-06 12:37:28 -08:00
Linus Torvalds
23b0d2f7c2 Merge tag 'dma-mapping-6.19-2026-02-06' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fixes from Marek Szyprowski:
 "Two minor fixes for the DMA-mapping subsystem:

   - check for the rare case of the allocation failure of the global CMA
     pool (Shanker Donthineni)

   - avoid perf buffer overflow when tracing large scatter-gather lists
     (Deepanshu Kartikey)"

* tag 'dma-mapping-6.19-2026-02-06' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
  dma: contiguous: Check return value of dma_contiguous_reserve_area()
  tracing/dma: Cap dma_map_sg tracepoint arrays to prevent buffer overflow
2026-02-06 10:27:42 -08:00
Jens Axboe
9fd99788f3 io_uring: add task fork hook
Called when copy_process() is called to copy state to a new child.
Right now this is just a stub, but will be used shortly to properly
handle fork'ing of task based io_uring restrictions.

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-06 07:29:14 -07:00
Petr Pavlu
ab10815472 livepatch: Fix having __klp_objects relics in non-livepatch modules
The linker script scripts/module.lds.S specifies that all input
__klp_objects sections should be consolidated into an output section of
the same name, and start/stop symbols should be created to enable
scripts/livepatch/init.c to locate this data.

This start/stop pattern is not ideal for modules because the symbols are
created even if no __klp_objects input sections are present.
Consequently, a dummy __klp_objects section also appears in the
resulting module. This unnecessarily pollutes non-livepatch modules.

Instead, since modules are relocatable files, the usual method for
locating consolidated data in a module is to read its section table.
This approach avoids the aforementioned problem.

The klp_modinfo already stores a copy of the entire section table with
the final addresses. Introduce a helper function that
scripts/livepatch/init.c can call to obtain the location of the
__klp_objects section from this data.

Fixes: dd590d4d57 ("objtool/klp: Introduce klp diff subcommand for diffing object files")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Link: https://patch.msgid.link/20260123102825.3521961-2-petr.pavlu@suse.com
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2026-02-05 08:00:44 -08:00
Steven Rostedt
033c55fe2e tracing: Fix ftrace event field alignments
The fields of ftrace specific events (events used to save ftrace internal
events like function traces and trace_printk) are generated similarly to
how normal trace event fields are generated. That is, the fields are added
to a trace_events_fields array that saves the name, offset, size,
alignment and signness of the field. It is used to produce the output in
the format file in tracefs so that tooling knows how to parse the binary
data of the trace events.

The issue is that some of the ftrace event structures are packed. The
function graph exit event structures are one of them. The 64 bit calltime
and rettime fields end up 4 byte aligned, but the algorithm to show to
userspace shows them as 8 byte aligned.

The macros that create the ftrace events has one for embedded structure
fields. There's two macros for theses fields:

  __field_desc() and __field_packed()

The difference of the latter macro is that it treats the field as packed.

Rename that field to __field_desc_packed() and create replace the
__field_packed() to be a normal field that is packed and have the calltime
and rettime use those.

This showed up on 32bit architectures for function graph time fields. It
had:

 ~# cat /sys/kernel/tracing/events/ftrace/funcgraph_exit/format
[..]
        field:unsigned long func;       offset:8;       size:4; signed:0;
        field:unsigned int depth;       offset:12;      size:4; signed:0;
        field:unsigned int overrun;     offset:16;      size:4; signed:0;
        field:unsigned long long calltime;      offset:24;      size:8; signed:0;
        field:unsigned long long rettime;       offset:32;      size:8; signed:0;

Notice that overrun is at offset 16 with size 4, where in the structure
calltime is at offset 20 (16 + 4), but it shows the offset at 24. That's
because it used the alignment of unsigned long long when used as a
declaration and not as a member of a structure where it would be aligned
by word size (in this case 4).

By using the proper structure alignment, the format has it at the correct
offset:

 ~# cat /sys/kernel/tracing/events/ftrace/funcgraph_exit/format
[..]
        field:unsigned long func;       offset:8;       size:4; signed:0;
        field:unsigned int depth;       offset:12;      size:4; signed:0;
        field:unsigned int overrun;     offset:16;      size:4; signed:0;
        field:unsigned long long calltime;      offset:20;      size:8; signed:0;
        field:unsigned long long rettime;       offset:28;      size:8; signed:0;

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reported-by: "jempty.liang" <imntjempty@163.com>
Link: https://patch.msgid.link/20260204113628.53faec78@gandalf.local.home
Fixes: 04ae87a520 ("ftrace: Rework event_create_dir()")
Closes: https://lore.kernel.org/all/20260130015740.212343-1-imntjempty@163.com/
Closes: https://lore.kernel.org/all/20260202123342.2544795-1-imntjempty@163.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-05 09:47:11 -05:00
Linus Torvalds
b20624608f Merge tag 'mm-hotfixes-stable-2026-02-04-15-55' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
 "Five hotfixes.  Two are cc:stable, two are for MM.

  All are singletons - please see the changelogs for details"

* tag 'mm-hotfixes-stable-2026-02-04-15-55' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  Documentation: document liveupdate cmdline parameter
  mm, shmem: prevent infinite loop on truncate race
  mailmap: update Alexander Mikhalitsyn's emails
  liveupdate: luo_file: do not clear serialized_data on unfreeze
  x86/kfence: fix booting on 32bit non-PAE systems
2026-02-04 16:04:00 -08:00
Linus Torvalds
3c7b4d1994 Merge tag 'sched_ext-for-6.19-rc8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fix from Tejun Heo:

 - Fix race where sched_class operations (sched_setscheduler() and
   friends) could be invoked on dead tasks after sched_ext_dead()
   already ran, causing invalid SCX task state transitions and NULL
   pointer dereferences.

   This was a regression from the cgroup exit ordering fix which
   moved sched_ext_free() to finish_task_switch().

* tag 'sched_ext-for-6.19-rc8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Short-circuit sched_class operations on dead tasks
2026-02-04 15:11:24 -08:00
Tejun Heo
0eca95cba2 sched_ext: Short-circuit sched_class operations on dead tasks
7900aa699c ("sched_ext: Fix cgroup exit ordering by moving sched_ext_free()
to finish_task_switch()") moved sched_ext_free() to finish_task_switch() and
renamed it to sched_ext_dead() to fix cgroup exit ordering issues. However,
this created a race window where certain sched_class ops may be invoked on
dead tasks leading to failures - e.g. sched_setscheduler() may try to switch a
task which finished sched_ext_dead() back into SCX triggering invalid SCX task
state transitions.

Add task_dead_and_done() which tests whether a task is TASK_DEAD and has
completed its final context switch, and use it to short-circuit sched_class
operations which may be called on dead tasks.

Fixes: 7900aa699c ("sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()")
Reported-by: Andrea Righi <arighi@nvidia.com>
Link: http://lkml.kernel.org/r/20260202151341.796959-1-arighi@nvidia.com
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-04 12:22:11 -10:00
Rafael J. Wysocki
073dcc0283 Merge branch 'pm-runtime'
Merge updates related to runtime PM for 6.20-rc1/7.0-rc1:

 - Make several drivers discard pm_runtime_put() return value in
   preparation for converting that function to a void one (Rafael
   Wysocki)

* pm-runtime:
  drm: Discard pm_runtime_put() return value
  genirq/chip: Change irq_chip_pm_put() return type to void
  scsi: ufs: core: Discard pm_runtime_put() return values
  platform/chrome: cros_hps_i2c: Discard pm_runtime_put() return value
  coresight: Discard pm_runtime_put() return values
  hwspinlock: omap: Discard pm_runtime_put() return value
  watchdog: rzv2h_wdt: Discard pm_runtime_put() return value
  watchdog: rz: Discard pm_runtime_put() return values
  media: ccs: Discard pm_runtime_put() return value
  drm/imagination: Discard pm_runtime_put() return value
  USB: core: Discard pm_runtime_put() return value
2026-02-04 21:03:18 +01:00
Rafael J. Wysocki
c233403593 Merge branch 'pm-sleep'
Merge updates related to system suspend and hibernation for
6.20-rc1/7.0-rc1:

 - Stop flagging the PM runtime workqueue as freezable to avoid system
   suspend and resume deadlocks in subsystems that assume asynchronous
   runtime PM to work during system-wide PM transitions (Rafael Wysocki)

 - Drop redundant NULL pointer checks before acomp_request_free() from
   the hibernation code handling image saving (Rafael Wysocki)

 - Update wakeup_sources_walk_start() to handle empty lists of wakeup
   sources as appropriate (Samuel Wu)

 - Make dev_pm_clear_wake_irq() check the power.wakeirq value under
   power.lock to avoid race conditions (Gui-Dong Han)

 - Avoid bit field races related to power.work_in_progress in the core
   device suspend code (Xuewen Yan)

* pm-sleep:
  PM: sleep: core: Avoid bit field races related to work_in_progress
  PM: sleep: wakeirq: harden dev_pm_clear_wake_irq() against races
  PM: wakeup: Handle empty list in wakeup_sources_walk_start()
  PM: hibernate: Drop NULL pointer checks before acomp_request_free()
  PM: sleep: Do not flag runtime PM workqueue as freezable
2026-02-04 20:52:09 +01:00
Thomas Gleixner
4463c7aa11 sched/mmcid: Optimize transitional CIDs when scheduling out
During the investigation of the various transition mode issues
instrumentation revealed that the amount of bitmap operations can be
significantly reduced when a task with a transitional CID schedules out
after the fixup function completed and disabled the transition mode.

At that point the mode is stable and therefore it is not required to drop
the transitional CID back into the pool. As the fixup is complete the
potential exhaustion of the CID pool is not longer possible, so the CID can
be transferred to the scheduling out task or to the CPU depending on the
current ownership mode.

The racy snapshot of mm_cid::mode which contains both the ownership state
and the transition bit is valid because runqueue lock is held and the fixup
function of a concurrent mode switch is serialized.

Assigning the ownership right there not only spares the bitmap access for
dropping the CID it also avoids it when the task is scheduled back in as it
directly hits the fast path in both modes when the CID is within the
optimal range. If it's outside the range the next schedule in will need to
converge so dropping it right away is sensible. In the good case this also
allows to go into the fast path on the next schedule in operation.

With a thread pool benchmark which is configured to cross the mode switch
boundaries frequently this reduces the number of bitmap operations by about
30% and increases the fastpath utilization in the low single digit
percentage range.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260201192835.100194627@kernel.org
2026-02-04 12:21:12 +01:00
Thomas Gleixner
007d84287c sched/mmcid: Drop per CPU CID immediately when switching to per task mode
When a exiting task initiates the switch from per CPU back to per task
mode, it has already dropped its CID and marked itself inactive. But a
leftover from an earlier iteration of the rework then reassigns the per
CPU CID to the exiting task with the transition bit set.

That's wrong as the task is already marked CID inactive, which means it is
inconsistent state. It's harmless because the CID is marked in transit and
therefore dropped back into the pool when the exiting task schedules out
either through preemption or the final schedule().

Simply drop the per CPU CID when the exiting task triggered the transition.

Fixes: fbd0e71dc3 ("sched/mmcid: Provide CID ownership mode fixup functions")
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260201192835.032221009@kernel.org
2026-02-04 12:21:12 +01:00
Thomas Gleixner
47ee94efcc sched/mmcid: Protect transition on weakly ordered systems
Shrikanth reported a hard lockup which he observed once. The stack trace
shows the following CID related participants:

  watchdog: CPU 23 self-detected hard LOCKUP @ mm_get_cid+0xe8/0x188
  NIP: mm_get_cid+0xe8/0x188
  LR:  mm_get_cid+0x108/0x188
   mm_cid_switch_to+0x3c4/0x52c
   __schedule+0x47c/0x700
   schedule_idle+0x3c/0x64
   do_idle+0x160/0x1b0
   cpu_startup_entry+0x48/0x50
   start_secondary+0x284/0x288
   start_secondary_prolog+0x10/0x14

  watchdog: CPU 11 self-detected hard LOCKUP @ plpar_hcall_norets_notrace+0x18/0x2c
  NIP: plpar_hcall_norets_notrace+0x18/0x2c
  LR:  queued_spin_lock_slowpath+0xd88/0x15d0
   _raw_spin_lock+0x80/0xa0
   raw_spin_rq_lock_nested+0x3c/0xf8
   mm_cid_fixup_cpus_to_tasks+0xc8/0x28c
   sched_mm_cid_exit+0x108/0x22c
   do_exit+0xf4/0x5d0
   make_task_dead+0x0/0x178
   system_call_exception+0x128/0x390
   system_call_vectored_common+0x15c/0x2ec

The task on CPU11 is running the CID ownership mode change fixup function
and is stuck on a runqueue lock. The task on CPU23 is trying to get a CID
from the pool with the same runqueue lock held, but the pool is empty.

After decoding a similar issue in the opposite direction switching from per
task to per CPU mode the tool which models the possible scenarios failed to
come up with a similar loop hole.

This showed up only once, was not reproducible and according to tooling not
related to a overlooked scheduling scenario permutation. But the fact that
it was observed on a PowerPC system gave the right hint: PowerPC is a
weakly ordered architecture.

The transition mechanism does:

    WRITE_ONCE(mm->mm_cid.transit, MM_CID_TRANSIT);
    WRITE_ONCE(mm->mm_cid.percpu, new_mode);

    fixup()

    WRITE_ONCE(mm->mm_cid.transit, 0);

mm_cid_schedin() does:

    if (!READ_ONCE(mm->mm_cid.percpu))
       ...
       cid |= READ_ONCE(mm->mm_cid.transit);

so weakly ordered systems can observe percpu == false and transit == 0 even
if the fixup function has not yet completed. As a consequence the task will
not drop the CID when scheduling out before the fixup is completed, which
means the CID space can be exhausted and the next task scheduling in will
loop in mm_get_cid() and the fixup thread can livelock on the held runqueue
lock as above.

This could obviously be solved by using:
     smp_store_release(&mm->mm_cid.percpu, true);
and
     smp_load_acquire(&mm->mm_cid.percpu);

but that brings a memory barrier back into the scheduler hotpath, which was
just designed out by the CID rewrite.

That can be completely avoided by combining the per CPU mode and the
transit storage into a single mm_cid::mode member and ordering the stores
against the fixup functions to prevent the CPU from reordering them.

That makes the update of both states atomic and a concurrent read observes
always consistent state.

The price is an additional AND operation in mm_cid_schedin() to evaluate
the per CPU or the per task path, but that's in the noise even on strongly
ordered architectures as the actual load can be significantly more
expensive and the conditional branch evaluation is there anyway.

Fixes: fbd0e71dc3 ("sched/mmcid: Provide CID ownership mode fixup functions")
Closes: https://lore.kernel.org/bdfea828-4585-40e8-8835-247c6a8a76b0@linux.ibm.com
Reported-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260201192834.965217106@kernel.org
2026-02-04 12:21:12 +01:00
Thomas Gleixner
4327fb13fa sched/mmcid: Prevent live lock on task to CPU mode transition
Ihor reported a BPF CI failure which turned out to be a live lock in the
MM_CID management. The scenario is:

A test program creates the 5th thread, which means the MM_CID users become
more than the number of CPUs (four in this example), so it switches to per
CPU ownership mode.

At this point each live task of the program has a CID associated. Assume
thread creation order assignment for simplicity.

   T0     CID0  runs fork() and creates T4
   T1 	  CID1
   T2 	  CID2
   T3 	  CID3
   T4       ---   not visible yet

T0 sets mm_cid::percpu = true and transfers its own CID to CPU0 where it
runs on and then starts the fixup which walks through the threads to
transfer the per task CIDs either to the CPU the task is running on or drop
it back into the pool if the task is not on a CPU.

During that T1 - T3 are free to schedule in and out before the fixup caught
up with them. Going through all possible permutations with a python script
revealed a few problematic cases. The most trivial one is:

   T1 schedules in on CPU1 and observes percpu == true, so it transfers
      its CID to CPU1

   T1 is migrated to CPU2 and schedule in observes percpu == true, but
      CPU2 does not have a CID associated and T1 transferred its own to
      CPU1

      So it has to allocate one with CPU2 runqueue lock held, but the
      pool is empty, so it keeps looping in mm_get_cid().

Now T0 reaches T1 in the thread walk and tries to lock the corresponding
runqueue lock, which is held causing a full live lock.

There is a similar scenario in the reverse direction of switching from per
CPU to task mode which is way more obvious and got therefore addressed by
an intermediate mode. In this mode the CIDs are marked with MM_CID_TRANSIT,
which means that they are neither owned by the CPU nor by the task. When a
task schedules out with a transit CID it drops the CID back into the pool
making it available for others to use temporarily. Once the task which
initiated the mode switch finished the fixup it clears the transit mode and
the process goes back into per task ownership mode.

Unfortunately this insight was not mapped back to the task to CPU mode
switch as the above described scenario was not considered in the analysis.

Apply the same transit mechanism to the task to CPU mode switch to handle
these problematic cases correctly.

As with the CPU to task transition this results in a potential temporary
contention on the CID bitmap, but that's only for the time it takes to
complete the transition. After that it stays in steady mode which does not
touch the bitmap at all.

Fixes: fbd0e71dc3 ("sched/mmcid: Provide CID ownership mode fixup functions")
Closes: https://lore.kernel.org/2b7463d7-0f58-4e34-9775-6e2115cfb971@linux.dev
Reported-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260201192834.897115238@kernel.org
2026-02-04 12:21:11 +01:00
Frederic Weisbecker
d279138a27 kthread: Document kthread_affine_preferred()
The documentation of this new API has been overlooked during its
introduction. Fill the gap.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
2026-02-03 15:23:35 +01:00
Frederic Weisbecker
60ba9c38b9 kthread: Comment on the purpose and placement of kthread_affine_node() call
It may not appear obvious why kthread_affine_node() is not called before
the kthread creation completion instead of after the first wake-up.

The reason is that kthread_affine_node() applies a default affinity
behaviour that only takes place if no affinity preference have already
been passed by the kthread creation call site.

Add a comment to clarify that.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
2026-02-03 15:23:35 +01:00
Frederic Weisbecker
e894f63398 kthread: Honour kthreads preferred affinity after cpuset changes
When cpuset isolated partitions get updated, unbound kthreads get
indifferently affine to all non isolated CPUs, regardless of their
individual affinity preferences.

For example kswapd is a per-node kthread that prefers to be affine to
the node it refers to. Whenever an isolated partition is created,
updated or deleted, kswapd's node affinity is going to be broken if any
CPU in the related node is not isolated because kswapd will be affine
globally.

Fix this with letting the consolidated kthread managed affinity code do
the affinity update on behalf of cpuset.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: cgroups@vger.kernel.org
2026-02-03 15:23:35 +01:00
Frederic Weisbecker
041ee6f372 kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management
Unbound kthreads want to run neither on nohz_full CPUs nor on domain
isolated CPUs. And since nohz_full implies domain isolation, checking
the latter is enough to verify both.

Therefore exclude kthreads from domain isolation.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
2026-02-03 15:23:35 +01:00
Frederic Weisbecker
92a734606e kthread: Include kthreadd to the managed affinity list
The unbound kthreads affinity management performed by cpuset is going to
be imported to the kthread core code for consolidation purposes.

Treat kthreadd just like any other kthread.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
2026-02-03 15:23:35 +01:00
Frederic Weisbecker
5564c12385 kthread: Include unbound kthreads in the managed affinity list
The managed affinity list currently contains only unbound kthreads that
have affinity preferences. Unbound kthreads globally affine by default
are outside of the list because their affinity is automatically managed
by the scheduler (through the fallback housekeeping mask) and by cpuset.

However in order to preserve the preferred affinity of kthreads, cpuset
will delegate the isolated partition update propagation to the
housekeeping and kthread code.

Prepare for that with including all unbound kthreads in the managed
affinity list.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
2026-02-03 15:23:35 +01:00
Frederic Weisbecker
012fef0e48 kthread: Refine naming of affinity related fields
The kthreads preferred affinity related fields use "hotplug" as the base
of their naming because the affinity management was initially deemed to
deal with CPU hotplug.

The scope of this role is going to broaden now and also deal with
cpuset isolated partition updates.

Switch the naming accordingly.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
2026-02-03 15:23:35 +01:00
Frederic Weisbecker
6440966067 cpuset: Remove cpuset_cpu_is_isolated()
The set of cpuset isolated CPUs is now included in HK_TYPE_DOMAIN
housekeeping cpumask. There is no usecase left interested in just
checking what is isolated by cpuset and not by the isolcpus= kernel
boot parameter.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: "Michal Koutný" <mkoutny@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: cgroups@vger.kernel.org
2026-02-03 15:23:34 +01:00
Frederic Weisbecker
0947d018cf timers/migration: Remove superfluous cpuset isolation test
Cpuset isolated partitions are now included in HK_TYPE_DOMAIN. Testing
if a CPU is part of an isolated partition alone is now useless.

Remove the superflous test.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
2026-02-03 15:23:34 +01:00
Frederic Weisbecker
f5c145ae4f cpuset: Propagate cpuset isolation update to timers through housekeeping
Until now, cpuset would propagate isolated partition changes to
timer migration so that unbound timers don't get migrated to isolated
CPUs.

Since housekeeping now centralizes, synchronize and propagates isolation
cpumask changes, perform the work from that subsystem for consolidation
and consistency purposes.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2026-02-03 15:23:34 +01:00
Frederic Weisbecker
23f09dcc0a cpuset: Propagate cpuset isolation update to workqueue through housekeeping
Until now, cpuset would propagate isolated partition changes to
workqueues so that unbound workers get properly reaffined.

Since housekeeping now centralizes, synchronize and propagates isolation
cpumask changes, perform the work from that subsystem for consolidation
and consistency purposes.

For simplification purpose, the target function is adapted to take the
new housekeeping mask instead of the isolated mask.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: "Michal Koutný" <mkoutny@suse.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: cgroups@vger.kernel.org
2026-02-03 15:23:34 +01:00
Frederic Weisbecker
29b306c44e PCI: Flush PCI probe workqueue on cpuset isolated partition change
The HK_TYPE_DOMAIN housekeeping cpumask is now modifiable at runtime. In
order to synchronize against PCI probe works and make sure that no
asynchronous probing is still pending or executing on a newly isolated
CPU, the housekeeping subsystem must flush the PCI probe works.

However the PCI probe works can't be flushed easily since they are
queued to the main per-CPU workqueue pool.

Solve this with creating a PCI probe-specific pool and provide and use
the appropriate flushing API.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: linux-pci@vger.kernel.org
2026-02-03 15:23:34 +01:00
Frederic Weisbecker
ce84ad5e99 sched/isolation: Flush vmstat workqueues on cpuset isolated partition change
The HK_TYPE_DOMAIN housekeeping cpumask is now modifiable at runtime.
In order to synchronize against vmstat workqueue to make sure
that no asynchronous vmstat work is still pending or executing on a
newly made isolated CPU, the housekeeping susbsystem must flush the
vmstat workqueues.

This involves flushing the whole mm_percpu_wq workqueue, shared with
LRU drain, introducing here a welcome side effect.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: linux-mm@kvack.org
2026-02-03 15:23:34 +01:00
Frederic Weisbecker
b7eb4edcc3 sched/isolation: Flush memcg workqueues on cpuset isolated partition change
The HK_TYPE_DOMAIN housekeeping cpumask is now modifiable at runtime. In
order to synchronize against memcg workqueue to make sure that no
asynchronous draining is still pending or executing on a newly made
isolated CPU, the housekeeping susbsystem must flush the memcg
workqueues.

However the memcg workqueues can't be flushed easily since they are
queued to the main per-CPU workqueue pool.

Solve this with creating a memcg specific pool and provide and use the
appropriate flushing API.

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
2026-02-03 15:23:34 +01:00
Frederic Weisbecker
03ff735101 cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset
Until now, HK_TYPE_DOMAIN used to only include boot defined isolated
CPUs passed through isolcpus= boot option. Users interested in also
knowing the runtime defined isolated CPUs through cpuset must use
different APIs: cpuset_cpu_is_isolated(), cpu_is_isolated(), etc...

There are many drawbacks to that approach:

1) Most interested subsystems want to know about all isolated CPUs, not
  just those defined on boot time.

2) cpuset_cpu_is_isolated() / cpu_is_isolated() are not synchronized with
  concurrent cpuset changes.

3) Further cpuset modifications are not propagated to subsystems

Solve 1) and 2) and centralize all isolated CPUs within the
HK_TYPE_DOMAIN housekeeping cpumask.

Subsystems can rely on RCU to synchronize against concurrent changes.

The propagation mentioned in 3) will be handled in further patches.

[Chen Ridong: Fix cpu_hotplug_lock deadlock and use correct static
branch API]

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Cc: "Michal Koutný" <mkoutny@suse.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: cgroups@vger.kernel.org
2026-02-03 15:23:34 +01:00
Frederic Weisbecker
27c3a5967f sched/isolation: Convert housekeeping cpumasks to rcu pointers
HK_TYPE_DOMAIN's cpumask will soon be made modifiable by cpuset.
A synchronization mechanism is then needed to synchronize the updates
with the housekeeping cpumask readers.

Turn the housekeeping cpumasks into RCU pointers. Once a housekeeping
cpumask will be modified, the update side will wait for an RCU grace
period and propagate the change to interested subsystem when deemed
necessary.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
2026-02-03 15:23:33 +01:00
Frederic Weisbecker
a7e546354d cpuset: Provide lockdep check for cpuset lock held
cpuset modifies partitions, including isolated, while holding the cpuset
mutex.

This means that holding the cpuset mutex is safe to synchronize against
housekeeping cpumask changes.

Provide a lockdep check to validate that.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: "Michal Koutný" <mkoutny@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: cgroups@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
2026-02-03 15:23:33 +01:00
Frederic Weisbecker
622c508bcf cpu: Provide lockdep check for CPU hotplug lock write-held
cpuset modifies partitions, including isolated, while holding the cpu
hotplug lock read-held.

This means that write-holding the CPU hotplug lock is safe to
synchronize against housekeeping cpumask changes.

Provide a lockdep check to validate that.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: linux-kernel@vger.kernel.org
2026-02-03 15:23:33 +01:00
Frederic Weisbecker
b5de34ed87 timers/migration: Prevent from lockdep false positive warning
Testing housekeeping_cpu() will soon require that either the RCU "lock"
is held or the cpuset mutex.

When CPUs get isolated through cpuset, the change is propagated to
timer migration such that isolation is also performed from the migration
tree. However that propagation is done using workqueue which tests if
the target is actually isolated before proceeding.

Lockdep doesn't know that the workqueue caller holds cpuset mutex and
that it waits for the work, making the housekeeping cpumask read safe.

Shut down the future warning by removing this test. It is unecessary
beyond hotplug, the workqueue is already targeted towards isolated CPUs.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Gabriele Monaco <gmonaco@redhat.com>
2026-02-03 15:23:33 +01:00
Frederic Weisbecker
0f4dfdc17b cpuset: Convert boot_hk_cpus to use HK_TYPE_DOMAIN_BOOT
boot_hk_cpus is an ad-hoc copy of HK_TYPE_DOMAIN_BOOT. Remove it and use
the official version.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Koutny <mkoutny@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: cgroups@vger.kernel.org
2026-02-03 15:23:33 +01:00
Frederic Weisbecker
4fca0e550d sched/isolation: Save boot defined domain flags
HK_TYPE_DOMAIN will soon integrate not only boot defined isolcpus= CPUs
but also cpuset isolated partitions.

Housekeeping still needs a way to record what was initially passed
to isolcpus= in order to keep these CPUs isolated after a cpuset
isolated partition is modified or destroyed while containing some of
them.

Create a new HK_TYPE_DOMAIN_BOOT to keep track of those.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
2026-02-03 15:23:33 +01:00
Johannes Thumshirn
ee4784a83f block: don't use strcpy to copy blockdev name
0-day bot flagged the use of strcpy() in blk_trace_setup(), because the
source buffer can theoretically be bigger than the destination buffer.

While none of the current callers pass a string bigger than
BLKTRACE_BDEV_SIZE, use strscpy() to prevent eventual future misuse and
silence the checker warnings.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202602020718.GUEIRyG9-lkp@intel.com/
Fixes: 113cbd6282 ("blktrace: pass blk_user_trace2 to setup functions")
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-03 07:15:31 -07:00
Pratyush Yadav (Google)
011d4e52a7 liveupdate: luo_file: do not clear serialized_data on unfreeze
Patch series "liveupdate: fixes in error handling".

This series contains some fixes in LUO's error handling paths.

The first patch deals with failed freeze() attempts.  The cleanup path
calls unfreeze, and that clears some data needed by later unpreserve
calls.

The second patch is a bit more involved.  It deals with failed retrieve()
attempts.  To do so properly, it reworks some of the error handling logic
in luo_file core.

Both these fixes are "theoretical" -- in the sense that I have not been
able to reproduce either of them in normal operation.  The only supported
file type right now is memfd, and there is nothing userspace can do right
now to make it fail its retrieve or freeze.  I need to make the retrieve
or freeze fail by artificially injecting errors.  The injected errors
trigger a use-after-free and a double-free.

That said, once more complex file handlers are added or memfd preservation
is used in ways not currently expected or covered by the tests, we will be
able to see them on real systems.


This patch (of 2):

The unfreeze operation is supposed to undo the effects of the freeze
operation.  serialized_data is not set by freeze, but by preserve. 
Consequently, the unpreserve operation needs to access serialized_data to
undo the effects of the preserve operation.  This includes freeing the
serialized data structures for example.

If a freeze callback fails, unfreeze is called for all frozen files.  This
would clear serialized_data for them.  Since live update has failed, it
can be expected that userspace aborts, releasing all sessions.  When the
sessions are released, unpreserve will be called for all files.  The
unfrozen files will see 0 in their serialized_data.  This is not expected
by file handlers, and they might either fail, leaking data and state, or
might even crash or cause invalid memory access.

Do not clear serialized_data on unfreeze so it gets passed on to
unpreserve.  There is no need to clear it on unpreserve since luo_file
will be freed immediately after.

Link: https://lkml.kernel.org/r/20260126230302.2936817-1-pratyush@kernel.org
Link: https://lkml.kernel.org/r/20260126230302.2936817-2-pratyush@kernel.org
Fixes: 7c722a7f44 ("liveupdate: luo_file: implement file systems callbacks")
Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-02 18:43:55 -08:00