Pull crypto update from Herbert Xu:
"API:
- Fix race condition in hwrng core by using RCU
Algorithms:
- Allow authenc(sha224,rfc3686) in fips mode
- Add test vectors for authenc(hmac(sha384),cbc(aes))
- Add test vectors for authenc(hmac(sha224),cbc(aes))
- Add test vectors for authenc(hmac(md5),cbc(des3_ede))
- Add lz4 support in hisi_zip
- Only allow clear key use during self-test in s390/{phmac,paes}
Drivers:
- Set rng quality to 900 in airoha
- Add gcm(aes) support for AMD/Xilinx Versal device
- Allow tfms to share device in hisilicon/trng"
* tag 'v7.0-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (100 commits)
crypto: img-hash - Use unregister_ahashes in img_{un}register_algs
crypto: testmgr - Add test vectors for authenc(hmac(md5),cbc(des3_ede))
crypto: cesa - Simplify return statement in mv_cesa_dequeue_req_locked
crypto: testmgr - Add test vectors for authenc(hmac(sha224),cbc(aes))
crypto: testmgr - Add test vectors for authenc(hmac(sha384),cbc(aes))
hwrng: core - use RCU and work_struct to fix race condition
crypto: starfive - Fix memory leak in starfive_aes_aead_do_one_req()
crypto: xilinx - Fix inconsistant indentation
crypto: rng - Use unregister_rngs in register_rngs
crypto: atmel - Use unregister_{aeads,ahashes,skciphers}
hwrng: optee - simplify OP-TEE context match
crypto: ccp - Add sysfs attribute for boot integrity
dt-bindings: crypto: atmel,at91sam9g46-sha: add microchip,lan9691-sha
dt-bindings: crypto: atmel,at91sam9g46-aes: add microchip,lan9691-aes
dt-bindings: crypto: qcom,inline-crypto-engine: document the Milos ICE
crypto: caam - fix netdev memory leak in dpaa2_caam_probe
crypto: hisilicon/qm - increase wait time for mailbox
crypto: hisilicon/qm - obtain the mailbox configuration at one time
crypto: hisilicon/qm - remove unnecessary code in qm_mb_write()
crypto: hisilicon/qm - move the barrier before writing to the mailbox register
...
Pull kthread updates from Frederic Weisbecker:
"The kthread code provides an infrastructure which manages the
preferred affinity of unbound kthreads (node or custom cpumask)
against housekeeping (CPU isolation) constraints and CPU hotplug
events.
One crucial missing piece is the handling of cpuset: when an isolated
partition is created, deleted, or its CPUs updated, all the unbound
kthreads in the top cpuset become indifferently affine to _all_ the
non-isolated CPUs, possibly breaking their preferred affinity along
the way.
Solve this with performing the kthreads affinity update from cpuset to
the kthreads consolidated relevant code instead so that preferred
affinities are honoured and applied against the updated cpuset
isolated partitions.
The dispatch of the new isolated cpumasks to timers, workqueues and
kthreads is performed by housekeeping, as per the nice Tejun's
suggestion.
As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set
from boot defined domain isolation (through isolcpus=) and cpuset
isolated partitions. Housekeeping cpumasks are now modifiable with a
specific RCU based synchronization. A big step toward making
nohz_full= also mutable through cpuset in the future"
* tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: (33 commits)
doc: Add housekeeping documentation
kthread: Document kthread_affine_preferred()
kthread: Comment on the purpose and placement of kthread_affine_node() call
kthread: Honour kthreads preferred affinity after cpuset changes
sched/arm64: Move fallback task cpumask to HK_TYPE_DOMAIN
sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN
kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management
kthread: Include kthreadd to the managed affinity list
kthread: Include unbound kthreads in the managed affinity list
kthread: Refine naming of affinity related fields
PCI: Remove superfluous HK_TYPE_WQ check
sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated()
cpuset: Remove cpuset_cpu_is_isolated()
timers/migration: Remove superfluous cpuset isolation test
cpuset: Propagate cpuset isolation update to timers through housekeeping
cpuset: Propagate cpuset isolation update to workqueue through housekeeping
PCI: Flush PCI probe workqueue on cpuset isolated partition change
sched/isolation: Flush vmstat workqueues on cpuset isolated partition change
sched/isolation: Flush memcg workqueues on cpuset isolated partition change
cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset
...
Pull power management updates from Rafael Wysocki:
"By the number of commits, cpufreq is the leading party (again) and the
most visible change there is the removal of the omap-cpufreq driver
that has not been used for a long time (good riddance). There are also
quite a few changes in the cppc_cpufreq driver, mostly related to
fixing its frequency invariance engine in the case when the CPPC
registers used by it are not in PCC. In addition to that, support for
AM62L3 is added to the ti-cpufreq driver and the cpufreq-dt-platdev
list is updated for some platforms. The remaining cpufreq changes are
assorted fixes and cleanups.
Next up is cpuidle and the changes there are dominated by intel_idle
driver updates, mostly related to the new command line facility
allowing users to adjust the list of C-states used by the driver.
There are also a few updates of cpuidle governors, including two menu
governor fixes and some refinements of the teo governor, and a
MAINTAINERS update adding Christian Loehle as a cpuidle reviewer.
[Thanks for stepping up Christian!]
The most significant update related to system suspend and hibernation
is the one to stop freezing the PM runtime workqueue during system PM
transitions which allows some deadlocks to be avoided. There is also a
fix for possible concurrent bit field updates in the core device
suspend code and a few other minor fixes.
Apart from the above, several drivers are updated to discard the
return value of pm_runtime_put() which is going to be converted to a
void function as soon as everybody stops using its return value, PL4
support for Ice Lake is added to the Intel RAPL power capping driver,
and there are assorted cleanups, documentation fixes, and some
cpupower utility improvements.
Specifics:
- Remove the unused omap-cpufreq driver (Andreas Kemnade)
- Optimize error handling code in cpufreq_boost_trigger_state() and
make cpufreq_boost_trigger_state() return -EOPNOTSUPP if no policy
supports boost (Lifeng Zheng)
- Update cpufreq-dt-platdev list for tegra, qcom, TI (Aaron Kling,
Dhruva Gole, and Konrad Dybcio)
- Minor improvements to the cpufreq and cpumask rust implementation
(Alexandre Courbot, Alice Ryhl, Tamir Duberstein, and Yilin Chen)
- Add support for AM62L3 SoC to the ti-cpufreq driver (Dhruva Gole)
- Update arch_freq_scale in the CPPC cpufreq driver's frequency
invariance engine (FIE) in scheduler ticks if the related CPPC
registers are not in PCC (Jie Zhan)
- Assorted minor cleanups and improvements in ARM cpufreq drivers
(Juan Martinez, Felix Gu, Luca Weiss, and Sergey Shtylyov)
- Add generic helpers for sysfs show/store to cppc_cpufreq (Sumit
Gupta)
- Make the scaling_setspeed cpufreq sysfs attribute return the actual
requested frequency to avoid confusion (Pengjie Zhang)
- Simplify the idle CPU time granularity test in the ondemand cpufreq
governor (Frederic Weisbecker)
- Enable asym capacity in intel_pstate only when CPU SMT is not
possible (Yaxiong Tian)
- Update the description of rate_limit_us default value in cpufreq
documentation (Yaxiong Tian)
- Add a command line option to adjust the C-states table in the
intel_idle driver, remove the 'preferred_cstates' module parameter
from it, add C-states validation to it and clean it up (Artem
Bityutskiy)
- Make the menu cpuidle governor always check the time till the
closest timer event when the scheduler tick has been stopped to
prevent it from mistakenly selecting the deepest available idle
state (Rafael Wysocki)
- Update the teo cpuidle governor to avoid making suboptimal
decisions in certain corner cases and generally improve idle state
selection accuracy (Rafael Wysocki)
- Remove an unlikely() annotation on the early-return condition in
menu_select() that leads to branch misprediction 100% of the time
on systems with only 1 idle state enabled, like ARM64 servers
(Breno Leitao)
- Add Christian Loehle to MAINTAINERS as a cpuidle reviewer
(Christian Loehle)
- Stop flagging the PM runtime workqueue as freezable to avoid system
suspend and resume deadlocks in subsystems that assume asynchronous
runtime PM to work during system-wide PM transitions (Rafael
Wysocki)
- Drop redundant NULL pointer checks before acomp_request_free() from
the hibernation code handling image saving (Rafael Wysocki)
- Update wakeup_sources_walk_start() to handle empty lists of wakeup
sources as appropriate (Samuel Wu)
- Make dev_pm_clear_wake_irq() check the power.wakeirq value under
power.lock to avoid race conditions (Gui-Dong Han)
- Avoid bit field races related to power.work_in_progress in the core
device suspend code (Xuewen Yan)
- Make several drivers discard pm_runtime_put() return value in
preparation for converting that function to a void one (Rafael
Wysocki)
- Add PL4 support for Ice Lake to the Intel RAPL power capping driver
(Daniel Tang)
- Replace sprintf() with sysfs_emit() in power capping sysfs show
functions (Sumeet Pawnikar)
- Make dev_pm_opp_get_level() return value match the documentation
after a previous update of the latter (Aleks Todorov)
- Use scoped for each OF child loop in the OPP code (Krzysztof
Kozlowski)
- Fix a bug in an example code snippet and correct typos in the
energy model management documentation (Patrick Little)
- Fix miscellaneous problems in cpupower (Kaushlendra Kumar):
* idle_monitor: Fix incorrect value logged after stop
* Fix inverted APERF capability check
* Use strcspn() to strip trailing newline
* Reset errno before strtoull()
* Show C0 in idle-info dump
- Improve cpupower installation procedure by making the systemd step
optional and allowing users to disable the installation of
systemd's unit file (João Marcos Costa)"
* tag 'pm-6.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (65 commits)
PM: sleep: core: Avoid bit field races related to work_in_progress
PM: sleep: wakeirq: harden dev_pm_clear_wake_irq() against races
cpufreq: Documentation: Update description of rate_limit_us default value
cpufreq: intel_pstate: Enable asym capacity only when CPU SMT is not possible
PM: wakeup: Handle empty list in wakeup_sources_walk_start()
PM: EM: Documentation: Fix bug in example code snippet
Documentation: Fix typos in energy model documentation
cpuidle: governors: teo: Refine intercepts-based idle state lookup
cpuidle: governors: teo: Adjust the classification of wakeup events
cpufreq: ondemand: Simplify idle cputime granularity test
cpufreq: userspace: make scaling_setspeed return the actual requested frequency
PM: hibernate: Drop NULL pointer checks before acomp_request_free()
cpufreq: CPPC: Add generic helpers for sysfs show/store
cpufreq: scmi: Fix device_node reference leak in scmi_cpu_domain_id()
cpufreq: ti-cpufreq: add support for AM62L3 SoC
cpufreq: dt-platdev: Add ti,am62l3 to blocklist
cpufreq/amd-pstate: Add comment explaining nominal_perf usage for performance policy
cpufreq: scmi: correct SCMI explanation
cpufreq: dt-platdev: Block the driver from probing on more QC platforms
rust: cpumask: rename methods of Cpumask for clarity and consistency
...
Pull ACPI updates from Rafael Wysocki:
"This one is significantly larger than previous ACPI support pull
requests because several significant updates have coincided in it.
First, there is a routine ACPICA code update, to upstream version
20251212, but this time it covers new ACPI 6.6 material that has not
been covered yet. Among other things, it includes definitions of a few
new ACPI tables and updates of some others, like the GICv5 MADT
structures and ARM IORT IWB node definitions that are used for adding
GICv5 ACPI probing on ARM (that technically is IRQ subsystem material,
but it depends on the ACPICA changes, so it is included here). The
latter alone adds a few hundred lines of new code.
Second, there is an update of ACPI _OSC handling including a fix that
prevents failures from occurring in some corner cases due to careless
handling of _OSC error bits.
On top of that, the "system resource" ACPI device objects with the
PNP0C01 and PNP0C02 are now going to be handled by the ACPI core
device enumeration code instead of handing them over to the legacy PNP
system driver which causes device enumeration issues to occur. Some of
those issues have been worked around in device drivers and elsewhere
and those workarounds should not be necessary any more, so they are
going away.
Moreover, the time has come to convert all "core ACPI" device drivers
that were still using struct acpi_driver objects for device binding
into proper platform drivers that use struct platform_driver for this
purpose. These updates are accompanied by some requisite core ACPI
device enumeration code changes.
Next, there are ACPI APEI updates, including changes to avoid excess
overhead in the NMI handler and in SEA on the ARM side, changes to
unify ACPI-based HW error tracing and logging, and changes to prevent
APEI code from reaching out of its allocated memory.
There are also some ACPI power management updates, mostly related to
the ACPI cpuidle support in the processor driver, suspend-to-idle
handling on systems with ACPI support and to ACPI PM of devices.
In addition to the above, bugs are fixed and the code is cleaned up in
assorted places all over.
Specifics:
- Update the ACPICA code in the kernel to upstream version 20251212
which includes the following changes:
* Add support for new ACPI table DTPR (Michal Camacho Romero)
* Release objects with acpi_ut_delete_object_desc() (Zilin Guan)
* Add UUIDs for Microsoft fan extensions and UUIDs associated with
TPM 2.0 devices (Armin Wolf)
* Fix NULL pointer dereference in acpi_ev_address_space_dispatch()
(Alexey Simakov)
* Add KEYP ACPI table definition (Dave Jiang)
* Add support for the Microsoft display mux _OSI string (Armin
Wolf)
* Add definitions for the IOVT ACPI table (Xianglai Li)
* Abort AML bytecode execution on AML_FATAL_OP (Armin Wolf)
* Include all fields in subtable type1 for PPTT (Ben Horgan)
* Add GICv5 MADT structures and Arm IORT IWB node definitions
(Jose Marinho)
* Update Parameter Block structure for RAS2 and add a new flag in
Memory Affinity Structure for SRAT (Pawel Chmielewski)
* Add _VDM (Voltage Domain) object (Pawel Chmielewski)
- Add support for GICv5 ACPI probing on ARM which is based on the
GICv5 MADT structures and ARM IORT IWB node definitions recently
added to ACPICA (Lorenzo Pieralisi)
- Rework ACPI PM notification setup for PCI root buses and modify the
ACPI PM setup for devices to register wakeup source objects under
physical (that is, PCI, platform, etc.) devices instead of doing
that under their ACPI companions (Rafael Wysocki)
- Adjust debug messages regarding postponed ACPI PM printed during
system resume to be more accurate (Rafael Wysocki)
- Remove dead code from lps0_device_attach() (Gergo Koteles)
- Start to invoke Microsoft Function 9 (Turn On Display) of the Low-
Power S0 Idle (LPS0) _DSM in the suspend-to-idle resume flow on
systems with ACPI LPS0 support to address a functional issue on
Lenovo Yoga Slim 7i Aura (15ILL9), where system fans and keyboard
backlights fail to resume after suspend (Jakob Riemenschneider)
- Add sysfs attribute cid for exposing _CID lists under ACPI device
objects (Rafael Wysocki)
- Replace sprintf() with sysfs_emit() in all of the core ACPI sysfs
interface code (Sumeet Pawnikar)
- Use acpi_get_local_u64_address() in the code implementing ACPI
support for PCI to evaluate _ADR instead of evaluating that object
directly (Andy Shevchenko)
- Add JWIPC JVC9100 to irq1_level_low_skip_override[] to unbreak
serial IRQs on that system (Ai Chao)
- Fix handling of _OSC errors in acpi_run_osc() to avoid failures on
systems where _OSC error bits are set even though the _OSC return
buffer contains acknowledged feature bits (Rafael Wysocki)
- Clean up and rearrange \_SB._OSC handling for general platform
features and USB4 features to avoid code duplication and
unnecessary memory management overhead (Rafael Wysocki)
- Make the ACPI core device enumeration code handle PNP0C01 and
PNP0C02 ("system resource") device objects directly instead of
letting the legacy PNP system driver handle them to avoid device
enumeration issues on systems where PNP0C02 is present in the _CID
list under ACPI device objects with a _HID matching a proper device
driver in Linux (Rafael Wysocki)
- Drop workarounds for the known device enumeration issues related to
_CID lists containing PNP0C02 (Rafael Wysocki)
- Drop outdated comment regarding removed function in the ACPI-based
device enumeration code (Julia Lawall)
- Make PRP0001 device matching work as expected for ACPI device
objects using it as a _HID for board development and similar
purposes (Kartik Rajput)
- Use async schedule function in acpi_scan_clear_dep_fn() to avoid
races with user space initialization on some systems (Yicong Yang)
- Add a piece of documentation explaining why binding drivers
directly to ACPI device objects is not a good idea in general and
why it is desirable to convert drivers doing so into proper
platform drivers that use struct platform_driver for device binding
(Rafael Wysocki)
- Convert multiple "core ACPI" drivers, including the NFIT ACPI
device driver, the generic ACPI button drivers, the generic ACPI
thermal zone driver, the ACPI hardware event device (HED) driver,
the ACPI EC driver, the ACPI SMBUS HC driver, the ACPI Smart
Battery Subsystem (SBS) driver, and the ACPI backlight (video)
driver to proper platform drivers that use struct platform_driver
for device binding (Rafael Wysocki)
- Use acpi_get_local_u64_address() in the ACPI backlight (video)
driver to evaluate _ADR instead of evaluating that object directly
(Andy Shevchenko)
- Convert the generic ACPI battery driver to a proper platform driver
using struct platform_driver for device binding (Rafael Wysocki)
- Fix incorrect charging status when current is zero in the generic
ACPI battery driver (Ata İlhan Köktürk)
- Use LIST_HEAD() for initializing a stack-allocated list in the
generic ACPI watchdog device driver (Can Peng)
- Rework the ACPI idle driver initialization to register it directly
from the common initialization code instead of doing that from a
CPU hotplug "online" callback and clean it up (Huisong Li, Rafael
Wysocki)
- Fix a possible NULL pointer dereference in
acpi_processor_errata_piix4() (Tuo Li)
- Make read-only array non_mmio_desc[] static const (Colin Ian King)
- Prevent the APEI GHES support code on ARM from accessing memory out
of bounds or going past the ARM processor CPER record buffer (Mauro
Carvalho Chehab)
- Prevent cper_print_fw_err() from dumping the entire memory on
systems with defective firmware (Mauro Carvalho Chehab)
- Improve ghes_notify_nmi() status check to avoid unnecessary
overhead in the NMI handler by carrying out all of the requisite
preparations and the NMI registration time (Tony Luck)
- Refactor the GHES driver by extracting common functionality into
reusable helper functions to reduce code duplication and improve
the ghes_notify_sea() status check in analogy with the previous
ghes_notify_nmi() status check improvement (Shuai Xue)
- Make ELOG and GHES log and trace consistently and support the CPER
CXL protocol analogously (Fabio De Francesco)
- Disable KASAN instrumentation in the APEI GHES driver when compile
testing with clang < 18 (Nathan Chancellor)
- Let ghes_edac be the preferred driver to load on __ZX__ and _BYO_
systems by extending the platform detection list in the APEI GHES
driver (Tony W Wang-oc)
- Clean up cppc_perf_caps and cppc_perf_ctrls structs and rename EPP
constants for clarity in the ACPI CPPC library (Sumit Gupta)"
* tag 'acpi-6.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (117 commits)
ACPI: battery: fix incorrect charging status when current is zero
ACPI: scan: Use async schedule function in acpi_scan_clear_dep_fn()
ACPI: x86: s2idle: Invoke Microsoft _DSM Function 9 (Turn On Display)
ACPI: APEI: GHES: Add ghes_edac support for __ZX__ and _BYO_ systems
ACPI: APEI: GHES: Disable KASAN instrumentation when compile testing with clang < 18
ACPI: sysfs: Replace sprintf() with sysfs_emit()
ACPI: CPPC: Rename EPP constants for clarity
ACPI: CPPC: Clean up cppc_perf_caps and cppc_perf_ctrls structs
ACPI: processor: idle: Rework the handling of acpi_processor_ffh_lpi_probe()
ACPI: processor: idle: Convert acpi_processor_setup_cpuidle_dev() to void
ACPI: processor: idle: Convert acpi_processor_setup_cpuidle_states() to void
irqchip/gic-v5: Add ACPI IWB probing
irqchip/gic-v5: Add ACPI ITS probing
irqchip/gic-v5: Add ACPI IRS probing
irqchip/gic-v5: Split IRS probing into OF and generic portions
PCI/MSI: Make the pci_msi_map_rid_ctlr_node() interface firmware agnostic
irqdomain: Add parent field to struct irqchip_fwid
ACPI: PCI: simplify code with acpi_get_local_u64_address()
ACPI: video: simplify code with acpi_get_local_u64_address()
ACPI: PM: Adjust messages regarding postponed ACPI PM
...
Pull block updates from Jens Axboe:
- Support for batch request processing for ublk, improving the
efficiency of the kernel/ublk server communication. This can yield
nice 7-12% performance improvements
- Support for integrity data for ublk
- Various other ublk improvements and additions, including a ton of
selftests additions and updated
- Move the handling of blk-crypto software fallback from below the
block layer to above it. This reduces the complexity of dealing with
bio splitting
- Series fixing a number of potential deadlocks in blk-mq related to
the queue usage counter and writeback throttling and rq-qos debugfs
handling
- Add an async_depth queue attribute, to resolve a performance
regression that's been around for a qhilw related to the scheduler
depth handling
- Only use task_work for IOPOLL completions on NVMe, if it is necessary
to do so. An earlier fix for an issue resulted in all these
completions being punted to task_work, to guarantee that completions
were only run for a given io_uring ring when it was local to that
ring. With the new changes, we can detect if it's necessary to use
task_work or not, and avoid it if possible.
- rnbd fixes:
- Fix refcount underflow in device unmap path
- Handle PREFLUSH and NOUNMAP flags properly in protocol
- Fix server-side bi_size for special IOs
- Zero response buffer before use
- Fix trace format for flags
- Add .release to rnbd_dev_ktype
- MD pull requests via Yu Kuai
- Fix raid5_run() to return error when log_init() fails
- Fix IO hang with degraded array with llbitmap
- Fix percpu_ref not resurrected on suspend timeout in llbitmap
- Fix GPF in write_page caused by resize race
- Fix NULL pointer dereference in process_metadata_update
- Fix hang when stopping arrays with metadata through dm-raid
- Fix any_working flag handling in raid10_sync_request
- Refactor sync/recovery code path, improve error handling for
badblocks, and remove unused recovery_disabled field
- Consolidate mddev boolean fields into mddev_flags
- Use mempool to allocate stripe_request_ctx and make sure
max_sectors is not less than io_opt in raid5
- Fix return value of mddev_trylock
- Fix memory leak in raid1_run()
- Add Li Nan as mdraid reviewer
- Move phys_vec definitions to the kernel types, mostly in preparation
for some VFIO and RDMA changes
- Improve the speed for secure erase for some devices
- Various little rust updates
- Various other minor fixes, improvements, and cleanups
* tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits)
blk-mq: ABI/sysfs-block: fix docs build warnings
selftests: ublk: organize test directories by test ID
block: decouple secure erase size limit from discard size limit
block: remove redundant kill_bdev() call in set_blocksize()
blk-mq: add documentation for new queue attribute async_dpeth
block, bfq: convert to use request_queue->async_depth
mq-deadline: covert to use request_queue->async_depth
kyber: covert to use request_queue->async_depth
blk-mq: add a new queue sysfs attribute async_depth
blk-mq: factor out a helper blk_mq_limit_depth()
blk-mq-sched: unify elevators checking for async requests
block: convert nr_requests to unsigned int
block: don't use strcpy to copy blockdev name
blk-mq-debugfs: warn about possible deadlock
blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs()
blk-mq-debugfs: remove blk_mq_debugfs_unregister_rqos()
blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static
blk-rq-qos: fix possible debugfs_mutex deadlock
blk-mq-debugfs: factor out a helper to register debugfs for all rq_qos
blk-wbt: fix possible deadlock to nest pcpu_alloc_mutex under q_usage_counter
...
Pull io_uring bpf filters from Jens Axboe:
"This adds support for both cBPF filters for io_uring, as well as task
inherited restrictions and filters.
seccomp and io_uring don't play along nicely, as most of the
interesting data to filter on resides somewhat out-of-band, in the
submission queue ring.
As a result, things like containers and systemd that apply seccomp
filters, can't filter io_uring operations.
That leaves them with just one choice if filtering is critical -
filter the actual io_uring_setup(2) system call to simply disallow
io_uring. That's rather unfortunate, and has limited us because of it.
io_uring already has some filtering support. It requires the ring to
be setup in a disabled state, and then a filter set can be applied.
This filter set is completely bi-modal - an opcode is either enabled
or it's not. Once a filter set is registered, the ring can be enabled.
This is very restrictive, and it's not useful at all to systemd or
containers which really want both broader and more specific control.
This first adds support for cBPF filters for opcodes, which enables
tighter control over what exactly a specific opcode may do. As
examples, specific support is added for IORING_OP_OPENAT/OPENAT2,
allowing filtering on resolve flags. And another example is added for
IORING_OP_SOCKET, allowing filtering on domain/type/protocol. These
are both common use cases. cBPF was chosen rather than eBPF, because
the latter is often restricted in containers as well.
These filters are run post the init phase of the request, which allows
filters to even dip into data that is being passed in struct in user
memory, as the init side of requests make that data stable by bringing
it into the kernel. This allows filtering without needing to copy this
data twice, or have filters etc know about the exact layout of the
user data. The filters get the already copied and sanitized data
passed.
On top of that support is added for per-task filters, meaning that any
ring created with a task that has a per-task filter will get those
filters applied when it's created. These filters are inherited across
fork as well. Once a filter has been registered, any further added
filters may only further restrict what operations are permitted.
Filters cannot change the return value of an operation, they can only
permit or deny it based on the contents"
* tag 'io_uring-bpf-restrictions.4-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring: allow registration of per-task restrictions
io_uring: add task fork hook
io_uring/bpf_filter: add ref counts to struct io_bpf_filter
io_uring/bpf_filter: cache lookup table in ctx->bpf_filters
io_uring/bpf_filter: allow filtering on contents of struct open_how
io_uring/net: allow filtering on IORING_OP_SOCKET data
io_uring: add support for BPF filtering for opcode restrictions
Pull vfs 'struct filename' updates from Al Viro:
"[Mostly] sanitize struct filename handling"
* tag 'pull-filename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (68 commits)
sysfs(2): fs_index() argument is _not_ a pathname
alpha: switch osf_mount() to strndup_user()
ksmbd: use CLASS(filename_kernel)
mqueue: switch to CLASS(filename)
user_statfs(): switch to CLASS(filename)
statx: switch to CLASS(filename_maybe_null)
quotactl_block(): switch to CLASS(filename)
chroot(2): switch to CLASS(filename)
move_mount(2): switch to CLASS(filename_maybe_null)
namei.c: switch user pathname imports to CLASS(filename{,_flags})
namei.c: convert getname_kernel() callers to CLASS(filename_kernel)
do_f{chmod,chown,access}at(): use CLASS(filename_uflags)
do_readlinkat(): switch to CLASS(filename_flags)
do_sys_truncate(): switch to CLASS(filename)
do_utimes_path(): switch to CLASS(filename_uflags)
chdir(2): unspaghettify a bit...
do_fchownat(): unspaghettify a bit...
fspick(2): use CLASS(filename_flags)
name_to_handle_at(): use CLASS(filename_uflags)
vfs_open_tree(): use CLASS(filename_uflags)
...
Pull misc vfs updates from Christian Brauner:
"This contains a mix of VFS cleanups, performance improvements, API
fixes, documentation, and a deprecation notice.
Scalability and performance:
- Rework pid allocation to only take pidmap_lock once instead of
twice during alloc_pid(), improving thread creation/teardown
throughput by 10-16% depending on false-sharing luck. Pad the
namespace refcount to reduce false-sharing
- Track file lock presence via a flag in ->i_opflags instead of
reading ->i_flctx, avoiding false-sharing with ->i_readcount on
open/close hot paths. Measured 4-16% improvement on 24-core
open-in-a-loop benchmarks
- Use a consume fence in locks_inode_context() to match the
store-release/load-consume idiom, eliminating a hardware fence on
some architectures
- Annotate cdev_lock with __cacheline_aligned_in_smp to prevent
false-sharing
- Remove a redundant DCACHE_MANAGED_DENTRY check in
__follow_mount_rcu() that never fires since the caller already
verifies it, eliminating a 100% mispredicted branch
- Fix a 100% mispredicted likely() in devcgroup_inode_permission()
that became wrong after a prior code reorder
Bug fixes and correctness:
- Make insert_inode_locked() wait for inode destruction instead of
skipping, fixing a corner case where two matching inodes could
exist in the hash
- Move f_mode initialization before file_ref_init() in alloc_file()
to respect the SLAB_TYPESAFE_BY_RCU ordering contract
- Add a WARN_ON_ONCE guard in try_to_free_buffers() for folios with
no buffers attached, preventing a null pointer dereference when
AS_RELEASE_ALWAYS is set but no release_folio op exists
- Fix select restart_block to store end_time as timespec64, avoiding
truncation of tv_sec on 32-bit architectures
- Make dump_inode() use get_kernel_nofault() to safely access inode
and superblock fields, matching the dump_mapping() pattern
API modernization:
- Make posix_acl_to_xattr() allocate the buffer internally since
every single caller was doing it anyway. Reduces boilerplate and
unnecessary error checking across ~15 filesystems
- Replace deprecated simple_strtoul() with kstrtoul() for the
ihash_entries, dhash_entries, mhash_entries, and mphash_entries
boot parameters, adding proper error handling
- Convert chardev code to use guard(mutex) and __free(kfree) cleanup
patterns
- Replace min_t() with min() or umin() in VFS code to avoid silently
truncating unsigned long to unsigned int
- Gate LOOKUP_RCU assertions behind CONFIG_DEBUG_VFS since callers
already check the flag
Deprecation:
- Begin deprecating legacy BSD process accounting (acct(2)). The
interface has numerous footguns and better alternatives exist
(eBPF)
Documentation:
- Fix and complete kernel-doc for struct export_operations, removing
duplicated documentation between ReST and source
- Fix kernel-doc warnings for __start_dirop() and ilookup5_nowait()
Testing:
- Add a kunit test for initramfs cpio handling of entries with
filesize > PATH_MAX
Misc:
- Add missing <linux/init_task.h> include in fs_struct.c"
* tag 'vfs-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits)
posix_acl: make posix_acl_to_xattr() alloc the buffer
fs: make insert_inode_locked() wait for inode destruction
initramfs_test: kunit test for cpio.filesize > PATH_MAX
fs: improve dump_inode() to safely access inode fields
fs: add <linux/init_task.h> for 'init_fs'
docs: exportfs: Use source code struct documentation
fs: move initializing f_mode before file_ref_init()
exportfs: Complete kernel-doc for struct export_operations
exportfs: Mark struct export_operations functions at kernel-doc
exportfs: Fix kernel-doc output for get_name()
acct(2): begin the deprecation of legacy BSD process accounting
device_cgroup: remove branch hint after code refactor
VFS: fix __start_dirop() kernel-doc warnings
fs: Describe @isnew parameter in ilookup5_nowait()
fs/namei: Remove redundant DCACHE_MANAGED_DENTRY check in __follow_mount_rcu
fs: only assert on LOOKUP_RCU when built with CONFIG_DEBUG_VFS
select: store end_time as timespec64 in restart block
chardev: Switch to guard(mutex) and __free(kfree)
namespace: Replace simple_strtoul with kstrtoul to parse boot params
dcache: Replace simple_strtoul with kstrtoul in set_dhash_entries
...
Pull lsm updates from Paul Moore:
- Unify the security_inode_listsecurity() calls in NFSv4
While looking at security_inode_listsecurity() with an eye towards
improving the interface, we realized that the NFSv4 code was making
multiple calls to the LSM hook that could be consolidated into one.
- Mark the LSM static branch keys as static - this helps resolve some
sparse warnings
- Add __rust_helper annotations to the LSM and cred wrapper functions
- Remove the unsused set_security_override_from_ctx() function
- Minor fixes to some of the LSM kdoc comment blocks
* tag 'lsm-pr-20260203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
lsm: make keys for static branch static
cred: remove unused set_security_override_from_ctx()
rust: security: add __rust_helper to helpers
rust: cred: add __rust_helper to helpers
nfs: unify security_inode_listsecurity() calls
lsm: fix kernel-doc struct member names
Pull audit updates from Paul Moore:
- Improve the NETFILTER_PKT audit records
Add source and destination ports to the NETFILTER_PKT audit records
while also consolidating a lot of the code into a new, singular
audit_log_nf_skb() function. This new approach to structuring the
NETFILTER_PKT record generation should eliminate some unnecessary
overhead when audit is not built into the kernel.
- Update the audit syscall classifier code
Add the listxattrat(), getxattrat(), and fchmodat2() syscall to the
audit code which classifies syscalls into categories of operations,
e.g. "read" or "change attributes".
- Move the syscall classifier declarations into audit_arch.h
Shuffle around some header file declarations to resolve some sparse
warnings.
* tag 'audit-pr-20260203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: move the compat_xxx_class[] extern declarations to audit_arch.h
audit: add missing syscalls to read class
audit: include source and destination ports to NETFILTER_PKT
audit: add audit_log_nf_skb helper function
audit: add fchmodat2() to change attributes class
Pull RCU updates from Boqun Feng:
- RCU Tasks Trace:
Re-implement RCU tasks trace in term of SRCU-fast, not only more than
500 lines of code are saved because of the reimplementation, a new
set of API, rcu_read_{,un}lock_tasks_trace(), becomes possible as
well. Compared to the previous rcu_read_{,un}lock_trace(), the new
API avoid the task_struct accesses thanks to the SRCU-fast semantics.
As a result, the old rcu_read{,un}lock_trace() API is now deprecated.
- RCU Torture Test:
- Multiple improvements on kvm-series.sh (parallel run and
progress showing metrics)
- Add context checks to rcu_torture_timer()
- Make config2csv.sh properly handle comments in .boot files
- Include commit discription in testid.txt
- Miscellaneous RCU changes:
- Reduce synchronize_rcu() latency by reporting GP kthread's
CPU QS early
- Use suitable gfp_flags for the init_srcu_struct_nodes()
- Fix rcu_read_unlock() deadloop due to softirq
- Correctly compute probability to invoke ->exp_current()
in rcutorture
- Make expedited RCU CPU stall warnings detect stall-end races
- RCU nocb:
- Remove unnecessary WakeOvfIsDeferred wake path and callback
overload handling
- Extract nocb_defer_wakeup_cancel() helper
* tag 'rcu.release.v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (25 commits)
rcu/nocb: Extract nocb_defer_wakeup_cancel() helper
rcu/nocb: Remove dead callback overload handling
rcu/nocb: Remove unnecessary WakeOvfIsDeferred wake path
rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early
srcu: Use suitable gfp_flags for the init_srcu_struct_nodes()
rcu: Fix rcu_read_unlock() deadloop due to softirq
rcutorture: Correctly compute probability to invoke ->exp_current()
rcu: Make expedited RCU CPU stall warnings detect stall-end races
rcutorture: Add --kill-previous option to terminate previous kvm.sh runs
rcutorture: Prevent concurrent kvm.sh runs on same source tree
torture: Include commit discription in testid.txt
torture: Make config2csv.sh properly handle comments in .boot files
torture: Make kvm-series.sh give run numbers and totals
torture: Make kvm-series.sh give build numbers and totals
torture: Parallelize kvm-series.sh guest-OS execution
rcutorture: Add context checks to rcu_torture_timer()
rcutorture: Test rcu_tasks_trace_expedite_current()
srcu: Create an rcu_tasks_trace_expedite_current() function
checkpatch: Deprecate rcu_read_{,un}lock_trace()
rcu: Update Requirements.rst for RCU Tasks Trace
...
Pull scheduler fixes from Ingo Molnar:
"Miscellaneous MMCID fixes to address bugs and performance regressions
in the recent rewrite of the SCHED_MM_CID management code:
- Fix livelock triggered by BPF CI testing
- Fix hard lockup on weakly ordered systems
- Simplify the dropping of CIDs in the exit path by removing an
unintended transition phase
- Fix performance/scalability regression on a thread-pool benchmark
by optimizing transitional CIDs when scheduling out"
* tag 'sched-urgent-2026-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/mmcid: Optimize transitional CIDs when scheduling out
sched/mmcid: Drop per CPU CID immediately when switching to per task mode
sched/mmcid: Protect transition on weakly ordered systems
sched/mmcid: Prevent live lock on task to CPU mode transition
Pull objtool fixes from Ingo Molnar::
- Bump up the Clang minimum version requirements for livepatch
builds, due to Clang assembler section handling bugs causing
silent miscompilations
- Strip livepatching symbol artifacts from non-livepatch modules
- Fix livepatch build warnings when certain Clang LTO options
are enabled
- Fix livepatch build error when CONFIG_MEM_ALLOC_PROFILING_DEBUG=y
* tag 'objtool-urgent-2026-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool/klp: Fix unexported static call key access for manually built livepatch modules
objtool/klp: Fix symbol correlation for orphaned local symbols
livepatch: Free klp_{object,func}_ext data after initialization
livepatch: Fix having __klp_objects relics in non-livepatch modules
livepatch/klp-build: Require Clang assembler >= 20
Pull tracing fix from Steven Rostedt:
- Fix event format field alignments for 32 bit architectures
The fields in the event format files are used to parse the raw binary
buffer data by applications. If they are incorrect, then the
application produces garbage.
On 32 bit architectures, the function graph 64bit calltime and
rettime were off by 4bytes. That's because the actual fields are in a
packed structure but the macros used by the ftrace events did not
mark them as packed, and instead, gave them their natural alignment
which made their offsets off by 4 bytes.
There are macros to have a packed field within an embedded structure
of an event, but there's no macro for normal fields within a packed
structure of the event. The macro __field_packed() was used for the
packed embedded structure field. Rename that to __field_desc_packed()
(to match the non-packed embedded field macro __field_desc()), and
make __field_packed() for fields that are in a packed event structure
(which matches the unpacked __field() macro).
Switch the calltime and rettime fields of the function graph event to
use the new __field_packed() and this makes the offsets correct.
* tag 'trace-v6.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix ftrace event field alignments
Pull dma-mapping fixes from Marek Szyprowski:
"Two minor fixes for the DMA-mapping subsystem:
- check for the rare case of the allocation failure of the global CMA
pool (Shanker Donthineni)
- avoid perf buffer overflow when tracing large scatter-gather lists
(Deepanshu Kartikey)"
* tag 'dma-mapping-6.19-2026-02-06' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma: contiguous: Check return value of dma_contiguous_reserve_area()
tracing/dma: Cap dma_map_sg tracepoint arrays to prevent buffer overflow
Called when copy_process() is called to copy state to a new child.
Right now this is just a stub, but will be used shortly to properly
handle fork'ing of task based io_uring restrictions.
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The linker script scripts/module.lds.S specifies that all input
__klp_objects sections should be consolidated into an output section of
the same name, and start/stop symbols should be created to enable
scripts/livepatch/init.c to locate this data.
This start/stop pattern is not ideal for modules because the symbols are
created even if no __klp_objects input sections are present.
Consequently, a dummy __klp_objects section also appears in the
resulting module. This unnecessarily pollutes non-livepatch modules.
Instead, since modules are relocatable files, the usual method for
locating consolidated data in a module is to read its section table.
This approach avoids the aforementioned problem.
The klp_modinfo already stores a copy of the entire section table with
the final addresses. Introduce a helper function that
scripts/livepatch/init.c can call to obtain the location of the
__klp_objects section from this data.
Fixes: dd590d4d57 ("objtool/klp: Introduce klp diff subcommand for diffing object files")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Link: https://patch.msgid.link/20260123102825.3521961-2-petr.pavlu@suse.com
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
The fields of ftrace specific events (events used to save ftrace internal
events like function traces and trace_printk) are generated similarly to
how normal trace event fields are generated. That is, the fields are added
to a trace_events_fields array that saves the name, offset, size,
alignment and signness of the field. It is used to produce the output in
the format file in tracefs so that tooling knows how to parse the binary
data of the trace events.
The issue is that some of the ftrace event structures are packed. The
function graph exit event structures are one of them. The 64 bit calltime
and rettime fields end up 4 byte aligned, but the algorithm to show to
userspace shows them as 8 byte aligned.
The macros that create the ftrace events has one for embedded structure
fields. There's two macros for theses fields:
__field_desc() and __field_packed()
The difference of the latter macro is that it treats the field as packed.
Rename that field to __field_desc_packed() and create replace the
__field_packed() to be a normal field that is packed and have the calltime
and rettime use those.
This showed up on 32bit architectures for function graph time fields. It
had:
~# cat /sys/kernel/tracing/events/ftrace/funcgraph_exit/format
[..]
field:unsigned long func; offset:8; size:4; signed:0;
field:unsigned int depth; offset:12; size:4; signed:0;
field:unsigned int overrun; offset:16; size:4; signed:0;
field:unsigned long long calltime; offset:24; size:8; signed:0;
field:unsigned long long rettime; offset:32; size:8; signed:0;
Notice that overrun is at offset 16 with size 4, where in the structure
calltime is at offset 20 (16 + 4), but it shows the offset at 24. That's
because it used the alignment of unsigned long long when used as a
declaration and not as a member of a structure where it would be aligned
by word size (in this case 4).
By using the proper structure alignment, the format has it at the correct
offset:
~# cat /sys/kernel/tracing/events/ftrace/funcgraph_exit/format
[..]
field:unsigned long func; offset:8; size:4; signed:0;
field:unsigned int depth; offset:12; size:4; signed:0;
field:unsigned int overrun; offset:16; size:4; signed:0;
field:unsigned long long calltime; offset:20; size:8; signed:0;
field:unsigned long long rettime; offset:28; size:8; signed:0;
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reported-by: "jempty.liang" <imntjempty@163.com>
Link: https://patch.msgid.link/20260204113628.53faec78@gandalf.local.home
Fixes: 04ae87a520 ("ftrace: Rework event_create_dir()")
Closes: https://lore.kernel.org/all/20260130015740.212343-1-imntjempty@163.com/
Closes: https://lore.kernel.org/all/20260202123342.2544795-1-imntjempty@163.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pull misc fixes from Andrew Morton:
"Five hotfixes. Two are cc:stable, two are for MM.
All are singletons - please see the changelogs for details"
* tag 'mm-hotfixes-stable-2026-02-04-15-55' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
Documentation: document liveupdate cmdline parameter
mm, shmem: prevent infinite loop on truncate race
mailmap: update Alexander Mikhalitsyn's emails
liveupdate: luo_file: do not clear serialized_data on unfreeze
x86/kfence: fix booting on 32bit non-PAE systems
Pull sched_ext fix from Tejun Heo:
- Fix race where sched_class operations (sched_setscheduler() and
friends) could be invoked on dead tasks after sched_ext_dead()
already ran, causing invalid SCX task state transitions and NULL
pointer dereferences.
This was a regression from the cgroup exit ordering fix which
moved sched_ext_free() to finish_task_switch().
* tag 'sched_ext-for-6.19-rc8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Short-circuit sched_class operations on dead tasks
7900aa699c ("sched_ext: Fix cgroup exit ordering by moving sched_ext_free()
to finish_task_switch()") moved sched_ext_free() to finish_task_switch() and
renamed it to sched_ext_dead() to fix cgroup exit ordering issues. However,
this created a race window where certain sched_class ops may be invoked on
dead tasks leading to failures - e.g. sched_setscheduler() may try to switch a
task which finished sched_ext_dead() back into SCX triggering invalid SCX task
state transitions.
Add task_dead_and_done() which tests whether a task is TASK_DEAD and has
completed its final context switch, and use it to short-circuit sched_class
operations which may be called on dead tasks.
Fixes: 7900aa699c ("sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()")
Reported-by: Andrea Righi <arighi@nvidia.com>
Link: http://lkml.kernel.org/r/20260202151341.796959-1-arighi@nvidia.com
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Merge updates related to runtime PM for 6.20-rc1/7.0-rc1:
- Make several drivers discard pm_runtime_put() return value in
preparation for converting that function to a void one (Rafael
Wysocki)
* pm-runtime:
drm: Discard pm_runtime_put() return value
genirq/chip: Change irq_chip_pm_put() return type to void
scsi: ufs: core: Discard pm_runtime_put() return values
platform/chrome: cros_hps_i2c: Discard pm_runtime_put() return value
coresight: Discard pm_runtime_put() return values
hwspinlock: omap: Discard pm_runtime_put() return value
watchdog: rzv2h_wdt: Discard pm_runtime_put() return value
watchdog: rz: Discard pm_runtime_put() return values
media: ccs: Discard pm_runtime_put() return value
drm/imagination: Discard pm_runtime_put() return value
USB: core: Discard pm_runtime_put() return value
Merge updates related to system suspend and hibernation for
6.20-rc1/7.0-rc1:
- Stop flagging the PM runtime workqueue as freezable to avoid system
suspend and resume deadlocks in subsystems that assume asynchronous
runtime PM to work during system-wide PM transitions (Rafael Wysocki)
- Drop redundant NULL pointer checks before acomp_request_free() from
the hibernation code handling image saving (Rafael Wysocki)
- Update wakeup_sources_walk_start() to handle empty lists of wakeup
sources as appropriate (Samuel Wu)
- Make dev_pm_clear_wake_irq() check the power.wakeirq value under
power.lock to avoid race conditions (Gui-Dong Han)
- Avoid bit field races related to power.work_in_progress in the core
device suspend code (Xuewen Yan)
* pm-sleep:
PM: sleep: core: Avoid bit field races related to work_in_progress
PM: sleep: wakeirq: harden dev_pm_clear_wake_irq() against races
PM: wakeup: Handle empty list in wakeup_sources_walk_start()
PM: hibernate: Drop NULL pointer checks before acomp_request_free()
PM: sleep: Do not flag runtime PM workqueue as freezable
During the investigation of the various transition mode issues
instrumentation revealed that the amount of bitmap operations can be
significantly reduced when a task with a transitional CID schedules out
after the fixup function completed and disabled the transition mode.
At that point the mode is stable and therefore it is not required to drop
the transitional CID back into the pool. As the fixup is complete the
potential exhaustion of the CID pool is not longer possible, so the CID can
be transferred to the scheduling out task or to the CPU depending on the
current ownership mode.
The racy snapshot of mm_cid::mode which contains both the ownership state
and the transition bit is valid because runqueue lock is held and the fixup
function of a concurrent mode switch is serialized.
Assigning the ownership right there not only spares the bitmap access for
dropping the CID it also avoids it when the task is scheduled back in as it
directly hits the fast path in both modes when the CID is within the
optimal range. If it's outside the range the next schedule in will need to
converge so dropping it right away is sensible. In the good case this also
allows to go into the fast path on the next schedule in operation.
With a thread pool benchmark which is configured to cross the mode switch
boundaries frequently this reduces the number of bitmap operations by about
30% and increases the fastpath utilization in the low single digit
percentage range.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260201192835.100194627@kernel.org
When a exiting task initiates the switch from per CPU back to per task
mode, it has already dropped its CID and marked itself inactive. But a
leftover from an earlier iteration of the rework then reassigns the per
CPU CID to the exiting task with the transition bit set.
That's wrong as the task is already marked CID inactive, which means it is
inconsistent state. It's harmless because the CID is marked in transit and
therefore dropped back into the pool when the exiting task schedules out
either through preemption or the final schedule().
Simply drop the per CPU CID when the exiting task triggered the transition.
Fixes: fbd0e71dc3 ("sched/mmcid: Provide CID ownership mode fixup functions")
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260201192835.032221009@kernel.org
Shrikanth reported a hard lockup which he observed once. The stack trace
shows the following CID related participants:
watchdog: CPU 23 self-detected hard LOCKUP @ mm_get_cid+0xe8/0x188
NIP: mm_get_cid+0xe8/0x188
LR: mm_get_cid+0x108/0x188
mm_cid_switch_to+0x3c4/0x52c
__schedule+0x47c/0x700
schedule_idle+0x3c/0x64
do_idle+0x160/0x1b0
cpu_startup_entry+0x48/0x50
start_secondary+0x284/0x288
start_secondary_prolog+0x10/0x14
watchdog: CPU 11 self-detected hard LOCKUP @ plpar_hcall_norets_notrace+0x18/0x2c
NIP: plpar_hcall_norets_notrace+0x18/0x2c
LR: queued_spin_lock_slowpath+0xd88/0x15d0
_raw_spin_lock+0x80/0xa0
raw_spin_rq_lock_nested+0x3c/0xf8
mm_cid_fixup_cpus_to_tasks+0xc8/0x28c
sched_mm_cid_exit+0x108/0x22c
do_exit+0xf4/0x5d0
make_task_dead+0x0/0x178
system_call_exception+0x128/0x390
system_call_vectored_common+0x15c/0x2ec
The task on CPU11 is running the CID ownership mode change fixup function
and is stuck on a runqueue lock. The task on CPU23 is trying to get a CID
from the pool with the same runqueue lock held, but the pool is empty.
After decoding a similar issue in the opposite direction switching from per
task to per CPU mode the tool which models the possible scenarios failed to
come up with a similar loop hole.
This showed up only once, was not reproducible and according to tooling not
related to a overlooked scheduling scenario permutation. But the fact that
it was observed on a PowerPC system gave the right hint: PowerPC is a
weakly ordered architecture.
The transition mechanism does:
WRITE_ONCE(mm->mm_cid.transit, MM_CID_TRANSIT);
WRITE_ONCE(mm->mm_cid.percpu, new_mode);
fixup()
WRITE_ONCE(mm->mm_cid.transit, 0);
mm_cid_schedin() does:
if (!READ_ONCE(mm->mm_cid.percpu))
...
cid |= READ_ONCE(mm->mm_cid.transit);
so weakly ordered systems can observe percpu == false and transit == 0 even
if the fixup function has not yet completed. As a consequence the task will
not drop the CID when scheduling out before the fixup is completed, which
means the CID space can be exhausted and the next task scheduling in will
loop in mm_get_cid() and the fixup thread can livelock on the held runqueue
lock as above.
This could obviously be solved by using:
smp_store_release(&mm->mm_cid.percpu, true);
and
smp_load_acquire(&mm->mm_cid.percpu);
but that brings a memory barrier back into the scheduler hotpath, which was
just designed out by the CID rewrite.
That can be completely avoided by combining the per CPU mode and the
transit storage into a single mm_cid::mode member and ordering the stores
against the fixup functions to prevent the CPU from reordering them.
That makes the update of both states atomic and a concurrent read observes
always consistent state.
The price is an additional AND operation in mm_cid_schedin() to evaluate
the per CPU or the per task path, but that's in the noise even on strongly
ordered architectures as the actual load can be significantly more
expensive and the conditional branch evaluation is there anyway.
Fixes: fbd0e71dc3 ("sched/mmcid: Provide CID ownership mode fixup functions")
Closes: https://lore.kernel.org/bdfea828-4585-40e8-8835-247c6a8a76b0@linux.ibm.com
Reported-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260201192834.965217106@kernel.org
Ihor reported a BPF CI failure which turned out to be a live lock in the
MM_CID management. The scenario is:
A test program creates the 5th thread, which means the MM_CID users become
more than the number of CPUs (four in this example), so it switches to per
CPU ownership mode.
At this point each live task of the program has a CID associated. Assume
thread creation order assignment for simplicity.
T0 CID0 runs fork() and creates T4
T1 CID1
T2 CID2
T3 CID3
T4 --- not visible yet
T0 sets mm_cid::percpu = true and transfers its own CID to CPU0 where it
runs on and then starts the fixup which walks through the threads to
transfer the per task CIDs either to the CPU the task is running on or drop
it back into the pool if the task is not on a CPU.
During that T1 - T3 are free to schedule in and out before the fixup caught
up with them. Going through all possible permutations with a python script
revealed a few problematic cases. The most trivial one is:
T1 schedules in on CPU1 and observes percpu == true, so it transfers
its CID to CPU1
T1 is migrated to CPU2 and schedule in observes percpu == true, but
CPU2 does not have a CID associated and T1 transferred its own to
CPU1
So it has to allocate one with CPU2 runqueue lock held, but the
pool is empty, so it keeps looping in mm_get_cid().
Now T0 reaches T1 in the thread walk and tries to lock the corresponding
runqueue lock, which is held causing a full live lock.
There is a similar scenario in the reverse direction of switching from per
CPU to task mode which is way more obvious and got therefore addressed by
an intermediate mode. In this mode the CIDs are marked with MM_CID_TRANSIT,
which means that they are neither owned by the CPU nor by the task. When a
task schedules out with a transit CID it drops the CID back into the pool
making it available for others to use temporarily. Once the task which
initiated the mode switch finished the fixup it clears the transit mode and
the process goes back into per task ownership mode.
Unfortunately this insight was not mapped back to the task to CPU mode
switch as the above described scenario was not considered in the analysis.
Apply the same transit mechanism to the task to CPU mode switch to handle
these problematic cases correctly.
As with the CPU to task transition this results in a potential temporary
contention on the CID bitmap, but that's only for the time it takes to
complete the transition. After that it stays in steady mode which does not
touch the bitmap at all.
Fixes: fbd0e71dc3 ("sched/mmcid: Provide CID ownership mode fixup functions")
Closes: https://lore.kernel.org/2b7463d7-0f58-4e34-9775-6e2115cfb971@linux.dev
Reported-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260201192834.897115238@kernel.org
It may not appear obvious why kthread_affine_node() is not called before
the kthread creation completion instead of after the first wake-up.
The reason is that kthread_affine_node() applies a default affinity
behaviour that only takes place if no affinity preference have already
been passed by the kthread creation call site.
Add a comment to clarify that.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
When cpuset isolated partitions get updated, unbound kthreads get
indifferently affine to all non isolated CPUs, regardless of their
individual affinity preferences.
For example kswapd is a per-node kthread that prefers to be affine to
the node it refers to. Whenever an isolated partition is created,
updated or deleted, kswapd's node affinity is going to be broken if any
CPU in the related node is not isolated because kswapd will be affine
globally.
Fix this with letting the consolidated kthread managed affinity code do
the affinity update on behalf of cpuset.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: cgroups@vger.kernel.org
The managed affinity list currently contains only unbound kthreads that
have affinity preferences. Unbound kthreads globally affine by default
are outside of the list because their affinity is automatically managed
by the scheduler (through the fallback housekeeping mask) and by cpuset.
However in order to preserve the preferred affinity of kthreads, cpuset
will delegate the isolated partition update propagation to the
housekeeping and kthread code.
Prepare for that with including all unbound kthreads in the managed
affinity list.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
The kthreads preferred affinity related fields use "hotplug" as the base
of their naming because the affinity management was initially deemed to
deal with CPU hotplug.
The scope of this role is going to broaden now and also deal with
cpuset isolated partition updates.
Switch the naming accordingly.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cpuset isolated partitions are now included in HK_TYPE_DOMAIN. Testing
if a CPU is part of an isolated partition alone is now useless.
Remove the superflous test.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Until now, cpuset would propagate isolated partition changes to
timer migration so that unbound timers don't get migrated to isolated
CPUs.
Since housekeeping now centralizes, synchronize and propagates isolation
cpumask changes, perform the work from that subsystem for consolidation
and consistency purposes.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
The HK_TYPE_DOMAIN housekeeping cpumask is now modifiable at runtime. In
order to synchronize against PCI probe works and make sure that no
asynchronous probing is still pending or executing on a newly isolated
CPU, the housekeeping subsystem must flush the PCI probe works.
However the PCI probe works can't be flushed easily since they are
queued to the main per-CPU workqueue pool.
Solve this with creating a PCI probe-specific pool and provide and use
the appropriate flushing API.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: linux-pci@vger.kernel.org
The HK_TYPE_DOMAIN housekeeping cpumask is now modifiable at runtime.
In order to synchronize against vmstat workqueue to make sure
that no asynchronous vmstat work is still pending or executing on a
newly made isolated CPU, the housekeeping susbsystem must flush the
vmstat workqueues.
This involves flushing the whole mm_percpu_wq workqueue, shared with
LRU drain, introducing here a welcome side effect.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: linux-mm@kvack.org
Until now, HK_TYPE_DOMAIN used to only include boot defined isolated
CPUs passed through isolcpus= boot option. Users interested in also
knowing the runtime defined isolated CPUs through cpuset must use
different APIs: cpuset_cpu_is_isolated(), cpu_is_isolated(), etc...
There are many drawbacks to that approach:
1) Most interested subsystems want to know about all isolated CPUs, not
just those defined on boot time.
2) cpuset_cpu_is_isolated() / cpu_is_isolated() are not synchronized with
concurrent cpuset changes.
3) Further cpuset modifications are not propagated to subsystems
Solve 1) and 2) and centralize all isolated CPUs within the
HK_TYPE_DOMAIN housekeeping cpumask.
Subsystems can rely on RCU to synchronize against concurrent changes.
The propagation mentioned in 3) will be handled in further patches.
[Chen Ridong: Fix cpu_hotplug_lock deadlock and use correct static
branch API]
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Cc: "Michal Koutný" <mkoutny@suse.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: cgroups@vger.kernel.org
HK_TYPE_DOMAIN's cpumask will soon be made modifiable by cpuset.
A synchronization mechanism is then needed to synchronize the updates
with the housekeeping cpumask readers.
Turn the housekeeping cpumasks into RCU pointers. Once a housekeeping
cpumask will be modified, the update side will wait for an RCU grace
period and propagate the change to interested subsystem when deemed
necessary.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Testing housekeeping_cpu() will soon require that either the RCU "lock"
is held or the cpuset mutex.
When CPUs get isolated through cpuset, the change is propagated to
timer migration such that isolation is also performed from the migration
tree. However that propagation is done using workqueue which tests if
the target is actually isolated before proceeding.
Lockdep doesn't know that the workqueue caller holds cpuset mutex and
that it waits for the work, making the housekeeping cpumask read safe.
Shut down the future warning by removing this test. It is unecessary
beyond hotplug, the workqueue is already targeted towards isolated CPUs.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Gabriele Monaco <gmonaco@redhat.com>
HK_TYPE_DOMAIN will soon integrate not only boot defined isolcpus= CPUs
but also cpuset isolated partitions.
Housekeeping still needs a way to record what was initially passed
to isolcpus= in order to keep these CPUs isolated after a cpuset
isolated partition is modified or destroyed while containing some of
them.
Create a new HK_TYPE_DOMAIN_BOOT to keep track of those.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
0-day bot flagged the use of strcpy() in blk_trace_setup(), because the
source buffer can theoretically be bigger than the destination buffer.
While none of the current callers pass a string bigger than
BLKTRACE_BDEV_SIZE, use strscpy() to prevent eventual future misuse and
silence the checker warnings.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202602020718.GUEIRyG9-lkp@intel.com/
Fixes: 113cbd6282 ("blktrace: pass blk_user_trace2 to setup functions")
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Patch series "liveupdate: fixes in error handling".
This series contains some fixes in LUO's error handling paths.
The first patch deals with failed freeze() attempts. The cleanup path
calls unfreeze, and that clears some data needed by later unpreserve
calls.
The second patch is a bit more involved. It deals with failed retrieve()
attempts. To do so properly, it reworks some of the error handling logic
in luo_file core.
Both these fixes are "theoretical" -- in the sense that I have not been
able to reproduce either of them in normal operation. The only supported
file type right now is memfd, and there is nothing userspace can do right
now to make it fail its retrieve or freeze. I need to make the retrieve
or freeze fail by artificially injecting errors. The injected errors
trigger a use-after-free and a double-free.
That said, once more complex file handlers are added or memfd preservation
is used in ways not currently expected or covered by the tests, we will be
able to see them on real systems.
This patch (of 2):
The unfreeze operation is supposed to undo the effects of the freeze
operation. serialized_data is not set by freeze, but by preserve.
Consequently, the unpreserve operation needs to access serialized_data to
undo the effects of the preserve operation. This includes freeing the
serialized data structures for example.
If a freeze callback fails, unfreeze is called for all frozen files. This
would clear serialized_data for them. Since live update has failed, it
can be expected that userspace aborts, releasing all sessions. When the
sessions are released, unpreserve will be called for all files. The
unfrozen files will see 0 in their serialized_data. This is not expected
by file handlers, and they might either fail, leaking data and state, or
might even crash or cause invalid memory access.
Do not clear serialized_data on unfreeze so it gets passed on to
unpreserve. There is no need to clear it on unpreserve since luo_file
will be freed immediately after.
Link: https://lkml.kernel.org/r/20260126230302.2936817-1-pratyush@kernel.org
Link: https://lkml.kernel.org/r/20260126230302.2936817-2-pratyush@kernel.org
Fixes: 7c722a7f44 ("liveupdate: luo_file: implement file systems callbacks")
Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>