linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-16 16:01:44 -04:00

Author	SHA1	Message	Date
Linus Torvalds	ee7226b2ae	Merge tag 'io_uring-7.1-20260515' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: - Small series sanitizing the locking done for either modifying or reading a chain of requests - If the application has a pid namespace, ensure that the sqthread pid is correctly printed in fdinfo - Fix for a hashing issue in the io-wq thread pool, which could lead to a use-after-free - Kill dead argument from io_prep_rw_pi() - Fix for a missed validation of the CQ ring head, affecting CQE refill * tag 'io_uring-7.1-20260515' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring: validate user-controlled cq.head in io_cqe_cache_refill() io-wq: check that the predecessor is hashed in io_wq_remove_pending() io_uring/rw: drop unused attr_type_mask from io_prep_rw_pi() io_uring: hold uring_lock across io_kill_timeouts() in cancel path io_uring: defer linked-timeout chain splice out of hrtimer context io_uring: hold uring_lock when walking link chain in io_wq_free_work() io_uring/fdinfo: translate SqThread PID through caller's pid_ns	2026-05-15 12:34:02 -07:00
Linus Torvalds	78e8370033	Merge tag 'hardening-v7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening fix from Kees Cook: - gcc-plugins: Fix GCC 16 removal of CONST_CAST macros * tag 'hardening-v7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: gcc-plugins: Always define CONST_CAST_GIMPLE and CONST_CAST_TREE	2026-05-15 12:27:03 -07:00
Linus Torvalds	36d49bba19	Merge tag 'docs-7.1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/docs/linux Pull documentation fixes from Jonathan Corbet: "This is Willy Tarreau's new document clarifying the definition and handling of security-related bugs, which we're trying to get out there quickly on the theory that some of the bug reporters might actually read and pay attention to it" * tag 'docs-7.1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/docs/linux: docs: threat-model: don't limit root capabilities to CAP_SYS_ADMIN docs: security-bugs: add a link to the threat-model documentation Documentation: security-bugs: clarify requirements for AI-assisted reports Documentation: security-bugs: explain what is and is not a security bug Documentation: security-bugs: do not systematically Cc the security team	2026-05-15 12:24:09 -07:00
Linus Torvalds	4844e7c4c2	Merge tag 'for-linus-7.1b-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: - one simple cleanup - a fix for a corner case when running as Xen PV dom0 - a fix of a regression for Xen PV guests, introduced in 7.0 * tag 'for-linus-7.1b-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: x86/xen: Tolerate nested XEN_LAZY_MMU entering/leaving x86/xen: Fix xen_e820_swap_entry_with_ram() xen/arm: Replace __ASSEMBLY__ with __ASSEMBLER__ in interface.h	2026-05-15 11:24:51 -07:00
Linus Torvalds	4c2cd91bff	Merge tag 'platform-drivers-x86-v7.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86 Pull x86 platform driver fixes from Ilpo Järvinen: - asus-nb-wmi: - Use existing keyboard quirk for ASUS Zenbook Duo UX8407AA - hp-wmi: - Add support for Victus 16-r0xxx (8BC2) - intel/vsec_tpmi: - Move debugfs register before creating devices - Prevent fault during unbind - lenovo-wmi-: - Fix memory leak in lwmi_dev_evaluate_int() - Balance IDA id allocation and free - Balance component bind and unbind - Prevent sending uninitialized WMI arguments to the device - Decouple lenovo-wmi-gamezone and lenovo-wmi-other to simplify module dependency graph - Limit adding attributes to supported devices - samsung-galaxybook: - Handle kbd backlight, mic mute and camera block hotkeys tag 'platform-drivers-x86-v7.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: platform/x86: asus-nb-wmi: add DMI quirk for ASUS Zenbook Duo UX8407AA platform/x86: lenovo-wmi-other: Limit adding attributes to supported devices platform/x86: lenovo-wmi-other: Add Attribute ID helper functions platform/x86: lenovo-wmi-helpers: Move gamezone enums to wmi-helpers platform/x86: lenovo: Decouple lenovo-wmi-gamezone and lenovo-wmi-other platform/x86: lenovo-wmi-other: Fix tunable_attr_01 struct members platform/x86: lenovo-wmi-other: Zero initialize WMI arguments platform/x86: lenovo-wmi-other: Balance component bind and unbind platform/x86: lenovo-wmi-other: Balance IDA id allocation and free platform/x86: lenovo-wmi-helpers: Fix memory leak in lwmi_dev_evaluate_int() platform/x86: hp-wmi: Add support for Victus 16-r0xxx (8BC2) platform/x86/intel/tpmi/plr: Prevent fault during unbind platform/x86: intel: Add notifiers support platform/x86: intel: Move debugfs register before creating devices platform/x86: samsung-galaxybook: Handle ACPI hotkey notifications platform/x86: samsung-galaxybook: Refactor camera lens cover input device	2026-05-15 11:12:54 -07:00
Linus Torvalds	fd6b566156	Merge tag 'v7.1-p4' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fixes from Herbert Xu: - Fix potential dead-lock in rhashtable when used by xattr - Avoid calling kvfree on atomic path in rhashtable * tag 'v7.1-p4' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: rhashtable: Add bucket_table_free_atomic() helper mm/slab: Add kvfree_atomic() helper rhashtable: drop ht->mutex in rhashtable_free_and_destroy()	2026-05-15 10:38:37 -07:00
Linus Torvalds	70eda68668	Merge tag 'hid-for-linus-2026051401' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid Pull HID fixes from Jiri Kosina: - fixes for a few OOB/UAF in several HID drivers (Florian Pradines, Lee Jones, Michael Zaidman, Rosalie Wanders, Sangyun Kim and Tomasz Pakuła) - more general sanitation of input data, dealing with potentially malicious hardware in hid-core (Benjamin Tissoires) - a few device-specific quirks and fixups * tag 'hid-for-linus-2026051401' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid: (22 commits) HID: logitech-hidpp: Add support for newer Bluetooth keyboards HID: pidff: Fix integer overflow in pidff_rescale HID: i2c-hid: add reset quirk for BLTP7853 touchpad HID: core: introduce hid_safe_input_report() HID: pass the buffer size to hid_report_raw_event HID: google: hammer: stop hardware on devres action failure HID: appletb-kbd: run inactivity autodim from workqueues HID: appletb-kbd: fix UAF in inactivity-timer cleanup path HID: playstation: Clamp num_touch_reports HID: magicmouse: Prevent out-of-bounds (OOB) read during DOUBLE_REPORT_ID HID: mcp2221: fix OOB write in mcp2221_raw_event() HID: quirks: really enable the intended work around for appledisplay HID: hid-sjoy: race between init and usage HID: uclogic: Fix regression of input name assignment HID: intel-thc-hid: Intel-quickspi: Fix some error codes HID: hid-lenovo-go-s: restore OS_TYPE after resume from s2idle HID: elan: Add support for ELAN SB974D touchpad HID: sony: add missing size validation for Rock Band 3 Pro instruments HID: sony: add missing size validation for SMK-Link remotes HID: sony: remove unneeded WARN_ON() in sony_leds_init() ...	2026-05-14 14:30:01 -07:00
Linus Torvalds	48f76a1271	Merge tag 'acpi-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI support fixes from Rafael Wysocki: "These fix several platform drivers that use the ACPI companion of the given platform device without checking its presence, which may lead to a NULL pointer dereference or other kind of malfunction if the driver is forced to match a device without an ACPI companion via driver override, and restore debug log level for some messages in the ACPI CPPC library: - Check ACPI_COMPANION() against NULL during probe in several core ACPI device drivers (Rafael Wysocki) - Restore log level of messages in amd_set_max_freq_ratio() (Mario Limonciello)" * tag 'acpi-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI: PAD: xen: Check ACPI_COMPANION() against NULL ACPI: driver: Check ACPI_COMPANION() against NULL during probe Revert "ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn"	2026-05-14 14:06:31 -07:00
Rafael J. Wysocki	af149b667b	Merge branch 'acpi-cppc' Merge a revert of an ACPI CPPC commit that increased the log level of some debug messages which turned out to be a bad idea: - Restore log level of messages in amd_set_max_freq_ratio() (Mario Limonciello) * acpi-cppc: Revert "ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn"	2026-05-14 22:46:33 +02:00
Juergen Gross	4594437880	x86/xen: Tolerate nested XEN_LAZY_MMU entering/leaving With the support of nested lazy mmu sections it can happen that arch_enter_lazy_mmu_mode() is being called twice without a call of arch_leave_lazy_mmu_mode() in between, as the lazy_mmu_*() helpers are not disabling preemption when checking for nested lazy mmu sections. This is a problem when running as a Xen PV guest, as xen_enter_lazy_mmu() and xen_leave_lazy_mmu() don't tolerate this case. Fix that in xen_enter_lazy_mmu() and xen_leave_lazy_mmu() in order not to hurt all other lazy mmu mode users. Fixes: `291b3abed6` ("x86/xen: use lazy_mmu_state when context-switching") Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20260508143933.493013-1-jgross@suse.com>	2026-05-14 18:33:05 +02:00
Juergen Gross	28e03f78e6	x86/xen: Fix xen_e820_swap_entry_with_ram() When swapping a not page-aligned E820 map entry with RAM, the start address of the modified entry is calculated wrong (the offset into the page is subtracted instead of being added to the page address). Fixes: `be35d91c88` ("xen: tolerate ACPI NVS memory overlapping with Xen allocated memory") Reported-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20260505102417.208138-1-jgross@suse.com>	2026-05-14 18:33:05 +02:00
Kees Cook	905c559e51	gcc-plugins: Always define CONST_CAST_GIMPLE and CONST_CAST_TREE For gcc-16, the CONST_CAST macro family was removed. Add back what we were using in gcc-common.h, as they are simple wrappers. See GCC commits: c3d96ff9e916c02584aa081f03ab999292efbb50 458c7926d48959abcb2c1adaa22458e27459a551 Suggested-by: Ingo Saitz <ingo@hannover.ccc.de> Link: https://lore.kernel.org/lkml/ab6OKoay0OWkywjK@spatz.zoo Fixes: `6b90bd4ba4` ("GCC plugin infrastructure") Tested-by: Ivan Bulatovic <combuster@archlinux.us> Tested-by: Christopher Cradock <christopher@cradock.myzen.co.uk> Signed-off-by: Kees Cook <kees@kernel.org>	2026-05-14 09:24:32 -07:00
Linus Torvalds	66182ca873	Merge tag 'net-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from netfilter. Previous releases - regressions: - ethtool: fix NULL pointer dereference in phy_reply_size - netfilter: - allocate hook ops while under mutex - close dangling table module init race - restore nf_conntrack helper propagation via expectation - tcp: - fix potential UAF in reqsk_timer_handler(). - fix out-of-bounds access for twsk in tcp_ao_established_key(). - vsock: fix empty payload in tap skb for non-linear buffers - hsr: fix NULL pointer dereference in hsr_get_node_data() - eth: - cortina: fix RX drop accounting - ice: fix locking in ice_dcb_rebuild() Previous releases - always broken: - napi: avoid gro timer misfiring at end of busypoll - sched: - dualpi2: initialize timer earlier in dualpi2_init() - sch_cbs: Call qdisc_reset for child qdisc - shaper: - fix ordering issue in net_shaper_commit() - reject handle IDs exceeding internal bit-width - ipv6: flowlabel: enforce per-netns limit for unprivileged callers - tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring - smc: avoid NULL deref of conn->lnk in smc_msg_event tracepoint - sctp: revalidate list cursor after sctp_sendmsg_to_asoc() in SCTP_SENDALL - batman-adv: - reject new tp_meter sessions during teardown - purge non-released claims - eth: - i40e: cleanup PTP registration on probe failure - idpf: fix double free and use-after-free in aux device error paths - ena: fix potential use-after-free in get_timestamp" * tag 'net-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits) net: phy: DP83TC811: add reading of abilities net: tls: prevent chain-after-chain in plain text SG net: tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring net/smc: reject CHID-0 ACCEPT that matches an empty ism_dev slot macsec: use rcu_work to defer TX SA crypto cleanup out of softirq macsec: use rcu_work to defer RX SA crypto cleanup out of softirq macsec: introduce dedicated workqueue for SA crypto cleanup net: net_failover: Fix the deadlock in slave register MAINTAINERS: update atlantic driver maintainer selftests/tc-testing: Add QFQ/CBS qlen underflow test net/sched: sch_cbs: Call qdisc_reset for child qdisc FDDI: defza: Sanitise the reset safety timer net: ethernet: ravb: Do not check URAM suspension when WoL is active ethtool: fix ethnl_bitmap32_not_zero() bit interval semantics net/smc: avoid NULL deref of conn->lnk in smc_msg_event tracepoint net/smc: fix sleep-inside-lock in __smc_setsockopt() causing local DoS net: atm: fix skb leak in sigd_send() default branch net: ethtool: phy: avoid NULL deref when PHY driver is unbound net: atlantic: preserve PCI wake-from-D3 on shutdown when WOL enabled net: shaper: reject QUEUE scope handle with missing id ...	2026-05-14 08:57:43 -07:00
Linus Torvalds	eb5441518f	Merge tag 'audit-pr-20260513' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit fixes from Paul Moore: - Correctly log the inheritable capabilities - Honor AUDIT_LOCKED in the AUDIT_TRIM and AUDIT_MAKE_EQUIV commands * tag 'audit-pr-20260513' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: enforce AUDIT_LOCKED for AUDIT_TRIM and AUDIT_MAKE_EQUIV audit: fix incorrect inheritable capability in CAPSET records	2026-05-14 08:53:24 -07:00
Linus Torvalds	31e62c2ebb	ptrace: slightly saner 'get_dumpable()' logic The 'dumpability' of a task is fundamentally about the memory image of the task - the concept comes from whether it can core dump or not - and makes no sense when you don't have an associated mm. And almost all users do in fact use it only for the case where the task has a mm pointer. But we have one odd special case: ptrace_may_access() uses 'dumpable' to check various other things entirely independently of the MM (typically explicitly using flags like PTRACE_MODE_READ_FSCREDS). Including for threads that no longer have a VM (and maybe never did, like most kernel threads). It's not what this flag was designed for, but it is what it is. The ptrace code does check that the uid/gid matches, so you do have to be uid-0 to see kernel thread details, but this means that the traditional "drop capabilities" model doesn't make any difference for this all. Make it all make a bit more sense by saying that if you don't have a MM pointer, we'll use a cached "last dumpability" flag if the thread ever had a MM (it will be zero for kernel threads since it is never set), and require a proper CAP_SYS_PTRACE capability to override. Reported-by: Qualys Security Advisory <qsa@qualys.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-05-14 08:32:11 -07:00
Sven Schuchmann	c78bdba7b9	net: phy: DP83TC811: add reading of abilities At this time the driver is not listing any speeds it supports. This should be ETHTOOL_LINK_MODE_100baseT1_Full_BIT for DP83TC811. Add the missing call for phylib to read the abilities. Fixes: `b753a9faaf` ("net: phy: DP83TC811: Introduce support for the DP83TC811 phy") Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Sven Schuchmann <schuchmann@schleissheimer.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20260512071949.6218-1-schuchmann@schleissheimer.de [pabeni@redhat.com: dropped revision history] Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-14 15:17:12 +02:00
Jonathan Corbet	f2e65e4e5b	docs: threat-model: don't limit root capabilities to CAP_SYS_ADMIN The threat-model document says that only users with CAP_SYS_ADMIN can carry out a number of admin-level tasks, but there are numerous capabilities that can confer that sort of power. Generalize the text slightly to make it clear that CAP_SYS_ADMIN is not the only all-powerful capability. Acked-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Jonathan Corbet <corbet@lwn.net>	2026-05-14 06:23:44 -06:00
Jonathan Corbet	561458db0d	docs: security-bugs: add a link to the threat-model documentation Rather than make readers search for this document, just a link to it where it is referenced. (While I was at it, I removed the unused and unneeded _threatmodel label from the top of threat-model.rst). Acked-by: Willy Tarreau <w@1wt.eu> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>	2026-05-14 06:23:06 -06:00
Jakub Kicinski	ff26a0e837	net: tls: prevent chain-after-chain in plain text SG Sashiko points out that if end = 0 (start != 0) the current code will create a chain link to content type right after the wrap link: This would create a chain where the wrap link points directly to another chain link. The scatterlist API sg_next iterator does not recursively resolve consecutive chain links. meaning this is illegal input to crypto. The wrapping link is unnecessary if end = 0. end is the entry after the last one used so end = 0 means there's nothing pushed after the wrap: end start i v v v [ ]...[ ][ d ][ d ][ d ][ d ][rsv for wrap] Skip the wrapping in this case. TLS 1.3 can use the "wrapping slot" for it's chaining if end = 0. This avoids the chain-after-chain. Move the wrap chaining before marking END and chaining off content type, that feels like more logical ordering to me, but should not matter from functional perspective. Reported-by: Sashiko <sashiko-bot@kernel.org> Fixes: `9aaaa56845` ("bpf: Sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260511174920.433155-3-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-14 13:18:40 +02:00
Jakub Kicinski	285943c6e7	net: tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring When an sk_msg scatterlist ring wraps (sg.end < sg.start), tls_push_record() chains the tail portion of the ring to the head using sg_chain(). An extra entry in the sg array is reserved for this: struct sk_msg_sg { [...] /* The extra two elements: * 1) used for chaining the front and sections when the list becomes * partitioned (e.g. end < start). The crypto APIs require the * chaining; * 2) to chain tailer SG entries after the message. */ struct scatterlist data[MAX_MSG_FRAGS + 2]; The current code uses MAX_SKB_FRAGS + 1 as the ring size: sg_chain(&msg_pl->sg.data[msg_pl->sg.start], MAX_SKB_FRAGS - msg_pl->sg.start + 1, msg_pl->sg.data); This places the chain pointer at sg_chain(data[start], (MAX_SKB_FRAGS - msg_start + 1) .. = &data[start] + (MAX_SKB_FRAGS - msg_start + 1) - 1 = data[start + (MAX_SKB_FRAGS - start + 1) - 1] = data[MAX_SKB_FRAGS] instead of the true last entry. This is likely due to a "race" of the commit under Fixes landing close to commit `031097d9e0` ("bpf: sk_msg, zap ingress queue on psock down") Convert to ARRAY_SIZE and drop the data[start] / - start (as suggested by Sabrina). Reported-by: 钱一铭 <yimingqian591@gmail.com> Fixes: `9aaaa56845` ("bpf: Sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20260511174920.433155-2-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-14 13:18:40 +02:00
Xiang Mei	277740023d	net/smc: reject CHID-0 ACCEPT that matches an empty ism_dev slot On the SMC-D client, slot 0 of ini->ism_dev[]/ini->ism_chid[] is reserved for an SMC-Dv1 device. smc_find_ism_v2_device_clnt() populates V2 entries starting at index 1, so when no V1 device is selected slot 0 is left in its kzalloc()'ed state with ism_dev[0] == NULL and ism_chid[0] == 0. smc_v2_determine_accepted_chid() then matches the peer's CHID against the array starting from index 0 using the CHID alone. A malicious peer replying to a SMC-Dv2-only proposal with d1.chid == 0 matches the empty slot, ini->ism_selected becomes 0, and the subsequent ism_dev[0]->lgr_lock dereference in smc_conn_create() faults at offsetof(struct smcd_dev, lgr_lock) == 0x68: BUG: KASAN: null-ptr-deref in _raw_spin_lock_bh+0x79/0xe0 Write of size 4 at addr 0000000000000068 by task exploit/144 Call Trace: _raw_spin_lock_bh smc_conn_create (net/smc/smc_core.c:1997) __smc_connect (net/smc/af_smc.c:1447) smc_connect (net/smc/af_smc.c:1720) __sys_connect __x64_sys_connect do_syscall_64 Require ism_dev[i] to be non-NULL before accepting a CHID match. Fixes: `a7c9c5f4af` ("net/smc: CLC accept / confirm V2") Reported-by: Weiming Shi <bestswngs@gmail.com> Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Xiang Mei <xmei5@asu.edu> Link: https://patch.msgid.link/20260511062138.2839584-1-xmei5@asu.edu Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-14 12:20:06 +02:00
Zizhi Wo	f44d38a31f	io_uring: validate user-controlled cq.head in io_cqe_cache_refill() A fuzzing run reproduced an unkillable io_uring task stuck at ~100% CPU: [root@fedora io_uring_stress]# ps -ef \| grep io_uring root 1240 1 99 13:36 ? 00:01:35 [io_uring_stress] <defunct> The task loops inside io_cqring_wait() and never returns to userspace, and SIGKILL has no effect. This is caused by the CQ ring exposing rings->cq.head to userspace as writable, while the authoritative tail lives in kernel-private ctx->cached_cq_tail. io_cqe_cache_refill() computes free space as an unsigned subtraction: free = ctx->cq_entries - min(tail - head, ctx->cq_entries); If userspace keeps head within [0, tail], the subtraction is well defined and min() just acts as a defensive clamp. But if userspace advances head past tail, (tail - head) wraps to a huge value, free becomes 0, and io_cqe_cache_refill() fails. The CQE is pushed onto the overflow list and IO_CHECK_CQ_OVERFLOW_BIT is set. The wait loop in io_cqring_wait() relies on an invariant: refill() only fails when the CQ is physically full, in which case rings->cq.tail has been advanced to iowq->cq_tail and io_should_wake() returns true. The tampered head breaks this: refill() fails while the ring is not full, no OCQE is copied in, rings->cq.tail never catches up, io_should_wake() stays false, and io_cqring_wait_schedule() keeps returning early because IO_CHECK_CQ_OVERFLOW_BIT is still set. The result is a tight retry loop that never returns to userspace. Introduce io_cqring_queued() as the single point that converts the (tail, head) pair into a trustworthy queued count. Since the real head/tail distance is bounded by cq_entries (far below 2^31), a signed comparison reliably detects userspace moving head past tail; in that case treat the queue as empty so callers see the full cache as free and forward progress is preserved. Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Link: https://patch.msgid.link/20260514021847.4062782-1-wozizhi@huaweicloud.com [axboe: fixup commit message, kill 'queued' var, and keep it all in io_uring.c] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-05-13 21:44:57 -06:00
Jakub Kicinski	cc21150cde	Merge branch 'macsec-use-rcu_work-to-fix-crypto-cleanup-in-softirq-context' Jinliang Zheng says: ==================== macsec: use rcu_work to fix crypto cleanup in softirq context From: Jinliang Zheng <alexjlzheng@tencent.com> crypto_free_aead() can internally call vunmap() (e.g. via dma_free_attrs() in hardware crypto drivers like hisi_sec2), which must not be invoked from softirq context. Both free_rxsa() and free_txsa() are RCU callbacks that run in softirq, causing a kernel crash on affected hardware. This series fixes the issue by deferring the actual cleanup to a workqueue using rcu_work, which combines the RCU grace period and workqueue dispatch into a single primitive. Two design decisions worth noting: 1. rcu_work instead of schedule_work() + synchronize_rcu() An alternative would be to call schedule_work() directly from macsec_rxsa_put()/macsec_txsa_put(), then call synchronize_rcu() at the start of the work handler to replace the grace period previously provided by call_rcu(). However, synchronize_rcu() blocks the worker thread for the duration of a full RCU grace period. Under high SA churn (e.g. tearing down an interface with many SAs), each SA would occupy a worker thread while waiting, and multiple concurrent calls cannot share the same grace period — leading to unnecessary latency and resource waste. rcu_work uses call_rcu_hurry() internally, which is fully asynchronous: the worker thread is only dispatched after the grace period has elapsed, and multiple concurrent queue_rcu_work() calls naturally batch under the same grace period via the RCU subsystem's existing coalescing mechanism. 2. Dedicated workqueue instead of system_wq Using a dedicated workqueue (macsec_wq) allows macsec_exit() to drain exactly the work items belonging to this module — by calling destroy_workqueue() after rcu_barrier(). If system_wq were used, flush_scheduled_work() would drain all pending work items across the entire system, creating unnecessary coupling with unrelated subsystems and potentially causing unexpected delays. The dedicated workqueue provides a clean, contained teardown path. ==================== Link: https://patch.msgid.link/20260511153102.2640368-1-alexjlzheng@tencent.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-13 19:03:07 -07:00
Jinliang Zheng	552cc2306c	macsec: use rcu_work to defer TX SA crypto cleanup out of softirq free_txsa() is an RCU callback running in softirq context, but calls crypto_free_aead() which can invoke vunmap() internally on hardware crypto drivers (e.g. hisi_sec2), triggering a kernel crash. Use rcu_work to defer the cleanup to a workqueue, for the same reasons as the analogous fix to free_rxsa() in the previous patch. Fixes: `c09440f7dc` ("macsec: introduce IEEE 802.1AE driver") Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20260511153102.2640368-4-alexjlzheng@tencent.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-13 19:03:05 -07:00
Jinliang Zheng	6624bba469	macsec: use rcu_work to defer RX SA crypto cleanup out of softirq crypto_free_aead() can internally invoke vunmap() (e.g. via dma_free_attrs() in hardware crypto drivers such as hisi_sec2). vunmap() must not be called from softirq context, but free_rxsa() is an RCU callback that runs in softirq, leading to a kernel crash: vunmap+0x4c/0x70 __iommu_dma_free+0xd0/0x138 dma_free_attrs+0xf4/0x100 sec_aead_exit+0x64/0xb8 [hisi_sec2] crypto_destroy_tfm+0x98/0x110 free_rxsa+0x28/0x50 [macsec] rcu_do_batch+0x184/0x460 rcu_core+0xf4/0x1f8 handle_softirqs+0x118/0x330 Use rcu_work to defer the cleanup to a workqueue. rcu_work dispatches the worker asynchronously after the RCU grace period, so no thread blocks waiting, and concurrent releases of multiple SAs naturally share the same grace period. Fixes: `c09440f7dc` ("macsec: introduce IEEE 802.1AE driver") Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20260511153102.2640368-3-alexjlzheng@tencent.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-13 19:03:05 -07:00
Jinliang Zheng	c6690a9030	macsec: introduce dedicated workqueue for SA crypto cleanup Introduce a dedicated ordered workqueue, macsec_wq, which will be used by subsequent patches to defer SA crypto cleanup (crypto_free_aead and related teardown) out of softirq context. Using a dedicated workqueue instead of system_wq allows macsec_exit() to drain exactly the work items belonging to this module via destroy_workqueue(), without interfering with unrelated work items on system_wq or causing unexpected delays elsewhere. rcu_barrier() in macsec_exit() ensures all in-flight rcu_work callbacks have enqueued their work items before destroy_workqueue() drains and destroys the queue, making the two-step teardown correct and complete. The same sequence is kept in the error path of macsec_init() as a precaution, to mirror macsec_exit() and stay safe if work ever becomes queueable before this point in the future. While at it, rename the error labels in macsec_init() from the resource-named style (rtnl:, notifier:, wq:) to the err_xxx: style (err_rtnl:, err_notifier:, err_destroy_wq:) to align with the broader kernel convention. Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20260511153102.2640368-2-alexjlzheng@tencent.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-13 19:03:05 -07:00
Faicker Mo	b84c5632c7	net: net_failover: Fix the deadlock in slave register There is netdev_lock_ops() before the NETDEV_REGISTER notifier in register_netdevice(), so use the non-locking functions in net_failover_slave_register(). failover_slave_register() in failover_existing_slave_register() adds lock and unlock ops too. Call Trace: <TASK> __schedule+0x30d/0x7a0 schedule+0x27/0x90 schedule_preempt_disabled+0x15/0x30 __mutex_lock.constprop.0+0x538/0x9e0 __mutex_lock_slowpath+0x13/0x20 mutex_lock+0x3b/0x50 dev_set_mtu+0x40/0xe0 net_failover_slave_register+0x24/0x280 failover_slave_register+0x103/0x1b0 failover_event+0x15e/0x210 ? dropmon_net_event+0xac/0xe0 notifier_call_chain+0x5e/0xe0 raw_notifier_call_chain+0x16/0x30 call_netdevice_notifiers_info+0x52/0xa0 register_netdevice+0x5f4/0x7c0 register_netdev+0x1e/0x40 _mlx5e_probe+0xe2/0x370 [mlx5_core] mlx5e_probe+0x59/0x70 [mlx5_core] ? __pfx_mlx5e_probe+0x10/0x10 [mlx5_core] Fixes: `4c975fd700` ("net: hold instance lock during NETDEV_REGISTER/UP") Signed-off-by: Faicker Mo <faicker.mo@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-13 19:01:03 -07:00
Sukhdeep Singh	9a390d34d5	MAINTAINERS: update atlantic driver maintainer Igor Russkikh and Egor Pomozov have left Marvell. Take over maintenance of the atlantic driver and its PTP subsystem. Signed-off-by: Sukhdeep Singh <sukhdeeps@marvell.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-13 18:09:52 -07:00
Victor Nogueira	59afae2008	selftests/tc-testing: Add QFQ/CBS qlen underflow test Since CBS was not calling reset for its child qdisc, there are scenarios where it could cause an underflow on its parent's qlen/backlog. When the parent is QFQ, a null-ptr deref could occur. Add a test case that reproduces the underflow followed by a null-ptr deref scenario. Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-13 17:53:39 -07:00
Jamal Hadi Salim	320fb29ea2	net/sched: sch_cbs: Call qdisc_reset for child qdisc During a reset, CBS is not calling reset on its child qdisc, which might cause qlen/backlog accounting issues. For example, if we have CBS with a QFQ parent and a netem child with delay, we can create a scenario where the parent's qlen underflows. QFQ, specifically, uses qlen to check whether it should deference a pointer, so this scenario may cause a null-ptr deref in QFQ: [ 43.875639][ T319] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000009: 0000 [#1] SMP KASAN NOPTI [ 43.876124][ T319] KASAN: null-ptr-deref in range [0x0000000000000048-0x000000000000004f] [ 43.876417][ T319] CPU: 10 UID: 0 PID: 319 Comm: ping Not tainted 7.0.0-13039-ge728258debd5 #773 PREEMPT(full) [ 43.876751][ T319] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 43.876949][ T319] RIP: 0010:qfq_dequeue+0x35c/0x1650 [ 43.877123][ T319] Code: 00 fc ff df 80 3c 02 00 0f 85 17 0e 00 00 4c 8d 73 48 48 89 9d b8 02 00 00 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 76 0c 00 00 48 b8 00 00 00 00 00 fc ff df 4c 8b [ 43.877648][ T319] RSP: 0018:ffff8881017ef4f0 EFLAGS: 00010216 [ 43.877845][ T319] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: dffffc0000000000 [ 43.878073][ T319] RDX: 0000000000000009 RSI: 0000000c40000000 RDI: ffff88810eef02b0 [ 43.878306][ T319] RBP: ffff88810eef0000 R08: ffff88810eef0280 R09: 1ffff1102120fd63 [ 43.878523][ T319] R10: 1ffff1102120fd66 R11: 1ffff1102120fd67 R12: 0000000c40000000 [ 43.878742][ T319] R13: ffff88810eef02b8 R14: 0000000000000048 R15: 0000000020000000 [ 43.878959][ T319] FS: 00007f9c51c47c40(0000) GS:ffff88817a0be000(0000) knlGS:0000000000000000 [ 43.879214][ T319] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 43.879403][ T319] CR2: 000055e69a2230a8 CR3: 000000010c07a000 CR4: 0000000000750ef0 [ 43.879621][ T319] PKRU: 55555554 [ 43.879735][ T319] Call Trace: [ 43.879844][ T319] <TASK> [ 43.879924][ T319] __qdisc_run+0x169/0x1900 [ 43.880075][ T319] ? dev_qdisc_enqueue+0x8b/0x210 [ 43.880222][ T319] __dev_queue_xmit+0x2346/0x37a0 [ 43.880376][ T319] ? register_lock_class+0x3f/0x800 [ 43.880531][ T319] ? srso_alias_return_thunk+0x5/0xfbef5 [ 43.880684][ T319] ? __pfx___dev_queue_xmit+0x10/0x10 [ 43.880834][ T319] ? srso_alias_return_thunk+0x5/0xfbef5 [ 43.880977][ T319] ? __lock_acquire+0x819/0x1df0 [ 43.881124][ T319] ? srso_alias_return_thunk+0x5/0xfbef5 [ 43.881275][ T319] ? srso_alias_return_thunk+0x5/0xfbef5 [ 43.881418][ T319] ? __asan_memcpy+0x3c/0x60 [ 43.881563][ T319] ? srso_alias_return_thunk+0x5/0xfbef5 [ 43.881708][ T319] ? eth_header+0x165/0x1a0 [ 43.881853][ T319] ? lockdep_hardirqs_on_prepare+0xdb/0x1a0 [ 43.882031][ T319] ? srso_alias_return_thunk+0x5/0xfbef5 [ 43.882174][ T319] ? neigh_resolve_output+0x3cc/0x7e0 [ 43.882325][ T319] ? srso_alias_return_thunk+0x5/0xfbef5 [ 43.882471][ T319] ip_finish_output2+0x6b6/0x1e10 Fix this by calling qdisc_reset for CBS' child qdisc. Sashiko caught an issue which could result in a null ptr deref if qdisc_create_dflt() is invoked on an unitialised cbs qdisc which is exposed by this patch. We add an early return if the qdisc is null to address this. This is a similar approach used by two other fixes[1][2]. The proper fix for this specific issue elucidated by sashiko is to remove the call to qdisc_reset when qdisc_create_dflt fails. Since the dflt qdisc isn't attached anywhere yet at that point, calling the reset callback doesn't make much sense (and as stated has been a source of two other bugs). We plan on submitting this fix in a later patch. [1] https://lore.kernel.org/netdev/20221018063201.306474-2-shaozhengchao@huawei.com/ [2] https://lore.kernel.org/netdev/20221018063201.306474-4-shaozhengchao@huawei.com/ Fixes: `585d763af0` ("net/sched: Introduce Credit Based Shaper (CBS) qdisc") Reported-by: Junyoung Jang <graypanda.inzag@gmail.com> Tested-by: Junyoung Jang <graypanda.inzag@gmail.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-13 17:53:39 -07:00
Maciej W. Rozycki	4694efc416	FDDI: defza: Sanitise the reset safety timer The reset actions of the DEFZA adapters are exceedingly slow, taking up to 30 seconds to complete by the device spec and typically in the range of 10 seconds in reality, as required for the device RTOS to boot, still quite a lot. Therefore a state machine is used that's interrupt driven, however a safety mechanism is required in case of adapter malfunction, so that if no state change interrupt has arrived in time, then the situation is taken care of. The safety mechanism depends on the origin of the reset. For regular adapter initialisation at the device probe time a sleep is requested. However a reset is also required by the device spec when the adapter has transitioned into the halted state, such as in response to a PC Trace event in the course of ring fault recovery, possibly a common network event. In that case no sleep is possible as a device halt is reported at the hardirq level. A timer is therefore set up to ensure progress in case no adapter state change interrupt has arrived in time, but as from commit `168f6b6ffb` ("timers: Use del_timer_sync() even on UP") a warning is issued as the timer is deleted in the hardirq handler upon an expected state change: defza: v.1.1.4 Oct 6 2018 Maciej W. Rozycki tc2: DEC FDDIcontroller 700 or 700-C at 0x18000000, irq 4 tc2: resetting the board... ------------[ cut here ]------------ WARNING: kernel/time/timer.c:1611 at __timer_delete_sync+0x104/0x120, CPU#0: swapper/0/0 Modules linked in: CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 7.0.0-dirty #2 VOLUNTARY Stack : 9800000002027d08 00000000140120e0 0000000000000000 ffffffff8089d468 0000000000000000 0000000000000000 ffffffff807ed6b8 ffffffff80897458 ffffffff80897400 9800000002027b88 0000000000000000 7070617773203a6d 0000000000000000 9800000002027ba4 0000000000001000 6465746e69617420 0000000000000000 ffffffff807ed6b8 00000000140120e0 0000000000000009 000000000000064b ffffffff800dd14c 0000000000000036 9800000002184000 0000000000000000 0000000000000020 0000000000000000 ffffffff80910000 ffffffff8085c000 9800000002027c70 0000000000000001 ffffffff80045fa0 0000000000000000 0000000000000000 0000000000000000 0000000000000009 000000000000064b ffffffff800502b8 ffffffff807ed6b8 ffffffff80045fa0 ... Call Trace: [<ffffffff800502b8>] show_stack+0x28/0xf0 [<ffffffff80045fa0>] dump_stack_lvl+0x48/0x7c [<ffffffff80068c98>] __warn+0xa0/0x128 [<ffffffff8004120c>] warn_slowpath_fmt+0x64/0xa4 [<ffffffff800dd14c>] __timer_delete_sync+0x104/0x120 [<ffffffff804934ac>] fza_interrupt+0xc74/0xeb8 [<ffffffff800c6390>] __handle_irq_event_percpu+0x70/0x228 [<ffffffff800c6560>] handle_irq_event_percpu+0x18/0x78 [<ffffffff800cc320>] handle_percpu_irq+0x50/0x80 [<ffffffff800c5970>] generic_handle_irq+0x90/0xd0 [<ffffffff806e956c>] do_IRQ+0x1c/0x30 [<ffffffff8004ad4c>] handle_int+0x148/0x154 [<ffffffff800ab7c0>] do_idle+0x40/0x108 [<ffffffff800abb0c>] cpu_startup_entry+0x2c/0x38 [<ffffffff806dfec8>] kernel_init+0x0/0x108 ---[ end trace 0000000000000000 ]--- tc2: OK tc2: model 700 (DEFZA-AA), MMF PMD, address 08-00-2b-xx-xx-xx tc2: ROM rev. 1.0, firmware rev. 1.2, RMC rev. A, SMT ver. 1 tc2: link unavailable ------------[ cut here ]------------ WARNING: kernel/time/timer.c:1611 at __timer_delete_sync+0x104/0x120, CPU#0: swapper/0/0 Modules linked in: CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G W 7.0.0-dirty #2 VOLUNTARY Tainted: [W]=WARN Stack : 9800000002027d08 00000000140120e0 0000000000000000 ffffffff8089d468 0000000000000000 0000000000000000 ffffffff807ed6b8 ffffffff80897458 ffffffff80897400 9800000002027b88 0000000000000000 0000000000000000 0000000000000000 9800000002027ba4 0000000000001000 0000000000000000 0000000000000000 ffffffff807ed6b8 00000000140120e0 0000000000000009 000000000000064b ffffffff800dd14c 0000000000000036 9800000002184000 0000000000000000 0000000000000020 0000000000000000 ffffffff80910000 ffffffff8085c000 9800000002027c70 0000000000000001 ffffffff80045fa0 0000000000000000 0000000000000000 0000000000000000 0000000000000009 000000000000064b ffffffff800502b8 ffffffff807ed6b8 ffffffff80045fa0 ... Call Trace: [<ffffffff800502b8>] show_stack+0x28/0xf0 [<ffffffff80045fa0>] dump_stack_lvl+0x48/0x7c [<ffffffff80068c98>] __warn+0xa0/0x128 [<ffffffff8004120c>] warn_slowpath_fmt+0x64/0xa4 [<ffffffff800dd14c>] __timer_delete_sync+0x104/0x120 [<ffffffff804934ac>] fza_interrupt+0xc74/0xeb8 [<ffffffff800c6390>] __handle_irq_event_percpu+0x70/0x228 [<ffffffff800c6560>] handle_irq_event_percpu+0x18/0x78 [<ffffffff800cc320>] handle_percpu_irq+0x50/0x80 [<ffffffff800c5970>] generic_handle_irq+0x90/0xd0 [<ffffffff806e956c>] do_IRQ+0x1c/0x30 [<ffffffff8004ad4c>] handle_int+0x148/0x154 [<ffffffff806de8a4>] arch_local_irq_disable+0x4/0x28 [<ffffffff800ab7d0>] do_idle+0x50/0x108 [<ffffffff800abb0c>] cpu_startup_entry+0x2c/0x38 [<ffffffff806dfec8>] kernel_init+0x0/0x108 ---[ end trace 0000000000000000 ]--- tc2: registered as fddi0 The immediate origin of the new warning is the switch away from aliasing del_timer_sync() to del_timer() (timer_delete_sync() to timer_delete() in terms of current function names) for UP configurations, which however is the only choice for this driver anyway as no SMP hardware supports the TURBOchannel bus this device interfaces to. Therefore there is a very remote issue only this is a sign of. Specifically if an adapter reset issued upon a transition to the halted state times out and first triggers fza_reset_timer() for another reset assertion, which then schedules fza_reset_timer() for reset deassertion and then that second call is pre-empted after poking at the hardware, but before the timer has been rearmed and owing to high system load causing exceedingly high scheduling latency control is not handed back before a transition to the uninitialised state has caused the timer to be deleted even before it has been started, then fza_reset_timer() will be called yet again and issue another reset even though by then the adapter has already recovered. Prevent this situation from happening by switching to timer_delete() for the transition to the halted state and protect the code region affected with a spinlock, also to make sure add_timer() has not been called twice in a row due to an execution race between the interrupt handler and the timer handler (though it could only happen on SMP, but let's keep the driver clean). It's a very unlikely sequence of events to happen and therefore there's no point in trying to be overly clever about it, such as by placing printk() calls outside the protection. For the transition to the uninitialised state switch to timer_delete_sync_try() instead, so that a timer isn't deleted that's just been rearmed by the timer handler and needs to watch for the device to come out of reset again (again, an SMP scenario only). Retain timer_delete_sync() invocations outside the hardirq context for a stray timer not to fire once device structures have been released. Fixes: `61414f5ec9` ("FDDI: defza: Add support for DEC FDDIcontroller 700 TURBOchannel adapter") Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-13 17:31:01 -07:00
Linus Torvalds	59a62ea458	Merge tag 'sched_ext-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: "The bulk of this is hardening of the new sub-scheduler infrastructure. - UAFs and lifecycle bugs on the sub-sched attach/detach paths: parent sub_kset freed under a racing child, list_del_rcu on an uninitialized list head, ops->priv stomped by concurrent attach/detach, and a UAF in the init-failure error path - Task state-machine reorg closing concurrent enable-vs-dead races: a task exiting during the unlocked init window could trip NULL ops derefs or skip exit_task() cleanup - A scx_link_sched() self-deadlock on scx_sched_lock - isolcpus: stop dereferencing the now-RCU-protected HK_TYPE_DOMAIN cpumask without RCU, and stop rejecting BPF schedulers when only cpuset isolated partitions are active - PREEMPT_RT: disable irq_work runs in hardirq context so dumps show the failing task rather than the irq_work kthread - Assorted !CONFIG_EXT_SUB_SCHED, randconfig, and selftest build fixes" * tag 'sched_ext-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation sched_ext: Defer sub_kset base put to scx_sched_free_rcu_work sched_ext: INIT_LIST_HEAD() &sch->all in scx_alloc_and_add_sched() sched_ext: Drop NONE early return in scx_disable_and_exit_task() sched_ext: Avoid UAF in scx_root_enable_workfn() init failure path sched_ext: Clear ops->priv on scx_alloc_and_add_sched() error paths sched_ext: Fix ops->priv clobber on concurrent attach/detach selftests/sched_ext: Fix build error in dequeue selftest sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths sched_ext: Close sub-sched init race with post-init DEAD recheck sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD state sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state() sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD work sched_ext: Use IRQ_WORK_INIT_HARD() to initialize sch->disable_irq_work sched_ext: Fix !CONFIG_EXT_SUB_SCHED build warnings sched_ext: Drop unused scx_find_sub_sched() stub sched_ext: Move scx_error() out of scx_link_sched()'s lock region	2026-05-13 15:00:40 -07:00
Linus Torvalds	0913b580f8	Merge tag 'cgroup-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - cpuset fixes: - Partition invalidation could return CPUs still in use by sibling partitions, producing overlapping effective_cpus - cpuset_can_attach() over-reserved DL bandwidth on moves that stayed within the same root domain - Pending DL migration state leaked into later attaches when a later can_attach() check failed - Reorder PF_EXITING and __GFP_HARDWALL checks so dying tasks can allocate from any node and exit quickly - dmem: propagate -ENOMEM instead of spinning forever when the fallback pool allocation also fails - selftests/cgroup: percpu test error-path leak, bogus numeric comparison of cpuset strings, and a zero-length read() that silently passed OOM-kill tests * tag 'cgroup-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: Return only actually allocated CPUs during partition invalidation selftests/cgroup: Fix error path leaks in test_percpu_basic cgroup/cpuset: Reserve DL bandwidth only for root-domain moves cgroup/cpuset: Reset DL migration state on can_attach() failure selftests/cgroup: Fix string comparison in write_test selftests/cgroup: Fix cg_read_strcmp() empty string comparison cgroup/dmem: Return -ENOMEM on failed pool preallocation cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node_allowed()	2026-05-13 14:56:31 -07:00
Linus Torvalds	50599e4c68	Merge tag 'wq-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: - Plug a wq->cpu_pwq leak on the WQ_UNBOUND allocation failure path - Fix a cancel_delayed_work_sync() livelock against drain_workqueue() caused by the drain/destroy reject path leaving WORK_STRUCT_PENDING set with no owner * tag 'wq-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Fix wq->cpu_pwq leak in alloc_and_link_pwqs() WQ_UNBOUND path workqueue: Release PENDING in __queue_work() drain/destroy reject path	2026-05-13 14:49:13 -07:00
Andrea Righi	6ae315d379	sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation scx_enable() refuses to attach a BPF scheduler when isolcpus=domain is in effect by comparing housekeeping_cpumask(HK_TYPE_DOMAIN) against cpu_possible_mask. Since commit `27c3a5967f` ("sched/isolation: Convert housekeeping cpumasks to rcu pointers"), HK_TYPE_DOMAIN's cpumask is RCU protected and dereferencing it requires either RCU read lock, the cpu_hotplug write lock, or the cpuset lock; scx_enable() holds none of these, so booting with isolcpus=domain and attaching any BPF scheduler triggers the following lockdep splat: ============================= WARNING: suspicious RCU usage ----------------------------- kernel/sched/isolation.c:60 suspicious rcu_dereference_check() usage! 1 lock held by scx_flash/281: #0: ffffffff8379fce0 (update_mutex){+.+.}-{4:4}, at: bpf_struct_ops_link_create+0x134/0x1c0 Call Trace: dump_stack_lvl+0x6f/0xb0 lockdep_rcu_suspicious.cold+0x37/0x70 housekeeping_cpumask+0xcd/0xe0 scx_enable.isra.0+0x17/0x120 bpf_scx_reg+0x5e/0x80 bpf_struct_ops_link_create+0x151/0x1c0 __sys_bpf+0x1e4b/0x33c0 __x64_sys_bpf+0x21/0x30 do_syscall_64+0x117/0xf80 entry_SYSCALL_64_after_hwframe+0x77/0x7f In addition, commit `03ff735101` ("cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset") made HK_TYPE_DOMAIN include cpuset isolated partitions as well, which means the current check also rejects BPF schedulers when a cpuset partition is active. That contradicts the original intent of commit `9f391f94a1` ("sched_ext: Disallow loading BPF scheduler if isolcpus= domain isolation is in effect"), which explicitly noted that cpuset partitions are honored through per-task cpumasks and should not be rejected. Switch to housekeeping_enabled(HK_TYPE_DOMAIN_BOOT), which reads only the housekeeping flag bit (no RCU dereference) and reflects exactly the boot-time isolcpus= configuration that the error message refers to. Fixes: `27c3a5967f` ("sched/isolation: Convert housekeeping cpumasks to rcu pointers") Cc: stable@vger.kernel.org # v7.0+ Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Frederic Weisbecker <frederic@kernel.org>	2026-05-13 10:02:57 -10:00
Nicholas Carlini	d6a2d7b04b	io-wq: check that the predecessor is hashed in io_wq_remove_pending() io_wq_remove_pending() needs to fix up wq->hash_tail[] if the cancelled work was the tail of its hash bucket. When doing this, it checks whether the preceding entry in acct->work_list has the same hash value, but never checks that the predecessor is hashed at all. io_get_work_hash() is simply atomic_read(&work->flags) >> IO_WQ_HASH_SHIFT, and the hash bits are never set for non-hashed work, so it returns 0. Thus, when a hashed bucket-0 work is cancelled while a non-hashed work is its list predecessor, the check spuriously passes and a pointer to the non-hashed io_kiocb is stored in wq->hash_tail[0]. Because non-hashed work is dequeued via the fast path in io_get_next_work(), which never touches hash_tail[], the stale pointer is never cleared. Therefore, after the non-hashed io_kiocb completes and is freed back to req_cachep, wq->hash_tail[0] is a dangling pointer. The io_wq is per-task (tctx->io_wq) and survives ring open/close, so the dangling pointer persists for the lifetime of the task; the next hashed bucket-0 enqueue dereferences it in io_wq_insert_work() and wq_list_add_after() writes through freed memory. Add the missing io_wq_is_hashed() check so a non-hashed predecessor never inherits a hash_tail[] slot. Cc: stable@vger.kernel.org Fixes: `204361a77f` ("io-wq: fix hang after cancelling pending hashed work") Signed-off-by: Nicholas Carlini <nicholas@carlini.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-05-13 13:06:37 -06:00
sunshaojie	345f401666	cgroup/cpuset: Return only actually allocated CPUs during partition invalidation In update_parent_effective_cpumask() with partcmd_invalidate, the CPUs to return to the parent are computed as: adding = cpumask_and(tmp->addmask, xcpus, parent->effective_xcpus); where xcpus = user_xcpus(cs) which returns cs->exclusive_cpus (if set) or cs->cpus_allowed. When exclusive_cpus is not set, user_xcpus(cs) can contain CPUs that were never actually granted to the partition due to sibling exclusion in compute_excpus(). Consequently, the invalidation may return CPUs to the parent that remain in use by sibling partitions, causing overlapping effective_cpus and triggering the WARN_ON_ONCE(1) in generate_sched_domains(). Use cs->effective_xcpus instead, which reflects the CPUs actually granted to this partition. Reproducer (on a 4-CPU machine): cd /sys/fs/cgroup mkdir a1 b1 # a1 becomes partition root with CPUs 0-1 echo "0-1" > a1/cpuset.cpus echo "root" > a1/cpuset.cpus.partition # b1 becomes partition root with CPUs 1-2, but sibling exclusion # reduces its effective_xcpus to CPU 2 only echo "1-2" > b1/cpuset.cpus echo "root" > b1/cpuset.cpus.partition # b1 changes cpus_allowed to 0-1 -> partition invalidation echo "0-1" > b1/cpuset.cpus # Expected: CPUs 2-3 (only CPU 2 returned from b1) # Actual: CPUs 1-3 (CPU 0-1 returned, overlapping with a1) cat cpuset.cpus.effective dmesg will also show a WARNING from generate_sched_domains() reporting overlapping partition root effective_cpus. Fixes: `2a3602030d` ("cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus conflict") Cc: stable@vger.kernel.org # v7.0+ Signed-off-by: sunshaojie <sunshaojie@kylinos.cn> Tested-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-13 08:54:53 -10:00
Linus Torvalds	e1914add27	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "arm64: - Add the pKVM side of the workaround for ARM's erratum 4193714, provided that the EL3 firmware does its part of the job. KVM will refuse to initialise otherwise - Correctly handle 52bit VAs for guest EL2 stage-1 translations when running under NV with E2H==0 - Correctly deal with permission faults in guest_memfd memslots - Fix the steal-time selftest after the infrastructure was reworked - Make sure the host cannot pass a non-sensical clock update to the EL2 tracing infrastructure - Appoint Steffen Eiden as a reviewer in anticipation of the KVM/s390 ability to run arm64 guests, which will inevitably lead to arm64 code being directly used on s390 - Make sure that EL2 is configured with both exception entry and exit being Context Synchronization Events - Handle the current vcpu being NULL on EL2 panic - Fix the selftest_vcpu memcache being empty at the point of donation or sharing - Check that the memcache has enough capacity before engaging on the share/donate path - Fix __deactivate_fgt() to use its parameter rather than a variable in the macro context s390: - Fix array overrun with large amounts of PCI devices x86: - Never use L0's PAUSE loop exiting while L2 is running, since it's unlikely that a nested guest will help solving the hypervisor's spinlock contention - Fix emulation of MOVNTDQA - Fix typo in Xen hypercall tracepoint - Add back an optimization that was left behind when recently fixing a bug - Add module parameter to disable CET, whose implementation seems to have issues. For now it remains enabled by default Generic: - Reject offset causing an unsigned overflow in kvm_reset_dirty_gfn() Documentation: - Update stale links Selftests: - Fix guest_memfd_test with host page size > guest page size" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (22 commits) KVM: VMX: introduce module parameter to disable CET KVM: x86: Swap the dst and src operand for MOVNTDQA KVM: x86: use again the flush argument of __link_shadow_page() KVM: selftests: Ensure gmem file sizes are multiple of host page size Documentation: kvm: update links in the references section of AMD Memory Encryption KVM: nSVM: Never use L0's PAUSE loop exiting while L2 is running KVM: x86: Fix Xen hypercall tracepoint argument assignment KVM: Reject wrapped offset in kvm_reset_dirty_gfn() KVM: arm64: Pre-check vcpu memcache for host->guest donate KVM: arm64: Pre-check vcpu memcache for host->guest share KVM: arm64: Seed pkvm_ownership_selftest vcpu memcache KVM: arm64: Fix __deactivate_fgt macro parameter typo KVM: arm64: Guard against NULL vcpu on VHE hyp panic path KVM: arm64: Make EL2 exception entry and exit context-synchronization events MAINTAINERS: Add Steffen as reviewer for KVM/arm64 KVM: arm64: Remove potential UB on nvhe tracing clock update KVM: selftests: arm64: Fix steal_time test after UAPI refactoring KVM: arm64: Handle permission faults with guest_memfd KVM: arm64: nv: Consider the DS bit when translating TCR_EL2 KVM: arm64: Work around C1-Pro erratum 4193714 for protected guests ...	2026-05-13 11:53:51 -07:00
Yu Miao	7d8f3158a5	selftests/cgroup: Fix error path leaks in test_percpu_basic When cg_name_indexed() returns NULL partway through the child creation loop, the code returned -1 without running cleanup_children and cleanup. That left the `parent` pathname allocation unreleased and did not remove child cgroup directories already created under the parent. Fix by jumping to cleanup_children instead of returning. When cg_create() fails, `child` (the pathname from cg_name_indexed()) was not freed before cleanup_children. Fix by freeing `child` before branching to cleanup_children. Fixes: `90631e1dea` ("kselftests: cgroup: add perpcu memory accounting test") Signed-off-by: Yu Miao <yumiao@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-13 08:40:52 -10:00
Linus Torvalds	1f63dd8ca0	Merge tag 'fixes-2026-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux Pull liveupdate fixes from Mike Rapoport: "A few fixes for kexec handover and liveupdate: - make sure KHO is skipped for crash kernel - fix error reporting in memfd preservation if it fails mid-loop - don't allow preserving memfds whose page count exceeds UINT_MAX - fix documentation of memfd seals preservation to match the code" * tag 'fixes-2026-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux: mm/memfd_luo: document preservation of file seals mm/memfd_luo: reject memfds whose page count exceeds UINT_MAX mm/memfd_luo: report error when restoring a folio fails mid-loop kho: skip KHO for crash kernel	2026-05-13 08:24:50 -07:00
Paolo Bonzini	2d5d3fc593	KVM: VMX: introduce module parameter to disable CET There have been reports of host hangs caused by CET virtualization. Until these are analyzed further, introduce a module parameter that makes it possible to easily disable it. Link: https://lore.kernel.org/all/85548beb-1486-40f9-beb4-632c78e3360b@proxmox.com/ Cc: David Riley <d.riley@proxmox.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-05-13 15:38:22 +02:00
Yang Xiuwei	5f7c7c63ff	io_uring/rw: drop unused attr_type_mask from io_prep_rw_pi() io_prep_rw_pi() never used the attr_type_mask argument. Callers already validate sqe->attr_type_mask before invoking the helper (only IORING_RW_ATTR_FLAG_PI is supported today). Remove the dead parameter to avoid implying further interpretation happens here. Signed-off-by: Yang Xiuwei <yangxiuwei@kylinos.cn> Link: https://patch.msgid.link/20260513094303.866533-1-yangxiuwei@kylinos.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-05-13 06:35:57 -06:00
Niklas Söderlund	f5b2772d14	net: ethernet: ravb: Do not check URAM suspension when WoL is active When updating the driver to match latest datasheet to suspend access to URAM when suspending DMA transfers a corner-case was missed, URAM access will not be suspended if WoL is enabled. This lead to the error message (correctly) being triggered as URAM access is not suspended even tho it's requested as part of stopping DMA. Avoid checking if URAM access is suspended and printing the error message if WoL is enabled when we suspend the system, as we know it will not be. Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Closes: https://lore.kernel.org/all/CAMuHMdWnjV%3DHGE1o08zLhUfTgOSene5fYx1J5GG10mB%2BToq8qg@mail.gmail.com/ Fixes: `353d8e7989` ("net: ethernet: ravb: Suspend and resume the transmission flow") Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Reviewed-by: Sai Krishna <saikrishnag@marvell.com> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-12 18:46:19 -07:00
Chenguang Zhao	3d042592eb	ethtool: fix ethnl_bitmap32_not_zero() bit interval semantics ethnl_bitmap32_not_zero() should return true if some bit in [start, end) is set: - Fix inverted memchr_inv() sense: return true when the scan finds a non-zero byte, not when the middle words are all zero. - Return false for an empty interval (end <= start). - When end is 32-bit aligned, indices in [start, end) do not include any bits from map[end_word]; return false after earlier checks found no non-zero data. Fixes: `10b518d4e6` ("ethtool: netlink bitset handling") Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-12 18:45:13 -07:00
Xiang Mei	7bf563badd	net/smc: avoid NULL deref of conn->lnk in smc_msg_event tracepoint The smc_msg_event tracepoint class, shared by smc_tx_sendmsg and smc_rx_recvmsg, unconditionally dereferences smc->conn.lnk: __string(name, smc->conn.lnk->ibname) conn->lnk is only set for SMC-R; for SMC-D it is NULL. Other code on these paths already handles this (e.g. !conn->lnk in SMC_STAT_RMB_TX_SIZE_SMALL()). With the tracepoint enabled, the first sendmsg()/recvmsg() on an SMC-D socket crashes: Oops: general protection fault, probably for non-canonical address KASAN: null-ptr-deref in range [...] RIP: 0010:strlen+0x1e/0xa0 Call Trace: trace_event_raw_event_smc_msg_event (net/smc/smc_tracepoint.h:44) smc_rx_recvmsg (net/smc/smc_rx.c:515) smc_recvmsg (net/smc/af_smc.c:2859) __sys_recvfrom (net/socket.c:2315) __x64_sys_recvfrom (net/socket.c:2326) do_syscall_64 The faulting address 0x3e0 is offsetof(struct smc_link, ibname), confirming the NULL ->lnk deref. Enabling the tracepoint requires root, but the trigger itself is unprivileged: socket(AF_SMC, ...) has no capability check, and SMC-D negotiation needs no admin step on s390 or on x86 with the loopback ISM device loaded. Log an empty device name for SMC-D instead of dereferencing NULL. Fixes: `aff3083f10` ("net/smc: Introduce tracepoints for tx and rx msg") Reported-by: Weiming Shi <bestswngs@gmail.com> Signed-off-by: Xiang Mei <xmei5@asu.edu> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-12 18:44:36 -07:00
Nicolò Coccia	a3fdd924d8	net/smc: fix sleep-inside-lock in __smc_setsockopt() causing local DoS A logic flaw in __smc_setsockopt() allows a local unprivileged user to cause a Denial of Service (DoS) by holding the socket lock indefinitely. The function __smc_setsockopt() calls copy_from_sockptr() while holding lock_sock(sk). By passing a userfaultfd-monitored memory page (or FUSE-backed memory on systems where unprivileged userfaultfd is disabled) as the optval, an attacker can halt execution during the copy operation, keeping the lock held. Combined with asynchronous tear-down operations like shutdown(), this exhausts the kernel wq (kworkers) and triggers the hung task watchdog. [ 240.123456] INFO: task kworker/u8:2 blocked for more than 120 seconds. [ 240.123489] Call Trace: [ 240.123501] smc_shutdown+... [ 240.123512] lock_sock_nested+... This patch moves the user-space copy outside the lock_sock() critical section to prevent the issue. Fixes: `a6a6fe27ba` ("net/smc: Dynamic control handshake limitation by socket options") Signed-off-by: Nicolò Coccia <n.coccia96@gmail.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Tested-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-12 18:43:40 -07:00
Wei Yang	f9e2342046	net: atm: fix skb leak in sigd_send() default branch The default branch in sigd_send() calls sock_put() and returns -EINVAL without freeing the skb, while all other exit paths do so. Add the missing dev_kfree_skb() before sock_put() to fix the leak. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Wei Yang <albinwyang@tencent.com> Link: https://patch.msgid.link/20260509122358.1102997-1-albin_yang@163.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-12 18:07:02 -07:00
David Carlier	e3adf69f8e	net: ethtool: phy: avoid NULL deref when PHY driver is unbound phydev->drv can become NULL while the phy_device is still attached to its net_device, namely after the PHY driver is unbound via sysfs: echo <mdio_id> > /sys/bus/mdio_bus/drivers/<phy_drv>/unbind phy_remove() clears phydev->drv but doesn't call phy_detach(), so the phy_device stays in the link topology xarray and ethnl_req_get_phydev() still hands it back. ETHTOOL_MSG_PHY_GET then oopses on: rep_data->drvname = kstrdup(phydev->drv->name, GFP_KERNEL); drvname is already treated as optional by phy_reply_size(), phy_fill_reply() and phy_cleanup_data(), so just skip the allocation when there is no driver bound. Fixes: `9dd2ad5e92` ("net: ethtool: phy: Convert the PHY_GET command to generic phy dump") Cc: stable@vger.kernel.org # 6.13.x Signed-off-by: David Carlier <devnexen@gmail.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260509215046.107157-1-devnexen@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-12 17:45:12 -07:00
Zoran Ilievski	2c308cf342	net: atlantic: preserve PCI wake-from-D3 on shutdown when WOL enabled The shutdown handler aq_pci_shutdown() unconditionally calls pci_wake_from_d3(pdev, false), clearing the PCI PME_En bit even when wake-on-LAN has been configured. While aq_nic_shutdown() correctly programs the NIC firmware via aq_nic_set_power() to listen for magic packets, the PCI subsystem will not propagate the resulting PME wake event from D3, so the system never wakes after poweroff. WOL from suspend (S3) is unaffected because aq_suspend_common() does not touch pci_wake_from_d3() and relies on the PM core's wake configuration via device_may_wakeup(). This affects all atlantic-supported NICs (AQC107/108/111/112/113); users have reported that WOL works if the atlantic driver is never loaded, but breaks once it has run its shutdown path. Pass the configured WOL state to pci_wake_from_d3() instead of a literal false, so the PCI PME_En bit is preserved when the user has armed WOL via ethtool. Fixes: `90869ddfef` ("net: aquantia: Implement pci shutdown callback") Cc: stable@vger.kernel.org Signed-off-by: Zoran Ilievski <goodboy@rexbytes.com> Reviewed-by: Sukhdeep Singh <sukhdeeps@marvell.com> Link: https://patch.msgid.link/20260511064002.1857-1-goodboy@rexbytes.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-12 16:41:41 -07:00
Tejun Heo	cceb874eee	sched_ext: Defer sub_kset base put to scx_sched_free_rcu_work scx_sub_enable_workfn() pins parent->kobj before dropping scx_sched_lock, but that does not pin parent->sub_kset. Concurrent disable can kset_unregister and free sub_kset before scx_alloc_and_add_sched() dereferences it. Split sub_kset teardown: kobject_del() at disable keeps sysfs removal; defer kobject_put() to scx_sched_free_rcu_work so the memory survives. A racing child sees state_in_sysfs=0 with valid memory, sysfs_create_dir() fails, and the existing exit_kind gate in scx_link_sched() turns it away with -ENOENT. Fixes: `411d3ef1a7` ("sched_ext: Unregister sub_kset on scheduler disable") Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-12 11:28:56 -10:00

1 2 3 4 5 ...

1445315 Commits