linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-02-26 06:23:09 -05:00

Author	SHA1	Message	Date
John Ogness	bebd87ae27	printk: Track nbcon consoles Add a global flag @have_nbcon_console to identify if any nbcon consoles are registered. This will be used in follow-up commits to preserve legacy behavior when no nbcon consoles are registered. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-29-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-08-21 14:56:25 +02:00
John Ogness	60013065fd	printk: Avoid console_lock dance if no legacy or boot consoles Currently the console lock is used to attempt legacy-type printing even if there are no legacy or boot consoles registered. If no such consoles are registered, the console lock does not need to be taken. Add tracking of legacy console registration and use it with boot console tracking to avoid unnecessary code paths, i.e. do not use the console lock if there are no boot consoles and no legacy consoles. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-28-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-08-21 14:56:24 +02:00
John Ogness	5dde3b7354	printk: nbcon: Add unsafe flushing on panic Add nbcon_atomic_flush_unsafe() to flush all nbcon consoles using the write_atomic() callback and allowing unsafe hostile takeovers. Call this at the end of panic() as a final attempt to flush any pending messages. Note that legacy consoles use unsafe methods for flushing from the beginning of panic (see bust_spinlocks()). Therefore, systems using both legacy and nbcon consoles may still fail to see panic messages due to unsafe legacy console usage. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-27-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-08-21 14:56:24 +02:00
John Ogness	d2e85ca7a7	printk: Flush nbcon consoles first on panic In console_flush_on_panic(), flush the nbcon consoles before flushing legacy consoles. The legacy write() callbacks are not fully safe when oops_in_progress is set. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-26-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-08-21 14:56:24 +02:00
John Ogness	8ba77712a7	printk: nbcon: Flush new records on device_release() There may be new records that were added while a driver was holding the nbcon context for non-printing purposes. These new records must be flushed by the nbcon_device_release() context because no other context will do it. If boot consoles are registered, the legacy loop is used (either direct or per irq_work) to handle the flushing. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-25-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-08-21 14:56:24 +02:00
John Ogness	70411bf8d2	printk: Add is_printk_legacy_deferred() If printk has been explicitly deferred or is called from NMI context, legacy console printing must be deferred to an irq_work context. Introduce a helper function is_printk_legacy_deferred() for a CPU to query if it must defer legacy console printing. In follow-up commits this helper will be needed at other call sites as well. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-24-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-08-21 14:56:24 +02:00
John Ogness	c158834b22	printk: nbcon: Use nbcon consoles in console_flush_all() Allow nbcon consoles to print messages in the legacy printk() caller context (printing via unlock) by integrating them into console_flush_all(). The write_atomic() callback is used for printing. Provide nbcon_legacy_emit_next_record(), which acts as the nbcon variant of console_emit_next_record(). Call this variant within console_flush_all() for nbcon consoles. Since nbcon consoles use their own @nbcon_seq variable to track the next record to print, this also must be appropriately handled in console_flush_all(). Note that the legacy printing logic uses @handover to detect handovers for printing all consoles. For nbcon consoles, handovers/takeovers occur on a per-console basis and thus do not cause the console_flush_all() loop to abort. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-23-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-08-21 14:56:24 +02:00
John Ogness	97ea9bccfc	printk: Track registered boot consoles Unfortunately it is not known if a boot console and a regular (legacy or nbcon) console use the same hardware. For this reason they must not be allowed to print simultaneously. For legacy consoles this is not an issue because they are already synchronized with the boot consoles using the console lock. However nbcon consoles can be triggered separately. Add a global flag @have_boot_console to identify if any boot consoles are registered. This will be used in follow-up commits to ensure that boot consoles and nbcon consoles cannot print simultaneously. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-22-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-08-21 14:56:24 +02:00
Thomas Gleixner	d3a9f82ec5	printk: nbcon: Provide function to flush using write_atomic() Provide nbcon_atomic_flush_pending() to perform flushing of all registered nbcon consoles using their write_atomic() callback. Unlike console_flush_all(), nbcon_atomic_flush_pending() will only flush up through the newest record at the time of the call. This prevents a CPU from printing unbounded when other CPUs are adding records. If new records are added while flushing, it is expected that the dedicated printer threads will print those records. If the printer thread is not available (which is always the case at this point in the rework), nbcon_atomic_flush_pending() _will_ flush all records in the ringbuffer. Unlike console_flush_all(), nbcon_atomic_flush_pending() will fully flush one console before flushing the next. This helps to guarantee that a block of pending records (such as a stack trace in an emergency situation) can be printed atomically at once before releasing console ownership. nbcon_atomic_flush_pending() is safe in any context because it uses write_atomic() and acquires with unsafe_takeover disabled. Co-developed-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-21-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-08-21 14:56:24 +02:00
John Ogness	06683a6649	printk: nbcon: Add helper to assign priority based on CPU state Add a helper function to use the current state of the CPU to determine which priority to assign to the printing context. The EMERGENCY priority handling is added in a follow-up commit. It will use a per-CPU variable. Note: nbcon_device_try_acquire(), which is used by console drivers to acquire the nbcon console for non-printing activities, is hard-coded to always use NORMAL priority. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-20-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-08-21 14:56:24 +02:00
John Ogness	fc400d5f63	printk: Add @flags argument for console_is_usable() The caller of console_is_usable() usually needs @console->flags for its own checks. Rather than having console_is_usable() read its own copy, make the caller pass in the @flags. This also ensures that the caller saw the same @flags value. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-19-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-08-21 14:56:24 +02:00
John Ogness	20846d1ce2	printk: Let console_is_usable() handle nbcon The nbcon consoles use a different printing callback. For nbcon consoles, check for the write_atomic() callback instead of write(). Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-18-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-08-21 14:56:24 +02:00
John Ogness	864c25c83d	printk: Make console_is_usable() available to nbcon.c Move console_is_usable() as-is into internal.h so that it can be used by nbcon printing functions as well. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-17-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-08-21 14:56:23 +02:00
John Ogness	1c17ebb790	printk: nbcon: Do not rely on proxy headers The headers kernel.h, serial_core.h, and console.h allow for the definitions of many types and functions from other headers. Rather than relying on these as proxy headers, explicitly include all headers providing needed definitions. Also sort the list alphabetically to be able to easily detect duplicates. Suggested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-16-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-08-21 14:56:23 +02:00
John Ogness	adf6f37d14	nbcon: Add API to acquire context for non-printing operations Provide functions nbcon_device_try_acquire() and nbcon_device_release() which will try to acquire the nbcon console ownership with NBCON_PRIO_NORMAL and mark it unsafe for handover/takeover. These functions are to be used together with the device-specific locking when performing non-printing activities on the console device. They will allow synchronization against the atomic_write() callback which will be serialized, for higher priority contexts, only by acquiring the console context ownership. Pitfalls: The API requires to be called in a context with migration disabled because it uses per-CPU variables internally. The context is set unsafe for a takeover all the time. It guarantees full serialization against any atomic_write() caller except for the final flush in panic() which might try an unsafe takeover. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-14-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-08-21 14:56:23 +02:00
John Ogness	e55c3bcf38	printk: nbcon: Use driver synchronization while (un)registering Console drivers typically have to deal with access to the hardware via user input/output (such as an interactive login shell) and output of kernel messages via printk() calls. They use some classic driver-specific locking mechanism in most situations. But console->write_atomic() callbacks, used by nbcon consoles, are synchronized only by acquiring the console context. The synchronization via the console context ownership is possible only when the console driver is registered. It is when a particular device driver is connected with a particular console driver. The two synchronization mechanisms must be synchronized between each other. It is tricky because the console context ownership is quite special. It might be taken over by a higher priority context. Also CPU migration must be disabled. The most tricky part is to (dis)connect these two mechanisms during the console (un)registration. Use the driver-specific locking callbacks: device_lock(), device_unlock(). They allow taking the device-specific lock while the device is being (un)registered by the related console driver. For example, these callbacks lock/unlock the port lock for serial port drivers. Note that the driver-specific locking is only needed during (un)register if it is an nbcon console with the write_atomic() callback implemented. If write_atomic() is not implemented, the driver should never attempt to access the hardware without first acquiring its driver-specific lock. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-10-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-08-21 14:56:23 +02:00
John Ogness	b7049d88c1	printk: nbcon: Remove return value for write_atomic() The return value of write_atomic() does not provide any useful information. On the contrary, it makes things more complicated for the caller to appropriately deal with the information. Change write_atomic() to not have a return value. If the message did not get printed due to loss of ownership, the caller will notice this on its own. If ownership was not lost, it will be assumed that the driver successfully printed the message and the sequence number for that console will be incremented. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-7-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-08-21 14:56:23 +02:00
John Ogness	8c9dab2c55	printk: nbcon: Clarify rules of the owner/waiter matching The functions nbcon_owner_matches() and nbcon_waiter_matches() use a minimal set of data to determine if a context matches. The existing kerneldoc and comments were not clear enough and caused the printk folks to re-prove that the functions are indeed reliable in all cases. Update and expand the explanations so that it is clear that the implementations are sufficient for all cases. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-6-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-08-21 14:56:22 +02:00
Sebastian Andrzej Siewior	0e1d5731d3	printk: Check printk_deferred_enter()/_exit() usage Add validation that printk_deferred_enter()/_exit() are called in non-migration contexts. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-5-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-08-21 14:56:22 +02:00
Petr Mladek	d3ff380d47	printk: Properly deal with nbcon consoles on seq init If a non-boot console is registering and boot consoles exist, the consoles are flushed before being unregistered. This allows the non-boot console to continue where the boot console left off. If for whatever reason flushing fails, the lowest seq found from any of the enabled boot consoles is used. Until now con->seq was checked. However, if it is an nbcon boot console, the function nbcon_seq_read() must be used to read seq because con->seq is not updated for nbcon consoles. Check if it is an nbcon boot console and if so call nbcon_seq_read() to read seq. Also, avoid usage of con->seq as temporary storage of the starting record. Instead, rename console_init_seq() to get_init_console_seq() and just return the value. For nbcon consoles set the sequence via nbcon_seq_force(), for legacy consoles set con->seq. The cleaned design should make sure that the value stays and is set before the console is added to the console list. It also unifies the sequence number initialization for legacy and nbcon consoles. Reviewed-by: John Ogness <john.ogness@linutronix.de> Link: https://lore.kernel.org/r/20240820063001.36405-4-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-08-21 14:56:22 +02:00
John Ogness	f37b105fae	printk: nbcon: Consolidate alloc() and init() Rather than splitting the nbcon allocation and initialization into two pieces, perform all initialization in nbcon_alloc(). Later, the initial sequence is calculated and can be explicitly set using nbcon_seq_force(). This removes the need for the strong rules of nbcon_init() that even included a BUG_ON(). Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-3-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-08-21 14:56:22 +02:00
John Ogness	eda25860bf	printk: Add notation to console_srcu locking kernel/printk/printk.c:284:5: sparse: sparse: context imbalance in 'console_srcu_read_lock' - wrong count at exit include/linux/srcu.h:301:9: sparse: sparse: context imbalance in 'console_srcu_read_unlock' - unexpected unlock Fixes: `6c4afa7914` ("printk: Prepare for SRCU console list protection") Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20240820063001.36405-2-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-08-21 14:56:22 +02:00
Matthew Brost	9b59a85a84	workqueue: Don't call va_start / va_end twice Calling va_start / va_end multiple times is undefined and causes problems with certain compiler / platforms. Change alloc_ordered_workqueue_lockdep_map to a macro and updated __alloc_workqueue to take a va_list argument. Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-20 09:38:39 -10:00
Yipeng Zou	9ad2861b77	sched_ext: Allow dequeue_task_scx to fail Since dequeue_task() allowed to fail, there is a compile error: kernel/sched/ext.c:3630:19: error: initialization of ‘bool ()(struct rq, struct task_struct , int)’ {aka ‘_Bool ()(struct rq , struct task_struct , int)’} from incompatible pointer type ‘void ()(struct rq, struct task_struct *, int)’ 3630 \| .dequeue_task = dequeue_task_scx, \| ^~~~~~~~~~~~~~~~ Allow dequeue_task_scx to fail too. Fixes: `863ccdbb91` ("sched: Allow sched_class::dequeue_task() to fail") Signed-off-by: Yipeng Zou <zouyipeng@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-20 09:09:01 -10:00
Tejun Heo	5ac998574f	Merge branch 'tip/sched/core' into for-6.12 To receive `863ccdbb91` ("sched: Allow sched_class::dequeue_task() to fail") which makes sched_class.dequeue_task() return bool instead of void. This leads to compile breakage and will be fixed by a follow-up patch. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-20 08:55:26 -10:00
Chen Ridong	3c2acae888	cgroup/cpuset: remove use_parent_ecpus of cpuset use_parent_ecpus is used to track whether the children are using the parent's effective_cpus. When a parent's effective_cpus is changed due to changes in a child partition's effective_xcpus, any child using parent'effective_cpus must call update_cpumasks_hier. However, if a child is not a valid partition, it is sufficient to determine whether to call update_cpumasks_hier based on whether the child's effective_cpus is going to change. To make the code more succinct, it is suggested to remove use_parent_ecpus. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-20 08:51:48 -10:00
Chen Ridong	9414f68d45	cgroup/cpuset: remove fetch_xcpus Both fetch_xcpus and user_xcpus functions are used to retrieve the value of exclusive_cpus. If exclusive_cpus is not set, cpus_allowed is the implicit value used as exclusive in a local partition. I can not imagine a scenario where effective_xcpus is not empty when exclusive_cpus is empty. Therefore, I suggest removing the fetch_xcpus function. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-20 08:51:34 -10:00
Chen Ridong	e55f45b4ba	cgroup/cpuset: Correct invalid remote parition prs When enable a remote partition, I found that: cd /sys/fs/cgroup/ mkdir test mkdir test/test1 echo +cpuset > cgroup.subtree_control echo +cpuset > test/cgroup.subtree_control echo 3 > test/test1/cpuset.cpus echo root > test/test1/cpuset.cpus.partition cat test/test1/cpuset.cpus.partition root invalid (Parent is not a partition root) The parent of a remote partition could not be a root. This is due to the emtpy effective_xcpus. It would be better to prompt the message "invalid cpu list in cpuset.cpus.exclusive". Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-20 08:51:16 -10:00
Caleb Sander Mateos	e68ac2b488	softirq: Remove unused 'action' parameter from action callback When soft interrupt actions are called, they are passed a pointer to the struct softirq action which contains the action's function pointer. This pointer isn't useful, as the action callback already knows what function it is. And since each callback handles a specific soft interrupt, the callback also knows which soft interrupt number is running. No soft interrupt action callback actually uses this parameter, so remove it from the function pointer signature. This clarifies that soft interrupt actions are global routines and makes it slightly cheaper to call them. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/all/20240815171549.3260003-1-csander@purestorage.com	2024-08-20 17:13:40 +02:00
Matti Vaittinen	24d02c4e53	irqdomain: Always associate interrupts for legacy domains The unification of irq_domain_create_legacy() missed the fact that interrupts must be associated even when the Linux interrupt number provided in the first_irq argument is 0. This breaks all call sites of irq_domain_create_legacy() which supply 0 as the first_irq argument. Enforce the association for legacy domains in __irq_domain_instantiate() to cure this. [ tglx: Massaged it slightly. ] Fixes: `70114e7f75` ("irqdomain: Simplify simple and legacy domain creation") Reported-by: Jiaxun Yang <jiaxun.yang@flygoat.com> Signed-off-by Matti Vaittinen <mazziesaccount@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jiaxun Yang <jiaxun.yang@flygoat.com> Link: https://lore.kernel.org/all/c3379142-10bc-4f14-b8ac-a46927aeac38@gmail.com	2024-08-20 17:12:43 +02:00
Thomas Gleixner	ea72ce5da2	x86/kaslr: Expose and use the end of the physical memory address space iounmap() on x86 occasionally fails to unmap because the provided valid ioremap address is not below high_memory. It turned out that this happens due to KASLR. KASLR uses the full address space between PAGE_OFFSET and vaddr_end to randomize the starting points of the direct map, vmalloc and vmemmap regions. It thereby limits the size of the direct map by using the installed memory size plus an extra configurable margin for hot-plug memory. This limitation is done to gain more randomization space because otherwise only the holes between the direct map, vmalloc, vmemmap and vaddr_end would be usable for randomizing. The limited direct map size is not exposed to the rest of the kernel, so the memory hot-plug and resource management related code paths still operate under the assumption that the available address space can be determined with MAX_PHYSMEM_BITS. request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1 downwards. That means the first allocation happens past the end of the direct map and if unlucky this address is in the vmalloc space, which causes high_memory to become greater than VMALLOC_START and consequently causes iounmap() to fail for valid ioremap addresses. MAX_PHYSMEM_BITS cannot be changed for that because the randomization does not align with address bit boundaries and there are other places which actually require to know the maximum number of address bits. All remaining usage sites of MAX_PHYSMEM_BITS have been analyzed and found to be correct. Cure this by exposing the end of the direct map via PHYSMEM_END and use that for the memory hot-plug and resource management related places instead of relying on MAX_PHYSMEM_BITS. In the KASLR case PHYSMEM_END maps to a variable which is initialized by the KASLR initialization and otherwise it is based on MAX_PHYSMEM_BITS as before. To prevent future hickups add a check into add_pages() to catch callers trying to add memory above PHYSMEM_END. Fixes: `0483e1fa6e` ("x86/mm: Implement ASLR for kernel memory regions") Reported-by: Max Ramanouski <max8rr8@gmail.com> Reported-by: Alistair Popple <apopple@nvidia.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-By: Max Ramanouski <max8rr8@gmail.com> Tested-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Kees Cook <kees@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/87ed6soy3z.ffs@tglx	2024-08-20 13:44:57 +02:00
Matteo Croce	7f6287417b	bpf: Allow bpf_current_task_under_cgroup() with BPF_CGROUP_* The helper bpf_current_task_under_cgroup() currently is only allowed for tracing programs, allow its usage also in the BPF_CGROUP_* program types. Move the code from kernel/trace/bpf_trace.c to kernel/bpf/helpers.c, so it compiles also without CONFIG_BPF_EVENTS. This will be used in systemd-networkd to monitor the sysctl writes, and filter it's own writes from others: https://github.com/systemd/systemd/pull/32212 Signed-off-by: Matteo Croce <teknoraver@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240819162805.78235-3-technoboy85@gmail.com	2024-08-19 15:25:30 -07:00
Matteo Croce	67666479ed	bpf: Enable generic kfuncs for BPF_CGROUP_* programs These kfuncs are enabled even in BPF_PROG_TYPE_TRACING, so they should be safe also in BPF_CGROUP_* programs. Since all BPF_CGROUP_* programs share the same hook, call register_btf_kfunc_id_set() only once. In enum btf_kfunc_hook, rename BTF_KFUNC_HOOK_CGROUP_SKB to a more generic BTF_KFUNC_HOOK_CGROUP, since it's used for all the cgroup related program types. Signed-off-by: Matteo Croce <teknoraver@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240819162805.78235-2-technoboy85@gmail.com	2024-08-19 15:25:30 -07:00
Chen Ridong	d1a92d2d6c	cgroup: update some statememt about delegation The comment in cgroup_file_write is missing some interfaces, such as 'cgroup.threads'. All delegatable files are listed in '/sys/kernel/cgroup/delegate', so update the comment in cgroup_file_write. Besides, add a statement that files outside the namespace shouldn't be visible from inside the delegated namespace. tj: Reflowed text for consistency. Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-19 12:16:17 -10:00
Petr Pavlu	f94ce04e54	module: Clean up the description of MODULE_SIG_<type> The MODULE_SIG_<type> config choice has an inconsistent prompt styled as a question and lengthy option names. Simplify the prompt and option names to be consistent with other module options. Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2024-08-19 15:11:20 -07:00
Petr Pavlu	c7ff693fa2	module: Split modules_install compression and in-kernel decompression The kernel configuration allows specifying a module compression mode. If one is selected then each module gets compressed during 'make modules_install' and additionally one can also enable support for a respective direct in-kernel decompression support. This means that the decompression support cannot be enabled without the automatic compression. Some distributions, such as the (open)SUSE family, use a signer service for modules. A build runs on a worker machine but signing is done by a separate locked-down server that is in possession of the signing key. The build invokes 'make modules_install' to create a modules tree, collects information about the modules, asks the signer service for their signature, appends each signature to the respective module and compresses all modules. When using this arrangment, the 'make modules_install' step produces unsigned+uncompressed modules and the distribution's own build recipe takes care of signing and compression later. The signing support can be currently enabled without automatically signing modules during 'make modules_install'. However, the in-kernel decompression support can be selected only after first enabling automatic compression during this step. To allow only enabling the in-kernel decompression support without the automatic compression during 'make modules_install', separate the compression options similarly to the signing options, as follows: > Enable loadable module support [] Module compression Module compression type (GZIP) ---> [] Automatically compress all modules [ ] Support in-kernel module decompression * "Module compression" (MODULE_COMPRESS) is a new main switch for the compression/decompression support. It replaces MODULE_COMPRESS_NONE. * "Module compression type" (MODULE_COMPRESS_<type>) chooses the compression type, one of GZ, XZ, ZSTD. * "Automatically compress all modules" (MODULE_COMPRESS_ALL) is a new option to enable module compression during 'make modules_install'. It defaults to Y. * "Support in-kernel module decompression" (MODULE_DECOMPRESS) enables in-kernel decompression. Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Acked-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2024-08-19 15:11:20 -07:00
Linus Torvalds	b0da640826	Merge tag 'printk-for-6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk fix from Petr Mladek: - Do not block printk on non-panic CPUs when they are dumping backtraces * tag 'printk-for-6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: printk/panic: Allow cpu backtraces to be written into ringbuffer during panic	2024-08-19 09:26:35 -07:00
Linus Torvalds	c3f2d783a4	Merge tag 'mm-hotfixes-stable-2024-08-17-19-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "16 hotfixes. All except one are for MM. 10 of these are cc:stable and the others pertain to post-6.10 issues. As usual with these merges, singletons and doubletons all over the place, no identifiable-by-me theme. Please see the lovingly curated changelogs to get the skinny" * tag 'mm-hotfixes-stable-2024-08-17-19-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/migrate: fix deadlock in migrate_pages_batch() on large folios alloc_tag: mark pages reserved during CMA activation as not tagged alloc_tag: introduce clear_page_tag_ref() helper function crash: fix riscv64 crash memory reserve dead loop selftests: memfd_secret: don't build memfd_secret test on unsupported arches mm: fix endless reclaim on machines with unaccepted memory selftests/mm: compaction_test: fix off by one in check_compaction() mm/numa: no task_numa_fault() call if PMD is changed mm/numa: no task_numa_fault() call if PTE is changed mm/vmalloc: fix page mapping if vm_area_alloc_pages() with high order fallback to order 0 mm/memory-failure: use raw_spinlock_t in struct memory_failure_cpu mm: don't account memmap per-node mm: add system wide stats items category mm: don't account memmap on failure mm/hugetlb: fix hugetlb vs. core-mm PT locking mseal: fix is_madv_discard()	2024-08-17 19:50:16 -07:00
Linus Torvalds	810996a363	Merge tag 'powerpc-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - Fix crashes on 85xx with some configs since the recent hugepd rework. - Fix boot warning with hugepages and CONFIG_DEBUG_VIRTUAL on some platforms. - Don't enable offline cores when changing SMT modes, to match existing userspace behaviour. Thanks to Christophe Leroy, Dr. David Alan Gilbert, Guenter Roeck, Nysal Jan K.A, Shrikanth Hegde, Thomas Gleixner, and Tyrel Datwyler. * tag 'powerpc-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/topology: Check if a core is online cpu/SMT: Enable SMT only if a core is online powerpc/mm: Fix boot warning with hugepages and CONFIG_DEBUG_VIRTUAL powerpc/mm: Fix size of allocated PGDIR soc: fsl: qbman: remove unused struct 'cgr_comp'	2024-08-17 19:23:02 -07:00
Peter Zijlstra	aef6987d89	sched/eevdf: Propagate min_slice up the cgroup hierarchy In the absence of an explicit cgroup slice configureation, make mixed slice length work with cgroups by propagating the min_slice up the hierarchy. This ensures the cgroup entity gets timely service to service its entities that have this timing constraint set on them. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.948188417@infradead.org	2024-08-17 11:06:46 +02:00
Peter Zijlstra	857b158dc5	sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion Allow applications to directly set a suggested request/slice length using sched_attr::sched_runtime. The implementation clamps the value to: 0.1[ms] <= slice <= 100[ms] which is 1/10 the size of HZ=1000 and 10 times the size of HZ=100. Applications should strive to use their periodic runtime at a high confidence interval (95%+) as the target slice. Using a smaller slice will introduce undue preemptions, while using a larger value will increase latency. For all the following examples assume a scheduling quantum of 8, and for consistency all examples have W=4: {A,B,C,D}(w=1,r=8): ABCD... +---+---+---+--- t=0, V=1.5 t=1, V=3.5 A \|------< A \|------< B \|------< B \|------< C \|------< C \|------< D \|------< D \|------< ---+------+-------+--- ---+------+-------+--- t=2, V=5.5 t=3, V=7.5 A \|------< A \|------< B \|------< B \|------< C \|------< C \|------< D \|------< D \|------< ---+------+-------+--- ---+------+-------+--- Note: 4 identical tasks in FIFO order ~~~ {A,B}(w=1,r=16) C(w=2,r=16) AACCBBCC... +---+---+---+--- t=0, V=1.25 t=2, V=5.25 A \|--------------< A \|--------------< B \|--------------< B \|--------------< C \|------< C \|------< ---+------+-------+--- ---+------+-------+--- t=4, V=8.25 t=6, V=12.25 A \|--------------< A \|--------------< B \|--------------< B \|--------------< C \|------< C \|------< ---+--------------+--- ---+-------+------+--- Note: 1 heavy task -- because q=8, double r such that the deadline of the w=2 task doesn't go below q. Note: observe the full schedule becomes: Wmax(r_i/w_i) = 42q = 8q in length. Note: the period of the heavy task is half the full period at: W(r_i/w_i) = 4(2q/2) = 4q ~~~ {A,C,D}(w=1,r=16) B(w=1,r=8): BAACCBDD... +---+---+---+--- t=0, V=1.5 t=1, V=3.5 A \|--------------< A \|---------------< B \|------< B \|------< C \|--------------< C \|--------------< D \|--------------< D \|--------------< ---+------+-------+--- ---+------+-------+--- t=3, V=7.5 t=5, V=11.5 A \|---------------< A \|---------------< B \|------< B \|------< C \|--------------< C \|--------------< D \|--------------< D \|--------------< ---+------+-------+--- ---+-------+------+--- t=6, V=13.5 A \|---------------< B \|------< C \|--------------< D \|--------------< ---+-------+------+--- Note: 1 short task -- again double r so that the deadline of the short task won't be below q. Made B short because its not the leftmost task, but is eligible with the 0,1,2,3 spread. Note: like with the heavy task, the period of the short task observes: W(r_i/w_i) = 4(1q/1) = 4q ~~~ A(w=1,r=16) B(w=1,r=8) C(w=2,r=16) BCCAABCC... +---+---+---+--- t=0, V=1.25 t=1, V=3.25 A \|--------------< A \|--------------< B \|------< B \|------< C \|------< C \|------< ---+------+-------+--- ---+------+-------+--- t=3, V=7.25 t=5, V=11.25 A \|--------------< A \|--------------< B \|------< B \|------< C \|------< C \|------< ---+------+-------+--- ---+-------+------+--- t=6, V=13.25 A \|--------------< B \|------< C \|------< ---+-------+------+--- Note: 1 heavy and 1 short task -- combine them all. Note: both the short and heavy task end up with a period of 4q ~~~ A(w=1,r=16) B(w=2,r=16) C(w=1,r=8) BBCAABBC... +---+---+---+--- t=0, V=1 t=2, V=5 A \|--------------< A \|--------------< B \|------< B \|------< C \|------< C \|------< ---+------+-------+--- ---+------+-------+--- t=3, V=7 t=5, V=11 A \|--------------< A \|--------------< B \|------< B \|------< C \|------< C \|------< ---+------+-------+--- ---+-------+------+--- t=7, V=15 A \|--------------< B \|------< C \|------< ---+-------+------+--- Note: as before but permuted ~~~ From all this it can be deduced that, for the steady state: - the total period (P) of a schedule is: Wmax(r_i/w_i) - the average period of a task is: W*(r_i/w_i) - each task obtains the fair share: w_i/W of each full period P Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.842834421@infradead.org	2024-08-17 11:06:45 +02:00
Peter Zijlstra	85e511df3c	sched/eevdf: Allow shorter slices to wakeup-preempt Part of the reason to have shorter slices is to improve responsiveness. Allow shorter slices to preempt longer slices on wakeup. Task \| Runtime ms \| Switches \| Avg delay ms \| Max delay ms \| Sum delay ms \| 100ms massive_intr 500us cyclictest NO_PREEMPT_SHORT 1 massive_intr:(5) \| 846018.956 ms \| 779188 \| avg: 0.273 ms \| max: 58.337 ms \| sum:212545.245 ms \| 2 massive_intr:(5) \| 853450.693 ms \| 792269 \| avg: 0.275 ms \| max: 71.193 ms \| sum:218263.588 ms \| 3 massive_intr:(5) \| 843888.920 ms \| 771456 \| avg: 0.277 ms \| max: 92.405 ms \| sum:213353.221 ms \| 1 chromium-browse:(8) \| 53015.889 ms \| 131766 \| avg: 0.463 ms \| max: 36.341 ms \| sum:60959.230 ms \| 2 chromium-browse:(8) \| 53864.088 ms \| 136962 \| avg: 0.480 ms \| max: 27.091 ms \| sum:65687.681 ms \| 3 chromium-browse:(9) \| 53637.904 ms \| 132637 \| avg: 0.481 ms \| max: 24.756 ms \| sum:63781.673 ms \| 1 cyclictest:(5) \| 12615.604 ms \| 639689 \| avg: 0.471 ms \| max: 32.272 ms \| sum:301351.094 ms \| 2 cyclictest:(5) \| 12511.583 ms \| 642578 \| avg: 0.448 ms \| max: 44.243 ms \| sum:287632.830 ms \| 3 cyclictest:(5) \| 12545.867 ms \| 635953 \| avg: 0.475 ms \| max: 25.530 ms \| sum:302374.658 ms \| 100ms massive_intr 500us cyclictest PREEMPT_SHORT 1 massive_intr:(5) \| 839843.919 ms \| 837384 \| avg: 0.264 ms \| max: 74.366 ms \| sum:221476.885 ms \| 2 massive_intr:(5) \| 852449.913 ms \| 845086 \| avg: 0.252 ms \| max: 68.162 ms \| sum:212595.968 ms \| 3 massive_intr:(5) \| 839180.725 ms \| 836883 \| avg: 0.266 ms \| max: 69.742 ms \| sum:222812.038 ms \| 1 chromium-browse:(11) \| 54591.481 ms \| 138388 \| avg: 0.458 ms \| max: 35.427 ms \| sum:63401.508 ms \| 2 chromium-browse:(8) \| 52034.541 ms \| 132276 \| avg: 0.436 ms \| max: 31.826 ms \| sum:57732.958 ms \| 3 chromium-browse:(8) \| 55231.771 ms \| 141892 \| avg: 0.469 ms \| max: 27.607 ms \| sum:66538.697 ms \| 1 cyclictest:(5) \| 13156.391 ms \| 667412 \| avg: 0.373 ms \| max: 38.247 ms \| sum:249174.502 ms \| 2 cyclictest:(5) \| 12688.939 ms \| 665144 \| avg: 0.374 ms \| max: 33.548 ms \| sum:248509.392 ms \| 3 cyclictest:(5) \| 13475.623 ms \| 669110 \| avg: 0.370 ms \| max: 37.819 ms \| sum:247673.390 ms \| As per the numbers the, this makes cyclictest (short slice) it's max-delay more consistent and consistency drops the sum-delay. The trade-off is that the massive_intr (long slice) gets more context switches and a slight increase in sum-delay. Chunxin contributed did_preempt_short() where a task that lost slice protection from PREEMPT_SHORT gets rescheduled once it becomes in-eligible. [mike: numbers] Co-Developed-by: Chunxin Zang <zangchunxin@lixiang.com> Signed-off-by: Chunxin Zang <zangchunxin@lixiang.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com> Link: https://lkml.kernel.org/r/20240727105030.735459544@infradead.org	2024-08-17 11:06:45 +02:00
Peter Zijlstra	82e9d0456e	sched/fair: Avoid re-setting virtual deadline on 'migrations' During OSPM24 Youssef noted that migrations are re-setting the virtual deadline. Notably everything that does a dequeue-enqueue, like setting nice, changing preferred numa-node, and a myriad of other random crap, will cause this to happen. This shouldn't be. Preserve the relative virtual deadline across such dequeue/enqueue cycles. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.625119246@infradead.org	2024-08-17 11:06:45 +02:00
Peter Zijlstra	fc1892becd	sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE Note that tasks that are kept on the runqueue to burn off negative lag, are not in fact runnable anymore, they'll get dequeued the moment they get picked. As such, don't count this time towards runnable. Thanks to Valentin for spotting I had this backwards initially. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.514088302@infradead.org	2024-08-17 11:06:45 +02:00
Peter Zijlstra	54a58a7877	sched/fair: Implement DELAY_ZERO 'Extend' DELAY_DEQUEUE by noting that since we wanted to dequeued them at the 0-lag point, truncate lag (eg. don't let them earn positive lag). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.403750550@infradead.org	2024-08-17 11:06:44 +02:00
Peter Zijlstra	152e11f6df	sched/fair: Implement delayed dequeue Extend / fix `86bfbb7ce4` ("sched/fair: Add lag based placement") by noting that lag is fundamentally a temporal measure. It should not be carried around indefinitely. OTOH it should also not be instantly discarded, doing so will allow a task to game the system by purposefully (micro) sleeping at the end of its time quantum. Since lag is intimately tied to the virtual time base, a wall-time based decay is also insufficient, notably competition is required for any of this to make sense. Instead, delay the dequeue and keep the 'tasks' on the runqueue, competing until they are eligible. Strictly speaking, we only care about keeping them until the 0-lag point, but that is a difficult proposition, instead carry them around until they get picked again, and dequeue them at that point. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.226163742@infradead.org	2024-08-17 11:06:44 +02:00
Peter Zijlstra	e1459a50ba	sched: Teach dequeue_task() about special task states Since special task states must not suffer spurious wakeups, and the proposed delayed dequeue can cause exactly these (under some boundary conditions), propagate this knowledge into dequeue_task() such that it can do the right thing. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.110439521@infradead.org	2024-08-17 11:06:44 +02:00
Peter Zijlstra	a1c446611e	sched,freezer: Mark TASK_FROZEN special The special task states are those that do not suffer spurious wakeups, TASK_FROZEN is very much one of those, mark it as such. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105029.998329901@infradead.org	2024-08-17 11:06:44 +02:00
Peter Zijlstra	781773e3b6	sched/fair: Implement ENQUEUE_DELAYED Doing a wakeup on a delayed dequeue task is about as simple as it sounds -- remove the delayed mark and enjoy the fact it was actually still on the runqueue. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105029.888107381@infradead.org	2024-08-17 11:06:43 +02:00
Peter Zijlstra	f12e148892	sched/fair: Prepare pick_next_task() for delayed dequeue Delayed dequeue's natural end is when it gets picked again. Ensure pick_next_task() knows what to do with delayed tasks. Note, this relies on the earlier patch that made pick_next_task() state invariant -- it will restart the pick on dequeue, because obviously the just dequeued task is no longer eligible. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105029.747330118@infradead.org	2024-08-17 11:06:43 +02:00

... 6 7 8 9 10 ...

45917 Commits