linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-04-14 06:45:43 -04:00

Author	SHA1	Message	Date
Tejun Heo	e7a6395a88	sched_ext: Make scx_bpf_reenqueue_local() skip tasks that are being migrated When a running task is migrated to another CPU, the stop_task is used to preempt the running task and migrate it. This, expectedly, invokes ops.cpu_release(). If the BPF scheduler then calls scx_bpf_reenqueue_local(), it re-enqueues all tasks on the local DSQ including the task which is being migrated. This creates an unnecessary re-enqueue of a task which is about to be deactivated and re-activated for migration anyway. It can also cause confusion for the BPF scheduler as scx_bpf_task_cpu() of the task and its allowed CPUs may not agree while migration is pending. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `245254f708` ("sched_ext: Implement sched_ext_ops.cpu_acquire/release()") Acked-by: David Vernet <void@manifault.com>	2024-07-09 12:30:26 -10:00
Tejun Heo	fd0cf51695	sched_ext: Reimplement scx_bpf_reenqueue_local() scx_bpf_reenqueue_local() is used to re-enqueue tasks on the local DSQ from ops.cpu_release(). Because the BPF scheduler may dispatch tasks to the same local DSQ, to avoid processing the same tasks repeatedly, it first takes the number of queued tasks and processes the task at the head of the queue that number of times. This is incorrect as a task can be dispatched to the same local DSQ with SCX_ENQ_HEAD. Such a task will be processed repeatedly until the count is exhausted and the succeeding tasks won't be processed at all. Fix it by first moving all candidate tasks to a private list and then processing that list. While at it, remove the WARNs. They're rather superflous as later steps will check them anyway. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `245254f708` ("sched_ext: Implement sched_ext_ops.cpu_acquire/release()") Acked-by: David Vernet <void@manifault.com>	2024-07-09 12:30:26 -10:00
Paolo Abeni	7b769adc26	Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-07-08 The following pull-request contains BPF updates for your net-next tree. We've added 102 non-merge commits during the last 28 day(s) which contain a total of 127 files changed, 4606 insertions(+), 980 deletions(-). The main changes are: 1) Support resilient split BTF which cuts down on duplication and makes BTF as compact as possible wrt BTF from modules, from Alan Maguire & Eduard Zingerman. 2) Add support for dumping kfunc prototypes from BTF which enables both detecting as well as dumping compilable prototypes for kfuncs, from Daniel Xu. 3) Batch of s390x BPF JIT improvements to add support for BPF arena and to implement support for BPF exceptions, from Ilya Leoshkevich. 4) Batch of riscv64 BPF JIT improvements in particular to add 12-argument support for BPF trampolines and to utilize bpf_prog_pack for the latter, from Pu Lehui. 5) Extend BPF test infrastructure to add a CHECKSUM_COMPLETE validation option for skbs and add coverage along with it, from Vadim Fedorenko. 6) Inline bpf_get_current_task/_btf() helpers in the arm64 BPF JIT which gives a small 1% performance improvement in micro-benchmarks, from Puranjay Mohan. 7) Extend the BPF verifier to track the delta between linked registers in order to better deal with recent LLVM code optimizations, from Alexei Starovoitov. 8) Fix bpf_wq_set_callback_impl() kfunc signature where the third argument should have been a pointer to the map value, from Benjamin Tissoires. 9) Extend BPF selftests to add regular expression support for test output matching and adjust some of the selftest when compiled under gcc, from Cupertino Miranda. 10) Simplify task_file_seq_get_next() and remove an unnecessary loop which always iterates exactly once anyway, from Dan Carpenter. 11) Add the capability to offload the netfilter flowtable in XDP layer through kfuncs, from Florian Westphal & Lorenzo Bianconi. 12) Various cleanups in networking helpers in BPF selftests to shave off a few lines of open-coded functions on client/server handling, from Geliang Tang. 13) Properly propagate prog->aux->tail_call_reachable out of BPF verifier, so that x86 JIT does not need to implement detection, from Leon Hwang. 14) Fix BPF verifier to add a missing check_func_arg_reg_off() to prevent an out-of-bounds memory access for dynpointers, from Matt Bobrowski. 15) Fix bpf_session_cookie() kfunc to return __u64 instead of long pointer as it might lead to problems on 32-bit archs, from Jiri Olsa. 16) Enhance traffic validation and dynamic batch size support in xsk selftests, from Tushar Vyavahare. bpf-next-for-netdev * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (102 commits) selftests/bpf: DENYLIST.aarch64: Remove fexit_sleep selftests/bpf: amend for wrong bpf_wq_set_callback_impl signature bpf: helpers: fix bpf_wq_set_callback_impl signature libbpf: Add NULL checks to bpf_object__{prev_map,next_map} selftests/bpf: Remove exceptions tests from DENYLIST.s390x s390/bpf: Implement exceptions s390/bpf: Change seen_reg to a mask bpf: Remove unnecessary loop in task_file_seq_get_next() riscv, bpf: Optimize stack usage of trampoline bpf, devmap: Add .map_alloc_check selftests/bpf: Remove arena tests from DENYLIST.s390x selftests/bpf: Add UAF tests for arena atomics selftests/bpf: Introduce __arena_global s390/bpf: Support arena atomics s390/bpf: Enable arena s390/bpf: Support address space cast instruction s390/bpf: Support BPF_PROBE_MEM32 s390/bpf: Land on the next JITed instruction after exception s390/bpf: Introduce pre- and post- probe functions s390/bpf: Get rid of get_probe_mem_regno() ... ==================== Link: https://patch.msgid.link/20240708221438.10974-1-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-07-09 17:01:46 +02:00
Sebastian Andrzej Siewior	2b84def990	perf: Split __perf_pending_irq() out of perf_pending_irq() perf_pending_irq() invokes perf_event_wakeup() and __perf_pending_irq(). The former is in charge of waking any tasks which waits to be woken up while the latter disables perf-events. The irq_work perf_pending_irq(), while this an irq_work, the callback is invoked in thread context on PREEMPT_RT. This is needed because all the waking functions (wake_up_all(), kill_fasync()) acquire sleep locks which must not be used with disabled interrupts. Disabling events, as done by __perf_pending_irq(), expects a hardirq context and disabled interrupts. This requirement is not fulfilled on PREEMPT_RT. Split functionality based on perf_event::pending_disable into irq_work named `pending_disable_irq' and invoke it in hardirq context on PREEMPT_RT. Rename the split out callback to perf_pending_disable(). Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marco Elver <elver@google.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/r/20240704170424.1466941-8-bigeasy@linutronix.de	2024-07-09 13:26:37 +02:00
Sebastian Andrzej Siewior	16b9569df9	perf: Don't disable preemption in perf_pending_task(). perf_pending_task() is invoked in task context and disables preemption because perf_swevent_get_recursion_context() used to access per-CPU variables. The other reason is to create a RCU read section while accessing the perf_event. The recursion counter is no longer a per-CPU accounter so disabling preemption is no longer required. The RCU section is needed and must be created explicit. Replace the preemption-disable section with a explicit RCU-read section. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marco Elver <elver@google.com> Link: https://lore.kernel.org/r/20240704170424.1466941-7-bigeasy@linutronix.de	2024-07-09 13:26:36 +02:00
Sebastian Andrzej Siewior	0d40a6d83e	perf: Move swevent_htable::recursion into task_struct. The swevent_htable::recursion counter is used to avoid creating an swevent while an event is processed to avoid recursion. The counter is per-CPU and preemption must be disabled to have a stable counter. perf_pending_task() disables preemption to access the counter and then signal. This is problematic on PREEMPT_RT because sending a signal uses a spinlock_t which must not be acquired in atomic on PREEMPT_RT because it becomes a sleeping lock. The atomic context can be avoided by moving the counter into the task_struct. There is a 4 byte hole between futex_state (usually always on) and the following perf pointer (perf_event_ctxp). After the recursion lost some weight it fits perfectly. Move swevent_htable::recursion into task_struct. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marco Elver <elver@google.com> Link: https://lore.kernel.org/r/20240704170424.1466941-6-bigeasy@linutronix.de	2024-07-09 13:26:36 +02:00
Sebastian Andrzej Siewior	5af42f928f	perf: Shrink the size of the recursion counter. There are four recursion counter, one for each context. The type of the counter is `int' but the counter is used as `bool' since it is only incremented if zero. The main goal here is to shrink the whole struct into 32bit int which can later be added task_struct into an existing hole. Reduce the type of the recursion counter to an unsigned char, keep the increment/ decrement operation. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marco Elver <elver@google.com> Link: https://lore.kernel.org/r/20240704170424.1466941-5-bigeasy@linutronix.de	2024-07-09 13:26:35 +02:00
Sebastian Andrzej Siewior	c5d93d23a2	perf: Enqueue SIGTRAP always via task_work. A signal is delivered by raising irq_work() which works from any context including NMI. irq_work() can be delayed if the architecture does not provide an interrupt vector. In order not to lose a signal, the signal is injected via task_work during event_sched_out(). Instead going via irq_work, the signal could be added directly via task_work. The signal is sent to current and can be enqueued on its return path to userland. Queue signal via task_work and consider possible NMI context. Remove perf_event::pending_sigtrap and and use perf_event::pending_work instead. Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marco Elver <elver@google.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/r/20240704170424.1466941-4-bigeasy@linutronix.de	2024-07-09 13:26:35 +02:00
Sebastian Andrzej Siewior	466e4d801c	task_work: Add TWA_NMI_CURRENT as an additional notify mode. Adding task_work from NMI context requires the following: - The kasan_record_aux_stack() is not NMU safe and must be avoided. - Using TWA_RESUME is NMI safe. If the NMI occurs while the CPU is in userland then it will continue in userland and not invoke the `work' callback. Add TWA_NMI_CURRENT as an additional notify mode. In this mode skip kasan and use irq_work in hardirq-mode to for needed interrupt. Set TIF_NOTIFY_RESUME within the irq_work callback due to k[ac]san instrumentation in test_and_set_bit() which does not look NMI safe in case of a report. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240704170424.1466941-3-bigeasy@linutronix.de	2024-07-09 13:26:34 +02:00
Sebastian Andrzej Siewior	058244c683	perf: Move irq_work_queue() where the event is prepared. Only if perf_event::pending_sigtrap is zero, the irq_work accounted by increminging perf_event::nr_pending. The member perf_event::pending_addr might be overwritten by a subsequent event if the signal was not yet delivered and is expected. The irq_work will not be enqeueued again because it has a check to be only enqueued once. Move irq_work_queue() to where the counter is incremented and perf_event::pending_sigtrap is set to make it more obvious that the irq_work is scheduled once. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marco Elver <elver@google.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/r/20240704170424.1466941-2-bigeasy@linutronix.de	2024-07-09 13:26:34 +02:00
Frederic Weisbecker	3a5465418f	perf: Fix event leak upon exec and file release The perf pending task work is never waited upon the matching event release. In the case of a child event, released via free_event() directly, this can potentially result in a leaked event, such as in the following scenario that doesn't even require a weak IRQ work implementation to trigger: schedule() prepare_task_switch() =======> <NMI> perf_event_overflow() event->pending_sigtrap = ... irq_work_queue(&event->pending_irq) <======= </NMI> perf_event_task_sched_out() event_sched_out() event->pending_sigtrap = 0; atomic_long_inc_not_zero(&event->refcount) task_work_add(&event->pending_task) finish_lock_switch() =======> <IRQ> perf_pending_irq() //do nothing, rely on pending task work <======= </IRQ> begin_new_exec() perf_event_exit_task() perf_event_exit_event() // If is child event free_event() WARN(atomic_long_cmpxchg(&event->refcount, 1, 0) != 1) // event is leaked Similar scenarios can also happen with perf_event_remove_on_exec() or simply against concurrent perf_event_release(). Fix this with synchonizing against the possibly remaining pending task work while freeing the event, just like is done with remaining pending IRQ work. This means that the pending task callback neither need nor should hold a reference to the event, preventing it from ever beeing freed. Fixes: `517e6a301f` ("perf: Fix perf_pending_task() UaF") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240621091601.18227-5-frederic@kernel.org	2024-07-09 13:26:33 +02:00
Frederic Weisbecker	2fd5ad3f31	perf: Fix event leak upon exit When a task is scheduled out, pending sigtrap deliveries are deferred to the target task upon resume to userspace via task_work. However failures while adding an event's callback to the task_work engine are ignored. And since the last call for events exit happen after task work is eventually closed, there is a small window during which pending sigtrap can be queued though ignored, leaking the event refcount addition such as in the following scenario: TASK A ----- do_exit() exit_task_work(tsk); <IRQ> perf_event_overflow() event->pending_sigtrap = pending_id; irq_work_queue(&event->pending_irq); </IRQ> =========> PREEMPTION: TASK A -> TASK B event_sched_out() event->pending_sigtrap = 0; atomic_long_inc_not_zero(&event->refcount) // FAILS: task work has exited task_work_add(&event->pending_task) [...] <IRQ WORK> perf_pending_irq() // early return: event->oncpu = -1 </IRQ WORK> [...] =========> TASK B -> TASK A perf_event_exit_task(tsk) perf_event_exit_event() free_event() WARN(atomic_long_cmpxchg(&event->refcount, 1, 0) != 1) // leak event due to unexpected refcount == 2 As a result the event is never released while the task exits. Fix this with appropriate task_work_add()'s error handling. Fixes: `517e6a301f` ("perf: Fix perf_pending_task() UaF") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240621091601.18227-4-frederic@kernel.org	2024-07-09 13:26:33 +02:00
Frederic Weisbecker	f409530e4d	task_work: Introduce task_work_cancel() again Re-introduce task_work_cancel(), this time to cancel an actual callback and not any callback pointing to a given function. This is going to be needed for perf events event freeing. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240621091601.18227-3-frederic@kernel.org	2024-07-09 13:26:32 +02:00
Frederic Weisbecker	68cbd415dd	task_work: s/task_work_cancel()/task_work_cancel_func()/ A proper task_work_cancel() API that actually cancels a callback and not any callback pointing to a given function is going to be needed for perf events event freeing. Do the appropriate rename to prepare for that. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240621091601.18227-2-frederic@kernel.org	2024-07-09 13:26:31 +02:00
John Stultz	e81859fe64	locking/rwsem: Add __always_inline annotation to __down_write_common() and inlined callers Apparently despite it being marked inline, the compiler may not inline __down_write_common() which makes it difficult to identify the cause of lock contention, as the wchan of the blocked function will always be listed as __down_write_common(). So add __always_inline annotation to the common function (as well as the inlined helper callers) to force it to be inlined so a more useful blocking function will be listed (via wchan). This mirrors commit `92cc5d00a4` ("locking/rwsem: Add __always_inline annotation to __down_read_common() and inlined callers") which did the same for __down_read_common. I sort of worry that I'm playing wack-a-mole here, and talking with compiler people, they tell me inline means nothing, which makes me want to cry a little. So I'm wondering if we need to replace all the inlines with __always_inline, or remove them because either we mean something by it, or not. Fixes: `c995e638cc` ("locking/rwsem: Fold __down_{read,write}*()") Reported-by: Tim Murray <timmurray@google.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Link: https://lkml.kernel.org/r/20240709060831.495366-1-jstultz@google.com	2024-07-09 13:26:26 +02:00
Yicong Yang	54624acf88	dma-mapping: benchmark: Don't starve others when doing the test The test thread will start N benchmark kthreads and then schedule out until the test time finished and notify the benchmark kthreads to stop. The benchmark kthreads will keep running until notified to stop. There's a problem with current implementation when the benchmark kthreads number is equal to the CPUs on a non-preemptible kernel: since the scheduler will balance the kthreads across the CPUs and when the test time's out the test thread won't get a chance to be scheduled on any CPU then cannot notify the benchmark kthreads to stop. This can be easily reproduced on a VM (simulated with 16 CPUs) with PREEMPT_VOLUNTARY: estuary:/mnt$ ./dma_map_benchmark -t 16 -s 1 rcu: INFO: rcu_sched self-detected stall on CPU rcu: 10-...!: (5221 ticks this GP) idle=ed24/1/0x4000000000000000 softirq=142/142 fqs=0 rcu: (t=5254 jiffies g=-559 q=45 ncpus=16) rcu: rcu_sched kthread starved for 5255 jiffies! g-559 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=12 rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior. rcu: RCU grace-period kthread stack dump: task:rcu_sched state:R running task stack:0 pid:16 tgid:16 ppid:2 flags:0x00000008 Call trace __switch_to+0xec/0x138 __schedule+0x2f8/0x1080 schedule+0x30/0x130 schedule_timeout+0xa0/0x188 rcu_gp_fqs_loop+0x128/0x528 rcu_gp_kthread+0x1c8/0x208 kthread+0xec/0xf8 ret_from_fork+0x10/0x20 Sending NMI from CPU 10 to CPUs 0: NMI backtrace for cpu 0 CPU: 0 PID: 332 Comm: dma-map-benchma Not tainted 6.10.0-rc1-vanilla-LSE #8 Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : arm_smmu_cmdq_issue_cmdlist+0x218/0x730 lr : arm_smmu_cmdq_issue_cmdlist+0x488/0x730 sp : ffff80008748b630 x29: ffff80008748b630 x28: 0000000000000000 x27: ffff80008748b780 x26: 0000000000000000 x25: 000000000000bc70 x24: 000000000001bc70 x23: ffff0000c12af080 x22: 0000000000010000 x21: 000000000000ffff x20: ffff80008748b700 x19: ffff0000c12af0c0 x18: 0000000000010000 x17: 0000000000000001 x16: 0000000000000040 x15: ffffffffffffffff x14: 0001ffffffffffff x13: 000000000000ffff x12: 00000000000002f1 x11: 000000000001ffff x10: 0000000000000031 x9 : ffff800080b6b0b8 x8 : ffff0000c2a48000 x7 : 000000000001bc71 x6 : 0001800000000000 x5 : 00000000000002f1 x4 : 01ffffffffffffff x3 : 000000000009aaf1 x2 : 0000000000000018 x1 : 000000000000000f x0 : ffff0000c12af18c Call trace: arm_smmu_cmdq_issue_cmdlist+0x218/0x730 __arm_smmu_tlb_inv_range+0xe0/0x1a8 arm_smmu_iotlb_sync+0xc0/0x128 __iommu_dma_unmap+0x248/0x320 iommu_dma_unmap_page+0x5c/0xe8 dma_unmap_page_attrs+0x38/0x1d0 map_benchmark_thread+0x118/0x2c0 kthread+0xec/0xf8 ret_from_fork+0x10/0x20 Solve this by adding scheduling point in the kthread loop, so if there're other threads in the system they may have a chance to run, especially the thread to notify the test end. However this may degrade the test concurrency so it's recommended to run this on an idle system. Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Acked-by: Barry Song <baohua@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2024-07-09 07:48:32 +02:00
Tejun Heo	650ba21b13	sched_ext: Implement DSQ iterator DSQs are very opaque in the consumption path. The BPF scheduler has no way of knowing which tasks are being considered and which is picked. This patch adds BPF DSQ iterator. - Allows iterating tasks queued on a DSQ in the dispatch order or reverse from anywhere using bpf_for_each(scx_dsq) or calling the iterator kfuncs directly. - Has ordering guarantee where only tasks which were already queued when the iteration started are visible and consumable during the iteration. v5: - Add a comment to the naked list_empty(&dsq->list) test in consume_dispatch_q() to explain the reasoning behind the lockless test and by extension why nldsq_next_task() isn't used there. - scx_qmap changes separated into its own patch. v4: - bpf_iter_scx_dsq_new() declaration in common.bpf.h was using the wrong type for the last argument (bool rev instead of u64 flags). Fix it. v3: - Alexei pointed out that the iterator is too big to allocate on stack. Added a prep patch to reduce the size of the cursor. Now bpf_iter_scx_dsq is 48 bytes and bpf_iter_scx_dsq_kern is 40 bytes on 64bit. - u32_before() comparison factored out. v2: - scx_bpf_consume_task() is separated out into a separate patch. - DSQ seq and iter flags don't need to be u64. Use u32. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: bpf@vger.kernel.org	2024-07-08 14:30:55 -10:00
Tejun Heo	d4af01c373	sched_ext: Take out ->priq and ->flags from scx_dsq_node struct scx_dsq_node contains two data structure nodes to link the containing task to a DSQ and a flags field that is protected by the lock of the associated DSQ. One reason why they are grouped into a struct is to use the type independently as a cursor node when iterating tasks on a DSQ. However, when iterating, the cursor only needs to be linked on the FIFO list and the rb_node part ends up inflating the size of the iterator data structure unnecessarily making it potentially too expensive to place it on stack. Take ->priq and ->flags out of scx_dsq_node and put them in sched_ext_entity as ->dsq_priq and ->dsq_flags, respectively. scx_dsq_node is renamed to scx_dsq_list_node and the field names are renamed accordingly. This will help implementing DSQ task iterator that can be allocated on stack. No functional change intended. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: David Vernet <void@manifault.com>	2024-07-08 14:30:55 -10:00
Tejun Heo	e196c908f9	sched, sched_ext: Move some declarations from kernel/sched/ext.h to sched.h While sched_ext was out of tree, everything sched_ext specific which can be put in kernel/sched/ext.h was put there to ease forward porting. However, kernel/sched/sched.h is the better location for some of them. Relocate. - struct sched_enq_and_set_ctx, sched_deq_and_put_task() and sched_enq_and_set_task(). - scx_enabled() and scx_switched_all(). - for_active_class_range() and for_each_active_class(). sched_class declarations are moved above the class iterators for this. No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: David Vernet <void@manifault.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de>	2024-07-08 09:39:48 -10:00
Tejun Heo	744d83601f	sched, sched_ext: Open code for_balance_class_range() For flexibility, sched_ext allows the BPF scheduler to select the CPU to execute a task on at dispatch time so that e.g. a queue can be shared across multiple CPUs. To enable this, the dispatch path is executed from balance() so that a dispatched task can be hot-migrated to its target CPU. This means that sched_ext needs its balance() method invoked before every pick_next_task() even when the CPU is waking up from SCHED_IDLE. for_balance_class_range() defined in kernel/sched/ext.h implements this selective iteration promotion. However, the indirection obfuscates more than helps. Open code the iteration promotion in put_prev_task_balance() and remove for_balance_class_range(). No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: David Vernet <void@manifault.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de>	2024-07-08 09:39:48 -10:00
Tejun Heo	6ab228ecc3	sched_ext: Minor cleanups in kernel/sched/ext.h - scx_ops_cpu_preempt is only used in kernel/sched/ext.c and doesn't need to be global. Make it static. - Relocate task_on_scx() so that the inline functions are located together. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>	2024-07-08 09:39:48 -10:00
Tejun Heo	9f391f94a1	sched_ext: Disallow loading BPF scheduler if isolcpus= domain isolation is in effect sched_domains regulate the load balancing for sched_classes. A machine can be partitioned into multiple sections that are not load-balanced across using either isolcpus= boot param or cpuset partitions. In such cases, tasks that are in one partition are expected to stay within that partition. cpuset configured partitions are always reflected in each member task's cpumask. As SCX always honors the task cpumasks, the BPF scheduler is automatically in compliance with the configured partitions. However, for isolcpus= domain isolation, the isolated CPUs are simply omitted from the top-level sched_domain[s] without further restrictions on tasks' cpumasks, so, for example, a task currently running in an isolated CPU may have more CPUs in its allowed cpumask while expected to remain on the same CPU. There is no straightforward way to enforce this partitioning preemptively on BPF schedulers and erroring out after a violation can be surprising. isolcpus= domain isolation is being replaced with cpuset partitions anyway, so keep it simple and simply disallow loading a BPF scheduler if isolcpus= domain isolation is in effect. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20240626082342.GY31592@noisy.programming.kicks-ass.net Cc: David Vernet <void@manifault.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <frederic@kernel.org>	2024-07-08 09:30:13 -10:00
Tejun Heo	e98abd22fb	sched_ext: Account for idle policy when setting p->scx.weight in scx_ops_enable_task() When initializing p->scx.weight, scx_ops_enable_task() wasn't considering whether the task is SCHED_IDLE. Update it to use WEIGHT_IDLEPRIO as the source weight for SCHED_IDLE tasks. This leaves reweight_task_scx() the sole user of set_task_scx_weight(). Open code it. @weight is going to be provided by sched core in the future anyway. v2: Use the newly available @lw->weight to set @p->scx.weight in reweight_task_scx(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: David Vernet <void@manifault.com> Cc: Peter Zijlstra <peterz@infradead.org>	2024-07-08 09:25:35 -10:00
Tejun Heo	60564acbef	sched, sched_ext: Simplify dl_prio() case handling in sched_fork() sched_fork() returns with -EAGAIN if dl_prio(@p). `a7a9fc5492` ("sched_ext: Add boilerplate for extensible scheduler class") added scx_pre_fork() call before it and then scx_cancel_fork() on the exit path. This is silly as the dl_prio() block can just be moved above the scx_pre_fork() call. Move the dl_prio() block above the scx_pre_fork() call and remove the now unnecessary scx_cancel_fork() invocation. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Vernet <void@manifault.com>	2024-07-08 08:55:46 -10:00
Hongyan Xia	6203ef73fa	sched/ext: Add BPF function to fetch rq rq contains many useful fields to implement a custom scheduler. For example, various clock signals like clock_task and clock_pelt can be used to track load. It also contains stats in other sched_classes, which are useful to drive scheduling decisions in ext. tj: Put the new helper below scx_bpf_task_*() helpers. Signed-off-by: Hongyan Xia <hongyan.xia2@arm.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-07-08 07:10:48 -10:00
Tejun Heo	7b9f6c864a	Merge branch 'sched/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-6.11 `d329605287` ("sched/fair: set_load_weight() must also call reweight_task() for SCHED_IDLE tasks") applied to sched/core changes how reweight_task() is called causing conflicts with `e83edbf88f` ("sched: Add sched_class->reweight_task()"). Resolve the conflicts by taking set_load_weight() changes from `d329605287` and updating sched_class->reweight_task() to take pointer to struct load_weight instead of int prio. Signed-off-by: Tejun Heo<tj@kernel.org>	2024-07-08 07:06:26 -10:00
Benjamin Tissoires	f56f4d541e	bpf: helpers: fix bpf_wq_set_callback_impl signature I realized this while having a map containing both a struct bpf_timer and a struct bpf_wq: the third argument provided to the bpf_wq callback is not the struct bpf_wq pointer itself, but the pointer to the value in the map. Which means that the users need to double cast the provided "value" as this is not a struct bpf_wq *. This is a change of API, but there doesn't seem to be much users of bpf_wq right now, so we should be able to go with this right now. Fixes: `81f1d7a583` ("bpf: wq: add bpf_wq_set_callback_impl") Signed-off-by: Benjamin Tissoires <bentiss@kernel.org> Link: https://lore.kernel.org/r/20240708-fix-wq-v2-1-667e5c9fbd99@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-07-08 10:01:48 -07:00
Dan Carpenter	bc239eb271	bpf: Remove unnecessary loop in task_file_seq_get_next() After commit `0ede61d858` ("file: convert to SLAB_TYPESAFE_BY_RCU") this loop always iterates exactly one time. Delete the for statement and pull the code in a tab. Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/ZoWJF51D4zWb6f5t@stanley.mountain	2024-07-08 16:23:19 +02:00
Masami Hiramatsu (Google)	9d8616034f	tracing/kprobes: Add symbol counting check when module loads Currently, kprobe event checks whether the target symbol name is unique or not, so that it does not put a probe on an unexpected place. But this skips the check if the target is on a module because the module may not be loaded. To fix this issue, this patch checks the number of probe target symbols in a target module when the module is loaded. If the probe is not on the unique name symbols in the module, it will be rejected at that point. Note that the symbol which has a unique name in the target module, it will be accepted even if there are same-name symbols in the kernel or other modules, Link: https://lore.kernel.org/all/172016348553.99543.2834679315611882137.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2024-07-06 09:27:47 +09:00
Lai Jiangshan	449b31ad29	workqueue: Init rescuer's affinities as the wq's effective cpumask Make it consistent with apply_wqattrs_commit(). Link: https://lore.kernel.org/lkml/20240203154334.791910-5-longman@redhat.com/ Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-07-05 09:14:40 -10:00
Lai Jiangshan	1726a17135	workqueue: Put PWQ allocation and WQ enlistment in the same lock C.S. The PWQ allocation and WQ enlistment are not within the same lock-held critical section; therefore, their states can become out of sync when the user modifies the unbound mask or if CPU hotplug events occur in the interim since those operations only update the WQs that are already in the list. Make the PWQ allocation and WQ enlistment atomic. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-07-05 09:14:40 -10:00
Lai Jiangshan	4e9a37389e	workqueue: Move kthread_flush_worker() out of alloc_and_link_pwqs() kthread_flush_worker() can't be called with wq_pool_mutex held. Prepare for moving wq_pool_mutex and cpu hotplug lock out of alloc_and_link_pwqs(). Cc: Zqiang <qiang.zhang1211@gmail.com> Link: https://lore.kernel.org/lkml/20230920060704.24981-1-qiang.zhang1211@gmail.com/ Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-07-05 09:14:40 -10:00
Lai Jiangshan	c5178e6ca6	workqueue: Make rescuer initialization as the last step of the creation of a new wq For early wq allocation, rescuer initialization is the last step of the creation of a new wq. Make the behavior the same for all allocations. Prepare for initializing rescuer's affinities with the default pwq's affinities. Prepare for moving the whole workqueue initializing procedure into wq_pool_mutex and cpu hotplug locks. Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-07-05 09:14:40 -10:00
Lai Jiangshan	c3138f3881	workqueue: Register sysfs after the whole creation of the new wq workqueue creation includes adding it to the workqueue list. Prepare for moving the whole workqueue initializing procedure into wq_pool_mutex and cpu hotplug locks. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-07-05 09:14:40 -10:00
Chen Ridong	b824766504	cgroup/rstat: add force idle show helper In the function cgroup_base_stat_cputime_show, there are five instances of #ifdef, which makes the code not concise. To address this, add the function cgroup_force_idle_show to make the code more succinct. Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-07-05 08:32:08 -10:00
Oleg Nesterov	8ac5dc6659	get_task_mm: check PF_KTHREAD lockless Nowadays PF_KTHREAD is sticky and it was never protected by ->alloc_lock. Move the PF_KTHREAD check outside of task_lock() section to make this code more understandable. Link: https://lkml.kernel.org/r/20240626191017.GA20031@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-07-04 18:05:58 -07:00
Oleg Nesterov	d73d003521	memcg: mm_update_next_owner: move for_each_thread() into try_to_set_owner() mm_update_next_owner() checks the children / real_parent->children to avoid the "everything else" loop in the likely case, but this won't work if a child/sibling has a zombie leader with ->mm == NULL. Move the for_each_thread() logic into try_to_set_owner(), if nothing else this makes the children/siblings/everything searches more consistent. Link: https://lkml.kernel.org/r/20240626152930.GA17936@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jinliang Zheng <alexjlzheng@tencent.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Tycho Andersen <tandersen@netflix.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-07-04 18:05:58 -07:00
Oleg Nesterov	2a22b773b1	memcg: mm_update_next_owner: kill the "retry" logic Add the new helper, try_to_set_owner(), which tries to update mm->owner once we see c->mm == mm. This way mm_update_next_owner() doesn't need to restart the list_for_each_entry/for_each_process loops from the very beginning if it races with exit/exec, it can just continue. Unlike the current code, try_to_set_owner() re-checks tsk->mm == mm before it drops tasklist_lock, so it doesn't need get/put_task_struct(). Link: https://lkml.kernel.org/r/20240626152924.GA17933@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jinliang Zheng <alexjlzheng@tencent.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Tycho Andersen <tandersen@netflix.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-07-04 18:05:57 -07:00
Jakub Kicinski	76ed626479	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/phy/aquantia/aquantia.h `219343755e` ("net: phy: aquantia: add missing include guards") `61578f6793` ("net: phy: aquantia: add support for PHY LEDs") drivers/net/ethernet/wangxun/libwx/wx_hw.c `bd07a98178` ("net: txgbe: remove separate irq request for MSI and INTx") `b501d261a5` ("net: txgbe: add FDIR ATR support") https://lore.kernel.org/all/20240703112936.483c1975@canb.auug.org.au/ include/linux/mlx5/mlx5_ifc.h `048a403648` ("net/mlx5: IFC updates for changing max EQs") `99be56171f` ("net/mlx5e: SHAMPO, Re-enable HW-GRO") https://lore.kernel.org/all/20240701133951.6926b2e3@canb.auug.org.au/ Adjacent changes: drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c `4130c67cd1` ("wifi: iwlwifi: mvm: check vif for NULL/ERR_PTR before dereference") `3f3126515f` ("wifi: iwlwifi: mvm: add mvm-specific guard") include/net/mac80211.h `816c6bec09` ("wifi: mac80211: fix BSS_CHANGED_UNSOL_BCAST_PROBE_RESP") `5a009b42e0` ("wifi: mac80211: track changes in AP's TPE") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-07-04 14:16:11 -07:00
Paul E. McKenney	02219caa92	Merge branches 'doc.2024.06.06a', 'fixes.2024.07.04a', 'mb.2024.06.28a', 'nocb.2024.06.03a', 'rcu-tasks.2024.06.06a', 'rcutorture.2024.06.06a' and 'srcu.2024.06.18a' into HEAD doc.2024.06.06a: Documentation updates. fixes.2024.07.04a: Miscellaneous fixes. mb.2024.06.28a: Grace-period memory-barrier redundancy removal. nocb.2024.06.03a: No-CB CPU updates. rcu-tasks.2024.06.06a: RCU-Tasks updates. rcutorture.2024.06.06a: Torture-test updates. srcu.2024.06.18a: SRCU polled-grace-period updates.	2024-07-04 13:54:17 -07:00
Frederic Weisbecker	55d4669ef1	rcu: Fix rcu_barrier() VS post CPUHP_TEARDOWN_CPU invocation When rcu_barrier() calls rcu_rdp_cpu_online() and observes a CPU off rnp->qsmaskinitnext, it means that all accesses from the offline CPU preceding the CPUHP_TEARDOWN_CPU are visible to RCU barrier, including callbacks expiration and counter updates. However interrupts can still fire after stop_machine() re-enables interrupts and before rcutree_report_cpu_dead(). The related accesses happening between CPUHP_TEARDOWN_CPU and rnp->qsmaskinitnext clearing are _NOT_ guaranteed to be seen by rcu_barrier() without proper ordering, especially when callbacks are invoked there to the end, making rcutree_migrate_callback() bypass barrier_lock. The following theoretical race example can make rcu_barrier() hang: CPU 0 CPU 1 ----- ----- //cpu_down() smpboot_park_threads() //ksoftirqd is parked now <IRQ> rcu_sched_clock_irq() invoke_rcu_core() do_softirq() rcu_core() rcu_do_batch() // callback storm // rcu_do_batch() returns // before completing all // of them // do_softirq also returns early because of // timeout. It defers to ksoftirqd but // it's parked </IRQ> stop_machine() take_cpu_down() rcu_barrier() spin_lock(barrier_lock) // observes rcu_segcblist_n_cbs(&rdp->cblist) != 0 <IRQ> do_softirq() rcu_core() rcu_do_batch() //completes all pending callbacks //smp_mb() implied _after_ callback number dec </IRQ> rcutree_report_cpu_dead() rnp->qsmaskinitnext &= ~rdp->grpmask; rcutree_migrate_callback() // no callback, early return without locking // barrier_lock //observes !rcu_rdp_cpu_online(rdp) rcu_barrier_entrain() rcu_segcblist_entrain() // Observe rcu_segcblist_n_cbs(rsclp) == 0 // because no barrier between reading // rnp->qsmaskinitnext and rsclp->len rcu_segcblist_add_len() smp_mb__before_atomic() // will now observe the 0 count and empty // list, but too late, we enqueue regardless WRITE_ONCE(rsclp->len, rsclp->len + v); // ignored barrier callback // rcu barrier stall... This could be solved with a read memory barrier, enforcing the message passing between rnp->qsmaskinitnext and rsclp->len, matching the full memory barrier after rsclp->len addition in rcu_segcblist_add_len() performed at the end of rcu_do_batch(). However the rcu_barrier() is complicated enough and probably doesn't need too many more subtleties. CPU down is a slowpath and the barrier_lock seldom contended. Solve the issue with unconditionally locking the barrier_lock on rcutree_migrate_callbacks(). This makes sure that either rcu_barrier() sees the empty queue or its entrained callback will be migrated. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2024-07-04 13:48:57 -07:00
Oleg Nesterov	6f4cec22c3	rcu: Eliminate lockless accesses to rcu_sync->gp_count The rcu_sync structure's ->gp_count field is always accessed under the protection of that same structure's ->rss_lock field, with the exception of a pair of WARN_ON_ONCE() calls just prior to acquiring that lock in functions rcu_sync_exit() and rcu_sync_dtor(). These lockless accesses are unnecessary and impair KCSAN's ability to catch bugs that might be inserted via other lockless accesses. This commit therefore moves those WARN_ON_ONCE() calls under the lock. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2024-07-04 13:48:57 -07:00
Paul E. McKenney	68d124b099	rcu: Add rcutree.nohz_full_patience_delay to reduce nohz_full OS jitter If a CPU is running either a userspace application or a guest OS in nohz_full mode, it is possible for a system call to occur just as an RCU grace period is starting. If that CPU also has the scheduling-clock tick enabled for any reason (such as a second runnable task), and if the system was booted with rcutree.use_softirq=0, then RCU can add insult to injury by awakening that CPU's rcuc kthread, resulting in yet another task and yet more OS jitter due to switching to that task, running it, and switching back. In addition, in the common case where that system call is not of excessively long duration, awakening the rcuc task is pointless. This pointlessness is due to the fact that the CPU will enter an extended quiescent state upon returning to the userspace application or guest OS. In this case, the rcuc kthread cannot do anything that the main RCU grace-period kthread cannot do on its behalf, at least if it is given a few additional milliseconds (for example, given the time duration specified by rcutree.jiffies_till_first_fqs, give or take scheduling delays). This commit therefore adds a rcutree.nohz_full_patience_delay kernel boot parameter that specifies the grace period age (in milliseconds, rounded to jiffies) before which RCU will refrain from awakening the rcuc kthread. Preliminary experimentation suggests a value of 1000, that is, one second. Increasing rcutree.nohz_full_patience_delay will increase grace-period latency and in turn increase memory footprint, so systems with constrained memory might choose a smaller value. Systems with less-aggressive OS-jitter requirements might choose the default value of zero, which keeps the traditional immediate-wakeup behavior, thus avoiding increases in grace-period latency. [ paulmck: Apply Leonardo Bras feedback. ] Link: https://lore.kernel.org/all/20240328171949.743211-1-leobras@redhat.com/ Reported-by: Leonardo Bras <leobras@redhat.com> Suggested-by: Leonardo Bras <leobras@redhat.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Leonardo Bras <leobras@redhat.com>	2024-07-04 13:47:39 -07:00
Peter Zijlstra	0c8ea05e9b	Merge branch 'tip/x86/cpu' The Lunarlake patches rely on the new VFM stuff. Signed-off-by: Peter Zijlstra <peterz@infradead.org>	2024-07-04 16:00:24 +02:00
Adrian Hunter	0ca4da2412	perf: Make rb_alloc_aux() return an error immediately if nr_pages <= 0 rb_alloc_aux() should not be called with nr_pages <= 0. Make it more robust and readable by returning an error immediately in that case. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240624201101.60186-8-adrian.hunter@intel.com	2024-07-04 16:00:23 +02:00
Adrian Hunter	43deb76b19	perf: Fix default aux_watermark calculation The default aux_watermark is half the AUX area buffer size. In general, on a 64-bit architecture, the AUX area buffer size could be a bigger than fits in a 32-bit type, but the calculation does not allow for that possibility. However the aux_watermark value is recorded in a u32, so should not be more than U32_MAX either. Fix by doing the calculation in a correctly sized type, and limiting the result to U32_MAX. Fixes: `d68e6799a5` ("perf: Cap allocation order at aux_watermark") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240624201101.60186-7-adrian.hunter@intel.com	2024-07-04 16:00:23 +02:00
Adrian Hunter	dbc48c8f41	perf: Prevent passing zero nr_pages to rb_alloc_aux() nr_pages is unsigned long but gets passed to rb_alloc_aux() as an int, and is stored as an int. Only power-of-2 values are accepted, so if nr_pages is a 64_bit value, it will be passed to rb_alloc_aux() as zero. That is not ideal because: 1. the value is incorrect 2. rb_alloc_aux() is at risk of misbehaving, although it manages to return -ENOMEM in that case, it is a result of passing zero to get_order() even though the get_order() result is documented to be undefined in that case. Fix by simply validating the maximum supported value in the first place. Use -ENOMEM error code for consistency with the current error code that is returned in that case. Fixes: `45bfb2e504` ("perf: Add AUX area to ring buffer for raw data streams") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240624201101.60186-6-adrian.hunter@intel.com	2024-07-04 16:00:22 +02:00
Adrian Hunter	3df94a5b10	perf: Fix perf_aux_size() for greater-than 32-bit size perf_buffer->aux_nr_pages uses a 32-bit type, so a cast is needed to calculate a 64-bit size. Fixes: `45bfb2e504` ("perf: Add AUX area to ring buffer for raw data streams") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240624201101.60186-5-adrian.hunter@intel.com	2024-07-04 16:00:22 +02:00
Tejun Heo	d329605287	sched/fair: set_load_weight() must also call reweight_task() for SCHED_IDLE tasks When a task's weight is being changed, set_load_weight() is called with @update_load set. As weight changes aren't trivial for the fair class, set_load_weight() calls fair.c::reweight_task() for fair class tasks. However, set_load_weight() first tests task_has_idle_policy() on entry and skips calling reweight_task() for SCHED_IDLE tasks. This is buggy as SCHED_IDLE tasks are just fair tasks with a very low weight and they would incorrectly skip load, vlag and position updates. Fix it by updating reweight_task() to take struct load_weight as idle weight can't be expressed with prio and making set_load_weight() call reweight_task() for SCHED_IDLE tasks too when @update_load is set. Fixes: `9059393e4e` ("sched/fair: Use reweight_entity() for set_user_nice()") Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org # v4.15+ Link: http://lkml.kernel.org/r/20240624102331.GI31592@noisy.programming.kicks-ass.net	2024-07-04 15:59:52 +02:00
Tvrtko Ursulin	0ec208ce98	sched/psi: Optimise psi_group_change a bit The current code loops over the psi_states only to call a helper which then resolves back to the action needed for each state using a switch statement. That is effectively creating a double indirection of a kind which, given how all the states need to be explicitly listed and handled anyway, we can simply remove. Both the for loop and the switch statement that is. The benefit is both in the code size and CPU time spent in this function. YMMV but on my Steam Deck, while in a game, the patch makes the CPU usage go from ~2.4% down to ~1.2%. Text size at the same time went from 0x323 to 0x2c1. Signed-off-by: Tvrtko Ursulin <tursulin@ursulin.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/20240625135000.38652-1-tursulin@igalia.com	2024-07-04 15:59:52 +02:00

... 15 16 17 18 19 ...

45917 Commits