linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-22 02:36:35 -04:00

Author	SHA1	Message	Date
Eduard Zingerman	0fb3cf6110	bpf: use register liveness information for func_states_equal Liveness analysis DFA computes a set of registers live before each instruction. Leverage this information to skip comparison of dead registers in func_states_equal(). This helps with convergance of iterator processing loops, as bpf_reg_state->live marks can't be used when loops are processed. This has certain performance impact for selftests, here is a veristat listing using `-f "insns_pct>5" -f "!insns<200"` selftests: File Program States (A) States (B) States (DIFF) -------------------- ----------------------------- ---------- ---------- -------------- arena_htab.bpf.o arena_htab_llvm 37 35 -2 (-5.41%) arena_htab_asm.bpf.o arena_htab_asm 37 33 -4 (-10.81%) arena_list.bpf.o arena_list_add 37 22 -15 (-40.54%) dynptr_success.bpf.o test_dynptr_copy 22 16 -6 (-27.27%) dynptr_success.bpf.o test_dynptr_copy_xdp 68 58 -10 (-14.71%) iters.bpf.o checkpoint_states_deletion 918 40 -878 (-95.64%) iters.bpf.o clean_live_states 136 66 -70 (-51.47%) iters.bpf.o iter_nested_deeply_iters 43 37 -6 (-13.95%) iters.bpf.o iter_nested_iters 72 62 -10 (-13.89%) iters.bpf.o iter_pass_iter_ptr_to_subprog 30 26 -4 (-13.33%) iters.bpf.o iter_subprog_iters 68 59 -9 (-13.24%) iters.bpf.o loop_state_deps2 35 32 -3 (-8.57%) iters_css.bpf.o iter_css_for_each 32 29 -3 (-9.38%) pyperf600_iter.bpf.o on_event 286 192 -94 (-32.87%) Total progs: 3578 Old success: 2061 New success: 2061 States diff min: -95.64% States diff max: 0.00% -100 .. -90 %: 1 -55 .. -45 %: 3 -45 .. -35 %: 2 -35 .. -25 %: 5 -20 .. -10 %: 12 -10 .. 0 %: 6 sched_ext: File Program States (A) States (B) States (DIFF) ----------------- ---------------------- ---------- ---------- --------------- bpf.bpf.o lavd_dispatch 8950 7065 -1885 (-21.06%) bpf.bpf.o lavd_init 516 480 -36 (-6.98%) bpf.bpf.o layered_dispatch 662 501 -161 (-24.32%) bpf.bpf.o layered_dump 298 237 -61 (-20.47%) bpf.bpf.o layered_init 523 423 -100 (-19.12%) bpf.bpf.o layered_init_task 24 22 -2 (-8.33%) bpf.bpf.o layered_runnable 151 125 -26 (-17.22%) bpf.bpf.o p2dq_dispatch 66 53 -13 (-19.70%) bpf.bpf.o p2dq_init 170 142 -28 (-16.47%) bpf.bpf.o refresh_layer_cpumasks 120 78 -42 (-35.00%) bpf.bpf.o rustland_init 37 34 -3 (-8.11%) bpf.bpf.o rustland_init 37 34 -3 (-8.11%) bpf.bpf.o rusty_select_cpu 125 108 -17 (-13.60%) scx_central.bpf.o central_dispatch 59 43 -16 (-27.12%) scx_central.bpf.o central_init 39 28 -11 (-28.21%) scx_nest.bpf.o nest_init 58 51 -7 (-12.07%) scx_pair.bpf.o pair_dispatch 142 111 -31 (-21.83%) scx_qmap.bpf.o qmap_dispatch 174 141 -33 (-18.97%) scx_qmap.bpf.o qmap_init 768 654 -114 (-14.84%) Total progs: 216 Old success: 186 New success: 186 States diff min: -35.00% States diff max: 0.00% -35 .. -25 %: 3 -25 .. -20 %: 6 -20 .. -15 %: 6 -15 .. -5 %: 7 -5 .. 0 %: 6 Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250304195024.2478889-5-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-03-15 11:48:29 -07:00
Eduard Zingerman	14c8552db6	bpf: simple DFA-based live registers analysis Compute may-live registers before each instruction in the program. The register is live before the instruction I if it is read by I or some instruction S following I during program execution and is not overwritten between I and S. This information would be used in the next patch as a hint in func_states_equal(). Use a simple algorithm described in [1] to compute this information: - define the following: - I.use : a set of all registers read by instruction I; - I.def : a set of all registers written by instruction I; - I.in : a set of all registers that may be alive before I execution; - I.out : a set of all registers that may be alive after I execution; - I.successors : a set of instructions S that might immediately follow I for some program execution; - associate separate empty sets 'I.in' and 'I.out' with each instruction; - visit each instruction in a postorder and update corresponding 'I.in' and 'I.out' sets as follows: I.out = U [S.in for S in I.successors] I.in = (I.out / I.def) U I.use (where U stands for set union, / stands for set difference) - repeat the computation while I.{in,out} changes for any instruction. On implementation side keep things as simple, as possible: - check_cfg() already marks instructions EXPLORED in post-order, modify it to save the index of each EXPLORED instruction in a vector; - represent I.{in,out,use,def} as bitmasks; - don't split the program into basic blocks and don't maintain the work queue, instead: - do fixed-point computation by visiting each instruction; - maintain a simple 'changed' flag if I.{in,out} for any instruction change; Measurements show that even such simplistic implementation does not add measurable verification time overhead (for selftests, at-least). Note on check_cfg() ex_insn_beg/ex_done change: To avoid out of bounds access to env->cfg.insn_postorder array, it should be guaranteed that instruction transitions to EXPLORED state only once. Previously this was not the fact for incorrect programs with direct calls to exception callbacks. The 'align' selftest needs adjustment to skip computed insn/live registers printout. Otherwise it matches lines from the live registers printout. [1] https://en.wikipedia.org/wiki/Live-variable_analysis Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250304195024.2478889-4-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-03-15 11:48:29 -07:00
Eduard Zingerman	22f8397495	bpf: get_call_summary() utility function Refactor mark_fastcall_pattern_for_call() to extract a utility function get_call_summary(). For a helper or kfunc call this function fills the following information: {num_params, is_void, fastcall}. This function would be used in the next patch in order to get number of parameters of a helper or kfunc call. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250304195024.2478889-3-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-03-15 11:48:29 -07:00
Eduard Zingerman	80ca3f1d77	bpf: jmp_offset() and verbose_insn() utility functions Extract two utility functions: - One BPF jump instruction uses .imm field to encode jump offset, while the rest use .off. Encapsulate this detail as jmp_offset() function. - Avoid duplicating instruction printing callback definitions by defining a verbose_insn() function, which disassembles an instruction into the verifier log while hiding this detail. These functions will be used in the next patch. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250304195024.2478889-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-03-15 11:48:29 -07:00
Peilin Ye	880442305a	bpf: Introduce load-acquire and store-release instructions Introduce BPF instructions with load-acquire and store-release semantics, as discussed in [1]. Define 2 new flags: #define BPF_LOAD_ACQ 0x100 #define BPF_STORE_REL 0x110 A "load-acquire" is a BPF_STX \| BPF_ATOMIC instruction with the 'imm' field set to BPF_LOAD_ACQ (0x100). Similarly, a "store-release" is a BPF_STX \| BPF_ATOMIC instruction with the 'imm' field set to BPF_STORE_REL (0x110). Unlike existing atomic read-modify-write operations that only support BPF_W (32-bit) and BPF_DW (64-bit) size modifiers, load-acquires and store-releases also support BPF_B (8-bit) and BPF_H (16-bit). As an exception, however, 64-bit load-acquires/store-releases are not supported on 32-bit architectures (to fix a build error reported by the kernel test robot). An 8- or 16-bit load-acquire zero-extends the value before writing it to a 32-bit register, just like ARM64 instruction LDARH and friends. Similar to existing atomic read-modify-write operations, misaligned load-acquires/store-releases are not allowed (even if BPF_F_ANY_ALIGNMENT is set). As an example, consider the following 64-bit load-acquire BPF instruction (assuming little-endian): db 10 00 00 00 01 00 00 r0 = load_acquire((u64 )(r1 + 0x0)) opcode (0xdb): BPF_ATOMIC \| BPF_DW \| BPF_STX imm (0x00000100): BPF_LOAD_ACQ Similarly, a 16-bit BPF store-release: cb 21 00 00 10 01 00 00 store_release((u16 )(r1 + 0x0), w2) opcode (0xcb): BPF_ATOMIC \| BPF_H \| BPF_STX imm (0x00000110): BPF_STORE_REL In arch/{arm64,s390,x86}/net/bpf_jit_comp.c, have bpf_jit_supports_insn(..., /in_arena=/true) return false for the new instructions, until the corresponding JIT compiler supports them in arena. [1] https://lore.kernel.org/all/20240729183246.4110549-1-yepeilin@google.com/ Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Peilin Ye <yepeilin@google.com> Link: https://lore.kernel.org/r/a217f46f0e445fbd573a1a024be5c6bf1d5fe716.1741049567.git.yepeilin@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-03-15 11:48:28 -07:00
Kumar Kartikeya Dwivedi	e723608bf4	bpf: Add verifier support for timed may_goto Implement support in the verifier for replacing may_goto implementation from a counter-based approach to one which samples time on the local CPU to have a bigger loop bound. We implement it by maintaining 16-bytes per-stack frame, and using 8 bytes for maintaining the count for amortizing time sampling, and 8 bytes for the starting timestamp. To minimize overhead, we need to avoid spilling and filling of registers around this sequence, so we push this cost into the time sampling function 'arch_bpf_timed_may_goto'. This is a JIT-specific wrapper around bpf_check_timed_may_goto which returns us the count to store into the stack through BPF_REG_AX. All caller-saved registers (r0-r5) are guaranteed to remain untouched. The loop can be broken by returning count as 0, otherwise we dispatch into the function when the count drops to 0, and the runtime chooses to refresh it (by returning count as BPF_MAX_TIMED_LOOPS) or returning 0 and aborting the loop on next iteration. Since the check for 0 is done right after loading the count from the stack, all subsequent cond_break sequences should immediately break as well, of the same loop or subsequent loops in the program. We pass in the stack_depth of the count (and thus the timestamp, by adding 8 to it) to the arch_bpf_timed_may_goto call so that it can be passed in to bpf_check_timed_may_goto as an argument after r1 is saved, by adding the offset to r10/fp. This adjustment will be arch specific, and the next patch will introduce support for x86. Note that depending on loop complexity, time spent in the loop can be more than the current limit (250 ms), but imposing an upper bound on program runtime is an orthogonal problem which will be addressed when program cancellations are supported. The current time afforded by cond_break may not be enough for cases where BPF programs want to implement locking algorithms inline, and use cond_break as a promise to the verifier that they will eventually terminate. Below are some benchmarking numbers on the time taken per-iteration for an empty loop that counts the number of iterations until cond_break fires. For comparison, we compare it against bpf_for/bpf_repeat which is another way to achieve the same number of spins (BPF_MAX_LOOPS). The hardware used for benchmarking was a Sapphire Rapids Intel server with performance governor enabled, mitigations were enabled. +-----------------------------+--------------+--------------+------------------+ \| Loop type \| Iterations \| Time (ms) \| Time/iter (ns) \| +-----------------------------\|--------------+--------------+------------------+ \| may_goto \| 8388608 \| 3 \| 0.36 \| \| timed_may_goto (count=65535)\| 589674932 \| 250 \| 0.42 \| \| bpf_for \| 8388608 \| 10 \| 1.19 \| +-----------------------------+--------------+--------------+------------------+ This gives a good approximation at low overhead while staying close to the current implementation. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250304003239.2390751-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-03-15 11:48:28 -07:00
Peilin Ye	a752ba4332	bpf: Factor out check_load_mem() and check_store_reg() Extract BPF_LDX and most non-ATOMIC BPF_STX instruction handling logic in do_check() into helper functions to be used later. While we are here, make that comment about "reserved fields" more specific. Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Peilin Ye <yepeilin@google.com> Link: https://lore.kernel.org/r/8b39c94eac2bb7389ff12392ca666f939124ec4f.1740978603.git.yepeilin@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-03-15 11:48:26 -07:00
Peilin Ye	2626ffe9f3	bpf: Factor out check_atomic_rmw() Currently, check_atomic() only handles atomic read-modify-write (RMW) instructions. Since we are planning to introduce other types of atomic instructions (i.e., atomic load/store), extract the existing RMW handling logic into its own function named check_atomic_rmw(). Remove the @insn_idx parameter as it is not really necessary. Use 'env->insn_idx' instead, as in other places in verifier.c. Signed-off-by: Peilin Ye <yepeilin@google.com> Link: https://lore.kernel.org/r/6323ac8e73a10a1c8ee547c77ed68cf8eb6b90e1.1740978603.git.yepeilin@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-03-15 11:48:26 -07:00
Peilin Ye	66faaea94e	bpf: Factor out atomic_ptr_type_ok() Factor out atomic_ptr_type_ok() as a helper function to be used later. Signed-off-by: Peilin Ye <yepeilin@google.com> Link: https://lore.kernel.org/r/e5ef8b3116f3fffce78117a14060ddce05eba52a.1740978603.git.yepeilin@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-03-15 11:48:26 -07:00
Eric Dumazet	f8ac5a4e1a	bpf: no longer acquire map_idr_lock in bpf_map_inc_not_zero() bpf_sk_storage_clone() is the only caller of bpf_map_inc_not_zero() and is holding rcu_read_lock(). map_idr_lock does not add any protection, just remove the cost for passive TCP flows. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kui-Feng Lee <kuifeng@meta.com> Cc: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://lore.kernel.org/r/20250301191315.1532629-1-edumazet@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-03-15 11:48:26 -07:00
Kumar Kartikeya Dwivedi	e2d8f560d1	bpf: Summarize sleepable global subprogs The verifier currently does not permit global subprog calls when a lock is held, preemption is disabled, or when IRQs are disabled. This is because we don't know whether the global subprog calls sleepable functions or not. In case of locks, there's an additional reason: functions called by the global subprog may hold additional locks etc. The verifier won't know while verifying the global subprog whether it was called in context where a spin lock is already held by the program. Perform summarization of the sleepable nature of a global subprog just like changes_pkt_data and then allow calls to global subprogs for non-sleepable ones from atomic context. While making this change, I noticed that RCU read sections had no protection against sleepable global subprog calls, include it in the checks and fix this while we're at it. Care needs to be taken to not allow global subprog calls when regular bpf_spin_lock is held. When resilient spin locks is held, we want to potentially have this check relaxed, but not for now. Also make sure extensions freplacing global functions cannot do so in case the target is non-sleepable, but the extension is. The other combination is ok. Tests are included in the next patch to handle all special conditions. Fixes: `9bb00b2895` ("bpf: Add kfunc bpf_rcu_read_lock/unlock()") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250301151846.1552362-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-03-15 11:48:25 -07:00
Yonghong Song	4b82b181a2	bpf: Allow pre-ordering for bpf cgroup progs Currently for bpf progs in a cgroup hierarchy, the effective prog array is computed from bottom cgroup to upper cgroups (post-ordering). For example, the following cgroup hierarchy root cgroup: p1, p2 subcgroup: p3, p4 have BPF_F_ALLOW_MULTI for both cgroup levels. The effective cgroup array ordering looks like p3 p4 p1 p2 and at run time, progs will execute based on that order. But in some cases, it is desirable to have root prog executes earlier than children progs (pre-ordering). For example, - prog p1 intends to collect original pkt dest addresses. - prog p3 will modify original pkt dest addresses to a proxy address for security reason. The end result is that prog p1 gets proxy address which is not what it wants. Putting p1 to every child cgroup is not desirable either as it will duplicate itself in many child cgroups. And this is exactly a use case we are encountering in Meta. To fix this issue, let us introduce a flag BPF_F_PREORDER. If the flag is specified at attachment time, the prog has higher priority and the ordering with that flag will be from top to bottom (pre-ordering). For example, in the above example, root cgroup: p1, p2 subcgroup: p3, p4 Let us say p2 and p4 are marked with BPF_F_PREORDER. The final effective array ordering will be p2 p4 p3 p1 Suggested-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250224230116.283071-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-03-15 11:48:25 -07:00
Mykyta Yatsenko	daec295a70	bpf/helpers: Introduce bpf_dynptr_copy kfunc Introducing bpf_dynptr_copy kfunc allowing copying data from one dynptr to another. This functionality is useful in scenarios such as capturing XDP data to a ring buffer. The implementation consists of 4 branches: * A fast branch for contiguous buffer capacity in both source and destination dynptrs * 3 branches utilizing __bpf_dynptr_read and __bpf_dynptr_write to copy data to/from non-contiguous buffer Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250226183201.332713-3-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-03-15 11:48:16 -07:00
Mykyta Yatsenko	09206af69c	bpf/helpers: Refactor bpf_dynptr_read and bpf_dynptr_write Refactor bpf_dynptr_read and bpf_dynptr_write helpers: extract code into the static functions namely __bpf_dynptr_read and __bpf_dynptr_write, this allows calling these without compiler warnings. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250226183201.332713-2-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-03-15 11:47:51 -07:00
Masahiro Yamada	479fde4965	Revert "kheaders: Ignore silly-rename files" This reverts commit `973b710b88`. As I mentioned in the review [1], I do not believe this was the correct fix. Commit `41a0005128` ("kheaders: prevent `find` from seeing perl temp files") addressed the root cause of the issue. I asked David to test it but received no response. Commit `973b710b88` ("kheaders: Ignore silly-rename files") merely worked around the issue by excluding such files, rather than preventing their creation. I have reverted the latter commit, hoping the issue has already been resolved by the former. If the silly-rename files come back, I will restore this change (or preferably, investigate the root cause). [1]: https://lore.kernel.org/lkml/CAK7LNAQndCMudAtVRAbfSfnV+XhSMDcnP-s1_GAQh8UiEdLBSg@mail.gmail.com/ Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>	2025-03-15 21:22:52 +09:00
Dietmar Eggemann	76f970ce51	Revert "sched/core: Reduce cost of sched_move_task when config autogroup" This reverts commit `eff6c8ce8d`. Hazem reported a 30% drop in UnixBench spawn test with commit `eff6c8ce8d` ("sched/core: Reduce cost of sched_move_task when config autogroup") on a m6g.xlarge AWS EC2 instance with 4 vCPUs and 16 GiB RAM (aarch64) (single level MC sched domain): https://lkml.kernel.org/r/20250205151026.13061-1-hagarhem@amazon.com There is an early bail from sched_move_task() if p->sched_task_group is equal to p's 'cpu cgroup' (sched_get_task_group()). E.g. both are pointing to taskgroup '/user.slice/user-1000.slice/session-1.scope' (Ubuntu '22.04.5 LTS'). So in: do_exit() sched_autogroup_exit_task() sched_move_task() if sched_get_task_group(p) == p->sched_task_group return /* p is enqueued */ dequeue_task() \ sched_change_group() \| task_change_group_fair() \| detach_task_cfs_rq() \| (1) set_task_rq() \| attach_task_cfs_rq() \| enqueue_task() / (1) isn't called for p anymore. Turns out that the regression is related to sgs->group_util in group_is_overloaded() and group_has_capacity(). If (1) isn't called for all the 'spawn' tasks then sgs->group_util is ~900 and sgs->group_capacity = 1024 (single CPU sched domain) and this leads to group_is_overloaded() returning true (2) and group_has_capacity() false (3) much more often compared to the case when (1) is called. I.e. there are much more cases of 'group_is_overloaded' and 'group_fully_busy' in WF_FORK wakeup sched_balance_find_dst_cpu() which then returns much more often a CPU != smp_processor_id() (5). This isn't good for these extremely short running tasks (FORK + EXIT) and also involves calling sched_balance_find_dst_group_cpu() unnecessary (single CPU sched domain). Instead if (1) is called for 'p->flags & PF_EXITING' then the path (4),(6) is taken much more often. select_task_rq_fair(..., wake_flags = WF_FORK) cpu = smp_processor_id() new_cpu = sched_balance_find_dst_cpu(..., cpu, ...) group = sched_balance_find_dst_group(..., cpu) do { update_sg_wakeup_stats() sgs->group_type = group_classify() if group_is_overloaded() (2) return group_overloaded if !group_has_capacity() (3) return group_fully_busy return group_has_spare (4) } while group if local_sgs.group_type > idlest_sgs.group_type return idlest (5) case group_has_spare: if local_sgs.idle_cpus >= idlest_sgs.idle_cpus return NULL (6) Unixbench Tests './Run -c 4 spawn' on: (a) VM AWS instance (m7gd.16xlarge) with v6.13 ('maxcpus=4 nr_cpus=4') and Ubuntu 22.04.5 LTS (aarch64). Shell & test run in '/user.slice/user-1000.slice/session-1.scope'. w/o patch w/ patch 21005 27120 (b) i7-13700K with tip/sched/core ('nosmt maxcpus=8 nr_cpus=8') and Ubuntu 22.04.5 LTS (x86_64). Shell & test run in '/A'. w/o patch w/ patch 67675 88806 CONFIG_SCHED_AUTOGROUP=y & /sys/proc/kernel/sched_autogroup_enabled equal 0 or 1. Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Hagar Hemdan <hagarhem@amazon.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250314151345.275739-1-dietmar.eggemann@arm.com	2025-03-15 10:34:27 +01:00
Xuewen Yan	4bc4582414	sched/uclamp: Optimize sched_uclamp_used static key enabling Repeat calls of static_branch_enable() to an already enabled static key introduce overhead, because it calls cpus_read_lock(). Users may frequently set the uclamp value of tasks, triggering the repeat enabling of the sched_uclamp_used static key. Optimize this and avoid repeat calls to static_branch_enable() by checking whether it's enabled already. [ mingo: Rewrote the changelog for legibility ] Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250219093747.2612-2-xuewen.yan@unisoc.com	2025-03-15 10:28:50 +01:00
Xuewen Yan	5fca5a4cf9	sched/uclamp: Use the uclamp_is_used() helper instead of open-coding it Don't open-code static_branch_unlikely(&sched_uclamp_used), we have the uclamp_is_used() wrapper around it. [ mingo: Clean up the changelog ] Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250219093747.2612-1-xuewen.yan@unisoc.com	2025-03-15 10:26:37 +01:00
Masami Hiramatsu (Google)	ac91052f0a	tracing: tprobe-events: Fix leakage of module refcount When enabling the tracepoint at loading module, the target module refcount is incremented by find_tracepoint_in_module(). But it is unnecessary because the module is not unloaded while processing module loading callbacks. Moreover, the refcount is not decremented in that function. To be clear the module refcount handling, move the try_module_get() callsite to trace_fprobe_create_internal(), where it is actually required. Link: https://lore.kernel.org/all/174182761071.83274.18334217580449925882.stgit@devnote2/ Fixes: `57a7e6de9e` ("tracing/fprobe: Support raw tracepoints on future loaded modules") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: stable@vger.kernel.org	2025-03-15 08:37:47 +09:00
Masami Hiramatsu (Google)	0a8bb688aa	tracing: tprobe-events: Fix to clean up tprobe correctly when module unload When unloading module, the tprobe events are not correctly cleaned up. Thus it becomes `fprobe-event` and never be enabled again even if loading the same module again. For example; # cd /sys/kernel/tracing # modprobe trace_events_sample # echo 't:my_tprobe foo_bar' >> dynamic_events # cat dynamic_events t:tracepoints/my_tprobe foo_bar # rmmod trace_events_sample # cat dynamic_events f:tracepoints/my_tprobe foo_bar As you can see, the second time my_tprobe starts with 'f' instead of 't'. This unregisters the fprobe and tracepoint callback when module is unloaded but marks the fprobe-event is tprobe-event. Link: https://lore.kernel.org/all/174158724946.189309.15826571379395619524.stgit@mhiramat.tok.corp.google.com/ Fixes: `57a7e6de9e` ("tracing/fprobe: Support raw tracepoints on future loaded modules") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2025-03-15 08:37:12 +09:00
Linus Torvalds	a22ea738f4	Merge tag 'sched-urgent-2025-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix a sleeping-while-atomic bug caused by a recent optimization utilizing static keys that didn't consider that the static_key_disable() call could be triggered in atomic context. Revert the optimization" * tag 'sched-urgent-2025-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/clock: Don't define sched_clock_irqtime as static key	2025-03-14 09:56:46 -10:00
Linus Torvalds	28c50999c9	Merge tag 'locking-urgent-2025-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc locking fixes from Ingo Molnar: - Restrict the Rust runtime from unintended access to dynamically allocated LockClassKeys - KernelDoc annotation fix - Fix a lock ordering bug in semaphore::up(), related to trying to printk() and wake up the console within critical sections * tag 'locking-urgent-2025-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/semaphore: Use wake_q to wake up processes outside lock critical section locking/rtmutex: Use the 'struct' keyword in kernel-doc comment rust: lockdep: Remove support for dynamically allocated LockClassKeys	2025-03-14 09:41:36 -10:00
Zhiming Hu	7c035bea94	KVM: TDX: Register TDX host key IDs to cgroup misc controller TDX host key IDs (HKID) are limit resources in a machine, and the misc cgroup lets the machine owner track their usage and limits the possibility of abusing them outside the owner's control. The cgroup v2 miscellaneous subsystem was introduced to control the resource of AMD SEV & SEV-ES ASIDs. Likewise introduce HKIDs as a misc resource. Signed-off-by: Zhiming Hu <zhiming.hu@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:51 -04:00
Andrea Righi	e4855fc90e	sched_ext: idle: Refactor scx_select_cpu_dfl() Make scx_select_cpu_dfl() more consistent with the other idle-related APIs by returning a negative value when an idle CPU isn't found. No functional changes, this is purely a refactoring. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-03-14 08:17:11 -10:00
Andrea Righi	c414c2171c	sched_ext: idle: Honor idle flags in the built-in idle selection policy Enable passing idle flags (%SCX_PICK_IDLE_*) to scx_select_cpu_dfl(), to enforce strict selection criteria, such as selecting an idle CPU strictly within @prev_cpu's node or choosing only a fully idle SMT core. This functionality will be exposed through a dedicated kfunc in a separate patch. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-03-14 08:17:01 -10:00
Tengda Wu	0b4ffbe488	tracing: Correct the refcount if the hist/hist_debug file fails to open The function event_{hist,hist_debug}_open() maintains the refcount of 'file->tr' and 'file' through tracing_open_file_tr(). However, it does not roll back these counts on subsequent failure paths, resulting in a refcount leak. A very obvious case is that if the hist/hist_debug file belongs to a specific instance, the refcount leak will prevent the deletion of that instance, as it relies on the condition 'tr->ref == 1' within __remove_instance(). Fix this by calling tracing_release_file_tr() on all failure paths in event_{hist,hist_debug}_open() to correct the refcount. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Zheng Yejian <zhengyejian1@huawei.com> Link: https://lore.kernel.org/20250314065335.1202817-1-wutengda@huaweicloud.com Fixes: `1cc111b9cd` ("tracing: Fix uaf issue when open the hist or hist_debug file") Signed-off-by: Tengda Wu <wutengda@huaweicloud.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-03-14 08:29:12 -04:00
Paolo Abeni	941defcea7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.14-rc6). Conflicts: tools/testing/selftests/drivers/net/ping.py `75cc19c8ff` ("selftests: drv-net: add xdp cases for ping.py") `de94e86974` ("selftests: drv-net: store addresses in dict indexed by ipver") https://lore.kernel.org/netdev/20250311115758.17a1d414@canb.auug.org.au/ net/core/devmem.c `a70f891e0f` ("net: devmem: do not WARN conditionally after netdev_rx_queue_restart()") `1d22d3060b` ("net: drop rtnl_lock for queue_mgmt operations") https://lore.kernel.org/netdev/20250313114929.43744df1@canb.auug.org.au/ Adjacent changes: tools/testing/selftests/net/Makefile `6f50175cca` ("selftests: Add IPv6 link-local address generation tests for GRE devices.") `2e5584e0f9` ("selftests/net: expand cmsg_ipv6.sh with ipv4") drivers/net/ethernet/broadcom/bnxt/bnxt.c `661958552e` ("eth: bnxt: do not use BNXT_VNIC_NTUPLE unconditionally in queue restart logic") `fe96d717d3` ("bnxt_en: Extend queue stop/start for TX rings") Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-03-13 23:08:11 +01:00
Thomas Gleixner	8327df4059	genirq/msi: Rename msi_[un]lock_descs() Now that all abuse is gone and the legit users are converted to guard(msi_descs_lock), rename the lock functions and document them as internal. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huwei.com> Link: https://lore.kernel.org/all/20250313130322.027190131@linutronix.de	2025-03-13 18:58:00 +01:00
Thomas Gleixner	5c99e0226e	genirq/msi: Use lock guards for MSI descriptor locking Provide a lock guard for MSI descriptor locking and update the core code accordingly. No functional change intended. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lore.kernel.org/all/20250313130321.506045185@linutronix.de	2025-03-13 18:57:59 +01:00
Andrea Righi	97e13ecb02	sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local() scx_bpf_reenqueue_local() can be invoked from ops.cpu_release() to give tasks that are queued to the local DSQ a chance to migrate to other CPUs, when a CPU is taken by a higher scheduling class. However, there is no point re-enqueuing tasks that can only run on that particular CPU, as they would simply be re-added to the same local DSQ without any benefit. Therefore, skip per-CPU tasks in scx_bpf_reenqueue_local(). Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-03-13 07:07:27 -10:00
Thomas Gleixner	5376252335	genirq/msi: Make a few functions static None of these functions are used outside of the MSI core. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250309084110.204054172@linutronix.de	2025-03-13 13:35:33 +01:00
Thomas Gleixner	ec2d0c0462	posix-timers: Provide a mechanism to allocate a given timer ID Checkpoint/Restore in Userspace (CRIU) requires to reconstruct posix timers with the same timer ID on restore. It uses sys_timer_create() and relies on the monotonic increasing timer ID provided by this syscall. It creates and deletes timers until the desired ID is reached. This is can loop for a long time, when the checkpointed process had a very sparse timer ID range. It has been debated to implement a new syscall to allow the creation of timers with a given timer ID, but that's tideous due to the 32/64bit compat issues of sigevent_t and of dubious value. The restore mechanism of CRIU creates the timers in a state where all threads of the restored process are held on a barrier and cannot issue syscalls. That means the restorer task has exclusive control. This allows to address this issue with a prctl() so that the restorer thread can do: if (prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_ON)) goto linear_mode; create_timers_with_explicit_ids(); prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_OFF); This is backwards compatible because the prctl() fails on older kernels and CRIU can fall back to the linear timer ID mechanism. CRIU versions which do not know about the prctl() just work as before. Implement the prctl() and modify timer_create() so that it copies the requested timer ID from userspace by utilizing the existing timer_t pointer, which is used to copy out the allocated timer ID on success. If the prctl() is disabled, which it is by default, timer_create() works as before and does not try to read from the userspace pointer. There is no problem when a broken or rogue user space application enables the prctl(). If the user space pointer does not contain a valid ID, then timer_create() fails. If the data is not initialized, but constains a random valid ID, timer_create() will create that random timer ID or fail if the ID is already given out. As CRIU must use the raw syscall to avoid manipulating the internal state of the restored process, this has no library dependencies and can be adopted by CRIU right away. Recreating two timers with IDs 1000000 and 2000000 takes 1.5 seconds with the create/delete method. With the prctl() it takes 3 microseconds. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com> Tested-by: Cyrill Gorcunov <gorcunov@gmail.com> Link: https://lore.kernel.org/all/87jz8vz0en.ffs@tglx	2025-03-13 12:07:18 +01:00
Thomas Gleixner	451898ea42	posix-timers: Make per process list RCU safe Preparatory change to remove the sighand locking from the /proc/$PID/timers iterator. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250308155624.403223080@linutronix.de	2025-03-13 12:07:18 +01:00
Thomas Gleixner	5fa75a432f	posix-timers: Avoid false cacheline sharing struct k_itimer has the hlist_node, which is used for lookup in the hash bucket, and the timer lock in the same cache line. That's obviously bad, if one CPU fiddles with a timer and the other is walking the hash bucket on which that timer is queued. Avoid this by restructuring struct k_itimer, so that the read mostly (only modified during setup and teardown) fields are in the first cache line and the lock and the rest of the fields which get written to are in cacheline 2-N. Reduces cacheline contention in a test case of 64 processes creating and accessing 20000 timers each by almost 30% according to perf. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250308155624.341108067@linutronix.de	2025-03-13 12:07:18 +01:00
Thomas Gleixner	781764e0b4	posix-timers: Switch to jhash32() The hash distribution of hash_32() is suboptimal. jhash32() provides a way better distribution, which evens out the length of the hash bucket lists, which in turn avoids large outliers in list walk times. Due to the sparse ID space (thanks CRIU) there is no guarantee that the timers will be fully evenly distributed over the hash buckets, but the behaviour is way better than with hash_32() even for randomly sparse ID spaces. For a pathological test case with 64 processes creating and accessing 20000 timers each, this results in a runtime reduction of ~10% and a significantly reduced runtime variation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250308155624.279080328@linutronix.de	2025-03-13 12:07:17 +01:00
Thomas Gleixner	1535cb8028	posix-timers: Improve hash table performance Eric and Ben reported a significant performance bottleneck on the global hash, which is used to store posix timers for lookup. Eric tried to do a lockless validation of a new timer ID before trying to insert the timer, but that does not solve the problem. For the non-contended case this is a pointless exercise and for the contended case this extra lookup just creates enough interleaving that all tasks can make progress. There are actually two real solutions to the problem: 1) Provide a per process (signal struct) xarray storage 2) Implement a smarter hash like the one in the futex code #1 works perfectly fine for most cases, but the fact that CRIU enforced a linear increasing timer ID to restore timers makes this problematic. It's easy enough to create a sparse timer ID space, which amounts very fast to a large junk of memory consumed for the xarray. 2048 timers with a ID offset of 512 consume more than one megabyte of memory for the xarray storage. #2 The main advantage of the futex hash is that it uses per hash bucket locks instead of a global hash lock. Aside of that it is scaled according to the number of CPUs at boot time. Experiments with artifical benchmarks have shown that a scaled hash with per bucket locks comes pretty close to the xarray performance and in some scenarios it performes better. Test 1: A single process creates 20000 timers and afterwards invokes timer_getoverrun(2) on each of them: mainline Eric newhash xarray create 23 ms 23 ms 9 ms 8 ms getoverrun 14 ms 14 ms 5 ms 4 ms Test 2: A single process creates 50000 timers and afterwards invokes timer_getoverrun(2) on each of them: mainline Eric newhash xarray create 98 ms 219 ms 20 ms 18 ms getoverrun 62 ms 62 ms 10 ms 9 ms Test 3: A single process creates 100000 timers and afterwards invokes timer_getoverrun(2) on each of them: mainline Eric newhash xarray create 313 ms 750 ms 48 ms 33 ms getoverrun 261 ms 260 ms 20 ms 14 ms Erics changes create quite some overhead in the create() path due to the double list walk, as the main issue according to perf is the list walk itself. With 100k timers each hash bucket contains ~200 timers, which in the worst case need to be all inspected. The same problem applies for getoverrun() where the lookup has to walk through the hash buckets to find the timer it is looking for. The scaled hash obviously reduces hash collisions and lock contention significantly. This becomes more prominent with concurrency. Test 4: A process creates 63 threads and all threads wait on a barrier before each instance creates 20000 timers and afterwards invokes timer_getoverrun(2) on each of them. The threads are pinned on seperate CPUs to achive maximum concurrency. The numbers are the average times per thread: mainline Eric newhash xarray create 180239 ms 38599 ms 579 ms 813 ms getoverrun 2645 ms 2642 ms 32 ms 7 ms Test 5: A process forks 63 times and all forks wait on a barrier before each instance creates 20000 timers and afterwards invokes timer_getoverrun(2) on each of them. The processes are pinned on seperate CPUs to achive maximum concurrency. The numbers are the average times per process: mainline eric newhash xarray create 157253 ms 40008 ms 83 ms 60 ms getoverrun 2611 ms 2614 ms 40 ms 4 ms So clearly the reduction of lock contention with Eric's changes makes a significant difference for the create() loop, but it does not mitigate the problem of long list walks, which is clearly visible on the getoverrun() side because that is purely dominated by the lookup itself. Once the timer is found, the syscall just reads from the timer structure with no other locks or code paths involved and returns. The reason for the difference between the thread and the fork case for the new hash and the xarray is that both suffer from contention on sighand::siglock and the xarray suffers additionally from contention on the xarray lock on insertion. The only case where the reworked hash slighly outperforms the xarray is a tight loop which creates and deletes timers. Test 4: A process creates 63 threads and all threads wait on a barrier before each instance runs a loop which creates and deletes a timer 100000 times in a row. The threads are pinned on seperate CPUs to achive maximum concurrency. The numbers are the average times per thread: mainline Eric newhash xarray loop 5917 ms 5897 ms 5473 ms 7846 ms Test 5: A process forks 63 times and all forks wait on a barrier before each each instance runs a loop which creates and deletes a timer 100000 times in a row. The processes are pinned on seperate CPUs to achive maximum concurrency. The numbers are the average times per process: mainline Eric newhash xarray loop 5137 ms 7828 ms 891 ms 872 ms In both test there is not much contention on the hash, but the ucount accounting for the signal and in the thread case the sighand::siglock contention (plus the xarray locking) contribute dominantly to the overhead. As the memory consumption of the xarray in the sparse ID case is significant, the scaled hash with per bucket locks seems to be the better overall option. While the xarray has faster lookup times for a large number of timers, the actual syscall usage, which requires the lookup is not an extreme hotpath. Most applications utilize signal delivery and all syscalls except timer_getoverrun(2) are all but cheap. So implement a scaled hash with per bucket locks, which offers the best tradeoff between performance and memory consumption. Reported-by: Eric Dumazet <edumazet@google.com> Reported-by: Benjamin Segall <bsegall@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250308155624.216091571@linutronix.de	2025-03-13 12:07:17 +01:00
Eric Dumazet	feb864ee99	posix-timers: Make signal_struct:: Next_posix_timer_id an atomic_t The global hash_lock protecting the posix timer hash table can be heavily contended especially when there is an extensive linear search for a timer ID. Timer IDs are handed out by monotonically increasing next_posix_timer_id and then validating that there is no timer with the same ID in the hash table. Both operations happen with the global hash lock held. To reduce the hash lock contention the hash will be reworked to a scaled hash with per bucket locks, which requires to handle the ID counter lockless. Prepare for this by making next_posix_timer_id an atomic_t, which can be used lockless with atomic_inc_return(). [ tglx: Adopted from Eric's series, massaged change log and simplified it ] Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250219125522.2535263-2-edumazet@google.com Link: https://lore.kernel.org/all/20250308155624.151545978@linutronix.de	2025-03-13 12:07:17 +01:00
Peter Zijlstra	538d710ec7	posix-timers: Make lock_timer() use guard() The lookup and locking of posix timers requires the same repeating pattern at all usage sites: tmr = lock_timer(tiner_id); if (!tmr) return -EINVAL; .... unlock_timer(tmr); Solve this with a guard implementation, which works in most places out of the box except for those, which need to unlock the timer inside the guard scope. Though the only places where this matters are timer_delete() and timer_settime(). In both cases the timer pointer needs to be preserved across the end of the scope, which is solved by storing the pointer in a variable outside of the scope. timer_settime() also has to protect the timer with RCU before unlocking, which obviously can't use guard(rcu) before leaving the guard scope as that guard is cleaned up before the unlock. Solve this by providing the RCU protection open coded. [ tglx: Made it work and added change log ] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250224162103.GD11590@noisy.programming.kicks-ass.net Link: https://lore.kernel.org/all/20250308155624.087465658@linutronix.de	2025-03-13 12:07:17 +01:00
Thomas Gleixner	1d25bdd3f3	posix-timers: Rework timer removal sys_timer_delete() and the do_exit() cleanup function itimer_delete() are doing the same thing, but have needlessly different implementations instead of sharing the code. The other oddity of timer deletion is the fact that the timer is not invalidated before the actual deletion happens, which allows concurrent lookups to succeed. That's wrong because a timer which is in the process of being deleted should not be visible and any actions like signal queueing, delivery and rearming should not happen once the task, which invoked timer_delete(), has the timer locked. Rework the code so that: 1) The signal queueing and delivery code ignore timers which are marked invalid 2) The deletion implementation between sys_timer_delete() and itimer_delete() is shared 3) The timer is invalidated and removed from the linked lists before the deletion callback of the relevant clock is invoked. That requires to rework timer_wait_running() as it does a lookup of the timer when relocking it at the end. In case of deletion this lookup would fail due to the preceding invalidation and the wait loop would terminate prematurely. But due to the preceding invalidation the timer cannot be accessed by other tasks anymore, so there is no way that the timer has been freed after the timer lock has been dropped. Move the re-validation out of timer_wait_running() and handle it at the only other usage site, timer_settime(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/87zfht1exf.ffs@tglx	2025-03-13 12:07:17 +01:00
Thomas Gleixner	50f53b23f1	posix-timers: Simplify lock/unlock_timer() Since the integration of sigqueue into the timer struct, lock_timer() is only used in task context. So taking the lock with irqsave() is not longer required. Convert it to use spin_[un]lock_irq(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250308155623.959825668@linutronix.de	2025-03-13 12:07:17 +01:00
Thomas Gleixner	a31a300c4d	posix-timers: Use guards in a few places Switch locking and RCU to guards where applicable. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250308155623.892762130@linutronix.de	2025-03-13 12:07:17 +01:00
Thomas Gleixner	f6d0c3d2eb	posix-timers: Remove SLAB_PANIC from kmem cache There is no need to panic when the posix-timer kmem_cache can't be created. timer_create() will fail with -ENOMEM and that's it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250308155623.829215801@linutronix.de	2025-03-13 12:07:16 +01:00
Thomas Gleixner	4c5cd058be	posix-timers: Remove a few paranoid warnings Warnings about a non-initialized timer or non-existing callbacks are just useful for implementing new posix clocks, but there a NULL pointer dereference is expected anyway. :) Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250308155623.765462334@linutronix.de	2025-03-13 12:07:16 +01:00
Thomas Gleixner	6ad9c3380a	posix-timers: Cleanup includes Remove pointless includes and sort the remaining ones alphabetically. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250308155623.701301552@linutronix.de	2025-03-13 12:07:16 +01:00
Eric Dumazet	5f2909c6cd	posix-timers: Add cond_resched() to posix_timer_add() search loop With a large number of POSIX timers the search for a valid ID might cause a soft lockup on PREEMPT_NONE/VOLUNTARY kernels. Add cond_resched() to the loop to prevent that. [ tglx: Split out from Eric's series ] Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250214135911.2037402-2-edumazet@google.com Link: https://lore.kernel.org/all/20250308155623.635612865@linutronix.de	2025-03-13 12:07:16 +01:00
Eric Dumazet	45ece9933d	posix-timers: Initialise timer before adding it to the hash table A timer is only valid in the hashtable when both timer::it_signal and timer::it_id are set to their final values, but timers are added without those values being set. The timer ID is allocated when the timer is added to the hash in invalid state. The ID is taken from a monotonically increasing per process counter which wraps around after reaching INT_MAX. The hash insertion validates that there is no timer with the allocated ID in the hash table which belongs to the same process. That opens a mostly theoretical race condition: If other threads of the same process manage to create/delete timers in rapid succession before the newly created timer is fully initialized and wrap around to the timer ID which was handed out, then a duplicate timer ID will be inserted into the hash table. Prevent this by: 1) Setting timer::it_id before inserting the timer into the hashtable. 2) Storing the signal pointer in timer::it_signal with bit 0 set before inserting it into the hashtable. Bit 0 acts as a invalid bit, which means that the regular lookup for sys_timer_*() will fail the comparison with the signal pointer. But the lookup on insertion masks out bit 0 and can therefore detect a timer which is not yet valid, but allocated in the hash table. Bit 0 in the pointer is cleared once the initialization of the timer completed. [ tglx: Fold ID and signal iniitializaion into one patch and massage change log and comments. ] Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250219125522.2535263-3-edumazet@google.com Link: https://lore.kernel.org/all/20250308155623.572035178@linutronix.de	2025-03-13 12:07:16 +01:00
Thomas Gleixner	2389c6efd3	posix-timers: Ensure that timer initialization is fully visible Frederic pointed out that the memory operations to initialize the timer are not guaranteed to be visible, when __lock_timer() observes timer::it_signal valid under timer::it_lock: T0 T1 --------- ----------- do_timer_create() // A new_timer->.... = .... spin_lock(current->sighand) // B WRITE_ONCE(new_timer->it_signal, current->signal) spin_unlock(current->sighand) sys_timer_*() t = __lock_timer() spin_lock(&timr->it_lock) // observes B if (timr->it_signal == current->signal) return timr; if (!t) return; // Is not guaranteed to observe A Protect the write of timer::it_signal, which makes the timer valid, with timer::it_lock as well. This guarantees that T1 must observe the initialization A completely, when it observes the valid signal pointer under timer::it_lock. sighand::siglock must still be taken to protect the signal::posix_timers list. Reported-by: Frederic Weisbecker <frederic@kernel.org> Suggested-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250308155623.507944489@linutronix.de	2025-03-13 12:07:16 +01:00
Thorsten Blum	fc661d0a78	clocksource: Remove unnecessary strscpy() size argument The size argument of strscpy() is only required when the destination pointer is not a fixed sized array or when the copy needs to be smaller than the size of the fixed sized destination array. For fixed sized destination arrays and full copies, strscpy() automatically determines the length of the destination buffer if the size argument is omitted. This makes the explicit sizeof() unnecessary. Remove it. [ tglx: Massaged change log ] Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250311110624.495718-2-thorsten.blum@linux.dev	2025-03-13 11:37:44 +01:00
Thomas Weißschuh	a52067c24c	timer_list: Don't use %pK through printk() This reverts commit `f590308536` ("timer debug: Hide kernel addresses via %pK in /proc/timer_list") The timer list helper SEQ_printf() uses either the real seq_printf() for procfs output or vprintk() to print to the kernel log, when invoked from SysRq-q. It uses %pK for printing pointers. In the past %pK was prefered over %p as it would not leak raw pointer values into the kernel log. Since commit `ad67b74d24` ("printk: hash addresses printed with %p") the regular %p has been improved to avoid this issue. Furthermore, restricted pointers ("%pK") were never meant to be used through printk(). They can still unintentionally leak raw pointers or acquire sleeping looks in atomic contexts. Switch to the regular pointer formatting which is safer, easier to reason about and sufficient here. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/lkml/20250113171731-dc10e3c1-da64-4af0-b767-7c7070468023@linutronix.de/ Link: https://lore.kernel.org/all/20250311-restricted-pointers-timer-v1-1-6626b91e54ab@linutronix.de	2025-03-13 08:19:19 +01:00
Linus Torvalds	b7f94fcf55	Merge tag 'sched_ext-for-6.14-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fix from Tejun Heo: "BPF schedulers could trigger a crash by passing in an invalid CPU to the scx_bpf_select_cpu_dfl() helper. Fix it by verifying input validity" * tag 'sched_ext-for-6.14-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()	2025-03-12 11:52:04 -10:00

... 41 42 43 44 45 ...

49605 Commits