linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-09 22:14:36 -04:00

Author	SHA1	Message	Date
Tejun Heo	f0e1a0643a	sched_ext: Implement BPF extensible scheduler class Implement a new scheduler class sched_ext (SCX), which allows scheduling policies to be implemented as BPF programs to achieve the following: 1. Ease of experimentation and exploration: Enabling rapid iteration of new scheduling policies. 2. Customization: Building application-specific schedulers which implement policies that are not applicable to general-purpose schedulers. 3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling policies in production environments. sched_ext leverages BPF’s struct_ops feature to define a structure which exports function callbacks and flags to BPF programs that wish to implement scheduling policies. The struct_ops structure exported by sched_ext is struct sched_ext_ops, and is conceptually similar to struct sched_class. The role of sched_ext is to map the complex sched_class callbacks to the more simple and ergonomic struct sched_ext_ops callbacks. For more detailed discussion on the motivations and overview, please refer to the cover letter. Later patches will also add several example schedulers and documentation. This patch implements the minimum core framework to enable implementation of BPF schedulers. Subsequent patches will gradually add functionalities including safety guarantee mechanisms, nohz and cgroup support. include/linux/sched/ext.h defines struct sched_ext_ops. With the comment on top, each operation should be self-explanatory. The followings are worth noting: - Both "sched_ext" and its shorthand "scx" are used. If the identifier already has "sched" in it, "ext" is used; otherwise, "scx". - In sched_ext_ops, only .name is mandatory. Every operation is optional and if omitted a simple but functional default behavior is provided. - A new policy constant SCHED_EXT is added and a task can select sched_ext by invoking sched_setscheduler(2) with the new policy constant. However, if the BPF scheduler is not loaded, SCHED_EXT is the same as SCHED_NORMAL and the task is scheduled by CFS. When the BPF scheduler is loaded, all tasks which have the SCHED_EXT policy are switched to sched_ext. - To bridge the workflow imbalance between the scheduler core and sched_ext_ops callbacks, sched_ext uses simple FIFOs called dispatch queues (dsq's). By default, there is one global dsq (SCX_DSQ_GLOBAL), and one local per-CPU dsq (SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for convenience and need not be used by a scheduler that doesn't require it. SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when putting the next task on the CPU. The BPF scheduler can manage an arbitrary number of dsq's using scx_bpf_create_dsq() and scx_bpf_destroy_dsq(). - sched_ext guarantees system integrity no matter what the BPF scheduler does. To enable this, each task's ownership is tracked through p->scx.ops_state and all tasks are put on scx_tasks list. The disable path can always recover and revert all tasks back to CFS. See p->scx.ops_state and scx_tasks. - A task is not tied to its rq while enqueued. This decouples CPU selection from queueing and allows sharing a scheduling queue across an arbitrary subset of CPUs. This adds some complexities as a task may need to be bounced between rq's right before it starts executing. See dispatch_to_local_dsq() and move_task_to_local_dsq(). - One complication that arises from the above weak association between task and rq is that synchronizing with dequeue() gets complicated as dequeue() may happen anytime while the task is enqueued and the dispatch path might need to release the rq lock to transfer the task. Solving this requires a bit of complexity. See the logic around p->scx.sticky_cpu and p->scx.ops_qseq. - Both enable and disable paths are a bit complicated. The enable path switches all tasks without blocking to avoid issues which can arise from partially switched states (e.g. the switching task itself being starved). The disable path can't trust the BPF scheduler at all, so it also has to guarantee forward progress without blocking. See scx_ops_enable() and scx_ops_disable_workfn(). - When sched_ext is disabled, static_branches are used to shut down the entry points from hot paths. v7: - scx_ops_bypass() was incorrectly and unnecessarily trying to grab scx_ops_enable_mutex which can lead to deadlocks in the disable path. Fixed. - Fixed TASK_DEAD handling bug in scx_ops_enable() path which could lead to use-after-free. - Consolidated per-cpu variable usages and other cleanups. v6: - SCX_NR_ONLINE_OPS replaced with SCX_OPI__BEGIN/END so that multiple groups can be expressed. Later CPU hotplug operations are put into their own group. - SCX_OPS_DISABLING state is replaced with the new bypass mechanism which allows temporarily putting the system into simple FIFO scheduling mode bypassing the BPF scheduler. In addition to the shut down path, this will also be used to isolate the BPF scheduler across PM events. Enabling and disabling the bypass mode requires iterating all runnable tasks. rq->scx.runnable_list addition is moved from the later watchdog patch. - ops.prep_enable() is replaced with ops.init_task() and ops.enable/disable() are now called whenever the task enters and leaves sched_ext instead of when the task becomes schedulable on sched_ext and stops being so. A new operation - ops.exit_task() - is called when the task stops being schedulable on sched_ext. - scx_bpf_dispatch() can now be called from ops.select_cpu() too. This removes the need for communicating local dispatch decision made by ops.select_cpu() to ops.enqueue() via per-task storage. SCX_KF_SELECT_CPU is added to support the change. - SCX_TASK_ENQ_LOCAL which told the BPF scheudler that scx_select_cpu_dfl() wants the task to be dispatched to the local DSQ was removed. Instead, scx_bpf_select_cpu_dfl() now dispatches directly if it finds a suitable idle CPU. If such behavior is not desired, users can use scx_bpf_select_cpu_dfl() which returns the verdict in a bool out param. - scx_select_cpu_dfl() was mishandling WAKE_SYNC and could end up queueing many tasks on a local DSQ which makes tasks to execute in order while other CPUs stay idle which made some hackbench numbers really bad. Fixed. - The current state of sched_ext can now be monitored through files under /sys/sched_ext instead of /sys/kernel/debug/sched/ext. This is to enable monitoring on kernels which don't enable debugfs. - sched_ext wasn't telling BPF that ops.dispatch()'s @prev argument may be NULL and a BPF scheduler which derefs the pointer without checking could crash the kernel. Tell BPF. This is currently a bit ugly. A better way to annotate this is expected in the future. - scx_exit_info updated to carry pointers to message buffers instead of embedding them directly. This decouples buffer sizes from API so that they can be changed without breaking compatibility. - exit_code added to scx_exit_info. This is used to indicate different exit conditions on non-error exits and will be used to handle e.g. CPU hotplugs. - The patch "sched_ext: Allow BPF schedulers to switch all eligible tasks into sched_ext" is folded in and the interface is changed so that partial switching is indicated with a new ops flag %SCX_OPS_SWITCH_PARTIAL. This makes scx_bpf_switch_all() unnecessasry and in turn SCX_KF_INIT. ops.init() is now called with SCX_KF_SLEEPABLE. - Code reorganized so that only the parts necessary to integrate with the rest of the kernel are in the header files. - Changes to reflect the BPF and other kernel changes including the addition of bpf_sched_ext_ops.cfi_stubs. v5: - To accommodate 32bit configs, p->scx.ops_state is now atomic_long_t instead of atomic64_t and scx_dsp_buf_ent.qseq which uses load_acquire/store_release is now unsigned long instead of u64. - Fix the bug where bpf_scx_btf_struct_access() was allowing write access to arbitrary fields. - Distinguish kfuncs which can be called from any sched_ext ops and from anywhere. e.g. scx_bpf_pick_idle_cpu() can now be called only from sched_ext ops. - Rename "type" to "kind" in scx_exit_info to make it easier to use on languages in which "type" is a reserved keyword. - Since `cff9b2332a` ("kernel/sched: Modify initial boot task idle setup"), PF_IDLE is not set on idle tasks which haven't been online yet which made scx_task_iter_next_filtered() include those idle tasks in iterations leading to oopses. Update scx_task_iter_next_filtered() to directly test p->sched_class against idle_sched_class instead of using is_idle_task() which tests PF_IDLE. - Other updates to match upstream changes such as adding const to set_cpumask() param and renaming check_preempt_curr() to wakeup_preempt(). v4: - SCHED_CHANGE_BLOCK replaced with the previous sched_deq_and_put_task()/sched_enq_and_set_tsak() pair. This is because upstream is adaopting a different generic cleanup mechanism. Once that lands, the code will be adapted accordingly. - task_on_scx() used to test whether a task should be switched into SCX, which is confusing. Renamed to task_should_scx(). task_on_scx() now tests whether a task is currently on SCX. - scx_has_idle_cpus is barely used anymore and replaced with direct check on the idle cpumask. - SCX_PICK_IDLE_CORE added and scx_pick_idle_cpu() improved to prefer fully idle cores. - ops.enable() now sees up-to-date p->scx.weight value. - ttwu_queue path is disabled for tasks on SCX to avoid confusing BPF schedulers expecting ->select_cpu() call. - Use cpu_smt_mask() instead of topology_sibling_cpumask() like the rest of the scheduler. v3: - ops.set_weight() added to allow BPF schedulers to track weight changes without polling p->scx.weight. - move_task_to_local_dsq() was losing SCX-specific enq_flags when enqueueing the task on the target dsq because it goes through activate_task() which loses the upper 32bit of the flags. Carry the flags through rq->scx.extra_enq_flags. - scx_bpf_dispatch(), scx_bpf_pick_idle_cpu(), scx_bpf_task_running() and scx_bpf_task_cpu() now use the new KF_RCU instead of KF_TRUSTED_ARGS to make it easier for BPF schedulers to call them. - The kfunc helper access control mechanism implemented through sched_ext_entity.kf_mask is improved. Now SCX_CALL_OP() is always used when invoking scx_ops operations. v2: - balance_scx_on_up() is dropped. Instead, on UP, balance_scx() is called from put_prev_taks_scx() and pick_next_task_scx() as necessary. To determine whether balance_scx() should be called from put_prev_task_scx(), SCX_TASK_DEQD_FOR_SLEEP flag is added. See the comment in put_prev_task_scx() for details. - sched_deq_and_put_task() / sched_enq_and_set_task() sequences replaced with SCHED_CHANGE_BLOCK(). - Unused all_dsqs list removed. This was a left-over from previous iterations. - p->scx.kf_mask is added to track and enforce which kfunc helpers are allowed. Also, init/exit sequences are updated to make some kfuncs always safe to call regardless of the current BPF scheduler state. Combined, this should make all the kfuncs safe. - BPF now supports sleepable struct_ops operations. Hacky workaround removed and operations and kfunc helpers are tagged appropriately. - BPF now supports bitmask / cpumask helpers. scx_bpf_get_idle_cpumask() and friends are added so that BPF schedulers can use the idle masks with the generic helpers. This replaces the hacky kfunc helpers added by a separate patch in V1. - CONFIG_SCHED_CLASS_EXT can no longer be enabled if SCHED_CORE is enabled. This restriction will be removed by a later patch which adds core-sched support. - Add MAINTAINERS entries and other misc changes. Signed-off-by: Tejun Heo <tj@kernel.org> Co-authored-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Cc: Andrea Righi <andrea.righi@canonical.com>	2024-06-18 10:09:17 -10:00
Tejun Heo	a7a9fc5492	sched_ext: Add boilerplate for extensible scheduler class This adds dummy implementations of sched_ext interfaces which interact with the scheduler core and hook them in the correct places. As they're all dummies, this doesn't cause any behavior changes. This is split out to help reviewing. v2: balance_scx_on_up() dropped. This will be handled in sched_ext proper. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>	2024-06-18 10:09:17 -10:00
Tejun Heo	2c8d046d5d	sched: Add normal_policy() A new BPF extensible sched_class will need to dynamically change how a task picks its sched_class. For example, if the loaded BPF scheduler progs fail, the tasks will be forced back on CFS even if the task's policy is set to the new sched_class. To support such mapping, add normal_policy() which wraps testing for %SCHED_NORMAL. This doesn't cause any behavior changes. v2: Update the description with more details on the expected use. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>	2024-06-18 10:09:17 -10:00
Tejun Heo	96fd6c65ef	sched: Factor out update_other_load_avgs() from __update_blocked_others() RT, DL, thermal and irq load and utilization metrics need to be decayed and updated periodically and before consumption to keep the numbers reasonable. This is currently done from __update_blocked_others() as a part of the fair class load balance path. Let's factor it out to update_other_load_avgs(). Pure refactor. No functional changes. This will be used by the new BPF extensible scheduling class to ensure that the above metrics are properly maintained. v2: Refreshed on top of tip:sched/core. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com>	2024-06-18 10:09:17 -10:00
Tejun Heo	4f9c7ca851	sched: Factor out cgroup weight conversion functions Factor out sched_weight_from/to_cgroup() which convert between scheduler shares and cgroup weight. No functional change. The factored out functions will be used by a new BPF extensible sched_class so that the weights can be exposed to the BPF programs in a way which is consistent cgroup weights and easier to interpret. The weight conversions will be used regardless of cgroup usage. It's just borrowing the cgroup weight range as it's more intuitive. CGROUP_WEIGHT_MIN/DFL/MAX constants are moved outside CONFIG_CGROUPS so that the conversion helpers can always be defined. v2: The helpers are now defined regardless of COFNIG_CGROUPS. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>	2024-06-18 10:09:16 -10:00
Tejun Heo	d8c7bc2e20	sched: Add sched_class->switching_to() and expose check_class_changing/changed() When a task switches to a new sched_class, the prev and new classes are notified through ->switched_from() and ->switched_to(), respectively, after the switching is done. A new BPF extensible sched_class will have callbacks that allow the BPF scheduler to keep track of relevant task states (like priority and cpumask). Those callbacks aren't called while a task is on a different sched_class. When a task comes back, we wanna tell the BPF progs the up-to-date state before the task gets enqueued, so we need a hook which is called before the switching is committed. This patch adds ->switching_to() which is called during sched_class switch through check_class_changing() before the task is restored. Also, this patch exposes check_class_changing/changed() in kernel/sched/sched.h. They will be used by the new BPF extensible sched_class to implement implicit sched_class switching which is used e.g. when falling back to CFS when the BPF scheduler fails or unloads. This is a prep patch and doesn't cause any behavior changes. The new operation and exposed functions aren't used yet. v3: Refreshed on top of tip:sched/core. v2: Improve patch description w/ details on planned use. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>	2024-06-18 10:09:16 -10:00
Tejun Heo	e83edbf88f	sched: Add sched_class->reweight_task() Currently, during a task weight change, sched core directly calls reweight_task() defined in fair.c if @p is on CFS. Let's make it a proper sched_class operation instead. CFS's reweight_task() is renamed to reweight_task_fair() and now called through sched_class. While it turns a direct call into an indirect one, set_load_weight() isn't called from a hot path and this change shouldn't cause any noticeable difference. This will be used to implement reweight_task for a new BPF extensible sched_class so that it can keep its cached task weight up-to-date. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>	2024-06-18 10:09:16 -10:00
Tejun Heo	304b3f2bc0	sched: Allow sched_cgroup_fork() to fail and introduce sched_cancel_fork() A new BPF extensible sched_class will need more control over the forking process. It wants to be able to fail from sched_cgroup_fork() after the new task's sched_task_group is initialized so that the loaded BPF program can prepare the task with its cgroup association is established and reject fork if e.g. allocation fails. Allow sched_cgroup_fork() to fail by making it return int instead of void and adding sched_cancel_fork() to undo sched_fork() in the error path. sched_cgroup_fork() doesn't fail yet and this patch shouldn't cause any behavior changes. v2: Patch description updated to detail the expected use. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>	2024-06-18 10:09:16 -10:00
Tejun Heo	df268382ad	sched: Restructure sched_class order sanity checks in sched_init() Currently, sched_init() checks that the sched_class'es are in the expected order by testing each adjacency which is a bit brittle and makes it cumbersome to add optional sched_class'es. Instead, let's verify whether they're in the expected order using sched_class_above() which is what matters. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Reviewed-by: David Vernet <dvernet@meta.com>	2024-06-18 10:09:16 -10:00
Tejun Heo	8cce475922	Merge branch 'bpf/for-next' into sched_ext-base	2024-06-18 10:03:20 -10:00
Andrii Nakryiko	f6afdaf72a	Merge branch 'bpf-support-resilient-split-btf' Alan Maguire says: ==================== bpf: support resilient split BTF Split BPF Type Format (BTF) provides huge advantages in that kernel modules only have to provide type information for types that they do not share with the core kernel; for core kernel types, split BTF refers to core kernel BTF type ids. So for a STRUCT sk_buff, a module that uses that structure (or a pointer to it) simply needs to refer to the core kernel type id, saving the need to define the structure and its many dependents. This cuts down on duplication and makes BTF as compact as possible. However, there is a downside. This scheme requires the references from split BTF to base BTF to be valid not just at encoding time, but at use time (when the module is loaded). Even a small change in kernel types can perturb the type ids in core kernel BTF, and - if the new reproducible BTF option is not used - pahole's parallel processing of compilation units can lead to different type ids for the same kernel if the BTF is regenerated. So we have a robustness problem for split BTF for cases where a module is not always compiled at the same time as the kernel. This problem is particularly acute for distros which generally want module builders to be able to compile a module for the lifetime of a Linux stable-based release, and have it continue to be valid over the lifetime of that release, even as changes in data structures (and hence BTF types) accrue. Today it's not possible to generate BTF for modules that works beyond the initial kernel it is compiled against - kernel bugfixes etc invalidate the split BTF references to vmlinux BTF, and BTF is no longer usable for the module. The goal of this series is to provide options to provide additional context for cases like this. That context comes in the form of distilled base BTF; it stands in for the base BTF, and contains information about the types referenced from split BTF, but not their full descriptions. The modified split BTF will refer to type ids in this .BTF.base section, and when the kernel loads such modules it will use that .BTF.base to map references from split BTF to the equivalent current vmlinux base BTF types. Once this relocation process has succeeded, the module BTF available in /sys/kernel/btf will look exactly as if it was built with the current vmlinux; references to base types will be fixed up etc. A module builder - using this series along with the pahole changes - can then build a module with distilled base BTF via an out-of-tree module build, i.e. make -C . M=path/2/module The module will have a .BTF section (the split BTF) and a .BTF.base section. The latter is small in size - distilled base BTF does not need full struct/union/enum information for named types for example. For 2667 modules built with distilled base BTF, the average size observed was 1556 bytes (stddev 1563). The overall size added to this 2667 modules was 5.3Mb. Note that for the in-tree modules, this approach is not needed as split and base BTF in the case of in-tree modules are always built and re-built together. The series first focuses on generating split BTF with distilled base BTF; then relocation support is added to allow split BTF with an associated distlled base to be relocated with a new base BTF. Next Eduard's patch allows BTF ELF parsing to work with both .BTF and .BTF.base sections; this ensures that bpftool will be able to dump BTF for a module with a .BTF.base section for example, or indeed dump relocated BTF where a module and a "-B vmlinux" is supplied. Then we add support to resolve_btfids to ignore base BTF - i.e. to avoid relocation - if a .BTF.base section is found. This ensures the .BTF.ids section is populated with ids relative to the distilled base (these will be relocated as part of module load). Finally the series supports storage of .BTF.base data/size in modules and supports sharing of relocation code with the kernel to allow relocation of module BTF. For the kernel, this relocation process happens at module load time, and we relocate split BTF references to point at types in the current vmlinux BTF. As part of this, .BTF.ids references need to be mapped also. So concretely, what happens is - we generate split BTF in the .BTF section of a module that refers to types in the .BTF.base section as base types; the latter are not full type descriptions but provide information about the base type. So a STRUCT sk_buff would be represented as a FWD struct sk_buff in distilled base BTF for example. - when the module is loaded, the split BTF is relocated with vmlinux BTF; in the case of the FWD struct sk_buff, we find the STRUCT sk_buff in vmlinux BTF and map all split BTF references to the distilled base FWD sk_buff, replacing them with references to the vmlinux BTF STRUCT sk_buff. A previous approach to this problem [1] utilized standalone BTF for such cases - where the BTF is not defined relative to base BTF so there is no relocation required. The problem with that approach is that from the verifier perspective, some types are special, and having a custom representation of a core kernel type that did not necessarily match the current representation is not tenable. So the approach taken here was to preserve the split BTF model while minimizing the representation of the context needed to relocate split and current vmlinux BTF. To generate distilled .BTF.base sections the associated dwarves patch (to be applied on the "next" branch there) is needed [3] Without it, things will still work but modules will not be built with a .BTF.base section. Changes since v5[4]: - Update search of distilled types to return the first occurrence of a string (or a string+size pair); this allows us to iterate over all matches in distilled base BTF (Andrii, patch 3) - Update to use BTF field iterators (Andrii, patches 1, 3 and 8) - Update tests to cover multiple match and associated error cases (Eduard, patch 4) - Rename elf_sections_info to btf_elf_secs, remove use of libbpf_get_error(), reset btf->owns_base when relocation succeeds (Andrii, patch 5) Changes since v4[5]: - Moved embeddedness, duplicate name checks to relocation time and record struct/union size for all distilled struct/unions instead of using forwards. This allows us to carry out type compatibility checks based on the base BTF we want to relocate with (Eduard, patches 1, 3) - Moved to using qsort() instead of qsort_r() as support for qsort_r() appears to be missing in Android libc (Andrii, patch 3) - Sorting/searching now incorporates size matching depending on BTF kind and embeddedness of struct/union (Eduard, Andrii, patch 3) - Improved naming of various types during relocation to avoid confusion (Andrii, patch 3) - Incorporated Eduard's patch (patch 5) which handles .BTF.base sections internally in btf_parse_elf(). This makes ELF parsing work with split BTF, split BTF with a distilled base, split BTF with a distilled base _and_ base BTF (by relocating) etc. Having this avoids the need for bpftool changes; it will work as-is with .BTF.base sections (Eduard, patch 4) - Updated resolve_btfids to _not_ relocate BTF for modules where a .BTF.base section is present; in that one case we do not want to relocate BTF as the .BTF.ids section should reflect ids in .BTF.base which will later be relocated on module load (Eduard, Andrii, patch 5) Changes since v3[6]: - distill now checks for duplicate-named struct/unions and records them as a sized struct/union to help identify which of the multiple base BTF structs/unions it refers to (Eduard, patch 1) - added test support for multiple name handling (Eduard, patch 2) - simplified the string mapping when updating split BTF to use base BTF instead of distilled base. Since the only string references split BTF can make to base BTF are the names of the base types, create a string map from distilled string offset -> base BTF string offset and update string offsets by visiting all strings in split BTF; this saves having to do costly searches of base BTF (Eduard, patch 7,10) - fixed bpftool manpage and indentation issues (Quentin, patch 11) Also explored Eduard's suggestion of doing an implicit fallback to checking for .BTF.base section in btf__parse() when it is called to get base BTF. However while it is doable, it turned out to be difficult operationally. Since fallback is implicit we do not know the source of the BTF - was it from .BTF or .BTF.base? In bpftool, we want to try first standalone BTF, then split, then split with distilled base. Having a way to explicitly request .BTF.base via btf__parse_opts() fits that model better. Changes since v2[7]: - submitted patch to use --btf_features in Makefile.btf for pahole v1.26 and later separately (Andrii). That has landed in bpf-next now. - distilled base now encodes ENUM64 as fwd ENUM (size 8), eliminating the need for support for ENUM64 in btf__add_fwd (patch 1, Andrii) - moved to distilling only named types, augmenting split BTF with associated reference types; this simplifies greatly the distilled base BTF and the mapping operation between distilled and base BTF when relocating (most of the series changes, Andrii) - relocation now iterates over base BTF, looking for matches based on name in distilled BTF. Distilled BTF is pre-sorted by name (Andrii, patch 8) - removed most redundant compabitiliby checks aside from struct size for base types/embedded structs and kind compatibility (since we only match on name) (Andrii, patch 8) - btf__parse_opts() now replaces btf_parse() internally in libbpf (Eduard, patch 3) Changes since RFC [8]: - updated terminology; we replace clunky "base reference" BTF with distilling base BTF into a .BTF.base section. Similarly BTF reconcilation becomes BTF relocation (Andrii, most patches) - add distilled base BTF by default for out-of-tree modules (Alexei, patch 8) - distill algorithm updated to record size of embedded struct/union by recording it as a 0-vlen STRUCT/UNION with size preserved (Andrii, patch 2) - verify size match on relocation for such STRUCT/UNIONs (Andrii, patch 9) - with embedded STRUCT/UNION recording size, we can have bpftool dump a header representation using .BTF.base + .BTF sections rather than special-casing and refusing to use "format c" for that case (patch 5) - match enum with enum64 and vice versa (Andrii, patch 9) - ensure that resolve_btfids works with BTF without .BTF.base section (patch 7) - update tests to cover embedded types, arrays and function prototypes (patches 3, 12) [1] https://lore.kernel.org/bpf/20231112124834.388735-14-alan.maguire@oracle.com/ [2] https://lore.kernel.org/bpf/20240501175035.2476830-1-alan.maguire@oracle.com/ [3] https://lore.kernel.org/bpf/20240517102714.4072080-1-alan.maguire@oracle.com/ [4] https://lore.kernel.org/bpf/20240528122408.3154936-1-alan.maguire@oracle.com/ [5] https://lore.kernel.org/bpf/20240517102246.4070184-1-alan.maguire@oracle.com/ [6] https://lore.kernel.org/bpf/20240510103052.850012-1-alan.maguire@oracle.com/ [7] https://lore.kernel.org/bpf/20240424154806.3417662-1-alan.maguire@oracle.com/ [8] https://lore.kernel.org/bpf/20240322102455.98558-1-alan.maguire@oracle.com/ ==================== Link: https://lore.kernel.org/r/20240613095014.357981-1-alan.maguire@oracle.com Signed-off-by: Andrii Nakryiko <andrii@kernel.org>	2024-06-17 14:41:45 -07:00
Alan Maguire	6ba77385f3	resolve_btfids: Handle presence of .BTF.base section Now that btf_parse_elf() handles .BTF.base section presence, we need to ensure that resolve_btfids uses .BTF.base when present rather than the vmlinux base BTF passed in via the -B option. Detect .BTF.base section presence and unset the base BTF path to ensure that BTF ELF parsing will do the right thing. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20240613095014.357981-7-alan.maguire@oracle.com	2024-06-17 14:38:31 -07:00
Eduard Zingerman	c86f180ffc	libbpf: Make btf_parse_elf process .BTF.base transparently Update btf_parse_elf() to check if .BTF.base section is present. The logic is as follows: if .BTF.base section exists: distilled_base := btf_new(.BTF.base) if distilled_base: btf := btf_new(.BTF, .base_btf=distilled_base) if base_btf: btf_relocate(btf, base_btf) else: btf := btf_new(.BTF) return btf In other words: - if .BTF.base section exists, load BTF from it and use it as a base for .BTF load; - if base_btf is specified and .BTF.base section exist, relocate newly loaded .BTF against base_btf. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240613095014.357981-6-alan.maguire@oracle.com	2024-06-17 14:38:31 -07:00
Alan Maguire	affdeb5061	selftests/bpf: Extend distilled BTF tests to cover BTF relocation Ensure relocated BTF looks as expected; in this case identical to original split BTF, with a few duplicate anonymous types added to split BTF by the relocation process. Also add relocation tests for edge cases like missing type in base BTF and multiple types of the same name. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20240613095014.357981-5-alan.maguire@oracle.com	2024-06-17 14:38:31 -07:00
Alan Maguire	19e00c897d	libbpf: Split BTF relocation Map distilled base BTF type ids referenced in split BTF and their references to the base BTF passed in, and if the mapping succeeds, reparent the split BTF to the base BTF. Relocation is done by first verifying that distilled base BTF only consists of named INT, FLOAT, ENUM, FWD, STRUCT and UNION kinds; then we sort these to speed lookups. Once sorted, the base BTF is iterated, and for each relevant kind we check for an equivalent in distilled base BTF. When found, the mapping from distilled -> base BTF id and string offset is recorded. In establishing mappings, we need to ensure we check STRUCT/UNION size when the STRUCT/UNION is embedded in a split BTF STRUCT/UNION, and when duplicate names exist for the same STRUCT/UNION. Otherwise size is ignored in matching STRUCT/UNIONs. Once all mappings are established, we can update type ids and string offsets in split BTF and reparent it to the new base. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20240613095014.357981-4-alan.maguire@oracle.com	2024-06-17 14:38:31 -07:00
Alan Maguire	eb20e727c4	selftests/bpf: Test distilled base, split BTF generation Test generation of split+distilled base BTF, ensuring that - named base BTF STRUCTs and UNIONs are represented as 0-vlen sized STRUCT/UNIONs - named ENUM[64]s are represented as 0-vlen named ENUM[64]s - anonymous struct/unions are represented in full in split BTF - anonymous enums are represented in full in split BTF - types unreferenced from split BTF are not present in distilled base BTF Also test that with vmlinux BTF and split BTF based upon it, we only represent needed base types referenced from split BTF in distilled base. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20240613095014.357981-3-alan.maguire@oracle.com	2024-06-17 14:38:31 -07:00
Alan Maguire	58e185a0dc	libbpf: Add btf__distill_base() creating split BTF with distilled base BTF To support more robust split BTF, adding supplemental context for the base BTF type ids that split BTF refers to is required. Without such references, a simple shuffling of base BTF type ids (without any other significant change) invalidates the split BTF. Here the attempt is made to store additional context to make split BTF more robust. This context comes in the form of distilled base BTF providing minimal information (name and - in some cases - size) for base INTs, FLOATs, STRUCTs, UNIONs, ENUMs and ENUM64s along with modified split BTF that points at that base and contains any additional types needed (such as TYPEDEF, PTR and anonymous STRUCT/UNION declarations). This information constitutes the minimal BTF representation needed to disambiguate or remove split BTF references to base BTF. The rules are as follows: - INT, FLOAT, FWD are recorded in full. - if a named base BTF STRUCT or UNION is referred to from split BTF, it will be encoded as a zero-member sized STRUCT/UNION (preserving size for later relocation checks). Only base BTF STRUCT/UNIONs that are either embedded in split BTF STRUCT/UNIONs or that have multiple STRUCT/UNION instances of the same name will _need_ size checks at relocation time, but as it is possible a different set of types will be duplicates in the later to-be-resolved base BTF, we preserve size information for all named STRUCT/UNIONs. - if an ENUM[64] is named, a ENUM forward representation (an ENUM with no values) of the same size is used. - in all other cases, the type is added to the new split BTF. Avoiding struct/union/enum/enum64 expansion is important to keep the distilled base BTF representation to a minimum size. When successful, new representations of the distilled base BTF and new split BTF that refers to it are returned. Both need to be freed by the caller. So to take a simple example, with split BTF with a type referring to "struct sk_buff", we will generate distilled base BTF with a 0-member STRUCT sk_buff of the appropriate size, and the split BTF will refer to it instead. Tools like pahole can utilize such split BTF to populate the .BTF section (split BTF) and an additional .BTF.base section. Then when the split BTF is loaded, the distilled base BTF can be used to relocate split BTF to reference the current (and possibly changed) base BTF. So for example if "struct sk_buff" was id 502 when the split BTF was originally generated, we can use the distilled base BTF to see that id 502 refers to a "struct sk_buff" and replace instances of id 502 with the current (relocated) base BTF sk_buff type id. Distilled base BTF is small; when building a kernel with all modules using distilled base BTF as a test, overall module size grew by only 5.3Mb total across ~2700 modules. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20240613095014.357981-2-alan.maguire@oracle.com	2024-06-17 14:38:31 -07:00
Alexei Starovoitov	dedf56d775	selftests/bpf: Add tests for add_const Improve arena based tests and add several C and asm tests with specific pattern. These tests would have failed without add_const verifier support. Also add several loop_inside_iter*() tests that are not related to add_const, but nice to have. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20240613013815.953-5-alexei.starovoitov@gmail.com	2024-06-14 21:52:40 +02:00
Alexei Starovoitov	6870bdb3f4	bpf: Support can_loop/cond_break on big endian Add big endian support for can_loop/cond_break macros. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/20240613013815.953-4-alexei.starovoitov@gmail.com	2024-06-14 21:52:40 +02:00
Alexei Starovoitov	98d7ca374b	bpf: Track delta between "linked" registers. Compilers can generate the code r1 = r2 r1 += 0x1 if r2 < 1000 goto ... use knowledge of r2 range in subsequent r1 operations So remember constant delta between r2 and r1 and update r1 after 'if' condition. Unfortunately LLVM still uses this pattern for loops with 'can_loop' construct: for (i = 0; i < 1000 && can_loop; i++) The "undo" pass was introduced in LLVM https://reviews.llvm.org/D121937 to prevent this optimization, but it cannot cover all cases. Instead of fighting middle end optimizer in BPF backend teach the verifier about this pattern. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20240613013815.953-3-alexei.starovoitov@gmail.com	2024-06-14 21:52:39 +02:00
Alexei Starovoitov	124e8c2b1b	bpf: Relax tuple len requirement for sk helpers. __bpf_skc_lookup() safely handles incorrect values of tuple len, hence we can allow zero to be passed as tuple len. This patch alone doesn't make an observable verifier difference. It's a trivial improvement that might simplify bpf programs. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20240613013815.953-2-alexei.starovoitov@gmail.com	2024-06-14 21:52:39 +02:00
Alexei Starovoitov	cdbde084d1	Merge branch 'bpf-make-trusted-args-nullable' Vadim Fedorenko says: ==================== bpf: make trusted args nullable Current verifier checks for the arg to be nullable after checking for certain pointer types. It prevents programs to pass NULL to kfunc args even if they are marked as nullable. This patchset adjusts verifier and changes bpf crypto kfuncs to allow null for IV parameter which is optional for some ciphers. Benchmark shows ~4% improvements when there is no need to initialise 0-sized dynptr. v3: - add special selftest for nullable parameters v2: - adjust kdoc accordingly ==================== Link: https://lore.kernel.org/r/20240613211817.1551967-1-vadfed@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-06-13 16:33:05 -07:00
Vadim Fedorenko	2d45ab1eda	selftests: bpf: add testmod kfunc for nullable params Add special test to be sure that only __nullable BTF params can be replaced by NULL. This patch adds fake kfuncs in bpf_testmod to properly test different params. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Vadim Fedorenko <vadfed@meta.com> Link: https://lore.kernel.org/r/20240613211817.1551967-6-vadfed@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-06-13 16:33:04 -07:00
Vadim Fedorenko	9b560751f7	selftests: bpf: crypto: adjust bench to use nullable IV The bench shows some improvements, around 4% faster on decrypt. Before: Benchmark 'crypto-decrypt' started. Iter 0 (325.719us): hits 5.105M/s ( 5.105M/prod), drops 0.000M/s, total operations 5.105M/s Iter 1 (-17.295us): hits 5.224M/s ( 5.224M/prod), drops 0.000M/s, total operations 5.224M/s Iter 2 ( 5.504us): hits 4.630M/s ( 4.630M/prod), drops 0.000M/s, total operations 4.630M/s Iter 3 ( 9.239us): hits 5.148M/s ( 5.148M/prod), drops 0.000M/s, total operations 5.148M/s Iter 4 ( 37.885us): hits 5.198M/s ( 5.198M/prod), drops 0.000M/s, total operations 5.198M/s Iter 5 (-53.282us): hits 5.167M/s ( 5.167M/prod), drops 0.000M/s, total operations 5.167M/s Iter 6 (-17.809us): hits 5.186M/s ( 5.186M/prod), drops 0.000M/s, total operations 5.186M/s Summary: hits 5.092 ± 0.228M/s ( 5.092M/prod), drops 0.000 ±0.000M/s, total operations 5.092 ± 0.228M/s After: Benchmark 'crypto-decrypt' started. Iter 0 (268.912us): hits 5.312M/s ( 5.312M/prod), drops 0.000M/s, total operations 5.312M/s Iter 1 (124.869us): hits 5.354M/s ( 5.354M/prod), drops 0.000M/s, total operations 5.354M/s Iter 2 (-36.801us): hits 5.334M/s ( 5.334M/prod), drops 0.000M/s, total operations 5.334M/s Iter 3 (254.628us): hits 5.334M/s ( 5.334M/prod), drops 0.000M/s, total operations 5.334M/s Iter 4 (-77.691us): hits 5.275M/s ( 5.275M/prod), drops 0.000M/s, total operations 5.275M/s Iter 5 (-164.510us): hits 5.313M/s ( 5.313M/prod), drops 0.000M/s, total operations 5.313M/s Iter 6 (-81.376us): hits 5.346M/s ( 5.346M/prod), drops 0.000M/s, total operations 5.346M/s Summary: hits 5.326 ± 0.029M/s ( 5.326M/prod), drops 0.000 ±0.000M/s, total operations 5.326 ± 0.029M/s Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Vadim Fedorenko <vadfed@meta.com> Link: https://lore.kernel.org/r/20240613211817.1551967-5-vadfed@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-06-13 16:33:04 -07:00
Vadim Fedorenko	9363dc8ddc	selftests: bpf: crypto: use NULL instead of 0-sized dynptr Adjust selftests to use nullable option for state and IV arg. Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Vadim Fedorenko <vadfed@meta.com> Link: https://lore.kernel.org/r/20240613211817.1551967-4-vadfed@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-06-13 16:33:04 -07:00
Vadim Fedorenko	65d6d61d25	bpf: crypto: make state and IV dynptr nullable Some ciphers do not require state and IV buffer, but with current implementation 0-sized dynptr is always needed. With adjustment to verifier we can provide NULL instead of 0-sized dynptr. Make crypto kfuncs ready for this. Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Vadim Fedorenko <vadfed@meta.com> Link: https://lore.kernel.org/r/20240613211817.1551967-3-vadfed@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-06-13 16:33:04 -07:00
Vadim Fedorenko	a90797993a	bpf: verifier: make kfuncs args nullalble Some arguments to kfuncs might be NULL in some cases. But currently it's not possible to pass NULL to any BTF structures because the check for the suffix is located after all type checks. Move it to earlier place to allow nullable args. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Vadim Fedorenko <vadfed@meta.com> Link: https://lore.kernel.org/r/20240613211817.1551967-2-vadfed@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-06-13 16:33:04 -07:00
Alexei Starovoitov	373a4e13ab	Merge branch 'fixes-for-kfunc-prototype-generation' Daniel Xu says: ==================== Fixes for kfunc prototype generation This patchset fixes new warnings and errors that kfunc prototype generation caused. ==================== Link: https://lore.kernel.org/r/cover.1718295425.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-06-13 11:18:43 -07:00
Daniel Xu	6a82601477	bpf: selftests: Do not use generated kfunc prototypes for arena progs When selftests are built with a new enough clang, the arena selftests opt-in to use LLVM address_space attribute annotations for arena pointers. These annotations are not emitted by kfunc prototype generation. This causes compilation errors when clang sees conflicting prototypes. Fix by opting arena selftests out of using generated kfunc prototypes. Fixes: `770abbb5a2` ("bpftool: Support dumping kfunc prototypes from BTF") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/r/202406131810.c1B8hTm8-lkp@intel.com/ Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/fc59a617439ceea9ad8dfbb4786843c2169496ae.1718295425.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-06-13 11:18:43 -07:00
Daniel Xu	78746f93e9	bpf: Fix bpf_dynptr documentation comments The function argument names were changed but the doc comment was not. Fix htmldocs build warning by updating doc comments. Fixes: `cce4c40b96` ("bpf: treewide: Align kfunc signatures to prog point-of-view") Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/d0b0eb05f91e12e5795966153b11998d3fc1d433.1718295425.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-06-13 11:18:43 -07:00
Vadim Fedorenko	041c1dc988	selftests/bpf: Validate CHECKSUM_COMPLETE option Adjust skb program test to run with checksum validation. Signed-off-by: Vadim Fedorenko <vadfed@meta.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20240606145851.229116-2-vadfed@meta.com	2024-06-13 14:29:53 +02:00
Vadim Fedorenko	a3cfe84cca	bpf: Add CHECKSUM_COMPLETE to bpf test progs Add special flag to validate that TC BPF program properly updates checksum information in skb. Signed-off-by: Vadim Fedorenko <vadfed@meta.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20240606145851.229116-1-vadfed@meta.com	2024-06-13 14:29:47 +02:00
Alexei Starovoitov	4ff5747158	Merge branch 'bpf-support-dumping-kfunc-prototypes-from-btf' Daniel Xu says: ==================== bpf: Support dumping kfunc prototypes from BTF This patchset enables both detecting as well as dumping compilable prototypes for kfuncs. The first commit instructs pahole to DECL_TAG kfuncs when available. This requires v1.27 which was released on 6/11/24. With it, users will be able to look at BTF inside vmlinux (or modules) and check if the kfunc they want is available. The final commit teaches bpftool how to dump kfunc prototypes. This is done for developer convenience. The rest of the commits are fixups to enable selftests to use the newly dumped kfunc prototypes. With these, selftests will regularly exercise the newly added codepaths. Tested with and without the required pahole changes: * https://github.com/kernel-patches/bpf/pull/7186 * https://github.com/kernel-patches/bpf/pull/7187 === Changelog === From v4: * Change bpf_session_cookie() return type * Only fixup used fentry test kfunc prototypes * Extract out projection detection into shared btf_is_projection_of() * Fix kernel test robot build warnings about doc comments From v3: * Teach selftests to use dumped prototypes From v2: * Update Makefile.btf with pahole flag * More error checking * Output formatting changes * Drop already-merged commit From v1: * Add __weak annotation * Use btf_dump for kfunc prototypes * Update kernel bpf_rdonly_cast() signature ==================== Link: https://lore.kernel.org/r/cover.1718207789.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-06-12 11:01:32 -07:00
Daniel Xu	770abbb5a2	bpftool: Support dumping kfunc prototypes from BTF This patch enables dumping kfunc prototypes from bpftool. This is useful b/c with this patch, end users will no longer have to manually define kfunc prototypes. For the kernel tree, this also means we can optionally drop kfunc prototypes from: tools/testing/selftests/bpf/bpf_kfuncs.h tools/testing/selftests/bpf/bpf_experimental.h Example usage: $ make PAHOLE=/home/dxu/dev/pahole/build/pahole -j30 vmlinux $ ./tools/bpf/bpftool/bpftool btf dump file ./vmlinux format c \| rg "__ksym;" \| head -3 extern void cgroup_rstat_updated(struct cgroup cgrp, int cpu) __weak __ksym; extern void cgroup_rstat_flush(struct cgroup cgrp) __weak __ksym; extern struct bpf_key *bpf_lookup_user_key(u32 serial, u64 flags) __weak __ksym; Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/bf6c08f9263c4bd9d10a717de95199d766a13f61.1718207789.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-06-12 11:01:32 -07:00
Daniel Xu	c567cba345	bpf: selftests: xfrm: Opt out of using generated kfunc prototypes The xfrm_info selftest locally defines an aliased type such that folks with CONFIG_XFRM_INTERFACE=m/n configs can still build the selftests. See commit `aa67961f32` ("selftests/bpf: Allow building bpf tests with CONFIG_XFRM_INTERFACE=[m\|n]"). Thus, it is simpler if this selftest opts out of using enerated kfunc prototypes. The preprocessor macro this commit uses will be introduced in the final commit. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/afe0bb1c50487f52542cdd5230c4aef9e36ce250.1718207789.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-06-12 11:01:31 -07:00
Daniel Xu	f709124dd7	bpf: selftests: nf: Opt out of using generated kfunc prototypes The bpf-nf selftests play various games with aliased types such that folks with CONFIG_NF_CONNTRACK=m/n configs can still build the selftests. See commits: `1058b6a78d` ("selftests/bpf: Do not fail build if CONFIG_NF_CONNTRACK=m/n") `92afc5329a` ("selftests/bpf: Fix build errors if CONFIG_NF_CONNTRACK=m") Thus, it is simpler if these selftests opt out of using generated kfunc prototypes. The preprocessor macro this commit uses will be introduced in the final commit. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/044a5b10cb3abd0d71cb1c818ee0bfc4a2239332.1718207789.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-06-12 11:01:31 -07:00
Daniel Xu	cce4c40b96	bpf: treewide: Align kfunc signatures to prog point-of-view Previously, kfunc declarations in bpf_kfuncs.h (and others) used "user facing" types for kfuncs prototypes while the actual kfunc definitions used "kernel facing" types. More specifically: bpf_dynptr vs bpf_dynptr_kern, __sk_buff vs sk_buff, and xdp_md vs xdp_buff. It wasn't an issue before, as the verifier allows aliased types. However, since we are now generating kfunc prototypes in vmlinux.h (in addition to keeping bpf_kfuncs.h around), this conflict creates compilation errors. Fix this conflict by using "user facing" types in kfunc definitions. This results in more casts, but otherwise has no additional runtime cost. Note, similar to `5b268d1ebc` ("bpf: Have bpf_rdonly_cast() take a const pointer"), we also make kfuncs take const arguments where appropriate in order to make the kfunc more permissive. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/b58346a63a0e66bc9b7504da751b526b0b189a67.1718207789.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-06-12 11:01:31 -07:00
Daniel Xu	ec209ad863	bpf: verifier: Relax caller requirements for kfunc projection type args Currently, if a kfunc accepts a projection type as an argument (eg struct __sk_buff *), the caller must exactly provide exactly the same type with provable provenance. However in practice, kfuncs that accept projection types _must_ cast to the underlying type before use b/c projection type layouts are completely made up. Thus, it is ok to relax the verifier rules around implicit conversions. We will use this functionality in the next commit when we align kfuncs to user-facing types. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/e2c025cb09ccfd4af1ec9e18284dc3cecff7514d.1718207789.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-06-12 11:01:31 -07:00
Daniel Xu	0ce089cbdc	bpf: selftests: Namespace struct_opt callbacks in bpf_dctcp With generated kfunc prototypes, the existing callback names will conflict. Fix by namespacing with a bpf_ prefix. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/efe7aadad8a054e5aeeba94b1d2e4502eee09d7a.1718207789.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-06-12 11:01:31 -07:00
Daniel Xu	2b8dd87332	bpf: Make bpf_session_cookie() kfunc return long * We will soon be generating kfunc prototypes from BTF. As part of that, we need to align the manual signatures in bpf_kfuncs.h with the actual kfunc definitions. There is currently a conflicting signature for bpf_session_cookie() w.r.t. return type. The original intent was to return long * and not __u64 *. You can see evidence of that intent in `a3a5113393` ("selftests/bpf: Add kprobe session cookie test"). Fix conflict by changing kfunc definition. Fixes: `5c919acef8` ("bpf: Add support for kprobe session cookie") Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/7043e1c251ab33151d6e3830f8ea1902ed2604ac.1718207789.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-06-12 11:01:31 -07:00
Daniel Xu	ac42f636dc	bpf: selftests: Fix bpf_map_sum_elem_count() kfunc prototype The prototype in progs/map_percpu_stats.c is not in line with how the actual kfuncs are defined in kernel/bpf/map_iter.c. This causes compilation errors when kfunc prototypes are generated from BTF. Fix by aligning with actual kfunc definitions. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/0497e11a71472dcb71ada7c90ad691523ae87c3b.1718207789.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-06-12 11:01:31 -07:00
Daniel Xu	89f0b1abac	bpf: selftests: Fix bpf_cpumask_first_zero() kfunc prototype The prototype in progs/nested_trust_common.h is not in line with how the actual kfuncs are defined in kernel/bpf/cpumask.c. This causes compilation errors when kfunc prototypes are generated from BTF. Fix by aligning with actual kfunc definitions. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/437936a4e554b02e04566dd6e3f0a5d08370cc8c.1718207789.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-06-12 11:01:31 -07:00
Daniel Xu	dff96e4f50	bpf: selftests: Fix fentry test kfunc prototypes Some prototypes in progs/get_func_ip_test.c were not in line with how the actual kfuncs are defined in net/bpf/test_run.c. This causes compilation errors when kfunc prototypes are generated from BTF. Fix by aligning with actual kfunc definitions. Also remove two unused prototypes. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/1e68870e7626b7b9c6420e65076b307fc404a2f0.1718207789.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-06-12 11:01:30 -07:00
Daniel Xu	718135f5bd	bpf: selftests: Fix bpf_iter_task_vma_new() prototype bpf_iter_task_vma_new() is defined as taking a u64 as its 3rd argument. u64 is a unsigned long long. bpf_experimental.h was defining the prototype as unsigned long. Fix by using __u64. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/fab4509bfee914f539166a91c3ff41e949f3df30.1718207789.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-06-12 11:01:30 -07:00
Daniel Xu	ebb79e96f1	kbuild: bpf: Tell pahole to DECL_TAG kfuncs With [0], pahole can now discover kfuncs and inject DECL_TAG into BTF. With this commit, we will start shipping said DECL_TAGs to downstream consumers if pahole supports it. This is useful for feature probing kfuncs as well as generating compilable prototypes. This is particularly important as kfuncs do not have stable ABI. [0]: https://git.kernel.org/pub/scm/devel/pahole/pahole.git/commit/?id=72e88f29c6f7e14201756e65bd66157427a61aaf Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/324aac5c627bddb80d9968c30df6382846994cc8.1718207789.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-06-12 11:01:30 -07:00
Kenta Tada	98b303c9bf	bpftool: Query only cgroup-related attach types When CONFIG_NETKIT=y, bpftool-cgroup shows error even if the cgroup's path is correct: $ bpftool cgroup tree /sys/fs/cgroup CgroupPath ID AttachType AttachFlags Name Error: can't query bpf programs attached to /sys/fs/cgroup: No such device or address >From strace and kernel tracing, I found netkit returned ENXIO and this command failed. I think this AttachType(BPF_NETKIT_PRIMARY) is not relevant to cgroup. bpftool-cgroup should query just only cgroup-related attach types. v2->v3: - removed an unnecessary check v1->v2: - used an array of cgroup attach types Signed-off-by: Kenta Tada <tadakentaso@gmail.com> Reviewed-by: Quentin Monnet <qmo@kernel.org> Link: https://lore.kernel.org/r/20240607111704.6716-1-tadakentaso@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-06-11 11:39:09 -07:00
Jakub Kicinski	bb678f0180	Merge branch 'intel-wired-lan-driver-updates-2024-06-03' Jacob Keller says: ==================== Intel Wired LAN Driver Updates 2024-06-03 This series includes miscellaneous improvements for the ice as well as a cleanup to the Makefiles for all Intel net drivers. Andy fixes all of the Intel net driver Makefiles to use the documented '-y' syntax for specifying object files to link into kernel driver modules, rather than the '-objs' syntax which works but is documented as reserved for user-space host programs. Jacob has a cleanup to refactor rounding logic in the ice driver into a common roundup_u64 helper function. Michal Schmidt replaces irq_set_affinity_hint() to use irq_update_affinity_hint() which behaves better with user-applied affinity settings. v2: https://lore.kernel.org/r/20240605-next-2024-06-03-intel-next-batch-v2-0-39c23963fa78@intel.com v1: https://lore.kernel.org/r/20240603-next-2024-06-03-intel-next-batch-v1-0-e0523b28f325@intel.com ==================== Link: https://lore.kernel.org/r/20240607-next-2024-06-03-intel-next-batch-v3-0-d1470cee3347@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-10 19:52:50 -07:00
Michal Schmidt	dee55767dc	ice: use irq_update_affinity_hint() irq_set_affinity_hint() is deprecated. Use irq_update_affinity_hint() instead. This removes the side-effect of actually applying the affinity. The driver does not really need to worry about spreading its IRQs across CPUs. The core code already takes care of that. On the contrary, when the driver applies affinities by itself, it breaks the users' expectations: 1. The user configures irqbalance with IRQBALANCE_BANNED_CPULIST in order to prevent IRQs from being moved to certain CPUs that run a real-time workload. 2. ice reconfigures VSIs at runtime due to a MIB change (ice_dcb_process_lldp_set_mib_change). Reopening a VSI resets the affinity in ice_vsi_req_irq_msix(). 3. ice has no idea about irqbalance's config, so it may move an IRQ to a banned CPU. The real-time workload suffers unacceptable latency. I am not sure if updating the affinity hints is at all useful, because irqbalance ignores them since 2016 ([1]), but at least it's harmless. This ice change is similar to i40e commit `d34c54d173` ("i40e: Use irq_update_affinity_hint()"). [1] https://github.com/Irqbalance/irqbalance/commit/dcc411e7bfdd Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Reviewed-by: Sunil Goutham <sgoutham@marvell.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20240607-next-2024-06-03-intel-next-batch-v3-3-d1470cee3347@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-10 19:52:45 -07:00
Jacob Keller	1d4ce389da	ice: add and use roundup_u64 instead of open coding equivalent In ice_ptp_cfg_clkout(), the ice driver needs to calculate the nearest next second of a current time value specified in nanoseconds. It implements this using div64_u64, because the time value is a u64. It could use div_u64 since NSEC_PER_SEC is smaller than 32-bits. Ideally this would be implemented directly with roundup(), but that can't work on all platforms due to a division which requires using the specific macros and functions due to platform restrictions, and to ensure that the most appropriate and fast instructions are used. The kernel doesn't currently provide any 64-bit equivalents for doing roundup. Attempting to use roundup() on a 32-bit platform will result in a link failure due to not having a direct 64-bit division. The closest equivalent for this is DIV64_U64_ROUND_UP, which does a division always rounding up. However, this only computes the division, and forces use of the div64_u64 in cases where the divisor is a 32bit value and could make use of div_u64. Introduce DIV_U64_ROUND_UP based on div_u64, and then use it to implement roundup_u64 which takes a u64 input value and a u32 rounding value. The name roundup_u64 matches the naming scheme of div_u64, and future patches could implement roundup64_u64 if they need to round by a multiple that is greater than 32-bits. Replace the logic in ice_ptp.c which does this equivalent with the newly added roundup_u64. Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20240607-next-2024-06-03-intel-next-batch-v3-2-d1470cee3347@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-10 19:52:44 -07:00
Andy Shevchenko	a2fe35df41	net: intel: Use -y instead of -objs in Makefile -objs suffix is reserved rather for (user-space) host programs while usually -y suffix is used for kernel drivers (although -objs works for that purpose for now). Let's correct the old usages of -objs in Makefiles. Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20240607-next-2024-06-03-intel-next-batch-v3-1-d1470cee3347@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-10 19:52:44 -07:00

1 2 3 4 5 ...

1280468 Commits