Commit Graph

156854 Commits

Author SHA1 Message Date
Tejun Heo
f0e1a0643a sched_ext: Implement BPF extensible scheduler class
Implement a new scheduler class sched_ext (SCX), which allows scheduling
policies to be implemented as BPF programs to achieve the following:

1. Ease of experimentation and exploration: Enabling rapid iteration of new
   scheduling policies.

2. Customization: Building application-specific schedulers which implement
   policies that are not applicable to general-purpose schedulers.

3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling
   policies in production environments.

sched_ext leverages BPF’s struct_ops feature to define a structure which
exports function callbacks and flags to BPF programs that wish to implement
scheduling policies. The struct_ops structure exported by sched_ext is
struct sched_ext_ops, and is conceptually similar to struct sched_class. The
role of sched_ext is to map the complex sched_class callbacks to the more
simple and ergonomic struct sched_ext_ops callbacks.

For more detailed discussion on the motivations and overview, please refer
to the cover letter.

Later patches will also add several example schedulers and documentation.

This patch implements the minimum core framework to enable implementation of
BPF schedulers. Subsequent patches will gradually add functionalities
including safety guarantee mechanisms, nohz and cgroup support.

include/linux/sched/ext.h defines struct sched_ext_ops. With the comment on
top, each operation should be self-explanatory. The followings are worth
noting:

- Both "sched_ext" and its shorthand "scx" are used. If the identifier
  already has "sched" in it, "ext" is used; otherwise, "scx".

- In sched_ext_ops, only .name is mandatory. Every operation is optional and
  if omitted a simple but functional default behavior is provided.

- A new policy constant SCHED_EXT is added and a task can select sched_ext
  by invoking sched_setscheduler(2) with the new policy constant. However,
  if the BPF scheduler is not loaded, SCHED_EXT is the same as SCHED_NORMAL
  and the task is scheduled by CFS. When the BPF scheduler is loaded, all
  tasks which have the SCHED_EXT policy are switched to sched_ext.

- To bridge the workflow imbalance between the scheduler core and
  sched_ext_ops callbacks, sched_ext uses simple FIFOs called dispatch
  queues (dsq's). By default, there is one global dsq (SCX_DSQ_GLOBAL), and
  one local per-CPU dsq (SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for
  convenience and need not be used by a scheduler that doesn't require it.
  SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when putting
  the next task on the CPU. The BPF scheduler can manage an arbitrary number
  of dsq's using scx_bpf_create_dsq() and scx_bpf_destroy_dsq().

- sched_ext guarantees system integrity no matter what the BPF scheduler
  does. To enable this, each task's ownership is tracked through
  p->scx.ops_state and all tasks are put on scx_tasks list. The disable path
  can always recover and revert all tasks back to CFS. See p->scx.ops_state
  and scx_tasks.

- A task is not tied to its rq while enqueued. This decouples CPU selection
  from queueing and allows sharing a scheduling queue across an arbitrary
  subset of CPUs. This adds some complexities as a task may need to be
  bounced between rq's right before it starts executing. See
  dispatch_to_local_dsq() and move_task_to_local_dsq().

- One complication that arises from the above weak association between task
  and rq is that synchronizing with dequeue() gets complicated as dequeue()
  may happen anytime while the task is enqueued and the dispatch path might
  need to release the rq lock to transfer the task. Solving this requires a
  bit of complexity. See the logic around p->scx.sticky_cpu and
  p->scx.ops_qseq.

- Both enable and disable paths are a bit complicated. The enable path
  switches all tasks without blocking to avoid issues which can arise from
  partially switched states (e.g. the switching task itself being starved).
  The disable path can't trust the BPF scheduler at all, so it also has to
  guarantee forward progress without blocking. See scx_ops_enable() and
  scx_ops_disable_workfn().

- When sched_ext is disabled, static_branches are used to shut down the
  entry points from hot paths.

v7: - scx_ops_bypass() was incorrectly and unnecessarily trying to grab
      scx_ops_enable_mutex which can lead to deadlocks in the disable path.
      Fixed.

    - Fixed TASK_DEAD handling bug in scx_ops_enable() path which could lead
      to use-after-free.

    - Consolidated per-cpu variable usages and other cleanups.

v6: - SCX_NR_ONLINE_OPS replaced with SCX_OPI_*_BEGIN/END so that multiple
      groups can be expressed. Later CPU hotplug operations are put into
      their own group.

    - SCX_OPS_DISABLING state is replaced with the new bypass mechanism
      which allows temporarily putting the system into simple FIFO
      scheduling mode bypassing the BPF scheduler. In addition to the shut
      down path, this will also be used to isolate the BPF scheduler across
      PM events. Enabling and disabling the bypass mode requires iterating
      all runnable tasks. rq->scx.runnable_list addition is moved from the
      later watchdog patch.

    - ops.prep_enable() is replaced with ops.init_task() and
      ops.enable/disable() are now called whenever the task enters and
      leaves sched_ext instead of when the task becomes schedulable on
      sched_ext and stops being so. A new operation - ops.exit_task() - is
      called when the task stops being schedulable on sched_ext.

    - scx_bpf_dispatch() can now be called from ops.select_cpu() too. This
      removes the need for communicating local dispatch decision made by
      ops.select_cpu() to ops.enqueue() via per-task storage.
      SCX_KF_SELECT_CPU is added to support the change.

    - SCX_TASK_ENQ_LOCAL which told the BPF scheudler that
      scx_select_cpu_dfl() wants the task to be dispatched to the local DSQ
      was removed. Instead, scx_bpf_select_cpu_dfl() now dispatches directly
      if it finds a suitable idle CPU. If such behavior is not desired,
      users can use scx_bpf_select_cpu_dfl() which returns the verdict in a
      bool out param.

    - scx_select_cpu_dfl() was mishandling WAKE_SYNC and could end up
      queueing many tasks on a local DSQ which makes tasks to execute in
      order while other CPUs stay idle which made some hackbench numbers
      really bad. Fixed.

    - The current state of sched_ext can now be monitored through files
      under /sys/sched_ext instead of /sys/kernel/debug/sched/ext. This is
      to enable monitoring on kernels which don't enable debugfs.

    - sched_ext wasn't telling BPF that ops.dispatch()'s @prev argument may
      be NULL and a BPF scheduler which derefs the pointer without checking
      could crash the kernel. Tell BPF. This is currently a bit ugly. A
      better way to annotate this is expected in the future.

    - scx_exit_info updated to carry pointers to message buffers instead of
      embedding them directly. This decouples buffer sizes from API so that
      they can be changed without breaking compatibility.

    - exit_code added to scx_exit_info. This is used to indicate different
      exit conditions on non-error exits and will be used to handle e.g. CPU
      hotplugs.

    - The patch "sched_ext: Allow BPF schedulers to switch all eligible
      tasks into sched_ext" is folded in and the interface is changed so
      that partial switching is indicated with a new ops flag
      %SCX_OPS_SWITCH_PARTIAL. This makes scx_bpf_switch_all() unnecessasry
      and in turn SCX_KF_INIT. ops.init() is now called with
      SCX_KF_SLEEPABLE.

    - Code reorganized so that only the parts necessary to integrate with
      the rest of the kernel are in the header files.

    - Changes to reflect the BPF and other kernel changes including the
      addition of bpf_sched_ext_ops.cfi_stubs.

v5: - To accommodate 32bit configs, p->scx.ops_state is now atomic_long_t
      instead of atomic64_t and scx_dsp_buf_ent.qseq which uses
      load_acquire/store_release is now unsigned long instead of u64.

    - Fix the bug where bpf_scx_btf_struct_access() was allowing write
      access to arbitrary fields.

    - Distinguish kfuncs which can be called from any sched_ext ops and from
      anywhere. e.g. scx_bpf_pick_idle_cpu() can now be called only from
      sched_ext ops.

    - Rename "type" to "kind" in scx_exit_info to make it easier to use on
      languages in which "type" is a reserved keyword.

    - Since cff9b2332a ("kernel/sched: Modify initial boot task idle
      setup"), PF_IDLE is not set on idle tasks which haven't been online
      yet which made scx_task_iter_next_filtered() include those idle tasks
      in iterations leading to oopses. Update scx_task_iter_next_filtered()
      to directly test p->sched_class against idle_sched_class instead of
      using is_idle_task() which tests PF_IDLE.

    - Other updates to match upstream changes such as adding const to
      set_cpumask() param and renaming check_preempt_curr() to
      wakeup_preempt().

v4: - SCHED_CHANGE_BLOCK replaced with the previous
      sched_deq_and_put_task()/sched_enq_and_set_tsak() pair. This is
      because upstream is adaopting a different generic cleanup mechanism.
      Once that lands, the code will be adapted accordingly.

    - task_on_scx() used to test whether a task should be switched into SCX,
      which is confusing. Renamed to task_should_scx(). task_on_scx() now
      tests whether a task is currently on SCX.

    - scx_has_idle_cpus is barely used anymore and replaced with direct
      check on the idle cpumask.

    - SCX_PICK_IDLE_CORE added and scx_pick_idle_cpu() improved to prefer
      fully idle cores.

    - ops.enable() now sees up-to-date p->scx.weight value.

    - ttwu_queue path is disabled for tasks on SCX to avoid confusing BPF
      schedulers expecting ->select_cpu() call.

    - Use cpu_smt_mask() instead of topology_sibling_cpumask() like the rest
      of the scheduler.

v3: - ops.set_weight() added to allow BPF schedulers to track weight changes
      without polling p->scx.weight.

    - move_task_to_local_dsq() was losing SCX-specific enq_flags when
      enqueueing the task on the target dsq because it goes through
      activate_task() which loses the upper 32bit of the flags. Carry the
      flags through rq->scx.extra_enq_flags.

    - scx_bpf_dispatch(), scx_bpf_pick_idle_cpu(), scx_bpf_task_running()
      and scx_bpf_task_cpu() now use the new KF_RCU instead of
      KF_TRUSTED_ARGS to make it easier for BPF schedulers to call them.

    - The kfunc helper access control mechanism implemented through
      sched_ext_entity.kf_mask is improved. Now SCX_CALL_OP*() is always
      used when invoking scx_ops operations.

v2: - balance_scx_on_up() is dropped. Instead, on UP, balance_scx() is
      called from put_prev_taks_scx() and pick_next_task_scx() as necessary.
      To determine whether balance_scx() should be called from
      put_prev_task_scx(), SCX_TASK_DEQD_FOR_SLEEP flag is added. See the
      comment in put_prev_task_scx() for details.

    - sched_deq_and_put_task() / sched_enq_and_set_task() sequences replaced
      with SCHED_CHANGE_BLOCK().

    - Unused all_dsqs list removed. This was a left-over from previous
      iterations.

    - p->scx.kf_mask is added to track and enforce which kfunc helpers are
      allowed. Also, init/exit sequences are updated to make some kfuncs
      always safe to call regardless of the current BPF scheduler state.
      Combined, this should make all the kfuncs safe.

    - BPF now supports sleepable struct_ops operations. Hacky workaround
      removed and operations and kfunc helpers are tagged appropriately.

    - BPF now supports bitmask / cpumask helpers. scx_bpf_get_idle_cpumask()
      and friends are added so that BPF schedulers can use the idle masks
      with the generic helpers. This replaces the hacky kfunc helpers added
      by a separate patch in V1.

    - CONFIG_SCHED_CLASS_EXT can no longer be enabled if SCHED_CORE is
      enabled. This restriction will be removed by a later patch which adds
      core-sched support.

    - Add MAINTAINERS entries and other misc changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Co-authored-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
Cc: Andrea Righi <andrea.righi@canonical.com>
2024-06-18 10:09:17 -10:00
Tejun Heo
a7a9fc5492 sched_ext: Add boilerplate for extensible scheduler class
This adds dummy implementations of sched_ext interfaces which interact with
the scheduler core and hook them in the correct places. As they're all
dummies, this doesn't cause any behavior changes. This is split out to help
reviewing.

v2: balance_scx_on_up() dropped. This will be handled in sched_ext proper.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
2024-06-18 10:09:17 -10:00
Tejun Heo
4f9c7ca851 sched: Factor out cgroup weight conversion functions
Factor out sched_weight_from/to_cgroup() which convert between scheduler
shares and cgroup weight. No functional change. The factored out functions
will be used by a new BPF extensible sched_class so that the weights can be
exposed to the BPF programs in a way which is consistent cgroup weights and
easier to interpret.

The weight conversions will be used regardless of cgroup usage. It's just
borrowing the cgroup weight range as it's more intuitive.
CGROUP_WEIGHT_MIN/DFL/MAX constants are moved outside CONFIG_CGROUPS so that
the conversion helpers can always be defined.

v2: The helpers are now defined regardless of COFNIG_CGROUPS.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
2024-06-18 10:09:16 -10:00
Tejun Heo
304b3f2bc0 sched: Allow sched_cgroup_fork() to fail and introduce sched_cancel_fork()
A new BPF extensible sched_class will need more control over the forking
process. It wants to be able to fail from sched_cgroup_fork() after the new
task's sched_task_group is initialized so that the loaded BPF program can
prepare the task with its cgroup association is established and reject fork
if e.g. allocation fails.

Allow sched_cgroup_fork() to fail by making it return int instead of void
and adding sched_cancel_fork() to undo sched_fork() in the error path.

sched_cgroup_fork() doesn't fail yet and this patch shouldn't cause any
behavior changes.

v2: Patch description updated to detail the expected use.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
2024-06-18 10:09:16 -10:00
Tejun Heo
8cce475922 Merge branch 'bpf/for-next' into sched_ext-base 2024-06-18 10:03:20 -10:00
Alexei Starovoitov
98d7ca374b bpf: Track delta between "linked" registers.
Compilers can generate the code
  r1 = r2
  r1 += 0x1
  if r2 < 1000 goto ...
  use knowledge of r2 range in subsequent r1 operations

So remember constant delta between r2 and r1 and update r1 after 'if' condition.

Unfortunately LLVM still uses this pattern for loops with 'can_loop' construct:
for (i = 0; i < 1000 && can_loop; i++)

The "undo" pass was introduced in LLVM
https://reviews.llvm.org/D121937
to prevent this optimization, but it cannot cover all cases.
Instead of fighting middle end optimizer in BPF backend teach the verifier
about this pattern.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240613013815.953-3-alexei.starovoitov@gmail.com
2024-06-14 21:52:39 +02:00
Vadim Fedorenko
a3cfe84cca bpf: Add CHECKSUM_COMPLETE to bpf test progs
Add special flag to validate that TC BPF program properly updates
checksum information in skb.

Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240606145851.229116-1-vadfed@meta.com
2024-06-13 14:29:47 +02:00
Daniel Xu
cce4c40b96 bpf: treewide: Align kfunc signatures to prog point-of-view
Previously, kfunc declarations in bpf_kfuncs.h (and others) used "user
facing" types for kfuncs prototypes while the actual kfunc definitions
used "kernel facing" types. More specifically: bpf_dynptr vs
bpf_dynptr_kern, __sk_buff vs sk_buff, and xdp_md vs xdp_buff.

It wasn't an issue before, as the verifier allows aliased types.
However, since we are now generating kfunc prototypes in vmlinux.h (in
addition to keeping bpf_kfuncs.h around), this conflict creates
compilation errors.

Fix this conflict by using "user facing" types in kfunc definitions.
This results in more casts, but otherwise has no additional runtime
cost.

Note, similar to 5b268d1ebc ("bpf: Have bpf_rdonly_cast() take a const
pointer"), we also make kfuncs take const arguments where appropriate in
order to make the kfunc more permissive.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/b58346a63a0e66bc9b7504da751b526b0b189a67.1718207789.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-12 11:01:31 -07:00
Daniel Xu
ec209ad863 bpf: verifier: Relax caller requirements for kfunc projection type args
Currently, if a kfunc accepts a projection type as an argument (eg
struct __sk_buff *), the caller must exactly provide exactly the same
type with provable provenance.

However in practice, kfuncs that accept projection types _must_ cast to
the underlying type before use b/c projection type layouts are
completely made up. Thus, it is ok to relax the verifier rules around
implicit conversions.

We will use this functionality in the next commit when we align kfuncs
to user-facing types.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/e2c025cb09ccfd4af1ec9e18284dc3cecff7514d.1718207789.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-12 11:01:31 -07:00
Jacob Keller
1d4ce389da ice: add and use roundup_u64 instead of open coding equivalent
In ice_ptp_cfg_clkout(), the ice driver needs to calculate the nearest next
second of a current time value specified in nanoseconds. It implements this
using div64_u64, because the time value is a u64. It could use div_u64
since NSEC_PER_SEC is smaller than 32-bits.

Ideally this would be implemented directly with roundup(), but that can't
work on all platforms due to a division which requires using the specific
macros and functions due to platform restrictions, and to ensure that the
most appropriate and fast instructions are used.

The kernel doesn't currently provide any 64-bit equivalents for doing
roundup. Attempting to use roundup() on a 32-bit platform will result in a
link failure due to not having a direct 64-bit division.

The closest equivalent for this is DIV64_U64_ROUND_UP, which does a
division always rounding up. However, this only computes the division, and
forces use of the div64_u64 in cases where the divisor is a 32bit value and
could make use of div_u64.

Introduce DIV_U64_ROUND_UP based on div_u64, and then use it to implement
roundup_u64 which takes a u64 input value and a u32 rounding value.

The name roundup_u64 matches the naming scheme of div_u64, and future
patches could implement roundup64_u64 if they need to round by a multiple
that is greater than 32-bits.

Replace the logic in ice_ptp.c which does this equivalent with the newly
added roundup_u64.

Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240607-next-2024-06-03-intel-next-batch-v3-2-d1470cee3347@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-10 19:52:44 -07:00
Jakub Kicinski
b1156532bc Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2024-06-06

We've added 54 non-merge commits during the last 10 day(s) which contain
a total of 50 files changed, 1887 insertions(+), 527 deletions(-).

The main changes are:

1) Add a user space notification mechanism via epoll when a struct_ops
   object is getting detached/unregistered, from Kui-Feng Lee.

2) Big batch of BPF selftest refactoring for sockmap and BPF congctl
   tests, from Geliang Tang.

3) Add BTF field (type and string fields, right now) iterator support
   to libbpf instead of using existing callback-based approaches,
   from Andrii Nakryiko.

4) Extend BPF selftests for the latter with a new btf_field_iter
   selftest, from Alan Maguire.

5) Add new kfuncs for a generic, open-coded bits iterator,
   from Yafang Shao.

6) Fix BPF selftests' kallsyms_find() helper under kernels configured
   with CONFIG_LTO_CLANG_THIN, from Yonghong Song.

7) Remove a bunch of unused structs in BPF selftests,
   from David Alan Gilbert.

8) Convert test_sockmap section names into names understood by libbpf
   so it can deduce program type and attach type, from Jakub Sitnicki.

9) Extend libbpf with the ability to configure log verbosity
   via LIBBPF_LOG_LEVEL environment variable, from Mykyta Yatsenko.

10) Fix BPF selftests with regards to bpf_cookie and find_vma flakiness
    in nested VMs, from Song Liu.

11) Extend riscv32/64 JITs to introduce shift/add helpers to generate Zba
    optimization, from Xiao Wang.

12) Enable BPF programs to declare arrays and struct fields with kptr,
    bpf_rb_root, and bpf_list_head, from Kui-Feng Lee.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (54 commits)
  selftests/bpf: Drop useless arguments of do_test in bpf_tcp_ca
  selftests/bpf: Use start_test in test_dctcp in bpf_tcp_ca
  selftests/bpf: Use start_test in test_dctcp_fallback in bpf_tcp_ca
  selftests/bpf: Add start_test helper in bpf_tcp_ca
  selftests/bpf: Use connect_to_fd_opts in do_test in bpf_tcp_ca
  libbpf: Auto-attach struct_ops BPF maps in BPF skeleton
  selftests/bpf: Add btf_field_iter selftests
  selftests/bpf: Fix send_signal test with nested CONFIG_PARAVIRT
  libbpf: Remove callback-based type/string BTF field visitor helpers
  bpftool: Use BTF field iterator in btfgen
  libbpf: Make use of BTF field iterator in BTF handling code
  libbpf: Make use of BTF field iterator in BPF linker code
  libbpf: Add BTF field iterator
  selftests/bpf: Ignore .llvm.<hash> suffix in kallsyms_find()
  selftests/bpf: Fix bpf_cookie and find_vma in nested VM
  selftests/bpf: Test global bpf_list_head arrays.
  selftests/bpf: Test global bpf_rb_root arrays and fields in nested struct types.
  selftests/bpf: Test kptr arrays and kptrs in nested struct fields.
  bpf: limit the number of levels of a nested struct type.
  bpf: look into the types of the fields of a struct type recursively.
  ...
====================

Link: https://lore.kernel.org/r/20240606223146.23020-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-10 18:02:14 -07:00
Jakub Kicinski
93d4e8bb3f Merge tag 'wireless-next-2024-06-07' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Kalle Valo says:

====================
wireless-next patches for v6.11

The first "new features" pull request for v6.11 with changes both in
stack and in drivers. Nothing out of ordinary, except that we have
two conflicts this time:

net/mac80211/cfg.c
  https://lore.kernel.org/all/20240531124415.05b25e7a@canb.auug.org.au

drivers/net/wireless/microchip/wilc1000/netdev.c
  https://lore.kernel.org/all/20240603110023.23572803@canb.auug.org.au

Major changes:

cfg80211/mac80211
 * parse Transmit Power Envelope (TPE) data in mac80211 instead of in drivers

wilc1000
 * read MAC address during probe to make it visible to user space

iwlwifi
 * bump FW API to 91 for BZ/SC devices
 * report 64-bit radiotap timestamp
 * enable P2P low latency by default
 * handle Transmit Power Envelope (TPE) advertised by AP
 * start using guard()

rtlwifi
 * RTL8192DU support

ath12k
 * remove unsupported tx monitor handling
 * channel 2 in 6 GHz band support
 * Spatial Multiplexing Power Save (SMPS) in 6 GHz band support
 * multiple BSSID (MBSSID) and Enhanced Multi-BSSID Advertisements (EMA)
   support
 * dynamic VLAN support
 * add panic handler for resetting the firmware state

ath10k
 * add qcom,no-msa-ready-indicator Device Tree property
 * LED support for various chipsets

* tag 'wireless-next-2024-06-07' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (194 commits)
  wifi: ath12k: add hw_link_id in ath12k_pdev
  wifi: ath12k: add panic handler
  wifi: rtw89: chan: Use swap() in rtw89_swap_sub_entity()
  wifi: brcm80211: remove unused structs
  wifi: brcm80211: use sizeof(*pointer) instead of sizeof(type)
  wifi: ath12k: do not process consecutive RDDM event
  dt-bindings: net: wireless: ath11k: Drop "qcom,ipq8074-wcss-pil" from example
  wifi: ath12k: fix memory leak in ath12k_dp_rx_peer_frag_setup()
  wifi: rtlwifi: handle return value of usb init TX/RX
  wifi: rtlwifi: Enable the new rtl8192du driver
  wifi: rtlwifi: Add rtl8192du/sw.c
  wifi: rtlwifi: Constify rtl_hal_cfg.{ops,usb_interface_cfg} and rtl_priv.cfg
  wifi: rtlwifi: Add rtl8192du/dm.{c,h}
  wifi: rtlwifi: Add rtl8192du/fw.{c,h} and rtl8192du/led.{c,h}
  wifi: rtlwifi: Add rtl8192du/rf.{c,h}
  wifi: rtlwifi: Add rtl8192du/trx.{c,h}
  wifi: rtlwifi: Add rtl8192du/phy.{c,h}
  wifi: rtlwifi: Add rtl8192du/hw.{c,h}
  wifi: rtlwifi: Add new members to struct rtl_priv for RTL8192DU
  wifi: rtlwifi: Add rtl8192du/table.{c,h}
  ...

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

====================

Link: https://lore.kernel.org/r/20240607093517.41394C2BBFC@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-10 17:40:26 -07:00
Jakub Kicinski
5fbf57a937 net: netlink: remove the cb_mutex "injection" from netlink core
Back in 2007, in commit af65bdfce9 ("[NETLINK]: Switch cb_lock spinlock
to mutex and allow to override it") netlink core was extended to allow
subsystems to replace the dump mutex lock with its own lock.

The mechanism was used by rtnetlink to take rtnl_lock but it isn't
sufficiently flexible for other users. Over the 17 years since
it was added no other user appeared. Since rtnetlink needs conditional
locking now, and doesn't use it either, axe this feature complete.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-10 13:15:40 +01:00
Florian Westphal
f81d0dd2fd tcp: move inet_twsk_schedule helper out of header
Its no longer used outside inet_timewait_sock.c, so move it there.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-10 11:54:18 +01:00
Valentin Schneider
b334b924c9 net: tcp/dccp: prepare for tw_timer un-pinning
The TCP timewait timer is proving to be problematic for setups where
scheduler CPU isolation is achieved at runtime via cpusets (as opposed to
statically via isolcpus=domains).

What happens there is a CPU goes through tcp_time_wait(), arming the
time_wait timer, then gets isolated. TCP_TIMEWAIT_LEN later, the timer
fires, causing interference for the now-isolated CPU. This is conceptually
similar to the issue described in commit e02b931248 ("workqueue: Unbind
kworkers before sending them to exit()")

Move inet_twsk_schedule() to within inet_twsk_hashdance(), with the ehash
lock held. Expand the lock's critical section from inet_twsk_kill() to
inet_twsk_deschedule_put(), serializing the scheduling vs descheduling of
the timer. IOW, this prevents the following race:

			     tcp_time_wait()
			       inet_twsk_hashdance()
  inet_twsk_deschedule_put()
    del_timer_sync()
			       inet_twsk_schedule()

Thanks to Paolo Abeni for suggesting to leverage the ehash lock.

This also restores a comment from commit ec94c2696f ("tcp/dccp: avoid
one atomic operation for timewait hashdance") as inet_twsk_hashdance() had
a "Step 1" and "Step 3" comment, but the "Step 2" had gone missing.

inet_twsk_deschedule_put() now acquires the ehash spinlock to synchronize
with inet_twsk_hashdance_schedule().

To ease possible regression search, actual un-pin is done in next patch.

Link: https://lore.kernel.org/all/ZPhpfMjSiHVjQkTk@localhost.localdomain/
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-10 11:54:18 +01:00
Ido Schimmel
97d833ceb2 mlxsw: spectrum_acl_erp: Fix object nesting warning
ACLs in Spectrum-2 and newer ASICs can reside in the algorithmic TCAM
(A-TCAM) or in the ordinary circuit TCAM (C-TCAM). The former can
contain more ACLs (i.e., tc filters), but the number of masks in each
region (i.e., tc chain) is limited.

In order to mitigate the effects of the above limitation, the device
allows filters to share a single mask if their masks only differ in up
to 8 consecutive bits. For example, dst_ip/25 can be represented using
dst_ip/24 with a delta of 1 bit. The C-TCAM does not have a limit on the
number of masks being used (and therefore does not support mask
aggregation), but can contain a limited number of filters.

The driver uses the "objagg" library to perform the mask aggregation by
passing it objects that consist of the filter's mask and whether the
filter is to be inserted into the A-TCAM or the C-TCAM since filters in
different TCAMs cannot share a mask.

The set of created objects is dependent on the insertion order of the
filters and is not necessarily optimal. Therefore, the driver will
periodically ask the library to compute a more optimal set ("hints") by
looking at all the existing objects.

When the library asks the driver whether two objects can be aggregated
the driver only compares the provided masks and ignores the A-TCAM /
C-TCAM indication. This is the right thing to do since the goal is to
move as many filters as possible to the A-TCAM. The driver also forbids
two identical masks from being aggregated since this can only happen if
one was intentionally put in the C-TCAM to avoid a conflict in the
A-TCAM.

The above can result in the following set of hints:

H1: {mask X, A-TCAM} -> H2: {mask Y, A-TCAM} // X is Y + delta
H3: {mask Y, C-TCAM} -> H4: {mask Z, A-TCAM} // Y is Z + delta

After getting the hints from the library the driver will start migrating
filters from one region to another while consulting the computed hints
and instructing the device to perform a lookup in both regions during
the transition.

Assuming a filter with mask X is being migrated into the A-TCAM in the
new region, the hints lookup will return H1. Since H2 is the parent of
H1, the library will try to find the object associated with it and
create it if necessary in which case another hints lookup (recursive)
will be performed. This hints lookup for {mask Y, A-TCAM} will either
return H2 or H3 since the driver passes the library an object comparison
function that ignores the A-TCAM / C-TCAM indication.

This can eventually lead to nested objects which are not supported by
the library [1].

Fix by removing the object comparison function from both the driver and
the library as the driver was the only user. That way the lookup will
only return exact matches.

I do not have a reliable reproducer that can reproduce the issue in a
timely manner, but before the fix the issue would reproduce in several
minutes and with the fix it does not reproduce in over an hour.

Note that the current usefulness of the hints is limited because they
include the C-TCAM indication and represent aggregation that cannot
actually happen. This will be addressed in net-next.

[1]
WARNING: CPU: 0 PID: 153 at lib/objagg.c:170 objagg_obj_parent_assign+0xb5/0xd0
Modules linked in:
CPU: 0 PID: 153 Comm: kworker/0:18 Not tainted 6.9.0-rc6-custom-g70fbc2c1c38b #42
Hardware name: Mellanox Technologies Ltd. MSN3700C/VMOD0008, BIOS 5.11 10/10/2018
Workqueue: mlxsw_core mlxsw_sp_acl_tcam_vregion_rehash_work
RIP: 0010:objagg_obj_parent_assign+0xb5/0xd0
[...]
Call Trace:
 <TASK>
 __objagg_obj_get+0x2bb/0x580
 objagg_obj_get+0xe/0x80
 mlxsw_sp_acl_erp_mask_get+0xb5/0xf0
 mlxsw_sp_acl_atcam_entry_add+0xe8/0x3c0
 mlxsw_sp_acl_tcam_entry_create+0x5e/0xa0
 mlxsw_sp_acl_tcam_vchunk_migrate_one+0x16b/0x270
 mlxsw_sp_acl_tcam_vregion_rehash_work+0xbe/0x510
 process_one_work+0x151/0x370

Fixes: 9069a3817d ("lib: objagg: implement optimization hints assembly and use hints for object creation")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Tested-by: Alexander Zubkov <green@qrator.net>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-10 11:14:52 +01:00
Jakub Kicinski
62b5bf58b9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts.

Adjacent changes:

drivers/net/ethernet/pensando/ionic/ionic_txrx.c
  d9c0420999 ("ionic: Mark error paths in the data path as unlikely")
  491aee894a ("ionic: fix kernel panic in XDP_TX action")

net/ipv6/ip6_fib.c
  b4cb4a1391 ("net: use unrcu_pointer() helper")
  b01e1c0307 ("ipv6: fix possible race in __fib6_drop_pcpu_from()")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-06 12:06:56 -07:00
Linus Torvalds
d30d0e49da Merge tag 'net-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
 "Including fixes from BPF and big collection of fixes for WiFi core and
  drivers.

  Current release - regressions:

   - vxlan: fix regression when dropping packets due to invalid src
     addresses

   - bpf: fix a potential use-after-free in bpf_link_free()

   - xdp: revert support for redirect to any xsk socket bound to the
     same UMEM as it can result in a corruption

   - virtio_net:
      - add missing lock protection when reading return code from
        control_buf
      - fix false-positive lockdep splat in DIM
      - Revert "wifi: wilc1000: convert list management to RCU"

   - wifi: ath11k: fix error path in ath11k_pcic_ext_irq_config

  Previous releases - regressions:

   - rtnetlink: make the "split" NLM_DONE handling generic, restore the
     old behavior for two cases where we started coalescing those
     messages with normal messages, breaking sloppily-coded userspace

   - wifi:
      - cfg80211: validate HE operation element parsing
      - cfg80211: fix 6 GHz scan request building
      - mt76: mt7615: add missing chanctx ops
      - ath11k: move power type check to ASSOC stage, fix connecting to
        6 GHz AP
      - ath11k: fix WCN6750 firmware crash caused by 17 num_vdevs
      - rtlwifi: ignore IEEE80211_CONF_CHANGE_RETRY_LIMITS
      - iwlwifi: mvm: fix a crash on 7265

  Previous releases - always broken:

   - ncsi: prevent multi-threaded channel probing, a spec violation

   - vmxnet3: disable rx data ring on dma allocation failure

   - ethtool: init tsinfo stats if requested, prevent unintentionally
     reporting all-zero stats on devices which don't implement any

   - dst_cache: fix possible races in less common IPv6 features

   - tcp: auth: don't consider TCP_CLOSE to be in TCP_AO_ESTABLISHED

   - ax25: fix two refcounting bugs

   - eth: ionic: fix kernel panic in XDP_TX action

  Misc:

   - tcp: count CLOSE-WAIT sockets for TCP_MIB_CURRESTAB"

* tag 'net-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (107 commits)
  selftests: net: lib: set 'i' as local
  selftests: net: lib: avoid error removing empty netns name
  selftests: net: lib: support errexit with busywait
  net: ethtool: fix the error condition in ethtool_get_phy_stats_ethtool()
  ipv6: fix possible race in __fib6_drop_pcpu_from()
  af_unix: Annotate data-race of sk->sk_shutdown in sk_diag_fill().
  af_unix: Use skb_queue_len_lockless() in sk_diag_show_rqlen().
  af_unix: Use skb_queue_empty_lockless() in unix_release_sock().
  af_unix: Use unix_recvq_full_lockless() in unix_stream_connect().
  af_unix: Annotate data-race of net->unx.sysctl_max_dgram_qlen.
  af_unix: Annotate data-races around sk->sk_sndbuf.
  af_unix: Annotate data-races around sk->sk_state in UNIX_DIAG.
  af_unix: Annotate data-race of sk->sk_state in unix_stream_read_skb().
  af_unix: Annotate data-races around sk->sk_state in sendmsg() and recvmsg().
  af_unix: Annotate data-race of sk->sk_state in unix_accept().
  af_unix: Annotate data-race of sk->sk_state in unix_stream_connect().
  af_unix: Annotate data-races around sk->sk_state in unix_write_space() and poll().
  af_unix: Annotate data-race of sk->sk_state in unix_inq_len().
  af_unix: Annodate data-races around sk->sk_state for writers.
  af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer.
  ...
2024-06-06 09:55:27 -07:00
Eric Dumazet
6971d21672 tcp: move reqsk_alloc() to inet_connection_sock.c
reqsk_alloc() has a single caller, no need to expose it
in include/net/request_sock.h.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 15:18:04 +02:00
Eric Dumazet
c34506406d tcp: small changes in reqsk_put() and reqsk_free()
In reqsk_free(), use DEBUG_NET_WARN_ON_ONCE()
instead of WARN_ON_ONCE() for a condition which never fired.

In reqsk_put() directly call __reqsk_free(), there is no
point checking rsk_refcnt again right after a transition to zero.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 15:18:04 +02:00
Eric Dumazet
b4cb4a1391 net: use unrcu_pointer() helper
Toke mentioned unrcu_pointer() existence, allowing
to remove some of the ugly casts we have when using
xchg() for rcu protected pointers.

Also make inet_rcv_compat const.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20240604111603.45871-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 11:52:52 +02:00
Yoray Zack
99be56171f net/mlx5e: SHAMPO, Re-enable HW-GRO
Add back HW-GRO to the reported features.

As the current implementation of HW-GRO uses KSMs with a
specific fixed buffer size (256B) to map its headers buffer,
we reported the feature only if the NIC is supporting KSM and
the minimum value for buffer size is below the requested one.

iperf3 bandwidth comparison:
+---------+--------+--------+-----------+
| streams | SW GRO | HW GRO | Unit      |
|---------+--------+--------+-----------|
| 1       | 36     | 42     | Gbits/sec |
| 4       | 34     | 39     | Gbits/sec |
| 8       | 31     | 35     | Gbits/sec |
+---------+--------+--------+-----------+

A downstream patch will add skb fragment coalescing which will improve
performance considerably.

Benchmark details:
VM based setup
CPU: Intel(R) Xeon(R) Platinum 8380 CPU, 24 cores
NIC: ConnectX-7 100GbE
iperf3 and irq running on same CPU over a single receive queue

Signed-off-by: Yoray Zack <yorayz@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-14-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05 20:20:46 -07:00
Yoray Zack
758191c9ea net/mlx5e: SHAMPO, Use KSMs instead of KLMs
KSM Mkey is KLM Mkey with a fixed buffer size. Due to this fact,
it is a faster mechanism than KLM.

SHAMPO feature used KLMs Mkeys for memory mappings of its headers buffer.
As it used KLMs with the same buffer size for each entry,
we can use KSMs instead.

This commit changes the Mkeys that map the SHAMPO headers buffer
from KLMs to KSMs.

Signed-off-by: Yoray Zack <yorayz@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-13-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05 20:20:46 -07:00
Linus Torvalds
553352597d Merge tag 'acpi-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
 "These fix the ACPI EC and AC drivers, the ACPI APEI error injection
  driver and build issues related to the dev_is_pnp() macro referring to
  pnp_bus_type that is not exported to modules.

  Specifics:

   - Fix error handling during EC operation region accesses in the ACPI
     EC driver (Armin Wolf)

   - Fix a memory leak in the APEI error injection driver introduced
     during its converion to a platform driver (Dan Williams)

   - Fix build failures related to the dev_is_pnp() macro by redefining
     it as a proper function and exporting it to modules as appropriate
     and unexport pnp_bus_type which need not be exported any more (Andy
     Shevchenko)

   - Update the ACPI AC driver to use power_supply_changed() to let the
     power supply core handle configuration changes properly (Thomas
     Weißschuh)"

* tag 'acpi-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: AC: Properly notify powermanagement core about changes
  PNP: Hide pnp_bus_type from the non-PNP code
  PNP: Make dev_is_pnp() to be a function and export it for modules
  ACPI: EC: Avoid returning AE_OK on errors in address space handler
  ACPI: EC: Abort address space access upon error
  ACPI: APEI: EINJ: Fix einj_dev release leak
2024-06-05 15:19:15 -07:00
Linus Torvalds
64c6a36d79 Merge tag 'pm-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
 "These fix the intel_pstate and amd-pstate cpufreq drivers and the
  cpupower utility.

  Specifics:

   - Fix a recently introduced unchecked HWP MSR access in the
     intel_pstate driver (Srinivas Pandruvada)

   - Add missing conversion from MHz to KHz to amd_pstate_set_boost() to
     address sysfs inteface inconsistency and fix P-state frequency
     reporting on AMD Family 1Ah CPUs in the cpupower utility (Dhananjay
     Ugwekar)

   - Get rid of an excess global header file used by the amd-pstate
     cpufreq driver (Arnd Bergmann)"

* tag 'pm-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: intel_pstate: Fix unchecked HWP MSR access
  cpufreq: amd-pstate: Fix the inconsistency in max frequency units
  cpufreq: amd-pstate: remove global header file
  tools/power/cpupower: Fix Pstate frequency reporting on AMD Family 1Ah CPUs
2024-06-05 15:12:35 -07:00
Sean Christopherson
c793a62823 sched/core: Drop spinlocks on contention iff kernel is preemptible
Use preempt_model_preemptible() to detect a preemptible kernel when
deciding whether or not to reschedule in order to drop a contended
spinlock or rwlock.  Because PREEMPT_DYNAMIC selects PREEMPTION, kernels
built with PREEMPT_DYNAMIC=y will yield contended locks even if the live
preemption model is "none" or "voluntary".  In short, make kernels with
dynamically selected models behave the same as kernels with statically
selected models.

Somewhat counter-intuitively, NOT yielding a lock can provide better
latency for the relevant tasks/processes.  E.g. KVM x86's mmu_lock, a
rwlock, is often contended between an invalidation event (takes mmu_lock
for write) and a vCPU servicing a guest page fault (takes mmu_lock for
read).  For _some_ setups, letting the invalidation task complete even
if there is mmu_lock contention provides lower latency for *all* tasks,
i.e. the invalidation completes sooner *and* the vCPU services the guest
page fault sooner.

But even KVM's mmu_lock behavior isn't uniform, e.g. the "best" behavior
can vary depending on the host VMM, the guest workload, the number of
vCPUs, the number of pCPUs in the host, why there is lock contention, etc.

In other words, simply deleting the CONFIG_PREEMPTION guard (or doing the
opposite and removing contention yielding entirely) needs to come with a
big pile of data proving that changing the status quo is a net positive.

Opportunistically document this side effect of preempt=full, as yielding
contended spinlocks can have significant, user-visible impact.

Fixes: c597bfddc9 ("sched: Provide Kconfig support for default dynamic preempt mode")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Link: https://lore.kernel.org/kvm/ef81ff36-64bb-4cfe-ae9b-e3acf47bff24@proxmox.com
2024-06-05 16:52:36 +02:00
Sean Christopherson
f0dc887f21 sched/core: Move preempt_model_*() helpers from sched.h to preempt.h
Move the declarations and inlined implementations of the preempt_model_*()
helpers to preempt.h so that they can be referenced in spinlock.h without
creating a potential circular dependency between spinlock.h and sched.h.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ankur Arora <ankur.a.arora@oracle.com>
Link: https://lkml.kernel.org/r/20240528003521.979836-2-ankur.a.arora@oracle.com
2024-06-05 16:52:36 +02:00
Kevin Yang
f086edef71 tcp: add sysctl_tcp_rto_min_us
Adding a sysctl knob to allow user to specify a default
rto_min at socket init time, other than using the hard
coded 200ms default rto_min.

Note that the rto_min route option has the highest precedence
for configuring this setting, followed by the TCP_BPF_RTO_MIN
socket option, followed by the tcp_rto_min_us sysctl.

Signed-off-by: Kevin Yang <yyd@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05 13:42:54 +01:00
Jakub Kicinski
5b4b62a169 rtnetlink: make the "split" NLM_DONE handling generic
Jaroslav reports Dell's OMSA Systems Management Data Engine
expects NLM_DONE in a separate recvmsg(), both for rtnl_dump_ifinfo()
and inet_dump_ifaddr(). We already added a similar fix previously in
commit 460b0d33cf ("inet: bring NLM_DONE out to a separate recv() again")

Instead of modifying all the dump handlers, and making them look
different than modern for_each_netdev_dump()-based dump handlers -
put the workaround in rtnetlink code. This will also help us move
the custom rtnl-locking from af_netlink in the future (in net-next).

Note that this change is not touching rtnl_dump_all(). rtnl_dump_all()
is different kettle of fish and a potential problem. We now mix families
in a single recvmsg(), but NLM_DONE is not coalesced.

Tested:

  ./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_addr.yaml \
           --dump getaddr --json '{"ifa-family": 2}'

  ./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_route.yaml \
           --dump getroute --json '{"rtm-family": 2}'

  ./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_link.yaml \
           --dump getlink

Fixes: 3e41af9076 ("rtnetlink: use xarray iterator to implement rtnl_dump_ifinfo()")
Fixes: cdb2f80f1c ("inet: use xa_array iterator to implement inet_dump_ifaddr()")
Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Link: https://lore.kernel.org/all/CAK8fFZ7MKoFSEzMBDAOjoUt+vTZRRQgLDNXEOfdCCXSoXXKE0g@mail.gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05 12:34:54 +01:00
Christophe JAILLET
82dc29b973 devlink: Constify the 'table_ops' parameter of devl_dpipe_table_register()
"struct devlink_dpipe_table_ops" only contains some function pointers.

Update "struct devlink_dpipe_table" and the 'table_ops' parameter of
devl_dpipe_table_register() so that structures in drivers can be
constified.

Constifying these structures will move some data to a read-only section, so
increase overall security.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05 10:24:57 +01:00
Dr. David Alan Gilbert
6f49c3fb56 net: caif: remove unused structs
'cfpktq' has been unused since
commit 73d6ac633c ("caif: code cleanup").

'caif_packet_funcs' is declared but never defined.

Remove both of them.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05 10:18:06 +01:00
Jason Xing
61e2bbafb0 net: remove NULL-pointer net parameter in ip_metrics_convert
When I was doing some experiments, I found that when using the first
parameter, namely, struct net, in ip_metrics_convert() always triggers NULL
pointer crash. Then I digged into this part, realizing that we can remove
this one due to its uselessness.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05 10:06:00 +01:00
Jakub Kicinski
1be68a87ab tcp: add a helper for setting EOR on tail skb
TLS (and hopefully soon PSP will) use EOR to prevent skbs
with different decrypted state from getting merged, without
adding new tests to the skb handling. In both cases once
the connection switches to an "encrypted" state, all subsequent
skbs will be encrypted, so a single "EOR fence" is sufficient
to prevent mixing.

Add a helper for setting the EOR bit, to make this arrangement
more explicit.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-04 13:23:30 +02:00
Jakub Kicinski
0711153018 tcp: wrap mptcp and decrypted checks into tcp_skb_can_collapse_rx()
tcp_skb_can_collapse() checks for conditions which don't make
sense on input. Because of this we ended up sprinkling a few
pairs of mptcp_skb_can_collapse() and skb_cmp_decrypted() calls
on the input path. Group them in a new helper. This should make
it less likely that someone will check mptcp and not decrypted
or vice versa when adding new code.

This implicitly adds a decrypted check early in tcp_collapse().
AFAIU this will very slightly increase our ability to collapse
packets under memory pressure, not a real bug.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-04 13:23:30 +02:00
Davide Caratti
1d17568e74 net/sched: cls_flower: add support for matching tunnel control flags
extend cls_flower to match TUNNEL_FLAGS_PRESENT bits in tunnel metadata.

Suggested-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-04 11:16:38 +02:00
Davide Caratti
668b6a2ef8 flow_dissector: add support for tunnel control flags
Dissect [no]csum, [no]dontfrag, [no]oam, [no]crit flags from skb metadata.
This is a prerequisite for matching these control flags using TC flower.

Suggested-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-04 11:16:38 +02:00
Wolfram Sang
c4aff1d1ec Merge tag 'i2c-host-6.10-pt2' of git://git.kernel.org/pub/scm/linux/kernel/git/andi.shyti/linux into i2c/for-current
Removed the SPD class of i2c devices from the device core.

Additionally, a cleanup in the Synquacer code removes the pclk
from the global structure, as it is used only in the probe.
Therefore, it is now declared locally.
2024-06-03 08:51:53 +02:00
Dmitry Safonov
33700a0c9b net/tcp: Don't consider TCP_CLOSE in TCP_AO_ESTABLISHED
TCP_CLOSE may or may not have current/rnext keys and should not be
considered "established". The fast-path for TCP_CLOSE is
SKB_DROP_REASON_TCP_CLOSE. This is what tcp_rcv_state_process() does
anyways. Add an early drop path to not spend any time verifying
segment signatures for sockets in TCP_CLOSE state.

Cc: stable@vger.kernel.org # v6.7
Fixes: 0a3a809089 ("net/tcp: Verify inbound TCP-AO signed segments")
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
Link: https://lore.kernel.org/r/20240529-tcp_ao-sk_state-v1-1-d69b5d323c52@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-01 16:27:26 -07:00
Jakub Kicinski
69c8b99871 net: qstat: extend kdoc about get_base_stats
mlx5 has a dedicated queue for PTP packets. Clarify that
this sort of queues can also be accounted towards the base.

Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://lore.kernel.org/r/20240529162922.3690698-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-01 15:11:52 -07:00
Linus Torvalds
f26ee67a0f Merge tag 'tty-6.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty fix from Greg KH:
 "Here is a single revert for a much-reported regression in 6.10-rc1
  when it comes to a few older architectures.

  Turns out that the VT ioctls don't work the same across all cpu types
  because of some old compatibility requrements for stuff like alpha and
  powerpc. So revert the change that attempted to have them use the
  _IO() macros and go back to the known-working values instead.

  This has NOT been in linux-next but has had many reports that it fixes
  the issue with 6.10-rc1"

* tag 'tty-6.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  Revert "VT: Use macros to define ioctls"
2024-06-01 08:53:39 -07:00
Greg Kroah-Hartman
7bc4244c88 Revert "VT: Use macros to define ioctls"
This reverts commit 8c467f3300.

Turns out this breaks many architectures as the vt ioctls do not all
match up everywhere due to historical reasons, so the original commit is
invalid for many values.

Reported-by: Nick Bowler <nbowler@draconx.ca>
Reported-by: Arnd Bergmann <arnd@kernel.org>
Reported-by: Jiri Slaby <jirislaby@kernel.org>
Reported-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Gladkov <legion@kernel.org>
Link: https://lore.kernel.org/r/ad4e561c-1d49-4f25-882c-7a36c6b1b5c0@draconx.ca
Link: https://lore.kernel.org/r/0da9785e-ba44-4718-9d08-4e96c1ba7ab2@kernel.org
Link: https://lore.kernel.org/all/34d848f4-670b-4493-bf21-130ef862521b@xenosoft.de/
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-01 07:28:21 +02:00
Linus Torvalds
cc8ed4d0a8 Merge tag 'drm-fixes-2024-06-01' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
 "This is the weekly fixes. Lots of small fixes across the board, one
  BUG_ON fix in shmem seems most important, otherwise amdgpu, i915, xe
  mostly with small fixes to all the other drivers.

  shmem:
   - fix BUG_ON in COW handling
   - warn when trying to pin imported objects

  buddy:
   - fix page size handling

  dma-buf:
   - sw-sync: Don't interfere with IRQ handling
   - fix kthreads-handling error path

  i915:
   - fix a race in audio component by registering it later
   - make DPT object unshrinkable to avoid shrinking when framebuffer
     has not shrunk
   - fix CCS id calculation to fix a perf regression
   - fix selftest caching mode
   - fix FIELD_PREP compiler warnings
   - fix indefinite wait for GT wakeref release
   - revert overeager multi-gt pm reference removal

  xe:
   - pcode polling timeout change
   - fix for deadlocks for faulting VMs
   - error-path lock imbalance fix

  amdgpu:
   - RAS fix
   - fix colorspace property for MST connectors
   - fix for PCIe DPM
   - silence UBSAN warning
   - GPUVM robustness fix
   - partition fix
   - drop deprecated I2C_CLASS_SPD

  amdkfd:
   - revert unused changes for certain 11.0.3 devices
   - simplify APU VRAM handling

  lima:
   - fix dma_resv-related deadlock in object pin

  msm:
   - remove build-time dependency on Python 3.9

  nouveau:
   - nvif: Fix possible integer overflow

  panel:
   - lg-sw43408: Select DP helpers; Declare backlight ops as static
   - sitronix-st7789v: Various fixes for jt240mhqs_hwt_ek_e3 panel

  panfrost:
   - fix dma_resv-related deadlock in object pin"

* tag 'drm-fixes-2024-06-01' of https://gitlab.freedesktop.org/drm/kernel: (35 commits)
  drm/msm: remove python 3.9 dependency for compiling msm
  drm/panel: sitronix-st7789v: fix display size for jt240mhqs_hwt_ek_e3 panel
  drm/panel: sitronix-st7789v: tweak timing for jt240mhqs_hwt_ek_e3 panel
  drm/panel: sitronix-st7789v: fix timing for jt240mhqs_hwt_ek_e3 panel
  drm/amd/pm: remove deprecated I2C_CLASS_SPD support from newly added SMU_14_0_2
  drm/amdgpu: Make CPX mode auto default in NPS4
  drm/amdkfd: simplify APU VRAM handling
  Revert "drm/amdkfd: fix gfx_target_version for certain 11.0.3 devices"
  drm/amdgpu: fix dereference null return value for the function amdgpu_vm_pt_parent
  drm/amdgpu: silence UBSAN warning
  drm/amdgpu: Adjust logic in amdgpu_device_partner_bandwidth()
  drm/i915: Fix audio component initialization
  drm/i915/dpt: Make DPT object unshrinkable
  drm/i915/gt: Fix CCS id's calculation for CCS mode setting
  drm/panel/lg-sw43408: mark sw43408_backlight_ops as static
  drm/i915/selftests: Set always_coherent to false when reading from CPU
  drm/panel/lg-sw43408: select CONFIG_DRM_DISPLAY_DP_HELPER
  drm/i915/guc: avoid FIELD_PREP warning
  drm/i915/gt: Disarm breadcrumbs if engines are already idle
  Revert "drm/i915: Remove extra multi-gt pm-references"
  ...
2024-05-31 16:26:48 -07:00
Linus Torvalds
0f9a75179d Merge tag 'block-6.10-20240530' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:

 - NVMe fixes via Keith:
      - Removing unused fields (Kanchan)
      - Large folio offsets support (Kundan)
      - Multipath NUMA node initialiazation fix (Nilay)
      - Multipath IO stats accounting fixes (Keith)
      - Circular lockdep fix (Keith)
      - Target race condition fix (Sagi)
      - Target memory leak fix (Sagi)

 - bcache fixes

 - null_blk fixes (Damien)

 - Fix regression in io.max due to throttle low removal (Waiman)

 - DM limit table fixes (Christoph)

 - SCSI and block limit fixes (Christoph)

 - zone fixes (Damien)

 - Misc fixes (Christoph, Hannes, hexue)

* tag 'block-6.10-20240530' of git://git.kernel.dk/linux: (25 commits)
  blk-throttle: Fix incorrect display of io.max
  block: Fix zone write plugging handling of devices with a runt zone
  block: Fix validation of zoned device with a runt zone
  null_blk: Do not allow runt zone with zone capacity smaller then zone size
  nvmet: fix a possible leak when destroy a ctrl during qp establishment
  nvme: use srcu for iterating namespace list
  bcache: code cleanup in __bch_bucket_alloc_set()
  bcache: call force_wake_up_gc() if necessary in check_should_bypass()
  bcache: allow allocator to invalidate bucket in gc
  block: check for max_hw_sectors underflow
  block: stack max_user_sectors
  sd: also set max_user_sectors when setting max_sectors
  null_blk: Print correct max open zones limit in null_init_zoned_dev()
  block: delete redundant function declaration
  null_blk: Fix return value of nullb_device_power_store()
  dm: make dm_set_zones_restrictions work on the queue limits
  dm: remove dm_check_zoned
  dm: move setting zoned_enabled to dm_table_set_restrictions
  block: remove blk_queue_max_integrity_segments
  nvme: adjust multiples of NVME_CTRL_PAGE_SIZE in offset
  ...
2024-05-31 15:31:27 -07:00
Jakub Kicinski
e19de2064f Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

drivers/net/ethernet/ti/icssg/icssg_classifier.c
  abd5576b9c ("net: ti: icssg-prueth: Add support for ICSSG switch firmware")
  56a5cf538c ("net: ti: icssg-prueth: Fix start counter for ft1 filter")
https://lore.kernel.org/all/20240531123822.3bb7eadf@canb.auug.org.au/

No other adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-31 14:10:28 -07:00
Linus Torvalds
7d88cc8ecc Merge tag 'sound-6.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
 "Lots of small fixes:

   - A race fix for debugfs handling in ALSA core

   - A series of corrections for MIDI2 core format conversions

   - ASoC Intel fixes for 16 bit DMIC config

   - Updates for missing module parameters in ASoC code

   - HD-audio quirk, Cirrus codec fix, etc minor fixes"

* tag 'sound-6.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (26 commits)
  ALSA: seq: ump: Fix swapped song position pointer data
  ASoC: SOF: ipc4-topology: Adjust the params based on DAI formats
  ASoC: SOF: ipc4-topology: Improve readability of sof_ipc4_prepare_dai_copier()
  ASoC: SOF: ipc4-topology/pcm: Rename sof_ipc4_copier_is_single_format()
  ASoC: SOF: ipc4-topology: Print out the channel count in sof_ipc4_dbg_audio_format
  ASoC: SOF: ipc4-topology: Add support for NHLT with 16-bit only DMIC blob
  ALSA: seq: Fix yet another spot for system message conversion
  ALSA: ump: Set default protocol when not given explicitly
  ALSA: ump: Don't accept an invalid UMP protocol number
  ASoC: SOF: ipc4-topology: Fix input format query of process modules without base extension
  ASoC: Intel: sof-sdw: fix missing SPI_MASTER dependency
  ALSA: pcm: fix typo in comment
  ALSA: ump: Don't clear bank selection after sending a program change
  ALSA: seq: Fix incorrect UMP type for system messages
  ALSA/hda: intel-dsp-config: reduce log verbosity
  ALSA: seq: Don't clear bank selection at event -> UMP MIDI2 conversion
  ALSA: seq: Fix missing bank setup between MIDI1/MIDI2 UMP conversion
  ASoC: SOF: add missing MODULE_DESCRIPTION()
  ASoC: SOF: reorder MODULE_ definitions
  ASoC: SOF: AMD: group all module related information
  ...
2024-05-31 12:11:44 -07:00
Russell King (Oracle)
83f55b01dd net: stmmac: rename xpcs_an_inband to default_an_inband
Rename xpcs_an_inband to default_an_inband to reflect the change in
phylink and its changed functionality.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Halaney <ahalaney@redhat.com>
Link: https://lore.kernel.org/r/E1sCJN6-00EcrD-43@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-30 18:32:16 -07:00
Russell King (Oracle)
02d00dc73d net: phylink: rename ovr_an_inband to default_an_inband
Since ovr_an_inband no longer overrides every MLO_AN_xxx mode, rename
it to reflect what it now does - it changes the default mode from
MLO_AN_PHY to MLO_AN_INBAND. Fix up the two users of this.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Halaney <ahalaney@redhat.com>
Link: https://lore.kernel.org/r/E1sCJMv-00Ecr1-Sk@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-30 18:32:15 -07:00
Hangbin Liu
a79d8fe2ff ipv6: sr: restruct ifdefines
There are too many ifdef in IPv6 segment routing code that may cause logic
problems. like commit 160e9d2752 ("ipv6: sr: fix invalid unregister error
path"). To avoid this, the init functions are redefined for both cases. The
code could be more clear after all fidefs are removed.

Suggested-by: Simon Horman <horms@kernel.org>
Suggested-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240529040908.3472952-1-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-30 18:29:38 -07:00
Kui-Feng Lee
67c3e8353f bpf: export bpf_link_inc_not_zero.
bpf_link_inc_not_zero() will be used by kernel modules.  We will use it in
bpf_testmod.c later.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240530065946.979330-5-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-05-30 15:34:13 -07:00
Kui-Feng Lee
1adddc97aa bpf: support epoll from bpf struct_ops links.
Add epoll support to bpf struct_ops links to trigger EPOLLHUP event upon
detachment.

This patch implements the "poll" of the "struct file_operations" for BPF
links and introduces a new "poll" operator in the "struct bpf_link_ops". By
implementing "poll" of "struct bpf_link_ops" for the links of struct_ops,
the file descriptor of a struct_ops link can be added to an epoll file
descriptor to receive EPOLLHUP events.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240530065946.979330-4-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-05-30 15:34:13 -07:00