The sysctl core is preparing to only expose instances of
struct ctl_table as "const".
This will also affect the ctl_table argument of sysctl handlers.
As the function prototype of all sysctl handlers throughout the tree
needs to stay consistent that change will be done in one commit.
To reduce the size of that final commit, switch utility functions which
are not bound by "typedef proc_handler" to "const struct ctl_table".
No functional change.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20240527-sysctl-const-handler-net-v1-3-16523767d0b2@weissschuh.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The sysctl core is preparing to only expose instances of
struct ctl_table as "const".
This will also affect the ctl_table argument of sysctl handlers.
As the function prototype of all sysctl handlers throughout the tree
needs to stay consistent that change will be done in one commit.
To reduce the size of that final commit, switch utility functions which
are not bound by "typedef proc_handler" to "const struct ctl_table".
No functional change.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20240527-sysctl-const-handler-net-v1-2-16523767d0b2@weissschuh.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The sysctl core is preparing to only expose instances of
struct ctl_table as "const".
This will also affect the ctl_table argument of sysctl handlers.
As the function prototype of all sysctl handlers throughout the tree
needs to stay consistent that change will be done in one commit.
To reduce the size of that final commit, switch utility functions which
are not bound by "typedef proc_handler" to "const struct ctl_table".
No functional change.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20240527-sysctl-const-handler-net-v1-1-16523767d0b2@weissschuh.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This early RTL8168b version was the first PCIe chip version, and it's
quite quirky. Last sign of life is from more than 15 yrs ago.
Let's remove detection of this chip version, we'll see whether anybody
complains. If not, support for this chip version can be removed a few
kernel versions later.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/875cdcf4-843c-420a-ad5d-417447b68572@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2024-05-28
We've added 23 non-merge commits during the last 11 day(s) which contain
a total of 45 files changed, 696 insertions(+), 277 deletions(-).
The main changes are:
1) Rename skb's mono_delivery_time to tstamp_type for extensibility
and add SKB_CLOCK_TAI type support to bpf_skb_set_tstamp(),
from Abhishek Chauhan.
2) Add netfilter CT zone ID and direction to bpf_ct_opts so that arbitrary
CT zones can be used from XDP/tc BPF netfilter CT helper functions,
from Brad Cowie.
3) Several tweaks to the instruction-set.rst IETF doc to address
the Last Call review comments, from Dave Thaler.
4) Small batch of riscv64 BPF JIT optimizations in order to emit more
compressed instructions to the JITed image for better icache efficiency,
from Xiao Wang.
5) Sort bpftool C dump output from BTF, aiming to simplify vmlinux.h
diffing and forcing more natural type definitions ordering,
from Mykyta Yatsenko.
6) Use DEV_STATS_INC() macro in BPF redirect helpers to silence
a syzbot/KCSAN race report for the tx_errors counter,
from Jiang Yunshui.
7) Un-constify bpf_func_info in bpftool to fix compilation with LLVM 17+
which started treating const structs as constants and thus breaking
full BTF program name resolution, from Ivan Babrou.
8) Fix up BPF program numbers in test_sockmap selftest in order to reduce
some of the test-internal array sizes, from Geliang Tang.
9) Small cleanup in Makefile.btf script to use test-ge check for v1.25-only
pahole, from Alan Maguire.
10) Fix bpftool's make dependencies for vmlinux.h in order to avoid needless
rebuilds in some corner cases, from Artem Savkov.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (23 commits)
bpf, net: Use DEV_STAT_INC()
bpf, docs: Fix instruction.rst indentation
bpf, docs: Clarify call local offset
bpf, docs: Add table captions
bpf, docs: clarify sign extension of 64-bit use of 32-bit imm
bpf, docs: Use RFC 2119 language for ISA requirements
bpf, docs: Move sentence about returning R0 to abi.rst
bpf: constify member bpf_sysctl_kern:: Table
riscv, bpf: Try RVC for reg move within BPF_CMPXCHG JIT
riscv, bpf: Use STACK_ALIGN macro for size rounding up
riscv, bpf: Optimize zextw insn with Zba extension
selftests/bpf: Handle forwarding of UDP CLOCK_TAI packets
net: Add additional bit to support clockid_t timestamp type
net: Rename mono_delivery_time to tstamp_type for scalabilty
selftests/bpf: Update tests for new ct zone opts for nf_conntrack kfuncs
net: netfilter: Make ct zone opts configurable for bpf ct helpers
selftests/bpf: Fix prog numbers in test_sockmap
bpf: Remove unused variable "prev_state"
bpftool: Un-const bpf_func_info to fix it for llvm 17 and newer
bpf: Fix order of args in call to bpf_map_kvcalloc
...
====================
Link: https://lore.kernel.org/r/20240528105924.30905-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
'ne2k_pci_card' is unused since 2.3.99-pre3 in March 2000.
Remove it.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
'mlx4_port_config was added by
commit ab9c17a009 ("mlx4_core: Modify driver initialization flow to
accommodate SRIOV for Ethernet")
but remained unused.
Remove it.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
'niclist' and 'oct_link_status_resp' are unused since the original
commit f21fb3ed36 ("Add support of Cavium Liquidio ethernet
adapters").
Remove them.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
'short_rx_done_desc' and 'basic_rx_done_desc' are unused since
commit fdecea6668 (" [netdrvr starfire] Add GPL'd firmware, remove
compat code").
Remove them.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
In commit cdfbabfb2f ("net: Work around lockdep limitation in
sockets that use sockets"), it introduces 'af_kern_callback_keys'
to lockdep-init of sk_callback_lock according to 'sk_kern_sock',
it modifies sock_init_data() only, and sk_clone_lock() calls
sk_init_common() to initialize sk_callback_lock too, so the
lockdep-init of sk_callback_lock should be moved to sk_init_common().
Signed-off-by: Gou Hao <gouhao@uniontech.com>
Link: https://lore.kernel.org/r/20240526145718.9542-2-gouhao@uniontech.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
In the Jump instructions section it explains that the offset is
"relative to the instruction following the jump instruction".
But the program-local section confusingly said "referenced by
offset from the call instruction, similar to JA".
This patch updates that sentence with consistent wording, saying
it's relative to the instruction following the call instruction.
Signed-off-by: Dave Thaler <dthaler1968@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240525153332.21355-1-dthaler1968@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
imm is defined as a 32-bit signed integer.
{MOV, K, ALU64} says it does "dst = src" (where src is 'imm') and it
does do dst = (s64)imm, which in that sense does sign extend imm. The MOVSX
instruction is explained as sign extending, so added the example of
{MOV, K, ALU64} to make this more clear.
{JLE, K, JMP} says it does "PC += offset if dst <= src" (where src is 'imm',
and the comparison is unsigned). This was apparently ambiguous to some
readers as to whether the comparison was "dst <= (u64)(u32)imm" or
"dst <= (u64)(s64)imm" so added an example to make this more clear.
v1 -> v2: Address comments from Yonghong
Signed-off-by: Dave Thaler <dthaler1968@googlemail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20240520215255.10595-1-dthaler1968@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Per IETF convention and discussion at LSF/MM/BPF, use MUST etc.
keywords as requested by IETF Area Director review. Also as
requested, indicate that documenting BTF is out of scope of this
document and will be covered by a separate IETF specification.
Added paragraph about the terminology that is required IETF boilerplate
and must be worded exactly as such.
Signed-off-by: Dave Thaler <dthaler1968@googlemail.com>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20240517165855.4688-1-dthaler1968@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
As discussed at LSF/MM/BPF, the sentence about using R0 for returning
values from calls is part of the calling convention and belongs in
abi.rst. Any further additions or clarifications to this text are left
for future patches on abi.rst. The current patch is simply to unblock
progression of instruction-set.rst to a standard.
In contrast, the restriction of register numbers to the range 0-10
is untouched, left in the instruction-set.rst definition of the
src_reg and dst_reg fields.
Signed-off-by: Dave Thaler <dthaler1968@googlemail.com>
Link: https://lore.kernel.org/r/20240517153445.3914-1-dthaler1968@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The sysctl core is preparing to only expose instances of struct ctl_table
as "const". This will also affect the ctl_table argument of sysctl handlers,
for which bpf_sysctl_kern::table is also used.
As the function prototype of all sysctl handlers throughout the tree
needs to stay consistent that change will be done in one commit.
To reduce the size of that final commit, switch this utility type which
is not bound by "typedef proc_handler" to "const struct ctl_table".
No functional change.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Joel Granados <j.granados@samsung.com>
Link: https://lore.kernel.org/bpf/20240518-sysctl-const-handler-bpf-v1-1-f0d7186743c1@weissschuh.net
Abhishek Chauhan says:
====================
Patch 1 :- This patch takes care of only renaming the mono delivery
timestamp to tstamp_type with no change in functionality of
existing available code in kernel also
Starts assigning tstamp_type with either mono or real and
introduces a new enum in the skbuff.h, again no change in functionality
of the existing available code in kernel , just making the code scalable.
Patch 2 :- Additional bit was added to support tai timestamp type to
avoid tstamp drops in the forwarding path when testing TC-ETF.
Patch is also updating bpf filter.c
Some updates to bpf header files with introduction to BPF_SKB_CLOCK_TAI
and documentation updates stating deprecation of BPF_SKB_TSTAMP_UNSPEC
and BPF_SKB_TSTAMP_DELIVERY_MONO
Patch 3:- Handles forwarding of UDP packets with TAI clock id tstamp_type
type with supported changes for tc_redirect/tc_redirect_dtime
to handle forwarding of UDP packets with TAI tstamp_type
====================
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
tstamp_type is now set based on actual clockid_t compressed
into 2 bits.
To make the design scalable for future needs this commit bring in
the change to extend the tstamp_type:1 to tstamp_type:2 to support
other clockid_t timestamp.
We now support CLOCK_TAI as part of tstamp_type as part of this
commit with existing support CLOCK_MONOTONIC and CLOCK_REALTIME.
Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240509211834.3235191-3-quic_abchauha@quicinc.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
mono_delivery_time was added to check if skb->tstamp has delivery
time in mono clock base (i.e. EDT) otherwise skb->tstamp has
timestamp in ingress and delivery_time at egress.
Renaming the bitfield from mono_delivery_time to tstamp_type is for
extensibilty for other timestamps such as userspace timestamp
(i.e. SO_TXTIME) set via sock opts.
As we are renaming the mono_delivery_time to tstamp_type, it makes
sense to start assigning tstamp_type based on enum defined
in this commit.
Earlier we used bool arg flag to check if the tstamp is mono in
function skb_set_delivery_time, Now the signature of the functions
accepts tstamp_type to distinguish between mono and real time.
Also skb_set_delivery_type_by_clockid is a new function which accepts
clockid to determine the tstamp_type.
In future tstamp_type:1 can be extended to support userspace timestamp
by increasing the bitfield.
Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240509211834.3235191-2-quic_abchauha@quicinc.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Pull networking fixes from Paolo Abeni:
"Quite smaller than usual. Notably it includes the fix for the unix
regression from the past weeks. The TCP window fix will require some
follow-up, already queued.
Current release - regressions:
- af_unix: fix garbage collection of embryos
Previous releases - regressions:
- af_unix: fix race between GC and receive path
- ipv6: sr: fix missing sk_buff release in seg6_input_core
- tcp: remove 64 KByte limit for initial tp->rcv_wnd value
- eth: r8169: fix rx hangup
- eth: lan966x: remove ptp traps in case the ptp is not enabled
- eth: ixgbe: fix link breakage vs cisco switches
- eth: ice: prevent ethtool from corrupting the channels
Previous releases - always broken:
- openvswitch: set the skbuff pkt_type for proper pmtud support
- tcp: Fix shift-out-of-bounds in dctcp_update_alpha()
Misc:
- a bunch of selftests stabilization patches"
* tag 'net-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (25 commits)
r8169: Fix possible ring buffer corruption on fragmented Tx packets.
idpf: Interpret .set_channels() input differently
ice: Interpret .set_channels() input differently
nfc: nci: Fix handling of zero-length payload packets in nci_rx_work()
net: relax socket state check at accept time.
tcp: remove 64 KByte limit for initial tp->rcv_wnd value
net: ti: icssg_prueth: Fix NULL pointer dereference in prueth_probe()
tls: fix missing memory barrier in tls_init
net: fec: avoid lock evasion when reading pps_enable
Revert "ixgbe: Manual AN-37 for troublesome link partners for X550 SFI"
testing: net-drv: use stats64 for testing
net: mana: Fix the extra HZ in mana_hwc_send_request
net: lan966x: Remove ptp traps in case the ptp is not enabled.
openvswitch: Set the skbuff pkt_type for proper pmtud support.
selftest: af_unix: Make SCM_RIGHTS into OOB data.
af_unix: Fix garbage collection of embryos carrying OOB with SCM_RIGHTS
tcp: Fix shift-out-of-bounds in dctcp_update_alpha().
selftests/net: use tc rule to filter the na packet
ipv6: sr: fix memleak in seg6_hmac_init_algo
af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock.
...
Pull tracing fixes from Steven Rostedt:
"Minor last minute fixes:
- Fix a very tight race between the ring buffer readers and resizing
the ring buffer
- Correct some stale comments in the ring buffer code
- Fix kernel-doc in the rv code
- Add a MODULE_DESCRIPTION to preemptirq_delay_test"
* tag 'trace-fixes-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
rv: Update rv_en(dis)able_monitor doc to match kernel-doc
tracing: Add MODULE_DESCRIPTION() to preemptirq_delay_test
ring-buffer: Fix a race between readers and resize checks
ring-buffer: Correct stale comments related to non-consuming readers
Pull tracing tool fix from Steven Rostedt:
"Fix printf format warnings in latency-collector.
Use the printf format string with %s to take a string instead of
taking in a string directly"
* tag 'trace-tools-v6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tools/latency-collector: Fix -Wformat-security compile warns
Pull tracing cleanup from Steven Rostedt:
"Remove second argument of __assign_str()
The __assign_str() macro logic of the TRACE_EVENT() macro was
optimized so that it no longer needs the second argument. The
__assign_str() is always matched with __string() field that takes a
field name and the source for that field:
__string(field, source)
The TRACE_EVENT() macro logic will save off the source value and then
use that value to copy into the ring buffer via the __assign_str().
Before commit c1fa617cae ("tracing: Rework __assign_str() and
__string() to not duplicate getting the string"), the __assign_str()
needed the second argument which would perform the same logic as the
__string() source parameter did. Not only would this add overhead, but
it was error prone as if the __assign_str() source produced something
different, it may not have allocated enough for the string in the ring
buffer (as the __string() source was used to determine how much to
allocate)
Now that the __assign_str() just uses the same string that was used in
__string() it no longer needs the source parameter. It can now be
removed"
* tag 'trace-assign-str-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/treewide: Remove second parameter of __assign_str()
Pull sparc updates from Andreas Larsson:
- Avoid on-stack cpumask variables in a number of places
- Move struct termio to asm/termios.h, matching other architectures and
allowing certain user space applications to build also for sparc
- Fix missing prototype warnings for sparc64
- Fix version generation warnings for sparc32
- Fix bug where non-consecutive CPU IDs lead to some CPUs not starting
- Simplification using swap and cleanup using NULL for pointer
- Convert sparc parport and chmc drivers to use remove callbacks
returning void
* tag 'sparc-for-6.10-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/alarsson/linux-sparc:
sparc/leon: Remove on-stack cpumask var
sparc/pci_msi: Remove on-stack cpumask var
sparc/of: Remove on-stack cpumask var
sparc/irq: Remove on-stack cpumask var
sparc/srmmu: Remove on-stack cpumask var
sparc: chmc: Convert to platform remove callback returning void
sparc: parport: Convert to platform remove callback returning void
sparc: Compare pointers to NULL instead of 0
sparc: Use swap() to fix Coccinelle warning
sparc32: Fix version generation failed warnings
sparc64: Fix number of online CPUs
sparc64: Fix prototype warning for sched_clock
sparc64: Fix prototype warnings in adi_64.c
sparc64: Fix prototype warning for dma_4v_iotsb_bind
sparc64: Fix prototype warning for uprobe_trap
sparc64: Fix prototype warning for alloc_irqstack_bootmem
sparc64: Fix prototype warning for vmemmap_free
sparc64: Fix prototype warnings in traps_64.c
sparc64: Fix prototype warning for init_vdso_image
sparc: move struct termio to asm/termios.h
Pull arm64 fixes from Will Deacon:
"The major fix here is for a filesystem corruption issue reported on
Apple M1 as a result of buggy management of the floating point
register state introduced in 6.8. I initially reverted one of the
offending patches, but in the end Ard cooked a proper fix so there's a
revert+reapply in the series.
Aside from that, we've got some CPU errata workarounds and misc other
fixes.
- Fix broken FP register state tracking which resulted in filesystem
corruption when dm-crypt is used
- Workarounds for Arm CPU errata affecting the SSBS Spectre
mitigation
- Fix lockdep assertion in DMC620 memory controller PMU driver
- Fix alignment of BUG table when CONFIG_DEBUG_BUGVERBOSE is
disabled"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64/fpsimd: Avoid erroneous elide of user state reload
Reapply "arm64: fpsimd: Implement lazy restore for kernel mode FPSIMD"
arm64: asm-bug: Add .align 2 to the end of __BUG_ENTRY
perf/arm-dmc620: Fix lockdep assert in ->event_init()
Revert "arm64: fpsimd: Implement lazy restore for kernel mode FPSIMD"
arm64: errata: Add workaround for Arm errata 3194386 and 3312417
arm64: cputype: Add Neoverse-V3 definitions
arm64: cputype: Add Cortex-X4 definitions
arm64: barrier: Restore spec_bar() macro
Pull virtio updates from Michael Tsirkin:
"Several new features here:
- virtio-net is finally supported in vduse
- virtio (balloon and mem) interaction with suspend is improved
- vhost-scsi now handles signals better/faster
And fixes, cleanups all over the place"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (48 commits)
virtio-pci: Check if is_avq is NULL
virtio: delete vq in vp_find_vqs_msix() when request_irq() fails
MAINTAINERS: add Eugenio Pérez as reviewer
vhost-vdpa: Remove usage of the deprecated ida_simple_xx() API
vp_vdpa: don't allocate unused msix vectors
sound: virtio: drop owner assignment
fuse: virtio: drop owner assignment
scsi: virtio: drop owner assignment
rpmsg: virtio: drop owner assignment
nvdimm: virtio_pmem: drop owner assignment
wifi: mac80211_hwsim: drop owner assignment
vsock/virtio: drop owner assignment
net: 9p: virtio: drop owner assignment
net: virtio: drop owner assignment
net: caif: virtio: drop owner assignment
misc: nsm: drop owner assignment
iommu: virtio: drop owner assignment
drm/virtio: drop owner assignment
gpio: virtio: drop owner assignment
firmware: arm_scmi: virtio: drop owner assignment
...
Fix the following -Wformat-security compile warnings adding missing
format arguments:
latency-collector.c: In function ‘show_available’:
latency-collector.c:938:17: warning: format not a string literal and
no format arguments [-Wformat-security]
938 | warnx(no_tracer_msg);
| ^~~~~
latency-collector.c:943:17: warning: format not a string literal and
no format arguments [-Wformat-security]
943 | warnx(no_latency_tr_msg);
| ^~~~~
latency-collector.c: In function ‘find_default_tracer’:
latency-collector.c:986:25: warning: format not a string literal and
no format arguments [-Wformat-security]
986 | errx(EXIT_FAILURE, no_tracer_msg);
|
^~~~
latency-collector.c: In function ‘scan_arguments’:
latency-collector.c:1881:33: warning: format not a string literal and
no format arguments [-Wformat-security]
1881 | errx(EXIT_FAILURE, no_tracer_msg);
| ^~~~
Link: https://lore.kernel.org/linux-trace-kernel/20240404011009.32945-1-skhan@linuxfoundation.org
Cc: stable@vger.kernel.org
Fixes: e23db805da ("tracing/tools: Add the latency-collector to tools directory")
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
An issue was found on the RTL8125b when transmitting small fragmented
packets, whereby invalid entries were inserted into the transmit ring
buffer, subsequently leading to calls to dma_unmap_single() with a null
address.
This was caused by rtl8169_start_xmit() not noticing changes to nr_frags
which may occur when small packets are padded (to work around hardware
quirks) in rtl8169_tso_csum_v2().
To fix this, postpone inspecting nr_frags until after any padding has been
applied.
Fixes: 9020845fb5 ("r8169: improve rtl8169_start_xmit")
Cc: stable@vger.kernel.org
Signed-off-by: Ken Milmore <ken.milmore@gmail.com>
Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/27ead18b-c23d-4f49-a020-1fc482c5ac95@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jacob Keller says:
====================
intel: Interpret .set_channels() input differently
The ice and idpf drivers can trigger a crash with AF_XDP due to incorrect
interpretation of the asymmetric Tx and Rx parameters in their
.set_channels() implementations:
1. ethtool -l <IFNAME> -> combined: 40
2. Attach AF_XDP to queue 30
3. ethtool -L <IFNAME> rx 15 tx 15
combined number is not specified, so command becomes {rx_count = 15,
tx_count = 15, combined_count = 40}.
4. ethnl_set_channels checks, if there are any AF_XDP of queues from the
new (combined_count + rx_count) to the old one, so from 55 to 40, check
does not trigger.
5. the driver interprets `rx 15 tx 15` as 15 combined channels and deletes
the queue that AF_XDP is attached to.
This is fundamentally a problem with interpreting a request for asymmetric
queues as symmetric combined queues.
Fix the ice and idpf drivers to stop interpreting such requests as a
request for combined queues. Due to current driver design for both ice and
idpf, it is not possible to support requests of the same count of Tx and Rx
queues with independent interrupts, (i.e. ethtool -L <IFNAME> rx 15 tx 15)
so such requests are now rejected.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
====================
Link: https://lore.kernel.org/r/20240521-iwl-net-2024-05-14-set-channels-fixes-v2-0-7aa39e2e99f1@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Unlike ice, idpf does not check, if user has requested at least 1 combined
channel. Instead, it relies on a check in the core code. Unfortunately, the
check does not trigger for us because of the hacky .set_channels()
interpretation logic that is not consistent with the core code.
This naturally leads to user being able to trigger a crash with an invalid
input. This is how:
1. ethtool -l <IFNAME> -> combined: 40
2. ethtool -L <IFNAME> rx 0 tx 0
combined number is not specified, so command becomes {rx_count = 0,
tx_count = 0, combined_count = 40}.
3. ethnl_set_channels checks, if there is at least 1 RX and 1 TX channel,
comparing (combined_count + rx_count) and (combined_count + tx_count)
to zero. Obviously, (40 + 0) is greater than zero, so the core code
deems the input OK.
4. idpf interprets `rx 0 tx 0` as 0 channels and tries to proceed with such
configuration.
The issue has to be solved fundamentally, as current logic is also known to
cause AF_XDP problems in ice [0].
Interpret the command in a way that is more consistent with ethtool
manual [1] (--show-channels and --set-channels) and new ice logic.
Considering that in the idpf driver only the difference between RX and TX
queues forms dedicated channels, change the correct way to set number of
channels to:
ethtool -L <IFNAME> combined 10 /* For symmetric queues */
ethtool -L <IFNAME> combined 8 tx 2 rx 0 /* For asymmetric queues */
[0] https://lore.kernel.org/netdev/20240418095857.2827-1-larysa.zaremba@intel.com/
[1] https://man7.org/linux/man-pages/man8/ethtool.8.html
Fixes: 02cbfba1ad ("idpf: add ethtool callbacks")
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Igor Bagnucki <igor.bagnucki@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
A bug occurs because a safety check guarding AF_XDP-related queues in
ethnl_set_channels(), does not trigger. This happens, because kernel and
ice driver interpret the ethtool command differently.
How the bug occurs:
1. ethtool -l <IFNAME> -> combined: 40
2. Attach AF_XDP to queue 30
3. ethtool -L <IFNAME> rx 15 tx 15
combined number is not specified, so command becomes {rx_count = 15,
tx_count = 15, combined_count = 40}.
4. ethnl_set_channels checks, if there are any AF_XDP of queues from the
new (combined_count + rx_count) to the old one, so from 55 to 40, check
does not trigger.
5. ice interprets `rx 15 tx 15` as 15 combined channels and deletes the
queue that AF_XDP is attached to.
Interpret the command in a way that is more consistent with ethtool
manual [0] (--show-channels and --set-channels).
Considering that in the ice driver only the difference between RX and TX
queues forms dedicated channels, change the correct way to set number of
channels to:
ethtool -L <IFNAME> combined 10 /* For symmetric queues */
ethtool -L <IFNAME> combined 8 tx 2 rx 0 /* For asymmetric queues */
[0] https://man7.org/linux/man-pages/man8/ethtool.8.html
Fixes: 87324e747f ("ice: Implement ethtool ops for channels")
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Recently, we had some servers upgraded to the latest kernel and noticed
the indicator from the user side showed worse results than before. It is
caused by the limitation of tp->rcv_wnd.
In 2018 commit a337531b94 ("tcp: up initial rmem to 128KB and SYN rwin
to around 64KB") limited the initial value of tp->rcv_wnd to 65535, most
CDN teams would not benefit from this change because they cannot have a
large window to receive a big packet, which will be slowed down especially
in long RTT. Small rcv_wnd means slow transfer speed, to some extent. It's
the side effect for the latency/time-sensitive users.
To avoid future confusion, current change doesn't affect the initial
receive window on the wire in a SYN or SYN+ACK packet which are set within
65535 bytes according to RFC 7323 also due to the limit in
__tcp_transmit_skb():
th->window = htons(min(tp->rcv_wnd, 65535U));
In one word, __tcp_transmit_skb() already ensures that constraint is
respected, no matter how large tp->rcv_wnd is. The change doesn't violate
RFC.
Let me provide one example if with or without the patch:
Before:
client --- SYN: rwindow=65535 ---> server
client <--- SYN+ACK: rwindow=65535 ---- server
client --- ACK: rwindow=65536 ---> server
Note: for the last ACK, the calculation is 512 << 7.
After:
client --- SYN: rwindow=65535 ---> server
client <--- SYN+ACK: rwindow=65535 ---- server
client --- ACK: rwindow=175232 ---> server
Note: I use the following command to make it work:
ip route change default via [ip] dev eth0 metric 100 initrwnd 120
For the last ACK, the calculation is 1369 << 7.
When we apply such a patch, having a large rcv_wnd if the user tweak this
knob can help transfer data more rapidly and save some rtts.
Fixes: a337531b94 ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20240521134220.12510-1-kerneljasonxing@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
In tls_init(), a write memory barrier is missing, and store-store
reordering may cause NULL dereference in tls_{setsockopt,getsockopt}.
CPU0 CPU1
----- -----
// In tls_init()
// In tls_ctx_create()
ctx = kzalloc()
ctx->sk_proto = READ_ONCE(sk->sk_prot) -(1)
// In update_sk_prot()
WRITE_ONCE(sk->sk_prot, tls_prots) -(2)
// In sock_common_setsockopt()
READ_ONCE(sk->sk_prot)->setsockopt()
// In tls_{setsockopt,getsockopt}()
ctx->sk_proto->setsockopt() -(3)
In the above scenario, when (1) and (2) are reordered, (3) can observe
the NULL value of ctx->sk_proto, causing NULL dereference.
To fix it, we rely on rcu_assign_pointer() which implies the release
barrier semantic. By moving rcu_assign_pointer() after ctx->sk_proto is
initialized, we can ensure that ctx->sk_proto are visible when
changing sk->sk_prot.
Fixes: d5bee7374b ("net/tls: Annotate access to sk_prot with READ_ONCE/WRITE_ONCE")
Signed-off-by: Yewon Choi <woni9911@gmail.com>
Signed-off-by: Dae R. Jeong <threeearcat@gmail.com>
Link: https://lore.kernel.org/netdev/ZU4OJG56g2V9z_H7@dragonet/T/
Link: https://lore.kernel.org/r/Zkx4vjSFp0mfpjQ2@libra05
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The assignment of pps_enable is protected by tmreg_lock, but the read
operation of pps_enable is not. So the Coverity tool reports a lock
evasion warning which may cause data race to occur when running in a
multithread environment. Although this issue is almost impossible to
occur, we'd better fix it, at least it seems more logically reasonable,
and it also prevents Coverity from continuing to issue warnings.
Fixes: 278d240478 ("net: fec: ptp: Enable PPS output based on ptp clock")
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://lore.kernel.org/r/20240521023800.17102-1-wei.fang@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This reverts commit 565736048b.
According to the commit, it implements a manual AN-37 for some
"troublesome" Juniper MX5 switches. This appears to be a workaround for a
particular switch.
It has been reported that this causes a severe breakage for other switches,
including a Cisco 3560CX-12PD-S.
The code appears to be a workaround for a specific switch which fails to
link in SFI mode. It expects to see AN-37 auto negotiation in order to
link. The Cisco switch is not expecting AN-37 auto negotiation. When the
device starts the manual AN-37, the Cisco switch decides that the port is
confused and stops attempting to link with it. This persists until a power
cycle. A simple driver unload and reload does not resolve the issue, even
if loading with a version of the driver which lacks this workaround.
The authors of the workaround commit have not responded with
clarifications, and the result of the workaround is complete failure to
connect with other switches.
This appears to be a case where the driver can either "correctly" link with
the Juniper MX5 switch, at the cost of bricking the link with the Cisco
switch, or it can behave properly for the Cisco switch, but fail to link
with the Junipir MX5 switch. I do not know enough about the standards
involved to clearly determine whether either switch is at fault or behaving
incorrectly. Nor do I know whether there exists some alternative fix which
corrects behavior with both switches.
Revert the workaround for the Juniper switch.
Fixes: 565736048b ("ixgbe: Manual AN-37 for troublesome link partners for X550 SFI")
Link: https://lore.kernel.org/netdev/cbe874db-9ac9-42b8-afa0-88ea910e1e99@intel.com/T/
Link: https://forum.proxmox.com/threads/intel-x553-sfp-ixgbe-no-go-on-pve8.135129/#post-612291
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Cc: Jeff Daly <jeffd@silicom-usa.com>
Cc: kernel.org-fo5k2w@ycharbi.fr
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240520-net-2024-05-20-revert-silicom-switch-workaround-v1-1-50f80f261c94@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Testing a network device that has large numbers of bytes/packets may
overflow. Using stats64 when comparing fixes this problem.
I tripped on this while iterating on a qstats patch for mlx5. See below
for confirmation without my added code that this is a bug.
Before this patch (with added debugging output):
$ NETIF=eth0 tools/testing/selftests/drivers/net/stats.py
KTAP version 1
1..4
ok 1 stats.check_pause
ok 2 stats.check_fec
rstat: 481708634 qstat: 666201639514 key: tx-bytes
not ok 3 stats.pkt_byte_sum
ok 4 stats.qstat_by_ifindex
Note the huge delta above ^^^ in the rtnl vs qstats.
After this patch:
$ NETIF=eth0 tools/testing/selftests/drivers/net/stats.py
KTAP version 1
1..4
ok 1 stats.check_pause
ok 2 stats.check_fec
ok 3 stats.pkt_byte_sum
ok 4 stats.qstat_by_ifindex
It looks like rtnl_fill_stats in net/core/rtnetlink.c will attempt to
copy the 64bit stats into a 32bit structure which is probably why this
behavior is occurring.
To show this is happening, you can get the underlying stats that the
stats.py test uses like this:
$ ./cli.py --spec ../../../Documentation/netlink/specs/rt_link.yaml \
--do getlink --json '{"ifi-index": 7}'
And examine the output (heavily snipped to show relevant fields):
'stats': {
'multicast': 3739197,
'rx-bytes': 1201525399,
'rx-packets': 56807158,
'tx-bytes': 492404458,
'tx-packets': 1200285371,
'stats64': {
'multicast': 3739197,
'rx-bytes': 35561263767,
'rx-packets': 56807158,
'tx-bytes': 666212335338,
'tx-packets': 1200285371,
The stats.py test prior to this patch was using the 'stats' structure
above, which matches the failure output on my system.
Comparing side by side, rx-bytes and tx-bytes, and getting ethtool -S
output:
rx-bytes stats: 1201525399
rx-bytes stats64: 35561263767
rx-bytes ethtool: 36203402638
tx-bytes stats: 492404458
tx-bytes stats64: 666212335338
tx-bytes ethtool: 666215360113
Note that the above was taken from a system with an mlx5 NIC, which only
exposes ndo_get_stats64.
Based on the ethtool output and qstat output, it appears that stats.py
should be updated to use the 'stats64' structure for accurate
comparisons when packet/byte counters get very large.
To confirm that this was not related to the qstats code I was iterating
on, I booted a kernel without my driver changes and re-ran the test
which shows the qstats are skipped (as they don't exist for mlx5):
NETIF=eth0 tools/testing/selftests/drivers/net/stats.py
KTAP version 1
1..4
ok 1 stats.check_pause
ok 2 stats.check_fec
ok 3 stats.pkt_byte_sum # SKIP qstats not supported by the device
ok 4 stats.qstat_by_ifindex # SKIP No ifindex supports qstats
But, fetching the stats using the CLI
$ ./cli.py --spec ../../../Documentation/netlink/specs/rt_link.yaml \
--do getlink --json '{"ifi-index": 7}'
Shows the same issue (heavily snipped for relevant fields only):
'stats': {
'multicast': 105489,
'rx-bytes': 530879526,
'rx-packets': 751415,
'tx-bytes': 2510191396,
'tx-packets': 27700323,
'stats64': {
'multicast': 105489,
'rx-bytes': 530879526,
'rx-packets': 751415,
'tx-bytes': 15395093284,
'tx-packets': 27700323,
Comparing side by side with ethtool -S on the unmodified mlx5 driver:
tx-bytes stats: 2510191396
tx-bytes stats64: 15395093284
tx-bytes ethtool: 17718435810
Fixes: f0e6c86e4b ("testing: net-drv: add a driver test for stats reporting")
Signed-off-by: Joe Damato <jdamato@fastly.com>
Link: https://lore.kernel.org/r/20240520235850.190041-1-jdamato@fastly.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Pull more non-mm updates from Andrew Morton:
- A series ("kbuild: enable more warnings by default") from Arnd
Bergmann which enables a number of additional build-time warnings. We
fixed all the fallout which we could find, there may still be a few
stragglers.
- Samuel Holland has developed the series "Unified cross-architecture
kernel-mode FPU API". This does a lot of consolidation of
per-architecture kernel-mode FPU usage and enables the use of newer
AMD GPUs on RISC-V.
- Tao Su has fixed some selftests build warnings in the series
"Selftests: Fix compilation warnings due to missing _GNU_SOURCE
definition".
- This pull also includes a nilfs2 fixup from Ryusuke Konishi.
* tag 'mm-nonmm-stable-2024-05-22-17-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (23 commits)
nilfs2: make block erasure safe in nilfs_finish_roll_forward()
selftests/harness: use 1024 in place of LINE_MAX
Revert "selftests/harness: remove use of LINE_MAX"
selftests/fpu: allow building on other architectures
selftests/fpu: move FP code to a separate translation unit
drm/amd/display: use ARCH_HAS_KERNEL_FPU_SUPPORT
drm/amd/display: only use hard-float, not altivec on powerpc
riscv: add support for kernel-mode FPU
x86: implement ARCH_HAS_KERNEL_FPU_SUPPORT
powerpc: implement ARCH_HAS_KERNEL_FPU_SUPPORT
LoongArch: implement ARCH_HAS_KERNEL_FPU_SUPPORT
lib/raid6: use CC_FLAGS_FPU for NEON CFLAGS
arm64: crypto: use CC_FLAGS_FPU for NEON CFLAGS
arm64: implement ARCH_HAS_KERNEL_FPU_SUPPORT
ARM: crypto: use CC_FLAGS_FPU for NEON CFLAGS
ARM: implement ARCH_HAS_KERNEL_FPU_SUPPORT
arch: add ARCH_HAS_KERNEL_FPU_SUPPORT
x86/fpu: fix asm/fpu/types.h include guard
kbuild: enable -Wcast-function-type-strict unconditionally
kbuild: enable -Wformat-truncation on clang
...