Daniel Borkmann says:
====================
pull-request: bpf-next 2024-04-29
We've added 147 non-merge commits during the last 32 day(s) which contain
a total of 158 files changed, 9400 insertions(+), 2213 deletions(-).
The main changes are:
1) Add an internal-only BPF per-CPU instruction for resolving per-CPU
memory addresses and implement support in x86 BPF JIT. This allows
inlining per-CPU array and hashmap lookups
and the bpf_get_smp_processor_id() helper, from Andrii Nakryiko.
2) Add BPF link support for sk_msg and sk_skb programs, from Yonghong Song.
3) Optimize x86 BPF JIT's emit_mov_imm64, and add support for various
atomics in bpf_arena which can be JITed as a single x86 instruction,
from Alexei Starovoitov.
4) Add support for passing mark with bpf_fib_lookup helper,
from Anton Protopopov.
5) Add a new bpf_wq API for deferring events and refactor sleepable
bpf_timer code to keep common code where possible,
from Benjamin Tissoires.
6) Fix BPF_PROG_TEST_RUN infra with regards to bpf_dummy_struct_ops programs
to check when NULL is passed for non-NULLable parameters,
from Eduard Zingerman.
7) Harden the BPF verifier's and/or/xor value tracking,
from Harishankar Vishwanathan.
8) Introduce crypto kfuncs to make BPF programs able to utilize the kernel
crypto subsystem, from Vadim Fedorenko.
9) Various improvements to the BPF instruction set standardization doc,
from Dave Thaler.
10) Extend libbpf APIs to partially consume items from the BPF ringbuffer,
from Andrea Righi.
11) Bigger batch of BPF selftests refactoring to use common network helpers
and to drop duplicate code, from Geliang Tang.
12) Support bpf_tail_call_static() helper for BPF programs with GCC 13,
from Jose E. Marchesi.
13) Add bpf_preempt_{disable,enable}() kfuncs in order to allow a BPF
program to have code sections where preemption is disabled,
from Kumar Kartikeya Dwivedi.
14) Allow invoking BPF kfuncs from BPF_PROG_TYPE_SYSCALL programs,
from David Vernet.
15) Extend the BPF verifier to allow different input maps for a given
bpf_for_each_map_elem() helper call in a BPF program, from Philo Lu.
16) Add support for PROBE_MEM32 and bpf_addr_space_cast instructions
for riscv64 and arm64 JITs to enable BPF Arena, from Puranjay Mohan.
17) Shut up a false-positive KMSAN splat in interpreter mode by unpoison
the stack memory, from Martin KaFai Lau.
18) Improve xsk selftest coverage with new tests on maximum and minimum
hardware ring size configurations, from Tushar Vyavahare.
19) Various ReST man pages fixes as well as documentation and bash completion
improvements for bpftool, from Rameez Rehman & Quentin Monnet.
20) Fix libbpf with regards to dumping subsequent char arrays,
from Quentin Deslandes.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (147 commits)
bpf, docs: Clarify PC use in instruction-set.rst
bpf_helpers.h: Define bpf_tail_call_static when building with GCC
bpf, docs: Add introduction for use in the ISA Internet Draft
selftests/bpf: extend BPF_SOCK_OPS_RTT_CB test for srtt and mrtt_us
bpf: add mrtt and srtt as BPF_SOCK_OPS_RTT_CB args
selftests/bpf: dummy_st_ops should reject 0 for non-nullable params
bpf: check bpf_dummy_struct_ops program params for test runs
selftests/bpf: do not pass NULL for non-nullable params in dummy_st_ops
selftests/bpf: adjust dummy_st_ops_success to detect additional error
bpf: mark bpf_dummy_struct_ops.test_1 parameter as nullable
selftests/bpf: Add ring_buffer__consume_n test.
bpf: Add bpf_guard_preempt() convenience macro
selftests: bpf: crypto: add benchmark for crypto functions
selftests: bpf: crypto skcipher algo selftests
bpf: crypto: add skcipher to bpf crypto
bpf: make common crypto API for TC/XDP programs
bpf: update the comment for BTF_FIELDS_MAX
selftests/bpf: Fix wq test.
selftests/bpf: Use make_sockaddr in test_sock_addr
selftests/bpf: Use connect_to_addr in test_sock_addr
...
====================
Link: https://lore.kernel.org/r/20240429131657.19423-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently there is no way for user to set what features the driver
should obey or not, it is hard wired in the code.
In order to be able to debug the device behavior in case some feature is
disabled, introduce a debugfs infrastructure with couple of files
allowing user to see what features the device advertises and
to set filter for features used by driver.
Example:
$cat /sys/bus/virtio/devices/virtio0/features
1110010111111111111101010000110010000000100000000000000000000000
$ echo "5" >/sys/kernel/debug/virtio/virtio0/filter_feature_add
$ cat /sys/kernel/debug/virtio/virtio0/filter_features
5
$ echo "virtio0" > /sys/bus/virtio/drivers/virtio_net/unbind
$ echo "virtio0" > /sys/bus/virtio/drivers/virtio_net/bind
$ cat /sys/bus/virtio/devices/virtio0/features
1110000111111111111101010000110010000000100000000000000000000000
Note that sysfs "features" now already exists, this patch does not
touch it.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tony Nguyen says:
====================
net: intel: start The Great Code Dedup + Page Pool for iavf
Alexander Lobakin says:
Here's a two-shot: introduce {,Intel} Ethernet common library (libeth and
libie) and switch iavf to Page Pool. Details are in the commit messages;
here's a summary:
Not a secret there's a ton of code duplication between two and more Intel
ethernet modules. Before introducing new changes, which would need to be
copied over again, start decoupling the already existing duplicate
functionality into a new module, which will be shared between several
Intel Ethernet drivers. The first name that came to my mind was
"libie" -- "Intel Ethernet common library". Also this sounds like
"lovelie" (-> one word, no "lib I E" pls) and can be expanded as
"lib Internet Explorer" :P
The "generic", pure-software part is placed separately, so that it can be
easily reused in any driver by any vendor without linking to the Intel
pre-200G guts. In a few words, it's something any modern driver does the
same way, but nobody moved it level up (yet).
The series is only the beginning. From now on, adding every new feature
or doing any good driver refactoring will remove much more lines than add
for quite some time. There's a basic roadmap with some deduplications
planned already, not speaking of that touching every line now asks:
"can I share this?". The final destination is very ambitious: have only
one unified driver for at least i40e, ice, iavf, and idpf with a struct
ops for each generation. That's never gonna happen, right? But you still
can at least try.
PP conversion for iavf lands within the same series as these two are tied
closely. libie will support Page Pool model only, so that a driver can't
use much of the lib until it's converted. iavf is only the example, the
rest will eventually be converted soon on a per-driver basis. That is
when it gets really interesting. Stay tech.
* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
MAINTAINERS: add entry for libeth and libie
iavf: switch to Page Pool
iavf: pack iavf_ring more efficiently
libeth: add Rx buffer management
page_pool: add DMA-sync-for-CPU inline helper
page_pool: constify some read-only function arguments
slab: introduce kvmalloc_array_node() and kvcalloc_node()
iavf: drop page splitting and recycling
iavf: kill "legacy-rx" for good
net: intel: introduce {, Intel} Ethernet common library
====================
Link: https://lore.kernel.org/r/20240424203559.3420468-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kalle Valo says:
====================
wireless-next patches for v6.10
The second "new features" pull request for v6.10 with changes both in
stack and in drivers. This time the pull request is rather small and
nothing special standing out except maybe that we have several
kernel-doc fixes. Great to see that we are getting warning free
wireless code (until new warnings are added).
Major changes:
rtl8xxxu:
* enable Management Frame Protection (MFP) support
rtw88:
* disable unsupported interface type of mesh point for all chips, and only
support station mode for SDIO chips.
* tag 'wireless-next-2024-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (63 commits)
wifi: mac80211: handle link ID during management Tx
wifi: mac80211: handle sdata->u.ap.active flag with MLO
wifi: cfg80211: add return docs for regulatory functions
wifi: cfg80211: make some regulatory functions void
wifi: mac80211: add return docs for sta_info_flush()
wifi: mac80211: keep mac80211 consistent on link activation failure
wifi: mac80211: simplify ieee80211_assign_link_chanctx()
wifi: mac80211: reserve chanctx during find
wifi: cfg80211: fix cfg80211 function kernel-doc
wifi: mac80211_hwsim: Use wider regulatory for custom for 6GHz tests
wifi: iwlwifi: mvm: Don't allow EMLSR when the RSSI is low
wifi: iwlwifi: mvm: disable EMLSR when we suspend with wowlan
wifi: iwlwifi: mvm: get periodic statistics in EMLSR
wifi: iwlwifi: mvm: don't recompute EMLSR mode in can_activate_links
wifi: iwlwifi: mvm: implement EMLSR prevention mechanism.
wifi: iwlwifi: mvm: exit EMLSR upon missed beacon
wifi: iwlwifi: mvm: init vif works only once
wifi: iwlwifi: mvm: Add helper functions to update EMLSR status
wifi: iwlwifi: mvm: Implement new link selection algorithm
wifi: iwlwifi: mvm: move EMLSR/links code
...
====================
Link: https://lore.kernel.org/r/20240424100122.217AEC113CE@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull networking fixes from Jakub Kicinski:
"Including fixes from netfilter, wireless and bluetooth.
Nothing major, regression fixes are mostly in drivers, two more of
those are flowing towards us thru various trees. I wish some of the
changes went into -rc5, we'll try to keep an eye on frequency of PRs
from sub-trees.
Also disproportional number of fixes for bugs added in v6.4, strange
coincidence.
Current release - regressions:
- igc: fix LED-related deadlock on driver unbind
- wifi: mac80211: small fixes to recent clean up of the connection
process
- Revert "wifi: iwlwifi: bump FW API to 90 for BZ/SC devices", kernel
doesn't have all the code to deal with that version, yet
- Bluetooth:
- set power_ctrl_enabled on NULL returned by gpiod_get_optional()
- qca: fix invalid device address check, again
- eth: ravb: fix registered interrupt names
Current release - new code bugs:
- wifi: mac80211: check EHT/TTLM action frame length
Previous releases - regressions:
- fix sk_memory_allocated_{add|sub} for architectures where
__this_cpu_{add|sub}* are not IRQ-safe
- dsa: mv88e6xx: fix link setup for 88E6250
Previous releases - always broken:
- ip: validate dev returned from __in_dev_get_rcu(), prevent possible
null-derefs in a few places
- switch number of for_each_rcu() loops using call_rcu() on the
iterator to for_each_safe()
- macsec: fix isolation of broadcast traffic in presence of offload
- vxlan: drop packets from invalid source address
- eth: mlxsw: trap and ACL programming fixes
- eth: bnxt: PCIe error recovery fixes, fix counting dropped packets
- Bluetooth:
- lots of fixes for the command submission rework from v6.4
- qca: fix NULL-deref on non-serdev suspend
Misc:
- tools: ynl: don't ignore errors in NLMSG_DONE messages"
* tag 'net-6.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits)
af_unix: Suppress false-positive lockdep splat for spin_lock() in __unix_gc().
net: b44: set pause params only when interface is up
tls: fix lockless read of strp->msg_ready in ->poll
dpll: fix dpll_pin_on_pin_register() for multiple parent pins
net: ravb: Fix registered interrupt names
octeontx2-af: fix the double free in rvu_npc_freemem()
net: ethernet: ti: am65-cpts: Fix PTPv1 message type on TX packets
ice: fix LAG and VF lock dependency in ice_reset_vf()
iavf: Fix TC config comparison with existing adapter TC config
i40e: Report MFS in decimal base instead of hex
i40e: Do not use WQ_MEM_RECLAIM flag for workqueue
net: ti: icssg-prueth: Fix signedness bug in prueth_init_rx_chns()
net/mlx5e: Advertise mlx5 ethernet driver updates sk_buff md_dst for MACsec
macsec: Detect if Rx skb is macsec-related for offloading devices that update md_dst
ethernet: Add helper for assigning packet type when dest address does not match device address
macsec: Enable devices to advertise whether they update sk_buff md_dst during offloads
net: phy: dp83869: Fix MII mode failure
netfilter: nf_tables: honor table dormant flag from netdev release event path
eth: bnxt: fix counting packets discarded due to OOM and netpoll
igc: Fix LED-related deadlock on driver unbind
...
Add crypto API support to BPF to be able to decrypt or encrypt packets
in TC/XDP BPF programs. Special care should be taken for initialization
part of crypto algo because crypto alloc) doesn't work with preemtion
disabled, it can be run only in sleepable BPF program. Also async crypto
is not supported because of the very same issue - TC/XDP BPF programs
are not sleepable.
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Link: https://lore.kernel.org/r/20240422225024.2847039-2-vadfed@meta.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Now that the IAVF driver simply uses dev_alloc_page() + free_page() with
no custom recycling logics, it can easily be switched to using Page
Pool / libeth API instead.
This allows to removing the whole dancing around headroom, HW buffer
size, and page order. All DMA-for-device is now done in the PP core,
for-CPU -- in the libeth helper.
Use skb_mark_for_recycle() to bring back the recycling and restore the
performance. Speaking of performance: on par with the baseline and
faster with the PP optimization series applied. But the memory usage for
1500b MTU is now almost 2x lower (x86_64) thanks to allocating a page
every second descriptor.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Add NUMA-aware counterparts for kvmalloc_array() and kvcalloc() to be
able to flexibly allocate arrays for a particular node.
Rewrite kvmalloc_array() to kvmalloc_array_node(NUMA_NO_NODE) call.
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Not a secret there's a ton of code duplication between two and more Intel
ethernet modules.
Before introducing new changes, which would need to be copied over again,
start decoupling the already existing duplicate functionality into a new
module, which will be shared between several Intel Ethernet drivers.
Add the lookup table which converts 8/10-bit hardware packet type into
a parsed bitfield structure for easy checking packet format parameters,
such as payload level, IP version, etc. This is currently used by i40e,
ice and iavf and it's all the same in all three drivers.
The only difference introduced in this implementation is that instead of
defining a 256 (or 1024 in case of ice) element array, add unlikely()
condition to limit the input to 154 (current maximum non-reserved packet
type). There's no reason to waste 600 (or even 3600) bytes only to not
hurt very unlikely exception packets.
The hash computation function now takes payload level directly as a
pkt_hash_type. There's a couple cases when non-IP ptypes are marked as
L3 payload and in the previous versions their hash level would be 2, not
3. But skb_set_hash() only sees difference between L4 and non-L4, thus
this won't change anything at all.
The module is behind the hidden Kconfig symbol, which the drivers will
select when needed. The exports are behind 'LIBIE' namespace to limit
the scope of the functions.
Not that non-HW-specific symbols will live in yet another module,
libeth. This is done to easily distinguish pretty generic code ready
for reusing by any other vendor and/or for moving the layer up from
the code useful in Intel's 1-100G drivers only.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Introduce two new BPF kfuncs, bpf_preempt_disable and
bpf_preempt_enable. These kfuncs allow disabling preemption in BPF
programs. Nesting is allowed, since the intended use cases includes
building native BPF spin locks without kernel helper involvement. Apart
from that, this can be used to per-CPU data structures for cases where
programs (or userspace) may preempt one or the other. Currently, while
per-CPU access is stable, whether it will be consistent is not
guaranteed, as only migration is disabled for BPF programs.
Global functions are disallowed from being called, but support for them
will be added as a follow up not just preempt kfuncs, but rcu_read_lock
kfuncs as well. Static subprog calls are permitted. Sleepable helpers
and kfuncs are disallowed in non-preemptible regions.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20240424031315.2757363-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
It is impossible to use init_dummy_netdev together with alloc_netdev()
as the 'setup' argument.
This is because alloc_netdev() initializes some fields in the net_device
structure, and later init_dummy_netdev() memzero them all. This causes
some problems as reported here:
https://lore.kernel.org/all/20240322082336.49f110cc@kernel.org/
Split the init_dummy_netdev() function in two. Create a new function called
init_dummy_netdev_core() that does not memzero the net_device structure.
Then have init_dummy_netdev() memzero-ing and calling
init_dummy_netdev_core(), keeping the old behaviour.
init_dummy_netdev_core() is the new function that could be called as an
argument for alloc_netdev().
Also, create a helper to allocate and initialize dummy net devices,
leveraging init_dummy_netdev_core() as the setup argument. This function
basically simplify the allocation of dummy devices, by allocating and
initializing it. Freeing the device continue to be done through
free_netdev()
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pavel Begunkov says:
====================
implement io_uring notification (ubuf_info) stacking (net part)
To have per request buffer notifications each zerocopy io_uring send
request allocates a new ubuf_info. However, as an skb can carry only
one uarg, it may force the stack to create many small skbs hurting
performance in many ways.
The patchset implements notification, i.e. an io_uring's ubuf_info
extension, stacking. It attempts to link ubuf_info's into a list,
allowing to have multiple of them per skb.
liburing/examples/send-zerocopy shows up 6 times performance improvement
for TCP with 4KB bytes per send, and levels it with MSG_ZEROCOPY. Without
the patchset it requires much larger sends to utilise all potential.
bytes | before | after (Kqps)
1200 | 195 | 1023
4000 | 193 | 1386
8000 | 154 | 1058
====================
Link: https://lore.kernel.org/all/cover.1713369317.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
At the moment an skb can only have one ubuf_info associated with it,
which might be a performance problem for zerocopy sends in cases like
TCP via io_uring. Add a callback for assigning ubuf_info to skb, this
way we will implement smarter assignment later like linking ubuf_info
together.
Note, it's an optional callback, which should be compatible with
skb_zcopy_set(), that's because the net stack might potentially decide
to clone an skb and take another reference to ubuf_info whenever it
wishes. Also, a correct implementation should always be able to bind to
an skb without prior ubuf_info, otherwise we could end up in a situation
when the send would not be able to progress.
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/all/b7918aadffeb787c84c9e72e34c729dc04f3a45d.1713369317.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull nfsd fix from Chuck Lever:
- Fix an NFS/RDMA performance regression in v6.9-rc
* tag 'nfsd-6.9-4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
Revert "svcrdma: Add Write chunk WRs to the RPC's Send WR chain"
To be able to constify instances of struct ctl_tables it is necessary to
remove ways through which non-const versions are exposed from the
sysctl core.
One of these is the ctl_table_arg member of struct ctl_table_header.
Constify this reference as a prerequisite for the full constification of
struct ctl_table instances.
No functional change.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull char / misc driver fixes from Greg KH:
"Here are some small char/misc and other driver fixes for 6.9-rc5.
Included in here are the following:
- binder driver fix for reported problem
- speakup crash fix
- mei driver fixes for reported problems
- comdei driver fix
- interconnect driver fixes
- rtsx driver fix
- peci.h kernel doc fix
All of these have been in linux-next for over a week with no reported
problems"
* tag 'char-misc-6.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
peci: linux/peci.h: fix Excess kernel-doc description warning
binder: check offset alignment in binder_get_object()
comedi: vmk80xx: fix incomplete endpoint checking
mei: vsc: Unregister interrupt handler for system suspend
Revert "mei: vsc: Call wake_up() in the threaded IRQ handler"
misc: rtsx: Fix rts5264 driver status incorrect when card removed
mei: me: disable RPL-S on SPS and IGN firmwares
speakup: Avoid crash on very long word
interconnect: Don't access req_list while it's being manipulated
interconnect: qcom: x1e80100: Remove inexistent ACV_PERF BCM
Pull block fixes from Jens Axboe:
"Just two minor fixes that should go into the 6.9 kernel release, one
fixing a regression with partition scanning errors, and one fixing a
WARN_ON() that can get triggered if we race with a timer"
* tag 'block-6.9-20240420' of git://git.kernel.dk/linux:
blk-iocost: do not WARN if iocg was already offlined
block: propagate partition scanning errors to the BLKRRPART ioctl
Pull clk fixes from Stephen Boyd:
"A couple clk driver fixes, a build fix, and a deadlock fix:
- Mediatek mt7988 has broken PCIe because the wrong parent is used
- Mediatek clk drivers may deadlock when registering their clks
because the clk provider device is repeatedly runtime PM resumed
and suspended during probe and clk registration.
Resuming the clk provider device deadlocks with an ABBA deadlock
due to genpd_lock and the clk prepare_lock. The fix is to keep the
device runtime resumed while registering clks.
- Another runtime PM related deadlock, this time with disabling
unused clks during late init.
We get an ABBA deadlock where a device is runtime PM resuming (or
suspending) while the disabling of unused clks is happening in
parallel. That runtime PM action calls into the clk framework and
tries to grab the clk prepare_lock while the disabling of unused
clks holds the prepare_lock and is waiting for that runtime PM
action to complete.
The fix is to runtime resume all the clk provider devices before
grabbing the clk prepare_lock during disable unused.
- A build fix to provide an empty devm_clk_rate_exclusive_get()
function when CONFIG_COMMON_CLK=n"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: mediatek: mt7988-infracfg: fix clocks for 2nd PCIe port
clk: mediatek: Do a runtime PM get on controllers during probe
clk: Get runtime PM before walking tree for clk_summary
clk: Get runtime PM before walking tree during disable_unused
clk: Initialize struct clk_core kref earlier
clk: Don't hold prepare_lock when calling kref_put()
clk: Remove prepare_lock hold assertion in __clk_release()
clk: Provide !COMMON_CLK dummy for devm_clk_rate_exclusive_get()
Performance regression reported with NFS/RDMA using Omnipath,
bisected to commit e084ee673c ("svcrdma: Add Write chunk WRs to
the RPC's Send WR chain").
Tracing on the server reports:
nfsd-7771 [060] 1758.891809: svcrdma_sq_post_err:
cq.id=205 cid=226 sc_sq_avail=13643/851 status=-12
sq_post_err reports ENOMEM, and the rdma->sc_sq_avail (13643) is
larger than rdma->sc_sq_depth (851). The number of available Send
Queue entries is always supposed to be smaller than the Send Queue
depth. That seems like a Send Queue accounting bug in svcrdma.
As it's getting to be late in the 6.9-rc cycle, revert this commit.
It can be revisited in a subsequent kernel release.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=218743
Fixes: e084ee673c ("svcrdma: Add Write chunk WRs to the RPC's Send WR chain")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Pull bootconfig fixes from Masami Hiramatsu:
- Fix potential static_command_line buffer overrun.
Currently we allocate the memory for static_command_line based on
"boot_command_line", but it will copy "command_line" into it. So we
use the length of "command_line" instead of "boot_command_line" (as
we previously did)
- Use memblock_free_late() in xbc_exit() instead of memblock_free()
after the buddy system is initialized
- Fix a kerneldoc warning
* tag 'bootconfig-fixes-v6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
bootconfig: Fix the kerneldoc of _xbc_exit()
bootconfig: use memblock_free_late to free xbc memory to buddy
init/main.c: Fix potential static_command_line memory overflow
Pull misc fixes from Andrew Morton:
"15 hotfixes. 9 are cc:stable and the remainder address post-6.8 issues
or aren't considered suitable for backporting.
There are a significant number of fixups for this cycle's page_owner
changes (series "page_owner: print stacks and their outstanding
allocations"). Apart from that, singleton changes all over, mainly in
MM"
* tag 'mm-hotfixes-stable-2024-04-18-14-41' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
nilfs2: fix OOB in nilfs_set_de_type
MAINTAINERS: update Naoya Horiguchi's email address
fork: defer linking file vma until vma is fully initialized
mm/shmem: inline shmem_is_huge() for disabled transparent hugepages
mm,page_owner: defer enablement of static branch
Squashfs: check the inode number is not the invalid value of zero
mm,swapops: update check in is_pfn_swap_entry for hwpoison entries
mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled
mm/userfaultfd: allow hugetlb change protection upon poison entry
mm,page_owner: fix printing of stack records
mm,page_owner: fix accounting of pages when migrating
mm,page_owner: fix refcount imbalance
mm,page_owner: update metadata for tail pages
userfaultfd: change src_folio after ensuring it's unpinned in UFFDIO_MOVE
mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properly
For selecting what link(s) out of the usable ones
to activate, calculate a grade for a given link.
Calculation of a link grade is done as follows:
1. get the estimated throughput according to the RSSI of the link, this
will be the base grade
2. get the channel load from the BSS Load Element, subtracting the load
caused by us. Apply the factor on the grade.
3. puncturing factor: calculate the percentage of the punctured
subchannels (out of the total subchannels). Apply this on the grade.
The link grading will be used by the link selection mechanism in a later
patch.
Also add KUnit tests for it.
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Link: https://msgid.link/20240416134215.a6799dbd5643.If137ca6dc443606c7d8c99ec1fc38b325003a7c1@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Integrate the regulator framework to the PSE framework for enhanced
access to features such as voltage, power measurement, and limits, which
are akin to regulators. Additionally, PSE features like port priorities
could potentially enhance the regulator framework. Note that this
integration introduces some implementation complexity, including wrapper
callbacks, but the potential benefits make it worthwhile.
Regulator are using enable counter with specific behavior.
Two calls to regulator_disable will trigger kernel warnings.
If the counter exceeds one, regulator_disable call won't disable the
PSE PI. These behavior isn't suitable for PSE control.
Added a boolean 'enabled' state to prevent multiple calls to
regulator_enable/disable. These calls will only be called from PSE
framework as it won't have any regulator children, therefore no mutex are
needed to safeguards this boolean.
regulator_get needs the consumer device pointer. Use PSE as regulator
provider and consumer device until we have RJ45 ports represented in
the Kernel.
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20240417-feature_poe-v9-10-242293fd1900@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The Power Sourcing Equipment Power Interface (PSE PI) plays a pivotal role
in the architecture of Power over Ethernet (PoE) systems. It is essentially
a blueprint that outlines how one or multiple power sources are connected
to the eight-pin modular jack, commonly known as the Ethernet RJ45 port.
This connection scheme is crucial for enabling the delivery of power
alongside data over Ethernet cables.
This patch adds support for getting the PSE controller node through PSE PI
device subnode.
This supports adds a way to get the PSE PI id from the pse_pi devicetree
subnode of a PSE controller node simply by reading the reg property.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://lore.kernel.org/r/20240417-feature_poe-v9-7-242293fd1900@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull networking fixes from Jakub Kicinski:
"A little calmer than usual, probably just the timing of sub-tree PRs.
Including fixes from netfilter.
Current release - regressions:
- inet: bring NLM_DONE out to a separate recv() again, fix user space
which assumes multiple recv()s will happen and gets blocked forever
- drv: mlx5:
- restore mistakenly dropped parts in register devlink flow
- use channel mdev reference instead of global mdev instance for
coalescing
- acquire RTNL lock before RQs/SQs activation/deactivation
Previous releases - regressions:
- net: change maximum number of UDP segments to 128, fix virtio
compatibility with Windows peers
- usb: ax88179_178a: avoid writing the mac address before first
reading
Previous releases - always broken:
- sched: fix mirred deadlock on device recursion
- netfilter:
- br_netfilter: skip conntrack input hook for promisc packets
- fixes removal of duplicate elements in the pipapo set backend
- various fixes for abort paths and error handling
- af_unix: don't peek OOB data without MSG_OOB
- drv: flower: fix fragment flags handling in multiple drivers
- drv: ravb: fix jumbo frames and packet stats accounting
Misc:
- kselftest_harness: fix Clang warning about zero-length format
- tun: limit printing rate when illegal packet received by tun dev"
* tag 'net-6.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (46 commits)
net: ethernet: ti: am65-cpsw-nuss: cleanup DMA Channels before using them
net: usb: ax88179_178a: avoid writing the mac address before first reading
net: ravb: Fix RX byte accounting for jumbo packets
net: ravb: Fix GbEth jumbo packet RX checksum handling
net: ravb: Allow RX loop to move past DMA mapping errors
net: ravb: Count packets instead of descriptors in R-Car RX path
net: ethernet: mtk_eth_soc: fix WED + wifi reset
net:usb:qmi_wwan: support Rolling modules
selftests: kselftest_harness: fix Clang warning about zero-length format
net/sched: Fix mirred deadlock on device recursion
netfilter: nf_tables: fix memleak in map from abort path
netfilter: nf_tables: restore set elements when delete set fails
netfilter: nf_tables: missing iterator type in lookup walk
s390/ism: Properly fix receive message buffer allocation
net: dsa: mt7530: fix port mirroring for MT7988 SoC switch
net: dsa: mt7530: fix mirroring frames received on local port
tun: limit printing rate when illegal packet received by tun dev
ice: Fix checking for unsupported keys on non-tunnel device
ice: tc: allow zero flags in parsing tc flower
ice: tc: check src_vsi in case of traffic from VF
...
Pull gpio fixes from Bartosz Golaszewski:
- use -ENOTSUPP consistently in Intel GPIO drivers
- don't include dt-bindings headers in gpio-swnode code
- add missing of device table to gpio-lpc32xx and fix autoloading
* tag 'gpio-fixes-for-v6.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpiolib: swnode: Remove wrong header inclusion
gpio: lpc32xx: fix module autoloading
gpio: crystalcove: Use -ENOTSUPP consistently
gpio: wcove: Use -ENOTSUPP consistently
Commit 4601b4b130 ("block: reopen the device in blkdev_reread_part")
lost the propagation of I/O errors from the low-level read of the
partition table to the user space caller of the BLKRRPART.
Apparently some user space relies on, so restore the propagation. This
isn't exactly pretty as other block device open calls explicitly do not
are about these errors, so add a new BLK_OPEN_STRICT_SCAN to opt into
the error propagation.
Fixes: 4601b4b130 ("block: reopen the device in blkdev_reread_part")
Reported-by: Saranya Muruganandam <saranyamohan@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/r/20240417144743.2277601-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The flags in the software node properties are supposed to be
the GPIO lookup flags, which are provided by gpio/machine.h,
as the software nodes are the kernel internal thing and doesn't
need to rely to any of ABIs.
Fixes: e7f9ff5dc9 ("gpiolib: add support for software nodes")
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
In commit 18ff0bcda6 ("ethtool: add interface to interact with Ethernet
Power Equipment"), the 'pse_control_config' structure was introduced,
housing a single member labeled 'admin_cotrol' responsible for maintaining
the operational state of the PoDL PSE functions.
A noticeable typographical error exists in the naming of this field
('cotrol' should be corrected to 'control'), which this commit aims to
rectify.
Furthermore, with upcoming extensions of this structure to encompass PoE
functionalities, the field is being renamed to 'podl_admin_state' to
distinctly indicate that this state is tailored specifically for PoDL."
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://lore.kernel.org/r/20240414-feature_poe-v8-3-e4bf1e860da5@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In order to minimize code size (CONFIG_CC_OPTIMIZE_FOR_SIZE=y),
compiler might choose to make a regular function call (out-of-line) for
shmem_is_huge() instead of inlining it. When transparent hugepages are
disabled (CONFIG_TRANSPARENT_HUGEPAGE=n), it can cause compilation
error.
mm/shmem.c: In function `shmem_getattr':
./include/linux/huge_mm.h:383:27: note: in expansion of macro `BUILD_BUG'
383 | #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
| ^~~~~~~~~
mm/shmem.c:1148:33: note: in expansion of macro `HPAGE_PMD_SIZE'
1148 | stat->blksize = HPAGE_PMD_SIZE;
To prevent the possible error, always inline shmem_is_huge() when
transparent hugepages are disabled.
Link: https://lkml.kernel.org/r/20240409155407.2322714-1-sumanthk@linux.ibm.com
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tony reported that the Machine check recovery was broken in v6.9-rc1, as
he was hitting a VM_BUG_ON when injecting uncorrectable memory errors to
DRAM.
After some more digging and debugging on his side, he realized that this
went back to v6.1, with the introduction of 'commit 0d206b5d2e
("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")'. That
commit, among other things, introduced swp_offset_pfn(), replacing
hwpoison_entry_to_pfn() in its favour.
The patch also introduced a VM_BUG_ON() check for is_pfn_swap_entry(), but
is_pfn_swap_entry() never got updated to cover hwpoison entries, which
means that we would hit the VM_BUG_ON whenever we would call
swp_offset_pfn() for such entries on environments with CONFIG_DEBUG_VM
set. Fix this by updating the check to cover hwpoison entries as well,
and update the comment while we are it.
Link: https://lkml.kernel.org/r/20240407130537.16977-1-osalvador@suse.de
Fixes: 0d206b5d2e ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reported-by: Tony Luck <tony.luck@intel.com>
Closes: https://lore.kernel.org/all/Zg8kLSl2yAlA3o5D@agluck-desk3/
Tested-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: <stable@vger.kernel.org> [6.1.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
With the previous change, struct dqs->stall_thrs will be in the hot path
(at queue side), even if DQS is disabled.
The other fields accessed in this function (last_obj_cnt and num_queued)
are in the first cache line, let's move this field (stall_thrs) to the
very first cache line, since there is a hole there.
This does not change the structure size, since it moves an short (2
bytes) to 4-bytes whole in the first cache line.
This is the new structure format now:
struct dql {
unsigned int num_queued;
unsigned int last_obj_cnt;
...
short unsigned int stall_thrs;
/* XXX 2 bytes hole, try to pack */
...
/* --- cacheline 1 boundary (64 bytes) --- */
...
/* Longest stall detected, reported to user */
short unsigned int stall_max;
/* XXX 2 bytes hole, try to pack */
};
Also, read the stall_thrs (now in the very first cache line) earlier,
together with dql->num_queued (also in the first cache line).
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240411192241.2498631-5-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When Dynamic Queue Limit (DQL) is set, it always populate stall
information through dql_queue_stall(). However, this information is
only necessary if a stall threshold is set, stored in struct
dql->stall_thrs.
dql_queue_stall() is cheap, but not free, since it does have memory
barriers and so forth.
Do not call dql_queue_stall() if there is no stall threshold set, and
save some CPU cycles.
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240411192241.2498631-4-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The dql_queued() function currently handles both queuing object counts
and populating bitmaps for reporting stalls.
This commit splits the bitmap population into a separate function,
allowing for conditional invocation in scenarios where the feature is
disabled.
This refactor maintains functionality while improving code
organization.
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240411192241.2498631-3-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If the dql_queued() function receives an invalid argument, WARN about it
and continue, instead of crashing the kernel.
This was raised by checkpatch, when I am refactoring this code (see
following patch/commit)
WARNING: Do not crash the kernel unless it is absolutely unavoidable--use WARN_ON_ONCE() plus recovery code (if feasible) instead of BUG() or variants
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240411192241.2498631-2-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull locking fix from Ingo Molnar:
"Fix a PREEMPT_RT build bug"
* tag 'locking-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking: Make rwsem_assert_held_write_nolockdep() build with PREEMPT_RT=y