This helper returns a 32bit TCP TSval from skb->tstamp.
As we are going to support usec or ms units soon, rename it
to tcp_skb_timestamp_ts() and add a boolean to select the unit.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation of usec TCP TS support, remove tcp_time_stamp_raw()
in favor of tcp_clock_ts() helper. This helper will return a suitable
32bit result to feed TS values, depending on a socket field.
Also add tcp_tw_tsval() and tcp_rsk_tsval() helpers to factorize
the details.
We do not yet support usec timestamps.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It delivers current TCP time stamp in ms unit, and is used
in place of confusing tcp_time_stamp_raw()
It is the same family than tcp_clock_ns() and tcp_clock_ms().
tcp_time_stamp_raw() will be replaced later for TSval
contexts with a more descriptive name.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation of adding usec TCP TS values, add tcp_time_stamp_ms()
for contexts needing ms based values.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
cookie_init_timestamp() is supposed to return a 64bit timestamp
suitable for both TSval determination and setting of skb->tstamp.
Unfortunately it uses 32bit fields and overflows after
2^32 * 10^6 nsec (~49 days) of uptime.
Generated TSval are still correct, but skb->tstamp might be set
far away in the past, potentially confusing other layers.
tcp_ns_to_ts() is changed to return a full 64bit value,
ts and ts_now variables are changed to u64 type,
and TSMASK is removed in favor of shifts operations.
While we are at it, change this sequence:
ts >>= TSBITS;
ts--;
ts <<= TSBITS;
ts |= options;
to:
ts -= (1UL << TSBITS);
Fixes: 9a568de481 ("tcp: switch TCP TS option (RFC 7323) to 1ms clock")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We want to make the function more generic so that it can be used by
other UDP tunnel implementations such as geneve and vxlan. To do that,
add the following arguments:
- source and destination UDP port;
- ifindex of the output interface, needed by vxlan;
- the tos, because in some cases it is not taken from struct
ip_tunnel_info (for example, when it's inherited from the inner
packet);
- the dst cache, because not all tunnel types (e.g. vxlan) want to
use the one from struct ip_tunnel_info.
With these parameters, the function no longer needs the full struct
ip_tunnel_info as argument and we can pass only the relevant part of
it (struct ip_tunnel_key).
This is similar to what already done for IPv4 in commit 72fc68c635
("ipv4: add new arguments to udp_tunnel_dst_lookup()").
Suggested-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Beniamino Galvani <b.galvani@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function is now UDP-specific, the protocol is always IPPROTO_UDP.
This is similar to what already done for IPv4 in commit 78f3655adc
("ipv4: remove "proto" argument from udp_tunnel_dst_lookup()").
Suggested-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Beniamino Galvani <b.galvani@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
At the moment ip6_dst_lookup_tunnel() is used only by bareudp.
Ideally, other UDP tunnel implementations should use it, but to do so
the function needs to accept new parameters that are specific for UDP
tunnels, such as the ports.
Prepare for these changes by renaming the function to
udp_tunnel6_dst_lookup() and move it to file
net/ipv6/ip6_udp_tunnel.c.
This is similar to what already done for IPv4 in commit bf3fcbf7e7
("ipv4: rename and move ip_route_output_tunnel()").
Suggested-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Beniamino Galvani <b.galvani@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 26c5334d34 ("ethtool: Add forced speed to supported link
modes maps") added a dependency between ethtool.h and linkmode.h.
The dependency in the opposite direction already exists so the
new code was inserted in an awkward place.
The reason for ethtool.h to include linkmode.h, is that
ethtool_forced_speed_maps_init() is a static inline helper.
That's not really necessary.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Paul Greenwalt <paul.greenwalt@intel.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reproduce environment:
network with 3 VM linuxs is connected as below:
VM1<---->VM2(latest kernel 6.5.0-rc7)<---->VM3
VM1: eth0 ip: 192.168.122.207 MTU 1500
VM2: eth0 ip: 192.168.122.208, eth1 ip: 192.168.123.224 MTU 1500
VM3: eth0 ip: 192.168.123.240 MTU 1500
Reproduce:
VM1 send 1400 bytes UDP data to VM3 using tools scapy with flags=0.
scapy command:
send(IP(dst="192.168.123.240",flags=0)/UDP()/str('0'*1400),count=1,
inter=1.000000)
Result:
Before IP data is sent.
----------------------------------------------------------------------
root@qemux86-64:~# cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors
ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests
OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails
FragOKs FragFails FragCreates
Ip: 1 64 11 0 3 4 0 0 4 7 0 0 0 0 0 0 0 0 0
......
----------------------------------------------------------------------
After IP data is sent.
----------------------------------------------------------------------
root@qemux86-64:~# cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors
ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests
OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails
FragOKs FragFails FragCreates
Ip: 1 64 12 0 3 5 0 0 4 8 0 0 0 0 0 0 0 0 0
......
----------------------------------------------------------------------
"ForwDatagrams" increase from 4 to 5 and "OutRequests" also increase
from 7 to 8.
Issue description and patch:
IPSTATS_MIB_OUTPKTS("OutRequests") is counted with IPSTATS_MIB_OUTOCTETS
("OutOctets") in ip_finish_output2().
According to RFC 4293, it is "OutOctets" counted with "OutTransmits" but
not "OutRequests". "OutRequests" does not include any datagrams counted
in "ForwDatagrams".
ipSystemStatsOutOctets OBJECT-TYPE
DESCRIPTION
"The total number of octets in IP datagrams delivered to the
lower layers for transmission. Octets from datagrams
counted in ipIfStatsOutTransmits MUST be counted here.
ipSystemStatsOutRequests OBJECT-TYPE
DESCRIPTION
"The total number of IP datagrams that local IP user-
protocols (including ICMP) supplied to IP in requests for
transmission. Note that this counter does not include any
datagrams counted in ipSystemStatsOutForwDatagrams.
So do patch to define IPSTATS_MIB_OUTPKTS to "OutTransmits" and add
IPSTATS_MIB_OUTREQUESTS for "OutRequests".
Add IPSTATS_MIB_OUTREQUESTS counter in __ip_local_out() for ipv4 and add
IPSTATS_MIB_OUT counter in ip6_finish_output2() for ipv6.
Test result with patch:
Before IP data is sent.
----------------------------------------------------------------------
root@qemux86-64:~# cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors
ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests
OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails
FragOKs FragFails FragCreates OutTransmits
Ip: 1 64 9 0 5 1 0 0 3 3 0 0 0 0 0 0 0 0 0 4
......
root@qemux86-64:~# cat /proc/net/netstat
......
IpExt: InNoRoutes InTruncatedPkts InMcastPkts OutMcastPkts InBcastPkts
OutBcastPkts InOctets OutOctets InMcastOctets OutMcastOctets
InBcastOctets OutBcastOctets InCsumErrors InNoECTPkts InECT1Pkts
InECT0Pkts InCEPkts ReasmOverlaps
IpExt: 0 0 0 0 0 0 2976 1896 0 0 0 0 0 9 0 0 0 0
----------------------------------------------------------------------
After IP data is sent.
----------------------------------------------------------------------
root@qemux86-64:~# cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors
ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests
OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails
FragOKs FragFails FragCreates OutTransmits
Ip: 1 64 10 0 5 2 0 0 3 3 0 0 0 0 0 0 0 0 0 5
......
root@qemux86-64:~# cat /proc/net/netstat
......
IpExt: InNoRoutes InTruncatedPkts InMcastPkts OutMcastPkts InBcastPkts
OutBcastPkts InOctets OutOctets InMcastOctets OutMcastOctets
InBcastOctets OutBcastOctets InCsumErrors InNoECTPkts InECT1Pkts
InECT0Pkts InCEPkts ReasmOverlaps
IpExt: 0 0 0 0 0 0 4404 3324 0 0 0 0 0 10 0 0 0 0
----------------------------------------------------------------------
"ForwDatagrams" increase from 1 to 2 and "OutRequests" is keeping 3.
"OutTransmits" increase from 4 to 5 and "OutOctets" increase 1428.
Signed-off-by: Heng Guo <heng.guo@windriver.com>
Reviewed-by: Kun Song <Kun.Song@windriver.com>
Reviewed-by: Filip Pudak <filip.pudak@windriver.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently push everyone to use padding to align 64b values
in netlink. Un-padded nla_put_u64() doesn't even exist any more.
The story behind this possibly start with this thread:
https://lore.kernel.org/netdev/20121204.130914.1457976839967676240.davem@davemloft.net/
where DaveM was concerned about the alignment of a structure
containing 64b stats. If user space tries to access such struct
directly:
struct some_stats *stats = nla_data(attr);
printf("A: %llu", stats->a);
lack of alignment may become problematic for some architectures.
These days we most often put every single member in a separate
attribute, meaning that the code above would use a helper like
nla_get_u64(), which can deal with alignment internally.
Even for arches which don't have good unaligned access - access
aligned to 4B should be pretty efficient.
Kernel and well known libraries deal with unaligned input already.
Padded 64b is quite space-inefficient (64b + pad means at worst 16B
per attr vs 32b which takes 8B). It is also more typing:
if (nla_put_u64_pad(rsp, NETDEV_A_SOMETHING_SOMETHING,
value, NETDEV_A_SOMETHING_PAD))
Create a new attribute type which will use 32 bits at netlink
level if value is small enough (probably most of the time?),
and (4B-aligned) 64 bits otherwise. Kernel API is just:
if (nla_put_uint(rsp, NETDEV_A_SOMETHING_SOMETHING, value))
Calling this new type "just" sint / uint with no specific size
will hopefully also make people more comfortable with using it.
Currently telling people "don't use u8, you may need the bits,
and netlink will round up to 4B, anyway" is the #1 comment
we give to newcomers.
In terms of netlink layout it looks like this:
0 4 8 12 16
32b: [nlattr][ u32 ]
64b: [ pad ][nlattr][ u64 ]
uint(32) [nlattr][ u32 ]
uint(64) [nlattr][ u64 ]
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since struct devlink_fmsg retains error by now (see 1st patch of this
series), there is no longer need to keep returning it in each call.
This is a separate commit to allow per-driver conversion to stop using
those return values.
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from Jakub Kicinski:
"Including fixes from bluetooth, netfilter, WiFi.
Feels like an up-tick in regression fixes, mostly for older releases.
The hfsc fix, tcp_disconnect() and Intel WWAN fixes stand out as
fairly clear-cut user reported regressions. The mlx5 DMA bug was
causing strife for 390x folks. The fixes themselves are not
particularly scary, tho. No open investigations / outstanding reports
at the time of writing.
Current release - regressions:
- eth: mlx5: perform DMA operations in the right locations, make
devices usable on s390x, again
- sched: sch_hfsc: upgrade 'rt' to 'sc' when it becomes a inner
curve, previous fix of rejecting invalid config broke some scripts
- rfkill: reduce data->mtx scope in rfkill_fop_open, avoid deadlock
- revert "ethtool: Fix mod state of verbose no_mask bitset", needs
more work
Current release - new code bugs:
- tcp: fix listen() warning with v4-mapped-v6 address
Previous releases - regressions:
- tcp: allow tcp_disconnect() again when threads are waiting, it was
denied to plug a constant source of bugs but turns out .NET depends
on it
- eth: mlx5: fix double-free if buffer refill fails under OOM
- revert "net: wwan: iosm: enable runtime pm support for 7560", it's
causing regressions and the WWAN team at Intel disappeared
- tcp: tsq: relax tcp_small_queue_check() when rtx queue contains a
single skb, fix single-stream perf regression on some devices
Previous releases - always broken:
- Bluetooth:
- fix issues in legacy BR/EDR PIN code pairing
- correctly bounds check and pad HCI_MON_NEW_INDEX name
- netfilter:
- more fixes / follow ups for the large "commit protocol" rework,
which went in as a fix to 6.5
- fix null-derefs on netlink attrs which user may not pass in
- tcp: fix excessive TLP and RACK timeouts from HZ rounding (bless
Debian for keeping HZ=250 alive)
- net: more strict VIRTIO_NET_HDR_GSO_UDP_L4 validation, prevent
letting frankenstein UDP super-frames from getting into the stack
- net: fix interface altnames when ifc moves to a new namespace
- eth: qed: fix the size of the RX buffers
- mptcp: avoid sending RST when closing the initial subflow"
* tag 'net-6.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (94 commits)
Revert "ethtool: Fix mod state of verbose no_mask bitset"
selftests: mptcp: join: no RST when rm subflow/addr
mptcp: avoid sending RST when closing the initial subflow
mptcp: more conservative check for zero probes
tcp: check mptcp-level constraints for backlog coalescing
selftests: mptcp: join: correctly check for no RST
net: ti: icssg-prueth: Fix r30 CMDs bitmasks
selftests: net: add very basic test for netdev names and namespaces
net: move altnames together with the netdevice
net: avoid UAF on deleted altname
net: check for altname conflicts when changing netdev's netns
net: fix ifname in netlink ntf during netns move
net: ethernet: ti: Fix mixed module-builtin object
net: phy: bcm7xxx: Add missing 16nm EPHY statistics
ipv4: fib: annotate races around nh->nh_saddr_genid and nh->nh_saddr
tcp_bpf: properly release resources on error paths
net/sched: sch_hfsc: upgrade 'rt' to 'sc' when it becomes a inner curve
net: mdio-mux: fix C45 access returning -EIO after API change
tcp: tsq: relax tcp_small_queue_check() when rtx queue contains a single skb
octeon_ep: update BQL sent bytes before ringing doorbell
...
Pull vfs fix from Christian Brauner:
"An openat() call from io_uring triggering an audit call can apparently
cause the refcount of struct filename to be incremented from multiple
threads concurrently during async execution, triggering a refcount
underflow and hitting a BUG_ON(). That bug has been lurking around
since at least v5.16 apparently.
Switch to an atomic counter to fix that. The underflow check is
downgraded from a BUG_ON() to a WARN_ON_ONCE() but we could easily
remove that check altogether tbh"
* tag 'v6.6-rc7.vfs.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
audit,io_uring: io_uring openat triggers audit reference count underflow
We currently have napi_if_scheduled_mark_missed that can be used to
check if napi is scheduled but that does more thing than simply checking
it and return a bool. Some driver already implement custom function to
check if napi is scheduled.
Drop these custom function and introduce napi_is_scheduled that simply
check if napi is scheduled atomically.
Update any driver and code that implement a similar check and instead
use this new helper.
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Do not store bitmask for enabling AUX_SNAPSHOT0. The previous commit
("net: stmmac: fix PPS capture input index") takes care of calculating
the proper bit mask from the request data's extts.index field, which is
0 if not explicitly specified otherwise.
Signed-off-by: Johannes Zink <j.zink@pengutronix.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When CONFIG_IPV6=n, and building with W=1:
In file included from include/trace/define_trace.h:102,
from include/trace/events/neigh.h:255,
from net/core/net-traces.c:51:
include/trace/events/neigh.h: In function ‘trace_event_raw_event_neigh_create’:
include/trace/events/neigh.h:42:34: error: variable ‘pin6’ set but not used [-Werror=unused-but-set-variable]
42 | struct in6_addr *pin6;
| ^~~~
include/trace/trace_events.h:402:11: note: in definition of macro ‘DECLARE_EVENT_CLASS’
402 | { assign; } \
| ^~~~~~
include/trace/trace_events.h:44:30: note: in expansion of macro ‘PARAMS’
44 | PARAMS(assign), \
| ^~~~~~
include/trace/events/neigh.h:23:1: note: in expansion of macro ‘TRACE_EVENT’
23 | TRACE_EVENT(neigh_create,
| ^~~~~~~~~~~
include/trace/events/neigh.h:41:9: note: in expansion of macro ‘TP_fast_assign’
41 | TP_fast_assign(
| ^~~~~~~~~~~~~~
In file included from include/trace/define_trace.h:103,
from include/trace/events/neigh.h:255,
from net/core/net-traces.c:51:
include/trace/events/neigh.h: In function ‘perf_trace_neigh_create’:
include/trace/events/neigh.h:42:34: error: variable ‘pin6’ set but not used [-Werror=unused-but-set-variable]
42 | struct in6_addr *pin6;
| ^~~~
include/trace/perf.h:51:11: note: in definition of macro ‘DECLARE_EVENT_CLASS’
51 | { assign; } \
| ^~~~~~
include/trace/trace_events.h:44:30: note: in expansion of macro ‘PARAMS’
44 | PARAMS(assign), \
| ^~~~~~
include/trace/events/neigh.h:23:1: note: in expansion of macro ‘TRACE_EVENT’
23 | TRACE_EVENT(neigh_create,
| ^~~~~~~~~~~
include/trace/events/neigh.h:41:9: note: in expansion of macro ‘TP_fast_assign’
41 | TP_fast_assign(
| ^~~~~~~~~~~~~~
Indeed, the variable pin6 is declared and initialized unconditionally,
while it is only used and needlessly re-initialized when support for
IPv6 is enabled.
Fix this by dropping the unused variable initialization, and moving the
variable declaration inside the existing section protected by a check
for CONFIG_IPV6.
Fixes: fc651001d2 ("neighbor: Add tracepoint to __neigh_create")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Simon Horman <horms@kernel.org> # build-tested
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal says:
====================
netfilter next pull request 2023-10-18
This series contains initial netfilter skb drop_reason support, from
myself.
First few patches fix up a few spots to make sure we won't trip
when followup patches embed error numbers in the upper bits
(we already do this in some places).
Then, nftables and bridge netfilter get converted to call kfree_skb_reason
directly to let tooling pinpoint exact location of packet drops,
rather than the existing NF_DROP catchall in nf_hook_slow().
I would like to eventually convert all netfilter modules, but as some
callers cannot deal with NF_STOLEN (notably act_ct), more preparation
work is needed for this.
Last patch gets rid of an ugly 'de-const' cast in nftables.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The need to map Ethtool forced speeds to Ethtool supported link modes is
common among drivers. To support this, add a common structure for forced
speed maps and a function to init them. This is solution was originally
introduced in commit 1d4e4ecccb ("qede: populate supported link modes
maps on module init") for qede driver.
ethtool_forced_speed_maps_init() should be called during driver init
with an array of struct ethtool_forced_speed_map to populate the mapping.
Definitions for maps themselves are left in the driver code, as the sets
of supported link modes may vary between the devices.
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Pawel Chmielewski <pawel.chmielewski@intel.com>
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
net_dropmonitor blames core.c:nf_hook_slow.
Add NF_DROP_REASON() helper and use it in nft_do_chain().
The helper releases the skb, so exact drop location becomes
available. Calling code will observe the NF_STOLEN verdict
instead.
Adjust nf_hook_slow so we can embed an erro value wih
NF_STOLEN verdicts, just like we do for NF_DROP.
After this, drop in nftables can be pinpointed to a drop due
to a rule or the chain policy.
Signed-off-by: Florian Westphal <fw@strlen.de>
Make the net pointer stored in possible_net_t structure annotated as
an RCU pointer. Change the access helpers to treat it as such.
Introduce read_pnet_rcu() helper to allow caller to dereference
the net pointer under RCU read lock.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Saeed Mahameed says:
====================
mlx5-updates-2023-10-10
1) Adham Faris, Increase max supported channels number to 256
2) Leon Romanovsky, Allow IPsec soft/hard limits in bytes
3) Shay Drory, Replace global mlx5_intf_lock with
HCA devcom component lock
4) Wei Zhang, Optimize SF creation flow
During SF creation, HCA state gets changed from INVALID to
IN_USE step by step. Accordingly, FW sends vhca event to
driver to inform about this state change asynchronously.
Each vhca event is critical because all related SW/FW
operations are triggered by it.
Currently there is only a single mlx5 general event handler
which not only handles vhca event but many other events.
This incurs huge bottleneck because all events are forced
to be handled in serial manner.
Moreover, all SFs share same table_lock which inevitably
impacts each other when they are created in parallel.
This series will solve this issue by:
1. A dedicated vhca event handler is introduced to eliminate
the mutual impact with other mlx5 events.
2. Max FW threads work queues are employed in the vhca event
handler to fully utilize FW capability.
3. Redesign SF active work logic to completely remove
table_lock.
With above optimization, SF creation time is reduced by 25%,
i.e. from 80s to 60s when creating 100 SFs.
Patches summary:
Patch 1 - implement dedicated vhca event handler with max FW
cmd threads of work queues.
Patch 2 - remove table_lock by redesigning SF active work
logic.
* tag 'mlx5-updates-2023-10-10' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5e: Allow IPsec soft/hard limits in bytes
net/mlx5e: Increase max supported channels number to 256
net/mlx5e: Preparations for supporting larger number of channels
net/mlx5e: Refactor mlx5e_rss_init() and mlx5e_rss_free() API's
net/mlx5e: Refactor mlx5e_rss_set_rxfh() and mlx5e_rss_get_rxfh()
net/mlx5e: Refactor rx_res_init() and rx_res_free() APIs
net/mlx5e: Use PTR_ERR_OR_ZERO() to simplify code
net/mlx5: Use PTR_ERR_OR_ZERO() to simplify code
net/mlx5: fix config name in Kconfig parameter documentation
net/mlx5: Remove unused declaration
net/mlx5: Replace global mlx5_intf_lock with HCA devcom component lock
net/mlx5: Refactor LAG peer device lookout bus logic to mlx5 devcom
net/mlx5: Avoid false positive lockdep warning by adding lock_class_key
net/mlx5: Redesign SF active work to remove table_lock
net/mlx5: Parallelize vhca event handling
====================
Link: https://lore.kernel.org/r/20231014171908.290428-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Steffen Klassert says:
====================
pull request (net): ipsec 2023-10-17
1) Fix a slab-use-after-free in xfrm_policy_inexact_list_reinsert.
From Dong Chenchen.
2) Fix data-races in the xfrm interfaces dev->stats fields.
From Eric Dumazet.
3) Fix a data-race in xfrm_gen_index.
From Eric Dumazet.
4) Fix an inet6_dev refcount underflow.
From Zhang Changzhong.
5) Check the return value of pskb_trim in esp_remove_trailer
for esp4 and esp6. From Ma Ke.
6) Fix a data-race in xfrm_lookup_with_ifid.
From Eric Dumazet.
* tag 'ipsec-2023-10-17' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
xfrm: fix a data-race in xfrm_lookup_with_ifid()
net: ipv4: fix return value check in esp_remove_trailer
net: ipv6: fix return value check in esp_remove_trailer
xfrm6: fix inet6_dev refcount underflow problem
xfrm: fix a data-race in xfrm_gen_index()
xfrm: interface: use DEV_STATS_INC()
net: xfrm: skip policies marked as dead while reinserting policies
====================
Link: https://lore.kernel.org/r/20231017083723.1364940-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Provide a new method, mac_get_caps() to get the MAC capabilities for
the specified interface mode. This is for MACs which have special
requirements, such as not supporting half-duplex in certain interface
modes, and will replace the validate() method.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/E1qsPk5-009wiX-G5@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The previous patch added accounting and a limit for the number of
dynamically learned FDB entries per bridge. However it did not provide
means to actually configure those bounds or read back the count. This
patch does that.
Two new netlink attributes are added for the accounting and limit of
dynamically learned FDB entries:
- IFLA_BR_FDB_N_LEARNED (RO) for the number of entries accounted for
a single bridge.
- IFLA_BR_FDB_MAX_LEARNED (RW) for the configured limit of entries for
the bridge.
The new attributes are used like this:
# ip link add name br up type bridge fdb_max_learned 256
# ip link add name v1 up master br type veth peer v2
# ip link set up dev v2
# mausezahn -a rand -c 1024 v2
0.01 seconds (90877 packets per second
# bridge fdb | grep -v permanent | wc -l
256
# ip -d link show dev br
13: br: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 [...]
[...] fdb_n_learned 256 fdb_max_learned 256
Signed-off-by: Johannes Nixdorf <jnixdorf-oss@avm.de>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20231016-fdb_limit-v5-3-32cddff87758@avm.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We discovered from packet traces of slow loss recovery on kernels with
the default HZ=250 setting (and min_rtt < 1ms) that after reordering,
when receiving a SACKed sequence range, the RACK reordering timer was
firing after about 16ms rather than the desired value of roughly
min_rtt/4 + 2ms. The problem is largely due to the RACK reorder timer
calculation adding in TCP_TIMEOUT_MIN, which is 2 jiffies. On kernels
with HZ=250, this is 2*4ms = 8ms. The TLP timer calculation has the
exact same issue.
This commit fixes the TLP transmit timer and RACK reordering timer
floor calculation to more closely match the intended 2ms floor even on
kernels with HZ=250. It does this by adding in a new
TCP_TIMEOUT_MIN_US floor of 2000 us and then converting to jiffies,
instead of the current approach of converting to jiffies and then
adding th TCP_TIMEOUT_MIN value of 2 jiffies.
Our testing has verified that on kernels with HZ=1000, as expected,
this does not produce significant changes in behavior, but on kernels
with the default HZ=250 the latency improvement can be large. For
example, our tests show that for HZ=250 kernels at low RTTs this fix
roughly halves the latency for the RACK reorder timer: instead of
mostly firing at 16ms it mostly fires at 8ms.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Fixes: bb4d991a28 ("tcp: adjust tail loss probe timeout")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231015174700.2206872-1-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull fbdev fixes and cleanups from Helge Deller:
"Various minor fixes, cleanups and annotations for atyfb, sa1100fb,
omapfb, uvesafb and mmp"
* tag 'fbdev-for-6.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev:
fbdev: core: syscopyarea: fix sloppy typing
fbdev: core: cfbcopyarea: fix sloppy typing
fbdev: uvesafb: Call cn_del_callback() at the end of uvesafb_exit()
fbdev: uvesafb: Remove uvesafb_exec() prototype from include/video/uvesafb.h
fbdev: sa1100fb: mark sa1100fb_init() static
fbdev: omapfb: fix some error codes
fbdev: atyfb: only use ioremap_uc() on i386 and ia64
fbdev: mmp: Annotate struct mmp_path with __counted_by
fbdev: mmp: Annotate struct mmphw_ctrl with __counted_by
Kalle Valo says:
====================
wireless-next patches for v6.7
The second pull request for v6.7, with only driver changes this time.
We have now support for mt7925 PCIe and USB variants, few new features
and of course some fixes.
Major changes:
mt76
- mt7925 support
ath12k
- read board data variant name from SMBIOS
wfx
- Remain-On-Channel (ROC) support
* tag 'wireless-next-2023-10-16' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (109 commits)
wifi: rtw89: mac: do bf_monitor only if WiFi 6 chips
wifi: rtw89: mac: set bf_assoc capabilities according to chip gen
wifi: rtw89: mac: set bfee_ctrl() according to chip gen
wifi: rtw89: mac: add registers of MU-EDCA parameters for WiFi 7 chips
wifi: rtw89: mac: generalize register of MU-EDCA switch according to chip gen
wifi: rtw89: mac: update RTS threshold according to chip gen
wifi: rtlwifi: simplify TX command fill callbacks
wifi: hostap: remove unused ioctl function
wifi: atmel: remove unused ioctl function
wifi: rtw89: coex: add annotation __counted_by() to struct rtw89_btc_btf_set_mon_reg
wifi: rtw89: coex: add annotation __counted_by() for struct rtw89_btc_btf_set_slot_table
wifi: rtw89: add EHT radiotap in monitor mode
wifi: rtw89: show EHT rate in debugfs
wifi: rtw89: parse TX EHT rate selected by firmware from RA C2H report
wifi: rtw89: Add EHT rate mask as parameters of RA H2C command
wifi: rtw89: parse EHT information from RX descriptor and PPDU status packet
wifi: radiotap: add bandwidth definition of EHT U-SIG
wifi: rtlwifi: use convenient list_count_nodes()
wifi: p54: Annotate struct p54_cal_database with __counted_by
wifi: brcmfmac: fweh: Add __counted_by for struct brcmf_fweh_queue_item and use struct_size()
...
====================
Link: https://lore.kernel.org/r/20231016143822.880D8C433C8@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2023-10-16
We've added 90 non-merge commits during the last 25 day(s) which contain
a total of 120 files changed, 3519 insertions(+), 895 deletions(-).
The main changes are:
1) Add missed stats for kprobes to retrieve the number of missed kprobe
executions and subsequent executions of BPF programs, from Jiri Olsa.
2) Add cgroup BPF sockaddr hooks for unix sockets. The use case is
for systemd to reimplement the LogNamespace feature which allows
running multiple instances of systemd-journald to process the logs
of different services, from Daan De Meyer.
3) Implement BPF CPUv4 support for s390x BPF JIT, from Ilya Leoshkevich.
4) Improve BPF verifier log output for scalar registers to better
disambiguate their internal state wrt defaults vs min/max values
matching, from Andrii Nakryiko.
5) Extend the BPF fib lookup helpers for IPv4/IPv6 to support retrieving
the source IP address with a new BPF_FIB_LOOKUP_SRC flag,
from Martynas Pumputis.
6) Add support for open-coded task_vma iterator to help with symbolization
for BPF-collected user stacks, from Dave Marchevsky.
7) Add libbpf getters for accessing individual BPF ring buffers which
is useful for polling them individually, for example, from Martin Kelly.
8) Extend AF_XDP selftests to validate the SHARED_UMEM feature,
from Tushar Vyavahare.
9) Improve BPF selftests cross-building support for riscv arch,
from Björn Töpel.
10) Add the ability to pin a BPF timer to the same calling CPU,
from David Vernet.
11) Fix libbpf's bpf_tracing.h macros for riscv to use the generic
implementation of PT_REGS_SYSCALL_REGS() to access syscall arguments,
from Alexandre Ghiti.
12) Extend libbpf to support symbol versioning for uprobes, from Hengqi Chen.
13) Fix bpftool's skeleton code generation to guarantee that ELF data
is 8 byte aligned, from Ian Rogers.
14) Inherit system-wide cpu_mitigations_off() setting for Spectre v1/v4
security mitigations in BPF verifier, from Yafang Shao.
15) Annotate struct bpf_stack_map with __counted_by attribute to prepare
BPF side for upcoming __counted_by compiler support, from Kees Cook.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (90 commits)
bpf: Ensure proper register state printing for cond jumps
bpf: Disambiguate SCALAR register state output in verifier logs
selftests/bpf: Make align selftests more robust
selftests/bpf: Improve missed_kprobe_recursion test robustness
selftests/bpf: Improve percpu_alloc test robustness
selftests/bpf: Add tests for open-coded task_vma iter
bpf: Introduce task_vma open-coded iterator kfuncs
selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c
bpf: Don't explicitly emit BTF for struct btf_iter_num
bpf: Change syscall_nr type to int in struct syscall_tp_t
net/bpf: Avoid unused "sin_addr_len" warning when CONFIG_CGROUP_BPF is not set
bpf: Avoid unnecessary audit log for CPU security mitigations
selftests/bpf: Add tests for cgroup unix socket address hooks
selftests/bpf: Make sure mount directory exists
documentation/bpf: Document cgroup unix socket address hooks
bpftool: Add support for cgroup unix socket address hooks
libbpf: Add support for cgroup unix socket address hooks
bpf: Implement cgroup sockaddr hooks for unix sockets
bpf: Add bpf_sock_addr_set_sun_path() to allow writing unix sockaddr from bpf
bpf: Propagate modified uaddrlen from cgroup sockaddr programs
...
====================
Link: https://lore.kernel.org/r/20231016204803.30153-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Fix the handling of the phycal timer offset when FEAT_ECV and
CNTPOFF_EL2 are implemented
- Restore the functionnality of Permission Indirection that was
broken by the Fine Grained Trapping rework
- Cleanup some PMU event sharing code
MIPS:
- Fix W=1 build
s390:
- One small fix for gisa to avoid stalls
x86:
- Truncate writes to PMU counters to the counter's width to avoid
spurious overflows when emulating counter events in software
- Set the LVTPC entry mask bit when handling a PMI (to match
Intel-defined architectural behavior)
- Treat KVM_REQ_PMI as a wake event instead of queueing host IRQ work
to kick the guest out of emulated halt
- Fix for loading XSAVE state from an old kernel into a new one
- Fixes for AMD AVIC
selftests:
- Play nice with %llx when formatting guest printf and assert
statements
- Clean up stale test metadata
- Zero-initialize structures in memslot perf test to workaround a
suspected 'may be used uninitialized' false positives from GCC"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits)
KVM: arm64: timers: Correctly handle TGE flip with CNTPOFF_EL2
KVM: arm64: POR{E0}_EL1 do not need trap handlers
KVM: arm64: Add nPIR{E0}_EL1 to HFG traps
KVM: MIPS: fix -Wunused-but-set-variable warning
KVM: arm64: pmu: Drop redundant check for non-NULL kvm_pmu_events
KVM: SVM: Fix build error when using -Werror=unused-but-set-variable
x86: KVM: SVM: refresh AVIC inhibition in svm_leave_nested()
x86: KVM: SVM: add support for Invalid IPI Vector interception
x86: KVM: SVM: always update the x2avic msr interception
KVM: selftests: Force load all supported XSAVE state in state test
KVM: selftests: Load XSAVE state into untouched vCPU during state test
KVM: selftests: Touch relevant XSAVE state in guest for state test
KVM: x86: Constrain guest-supported xfeatures only at KVM_GET_XSAVE{2}
x86/fpu: Allow caller to constrain xfeatures when copying to uabi buffer
KVM: selftests: Zero-initialize entire test_result in memslot perf test
KVM: selftests: Remove obsolete and incorrect test case metadata
KVM: selftests: Treat %llx like %lx when formatting guest printf
KVM: x86/pmu: Synthesize at most one PMI per VM-exit
KVM: x86: Mask LVTPC when handling a PMI
KVM: x86/pmu: Truncate counter value to allowed width on write
...
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- Fix race when opening vhci device
- Avoid memcmp() out of bounds warning
- Correctly bounds check and pad HCI_MON_NEW_INDEX name
- Fix using memcmp when comparing keys
- Ignore error return for hci_devcd_register() in btrtl
- Always check if connection is alive before deleting
- Fix a refcnt underflow problem for hci_conn
* tag 'for-net-2023-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: hci_sock: Correctly bounds check and pad HCI_MON_NEW_INDEX name
Bluetooth: avoid memcmp() out of bounds warning
Bluetooth: hci_sock: fix slab oob read in create_monitor_event
Bluetooth: btrtl: Ignore error return for hci_devcd_register()
Bluetooth: hci_event: Fix coding style
Bluetooth: hci_event: Fix using memcmp when comparing keys
Bluetooth: Fix a refcnt underflow problem for hci_conn
Bluetooth: hci_sync: always check if connection is alive before deleting
Bluetooth: Reject connection with the device which has same BD_ADDR
Bluetooth: hci_event: Ignore NULL link key
Bluetooth: ISO: Fix invalid context error
Bluetooth: vhci: Fix race when opening vhci device
====================
Link: https://lore.kernel.org/r/20231014031336.1664558-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A few networking drivers including bnx2x, bnxt, qede, and idpf call
tcp_gro_complete as part of offloading TCP GRO. The function is only
defined if CONFIG_INET is true, since its TCP specific and is meaningless
if the kernel lacks IP networking support.
The combination of trying to use the complex network drivers with
CONFIG_NET but not CONFIG_INET is rather unlikely in practice: most use
cases are going to need IP networking.
The tcp_gro_complete function just sets some data in the socket buffer for
use in processing the TCP packet in the event that the GRO was offloaded to
the device. If the kernel lacks TCP support, such setup will simply go
unused.
The bnx2x, bnxt, and qede drivers wrap their TCP offload support in
CONFIG_INET checks and skip handling on such kernels.
The idpf driver did not check CONFIG_INET and thus fails to link if the
kernel is configured with CONFIG_NET=y, CONFIG_IDPF=(m|y), and
CONFIG_INET=n.
While checking CONFIG_INET does allow the driver to bypass significantly
more instructions in the event that we know TCP networking isn't supported,
the configuration is unlikely to be used widely.
Rather than require driver authors to care about this, stub the
tcp_gro_complete function when CONFIG_INET=n. This allows drivers to be
left as-is. It does mean the idpf driver will perform slightly more work
than strictly necessary when CONFIG_INET=n, since it will still execute
some of the skb setup in idpf_rx_rsc. However, that work would be performed
in the case where CONFIG_INET=y anyways.
I did not change the existing drivers, since they appear to wrap a
significant portion of code when CONFIG_INET=n. There is little benefit in
trashing these drivers just to unwrap and remove the CONFIG_INET check.
Using a stub for tcp_gro_complete is still beneficial, as it means future
drivers no longer need to worry about this case of CONFIG_NET=y and
CONFIG_INET=n, which should reduce noise from buildbots that check such a
configuration.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Link: https://lore.kernel.org/r/20231013185502.1473541-1-jacob.e.keller@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
TCP pingpong threshold is 1 by default. But some applications, like SQL DB
may prefer a higher pingpong threshold to activate delayed acks in quick
ack mode for better performance.
The pingpong threshold and related code were changed to 3 in the year
2019 in:
commit 4a41f453be ("tcp: change pingpong threshold to 3")
And reverted to 1 in the year 2022 in:
commit 4d8f24eeed ("Revert "tcp: change pingpong threshold to 3"")
There is no single value that fits all applications.
Add net.ipv4.tcp_pingpong_thresh sysctl tunable, so it can be tuned for
optimal performance based on the application needs.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/1697056244-21888-1-git-send-email-haiyangz@microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
uvesafb_exec() is a static function defined and called only in
drivers/video/fbdev/uvesafb.c, remove the prototype from
include/video/uvesafb.h.
Fixes the warning:
./include/video/uvesafb.h:112:12: warning: 'uvesafb_exec' declared 'static' but never defined [-Wunused-function]
when including '<video/uvesafb.h>' in an external program.
Signed-off-by: Jorge Maidana <jorgem.linux@gmail.com>
Signed-off-by: Helge Deller <deller@gmx.de>
Add an initial user for the newly added tcf_set_drop_reason() helper to set the
drop reason for internal errors leading to TC_ACT_SHOT inside {__,}tcf_classify().
Right now this only adds a very basic SKB_DROP_REASON_TC_ERROR as a generic
fallback indicator to mark drop locations. Where needed, such locations can be
converted to more specific codes, for example, when hitting the reclassification
limit, etc.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Victor Nogueira <victor@mojatatu.com>
Link: https://lore.kernel.org/r/20231009092655.22025-2-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, the kfree_skb_reason() in sch_handle_{ingress,egress}() can only
express a basic SKB_DROP_REASON_TC_INGRESS or SKB_DROP_REASON_TC_EGRESS reason.
Victor kicked-off an initial proposal to make this more flexible by disambiguating
verdict from return code by moving the verdict into struct tcf_result and
letting tcf_classify() return a negative error. If hit, then two new drop
reasons were added in the proposal, that is SKB_DROP_REASON_TC_INGRESS_ERROR
as well as SKB_DROP_REASON_TC_EGRESS_ERROR. Further analysis of the actual
error codes would have required to attach to tcf_classify via kprobe/kretprobe
to more deeply debug skb and the returned error.
In order to make the kfree_skb_reason() in sch_handle_{ingress,egress}() more
extensible, it can be addressed in a more straight forward way, that is: Instead
of placing the verdict into struct tcf_result, we can just put the drop reason
in there, which does not require changes throughout various classful schedulers
given the existing verdict logic can stay as is.
Then, SKB_DROP_REASON_TC_ERROR{,_*} can be added to the enum skb_drop_reason
to disambiguate between an error or an intentional drop. New drop reason error
codes can be added successively to the tc code base.
For internal error locations which have not yet been annotated with a
SKB_DROP_REASON_TC_ERROR{,_*}, the fallback is SKB_DROP_REASON_TC_INGRESS and
SKB_DROP_REASON_TC_EGRESS, respectively. Generic errors could be marked with a
SKB_DROP_REASON_TC_ERROR code until they are converted to more specific ones
if it is found that they would be useful for troubleshooting.
While drop reasons have infrastructure for subsystem specific error codes which
are currently used by mac80211 and ovs, Jakub mentioned that it is preferred
for tc to use the enum skb_drop_reason core codes given it is a better fit and
currently the tooling support is better, too.
With regards to the latter:
[...] I think Alastair (bpftrace) is working on auto-prettifying enums when
bpftrace outputs maps. So we can do something like:
$ bpftrace -e 'tracepoint:skb:kfree_skb { @[args->reason] = count(); }'
Attaching 1 probe...
^C
@[SKB_DROP_REASON_TC_INGRESS]: 2
@[SKB_CONSUMED]: 34
^^^^^^^^^^^^ names!!
Auto-magically. [...]
Add a small helper tcf_set_drop_reason() which can be used to set the drop reason
into the tcf_result.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Victor Nogueira <victor@mojatatu.com>
Link: https://lore.kernel.org/netdev/20231006063233.74345d36@kernel.org
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20231009092655.22025-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We want to make the function more generic so that it can be used by
other UDP tunnel implementations such as geneve and vxlan. To do that,
add the following arguments:
- source and destination UDP port;
- ifindex of the output interface, needed by vxlan;
- the tos, because in some cases it is not taken from struct
ip_tunnel_info (for example, when it's inherited from the inner
packet);
- the dst cache, because not all tunnel types (e.g. vxlan) want to
use the one from struct ip_tunnel_info.
With these parameters, the function no longer needs the full struct
ip_tunnel_info as argument and we can pass only the relevant part of
it (struct ip_tunnel_key).
Suggested-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Beniamino Galvani <b.galvani@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
At the moment ip_route_output_tunnel() is used only by bareudp.
Ideally, other UDP tunnel implementations should use it, but to do so
the function needs to accept new parameters that are specific for UDP
tunnels, such as the ports.
Prepare for these changes by renaming the function to
udp_tunnel_dst_lookup() and move it to file
net/ipv4/udp_tunnel_core.c.
Suggested-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Beniamino Galvani <b.galvani@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
On systems with multiple timestamp event channels, some readers might
want to receive only a subset of those channels.
Add the necessary modifications to support timestamp event channel
filtering, including two IOCTL operations:
- Clear all channels
- Enable one channel
The mask modification operations will be applied exclusively on the
event queue assigned to the file descriptor used on the IOCTL operation,
so the typical procedure to have a reader receiving only a subset of the
enabled channels would be:
- Open device file
- ioctl: clear all channels
- ioctl: enable one channel
- start reading
Calling the enable one channel ioctl more than once will result in
multiple enabled channels.
Signed-off-by: Xabier Marquiegui <reibax@gmail.com>
Suggested-by: Richard Cochran <richardcochran@gmail.com>
Suggested-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the necessary structure to support custom private-data per
posix-clock user.
The previous implementation of posix-clock assumed all file open
instances need access to the same clock structure on private_data.
The need for individual data structures per file open instance has been
identified when developing support for multiple timestamp event queue
users for ptp_clock.
Signed-off-by: Xabier Marquiegui <reibax@gmail.com>
Suggested-by: Richard Cochran <richardcochran@gmail.com>
Suggested-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Syzbot reported two new paths to hit an internal WARNING using the
new virtio gso type VIRTIO_NET_HDR_GSO_UDP_L4.
RIP: 0010:skb_checksum_help+0x4a2/0x600 net/core/dev.c:3260
skb len=64521 gso_size=344
and
RIP: 0010:skb_warn_bad_offload+0x118/0x240 net/core/dev.c:3262
Older virtio types have historically had loose restrictions, leading
to many entirely impractical fuzzer generated packets causing
problems deep in the kernel stack. Ideally, we would have had strict
validation for all types from the start.
New virtio types can have tighter validation. Limit UDP GSO packets
inserted via virtio to the same limits imposed by the UDP_SEGMENT
socket interface:
1. must use checksum offload
2. checksum offload matches UDP header
3. no more segments than UDP_MAX_SEGMENTS
4. UDP GSO does not take modifier flags, notably SKB_GSO_TCP_ECN
Fixes: 860b7f27b8 ("linux/virtio_net.h: Support USO offload in vnet header.")
Reported-by: syzbot+01cdbc31e9c0ae9b33ac@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/0000000000005039270605eb0b7f@google.com/
Reported-by: syzbot+c99d835ff081ca30f986@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/0000000000005426680605eb0b9f@google.com/
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>