Commit Graph

1156694 Commits

Author SHA1 Message Date
Xin Long
67fc5d7ffb net: extract nf_ct_skb_network_trim function to nf_conntrack_ovs
There are almost the same code in ovs_skb_network_trim() and
tcf_ct_skb_network_trim(), this patch extracts them into a function
nf_ct_skb_network_trim() and moves the function to nf_conntrack_ovs.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-10 16:23:03 -08:00
Xin Long
c0c3ab63de net: create nf_conntrack_ovs for ovs and tc use
Similar to nf_nat_ovs created by Commit ebddb14049 ("net: move the
nat function to nf_nat_ovs for ovs and tc"), this patch is to create
nf_conntrack_ovs to get these functions shared by OVS and TC only.

There are nf_ct_helper() and nf_ct_add_helper() from nf_conntrak_helper
in this patch, and will be more in the following patches.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-10 16:23:03 -08:00
Jakub Kicinski
025a785ff0 net: skbuff: drop the word head from skb cache
skbuff_head_cache is misnamed (perhaps for historical reasons?)
because it does not hold heads. Head is the buffer which skb->data
points to, and also where shinfo lives. struct sk_buff is a metadata
structure, not the head.

Eric recently added skb_small_head_cache (which allocates actual
head buffers), let that serve as an excuse to finally clean this up :)

Leave the user-space visible name intact, it could possibly be uAPI.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-10 09:10:28 +00:00
David S. Miller
01a7ee36e7 Merge branch 'net-ipa-GSI'
Alex Elder says:

====================
net: ipa: prepare for GSI register updtaes

An upcoming series (or two) will convert the definitions of GSI
registers used by IPA so they use the "IPA reg" mechanism to specify
register offsets and their fields.  This will simplify implementing
the fairly large number of changes required in GSI registers to
support more than 32 GSI channels (introduced in IPA v5.0).

A few minor problems and inconsistencies were found, and they're
fixed here.  The last three patches in this series change the
"ipa_reg" code to separate the IPA-specific part (the base virtual
address, basically) from the generic register part, and the now-
generic code is renamed to use just "reg_" or "REG_" as a prefix
rather than "ipa_reg" or "IPA_REG_".
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-10 08:06:33 +00:00
Alex Elder
f1470fd790 net: ipa: generalize register field functions
Rename functions related to register fields so they don't appear to
be IPA-specific, and move their definitions into "reg.h":
    ipa_reg_fmask()	-> reg_fmask()
    ipa_reg_bit()	-> reg_bit()
    ipa_reg_field_max()	-> reg_field_max()
    ipa_reg_encode()	-> reg_encode()
    ipa_reg_decode()	-> reg_decode()

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-10 08:06:32 +00:00
Alex Elder
fc4cecf706 net: ipa: generalize register offset functions
Rename ipa_reg_offset() to be reg_offset() and move its definition
to "reg.h".  Rename ipa_reg_n_offset() to be reg_n_offset() also.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-10 08:06:32 +00:00
Alex Elder
81772e444d net: ipa: start generalizing "ipa_reg"
IPA register definitions have evolved with each new version.  The
changes required to support more than 32 endpoints in IPA v5.0 made
it best to define a unified mechanism for defining registers and
their fields.

GSI register definitions, meanwhile, have remained fairly stable.
And even as the total number of IPA endpoints goes beyond 32, the
number of GSI channels on a given EE that underly endpoints still
remains 32 or less.

Despite that, GSI v3.0 (which is used with IPA v5.0) extends the
number of channels (and events) it supports to be about 256, and as
a result, many GSI register definitions must change significantly.
To address this, we'll use the same "ipa_reg" mechanism to define
the GSI registers.

As a first step in generalizing the "ipa_reg" to also support GSI
registers, isolate the definitions of the "ipa_reg" and "ipa_regs"
structure types (and some supporting macros) into a new header file,
and remove the "ipa_" and "IPA_" from symbol names.

Separate the IPA register ID validity checking from the generic
check that a register ID is in range.  Aside from that, this is
intended to have no functional effect on the code.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-10 08:06:32 +00:00
Alex Elder
0ec573ef2a net: ipa: GSI register cleanup
Move some static inline function definitions out of "gsi_reg.h" and
into "gsi.c", which is the only place they're used.  Rename them so
their names identify the register they're associated with.

Move the gsi_channel_type enumerated type definition below the
offset and field definitions for the CH_C_CNTXT_0 register where
it's used.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-10 08:06:32 +00:00
Alex Elder
c5ebba75c7 net: ipa: use bitmasks for GSI IRQ values
There are seven GSI interrupt types that can be signaled by a single
GSI IRQ.  These are represented in a bitmask, and the gsi_irq_type_id
enumerated type defines what each bit position represents.

Similarly, the global and general GSI interrupt types each has a set
of conditions it signals, and both types have an enumerated type
that defines which bit that represents each condition.

When used, these enumerated values are passed as an argument to BIT()
in *all* cases.  So clean up the code a little bit by defining the
enumerated type values as one-bit masks rather than bit positions.

Rename gsi_general_id to be gsi_general_irq_id for consistency.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-10 08:06:32 +00:00
Alex Elder
d86603e940 net: ipa: tighten up IPA register validity checking
When checking the validity of an IPA register ID, compare it against
all possible ipa_reg_id values.

Rename the function ipa_reg_id_valid() to be specific about what's
being checked.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-10 08:06:32 +00:00
Alex Elder
3aac8ec1c0 net: ipa: add some new IPA versions
Soon IPA v5.0+ will be supported, and when that happens we will be
able to enable support for the SDX65 (IPA v5.0), SM8450 (IPA v5.1),
and SM8550 (IPA v5.5).

Fix the comment about the GSI version used for IPA v3.1.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-10 08:06:31 +00:00
Alex Elder
38028e6f39 net: ipa: get rid of ipa->reg_addr
The reg_addr field in the IPA structure is set but never used.
Get rid of it.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-10 08:06:31 +00:00
Alex Elder
2df181f09c net: ipa: generic command param fix
Starting at IPA v4.11, the GSI_GENERIC_COMMAND GSI register got a
new PARAMS field.  The code that encodes a value into that field
sets it unconditionally, which is wrong.

We currently only provide 0 as the field's value, so this error has
no real effect.  Still, it's a bug, so let's fix it.

Fix an (unrelated) incorrect comment as well.  Fields in the
ERROR_LOG GSI register actually *are* defined for IPA versions
prior to v3.5.1.

Fixes: fe68c43ce3 ("net: ipa: support enhanced channel flow control")
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-10 08:06:31 +00:00
Horatiu Vultur
47400aaea4 net: microchip: vcap: Add tc flower keys for lan966x
Add the following TC flower filter keys to lan966x for IS2:
- ipv4_addr (sip and dip)
- ipv6_addr (sip and dip)
- control (IPv4 fragments)
- portnum (tcp and udp port numbers)
- basic (L3 and L4 protocol)
- vlan (outer vlan tag info)
- tcp (tcp flags)
- ip (tos field)

As the parsing of these keys is similar between lan966x and sparx5, move
the code in a separate file to be shared by these 2 chips. And put the
specific parsing outside of the common functions.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-10 08:03:09 +00:00
David S. Miller
21119e2c6e Merge tag 'rxrpc-next-20230208' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:

====================
rxrpc development

Here are some miscellaneous changes for rxrpc:

 (1) Use consume_skb() rather than kfree_skb_reason().

 (2) Fix unnecessary waking when poking and already-poked call.

 (3) Add ack.rwind to the rxrpc_tx_ack tracepoint as this indicates how
     many incoming DATA packets we're telling the peer that we are
     currently willing to accept on this call.

 (4) Reduce duplicate ACK transmission.  We send ACKs to let the peer know
     that we're increasing the receive window (ack.rwind) as we consume
     packets locally.  Normal ACK transmission is triggered in three places
     and that leads to duplicates being sent.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-10 08:00:05 +00:00
Clément Léger
dc8c413201 net: pcs: rzn1-miic: remove unused struct members and use miic variable
Remove unused bulk clocks struct from the miic state and use an already
existing miic variable in miic_config().

Signed-off-by: Clément Léger <clement.leger@bootlin.com>
Link: https://lore.kernel.org/r/20230208161249.329631-1-clement.leger@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-09 22:47:16 -08:00
Andy Shevchenko
5c72b4c644 openvswitch: Use string_is_terminated() helper
Use string_is_terminated() helper instead of cpecific memchr() call.
This shows better the intention of the call.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230208133153.22528-3-andriy.shevchenko@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-09 22:30:24 -08:00
Andy Shevchenko
d4545bf9c3 genetlink: Use string_is_terminated() helper
Use string_is_terminated() helper instead of cpecific memchr() call.
This shows better the intention of the call.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230208133153.22528-2-andriy.shevchenko@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-09 22:30:24 -08:00
Andy Shevchenko
f1db99c07b string_helpers: Move string_is_valid() to the header
Move string_is_valid() to the header for wider use.

While at it, rename to string_is_terminated() to be
precise about its semantics.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230208133153.22528-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-09 22:30:24 -08:00
Horatiu Vultur
a136391ae4 net: micrel: Cable Diagnostics feature for lan8841 PHY
Add support for cable diagnostics in lan8841 PHY. It has the same
registers layout as lan8814 PHY, therefore reuse the functionality.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20230208114406.1666671-1-horatiu.vultur@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-09 22:29:52 -08:00
Huanhuan Wang
436396f26d nfp: support IPsec offloading for NFP3800
Add IPsec offloading support for NFP3800. Include data
plane and control plane.

Data plane: add IPsec packet process flow in NFP3800
datapath (NFDk).

Control plane: add an algorithm support distinction flow
in xfrm hook function xdo_dev_state_add(), as NFP3800 has
a different set of IPsec algorithm support.

This matches existing support for the NFP6000/NFP4000 and
their NFD3 datapath.

In addition, fixup the md_bytes calculation for NFD3 datapath
to make sure the two datapahts are keept in sync.

Signed-off-by: Huanhuan Wang <huanhuan.wang@corigine.com>
Reviewed-by: Niklas Söderlund <niklas.soderlund@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230208091000.4139974-1-simon.horman@corigine.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-09 22:22:19 -08:00
Jakub Kicinski
2894d35309 Merge branch 'net-introduce-rps_default_mask'
Paolo Abeni says:

====================
net: introduce rps_default_mask

Real-time setups try hard to ensure proper isolation between time
critical applications and e.g. network processing performed by the
network stack in softirq and RPS is used to move the softirq
activity away from the isolated core.

If the network configuration is dynamic, with netns and devices
routinely created at run-time, enforcing the correct RPS setting
on each newly created device allowing to transient bad configuration
became complex.

Additionally, when multi-queue devices are involved, configuring rps
in user-space on each queue easily becomes very expensive, e.g.
some setups use veths with per cpu queues.

These series try to address the above, introducing a new
sysctl knob: rps_default_mask. The new sysctl entry allows
configuring a netns-wide RPS mask, to be enforced since receive
queue creation time without any fourther per device configuration
required.

Additionally, a simple self-test is introduced to check the
rps_default_mask behavior.
====================

Link: https://lore.kernel.org/r/cover.1675789134.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-09 17:45:57 -08:00
Paolo Abeni
c12e0d5f26 self-tests: introduce self-tests for RPS default mask
Ensure that RPS default mask changes take place on
all newly created netns/devices and don't affect
existing ones.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-09 17:45:55 -08:00
Paolo Abeni
605cfa1b10 net: introduce default_rps_mask netns attribute
If RPS is enabled, this allows configuring a default rps
mask, which is effective since receive queue creation time.

A default RPS mask allows the system admin to ensure proper
isolation, avoiding races at network namespace or device
creation time.

The default RPS mask is initially empty, and can be
modified via a newly added sysctl entry.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-09 17:45:55 -08:00
Paolo Abeni
370ca718fd net-sysctl: factor-out rpm mask manipulation helpers
Will simplify the following patch. No functional change
intended.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-09 17:45:55 -08:00
Paolo Abeni
135746c61f net-sysctl: factor out cpumask parsing helper
Will be used by the following patch to avoid code
duplication. No functional changes intended.

The only difference is that now flow_limit_cpu_sysctl() will
always compute the flow limit mask on each read operation,
even when read() will not return any byte to user-space.

Note that the new helper is placed under a new #ifdef at
the file start to better fit the usage in the later patch

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-09 17:45:55 -08:00
Jakub Kicinski
8697a258ae Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
net/devlink/leftover.c / net/core/devlink.c:
  565b4824c3 ("devlink: change port event netdev notifier from per-net to global")
  f05bd8ebeb ("devlink: move code to a dedicated directory")
  687125b579 ("devlink: split out core code")
https://lore.kernel.org/all/20230208094657.379f2b1a@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-09 12:25:40 -08:00
Eric Dumazet
0b34d68049 net: enable usercopy for skb_small_head_cache
syzbot and other bots reported that we have to enable
user copy to/from skb->head. [1]

We can prevent access to skb_shared_info, which is a nice
improvement over standard kmem_cache.

Layout of these kmem_cache objects is:

< SKB_SMALL_HEAD_HEADROOM >< struct skb_shared_info >

usercopy: Kernel memory overwrite attempt detected to SLUB object 'skbuff_small_head' (offset 32, size 20)!
------------[ cut here ]------------
kernel BUG at mm/usercopy.c:102 !
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.2.0-rc6-syzkaller-01425-gcb6b2e11a42d #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/12/2023
RIP: 0010:usercopy_abort+0xbd/0xbf mm/usercopy.c:102
Code: e8 ee ad ba f7 49 89 d9 4d 89 e8 4c 89 e1 41 56 48 89 ee 48 c7 c7 20 2b 5b 8a ff 74 24 08 41 57 48 8b 54 24 20 e8 7a 17 fe ff <0f> 0b e8 c2 ad ba f7 e8 7d fb 08 f8 48 8b 0c 24 49 89 d8 44 89 ea
RSP: 0000:ffffc90000067a48 EFLAGS: 00010286
RAX: 000000000000006b RBX: ffffffff8b5b6ea0 RCX: 0000000000000000
RDX: ffff8881401c0000 RSI: ffffffff8166195c RDI: fffff5200000cf3b
RBP: ffffffff8a5b2a60 R08: 000000000000006b R09: 0000000000000000
R10: 0000000080000000 R11: 0000000000000000 R12: ffffffff8bf2a925
R13: ffffffff8a5b29a0 R14: 0000000000000014 R15: ffffffff8a5b2960
FS: 0000000000000000(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000000c48e000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
__check_heap_object+0xdd/0x110 mm/slub.c:4761
check_heap_object mm/usercopy.c:196 [inline]
__check_object_size mm/usercopy.c:251 [inline]
__check_object_size+0x1da/0x5a0 mm/usercopy.c:213
check_object_size include/linux/thread_info.h:199 [inline]
check_copy_size include/linux/thread_info.h:235 [inline]
copy_from_iter include/linux/uio.h:186 [inline]
copy_from_iter_full include/linux/uio.h:194 [inline]
memcpy_from_msg include/linux/skbuff.h:3977 [inline]
qrtr_sendmsg+0x65f/0x970 net/qrtr/af_qrtr.c:965
sock_sendmsg_nosec net/socket.c:722 [inline]
sock_sendmsg+0xde/0x190 net/socket.c:745
say_hello+0xf6/0x170 net/qrtr/ns.c:325
qrtr_ns_init+0x220/0x2b0 net/qrtr/ns.c:804
qrtr_proto_init+0x59/0x95 net/qrtr/af_qrtr.c:1296
do_one_initcall+0x141/0x790 init/main.c:1306
do_initcall_level init/main.c:1379 [inline]
do_initcalls init/main.c:1395 [inline]
do_basic_setup init/main.c:1414 [inline]
kernel_init_freeable+0x6f9/0x782 init/main.c:1634
kernel_init+0x1e/0x1d0 init/main.c:1522
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
</TASK>

Fixes: bf9f1baa27 ("net: add dedicated kmem_cache for typical/small skb->head")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Link: https://lore.kernel.org/linux-next/CA+G9fYs-i-c2KTSA7Ai4ES_ZESY1ZnM=Zuo8P1jN00oed6KHMA@mail.gmail.com
Link: https://lore.kernel.org/r/20230208142508.3278406-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-09 09:57:23 -08:00
Linus Torvalds
35674e7875 Merge tag 'net-6.2-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
 "Including fixes from can and ipsec subtrees.

  Current release - regressions:

   - sched: fix off by one in htb_activate_prios()

   - eth: mana: fix accessing freed irq affinity_hint

   - eth: ice: fix out-of-bounds KASAN warning in virtchnl

  Current release - new code bugs:

   - eth: mtk_eth_soc: enable special tag when any MAC uses DSA

  Previous releases - always broken:

   - core: fix sk->sk_txrehash default

   - neigh: make sure used and confirmed times are valid

   - mptcp: be careful on subflow status propagation on errors

   - xfrm: prevent potential spectre v1 gadget in xfrm_xlate32_attr()

   - phylink: move phy_device_free() to correctly release phy device

   - eth: mlx5:
      - fix crash unsetting rx-vlan-filter in switchdev mode
      - fix hang on firmware reset
      - serialize module cleanup with reload and remove"

* tag 'net-6.2-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (57 commits)
  selftests: forwarding: lib: quote the sysctl values
  net: mscc: ocelot: fix all IPv6 getting trapped to CPU when PTP timestamping is used
  rds: rds_rm_zerocopy_callback() use list_first_entry()
  net: txgbe: Update support email address
  selftests: Fix failing VXLAN VNI filtering test
  selftests: mptcp: stop tests earlier
  selftests: mptcp: allow more slack for slow test-case
  mptcp: be careful on subflow status propagation on errors
  mptcp: fix locking for in-kernel listener creation
  mptcp: fix locking for setsockopt corner-case
  mptcp: do not wait for bare sockets' timeout
  net: ethernet: mtk_eth_soc: fix DSA TX tag hwaccel for switch port 0
  nfp: ethtool: fix the bug of setting unsupported port speed
  txhash: fix sk->sk_txrehash default
  net: ethernet: mtk_eth_soc: fix wrong parameters order in __xdp_rxq_info_reg()
  net: ethernet: mtk_eth_soc: enable special tag when any MAC uses DSA
  net: sched: sch: Fix off by one in htb_activate_prios()
  igc: Add ndo_tx_timeout support
  net: mana: Fix accessing freed irq affinity_hint
  hv_netvsc: Allocate memory in netvsc_dma_map() with GFP_ATOMIC
  ...
2023-02-09 09:17:38 -08:00
Linus Torvalds
0b028189d1 Merge tag 'for-linus-2023020901' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid
Pull HID fixes from Benjamin Tissoires:

 - fix potential infinite loop with a badly crafted HID device (Xin
   Zhao)

 - fix regression from 6.1 in USB logitech devices potentially making
   their mouse wheel not working (Bastien Nocera)

 - clean up in AMD sensors, which fixes a long time resume bug (Mario
   Limonciello)

 - few device small fixes and quirks

* tag 'for-linus-2023020901' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
  HID: Ignore battery for ELAN touchscreen 29DF on HP
  HID: amd_sfh: if no sensors are enabled, clean up
  HID: logitech: Disable hi-res scrolling on USB
  HID: core: Fix deadloop in hid_apply_multiplier.
  HID: Ignore battery for Elan touchscreen on Asus TP420IA
  HID: elecom: add support for TrackBall 056E:011C
2023-02-09 09:09:13 -08:00
Linus Torvalds
94a1f56db6 Merge tag '6.2-rc8-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6
Pull cifx fix from Steve French:
 "Small fix for use after free"

* tag '6.2-rc8-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Fix use-after-free in rdata->read_into_pages()
2023-02-09 09:00:26 -08:00
Hangbin Liu
3a082086aa selftests: forwarding: lib: quote the sysctl values
When set/restore sysctl value, we should quote the value as some keys
may have multi values, e.g. net.ipv4.ping_group_range

Fixes: f5ae57784b ("selftests: forwarding: lib: Add sysctl_set(), sysctl_restore()")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/20230208032110.879205-1-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-02-09 11:05:38 +01:00
Vladimir Oltean
2fcde9fe25 net: mscc: ocelot: fix all IPv6 getting trapped to CPU when PTP timestamping is used
While running this selftest which usually passes:

~/selftests/drivers/net/dsa# ./local_termination.sh eno0 swp0
TEST: swp0: Unicast IPv4 to primary MAC address                     [ OK ]
TEST: swp0: Unicast IPv4 to macvlan MAC address                     [ OK ]
TEST: swp0: Unicast IPv4 to unknown MAC address                     [ OK ]
TEST: swp0: Unicast IPv4 to unknown MAC address, promisc            [ OK ]
TEST: swp0: Unicast IPv4 to unknown MAC address, allmulti           [ OK ]
TEST: swp0: Multicast IPv4 to joined group                          [ OK ]
TEST: swp0: Multicast IPv4 to unknown group                         [ OK ]
TEST: swp0: Multicast IPv4 to unknown group, promisc                [ OK ]
TEST: swp0: Multicast IPv4 to unknown group, allmulti               [ OK ]
TEST: swp0: Multicast IPv6 to joined group                          [ OK ]
TEST: swp0: Multicast IPv6 to unknown group                         [ OK ]
TEST: swp0: Multicast IPv6 to unknown group, promisc                [ OK ]
TEST: swp0: Multicast IPv6 to unknown group, allmulti               [ OK ]

if I start PTP timestamping then run it again (debug prints added by me),
the unknown IPv6 MC traffic is seen by the CPU port even when it should
have been dropped:

~/selftests/drivers/net/dsa# ptp4l -i swp0 -2 -P -m
ptp4l[225.410]: selected /dev/ptp1 as PTP clock
[  225.445746] mscc_felix 0000:00:00.5: ocelot_l2_ptp_trap_add: port 0 adding L2 PTP trap
[  225.453815] mscc_felix 0000:00:00.5: ocelot_ipv4_ptp_trap_add: port 0 adding IPv4 PTP event trap
[  225.462703] mscc_felix 0000:00:00.5: ocelot_ipv4_ptp_trap_add: port 0 adding IPv4 PTP general trap
[  225.471768] mscc_felix 0000:00:00.5: ocelot_ipv6_ptp_trap_add: port 0 adding IPv6 PTP event trap
[  225.480651] mscc_felix 0000:00:00.5: ocelot_ipv6_ptp_trap_add: port 0 adding IPv6 PTP general trap
ptp4l[225.488]: port 1: INITIALIZING to LISTENING on INIT_COMPLETE
ptp4l[225.488]: port 0: INITIALIZING to LISTENING on INIT_COMPLETE
^C
~/selftests/drivers/net/dsa# ./local_termination.sh eno0 swp0
TEST: swp0: Unicast IPv4 to primary MAC address                     [ OK ]
TEST: swp0: Unicast IPv4 to macvlan MAC address                     [ OK ]
TEST: swp0: Unicast IPv4 to unknown MAC address                     [ OK ]
TEST: swp0: Unicast IPv4 to unknown MAC address, promisc            [ OK ]
TEST: swp0: Unicast IPv4 to unknown MAC address, allmulti           [ OK ]
TEST: swp0: Multicast IPv4 to joined group                          [ OK ]
TEST: swp0: Multicast IPv4 to unknown group                         [ OK ]
TEST: swp0: Multicast IPv4 to unknown group, promisc                [ OK ]
TEST: swp0: Multicast IPv4 to unknown group, allmulti               [ OK ]
TEST: swp0: Multicast IPv6 to joined group                          [ OK ]
TEST: swp0: Multicast IPv6 to unknown group                         [FAIL]
        reception succeeded, but should have failed
TEST: swp0: Multicast IPv6 to unknown group, promisc                [ OK ]
TEST: swp0: Multicast IPv6 to unknown group, allmulti               [ OK ]

The PGID_MCIPV6 is configured correctly to not flood to the CPU,
I checked that.

Furthermore, when I disable back PTP RX timestamping (ptp4l doesn't do
that when it exists), packets are RX filtered again as they should be:

~/selftests/drivers/net/dsa# hwstamp_ctl -i swp0 -r 0
[  218.202854] mscc_felix 0000:00:00.5: ocelot_l2_ptp_trap_del: port 0 removing L2 PTP trap
[  218.212656] mscc_felix 0000:00:00.5: ocelot_ipv4_ptp_trap_del: port 0 removing IPv4 PTP event trap
[  218.222975] mscc_felix 0000:00:00.5: ocelot_ipv4_ptp_trap_del: port 0 removing IPv4 PTP general trap
[  218.233133] mscc_felix 0000:00:00.5: ocelot_ipv6_ptp_trap_del: port 0 removing IPv6 PTP event trap
[  218.242251] mscc_felix 0000:00:00.5: ocelot_ipv6_ptp_trap_del: port 0 removing IPv6 PTP general trap
current settings:
tx_type 1
rx_filter 12
new settings:
tx_type 1
rx_filter 0
~/selftests/drivers/net/dsa# ./local_termination.sh eno0 swp0
TEST: swp0: Unicast IPv4 to primary MAC address                     [ OK ]
TEST: swp0: Unicast IPv4 to macvlan MAC address                     [ OK ]
TEST: swp0: Unicast IPv4 to unknown MAC address                     [ OK ]
TEST: swp0: Unicast IPv4 to unknown MAC address, promisc            [ OK ]
TEST: swp0: Unicast IPv4 to unknown MAC address, allmulti           [ OK ]
TEST: swp0: Multicast IPv4 to joined group                          [ OK ]
TEST: swp0: Multicast IPv4 to unknown group                         [ OK ]
TEST: swp0: Multicast IPv4 to unknown group, promisc                [ OK ]
TEST: swp0: Multicast IPv4 to unknown group, allmulti               [ OK ]
TEST: swp0: Multicast IPv6 to joined group                          [ OK ]
TEST: swp0: Multicast IPv6 to unknown group                         [ OK ]
TEST: swp0: Multicast IPv6 to unknown group, promisc                [ OK ]
TEST: swp0: Multicast IPv6 to unknown group, allmulti               [ OK ]

So it's clear that something in the PTP RX trapping logic went wrong.

Looking a bit at the code, I can see that there are 4 typos, which
populate "ipv4" VCAP IS2 key filter fields for IPv6 keys.

VCAP IS2 keys of type OCELOT_VCAP_KEY_IPV4 and OCELOT_VCAP_KEY_IPV6 are
handled by is2_entry_set(). OCELOT_VCAP_KEY_IPV4 looks at
&filter->key.ipv4, and OCELOT_VCAP_KEY_IPV6 at &filter->key.ipv6.
Simply put, when we populate the wrong key field, &filter->key.ipv6
fields "proto.mask" and "proto.value" remain all zeroes (or "don't care").
So is2_entry_set() will enter the "else" of this "if" condition:

	if (msk == 0xff && (val == IPPROTO_TCP || val == IPPROTO_UDP))

and proceed to ignore the "proto" field. The resulting rule will match
on all IPv6 traffic, trapping it to the CPU.

This is the reason why the local_termination.sh selftest sees it,
because control traps are stronger than the PGID_MCIPV6 used for
flooding (from the forwarding data path).

But the problem is in fact much deeper. We trap all IPv6 traffic to the
CPU, but if we're bridged, we set skb->offload_fwd_mark = 1, so software
forwarding will not take place and IPv6 traffic will never reach its
destination.

The fix is simple - correct the typos.

I was intentionally inaccurate in the commit message about the breakage
occurring when any PTP timestamping is enabled. In fact it only happens
when L4 timestamping is requested (HWTSTAMP_FILTER_PTP_V2_EVENT or
HWTSTAMP_FILTER_PTP_V2_L4_EVENT). But ptp4l requests a larger RX
timestamping filter than it needs for "-2": HWTSTAMP_FILTER_PTP_V2_EVENT.
I wanted people skimming through git logs to not think that the bug
doesn't affect them because they only use ptp4l in L2 mode.

Fixes: 96ca08c058 ("net: mscc: ocelot: set up traps for PTP packets")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230207183117.1745754-1-vladimir.oltean@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-02-09 10:52:44 +01:00
Pietro Borrello
f753a68980 rds: rds_rm_zerocopy_callback() use list_first_entry()
rds_rm_zerocopy_callback() uses list_entry() on the head of a list
causing a type confusion.
Use list_first_entry() to actually access the first element of the
rs_zcookie_queue list.

Fixes: 9426bbc6de ("rds: use list structure to track information for zerocopy completion notification")
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it>
Link: https://lore.kernel.org/r/20230202-rds-zerocopy-v3-1-83b0df974f9a@diag.uniroma1.it
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-02-09 10:37:26 +01:00
Jakub Kicinski
646be03ec4 Merge tag 'ipsec-2023-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
ipsec 2023-02-08

1) Fix policy checks for nested IPsec tunnels when using
   xfrm interfaces. From Benedict Wong.

2) Fix netlink message expression on 32=>64-bit
   messages translators. From Anastasia Belova.

3) Prevent potential spectre v1 gadget in xfrm_xlate32_attr.
   From Eric Dumazet.

4) Always consistently use time64_t in xfrm_timer_handler.
   From Eric Dumazet.

5) Fix KCSAN reported bug: Multiple cpus can update use_time
   at the same time. From Eric Dumazet.

6) Fix SCP copy from IPv4 to IPv6 on interfamily tunnel.
   From Christian Hopps.

* tag 'ipsec-2023-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
  xfrm: fix bug with DSCP copy to v6 from v4 tunnel
  xfrm: annotate data-race around use_time
  xfrm: consistently use time64_t in xfrm_timer_handler()
  xfrm/compat: prevent potential spectre v1 gadget in xfrm_xlate32_attr()
  xfrm: compat: change expression for switch in xfrm_xlate64
  Fix XFRM-I support for nested ESP tunnels
====================

Link: https://lore.kernel.org/r/20230208114322.266510-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-08 21:35:39 -08:00
Jakub Kicinski
5131a053f2 Merge tag 'linux-can-next-for-6.3-20230208' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:

====================
can-next 2023-02-08

The 1st patch is by Oliver Hartkopp and cleans up the CAN_RAW's
raw_setsockopt() for CAN_RAW_FD_FRAMES.

The 2nd patch is by me and fixes the compilation if
CONFIG_CAN_CALC_BITTIMING is disabled. (Problem introduced in last
pull request to next-next.)

* tag 'linux-can-next-for-6.3-20230208' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next:
  can: bittiming: can_calc_bittiming(): add missing parameter to no-op function
  can: raw: use temp variable instead of rolling back config
====================

Link: https://lore.kernel.org/r/20230208210014.3169347-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-08 21:32:20 -08:00
Jakub Kicinski
9245b518c8 Merge tag 'mlx5-next-netdev-deadlock' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Saeed Mahameed says:

====================
mlx5-next-netdev-deadlock

This series from Jiri solves a deadlock when removing a network namespace
with mlx5 devlink instance being in it.
The deadlock is between:
1) mlx5_ib->unregister_netdevice_notifier()
AND
2) mlx5_core->devlink_reload->cleanup_net()

To slove this introduced mlx5 netdev added/removed events to track uplink
netdev to be used for register_netdevice_notifier_dev_net() purposes.

* tag 'mlx5-next-netdev-deadlock' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
  RDMA/mlx5: Track netdev to avoid deadlock during netdev notifier unregister
  net/mlx5e: Propagate an internal event in case uplink netdev changes
  net/mlx5e: Fix trap event handling
  net/mlx5: Introduce CQE error syndrome
====================

Link: https://lore.kernel.org/r/20230208005626.72930-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-08 21:01:17 -08:00
Yang Li
3ca11619a3 net: libwx: Remove unneeded semicolon
./drivers/net/ethernet/wangxun/libwx/wx_lib.c:683:2-3: Unneeded semicolon

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3976
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230208004959.47553-1-yang.lee@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-08 20:50:31 -08:00
Yang Li
f978fa41f6 net: libwx: clean up one inconsistent indenting
drivers/net/ethernet/wangxun/libwx/wx_lib.c:1835 wx_setup_all_rx_resources() warn: inconsistent indenting

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3981
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230208013227.111605-1-yang.lee@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-08 20:49:38 -08:00
Jiawen Wu
363d7c2298 net: txgbe: Update support email address
Update new email address for Wangxun 10Gb NIC support team.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Link: https://lore.kernel.org/r/20230208023035.3371250-1-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-08 20:48:37 -08:00
Jiri Pirko
dca55da0a1 RDMA/mlx5: Track netdev to avoid deadlock during netdev notifier unregister
When removing a network namespace with mlx5 devlink instance being in
it, following callchain is performed:

cleanup_net (takes down_read(&pernet_ops_rwsem)
devlink_pernet_pre_exit()
devlink_reload()
mlx5_devlink_reload_down()
mlx5_unload_one_devl_locked()
mlx5_detach_device()
del_adev()
mlx5r_remove()
__mlx5_ib_remove()
mlx5_ib_roce_cleanup()
mlx5_remove_netdev_notifier()
unregister_netdevice_notifier (takes down_write(&pernet_ops_rwsem)

This deadlocks.

Resolve this by converting to register_netdevice_notifier_dev_net()
which does not take pernet_ops_rwsem and moves the notifier block around
according to netdev it takes as arg.

Use previously introduced netdev added/removed events to track uplink
netdev to be used for register_netdevice_notifier_dev_net() purposes.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-08 20:40:57 -08:00
Jiri Pirko
c7d4e6ab31 net/mlx5e: Propagate an internal event in case uplink netdev changes
Whenever uplink netdev is set/cleared, propagate newly introduced event
to inform notifier blocks netdev was added/removed.

Move the set() helper to core.c from header, introduce clear() and
netdev_added_event_replay() helpers. The last one is going to be called
from rdma driver, so export it.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-08 20:40:57 -08:00
Jiri Pirko
3f26a315f8 net/mlx5e: Fix trap event handling
Current code does not return correct return value from event handler.
Fix it by returning NOTIFY_* and propagate err over newly introduce ctx
structure.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-08 20:40:56 -08:00
Jakub Kicinski
ff8ced4eef Merge tag 'mlx5-fixes-2023-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:

====================
mlx5 fixes 2023-02-07

This series provides bug fixes to mlx5 driver.

* tag 'mlx5-fixes-2023-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
  net/mlx5: Serialize module cleanup with reload and remove
  net/mlx5: fw_tracer, Zero consumer index when reloading the tracer
  net/mlx5: fw_tracer, Clear load bit when freeing string DBs buffers
  net/mlx5: Expose SF firmware pages counter
  net/mlx5: Store page counters in a single array
  net/mlx5e: IPoIB, Show unknown speed instead of error
  net/mlx5e: Fix crash unsetting rx-vlan-filter in switchdev mode
  net/mlx5: Bridge, fix ageing of peer FDB entries
  net/mlx5: DR, Fix potential race in dr_rule_create_rule_nic
  net/mlx5e: Update rx ring hw mtu upon each rx-fcs flag change
====================

Link: https://lore.kernel.org/r/20230208030302.95378-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-08 19:23:45 -08:00
Jakub Kicinski
7eadc0a013 Merge tag 'mlx5-updates-2023-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:

====================
mlx5-updates-2023-02-07

1) Minor and trivial code Cleanups

2) Minor fixes for net-next

3) From Shay: dynamic FW trace strings update.

* tag 'mlx5-updates-2023-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
  net/mlx5: fw_tracer, Add support for unrecognized string
  net/mlx5: fw_tracer, Add support for strings DB update event
  net/mlx5: fw_tracer, allow 0 size string DBs
  net/mlx5: fw_tracer: Fix debug print
  net/mlx5: fs, Remove redundant assignment of size
  net/mlx5: fs_core, Remove redundant variable err
  net/mlx5: Fix memory leak in error flow of port set buffer
  net/mlx5e: Remove incorrect debugfs_create_dir NULL check in TLS
  net/mlx5e: Remove incorrect debugfs_create_dir NULL check in hairpin
  net/mlx5: fs, Remove redundant vport_number assignment
  net/mlx5e: Remove redundant code for handling vlan actions
  net/mlx5e: Don't listen to remove flows event
  net/mlx5: fw reset: Skip device ID check if PCI link up failed
  net/mlx5: Remove redundant health work lock
  mlx5: reduce stack usage in mlx5_setup_tc
====================

Link: https://lore.kernel.org/r/20230208003712.68386-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-08 19:05:58 -08:00
Ido Schimmel
b963d9d5b9 selftests: Fix failing VXLAN VNI filtering test
iproute2 does not recognize the "group6" and "remote6" keywords. Fix by
using "group" and "remote" instead.

Before:

 # ./test_vxlan_vnifiltering.sh
 [...]
 Tests passed:  25
 Tests failed:   2

After:

 # ./test_vxlan_vnifiltering.sh
 [...]
 Tests passed:  27
 Tests failed:   0

Fixes: 3edf5f66c1 ("selftests: add new tests for vxlan vnifiltering")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/20230207141819.256689-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-08 16:54:03 -08:00
Marc Kleine-Budde
65db3d8b52 can: bittiming: can_calc_bittiming(): add missing parameter to no-op function
In commit 286c0e09e8 ("can: bittiming: can_changelink() pass extack
down callstack") a new parameter was added to can_calc_bittiming(),
however the static inline no-op (which is used if
CONFIG_CAN_CALC_BITTIMING is disabled) wasn't converted.

Add the new parameter to the static inline no-op of
can_calc_bittiming().

Fixes: 286c0e09e8 ("can: bittiming: can_changelink() pass extack down callstack")
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/20230207201734.2905618-1-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2023-02-08 21:53:24 +01:00
Oliver Hartkopp
f2f527d595 can: raw: use temp variable instead of rolling back config
Introduce a temporary variable to check for an invalid configuration
attempt from user space. Before this patch the value was copied to
the real config variable and rolled back in the case of an error.

Suggested-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/all/20230203090807.97100-1-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2023-02-08 21:53:24 +01:00
David S. Miller
e6ebe6c123 Merge branch 'taprio-auto-qmaxsdu-new-tx'
Vladimir Oltean says:

====================
taprio automatic queueMaxSDU and new TXQ selection procedure

This patch set addresses 2 design limitations in the taprio software scheduler:

1. Software scheduling fundamentally prioritizes traffic incorrectly,
   in a way which was inspired from Intel igb/igc drivers and does not
   follow the inputs user space gives (traffic classes and TC to TXQ
   mapping). Patch 05/15 handles this, 01/15 - 04/15 are preparations
   for this work.

2. Software scheduling assumes that the gate for a traffic class closes
   as soon as the next interval begins. But this isn't true.
   If consecutive schedule entries have that traffic class gate open,
   there is no "gate close" event and taprio should keep dequeuing from
   that TC without interruptions. Patches 06/15 - 15/15 handle this.
   Patch 10/15 is a generic Qdisc change required for this to work.

Future development directions which depend on this patch set are:

- Propagating the automatic queueMaxSDU calculation down to offloading
  device drivers, instead of letting them calculate this, as
  vsc9959_tas_guard_bands_update() does today.

- A software data path for tc-taprio with preemptible traffic and
  Hold/Release events.

v1 at:
https://patchwork.kernel.org/project/netdevbpf/cover/20230128010719.2182346-1-vladimir.oltean@nxp.com/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-08 09:48:53 +00:00
Vladimir Oltean
39b02d6d10 net/sched: taprio: don't segment unnecessarily
Improve commit 497cc00224 ("taprio: Handle short intervals and large
packets") to only perform segmentation when skb->len exceeds what
taprio_dequeue() expects.

In practice, this will make the biggest difference when a traffic class
gate is always open in the schedule. This is because the max_frm_len
will be U32_MAX, and such large skb->len values as Kurt reported will be
sent just fine unsegmented.

What I don't seem to know how to handle is how to make sure that the
segmented skbs themselves are smaller than the maximum frame size given
by the current queueMaxSDU[tc]. Nonetheless, we still need to drop
those, otherwise the Qdisc will hang.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-08 09:48:53 +00:00