Commit Graph

1136659 Commits

Author SHA1 Message Date
Kuniyuki Iwashima
b261eda84e soreuseport: Fix socket selection for SO_INCOMING_CPU.
Kazuho Oku reported that setsockopt(SO_INCOMING_CPU) does not work
with setsockopt(SO_REUSEPORT) since v4.6.

With the combination of SO_REUSEPORT and SO_INCOMING_CPU, we could
build a highly efficient server application.

setsockopt(SO_INCOMING_CPU) associates a CPU with a TCP listener
or UDP socket, and then incoming packets processed on the CPU will
likely be distributed to the socket.  Technically, a socket could
even receive packets handled on another CPU if no sockets in the
reuseport group have the same CPU receiving the flow.

The logic exists in compute_score() so that a socket will get a higher
score if it has the same CPU with the flow.  However, the score gets
ignored after the blamed two commits, which introduced a faster socket
selection algorithm for SO_REUSEPORT.

This patch introduces a counter of sockets with SO_INCOMING_CPU in
a reuseport group to check if we should iterate all sockets to find
a proper one.  We increment the counter when

  * calling listen() if the socket has SO_INCOMING_CPU and SO_REUSEPORT

  * enabling SO_INCOMING_CPU if the socket is in a reuseport group

Also, we decrement it when

  * detaching a socket out of the group to apply SO_INCOMING_CPU to
    migrated TCP requests

  * disabling SO_INCOMING_CPU if the socket is in a reuseport group

When the counter reaches 0, we can get back to the O(1) selection
algorithm.

The overall changes are negligible for the non-SO_INCOMING_CPU case,
and the only notable thing is that we have to update sk_incomnig_cpu
under reuseport_lock.  Otherwise, the race prevents transitioning to
the O(n) algorithm and results in the wrong socket selection.

 cpu1 (setsockopt)               cpu2 (listen)
+-----------------+             +-------------+

lock_sock(sk1)                  lock_sock(sk2)

reuseport_update_incoming_cpu(sk1, val)
.
|  /* set CPU as 0 */
|- WRITE_ONCE(sk1->incoming_cpu, val)
|
|                               spin_lock_bh(&reuseport_lock)
|                               reuseport_grow(sk2, reuse)
|                               .
|                               |- more_socks_size = reuse->max_socks * 2U;
|                               |- if (more_socks_size > U16_MAX &&
|                               |       reuse->num_closed_socks)
|                               |  .
|                               |  |- RCU_INIT_POINTER(sk1->sk_reuseport_cb, NULL);
|                               |  `- __reuseport_detach_closed_sock(sk1, reuse)
|                               |     .
|                               |     `- reuseport_put_incoming_cpu(sk1, reuse)
|                               |        .
|                               |        |  /* Read shutdown()ed sk1's sk_incoming_cpu
|                               |        |   * without lock_sock().
|                               |        |   */
|                               |        `- if (sk1->sk_incoming_cpu >= 0)
|                               |           .
|                               |           |  /* decrement not-yet-incremented
|                               |           |   * count, which is never incremented.
|                               |           |   */
|                               |           `- __reuseport_put_incoming_cpu(reuse);
|                               |
|                               `- spin_lock_bh(&reuseport_lock)
|
|- spin_lock_bh(&reuseport_lock)
|
|- reuse = rcu_dereference_protected(sk1->sk_reuseport_cb, ...)
|- if (!reuse)
|  .
|  |  /* Cannot increment reuse->incoming_cpu. */
|  `- goto out;
|
`- spin_unlock_bh(&reuseport_lock)

Fixes: e32ea7e747 ("soreuseport: fast reuseport UDP socket selection")
Fixes: c125e80b88 ("soreuseport: fast reuseport TCP socket selection")
Reported-by: Kazuho Oku <kazuhooku@gmail.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 11:35:16 +02:00
Paolo Abeni
71920a773e Merge branch 'net-ipa-validation-cleanup'
Alex Elder says:

====================
net: ipa: validation cleanup

This series gathers a set of IPA driver cleanups, mostly involving
code that ensures certain things are known to be correct *early*
(either at build or initializatin time), so they can be assumed good
during normal operation.

The first removes three constant symbols, by making a (reasonable)
assumption that a routing table consists of entries for the modem
followed by entries for the AP, with no unused entries between them.

The second removes two checks that are redundant (they verify the
sizes of two memory regions are in range, which will have been done
earlier for all regions).

The third adds some new checks to routing and filter tables that
can be done at "init time" (without requiring any access to IPA
hardware).

The fourth moves a check that routing and filter table addresses can
be encoded within certain IPA immediate commands, so it's performed
earlier; the checks can be done without touching IPA hardware.  The
fifth moves some other command-related checks earlier, for the same
reason.

The sixth removes the definition ipa_table_valid(), because what it
does has become redundant.  Finally, the last patch moves two more
validation calls so they're done very early in the probe process.
This will be required by some upcoming patches, which will record
the size of the routing and filter tables at this time so they're
available for subsequent initialization.
====================

Link: https://lore.kernel.org/r/20221021191340.4187935-1-elder@linaro.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 11:15:21 +02:00
Alex Elder
73da9cac51 net: ipa: check table memory regions earlier
Verify that the sizes of the routing and filter table memory regions
are valid as part of memory initialization, rather than waiting for
table initialization.  The main reason to do this is that upcoming
patches use these memory region sizes to determine the number of
entries in these tables, and we'll want to know these sizes are good
sooner.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 11:15:19 +02:00
Alex Elder
39ad815244 net: ipa: kill ipa_table_valid()
What ipa_table_valid() (and ipa_table_valid_one(), which it calls)
does is ensure that the memory regions that hold routing and filter
tables have reasonable size.  Specifically, it checks that the size
of a region is sufficient (or rather, exactly the right size) to
hold the maximum number of entries supported by the driver.  (There
is an additional check that's erroneous, but in practice it is never
reached.)

Recently ipa_table_mem_valid() was added, which is called by
ipa_table_init().  That function verifies that all table memory
regions are of sufficient size, and requires hashed tables to have
zero size if hashing is not supported.  It only ensures the filter
table is large enough to hold the number of endpoints that support
filtering, but that is adequate.

Therefore everything that ipa_table_valid() does is redundant, so
get rid of it.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 11:15:19 +02:00
Alex Elder
7fd10a2aca net: ipa: introduce ipa_cmd_init()
Currently, ipa_cmd_data_valid() is called by ipa_mem_config().
Nothing it does requires access to hardware though, so it can be
done during the init phase of IPA driver startup.

Create a new function ipa_cmd_init(), whose purpose is to do early
initialization related to IPA immediate commands.  It will call the
build-time validation function, then will make the two calls made
previously by ipa_cmd_data_valid().  This make ipa_cmd_data_valid()
unnecessary, so get rid of it.

Rename ipa_cmd_header_valid() to be ipa_cmd_header_init_local_valid(),
so its name is clearer about which IPA immediate command it is
associated with.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 11:15:19 +02:00
Alex Elder
5444b0ea99 net: ipa: verify table sizes fit in commands early
We currently verify the table size and offset fit in the immediate
command fields that must encode them in ipa_table_valid_one().  We
can now make this check earlier, in ipa_table_mem_valid().

The non-hashed IPv4 filter and route tables will always exist, and
their sizes will match the IPv6 tables, as well as the hashed tables
(if supported).  So it's sufficient to verify the offset and size of
the IPv4 non-hashed tables fit into these fields.

Rename the function ipa_cmd_table_init_valid(), to reinforce that
it is the TABLE_INIT immediate command fields we're checking.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 11:15:18 +02:00
Alex Elder
cf13919654 net: ipa: validate IPA table memory earlier
Add checks in ipa_table_init() to ensure the memory regions defined
for IPA filter and routing tables are valid.

For routing tables, the checks ensure:
  - The non-hashed IPv4 and IPv6 routing tables are defined
  - The non-hashed IPv4 and IPv6 routing tables are the same size
  - The number entries in the non-hashed IPv4 routing table is enough
    to hold the number entries available to the modem, plus at least
    one usable by the AP.

For filter tables, the checks ensure:
  - The non-hashed IPv4 and IPv6 filter tables are defined
  - The non-hashed IPv4 and IPv6 filter tables are the same size
  - The number entries in the non-hashed IPv4 filter table is enough
    to hold the endpoint bitmap, plus an entry for each defined
    endpoint that supports filtering.

In addition, for both routing and filter tables:
  - If hashing isn't supported (IPA v4.2), hashed tables are zero size
  - If hashing *is* supported, all hashed tables are the same size as
    their non-hashed counterparts.

When validating the size of routing tables, require the AP to have
at least one entry (in addition to those used by the modem).

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 11:15:18 +02:00
Alex Elder
2554322b31 net: ipa: remove two memory region checks
There's no need to ensure table memory regions fit within the
IPA-local memory range.  And there's no need to ensure the modem
header memory region is in range either.  These are verified for all
memory regions in ipa_mem_size_valid(), once we have settled on the
size of IPA memory.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 11:15:18 +02:00
Alex Elder
fb4014ac76 net: ipa: kill two constant symbols
The entries in each IPA routing table are divided between the modem
and the AP.  The modem always gets some number of entries located at
the base of the table; the AP gets all those that follow.

There's no reason to think the modem will use anything different
from the first entries in a routing table, so:
  - Get rid of IPA_ROUTE_MODEM_MIN (just assume it's 0)
  - Get rid of IPA_ROUTE_AP_MIN (just assume it's IPA_ROUTE_MODEM_COUNT)
And finally:
  - Open-code IPA_ROUTE_AP_COUNT and remove its definition

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 11:15:18 +02:00
Paolo Abeni
34802d0662 Merge branch 'extend-action-skbedit-to-rx-queue-mapping'
Amritha Nambiar says:

====================
Extend action skbedit to RX queue mapping

Based on the discussion on
https://lore.kernel.org/netdev/166260012413.81018.8010396115034847972.stgit@anambiarhost.jf.intel.com/ ,
the following series extends skbedit tc action to RX queue mapping.
Currently, skbedit action in tc allows overriding of transmit queue.
Extending this ability of skedit action supports the selection of
receive queue for incoming packets. On the receive side, this action
is supported only in hardware, so the skip_sw flag is enforced.

Enabled ice driver to offload this type of filter into the hardware
for accepting packets to the device's receive queue.
====================

Link: https://lore.kernel.org/r/166633888716.52141.3425659377117969638.stgit@anambiarhost.jf.intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 10:32:43 +02:00
Amritha Nambiar
d5ae8ecf38 Documentation: networking: TC queue based filtering
Add tc-queue-filters.rst with notes on TC filters for
selecting a set of queues and/or a queue.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 10:32:40 +02:00
Amritha Nambiar
143b86f346 ice: Enable RX queue selection using skbedit action
This patch uses TC skbedit queue_mapping action to support
forwarding packets to a device queue. Such filters with action
forward to queue will be the highest priority switch filter in
HW.
Example:
$ tc filter add dev ens4f0 protocol ip ingress flower\
  dst_ip 192.168.1.12 ip_proto tcp dst_port 5001\
  action skbedit queue_mapping 5 skip_sw

The above command adds an ingress filter, incoming packets
qualifying the match will be accepted into queue 5. The queue
number is in decimal format.

Refactored ice_add_tc_flower_adv_fltr() to consolidate code with
action FWD_TO_VSI and FWD_TO QUEUE.

Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 10:32:40 +02:00
Amritha Nambiar
4a6a676f8c act_skbedit: skbedit queue mapping for receive queue
Add support for skbedit queue mapping action on receive
side. This is supported only in hardware, so the skip_sw
flag is enforced. This enables offloading filters for
receive queue selection in the hardware using the
skbedit action. Traffic arrives on the Rx queue requested
in the skbedit action parameter. A new tc action flag
TCA_ACT_FLAGS_AT_INGRESS is introduced to identify the
traffic direction the action queue_mapping is requested
on during filter addition. This is used to disallow
offloading the skbedit queue mapping action on transmit
side.

Example:
$tc filter add dev $IFACE ingress protocol ip flower dst_ip $DST_IP\
 action skbedit queue_mapping $rxq_id skip_sw

Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 10:32:40 +02:00
Jakub Kicinski
6143eca357 Merge branch 'net-sfp-improve-high-power-module-implementation'
Russell King says:

====================
net: sfp: improve high power module implementation

This series aims to improve the power level switching between standard
level 1 and the higher power levels.

The first patch updates the DT binding documentation to include the
minimum and default of 1W, which is the base level that every SFP cage
must support. Hence, it makes sense to document this in the binding.

The second patch enforces a minimum of 1W when parsing the firmware
description, and optimises the code for that case; there's no need to
check for SFF8472 compliance since we will not need to touch the
A2h registers.

Patch 3 validates that the module supports SFF-8472 rev 10.2 before
checking for power level 2 - rev 10.2 is where support for power
levels was introduced, so if the module doesn't support this revision,
it doesn't support power levels. Setting the power level 2 declaration
bit is likely to be spurious.

Patch 4 does the same for power level 3, except this was introduced in
SFF-8472 rev 11.9. The revision code was never updated, so we use the
rev 11.4 to signify this.

Patch 5 cleans up the code - rather than using BIT(0), we now use a
properly named value for the power level select bit.

Patch 6 introduces a read-modify-write helper.

Patch 7 gets rid of the DM7052 hack (which sets a power level
declaration bit but is not compatible with SFF-8472 rev 10.2, and
the module does not implement the A2h I2C address.)

Series tested with my DM7052.

v2: update sff.sfp.yaml with Rob's feedback
====================

Andrew's review tags from v1.

Link: https://lore.kernel.org/r/Y0%2F7dAB8OU3jrbz6@shell.armlinux.org.uk
Link: https://lore.kernel.org/r/Y1K17UtfFopACIi2@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 21:06:30 -07:00
Russell King (Oracle)
bd1432f68d net: sfp: get rid of DM7052 hack when enabling high power
Since we no longer mis-detect high-power mode with the DM7052 module,
we no longer need the hack in sfp_module_enable_high_power(), and can
now switch this to use sfp_modify_u8().

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 21:06:24 -07:00
Russell King (Oracle)
a3c536fc75 net: sfp: add sfp_modify_u8() helper
Add a helper to modify bits in a single byte in memory space, and use
it when updating the soft tx-disable flag in the module.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 21:06:20 -07:00
Russell King (Oracle)
3989004984 net: sfp: provide a definition for the power level select bit
Provide a named definition for the power level select bit in the
extended status register, rather than using BIT(0) in the code.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 21:06:16 -07:00
Russell King (Oracle)
f8810ca758 net: sfp: ignore power level 3 prior to SFF-8472 Rev 11.4
Power level 3 was included in SFF-8472 revision 11.9, but this does
not have a compliance code. Use revision 11.4 as the minimum
compliance level instead.

This should avoid any spurious indication of 2W modules.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 21:06:12 -07:00
Russell King (Oracle)
18cc659e95 net: sfp: ignore power level 2 prior to SFF-8472 Rev 10.2
Power level 2 was introduced by SFF-8472 revision 10.2. Ignore
the power declaration bit for modules that are not compliant with
at least this revision.

This should remove any spurious indication of 1.5W modules.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 21:06:06 -07:00
Russell King (Oracle)
02eaf5a791 net: sfp: check firmware provided max power
Check that the firmware provided maximum power is at least 1W, which
is the minimum power level for any SFP module.

Now that we enforce the minimum of 1W, we can exit early from
sfp_module_parse_power() if the module power is 1W or less.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 21:05:54 -07:00
Russell King (Oracle)
a272bcb9e5 dt-bindings: net: sff,sfp: update binding
Add a minimum and default for the maximum-power-milliwatt option;
module power levels were originally up to 1W, so this is the default
and the minimum power level we can have for a functional SFP cage.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 21:05:45 -07:00
Jakub Kicinski
1b3d6ecd41 Merge branch 'bnxt_en-driver-updates'
Michael Chan says:

====================
bnxt_en: Driver updates

This patchset adds .get_module_eeprom_by_page() support and adds
an NVRAM resize step to allow larger firmware images to be flashed
to older firmware.
====================

Link: https://lore.kernel.org/r/1666334243-23866-1-git-send-email-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 19:24:20 -07:00
Vikas Gupta
4503422462 bnxt_en: check and resize NVRAM UPDATE entry before flashing
Resize of the UPDATE entry is required if the image to
be flashed is larger than the available space. Add this step,
otherwise flashing larger firmware images by ethtool or devlink
may fail.

Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 19:24:14 -07:00
Vikas Gupta
7ef3d3901b bnxt_en: add .get_module_eeprom_by_page() support
Add support for .get_module_eeprom_by_page() callback which
implements generic solution for module`s eeprom access.

v3: Add bnxt_get_module_status() to get a more specific extack error
    string.
    Return -EINVAL from bnxt_get_module_eeprom_by_page() when we
    don't want to fallback to old method.
v2: Simplification suggested by Ido Schimmel

Link: https://lore.kernel.org/netdev/YzVJ%2FvKJugoz15yV@shredder/
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 19:24:14 -07:00
Michael Chan
84a911db83 bnxt_en: Update firmware interface to 1.10.2.118
The main changes are PTM timestamp support, CMIS EEPROM support, and
asymmetric CoS queues support.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 19:24:14 -07:00
Jakub Kicinski
96917bb3a3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
include/linux/net.h
  a5ef058dc4 ("net: introduce and use custom sockopt socket flag")
  e993ffe3da ("net: flag sockets supporting msghdr originated zerocopy")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 13:44:11 -07:00
Linus Torvalds
337a0a0b63 Merge tag 'net-6.1-rc3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf.

  The net-memcg fix stands out, the rest is very run-off-the-mill. Maybe
  I'm biased.

  Current release - regressions:

   - eth: fman: re-expose location of the MAC address to userspace,
     apparently some udev scripts depended on the exact value

  Current release - new code bugs:

   - bpf:
       - wait for busy refill_work when destroying bpf memory allocator
       - allow bpf_user_ringbuf_drain() callbacks to return 1
       - fix dispatcher patchable function entry to 5 bytes nop

  Previous releases - regressions:

   - net-memcg: avoid stalls when under memory pressure

   - tcp: fix indefinite deferral of RTO with SACK reneging

   - tipc: fix a null-ptr-deref in tipc_topsrv_accept

   - eth: macb: specify PHY PM management done by MAC

   - tcp: fix a signed-integer-overflow bug in tcp_add_backlog()

  Previous releases - always broken:

   - eth: amd-xgbe: SFP fixes and compatibility improvements

  Misc:

   - docs: netdev: offer performance feedback to contributors"

* tag 'net-6.1-rc3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (37 commits)
  net-memcg: avoid stalls when under memory pressure
  tcp: fix indefinite deferral of RTO with SACK reneging
  tcp: fix a signed-integer-overflow bug in tcp_add_backlog()
  net: lantiq_etop: don't free skb when returning NETDEV_TX_BUSY
  net: fix UAF issue in nfqnl_nf_hook_drop() when ops_init() failed
  docs: netdev: offer performance feedback to contributors
  kcm: annotate data-races around kcm->rx_wait
  kcm: annotate data-races around kcm->rx_psock
  net: fman: Use physical address for userspace interfaces
  net/mlx5e: Cleanup MACsec uninitialization routine
  atlantic: fix deadlock at aq_nic_stop
  nfp: only clean `sp_indiff` when application firmware is unloaded
  amd-xgbe: add the bit rate quirk for Molex cables
  amd-xgbe: fix the SFP compliance codes check for DAC cables
  amd-xgbe: enable PLL_CTL for fixed PHY modes only
  amd-xgbe: use enums for mailbox cmd and sub_cmds
  amd-xgbe: Yellow carp devices do not need rrc
  bpf: Use __llist_del_all() whenever possbile during memory draining
  bpf: Wait for busy refill_work when destroying bpf memory allocator
  MAINTAINERS: add keyword match on PTP
  ...
2022-10-24 12:43:51 -07:00
Linus Torvalds
f6602a97a1 Merge tag 'rcu-urgent.2022.10.20a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU fix from Paul McKenney:
 "Fix a regression caused by commit bf95b2bc3e ("rcu: Switch polled
  grace-period APIs to ->gp_seq_polled"), which could incorrectly leave
  interrupts enabled after an early-boot call to synchronize_rcu().

  Such synchronize_rcu() calls must acquire leaf rcu_node locks in order
  to properly interact with polled grace periods, but the code did not
  take into account the possibility of synchronize_rcu() being invoked
  from the portion of the boot sequence during which interrupts are
  disabled.

  This commit therefore switches the lock acquisition and release from
  irq to irqsave/irqrestore"

* tag 'rcu-urgent.2022.10.20a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
  rcu: Keep synchronize_rcu() from enabling irqs in early boot
2022-10-24 12:33:30 -07:00
Linus Torvalds
2a91e897c0 Merge tag 'linux-kselftest-kunit-fixes-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull KUnit fixes from Shuah Khan:
 "One single fix to update alloc_string_stream() callers to check for
  IS_ERR() instead of NULL to be in sync with alloc_string_stream()
  returning an ERR_PTR()"

* tag 'linux-kselftest-kunit-fixes-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  kunit: update NULL vs IS_ERR() tests
2022-10-24 12:19:34 -07:00
Linus Torvalds
21c92498e9 Merge tag 'linux-kselftest-fixes-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull Kselftest fixes from Shuah Khan:

 - futex, intel_pstate, kexec build fixes

 - ftrace dynamic_events dependency check fix

 - memory-hotplug fix to remove redundant warning from test report

* tag 'linux-kselftest-fixes-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  selftests/ftrace: fix dynamic_events dependency check
  selftests/memory-hotplug: Remove the redundant warning information
  selftests/kexec: fix build for ARCH=x86_64
  selftests/intel_pstate: fix build for ARCH=x86_64
  selftests/futex: fix build for clang
2022-10-24 12:10:55 -07:00
Linus Torvalds
74d5b415a5 Merge tag 'pinctrl-v6.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:

 - Fix typos in UART1 and MMC in the Ingenic driver

 - A really well researched glitch bug fix to the Qualcomm driver that
   was tracked down and fixed by Dough Anderson from Chromium. Hats off
   for this one!

 - Revert two patches on the Xilinx ZynqMP driver: this needs a proper
   solution making use of firmware version information to adapt to
   different firmware releases

 - Fix interrupt triggers in the Ocelot driver

* tag 'pinctrl-v6.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
  pinctrl: ocelot: Fix incorrect trigger of the interrupt.
  Revert "dt-bindings: pinctrl-zynqmp: Add output-enable configuration"
  Revert "pinctrl: pinctrl-zynqmp: Add support for output-enable and bias-high-impedance"
  pinctrl: qcom: Avoid glitching lines when we first mux to output
  pinctrl: Ingenic: JZ4755 bug fixes
2022-10-24 11:48:30 -07:00
Jakub Kicinski
720ca52bce net-memcg: avoid stalls when under memory pressure
As Shakeel explains the commit under Fixes had the unintended
side-effect of no longer pre-loading the cached memory allowance.
Even tho we previously dropped the first packet received when
over memory limit - the consecutive ones would get thru by using
the cache. The charging was happening in batches of 128kB, so
we'd let in 128kB (truesize) worth of packets per one drop.

After the change we no longer force charge, there will be no
cache filling side effects. This causes significant drops and
connection stalls for workloads which use a lot of page cache,
since we can't reclaim page cache under GFP_NOWAIT.

Some of the latency can be recovered by improving SACK reneg
handling but nowhere near enough to get back to the pre-5.15
performance (the application I'm experimenting with still
sees 5-10x worst latency).

Apply the suggested workaround of using GFP_ATOMIC. We will now
be more permissive than previously as we'll drop _no_ packets
in softirq when under pressure. But I can't think of any good
and simple way to address that within networking.

Link: https://lore.kernel.org/all/20221012163300.795e7b86@kernel.org/
Suggested-by: Shakeel Butt <shakeelb@google.com>
Fixes: 4b1327be9f ("net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()")
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20221021160304.1362511-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 10:35:09 -07:00
Neal Cardwell
3d2af9cce3 tcp: fix indefinite deferral of RTO with SACK reneging
This commit fixes a bug that can cause a TCP data sender to repeatedly
defer RTOs when encountering SACK reneging.

The bug is that when we're in fast recovery in a scenario with SACK
reneging, every time we get an ACK we call tcp_check_sack_reneging()
and it can note the apparent SACK reneging and rearm the RTO timer for
srtt/2 into the future. In some SACK reneging scenarios that can
happen repeatedly until the receive window fills up, at which point
the sender can't send any more, the ACKs stop arriving, and the RTO
fires at srtt/2 after the last ACK. But that can take far too long
(O(10 secs)), since the connection is stuck in fast recovery with a
low cwnd that cannot grow beyond ssthresh, even if more bandwidth is
available.

This fix changes the logic in tcp_check_sack_reneging() to only rearm
the RTO timer if data is cumulatively ACKed, indicating forward
progress. This avoids this kind of nearly infinite loop of RTO timer
re-arming. In addition, this meets the goals of
tcp_check_sack_reneging() in handling Windows TCP behavior that looks
temporarily like SACK reneging but is not really.

Many thanks to Jakub Kicinski and Neil Spring, who reported this issue
and provided critical packet traces that enabled root-causing this
issue. Also, many thanks to Jakub Kicinski for testing this fix.

Fixes: 5ae344c949 ("tcp: reduce spurious retransmits due to transient SACK reneging")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Reported-by: Neil Spring <ntspring@fb.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Tested-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20221021170821.1093930-1-ncardwell.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 10:34:48 -07:00
Jakub Kicinski
e28c44450b Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2022-10-23

We've added 7 non-merge commits during the last 18 day(s) which contain
a total of 8 files changed, 69 insertions(+), 5 deletions(-).

The main changes are:

1) Wait for busy refill_work when destroying bpf memory allocator, from Hou.

2) Allow bpf_user_ringbuf_drain() callbacks to return 1, from David.

3) Fix dispatcher patchable function entry to 5 bytes nop, from Jiri.

4) Prevent decl_tag from being referenced in func_proto, from Stanislav.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Use __llist_del_all() whenever possbile during memory draining
  bpf: Wait for busy refill_work when destroying bpf memory allocator
  bpf: Fix dispatcher patchable function entry to 5 bytes nop
  bpf: prevent decl_tag from being referenced in func_proto
  selftests/bpf: Add reproducer for decl_tag in func_proto return type
  selftests/bpf: Make bpf_user_ringbuf_drain() selftest callback return 1
  bpf: Allow bpf_user_ringbuf_drain() callbacks to return 1
====================

Link: https://lore.kernel.org/r/20221023192244.81137-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 10:32:01 -07:00
David S. Miller
86d6f77a3c Merge branch 'ptp-ocxp-Oroli-ART-CARD'
Vadim Fedorenko says:

====================
ptp: ocp: add support for Orolia ART-CARD

Orolia company created alternative open source TimeCard. The hardware of
the card provides similar to OCP's card functions, that's why the support
is added to current driver.

The first patch in the series changes the way to store information about
serial ports and is more like preparation.

The patches 2 to 4 introduces actual hardware support.

The last patch removes fallback from devlink flashing interface to protect
against flashing wrong image. This became actual now as we have 2 different
boards supported and wrong image can ruin hardware easily.

v2:
  Address comments from Jonathan Lemon

v3:
  Fix issue reported by kernel test robot <lkp@intel.com>

v4:
  Fix clang build issue

v5:
  Fix warnings and per-patch build errors

v6:
  Fix more style issues
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-24 13:10:40 +01:00
Vadim Fedorenko
c1fd463d57 ptp: ocp: remove flash image header check fallback
Previously there was a fallback mode to flash firmware image without
proper header. But now we have different supported vendors and flashing
wrong image could destroy the hardware. Remove fallback mode and force
header check. Both vendors have published firmware images with headers.

Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-24 13:10:40 +01:00
Vadim Fedorenko
ee6439aaad ptp: ocp: expose config and temperature for ART card
Orolia card has disciplining configuration and temperature table
stored in EEPROM. This patch exposes them as binary attributes to
have read and write access.

Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Co-developed-by: Charles Parent <charles.parent@orolia2s.com>
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-24 13:10:40 +01:00
Vadim Fedorenko
9c44a7ac17 ptp: ocp: add serial port of mRO50 MAC on ART card
ART card provides interface to access to serial port of miniature atomic
clock found on the card. Add support for this device and configure it
during init phase.

Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Co-developed-by: Charles Parent <charles.parent@orolia2s.com>
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-24 13:10:40 +01:00
Vadim Fedorenko
69dbe1079c ptp: ocp: add Orolia timecard support
This brings in the Orolia timecard support from the GitHub repository.
The card uses different drivers to provide access to i2c EEPROM and
firmware SPI flash. And it also has a bit different EEPROM map, but
other parts of the code are the same and could be reused.

Co-developed-by: Charles Parent <charles.parent@orolia2s.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-24 13:10:40 +01:00
Vadim Fedorenko
895ac5a51f ptp: ocp: upgrade serial line information
Introduce structure to hold serial port line number and the baud rate
it supports.

Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-24 13:10:40 +01:00
Lu Wei
ec791d8149 tcp: fix a signed-integer-overflow bug in tcp_add_backlog()
The type of sk_rcvbuf and sk_sndbuf in struct sock is int, and
in tcp_add_backlog(), the variable limit is caculated by adding
sk_rcvbuf, sk_sndbuf and 64 * 1024, it may exceed the max value
of int and overflow. This patch reduces the limit budget by
halving the sndbuf to solve this issue since ACK packets are much
smaller than the payload.

Fixes: c9c3321257 ("tcp: add tcp_add_backlog()")
Signed-off-by: Lu Wei <luwei32@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-24 13:04:25 +01:00
Yunsheng Lin
4727bab4e9 net: skb: move skb_pp_recycle() to skbuff.c
skb_pp_recycle() is only used by skb_free_head() in
skbuff.c, so move it to skbuff.c.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-24 13:03:43 +01:00
Zhang Changzhong
9c1eaa27ec net: lantiq_etop: don't free skb when returning NETDEV_TX_BUSY
The ndo_start_xmit() method must not free skb when returning
NETDEV_TX_BUSY, since caller is going to requeue freed skb.

Fixes: 504d4721ee ("MIPS: Lantiq: Add ethernet driver")
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-24 13:02:18 +01:00
Nick Child
127b7218bf ibmveth: Always stop tx queues during close
netif_stop_all_queues must be called before calling H_FREE_LOGICAL_LAN.
As a result, we can remove the pool_config field from the ibmveth
adapter structure.

Some device configuration changes call ibmveth_close in order to free
the current resources held by the device. These functions then make
their changes and call ibmveth_open to reallocate and reserve resources
for the device.

Prior to this commit, the flag pool_config was used to tell ibmveth_close
that it should not halt the transmit queue. pool_config was introduced in
commit 860f242eb5 ("[PATCH] ibmveth change buffer pools dynamically")
to avoid interrupting the tx flow when making rx config changes. Since
then, other commits adopted this approach, even if making tx config
changes.

The issue with this approach was that the hypervisor freed all of
the devices control structures after the hcall H_FREE_LOGICAL_LAN
was performed but the transmit queues were never stopped. So the higher
layers in the network stack would continue transmission but any
H_SEND_LOGICAL_LAN hcall would fail with H_PARAMETER until the
hypervisor's structures for the device were allocated with the
H_REGISTER_LOGICAL_LAN hcall in ibmveth_open. This resulted in
no real networking harm but did cause several of these error
messages to be logged: "h_send_logical_lan failed with rc=-4"

So, instead of trying to keep the transmit queues alive during network
configuration changes, just stop the queues, make necessary changes then
restart the queues.

Signed-off-by: Nick Child <nnac123@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-24 13:01:37 +01:00
xu xin
233baf9a1b net: remove useless parameter of __sock_cmsg_send
The parameter 'msg' has never been used by __sock_cmsg_send, so we can remove it
safely.

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Reviewed-by: Zhang Yunkai <zhang.yunkai@zte.com.cn>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-24 12:43:46 +01:00
Wei Fang
350749b909 net: fec: Add support for periodic output signal of PPS
This patch adds the support for configuring periodic output
signal of PPS. So the PPS can be output at a specified time
and period.
For developers or testers, they can use the command "echo
<channel> <start.sec> <start.nsec> <period.sec> <period.
nsec> > /sys/class/ptp/ptp0/period" to specify time and
period to output PPS signal.
Notice that, the channel can only be set to 0. In addtion,
the start time must larger than the current PTP clock time.
So users can use the command "phc_ctl /dev/ptp0 -- get" to
get the current PTP clock time before.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-24 12:40:58 +01:00
Zhengchao Shao
d266935ac4 net: fix UAF issue in nfqnl_nf_hook_drop() when ops_init() failed
When the ops_init() interface is invoked to initialize the net, but
ops->init() fails, data is released. However, the ptr pointer in
net->gen is invalid. In this case, when nfqnl_nf_hook_drop() is invoked
to release the net, invalid address access occurs.

The process is as follows:
setup_net()
	ops_init()
		data = kzalloc(...)   ---> alloc "data"
		net_assign_generic()  ---> assign "date" to ptr in net->gen
		...
		ops->init()           ---> failed
		...
		kfree(data);          ---> ptr in net->gen is invalid
	...
	ops_exit_list()
		...
		nfqnl_nf_hook_drop()
			*q = nfnl_queue_pernet(net) ---> q is invalid

The following is the Call Trace information:
BUG: KASAN: use-after-free in nfqnl_nf_hook_drop+0x264/0x280
Read of size 8 at addr ffff88810396b240 by task ip/15855
Call Trace:
<TASK>
dump_stack_lvl+0x8e/0xd1
print_report+0x155/0x454
kasan_report+0xba/0x1f0
nfqnl_nf_hook_drop+0x264/0x280
nf_queue_nf_hook_drop+0x8b/0x1b0
__nf_unregister_net_hook+0x1ae/0x5a0
nf_unregister_net_hooks+0xde/0x130
ops_exit_list+0xb0/0x170
setup_net+0x7ac/0xbd0
copy_net_ns+0x2e6/0x6b0
create_new_namespaces+0x382/0xa50
unshare_nsproxy_namespaces+0xa6/0x1c0
ksys_unshare+0x3a4/0x7e0
__x64_sys_unshare+0x2d/0x40
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
</TASK>

Allocated by task 15855:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x30
__kasan_kmalloc+0xa1/0xb0
__kmalloc+0x49/0xb0
ops_init+0xe7/0x410
setup_net+0x5aa/0xbd0
copy_net_ns+0x2e6/0x6b0
create_new_namespaces+0x382/0xa50
unshare_nsproxy_namespaces+0xa6/0x1c0
ksys_unshare+0x3a4/0x7e0
__x64_sys_unshare+0x2d/0x40
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0

Freed by task 15855:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x30
kasan_save_free_info+0x2a/0x40
____kasan_slab_free+0x155/0x1b0
slab_free_freelist_hook+0x11b/0x220
__kmem_cache_free+0xa4/0x360
ops_init+0xb9/0x410
setup_net+0x5aa/0xbd0
copy_net_ns+0x2e6/0x6b0
create_new_namespaces+0x382/0xa50
unshare_nsproxy_namespaces+0xa6/0x1c0
ksys_unshare+0x3a4/0x7e0
__x64_sys_unshare+0x2d/0x40
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0

Fixes: f875bae065 ("net: Automatically allocate per namespace data.")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-24 12:40:06 +01:00
Eric Dumazet
0cafd77dcd net: add a refcount tracker for kernel sockets
Commit ffa84b5ffb ("net: add netns refcount tracker to struct sock")
added a tracker to sockets, but did not track kernel sockets.

We still have syzbot reports hinting about netns being destroyed
while some kernel TCP sockets had not been dismantled.

This patch tracks kernel sockets, and adds a ref_tracker_dir_print()
call to net_free() right before the netns is freed.

Normally, each layer is responsible for properly releasing its
kernel sockets before last call to net_free().

This debugging facility is enabled with CONFIG_NET_NS_REFCNT_TRACKER=y

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Tested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-24 11:04:43 +01:00
Jakub Kicinski
c5884ef477 docs: netdev: offer performance feedback to contributors
Some of us gotten used to producing large quantities of peer feedback
at work, every 3 or 6 months. Extending the same courtesy to community
members seems like a logical step. It may be hard for some folks to
get validation of how important their work is internally, especially
at smaller companies which don't employ many kernel experts.

The concept of "peer feedback" may be a hyperscaler / silicon valley
thing so YMMV. Hopefully we can build more context as we go.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-24 11:03:44 +01:00
David S. Miller
931ae86f8b Merge branch 'kcm-data-races'
Eric Dumazet says:

====================
kcm: annotate data-races

This series address two different syzbot reports for KCM.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-24 10:57:56 +01:00