Commit Graph

147497 Commits

Author SHA1 Message Date
Parav Pandit
b1f2abcf81 net: Make gro complete function to return void
tcp_gro_complete() function only updates the skb fields related to GRO
and it always returns zero. All the 3 drivers which are using it
do not check for the return value either.

Change it to return void instead which simplifies its callers as
error handing becomes unnecessary.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-31 09:50:17 +01:00
Christian Marangi
947acacab5 leds: trigger: netdev: expose netdev trigger modes in linux include
Expose netdev trigger modes to make them accessible by LED driver that
will support netdev trigger for hw control.

Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-31 09:42:09 +01:00
Andrew Lunn
052c38eb17 leds: add API to get attached device for LED hw control
Some specific LED triggers blink the LED based on events from a device
or subsystem.
For example, an LED could be blinked to indicate a network device is
receiving packets, or a disk is reading blocks. To correctly enable and
request the hw control of the LED, the trigger has to check if the
network interface or block device configured via a /sys/class/led file
match the one the LED driver provide for hw control for.

Provide an API call to get the device which the LED blinks for.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-31 09:42:08 +01:00
Christian Marangi
ed554d3f94 leds: add APIs for LEDs hw control
Add an option to permit LED driver to declare support for a specific
trigger to use hw control and setup the LED to blink based on specific
provided modes.

Add APIs for LEDs hw control. These functions will be used to activate
hardware control where a LED will use the provided flags, from an
unique defined supported trigger, to setup the LED to be driven by
hardware.

Add hw_control_is_supported() to ask the LED driver if the requested
mode by the trigger are supported and the LED can be setup to follow
the requested modes.

Deactivate hardware blink control by setting brightness to LED_OFF via
the brightness_set() callback.

Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-31 09:42:08 +01:00
Ido Schimmel
1a432018c0 net/sched: flower: Allow matching on layer 2 miss
Add the 'TCA_FLOWER_L2_MISS' netlink attribute that allows user space to
match on packets that encountered a layer 2 miss. The miss indication is
set as metadata in the tc skb extension by the bridge driver upon FDB or
MDB lookup miss and dissected by the flow dissector to the
'FLOW_DISSECTOR_KEY_META' key.

The use of this skb extension is guarded by the 'tc_skb_ext_tc' static
key. As such, enable / disable this key when filters that match on layer
2 miss are added / deleted.

Tested:

 # cat tc_skb_ext_tc.py
 #!/usr/bin/env -S drgn -s vmlinux

 refcount = prog["tc_skb_ext_tc"].key.enabled.counter.value_()
 print(f"tc_skb_ext_tc reference count is {refcount}")

 # ./tc_skb_ext_tc.py
 tc_skb_ext_tc reference count is 0

 # tc filter add dev swp1 egress proto all handle 101 pref 1 flower src_mac 00:11:22:33:44:55 action drop
 # tc filter add dev swp1 egress proto all handle 102 pref 2 flower src_mac 00:11:22:33:44:55 l2_miss true action drop
 # tc filter add dev swp1 egress proto all handle 103 pref 3 flower src_mac 00:11:22:33:44:55 l2_miss false action drop

 # ./tc_skb_ext_tc.py
 tc_skb_ext_tc reference count is 2

 # tc filter replace dev swp1 egress proto all handle 102 pref 2 flower src_mac 00:01:02:03:04:05 l2_miss false action drop

 # ./tc_skb_ext_tc.py
 tc_skb_ext_tc reference count is 2

 # tc filter del dev swp1 egress proto all handle 103 pref 3 flower
 # tc filter del dev swp1 egress proto all handle 102 pref 2 flower
 # tc filter del dev swp1 egress proto all handle 101 pref 1 flower

 # ./tc_skb_ext_tc.py
 tc_skb_ext_tc reference count is 0

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-30 23:37:00 -07:00
Ido Schimmel
d5ccfd90df flow_dissector: Dissect layer 2 miss from tc skb extension
Extend the 'FLOW_DISSECTOR_KEY_META' key with a new 'l2_miss' field and
populate it from a field with the same name in the tc skb extension.
This field is set by the bridge driver for packets that incur an FDB or
MDB miss.

The next patch will extend the flower classifier to be able to match on
layer 2 misses.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-30 23:37:00 -07:00
Ido Schimmel
7b4858df3b skbuff: bridge: Add layer 2 miss indication
For EVPN non-DF (Designated Forwarder) filtering we need to be able to
prevent decapsulated traffic from being flooded to a multi-homed host.
Filtering of multicast and broadcast traffic can be achieved using the
following flower filter:

 # tc filter add dev bond0 egress pref 1 proto all flower indev vxlan0 dst_mac 01:00:00:00:00:00/01:00:00:00:00:00 action drop

Unlike broadcast and multicast traffic, it is not currently possible to
filter unknown unicast traffic. The classification into unknown unicast
is performed by the bridge driver, but is not visible to other layers
such as tc.

Solve this by adding a new 'l2_miss' bit to the tc skb extension. Clear
the bit whenever a packet enters the bridge (received from a bridge port
or transmitted via the bridge) and set it if the packet did not match an
FDB or MDB entry. If there is no skb extension and the bit needs to be
cleared, then do not allocate one as no extension is equivalent to the
bit being cleared. The bit is not set for broadcast packets as they
never perform a lookup and therefore never incur a miss.

A bit that is set for every flooded packet would also work for the
current use case, but it does not allow us to differentiate between
registered and unregistered multicast traffic, which might be useful in
the future.

To keep the performance impact to a minimum, the marking of packets is
guarded by the 'tc_skb_ext_tc' static key. When 'false', the skb is not
touched and an skb extension is not allocated. Instead, only a
5 bytes nop is executed, as demonstrated below for the call site in
br_handle_frame().

Before the patch:

```
        memset(skb->cb, 0, sizeof(struct br_input_skb_cb));
  c37b09:       49 c7 44 24 28 00 00    movq   $0x0,0x28(%r12)
  c37b10:       00 00

        p = br_port_get_rcu(skb->dev);
  c37b12:       49 8b 44 24 10          mov    0x10(%r12),%rax
        memset(skb->cb, 0, sizeof(struct br_input_skb_cb));
  c37b17:       49 c7 44 24 30 00 00    movq   $0x0,0x30(%r12)
  c37b1e:       00 00
  c37b20:       49 c7 44 24 38 00 00    movq   $0x0,0x38(%r12)
  c37b27:       00 00
```

After the patch (when static key is disabled):

```
        memset(skb->cb, 0, sizeof(struct br_input_skb_cb));
  c37c29:       49 c7 44 24 28 00 00    movq   $0x0,0x28(%r12)
  c37c30:       00 00
  c37c32:       49 8d 44 24 28          lea    0x28(%r12),%rax
  c37c37:       48 c7 40 08 00 00 00    movq   $0x0,0x8(%rax)
  c37c3e:       00
  c37c3f:       48 c7 40 10 00 00 00    movq   $0x0,0x10(%rax)
  c37c46:       00

#ifdef CONFIG_HAVE_JUMP_LABEL_HACK

static __always_inline bool arch_static_branch(struct static_key *key, bool branch)
{
        asm_volatile_goto("1:"
  c37c47:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
        br_tc_skb_miss_set(skb, false);

        p = br_port_get_rcu(skb->dev);
  c37c4c:       49 8b 44 24 10          mov    0x10(%r12),%rax
```

Subsequent patches will extend the flower classifier to be able to match
on the new 'l2_miss' bit and enable / disable the static key when
filters that match on it are added / deleted.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-30 23:37:00 -07:00
Jiri Pirko
216ba9f4ad devlink: move port_del() to devlink_port_ops
Move port_del() from devlink_ops into newly introduced devlink_port_ops.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-30 10:32:20 -07:00
Jiri Pirko
216aa67f3e devlink: move port_fn_state_get/set() to devlink_port_ops
Move port_fn_state_get/set() from devlink_ops into newly introduced
devlink_port_ops.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-30 10:32:20 -07:00
Jiri Pirko
4a490d7154 devlink: move port_fn_migratable_get/set() to devlink_port_ops
Move port_fn_migratable_get/set() from devlink_ops into newly introduced
devlink_port_ops.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-30 10:32:20 -07:00
Jiri Pirko
933c13275c devlink: move port_fn_roce_get/set() to devlink_port_ops
Move port_fn_roce_get/set() from devlink_ops into newly introduced
devlink_port_ops.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-30 10:32:20 -07:00
Jiri Pirko
71c93e37cf devlink: move port_fn_hw_addr_get/set() to devlink_port_ops
Move port_fn_hw_addr_get/set() from devlink_ops into newly introduced
devlink_port_ops.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-30 10:32:20 -07:00
Jiri Pirko
65a4c44bf9 devlink: move port_type_set() op into devlink_port_ops
Move port_type_set() from devlink_ops into newly introduced
devlink_port_ops.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-30 10:32:20 -07:00
Jiri Pirko
f58a3e4dfe devlink: move port_split/unsplit() ops into devlink_port_ops
Move port_split/unsplit() from devlink_ops into newly introduced
devlink_port_ops.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-30 10:32:20 -07:00
Jiri Pirko
6acdf43d8a devlink: introduce port ops placeholder
In devlink, some of the objects have separate ops registered alongside
with the object itself. Port however have ops in devlink_ops structure.
For drivers what register multiple kinds of ports with different ops
this is not convenient. Introduce devlink_port_ops and a set
of functions that allow drivers to pass ops pointer during
port registration.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-30 10:32:19 -07:00
Simon Horman
45402f04c5 devlink: Spelling corrections
Make some minor spelling corrections in comments.

Found by inspection.

Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230526-devlink-spelling-v1-1-9a3e36cdebc8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-29 21:59:59 -07:00
Russell King (Oracle)
86b5f2d8cd net: pcs: lynx: add lynx_pcs_create_mdiodev()
Add lynx_pcs_create_mdiodev() to simplify the creation of the mdio
device associated with lynx PCS. In order to allow lynx_pcs_destroy()
to clean this up, we need to arrange for lynx_pcs_create() to take a
refcount on the mdiodev, and lynx_pcs_destroy() to put it.

Adding the refcounting to lynx_pcs_create()..lynx_pcs_destroy() will
be transparent to existing users of these interfaces.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-29 21:46:53 -07:00
Russell King (Oracle)
9a5d500cff net: pcs: xpcs: add xpcs_create_mdiodev()
Add xpcs_create_mdiodev() to simplify the creation of the mdio device
associated with the XPCS. In order to allow xpcs_destroy() to clean
this up, we need to arrange for xpcs_create() to take a refcount on
the mdiodev, and xpcs_destroy() to put it.

Adding the refcounting to xpcs_create()..xpcs_destroy() will be
transparent to existing users of these interfaces.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-29 21:46:53 -07:00
Russell King (Oracle)
c4933fa88a net: mdio: add mdio_device_get() and mdio_device_put()
Add two new operations for a mdio device to manage the refcount on the
underlying struct device. This will be used by mdio PCS drivers to
simplify the creation and destruction handling, making it easier for
users to get it correct.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-29 21:46:53 -07:00
Jakub Kicinski
75455b906d Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2023-05-26

We've added 54 non-merge commits during the last 10 day(s) which contain
a total of 76 files changed, 2729 insertions(+), 1003 deletions(-).

The main changes are:

1) Add the capability to destroy sockets in BPF through a new kfunc,
   from Aditi Ghag.

2) Support O_PATH fds in BPF_OBJ_PIN and BPF_OBJ_GET commands,
   from Andrii Nakryiko.

3) Add capability for libbpf to resize datasec maps when backed via mmap,
   from JP Kobryn.

4) Move all the test kfuncs for CI out of the kernel and into bpf_testmod,
   from Jiri Olsa.

5) Big batch of xsk selftest improvements to prep for multi-buffer testing,
   from Magnus Karlsson.

6) Show the target_{obj,btf}_id in tracing link's fdinfo and dump it
   via bpftool, from Yafang Shao.

7) Various misc BPF selftest improvements to work with upcoming LLVM 17,
   from Yonghong Song.

8) Extend bpftool to specify netdevice for resolving XDP hints,
   from Larysa Zaremba.

9) Document masking in shift operations for the insn set document,
   from Dave Thaler.

10) Extend BPF selftests to check xdp_feature support for bond driver,
    from Lorenzo Bianconi.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (54 commits)
  bpf: Fix bad unlock balance on freeze_mutex
  libbpf: Ensure FD >= 3 during bpf_map__reuse_fd()
  libbpf: Ensure libbpf always opens files with O_CLOEXEC
  selftests/bpf: Check whether to run selftest
  libbpf: Change var type in datasec resize func
  bpf: drop unnecessary bpf_capable() check in BPF_MAP_FREEZE command
  libbpf: Selftests for resizing datasec maps
  libbpf: Add capability for resizing datasec maps
  selftests/bpf: Add path_fd-based BPF_OBJ_PIN and BPF_OBJ_GET tests
  libbpf: Add opts-based bpf_obj_pin() API and add support for path_fd
  bpf: Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
  libbpf: Start v1.3 development cycle
  bpf: Validate BPF object in BPF_OBJ_PIN before calling LSM
  bpftool: Specify XDP Hints ifname when loading program
  selftests/bpf: Add xdp_feature selftest for bond device
  selftests/bpf: Test bpf_sock_destroy
  selftests/bpf: Add helper to get port using getsockname
  bpf: Add bpf_sock_destroy kfunc
  bpf: Add kfunc filter function to 'struct btf_kfunc_id_set'
  bpf: udp: Implement batching for sockets iterator
  ...
====================

Link: https://lore.kernel.org/r/20230526222747.17775-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-26 17:26:01 -07:00
Neal Cardwell
f26f03b303 tcp: remove unused TCP_SYNQ_INTERVAL definition
Currently TCP_SYNQ_INTERVAL is defined but never used.

According to "git log -S TCP_SYNQ_INTERVAL net-next/main" it seems
the last references to TCP_SYNQ_INTERVAL were removed by 2015
commit fa76ce7328 ("inet: get rid of central tcp/dccp listener timer")

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-26 11:33:28 +01:00
Russell King (Oracle)
dd805cf3e8 net: dsa: add support for mac_prepare() and mac_finish() calls
Add DSA support for the phylink mac_prepare() and mac_finish() calls.
These were introduced as part of the PCS support to allow MACs to
perform preparatory steps prior to configuration, and finalisation
steps after the MAC and PCS has been configured.

Introducing phylink_pcs support to the mv88e6xxx DSA driver needs some
code moved out of its mac_config() stage into the mac_prepare() and
mac_finish() stages, and this commit facilitates such code in DSA
drivers.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-26 10:39:40 +01:00
Jakub Kicinski
d6f1e0bfe5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-25 20:56:43 -07:00
Jakub Kicinski
d4031ec844 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

net/ipv4/raw.c
  3632679d9e ("ipv{4,6}/raw: fix output xfrm lookup wrt protocol")
  c85be08fc4 ("raw: Stop using RTO_ONLINK.")
https://lore.kernel.org/all/20230525110037.2b532b83@canb.auug.org.au/

Adjacent changes:

drivers/net/ethernet/freescale/fec_main.c
  9025944fdd ("net: fec: add dma_wmb to ensure correct descriptor values")
  144470c88c ("net: fec: using the standard return codes when xdp xmit errors")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-25 19:57:39 -07:00
Linus Torvalds
50fb587e6a Merge tag 'net-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
 "Including fixes from bluetooth and bpf.

  Current release - regressions:

   - net: fix skb leak in __skb_tstamp_tx()

   - eth: mtk_eth_soc: fix QoS on DSA MAC on non MTK_NETSYS_V2 SoCs

  Current release - new code bugs:

   - handshake:
      - fix sock->file allocation
      - fix handshake_dup() ref counting

   - bluetooth:
      - fix potential double free caused by hci_conn_unlink
      - fix UAF in hci_conn_hash_flush

  Previous releases - regressions:

   - core: fix stack overflow when LRO is disabled for virtual
     interfaces

   - tls: fix strparser rx issues

   - bpf:
      - fix many sockmap/TCP related issues
      - fix a memory leak in the LRU and LRU_PERCPU hash maps
      - init the offload table earlier

   - eth: mlx5e:
      - do as little as possible in napi poll when budget is 0
      - fix using eswitch mapping in nic mode
      - fix deadlock in tc route query code

  Previous releases - always broken:

   - udplite: fix NULL pointer dereference in __sk_mem_raise_allocated()

   - raw: fix output xfrm lookup wrt protocol

   - smc: reset connection when trying to use SMCRv2 fails

   - phy: mscc: enable VSC8501/2 RGMII RX clock

   - eth: octeontx2-pf: fix TSOv6 offload

   - eth: cdc_ncm: deal with too low values of dwNtbOutMaxSize"

* tag 'net-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (79 commits)
  udplite: Fix NULL pointer dereference in __sk_mem_raise_allocated().
  net: phy: mscc: enable VSC8501/2 RGMII RX clock
  net: phy: mscc: remove unnecessary phydev locking
  net: phy: mscc: add support for VSC8501
  net: phy: mscc: add VSC8502 to MODULE_DEVICE_TABLE
  net/handshake: Enable the SNI extension to work properly
  net/handshake: Unpin sock->file if a handshake is cancelled
  net/handshake: handshake_genl_notify() shouldn't ignore @flags
  net/handshake: Fix uninitialized local variable
  net/handshake: Fix handshake_dup() ref counting
  net/handshake: Remove unneeded check from handshake_dup()
  ipv6: Fix out-of-bounds access in ipv6_find_tlv()
  net: ethernet: mtk_eth_soc: fix QoS on DSA MAC on non MTK_NETSYS_V2 SoCs
  docs: netdev: document the existence of the mail bot
  net: fix skb leak in __skb_tstamp_tx()
  r8169: Use a raw_spinlock_t for the register locks.
  page_pool: fix inconsistency for page_pool_ring_[un]lock()
  bpf, sockmap: Test progs verifier error with latest clang
  bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer with drops
  bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer
  ...
2023-05-25 10:55:26 -07:00
Linus Torvalds
eb03e31813 Merge tag 'for-v6.4-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply
Pull power supply fixes from Sebastian Reichel:

 - Fix power_supply_get_battery_info for devices without parent devices
   resulting in NULL pointer dereference

 - Fix desktop systems reporting to run on battery once a power-supply
   device with device scope appears (e.g. a HID keyboard with a battery)

 - Ratelimit debug print about driver not providing data

 - Fix race condition related to external_power_changed in multiple
   drivers (ab8500, axp288, bq25890, sc27xx, bq27xxx)

 - Fix LED trigger switching from blinking to solid-on when charging
   finishes

 - Fix multiple races in bq27xxx battery driver

 - mt6360: handle potential ENOMEM from devm_work_autocancel

 - sbs-charger: Fix SBS_CHARGER_STATUS_CHARGE_INHIBITED bit

 - rt9467: avoid passing 0 to dev_err_probe

* tag 'for-v6.4-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply: (21 commits)
  power: supply: Fix logic checking if system is running from battery
  power: supply: mt6360: add a check of devm_work_autocancel in mt6360_charger_probe
  power: supply: sbs-charger: Fix INHIBITED bit for Status reg
  power: supply: rt9467: Fix passing zero to 'dev_err_probe'
  power: supply: Ratelimit no data debug output
  power: supply: Fix power_supply_get_battery_info() if parent is NULL
  power: supply: bq24190: Call power_supply_changed() after updating input current
  power: supply: bq25890: Call power_supply_changed() after updating input current or voltage
  power: supply: bq27xxx: Use mod_delayed_work() instead of cancel() + schedule()
  power: supply: bq27xxx: After charger plug in/out wait 0.5s for things to stabilize
  power: supply: bq27xxx: Ensure power_supply_changed() is called on current sign changes
  power: supply: bq27xxx: Move bq27xxx_battery_update() down
  power: supply: bq27xxx: Add cache parameter to bq27xxx_battery_current_and_status()
  power: supply: bq27xxx: Fix poll_interval handling and races on remove
  power: supply: bq27xxx: Fix I2C IRQ race on remove
  power: supply: bq27xxx: Fix bq27xxx_battery_update() race condition
  power: supply: leds: Fix blink to LED on transition
  power: supply: sc27xx: Fix external_power_changed race
  power: supply: bq25890: Fix external_power_changed race
  power: supply: axp288_fuel_gauge: Fix external_power_changed race
  ...
2023-05-25 10:26:36 -07:00
Linus Torvalds
029c77f89a Merge tag 'sound-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
 "A collection of small fixes:

   - HD-audio runtime PM bug fix

   - A couple of HD-audio quirks

   - Fix series of ASoC Intel AVS drivers

   - ASoC DPCM fix for a bug found on new Intel systems

   - A few other ASoC device-specific small fixes"

* tag 'sound-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: hda/realtek: Enable headset onLenovo M70/M90
  ASoC: dwc: move DMA init to snd_soc_dai_driver probe()
  ASoC: cs35l41: Fix default regmap values for some registers
  ALSA: hda: Fix unhandled register update during auto-suspend period
  ASoC: dt-bindings: tlv320aic32x4: Fix supply names
  ASoC: Intel: avs: Add missing checks on FE startup
  ASoC: Intel: avs: Fix avs_path_module::instance_id size
  ASoC: Intel: avs: Account for UID of ACPI device
  ASoC: Intel: avs: Fix declaration of enum avs_channel_config
  ASoC: Intel: Skylake: Fix declaration of enum skl_ch_cfg
  ASoC: Intel: avs: Access path components under lock
  ASoC: Intel: avs: Fix module lookup
  ALSA: hda/ca0132: add quirk for EVGA X299 DARK
  ASoC: soc-pcm: test if a BE can be prepared
  ASoC: rt5682: Disable jack detection interrupt during suspend
  ASoC: lpass: Fix for KASAN use_after_free out of bounds
2023-05-25 09:48:23 -07:00
Antoine Tenart
c0a8966e2b net: ipv4: use consistent txhash in TIME_WAIT and SYN_RECV
When using IPv4/TCP, skb->hash comes from sk->sk_txhash except in
TIME_WAIT and SYN_RECV where it's not set in the reply skb from
ip_send_unicast_reply. Those packets will have a mismatched hash with
others from the same flow as their hashes will be 0. IPv6 does not have
the same issue as the hash is set from the socket txhash in those cases.

This commits sets the hash in the reply skb from ip_send_unicast_reply,
which makes the IPv4 code behaving like IPv6.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-05-25 13:20:45 +02:00
Chuck Lever
26fb5480a2 net/handshake: Enable the SNI extension to work properly
Enable the upper layer protocol to specify the SNI peername. This
avoids the need for tlshd to use a DNS lookup, which can return a
hostname that doesn't match the incoming certificate's SubjectName.

Fixes: 2fd5532044 ("net/handshake: Add a kernel API for requesting a TLSv1.3 handshake")
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-24 22:05:24 -07:00
Jakub Kicinski
0c615f1cc3 Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2023-05-24

We've added 19 non-merge commits during the last 10 day(s) which contain
a total of 20 files changed, 738 insertions(+), 448 deletions(-).

The main changes are:

1) Batch of BPF sockmap fixes found when running against NGINX TCP tests,
   from John Fastabend.

2) Fix a memleak in the LRU{,_PERCPU} hash map when bucket locking fails,
   from Anton Protopopov.

3) Init the BPF offload table earlier than just late_initcall,
   from Jakub Kicinski.

4) Fix ctx access mask generation for 32-bit narrow loads of 64-bit fields,
   from Will Deacon.

5) Remove a now unsupported __fallthrough in BPF samples,
   from Andrii Nakryiko.

6) Fix a typo in pkg-config call for building sign-file,
   from Jeremy Sowden.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf, sockmap: Test progs verifier error with latest clang
  bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer with drops
  bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer
  bpf, sockmap: Test shutdown() correctly exits epoll and recv()=0
  bpf, sockmap: Build helper to create connected socket pair
  bpf, sockmap: Pull socket helpers out of listen test for general use
  bpf, sockmap: Incorrectly handling copied_seq
  bpf, sockmap: Wake up polling after data copy
  bpf, sockmap: TCP data stall on recv before accept
  bpf, sockmap: Handle fin correctly
  bpf, sockmap: Improved check for empty queue
  bpf, sockmap: Reschedule is now done through backlog
  bpf, sockmap: Convert schedule_work into delayed_work
  bpf, sockmap: Pass skb ownership through read_skb
  bpf: fix a memory leak in the LRU and LRU_PERCPU hash maps
  bpf: Fix mask generation for 32-bit narrow loads of 64-bit fields
  samples/bpf: Drop unnecessary fallthrough
  bpf: netdev: init the offload table earlier
  selftests/bpf: Fix pkg-config call building sign-file
====================

Link: https://lore.kernel.org/r/20230524170839.13905-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-24 21:57:57 -07:00
Russell King (Oracle)
dad987484e net: phylink: add function to resolve clause 73 negotiation
Add a function to resolve clause 73 negotiation according to the
priority resolution function described in clause 73.3.6.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-24 09:13:22 -07:00
Russell King (Oracle)
e9261467ae net: mdio: add clause 73 to ethtool conversion helper
Add a helper to convert a clause 73 advertisement to an ethtool bitmap.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-24 09:13:22 -07:00
Jiri Pirko
9277649c66 devlink: pass devlink_port pointer to ops->port_del() instead of index
Historically there was a reason why port_dev() along with for example
port_split() did get port_index instead of the devlink_port pointer.
With the locking changes that were done which ensured devlink instance
mutex is hold for every command, the port ops could get devlink_port
pointer directly. Change the forgotten port_dev() op to be as others
and pass devlink_port pointer instead of port_index.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-24 10:34:26 +01:00
Jiri Pirko
1bb1b57898 devlink: remove no longer true locking comment from port_new/del()
All commands are called holding instance lock. Remove the outdated
comment that says otherwise.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-24 10:34:26 +01:00
Jiri Pirko
c496daeb86 devlink: remove duplicate port notification
The notification about created port is send from devl_port_register()
function called from ops->port_new(). No need to send it again here,
so remove the call and the helper function.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-24 10:34:26 +01:00
David S. Miller
ba46c96db9 Merge tag 'mlx5-fixes-2023-05-22' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:

====================
mlx5-fixes-2023-05-22

This series provides bug fixes for the mlx5 driver.
Please pull and let me know if there is any problem.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-24 08:40:14 +01:00
Russell King (Oracle)
59088b5a94 net: phy: avoid kernel warning dump when stopping an errored PHY
When taking a network interface down (or removing a SFP module) after
the PHY has encountered an error, phy_stop() complains incorrectly
that it was called from HALTED state.

The reason this is incorrect is that the network driver will have
called phy_start() when the interface was brought up, and the fact
that the PHY has a problem bears no relationship to the administrative
state of the interface. Taking the interface administratively down
(which calls phy_stop()) is always the right thing to do after a
successful phy_start() call, whether or not the PHY has encountered
an error.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-24 08:28:27 +01:00
Guillaume Nault
c85be08fc4 raw: Stop using RTO_ONLINK.
Use ip_sendmsg_scope() to properly initialise the scope in
flowi4_init_output(), instead of overriding tos with the RTO_ONLINK
flag. The objective is to eventually remove RTO_ONLINK, which will
allow converting .flowi4_tos to dscp_t.

The MSG_DONTROUTE and SOCK_LOCALROUTE cases were already handled by
raw_sendmsg() (SOCK_LOCALROUTE was handled by the RT_CONN_FLAGS*()
macros called by get_rtconn_flags()). However, opt.is_strictroute
wasn't taken into account. Therefore, a side effect of this patch is to
now honour opt.is_strictroute, and thus align raw_sendmsg() with
ping_v4_sendmsg() and udp_sendmsg().

Since raw_sendmsg() was the only user of get_rtconn_flags(), we can now
remove this function.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-24 08:22:06 +01:00
Guillaume Nault
726de790f6 ping: Stop using RTO_ONLINK.
Define a new helper to figure out the correct route scope to use on TX,
depending on socket configuration, ancillary data and send flags.

Use this new helper to properly initialise the scope in
flowi4_init_output(), instead of overriding tos with the RTO_ONLINK
flag.

The objective is to eventually remove RTO_ONLINK, which will allow
converting .flowi4_tos to dscp_t.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-24 08:22:06 +01:00
David Howells
c49cf26632 ip: Remove ip_append_page()
ip_append_page() is no longer used with the removal of udp_sendpage(), so
remove it.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
cc: David Ahern <dsahern@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23 20:48:27 -07:00
David Howells
5367f9bbb8 tcp: Fold do_tcp_sendpages() into tcp_sendpage_locked()
Fold do_tcp_sendpages() into its last remaining caller,
tcp_sendpage_locked().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: David Ahern <dsahern@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23 20:48:27 -07:00
David Howells
e117dcfd64 tls: Inline do_tcp_sendpages()
do_tcp_sendpages() is now just a small wrapper around tcp_sendmsg_locked(),
so inline it, allowing do_tcp_sendpages() to be removed.  This is part of
replacing ->sendpage() with a call to sendmsg() with MSG_SPLICE_PAGES set.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Boris Pismenny <borisp@nvidia.com>
cc: John Fastabend <john.fastabend@gmail.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23 20:48:27 -07:00
David Howells
2e910b9532 net: Add a function to splice pages into an skbuff for MSG_SPLICE_PAGES
Add a function to handle MSG_SPLICE_PAGES being passed internally to
sendmsg().  Pages are spliced into the given socket buffer if possible and
copied in if not (e.g. they're slab pages or have a zero refcount).

Signed-off-by: David Howells <dhowells@redhat.com>
cc: David Ahern <dsahern@kernel.org>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23 20:48:27 -07:00
David Howells
96449f9024 net: Pass max frags into skb_append_pagefrags()
Pass the maximum number of fragments into skb_append_pagefrags() rather
than using MAX_SKB_FRAGS so that it can be used from code that wants to
specify sysctl_max_skb_frags.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: David Ahern <dsahern@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23 20:48:27 -07:00
David Howells
b841b901c4 net: Declare MSG_SPLICE_PAGES internal sendmsg() flag
Declare MSG_SPLICE_PAGES, an internal sendmsg() flag, that hints to a
network protocol that it should splice pages from the source iterator
rather than copying the data if it can.  This flag is added to a list that
is cleared by sendmsg syscalls on entry.

This is intended as a replacement for the ->sendpage() op, allowing a way
to splice in several multipage folios in one go.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23 20:48:27 -07:00
Yunsheng Lin
368d3cb406 page_pool: fix inconsistency for page_pool_ring_[un]lock()
page_pool_ring_[un]lock() use in_softirq() to decide which
spin lock variant to use, and when they are called in the
context with in_softirq() being false, spin_lock_bh() is
called in page_pool_ring_lock() while spin_unlock() is
called in page_pool_ring_unlock(), because spin_lock_bh()
has disabled the softirq in page_pool_ring_lock(), which
causes inconsistency for spin lock pair calling.

This patch fixes it by returning in_softirq state from
page_pool_producer_lock(), and use it to decide which
spin lock variant to use in page_pool_producer_unlock().

As pool->ring has both producer and consumer lock, so
rename it to page_pool_producer_[un]lock() to reflect
the actual usage. Also move them to page_pool.c as they
are only used there, and remove the 'inline' as the
compiler may have better idea to do inlining or not.

Fixes: 7886244736 ("net: page_pool: Add bulk support for ptr_ring")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Link: https://lore.kernel.org/r/20230522031714.5089-1-linyunsheng@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23 20:25:13 -07:00
Andrii Nakryiko
cb8edce280 bpf: Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall
forces users to specify pinning location as a string-based absolute or
relative (to current working directory) path. This has various
implications related to security (e.g., symlink-based attacks), forces
BPF FS to be exposed in the file system, which can cause races with
other applications.

One of the feedbacks we got from folks working with containers heavily
was that inability to use purely FD-based location specification was an
unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET
commands. This patch closes this oversight, adding path_fd field to
BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by
*at() syscalls for dirfd + pathname combinations.

This now allows interesting possibilities like working with detached BPF
FS mount (e.g., to perform multiple pinnings without running a risk of
someone interfering with them), and generally making pinning/getting
more secure and not prone to any races and/or security attacks.

This is demonstrated by a selftest added in subsequent patch that takes
advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate
creating detached BPF FS mount, pinning, and then getting BPF map out of
it, all while never exposing this private instance of BPF FS to outside
worlds.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20230523170013.728457-4-andrii@kernel.org
2023-05-23 23:31:42 +02:00
John Fastabend
e5c6de5fa0 bpf, sockmap: Incorrectly handling copied_seq
The read_skb() logic is incrementing the tcp->copied_seq which is used for
among other things calculating how many outstanding bytes can be read by
the application. This results in application errors, if the application
does an ioctl(FIONREAD) we return zero because this is calculated from
the copied_seq value.

To fix this we move tcp->copied_seq accounting into the recv handler so
that we update these when the recvmsg() hook is called and data is in
fact copied into user buffers. This gives an accurate FIONREAD value
as expected and improves ACK handling. Before we were calling the
tcp_rcv_space_adjust() which would update 'number of bytes copied to
user in last RTT' which is wrong for programs returning SK_PASS. The
bytes are only copied to the user when recvmsg is handled.

Doing the fix for recvmsg is straightforward, but fixing redirect and
SK_DROP pkts is a bit tricker. Build a tcp_psock_eat() helper and then
call this from skmsg handlers. This fixes another issue where a broken
socket with a BPF program doing a resubmit could hang the receiver. This
happened because although read_skb() consumed the skb through sock_drop()
it did not update the copied_seq. Now if a single reccv socket is
redirecting to many sockets (for example for lb) the receiver sk will be
hung even though we might expect it to continue. The hang comes from
not updating the copied_seq numbers and memory pressure resulting from
that.

We have a slight layer problem of calling tcp_eat_skb even if its not
a TCP socket. To fix we could refactor and create per type receiver
handlers. I decided this is more work than we want in the fix and we
already have some small tweaks depending on caller that use the
helper skb_bpf_strparser(). So we extend that a bit and always set
the strparser bit when it is in use and then we can gate the
seq_copied updates on this.

Fixes: 04919bed94 ("tcp: Introduce tcp_read_skb()")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-9-john.fastabend@gmail.com
2023-05-23 16:10:42 +02:00
John Fastabend
405df89dd5 bpf, sockmap: Improved check for empty queue
We noticed some rare sk_buffs were stepping past the queue when system was
under memory pressure. The general theory is to skip enqueueing
sk_buffs when its not necessary which is the normal case with a system
that is properly provisioned for the task, no memory pressure and enough
cpu assigned.

But, if we can't allocate memory due to an ENOMEM error when enqueueing
the sk_buff into the sockmap receive queue we push it onto a delayed
workqueue to retry later. When a new sk_buff is received we then check
if that queue is empty. However, there is a problem with simply checking
the queue length. When a sk_buff is being processed from the ingress queue
but not yet on the sockmap msg receive queue its possible to also recv
a sk_buff through normal path. It will check the ingress queue which is
zero and then skip ahead of the pkt being processed.

Previously we used sock lock from both contexts which made the problem
harder to hit, but not impossible.

To fix instead of popping the skb from the queue entirely we peek the
skb from the queue and do the copy there. This ensures checks to the
queue length are non-zero while skb is being processed. Then finally
when the entire skb has been copied to user space queue or another
socket we pop it off the queue. This way the queue length check allows
bypassing the queue only after the list has been completely processed.

To reproduce issue we run NGINX compliance test with sockmap running and
observe some flakes in our testing that we attributed to this issue.

Fixes: 04919bed94 ("tcp: Introduce tcp_read_skb()")
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: William Findlay <will@isovalent.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-5-john.fastabend@gmail.com
2023-05-23 16:10:11 +02:00
John Fastabend
29173d07f7 bpf, sockmap: Convert schedule_work into delayed_work
Sk_buffs are fed into sockmap verdict programs either from a strparser
(when the user might want to decide how framing of skb is done by attaching
another parser program) or directly through tcp_read_sock. The
tcp_read_sock is the preferred method for performance when the BPF logic is
a stream parser.

The flow for Cilium's common use case with a stream parser is,

 tcp_read_sock()
  sk_psock_verdict_recv
    ret = bpf_prog_run_pin_on_cpu()
    sk_psock_verdict_apply(sock, skb, ret)
     // if system is under memory pressure or app is slow we may
     // need to queue skb. Do this queuing through ingress_skb and
     // then kick timer to wake up handler
     skb_queue_tail(ingress_skb, skb)
     schedule_work(work);

The work queue is wired up to sk_psock_backlog(). This will then walk the
ingress_skb skb list that holds our sk_buffs that could not be handled,
but should be OK to run at some later point. However, its possible that
the workqueue doing this work still hits an error when sending the skb.
When this happens the skbuff is requeued on a temporary 'state' struct
kept with the workqueue. This is necessary because its possible to
partially send an skbuff before hitting an error and we need to know how
and where to restart when the workqueue runs next.

Now for the trouble, we don't rekick the workqueue. This can cause a
stall where the skbuff we just cached on the state variable might never
be sent. This happens when its the last packet in a flow and no further
packets come along that would cause the system to kick the workqueue from
that side.

To fix we could do simple schedule_work(), but while under memory pressure
it makes sense to back off some instead of continue to retry repeatedly. So
instead to fix convert schedule_work to schedule_delayed_work and add
backoff logic to reschedule from backlog queue on errors. Its not obvious
though what a good backoff is so use '1'.

To test we observed some flakes whil running NGINX compliance test with
sockmap we attributed these failed test to this bug and subsequent issue.

>From on list discussion. This commit

 bec217197b41("skmsg: Schedule psock work if the cached skb exists on the psock")

was intended to address similar race, but had a couple cases it missed.
Most obvious it only accounted for receiving traffic on the local socket
so if redirecting into another socket we could still get an sk_buff stuck
here. Next it missed the case where copied=0 in the recv() handler and
then we wouldn't kick the scheduler. Also its sub-optimal to require
userspace to kick the internal mechanisms of sockmap to wake it up and
copy data to user. It results in an extra syscall and requires the app
to actual handle the EAGAIN correctly.

Fixes: 04919bed94 ("tcp: Introduce tcp_read_skb()")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: William Findlay <will@isovalent.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-3-john.fastabend@gmail.com
2023-05-23 16:09:56 +02:00