Currently if BBR congestion control is initialized after more than 2B
packets have been delivered, depending on the phase of the
tp->delivered counter the tracking of BBR round trips can get stuck.
The bug arises because if tp->delivered is between 2^31 and 2^32 at
the time the BBR congestion control module is initialized, then the
initialization of bbr->next_rtt_delivered to 0 will cause the logic to
believe that the end of the round trip is still billions of packets in
the future. More specifically, the following check will fail
repeatedly:
!before(rs->prior_delivered, bbr->next_rtt_delivered)
and thus the connection will take up to 2B packets delivered before
that check will pass and the connection will set:
bbr->round_start = 1;
This could cause many mechanisms in BBR to fail to trigger, for
example bbr_check_full_bw_reached() would likely never exit STARTUP.
This bug is 5 years old and has not been observed, and as a practical
matter this would likely rarely trigger, since it would require
transferring at least 2B packets, or likely more than 3 terabytes of
data, before switching congestion control algorithms to BBR.
This patch is a stable candidate for kernels as far back as v4.9,
when tcp_bbr.c was added.
Fixes: 0f8782ea14 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Kevin Yang <yyd@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210811024056.235161-1-ncardwell@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After migrating my laptop from 4.19-LTS to 5.4-LTS a while ago I noticed
that my Ethernet port to which a bond and a VLAN interface are attached
appeared to remain up after resuming from suspend with the cable unplugged
(and that problem still persists with 5.10-LTS).
It happens that the following happens:
- the network driver (e1000e here) prepares to suspend, calls e1000e_down()
which calls netif_carrier_off() to signal that the link is going down.
- netif_carrier_off() adds a link_watch event to the list of events for
this device
- the device is completely stopped.
- the machine suspends
- the cable is unplugged and the machine brought to another location
- the machine is resumed
- the queued linkwatch events are processed for the device
- the device doesn't yet have the __LINK_STATE_PRESENT bit and its events
are silently dropped
- the device is resumed with its link down
- the upper VLAN and bond interfaces are never notified that the link had
been turned down and remain up
- the only way to provoke a change is to physically connect the machine
to a port and possibly unplug it.
The state after resume looks like this:
$ ip -br li | egrep 'bond|eth'
bond0 UP e8:6a:64:64:64:64 <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP>
eth0 DOWN e8:6a:64:64:64:64 <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP>
eth0.2@eth0 UP e8:6a:64:64:64:64 <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP>
Placing an explicit call to netdev_state_change() either in the suspend
or the resume code in the NIC driver worked around this but the solution
is not satisfying.
The issue in fact really is in link_watch that loses events while it
ought not to. It happens that the test for the device being present was
added by commit 124eee3f69 ("net: linkwatch: add check for netdevice
being present to linkwatch_do_dev") in 4.20 to avoid an access to
devices that are not present.
Instead of dropping events, this patch proceeds slightly differently by
postponing their handling so that they happen after the device is fully
resumed.
Fixes: 124eee3f69 ("net: linkwatch: add check for netdevice being present to linkwatch_do_dev")
Link: https://lists.openwall.net/netdev/2018/03/15/62
Cc: Heiner Kallweit <hkallweit1@gmail.com>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20210809160628.22623-1-w@1wt.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A recent change in LLVM causes module_{c,d}tor sections to appear when
CONFIG_K{A,C}SAN are enabled, which results in orphan section warnings
because these are not handled anywhere:
ld.lld: warning: arch/x86/pci/built-in.a(legacy.o):(.text.asan.module_ctor) is being placed in '.text.asan.module_ctor'
ld.lld: warning: arch/x86/pci/built-in.a(legacy.o):(.text.asan.module_dtor) is being placed in '.text.asan.module_dtor'
ld.lld: warning: arch/x86/pci/built-in.a(legacy.o):(.text.tsan.module_ctor) is being placed in '.text.tsan.module_ctor'
Fangrui explains: "the function asan.module_ctor has the SHF_GNU_RETAIN
flag, so it is in a separate section even with -fno-function-sections
(default)".
Place them in the TEXT_TEXT section so that these technologies continue
to work with the newer compiler versions. All of the KASAN and KCSAN
KUnit tests continue to pass after this change.
Cc: stable@vger.kernel.org
Link: https://github.com/ClangBuiltLinux/linux/issues/1432
Link: 7b78956224
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Fangrui Song <maskray@google.com>
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210731023107.1932981-1-nathan@kernel.org
Pull ARC fixes from Vineet Gupta:
- Fix FPU_STATUS update
- Update my email address
- Other spellos and fixes
* tag 'arc-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
MAINTAINERS: update Vineet's email address
ARC: fp: set FPU_STATUS.FWE to enable FPU_STATUS update on context switch
ARC: Fix CONFIG_STACKDEPOT
arc: Fix spelling mistake and grammar in Kconfig
arc: Prefer unsigned int to bare use of unsigned
The blamed commit added a new field to struct switchdev_notifier_fdb_info,
but did not make sure that all call paths set it to something valid.
For example, a switchdev driver may emit a SWITCHDEV_FDB_ADD_TO_BRIDGE
notifier, and since the 'is_local' flag is not set, it contains junk
from the stack, so the bridge might interpret those notifications as
being for local FDB entries when that was not intended.
To avoid that now and in the future, zero-initialize all
switchdev_notifier_fdb_info structures created by drivers such that all
newly added fields to not need to touch drivers again.
Fixes: 2c4eca3ef7 ("net: bridge: switchdev: include local flag in FDB notifications")
Reported-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Karsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/r/20210810115024.1629983-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ignore fdb flags when adding port extern learn entries and always set
BR_FDB_LOCAL flag when adding bridge extern learn entries. This is
closest to the behaviour we had before and avoids breaking any use cases
which were allowed.
This patch fixes iproute2 calls which assume NUD_PERMANENT and were
allowed before, example:
$ bridge fdb add 00:11:22:33:44:55 dev swp1 extern_learn
Extern learn entries are allowed to roam, but do not expire, so static
or dynamic flags make no sense for them.
Also add a comment for future reference.
Fixes: eb100e0e24 ("net: bridge: allow to add externally learned entries from user-space")
Fixes: 0541a62932 ("net: bridge: validate the NUD_PERMANENT bit when adding an extern_learn FDB entry")
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20210810110010.43859-1-razor@blackwall.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull x86 platform driver fixes from Hans de Goede:
"Small set of pdx86 fixes for 5.14"
* tag 'platform-drivers-x86-v5.14-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
platform/x86: pcengines-apuv2: Add missing terminating entries to gpio-lookup tables
platform/x86: Make dual_accel_detect() KIOX010A + KIOX020A detect more robust
platform/x86: Add and use a dual_accel_detect() helper
Pull overlayfs fixes from Miklos Szeredi:
"Fix several bugs in overlayfs"
* tag 'ovl-fixes-5.14-rc6-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: prevent private clone if bind mount is not allowed
ovl: fix uninitialized pointer read in ovl_lookup_real_one()
ovl: fix deadlock in splice write
ovl: skip stale entries in merge dir cache iteration
Daniel Borkmann says:
====================
bpf 2021-08-10
We've added 5 non-merge commits during the last 2 day(s) which contain
a total of 7 files changed, 27 insertions(+), 15 deletions(-).
1) Fix missing bpf_read_lock_trace() context for BPF loader progs, from Yonghong Song.
2) Fix corner case where BPF prog retrieves wrong local storage, also from Yonghong Song.
3) Restrict availability of BPF write_user helper behind lockdown, from Daniel Borkmann.
4) Fix multiple kernel-doc warnings in BPF core, from Randy Dunlap.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf, core: Fix kernel-doc notation
bpf: Fix potentially incorrect results with bpf_get_local_storage()
bpf: Add missing bpf_read_[un]lock_trace() for syscall program
bpf: Add lockdown check for probe_write_user helper
bpf: Add _kernel suffix to internal lockdown_bpf_read
====================
Link: https://lore.kernel.org/r/20210810144025.22814-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean says:
====================
Fix broken backpressure during FDB dump in DSA drivers
rtnl_fdb_dump() has logic to split a dump of PF_BRIDGE neighbors into
multiple netlink skbs if the buffer provided by user space is too small
(one buffer will typically handle a few hundred FDB entries).
When the current buffer becomes full, nlmsg_put() in
dsa_slave_port_fdb_do_dump() returns -EMSGSIZE and DSA saves the index
of the last dumped FDB entry, returns to rtnl_fdb_dump() up to that
point, and then the dump resumes on the same port with a new skb, and
FDB entries up to the saved index are simply skipped.
Since dsa_slave_port_fdb_do_dump() is pointed to by the "cb" passed to
drivers, then drivers must check for the -EMSGSIZE error code returned
by it. Otherwise, when a netlink skb becomes full, DSA will no longer
save newly dumped FDB entries to it, but the driver will continue
dumping. So FDB entries will be missing from the dump.
DSA is one of the few switchdev drivers that have an .ndo_fdb_dump
implementation, because of the assumption that the hardware and software
FDBs cannot be efficiently kept in sync via SWITCHDEV_FDB_ADD_TO_BRIDGE.
Other drivers with a home-cooked .ndo_fdb_dump implementation are
ocelot and dpaa2-switch. These appear to do the correct thing, as do the
other DSA drivers, so nothing else appears to need fixing.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
rtnl_fdb_dump() has logic to split a dump of PF_BRIDGE neighbors into
multiple netlink skbs if the buffer provided by user space is too small
(one buffer will typically handle a few hundred FDB entries).
When the current buffer becomes full, nlmsg_put() in
dsa_slave_port_fdb_do_dump() returns -EMSGSIZE and DSA saves the index
of the last dumped FDB entry, returns to rtnl_fdb_dump() up to that
point, and then the dump resumes on the same port with a new skb, and
FDB entries up to the saved index are simply skipped.
Since dsa_slave_port_fdb_do_dump() is pointed to by the "cb" passed to
drivers, then drivers must check for the -EMSGSIZE error code returned
by it. Otherwise, when a netlink skb becomes full, DSA will no longer
save newly dumped FDB entries to it, but the driver will continue
dumping. So FDB entries will be missing from the dump.
Fix the broken backpressure by propagating the "cb" return code and
allow rtnl_fdb_dump() to restart the FDB dump with a new skb.
Fixes: 291d1e72b7 ("net: dsa: sja1105: Add support for FDB and MDB management")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rtnl_fdb_dump() has logic to split a dump of PF_BRIDGE neighbors into
multiple netlink skbs if the buffer provided by user space is too small
(one buffer will typically handle a few hundred FDB entries).
When the current buffer becomes full, nlmsg_put() in
dsa_slave_port_fdb_do_dump() returns -EMSGSIZE and DSA saves the index
of the last dumped FDB entry, returns to rtnl_fdb_dump() up to that
point, and then the dump resumes on the same port with a new skb, and
FDB entries up to the saved index are simply skipped.
Since dsa_slave_port_fdb_do_dump() is pointed to by the "cb" passed to
drivers, then drivers must check for the -EMSGSIZE error code returned
by it. Otherwise, when a netlink skb becomes full, DSA will no longer
save newly dumped FDB entries to it, but the driver will continue
dumping. So FDB entries will be missing from the dump.
Fix the broken backpressure by propagating the "cb" return code and
allow rtnl_fdb_dump() to restart the FDB dump with a new skb.
Fixes: 58c59ef9e9 ("net: dsa: lantiq: Add Forwarding Database access")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rtnl_fdb_dump() has logic to split a dump of PF_BRIDGE neighbors into
multiple netlink skbs if the buffer provided by user space is too small
(one buffer will typically handle a few hundred FDB entries).
When the current buffer becomes full, nlmsg_put() in
dsa_slave_port_fdb_do_dump() returns -EMSGSIZE and DSA saves the index
of the last dumped FDB entry, returns to rtnl_fdb_dump() up to that
point, and then the dump resumes on the same port with a new skb, and
FDB entries up to the saved index are simply skipped.
Since dsa_slave_port_fdb_do_dump() is pointed to by the "cb" passed to
drivers, then drivers must check for the -EMSGSIZE error code returned
by it. Otherwise, when a netlink skb becomes full, DSA will no longer
save newly dumped FDB entries to it, but the driver will continue
dumping. So FDB entries will be missing from the dump.
Fix the broken backpressure by propagating the "cb" return code and
allow rtnl_fdb_dump() to restart the FDB dump with a new skb.
Fixes: ab335349b8 ("net: dsa: lan9303: Add port_fast_age and port_fdb_dump methods")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rtnl_fdb_dump() has logic to split a dump of PF_BRIDGE neighbors into
multiple netlink skbs if the buffer provided by user space is too small
(one buffer will typically handle a few hundred FDB entries).
When the current buffer becomes full, nlmsg_put() in
dsa_slave_port_fdb_do_dump() returns -EMSGSIZE and DSA saves the index
of the last dumped FDB entry, returns to rtnl_fdb_dump() up to that
point, and then the dump resumes on the same port with a new skb, and
FDB entries up to the saved index are simply skipped.
Since dsa_slave_port_fdb_do_dump() is pointed to by the "cb" passed to
drivers, then drivers must check for the -EMSGSIZE error code returned
by it. Otherwise, when a netlink skb becomes full, DSA will no longer
save newly dumped FDB entries to it, but the driver will continue
dumping. So FDB entries will be missing from the dump.
Fix the broken backpressure by propagating the "cb" return code and
allow rtnl_fdb_dump() to restart the FDB dump with a new skb.
Fixes: e4b27ebc78 ("net: dsa: Add DSA driver for Hirschmann Hellcreek switches")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix kernel-doc warnings in kernel/bpf/core.c (found by scripts/kernel-doc
and W=1 builds). That is, correct a function name in a comment and add
return descriptions for 2 functions.
Fixes these kernel-doc warnings:
kernel/bpf/core.c:1372: warning: expecting prototype for __bpf_prog_run(). Prototype was for ___bpf_prog_run() instead
kernel/bpf/core.c:1372: warning: No description found for return value of '___bpf_prog_run'
kernel/bpf/core.c:1883: warning: No description found for return value of 'bpf_prog_select_runtime'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210809215229.7556-1-rdunlap@infradead.org
Ben Hutchings says:
====================
ksz8795 VLAN fixes
This series fixes a number of bugs in the ksz8795 driver that affect
VLAN filtering, tag insertion, and tag removal.
I've tested these on the KSZ8795CLXD evaluation board, and checked the
register usage against the datasheets for the other supported chips.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The magic number 4 in VLAN table lookup was the number of entries we
can read and write at once. Using phy_port_cnt here doesn't make
sense and presumably broke VLAN filtering for 3-port switches. Change
it back to 4.
Fixes: 4ce2a984ab ("net: dsa: microchip: ksz8795: use phy_port_cnt ...")
Signed-off-by: Ben Hutchings <ben.hutchings@mind.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently ksz8_port_vlan_filtering() sets or clears the VLAN Enable
hardware flag. That controls discarding of packets with a VID that
has not been enabled for any port on the switch.
Since it is a global flag, set the dsa_switch::vlan_filtering_is_global
flag so that the DSA core understands this can't be controlled per
port.
When VLAN filtering is enabled, the switch should also discard packets
with a VID that's not enabled on the ingress port. Set or clear each
external port's VLAN Ingress Filter flag in ksz8_port_vlan_filtering()
to make that happen.
Fixes: e66f840c08 ("net: dsa: ksz: Add Microchip KSZ8795 DSA driver")
Signed-off-by: Ben Hutchings <ben.hutchings@mind.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
On the CPU port, we can support both tagged and untagged VLANs at the
same time by doing any necessary untagging in software rather than
hardware. To enable that, keep the CPU port's Remove Tag flag cleared
and set the dsa_switch::untag_bridge_pvid flag.
Fixes: e66f840c08 ("net: dsa: ksz: Add Microchip KSZ8795 DSA driver")
Signed-off-by: Ben Hutchings <ben.hutchings@mind.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a VLAN is deleted from a port, the flags in struct
switchdev_obj_port_vlan are always 0. ksz8_port_vlan_del() copies the
BRIDGE_VLAN_INFO_UNTAGGED flag to the port's Tag Removal flag, and
therefore always clears it.
In case there are multiple VLANs configured as untagged on this port -
which seems useless, but is allowed - deleting one of them changes the
remaining VLANs to be tagged.
It's only ever necessary to change this flag when a VLAN is added to
the port, so leave it unchanged in ksz8_port_vlan_del().
Fixes: e66f840c08 ("net: dsa: ksz: Add Microchip KSZ8795 DSA driver")
Signed-off-by: Ben Hutchings <ben.hutchings@mind.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
The switches supported by ksz8795 only have a per-port flag for Tag
Removal. This means it is not possible to support both tagged and
untagged VLANs on the same port. Reject attempts to add a VLAN that
requires the flag to be changed, unless there are no VLANs currently
configured.
VID 0 is excluded from this check since it is untagged regardless of
the state of the flag.
On the CPU port we could support tagged and untagged VLANs at the same
time. This will be enabled by a later patch.
Fixes: e66f840c08 ("net: dsa: ksz: Add Microchip KSZ8795 DSA driver")
Signed-off-by: Ben Hutchings <ben.hutchings@mind.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
ksz8795 has never actually enabled PVID tag insertion, and it also
programmed the PVID incorrectly. To fix this:
* Allow tag insertion to be controlled per ingress port. On most
chips, set bit 2 in Global Control 19. On KSZ88x3 this control
flag doesn't exist.
* When adding a PVID:
- Set the appropriate register bits to enable tag insertion on
egress at every other port if this was the packet's ingress port.
- Mask *out* the VID from the default tag, before or-ing in the new
PVID.
* When removing a PVID:
- Clear the same control bits to disable tag insertion.
- Don't update the default tag. This wasn't doing anything useful.
Fixes: e66f840c08 ("net: dsa: ksz: Add Microchip KSZ8795 DSA driver")
Signed-off-by: Ben Hutchings <ben.hutchings@mind.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
ksz_read64() currently does some dubious byte-swapping on the two
halves of a 64-bit register, and then only returns the high bits.
Replace this with a straightforward expression.
Fixes: e66f840c08 ("net: dsa: ksz: Add Microchip KSZ8795 DSA driver")
Signed-off-by: Ben Hutchings <ben.hutchings@mind.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
linux-can-fixes-for-5.14-20210810
Marc Kleine-Budde says:
====================
pull-request: can 2021-08-10
this is a pull request of 2 patches for net/master.
Baruch Siach's patch fixes a typo for the Microchip CAN BUS Analyzer
Tool entry in the MAINTAINERS file.
Hussein Alasadi fixes the setting of the M_CAN_DBTP register in the
m_can driver. The regression git mainline in v5.14-rc1, so no backport
to stable is needed.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Saeed Mahameed says:
====================
mlx5 fixes 2021-08-09
This series introduces fixes to mlx5 driver.
Please pull and let me know if there is any problem.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2021-08-09
This series contains updates to ice and iavf drivers.
Ani prevents the ice driver from accidentally being probed to a virtual
function and stops processing of VF messages when VFs are being torn
down.
Brett prevents the ice driver from deleting is own MAC address.
Fahad ensures the RSS LUT and key are always set following reset for
iavf.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit b910eaaaa4 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage()
helper") fixed a bug for bpf_get_local_storage() helper so different tasks
won't mess up with each other's percpu local storage.
The percpu data contains 8 slots so it can hold up to 8 contexts (same or
different tasks), for 8 different program runs, at the same time. This in
general is sufficient. But our internal testing showed the following warning
multiple times:
[...]
warning: WARNING: CPU: 13 PID: 41661 at include/linux/bpf-cgroup.h:193
__cgroup_bpf_run_filter_sock_ops+0x13e/0x180
RIP: 0010:__cgroup_bpf_run_filter_sock_ops+0x13e/0x180
<IRQ>
tcp_call_bpf.constprop.99+0x93/0xc0
tcp_conn_request+0x41e/0xa50
? tcp_rcv_state_process+0x203/0xe00
tcp_rcv_state_process+0x203/0xe00
? sk_filter_trim_cap+0xbc/0x210
? tcp_v6_inbound_md5_hash.constprop.41+0x44/0x160
tcp_v6_do_rcv+0x181/0x3e0
tcp_v6_rcv+0xc65/0xcb0
ip6_protocol_deliver_rcu+0xbd/0x450
ip6_input_finish+0x11/0x20
ip6_input+0xb5/0xc0
ip6_sublist_rcv_finish+0x37/0x50
ip6_sublist_rcv+0x1dc/0x270
ipv6_list_rcv+0x113/0x140
__netif_receive_skb_list_core+0x1a0/0x210
netif_receive_skb_list_internal+0x186/0x2a0
gro_normal_list.part.170+0x19/0x40
napi_complete_done+0x65/0x150
mlx5e_napi_poll+0x1ae/0x680
__napi_poll+0x25/0x120
net_rx_action+0x11e/0x280
__do_softirq+0xbb/0x271
irq_exit_rcu+0x97/0xa0
common_interrupt+0x7f/0xa0
</IRQ>
asm_common_interrupt+0x1e/0x40
RIP: 0010:bpf_prog_1835a9241238291a_tw_egress+0x5/0xbac
? __cgroup_bpf_run_filter_skb+0x378/0x4e0
? do_softirq+0x34/0x70
? ip6_finish_output2+0x266/0x590
? ip6_finish_output+0x66/0xa0
? ip6_output+0x6c/0x130
? ip6_xmit+0x279/0x550
? ip6_dst_check+0x61/0xd0
[...]
Using drgn [0] to dump the percpu buffer contents showed that on this CPU
slot 0 is still available, but slots 1-7 are occupied and those tasks in
slots 1-7 mostly don't exist any more. So we might have issues in
bpf_cgroup_storage_unset().
Further debugging confirmed that there is a bug in bpf_cgroup_storage_unset().
Currently, it tries to unset "current" slot with searching from the start.
So the following sequence is possible:
1. A task is running and claims slot 0
2. Running BPF program is done, and it checked slot 0 has the "task"
and ready to reset it to NULL (not yet).
3. An interrupt happens, another BPF program runs and it claims slot 1
with the *same* task.
4. The unset() in interrupt context releases slot 0 since it matches "task".
5. Interrupt is done, the task in process context reset slot 0.
At the end, slot 1 is not reset and the same process can continue to occupy
slots 2-7 and finally, when the above step 1-5 is repeated again, step 3 BPF
program won't be able to claim an empty slot and a warning will be issued.
To fix the issue, for unset() function, we should traverse from the last slot
to the first. This way, the above issue can be avoided.
The same reverse traversal should also be done in bpf_get_local_storage() helper
itself. Otherwise, incorrect local storage may be returned to BPF program.
[0] https://github.com/osandov/drgn
Fixes: b910eaaaa4 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210810010413.1976277-1-yhs@fb.com
Add the following checks from __do_loopback() to clone_private_mount() as
well:
- verify that the mount is in the current namespace
- verify that there are no locked children
Reported-by: Alois Wohlschlager <alois1@gmx-topmail.de>
Fixes: c771d683a6 ("vfs: introduce clone_private_mount()")
Cc: <stable@vger.kernel.org> # v3.18
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
One error path can result in release_dentry_name_snapshot() being called
before "name" was initialized by take_dentry_name_snapshot().
Fix by moving the release_dentry_name_snapshot() to immediately after the
only use.
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
There's possibility of an ABBA deadlock in case of a splice write to an
overlayfs file and a concurrent splice write to a corresponding real file.
The call chain for splice to an overlay file:
-> do_splice [takes sb_writers on overlay file]
-> do_splice_from
-> iter_file_splice_write [takes pipe->mutex]
-> vfs_iter_write
...
-> ovl_write_iter [takes sb_writers on real file]
And the call chain for splice to a real file:
-> do_splice [takes sb_writers on real file]
-> do_splice_from
-> iter_file_splice_write [takes pipe->mutex]
Syzbot successfully bisected this to commit 82a763e61e ("ovl: simplify
file splice").
Fix by reverting the write part of the above commit and by adding missing
bits from ovl_write_iter() into ovl_splice_write().
Fixes: 82a763e61e ("ovl: simplify file splice")
Reported-and-tested-by: syzbot+579885d1a9a833336209@syzkaller.appspotmail.com
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
On the first getdents call, ovl_iterate() populates the readdir cache
with a list of entries, but for upper entries with origin lower inode,
p->ino remains zero.
Following getdents calls traverse the readdir cache list and call
ovl_cache_update_ino() for entries with zero p->ino to lookup the entry
in the overlay and return d_ino that is consistent with st_ino.
If the upper file was unlinked between the first getdents call and the
getdents call that lists the file entry, ovl_cache_update_ino() will not
find the entry and fall back to setting d_ino to the upper real st_ino,
which is inconsistent with how this object was presented to users.
Instead of listing a stale entry with inconsistent d_ino, simply skip
the stale entry, which is better for users.
xfstest overlay/077 is failing without this patch.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/fstests/CAOQ4uxgR_cLnC_vdU5=seP3fwqVkuZM_-WfD6maFTMbMYq=a9w@mail.gmail.com/
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Commit 79a7f8bdb1 ("bpf: Introduce bpf_sys_bpf() helper and program type.")
added support for syscall program, which is a sleepable program.
But the program run missed bpf_read_lock_trace()/bpf_read_unlock_trace(),
which is needed to ensure proper rcu callback invocations. This patch adds
bpf_read_[un]lock_trace() properly.
Fixes: 79a7f8bdb1 ("bpf: Introduce bpf_sys_bpf() helper and program type.")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210809235151.1663680-1-yhs@fb.com
Back then, commit 96ae522795 ("bpf: Add bpf_probe_write_user BPF helper
to be called in tracers") added the bpf_probe_write_user() helper in order
to allow to override user space memory. Its original goal was to have a
facility to "debug, divert, and manipulate execution of semi-cooperative
processes" under CAP_SYS_ADMIN. Write to kernel was explicitly disallowed
since it would otherwise tamper with its integrity.
One use case was shown in cf9b1199de ("samples/bpf: Add test/example of
using bpf_probe_write_user bpf helper") where the program DNATs traffic
at the time of connect(2) syscall, meaning, it rewrites the arguments to
a syscall while they're still in userspace, and before the syscall has a
chance to copy the argument into kernel space. These days we have better
mechanisms in BPF for achieving the same (e.g. for load-balancers), but
without having to write to userspace memory.
Of course the bpf_probe_write_user() helper can also be used to abuse
many other things for both good or bad purpose. Outside of BPF, there is
a similar mechanism for ptrace(2) such as PTRACE_PEEK{TEXT,DATA} and
PTRACE_POKE{TEXT,DATA}, but would likely require some more effort.
Commit 96ae522795 explicitly dedicated the helper for experimentation
purpose only. Thus, move the helper's availability behind a newly added
LOCKDOWN_BPF_WRITE_USER lockdown knob so that the helper is disabled under
the "integrity" mode. More fine-grained control can be implemented also
from LSM side with this change.
Fixes: 96ae522795 ("bpf: Add bpf_probe_write_user BPF helper to be called in tracers")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>