Commit Graph

6309 Commits

Author SHA1 Message Date
Jakub Kicinski
c2865b1122 Daniel Borkmann says:
====================
pull-request: bpf-next 2023-04-13

We've added 260 non-merge commits during the last 36 day(s) which contain
a total of 356 files changed, 21786 insertions(+), 11275 deletions(-).

The main changes are:

1) Rework BPF verifier log behavior and implement it as a rotating log
   by default with the option to retain old-style fixed log behavior,
   from Andrii Nakryiko.

2) Adds support for using {FOU,GUE} encap with an ipip device operating
   in collect_md mode and add a set of BPF kfuncs for controlling encap
   params, from Christian Ehrig.

3) Allow BPF programs to detect at load time whether a particular kfunc
   exists or not, and also add support for this in light skeleton,
   from Alexei Starovoitov.

4) Optimize hashmap lookups when key size is multiple of 4,
   from Anton Protopopov.

5) Enable RCU semantics for task BPF kptrs and allow referenced kptr
   tasks to be stored in BPF maps, from David Vernet.

6) Add support for stashing local BPF kptr into a map value via
   bpf_kptr_xchg(). This is useful e.g. for rbtree node creation
   for new cgroups, from Dave Marchevsky.

7) Fix BTF handling of is_int_ptr to skip modifiers to work around
   tracing issues where a program cannot be attached, from Feng Zhou.

8) Migrate a big portion of test_verifier unit tests over to
   test_progs -a verifier_* via inline asm to ease {read,debug}ability,
   from Eduard Zingerman.

9) Several updates to the instruction-set.rst documentation
   which is subject to future IETF standardization
   (https://lwn.net/Articles/926882/), from Dave Thaler.

10) Fix BPF verifier in the __reg_bound_offset's 64->32 tnum sub-register
    known bits information propagation, from Daniel Borkmann.

11) Add skb bitfield compaction work related to BPF with the overall goal
    to make more of the sk_buff bits optional, from Jakub Kicinski.

12) BPF selftest cleanups for build id extraction which stand on its own
    from the upcoming integration work of build id into struct file object,
    from Jiri Olsa.

13) Add fixes and optimizations for xsk descriptor validation and several
    selftest improvements for xsk sockets, from Kal Conley.

14) Add BPF links for struct_ops and enable switching implementations
    of BPF TCP cong-ctls under a given name by replacing backing
    struct_ops map, from Kui-Feng Lee.

15) Remove a misleading BPF verifier env->bypass_spec_v1 check on variable
    offset stack read as earlier Spectre checks cover this,
    from Luis Gerhorst.

16) Fix issues in copy_from_user_nofault() for BPF and other tracers
    to resemble copy_from_user_nmi() from safety PoV, from Florian Lehner
    and Alexei Starovoitov.

17) Add --json-summary option to test_progs in order for CI tooling to
    ease parsing of test results, from Manu Bretelle.

18) Batch of improvements and refactoring to prep for upcoming
    bpf_local_storage conversion to bpf_mem_cache_{alloc,free} allocator,
    from Martin KaFai Lau.

19) Improve bpftool's visual program dump which produces the control
    flow graph in a DOT format by adding C source inline annotations,
    from Quentin Monnet.

20) Fix attaching fentry/fexit/fmod_ret/lsm to modules by extracting
    the module name from BTF of the target and searching kallsyms of
    the correct module, from Viktor Malik.

21) Improve BPF verifier handling of '<const> <cond> <non_const>'
    to better detect whether in particular jmp32 branches are taken,
    from Yonghong Song.

22) Allow BPF TCP cong-ctls to write app_limited of struct tcp_sock.
    A built-in cc or one from a kernel module is already able to write
    to app_limited, from Yixin Shen.

Conflicts:

Documentation/bpf/bpf_devel_QA.rst
  b7abcd9c65 ("bpf, doc: Link to submitting-patches.rst for general patch submission info")
  0f10f647f4 ("bpf, docs: Use internal linking for link to netdev subsystem doc")
https://lore.kernel.org/all/20230307095812.236eb1be@canb.auug.org.au/

include/net/ip_tunnels.h
  bc9d003dc4 ("ip_tunnel: Preserve pointer const in ip_tunnel_info_opts")
  ac931d4cde ("ipip,ip_tunnel,sit: Add FOU support for externally controlled ipip devices")
https://lore.kernel.org/all/20230413161235.4093777-1-broonie@kernel.org/

net/bpf/test_run.c
  e5995bc7e2 ("bpf, test_run: fix crashes due to XDP frame overwriting/corruption")
  294635a816 ("bpf, test_run: fix &xdp_frame misplacement for LIVE_FRAMES")
https://lore.kernel.org/all/20230320102619.05b80a98@canb.auug.org.au/
====================

Link: https://lore.kernel.org/r/20230413191525.7295-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13 16:43:38 -07:00
Alexei Starovoitov
b7e852a9ec bpf: Remove unused arguments from btf_struct_access().
Remove unused arguments from btf_struct_access() callback.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20230404045029.82870-3-alexei.starovoitov@gmail.com
2023-04-04 16:57:10 -07:00
Paul Blakey
9b7c68b391 netfilter: ctnetlink: Support offloaded conntrack entry deletion
Currently, offloaded conntrack entries (flows) can only be deleted
after they are removed from offload, which is either by timeout,
tcp state change or tc ct rule deletion. This can cause issues for
users wishing to manually delete or flush existing entries.

Support deletion of offloaded conntrack entries.

Example usage:
 # Delete all offloaded (and non offloaded) conntrack entries
 # whose source address is 1.2.3.4
 $ conntrack -D -s 1.2.3.4
 # Delete all entries
 $ conntrack -F

Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-03-30 22:20:09 +02:00
Eric Sage
28c1b6df43 netfilter: nfnetlink_queue: enable classid socket info retrieval
This enables associating a socket with a v1 net_cls cgroup. Useful for
applying a per-cgroup policy when processing packets in userspace.

Signed-off-by: Eric Sage <eric_sage@apple.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-03-30 22:20:09 +02:00
Florian Westphal
356e2adb3f netfilter: nfnetlink_log: remove rcu_bh usage
structure is free'd via call_rcu, so its safe to use rcu_read_lock only.

While at it, skip rcu_read_lock for lookup from packet path, its always
called with rcu held.

Signed-off-by: Florian Westphal <fw@strlen.de>
2023-03-30 22:20:09 +02:00
Thomas Gleixner
bc9d3a9f2a net: dst: Switch to rcuref_t reference counting
Under high contention dst_entry::__refcnt becomes a significant bottleneck.

atomic_inc_not_zero() is implemented with a cmpxchg() loop, which goes into
high retry rates on contention.

Switch the reference count to rcuref_t which results in a significant
performance gain. Rename the reference count member to __rcuref to reflect
the change.

The gain depends on the micro-architecture and the number of concurrent
operations and has been measured in the range of +25% to +130% with a
localhost memtier/memcached benchmark which amplifies the problem
massively.

Running the memtier/memcached benchmark over a real (1Gb) network
connection the conversion on top of the false sharing fix for struct
dst_entry::__refcnt results in a total gain in the 2%-5% range over the
upstream baseline.

Reported-by: Wangyang Guo <wangyang.guo@intel.com>
Reported-by: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230307125538.989175656@linutronix.de
Link: https://lore.kernel.org/r/20230323102800.215027837@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-28 18:52:28 -07:00
David Vernet
fb2211a57c bpf: Remove now-unnecessary NULL checks for KF_RELEASE kfuncs
Now that we're not invoking kfunc destructors when the kptr in a map was
NULL, we no longer require NULL checks in many of our KF_RELEASE kfuncs.
This patch removes those NULL checks.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230325213144.486885-3-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-25 16:56:22 -07:00
Florian Westphal
36ce9982ef xtables: move icmp/icmpv6 logic to xt_tcpudp
icmp/icmp6 matches are baked into ip(6)_tables.ko.

This means that even if iptables-nft is used, a rule like
"-p icmp --icmp-type 1" will load the ip(6)tables modules.

Move them to xt_tcpdudp.ko instead to avoid this.

This will also allow to eventually add kconfig knobs to build kernels
that support iptables-nft but not iptables-legacy (old set/getsockopt
interface).

Signed-off-by: Florian Westphal <fw@strlen.de>
2023-03-22 21:48:59 +01:00
Florian Westphal
bde7170a04 netfilter: xtables: disable 32bit compat interface by default
This defaulted to 'y' because before this knob existed the 32bit
compat layer was always compiled in if CONFIG_COMPAT was set.

32bit iptables on 64bit kernel isn't common anymore, so remove
the default-y now.

Signed-off-by: Florian Westphal <fw@strlen.de>
2023-03-22 21:48:59 +01:00
Jeremy Sowden
f6ca5d5ed7 netfilter: nft_masq: deduplicate eval call-backs
nft_masq has separate ipv4 and ipv6 call-backs which share much of their
code, and an inet one switch containing a switch that calls one of the
others based on the family of the packet.  Merge the ipv4 and ipv6 ones
into the inet one in order to get rid of the duplicate code.

Const-qualify the `priv` pointer since we don't need to write through it.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-03-22 21:48:59 +01:00
Jeremy Sowden
6f56ad1b92 netfilter: nft_redir: use struct nf_nat_range2 throughout and deduplicate eval call-backs
`nf_nat_redirect_ipv4` takes a `struct nf_nat_ipv4_multi_range_compat`,
but converts it internally to a `struct nf_nat_range2`.  Change the
function to take the latter, factor out the code now shared with
`nf_nat_redirect_ipv6`, move the conversion to the xt_REDIRECT module,
and update the ipv4 range initialization in the nft_redir module.

Replace a bare hex constant for 127.0.0.1 with a macro.

Remove `WARN_ON`.  `nf_nat_setup_info` calls `nf_ct_is_confirmed`:

	/* Can't setup nat info for confirmed ct. */
	if (nf_ct_is_confirmed(ct))
		return NF_ACCEPT;

This means that `ct` cannot be null or the kernel will crash, and
implies that `ctinfo` is `IP_CT_NEW` or `IP_CT_RELATED`.

nft_redir has separate ipv4 and ipv6 call-backs which share much of
their code, and an inet one switch containing a switch that calls one of
the others based on the family of the packet.  Merge the ipv4 and ipv6
ones into the inet one in order to get rid of the duplicate code.

Const-qualify the `priv` pointer since we don't need to write through
it.

Assign `priv->flags` to the range instead of OR-ing it in.

Set the `NF_NAT_RANGE_PROTO_SPECIFIED` flag once during init, rather
than on every eval.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-03-22 21:48:59 +01:00
Jakub Kicinski
1118aa4c70 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
net/wireless/nl80211.c
  b27f07c50a ("wifi: nl80211: fix puncturing bitmap policy")
  cbbaf2bb82 ("wifi: nl80211: add a command to enable/disable HW timestamping")
https://lore.kernel.org/all/20230314105421.3608efae@canb.auug.org.au

tools/testing/selftests/net/Makefile
  62199e3f16 ("selftests: net: Add VXLAN MDB test")
  13715acf8a ("selftest: Add test for bind() conflicts.")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-17 16:29:25 -07:00
Jakub Kicinski
d0928c1c5b Merge branch 'main' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says:

====================
Netfilter updates for net-next

1. nf_tables 'brouting' support, from Sriram Yagnaraman.

2. Update bridge netfilter and ovs conntrack helpers to handle
   IPv6 Jumbo packets properly, i.e. fetch the packet length
   from hop-by-hop extension header, from Xin Long.

   This comes with a test BIG TCP test case, added to
   tools/testing/selftests/net/.

3. Fix spelling and indentation in conntrack, from Jeremy Sowden.

* 'main' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: nat: fix indentation of function arguments
  netfilter: conntrack: fix typo
  selftests: add a selftest for big tcp
  netfilter: use nf_ip6_check_hbh_len in nf_ct_skb_network_trim
  netfilter: move br_nf_check_hbh_len to utils
  netfilter: bridge: move pskb_trim_rcsum out of br_nf_check_hbh_len
  netfilter: bridge: check len before accessing more nh data
  netfilter: bridge: call pskb_may_pull in br_nf_check_hbh_len
  netfilter: bridge: introduce broute meta statement
====================

Link: https://lore.kernel.org/r/20230308193033.13965-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-09 23:28:53 -08:00
Jeremy Sowden
b0ca200077 netfilter: nat: fix indentation of function arguments
A couple of arguments to a function call are incorrectly indented.
Fix them.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-03-08 14:25:44 +01:00
Jeremy Sowden
e5d015a114 netfilter: conntrack: fix typo
There's a spelling mistake in a comment.  Fix it.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-03-08 14:25:43 +01:00
Xin Long
eaafdaa3e9 netfilter: use nf_ip6_check_hbh_len in nf_ct_skb_network_trim
For IPv6 Jumbo packets, the ipv6_hdr(skb)->payload_len is always 0,
and its real payload_len ( > 65535) is saved in hbh exthdr. With 0
length for the jumbo packets, all data and exthdr will be trimmed
in nf_ct_skb_network_trim().

This patch is to call nf_ip6_check_hbh_len() to get real pkt_len
of the IPv6 packet, similar to br_validate_ipv6().

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-03-08 14:25:41 +01:00
Xin Long
28e144cf5f netfilter: move br_nf_check_hbh_len to utils
Rename br_nf_check_hbh_len() to nf_ip6_check_hbh_len() and move it
to netfilter utils, so that it can be used by other modules, like
ovs and tc.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-03-08 14:25:40 +01:00
Jeremy Sowden
493924519b netfilter: nft_redir: correct value of inet type .maxattrs
`nft_redir_inet_type.maxattrs` was being set, presumably because of a
cut-and-paste error, to `NFTA_MASQ_MAX`, instead of `NFTA_REDIR_MAX`.

Fixes: 63ce3940f3 ("netfilter: nft_redir: add inet support")
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-03-08 12:26:42 +01:00
Jeremy Sowden
1f617b6b4c netfilter: nft_redir: correct length for loading protocol registers
The values in the protocol registers are two bytes wide.  However, when
parsing the register loads, the code currently uses the larger 16-byte
size of a `union nf_inet_addr`.  Change it to use the (correct) size of
a `union nf_conntrack_man_proto` instead.

Fixes: d07db9884a ("netfilter: nf_tables: introduce nft_validate_register_load()")
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-03-08 12:26:37 +01:00
Jeremy Sowden
ec2c5917eb netfilter: nft_masq: correct length for loading protocol registers
The values in the protocol registers are two bytes wide.  However, when
parsing the register loads, the code currently uses the larger 16-byte
size of a `union nf_inet_addr`.  Change it to use the (correct) size of
a `union nf_conntrack_man_proto` instead.

Fixes: 8a6bf5da1a ("netfilter: nft_masq: support port range")
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-03-08 12:26:31 +01:00
Jeremy Sowden
068d82e75d netfilter: nft_nat: correct length for loading protocol registers
The values in the protocol registers are two bytes wide.  However, when
parsing the register loads, the code currently uses the larger 16-byte
size of a `union nf_inet_addr`.  Change it to use the (correct) size of
a `union nf_conntrack_man_proto` instead.

Fixes: d07db9884a ("netfilter: nf_tables: introduce nft_validate_register_load()")
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-03-08 12:25:59 +01:00
Eric Dumazet
c77737b736 netfilter: conntrack: adopt safer max chain length
Customers using GKE 1.25 and 1.26 are facing conntrack issues
root caused to commit c9c3b6811f ("netfilter: conntrack: make
max chain length random").

Even if we assume Uniform Hashing, a bucket often reachs 8 chained
items while the load factor of the hash table is smaller than 0.5

With a limit of 16, we reach load factors of 3.
With a limit of 32, we reach load factors of 11.
With a limit of 40, we reach load factors of 15.
With a limit of 50, we reach load factors of 24.

This patch changes MIN_CHAINLEN to 50, to minimize risks.

Ideally, we could in the future add a cushion based on expected
load factor (2 * nf_conntrack_max / nf_conntrack_buckets),
because some setups might expect unusual values.

Fixes: c9c3b6811f ("netfilter: conntrack: make max chain length random")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-03-07 10:58:06 +01:00
Ivan Delalande
9f7dd42f0d netfilter: ctnetlink: revert to dumping mark regardless of event type
It seems that change was unintentional, we have userspace code that
needs the mark while listening for events like REPLY, DESTROY, etc.
Also include 0-marks in requested dumps, as they were before that fix.

Fixes: 1feeae0715 ("netfilter: ctnetlink: fix compilation warning after data race fixes in ct mark")
Signed-off-by: Ivan Delalande <colona@arista.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-03-06 12:09:23 +01:00
Pablo Neira Ayuso
aabef97a35 netfilter: nft_quota: copy content when cloning expression
If the ruleset contains consumed quota, restore them accordingly.
Otherwise, listing after restoration shows never used items.

Restore the user-defined quota and flags too.

Fixes: ed0a0c60f0 ("netfilter: nft_quota: move stateful fields out of expression data")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-03-01 17:23:23 +01:00
Pablo Neira Ayuso
860e874290 netfilter: nft_last: copy content when cloning expression
If the ruleset contains last timestamps, restore them accordingly.
Otherwise, listing after restoration shows never used items.

Fixes: 33a24de37e ("netfilter: nft_last: move stateful fields out of expression data")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-03-01 17:23:23 +01:00
Jakub Kicinski
fd2a55e74a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

1) Fix broken listing of set elements when table has an owner.

2) Fix conntrack refcount leak in ctnetlink with related conntrack
   entries, from Hangyu Hua.

3) Fix use-after-free/double-free in ctnetlink conntrack insert path,
   from Florian Westphal.

4) Fix ip6t_rpfilter with VRF, from Phil Sutter.

5) Fix use-after-free in ebtables reported by syzbot, also from Florian.

6) Use skb->len in xt_length to deal with IPv6 jumbo packets,
   from Xin Long.

7) Fix NETLINK_LISTEN_ALL_NSID with ctnetlink, from Florian Westphal.

8) Fix memleak in {ip_,ip6_,arp_}tables in ENOMEM error case,
   from Pavel Tikhomirov.

* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: x_tables: fix percpu counter block leak on error path when creating new netns
  netfilter: ctnetlink: make event listener tracking global
  netfilter: xt_length: use skb len to match in length_mt6
  netfilter: ebtables: fix table blob use-after-free
  netfilter: ip6t_rpfilter: Fix regression with VRF interfaces
  netfilter: conntrack: fix rmmod double-free race
  netfilter: ctnetlink: fix possible refcount leak in ctnetlink_create_conntrack()
  netfilter: nf_tables: allow to fetch set elements when table has an owner
====================

Link: https://lore.kernel.org/r/20230222092137.88637-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-22 21:25:23 -08:00
Florian Westphal
fdf6491193 netfilter: ctnetlink: make event listener tracking global
pernet tracking doesn't work correctly because other netns might have
set NETLINK_LISTEN_ALL_NSID on its event socket.

In this case its expected that events originating in other net
namespaces are also received.

Making pernet-tracking work while also honoring NETLINK_LISTEN_ALL_NSID
requires much more intrusive changes both in netlink and nfnetlink,
f.e. adding a 'setsockopt' callback that lets nfnetlink know that the
event socket entered (or left) ALL_NSID mode.

Move to global tracking instead: if there is an event socket anywhere
on the system, all net namespaces which have conntrack enabled and
use autobind mode will allocate the ecache extension.

netlink_has_listeners() returns false only if the given group has no
subscribers in any net namespace, the 'net' argument passed to
nfnetlink_has_listeners is only used to derive the protocol (nfnetlink),
it has no other effect.

For proper NETLINK_LISTEN_ALL_NSID-aware pernet tracking of event
listeners a new netlink_has_net_listeners() is also needed.

Fixes: 90d1daa458 ("netfilter: conntrack: add nf_conntrack_events autodetect mode")
Reported-by: Bryce Kahle <bryce.kahle@datadoghq.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-02-22 00:28:47 +01:00
Xin Long
05c07c0c6c netfilter: xt_length: use skb len to match in length_mt6
For IPv6 Jumbo packets, the ipv6_hdr(skb)->payload_len is always 0,
and its real payload_len ( > 65535) is saved in hbh exthdr. With 0
length for the jumbo packets, it may mismatch.

To fix this, we can just use skb->len instead of parsing exthdrs, as
the hbh exthdr parsing has been done before coming to length_mt6 in
ip6_rcv_core() and br_validate_ipv6() and also the packet has been
trimmed according to the correct IPv6 (ext)hdr length there, and skb
len is trustable in length_mt6().

Note that this patch is especially needed after the IPv6 BIG TCP was
supported in kernel, which is using IPv6 Jumbo packets. Besides, to
match the packets greater than 65535 more properly, a v1 revision of
xt_length may be needed to extend "min, max" to u32 in the future,
and for now the IPv6 Jumbo packets can be matched by:

  # ip6tables -m length ! --length 0:65535

Fixes: 7c4e983c4f ("net: allow gso_max_size to exceed 65536")
Fixes: 0fe79f28bf ("net: allow gro_max_size to exceed 65536")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-02-22 00:28:39 +01:00
Florian Westphal
e6d57e9ff0 netfilter: conntrack: fix rmmod double-free race
nf_conntrack_hash_check_insert() callers free the ct entry directly, via
nf_conntrack_free.

This isn't safe anymore because
nf_conntrack_hash_check_insert() might place the entry into the conntrack
table and then delteted the entry again because it found that a conntrack
extension has been removed at the same time.

In this case, the just-added entry is removed again and an error is
returned to the caller.

Problem is that another cpu might have picked up this entry and
incremented its reference count.

This results in a use-after-free/double-free, once by the other cpu and
once by the caller of nf_conntrack_hash_check_insert().

Fix this by making nf_conntrack_hash_check_insert() not fail anymore
after the insertion, just like before the 'Fixes' commit.

This is safe because a racing nf_ct_iterate() has to wait for us
to release the conntrack hash spinlocks.

While at it, make the function return -EAGAIN in the rmmod (genid
changed) case, this makes nfnetlink replay the command (suggested
by Pablo Neira).

Fixes: c56716c69c ("netfilter: extensions: introduce extension genid count")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-02-22 00:19:38 +01:00
Hangyu Hua
ac4893980b netfilter: ctnetlink: fix possible refcount leak in ctnetlink_create_conntrack()
nf_ct_put() needs to be called to put the refcount got by
nf_conntrack_find_get() to avoid refcount leak when
nf_conntrack_hash_check_insert() fails.

Fixes: 7d367e0668 ("netfilter: ctnetlink: fix soft lockup when netlink adds new entries (v2)")
Signed-off-by: Hangyu Hua <hbh25y@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-02-22 00:06:19 +01:00
David S. Miller
1155a2281d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Add safeguard to check for NULL tupe in objects updates via
   NFT_MSG_NEWOBJ, this should not ever happen. From Alok Tiwari.

2) Incorrect pointer check in the new destroy rule command,
   from Yang Yingliang.

3) Incorrect status bitcheck in nf_conntrack_udp_packet(),
   from Florian Westphal.

4) Simplify seq_print_acct(), from Ilia Gavrilov.

5) Use 2-arg optimal variant of kfree_rcu() in IPVS,
   from Julian Anastasov.

6) TCP connection enters CLOSE state in conntrack for locally
   originated TCP reset packet from the reject target,
   from Florian Westphal.

The fixes #2 and #3 in this series address issues from the previous pull
nf-next request in this net-next cycle.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20 10:53:56 +00:00
Florian Westphal
2954fe60e3 netfilter: let reset rules clean out conntrack entries
iptables/nftables support responding to tcp packets with tcp resets.

The generated tcp reset packet passes through both output and postrouting
netfilter hooks, but conntrack will never see them because the generated
skb has its ->nfct pointer copied over from the packet that triggered the
reset rule.

If the reset rule is used for established connections, this
may result in the conntrack entry to be around for a very long
time (default timeout is 5 days).

One way to avoid this would be to not copy the nf_conn pointer
so that the rest packet passes through conntrack too.

Problem is that output rules might not have the same conntrack
zone setup as the prerouting ones, so its possible that the
reset skb won't find the correct entry.  Generating a template
entry for the skb seems error prone as well.

Add an explicit "closing" function that switches a confirmed
conntrack entry to closed state and wire this up for tcp.

If the entry isn't confirmed, no action is needed because
the conntrack entry will never be committed to the table.

Reported-by: Russel King <linux@armlinux.org.uk>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-02-17 13:04:56 +01:00
Jakub Kicinski
de42873367 Daniel Borkmann says:
====================
pull-request: bpf-next 2023-02-11

We've added 96 non-merge commits during the last 14 day(s) which contain
a total of 152 files changed, 4884 insertions(+), 962 deletions(-).

There is a minor conflict in drivers/net/ethernet/intel/ice/ice_main.c
between commit 5b246e533d ("ice: split probe into smaller functions")
from the net-next tree and commit 66c0e13ad2 ("drivers: net: turn on
XDP features") from the bpf-next tree. Remove the hunk given ice_cfg_netdev()
is otherwise there a 2nd time, and add XDP features to the existing
ice_cfg_netdev() one:

        [...]
        ice_set_netdev_features(netdev);
        netdev->xdp_features = NETDEV_XDP_ACT_BASIC | NETDEV_XDP_ACT_REDIRECT |
                               NETDEV_XDP_ACT_XSK_ZEROCOPY;
        ice_set_ops(netdev);
        [...]

Stephen's merge conflict mail:
https://lore.kernel.org/bpf/20230207101951.21a114fa@canb.auug.org.au/

The main changes are:

1) Add support for BPF trampoline on s390x which finally allows to remove many
   test cases from the BPF CI's DENYLIST.s390x, from Ilya Leoshkevich.

2) Add multi-buffer XDP support to ice driver, from Maciej Fijalkowski.

3) Add capability to export the XDP features supported by the NIC.
   Along with that, add a XDP compliance test tool,
   from Lorenzo Bianconi & Marek Majtyka.

4) Add __bpf_kfunc tag for marking kernel functions as kfuncs,
   from David Vernet.

5) Add a deep dive documentation about the verifier's register
   liveness tracking algorithm, from Eduard Zingerman.

6) Fix and follow-up cleanups for resolve_btfids to be compiled
   as a host program to avoid cross compile issues,
   from Jiri Olsa & Ian Rogers.

7) Batch of fixes to the BPF selftest for xdp_hw_metadata which resulted
   when testing on different NICs, from Jesper Dangaard Brouer.

8) Fix libbpf to better detect kernel version code on Debian, from Hao Xiang.

9) Extend libbpf to add an option for when the perf buffer should
   wake up, from Jon Doron.

10) Follow-up fix on xdp_metadata selftest to just consume on TX
    completion, from Stanislav Fomichev.

11) Extend the kfuncs.rst document with description on kfunc
    lifecycle & stability expectations, from David Vernet.

12) Fix bpftool prog profile to skip attaching to offline CPUs,
    from Tonghao Zhang.

====================

Link: https://lore.kernel.org/r/20230211002037.8489-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-10 17:51:27 -08:00
Xin Long
0785407e78 net: extract nf_ct_handle_fragments to nf_conntrack_ovs
Now handle_fragments() in OVS and TC have the similar code, and
this patch removes the duplicate code by moving the function
to nf_conntrack_ovs.

Note that skb_clear_hash(skb) or skb->ignore_df = 1 should be
done only when defrag returns 0, as it does in other places
in kernel.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-10 16:23:03 -08:00
Xin Long
67fc5d7ffb net: extract nf_ct_skb_network_trim function to nf_conntrack_ovs
There are almost the same code in ovs_skb_network_trim() and
tcf_ct_skb_network_trim(), this patch extracts them into a function
nf_ct_skb_network_trim() and moves the function to nf_conntrack_ovs.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-10 16:23:03 -08:00
Xin Long
c0c3ab63de net: create nf_conntrack_ovs for ovs and tc use
Similar to nf_nat_ovs created by Commit ebddb14049 ("net: move the
nat function to nf_nat_ovs for ovs and tc"), this patch is to create
nf_conntrack_ovs to get these functions shared by OVS and TC only.

There are nf_ct_helper() and nf_ct_add_helper() from nf_conntrak_helper
in this patch, and will be more in the following patches.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-10 16:23:03 -08:00
Pablo Neira Ayuso
92f3e96d64 netfilter: nf_tables: allow to fetch set elements when table has an owner
NFT_MSG_GETSETELEM returns -EPERM when fetching set elements that belong
to table that has an owner. This results in empty set/map listing from
userspace.

Fixes: 6001a930ce ("netfilter: nftables: introduce table ownership")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-02-09 17:34:11 +01:00
Vlad Buslov
df25455e5a netfilter: nf_conntrack: allow early drop of offloaded UDP conns
Both synchronous early drop algorithm and asynchronous gc worker completely
ignore connections with IPS_OFFLOAD_BIT status bit set. With new
functionality that enabled UDP NEW connection offload in action CT
malicious user can flood the conntrack table with offloaded UDP connections
by just sending a single packet per 5tuple because such connections can no
longer be deleted by early drop algorithm.

To mitigate the issue allow both early drop and gc to consider offloaded
UDP connections for deletion.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-03 09:31:24 +00:00
Vlad Buslov
1a441a9b8b netfilter: flowtable: cache info of last offload
Modify flow table offload to cache the last ct info status that was passed
to the driver offload callbacks by extending enum nf_flow_flags with new
"NF_FLOW_HW_ESTABLISHED" flag. Set the flag if ctinfo was 'established'
during last act_ct meta actions fill call. This infrastructure change is
necessary to optimize promoting of UDP connections from 'new' to
'established' in following patches in this series.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-03 09:31:24 +00:00
Vlad Buslov
8f84780b84 netfilter: flowtable: allow unidirectional rules
Modify flow table offload to support unidirectional connections by
extending enum nf_flow_flags with new "NF_FLOW_HW_BIDIRECTIONAL" flag. Only
offload reply direction when the flag is set. This infrastructure change is
necessary to support offloading UDP NEW connections in original direction
in following patches in series.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-03 09:31:24 +00:00
Vlad Buslov
0eb5acb164 netfilter: flowtable: fixup UDP timeout depending on ct state
Currently flow_offload_fixup_ct() function assumes that only replied UDP
connections can be offloaded and hardcodes UDP_CT_REPLIED timeout value. To
enable UDP NEW connection offload in following patches extract the actual
connections state from ct->status and set the timeout according to it.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-03 09:31:24 +00:00
Jakub Kicinski
82b4a9412b Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
net/core/gro.c
  7d2c89b325 ("skb: Do mix page pool and page referenced frags in GRO")
  b1a78b9b98 ("net: add support for ipv4 big tcp")
https://lore.kernel.org/all/20230203094454.5766f160@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-02 14:49:55 -08:00
Julian Anastasov
e4d0fe71f5 ipvs: avoid kfree_rcu without 2nd arg
Avoid possible synchronize_rcu() as part from the
kfree_rcu() call when 2nd arg is not provided.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-02-02 14:02:01 +01:00
Xin Long
a13fbf5ed5 netfilter: use skb_ip_totlen and iph_totlen
There are also quite some places in netfilter that may process IPv4 TCP
GSO packets, we need to replace them too.

In length_mt(), we have to use u_int32_t/int to accept skb_ip_totlen()
return value, otherwise it may overflow and mismatch. This change will
also help us add selftest for IPv4 BIG TCP in the following patch.

Note that we don't need to replace the one in tcpmss_tg4(), as it will
return if there is data after tcphdr in tcpmss_mangle_packet(). The
same in mangle_contents() in nf_nat_helper.c, it returns false when
skb->len + extra > 65535 in enlarge_skb().

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-01 20:54:27 -08:00
David Vernet
400031e05a bpf: Add __bpf_kfunc tag to all kfuncs
Now that we have the __bpf_kfunc tag, we should use add it to all
existing kfuncs to ensure that they'll never be elided in LTO builds.

Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230201173016.342758-4-void@manifault.com
2023-02-02 00:25:14 +01:00
Gavrilov Ilia
f6477ec62f netfilter: conntrack: remote a return value of the 'seq_print_acct' function.
The static 'seq_print_acct' function always returns 0.

Change the return value to 'void' and remove unnecessary checks.

Found by InfoTeCS on behalf of Linux Verification Center
(linuxtesting.org) with SVACE.

Fixes: 1ca9e41770 ("netfilter: Remove uses of seq_<foo> return values")
Signed-off-by: Ilia.Gavrilov <Ilia.Gavrilov@infotecs.ru>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-02-01 12:18:52 +01:00
Florian Westphal
28af0f009d netfilter: conntrack: udp: fix seen-reply test
IPS_SEEN_REPLY_BIT is only useful for test_bit() api.

Fixes: 4883ec512c ("netfilter: conntrack: avoid reload of ct->status")
Reported-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-02-01 12:18:51 +01:00
Yang Yingliang
1fb7696ac6 netfilter: nf_tables: fix wrong pointer passed to PTR_ERR()
It should be 'chain' passed to PTR_ERR() in the error path
after calling nft_chain_lookup() in nf_tables_delrule().

Fixes: f80a612dd7 ("netfilter: nf_tables: add support to destroy operation")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-02-01 12:18:51 +01:00
Alok Tiwari
dac7f50a45 netfilter: nf_tables: NULL pointer dereference in nf_tables_updobj()
static analyzer detect null pointer dereference case for 'type'
function __nft_obj_type_get() can return NULL value which require to handle
if type is NULL pointer return -ENOENT.

This is a theoretical issue, since an existing object has a type, but
better add this failsafe check.

Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-02-01 12:18:48 +01:00
Florian Westphal
bd0e06f0de Revert "netfilter: conntrack: fix bug in for_each_sctp_chunk"
There is no bug.  If sch->length == 0, this would result in an infinite
loop, but first caller, do_basic_checks(), errors out in this case.

After this change, packets with bogus zero-length chunks are no longer
detected as invalid, so revert & add comment wrt. 0 length check.

Fixes: 98ee007745 ("netfilter: conntrack: fix bug in for_each_sctp_chunk")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-01-31 14:02:48 +01:00