When the state of rep was introduced, it was also designed to prevent
duplicate unloading of the same rep. Considering the following two
flows when an eswitch manager is at switchdev mode with n VF reps loaded.
+--------------------------------------+--------------------------------+
| cpu-0 | cpu-1 |
| -------- | -------- |
| mlx5_ib_remove | mlx5_eswitch_disable_sriov |
| mlx5_ib_unregister_vport_reps | esw_offloads_cleanup |
| mlx5_eswitch_unregister_vport_reps | esw_offloads_unload_all_reps |
| __unload_reps_all_vport | __unload_reps_all_vport |
+--------------------------------------+--------------------------------+
These two flows will try to unload the same rep. Per original design,
once one flow unloads the rep, the state moves to REGISTERED. The 2nd
flow will no longer needs to do the unload and bails out. However, as
read and write of the state is not atomic, when 1st flow is doing the
unload, the state is still LOADED, 2nd flow is able to do the same
unload action. Kernel crash will happen.
To solve this, driver should do atomic test-and-set for the state. So
that only one flow can change the rep state from LOADED to REGISTERED,
and proceed to do the actual unloading.
Since the state is changing to atomic type, all other read/write should
be atomic action as well.
Fixes: f121e0ea95 (net/mlx5: E-Switch, Add state to eswitch vport representors)
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Vu Pham <vuhuong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
mlx5_query_nic_vport_vlans() is not used anymore. Hence remove it.
This patch doesn't change any functionality.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Add the support to read additional EEPROM information from high pages.
Information for modules such as SFF-8436 and SFF-8636:
1) Application select table
2) User writable EEPROM
3) Thresholds and alarms
Signed-off-by: Erez Alfasi <ereza@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Added max EEPROM length defines for ethtool usage:
#define ETH_MODULE_SFF_8636_MAX_LEN 640
#define ETH_MODULE_SFF_8436_MAX_LEN 640
These definitions are used to determine the EEPROM
data length when reading high eeprom pages.
For example, SFF-8636 EEPROM data from page 03h
needs to be stored at data[512] - data[639].
Signed-off-by: Erez Alfasi <ereza@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
This merge commit includes some misc shared code updates from mlx5-next branch needed
for net-next.
1) From Aya: Enable general events on all physical link types and
restrict general event handling of subtype DELAY_DROP_TIMEOUT in mlx5 rdma
driver to ethernet links only as it was intended.
2) From Eli: Introduce low level bits for prio tag mode
3) From Maor: Low level steering updates to support RDMA RX flow
steering and enables RoCE loopback traffic when switchdev is enabled.
4) From Vu and Parav: Two small mlx5 core cleanups
5) From Yevgeny add HW definitions of geneve offloads
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The cited commit broke the offsets of hca cap struct, fix it.
While at it, cleanup a white space introduced by the same commit.
Fixes: b169e64a24 ("net/mlx5: Geneve, Add flow table capabilities for Geneve decap with TLV options")
Reported-by: Qian Cai <cai@lca.pw>
Cc: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
This allows custom setup of IRQ coalescing for platforms using legacy
platform_device. The irq timeout and count parameters can be used for
tuning cpu load vs. latency.
I have maintained the 0x00000400 bit in TX_CHNL_CTRL. It is specified as
unused in the documentation I have available. It does not make any
difference in the hardware I have available, so it is left in to not risk
breaking other platforms where it might be used.
Signed-off-by: Esben Haabendal <esben@geanix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Indirect register access goes through a DCR bus bridge, which
allows only one outstanding transaction. And to make matters
worse, each TEMAC IP block contains two Ethernet interfaces, and
although they seem to have separate registers for indirect access,
they actually share the registers. Or to be more specific, MSW, LSW
and CTL registers are physically shared between Ethernet interfaces
in same TEMAC IP, with RDY register being (almost) specificic to
the Ethernet interface. The 0x10000 bit in RDY reflects combined
bus ready state though.
So we need to take care to synchronize not only within a single
device, but also between devices in same TEMAC IP.
This commit allows to do that with legacy platform devices.
For OF devices, the xlnx,compound parent of the temac node should be
used to find siblings, and setup a shared indirect_mutex between them.
I will leave this work to somebody else, as I don't have hardware to
test that. No regression is introduced by that, as before this commit
using two Ethernet interfaces in same TEMAC block is simply broken.
Signed-off-by: Esben Haabendal <esben@geanix.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace the powerpc specific MMIO register access functions with the
generic big-endian mmio access functions, and add support for
little-endian access depending on configuration.
Big-endian access is maintained as the default, but little-endian can
be configured in device-tree binding or in platform data.
The temac_ior()/temac_iow() functions are replaced with macro wrappers
to avoid modifying existing code more than necessary.
Signed-off-by: Esben Haabendal <esben@geanix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Support initialization with platdata, so the driver can be used on
non-device-tree platforms.
For currently supported device-tree platforms, the driver should behave
as before.
Signed-off-by: Esben Haabendal <esben@geanix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IEEE 802.1Q-2018 defines the concept of a cycle-time-extension, so the
last entry of a schedule before the start of a new schedule can be
extended, so "too-short" entries can be avoided.
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IEEE 802.1Q-2018 defines that a the cycle-time of a schedule may be
overridden, so the schedule is truncated to a determined "width".
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The IEEE 802.1Q-2018 defines two "types" of schedules, the "Oper" (from
operational?) and "Admin" ones. Up until now, 'taprio' only had
support for the "Oper" one, added when the qdisc is created. This adds
support for the "Admin" one, which allows the .change() operation to
be supported.
Just for clarification, some quick (and dirty) definitions, the "Oper"
schedule is the currently (as in this instant) running one, and it's
read-only. The "Admin" one is the one that the system configurator has
installed, it can be changed, and it will be "promoted" to "Oper" when
it's 'base-time' is reached.
The idea behing this patch is that calling something like the below,
(after taprio is already configured with an initial schedule):
$ tc qdisc change taprio dev IFACE parent root \
base-time X \
sched-entry <CMD> <GATES> <INTERVAL> \
...
Will cause a new admin schedule to be created and programmed to be
"promoted" to "Oper" at instant X. If an "Admin" schedule already
exists, it will be overwritten with the new parameters.
Up until now, there was some code that was added to ease the support
of changing a single entry of a schedule, but was ultimately unused.
Now, that we have support for "change" with more well thought
semantics, updating a single entry seems to be less useful.
So we remove what is in practice dead code, and return a "not
supported" error if the user tries to use it. If changing a single
entry would make the user's life easier we may ressurrect this idea,
but at this point, removing it simplifies the code.
For now, only the schedule specific bits are allowed to be added for a
new schedule, that means that 'clockid', 'num_tc', 'map' and 'queues'
cannot be modified.
Example:
$ tc qdisc change dev IFACE parent root handle 100 taprio \
base-time $BASE_TIME \
sched-entry S 00 500000 \
sched-entry S 0f 500000 \
clockid CLOCK_TAI
The only change in the netlink API introduced by this change is the
introduction of an "admin" type in the response to a dump request,
that type allows userspace to separate the "oper" schedule from the
"admin" schedule. If userspace doesn't support the "admin" type, it
will only display the "oper" schedule.
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The devlink health reporters create/destroy and user commands currently
use the devlink->lock as a locking mechanism. Different reporters have
different rules in the driver and are being created/destroyed during
different stages of driver load/unload/running. So during execution of a
reporter recover the flow can go through another reporter's destroy and
create. Such flow leads to deadlock trying to lock a mutex already
held.
With the new locking mechanism the different reporters share mutex lock
only to protect access to shared reporters list.
Added refcount per reporter, to protect the reporters from destroy while
being used.
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that all drivers can be probed using more traditional methods,
remove the legacy probe code.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since different types of hardware may or may not support this setting
per-port, DSA keeps it either in dsa_switch or in dsa_port.
While drivers may know the characteristics of their hardware and
retrieve it from the correct place without the need of helpers, it is
cumbersone to find out an unambigous answer from generic DSA code.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current behavior is not as obvious as one would assume (which is
that, if the driver set vlan_filtering_is_global = 1, then checking any
dp->vlan_filtering would yield the same result). Only the ports which
are actively enslaved into a bridge would have vlan_filtering set.
This makes it tricky for drivers to check what the global state is.
So fix this and make the struct dsa_switch hold this global setting.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On some switches, the action of whether to parse VLAN frame headers and use
that information for ingress admission is configurable, but not per
port. Such is the case for the Broadcom BCM53xx and the NXP SJA1105
families, for example. In that case, DSA can prevent the bridge core
from trying to apply different VLAN filtering settings on net devices
that belong to the same switch.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows drivers to query the VLAN setting imposed by the bridge
driver directly from DSA, instead of keeping their own state based on
the .port_vlan_filtering callback.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2019-04-30
1) A lot of work to remove indirections from the xfrm code.
From Florian Westphal.
2) Support ESP offload in combination with gso partial.
From Boris Pismenny.
3) Remove some duplicated code from vti4.
From Jeremy Sowden.
Please note that there is merge conflict
between commit:
8742dc86d0 ("xfrm4: Fix uninitialized memory read in _decode_session4")
from the ipsec tree and commit:
c53ac41e37 ("xfrm: remove decode_session indirection from afinfo_policy")
from the ipsec-next tree. The merge conflict will appear
when those trees get merged during the merge window.
The conflict can be solved as it is done in linux-next:
https://lkml.org/lkml/2019/4/25/1207
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce specification for Geneve decap flow with encapsulation options
and allow creation of rules that are matching on Geneve TLV options.
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Introduce support for Geneve flow specification and allow
the creation of rules that are matching on basic Geneve
protocol fields: VNI, OAM bit, protocol type, options length.
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When in switchdev mode, we would like to treat loopback RoCE
traffic (on eswitch manager) as RDMA and not as regular
Ethernet traffic
In order to enable it we add flow steering rule that forward RoCE
loopback traffic to the HW RoCE filter (by adding allow rule).
In addition we add RoCE address in GID index 0, which will be
set in the RoCE loopback packet.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Flow table supports three types of miss action:
1. Default miss action - go to default miss table according to table.
2. Go to specific table.
3. Switch domain - go to the root table of an alternative steering
table domain.
New table miss action was added - switch_domain.
The next domain for RDMA_RX namespace is the NIC RX domain.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Add new flow steering namespace - MLX5_FLOW_NAMESPACE_RDMA_RX.
Flow steering rules in this namespace are used to filter
RDMA traffic.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Currently mlx5 core stores copy of the PCI device name in a
mlx5_priv structure and uses pr_warn, pr_err helpers.
Get rid of the copy of this name; instead store the parent device
pointer that contains name as well as dma specific parameters.
This also allows to use kernel's well defined dev_warn, dev_err, dev_dbg
device specific print routines.
This is also a preparation patch to access non PCI parent device in
future.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Current ConnectX HW is unable to perform VLAN pop in TX path and VLAN
push on RX path. To workaround that limitation untagged packets will be
tagged with VLAN ID 0x000 (priority tag) and pop/push actions will be
replaced by VLAN re-write actions (which are supported by the HW).
Introduce prio tag mode as a pre-step to controlling the workaround
behavior.
Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Now that tag drivers dynamically register, we don't need the static
table. Remove it. This also means the tag driver structures can be
made static.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A DSA tag driver module will need to register the tag protocols it
implements with the DSA core. Add macros containing this boiler plate.
The registration/unregistration code is currently just a stub. A Later
patch will add the real implementation.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
v2
Fix indent of #endif
Rewrite to move list pointer into a new structure
v3
Move kdoc next to macro
Fix THIS_MODULE indentation
Signed-off-by: David S. Miller <davem@davemloft.net>
In order that we can match the tagging protocol a switch driver
request to the tagger, we need to know what protocol the tagger
supports. Add this information to the ops structure.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
v2
More tag protocol to end of structure to keep hot members at the beginning.
Signed-off-by: David S. Miller <davem@davemloft.net>
When the tag drivers become modules, we will need to dynamically load
them based on what the switch drivers need. Add aliases to map between
the TAG protocol and the driver.
In order to do this, we need the tag protocol number as something
which the C pre-processor can stringinfy. Only the compiler knows the
value of an enum, CPP cannot use them. So add #defines.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rather than keep a list to map a tagger ops to a name, place the name
into the ops structure. This removes the hard coded list, a step
towards making the taggers more dynamic.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
v2:
Move name to end of structure, keeping the hot entries at the beginning.
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2019-04-28
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Introduce BPF socket local storage map so that BPF programs can store
private data they associate with a socket (instead of e.g. separate hash
table), from Martin.
2) Add support for bpftool to dump BTF types. This is done through a new
`bpftool btf dump` sub-command, from Andrii.
3) Enable BPF-based flow dissector for skb-less eth_get_headlen() calls which
was currently not supported since skb was used to lookup netns, from Stanislav.
4) Add an opt-in interface for tracepoints to expose a writable context
for attached BPF programs, used here for NBD sockets, from Matt.
5) BPF xadd related arm64 JIT fixes and scalability improvements, from Daniel.
6) Change the skb->protocol for bpf_skb_adjust_room() helper in order to
support tunnels such as sit. Add selftests as well, from Willem.
7) Various smaller misc fixes.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add options to strictly validate messages and dump messages,
sometimes perhaps validating dump messages non-strictly may
be required, so add an option for that as well.
Since none of this can really be applied to existing commands,
set the options everwhere using the following spatch:
@@
identifier ops;
expression X;
@@
struct genl_ops ops[] = {
...,
{
.cmd = X,
+ .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
...
},
...
};
For new commands one should just not copy the .validate 'opt-out'
flags and thus get strict validation.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unfortunately, we cannot add strict parsing for all attributes, as
that would break existing userspace. We currently warn about it, but
that's about all we can do.
For new attributes, however, the story is better: nobody is using
them, so we can reject bad sizes.
Also, for new attributes, we need not accept them when the policy
doesn't declare their usage.
David Ahern and I went back and forth on how to best encode this, and
the best way we found was to have a "boundary type", from which point
on new attributes have all possible validation applied, and NLA_UNSPEC
is rejected.
As we didn't want to add another argument to all functions that get a
netlink policy, the workaround is to encode that boundary in the first
entry of the policy array (which is for type 0 and thus probably not
really valid anyway). I put it into the validation union for the rare
possibility that somebody is actually using attribute 0, which would
continue to work fine unless they tried to use the extended validation,
which isn't likely. We also didn't find any in-tree users with type 0.
The reason for setting the "start strict here" attribute is that we
never really need to start strict from 0, which is invalid anyway (or
in legacy families where that isn't true, it cannot be set to strict),
so we can thus reserve the value 0 for "don't do this check" and don't
have to add the tag to all policies right now.
Thus, policies can now opt in to this validation, which we should do
for all existing policies, at least when adding new attributes.
Note that entirely *new* policies won't need to set it, as the use
of that should be using nla_parse()/nlmsg_parse() etc. which anyway
do fully strict validation now, regardless of this.
So in effect, this patch only covers the "existing command with new
attribute" case.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This re-adds the parse and validate functions like nla_parse()
that are now actually strict after the previous rename and were
just split out to make sure everything is converted (and if not
compilation of the previous patch would fail.)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently have two levels of strict validation:
1) liberal (default)
- undefined (type >= max) & NLA_UNSPEC attributes accepted
- attribute length >= expected accepted
- garbage at end of message accepted
2) strict (opt-in)
- NLA_UNSPEC attributes accepted
- attribute length >= expected accepted
Split out parsing strictness into four different options:
* TRAILING - check that there's no trailing data after parsing
attributes (in message or nested)
* MAXTYPE - reject attrs > max known type
* UNSPEC - reject attributes with NLA_UNSPEC policy entries
* STRICT_ATTRS - strictly validate attribute size
The default for future things should be *everything*.
The current *_strict() is a combination of TRAILING and MAXTYPE,
and is renamed to _deprecated_strict().
The current regular parsing has none of this, and is renamed to
*_parse_deprecated().
Additionally it allows us to selectively set one of the new flags
even on old policies. Notably, the UNSPEC flag could be useful in
this case, since it can be arranged (by filling in the policy) to
not be an incompatible userspace ABI change, but would then going
forward prevent forgetting attribute entries. Similar can apply
to the POLICY flag.
We end up with the following renames:
* nla_parse -> nla_parse_deprecated
* nla_parse_strict -> nla_parse_deprecated_strict
* nlmsg_parse -> nlmsg_parse_deprecated
* nlmsg_parse_strict -> nlmsg_parse_deprecated_strict
* nla_parse_nested -> nla_parse_nested_deprecated
* nla_validate_nested -> nla_validate_nested_deprecated
Using spatch, of course:
@@
expression TB, MAX, HEAD, LEN, POL, EXT;
@@
-nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
+nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)
@@
expression NLH, HDRLEN, TB, MAX, POL, EXT;
@@
-nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
+nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)
@@
expression NLH, HDRLEN, TB, MAX, POL, EXT;
@@
-nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
+nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
@@
expression TB, MAX, NLA, POL, EXT;
@@
-nla_parse_nested(TB, MAX, NLA, POL, EXT)
+nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)
@@
expression START, MAX, POL, EXT;
@@
-nla_validate_nested(START, MAX, POL, EXT)
+nla_validate_nested_deprecated(START, MAX, POL, EXT)
@@
expression NLH, HDRLEN, MAX, POL, EXT;
@@
-nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
+nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)
For this patch, don't actually add the strict, non-renamed versions
yet so that it breaks compile if I get it wrong.
Also, while at it, make nla_validate and nla_parse go down to a
common __nla_validate_parse() function to avoid code duplication.
Ultimately, this allows us to have very strict validation for every
new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
next patch, while existing things will continue to work as is.
In effect then, this adds fully strict validation for any new command.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rather than using NLA_UNSPEC for this type of thing, use NLA_MIN_LEN
so we can make NLA_UNSPEC be NLA_REJECT under certain conditions for
future attributes.
While at it, also use NLA_EXACT_LEN for the struct example.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the previous commit, both ipset_nest_start() and ipset_nest_end() are
just aliases for nla_nest_start() and nla_nest_end() so that there is no
need to keep them.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
netlink based interfaces (including recently added ones) are still not
setting it in kernel generated messages. Without the flag, message parsers
not aware of attribute semantics (e.g. wireshark dissector or libmnl's
mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
the structure of their contents.
Unfortunately we cannot just add the flag everywhere as there may be
userspace applications which check nlattr::nla_type directly rather than
through a helper masking out the flags. Therefore the patch renames
nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
are rewritten to use nla_nest_start().
Except for changes in include/net/netlink.h, the patch was generated using
this semantic patch:
@@ expression E1, E2; @@
-nla_nest_start(E1, E2)
+nla_nest_start_noflag(E1, E2)
@@ expression E1, E2; @@
-nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
+nla_nest_start(E1, E2)
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To avoid a sparse warning byteswap the be32 sequence number
before it's stored in the atomic value. While at it drop
unnecessary brackets and use kernel's u64 type.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There seems to be no reason for tls_ops to be defined in netdevice.h
which is included in a lot of places. Don't wrap the struct/enum
declaration in ifdefs, it trickles down unnecessary ifdefs into
driver code.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tls_device_sk_destruct being set on a socket used to indicate
that socket is a kTLS device one. That is no longer true -
now we use sk_validate_xmit_skb pointer for that purpose.
Remove the export. tls_device_attach() needs to be moved.
While at it, remove the dead declaration of tls_sk_destruct().
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After allowing a bpf prog to
- directly read the skb->sk ptr
- get the fullsock bpf_sock by "bpf_sk_fullsock()"
- get the bpf_tcp_sock by "bpf_tcp_sock()"
- get the listener sock by "bpf_get_listener_sock()"
- avoid duplicating the fields of "(bpf_)sock" and "(bpf_)tcp_sock"
into different bpf running context.
this patch is another effort to make bpf's network programming
more intuitive to do (together with memory and performance benefit).
When bpf prog needs to store data for a sk, the current practice is to
define a map with the usual 4-tuples (src/dst ip/port) as the key.
If multiple bpf progs require to store different sk data, multiple maps
have to be defined. Hence, wasting memory to store the duplicated
keys (i.e. 4 tuples here) in each of the bpf map.
[ The smallest key could be the sk pointer itself which requires
some enhancement in the verifier and it is a separate topic. ]
Also, the bpf prog needs to clean up the elem when sk is freed.
Otherwise, the bpf map will become full and un-usable quickly.
The sk-free tracking currently could be done during sk state
transition (e.g. BPF_SOCK_OPS_STATE_CB).
The size of the map needs to be predefined which then usually ended-up
with an over-provisioned map in production. Even the map was re-sizable,
while the sk naturally come and go away already, this potential re-size
operation is arguably redundant if the data can be directly connected
to the sk itself instead of proxy-ing through a bpf map.
This patch introduces sk->sk_bpf_storage to provide local storage space
at sk for bpf prog to use. The space will be allocated when the first bpf
prog has created data for this particular sk.
The design optimizes the bpf prog's lookup (and then optionally followed by
an inline update). bpf_spin_lock should be used if the inline update needs
to be protected.
BPF_MAP_TYPE_SK_STORAGE:
-----------------------
To define a bpf "sk-local-storage", a BPF_MAP_TYPE_SK_STORAGE map (new in
this patch) needs to be created. Multiple BPF_MAP_TYPE_SK_STORAGE maps can
be created to fit different bpf progs' needs. The map enforces
BTF to allow printing the sk-local-storage during a system-wise
sk dump (e.g. "ss -ta") in the future.
The purpose of a BPF_MAP_TYPE_SK_STORAGE map is not for lookup/update/delete
a "sk-local-storage" data from a particular sk.
Think of the map as a meta-data (or "type") of a "sk-local-storage". This
particular "type" of "sk-local-storage" data can then be stored in any sk.
The main purposes of this map are mostly:
1. Define the size of a "sk-local-storage" type.
2. Provide a similar syscall userspace API as the map (e.g. lookup/update,
map-id, map-btf...etc.)
3. Keep track of all sk's storages of this "type" and clean them up
when the map is freed.
sk->sk_bpf_storage:
------------------
The main lookup/update/delete is done on sk->sk_bpf_storage (which
is a "struct bpf_sk_storage"). When doing a lookup,
the "map" pointer is now used as the "key" to search on the
sk_storage->list. The "map" pointer is actually serving
as the "type" of the "sk-local-storage" that is being
requested.
To allow very fast lookup, it should be as fast as looking up an
array at a stable-offset. At the same time, it is not ideal to
set a hard limit on the number of sk-local-storage "type" that the
system can have. Hence, this patch takes a cache approach.
The last search result from sk_storage->list is cached in
sk_storage->cache[] which is a stable sized array. Each
"sk-local-storage" type has a stable offset to the cache[] array.
In the future, a map's flag could be introduced to do cache
opt-out/enforcement if it became necessary.
The cache size is 16 (i.e. 16 types of "sk-local-storage").
Programs can share map. On the program side, having a few bpf_progs
running in the networking hotpath is already a lot. The bpf_prog
should have already consolidated the existing sock-key-ed map usage
to minimize the map lookup penalty. 16 has enough runway to grow.
All sk-local-storage data will be removed from sk->sk_bpf_storage
during sk destruction.
bpf_sk_storage_get() and bpf_sk_storage_delete():
------------------------------------------------
Instead of using bpf_map_(lookup|update|delete)_elem(),
the bpf prog needs to use the new helper bpf_sk_storage_get() and
bpf_sk_storage_delete(). The verifier can then enforce the
ARG_PTR_TO_SOCKET argument. The bpf_sk_storage_get() also allows to
"create" new elem if one does not exist in the sk. It is done by
the new BPF_SK_STORAGE_GET_F_CREATE flag. An optional value can also be
provided as the initial value during BPF_SK_STORAGE_GET_F_CREATE.
The BPF_MAP_TYPE_SK_STORAGE also supports bpf_spin_lock. Together,
it has eliminated the potential use cases for an equivalent
bpf_map_update_elem() API (for bpf_prog) in this patch.
Misc notes:
----------
1. map_get_next_key is not supported. From the userspace syscall
perspective, the map has the socket fd as the key while the map
can be shared by pinned-file or map-id.
Since btf is enforced, the existing "ss" could be enhanced to pretty
print the local-storage.
Supporting a kernel defined btf with 4 tuples as the return key could
be explored later also.
2. The sk->sk_lock cannot be acquired. Atomic operations is used instead.
e.g. cmpxchg is done on the sk->sk_bpf_storage ptr.
Please refer to the source code comments for the details in
synchronization cases and considerations.
3. The mem is charged to the sk->sk_omem_alloc as the sk filter does.
Benchmark:
---------
Here is the benchmark data collected by turning on
the "kernel.bpf_stats_enabled" sysctl.
Two bpf progs are tested:
One bpf prog with the usual bpf hashmap (max_entries = 8192) with the
sk ptr as the key. (verifier is modified to support sk ptr as the key
That should have shortened the key lookup time.)
Another bpf prog is with the new BPF_MAP_TYPE_SK_STORAGE.
Both are storing a "u32 cnt", do a lookup on "egress_skb/cgroup" for
each egress skb and then bump the cnt. netperf is used to drive
data with 4096 connected UDP sockets.
BPF_MAP_TYPE_HASH with a modifier verifier (152ns per bpf run)
27: cgroup_skb name egress_sk_map tag 74f56e832918070b run_time_ns 58280107540 run_cnt 381347633
loaded_at 2019-04-15T13:46:39-0700 uid 0
xlated 344B jited 258B memlock 4096B map_ids 16
btf_id 5
BPF_MAP_TYPE_SK_STORAGE in this patch (66ns per bpf run)
30: cgroup_skb name egress_sk_stora tag d4aa70984cc7bbf6 run_time_ns 25617093319 run_cnt 390989739
loaded_at 2019-04-15T13:47:54-0700 uid 0
xlated 168B jited 156B memlock 4096B map_ids 17
btf_id 6
Here is a high-level picture on how are the objects organized:
sk
┌──────┐
│ │
│ │
│ │
│*sk_bpf_storage─────▶ bpf_sk_storage
└──────┘ ┌───────┐
┌───────────┤ list │
│ │ │
│ │ │
│ │ │
│ └───────┘
│
│ elem
│ ┌────────┐
├─▶│ snode │
│ ├────────┤
│ │ data │ bpf_map
│ ├────────┤ ┌─────────┐
│ │map_node│◀─┬─────┤ list │
│ └────────┘ │ │ │
│ │ │ │
│ elem │ │ │
│ ┌────────┐ │ └─────────┘
└─▶│ snode │ │
├────────┤ │
bpf_map │ data │ │
┌─────────┐ ├────────┤ │
│ list ├───────▶│map_node│ │
│ │ └────────┘ │
│ │ │
│ │ elem │
└─────────┘ ┌────────┐ │
┌─▶│ snode │ │
│ ├────────┤ │
│ │ data │ │
│ ├────────┤ │
│ │map_node│◀─┘
│ └────────┘
│
│
│ ┌───────┐
sk └──────────│ list │
┌──────┐ │ │
│ │ │ │
│ │ │ │
│ │ └───────┘
│*sk_bpf_storage───────▶bpf_sk_storage
└──────┘
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This tests that:
* a BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE cannot be attached if it
uses either:
* a variable offset to the tracepoint buffer, or
* an offset beyond the size of the tracepoint buffer
* a tracer can modify the buffer provided when attached to a writable
tracepoint in bpf_prog_test_run
Signed-off-by: Matt Mullins <mmullins@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This adds four tracepoints to nbd, enabling separate tracing of payload
and header sending/receipt.
In the send path for headers that have already been sent, we also
explicitly initialize the handle so it can be referenced by the later
tracepoint.
Signed-off-by: Andrew Hall <hall@fb.com>
Signed-off-by: Matt Mullins <mmullins@fb.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This adds a tracepoint that can both observe the nbd request being sent
to the server, as well as modify that request , e.g., setting a flag in
the request that will cause the server to collect detailed tracing data.
The struct request * being handled is included to permit correlation
with the block tracepoints.
Signed-off-by: Matt Mullins <mmullins@fb.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This is an opt-in interface that allows a tracepoint to provide a safe
buffer that can be written from a BPF_PROG_TYPE_RAW_TRACEPOINT program.
The size of the buffer must be a compile-time constant, and is checked
before allowing a BPF program to attach to a tracepoint that uses this
feature.
The pointer to this buffer will be the first argument of tracepoints
that opt in; the pointer is valid and can be bpf_probe_read() by both
BPF_PROG_TYPE_RAW_TRACEPOINT and BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE
programs that attach to such a tracepoint, but the buffer to which it
points may only be written by the latter.
Signed-off-by: Matt Mullins <mmullins@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Johannes Berg says:
====================
Various updates, notably:
* extended key ID support (from 802.11-2016)
* per-STA TX power control support
* mac80211 TX performance improvements
* HE (802.11ax) updates
* mesh link probing support
* enhancements of multi-BSSID support (also related to HE)
* OWE userspace processing support
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The requirement for mesh link metric refreshing, is that from one
mesh point we be able to send some data frames to other mesh points
which are not currently selected as a primary traffic path, but which
are only 1 hop away. The absence of the primary path to the chosen node
makes it necessary to apply some form of marking on a chosen packet
stream so that the packets can be properly steered to the selected node
for testing, and not by the regular mesh path lookup.
Tested-by: Pradeep Kumar Chitrapu <pradeepc@codeaurora.org>
Signed-off-by: Rajkumar Manoharan <rmanohar@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>