rxrpc has a problem in its congestion management in that it saves the
congestion window size (cwnd) from one call to another, but if this is 0 at
the time is saved, then the next call may not actually manage to ever
transmit anything.
To this end:
(1) Don't save cwnd between calls, but rather reset back down to the
initial cwnd and re-enter slow-start if data transmission is idle for
more than an RTT.
(2) Preserve ssthresh instead, as that is a handy estimate of pipe
capacity. Knowing roughly when to stop slow start and enter
congestion avoidance can reduce the tendency to overshoot and drop
larger amounts of packets when probing.
In future, cwind growth also needs to be constrained when the window isn't
being filled due to being application limited.
Reported-by: Simon Wilkinson <sxw@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Improve the tracking of which packets need to be transmitted by saving the
last ACK packet that we receive that has a populated soft-ACK table rather
than marking packets. Then we can step through the soft-ACK table and look
at the packets we've transmitted beyond that to determine which packets we
might want to retransmit.
We also look at the highest serial number that has been acked to try and
guess which packets we've transmitted the peer is likely to have seen. If
necessary, we send a ping to retrieve that number.
One downside that might be a problem is that we can't then compare the
previous acked/unacked state so easily in rxrpc_input_soft_acks() - which
is a potential problem for the slow-start algorithm.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Change the way the Tx queueing works to make the following ends easier to
achieve:
(1) The filling of packets, the encryption of packets and the transmission
of packets can be handled in parallel by separate threads, rather than
rxrpc_sendmsg() allocating, filling, encrypting and transmitting each
packet before moving onto the next one.
(2) Get rid of the fixed-size ring which sets a hard limit on the number
of packets that can be retained in the ring. This allows the number
of packets to increase without having to allocate a very large ring or
having variable-sized rings.
[Note: the downside of this is that it's then less efficient to locate
a packet for retransmission as we then have to step through a list and
examine each buffer in the list.]
(3) Allow the filler/encrypter to run ahead of the transmission window.
(4) Make it easier to do zero copy UDP from the packet buffers.
(5) Make it easier to do zero copy from userspace to the packet buffers -
and thence to UDP (only if for unauthenticated connections).
To that end, the following changes are made:
(1) Use the new rxrpc_txbuf struct instead of sk_buff for keeping packets
to be transmitted in. This allows them to be placed on multiple
queues simultaneously. An sk_buff isn't really necessary as it's
never passed on to lower-level networking code.
(2) Keep the transmissable packets in a linked list on the call struct
rather than in a ring. As a consequence, the annotation buffer isn't
used either; rather a flag is set on the packet to indicate ackedness.
(3) Use the RXRPC_CALL_TX_LAST flag to indicate that the last packet to be
transmitted has been queued. Add RXRPC_CALL_TX_ALL_ACKED to indicate
that all packets up to and including the last got hard acked.
(4) Wire headers are now stored in the txbuf rather than being concocted
on the stack and they're stored immediately before the data, thereby
allowing zerocopy of a single span.
(5) Don't bother with instant-resend on transmission failure; rather,
leave it for a timer or an ACK packet to trigger.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Get rid of the Rx ring and replace it with a pair of queues instead. One
queue gets the packets that are in-sequence and are ready for processing by
recvmsg(); the other queue gets the out-of-sequence packets for addition to
the first queue as the holes get filled.
The annotation ring is removed and replaced with a SACK table. The SACK
table has the bits set that correspond exactly to the sequence number of
the packet being acked. The SACK ring is copied when an ACK packet is
being assembled and rotated so that the first ACK is in byte 0.
Flow control handling is altered so that packets that are moved to the
in-sequence queue are hard-ACK'd even before they're consumed - and then
the Rx window size in the ACK packet (rsize) is shrunk down to compensate
(even going to 0 if the window is full).
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Split up received jumbo packets into separate skbuffs by cloning the
original skbuff for each subpacket and setting the offset and length of the
data in that subpacket in the skbuff's private data. The subpackets are
then placed on the recvmsg queue separately. The security class then gets
to revise the offset and length to remove its metadata.
If we fail to clone a packet, we just drop it and let the peer resend it.
The original packet gets used for the final subpacket.
This should make it easier to handle parallel decryption of the subpackets.
It also simplifies the handling of lost or misordered packets in the
queuing/buffering loop as the possibility of overlapping jumbo packets no
longer needs to be considered.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Split the rxrpc_recvmsg tracepoint so that the tracepoints that are about
data packet processing (and which have extra pieces of information) are
separate from the tracepoint that shows the general flow of recvmsg().
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Clean up the rxrpc_propose_ACK() function. If deferred PING ACK proposal
is split out, it's only really needed for deferred DELAY ACKs. All other
ACKs, bar terminal IDLE ACK are sent immediately. The deferred IDLE ACK
submission can be handled by conversion of a DELAY ACK into an IDLE ACK if
there's nothing to be SACK'd.
Also, because there's a delay between an ACK being generated and being
transmitted, it's possible that other ACKs of the same type will be
generated during that interval. Apart from the ACK time and the serial
number responded to, most of the ACK body, including window and SACK
parameters, are not filled out till the point of transmission - so we can
avoid generating a new ACK if there's one pending that will cover the SACK
data we need to convey.
Therefore, don't propose a new DELAY or IDLE ACK for a call if there's one
already pending.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Define a struct, rxrpc_txbuf, to carry data to be transmitted instead of a
socket buffer so that it can be placed onto multiple queues at once. This
also allows the data buffer to be in the same allocation as the internal
data.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Remove the flags from the rxrpc_skb tracepoint as we're no longer going to
be using this for the transmission buffers and so marking which are
transmission buffers isn't going to be necessary.
Note that this also remove the rxrpc skb flag that indicates if this is a
transmission buffer and so the count is not updated for the moment.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
This patch is to add helper support in act_ct for OVS actions=ct(alg=xxx)
offloading, which is corresponding to Commit cae3a26275 ("openvswitch:
Allow attaching helpers to ct action") in OVS kernel part.
The difference is when adding TC actions family and proto cannot be got
from the filter/match, other than helper name in tb[TCA_CT_HELPER_NAME],
we also need to send the family in tb[TCA_CT_HELPER_FAMILY] and the
proto in tb[TCA_CT_HELPER_PROTO] to kernel.
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Move ovs_ct_add_helper from openvswitch to nf_conntrack_helper and
rename as nf_ct_add_helper, so that it can be used in TC act_ct in
the next patch.
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Move ovs_ct_helper from openvswitch to nf_conntrack_helper and rename
as nf_ct_helper so that it can be used in TC act_ct in the next patch.
Note that it also adds the checks for the family and proto, as in TC
act_ct, the packets with correct family and proto are not guaranteed.
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The previous attempt to augment carrier_down (see Link)
was not met with much enthusiasm so let's do the simple
thing of exposing what some devices already maintain.
Add a common ethtool statistic for link going down.
Currently users have to maintain per-driver mapping
to extract the right stat from the vendor-specific ethtool -S
stats. carrier_down does not fit the bill because it counts
a lot of software related false positives.
Add the statistic to the extended link state API to steer
vendors towards implementing all of it.
Implement for bnxt and all Linux-controlled PHYs. mlx5 and (possibly)
enic also have a counter for this but I leave the implementation
to their maintainers.
Link: https://lore.kernel.org/r/20220520004500.2250674-1-kuba@kernel.org
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20221104190125.684910-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Virtually all conventional network drivers are now converted to use
phylink_generic_validate() - only DSA drivers and fman_memac remain,
so lets remove the necessity for network drivers to explicitly set
this member, and default to phylink_generic_validate() when unset.
This is possible as .validate must currently be set.
Any remaining instances that have not been addressed by this patch can
be fixed up later.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/E1or0FZ-001tRa-DI@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Let families to hook in the new split ops.
They are more flexible and should not be much larger than
full ops. Each split op is 40B while full op is 48B.
Devlink for example has 54 dos and 19 dumps, 2 of the dumps
do not have a do -> 56 full commands = 2688B.
Split ops would have taken 2920B, so 9% more space while
allowing individual per/post doit and per-type policies.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently have two forms of operations - small ops and "full" ops
(or just ops). The former does not have pointers for some of the less
commonly used features (namely dump start/done and policy).
The "full" ops, however, still don't contain all the necessary
information. In particular the policy is per command ID, while
do and dump often accept different attributes. It's also not
possible to define different pre_doit and post_doit callbacks
for different commands within the family.
At the same time a lot of commands do not support dumping and
therefore all the dump-related information is wasted space.
Create a new command representation which can hold info about
a do implementation or a dump implementation, but not both at
the same time.
Use this new representation on the command execution path
(genl_family_rcv_msg) as we either run a do or a dump and
don't have to create a "full" op there.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the private fields down to form a "private section".
Use the kdoc "private:" label comment thing to hide them
from the main kdoc comment.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Support mode switch properly, which is not available before.
If SoC has two Ethernet controllers, by setting both of them into MII
mode, the first controller enters GMII mode, while the second
controller is effectively disabled. This requires configuring (and
maybe enabling) the second controller in the device tree, even though
it cannot be used.
Signed-off-by: David Yang <mmyangfl@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Just a basic s/thig/this swap, fixing up a typo introduced by a commit
added in the 6.1 release.
Fixes: 9cda70f622 ("io_uring: introduce fixed buffer support for io_uring_cmd")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Shifting signed 32-bit value by 31 bits is undefined, so changing
significant bit to unsigned. The UBSAN warning calltrace like below:
UBSAN: shift-out-of-bounds in security/commoncap.c:1252:2
left shift of 1 by 31 places cannot be represented in type 'int'
Call Trace:
<TASK>
dump_stack_lvl+0x7d/0xa5
dump_stack+0x15/0x1b
ubsan_epilogue+0xe/0x4e
__ubsan_handle_shift_out_of_bounds+0x1e7/0x20c
cap_task_prctl+0x561/0x6f0
security_task_prctl+0x5a/0xb0
__x64_sys_prctl+0x61/0x8f0
do_syscall_64+0x58/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
</TASK>
Fixes: e338d263a7 ("Add 64-bit capability support to the kernel")
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Acked-by: Andrew G. Morgan <morgan@kernel.org>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
The dispatcher function is currently abusing the ftrace __fentry__
call location for its own purposes -- this obviously gives trouble
when the dispatcher and ftrace are both in use.
A previous solution tried using __attribute__((patchable_function_entry()))
which works, except it is GCC-8+ only, breaking the build on the
earlier still supported compilers. Instead use static_call() -- which
has its own annotations and does not conflict with ftrace -- to
rewrite the dispatch function.
By using: return static_call()(ctx, insni, bpf_func) you get a perfect
forwarding tail call as function body (iow a single jmp instruction).
By having the default static_call() target be bpf_dispatcher_nop_func()
it retains the default behaviour (an indirect call to the argument
function). Only once a dispatcher program is attached is the target
rewritten to directly call the JIT'ed image.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Björn Töpel <bjorn@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Björn Töpel <bjorn@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lkml.kernel.org/r/Y1/oBlK0yFk5c/Im@hirez.programming.kicks-ass.net
Link: https://lore.kernel.org/bpf/20221103120647.796772565@infradead.org
Pull hardening fix from Kees Cook:
- Correctly report struct member size on memcpy overflow (Kees Cook)
* tag 'hardening-v6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
fortify: Capture __bos() results in const temp vars
Pull EFI fixes from Ard Biesheuvel:
- A pair of tweaks to the EFI random seed code so that externally
provided version of this config table are handled more robustly
- Another fix for the v6.0 EFI variable refactor that turned out to
break Apple machines which don't provide QueryVariableInfo()
- Add some guard rails to the EFI runtime service call wrapper so we
can recover from synchronous exceptions caused by firmware
* tag 'efi-fixes-for-v6.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
arm64: efi: Recover from synchronous exceptions occurring in firmware
efi: efivars: Fix variable writes with unsupported query_variable_store()
efi: random: Use 'ACPI reclaim' memory for random seed
efi: random: reduce seed size to 32 bytes
efi/tpm: Pass correct address to memblock_reserve
For !CONFIG_TRACING kernels, the kmalloc() implementation tries (in cases where
the allocation size is build-time constant) to save a function call, by
inlining kmalloc_trace() to a kmem_cache_alloc() call.
However since commit 6edf2576a6 ("mm/slub: enable debugging memory wasting of
kmalloc") this path now fails to pass the original request size to be
eventually recorded (for kmalloc caches with debugging enabled).
We could adjust the code to call __kmem_cache_alloc_node() as the
CONFIG_TRACING variant, but that would as a result inline a call with 5
parameters, bloating the kmalloc() call sites. The cost of extra function
call (to kmalloc_trace()) seems like a lesser evil.
It also appears that the !CONFIG_TRACING variant is incompatible with upcoming
hardening efforts [1] so it's easier if we just remove it now. Kernels with no
tracing are rare these days and the benefit is dubious anyway.
[1] https://lore.kernel.org/linux-mm/20221101222520.never.109-kees@kernel.org/T/#m20ecf14390e406247bde0ea9cce368f469c539ed
Link: https://lore.kernel.org/all/097d8fba-bd10-a312-24a3-a4068c4f424c@suse.cz/
Suggested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
The <linux/bcma/bcma_driver_chipcommon.h> is including the legacy
header <linux/gpio.h> to obtain struct gpio_chip. Instead, include
<linux/gpio/driver.h> where this struct is defined.
It turns out that the brcm80211 brcmsmac depends on this to
bring in the symbol gpio_is_valid().
The driver looks up the BCMA parent GPIO driver and checks that
this succeeds, but then it goes on to use the deprecated GPIO
call gpio_is_valid() to check the consistency of the .base
member of the BCMA GPIO struct. The whole check can be dropped
because the bcma_gpio is initialized in the declarations:
struct gpio_chip *bcma_gpio = &cc_drv->gpio;
And this can never be NULL.
Cc: Jonas Gorski <jonas.gorski@gmail.com>
Acked-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20221028092332.238728-1-linus.walleij@linaro.org
Refactor map->off_arr handling into generic functions that can work on
their own without hardcoding map specific code. The btf_fields_offs
structure is now returned from btf_parse_field_offs, which can be reused
later for types in program BTF.
All functions like copy_map_value, zero_map_value call generic
underlying functions so that they can also be reused later for copying
to values allocated in programs which encode specific fields.
Later, some helper functions will also require access to this
btf_field_offs structure to be able to skip over special fields at
runtime.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221103191013.1236066-9-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Now that kptr_off_tab has been refactored into btf_record, and can hold
more than one specific field type, accomodate bpf_spin_lock and
bpf_timer as well.
While they don't require any more metadata than offset, having all
special fields in one place allows us to share the same code for
allocated user defined types and handle both map values and these
allocated objects in a similar fashion.
As an optimization, we still keep spin_lock_off and timer_off offsets in
the btf_record structure, just to avoid having to find the btf_field
struct each time their offset is needed. This is mostly needed to
manipulate such objects in a map value at runtime. It's ok to hardcode
just one offset as more than one field is disallowed.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221103191013.1236066-8-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
To prepare the BPF verifier to handle special fields in both map values
and program allocated types coming from program BTF, we need to refactor
the kptr_off_tab handling code into something more generic and reusable
across both cases to avoid code duplication.
Later patches also require passing this data to helpers at runtime, so
that they can work on user defined types, initialize them, destruct
them, etc.
The main observation is that both map values and such allocated types
point to a type in program BTF, hence they can be handled similarly. We
can prepare a field metadata table for both cases and store them in
struct bpf_map or struct btf depending on the use case.
Hence, refactor the code into generic btf_record and btf_field member
structs. The btf_record represents the fields of a specific btf_type in
user BTF. The cnt indicates the number of special fields we successfully
recognized, and field_mask is a bitmask of fields that were found, to
enable quick determination of availability of a certain field.
Subsequently, refactor the rest of the code to work with these generic
types, remove assumptions about kptr and kptr_off_tab, rename variables
to more meaningful names, etc.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221103191013.1236066-7-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Expose devlink port handle related to netdev over rtnetlink. Introduce a
new nested IFLA attribute to carry the info. Call into devlink code to
fill-up the nest with existing devlink attributes that are used over
devlink netlink.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Remove ndo_get_devlink_port which is no longer used alongside with the
implementations in drivers.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To avoid a need to take RTNL mutex in port_fill() function, benefit from
the introduce infrastructure that tracks netdevice notifier events.
Store the ifindex and ifname upon register and change name events.
Remove the rtnl_held bool propagated down to port_fill() function as it
is no longer needed.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since devlink_port_type_eth_set() should no longer be called by any
driver with netdev pointer as it should rather use
SET_NETDEV_DEVLINK_PORT, remove the netdev arg. Add a warn to
type_clear()
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, ethernet drivers are using devlink_port_type_eth_set() and
devlink_port_type_clear() to set devlink port type and link to related
netdev.
Instead of calling them directly, let the driver use
SET_NETDEV_DEVLINK_PORT macro to assign devlink_port pointer and let
devlink to track it. Note the devlink port pointer is static during
the time netdevice is registered.
In devlink code, use per-namespace netdev notifier to track
the netdevices with devlink_port assigned and change the internal
devlink_port type and related type pointer accordingly.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Instead of storing type_dev as a void pointer, convert it to union and
use it to store either struct net_device or struct ib_device pointer.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Hosts that support 802.1X authentication are able to authenticate
themselves by exchanging EAPOL frames with an authenticator (Ethernet
bridge, in this case) and an authentication server. Access to the
network is only granted by the authenticator to successfully
authenticated hosts.
The above is implemented in the bridge using the "locked" bridge port
option. When enabled, link-local frames (e.g., EAPOL) can be locally
received by the bridge, but all other frames are dropped unless the host
is authenticated. That is, unless the user space control plane installed
an FDB entry according to which the source address of the frame is
located behind the locked ingress port. The entry can be dynamic, in
which case learning needs to be enabled so that the entry will be
refreshed by incoming traffic.
There are deployments in which not all the devices connected to the
authenticator (the bridge) support 802.1X. Such devices can include
printers and cameras. One option to support such deployments is to
unlock the bridge ports connecting these devices, but a slightly more
secure option is to use MAB. When MAB is enabled, the MAC address of the
connected device is used as the user name and password for the
authentication.
For MAB to work, the user space control plane needs to be notified about
MAC addresses that are trying to gain access so that they will be
compared against an allow list. This can be implemented via the regular
learning process with the sole difference that learned FDB entries are
installed with a new "locked" flag indicating that the entry cannot be
used to authenticate the device. The flag cannot be set by user space,
but user space can clear the flag by replacing the entry, thereby
authenticating the device.
Locked FDB entries implement the following semantics with regards to
roaming, aging and forwarding:
1. Roaming: Locked FDB entries can roam to unlocked (authorized) ports,
in which case the "locked" flag is cleared. FDB entries cannot roam
to locked ports regardless of MAB being enabled or not. Therefore,
locked FDB entries are only created if an FDB entry with the given {MAC,
VID} does not already exist. This behavior prevents unauthenticated
devices from disrupting traffic destined to already authenticated
devices.
2. Aging: Locked FDB entries age and refresh by incoming traffic like
regular entries.
3. Forwarding: Locked FDB entries forward traffic like regular entries.
If user space detects an unauthorized MAC behind a locked port and
wishes to prevent traffic with this MAC DA from reaching the host, it
can do so using tc or a different mechanism.
Enable the above behavior using a new bridge port option called "mab".
It can only be enabled on a bridge port that is both locked and has
learning enabled. Locked FDB entries are flushed from the port once MAB
is disabled. A new option is added because there are pure 802.1X
deployments that are not interested in notifications about locked FDB
entries.
Signed-off-by: Hans J. Schultz <netdev@kapio-technology.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
bpf 2022-11-04
We've added 8 non-merge commits during the last 3 day(s) which contain
a total of 10 files changed, 113 insertions(+), 16 deletions(-).
The main changes are:
1) Fix memory leak upon allocation failure in BPF verifier's stack state
tracking, from Kees Cook.
2) Fix address leakage when BPF progs release reference to an object,
from Youlin Li.
3) Fix BPF CI breakage from buggy in.h uapi header dependency,
from Andrii Nakryiko.
4) Fix bpftool pin sub-command's argument parsing, from Pu Lehui.
5) Fix BPF sockmap lockdep warning by cancelling psock work outside
of socket lock, from Cong Wang.
6) Follow-up for BPF sockmap to fix sk_forward_alloc accounting,
from Wang Yufen.
bpf-for-netdev
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Add verifier test for release_reference()
bpf: Fix wrong reg type conversion in release_reference()
bpf, sock_map: Move cancel_work_sync() out of sock lock
tools/headers: Pull in stddef.h to uapi to fix BPF selftests build in CI
net/ipv4: Fix linux/in.h header dependencies
bpftool: Fix NULL pointer dereference when pin {PROG, MAP, LINK} without FILE
bpf, sockmap: Fix the sk->sk_forward_alloc warning of sk_stream_kill_queues
bpf, verifier: Fix memory leak in array reallocation for stack state
====================
Link: https://lore.kernel.org/r/20221104000445.30761-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull networking fixes from Paolo Abeni:
"Including fixes from bluetooth and netfilter.
Current release - regressions:
- net: several zerocopy flags fixes
- netfilter: fix possible memory leak in nf_nat_init()
- openvswitch: add missing .resv_start_op
Previous releases - regressions:
- neigh: fix null-ptr-deref in neigh_table_clear()
- sched: fix use after free in red_enqueue()
- dsa: fall back to default tagger if we can't load the one from DT
- bluetooth: fix use-after-free in l2cap_conn_del()
Previous releases - always broken:
- netfilter: netlink notifier might race to release objects
- nfc: fix potential memory leak of skb
- bluetooth: fix use-after-free caused by l2cap_reassemble_sdu
- bluetooth: use skb_put to set length
- eth: tun: fix bugs for oversize packet when napi frags enabled
- eth: lan966x: fixes for when MTU is changed
- eth: dwmac-loongson: fix invalid mdio_node"
* tag 'net-6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (53 commits)
vsock: fix possible infinite sleep in vsock_connectible_wait_data()
vsock: remove the unused 'wait' in vsock_connectible_recvmsg()
ipv6: fix WARNING in ip6_route_net_exit_late()
bridge: Fix flushing of dynamic FDB entries
net, neigh: Fix null-ptr-deref in neigh_table_clear()
net/smc: Fix possible leaked pernet namespace in smc_init()
stmmac: dwmac-loongson: fix invalid mdio_node
ibmvnic: Free rwi on reset success
net: mdio: fix undefined behavior in bit shift for __mdiobus_register
Bluetooth: L2CAP: Fix attempting to access uninitialized memory
Bluetooth: L2CAP: Fix l2cap_global_chan_by_psm
Bluetooth: L2CAP: Fix accepting connection request for invalid SPSM
Bluetooth: hci_conn: Fix not restoring ISO buffer count on disconnect
Bluetooth: L2CAP: Fix memory leak in vhci_write
Bluetooth: L2CAP: fix use-after-free in l2cap_conn_del()
Bluetooth: virtio_bt: Use skb_put to set length
Bluetooth: hci_conn: Fix CIS connection dst_type handling
Bluetooth: L2CAP: Fix use-after-free caused by l2cap_reassemble_sdu
netfilter: ipset: enforce documented limit to prevent allocating huge memory
isdn: mISDN: netjet: fix wrong check of device registration
...
Pull powerpc fixes from Michael Ellerman:
- Fix an endian thinko in the asm-generic compat_arg_u64() which led to
syscall arguments being swapped for some compat syscalls.
- Fix syscall wrapper handling of syscalls with 64-bit arguments on
32-bit kernels, which led to syscall arguments being misplaced.
- A build fix for amdgpu on Book3E with AltiVec disabled.
Thanks to Andreas Schwab, Christian Zigotzky, and Arnd Bergmann.
* tag 'powerpc-6.1-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/32: Select ARCH_SPLIT_ARG64
powerpc/32: fix syscall wrappers with 64-bit arguments
asm-generic: compat: fix compat_arg_u64() and compat_arg_u64_dual()
powerpc/64e: Fix amdgpu build on Book3E w/o AltiVec