S390 bpf_jit.S is removed in net-next and had changes in 'net',
since that code isn't used any more take the removal.
TLS data structures split the TX and RX components in 'net-next',
put the new struct members from the bug fix in 'net' into the RX
part.
The 'net-next' tree had some reworking of how the ERSPAN code works in
the GRE tunneling code, overlapping with a one-line headroom
calculation fix in 'net'.
Overlapping changes in __sock_map_ctx_update_elem(), keep the bits
that read the prog members via READ_ONCE() into local variables
before using them.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix refcounting bug for connections in on-packet scheduling mode of
IPVS, from Julian Anastasov.
2) Set network header properly in AF_PACKET's packet_snd, from Willem
de Bruijn.
3) Fix regressions in 3c59x by converting to generic DMA API. It was
relying upon the hack that the PCI DMA interfaces would accept NULL
for EISA devices. From Christoph Hellwig.
4) Remove RDMA devices before unregistering netdev in QEDE driver, from
Michal Kalderon.
5) Use after free in TUN driver ptr_ring usage, from Jason Wang.
6) Properly check for missing netlink attributes in SMC_PNETID
requests, from Eric Biggers.
7) Set DMA mask before performaing any DMA operations in vmxnet3
driver, from Regis Duchesne.
8) Fix mlx5 build with SMP=n, from Saeed Mahameed.
9) Classifier fixes in bcm_sf2 driver from Florian Fainelli.
10) Tuntap use after free during release, from Jason Wang.
11) Don't use stack memory in scatterlists in tls code, from Matt
Mullins.
12) Not fully initialized flow key object in ipv4 routing code, from
David Ahern.
13) Various packet headroom bug fixes in ip6_gre driver, from Petr
Machata.
14) Remove queues from XPS maps using correct index, from Amritha
Nambiar.
15) Fix use after free in sock_diag, from Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (64 commits)
net: ip6_gre: fix tunnel metadata device sharing.
cxgb4: fix offset in collecting TX rate limit info
net: sched: red: avoid hashing NULL child
sock_diag: fix use-after-free read in __sk_free
sh_eth: Change platform check to CONFIG_ARCH_RENESAS
net: dsa: Do not register devlink for unused ports
net: Fix a bug in removing queues from XPS map
bpf: fix truncated jump targets on heavy expansions
bpf: parse and verdict prog attach may race with bpf map update
bpf: sockmap update rollback on error can incorrectly dec prog refcnt
net: test tailroom before appending to linear skb
net: ip6_gre: Fix ip6erspan hlen calculation
net: ip6_gre: Split up ip6gre_changelink()
net: ip6_gre: Split up ip6gre_newlink()
net: ip6_gre: Split up ip6gre_tnl_change()
net: ip6_gre: Split up ip6gre_tnl_link_config()
net: ip6_gre: Fix headroom request in ip6erspan_tunnel_xmit()
net: ip6_gre: Request headroom in __gre6_xmit()
selftests/bpf: check return value of fopen in test_verifier.c
erspan: fix invalid erspan version.
...
The b53 driver already defines and internally uses platform data to let the
glue drivers specify parameters such as the chip id. What we were missing was
a way to tell the core DSA layer about the ports and their type.
Place a dsa_chip_data structure at the beginning of b53_platform_data for
dsa_register_switch() to access it. This does not require modifications to
b53_common.c which will pass platform_data trough.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the size of the EEPROM to the platform data, so it can also be
instantiated by a platform device.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Not all the world uses device tree. Some parts of the world still use
platform devices and platform data. Add basic support for probing a
Marvell switch via platform data.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before the patch, the erspan BSO bit (Bad/Short/Oversized) is not
handled. BSO has 4 possible values:
00 --> Good frame with no error, or unknown integrity
11 --> Payload is a Bad Frame with CRC or Alignment Error
01 --> Payload is a Short Frame
10 --> Payload is an Oversized Frame
Based the short/oversized definitions in RFC1757, the patch sets
the bso bit based on the mirrored packet's size.
Reported-by: Xiaoyan Jin <xiaoyanj@vmware.com>
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2018-05-18
Here's the first bluetooth-next pull request for the 4.18 kernel:
- Refactoring of the btbcm driver
- New USB IDs for QCA_ROME and LiteOn controllers
- Buffer overflow fix if the controller sends invalid advertising data length
- Various cleanups & fixes for Qualcomm controllers
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull locking fixes from Thomas Gleixner:
"Two fixes to address shortcomings of the rwsem/percpu-rwsem lock
debugging code which emits false positive warnings when the rwsem is
anonymously locked and unlocked"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/percpu-rwsem: Annotate rwsem ownership transfer by setting RWSEM_OWNER_UNKNOWN
locking/rwsem: Add a new RWSEM_ANONYMOUSLY_OWNED flag
Pull EFI fixes from Thomas Gleixner:
- Use explicitely sized type for the romimage pointer in the 32bit EFI
protocol struct so a 64bit kernel does not expand it to 64bit. Ditto
for the 64bit struct to avoid the reverse issue on 32bit kernels.
- Handle randomized tex offset correctly in the ARM64 EFI stub to avoid
unaligned data resulting in stack corruption and other hard to
diagnose wreckage.
* 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi/libstub/arm64: Handle randomized TEXT_OFFSET
efi: Avoid potential crashes, fix the 'struct efi_pci_io_protocol_32' definition for mixed mode
Each driver implements physical port name generation by itself. However
as devlink has all needed info, it can easily do the job for all its
users. So implement this helper in devlink.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Devlink ports can have specific flavour according to the purpose of use.
This patch extend attrs_set so the driver can say which flavour port
has. Initial flavours are:
physical, cpu, dsa
User can query this to see right away what is the purpose of each port.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change existing setter for split port information into more generic
attrs setter. Alongside with that, allow to set port number and subport
number for split ports.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Many places in drivers/ file systems, error was handled in a common way
like below:
ret = (ret == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS;
vmf_error() will replace this and return vm_fault_t type err.
A lot of drivers and filesystems currently have a rather complex mapping
of errno-to-VM_FAULT code. We have been able to eliminate a lot of it
by just returning VM_FAULT codes directly from functions which are
called exclusively from the fault handling path.
Some functions can be called both from the fault handler and other
context which are expecting an errno, so they have to continue to return
an errno. Some users still need to choose different behaviour for
different errnos, but vmf_error() captures the essential error
translation that's common to all users, and those that need to handle
additional errors can handle them first.
Link: http://lkml.kernel.org/r/20180510174826.GA14268@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Saeed Mahameed says:
====================
mlx5-updates-2018-05-17
mlx5 core dirver updates for both net-next and rdma-next branches.
From Christophe JAILLET, first three patche to use kvfree where needed.
From: Or Gerlitz <ogerlitz@mellanox.com>
Next six patches from Roi and Co adds support for merged
sriov e-switch which comes to serve cases where both PFs, VFs set
on them and both uplinks are to be used in single v-switch SW model.
When merged e-switch is supported, the per-port e-switch is logically
merged into one e-switch that spans both physical ports and all the VFs.
This model allows to offload TC eswitch rules between VFs belonging
to different PFs (and hence have different eswitch affinity), it also
sets the some of the foundations needed for uplink LAG support.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull mtd fixes from Boris Brezillon:
"NAND fixes:
- Fix read path of the Marvell NAND driver
- Make sure we don't pass a u64 to ndelay()
CFI fix:
- Fix the map_word_andequal() implementation"
* tag 'mtd/fixes-for-4.17-rc6' of git://git.infradead.org/linux-mtd:
mtd: rawnand: Fix return type of __DIVIDE() when called with 32-bit
mtd: rawnand: marvell: Fix read logic for layouts with ->nchunks > 2
mtd: Fix comparison in map_word_andequal()
This per netns sysctl allows for TCP SACK compression fine-tuning.
This limits number of SACK that can be compressed.
Using 0 disables SACK compression.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This per netns sysctl allows for TCP SACK compression fine-tuning.
Its default value is 1,000,000, or 1 ms to meet TSO autosizing period.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This counter tracks number of ACK packets that the host has not sent,
thanks to ACK compression.
Sample output :
$ nstat -n;sleep 1;nstat|egrep "IpInReceives|IpOutRequests|TcpInSegs|TcpOutSegs|TcpExtTCPAckCompressed"
IpInReceives 123250 0.0
IpOutRequests 3684 0.0
TcpInSegs 123251 0.0
TcpOutSegs 3684 0.0
TcpExtTCPAckCompressed 119252 0.0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When TCP receives an out-of-order packet, it immediately sends
a SACK packet, generating network load but also forcing the
receiver to send 1-MSS pathological packets, increasing its
RTX queue length/depth, and thus processing time.
Wifi networks suffer from this aggressive behavior, but generally
speaking, all these SACK packets add fuel to the fire when networks
are under congestion.
This patch adds a high resolution timer and tp->compressed_ack counter.
Instead of sending a SACK, we program this timer with a small delay,
based on RTT and capped to 1 ms :
delay = min ( 5 % of RTT, 1 ms)
If subsequent SACKs need to be sent while the timer has not yet
expired, we simply increment tp->compressed_ack.
When timer expires, a SACK is sent with the latest information.
Whenever an ACK is sent (if data is sent, or if in-order
data is received) timer is canceled.
Note that tcp_sack_new_ofo_skb() is able to force a SACK to be sent
if the sack blocks need to be shuffled, even if the timer has not
expired.
A new SNMP counter is added in the following patch.
Two other patches add sysctls to allow changing the 1,000,000 and 44
values that this commit hard-coded.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This function allows to send a HCI command without expecting any
controller event/response in return. This is allowed for vendor-
specific commands only.
Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The source e-switch owner allows a vport on one e-switch port be associated
with a rule defined on the second port e-switch.
The role of the source eswitch owner valid bit in the flow group is to
allow the firmware fail driver attempts to wild card the source eswitch
match field. If this bit is not set, the firmware ignores the source
eswitch owner field totally.
Signed-off-by: Shahar Klein <shahark@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The destination e-switch owner allows a rule in namespace of one e-switch
owner to point to a vport that is natively associated with another
e-switch owner.
Signed-off-by: Shahar Klein <shahark@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When merged e-switch is supported, the per-port e-switch is logically
merged into one e-switch that spans both physical ports and all the VFs.
Under merged eswitch, both the matching on source vport and setting
destination vport can have a 2nd attribute which is the vhca id of the
eswitch owner.
For example:
esw0: {match: <src vport=1 owner=0> action: fwd to <dst vport=7, owner=1>}
is a flow set on eswitch0 matching on source vport=1 from his eswitch
and the action being fwd to dest vport=7 of eswitch1.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Shahar Klein <shahark@mellanox.com>
Reviewed-by: Or Gerlitz Klein <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
This patch adds one more generic PHY mode to the phy_mode enum, to allow
configuring generic PHYs to the 2.5G SGMII mode by using the set_mode
callback.
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Acked-by: Kishon Vijay Abraham I <kishon@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kalle Valo says:
====================
wireless-drivers-next patches for 4.18
The first pull request for 4.18. As usual new features and bug fixes
but nothing really special.
I also merged wireless-drivers due to an iwlwifi patch dependency.
Major changes:
iwlwifi
* implement Traffic Condition Monitor and use it for scan, BT coex and
to detect when the AP doesn't support UAPSD properly
* some more work for the 22000 family of devices;
* introduce AMSDU rate control offload
qtnfmac
* DFS offload support
rsi
* roaming enhancements
* increase max supported aggregation subframes
* don't advertise 5 GHz support if the device doesn't support it
brcmfmac
* add support for BCM4366E chipset
* add support for bcm43364 wireless chipset
ath10k
* enable temperature reads for QCA6174 and QCA9377
* add firmware memory dump support for QCA9984
* continue adding WCN3990 support via SNOC bus
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The previous approach for the lost and retransmit bits was to
wipe the slate clean: zero all the lost and retransmit bits,
correspondingly zero the lost_out and retrans_out counters, and
then add back the lost bits (and correspondingly increment lost_out).
The new approach is to treat this very much like marking packets
lost in fast recovery. We don’t wipe the slate clean. We just say
that for all packets that were not yet marked sacked or lost, we now
mark them as lost in exactly the same way we do for fast recovery.
This fixes the lost retransmit accounting at RTO time and greatly
simplifies the RTO code by sharing much of the logic with Fast
Recovery.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a rewrite of NewReno loss recovery implementation that is
simpler and standalone for readability and better performance by
using less states.
Note that NewReno refers to RFC6582 as a modification to the fast
recovery algorithm. It is used only if the connection does not
support SACK in Linux. It should not to be confused with the Reno
(AIMD) congestion control.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for the classic DUPACK threshold rule
(#DupThresh) in RACK.
When the number of packets SACKed is greater or equal to the
threshold, RACK sets the reordering window to zero which would
immediately mark all the unsacked packets below the highest SACKed
sequence lost. Since this approach is known to not work well with
reordering, RACK only uses it if no reordering has been observed.
The DUPACK threshold rule is a particularly useful extension to the
fast recoveries triggered by RACK reordering timer. For example
data-center transfers where the RTT is much smaller than a timer
tick, or high RTT path where the default RTT/4 may take too long.
Note that this patch differs slightly from RFC6675. RFC6675
considers a packet lost when at least #DupThresh higher-sequence
packets are SACKed.
With RACK, for connections that have seen reordering, RACK
continues to use a dynamically-adaptive time-based reordering
window to detect losses. But for connections on which we have not
yet seen reordering, this patch considers a packet lost when at
least one higher sequence packet is SACKed and the total number
of SACKed packets is at least DupThresh. For example, suppose a
connection has not seen reordering, and sends 10 packets, and
packets 3, 5, 7 are SACKed. RFC6675 considers packets 1 and 2
lost. RACK considers packets 1, 2, 4, 6 lost.
There is some small risk of spurious retransmits here due to
reordering. However, this is mostly limited to the first flight of
a connection on which the sender receives SACKs from reordering.
And RFC 6675 and FACK loss detection have a similar risk on the
first flight with reordering (it's just that the risk of spurious
retransmits from reordering was slightly narrower for those older
algorithms due to the margin of 3*MSS).
Also the minimum reordering window is reduced from 1 msec to 0
to recover quicker on short RTT transfers. Therefore RACK is more
aggressive in marking packets lost during recovery to reduce the
reordering window timeouts.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
scatterlist code expects virt_to_page() to work, which fails with
CONFIG_VMAP_STACK=y.
Fixes: c46234ebb4 ("tls: RX path for ktls")
Signed-off-by: Matt Mullins <mmullins@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull kvm fixes from Paolo Bonzini:
- ARM/ARM64 locking fixes
- x86 fixes: PCID, UMIP, locking
- improved support for recent Windows version that have a 2048 Hz APIC
timer
- rename KVM_HINTS_DEDICATED CPUID bit to KVM_HINTS_REALTIME
- better behaved selftests
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm: rename KVM_HINTS_DEDICATED to KVM_HINTS_REALTIME
KVM: arm/arm64: VGIC/ITS save/restore: protect kvm_read_guest() calls
KVM: arm/arm64: VGIC/ITS: protect kvm_read_guest() calls with SRCU lock
KVM: arm/arm64: VGIC/ITS: Promote irq_lock() in update_affinity
KVM: arm/arm64: Properly protect VGIC locks from IRQs
KVM: X86: Lower the default timer frequency limit to 200us
KVM: vmx: update sec exec controls for UMIP iff emulating UMIP
kvm: x86: Suppress CR3_PCID_INVD bit only when PCIDs are enabled
KVM: selftests: exit with 0 status code when tests cannot be run
KVM: hyperv: idr_find needs RCU protection
x86: Delay skip of emulated hypercall instruction
KVM: Extend MAX_IRQ_ROUTES to 4096 for all archs
After the previous patch, for NOLOCK qdiscs, q->seqlock is
always held when the dequeue() is invoked, we can drop
any additional locking to protect such operation.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So that we can use lockdep on it.
The newly introduced sequence lock has the same scope of busylock,
so it shares the same lockdep annotation, but it's only used for
NOLOCK qdiscs.
With this changeset we acquire such lock in the control path around
flushing operation (qdisc reset), to allow more NOLOCK qdisc perf
improvement in the next patch.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
proc_pid_cmdline_read() and environ_read() directly access the target
process' VM to retrieve the command line and environment. If this
process remaps these areas onto a file via mmap(), the requesting
process may experience various issues such as extra delays if the
underlying device is slow to respond.
Let's simply refuse to access file-backed areas in these functions.
For this we add a new FOLL_ANON gup flag that is passed to all calls
to access_remote_vm(). The code already takes care of such failures
(including unmapped areas). Accesses via /proc/pid/mem were not
changed though.
This was assigned CVE-2018-1120.
Note for stable backports: the patch may apply to kernels prior to 4.11
but silently miss one location; it must be checked that no call to
access_remote_vm() keeps zero as the last argument.
Reported-by: Qualys Security Advisory <qsa@qualys.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-05-17
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Provide a new BPF helper for doing a FIB and neighbor lookup
in the kernel tables from an XDP or tc BPF program. The helper
provides a fast-path for forwarding packets. The API supports
IPv4, IPv6 and MPLS protocols, but currently IPv4 and IPv6 are
implemented in this initial work, from David (Ahern).
2) Just a tiny diff but huge feature enabled for nfp driver by
extending the BPF offload beyond a pure host processing offload.
Offloaded XDP programs are allowed to set the RX queue index and
thus opening the door for defining a fully programmable RSS/n-tuple
filter replacement. Once BPF decided on a queue already, the device
data-path will skip the conventional RSS processing completely,
from Jakub.
3) The original sockmap implementation was array based similar to
devmap. However unlike devmap where an ifindex has a 1:1 mapping
into the map there are use cases with sockets that need to be
referenced using longer keys. Hence, sockhash map is added reusing
as much of the sockmap code as possible, from John.
4) Introduce BTF ID. The ID is allocatd through an IDR similar as
with BPF maps and progs. It also makes BTF accessible to user
space via BPF_BTF_GET_FD_BY_ID and adds exposure of the BTF data
through BPF_OBJ_GET_INFO_BY_FD, from Martin.
5) Enable BPF stackmap with build_id also in NMI context. Due to the
up_read() of current->mm->mmap_sem build_id cannot be parsed.
This work defers the up_read() via a per-cpu irq_work so that
at least limited support can be enabled, from Song.
6) Various BPF JIT follow-up cleanups and fixups after the LD_ABS/LD_IND
JIT conversion as well as implementation of an optimized 32/64 bit
immediate load in the arm64 JIT that allows to reduce the number of
emitted instructions; in case of tested real-world programs they
were shrinking by three percent, from Daniel.
7) Add ifindex parameter to the libbpf loader in order to enable
BPF offload support. Right now only iproute2 can load offloaded
BPF and this will also enable libbpf for direct integration into
other applications, from David (Beckett).
8) Convert the plain text documentation under Documentation/bpf/ into
RST format since this is the appropriate standard the kernel is
moving to for all documentation. Also add an overview README.rst,
from Jesper.
9) Add __printf verification attribute to the bpf_verifier_vlog()
helper. Though it uses va_list we can still allow gcc to check
the format string, from Mathieu.
10) Fix a bash reference in the BPF selftest's Makefile. The '|& ...'
is a bash 4.0+ feature which is not guaranteed to be available
when calling out to shell, therefore use a more portable variant,
from Joe.
11) Fix a 64 bit division in xdp_umem_reg() by using div_u64()
instead of relying on the gcc built-in, from Björn.
12) Fix a sock hashmap kmalloc warning reported by syzbot when an
overly large key size is used in hashmap then causing overflows
in htab->elem_size. Reject bogus attr->key_size early in the
sock_hash_alloc(), from Yonghong.
13) Ensure in BPF selftests when urandom_read is being linked that
--build-id is always enabled so that test_stacktrace_build_id[_nmi]
won't be failing, from Alexei.
14) Add bitsperlong.h as well as errno.h uapi headers into the tools
header infrastructure which point to one of the arch specific
uapi headers. This was needed in order to fix a build error on
some systems for the BPF selftests, from Sirio.
15) Allow for short options to be used in the xdp_monitor BPF sample
code. And also a bpf.h tools uapi header sync in order to fix a
selftest build failure. Both from Prashant.
16) More formally clarify the meaning of ID in the direct packet access
section of the BPF documentation, from Wang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull tracing fix from Steven Rostedt:
"Some of the ftrace internal events use a zero for a data size of a
field event. This is increasingly important for the histogram trigger
work that is being extended.
While auditing trace events, I found that a couple of the xen events
were used as just marking that a function was called, by creating a
static array of size zero. This can play havoc with the tracing
features if these events are used, because a zero size of a static
array is denoted as a special nul terminated dynamic array (this is
what the trace_marker code uses). But since the xen events have no
size, they are not nul terminated, and unexpected results may occur.
As trace events were never intended on being a marker to denote that a
function was hit or not, especially since function tracing and kprobes
can trivially do the same, the best course of action is to simply
remove these events"
* tag 'trace-v4.17-rc4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/x86/xen: Remove zero data size trace events trace_xen_mmu_flush_tlb{_all}
__printf is useful to verify format and arguments. ‘bpf_verifier_vlog’
function is used twice in verifier.c in both cases the caller function
already uses the __printf gcc attribute.
Remove the following warning, triggered with W=1:
kernel/bpf/verifier.c:176:2: warning: function might be possible candidate for ‘gnu_printf’ format attribute [-Wsuggest-attribute=format]
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Currently NOLOCK qdiscs pay a measurable overhead to atomically
manipulate the __QDISC_STATE_RUNNING. Such bit is flipped twice per
packet in the uncontended scenario with packet rate below the
line rate: on packed dequeue and on the next, failing dequeue attempt.
This changeset moves the bit manipulation into the qdisc_run_{begin,end}
helpers, so that the bit is now flipped only once per packet, with
measurable performance improvement in the uncontended scenario.
This also allows simplifying the qdisc teardown code path - since
qdisc_is_running() is now effective for each qdisc type - and avoid a
possible race between qdisc_run() and dev_deactivate_many(), as now
the some_qdisc_is_busy() can properly detect NOLOCK qdiscs being busy
dequeuing packets.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Avoid using the kernel's irq_descriptor and return IRQ vector affinity
directly from the driver.
This fixes the following build break when CONFIG_SMP=n
include/linux/mlx5/driver.h: In function ‘mlx5_get_vector_affinity_hint’:
include/linux/mlx5/driver.h:1299:13: error:
‘struct irq_desc’ has no member named ‘affinity_hint’
Fixes: 6082d9c9c9 ("net/mlx5: Fix mlx5_get_vector_affinity function")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
CC: Randy Dunlap <rdunlap@infradead.org>
CC: Guenter Roeck <linux@roeck-us.net>
CC: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Israel Rukshin <israelr@mellanox.com>
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The rx load balancing provided by balance-alb is not mutually
exclusive with using hashing for tx selection, and should provide a decent
speed increase because this eliminates spinlocks and cache contention.
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The filesystem freezing code needs to transfer ownership of a rwsem
embedded in a percpu-rwsem from the task that does the freezing to
another one that does the thawing by calling percpu_rwsem_release()
after freezing and percpu_rwsem_acquire() before thawing.
However, the new rwsem debug code runs afoul with this scheme by warning
that the task that releases the rwsem isn't the one that acquires it,
as reported by Amir Goldstein:
DEBUG_LOCKS_WARN_ON(sem->owner != get_current())
WARNING: CPU: 1 PID: 1401 at /home/amir/build/src/linux/kernel/locking/rwsem.c:133 up_write+0x59/0x79
Call Trace:
percpu_up_write+0x1f/0x28
thaw_super_locked+0xdf/0x120
do_vfs_ioctl+0x270/0x5f1
ksys_ioctl+0x52/0x71
__x64_sys_ioctl+0x16/0x19
do_syscall_64+0x5d/0x167
entry_SYSCALL_64_after_hwframe+0x49/0xbe
To work properly with the rwsem debug code, we need to annotate that the
rwsem ownership is unknown during the tranfer period until a brave soul
comes forward to acquire the ownership. During that period, optimistic
spinning will be disabled.
Reported-by: Amir Goldstein <amir73il@gmail.com>
Tested-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Theodore Y. Ts'o <tytso@mit.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-fsdevel@vger.kernel.org
Link: http://lkml.kernel.org/r/1526420991-21213-3-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Sockmap is currently backed by an array and enforces keys to be
four bytes. This works well for many use cases and was originally
modeled after devmap which also uses four bytes keys. However,
this has become limiting in larger use cases where a hash would
be more appropriate. For example users may want to use the 5-tuple
of the socket as the lookup key.
To support this add hash support.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Pull AFS fixes from David Howells:
"Here's a set of patches that fix a number of bugs in the in-kernel AFS
client, including:
- Fix directory locking to not use individual page locks for
directory reading/scanning but rather to use a semaphore on the
afs_vnode struct as the directory contents must be read in a single
blob and data from different reads must not be mixed as the entire
contents may be shuffled about between reads.
- Fix address list parsing to handle port specifiers correctly.
- Only give up callback records on a server if we actually talked to
that server (we might not be able to access a server).
- Fix some callback handling bugs, including refcounting,
whole-volume callbacks and when callbacks actually get broken in
response to a CB.CallBack op.
- Fix some server/address rotation bugs, including giving up if we
can't probe a server; giving up if a server says it doesn't have a
volume, but there are more servers to try.
- Fix the decoding of fetched statuses to be OpenAFS compatible.
- Fix the handling of server lookups in Cache Manager ops (such as
CB.InitCallBackState3) to use a UUID if possible and to handle no
server being found.
- Fix a bug in server lookup where not all addresses are compared.
- Fix the non-encryption of calls that prevents some servers from
being accessed (this also requires an AF_RXRPC patch that has
already gone in through the net tree).
There's also a patch that adds tracepoints to log Cache Manager ops
that don't find a matching server, either by UUID or by address"
* tag 'afs-fixes-20180514' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Fix the non-encryption of calls
afs: Fix CB.CallBack handling
afs: Fix whole-volume callback handling
afs: Fix afs_find_server search loop
afs: Fix the handling of an unfound server in CM operations
afs: Add a tracepoint to record callbacks from unlisted servers
afs: Fix the handling of CB.InitCallBackState3 to find the server by UUID
afs: Fix VNOVOL handling in address rotation
afs: Fix AFSFetchStatus decoder to provide OpenAFS compatibility
afs: Fix server rotation's handling of fileserver probe failure
afs: Fix refcounting in callback registration
afs: Fix giving up callbacks on server destruction
afs: Fix address list parsing
afs: Fix directory page locking
This patch only refactors the existing sockmap code. This will allow
much of the psock initialization code path and bpf helper codes to
work for both sockmap bpf map types that are backed by an array, the
currently supported type, and the new hash backed bpf map type
sockhash.
Most the fallout comes from three changes,
- Pushing bpf programs into an independent structure so we
can use it from the htab struct in the next patch.
- Generalizing helpers to use void *key instead of the hardcoded
u32.
- Instead of passing map/key through the metadata we now do
the lookup inline. This avoids storing the key in the metadata
which will be useful when keys can be longer than 4 bytes. We
rename the sk pointers to sk_redir at this point as well to
avoid any confusion between the current sk pointer and the
redirect pointer sk_redir.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The __DIVIDE() macro checks whether it is called with a 32-bit or 64-bit
dividend, to select the appropriate divide-and-round-up routine.
As the check uses the ternary operator, the result will always be
promoted to a type that can hold both results, i.e. unsigned long long.
When using this result in a division on a 32-bit system, this may lead
to link errors like:
ERROR: "__udivdi3" [drivers/mtd/nand/raw/nand.ko] undefined!
Fix this by casting the result of the division to the type of the
dividend.
Fixes: 8878b126df ("mtd: nand: add ->exec_op() implementation")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Boris Brezillon <boris.brezillon@bootlin.com>
Doing an audit of trace events, I discovered two trace events in the xen
subsystem that use a hack to create zero data size trace events. This is not
what trace events are for. Trace events add memory footprint overhead, and
if all you need to do is see if a function is hit or not, simply make that
function noinline and use function tracer filtering.
Worse yet, the hack used was:
__array(char, x, 0)
Which creates a static string of zero in length. There's assumptions about
such constructs in ftrace that this is a dynamic string that is nul
terminated. This is not the case with these tracepoints and can cause
problems in various parts of ftrace.
Nuke the trace events!
Link: http://lkml.kernel.org/r/20180509144605.5a220327@gandalf.local.home
Cc: stable@vger.kernel.org
Fixes: 95a7d76897 ("xen/mmu: Use Xen specific TLB flush instead of the generic one.")
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Currently, when the rule is not to be exclusively executed by the
hardware, extack is not passed along and offloading failures don't
get logged. The idea was that hardware failures are okay because the
rule will get executed in software then and this way it doesn't confuse
unware users.
But this is not helpful in case one needs to understand why a certain
rule failed to get offloaded. Considering it may have been a temporary
failure, like resources exceeded or so, reproducing it later and knowing
that it is triggering the same reason may be challenging.
The ultimate goal is to improve Open vSwitch debuggability when using
flower offloading.
This patch adds a new flag to enable verbose logging. With the flag set,
extack will be passed to the driver, which will be able to log the
error. As the operation itself probably won't fail (not because of this,
at least), current iproute will already log it as a Warning.
The flag is generic, so it can be reused later. No need to restrict it
just for HW offloading. The command line will follow the syntax that
tc-ebpf already uses, tc ... [ verbose ] ... , and extend its meaning.
For example:
# ./tc qdisc add dev p7p1 ingress
# ./tc filter add dev p7p1 parent ffff: protocol ip prio 1 \
flower verbose \
src_mac ed:13:db:00:00:00 dst_mac 01:80:c2:00:00:d0 \
src_ip 56.0.0.0 dst_ip 55.0.0.0 action drop
Warning: TC offload is disabled on net device.
# echo $?
0
# ./tc filter add dev p7p1 parent ffff: protocol ip prio 1 \
flower \
src_mac ff:13:db:00:00:00 dst_mac 01:80:c2:00:00:d0 \
src_ip 56.0.0.0 dst_ip 55.0.0.0 action drop
# echo $?
0
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jeff Kirsher says:
====================
40GbE Intel Wired LAN Driver Updates 2018-05-14
This series contains updates to virtchnl, i40e and i40evf.
Bruce cleans up whitespace and unnecessary parentheses in virtchnl.
Jake does a number of stat cleanups in the i40e driver, including
cleanup of code indentation, whitespace issues, remove duplicate stats,
fix grammar in code comment and general spring cleaning of the
statistics code.
Patryk fixes an issue where we recalculate vectors left and vectors
wanted but do not take into account the reduced number of queue pairs
per VSI.
Harshitha adds tx_busy stat to ethtool stats to track the number of
times we return NETDEV_TX_BUSY to the stack during transmit.
Paweł fixes a potential system crash when unloading the VF driver after
a hardware reset.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The sequence of actions done by device drivers to append their device
specific hardware/firmware logs to /proc/vmcore are as follows:
1. During probe (before hardware is initialized), device drivers
register to the vmcore module (via vmcore_add_device_dump()), with
callback function, along with buffer size and log name needed for
firmware/hardware log collection.
2. vmcore module allocates the buffer with requested size. It adds
an Elf note and invokes the device driver's registered callback
function.
3. Device driver collects all hardware/firmware logs into the buffer
and returns control back to vmcore module.
Ensure that the device dump buffer size is always aligned to page size
so that it can be mmaped.
Also, rename alloc_elfnotes_buf() to vmcore_alloc_buf() to make it more
generic and reserve NT_VMCOREDD note type to indicate vmcore device
dump.
Suggested-by: Eric Biederman <ebiederm@xmission.com>.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>