Commit Graph

53519 Commits

Author SHA1 Message Date
David S. Miller
7d982567f4 Merge tag 'mlx5-updates-2017-01-19' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:

====================
mlx5 and mlx5e updates 2017-01-19

This series includes some updates for mlx5 core and mlx5e netdevice driver.

From Leon, a small fix that remove an unnecessary print.

From Eli Cohen, a fix to the FW version printout in case of internal error.

From Eugenia Emantayev, two patches, the 1st adds mlx5 1pps (pulse per
second) mlx5 infrastructure support and the 2nd adds the necessary bits
for mlx5e ptp logic and structures.

From Mohamad, add support for s-tagged packet receive when in promiscuous
mode.

Form Gal Pressman, MCAM (Management capabilities mask register) and PCAM
(Ports capabilities mask register) registers infrastructure, those
registers are needed in order to query the different statistics registers
support in FW, in order for the driver to enable/disable query and
reporting them back to user.  On top of this infrastructure we've exposed
new set of statistics groups:
   - MPCNT: Physical layer statistical counters (For symbol errors)
   - PPCNT: PCIe performance counters

In addition to the statistics capabilities series we've moved the mlx5 HCA
capabilities fields to a dedicated struct under the driver private data.

At the end a small patch to update & query statistics in the most desired
order.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 14:22:27 -05:00
Phil Sutter
9af15c3825 device: Implement a bus agnostic dev_num_vf routine
Now that pci_bus_type has num_vf callback set, dev_num_vf can be
implemented in a bus type independent way and the check for whether a
PCI device is being handled in rtnetlink can be dropped.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 11:43:17 -05:00
Phil Sutter
582a686f52 device: bus_type: Introduce num_vf callback
This allows for bus types to implement their own method of retrieving
the number of virtual functions a NIC on that type of bus supports.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 11:43:16 -05:00
Gal Pressman
701052c578 net/mlx5: Move cached hca caps to designated caps struct
The caps structure consists of hca caps and port/management caps,
all under one roof.

Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-19 23:20:03 +02:00
Gal Pressman
8ed1a6306d net/mlx5: Add MPCNT register infrastructure
Add the needed infrastructure for future use of MPCNT register.

Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-19 23:20:01 +02:00
Gal Pressman
d8dc0508c5 net/mlx5: Add PPCNT physical layer statistical group infrastructure
Add the needed infrastructure for future use of PPCNT physical layer
statistical group.

Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2017-01-19 23:20:00 +02:00
Gal Pressman
71862561f3 net/mlx5: Query and cache PCAM, MCAM registers on initialization
On load_one, we now cache our capabilities registers internally, similar
to QUERY_HCA_CAP. Capabilities can later be queried using macros
introduced in this patch.

Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-19 23:19:59 +02:00
Gal Pressman
cfdcbceaef net/mlx5: Expose PCAM, MCAM registers infrastructure
PCAM: Ports capabilities mask register.
MCAM: Management capabilities mask register.

PCAM and MCAM registers will provide information regarding firmware
support for different features, in order to avoid cases where new driver
combined with old firmware results in syndromes (for ex. PCIe counters
before this patchset).

Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-19 23:19:57 +02:00
Mohamad Haj Yahia
105433659d net/mlx5: Add support to s-tag in mlx5 firmware interface
Add svlan_tag and rename vlan_tag to cvlan_tag in flow table entry
match param.

Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2017-01-19 23:19:55 +02:00
Eugenia Emantayev
f9a1ef720e net/mlx5: Add MTPPS and MTPPSE registers infrastructure
Implement query and set functionality for MTPPS and MTPPSE registers.
MTPPS (Management Pulse Per Second) provides the device PPS capabilities,
configures the PPS in and out modules and holds the PPS in time stamp.
Query MTPPS is supported only when HCA_CAP.pps is set and modify is supported
when HCA_CAP.pps_modify is set.

MTPPSE (Management Pulse Per Second Event) configures the different event
generation modes for PPS. Supported when HCA_CAP.pps is set.

Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-19 23:19:53 +02:00
Volodymyr Bendiuga
4567d686f5 phy: increase size of MII_BUS_ID_SIZE and bus_id
Some bus names are pretty long and do not fit into
17 chars. Increase therefore MII_BUS_ID_SIZE and
phy_fixup.bus_id to larger number. Now mii_bus.id
can host larger name.

Signed-off-by: Volodymyr Bendiuga <volodymyr.bendiuga@gmail.com>
Signed-off-by: Magnus Öberg <magnus.oberg@westermo.se>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-19 11:46:20 -05:00
Tobias Klauser
4a7c972644 net: Remove usage of net_device last_rx member
The network stack no longer uses the last_rx member of struct net_device
since the bonding driver switched to use its own private last_rx in
commit 9f24273837 ("bonding: use last_arp_rx in slave_last_rx()").

However, some drivers still (ab)use the field for their own purposes and
some driver just update it without actually using it.

Previously, there was an accompanying comment for the last_rx member
added in commit 4dc89133f4 ("net: add a comment on netdev->last_rx")
which asked drivers not to update is, unless really needed. However,
this commend was removed in commit f8ff080dac ("bonding: remove
useless updating of slave->dev->last_rx"), so some drivers added later
on still did update last_rx.

Remove all usage of last_rx and switch three drivers (sky2, atp and
smc91c92_cs) which actually read and write it to use their own private
copy in netdev_priv.

Compile-tested with allyesconfig and allmodconfig on x86 and arm.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Mirko Lindner <mlindner@marvell.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 17:22:49 -05:00
David Ahern
fd61c6ba31 net: ipv6: remove nowait arg to rt6_fill_node
All callers of rt6_fill_node pass 0 for nowait arg. Remove the arg and
simplify rt6_fill_node accordingly.

rt6_fill_node passes the nowait of 0 to ip6mr_get_route. Remove the
nowait arg from it as well.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 15:43:59 -05:00
Xin Long
cc16f00f65 sctp: add support for generating stream reconf ssn reset request chunk
This patch is to add asoc strreset_outseq and strreset_inseq for
saving the reconf request sequence, initialize them when create
assoc and process init, and also to define Incoming and Outgoing
SSN Reset Request Parameter described in rfc6525 section 4.1 and
4.2, As they can be in one same chunk as section rfc6525 3.1-3
describes, it makes them in one function.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 14:55:09 -05:00
David S. Miller
580bdf5650 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-01-17 15:19:37 -05:00
Rolf Neugebauer
501db51139 virtio: don't set VIRTIO_NET_HDR_F_DATA_VALID on xmit
This patch part reverts fd2a0437dc and e858fae2b0 which introduced a
subtle change in how the virtio_net flags are derived from the SKBs
ip_summed field.

With the above commits, the flags are set to VIRTIO_NET_HDR_F_DATA_VALID
when ip_summed == CHECKSUM_UNNECESSARY, thus treating it differently to
ip_summed == CHECKSUM_NONE, which should be the same.

Further, the virtio spec 1.0 / CS04 explicitly says that
VIRTIO_NET_HDR_F_DATA_VALID must not be set by the driver.

Fixes: fd2a0437dc ("virtio_net: introduce virtio_net_hdr_{from,to}_skb")
Fixes: e858fae2b0 (" virtio_net: use common code for virtio_net_hdr and skb GSO conversion")
Signed-off-by: Rolf Neugebauer <rolf.neugebauer@docker.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-17 14:07:31 -05:00
Linus Torvalds
4b19a9e20b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Handle multicast packets properly in fast-RX path of mac80211, from
    Johannes Berg.

 2) Because of a logic bug, the user can't actually force SW
    checksumming on r8152 devices. This makes diagnosis of hw
    checksumming bugs really annoying. Fix from Hayes Wang.

 3) VXLAN route lookup does not take the source and destination ports
    into account, which means IPSEC policies cannot be matched properly.
    Fix from Martynas Pumputis.

 4) Do proper RCU locking in netvsc callbacks, from Stephen Hemminger.

 5) Fix SKB leaks in mlxsw driver, from Arkadi Sharshevsky.

 6) If lwtunnel_fill_encap() fails, we do not abort the netlink message
    construction properly in fib_dump_info(), from David Ahern.

 7) Do not use kernel stack for DMA buffers in atusb driver, from Stefan
    Schmidt.

 8) Openvswitch conntack actions need to maintain a correct checksum,
    fix from Lance Richardson.

 9) ax25_disconnect() is missing a check for ax25->sk being NULL, in
    fact it already checks this, but not in all of the necessary spots.
    Fix from Basil Gunn.

10) Action GET operations in the packet scheduler can erroneously bump
    the reference count of the entry, making it unreleasable. Fix from
    Jamal Hadi Salim. Jamal gives a great set of example command lines
    that trigger this in the commit message.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (46 commits)
  net sched actions: fix refcnt when GETing of action after bind
  net/mlx4_core: Eliminate warning messages for SRQ_LIMIT under SRIOV
  net/mlx4_core: Fix when to save some qp context flags for dynamic VST to VGT transitions
  net/mlx4_core: Fix racy CQ (Completion Queue) free
  net: stmmac: don't use netdev_[dbg, info, ..] before net_device is registered
  net/mlx5e: Fix a -Wmaybe-uninitialized warning
  ax25: Fix segfault after sock connection timeout
  bpf: rework prog_digest into prog_tag
  tipc: allocate user memory with GFP_KERNEL flag
  net: phy: dp83867: allow RGMII_TXID/RGMII_RXID interface types
  ip6_tunnel: Account for tunnel header in tunnel MTU
  mld: do not remove mld souce list info when set link down
  be2net: fix MAC addr setting on privileged BE3 VFs
  be2net: don't delete MAC on close on unprivileged BE3 VFs
  be2net: fix status check in be_cmd_pmac_add()
  cpmac: remove hopeless #warning
  ravb: do not use zero-length alignment DMA descriptor
  mlx4: do not call napi_schedule() without care
  openvswitch: maintain correct checksum state in conntrack actions
  tcp: fix tcp_fastopen unaligned access complaints on sparc
  ...
2017-01-17 09:33:10 -08:00
Daniel Borkmann
f1f7714ea5 bpf: rework prog_digest into prog_tag
Commit 7bd509e311 ("bpf: add prog_digest and expose it via
fdinfo/netlink") was recently discussed, partially due to
admittedly suboptimal name of "prog_digest" in combination
with sha1 hash usage, thus inevitably and rightfully concerns
about its security in terms of collision resistance were
raised with regards to use-cases.

The intended use cases are for debugging resp. introspection
only for providing a stable "tag" over the instruction sequence
that both kernel and user space can calculate independently.
It's not usable at all for making a security relevant decision.
So collisions where two different instruction sequences generate
the same tag can happen, but ideally at a rather low rate. The
"tag" will be dumped in hex and is short enough to introspect
in tracepoints or kallsyms output along with other data such
as stack trace, etc. Thus, this patch performs a rename into
prog_tag and truncates the tag to a short output (64 bits) to
make it obvious it's not collision-free.

Should in future a hash or facility be needed with a security
relevant focus, then we can think about requirements, constraints,
etc that would fit to that situation. For now, rework the exposed
parts for the current use cases as long as nothing has been
released yet. Tested on x86_64 and s390x.

Fixes: 7bd509e311 ("bpf: add prog_digest and expose it via fdinfo/netlink")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16 14:03:31 -05:00
Linus Torvalds
2eabb8b8d6 Merge tag 'nfsd-4.10-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields:
 "Miscellaneous nfsd bugfixes, one for a 4.10 regression, three for
  older bugs"

* tag 'nfsd-4.10-1' of git://linux-nfs.org/~bfields/linux:
  svcrdma: avoid duplicate dma unmapping during error recovery
  sunrpc: don't call sleeping functions from the notifier block callbacks
  svcrpc: don't leak contexts on PROC_DESTROY
  nfsd: fix supported attributes for acl & labels
2017-01-16 09:34:37 -08:00
Linus Torvalds
aa2797b30c Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
 "Bugfixes for I2C. Mostly core this time which is a bit unusual but
  nothing really scary in there"

* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: piix4: Avoid race conditions with IMC
  i2c: fix spelling mistake: "insufficent" -> "insufficient"
  i2c: print correct device invalid address
  i2c: do not enable fall back to Host Notify by default
  i2c: fix kernel memory disclosure in dev interface
2017-01-15 12:28:14 -08:00
Linus Torvalds
79078c53ba Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Misc race fixes uncovered by fuzzing efforts, a Sparse fix, two PMU
  driver fixes, plus miscellanous tooling fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: Reject non sampling events with precise_ip
  perf/x86/intel: Account interrupts for PEBS errors
  perf/core: Fix concurrent sys_perf_event_open() vs. 'move_group' race
  perf/core: Fix sys_perf_event_open() vs. hotplug
  perf/x86/intel: Use ULL constant to prevent undefined shift behaviour
  perf/x86/intel/uncore: Fix hardcoded socket 0 assumption in the Haswell init code
  perf/x86: Set pmu->module in Intel PMU modules
  perf probe: Fix to probe on gcc generated symbols for offline kernel
  perf probe: Fix --funcs to show correct symbols for offline module
  perf symbols: Robustify reading of build-id from sysfs
  perf tools: Install tools/lib/traceevent plugins with install-bin
  tools lib traceevent: Fix prev/next_prio for deadline tasks
  perf record: Fix --switch-output documentation and comment
  perf record: Make __record_options static
  tools lib subcmd: Add OPT_STRING_OPTARG_SET option
  perf probe: Fix to get correct modname from elf header
  samples/bpf trace_output_user: Remove duplicate sys/ioctl.h include
  samples/bpf sock_example: Avoid getting ethhdr from two includes
  perf sched timehist: Show total scheduling time
2017-01-15 11:37:43 -08:00
Linus Torvalds
255e6140fa Merge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI fixes from Ingo Molnar:
 "A number of regression fixes:

   - Fix a boot hang on machines that have somewhat unusual memory map
     entries of phys_addr=0x0 num_pages=0, which broke due to a recent
     commit. This commit got cherry-picked from the v4.11 queue because
     the bug is affecting real machines.

   - Fix a boot hang also reported by KASAN, caused by incorrect init
     ordering introduced by a recent optimization.

   - Fix a recent robustification fix to allocate_new_fdt_and_exit_boot()
     that introduced an invalid assumption. Neither bugs were seen in
     the wild AFAIK"

* 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  efi/x86: Prune invalid memory map entries and fix boot regression
  x86/efi: Don't allocate memmap through memblock after mm_init()
  efi/libstub/arm*: Pass latest memory map to the kernel
2017-01-15 10:54:39 -08:00
Linus Torvalds
f4d3935e4f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro.

The most notable fix here is probably the fix for a splice regression
("fix a fencepost error in pipe_advance()") noticed by Alan Wylie.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fix a fencepost error in pipe_advance()
  coredump: Ensure proper size of sparse core files
  aio: fix lock dep warning
  tmpfs: clear S_ISGID when setting posix ACLs
2017-01-14 17:13:28 -08:00
Linus Torvalds
34241af77b Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:

 - the virtio_blk stack DMA corruption fix from Christoph, fixing and
   issue with VMAP stacks.

 - O_DIRECT blkbits calculation fix from Chandan.

 - discard regression fix from Christoph.

 - queue init error handling fixes for nbd and virtio_blk, from Omar and
   Jeff.

 - two small nvme fixes, from Christoph and Guilherme.

 - rename of blk_queue_zone_size and bdev_zone_size to _sectors instead,
   to more closely follow what we do in other places in the block layer.
   This interface is new for this series, so let's get the naming right
   before releasing a kernel with this feature. From Damien.

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: don't try to discard from __blkdev_issue_zeroout
  sd: remove __data_len hack for WRITE SAME
  nvme: use blk_rq_payload_bytes
  scsi: use blk_rq_payload_bytes
  block: add blk_rq_payload_bytes
  block: Rename blk_queue_zone_size and bdev_zone_size
  nvme: apply DELAY_BEFORE_CHK_RDY quirk at probe time too
  nvme-rdma: fix nvme_rdma_queue_is_ready
  virtio_blk: fix panic in initialization error path
  nbd: blk_mq_init_queue returns an error code on failure, not NULL
  virtio_blk: avoid DMA to stack for the sense buffer
  do_direct_IO: Use inode->i_blkbits to compute block count to be cleaned
2017-01-14 17:07:04 -08:00
Dave Kleikamp
4d22c75d4c coredump: Ensure proper size of sparse core files
If the last section of a core file ends with an unmapped or zero page,
the size of the file does not correspond with the last dump_skip() call.
gdb complains that the file is truncated and can be confusing to users.

After all of the vma sections are written, make sure that the file size
is no smaller than the current file position.

This problem can be demonstrated with gdb's bigcore testcase on the
sparc architecture.

Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-01-14 19:32:40 -05:00
David S. Miller
bb60b8b35a Merge tag 'mac80211-next-for-davem-2017-01-13' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:

====================
For 4.11, we seem to have more than in the past few releases:
 * socket owner support for connections, so when the wifi
   manager (e.g. wpa_supplicant) is killed, connections are
   torn down - wpa_supplicant is critical to managing certain
   operations, and can opt in to this where applicable
 * minstrel & minstrel_ht updates to be more efficient (time and space)
 * set wifi_acked/wifi_acked_valid for skb->destructor use in the
   kernel, which was already available to userspace
 * don't indicate new mesh peers that might be used if there's no
   room to add them
 * multicast-to-unicast support in mac80211, for better medium usage
   (since unicast frames can use *much* higher rates, by ~3 orders of
   magnitude)
 * add API to read channel (frequency) limitations from DT
 * add infrastructure to allow randomizing public action frames for
   MAC address privacy (still requires driver support)
 * many cleanups and small improvements/fixes across the board
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-14 12:02:15 -05:00
Peter Jones
0100a3e67a efi/x86: Prune invalid memory map entries and fix boot regression
Some machines, such as the Lenovo ThinkPad W541 with firmware GNET80WW
(2.28), include memory map entries with phys_addr=0x0 and num_pages=0.

These machines fail to boot after the following commit,

  commit 8e80632fb2 ("efi/esrt: Use efi_mem_reserve() and avoid a kmalloc()")

Fix this by removing such bogus entries from the memory map.

Furthermore, currently the log output for this case (with efi=debug)
looks like:

 [    0.000000] efi: mem45: [Reserved           |   |  |  |  |  |  |  |  |  |  |  |  ] range=[0x0000000000000000-0xffffffffffffffff] (0MB)

This is clearly wrong, and also not as informative as it could be.  This
patch changes it so that if we find obviously invalid memory map
entries, we print an error and skip those entries.  It also detects the
display of the address range calculation overflow, so the new output is:

 [    0.000000] efi: [Firmware Bug]: Invalid EFI memory map entries:
 [    0.000000] efi: mem45: [Reserved           |   |  |  |  |  |  |  |   |  |  |  |  ] range=[0x0000000000000000-0x0000000000000000] (invalid)

It also detects memory map sizes that would overflow the physical
address, for example phys_addr=0xfffffffffffff000 and
num_pages=0x0200000000000001, and prints:

 [    0.000000] efi: [Firmware Bug]: Invalid EFI memory map entries:
 [    0.000000] efi: mem45: [Reserved           |   |  |  |  |  |  |  |   |  |  |  |  ] range=[phys_addr=0xfffffffffffff000-0x20ffffffffffffffff] (invalid)

It then removes these entries from the memory map.

Signed-off-by: Peter Jones <pjones@redhat.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
[ardb: refactor for clarity with no functional changes, avoid PAGE_SHIFT]
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
[Matt: Include bugzilla info in commit log]
Cc: <stable@vger.kernel.org> # v4.9+
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=191121
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 16:48:53 +01:00
Jiri Olsa
475113d937 perf/x86/intel: Account interrupts for PEBS errors
It's possible to set up PEBS events to get only errors and not
any data, like on SNB-X (model 45) and IVB-EP (model 62)
via 2 perf commands running simultaneously:

    taskset -c 1 ./perf record -c 4 -e branches:pp -j any -C 10

This leads to a soft lock up, because the error path of the
intel_pmu_drain_pebs_nhm() does not account event->hw.interrupt
for error PEBS interrupts, so in case you're getting ONLY
errors you don't have a way to stop the event when it's over
the max_samples_per_tick limit:

  NMI watchdog: BUG: soft lockup - CPU#22 stuck for 22s! [perf_fuzzer:5816]
  ...
  RIP: 0010:[<ffffffff81159232>]  [<ffffffff81159232>] smp_call_function_single+0xe2/0x140
  ...
  Call Trace:
   ? trace_hardirqs_on_caller+0xf5/0x1b0
   ? perf_cgroup_attach+0x70/0x70
   perf_install_in_context+0x199/0x1b0
   ? ctx_resched+0x90/0x90
   SYSC_perf_event_open+0x641/0xf90
   SyS_perf_event_open+0x9/0x10
   do_syscall_64+0x6c/0x1f0
   entry_SYSCALL64_slow_path+0x25/0x25

Add perf_event_account_interrupt() which does the interrupt
and frequency checks and call it from intel_pmu_drain_pebs_nhm()'s
error path.

We keep the pending_kill and pending_wakeup logic only in the
__perf_event_overflow() path, because they make sense only if
there's any data to deliver.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vince@deater.net>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1482931866-6018-2-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:06:49 +01:00
Yuchung Cheng
4a7f600944 tcp: remove thin_dupack feature
Thin stream DUPACK is to start fast recovery on only one DUPACK
provided the connection is a thin stream (i.e., low inflight).  But
this older feature is now subsumed with RACK. If a connection
receives only a single DUPACK, RACK would arm a reordering timer
and soon starts fast recovery instead of timeout if no further
ACKs are received.

The socket option (THIN_DUPACK) is kept as a nop for compatibility.
Note that this patch does not change another thin-stream feature
which enables linear RTO. Although it might be good to generalize
that in the future (i.e., linear RTO for the first say 3 retries).

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Yuchung Cheng
bec41a11dd tcp: remove early retransmit
This patch removes the support of RFC5827 early retransmit (i.e.,
fast recovery on small inflight with <3 dupacks) because it is
subsumed by the new RACK loss detection. More specifically when
RACK receives DUPACKs, it'll arm a reordering timer to start fast
recovery after a quarter of (min)RTT, hence it covers the early
retransmit except RACK does not limit itself to specific inflight
or dupack numbers.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Yuchung Cheng
840a3cbe89 tcp: remove forward retransmit feature
Forward retransmit is an esoteric feature in RFC3517 (condition(3)
in the NextSeg()). Basically if a packet is not considered lost by
the current criteria (# of dupacks etc), but the congestion window
has room for more packets, then retransmit this packet.

However it actually conflicts with the rest of recovery design. For
example, when reordering is detected we want to be conservative
in retransmitting packets but forward-retransmit feature would
break that to force more retransmission. Also the implementation is
fairly complicated inside the retransmission logic inducing extra
iterations in the write queue. With RACK losses are being detected
timely and this heuristic is no longer necessary. There this patch
removes the feature.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Yuchung Cheng
1d0833df59 tcp: use sequence to break TS ties for RACK loss detection
The packets inside a jumbo skb (e.g., TSO) share the same skb
timestamp, even though they are sent sequentially on the wire. Since
RACK is based on time, it can not detect some packets inside the
same skb are lost.  However, we can leverage the packet sequence
numbers as extended timestamps to detect losses. Therefore, when
RACK timestamp is identical to skb's timestamp (i.e., one of the
packets of the skb is acked or sacked), we use the sequence numbers
of the acked and unacked packets to break ties.

We can use the same sequence logic to advance RACK xmit time as
well to detect more losses and avoid timeout.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Yuchung Cheng
deed7be78f tcp: record most recent RTT in RACK loss detection
Record the most recent RTT in RACK. It is often identical to the
"ca_rtt_us" values in tcp_clean_rtx_queue. But when the packet has
been retransmitted, RACK choses to believe the ACK is for the
(latest) retransmitted packet if the RTT is over minimum RTT.

This requires passing the arrival time of the most recent ACK to
RACK routines. The timestamp is now recorded in the "ack_time"
in tcp_sacktag_state during the ACK processing.

This patch does not change the RACK algorithm itself. It only adds
the RTT variable to prepare the next main patch.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Linus Torvalds
406732c932 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:

 - fix for module unload vs deferred jump labels (note: there might be
   other buggy modules!)

 - two NULL pointer dereferences from syzkaller

 - also syzkaller: fix emulation of fxsave/fxrstor/sgdt/sidt, problem
   made worse during this merge window, "just" kernel memory leak on
   releases

 - fix emulation of "mov ss" - somewhat serious on AMD, less so on Intel

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: fix emulation of "MOV SS, null selector"
  KVM: x86: fix NULL deref in vcpu_scan_ioapic
  KVM: eventfd: fix NULL deref irqbypass consumer
  KVM: x86: Introduce segmented_write_std
  KVM: x86: flush pending lapic jump label updates on module unload
  jump_labels: API for flushing deferred jump label updates
2017-01-13 17:06:24 -08:00
Christoph Hellwig
2e3258ecfa block: add blk_rq_payload_bytes
Add a helper to calculate the actual data transfer size for special
payload requests.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-13 15:17:04 -07:00
Shannon Nelson
003c941057 tcp: fix tcp_fastopen unaligned access complaints on sparc
Fix up a data alignment issue on sparc by swapping the order
of the cookie byte array field with the length field in
struct tcp_fastopen_cookie, and making it a proper union
to clean up the typecasting.

This addresses log complaints like these:
    log_unaligned: 113 callbacks suppressed
    Kernel unaligned access at TPC[976490] tcp_try_fastopen+0x2d0/0x360
    Kernel unaligned access at TPC[9764ac] tcp_try_fastopen+0x2ec/0x360
    Kernel unaligned access at TPC[9764c8] tcp_try_fastopen+0x308/0x360
    Kernel unaligned access at TPC[9764e4] tcp_try_fastopen+0x324/0x360
    Kernel unaligned access at TPC[976490] tcp_try_fastopen+0x2d0/0x360

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 12:31:24 -05:00
Nikolay Aleksandrov
8fb472c09b ipmr: improve hash scalability
Recently we started using ipmr with thousands of entries and easily hit
soft lockups on smaller devices. The reason is that the hash function
uses the high order bits from the src and dst, but those don't change in
many common cases, also the hash table  is only 64 elements so with
thousands it doesn't scale at all.
This patch migrates the hash table to rhashtable, and in particular the
rhl interface which allows for duplicate elements to be chained because
of the MFC_PROXY support (*,G; *,*,oif cases) which allows for multiple
duplicate entries to be added with different interfaces (IMO wrong, but
it's been in for a long time).

And here are some results from tests I've run in a VM:
 mr_table size (default, allocated for all namespaces):
  Before                    After
   49304 bytes               2400 bytes

 Add 65000 routes (the diff is much larger on smaller devices):
  Before                    After
   1m42s                     58s

 Forwarding 256 byte packets with 65000 routes (test done in a VM):
  Before                    After
   3 Mbps / ~1465 pps        122 Mbps / ~59000 pps

As a bonus we no longer see the soft lockups on smaller devices which
showed up even with 2000 entries before.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-12 16:48:26 -05:00
Scott Mayhew
546125d161 sunrpc: don't call sleeping functions from the notifier block callbacks
The inet6addr_chain is an atomic notifier chain, so we can't call
anything that might sleep (like lock_sock)... instead of closing the
socket from svc_age_temp_xprts_now (which is called by the notifier
function), just have the rpc service threads do it instead.

Cc: stable@vger.kernel.org
Fixes: c3d4879e01 "sunrpc: Add a function to close..."
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-01-12 15:56:40 -05:00
Dmitry Torokhov
331c342552 i2c: do not enable fall back to Host Notify by default
Falling back unconditionally to HostNotify as primary client's interrupt
breaks some drivers which alter their functionality depending on whether
interrupt is present or not, so let's introduce a board flag telling I2C
core explicitly if we want wired interrupt or HostNotify-based one:
I2C_CLIENT_HOST_NOTIFY.

For DT-based systems we introduce "host-notify" property that we convert
to I2C_CLIENT_HOST_NOTIFY board flag.

Tested-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Acked-by: Pali Rohár <pali.rohar@gmail.com>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2017-01-12 20:06:15 +01:00
Linus Torvalds
9ca277eba0 Merge tag 'rproc-v4.10-fixes' of git://github.com/andersson/remoteproc
Pull remoteproc fixes from Bjorn Andersson:
 "This fixes two regressions that have been reported to be introduced in
  v4.10-rc1.

   - correct an incorrect usage of the kref api

   - revert the change to make the resource table read-only. As the
     space each vdev resource is used as virtio device config space it
     must be shared with the remote"

* tag 'rproc-v4.10-fixes' of git://github.com/andersson/remoteproc:
  Revert "remoteproc: Merge table_ptr and cached_table pointers"
  remoteproc: fix vdev reference management
2017-01-12 11:00:22 -08:00
Daniel Borkmann
6b8cc1d11e bpf: pass original insn directly to convert_ctx_access
Currently, when calling convert_ctx_access() callback for the various
program types, we pass in insn->dst_reg, insn->src_reg, insn->off from
the original instruction. This information is needed to rewrite the
instruction that is based on the user ctx structure into a kernel
representation for the ctx. As we'd like to allow access size beyond
just BPF_W, we'd need also insn->code for that in order to decode the
original access size. Given that, lets just pass insn directly to the
convert_ctx_access() callback and work on that to not clutter the
callback with even more arguments we need to pass when everything is
already contained in insn. So lets go through that once, no functional
change.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-12 10:00:31 -05:00
Damien Le Moal
f99e86485c block: Rename blk_queue_zone_size and bdev_zone_size
All block device data fields and functions returning a number of 512B
sectors are by convention named xxx_sectors while names in the form
xxx_size are generally used for a number of bytes. The blk_queue_zone_size
and bdev_zone_size functions were not following this convention so rename
them.

No functional change is introduced by this patch.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>

Collapsed the two patches, they were nonsensically split and broke
bisection.

Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-12 07:58:32 -07:00
Florian Fainelli
738b35ccee net: core: Make netif_wake_subqueue a wrapper
netif_wake_subqueue() is duplicating the same thing that netif_tx_wake_queue()
does, so make it call it directly after looking up the queue from the index.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-12 09:18:05 -05:00
David Matlack
b6416e6101 jump_labels: API for flushing deferred jump label updates
Modules that use static_key_deferred need a way to synchronize with
any delayed work that is still pending when the module is unloaded.
Introduce static_key_deferred_flush() which flushes any pending
jump label updates.

Signed-off-by: David Matlack <dmatlack@google.com>
Cc: stable@vger.kernel.org
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-12 14:33:16 +01:00
David S. Miller
02ac5d1487 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two AF_* families adding entries to the lockdep tables
at the same time.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-11 14:43:39 -05:00
Linus Torvalds
ba836a6f5a Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
 "27 fixes.

  There are three patches that aren't actually fixes. They're simple
  function renamings which are nice-to-have in mainline as ongoing net
  development depends on them."

* akpm: (27 commits)
  timerfd: export defines to userspace
  mm/hugetlb.c: fix reservation race when freeing surplus pages
  mm/slab.c: fix SLAB freelist randomization duplicate entries
  zram: support BDI_CAP_STABLE_WRITES
  zram: revalidate disk under init_lock
  mm: support anonymous stable page
  mm: add documentation for page fragment APIs
  mm: rename __page_frag functions to __page_frag_cache, drop order from drain
  mm: rename __alloc_page_frag to page_frag_alloc and __free_page_frag to page_frag_free
  mm, memcg: fix the active list aging for lowmem requests when memcg is enabled
  mm: don't dereference struct page fields of invalid pages
  mailmap: add codeaurora.org names for nameless email commits
  signal: protect SIGNAL_UNKILLABLE from unintentional clearing.
  mm: pmd dirty emulation in page fault handler
  ipc/sem.c: fix incorrect sem_lock pairing
  lib/Kconfig.debug: fix frv build failure
  mm: get rid of __GFP_OTHER_NODE
  mm: fix remote numa hits statistics
  mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done}
  ocfs2: fix crash caused by stale lvb with fsdlm plugin
  ...
2017-01-11 11:15:15 -08:00
Mike Frysinger
575b1967e1 timerfd: export defines to userspace
Since userspace is expected to call timerfd syscalls directly with these
flags/ioctls, make sure we export them so they don't have to duplicate
the values themselves.

Link: http://lkml.kernel.org/r/20161219064052.7196-1-vapier@gentoo.org
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-10 18:31:55 -08:00
Minchan Kim
f05714293a mm: support anonymous stable page
During developemnt for zram-swap asynchronous writeback, I found strange
corruption of compressed page, resulting in:

  Modules linked in: zram(E)
  CPU: 3 PID: 1520 Comm: zramd-1 Tainted: G            E   4.8.0-mm1-00320-ge0d4894c9c38-dirty #3274
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
  task: ffff88007620b840 task.stack: ffff880078090000
  RIP: set_freeobj.part.43+0x1c/0x1f
  RSP: 0018:ffff880078093ca8  EFLAGS: 00010246
  RAX: 0000000000000018 RBX: ffff880076798d88 RCX: ffffffff81c408c8
  RDX: 0000000000000018 RSI: 0000000000000000 RDI: 0000000000000246
  RBP: ffff880078093cb0 R08: 0000000000000000 R09: 0000000000000000
  R10: ffff88005bc43030 R11: 0000000000001df3 R12: ffff880076798d88
  R13: 000000000005bc43 R14: ffff88007819d1b8 R15: 0000000000000001
  FS:  0000000000000000(0000) GS:ffff88007e380000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007fc934048f20 CR3: 0000000077b01000 CR4: 00000000000406e0
  Call Trace:
    obj_malloc+0x22b/0x260
    zs_malloc+0x1e4/0x580
    zram_bvec_rw+0x4cd/0x830 [zram]
    page_requests_rw+0x9c/0x130 [zram]
    zram_thread+0xe6/0x173 [zram]
    kthread+0xca/0xe0
    ret_from_fork+0x25/0x30

With investigation, it reveals currently stable page doesn't support
anonymous page.  IOW, reuse_swap_page can reuse the page without waiting
writeback completion so it can overwrite page zram is compressing.

Unfortunately, zram has used per-cpu stream feature from v4.7.
It aims for increasing cache hit ratio of scratch buffer for
compressing. Downside of that approach is that zram should ask
memory space for compressed page in per-cpu context which requires
stricted gfp flag which could be failed. If so, it retries to
allocate memory space out of per-cpu context so it could get memory
this time and compress the data again, copies it to the memory space.

In this scenario, zram assumes the data should never be changed
but it is not true unless stable page supports. So, If the data is
changed under us, zram can make buffer overrun because second
compression size could be bigger than one we got in previous trial
and blindly, copy bigger size object to smaller buffer which is
buffer overrun. The overrun breaks zsmalloc free object chaining
so system goes crash like above.

I think below is same problem.
https://bugzilla.suse.com/show_bug.cgi?id=997574

Unfortunately, reuse_swap_page should be atomic so that we cannot wait on
writeback in there so the approach in this patch is simply return false if
we found it needs stable page.  Although it increases memory footprint
temporarily, it happens rarely and it should be reclaimed easily althoug
it happened.  Also, It would be better than waiting of IO completion,
which is critial path for application latency.

Fixes: da9556a236 ("zram: user per-cpu compression streams")
Link: http://lkml.kernel.org/r/20161120233015.GA14113@bbox
Link: http://lkml.kernel.org/r/1482366980-3782-2-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Hyeoncheol Lee <cheol.lee@lge.com>
Cc: <yjay.kim@lge.com>
Cc: Sangseok Lee <sangseok.lee@lge.com>
Cc: <stable@vger.kernel.org> [4.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-10 18:31:55 -08:00
Alexander Duyck
2976db8018 mm: rename __page_frag functions to __page_frag_cache, drop order from drain
This patch does two things.

First it goes through and renames the __page_frag prefixed functions to
__page_frag_cache so that we can be clear that we are draining or
refilling the cache, not the frags themselves.

Second we drop the order parameter from __page_frag_cache_drain since we
don't actually need to pass it since all fragments are either order 0 or
must be a compound page.

Link: http://lkml.kernel.org/r/20170104023954.13451.5678.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-10 18:31:55 -08:00
Alexander Duyck
8c2dd3e4a4 mm: rename __alloc_page_frag to page_frag_alloc and __free_page_frag to page_frag_free
Patch series "Page fragment updates", v4.

This patch series takes care of a few cleanups for the page fragments
API.

First we do some renames so that things are much more consistent.  First
we move the page_frag_ portion of the name to the front of the functions
names.  Secondly we split out the cache specific functions from the
other page fragment functions by adding the word "cache" to the name.

Finally I added a bit of documentation that will hopefully help to
explain some of this.  I plan to revisit this later as we get things
more ironed out in the near future with the changes planned for the DMA
setup to support eXpress Data Path.

This patch (of 3):

This patch renames the page frag functions to be more consistent with
other APIs.  Specifically we place the name page_frag first in the name
and then have either an alloc or free call name that we append as the
suffix.  This makes it a bit clearer in terms of naming.

In addition we drop the leading double underscores since we are
technically no longer a backing interface and instead the front end that
is called from the networking APIs.

Link: http://lkml.kernel.org/r/20170104023854.13451.67390.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-10 18:31:55 -08:00