Commit Graph

87513 Commits

Author SHA1 Message Date
Daniel Borkmann
c491680f8f bpf: reuse dev_is_mac_header_xmit for redirect
Commit dcf800344a ("net/sched: act_mirred: Refactor detection whether
dev needs xmit at mac header") added dev_is_mac_header_xmit(); since it's
also useful elsewhere, move it to if_arp.h and reuse it for BPF.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-27 20:38:47 -05:00
Daniel Borkmann
88575199cc bpf: drop unnecessary context cast from BPF_PROG_RUN
Since long already bpf_func is not only about struct sk_buff * as
input anymore. Make it generic as void *, so that callers don't
need to cast for it each time they call BPF_PROG_RUN().

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-27 20:38:47 -05:00
Michael S. Tsirkin
5a717f4f8f netdevice: fix sparse warning for HARD_TX_LOCK
sparse warns about context imbalance in any code
that uses HARD_TX_LOCK/UNLOCK - this is because it's
unable to determine that flags don't change so
lock and unlock are paired.

Seems easy enough to fix by adding __acquire/__release
calls.

With this patch af_packet.c is now sparse-clean,

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-27 15:28:35 -05:00
David S. Miller
0b42f25d2f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
udplite conflict is resolved by taking what 'net-next' did
which removed the backlog receive method assignment, since
it is no longer necessary.

Two entries were added to the non-priv ethtool operations
switch statement, one in 'net' and one in 'net-next, so
simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-26 23:42:21 -05:00
Linus Torvalds
a0d60e62ea Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix leak in fsl/fman driver, from Dan Carpenter.

 2) Call flow dissector initcall earlier than any networking driver can
    register and start to use it, from Eric Dumazet.

 3) Some dup header fixes from Geliang Tang.

 4) TIPC link monitoring compat fix from Jon Paul Maloy.

 5) Link changes require EEE re-negotiation in bcm_sf2 driver, from
    Florian Fainelli.

 6) Fix bogus handle ID passed into tfilter_notify_chain(), from Roman
    Mashak.

 7) Fix dump size calculation in rtnl_calcit(), from Zhang Shengju.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits)
  tipc: resolve connection flow control compatibility problem
  mvpp2: use correct size for memset
  net/mlx5: drop duplicate header delay.h
  net: ieee802154: drop duplicate header delay.h
  ibmvnic: drop duplicate header seq_file.h
  fsl/fman: fix a leak in tgec_free()
  net: ethtool: don't require CAP_NET_ADMIN for ETHTOOL_GLINKSETTINGS
  tipc: improve sanity check for received domain records
  tipc: fix compatibility bug in link monitoring
  net: ethernet: mvneta: Remove IFF_UNICAST_FLT which is not implemented
  dwc_eth_qos: drop duplicate headers
  net sched filters: fix filter handle ID in tfilter_notify_chain()
  net: dsa: bcm_sf2: Ensure we re-negotiate EEE during after link change
  bnxt: do not busy-poll when link is down
  udplite: call proper backlog handlers
  ipv6: bump genid when the IFA_F_TENTATIVE flag is clear
  net/mlx4_en: Free netdev resources under state lock
  net: revert "net: l2tp: Treat NET_XMIT_CN as success in l2tp_eth_dev_xmit"
  rtnetlink: fix the wrong minimal dump size getting from rtnl_calcit()
  bnxt_en: Fix a VXLAN vs GENEVE issue
  ...
2016-11-26 13:05:05 -08:00
Daniel Mack
f432455148 bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and
BPF_PROG_DETACH which allow attaching and detaching eBPF programs
to a target.

On the API level, the target could be anything that has an fd in
userspace, hence the name of the field in union bpf_attr is called
'target_fd'.

When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is
expected to be a valid file descriptor of a cgroup v2 directory which
has the bpf controller enabled. These are the only use-cases
implemented by this patch at this point, but more can be added.

If a program of the given type already exists in the given cgroup,
the program is swapped automically, so userspace does not have to drop
an existing program first before installing a new one, which would
otherwise leave a gap in which no program is attached.

For more information on the propagation logic to subcgroups, please
refer to the bpf cgroup controller implementation.

The API is guarded by CAP_NET_ADMIN.

Signed-off-by: Daniel Mack <daniel@zonque.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25 16:26:04 -05:00
Daniel Mack
3007098494 cgroup: add support for eBPF programs
This patch adds two sets of eBPF program pointers to struct cgroup.
One for such that are directly pinned to a cgroup, and one for such
that are effective for it.

To illustrate the logic behind that, assume the following example
cgroup hierarchy.

  A - B - C
        \ D - E

If only B has a program attached, it will be effective for B, C, D
and E. If D then attaches a program itself, that will be effective for
both D and E, and the program in B will only affect B and C. Only one
program of a given type is effective for a cgroup.

Attaching and detaching programs will be done through the bpf(2)
syscall. For now, ingress and egress inet socket filtering are the
only supported use-cases.

Signed-off-by: Daniel Mack <daniel@zonque.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25 16:25:52 -05:00
Daniel Mack
0e33661de4 bpf: add new prog type for cgroup socket filtering
This program type is similar to BPF_PROG_TYPE_SOCKET_FILTER, except that
it does not allow BPF_LD_[ABS|IND] instructions and hooks up the
bpf_skb_load_bytes() helper.

Programs of this type will be attached to cgroups for network filtering
and accounting.

Signed-off-by: Daniel Mack <daniel@zonque.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25 16:25:52 -05:00
David S. Miller
f9e154a0e6 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Johan Hedberg says:

====================
pull request: bluetooth 2016-11-23

Sorry about the late pull request for 4.9, but we have one more
important Bluetooth patch that should make it to the release. It fixes
connection creation for Bluetooth LE controllers that do not have a
public address (only a random one).

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-24 16:24:20 -05:00
Roi Dayan
34e4e99078 net/mlx5: Enable to query min inline for a specific vport
Also move the inline capablities enum to a shared header vport.h

Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-24 16:01:14 -05:00
Roi Dayan
59bfde01fa devlink: Add E-Switch inline mode control
Some HWs need the VF driver to put part of the packet headers on the
TX descriptor so the e-switch can do proper matching and steering.

The supported modes: none, link, network, transport.

Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-24 16:01:14 -05:00
Or Gerlitz
3df5b3c675 net: Add net-device param to the get offloaded stats ndo
Some drivers would need to check few internal matters for
that. To be used in downstream mlx5 commit.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-24 16:01:14 -05:00
Florian Fainelli
d06f78c423 net: phy: broadcom: Add support code for downshift/Wirespeed
Broadcom's Wirespeed feature allows us to configure how auto-negotiation
should behave with fewer working pairs of wires on a cable. Add support
code for retrieving and setting such downshift counters using the
recently added ethtool downshift tunables.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-24 15:45:53 -05:00
Uwe Kleine-König
e22e996b72 net/phy: add trace events for mdio accesses
Make it possible to generate trace events for mdio read and write accesses.

Signed-off-by: Uwe Kleine-König <uwe@kleine-koenig.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-24 11:55:43 -05:00
Randy Dunlap
920c1cd366 netdevice.h: fix kernel-doc warning
Fix kernel-doc warning in <linux/netdevice.h> (missing ':'):

..//include/linux/netdevice.h:1904: warning: No description found for parameter 'prio_tc_map[TC_BITMASK + 1]'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-23 20:18:36 -05:00
Johan Hedberg
39385cb5f3 Bluetooth: Fix using the correct source address type
The hci_get_route() API is used to look up local HCI devices, however
so far it has been incapable of dealing with anything else than the
public address of HCI devices. This completely breaks with LE-only HCI
devices that do not come with a public address, but use a static
random address instead.

This patch exteds the hci_get_route() API with a src_type parameter
that's used for comparing with the right address of each HCI device.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-11-22 22:50:46 +01:00
David S. Miller
f9aa9dc7d2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
All conflicts were simple overlapping changes except perhaps
for the Thunder driver.

That driver has a change_mtu method explicitly for sending
a message to the hardware.  If that fails it returns an
error.

Normally a driver doesn't need an ndo_change_mtu method becuase those
are usually just range changes, which are now handled generically.
But since this extra operation is needed in the Thunder driver, it has
to stay.

However, if the message send fails we have to restore the original
MTU before the change because the entire call chain expects that if
an error is thrown by ndo_change_mtu then the MTU did not change.
Therefore code is added to nicvf_change_mtu to remember the original
MTU, and to restore it upon nicvf_update_hw_max_frs() failue.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-22 13:27:16 -05:00
Oleg Nesterov
8e5bfa8c1f sched/autogroup: Do not use autogroup->tg in zombie threads
Exactly because for_each_thread() in autogroup_move_group() can't see it
and update its ->sched_task_group before _put() and possibly free().

So the exiting task needs another sched_move_task() before exit_notify()
and we need to re-introduce the PF_EXITING (or similar) check removed by
the previous change for another reason.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hartsjc@redhat.com
Cc: vbendel@redhat.com
Cc: vlovejoy@redhat.com
Link: http://lkml.kernel.org/r/20161114184612.GA15968@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22 12:33:43 +01:00
Linus Torvalds
27e7ab99db Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Clear congestion control state when changing algorithms on an
    existing socket, from Florian Westphal.

 2) Fix register bit values in altr_tse_pcs portion of stmmac driver,
    from Jia Jie Ho.

 3) Fix PTP handling in stammc driver for GMAC4, from Giuseppe
    CAVALLARO.

 4) Fix udplite multicast delivery handling, it ignores the udp_table
    parameter passed into the lookups, from Pablo Neira Ayuso.

 5) Synchronize the space estimated by rtnl_vfinfo_size and the space
    actually used by rtnl_fill_vfinfo. From Sabrina Dubroca.

 6) Fix memory leak in fib_info when splitting nodes, from Alexander
    Duyck.

 7) If a driver does a napi_hash_del() explicitily and not via
    netif_napi_del(), it must perform RCU synchronization as needed. Fix
    this in virtio-net and bnxt drivers, from Eric Dumazet.

 8) Likewise, it is not necessary to invoke napi_hash_del() is we are
    also doing neif_napi_del() in the same code path. Remove such calls
    from be2net and cxgb4 drivers, also from Eric Dumazet.

 9) Don't allocate an ID in peernet2id_alloc() if the netns is dead,
    from WANG Cong.

10) Fix OF node and device struct leaks in of_mdio, from Johan Hovold.

11) We cannot cache routes in ip6_tunnel when using inherited traffic
    classes, from Paolo Abeni.

12) Fix several crashes and leaks in cpsw driver, from Johan Hovold.

13) Splice operations cannot use freezable blocking calls in AF_UNIX,
    from WANG Cong.

14) Link dump filtering by master device and kind support added an error
    in loop index updates during the dump if we actually do filter, fix
    from Zhang Shengju.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (59 commits)
  tcp: zero ca_priv area when switching cc algorithms
  net: l2tp: Treat NET_XMIT_CN as success in l2tp_eth_dev_xmit
  ethernet: stmmac: make DWMAC_STM32 depend on it's associated SoC
  tipc: eliminate obsolete socket locking policy description
  rtnl: fix the loop index update error in rtnl_dump_ifinfo()
  l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind()
  net: macb: add check for dma mapping error in start_xmit()
  rtnetlink: fix FDB size computation
  netns: fix get_net_ns_by_fd(int pid) typo
  af_unix: conditionally use freezable blocking calls in read
  net: ethernet: ti: cpsw: fix fixed-link phy probe deferral
  net: ethernet: ti: cpsw: add missing sanity check
  net: ethernet: ti: cpsw: fix secondary-emac probe error path
  net: ethernet: ti: cpsw: fix of_node and phydev leaks
  net: ethernet: ti: cpsw: fix deferred probe
  net: ethernet: ti: cpsw: fix mdio device reference leak
  net: ethernet: ti: cpsw: fix bad register access in probe error path
  net: sky2: Fix shutdown crash
  cfg80211: limit scan results cache size
  net sched filters: pass netlink message flags in event notification
  ...
2016-11-21 13:26:28 -08:00
Florian Westphal
e97991832a tcp: make undo_cwnd mandatory for congestion modules
The undo_cwnd fallback in the stack doubles cwnd based on ssthresh,
which un-does reno halving behaviour.

It seems more appropriate to let congctl algorithms pair .ssthresh
and .undo_cwnd properly. Add a 'tcp_reno_undo_cwnd' function and wire it
up for all congestion algorithms that used to rely on the fallback.

Cc: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21 13:20:17 -05:00
Nikolay Aleksandrov
aa2ae3e71c bridge: mcast: add MLDv2 querier support
This patch adds basic support for MLDv2 queries, the default is MLDv1
as before. A new multicast option - multicast_mld_version, adds the
ability to change it between 1 and 2 via netlink and sysfs.
The MLD option is disabled if CONFIG_IPV6 is disabled.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21 13:16:58 -05:00
Nikolay Aleksandrov
5e9235853d bridge: mcast: add IGMPv3 query support
This patch adds basic support for IGMPv3 queries, the default is IGMPv2
as before. A new multicast option - multicast_igmp_version, adds the
ability to change it between 2 and 3 via netlink and sysfs. The option
struct member is in a 4 byte hole in net_bridge.

There also a few minor style adjustments in br_multicast_new_group and
br_multicast_add_group.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21 13:16:58 -05:00
Daniel Borkmann
6d67942dd0 bpf: add __must_check attributes to refcount manipulating helpers
Helpers like bpf_prog_add(), bpf_prog_inc(), bpf_map_inc() can fail
with an error, so make sure the caller properly checks their return
value and not just ignores it, which could worst-case lead to use
after free.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21 11:25:58 -05:00
Alexey Dobriyan
c72d8cdaa5 net: fix bogus cast in skb_pagelen() and use unsigned variables
1) cast to "int" is unnecessary:
   u8 will be promoted to int before decrementing,
   small positive numbers fit into "int", so their values won't be changed
   during promotion.

   Once everything is int including loop counters, signedness doesn't
   matter: 32-bit operations will stay 32-bit operations.

   But! Someone tried to make this loop smart by making everything of
   the same type apparently in an attempt to optimise it.
   Do the optimization, just differently.
   Do the cast where it matters. :^)

2) frag size is unsigned entity and sum of fragments sizes is also
   unsigned.

Make everything unsigned, leave no MOVSX instruction behind.

	add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-4 (-4)
	function                                     old     new   delta
	skb_cow_data                                 835     834      -1
	ip_do_fragment                              2549    2548      -1
	ip6_fragment                                3130    3128      -2
	Total: Before=154865032, After=154865028, chg -0.00%

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-19 22:11:25 -05:00
Alexey Dobriyan
3b2c75d371 netlink: use "unsigned int" in nla_next()
->nla_len is unsigned entity (it's length after all) and u16,
thus it can't overflow when being aligned into int/unsigned int.

(nlmsg_next has the same code, but I didn't yet convince myself
it is correct to do so).

There is pointer arithmetic in this function and offset being
unsigned is better:

	add/remove: 0/0 grow/shrink: 1/64 up/down: 5/-309 (-304)
	function                                     old     new   delta
	nl80211_set_wiphy                           1444    1449      +5
	team_nl_cmd_options_set                      997     995      -2
	tcf_em_tree_validate                         872     870      -2
	switchdev_port_bridge_setlink                352     350      -2
	switchdev_port_br_afspec                     312     310      -2
	rtm_to_fib_config                            428     426      -2
	qla4xxx_sysfs_ddb_set_param                 2193    2191      -2
	qla4xxx_iface_set_param                     4470    4468      -2
	ovs_nla_free_flow_actions                    152     150      -2
	output_userspace                             518     516      -2
		...
	nl80211_set_reg                              654     649      -5
	validate_scan_freqs                          148     142      -6
	validate_linkmsg                             288     282      -6
	nl80211_parse_connkeys                       489     483      -6
	nlattr_set                                   231     224      -7
	nf_tables_delsetelem                         267     260      -7
	do_setlink                                  3416    3408      -8
	netlbl_cipsov4_add_std                      1672    1659     -13
	nl80211_parse_sched_scan                    2902    2888     -14
	nl80211_trigger_scan                        1738    1720     -18
	do_execute_actions                          2821    2738     -83
	Total: Before=154865355, After=154865051, chg -0.00%

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-19 22:11:25 -05:00
Linus Torvalds
dce9ce3615 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Radim Krčmář:
 "ARM:
   - Fix handling of the 32bit cycle counter
   - Fix cycle counter filtering

  x86:
   - Fix a race leading to double unregistering of user notifiers
   - Amend oversight in kvm_arch_set_irq that turned Hyper-V code dead
   - Use SRCU around kvm_lapic_set_vapic_addr
   - Avoid recursive flushing of asynchronous page faults
   - Do not rely on deferred update in KVM_GET_CLOCK, which fixes #GP
   - Let userspace know that KVM_GET_CLOCK is useful with master clock;
     4.9 changed the return value to better match the guest clock, but
     didn't provide means to let guests take advantage of it"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  kvm: x86: merge kvm_arch_set_irq and kvm_arch_set_irq_inatomic
  KVM: x86: fix missed SRCU usage in kvm_lapic_set_vapic_addr
  KVM: async_pf: avoid recursive flushing of work items
  kvm: kvmclock: let KVM_GET_CLOCK return whether the master clock is in use
  KVM: Disable irq while unregistering user notifier
  KVM: x86: do not go through vcpu in __get_kvmclock_ns
  KVM: arm64: Fix the issues when guest PMCCFILTR is configured
  arm64: KVM: pmu: Fix AArch32 cycle counter access
2016-11-19 13:31:40 -08:00
Paolo Bonzini
e3fd9a93a1 kvm: kvmclock: let KVM_GET_CLOCK return whether the master clock is in use
Userspace can read the exact value of kvmclock by reading the TSC
and fetching the timekeeping parameters out of guest memory.  This
however is brittle and not necessary anymore with KVM 4.11.  Provide
a mechanism that lets userspace know if the new KVM_GET_CLOCK
semantics are in effect, and---since we are at it---if the clock
is stable across all VCPUs.

Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-19 19:04:16 +01:00
Jarno Rajahalme
9403cd7cbb virtio_net: Do not clear memory for struct virtio_net_hdr twice.
virtio_net_hdr_from_skb() clears the memory for the header, so there
is no point for the callers to do the same.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-19 10:37:03 -05:00
Jarno Rajahalme
d66016a777 virtio_net.h: Fix comment.
Fix incorrent comment after the final #endif.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-19 10:37:03 -05:00
Linus Torvalds
20afa6e2f9 Merge tag 'acpi-4.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
 "They fix an ACPI thermal management regression introduced by a recent
  FADT handling cleanup, an ACPI tools build issue introduced by a
  recent ACPICA commit and a PCC mailbox initialization bug causing
  lockdep to complain loudly.

  Specifics:

   - Revert a recent ACPICA cleanup that attempted to get rid of all
     FADT version 2 legacy, but broke ACPI thermal management on at
     least one system (Rafael Wysocki).

   - Fix cross-compiled builds of ACPI tools that stopped working after
     a recent cleanup related to the handling of header files in ACPICA
     (Lv Zheng).

   - Fix a locking issue in the PCC channel initialization code that
     invokes devm_request_irq() under a spinlock (among other things)
     and causes lockdep to complain (Hoan Tran)"

* tag 'acpi-4.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  tools/power/acpi: Remove direct kernel source include reference
  mailbox: PCC: Fix lockdep warning when request PCC channel
  Revert "ACPICA: FADT support cleanup"
2016-11-18 17:21:58 -08:00
Linus Torvalds
aad931a30f Merge tag 'nfsd-4.9-2' of git://linux-nfs.org/~bfields/linux
Pull nfsd bugfix from Bruce Fields:
 "Just one fix for an NFS/RDMA crash"

* tag 'nfsd-4.9-2' of git://linux-nfs.org/~bfields/linux:
  sunrpc: svc_age_temp_xprts_now should not call setsockopt non-tcp transports
2016-11-18 16:32:21 -08:00
Rafael J. Wysocki
aab0b243b9 Merge branches 'acpica-fixes', 'acpi-cppc-fixes' and 'acpi-tools-fixes'
* acpica-fixes:
  Revert "ACPICA: FADT support cleanup"

* acpi-cppc-fixes:
  mailbox: PCC: Fix lockdep warning when request PCC channel

* acpi-tools-fixes:
  tools/power/acpi: Remove direct kernel source include reference
2016-11-18 21:34:42 +01:00
Stefan Hajnoczi
0f5258cd91 netns: fix get_net_ns_by_fd(int pid) typo
The argument to get_net_ns_by_fd() is a /proc/$PID/ns/net file
descriptor not a pid.  Fix the typo.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Rami Rosen <roszenrami@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18 14:01:58 -05:00
Florian Fainelli
603ab57363 bus: mvebu-bus: Provide inline stub for mvebu_mbus_get_dram_win_info
In preparation for allowing CONFIG_MVNETA_BM to build with COMPILE_TEST,
provide an inline stub for mvebu_mbus_get_dram_win_info().

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18 13:54:21 -05:00
Raju Lakkaraju
607c702914 ethtool: (uapi) Add ETHTOOL_PHY_DOWNSHIFT to PHY tunables
For operation in cabling environments that are incompatible with
1000BASE-T, PHY device may provide an automatic link speed downshift
operation. When enabled, the device automatically changes its 1000BASE-T
auto-negotiation to the next slower speed after a configured number of
failed attempts at 1000BASE-T.  This feature is useful in setting up in
networks using older cable installations that include only pairs A and B,
and not pairs C and D.

Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Signed-off-by: Allan W. Nielsen <allan.nielsen@microsemi.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18 12:12:14 -05:00
Raju Lakkaraju
968ad9da7e ethtool: Implements ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE
Adding get_tunable/set_tunable function pointer to the phy_driver
structure, and uses these function pointers to implement the
ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE ioctls.

Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Allan W. Nielsen <allan.nielsen@microsemi.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18 12:12:14 -05:00
Raju Lakkaraju
0d27f4e437 ethtool: (uapi) Add ETHTOOL_PHY_GTUNABLE and ETHTOOL_PHY_STUNABLE
Defines a generic API to get/set phy tunables. The API is using the
existing ethtool_tunable/tunable_type_id types which is already being used
for mac level tunables.

Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Allan W. Nielsen <allan.nielsen@microsemi.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18 12:12:13 -05:00
Gal Pressman
7f503169ca net/mlx5: Add MPCNT register infrastructure
Add the needed infrastructure for future use of MPCNT register.

Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18 12:08:58 -05:00
Saeed Mahameed
0dbc6fe09f net/mlx5: Set driver version infrastructure
Add driver_version capability bit is enabled, and set driver
version command in mlx5_ifc firmware header.  The only purpose
of this command is to store a driver version/OS string in FW
to be reported and displayed in various management systems,
such as IPMI/BMC.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18 12:08:57 -05:00
Huy Nguyen
d4eb4cd78b net/mlx5: Add handling for port module event
For each asynchronous port module event:
  1. print with ratelimit to the dmesg log
  2. increment the corresponding event counter

Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18 12:08:57 -05:00
Huy Nguyen
4ce3bf2fa8 net/mlx5: Port module event hardware structures
Add hardware structures and constants definitions needed for module
events support.

Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18 12:08:57 -05:00
Mohamad Haj Yahia
0ac3ea7089 net/mlx5: Make the command interface cache more flexible
Add more cache command size sets and more entries for each set based on
the current commands set different sizes and commands frequency.

Fixes: e126ba97db ('mlx5: Add driver for Mellanox Connect-IB adapters')
Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18 12:08:56 -05:00
Alexey Dobriyan
c7d03a00b5 netns: make struct pernet_operations::id unsigned int
Make struct pernet_operations::id unsigned.

There are 2 reasons to do so:

1)
This field is really an index into an zero based array and
thus is unsigned entity. Using negative value is out-of-bound
access by definition.

2)
On x86_64 unsigned 32-bit data which are mixed with pointers
via array indexing or offsets added or subtracted to pointers
are preffered to signed 32-bit data.

"int" being used as an array index needs to be sign-extended
to 64-bit before being used.

	void f(long *p, int i)
	{
		g(p[i]);
	}

  roughly translates to

	movsx	rsi, esi
	mov	rdi, [rsi+...]
	call 	g

MOVSX is 3 byte instruction which isn't necessary if the variable is
unsigned because x86_64 is zero extending by default.

Now, there is net_generic() function which, you guessed it right, uses
"int" as an array index:

	static inline void *net_generic(const struct net *net, int id)
	{
		...
		ptr = ng->ptr[id - 1];
		...
	}

And this function is used a lot, so those sign extensions add up.

Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
messing with code generation):

	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)

Unfortunately some functions actually grow bigger.
This is a semmingly random artefact of code generation with register
allocator being used differently. gcc decides that some variable
needs to live in new r8+ registers and every access now requires REX
prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
used which is longer than [r8]

However, overall balance is in negative direction:

	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
	function                                     old     new   delta
	nfsd4_lock                                  3886    3959     +73
	tipc_link_build_proto_msg                   1096    1140     +44
	mac80211_hwsim_new_radio                    2776    2808     +32
	tipc_mon_rcv                                1032    1058     +26
	svcauth_gss_legacy_init                     1413    1429     +16
	tipc_bcbase_select_primary                   379     392     +13
	nfsd4_exchange_id                           1247    1260     +13
	nfsd4_setclientid_confirm                    782     793     +11
		...
	put_client_renew_locked                      494     480     -14
	ip_set_sockfn_get                            730     716     -14
	geneve_sock_add                              829     813     -16
	nfsd4_sequence_done                          721     703     -18
	nlmclnt_lookup_host                          708     686     -22
	nfsd4_lockt                                 1085    1063     -22
	nfs_get_client                              1077    1050     -27
	tcf_bpf_init                                1106    1076     -30
	nfsd4_encode_fattr                          5997    5930     -67
	Total: Before=154856051, After=154854321, chg -0.00%

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18 10:59:15 -05:00
Eric Dumazet
e68b6e50fa udp: enable busy polling for all sockets
UDP busy polling is restricted to connected UDP sockets.

This is because sk_busy_loop() only takes care of one NAPI context.

There are cases where it could be extended.

1) Some hosts receive traffic on a single NIC, with one RX queue.

2) Some applications use SO_REUSEPORT and associated BPF filter
   to split the incoming traffic on one UDP socket per RX
queue/thread/cpu

3) Some UDP sockets are used to send/receive traffic for one flow, but
they do not bother with connect()

This patch records the napi_id of first received skb, giving more
reach to busy polling.

Tested:

lpaa23:~# echo 70 >/proc/sys/net/core/busy_read
lpaa24:~# echo 70 >/proc/sys/net/core/busy_read

lpaa23:~# for f in `seq 1 10`; do ./super_netperf 1 -H lpaa24 -t UDP_RR -l 5; done

Before patch :
   27867   28870   37324   41060   41215
   36764   36838   44455   41282   43843
After patch :
   73920   73213   70147   74845   71697
   68315   68028   75219   70082   73707

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18 10:44:31 -05:00
Aaron Lu
5d1904204c mremap: fix race between mremap() and page cleanning
Prior to 3.15, there was a race between zap_pte_range() and
page_mkclean() where writes to a page could be lost.  Dave Hansen
discovered by inspection that there is a similar race between
move_ptes() and page_mkclean().

We've been able to reproduce the issue by enlarging the race window with
a msleep(), but have not been able to hit it without modifying the code.
So, we think it's a real issue, but is difficult or impossible to hit in
practice.

The zap_pte_range() issue is fixed by commit 1cf35d47712d("mm: split
'tlb_flush_mmu()' into tlb flushing and memory freeing parts").  And
this patch is to fix the race between page_mkclean() and mremap().

Here is one possible way to hit the race: suppose a process mmapped a
file with READ | WRITE and SHARED, it has two threads and they are bound
to 2 different CPUs, e.g.  CPU1 and CPU2.  mmap returned X, then thread
1 did a write to addr X so that CPU1 now has a writable TLB for addr X
on it.  Thread 2 starts mremaping from addr X to Y while thread 1
cleaned the page and then did another write to the old addr X again.
The 2nd write from thread 1 could succeed but the value will get lost.

        thread 1                           thread 2
     (bound to CPU1)                    (bound to CPU2)

  1: write 1 to addr X to get a
     writeable TLB on this CPU

                                        2: mremap starts

                                        3: move_ptes emptied PTE for addr X
                                           and setup new PTE for addr Y and
                                           then dropped PTL for X and Y

  4: page laundering for N by doing
     fadvise FADV_DONTNEED. When done,
     pageframe N is deemed clean.

  5: *write 2 to addr X

                                        6: tlb flush for addr X

  7: munmap (Y, pagesize) to make the
     page unmapped

  8: fadvise with FADV_DONTNEED again
     to kick the page off the pagecache

  9: pread the page from file to verify
     the value. If 1 is there, it means
     we have lost the written 2.

  *the write may or may not cause segmentation fault, it depends on
  if the TLB is still on the CPU.

Please note that this is only one specific way of how the race could
occur, it didn't mean that the race could only occur in exact the above
config, e.g. more than 2 threads could be involved and fadvise() could
be done in another thread, etc.

For anonymous pages, they could race between mremap() and page reclaim:
THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge
PMD gets unmapped/splitted/pagedout before the flush tlb happened for
the old huge PMD in move_page_tables() and we could still write data to
it.  The normal anonymous page has similar situation.

To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
if any, did the flush before dropping the PTL.  If we did the flush for
every move_ptes()/move_huge_pmd() call then we do not need to do the
flush in move_pages_tables() for the whole range.  But if we didn't, we
still need to do the whole range flush.

Alternatively, we can track which part of the range is flushed in
move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
range in move_page_tables().  But that would require multiple tlb
flushes for the different sub-ranges and should be less efficient than
the single whole range flush.

KBuild test on my Sandybridge desktop doesn't show any noticeable change.
v4.9-rc4:
  real    5m14.048s
  user    32m19.800s
  sys     4m50.320s

With this commit:
  real    5m13.888s
  user    32m19.330s
  sys     4m51.200s

Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-17 09:46:56 -08:00
Xin Long
7fda702f93 sctp: use new rhlist interface on sctp transport rhashtable
Now sctp transport rhashtable uses hash(lport, dport, daddr) as the key
to hash a node to one chain. If in one host thousands of assocs connect
to one server with the same lport and different laddrs (although it's
not a normal case), all the transports would be hashed into the same
chain.

It may cause to keep returning -EBUSY when inserting a new node, as the
chain is too long and sctp inserts a transport node in a loop, which
could even lead to system hangs there.

The new rhlist interface works for this case that there are many nodes
with the same key in one chain. It puts them into a list then makes this
list be as a node of the chain.

This patch is to replace rhashtable_ interface with rhltable_ interface.
Since a chain would not be too long and it would not return -EBUSY with
this fix when inserting a node, the reinsert loop is also removed here.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16 23:22:17 -05:00
Eric Dumazet
89c4b442b7 netpoll: more efficient locking
Callers of netpoll_poll_lock() own NAPI_STATE_SCHED

Callers of netpoll_poll_unlock() have BH blocked between
the NAPI_STATE_SCHED being cleared and poll_lock is released.

We can avoid the spinlock which has no contention, and use cmpxchg()
on poll_owner which we need to set anyway.

This removes a possible lockdep violation after the cited commit,
since sk_busy_loop() re-enables BH before calling busy_poll_stop()

Fixes: 217f697436 ("net: busy-poll: allow preemption in sk_busy_loop()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16 18:32:02 -05:00
David Lebrun
a23a8f5bc1 lwtunnel: subtract tunnel headroom from mtu on output redirect
This patch changes the lwtunnel_headroom() function which is called
in ipv4_mtu() and ip6_mtu(), to also return the correct headroom
value when the lwtunnel state is OUTPUT_REDIRECT.

This patch enables e.g. SR-IPv6 encapsulations to work without
manually setting the route mtu.

Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16 17:01:15 -05:00
Lv Zheng
d5a4b1a540 tools/power/acpi: Remove direct kernel source include reference
Avoid breaking cross-compiled ACPI tools builds by rearranging the
handling of kernel header files.

This patch also contains OUTPUT/srctree cleanups in order to make above fix
working for various build environments.

Fixes: e323c02dee (ACPICA: MSVC9: Fix <sys/stat.h> inclusion order issue)
Reported-and-tested-by: Yisheng Xie <xieyisheng1@huawei.com>
Reported-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Lv Zheng <lv.zheng@intel.com>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-11-16 22:31:26 +01:00
Eric Dumazet
364b605573 net: busy-poll: return busypolling status to drivers
NAPI drivers use napi_complete_done() or napi_complete() when
they drained RX ring and right before re-enabling device interrupts.

In busy polling, we can avoid interrupts being delivered since
we are polling RX ring in a controlled loop.

Drivers can chose to use napi_complete_done() return value
to reduce interrupts overhead while busy polling is active.

This is optional, legacy drivers should work fine even
if not updated.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Adam Belay <abelay@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Yuval Mintz <Yuval.Mintz@cavium.com>
Cc: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16 13:40:58 -05:00